survey

Open access

A Survey on Malware Detection with Graph Representation Learning

Authors:

Tristan Bilot,

Nour El Madhoun,

Khaldoun Al Agha,

Anis ZouaouiAuthors Info & Claims

ACM Computing Surveys, Volume 56, Issue 11

Article No.: 278, Pages 1 - 36

https://doi.org/10.1145/3664649

Published: 29 June 2024 Publication History

PDF eReader

Abstract

Malware detection has become a major concern due to the increasing number and complexity of malware. Traditional detection methods based on signatures and heuristics are used for malware detection, but unfortunately, they suffer from poor generalization to unknown attacks and can be easily circumvented using obfuscation techniques. In recent years, Machine Learning (ML) and notably Deep Learning (DL) achieved impressive results in malware detection by learning useful representations from data and have become a solution preferred over traditional methods. Recently, the application of Graph Representation Learning (GRL) techniques on graph-structured data has demonstrated impressive capabilities in malware detection. This success benefits notably from the robust structure of graphs, which are challenging for attackers to alter, and their intrinsic explainability capabilities. In this survey, we provide an in-depth literature review to summarize and unify existing works under the common approaches and architectures. We notably demonstrate that Graph Neural Networks (GNNs) reach competitive results in learning robust embeddings from malware represented as expressive graph structures such as Function Call Graphs (FCGs) and Control Flow Graphs (CFGs). This study also discusses the robustness of GRL-based methods to adversarial attacks, contrasts their effectiveness with other ML/DL approaches, and outlines future research for practical deployment.

1 Introduction

Malware, short for malicious software, is a generic term for unwanted programs designed to harm or exploit computer systems [1]. The detection of widespread malware such as ransomware, worms, Trojan horses or spyware, has become a major concern since their increase in both number and complexity [2]. Indeed, malware programs can appear in different forms and may be hidden under other trusted programs available on the most used platforms such as Android or Windows. Unaware users are frequently fooled by authors of malware and important efforts have been spent to prevent these threats. Traditional detection techniques mainly rely on signatures and heuristics, where malware is detected by comparing it to existing malware or known malicious patterns. However, those methods are known to suffer from poor generalization to unknown attacks or variants and can be easily circumvented using obfuscation techniques [3]. Other behavior-based methods tend to perform better by further analyzing the malware and evaluating its intended actions before executing it. However, such techniques appear to be very time-consuming [4]. Over the last decade, Machine Learning (ML) and notably Deep Learning (DL) have sparked a sea of change in a variety of fields, including cybersecurity, by allowing the model to learn from data and adapt to new patterns. This ability to adapt makes these methods well-suited to a number of tasks, including malware detection, as shown by the growing number of papers that apply ML to this problem [5].

Despite the progress made with these learning-based methods, malware detection remains a challenging task, as malware authors continue to make their techniques evolve, with the aim to evade detection. In an attempt to outperform current ML and DL methods that learn from traditional Euclidean data, Graph Representation Learning (GRL) has emerged as a promising alternative to capture complex patterns in malware programs represented as graphs. Indeed, a growing number of fields are benefiting from these graph-based learning methods and obtaining state-of-the-art results [6], as graph structures offer even more semantic information by encoding spatial relations and connectivity between entities.

Current studies on malware detection using machine learning are mainly based on the review of traditional ML and DL techniques applied to structured data. However, more and more recent papers tend to use GRL in their approaches, and to the best of our knowledge, there is no literature review that specifically focuses on these techniques applied to malware detection.

This survey is a first attempt to shape the research area of malware detection with GRL, by providing a comprehensive review of current approaches. Specifically, we present in this paper the following contributions:

—

An overview of common representations that are used to model malware as graphs as well as techniques to extract these graph structures from raw malware data.

—

A comprehensive summary of the state-of-the-art papers, grouped according to the most common types of graphs, namely: Control Flow Graph (CFG), Function Call Graph (FCG), Program Dependence Graph (PDG), system call graph and system entity graph. We also propose a general architecture under which a majority of works can be abstractly summarized.

—

A review of the adversarial attacks that are used against GNN-based malware detection techniques, along with a discussion on the challenges that may be encountered as well as future research directions and conclusions. In particular, we show that the works presented in this paper are very recent and that many promising directions remain unexplored.

The paper is organized as follows. In Section 2, we introduce related works and further explain the contributions of our paper. In Section 3, we provide background knowledge on graphs and present the fundamentals of GRL and Graph Neural Networks (GNNs). Section 4 discusses the techniques used to extract graph-structured data from malware as well as the general architecture used for their detection with representation learning techniques. Sections 5 and 6 review the state-of-the-art papers for the detection of Android and Windows malware, respectively. Section 7 discusses the robustness of GNN-based detection systems against adversarial attacks. In Section 8, we offer a detailed discussion on applying GRL-based techniques to real-world malware detection, including a comparison of their strengths and weaknesses against traditional ML and DL methods. The last Section 9 concludes this paper.

2 Related Works

In existing literature, several studies have been published that aim to review malware detection using standard ML and DL techniques. The authors of the paper [3] have conducted a comprehensive review on malware detection. They first present the problem of malware detection, as well as the various challenges that can be encountered and the techniques used to overcome them. They also review a significant number of papers based on traditional methods such as signatures, behaviors and heuristics, but also cover some ML-based methods. The paper [7] proposes to review the deep learning models employed in Android malware detection, focusing on the analysis of the strengths and weaknesses of these models. The literature is comprehensively summarized by providing useful information about each research work, including the analysis method, features, models used and their performance, and input datasets. The proposed survey in the article [8] covers a wide variety of deep neural models used for Android malware detection and mentions few graph-based methods using control flow graphs [9] and App-API graphs [10, 11]. Authors in [5] surveyed the traditional ML techniques employed in Android malware detection and explain the commonly employed ML tasks such as data acquisition, data preprocessing, and feature selection. In the paper [12], a large category of deep learning methods using static, dynamic and hybrid analysis is reviewed. Important information is provided regarding the input features that can be extracted from APKs, as well as the most commonly used datasets for both benignware and malicious Android software. The survey [13] analyzes traditional ML methods in a general approach for malware detection based on executable files. Representation learning methods applied to cybersecurity are reviewed in the study [14], with few mentions to malware detection. More recently, the paper [15] also reviewed DL methods applied to the detection of mobile malware, Windows malware, IoT malware, Advanced Persistent Threats (APTs) and Ransomware.

Regarding GRL and GNN-based methods, the work [16] surveys GNN techniques employed for malware analysis with a focus on the prediction explainability. Other surveys review the applications of GNNs [6, 17, 18, 19] but none of them mention malware detection. Indeed, after extensive research and to the best of our knowledge, the literature on malware detection using ML and DL techniques is widely covered and documented but it is still missing a review dedicated to GRL methods. Our paper focuses on analyzing recent research studies based on such methods for malware detection, starting from the extraction of graph-structured data using reverse engineering tools, to the classification of malware based on graph embeddings. Our goal is to provide the necessary knowledge to researchers interested in the application of ML to graph-structured malware, and to contribute to the advancement of this field.

3 Background

In this section, we introduce the fundamentals about graphs along with the GRL techniques leveraged to learn from these structures. We first discuss the properties of graphs and then explain differences between traditional Deep Learning and GRL, along with the types of GNNs that are frequently employed.

3.1 Graph Structures

Graphs are useful data structures to model the interactions between the entities of a complex system. They possess a great expressiveness and can represent any connected systems using only two abstract objects, which are nodes and edges.

Graph. A graph can be denoted as \(G=(V, E)\) where \(V={v_1, \ldots , v_N}\) is a set of \(N=|V|\) nodes (i.e., entities) and \(E={e_1, \ldots , e_M}\) is a set of \(M=|E|\) edges, namely the relations between entities. Edges in the graph can either be directed (e.g., a process a forks another process b), or undirected (e.g., a bi-directional communication flow between two clients). By default, such graphs only represent a topology by incorporating the relations between different objects and do not store any local information.

Attributed Graph. Attributed graphs attach additional features to the elements of the graph, leading to a more detailed representation. A node-attributed graph assumes function \(F_n: V \longrightarrow \mathbb {R}^{d_n}\) to map each node to a feature vector of \(d_n\) elements. Similarly, an edge-attributed graph assumes function \(F_e: E \longrightarrow \mathbb {R}^{d_e}\) to map every edge to a vector of \(d_e\) features. Node and edge features can be conveniently described in a matrix format, where X usually represents the node feature matrix and \(X_e\) is the edge feature matrix. Furthermore, the structure of the graph is mostly designated by an adjacency matrix A.

Heterogeneous Graph. In many cases, the relations between graph objects become more complex, involving multiple types of modalities. These representations can be modeled with heterogeneous graphs, by introducing two mapping functions \(\phi _v:V \rightarrow T_v\) and \(\phi _e:E \rightarrow T_e\) that respectively map to a node type in \(T_v\) and an edge type in \(T_e\) .

Although other graph structures exist, current state-of-the-art graph-based malware detection methods are mostly based on these representations.

3.2 Learning on Graphs

Since nodes in a graph are inherently connected, they are not considered independent and uniformly distributed. For these reasons, traditional ML models cannot be directly applied on graphs, which suggests that specific techniques are required to deal with these interconnected structures.

3.2.1 Representation Learning.

A malware detection model must first go through a training procedure where it learns parameters based on a large number of training samples, in order to approximate a relationship function between the input feature space and the output binary label. Representation learning aims at learning an intermediate function f formulated as \(f: X \rightarrow \mathbb {R}^d\) , which maps the input feature space X to an embedding space \(\mathbb {R}^d\) that retains essential information from raw input features. Embedding representations can then be leveraged in downstream tasks such as learning word relationships [20], learning the representation of objects in images [21] or learning translation of language [22]. In malware detection, representation learning aims at creating embeddings from input data such as program code. The embeddings are then converted into a distribution that either indicates a probability to be a malware (binary classification) or to belong to a determined malware category or malware family (multi-class classification).

3.2.2 Graph Representation Learning.

Standard representation learning techniques are not suited to deal with data generated from non-Euclidean domain space such as graphs. For instance, regular Convolutional Neural Networks (CNNs) [23] and Recurrent Neural Networks (RNNs) [24] are unable to perform traditional convolutions or recurrent operations on graphs as the notion of Euclidean distance cannot be applied. Graph Representation Learning (GRL) [25], on the other hand, is a specific area of ML that aims to learn embedding representations from graph-structured data. This involves learning embeddings from nodes, edges, or graphs in a way that ensures that objects with similarities in feature space have similar representations in embedding space. The proximity between learned representations can then be leveraged in different downstream tasks. In the field of cybersecurity, tasks such as node classification, edge classification and graph classification are frequently used. Node classification aims at finding a label for a specific object in the graph such as detecting a malicious file in a system entities graph [26], whereas edge classification is applied to assign a label to a relation or event, such as detecting a malicious authentication request [27, 28]. On the other hand, graph classification maps the whole graph to a label. This task is largely used in malware detection in cases where the goal is to predict the label of a binary represented as a graph [29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57]. It is also possible to work at the sub-graph level to detect areas in the graph that are responsible for the prediction done by a predictive model [58].

In literature, the first methods for GRL based on graph embedding are mostly relying on random walks, where the co-occurrence of nodes is preserved. DeepWalk [59] was the first method to leverage the Skip-gram model [20] to compute embeddings from nodes that co-occur in random walks. It learns node embeddings by optimizing a neighborhood preserving objective, using random walks and word embedding techniques. First, n random walks are generated by randomly traversing the graph n times. Each walk is composed of k nodes, where k is a hyperparameter representing the length of a random walk. Then each node tries to reconstruct neighboring nodes from its random walk using the Skip-gram model.

To fully learn the embeddings, node2vec [60] integrates a second-order biased random walk that captures local and global structures using Breadth First Search (BFS) and Depth First Search (DFS) algorithms. Other methods such as LINE [61] have also achieved great performance in learning embeddings from graphs. However, most of these techniques do not share parameters between nodes [25], meaning that the model size grows linearly with the size of the graph. Moreover, these methods are highly dependent on the values of hyperparameters and tend to favor proximity information over structural information [62]. Another disadvantage of using these walk-based techniques is that they are generally transductive, meaning that a single graph is taken as input and that no inference is possible on unseen nodes or edges. Contrarily, inductive models take as input multiple graphs and can generalize to unseen examples.

3.2.3 Graph Neural Networks.

Recent GRL approaches tend to be inspired from the Graph Neural Network (GNN) model [63][64], which is the origin of the first application of deep neural networks to graph-structured data. Although DL on graphs has been democratized fairly recently, the first GNN [63] dates back to 2005 and is originally inspired from RNNs. In recent years, the popularity of DL has led to the emergence of new methods involving spectral and spatial convolution methods applied to graph structures, making it possible to take advantage of both the expressive structure of graphs and the power of representation learning. Spectral GNNs such as ChebNet [65] exploit the Laplacian matrix eigen decomposition in Fourier space to analyze the underlying structure of the graph. On the other hand, spatial GNNs such as Graph Convolutional Network (GCN) [66], GraphSAGE [67], Deep Graph Convolutional Neural Network (DGCNN) [68], and Graph Attention Network (GAT) [69] work directly on the adjacency matrix and capture the local neighborhood of the nodes in the graph domain, which avoids the time-consuming switch in spectral domain.

Graph Convolutional Network (GCN). GCN captures both feature and local substructure information by propagating the information along the neighboring nodes within the graph. The propagation rule of GCN to compute node embeddings is described as:

\begin{equation} {\bf H}^{(l+1)} = \sigma \left(\tilde{D}^{-\frac{1}{2}} \tilde{A}\tilde{D}^{\frac{-1}{2}} {\bf H}^{(l)} {\bf W}^{(l)} \right), \end{equation}

(1)

where \({\bf H}^{(l)}\) denotes the node embeddings at layer l and \({\bf H}^{(0)} = X\) , with X the initial node features matrix. \(\tilde{A}\) is the adjacency matrix representation of the graph with self-loops such that \(\tilde{A}=A+I\) , with I the identity matrix of same shape as the adjacency matrix A. \(\tilde{D}\) is the degree matrix of \(\tilde{A}\) , whereas \({\bf W}\) represents a trainable weight matrix and \(\sigma\) is the sigmoid non-linear activation function. The GCN model has inspired the development of numerous other GNN architectures but remains one of the most widely used models in GRL-based malware detection for its capabilities in learning insightful embeddings in transductive tasks [32, 34, 40, 42, 52, 53, 55, 70, 70, 71, 72].

Deep Graph Convolutional Neural Network (DGCNN). This model also leverages convolutions on graphs but is specifically designed for graph classification tasks. It computes node embeddings using message passing, following a propagation rule similar to that of GCN, described as follows:

\begin{equation} {\bf H}^{(l+1)} = f \left(\tilde{D}^{-1} \tilde{A} {\bf H}^{(l)} {\bf W}^{(l)} \right), \end{equation}

(2)

where f denotes an arbitrary non-linear activation function. Here, the generated node embeddings are sorted using a dedicated SortPooling layer, before feeding them into traditional 1D convolutional and dense layers. The DGCNN model has already been successfully utilized in malware detection due to its graph classification capabilities, where the overall knowledge of the graph is employed to predict the presence of malware [46, 51, 73].

GraphSAGE. GraphSAGE provides an inductive solution that can scale to large graphs by sampling the neighbors before message-passing. In a first time, each node’s neighborhood is uniformly sampled to keep a maximum of s neighbors. The features from these sampled neighbors are then aggregated such that:

\begin{equation} h^{(l+1)}_{\mathcal {N}_s(u)} = \text{AGGREGATE}_{(l+1)} \left(\left\lbrace h_v^{(l)}, \forall v \in \mathcal {N}_s\left(u\right) \right\rbrace \right), \end{equation}

(3)

where \(\mathcal {N}_s(u)\) is the sampled neighorhood of node u with size s, and \(h_v^{(l)}\) is the embedding of node v at layer l. \(\text{AGGREGATE}_{(l+1)}\) denotes a differentiable aggregation function such as the mean, max or sum operation. The aggregated information \(h^{(l+1)}_{\mathcal {N}_s(u)}\) of a node u is then leveraged in the following propagation rule, to compute node embeddings:

\begin{equation} h_u^{(l+1)} = \sigma \left({\bf W}^{(l+1)} \left[h_u^{(l)}, h^{(l+1)}_{\mathcal {N}_s(u)}\right] \right), \end{equation}

(4)

where \({\bf W}^{(l+1)}\) denotes a trainable weight matrix and \(\left[,\right]\) represents the concatenation operation. The powerful scalability capabilities of GraphSAGE make this model especially useful for large graphs. The graphs encountered in malware detection are generally of small or medium size. Consequently, the sampling strategy of GraphSAGE is not frequently employed in the papers reviewed. However, the propagation rule of GraphSAGE has been found to be successful in the malware detection literature, even without the use of neighbor sampling [36, 40, 43].

Graph Attention Network (GAT). The GAT model leverages the attention mechanism [74] to learn an importance weight for each neighboring node. The GAT layer aims to compute an attention score \(e_{uv}\) for each connected pair of nodes u and v such that:

\begin{align*} e_{uv}=\operatorname{LeakyReLU}\left(\mathbf {a}^T\left[\mathbf {W} h_u, \mathbf {W} h_v\right]\right), \end{align*}

where \([,]\) is the concatenation operation, \(\mathbf {a}\) and \({\bf W}\) denote a trainable attention vector and a weight matrix, respectively. The features of nodes u and v are respectively represented here by \(h_u\) and \(h_v\) . A softmax activation is then applied along the attention scores of all neighbors to obtain normalized probability scores:

\begin{align*} \alpha _{uv}=\operatorname{softmax}\left(e_{uv}\right)=\frac{\exp \left(e_{uv}\right)}{\sum _{k \in \mathcal {N}_u} \exp \left(e_{uk}\right)}, \end{align*}

where \(\mathcal {N}_u\) is the neighborhood of node u. The overall propagation rule of GAT is derived from the attention coefficients computed to learn the importance of nodes relative to their neighbors. For reasons of training stability, multi-head attention may be utilized to learn K distinct trainable matrices concurrently. Each head learns a unique representation of the node, and all representations are then concatenated into a single node embedding:

\begin{align*} h_u^{\prime }=\Vert _{k=1}^K \sigma \left(\sum _{v \in \mathcal {N}(u)} \alpha _{uv}^k {\bf W}^k h_v\right), \end{align*}

where \(||\) represents the concatenation operation and K is the number of attention heads. The attention mechanism used in GAT allows the model to learn complex relationships within the graph. This capability makes the model valuable in malware detection, where identifying complex malicious patterns is essential [38, 48, 54, 57, 75].

Variants of these models have achieved state-of-the-art results in various domains, such as recommender systems [76], traffic forecasting [77], and drug discovery [78]. However, GNN-based methods are still underutilized in cybersecurity compared to other fields where research is predominantly focused in this direction.

4 Malware Detection with Graphs

In this section, we describe the problem of malware detection with graphs, along with ways to represent malware as expressive graph structures. We also propose a general methodology to leverage GRL in downstream malware detection tasks.

4.1 Problem Definition

Malware detection refers to the process of identifying and neutralizing malware, designed to infiltrate, damage, or disrupt computer systems without the consent of users. In the field of ML, malware detection first consists in extracting features from an input binary file, which are then leveraged by a downstream learning algorithm for classification. Malware detection with GRL follows the same idea, with the only difference that a supplementary step is introduced after the feature extraction. This step consists in preprocessing the input features by transforming them into a graph structure that will then be fed to the model.

To provide an intuitive understanding of malware detection methodologies, particularly those leveraging GRL, we introduce a detailed example complemented by a visual illustration in Figure 1.

Fig. 1.

Imagine a scenario where a cybersecurity analyst is tasked with investigating a suspicious file. Initially, the extraction of meaningful information from the file is required. Common approaches typically leverage reverse engineering to perform static and dynamic analysis as a way to extract features from the file. This file, when executed in a dynamic analysis scenario, exhibits a sequence of operations such as file accesses, network communications, and registry modifications. Each of these operations can be represented as a node within a graph, while the interactions between them are the edges. Alternatively, parsing the decompiled code during static analysis can be a way to build a graph where nodes are functions and edges represent function calls.

The analyst uses GRL techniques to process this graph. A learning method such as a GNN or a Skip-gram model with random walks processes the node and edge embeddings, learning the patterns that are indicative of either benign or malicious behavior. The model takes into account the rich structure of the graph and learns from the topology and the attributes of nodes and edges. For example, a common benign behavior might be a file creating a temporary backup before an update. This creates a predictable and harmless pattern within the graph. In contrast, a malware file might attempt to replicate itself or communicate with an external server for command and control purposes, leading to a different pattern, potentially spread across various parts of the graph.

In the remainder of this section, we provide a comprehensive description of the steps comprising GRL-based malware detection.

4.2 Modeling Malware as Graphs

In real-world scenarios, malware programs are usually compiled binaries that may be obfuscated to hide their malicious payload. Furthermore, a same malware could be written in multiple languages or using different hardware platforms. Therefore, we think that an optimal representation of a binary program should fulfill these conditions:

—

Preserve the semantic of the program: the actions resulting from the execution of the app should be captured by the data representation, in order to understand benign and malicious behaviors.

—

Be robust to obfuscation techniques: the representation should capture the fundamental semantic of the program even if its code is obfuscated.

—

Be language- and platform-agnostic: the representation should be abstract enough to transcend the programming language and the platform on which the program is written.

Building a data representation that respects all previous conditions remains a challenging task due to the constant evolution of techniques employed by attackers. In the next section, we describe the analysis methods and graph structures commonly used for the representation of malware.

4.2.1 Analysis Methods for Feature Extraction.

Multiple analysis techniques are commonly employed to extract meaningful features from computer programs in an attempt to obtain an efficient representation [2]. Static analysis aims to analyze software without executing it, whereas dynamic analysis actually executes it to capture different levels of information. Both approaches have their own strengths and weaknesses. Static analysis is relatively cheap to perform and provides a comprehensive view of the program by considering all branches present in the code. However, it may not detect issues that only occur at runtime, such as memory leaks and race conditions. On the other hand, dynamic analysis can further analyze the behavior of the program by running it with different inputs and capturing the generated events at runtime. This technique is also more robust to code obfuscation, compared to static analysis. However, dynamic analysis is very resource-intensive to execute and may not be able to analyze the entire program, providing a less comprehensive view that could ignore malicious behaviors [79]. In an attempt to benefit from both techniques, hybrid analysis is a solution that tries to combine the advantages of both static and dynamic analysis, while minimizing their weaknesses.

4.2.2 Common Graph Structures for Malware Detection.

The various features extracted using the aforementioned analysis methods are frequently used to represent program semantics in the form of graphs. This practice has gained more and more interest due to the faculty of graphs to represent systems in a robust and intuitive way [79, 80]. The various types of graphs used for malware detection are presented below, with the main tools used to create these graphs listed in Table 1.

Table 1.

Representation	Tools
CFG	Androguard [85], radare2 [86], IDA Pro [87], Ghidra [88]
FCG	Androguard, Apktool [89], graph4apk [90], WALA [91], Angr [92], radare2, IDA Pro, cuckoo [83], Ghidra
PDG	Androguard, Ghidra
Syscalls	strace [93], SystemTap [94], ltrace [94]
System entities	cuckoo, Any.Run [95]

Table 1. Common Tools Employed for Data Extraction and Graph Construction with Static or Dynamic Analysis

Green refers to tools specifically designed for Android APKs.

Control Flow Graph (CFG). Control flow graphs model all possible paths during the execution of a program, in an intra-procedural way. The nodes represent basic blocks, namely a sequence of instructions (e.g., assembly instructions) without any jumps. The jumps are characterized by the directed edges between the basic blocks, that represent the control flow of the program. When built from low-level assembly languages, CFGs have the faculty to be language-agnostic as they model the logic of a program without requiring language-specific instructions [46]. Although other representations such as hexadecimal also possess this characteristic, CFGs provide a more intuitive way to model programs as graphs [46, 58].

Function Call Graph (FCG). Function call graphs are a type of CFG that provide an inter-procedural view of the program, where nodes are functions and edges represent function calls from one function to another. Although FCGs offer a global view of function calls executed by the program, they generally lack the intra-procedural information that CFGs provide. To address this, some approaches can be employed by jointly using FCGs and CFGs, where embeddings from CFGs are integrated into the nodes of the FCGs, to capture both intra-procedural and inter-procedural semantic [29, 30, 48]. In the case of Android malware analysis, a prevalent approach is to statically extract the API call sequences from the application and represent them using a FCG [51, 53, 54, 81].

Program Dependence Graph (PDG). Program dependence graphs model both data and control dependencies in code, where nodes are instructions or statements and edges represent the data values and control conditions that must be fulfilled to execute the node’s operation [82]. Scarcely used by current GRL approaches, this graph structure may be promising to capture different conditional flows in the program [44].

System Call Graph. The system calls generated by the execution of a program can be captured via dynamic analysis and the communication with the system can be modeled with a graph, where nodes represent system calls and edges are the interactions [45] between those calls. This representation offers a low-level view of system interactions that can also benefit the detection of malware.

System Entity Graph. During its execution, a program interacts with system entities such as processes, files, registry keys or network sockets. These interactions can be captured using sandbox tools like cuckoo [83] for a deep analysis of the program’s behaviors. Similarly as provenance graphs employed in host-based intrusion, these entities can be modeled as nodes in a graph, and the edges represent the operations between them [26, 55].

In this survey, we focus on the analysis of malware detection methods with GRL for Android and Windows platforms, since the vast majority of current state-of-the-art works are solely based on these platforms. For each of these platforms, we divided the state-of-the-art into sections based on the input graph structure. The literature review is summarized in Figure 2.

Fig. 2.

4.3 Methodology of Malware Detection with Graph Representation Learning

In literature, a majority of contributions rely on a similar sequence of operations to predict malware from source code with GRL. In this section, we propose a general architecture, shown in Figure 3, to summarize the process of malware detection from graph-represented source code on Android and Windows platform.

Fig. 3.

The first step involves extracting code from the binary, which is usually disassembled to assembly language or decompiled to higher-level language. In the case of malware detection with dynamic analysis, this step assumes dynamic input features such as a stream of API calls or system entity interactions (see Section 6.1). Subsequently, a graph builder is employed to transform the code into a graph-structured representation that preserves the program’s semantics, as detailed in Section 4.2.2. Typically, these first two steps are performed using reverse engineering tools listed in Table 7. Optionally, the graph can be preprocessed and attributed with hand-crafted features, located on nodes or edges. Then, GRL techniques, such as GNNs, leverage the semantics of code to learn node embeddings, which capture the relationships and the role of internal instructions, functions, or API calls, depending on the input graph. These embeddings are commonly generated using well-known GNN variants, which have been discussed in Section 3.2.3. Other techniques employ word embedding techniques inspired from Natural Language Processing (NLP) to learn the meaning of opcode or API functions, enabling integration of the resulting embeddings into a global graph structure for GNNs to learn the structural properties. The majority of studies consider malware detection as a graph classification task, whereby node embeddings are transformed with a global pooling operation (or readout) to create a single fixed-size graph embedding vector that encapsulates all information of the graph. The final vector can then be classified using traditional ML or DL methods.

Table 2.

Data	Analysis	Graph type	Classification	Learning	Models	Year	Paper
CFG+FCG+Flows	Hybrid	Attributed	Graph	Supervised	Word2vec, structure2vec	2021	Hybroid [29]
CFG+FCG+Flows	Hybrid	Attributed	Graph	Supervised	Bi-LSTM, word2vec, structure2vec	2021	hybrid-Falcon [30]
CFG	Hybrid	Attributed	Graph, Subgraph	Supervised	Inferential SIR_GN, RF	2021	Fairbanks et al. [31]
FCG	Static	Attributed	Graph	Supervised	GCN	2018	CG-GCN [32]
					GNN	2021	CGDroid [33]
					GCN, CBOW	2021	Cai et al. [34]
					GNN, Skip-gram	2021	Xu et al. [35]
					GraphSAGE	2021	Vinayaka et al. [36]
					CGMM	2021	Errica et al. [37]
					GAT, node2vec	2021	Catal et al. [38]
					GNN, Bi-LSTM, TF-IDF	2022	DeepCatra [39]
					GCN, GraphSAGE, GIN	2022	Lo et al. [40]
					VGAE, word2vec	2022	Gunduz et al. [41]
					GCN	2022	Lu et al. [42]
					GraphSAGE, VGAE	2022	Yumlembam et al. [43]
App-API FCG	Static	Heterogeneous	Node	Supervised	Multi-kernel model, Meta-path	2017	HinDroid [11]
					GCN, Skip-gram	2021	GDroid [70]
					Custom HAN, Meta-path	2021	Hawk [84]
PDG+FCG	Static	Attributed	Graph	Supervised	structure2vec, word2vec, SIF	2021	Android-COCO [44]
Syscall Graph	Dynamic	Attributed	Graph	Supervised	GCN	2020	John et al. [45]

Table 2. Summary of Android-based Malware Detection Approaches Leveraging Graph Representation Learning

Data represents the data type taken as input by the models; Analysis refers to the analysis method that is leveraged to extract features (e.g., extracting a CFG from smali code is static, capturing network traffic from a running app is dynamic, whereas leveraging both results in a hybrid analysis); Graph type designates one of the graphs introduced in Section 3.1, here we characterize a graph as attributed if a node or an edge is attributed either with hand-crafted features, raw features (e.g., raw instructions, function names) or embeddings (e.g., word embedding of a function), whereas a heterogeneous graph deals with multiple types and possibly different attributes; Classification designates the final object to classify (i.e., the classification task); Learning is the learning method used to train the models, whereas Models refer to the models on which the paper is inspired; Paper and Year identify the work and its publication year.

Table 3.

Paper	Datasets	Performance
Hybroid [29]	CICAndMal2017	97% F1
hybrid-Falcon [30]	CICAndMal2017,AndroZoo	97.09% F1
Fairbanks et al. [31]	VirusTotal	92.7% F1
CG-GCN [32]	Drebin+Apkpure+HKUST	99.68% F1
CGDroid [33]	Drebin+AndroZoo	~99% F1
Cai et al. [34]	AndroZoo+VirusShare	99.65% F1
Xu et al. [35]	Drebin+AMD+PRAGuard+AndroZoo	99.6% acc
Vinayaka et al. [36]	CICMalDroid2020,AndroZoo	92.23% F1
Errica et al. [37]	AMD,Google Play Store	97.2% macro F1
Catal et al. [38]	CICMalDroid2020+ISCX-AndroidBot-2015	94.1% F1
DeepCatra [39]	Drebin+DroidAnalytics+VirusShare +CICInvesAndMal2019+AndroZoo	95.83% F1
Lo et al. [40]	MalNet-Tiny,Drebin	94%, 97% F1
Gunduz et al. [41]	ISCX-AndroidBot-2015+CICMalDroid2020	93.4% F1
Lu et al. [42]	VirusShare+AndroZoo+Google Play Store	~63%-95% F1
Yumlembam et al. [43]	Drebin,CICMalDroid2020	98.33%, 98.68% acc
HinDroid [11]	Private	98.84% F1
GDroid [70]	AMD,Google Play Store	98.99% acc
Hawk [84]	CICAndMal2017+VirusShare+AndroZoo +Google Play Store	>96% F1
Android-COCO [44]	Drebin+AMD+AndroZoo	99.88% F1
John et al. [45]	Drebin+AMD+Malgenome	92.3% acc

Table 3. Datasets Employed in Android Malware Detection Studies

For each Paper, we provide the list of datasets along with the performance of the model. Datasets separated by a “+” refer to a global dataset built from the assembling of each mentioned dataset, whereas the use of a “,” means that the authors have conducted experiments on separated datasets. If multiple comma-separated datasets are present in Datasets, and only one metric is assigned in Performance, then this metric refers to the performance of the first dataset. Otherwise, each dataset is assigned to a performance metric. If the number of metrics in Performance is greater than the number of datasets, then multiple variants of the models are proposed and we suggest to refer to the original paper for further information. In this paper, Performance metrics are defined such that the accuracy denoted as “acc”= \((TP+TN)/(TP+TN+FP+FN)\) , where TP, TN, FP, FN refer to true positive, true negative, false positive and false negative, respectively. “F1”= \(2\times\) Precision \(\times\) Recall \(/(\) Precision \(+\) Recall \()\) where Precision \(=T P /(T P+F P)\) and Recall \(=T P /(T P+\) \(F N)\) . “AUC” refers to the Area Under the Receiver Operating Characteristic Curve.

Table 4.

Dataset	Type	Number of Elements	Number of Classes
CICAndMal2017	APK file	10,854	5
CICMalDroid	APK file	17,341	5
AndroZoo	APK file	>21M	/
Drebin	APK file	5,560	179
MalNet	FCG from APK	1,262,024	696

Table 4. Overview of Malware Detection Datasets

Table 5.

Data	Analysis	Graph type	Classification	Learning	Models	Year	Paper
CFG	Static	Attributed	Graph	Supervised	DGCNN	2019	MAGIC [46]
CFG	Hybrid	Attributed	Graph	Supervised	GNN, word2vec	2021	HawkEye [47]
CFG+FCG	Static	Attributed	Graph	Supervised	GAT, Random Walk, BERT	2021	Wang et al. [48]
CFG+FCG	Static	Attributed	Graph	Supervised	GraphSAGE	2022	MalGraph [49]
FCG	Static	Attributed	Graph	Supervised	node2vec, SDA	2019	DLGraph [50]
	Dynamic	Attributed	Graph	Supervised	DGCNN	2019	Oliveira et al. [51]
					GCN	2021	SDGNet [52]
					GCN, Markov chain	2021	Li et al. [53]
					GAT, GIN, Word2vec	2022	DMalNet [54]
Entities	Dynamic	Attributed	Graph	Supervised	GCN	2019	MeQDFG [55]
	Dynamic	Heterogeneous	Graph	Semi-supervised	Heterogeneous GNN, Meta-path, Attention, Siamese Network	2019	MatchGNet [56]
	Dynamic	Heterogeneous, Attributed	Node	Supervised	GraphSAGE, Meta-path	2021	MalSage [26]
	Dynamic	Heterogeneous	Graph	Self-supervised	GAT, Meta-path, Contrastive learning	2022	FewM-HGCL [57]

Table 5. Summary of Windows-based Malware Detection Approaches Leveraging Graph Representation Learning

Table 6.

Paper	Datasets	Performance
MAGIC [46]	MMCC	99.25% acc
HawkEye [47]	VirusShare+AndroZoo	96.82%, 93.39%, 99.6% acc
Wang et al. [48]	VirusShare+VirusTotal	90.88%, 72.44% F1
MalGraph [49]	VirusShare+VirusTotal	99.97% acc
DLGraph [50]	MMCC+VirusShare+KafanBBS	>99% acc
Oliveira et al. [51]	VirusShare	~99.4% F1
SDGNet [52]	Alibaba Dataset	97.3% acc
Li et al. [53]	VirusShare+VirusTotal	98.32% acc
DMalNet [54]	VirusShare+VirusTotal	98.43% acc
MeQDFG [55]	VirusShare	86.22% acc
MatchGNet [56]	Private	96.53% acc
MalSage [26]	VirusTotal	91.56% acc
FewM-HGCL [57]	VirusShare,ACT- KingKong,Ember, API Call Sequences,BIG 2015	85.73%-98.65% acc

Table 6. Datasets Employed in Windows Malware Detection Studies

Table 7.

Data	Access	Techniques	Datasets	Year	Paper
APK FCG	White-box, gray-box	N-strongest nodes, gradient-based	AndroZoo, Drebin	2020	MANIS [151]
	White-box	RL, heuristic optimization	AndroZoo, MalScan	2021	HRAT [152]
	Gray-box	VGAE	CICMaldroid, Drebin	2022	VGAE-MalGAN [43]
PE CFG	Black-box	RL, NOPs insertion	VXHeavens, VirusShare	2022	[153]
PE FCG	Black-box	Adversarial code injection	VirusShare	2024	[73]

Table 7. Summary of GRL-based Adversarial Attack Works

In this table, Data refers to the type of graph used as input; Visibility denotes the visibility that the attacker has on the system; Techniques summarizes the technical methods leveraged in the paper.

5 Graph-based Android Malware Detection

In this section, we present graph-based malware detection for Android platform, starting from the global methodology to build graphs from disassembled Android applications, to the review of existing works that leverage GRL for the detection of malware.

5.1 Android-based Graph Structures

Android applications are packed into APK files containing the source code, resources, manifest file and assets. After unzipping an APK, numerous features can be extracted to be used in downstream ML tasks. The manifest file provides the big picture of an app and contains its meta-information such as the required permissions to run it, hardware features and components. Resources such as images, videos or audio files are also available for further analysis. However, these data are inherently flat and do not provide enough structured information to build a graph. This is why most approaches leverage the actual source code to represent the logic of the app as a graph. As an APK is a production-ready app package, the code has already been compiled and assembled into a Dalvik bytecode format (.dex file). In practice, this bytecode can be disassembled into higher-level human-readable code (.smali files). Based on both code representations, control flow graphs (CFGs), function call graphs (FCGs), program dependence graphs (PDGs) and APIs can be extracted in a static way. This static extraction process is demonstrated in Figure 4. Concerning dynamic analysis, API call and system call sequences can be recorded when running the app in a sandboxed environment.

Fig. 4.

5.2 Android-based Approaches

In this section, we review state-of-the-art Android-based papers, classified by types of graph, also summarized in Table 2.

5.2.1 CFG Approaches for Android Malware Detection.

CFGs offer a remarkable abstract representation of programs to detect malware. Hybroid [29] leverages this representation by extracting basic blocks from APKs. Three types of embeddings are then constructed from the code, to capture different semantics, namely opcode embedding, basic block embedding and CFG embedding, where each representation is associated to a level of abstraction. The semantic of opcodes (Dalvik instructions) and basic blocks (sequence of instructions) is computed using the NLP-based model word2vec with Skip-gram. Precisely, Skip-gram learns embedding vectors for each basic block’s raw instructions, by using an opcode to predict its surrounding opcodes. Indeed, the operands are not leveraged here as they are affected by the usage of Dalvik VM. For the basic block embeddings, they compute the weighted mean of the inner instructions’ opcode. These basic block embeddings then become the node embeddings from the point of view of a FCG and structure2vec [96] creates a final graph embedding vector for graph classification. In parallel, the network traffic generated by the app is also captured with dynamic analysis using Argus [97]. The packets are transformed into flows to summarize the communication between the Android device and the destination IP addresses, using various statistics. After a feature selection step, important features are combined with the FCG embeddings for downstream classification with gradient boosting on the CICAndMal2017 dataset, where the model demonstrates a F1-score of 97% and outperforms other methods such as DREBIN [98] and SVM [99].

On the other hand, hybrid-Falcon [30] transforms network flow data into 2D images on which a bi-directional LSTM captures flow representations from pixels. On the same dataset, the F1-score is further improved to 97.09%.

The paper [31] provides a solution to locate MITRE ATT&CK Tactics Techniques and Procedures (TTPs) detected from a subgraph of a CFG. Node representations are extracted with Inferential SIR-GN [100] and the prediction is done using a random forest. To identify the subgraph responsible for the prediction, the authors rely on SHAP [101] to attribute for each input feature a value that indicates its relevance for the final output. TTPs are successfully detected with a F1-score of 92.7%.

5.2.2 FCG Approaches for Android Malware Detection.

The semantic information captured by FCGs in programs makes this data structure a predominant choice in graph-based malware detection. For instance, a FCG is constructed from Smali code in work [32], where each node is attributed with function attributes such as the method type (system API, third-party API, etc.) and the requested permissions (required permissions for the execution of the function). The graph embeddings are then trained in a supervised way using the GCN propagation rule, presented in Section 3.2.3. The Drebin dataset is used for final evaluation with a F1-score of 99.68%.

In CGdroid [33], multiple node features are extracted from the disassembled methods in order to build an FCG that captures the semantic of functions. Indeed, each node is mapped to a vector of hand-designed features such as the number of string constants, the number of call and jump instructions, the associated permissions, etc. A GNN computes graph embeddings and a MLP is used for downstream graph classification on Drebin and AndroZoo [102] datasets, where a baseline is outperformed by 8% in F1-score.

Word embedding techniques are employed in [34] to consider functions similarly as words and learn the meaning of functions. The embeddings are then assigned as attributes to each corresponding function node in an FCG, and a GCN is used as graph learning method. The proposed method achieved 99.65% F1-score with random forest classifier on a private dataset.

Similarly, authors in [35] leverage word embedding to transform Android opcodes from text to vectors using Skip-gram. In the same way as [34], the embeddings are used as nodes in a FCG and this graph is fed into a GNN to compute a fixed-size graph embeddings vector. A 2-layer MLP is used as last layer for graph classification, with a 99.6% average accuracy.

In the reference [36], FCGs are extracted from APKs using Androguard and each node stores attributes related to the structural meaning of the node in the graph (e.g., node degree) or features extracted from the actual disassembled functions (e.g., method attributes, method opcodes’ summary). Using these previous features, GCN, GraphSAGE, GAT and TAGCN [103] are benchmarked together, with a better performance achieved with GraphSAGE. First, each node i uniformly selects a fixed-size set of neighbors, denoted \(\mathcal {N}(i)\) . Neighbors are then aggregated using a mean aggregation function such as:

\begin{equation} \mathbf {h}_{\mathcal {N}(i)}^{(l+1)}=\operatorname{aggregate}\left(\left\lbrace \mathbf {h}_j^{(l)}, \forall j \in \mathcal {N}(i)\right\rbrace \right) \end{equation}

(5)

where \(\mathbf {h}_{\mathcal {N}(i)}^{(l+1)}\) is the embedding of node i at layer \(l+1\) and \(h_j^{(l)}\) denotes the embedding of a neighbor node j at layer l. The embedding of i at previous layer is then concatenated with the aggregated representation and then learned by a neural network.

\begin{equation} \mathbf {h}_i^{(l+1)}=\sigma \left(\mathbf {W}^{(l)} \cdot \left[\mathbf {h}_i^{(l)}, \mathbf {h}_{\mathcal {N}(i)}^{(l+1)}\right]\right) \end{equation}

(6)

where \([,]\) represents the concatenation operation. The embedding is finally normalized:

\begin{equation} \mathbf {h}_i^{(l+1)}=\frac{\mathbf {h}_i^{(l+1)}}{||\mathbf {h}_i^{(l+1)}||_2} \end{equation}

(7)

Malware apps from CICMalDroid2020 are used for evaluation, with a best F1-score of 92.23%.

The paper [37] focuses on obfuscated malware detection using call graphs attributed with nodes’ out-degree, applying the Contextual Graph Markov Model (CGMM) [104] to learn embeddings for classification with a feed-forward network, achieving a 97.2% macro F1-score. The authors in [38] explain that the use of node2vec as feature extraction method is justified by a 3% increase in F1-score compared to traditional graph centrality features. Using the attention aggregation from GAT as a final step, the proposed solution reaches 94.1% in F1-score.

DeepCatra [39] identifies critical Android APIs using Term Frequency-Inverse Document Frequency (TF-IDF) from call traces, extracted from popular repositories such as CVE [105] and Exploit-DB [106]. It generates call graphs with Wala [91], incorporating knowledge of critical APIs, and trains a custom GNN and a bi-directional LSTM to capture graph topology and temporal features, respectively. The merged output vectors from both models undergo binary classification, achieving a 95.83% F1-score, surpassing various baselines in performance.

In [40], an enhanced FCG employs graph centrality metrics (PageRank, in/out degree, node betweenness) as node attributes. Comparing GCN, GraphSAGE, and GIN models, all incorporating jumping knowledge [107] to counteract over-smoothing, GraphSAGE shows superior performance on multi-class classification tasks, with notable F1-scores on Malnet-Tiny and Drebin datasets.

Authors in [41] compute node embeddings from API FCGs using word2vec and apply a Variational Graph Auto-Encoder (VGAE) [108] for a compact representation, achieving a 93.4% F-measure in downstream classification tasks. [42] introduces a robust detection method against obfuscation using a GCN with subgraphs and a denoising approach, tested on a dataset comprising VirusShare and AndroZoo samples.

An innovative approach is proposed in [43], where each Android app is viewed as a local graph with APIs as nodes, linking APIs co-existing in the same code block. A global graph captures inter-app connections via co-occurring APIs. Their approach combines GraphSAGE embeddings with app manifest features (permissions and intents) using a concatenation operation before classification by a downstream supervised classifier like a CNN. In order to test the robustness of GNN-based malware detection models, the authors also provide a generative model inspired from VGAE, which can generate adversarial API graphs to fool the predictive model.

HinDroid [10, 11] models interactions between Android apps and API calls in an App-API graph as a Heterogeneous Information Network (HIN) [109], using meta-paths to extract semantic relationships. These relationships are leveraged by a multi-kernel SVM for app node classification, resulting in a 98.84% F1-score. Similarly, GDroid [70] constructs an App-API graph from API co-occurrence, and encodes APIs with a Skip-gram model, while applying a GCN for node classification.

Hawk [84] constructs a large heterogeneous App-API graph from over 180k APKs, capturing various relationships (App-Permission, App-Class, App-Interface) and using meta-paths and meta-graphs for semantic extraction. Custom heterogeneous GAT models for in-sample and out-sample nodes demonstrate superior malware detection capabilities against several baselines.

5.2.3 PDG Approaches for Android Malware Detection.

Program Dependence Graphs have been widely used in optimization tasks due to their faculty to model data and control flow from programs [82]. For similar reasons, this structure is also employed in malware detection but remains little used with GRL.

Android-COCO [44] leverages the native code of dynamic libraries (.so files) along with the Android bytecode (.dex files) to construct a PDG for each app. Structure2vec computes the graph embeddings, that are then passed into a MLP for graph classification. For a more accurate prediction, an FCG is created, on which graph embeddings are also computed (similar to Hybroid [29]). The predictions of both graphs are finally combined using an ensemble algorithm and a 99.88% F1-score is reached on samples from Drebin, AMD [110] and AndroZoo datasets.

5.2.4 System Call Approaches for Android Malware Detection.

System calls provide a low-level view of system interactions, able to model attacks patterns. The authors in [45] rely on dynamic analysis to record the system calls generated by the activity of a running APK to detect malware behaviors. Each node in the graph is one among 26 selected system calls and is summarized by four centrality indicators as node features: Katz, Betweennes, Closeness and PageRank centralities. Edges represent interactions between those system calls while the app is running. A GCN and a pooling layer compute graph embeddings and a fully-connected layer along with a softmax activation are used for graph classification. Their implementation achieves 92.3% accuracy and similar true positive rate as SVM but significantly outperforms all other methods regarding the false positive rate. As for the PDG, system call graphs are still scarcely used in current GRL literature.

5.3 Android Malware Datasets

In this section, we present Android datasets employed for graph-based malware detection tasks. A summary of the datasets used in previous studies is available in Table 3, and some of their statistics are summarized in Table 4. Based on the current information provided from the respective websites of AMD, Malgenome and PRAGuard datasets, the release of these datasets has stopped for maintenance reasons and are not further described in this paper.

CICAndMal2017 [111]. An Android malware dataset developed by the Canadian Institute of Cybersecurity (CIC). It comprises 10,854 APK files published between 2015 and 2017 on Google Play Store. The dataset consists of 6,500 benign apps and 4,354 malware divided into Benign, Adware, Ransomware, SMS and Riskware classes. For each scenario, network packets are also collected and transformed into flows using CICFlowMeter [112]. This tool generates, for each flow, 80 features based on statistics from the packets contained within the flow.

CICMalDroid [113]. This dataset was also made public by the Canadian Institute for Cybersecurity. It is composed of 17,341 APK samples collected during one year in 2018. Malware examples are divided into five classes: Benign, Adware, Banking, SMS and Riskware. Along with the APK files that can be used for classification tasks, three kinds of features are also provided for each sample: statically extracted features (e.g., intents, permissions and services), dynamically observed behaviors (e.g., system calls, binder calls, composite behaviors) and network traffic in pcap format. Features are available from CSV files, ranging from 139 to 50,621 files depending on the APK.

AndroZoo [102]. A collection of Android apps provided by the University of Luxembourg. In 2023, the dataset contains more than 22M APKs, mostly including benign apps from Google Play Store but also malware from VirusShare. Many other APK stores are fetched to continually update the collection. For each APK file, nine features are collected such as the sha256 hash, the app compilation date or the size of the .dex file. AndroZoo is often used in combination with other APK malware datasets to obtain a balanced number of benign samples.

Drebin [98]. Made available by the MobileSandbox project, this dataset also provides malware Android apps. A total of 5,560 APKs divided into 179 malware families were collected between August 2010 and October 2012. Considering the important variety of classes, most multi-class classification papers use the top-k classes from the dataset by sorting malware based on the number of samples per class. Otherwise, all examples can be used for binary classification. Each APK is summarized by 10 features such as permissions, intents and providers. This dataset does not contain any benign example so an additional dataset such as AndroZoo should be used to complete the dataset with benignware examples.

MalNet [114]. A large dataset containing FCGs extracted from AndroZoo APK files. According to the original paper from 2021, it was at this time the largest database for GRL with 1,262,024 graphs, averaging over 15k nodes and 35k edges per graph, across 696 families and divided among 47 types. GNNs have been applied to this dataset in the original paper [114], where baselines such as GCN, GIN or Feather are benchmarked together. Moreover, FCGs have demonstrated promising results when combined with representation learning techniques, in trying to overcome the polymorphic nature of malware [50, 115]. For smaller experiments, MalNet-Tiny is a subset of MalNet composed of 5,000 graphs of at most 5k nodes and balanced in five types.

6 Graph-based Windows Malware Detection

The increasing level of sophistication of malware on Windows platform along with the widespread usage of this operating system worldwide has become a major concern to preserve the safety of many users. After the success of GRL in many classification tasks, its application to Windows-based malware detection has become obvious. As for Android malware detection, in this section, we present a global methodology to model Windows binaries as graphs and we perform a literature review of current approaches involving graph learning algorithms.

6.1 Windows-based Program Graph Structures

In Windows, executable files are encapsulated following the portable executable (PE) file format. File extensions such as .exe, .dll and .sys are all PEs with different roles. Dynamic-link libraries (DLL) and system (SYS) files are both libraries of functions that are loaded into memory and used by other programs. The former is intended for a general function sharing purpose, whereas the latter is intended for a more specific use related to device drivers and hardware configurations by the system [116]. The EXE file is the one that is actually executed and that communicates with function libraries. Similarly as for Android APKs, PE files can be analyzed statically to extract source code employed in downstream graph structures such as CFGs, FCGs and PDGs. PEs are compiled code, meaning that the binary has to be first disassembled or decompiled to obtain an appropriate human-readable representation like assembly. A number of works also leverage dynamic analysis to detect Windows malware, where the executable binaries are run into a sandbox to monitor system entity interactions, system calls, network flows and live API calls. This process is described in Figure 5.

Fig. 5.

6.2 Windows-based Approaches

This section reviews the approaches employed in the Windows-based papers presented in Table 5.

6.2.1 CFG Approaches for Windows Malware Detection.

In MAGIC [46], assembly code is extracted from PE files and converted into CFG, where nodes are basic blocks composed of multiple assembly instructions, and edges represent the program flow along these basic blocks, as explained in Section 4.2.2. Instead of using a standard GCN that was initially made for node classification, the authors prefer to leverage a Deep Graph Convolutional Neural Network (DGCNN) [68], which is especially designed for graph classification. The proposed DGCNN leverages adaptive max pooling and replaces the original Conv1D layer with a custom layer that considers graph embedding idea. The training procedure of this model minimizes the mean negative logarithmic loss in an end-to-end manner. The authors trained the model for malware classification on two private CFG-based datasets: MSKCFG (inspired by [117]) and YANCFG (inspired by [118]). These datasets were not made publicly available, but the procedure to generate the CFGs is provided in the paper. The model was also evaluated on the Microsoft Malware Classification Challenge dataset [117] and reaches 99.25% accuracy.

A cross-platform approach is proposed in HawkEye [47] to extract both static and dynamic CFGs from binaries (i.e., Windows, Linux and Android platforms). Embeddings of instructions are computed using word2vec whereas the final graph embedding is calculated using a custom GNN that leverages the word embeddings as nodes. Malware samples were collected from VirusShare and AndroZoo, and benign examples for Windows and Linux platforms were collected from libraries. Using this cross-platform method, HawkEye reaches an accuracy of 96.82% on Linux, 93.39% on Windows, whereas 99.6% accuracy is obtained on Android.

A similar CFG based on an assembly is employed in the paper [48]. First, the semantic of functions is computed using random walk and the BERT [119] language model. These embeddings are then assigned to the nodes of a FCG that represents a global view of the program. The importance between function nodes is then calculated with a GAT model, whose goal is to capture the complex patterns of malware using the attention mechanism described in Section 3.2.3. By leveraging the function-level and program-level embeddings with attention, the overall model achieved 90.88% and 72.44% F1-score on two private datasets.

In MalGraph [49], both CFG and FCG are also leveraged together. Intra-procedural relations are captured with GraphSAGE from the CFG and the embeddings are attributed to nodes in a FCG whose embeddings are also computed with GraphSAGE to capture inter-procedural relations. Max-pooling transforms node embeddings into graph embeddings for downstream PE malware graph classification. All samples were collected from VirusShare and VirusTotal and IDA Pro was utilized for disassembling.

Although previous works are by definition detection methods, they provide poor insights on the actual patterns and areas in the graph that led to the final prediction. CFGExplainer [58] is an explainability framework specially designed to explain the predictions done by GNNs on malware classification tasks based on CFGs. This method identifies subgraphs in the CFG, that contribute to the final prediction of a given GNN. Concerning this particular task, CFGExplainer outperforms other explainability frameworks like GNNExplainer [120], SubgraphX [121] and PGExplainer [122].

6.2.2 FCG Approaches for Windows Malware Detection.

DLGraph [50] leverages static analysis to extract API calls and FCG from binaries. The model relies on a FCG that represents the interactions between functions from disassembled PE files, along with a vector of extracted Windows API calls. The node embeddings of the FCG are calculated using node2vec and are fed into a stacked denoising auto encoder (SDA) [21] to create a graph embedding vector. Similarly, an SDA takes as input the API vector and the two resulting vectors are then concatenated and passed into a softmax regression layer for classification, where an accuracy greater than 99% is achieved.

In the paper [51], a behavioral graph is constructed from API call sequences monitored during the execution of PEs in a sandbox environment. A DGCNN is then used to compute embeddings for graph classification. Based on their experiments, the authors released a public dataset made of 42,797 malware and 1,079 benignware API call sequences [123]. On the original imbalanced dataset, the proposed model achieves an F1-score of more than 99%.

SDGNet [52] similarly captures API calls along with attributes using dynamic analysis. Weighted graph normalization methods are utilized to transform the adjacency matrix into three symmetrical matrices that describe interactions of node information. A GCN-based model computes node embeddings for these matrices and all representations are merged into a final graph embedding that is leveraged for classification. A total of 8,909 labeled samples were collected from the Alibaba Cloud Malware Detection Base on Behavior dataset [124] and the final model achieves 97.3% accuracy.

The reference [53] uses a GCN on a directed cyclic graph that was pre-processed with Markov chain. The nodes represent API calls, whereas the edges \((u,v)\) are weighted according to the number of calls from u to v. Malware samples used for evaluation are also collected dynamically using a sandbox environment in order to create a private dataset, on which a 98.32% accuracy is reached.

DMalNet [54] also leverages dynamic analysis to build FCGs from API calls captured during the execution of PE binaries. Here, both API names and API arguments are considered. The embeddings of these attributes are learned with a custom GIN model, whereas more complex structural interactions between APIs are learned with attention using a GAT. After computing embeddings for both semantics, important information is captured with a gPool layer [125, 126] for feature selection. More precisely, this pooling operation attributes to each node a projection score and selects top-k nodes based on these scores. The gPool output of the GIN model is taken as input by the GAT to capture the interactions between API calls. A final accuracy of 98.43% is obtained by leveraging a MLP for classification on a private dataset.

6.2.3 Entity Graph Approaches for Windows Malware Detection.

The dynamic nature of communications between system entities provide valuable information to detect malicious behaviors. In the paper [55], authors monitor such interactions using dynamic analysis. Directed multi-edge graphs are built from interactions between four system entities: processes, files, registry keys and network sockets. Edges represent data transmission between entities such as system calls. Concretely, a node is represented by a categorical value between 0 and 3, and an edge stores a feature vector containing the size of the transmitted data and the time when the action occurred. Representation learning is done using a GCN and an attention-based pooling function is used to transform node embeddings into fixed-size graph embedding vector that is classified by a feed-forward network. Experiments have been conducted on a private dataset with samples from VirusShare and the proposed solution achieved 86.22% accuracy.

In MatchGNet [56], malware detection is considered as the detection of a malicious process that behaves differently from benign processes. The authors first designed an invariant graph modeling technique to capture interactions in a heterogeneous graph that represents relations among system entities such as processes, files or sockets. A GNN-based encoder with attention learns the representations and a Siamese Network [127] learns the similarity between known benign programs and new incoming programs. During inference, the similarity distance between these two programs results in a score that is utilized for final classification. This model can thus be trained using only benign examples. The final evaluation is performed on a real enterprise dataset composed of 300 million events recorded on Windows and Linux hosts.

The paper [26] represents malware behaviors as a weighted heterogeneous graph, where nodes are either an executable file (PE), a file, a file suffix or a module, and edges represent different (weighted) actions between entities. A custom model based on GraphSAGE and meta-paths is implemented to deal with the heterogeneity of the graph. This model achieved 91.56% accuracy on a private dataset.

FewM-HGCL [57] introduces a self-supervised method based on contrastive learning for few-shot malware variants detection. The authors construct a heterogeneous graph with five types of entities: process, API, file, signature, and network. API nodes are not only identified by their name but are also characterized by their category. API attributes are irregular by default and feature hashing [128] is used to transform them into compact fixed-length vectors. The idea behind contrastive learning is to use the natural co-occurrence associations in data as a substitute for ground truth labeled information. To perform contrastive learning, negative samples and positive samples are generated using different data augmentation techniques. Three distinct GAT models are then trained to respectively learn graph embeddings on the original graph, the positive graph and the negative graph. A discriminator aims to capture the similarity between the original graph and the positive graph along with the dissimilarity between the original graph and the negative graph. All self-trained embeddings are finally merged in a readout layer for downstream graph classification with an accuracy ranging from 85.73% to 98.65% on multiple datasets for malware variants detection.

6.3 Windows Malware Datasets

On Windows platform, a majority of works rely on private datasets constructed from public malware samples downloaded from VirusShare and VirusTotal. Using these data, the comparison between papers is ineffective. Other works evaluate their experiments on the Microsoft Malware Classification Challenge, which makes the performance comparison between papers possible. Datasets used in previous studies are reviewed in Table 6.

Microsoft Malware Classification Challenge (MMCC) [117]. This dataset contains more than 20,000 malware samples that fall into nine families, namely Ramnit, Lollipop, Kelihos ver3, Vundo, Simda, Tracur, Kelihos ver1, Obfuscator.ACY and Gatak. For each binary, the dataset provides two data representations: the bytecode and the disassembly code (disassembled with IDA Pro). The assembly code can then be used to build attributed CFGs [46] or FCGs [50].

VirusShare [129], VirusTotal [130]. It is common to download PE malware and Android malware from these two websites. Some works rely on dynamic analysis to run the downloaded malware into a cuckoo sandbox [53, 54, 55], whereas others build static CFGs and FCGs [48, 50].

7 Adversarial Attacks

Despite the capabilities of machine learning in classification tasks, these techniques are not immune to adversarial attacks, which aim to disturb the predictions of the model by introducing adversarial examples, crafted from small perturbations in the input. In this section, we introduce background knowledge on adversarial attacks as well as existing approaches to countering traditional and graph-based malware detection.

7.1 Background

Overview of Adversarial Attacks. Adversarial attacks pose a fundamental challenge in the field of ML due to the inherent vulnerability of these models to manipulation. The beginning of adversarial research can be traced back to the early work [131], which first demonstrated that neural networks could be easily fooled by imperceptible perturbations to input data, thereby misclassifying images with high confidence. This work opened a new avenue of research into the security and robustness of ML models. The theoretical foundations of adversarial attacks delve into the inherent limitations of these models, particularly their inability to generalize well to slightly modified inputs that have not been encountered during training. The core idea of adversarial machine learning is to find and use the differences between how humans and machines understand data. By doing this, it’s possible to uncover or increase weaknesses in the models.

The essence of adversarial ML is to find and use the differences between how humans and machines understand data. Doing this enables to uncover or increase weaknesses in the models [132]. Furthermore, the practical implications of adversarial attacks span across numerous domains, from Computer Vision to NLP, and even to more structured data forms like graphs. In each of these areas, attackers have developed sophisticated techniques to manipulate input data in a manner that alters model predictions without detection. For instance, in autonomous vehicles, slight alterations to road signs can lead to misinterpretation by the driving algorithms, posing serious safety risks [133]. To address these challenges, a vast body of literature has emerged, focusing on both attack strategies and defense mechanisms. Among the notable contributions are [132], which introduced the concept of “adversarial examples”, and [134], which proposed a method for generating them through gradient-based optimizations. Since then, adversarial attacks have seen great success notably in the Computer Vision domain [135], where the goal is to craft adversarial images that the model will misclassify by predicting a wrong label.

Definitions. Formally, for a given classification model f denoted as \(f: x \rightarrow y\) that predicts a label \(y \in \mathbb {Y}\) given the features \(x \in \mathbb {X}\) of an input example \(z \in \mathbb {Z}\) , we denote two categories of adversarial attacks [136, 137]. A feature-space attack aims to craft adversarial features \(x^{\prime } \in \mathbb {X}\) (e.g., a modified FCG or CFG) such that the distance between x and \(x^{\prime }\) in feature-space is minimized, and such that the model f predicts a label \(y^{\prime } \in \mathbb {Y}\) different from y. However, a problem-space attack works directly on the real-world input z instead of the features x. The goal then becomes to minimize the cost between z and an adversarial example \(z^{\prime }\) (e.g., modified source code), such that f predicts another label \(y^{\prime }\) . We can further classify adversarial attacks by the prior knowledge acquired by the attacker. White-box attacks assume that the attacker has full knowledge of the target model f, namely he knows about the architecture, the parameters, and so on. In contrast, black-box attacks refer to scenarios where only the output prediction is known by the attacker, making these attacks more difficult to succeed but also more likely to be faced in real-world applications. Other methods called gray-box attacks, live at the intersection between black- and white-box methods, where the attacker has knowledge of some prior knowledge that shall be defined depending on the use case.

Defense Against Adversarial Attacks. In response to the threat posed by adversarial attacks, the ML community has developed several defensive techniques aimed at enhancing the robustness of models. Adversarial training [134] is among the most effective defenses, where models are trained on a mixture of clean and adversarially perturbed inputs, making them more resilient to similar manipulations. Another notable technique is defensive distillation [138], which involves training a model to predict the soft outputs (probabilities) of a previously trained model on the same task. This process can reduce the sensitivity of the model to small perturbations, making it harder for attackers to craft effective adversarial examples. Feature squeezing [139] reduces the color depth of images or applies spatial smoothing, thereby limiting the ability to introduce fine-grained perturbations that go unnoticed by human observers but mislead the model. Despite these efforts, many defenses have been circumvented. This underscores the need for a continuous improvement of defensive methods.

7.2 Adversarial Attacks and Malware Detection

In the case of malware, adversarial attacks aim to craft new malware examples that preserve maliciousness while misleading the classification of the model. Formally, given an input malware z such as a PE or an APK, we want to find either a modified version of its compiled code \(z^{\prime }\) , or a modified version of its graph representation \(x^{\prime }\) that will not be detected by the model. This process is also described in Figure 6, where an attacker evades the detection of a GRL-based model by altering the structure of an input malware. However, the aforementioned requirements imply multiple constraints that are difficult to fulfill in the case of malware adversarial attacks. Indeed, as explained by Ling et al. [136], adversarial attacks have been successfully applied to image classification as it is easy to retrieve a corresponding image \(z^{\prime }\) from an adversarial feature \(x^{\prime }\) because an image can be simply represented as a 2D-array of pixels. In other words, a differentiable and bi-injective inverse feature mapping function \(\phi ^{-1}\) can be approximated to map features from the feature space to an image in problem space. However, retrieving the original malware code \(z^{\prime }\) from a feature representation \(x^{\prime }\) (i.e., finding a similar inverse function) is much more challenging as the reconstructed input needs to fulfill multiple conditions to remain executable [140]. Notably, the generated adversarial example needs to respect a specific format such as PE or APK, but also needs to preserve the malicious payload while still being executable without error. Furthermore, in a black-box scenario, the attacker does not know beforehand the feature representation taken as input by the model, which further complicates the adversarial process.

Fig. 6.

Despite these complicated requirements, researchers found adversarial attacks that can be employed to detect malware. To evade raw bytes-based malware detection models, works [141] and [142] append an adversarial sequence of bytes to the malware. Other works prefer to modify regions in the PE header [143] or extend the DOS header [137]. However, these techniques are ineffective for higher-level representations such as those based on API calls. For this purpose, some works insert additional API calls in feature space to add noise in the representation and evade the detection systems [144, 145]. In other works, RL is leveraged to manipulate the original malware in order to evade detection while maintaining a correct format and semantic [146, 147].

7.3 Adversarial Attacks on Graph-based Malware Detection

Adversarial attacks are inherently dependent on the data representation taken as input by the model. When working with GNNs, attackers thus need to consider the graph representation of the data, leading to adversarial attacks specifically designed for graph-based detection systems. Literature presents numerous papers that apply such attacks to GNN classifiers by either modifying node and edge features, or by directly manipulating the graph structure with actions such as removing or adding nodes and edges [148, 149, 150]. In the case of malware, removing random nodes or edges from graph structures such as FCG or CFG is not appropriate as it would not preserve the functionality of the program. The adversarial attack should also be efficient on graph classification tasks, as a large majority of works leverage this task for malware detection. However, we review several works, presented in Table 7, that have successfully bypassed the detection of GRL-based models by altering the structure of graphs.

Two such adversarial approaches specifically designed against call graph-based malware detection are proposed by Xu et al. in MANIS [151]. The first method aims to pick the n-strongest nodes from the graph, which are the nodes that have the most influence over their neighbor nodes. They are then inserted in the input graph until evasion has succeeded. The second method relies on the direction of the gradient to guide the insertion of new nodes. The advantage of these methods is that they produce a valid binary that preserves the given format (e.g., PE or APK). On the Drebin dataset, 72.2% misclassification rate is achieved with the n-strongest nodes method, whereas the gradient-based proposition reaches 33.4% misclassification rate under the white-box setting. Similar results are also obtained in gray-box setting.

In the paper [152], authors propose HRAT, a structural attack for APK-based FCGs that aims to address the inverse mapping problem [140], that consists in retrieving a valid malware in problem-space from the modified malware in feature-space (see Section 7.2). The proposed method works in white-box setting, and leverages RL along with heuristic optimization to perform graph modifications such as inserting and deleting nodes, or adding edges and rewiring. The performance of this solution has been evaluated on 30k APKs from AndroZoo with over 90% attack success rate in feature space and up to 100% attack success rate in problem space.

An adversarial attack for GNN-based APK malware detection has been introduced in VGAE-MalGAN [43] to measure the robustness of the proposed detection model. The proposed attack is based on a VGAE model, aiming to effectively add nodes and edges to a FCG in order to fool the GNN classifier, in a gray-box setting where the attacker has knowledge that the detector is a GNN-based model and knows the features used. This adversarial approach has been applied to the original GNN model to further improve the robustness of the detection system.

An adversarial method based on RL is presented in [153] to evade GNN-based malware detection from CFGs. A deep RL agent is trained to insert semantic NOPs (no-operations) in CFG basic blocks extracted from PE malware. This technique has the faculty to preserve the semantic and format of the original file, while evading detection in black-box setting with nearly 100% attack success rate.

In [73], the Mal2GCN model is introduced, a model that improves resilience against adversarial attacks on malware detection. Leveraging FCGs from API calls and a DGCNN model, Mal2GCN targets malware’s functional behaviors over binary data to counter evasion techniques such as byte appending. However, this method is vulnerable to adversarial source code changes. Addressing this, Mal2GCN employs a non-negative training method, inspired by [154], to mitigate code alteration effects used in malware disguise. It remains accurate across 2,000 adversarial samples.

7.4 Real-world Implications for Attackers

Adversarial attacks against malware detection models should be understood as attacks on the entirety of security systems, not merely isolated models. Attackers targeting these systems face a landscape far more complex than simply manipulating inputs to evade a single detection model.

In practice, attackers encounter systems that integrate various forms of threat detection and response mechanisms. For example, an effective adversarial attack must take into account not only how to evade detection by an ML model but also how to circumvent other layers of defense. This may involve evading heuristic-based detection, bypassing rate limiting, and staying undetected by anomaly detection systems that look for unusual patterns of activity indicative of an attack.

The work [155] underscores the discrepancy between theoretical adversarial ML models and their application in real cybersecurity environments. It points out that attackers often lack complete knowledge about the cybersecurity detection systems they are targeting. Rather, they might exploit certain vulnerabilities without the need to fully understand or modify the underlying ML models directly. This perspective is important for developing more realistic defensive strategies that take into account the practical constraints and capabilities of attackers.

The common assumption in adversarial ML research is that attackers have unrestricted access to the target models (i.e., in a white-box setting) or can freely interact with the systems to generate adversarial examples [155]. The focus is instead recommended to shift towards adversarial techniques that are feasible under realistic conditions, such as limited knowledge about the system and indirect access through legitimate channels (i.e., gray-box or black-box scenarios).

Furthermore, there is a disparity between academic research on adversarial ML and the tactics used by actual attackers [156]. For instance, attackers often employ simpler, more straightforward methods to circumvent ML-driven systems rather than using sophisticated adversarial examples. This observation demonstrates the need for security practitioners to prioritize defenses against more common and practical attack methods, rather than concentrating exclusively on defending against complex adversarial attacks that are less likely to occur.

Moreover, it has been analyzed that while modern organizations are aware of these issues, they do not regard this threat as a top priority because there are no defensive mechanisms that are truly effective in real-world environments [157]. Indeed, given sufficient resources, attackers can achieve their objectives [158], underscoring the challenge in securing systems against cyberattacks. More broadly, as outlined by [156], the underlying economics significantly influence practical cybersecurity efforts for both attackers and defenders. This economic perspective suggests that decisions in cybersecurity are primarily driven by cost-benefit analysis.

8 Discussion

8.1 Application in Real-World Environments

Although GRL techniques show promising capabilities in assisting security analysts with the detection of complex malware, their application in real-world environments remains a challenging task that should be carefully considered beforehand. Indeed, most of the papers surveyed here report near-perfect detection performance, as evidenced by the high detection metrics presented in Tables 3 and 6. However, these studies often only briefly mention the practical application of their proposed techniques in real-world scenarios and encounter multiple sustainability challenges that may necessitate further exploration.

8.1.1 Matching Real-World Requirements.

Scalability. Ensuring the scalability of detection systems for real-world applications is a critical yet often underexplored requirement in existing literature. For example, the new malware detection model implemented in Gmail [159] is tasked with analyzing over 300 billion documents weekly [160], a demand that naturally leads to a tradeoff between processing speed and detection accuracy due to the need to handle such a vast volume of data. To preserve the scalability of the model over time, adopting incremental learning models is pivotal. These models can update themselves with new data in real-time or near-real-time, eliminating the necessity for complete retraining [161]. This strategy significantly lowers the computational burden and allows for rapid adaptation to emerging malware signatures or behaviors. Furthermore, when suitable, GNN-based methods might benefit from techniques such as neighbor sampling, which controls the size of neighborhoods in the graph, and mini-batch training, enabling the parallel processing of multiple graphs on GPUs.

Concept Drift and Biases. The phenomenon of concept drift, marked by evolving data distributions over time, necessitates ongoing model updates to maintain detection performance [162]. This issue is especially prominent in malware detection, where similarities within malware families introduce a positive bias in k-fold cross-validation, leading to an overestimated effectiveness of classifiers [163]. The studies in [163] also exposed spatial and temporal biases in the AndroZoo malware samples, widely used in current research, suggesting strategies for their mitigation. To address concept drift and maintain the relevance of models over time, the implementation of systems for dynamic model updating and retraining is essential. This includes continuous learning mechanisms for rapid adaptation to new data. Moreover, [164] noted the tendency of researchers to develop learning-based methods without fully grasping the true distribution of the input space, relying instead on datasets designed to mimic this distribution. Despite the inevitable presence of some bias, identifying and tackling the specific biases in the used datasets has become essential.

Varying Input Formats. The majority of techniques discussed in this survey have shown success in detecting malware on particular platforms, such as Android and Windows. Furthermore, the data sources used to construct input graphs are often specific to a platform; for example, an FCG derived from an APK file shares no common attributes with an FCG from a PE file. This significantly limits the transferability of methods trained on one platform to others. Nonetheless, there are many practical scenarios where malware detection needs to be agnostic of the platform or file format. To overcome this limitation, existing methods require certain modifications. Datasets must include a wider variety of examples, and alternative graph representations like CFGs could be employed to represent the code in a more generic format applicable across different file formats [46, 58].

Interpretability. One of the limitations of employing deep models is their lack of interpretability, as they operate as “black boxes.” However, understanding the reason behind the decision of a predictive model is essential, especially in cybersecurity. In this domain, analysts need to comprehend the security-related decisions made by algorithms. While some techniques exist to shed light on the predictions made by deep architectures like GNNs [120, 165], there are relatively few studies that specifically use these methods to enhance the interpretability of malware predictions with GNNs [16]. Therefore, further research in this area could significantly benefit malware detection.

8.1.2 Reliance on Noisy Datasets.

Erroneous Labels. A significant hurdle in deploying ML for malware detection is the dependency on datasets that frequently contain noisy and inaccurately labeled data. Collecting a vast corpus of files for malware detection is feasible using public repositories, yet relying on unverified data poses significant risks due to the potential inclusion of malicious applications. The challenge lies in verifying the ground truth, as automated web services like VirusTotal may produce conflicting results, which necessitate expensive expert validation [166]. A potential solution is the adoption of collective wisdom techniques, which aggregate insights from various antimalware engines, and efforts to clean noisy labels in the dataset [167]. For practical use in ML-driven malware detection, a dataset needs to be large enough to accurately represent the environment and threats, have a balanced ratio of benign to malicious samples to prevent excessive false positives, include accurately labeled data, and be regularly updated for relevance [168].

Inconsistent Datasets. While numerous studies demonstrate the high detection rate of GRL techniques for malware detection, the evaluation of these models often relies on distinct examples. Popular datasets are commonly augmented with additional samples from public sources such as the Google Play Store, AndroZoo, VirusShare, and VirusTotal. This practice makes the comparison between models inefficient, as the inconsistency in training and testing samples across studies leads to different detection results. Moreover, such variability undermines the reproducibility of results in real-world settings, given the inability to use identical input files as those reported in the literature. Tackling this issue requires creating and adopting comprehensive, diverse, and open benchmark datasets. Such datasets would standardize evaluations of GRL models in malware detection and enhance both comparability and reproducibility. To stay current and effective, these resources must be regularly updated to reflect the evolving nature of malware.

8.1.3 Vulnerability to Adversarial Attacks.

Increasing Frequency and Sophistication of Adversarial Attacks. The vulnerability of ML models to adversarial attacks poses a significant challenge. Such attacks can occur during both the training phase, known as poisoning, and the inference phase, known as evasion. They exploit the learned decision boundaries of the model to misclassify malicious samples as benign [155]. In 2020, Gartner predicted that by 2022, 30% of cyberattacks would involve tactics such as training-data poisoning, model theft, or the creation of adversarial examples [169]. However, the robustness of GRL-based malware detection models against adversarial attacks remains an area that is still under-explored. One possible approach to enhance the resilience of detection models is to include adversarial examples in the training process, a method known as adversarial training. Other strategies, such as those presented in Section 7, could also complicate the efforts of attackers to manipulate the model with adversarial inputs. These mechanisms aim to reduce the attack surface accessible to attackers or hide model details that could be exploited.

Indifference to the Security of ML Models. Research into the practical application of adversarial ML reveals a significant gap between academic interest and industry implementation. Surveys and studies conducted from 2020 to 2022 highlight a concerning trend: many practitioners and ML developers are either unaware of or indifferent to the security of their ML models [156]; a trend also confirmed by the rare mentions of adversarial attack robustness in the works reviewed in this study. Initial surveys found a lack of response and concern primarily focused on poisoning attacks. Subsequent investigations revealed that a significant portion of ML models, especially those deployed in mobile apps, lack any form of protection. Furthermore, many developers admit to a lack of understanding on how to secure ML systems against adversarial threats, with a general attitude of questioning the necessity of countermeasures. Despite warnings and predictions about the increasing role of adversarial tactics in cyberattacks, the industry’s position on prioritizing the security of ML models remains largely unchanged. This indicates a widespread underestimation of the risks associated with adversarial attacks.

8.2 Comparison with Other DL & ML Techniques

In the context of malware detection, GRL-based methods have unique advantages and drawbacks compared to other DL and ML models.

8.2.1 Strengths of GRL Techniques.

The primary advantage of GRL methods lies in their ability to capture and model the intricate relationships and dependencies present in graph-structured data. Unlike traditional DL methods such as CNNs that excel in grid-like data structures (e.g., images), or RNNs that are designed for sequential data, GNNs are powerful at handling data represented in graphs. This is particularly beneficial for malware detection, where the relationships between system entities (e.g., files, processes, network connections) can be naturally modeled as a graph. For instance, the propagation patterns of malware within a network can be effectively captured using GNNs, providing insights that are not readily accessible through other ML techniques [6]. The intrinsic structure of graphs also enhances the robustness of graph-based detection systems against adversarial attacks. This is because altering attack patterns to evade detection, while maintaining the malicious payload, becomes more challenging. For detecting specific attack schemes, such as those detailed in Common Vulnerabilities and Exposures (CVEs), graphs can represent these patterns robustly, making evasion difficult for attackers [43]. Moreover, GNNs and GRL techniques efficiently create low-dimensional embeddings for graph entities (nodes, edges, or subgraphs). Consequently, these embeddings aid in classifying entities, predicting links, and detecting malware by identifying complex patterns. This notably enhances the detection of advanced malware with evasion or polymorphic behaviors [25, 79, 80].

Finally, leveraging these representations alongside the original graph structure enables a visual and intuitive graph-based examination of predictions. These predictions play a crucial role in providing explicability, unlike other ML and DL techniques that may only yield a single prediction or an overall embedding for the entire malware. In summary, it provides detailed embeddings of the individual entities within the graph and enables a deeper analysis of the attacks and false positives.

Weaknesses of GRL Techniques. Despite their advantages, GNNs and GRL techniques are not devoid of challenges. One of the primary limitations is scalability. Processing large-scale graphs, as often encountered in cybersecurity applications, poses significant computational and memory demands. This is due to the necessity to aggregate and process information from potentially large neighborhoods for each node in the graph, which can quickly become infeasible for graphs with millions of nodes and edges [170]. Another critical issue specific to GNNs is the phenomenon of over-smoothing, a condition where multiple layers of a GNN model cause the node features to converge to similar values, making them indistinguishable. This over-generalization can lead to a loss of critical information, reducing the ability of the model to differentiate between benignware and malware [171]. Furthermore, the performance of GNNs and GRL techniques is heavily dependent on the quality and structure of the input graph. In the context of malware detection, the choice of the input graph representation (i.e., FCG, CFG, etc.) has a significant impact on the overall detection capabilities, as some graphs may capture important behaviors that others could miss. The dynamic nature of cyber threats, which continuously evolve to evade detection, further exacerbates this challenge. This notably requires the graphs to be constantly analyzed to ensure they accurately capture these evolving patterns.

In summary, while GNNs and GRL techniques offer promising avenues for improving malware detection through their ability to capture important relations, they also introduce specific challenges related to scalability, model depth sensitivity, and dependency on the quality of graph construction. Tackling these challenges requires continuous research and innovation to unlock the full potential of GRL techniques in malware detection.

9 Conclusion

In this paper, we provide an in-depth review of GRL techniques applied to the detection of Android and Windows malware. We first introduced fundamental knowledge to understand graph-based learning methods along with the graph structures commonly employed in malware detection. We reviewed and classified state-of-the-art works in a comprehensive way and provide descriptions and insights on the datasets that can be leveraged to represent malware as graphs. We notably found that most existing techniques can be represented under a same architecture based on graph classification, which is presented and used as reference in this survey. We also discovered that many recent works prefer leveraging GNNs in combination with word embedding techniques to learn the semantic of disassembled code along with the structural patterns of existing malware. This survey also shows that effective adversarial attacks can be used by attackers in an attempt to fool graph-based detection systems. The analysis of recent papers demonstrates the promising future of GRL methods applied to malware detection, and as a result, we have provided future research directions based on the current challenges that can be addressed.

References

[1]

Ulrich Bayer, Andreas Moser, Christopher Kruegel, and Engin Kirda. 2006. Dynamic analysis of malicious code. Journal in Computer Virology, Springer (2006).

Abstract

1 Introduction

2 Related Works

3 Background

3.1 Graph Structures

3.2 Learning on Graphs

3.2.1 Representation Learning.

3.2.2 Graph Representation Learning.

3.2.3 Graph Neural Networks.

4 Malware Detection with Graphs

4.1 Problem Definition

4.2 Modeling Malware as Graphs

4.2.1 Analysis Methods for Feature Extraction.

4.2.2 Common Graph Structures for Malware Detection.

4.3 Methodology of Malware Detection with Graph Representation Learning

5 Graph-based Android Malware Detection

5.1 Android-based Graph Structures

5.2 Android-based Approaches

5.2.1 CFG Approaches for Android Malware Detection.

5.2.2 FCG Approaches for Android Malware Detection.

5.2.3 PDG Approaches for Android Malware Detection.

5.2.4 System Call Approaches for Android Malware Detection.

5.3 Android Malware Datasets

6 Graph-based Windows Malware Detection

6.1 Windows-based Program Graph Structures

6.2 Windows-based Approaches

6.2.1 CFG Approaches for Windows Malware Detection.

6.2.2 FCG Approaches for Windows Malware Detection.

6.2.3 Entity Graph Approaches for Windows Malware Detection.

6.3 Windows Malware Datasets

7 Adversarial Attacks

7.1 Background

7.2 Adversarial Attacks and Malware Detection

7.3 Adversarial Attacks on Graph-based Malware Detection

7.4 Real-world Implications for Attackers

8 Discussion

8.1 Application in Real-World Environments

8.1.1 Matching Real-World Requirements.

8.1.2 Reliance on Noisy Datasets.

8.1.3 Vulnerability to Adversarial Attacks.

8.2 Comparison with Other DL & ML Techniques

8.2.1 Strengths of GRL Techniques.

9 Conclusion

References

Cited By

Index Terms

Recommendations

A comprehensive survey on deep learning based malware detection techniques

Malware detection using adaptive data compression

Machine Learning based Malware Detection with the 2019 KISA Data Challenge Dataset

Comments

Information

Published In

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

PDF

eReader

Get Access

Login options

Full Access

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations