5.2.1 CFG Approaches for Android Malware Detection.
CFGs offer a remarkable abstract representation of programs to detect malware. Hybroid [
29] leverages this representation by extracting basic blocks from APKs. Three types of embeddings are then constructed from the code, to capture different semantics, namely opcode embedding, basic block embedding and CFG embedding, where each representation is associated to a level of abstraction. The semantic of opcodes (Dalvik instructions) and basic blocks (sequence of instructions) is computed using the NLP-based model word2vec with Skip-gram. Precisely, Skip-gram learns embedding vectors for each basic block’s raw instructions, by using an opcode to predict its surrounding opcodes. Indeed, the operands are not leveraged here as they are affected by the usage of Dalvik VM. For the basic block embeddings, they compute the weighted mean of the inner instructions’ opcode. These basic block embeddings then become the node embeddings from the point of view of a FCG and structure2vec [
96] creates a final graph embedding vector for graph classification. In parallel, the network traffic generated by the app is also captured with dynamic analysis using Argus [
97]. The packets are transformed into flows to summarize the communication between the Android device and the destination IP addresses, using various statistics. After a feature selection step, important features are combined with the FCG embeddings for downstream classification with gradient boosting on the CICAndMal2017 dataset, where the model demonstrates a F1-score of 97% and outperforms other methods such as DREBIN [
98] and SVM [
99].
On the other hand, hybrid-Falcon [
30] transforms network flow data into 2D images on which a bi-directional LSTM captures flow representations from pixels. On the same dataset, the F1-score is further improved to 97.09%.
The paper [
31] provides a solution to locate MITRE ATT&CK
Tactics Techniques and Procedures (TTPs) detected from a subgraph of a CFG. Node representations are extracted with Inferential SIR-GN [
100] and the prediction is done using a random forest. To identify the subgraph responsible for the prediction, the authors rely on SHAP [
101] to attribute for each input feature a value that indicates its relevance for the final output. TTPs are successfully detected with a F1-score of 92.7%.
5.2.2 FCG Approaches for Android Malware Detection.
The semantic information captured by FCGs in programs makes this data structure a predominant choice in graph-based malware detection. For instance, a FCG is constructed from Smali code in work [
32], where each node is attributed with function attributes such as the method type (system API, third-party API, etc.) and the requested permissions (required permissions for the execution of the function). The graph embeddings are then trained in a supervised way using the GCN propagation rule, presented in Section
3.2.3. The Drebin dataset is used for final evaluation with a F1-score of 99.68%.
In CGdroid [
33], multiple node features are extracted from the disassembled methods in order to build an FCG that captures the semantic of functions. Indeed, each node is mapped to a vector of hand-designed features such as the number of string constants, the number of call and jump instructions, the associated permissions, etc. A GNN computes graph embeddings and a MLP is used for downstream graph classification on Drebin and AndroZoo [
102] datasets, where a baseline is outperformed by 8% in F1-score.
Word embedding techniques are employed in [
34] to consider functions similarly as words and learn the meaning of functions. The embeddings are then assigned as attributes to each corresponding function node in an FCG, and a GCN is used as graph learning method. The proposed method achieved 99.65% F1-score with random forest classifier on a private dataset.
Similarly, authors in [
35] leverage word embedding to transform Android opcodes from text to vectors using Skip-gram. In the same way as [
34], the embeddings are used as nodes in a FCG and this graph is fed into a GNN to compute a fixed-size graph embeddings vector. A 2-layer MLP is used as last layer for graph classification, with a 99.6% average accuracy.
In the reference [
36], FCGs are extracted from APKs using Androguard and each node stores attributes related to the structural meaning of the node in the graph (e.g., node degree) or features extracted from the actual disassembled functions (e.g., method attributes, method opcodes’ summary). Using these previous features, GCN, GraphSAGE, GAT and TAGCN [
103] are benchmarked together, with a better performance achieved with GraphSAGE. First, each node
i uniformly selects a fixed-size set of neighbors, denoted
\(\mathcal {N}(i)\) . Neighbors are then aggregated using a mean aggregation function such as:
where
\(\mathbf {h}_{\mathcal {N}(i)}^{(l+1)}\) is the embedding of node
i at layer
\(l+1\) and
\(h_j^{(l)}\) denotes the embedding of a neighbor node
j at layer
l. The embedding of
i at previous layer is then concatenated with the aggregated representation and then learned by a neural network.
where
\([,]\) represents the concatenation operation. The embedding is finally normalized:
Malware apps from CICMalDroid2020 are used for evaluation, with a best F1-score of 92.23%.
The paper [
37] focuses on obfuscated malware detection using call graphs attributed with nodes’ out-degree, applying the
Contextual Graph Markov Model (CGMM) [
104] to learn embeddings for classification with a feed-forward network, achieving a 97.2% macro F1-score. The authors in [
38] explain that the use of node2vec as feature extraction method is justified by a 3% increase in F1-score compared to traditional graph centrality features. Using the attention aggregation from GAT as a final step, the proposed solution reaches 94.1% in F1-score.
DeepCatra [
39] identifies critical Android APIs using
Term Frequency-Inverse Document Frequency (TF-IDF) from call traces, extracted from popular repositories such as CVE [
105] and Exploit-DB [
106]. It generates call graphs with Wala [
91], incorporating knowledge of critical APIs, and trains a custom GNN and a bi-directional LSTM to capture graph topology and temporal features, respectively. The merged output vectors from both models undergo binary classification, achieving a 95.83% F1-score, surpassing various baselines in performance.
In [
40], an enhanced FCG employs graph centrality metrics (PageRank, in/out degree, node betweenness) as node attributes. Comparing GCN, GraphSAGE, and GIN models, all incorporating jumping knowledge [
107] to counteract over-smoothing, GraphSAGE shows superior performance on multi-class classification tasks, with notable F1-scores on Malnet-Tiny and Drebin datasets.
Authors in [
41] compute node embeddings from API FCGs using word2vec and apply a
Variational Graph Auto-Encoder (VGAE) [
108] for a compact representation, achieving a 93.4% F-measure in downstream classification tasks. [
42] introduces a robust detection method against obfuscation using a GCN with subgraphs and a denoising approach, tested on a dataset comprising VirusShare and AndroZoo samples.
An innovative approach is proposed in [
43], where each Android app is viewed as a local graph with APIs as nodes, linking APIs co-existing in the same code block. A global graph captures inter-app connections via co-occurring APIs. Their approach combines GraphSAGE embeddings with app manifest features (permissions and intents) using a concatenation operation before classification by a downstream supervised classifier like a CNN. In order to test the robustness of GNN-based malware detection models, the authors also provide a generative model inspired from VGAE, which can generate adversarial API graphs to fool the predictive model.
HinDroid [
10,
11] models interactions between Android apps and API calls in an App-API graph as a
Heterogeneous Information Network (HIN) [
109], using meta-paths to extract semantic relationships. These relationships are leveraged by a multi-kernel SVM for app node classification, resulting in a 98.84% F1-score. Similarly, GDroid [
70] constructs an App-API graph from API co-occurrence, and encodes APIs with a Skip-gram model, while applying a GCN for node classification.
Hawk [
84] constructs a large heterogeneous App-API graph from over 180k APKs, capturing various relationships (App-Permission, App-Class, App-Interface) and using meta-paths and meta-graphs for semantic extraction. Custom heterogeneous GAT models for in-sample and out-sample nodes demonstrate superior malware detection capabilities against several baselines.