On the Effectiveness of Machine Learning-based Call-Graph Pruning: An Empirical Study

AMir120 12 views 22 slides Jun 06, 2024
Slide 1
Slide 1 of 22
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22

About This Presentation

I presented our paper, "On the Effectiveness of Machine Learning-based Call-Graph Pruning: An Empirical Study" at the 21st international conference on mining software repositories (MSR'24) in Lisbon, Portugal.


Slide Content

On the Effectiveness of Machine Learning-based
Call-Graph Pruning: An Empirical Study
Amir M. Mir, Mehdi Keshani, Sebastian Proksch
Software Engineering Research Group, TU Delft, Netherlands

Introduction
2

Call Graphs
Source: Walunj, V., Gharibi, G., Ho, D. H., & Lee, Y. (2019, December). Graphevo: Characterizing and understanding software evolution using call graphs. In
2019 IEEE international conference on big data (big data) (pp. 4799-4807). IEEE.
3

Call Graphs
4

Over-approximation in Call Graphs
5

Motivation
-Static call graphs are over-approximated
-Leads to imprecision (superfluous edges)
-False warnings in the downstream analyses
-Reducing the size of call graphs
-For saving storage and memory
-Making downstream analyses more scalable
6

Call Graph Pruning


7

Related Work
Limitations
-Previous work: CGPruner [1] and AutoPruner [2]
-Trained and evaluated on a dataset without real-world programs
-Highly imbalanced dataset (actual invocations are rare)
-After CG pruning, the recall drops substantially by more than 25%
-Lack of comparison with context-sensitive CG techniques (k-CFA)
Source:
[1] Utture, A., Liu, S., Kalhauge, C. G., & Palsberg, J. (2022, May). Striking a balance: pruning false-positives from static call graphs. In Proceedings of the 44th
International Conference on Software Engineering (pp. 2043-2055).
[2] Le-Cong, T., Kang, H. J., Nguyen, T. G., Haryono, S. A., Lo, D., Le, X. B. D., & Huynh, Q. T. (2022, November). Autopruner: transformer-based call graph
pruning. In Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (pp.
520-532).
8

Contribution
-A new benchmark dataset, NYXCorpus, containing real-world programs
-An empirical study on the effectiveness of ML-based call graph pruning, which
studies current issues, proposes solutions, and evaluates their effects.
-Adapting existing ML models to support weighted training and customizable
pruning through confidence levels
9

Methodology
10

Dataset Creation
NJR-1
+Has dynamic CGs (ground truth)
-Dependencies missing
-Not representative

XCorpus
+Compilabe projects w/ unit tests
+Contains real-world programs
-Static and dynamic CGs

YCorpus
+Projects w/ high test coverage (>80%)
+Real-world programs
-Static and dynamic CGs

NYXCorpus
11

Call Graph Generation
12
Static CG
generation
(WALA)
NYXCorpus
Dynamic CG
generation
(Ground truth)
Filtering and
samples edges

Research Methodology
Extracting Code Features
●Structural
○No. of nodes/edges in call graph
○No. of edges in/out of callee and caller
●Semantic
○Source and target method body
●Signature
●Combined
○Both structural and semantic

13

Research Methodology
Semantic Feature

[CLS]⟨caller’s source code⟩[SEP]⟨callee’s source code⟩[EOS]

14

Evaluation
15

Evaluation Metrics

●F2 score for emphasizing on recall

16

General performance of ML-based CG Pruners
-F2 scores is lower than
WALA’s
17

Conservative CG Pruning
●Incorporating weights into learning process
○Weight 0.99 to retain edges
○Big improvements to recall (30%)
○12% higher F2 scores
●Considering confidence levels gives control over recall/precision when
pruning
○An F2-score of 76% at 0.95 confidence level
18

The effect of CG pruning on vulnerability propagation
19

The effect of CG pruning on vulnerability propagation
-With paranoid pruning:
-1.4% loss in reachability
-22% faster analysis time
-~30% smaller CGs
20

Discussion
-Ground truth with assumptions based on high test coverage
-Better code features/representation
-A proper benchmark for different downstream analyses
21

Summary
Feel free to reach out!
@amir_mir93
[email protected]
mir-am
22