FASTEN was presented in the Devroom on Dependency Management at FOSDEM 2021. Presentation Abstract: The goal of the EU project FASTEN is being able to perform a more sophisticated analysis of security-vulnerability propagation, licensing compliance, and dependency risk profiles (among others) by relying on the call-level dependency network of the whole software ecosystem. We outline the purpose and structure of the project, and present some preliminary results.
1 of 98
Download to read offline
More Related Content
Software Ecosystems as Networks - Advances on the FASTEN project, Paolo Boldi, FOSDEM 2021
5. Sharing through software libraries
❖ Internet made the dream of collaborative development a
reality, by means of libraries that are made available:
6. Sharing through software libraries
❖ Internet made the dream of collaborative development a
reality, by means of libraries that are made available:
❖ on repositories (SourceForge, GitHub, BitBucket, …)
7. Sharing through software libraries
❖ Internet made the dream of collaborative development a
reality, by means of libraries that are made available:
❖ on repositories (SourceForge, GitHub, BitBucket, …)
❖ or forges (Maven, PyPi, CPAN, …)
9. Industrial revolution
at the harbour of software development
❖ All trades, arts, and handiworks have gained by
division of labour, namely, when, instead of one
man doing everything, each con
fi
nes himself to a
certain kind of work distinct from others in the
treatment it requires, so as to be able to perform it
with greater facility and in the greatest
perfection. Where the different kinds of work are
not distinguished and divided, where everyone is
a jack-of-all-trades, there manufactures remain
still in the greatest barbarism.
Immanuel Kan
t
Groundwork for the Metaphysic
s
of Morals (1785)
12. Dependency graphs
❖ Library+versions and their
dependencies form (complex,
huge) dependency networks
❖ Version constraints make these
networks more complicated
than simple graphs
13. Dependency graphs
❖ Library+versions and their
dependencies form (complex,
huge) dependency networks
❖ Version constraints make these
networks more complicated
than simple graphs
❖ Package manager will
fi
nally
determine which version is
chosen for each library
23. Ecosystems
❖ Ecosystems grow at mind boggling speed
❖ JavaScript projects have an average of 80 (Zimmerman et al.,
2019) transitive dependencies
24. Ecosystems
❖ Ecosystems grow at mind boggling speed
❖ JavaScript projects have an average of 80 (Zimmerman et al.,
2019) transitive dependencies
❖ 50% of dependencies change in a 6-month time (Hejderup et al.,
2019)
25. Ecosystems
❖ Ecosystems grow at mind boggling speed
❖ JavaScript projects have an average of 80 (Zimmerman et al.,
2019) transitive dependencies
❖ 50% of dependencies change in a 6-month time (Hejderup et al.,
2019)
❖ And deteriorate almost as rapidly
26. Ecosystems
❖ Ecosystems grow at mind boggling speed
❖ JavaScript projects have an average of 80 (Zimmerman et al.,
2019) transitive dependencies
❖ 50% of dependencies change in a 6-month time (Hejderup et al.,
2019)
❖ And deteriorate almost as rapidly
❖ Existence of package bottlenecks (the removal on one single
package can bring down almost 40% of the system)
27. Ecosystems
❖ Ecosystems grow at mind boggling speed
❖ JavaScript projects have an average of 80 (Zimmerman et al.,
2019) transitive dependencies
❖ 50% of dependencies change in a 6-month time (Hejderup et al.,
2019)
❖ And deteriorate almost as rapidly
❖ Existence of package bottlenecks (the removal on one single
package can bring down almost 40% of the system)
❖ Rich get richer: few maintainers dominate most packages
28. Epidemics in dependency graphs
Lib A, vers 1.0
Lib B, vers 2.5
Lib C, vers 1.5
Lib D, vers 3.0
29. Epidemics in dependency graphs
Lib A, vers 1.0
Lib B, vers 2.5
Lib C, vers 1.5
Lib D, vers 3.0
A vulnerability aler
t
is issue
d
about Lib D, vers 3.0
30. Epidemics in dependency graphs
Lib A, vers 1.0
Lib B, vers 2.5
Lib C, vers 1.5
Lib D, vers 3.0
A vulnerability aler
t
is issue
d
about Lib D, vers 3.0
All libraries in this
graph are infected!
33. Isn’t this kind of tool enough?
❖ In theory. But in practice:
34. Isn’t this kind of tool enough?
❖ In theory. But in practice:
❖ Developers don’t update
35. Isn’t this kind of tool enough?
❖ In theory. But in practice:
❖ Developers don’t update
❖ → Vulnerabilities proliferate
36. Isn’t this kind of tool enough?
❖ In theory. But in practice:
❖ Developers don’t update
❖ → Vulnerabilities proliferate
❖ Why?
37. Isn’t this kind of tool enough?
❖ In theory. But in practice:
❖ Developers don’t update
❖ → Vulnerabilities proliferate
❖ Why?
❖ Our tools are not sharp enough for what we want
38. Examples of what people want
Developers Maintainers
Update
Does this outdated dependency
really break my code?
How do I update without breaking
too many of my important clients?
Violations
Am I violating anyone’s
copyright?
How do I spot instances of my
code being distributed without
permission?
39. Epidemics in dependency graphs
Lib A, vers 1.0
Lib B, vers 2.5
Lib C, vers 1.5
Lib D, vers 3.0
41. Epidemics in dependency graphs
A.f0
A.f2
A.f3
B.f1
B.f2
B.f3
C.f1
C.f2
D.f1
D.f2
D.f3
A vulnerability aler
t
is issue
d
about Lib D, vers 3.0
,
function f3
42. Epidemics in dependency graphs
A.f0
A.f2
A.f3
B.f1
B.f2
B.f3
C.f1
C.f2
D.f1
D.f2
D.f3
A vulnerability aler
t
is issue
d
about Lib D, vers 3.0
,
function f3
43. Epidemics in dependency graphs
A.f0
A.f2
A.f3
B.f1
B.f2
B.f3
C.f1
C.f2
D.f1
D.f2
D.f3
A vulnerability aler
t
is issue
d
about Lib D, vers 3.0
,
function f3
Much more informative!
44. Epidemics in dependency graphs
A.f0
A.f2
A.f3
B.f1
B.f2
B.f3
C.f1
C.f2
D.f1
D.f2
D.f3
A vulnerability aler
t
is issue
d
about Lib D, vers 3.0
,
function f3
45. Epidemics in dependency graphs
A.f0
A.f2
A.f3
B.f1
B.f2
B.f3
C.f1
C.f2
D.f1
D.f2
D.f3
A vulnerability aler
t
is issue
d
about Lib D, vers 3.0
,
function f3
Avoid the cry wolf effect!
47. Examples
❖ Fully precise change impact analysis: “How many libraries
are affected if I remove/modify a certain method/interface?”
48. Examples
❖ Fully precise change impact analysis: “How many libraries
are affected if I remove/modify a certain method/interface?”
❖ Fully precise license compliance: “Is my library compliant
with the licenses of the libraries that I depend from (directly or
indirectly)? (e.g., am I linking any GPL code?)”
49. Examples
❖ Fully precise change impact analysis: “How many libraries
are affected if I remove/modify a certain method/interface?”
❖ Fully precise license compliance: “Is my library compliant
with the licenses of the libraries that I depend from (directly or
indirectly)? (e.g., am I linking any GPL code?)”
❖ Fully precise risk pro
fi
ling: “Does this vulnerability affect my
code?”
50. Examples
❖ Fully precise change impact analysis: “How many libraries
are affected if I remove/modify a certain method/interface?”
❖ Fully precise license compliance: “Is my library compliant
with the licenses of the libraries that I depend from (directly or
indirectly)? (e.g., am I linking any GPL code?)”
❖ Fully precise risk pro
fi
ling: “Does this vulnerability affect my
code?”
❖ Centrality analysis: “What methods/functions are more central
within a given ecosystem? are there bottlenecks? critical points?”
54. The FASTEN toolchain
Project information
Securit
y
alerts
Repositories
publish
Data stream
FASTE
N
server
publish
publish
55. The FASTEN toolchain
Project information
Securit
y
alerts
Repositories
publish
Data stream
FASTE
N
server
Call-graph
construction
publish
publish
56. The FASTEN toolchain
Project information
Securit
y
alerts
Repositories
publish
Data stream
FASTE
N
server
Call-graph
construction
Storage
layer
publish
publish
57. The FASTEN toolchain
Project information
Securit
y
alerts
Repositories
publish
Data stream
FASTE
N
server
Call-graph
construction
Storage
layer
Analysis
layer
publish
publish
58. The FASTEN toolchain
Project information
Securit
y
alerts
Repositories
publish
Data stream
FASTE
N
server
Call-graph
construction
Storage
layer
Analysis
layer
REST
Api
publish
publish
59. The FASTEN toolchain
Project information
Securit
y
alerts
Repositories
publish
Data stream
FASTE
N
server
Call-graph
construction
Storage
layer
Analysis
layer
REST
Api
Web
UI
publish
publish
60. The FASTEN toolchain
Project information
Securit
y
alerts
Repositories
publish
Data stream
FASTE
N
server
Call-graph
construction
Storage
layer
Analysis
layer
REST
Api
Web
UI
publish
publish
Continuous
integration server
61. The FASTEN toolchain
Project information
Securit
y
alerts
Repositories
publish
Data stream
FASTE
N
server
Call-graph
construction
Storage
layer
Analysis
layer
REST
Api
Web
UI
publish
publish
Continuous
integration server
62. The FASTEN toolchain
Project information
Securit
y
alerts
Repositories
publish
Data stream
FASTE
N
server
Call-graph
construction
Storage
layer
Analysis
layer
REST
Api
Web
UI
publish
publish
Continuous
integration server
Developer
66. Universal function identifiers
How to uniquely reference a function in a global namespace?
fasten://
/mvn
/org.slf4j.slf4j-api
/1.2.3
/org.slf4j.helpers
/BasicMarkerFactory.getDetachedMarker
(%2Fjava.lang%2FString)
%2Forg.slf4j%2FMarker
scheme
forge
artifact
version
namespace
function
argument(s)
return type
D
o
n
e
67. Universal function identifiers
How to uniquely reference a function in a global namespace?
fasten://
/mvn
/org.slf4j.slf4j-api
/1.2.3
/org.slf4j.helpers
/BasicMarkerFactory.getDetachedMarker
(%2Fjava.lang%2FString)
%2Forg.slf4j%2FMarker
scheme
forge
artifact
version
namespace
function
argument(s)
return type
Generic format +
Java
Python
C
D
o
n
e
71. Language-dependent
call graph generation
❖ Java: Based on tools from the OPAL project (stg-tud/opal
)
❖ Python: New static analysis tool: PyCG (Submitted ICSE 2020)
❖ C: CScout for static call graphs; gprof, callgrind for dynamic calls
D
o
n
e
72. Current CG results
Language /
Ecosystem
Total
Packages
Results
Packages Nodes Edges Success
Rate
C / Debian
Buster
7.380 (757
analyzed) *
531 491.721 579.253 70%
Java / Maven 2.7M artifacts 2.4M ~5B+ ~56B+ 89.13%
Python / PyPI ~740 K ~520K ~211M ~310M 70%
I
n
p
r
o
g
r
e
s
s
* Technical issues prohibited us from downloading the rest of the packages.
73. Call graph stitching
How to scale call graph processing to 10^6 package versions?
I
n
p
r
o
g
r
e
s
s
74. Call graph stitching
❖ Idea: Decouple package resolution from call graph
generation
How to scale call graph processing to 10^6 package versions?
I
n
p
r
o
g
r
e
s
s
75. Call graph stitching
❖ Idea: Decouple package resolution from call graph
generation
❖ Build and store call graphs per package version, incl.:
How to scale call graph processing to 10^6 package versions?
I
n
p
r
o
g
r
e
s
s
76. Call graph stitching
❖ Idea: Decouple package resolution from call graph
generation
❖ Build and store call graphs per package version, incl.:
❖ unresolved calls
How to scale call graph processing to 10^6 package versions?
I
n
p
r
o
g
r
e
s
s
77. Call graph stitching
❖ Idea: Decouple package resolution from call graph
generation
❖ Build and store call graphs per package version, incl.:
❖ unresolved calls
❖ class hierarchies (Java, Python)
How to scale call graph processing to 10^6 package versions?
I
n
p
r
o
g
r
e
s
s
78. Call graph stitching
❖ Idea: Decouple package resolution from call graph
generation
❖ Build and store call graphs per package version, incl.:
❖ unresolved calls
❖ class hierarchies (Java, Python)
❖ Call graph stitching: Resolve unresolved
calls given a dependency tree
How to scale call graph processing to 10^6 package versions?
I
n
p
r
o
g
r
e
s
s
80. Examples of queries:
largest packages (# of functions)
select p.package_name, pv.version, count(*)
from package_versions pv
join packages p on pv.package_id = p.id
join modules m on m.package_version_id = pv.id
join callables c on c.module_id = m.id
group by p.package_name, pv.version
order by count(*) desc
limit 10;
81. Examples of queries:
Packages depending on vulnerable package
SELECT package_version_id, p.package_name, pv.version
FROM dependencies d
JOIN package_versions pv ON pv.id = d.package_version_id
JOIN packages p ON p.id = pv.package_id
WHERE d.dependency_id =
(SELECT id
FROM packages
WHERE package_name = 'com.google.guava:guava')
AND '20.0' = ANY(d.version_range);
84. Graph analytics
(results shown refer to Java CG’s)
❖ Graph stored using WebGraph (UMIL)
❖ For 1.1M graphs (2.3B nodes, 18B edges):
I
n
p
r
o
g
r
e
s
s
85. Graph analytics
(results shown refer to Java CG’s)
❖ Graph stored using WebGraph (UMIL)
❖ For 1.1M graphs (2.3B nodes, 18B edges):
❖ 3.6 bits per edge, plus global ID storage for each node
(9.0 bits per edge overall)
I
n
p
r
o
g
r
e
s
s
86. Graph analytics
(results shown refer to Java CG’s)
❖ Graph stored using WebGraph (UMIL)
❖ For 1.1M graphs (2.3B nodes, 18B edges):
❖ 3.6 bits per edge, plus global ID storage for each node
(9.0 bits per edge overall)
❖ DB size: 38GB → we can
fi
t the whole of Maven in
RAM
I
n
p
r
o
g
r
e
s
s
88. Vulnerability Plugin
❖ Gathering vulnerability information (at
package and callable level
)
❖ A normalized Vulnerability Object
de
fi
nition is injected in the metadata
database
❖ Normalization is needed to smooth out
the different sources of informatio
n
❖ The plugin continuously pulls updates
for new information and keeps storing
the results
I
n
p
r
o
g
r
e
s
s
89. Analysis plug-ins
RAPID: Risk Analysis and Propagation Inspection for Security
and Maintainability risk
s
❖ On the server side (to enrich the metadata DB)
:
❖ Plugin for code maintainability analysis:
V1 deployed, processed 126K Maven coordinates to dat
e
❖ Plugin for security vulnerability propagation
❖ On the client side:
❖ A user application to model and present risks
I
n
p
r
o
g
r
e
s
s
90. License and Compliance analysis
❖ QMSTR Plugin consists of 3 steps
:
1. Use the CG generator to gather information about all
the generated artifacts that will be distributed
together with the source cod
e
2. Execution of static analysis tools that augment the
build graph with license and compliance metadat
a
3. Generation of a report with package's relevant license
and authorship metadata that is
fi
nally distributed
I
n
p
r
o
g
r
e
s
s
92. REST API
❖ Implementation of endpoints to expose canned queries
from the metadata databas
e
❖ In development
:
❖ Full DB entity suppor
t
❖ Custom extension points
I
n
p
r
o
g
r
e
s
s
93. Use cases
❖ Endocod
e
❖ Endocode developed a license-compliance solution, called Quartermaste
r
❖ They are integrating FASTEN to improve the precision of their compliance offerin
g
❖ SI
G
❖ Integration of FASTEN in BetterCodeHub, their GitHub-connected code quality
monitoring produc
t
❖ XWiki
❖ Risk validation in the dependencies at Maven build tim
e
❖ Risk validation in the installed extensions of an XWiki instanc
e
❖ Filter out available compatible extensions for an XWiki instanc
e
❖ Discoverability of XWiki components in available extensions
I
n
p
r
o
g
r
e
s
s
95. The future
End 2020
Q1 2021
Q2 2021
Q3 2021
REST API,
fi
rst full version of knowledge base, CG enrichment,
build graph integration,
fi
rst public announcement
Impact analysis, integration with MVN / PyPI;
fi
rst external user
Q4 2021
Q1 2022 FASTEN 2?
Industrial use cases integrated;
fi
rst external adoption
Licensing and security fully integrated;
Data-driven API evolution
Project
fi
nished; external integrations