SlideShare a Scribd company logo
Software Ecosystems as
Networks


Advances on the FASTEN project
Paolo Bold
i

Università degli Studi di
Milan
o

Italy
The FASTEN Project
❖ Fine-Grained Analysis of SofTware Ecosystems as Network
s

❖ Part of the EU H2020-ICT-2018-2020 Progra
m

❖ Consortium


Why FASTEN?
Sharing through software libraries
Sharing through software libraries
❖ Internet made the dream of collaborative development a
reality, by means of libraries that are made available:
Sharing through software libraries
❖ Internet made the dream of collaborative development a
reality, by means of libraries that are made available:
❖ on repositories (SourceForge, GitHub, BitBucket, …)
Sharing through software libraries
❖ Internet made the dream of collaborative development a
reality, by means of libraries that are made available:
❖ on repositories (SourceForge, GitHub, BitBucket, …)
❖ or forges (Maven, PyPi, CPAN, …)
Industrial revolution


at the harbour of software development
Industrial revolution


at the harbour of software development
❖ All trades, arts, and handiworks have gained by
division of labour, namely, when, instead of one
man doing everything, each con
fi
nes himself to a
certain kind of work distinct from others in the
treatment it requires, so as to be able to perform it
with greater facility and in the greatest
perfection. Where the different kinds of work are
not distinguished and divided, where everyone is
a jack-of-all-trades, there manufactures remain
still in the greatest barbarism.
Immanuel Kan
t

Groundwork for the Metaphysic
s

of Morals (1785)
Dependency graphs
Dependency graphs
❖ Library+versions and their
dependencies form (complex,
huge) dependency networks
Dependency graphs
❖ Library+versions and their
dependencies form (complex,
huge) dependency networks
❖ Version constraints make these
networks more complicated
than simple graphs
Dependency graphs
❖ Library+versions and their
dependencies form (complex,
huge) dependency networks
❖ Version constraints make these
networks more complicated
than simple graphs
❖ Package manager will
fi
nally
determine which version is
chosen for each library
The dependency heaven
The dependency heaven
❖ Relying on an
ecosystem of easy-to-
use well written
libraries made the
dream of code reuse a
reality
The dependency hell
The dependency hell
❖ A bug or security
breach or legal issue
concerning one single
piece
…

❖ …can make the whole
tower fall!
Recent dependency nightmares
Recent dependency nightmares
❖ The leftpad incident (2016): millions of websites
affected
Recent dependency nightmares
❖ The leftpad incident (2016): millions of websites
affected
❖ The Equifax breach (2017): costed 4B$
Ecosystems
Ecosystems
❖ Ecosystems grow at mind boggling speed
Ecosystems
❖ Ecosystems grow at mind boggling speed
❖ JavaScript projects have an average of 80 (Zimmerman et al.,
2019) transitive dependencies
Ecosystems
❖ Ecosystems grow at mind boggling speed
❖ JavaScript projects have an average of 80 (Zimmerman et al.,
2019) transitive dependencies
❖ 50% of dependencies change in a 6-month time (Hejderup et al.,
2019)
Ecosystems
❖ Ecosystems grow at mind boggling speed
❖ JavaScript projects have an average of 80 (Zimmerman et al.,
2019) transitive dependencies
❖ 50% of dependencies change in a 6-month time (Hejderup et al.,
2019)
❖ And deteriorate almost as rapidly
Ecosystems
❖ Ecosystems grow at mind boggling speed
❖ JavaScript projects have an average of 80 (Zimmerman et al.,
2019) transitive dependencies
❖ 50% of dependencies change in a 6-month time (Hejderup et al.,
2019)
❖ And deteriorate almost as rapidly
❖ Existence of package bottlenecks (the removal on one single
package can bring down almost 40% of the system)
Ecosystems
❖ Ecosystems grow at mind boggling speed
❖ JavaScript projects have an average of 80 (Zimmerman et al.,
2019) transitive dependencies
❖ 50% of dependencies change in a 6-month time (Hejderup et al.,
2019)
❖ And deteriorate almost as rapidly
❖ Existence of package bottlenecks (the removal on one single
package can bring down almost 40% of the system)
❖ Rich get richer: few maintainers dominate most packages
Epidemics in dependency graphs
Lib A, vers 1.0
Lib B, vers 2.5
Lib C, vers 1.5
Lib D, vers 3.0
Epidemics in dependency graphs
Lib A, vers 1.0
Lib B, vers 2.5
Lib C, vers 1.5
Lib D, vers 3.0
A vulnerability aler
t

is issue
d

about Lib D, vers 3.0
Epidemics in dependency graphs
Lib A, vers 1.0
Lib B, vers 2.5
Lib C, vers 1.5
Lib D, vers 3.0
A vulnerability aler
t

is issue
d

about Lib D, vers 3.0
All libraries in this
 

graph are infected!
GitHub security alerts
But is this enough?
Isn’t this kind of tool enough?
Isn’t this kind of tool enough?
❖ In theory. But in practice:
Isn’t this kind of tool enough?
❖ In theory. But in practice:
❖ Developers don’t update
Isn’t this kind of tool enough?
❖ In theory. But in practice:
❖ Developers don’t update
❖ → Vulnerabilities proliferate
Isn’t this kind of tool enough?
❖ In theory. But in practice:
❖ Developers don’t update
❖ → Vulnerabilities proliferate
❖ Why?
Isn’t this kind of tool enough?
❖ In theory. But in practice:
❖ Developers don’t update
❖ → Vulnerabilities proliferate
❖ Why?
❖ Our tools are not sharp enough for what we want
Examples of what people want
Developers Maintainers
Update
Does this outdated dependency
really break my code?
How do I update without breaking
too many of my important clients?
Violations
Am I violating anyone’s
copyright?
How do I spot instances of my
code being distributed without
permission?
Epidemics in dependency graphs
Lib A, vers 1.0
Lib B, vers 2.5
Lib C, vers 1.5
Lib D, vers 3.0
Epidemics in dependency graphs
A.f0
A.f2
A.f3
B.f1
B.f2
B.f3
C.f1
C.f2
D.f1
D.f2
D.f3
Epidemics in dependency graphs
A.f0
A.f2
A.f3
B.f1
B.f2
B.f3
C.f1
C.f2
D.f1
D.f2
D.f3
A vulnerability aler
t

is issue
d

about Lib D, vers 3.0
,

function f3
Epidemics in dependency graphs
A.f0
A.f2
A.f3
B.f1
B.f2
B.f3
C.f1
C.f2
D.f1
D.f2
D.f3
A vulnerability aler
t

is issue
d

about Lib D, vers 3.0
,

function f3
Epidemics in dependency graphs
A.f0
A.f2
A.f3
B.f1
B.f2
B.f3
C.f1
C.f2
D.f1
D.f2
D.f3
A vulnerability aler
t

is issue
d

about Lib D, vers 3.0
,

function f3
Much more informative!
Epidemics in dependency graphs
A.f0
A.f2
A.f3
B.f1
B.f2
B.f3
C.f1
C.f2
D.f1
D.f2
D.f3
A vulnerability aler
t

is issue
d

about Lib D, vers 3.0
,

function f3
Epidemics in dependency graphs
A.f0
A.f2
A.f3
B.f1
B.f2
B.f3
C.f1
C.f2
D.f1
D.f2
D.f3
A vulnerability aler
t

is issue
d

about Lib D, vers 3.0
,

function f3
Avoid the cry wolf effect!
Examples
Examples
❖ Fully precise change impact analysis: “How many libraries
are affected if I remove/modify a certain method/interface?”
Examples
❖ Fully precise change impact analysis: “How many libraries
are affected if I remove/modify a certain method/interface?”
❖ Fully precise license compliance: “Is my library compliant
with the licenses of the libraries that I depend from (directly or
indirectly)? (e.g., am I linking any GPL code?)”
Examples
❖ Fully precise change impact analysis: “How many libraries
are affected if I remove/modify a certain method/interface?”
❖ Fully precise license compliance: “Is my library compliant
with the licenses of the libraries that I depend from (directly or
indirectly)? (e.g., am I linking any GPL code?)”
❖ Fully precise risk pro
fi
ling: “Does this vulnerability affect my
code?”
Examples
❖ Fully precise change impact analysis: “How many libraries
are affected if I remove/modify a certain method/interface?”
❖ Fully precise license compliance: “Is my library compliant
with the licenses of the libraries that I depend from (directly or
indirectly)? (e.g., am I linking any GPL code?)”
❖ Fully precise risk pro
fi
ling: “Does this vulnerability affect my
code?”
❖ Centrality analysis: “What methods/functions are more central
within a given ecosystem? are there bottlenecks? critical points?”
The FASTEN toolchain
The FASTEN toolchain
Project information
Securit
y

alerts
Repositories
The FASTEN toolchain
Project information
Securit
y

alerts
Repositories
publish
Data stream
publish
publish
The FASTEN toolchain
Project information
Securit
y

alerts
Repositories
publish
Data stream
FASTE
N

server
publish
publish
The FASTEN toolchain
Project information
Securit
y

alerts
Repositories
publish
Data stream
FASTE
N

server
Call-graph
 

construction
publish
publish
The FASTEN toolchain
Project information
Securit
y

alerts
Repositories
publish
Data stream
FASTE
N

server
Call-graph
 

construction
Storage
 

layer
publish
publish
The FASTEN toolchain
Project information
Securit
y

alerts
Repositories
publish
Data stream
FASTE
N

server
Call-graph
 

construction
Storage
 

layer
Analysis
 

layer
publish
publish
The FASTEN toolchain
Project information
Securit
y

alerts
Repositories
publish
Data stream
FASTE
N

server
Call-graph
 

construction
Storage
 

layer
Analysis
 

layer
REST
Api
publish
publish
The FASTEN toolchain
Project information
Securit
y

alerts
Repositories
publish
Data stream
FASTE
N

server
Call-graph
 

construction
Storage
 

layer
Analysis
 

layer
REST
Api
Web
UI
publish
publish
The FASTEN toolchain
Project information
Securit
y

alerts
Repositories
publish
Data stream
FASTE
N

server
Call-graph
 

construction
Storage
 

layer
Analysis
 

layer
REST
Api
Web
UI
publish
publish
Continuous
 

integration server
The FASTEN toolchain
Project information
Securit
y

alerts
Repositories
publish
Data stream
FASTE
N

server
Call-graph
 

construction
Storage
 

layer
Analysis
 

layer
REST
Api
Web
UI
publish
publish
Continuous
 

integration server
The FASTEN toolchain
Project information
Securit
y

alerts
Repositories
publish
Data stream
FASTE
N

server
Call-graph
 

construction
Storage
 

layer
Analysis
 

layer
REST
Api
Web
UI
publish
publish
Continuous
 

integration server
Developer
Preliminary results
Server-side highlights
Dataflow example: CG generation
D
o
n
e
Universal function identifiers
How to uniquely reference a function in a global namespace?
fasten://


/mvn


/org.slf4j.slf4j-api


/1.2.3


/org.slf4j.helpers


/BasicMarkerFactory.getDetachedMarker


(%2Fjava.lang%2FString)


%2Forg.slf4j%2FMarker
scheme


forge


artifact


version


namespace


function


argument(s)


return type
D
o
n
e
Universal function identifiers
How to uniquely reference a function in a global namespace?
fasten://


/mvn


/org.slf4j.slf4j-api


/1.2.3


/org.slf4j.helpers


/BasicMarkerFactory.getDetachedMarker


(%2Fjava.lang%2FString)


%2Forg.slf4j%2FMarker
scheme


forge


artifact


version


namespace


function


argument(s)


return type
Generic format +

Java

Python

C
D
o
n
e
Call graph transport
{


"product": "foo",


"forge": "mvn",


"depset": [


[


{ "product": "a", "forge": "mvn", "constraints": ["[1.2..1.5]", "[2.3..]"] },


{ "product": "b", "forge": "mvn", "constraints": ["[2.0.1]"] }


]


],


"version": "3.10.0.7",


"cha": {


"/name.space/A": {


"methods": {


"0": "/name.space/A.A()%2Fjava.lang%2FVoidType",


"1": "/name.space/A.g(%2Fjava.lang%2FString)%2Fjava.lang%2FInteger"


},


"superInterfaces": [ "/java.lang/Serializable" ],


"sourceFile": "filename.java",


"superClasses": [ "/java.lang/Object" ]


}


},


"graph": {


"internalCalls": [


[ 0, 1 ]


],


"externalCalls": [


[ "1", "///their.package/TheirClass.method()Response", { "invokeinterface": "1" } ]


]


},


"timestamp": 123


}
D
o
n
e
Call graph transport
{


"product": "foo",


"forge": "mvn",


"depset": [


[


{ "product": "a", "forge": "mvn", "constraints": ["[1.2..1.5]", "[2.3..]"] },


{ "product": "b", "forge": "mvn", "constraints": ["[2.0.1]"] }


]


],


"version": "3.10.0.7",


"cha": {


"/name.space/A": {


"methods": {


"0": "/name.space/A.A()%2Fjava.lang%2FVoidType",


"1": "/name.space/A.g(%2Fjava.lang%2FString)%2Fjava.lang%2FInteger"


},


"superInterfaces": [ "/java.lang/Serializable" ],


"sourceFile": "filename.java",


"superClasses": [ "/java.lang/Object" ]


}


},


"graph": {


"internalCalls": [


[ 0, 1 ]


],


"externalCalls": [


[ "1", "///their.package/TheirClass.method()Response", { "invokeinterface": "1" } ]


]


},


"timestamp": 123


}
Generic format +

Java

Python

C
D
o
n
e
Language-dependent


call graph generation
D
o
n
e
Language-dependent


call graph generation
❖ Java: Based on tools from the OPAL project (stg-tud/opal
)

❖ Python: New static analysis tool: PyCG (Submitted ICSE 2020)
❖ C: CScout for static call graphs; gprof, callgrind for dynamic calls
D
o
n
e
Current CG results
Language /
Ecosystem
Total
Packages
Results
Packages Nodes Edges Success
Rate
C / Debian
Buster
7.380 (757
analyzed) *
531 491.721 579.253 70%
Java / Maven 2.7M artifacts 2.4M ~5B+ ~56B+ 89.13%
Python / PyPI ~740 K ~520K ~211M ~310M 70%
I
n
p
r
o
g
r
e
s
s
* Technical issues prohibited us from downloading the rest of the packages.
Call graph stitching
How to scale call graph processing to 10^6 package versions?
I
n
p
r
o
g
r
e
s
s
Call graph stitching
❖ Idea: Decouple package resolution from call graph
generation
How to scale call graph processing to 10^6 package versions?
I
n
p
r
o
g
r
e
s
s
Call graph stitching
❖ Idea: Decouple package resolution from call graph
generation
❖ Build and store call graphs per package version, incl.:
How to scale call graph processing to 10^6 package versions?
I
n
p
r
o
g
r
e
s
s
Call graph stitching
❖ Idea: Decouple package resolution from call graph
generation
❖ Build and store call graphs per package version, incl.:
❖ unresolved calls
How to scale call graph processing to 10^6 package versions?
I
n
p
r
o
g
r
e
s
s
Call graph stitching
❖ Idea: Decouple package resolution from call graph
generation
❖ Build and store call graphs per package version, incl.:
❖ unresolved calls
❖ class hierarchies (Java, Python)
How to scale call graph processing to 10^6 package versions?
I
n
p
r
o
g
r
e
s
s
Call graph stitching
❖ Idea: Decouple package resolution from call graph
generation
❖ Build and store call graphs per package version, incl.:
❖ unresolved calls
❖ class hierarchies (Java, Python)
❖ Call graph stitching: Resolve unresolved


calls given a dependency tree
How to scale call graph processing to 10^6 package versions?
I
n
p
r
o
g
r
e
s
s
The database schema
D
o
n
e
Examples of queries:


largest packages (# of functions)
select p.package_name, pv.version, count(*)


from package_versions pv


join packages p on pv.package_id = p.id


join modules m on m.package_version_id = pv.id


join callables c on c.module_id = m.id


group by p.package_name, pv.version


order by count(*) desc


limit 10;
Examples of queries:


Packages depending on vulnerable package
SELECT package_version_id, p.package_name, pv.version


FROM dependencies d


JOIN package_versions pv ON pv.id = d.package_version_id


JOIN packages p ON p.id = pv.package_id


WHERE d.dependency_id =


(SELECT id


FROM packages


WHERE package_name = 'com.google.guava:guava')


AND '20.0' = ANY(d.version_range);
Graph analytics


(results shown refer to Java CG’s)
I
n
p
r
o
g
r
e
s
s
Graph analytics


(results shown refer to Java CG’s)
❖ Graph stored using WebGraph (UMIL)
I
n
p
r
o
g
r
e
s
s
Graph analytics


(results shown refer to Java CG’s)
❖ Graph stored using WebGraph (UMIL)
❖ For 1.1M graphs (2.3B nodes, 18B edges):
I
n
p
r
o
g
r
e
s
s
Graph analytics


(results shown refer to Java CG’s)
❖ Graph stored using WebGraph (UMIL)
❖ For 1.1M graphs (2.3B nodes, 18B edges):
❖ 3.6 bits per edge, plus global ID storage for each node
(9.0 bits per edge overall)
I
n
p
r
o
g
r
e
s
s
Graph analytics


(results shown refer to Java CG’s)
❖ Graph stored using WebGraph (UMIL)
❖ For 1.1M graphs (2.3B nodes, 18B edges):
❖ 3.6 bits per edge, plus global ID storage for each node
(9.0 bits per edge overall)
❖ DB size: 38GB → we can
fi
t the whole of Maven in
RAM
I
n
p
r
o
g
r
e
s
s
Graph storage
I
n
p
r
o
g
r
e
s
s
Vulnerability Plugin
❖ Gathering vulnerability information (at
package and callable level
)

❖ A normalized Vulnerability Object
de
fi
nition is injected in the metadata
database
 

❖ Normalization is needed to smooth out
the different sources of informatio
n

❖ The plugin continuously pulls updates
for new information and keeps storing
the results
I
n
p
r
o
g
r
e
s
s
Analysis plug-ins
RAPID: Risk Analysis and Propagation Inspection for Security
and Maintainability risk
s

❖ On the server side (to enrich the metadata DB)
:

❖ Plugin for code maintainability analysis:


V1 deployed, processed 126K Maven coordinates to dat
e

❖ Plugin for security vulnerability propagation
❖ On the client side:
 

❖ A user application to model and present risks
I
n
p
r
o
g
r
e
s
s
License and Compliance analysis
❖ QMSTR Plugin consists of 3 steps
:

1. Use the CG generator to gather information about all
the generated artifacts that will be distributed
together with the source cod
e

2. Execution of static analysis tools that augment the
build graph with license and compliance metadat
a

3. Generation of a report with package's relevant license
and authorship metadata that is
fi
nally distributed
I
n
p
r
o
g
r
e
s
s
Client-side highlights
REST API
❖ Implementation of endpoints to expose canned queries
from the metadata databas
e

❖ In development
:

❖ Full DB entity suppor
t

❖ Custom extension points
I
n
p
r
o
g
r
e
s
s
Use cases
❖ Endocod
e

❖ Endocode developed a license-compliance solution, called Quartermaste
r

❖ They are integrating FASTEN to improve the precision of their compliance offerin
g

❖ SI
G

❖ Integration of FASTEN in  BetterCodeHub, their GitHub-connected code quality
monitoring produc
t

❖ XWiki
 

❖ Risk validation in the dependencies at Maven build tim
e

❖ Risk validation in the installed extensions of an XWiki instanc
e

❖ Filter out available compatible extensions for an XWiki instanc
e

❖ Discoverability of XWiki components in available extensions
I
n
p
r
o
g
r
e
s
s
Future timeline
The future
End 2020
Q1 2021
Q2 2021
Q3 2021
REST API,
fi
rst full version of knowledge base, CG enrichment,
build graph integration,
fi
rst public announcement
Impact analysis, integration with MVN / PyPI;
fi
rst external user
Q4 2021
Q1 2022 FASTEN 2?
Industrial use cases integrated;
fi
rst external adoption
Licensing and security fully integrated;
Data-driven API evolution
Project
fi
nished; external integrations
Network analysis will be the next
 

step for the future of
 

software development
Network analysis will be the next
 

step for the future of
 

software development
Questions?
Paolo Bold
i

Università degli Studi di
Milan
o

Ital
y

paolo.boldi@unimi.it

More Related Content

Software Ecosystems as Networks - Advances on the FASTEN project, Paolo Boldi, FOSDEM 2021

  • 1. Software Ecosystems as Networks Advances on the FASTEN project Paolo Bold i Università degli Studi di Milan o Italy
  • 2. The FASTEN Project ❖ Fine-Grained Analysis of SofTware Ecosystems as Network s ❖ Part of the EU H2020-ICT-2018-2020 Progra m ❖ Consortium 

  • 5. Sharing through software libraries ❖ Internet made the dream of collaborative development a reality, by means of libraries that are made available:
  • 6. Sharing through software libraries ❖ Internet made the dream of collaborative development a reality, by means of libraries that are made available: ❖ on repositories (SourceForge, GitHub, BitBucket, …)
  • 7. Sharing through software libraries ❖ Internet made the dream of collaborative development a reality, by means of libraries that are made available: ❖ on repositories (SourceForge, GitHub, BitBucket, …) ❖ or forges (Maven, PyPi, CPAN, …)
  • 8. Industrial revolution at the harbour of software development
  • 9. Industrial revolution at the harbour of software development ❖ All trades, arts, and handiworks have gained by division of labour, namely, when, instead of one man doing everything, each con fi nes himself to a certain kind of work distinct from others in the treatment it requires, so as to be able to perform it with greater facility and in the greatest perfection. Where the different kinds of work are not distinguished and divided, where everyone is a jack-of-all-trades, there manufactures remain still in the greatest barbarism. Immanuel Kan t Groundwork for the Metaphysic s of Morals (1785)
  • 11. Dependency graphs ❖ Library+versions and their dependencies form (complex, huge) dependency networks
  • 12. Dependency graphs ❖ Library+versions and their dependencies form (complex, huge) dependency networks ❖ Version constraints make these networks more complicated than simple graphs
  • 13. Dependency graphs ❖ Library+versions and their dependencies form (complex, huge) dependency networks ❖ Version constraints make these networks more complicated than simple graphs ❖ Package manager will fi nally determine which version is chosen for each library
  • 15. The dependency heaven ❖ Relying on an ecosystem of easy-to- use well written libraries made the dream of code reuse a reality
  • 17. The dependency hell ❖ A bug or security breach or legal issue concerning one single piece … ❖ …can make the whole tower fall!
  • 19. Recent dependency nightmares ❖ The leftpad incident (2016): millions of websites affected
  • 20. Recent dependency nightmares ❖ The leftpad incident (2016): millions of websites affected ❖ The Equifax breach (2017): costed 4B$
  • 22. Ecosystems ❖ Ecosystems grow at mind boggling speed
  • 23. Ecosystems ❖ Ecosystems grow at mind boggling speed ❖ JavaScript projects have an average of 80 (Zimmerman et al., 2019) transitive dependencies
  • 24. Ecosystems ❖ Ecosystems grow at mind boggling speed ❖ JavaScript projects have an average of 80 (Zimmerman et al., 2019) transitive dependencies ❖ 50% of dependencies change in a 6-month time (Hejderup et al., 2019)
  • 25. Ecosystems ❖ Ecosystems grow at mind boggling speed ❖ JavaScript projects have an average of 80 (Zimmerman et al., 2019) transitive dependencies ❖ 50% of dependencies change in a 6-month time (Hejderup et al., 2019) ❖ And deteriorate almost as rapidly
  • 26. Ecosystems ❖ Ecosystems grow at mind boggling speed ❖ JavaScript projects have an average of 80 (Zimmerman et al., 2019) transitive dependencies ❖ 50% of dependencies change in a 6-month time (Hejderup et al., 2019) ❖ And deteriorate almost as rapidly ❖ Existence of package bottlenecks (the removal on one single package can bring down almost 40% of the system)
  • 27. Ecosystems ❖ Ecosystems grow at mind boggling speed ❖ JavaScript projects have an average of 80 (Zimmerman et al., 2019) transitive dependencies ❖ 50% of dependencies change in a 6-month time (Hejderup et al., 2019) ❖ And deteriorate almost as rapidly ❖ Existence of package bottlenecks (the removal on one single package can bring down almost 40% of the system) ❖ Rich get richer: few maintainers dominate most packages
  • 28. Epidemics in dependency graphs Lib A, vers 1.0 Lib B, vers 2.5 Lib C, vers 1.5 Lib D, vers 3.0
  • 29. Epidemics in dependency graphs Lib A, vers 1.0 Lib B, vers 2.5 Lib C, vers 1.5 Lib D, vers 3.0 A vulnerability aler t is issue d about Lib D, vers 3.0
  • 30. Epidemics in dependency graphs Lib A, vers 1.0 Lib B, vers 2.5 Lib C, vers 1.5 Lib D, vers 3.0 A vulnerability aler t is issue d about Lib D, vers 3.0 All libraries in this graph are infected!
  • 31. GitHub security alerts But is this enough?
  • 32. Isn’t this kind of tool enough?
  • 33. Isn’t this kind of tool enough? ❖ In theory. But in practice:
  • 34. Isn’t this kind of tool enough? ❖ In theory. But in practice: ❖ Developers don’t update
  • 35. Isn’t this kind of tool enough? ❖ In theory. But in practice: ❖ Developers don’t update ❖ → Vulnerabilities proliferate
  • 36. Isn’t this kind of tool enough? ❖ In theory. But in practice: ❖ Developers don’t update ❖ → Vulnerabilities proliferate ❖ Why?
  • 37. Isn’t this kind of tool enough? ❖ In theory. But in practice: ❖ Developers don’t update ❖ → Vulnerabilities proliferate ❖ Why? ❖ Our tools are not sharp enough for what we want
  • 38. Examples of what people want Developers Maintainers Update Does this outdated dependency really break my code? How do I update without breaking too many of my important clients? Violations Am I violating anyone’s copyright? How do I spot instances of my code being distributed without permission?
  • 39. Epidemics in dependency graphs Lib A, vers 1.0 Lib B, vers 2.5 Lib C, vers 1.5 Lib D, vers 3.0
  • 40. Epidemics in dependency graphs A.f0 A.f2 A.f3 B.f1 B.f2 B.f3 C.f1 C.f2 D.f1 D.f2 D.f3
  • 41. Epidemics in dependency graphs A.f0 A.f2 A.f3 B.f1 B.f2 B.f3 C.f1 C.f2 D.f1 D.f2 D.f3 A vulnerability aler t is issue d about Lib D, vers 3.0 , function f3
  • 42. Epidemics in dependency graphs A.f0 A.f2 A.f3 B.f1 B.f2 B.f3 C.f1 C.f2 D.f1 D.f2 D.f3 A vulnerability aler t is issue d about Lib D, vers 3.0 , function f3
  • 43. Epidemics in dependency graphs A.f0 A.f2 A.f3 B.f1 B.f2 B.f3 C.f1 C.f2 D.f1 D.f2 D.f3 A vulnerability aler t is issue d about Lib D, vers 3.0 , function f3 Much more informative!
  • 44. Epidemics in dependency graphs A.f0 A.f2 A.f3 B.f1 B.f2 B.f3 C.f1 C.f2 D.f1 D.f2 D.f3 A vulnerability aler t is issue d about Lib D, vers 3.0 , function f3
  • 45. Epidemics in dependency graphs A.f0 A.f2 A.f3 B.f1 B.f2 B.f3 C.f1 C.f2 D.f1 D.f2 D.f3 A vulnerability aler t is issue d about Lib D, vers 3.0 , function f3 Avoid the cry wolf effect!
  • 47. Examples ❖ Fully precise change impact analysis: “How many libraries are affected if I remove/modify a certain method/interface?”
  • 48. Examples ❖ Fully precise change impact analysis: “How many libraries are affected if I remove/modify a certain method/interface?” ❖ Fully precise license compliance: “Is my library compliant with the licenses of the libraries that I depend from (directly or indirectly)? (e.g., am I linking any GPL code?)”
  • 49. Examples ❖ Fully precise change impact analysis: “How many libraries are affected if I remove/modify a certain method/interface?” ❖ Fully precise license compliance: “Is my library compliant with the licenses of the libraries that I depend from (directly or indirectly)? (e.g., am I linking any GPL code?)” ❖ Fully precise risk pro fi ling: “Does this vulnerability affect my code?”
  • 50. Examples ❖ Fully precise change impact analysis: “How many libraries are affected if I remove/modify a certain method/interface?” ❖ Fully precise license compliance: “Is my library compliant with the licenses of the libraries that I depend from (directly or indirectly)? (e.g., am I linking any GPL code?)” ❖ Fully precise risk pro fi ling: “Does this vulnerability affect my code?” ❖ Centrality analysis: “What methods/functions are more central within a given ecosystem? are there bottlenecks? critical points?”
  • 52. The FASTEN toolchain Project information Securit y alerts Repositories
  • 53. The FASTEN toolchain Project information Securit y alerts Repositories publish Data stream publish publish
  • 54. The FASTEN toolchain Project information Securit y alerts Repositories publish Data stream FASTE N server publish publish
  • 55. The FASTEN toolchain Project information Securit y alerts Repositories publish Data stream FASTE N server Call-graph construction publish publish
  • 56. The FASTEN toolchain Project information Securit y alerts Repositories publish Data stream FASTE N server Call-graph construction Storage layer publish publish
  • 57. The FASTEN toolchain Project information Securit y alerts Repositories publish Data stream FASTE N server Call-graph construction Storage layer Analysis layer publish publish
  • 58. The FASTEN toolchain Project information Securit y alerts Repositories publish Data stream FASTE N server Call-graph construction Storage layer Analysis layer REST Api publish publish
  • 59. The FASTEN toolchain Project information Securit y alerts Repositories publish Data stream FASTE N server Call-graph construction Storage layer Analysis layer REST Api Web UI publish publish
  • 60. The FASTEN toolchain Project information Securit y alerts Repositories publish Data stream FASTE N server Call-graph construction Storage layer Analysis layer REST Api Web UI publish publish Continuous integration server
  • 61. The FASTEN toolchain Project information Securit y alerts Repositories publish Data stream FASTE N server Call-graph construction Storage layer Analysis layer REST Api Web UI publish publish Continuous integration server
  • 62. The FASTEN toolchain Project information Securit y alerts Repositories publish Data stream FASTE N server Call-graph construction Storage layer Analysis layer REST Api Web UI publish publish Continuous integration server Developer
  • 65. Dataflow example: CG generation D o n e
  • 66. Universal function identifiers How to uniquely reference a function in a global namespace? fasten:// /mvn /org.slf4j.slf4j-api /1.2.3 /org.slf4j.helpers /BasicMarkerFactory.getDetachedMarker (%2Fjava.lang%2FString) %2Forg.slf4j%2FMarker scheme forge artifact version namespace function argument(s) return type D o n e
  • 67. Universal function identifiers How to uniquely reference a function in a global namespace? fasten:// /mvn /org.slf4j.slf4j-api /1.2.3 /org.slf4j.helpers /BasicMarkerFactory.getDetachedMarker (%2Fjava.lang%2FString) %2Forg.slf4j%2FMarker scheme forge artifact version namespace function argument(s) return type Generic format + Java Python C D o n e
  • 68. Call graph transport { "product": "foo", "forge": "mvn", "depset": [ [ { "product": "a", "forge": "mvn", "constraints": ["[1.2..1.5]", "[2.3..]"] }, { "product": "b", "forge": "mvn", "constraints": ["[2.0.1]"] } ] ], "version": "3.10.0.7", "cha": { "/name.space/A": { "methods": { "0": "/name.space/A.A()%2Fjava.lang%2FVoidType", "1": "/name.space/A.g(%2Fjava.lang%2FString)%2Fjava.lang%2FInteger" }, "superInterfaces": [ "/java.lang/Serializable" ], "sourceFile": "filename.java", "superClasses": [ "/java.lang/Object" ] } }, "graph": { "internalCalls": [ [ 0, 1 ] ], "externalCalls": [ [ "1", "///their.package/TheirClass.method()Response", { "invokeinterface": "1" } ] ] }, "timestamp": 123 } D o n e
  • 69. Call graph transport { "product": "foo", "forge": "mvn", "depset": [ [ { "product": "a", "forge": "mvn", "constraints": ["[1.2..1.5]", "[2.3..]"] }, { "product": "b", "forge": "mvn", "constraints": ["[2.0.1]"] } ] ], "version": "3.10.0.7", "cha": { "/name.space/A": { "methods": { "0": "/name.space/A.A()%2Fjava.lang%2FVoidType", "1": "/name.space/A.g(%2Fjava.lang%2FString)%2Fjava.lang%2FInteger" }, "superInterfaces": [ "/java.lang/Serializable" ], "sourceFile": "filename.java", "superClasses": [ "/java.lang/Object" ] } }, "graph": { "internalCalls": [ [ 0, 1 ] ], "externalCalls": [ [ "1", "///their.package/TheirClass.method()Response", { "invokeinterface": "1" } ] ] }, "timestamp": 123 } Generic format + Java Python C D o n e
  • 71. Language-dependent call graph generation ❖ Java: Based on tools from the OPAL project (stg-tud/opal ) ❖ Python: New static analysis tool: PyCG (Submitted ICSE 2020) ❖ C: CScout for static call graphs; gprof, callgrind for dynamic calls D o n e
  • 72. Current CG results Language / Ecosystem Total Packages Results Packages Nodes Edges Success Rate C / Debian Buster 7.380 (757 analyzed) * 531 491.721 579.253 70% Java / Maven 2.7M artifacts 2.4M ~5B+ ~56B+ 89.13% Python / PyPI ~740 K ~520K ~211M ~310M 70% I n p r o g r e s s * Technical issues prohibited us from downloading the rest of the packages.
  • 73. Call graph stitching How to scale call graph processing to 10^6 package versions? I n p r o g r e s s
  • 74. Call graph stitching ❖ Idea: Decouple package resolution from call graph generation How to scale call graph processing to 10^6 package versions? I n p r o g r e s s
  • 75. Call graph stitching ❖ Idea: Decouple package resolution from call graph generation ❖ Build and store call graphs per package version, incl.: How to scale call graph processing to 10^6 package versions? I n p r o g r e s s
  • 76. Call graph stitching ❖ Idea: Decouple package resolution from call graph generation ❖ Build and store call graphs per package version, incl.: ❖ unresolved calls How to scale call graph processing to 10^6 package versions? I n p r o g r e s s
  • 77. Call graph stitching ❖ Idea: Decouple package resolution from call graph generation ❖ Build and store call graphs per package version, incl.: ❖ unresolved calls ❖ class hierarchies (Java, Python) How to scale call graph processing to 10^6 package versions? I n p r o g r e s s
  • 78. Call graph stitching ❖ Idea: Decouple package resolution from call graph generation ❖ Build and store call graphs per package version, incl.: ❖ unresolved calls ❖ class hierarchies (Java, Python) ❖ Call graph stitching: Resolve unresolved 
 calls given a dependency tree How to scale call graph processing to 10^6 package versions? I n p r o g r e s s
  • 80. Examples of queries: largest packages (# of functions) select p.package_name, pv.version, count(*) from package_versions pv join packages p on pv.package_id = p.id join modules m on m.package_version_id = pv.id join callables c on c.module_id = m.id group by p.package_name, pv.version order by count(*) desc limit 10;
  • 81. Examples of queries: Packages depending on vulnerable package SELECT package_version_id, p.package_name, pv.version FROM dependencies d JOIN package_versions pv ON pv.id = d.package_version_id JOIN packages p ON p.id = pv.package_id WHERE d.dependency_id = (SELECT id FROM packages WHERE package_name = 'com.google.guava:guava') AND '20.0' = ANY(d.version_range);
  • 82. Graph analytics 
 (results shown refer to Java CG’s) I n p r o g r e s s
  • 83. Graph analytics 
 (results shown refer to Java CG’s) ❖ Graph stored using WebGraph (UMIL) I n p r o g r e s s
  • 84. Graph analytics 
 (results shown refer to Java CG’s) ❖ Graph stored using WebGraph (UMIL) ❖ For 1.1M graphs (2.3B nodes, 18B edges): I n p r o g r e s s
  • 85. Graph analytics 
 (results shown refer to Java CG’s) ❖ Graph stored using WebGraph (UMIL) ❖ For 1.1M graphs (2.3B nodes, 18B edges): ❖ 3.6 bits per edge, plus global ID storage for each node (9.0 bits per edge overall) I n p r o g r e s s
  • 86. Graph analytics 
 (results shown refer to Java CG’s) ❖ Graph stored using WebGraph (UMIL) ❖ For 1.1M graphs (2.3B nodes, 18B edges): ❖ 3.6 bits per edge, plus global ID storage for each node (9.0 bits per edge overall) ❖ DB size: 38GB → we can fi t the whole of Maven in RAM I n p r o g r e s s
  • 88. Vulnerability Plugin ❖ Gathering vulnerability information (at package and callable level ) ❖ A normalized Vulnerability Object de fi nition is injected in the metadata database ❖ Normalization is needed to smooth out the different sources of informatio n ❖ The plugin continuously pulls updates for new information and keeps storing the results I n p r o g r e s s
  • 89. Analysis plug-ins RAPID: Risk Analysis and Propagation Inspection for Security and Maintainability risk s ❖ On the server side (to enrich the metadata DB) : ❖ Plugin for code maintainability analysis: 
 V1 deployed, processed 126K Maven coordinates to dat e ❖ Plugin for security vulnerability propagation ❖ On the client side: ❖ A user application to model and present risks I n p r o g r e s s
  • 90. License and Compliance analysis ❖ QMSTR Plugin consists of 3 steps : 1. Use the CG generator to gather information about all the generated artifacts that will be distributed together with the source cod e 2. Execution of static analysis tools that augment the build graph with license and compliance metadat a 3. Generation of a report with package's relevant license and authorship metadata that is fi nally distributed I n p r o g r e s s
  • 92. REST API ❖ Implementation of endpoints to expose canned queries from the metadata databas e ❖ In development : ❖ Full DB entity suppor t ❖ Custom extension points I n p r o g r e s s
  • 93. Use cases ❖ Endocod e ❖ Endocode developed a license-compliance solution, called Quartermaste r ❖ They are integrating FASTEN to improve the precision of their compliance offerin g ❖ SI G ❖ Integration of FASTEN in  BetterCodeHub, their GitHub-connected code quality monitoring produc t ❖ XWiki ❖ Risk validation in the dependencies at Maven build tim e ❖ Risk validation in the installed extensions of an XWiki instanc e ❖ Filter out available compatible extensions for an XWiki instanc e ❖ Discoverability of XWiki components in available extensions I n p r o g r e s s
  • 95. The future End 2020 Q1 2021 Q2 2021 Q3 2021 REST API, fi rst full version of knowledge base, CG enrichment, build graph integration, fi rst public announcement Impact analysis, integration with MVN / PyPI; fi rst external user Q4 2021 Q1 2022 FASTEN 2? Industrial use cases integrated; fi rst external adoption Licensing and security fully integrated; Data-driven API evolution Project fi nished; external integrations
  • 96. Network analysis will be the next step for the future of software development
  • 97. Network analysis will be the next step for the future of software development
  • 98. Questions? Paolo Bold i Università degli Studi di Milan o Ital y paolo.boldi@unimi.it