Trusty is a free-to-use web app that provides data and scoring on the supply chain risk for open source packages.
We've developed a way to help determine provenance (or proof of origin) for open source software packages that we believe can serve as a viable alternative when sigstore provenance is not available. Called “historical provenance,” it involves looking back at historical Git releases and tags in a source repo, and mapping those to published package versions.
Join Stacklok CTO Luke Hinds and Staff Data Scientist Nigel Brown for a live demo and discussion about historical provenance on January 16, 2024 at 9 AM PT!
The demo and discussion will be streamed here on YouTube Live.
As humans, we aren’t allowed to do sensitive things—like opening a bank account, or accessing our healthcare information—without proving our identity, to ensure it’s not compromised. And yet, when it comes to software, we’ve largely accepted the practice of injecting third-party code into our projects without being able to prove that the code is authentic, and that it will do what it says it will do.
Malicious actors are taking full advantage of this. Sonatype noted that the number of malicious packages tripled in the past year, to over 245K. “Masquerading” or “typosquatting” are common attacks: malicious actors copy the metadata from a popular package and use it for their malicious package, with a slightly different name. It’s intentionally really hard for developers to tell the difference.
How can we help guard against these types of attacks? One key way is by establishing proof of origin and build provenance for open source packages.
The open source project sigstore was founded by Stacklok CTO Luke Hinds to give developers an easier way to digitally sign and verify software artifacts. When you use sigstore to sign an artifact, your signature is stored in a public ledger that can’t be tampered with. This practice establishes a cryptographic link from the package back to its source code, acting like a digital ID.
But as of today, only a fraction of packages have been signed using sigstore. While we need to continue to make it easier for developers to use sigstore to digitally sign and verify OSS packages—and this will be a key goal for Minder—widespread adoption of sigstore by developer communities will take time. From a software security perspective, we can’t afford to wait.
At Stacklok, we’ve developed a way to help determine provenance (or proof of origin) for software packages that we believe can serve as a viable alternative when sigstore provenance is not available. We’re calling it “historical provenance,” because it involves looking back at historical Git releases and tags in a source repo, and mapping those to published package versions.
It’s important to note that historical provenance does not replace the value of using sigstore and SLSA to establish cryptographically strong links, or using or a setup similar to Go Modules. Notably, sigstore can verify the connection between a specific package version and its source repo, while historical provenance can only link the overall package repository to the source repo. But in the absence of a cryptographic link, historical provenance can still provide strong linkage between a package and its source code, giving developers a better signal as to whether a package is what it says it is.
When developers release code, it is very common practice, though not mandated, to tag the source code.
Tags in source code serve as reference points, marking a specific state of the code that corresponds to a given release (via a commit). When issues are found in a production environment, tags allow developers to quickly check out the exact code that was running to reproduce and troubleshoot the issue.
Tags provide a clear history of the project's progression. Critically, they also act as a form of documentation that indicates when certain features were introduced or when bugs were fixed.
Tags also carry a second advantage in git. Each commit contains a hash of its contents, which includes the source code, commit message, author, and date, as well as the hash of the previous commit(s). This chaining ensures that every commit is a snapshot of the entire repository's history up to that point. If any part of a commit's data were to change, its hash would change, invalidating all subsequent commits. This makes the history tamper-evident, and even more so when combined with a signature.
All package managers also record and publish an event-based timestamp.
For packages in the crates, npm, and PyPI ecosystems, we compared the timestamps of git tags in the source code to the published timestamp of versions listed in the package manager’s repo. We found a strong correlation between the two.
For example, when we look at the number of releases vs. the number of tags in the repository and compare the times at which they happened, we see they are similar. If we also do a fuzzy match on the strings themselves, we get an even stronger correlation. If the repo and the package share even a small number of versions, we can reasonably assume the package came from the repo.
This comparison is very hard to fake, especially for a longer-lived package. To do so would involve going back in time and making fraudulent releases at the same time as valid tags. As long as we trust the tag producer (e.g., GitHub) and the packaging infrastructure owners (e.g., pypi.org, npmjs.com, crates.io), we can trust these mappings. (If GitHub or a packaging provider is hacked, we are all in trouble!)
While visual inspection is possible for matching the tag -> publish timestamps, there are millions of open source packages in these ecosystems. We needed to automate this, so that we could display this data in Trusty for developers to use.
Here’s how we approach that. We start with two streams of timestamps with slightly different version strings, and we want to consider how similar they are:
Create a list of tags and package versions
Take each tag in the repo and look for a ‘core’ of the form #.#.# using a regex.
Look for that core in the package versions
Match up the dates and count them
Report the count as ‘common’ matching tags.
Note that a low-percentage overlap can mean there are multiple packages in one repo, or it can mean that tags are used for things other than releases.'
The initial set of results are promising. There is a clear diagonal relationship between the packages and the repositories they claim. This means that when we compare a package with its own repo, we get a high score; when we compare to another repo, we get a low (0) score.
There are some anomalies to consider. Notably, we can’t compare repos that contain no tags. So these score 0 on the leading diagonal.
We intend to use historical provenance data to catch possible “starjacking” and “typosquatting” attempts—cases in which a bad actor is using copied metadata and slightly misspelled package names to get developers to install malicious code. Here’s how we can do this.
Example 1: Identifying which packages are most likely to be associated with a specific repo
In this case, we have more than one package that claims to come from the same source repo, and we want to prove which packages actually do come from that repo. From the graphs above, we can see this in action. The diagonal line gives a strong signal. We can create a confusion matrix to see if we can select the best match from all the others in the test set:
This is a perfect test, within this sample set.
Example 2: Identifying whether a given package is from a specific repo
To figure out whether a package matches its claimed repository, we need to have some kind of cutoff. What score is good enough? This can be seen in the image below.
The blue section represents the “correct” packages. We expect these to have some overlap, which they mostly do. The orange ones are mismatched pairs, which we expect to have no overlap. This is true apart from one case, which does legitimately share the same repo. So, in this case it looks like any overlap is enough for discrimination.
We can use it to create a confusion matrix.
This is a very good test.
The samples above are mostly from Python (crates 11, npm 12, pypi 40), but we also found the same is true of the rust and npm packages we examined. There are occasional edge cases where things go wrong, but all things considered, this is a very useful approach.
With historical provenance, we can often prove that a package comes from its claimed source repository with a high degree of accuracy by comparing the history of its releases. Again, historical provenance is not a replacement for cryptographic provenance (we consider sigstore to be the “chef's kiss” solution), but it is a very useful tool in understanding the true source of origin of a package and knowing whether it is what it says it is.
For Stacklok, this method of provenance will help us better identify malicious packages and provide stronger indicators to developer communities, because it gives us more insight and observability over a package’s metadata claims. For that reason, we’ve integrated historical provenance into Trusty, our free-to-use service for vetting the safety and trustworthiness of open source packages. You can check out historical provenance in action now by heading to www.trustypkg.dev.
We’d love to hear your feedback on our approach with historical provenance. Join our Discord channel to chat with us and share your thoughts.
Join us for a live demo and discussion about historical provenance on January 16, 2024 at 9 AM PT! The demo and discussion will be streamed here on YouTube Live.
Nigel Brown
Staff Data Scientist