Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
icse-presentation.pptx
1. Bad Snakes:
Understanding and Improving Python Package Index
Malware Scanning
D.L. Vu1,2 Zachary Newman1 John Speed Meyers1
1Chainguard, USA, 2FPT University, Vietnam
5/18/2023
Bad Snakes: Understanding and Improving Python Package
Index Malware Scanning
1
lyvd@fe.edu.vn zjn@chainguard.dev jsmeyers@chainguard.dev
2. Problem Statement
● An increasing number of malware on open source package repositories,
specifically PyPI.
● Academic and commercial tools that can detect malicious open source software
packages starts to sound like a magic wand that could make these problems
disappear.
● But, are these tools the cure to these malicious packages?
5/18/2023 Bad Snakes: Understanding and Improving Python Package Index Malware Scanning 2
3. Our study
• We spoke to administrators of and contributors to PyPI, the main repository for
Python packages, along with an academic researcher who works on this problem.
• We conducted an empirical research of malware detection tools to see how they
measured up to the requirements of real package repositories.
5/18/2023 Bad Snakes: Understanding and Improving Python Package Index Malware Scanning 3
4. Takeaways
• These tools aren’t suitable to run on open source software repositories
automatically, in large part because they’re too noisy.
• External researchers can (and do) run their own tools in their own environments
and send reports to get malware removed.
• This often works out better for everybody involved.
• There are promising directions for improving these scanners, and other, even
more promising techniques for improving software repository security that
administrators are working toward right now.
5/18/2023 Bad Snakes: Understanding and Improving Python Package Index Malware Scanning 4
5. Interviews
• We checked in with members of the PyPI community and supply-chain security
researchers to see what it would take to deploy malware detection techniques on
package repositories.
• PyPI deployed an experimental “malware checks” system in 2020, so our interviewees
(an administrator of PyPI, and one developer of the malware check system) have direct
experience with running malware detection for a real repository. However, these checks
aren’t used anymore.
• We sought to find out why not, and what it would take to deploy such a system again.
5/18/2023 Bad Snakes: Understanding and Improving Python Package Index Malware Scanning 5
6. False positive rates matter more than false
negative rates
• Many researchers build systems designed to catch all or most malware: after all,
we don’t want to let bad packages through.
• They accept a low false-positive rate as the price to pay to catch bad actors.
• However, given the number of legitimate packages published, even seemingly-
low rates (like 5%) require administrators to manually inspect thousands of
packages each week.
• An automated tool needs to have an “effectively zero” rate.
5/18/2023 Bad Snakes: Understanding and Improving Python Package Index Malware Scanning 7
7. Repository administrators must balance
multiple security priorities
• PyPI and similar repositories must weigh automated malware detection against
software signing and multi-factor authentication.
• Most malware packages affect few or no actual users, PyPI administrators have
decided to use their finite resources to focus on higher impact projects.
5/18/2023 Bad Snakes: Understanding and Improving Python Package Index Malware Scanning 8
8. Just because PyPI isn’t running these checks
doesn’t mean that others aren’t.
• Security researchers develop and operate Python malware detection systems
using their own time and computing resources, providing reports to PyPI when
they detect malicious packages.
• PyPI maintainers benefit with high-quality, low-noise reports on malware, and the
security researchers benefit from positive coverage of their company, products,
and services.
5/18/2023 Bad Snakes: Understanding and Improving Python Package Index Malware Scanning 9
9. Benchmarking Different Malware Detection
Approaches
• To understand if existing systems were appropriate for this setting, we ran some
experiments comparing different Python malware detection approaches.
• These systems include static analysis tools that analyze source code, dynamic analysis
tools that observe running software, and metadata analysis tools that look at things like
package names.
• We found three Python malware detection tools which met our criteria:
Bandit4Mal, OSSGadget OSS Detect Backdoor, and PyPI Malware Checks.
• We used a benchmark dataset including 168 malware packages (courtesy of the
Backstabber's Knife Collection and MALOSS datasets), 1,430 popular packages, and
986 randomly-selected packages.
5/18/2023 Bad Snakes: Understanding and Improving Python Package Index Malware Scanning 10
10. How do malware
detection approaches
perform?
• We scanned these packages with each chosen tool,
recording all alerts produced by the setup.py files (which can
run malicious code at package installation time) as well as
the entire package (for malicious code that executes at
runtime).
• We consider an alert for a malicious package a true positive
and an alert for a benign package a false positive.
5/18/2023
Bad Snakes: Understanding and Improving Python Package
Index Malware Scanning
13
11. Scanners catch the
majority of malicious
packages.
• All three of these tools had true positive rates above 50%
• When including all Python files, the tools detected over
85% of malicious packages.
5/18/2023
Bad Snakes: Understanding and Improving Python Package
Index Malware Scanning
14
12. False positive rates are
high (sometimes higher
than true positive rates)
• The measured tools have false positive rates between 15% and 97%.
• The false positive rate increases (sometimes higher than the true
positive rate for malicious packages) when checking all files, rather
than just setup.py files.
• This suggests that many rules used by these tools are designed to
catch behavior that is suspicious in setup.py files, but normal in
package code.
5/18/2023
Bad Snakes: Understanding and Improving Python Package
Index Malware Scanning
15
13. When it rains, it pours: packages with one alert
often have many more.
• The tools can fire multiple alerts per package, and they did.
• Scanning the setup.py files of benign packages, we find that all tools have a median of 3
or fewer alerts.
• When scanning all Python files, the number of alerts increases to between 10 and 85.
• The noisiest benign package had 145,799 alerts.
5/18/2023 Bad Snakes: Understanding and Improving Python Package Index Malware Scanning 16
14. Making the alerts more strict results in missing
a lot of malware
• Rather than flagging a package as possibly malicious if it has any alerts, we tried
requiring a threshold number of alerts.
• We found that with a higher threshold, the tools report very few (or even no)
malicious packages even before the true positive rates became manageable.
5/18/2023 Bad Snakes: Understanding and Improving Python Package Index Malware Scanning 17
15. Some rules are better than others
• Some rules are better than others. One of the rules checks for networking code in
unexpected places. These types of checks were a good indicator of a malicious
package.
• Other rules, which looked for metaprogramming or running external processes,
were less effective in distinguishing malicious and benign code.
5/18/2023 Bad Snakes: Understanding and Improving Python Package Index Malware Scanning 18
16. The tools ran reasonably fast
• Tested on a laptop, processing a typical package took well under 10 seconds.
• This is too slow to run before a package upload finishes but is quite reasonable to
passively analyze a repository.
5/18/2023 Bad Snakes: Understanding and Improving Python Package Index Malware Scanning 19
17. Potential directions for better scanning
• Prioritize higher-impact packages: typosquatters, shrinkwrapped clones, and popular
packages.
• Consider dynamic scanning techniques, running code in a sandbox
• Make sure tools are easy to interpret. “6 alerts” is hard to evaluate; “makes network calls
to these domains,” less so.
• Most importantly, don’t expect volunteer repository administrators to maintain and run
tools for you; instead, form a relationship and plan to work together in the long haul.
5/18/2023 Bad Snakes: Understanding and Improving Python Package Index Malware Scanning 20
18. Conclusions
• The primary lesson from our interviews and experiments is to listen to maintainers.
• Researchers should engage with maintainers, who can outline requirements for practical
systems, and who have endless ideas worth exploring
• We remain optimistic about open-source software security. Organizations like the
OpenSSF do listen to maintainers while providing resources for academics, maintainers,
and companies to collaborate.
• As long as we listen to what the community has to say, open-source security will steadily
improve.
5/18/2023 Bad Snakes: Understanding and Improving Python Package Index Malware Scanning 21
Editor's Notes
Hello everyone, my name is Ly Vu. Today I am going to talk about our work titled Bad Snakes: Understanding and Improving Python Package Index Malware Scanning. This is the joint work with Zack Newman and John Speed Meyers. The work has been supported by ChainGuard.
Let me first explain what motivates our study. There are tools out there that can detect malicious open-source software packages. Plus, decades of academic research and commercial tools developed to detect malicious software. But are these tools really the cure to these malicious packages? Would they be able to be adopted in package repositories?
To find out the requirements of package repositories for a malware detection tool and current status of malware detection tools. We conducted a study, that is two fold: first we checked in with administrators of PyPI, the main and biggest repository for third-party Python packages. We then performed some experiments to compare different malware detection techniques to see how they measured up to the requirements of real package repositories.
We distill several key insights from our study:
Current malware detection tools does not seem to be suitable to run on open source package repositories automatically, because in large part they're too noisy in terms of false alerts.
External researchers such as those coming from academia can and do run their own tools in their own environments and send incident reports to repository maintainers to get malware removed
Both external researchers and PyPI benefit by working together
There are promising directions from improving the malware scanners, and other, even more promising techniques from improving software repository security that administrators and other researchers are working toward right now.
We interviewed PyPI administrators and a academic researcher to see what it would take to deploy malware detection techniques on open-source package repositories such as PyPI. PyPI or Python Package Index had implemented a so-called "malware checks" in 2020, two of our interviewees have direct experience in developing this system. However, unfortunately these checks aren't used anymore.
We sought to find out why not, and what it would take to deploy such a system again.
Particularly, we ask the following questions:
What is the origin story of the current PyPI malware checks?
What has been your experience, if any, with the current PyPI malware checks?
What are the current plans, if any, for improving the PyPI malware checks?
How do you judge the performance of a PyPI malware check system?
How would you judge a set of proposed improvements to the PyPI malware check system?
Many researchers when designing a detection system they aim to catch all or most malware. They tend to accept a low false-positive rate as the price to pay to catch bad actors.
However, PyPI receives many legitimate packages everyday, even seemingly-low false positive rates would require administers to manually inspect thousands of packages each week.
Hence, an automated tool needs to have an "effectively zero" false positive rate to be considered to be integrated in a security pipeline of a package repository
Second insight from our interviews, PyPI and similar package repositories must weigh automated malware detection against other security mechanisms such as software signing and multi-factor authenticated.
On the other hand, most malware packages affected few or no actual users, for example downloads from mirrors or bots. PyPI administors therefore have decided to use their limited resources to focus on higher impact projects.
External researchers such as those coming from academia develop and operate Python malware detection tools using their own time and computing resources, they then can report malicious packages to PyPI.
PyPI maintainers, therefore, benefit with high quality, low-noise reports on malware. On the other side, security researches benefit from positive coverage of their company, products, and services.
To understand if a specific system was appropriate for the setting of security pipeline, we ran some experiments comparing different Python malware detection approaches.
These systems include static code analysis, dynamic analysis tools that observe runtime behavior of a package, and metadata analysis tools that look at things like package names, or package downloads.
We collected malware samples from the two biggest datasets named Backstabber’s Knife Collection and MALOSS. Also, we collected top popular and random packages from PyPI.
This table represents the tools we surveyed. We focused on tools relying on behavior-based as it can provide much more precise analysis than metadata-based tools. A tool to be included should have their source code available and publish their detection rules.
At the end, we found three detection tools Bandit4Mal, a custom version of Bandit designed to catch malicious code. OSSGadget a tool developed by Microsoft to scan not only Python code but also other languages such as JavaScript. And, also PyPI Malware checks, the default check developed by PyPI.
This diagram represents our experiments with the malware detection tools. Given the list of packages collected from PyPI and the two malware datasets. , we run the selected tools on the chosen package artifacts.
We record the alerts generated on whole package and the setup.py file of a package as it is often the file injected by malicious code.
An alert is considered as true positive if a malicious package is classified as malicious. Otherwise, it’s false positive when a beinign legitimate package is classified as malicious package.
Here is what we found. We found that the scanners catch the majority of malicious packages, which is a good news. In particular, all three of the selected tools had true positive rates above 50% when considering only setup.py files
When including all Python files, the tools detected over 85% of malicious packages.
However, we observed that the tools suffered a relative high false positives especially, when checking all Python files.
The false positive rates range between 15% and 97% for all tools. But when scanning all files, the chance of false classifying benign code as malicious was much higher.
This suggest that these selected tools are designed to catch behavior that is suspicious in setup.py files, but normal in package code.
We observed that packages with one alert often have many more. When it rains, it pours
Scanning the setup.py files, installation files in Python packages, we find that all tools have a median of 3 or fewer alerts.
Scanning all Python files made the number of alerts increases to between 10 and 85
In our experiment, the noisiest benign package had 145,799 alerts.
Rule-based tools require setting a proper threshold to balance false positive and true positive rate. In our experiment, rather than flagging a package as possibly malicious if it has any alerts, we tried requiring a threshold number of alerts.
We found that with a higher threshold, the tools report very few (or even no) malicious packages even before the true positive rates become manageable
We examined the rules in PyPI malware checks. We observed rules that not all rules are equal in detecting malicious code.
Some rules are better than others. For example, checking the presence of an outgoing network connection or command execution could be a good indicator of a malicious package.
Other rules, which looked for metaprogramming or running external processes were less effective in separating malicious and benign code.
Improving the rules is therefore important in increasing the precision of the tools
We observed the tools run reasonably fast. Particularly tested on a laptop having 8GB RAM, Intel CPU 4 cores processing a typical package took wel under 10 seconds
This may be too slow to run before a package upload finishes but is quite reasonable to passively analyze a repository. For examples, you could ran the tools to scan entire repository.
There are potenial directions for better scanning malware that we learned from the interviewers and our experiments
First, Priortizing higher-impact packages such as typosquatting packages, and popular packages. These are common malware attacks, and highly used for malware infection
Second, investing on dynamic scanning techniques, which run code in a sandbox, as it can provide more precise analysis and reflect true behavior of malware
Third, making sure that the output of scanners are easy to interpret, for example giving insightful description for an alert. 6 alerts is hard to evaluate but making network calls to these domains or
And last but not least, don’t expect volunteer repository administrators to maintain and run tools for you instead form a relationship and plan to work together in the long haul.
It now comes to the end of my presentation. Let me summarize our research work in this final slide.
The main message from our interviews and experiments is to listen to maintainers who are responsible for managing security issues in package repository
External researchers can outline requirements for practical systems, and propose endless ideas for improving the system. They should actively engage with maintainers of package repositories to evaluate and get their feedback for their solutions.
Still we remain optimistic about open-source software security. Organizations like OpenSSF do listen to maintainers while providing resources for academics, maintainers, and companies to collaborate.
As long as we listen to what the community has to say, we believe open-source security will be improving steadily.