icse-presentation.pptx

Bad Snakes:
Understanding and Improving Python Package Index
Malware Scanning
D.L. Vu1,2 Zachary Newman1 John Speed Meyers1
1Chainguard, USA, 2FPT University, Vietnam
5/18/2023
Bad Snakes: Understanding and Improving Python Package
Index Malware Scanning
1
lyvd@fe.edu.vn zjn@chainguard.dev jsmeyers@chainguard.dev

Problem Statement
● An increasing number of malware on open source package repositories,
specifically PyPI.
● Academic and commercial tools that can detect malicious open source software
packages starts to sound like a magic wand that could make these problems
disappear.
● But, are these tools the cure to these malicious packages?
5/18/2023 Bad Snakes: Understanding and Improving Python Package Index Malware Scanning 2

Our study
• We spoke to administrators of and contributors to PyPI, the main repository for
Python packages, along with an academic researcher who works on this problem.
• We conducted an empirical research of malware detection tools to see how they
measured up to the requirements of real package repositories.

Takeaways
• These tools aren’t suitable to run on open source software repositories
automatically, in large part because they’re too noisy.
• External researchers can (and do) run their own tools in their own environments
and send reports to get malware removed.
• This often works out better for everybody involved.
• There are promising directions for improving these scanners, and other, even
more promising techniques for improving software repository security that
administrators are working toward right now.

Interviews
• We checked in with members of the PyPI community and supply-chain security
researchers to see what it would take to deploy malware detection techniques on
package repositories.
• PyPI deployed an experimental “malware checks” system in 2020, so our interviewees
(an administrator of PyPI, and one developer of the malware check system) have direct
experience with running malware detection for a real repository. However, these checks
aren’t used anymore.
• We sought to find out why not, and what it would take to deploy such a system again.

False positive rates matter more than false
negative rates
• Many researchers build systems designed to catch all or most malware: after all,
we don’t want to let bad packages through.
• They accept a low false-positive rate as the price to pay to catch bad actors.
• However, given the number of legitimate packages published, even seemingly-
low rates (like 5%) require administrators to manually inspect thousands of
packages each week.
• An automated tool needs to have an “effectively zero” rate.

Repository administrators must balance
multiple security priorities
• PyPI and similar repositories must weigh automated malware detection against
software signing and multi-factor authentication.
• Most malware packages affect few or no actual users, PyPI administrators have
decided to use their finite resources to focus on higher impact projects.

Just because PyPI isn’t running these checks
doesn’t mean that others aren’t.
• Security researchers develop and operate Python malware detection systems
using their own time and computing resources, providing reports to PyPI when
they detect malicious packages.
• PyPI maintainers benefit with high-quality, low-noise reports on malware, and the
security researchers benefit from positive coverage of their company, products,
and services.

Benchmarking Different Malware Detection
Approaches
• To understand if existing systems were appropriate for this setting, we ran some
experiments comparing different Python malware detection approaches.
• These systems include static analysis tools that analyze source code, dynamic analysis
tools that observe running software, and metadata analysis tools that look at things like
package names.
• We found three Python malware detection tools which met our criteria:
Bandit4Mal, OSSGadget OSS Detect Backdoor, and PyPI Malware Checks.
• We used a benchmark dataset including 168 malware packages (courtesy of the
Backstabber's Knife Collection and MALOSS datasets), 1,430 popular packages, and
986 randomly-selected packages.

How do malware
detection approaches
perform?
• We scanned these packages with each chosen tool,
recording all alerts produced by the setup.py files (which can
run malicious code at package installation time) as well as
the entire package (for malicious code that executes at
runtime).
• We consider an alert for a malicious package a true positive
and an alert for a benign package a false positive.
5/18/2023
13

Scanners catch the
majority of malicious
packages.
• All three of these tools had true positive rates above 50%
• When including all Python files, the tools detected over
85% of malicious packages.
5/18/2023
14

False positive rates are
high (sometimes higher
than true positive rates)
• The measured tools have false positive rates between 15% and 97%.
• The false positive rate increases (sometimes higher than the true
positive rate for malicious packages) when checking all files, rather
than just setup.py files.
• This suggests that many rules used by these tools are designed to
catch behavior that is suspicious in setup.py files, but normal in
package code.
5/18/2023
15

When it rains, it pours: packages with one alert
often have many more.
• The tools can fire multiple alerts per package, and they did.
• Scanning the setup.py files of benign packages, we find that all tools have a median of 3
or fewer alerts.
• When scanning all Python files, the number of alerts increases to between 10 and 85.
• The noisiest benign package had 145,799 alerts.

Making the alerts more strict results in missing
a lot of malware
• Rather than flagging a package as possibly malicious if it has any alerts, we tried
requiring a threshold number of alerts.
• We found that with a higher threshold, the tools report very few (or even no)
malicious packages even before the true positive rates became manageable.

Some rules are better than others
• Some rules are better than others. One of the rules checks for networking code in
unexpected places. These types of checks were a good indicator of a malicious
package.
• Other rules, which looked for metaprogramming or running external processes,
were less effective in distinguishing malicious and benign code.

The tools ran reasonably fast
• Tested on a laptop, processing a typical package took well under 10 seconds.
• This is too slow to run before a package upload finishes but is quite reasonable to
passively analyze a repository.

Potential directions for better scanning
• Prioritize higher-impact packages: typosquatters, shrinkwrapped clones, and popular
packages.
• Consider dynamic scanning techniques, running code in a sandbox
• Make sure tools are easy to interpret. “6 alerts” is hard to evaluate; “makes network calls
to these domains,” less so.
• Most importantly, don’t expect volunteer repository administrators to maintain and run
tools for you; instead, form a relationship and plan to work together in the long haul.

Conclusions
• The primary lesson from our interviews and experiments is to listen to maintainers.
• Researchers should engage with maintainers, who can outline requirements for practical
systems, and who have endless ideas worth exploring
• We remain optimistic about open-source software security. Organizations like the
OpenSSF do listen to maintainers while providing resources for academics, maintainers,
and companies to collaborate.
• As long as we listen to what the community has to say, open-source security will steadily
improve.

icse-presentation.pptx

Recommended

Recommended

More Related Content

Similar to icse-presentation.pptx

Similar to icse-presentation.pptx (20)

Recently uploaded

Recently uploaded (20)

icse-presentation.pptx

Editor's Notes