Estimating Security Risk Through Repository Mining

Estimating security risk
through repository mining
Tamas K Lengyel, PhD

2
#whoami
• Sr Security Researcher @ Intel
• Maintainer of Xen, LibVMI, DRAKVUF, KF/x
• Working on fuzzing, pen testing, secure architecture
review
Linux Security Summit Europe ‘23

3
Agenda
1. Motivation
2. Problem statement & hypothesis
3. Experiment design & tools
4. Results
5. Threats to validity
6. Discussion
7. Summary

4
Motivation
Why estimate security risk?
• Complexity is increasing
• Software supply chain is
opaque, even if you have an
SBOM
• Manual review doesn’t scale
• OpenSSF Scorecard -
Security health metrics for
Open Source
xkcd: Dependency

5
What’s measured by the scorecard?
https://github.com/ossf/scorecard#scorecard-checks

6
Problem statement
Does it work?
• Can a project with a high score really be considered low risk?
• There is no evidence pro or contra
• How can we approach this problem scientifically?
1. Wait and see: do we get a correlation with low scores and high incident rates?
X -ENOTIME
2. Check against CVEs: are there more CVEs reported for projects with low scores?
X Not many projects use CVEs, the ones that do would lead to selection bias

7
Hypothesis
If we can’t measure it directly, we’ll find a proxy
The scorecard checks are applicable to other software
quality metrics as well. A well maintained project with a CI,
code reviews, fuzzing etc. should have fewer bugs!
We can measure bug density with static analysis tools!
Bugs don’t equal security risk – but the
scorecard should correlate with both

8
Experiment design
1. Find the most popular C & C++ projects on GitHub
2. Run the Scorecard (if not already installed)
3. Run static analysis to find bugs
4. Perform linear regression analysis
We need bugs – lots of bugs

9
Scaling Repo Scanner (SRS)
• Deployed via GitHub Actions
• Perform search & analysis on thousands of repositories
• Collect metrics & metadata, publish results on GitHub Pages
• Extendable framework to add new scans
• OSSF Scorecard
• clang scan-build + Z3 verifier: detects most common bug types with
supposedly low false positive rate
• clang-tidy cognitive complexity analysis
• Lines of Code + GitHub repository metadata collection
• Open-source: https://github.com/intel/srs
Scalable static analysis of GitHub projects

10
• Only builds 400+ star C & C++ GitHub repositories on Debian
• Dependencies must be greppable & apt-gettable
• Should build with a modern clang compiler (15)
• Build time limit is 6 hours enforced by GitHub
• Autotools, meson & cmake build systems only
• GitHub API rate limit is a bottleneck, 5000/hour max
• Disk space is limited, builds can fail if it generates too many files
Limitations

11
• Scans run monthly
• Summary & scan details
for every repo
• Download & run your own
analysis
• Fork & add your own scans
Results posted to https://intel.github.io/srs

12
Results: OSSF scorecard as predictor for bugs
Date # of repos Bug change
estimateper
scoreincrease
P-value
8/23 2097 -6.937 2.02*10-3 (***)
7/23 2071 -7.572 7.02*10-4 (***)
6/23 2035 -9.127 7.95*10-5 (***)
5/23 2003 -8.937 1.43*10-4 (**)
4/23 2318 -6.549 3.45*10-3 (**)
• Each point increase in
OSSF scorecard
correlates with a reduction
in number of bugs found
• Statistically significant
results!
Ship it! o/

13
Bug mean=38.09
Bug stdev=104.36

14
Adjusted R2: how well model explains variance in the current data
Predicted R2: how well model will make predictions
Practically no connection between the two!
Date # of
repos
Bug change
estimateper
score increase
P-value AdjustedR2 PredictedR2
8/23 2097 -6.937 2.02*10-3 (**) 0.0040 0.0024
7/23 2071 -7.572 7.02*10-4
(***)
0.0050 0.0033
6/23 2035 -9.127 7.95*10-5 (***) 0.0071 0.0055
5/23 2003 -8.937 1.43*10-4 (***) 0.0067 0.0051
4/23 2318 -6.549 3.45*10-3 (**) 0.0032 0.0015

15
Results: OSSF subscores as predictor for bugs
Check Estim. P-value Adjusted
R2
Predicted
R2
Maintained 2.047 3.05*10-5
(***)
7.792*10-3 6.27*10-3
Has-CI 0.269 0.929 -4.74*10-4 -1.96*10-3
Codereview -5.613 0.172 4.15*10-4 -9.87*10-4
• Perhaps the aggregate score
doesn’t correlate, but
subscores fare better?
• Nope, some checks actually
estimate an increase in bugs
for each score increase
• Other checks aren’t
statistically significant (and
don’t explain or predict the
data anyway)

16
Results: OSSF scores as predictor for other bugs
Static analysis engine Bug
change
estimate
per score
increase
P-value AdjustedR2 PredictedR2
Facebook Infer -7.78 0.297 4.19*10-5 -1.4*10-3
BinAbsInspector 2.636 0.929 -4*10-4 -1.3*10-3
Perhaps we need a different static analysis engine?

17
Can we find anything that predicts the bugs?
Looking at metadata
Check Estimate P-value AdjustedR2 PredictedR2
Lines of code 8.09*10-05 3.35*10-12 (***) 0.0224 0.0187
L of comments 2.084*10-05 7.21*10-14 (***) 0.0259 0.0194
Size (Kb) 1.735*10-5 8.17*10-3 (**) 0.0028 0.0007
Stars 4.2*10-3 0.133 0.0006 -0.0008
Watchers 0.07 0.071 0.0010 -0.0008
Forks 0.012 0.122 0.0006 -0.0008
Issues 0.046 1.96*10-3 (**) 0.0040 0.0028

18
Can we find anything that predicts the bugs?
# of functions & cognitive complexity
Number of
functions
0.014 2*10-16 (***) 0.2782 0.2704
Number of
cognitively
complex
functions
0.097 2*10-16 (***) 0.2789 0.2734
% of functions
cognitive
complex
1.191 8.3*10-7 (***) 0.011 0.009

19
Multiple linear regression (MLR)
Number of
functions
+ % of functions
cogn. cmplx
0.0145
0.8775
2*10-16 (***)
2.01*10-5 (***)
0.2841 0.2763
Number of
functions
+ number of
cogn. cmplx
functions
0.0077
0.0521
3.9*10-13 (***)
1.52*10-13 (***)
0.2964 0.2867

20
Multiple linear regression (MLR)
Number of
functions
+ number of
cogn. cmplx
functions
+ percent of
functions cogn.
cmplx
0.0083
0.0475
0.5164
2.96*10-14 (***)
7.29*10-11 (***)
1.43*10-2(*)
0.2981 0.2883

21
Looking at other static analysis results
Facebook Infer results with MLR model
Number of
functions
+ number of
cogn. cmplx
functions
+ percent of
functions cogn.
cmplx
-0.0059
0.366
2.991
0.356
3.46*10-10 (***)
5.93*10-4 (***)
0.11 0.088

22
Looking at other static analysis results
BinAbsInspector results with MLR model
Number of
functions
+ number of
cogn. cmplx
functions
+ percent of
functions cogn.
cmplx
0.034
-0.138
9.917
0.352
0.552
0.102
5.28*10-4 -2.28*10-3

23
# of functions & scan-build bugs

24
# of complex functions & scan-build bugs

25
% of functions complex & scan-build bugs

26
Cook’s Distance: 41 outliers detected with MLR

27
Redo without outliers
MLR model vs OSSF Scorecard
Number of functions
+ number of cogn.
cmplx functions
+ percent of functions
cogn. cmplx
0.0075
0.0864
0.2737
10-9 (***)
2*10-16 (***)
2.95*10-2 (*)
0.3774 0.3729
OSSF Scorecard -5.123 1.62*10-4 (***) 0.0062 0.0046

28
Are complex functions more buggy?
• 3.34% of cognitively complex functions had at least 1 bug
• Total of 12,410 out of 371,556 functions
• 0.97% of non-complex functions had at least 1 bug
• Total of 26,972 out of 2,754,311 functions
• 44.7% of bugs were found in cognitively complex functions
• Total of 35,687 bugs
• 55.3% of bugs were found in non-complex functions
• Total of 44,134 bugs
Only 11.8% of functions were cognitively complex yet had almost
half of all the bugs!

29
Threats to validity
• Bugs in the static analysis tools resulting in false positives
• Bugs in our data-collection & analysis scripts
• We only built ~2k C & C++ repos
• We only built the ones that compile on Debian
• Only repos with 400+ stars
• OSSF scorecard often runs out of GitHub API requests, we had to
disable some OSSF checks
• Linear regression modeling might not be the right analysis
What may have affected our analysis & results

30
Discussion
• What if all bugs were false positive?
• If a project has a ton of false positive bugs detected by static analysis, is it
more or less risky?
• If a project is confusing to read for humans & confuses static analysis
tools, is it more or less risky?
• If the MLR model predicts high bug count, but Scorecard predicts
low-risk, which one to trust?
• Why would Scorecard predict “security risk” correctly but bugs
incorrectly?
• It seems “complexity” correlates with bugs, why would “risk” be different?

31
Discussion
• Clang scan-build is free &
should find 0 bugs
• Add it as a simple GitHub PR
gate
• Don’t merge code that adds
bugs
• https://github.com/intel/srs/t
ree/scan-build-action-v1

32
Summary
• We need automated tools to quickly spot problematic Open
Source projects
• OSSF Scorecard is a fantastic but very hard undertaking
• Claims should be backed by data & statistics
• If you think we made an error, let’s hear it & publish your data! ☺
Tools & scripts: https://github.com/intel/srs
Data: https://intel.github.io/srs
Security risk estimation is hard

33
Disclaimers
Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.
Performance results are based on testing as of dates shown in configurations and may not reflect all publicly
available updates. See backup for configuration details. No product or component can be absolutely secure.
No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document.
Intel does not control or audit third-party data. You should consult other sources to evaluate accuracy.
Intel disclaims all express and implied warranties, including without limitation, the implied warranties of merchantability,
fitness for a particular purpose, and non-infringement, as well as any warranty arising from course of performance, course of
dealing, or usage in trade.
Your costs and results may vary.
Intel technologies may require enabled hardware, software or service activation.
© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other
names and brands may be claimed as the property of others.

Estimating Security Risk Through Repository Mining

Estimating Security Risk Through Repository Mining

More Related Content

Similar to Estimating Security Risk Through Repository Mining

More from Tamas K Lengyel

Recently uploaded

Estimating Security Risk Through Repository Mining