Estimating security risk
through repository mining
Tamas K Lengyel, PhD
2
#whoami
• Sr Security Researcher @ Intel
• Maintainer of Xen, LibVMI, DRAKVUF, KF/x
• Working on fuzzing, pen testing, secure architecture
review
Linux Security Summit Europe ‘23
3
Agenda
1. Motivation
2. Problem statement & hypothesis
3. Experiment design & tools
4. Results
5. Threats to validity
6. Discussion
7. Summary
Linux Security Summit Europe ‘23
4
Motivation
Why estimate security risk?
Linux Security Summit Europe ‘23
• Complexity is increasing
• Software supply chain is
opaque, even if you have an
SBOM
• Manual review doesn’t scale
• OpenSSF Scorecard -
Security health metrics for
Open Source
xkcd: Dependency
5
What’s measured by the scorecard?
Linux Security Summit Europe ‘23
https://github.com/ossf/scorecard#scorecard-checks
6
Problem statement
Does it work?
Linux Security Summit Europe ‘23
• Can a project with a high score really be considered low risk?
• There is no evidence pro or contra
• How can we approach this problem scientifically?
1. Wait and see: do we get a correlation with low scores and high incident rates?
X -ENOTIME
2. Check against CVEs: are there more CVEs reported for projects with low scores?
X Not many projects use CVEs, the ones that do would lead to selection bias
7
Hypothesis
If we can’t measure it directly, we’ll find a proxy
The scorecard checks are applicable to other software
quality metrics as well. A well maintained project with a CI,
code reviews, fuzzing etc. should have fewer bugs!
We can measure bug density with static analysis tools!
Bugs don’t equal security risk – but the
scorecard should correlate with both
Linux Security Summit Europe ‘23
8
Experiment design
1. Find the most popular C & C++ projects on GitHub
2. Run the Scorecard (if not already installed)
3. Run static analysis to find bugs
4. Perform linear regression analysis
We need bugs – lots of bugs
Linux Security Summit Europe ‘23
9
Scaling Repo Scanner (SRS)
• Deployed via GitHub Actions
• Perform search & analysis on thousands of repositories
• Collect metrics & metadata, publish results on GitHub Pages
• Extendable framework to add new scans
• OSSF Scorecard
• clang scan-build + Z3 verifier: detects most common bug types with
supposedly low false positive rate
• clang-tidy cognitive complexity analysis
• Lines of Code + GitHub repository metadata collection
• Open-source: https://github.com/intel/srs
Scalable static analysis of GitHub projects
Linux Security Summit Europe ‘23
10
Scaling Repo Scanner (SRS)
• Only builds 400+ star C & C++ GitHub repositories on Debian
• Dependencies must be greppable & apt-gettable
• Should build with a modern clang compiler (15)
• Build time limit is 6 hours enforced by GitHub
• Autotools, meson & cmake build systems only
• GitHub API rate limit is a bottleneck, 5000/hour max
• Disk space is limited, builds can fail if it generates too many files
Limitations
Linux Security Summit Europe ‘23
11
Scaling Repo Scanner (SRS)
• Scans run monthly
• Summary & scan details
for every repo
• Download & run your own
analysis
• Fork & add your own scans
Results posted to https://intel.github.io/srs
Linux Security Summit Europe ‘23
12
Results: OSSF scorecard as predictor for bugs
Date # of repos Bug change
estimateper
scoreincrease
P-value
8/23 2097 -6.937 2.02*10-3 (***)
7/23 2071 -7.572 7.02*10-4 (***)
6/23 2035 -9.127 7.95*10-5 (***)
5/23 2003 -8.937 1.43*10-4 (**)
4/23 2318 -6.549 3.45*10-3 (**)
• Each point increase in
OSSF scorecard
correlates with a reduction
in number of bugs found
• Statistically significant
results!
Ship it! o/
Linux Security Summit Europe ‘23
13
Results: OSSF scorecard as predictor for bugs
Bug mean=38.09
Bug stdev=104.36
Linux Security Summit Europe ‘23
14
Results: OSSF scorecard as predictor for bugs
Adjusted R2: how well model explains variance in the current data
Predicted R2: how well model will make predictions
Practically no connection between the two!
Linux Security Summit Europe ‘23
Date # of
repos
Bug change
estimateper
score increase
P-value AdjustedR2 PredictedR2
8/23 2097 -6.937 2.02*10-3 (**) 0.0040 0.0024
7/23 2071 -7.572 7.02*10-4
(***)
0.0050 0.0033
6/23 2035 -9.127 7.95*10-5 (***) 0.0071 0.0055
5/23 2003 -8.937 1.43*10-4 (***) 0.0067 0.0051
4/23 2318 -6.549 3.45*10-3 (**) 0.0032 0.0015
15
Results: OSSF subscores as predictor for bugs
Check Estim. P-value Adjusted
R2
Predicted
R2
Maintained 2.047 3.05*10-5
(***)
7.792*10-3 6.27*10-3
Has-CI 0.269 0.929 -4.74*10-4 -1.96*10-3
Codereview -5.613 0.172 4.15*10-4 -9.87*10-4
• Perhaps the aggregate score
doesn’t correlate, but
subscores fare better?
• Nope, some checks actually
estimate an increase in bugs
for each score increase
• Other checks aren’t
statistically significant (and
don’t explain or predict the
data anyway)
Linux Security Summit Europe ‘23
16
Results: OSSF scores as predictor for other bugs
Static analysis engine Bug
change
estimate
per score
increase
P-value AdjustedR2 PredictedR2
Facebook Infer -7.78 0.297 4.19*10-5 -1.4*10-3
BinAbsInspector 2.636 0.929 -4*10-4 -1.3*10-3
Linux Security Summit Europe ‘23
Perhaps we need a different static analysis engine?
17
Can we find anything that predicts the bugs?
Looking at metadata
Linux Security Summit Europe ‘23
Check Estimate P-value AdjustedR2 PredictedR2
Lines of code 8.09*10-05 3.35*10-12 (***) 0.0224 0.0187
L of comments 2.084*10-05 7.21*10-14 (***) 0.0259 0.0194
Size (Kb) 1.735*10-5 8.17*10-3 (**) 0.0028 0.0007
Stars 4.2*10-3 0.133 0.0006 -0.0008
Watchers 0.07 0.071 0.0010 -0.0008
Forks 0.012 0.122 0.0006 -0.0008
Issues 0.046 1.96*10-3 (**) 0.0040 0.0028
18
Can we find anything that predicts the bugs?
# of functions & cognitive complexity
Linux Security Summit Europe ‘23
Check Estimate P-value AdjustedR2 PredictedR2
Number of
functions
0.014 2*10-16 (***) 0.2782 0.2704
Number of
cognitively
complex
functions
0.097 2*10-16 (***) 0.2789 0.2734
% of functions
cognitive
complex
1.191 8.3*10-7 (***) 0.011 0.009
19
Multiple linear regression (MLR)
# of functions & cognitive complexity
Linux Security Summit Europe ‘23
Check Estimate P-value AdjustedR2 PredictedR2
Number of
functions
+ % of functions
cogn. cmplx
0.0145
0.8775
2*10-16 (***)
2.01*10-5 (***)
0.2841 0.2763
Number of
functions
+ number of
cogn. cmplx
functions
0.0077
0.0521
3.9*10-13 (***)
1.52*10-13 (***)
0.2964 0.2867
20
Multiple linear regression (MLR)
Linux Security Summit Europe ‘23
Check Estimate P-value AdjustedR2 PredictedR2
Number of
functions
+ number of
cogn. cmplx
functions
+ percent of
functions cogn.
cmplx
0.0083
0.0475
0.5164
2.96*10-14 (***)
7.29*10-11 (***)
1.43*10-2(*)
0.2981 0.2883
# of functions & cognitive complexity
21
Looking at other static analysis results
Facebook Infer results with MLR model
Linux Security Summit Europe ‘23
Check Estimate P-value AdjustedR2 PredictedR2
Number of
functions
+ number of
cogn. cmplx
functions
+ percent of
functions cogn.
cmplx
-0.0059
0.366
2.991
0.356
3.46*10-10 (***)
5.93*10-4 (***)
0.11 0.088
22
Looking at other static analysis results
BinAbsInspector results with MLR model
Linux Security Summit Europe ‘23
Check Estimate P-value AdjustedR2 PredictedR2
Number of
functions
+ number of
cogn. cmplx
functions
+ percent of
functions cogn.
cmplx
0.034
-0.138
9.917
0.352
0.552
0.102
5.28*10-4 -2.28*10-3
23
# of functions & scan-build bugs
24
# of complex functions & scan-build bugs
25
% of functions complex & scan-build bugs
26
Cook’s Distance: 41 outliers detected with MLR
27
Redo without outliers
MLR model vs OSSF Scorecard
Linux Security Summit Europe ‘23
Check Estimate P-value AdjustedR2 PredictedR2
Number of functions
+ number of cogn.
cmplx functions
+ percent of functions
cogn. cmplx
0.0075
0.0864
0.2737
10-9 (***)
2*10-16 (***)
2.95*10-2 (*)
0.3774 0.3729
OSSF Scorecard -5.123 1.62*10-4 (***) 0.0062 0.0046
28
Are complex functions more buggy?
• 3.34% of cognitively complex functions had at least 1 bug
• Total of 12,410 out of 371,556 functions
• 0.97% of non-complex functions had at least 1 bug
• Total of 26,972 out of 2,754,311 functions
• 44.7% of bugs were found in cognitively complex functions
• Total of 35,687 bugs
• 55.3% of bugs were found in non-complex functions
• Total of 44,134 bugs
Only 11.8% of functions were cognitively complex yet had almost
half of all the bugs!
29
Threats to validity
• Bugs in the static analysis tools resulting in false positives
• Bugs in our data-collection & analysis scripts
• We only built ~2k C & C++ repos
• We only built the ones that compile on Debian
• Only repos with 400+ stars
• OSSF scorecard often runs out of GitHub API requests, we had to
disable some OSSF checks
• Linear regression modeling might not be the right analysis
What may have affected our analysis & results
Linux Security Summit Europe ‘23
30
Discussion
• What if all bugs were false positive?
• If a project has a ton of false positive bugs detected by static analysis, is it
more or less risky?
• If a project is confusing to read for humans & confuses static analysis
tools, is it more or less risky?
• If the MLR model predicts high bug count, but Scorecard predicts
low-risk, which one to trust?
• Why would Scorecard predict “security risk” correctly but bugs
incorrectly?
• It seems “complexity” correlates with bugs, why would “risk” be different?
Linux Security Summit Europe ‘23
31
Discussion
• Clang scan-build is free &
should find 0 bugs
• Add it as a simple GitHub PR
gate
• Don’t merge code that adds
bugs
• https://github.com/intel/srs/t
ree/scan-build-action-v1
Linux Security Summit Europe ‘23
32
Summary
• We need automated tools to quickly spot problematic Open
Source projects
• OSSF Scorecard is a fantastic but very hard undertaking
• Claims should be backed by data & statistics
• If you think we made an error, let’s hear it & publish your data! ☺
Tools & scripts: https://github.com/intel/srs
Data: https://intel.github.io/srs
Security risk estimation is hard
Linux Security Summit Europe ‘23
33
Disclaimers
Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.
Performance results are based on testing as of dates shown in configurations and may not reflect all publicly
available updates. See backup for configuration details. No product or component can be absolutely secure.
No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document.
Intel does not control or audit third-party data. You should consult other sources to evaluate accuracy.
Intel disclaims all express and implied warranties, including without limitation, the implied warranties of merchantability,
fitness for a particular purpose, and non-infringement, as well as any warranty arising from course of performance, course of
dealing, or usage in trade.
Your costs and results may vary.
Intel technologies may require enabled hardware, software or service activation.
Š Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other
names and brands may be claimed as the property of others.
Estimating Security Risk Through Repository Mining

Estimating Security Risk Through Repository Mining

  • 1.
    Estimating security risk throughrepository mining Tamas K Lengyel, PhD
  • 2.
    2 #whoami • Sr SecurityResearcher @ Intel • Maintainer of Xen, LibVMI, DRAKVUF, KF/x • Working on fuzzing, pen testing, secure architecture review Linux Security Summit Europe ‘23
  • 3.
    3 Agenda 1. Motivation 2. Problemstatement & hypothesis 3. Experiment design & tools 4. Results 5. Threats to validity 6. Discussion 7. Summary Linux Security Summit Europe ‘23
  • 4.
    4 Motivation Why estimate securityrisk? Linux Security Summit Europe ‘23 • Complexity is increasing • Software supply chain is opaque, even if you have an SBOM • Manual review doesn’t scale • OpenSSF Scorecard - Security health metrics for Open Source xkcd: Dependency
  • 5.
    5 What’s measured bythe scorecard? Linux Security Summit Europe ‘23 https://github.com/ossf/scorecard#scorecard-checks
  • 6.
    6 Problem statement Does itwork? Linux Security Summit Europe ‘23 • Can a project with a high score really be considered low risk? • There is no evidence pro or contra • How can we approach this problem scientifically? 1. Wait and see: do we get a correlation with low scores and high incident rates? X -ENOTIME 2. Check against CVEs: are there more CVEs reported for projects with low scores? X Not many projects use CVEs, the ones that do would lead to selection bias
  • 7.
    7 Hypothesis If we can’tmeasure it directly, we’ll find a proxy The scorecard checks are applicable to other software quality metrics as well. A well maintained project with a CI, code reviews, fuzzing etc. should have fewer bugs! We can measure bug density with static analysis tools! Bugs don’t equal security risk – but the scorecard should correlate with both Linux Security Summit Europe ‘23
  • 8.
    8 Experiment design 1. Findthe most popular C & C++ projects on GitHub 2. Run the Scorecard (if not already installed) 3. Run static analysis to find bugs 4. Perform linear regression analysis We need bugs – lots of bugs Linux Security Summit Europe ‘23
  • 9.
    9 Scaling Repo Scanner(SRS) • Deployed via GitHub Actions • Perform search & analysis on thousands of repositories • Collect metrics & metadata, publish results on GitHub Pages • Extendable framework to add new scans • OSSF Scorecard • clang scan-build + Z3 verifier: detects most common bug types with supposedly low false positive rate • clang-tidy cognitive complexity analysis • Lines of Code + GitHub repository metadata collection • Open-source: https://github.com/intel/srs Scalable static analysis of GitHub projects Linux Security Summit Europe ‘23
  • 10.
    10 Scaling Repo Scanner(SRS) • Only builds 400+ star C & C++ GitHub repositories on Debian • Dependencies must be greppable & apt-gettable • Should build with a modern clang compiler (15) • Build time limit is 6 hours enforced by GitHub • Autotools, meson & cmake build systems only • GitHub API rate limit is a bottleneck, 5000/hour max • Disk space is limited, builds can fail if it generates too many files Limitations Linux Security Summit Europe ‘23
  • 11.
    11 Scaling Repo Scanner(SRS) • Scans run monthly • Summary & scan details for every repo • Download & run your own analysis • Fork & add your own scans Results posted to https://intel.github.io/srs Linux Security Summit Europe ‘23
  • 12.
    12 Results: OSSF scorecardas predictor for bugs Date # of repos Bug change estimateper scoreincrease P-value 8/23 2097 -6.937 2.02*10-3 (***) 7/23 2071 -7.572 7.02*10-4 (***) 6/23 2035 -9.127 7.95*10-5 (***) 5/23 2003 -8.937 1.43*10-4 (**) 4/23 2318 -6.549 3.45*10-3 (**) • Each point increase in OSSF scorecard correlates with a reduction in number of bugs found • Statistically significant results! Ship it! o/ Linux Security Summit Europe ‘23
  • 13.
    13 Results: OSSF scorecardas predictor for bugs Bug mean=38.09 Bug stdev=104.36 Linux Security Summit Europe ‘23
  • 14.
    14 Results: OSSF scorecardas predictor for bugs Adjusted R2: how well model explains variance in the current data Predicted R2: how well model will make predictions Practically no connection between the two! Linux Security Summit Europe ‘23 Date # of repos Bug change estimateper score increase P-value AdjustedR2 PredictedR2 8/23 2097 -6.937 2.02*10-3 (**) 0.0040 0.0024 7/23 2071 -7.572 7.02*10-4 (***) 0.0050 0.0033 6/23 2035 -9.127 7.95*10-5 (***) 0.0071 0.0055 5/23 2003 -8.937 1.43*10-4 (***) 0.0067 0.0051 4/23 2318 -6.549 3.45*10-3 (**) 0.0032 0.0015
  • 15.
    15 Results: OSSF subscoresas predictor for bugs Check Estim. P-value Adjusted R2 Predicted R2 Maintained 2.047 3.05*10-5 (***) 7.792*10-3 6.27*10-3 Has-CI 0.269 0.929 -4.74*10-4 -1.96*10-3 Codereview -5.613 0.172 4.15*10-4 -9.87*10-4 • Perhaps the aggregate score doesn’t correlate, but subscores fare better? • Nope, some checks actually estimate an increase in bugs for each score increase • Other checks aren’t statistically significant (and don’t explain or predict the data anyway) Linux Security Summit Europe ‘23
  • 16.
    16 Results: OSSF scoresas predictor for other bugs Static analysis engine Bug change estimate per score increase P-value AdjustedR2 PredictedR2 Facebook Infer -7.78 0.297 4.19*10-5 -1.4*10-3 BinAbsInspector 2.636 0.929 -4*10-4 -1.3*10-3 Linux Security Summit Europe ‘23 Perhaps we need a different static analysis engine?
  • 17.
    17 Can we findanything that predicts the bugs? Looking at metadata Linux Security Summit Europe ‘23 Check Estimate P-value AdjustedR2 PredictedR2 Lines of code 8.09*10-05 3.35*10-12 (***) 0.0224 0.0187 L of comments 2.084*10-05 7.21*10-14 (***) 0.0259 0.0194 Size (Kb) 1.735*10-5 8.17*10-3 (**) 0.0028 0.0007 Stars 4.2*10-3 0.133 0.0006 -0.0008 Watchers 0.07 0.071 0.0010 -0.0008 Forks 0.012 0.122 0.0006 -0.0008 Issues 0.046 1.96*10-3 (**) 0.0040 0.0028
  • 18.
    18 Can we findanything that predicts the bugs? # of functions & cognitive complexity Linux Security Summit Europe ‘23 Check Estimate P-value AdjustedR2 PredictedR2 Number of functions 0.014 2*10-16 (***) 0.2782 0.2704 Number of cognitively complex functions 0.097 2*10-16 (***) 0.2789 0.2734 % of functions cognitive complex 1.191 8.3*10-7 (***) 0.011 0.009
  • 19.
    19 Multiple linear regression(MLR) # of functions & cognitive complexity Linux Security Summit Europe ‘23 Check Estimate P-value AdjustedR2 PredictedR2 Number of functions + % of functions cogn. cmplx 0.0145 0.8775 2*10-16 (***) 2.01*10-5 (***) 0.2841 0.2763 Number of functions + number of cogn. cmplx functions 0.0077 0.0521 3.9*10-13 (***) 1.52*10-13 (***) 0.2964 0.2867
  • 20.
    20 Multiple linear regression(MLR) Linux Security Summit Europe ‘23 Check Estimate P-value AdjustedR2 PredictedR2 Number of functions + number of cogn. cmplx functions + percent of functions cogn. cmplx 0.0083 0.0475 0.5164 2.96*10-14 (***) 7.29*10-11 (***) 1.43*10-2(*) 0.2981 0.2883 # of functions & cognitive complexity
  • 21.
    21 Looking at otherstatic analysis results Facebook Infer results with MLR model Linux Security Summit Europe ‘23 Check Estimate P-value AdjustedR2 PredictedR2 Number of functions + number of cogn. cmplx functions + percent of functions cogn. cmplx -0.0059 0.366 2.991 0.356 3.46*10-10 (***) 5.93*10-4 (***) 0.11 0.088
  • 22.
    22 Looking at otherstatic analysis results BinAbsInspector results with MLR model Linux Security Summit Europe ‘23 Check Estimate P-value AdjustedR2 PredictedR2 Number of functions + number of cogn. cmplx functions + percent of functions cogn. cmplx 0.034 -0.138 9.917 0.352 0.552 0.102 5.28*10-4 -2.28*10-3
  • 23.
    23 # of functions& scan-build bugs
  • 24.
    24 # of complexfunctions & scan-build bugs
  • 25.
    25 % of functionscomplex & scan-build bugs
  • 26.
    26 Cook’s Distance: 41outliers detected with MLR
  • 27.
    27 Redo without outliers MLRmodel vs OSSF Scorecard Linux Security Summit Europe ‘23 Check Estimate P-value AdjustedR2 PredictedR2 Number of functions + number of cogn. cmplx functions + percent of functions cogn. cmplx 0.0075 0.0864 0.2737 10-9 (***) 2*10-16 (***) 2.95*10-2 (*) 0.3774 0.3729 OSSF Scorecard -5.123 1.62*10-4 (***) 0.0062 0.0046
  • 28.
    28 Are complex functionsmore buggy? • 3.34% of cognitively complex functions had at least 1 bug • Total of 12,410 out of 371,556 functions • 0.97% of non-complex functions had at least 1 bug • Total of 26,972 out of 2,754,311 functions • 44.7% of bugs were found in cognitively complex functions • Total of 35,687 bugs • 55.3% of bugs were found in non-complex functions • Total of 44,134 bugs Only 11.8% of functions were cognitively complex yet had almost half of all the bugs!
  • 29.
    29 Threats to validity •Bugs in the static analysis tools resulting in false positives • Bugs in our data-collection & analysis scripts • We only built ~2k C & C++ repos • We only built the ones that compile on Debian • Only repos with 400+ stars • OSSF scorecard often runs out of GitHub API requests, we had to disable some OSSF checks • Linear regression modeling might not be the right analysis What may have affected our analysis & results Linux Security Summit Europe ‘23
  • 30.
    30 Discussion • What ifall bugs were false positive? • If a project has a ton of false positive bugs detected by static analysis, is it more or less risky? • If a project is confusing to read for humans & confuses static analysis tools, is it more or less risky? • If the MLR model predicts high bug count, but Scorecard predicts low-risk, which one to trust? • Why would Scorecard predict “security risk” correctly but bugs incorrectly? • It seems “complexity” correlates with bugs, why would “risk” be different? Linux Security Summit Europe ‘23
  • 31.
    31 Discussion • Clang scan-buildis free & should find 0 bugs • Add it as a simple GitHub PR gate • Don’t merge code that adds bugs • https://github.com/intel/srs/t ree/scan-build-action-v1 Linux Security Summit Europe ‘23
  • 32.
    32 Summary • We needautomated tools to quickly spot problematic Open Source projects • OSSF Scorecard is a fantastic but very hard undertaking • Claims should be backed by data & statistics • If you think we made an error, let’s hear it & publish your data! ☺ Tools & scripts: https://github.com/intel/srs Data: https://intel.github.io/srs Security risk estimation is hard Linux Security Summit Europe ‘23
  • 33.
    33 Disclaimers Performance varies byuse, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex. Performance results are based on testing as of dates shown in configurations and may not reflect all publicly available updates. See backup for configuration details. No product or component can be absolutely secure. No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document. Intel does not control or audit third-party data. You should consult other sources to evaluate accuracy. Intel disclaims all express and implied warranties, including without limitation, the implied warranties of merchantability, fitness for a particular purpose, and non-infringement, as well as any warranty arising from course of performance, course of dealing, or usage in trade. Your costs and results may vary. Intel technologies may require enabled hardware, software or service activation. Š Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others.