Managing Software
Dependencies and the Supply
Chain
Wrangling Software Engineering Projects
MIT EM.S20
Andrew Lamb
April 6, 2022
Goal
Give both a commercial and an open-source perspective on the benefits, costs,
and risks of taking on dependencies.
About me
MIT Course VI-2 ‘02, MEng ‘03
17 years professional development 🤔
15 commercial enterprise software (startups at various stages)
● Oracle, DataPower/IBM, Vertica/HP, Nutonian, DataRobot
Last 2 years in open source commercial software development
● InfluxData, contributor to influxdb_iox
● Maintainer of arrow-rs, arrow-datafusion, and sqlparser-rs projects
● PMC member of Apache Arrow
Software “Supply Chain” ?
Code
Contributors
Project
Management
(e.g PRs)
User (😊)
AWS
Marketplace
Apple Pay
CI / CD
system
Software
Distribution
E.g.
Dockerhub,
App Store
Software Supply Chain Complexity
2005: Andrew’s First Startup (DataPower)
● C/C++, < 5 dependences (OpenSSL)
● Single binary, distributed to customers, on CD or via FTP
2022: Andrew’s Current Startup (InfluxDB)
● IOx has …. 606 dependencies
(rust alone)
Distributed as a
docker image on
GCR
Dependencies?
● Software Engineering 101 (6.001 / 6.037)
● “Don’t Reinvent the Wheel”: Use a pre-existing library of code
● The number and quality of pre-existing libraries grown massively
● Example:
○ 2004: DataPower had a custom written HTTP/S implementation, url parser,
and more!
○ 2022: Most languages have a library to do it (requests for python, node,
reqwest in Rust, etc)
(Dramatically) Lowers Cost of Building Software
● Low Barrier to Entry: Someone else designed the API, implemented
and (hopefully) tested it
○ E.g. can get a cross platform, secure webserver up and running almost instantly,
● Maintenance: You benefit from bugs fixed by others
● Debuggability: Source code is available, you can often even step
through it
Managing Dependencies: Licensing
● Software Patent licensing is still a (huge) thing
○ IBM makes $1Bn a year on software licensing
● You need to ensure you have the legal right to use the software.
● Good news: Most organizations have figured out licensing, have
known good “approved” set of licenses.
○ As long as you stick to known good ones
● Example “Auto Approve” (permissive): MIT, BSD, Apache 2
● Example “Special Dispensation”: MongoDB server side license
● Example “Do not use”: GPL / LGPL
Managing Dependencies: Quality
Quality of many Open Source dependencies is outstanding
● Crowdsourcing means more investment into bug reporting and fixing
● In theory you can look at the code to assess the quality
● You have many options to choose from
Managing Dependencies: Quality
● Amount of time spent on reviewing / assessing open source is minimal (both
commercially and in open source) – think reviewing 606 packages
● No one to cry to: Maintainers have
limited time to respond to your issue
● Open source maintainers typically
stretched (very) thin
● Parable: “broke my old version, sorry”:
dtolnay/quote/#204
Managing Dependencies: Security
● Somewhat terrifying to read “Backstabber's toolkit” paper
● Open source maintainers do not have loads of time
○ Open source is fundamentally based on trust but verify (in the maintainers + community)
○ Possible to abuse that trust and insert malicious code
● Surface Area: dependencies of dependencies
Managing Dependencies: Build times / package bloat
● Dependencies add build time to compiled languages (C/C++, Rust)
● Add significant bloat to binary / distribution size (MBs!)
○ Parable: Dependency (python) stack in one startup was > 1.5GB package.
● “DLL Hell”: Version matching dependencies (of dependencies)
Managing Dependencies: Keeping up to date
● Dependencies get upgraded with unpredictable regularity
● Things like security fixes you want/need, also features you probably don’t
Challenges
● Open source projects invest relatively less time on maintaining past releases.
○ p.s. Microsoft Windows: programs written 20+ years ago still run fine
● ⇒ bump dependencies a lot (daily)
● “Semantic versioning” - helps auto update dependencies 🤗
○ Sometimes do release incompatibilities and break builds 😖
○ Can get different binaries depending on *when* you run your build 😱
○ “Backstabbers Toolkit” 😓
Managing Dependencies: Packaging
Packaging: Gathering your code and dependencies into an executable “package”
that user can run on their system
As number dependencies grow, so does challenges in packaging / DLL Hell
● Language Runtime
● Your direct dependencies (e.g. http library)
● Indirect dependencies (e.g url parser)
● System dependencies (libssl, libqt, etc)
How to Manage
Think Twice about Adding New Dependencies
“A little copying is better than a little
dependency.”
- Rob Pike via https://go-proverbs.github.io/
E.g. One data structure from a library of data structures
Anti-example: http clients / crypto library
Best Practice: CI/CD (test, test, and test some more)
CI: Run
Tests
on change
branch
Build
“Artifacts”
CD: release
/ deploy
Source
Code
(in git)
CI: Run Tests
(on main
branch)
Propose
change via Pull
Request
approve +
merge to
main
branch
CI == Continuous Integration
CD == Continuous Deployment
Likely more
tests here
Likely more
tests here
Best Practice: Package Manager
❏ Use package manager built into your ecosystem:
❏ Java; maven
❏ Python: Pip
❏ Nodejs: NPM
❏ Ruby: Ruby Gems
❏ Rust: cargo
❏ …
❏ C/C++ CMake (not quite a package manager, but closer than Makefiles)
❏ Use “freeze” “shrinkwrap” or “version lock” feature to control updates
❏ Ensure you use widely used packages (wisdom of crowds)
Managing Dependencies: Best Practices
❏ Invest heavily in automated testing
❏ Especially end to end tests, and key features that rely on behavior of dependencies
❏ Invest in keeping dependencies up to date
❏ Update direct dependencies (tools like Dependabot can help)
❏ Help debug and fix your dependent libraries
❏ Submit patches back upstream
❏ May need to fork / apply a fix while you wait for maintainer to release new version
Managing Dependencies: Packaging
Technology to the rescue (enabler)
● Static Linking
● yum + .rpm ; apt + .deb
● FX; Electron (for Java; nodejs / desktop apps)
● Containerization (docker, et al)
● VMs (“Virtual Appliances”)
Thank you
Questions?
Readings (tentative):
https://ieeexplore-ieee-org.libproxy.mit.edu/stamp/stamp.jsp?tp=&arnumber=242525 – software maturity
https://www.oreilly.com/library/view/understanding-open-source/0596005814/ch06.html – reasonably thorough overview of software licensing
https://arxiv.org/pdf/2005.09535.pdf – supply-chain attacks
https://blog.npmjs.org/post/141577284765/kik-left-pad-and-npm.html – specific example of how easy/common broad supply-chain breaks are today
[optional] https://blogs.sap.com/2020/06/26/attacks-on-open-source-supply-chains-how-hackers-poison-the-well/
[optional] https://www.gnu.org/licenses/license-compatibility.en.html
[optional] https://www.tandfonline.com/doi/pdf/10.1080/14783360500235819?needAccess=true – software maturity

Managing Software Dependencies and the Supply Chain_ MIT EM.S20.pdf

  • 1.
    Managing Software Dependencies andthe Supply Chain Wrangling Software Engineering Projects MIT EM.S20 Andrew Lamb April 6, 2022
  • 2.
    Goal Give both acommercial and an open-source perspective on the benefits, costs, and risks of taking on dependencies.
  • 3.
    About me MIT CourseVI-2 ‘02, MEng ‘03 17 years professional development 🤔 15 commercial enterprise software (startups at various stages) ● Oracle, DataPower/IBM, Vertica/HP, Nutonian, DataRobot Last 2 years in open source commercial software development ● InfluxData, contributor to influxdb_iox ● Maintainer of arrow-rs, arrow-datafusion, and sqlparser-rs projects ● PMC member of Apache Arrow
  • 4.
    Software “Supply Chain”? Code Contributors Project Management (e.g PRs) User (😊) AWS Marketplace Apple Pay CI / CD system Software Distribution E.g. Dockerhub, App Store
  • 5.
    Software Supply ChainComplexity 2005: Andrew’s First Startup (DataPower) ● C/C++, < 5 dependences (OpenSSL) ● Single binary, distributed to customers, on CD or via FTP 2022: Andrew’s Current Startup (InfluxDB) ● IOx has …. 606 dependencies (rust alone) Distributed as a docker image on GCR
  • 6.
    Dependencies? ● Software Engineering101 (6.001 / 6.037) ● “Don’t Reinvent the Wheel”: Use a pre-existing library of code ● The number and quality of pre-existing libraries grown massively ● Example: ○ 2004: DataPower had a custom written HTTP/S implementation, url parser, and more! ○ 2022: Most languages have a library to do it (requests for python, node, reqwest in Rust, etc)
  • 7.
    (Dramatically) Lowers Costof Building Software ● Low Barrier to Entry: Someone else designed the API, implemented and (hopefully) tested it ○ E.g. can get a cross platform, secure webserver up and running almost instantly, ● Maintenance: You benefit from bugs fixed by others ● Debuggability: Source code is available, you can often even step through it
  • 9.
    Managing Dependencies: Licensing ●Software Patent licensing is still a (huge) thing ○ IBM makes $1Bn a year on software licensing ● You need to ensure you have the legal right to use the software. ● Good news: Most organizations have figured out licensing, have known good “approved” set of licenses. ○ As long as you stick to known good ones ● Example “Auto Approve” (permissive): MIT, BSD, Apache 2 ● Example “Special Dispensation”: MongoDB server side license ● Example “Do not use”: GPL / LGPL
  • 10.
    Managing Dependencies: Quality Qualityof many Open Source dependencies is outstanding ● Crowdsourcing means more investment into bug reporting and fixing ● In theory you can look at the code to assess the quality ● You have many options to choose from
  • 11.
    Managing Dependencies: Quality ●Amount of time spent on reviewing / assessing open source is minimal (both commercially and in open source) – think reviewing 606 packages ● No one to cry to: Maintainers have limited time to respond to your issue ● Open source maintainers typically stretched (very) thin ● Parable: “broke my old version, sorry”: dtolnay/quote/#204
  • 12.
    Managing Dependencies: Security ●Somewhat terrifying to read “Backstabber's toolkit” paper ● Open source maintainers do not have loads of time ○ Open source is fundamentally based on trust but verify (in the maintainers + community) ○ Possible to abuse that trust and insert malicious code ● Surface Area: dependencies of dependencies
  • 13.
    Managing Dependencies: Buildtimes / package bloat ● Dependencies add build time to compiled languages (C/C++, Rust) ● Add significant bloat to binary / distribution size (MBs!) ○ Parable: Dependency (python) stack in one startup was > 1.5GB package. ● “DLL Hell”: Version matching dependencies (of dependencies)
  • 14.
    Managing Dependencies: Keepingup to date ● Dependencies get upgraded with unpredictable regularity ● Things like security fixes you want/need, also features you probably don’t Challenges ● Open source projects invest relatively less time on maintaining past releases. ○ p.s. Microsoft Windows: programs written 20+ years ago still run fine ● ⇒ bump dependencies a lot (daily) ● “Semantic versioning” - helps auto update dependencies 🤗 ○ Sometimes do release incompatibilities and break builds 😖 ○ Can get different binaries depending on *when* you run your build 😱 ○ “Backstabbers Toolkit” 😓
  • 15.
    Managing Dependencies: Packaging Packaging:Gathering your code and dependencies into an executable “package” that user can run on their system As number dependencies grow, so does challenges in packaging / DLL Hell ● Language Runtime ● Your direct dependencies (e.g. http library) ● Indirect dependencies (e.g url parser) ● System dependencies (libssl, libqt, etc)
  • 16.
  • 17.
    Think Twice aboutAdding New Dependencies “A little copying is better than a little dependency.” - Rob Pike via https://go-proverbs.github.io/ E.g. One data structure from a library of data structures Anti-example: http clients / crypto library
  • 18.
    Best Practice: CI/CD(test, test, and test some more) CI: Run Tests on change branch Build “Artifacts” CD: release / deploy Source Code (in git) CI: Run Tests (on main branch) Propose change via Pull Request approve + merge to main branch CI == Continuous Integration CD == Continuous Deployment Likely more tests here Likely more tests here
  • 19.
    Best Practice: PackageManager ❏ Use package manager built into your ecosystem: ❏ Java; maven ❏ Python: Pip ❏ Nodejs: NPM ❏ Ruby: Ruby Gems ❏ Rust: cargo ❏ … ❏ C/C++ CMake (not quite a package manager, but closer than Makefiles) ❏ Use “freeze” “shrinkwrap” or “version lock” feature to control updates ❏ Ensure you use widely used packages (wisdom of crowds)
  • 20.
    Managing Dependencies: BestPractices ❏ Invest heavily in automated testing ❏ Especially end to end tests, and key features that rely on behavior of dependencies ❏ Invest in keeping dependencies up to date ❏ Update direct dependencies (tools like Dependabot can help) ❏ Help debug and fix your dependent libraries ❏ Submit patches back upstream ❏ May need to fork / apply a fix while you wait for maintainer to release new version
  • 21.
    Managing Dependencies: Packaging Technologyto the rescue (enabler) ● Static Linking ● yum + .rpm ; apt + .deb ● FX; Electron (for Java; nodejs / desktop apps) ● Containerization (docker, et al) ● VMs (“Virtual Appliances”)
  • 22.
  • 23.
    Readings (tentative): https://ieeexplore-ieee-org.libproxy.mit.edu/stamp/stamp.jsp?tp=&arnumber=242525 –software maturity https://www.oreilly.com/library/view/understanding-open-source/0596005814/ch06.html – reasonably thorough overview of software licensing https://arxiv.org/pdf/2005.09535.pdf – supply-chain attacks https://blog.npmjs.org/post/141577284765/kik-left-pad-and-npm.html – specific example of how easy/common broad supply-chain breaks are today [optional] https://blogs.sap.com/2020/06/26/attacks-on-open-source-supply-chains-how-hackers-poison-the-well/ [optional] https://www.gnu.org/licenses/license-compatibility.en.html [optional] https://www.tandfonline.com/doi/pdf/10.1080/14783360500235819?needAccess=true – software maturity