Patterns and Anti-Patterns for Memorializing Data Science Project Artifacts
1. Patterns and Anti-patterns for
Memorializing Data Science
Project Artifacts
Derrick Higgins
Senior Director, Enterprise Data Science
Sonjia Waxmonsky
Senior Data Scientist, Enterprise Data Science
2. We are the fourth-largest private insurer in the US, and the
largest customer-owned insurer.
We are a not-for-profit insurer. We view health care financing,
access and delivery with a long-term perspective that promotes
the entire health care system, not just the company's position.
We insure one in twelve American adults—16 million members.
We have been recognized by Forbes magazine as one of the top
workplaces for women, for diversity and for LGBTQ equality
4. Code fragmentation
Hey—can you tell me if something changed
with the way last_visit_date is calculated?
Sean made some updates, but nothing that
should change your data stream
It seems … different from last week
Can you share what he did?
I can ask him to email you a snapshot of the
code
How are you showing the model scores for
each day in the dashboard?
The average
Average of all scores for a day?
Hmm; don’t remember
Or average by members who visit on a day?
Can I get back to you tomorrow?
5. Team fragmentation
Problems
• Inefficient and error-prone
manual interfaces between
teams
• Lack of reproducibility
• Challenges for quality
assurance and governance
Conway’s law: Any organization that designs a system (defined broadly) will
produce a design whose structure is a copy of the organization's
communication structure
Data Team
ETL
Infrastructure Team
Resource Specification
Data Science Team
Modeling
Front-End Team
UI
6. Defragmentation
Goals
• Transparency (across teams)
• Common/linked versioning
• Interoperability
Obstacles
• Differences in tooling
• Differences in technical sophistication
• Manual processes for which there is no code
• Politics…
7. Strategies
Single shared repository
• The ultimate in
transparency and common
versioning
• Depends on shared
processes and standards
for contributions
• Difficult to achieve in
large, matrixed
organizations
Linking of versions across
project repos
• Allows for transparency
and reproducibility
• Looser coupling between
related development
efforts
• Still has certain technical
prerequisites (versioning)
Documentation and cross-
linking
• If nothing else, at least
document dependencies
and consumers as best
possible
• Links to where code lives,
even if not versioned
• Document stakeholders
10. File folders
My household
management project
Barnard & Fein, 1958. Organization and Retrieval of
Records Generated in a Large-Scale Engineering Project
11. Problem: Lack of versioning
• Most file systems do
not support versioning
• When they do (e.g.,
SharePoint, S3
buckets), they are not
set up to track
metadata related to
changes
12. Problem: Catastrophic failure
• If the code only
lives on your
laptop, it lives and
dies with your
laptop
• Even a disk in a
“secure” location
can become
corrupt, fall victim
to malware, or
catch fire
13. Problem: Obstacles to collaboration
What about a shared drive?
• Exacerbates other problems
– More users ⟶ greater risk of accidental
deletion or corruption
– Versioning by convention breaks down
with larger development teams
• Doesn’t solve versioning problem
• Assumes uninterrupted connectivity
Project is not discoverable
• Collaborators only get insight into code
state when the maintainer sends a copy
• (And they need to know to ask)
Project does not support a workflow
for parallel changes made by multiple
contributors
• Editing a copy of the code creates
irreconcilable forks
14. Where to store project files
File store(s) must be versioned
File store(s) must be resilient and allow for recovery
File store(s) must be transparent and allow for
independent and asynchronous contributions by
multiple collaborators
16. The zeal of the converted
• Git and GitHub provide
versioning, resilience,
transparency and a mechanism
for collaborative development
• But they are not meant for
everything!
• Git (and similar distributed
version control systems) can be
abused for purposes they were
not intended for – especially by
data scientists
Tensorflow: 500 MB;
10 minutes to clone
Other repo: 2.1 GB; 40
minutes to clone
17. What belongs in GitHub?
😀 😐 😬Diff-able, human-readable,
dynamic, small
Binary, static, large
ETL code
Program code
Configuration files
Experimental scripts
Documentation
Trained models
Input data
Compiled executables
Intermediate files
Bundled resources (images)
Notebooks
18. Problems
Obstacles to collaboration
▪ Interacting with remote repository slows to a crawl; syncs become a bottleneck in workflow
▪ Meaningless diffs of large / binary / non-editable files obscure consequential changes in
codebase
Obstacles to production deployment
▪ Heterogeneous file types in repository may not all be suitable for a production environment
▪ Could contain sensitive information or simply introduce bloat
Challenges with integration
▪ Intermediate data files stored in repository can make it difficult to ensure consistency with
upstream data sources
19. Better solutions
Nice tooling incorporated in
Databricks platform!
Other options for data
• Git LFS
• Versioned buckets in S3 and
other blob storage services
Other options for trained
models
• Data science platforms…
• Blob storage linkage
30. Parameterize your code
• Centralize parameters into configuration options and consolidate
• Anticipate and allow future configuration changes
• Be prepared for future changes
YAML and other config file formats
Command-line options
MLflow Tracking / Runs