SlideShare a Scribd company logo
1 of 44
Download to read offline
dotscience @lmarsden
@getdotscience
Inextricably Linked:
Reproducibility and Productivity
in Data Science and AI
Mark Coleman
mark@dotscience.com
@mrmrcoleman
@getdotscience
dotscience @lmarsden
@getdotscience
dotscience @lmarsden
@getdotscience
Who am/was I?
⇢ VP Marketing - dotscience
⇢ Serverless early fan
⇢ CNCF Marketing chairperson
⇢ Docker early fan
⇢ CI/CD consultant
⇢ Devops early fan
⇢ C++ programmer
⇢ C embedded systems
Now
Then
dotscience @lmarsden
@getdotscience
Let's compare
Data Science/ML/AI
and
Software Dev/DevOps
dotscience @lmarsden
@getdotscience
Not long ago, software dev was a bit of a mess
⇢ Work split across silos
+ Development
+ Testing
+ Operations
⇢ Caused huge amounts of pain
dotscience @lmarsden
@getdotscience
90s Software Development
⇢ Without version control, life is hard
+ You email zip files of source code
⇢ Two people change the same files?
+ Your work gets clobbered
dotscience @lmarsden
@getdotscience
90s Testing
⇢ "Works on my machine"
⇢ Email, USB stick, or shared drive → separate
testing team
⇢ High latency between breakage & knowing
+ Lost valuable context by time to fix
+ A slow & frustrating cycle
dotscience @lmarsden
@getdotscience
90s Operations
⇢ Throw release candidates over the wall to Ops
⇢ They drop a WAR file onto a Tomcat server
⇢ Dev & test failed to account for NFR
+ Ops can't fix it
⇢ Monitoring is sketchy, users find bugs
+ SSH into the production box
+ Process skipped during outage, introduces more bugs
⇢ Everyone is sad
dotscience @lmarsden
@getdotscience
How did we ship anything with all this mess?
⇢ Slowly!
⇢ Release cycles are weeks or months
⇢ Bad tooling & process?
+ Choose SPEED or SAFETY but not both
⇢ Most companies were forced to choose
SAFETY
dotscience @lmarsden
@getdotscience
What’s this have to do with reproducibility?
⇢ Software is iterative
⇢ Try something → figure out what happened → learn → try
something else
⇢ How do we figure out what happened?
+ Reproduce all the variables & see what changed
⇢ When bad tooling stops us reproducing an environment
development grinds to a halt
dotscience @lmarsden
@getdotscience
Things got a lot better in 20 years!
dotscience @lmarsden
@getdotscience
https://www.joelonsoftware.c
om/2000/08/09/the-joel-test-
12-steps-to-better-code/
dotscience @lmarsden
@getdotscience
Destructive vs Constructive Collaboration
⇢ Destructive = making copies
+ No source of truth
+ Divergence occurs instantly
⇢ Constructive = single source of truth
+ Multiple branches, try different ideas
+ Diff & merge enables reconciliation
⇢ Version control enables constructive collaboration
dotscience @lmarsden
@getdotscience
Ubiquitous Version Control
⇢ Sane people use version control
⇢ Developers collaborate effectively
⇢ Testing teams can too
⇢ Even Ops uses version control now – GitOps!
dotscience @lmarsden
@getdotscience
Continuous Integration
⇢ Version control enables CI
⇢ CI enables fast feedback
+ React to failures when we can still remember
what we changed (minutes not weeks)
⇢ Platform for tested versioned artifacts
+ Deploy into CD pipeline
dotscience @lmarsden
@getdotscience
Continuous Delivery & Observability
⇢ A single traceable way to get a tested change in
development to production
⇢ DevOps = ops can collaborate in same way that
dev & test teams do with CI
⇢ Application level observability & monitoring
allows deep dive into root causes
dotscience @lmarsden
@getdotscience
What has all this achieved?
⇢ Version control enabled reproducibility &
collaboration
⇢ This unlocks Continuous Integration &
Continuous Delivery
⇢ Add some Observability & Monitoring...
⇢ You get both SPEED and SAFETY!
dotscience @lmarsden
@getdotscience
How is AI doing in 2018?
⇢ Been talking to dozens of data science & AI
teams
⇢ Data science & AI seems to be where software
development was in the late 90s :'(
“In retrospect if we had been able to save the versions or
have gone back in time to see how he got his learning
rates it would have avoided a lot of questions from the
auditors.
“Two of the data scientists who worked on that particular
model have left and gone to other companies. You want to
be able to see what they did and how they did it and not
that it's gone when they're gone.
“One model failed for 3 months and we lost an
immeasurable amount of money!
“After the last audit I was surprised by how many problems
in the audit we could have solved by keeping PAPER LOGS.
But if we ask our data scientists to do this they will leave!
“We keep our data scientist teams small and in the same
room so they can track their summary statistics by talking
to each other and remembering
dotscience @lmarsden
@getdotscience
Destructive collaboration is commonplace
⇢ Shared drives for training data
⇢ Notebooks emailed or slacked between team
members
⇢ Scant manual documentation
⇢ Data wrangles go unrecorded
dotscience @lmarsden
@getdotscience
Testing of models is rare
⇢ Automated testing of models is rare
⇢ CI systems uncommon
⇢ "Testing" is more often done manually by an
individual in an untracked Jupyter environment
dotscience @lmarsden
@getdotscience
Deployment is manual
⇢ Models often “thrown over the wall”
⇢ Left in production to rot until somebody notices
⇢ No real monitoring, especially challenging with
retraining & model drift
⇢ Haven't seen much continuous delivery
dotscience @lmarsden
@getdotscience
How do we ship anything with all this mess?
⇢ Inappropriate tooling makes us choose between
SPEED and SAFETY
⇢ Therefore
+ AI/ML projects being shipped slowly with meticulous docs
+ AI/ML projects being shipped unsafely
+ not tracked, not auditable
+ no single source of truth for what made it into prod & how
+ siloed in peoples' heads...
dotscience @lmarsden
@getdotscience
How do we get AI & ML & etc out of the
90s?
Continuous
Integration
Continuous
Delivery
Version control is fundamental & enabling
in the lifecycle
Observability &
Monitoring
Version control
Development
Continuous
Integration
Continuous
Delivery
Version control is fundamental & enabling
in the lifecycle
Observability &
Monitoring
Version control
Developmentmission: go round the loop faster!
How do we version control?
⇢ Versioned data, environment, code: notebooks +
parameters
⇢ Metrics tracking: parameters ↔ summary statistics (e.g.
accuracy, business metrics)
⇢ Diff & merge for notebooks, data, code
⇢ Forks, pull requests & comment tracking
⇢ Enables:
+ Creativity & collaboration
+ Audit & reporting
How do we continuously integrate?
⇢ What do automated tests look like for models?
+ Not always binary like software – probabilistic
+ Pick some inputs / outputs & put triggers on them
+ If it goes > N stddev, fail tests
+ Also test NFR & unit/integration tests on code
⇢ When issues are reported with a model, convert issues to
tests
+ This way, CI provides "guide rails" for faster & more
confident development
How do we continuously deliver?
⇢ Triggers: when code changes or data changes
⇢ Automatically run code and model tests
⇢ If tests pass, automatically deploy to production
+ Champion Challenger, A/B
+ Minimize time between breakage & knowing
+ Minimize MTTR not MTBF, fast rollback
⇢ From decisions made in production, be able to track back
perfectly
+ See lineage of model development right down to
individual parameter tweakings - who/what/when/why
How do we solve observability?
⇢ Once model is in production, track model health with
same metrics used in development
+ Single source of truth for dev/prod metrics
+ See model drift
+ If model health < X, page a human
⇢ Automatic retraining can happen periodically when new
data is available
⇢ CI & CD gives us confidence to ship quickly
Continuous
Integration
Continuous
Delivery
So that's the big vision… where do we start?
Observability &
Monitoring
Version control
Development
So you want to do reproducible data
science/AI/ML?
Environment
So you want to do reproducible data
science/AI/ML?
Environment
Code +
Notebooks
Including
parameters
So you want to do reproducible data
science/AI/ML?
Environment
Code +
Notebooks
Including
parameters
Versioned
Data
How?
Pinning down environment
⇢ In the DevOps world, Docker has been a big hit.
⇢ Docker helps you pin down the execution
environment that your model training (or other
data work) is happening in.
⇢ What is Docker?
Pinning down code & notebooks
⇢ Developers have been version controlling their
code for a while now.
⇢ Git is the default choice
Challenges with git in data science
⇢ In data science, it's not natural to commit every
time you change anything, e.g. while tuning
parameters
⇢ But you generate important results while you're
iterating
⇢ git doesn't cope with large files, data scientists
often mingle code & data
⇢ diffing and merging Jupyter notebooks not easy
Lets you track versions of your code and collaborate
with others by commit, clone, push, pull…
Problems:
Proposal: a new version control & collaboration
system for AI
⇢ Use Dotmesh with ZFS
+ "Git for data"
+ Handles large data atomically &
efficiently
+ Deal with terabyte workspaces
⇢ Track metrics/stats & params
⇢ Track lineage & provenance
⇢ Next:
+ Diff & merge notebooks
+ Enable pull requests
I need your help 🙏
mark@dotscience.com
@mrmrcoleman
@getdotmesh
dotscience.com/try
Questions?

More Related Content

What's hot

Lean Product Management User-Centered App Design
Lean Product Management User-Centered App DesignLean Product Management User-Centered App Design
Lean Product Management User-Centered App DesignVMware Tanzu
 
What I learned from 5 years of sciencing the crap out of DevOps
What I learned from 5 years of sciencing the crap out of DevOpsWhat I learned from 5 years of sciencing the crap out of DevOps
What I learned from 5 years of sciencing the crap out of DevOpsDevOpsDays DFW
 
2014 State Of DevOps Findings! Velocity Conference
2014 State Of DevOps Findings! Velocity Conference2014 State Of DevOps Findings! Velocity Conference
2014 State Of DevOps Findings! Velocity ConferenceGene Kim
 
Continuous Integration for Citizens
Continuous Integration for CitizensContinuous Integration for Citizens
Continuous Integration for CitizensMikhail Zyatin
 
DevOps Roadtrip - Denver
DevOps Roadtrip - DenverDevOps Roadtrip - Denver
DevOps Roadtrip - DenverVictorOps
 
Continuous Integration for Citizens
Continuous Integration for CitizensContinuous Integration for Citizens
Continuous Integration for CitizensMikhail Zyatin
 
DevOps Kaizen: Practical Steps to Start & Sustain a Transformation
DevOps Kaizen: Practical Steps to Start & Sustain a TransformationDevOps Kaizen: Practical Steps to Start & Sustain a Transformation
DevOps Kaizen: Practical Steps to Start & Sustain a Transformationdev2ops
 
How Do We Better Sell DevOps? - PuppetConf 2013
How Do We Better Sell DevOps? - PuppetConf 2013How Do We Better Sell DevOps? - PuppetConf 2013
How Do We Better Sell DevOps? - PuppetConf 2013Puppet
 
Why Everyone Needs DevOps Now: 15 Year Study Of High Performing Technology Orgs
Why Everyone Needs DevOps Now: 15 Year Study Of High Performing Technology OrgsWhy Everyone Needs DevOps Now: 15 Year Study Of High Performing Technology Orgs
Why Everyone Needs DevOps Now: 15 Year Study Of High Performing Technology OrgsGene Kim
 
DOES 2016 Sciencing the Crap Out of DevOps
DOES 2016 Sciencing the Crap Out of DevOpsDOES 2016 Sciencing the Crap Out of DevOps
DOES 2016 Sciencing the Crap Out of DevOpsNicole Forsgren
 
The Data Behind DevOps: What Does it Take to be a High Performer? Jenkins Wor...
The Data Behind DevOps: What Does it Take to be a High Performer? Jenkins Wor...The Data Behind DevOps: What Does it Take to be a High Performer? Jenkins Wor...
The Data Behind DevOps: What Does it Take to be a High Performer? Jenkins Wor...Nicole Forsgren
 
Vmware2021 why even devop nicolefv
Vmware2021 why even devop nicolefvVmware2021 why even devop nicolefv
Vmware2021 why even devop nicolefvNicole Forsgren
 
DOES SFO 2016 - Sam Guckenheimer & Ed Blankenship "Moving to One Engineering ...
DOES SFO 2016 - Sam Guckenheimer & Ed Blankenship "Moving to One Engineering ...DOES SFO 2016 - Sam Guckenheimer & Ed Blankenship "Moving to One Engineering ...
DOES SFO 2016 - Sam Guckenheimer & Ed Blankenship "Moving to One Engineering ...Gene Kim
 
DeKnowledge - Try us
DeKnowledge - Try usDeKnowledge - Try us
DeKnowledge - Try usBob Pinto
 
Top 10 Things Admins Can Learn from Developers (without learning to code)
Top 10 Things Admins Can Learn from Developers (without learning to code)Top 10 Things Admins Can Learn from Developers (without learning to code)
Top 10 Things Admins Can Learn from Developers (without learning to code)YeurDreamin'
 
JUG Bonn June 2021 - The DevOps disaster
JUG Bonn June 2021 - The DevOps disasterJUG Bonn June 2021 - The DevOps disaster
JUG Bonn June 2021 - The DevOps disasterBert Jan Schrijver
 
Successful writing at work copyright 2017 cengage learn
Successful writing at work copyright 2017 cengage learnSuccessful writing at work copyright 2017 cengage learn
Successful writing at work copyright 2017 cengage learnssusere73ce3
 
Soaring in the Clouds - Don't be dragged down by ITIL bloat!
Soaring in the Clouds - Don't be dragged down by ITIL bloat! Soaring in the Clouds - Don't be dragged down by ITIL bloat!
Soaring in the Clouds - Don't be dragged down by ITIL bloat! Navvia
 

What's hot (20)

Lean Product Management User-Centered App Design
Lean Product Management User-Centered App DesignLean Product Management User-Centered App Design
Lean Product Management User-Centered App Design
 
What I learned from 5 years of sciencing the crap out of DevOps
What I learned from 5 years of sciencing the crap out of DevOpsWhat I learned from 5 years of sciencing the crap out of DevOps
What I learned from 5 years of sciencing the crap out of DevOps
 
ROOTS2011 Continuous Delivery
ROOTS2011 Continuous DeliveryROOTS2011 Continuous Delivery
ROOTS2011 Continuous Delivery
 
2014 State Of DevOps Findings! Velocity Conference
2014 State Of DevOps Findings! Velocity Conference2014 State Of DevOps Findings! Velocity Conference
2014 State Of DevOps Findings! Velocity Conference
 
Continuous Integration for Citizens
Continuous Integration for CitizensContinuous Integration for Citizens
Continuous Integration for Citizens
 
DevOps Roadtrip - Denver
DevOps Roadtrip - DenverDevOps Roadtrip - Denver
DevOps Roadtrip - Denver
 
Continuous Integration for Citizens
Continuous Integration for CitizensContinuous Integration for Citizens
Continuous Integration for Citizens
 
DevOps Kaizen: Practical Steps to Start & Sustain a Transformation
DevOps Kaizen: Practical Steps to Start & Sustain a TransformationDevOps Kaizen: Practical Steps to Start & Sustain a Transformation
DevOps Kaizen: Practical Steps to Start & Sustain a Transformation
 
How Do We Better Sell DevOps? - PuppetConf 2013
How Do We Better Sell DevOps? - PuppetConf 2013How Do We Better Sell DevOps? - PuppetConf 2013
How Do We Better Sell DevOps? - PuppetConf 2013
 
Why Everyone Needs DevOps Now: 15 Year Study Of High Performing Technology Orgs
Why Everyone Needs DevOps Now: 15 Year Study Of High Performing Technology OrgsWhy Everyone Needs DevOps Now: 15 Year Study Of High Performing Technology Orgs
Why Everyone Needs DevOps Now: 15 Year Study Of High Performing Technology Orgs
 
DOES 2016 Sciencing the Crap Out of DevOps
DOES 2016 Sciencing the Crap Out of DevOpsDOES 2016 Sciencing the Crap Out of DevOps
DOES 2016 Sciencing the Crap Out of DevOps
 
The Data Behind DevOps: What Does it Take to be a High Performer? Jenkins Wor...
The Data Behind DevOps: What Does it Take to be a High Performer? Jenkins Wor...The Data Behind DevOps: What Does it Take to be a High Performer? Jenkins Wor...
The Data Behind DevOps: What Does it Take to be a High Performer? Jenkins Wor...
 
Vmware2021 why even devop nicolefv
Vmware2021 why even devop nicolefvVmware2021 why even devop nicolefv
Vmware2021 why even devop nicolefv
 
DOES SFO 2016 - Sam Guckenheimer & Ed Blankenship "Moving to One Engineering ...
DOES SFO 2016 - Sam Guckenheimer & Ed Blankenship "Moving to One Engineering ...DOES SFO 2016 - Sam Guckenheimer & Ed Blankenship "Moving to One Engineering ...
DOES SFO 2016 - Sam Guckenheimer & Ed Blankenship "Moving to One Engineering ...
 
DeKnowledge - Try us
DeKnowledge - Try usDeKnowledge - Try us
DeKnowledge - Try us
 
Software as Craft
Software as CraftSoftware as Craft
Software as Craft
 
Top 10 Things Admins Can Learn from Developers (without learning to code)
Top 10 Things Admins Can Learn from Developers (without learning to code)Top 10 Things Admins Can Learn from Developers (without learning to code)
Top 10 Things Admins Can Learn from Developers (without learning to code)
 
JUG Bonn June 2021 - The DevOps disaster
JUG Bonn June 2021 - The DevOps disasterJUG Bonn June 2021 - The DevOps disaster
JUG Bonn June 2021 - The DevOps disaster
 
Successful writing at work copyright 2017 cengage learn
Successful writing at work copyright 2017 cengage learnSuccessful writing at work copyright 2017 cengage learn
Successful writing at work copyright 2017 cengage learn
 
Soaring in the Clouds - Don't be dragged down by ITIL bloat!
Soaring in the Clouds - Don't be dragged down by ITIL bloat! Soaring in the Clouds - Don't be dragged down by ITIL bloat!
Soaring in the Clouds - Don't be dragged down by ITIL bloat!
 

Similar to Inextricably linked reproducibility and productivity in data science and ai mark's sf m-lon_code meetup talk (2) (1)

Real World Lessons Using Lean UX (Workshop)
Real World Lessons Using Lean UX (Workshop)Real World Lessons Using Lean UX (Workshop)
Real World Lessons Using Lean UX (Workshop)Bill Scott
 
How to explain DevOps to your mom
How to explain DevOps to your momHow to explain DevOps to your mom
How to explain DevOps to your momAndreas Grabner
 
Software Carpentry and the Hydrological Sciences @ AGU 2013
Software Carpentry and the Hydrological Sciences @ AGU 2013Software Carpentry and the Hydrological Sciences @ AGU 2013
Software Carpentry and the Hydrological Sciences @ AGU 2013Aron Ahmadia
 
stackconf 2023 | Better Living by Changing Less – IncrativeOps by Michael Cot...
stackconf 2023 | Better Living by Changing Less – IncrativeOps by Michael Cot...stackconf 2023 | Better Living by Changing Less – IncrativeOps by Michael Cot...
stackconf 2023 | Better Living by Changing Less – IncrativeOps by Michael Cot...NETWAYS
 
Pat Hermens - From 100 to 1,000+ deployments a day - Codemotion Amsterdam 2019
Pat Hermens - From 100 to 1,000+ deployments a day - Codemotion Amsterdam 2019Pat Hermens - From 100 to 1,000+ deployments a day - Codemotion Amsterdam 2019
Pat Hermens - From 100 to 1,000+ deployments a day - Codemotion Amsterdam 2019Codemotion
 
ASAS 2015 - Benito de Miranda
ASAS 2015 - Benito de MirandaASAS 2015 - Benito de Miranda
ASAS 2015 - Benito de MirandaAvisi B.V.
 
Continuous Intelligence Workshop
Continuous Intelligence WorkshopContinuous Intelligence Workshop
Continuous Intelligence WorkshopDavid Tan
 
Using MLOps to Bring ML to Production/The Promise of MLOps
Using MLOps to Bring ML to Production/The Promise of MLOpsUsing MLOps to Bring ML to Production/The Promise of MLOps
Using MLOps to Bring ML to Production/The Promise of MLOpsWeaveworks
 
Innovate Better Through Machine data Analytics
Innovate Better Through Machine data AnalyticsInnovate Better Through Machine data Analytics
Innovate Better Through Machine data AnalyticsHal Rottenberg
 
Dev talks2021 ionut rusu
Dev talks2021   ionut rusuDev talks2021   ionut rusu
Dev talks2021 ionut rusuIonutRemus
 
2014-10 DevOps NFi - Why it's a good idea to deploy 10 times per day v1.0
2014-10 DevOps NFi - Why it's a good idea to deploy 10 times per day v1.02014-10 DevOps NFi - Why it's a good idea to deploy 10 times per day v1.0
2014-10 DevOps NFi - Why it's a good idea to deploy 10 times per day v1.0Joakim Lindbom
 
Building and Scaling High Performing Technology Organizations by Jez Humble a...
Building and Scaling High Performing Technology Organizations by Jez Humble a...Building and Scaling High Performing Technology Organizations by Jez Humble a...
Building and Scaling High Performing Technology Organizations by Jez Humble a...Agile India
 
Design + Devops: What We've Learned from Our Developer Friends
Design + Devops: What We've Learned from Our Developer FriendsDesign + Devops: What We've Learned from Our Developer Friends
Design + Devops: What We've Learned from Our Developer FriendsUXPA International
 
From dev to ops and beyond - getting it done
From dev to ops and beyond - getting it doneFrom dev to ops and beyond - getting it done
From dev to ops and beyond - getting it doneEdorian
 
Dances with unicorns
Dances with unicornsDances with unicorns
Dances with unicornsEspritAgile
 
Abcd iqs ssoftware-projects-mercecrosas
Abcd iqs ssoftware-projects-mercecrosasAbcd iqs ssoftware-projects-mercecrosas
Abcd iqs ssoftware-projects-mercecrosasMerce Crosas
 
DevOps: Cultural and Tooling Tips Around the World
DevOps: Cultural and Tooling Tips Around the WorldDevOps: Cultural and Tooling Tips Around the World
DevOps: Cultural and Tooling Tips Around the WorldDynatrace
 
Big guns for small guys (reloaded)
Big guns for small guys (reloaded)Big guns for small guys (reloaded)
Big guns for small guys (reloaded)Jorge López-Lago
 
Data Engineer's Lunch #68: DevOps Fundamentals
Data Engineer's Lunch #68: DevOps FundamentalsData Engineer's Lunch #68: DevOps Fundamentals
Data Engineer's Lunch #68: DevOps FundamentalsAnant Corporation
 

Similar to Inextricably linked reproducibility and productivity in data science and ai mark's sf m-lon_code meetup talk (2) (1) (20)

Real World Lessons Using Lean UX (Workshop)
Real World Lessons Using Lean UX (Workshop)Real World Lessons Using Lean UX (Workshop)
Real World Lessons Using Lean UX (Workshop)
 
Stress free development
Stress free developmentStress free development
Stress free development
 
How to explain DevOps to your mom
How to explain DevOps to your momHow to explain DevOps to your mom
How to explain DevOps to your mom
 
Software Carpentry and the Hydrological Sciences @ AGU 2013
Software Carpentry and the Hydrological Sciences @ AGU 2013Software Carpentry and the Hydrological Sciences @ AGU 2013
Software Carpentry and the Hydrological Sciences @ AGU 2013
 
stackconf 2023 | Better Living by Changing Less – IncrativeOps by Michael Cot...
stackconf 2023 | Better Living by Changing Less – IncrativeOps by Michael Cot...stackconf 2023 | Better Living by Changing Less – IncrativeOps by Michael Cot...
stackconf 2023 | Better Living by Changing Less – IncrativeOps by Michael Cot...
 
Pat Hermens - From 100 to 1,000+ deployments a day - Codemotion Amsterdam 2019
Pat Hermens - From 100 to 1,000+ deployments a day - Codemotion Amsterdam 2019Pat Hermens - From 100 to 1,000+ deployments a day - Codemotion Amsterdam 2019
Pat Hermens - From 100 to 1,000+ deployments a day - Codemotion Amsterdam 2019
 
ASAS 2015 - Benito de Miranda
ASAS 2015 - Benito de MirandaASAS 2015 - Benito de Miranda
ASAS 2015 - Benito de Miranda
 
Continuous Intelligence Workshop
Continuous Intelligence WorkshopContinuous Intelligence Workshop
Continuous Intelligence Workshop
 
Using MLOps to Bring ML to Production/The Promise of MLOps
Using MLOps to Bring ML to Production/The Promise of MLOpsUsing MLOps to Bring ML to Production/The Promise of MLOps
Using MLOps to Bring ML to Production/The Promise of MLOps
 
Innovate Better Through Machine data Analytics
Innovate Better Through Machine data AnalyticsInnovate Better Through Machine data Analytics
Innovate Better Through Machine data Analytics
 
Dev talks2021 ionut rusu
Dev talks2021   ionut rusuDev talks2021   ionut rusu
Dev talks2021 ionut rusu
 
2014-10 DevOps NFi - Why it's a good idea to deploy 10 times per day v1.0
2014-10 DevOps NFi - Why it's a good idea to deploy 10 times per day v1.02014-10 DevOps NFi - Why it's a good idea to deploy 10 times per day v1.0
2014-10 DevOps NFi - Why it's a good idea to deploy 10 times per day v1.0
 
Building and Scaling High Performing Technology Organizations by Jez Humble a...
Building and Scaling High Performing Technology Organizations by Jez Humble a...Building and Scaling High Performing Technology Organizations by Jez Humble a...
Building and Scaling High Performing Technology Organizations by Jez Humble a...
 
Design + Devops: What We've Learned from Our Developer Friends
Design + Devops: What We've Learned from Our Developer FriendsDesign + Devops: What We've Learned from Our Developer Friends
Design + Devops: What We've Learned from Our Developer Friends
 
From dev to ops and beyond - getting it done
From dev to ops and beyond - getting it doneFrom dev to ops and beyond - getting it done
From dev to ops and beyond - getting it done
 
Dances with unicorns
Dances with unicornsDances with unicorns
Dances with unicorns
 
Abcd iqs ssoftware-projects-mercecrosas
Abcd iqs ssoftware-projects-mercecrosasAbcd iqs ssoftware-projects-mercecrosas
Abcd iqs ssoftware-projects-mercecrosas
 
DevOps: Cultural and Tooling Tips Around the World
DevOps: Cultural and Tooling Tips Around the WorldDevOps: Cultural and Tooling Tips Around the World
DevOps: Cultural and Tooling Tips Around the World
 
Big guns for small guys (reloaded)
Big guns for small guys (reloaded)Big guns for small guys (reloaded)
Big guns for small guys (reloaded)
 
Data Engineer's Lunch #68: DevOps Fundamentals
Data Engineer's Lunch #68: DevOps FundamentalsData Engineer's Lunch #68: DevOps Fundamentals
Data Engineer's Lunch #68: DevOps Fundamentals
 

More from source{d}

Overton, Apple Flavored ML
Overton, Apple Flavored MLOverton, Apple Flavored ML
Overton, Apple Flavored MLsource{d}
 
Unlocking Engineering Observability with advanced IT analytics
Unlocking Engineering Observability with advanced IT analyticsUnlocking Engineering Observability with advanced IT analytics
Unlocking Engineering Observability with advanced IT analyticssource{d}
 
What's new in the latest source{d} releases!
What's new in the latest source{d} releases!What's new in the latest source{d} releases!
What's new in the latest source{d} releases!source{d}
 
Code as Data workshop: Using source{d} Engine to extract insights from git re...
Code as Data workshop: Using source{d} Engine to extract insights from git re...Code as Data workshop: Using source{d} Engine to extract insights from git re...
Code as Data workshop: Using source{d} Engine to extract insights from git re...source{d}
 
Gitbase, SQL interface to Git repositories
Gitbase, SQL interface to Git repositoriesGitbase, SQL interface to Git repositories
Gitbase, SQL interface to Git repositoriessource{d}
 
Deduplication on large amounts of code
Deduplication on large amounts of codeDeduplication on large amounts of code
Deduplication on large amounts of codesource{d}
 
Assisted code review with source{d} lookout
Assisted code review with source{d} lookoutAssisted code review with source{d} lookout
Assisted code review with source{d} lookoutsource{d}
 
Machine Learning on Code - SF meetup
Machine Learning on Code - SF meetupMachine Learning on Code - SF meetup
Machine Learning on Code - SF meetupsource{d}
 
source{d} Engine - your code as data
source{d} Engine - your code as datasource{d} Engine - your code as data
source{d} Engine - your code as datasource{d}
 
Introduction to the source{d} Stack
Introduction to the source{d} Stack Introduction to the source{d} Stack
Introduction to the source{d} Stack source{d}
 
source{d} Engine: Exploring git repos with SQL
source{d} Engine: Exploring git repos with SQLsource{d} Engine: Exploring git repos with SQL
source{d} Engine: Exploring git repos with SQLsource{d}
 
Introduction to source{d} Engine and source{d} Lookout
Introduction to source{d} Engine and source{d} Lookout Introduction to source{d} Engine and source{d} Lookout
Introduction to source{d} Engine and source{d} Lookout source{d}
 
Machine learning on Go Code
Machine learning on Go CodeMachine learning on Go Code
Machine learning on Go Codesource{d}
 
Improving go-git performance
Improving go-git performanceImproving go-git performance
Improving go-git performancesource{d}
 
Machine learning on source code
Machine learning on source codeMachine learning on source code
Machine learning on source codesource{d}
 

More from source{d} (15)

Overton, Apple Flavored ML
Overton, Apple Flavored MLOverton, Apple Flavored ML
Overton, Apple Flavored ML
 
Unlocking Engineering Observability with advanced IT analytics
Unlocking Engineering Observability with advanced IT analyticsUnlocking Engineering Observability with advanced IT analytics
Unlocking Engineering Observability with advanced IT analytics
 
What's new in the latest source{d} releases!
What's new in the latest source{d} releases!What's new in the latest source{d} releases!
What's new in the latest source{d} releases!
 
Code as Data workshop: Using source{d} Engine to extract insights from git re...
Code as Data workshop: Using source{d} Engine to extract insights from git re...Code as Data workshop: Using source{d} Engine to extract insights from git re...
Code as Data workshop: Using source{d} Engine to extract insights from git re...
 
Gitbase, SQL interface to Git repositories
Gitbase, SQL interface to Git repositoriesGitbase, SQL interface to Git repositories
Gitbase, SQL interface to Git repositories
 
Deduplication on large amounts of code
Deduplication on large amounts of codeDeduplication on large amounts of code
Deduplication on large amounts of code
 
Assisted code review with source{d} lookout
Assisted code review with source{d} lookoutAssisted code review with source{d} lookout
Assisted code review with source{d} lookout
 
Machine Learning on Code - SF meetup
Machine Learning on Code - SF meetupMachine Learning on Code - SF meetup
Machine Learning on Code - SF meetup
 
source{d} Engine - your code as data
source{d} Engine - your code as datasource{d} Engine - your code as data
source{d} Engine - your code as data
 
Introduction to the source{d} Stack
Introduction to the source{d} Stack Introduction to the source{d} Stack
Introduction to the source{d} Stack
 
source{d} Engine: Exploring git repos with SQL
source{d} Engine: Exploring git repos with SQLsource{d} Engine: Exploring git repos with SQL
source{d} Engine: Exploring git repos with SQL
 
Introduction to source{d} Engine and source{d} Lookout
Introduction to source{d} Engine and source{d} Lookout Introduction to source{d} Engine and source{d} Lookout
Introduction to source{d} Engine and source{d} Lookout
 
Machine learning on Go Code
Machine learning on Go CodeMachine learning on Go Code
Machine learning on Go Code
 
Improving go-git performance
Improving go-git performanceImproving go-git performance
Improving go-git performance
 
Machine learning on source code
Machine learning on source codeMachine learning on source code
Machine learning on source code
 

Recently uploaded

Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDGMarianaLemus7
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Neo4j
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 

Recently uploaded (20)

Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDG
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024
 
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptxVulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 

Inextricably linked reproducibility and productivity in data science and ai mark's sf m-lon_code meetup talk (2) (1)

  • 1. dotscience @lmarsden @getdotscience Inextricably Linked: Reproducibility and Productivity in Data Science and AI Mark Coleman mark@dotscience.com @mrmrcoleman @getdotscience
  • 3. dotscience @lmarsden @getdotscience Who am/was I? ⇢ VP Marketing - dotscience ⇢ Serverless early fan ⇢ CNCF Marketing chairperson ⇢ Docker early fan ⇢ CI/CD consultant ⇢ Devops early fan ⇢ C++ programmer ⇢ C embedded systems Now Then
  • 4. dotscience @lmarsden @getdotscience Let's compare Data Science/ML/AI and Software Dev/DevOps
  • 5. dotscience @lmarsden @getdotscience Not long ago, software dev was a bit of a mess ⇢ Work split across silos + Development + Testing + Operations ⇢ Caused huge amounts of pain
  • 6. dotscience @lmarsden @getdotscience 90s Software Development ⇢ Without version control, life is hard + You email zip files of source code ⇢ Two people change the same files? + Your work gets clobbered
  • 7. dotscience @lmarsden @getdotscience 90s Testing ⇢ "Works on my machine" ⇢ Email, USB stick, or shared drive → separate testing team ⇢ High latency between breakage & knowing + Lost valuable context by time to fix + A slow & frustrating cycle
  • 8. dotscience @lmarsden @getdotscience 90s Operations ⇢ Throw release candidates over the wall to Ops ⇢ They drop a WAR file onto a Tomcat server ⇢ Dev & test failed to account for NFR + Ops can't fix it ⇢ Monitoring is sketchy, users find bugs + SSH into the production box + Process skipped during outage, introduces more bugs ⇢ Everyone is sad
  • 9. dotscience @lmarsden @getdotscience How did we ship anything with all this mess? ⇢ Slowly! ⇢ Release cycles are weeks or months ⇢ Bad tooling & process? + Choose SPEED or SAFETY but not both ⇢ Most companies were forced to choose SAFETY
  • 10. dotscience @lmarsden @getdotscience What’s this have to do with reproducibility? ⇢ Software is iterative ⇢ Try something → figure out what happened → learn → try something else ⇢ How do we figure out what happened? + Reproduce all the variables & see what changed ⇢ When bad tooling stops us reproducing an environment development grinds to a halt
  • 13. dotscience @lmarsden @getdotscience Destructive vs Constructive Collaboration ⇢ Destructive = making copies + No source of truth + Divergence occurs instantly ⇢ Constructive = single source of truth + Multiple branches, try different ideas + Diff & merge enables reconciliation ⇢ Version control enables constructive collaboration
  • 14. dotscience @lmarsden @getdotscience Ubiquitous Version Control ⇢ Sane people use version control ⇢ Developers collaborate effectively ⇢ Testing teams can too ⇢ Even Ops uses version control now – GitOps!
  • 15. dotscience @lmarsden @getdotscience Continuous Integration ⇢ Version control enables CI ⇢ CI enables fast feedback + React to failures when we can still remember what we changed (minutes not weeks) ⇢ Platform for tested versioned artifacts + Deploy into CD pipeline
  • 16. dotscience @lmarsden @getdotscience Continuous Delivery & Observability ⇢ A single traceable way to get a tested change in development to production ⇢ DevOps = ops can collaborate in same way that dev & test teams do with CI ⇢ Application level observability & monitoring allows deep dive into root causes
  • 17. dotscience @lmarsden @getdotscience What has all this achieved? ⇢ Version control enabled reproducibility & collaboration ⇢ This unlocks Continuous Integration & Continuous Delivery ⇢ Add some Observability & Monitoring... ⇢ You get both SPEED and SAFETY!
  • 18. dotscience @lmarsden @getdotscience How is AI doing in 2018? ⇢ Been talking to dozens of data science & AI teams ⇢ Data science & AI seems to be where software development was in the late 90s :'(
  • 19. “In retrospect if we had been able to save the versions or have gone back in time to see how he got his learning rates it would have avoided a lot of questions from the auditors.
  • 20. “Two of the data scientists who worked on that particular model have left and gone to other companies. You want to be able to see what they did and how they did it and not that it's gone when they're gone.
  • 21. “One model failed for 3 months and we lost an immeasurable amount of money!
  • 22. “After the last audit I was surprised by how many problems in the audit we could have solved by keeping PAPER LOGS. But if we ask our data scientists to do this they will leave!
  • 23. “We keep our data scientist teams small and in the same room so they can track their summary statistics by talking to each other and remembering
  • 24. dotscience @lmarsden @getdotscience Destructive collaboration is commonplace ⇢ Shared drives for training data ⇢ Notebooks emailed or slacked between team members ⇢ Scant manual documentation ⇢ Data wrangles go unrecorded
  • 25. dotscience @lmarsden @getdotscience Testing of models is rare ⇢ Automated testing of models is rare ⇢ CI systems uncommon ⇢ "Testing" is more often done manually by an individual in an untracked Jupyter environment
  • 26. dotscience @lmarsden @getdotscience Deployment is manual ⇢ Models often “thrown over the wall” ⇢ Left in production to rot until somebody notices ⇢ No real monitoring, especially challenging with retraining & model drift ⇢ Haven't seen much continuous delivery
  • 27. dotscience @lmarsden @getdotscience How do we ship anything with all this mess? ⇢ Inappropriate tooling makes us choose between SPEED and SAFETY ⇢ Therefore + AI/ML projects being shipped slowly with meticulous docs + AI/ML projects being shipped unsafely + not tracked, not auditable + no single source of truth for what made it into prod & how + siloed in peoples' heads...
  • 28. dotscience @lmarsden @getdotscience How do we get AI & ML & etc out of the 90s?
  • 29. Continuous Integration Continuous Delivery Version control is fundamental & enabling in the lifecycle Observability & Monitoring Version control Development
  • 30. Continuous Integration Continuous Delivery Version control is fundamental & enabling in the lifecycle Observability & Monitoring Version control Developmentmission: go round the loop faster!
  • 31. How do we version control? ⇢ Versioned data, environment, code: notebooks + parameters ⇢ Metrics tracking: parameters ↔ summary statistics (e.g. accuracy, business metrics) ⇢ Diff & merge for notebooks, data, code ⇢ Forks, pull requests & comment tracking ⇢ Enables: + Creativity & collaboration + Audit & reporting
  • 32. How do we continuously integrate? ⇢ What do automated tests look like for models? + Not always binary like software – probabilistic + Pick some inputs / outputs & put triggers on them + If it goes > N stddev, fail tests + Also test NFR & unit/integration tests on code ⇢ When issues are reported with a model, convert issues to tests + This way, CI provides "guide rails" for faster & more confident development
  • 33. How do we continuously deliver? ⇢ Triggers: when code changes or data changes ⇢ Automatically run code and model tests ⇢ If tests pass, automatically deploy to production + Champion Challenger, A/B + Minimize time between breakage & knowing + Minimize MTTR not MTBF, fast rollback ⇢ From decisions made in production, be able to track back perfectly + See lineage of model development right down to individual parameter tweakings - who/what/when/why
  • 34. How do we solve observability? ⇢ Once model is in production, track model health with same metrics used in development + Single source of truth for dev/prod metrics + See model drift + If model health < X, page a human ⇢ Automatic retraining can happen periodically when new data is available ⇢ CI & CD gives us confidence to ship quickly
  • 35. Continuous Integration Continuous Delivery So that's the big vision… where do we start? Observability & Monitoring Version control Development
  • 36. So you want to do reproducible data science/AI/ML? Environment
  • 37. So you want to do reproducible data science/AI/ML? Environment Code + Notebooks Including parameters
  • 38. So you want to do reproducible data science/AI/ML? Environment Code + Notebooks Including parameters Versioned Data
  • 39. How?
  • 40. Pinning down environment ⇢ In the DevOps world, Docker has been a big hit. ⇢ Docker helps you pin down the execution environment that your model training (or other data work) is happening in. ⇢ What is Docker?
  • 41. Pinning down code & notebooks ⇢ Developers have been version controlling their code for a while now. ⇢ Git is the default choice
  • 42. Challenges with git in data science ⇢ In data science, it's not natural to commit every time you change anything, e.g. while tuning parameters ⇢ But you generate important results while you're iterating ⇢ git doesn't cope with large files, data scientists often mingle code & data ⇢ diffing and merging Jupyter notebooks not easy Lets you track versions of your code and collaborate with others by commit, clone, push, pull… Problems:
  • 43. Proposal: a new version control & collaboration system for AI ⇢ Use Dotmesh with ZFS + "Git for data" + Handles large data atomically & efficiently + Deal with terabyte workspaces ⇢ Track metrics/stats & params ⇢ Track lineage & provenance ⇢ Next: + Diff & merge notebooks + Enable pull requests
  • 44. I need your help 🙏 mark@dotscience.com @mrmrcoleman @getdotmesh dotscience.com/try Questions?