The recent advances in machine learning and artificial intelligence are amazing! Yet, in order to have real value within a company, data scientists must be able to get their models off of their laptops and deployed within a company’s data pipelines and infrastructure. In this session, I'll demonstrate how one-off experiments can be transformed into scalable ML pipelines with minimal effort.
7. To Reach Its Full Potential
Machine Learning Needs1.
Data to have the same
production practices as
code
2.
Empowered developers
not restricted
3.
Organization wide
confidence
8. Data Divergence
Data sets change constantly. Teams can’t make decisions from their
data if they don’t know what version was used.
Tooling Constraints
Infra often restricts the tooling options available to data scientists.
Not Reproducible
Data teams can’t reproduce results because they can’t track every
version of data and code throughout the system.
Obstacles that prevent
Effective Data Science
Pachyderm.com
9. For data science to be successful
outputs need to be reproducible
Manage data with the
same production
practices as code
Developers need to be
empowered with choice,
not restricted
Version control for Data
Containerized data pipelines
Be able to instantly
reconstruct any past
output/decision
Data Lineage
10. General Fusion uses Pachyderm to
Power Commercial Fusion
Research
“The true tipping point in our decision to use
Pachyderm was its version control features for
managing our data.”
- Jonathan Fraser
Engineer at General Fusion
General Fusion has collects large sets of complex data from thousands of
sensors. Managing, scaling, and processing that data is a challenge.
Criteria
1. A data science platform that could scale and adapt with their growth.
2. Augment existing experimental and analysis workflows.
3. Seamless collaboration with external scientific partners.
Business Outcome
1. Data versioning - Pachyderm enables data science teams to develop
reproducible and distributed data workflows without interfering with
each other's analysis.
2. Data provenance - Every data transformation is tracked, allowing any
result to be 100 percent reproducible and verifiable
11. Pachyderm provides reproducibility through
Data Versioning
Identify and revert “bad” data changes
Version model binaries and parameters
along with the data used to train them
Reproduce specific processes using
historical state(s) of data
Commit ID: a5bcc61...1812
Commit ID: 7afad96...680e
Commit ID: b85ea63...e4d4
Commit ID: 7585b4e...0cc5
Commit ID: af4cf48...8840
person.png
stopsign.png
road.png
boat.png
bike.png
Pachyderm.com
12. Pachyderm provides workflow management through
Containerized Analyses
Use any languages and frameworks in
pipelines
Port your workflows to any
infrastructure
Easily transition from local dev to production
deploy
Pachyderm.com
13. Pachyderm provides workflow management through
Data Pipelines
Use any languages and
frameworks in pipelines
Port your workflows to any
infrastructure
Easily transition from local
dev to production deploy
ETL Pipeline ML pipeline CI/CD Application
Pachyderm
Pachyderm.com
15. Pachyderm provides audit trails via
Data Provenance
Track every version of data and code
that produced a result
Maintain compliance and reproducibility
Manage relationship between historical
data states
Pachyderm.com
17. Data Provenance In Action
Being able to pinpoint exactly what data is
being used is hard enough for most
companies. Tack on the requirement of having
to edit/remove a specific piece of data without
disruption, and that sees next to impossible.
General Data Protection
Regulation
Pachyderm.com
18. GDPR Example - Before
● File a ticket
● Entire audit of pipeline
● Removal of Jared’s data
● Models need to be
re-trained and tested.
● Audit to ensure Jared it
not part of the future
● Etc.
Time consuming
manual process
Model
Training
Users
Database
Model
Deployed
User “Jared”
Opts out
?
What happens when
“Black Box Problem”
Pachyderm.com
19. GDPR Example - With Pachyderm
Model
Training
Users
Database
Model
Deployed
User “Jared”
Opts out
What happens when
Jian Yang
commit: 9fa0a4...74f
Gaven Belson
commit: 8593ef...4d7
Jared Dunn
commit: 60fae8...7d0
“Pachctl delete-file jared.info”
Pachyderm maintains a complete audit, enabling you to
add/edit/remove data with just one command and zero disruption.
Pachyderm.com
20. GDPR Example - With Pachyderm
Model
Training
Users
Database
Model
Deployed
User “Jared”
Opts out
What happens when
Jian Yang
commit: 9fa0a4...74f
Gaven Belson
commit: 8593ef...4d7
Jared Dunn
commit: 60fae8...7d0
Pachyderm maintains a complete audit, enabling you to
add/edit/remove data with just one command and zero disruption.
GDPR Request
Met
Pachyderm.com
21. Pachyderm in 60-seconds
Pachyderm lets you deploy and manage multi-stage, language-agnostic data
pipelines while maintaining complete reproducibility and provenance.
Pachyderm.com
github.com/pachyderm