SlideShare a Scribd company logo
1 of 22
Download to read offline
PachydermReproducible and Compliant
Data Science
Nick Harvey - Lead Developer Advocate
Pachyderm Inc.
Nick@pachyderm.com
@nicksharvey
As Data Scientists...
We
Pachyderm.com
As Data Scientists...
We
“Big” Data
Pachyderm.com
As Data Scientists...
We
Pachyderm.com
Production
ML/AI Model
Training
ML/AI Model
Inference or
Prediction
Pachyderm.com
Production
ML/AI Model
Training
Data Input
Transforms
Data
Ingestion
Data
Cleaning
Feature
Engineering
Model
Selection,
Parameter
Search
Feature
Transforms
Production
Model
Testing
Model
Export &
Optimization
ML/AI Model
Inference or
Prediction
Post
Processing
Pachyderm.com
To Reach Its Full Potential
Machine Learning Needs1.
Data to have the same
production practices as
code
2.
Empowered developers
not restricted
3.
Organization wide
confidence
Data Divergence
Data sets change constantly. Teams can’t make decisions from their
data if they don’t know what version was used.
Tooling Constraints
Infra often restricts the tooling options available to data scientists.
Not Reproducible
Data teams can’t reproduce results because they can’t track every
version of data and code throughout the system.
Obstacles that prevent
Effective Data Science
Pachyderm.com
For data science to be successful
outputs need to be reproducible
Manage data with the
same production
practices as code
Developers need to be
empowered with choice,
not restricted
Version control for Data
Containerized data pipelines
Be able to instantly
reconstruct any past
output/decision
Data Lineage
General Fusion uses Pachyderm to
Power Commercial Fusion
Research
“The true tipping point in our decision to use
Pachyderm was its version control features for
managing our data.”
- Jonathan Fraser
Engineer at General Fusion
General Fusion has collects large sets of complex data from thousands of
sensors. Managing, scaling, and processing that data is a challenge.
Criteria
1. A data science platform that could scale and adapt with their growth.
2. Augment existing experimental and analysis workflows.
3. Seamless collaboration with external scientific partners.
Business Outcome
1. Data versioning - Pachyderm enables data science teams to develop
reproducible and distributed data workflows without interfering with
each other's analysis.
2. Data provenance - Every data transformation is tracked, allowing any
result to be 100 percent reproducible and verifiable
Pachyderm provides reproducibility through
Data Versioning
Identify and revert “bad” data changes
Version model binaries and parameters
along with the data used to train them
Reproduce specific processes using
historical state(s) of data
Commit ID: a5bcc61...1812
Commit ID: 7afad96...680e
Commit ID: b85ea63...e4d4
Commit ID: 7585b4e...0cc5
Commit ID: af4cf48...8840
person.png
stopsign.png
road.png
boat.png
bike.png
Pachyderm.com
Pachyderm provides workflow management through
Containerized Analyses
Use any languages and frameworks in
pipelines
Port your workflows to any
infrastructure
Easily transition from local dev to production
deploy
Pachyderm.com
Pachyderm provides workflow management through
Data Pipelines
Use any languages and
frameworks in pipelines
Port your workflows to any
infrastructure
Easily transition from local
dev to production deploy
ETL Pipeline ML pipeline CI/CD Application
Pachyderm
Pachyderm.com
Versioned
Training
Data
Pre-Processing Model Export
Versioned
Pre-Processed
Data
Training Versioned
Model
Coming Soon
github.com/kubeflow/examples
Pachyderm provides audit trails via
Data Provenance
Track every version of data and code
that produced a result
Maintain compliance and reproducibility
Manage relationship between historical
data states
Pachyderm.com
Pachyderm
Stack Diagram
Pachyderm.com
Data Provenance In Action
Being able to pinpoint exactly what data is
being used is hard enough for most
companies. Tack on the requirement of having
to edit/remove a specific piece of data without
disruption, and that sees next to impossible.
General Data Protection
Regulation
Pachyderm.com
GDPR Example - Before
● File a ticket
● Entire audit of pipeline
● Removal of Jared’s data
● Models need to be
re-trained and tested.
● Audit to ensure Jared it
not part of the future
● Etc.
Time consuming
manual process
Model
Training
Users
Database
Model
Deployed
User “Jared”
Opts out
?
What happens when
“Black Box Problem”
Pachyderm.com
GDPR Example - With Pachyderm
Model
Training
Users
Database
Model
Deployed
User “Jared”
Opts out
What happens when
Jian Yang
commit: 9fa0a4...74f
Gaven Belson
commit: 8593ef...4d7
Jared Dunn
commit: 60fae8...7d0
“Pachctl delete-file jared.info”
Pachyderm maintains a complete audit, enabling you to
add/edit/remove data with just one command and zero disruption.
Pachyderm.com
GDPR Example - With Pachyderm
Model
Training
Users
Database
Model
Deployed
User “Jared”
Opts out
What happens when
Jian Yang
commit: 9fa0a4...74f
Gaven Belson
commit: 8593ef...4d7
Jared Dunn
commit: 60fae8...7d0
Pachyderm maintains a complete audit, enabling you to
add/edit/remove data with just one command and zero disruption.
GDPR Request
Met
Pachyderm.com
Pachyderm in 60-seconds
Pachyderm lets you deploy and manage multi-stage, language-agnostic data
pipelines while maintaining complete reproducibility and provenance.
Pachyderm.com
github.com/pachyderm
Thank you

More Related Content

What's hot

Death to project documentation with eXtreme Programming
Death to project documentation with eXtreme ProgrammingDeath to project documentation with eXtreme Programming
Death to project documentation with eXtreme ProgrammingAlex Fernandez
 
Data Science Challenges in Personal Program Analysis
Data Science Challenges in Personal Program AnalysisData Science Challenges in Personal Program Analysis
Data Science Challenges in Personal Program AnalysisWork-Bench
 
Scaling Ride-Hailing with Machine Learning on MLflow
Scaling Ride-Hailing with Machine Learning on MLflowScaling Ride-Hailing with Machine Learning on MLflow
Scaling Ride-Hailing with Machine Learning on MLflowDatabricks
 
Using dataset versioning in data science
Using dataset versioning in data scienceUsing dataset versioning in data science
Using dataset versioning in data scienceVenkata Pingali
 
Applying Java 8 Idioms to Existing Code
Applying Java 8 Idioms to Existing CodeApplying Java 8 Idioms to Existing Code
Applying Java 8 Idioms to Existing CodeC4Media
 
Julia + R for Data Science
Julia + R for Data ScienceJulia + R for Data Science
Julia + R for Data ScienceWork-Bench
 
Web Applications of the Future with TypeScript and GraphQL
Web Applications of the Future with TypeScript and GraphQLWeb Applications of the Future with TypeScript and GraphQL
Web Applications of the Future with TypeScript and GraphQLRoy Derks
 
Reproducible AI using MLflow and PyTorch
Reproducible AI using MLflow and PyTorchReproducible AI using MLflow and PyTorch
Reproducible AI using MLflow and PyTorchDatabricks
 
Machine Learning In Production
Machine Learning In ProductionMachine Learning In Production
Machine Learning In ProductionSamir Bessalah
 
Software Frameworks for Music Information Retrieval
Software Frameworks for Music Information RetrievalSoftware Frameworks for Music Information Retrieval
Software Frameworks for Music Information RetrievalXavier Amatriain
 
Improving data interoperability in Python and R
Improving data interoperability in Python and RImproving data interoperability in Python and R
Improving data interoperability in Python and RWes McKinney
 
Functional Programming - Worth the Effort
Functional Programming - Worth the EffortFunctional Programming - Worth the Effort
Functional Programming - Worth the EffortBoldRadius Solutions
 
Finding Defects in C#: Coverity vs. FxCop
Finding Defects in C#: Coverity vs. FxCopFinding Defects in C#: Coverity vs. FxCop
Finding Defects in C#: Coverity vs. FxCopCoverity
 
How to NLProc from .NET
How to NLProc from .NETHow to NLProc from .NET
How to NLProc from .NETSergey Tihon
 
C# 4.0 and .NET 4.0
C# 4.0 and .NET 4.0C# 4.0 and .NET 4.0
C# 4.0 and .NET 4.0Buu Nguyen
 
Scott Clark, CEO, SigOpt, at The AI Conference 2017
Scott Clark, CEO, SigOpt, at The AI Conference 2017Scott Clark, CEO, SigOpt, at The AI Conference 2017
Scott Clark, CEO, SigOpt, at The AI Conference 2017MLconf
 
The Quest for an Open Source Data Science Platform
 The Quest for an Open Source Data Science Platform The Quest for an Open Source Data Science Platform
The Quest for an Open Source Data Science PlatformQAware GmbH
 

What's hot (20)

Death to project documentation with eXtreme Programming
Death to project documentation with eXtreme ProgrammingDeath to project documentation with eXtreme Programming
Death to project documentation with eXtreme Programming
 
Data Science Challenges in Personal Program Analysis
Data Science Challenges in Personal Program AnalysisData Science Challenges in Personal Program Analysis
Data Science Challenges in Personal Program Analysis
 
Scaling Ride-Hailing with Machine Learning on MLflow
Scaling Ride-Hailing with Machine Learning on MLflowScaling Ride-Hailing with Machine Learning on MLflow
Scaling Ride-Hailing with Machine Learning on MLflow
 
Using dataset versioning in data science
Using dataset versioning in data scienceUsing dataset versioning in data science
Using dataset versioning in data science
 
Applying Java 8 Idioms to Existing Code
Applying Java 8 Idioms to Existing CodeApplying Java 8 Idioms to Existing Code
Applying Java 8 Idioms to Existing Code
 
MLOps at OLX
MLOps at OLXMLOps at OLX
MLOps at OLX
 
Cqrs
CqrsCqrs
Cqrs
 
Julia + R for Data Science
Julia + R for Data ScienceJulia + R for Data Science
Julia + R for Data Science
 
Web Applications of the Future with TypeScript and GraphQL
Web Applications of the Future with TypeScript and GraphQLWeb Applications of the Future with TypeScript and GraphQL
Web Applications of the Future with TypeScript and GraphQL
 
Reproducible AI using MLflow and PyTorch
Reproducible AI using MLflow and PyTorchReproducible AI using MLflow and PyTorch
Reproducible AI using MLflow and PyTorch
 
Machine Learning In Production
Machine Learning In ProductionMachine Learning In Production
Machine Learning In Production
 
Software Frameworks for Music Information Retrieval
Software Frameworks for Music Information RetrievalSoftware Frameworks for Music Information Retrieval
Software Frameworks for Music Information Retrieval
 
Improving data interoperability in Python and R
Improving data interoperability in Python and RImproving data interoperability in Python and R
Improving data interoperability in Python and R
 
Functional Programming - Worth the Effort
Functional Programming - Worth the EffortFunctional Programming - Worth the Effort
Functional Programming - Worth the Effort
 
Finding Defects in C#: Coverity vs. FxCop
Finding Defects in C#: Coverity vs. FxCopFinding Defects in C#: Coverity vs. FxCop
Finding Defects in C#: Coverity vs. FxCop
 
How to NLProc from .NET
How to NLProc from .NETHow to NLProc from .NET
How to NLProc from .NET
 
C# 4.0 and .NET 4.0
C# 4.0 and .NET 4.0C# 4.0 and .NET 4.0
C# 4.0 and .NET 4.0
 
Scott Clark, CEO, SigOpt, at The AI Conference 2017
Scott Clark, CEO, SigOpt, at The AI Conference 2017Scott Clark, CEO, SigOpt, at The AI Conference 2017
Scott Clark, CEO, SigOpt, at The AI Conference 2017
 
The Quest for an Open Source Data Science Platform
 The Quest for an Open Source Data Science Platform The Quest for an Open Source Data Science Platform
The Quest for an Open Source Data Science Platform
 
Code Reviews in Python - PyZh
Code Reviews in Python - PyZhCode Reviews in Python - PyZh
Code Reviews in Python - PyZh
 

Similar to Pachyderm Provides Reproducible and Compliant Data Science Workflows

Scalable and Repeatable Machine Learning pipelines: A key requirement for you...
Scalable and Repeatable Machine Learning pipelines: A key requirement for you...Scalable and Repeatable Machine Learning pipelines: A key requirement for you...
Scalable and Repeatable Machine Learning pipelines: A key requirement for you...All Things Open
 
Securing your Machine Learning models
Securing your Machine Learning modelsSecuring your Machine Learning models
Securing your Machine Learning modelsPhilipBasford
 
Jonathon Wright - Intelligent Performance Cognitive Learning (AIOps)
Jonathon Wright - Intelligent Performance Cognitive Learning (AIOps)Jonathon Wright - Intelligent Performance Cognitive Learning (AIOps)
Jonathon Wright - Intelligent Performance Cognitive Learning (AIOps)Neotys_Partner
 
MLOps - Build pipelines with Tensor Flow Extended & Kubeflow
MLOps - Build pipelines with Tensor Flow Extended & KubeflowMLOps - Build pipelines with Tensor Flow Extended & Kubeflow
MLOps - Build pipelines with Tensor Flow Extended & KubeflowJan Kirenz
 
Machine Learning Models in Production
Machine Learning Models in ProductionMachine Learning Models in Production
Machine Learning Models in ProductionDataWorks Summit
 
Integration Patterns for Big Data Applications
Integration Patterns for Big Data ApplicationsIntegration Patterns for Big Data Applications
Integration Patterns for Big Data ApplicationsMichael Häusler
 
Software engineering practices for the data science and machine learning life...
Software engineering practices for the data science and machine learning life...Software engineering practices for the data science and machine learning life...
Software engineering practices for the data science and machine learning life...DataWorks Summit
 
Serverless machine learning architectures at Helixa
Serverless machine learning architectures at HelixaServerless machine learning architectures at Helixa
Serverless machine learning architectures at HelixaData Science Milan
 
DevOps for Machine Learning overview en-us
DevOps for Machine Learning overview en-usDevOps for Machine Learning overview en-us
DevOps for Machine Learning overview en-useltonrodriguez11
 
Continuous delivery for machine learning
Continuous delivery for machine learningContinuous delivery for machine learning
Continuous delivery for machine learningRajesh Muppalla
 
MLOps - Build pipelines with Tensor Flow Extended & Kubeflow
MLOps - Build pipelines with Tensor Flow Extended & KubeflowMLOps - Build pipelines with Tensor Flow Extended & Kubeflow
MLOps - Build pipelines with Tensor Flow Extended & KubeflowJan Kirenz
 
MLOps and Data Quality: Deploying Reliable ML Models in Production
MLOps and Data Quality: Deploying Reliable ML Models in ProductionMLOps and Data Quality: Deploying Reliable ML Models in Production
MLOps and Data Quality: Deploying Reliable ML Models in ProductionProvectus
 
Deeplearning and dev ops azure
Deeplearning and dev ops azureDeeplearning and dev ops azure
Deeplearning and dev ops azureVishwas N
 
Data Architectures for Robust Decision Making
Data Architectures for Robust Decision MakingData Architectures for Robust Decision Making
Data Architectures for Robust Decision MakingGwen (Chen) Shapira
 
Accelerate ML Deployment with H2O Driverless AI on AWS
Accelerate ML Deployment with H2O Driverless AI on AWSAccelerate ML Deployment with H2O Driverless AI on AWS
Accelerate ML Deployment with H2O Driverless AI on AWSSri Ambati
 
How to Lower the Cost of Deploying Analytics: An Introduction to the Portable...
How to Lower the Cost of Deploying Analytics: An Introduction to the Portable...How to Lower the Cost of Deploying Analytics: An Introduction to the Portable...
How to Lower the Cost of Deploying Analytics: An Introduction to the Portable...Robert Grossman
 
Microsoft DevOps for AI with GoDataDriven
Microsoft DevOps for AI with GoDataDrivenMicrosoft DevOps for AI with GoDataDriven
Microsoft DevOps for AI with GoDataDrivenGoDataDriven
 
SpiraPlan Overview Presentation (2019)
SpiraPlan Overview Presentation (2019)SpiraPlan Overview Presentation (2019)
SpiraPlan Overview Presentation (2019)Inflectra
 
MLOps Virtual Event | Building Machine Learning Platforms for the Full Lifecycle
MLOps Virtual Event | Building Machine Learning Platforms for the Full LifecycleMLOps Virtual Event | Building Machine Learning Platforms for the Full Lifecycle
MLOps Virtual Event | Building Machine Learning Platforms for the Full LifecycleDatabricks
 

Similar to Pachyderm Provides Reproducible and Compliant Data Science Workflows (20)

Scalable and Repeatable Machine Learning pipelines: A key requirement for you...
Scalable and Repeatable Machine Learning pipelines: A key requirement for you...Scalable and Repeatable Machine Learning pipelines: A key requirement for you...
Scalable and Repeatable Machine Learning pipelines: A key requirement for you...
 
Securing your Machine Learning models
Securing your Machine Learning modelsSecuring your Machine Learning models
Securing your Machine Learning models
 
Jonathon Wright - Intelligent Performance Cognitive Learning (AIOps)
Jonathon Wright - Intelligent Performance Cognitive Learning (AIOps)Jonathon Wright - Intelligent Performance Cognitive Learning (AIOps)
Jonathon Wright - Intelligent Performance Cognitive Learning (AIOps)
 
MLOps - Build pipelines with Tensor Flow Extended & Kubeflow
MLOps - Build pipelines with Tensor Flow Extended & KubeflowMLOps - Build pipelines with Tensor Flow Extended & Kubeflow
MLOps - Build pipelines with Tensor Flow Extended & Kubeflow
 
Machine Learning Models in Production
Machine Learning Models in ProductionMachine Learning Models in Production
Machine Learning Models in Production
 
Integration Patterns for Big Data Applications
Integration Patterns for Big Data ApplicationsIntegration Patterns for Big Data Applications
Integration Patterns for Big Data Applications
 
DEVOPS AND MACHINE LEARNING
DEVOPS AND MACHINE LEARNINGDEVOPS AND MACHINE LEARNING
DEVOPS AND MACHINE LEARNING
 
Software engineering practices for the data science and machine learning life...
Software engineering practices for the data science and machine learning life...Software engineering practices for the data science and machine learning life...
Software engineering practices for the data science and machine learning life...
 
Serverless machine learning architectures at Helixa
Serverless machine learning architectures at HelixaServerless machine learning architectures at Helixa
Serverless machine learning architectures at Helixa
 
DevOps for Machine Learning overview en-us
DevOps for Machine Learning overview en-usDevOps for Machine Learning overview en-us
DevOps for Machine Learning overview en-us
 
Continuous delivery for machine learning
Continuous delivery for machine learningContinuous delivery for machine learning
Continuous delivery for machine learning
 
MLOps - Build pipelines with Tensor Flow Extended & Kubeflow
MLOps - Build pipelines with Tensor Flow Extended & KubeflowMLOps - Build pipelines with Tensor Flow Extended & Kubeflow
MLOps - Build pipelines with Tensor Flow Extended & Kubeflow
 
MLOps and Data Quality: Deploying Reliable ML Models in Production
MLOps and Data Quality: Deploying Reliable ML Models in ProductionMLOps and Data Quality: Deploying Reliable ML Models in Production
MLOps and Data Quality: Deploying Reliable ML Models in Production
 
Deeplearning and dev ops azure
Deeplearning and dev ops azureDeeplearning and dev ops azure
Deeplearning and dev ops azure
 
Data Architectures for Robust Decision Making
Data Architectures for Robust Decision MakingData Architectures for Robust Decision Making
Data Architectures for Robust Decision Making
 
Accelerate ML Deployment with H2O Driverless AI on AWS
Accelerate ML Deployment with H2O Driverless AI on AWSAccelerate ML Deployment with H2O Driverless AI on AWS
Accelerate ML Deployment with H2O Driverless AI on AWS
 
How to Lower the Cost of Deploying Analytics: An Introduction to the Portable...
How to Lower the Cost of Deploying Analytics: An Introduction to the Portable...How to Lower the Cost of Deploying Analytics: An Introduction to the Portable...
How to Lower the Cost of Deploying Analytics: An Introduction to the Portable...
 
Microsoft DevOps for AI with GoDataDriven
Microsoft DevOps for AI with GoDataDrivenMicrosoft DevOps for AI with GoDataDriven
Microsoft DevOps for AI with GoDataDriven
 
SpiraPlan Overview Presentation (2019)
SpiraPlan Overview Presentation (2019)SpiraPlan Overview Presentation (2019)
SpiraPlan Overview Presentation (2019)
 
MLOps Virtual Event | Building Machine Learning Platforms for the Full Lifecycle
MLOps Virtual Event | Building Machine Learning Platforms for the Full LifecycleMLOps Virtual Event | Building Machine Learning Platforms for the Full Lifecycle
MLOps Virtual Event | Building Machine Learning Platforms for the Full Lifecycle
 

More from PyData

Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...PyData
 
Unit testing data with marbles - Jane Stewart Adams, Leif Walsh
Unit testing data with marbles - Jane Stewart Adams, Leif WalshUnit testing data with marbles - Jane Stewart Adams, Leif Walsh
Unit testing data with marbles - Jane Stewart Adams, Leif WalshPyData
 
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake BolewskiThe TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake BolewskiPyData
 
Deploying Data Science for Distribution of The New York Times - Anne Bauer
Deploying Data Science for Distribution of The New York Times - Anne BauerDeploying Data Science for Distribution of The New York Times - Anne Bauer
Deploying Data Science for Distribution of The New York Times - Anne BauerPyData
 
Graph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
Graph Analytics - From the Whiteboard to Your Toolbox - Sam LermaGraph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
Graph Analytics - From the Whiteboard to Your Toolbox - Sam LermaPyData
 
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...PyData
 
Avoiding Bad Database Surprises: Simulation and Scalability - Steven Lott
Avoiding Bad Database Surprises: Simulation and Scalability - Steven LottAvoiding Bad Database Surprises: Simulation and Scalability - Steven Lott
Avoiding Bad Database Surprises: Simulation and Scalability - Steven LottPyData
 
Words in Space - Rebecca Bilbro
Words in Space - Rebecca BilbroWords in Space - Rebecca Bilbro
Words in Space - Rebecca BilbroPyData
 
Pydata beautiful soup - Monica Puerto
Pydata beautiful soup - Monica PuertoPydata beautiful soup - Monica Puerto
Pydata beautiful soup - Monica PuertoPyData
 
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...PyData
 
Extending Pandas with Custom Types - Will Ayd
Extending Pandas with Custom Types - Will AydExtending Pandas with Custom Types - Will Ayd
Extending Pandas with Custom Types - Will AydPyData
 
Measuring Model Fairness - Stephen Hoover
Measuring Model Fairness - Stephen HooverMeasuring Model Fairness - Stephen Hoover
Measuring Model Fairness - Stephen HooverPyData
 
What's the Science in Data Science? - Skipper Seabold
What's the Science in Data Science? - Skipper SeaboldWhat's the Science in Data Science? - Skipper Seabold
What's the Science in Data Science? - Skipper SeaboldPyData
 
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...PyData
 
Solving very simple substitution ciphers algorithmically - Stephen Enright-Ward
Solving very simple substitution ciphers algorithmically - Stephen Enright-WardSolving very simple substitution ciphers algorithmically - Stephen Enright-Ward
Solving very simple substitution ciphers algorithmically - Stephen Enright-WardPyData
 
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...PyData
 
Deprecating the state machine: building conversational AI with the Rasa stack...
Deprecating the state machine: building conversational AI with the Rasa stack...Deprecating the state machine: building conversational AI with the Rasa stack...
Deprecating the state machine: building conversational AI with the Rasa stack...PyData
 
Towards automating machine learning: benchmarking tools for hyperparameter tu...
Towards automating machine learning: benchmarking tools for hyperparameter tu...Towards automating machine learning: benchmarking tools for hyperparameter tu...
Towards automating machine learning: benchmarking tools for hyperparameter tu...PyData
 
Using GANs to improve generalization in a semi-supervised setting - trying it...
Using GANs to improve generalization in a semi-supervised setting - trying it...Using GANs to improve generalization in a semi-supervised setting - trying it...
Using GANs to improve generalization in a semi-supervised setting - trying it...PyData
 
LightFields.jl: Fast 3D image reconstruction for VR applications - Hector And...
LightFields.jl: Fast 3D image reconstruction for VR applications - Hector And...LightFields.jl: Fast 3D image reconstruction for VR applications - Hector And...
LightFields.jl: Fast 3D image reconstruction for VR applications - Hector And...PyData
 

More from PyData (20)

Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
 
Unit testing data with marbles - Jane Stewart Adams, Leif Walsh
Unit testing data with marbles - Jane Stewart Adams, Leif WalshUnit testing data with marbles - Jane Stewart Adams, Leif Walsh
Unit testing data with marbles - Jane Stewart Adams, Leif Walsh
 
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake BolewskiThe TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
 
Deploying Data Science for Distribution of The New York Times - Anne Bauer
Deploying Data Science for Distribution of The New York Times - Anne BauerDeploying Data Science for Distribution of The New York Times - Anne Bauer
Deploying Data Science for Distribution of The New York Times - Anne Bauer
 
Graph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
Graph Analytics - From the Whiteboard to Your Toolbox - Sam LermaGraph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
Graph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
 
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
 
Avoiding Bad Database Surprises: Simulation and Scalability - Steven Lott
Avoiding Bad Database Surprises: Simulation and Scalability - Steven LottAvoiding Bad Database Surprises: Simulation and Scalability - Steven Lott
Avoiding Bad Database Surprises: Simulation and Scalability - Steven Lott
 
Words in Space - Rebecca Bilbro
Words in Space - Rebecca BilbroWords in Space - Rebecca Bilbro
Words in Space - Rebecca Bilbro
 
Pydata beautiful soup - Monica Puerto
Pydata beautiful soup - Monica PuertoPydata beautiful soup - Monica Puerto
Pydata beautiful soup - Monica Puerto
 
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
 
Extending Pandas with Custom Types - Will Ayd
Extending Pandas with Custom Types - Will AydExtending Pandas with Custom Types - Will Ayd
Extending Pandas with Custom Types - Will Ayd
 
Measuring Model Fairness - Stephen Hoover
Measuring Model Fairness - Stephen HooverMeasuring Model Fairness - Stephen Hoover
Measuring Model Fairness - Stephen Hoover
 
What's the Science in Data Science? - Skipper Seabold
What's the Science in Data Science? - Skipper SeaboldWhat's the Science in Data Science? - Skipper Seabold
What's the Science in Data Science? - Skipper Seabold
 
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
 
Solving very simple substitution ciphers algorithmically - Stephen Enright-Ward
Solving very simple substitution ciphers algorithmically - Stephen Enright-WardSolving very simple substitution ciphers algorithmically - Stephen Enright-Ward
Solving very simple substitution ciphers algorithmically - Stephen Enright-Ward
 
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
 
Deprecating the state machine: building conversational AI with the Rasa stack...
Deprecating the state machine: building conversational AI with the Rasa stack...Deprecating the state machine: building conversational AI with the Rasa stack...
Deprecating the state machine: building conversational AI with the Rasa stack...
 
Towards automating machine learning: benchmarking tools for hyperparameter tu...
Towards automating machine learning: benchmarking tools for hyperparameter tu...Towards automating machine learning: benchmarking tools for hyperparameter tu...
Towards automating machine learning: benchmarking tools for hyperparameter tu...
 
Using GANs to improve generalization in a semi-supervised setting - trying it...
Using GANs to improve generalization in a semi-supervised setting - trying it...Using GANs to improve generalization in a semi-supervised setting - trying it...
Using GANs to improve generalization in a semi-supervised setting - trying it...
 
LightFields.jl: Fast 3D image reconstruction for VR applications - Hector And...
LightFields.jl: Fast 3D image reconstruction for VR applications - Hector And...LightFields.jl: Fast 3D image reconstruction for VR applications - Hector And...
LightFields.jl: Fast 3D image reconstruction for VR applications - Hector And...
 

Recently uploaded

Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAndikSusilo4
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 

Recently uploaded (20)

Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & Application
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 

Pachyderm Provides Reproducible and Compliant Data Science Workflows

  • 1. PachydermReproducible and Compliant Data Science Nick Harvey - Lead Developer Advocate Pachyderm Inc. Nick@pachyderm.com @nicksharvey
  • 7. To Reach Its Full Potential Machine Learning Needs1. Data to have the same production practices as code 2. Empowered developers not restricted 3. Organization wide confidence
  • 8. Data Divergence Data sets change constantly. Teams can’t make decisions from their data if they don’t know what version was used. Tooling Constraints Infra often restricts the tooling options available to data scientists. Not Reproducible Data teams can’t reproduce results because they can’t track every version of data and code throughout the system. Obstacles that prevent Effective Data Science Pachyderm.com
  • 9. For data science to be successful outputs need to be reproducible Manage data with the same production practices as code Developers need to be empowered with choice, not restricted Version control for Data Containerized data pipelines Be able to instantly reconstruct any past output/decision Data Lineage
  • 10. General Fusion uses Pachyderm to Power Commercial Fusion Research “The true tipping point in our decision to use Pachyderm was its version control features for managing our data.” - Jonathan Fraser Engineer at General Fusion General Fusion has collects large sets of complex data from thousands of sensors. Managing, scaling, and processing that data is a challenge. Criteria 1. A data science platform that could scale and adapt with their growth. 2. Augment existing experimental and analysis workflows. 3. Seamless collaboration with external scientific partners. Business Outcome 1. Data versioning - Pachyderm enables data science teams to develop reproducible and distributed data workflows without interfering with each other's analysis. 2. Data provenance - Every data transformation is tracked, allowing any result to be 100 percent reproducible and verifiable
  • 11. Pachyderm provides reproducibility through Data Versioning Identify and revert “bad” data changes Version model binaries and parameters along with the data used to train them Reproduce specific processes using historical state(s) of data Commit ID: a5bcc61...1812 Commit ID: 7afad96...680e Commit ID: b85ea63...e4d4 Commit ID: 7585b4e...0cc5 Commit ID: af4cf48...8840 person.png stopsign.png road.png boat.png bike.png Pachyderm.com
  • 12. Pachyderm provides workflow management through Containerized Analyses Use any languages and frameworks in pipelines Port your workflows to any infrastructure Easily transition from local dev to production deploy Pachyderm.com
  • 13. Pachyderm provides workflow management through Data Pipelines Use any languages and frameworks in pipelines Port your workflows to any infrastructure Easily transition from local dev to production deploy ETL Pipeline ML pipeline CI/CD Application Pachyderm Pachyderm.com
  • 14. Versioned Training Data Pre-Processing Model Export Versioned Pre-Processed Data Training Versioned Model Coming Soon github.com/kubeflow/examples
  • 15. Pachyderm provides audit trails via Data Provenance Track every version of data and code that produced a result Maintain compliance and reproducibility Manage relationship between historical data states Pachyderm.com
  • 17. Data Provenance In Action Being able to pinpoint exactly what data is being used is hard enough for most companies. Tack on the requirement of having to edit/remove a specific piece of data without disruption, and that sees next to impossible. General Data Protection Regulation Pachyderm.com
  • 18. GDPR Example - Before ● File a ticket ● Entire audit of pipeline ● Removal of Jared’s data ● Models need to be re-trained and tested. ● Audit to ensure Jared it not part of the future ● Etc. Time consuming manual process Model Training Users Database Model Deployed User “Jared” Opts out ? What happens when “Black Box Problem” Pachyderm.com
  • 19. GDPR Example - With Pachyderm Model Training Users Database Model Deployed User “Jared” Opts out What happens when Jian Yang commit: 9fa0a4...74f Gaven Belson commit: 8593ef...4d7 Jared Dunn commit: 60fae8...7d0 “Pachctl delete-file jared.info” Pachyderm maintains a complete audit, enabling you to add/edit/remove data with just one command and zero disruption. Pachyderm.com
  • 20. GDPR Example - With Pachyderm Model Training Users Database Model Deployed User “Jared” Opts out What happens when Jian Yang commit: 9fa0a4...74f Gaven Belson commit: 8593ef...4d7 Jared Dunn commit: 60fae8...7d0 Pachyderm maintains a complete audit, enabling you to add/edit/remove data with just one command and zero disruption. GDPR Request Met Pachyderm.com
  • 21. Pachyderm in 60-seconds Pachyderm lets you deploy and manage multi-stage, language-agnostic data pipelines while maintaining complete reproducibility and provenance. Pachyderm.com github.com/pachyderm