Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Reproducible data science: review of Pachyderm, Data Version Control and GIT LFS tools


Published on

The advances in machine learning are great, yet, in order to have real value within a company, data scientists must be able to go from a research project to a reproducible process. A common problem is that the code is intrinsically linked to the data it was developed against. Hence it is critically important to track, trace and validate the input data used to train and test the algorithm. This talk will be a review of the several tools which for data versioning and processing.

Published in: Software
  • Get Paid To Manage Facebook Fan Pages! Facebook Fan Page Workers Required - Start Immediately. ▲▲▲
    Are you sure you want to  Yes  No
    Your message goes here

Reproducible data science: review of Pachyderm, Data Version Control and GIT LFS tools

  1. 1. Reproducible data science Lightning review
  2. 2. Into • Data Science Lead at Outra • 5 years in Data Science • Main focus has been social media, marketing and retail Josh Levy-Kramer
  3. 3. Outline 1. Reproducibility crisis 2. Possible solutions 3. Comparison and roundup
  4. 4. Reproducibility crisis • Dark ages when tracking changes and building models • ML lacking abstractions that developers have developed • Creates problems for yourself, your team and public projects • Partially due to cant commit large files into git
  5. 5. Data Scientist Manifesto • Reproducibility – the ability to reconstruct any previous state of your data analysis (data and execution) • Provenance – the ability to track any result and link it the the input and code used • Collaboration – the ability to easily collaborate with team members • Environment agnostic – the ability to deploy a process to in different environments without much hindrance Adapted from
  6. 6. Data – Code – Environment → Output CodeData Output Environment
  7. 7. Code – Environment Git 1 numpy==1.13.3 2 pandas==0.21.1 3 scikit-learn==0.19.1 4 psutil==5.4.0 5 pyyaml==3.1 OS environment Python eniroment 1 FROM continuumio/miniconda3:4.3.27 2 COPY requirements.txt . 3 RUN pip install -r requirements.txt 4 COPY 5 ENTRYPOINT 1 AWSTemplateFormatVersion: "2010-09-09" 2 Resources: 3 WebInstance: 4 Type: AWS::EC2::Instance 5 Properties: 6 InstanceType: r4.8xlarge 7 ImageId: ami-80861296 8 KeyName: my-key 9 SecurityGroupIds: 10 - sg-abc01234 11 SubnetId: subnet-acb01234 Hardware environment Code Dockerfile pip requirements.txt AWS CloudFormation template
  8. 8. Data? Data Output
  9. 9. Emerging solutions Data Version Control
  10. 10. Git LFS • Git extension • Allows you to commit large files into Git • Uses custom protocol and store • No concept of pipelining
  11. 11. What's similar Data Pipelines • Version controls data and pipelines, similar to what Git does with code • Two main abstractions:
  12. 12. alpha=0.7 Output v1 ML Image model Input Output alpha=0.7 Output v2 Change input alpha=0.1 Output v3 Change model Version control all • Version controls data and pipelines, similar to what Git does with code
  13. 13. Pachyderm - workflow Data repo pachctl putfile Pipeline Output pachctl create-pipeline
  14. 14. Pachyderm • I like: • Interlinked data-pipeline- output version control • Automatic output generation • Parallelisation and distribution • Semi-mature project – started 2014 👍 👎 • I dislike: • Not environment agnostic • Bloated tool: • Not generic - highly integrated with Kubernetes and S3 • Installation is complicated • Not portable • Not integrated with git
  15. 15. dvc • “Git extension for data scientists – mange your code and data together” • Same git workflow with extra commands dvc add Data repo Pipeline Output dvc run –d input.csv –o output.csv alpha=0.1
  16. 16. Workflow
  17. 17. dvc • I like: • Integration with GIT • Interlinked data-pipeline- output version control • Easy to install • Environment agnostic 👍 👎 • I dislike: • Double the actions required compared to just GIT – easy to get lost in workflow • Terrible name • Immature project – started 2017
  18. 18. Round up • No solution quite there yet • Data version control is the best contender 🤔