Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Version Control For Data Science @ PyCon DE & PyData Berlin // October 11th 2019

1,585 views

Published on

Data is the key differentiator between a Machine Learning project and a traditional software project: even if everything else stays stable, changing the data your models are trained upon makes a huge difference.

The best tools for tracking changes are the VCS that are used in software development, such as Git, Mercurial, and Subversion. They keep track of what was changed in a file, when and by whom, and synchronize changes to a central server so that multiple contributors can manage changes to the same set of files. But these traditional tools aren’t quite sufficient for Machine Learning because of the need for being able to track the data sets along with the code itself and some of the resulting models.

So versioning in Data Science projects can be pretty painful. There are generally six things that you usually want to keep track of:

code
data
configurations
resulting models
performance metrics
environments / dependencies
Running a Data Science project is an iterative process and you usually don’t want to commit changes every time you change one parameter or one performance metric. Instead, you'll run a variety of experiments and commit it once you’re satisfied.

This usually means that during the experimentation process, you might lose track of any of the experiments that you did (e.g. changes on data or dependencies). However, when you share your results with your colleagues, they'll not have any ideas of what you've already tried and most likely will end up redoing a bunch of work — heck, after a couple of weeks you could end up doing the same.

In this talk I will share some best practices to help you better version your ML project and also I will show some existing tools such as DVC, ndim and ReviewNB (to version Jupyter Notebooks).

This talk is aimed at PyData beginners and specific Machine Learning expertise is not required, although knowledge about Git and the Data Science ecosystem would help follow the speech.

Published in: Data & Analytics

Version Control For Data Science @ PyCon DE & PyData Berlin // October 11th 2019

  1. 1. ALESSIA MARCOLINI VERSION CONTROL FOR DATA SCIENCE amarcolini@fbk.eu @viperale
  2. 2. Git Basics Git is a free and open source  distributed version control system
  3. 3. Git Basics Create a new repository Create a new directory, open it and perform a git init to create a new git repository Checkout a repository Create a working copy of a local repository by running the command git clone /path/to/repository
  4. 4. Git Basics Workflow The local repository consists of three "trees" maintained by git.
  5. 5. Git Basics Add, Commit & Push You can propose changes using git add <filename> To commit these changes use git commit -m "Commit message" To send your changes to your remote repository, execute git push origin master
  6. 6. Git Basics Branching Branches are used to develop features isolated from each other. Create a new branch named "feature_x" and switch to it using git checkout -b feature_x and switch back to master git checkout master
  7. 7. Git Basics Update & Merge To update your local repository to the newest commit, execute git pull To merge another branch into your active branch, use git merge <branch> To view the changes you made relative to the index, use git diff [<path>…]
  8. 8. Nothing more than a json file!
  9. 9. Reordering import statements Changed environment name
  10. 10. Deleted Cell Moved a cell before Same code deleted and added
  11. 11. nbdime to the rescue nbdime provides “content-aware” diffing and merging of Jupyter notebooks. It understands the structure of notebook documents.
  12. 12. Install pip install nbdime diff notebooks in your terminal with nbdiff nbdiff notebook_1.ipynb notebook_2.ipynb rich web-based rendering of the diff with nbdiff-web nbdiff-web notebook_1.ipynb notebook_2.ipynb
  13. 13. nbdime config-git --enable --global Git integration
  14. 14. data.csv original dataset preprocessed_data.csv rescaled dataset → [0, 1] preprocessed_data_clean.csv remove incorrect data preprocessed_data_clean_1.csv remove outliers What about the data?
  15. 15. Hangar version control for tensor data https://github.com/tensorwerk/hangar-py Dataset Arrayset Arrayset Arrayset image filename label [[0,1,0,1], … ,[0,1,1,1]] "image1.png" 0 [[1,1,1,1], … ,[0,1,1,1]] "image2.png" [[0,1,1,1], … ,[1,1,1,0]] "image3.png" 1
  16. 16. Dataset Arrayset Arrayset Arrayset image filename label [[0,1,0,1], … ,[0,1,1,1]] "image1.png" 0 [[1,1,1,1], … ,[0,1,1,1]] "image2.png" [[0,1,1,1], … ,[1,1,1,0]] "image3.png" 1 Arbitrary Backend Selection Arrayset stored in backend optimised for data of that particular shape / type / layout
  17. 17. Working with data >>> from hangar import Repository >>> import numpy as np >>> repo = Repository(path='path/to/repository') >>> repo.init(user_name='Alessia Marcolini', user_email='amarcolini@fbk.eu', remove_old=True) >>> co = repo.checkout(write=True) >>> train_images = np.load('mnist_training_images.npy') >>> co.arraysets.init_arrayset(name='mnist_training_images', prototype=train_images[0]) >>> train_aset = co.arraysets['mnist_training_images'] >>> train_aset['0'] = train_images[0] >>> train_aset.add(data=train_images[1], name='1') >>> train_aset[51] = train_images[51] >>> co.commit('Add mnist dataset') >>> co.close()
  18. 18. Branching & Merging >>> dummy = np.arange(10, dtype=np.uint16) >>> aset = co.arraysets.init_arrayset(name='dummy_arrayset', prototype=dummy) >>> aset['0'] = dummy >>> initialCommitHash = co.commit('single sample added to a dummy arrayset') >>> co.close() >>> branch_1 = repo.create_branch(name='testbranch') >>> co = repo.checkout(write=True, branch='testbranch') >>> co.arraysets['dummy_arrayset']['0'] array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype=uint16) >>> arr = np.arange(10, dtype=np.uint16) >>> arr += 1 >>> co['dummy_arrayset', '1'] = arr >>> co.commit('second sample') >>> co = repo.checkout(write=True, branch='master') >>> co.merge(message='message for commit (not used for FF merge)', dev_branch='testbranch')
  19. 19. Working with remotes $ hangar server >>> repo.remote.add('origin', 'localhost:50051') >>> repo.remote.push('origin', 'master') >>> cloneRepo = Repository('path/to/repository’) >>> cloneRepo.clone('Alessia Marcolini', 'amarcolini@fbk.eu', 'localhost:50051', remove_old=True) >>> cloneRepo.remote.fetch_data('origin', 'master', 1024)
  20. 20. >>> from hangar import Repository >>> from hangar import make_tf_dataset >>> import tensorflow as tf >>> tf.compat.v1.enable_eager_execution() >>> repo = Repository('.') >>> co = repo.checkout() >>> data = co.arraysets['mnist_data'] >>> target = co.arrayset['mnist_target'] >>> dataset = make_tf_dataset([data, target]) >>> dataset = dataset.batch(512) >>> for b_data, b_target in dataset: print(b_data.shape, b_target.shape) Machine Learning dataloaders TensorFlow PyTorch >>> from hangar import Repository >>> from hangar import make_torch_dataset >>> from torch.utils.data import DataLoader >>> repo = Repository('.') >>> co = repo.checkout() >>> aset = co.arraysets['dummy_aset'] >>> dataset = make_torch_dataset(aset, index_range=slice(1,100)) >>> loader = DataLoader(dataset, batch_size=16) >>> for batch in loader: train_mode(batch)
  21. 21. data.csv preprocessed_data.csv model preprocessed_data_clean.csv model_1 preprocessed_data_clean_1.csv model_final model_final_v2 The whole story … adapted from a graphic by @faviovaz
  22. 22. How to keep track of changes? How to link code, data, model, metrics? Are you able to ensure replicability?
  23. 23. https://github.com/iterative/dvc DVC is designed to be agnostic of frameworks and languages, and is designed to run on top of Git repositories.
  24. 24. Link to video: https://www.youtube.com/watch?v=4h6I9_xeYA4
  25. 25. To initialise the dvc repo, run $ dvc init Then choose a data remote: • Local • AWS S3 • Google Cloud Storage • Azure Blog Storage • SSH • HDFS • HTTP Then run $ dvc remote add -d myremote s3://YOUR_BUCKET_NAME/data
  26. 26. $ dvc add data/data.csv create a data/data.csv.dvc file and add data/data.csv to .gitignore $ git add data/data.csv.dvc .gitignore $ git commit -m "add data" $ dvc push upload data to remote Add data to DVC
  27. 27. $ rm -f data/data.csv $ dvc pull or $ dvc pull data/data.csv.dvc Retrieve data
  28. 28. $ dvc run -f preprocess_data.dvc -d src/prep.py -d data/data.csv -o data/preprocessed_data.csv python src/prep.py data/data.csv preprocessed_data.csv $ git add preprocess_data.dvc .gitignore $ git commit -m "add preprocessing stage" $ dvc push Connect code and data
  29. 29. $ dvc run -f train.dvc -d src/train.py -d data/preprocessed_data.csv -o results/model.pkl python src/train.py data/preprocessed_data.csv model.pkl $ git add train.dvc .gitignore $ git commit -m "add training stage" $ dvc push Pipelines
  30. 30. $ dvc repro train.dvc Reproduce
  31. 31. $ git tag -a "baseline-experiment" -m "Baseline experiment" $ git checkout baseline-experiment $ dvc checkout Tag and go
  32. 32. Thank you amarcolini@fbk.eu @viperale

×