SlideShare a Scribd company logo
1 of 30
Download to read offline
Putting data science to work –
a case study
Alex Breeze & Martin Tynan
Octo Telematics
5 November 2018
5 November 2018 2
Putting data science to work
5 November 2018 3
Setting the scene
5 November 2018 4
Data science workflow picture from: https://ismayc.github.io/poRtland-bootcamp17/
5 November 2018 5
Data science team?
Differences
Collaboration Review
Documentation Reproducibility
Results
Reliability of solution
Sustainably adding value
Regulatory approval
5 November 2018 6
Opportunities Risks
Cutting edge
Rapid improvements
Is it trustworthy?
Rapidly outdated
Free / open licence
Leverage existing solution
Use at own risk
Learning curve
New tools to embed good
practice
New concepts
What’s new? Open source!
balanceCapture Mitigate
5 November 2018 7
This presentation
1. Version control using Git
2. Managing dependencies in
R and Python
Caveat:
▪ This is not (and will never be) the final version of these slides!
▪ Links to resources at end.
Our experiences
5 November 2018 8
Version control using git
5 November 2018 9
Version Chaos
Why?
▪ Experimenting
▪ Feedback
▪ Reproducibility
How do you choose?
▪ File name
▪ Date modified
▪ Look at each file
▪ Documentation
5 November 2018 10
Version Chaos
Why?
▪ Experimenting
▪ Feedback
▪ Reproducibility
How do you choose?
▪ File name
▪ Date modified
▪ Look at each file
▪ Documentation
5 November 2018 11
Git
Project
beginning
Tree of commits
Snapshot of project &
comment
5 November 2018 12
Git
Project
beginning
Tree of commits
Snapshot of project &
comment
5 November 2018 13
Git in a Team
Need to share the complete project code folders
You are in control of when the syncing occurs
A
B
C
Origin
Push
Pull
5 November 2018 14
‘Feature Branch’ Workflow
Feature
branch
Point of
review
Approved
master
branch
5 November 2018 15
master
Elastic nets
xgboost
Data prep
Approved
Point of
review
Example
5 November 2018 16
master
Elastic nets
xgboost
Data prep
Example
Merging is where you combine two commit trees into one unified
history.
This can lead to conflicts!
5 November 2018 17
Considerations
▪ Tracks projects
▪ Built-in
▪ GUI & Command line
▪ Merge conflicts
▪ Script languages
▪ Extremely flexible
▪ Location of share
▪ Packrat and Conda envs
mergetools
Easy setup
RStudio
£0
PyCharm
Learning
Differencing
Branching Best practice
Security Online repos
Compatible
5 November 2018 18
Managing
dependencies in R and
Python
5 November 2018 19
Packages
Code
Run
Results
Author
Packages
Share
Code
Run
Colleague
5 November 2018 20
Packages
Code
Run
Results
Author
Packages
Share
Code
Run
Packages
Track
Specification Specification
Restore
Colleague
5 November 2018 21
Project A Code
Run
Team member
Project B Code
Run
Project C Code
Run
Results
Internet
Personal Packages
Download
5 November 2018 22
Project A Code
Run
Team member
Packages
Project B Code
Run
Packages
Project C Code
Run
Results
Packages
Internet
Personal Packages
5 November 2018 23
R + packrat approach
Code
Run
Results
Packages
Track
Specification
Restore
Scan
Specification + source files Isolate
5 November 2018 24
Our findings: packrat
▪ Well documented
▪ Integrates with RStudio
▪ Long installation time ⇒
Use the R GUI (not RStudio)
▪ Implement from the start of project
▪ Doesn’t recognise all dependencies
▪ Only tracks packages, not R itself
Investigate first
Add dummy script
5 November 2018 25
Isolate
Python + conda approach
Code
Run
Results
Packages
Track
Specification
Restore
Packages + others
5 November 2018 26
Not managing dependencies is not an option
5 November 2018 27
Final thoughts
5 November 2018 28
Data Science
Learn
Share
Experiment
5 November 2018 29
Further resources
Git
Main website https://git-scm.com/
Pro-git book (free) https://git-scm.com/book/en/v2
Coursera course (free
without certificate)
https://www.coursera.org/learn/version-control-with-git
Git in Practice book (free) https://github.com/GitInPractice/GitInPractice#readme
DataCamp course: Intro to
Git for Data Science (free)
https://www.datacamp.com/courses/introduction-to-git-
for-data-science
packrat
Main website https://rstudio.github.io/packrat/
Using packrat with RStudio https://www.rstudio.com/resources/webinars/managing-
package-dependencies-in-r-with-packrat/
conda
Main website https://conda.io/docs/index.html
Useful blog post “Conda:
Myths and Misconceptions”
https://jakevdp.github.io/blog/2016/08/25/conda-myths-
and-misconceptions/
Links checked at time of making this presentation in October 2018
5 November 2018 30
The views expressed in this [publication/presentation] are those of invited contributors and not necessarily those of the IFoA. The
IFoA do not endorse any of the views stated, nor any claims or representations made in this [publication/presentation] and accept
no responsibility or liability to any person for loss or damage suffered as a consequence of their placing reliance upon any view,
claim or representation made in this [publication/presentation].
The information and expressions of opinion contained in this publication are not intended to be a comprehensive study, nor to
provide actuarial advice or advice of any nature and should not be treated as a substitute for specific advice concerning individual
situations. On no account may any part of this [publication/presentation] be reproduced without the written permission of the IFoA
[or authors, in the case of non-IFoA research].
Questions Comments

More Related Content

What's hot

Code Matters - Eclipse Hackers Git Guide - EclipseCon France 2014
Code Matters - Eclipse Hackers Git Guide - EclipseCon France 2014Code Matters - Eclipse Hackers Git Guide - EclipseCon France 2014
Code Matters - Eclipse Hackers Git Guide - EclipseCon France 2014
msohn
 
WebRTC State of the Market, Dec 2014
WebRTC State of the Market, Dec 2014WebRTC State of the Market, Dec 2014
WebRTC State of the Market, Dec 2014
Tsahi Levent-levi
 

What's hot (19)

Cloud-native Gerrit Code Review
Cloud-native Gerrit Code ReviewCloud-native Gerrit Code Review
Cloud-native Gerrit Code Review
 
This week in Neo4j - 21st October 2017
This week in Neo4j - 21st October 2017This week in Neo4j - 21st October 2017
This week in Neo4j - 21st October 2017
 
本を書こう(雑に)
本を書こう(雑に)本を書こう(雑に)
本を書こう(雑に)
 
Gerrit Code Review migrations step-by-step
Gerrit Code Review migrations step-by-stepGerrit Code Review migrations step-by-step
Gerrit Code Review migrations step-by-step
 
Jenkins plugin for Gerrit Code Review pipelines
Jenkins plugin for Gerrit Code Review pipelinesJenkins plugin for Gerrit Code Review pipelines
Jenkins plugin for Gerrit Code Review pipelines
 
Gerrit Code Review multi-site
Gerrit Code Review multi-siteGerrit Code Review multi-site
Gerrit Code Review multi-site
 
Real User Monitoring con Navigation Timing API, Tag Manager e Google Analytics
Real User Monitoring con Navigation Timing API, Tag Manager e Google AnalyticsReal User Monitoring con Navigation Timing API, Tag Manager e Google Analytics
Real User Monitoring con Navigation Timing API, Tag Manager e Google Analytics
 
Reactive Programming Meetup - NodeJs on K8s
Reactive Programming Meetup - NodeJs on K8sReactive Programming Meetup - NodeJs on K8s
Reactive Programming Meetup - NodeJs on K8s
 
IBM Agile Engineering Summit 18: How to Integrate IBM Rational® Team Concert ...
IBM Agile Engineering Summit 18: How to Integrate IBM Rational® Team Concert ...IBM Agile Engineering Summit 18: How to Integrate IBM Rational® Team Concert ...
IBM Agile Engineering Summit 18: How to Integrate IBM Rational® Team Concert ...
 
Python and big data : a good match?
Python and big data : a good match?Python and big data : a good match?
Python and big data : a good match?
 
Gerrit User Summit 2019 Keynote
Gerrit User Summit 2019 KeynoteGerrit User Summit 2019 Keynote
Gerrit User Summit 2019 Keynote
 
Code Matters - Eclipse Hackers Git Guide - EclipseCon France 2014
Code Matters - Eclipse Hackers Git Guide - EclipseCon France 2014Code Matters - Eclipse Hackers Git Guide - EclipseCon France 2014
Code Matters - Eclipse Hackers Git Guide - EclipseCon France 2014
 
Building and Deploying containerized Python Apps in the Cloud
Building and Deploying containerized Python Apps in the CloudBuilding and Deploying containerized Python Apps in the Cloud
Building and Deploying containerized Python Apps in the Cloud
 
This Week in Neo4j - 6th October 2018
This Week in Neo4j - 6th October 2018This Week in Neo4j - 6th October 2018
This Week in Neo4j - 6th October 2018
 
Practical use of IFC
Practical use of IFCPractical use of IFC
Practical use of IFC
 
EOY Project Pitch - Elaine Yeung
EOY Project Pitch - Elaine YeungEOY Project Pitch - Elaine Yeung
EOY Project Pitch - Elaine Yeung
 
[2020 git lab commit] continuous infrastructure
[2020 git lab commit] continuous infrastructure[2020 git lab commit] continuous infrastructure
[2020 git lab commit] continuous infrastructure
 
WebRTC State of the Market, Dec 2014
WebRTC State of the Market, Dec 2014WebRTC State of the Market, Dec 2014
WebRTC State of the Market, Dec 2014
 
GraphQL vs Traditional Rest API [GeeCon Prague 2018]
GraphQL vs Traditional Rest API [GeeCon Prague 2018]GraphQL vs Traditional Rest API [GeeCon Prague 2018]
GraphQL vs Traditional Rest API [GeeCon Prague 2018]
 

Similar to Putting data science to work

O'Leary - Using GitHub for Enterprise and Open Source Documentation
O'Leary - Using GitHub for Enterprise and Open Source DocumentationO'Leary - Using GitHub for Enterprise and Open Source Documentation
O'Leary - Using GitHub for Enterprise and Open Source Documentation
LavaCon
 
OSMC 2022 | Unifying Observability Weaving Prometheus, Jaeger, and Open Sourc...
OSMC 2022 | Unifying Observability Weaving Prometheus, Jaeger, and Open Sourc...OSMC 2022 | Unifying Observability Weaving Prometheus, Jaeger, and Open Sourc...
OSMC 2022 | Unifying Observability Weaving Prometheus, Jaeger, and Open Sourc...
NETWAYS
 
Full Stack Monitoring with Prometheus and Grafana
Full Stack Monitoring with Prometheus and GrafanaFull Stack Monitoring with Prometheus and Grafana
Full Stack Monitoring with Prometheus and Grafana
Jazz Yao-Tsung Wang
 

Similar to Putting data science to work (20)

Workflows using Git GitHub | Edureka
Workflows using Git GitHub | EdurekaWorkflows using Git GitHub | Edureka
Workflows using Git GitHub | Edureka
 
Rejekts 24 EU No GitOps Pain, No Platform Gain
Rejekts 24 EU No GitOps Pain, No Platform GainRejekts 24 EU No GitOps Pain, No Platform Gain
Rejekts 24 EU No GitOps Pain, No Platform Gain
 
O'Leary - Using GitHub for Enterprise and Open Source Documentation
O'Leary - Using GitHub for Enterprise and Open Source DocumentationO'Leary - Using GitHub for Enterprise and Open Source Documentation
O'Leary - Using GitHub for Enterprise and Open Source Documentation
 
CodeCamp Iasi - Creating serverless data analytics system on GCP using BigQuery
CodeCamp Iasi - Creating serverless data analytics system on GCP using BigQueryCodeCamp Iasi - Creating serverless data analytics system on GCP using BigQuery
CodeCamp Iasi - Creating serverless data analytics system on GCP using BigQuery
 
Open source projects with python
Open source projects with pythonOpen source projects with python
Open source projects with python
 
OSMC 2022 | Unifying Observability Weaving Prometheus, Jaeger, and Open Sourc...
OSMC 2022 | Unifying Observability Weaving Prometheus, Jaeger, and Open Sourc...OSMC 2022 | Unifying Observability Weaving Prometheus, Jaeger, and Open Sourc...
OSMC 2022 | Unifying Observability Weaving Prometheus, Jaeger, and Open Sourc...
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Let's build Developer Portal with Backstage
Let's build Developer Portal with BackstageLet's build Developer Portal with Backstage
Let's build Developer Portal with Backstage
 
GitBucket: Open source self-hosting Git server built by Scala
GitBucket: Open source self-hosting Git server built by ScalaGitBucket: Open source self-hosting Git server built by Scala
GitBucket: Open source self-hosting Git server built by Scala
 
Go at Skroutz
Go at SkroutzGo at Skroutz
Go at Skroutz
 
20181019 code.talks graph_analytics_k_patenge
20181019 code.talks graph_analytics_k_patenge20181019 code.talks graph_analytics_k_patenge
20181019 code.talks graph_analytics_k_patenge
 
Documentation the Community Way - TxLF 2018
Documentation the Community Way - TxLF 2018Documentation the Community Way - TxLF 2018
Documentation the Community Way - TxLF 2018
 
ACS San Diego - The RDKit: Open-source cheminformatics
ACS San Diego - The RDKit: Open-source cheminformaticsACS San Diego - The RDKit: Open-source cheminformatics
ACS San Diego - The RDKit: Open-source cheminformatics
 
Making advanced analytics accessible to more companies
Making advanced analytics accessible to more companiesMaking advanced analytics accessible to more companies
Making advanced analytics accessible to more companies
 
[scala.by] Launching new application fast
[scala.by] Launching new application fast[scala.by] Launching new application fast
[scala.by] Launching new application fast
 
Primers or Reminders? The Effects of Existing Review Comments on Code Review
Primers or Reminders? The Effects of Existing Review Comments on Code ReviewPrimers or Reminders? The Effects of Existing Review Comments on Code Review
Primers or Reminders? The Effects of Existing Review Comments on Code Review
 
Full Stack Monitoring with Prometheus and Grafana
Full Stack Monitoring with Prometheus and GrafanaFull Stack Monitoring with Prometheus and Grafana
Full Stack Monitoring with Prometheus and Grafana
 
PGConf APAC 2018 - A PostgreSQL DBAs Toolbelt for 2018
PGConf APAC 2018 - A PostgreSQL DBAs Toolbelt for 2018PGConf APAC 2018 - A PostgreSQL DBAs Toolbelt for 2018
PGConf APAC 2018 - A PostgreSQL DBAs Toolbelt for 2018
 
There is something about serverless
There is something about serverlessThere is something about serverless
There is something about serverless
 
What We Learned Building an R-Python Hybrid Predictive Analytics Pipeline
What We Learned Building an R-Python Hybrid Predictive Analytics PipelineWhat We Learned Building an R-Python Hybrid Predictive Analytics Pipeline
What We Learned Building an R-Python Hybrid Predictive Analytics Pipeline
 

Recently uploaded

Abortion pills in Riyadh Saudi Arabia| +966572737505 | Get Cytotec, Unwanted Kit
Abortion pills in Riyadh Saudi Arabia| +966572737505 | Get Cytotec, Unwanted KitAbortion pills in Riyadh Saudi Arabia| +966572737505 | Get Cytotec, Unwanted Kit
Abortion pills in Riyadh Saudi Arabia| +966572737505 | Get Cytotec, Unwanted Kit
Abortion pills in Riyadh +966572737505 get cytotec
 
如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样
如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样
如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样
jk0tkvfv
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
ahmedjiabur940
 
Huawei Ransomware Protection Storage Solution Technical Overview Presentation...
Huawei Ransomware Protection Storage Solution Technical Overview Presentation...Huawei Ransomware Protection Storage Solution Technical Overview Presentation...
Huawei Ransomware Protection Storage Solution Technical Overview Presentation...
LuisMiguelPaz5
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Abortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotec
Abortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotecAbortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotec
Abortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证
如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证
如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证
zifhagzkk
 
sourabh vyas1222222222222222222244444444
sourabh vyas1222222222222222222244444444sourabh vyas1222222222222222222244444444
sourabh vyas1222222222222222222244444444
saurabvyas476
 

Recently uploaded (20)

Ranking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRanking and Scoring Exercises for Research
Ranking and Scoring Exercises for Research
 
Abortion pills in Riyadh Saudi Arabia| +966572737505 | Get Cytotec, Unwanted Kit
Abortion pills in Riyadh Saudi Arabia| +966572737505 | Get Cytotec, Unwanted KitAbortion pills in Riyadh Saudi Arabia| +966572737505 | Get Cytotec, Unwanted Kit
Abortion pills in Riyadh Saudi Arabia| +966572737505 | Get Cytotec, Unwanted Kit
 
如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样
如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样
如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样
 
Introduction to Statistics Presentation.pptx
Introduction to Statistics Presentation.pptxIntroduction to Statistics Presentation.pptx
Introduction to Statistics Presentation.pptx
 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
 
Credit Card Fraud Detection: Safeguarding Transactions in the Digital Age
Credit Card Fraud Detection: Safeguarding Transactions in the Digital AgeCredit Card Fraud Detection: Safeguarding Transactions in the Digital Age
Credit Card Fraud Detection: Safeguarding Transactions in the Digital Age
 
Seven tools of quality control.slideshare
Seven tools of quality control.slideshareSeven tools of quality control.slideshare
Seven tools of quality control.slideshare
 
DS Lecture-1 about discrete structure .ppt
DS Lecture-1 about discrete structure .pptDS Lecture-1 about discrete structure .ppt
DS Lecture-1 about discrete structure .ppt
 
Case Study 4 Where the cry of rebellion happen?
Case Study 4 Where the cry of rebellion happen?Case Study 4 Where the cry of rebellion happen?
Case Study 4 Where the cry of rebellion happen?
 
Predictive Precipitation: Advanced Rain Forecasting Techniques
Predictive Precipitation: Advanced Rain Forecasting TechniquesPredictive Precipitation: Advanced Rain Forecasting Techniques
Predictive Precipitation: Advanced Rain Forecasting Techniques
 
社内勉強会資料_Object Recognition as Next Token Prediction
社内勉強会資料_Object Recognition as Next Token Prediction社内勉強会資料_Object Recognition as Next Token Prediction
社内勉強会資料_Object Recognition as Next Token Prediction
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
 
Identify Customer Segments to Create Customer Offers for Each Segment - Appli...
Identify Customer Segments to Create Customer Offers for Each Segment - Appli...Identify Customer Segments to Create Customer Offers for Each Segment - Appli...
Identify Customer Segments to Create Customer Offers for Each Segment - Appli...
 
Huawei Ransomware Protection Storage Solution Technical Overview Presentation...
Huawei Ransomware Protection Storage Solution Technical Overview Presentation...Huawei Ransomware Protection Storage Solution Technical Overview Presentation...
Huawei Ransomware Protection Storage Solution Technical Overview Presentation...
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
 
Harnessing the Power of GenAI for BI and Reporting.pptx
Harnessing the Power of GenAI for BI and Reporting.pptxHarnessing the Power of GenAI for BI and Reporting.pptx
Harnessing the Power of GenAI for BI and Reporting.pptx
 
Abortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotec
Abortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotecAbortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotec
Abortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotec
 
如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证
如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证
如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证
 
sourabh vyas1222222222222222222244444444
sourabh vyas1222222222222222222244444444sourabh vyas1222222222222222222244444444
sourabh vyas1222222222222222222244444444
 
jll-asia-pacific-capital-tracker-1q24.pdf
jll-asia-pacific-capital-tracker-1q24.pdfjll-asia-pacific-capital-tracker-1q24.pdf
jll-asia-pacific-capital-tracker-1q24.pdf
 

Putting data science to work

  • 1. Putting data science to work – a case study Alex Breeze & Martin Tynan Octo Telematics 5 November 2018
  • 2. 5 November 2018 2 Putting data science to work
  • 3. 5 November 2018 3 Setting the scene
  • 4. 5 November 2018 4 Data science workflow picture from: https://ismayc.github.io/poRtland-bootcamp17/
  • 5. 5 November 2018 5 Data science team? Differences Collaboration Review Documentation Reproducibility Results Reliability of solution Sustainably adding value Regulatory approval
  • 6. 5 November 2018 6 Opportunities Risks Cutting edge Rapid improvements Is it trustworthy? Rapidly outdated Free / open licence Leverage existing solution Use at own risk Learning curve New tools to embed good practice New concepts What’s new? Open source! balanceCapture Mitigate
  • 7. 5 November 2018 7 This presentation 1. Version control using Git 2. Managing dependencies in R and Python Caveat: ▪ This is not (and will never be) the final version of these slides! ▪ Links to resources at end. Our experiences
  • 8. 5 November 2018 8 Version control using git
  • 9. 5 November 2018 9 Version Chaos Why? ▪ Experimenting ▪ Feedback ▪ Reproducibility How do you choose? ▪ File name ▪ Date modified ▪ Look at each file ▪ Documentation
  • 10. 5 November 2018 10 Version Chaos Why? ▪ Experimenting ▪ Feedback ▪ Reproducibility How do you choose? ▪ File name ▪ Date modified ▪ Look at each file ▪ Documentation
  • 11. 5 November 2018 11 Git Project beginning Tree of commits Snapshot of project & comment
  • 12. 5 November 2018 12 Git Project beginning Tree of commits Snapshot of project & comment
  • 13. 5 November 2018 13 Git in a Team Need to share the complete project code folders You are in control of when the syncing occurs A B C Origin Push Pull
  • 14. 5 November 2018 14 ‘Feature Branch’ Workflow Feature branch Point of review Approved master branch
  • 15. 5 November 2018 15 master Elastic nets xgboost Data prep Approved Point of review Example
  • 16. 5 November 2018 16 master Elastic nets xgboost Data prep Example Merging is where you combine two commit trees into one unified history. This can lead to conflicts!
  • 17. 5 November 2018 17 Considerations ▪ Tracks projects ▪ Built-in ▪ GUI & Command line ▪ Merge conflicts ▪ Script languages ▪ Extremely flexible ▪ Location of share ▪ Packrat and Conda envs mergetools Easy setup RStudio £0 PyCharm Learning Differencing Branching Best practice Security Online repos Compatible
  • 18. 5 November 2018 18 Managing dependencies in R and Python
  • 19. 5 November 2018 19 Packages Code Run Results Author Packages Share Code Run Colleague
  • 20. 5 November 2018 20 Packages Code Run Results Author Packages Share Code Run Packages Track Specification Specification Restore Colleague
  • 21. 5 November 2018 21 Project A Code Run Team member Project B Code Run Project C Code Run Results Internet Personal Packages Download
  • 22. 5 November 2018 22 Project A Code Run Team member Packages Project B Code Run Packages Project C Code Run Results Packages Internet Personal Packages
  • 23. 5 November 2018 23 R + packrat approach Code Run Results Packages Track Specification Restore Scan Specification + source files Isolate
  • 24. 5 November 2018 24 Our findings: packrat ▪ Well documented ▪ Integrates with RStudio ▪ Long installation time ⇒ Use the R GUI (not RStudio) ▪ Implement from the start of project ▪ Doesn’t recognise all dependencies ▪ Only tracks packages, not R itself Investigate first Add dummy script
  • 25. 5 November 2018 25 Isolate Python + conda approach Code Run Results Packages Track Specification Restore Packages + others
  • 26. 5 November 2018 26 Not managing dependencies is not an option
  • 27. 5 November 2018 27 Final thoughts
  • 28. 5 November 2018 28 Data Science Learn Share Experiment
  • 29. 5 November 2018 29 Further resources Git Main website https://git-scm.com/ Pro-git book (free) https://git-scm.com/book/en/v2 Coursera course (free without certificate) https://www.coursera.org/learn/version-control-with-git Git in Practice book (free) https://github.com/GitInPractice/GitInPractice#readme DataCamp course: Intro to Git for Data Science (free) https://www.datacamp.com/courses/introduction-to-git- for-data-science packrat Main website https://rstudio.github.io/packrat/ Using packrat with RStudio https://www.rstudio.com/resources/webinars/managing- package-dependencies-in-r-with-packrat/ conda Main website https://conda.io/docs/index.html Useful blog post “Conda: Myths and Misconceptions” https://jakevdp.github.io/blog/2016/08/25/conda-myths- and-misconceptions/ Links checked at time of making this presentation in October 2018
  • 30. 5 November 2018 30 The views expressed in this [publication/presentation] are those of invited contributors and not necessarily those of the IFoA. The IFoA do not endorse any of the views stated, nor any claims or representations made in this [publication/presentation] and accept no responsibility or liability to any person for loss or damage suffered as a consequence of their placing reliance upon any view, claim or representation made in this [publication/presentation]. The information and expressions of opinion contained in this publication are not intended to be a comprehensive study, nor to provide actuarial advice or advice of any nature and should not be treated as a substitute for specific advice concerning individual situations. On no account may any part of this [publication/presentation] be reproduced without the written permission of the IFoA [or authors, in the case of non-IFoA research]. Questions Comments