SlideShare a Scribd company logo
Using Dataset
Versioning in Data
Science
Dr. Venkata Pingali
Founder, Scribble Data
pingali@scribbledata.io
https://github.com/pingali
Agenda
1. Why dataset versioning
2. Revised process using data versioning
3. Tool summary and demo
4. Roadmap
5. Feedback
a. Overall direction
b. dgit features
c. Suggestions
d. Actionables/next steps if any
About Me
Dr. Venkata Pingali
Founder, Scribble Data
Former-VP Analytics, FourthLion
Founder, eLuminos Energy Analytics
IIT(B) PhD (USC)
http://linkedin.com/in/pingali
Scribble Data
Reduce Cost and Complexity of
Data Science through Automation
Great Day!
Only the Beginning
To Manager:
Ready to process CC
Marriott's numbers on
scanned Invoices!
(or some high risk activity
based on this)
Then some questions
1. Where did the numbers come from? (Correctness, Lineage)
a. Assumption, models, datasets
2. Is this an accident? Does it hold now? (Reproducibility, Impact assessment)
a. Model, dataset, and question revisions
b. Performance in deployment
3. Can you get the results faster? (Efficiency)
a. Time, effort, cost
4. Can you also analyze X? (Extensibility)
a. Different dataset, question
5. Could we try X? (DoE, Synthetic data)
a. What if scenarios, field experiments
Conceptual Flow
Biz
Analytics
Team
Data
Engg
Qtns, Context
Data Req
Datasets
Model Results
Story Telling
Business Complexity is Discovered Over
Time
Incomplete context (history, semantics)
Qtns not thought through
Continuous revisions
Biz
Analytics
Team
Data
Engg
Qtns, Context
Data Req
Datasets
Model Results
Story Telling
Imperfect Data Queries due to Limited
Understanding
Dependencies not specified
Wrong filters
Known outliers
Narrow specification (cubes)
Biz
Analytics
Team
Data
Engg
Qtns, Context
Data Req
Datasets
Model Results
Story Telling
Weak process
Lack of protocol (email/files)
Missing validation checks
No lineage
No revisions
Biz
Analytics
Team
Data
Engg
Qtns, Context
Data Req
Datasets
Model Results
Story Telling
Eagerness to Present Great Narratives
Wrong input dataset
Mistakes in pipeline
Excel/adhoc transformations
Model evolution
Continuous revision of narratives
Missing interpretation integrity
checks (e.g. other time windows)
Better methodology
Biz
Analytics
Team
Data
Engg
Qtns, Context
Data Req
Datasets
Model Results
Story Telling
Underlying Issue: Messy Analytics Process
Biz
Analytics
Team
Data
Engg
Qtns, Context
Data Req
Datasets
Modeling
Floating data
Adhoc
Iterative
Laborious
Fast paced
Story telling
Desired State
1. Trusted
a. Every model should be auditable to the last record and step
b. Every model should be reproducible with zero human intervention
c. All models should be evaluated independently for quality
d. No data should change without leaving audit trail
e. All applications (presentation, configuration etc) should be hyperlinked
2. Scalable
a. All models should be searchable and usable easily
b. All data and model components should be reusable
c. Process should enable observation of data science process
3. Robust
a. Process should cope with younger inexperienced staff
b. Churn in the staff
Similar to https://medium.com/airbnb-engineering/scaling-knowledge-at-airbnb-875d73eff091
Core Process with Dataset Versioning
Biz
Analytics
Team
Data
Engg
Server Side CI
Dataset Rules
Evaluation Rules
Dependencies
Materialized dataset
v1
v2
v3Materialize
Model Pipeline
Pipeline Execution
v4
Slide Content
URN
Context,
Questions
v5Quality Check
Interpretation
v6
Dataset as mutable object
with memory
No emails/google docs
Continuous validation by
thirdparty (server)
Separate model
development and
evaluation
Extended Process
R
Python
Jupyter
Matlab
SQL
Input
Repo
Output
Repo
Data CI
Laptop/
EC2
S3/
Github/
Gitlab
Validation & Quality Checking
Discovery & Deployment
Orchestration
Output
Git
Repo
Input
Git
Repo Indexing &
Searching
Graphing & Data
Understanding
EDA
Precompute
Impact management &
Change propagation
Dependency
Tracking
Server Side
Asynchronous
Execution Automatic Reproduction
dgit - git wrapper for datasets
1. Python package, MIT license
2. Application of git
3. Beyond git - “Understands” data
a. Metadata generation and management
b. Automatic scanning of working directory for changes
c. Automatic validation and materialization
d. Dependency tracking across repos
e. Automatic audit trails with execution
f. Pipeline support
Dgit Structure
dgitcore API
Repo Mgr
Git
Backend
S3
Validator Transformer Instrumentation
S3Regression ContentPlatform
dgit CLI
Metadata
Basic
AnonymizerAnonymizer
Roadmap to Reduce Cost and Complexity
● Standardize processes around versioned data
○ April 2016 - git for data (opensource)
● Simplify data access
○ May 2016 - EasyQuery (SAAS product)
● Increase security of data science services
○ July 2016 - Ethereum integration (SAAS product)
Upvote if you like this talk….
https://fifthelephant.talkfunnel.com/2016
Thank you!
Missing Process Infrastructure for Data
Code Data
Versioning, Bugs Git, Github DVCS? Instabase?
Discovery Github, Stackshare CKFN, Dat
Security OWASP, ISO 27K GDPR, HIPPA
Packaging Pypi Dataprotocols?
Collaboration Slack,Stackoverflow ?
Documentation RTD Dataprotocols?
Testing & Validation Travis/Jenkins ?
Deployment Migrations ?
...

More Related Content

What's hot

End-to-End Machine learning pipelines for Python driven organizations - Nick ...
End-to-End Machine learning pipelines for Python driven organizations - Nick ...End-to-End Machine learning pipelines for Python driven organizations - Nick ...
End-to-End Machine learning pipelines for Python driven organizations - Nick ...
PyData
 
Scaling Analysis Responsibly
Scaling Analysis ResponsiblyScaling Analysis Responsibly
Scaling Analysis Responsibly
Work-Bench
 
From NASA to Startups to Big Commerce
From NASA to Startups to Big CommerceFrom NASA to Startups to Big Commerce
From NASA to Startups to Big Commerce
Daniel Greenfeld
 
Seamless MLOps with Seldon and MLflow
Seamless MLOps with Seldon and MLflowSeamless MLOps with Seldon and MLflow
Seamless MLOps with Seldon and MLflow
Databricks
 
Deploying ML models to production (frequently and safely) - PYCON 2018
Deploying ML models to production (frequently and safely) - PYCON 2018Deploying ML models to production (frequently and safely) - PYCON 2018
Deploying ML models to production (frequently and safely) - PYCON 2018
David Tan
 
SplunkLive! Seattle - Splunk for Developers
SplunkLive! Seattle - Splunk for DevelopersSplunkLive! Seattle - Splunk for Developers
SplunkLive! Seattle - Splunk for Developers
Grigori Melnik
 
AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...
AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...
AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...
Robert Grossman
 
Spark ML Pipeline serving
Spark ML Pipeline servingSpark ML Pipeline serving
Spark ML Pipeline serving
Stepan Pushkarev
 
ML at the Edge: Building Your Production Pipeline with Apache Spark and Tens...
 ML at the Edge: Building Your Production Pipeline with Apache Spark and Tens... ML at the Edge: Building Your Production Pipeline with Apache Spark and Tens...
ML at the Edge: Building Your Production Pipeline with Apache Spark and Tens...
Databricks
 
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
PyData
 
mlflow: Accelerating the End-to-End ML lifecycle
mlflow: Accelerating the End-to-End ML lifecyclemlflow: Accelerating the End-to-End ML lifecycle
mlflow: Accelerating the End-to-End ML lifecycle
Databricks
 
The Quest for an Open Source Data Science Platform
 The Quest for an Open Source Data Science Platform The Quest for an Open Source Data Science Platform
The Quest for an Open Source Data Science Platform
QAware GmbH
 
Reproducible AI using MLflow and PyTorch
Reproducible AI using MLflow and PyTorchReproducible AI using MLflow and PyTorch
Reproducible AI using MLflow and PyTorch
Databricks
 
Automated Hyperparameter Tuning, Scaling and Tracking
Automated Hyperparameter Tuning, Scaling and TrackingAutomated Hyperparameter Tuning, Scaling and Tracking
Automated Hyperparameter Tuning, Scaling and Tracking
Databricks
 
MATS stack (MLFlow, Airflow, Tensorflow, Spark) for Cross-system Orchestratio...
MATS stack (MLFlow, Airflow, Tensorflow, Spark) for Cross-system Orchestratio...MATS stack (MLFlow, Airflow, Tensorflow, Spark) for Cross-system Orchestratio...
MATS stack (MLFlow, Airflow, Tensorflow, Spark) for Cross-system Orchestratio...
Databricks
 
Importance of ML Reproducibility & Applications with MLfLow
Importance of ML Reproducibility & Applications with MLfLowImportance of ML Reproducibility & Applications with MLfLow
Importance of ML Reproducibility & Applications with MLfLow
Databricks
 
MLOps at OLX
MLOps at OLXMLOps at OLX
MLOps at OLX
Alexey Grigorev
 
Scaling Ride-Hailing with Machine Learning on MLflow
Scaling Ride-Hailing with Machine Learning on MLflowScaling Ride-Hailing with Machine Learning on MLflow
Scaling Ride-Hailing with Machine Learning on MLflow
Databricks
 
High-Performance Python
High-Performance PythonHigh-Performance Python
High-Performance Python
Work-Bench
 
Scalable Automatic Machine Learning with H2O
Scalable Automatic Machine Learning with H2OScalable Automatic Machine Learning with H2O
Scalable Automatic Machine Learning with H2O
Sri Ambati
 

What's hot (20)

End-to-End Machine learning pipelines for Python driven organizations - Nick ...
End-to-End Machine learning pipelines for Python driven organizations - Nick ...End-to-End Machine learning pipelines for Python driven organizations - Nick ...
End-to-End Machine learning pipelines for Python driven organizations - Nick ...
 
Scaling Analysis Responsibly
Scaling Analysis ResponsiblyScaling Analysis Responsibly
Scaling Analysis Responsibly
 
From NASA to Startups to Big Commerce
From NASA to Startups to Big CommerceFrom NASA to Startups to Big Commerce
From NASA to Startups to Big Commerce
 
Seamless MLOps with Seldon and MLflow
Seamless MLOps with Seldon and MLflowSeamless MLOps with Seldon and MLflow
Seamless MLOps with Seldon and MLflow
 
Deploying ML models to production (frequently and safely) - PYCON 2018
Deploying ML models to production (frequently and safely) - PYCON 2018Deploying ML models to production (frequently and safely) - PYCON 2018
Deploying ML models to production (frequently and safely) - PYCON 2018
 
SplunkLive! Seattle - Splunk for Developers
SplunkLive! Seattle - Splunk for DevelopersSplunkLive! Seattle - Splunk for Developers
SplunkLive! Seattle - Splunk for Developers
 
AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...
AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...
AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...
 
Spark ML Pipeline serving
Spark ML Pipeline servingSpark ML Pipeline serving
Spark ML Pipeline serving
 
ML at the Edge: Building Your Production Pipeline with Apache Spark and Tens...
 ML at the Edge: Building Your Production Pipeline with Apache Spark and Tens... ML at the Edge: Building Your Production Pipeline with Apache Spark and Tens...
ML at the Edge: Building Your Production Pipeline with Apache Spark and Tens...
 
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
 
mlflow: Accelerating the End-to-End ML lifecycle
mlflow: Accelerating the End-to-End ML lifecyclemlflow: Accelerating the End-to-End ML lifecycle
mlflow: Accelerating the End-to-End ML lifecycle
 
The Quest for an Open Source Data Science Platform
 The Quest for an Open Source Data Science Platform The Quest for an Open Source Data Science Platform
The Quest for an Open Source Data Science Platform
 
Reproducible AI using MLflow and PyTorch
Reproducible AI using MLflow and PyTorchReproducible AI using MLflow and PyTorch
Reproducible AI using MLflow and PyTorch
 
Automated Hyperparameter Tuning, Scaling and Tracking
Automated Hyperparameter Tuning, Scaling and TrackingAutomated Hyperparameter Tuning, Scaling and Tracking
Automated Hyperparameter Tuning, Scaling and Tracking
 
MATS stack (MLFlow, Airflow, Tensorflow, Spark) for Cross-system Orchestratio...
MATS stack (MLFlow, Airflow, Tensorflow, Spark) for Cross-system Orchestratio...MATS stack (MLFlow, Airflow, Tensorflow, Spark) for Cross-system Orchestratio...
MATS stack (MLFlow, Airflow, Tensorflow, Spark) for Cross-system Orchestratio...
 
Importance of ML Reproducibility & Applications with MLfLow
Importance of ML Reproducibility & Applications with MLfLowImportance of ML Reproducibility & Applications with MLfLow
Importance of ML Reproducibility & Applications with MLfLow
 
MLOps at OLX
MLOps at OLXMLOps at OLX
MLOps at OLX
 
Scaling Ride-Hailing with Machine Learning on MLflow
Scaling Ride-Hailing with Machine Learning on MLflowScaling Ride-Hailing with Machine Learning on MLflow
Scaling Ride-Hailing with Machine Learning on MLflow
 
High-Performance Python
High-Performance PythonHigh-Performance Python
High-Performance Python
 
Scalable Automatic Machine Learning with H2O
Scalable Automatic Machine Learning with H2OScalable Automatic Machine Learning with H2O
Scalable Automatic Machine Learning with H2O
 

Similar to Using dataset versioning in data science

R meetup talk scaling data science with dgit
R meetup talk   scaling data science with dgitR meetup talk   scaling data science with dgit
R meetup talk scaling data science with dgit
Venkata Pingali
 
AllThingsOpen 2018 - Deployment Design Patterns (Dan Zaratsian)
AllThingsOpen 2018 - Deployment Design Patterns (Dan Zaratsian)AllThingsOpen 2018 - Deployment Design Patterns (Dan Zaratsian)
AllThingsOpen 2018 - Deployment Design Patterns (Dan Zaratsian)
dtz001
 
Deployment Design Patterns - Deploying Machine Learning and Deep Learning Mod...
Deployment Design Patterns - Deploying Machine Learning and Deep Learning Mod...Deployment Design Patterns - Deploying Machine Learning and Deep Learning Mod...
Deployment Design Patterns - Deploying Machine Learning and Deep Learning Mod...
All Things Open
 
Reproducibility and experiments management in Machine Learning
Reproducibility and experiments management in Machine Learning Reproducibility and experiments management in Machine Learning
Reproducibility and experiments management in Machine Learning
Mikhail Rozhkov
 
Resume
ResumeResume
Agile Data Science on Greenplum Using Airflow - Greenplum Summit 2019
Agile Data Science on Greenplum Using Airflow - Greenplum Summit 2019Agile Data Science on Greenplum Using Airflow - Greenplum Summit 2019
Agile Data Science on Greenplum Using Airflow - Greenplum Summit 2019
VMware Tanzu
 
Data kitchen 7 agile steps - big data fest 9-18-2015
Data kitchen   7 agile steps - big data fest 9-18-2015Data kitchen   7 agile steps - big data fest 9-18-2015
Data kitchen 7 agile steps - big data fest 9-18-2015
DataKitchen
 
Data Science as a Service: Intersection of Cloud Computing and Data Science
Data Science as a Service: Intersection of Cloud Computing and Data ScienceData Science as a Service: Intersection of Cloud Computing and Data Science
Data Science as a Service: Intersection of Cloud Computing and Data Science
Pouria Amirian
 
Data Science as a Service: Intersection of Cloud Computing and Data Science
Data Science as a Service: Intersection of Cloud Computing and Data ScienceData Science as a Service: Intersection of Cloud Computing and Data Science
Data Science as a Service: Intersection of Cloud Computing and Data Science
Pouria Amirian
 
Machine Learning Models in Production
Machine Learning Models in ProductionMachine Learning Models in Production
Machine Learning Models in Production
DataWorks Summit
 
DataOps - Production ML
DataOps - Production MLDataOps - Production ML
DataOps - Production ML
Al Zindiq
 
From Machine Learning Scientist to Full Stack Data Scientist: Lessons learned...
From Machine Learning Scientist to Full Stack Data Scientist: Lessons learned...From Machine Learning Scientist to Full Stack Data Scientist: Lessons learned...
From Machine Learning Scientist to Full Stack Data Scientist: Lessons learned...
Paris Women in Machine Learning and Data Science
 
Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...
Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...
Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...
Rodney Joyce
 
Ac fr ogdgcmxqfucumvb3rtaloaj_brftdqxmm9hvb6ttcdlh-kap3doq8rsu8vhkdcpgfpozovb...
Ac fr ogdgcmxqfucumvb3rtaloaj_brftdqxmm9hvb6ttcdlh-kap3doq8rsu8vhkdcpgfpozovb...Ac fr ogdgcmxqfucumvb3rtaloaj_brftdqxmm9hvb6ttcdlh-kap3doq8rsu8vhkdcpgfpozovb...
Ac fr ogdgcmxqfucumvb3rtaloaj_brftdqxmm9hvb6ttcdlh-kap3doq8rsu8vhkdcpgfpozovb...
RaunakMalkani3
 
Building Data Apps with Python
Building Data Apps with PythonBuilding Data Apps with Python
Building Data Apps with Python
Benjamin Bengfort
 
Agile development of data science projects | Part 1
Agile development of data science projects | Part 1 Agile development of data science projects | Part 1
Agile development of data science projects | Part 1
Anubhav Dhiman
 
IRJET- Deep Learning Model to Predict Hardware Performance
IRJET- Deep Learning Model to Predict Hardware PerformanceIRJET- Deep Learning Model to Predict Hardware Performance
IRJET- Deep Learning Model to Predict Hardware Performance
IRJET Journal
 
IRJET- Analysis of PV Fed Vector Controlled Induction Motor Drive
IRJET- Analysis of PV Fed Vector Controlled Induction Motor DriveIRJET- Analysis of PV Fed Vector Controlled Induction Motor Drive
IRJET- Analysis of PV Fed Vector Controlled Induction Motor Drive
IRJET Journal
 
A00-440: Useful Questions for SAS ModelOps Specialist Certification Success
A00-440: Useful Questions for SAS ModelOps Specialist Certification SuccessA00-440: Useful Questions for SAS ModelOps Specialist Certification Success
A00-440: Useful Questions for SAS ModelOps Specialist Certification Success
PalakMazumdar1
 
Challenges of Operationalising Data Science in Production
Challenges of Operationalising Data Science in ProductionChallenges of Operationalising Data Science in Production
Challenges of Operationalising Data Science in Production
iguazio
 

Similar to Using dataset versioning in data science (20)

R meetup talk scaling data science with dgit
R meetup talk   scaling data science with dgitR meetup talk   scaling data science with dgit
R meetup talk scaling data science with dgit
 
AllThingsOpen 2018 - Deployment Design Patterns (Dan Zaratsian)
AllThingsOpen 2018 - Deployment Design Patterns (Dan Zaratsian)AllThingsOpen 2018 - Deployment Design Patterns (Dan Zaratsian)
AllThingsOpen 2018 - Deployment Design Patterns (Dan Zaratsian)
 
Deployment Design Patterns - Deploying Machine Learning and Deep Learning Mod...
Deployment Design Patterns - Deploying Machine Learning and Deep Learning Mod...Deployment Design Patterns - Deploying Machine Learning and Deep Learning Mod...
Deployment Design Patterns - Deploying Machine Learning and Deep Learning Mod...
 
Reproducibility and experiments management in Machine Learning
Reproducibility and experiments management in Machine Learning Reproducibility and experiments management in Machine Learning
Reproducibility and experiments management in Machine Learning
 
Resume
ResumeResume
Resume
 
Agile Data Science on Greenplum Using Airflow - Greenplum Summit 2019
Agile Data Science on Greenplum Using Airflow - Greenplum Summit 2019Agile Data Science on Greenplum Using Airflow - Greenplum Summit 2019
Agile Data Science on Greenplum Using Airflow - Greenplum Summit 2019
 
Data kitchen 7 agile steps - big data fest 9-18-2015
Data kitchen   7 agile steps - big data fest 9-18-2015Data kitchen   7 agile steps - big data fest 9-18-2015
Data kitchen 7 agile steps - big data fest 9-18-2015
 
Data Science as a Service: Intersection of Cloud Computing and Data Science
Data Science as a Service: Intersection of Cloud Computing and Data ScienceData Science as a Service: Intersection of Cloud Computing and Data Science
Data Science as a Service: Intersection of Cloud Computing and Data Science
 
Data Science as a Service: Intersection of Cloud Computing and Data Science
Data Science as a Service: Intersection of Cloud Computing and Data ScienceData Science as a Service: Intersection of Cloud Computing and Data Science
Data Science as a Service: Intersection of Cloud Computing and Data Science
 
Machine Learning Models in Production
Machine Learning Models in ProductionMachine Learning Models in Production
Machine Learning Models in Production
 
DataOps - Production ML
DataOps - Production MLDataOps - Production ML
DataOps - Production ML
 
From Machine Learning Scientist to Full Stack Data Scientist: Lessons learned...
From Machine Learning Scientist to Full Stack Data Scientist: Lessons learned...From Machine Learning Scientist to Full Stack Data Scientist: Lessons learned...
From Machine Learning Scientist to Full Stack Data Scientist: Lessons learned...
 
Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...
Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...
Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...
 
Ac fr ogdgcmxqfucumvb3rtaloaj_brftdqxmm9hvb6ttcdlh-kap3doq8rsu8vhkdcpgfpozovb...
Ac fr ogdgcmxqfucumvb3rtaloaj_brftdqxmm9hvb6ttcdlh-kap3doq8rsu8vhkdcpgfpozovb...Ac fr ogdgcmxqfucumvb3rtaloaj_brftdqxmm9hvb6ttcdlh-kap3doq8rsu8vhkdcpgfpozovb...
Ac fr ogdgcmxqfucumvb3rtaloaj_brftdqxmm9hvb6ttcdlh-kap3doq8rsu8vhkdcpgfpozovb...
 
Building Data Apps with Python
Building Data Apps with PythonBuilding Data Apps with Python
Building Data Apps with Python
 
Agile development of data science projects | Part 1
Agile development of data science projects | Part 1 Agile development of data science projects | Part 1
Agile development of data science projects | Part 1
 
IRJET- Deep Learning Model to Predict Hardware Performance
IRJET- Deep Learning Model to Predict Hardware PerformanceIRJET- Deep Learning Model to Predict Hardware Performance
IRJET- Deep Learning Model to Predict Hardware Performance
 
IRJET- Analysis of PV Fed Vector Controlled Induction Motor Drive
IRJET- Analysis of PV Fed Vector Controlled Induction Motor DriveIRJET- Analysis of PV Fed Vector Controlled Induction Motor Drive
IRJET- Analysis of PV Fed Vector Controlled Induction Motor Drive
 
A00-440: Useful Questions for SAS ModelOps Specialist Certification Success
A00-440: Useful Questions for SAS ModelOps Specialist Certification SuccessA00-440: Useful Questions for SAS ModelOps Specialist Certification Success
A00-440: Useful Questions for SAS ModelOps Specialist Certification Success
 
Challenges of Operationalising Data Science in Production
Challenges of Operationalising Data Science in ProductionChallenges of Operationalising Data Science in Production
Challenges of Operationalising Data Science in Production
 

Recently uploaded

Discover the benefits of outsourcing SEO to India
Discover the benefits of outsourcing SEO to IndiaDiscover the benefits of outsourcing SEO to India
Discover the benefits of outsourcing SEO to India
davidjhones387
 
Bengaluru Dreamin' 24 - Personal Branding
Bengaluru Dreamin' 24 - Personal BrandingBengaluru Dreamin' 24 - Personal Branding
Bengaluru Dreamin' 24 - Personal Branding
Tarandeep Singh
 
Securing BGP: Operational Strategies and Best Practices for Network Defenders...
Securing BGP: Operational Strategies and Best Practices for Network Defenders...Securing BGP: Operational Strategies and Best Practices for Network Defenders...
Securing BGP: Operational Strategies and Best Practices for Network Defenders...
APNIC
 
HijackLoader Evolution: Interactive Process Hollowing
HijackLoader Evolution: Interactive Process HollowingHijackLoader Evolution: Interactive Process Hollowing
HijackLoader Evolution: Interactive Process Hollowing
Donato Onofri
 
办理新西兰奥克兰大学毕业证学位证书范本原版一模一样
办理新西兰奥克兰大学毕业证学位证书范本原版一模一样办理新西兰奥克兰大学毕业证学位证书范本原版一模一样
办理新西兰奥克兰大学毕业证学位证书范本原版一模一样
xjq03c34
 
怎么办理(umiami毕业证书)美国迈阿密大学毕业证文凭证书实拍图原版一模一样
怎么办理(umiami毕业证书)美国迈阿密大学毕业证文凭证书实拍图原版一模一样怎么办理(umiami毕业证书)美国迈阿密大学毕业证文凭证书实拍图原版一模一样
怎么办理(umiami毕业证书)美国迈阿密大学毕业证文凭证书实拍图原版一模一样
rtunex8r
 
Should Repositories Participate in the Fediverse?
Should Repositories Participate in the Fediverse?Should Repositories Participate in the Fediverse?
Should Repositories Participate in the Fediverse?
Paul Walk
 
一比一原版(USYD毕业证)悉尼大学毕业证如何办理
一比一原版(USYD毕业证)悉尼大学毕业证如何办理一比一原版(USYD毕业证)悉尼大学毕业证如何办理
一比一原版(USYD毕业证)悉尼大学毕业证如何办理
k4ncd0z
 
快速办理(Vic毕业证书)惠灵顿维多利亚大学毕业证完成信一模一样
快速办理(Vic毕业证书)惠灵顿维多利亚大学毕业证完成信一模一样快速办理(Vic毕业证书)惠灵顿维多利亚大学毕业证完成信一模一样
快速办理(Vic毕业证书)惠灵顿维多利亚大学毕业证完成信一模一样
3a0sd7z3
 
一比一原版新西兰林肯大学毕业证(Lincoln毕业证书)学历如何办理
一比一原版新西兰林肯大学毕业证(Lincoln毕业证书)学历如何办理一比一原版新西兰林肯大学毕业证(Lincoln毕业证书)学历如何办理
一比一原版新西兰林肯大学毕业证(Lincoln毕业证书)学历如何办理
thezot
 
快速办理(新加坡SMU毕业证书)新加坡管理大学毕业证文凭证书一模一样
快速办理(新加坡SMU毕业证书)新加坡管理大学毕业证文凭证书一模一样快速办理(新加坡SMU毕业证书)新加坡管理大学毕业证文凭证书一模一样
快速办理(新加坡SMU毕业证书)新加坡管理大学毕业证文凭证书一模一样
3a0sd7z3
 
Honeypots Unveiled: Proactive Defense Tactics for Cyber Security, Phoenix Sum...
Honeypots Unveiled: Proactive Defense Tactics for Cyber Security, Phoenix Sum...Honeypots Unveiled: Proactive Defense Tactics for Cyber Security, Phoenix Sum...
Honeypots Unveiled: Proactive Defense Tactics for Cyber Security, Phoenix Sum...
APNIC
 

Recently uploaded (12)

Discover the benefits of outsourcing SEO to India
Discover the benefits of outsourcing SEO to IndiaDiscover the benefits of outsourcing SEO to India
Discover the benefits of outsourcing SEO to India
 
Bengaluru Dreamin' 24 - Personal Branding
Bengaluru Dreamin' 24 - Personal BrandingBengaluru Dreamin' 24 - Personal Branding
Bengaluru Dreamin' 24 - Personal Branding
 
Securing BGP: Operational Strategies and Best Practices for Network Defenders...
Securing BGP: Operational Strategies and Best Practices for Network Defenders...Securing BGP: Operational Strategies and Best Practices for Network Defenders...
Securing BGP: Operational Strategies and Best Practices for Network Defenders...
 
HijackLoader Evolution: Interactive Process Hollowing
HijackLoader Evolution: Interactive Process HollowingHijackLoader Evolution: Interactive Process Hollowing
HijackLoader Evolution: Interactive Process Hollowing
 
办理新西兰奥克兰大学毕业证学位证书范本原版一模一样
办理新西兰奥克兰大学毕业证学位证书范本原版一模一样办理新西兰奥克兰大学毕业证学位证书范本原版一模一样
办理新西兰奥克兰大学毕业证学位证书范本原版一模一样
 
怎么办理(umiami毕业证书)美国迈阿密大学毕业证文凭证书实拍图原版一模一样
怎么办理(umiami毕业证书)美国迈阿密大学毕业证文凭证书实拍图原版一模一样怎么办理(umiami毕业证书)美国迈阿密大学毕业证文凭证书实拍图原版一模一样
怎么办理(umiami毕业证书)美国迈阿密大学毕业证文凭证书实拍图原版一模一样
 
Should Repositories Participate in the Fediverse?
Should Repositories Participate in the Fediverse?Should Repositories Participate in the Fediverse?
Should Repositories Participate in the Fediverse?
 
一比一原版(USYD毕业证)悉尼大学毕业证如何办理
一比一原版(USYD毕业证)悉尼大学毕业证如何办理一比一原版(USYD毕业证)悉尼大学毕业证如何办理
一比一原版(USYD毕业证)悉尼大学毕业证如何办理
 
快速办理(Vic毕业证书)惠灵顿维多利亚大学毕业证完成信一模一样
快速办理(Vic毕业证书)惠灵顿维多利亚大学毕业证完成信一模一样快速办理(Vic毕业证书)惠灵顿维多利亚大学毕业证完成信一模一样
快速办理(Vic毕业证书)惠灵顿维多利亚大学毕业证完成信一模一样
 
一比一原版新西兰林肯大学毕业证(Lincoln毕业证书)学历如何办理
一比一原版新西兰林肯大学毕业证(Lincoln毕业证书)学历如何办理一比一原版新西兰林肯大学毕业证(Lincoln毕业证书)学历如何办理
一比一原版新西兰林肯大学毕业证(Lincoln毕业证书)学历如何办理
 
快速办理(新加坡SMU毕业证书)新加坡管理大学毕业证文凭证书一模一样
快速办理(新加坡SMU毕业证书)新加坡管理大学毕业证文凭证书一模一样快速办理(新加坡SMU毕业证书)新加坡管理大学毕业证文凭证书一模一样
快速办理(新加坡SMU毕业证书)新加坡管理大学毕业证文凭证书一模一样
 
Honeypots Unveiled: Proactive Defense Tactics for Cyber Security, Phoenix Sum...
Honeypots Unveiled: Proactive Defense Tactics for Cyber Security, Phoenix Sum...Honeypots Unveiled: Proactive Defense Tactics for Cyber Security, Phoenix Sum...
Honeypots Unveiled: Proactive Defense Tactics for Cyber Security, Phoenix Sum...
 

Using dataset versioning in data science

  • 1. Using Dataset Versioning in Data Science Dr. Venkata Pingali Founder, Scribble Data pingali@scribbledata.io https://github.com/pingali
  • 2. Agenda 1. Why dataset versioning 2. Revised process using data versioning 3. Tool summary and demo 4. Roadmap 5. Feedback a. Overall direction b. dgit features c. Suggestions d. Actionables/next steps if any
  • 3. About Me Dr. Venkata Pingali Founder, Scribble Data Former-VP Analytics, FourthLion Founder, eLuminos Energy Analytics IIT(B) PhD (USC) http://linkedin.com/in/pingali
  • 4. Scribble Data Reduce Cost and Complexity of Data Science through Automation
  • 6. Only the Beginning To Manager: Ready to process CC Marriott's numbers on scanned Invoices! (or some high risk activity based on this)
  • 7. Then some questions 1. Where did the numbers come from? (Correctness, Lineage) a. Assumption, models, datasets 2. Is this an accident? Does it hold now? (Reproducibility, Impact assessment) a. Model, dataset, and question revisions b. Performance in deployment 3. Can you get the results faster? (Efficiency) a. Time, effort, cost 4. Can you also analyze X? (Extensibility) a. Different dataset, question 5. Could we try X? (DoE, Synthetic data) a. What if scenarios, field experiments
  • 8. Conceptual Flow Biz Analytics Team Data Engg Qtns, Context Data Req Datasets Model Results Story Telling
  • 9. Business Complexity is Discovered Over Time Incomplete context (history, semantics) Qtns not thought through Continuous revisions Biz Analytics Team Data Engg Qtns, Context Data Req Datasets Model Results Story Telling
  • 10. Imperfect Data Queries due to Limited Understanding Dependencies not specified Wrong filters Known outliers Narrow specification (cubes) Biz Analytics Team Data Engg Qtns, Context Data Req Datasets Model Results Story Telling
  • 11. Weak process Lack of protocol (email/files) Missing validation checks No lineage No revisions Biz Analytics Team Data Engg Qtns, Context Data Req Datasets Model Results Story Telling
  • 12. Eagerness to Present Great Narratives Wrong input dataset Mistakes in pipeline Excel/adhoc transformations Model evolution Continuous revision of narratives Missing interpretation integrity checks (e.g. other time windows) Better methodology Biz Analytics Team Data Engg Qtns, Context Data Req Datasets Model Results Story Telling
  • 13. Underlying Issue: Messy Analytics Process Biz Analytics Team Data Engg Qtns, Context Data Req Datasets Modeling Floating data Adhoc Iterative Laborious Fast paced Story telling
  • 14. Desired State 1. Trusted a. Every model should be auditable to the last record and step b. Every model should be reproducible with zero human intervention c. All models should be evaluated independently for quality d. No data should change without leaving audit trail e. All applications (presentation, configuration etc) should be hyperlinked 2. Scalable a. All models should be searchable and usable easily b. All data and model components should be reusable c. Process should enable observation of data science process 3. Robust a. Process should cope with younger inexperienced staff b. Churn in the staff Similar to https://medium.com/airbnb-engineering/scaling-knowledge-at-airbnb-875d73eff091
  • 15. Core Process with Dataset Versioning Biz Analytics Team Data Engg Server Side CI Dataset Rules Evaluation Rules Dependencies Materialized dataset v1 v2 v3Materialize Model Pipeline Pipeline Execution v4 Slide Content URN Context, Questions v5Quality Check Interpretation v6 Dataset as mutable object with memory No emails/google docs Continuous validation by thirdparty (server) Separate model development and evaluation
  • 16. Extended Process R Python Jupyter Matlab SQL Input Repo Output Repo Data CI Laptop/ EC2 S3/ Github/ Gitlab Validation & Quality Checking Discovery & Deployment Orchestration Output Git Repo Input Git Repo Indexing & Searching Graphing & Data Understanding EDA Precompute Impact management & Change propagation Dependency Tracking Server Side Asynchronous Execution Automatic Reproduction
  • 17. dgit - git wrapper for datasets 1. Python package, MIT license 2. Application of git 3. Beyond git - “Understands” data a. Metadata generation and management b. Automatic scanning of working directory for changes c. Automatic validation and materialization d. Dependency tracking across repos e. Automatic audit trails with execution f. Pipeline support
  • 18. Dgit Structure dgitcore API Repo Mgr Git Backend S3 Validator Transformer Instrumentation S3Regression ContentPlatform dgit CLI Metadata Basic AnonymizerAnonymizer
  • 19. Roadmap to Reduce Cost and Complexity ● Standardize processes around versioned data ○ April 2016 - git for data (opensource) ● Simplify data access ○ May 2016 - EasyQuery (SAAS product) ● Increase security of data science services ○ July 2016 - Ethereum integration (SAAS product)
  • 20. Upvote if you like this talk…. https://fifthelephant.talkfunnel.com/2016
  • 22. Missing Process Infrastructure for Data Code Data Versioning, Bugs Git, Github DVCS? Instabase? Discovery Github, Stackshare CKFN, Dat Security OWASP, ISO 27K GDPR, HIPPA Packaging Pypi Dataprotocols? Collaboration Slack,Stackoverflow ? Documentation RTD Dataprotocols? Testing & Validation Travis/Jenkins ? Deployment Migrations ? ...