SlideShare a Scribd company logo
1 of 12
Download to read offline
> Reproducibility with
unstructured data in 3 steps
Dmitry Petrov
DVC.org
|00|
Co-Founder & CEO > Iterative.AI > San Francisco, USA
ex-Data Scientist > Microsoft (BingAds) > Seattle, USA
ex-Head of Lab > St. Petersburg Electrotechnical University > Russia
|HELLO|
Dmitry Petrov
PhD in Computer Science
Twitter: @FullStackML
Creator of
DVC.org project
> Data Analyst → Structured
> Data Scientist→ Semi-structured
> ML engineer → Unstructured
- NLP - text files
- Computer Vision - images
- Multiple files and/or formats
|Unstructured data|
Reproducibility is about storing
and sharing the mapping:
Code + Data → Model
|Unstructured data reproducibility|
Use a central storage for all
data artifacts:
- Datafiles
- Models
|1st - CENTRAL STORAGE|
s3://mybucket/semsegm-proj/
ddd
|2nd - DECOUPLE DATA FROM CODE|
Use dataset metafiles. Do not read
data from code directly.
Individual file → Snapshot
# files-meta.yaml
file1: s3://mybucket/semsegm-proj/raw/file1-ver2020-10-01
file2: s3://mybucket/semsegm-proj/raw/file1-ver2020-07-26
...
model.pkl: s3://mybucket/semsegm-proj/model-5-1.pkl
ddd
Version metrics files in Git
$ cat metrics.json
{
“AUC”: 0.8367073
“TP”: 8614
“Process”: {
“Threshold”: 0.92
…
$ git diff release-sept
…
- "AUC":0.7906391,
+ "AUC":0.8367073,
|3nd - BE METRICS DRIVEN|
II. DVC - Date Version Control
|1st - CENTRAL STORAGE|
ddd
DVC introduces data-remote
$ dvc remote add -d myremote s3://mybucket/semsegm-proj/
$ dvc push
ddd
|2nd - DECOUPLE DATA FROM CODE|
$ dvc add data.tsv
$ cat data.tsv.dvc
outs:
- md5: fadc70dff966edd21b3dd2b0c2755189
path: data.tsv
size: 593310482
$ dvc push data.tsv
ddd
$ dvc metrics diff release-sept
Path Metric Value Change
metrics.json AUC 0.8367073 0.0460682
metrics.json TP 8291 374
$ dvc plots diff release-sept
|3nd - BE METRICS DRIVEN|
> Questions
Email dmitry@iterative.ai
Web http://dvc.org
> Actions
Follow @FullStackML
|THANK YOU|

More Related Content

What's hot

Resume(Sci Comp & DS)
Resume(Sci Comp & DS)Resume(Sci Comp & DS)
Resume(Sci Comp & DS)Tomi Olubeko
 
Interesting MATLAB Projects Research Help
Interesting MATLAB Projects Research HelpInteresting MATLAB Projects Research Help
Interesting MATLAB Projects Research HelpMatlab Simulation
 
IEEE MATLAB Projects Research Ideas
IEEE MATLAB Projects Research IdeasIEEE MATLAB Projects Research Ideas
IEEE MATLAB Projects Research IdeasMatlab Simulation
 
Lviv MD Day 2015 Малаховський Віталій "Архітектура компонентів обробки даних ...
Lviv MD Day 2015 Малаховський Віталій "Архітектура компонентів обробки даних ...Lviv MD Day 2015 Малаховський Віталій "Архітектура компонентів обробки даних ...
Lviv MD Day 2015 Малаховський Віталій "Архітектура компонентів обробки даних ...Lviv Startup Club
 
Matlab-Assignment-Projects
Matlab-Assignment-ProjectsMatlab-Assignment-Projects
Matlab-Assignment-ProjectsPhdtopiccom
 
Yiran_Wang_Resume
Yiran_Wang_ResumeYiran_Wang_Resume
Yiran_Wang_ResumeYiran Wang
 
Resume_Scott.D.Thomas_170104
Resume_Scott.D.Thomas_170104Resume_Scott.D.Thomas_170104
Resume_Scott.D.Thomas_170104Scott Thomas
 
Avery M Allen power 2016D2
Avery M Allen power 2016D2Avery M Allen power 2016D2
Avery M Allen power 2016D2Avery Allen
 
Session 2 - Akyildiz, Beinecke, Yee at MLconf NYC
Session 2 - Akyildiz, Beinecke, Yee at MLconf NYCSession 2 - Akyildiz, Beinecke, Yee at MLconf NYC
Session 2 - Akyildiz, Beinecke, Yee at MLconf NYCMLconf
 
Practical Tips for Interpreting Machine Learning Models - Patrick Hall, H2O.ai
Practical Tips for Interpreting Machine Learning Models - Patrick Hall, H2O.aiPractical Tips for Interpreting Machine Learning Models - Patrick Hall, H2O.ai
Practical Tips for Interpreting Machine Learning Models - Patrick Hall, H2O.aiSri Ambati
 
Modern Machine Learning Infrastructure and Practices
Modern Machine Learning Infrastructure and PracticesModern Machine Learning Infrastructure and Practices
Modern Machine Learning Infrastructure and PracticesWill Gardella
 
Alexandra johnson reducing operational barriers to model training
Alexandra johnson   reducing operational barriers to model trainingAlexandra johnson   reducing operational barriers to model training
Alexandra johnson reducing operational barriers to model trainingMLconf
 
Machine Learning for the Sensored IoT
Machine Learning for the Sensored IoTMachine Learning for the Sensored IoT
Machine Learning for the Sensored IoTHank Roark
 
Making Data Science Scalable - 5 Lessons Learned
Making Data Science Scalable - 5 Lessons LearnedMaking Data Science Scalable - 5 Lessons Learned
Making Data Science Scalable - 5 Lessons LearnedLaurenz Wuttke
 
Master-Thesis-Proposal-Computer-Sscience
Master-Thesis-Proposal-Computer-SscienceMaster-Thesis-Proposal-Computer-Sscience
Master-Thesis-Proposal-Computer-SsciencePhdtopiccom
 

What's hot (17)

Resume(Sci Comp & DS)
Resume(Sci Comp & DS)Resume(Sci Comp & DS)
Resume(Sci Comp & DS)
 
Interesting MATLAB Projects Research Help
Interesting MATLAB Projects Research HelpInteresting MATLAB Projects Research Help
Interesting MATLAB Projects Research Help
 
IEEE MATLAB Projects Research Ideas
IEEE MATLAB Projects Research IdeasIEEE MATLAB Projects Research Ideas
IEEE MATLAB Projects Research Ideas
 
Resume_Amarjit
Resume_AmarjitResume_Amarjit
Resume_Amarjit
 
Lviv MD Day 2015 Малаховський Віталій "Архітектура компонентів обробки даних ...
Lviv MD Day 2015 Малаховський Віталій "Архітектура компонентів обробки даних ...Lviv MD Day 2015 Малаховський Віталій "Архітектура компонентів обробки даних ...
Lviv MD Day 2015 Малаховський Віталій "Архітектура компонентів обробки даних ...
 
Matlab-Assignment-Projects
Matlab-Assignment-ProjectsMatlab-Assignment-Projects
Matlab-Assignment-Projects
 
Yiran_Wang_Resume
Yiran_Wang_ResumeYiran_Wang_Resume
Yiran_Wang_Resume
 
Resume_Scott.D.Thomas_170104
Resume_Scott.D.Thomas_170104Resume_Scott.D.Thomas_170104
Resume_Scott.D.Thomas_170104
 
Saurav Sengupta Resume
Saurav Sengupta ResumeSaurav Sengupta Resume
Saurav Sengupta Resume
 
Avery M Allen power 2016D2
Avery M Allen power 2016D2Avery M Allen power 2016D2
Avery M Allen power 2016D2
 
Session 2 - Akyildiz, Beinecke, Yee at MLconf NYC
Session 2 - Akyildiz, Beinecke, Yee at MLconf NYCSession 2 - Akyildiz, Beinecke, Yee at MLconf NYC
Session 2 - Akyildiz, Beinecke, Yee at MLconf NYC
 
Practical Tips for Interpreting Machine Learning Models - Patrick Hall, H2O.ai
Practical Tips for Interpreting Machine Learning Models - Patrick Hall, H2O.aiPractical Tips for Interpreting Machine Learning Models - Patrick Hall, H2O.ai
Practical Tips for Interpreting Machine Learning Models - Patrick Hall, H2O.ai
 
Modern Machine Learning Infrastructure and Practices
Modern Machine Learning Infrastructure and PracticesModern Machine Learning Infrastructure and Practices
Modern Machine Learning Infrastructure and Practices
 
Alexandra johnson reducing operational barriers to model training
Alexandra johnson   reducing operational barriers to model trainingAlexandra johnson   reducing operational barriers to model training
Alexandra johnson reducing operational barriers to model training
 
Machine Learning for the Sensored IoT
Machine Learning for the Sensored IoTMachine Learning for the Sensored IoT
Machine Learning for the Sensored IoT
 
Making Data Science Scalable - 5 Lessons Learned
Making Data Science Scalable - 5 Lessons LearnedMaking Data Science Scalable - 5 Lessons Learned
Making Data Science Scalable - 5 Lessons Learned
 
Master-Thesis-Proposal-Computer-Sscience
Master-Thesis-Proposal-Computer-SscienceMaster-Thesis-Proposal-Computer-Sscience
Master-Thesis-Proposal-Computer-Sscience
 

Similar to Reproducibility with Unstructured Data in 3 steps

The two faces of sql parameter sniffing
The two faces of sql parameter sniffingThe two faces of sql parameter sniffing
The two faces of sql parameter sniffingIvo Andreev
 
Datascience and Azure(v1.0)
Datascience and Azure(v1.0)Datascience and Azure(v1.0)
Datascience and Azure(v1.0)Zenodia Charpy
 
Data Science on Azure
Data Science on Azure Data Science on Azure
Data Science on Azure Zenodia Charpy
 
TensorFlow 16: Building a Data Science Platform
TensorFlow 16: Building a Data Science Platform TensorFlow 16: Building a Data Science Platform
TensorFlow 16: Building a Data Science Platform Seldon
 
01 nosql and multi model database
01   nosql and multi model database01   nosql and multi model database
01 nosql and multi model databaseMahdi Atawneh
 
PyData Berlin 2018: dvc.org
PyData Berlin 2018: dvc.orgPyData Berlin 2018: dvc.org
PyData Berlin 2018: dvc.orgDmitry Petrov
 
Ibm db2update2019 machine learning and db2 ai
Ibm db2update2019 machine learning and db2 aiIbm db2update2019 machine learning and db2 ai
Ibm db2update2019 machine learning and db2 aiGustav Lundström
 
Microsoft Entity Framework
Microsoft Entity FrameworkMicrosoft Entity Framework
Microsoft Entity FrameworkMahmoud Tolba
 
201906 02 Introduction to AutoML with ML.NET 1.0
201906 02 Introduction to AutoML with ML.NET 1.0201906 02 Introduction to AutoML with ML.NET 1.0
201906 02 Introduction to AutoML with ML.NET 1.0Mark Tabladillo
 
GAIBT NewYork - Serverless Machine Learning.pptx
GAIBT NewYork - Serverless Machine Learning.pptxGAIBT NewYork - Serverless Machine Learning.pptx
GAIBT NewYork - Serverless Machine Learning.pptxLuis Beltran
 
Introduction to Spark Training
Introduction to Spark TrainingIntroduction to Spark Training
Introduction to Spark TrainingSpark Summit
 
【Unite 2018 Tokyo】C# Job SystemとECS(Entity Component System)解説
【Unite 2018 Tokyo】C# Job SystemとECS(Entity Component System)解説【Unite 2018 Tokyo】C# Job SystemとECS(Entity Component System)解説
【Unite 2018 Tokyo】C# Job SystemとECS(Entity Component System)解説Unity Technologies Japan K.K.
 
Sem tech 2011 v8
Sem tech 2011 v8Sem tech 2011 v8
Sem tech 2011 v8dallemang
 
Intro to Spark development
 Intro to Spark development  Intro to Spark development
Intro to Spark development Spark Summit
 
The Frontier of Deep Learning in 2020 and Beyond
The Frontier of Deep Learning in 2020 and BeyondThe Frontier of Deep Learning in 2020 and Beyond
The Frontier of Deep Learning in 2020 and BeyondNUS-ISS
 
[DSC Europe 22] Smart approach in development and deployment process for vari...
[DSC Europe 22] Smart approach in development and deployment process for vari...[DSC Europe 22] Smart approach in development and deployment process for vari...
[DSC Europe 22] Smart approach in development and deployment process for vari...DataScienceConferenc1
 
Machine Learning Infrastructure
Machine Learning InfrastructureMachine Learning Infrastructure
Machine Learning InfrastructureSigOpt
 

Similar to Reproducibility with Unstructured Data in 3 steps (20)

The two faces of sql parameter sniffing
The two faces of sql parameter sniffingThe two faces of sql parameter sniffing
The two faces of sql parameter sniffing
 
Datascience and Azure(v1.0)
Datascience and Azure(v1.0)Datascience and Azure(v1.0)
Datascience and Azure(v1.0)
 
Data Science on Azure
Data Science on Azure Data Science on Azure
Data Science on Azure
 
TensorFlow 16: Building a Data Science Platform
TensorFlow 16: Building a Data Science Platform TensorFlow 16: Building a Data Science Platform
TensorFlow 16: Building a Data Science Platform
 
01 nosql and multi model database
01   nosql and multi model database01   nosql and multi model database
01 nosql and multi model database
 
PyData Berlin 2018: dvc.org
PyData Berlin 2018: dvc.orgPyData Berlin 2018: dvc.org
PyData Berlin 2018: dvc.org
 
Ibm db2update2019 machine learning and db2 ai
Ibm db2update2019 machine learning and db2 aiIbm db2update2019 machine learning and db2 ai
Ibm db2update2019 machine learning and db2 ai
 
Ember
EmberEmber
Ember
 
Microsoft Entity Framework
Microsoft Entity FrameworkMicrosoft Entity Framework
Microsoft Entity Framework
 
Role of ML engineer
Role of ML engineerRole of ML engineer
Role of ML engineer
 
201906 02 Introduction to AutoML with ML.NET 1.0
201906 02 Introduction to AutoML with ML.NET 1.0201906 02 Introduction to AutoML with ML.NET 1.0
201906 02 Introduction to AutoML with ML.NET 1.0
 
GAIBT NewYork - Serverless Machine Learning.pptx
GAIBT NewYork - Serverless Machine Learning.pptxGAIBT NewYork - Serverless Machine Learning.pptx
GAIBT NewYork - Serverless Machine Learning.pptx
 
Introduction to Spark Training
Introduction to Spark TrainingIntroduction to Spark Training
Introduction to Spark Training
 
【Unite 2018 Tokyo】C# Job SystemとECS(Entity Component System)解説
【Unite 2018 Tokyo】C# Job SystemとECS(Entity Component System)解説【Unite 2018 Tokyo】C# Job SystemとECS(Entity Component System)解説
【Unite 2018 Tokyo】C# Job SystemとECS(Entity Component System)解説
 
Sem tech 2011 v8
Sem tech 2011 v8Sem tech 2011 v8
Sem tech 2011 v8
 
Intro to Spark development
 Intro to Spark development  Intro to Spark development
Intro to Spark development
 
Welcome to CS310!
Welcome to CS310!Welcome to CS310!
Welcome to CS310!
 
The Frontier of Deep Learning in 2020 and Beyond
The Frontier of Deep Learning in 2020 and BeyondThe Frontier of Deep Learning in 2020 and Beyond
The Frontier of Deep Learning in 2020 and Beyond
 
[DSC Europe 22] Smart approach in development and deployment process for vari...
[DSC Europe 22] Smart approach in development and deployment process for vari...[DSC Europe 22] Smart approach in development and deployment process for vari...
[DSC Europe 22] Smart approach in development and deployment process for vari...
 
Machine Learning Infrastructure
Machine Learning InfrastructureMachine Learning Infrastructure
Machine Learning Infrastructure
 

Recently uploaded

(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiSuhani Kapoor
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxfirstjob4
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一ffjhghh
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxolyaivanovalion
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxolyaivanovalion
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiSuhani Kapoor
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSAishani27
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxolyaivanovalion
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationshipsccctableauusergroup
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
 

Recently uploaded (20)

(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptx
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICS
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 

Reproducibility with Unstructured Data in 3 steps

  • 1. > Reproducibility with unstructured data in 3 steps Dmitry Petrov DVC.org |00|
  • 2. Co-Founder & CEO > Iterative.AI > San Francisco, USA ex-Data Scientist > Microsoft (BingAds) > Seattle, USA ex-Head of Lab > St. Petersburg Electrotechnical University > Russia |HELLO| Dmitry Petrov PhD in Computer Science Twitter: @FullStackML Creator of DVC.org project
  • 3. > Data Analyst → Structured > Data Scientist→ Semi-structured > ML engineer → Unstructured - NLP - text files - Computer Vision - images - Multiple files and/or formats |Unstructured data|
  • 4. Reproducibility is about storing and sharing the mapping: Code + Data → Model |Unstructured data reproducibility|
  • 5. Use a central storage for all data artifacts: - Datafiles - Models |1st - CENTRAL STORAGE| s3://mybucket/semsegm-proj/
  • 6. ddd |2nd - DECOUPLE DATA FROM CODE| Use dataset metafiles. Do not read data from code directly. Individual file → Snapshot # files-meta.yaml file1: s3://mybucket/semsegm-proj/raw/file1-ver2020-10-01 file2: s3://mybucket/semsegm-proj/raw/file1-ver2020-07-26 ... model.pkl: s3://mybucket/semsegm-proj/model-5-1.pkl
  • 7. ddd Version metrics files in Git $ cat metrics.json { “AUC”: 0.8367073 “TP”: 8614 “Process”: { “Threshold”: 0.92 … $ git diff release-sept … - "AUC":0.7906391, + "AUC":0.8367073, |3nd - BE METRICS DRIVEN|
  • 8. II. DVC - Date Version Control
  • 9. |1st - CENTRAL STORAGE| ddd DVC introduces data-remote $ dvc remote add -d myremote s3://mybucket/semsegm-proj/ $ dvc push
  • 10. ddd |2nd - DECOUPLE DATA FROM CODE| $ dvc add data.tsv $ cat data.tsv.dvc outs: - md5: fadc70dff966edd21b3dd2b0c2755189 path: data.tsv size: 593310482 $ dvc push data.tsv
  • 11. ddd $ dvc metrics diff release-sept Path Metric Value Change metrics.json AUC 0.8367073 0.0460682 metrics.json TP 8291 374 $ dvc plots diff release-sept |3nd - BE METRICS DRIVEN|
  • 12. > Questions Email dmitry@iterative.ai Web http://dvc.org > Actions Follow @FullStackML |THANK YOU|