SlideShare a Scribd company logo
1 of 14
ON THE CO-EVOLUTION OF ML PIPELINES AND
SOURCE CODE - EMPIRICAL STUDY OF DVC
PROJECTS
- SANER 2021 -
March 11th, 2021
1
Amine Barrak
Ellis E. Eghan
Polytechnique Montreal
Bram Adams
Queen’s University
SANER 2021
WHAT ARE ML PIPELINES IN ML PROJECTS?
2
ML PROJECTS REQUIRE TO ALIGN COMPATIBLE VERSIONS
OF DATA, MODELS AND CODE ARTIFACTS
3
V1
V2
V3
V1
V2
V1
V2
V3
Data Timeline
Models Timeline
Source code Timeline
EXAMPLES OF ML PIPELINE TRACEABILITY TOOLS
4
?
MODELS
DATA
CODE
OVERVIEW OF THE DVC TOOL
5
MD5
checksum
MD5
checksum
MODELS
DATA
CODE
 RQ1: How common is the usage of DVC in Github
projects?
 RQ2: How much coupling exists between software
artifacts and DVC artifacts?
 RQ3: How does the complexity of the DVC ML
pipeline evolve over time?
6
RESEARCH QUESTIONS
7
RQ1: HOW COMMON IS THE USAGE OF
DVC IN GITHUB PROJECTS?
The DVC usage
Advanced search for DVC
projects on Github
391 projects
RQ1
• How are current ML versioning tools
(like DVC) are used in open-source
projects?
• How much do projects rely on these
tools?
8
RQ1
DESPITE THE YOUNG PRACTICE, MORE THAN HALF
OF DVC FILES ARE FREQUENTLY CHANGED
#projects adopting DVC a given period after
DVC’s launch
#projects using DVC for a given period
#projects using the different DVC remote
storage options
9
RQ2: HOW MUCH COUPLING EXISTS BETWEEN
SOFTWARE ARTIFACTS AND DVC ARTIFACTS?
Filter most active
projects
391 projects
Filter projects with >
10 Pull Requests
25 projects 10 projects
File classification (Perceval, Manual)
Source code artifacts
[source, test, data,
gitignore, others]
DVC artifacts
[data, pipeline,
utilities]
DVC commits
coupling analysis
DVC Pull Request
coupling analysis
Classified
Commits/PR
RQ2
RQ2
What is the required maintenance
effort to add or change data/model
files and the ML pipeline specification
in a project during regular
development?
IMPORTANT COUPLING BETWEEN DVC-PIPELINE
AND SOURCE CODE ARTIFACTS AT PR-LEVEL
PR-level coupling between DVC
pipeline and software artifacts
10
RQ2
11
RQ3: HOW DOES THE COMPLEXITY OF THE
DVC ML PIPELINE EVOLVE OVER TIME? RQ3
Nodes
Operators
Operands
Edges
McCabe Complexity
(pipeline graph)
Halstead Complexity
(file’s verbosity )
RQ3
Complexity Pipeline
Graph/Textual Evolution
Pipelines
Reproduction
25 most active
projects
12
COMPLEXITY: NO COMMON EVOLUTION TREND
RQ3
Halstead Complexity
McCabe Complexity
• 6/25 projects have an increasing
trend
• 5/25 projects have a high
fluctuation
• 7/25 projects have an increasing
trend
• 7/25 projects have a high
fluctuation
13
WHAT NOW?
 Implications to ML application developers
 Use properly the remote storage functionality (e.g., amazon S3)
 Use one DVC stage per similar data folder (e.g., images folder)
 Split pipeline into subcomponents to facilitate maintenance and reduce the
pipeline complexity
 Implications to ML versioning tool developers/-companies.
 Consider notebook cells granuality (an ML pipeline in a notebook file is not taken
in consideration)
 Implications to Researchers
 Lack of techniques or tools that can assist developers to identify code changes
that require DVC maintenance and vice versa (e.g., fixing a bug in the pipeline).
14
SUMMARY

More Related Content

Similar to On the Co-evolution of ML Pipelines and Source Code - Empirical Study of DVC Projects

eMDC 2017 Reath Weber Device Scaling v Process Control Scaling
eMDC 2017 Reath Weber Device Scaling v Process Control ScalingeMDC 2017 Reath Weber Device Scaling v Process Control Scaling
eMDC 2017 Reath Weber Device Scaling v Process Control ScalingKimberly Daich
 
Bit Serial multiplier using Verilog
Bit Serial multiplier using VerilogBit Serial multiplier using Verilog
Bit Serial multiplier using VerilogBhargavKatkam
 
Data integration with Apache Kafka
Data integration with Apache KafkaData integration with Apache Kafka
Data integration with Apache Kafkaconfluent
 
Capella Days 2021 | An example of model-centric engineering environment with ...
Capella Days 2021 | An example of model-centric engineering environment with ...Capella Days 2021 | An example of model-centric engineering environment with ...
Capella Days 2021 | An example of model-centric engineering environment with ...Obeo
 
IGUANA: A Generic Framework for Benchmarking the Read-Write Performance of Tr...
IGUANA: A Generic Framework for Benchmarking the Read-Write Performance of Tr...IGUANA: A Generic Framework for Benchmarking the Read-Write Performance of Tr...
IGUANA: A Generic Framework for Benchmarking the Read-Write Performance of Tr...Lixi Conrads
 
Chitra_BE_ECE_2015_74AGG
Chitra_BE_ECE_2015_74AGGChitra_BE_ECE_2015_74AGG
Chitra_BE_ECE_2015_74AGGCHAITRA NAGANUR
 
ppbench - A Visualizing Network Benchmark for Microservices
ppbench - A Visualizing Network Benchmark for Microservicesppbench - A Visualizing Network Benchmark for Microservices
ppbench - A Visualizing Network Benchmark for MicroservicesNane Kratzke
 
Dilnoza Bobokalonova Resume | Embedded Systems Engineering | Backend Software...
Dilnoza Bobokalonova Resume | Embedded Systems Engineering | Backend Software...Dilnoza Bobokalonova Resume | Embedded Systems Engineering | Backend Software...
Dilnoza Bobokalonova Resume | Embedded Systems Engineering | Backend Software...Dilnoza Bobokalonova
 
MetaCloud Computing Environment
MetaCloud Computing EnvironmentMetaCloud Computing Environment
MetaCloud Computing EnvironmentARCCN
 
Programming the Semantic Web
Programming the Semantic WebProgramming the Semantic Web
Programming the Semantic WebSteffen Staab
 
Overview of DuraMat software tool development (poster version)
Overview of DuraMat software tool development(poster version)Overview of DuraMat software tool development(poster version)
Overview of DuraMat software tool development (poster version)Anubhav Jain
 
German Sviridov - PhD defense
German Sviridov - PhD defense German Sviridov - PhD defense
German Sviridov - PhD defense German Sviridov
 
SDN and NFV Value in Business Services - A Presentation By Cox Communications
SDN and NFV Value in Business Services - A Presentation By Cox CommunicationsSDN and NFV Value in Business Services - A Presentation By Cox Communications
SDN and NFV Value in Business Services - A Presentation By Cox CommunicationsCisco Service Provider
 
Source-to-source transformations: Supporting tools and infrastructure
Source-to-source transformations: Supporting tools and infrastructureSource-to-source transformations: Supporting tools and infrastructure
Source-to-source transformations: Supporting tools and infrastructurekaveirious
 
Cloud-Native Patterns for Data-Intensive Applications
Cloud-Native Patterns for Data-Intensive ApplicationsCloud-Native Patterns for Data-Intensive Applications
Cloud-Native Patterns for Data-Intensive ApplicationsVMware Tanzu
 
Data Engineer's Lunch #86: Building Real-Time Applications at Scale: A Case S...
Data Engineer's Lunch #86: Building Real-Time Applications at Scale: A Case S...Data Engineer's Lunch #86: Building Real-Time Applications at Scale: A Case S...
Data Engineer's Lunch #86: Building Real-Time Applications at Scale: A Case S...Anant Corporation
 

Similar to On the Co-evolution of ML Pipelines and Source Code - Empirical Study of DVC Projects (20)

eMDC 2017 Reath Weber Device Scaling v Process Control Scaling
eMDC 2017 Reath Weber Device Scaling v Process Control ScalingeMDC 2017 Reath Weber Device Scaling v Process Control Scaling
eMDC 2017 Reath Weber Device Scaling v Process Control Scaling
 
Bit Serial multiplier using Verilog
Bit Serial multiplier using VerilogBit Serial multiplier using Verilog
Bit Serial multiplier using Verilog
 
Link_NwkingforDevOps
Link_NwkingforDevOpsLink_NwkingforDevOps
Link_NwkingforDevOps
 
Data integration with Apache Kafka
Data integration with Apache KafkaData integration with Apache Kafka
Data integration with Apache Kafka
 
Capella Days 2021 | An example of model-centric engineering environment with ...
Capella Days 2021 | An example of model-centric engineering environment with ...Capella Days 2021 | An example of model-centric engineering environment with ...
Capella Days 2021 | An example of model-centric engineering environment with ...
 
IGUANA: A Generic Framework for Benchmarking the Read-Write Performance of Tr...
IGUANA: A Generic Framework for Benchmarking the Read-Write Performance of Tr...IGUANA: A Generic Framework for Benchmarking the Read-Write Performance of Tr...
IGUANA: A Generic Framework for Benchmarking the Read-Write Performance of Tr...
 
Chitra_BE_ECE_2015_74AGG
Chitra_BE_ECE_2015_74AGGChitra_BE_ECE_2015_74AGG
Chitra_BE_ECE_2015_74AGG
 
ppbench - A Visualizing Network Benchmark for Microservices
ppbench - A Visualizing Network Benchmark for Microservicesppbench - A Visualizing Network Benchmark for Microservices
ppbench - A Visualizing Network Benchmark for Microservices
 
Ramesh resume
Ramesh resumeRamesh resume
Ramesh resume
 
Dilnoza Bobokalonova Resume | Embedded Systems Engineering | Backend Software...
Dilnoza Bobokalonova Resume | Embedded Systems Engineering | Backend Software...Dilnoza Bobokalonova Resume | Embedded Systems Engineering | Backend Software...
Dilnoza Bobokalonova Resume | Embedded Systems Engineering | Backend Software...
 
MetaCloud Computing Environment
MetaCloud Computing EnvironmentMetaCloud Computing Environment
MetaCloud Computing Environment
 
Programming the Semantic Web
Programming the Semantic WebProgramming the Semantic Web
Programming the Semantic Web
 
Overview of DuraMat software tool development (poster version)
Overview of DuraMat software tool development(poster version)Overview of DuraMat software tool development(poster version)
Overview of DuraMat software tool development (poster version)
 
Shift Dev Conf API
Shift Dev Conf APIShift Dev Conf API
Shift Dev Conf API
 
German Sviridov - PhD defense
German Sviridov - PhD defense German Sviridov - PhD defense
German Sviridov - PhD defense
 
SDN and NFV Value in Business Services - A Presentation By Cox Communications
SDN and NFV Value in Business Services - A Presentation By Cox CommunicationsSDN and NFV Value in Business Services - A Presentation By Cox Communications
SDN and NFV Value in Business Services - A Presentation By Cox Communications
 
Source-to-source transformations: Supporting tools and infrastructure
Source-to-source transformations: Supporting tools and infrastructureSource-to-source transformations: Supporting tools and infrastructure
Source-to-source transformations: Supporting tools and infrastructure
 
Cloud-Native Patterns for Data-Intensive Applications
Cloud-Native Patterns for Data-Intensive ApplicationsCloud-Native Patterns for Data-Intensive Applications
Cloud-Native Patterns for Data-Intensive Applications
 
Connon R_2015
Connon R_2015Connon R_2015
Connon R_2015
 
Data Engineer's Lunch #86: Building Real-Time Applications at Scale: A Case S...
Data Engineer's Lunch #86: Building Real-Time Applications at Scale: A Case S...Data Engineer's Lunch #86: Building Real-Time Applications at Scale: A Case S...
Data Engineer's Lunch #86: Building Real-Time Applications at Scale: A Case S...
 

Recently uploaded

Intze Overhead Water Tank Design by Working Stress - IS Method.pdf
Intze Overhead Water Tank  Design by Working Stress - IS Method.pdfIntze Overhead Water Tank  Design by Working Stress - IS Method.pdf
Intze Overhead Water Tank Design by Working Stress - IS Method.pdfSuman Jyoti
 
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfKamal Acharya
 
Coefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxCoefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxAsutosh Ranjan
 
AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes ---  Unit 3.pdfAKTU Computer Networks notes ---  Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdfankushspencer015
 
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Bookingdharasingh5698
 
UNIT-III FMM. DIMENSIONAL ANALYSIS
UNIT-III FMM.        DIMENSIONAL ANALYSISUNIT-III FMM.        DIMENSIONAL ANALYSIS
UNIT-III FMM. DIMENSIONAL ANALYSISrknatarajan
 
Vivazz, Mieres Social Housing Design Spain
Vivazz, Mieres Social Housing Design SpainVivazz, Mieres Social Housing Design Spain
Vivazz, Mieres Social Housing Design Spaintimesproduction05
 
Extrusion Processes and Their Limitations
Extrusion Processes and Their LimitationsExtrusion Processes and Their Limitations
Extrusion Processes and Their Limitations120cr0395
 
UNIT-IFLUID PROPERTIES & FLOW CHARACTERISTICS
UNIT-IFLUID PROPERTIES & FLOW CHARACTERISTICSUNIT-IFLUID PROPERTIES & FLOW CHARACTERISTICS
UNIT-IFLUID PROPERTIES & FLOW CHARACTERISTICSrknatarajan
 
Call for Papers - International Journal of Intelligent Systems and Applicatio...
Call for Papers - International Journal of Intelligent Systems and Applicatio...Call for Papers - International Journal of Intelligent Systems and Applicatio...
Call for Papers - International Journal of Intelligent Systems and Applicatio...Christo Ananth
 
Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)simmis5
 
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and workingUNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and workingrknatarajan
 
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordAsst.prof M.Gokilavani
 
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptxBSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptxfenichawla
 
chapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineeringchapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineeringmulugeta48
 
University management System project report..pdf
University management System project report..pdfUniversity management System project report..pdf
University management System project report..pdfKamal Acharya
 

Recently uploaded (20)

Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
 
Intze Overhead Water Tank Design by Working Stress - IS Method.pdf
Intze Overhead Water Tank  Design by Working Stress - IS Method.pdfIntze Overhead Water Tank  Design by Working Stress - IS Method.pdf
Intze Overhead Water Tank Design by Working Stress - IS Method.pdf
 
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
 
Coefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxCoefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptx
 
AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes ---  Unit 3.pdfAKTU Computer Networks notes ---  Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdf
 
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
 
UNIT-III FMM. DIMENSIONAL ANALYSIS
UNIT-III FMM.        DIMENSIONAL ANALYSISUNIT-III FMM.        DIMENSIONAL ANALYSIS
UNIT-III FMM. DIMENSIONAL ANALYSIS
 
Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024
 
Vivazz, Mieres Social Housing Design Spain
Vivazz, Mieres Social Housing Design SpainVivazz, Mieres Social Housing Design Spain
Vivazz, Mieres Social Housing Design Spain
 
Extrusion Processes and Their Limitations
Extrusion Processes and Their LimitationsExtrusion Processes and Their Limitations
Extrusion Processes and Their Limitations
 
UNIT-IFLUID PROPERTIES & FLOW CHARACTERISTICS
UNIT-IFLUID PROPERTIES & FLOW CHARACTERISTICSUNIT-IFLUID PROPERTIES & FLOW CHARACTERISTICS
UNIT-IFLUID PROPERTIES & FLOW CHARACTERISTICS
 
Call for Papers - International Journal of Intelligent Systems and Applicatio...
Call for Papers - International Journal of Intelligent Systems and Applicatio...Call for Papers - International Journal of Intelligent Systems and Applicatio...
Call for Papers - International Journal of Intelligent Systems and Applicatio...
 
Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)
 
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and workingUNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
 
Roadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and RoutesRoadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and Routes
 
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
 
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptxBSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
 
chapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineeringchapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineering
 
University management System project report..pdf
University management System project report..pdfUniversity management System project report..pdf
University management System project report..pdf
 
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
 

On the Co-evolution of ML Pipelines and Source Code - Empirical Study of DVC Projects

  • 1. ON THE CO-EVOLUTION OF ML PIPELINES AND SOURCE CODE - EMPIRICAL STUDY OF DVC PROJECTS - SANER 2021 - March 11th, 2021 1 Amine Barrak Ellis E. Eghan Polytechnique Montreal Bram Adams Queen’s University SANER 2021
  • 2. WHAT ARE ML PIPELINES IN ML PROJECTS? 2
  • 3. ML PROJECTS REQUIRE TO ALIGN COMPATIBLE VERSIONS OF DATA, MODELS AND CODE ARTIFACTS 3 V1 V2 V3 V1 V2 V1 V2 V3 Data Timeline Models Timeline Source code Timeline
  • 4. EXAMPLES OF ML PIPELINE TRACEABILITY TOOLS 4 ? MODELS DATA CODE
  • 5. OVERVIEW OF THE DVC TOOL 5 MD5 checksum MD5 checksum MODELS DATA CODE
  • 6.  RQ1: How common is the usage of DVC in Github projects?  RQ2: How much coupling exists between software artifacts and DVC artifacts?  RQ3: How does the complexity of the DVC ML pipeline evolve over time? 6 RESEARCH QUESTIONS
  • 7. 7 RQ1: HOW COMMON IS THE USAGE OF DVC IN GITHUB PROJECTS? The DVC usage Advanced search for DVC projects on Github 391 projects RQ1 • How are current ML versioning tools (like DVC) are used in open-source projects? • How much do projects rely on these tools?
  • 8. 8 RQ1 DESPITE THE YOUNG PRACTICE, MORE THAN HALF OF DVC FILES ARE FREQUENTLY CHANGED #projects adopting DVC a given period after DVC’s launch #projects using DVC for a given period #projects using the different DVC remote storage options
  • 9. 9 RQ2: HOW MUCH COUPLING EXISTS BETWEEN SOFTWARE ARTIFACTS AND DVC ARTIFACTS? Filter most active projects 391 projects Filter projects with > 10 Pull Requests 25 projects 10 projects File classification (Perceval, Manual) Source code artifacts [source, test, data, gitignore, others] DVC artifacts [data, pipeline, utilities] DVC commits coupling analysis DVC Pull Request coupling analysis Classified Commits/PR RQ2 RQ2 What is the required maintenance effort to add or change data/model files and the ML pipeline specification in a project during regular development?
  • 10. IMPORTANT COUPLING BETWEEN DVC-PIPELINE AND SOURCE CODE ARTIFACTS AT PR-LEVEL PR-level coupling between DVC pipeline and software artifacts 10 RQ2
  • 11. 11 RQ3: HOW DOES THE COMPLEXITY OF THE DVC ML PIPELINE EVOLVE OVER TIME? RQ3 Nodes Operators Operands Edges McCabe Complexity (pipeline graph) Halstead Complexity (file’s verbosity ) RQ3 Complexity Pipeline Graph/Textual Evolution Pipelines Reproduction 25 most active projects
  • 12. 12 COMPLEXITY: NO COMMON EVOLUTION TREND RQ3 Halstead Complexity McCabe Complexity • 6/25 projects have an increasing trend • 5/25 projects have a high fluctuation • 7/25 projects have an increasing trend • 7/25 projects have a high fluctuation
  • 13. 13 WHAT NOW?  Implications to ML application developers  Use properly the remote storage functionality (e.g., amazon S3)  Use one DVC stage per similar data folder (e.g., images folder)  Split pipeline into subcomponents to facilitate maintenance and reduce the pipeline complexity  Implications to ML versioning tool developers/-companies.  Consider notebook cells granuality (an ML pipeline in a notebook file is not taken in consideration)  Implications to Researchers  Lack of techniques or tools that can assist developers to identify code changes that require DVC maintenance and vice versa (e.g., fixing a bug in the pipeline).

Editor's Notes

  1. Hi everybody, thank you for attending my presentation, my name is amine barrak, i am currently a PhD candidate at polytechnique montreal, Today i will present my paper intitled I would like to thank my co-authors for their hard work
  2. A typical ML pipeline was introduced by microsoft team members, where they show  a series of steps chained together  to form the machine learning workflow essential stages. These stages include data and model oriented artifacts starting from data collection, data cleaning, until the model evaluation and model deployment. These stages construct a ML pipeline.
  3. Let’s start by an example of ML project, We consider that an ML project can has 3 artifacts (code, data, models). We can see that these artifacts evolve seperately and they migh have different versions. Also, A snapshot of the project may contain different versions of these artifacts that are compatible to work together. We need a way to keep tracking the different versions of these artifacts and also a way to reverse to previous snapshot of the project including a compatible versions of these artifacts
  4. So the question here, how can we keep tracking these artifacts together? We know that git can store and version code, but doesn't hundle storing scalable large files or models. Mlflow, pachyderm and DVC are tools to make an ML pipeline tracable. We chose specifically the DVC since it is lightweight and works as a layer with github. #Popular,  Support non-ML projects, lightweight, open source  
  5. As we said, github can hundle source code artifact, but cannot hundle scalable models and dataset. Here it came the DVC that works as a top layer on github, by making pointers to large files datasets and ML models using MD5 checksum This is an example of a dvc file, it looks like a source code file, it contains the checksum to link the actual stages with its dependencies. DVC store the pipeline evolution starting from data collection to the prediction model. Our study is designed to understand the process of co-evolution between the 3 artifacts (source code, data, models)
  6. We designed these three RQ to study the DVC co-evolution with source code artifacts
  7. Our first RQ is to look for how common is the usage of DVC in Github Projects? To do so, we run an advanced search on github looking for DVC projects, to find out : Q1, Q2
  8. The first plot show how quickly projects adopt dvc after creating their repository Quick adoption of the DVC after creating the repository (350 projects) The second plot show how long the usage of dvc in projects 25% of the projects applied DVC for more than a week On this plot, we compute the proportion of changed dvc files chronologically in the projects 50% or more of the DVC files in a project are changed at least once every one-tenth of the project’s lifetime Which mean that the dvc files are frequently changed in the projects lifetime. The last plot show the remote storage used by dvc projects 127 of the projects have no trace of their DVC remote in their GitHub repository, due to toy projects with a usage for less than a day, and some projects were hiding their config files for security purposes. The three top used data storage locations: Amazon S3 (84 projects), the local cache in prived machine (78), and Google Cloud Storage (35).
  9. To respond this RQ, we filter out the most active projects to compute the coupling at the commit level and for the pull request level, we kept only projects having at least 10 PR. After that, we classified all the files on these projects to the DVC artifact files including (data, pipeline, utilities) and source code artifacts including (source, test, data, gitignore, others). Using the association rules, we were able to compute the coupling and we used chi-square test to validate the analysed coupling. The reason of computing the coupling is analyse the required …. Since, studies have shown that changes to software artifacts like build and environment files that are coupled with traditional software artifacts such as source code, test files, introduce overhead as developers have to maintain the source code and tests together with these artifacts, Shane 2011 studied the coupling between builds and source code
  10. At this level, we studied the coupling on the commit and pull request level between the DVC artifacts and source code artifacts. We show here only the coupling between the dvc pipeline and source code artifacts at the Pull request level on the 10 projects having at least 10 PR. The rest of the coupling results with the statistical test details can be found on the paper. one out of  four  PRs  changing  source  code requiring changes to dvc pipeline files one  out  of  two PRs changing tests, requiring changes to dvc pipeline files 80% of case where we have a change data file require a change to dvc pipeline files inside a PR.
  11. Previous studies have shown that Source code and build code evolve in terms of complexity and size . Moreover, Since we find a High change frequency in DVC files (RQ1), High coupling between DVC and source code artifacts (RQ2) We aim to explore in the third RQ how complexity of the ML pipeline evolve over time. As we can see, this is an example of a dvc stage and how the pipeline is shown on a graph form wih its different dependencies. We will use two metrics to compute the complexity : The first metric is the mccabe complexity to estimate the graph structure complexity of the pipeline We compute it from the number of edges and nodes in the pipeline The second metric is the halstead complexity to reveal the file’s verbosity and effort to understand the textual form of a dvc pipeline files. the operators are the commands in top level The operands which are the parameters passed to the operators To do so, we reproduced the pipelines of the 25 most active projects and compute the two metrics complexity.
  12. We compute the Halstead and McCabe complexity for each projects and we didn’t find a common evolution trend We classified the complexity trends to (increasing, fluctuating, major impact and sudden drop) These are exmeples of the complexity évolutions that we find.
  13. As a reminder, First we show how DVC is a layer on the top of Github and how make pointers to the data and models, Second, We show how the dvc is showing a high change frequency in DVC files (RQ1) despite its young practice. Third, we show how the dvc has a high coupling with the source code artifacts in the pull request level (RQ2). And finally the complexity trend evolution of the ML pipeline
  14. We compute the Halstead and McCabe complexity for each projects of the most active ones (25 project) during their lifetime to track the pipelines evolution. Results of median complexities are in this this plot, where they don’t show a correlation.
  15. As shown previously the ML pipeline include 3 connected essential artifacts in ML project. These artifacts are dependent and evolve in a separate way. The question here, how can we keep tracking these artifacts, especially when there is a need to roll back the system (debugging, testing new model, etc). The artifacts already evolve, and it became expensive to rollback.
  16. (Data Version Control) acts as a layer over git which produces versioned pointers to the files instead of the files themselves. These files are finally stored in a local cache and this cache can be synchronized with a remote storage. Stages are run using dvc run [command] and options among which we use: d for dependency: specify an input file o for output: specify an output file ignored by git and tracked by dvc M for metric: specify an output file tracked by git f for file: specify the name of the dvc file.
  17. A typical ML pipeline is a series of steps chained together in the ML cycle that often involves obtaining the data, processing the data, training/testing on various ML algorithms and finally obtaining some output (in the form of a prediction, etc). We chose the DVC Popular, Support non-ML projects, lightweight, open source
  18. Halstead: -The Halstead metric focuses on the file’s verbosity (operands and operators) the operators in the case of DVC would be the top-level commands/configuration of a DVC file such as cmd, deps, outs The operands will be the parameters passed to the operators (e.g., path, wdir, repo, etc). McCabe: -The McCabe complexity metric is primarily concerned with the number of decision points in the generated pipeline graph. 
  19. In the third RQ we aim to analyse the evolution of the ML pipeline evolution from two perspective (verbosity and the pipeline graph), Since the DVC can be shown is two forms McCabe: -The McCabe complexity metric is concerned to primarily concerned with the number of decision points in the generated pipeline graph.  Halstead: -The Halstead metric focuses on the file’s verbosity (operands and operators) the operators in the case of DVC would be the top-level commands/configuration of a DVC file such as cmd, deps, outs The operands will be the parameters passed to the operators (e.g., path, wdir, repo, etc).