On the Co-evolution of ML Pipelines and Source Code - Empirical Study of DVC Projects

•Download as PPTX, PDF•

0 likes•20 views

Amine Barrak

Engineering

ML PROJECTS REQUIRE TO ALIGN COMPATIBLE VERSIONS
OF DATA, MODELS AND CODE ARTIFACTS
3
V1
V2
V3
V1
V2
V1
V2
V3
Data Timeline
Models Timeline
Source code Timeline

EXAMPLES OF ML PIPELINE TRACEABILITY TOOLS
4
?
MODELS
DATA
CODE

OVERVIEW OF THE DVC TOOL
5
MD5
checksum
MD5
checksum
MODELS
DATA
CODE

 RQ1: How common is the usage of DVC in Github
projects?
 RQ2: How much coupling exists between software
artifacts and DVC artifacts?
 RQ3: How does the complexity of the DVC ML
pipeline evolve over time?
6
RESEARCH QUESTIONS

7
RQ1: HOW COMMON IS THE USAGE OF
DVC IN GITHUB PROJECTS?
The DVC usage
Advanced search for DVC
projects on Github
391 projects
RQ1
• How are current ML versioning tools
(like DVC) are used in open-source
projects?
• How much do projects rely on these
tools?

8
RQ1
DESPITE THE YOUNG PRACTICE, MORE THAN HALF
OF DVC FILES ARE FREQUENTLY CHANGED
#projects adopting DVC a given period after
DVC’s launch
#projects using DVC for a given period
#projects using the different DVC remote
storage options

9
RQ2: HOW MUCH COUPLING EXISTS BETWEEN
SOFTWARE ARTIFACTS AND DVC ARTIFACTS?
Filter most active
projects
391 projects
Filter projects with >
10 Pull Requests
25 projects 10 projects
File classification (Perceval, Manual)
Source code artifacts
[source, test, data,
gitignore, others]
DVC artifacts
[data, pipeline,
utilities]
DVC commits
coupling analysis
DVC Pull Request
coupling analysis
Classified
Commits/PR
RQ2
RQ2
What is the required maintenance
effort to add or change data/model
files and the ML pipeline specification
in a project during regular
development?

IMPORTANT COUPLING BETWEEN DVC-PIPELINE
AND SOURCE CODE ARTIFACTS AT PR-LEVEL
PR-level coupling between DVC
pipeline and software artifacts
10
RQ2

11
RQ3: HOW DOES THE COMPLEXITY OF THE
DVC ML PIPELINE EVOLVE OVER TIME? RQ3
Nodes
Operators
Operands
Edges
McCabe Complexity
(pipeline graph)
Halstead Complexity
(file’s verbosity )
RQ3
Complexity Pipeline
Graph/Textual Evolution
Pipelines
Reproduction
25 most active
projects

12
COMPLEXITY: NO COMMON EVOLUTION TREND
RQ3
Halstead Complexity
McCabe Complexity
• 6/25 projects have an increasing
trend
• 5/25 projects have a high
fluctuation
• 7/25 projects have an increasing
trend
• 7/25 projects have a high
fluctuation

13
WHAT NOW?
 Implications to ML application developers
 Use properly the remote storage functionality (e.g., amazon S3)
 Use one DVC stage per similar data folder (e.g., images folder)
 Split pipeline into subcomponents to facilitate maintenance and reduce the
pipeline complexity
 Implications to ML versioning tool developers/-companies.
 Consider notebook cells granuality (an ML pipeline in a notebook file is not taken
in consideration)
 Implications to Researchers
 Lack of techniques or tools that can assist developers to identify code changes
that require DVC maintenance and vice versa (e.g., fixing a bug in the pipeline).

Similar to On the Co-evolution of ML Pipelines and Source Code - Empirical Study of DVC Projects

eMDC 2017 Reath Weber Device Scaling v Process Control ScalingKimberly Daich

Bit Serial multiplier using VerilogBhargavKatkam

Link_NwkingforDevOpsVikas Deolaliker

Data integration with Apache Kafkaconfluent

Capella Days 2021 | An example of model-centric engineering environment with ...Obeo

IGUANA: A Generic Framework for Benchmarking the Read-Write Performance of Tr...Lixi Conrads

Chitra_BE_ECE_2015_74AGGCHAITRA NAGANUR

ppbench - A Visualizing Network Benchmark for MicroservicesNane Kratzke

Ramesh resumeRamesh Bankapalli

Dilnoza Bobokalonova Resume | Embedded Systems Engineering | Backend Software...Dilnoza Bobokalonova

MetaCloud Computing EnvironmentARCCN

Programming the Semantic WebSteffen Staab

Overview of DuraMat software tool development(poster version)Anubhav Jain

Shift Dev Conf APICédrick Lunven

German Sviridov - PhD defense German Sviridov

SDN and NFV Value in Business Services - A Presentation By Cox CommunicationsCisco Service Provider

Source-to-source transformations: Supporting tools and infrastructurekaveirious

Cloud-Native Patterns for Data-Intensive ApplicationsVMware Tanzu

Connon R_2015Ryan Connon

Data Engineer's Lunch #86: Building Real-Time Applications at Scale: A Case S...Anant Corporation

Similar to On the Co-evolution of ML Pipelines and Source Code - Empirical Study of DVC Projects (20)

eMDC 2017 Reath Weber Device Scaling v Process Control Scaling

Bit Serial multiplier using Verilog

Link_NwkingforDevOps

Data integration with Apache Kafka

Capella Days 2021 | An example of model-centric engineering environment with ...

IGUANA: A Generic Framework for Benchmarking the Read-Write Performance of Tr...

Chitra_BE_ECE_2015_74AGG

ppbench - A Visualizing Network Benchmark for Microservices

Ramesh resume

Dilnoza Bobokalonova Resume | Embedded Systems Engineering | Backend Software...

MetaCloud Computing Environment

Programming the Semantic Web

Overview of DuraMat software tool development(poster version)

Shift Dev Conf API

German Sviridov - PhD defense

SDN and NFV Value in Business Services - A Presentation By Cox Communications

Source-to-source transformations: Supporting tools and infrastructure

Cloud-Native Patterns for Data-Intensive Applications

Connon R_2015

Data Engineer's Lunch #86: Building Real-Time Applications at Scale: A Case S...

Recently uploaded

Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...9953056974 Low Rate Call Girls In Saket, Delhi NCR

Intze Overhead Water Tank Design by Working Stress - IS Method.pdfSuman Jyoti

ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfKamal Acharya

Coefficient of Thermal Expansion and their Importance.pptxAsutosh Ranjan

AKTU Computer Networks notes --- Unit 3.pdfankushspencer015

VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Bookingdharasingh5698

UNIT-III FMM. DIMENSIONAL ANALYSISrknatarajan

Water Industry Process Automation & Control Monthly - April 2024Water Industry Process Automation & Control

Vivazz, Mieres Social Housing Design Spaintimesproduction05

Extrusion Processes and Their Limitations120cr0395

UNIT-IFLUID PROPERTIES & FLOW CHARACTERISTICSrknatarajan

Call for Papers - International Journal of Intelligent Systems and Applicatio...Christo Ananth

Java Programming :Event Handling(Types of Events)simmis5

UNIT-V FMM.HYDRAULIC TURBINE - Construction and workingrknatarajan

Roadmap to Membership of RICS - Pathways and RoutesM Maged Hegazy, LLM, MBA, CCP, P3O

CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordAsst.prof M.Gokilavani

BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptxfenichawla

chapter 5.pptx: drainage and irrigation engineeringmulugeta48

University management System project report..pdfKamal Acharya

(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7Call Girls in Nagpur High Profile Call Girls

Recently uploaded (20)

Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...

Intze Overhead Water Tank Design by Working Stress - IS Method.pdf

ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf

Coefficient of Thermal Expansion and their Importance.pptx

AKTU Computer Networks notes --- Unit 3.pdf

VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking

UNIT-III FMM. DIMENSIONAL ANALYSIS

Water Industry Process Automation & Control Monthly - April 2024

Vivazz, Mieres Social Housing Design Spain

Extrusion Processes and Their Limitations

UNIT-IFLUID PROPERTIES & FLOW CHARACTERISTICS

Call for Papers - International Journal of Intelligent Systems and Applicatio...

Java Programming :Event Handling(Types of Events)

UNIT-V FMM.HYDRAULIC TURBINE - Construction and working

Roadmap to Membership of RICS - Pathways and Routes

CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record

BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx

chapter 5.pptx: drainage and irrigation engineering

University management System project report..pdf

(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7

On the Co-evolution of ML Pipelines and Source Code - Empirical Study of DVC Projects

1. ON THE CO-EVOLUTION OF ML PIPELINES AND SOURCE CODE - EMPIRICAL STUDY OF DVC PROJECTS - SANER 2021 - March 11th, 2021 1 Amine Barrak Ellis E. Eghan Polytechnique Montreal Bram Adams Queen’s University SANER 2021

2. WHAT ARE ML PIPELINES IN ML PROJECTS? 2

3. ML PROJECTS REQUIRE TO ALIGN COMPATIBLE VERSIONS OF DATA, MODELS AND CODE ARTIFACTS 3 V1 V2 V3 V1 V2 V1 V2 V3 Data Timeline Models Timeline Source code Timeline

4. EXAMPLES OF ML PIPELINE TRACEABILITY TOOLS 4 ? MODELS DATA CODE

5. OVERVIEW OF THE DVC TOOL 5 MD5 checksum MD5 checksum MODELS DATA CODE

6.  RQ1: How common is the usage of DVC in Github projects?  RQ2: How much coupling exists between software artifacts and DVC artifacts?  RQ3: How does the complexity of the DVC ML pipeline evolve over time? 6 RESEARCH QUESTIONS

7. 7 RQ1: HOW COMMON IS THE USAGE OF DVC IN GITHUB PROJECTS? The DVC usage Advanced search for DVC projects on Github 391 projects RQ1 • How are current ML versioning tools (like DVC) are used in open-source projects? • How much do projects rely on these tools?

8. 8 RQ1 DESPITE THE YOUNG PRACTICE, MORE THAN HALF OF DVC FILES ARE FREQUENTLY CHANGED #projects adopting DVC a given period after DVC’s launch #projects using DVC for a given period #projects using the different DVC remote storage options

9. 9 RQ2: HOW MUCH COUPLING EXISTS BETWEEN SOFTWARE ARTIFACTS AND DVC ARTIFACTS? Filter most active projects 391 projects Filter projects with > 10 Pull Requests 25 projects 10 projects File classification (Perceval, Manual) Source code artifacts [source, test, data, gitignore, others] DVC artifacts [data, pipeline, utilities] DVC commits coupling analysis DVC Pull Request coupling analysis Classified Commits/PR RQ2 RQ2 What is the required maintenance effort to add or change data/model files and the ML pipeline specification in a project during regular development?

10. IMPORTANT COUPLING BETWEEN DVC-PIPELINE AND SOURCE CODE ARTIFACTS AT PR-LEVEL PR-level coupling between DVC pipeline and software artifacts 10 RQ2

11. 11 RQ3: HOW DOES THE COMPLEXITY OF THE DVC ML PIPELINE EVOLVE OVER TIME? RQ3 Nodes Operators Operands Edges McCabe Complexity (pipeline graph) Halstead Complexity (file’s verbosity ) RQ3 Complexity Pipeline Graph/Textual Evolution Pipelines Reproduction 25 most active projects

12. 12 COMPLEXITY: NO COMMON EVOLUTION TREND RQ3 Halstead Complexity McCabe Complexity • 6/25 projects have an increasing trend • 5/25 projects have a high fluctuation • 7/25 projects have an increasing trend • 7/25 projects have a high fluctuation

13. 13 WHAT NOW?  Implications to ML application developers  Use properly the remote storage functionality (e.g., amazon S3)  Use one DVC stage per similar data folder (e.g., images folder)  Split pipeline into subcomponents to facilitate maintenance and reduce the pipeline complexity  Implications to ML versioning tool developers/-companies.  Consider notebook cells granuality (an ML pipeline in a notebook file is not taken in consideration)  Implications to Researchers  Lack of techniques or tools that can assist developers to identify code changes that require DVC maintenance and vice versa (e.g., fixing a bug in the pipeline).

14. 14 SUMMARY

Editor's Notes

Hi everybody, thank you for attending my presentation, my name is amine barrak, i am currently a PhD candidate at polytechnique montreal, Today i will present my paper intitled I would like to thank my co-authors for their hard work
A typical ML pipeline was introduced by microsoft team members, where they show a series of steps chained together to form the machine learning workflow essential stages. These stages include data and model oriented artifacts starting from data collection, data cleaning, until the model evaluation and model deployment. These stages construct a ML pipeline.
Let’s start by an example of ML project, We consider that an ML project can has 3 artifacts (code, data, models). We can see that these artifacts evolve seperately and they migh have different versions. Also, A snapshot of the project may contain different versions of these artifacts that are compatible to work together. We need a way to keep tracking the different versions of these artifacts and also a way to reverse to previous snapshot of the project including a compatible versions of these artifacts
So the question here, how can we keep tracking these artifacts together? We know that git can store and version code, but doesn't hundle storing scalable large files or models. Mlflow, pachyderm and DVC are tools to make an ML pipeline tracable. We chose specifically the DVC since it is lightweight and works as a layer with github. #Popular, Support non-ML projects, lightweight, open source
As we said, github can hundle source code artifact, but cannot hundle scalable models and dataset. Here it came the DVC that works as a top layer on github, by making pointers to large files datasets and ML models using MD5 checksum This is an example of a dvc file, it looks like a source code file, it contains the checksum to link the actual stages with its dependencies. DVC store the pipeline evolution starting from data collection to the prediction model. Our study is designed to understand the process of co-evolution between the 3 artifacts (source code, data, models)
We designed these three RQ to study the DVC co-evolution with source code artifacts
Our first RQ is to look for how common is the usage of DVC in Github Projects? To do so, we run an advanced search on github looking for DVC projects, to find out : Q1, Q2
The first plot show how quickly projects adopt dvc after creating their repository Quick adoption of the DVC after creating the repository (350 projects) The second plot show how long the usage of dvc in projects 25% of the projects applied DVC for more than a week On this plot, we compute the proportion of changed dvc files chronologically in the projects 50% or more of the DVC files in a project are changed at least once every one-tenth of the project’s lifetime Which mean that the dvc files are frequently changed in the projects lifetime. The last plot show the remote storage used by dvc projects 127 of the projects have no trace of their DVC remote in their GitHub repository, due to toy projects with a usage for less than a day, and some projects were hiding their config files for security purposes. The three top used data storage locations: Amazon S3 (84 projects), the local cache in prived machine (78), and Google Cloud Storage (35).
To respond this RQ, we filter out the most active projects to compute the coupling at the commit level and for the pull request level, we kept only projects having at least 10 PR. After that, we classified all the files on these projects to the DVC artifact files including (data, pipeline, utilities) and source code artifacts including (source, test, data, gitignore, others). Using the association rules, we were able to compute the coupling and we used chi-square test to validate the analysed coupling. The reason of computing the coupling is analyse the required …. Since, studies have shown that changes to software artifacts like build and environment files that are coupled with traditional software artifacts such as source code, test files, introduce overhead as developers have to maintain the source code and tests together with these artifacts, Shane 2011 studied the coupling between builds and source code
At this level, we studied the coupling on the commit and pull request level between the DVC artifacts and source code artifacts. We show here only the coupling between the dvc pipeline and source code artifacts at the Pull request level on the 10 projects having at least 10 PR. The rest of the coupling results with the statistical test details can be found on the paper. one out of four PRs changing source code requiring changes to dvc pipeline files one out of two PRs changing tests, requiring changes to dvc pipeline files 80% of case where we have a change data file require a change to dvc pipeline files inside a PR.
Previous studies have shown that Source code and build code evolve in terms of complexity and size . Moreover, Since we find a High change frequency in DVC files (RQ1), High coupling between DVC and source code artifacts (RQ2) We aim to explore in the third RQ how complexity of the ML pipeline evolve over time. As we can see, this is an example of a dvc stage and how the pipeline is shown on a graph form wih its different dependencies. We will use two metrics to compute the complexity : The first metric is the mccabe complexity to estimate the graph structure complexity of the pipeline We compute it from the number of edges and nodes in the pipeline The second metric is the halstead complexity to reveal the file’s verbosity and effort to understand the textual form of a dvc pipeline files. the operators are the commands in top level The operands which are the parameters passed to the operators To do so, we reproduced the pipelines of the 25 most active projects and compute the two metrics complexity.
We compute the Halstead and McCabe complexity for each projects and we didn’t find a common evolution trend We classified the complexity trends to (increasing, fluctuating, major impact and sudden drop) These are exmeples of the complexity évolutions that we find.
As a reminder, First we show how DVC is a layer on the top of Github and how make pointers to the data and models, Second, We show how the dvc is showing a high change frequency in DVC files (RQ1) despite its young practice. Third, we show how the dvc has a high coupling with the source code artifacts in the pull request level (RQ2). And finally the complexity trend evolution of the ML pipeline
We compute the Halstead and McCabe complexity for each projects of the most active ones (25 project) during their lifetime to track the pipelines evolution. Results of median complexities are in this this plot, where they don’t show a correlation.
As shown previously the ML pipeline include 3 connected essential artifacts in ML project. These artifacts are dependent and evolve in a separate way. The question here, how can we keep tracking these artifacts, especially when there is a need to roll back the system (debugging, testing new model, etc). The artifacts already evolve, and it became expensive to rollback.
(Data Version Control) acts as a layer over git which produces versioned pointers to the files instead of the files themselves. These files are finally stored in a local cache and this cache can be synchronized with a remote storage. Stages are run using dvc run [command] and options among which we use: d for dependency: specify an input file o for output: specify an output file ignored by git and tracked by dvc M for metric: specify an output file tracked by git f for file: specify the name of the dvc file.
A typical ML pipeline is a series of steps chained together in the ML cycle that often involves obtaining the data, processing the data, training/testing on various ML algorithms and finally obtaining some output (in the form of a prediction, etc). We chose the DVC Popular, Support non-ML projects, lightweight, open source
Halstead: -The Halstead metric focuses on the file’s verbosity (operands and operators) the operators in the case of DVC would be the top-level commands/configuration of a DVC file such as cmd, deps, outs The operands will be the parameters passed to the operators (e.g., path, wdir, repo, etc). McCabe: -The McCabe complexity metric is primarily concerned with the number of decision points in the generated pipeline graph.
In the third RQ we aim to analyse the evolution of the ML pipeline evolution from two perspective (verbosity and the pipeline graph), Since the DVC can be shown is two forms McCabe: -The McCabe complexity metric is concerned to primarily concerned with the number of decision points in the generated pipeline graph. Halstead: -The Halstead metric focuses on the file’s verbosity (operands and operators) the operators in the case of DVC would be the top-level commands/configuration of a DVC file such as cmd, deps, outs The operands will be the parameters passed to the operators (e.g., path, wdir, repo, etc).

On the Co-evolution of ML Pipelines and Source Code - Empirical Study of DVC Projects

Recommended

Recommended

More Related Content

Similar to On the Co-evolution of ML Pipelines and Source Code - Empirical Study of DVC Projects

Similar to On the Co-evolution of ML Pipelines and Source Code - Empirical Study of DVC Projects (20)

Recently uploaded

Recently uploaded (20)

On the Co-evolution of ML Pipelines and Source Code - Empirical Study of DVC Projects

Editor's Notes