(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
On the Co-evolution of ML Pipelines and Source Code - Empirical Study of DVC Projects
1. ON THE CO-EVOLUTION OF ML PIPELINES AND
SOURCE CODE - EMPIRICAL STUDY OF DVC
PROJECTS
- SANER 2021 -
March 11th, 2021
1
Amine Barrak
Ellis E. Eghan
Polytechnique Montreal
Bram Adams
Queen’s University
SANER 2021
3. ML PROJECTS REQUIRE TO ALIGN COMPATIBLE VERSIONS
OF DATA, MODELS AND CODE ARTIFACTS
3
V1
V2
V3
V1
V2
V1
V2
V3
Data Timeline
Models Timeline
Source code Timeline
4. EXAMPLES OF ML PIPELINE TRACEABILITY TOOLS
4
?
MODELS
DATA
CODE
5. OVERVIEW OF THE DVC TOOL
5
MD5
checksum
MD5
checksum
MODELS
DATA
CODE
6. RQ1: How common is the usage of DVC in Github
projects?
RQ2: How much coupling exists between software
artifacts and DVC artifacts?
RQ3: How does the complexity of the DVC ML
pipeline evolve over time?
6
RESEARCH QUESTIONS
7. 7
RQ1: HOW COMMON IS THE USAGE OF
DVC IN GITHUB PROJECTS?
The DVC usage
Advanced search for DVC
projects on Github
391 projects
RQ1
• How are current ML versioning tools
(like DVC) are used in open-source
projects?
• How much do projects rely on these
tools?
8. 8
RQ1
DESPITE THE YOUNG PRACTICE, MORE THAN HALF
OF DVC FILES ARE FREQUENTLY CHANGED
#projects adopting DVC a given period after
DVC’s launch
#projects using DVC for a given period
#projects using the different DVC remote
storage options
9. 9
RQ2: HOW MUCH COUPLING EXISTS BETWEEN
SOFTWARE ARTIFACTS AND DVC ARTIFACTS?
Filter most active
projects
391 projects
Filter projects with >
10 Pull Requests
25 projects 10 projects
File classification (Perceval, Manual)
Source code artifacts
[source, test, data,
gitignore, others]
DVC artifacts
[data, pipeline,
utilities]
DVC commits
coupling analysis
DVC Pull Request
coupling analysis
Classified
Commits/PR
RQ2
RQ2
What is the required maintenance
effort to add or change data/model
files and the ML pipeline specification
in a project during regular
development?
10. IMPORTANT COUPLING BETWEEN DVC-PIPELINE
AND SOURCE CODE ARTIFACTS AT PR-LEVEL
PR-level coupling between DVC
pipeline and software artifacts
10
RQ2
11. 11
RQ3: HOW DOES THE COMPLEXITY OF THE
DVC ML PIPELINE EVOLVE OVER TIME? RQ3
Nodes
Operators
Operands
Edges
McCabe Complexity
(pipeline graph)
Halstead Complexity
(file’s verbosity )
RQ3
Complexity Pipeline
Graph/Textual Evolution
Pipelines
Reproduction
25 most active
projects
12. 12
COMPLEXITY: NO COMMON EVOLUTION TREND
RQ3
Halstead Complexity
McCabe Complexity
• 6/25 projects have an increasing
trend
• 5/25 projects have a high
fluctuation
• 7/25 projects have an increasing
trend
• 7/25 projects have a high
fluctuation
13. 13
WHAT NOW?
Implications to ML application developers
Use properly the remote storage functionality (e.g., amazon S3)
Use one DVC stage per similar data folder (e.g., images folder)
Split pipeline into subcomponents to facilitate maintenance and reduce the
pipeline complexity
Implications to ML versioning tool developers/-companies.
Consider notebook cells granuality (an ML pipeline in a notebook file is not taken
in consideration)
Implications to Researchers
Lack of techniques or tools that can assist developers to identify code changes
that require DVC maintenance and vice versa (e.g., fixing a bug in the pipeline).
Hi everybody, thank you for attending my presentation, my name is amine barrak, i am currently a PhD candidate at polytechnique montreal,
Today i will present my paper intitled
I would like to thank my co-authors for their hard work
A typical ML pipeline was introduced by microsoft team members, where they show a series of steps chained together to form the machine learning workflow essential stages. These stages include data and model oriented artifacts starting from data collection, data cleaning, until the model evaluation and model deployment. These stages construct a ML pipeline.
Let’s start by an example of ML project,
We consider that an ML project can has 3 artifacts (code, data, models).
We can see that these artifacts evolve seperately and they migh have different versions.
Also, A snapshot of the project may contain different versions of these artifacts that are compatible to work together.
We need a way to keep tracking the different versions of these artifacts and also a way to reverse to previous snapshot of the project including a compatible versions of these artifacts
So the question here, how can we keep tracking these artifacts together?
We know that git can store and version code, but doesn't hundle storing scalable large files or models.
Mlflow, pachyderm and DVC are tools to make an ML pipeline tracable.
We chose specifically the DVC since it is lightweight and works as a layer with github.
#Popular, Support non-ML projects, lightweight, open source
As we said, github can hundle source code artifact, but cannot hundle scalable models and dataset.
Here it came the DVC that works as a top layer on github, by making pointers to large files datasets and ML models using MD5 checksum
This is an example of a dvc file, it looks like a source code file, it contains the checksum to link the actual stages with its dependencies.
DVC store the pipeline evolution starting from data collection to the prediction model.
Our study is designed to understand the process of co-evolution between the 3 artifacts (source code, data, models)
We designed these three RQ to study the DVC co-evolution with source code artifacts
Our first RQ is to look for how common is the usage of DVC in Github Projects?
To do so, we run an advanced search on github looking for DVC projects, to find out : Q1, Q2
The first plot show how quickly projects adopt dvc after creating their repository
Quick adoption of the DVC after creating the repository (350 projects)
The second plot show how long the usage of dvc in projects
25% of the projects applied DVC for more than a week
On this plot, we compute the proportion of changed dvc files chronologically in the projects
50% or more of the DVC files in a project are changed at least once every one-tenth of the project’s lifetime
Which mean that the dvc files are frequently changed in the projects lifetime.
The last plot show the remote storage used by dvc projects
127 of the projects have no trace of their DVC remote in their GitHub repository, due to toy projects with a usage for less than a day, and some projects were hiding their config files for security purposes.
The three top used data storage locations: Amazon S3 (84 projects), the local cache in prived machine (78), and Google Cloud Storage (35).
To respond this RQ, we filter out the most active projects to compute the coupling at the commit level and for the pull request level, we kept only projects having at least 10 PR. After that, we classified all the files on these projects to the DVC artifact files including (data, pipeline, utilities) and source code artifacts including (source, test, data, gitignore, others).
Using the association rules, we were able to compute the coupling and we used chi-square test to validate the analysed coupling.
The reason of computing the coupling is analyse the required ….
Since, studies have shown that changes to software artifacts like build and environment files that are coupled with traditional software artifacts such as source code, test files, introduce overhead as developers have to maintain the source code and tests together with these artifacts,
Shane 2011 studied the coupling between builds and source code
At this level, we studied the coupling on the commit and pull request level between the DVC artifacts and source code artifacts.
We show here only the coupling between the dvc pipeline and source code artifacts at the Pull request level on the 10 projects having at least 10 PR. The rest of the coupling results with the statistical test details can be found on the paper.
one out of four PRs changing source code requiring changes to dvc pipeline files
one out of two PRs changing tests, requiring changes to dvc pipeline files
80% of case where we have a change data file require a change to dvc pipeline files inside a PR.
Previous studies have shown that Source code and build code evolve in terms of complexity and size . Moreover, Since we find a High change frequency in DVC files (RQ1), High coupling between DVC and source code artifacts (RQ2)
We aim to explore in the third RQ how complexity of the ML pipeline evolve over time.
As we can see, this is an example of a dvc stage and how the pipeline is shown on a graph form wih its different dependencies.
We will use two metrics to compute the complexity :
The first metric is the mccabe complexity to estimate the graph structure complexity of the pipeline
We compute it from the number of edges and nodes in the pipeline
The second metric is the halstead complexity to reveal the file’s verbosity and effort to understand the textual form of a dvc pipeline files.
the operators are the commands in top level
The operands which are the parameters passed to the operators
To do so, we reproduced the pipelines of the 25 most active projects and compute the two metrics complexity.
We compute the Halstead and McCabe complexity for each projects and we didn’t find a common evolution trend
We classified the complexity trends to (increasing, fluctuating, major impact and sudden drop)
These are exmeples of the complexity évolutions that we find.
As a reminder, First we show how DVC is a layer on the top of Github and how make pointers to the data and models,
Second, We show how the dvc is showing a high change frequency in DVC files (RQ1) despite its young practice.
Third, we show how the dvc has a high coupling with the source code artifacts in the pull request level (RQ2).
And finally the complexity trend evolution of the ML pipeline
We compute the Halstead and McCabe complexity for each projects of the most active ones (25 project) during their lifetime to track the pipelines evolution.
Results of median complexities are in this this plot, where they don’t show a correlation.
As shown previously the ML pipeline include 3 connected essential artifacts in ML project. These artifacts are dependent and evolve in a separate way.
The question here, how can we keep tracking these artifacts, especially when there is a need to roll back the system (debugging, testing new model, etc). The artifacts already evolve, and it became expensive to rollback.
(Data Version Control) acts as a layer over git which produces versioned pointers to the files instead of the files themselves.
These files are finally stored in a local cache and this cache can be synchronized with a remote storage.
Stages are run using dvc run [command] and options among which we use:
d for dependency: specify an input file
o for output: specify an output file ignored by git and tracked by dvc
M for metric: specify an output file tracked by git
f for file: specify the name of the dvc file.
A typical ML pipeline is a series of steps chained together in the ML cycle that often involves obtaining the data, processing the data, training/testing on various ML algorithms and finally obtaining some output (in the form of a prediction, etc).
We chose the DVC
Popular, Support non-ML projects, lightweight, open source
Halstead:
-The Halstead metric focuses on the file’s verbosity (operands and operators)
the operators in the case of DVC would be the top-level commands/configuration of a DVC file such as cmd, deps, outs
The operands will be the parameters passed to the operators (e.g., path, wdir, repo, etc).
McCabe:
-The McCabe complexity metric is primarily concerned with the number of decision points in the generated pipeline graph.
In the third RQ we aim to analyse the evolution of the ML pipeline evolution from two perspective (verbosity and the pipeline graph),
Since the DVC can be shown is two forms
McCabe:
-The McCabe complexity metric is concerned to
primarily concerned with the number of decision points in the generated pipeline graph.
Halstead:
-The Halstead metric focuses on the file’s verbosity (operands and operators)
the operators in the case of DVC would be the top-level commands/configuration of a DVC file such as cmd, deps, outs
The operands will be the parameters passed to the operators (e.g., path, wdir, repo, etc).