DVC is an open-source tool for versioning datasets, artifacts, and models in Machine Learning projects.
This extremely powerful tool allows you to leverage an intuitive git-like interface to seamlessly
1. track datasets version updates
2. have reproducible and sharable machine learning pipelines (e.g. model training)
3. compare model performance scores
4. integrate your data and model versioning with git
5. deploy the desired version of your trained models
Data Versioning and Reproducible ML with DVC and MLflowDatabricks
Machine Learning development involves comparing models and storing the artifacts they produced. We often compare several algorithms to select the most efficient ones. We assess different hyper-parameters to fine-tune the model. Git helps us store multiple versions of our code. Additionally, we need to keep track of the datasets we are using. This is important not only for audit purposes but also for assessing the performances of the models, developed at a later time. Git is a standard code versioning tool in software development. It can be used to store your datasets but it does not offer an optimal solution.
Do you know the basics of Git but wonder what all the hype is about? Do you want the ultimate control over your Git history? This tutorial will walk you through the basics of committing changes before diving into the more advanced and "dangerous" Git commands.
Git is an open source, distributed version control system used to track many different projects. You can use it to manage anything from a personal notes directory to a multi-programmer project.
This tutorial provides a short walk through of basic git commands and the Git philosophy to project management. Then we’ll dive into an exploration of the more advanced and “dangerous” Git commands. Watch as we rewrite our repository history, track bugs down to a specific commit, and untangle commits into an LKML-worthy patchset.
Git is a version control system that stores snapshots of files rather than tracking changes between file versions. It allows for offline work and nearly all operations are performed locally. Files can exist in three states - committed, modified, or staged. Commits create snapshots of the staged files. Branches act as pointers to commits, with the default branch being master.
In the era of Microservices, Cloud Computing and Serverless architecture, it’s useful to understand Kubernetes and learn how to use it. However, the official Kubernetes documentation can be hard to decipher, especially for newcomers. In this book, I will present a simplified view of Kubernetes and give examples of how to use it for deploying microservices using different cloud providers, including Azure, Amazon, Google Cloud and even IBM.
How to Utilize MLflow and Kubernetes to Build an Enterprise ML PlatformDatabricks
This document summarizes a presentation about utilizing MLFlow and Kubernetes to build an enterprise machine learning platform. It discusses challenges that motivated building such a platform, like lack of model management and difficult deployments. The solution presented abstracts data pipelines into modular components to standardize workflows. It also uses MLFlow to package and track models and experiments, and Kubernetes with Kubeflow to deploy models at scale. A demo shows implementing model serving with these tools.
** DevOps Training: https://www.edureka.co/devops **
This Edureka Git Tutorial explains what is version control system and why Git is the best tool for version control. You will get to learn what is Git through various operations that Git supports. Below are the topics covered in the tutorial:
1. Version Control System
2. Types of Version Control System
3. Version Control System Tools
4. What is Git?
5. Git Features
6. Git Workflow
7. Parallel Development
8. Hands-On
Check our complete DevOps playlist here (includes all the videos mentioned in the video): http://goo.gl/O2vo13
In Data Engineer's Lunch #54, we will discuss the data build tool, a tool for managing data transformations with config files rather than code. We will be connecting it to Apache Spark and using it to perform transformations.
Accompanying YouTube: https://youtu.be/dwZlYG6RCSY
Sign Up For Our Newsletter: http://eepurl.com/grdMkn
Join Data Engineer’s Lunch Weekly at 12 PM EST Every Monday:
https://www.meetup.com/Data-Wranglers-DC/events/
Cassandra.Link:
https://cassandra.link/
Follow Us and Reach Us At:
Anant:
https://www.anant.us/
Awesome Cassandra:
https://github.com/Anant/awesome-cassandra
Email:
solutions@anant.us
LinkedIn:
https://www.linkedin.com/company/anant/
Twitter:
https://twitter.com/anantcorp
Eventbrite:
https://www.eventbrite.com/o/anant-1072927283
Facebook:
https://www.facebook.com/AnantCorp/
Join The Anant Team:
https://www.careers.anant.us
Data Versioning and Reproducible ML with DVC and MLflowDatabricks
Machine Learning development involves comparing models and storing the artifacts they produced. We often compare several algorithms to select the most efficient ones. We assess different hyper-parameters to fine-tune the model. Git helps us store multiple versions of our code. Additionally, we need to keep track of the datasets we are using. This is important not only for audit purposes but also for assessing the performances of the models, developed at a later time. Git is a standard code versioning tool in software development. It can be used to store your datasets but it does not offer an optimal solution.
Do you know the basics of Git but wonder what all the hype is about? Do you want the ultimate control over your Git history? This tutorial will walk you through the basics of committing changes before diving into the more advanced and "dangerous" Git commands.
Git is an open source, distributed version control system used to track many different projects. You can use it to manage anything from a personal notes directory to a multi-programmer project.
This tutorial provides a short walk through of basic git commands and the Git philosophy to project management. Then we’ll dive into an exploration of the more advanced and “dangerous” Git commands. Watch as we rewrite our repository history, track bugs down to a specific commit, and untangle commits into an LKML-worthy patchset.
Git is a version control system that stores snapshots of files rather than tracking changes between file versions. It allows for offline work and nearly all operations are performed locally. Files can exist in three states - committed, modified, or staged. Commits create snapshots of the staged files. Branches act as pointers to commits, with the default branch being master.
In the era of Microservices, Cloud Computing and Serverless architecture, it’s useful to understand Kubernetes and learn how to use it. However, the official Kubernetes documentation can be hard to decipher, especially for newcomers. In this book, I will present a simplified view of Kubernetes and give examples of how to use it for deploying microservices using different cloud providers, including Azure, Amazon, Google Cloud and even IBM.
How to Utilize MLflow and Kubernetes to Build an Enterprise ML PlatformDatabricks
This document summarizes a presentation about utilizing MLFlow and Kubernetes to build an enterprise machine learning platform. It discusses challenges that motivated building such a platform, like lack of model management and difficult deployments. The solution presented abstracts data pipelines into modular components to standardize workflows. It also uses MLFlow to package and track models and experiments, and Kubernetes with Kubeflow to deploy models at scale. A demo shows implementing model serving with these tools.
** DevOps Training: https://www.edureka.co/devops **
This Edureka Git Tutorial explains what is version control system and why Git is the best tool for version control. You will get to learn what is Git through various operations that Git supports. Below are the topics covered in the tutorial:
1. Version Control System
2. Types of Version Control System
3. Version Control System Tools
4. What is Git?
5. Git Features
6. Git Workflow
7. Parallel Development
8. Hands-On
Check our complete DevOps playlist here (includes all the videos mentioned in the video): http://goo.gl/O2vo13
In Data Engineer's Lunch #54, we will discuss the data build tool, a tool for managing data transformations with config files rather than code. We will be connecting it to Apache Spark and using it to perform transformations.
Accompanying YouTube: https://youtu.be/dwZlYG6RCSY
Sign Up For Our Newsletter: http://eepurl.com/grdMkn
Join Data Engineer’s Lunch Weekly at 12 PM EST Every Monday:
https://www.meetup.com/Data-Wranglers-DC/events/
Cassandra.Link:
https://cassandra.link/
Follow Us and Reach Us At:
Anant:
https://www.anant.us/
Awesome Cassandra:
https://github.com/Anant/awesome-cassandra
Email:
solutions@anant.us
LinkedIn:
https://www.linkedin.com/company/anant/
Twitter:
https://twitter.com/anantcorp
Eventbrite:
https://www.eventbrite.com/o/anant-1072927283
Facebook:
https://www.facebook.com/AnantCorp/
Join The Anant Team:
https://www.careers.anant.us
GitOps - Modern best practices for high velocity app dev using cloud native t...Weaveworks
Alexis Richardson, Weaveworks CEO, recently presented this slide deck at the KubeCon + CloudNativeCon event. He covers GitOps - modern best practices for developing apps faster using cloud native tools.
Big Data Analytics Tutorial | Big Data Analytics for Beginners | Hadoop Tutor...Edureka!
This Edureka Big Data Analytics Tutorial will help you to understand the basics of Big Data domain. Learn how to analyze Big Data in this tutorial. Below are the topics covered in this tutorial:
1) Big Data Introduction
2) What is Big Data Analytics?
3) Why Big Data Analytics?
4) Stages in Big Data Analytics
5) Big Data Analytics Domains
6) Big Data Analytics Use Cases
Subscribe to our channel to get updates.
Check our complete Hadoop playlist here: https://goo.gl/4OyoTW
This document discusses Kubernetes networking and security. It covers Docker networking and how all containers can communicate without NAT. It also discusses Kubernetes networking using Kubenet, Flannel, and Calico. Specifically, it explains how Kubenet provides networking between pods, how Flannel assigns each host a subnet and routes traffic between nodes, and how Calico uses IPIP encapsulation or routing to provide connectivity between pods.
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...Mihai Criveti
- The document discusses automating data science pipelines with DevOps tools like Ansible, Packer, and Kubernetes.
- It covers obtaining data, exploring and modeling data, and how to automate infrastructure setup and deployment with tools like Packer to build machine images and Ansible for configuration management.
- The rise of DevOps and its cultural aspects are discussed as well as how tools like Packer, Ansible, Kubernetes can help automate infrastructure and deploy machine learning models at scale in production environments.
Git is a distributed version control system that records changes to files over time. It allows multiple developers to work together and tracks the version history. The document outlines the basic concepts and commands of Git including repositories, commits, branches, merging, cloning, pulling and pushing changes between a local and remote repository. Examples are provided to demonstrate how to initialize a local repository, add and commit changes, switch branches, and push updates to a remote server.
CI:CD in Lightspeed with kubernetes and argo cdBilly Yuen
Enterprises have benefited greatly from the elastic scalability and multi-region availability by moving to AWS, but the fundamental deployment model remains the same.
At Intuit, we have adopted k8s as our new saas platform and re-invented our CI/CD pipeline to take full advantage of k8s. In this presentation, we will discuss our journey from Spinnaker to Argo CD.
1. Reduce CI/CD time from 60 minutes to 10 minutes.
2. Reduce production release (or rollback) from 10 minutes to 2 minutes.
3. Enable concurrent deployment using spinnaker and argo cd as HA/DR to safely adopt the new platform with no downtime.
4. Be compatible with the existing application monitoring toolset.
Kubernetes is an open-source system for automating deployment, scaling, and management of containerized applications. It groups containers that make up an application into logical units for easy management and discovery called pods. Kubernetes masters manage the cluster and make scheduling decisions while nodes run the pods and containers. It uses labels and selectors to identify and group related application objects together. Services provide a single endpoint for pods, while deployments help manage replicated applications. Kubernetes provides mechanisms for storage, configuration, networking, security and other functionality to help run distributed systems reliably at scale.
Jenkins is a tool that allows users to automate multi-step processes that involve dependencies across multiple servers. It can be used to continuously build, test, and deploy code by triggering jobs that integrate code, run tests, deploy updates, and more. Jenkins provides a web-based interface to configure and manage recurring jobs and can scale to include slave agents to perform tasks on other machines. It offers many plugins to support tasks like testing, deployment, and notifications.
Apache Iceberg: An Architectural Look Under the CoversScyllaDB
Data Lakes have been built with a desire to democratize data - to allow more and more people, tools, and applications to make use of data. A key capability needed to achieve it is hiding the complexity of underlying data structures and physical data storage from users. The de-facto standard has been the Hive table format addresses some of these problems but falls short at data, user, and application scale. So what is the answer? Apache Iceberg.
Apache Iceberg table format is now in use and contributed to by many leading tech companies like Netflix, Apple, Airbnb, LinkedIn, Dremio, Expedia, and AWS.
Watch Alex Merced, Developer Advocate at Dremio, as he describes the open architecture and performance-oriented capabilities of Apache Iceberg.
You will learn:
• The issues that arise when using the Hive table format at scale, and why we need a new table format
• How a straightforward, elegant change in table format structure has enormous positive effects
• The underlying architecture of an Apache Iceberg table, how a query against an Iceberg table works, and how the table’s underlying structure changes as CRUD operations are done on it
• The resulting benefits of this architectural design
Building Better Data Pipelines using Apache AirflowSid Anand
Apache Airflow is a platform for authoring, scheduling, and monitoring workflows or directed acyclic graphs (DAGs). It allows users to programmatically author DAGs in Python without needing to bundle many XML files. The UI provides a tree view to see DAG runs over time and Gantt charts to see performance trends. Airflow is useful for ETL pipelines, machine learning workflows, and general job scheduling. It handles task dependencies and failures, monitors performance, and enforces service level agreements. Behind the scenes, the scheduler distributes tasks from the metadata database to Celery workers via RabbitMQ.
GitOps è un nuovo metodo di CD che utilizza Git come unica fonte di verità per le applicazioni e per l'infrastruttura (declarative infrastructure/infrastructure as code), fornendo sia il controllo delle revisioni che il controllo delle modifiche. In questo talk vedremo come implementare workflow di CI/CD Gitops basati su Kubernetes, dalla teoria alla pratica passando in rassegna i principali strumenti oggi a disposizione come ArgoCD, Flux (aka Gitops engine) e JenkinsX
This document provides an overview of Git commands and workflows:
- It introduces basic Git commands for setting up a local repository, adding and committing files, viewing the status and differences between commits, ignoring files, and more.
- Common workflows are demonstrated including cloning a repository, making changes and committing them locally, and pushing changes to a remote repository.
- More advanced topics are covered like branching, merging, rebasing, resolving conflicts, and using tools to help with these processes.
- Configuration options and tips are provided to customize Git behavior and inspect repositories.
Architect’s Open-Source Guide for a Data Mesh ArchitectureDatabricks
Data Mesh is an innovative concept addressing many data challenges from an architectural, cultural, and organizational perspective. But is the world ready to implement Data Mesh?
In this session, we will review the importance of core Data Mesh principles, what they can offer, and when it is a good idea to try a Data Mesh architecture. We will discuss common challenges with implementation of Data Mesh systems and focus on the role of open-source projects for it. Projects like Apache Spark can play a key part in standardized infrastructure platform implementation of Data Mesh. We will examine the landscape of useful data engineering open-source projects to utilize in several areas of a Data Mesh system in practice, along with an architectural example. We will touch on what work (culture, tools, mindset) needs to be done to ensure Data Mesh is more accessible for engineers in the industry.
The audience will leave with a good understanding of the benefits of Data Mesh architecture, common challenges, and the role of Apache Spark and other open-source projects for its implementation in real systems.
This session is targeted for architects, decision-makers, data-engineers, and system designers.
Data Build Tool (DBT) is an open source technology to set up your data lake using best practices from software engineering. This SQL first technology is a great marriage between Databricks and Delta. This allows you to maintain high quality data and documentation during the entire datalake life-cycle. In this talk I’ll do an introduction into DBT, and show how we can leverage Databricks to do the actual heavy lifting. Next, I’ll present how DBT supports Delta to enable upserting using SQL. Finally, we show how we integrate DBT+Databricks into the Azure cloud. Finally we show how we emit the pipeline metrics to Azure monitor to make sure that you have observability over your pipeline.
What is Git | What is GitHub | Git Tutorial | GitHub Tutorial | Devops Tutori...Edureka!
This DevOps Tutorial on what is Git & what is GitHub ( Git Blog series: https://goo.gl/XS1Vux ) will let you know all about Version Control System & Version Control Tools like Git. You will learn all the Git commands to create repositories on your local machine & GitHub, commit changes, push & pull files. Also you will get your hands on with some advanced operations in Git like branching, merging, rebasing etc. Below are the topics covered in this tutorial:
1. Version Control Introduction
2. Why version Control?
3. Version Control Tools
4. Git & GitHub
5. Case Study: Dominion enterprises
6. What is Git?
7. Features of Git
8. What is a Repository?
9. Git Operations and Commands
Applying DevOps to Databricks can be a daunting task. In this talk this will be broken down into bite size chunks. Common DevOps subject areas will be covered, including CI/CD (Continuous Integration/Continuous Deployment), IAC (Infrastructure as Code) and Build Agents.
We will explore how to apply DevOps to Databricks (in Azure), primarily using Azure DevOps tooling. As a lot of Spark/Databricks users are Python users, will will focus on the Databricks Rest API (using Python) to perform our tasks.
Data Mesh in Practice: How Europe’s Leading Online Platform for Fashion Goes ...Databricks
The Data Lake paradigm is often considered the scalable successor of the more curated Data Warehouse approach when it comes to democratization of data. However, many who went out to build a centralized Data Lake came out with a data swamp of unclear responsibilities, a lack of data ownership, and sub-par data availability.
Introduction to Git & GitHub.
Agenda:
- What’s a Version Control System?
- What the heck is Git?
- Some Git commands
- What’s about GitHub?
- Git in Action!
22nd Athens Big Data Meetup - 1st Talk - MLOps Workshop: The Full ML Lifecycl...Athens Big Data
Title: MLOps Workshop: The Full ML Lifecycle - How to Use ML in Production
Speakers: Spyros Cavadias (https://www.linkedin.com/in/spyros-cavadias/), Konstantinos Pittas (https://www.linkedin.com/in/konstantinos-pittas-83310270/), Thanos Gkinakos (https://www.linkedin.com/in/thanos-gkinakos-03582a128/)
Date: Saturday, December 17, 2022
Event: https://www.meetup.com/athens-big-data/events/289927468/
DevOops & How I hacked you DevopsDays DC June 2015Chris Gates
In a quest to move faster, organizations can end up creating security vulnerabilities using the tools and products meant to protect them. Both Chris Gates and Ken Johnson will share their collaborative research into the technology driving DevOps as well as share their stories of what happens when these tools are used insecurely as well as when the tools are just insecure.
Technologies discussed will encompass AWS Technology, Chef, Puppet, Hudson/Jenkins, Vagrant, Kickstart and much, much more. This talk will most definitely be an entertaining one but a cautionary tale as well, provoking attendees into action. Ultimately, this is research targeted towards awareness for those operating within a DevOps environment.
GitOps - Modern best practices for high velocity app dev using cloud native t...Weaveworks
Alexis Richardson, Weaveworks CEO, recently presented this slide deck at the KubeCon + CloudNativeCon event. He covers GitOps - modern best practices for developing apps faster using cloud native tools.
Big Data Analytics Tutorial | Big Data Analytics for Beginners | Hadoop Tutor...Edureka!
This Edureka Big Data Analytics Tutorial will help you to understand the basics of Big Data domain. Learn how to analyze Big Data in this tutorial. Below are the topics covered in this tutorial:
1) Big Data Introduction
2) What is Big Data Analytics?
3) Why Big Data Analytics?
4) Stages in Big Data Analytics
5) Big Data Analytics Domains
6) Big Data Analytics Use Cases
Subscribe to our channel to get updates.
Check our complete Hadoop playlist here: https://goo.gl/4OyoTW
This document discusses Kubernetes networking and security. It covers Docker networking and how all containers can communicate without NAT. It also discusses Kubernetes networking using Kubenet, Flannel, and Calico. Specifically, it explains how Kubenet provides networking between pods, how Flannel assigns each host a subnet and routes traffic between nodes, and how Calico uses IPIP encapsulation or routing to provide connectivity between pods.
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...Mihai Criveti
- The document discusses automating data science pipelines with DevOps tools like Ansible, Packer, and Kubernetes.
- It covers obtaining data, exploring and modeling data, and how to automate infrastructure setup and deployment with tools like Packer to build machine images and Ansible for configuration management.
- The rise of DevOps and its cultural aspects are discussed as well as how tools like Packer, Ansible, Kubernetes can help automate infrastructure and deploy machine learning models at scale in production environments.
Git is a distributed version control system that records changes to files over time. It allows multiple developers to work together and tracks the version history. The document outlines the basic concepts and commands of Git including repositories, commits, branches, merging, cloning, pulling and pushing changes between a local and remote repository. Examples are provided to demonstrate how to initialize a local repository, add and commit changes, switch branches, and push updates to a remote server.
CI:CD in Lightspeed with kubernetes and argo cdBilly Yuen
Enterprises have benefited greatly from the elastic scalability and multi-region availability by moving to AWS, but the fundamental deployment model remains the same.
At Intuit, we have adopted k8s as our new saas platform and re-invented our CI/CD pipeline to take full advantage of k8s. In this presentation, we will discuss our journey from Spinnaker to Argo CD.
1. Reduce CI/CD time from 60 minutes to 10 minutes.
2. Reduce production release (or rollback) from 10 minutes to 2 minutes.
3. Enable concurrent deployment using spinnaker and argo cd as HA/DR to safely adopt the new platform with no downtime.
4. Be compatible with the existing application monitoring toolset.
Kubernetes is an open-source system for automating deployment, scaling, and management of containerized applications. It groups containers that make up an application into logical units for easy management and discovery called pods. Kubernetes masters manage the cluster and make scheduling decisions while nodes run the pods and containers. It uses labels and selectors to identify and group related application objects together. Services provide a single endpoint for pods, while deployments help manage replicated applications. Kubernetes provides mechanisms for storage, configuration, networking, security and other functionality to help run distributed systems reliably at scale.
Jenkins is a tool that allows users to automate multi-step processes that involve dependencies across multiple servers. It can be used to continuously build, test, and deploy code by triggering jobs that integrate code, run tests, deploy updates, and more. Jenkins provides a web-based interface to configure and manage recurring jobs and can scale to include slave agents to perform tasks on other machines. It offers many plugins to support tasks like testing, deployment, and notifications.
Apache Iceberg: An Architectural Look Under the CoversScyllaDB
Data Lakes have been built with a desire to democratize data - to allow more and more people, tools, and applications to make use of data. A key capability needed to achieve it is hiding the complexity of underlying data structures and physical data storage from users. The de-facto standard has been the Hive table format addresses some of these problems but falls short at data, user, and application scale. So what is the answer? Apache Iceberg.
Apache Iceberg table format is now in use and contributed to by many leading tech companies like Netflix, Apple, Airbnb, LinkedIn, Dremio, Expedia, and AWS.
Watch Alex Merced, Developer Advocate at Dremio, as he describes the open architecture and performance-oriented capabilities of Apache Iceberg.
You will learn:
• The issues that arise when using the Hive table format at scale, and why we need a new table format
• How a straightforward, elegant change in table format structure has enormous positive effects
• The underlying architecture of an Apache Iceberg table, how a query against an Iceberg table works, and how the table’s underlying structure changes as CRUD operations are done on it
• The resulting benefits of this architectural design
Building Better Data Pipelines using Apache AirflowSid Anand
Apache Airflow is a platform for authoring, scheduling, and monitoring workflows or directed acyclic graphs (DAGs). It allows users to programmatically author DAGs in Python without needing to bundle many XML files. The UI provides a tree view to see DAG runs over time and Gantt charts to see performance trends. Airflow is useful for ETL pipelines, machine learning workflows, and general job scheduling. It handles task dependencies and failures, monitors performance, and enforces service level agreements. Behind the scenes, the scheduler distributes tasks from the metadata database to Celery workers via RabbitMQ.
GitOps è un nuovo metodo di CD che utilizza Git come unica fonte di verità per le applicazioni e per l'infrastruttura (declarative infrastructure/infrastructure as code), fornendo sia il controllo delle revisioni che il controllo delle modifiche. In questo talk vedremo come implementare workflow di CI/CD Gitops basati su Kubernetes, dalla teoria alla pratica passando in rassegna i principali strumenti oggi a disposizione come ArgoCD, Flux (aka Gitops engine) e JenkinsX
This document provides an overview of Git commands and workflows:
- It introduces basic Git commands for setting up a local repository, adding and committing files, viewing the status and differences between commits, ignoring files, and more.
- Common workflows are demonstrated including cloning a repository, making changes and committing them locally, and pushing changes to a remote repository.
- More advanced topics are covered like branching, merging, rebasing, resolving conflicts, and using tools to help with these processes.
- Configuration options and tips are provided to customize Git behavior and inspect repositories.
Architect’s Open-Source Guide for a Data Mesh ArchitectureDatabricks
Data Mesh is an innovative concept addressing many data challenges from an architectural, cultural, and organizational perspective. But is the world ready to implement Data Mesh?
In this session, we will review the importance of core Data Mesh principles, what they can offer, and when it is a good idea to try a Data Mesh architecture. We will discuss common challenges with implementation of Data Mesh systems and focus on the role of open-source projects for it. Projects like Apache Spark can play a key part in standardized infrastructure platform implementation of Data Mesh. We will examine the landscape of useful data engineering open-source projects to utilize in several areas of a Data Mesh system in practice, along with an architectural example. We will touch on what work (culture, tools, mindset) needs to be done to ensure Data Mesh is more accessible for engineers in the industry.
The audience will leave with a good understanding of the benefits of Data Mesh architecture, common challenges, and the role of Apache Spark and other open-source projects for its implementation in real systems.
This session is targeted for architects, decision-makers, data-engineers, and system designers.
Data Build Tool (DBT) is an open source technology to set up your data lake using best practices from software engineering. This SQL first technology is a great marriage between Databricks and Delta. This allows you to maintain high quality data and documentation during the entire datalake life-cycle. In this talk I’ll do an introduction into DBT, and show how we can leverage Databricks to do the actual heavy lifting. Next, I’ll present how DBT supports Delta to enable upserting using SQL. Finally, we show how we integrate DBT+Databricks into the Azure cloud. Finally we show how we emit the pipeline metrics to Azure monitor to make sure that you have observability over your pipeline.
What is Git | What is GitHub | Git Tutorial | GitHub Tutorial | Devops Tutori...Edureka!
This DevOps Tutorial on what is Git & what is GitHub ( Git Blog series: https://goo.gl/XS1Vux ) will let you know all about Version Control System & Version Control Tools like Git. You will learn all the Git commands to create repositories on your local machine & GitHub, commit changes, push & pull files. Also you will get your hands on with some advanced operations in Git like branching, merging, rebasing etc. Below are the topics covered in this tutorial:
1. Version Control Introduction
2. Why version Control?
3. Version Control Tools
4. Git & GitHub
5. Case Study: Dominion enterprises
6. What is Git?
7. Features of Git
8. What is a Repository?
9. Git Operations and Commands
Applying DevOps to Databricks can be a daunting task. In this talk this will be broken down into bite size chunks. Common DevOps subject areas will be covered, including CI/CD (Continuous Integration/Continuous Deployment), IAC (Infrastructure as Code) and Build Agents.
We will explore how to apply DevOps to Databricks (in Azure), primarily using Azure DevOps tooling. As a lot of Spark/Databricks users are Python users, will will focus on the Databricks Rest API (using Python) to perform our tasks.
Data Mesh in Practice: How Europe’s Leading Online Platform for Fashion Goes ...Databricks
The Data Lake paradigm is often considered the scalable successor of the more curated Data Warehouse approach when it comes to democratization of data. However, many who went out to build a centralized Data Lake came out with a data swamp of unclear responsibilities, a lack of data ownership, and sub-par data availability.
Introduction to Git & GitHub.
Agenda:
- What’s a Version Control System?
- What the heck is Git?
- Some Git commands
- What’s about GitHub?
- Git in Action!
22nd Athens Big Data Meetup - 1st Talk - MLOps Workshop: The Full ML Lifecycl...Athens Big Data
Title: MLOps Workshop: The Full ML Lifecycle - How to Use ML in Production
Speakers: Spyros Cavadias (https://www.linkedin.com/in/spyros-cavadias/), Konstantinos Pittas (https://www.linkedin.com/in/konstantinos-pittas-83310270/), Thanos Gkinakos (https://www.linkedin.com/in/thanos-gkinakos-03582a128/)
Date: Saturday, December 17, 2022
Event: https://www.meetup.com/athens-big-data/events/289927468/
DevOops & How I hacked you DevopsDays DC June 2015Chris Gates
In a quest to move faster, organizations can end up creating security vulnerabilities using the tools and products meant to protect them. Both Chris Gates and Ken Johnson will share their collaborative research into the technology driving DevOps as well as share their stories of what happens when these tools are used insecurely as well as when the tools are just insecure.
Technologies discussed will encompass AWS Technology, Chef, Puppet, Hudson/Jenkins, Vagrant, Kickstart and much, much more. This talk will most definitely be an entertaining one but a cautionary tale as well, provoking attendees into action. Ultimately, this is research targeted towards awareness for those operating within a DevOps environment.
In a rare mash-up, DevOps is increasingly blending the work of both application and network security professionals. In a quest to move faster, organizations can end up creating security vulnerabilities using the tools and products meant to protect them. Both Chris Gates (carnal0wnage) and Ken Johnson (cktricky) will share their collaborative research into the technology driving DevOps as well as share their stories of what happens when these tools are used insecurely as well as when the tools are just insecure.
Technologies discussed will encompass AWS Technology, Chef, Puppet, Hudson/Jenkins, Vagrant, Kickstart and much, much more. Everything from common misconfigurations to remote code execution will be presented. This is research to bring awareness to those responsible for securing a DevOps environment.
This document discusses different build tools including Ant, Maven, and Gradle. It provides examples of build files for each tool and compares their characteristics such as flexibility, conventions, dependencies management, and extensibility. Ant is described as very flexible but verbose and complex to write and maintain. Maven enforces conventions but has a rigid structure. Gradle combines flexibility, conventions, and extensibility through its Groovy DSL.
Data versioning in machine learning projects. How is data science different from software engineering? Is there a methodology mismatch? How to use dvc.org to version data and experiments.
[20200720]cloud native develoment - Nelson LinHanLing Shen
There is no shortage now of development and CI/CD tools for cloud-native application development. But how do we put the cloud-native concept and think as the cloud-native way on the leftmost side of CI/CD pipeline.
During developing phrase, the tools provided with cloud code can help you expedite iteration of source codes, run and debug cloud native applications in an easy and fast way, making cloud-native development turn into real-time process, reduce the gap between deployment and development.
現在不乏用於雲原生應用程序開發的開發和 CI/CD工具。 但是,我們如何將雲原生概念放在的 CI/CD 流水線的最左側呢?
在開發階段,如何用 Cloud code 協助您加快原始碼的迭代速度,以簡便快捷的方式運行和調用雲原生應用程序,使雲原生開發變為即使過程,縮小開發與部署之間的差
As more workloads move to severless-like environments, the importance of properly handling downscaling increases. While recomputing the entire RDD makes sense for dealing with machine failure, if your nodes are more being removed frequently, you can end up in a seemingly loop-like scenario, where you scale down and need to recompute the expensive part of your computation, scale back up, and then need to scale back down again.
Even if you aren’t in a serverless-like environment, preemptable or spot instances can encounter similar issues with large decreases in workers, potentially triggering large recomputes. In this talk, we explore approaches for improving the scale-down experience on open source cluster managers, such as Yarn and Kubernetes-everything from how to schedule jobs to location of blocks and their impact (shuffle and otherwise).
This document discusses 10 things not to forget before deploying Docker in production. It covers logging, monitoring, secrets, container access, filesystem choices, disk space usage, build optimizations, download speeds, backups, and Docker clusters. Overall, Docker provides benefits for portability and workflows but has some challenges to address for system-wide deployments in production environments.
This document describes a GitLab CI/CD workflow using GitLab, Docker, GitLab Runner, and Ansible. Developers push code to a GitLab repo, which triggers a GitLab Runner job to build a Docker image. Ansible is then used to provision servers based on the environment. The benefits of using Ansible include managing everything from one playbook to clone code, build containers, deploy applications, and more. Examples of using shell scripts versus Ansible playbooks are provided.
Continuous Delivery w projekcie Open Source - Marcin Stachniuk - DevCrowd 2017MarcinStachniuk
This document discusses continuous delivery in an open source project. It begins with an introduction of the speaker and then discusses various tools used in the continuous delivery process like Travis CI for continuous integration. It outlines the build pipeline for the project including deploying to Bintray and updating GitHub pages. It also covers code quality tools like Codecov and promoting the project on the internet through blogs, conferences and other forums.
Docker is a tool that allows developers to package applications into containers to ensure consistency across environments. Some key benefits of Docker include lightweight containers, isolation, and portability. The Docker workflow involves building images, pulling pre-built images, pushing images to registries, and running containers from images. Docker uses a layered filesystem to efficiently build and run containers. Running multiple related containers together can be done using Docker Compose or Kubernetes for orchestration.
Microservices DevOps on Google Cloud PlatformSunnyvale
The document discusses microservices architecture and describes how to build and deploy a sample "Hello World" application using microservices. It covers developing two projects ("World" and "Helidon_HelloWorld"), building Docker images using Cloud Build, storing artifacts in Google Cloud Storage, deploying the application to Google Kubernetes Engine, and exposing it via an external load balancer. The microservices architecture allows developing and deploying complex applications by decomposing them into independently deployable components that communicate over a network.
Cloud Driven Development: a better workflow, less worries, and more powerMarzee Labs
Platform-as-a-service (PaaS) solutions have recently sprung up for Drupal, with Pantheon and Acquia Dev Cloud leading the race. The advantages are plentiful: zero set-up costs, instant upscaling, the use of powerful services such as Apache Solr, Varnish, Redis/Memcached, automated Drupal core updates, site profiling tools, etc.
In this session, I’ll make Drupal developers familiar with PaaS, and show the concepts of “Cloud-driven development” to speed up development and deployment processes. I will show how to use your local, development, test and production environments to organize your Drupal development, and push changes back and forth using Git, Features and Drush, eliminating the need to share the database and pushing changes exclusively via code. Finally, Drush will make your deployment a breeze.
With the free developer subscription of Pantheon and a series of Drush commands and scripts, you will be able to start developing and deploying your own Drupal projects in the cloud, and never again worry about your server. After all, you are a Drupal Developer, not a System Administrator!
JDD2014: Docker.io - versioned linux containers for JVM devops - Dominik DornPROIDEA
This presentation will introduce you to Docker - the new shiny star on the Devops horizon. It will teach you everything you need to know to get started with Docker, why you'd want to use it and which tools to use to get the most out of it. Additionally to showing the basics, it will introduce helpful libraries available for the JVM and how they can be used together with Docker to create secure, scalable and maintainable cloud setups for your applications.
DCEU 18: Developing with Docker ContainersDocker, Inc.
Laura Frank Tacho - Director of Engineering, CloudBees
Wouldn't it be great for a new developer on your team to have their dev environment totally set up on their first day? What about having the confidence that your dev environment mirrors testing and prod? Containers enable this to become reality, along with other great benefits like keeping dependencies nice and tidy and making packaged code easier to share. Come learn about the ways containers can help you build and ship software easily, and walk away with two actionable steps you can take to start using Docker containers for development.
DevStack is a documented shell script that builds complete OpenStack development environments by installing and configuring services like Glance, Horizon, Keystone, Nova, Quantum, Swift, and others. It uses files like localrc to override variables, stackrc for repository and image settings, and stack.sh to orchestrate the installation. Developers can modify code in /opt/stack, restart services, and test changes by viewing logs in screen sessions to iteratively develop and debug OpenStack.
Do any VM's contain a particular indicator of compromise? E.g. Run a YARA signature over all executables on my virtual machines and tell me which ones match.
Hands on Docker - Launch your own LEMP or LAMP stack - SunshinePHPDana Luther
In this tutorial we will go over setting up a standard LEMP stack for development use and learn how to modify it to mimic your production/pre-production environments as closely as possible. We will go over how to switch from Nginx to Apache, upgrade PHP versions and introduce additional storage engines such as Redis to the equation. We'll also step through how to run both unit and acceptance suites using headless Selenium images in the stack. Leave here fully confident in knowing that whatever environment you get thrown into, you can replicate it and work in it comfortably.
Similar to DVC - Git-like Data Version Control for Machine Learning projects (20)
Ordinal Regression and Machine Learning: Applications, Methods, MetricsFrancesco Casalegno
What do movie recommender systems, disease progression evaluation, and sovereign credit ranking have in common?
→ ordinal regression sits between classification and regression
→ target values are categorical and discrete, but ordered
→ many challenges to face when training and evaluating models
What will you find in this presentation?
→ real life, clear examples of ordinal regression you see everyday
→ learning to rank: predict user preferences and items relevance
→ best solution methods: naïve, binary decomposition, threshold
→ how to measure performance: understand & choose metrics
Recommender Systems represent one of the most widespread and impactful applications of predictive machine learning models.
Amazon, YouTube, Netflix, Facebook and many other companies generate an important fraction of their revenues thanks to their ability to model and accurately predict users ratings and preferences.
In this presentation we cover the following points:
→ introduction to recommender systems
→ working with explicit vs implicit feedback
→ content-based vs collaborative filtering approaches
→ user-based and item-item methods
→ machine learning and deep learning models
→ pros & cons of the methods: scalability, accuracy, explainability
Why should you care about Markov Chain Monte Carlo methods?
→ They are in the list of "Top 10 Algorithms of 20th Century"
→ They allow you to make inference with Bayesian Networks
→ They are used everywhere in Machine Learning and Statistics
Markov Chain Monte Carlo methods are a class of algorithms used to sample from complicated distributions. Typically, this is the case of posterior distributions in Bayesian Networks (Belief Networks).
These slides cover the following topics.
→ Motivation and Practical Examples (Bayesian Networks)
→ Basic Principles of MCMC
→ Gibbs Sampling
→ Metropolis–Hastings
→ Hamiltonian Monte Carlo
→ Reversible-Jump Markov Chain Monte Carlo
▸ Machine Learning / Deep Learning models require to set the value of many hyperparameters
▸ Common examples: regularization coefficients, dropout rate, or number of neurons per layer in a Neural Network
▸ Instead of relying on some "expert advice", this presentation shows how to automatically find optimal hyperparameters
▸ Exhaustive Search, Monte Carlo Search, Bayesian Optimization, and Evolutionary Algorithms are explained with concrete examples
Confidence Intervals––Exact Intervals, Jackknife, and BootstrapFrancesco Casalegno
The document discusses confidence intervals and different methods for computing them, including exact, asymptotic, jackknife, and bootstrap methods. The exact method computes an interval based on the known distribution of the estimator, but this is often impossible. The asymptotic method uses the asymptotic normality of maximum likelihood estimators, but requires large sample sizes. The jackknife method uses leave-one-out resampling to estimate bias and variance up to O(1/n^2), while bootstrap resamples with replacement to estimate the full distribution and compute confidence intervals.
••• Learn how to safely manage memory with smart pointers! •••
In this presentation you will learn:
▸ the dangers of using raw pointers for dynamic memory
▸ the difference between unique_ptr, shared_ptr, weak_ptr
▸ how to use factories to increase safety and performance
▸ when raw pointers are still needed
[C++] The Curiously Recurring Template Pattern: Static Polymorphsim and Expre...Francesco Casalegno
••• Exploit the full potential of the CRTP! •••
In this presentation you will learn:
▸ what is the curiously recurring template pattern
▸ the actual cost (memory and time) of virtual functions
▸ how to implement static polymorphism
▸ how to implement expression templates to avoid loops and copies
••• Boost your code's performances using C++11 new features! •••
In this presentation you will learn:
▸ the difference between an Lvalue and Rvalue
▸ how to use std::move, std::forward, noexcept
▸ how to implement move semantics to avoid useless copies
▸ how to implement perfect forwarding for the factory pattern
Build applications with generative AI on Google CloudMárton Kodok
We will explore Vertex AI - Model Garden powered experiences, we are going to learn more about the integration of these generative AI APIs. We are going to see in action what the Gemini family of generative models are for developers to build and deploy AI-driven applications. Vertex AI includes a suite of foundation models, these are referred to as the PaLM and Gemini family of generative ai models, and they come in different versions. We are going to cover how to use via API to: - execute prompts in text and chat - cover multimodal use cases with image prompts. - finetune and distill to improve knowledge domains - run function calls with foundation models to optimize them for specific tasks. At the end of the session, developers will understand how to innovate with generative AI and develop apps using the generative ai industry trends.
The Ipsos - AI - Monitor 2024 Report.pdfSocial Samosa
According to Ipsos AI Monitor's 2024 report, 65% Indians said that products and services using AI have profoundly changed their daily life in the past 3-5 years.
"Financial Odyssey: Navigating Past Performance Through Diverse Analytical Lens"sameer shah
Embark on a captivating financial journey with 'Financial Odyssey,' our hackathon project. Delve deep into the past performance of two companies as we employ an array of financial statement analysis techniques. From ratio analysis to trend analysis, uncover insights crucial for informed decision-making in the dynamic world of finance."
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...sameer shah
"Join us for STATATHON, a dynamic 2-day event dedicated to exploring statistical knowledge and its real-world applications. From theory to practice, participants engage in intensive learning sessions, workshops, and challenges, fostering a deeper understanding of statistical methodologies and their significance in various fields."
3. Francesco Casalegno – DVC
● Simple command line Git-like experience.
○ Does not require installing and maintaining any databases.
○ Does not depend on any proprietary online services.
● Management and versioning of datasets and ML models.
○ Data is saved in S3, Google cloud, Azure, SSH server, HDFS, or even local HDD RAID.
● Makes projects reproducible and shareable; answers questions on how a model was built.
● Helps manage experiments with Git tags/branches and metrics tracking.
“DVC aims to replace spreadsheet and document sharing tools (such as Excel or Google Docs)
which are being used frequently as both knowledge repositories and team ledgers.
DVC also replaces both ad-hoc scripts to track, move, and deploy different model versions; as
well as ad-hoc data file suffixes and prefixes.”
What is DVC?[ref]
3
4. Francesco Casalegno – DVC
● dvc and git
○ git: version code, small files
○ dvc: version data, intermed. results, models
○ dvc uses git, w/o storing file content in repo
● versioning and storing large files
○ dvc save info on data in special .dvc files
○ .dvc files can then be versioned using git
○ actual storage happens w remote storage
○ dvc supports many remote storage types
● dvc main features
○ data versioning
○ data access
○ data pipelines
What is DVC?
4
6. Francesco Casalegno – DVC
Install
6
$ pip install dvc
● Depending on remote storage you will use, you may want to install specific dependencies.
$ pip install dvc[s3] # support Amazon S3
$ pip install dvc[ssh] # support ssh
$ pip install dvc[all] # all supports
● Install as a python package.
7. Francesco Casalegno – DVC
$ mkdir ml_project & cd ml_project
$ git init
Initialization
● We must work inside a Git repository. If it does not exist yet, we create and initialize one.
7
$ dvc init
$ git status -s
A .dvc/.gitignore
A .dvc/config
A .dvc/plots/confusion.json
A .dvc/plots/default.json
A .dvc/plots/scatter.json
A .dvc/plots/smooth.json
A .dvcignore
$ git commit -m "Initialize dvc project"
● Initializing a DVC project creates and automatically git add a few important files.
Tell Git not to track .dvc/cache and .dvc/tmp
TOML file with configurations for
- dvc remote storage — name, url
- dvc cache – reflink/copy/hardlink, location, ...
- ...
Tell dvc what not to track (empty for now)
Plot templates (visualize & compare metrics)
9. Francesco Casalegno – DVC
data
├── train
│ ├── dogs # 500 pictures
│ └── cats # 500 pictures
└── validation
├── dogs # 400 pictures
└── cats # 400 pictures
● This folder contains 43 MB of JPG figures organized in a hierarchical fashion.
Getting some data
● Let’s download some data to train and validate a “cat VS dog” CNN classifier.
We use dvc get, which is like wget to download data/models from a remote dvc repo.
9
$ dvc get https://github.com/iterative/dataset-registry tutorial/ver/data.zip
$ unzip data.zip & rm -f data.zip
inflating: data/train/cats/cat.001.jpg
...
10. Francesco Casalegno – DVC
Start versioning data
● Tracking data with DVC is very similar to tracking code with git.
10
$ dvc add data/
100% Add|██████████|1/1 [00:30, 30.51s/file]
To track the changes with git, run:
git add .gitignore data.dvc
$ git add .gitignore data.dvc
$ git commit -m "Add first version of data/"
$ git tag -a "v1.0" -m "data v1.0, 1000 images"
Tell git not to track the data/ directory
DVC-generated, contains hash to track data/ :
outs:
- md5: b8f4d5a78e55e88906d5f4aeaf43802e.dir
path: data
● Quite a few things happened when calling dvc add :
○ The hash of the content of data/ was computed and added to a new data.dvc file
○ DVC updates .gitignore to tell Git not to track the content of data/
○ The physical content of data/ —i.e. the jpg images— has been moved to a cache
(by default the cache is located in .dvc/cache/ but using a remote cache is possible!)
○ The files were linked back to the workspace so that it looks like nothing happened
(the user can configure the link type to use: hard link, soft link, reflink, copy)
→ human readable, can be versioned with git!
11. Francesco Casalegno – DVC
● To track the changes in our data with dvc, we follow the same procedure as before.
Make changes to tracked data (add)
11
● Let’s download some more data for our “cat VS dog” dataset.
Running dvc diff will confirm that dvc is aware the data has changed!
$ dvc get https://github.com/iterative/dataset-registry tutorial/ver/new-labels.zip
$ unzip new-labels.zip & rm -f new-labels.zip
inflating: data/train/cats/cat.501.jpg
...
$ dvc diff
Modified:
data/
$ dvc add data/
$ git diff data.dvc
outs:
-- md5: b8f4d5a78e55e88906d5f4aeaf43802e.dir
+- md5: 21060888834f7220846d1c6f6c04e649.dir
path: data
$ git commit -am "New version of data/ with more training images"
$ git tag -a "v2.0" -m "data v2.0, 2000 images"
12. Francesco Casalegno – DVC
● To fix this mismatch we simply call dvc checkout.
This reads the cache and updates the data in the
workspace based on the current *.dvc files.
Switch between versions
12
● To switch version, first run git checkout.
This affects data.dvc but not workspace files in data/ !
$ git checkout v1.0
$ dvc diff
Modified:
data/
$ dvc checkout
M data/
$ dvc status
Data and pipelines are up to date.
14. Francesco Casalegno – DVC
$ mkdir -p ~/tmp/dvc_storage
$ dvc remote add --default loc_remote ~/tmp/dvc_storage
Setting 'loc_remote' as a default remote.
$ git add .dvc/config
$ git commit -m "Configure remote storage loc_remote"
● Many remote storages are supported (Google Drive, Amazon S3, Google Cloud, SSH, HDFS, HTTP, …)
But we (as for Git) nothing prevents us to use a “local remote”!
● A remote storage is for dvc, what a GitHub is for git:
○ push and pull files from your workspace to the remote
○ easy sharing between developers
○ safe backup should you ever do a terrible mistake à la rm -rf *
Configure remote storage
14
DVC-generated, contains remote storage config
[core]
remote = loc_remote
['remote "loc_remote"']
url = /root/tmp/dvc_storage
15. Francesco Casalegno – DVC
$ dvc push
1800 files pushed
$ rm -rf .dvc/cache data
$ dvc pull # update .dvc/cache with contents from remote
1800 files fetched
$ dvc checkout # update workspace, linking data from .dvc/cache
A data/
● Now, even if all the data is deleted from our workspace and cache, we can download it with dvc pull.
This is pretty much like git pull.
● Running basically dvc push uploads the content of the cache to the remote storage.
This is pretty much like git push.
Storing, sharing, retrieving from storage
15
16. Francesco Casalegno – DVC
$ dvc list https://github.com/iterative/dataset-registry
README.md
get-started/
tutorial/
...
$ dvc get https://github.com/iterative/dataset-registry tutorial/ver/new-labels.zip
● When working outside of a DVC project —e.g. in automated ML model deployment— use dvg get
● First, we can explore the content of a DVC repo hosted on a Git server.
Access data from storage
16
● Note. For all these commands we can specify a git revision (sha, branch, or tag) with --rev <commit>.
● When working inside of another DVC project, we want to keep the connection between the projects.
In this way, others can know where the data comes from and whether new versions are available.
dvc import is like dvc get + dvc add, but the resulting data.dvc also includes a ref to the source repo!
$ dvc import https://github.com/iterative/dataset-registry tutorial/ver/new-labels.zip
$ git add new-labels.zip.dvc .gitignore
$ git commit -m "Import data from source"
17. Francesco Casalegno – DVC
● We can build a DVC project dedicated only to tracking and versioning datasets and models.
The repository would have all the metadata and history of changes in the different datasets.
● This a data registry, a middleware between ML projects and cloud storage.
This introduces quite a few advantages.
○ Reusability — reproduce and organize feature stores with a simple dvc get / import
○ Optimization — track data shared by multiple projects centralized in a single location
○ Data as code — leverage Git workflow such as commits, branching, pull requests, CI/CD …
○ Persistence — a DVC registry-controlled remote storage improves data security
● But versioning large data files for data science is great, is not all DVC can do:
DVC data pipelines capture how is data filtered, transformed, and used to train models!
Data Registries
17
19. Francesco Casalegno – DVC
● With dvc add we can track large files—this includes files such as: trained models, embeddings, etc.
However, we also want to track how such files were generated for reproducibility and better tracking!
Motivation
The following is an example of a typical ML pipeline. Its structure is a DAG (direct acyclic graph).
19
data.xml
train.tsv
test.tsv
train.pkl
test.pkl
model.pkl
scores.json
prc.json
Prepare
train-test split
◾ seed
◾ split
prepare.py
Featurize
TF-IDF embed
◾ max_feats
◾ n_grams
featurize.py
Train
RandomForest
◾ seed
◾ n_estimators
train.py
Evaluate
PR curve, AUC
evaluate.py
Stage Name
◾ parameter
script
input
output
20. Francesco Casalegno – DVC
→ Advantages of Option B
1. outputs are automatically tracked (i.e. saved in .dvc/cache)
2. pipeline stages with parameters names are saved in dvc.yaml
3. deps, params, outs are all hashed and tracked in dvc.lock
4. like a Makefile, can reproduce by dvc run prepare—re-run only if deps changed!
● Option A: run pipeline stages, then track output artifacts with dvc add
$ python src/prepare.py data/data.xml
$ dvc add data/prepared/train.tsv data/prepared/train.tsv
Tracking ML Pipelines
20
$ dvc run -n prepare
-p prepare.seed
-p prepare.split
-d src/prepare.py
-d data/data.xml
-o data/prepared
python src/prepare.py data/data.xml
● Option B: run pipeline stage and track them together with all dependencies with dvc run
stage name
parameters — read from params.yaml
dependencies (including script!)
outputs to track
prepare:
split: 0.20
seed: 20170428
featurize:
max_feats: 500
ngrams: 1
...
21. Francesco Casalegno – DVC
$ tree
.
├── data
│ ├── data.xml
│ └── data.xml.dvc
├── params.yaml
└── src
├── evaluate.py
├── featurization.py
├── prepare.py
├── requirements.txt
└── train.py
$ mkdir nlp_project & cd nlp_project
$ git init & dvc init & git commit -m "Init dvc repo"
● Let’s create a DVC repo for an NLP project.
Example
21
● Then we download some data + some code to prepare the data and train/evaluate a model
$ dvc get https://github.com/iterative/dataset-registry get-started/data.xml
-o data/data.xml
$ dvc add data/data.xml
$ git add data/.gitignore data/data.xml.dvc & git commit -m "Add data, first version"
$ wget https://code.dvc.org/get-started/code.zip
$ unzip code.zip & rm -f code.zip
prepare:
split: 0.20
seed: 20170428
featurize:
max_features: 500
ngrams: 1
train:
seed: 20170428
n_estimators: 50
YAML file with params for all the stages
pipeline steps
22. Francesco Casalegno – DVC
$ dvc run -n prepare
-p prepare.seed
-p prepare.split
-d src/prepare.py
-d data/data.xml
-o data/prepared
python src/prepare.py data/data.xml
$ git add data/.gitignore dvc.yaml dvc.lock
● Let’s run the prepare stage.
Example
22
stages:
prepare:
cmd: python src/prepare.py data/data.xml
deps:
- data/data.xml
- src/prepare.py
params:
- prepare.seed
- prepare.split
outs:
- data/prepared
prepare:
cmd: python src/prepare.py data/data.xml
deps:
- path: data/data.xml
md5: a304afb96060aad90176268345e10355
- path: src/prepare.py
md5: 285af85d794bb57e5d09ace7209f3519
params:
params.yaml:
prepare.seed: 20170428
prepare.split: 0.2
outs:
- params: data/prepared
md5: 20b786b6e6f80e2b3fcf17827ad18597.dir
Describe data pipelines, similar to how
Makefiles work for building software.
Matches the dvc.yaml file.
Created and updated by DVC
commands like dvc run.
Describes latest pipeline state for:
1. track intermediate and final
artifacts (like a .dvc file)
2. allow DVC to detect when stage
defs or dependencies changed,
triggering re-run.
● Note: dependencies and artifacts are automatically tracked, no need to dvc add them!
23. Francesco Casalegno – DVC
$ dvc run -n featurize
-p featurize.max_features
-p featurize.ngrams
-d src/featurize.py
-d data/prepared
-o data/features
python src/featurization.py data/prepared data/features
$ git add data/.gitignore dvc.yaml dvc.lock
● Then we run the featurize and train stages in the same way.
Example
23
$ dvc run -n train
-p train.seed
-p train.n_estimators
-d src/train.py
-d data/features
-o model.pkl
python src/train.py data/features model.pkl
$ git add data/.gitignore dvc.yaml dvc.lock
24. Francesco Casalegno – DVC
$ dvc run -n evaluate
-d src/evaluate.py
-d model.pkl
-d data/features
--metrics-no-cache scores.json
--plots-no-cache prc.json
python src/evaluate.py model.pkl data/features scores.json prc.json
$ git add dvc.yaml dvc.lock
● And finally we run the evaluation stage.
Example
24
Declare output metrics file.
A special kind of output file (-o), must be
JSON and can be used to make comparisons
across experiments in a tabular form.
E.g. here it contains data for AUC score.
The -no-cache prevents DVC to store the file
in cache.
Declare output plot file.
A special kind of output file (-o), must be
JSON and can be used to make comparisons
across experiments in a plot form.
E.g. here it contains data for ROC curve plot.
The -no-cache prevents DVC to store the file
in cache.
26. Francesco Casalegno – DVC
$ dvc repro train
'data/data.xml.dvc' didn't change, skipping
Stage 'prepare' didn't change, skipping
Stage 'featurize' didn't change, skipping
Stage 'train' didn't change, skipping
Data and pipelines are up to date.
● dvc repro regenerate data pipeline results, by restoring the DAG defined by stages listed in dvc.yaml.
This compares file hashes with dvc.lock to re-run only if needed. This is like make in software builds.
Reproducing Pipelines
26
● Case 1: nothing changed, re-running pipeline stages is skipped.
● Case 2: a dependency changed, pipeline stages are re-run if needed.
$ sed -i -e "s@max_features: 500@max_features: 1500@g" params.yaml
$ dvc repro train
'data/data.xml.dvc' didn't change, skipping
Stage 'prepare' didn't change, skipping
Running stage 'featurize' with command:
python src/featurization.py data/prepared data/features
Updating lock file 'dvc.lock'
Running stage 'train' with command:
python src/train.py data/features model.pkl
Updating lock file 'dvc.lock'
To track the changes with git, run:
git add dvc.lock
27. Francesco Casalegno – DVC
$ dvc plots diff -x recall -y precision
file:///Users/dvc/example-get-started/plots.html
$ dvc params diff
Path Param Old New
params.yaml featurize.max_features 500 1500
● dvc params diff rev_1 rev_2 shows how parameters differ in two different git revisions/tags.
Without arguments, it shows how they differ in workspace vs. last commit.
Comparing experiments
27
$ dvc params diff
Path Metric Value Change
scores.json auc 0.61314 0.07139
● dvc metrics diff rev_1 rev_2 does the same for metrics.
● dvc plots diff rev_1 rev_2 does the same for plots.
29. Francesco Casalegno – DVC
● Disk space optimization
Avoid having 1 cache per user!
● Use DVC as usual
- Each dvc add or dvc run moves
data to the shared external cache!
- Each dvc checkout links required
data to the workspace!
Shared Development Server
29
● See here for implementation details,
but basically it’s not too difficult:
$ mkdir -p path_shared_cache/
$ mv .dvc/cache/* path_shared_cache/
$ dvc cache dir path_shared_cache/
$ dvc config cache.shared group
$ git commit -m "config shared cache"
31. Francesco Casalegno – DVC
Conclusions
31
● DVC is a version control system for large ML data and artifacts.
● DVC integrates with Git through *.dvc and dvc.lock files, to version files and pipelines, respectively.
● DVC repos can work as data registries, i.e. a middleware between cloud storage and ML projects
● To track raw ML data files, use dvc add—e.g. for input dataset.
● To track intermediate or final results of a ML pipeline, use dvc run—e.g. for model weights, dataset.
● Consider using a shared development server with a unified, shared external cache