Title: MLOps Workshop: The Full ML Lifecycle - How to Use ML in Production
Speakers: Spyros Cavadias (https://www.linkedin.com/in/spyros-cavadias/), Konstantinos Pittas (https://www.linkedin.com/in/konstantinos-pittas-83310270/), Thanos Gkinakos (https://www.linkedin.com/in/thanos-gkinakos-03582a128/)
Date: Saturday, December 17, 2022
Event: https://www.meetup.com/athens-big-data/events/289927468/
MLflow: Infrastructure for a Complete Machine Learning Life CycleDatabricks
ML development brings many new complexities beyond the traditional software development lifecycle. Unlike in traditional software development, ML developers want to try multiple algorithms, tools and parameters to get the best results, and they need to track this information to reproduce work. In addition, developers need to use many distinct systems to productionize models. To address these problems, many companies are building custom “ML platforms” that automate this lifecycle, but even these platforms are limited to a few supported algorithms and to each company’s internal infrastructure.
In this talk, we will present MLflow, a new open source project from Databricks that aims to design an open ML platform where organizations can use any ML library and development tool of their choice to reliably build and share ML applications. MLflow introduces simple abstractions to package reproducible projects, track results, and encapsulate models that can be used with many existing tools, accelerating the ML lifecycle for organizations of any size.
End to end Machine Learning using Kubeflow - Build, Train, Deploy and ManageAnimesh Singh
With the breadth of sheer functionalities which need to be addressed in the Machine Learning world around building, training, serving and managing models, getting it done in a consistent, composable, portable, and scalable manner is hard. The Kubernetes framework is well suited to address these issues, which is why it's a great foundation for deploying ML workloads. Kubeflow is designed to take advantage of these benefits. In this talk, we are going to address how to make it easy for everyone to develop, deploy, and manage portable, scalable ML everywhere and support the full lifecycle Machine Learning using open source technologies like Kubeflow, Tensorflow, PyTorch,Tekton, Knative, Istio and others. We are going to discuss how to enable distributed training of models, model serving, canary rollouts, drift detection, model explainability, metadata management, pipelines and others. Additionally we will discuss Watson productization in progress based on Kubeflow Pipelines and Tekton, and point to Kubeflow Dojo materials and follow-on workshops.
Presentation given at Coolblue B.V. demonstrating Apache Airflow (incubating), what we learned from the underlying design principles and how an implementation of these principles reduce the amount of ETL effort. Why choose Airflow? Because it makes your engineering life easier, more people can contribute to how data flows through the organization, so that you can spend more time applying your brain to more difficult problems like Machine Learning, Deep Learning and higher level analysis.
MLflow: Infrastructure for a Complete Machine Learning Life Cycle with Mani ...Databricks
ML development brings many new complexities beyond the traditional software development lifecycle. Unlike in traditional software development, ML developers want to try multiple algorithms, tools, and parameters to get the best results, and they need to track this information to reproduce work. In addition, developers need to use many distinct systems to productionize models. To address these problems, many companies are building custom “ML platforms” that automate this lifecycle, but even these platforms are limited to a few supported algorithms and to each company’s internal infrastructure. In this session, we introduce MLflow, a new open source project from Databricks that aims to design an open ML platform where organizations can use any ML library and development tool of their choice to reliably build and share ML applications. MLflow introduces simple abstractions to package reproducible projects, track results, and encapsulate models that can be used with many existing tools, accelerating the ML lifecycle for organizations of any size. In this deep-dive session, through a complete ML model life-cycle example, you will walk away with:
MLflow concepts and abstractions for models, experiments, and projects
How to get started with MLFlow
Understand aspects of MLflow APIs
Using tracking APIs during model training
Using MLflow UI to visually compare and contrast experimental runs with different tuning parameters and evaluate metrics
Package, save, and deploy an MLflow model
Serve it using MLflow REST API
What’s next and how to contribute
MLFlow: Platform for Complete Machine Learning Lifecycle Databricks
Description
Data Science and ML development bring many new complexities beyond the traditional software development lifecycle. Unlike in traditional software development, ML developers want to try multiple algorithms, tools, and parameters to get the best results, and they need to track this information to reproduce work.
MLflow addresses some of these challenges during an ML model development cycle.
Abstract
ML development brings many new complexities beyond the traditional software development lifecycle. Unlike in traditional software development, ML developers want to try multiple algorithms, tools, and parameters to get the best results, and they need to track this information to reproduce work. In addition, developers need to use many distinct systems to productionize models. To address these problems, many companies are building custom “ML platforms” that automate this lifecycle, but even these platforms are limited to a few supported algorithms and to each company’s internal infrastructure.
In this session, we introduce MLflow, a new open source project from Databricks that aims to design an open ML platform where organizations can use any ML library and development tool of their choice to reliably build and share ML applications. MLflow introduces simple abstractions to package reproducible projects, track results, and encapsulate models that can be used with many existing tools, accelerating the ML lifecycle for organizations of any size.
With a short demo, you see a complete ML model life-cycle example, you will walk away with: MLflow concepts and abstractions for models, experiments, and projects How to get started with MLFlow Using tracking Python APIs during model training Using MLflow UI to visually compare and contrast experimental runs with different tuning parameters and evaluate metrics
Build Large-Scale Data Analytics and AI Pipeline Using RayDPDatabricks
A large-scale end-to-end data analytics and AI pipeline usually involves data processing frameworks such as Apache Spark for massive data preprocessing, and ML/DL frameworks for distributed training on the preprocessed data. A conventional approach is to use two separate clusters and glue multiple jobs. Other solutions include running deep learning frameworks in an Apache Spark cluster, or use workflow orchestrators like Kubeflow to stitch distributed programs. All these options have their own limitations. We introduce Ray as a single substrate for distributed data processing and machine learning. We also introduce RayDP which allows you to start an Apache Spark job on Ray in your python program and utilize Ray’s in-memory object store to efficiently exchange data between Apache Spark and other libraries. We will demonstrate how this makes building an end-to-end data analytics and AI pipeline simpler and more efficient.
MLflow: Infrastructure for a Complete Machine Learning Life CycleDatabricks
ML development brings many new complexities beyond the traditional software development lifecycle. Unlike in traditional software development, ML developers want to try multiple algorithms, tools and parameters to get the best results, and they need to track this information to reproduce work. In addition, developers need to use many distinct systems to productionize models. To address these problems, many companies are building custom “ML platforms” that automate this lifecycle, but even these platforms are limited to a few supported algorithms and to each company’s internal infrastructure.
In this talk, we will present MLflow, a new open source project from Databricks that aims to design an open ML platform where organizations can use any ML library and development tool of their choice to reliably build and share ML applications. MLflow introduces simple abstractions to package reproducible projects, track results, and encapsulate models that can be used with many existing tools, accelerating the ML lifecycle for organizations of any size.
End to end Machine Learning using Kubeflow - Build, Train, Deploy and ManageAnimesh Singh
With the breadth of sheer functionalities which need to be addressed in the Machine Learning world around building, training, serving and managing models, getting it done in a consistent, composable, portable, and scalable manner is hard. The Kubernetes framework is well suited to address these issues, which is why it's a great foundation for deploying ML workloads. Kubeflow is designed to take advantage of these benefits. In this talk, we are going to address how to make it easy for everyone to develop, deploy, and manage portable, scalable ML everywhere and support the full lifecycle Machine Learning using open source technologies like Kubeflow, Tensorflow, PyTorch,Tekton, Knative, Istio and others. We are going to discuss how to enable distributed training of models, model serving, canary rollouts, drift detection, model explainability, metadata management, pipelines and others. Additionally we will discuss Watson productization in progress based on Kubeflow Pipelines and Tekton, and point to Kubeflow Dojo materials and follow-on workshops.
Presentation given at Coolblue B.V. demonstrating Apache Airflow (incubating), what we learned from the underlying design principles and how an implementation of these principles reduce the amount of ETL effort. Why choose Airflow? Because it makes your engineering life easier, more people can contribute to how data flows through the organization, so that you can spend more time applying your brain to more difficult problems like Machine Learning, Deep Learning and higher level analysis.
MLflow: Infrastructure for a Complete Machine Learning Life Cycle with Mani ...Databricks
ML development brings many new complexities beyond the traditional software development lifecycle. Unlike in traditional software development, ML developers want to try multiple algorithms, tools, and parameters to get the best results, and they need to track this information to reproduce work. In addition, developers need to use many distinct systems to productionize models. To address these problems, many companies are building custom “ML platforms” that automate this lifecycle, but even these platforms are limited to a few supported algorithms and to each company’s internal infrastructure. In this session, we introduce MLflow, a new open source project from Databricks that aims to design an open ML platform where organizations can use any ML library and development tool of their choice to reliably build and share ML applications. MLflow introduces simple abstractions to package reproducible projects, track results, and encapsulate models that can be used with many existing tools, accelerating the ML lifecycle for organizations of any size. In this deep-dive session, through a complete ML model life-cycle example, you will walk away with:
MLflow concepts and abstractions for models, experiments, and projects
How to get started with MLFlow
Understand aspects of MLflow APIs
Using tracking APIs during model training
Using MLflow UI to visually compare and contrast experimental runs with different tuning parameters and evaluate metrics
Package, save, and deploy an MLflow model
Serve it using MLflow REST API
What’s next and how to contribute
MLFlow: Platform for Complete Machine Learning Lifecycle Databricks
Description
Data Science and ML development bring many new complexities beyond the traditional software development lifecycle. Unlike in traditional software development, ML developers want to try multiple algorithms, tools, and parameters to get the best results, and they need to track this information to reproduce work.
MLflow addresses some of these challenges during an ML model development cycle.
Abstract
ML development brings many new complexities beyond the traditional software development lifecycle. Unlike in traditional software development, ML developers want to try multiple algorithms, tools, and parameters to get the best results, and they need to track this information to reproduce work. In addition, developers need to use many distinct systems to productionize models. To address these problems, many companies are building custom “ML platforms” that automate this lifecycle, but even these platforms are limited to a few supported algorithms and to each company’s internal infrastructure.
In this session, we introduce MLflow, a new open source project from Databricks that aims to design an open ML platform where organizations can use any ML library and development tool of their choice to reliably build and share ML applications. MLflow introduces simple abstractions to package reproducible projects, track results, and encapsulate models that can be used with many existing tools, accelerating the ML lifecycle for organizations of any size.
With a short demo, you see a complete ML model life-cycle example, you will walk away with: MLflow concepts and abstractions for models, experiments, and projects How to get started with MLFlow Using tracking Python APIs during model training Using MLflow UI to visually compare and contrast experimental runs with different tuning parameters and evaluate metrics
Build Large-Scale Data Analytics and AI Pipeline Using RayDPDatabricks
A large-scale end-to-end data analytics and AI pipeline usually involves data processing frameworks such as Apache Spark for massive data preprocessing, and ML/DL frameworks for distributed training on the preprocessed data. A conventional approach is to use two separate clusters and glue multiple jobs. Other solutions include running deep learning frameworks in an Apache Spark cluster, or use workflow orchestrators like Kubeflow to stitch distributed programs. All these options have their own limitations. We introduce Ray as a single substrate for distributed data processing and machine learning. We also introduce RayDP which allows you to start an Apache Spark job on Ray in your python program and utilize Ray’s in-memory object store to efficiently exchange data between Apache Spark and other libraries. We will demonstrate how this makes building an end-to-end data analytics and AI pipeline simpler and more efficient.
Building a Data Pipeline using Apache Airflow (on AWS / GCP)Yohei Onishi
This is the slide I presented at PyCon SG 2019. I talked about overview of Airflow and how we can use Airflow and the other data engineering services on AWS and GCP to build data pipelines.
Building an ML Platform with Ray and MLflowDatabricks
A successful machine learning platform allows ML practitioners to focus solely on their experiments and models and minimizes the time it takes to develop ML applications and take them to production. However, building an ML Platform is typically not an easy task due to the many different components involved in the process. In this talk, we will show how two open source projects, Ray (https://ray.io/) and MLflow (https://mlflow.org/), work together to make it easy for ML platform developers to add scaling and experiment management to their platform.
We will first provide an overview of Ray and its native libraries: Ray Tune (https://tune.io) for distributed hyperparameter tuning and Ray Serve (https://docs.ray.io/en/master/serve/index.html) for scalable model serving. Then we will showcase how MLflow provides a perfect solution for managing experiments through integrations with Ray for tracking and model deployment. Finally, we will finish with a demo of an ML platform built on Ray, MLflow, and other open source tools.
ML development brings many new complexities beyond the traditional software development lifecycle. Unlike in traditional software development, ML developers want to try multiple algorithms, tools and parameters to get the best results, and they need to track this information to reproduce work. In addition, developers need to use many distinct systems to productionize models. To address these problems, many companies are building custom “ML platforms” that automate this lifecycle, but even these platforms are limited to a few supported algorithms and to each company’s internal infrastructure. In this talk, I present MLflow, a new open source project from Databricks that aims to design an open ML platform where organizations can use any ML library and development tool of their choice to reliably build and share ML applications. MLflow introduces simple abstractions to package reproducible projects, track results, and encapsulate models that can be used with many existing tools, accelerating the ML lifecycle for organizations of any size.
MLOps (a compound of “machine learning” and “operations”) is a practice for collaboration and communication between data scientists and operations professionals to help manage the production machine learning lifecycle. Similar to the DevOps term in the software development world, MLOps looks to increase automation and improve the quality of production ML while also focusing on business and regulatory requirements. MLOps applies to the entire ML lifecycle - from integrating with model generation (software development lifecycle, continuous integration/continuous delivery), orchestration, and deployment, to health, diagnostics, governance, and business metrics.
To watch the full presentation click here: https://info.cnvrg.io/mlopsformachinelearning
In this webinar, we’ll discuss core practices in MLOps that will help data science teams scale to the enterprise level. You’ll learn the primary functions of MLOps, and what tasks are suggested to accelerate your teams machine learning pipeline. Join us in a discussion with cnvrg.io Solutions Architect, Aaron Schneider, and learn how teams use MLOps for more productive machine learning workflows.
- Reduce friction between science and engineering
- Deploy your models to production faster
- Health, diagnostics and governance of ML models
- Kubernetes as a core platform for MLOps
- Support advanced use-cases like continual learning with MLOps
Machine Learning operations brings data science to the world of devops. Data scientists create models on their workstations. MLOps adds automation, validation and monitoring to any environment including machine learning on kubernetes. In this session you hear about latest developments and see it in action.
Apache Iceberg Presentation for the St. Louis Big Data IDEAAdam Doyle
Presentation on Apache Iceberg for the February 2021 St. Louis Big Data IDEA. Apache Iceberg is an alternative database platform that works with Hive and Spark.
Managing the Complete Machine Learning Lifecycle with MLflowDatabricks
ML development brings many new complexities beyond the traditional software development lifecycle. Unlike in traditional software development, ML developers want to try multiple algorithms, tools and parameters to get the best results, and they need to track this information to reproduce work. In addition, developers need to use many distinct systems to productionize models.
To solve for these challenges, Databricks unveiled last year MLflow, an open source project that aims at simplifying the entire ML lifecycle. MLflow introduces simple abstractions to package reproducible projects, track results, and encapsulate models that can be used with many existing tools, accelerating the ML lifecycle for organizations of any size.
In the past year, the MLflow community has grown quickly: over 120 contributors from over 40 companies have contributed code to the project, and over 200 companies are using MLflow.
In this tutorial, we will show you how using MLflow can help you:
Keep track of experiments runs and results across frameworks.
Execute projects remotely on to a Databricks cluster, and quickly reproduce your runs.
Quickly productionize models using Databricks production jobs, Docker containers, Azure ML, or Amazon SageMaker.
We will demo the building blocks of MLflow as well as the most recent additions since the 1.0 release.
What you will learn:
Understand the three main components of open source MLflow (MLflow Tracking, MLflow Projects, MLflow Models) and how each help address challenges of the ML lifecycle.
How to use MLflow Tracking to record and query experiments: code, data, config, and results.
How to use MLflow Projects packaging format to reproduce runs on any platform.
How to use MLflow Models general format to send models to diverse deployment tools.
Prerequisites:
A fully-charged laptop (8-16GB memory) with Chrome or Firefox
Python 3 and pip pre-installed
Pre-Register for a Databricks Standard Trial
Basic knowledge of Python programming language
Basic understanding of Machine Learning Concepts
Data Engineer's Lunch #83: Strategies for Migration to Apache IcebergAnant Corporation
In this talk, Dremio Developer Advocate, Alex Merced, discusses strategies for migrating your existing data over to Apache Iceberg. He'll go over the following:
How to Migrate Hive, Delta Lake, JSON, and CSV sources to Apache Iceberg
Pros and Cons of an In-place or Shadow Migration
Migrating between Apache Iceberg catalogs Hive/Glue -- Arctic/Nessie
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
Pipelines have become ubiquitous, as the need for stringing multiple functions to compose applications has gained adoption and popularity. Common pipeline abstractions such as “fit” and “transform” are even shared across divergent platforms such as Python Scikit-Learn and Apache Spark.
Scaling pipelines at the level of simple functions is desirable for many AI applications, however is not directly supported by Ray’s parallelism primitives. In this talk, Raghu will describe a pipeline abstraction that takes advantage of Ray’s compute model to efficiently scale arbitrarily complex pipeline workflows. He will demonstrate how this abstraction cleanly unifies pipeline workflows across multiple platforms such as Scikit-Learn and Spark, and achieves nearly optimal scale-out parallelism on pipelined computations.
Attendees will learn how pipelined workflows can be mapped to Ray’s compute model and how they can both unify and accelerate their pipelines with Ray.
Parallelizing with Apache Spark in Unexpected WaysDatabricks
"Out of the box, Spark provides rich and extensive APIs for performing in memory, large-scale computation across data. Once a system has been built and tuned with Spark Datasets/Dataframes/RDDs, have you ever been left wondering if you could push the limits of Spark even further? In this session, we will cover some of the tips learned while building retail-scale systems at Target to maximize the parallelization that you can achieve from Spark in ways that may not be obvious from current documentation. Specifically, we will cover multithreading the Spark driver with Scala Futures to enable parallel job submission. We will talk about developing custom partitioners to leverage the ability to apply operations across understood chunks of data and what tradeoffs that entails. We will also dive into strategies for parallelizing scripts with Spark that might have nothing to with Spark to support environments where peers work in multiple languages or perhaps a different language/library is just the best thing to get the job done. Come learn how to squeeze every last drop out of your Spark job with strategies for parallelization that go off the beaten path.
"
MLflow is an MLOps tool that enables data scientist to quickly productionize their Machine Learning projects. To achieve this, MLFlow has four major components which are Tracking, Projects, Models, and Registry. MLflow lets you train, reuse, and deploy models with any library and package them into reproducible steps. MLflow is designed to work with any machine learning library and require minimal changes to integrate into an existing codebase. In this session, we will cover the common pain points of machine learning developers such as tracking experiments, reproducibility, deployment tool and model versioning. Ready to get your hands dirty by doing quick ML project using mlflow and release to production to understand the ML-Ops lifecycle.
DVC - Git-like Data Version Control for Machine Learning projectsFrancesco Casalegno
DVC is an open-source tool for versioning datasets, artifacts, and models in Machine Learning projects.
This extremely powerful tool allows you to leverage an intuitive git-like interface to seamlessly
1. track datasets version updates
2. have reproducible and sharable machine learning pipelines (e.g. model training)
3. compare model performance scores
4. integrate your data and model versioning with git
5. deploy the desired version of your trained models
Microservices DevOps on Google Cloud PlatformSunnyvale
A typical CI/CD development process built on top of Google Cloud Platform to deliver a Java microservice using Helidon.io native-compiled with GraalVM and scheduled on a Google Kubernetes Engine cluster.
Building a Data Pipeline using Apache Airflow (on AWS / GCP)Yohei Onishi
This is the slide I presented at PyCon SG 2019. I talked about overview of Airflow and how we can use Airflow and the other data engineering services on AWS and GCP to build data pipelines.
Building an ML Platform with Ray and MLflowDatabricks
A successful machine learning platform allows ML practitioners to focus solely on their experiments and models and minimizes the time it takes to develop ML applications and take them to production. However, building an ML Platform is typically not an easy task due to the many different components involved in the process. In this talk, we will show how two open source projects, Ray (https://ray.io/) and MLflow (https://mlflow.org/), work together to make it easy for ML platform developers to add scaling and experiment management to their platform.
We will first provide an overview of Ray and its native libraries: Ray Tune (https://tune.io) for distributed hyperparameter tuning and Ray Serve (https://docs.ray.io/en/master/serve/index.html) for scalable model serving. Then we will showcase how MLflow provides a perfect solution for managing experiments through integrations with Ray for tracking and model deployment. Finally, we will finish with a demo of an ML platform built on Ray, MLflow, and other open source tools.
ML development brings many new complexities beyond the traditional software development lifecycle. Unlike in traditional software development, ML developers want to try multiple algorithms, tools and parameters to get the best results, and they need to track this information to reproduce work. In addition, developers need to use many distinct systems to productionize models. To address these problems, many companies are building custom “ML platforms” that automate this lifecycle, but even these platforms are limited to a few supported algorithms and to each company’s internal infrastructure. In this talk, I present MLflow, a new open source project from Databricks that aims to design an open ML platform where organizations can use any ML library and development tool of their choice to reliably build and share ML applications. MLflow introduces simple abstractions to package reproducible projects, track results, and encapsulate models that can be used with many existing tools, accelerating the ML lifecycle for organizations of any size.
MLOps (a compound of “machine learning” and “operations”) is a practice for collaboration and communication between data scientists and operations professionals to help manage the production machine learning lifecycle. Similar to the DevOps term in the software development world, MLOps looks to increase automation and improve the quality of production ML while also focusing on business and regulatory requirements. MLOps applies to the entire ML lifecycle - from integrating with model generation (software development lifecycle, continuous integration/continuous delivery), orchestration, and deployment, to health, diagnostics, governance, and business metrics.
To watch the full presentation click here: https://info.cnvrg.io/mlopsformachinelearning
In this webinar, we’ll discuss core practices in MLOps that will help data science teams scale to the enterprise level. You’ll learn the primary functions of MLOps, and what tasks are suggested to accelerate your teams machine learning pipeline. Join us in a discussion with cnvrg.io Solutions Architect, Aaron Schneider, and learn how teams use MLOps for more productive machine learning workflows.
- Reduce friction between science and engineering
- Deploy your models to production faster
- Health, diagnostics and governance of ML models
- Kubernetes as a core platform for MLOps
- Support advanced use-cases like continual learning with MLOps
Machine Learning operations brings data science to the world of devops. Data scientists create models on their workstations. MLOps adds automation, validation and monitoring to any environment including machine learning on kubernetes. In this session you hear about latest developments and see it in action.
Apache Iceberg Presentation for the St. Louis Big Data IDEAAdam Doyle
Presentation on Apache Iceberg for the February 2021 St. Louis Big Data IDEA. Apache Iceberg is an alternative database platform that works with Hive and Spark.
Managing the Complete Machine Learning Lifecycle with MLflowDatabricks
ML development brings many new complexities beyond the traditional software development lifecycle. Unlike in traditional software development, ML developers want to try multiple algorithms, tools and parameters to get the best results, and they need to track this information to reproduce work. In addition, developers need to use many distinct systems to productionize models.
To solve for these challenges, Databricks unveiled last year MLflow, an open source project that aims at simplifying the entire ML lifecycle. MLflow introduces simple abstractions to package reproducible projects, track results, and encapsulate models that can be used with many existing tools, accelerating the ML lifecycle for organizations of any size.
In the past year, the MLflow community has grown quickly: over 120 contributors from over 40 companies have contributed code to the project, and over 200 companies are using MLflow.
In this tutorial, we will show you how using MLflow can help you:
Keep track of experiments runs and results across frameworks.
Execute projects remotely on to a Databricks cluster, and quickly reproduce your runs.
Quickly productionize models using Databricks production jobs, Docker containers, Azure ML, or Amazon SageMaker.
We will demo the building blocks of MLflow as well as the most recent additions since the 1.0 release.
What you will learn:
Understand the three main components of open source MLflow (MLflow Tracking, MLflow Projects, MLflow Models) and how each help address challenges of the ML lifecycle.
How to use MLflow Tracking to record and query experiments: code, data, config, and results.
How to use MLflow Projects packaging format to reproduce runs on any platform.
How to use MLflow Models general format to send models to diverse deployment tools.
Prerequisites:
A fully-charged laptop (8-16GB memory) with Chrome or Firefox
Python 3 and pip pre-installed
Pre-Register for a Databricks Standard Trial
Basic knowledge of Python programming language
Basic understanding of Machine Learning Concepts
Data Engineer's Lunch #83: Strategies for Migration to Apache IcebergAnant Corporation
In this talk, Dremio Developer Advocate, Alex Merced, discusses strategies for migrating your existing data over to Apache Iceberg. He'll go over the following:
How to Migrate Hive, Delta Lake, JSON, and CSV sources to Apache Iceberg
Pros and Cons of an In-place or Shadow Migration
Migrating between Apache Iceberg catalogs Hive/Glue -- Arctic/Nessie
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
Pipelines have become ubiquitous, as the need for stringing multiple functions to compose applications has gained adoption and popularity. Common pipeline abstractions such as “fit” and “transform” are even shared across divergent platforms such as Python Scikit-Learn and Apache Spark.
Scaling pipelines at the level of simple functions is desirable for many AI applications, however is not directly supported by Ray’s parallelism primitives. In this talk, Raghu will describe a pipeline abstraction that takes advantage of Ray’s compute model to efficiently scale arbitrarily complex pipeline workflows. He will demonstrate how this abstraction cleanly unifies pipeline workflows across multiple platforms such as Scikit-Learn and Spark, and achieves nearly optimal scale-out parallelism on pipelined computations.
Attendees will learn how pipelined workflows can be mapped to Ray’s compute model and how they can both unify and accelerate their pipelines with Ray.
Parallelizing with Apache Spark in Unexpected WaysDatabricks
"Out of the box, Spark provides rich and extensive APIs for performing in memory, large-scale computation across data. Once a system has been built and tuned with Spark Datasets/Dataframes/RDDs, have you ever been left wondering if you could push the limits of Spark even further? In this session, we will cover some of the tips learned while building retail-scale systems at Target to maximize the parallelization that you can achieve from Spark in ways that may not be obvious from current documentation. Specifically, we will cover multithreading the Spark driver with Scala Futures to enable parallel job submission. We will talk about developing custom partitioners to leverage the ability to apply operations across understood chunks of data and what tradeoffs that entails. We will also dive into strategies for parallelizing scripts with Spark that might have nothing to with Spark to support environments where peers work in multiple languages or perhaps a different language/library is just the best thing to get the job done. Come learn how to squeeze every last drop out of your Spark job with strategies for parallelization that go off the beaten path.
"
MLflow is an MLOps tool that enables data scientist to quickly productionize their Machine Learning projects. To achieve this, MLFlow has four major components which are Tracking, Projects, Models, and Registry. MLflow lets you train, reuse, and deploy models with any library and package them into reproducible steps. MLflow is designed to work with any machine learning library and require minimal changes to integrate into an existing codebase. In this session, we will cover the common pain points of machine learning developers such as tracking experiments, reproducibility, deployment tool and model versioning. Ready to get your hands dirty by doing quick ML project using mlflow and release to production to understand the ML-Ops lifecycle.
DVC - Git-like Data Version Control for Machine Learning projectsFrancesco Casalegno
DVC is an open-source tool for versioning datasets, artifacts, and models in Machine Learning projects.
This extremely powerful tool allows you to leverage an intuitive git-like interface to seamlessly
1. track datasets version updates
2. have reproducible and sharable machine learning pipelines (e.g. model training)
3. compare model performance scores
4. integrate your data and model versioning with git
5. deploy the desired version of your trained models
Microservices DevOps on Google Cloud PlatformSunnyvale
A typical CI/CD development process built on top of Google Cloud Platform to deliver a Java microservice using Helidon.io native-compiled with GraalVM and scheduled on a Google Kubernetes Engine cluster.
The overall evolution towards microservices has caused a lot of IT leaders to radically rethink architectures and platforms. One can hardly keep up with the rapid onslaught on new distributed technologies. The same people who just asked yesterday "how can we deploy Docker containers?", are now asking "how can we operate Kubernetes-as-a-Service on-premise?", and are about to start asking "how can we operate the open source frameworks of our choice, such as Spark, TensorFlow, HDFS, and more, as a service across hybrid clouds?”. This session will discuss: Challenges of orchestrating and operating.
The overall evolution towards microservices has caused a lot of IT leaders to radically rethink architectures and platforms. One can hardly keep up with the rapid onslaught on new distributed technologies. The same people who just asked yesterday "how can we deploy Docker containers?", are now asking "how can we operate Kubernetes-as-a-Service on-premise?", and are about to start asking "how can we operate the open source frameworks of our choice, such as Spark, TensorFlow, HDFS, and more, as a service across hybrid clouds?”. This session will discuss: Challenges of orchestrating and operating
Leveraging Docker for Hadoop build automation and Big Data stack provisioningDataWorks Summit
Apache Bigtop as an open source Hadoop distribution, focuses on developing packaging, testing and deployment solutions that help infrastructure engineers to build up their own customized big data platform as easy as possible. However, packages deployed in production require a solid CI testing framework to ensure its quality. Numbers of Hadoop component must be ensured to work perfectly together as well. In this presentation, we'll talk about how Bigtop deliver its containerized CI framework which can be directly replicated by Bigtop users. The core revolution here are the newly developed Docker Provisioner that leveraged Docker for Hadoop deployment and Docker Sandbox for developer to quickly start a big data stack. The content of this talk includes the containerized CI framework, technical detail of Docker Provisioner and Docker Sandbox, a hierarchy of docker images we designed, and several components we developed such as Bigtop Toolchain to achieve build automation.
Cloud Driven Development: a better workflow, less worries, and more powerMarzee Labs
Platform-as-a-service (PaaS) solutions have recently sprung up for Drupal, with Pantheon and Acquia Dev Cloud leading the race. The advantages are plentiful: zero set-up costs, instant upscaling, the use of powerful services such as Apache Solr, Varnish, Redis/Memcached, automated Drupal core updates, site profiling tools, etc.
In this session, I’ll make Drupal developers familiar with PaaS, and show the concepts of “Cloud-driven development” to speed up development and deployment processes. I will show how to use your local, development, test and production environments to organize your Drupal development, and push changes back and forth using Git, Features and Drush, eliminating the need to share the database and pushing changes exclusively via code. Finally, Drush will make your deployment a breeze.
With the free developer subscription of Pantheon and a series of Drush commands and scripts, you will be able to start developing and deploying your own Drupal projects in the cloud, and never again worry about your server. After all, you are a Drupal Developer, not a System Administrator!
Bonnier News is the largest news organisation in Sweden, publishing Dagens Nyheter and Expressen, two of the country’s largest newspapers. When we needed to build a new data processing platform that could accommodate the needs of many different, competing brands, we turned to Openshift and Kubernetes. In this presentation, we will describe the architectural tradeoffs and choices we made, and how we have been able to deploy data flows at a high rate by focusing on technical simplicity.
DevOpsDaysRiga 2018: Eric Skoglund, Lars Albertsson - Kubernetes as data plat...DevOpsDays Riga
"Bonnier News is the largest news organisation in Sweden, publishing Dagens Nyheter and Expressen, two of the country’s largest newspapers. When we needed to build a new data processing platform that could accommodate the needs of many different, competing brands, we turned to Openshift and Kubernetes. In this presentation, we will describe the architectural tradeoffs and choices we made, and how we have been able to deploy data flows at a high rate by focusing on technical simplicity." Lars Albertsson is an independent consultant for Bonnier News as well as Spotify, several startups and banks. He will talk together with Eric Skoglund, the product owner of Bonnier News’s data platform project.
Flink Forward SF 2017: James Malone - Make The Cloud Work For YouFlink Forward
You should spend your time using the powerful Apache Flink ecosystem to get value from your data, not on your data processing infrastructure. Cloud environments can help you with this problem by providing managed services and infrastructure. Since Google Cloud Dataproc, Google's managed service to power the Apache big data ecosystem, runs Flink, you can easily combine the benefits of cloud with your Flink data pipelines. With new support for Flink and long-running streaming jobs, we will show you how you can set up a cluster and a streaming job in less than three minutes.
Lean Drupal Repositories with Composer and DrushPantheon
Composer is the industry-standard PHP dependency manager that is now in use in Drupal 8 core. This session will show the current best practices for using Composer, drupal-composer, drupal-scaffold, Drush, Drupal Console and Drush site-local aliases to streamline your Drupal 7 and Drupal 8 site repositories for optimal use on teams.
[20200720]cloud native develoment - Nelson LinHanLing Shen
There is no shortage now of development and CI/CD tools for cloud-native application development. But how do we put the cloud-native concept and think as the cloud-native way on the leftmost side of CI/CD pipeline.
During developing phrase, the tools provided with cloud code can help you expedite iteration of source codes, run and debug cloud native applications in an easy and fast way, making cloud-native development turn into real-time process, reduce the gap between deployment and development.
現在不乏用於雲原生應用程序開發的開發和 CI/CD工具。 但是,我們如何將雲原生概念放在的 CI/CD 流水線的最左側呢?
在開發階段,如何用 Cloud code 協助您加快原始碼的迭代速度,以簡便快捷的方式運行和調用雲原生應用程序,使雲原生開發變為即使過程,縮小開發與部署之間的差
My presentation from Drupal Camp Lutsk 2017 where I was describing Migration process at Drupal 8.
Goal of this presentation is to understand what to do if you get Migration task and make it simple and stable for developer.
Similar to 22nd Athens Big Data Meetup - 1st Talk - MLOps Workshop: The Full ML Lifecycle - How to Use ML in Production (20)
21st Athens Big Data Meetup - 2nd Talk - Dive into ClickHouse storage systemAthens Big Data
Title: Dive into ClickHouse storage system
Speaker: Alexander Sapin (https://github.com/alesapin/)
Date: Thursday, March 5, 2020
Event: https://meetup.com/Athens-Big-Data/events/268379195/
19th Athens Big Data Meetup - 2nd Talk - NLP: From news recommendation to wor...Athens Big Data
Title: NLP: From news recommendation to word prediction
Speaker: Mihalis Papakonstantinou (https://linkedin.com/in/mihalispapak/)
Date: Thursday, December 12, 2019
Event: https://meetup.com/Athens-Big-Data/events/266762191/
21st Athens Big Data Meetup - 1st Talk - Fast and simple data exploration wit...Athens Big Data
Title: Fast and simple data exploration with ClickHouse
Speaker: Alexander Kuzmenkov (https://github.com/akuzm/)
Date: Thursday, March 5, 2020
Event: https://meetup.com/Athens-Big-Data/events/268379195/
20th Athens Big Data Meetup - 2nd Talk - Druid: under the coversAthens Big Data
Title: Druid: under the covers
Speaker: Peter Marshall (https://linkedin.com/in/amillionbytes/)
Date: Tuesday, January 28, 2020
Event: https://meetup.com/Athens-Big-Data/events/266900242/
20th Athens Big Data Meetup - 1st Talk - Druid: the open source, performant, ...Athens Big Data
Title: Druid: the open source, performant, real-time, analytical datastore
Speaker: Peter Marshall (https://linkedin.com/in/amillionbytes/)
Date: Tuesday, January 28, 2020
Event: https://meetup.com/Athens-Big-Data/events/266900242/
18th Athens Big Data Meetup - 2nd Talk - Run Spark and Flink Jobs on KubernetesAthens Big Data
Title: Run Spark and Flink Jobs on Kubernetes
Speaker: Chaoran Yu (https://linkedin.com/in/chaoran-yu-97b1144a/)
Date: Thursday, November 14, 2019
Event: https://meetup.com/Athens-Big-Data/events/265957761/
18th Athens Big Data Meetup - 1st Talk - Timeseries Forecasting as a ServiceAthens Big Data
Title: Timeseries Forecasting as a Service
Speaker: Thanassis Spyrou (https://linkedin.com/in/thanassis-spyrou-92911959/)
Date: Thursday, November 14, 2019
Event: https://meetup.com/Athens-Big-Data/events/265957761/
17th Athens Big Data Meetup - 2nd Talk - Data Flow Building and Calculation P...Athens Big Data
Title: Data Flow Building and Calculation Pipelines via PySpark and ML Modeling via Python
Speaker: Theodoros Michalareas (https://linkedin.com/in/theodorosmichalareas/)
Date: Tuesday, September 24, 2019
Event: https://meetup.com/Athens-Big-Data/events/264702584/
16th Athens Big Data Meetup - 2nd Talk - A Focus on Building and Optimizing M...Athens Big Data
Title: A Focus on Building and Optimizing ML Models with Amazon SageMaker
Speaker: Pavlos Mitsoulis-Ntompos (https://linkedin.com/in/pavlosmitsoulis/)
Date: Thursday, March 14, 2019
Event: https://www.meetup.com/Athens-Big-Data/events/259091496/
16th Athens Big Data Meetup - 1st Talk - An Introduction to Machine Learning ...Athens Big Data
Title: An Introduction to Machine Learning with Python and Scikit-Learn
Speaker: Julien Simon (https://linkedin.com/in/juliensimon/)
Date: Thursday, March 14, 2019
Event: https://www.meetup.com/Athens-Big-Data/events/259091496/
15th Athens Big Data Meetup - 1st Talk - Running Spark On MesosAthens Big Data
Title: Running Spark On Mesos
Speaker: Mr. Chris Sidiropoulos (https://linkedin.com/in/chris-sidiropoulos-2a6b156a//)
Date: Tuesday, November 27, 2018
Event: https://meetup.com/Athens-Big-Data/events/256098657/
5th Athens Big Data Meetup - PipelineIO Workshop - Real-Time Training and Dep...Athens Big Data
Title: Real-Time Training and Deploying Spark ML Recommendations With Kafka and NetflixOSS
Speaker: Chris Fregly (https://linkedin.com/in/cfregly/)
Date: Monday, October 17, 2016
Event: https://meetup.com/Athens-Big-Data/events/234546355/
13th Athens Big Data Meetup - 2nd Talk - Training Neural Networks With Enterp...Athens Big Data
Title: Training Neural Networks With Enterprise Relational Data
Speaker: Dr. Christos Malliopoulos (https://linkedin.com/in/cmalliopoulos//)
Date: Thursday, June 21, 2018
Event: https://meetup.com/Athens-Big-Data/events/251411957/
11th Athens Big Data Meetup - 2nd Talk - Beyond Bitcoin; Blockchain Technolog...Athens Big Data
Title: Beyond Bitcoin; Blockchain Technologies And How They Disrupt Every Industry
Speaker: Mr. Dimitrios Kouzis-Loukas (https://linkedin.com/in/lookfwd//)
Date: Wednesday, December 27, 2017
Event: https://meetup.com/Athens-Big-Data/events/246033156/
9th Athens Big Data Meetup - 2nd Talk - Lead Scoring And GradingAthens Big Data
Title: Lead Scoring And Lead Grading
Speaker: Dr. Lefteris Mantelas (https://linkedin.com/in/lefterismantelas//)
Date: Tuesday, October 24, 2017
Event: https://meetup.com/Athens-Big-Data/events/243829781/
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
Elevating Tactical DDD Patterns Through Object CalisthenicsDorra BARTAGUIZ
After immersing yourself in the blue book and its red counterpart, attending DDD-focused conferences, and applying tactical patterns, you're left with a crucial question: How do I ensure my design is effective? Tactical patterns within Domain-Driven Design (DDD) serve as guiding principles for creating clear and manageable domain models. However, achieving success with these patterns requires additional guidance. Interestingly, we've observed that a set of constraints initially designed for training purposes remarkably aligns with effective pattern implementation, offering a more ‘mechanical’ approach. Let's explore together how Object Calisthenics can elevate the design of your tactical DDD patterns, offering concrete help for those venturing into DDD for the first time!
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Ramesh Iyer
In today's fast-changing business world, Companies that adapt and embrace new ideas often need help to keep up with the competition. However, fostering a culture of innovation takes much work. It takes vision, leadership and willingness to take risks in the right proportion. Sachin Dev Duggal, co-founder of Builder.ai, has perfected the art of this balance, creating a company culture where creativity and growth are nurtured at each stage.
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
UiPath Test Automation using UiPath Test Suite series, part 3DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 3. In this session, we will cover desktop automation along with UI automation.
Topics covered:
UI automation Introduction,
UI automation Sample
Desktop automation flow
Pradeep Chinnala, Senior Consultant Automation Developer @WonderBotz and UiPath MVP
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing – with little to no training required
Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
👨🏫 Andras Palfi, Senior Product Manager, UiPath
👩🏫 Lenka Dulovicova, Product Program Manager, UiPath
Neuro-symbolic is not enough, we need neuro-*semantic*Frank van Harmelen
Neuro-symbolic (NeSy) AI is on the rise. However, simply machine learning on just any symbolic structure is not sufficient to really harvest the gains of NeSy. These will only be gained when the symbolic structures have an actual semantics. I give an operational definition of semantics as “predictable inference”.
All of this illustrated with link prediction over knowledge graphs, but the argument is general.
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Tobias Schneck
As AI technology is pushing into IT I was wondering myself, as an “infrastructure container kubernetes guy”, how get this fancy AI technology get managed from an infrastructure operational view? Is it possible to apply our lovely cloud native principals as well? What benefit’s both technologies could bring to each other?
Let me take this questions and provide you a short journey through existing deployment models and use cases for AI software. On practical examples, we discuss what cloud/on-premise strategy we may need for applying it to our own infrastructure to get it to work from an enterprise perspective. I want to give an overview about infrastructure requirements and technologies, what could be beneficial or limiting your AI use cases in an enterprise environment. An interactive Demo will give you some insides, what approaches I got already working for real.
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
22nd Athens Big Data Meetup - 1st Talk - MLOps Workshop: The Full ML Lifecycle - How to Use ML in Production
1. Confidential
DATA DRIVEN VALUE CREATION
DATA SCIENCE & ANALYTICS | DATA MANAGEMENT | VISUALIZATION & DATA EXPERIENCE
D ONE Solutions AG, Sihlfeldstrasse 58, 8003 Zürich, d-one.ai
2. How to Use Machine
Learning in Production
(MLOps)
Steffen Terhaar, Tim Rohner, Bernhard Venneman,
Spyros Cavadias & Roman Moser
CD4ML Working Group @ D ONE
3. Workshop Case Study
A Zurich-based, data-driven company that
provides an analytics platform for wind turbine
parks
Goal
Build and productionalize a
model to predict turbine
malfunctions
6. ML model development vs MLOps
input
“cat”
“dog”
Developing a machine learning
model is challenging and fun …
… however, it is only one
(small) piece of the puzzle
7. Confidential
Machine Learning Pipeline
Scoping
Define project
improve feature
engineering
Concept drift
(e.g., new/different
wind turbines)
Data drift
(e.g., seasonality)
Going to production means your workflow:
■ Needs to Scale
■ Needs Governance
Data
Define data and
establish
baseline
Label and
organize data
Modelling
Select and train
model
Perform error
analysis
Deployment
Deploy in
production
Monitor &
maintain system
8. Confidential
For this case study we use tools from the open source stack
■ We want to provide a solution that:
■ Works on-premise
■ Works in the cloud (independent of the cloud provider)
■ Gives you complete control
10. Confidential
A Selection of an Open Source Stack
Data Modelling Tracking Orchestration
DVC
Deployment
MLflow
Exploration
Jupyter
Python
scikit-learn
Docker
Airflow
Can Do Should Do
13. Confidential
What DVC offers
13
Git-compatible
Storage agnostic
Reproducible
Low friction branching
Metric tracking
ML pipeline framework
Language- & framework-agnostic
HDFS, Hive & Apache Spark
Track failures
DVC features
■ Data and model versioning
■ Remote data storage
■ Data pipelines
■ Experiments and metrics tracking
14. Confidential
How DVC data versioning works
Basic idea
Keep versions not only of your
code, but also of your data,
pipeline and models.
image source: dvc.org 14
15. Confidential
How DVC data versioning works
image source: dvc.org 15
Problem with Git and data
Git is not suitable for tracking
large files. Thus, tracking entire
datasets and large models is not
feasible with Git.
Solution
DVC!
Cloud
Storage
Git
Server
16. Confidential
Advantages over Git LFS
image source: dvc.org 16
DVC was developed with Data Science in mind
■ No special server infrastructure needed
■ Avoids GitHub repository/file size limit of 2-5GB
■ Works with any cloud storage (also on-prem)
■ Amazon S3
■ Microsoft Azure Blob
■ Google Drive
■ Google Cloud Storage
■ Aliyun OSS
■ SSH
■ HDFS
■ Web HDFS
■ HTTP
■ WebDAV
■ Local remote
17. Confidential
Installing DVC
17
A Python package
■ Basic:
$ pip install dvc
or
$ conda install -c conda-forge dvc
■ Additional backends:
■ dvc[all]: all available backends
■ [s3], [azure], [gdrive], [gs], [oss], [ssh]: Individual backends
18. Confidential
Setting up repo for data versioning
18
1. Initialize git repository
$ git init
2. Initialize DVC
$ dvc init
3. Commit the created files to the git repo
$ git commit -m “Initialize DVC”
4. Add the data file for DVC tracking
$ dvc add [filename]
5. Commit the newly created files to the git repo
$ git add .gitignore [filename].dvc
$ git commit -m “Initial dataset commit”
6. Optional: Set up remote storage and push data to it
$ dvc remote add [name] [url], $ dvc push
19. Confidential
Setting up repo for data versioning
19
1. Initialize git repository
$ git init
2. Initialize DVC
$ dvc init
3. Commit the created files to the git repo
$ git commit -m “Initialize DVC”
4. Add the data file for DVC tracking
$ dvc add [filename]
5. Commit the newly created files to the git repo
$ git add .gitignore [filename].dvc
$ git commit -m “Initial dataset commit”
6. Optional: Set up remote storage and push data to it
$ dvc remote add [name] [url]
Initializes a git repository and creates
● .git/
Directory used for git tracking
20. Confidential
Setting up repo for data versioning
20
1. Initialize git repository
$ git init
2. Initialize DVC
$ dvc init
3. Commit the created files to the git repo
$ git commit -m “Initialize DVC”
4. Add the data file for DVC tracking
$ dvc add [filename]
5. Commit the newly created files to the git repo
$ git add .gitignore [filename].dvc
$ git commit -m “Initial dataset commit”
6. Optional: Set up remote storage and push data to it
$ dvc remote add [name] [url]
Initializes a DVC project with the following files
● .dvcignore
Used to exclude files from DVC tracking
(similar to .gitignore)
● .dvc/.gitignore
Used to exclude DVC-internal files from git
tracking
● .dvc/config
Configuration file which may be edited
manually or with $ dvc config
● .dvc/plots/
Directory for the plotting functionality of DVC
21. Confidential
Setting up repo for data versioning
21
1. Initialize git repository
$ git init
2. Initialize DVC
$ dvc init
3. Commit the created files to the git repo
$ git commit -m “Initialize DVC”
4. Add the data file for DVC tracking
$ dvc add [filename]
5. Commit the newly created files to the git repo
$ git add .gitignore [filename].dvc
$ git commit -m “Initial dataset commit”
6. Optional: Set up remote storage and push data to it
$ dvc remote add [name] [url]
Commits the (automatically staged) DVC files to the
git repository
22. Confidential
Setting up repo for data versioning
22
1. Initialize git repository
$ git init
2. Initialize DVC
$ dvc init
3. Commit the created files to the git repo
$ git commit -m “Initialize DVC”
4. Add the data file for DVC tracking
$ dvc add [filename]
5. Commit the newly created files to the git repo
$ git add .gitignore [filename].dvc
$ git commit -m “Initial dataset commit”
6. Optional: Set up remote storage and push data to it
$ dvc remote add [name] [url]
Add data files or directories for DVC tracking
(analogous to the git add command).
Command creates a new file
● [filename].dvc
Functions as placeholder for the data which
is versioned with git
23. Confidential
Setting up repo for data versioning
23
1. Initialize git repository
$ git init
2. Initialize DVC
$ dvc init
3. Commit the created files to the git repo
$ git commit -m “Initialize DVC”
4. Add the data file for DVC tracking
$ dvc add [filename]
5. Commit the newly created files to the git repo
$ git add .gitignore [filename].dvc
$ git commit -m “Initial dataset commit”
6. Optional: Set up remote storage and push data to it
$ dvc remote add [name] [url]
Stages the newly created files and commits them
to the git repository.
The .gitignore was updated when running
dvc add to exclude that data file from git
versioning (because we are versioning the data.dvc
file, not the data file itself).
24. Confidential
Setting up repo for data versioning
24
1. Initialize git repository
$ git init
2. Initialize DVC
$ dvc init
3. Commit the created files to the git repo
$ git commit -m “Initialize DVC”
4. Add the data file for DVC tracking
$ dvc add [filename]
5. Commit the newly created files to the git repo
$ git add .gitignore [filename].dvc
$ git commit -m “Initial dataset commit”
6. Optional: Set up remote storage and push data to it
$ dvc remote add [name] [url], $ dvc push
Adding remote storage backend.
Backend-specific instructions can be found here:
dvc.org/doc/command-reference/remote/add
25. 25
Hands-on Exercise: DVC
Go to http://localhost:8888/ to access the remote instance of Jupyter Lab, and then navigate to
/notebooks/ and open dvc-exercise.ipynb notebook.
27. ● Experiment tracking is the process of saving all experiment related information that you care
about for every experiment you run.
● Experiment tracking is different from ML model management, as it focuses on the iterative
model development phase when you try many things to get your model performance to the
level you need.
MLOPS
Why tracking?
Data Ingestion
Data tracking
Data
transformation
Model training
Model versioning
Model
deployment
Model evaluation
Model
Architecture
Prod Prediction
EXPERIMENT TRACKING
MODEL MANAGEMENT
27
28. Machine Learning Platforms
■ Big Data Companies:
■ Deal with this with proprietary solutions that standardize data prep/training/deploy loop, so as
long as you are in the ecosystem all is good!
■ Tied to the company’s infrastructure
■ Out of luck if you want to migrate somewhere else
■ May be limited to certain algorithms or frameworks
■ An Open Source Solution is needed
28
29. Confidential
29
Tool Description creator language Github ⭐
MLflow focus on entire ML lifecycle, works with
any ML library
Databricks Python 11.5k
DVC focus on Data Versioning Iterative Python 9.5k
Kubeflow Use with Kubernetes but define tasks in
Python, difficult set-up (preferably with GCP)
Google Python 11.3k
ClearML complicated installation, unstable Allegro.AI Python 3.1k
Guild AI easy to start with, immature, lightweight,
basic interface
Open-source
community
Python 678
Tracking tools (open source)
30. Confidential
Machine Learning Platforms
■ Big Data Companies:
■ Deal with this with proprietary solutions that standardize data prep/training/deploy loop, so as
long as you are in the ecosystem all is good!
■ Tied to the company’s infrastructure
■ Out of luck if you want to migrate somewhere else
■ May be limited to certain algorithms or frameworks
■ An Open Source Solution is needed
30
31. Confidential
Why MLflow
■ Open source, not tied to a particular platform/company
■ Runs the same way everywhere (locally or in the cloud)
■ Useful from 1 developer to 100+ developers
■ Design philosophy:
1. API-First
2. Integration with popular libraries
3. high level “model” function that can be deployed everywhere
4. Open interface that enables contributions from the community
5. Modular design (can use DISTINCT components separately)
31
35. Confidential
MLflow Tracking
What do we track?
■ Parameters : inputs to our code mlflow.log_param()..
■ Metrics : numeric values to access our models mlflow.log_metric()..
■ Tags/Notes: info about the run mlflow.set_tag()..
■ Artifacts: files,data and models produced mlflow.log_artifact(), mlflow.log_artifacts()..
■ Source: what code run
■ Version: what version of the code run (github)
■ Run: the particular code instance (id) captured by MLflow mlflow.start_run()..
■ Experiment: the set of runs mlflow.create_experiment(), mlflow.set_experiment()..
More on : https://www.mlflow.org/docs/latest/tracking.html
35
36. Confidential
MLflow UI is used to compare the models that you have produced
■ connect to the MLflow tracking server where you set mlflow.set_tracking_uri()
■ or connect to local MLflow tracking server instance by running the CLI command mlflow ui
in the same working directory as the one that contains the mlruns folder
MLflow UI
URI: Uniform Resource Identifier
UI: User Interface
36
37. Confidential
MLflow Model
■ Standard format for packaging machine learning models in MLflow
■ Defines a convention that lets you save a model in different “flavors” that can be
understood by different downstream tools
# Directory written by
mlflow.sklearn.save_model(model,
"my_model")
my_model/
├── MLmodel
├── model.pkl
├── conda.yaml
└── requirements.txt
# in MLmodel file
time_created:
2021-10-25T17:28:53.35
flavors:
sklearn:
sklearn_version: 0.24.1
pickled_model: model.pkl
python_function:
loader_module: mlflow.sklearn
37
38. Confidential
python_function model flavor
■ python_function is the default model interface for MLflow Python
■ It:
1. allows any Python model to be productionized in a variety of environments
2. has generic filesystem model format
3. is self-contained (includes all the information necessary to load and use a model)
# LOG
from mlflow import pyfunc
# Log model as artifact
mlflow.pyfunc.log_model(param)
# or Save model in storage
mlflow.pyfunc.save_model((param)
# LOAD
from mlflow import pyfunc
# load model
pyfunc_model = pyfunc.load_model(param)
# predict on model
y_pred = pyfunc_model.predict(y_test)
38
39. Confidential
MLflow Model Registry
■ It is a centralized model store, set of APIs, and UI, to collaboratively manage the full
lifecycle of an MLflow Model.
■ Provides model lineage (which MLflow experiment and run produced the model),
model versioning, stage transitions (for example from staging to production), and
annotations.
■ You register a model through:
1. API Workflow
2. UI Workflow # register model
res = mlflow.register_model(my_model_uri, "my_model")
39
41. 41
Hands-on Exercise: Tracking
Go to http://localhost:8888/ to access the remote instance of Jupyter Lab, and then navigate to
/cd4ml-workshop/notebooks/ and open mlflow-exerise.ipynb notebook.
43. How do we bring everything together?
43
■ Up to this point we have built the core components of our ML training pipeline
■ Now we want to combine these components to run the pipeline end-to-end
■ This could be done in a simple bash/Python script that runs all the components
sequentially..
■ However, we would like to:
■ Have a scheduler that is aware how the components interact and when to run each
component as the pipeline grows in complexity
■ Be able to re-start the pipeline from a certain component after a crash
■ Visually inspect and monitor our pipeline
Data Ingestion
Data tracking
Data validation &
transformation
Model training
Model validation
Model registry
44. Meet pipeline orchestration
44
■ To combine the components and add the desired functionalities, we need a pipeline orchestrator,
which:
■ Helps to automate and execute workflows
■ Schedules different components and coordinates dependencies among them
■ Abstracts away details of computing environment and provides a UI to monitor and interact
with pipeline runs
Data Ingestion
Data tracking Model training
Model validation
Model registry
Data validation &
transformation
45. Orchestration tools
45
Tool Description first
release
creator language Github ⭐
Luigi Easy to get started, low complexity 2015 Spotify Python 15.5k
Apache Airflow Includes UI, feature-rich, mature, complex 2016 Airbnb Python 25.1k
Kubeflow Use with Kubernetes but define tasks in
Python, difficult set-up (preferably with GCP)
2018 Google Python 11.3k
Argo Workflows Use with Kubernetes (Alternative to
Kubeflow), define tasks with YAML
2017 Open-source
community
YAML 10.6k
Prefect Easy to get started for Python programmers,
immature, open-core
2019 Prefect Python 8.5k
→ We chose Airflow due to its maturity, popularity (support of open source community), flexibility, and easy to use UI
46. Airflow concepts and components
46
Airflow Webserver
- Hosts Airflow UI (Flask App)
- Monitor runs, investigate logs, etc.
DAG (ML Pipeline)
- Built with Operators
- Triggered:
- automatically by Airflow Scheduler, or
- manually from Airflow UI
Operators (Tasks)
- Defined in Python, Bash, SQL, etc.
- Run on Executor
Metadata database (SQLAlchemy)
- Tracks how components interact
- All process read from / write to it
Airflow Scheduler
- Multithreaded Python process
- Schedules workflows and tasks
Executor (specified in airflow.cfg)
- local executors: run tasks inside scheduler
- remote executors: run tasks remotely
(e.g. Kubernetes)
47. How to set up Airflow
■ With Docker - convenient to manage dependencies & isolate Airflow components (documentation)
■ With Docker-Compose:
○ Easy to get started (documentation), but requires Docker Compose expertise for customized &
production ready deployment
■ With Helm Chart package manager (documentation)
○ Docker-based deployment using Kubernetes, supported by Airflow community
■ Locally in a virtual environment (documentation)
○ Set AIRFLOW_HOME environment variable
○ Install Airflow using PyPI
○ Initialize the metadata database airflow initdb
○ Start the Airflow scheduler
○ Start the Airflow webserver
47
48. How to create an Airflow DAG
48
■ Create a Python file for the Airflow DAG, including:
1. Define an Airflow DAG object with configuration settings
2. Define Airflow operators for the individual tasks. Airflow
offers pre-defined operators such as:
- Python operators: calling a Python function
- Bash operators: executing a bash command
- SQL operators, MySQL operators, etc.
3. Set instructions on how to connect the various Airflow
operators
■ Save the DAG python file in ~/AIRFLOW_HOME/dags
49. How to create an Airflow DAG
49
■ Create a Python file for the Airflow DAG, including:
1. Define an Airflow DAG object with configuration settings
2. Define Airflow operators for the individual tasks. Airflow
offers pre-defined operators such as:
- Python operators: calling a Python function
- Bash operators: executing a bash command
- SQL operators, MySQL operators, etc.
3. Set instructions on how to connect the various Airflow
operators
■ Save the DAG python file in ~/AIRFLOW_HOME/dags
50. How to create an Airflow DAG
■ Create a Python file for the Airflow DAG, including:
1. Define an Airflow DAG object with configuration settings
2. Define Airflow operators for the individual tasks. Airflow
offers pre-defined operators such as:
- Python operators: calling a Python function
- Bash operators: executing a bash command
- SQL operators, MySQL operators, etc.
3. Set instructions on how to connect the various Airflow
operators
■ Save the DAG python file in ~/AIRFLOW_HOME/dags
50
51. How to create an Airflow DAG
51
■ Create a Python file for the Airflow DAG, including:
1. Define an Airflow DAG object with configuration settings
2. Define Airflow operators for the individual tasks. Airflow
offers pre-defined operators such as:
- Python operators: calling a Python function
- Bash operators: executing a bash command
- SQL operators, MySQL operators, etc.
3. Set instructions on how to connect the various Airflow
operators
■ Save the DAG python file in ~/AIRFLOW_HOME/dags
52. Interact with the Airflow UI
52
■ The best way to get familiar with the UI is to experiment yourself…
53. Interact with the Airflow UI
53
■ The best way to get familiar with the UI is to experiment yourself…
Trigger a DAG run
54. Interact with the Airflow UI
54
■ The best way to get familiar with the UI is to experiment yourself…
Monitor the pipeline
workflow and past
runs
55. Interact with the Airflow UI
55
■ The best way to get familiar with the UI is to experiment yourself…
Inspect the DAG graph
in its operators
56. Interact with the Airflow UI
56
■ The best way to get familiar with the UI is to experiment yourself… Monitor running times
57. Interact with the Airflow UI
57
Select an individual component, to
- view the logs
- mark as failed/success
- clear the run (and trigger automatic
rerun of all downstream components)
58. Now it is your turn!
■ In the next exercise you will run the complete ML training pipeline in Airflow
■ For reference, here is a summary what the individual components do:
load and aggregate
the raw data and save
it in the project
split the raw
data in a train
and test split
check if column
names, data types
and optionality are
as expected
prepare the data
for the ML model
train the model on
the transformed
training data
validate the model on the
test data, compare
performance with current
production model
models will be
tracked & registered
with MLflow
track data with
DVC (optional
exercise)
register model
and tag it as
production
63. Confidential
Deployment with Docker and Flask
score.py
app.py
Dockerfile
contains the actual scoring (inference)
logic and two functions init()and
run(input).
Launches the API in a Flask Web
application.
All the configuration for the build of
the Docker image.
66. Recap
66
■ We presented you with a real-world case how to bring ML to production
■ You learned, how to:
■ Track your data
■ Track your models and experiments
■ Orchestrate your ML pipeline
■ Deploy
■ We presented just one specific MLOps-approach – There are many different tools and
practices to achieve the same goal
■ We hope we provided you with some valuable practices and inspiration to put your
own ML project to production