demonstration of using featuretools package to generate features / aggregates from raw relational data, and using ml flow to track entire model building & hyperparams optimization
Productionzing ML Model Using MLflow Model ServingDatabricks
Productionzing ML Models are needs to ensure model integrity while it efficiently replicate runtime environments across servers besides it keep track of how each of our models were created. It helps us better trace the root cause of changes and issues over time as we acquire new data and update our model. We have greater accountability over our models and the results they generate.
MLflow Model Serving delivers cost-effective and on-click deployment of model for real-time inferences. Also the Model Version deployed in the Model Serving can also be conveniently managed with MLflow Model Registry. We will going to cover following topics Deployment, Consumption and Monitoring. For deployment, we will demo the different version deployment and validate the deployment. For consumption, we demo connecting power bi and generate prediction report using ML Model deployed in MLflow serving. Lastly will wrap up with managing the MLflow serving like, access rights and monitoring capabilities.
Fifth elephant 2017 Data Pipeline workshopKetan Khairnar
This document outlines the phases and topics to be covered in a workshop on data pipelines and instrumentation for an Airbnb clone application called yourbnb. The workshop will cover basic instrumentation of host metrics and app services, implementing audit trails and deployment history, using different data stores for telemetry, events, logs and master data, designing data pipelines, and setting up dashboards to monitor key performance indicators. Hands-on exercises will demonstrate adding metrics, implementing an event sourcing architecture, and setting up data visualization with Grafana.
Version Control in AI/Machine Learning by DatmoNicholas Walsh
Starting with outlining the history of conventional version control before diving into explaining QoDs (Quantitative Oriented Developers) and the unique problems their ML systems pose from an operations perspective (MLOps). With the only status quo solutions being proprietary in-house pipelines (exclusive to Uber, Google, Facebook) and manual tracking/fragile "glue" code for everyone else.
Datmo works to solve this issue by empowering QoDs in two ways: making MLOps manageable and simple (rather than completely abstracted away) as well as reducing the amount of glue code so to ensure more robust pipelines.
How to Empower a Platform With a Data Pipeline At a ScaleDeepak Sood
StashFin provides personal loans to individuals in India through a web and mobile platform. They have originated over 620,000 loans since being founded in 2016. To scale their platform, StashFin moved from a monolithic architecture to a microservices architecture using AWS services. This included using S3 for storage, EKS for Kubernetes, and AWS Glue and Athena for analytics. They also designed a data pipeline on AWS to handle a large increase in loan applications. The pipeline uses Redis for caching, S3 as the data lake, and Athena for querying large amounts of data stored in S3. This has allowed for faster decisioning, higher reliability, and cost and performance benefits compared to managing their own infrastructure.
Apache Liminal (Incubating)—Orchestrate the Machine Learning PipelineDatabricks
Apache Liminal is an end-to-end platform for data engineers & scientists, allowing them to build, train and deploy machine learning models in a robust and agile way. The platform provides the abstractions and declarative capabilities for data extraction & feature engineering followed by model training and serving; using standard tools and libraries (e.g. Airflow, K8S, Spark, scikit-learn, etc.).
The document discusses moving from data science to MLOps. It defines MLOps as extending DevOps methodology to include machine learning, data science, and data engineering assets. Key concepts of MLOps include iterative development, automation, continuous integration and delivery, versioning, testing, reproducibility, monitoring, source control, and model/feature stores. MLOps helps address challenges of moving models to production like the deployment gap by establishing best practices and tools for testing, deploying, managing, and monitoring models.
MLOps and Data Quality: Deploying Reliable ML Models in ProductionProvectus
Looking to build a robust machine learning infrastructure to streamline MLOps? Learn from Provectus experts how to ensure the success of your MLOps initiative by implementing Data QA components in your ML infrastructure.
For most organizations, the development of multiple machine learning models, their deployment and maintenance in production are relatively new tasks. Join Provectus as we explain how to build an end-to-end infrastructure for machine learning, with a focus on data quality and metadata management, to standardize and streamline machine learning life cycle management (MLOps).
Agenda
- Data Quality and why it matters
- Challenges and solutions of Data Testing
- Challenges and solutions of Model Testing
- MLOps pipelines and why they matter
- How to expand validation pipelines for Data Quality
This document discusses MLOps at OLX, including:
- The main areas of data science work at OLX like search, recommendations, fraud detection, and content moderation.
- How OLX uses teams structured by both feature areas and roles to collaborate on projects.
- A maturity model for MLOps with levels from no MLOps to fully automated processes.
- How OLX has improved from siloed work to cross-functional teams and adding more automation to model creation, release, and application integration over time.
Productionzing ML Model Using MLflow Model ServingDatabricks
Productionzing ML Models are needs to ensure model integrity while it efficiently replicate runtime environments across servers besides it keep track of how each of our models were created. It helps us better trace the root cause of changes and issues over time as we acquire new data and update our model. We have greater accountability over our models and the results they generate.
MLflow Model Serving delivers cost-effective and on-click deployment of model for real-time inferences. Also the Model Version deployed in the Model Serving can also be conveniently managed with MLflow Model Registry. We will going to cover following topics Deployment, Consumption and Monitoring. For deployment, we will demo the different version deployment and validate the deployment. For consumption, we demo connecting power bi and generate prediction report using ML Model deployed in MLflow serving. Lastly will wrap up with managing the MLflow serving like, access rights and monitoring capabilities.
Fifth elephant 2017 Data Pipeline workshopKetan Khairnar
This document outlines the phases and topics to be covered in a workshop on data pipelines and instrumentation for an Airbnb clone application called yourbnb. The workshop will cover basic instrumentation of host metrics and app services, implementing audit trails and deployment history, using different data stores for telemetry, events, logs and master data, designing data pipelines, and setting up dashboards to monitor key performance indicators. Hands-on exercises will demonstrate adding metrics, implementing an event sourcing architecture, and setting up data visualization with Grafana.
Version Control in AI/Machine Learning by DatmoNicholas Walsh
Starting with outlining the history of conventional version control before diving into explaining QoDs (Quantitative Oriented Developers) and the unique problems their ML systems pose from an operations perspective (MLOps). With the only status quo solutions being proprietary in-house pipelines (exclusive to Uber, Google, Facebook) and manual tracking/fragile "glue" code for everyone else.
Datmo works to solve this issue by empowering QoDs in two ways: making MLOps manageable and simple (rather than completely abstracted away) as well as reducing the amount of glue code so to ensure more robust pipelines.
How to Empower a Platform With a Data Pipeline At a ScaleDeepak Sood
StashFin provides personal loans to individuals in India through a web and mobile platform. They have originated over 620,000 loans since being founded in 2016. To scale their platform, StashFin moved from a monolithic architecture to a microservices architecture using AWS services. This included using S3 for storage, EKS for Kubernetes, and AWS Glue and Athena for analytics. They also designed a data pipeline on AWS to handle a large increase in loan applications. The pipeline uses Redis for caching, S3 as the data lake, and Athena for querying large amounts of data stored in S3. This has allowed for faster decisioning, higher reliability, and cost and performance benefits compared to managing their own infrastructure.
Apache Liminal (Incubating)—Orchestrate the Machine Learning PipelineDatabricks
Apache Liminal is an end-to-end platform for data engineers & scientists, allowing them to build, train and deploy machine learning models in a robust and agile way. The platform provides the abstractions and declarative capabilities for data extraction & feature engineering followed by model training and serving; using standard tools and libraries (e.g. Airflow, K8S, Spark, scikit-learn, etc.).
The document discusses moving from data science to MLOps. It defines MLOps as extending DevOps methodology to include machine learning, data science, and data engineering assets. Key concepts of MLOps include iterative development, automation, continuous integration and delivery, versioning, testing, reproducibility, monitoring, source control, and model/feature stores. MLOps helps address challenges of moving models to production like the deployment gap by establishing best practices and tools for testing, deploying, managing, and monitoring models.
MLOps and Data Quality: Deploying Reliable ML Models in ProductionProvectus
Looking to build a robust machine learning infrastructure to streamline MLOps? Learn from Provectus experts how to ensure the success of your MLOps initiative by implementing Data QA components in your ML infrastructure.
For most organizations, the development of multiple machine learning models, their deployment and maintenance in production are relatively new tasks. Join Provectus as we explain how to build an end-to-end infrastructure for machine learning, with a focus on data quality and metadata management, to standardize and streamline machine learning life cycle management (MLOps).
Agenda
- Data Quality and why it matters
- Challenges and solutions of Data Testing
- Challenges and solutions of Model Testing
- MLOps pipelines and why they matter
- How to expand validation pipelines for Data Quality
This document discusses MLOps at OLX, including:
- The main areas of data science work at OLX like search, recommendations, fraud detection, and content moderation.
- How OLX uses teams structured by both feature areas and roles to collaborate on projects.
- A maturity model for MLOps with levels from no MLOps to fully automated processes.
- How OLX has improved from siloed work to cross-functional teams and adding more automation to model creation, release, and application integration over time.
As the commercial world accelerates investment into AI and Machine Learning one theme continually appears. Models are being built, but they are not being used. Teams of Data Scientists around the world are training versatile models but due to managerial, logistical and infrastructural problems, these models are not making it to production.
To watch the full presentation with visual and audio click here: https://info.cnvrg.io/ml-models-to-production
In this webinar, Solutions Architect Aaron Schneider will diagnose the problem and identify the symptoms. He’ll explain how reproducibility, scalability and collaboration can increase the gap between research and production. The webinar will examine best practices for building a machine learning pipeline that enables quick iteration, deployment and CI/CD to ensure that your company is deploying and maintaining the best services for you customers and clients.
Key takeaways:
- The common issues that block deployment and increase time to production
- How different stakeholders can resolve key issues
- How to accelerate from research to production
Tools that can make productionizing models easy
- Leveraging Kubernetes and container-based architecture for faster deployment
Watch the full presentation here: https://info.cnvrg.io/ml-models-to-production
“Houston, we have a model...” Introduction to MLOpsRui Quintino
The document introduces MLOps (Machine Learning Operations) and the need to operationalize machine learning models beyond just model deployment. It discusses challenges like data and model drift, retraining models, software dependencies, monitoring models in production, and the need for automation, testing, and reproducibility across the full machine learning lifecycle from data to deployment. An example MLOps workflow is shown using GitHub and Azure ML to enable experiment tracking, automation, and continuous integration and delivery of models.
Robust MLOps with Open-Source: ModelDB, Docker, Jenkins, and PrometheusManasi Vartak
These are slides from Manasi Vartak's Strata Talk in March 2020 on Robust MLOps with Open-Source.
* Introduction to talk
* What is MLOps?
* Building an MLOps Pipeline
* Real-world Simulations
* Let’s fix the pipeline
* Wrap-up
Tech leaders guide to effective building of machine learning productsGianmario Spacagna
This document provides guidance for machine learning product managers and technical leaders on building effective ML products. It discusses introducing ML in enterprises, defining product specifications, planning under uncertainty, and building balanced ML teams. It also covers the ML product lifecycle, including tracking experiments, centralized data storage, automated testing, continuous integration, and serverless architectures. Serverless computing can help simplify deployments, improve scalability, and reduce costs.
How to choose correct framework and define your manifesto for technology practices around Machine Learning Journey.
Kubernetes being successor in this space, Seldom Core and Kubeflow is truly winner in this Segment.
Once a model is deployed, you have a responsibility to ensure its reliability and performance in production. That means that in addition to system monitoring, you should be checking and monitoring its ML health and vitals such as accuracy, bias, and variance as new data comes in. In this online workshop we’ll discuss how to build a system to monitor your machine learning model in production on Kubernetes. You’ll learn to keep track of different models and their model performance over time, and how to set up custom alerts for your models. We’ll discuss what types of variants to monitor, and how to measure its performance. Join CTO of cnvrg.io, Leah Kolben in this hands-on workshop on critical practices for monitoring your machine learning models in production. Using the power of Kubernetes, we’ll build a complete system for model tracking that ensures high performing models in production.
Watch the full presentation with video and audio here: https://info.cnvrg.io/monitor-machine-learning-model-workshop
What you’ll learn:
- Why we monitor models in production
- The critical vitals to track and monitor performance
- How to set up automated alerts
- How to set up Kubernetes for monitoring
- Use tools like Grafana and Kibana to monitor and visualize your system and ML health
To watch the full live presentation click here: https://info.cnvrg.io/monitor-machine-learning-model-workshop
Using MLOps to Bring ML to Production/The Promise of MLOpsWeaveworks
In this final Weave Online User Group of 2019, David Aronchick asks: have you ever struggled with having different environments to build, train and serve ML models, and how to orchestrate between them? While DevOps and GitOps have made huge traction in recent years, many customers struggle to apply these practices to ML workloads. This talk will focus on the ways MLOps has helped to effectively infuse AI into production-grade applications through establishing practices around model reproducibility, validation, versioning/tracking, and safe/compliant deployment. We will also talk about the direction for MLOps as an industry, and how we can use it to move faster, with more stability, than ever before.
The recording of this session is on our YouTube Channel here: https://youtu.be/twsxcwgB0ZQ
Speaker: David Aronchick, Head of Open Source ML Strategy, Microsoft
Bio: David leads Open Source Machine Learning Strategy at Azure. This means he spends most of his time helping humans to convince machines to be smarter. He is only moderately successful at this. Previously, David led product management for Kubernetes at Google, launched GKE, and co-founded the Kubeflow project. David has also worked at Microsoft, Amazon and Chef and co-founded three startups.
Sign up for a free Machine Learning Ops Workshop: http://bit.ly/MLOps_Workshop_List
Weaveworks will cover concepts such as GitOps (operations by pull request), Progressive Delivery (canary, A/B, blue-green), and how to apply those approaches to your machine learning operations to mitigate risk.
Streamlining your machine learning pipeline is critical for enterprise data science to deliver better business results. Accelerating the process from data, to processing to training to deployment and back again will help you get better performing models, faster. Watch the full presentation with audio and video here: https://info.cnvrg.io/build-machine-learning-pipelines
This presentation will offer solutions to the common challenges data scientists and data engineers face when building a machine learning pipeline.
We will dissect each part of the pipeline and offer strategies on how to design your machine learning pipelines for a more efficient, integrated and automated process. We’ll tackle ways to connect all your data sourcing in one unified location. How to create modular ML components for easy reproducibility, and automate MLOps for quick training of models and hyperparameter optimization. Streamline frequent deployment of models leveraging the power of Kubernetes. And lastly, you’ll learn to design a monitoring toolkit with Grafana and Kibana for easy CI/CD. Join Solutions Architect, Aaron Schneider as he builds and end-to-end machine learning pipeline, and explains how to optimize each part for a more efficient workflow.
Key webinar takeaways:
- Set up an efficient machine learning pipeline
- Learn key MLOps solutions streamlining science and engineering
- Create reusable ML components
- Build a suite of monitoring and visualization tools
- Instantly train and deploy ML models with Kubernetes
- Use CI/CD to design an auto-adaptive machine learning pipeline
Watch the full presentation here: https://info.cnvrg.io/build-machine-learning-pipelines
Reproducible AI Using PyTorch and MLflowDatabricks
Model reproducibility is becoming the next frontier for successful AI models building and deployments for both Research and Production scenarios. In this talk we will show you how to build reproducible AI models and workflows using PyTorch and MLflow that can be shared across your teams, with traceability and speed up collaboration for AI projects.
This document discusses MLOps, which is applying DevOps practices and principles to machine learning to enable continuous delivery of ML models. It explains that ML models need continuous improvement through retraining but data scientists currently lack tools for quick iteration, versioning, and deployment. MLOps addresses this by providing ML pipelines, model management, monitoring, and retraining in a reusable workflow similar to how software is developed. Implementing even a basic CI/CD pipeline for ML can help iterate models more quickly than having no pipeline at all. The document encourages building responsible AI through practices like ensuring model performance and addressing bias.
Dependency inversion using ports and adaptersMahfuzul Haque
This document discusses the Ports and Adapters architecture for decoupling an application's core business logic from external dependencies and allowing different services to be plugged in. The aims are to decouple the core logic, allow different services to be plugged in and removed easily, and make the application framework agnostic. An example order processing application is used to show how it evolves from being tightly coupled to external dependencies to following the Ports and Adapters pattern using interfaces, ports, and adapters to isolate the core logic. Code examples are provided in a GitHub repository linked in the document.
CI/CD (Continuous Integration/ Continuous Deployment) has long been a successful process for most software applications. The same can be done with Machine Learning applications, offering an automated and continuous training and continuous deployment of machine learning models. Using CI/CD for machine learning applications creates a truly end-to-end pipeline that closes the feedback loop at every step of the way, and maintains high performing ML models. It can also bridge science and engineering tasks, causing less friction from data, to modeling, to production and back again. Join CEO of cnvrg.io Yochay Ettun as he brings you through how to create a CI/CD pipeline for machine learning, and set up continuous deployment in just one click. With a depth of knowledge in all the latest research, Yochay will share with you today's top methods for applying CI/CD to machine learning.
Webinar takeaways:
Configure and execute continuous training and continuous deployment for ML
Define dependencies and triggers
Automatically connect data pipeline, machine learning pipeline and deployment pipelines
Integrate model bias detection or fairness and accuracy validations
Build monitoring infrastructure to close the data feedback loop
Collect live data for improved model performance
Watch all our webinars at https://cnvrg.io/webinars-and-workshops/
Given at the MLOps. Summit 2020 - I cover the origins of MLOps in 2018, how MLOps has evolved from 2018 to 2020, and what I expect for the future of MLOps
NLP Text Recommendation System Journey to Automated TrainingDatabricks
The document discusses the goal of building an NLP text recommender system to provide customer service agents with relevant answers to customer questions, the approach taken including developing features for an ML ranking model and architecture for recommendations serving, model training, and system evolution over multiple versions to support multi-tenancy, dynamic training, and rollbacks.
MLflow is an MLOps tool that enables data scientist to quickly productionize their Machine Learning projects. To achieve this, MLFlow has four major components which are Tracking, Projects, Models, and Registry. MLflow lets you train, reuse, and deploy models with any library and package them into reproducible steps. MLflow is designed to work with any machine learning library and require minimal changes to integrate into an existing codebase. In this session, we will cover the common pain points of machine learning developers such as tracking experiments, reproducibility, deployment tool and model versioning. Ready to get your hands dirty by doing quick ML project using mlflow and release to production to understand the ML-Ops lifecycle.
Richard Coffey (x18140785) - Research in Computing CA2Richard Coffey
The document discusses applying DevOps practices to machine learning algorithms through MLOps. It defines MLOps as combining DevOps practices with machine learning to improve the reliability and deployment of ML models. The document outlines using Microsoft's Azure ML tools to develop a custom ML application and deploy it using MLOps pipelines, then surveying ML professionals on the value of MLOps. It proposes tracking project progression through a Gantt chart.
Managing the Complete Machine Learning Lifecycle with MLflowDatabricks
ML development brings many new complexities beyond the traditional software development lifecycle. Unlike in traditional software development, ML developers want to try multiple algorithms, tools and parameters to get the best results, and they need to track this information to reproduce work. In addition, developers need to use many distinct systems to productionize models.
To solve for these challenges, Databricks unveiled last year MLflow, an open source project that aims at simplifying the entire ML lifecycle. MLflow introduces simple abstractions to package reproducible projects, track results, and encapsulate models that can be used with many existing tools, accelerating the ML lifecycle for organizations of any size.
In the past year, the MLflow community has grown quickly: over 120 contributors from over 40 companies have contributed code to the project, and over 200 companies are using MLflow.
In this tutorial, we will show you how using MLflow can help you:
Keep track of experiments runs and results across frameworks.
Execute projects remotely on to a Databricks cluster, and quickly reproduce your runs.
Quickly productionize models using Databricks production jobs, Docker containers, Azure ML, or Amazon SageMaker.
We will demo the building blocks of MLflow as well as the most recent additions since the 1.0 release.
What you will learn:
Understand the three main components of open source MLflow (MLflow Tracking, MLflow Projects, MLflow Models) and how each help address challenges of the ML lifecycle.
How to use MLflow Tracking to record and query experiments: code, data, config, and results.
How to use MLflow Projects packaging format to reproduce runs on any platform.
How to use MLflow Models general format to send models to diverse deployment tools.
Prerequisites:
A fully-charged laptop (8-16GB memory) with Chrome or Firefox
Python 3 and pip pre-installed
Pre-Register for a Databricks Standard Trial
Basic knowledge of Python programming language
Basic understanding of Machine Learning Concepts
MLFlow: Platform for Complete Machine Learning Lifecycle Databricks
Description
Data Science and ML development bring many new complexities beyond the traditional software development lifecycle. Unlike in traditional software development, ML developers want to try multiple algorithms, tools, and parameters to get the best results, and they need to track this information to reproduce work.
MLflow addresses some of these challenges during an ML model development cycle.
Abstract
ML development brings many new complexities beyond the traditional software development lifecycle. Unlike in traditional software development, ML developers want to try multiple algorithms, tools, and parameters to get the best results, and they need to track this information to reproduce work. In addition, developers need to use many distinct systems to productionize models. To address these problems, many companies are building custom “ML platforms” that automate this lifecycle, but even these platforms are limited to a few supported algorithms and to each company’s internal infrastructure.
In this session, we introduce MLflow, a new open source project from Databricks that aims to design an open ML platform where organizations can use any ML library and development tool of their choice to reliably build and share ML applications. MLflow introduces simple abstractions to package reproducible projects, track results, and encapsulate models that can be used with many existing tools, accelerating the ML lifecycle for organizations of any size.
With a short demo, you see a complete ML model life-cycle example, you will walk away with: MLflow concepts and abstractions for models, experiments, and projects How to get started with MLFlow Using tracking Python APIs during model training Using MLflow UI to visually compare and contrast experimental runs with different tuning parameters and evaluate metrics
As the commercial world accelerates investment into AI and Machine Learning one theme continually appears. Models are being built, but they are not being used. Teams of Data Scientists around the world are training versatile models but due to managerial, logistical and infrastructural problems, these models are not making it to production.
To watch the full presentation with visual and audio click here: https://info.cnvrg.io/ml-models-to-production
In this webinar, Solutions Architect Aaron Schneider will diagnose the problem and identify the symptoms. He’ll explain how reproducibility, scalability and collaboration can increase the gap between research and production. The webinar will examine best practices for building a machine learning pipeline that enables quick iteration, deployment and CI/CD to ensure that your company is deploying and maintaining the best services for you customers and clients.
Key takeaways:
- The common issues that block deployment and increase time to production
- How different stakeholders can resolve key issues
- How to accelerate from research to production
Tools that can make productionizing models easy
- Leveraging Kubernetes and container-based architecture for faster deployment
Watch the full presentation here: https://info.cnvrg.io/ml-models-to-production
“Houston, we have a model...” Introduction to MLOpsRui Quintino
The document introduces MLOps (Machine Learning Operations) and the need to operationalize machine learning models beyond just model deployment. It discusses challenges like data and model drift, retraining models, software dependencies, monitoring models in production, and the need for automation, testing, and reproducibility across the full machine learning lifecycle from data to deployment. An example MLOps workflow is shown using GitHub and Azure ML to enable experiment tracking, automation, and continuous integration and delivery of models.
Robust MLOps with Open-Source: ModelDB, Docker, Jenkins, and PrometheusManasi Vartak
These are slides from Manasi Vartak's Strata Talk in March 2020 on Robust MLOps with Open-Source.
* Introduction to talk
* What is MLOps?
* Building an MLOps Pipeline
* Real-world Simulations
* Let’s fix the pipeline
* Wrap-up
Tech leaders guide to effective building of machine learning productsGianmario Spacagna
This document provides guidance for machine learning product managers and technical leaders on building effective ML products. It discusses introducing ML in enterprises, defining product specifications, planning under uncertainty, and building balanced ML teams. It also covers the ML product lifecycle, including tracking experiments, centralized data storage, automated testing, continuous integration, and serverless architectures. Serverless computing can help simplify deployments, improve scalability, and reduce costs.
How to choose correct framework and define your manifesto for technology practices around Machine Learning Journey.
Kubernetes being successor in this space, Seldom Core and Kubeflow is truly winner in this Segment.
Once a model is deployed, you have a responsibility to ensure its reliability and performance in production. That means that in addition to system monitoring, you should be checking and monitoring its ML health and vitals such as accuracy, bias, and variance as new data comes in. In this online workshop we’ll discuss how to build a system to monitor your machine learning model in production on Kubernetes. You’ll learn to keep track of different models and their model performance over time, and how to set up custom alerts for your models. We’ll discuss what types of variants to monitor, and how to measure its performance. Join CTO of cnvrg.io, Leah Kolben in this hands-on workshop on critical practices for monitoring your machine learning models in production. Using the power of Kubernetes, we’ll build a complete system for model tracking that ensures high performing models in production.
Watch the full presentation with video and audio here: https://info.cnvrg.io/monitor-machine-learning-model-workshop
What you’ll learn:
- Why we monitor models in production
- The critical vitals to track and monitor performance
- How to set up automated alerts
- How to set up Kubernetes for monitoring
- Use tools like Grafana and Kibana to monitor and visualize your system and ML health
To watch the full live presentation click here: https://info.cnvrg.io/monitor-machine-learning-model-workshop
Using MLOps to Bring ML to Production/The Promise of MLOpsWeaveworks
In this final Weave Online User Group of 2019, David Aronchick asks: have you ever struggled with having different environments to build, train and serve ML models, and how to orchestrate between them? While DevOps and GitOps have made huge traction in recent years, many customers struggle to apply these practices to ML workloads. This talk will focus on the ways MLOps has helped to effectively infuse AI into production-grade applications through establishing practices around model reproducibility, validation, versioning/tracking, and safe/compliant deployment. We will also talk about the direction for MLOps as an industry, and how we can use it to move faster, with more stability, than ever before.
The recording of this session is on our YouTube Channel here: https://youtu.be/twsxcwgB0ZQ
Speaker: David Aronchick, Head of Open Source ML Strategy, Microsoft
Bio: David leads Open Source Machine Learning Strategy at Azure. This means he spends most of his time helping humans to convince machines to be smarter. He is only moderately successful at this. Previously, David led product management for Kubernetes at Google, launched GKE, and co-founded the Kubeflow project. David has also worked at Microsoft, Amazon and Chef and co-founded three startups.
Sign up for a free Machine Learning Ops Workshop: http://bit.ly/MLOps_Workshop_List
Weaveworks will cover concepts such as GitOps (operations by pull request), Progressive Delivery (canary, A/B, blue-green), and how to apply those approaches to your machine learning operations to mitigate risk.
Streamlining your machine learning pipeline is critical for enterprise data science to deliver better business results. Accelerating the process from data, to processing to training to deployment and back again will help you get better performing models, faster. Watch the full presentation with audio and video here: https://info.cnvrg.io/build-machine-learning-pipelines
This presentation will offer solutions to the common challenges data scientists and data engineers face when building a machine learning pipeline.
We will dissect each part of the pipeline and offer strategies on how to design your machine learning pipelines for a more efficient, integrated and automated process. We’ll tackle ways to connect all your data sourcing in one unified location. How to create modular ML components for easy reproducibility, and automate MLOps for quick training of models and hyperparameter optimization. Streamline frequent deployment of models leveraging the power of Kubernetes. And lastly, you’ll learn to design a monitoring toolkit with Grafana and Kibana for easy CI/CD. Join Solutions Architect, Aaron Schneider as he builds and end-to-end machine learning pipeline, and explains how to optimize each part for a more efficient workflow.
Key webinar takeaways:
- Set up an efficient machine learning pipeline
- Learn key MLOps solutions streamlining science and engineering
- Create reusable ML components
- Build a suite of monitoring and visualization tools
- Instantly train and deploy ML models with Kubernetes
- Use CI/CD to design an auto-adaptive machine learning pipeline
Watch the full presentation here: https://info.cnvrg.io/build-machine-learning-pipelines
Reproducible AI Using PyTorch and MLflowDatabricks
Model reproducibility is becoming the next frontier for successful AI models building and deployments for both Research and Production scenarios. In this talk we will show you how to build reproducible AI models and workflows using PyTorch and MLflow that can be shared across your teams, with traceability and speed up collaboration for AI projects.
This document discusses MLOps, which is applying DevOps practices and principles to machine learning to enable continuous delivery of ML models. It explains that ML models need continuous improvement through retraining but data scientists currently lack tools for quick iteration, versioning, and deployment. MLOps addresses this by providing ML pipelines, model management, monitoring, and retraining in a reusable workflow similar to how software is developed. Implementing even a basic CI/CD pipeline for ML can help iterate models more quickly than having no pipeline at all. The document encourages building responsible AI through practices like ensuring model performance and addressing bias.
Dependency inversion using ports and adaptersMahfuzul Haque
This document discusses the Ports and Adapters architecture for decoupling an application's core business logic from external dependencies and allowing different services to be plugged in. The aims are to decouple the core logic, allow different services to be plugged in and removed easily, and make the application framework agnostic. An example order processing application is used to show how it evolves from being tightly coupled to external dependencies to following the Ports and Adapters pattern using interfaces, ports, and adapters to isolate the core logic. Code examples are provided in a GitHub repository linked in the document.
CI/CD (Continuous Integration/ Continuous Deployment) has long been a successful process for most software applications. The same can be done with Machine Learning applications, offering an automated and continuous training and continuous deployment of machine learning models. Using CI/CD for machine learning applications creates a truly end-to-end pipeline that closes the feedback loop at every step of the way, and maintains high performing ML models. It can also bridge science and engineering tasks, causing less friction from data, to modeling, to production and back again. Join CEO of cnvrg.io Yochay Ettun as he brings you through how to create a CI/CD pipeline for machine learning, and set up continuous deployment in just one click. With a depth of knowledge in all the latest research, Yochay will share with you today's top methods for applying CI/CD to machine learning.
Webinar takeaways:
Configure and execute continuous training and continuous deployment for ML
Define dependencies and triggers
Automatically connect data pipeline, machine learning pipeline and deployment pipelines
Integrate model bias detection or fairness and accuracy validations
Build monitoring infrastructure to close the data feedback loop
Collect live data for improved model performance
Watch all our webinars at https://cnvrg.io/webinars-and-workshops/
Given at the MLOps. Summit 2020 - I cover the origins of MLOps in 2018, how MLOps has evolved from 2018 to 2020, and what I expect for the future of MLOps
NLP Text Recommendation System Journey to Automated TrainingDatabricks
The document discusses the goal of building an NLP text recommender system to provide customer service agents with relevant answers to customer questions, the approach taken including developing features for an ML ranking model and architecture for recommendations serving, model training, and system evolution over multiple versions to support multi-tenancy, dynamic training, and rollbacks.
MLflow is an MLOps tool that enables data scientist to quickly productionize their Machine Learning projects. To achieve this, MLFlow has four major components which are Tracking, Projects, Models, and Registry. MLflow lets you train, reuse, and deploy models with any library and package them into reproducible steps. MLflow is designed to work with any machine learning library and require minimal changes to integrate into an existing codebase. In this session, we will cover the common pain points of machine learning developers such as tracking experiments, reproducibility, deployment tool and model versioning. Ready to get your hands dirty by doing quick ML project using mlflow and release to production to understand the ML-Ops lifecycle.
Richard Coffey (x18140785) - Research in Computing CA2Richard Coffey
The document discusses applying DevOps practices to machine learning algorithms through MLOps. It defines MLOps as combining DevOps practices with machine learning to improve the reliability and deployment of ML models. The document outlines using Microsoft's Azure ML tools to develop a custom ML application and deploy it using MLOps pipelines, then surveying ML professionals on the value of MLOps. It proposes tracking project progression through a Gantt chart.
Managing the Complete Machine Learning Lifecycle with MLflowDatabricks
ML development brings many new complexities beyond the traditional software development lifecycle. Unlike in traditional software development, ML developers want to try multiple algorithms, tools and parameters to get the best results, and they need to track this information to reproduce work. In addition, developers need to use many distinct systems to productionize models.
To solve for these challenges, Databricks unveiled last year MLflow, an open source project that aims at simplifying the entire ML lifecycle. MLflow introduces simple abstractions to package reproducible projects, track results, and encapsulate models that can be used with many existing tools, accelerating the ML lifecycle for organizations of any size.
In the past year, the MLflow community has grown quickly: over 120 contributors from over 40 companies have contributed code to the project, and over 200 companies are using MLflow.
In this tutorial, we will show you how using MLflow can help you:
Keep track of experiments runs and results across frameworks.
Execute projects remotely on to a Databricks cluster, and quickly reproduce your runs.
Quickly productionize models using Databricks production jobs, Docker containers, Azure ML, or Amazon SageMaker.
We will demo the building blocks of MLflow as well as the most recent additions since the 1.0 release.
What you will learn:
Understand the three main components of open source MLflow (MLflow Tracking, MLflow Projects, MLflow Models) and how each help address challenges of the ML lifecycle.
How to use MLflow Tracking to record and query experiments: code, data, config, and results.
How to use MLflow Projects packaging format to reproduce runs on any platform.
How to use MLflow Models general format to send models to diverse deployment tools.
Prerequisites:
A fully-charged laptop (8-16GB memory) with Chrome or Firefox
Python 3 and pip pre-installed
Pre-Register for a Databricks Standard Trial
Basic knowledge of Python programming language
Basic understanding of Machine Learning Concepts
MLFlow: Platform for Complete Machine Learning Lifecycle Databricks
Description
Data Science and ML development bring many new complexities beyond the traditional software development lifecycle. Unlike in traditional software development, ML developers want to try multiple algorithms, tools, and parameters to get the best results, and they need to track this information to reproduce work.
MLflow addresses some of these challenges during an ML model development cycle.
Abstract
ML development brings many new complexities beyond the traditional software development lifecycle. Unlike in traditional software development, ML developers want to try multiple algorithms, tools, and parameters to get the best results, and they need to track this information to reproduce work. In addition, developers need to use many distinct systems to productionize models. To address these problems, many companies are building custom “ML platforms” that automate this lifecycle, but even these platforms are limited to a few supported algorithms and to each company’s internal infrastructure.
In this session, we introduce MLflow, a new open source project from Databricks that aims to design an open ML platform where organizations can use any ML library and development tool of their choice to reliably build and share ML applications. MLflow introduces simple abstractions to package reproducible projects, track results, and encapsulate models that can be used with many existing tools, accelerating the ML lifecycle for organizations of any size.
With a short demo, you see a complete ML model life-cycle example, you will walk away with: MLflow concepts and abstractions for models, experiments, and projects How to get started with MLFlow Using tracking Python APIs during model training Using MLflow UI to visually compare and contrast experimental runs with different tuning parameters and evaluate metrics
mlflow: Accelerating the End-to-End ML lifecycleDatabricks
Building and deploying a machine learning model can be difficult to do once. Enabling other data scientists (or yourself, one month later) to reproduce your pipeline, to compare the results of different versions, to track what’s running where, and to redeploy and rollback updated models is much harder.
In this talk, I’ll introduce MLflow, a new open source project from Databricks that simplifies the machine learning lifecycle. MLflow provides APIs for tracking experiment runs between multiple users within a reproducible environment, and for managing the deployment of models to production. MLflow is designed to be an open, modular platform, in the sense that you can use it with any existing ML library and development process. MLflow was launched in June 2018 and has already seen significant community contributions, with over 50 contributors and new features including language APIs, integrations with popular ML libraries, and storage backends. I’ll show how MLflow works and explain how to get started with MLflow.
DevBCN Vertex AI - Pipelines for your MLOps workflowsMárton Kodok
In recent years, one of the biggest trends in applications development has been the rise of Machine Learning solutions, tools, and managed platforms. Vertex AI is a managed unified ML platform for all your AI workloads. On the MLOps side, Vertex AI Pipelines solutions let you adopt experiment pipelining beyond the classic build, train, eval, and deploy a model. It is engineered for data scientists and data engineers, and it’s a tremendous help for those teams who don’t have DevOps or sysadmin engineers, as infrastructure management overhead has been almost completely eliminated. Based on practical examples we will demonstrate how Vertex AI Pipelines scores high in terms of developer experience, how fits custom ML needs, and analyze results. It’s a toolset for a fully-fledged machine learning workflow, a sequence of steps in the model development, a deployment cycle, such as data preparation/validation, model training, hyperparameter tuning, model validation, and model deployment. Vertex AI comes with all classic resources plus an ML metadata store, a fully managed feature store, and a fully managed pipelines runner. Vertex AI Pipelines is a managed serverless toolkit, which means you don't have to fiddle with infrastructure or back-end resources to run workflows.
"Managing the Complete Machine Learning Lifecycle with MLflow"Databricks
Machine Learning development brings many new complexities beyond the traditional software development lifecycle. Unlike in traditional software development, ML developers want to try multiple algorithms, tools, and parameters to get the best results, and they need to track this information to reproduce work. In addition, developers need to use many distinct systems to productionize models. To address these problems, many companies are building custom “ML platforms” that automate this lifecycle, but even these platforms are limited to a few supported algorithms and to each company’s internal infrastructure.
In this session, we introduce MLflow, a new open-source project from Databricks that aims to design an open ML platform where organizations can use any ML library and development tool of their choice to reliably build and share ML applications. MLflow introduces simple abstractions to package reproducible projects, track results, and encapsulate models that can be used with many existing tools, accelerating the ML lifecycle for organizations of any size.
MLflow: Infrastructure for a Complete Machine Learning Life Cycle with Mani ...Databricks
ML development brings many new complexities beyond the traditional software development lifecycle. Unlike in traditional software development, ML developers want to try multiple algorithms, tools, and parameters to get the best results, and they need to track this information to reproduce work. In addition, developers need to use many distinct systems to productionize models. To address these problems, many companies are building custom “ML platforms” that automate this lifecycle, but even these platforms are limited to a few supported algorithms and to each company’s internal infrastructure. In this session, we introduce MLflow, a new open source project from Databricks that aims to design an open ML platform where organizations can use any ML library and development tool of their choice to reliably build and share ML applications. MLflow introduces simple abstractions to package reproducible projects, track results, and encapsulate models that can be used with many existing tools, accelerating the ML lifecycle for organizations of any size. In this deep-dive session, through a complete ML model life-cycle example, you will walk away with:
MLflow concepts and abstractions for models, experiments, and projects
How to get started with MLFlow
Understand aspects of MLflow APIs
Using tracking APIs during model training
Using MLflow UI to visually compare and contrast experimental runs with different tuning parameters and evaluate metrics
Package, save, and deploy an MLflow model
Serve it using MLflow REST API
What’s next and how to contribute
Modern machine learning systems may be very complex and may fall into many pitfalls. It's very easy to unintendedly introduce technical debt into such a complex structure. One of the approaches solving some of anti-patterns is a feature store. Feature store is a missing piece filling a gap between raw data and machine learning models. Not only it will help you to handle technical debt, but even more importantly speeds up time to develop new model.
This document discusses challenges and solutions for machine learning at scale. It begins by describing how machine learning is used in enterprises for business monitoring, optimization, and data monetization. It then covers the machine learning lifecycle from identifying business questions to model deployment. Key topics discussed include modeling approaches, model evolution, standardization, governance, serving models at scale using systems like TensorFlow Serving and Flink, working with data lakes, using notebooks for development, and machine learning with Apache Spark/MLlib.
Helixa uses serverless machine learning architectures to power an audience intelligence platform. It ingests large datasets and uses machine learning models to provide insights. Helixa's machine learning system is built on AWS serverless services like Lambda, Glue, Athena and S3. It features a data lake for storage, a feature store for preprocessed data, and uses techniques like map-reduce to parallelize tasks. Helixa aims to build scalable and cost-effective machine learning pipelines without having to manage servers.
What MLflow is; what problem it solves for machine learning lifecycle; and how it solves; How it will be used with Databricks; and CI/CD pipeline with Databricks.
MLOps pipelines using MLFlow - From training to productionFabian Hadiji
This talk was given at the Cologne AI and Machine Learning Meetup on April 13, 2023 (https://www.meetup.com/de-DE/cologne-ai-and-machine-learning-meetup/events/291513393/) by Dr. Andreas Weiden, Co-Lead Cloud / Data Engineering at skillbyte: MLOps pipelines using MLFlow - From training to production
In this talk we explore the world of MLOps pipelines and how MLFlow can be used to facilitate workflows for getting your machine learning models from training to production. We will briefly delve into the tracking aspects of MLFlow and how to store experiments and runs. Next, we will move on to an actual use case that involves managing artefacts generated by multiple training pipelines running on a daily schedule. These artefacts are used in prediction services but also in managed vector search engines such as ElasticSearch and Google VertexAI. A simple microservice that polls the MLFlow registry is used to update both REST-APIs running in Kubernetes and to ingest the models into the vector search services. Finally, we will compare different alternatives that were considered.
The ODAHU project is focused on creating services, extensions for third party systems and tools which help to accelerate building enterprise level systems with automated AI/ML models life cycle.
GDG Cloud Southlake #16: Priyanka Vergadia: Scalable Data Analytics in Google...James Anderson
Do you know The Cloud Girl? She makes the cloud come alive with pictures and storytelling.
The Cloud Girl, Priyanka Vergadia, Chief Content Officer @Google, joins us to tell us about Scaleable Data Analytics in Google Cloud.
Maybe, with her explanation, we'll finally understand it!
Priyanka is a technical storyteller and content creator who has created over 300 videos, articles, podcasts, courses and tutorials which help developers learn Google Cloud fundamentals, solve their business challenges and pass certifications! Checkout her content on Google Cloud Tech Youtube channel.
Priyanka enjoys drawing and painting which she tries to bring to her advocacy.
Check out her website The Cloud Girl: https://thecloudgirl.dev/ and her new book: https://www.amazon.com/Visualizing-Google-Cloud-Illustrated-References/dp/1119816327
Vertex AI - Unified ML Platform for the entire AI workflow on Google CloudMárton Kodok
The document discusses Vertex AI, Google Cloud's unified machine learning platform. It provides an overview of Vertex AI's key capabilities including gathering and labeling datasets at scale, building and training models using AutoML or custom training, deploying models with endpoints, managing models with confidence through explainability and monitoring tools, using pipelines to orchestrate the entire ML workflow, and adapting to changes in data. The conclusion emphasizes that Vertex AI offers an end-to-end platform for all stages of ML development and productionization with tools to make ML more approachable and pipelines that can solve complex tasks.
Vertex AI: Pipelines for your MLOps workflowsMárton Kodok
The document discusses Vertex AI pipelines for MLOps workflows. It begins with an introduction of the speaker and their background. It then discusses what MLOps is, defining three levels of automation maturity. Vertex AI is introduced as Google Cloud's managed ML platform. Pipelines are described as orchestrating the entire ML workflow through components. Custom components and conditionals allow flexibility. Pipelines improve reproducibility and sharing. Changes can trigger pipelines through services like Cloud Build, Eventarc, and Cloud Scheduler to continuously adapt models to new data.
Utilisation de MLflow pour le cycle de vie des projet Machine learningParis Data Engineers !
Mlflow est un projet opensource pour administrer le cycle de vie des projets machine learning (de l’expérimentation jusqu’au déploiement) afin de mieux les intégrer dans l’écosystème qui les entoure.
Durant cette présentation nous montrerons les différentes composantes de MLflow et ferons une démonstration de son utilisation à la fois dans le contexte d’une plateforme Databricks et d’un IDE local.
Similar to databricks ml flow demonstration using automatic features engineering (20)
Analysis insight about a Flyball dog competition team's performanceroli9797
Insight of my analysis about a Flyball dog competition team's last year performance. Find more: https://github.com/rolandnagy-ds/flyball_race_analysis/tree/main
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Aggregage
This webinar will explore cutting-edge, less familiar but powerful experimentation methodologies which address well-known limitations of standard A/B Testing. Designed for data and product leaders, this session aims to inspire the embrace of innovative approaches and provide insights into the frontiers of experimentation!
Learn SQL from basic queries to Advance queriesmanishkhaire30
Dive into the world of data analysis with our comprehensive guide on mastering SQL! This presentation offers a practical approach to learning SQL, focusing on real-world applications and hands-on practice. Whether you're a beginner or looking to sharpen your skills, this guide provides the tools you need to extract, analyze, and interpret data effectively.
Key Highlights:
Foundations of SQL: Understand the basics of SQL, including data retrieval, filtering, and aggregation.
Advanced Queries: Learn to craft complex queries to uncover deep insights from your data.
Data Trends and Patterns: Discover how to identify and interpret trends and patterns in your datasets.
Practical Examples: Follow step-by-step examples to apply SQL techniques in real-world scenarios.
Actionable Insights: Gain the skills to derive actionable insights that drive informed decision-making.
Join us on this journey to enhance your data analysis capabilities and unlock the full potential of SQL. Perfect for data enthusiasts, analysts, and anyone eager to harness the power of data!
#DataAnalysis #SQL #LearningSQL #DataInsights #DataScience #Analytics
Codeless Generative AI Pipelines
(GenAI with Milvus)
https://ml.dssconf.pl/user.html#!/lecture/DSSML24-041a/rate
Discover the potential of real-time streaming in the context of GenAI as we delve into the intricacies of Apache NiFi and its capabilities. Learn how this tool can significantly simplify the data engineering workflow for GenAI applications, allowing you to focus on the creative aspects rather than the technical complexities. I will guide you through practical examples and use cases, showing the impact of automation on prompt building. From data ingestion to transformation and delivery, witness how Apache NiFi streamlines the entire pipeline, ensuring a smooth and hassle-free experience.
Timothy Spann
https://www.youtube.com/@FLaNK-Stack
https://medium.com/@tspann
https://www.datainmotion.dev/
milvus, unstructured data, vector database, zilliz, cloud, vectors, python, deep learning, generative ai, genai, nifi, kafka, flink, streaming, iot, edge
The Building Blocks of QuestDB, a Time Series Databasejavier ramirez
Talk Delivered at Valencia Codes Meetup 2024-06.
Traditionally, databases have treated timestamps just as another data type. However, when performing real-time analytics, timestamps should be first class citizens and we need rich time semantics to get the most out of our data. We also need to deal with ever growing datasets while keeping performant, which is as fun as it sounds.
It is no wonder time-series databases are now more popular than ever before. Join me in this session to learn about the internal architecture and building blocks of QuestDB, an open source time-series database designed for speed. We will also review a history of some of the changes we have gone over the past two years to deal with late and unordered data, non-blocking writes, read-replicas, or faster batch ingestion.
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...sameer shah
"Join us for STATATHON, a dynamic 2-day event dedicated to exploring statistical knowledge and its real-world applications. From theory to practice, participants engage in intensive learning sessions, workshops, and challenges, fostering a deeper understanding of statistical methodologies and their significance in various fields."
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...Social Samosa
The Modern Marketing Reckoner (MMR) is a comprehensive resource packed with POVs from 60+ industry leaders on how AI is transforming the 4 key pillars of marketing – product, place, price and promotions.
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataKiwi Creative
Harness the power of AI-backed reports, benchmarking and data analysis to predict trends and detect anomalies in your marketing efforts.
Peter Caputa, CEO at Databox, reveals how you can discover the strategies and tools to increase your growth rate (and margins!).
From metrics to track to data habits to pick up, enhance your reporting for powerful insights to improve your B2B tech company's marketing.
- - -
This is the webinar recording from the June 2024 HubSpot User Group (HUG) for B2B Technology USA.
Watch the video recording at https://youtu.be/5vjwGfPN9lw
Sign up for future HUG events at https://events.hubspot.com/b2b-technology-usa/
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
databricks ml flow demonstration using automatic features engineering
1.
2. Overview of a typical machine learning model workflow
Fact #1 : Doing machine learning IS complex
3. Fact #2 ! Hardest part of AI actually is not AI code...
4.
5. Machine learning projects main concerns
1- Open source ML ecosystem is crowded : For each phase of ML
process, there is a myriad of tools to choose from ;
2- Tracking : it is difficult to track by hand which parameters, code, and
data went into each experiment to produce a model, especially when
work in teams ;
3- Reproducibility : Without detailed tracking, teams often have
trouble getting the same code to work / achieve same results
5
6. - First release on june 2018
- Latest version v1.5, released 19 Dec 2019
Introducing MLflow
6
7. MLflow address machine learning challenges through its
3 main components
7
“MLflow is an open source platform to manage the ML
lifecycle, including experimentation, reproducibility and
deployment. It currently offers three components:
What is ML flow ?
1 2 3
9. ML tracking API
Single API + UI to track for
each experiment :
▸ Parameters
▸ Metrics
▸ Artefacts (training
datasets, …)
Can be used on standalone
script / from a notebook
9
11. ML projects
- ML projects define a standard
packaging format to manage data
science code.
- It can be a simple directory / git
repo with code to run.
- The running environment
requirements are defined as a
simple YAML file.
11
ML flow projects sample YAML project
12. ML Models
- MLflow Models is a
convention for packaging
machine learning models in
multiple formats called
“flavors”. MLflow offers a
variety of tools to help you
deploy different flavors of
models.
- Each MLflow Model is saved
as a directory containing
arbitrary files and an MLmodel
descriptor file that lists the
flavors it can be used in.
12
Example of scikit-learn model
13. Model serving commands
13
mlflow models serve
Deploys the model as a
local REST API server.
mlflow models build-
docker
packages a REST API
endpoint serving the model
as a docker image.
mlflow models predict
uses the model to generate
a prediction for a local CSV
or JSON file.
15. E-commerce fraud detection
We have some json profiles representing fictional customers from an ecommerce
company.
(cf. courtesy of RAVELIN : https://github.com/unravelin/code-test-data-science)
The profiles contain information about the customers, their orders, their transactions, what
payment methods they used and whether the customer is fraudulent or not.
Our task :
● Transform the json profiles into feature vectors :
a. Automated feature engineering using featuretools package
● Construct a model to predict if a customer is fraudulent based on their profile.
a. modeling phase using python + scikit-learn
b. Track experiments results using databricks ML flow
15
16. 16
Transform input data
1/Transactions / orders
2/Labels : fraudulent (true/false)
Extract features
Count orders
min/max/avg transaction amount...
Store analytical dataset
Parquet file with :
customerID | features X… | label
Python script to decode each user profile json array into relational
pandas dataframes
Data aggregation can be done using SQL / sparkSQL / pandas dataframe
- In our case will use featurestools package to automate this phase
Baseline classifier
Train a random forest model with
default parameters
Tuned classifier
Using gridsearchCV to tune best
parameters based on cross-
validation results
Base
AUC
Optimized
AUC
MLflow tracking goes here
Model Building & tracking process
17. Appendix #1 :
Deep feature synthesis used in Feature tools python package to generate
aggregates / apply transformations on relational data
17featurelabs.com
18. 18
Appendix #2 : How random forests works ?
Main parameters to tune :
- Max_depth : # trees
- Nb_estimators : # of trees
- Min_rows: Specify the minimum
number of observations for a leaf
- Col_sample: column sample / tree
- Sample_rate : default to 0.63333
19. 19
Complete code on github :
https://github.com/mmejdoubi/mlflow_fraud_ecom/blob/mast
er/ravelin_fraud_RF_mlflow_v1.ipynb