Application performance monitoring (APM) has become the cornerstone of software engineering allowing engineering teams to quickly identify and remedy production issues. However, as the world moves to intelligent software applications that are built using machine learning, traditional APM quickly becomes insufficient to identify and remedy production issues encountered in these modern software applications.
As a lead software engineer at NewRelic, my team built high-performance monitoring systems including Insights, Mobile, and SixthSense. As I transitioned to building ML Monitoring software, I found the architectural principles and design choices underlying APM to not be a good fit for this brand new world. In fact, blindly following APM designs led us down paths that would have been better left unexplored.
In this talk, I draw upon my (and my team’s) experience building an ML Monitoring system from the ground up and deploying it on customer workloads running large-scale ML training with Spark as well as real-time inference systems. I will highlight how the key principles and architectural choices of APM don’t apply to ML monitoring. You’ll learn why, understand what ML Monitoring can successfully borrow from APM, and hear what is required to build a scalable, robust ML Monitoring architecture.
Jeeves Grows Up: An AI Chatbot for Performance and QualityDatabricks
Sarah: CEO-Finance-Report pipeline seems to be slow today. Why
Jeeves: SparkSQL query dbt_fin_model in CEO-Finance-Report is running 53% slower on 2/28/2021. Data skew issue detected. Issue has not been seen in last 90 days.
Jeeves: Adding 5 more nodes to cluster recommended for CEO-Finance-Report to finish in its 99th percentile time of 5.2 hours.
Who is Jeeves? An experienced Spark developer? A seasoned administrator? No, Jeeves is a chatbot created to simplify data operations management for enterprise Spark clusters. This chatbot is powered by advanced AI algorithms and an intuitive conversational interface that together provide answers to get users in and out of problems quickly. Instead of being stuck to screens displaying logs and metrics, users can now have a more refreshing experience via a two-way conversation with their own personal Spark expert.
We presented Jeeves at Spark Summit 2019. In the two years since, Jeeves has grown up a lot. Jeeves can now learn continuously as telemetry information streams in from more and more applications, especially SQL queries. Jeeves now “knows” about data pipelines that have many components. Jeeves can also answer questions about data quality in addition to performance, cost, failures, and SLAs. For example:
Tom: I am not seeing any data for today in my Campaign Metrics Dashboard.
Jeeves: 3/5 validations failed on the cmp_kpis table on 2/28/2021. Run of pipeline cmp_incremental_daily failed on 2/28/2021.
This talk will give an overview of the newer capabilities of the chatbot, and how it now fits in a modern data stack with the emergence of new data roles like analytics engineers and machine learning engineers. You will learn how to build chatbots that tackle your complex data operations challenges.
Importance of ML Reproducibility & Applications with MLfLowDatabricks
With data as a valuable currency and the architecture of reliable, scalable Data Lakes and Lakehouses continuing to mature, it is crucial that machine learning training and deployment techniques keep up to realize value. Reproducibility, efficiency, and governance in training and production environments rest on the shoulders of both point in time snapshots of the data and a governing mechanism to regulate, track, and make best use of associated metadata.
This talk will outline the challenges and importance of building and maintaining reproducible, efficient, and governed machine learning solutions as well as posing solutions built on open source technologies – namely Delta Lake for data versioning and MLflow for efficiency and governance.
Re-imagine Data Monitoring with whylogs and SparkDatabricks
In the era of microservices, decentralized ML architectures and complex data pipelines, data quality has become a bigger challenge than ever. When data is involved in complex business processes and decisions, bad data can, and will, affect the bottom line. As a result, ensuring data quality across the entire ML pipeline is both costly, and cumbersome while data monitoring is often fragmented and performed ad hoc. To address these challenges, we built whylogs, an open source standard for data logging. It is a lightweight data profiling library that enables end-to-end data profiling across the entire software stack. The library implements a language and platform agnostic approach to data quality and data monitoring. It can work with different modes of data operations, including streaming, batch and IoT data.
In this talk, we will provide an overview of the whylogs architecture, including its lightweight statistical data collection approach and various integrations. We will demonstrate how the whylogs integration with Apache Spark achieves large scale data profiling, and we will show how users can apply this integration into existing data and ML pipelines.
The Critical Missing Component in the Production ML StackDatabricks
The day the ML application is deployed to production and begins facing the real world is the best and the worst day in the life of the model builder. The joy of seeing accurate predictions is quickly overshadowed by a myriad of operational challenges. Debugging, troubleshooting & monitoring takes over the majority of their day, leaving little time for model building. In DevOps, software operations are taken to a level of an art. Sophisticated tools enable engineers to quickly identify and resolve issues, continuously improving software stability and robustness. In the ML world, operations are still largely a manual process that involves Jupyter notebooks and shell scripts. One of the cornerstones of the DevOps toolchain is logging. Traces and metrics are built on top of logs enabling monitoring and feedback loops. What does logging look like in an ML system?
In this talk we will demonstrate how to enable data logging for an AI application using MLflow in a matter of minutes. We will discuss how something so simple enables testing, monitoring and debugging in an AI application that handles TBs of data and runs in real-time. Attendees will leave the talk equipped with tools and best practices to supercharge MLOps in their team.
FlorenceAI: Reinventing Data Science at HumanaDatabricks
Humana strives to help the communities we serve and our individual members achieve their best health – no small task in the past year! We had the opportunity to rethink our existing operations and reimagine what a collaborative ML platform for hundreds of data scientists might look like. The primary goal of our ML Platform, named FlorenceAI, is to automate and accelerate the delivery lifecycle of data science solutions at scale. In this presentation, we will walk through an end-to-end example of how to build a model at scale on FlorenceAI and deploy it to production. Tools highlighted include Azure Databricks, MLFlow, AppInsights, and Azure Data Factory.
We will employ slides, notebooks and code snippets covering problem framing and design, initial feature selection, model design and experimentation, and a framework of centralized production code to streamline implementation. Hundreds of data scientists now use our feature store that has tens of thousands of features refreshed in daily and monthly cadences across several years of historical data. We already have dozens of models in production and also daily provide fresh insights for our Enterprise Clinical Operating Model. Each day, billions of rows of data are generated to give us timely information.
We already have examples of teams operating orders of magnitude faster and at a scale not within reach using fixed on-premise resources. Given rapid adoption from a dozen pilot users to over 100 MAU in the first 5 months, we will also share some anecodotes about key early wins created by the platform. We want FlorenceAI to enable Humana’s data scientists to focus their efforts where they add the most value so we can continue to deliver high-quality solutions that remain fresh, relevant and fair in an ever changing world.
Model Monitoring at Scale with Apache Spark and VertaDatabricks
For any organization whose core product or business depends on ML models (think Slack search, Twitter feed ranking, or Tesla Autopilot), ensuring that production ML models are performing with high efficacy is crucial. In fact, according to the McKinsey report on model risk, defective models have led to revenue losses of hundreds of millions of dollars in the financial sector alone. However, in spite of the significant harms of defective models, tools to detect and remedy model performance issues for production ML models are missing.
Based on our experience building ML debugging and robustness tools at MIT CSAIL and managing large-scale model inference services at Twitter, Nvidia, and now at Verta, we developed a generalized model monitoring framework that can monitor a wide variety of ML models, work unchanged in batch and real-time inference scenarios, and scale to millions of inference requests. In this talk, we focus on how this framework applies to monitoring ML inference workflows built on top of Apache Spark and Databricks. We describe how we can supplement the massively scalable data processing capabilities of these platforms with statistical processors to support the monitoring and debugging of ML models.
Learn how ML Monitoring is fundamentally different from application performance monitoring or data monitoring. Understand what model monitoring must achieve for batch and real-time model serving use cases. Then dig in with us as we focus on the batch prediction use case for model scoring and demonstrate how we can leverage the core Apache Spark engine to easily monitor model performance and identify errors in serving pipelines.
Scaling AutoML-Driven Anomaly Detection With LuminaireDatabricks
Organizations rely heavily on time series metrics to measure and model key aspects of operational and business performance. The ability to reliably detect issues with these metrics is imperative to identifying early indicators of major problems before they become pervasive. This is a difficult machine learning and systems problem because temporal patterns are complex, ever changing, and often very noisy, traditionally requiring significant manual configuration and model maintenance.
At Zillow, we have built an orchestration framework around Luminaire, our open-source python library for hands-off time-series Anomaly Detection. Luminaire provides a suite of models and built-in AutoML capabilities which we process with Spark for distributed training and scoring of thousands of metrics. In this talk, we will cover the architecture of this framework and performance of the Luminaire package across detection and prediction accuracy as well as runtime efficiency.
Processing Large Datasets for ADAS Applications using Apache SparkDatabricks
Semantic segmentation is the classification of every pixel in an image/video. The segmentation partitions a digital image into multiple objects to simplify/change the representation of the image into something that is more meaningful and easier to analyze [1][2]. The technique has a wide variety of applications ranging from perception in autonomous driving scenarios to cancer cell segmentation for medical diagnosis.
Exponential growth in the datasets that require such segmentation is driven by improvements in the accuracy and quality of the sensors generating the data extending to 3D point cloud data. This growth is further compounded by exponential advances in cloud technologies enabling the storage and compute available for such applications. The need for semantically segmented datasets is a key requirement to improve the accuracy of inference engines that are built upon them.
Streamlining the accuracy and efficiency of these systems directly affects the value of the business outcome for organizations that are developing such functionalities as a part of their AI strategy.
This presentation details workflows for labeling, preprocessing, modeling, and evaluating performance/accuracy. Scientists and engineers leverage domain-specific features/tools that support the entire workflow from labeling the ground truth, handling data from a wide variety of sources/formats, developing models and finally deploying these models. Users can scale their deployments optimally on GPU-based cloud infrastructure to build accelerated training and inference pipelines while working with big datasets. These environments are optimized for engineers to develop such functionality with ease and then scale against large datasets with Spark-based clusters on the cloud.
Jeeves Grows Up: An AI Chatbot for Performance and QualityDatabricks
Sarah: CEO-Finance-Report pipeline seems to be slow today. Why
Jeeves: SparkSQL query dbt_fin_model in CEO-Finance-Report is running 53% slower on 2/28/2021. Data skew issue detected. Issue has not been seen in last 90 days.
Jeeves: Adding 5 more nodes to cluster recommended for CEO-Finance-Report to finish in its 99th percentile time of 5.2 hours.
Who is Jeeves? An experienced Spark developer? A seasoned administrator? No, Jeeves is a chatbot created to simplify data operations management for enterprise Spark clusters. This chatbot is powered by advanced AI algorithms and an intuitive conversational interface that together provide answers to get users in and out of problems quickly. Instead of being stuck to screens displaying logs and metrics, users can now have a more refreshing experience via a two-way conversation with their own personal Spark expert.
We presented Jeeves at Spark Summit 2019. In the two years since, Jeeves has grown up a lot. Jeeves can now learn continuously as telemetry information streams in from more and more applications, especially SQL queries. Jeeves now “knows” about data pipelines that have many components. Jeeves can also answer questions about data quality in addition to performance, cost, failures, and SLAs. For example:
Tom: I am not seeing any data for today in my Campaign Metrics Dashboard.
Jeeves: 3/5 validations failed on the cmp_kpis table on 2/28/2021. Run of pipeline cmp_incremental_daily failed on 2/28/2021.
This talk will give an overview of the newer capabilities of the chatbot, and how it now fits in a modern data stack with the emergence of new data roles like analytics engineers and machine learning engineers. You will learn how to build chatbots that tackle your complex data operations challenges.
Importance of ML Reproducibility & Applications with MLfLowDatabricks
With data as a valuable currency and the architecture of reliable, scalable Data Lakes and Lakehouses continuing to mature, it is crucial that machine learning training and deployment techniques keep up to realize value. Reproducibility, efficiency, and governance in training and production environments rest on the shoulders of both point in time snapshots of the data and a governing mechanism to regulate, track, and make best use of associated metadata.
This talk will outline the challenges and importance of building and maintaining reproducible, efficient, and governed machine learning solutions as well as posing solutions built on open source technologies – namely Delta Lake for data versioning and MLflow for efficiency and governance.
Re-imagine Data Monitoring with whylogs and SparkDatabricks
In the era of microservices, decentralized ML architectures and complex data pipelines, data quality has become a bigger challenge than ever. When data is involved in complex business processes and decisions, bad data can, and will, affect the bottom line. As a result, ensuring data quality across the entire ML pipeline is both costly, and cumbersome while data monitoring is often fragmented and performed ad hoc. To address these challenges, we built whylogs, an open source standard for data logging. It is a lightweight data profiling library that enables end-to-end data profiling across the entire software stack. The library implements a language and platform agnostic approach to data quality and data monitoring. It can work with different modes of data operations, including streaming, batch and IoT data.
In this talk, we will provide an overview of the whylogs architecture, including its lightweight statistical data collection approach and various integrations. We will demonstrate how the whylogs integration with Apache Spark achieves large scale data profiling, and we will show how users can apply this integration into existing data and ML pipelines.
The Critical Missing Component in the Production ML StackDatabricks
The day the ML application is deployed to production and begins facing the real world is the best and the worst day in the life of the model builder. The joy of seeing accurate predictions is quickly overshadowed by a myriad of operational challenges. Debugging, troubleshooting & monitoring takes over the majority of their day, leaving little time for model building. In DevOps, software operations are taken to a level of an art. Sophisticated tools enable engineers to quickly identify and resolve issues, continuously improving software stability and robustness. In the ML world, operations are still largely a manual process that involves Jupyter notebooks and shell scripts. One of the cornerstones of the DevOps toolchain is logging. Traces and metrics are built on top of logs enabling monitoring and feedback loops. What does logging look like in an ML system?
In this talk we will demonstrate how to enable data logging for an AI application using MLflow in a matter of minutes. We will discuss how something so simple enables testing, monitoring and debugging in an AI application that handles TBs of data and runs in real-time. Attendees will leave the talk equipped with tools and best practices to supercharge MLOps in their team.
FlorenceAI: Reinventing Data Science at HumanaDatabricks
Humana strives to help the communities we serve and our individual members achieve their best health – no small task in the past year! We had the opportunity to rethink our existing operations and reimagine what a collaborative ML platform for hundreds of data scientists might look like. The primary goal of our ML Platform, named FlorenceAI, is to automate and accelerate the delivery lifecycle of data science solutions at scale. In this presentation, we will walk through an end-to-end example of how to build a model at scale on FlorenceAI and deploy it to production. Tools highlighted include Azure Databricks, MLFlow, AppInsights, and Azure Data Factory.
We will employ slides, notebooks and code snippets covering problem framing and design, initial feature selection, model design and experimentation, and a framework of centralized production code to streamline implementation. Hundreds of data scientists now use our feature store that has tens of thousands of features refreshed in daily and monthly cadences across several years of historical data. We already have dozens of models in production and also daily provide fresh insights for our Enterprise Clinical Operating Model. Each day, billions of rows of data are generated to give us timely information.
We already have examples of teams operating orders of magnitude faster and at a scale not within reach using fixed on-premise resources. Given rapid adoption from a dozen pilot users to over 100 MAU in the first 5 months, we will also share some anecodotes about key early wins created by the platform. We want FlorenceAI to enable Humana’s data scientists to focus their efforts where they add the most value so we can continue to deliver high-quality solutions that remain fresh, relevant and fair in an ever changing world.
Model Monitoring at Scale with Apache Spark and VertaDatabricks
For any organization whose core product or business depends on ML models (think Slack search, Twitter feed ranking, or Tesla Autopilot), ensuring that production ML models are performing with high efficacy is crucial. In fact, according to the McKinsey report on model risk, defective models have led to revenue losses of hundreds of millions of dollars in the financial sector alone. However, in spite of the significant harms of defective models, tools to detect and remedy model performance issues for production ML models are missing.
Based on our experience building ML debugging and robustness tools at MIT CSAIL and managing large-scale model inference services at Twitter, Nvidia, and now at Verta, we developed a generalized model monitoring framework that can monitor a wide variety of ML models, work unchanged in batch and real-time inference scenarios, and scale to millions of inference requests. In this talk, we focus on how this framework applies to monitoring ML inference workflows built on top of Apache Spark and Databricks. We describe how we can supplement the massively scalable data processing capabilities of these platforms with statistical processors to support the monitoring and debugging of ML models.
Learn how ML Monitoring is fundamentally different from application performance monitoring or data monitoring. Understand what model monitoring must achieve for batch and real-time model serving use cases. Then dig in with us as we focus on the batch prediction use case for model scoring and demonstrate how we can leverage the core Apache Spark engine to easily monitor model performance and identify errors in serving pipelines.
Scaling AutoML-Driven Anomaly Detection With LuminaireDatabricks
Organizations rely heavily on time series metrics to measure and model key aspects of operational and business performance. The ability to reliably detect issues with these metrics is imperative to identifying early indicators of major problems before they become pervasive. This is a difficult machine learning and systems problem because temporal patterns are complex, ever changing, and often very noisy, traditionally requiring significant manual configuration and model maintenance.
At Zillow, we have built an orchestration framework around Luminaire, our open-source python library for hands-off time-series Anomaly Detection. Luminaire provides a suite of models and built-in AutoML capabilities which we process with Spark for distributed training and scoring of thousands of metrics. In this talk, we will cover the architecture of this framework and performance of the Luminaire package across detection and prediction accuracy as well as runtime efficiency.
Processing Large Datasets for ADAS Applications using Apache SparkDatabricks
Semantic segmentation is the classification of every pixel in an image/video. The segmentation partitions a digital image into multiple objects to simplify/change the representation of the image into something that is more meaningful and easier to analyze [1][2]. The technique has a wide variety of applications ranging from perception in autonomous driving scenarios to cancer cell segmentation for medical diagnosis.
Exponential growth in the datasets that require such segmentation is driven by improvements in the accuracy and quality of the sensors generating the data extending to 3D point cloud data. This growth is further compounded by exponential advances in cloud technologies enabling the storage and compute available for such applications. The need for semantically segmented datasets is a key requirement to improve the accuracy of inference engines that are built upon them.
Streamlining the accuracy and efficiency of these systems directly affects the value of the business outcome for organizations that are developing such functionalities as a part of their AI strategy.
This presentation details workflows for labeling, preprocessing, modeling, and evaluating performance/accuracy. Scientists and engineers leverage domain-specific features/tools that support the entire workflow from labeling the ground truth, handling data from a wide variety of sources/formats, developing models and finally deploying these models. Users can scale their deployments optimally on GPU-based cloud infrastructure to build accelerated training and inference pipelines while working with big datasets. These environments are optimized for engineers to develop such functionality with ease and then scale against large datasets with Spark-based clusters on the cloud.
AI Modernization at AT&T and the Application to Fraud with DatabricksDatabricks
AT&T has been involved in AI from the beginning, with many firsts; “first to coin the term AI”, “inventors of R”, “foundational work on Conv. Neural Nets”, etc. and we have applied AI to hundreds of solutions. Today we are modernizing these AI solutions in the cloud with the help of Databricks and a variety of in-house developments. This talk will highlight our AI modernization effort along with its application to Fraud which is one of our biggest benefitting applications.
When it comes to Large Scale data processing and Machine Learning, Apache Spark is no doubt one of the top battle-tested frameworks out there for handling batched or streaming workloads. The ease of use, built-in Machine Learning modules, and multi-language support makes it a very attractive choice for data wonks. However bootstrapping and getting off the ground could be difficult for most teams without leveraging a Spark cluster that is already pre-provisioned and provided as a managed service in the Cloud, while this is a very attractive choice to get going, in the long run, it could be a very expensive option if it’s not well managed.
As an alternative to this approach, our team has been exploring and working a lot with running Spark and all our Machine Learning workloads and pipelines as containerized Docker packages on Kubernetes. This provides an infrastructure-agnostic abstraction layer for us, and as a result, it improves our operational efficiency and reduces our overall compute cost. Most importantly, we can easily target our Spark workload deployment to run on any major Cloud or On-prem infrastructure (with Kubernetes as the common denominator) by just modifying a few configurations.
In this talk, we will walk you through the process our team follows to make it easy for us to run a production deployment of our Machine Learning workloads and pipelines on Kubernetes which seamlessly allows us to port our implementation from a local Kubernetes set up on the laptop during development to either an On-prem or Cloud Kubernetes environment
NLP Text Recommendation System Journey to Automated TrainingDatabricks
This talk will cover how we built and productionized automated machine learning pipelines at Salesforce. Starting with heuristics to automated retraining using technologies including but not limited to Scala, Python, Apache Spark, Docker, Sagemaker for training, and serving. We will walk through the generally applicable data prep, feature engineering, training, evaluation/comparisons, and continuous model training including data feedback loops in containerized environments with Sagemaker. We will talk about our deployment and validation approach. Finally, we’ll draw lessons from iteratively building an enterprise ML product. Attendees will learn about the mental models for building end to end prod ML pipelines and GA ready products.
Tensors Are All You Need: Faster Inference with HummingbirdDatabricks
The ever-increasing interest around deep learning and neural networks has led to a vast increase in processing frameworks like TensorFlow and PyTorch. These libraries are built around the idea of a computational graph that models the dataflow of individual units. Because tensors are their basic computational unit, these frameworks can run efficiently on hardware accelerators (e.g. GPUs).Traditional machine learning (ML) such as linear regressions and decision trees in scikit-learn cannot currently be run on GPUs, missing out on the potential accelerations that deep learning and neural networks enjoy.
In this talk, we’ll show how you can use Hummingbird to achieve 1000x speedup in inferencing on GPUs by converting your traditional ML models to tensor-based models (PyTorch andTVM). https://github.com/microsoft/hummingbird
This talk is for intermediate audiences that use traditional machine learning and want to speedup the time it takes to perform inference with these models. After watching the talk, the audience should be able to use ~5 lines of code to convert their traditional models to tensor-based models to be able to try them out on GPUs.
Outline:
Introduction of what ML inference is (and why it’s different than training)
Motivation: Tensor-based DNN frameworks allow inference on GPU, but “traditional” ML frameworks do not
Why “traditional” ML methods are important
Introduction of what Hummingbirddoes and main benefits
Deep dive on how traditional ML models are built
Brief intro onhow Hummingbird converter works
Example of how Hummingbird can convert a tree model into a tensor-based model
Other models
Demo
Status
Q&A
Detecting Anomalous Behavior with Surveillance AnalyticsDatabricks
Surveillance feed has essentially been monitored manually until recent years. Video analytics as a technology has made great strides and leverages video surveillance networks to derive searchable, actionable, and quantifiable intelligence from live or recorded video content.
Driven by artificial intelligence and deep learning, video intelligence solutions detect and extract objects in a video. These solutions identify target objects based on trained Deep Neural Networks and then classify each object to enable intelligent video analysis, including search & filtering, alerting, data aggregation and visualization.
In our session, we will:
Discuss the current state of surveillance and popular Python libraries used in video analytics
Elucidate various approaches deployed, using a myriad of pre-trained models from MobileNet SSD to the state-of-the-art Yolo Model.
Describe the many pre-processing techniques we have used, such as the generation of a time-averaged frame, erosion, dilation, and many others
With the basics covered, it’s LIGHTS! CAMERA! ACTION ….Let us show you how this works. We will be presenting a live demo that will explain the performance-computing trade-offs between the use of different models, techniques, and their limitations.
What you can expect to take away from our session:
Gain a deeper understanding of advanced Video Analytics techniques
Understand how to utilize pre-trained models for video analytics solutions
Learn more about the hardware requirements, limitations and challenges posed while devising a video analytics solution
Benefit from the lessons learnt upon deployment in a real-life scenario
The future direction and possibilities of the solution we have developed
Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...Rodney Joyce
Number 2 in the Data Science for Dummies series - We'll predict Titanic survival with Databricks, python and MLSpark.
These are the slides only (excuse the Powerpoint animation issues) - check out the actual tech talk on YouTube: https://rodneyjoyce.home.blog/2019/05/03/data-science-for-dummies-machine-learning-with-databricks-python-sparkml-tech-talk-1-of-7/)
If you have not used Databricks before check out the first talk - Databricks for Dummies.
Here's the rest of the series: https://rodneyjoyce.home.blog/tag/data-science-for-dummies/
1) Data Science overview with Databricks
2) Titanic survival prediction with Azure Machine Learning Studio + Kaggle
3) Data Engineering with Titanic dataset + Databricks + Python
4) Titanic with Databricks + Spark ML
5) Titanic with Databricks + Azure Machine Learning Service
6) Titanic with Databricks + MLS + AutoML
7) Titanic with Databricks + MLFlow
8) Titanic with .NET Core + ML.NET
9) Deployment, DevOps/MLOps and Productionisation
Feature drift monitoring as a service for machine learning models at scaleNoriaki Tatsumi
In this talk, you’ll learn about techniques used to build a feature drift detection as a service capability for your enterprise and beyond. Feature drift monitoring is a way to check volatility of machine learning model inputs. It can trigger investigations for potential model degradation as well as explain why models have shifted.
ML-Ops: From Proof-of-Concept to Production ApplicationHunter Carlisle
Successfully deploying a working machine learning prototype to a production application is a challenging task, frought with difficulties not experienced in traditional software deployments.
In this talk, you will learn techniques to successfully deploy ML applications in a scalable, maintainable, and automated way.
Advanced Model Comparison and Automated Deployment Using MLDatabricks
Here at T-Mobile when a new account is opened, there are fraud checks that occur both pre- and post-activation. Fraud that is missed has a tendency of falling into first payment default, looking like a delinquent new account. The objective of this project was to investigate newly created accounts headed towards delinquency to find additional fraud.
For the longevity of this project we wanted to implement it as an end to end automated solution for building and productionizing models that included multiple modeling techniques and hyper parameter tuning.
We wanted to utilize MLflow for model comparison, graduation to production, and parallel hyper parameter tuning using Hyperopt. To achieve this goal, we created multiple machine learning notebooks where a variety of models could be tuned with their specific parameters. These models were saved into a training MLflow experiment, after which the best performing model for each model notebook was saved to a model comparison MLflow experiment.
In the second experiment the newly built models would be compared with each other as well as the models currently and previously in production. After the best performing model was identified it was then saved to the MLflow Model Registry to be graduated to production.
We were able to execute the multiple notebook solution above as part of an Azure Data Factory pipeline to be regularly scheduled, making the model building and selection a completely hand off implementation.
Every data science project has its nuances; the key is to leverage available tools in a customized approach that fit your needs. We are hoping to provide the audience with a view into our advanced and custom approach of utilizing the MLflow infrastructure and leveraging these tools through automation.
Unified MLOps: Feature Stores & Model DeploymentDatabricks
If you’ve brought two or more ML models into production, you know the struggle that comes from managing multiple data sets, feature engineering pipelines, and models. This talk will propose a whole new approach to MLOps that allows you to successfully scale your models, without increasing latency, by merging a database, a feature store, and machine learning.
Splice Machine is a hybrid (HTAP) database built upon HBase and Spark. The database powers a one of a kind single-engine feature store, as well as the deployment of ML models as tables inside the database. A simple JDBC connection means Splice Machine can be used with any model ops environment, such as Databricks.
The HBase side allows us to serve features to deployed ML models, and generate ML predictions, in milliseconds. Our unique Spark engine allows us to generate complex training sets, as well as ML predictions on petabytes of data.
In this talk, Monte will discuss how his experience running the AI lab at NASA, and as CEO of Red Pepper, Blue Martini Software and Rocket Fuel, led him to create Splice Machine. Jack will give a quick demonstration of how it all works.
Operationalizing Edge Machine Learning with Apache Spark with Nisha Talagala ...Databricks
Machine Learning is everywhere, but translating a data scientist’s model into an operational environment is challenging for many reasons. Models may need to be distributed to remote applications to generate predictions, or in the case of re-training, existing models may need to be updated or replaced. To monitor and diagnose such configurations requires tracking many variables (such as performance counters, models, ML algorithm specific statistics and more).
In this talk we will demonstrate how we have attacked this problem for a specific use case, edge based anomaly detection. We will show how Spark can be deployed in two types of environments (on edge nodes where the ML predictions can detect anomalies in real time, and on a cloud based cluster where new model coefficients can be computed on a larger collection of available data). To make this solution practically deployable, we have developed mechanisms to automatically update the edge prediction pipelines with new models, regularly retrain at the cloud instance, and gather metrics from all pipelines to monitor, diagnose and detect issues with the entire workflow. Using SparkML and Spark Accumulators, we have developed an ML pipeline framework capable of automating such deployments and a distributed application monitoring framework to aid in live monitoring.
The talk will describe the problems of operationalizing ML in an Edge context, our approaches to solving them and what we have learned, and include a live demo of our approach using anomaly detection ML algorithms in SparkML and others (clustering etc.) and live data feeds. All datasets and outputs will be made publicly available.
NLP-Focused Applied ML at Scale for Global Fleet Analytics at ExxonMobilDatabricks
Equipment maintenance log of the global fleet is traditionally maintained using legacy infrastructure and data models, which limit the ability to extract insights at scale. However, to impact the bottom line, it is critical to ingest and enrich global fleet data to generate data driven guidance for operations. The impact of such insights is projected to be millions of dollars per annum.
To this end, we leverage Databricks to perform machine learning at scale, including ingesting (structured and unstructured data) from legacy systems, and then sifting through millions of nonlinearly growing records to extract insights using NLP. The insights enable outlier identification, capacity planning, prioritization of cost reduction opportunities, and the discovery process for cross-functional teams.
In this Strata 2018 presentation, Ted Malaska and Mark Grover discuss how to make the most of big data at speed.
https://conferences.oreilly.com/strata/strata-ny/public/schedule/detail/72396
Machine Learning in Production
The era of big data generation is upon us. Devices ranging from sensors to robots and sophisticated applications are generating increasing amounts of rich data (time series, text, images, sound, video, etc.). For such data to benefit a business’s bottom line, insights must be extracted, a process that increasingly requires machine learning (ML) and deep learning (DL) approaches deployed in production applications use cases.
Production ML is complicated by several challenges, including the need for two very distinct skill sets (operations and data science) to collaborate, the inherent complexity and uniqueness of ML itself, when compared to other apps, and the varied array of analytic engines that need to be combined for a practical deployment, often across physically distributed infrastructure. Nisha Talagala shares solutions and techniques for effectively managing machine learning and deep learning in production with popular analytic engines such as Apache Spark, TensorFlow, and Apache Flink.
Dashboards are useless. Open YouTube if you want to watch something. What benefits could automation of streaming KPI metrics bring to your business, and what pitfalls and concerns are to be expected? From Time Series analysis approach to building distributed streaming data pipeline.
#ATAGTR2021 Presentation : "Use of AI and ML in Performance Testing" by Adolf...Agile Testing Alliance
Interactive Session on "Use of AI and ML in Performance Testing" by Adolf Patel Performance Test Architect Cognizant at #ATAGTR2021.
#ATAGTR2021 was the 6th Edition of Global Testing Retreat.
The video recording of the session is now available on the following link: https://www.youtube.com/watch?v=ajyPSmmswpM
To know more about #ATAGTR2021, please visit:https://gtr.agiletestingalliance.org/
AI Modernization at AT&T and the Application to Fraud with DatabricksDatabricks
AT&T has been involved in AI from the beginning, with many firsts; “first to coin the term AI”, “inventors of R”, “foundational work on Conv. Neural Nets”, etc. and we have applied AI to hundreds of solutions. Today we are modernizing these AI solutions in the cloud with the help of Databricks and a variety of in-house developments. This talk will highlight our AI modernization effort along with its application to Fraud which is one of our biggest benefitting applications.
When it comes to Large Scale data processing and Machine Learning, Apache Spark is no doubt one of the top battle-tested frameworks out there for handling batched or streaming workloads. The ease of use, built-in Machine Learning modules, and multi-language support makes it a very attractive choice for data wonks. However bootstrapping and getting off the ground could be difficult for most teams without leveraging a Spark cluster that is already pre-provisioned and provided as a managed service in the Cloud, while this is a very attractive choice to get going, in the long run, it could be a very expensive option if it’s not well managed.
As an alternative to this approach, our team has been exploring and working a lot with running Spark and all our Machine Learning workloads and pipelines as containerized Docker packages on Kubernetes. This provides an infrastructure-agnostic abstraction layer for us, and as a result, it improves our operational efficiency and reduces our overall compute cost. Most importantly, we can easily target our Spark workload deployment to run on any major Cloud or On-prem infrastructure (with Kubernetes as the common denominator) by just modifying a few configurations.
In this talk, we will walk you through the process our team follows to make it easy for us to run a production deployment of our Machine Learning workloads and pipelines on Kubernetes which seamlessly allows us to port our implementation from a local Kubernetes set up on the laptop during development to either an On-prem or Cloud Kubernetes environment
NLP Text Recommendation System Journey to Automated TrainingDatabricks
This talk will cover how we built and productionized automated machine learning pipelines at Salesforce. Starting with heuristics to automated retraining using technologies including but not limited to Scala, Python, Apache Spark, Docker, Sagemaker for training, and serving. We will walk through the generally applicable data prep, feature engineering, training, evaluation/comparisons, and continuous model training including data feedback loops in containerized environments with Sagemaker. We will talk about our deployment and validation approach. Finally, we’ll draw lessons from iteratively building an enterprise ML product. Attendees will learn about the mental models for building end to end prod ML pipelines and GA ready products.
Tensors Are All You Need: Faster Inference with HummingbirdDatabricks
The ever-increasing interest around deep learning and neural networks has led to a vast increase in processing frameworks like TensorFlow and PyTorch. These libraries are built around the idea of a computational graph that models the dataflow of individual units. Because tensors are their basic computational unit, these frameworks can run efficiently on hardware accelerators (e.g. GPUs).Traditional machine learning (ML) such as linear regressions and decision trees in scikit-learn cannot currently be run on GPUs, missing out on the potential accelerations that deep learning and neural networks enjoy.
In this talk, we’ll show how you can use Hummingbird to achieve 1000x speedup in inferencing on GPUs by converting your traditional ML models to tensor-based models (PyTorch andTVM). https://github.com/microsoft/hummingbird
This talk is for intermediate audiences that use traditional machine learning and want to speedup the time it takes to perform inference with these models. After watching the talk, the audience should be able to use ~5 lines of code to convert their traditional models to tensor-based models to be able to try them out on GPUs.
Outline:
Introduction of what ML inference is (and why it’s different than training)
Motivation: Tensor-based DNN frameworks allow inference on GPU, but “traditional” ML frameworks do not
Why “traditional” ML methods are important
Introduction of what Hummingbirddoes and main benefits
Deep dive on how traditional ML models are built
Brief intro onhow Hummingbird converter works
Example of how Hummingbird can convert a tree model into a tensor-based model
Other models
Demo
Status
Q&A
Detecting Anomalous Behavior with Surveillance AnalyticsDatabricks
Surveillance feed has essentially been monitored manually until recent years. Video analytics as a technology has made great strides and leverages video surveillance networks to derive searchable, actionable, and quantifiable intelligence from live or recorded video content.
Driven by artificial intelligence and deep learning, video intelligence solutions detect and extract objects in a video. These solutions identify target objects based on trained Deep Neural Networks and then classify each object to enable intelligent video analysis, including search & filtering, alerting, data aggregation and visualization.
In our session, we will:
Discuss the current state of surveillance and popular Python libraries used in video analytics
Elucidate various approaches deployed, using a myriad of pre-trained models from MobileNet SSD to the state-of-the-art Yolo Model.
Describe the many pre-processing techniques we have used, such as the generation of a time-averaged frame, erosion, dilation, and many others
With the basics covered, it’s LIGHTS! CAMERA! ACTION ….Let us show you how this works. We will be presenting a live demo that will explain the performance-computing trade-offs between the use of different models, techniques, and their limitations.
What you can expect to take away from our session:
Gain a deeper understanding of advanced Video Analytics techniques
Understand how to utilize pre-trained models for video analytics solutions
Learn more about the hardware requirements, limitations and challenges posed while devising a video analytics solution
Benefit from the lessons learnt upon deployment in a real-life scenario
The future direction and possibilities of the solution we have developed
Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...Rodney Joyce
Number 2 in the Data Science for Dummies series - We'll predict Titanic survival with Databricks, python and MLSpark.
These are the slides only (excuse the Powerpoint animation issues) - check out the actual tech talk on YouTube: https://rodneyjoyce.home.blog/2019/05/03/data-science-for-dummies-machine-learning-with-databricks-python-sparkml-tech-talk-1-of-7/)
If you have not used Databricks before check out the first talk - Databricks for Dummies.
Here's the rest of the series: https://rodneyjoyce.home.blog/tag/data-science-for-dummies/
1) Data Science overview with Databricks
2) Titanic survival prediction with Azure Machine Learning Studio + Kaggle
3) Data Engineering with Titanic dataset + Databricks + Python
4) Titanic with Databricks + Spark ML
5) Titanic with Databricks + Azure Machine Learning Service
6) Titanic with Databricks + MLS + AutoML
7) Titanic with Databricks + MLFlow
8) Titanic with .NET Core + ML.NET
9) Deployment, DevOps/MLOps and Productionisation
Feature drift monitoring as a service for machine learning models at scaleNoriaki Tatsumi
In this talk, you’ll learn about techniques used to build a feature drift detection as a service capability for your enterprise and beyond. Feature drift monitoring is a way to check volatility of machine learning model inputs. It can trigger investigations for potential model degradation as well as explain why models have shifted.
ML-Ops: From Proof-of-Concept to Production ApplicationHunter Carlisle
Successfully deploying a working machine learning prototype to a production application is a challenging task, frought with difficulties not experienced in traditional software deployments.
In this talk, you will learn techniques to successfully deploy ML applications in a scalable, maintainable, and automated way.
Advanced Model Comparison and Automated Deployment Using MLDatabricks
Here at T-Mobile when a new account is opened, there are fraud checks that occur both pre- and post-activation. Fraud that is missed has a tendency of falling into first payment default, looking like a delinquent new account. The objective of this project was to investigate newly created accounts headed towards delinquency to find additional fraud.
For the longevity of this project we wanted to implement it as an end to end automated solution for building and productionizing models that included multiple modeling techniques and hyper parameter tuning.
We wanted to utilize MLflow for model comparison, graduation to production, and parallel hyper parameter tuning using Hyperopt. To achieve this goal, we created multiple machine learning notebooks where a variety of models could be tuned with their specific parameters. These models were saved into a training MLflow experiment, after which the best performing model for each model notebook was saved to a model comparison MLflow experiment.
In the second experiment the newly built models would be compared with each other as well as the models currently and previously in production. After the best performing model was identified it was then saved to the MLflow Model Registry to be graduated to production.
We were able to execute the multiple notebook solution above as part of an Azure Data Factory pipeline to be regularly scheduled, making the model building and selection a completely hand off implementation.
Every data science project has its nuances; the key is to leverage available tools in a customized approach that fit your needs. We are hoping to provide the audience with a view into our advanced and custom approach of utilizing the MLflow infrastructure and leveraging these tools through automation.
Unified MLOps: Feature Stores & Model DeploymentDatabricks
If you’ve brought two or more ML models into production, you know the struggle that comes from managing multiple data sets, feature engineering pipelines, and models. This talk will propose a whole new approach to MLOps that allows you to successfully scale your models, without increasing latency, by merging a database, a feature store, and machine learning.
Splice Machine is a hybrid (HTAP) database built upon HBase and Spark. The database powers a one of a kind single-engine feature store, as well as the deployment of ML models as tables inside the database. A simple JDBC connection means Splice Machine can be used with any model ops environment, such as Databricks.
The HBase side allows us to serve features to deployed ML models, and generate ML predictions, in milliseconds. Our unique Spark engine allows us to generate complex training sets, as well as ML predictions on petabytes of data.
In this talk, Monte will discuss how his experience running the AI lab at NASA, and as CEO of Red Pepper, Blue Martini Software and Rocket Fuel, led him to create Splice Machine. Jack will give a quick demonstration of how it all works.
Operationalizing Edge Machine Learning with Apache Spark with Nisha Talagala ...Databricks
Machine Learning is everywhere, but translating a data scientist’s model into an operational environment is challenging for many reasons. Models may need to be distributed to remote applications to generate predictions, or in the case of re-training, existing models may need to be updated or replaced. To monitor and diagnose such configurations requires tracking many variables (such as performance counters, models, ML algorithm specific statistics and more).
In this talk we will demonstrate how we have attacked this problem for a specific use case, edge based anomaly detection. We will show how Spark can be deployed in two types of environments (on edge nodes where the ML predictions can detect anomalies in real time, and on a cloud based cluster where new model coefficients can be computed on a larger collection of available data). To make this solution practically deployable, we have developed mechanisms to automatically update the edge prediction pipelines with new models, regularly retrain at the cloud instance, and gather metrics from all pipelines to monitor, diagnose and detect issues with the entire workflow. Using SparkML and Spark Accumulators, we have developed an ML pipeline framework capable of automating such deployments and a distributed application monitoring framework to aid in live monitoring.
The talk will describe the problems of operationalizing ML in an Edge context, our approaches to solving them and what we have learned, and include a live demo of our approach using anomaly detection ML algorithms in SparkML and others (clustering etc.) and live data feeds. All datasets and outputs will be made publicly available.
NLP-Focused Applied ML at Scale for Global Fleet Analytics at ExxonMobilDatabricks
Equipment maintenance log of the global fleet is traditionally maintained using legacy infrastructure and data models, which limit the ability to extract insights at scale. However, to impact the bottom line, it is critical to ingest and enrich global fleet data to generate data driven guidance for operations. The impact of such insights is projected to be millions of dollars per annum.
To this end, we leverage Databricks to perform machine learning at scale, including ingesting (structured and unstructured data) from legacy systems, and then sifting through millions of nonlinearly growing records to extract insights using NLP. The insights enable outlier identification, capacity planning, prioritization of cost reduction opportunities, and the discovery process for cross-functional teams.
In this Strata 2018 presentation, Ted Malaska and Mark Grover discuss how to make the most of big data at speed.
https://conferences.oreilly.com/strata/strata-ny/public/schedule/detail/72396
Machine Learning in Production
The era of big data generation is upon us. Devices ranging from sensors to robots and sophisticated applications are generating increasing amounts of rich data (time series, text, images, sound, video, etc.). For such data to benefit a business’s bottom line, insights must be extracted, a process that increasingly requires machine learning (ML) and deep learning (DL) approaches deployed in production applications use cases.
Production ML is complicated by several challenges, including the need for two very distinct skill sets (operations and data science) to collaborate, the inherent complexity and uniqueness of ML itself, when compared to other apps, and the varied array of analytic engines that need to be combined for a practical deployment, often across physically distributed infrastructure. Nisha Talagala shares solutions and techniques for effectively managing machine learning and deep learning in production with popular analytic engines such as Apache Spark, TensorFlow, and Apache Flink.
Dashboards are useless. Open YouTube if you want to watch something. What benefits could automation of streaming KPI metrics bring to your business, and what pitfalls and concerns are to be expected? From Time Series analysis approach to building distributed streaming data pipeline.
#ATAGTR2021 Presentation : "Use of AI and ML in Performance Testing" by Adolf...Agile Testing Alliance
Interactive Session on "Use of AI and ML in Performance Testing" by Adolf Patel Performance Test Architect Cognizant at #ATAGTR2021.
#ATAGTR2021 was the 6th Edition of Global Testing Retreat.
The video recording of the session is now available on the following link: https://www.youtube.com/watch?v=ajyPSmmswpM
To know more about #ATAGTR2021, please visit:https://gtr.agiletestingalliance.org/
Managing the Machine Learning Lifecycle with MLflowDatabricks
ML development brings many new complexities beyond the traditional software development lifecycle. MLflow is an open-source project from Databricks aiming to solve some of these challenges such as experiment tracking, reproducibility, model packaging, deployment, and governance, in order to manage and accelerate the lifecycle of your ML projects.
This Presentation presents the benefits of Data Science for those in retail broking practice. Employing Machine Learning techniques and text analytics, you not only get that competitive edge but also earn the customer's satisfaction and loyalty
This Presentation presents how Data science can bring manifold benefits to Retail Broking. Machine Learning & Text Analytics can impact your business in many positive ways- gives you that competitive edge and gains you customer satisfaction & loyalty
Success comes from enabling your workforce to make better decisions and execute appropriate actions. We deliver value to your Hospital or Clinic by helping you reduce the time, resources, effort, and cost of operating your Laboratory System.
Our Laboratory Information System is built on world class Sage 300 ERP award winning architecture. Lab System integrates with any HL7 compliant hospital information system. LIS follows CAP compliant (College of American Pathologists) and most of the hospitals were LIS is implemented are JCI (Joint Commission International).
This presentation is intended to give the viewer a working knowledge of the practical applications of SAS in terms of Banking Analytics. Specifically, Enterprise Guide and Enterprise Miner have been discussed in detail.
TRI, the risk-based monitoring company holds a number of industry "firsts". TRI is the first company entirely dedicated to RBM and quality oversight. They are the creators of the world's first purpose-built RBM platform, OPRA, and the first company to offer a true, holistic RBM solution- offering not only the technology but also the knowledge and services required for any organization wishing to successfully implement and adopt a risk-based approach in their clinical trials. TRI - Where's the Risk?
The Automation Firehose: Be Strategic & Tactical With Your Mobile & Web TestingPerfecto by Perforce
The widespread adoption of test automation has created many challenges — for everything from development lifecycle integration to scripting strategy.
One pitfall of automation is that teams often rush to automate everything they can. This is the automation firehose.
However, just because a scenario CAN be automated does not mean it SHOULD be automated. For scenarios that should be automated, teams must adopt implementation plans to ensure tests are reliable and deriving value.
Join this webinar led by Perfecto’s Chief Evangelist, Eran Kinsbruner, along with Thomas Haver, Manager of Automation & Delivery. In this session, the audience will:
-Understand which test scenarios to automate.
-Learn how to maximize the benefits of automation.
-Receive a checklist to determine automation feasibility and ROI.
Similar to Why APM Is Not the Same As ML Monitoring (20)
Data Lakehouse Symposium | Day 1 | Part 1Databricks
The world of data architecture began with applications. Next came data warehouses. Then text was organized into a data warehouse.
Then one day the world discovered a whole new kind of data that was being generated by organizations. The world found that machines generated data that could be transformed into valuable insights. This was the origin of what is today called the data lakehouse. The evolution of data architecture continues today.
Come listen to industry experts describe this transformation of ordinary data into a data architecture that is invaluable to business. Simply put, organizations that take data architecture seriously are going to be at the forefront of business tomorrow.
This is an educational event.
Several of the authors of the book Building the Data Lakehouse will be presenting at this symposium.
Data Lakehouse Symposium | Day 1 | Part 2Databricks
The world of data architecture began with applications. Next came data warehouses. Then text was organized into a data warehouse.
Then one day the world discovered a whole new kind of data that was being generated by organizations. The world found that machines generated data that could be transformed into valuable insights. This was the origin of what is today called the data lakehouse. The evolution of data architecture continues today.
Come listen to industry experts describe this transformation of ordinary data into a data architecture that is invaluable to business. Simply put, organizations that take data architecture seriously are going to be at the forefront of business tomorrow.
This is an educational event.
Several of the authors of the book Building the Data Lakehouse will be presenting at this symposium.
The world of data architecture began with applications. Next came data warehouses. Then text was organized into a data warehouse.
Then one day the world discovered a whole new kind of data that was being generated by organizations. The world found that machines generated data that could be transformed into valuable insights. This was the origin of what is today called the data lakehouse. The evolution of data architecture continues today.
Come listen to industry experts describe this transformation of ordinary data into a data architecture that is invaluable to business. Simply put, organizations that take data architecture seriously are going to be at the forefront of business tomorrow.
This is an educational event.
Several of the authors of the book Building the Data Lakehouse will be presenting at this symposium.
The world of data architecture began with applications. Next came data warehouses. Then text was organized into a data warehouse.
Then one day the world discovered a whole new kind of data that was being generated by organizations. The world found that machines generated data that could be transformed into valuable insights. This was the origin of what is today called the data lakehouse. The evolution of data architecture continues today.
Come listen to industry experts describe this transformation of ordinary data into a data architecture that is invaluable to business. Simply put, organizations that take data architecture seriously are going to be at the forefront of business tomorrow.
This is an educational event.
Several of the authors of the book Building the Data Lakehouse will be presenting at this symposium.
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
In this session, learn how to quickly supplement your on-premises Hadoop environment with a simple, open, and collaborative cloud architecture that enables you to generate greater value with scaled application of analytics and AI on all your data. You will also learn five critical steps for a successful migration to the Databricks Lakehouse Platform along with the resources available to help you begin to re-skill your data teams.
Democratizing Data Quality Through a Centralized PlatformDatabricks
Bad data leads to bad decisions and broken customer experiences. Organizations depend on complete and accurate data to power their business, maintain efficiency, and uphold customer trust. With thousands of datasets and pipelines running, how do we ensure that all data meets quality standards, and that expectations are clear between producers and consumers? Investing in shared, flexible components and practices for monitoring data health is crucial for a complex data organization to rapidly and effectively scale.
At Zillow, we built a centralized platform to meet our data quality needs across stakeholders. The platform is accessible to engineers, scientists, and analysts, and seamlessly integrates with existing data pipelines and data discovery tools. In this presentation, we will provide an overview of our platform’s capabilities, including:
Giving producers and consumers the ability to define and view data quality expectations using a self-service onboarding portal
Performing data quality validations using libraries built to work with spark
Dynamically generating pipelines that can be abstracted away from users
Flagging data that doesn’t meet quality standards at the earliest stage and giving producers the opportunity to resolve issues before use by downstream consumers
Exposing data quality metrics alongside each dataset to provide producers and consumers with a comprehensive picture of health over time
Learn to Use Databricks for Data ScienceDatabricks
Data scientists face numerous challenges throughout the data science workflow that hinder productivity. As organizations continue to become more data-driven, a collaborative environment is more critical than ever — one that provides easier access and visibility into the data, reports and dashboards built against the data, reproducibility, and insights uncovered within the data.. Join us to hear how Databricks’ open and collaborative platform simplifies data science by enabling you to run all types of analytics workloads, from data preparation to exploratory analysis and predictive analytics, at scale — all on one unified platform.
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks
Autonomy and ownership are core to working at Stitch Fix, particularly on the Algorithms team. We enable data scientists to deploy and operate their models independently, with minimal need for handoffs or gatekeeping. By writing a simple function and calling out to an intuitive API, data scientists can harness a suite of platform-provided tooling meant to make ML operations easy. In this talk, we will dive into the abstractions the Data Platform team has built to enable this. We will go over the interface data scientists use to specify a model and what that hooks into, including online deployment, batch execution on Spark, and metrics tracking and visualization.
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
In this talk, I will dive into the stage level scheduling feature added to Apache Spark 3.1. Stage level scheduling extends upon Project Hydrogen by improving big data ETL and AI integration and also enables multiple other use cases. It is beneficial any time the user wants to change container resources between stages in a single Apache Spark application, whether those resources are CPU, Memory or GPUs. One of the most popular use cases is enabling end-to-end scalable Deep Learning and AI to efficiently use GPU resources. In this type of use case, users read from a distributed file system, do data manipulation and filtering to get the data into a format that the Deep Learning algorithm needs for training or inference and then sends the data into a Deep Learning algorithm. Using stage level scheduling combined with accelerator aware scheduling enables users to seamlessly go from ETL to Deep Learning running on the GPU by adjusting the container requirements for different stages in Spark within the same application. This makes writing these applications easier and can help with hardware utilization and costs.
There are other ETL use cases where users want to change CPU and memory resources between stages, for instance there is data skew or perhaps the data size is much larger in certain stages of the application. In this talk, I will go over the feature details, cluster requirements, the API and use cases. I will demo how the stage level scheduling API can be used by Horovod to seamlessly go from data preparation to training using the Tensorflow Keras API using GPUs.
The talk will also touch on other new Apache Spark 3.1 functionality, such as pluggable caching, which can be used to enable faster dataframe access when operating from GPUs.
Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks
In this talk, I would like to introduce an open-source tool built by our team that simplifies the data conversion from Apache Spark to deep learning frameworks.
Imagine you have a large dataset, say 20 GBs, and you want to use it to train a TensorFlow model. Before feeding the data to the model, you need to clean and preprocess your data using Spark. Now you have your dataset in a Spark DataFrame. When it comes to the training part, you may have the problem: How can I convert my Spark DataFrame to some format recognized by my TensorFlow model?
The existing data conversion process can be tedious. For example, to convert an Apache Spark DataFrame to a TensorFlow Dataset file format, you need to either save the Apache Spark DataFrame on a distributed filesystem in parquet format and load the converted data with third-party tools such as Petastorm, or save it directly in TFRecord files with spark-tensorflow-connector and load it back using TFRecordDataset. Both approaches take more than 20 lines of code to manage the intermediate data files, rely on different parsing syntax, and require extra attention for handling vector columns in the Spark DataFrames. In short, all these engineering frictions greatly reduced the data scientists’ productivity.
The Databricks Machine Learning team contributed a new Spark Dataset Converter API to Petastorm to simplify these tedious data conversion process steps. With the new API, it takes a few lines of code to convert a Spark DataFrame to a TensorFlow Dataset or a PyTorch DataLoader with default parameters.
In the talk, I will use an example to show how to use the Spark Dataset Converter to train a Tensorflow model and how simple it is to go from single-node training to distributed training on Databricks.
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
There is no doubt Kubernetes has emerged as the next generation of cloud native infrastructure to support a wide variety of distributed workloads. Apache Spark has evolved to run both Machine Learning and large scale analytics workloads. There is growing interest in running Apache Spark natively on Kubernetes. By combining the flexibility of Kubernetes and scalable data processing with Apache Spark, you can run any data and machine pipelines on this infrastructure while effectively utilizing resources at disposal.
In this talk, Rajesh Thallam and Sougata Biswas will share how to effectively run your Apache Spark applications on Google Kubernetes Engine (GKE) and Google Cloud Dataproc, orchestrate the data and machine learning pipelines with managed Apache Airflow on GKE (Google Cloud Composer). Following topics will be covered: – Understanding key traits of Apache Spark on Kubernetes- Things to know when running Apache Spark on Kubernetes such as autoscaling- Demonstrate running analytics pipelines on Apache Spark orchestrated with Apache Airflow on Kubernetes cluster.
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
Pipelines have become ubiquitous, as the need for stringing multiple functions to compose applications has gained adoption and popularity. Common pipeline abstractions such as “fit” and “transform” are even shared across divergent platforms such as Python Scikit-Learn and Apache Spark.
Scaling pipelines at the level of simple functions is desirable for many AI applications, however is not directly supported by Ray’s parallelism primitives. In this talk, Raghu will describe a pipeline abstraction that takes advantage of Ray’s compute model to efficiently scale arbitrarily complex pipeline workflows. He will demonstrate how this abstraction cleanly unifies pipeline workflows across multiple platforms such as Scikit-Learn and Spark, and achieves nearly optimal scale-out parallelism on pipelined computations.
Attendees will learn how pipelined workflows can be mapped to Ray’s compute model and how they can both unify and accelerate their pipelines with Ray.
Sawtooth Windows for Feature AggregationsDatabricks
In this talk about zipline, we will introduce a new type of windowing construct called a sawtooth window. We will describe various properties about sawtooth windows that we utilize to achieve online-offline consistency, while still maintaining high-throughput, low-read latency and tunable write latency for serving machine learning features.We will also talk about a simple deployment strategy for correcting feature drift – due operations that are not “abelian groups”, that operate over change data.
We want to present multiple anti patterns utilizing Redis in unconventional ways to get the maximum out of Apache Spark.All examples presented are tried and tested in production at Scale at Adobe. The most common integration is spark-redis which interfaces with Redis as a Dataframe backing Store or as an upstream for Structured Streaming. We deviate from the common use cases to explore where Redis can plug gaps while scaling out high throughput applications in Spark.
Niche 1 : Long Running Spark Batch Job – Dispatch New Jobs by polling a Redis Queue
· Why?
o Custom queries on top a table; We load the data once and query N times
· Why not Structured Streaming
· Working Solution using Redis
Niche 2 : Distributed Counters
· Problems with Spark Accumulators
· Utilize Redis Hashes as distributed counters
· Precautions for retries and speculative execution
· Pipelining to improve performance
Raven: End-to-end Optimization of ML Prediction QueriesDatabricks
Machine learning (ML) models are typically part of prediction queries that consist of a data processing part (e.g., for joining, filtering, cleaning, featurization) and an ML part invoking one or more trained models. In this presentation, we identify significant and unexplored opportunities for optimization. To the best of our knowledge, this is the first effort to look at prediction queries holistically, optimizing across both the ML and SQL components.
We will present Raven, an end-to-end optimizer for prediction queries. Raven relies on a unified intermediate representation that captures both data processing and ML operators in a single graph structure.
This allows us to introduce optimization rules that
(i) reduce unnecessary computations by passing information between the data processing and ML operators
(ii) leverage operator transformations (e.g., turning a decision tree to a SQL expression or an equivalent neural network) to map operators to the right execution engine, and
(iii) integrate compiler techniques to take advantage of the most efficient hardware backend (e.g., CPU, GPU) for each operator.
We have implemented Raven as an extension to Spark’s Catalyst optimizer to enable the optimization of SparkSQL prediction queries. Our implementation also allows the optimization of prediction queries in SQL Server. As we will show, Raven is capable of improving prediction query performance on Apache Spark and SQL Server by up to 13.1x and 330x, respectively. For complex models, where GPU acceleration is beneficial, Raven provides up to 8x speedup compared to state-of-the-art systems. As part of the presentation, we will also give a demo showcasing Raven in action.
Massive Data Processing in Adobe Using Delta LakeDatabricks
At Adobe Experience Platform, we ingest TBs of data every day and manage PBs of data for our customers as part of the Unified Profile Offering. At the heart of this is a bunch of complex ingestion of a mix of normalized and denormalized data with various linkage scenarios power by a central Identity Linking Graph. This helps power various marketing scenarios that are activated in multiple platforms and channels like email, advertisements etc. We will go over how we built a cost effective and scalable data pipeline using Apache Spark and Delta Lake and share our experiences.
What are we storing?
Multi Source – Multi Channel Problem
Data Representation and Nested Schema Evolution
Performance Trade Offs with Various formats
Go over anti-patterns used
(String FTW)
Data Manipulation using UDFs
Writer Worries and How to Wipe them Away
Staging Tables FTW
Datalake Replication Lag Tracking
Performance Time!
Machine Learning CI/CD for Email Attack DetectionDatabricks
Detecting advanced email attacks at scale is a challenging ML problem, particularly due to the rarity of attacks, adversarial nature of the problem, and scale of data. In order to move quickly and adapt to the newest threat we needed to build a Continuous Integration / Continuous Delivery pipeline for the entire ML detection stack. Our goal is to enable detection engineers and data scientists to make changes to any part of the stack including joined datasets for hydration, feature extraction code, detection logic, and develop/train ML models.
In this talk, we discuss why we decided to build this pipeline, how it is used to accelerate development and ensure quality, and dive into the nitty-gritty details of building such a system on top of an Apache Spark + Databricks stack.
Intuitive & Scalable Hyperparameter Tuning with Apache Spark + FugueDatabricks
Hyperparameter tuning is critical in model development. And its general form: parameter tuning with an objective function is also widely used in industry. On the other hand, Apache Spark can handle massive parallelism, and Apache Spark ML is a solid machine learning solution.
But we have not seen a general and intuitive distributed parameter tuning solution based on Apache Spark, why?
Not every tuning problem is on Apache Spark ML models. How can Apache Spark handle general models?
Not every tuning problem is a parallelizable grid or random search. Bayesian optimization is sequential, how can Apache Spark help in this case?
Not every tuning problem is single epoch, deep learning is not. How to fit algos such as hyperband and ASHA into Apache Spark?
Not every tuning problem is a machine learning problem, for example simulation + tuning is also common. How to generalize?
In this talk, we are going to show how using Fugue-Tune and Apache Spark together can eliminate these painpoints
Fugue-Tune like Fugue, is a “super framework” – an absraction layer unifying existing solutions such as Hyperopt and Optuna
It firstly models the general tuning problems, independent from machine learning
It is designed for both small and large scale problems. It can always fully parallelize the distributable part of a tuning problem
It works for both classical and deep learning models. With Fugue, running hyperband and ASHA becomes possible on Apache Spark.
In the demo, you will see how to do any type of tuning in a consistent, intuitive, scalable and minimal way. And you will see a live demo of the amazing performance.
Improving Apache Spark for Dynamic Allocation and Spot InstancesDatabricks
This presentation will explore the new work in Spark 3.1 adding the concept of graceful decommissioning and how we can use this to improve Spark’s performance in both dynamic allocation and spot/preemptable instances. Together we’ll explore how Spark’s dynamic allocation has evolved over time, and why the different changes have been needed. We’ll also look at the multi-company collaboration that resulted in being able to deliver this feature and I’ll end with encouraging pointers on how to get more involved in Spark’s development.
StarCompliance is a leading firm specializing in the recovery of stolen cryptocurrency. Our comprehensive services are designed to assist individuals and organizations in navigating the complex process of fraud reporting, investigation, and fund recovery. We combine cutting-edge technology with expert legal support to provide a robust solution for victims of crypto theft.
Our Services Include:
Reporting to Tracking Authorities:
We immediately notify all relevant centralized exchanges (CEX), decentralized exchanges (DEX), and wallet providers about the stolen cryptocurrency. This ensures that the stolen assets are flagged as scam transactions, making it impossible for the thief to use them.
Assistance with Filing Police Reports:
We guide you through the process of filing a valid police report. Our support team provides detailed instructions on which police department to contact and helps you complete the necessary paperwork within the critical 72-hour window.
Launching the Refund Process:
Our team of experienced lawyers can initiate lawsuits on your behalf and represent you in various jurisdictions around the world. They work diligently to recover your stolen funds and ensure that justice is served.
At StarCompliance, we understand the urgency and stress involved in dealing with cryptocurrency theft. Our dedicated team works quickly and efficiently to provide you with the support and expertise needed to recover your assets. Trust us to be your partner in navigating the complexities of the crypto world and safeguarding your investments.
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
Show drafts
volume_up
Empowering the Data Analytics Ecosystem: A Laser Focus on Value
The data analytics ecosystem thrives when every component functions at its peak, unlocking the true potential of data. Here's a laser focus on key areas for an empowered ecosystem:
1. Democratize Access, Not Data:
Granular Access Controls: Provide users with self-service tools tailored to their specific needs, preventing data overload and misuse.
Data Catalogs: Implement robust data catalogs for easy discovery and understanding of available data sources.
2. Foster Collaboration with Clear Roles:
Data Mesh Architecture: Break down data silos by creating a distributed data ownership model with clear ownership and responsibilities.
Collaborative Workspaces: Utilize interactive platforms where data scientists, analysts, and domain experts can work seamlessly together.
3. Leverage Advanced Analytics Strategically:
AI-powered Automation: Automate repetitive tasks like data cleaning and feature engineering, freeing up data talent for higher-level analysis.
Right-Tool Selection: Strategically choose the most effective advanced analytics techniques (e.g., AI, ML) based on specific business problems.
4. Prioritize Data Quality with Automation:
Automated Data Validation: Implement automated data quality checks to identify and rectify errors at the source, minimizing downstream issues.
Data Lineage Tracking: Track the flow of data throughout the ecosystem, ensuring transparency and facilitating root cause analysis for errors.
5. Cultivate a Data-Driven Mindset:
Metrics-Driven Performance Management: Align KPIs and performance metrics with data-driven insights to ensure actionable decision making.
Data Storytelling Workshops: Equip stakeholders with the skills to translate complex data findings into compelling narratives that drive action.
Benefits of a Precise Ecosystem:
Sharpened Focus: Precise access and clear roles ensure everyone works with the most relevant data, maximizing efficiency.
Actionable Insights: Strategic analytics and automated quality checks lead to more reliable and actionable data insights.
Continuous Improvement: Data-driven performance management fosters a culture of learning and continuous improvement.
Sustainable Growth: Empowered by data, organizations can make informed decisions to drive sustainable growth and innovation.
By focusing on these precise actions, organizations can create an empowered data analytics ecosystem that delivers real value by driving data-driven decisions and maximizing the return on their data investment.
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Subhajit Sahu
Abstract — Levelwise PageRank is an alternative method of PageRank computation which decomposes the input graph into a directed acyclic block-graph of strongly connected components, and processes them in topological order, one level at a time. This enables calculation for ranks in a distributed fashion without per-iteration communication, unlike the standard method where all vertices are processed in each iteration. It however comes with a precondition of the absence of dead ends in the input graph. Here, the native non-distributed performance of Levelwise PageRank was compared against Monolithic PageRank on a CPU as well as a GPU. To ensure a fair comparison, Monolithic PageRank was also performed on a graph where vertices were split by components. Results indicate that Levelwise PageRank is about as fast as Monolithic PageRank on the CPU, but quite a bit slower on the GPU. Slowdown on the GPU is likely caused by a large submission of small workloads, and expected to be non-issue when the computation is performed on massive graphs.
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...John Andrews
SlideShare Description for "Chatty Kathy - UNC Bootcamp Final Project Presentation"
Title: Chatty Kathy: Enhancing Physical Activity Among Older Adults
Description:
Discover how Chatty Kathy, an innovative project developed at the UNC Bootcamp, aims to tackle the challenge of low physical activity among older adults. Our AI-driven solution uses peer interaction to boost and sustain exercise levels, significantly improving health outcomes. This presentation covers our problem statement, the rationale behind Chatty Kathy, synthetic data and persona creation, model performance metrics, a visual demonstration of the project, and potential future developments. Join us for an insightful Q&A session to explore the potential of this groundbreaking project.
Project Team: Jay Requarth, Jana Avery, John Andrews, Dr. Dick Davis II, Nee Buntoum, Nam Yeongjin & Mat Nicholas
Explore our comprehensive data analysis project presentation on predicting product ad campaign performance. Learn how data-driven insights can optimize your marketing strategies and enhance campaign effectiveness. Perfect for professionals and students looking to understand the power of data analysis in advertising. for more details visit: https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/
1. ML Monitoring is not APM
Cory A. Johannsen
Product Engineer, Verta Inc.
www.verta.ai
2. Agenda
▴ What is APM?
▴ What is ML monitoring?
▴ How ML monitoring and APM differ
▴ The unique needs of ML monitoring
▴ A very cool solution to model monitoring from Verta
3. About
https://www.verta.ai/product
- End-to-end MLOps platform for ML
model delivery, operations and
management
- Kubernetes-based, operations stack
for ML
- 23 years as a software engineer
- Embedded systems, enterprise
software, SaaS
- 6 years in APM working at scale
10. ▴ Know when models are failing
▴ Quickly find the root cause
▴ Close the loop by fast recovery
10
Ensuring model results are
consistently of high quality
*We refer to all latency, throughput etc. as model service health
11. ▴ w/o ground truth, model
fails challenging to detect
▴ Need to monitor complex
statistical summaries
▴ Distributions, anomalies,
missing values, quantiles
etc.
▴ Often model-specific
▴ Intelligent detection
and alerting to
pre-emptively identify
issues and trigger
remediations
▴ Execute re-trains,
fallback models, and
human intervention.
11
Know when a model fails Close the loop
▴ A model is one part of a
inference pipeline
▴ Need global view of the
pipeline jungle to see
where the root issue
may be
Quickly find the root cause
12. How APM and ML monitoring align
▴ Error rate, Throughput, Latency
○ You need to know my production systems are
operational
▴ Visualization
○ You need to see change over time
▴ Alerting
○ You need to know when
something has gone wrong
(and only when something
has gone wrong)
13. What do you care about in ML Monitoring?
▴ Distribution
○ Training versus test
○ Iteration over iteration
○ Live prediction
▴ Drift
○ Change in Distribution over
time
14. How APM and ML monitoring differ
▴ Error Rate, Throughput, Latency
○ Necessary, no longer sufficient
▴ Not all work is production work
○ ML monitoring happens from the beginning
of the pipeline
▴ APM can tell you what is wrong
○ ML monitoring is about understanding why
15. What makes ML monitoring unique
▴ Quantitative analysis of model performance
○ Information you can use
▴ Controlled comparison of distributions
○ Repeatable
○ Reliable
○ Consistent
▴ Alerting on meaningful deviation
○ Actionable
○ Timely
○ Accurate
16. Only you know the shape of your data
▴ Every model and pipeline is different and specialized
○ You built them, you understand them
▴ You know what metrics and distributions are valuable
○ This is your model, you know the data and processes that created it
▴ You know the expected distributions
○ You can determine whether the behavior is correct
17. Only you know how to measure change
▴ Compare to reference set
○ Training, test, golden data set
▴ Compare to a baseline
○ Calculate a baseline from your data or production systems
▴ Compare to other
○ Use a comparison that makes sense in your domain
18. Only you know when a change matters
▴ You know your model and tolerances
▴ You know when a deviation is significant (or not!)
▴ You know when these conditions need to change
19. Verta understand model monitoring
▴ Designed for your workflows
▴ Easy integration to capture your monitoring data
▴ Visualize and understand your metrics, distributions, and drift
▴ Get alerted when you should - not otherwise
21. Concepts
▴ Monitored Entity: A reference name (e.g. model or pipeline) that you want to
monitor
▴ Profiler: A function that computes statistics about your data
▴ Summary: A collection of statistics about your data (output of profiler)
○ Samples: instance of a summary, i.e., a statistic
○ Labels: key-values attached to summary samples. Used for rich filtering and
aggregation
▴ Alerter: Triggered periodically, it can talk with the Verta API to fetch information
about summaries and identify if they look wrong
22. How does it work?
1. Define monitored entity: the entity to be monitored (e.g., model, data, pipeline)
2. Define summaries to monitor for the entity
3. Run profilers (manually or automatically) to produce summary samples
4. View samples, define alerts
5. Get alerted (e.g. via Slack)
6. Close the loop!
23. How does it work?
Time-series DB for
statistical summaries
...
Ground truth
Data/Model
Pipelines
Model (Live)
Remediation
- Retrain
- Rollback
- Human loop
Model (Batch)
Prediction
Log
24. Summary
▴ Performance monitoring is no longer sufficient for the needs of modern ML systems
○ Model monitoring starts at the beginning of the pipeline and continues through production
○ Batch and live can be addressed in the same framework
▴ Knowing something is wrong is not enough, you need to know why
▴ Timely actionable alerting is mandatory
▴ Building these tools on-site is difficult, error-prone, and expensive
▴ Spark is a fantastic tool to enable model monitoring
25. Monitor Your Models with Verta
▴ Visit monitoring.verta.ai today and see it in action
▴ Join our community
▴ Get more out of your models
▴ Get more out of your alerts