Re-imagine Data Monitoring with whylogs and Spark

Databricks
DatabricksDeveloper Marketing and Relations at MuleSoft
Re-imagine Data Monitoring
with whylogs and Apache
Spark
Andy Dang
Co-Founder & Lead Engineer, WhyLabs
Outline
ML Data Challenges
How traditional data analysis techniques fail ML
data pipelines
Lightweight Profiling for Big ML
Data
Profiling techniques for detecting data quality
problems
The Open Source whylogs Library
Building the standard for data logging
2
Source: Google Cloud AI
3
ML Lifecycle
Issues encountered in production (small sample)...
...or it simply doesn’t work, and nobody know why...
● Experiment/production
environment mismatch
● Wrong model version deployed
● Underprovisioned hardware
● Inappropriate hardware
● Latency/SLA issues
● Data permissions misconfigured
● Untracked changes broke prod
● Traffic sent to the wrong model
● Computational instability
● Customers gaming the model
(adversarial attacks)
● PII data exposed
● Expected accuracy doesn’t
materialize
● Pre-processing mismatch in
experiments vs. production
● Retrained on faulty data
● Accuracy improves on one
segment, regresses in others
● Outliers predicted incorrectly
● Bias identified
● Correlation with protected
features
● Overfitting on training/test
● Surge in missing values
● Surge in duplicates
● Poor performance on new
categories
● Poor performance on new
customer segments
● Poor performance on outliers
● Data quality issues affect
accuracy
● Production data doesn’t match
test/training
● Accuracy is decaying over time
● Data drift in inputs
● Concept drift in outputs
● Extreme predictions for out of
distribution data
● Model not generalizing on new
data / new segments
● Major customer behavior shift
4
Issues encountered in production (small sample)...
issues caused by data
● Experiment/production
environment mismatch
● Wrong model version deployed
● Underprovisioned hardware
● Inappropriate hardware
● Latency/SLA issues
● Data permissions misconfigured
● Untracked changes broke prod
● Traffic sent to the wrong model
● Computational instability
● Customers gaming the model
(adversarial attacks)
● PII data exposed
● Expected accuracy doesn’t
materialize
● Pre-processing mismatch in
experiments vs. production
● Retrained on faulty data
● Accuracy improves on one
segment, regresses in others
● Outliers predicted incorrectly
● Bias identified
● Correlation with protected
features
● Overfitting on training/test
● Surge in missing values
● Surge in duplicates
● Poor performance on new
categories
● Poor performance on new
customer segments
● Poor performance on outliers
● Data quality issues affect
accuracy
● Production data doesn’t match
test/training
● Accuracy is decaying over time
● Data drift in inputs
● Concept drift in outputs
● Extreme predictions for out of
distribution data
● Model not generalizing on new
data / new segments
● Major customer behavior shift
5
Data Logs
Model Metadata
Pipeline Metadata
i.e. data profiling
Data profiling refers to the analysis of information [...] in order to clarify the
structure, content, relationships, and derivation rules of the data [Wikipedia]
6
Data monitoring starts with logging
7
Sampling Profiling
Pros
● Easy to build
● Little upfront design
● Log & raw data analysis identical
● Scalable & lightweight
● Flexible & configurable
● Rare events and outlier-dependent metrics
● Directly interpretable results
Cons
● I/O & storage
● Noisy
● Requires statistical analysis
● Rare events & outliers
● Min/max, unique values, etc
● Data dependent output format
● No existing widespread solutions
● Mathematical & engineering challenges
Data logs: sampling vs. profiling
8
Data logs: must be accurate
Median: errors in the estimate of the median for sampling vs profiling for various distributions. Mean
absolute error and mean relative (fractional) absolute error are shown.
9
Data logs: must be scalable
Dataset Size # of entries # of features Memory
consumption
Output size
Lending Club 1.6G 2.2M 151 14MB 7.4MB
NYC Tickets 1.9G 10.8 43 14MB 2.3MB
Pain pills 75GB 178M 42 15MB 2MB
10
Logging ML data at scale
Four key paradigms:
● Approximations rather than exact results
● Lightweight
● Additive
● Batch and streaming support
profile: collection of lightweight metrics that provide these
properties
Lightweight
Old Approach
11
process process process process
Data Warehouse/
Data Lake
Processing Engine
New Approach
process
profiling
process
profiling
process
profiling
process
profiling
Profile Store
Analysis
Only feasible if:
● Profiling is fast
● Profiling is not memory intensive
Additive
12
dataset 1 dataset 2 dataset 3
sort (shuffle)
reduce step
Median
dataset 1
profile 1
dataset 2
profile 2
dataset 3
profile 3
add(profile1, 2, 3)
Estimated Median
Batch and streaming support
13
partition
1
profiling
partition
2
profiling
partition
3
profiling
partition
n
profiling
Spark/Hive
Query Engine
No shuffle!
day 0
profiling
day 1
profiling
day 2
profiling
day 3
profiling
... ...
sum(profiles)
Approximate Statistics
● Using Stochastic Streaming Algorithms
○ Model the problem as a stochastic process
○ Apache Datasketches is the open source implementation
● Statistics that we focus on at the moment:
○ Histograms
○ Frequent items
○ Cardinality
14
whylogs: The Data Logging Library
● Multi-language support: Python + Java
● Support both data engineering and data science
workflows
● Extensibility: image support. Text, video, audio &
embeddings support to come
● Growing integration list:
15
16
whylogs: Python
● A few lines of code to start logging
● Integrate with popular data science libraries
● Out of the box visualization utilities
17
whylogs in Apache Spark
Data Lake
col1
, col2
, …, coln
partition 1
partition 2
partition k
profile
profile
merge (
profile1
,
profile2
…,
profilek
)
global profile
Schema
Metadata
Sketches
profile
Metrics
18
Simple Spark API
19
pySpark support
20
Catch distribution drift in a few lines of code
21
Scalable monitoring at input feature granularity
Monitoring layer for ML applications
22
23
bit.ly/whylogs
andy@whylabs.ai
@andy_dng
24
bit.ly/whylogs
Help build the open
standard for data
logging!
Thank you!
1 of 24

Recommended

Data Discovery at Databricks with Amundsen by
Data Discovery at Databricks with AmundsenData Discovery at Databricks with Amundsen
Data Discovery at Databricks with AmundsenDatabricks
1.2K views45 slides
Making Data Timelier and More Reliable with Lakehouse Technology by
Making Data Timelier and More Reliable with Lakehouse TechnologyMaking Data Timelier and More Reliable with Lakehouse Technology
Making Data Timelier and More Reliable with Lakehouse TechnologyMatei Zaharia
1.6K views44 slides
Data Warehouse or Data Lake, Which Do I Choose? by
Data Warehouse or Data Lake, Which Do I Choose?Data Warehouse or Data Lake, Which Do I Choose?
Data Warehouse or Data Lake, Which Do I Choose?DATAVERSITY
802 views26 slides
Massive Data Processing in Adobe Using Delta Lake by
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeDatabricks
719 views25 slides
Making Apache Spark Better with Delta Lake by
Making Apache Spark Better with Delta LakeMaking Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeDatabricks
5.4K views40 slides
Databricks: A Tool That Empowers You To Do More With Data by
Databricks: A Tool That Empowers You To Do More With DataDatabricks: A Tool That Empowers You To Do More With Data
Databricks: A Tool That Empowers You To Do More With DataDatabricks
456 views9 slides

More Related Content

What's hot

Building Data Quality pipelines with Apache Spark and Delta Lake by
Building Data Quality pipelines with Apache Spark and Delta LakeBuilding Data Quality pipelines with Apache Spark and Delta Lake
Building Data Quality pipelines with Apache Spark and Delta LakeDatabricks
1.1K views14 slides
Democratizing Data Quality Through a Centralized Platform by
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDatabricks
1.4K views36 slides
Data Lakehouse Symposium | Day 4 by
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Databricks
1.8K views74 slides
Scaling Data Quality @ Netflix by
Scaling Data Quality @ NetflixScaling Data Quality @ Netflix
Scaling Data Quality @ NetflixMichelle Ufford
8.2K views49 slides
Getting Started with Delta Lake on Databricks by
Getting Started with Delta Lake on DatabricksGetting Started with Delta Lake on Databricks
Getting Started with Delta Lake on DatabricksKnoldus Inc.
275 views15 slides
Delta Lake with Azure Databricks by
Delta Lake with Azure DatabricksDelta Lake with Azure Databricks
Delta Lake with Azure DatabricksDustin Vannoy
418 views28 slides

What's hot(20)

Building Data Quality pipelines with Apache Spark and Delta Lake by Databricks
Building Data Quality pipelines with Apache Spark and Delta LakeBuilding Data Quality pipelines with Apache Spark and Delta Lake
Building Data Quality pipelines with Apache Spark and Delta Lake
Databricks1.1K views
Democratizing Data Quality Through a Centralized Platform by Databricks
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks1.4K views
Data Lakehouse Symposium | Day 4 by Databricks
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
Databricks1.8K views
Scaling Data Quality @ Netflix by Michelle Ufford
Scaling Data Quality @ NetflixScaling Data Quality @ Netflix
Scaling Data Quality @ Netflix
Michelle Ufford8.2K views
Getting Started with Delta Lake on Databricks by Knoldus Inc.
Getting Started with Delta Lake on DatabricksGetting Started with Delta Lake on Databricks
Getting Started with Delta Lake on Databricks
Knoldus Inc.275 views
Delta Lake with Azure Databricks by Dustin Vannoy
Delta Lake with Azure DatabricksDelta Lake with Azure Databricks
Delta Lake with Azure Databricks
Dustin Vannoy418 views
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard by Paris Data Engineers !
Delta Lake OSS: Create reliable and performant Data Lake by Quentin AmbardDelta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Introduction SQL Analytics on Lakehouse Architecture by Databricks
Introduction SQL Analytics on Lakehouse ArchitectureIntroduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse Architecture
Databricks5.8K views
Azure Databricks - An Introduction (by Kris Bock) by Daniel Toomey
Azure Databricks - An Introduction (by Kris Bock)Azure Databricks - An Introduction (by Kris Bock)
Azure Databricks - An Introduction (by Kris Bock)
Daniel Toomey370 views
Introducing Databricks Delta by Databricks
Introducing Databricks DeltaIntroducing Databricks Delta
Introducing Databricks Delta
Databricks5.9K views
Databricks Delta Lake and Its Benefits by Databricks
Databricks Delta Lake and Its BenefitsDatabricks Delta Lake and Its Benefits
Databricks Delta Lake and Its Benefits
Databricks5.1K views
Build Real-Time Applications with Databricks Streaming by Databricks
Build Real-Time Applications with Databricks StreamingBuild Real-Time Applications with Databricks Streaming
Build Real-Time Applications with Databricks Streaming
Databricks950 views
Achieving Lakehouse Models with Spark 3.0 by Databricks
Achieving Lakehouse Models with Spark 3.0Achieving Lakehouse Models with Spark 3.0
Achieving Lakehouse Models with Spark 3.0
Databricks621 views
Iceberg: a fast table format for S3 by DataWorks Summit
Iceberg: a fast table format for S3Iceberg: a fast table format for S3
Iceberg: a fast table format for S3
DataWorks Summit7.5K views
DW Migration Webinar-March 2022.pptx by Databricks
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
Databricks4.3K views
Data Lake Overview by James Serra
Data Lake OverviewData Lake Overview
Data Lake Overview
James Serra19.8K views
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ... by Databricks
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Databricks441 views
Delta from a Data Engineer's Perspective by Databricks
Delta from a Data Engineer's PerspectiveDelta from a Data Engineer's Perspective
Delta from a Data Engineer's Perspective
Databricks1.1K views
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ... by Chester Chen
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
Chester Chen1.8K views

Similar to Re-imagine Data Monitoring with whylogs and Spark

MLOps and Data Quality: Deploying Reliable ML Models in Production by
MLOps and Data Quality: Deploying Reliable ML Models in ProductionMLOps and Data Quality: Deploying Reliable ML Models in Production
MLOps and Data Quality: Deploying Reliable ML Models in ProductionProvectus
211 views32 slides
Moving from BI to AI : For decision makers by
Moving from BI to AI : For decision makersMoving from BI to AI : For decision makers
Moving from BI to AI : For decision makerszekeLabs Technologies
348 views71 slides
C2_W1---.pdf by
C2_W1---.pdfC2_W1---.pdf
C2_W1---.pdfHumayun Kabir
8 views28 slides
Obfuscating LinkedIn Member Data by
Obfuscating LinkedIn Member DataObfuscating LinkedIn Member Data
Obfuscating LinkedIn Member DataDataWorks Summit
208 views27 slides
Model selection and tuning at scale by
Model selection and tuning at scaleModel selection and tuning at scale
Model selection and tuning at scaleOwen Zhang
2.6K views25 slides
vodQA Pune (2019) - Testing AI,ML applications by
vodQA Pune (2019) - Testing AI,ML applicationsvodQA Pune (2019) - Testing AI,ML applications
vodQA Pune (2019) - Testing AI,ML applicationsvodQA
441 views42 slides

Similar to Re-imagine Data Monitoring with whylogs and Spark(20)

MLOps and Data Quality: Deploying Reliable ML Models in Production by Provectus
MLOps and Data Quality: Deploying Reliable ML Models in ProductionMLOps and Data Quality: Deploying Reliable ML Models in Production
MLOps and Data Quality: Deploying Reliable ML Models in Production
Provectus211 views
Model selection and tuning at scale by Owen Zhang
Model selection and tuning at scaleModel selection and tuning at scale
Model selection and tuning at scale
Owen Zhang2.6K views
vodQA Pune (2019) - Testing AI,ML applications by vodQA
vodQA Pune (2019) - Testing AI,ML applicationsvodQA Pune (2019) - Testing AI,ML applications
vodQA Pune (2019) - Testing AI,ML applications
vodQA441 views
AI hype or reality by Awantik Das
AI  hype or realityAI  hype or reality
AI hype or reality
Awantik Das106 views
Reproducibility and experiments management in Machine Learning by Mikhail Rozhkov
Reproducibility and experiments management in Machine Learning Reproducibility and experiments management in Machine Learning
Reproducibility and experiments management in Machine Learning
Mikhail Rozhkov111 views
Importance of ML Reproducibility & Applications with MLfLow by Databricks
Importance of ML Reproducibility & Applications with MLfLowImportance of ML Reproducibility & Applications with MLfLow
Importance of ML Reproducibility & Applications with MLfLow
Databricks288 views
MLlib and Machine Learning on Spark by Petr Zapletal
MLlib and Machine Learning on SparkMLlib and Machine Learning on Spark
MLlib and Machine Learning on Spark
Petr Zapletal4.9K views
Data analytcis-first-steps by Shesha R
Data analytcis-first-stepsData analytcis-first-steps
Data analytcis-first-steps
Shesha R895 views
20230511 - PGConf Nepal - Clustering in PostgreSQL_ Because one database serv... by Umair Shahid
20230511 - PGConf Nepal - Clustering in PostgreSQL_ Because one database serv...20230511 - PGConf Nepal - Clustering in PostgreSQL_ Because one database serv...
20230511 - PGConf Nepal - Clustering in PostgreSQL_ Because one database serv...
Umair Shahid219 views
KNOLX_Data_preprocessing by Knoldus Inc.
KNOLX_Data_preprocessingKNOLX_Data_preprocessing
KNOLX_Data_preprocessing
Knoldus Inc.21 views
(Faiz) MachineLearning(ppt).pptx by Faiz430036
(Faiz) MachineLearning(ppt).pptx(Faiz) MachineLearning(ppt).pptx
(Faiz) MachineLearning(ppt).pptx
Faiz4300363 views
The Critical Missing Component in the Production ML Stack by Databricks
The Critical Missing Component in the Production ML StackThe Critical Missing Component in the Production ML Stack
The Critical Missing Component in the Production ML Stack
Databricks66 views
10 ways to stumble with big data by Lars Albertsson
10 ways to stumble with big data10 ways to stumble with big data
10 ways to stumble with big data
Lars Albertsson1.4K views
Production-Ready BIG ML Workflows - from zero to hero by Daniel Marcous
Production-Ready BIG ML Workflows - from zero to heroProduction-Ready BIG ML Workflows - from zero to hero
Production-Ready BIG ML Workflows - from zero to hero
Daniel Marcous2.2K views
Copy of CRICKET MATCH WIN PREDICTOR USING LOGISTIC ... by PATHALAMRAJESH
Copy of CRICKET MATCH WIN PREDICTOR USING LOGISTIC                           ...Copy of CRICKET MATCH WIN PREDICTOR USING LOGISTIC                           ...
Copy of CRICKET MATCH WIN PREDICTOR USING LOGISTIC ...
PATHALAMRAJESH24 views
Data kitchen 7 agile steps - big data fest 9-18-2015 by DataKitchen
Data kitchen   7 agile steps - big data fest 9-18-2015Data kitchen   7 agile steps - big data fest 9-18-2015
Data kitchen 7 agile steps - big data fest 9-18-2015
DataKitchen1K views

More from Databricks

Data Lakehouse Symposium | Day 1 | Part 1 by
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Databricks
1.5K views43 slides
Data Lakehouse Symposium | Day 1 | Part 2 by
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Databricks
739 views16 slides
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop by
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
6.3K views64 slides
Learn to Use Databricks for Data Science by
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceDatabricks
1.6K views12 slides
Why APM Is Not the Same As ML Monitoring by
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringDatabricks
743 views26 slides
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix by
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks
688 views48 slides

More from Databricks(20)

Data Lakehouse Symposium | Day 1 | Part 1 by Databricks
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
Databricks1.5K views
Data Lakehouse Symposium | Day 1 | Part 2 by Databricks
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
Databricks739 views
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop by Databricks
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks6.3K views
Learn to Use Databricks for Data Science by Databricks
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
Databricks1.6K views
Why APM Is Not the Same As ML Monitoring by Databricks
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
Databricks743 views
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix by Databricks
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks688 views
Stage Level Scheduling Improving Big Data and AI Integration by Databricks
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks850 views
Simplify Data Conversion from Spark to TensorFlow and PyTorch by Databricks
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks1.8K views
Scaling your Data Pipelines with Apache Spark on Kubernetes by Databricks
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks2.1K views
Scaling and Unifying SciKit Learn and Apache Spark Pipelines by Databricks
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks667 views
Sawtooth Windows for Feature Aggregations by Databricks
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
Databricks604 views
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink by Databricks
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks675 views
Raven: End-to-end Optimization of ML Prediction Queries by Databricks
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks448 views
Processing Large Datasets for ADAS Applications using Apache Spark by Databricks
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks512 views
Machine Learning CI/CD for Email Attack Detection by Databricks
Machine Learning CI/CD for Email Attack DetectionMachine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack Detection
Databricks389 views
Jeeves Grows Up: An AI Chatbot for Performance and Quality by Databricks
Jeeves Grows Up: An AI Chatbot for Performance and QualityJeeves Grows Up: An AI Chatbot for Performance and Quality
Jeeves Grows Up: An AI Chatbot for Performance and Quality
Databricks260 views
Intuitive & Scalable Hyperparameter Tuning with Apache Spark + Fugue by Databricks
Intuitive & Scalable Hyperparameter Tuning with Apache Spark + FugueIntuitive & Scalable Hyperparameter Tuning with Apache Spark + Fugue
Intuitive & Scalable Hyperparameter Tuning with Apache Spark + Fugue
Databricks348 views
Infrastructure Agnostic Machine Learning Workload Deployment by Databricks
Infrastructure Agnostic Machine Learning Workload DeploymentInfrastructure Agnostic Machine Learning Workload Deployment
Infrastructure Agnostic Machine Learning Workload Deployment
Databricks347 views
Improving Apache Spark for Dynamic Allocation and Spot Instances by Databricks
Improving Apache Spark for Dynamic Allocation and Spot InstancesImproving Apache Spark for Dynamic Allocation and Spot Instances
Improving Apache Spark for Dynamic Allocation and Spot Instances
Databricks281 views
Hyperspace for Delta Lake by Databricks
Hyperspace for Delta LakeHyperspace for Delta Lake
Hyperspace for Delta Lake
Databricks560 views

Recently uploaded

RIO GRANDE SUPPLY COMPANY INC, JAYSON.docx by
RIO GRANDE SUPPLY COMPANY INC, JAYSON.docxRIO GRANDE SUPPLY COMPANY INC, JAYSON.docx
RIO GRANDE SUPPLY COMPANY INC, JAYSON.docxJaysonGarabilesEspej
6 views3 slides
Supercharging your Data with Azure AI Search and Azure OpenAI by
Supercharging your Data with Azure AI Search and Azure OpenAISupercharging your Data with Azure AI Search and Azure OpenAI
Supercharging your Data with Azure AI Search and Azure OpenAIPeter Gallagher
37 views32 slides
Introduction to Microsoft Fabric.pdf by
Introduction to Microsoft Fabric.pdfIntroduction to Microsoft Fabric.pdf
Introduction to Microsoft Fabric.pdfishaniuudeshika
24 views16 slides
MOSORE_BRESCIA by
MOSORE_BRESCIAMOSORE_BRESCIA
MOSORE_BRESCIAFederico Karagulian
5 views8 slides
Vikas 500 BIG DATA TECHNOLOGIES LAB.pdf by
Vikas 500 BIG DATA TECHNOLOGIES LAB.pdfVikas 500 BIG DATA TECHNOLOGIES LAB.pdf
Vikas 500 BIG DATA TECHNOLOGIES LAB.pdfvikas12611618
8 views30 slides
Cross-network in Google Analytics 4.pdf by
Cross-network in Google Analytics 4.pdfCross-network in Google Analytics 4.pdf
Cross-network in Google Analytics 4.pdfGA4 Tutorials
6 views7 slides

Recently uploaded(20)

Supercharging your Data with Azure AI Search and Azure OpenAI by Peter Gallagher
Supercharging your Data with Azure AI Search and Azure OpenAISupercharging your Data with Azure AI Search and Azure OpenAI
Supercharging your Data with Azure AI Search and Azure OpenAI
Peter Gallagher37 views
Introduction to Microsoft Fabric.pdf by ishaniuudeshika
Introduction to Microsoft Fabric.pdfIntroduction to Microsoft Fabric.pdf
Introduction to Microsoft Fabric.pdf
ishaniuudeshika24 views
Vikas 500 BIG DATA TECHNOLOGIES LAB.pdf by vikas12611618
Vikas 500 BIG DATA TECHNOLOGIES LAB.pdfVikas 500 BIG DATA TECHNOLOGIES LAB.pdf
Vikas 500 BIG DATA TECHNOLOGIES LAB.pdf
vikas126116188 views
Cross-network in Google Analytics 4.pdf by GA4 Tutorials
Cross-network in Google Analytics 4.pdfCross-network in Google Analytics 4.pdf
Cross-network in Google Analytics 4.pdf
GA4 Tutorials6 views
Data structure and algorithm. by Abdul salam
Data structure and algorithm. Data structure and algorithm.
Data structure and algorithm.
Abdul salam 18 views
Advanced_Recommendation_Systems_Presentation.pptx by neeharikasingh29
Advanced_Recommendation_Systems_Presentation.pptxAdvanced_Recommendation_Systems_Presentation.pptx
Advanced_Recommendation_Systems_Presentation.pptx
Short Story Assignment by Kelly Nguyen by kellynguyen01
Short Story Assignment by Kelly NguyenShort Story Assignment by Kelly Nguyen
Short Story Assignment by Kelly Nguyen
kellynguyen0118 views
Chapter 3b- Process Communication (1) (1)(1) (1).pptx by ayeshabaig2004
Chapter 3b- Process Communication (1) (1)(1) (1).pptxChapter 3b- Process Communication (1) (1)(1) (1).pptx
Chapter 3b- Process Communication (1) (1)(1) (1).pptx
ayeshabaig20045 views
CRIJ4385_Death Penalty_F23.pptx by yvettemm100
CRIJ4385_Death Penalty_F23.pptxCRIJ4385_Death Penalty_F23.pptx
CRIJ4385_Death Penalty_F23.pptx
yvettemm1006 views
RuleBookForTheFairDataEconomy.pptx by noraelstela1
RuleBookForTheFairDataEconomy.pptxRuleBookForTheFairDataEconomy.pptx
RuleBookForTheFairDataEconomy.pptx
noraelstela167 views
[DSC Europe 23] Spela Poklukar & Tea Brasanac - Retrieval Augmented Generation by DataScienceConferenc1
[DSC Europe 23] Spela Poklukar & Tea Brasanac - Retrieval Augmented Generation[DSC Europe 23] Spela Poklukar & Tea Brasanac - Retrieval Augmented Generation
[DSC Europe 23] Spela Poklukar & Tea Brasanac - Retrieval Augmented Generation
UNEP FI CRS Climate Risk Results.pptx by pekka28
UNEP FI CRS Climate Risk Results.pptxUNEP FI CRS Climate Risk Results.pptx
UNEP FI CRS Climate Risk Results.pptx
pekka2811 views
Building Real-Time Travel Alerts by Timothy Spann
Building Real-Time Travel AlertsBuilding Real-Time Travel Alerts
Building Real-Time Travel Alerts
Timothy Spann109 views
Organic Shopping in Google Analytics 4.pdf by GA4 Tutorials
Organic Shopping in Google Analytics 4.pdfOrganic Shopping in Google Analytics 4.pdf
Organic Shopping in Google Analytics 4.pdf
GA4 Tutorials10 views
Understanding Hallucinations in LLMs - 2023 09 29.pptx by Greg Makowski
Understanding Hallucinations in LLMs - 2023 09 29.pptxUnderstanding Hallucinations in LLMs - 2023 09 29.pptx
Understanding Hallucinations in LLMs - 2023 09 29.pptx
Greg Makowski13 views
Survey on Factuality in LLM's.pptx by NeethaSherra1
Survey on Factuality in LLM's.pptxSurvey on Factuality in LLM's.pptx
Survey on Factuality in LLM's.pptx
NeethaSherra15 views

Re-imagine Data Monitoring with whylogs and Spark

  • 1. Re-imagine Data Monitoring with whylogs and Apache Spark Andy Dang Co-Founder & Lead Engineer, WhyLabs
  • 2. Outline ML Data Challenges How traditional data analysis techniques fail ML data pipelines Lightweight Profiling for Big ML Data Profiling techniques for detecting data quality problems The Open Source whylogs Library Building the standard for data logging 2
  • 3. Source: Google Cloud AI 3 ML Lifecycle
  • 4. Issues encountered in production (small sample)... ...or it simply doesn’t work, and nobody know why... ● Experiment/production environment mismatch ● Wrong model version deployed ● Underprovisioned hardware ● Inappropriate hardware ● Latency/SLA issues ● Data permissions misconfigured ● Untracked changes broke prod ● Traffic sent to the wrong model ● Computational instability ● Customers gaming the model (adversarial attacks) ● PII data exposed ● Expected accuracy doesn’t materialize ● Pre-processing mismatch in experiments vs. production ● Retrained on faulty data ● Accuracy improves on one segment, regresses in others ● Outliers predicted incorrectly ● Bias identified ● Correlation with protected features ● Overfitting on training/test ● Surge in missing values ● Surge in duplicates ● Poor performance on new categories ● Poor performance on new customer segments ● Poor performance on outliers ● Data quality issues affect accuracy ● Production data doesn’t match test/training ● Accuracy is decaying over time ● Data drift in inputs ● Concept drift in outputs ● Extreme predictions for out of distribution data ● Model not generalizing on new data / new segments ● Major customer behavior shift 4
  • 5. Issues encountered in production (small sample)... issues caused by data ● Experiment/production environment mismatch ● Wrong model version deployed ● Underprovisioned hardware ● Inappropriate hardware ● Latency/SLA issues ● Data permissions misconfigured ● Untracked changes broke prod ● Traffic sent to the wrong model ● Computational instability ● Customers gaming the model (adversarial attacks) ● PII data exposed ● Expected accuracy doesn’t materialize ● Pre-processing mismatch in experiments vs. production ● Retrained on faulty data ● Accuracy improves on one segment, regresses in others ● Outliers predicted incorrectly ● Bias identified ● Correlation with protected features ● Overfitting on training/test ● Surge in missing values ● Surge in duplicates ● Poor performance on new categories ● Poor performance on new customer segments ● Poor performance on outliers ● Data quality issues affect accuracy ● Production data doesn’t match test/training ● Accuracy is decaying over time ● Data drift in inputs ● Concept drift in outputs ● Extreme predictions for out of distribution data ● Model not generalizing on new data / new segments ● Major customer behavior shift 5
  • 6. Data Logs Model Metadata Pipeline Metadata i.e. data profiling Data profiling refers to the analysis of information [...] in order to clarify the structure, content, relationships, and derivation rules of the data [Wikipedia] 6 Data monitoring starts with logging
  • 7. 7 Sampling Profiling Pros ● Easy to build ● Little upfront design ● Log & raw data analysis identical ● Scalable & lightweight ● Flexible & configurable ● Rare events and outlier-dependent metrics ● Directly interpretable results Cons ● I/O & storage ● Noisy ● Requires statistical analysis ● Rare events & outliers ● Min/max, unique values, etc ● Data dependent output format ● No existing widespread solutions ● Mathematical & engineering challenges Data logs: sampling vs. profiling
  • 8. 8 Data logs: must be accurate Median: errors in the estimate of the median for sampling vs profiling for various distributions. Mean absolute error and mean relative (fractional) absolute error are shown.
  • 9. 9 Data logs: must be scalable Dataset Size # of entries # of features Memory consumption Output size Lending Club 1.6G 2.2M 151 14MB 7.4MB NYC Tickets 1.9G 10.8 43 14MB 2.3MB Pain pills 75GB 178M 42 15MB 2MB
  • 10. 10 Logging ML data at scale Four key paradigms: ● Approximations rather than exact results ● Lightweight ● Additive ● Batch and streaming support profile: collection of lightweight metrics that provide these properties
  • 11. Lightweight Old Approach 11 process process process process Data Warehouse/ Data Lake Processing Engine New Approach process profiling process profiling process profiling process profiling Profile Store Analysis Only feasible if: ● Profiling is fast ● Profiling is not memory intensive
  • 12. Additive 12 dataset 1 dataset 2 dataset 3 sort (shuffle) reduce step Median dataset 1 profile 1 dataset 2 profile 2 dataset 3 profile 3 add(profile1, 2, 3) Estimated Median
  • 13. Batch and streaming support 13 partition 1 profiling partition 2 profiling partition 3 profiling partition n profiling Spark/Hive Query Engine No shuffle! day 0 profiling day 1 profiling day 2 profiling day 3 profiling ... ... sum(profiles)
  • 14. Approximate Statistics ● Using Stochastic Streaming Algorithms ○ Model the problem as a stochastic process ○ Apache Datasketches is the open source implementation ● Statistics that we focus on at the moment: ○ Histograms ○ Frequent items ○ Cardinality 14
  • 15. whylogs: The Data Logging Library ● Multi-language support: Python + Java ● Support both data engineering and data science workflows ● Extensibility: image support. Text, video, audio & embeddings support to come ● Growing integration list: 15
  • 16. 16 whylogs: Python ● A few lines of code to start logging ● Integrate with popular data science libraries ● Out of the box visualization utilities
  • 17. 17 whylogs in Apache Spark Data Lake col1 , col2 , …, coln partition 1 partition 2 partition k profile profile merge ( profile1 , profile2 …, profilek ) global profile Schema Metadata Sketches profile Metrics
  • 20. 20 Catch distribution drift in a few lines of code
  • 21. 21 Scalable monitoring at input feature granularity
  • 22. Monitoring layer for ML applications 22
  • 24. andy@whylabs.ai @andy_dng 24 bit.ly/whylogs Help build the open standard for data logging! Thank you!