Monitoring Half a Million ML Models, IoT Streaming Data, and Automated Quality Check on Delta Lake

Monitoring Half a Million ML Models, IoT Streaming
Data and Automated Quality Check on Delta Lake
Aemro Amare, Resident Solution Architect at Databricks
https://www.linkedin.com/in/aemro
Shekh Morshed Akther, Senior Data Engineer, Quby
https://www.linkedin.com/in/shekh-akther

Households use 44% of all natural gas
and 27% of all electricity in the EU
Source: Eurostat
44% 27%
How can we prevent energy waste?

Reducing energy waste in 500,000
connected homes across Europe

Personalized services by applying ML to IoT
and customer data
IoT & customer data Quby Platform Personalised services
§ Smart thermostat control
§ Home monitoring
§ Advice & Insights

Data Models
Personalised services
Central
heating
system
Water
sensor
Electricity
sensor
Gas
sensor
Terabytes of IoT data
daily
DishwasherWashing
machine
Washing
machine
DryerDryer
Patented algorithms
Waste checker service

Quby’s Data Journey With Databricks
2018 2019 20202017> 2016
• On Premises
• Cloudera & Apache storm
• < 1 Terabyte total data
• Continues failure
• Moved to cloud with
• More data and algorithms on Prod
• Stable batch processing
• Spark Streaming & Optimization
• Data Team United
• Algorithm Patents
• More data sources, more services
• Algorithms as a Service
• < ½ Million models run daily
• Stronger CI/CD & monitoring
• Petabytes of data in
• Spark 3.0
• Automated Model and
algorithm deployment

Agenda
First Presenter
Aemro Amare
- explain how Quby monitor data collection &
ingestion, and Models performance.
Second Presenter
Shekh Morshed Akther
- walk through sample codes and implementations.

How we monitor our data collection &
ingestion; and Models performance
Aemro Amare

Quby’s Big Data Eco System
s3
Batch Ingestion
Streaming Ingestion Data Curation
Machine Learning
Insight Services
Trigger Services
Bronze data Silver data Gold data

Quby’s Data Lake
▪ Multi Tenancy
▪ Currently 4 Tenants
▪ Logical data isolation
▪ Prod and Acceptance env.
▪ Infrastructure as Code
▪ GDPR Compliant
▪ Automated Deployment

So what do we miss in this architecture ?

So what do we miss in this architecture ?
Monitoring and Alerting at every stage

Monitoring is another Big Chunk of work
Source: google paper 2015
The infrastructure needed for
running ML systems in
production is vast and complex;
we’ll focus on the Monitoring part

What can go wrong on the bronze layer
▪ Streaming jobs may have unexpected lags
▪ Delayed delivery of data
▪ IoT devices might be disconnected
▪ Unexpected data format and schema
▪ Missing values in timeseries data
During Data Loading

What can go wrong on the silver layer
▪ Slow Spark jobs
▪ Job failure
▪ Model drift
▪ Run on Incomplete Data
During data Curation and ML run

What can go wrong on the gold layer
▪ Daily result delivery might miss SLAs
▪ Customers might miss results
▪ Jobs might fail
▪ Customer might receive wrong result
▪ Network connection issue across Services
Final aggregation and result
delivery

How Quby Built Monitoring and Alerting ?
s3
Batch Ingestion
Machine Learning
Insight Services
Trigger Services

How Quby Built Monitoring and Alerting
• Dashboards
• Alerting
• Slack Integration
• Email Integration
s3
Batch Ingestion
Machine Learning
Insight Services
Trigger Services
Monitoring jobs Monitoring jobs Monitoring jobs

DataBricks Dashboards
• Monitoring jobs run
periodically
• Dashboards displayed on
big screens,
• we refresh the
dashboards using
chrome plugins
• We put thresholds to
trigger alert

Sample codes and implementations – how
we built Monitoring and Alerting
Shekh Morshed Akther

Example: Monitoring
Streaming Job Lags

Example: Monitoring
Daily Data Ingestion

Example: Monitoring
Model Performance

Example: Alerting
Automated Model
Performance
▪ Use mlflow api to get the experiment by
name/id
▪ Get latest 2 runs and compare the values
▪ If latest run value deviates more than
threshold value (i.e. 50) then send an alert
notification (in slack)

Example: Alerting On
Scheduled Job Failure

Example: Monitoring & Alerting for SLAs

Sample project
Sample notebook relevant to this presentation can be found:
https://github.com/quby-io/databricks-workflow
This repository is an example of how to use Databricks for setting
up a multi-environment data processing pipeline. If you are part of
a Data Engineering or Data Science team, and you want to start a
project in Databricks, you can use this repository as a jump start.

Why are Databricks notebooks good for Monitoring?
▪ Scalability
▪ Easy for the data team
▪ No running overhead
▪ Easy to change when business logic change
▪ Easy to call Mlflow APIs

Key Takeaways
The Biggest
part of
building a
monitoring
system is
knowing
what could
go wrong
Adding
unnecessary
metrics on
your
dashboard
adds more
confusion
Building a
monitoring
dashboard
should be on
the same
cycle of
product
development
Databricks
can be used
as
monitoring
platform too

Feedback
Your feedback is important to us.
Don’t forget to rate
and review the sessions.

Monitoring Half a Million ML Models, IoT Streaming Data, and Automated Quality Check on Delta Lake

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Monitoring Half a Million ML Models, IoT Streaming Data, and Automated Quality Check on Delta Lake

Similar to Monitoring Half a Million ML Models, IoT Streaming Data, and Automated Quality Check on Delta Lake (20)

More from Databricks

More from Databricks (20)

Recently uploaded

Recently uploaded (20)

Monitoring Half a Million ML Models, IoT Streaming Data, and Automated Quality Check on Delta Lake