Quby, an Amsterdam-based technology company, offers solutions to empower homeowners to stay in control of their electricity, gas and water usage. Using Europe’s largest energy dataset, consisting of petabytes of IoT data, the company has developed AI powered products that are used by hundreds of thousands of users on a daily basis. Delta Lake ensures the quality of incoming records though schema enforcement and evolution. But it is the Data Engineers role to check whether the expected data is ingested in to the Delta Lake at the right time with expected metrics so that downstream processes will function their duties. Re-training models and serving on the fly might go wrong unless we put the right monitoring infrastructure too.
Monitoring Half a Million ML Models, IoT Streaming Data, and Automated Quality Check on Delta Lake
1. Monitoring Half a Million ML Models, IoT Streaming
Data and Automated Quality Check on Delta Lake
Aemro Amare, Resident Solution Architect at Databricks
https://www.linkedin.com/in/aemro
Shekh Morshed Akther, Senior Data Engineer, Quby
https://www.linkedin.com/in/shekh-akther
2. Households use 44% of all natural gas
and 27% of all electricity in the EU
Source: Eurostat
44% 27%
How can we prevent energy waste?
4. Personalized services by applying ML to IoT
and customer data
IoT & customer data Quby Platform Personalised services
§ Smart thermostat control
§ Home monitoring
§ Advice & Insights
6. Quby’s Data Journey With Databricks
2018 2019 20202017> 2016
• On Premises
• Cloudera & Apache storm
• < 1 Terabyte total data
• Continues failure
• Moved to cloud with
• More data and algorithms on Prod
• Stable batch processing
• Spark Streaming & Optimization
• Data Team United
• Algorithm Patents
• More data sources, more services
• Algorithms as a Service
• < ½ Million models run daily
• Stronger CI/CD & monitoring
• Petabytes of data in
• Spark 3.0
• Automated Model and
algorithm deployment
7. Agenda
First Presenter
Aemro Amare
- explain how Quby monitor data collection &
ingestion, and Models performance.
Second Presenter
Shekh Morshed Akther
- walk through sample codes and implementations.
8. How we monitor our data collection &
ingestion; and Models performance
Aemro Amare
9. Quby’s Big Data Eco System
s3
Batch Ingestion
Streaming Ingestion Data Curation
Machine Learning
Insight Services
Trigger Services
Bronze data Silver data Gold data
10. Quby’s Big Data Eco System
s3
Batch Ingestion
Streaming Ingestion Data Curation
Machine Learning
Insight Services
Trigger Services
Bronze data Silver data Gold data
11. Quby’s Data Lake
▪ Multi Tenancy
▪ Currently 4 Tenants
▪ Logical data isolation
▪ Prod and Acceptance env.
▪ Infrastructure as Code
▪ GDPR Compliant
▪ Automated Deployment
13. So what do we miss in this architecture ?
Monitoring and Alerting at every stage
14. Monitoring is another Big Chunk of work
Source: google paper 2015
The infrastructure needed for
running ML systems in
production is vast and complex;
we’ll focus on the Monitoring part
15. What can go wrong on the bronze layer
▪ Streaming jobs may have unexpected lags
▪ Delayed delivery of data
▪ IoT devices might be disconnected
▪ Unexpected data format and schema
▪ Missing values in timeseries data
During Data Loading
16. What can go wrong on the silver layer
▪ Slow Spark jobs
▪ Job failure
▪ Model drift
▪ Run on Incomplete Data
During data Curation and ML run
17. What can go wrong on the gold layer
▪ Daily result delivery might miss SLAs
▪ Customers might miss results
▪ Jobs might fail
▪ Customer might receive wrong result
▪ Network connection issue across Services
Final aggregation and result
delivery
18. How Quby Built Monitoring and Alerting ?
s3
Batch Ingestion
Streaming Ingestion Data Curation
Machine Learning
Insight Services
Trigger Services
Bronze data Silver data Gold data
19. How Quby Built Monitoring and Alerting
• Dashboards
• Alerting
• Slack Integration
• Email Integration
s3
Batch Ingestion
Streaming Ingestion Data Curation
Machine Learning
Insight Services
Trigger Services
Bronze data Silver data Gold data
Monitoring jobs Monitoring jobs Monitoring jobs
20. DataBricks Dashboards
• Monitoring jobs run
periodically
• Dashboards displayed on
big screens,
• we refresh the
dashboards using
chrome plugins
• We put thresholds to
trigger alert
21. Sample codes and implementations – how
we built Monitoring and Alerting
Shekh Morshed Akther
25. Example: Alerting
Automated Model
Performance
▪ Use mlflow api to get the experiment by
name/id
▪ Get latest 2 runs and compare the values
▪ If latest run value deviates more than
threshold value (i.e. 50) then send an alert
notification (in slack)
28. Sample project
Sample notebook relevant to this presentation can be found:
https://github.com/quby-io/databricks-workflow
This repository is an example of how to use Databricks for setting
up a multi-environment data processing pipeline. If you are part of
a Data Engineering or Data Science team, and you want to start a
project in Databricks, you can use this repository as a jump start.
29. Why are Databricks notebooks good for Monitoring?
▪ Scalability
▪ Easy for the data team
▪ No running overhead
▪ Easy to change when business logic change
▪ Easy to call Mlflow APIs
30. Key Takeaways
The Biggest
part of
building a
monitoring
system is
knowing
what could
go wrong
Adding
unnecessary
metrics on
your
dashboard
adds more
confusion
Building a
monitoring
dashboard
should be on
the same
cycle of
product
development
Databricks
can be used
as
monitoring
platform too