Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Upcoming SlideShare
What to Upload to SlideShare
What to Upload to SlideShare
Loading in …3
×
1 of 29

Lessons Learned Developing and Managing High Volume Apache Spark Pipelines in Production with Erni Durdevic

1

Share

Download to read offline

Quby is the creator and provider of Toon, a leading European smart home platform. We enable Toon users to control and monitor their homes using both an in-home display and app. As a data driven company, we use machine learning algorithms to generate actionable insights for our end users. We have developed data driven services to ensure that users do not needlessly waste energy and can receive real-time alerts about problems with their heating system.

In this talk, Erni will describe our journey of productionizing data science algorithms. We’ll take a deep dive into our pipeline and describe our streamlined development and deployment workflow. We’ll explain how we define and manage dependencies between jobs in multiple environments (test, acceptance and production) and schedule the pipeline computation. We’ll delve into scale challenges, metrics, monitoring and data quality. Also, we will reflect on the lessons learned while building high volume infrastructure that offers multiple data driven services to hundreds of thousands of users.

Related Books

Free with a 30 day trial from Scribd

See all

Lessons Learned Developing and Managing High Volume Apache Spark Pipelines in Production with Erni Durdevic

  1. 1. Erni Durdevic, Quby B.V. Lessons Learned Developing and Managing High Volume Apache Spark Pipelines in Production #SAISML4
  2. 2. Launched in 2016 Launched in 2017 Launched in 2012 Testing phase Our partners: Over 350,000 connected homes across Europe
  3. 3. SMART THERMOSTAT & APP TOON SOLAR BOILER MONITORINGWASTE CHECKER MONTHLY ENERGY INSIGHT WATER INSIGHT SECURITY PACKAGE & APP SMART METER DONGLE & APP ENERGY INSIGHTS APP DATA SERVICES
  4. 4. #SAISML4 Z-Wave Meter adaptor Boiler adapter Gas sensor Philips Hue Z-Wave Central heating system Solar panels Click data Smart plugs & smoke detectors Electricity sensor Over 300 types of sensor and user interaction data >1200 TB in total
  5. 5. #SAISML4 data collector Master node Job execution Batch pipeline
  6. 6. USE CASE #1 Waste Checker ? #SAISML4
  7. 7. Energy Waste Checker “We don’t always notice how much energy we’re wasting. Toon can now expose the energy guzzlers in your home.” Launched in December 2017 to all Eneco Toon users #SAISML4
  8. 8. Quby’s disaggregation algorithms Patent pending algorithms can detect appliances from 10 second resolution electricity meter data DishwasherWashing machine Washing machine(only night hours shown) DryerDryer #SAISML4
  9. 9. Use case example: Inefficient dishwasher diagnosis Disaggregation algorithms run on the 10s electricity meter data Compared with industry standards and peers Translated to personalised advice for the end user Toon determines the “fingerprint” of the appliance through features #SAISML4
  10. 10. Scale of the Waste checker Each day we detect over 75,000dishwasher cycles That’s 13 years of dishwashers running continuously and over 25% are used inefficiently #SAISML4
  11. 11. #SAISML4 Example data flow 11
  12. 12. #SAISML4 Example data flow 12 Getting the data ready • Extraction & Cleaning • Resampling • Vectorization
  13. 13. #SAISML4 Example data flow 13 Detecting appliances • Signal processing • Machine learning (One Big model) Vectorized electricity Predictions
  14. 14. #SAISML4 Example data flow 14 Finding user behavior • Clustering per user • Many, many, many (small) models + Detections Behaviour
  15. 15. #SAISML4 Example data flow 15 Drawing conclusions • Comparing to other users • Comparing to industry standards
  16. 16. The Data Pipeline #SAISML4
  17. 17. #SAISML4 Managing jobs (Option 1) 17 • Databricks Jobs & Notebook Workflows https://databricks.com/blog/2016/08/30/notebook-workflows-the-easiest-way-to-implement-apache- spark-pipelines.html
  18. 18. #SAISML4 Managing jobs (Option 2) 18 • Airflow DAGs https://airflow.apache.org/ • Must read: – ETL principles: https://gtoonstra.github.io/etl-with-airflow/principles.html – Gotcha’s: https://gtoonstra.github.io/etl-with-airflow/gotchas.html
  19. 19. #SAISML4 When designing data pipelines 19 • Enforce idempotent constraints • Enforce reproducibility • Let data transformations be chainable • Leverage partitioning and data locality
  20. 20. #SAISML4 Monitoring • Live dashboards with aggregated data 20
  21. 21. #SAISML4 Monitoring and Validation • Daily email to Quby’s VIP employees 21
  22. 22. #SAISML4 Alerting 22 Alerting (via Email / Slack) • If anything goes wrong • If an independent monitoring job detects missing data
  23. 23. USE CASE #2 Detecting anomalies in heating systems #SAISML4
  24. 24. #SAISML4 Structured Streaming 24 Anomaly classification
  25. 25. #SAISML4 Multiple Streaming aggregations 25 When working with streams in spark it is not possible to do multiple aggregations on the same stream • E.g. Forking a stream in multiple streams • E.g. Do consecutive aggregations Work around Output the aggregations on a sink and read it back in Spark (E.g. Kafka, Kinesis, Delta tables) stream groupBy(window($”timestamp”, “5 minutes”)) groupBy(window($”timestamp”, “1 hour”)) stream filter(…) groupBy(…) filter(…) groupBy(…) Delta Delta
  26. 26. #SAISML4 Non time-based windows on streams 26 • When working with streams in spark it is not possible to compute non time-based window operations – E.g. Compute the derivative of a signal Our solution Compute non time-based window operations • Use (Flat) mapGroupWithState (Beware: no ordering guarantee) • Inside a time-based window by collecting a list of Struct(Timestamp, Value) timestamp lag(timestamp) 2018-10-04 14:00:00 2018-10-04 15:00:00 2018-10-04 14:00:00 2018-10-04 16:00:00 2018-10-04 15:00:00
  27. 27. #SAISML4 Stream to stream joins • When doing stream to stream joins, keep an eye on the distinct key count on aggregation state 27
  28. 28. #SAISML4 Streaming in production • Structured Streaming in Production checklist – Setup recovery of queries from failure • Configure Checkpointing • Query restart – Configure Spark scheduler pool for efficiency – Optimize performance of stateful streaming queries – Configure multiple watermark policy Reference: https://docs.databricks.com/spark/latest/structured-streaming/production.html 28
  29. 29. Erni Durdevic Machine Learning Engineer +31 (0) 638 11 94 12 erni.durdevic@quby.com #SAISML4

×