Successfully reported this slideshow.
Your SlideShare is downloading. ×

Apache Kafka, Tiered Storage and TensorFlow for Streaming Machine Learning without a Data Lake

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad

Check these out next

1 of 36 Ad

Apache Kafka, Tiered Storage and TensorFlow for Streaming Machine Learning without a Data Lake

Don’t underestimate the Hidden Technical Debt in Machine Learning Systems.

Leverage Apache Kafka’s open ecosystem as a scalable and flexible Event Streaming Platform to build one pipeline for real-time and batch use cases.

Use Streaming Machine Learning with Apache Kafka, Tiered Storage, and TensorFlow IO to simplify your big data architecture.

Tiered Storage for Kafka provides:
- one platform for all data processing
- an event-based source of truth for materialized views
- no need for a pipeline between Kafka and a Data Lake like Hadoop

Benefits:
- cost reduction
- long-term backup
- performance isolation (real-time and historical analysis in the same cluster)

Use Cases for Reprocessing Historical Events:
- New consumer application
- Error-handling
- Compliance / regulatory processing
- Query and analyze existing events
- Model training

Don’t underestimate the Hidden Technical Debt in Machine Learning Systems.

Leverage Apache Kafka’s open ecosystem as a scalable and flexible Event Streaming Platform to build one pipeline for real-time and batch use cases.

Use Streaming Machine Learning with Apache Kafka, Tiered Storage, and TensorFlow IO to simplify your big data architecture.

Tiered Storage for Kafka provides:
- one platform for all data processing
- an event-based source of truth for materialized views
- no need for a pipeline between Kafka and a Data Lake like Hadoop

Benefits:
- cost reduction
- long-term backup
- performance isolation (real-time and historical analysis in the same cluster)

Use Cases for Reprocessing Historical Events:
- New consumer application
- Error-handling
- Compliance / regulatory processing
- Query and analyze existing events
- Model training

Advertisement
Advertisement

More Related Content

Slideshows for you (20)

Similar to Apache Kafka, Tiered Storage and TensorFlow for Streaming Machine Learning without a Data Lake (20)

Advertisement

More from Kai Wähner (17)

Recently uploaded (20)

Advertisement

Apache Kafka, Tiered Storage and TensorFlow for Streaming Machine Learning without a Data Lake

  1. 1. Apache Kafka, Tiered Storage and TensorFlow for Streaming Machine Learning without a Data Lake Kai Waehner Technology Evangelist contact@kai-waehner.de LinkedIn @KaiWaehner www.confluent.io www.kai-waehner.de
  2. 2. Disclaimer – Status for Tiered Storage in August 2020 KIP-405 – Add Tiered Storage Support to Kafka Confluent is actively working on this with the open source community - Uber is leading this initiative Confluent Tiered Storage is available today in Confluent Platform and used under the hood in Confluent Cloud https://cwiki.apache.org/confluence/display/KAFKA/KIP- 405%3A+Kafka+Tiered+Storage (in the works) www.kai-waehner.de | @KaiWaehner | Streaming Machine Learning without a Data Lake
  3. 3. STREAM PROCESSING Create and store materialized views Filter Analyze in-flight Time C CC Event Streaming www.kai-waehner.de | @KaiWaehner | Streaming Machine Learning without a Data Lake
  4. 4. Machine Learning to Improve Traditional and to Build New Use Cases 5www.kai-waehner.de | @KaiWaehner | Streaming Machine Learning without a Data Lake Real Time Tracking Predictive Maintenance Fraud Detection Cross Selling Transportation Rerouting Customer Service Inventory ManagementAutonomous Driving Face Recognition Robotics Speech Translation Video Generation Supply Chain Optimization Simulations Real Time Information Digital Transformation Strategic Goals Customer Churn
  5. 5. Global Automotive Company Builds Connected Car Infrastructure 6 Digital Transformation ● Improve Customer Experience ● Increase Revenue ● Reduce Risk 3 years ago Today 2 years in the future Project begins Connected car infrastructure in production for first use cases Improved processes leveraging machine learning (predictive maintenance, cross-selling) www.kai-waehner.de | @KaiWaehner | Streaming Machine Learning without a Data Lake
  6. 6. Streaming Analytics for Predictive Maintenance at Scale 7 IoT Integration Layer Batch Analytics Platform BI Dashboard Streaming Platform Big Data Integration Layer Car Sensors Streaming Platform Other Components Real Time Monitoring System All Data Critical Data Ingest Data Human Intelligence www.kai-waehner.de | @KaiWaehner | Streaming Machine Learning without a Data Lake
  7. 7. Machine Learning (ML) ...allows computers to find hidden insights without being programmed where to look 8 Machine Learning ● Decision Trees ● Naïve Bayes ● Clustering ● Neural Networks ● Etc. Deep Learning ● CNN ● RNN ● Transformer ● Autoencoder ● Etc. www.kai-waehner.de | @KaiWaehner | Streaming Machine Learning without a Data Lake
  8. 8. Streaming Analytics for Predictive Maintenance at Scale 9 IoT Integration Layer Batch Analytics Platform BI Dashboard Streaming Platform Big Data Integration Layer Car Sensors Streaming Platform Analytics Platform Other Components Real Time Monitoring System All Data Critical Data Ingest Data Potential DetectAnalytics Platform Train Analytic Model Data Processing Analytic Model Preprocess Data Consume Data Deploy Analytic Model www.kai-waehner.de | @KaiWaehner | Streaming Machine Learning without a Data Lake
  9. 9. The First Analytic Models 10 How to deploy the models in production? …real-time processing? …at scale? …24/7 zero uptime? www.kai-waehner.de | @KaiWaehner | Streaming Machine Learning without a Data Lake
  10. 10. Hidden Technical Debt in Machine Learning Systems 11 https://papers.nips.cc/paper/5656-hidden-technical-debt-in-machine-learning-systems.pdf www.kai-waehner.de | @KaiWaehner | Streaming Machine Learning without a Data Lake
  11. 11. Scalable, Technology-Agnostic ML Infrastructures What is this thing used everywhere? https://www.infoq.com/presentations/netflix-ml-meson https://eng.uber.com/michelangelo https://www.infoq.com/presentations/paypal-data-service-fraud www.kai-waehner.de | @KaiWaehner | Streaming Machine Learning without a Data Lake
  12. 12. A Streaming Platform - The Underpinning of an Event-Driven Architecture 15 Microservices DBs SaaS apps Mobile Customer 360 Real-time fraud detection Data warehouse Producers Consumers Database change Microservices events SaaS data Customer experiences Streams of real time events Stream processing apps Connectors Connectors Stream processing apps www.kai-waehner.de | @KaiWaehner | Streaming Machine Learning without a Data Lake
  13. 13. Apache Kafka as Infrastructure for ML www.kai-waehner.de | @KaiWaehner | Streaming Machine Learning without a Data Lake
  14. 14. Apache Kafka’s Open Ecosystem as Infrastructure for ML Kafka Streams/ ksqlDB Kafka Connect Confluent REST Proxy Confluent Schema Registry Go/.NET/Python Kafka Producer ksqlDB Python Client www.kai-waehner.de | @KaiWaehner | Streaming Machine Learning without a Data Lake
  15. 15. Ingestion of IoT Data 20 Replication MirrorMaker / Confluent Replicator Kafka Connect Analytics / Machine Learning Ca rsCa rsCa rsCa rs Cars www.kai-waehner.de | @KaiWaehner | Streaming Machine Learning without a Data Lake
  16. 16. Data Preprocessing 21 Preprocessing Filter, transform, anonymize, extract features Streams Data Ready For Model Training www.kai-waehner.de | @KaiWaehner | Streaming Machine Learning without a Data Lake
  17. 17. Preprocessing with ksqlDB 22 SELECT car_id, event_id, car_model_id, sensor_input FROM car_sensor c LEFT JOIN car_models m ON c.car_model_id = m.car_model_id WHERE m.car_model_type ='Audi_A8'; www.kai-waehner.de | @KaiWaehner | Streaming Machine Learning without a Data Lake
  18. 18. Data Ingestion into a Data Store for Model Training (and Consumption by other Decoupled Applications) 23 Connect Preprocessed Data Batch Near Real Time Real Time www.kai-waehner.de | @KaiWaehner | Streaming Machine Learning without a Data Lake
  19. 19. Extreme scale usingTensorFlow and TPUs in the cloud! Analytic Model Model Training Using an Elastic Infrastructure in the Cloud 24www.kai-waehner.de | @KaiWaehner | Streaming Machine Learning without a Data Lake
  20. 20. TensorFlow Model — Autoencoder for Anomaly Detection www.kai-waehner.de | @KaiWaehner | Streaming Machine Learning without a Data Lake 25
  21. 21. Direct streaming ingestion for model training with TensorFlow I/O + Kafka Plugin (no additional data storage like S3 or HDFS required!) Time Model BModel A Producer Distributed Commit Log Streaming Ingestion and Model Training with TensorFlow IO https://github.com/tensorflow/io 26 Model X (at a later time) www.kai-waehner.de | @KaiWaehner | Streaming Machine Learning without a Data Lake
  22. 22. Store Data Long-Term in Kafka? Today, Kafka works well for recent events, short horizon storage, and manual data balancing. Kafka’s present-day design offers extraordinarily low messaging latency by storing topic data on fast disks that are collocated with brokers. This is usually good. But sometimes, you need to store a huge amount of data for a long time. Kafka Processing App Storage Transactions, auth, quota enforcement, compaction, ... www.kai-waehner.de | @KaiWaehner | Streaming Machine Learning without a Data Lake
  23. 23. Simplified Data Lake Architecture Tiered Storage for Kafka provides ● one platform for all data processing ● an event-based source of truth for materialized views ● no need for a pipeline between Kafka and a Data Lake like Hadoop Benefits ● cost reduction ● long-term backup ● performance isolation (real-time and historical analysis in the same cluster)
  24. 24. Confluent Tiered Storage for Kafka Object Store Processing Storage Transactions, auth, quota enforcement, compaction, ... Local Remote Kafka Apps Store Forever Older data is offloaded to inexpensive object storage, permitting it to be consumed at any time. Save $$$ Storage limitations, like capacity and duration, are effectively uncapped. Instantaneously scale up and down Your Kafka clusters will be able to automatically self-balance load and hence elastically scale (Only available in Confluent Platform) www.kai-waehner.de | @KaiWaehner
  25. 25. Confluent Tiered Storage for Kafka 30www.kai-waehner.de | @KaiWaehner (Only available in Confluent Platform)
  26. 26. Use Cases for Reprocessing Historical Events Give me all events from time A to time B Real-time Producer Time • New consumer application • Error-handling • Compliance / regulatory processing • Query and analyze existing events • Model training Real-time Consumer Consumer of Historical Data www.kai-waehner.de | @KaiWaehner
  27. 27. Local Predictions Model Training in Cloud Model Deployment at the Edge Analytic Model Separation of Model Training and Model Inference www.kai-waehner.de | @KaiWaehner | Streaming Machine Learning without a Data Lake 32
  28. 28. Streams Input Event Prediction Request Response Model Serving TensorFlow Serving gRPC / HTTP Application Stream Processing with External Model and RPC www.kai-waehner.de | @KaiWaehner | Streaming Machine Learning without a Data Lake 33
  29. 29. Prediction Stream Processing Model doPrediction() return value Stream Processing with Embedded Model Streams Input Event www.kai-waehner.de | @KaiWaehner | Streaming Machine Learning without a Data Lake 34
  30. 30. “CREATE STREAM AnomalyDetection AS SELECT sensor_id, detectAnomaly(sensor_values) FROM car_engine;“ User Defined Function (UDF) Model Deployment with Apache Kafka, ksqlDB and TensorFlow www.kai-waehner.de | @KaiWaehner | Streaming Machine Learning without a Data Lake 35
  31. 31. Streaming Analytics with Kafka and TensorFlow 36www.kai-waehner.de | @KaiWaehner | Streaming Machine Learning without a Data Lake MQTT Proxy MongoDB Storage MongoDB Dashboards Search Analytics Kafka Cluster Kafka Connect Car Sensors Kafka Ecosystem TensorFlow Other Components Kafka Streams Application All Data Critical Data Ingest Data Potential DetectTensorFlow Train Analytic Model ksqlDB Analytic Model Preprocess Data Consume Data Deploy Analytic Model Tiered Storage Mobile App BI Tool
  32. 32. Demo: 100,000 Connected Cars (Kafka + ksqlDB + MQTT + TensorFlow) https://github.com/kaiwaehner/hivemq-mqtt-tensorflow-kafka-realtime-iot-machine-learning-training-inference www.kai-waehner.de | @KaiWaehner | Streaming Machine Learning without a Data Lake 37
  33. 33. Live Demo www.kai-waehner.de | @KaiWaehner | Streaming Machine Learning without a Data Lake 38
  34. 34. Machine Learning + Apache Kafka à Examples @ Github 39 https://github.com/kaiwaehner
  35. 35. One pipeline to rule them all Real-time model scoring, batch model training, near-real time BI analytics Give me all events from time A to time B Car sensors (MQTT connector) Time Production infrastructure (Java) Data science / analytics infrastructure (Python + Jupyter) www.kai-waehner.de | @KaiWaehner | Streaming Machine Learning without a Data Lake
  36. 36. Kai Waehner Technology Evangelist contact@kai-waehner.de @KaiWaehner www.confluent.io www.kai-waehner.de www.confluent.io LinkedIn Questions? Feedback? Let’s connect!

×