Successfully reported this slideshow.
Your SlideShare is downloading. ×

Serverless Kafka and Spark in a Multi-Cloud Lakehouse Architecture

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad

Check these out next

1 of 39 Ad

Serverless Kafka and Spark in a Multi-Cloud Lakehouse Architecture

Apache Kafka in conjunction with Apache Spark became the de facto standard for processing and analyzing data. Both frameworks are open, flexible, and scalable.

Unfortunately, the latter makes operations a challenge for many teams. Ideally, teams can use serverless SaaS offerings to focus on business logic. However, hybrid and multi-cloud scenarios require a cloud-native platform that provides automated and elastic tooling to reduce the operations burden.

This session explores different architectures to build serverless Apache Kafka and Apache Spark multi-cloud architectures across regions and continents.

We start from the analytics perspective of a data lake and explore its relation to a fully integrated data streaming layer with Kafka to build a modern data Data Lakehouse.

Real-world use cases show the joint value and explore the benefit of the "delta lake" integration.

Apache Kafka in conjunction with Apache Spark became the de facto standard for processing and analyzing data. Both frameworks are open, flexible, and scalable.

Unfortunately, the latter makes operations a challenge for many teams. Ideally, teams can use serverless SaaS offerings to focus on business logic. However, hybrid and multi-cloud scenarios require a cloud-native platform that provides automated and elastic tooling to reduce the operations burden.

This session explores different architectures to build serverless Apache Kafka and Apache Spark multi-cloud architectures across regions and continents.

We start from the analytics perspective of a data lake and explore its relation to a fully integrated data streaming layer with Kafka to build a modern data Data Lakehouse.

Real-world use cases show the joint value and explore the benefit of the "delta lake" integration.

Advertisement
Advertisement

More Related Content

Similar to Serverless Kafka and Spark in a Multi-Cloud Lakehouse Architecture (20)

More from Kai Wähner (20)

Advertisement

Recently uploaded (20)

Serverless Kafka and Spark in a Multi-Cloud Lakehouse Architecture

  1. 1. Serverless Kafka and Spark in a Multi-Cloud Data Lakehouse Architecture Kai Waehner Field CTO kai.waehner@confluent.io linkedin.com/in/kaiwaehner @KaiWaehner confluent.io kai-waehner.de
  2. 2. Agenda • Data Analytics at Rest • Data Streaming in Motion • Lakehouse: Data Streaming + Analytics • A Lakehouse Example: Intelligent Connected Cars • Cloud-Native vs. Serverless Infrastructure • Central vs. Hybrid and Global Data Mesh kai-waehner.de | @KaiWaehner | Serverless Apache Kafka and Spark across the Globe
  3. 3. Agenda • Data Analytics at Rest • Data Streaming in Motion • Lakehouse: Data Streaming + Analytics • A Lakehouse Example: Intelligent Connected Cars • Cloud-Native vs. Serverless Infrastructure • Central vs. Hybrid and Global Data Mesh kai-waehner.de | @KaiWaehner | Serverless Apache Kafka and Spark across the Globe
  4. 4. Storage at Rest USER JAY SUE FRED CREDIT_SCORE 695 430 710 V1 V3 V2
  5. 5. Analytics at Rest SELECT * FROM DB_TABLE Active Query: Passive Data: DB Table
  6. 6. Use Cases for Data at Rest • Reporting • Business Intelligence • Data Engineering • Big Data Analytics • Machine Learning kai-waehner.de | @KaiWaehner | Serverless Apache Kafka and Spark across the Globe
  7. 7. Apache Spark – The De Facto Standard for Big Data at Rest kai-waehner.de | @KaiWaehner | Serverless Apache Kafka and Spark across the Globe Big Data In Big Data Out Big Data Storage and Processing From Historical Data to Insights
  8. 8. Delta Lake Open-source storage framework and open format for data analytics kai-waehner.de | @KaiWaehner | Serverless Apache Kafka and Spark across the Globe
  9. 9. Agenda • Data Analytics at Rest • Data Streaming in Motion • Lakehouse: Data Streaming + Analytics • A Lakehouse Example: Intelligent Connected Cars • Cloud-Native vs. Serverless Infrastructure • Central vs. Hybrid and Global Data Mesh kai-waehner.de | @KaiWaehner | Serverless Apache Kafka and Spark across the Globe
  10. 10. Real-time Data beats Slow Data. kai-waehner.de | @KaiWaehner | Serverless Apache Kafka and Spark across the Globe
  11. 11. Real-time Data beats Slow Data. Transportation Real-time sensor diagnostics Driver-rider match ETA updates Insurance Claim processing Fraud detection Omnichannel quote processing Retail Real-time inventory Real-time POS reporting Personalization Entertainment Real-time recommendations Personalized news feed In-app purchases kai-waehner.de | @KaiWaehner | Serverless Apache Kafka and Spark across the Globe
  12. 12. Data at Rest Data in Motion SELECT * FROM DB_TABLE CREATE TABLE T AS SELECT * FROM EVENT_STREAM Active Query: Passive Data: DB Table Active Data: Passive Query: Event Stream
  13. 13. Tables at Rest Streams in Motion USER JAY SUE FRED CREDIT_SCORE 695 430 710 V1 V3 V2 PAYMENTS 42 18 65 ... USER JAY SUE FRED ...
  14. 14. Data Streaming = Data at Rest + Data in Motion Payments Stream Credit Score Stream CREATE TABLE credit_scores AS SELECT user, updateScore(p.amount)…
  15. 15. Apache Kafka – The De Facto Standard for Data in Motion Database CRM Sensors Mobile Customer 360 Real-time Alerting System Data warehouse Producers Consumers Streams of real time events Stream processing apps Connectors Connectors Stream processing apps Incident Alert Forecast Pricing Customer Order kai-waehner.de | @KaiWaehner | Serverless Apache Kafka and Spark across the Globe
  16. 16. Agenda • Data Analytics at Rest • Data Streaming in Motion • Lakehouse: Data Streaming + Analytics • A Lakehouse Example: Intelligent Connected Cars • Cloud-Native vs. Serverless Infrastructure • Central vs. Hybrid and Global Data Mesh kai-waehner.de | @KaiWaehner | Serverless Apache Kafka and Spark across the Globe
  17. 17. Data Lakehouse kai-waehner.de | @KaiWaehner | Serverless Apache Kafka and Spark across the Globe Building the Data Lakehouse Author: Bill Inmon Lakehouse is a logical view, not physical!
  18. 18. Lambda Architecture Option 1: Unified serving layer Data Source Real-Time Layer (Data Processing in Motion) Batch Layer (Data Processing at Rest) Serving Layer Real-Time App (Data Processing in Motion) Batch App (Data Processing at Rest) ms min/hr kai-waehner.de | @KaiWaehner | Serverless Apache Kafka and Spark across the Globe
  19. 19. Data Source Real-Time Layer (Data Processing in Motion) Batch Layer (Data Processing at Rest) Real-time Query Mixed Query ms min/hr Speed View Batch View Batch Query Lambda Architecture Option 2: Separate serving layers kai-waehner.de | @KaiWaehner | Serverless Apache Kafka and Spark across the Globe
  20. 20. Data Source Real-Time Layer (Data Processing in Motion) Real-Time App (Data Processing in Motion) Storage Batch App (Data Processing at Rest) Storage ms min/hr Storage Kappa Architecture One pipeline for real-time and batch consumers kai-waehner.de | @KaiWaehner | Serverless Apache Kafka and Spark across the Globe
  21. 21. Kappa @ Uber 24 kai-waehner.de | @KaiWaehner | Kappa vs. Lambda Architecture
  22. 22. Confluent + Databricks Reference Architecture Kafka Connect On Premises or any cloud Kafka Streams & ksqlDB - real-time stream processing and transformations Databricks Data Science Workspace Databricks Delta Lake Sink Connector for Confluent Cloud (AWS) Legacy Data Stores: Netezza, Teradata Oracle, Mainframes Databases IoT Data Streaming Analytics Sources Data Streaming Platform built on Kafka On Premises or any cloud Databricks BI Workspace kai-waehner.de | @KaiWaehner | Serverless Apache Kafka and Spark across the Globe
  23. 23. Agenda • Data Analytics at Rest • Data Streaming in Motion • Lakehouse: Data Streaming + Analytics • A Lakehouse Example: Intelligent Connected Cars • Cloud-Native vs. Serverless Infrastructure • Central vs. Hybrid and Global Data Mesh kai-waehner.de | @KaiWaehner | Serverless Apache Kafka and Spark across the Globe
  24. 24. Connected Car Infrastructure at Audi 27 • Real Time Data Analysis • Swarm Intelligence • Collaboration with Partners • Predictive AI • … https://www.youtube.com/watch?v=yGLKi3TMJv8 kai-waehner.de | @KaiWaehner | Serverless Apache Kafka and Spark across the Globe
  25. 25. Connected Car Infrastructure at Audi 28 https://www.youtube.com/watch?v=yGLKi3TMJv8 kai-waehner.de | @KaiWaehner | Serverless Apache Kafka and Spark across the Globe
  26. 26. Kappa Architecture for a Lakehouse with Kafka and Spark MQTT Proxy Spark Core Storage Spark SQL Reporting Kafka Cluster Kafka Connect Car Sensors Kafka Ecosystem Spark Ecosystem Other Components Kafka Streams All Data Critical Data Ingest Data Potential Detect Spark MLlib Model Training ksqlDB Model Deployment Preprocess Data Consume Data Deploy Analytic Model Mobile App BI Tool kai-waehner.de | @KaiWaehner | Serverless Apache Kafka and Spark across the Globe
  27. 27. Machine Learning Model Training with Spark MLlib kai-waehner.de | @KaiWaehner | Serverless Apache Kafka and Spark across the Globe https://dev.to/siddhantpatro/spark-mllib-for-big-data-and-machine-learning-330j
  28. 28. “CREATE STREAM AnomalyDetection AS SELECT sensor_id, detectAnomaly(sensor_values) FROM car_engine;“ User Defined Function (UDF) Model Deployment with Apache Kafka, ksqlDB and Spark MLlib kai-waehner.de | @KaiWaehner | Serverless Apache Kafka and Spark across the Globe MLlib
  29. 29. Stream Processing with Kafka or Spark? kai-waehner.de | @KaiWaehner | Serverless Apache Kafka and Spark across the Globe Kafka Streams / ksqlDB Spark Streaming Component of the data streaming infrastructure Low latency Focus on 24/7 operations Lightweight, decoupled microservices Component of the data analytics infrastructure Strong integration with the rest of the Spark ecossytem Stream and batch Machine Learning “embedded”
  30. 30. Agenda • Data Analytics at Rest • Data Streaming in Motion • Lakehouse: Data Streaming + Analytics • A Lakehouse Example: Intelligent Connected Cars • Cloud-Native vs. Serverless Infrastructure • Central vs. Hybrid and Global Data Mesh kai-waehner.de | @KaiWaehner | Serverless Apache Kafka and Spark across the Globe
  31. 31. Cloud-Native Deployment à Elastic Infrastructure and Faster Time-to-Market kai-waehner.de | @KaiWaehner | Serverless Apache Kafka and Spark across the Globe
  32. 32. You Manage Provider Managed Self-managed IaaS Hosted Cloud Service Fully Managed SaaS Scaling Scaling Scaling Load balancing Load balancing Load balancing Partition placement Partition placement Partition placement Logical Storage Logical Storage Logical Storage Broker settings Broker settings Broker settings Zookeeper Zookeeper Zookeeper Kafka patching Kafka patching Kafka patching JVM JVM JVM O/S O/S O/S VMs VMs VMs Servers Servers Servers Provider managed features Product ease of use Fully Managed Partially Managed Self-Managed What is a (truly) fully-managed SaaS? kai-waehner.de | @KaiWaehner | Serverless Apache Kafka and Spark across the Globe
  33. 33. Agenda • Data Analytics at Rest • Data Streaming in Motion • Lakehouse: Data Streaming + Analytics • A Lakehouse Example: Intelligent Connected Cars • Cloud-Native vs. Serverless Infrastructure • Central vs. Hybrid and Global Data Mesh kai-waehner.de | @KaiWaehner | Serverless Apache Kafka and Spark across the Globe
  34. 34. AWS Cloud Outage hit Disney World Visitors… https://www.cnet.com/tech/services-and-software/disney-parks-were-already-facing-heat-from-fans-then-an-aws-outage-came-along/ kai-waehner.de | @KaiWaehner | Serverless Apache Kafka and Spark across the Globe
  35. 35. Disaster Recovery – RPO and RTO RPO = Recovery Point Objective RTO = Recovery Time Objective kai-waehner.de | @KaiWaehner | Serverless Apache Kafka and Spark across the Globe
  36. 36. Use Cases for Hybrid and Multi-Cloud Data Lakehouses • Disaster Recovery and High Availability: Create a disaster recovery cluster, and fail over to it during an outage. • Global and Multi-Cloud Replication: Move and aggregate data across regions and clouds. • Data Sharing: Share data with other teams, lines-of-business, or organizations. • Data Migration: Migrate data and workloads from one cluster to another (like from legacy on-premise data warehouse to cloud-native data lakehouse). kai-waehner.de | @KaiWaehner | Serverless Apache Kafka and Spark across the Globe Data Replication at Rest or in Motion?
  37. 37. Copyright 2021, Confluent, Inc. All rights reserved. This document may not be reproduced in any manner without the express written permission of Confluent, Inc. Global Data Lakehouse across Edge and Hybrid Cloud Streaming Replication between Kafka Clusters Bridge to Databases, Data Lakes, Apps, APIs, SaaS Aggregation of Edge Deployments with Replication (Aggregation) Disaster Recovery Operations with Multi-Region Clusters for RPO=0 and RTO~0 Global Data Streaming with Replication and Cluster Linking kai-waehner.de | @KaiWaehner | Serverless Apache Kafka and Spark across the Globe
  38. 38. A data mesh for decentralized data products Data Product Independent Data Products for Reporting, Analytics, Data Streaming kai-waehner.de | @KaiWaehner | Serverless Apache Kafka and Spark across the Globe For instance: A KSQL microservice
  39. 39. Kai Waehner Field CTO kai.waehner@confluent.io @KaiWaehner confluent.io kai-waehner.de linkedin.com/in/kaiwaehner Questions? Feedback? Let’s connect!

×