SlideShare a Scribd company logo
1 of 65
Download to read offline
©2022 Databricks Inc. — All rights reserved
Streaming Data
Into the Lakehouse
Current.io 2022
1
Frank Munz, Ph.D.
Oct 2022
©2022 Databricks Inc. — All rights reserved
About me
• @Databricks, Data and AI Products, formerly DevRel EMEA
• All things large scale data & compute
• Based in Munich, 🍻 ⛰ 🥨 󰎲
• Twitter: @frankmunz
• Formerly AWS Tech Evangelist, SW architect, data scientist,
published author etc.
©2022 Databricks Inc. — All rights reserved 3
What do you love most about
Apache Spark?
©2022 Databricks Inc. — All rights reserved
Spark SQL and Dataframe API
# PYTHON: read multiple JSON files with multiple datasets per file
cust = spark.read.json('/data/customers/')
cust.createOrReplaceTempView("customers")
cust.filter(cust['car_make']== "Audi").show()
%sql
select * from customers where car_make = "Audi"
-- PYTHON:
-- spark.sql('select * from customers where car_make ="Audi"').show()
©2022 Databricks Inc. — All rights reserved
Automatic Parallelization and Scale
©2022 Databricks Inc. — All rights reserved
Reading from Files vs Reading from Streams
# read file(s)
cust = spark.read.format("json").path('/data/customers/')
# read from messaging platform
sales = spark.readStream.format("kafka|kinesis|socket").load()
©2022 Databricks Inc. — All rights reserved
Stream Processing is
continuous and unbounded
Stream Processing
7
Traditional Processing is
one-off and bounded 1
Data Source
2
Processing
Data Source Processing
©2022 Databricks Inc. — All rights reserved
Technical Advantages
A more intuitive way of capturing and processing
continuous and unbounded data
Lower latency for time sensitive applications and use cases
Better fault-tolerance through checkpointing
Better compute utilization and scalability through
continuous and incremental processing
8
©2022 Databricks Inc. — All rights reserved
Business Benefits
9
BI and SQL
Analytics
Fresher
and faster
insights
Quicker and
better business
decisions
Data
Engineering
Sooner
availability of
cleaned data
More business
use cases
Data Science
and ML
More frequent
model update
and inference
Better model
efficacy
Event Driven
Application
Faster customized
response
and action
Better and
differentiated
customer
experience
©2022 Databricks Inc. — All rights reserved 10
Streaming
Misconceptions
©2022 Databricks Inc. — All rights reserved
Misconception #1
Stream processing is only for low latency use cases
X
spark.readStream
.format("delta")
.option("maxFilesPerTrigger", "1")
.load(inputDir)
.writeStream
.trigger(Trigger.AvailableNow)
.option("checkpointLocation",chkdir)
.start()
11
Stream processing can
be applied to use cases
of any latency
“Batch” is a special case
of streaming
©2022 Databricks Inc. — All rights reserved
Misconception #2
Latency
Accuracy Cost
Choose the right
latency, accuracy, and
cost tradeoff for each
specific use case
12
The lower the latency, the "better"
X
©2022 Databricks Inc. — All rights reserved 13
Streaming is about the
programming paradigm
©2022 Databricks Inc. — All rights reserved 14
Spark Structured
Streaming
©2022 Databricks Inc. — All rights reserved
Structured Streaming
15
A scalable and
fault-tolerant stream
processing engine built
on the Spark SQL engine
+Project Lightspeed: predictable low latencies
©2022 Databricks Inc. — All rights reserved
Structured Streaming
16
● Read from an initial
offset position
● Keep tracking offset
position as
processing makes
progress
Source
● Apply the same
transformations
using a standard
dataframe
Transformation
● Write to a target
● Keep updating
checkpoint as
processing makes
progress
Sink
Trigger
©2022 Databricks Inc. — All rights reserved
source
transformation
sink
config
spark.readStream.format("kafka|kinesis|socket|...")
.option(<>,<>)...
.load()
.select(cast("string").alias("jsonData"))
.select(from_json($"jsonData",jsonSchema).alias("payload"))
.writeStream
.format("delta")
.option("path",...)
.trigger("30 seconds")
.option("checkpointLocation",...)
.start()
Streaming ETL
17
©2022 Databricks Inc. — All rights reserved
Trigger Types
18
● Default: Process as soon as the previous batch has been processed
● Fixed interval: Process at a user-specified time interval
● One-time: Process all of the available data and then stop
©2022 Databricks Inc. — All rights reserved
Output Modes
19
● Append (Default): Only new rows added to the result
table since the last trigger will be output to the sink
● Complete: The whole result table will be output to the
sink after each trigger
● Update: Only the rows updated in the result table since
the last trigger will be output to the sink
©2022 Databricks Inc. — All rights reserved
Structured Streaming Benefits
High
Throughput
Optimized for high
throughput and
low cost
Rich Connector
Ecosystem
Streaming
connectors ranging
from message
buses to object
storage services
Exactly Once
Semantics
Fault-tolerance and
exactly once
semantics
guarantee
correctness
Unified Batch and
Streaming
Unified API makes
development and
maintenance
simple
20
©2022 Databricks Inc. — All rights reserved 21
The Data Lakehouse
©2022 Databricks Inc. — All rights reserved
Data Maturity Curve
Data + AI Maturity
Competitive
Advantage
Reports
Clean Data
Ad Hoc
Queries
Data
Exploration
Predictive
Modeling
Prescriptive
Analytics
Automated
Decision Making
Data Lake
for AI
Data Warehouse
for BI
Data Maturity Curve
What will happen?
What happened?
22
©2022 Databricks Inc. — All rights reserved
Business
Intelligence
SQL
Analytics
Data Science
& ML
Data
Streaming
Two disparate, incompatible data platforms
Structured tables Unstructured files:
logs, text, images, video,
Data Warehouse Data Lake
Governance and Security
Table ACLs
Governance and Security
Files and Blobs
Copy subsets of data
Disjointed
and duplicative
data silos
Incompatible
security and
governance models
Incomplete
support for
use cases
23
©2022 Databricks Inc. — All rights reserved 24
Databricks
Lakehouse Platform
Lakehouse Platform
Data
Warehousing
Data
Engineering
Data Science
and ML
Data
Streaming
All structured and unstructured data
Cloud Data Lake
Unity Catalog
Fine-grained governance for data and AI
Delta Lake
Data reliability and performance
Simple
Unify your data warehousing and AI
use cases on a single platform
Open
Built on open source and open standards
Multicloud
One consistent data platform across clouds
©2022 Databricks Inc. — All rights reserved 25
Streaming on the Lakehouse
Streaming on the Lakehouse
Lakehouse Platform
All structured and unstructured data
Cloud Data Lake
Unity Catalog
Fine-grained governance for data and AI
Delta Lake
Data reliability and performance
Streaming
Ingesting
Streaming
ETL
Event
Processing
Event Driven
Application
ML Inference
Live Dashboard
Near Real-time
Query
Alert
Fraud Prevention
Dynamic UI
Dynamic Ads
Dynamic Pricing
Device Control
Game Scene
Update
Etc.
DB Change
Data Feeds
Clickstreams
Machine &
Application Logs
Mobile & IoT Data
Application Events
©2022 Databricks Inc. — All rights reserved
Lakehouse Differentiations
26
Unified Batch and
Streaming
No overhead of
learning, developing
on, or maintaining
two sets of APIs and
data processing
stacks
Favorite Tools
Provide diverse users
with their favorite tools
to work with streaming
data, enabling the
broader organization to
take advantage of
streaming
End-to-End
Streaming
Has everything you
need, no need to
stitch together
different streaming
technology stacks
or tune them to work
together
Optimal Cost
Structure
Easily configure the
right latency-cost
tradeoff for each of
your streaming
workloads
©2022 Databricks Inc. — All rights reserved 27
but what is the
foundation of the
Lakehouse Platform?
©2022 Databricks Inc. — All rights reserved
One source of truth for all your data
Open format Delta Lake is the foundation of the Lakehouse
• open source format Delta Lake, based on Parquet
• adds quality, reliability, and performance
to your existing data lakes
• Provides one common data management
framework for Batch & Streaming, ETL, Analytics & ML
✓ ACID Transactions
✓ Time Travel
✓ Schema Enforcement
✓ Identity Columns
✓ Advanced Indexing
✓ Caching
✓ Auto-tuning
✓ Python, SQL, R, Scala Support
AllOpenSource
©2022 Databricks Inc. — All rights reserved
Lakehouse: Delta Lake Publication
Link to PDF
©2022 Databricks Inc. — All rights reserved
Delta’s expanding ecosystem of connectors
N
ew
!
C
o
m
in
g
S
o
o
n
!
©2022 Databricks Inc. — All rights reserved
Under the Hood
my_table/ ← table name
_delta_log/ ← delta log
00000.json ← delta for tx
00001.json
…
00010.json
00010.chkpt.parquet ← checkpoint files
date=2022-01-01/ ← optional partition
File-1.parquet ← data in parquet format
Write ahead log with optimistic concurrency provides
serializable ACID transactions:
©2022 Databricks Inc. — All rights reserved
Delta Lake Quickstart
pyspark --packages io.delta:delta-core_2.12:1.0.0 
--conf "spark.sql.extensions= 
io.delta.sql.DeltaSparkSessionExtension" 
--conf "spark.sql.catalog.spark_catalog= 
org.apache.spark.sql.delta.catalog.DeltaCatalog"
https://github.com/delta-io/delta
©2022 Databricks Inc. — All rights reserved
Spark: Convert to Delta
33
#convert to delta format
cust = spark.read.json('/data/customers/')
cust.write.format("delta").saveAsTable(table_name)
©2022 Databricks Inc. — All rights reserved
All of Delta Lake 2.0 is open
ACID
Transactions
Scalable
Metadata
Time Travel Open Source
Unified
Batch/Streaming
Schema Evolution
/Enforcement
Audit History DML Operations
OPTIMIZE
Compaction
OPTIMIZE
ZORDER
Change data
feed
Table Restore S3 Multi-cluster
writes
MERGE
Enhancements
Stream
Enhancements
Simplified
Logstore
Data Skipping
via Column
Stats
Multi-part checkpoint
writes
Generated
Columns
Column
Mapping
Generated
column support
w/ partitioning
Identity
Columns
Subqueries in
deletes and
updates
Clones
Iceberg to Delta
converter
Fast metadata
only deletes
Coming Soon!
©2022 Databricks Inc. — All rights reserved
Without Delta Lake: No UPDATE
©2022 Databricks Inc. — All rights reserved
Delta Examples (in SQL for brevity)
36
SELECT * FROM student TIMESTAMP AS OF "2022-05-28"
CREATE TABLE test_student SHALLOW CLONE student
DESCRIBE HISTORY student
UPDATE student SET lastname = "Miller" WHERE id = 2805
OPTIMIZE student ZORDER BY age
VACUUM student RETAIN 12 HOURS
Update/Upsert
Time travel
Clones
Table History
Co-locate
Compact Files
©2022 Databricks Inc. — All rights reserved 37
… but how does this perform for
DWH workloads?
©2022 Databricks Inc. — All rights reserved
Databricks SQL Photon
Serverless
Eliminate compute
infrastructure management
Instant, Elastic Compute
Zero Management
Lower TCO
Vectorized C++ execution engine with
Apache Spark API
https://dbricks.co/benchmark
TPC-DS Benchmark 100 TB
©2022 Databricks Inc. — All rights reserved 39
… and streaming?
©2022 Databricks Inc. — All rights reserved
Streaming: Built into the foundation
40
display(spark.readStream.format("delta").table("heart.bpm").
groupBy("bpm",window("time", "1 minute"))
.avg("bpm").orderBy("window",ascending=True))
Delta Tables can be
streaming
sources and sinks
©2022 Databricks Inc. — All rights reserved 41
©2022 Databricks Inc. — All rights reserved 42
How can you simplify
Streaming Ingestion and
Data Pipelines?
©2022 Databricks Inc. — All rights reserved
Auto Loader
• Python & Scala (and SQL in Delta Live Tables!)
• Streaming data source with incremental loading
• Exactly once ingestion
• Scales to large amounts of data
• Designed for structured, semi-structured and unstructured data
• Schema inference, enforcement with data rescue, and evolution
df = spark.readStream.format("cloudFiles")
.option("cloudFiles.format", "json")
.load("/path/to/table")
43
©2022 Databricks Inc. — All rights reserved
“We ingest more than
5 petabytes per day
with Auto Loader”
44
©2022 Databricks Inc. — All rights reserved
Delta Live Tables
45
CREATE STREAMING LIVE TABLE raw_data
AS SELECT *
FROM cloud_files ("/raw_data", "json")
CREATE LIVE TABLE clean_data
AS SELECT …
FROM LIVE .raw_data
Reliable, declarative, streaming data pipelines in SQL or Python
Accelerate ETL development
Declare SQL or Python and DLT automatically
orchestrates the DAG, handles retries, changing data
Automatically manage your infrastructure
Automates complex tedious activities like recovery,
auto-scaling, and performance optimization
Ensure high data quality
Deliver reliable data with built-in quality controls,
testing, monitoring, and enforcement
Unify batch and streaming
Get the simplicity of SQL with freshness
of streaming with one unified API
©2022 Databricks Inc. — All rights reserved
COPY INTO
• SQL command
• Idempotent and incremental
• Great when source directory contains ~ thousands of files
• Schema automatically inferred
COPY INTO my_delta_table
FROM 's3://my-bucket/path/to/csv_files'
FILEFORMAT = CSV
FORMAT_OPTIONS ('header'='true','inferSchema'='true')
46
©2022 Databricks Inc. — All rights reserved 47
10k View
©2022 Databricks Inc. — All rights reserved
Lakehouse Platform: Streaming Options
48
Messaging Systems
(Apache Kafka, Kinesis, …)
Files / Object Stores
(S3, ADLS, GCS, …)
Apache Kafka
Connector for
SSStreaming
Databricks
Auto Loader
Spark Structured Streaming
Delta Live Tables (SQL/Python)
Delta Lake
Data
Consumers
Delta Lake
Sink Connector
©2022 Databricks Inc. — All rights reserved 49
It's demo time!
©2022 Databricks Inc. — All rights reserved 50
Demo 1
©2022 Databricks Inc. — All rights reserved
Delta Live Tables
Twitter Sentiment Analysis
©2022 Databricks Inc. — All rights reserved
Tweepy API: Streaming Twitter Feed
©2022 Databricks Inc. — All rights reserved
Auto Loader: Streaming Data Ingestion
Ingest Streaming Data in a Delta Live Table pipeline
©2022 Databricks Inc. — All rights reserved
Declarative, auto scaling Data Pipelines in SQL
CTAS Pattern: Create Table As Select …
©2022 Databricks Inc. — All rights reserved
Declarative, auto scaling Data Pipelines
©2022 Databricks Inc. — All rights reserved
DWH / SQL Persona
©2022 Databricks Inc. — All rights reserved
ML/DS: Hugging Face Transformer
57
©2022 Databricks Inc. — All rights reserved 58
Demo 2
©2022 Databricks Inc. — All rights reserved
Demo: Data Donation Project
https://corona-datenspende.de/science/en/
©2022 Databricks Inc. — All rights reserved
System Architecture
©2022 Databricks Inc. — All rights reserved 61
Resources
©2022 Databricks Inc. — All rights reserved
Demos on Databricks Github
https://github.com/databricks/delta-live-tables-notebooks
©2022 Databricks Inc. — All rights reserved
Databricks Blog: DLT with Apache Kafka
https://www.databricks.com/blog/2022/08/09/low-latency-streaming-data-pipelines-with-delta-live-tables-and-apache-kafka.html
©2022 Databricks Inc. — All rights reserved
Resources
• The best platform to run your Spark workloads
• Databricks Demo Hub
• Datbricks GitHub
• Databricks Blog
• Databricks Community / Forum (join to ask your tech questions!)
• Training & Certification
©2022 Databricks Inc. — All rights reserved 65
Thank You!
https://speakerdeck.com/fmunz
@frankmunz

More Related Content

Similar to Streaming Data Into Your Lakehouse With Frank Munz | Current 2022

Data Mesh Part 4 Monolith to Mesh
Data Mesh Part 4 Monolith to MeshData Mesh Part 4 Monolith to Mesh
Data Mesh Part 4 Monolith to Mesh
Jeffrey T. Pollock
 

Similar to Streaming Data Into Your Lakehouse With Frank Munz | Current 2022 (20)

Snowflake’s Cloud Data Platform and Modern Analytics
Snowflake’s Cloud Data Platform and Modern AnalyticsSnowflake’s Cloud Data Platform and Modern Analytics
Snowflake’s Cloud Data Platform and Modern Analytics
 
Serverless Data Platform
Serverless Data PlatformServerless Data Platform
Serverless Data Platform
 
Delivering Data Democratization in the Cloud with Snowflake
Delivering Data Democratization in the Cloud with SnowflakeDelivering Data Democratization in the Cloud with Snowflake
Delivering Data Democratization in the Cloud with Snowflake
 
Cloud-based Data Lake for Analytics and AI
Cloud-based Data Lake for Analytics and AICloud-based Data Lake for Analytics and AI
Cloud-based Data Lake for Analytics and AI
 
Data Mesh Part 4 Monolith to Mesh
Data Mesh Part 4 Monolith to MeshData Mesh Part 4 Monolith to Mesh
Data Mesh Part 4 Monolith to Mesh
 
Technical Deck Delta Live Tables.pdf
Technical Deck Delta Live Tables.pdfTechnical Deck Delta Live Tables.pdf
Technical Deck Delta Live Tables.pdf
 
Oracle Database Migration to Oracle Cloud Infrastructure
Oracle Database Migration to Oracle Cloud InfrastructureOracle Database Migration to Oracle Cloud Infrastructure
Oracle Database Migration to Oracle Cloud Infrastructure
 
Introduction to NuoDB - March 2018
Introduction to NuoDB - March 2018Introduction to NuoDB - March 2018
Introduction to NuoDB - March 2018
 
IBM THINK 2020 - Cloud Data Lake with IBM Cloud Data Services
IBM THINK 2020 - Cloud Data Lake with IBM Cloud Data Services IBM THINK 2020 - Cloud Data Lake with IBM Cloud Data Services
IBM THINK 2020 - Cloud Data Lake with IBM Cloud Data Services
 
The new big data
The new big dataThe new big data
The new big data
 
[OpenStack Day in Korea 2015] Keynote 2 - Leveraging OpenStack to Realize the...
[OpenStack Day in Korea 2015] Keynote 2 - Leveraging OpenStack to Realize the...[OpenStack Day in Korea 2015] Keynote 2 - Leveraging OpenStack to Realize the...
[OpenStack Day in Korea 2015] Keynote 2 - Leveraging OpenStack to Realize the...
 
Big Data Analytics from Azure Cloud to Power BI Mobile
Big Data Analytics from Azure Cloud to Power BI MobileBig Data Analytics from Azure Cloud to Power BI Mobile
Big Data Analytics from Azure Cloud to Power BI Mobile
 
Using the FLaNK Stack for edge ai (flink, nifi, kafka, kudu)
Using the FLaNK Stack for edge ai (flink, nifi, kafka, kudu)Using the FLaNK Stack for edge ai (flink, nifi, kafka, kudu)
Using the FLaNK Stack for edge ai (flink, nifi, kafka, kudu)
 
Solving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache ArrowSolving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache Arrow
 
Unlocking the Value of Your Data Lake
Unlocking the Value of Your Data LakeUnlocking the Value of Your Data Lake
Unlocking the Value of Your Data Lake
 
128692851-Introducing-Windows-Azure.ppt
128692851-Introducing-Windows-Azure.ppt128692851-Introducing-Windows-Azure.ppt
128692851-Introducing-Windows-Azure.ppt
 
Evolution from EDA to Data Mesh: Data in Motion
Evolution from EDA to Data Mesh: Data in MotionEvolution from EDA to Data Mesh: Data in Motion
Evolution from EDA to Data Mesh: Data in Motion
 
Lessons from Building Large-Scale, Multi-Cloud, SaaS Software at Databricks
Lessons from Building Large-Scale, Multi-Cloud, SaaS Software at DatabricksLessons from Building Large-Scale, Multi-Cloud, SaaS Software at Databricks
Lessons from Building Large-Scale, Multi-Cloud, SaaS Software at Databricks
 
Best Practices for Managing Kubernetes and Stateful Services: Mesosphere & Sy...
Best Practices for Managing Kubernetes and Stateful Services: Mesosphere & Sy...Best Practices for Managing Kubernetes and Stateful Services: Mesosphere & Sy...
Best Practices for Managing Kubernetes and Stateful Services: Mesosphere & Sy...
 
Databases - beyond SQL : Cosmos DB (part 6)
Databases - beyond SQL : Cosmos DB (part 6)Databases - beyond SQL : Cosmos DB (part 6)
Databases - beyond SQL : Cosmos DB (part 6)
 

More from HostedbyConfluent

Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
HostedbyConfluent
 
Evolution of NRT Data Ingestion Pipeline at Trendyol
Evolution of NRT Data Ingestion Pipeline at TrendyolEvolution of NRT Data Ingestion Pipeline at Trendyol
Evolution of NRT Data Ingestion Pipeline at Trendyol
HostedbyConfluent
 

More from HostedbyConfluent (20)

Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
Renaming a Kafka Topic | Kafka Summit London
Renaming a Kafka Topic | Kafka Summit LondonRenaming a Kafka Topic | Kafka Summit London
Renaming a Kafka Topic | Kafka Summit London
 
Evolution of NRT Data Ingestion Pipeline at Trendyol
Evolution of NRT Data Ingestion Pipeline at TrendyolEvolution of NRT Data Ingestion Pipeline at Trendyol
Evolution of NRT Data Ingestion Pipeline at Trendyol
 
Ensuring Kafka Service Resilience: A Dive into Health-Checking Techniques
Ensuring Kafka Service Resilience: A Dive into Health-Checking TechniquesEnsuring Kafka Service Resilience: A Dive into Health-Checking Techniques
Ensuring Kafka Service Resilience: A Dive into Health-Checking Techniques
 
Exactly-once Stream Processing with Arroyo and Kafka
Exactly-once Stream Processing with Arroyo and KafkaExactly-once Stream Processing with Arroyo and Kafka
Exactly-once Stream Processing with Arroyo and Kafka
 
Fish Plays Pokemon | Kafka Summit London
Fish Plays Pokemon | Kafka Summit LondonFish Plays Pokemon | Kafka Summit London
Fish Plays Pokemon | Kafka Summit London
 
Tiered Storage 101 | Kafla Summit London
Tiered Storage 101 | Kafla Summit LondonTiered Storage 101 | Kafla Summit London
Tiered Storage 101 | Kafla Summit London
 
Building a Self-Service Stream Processing Portal: How And Why
Building a Self-Service Stream Processing Portal: How And WhyBuilding a Self-Service Stream Processing Portal: How And Why
Building a Self-Service Stream Processing Portal: How And Why
 
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
 
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
 
Navigating Private Network Connectivity Options for Kafka Clusters
Navigating Private Network Connectivity Options for Kafka ClustersNavigating Private Network Connectivity Options for Kafka Clusters
Navigating Private Network Connectivity Options for Kafka Clusters
 
Apache Flink: Building a Company-wide Self-service Streaming Data Platform
Apache Flink: Building a Company-wide Self-service Streaming Data PlatformApache Flink: Building a Company-wide Self-service Streaming Data Platform
Apache Flink: Building a Company-wide Self-service Streaming Data Platform
 
Explaining How Real-Time GenAI Works in a Noisy Pub
Explaining How Real-Time GenAI Works in a Noisy PubExplaining How Real-Time GenAI Works in a Noisy Pub
Explaining How Real-Time GenAI Works in a Noisy Pub
 
TL;DR Kafka Metrics | Kafka Summit London
TL;DR Kafka Metrics | Kafka Summit LondonTL;DR Kafka Metrics | Kafka Summit London
TL;DR Kafka Metrics | Kafka Summit London
 
A Window Into Your Kafka Streams Tasks | KSL
A Window Into Your Kafka Streams Tasks | KSLA Window Into Your Kafka Streams Tasks | KSL
A Window Into Your Kafka Streams Tasks | KSL
 
Mastering Kafka Producer Configs: A Guide to Optimizing Performance
Mastering Kafka Producer Configs: A Guide to Optimizing PerformanceMastering Kafka Producer Configs: A Guide to Optimizing Performance
Mastering Kafka Producer Configs: A Guide to Optimizing Performance
 
Data Contracts Management: Schema Registry and Beyond
Data Contracts Management: Schema Registry and BeyondData Contracts Management: Schema Registry and Beyond
Data Contracts Management: Schema Registry and Beyond
 
Code-First Approach: Crafting Efficient Flink Apps
Code-First Approach: Crafting Efficient Flink AppsCode-First Approach: Crafting Efficient Flink Apps
Code-First Approach: Crafting Efficient Flink Apps
 
Debezium vs. the World: An Overview of the CDC Ecosystem
Debezium vs. the World: An Overview of the CDC EcosystemDebezium vs. the World: An Overview of the CDC Ecosystem
Debezium vs. the World: An Overview of the CDC Ecosystem
 
Beyond Tiered Storage: Serverless Kafka with No Local Disks
Beyond Tiered Storage: Serverless Kafka with No Local DisksBeyond Tiered Storage: Serverless Kafka with No Local Disks
Beyond Tiered Storage: Serverless Kafka with No Local Disks
 

Recently uploaded

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 

Recently uploaded (20)

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 

Streaming Data Into Your Lakehouse With Frank Munz | Current 2022

  • 1. ©2022 Databricks Inc. — All rights reserved Streaming Data Into the Lakehouse Current.io 2022 1 Frank Munz, Ph.D. Oct 2022
  • 2. ©2022 Databricks Inc. — All rights reserved About me • @Databricks, Data and AI Products, formerly DevRel EMEA • All things large scale data & compute • Based in Munich, 🍻 ⛰ 🥨 󰎲 • Twitter: @frankmunz • Formerly AWS Tech Evangelist, SW architect, data scientist, published author etc.
  • 3. ©2022 Databricks Inc. — All rights reserved 3 What do you love most about Apache Spark?
  • 4. ©2022 Databricks Inc. — All rights reserved Spark SQL and Dataframe API # PYTHON: read multiple JSON files with multiple datasets per file cust = spark.read.json('/data/customers/') cust.createOrReplaceTempView("customers") cust.filter(cust['car_make']== "Audi").show() %sql select * from customers where car_make = "Audi" -- PYTHON: -- spark.sql('select * from customers where car_make ="Audi"').show()
  • 5. ©2022 Databricks Inc. — All rights reserved Automatic Parallelization and Scale
  • 6. ©2022 Databricks Inc. — All rights reserved Reading from Files vs Reading from Streams # read file(s) cust = spark.read.format("json").path('/data/customers/') # read from messaging platform sales = spark.readStream.format("kafka|kinesis|socket").load()
  • 7. ©2022 Databricks Inc. — All rights reserved Stream Processing is continuous and unbounded Stream Processing 7 Traditional Processing is one-off and bounded 1 Data Source 2 Processing Data Source Processing
  • 8. ©2022 Databricks Inc. — All rights reserved Technical Advantages A more intuitive way of capturing and processing continuous and unbounded data Lower latency for time sensitive applications and use cases Better fault-tolerance through checkpointing Better compute utilization and scalability through continuous and incremental processing 8
  • 9. ©2022 Databricks Inc. — All rights reserved Business Benefits 9 BI and SQL Analytics Fresher and faster insights Quicker and better business decisions Data Engineering Sooner availability of cleaned data More business use cases Data Science and ML More frequent model update and inference Better model efficacy Event Driven Application Faster customized response and action Better and differentiated customer experience
  • 10. ©2022 Databricks Inc. — All rights reserved 10 Streaming Misconceptions
  • 11. ©2022 Databricks Inc. — All rights reserved Misconception #1 Stream processing is only for low latency use cases X spark.readStream .format("delta") .option("maxFilesPerTrigger", "1") .load(inputDir) .writeStream .trigger(Trigger.AvailableNow) .option("checkpointLocation",chkdir) .start() 11 Stream processing can be applied to use cases of any latency “Batch” is a special case of streaming
  • 12. ©2022 Databricks Inc. — All rights reserved Misconception #2 Latency Accuracy Cost Choose the right latency, accuracy, and cost tradeoff for each specific use case 12 The lower the latency, the "better" X
  • 13. ©2022 Databricks Inc. — All rights reserved 13 Streaming is about the programming paradigm
  • 14. ©2022 Databricks Inc. — All rights reserved 14 Spark Structured Streaming
  • 15. ©2022 Databricks Inc. — All rights reserved Structured Streaming 15 A scalable and fault-tolerant stream processing engine built on the Spark SQL engine +Project Lightspeed: predictable low latencies
  • 16. ©2022 Databricks Inc. — All rights reserved Structured Streaming 16 ● Read from an initial offset position ● Keep tracking offset position as processing makes progress Source ● Apply the same transformations using a standard dataframe Transformation ● Write to a target ● Keep updating checkpoint as processing makes progress Sink Trigger
  • 17. ©2022 Databricks Inc. — All rights reserved source transformation sink config spark.readStream.format("kafka|kinesis|socket|...") .option(<>,<>)... .load() .select(cast("string").alias("jsonData")) .select(from_json($"jsonData",jsonSchema).alias("payload")) .writeStream .format("delta") .option("path",...) .trigger("30 seconds") .option("checkpointLocation",...) .start() Streaming ETL 17
  • 18. ©2022 Databricks Inc. — All rights reserved Trigger Types 18 ● Default: Process as soon as the previous batch has been processed ● Fixed interval: Process at a user-specified time interval ● One-time: Process all of the available data and then stop
  • 19. ©2022 Databricks Inc. — All rights reserved Output Modes 19 ● Append (Default): Only new rows added to the result table since the last trigger will be output to the sink ● Complete: The whole result table will be output to the sink after each trigger ● Update: Only the rows updated in the result table since the last trigger will be output to the sink
  • 20. ©2022 Databricks Inc. — All rights reserved Structured Streaming Benefits High Throughput Optimized for high throughput and low cost Rich Connector Ecosystem Streaming connectors ranging from message buses to object storage services Exactly Once Semantics Fault-tolerance and exactly once semantics guarantee correctness Unified Batch and Streaming Unified API makes development and maintenance simple 20
  • 21. ©2022 Databricks Inc. — All rights reserved 21 The Data Lakehouse
  • 22. ©2022 Databricks Inc. — All rights reserved Data Maturity Curve Data + AI Maturity Competitive Advantage Reports Clean Data Ad Hoc Queries Data Exploration Predictive Modeling Prescriptive Analytics Automated Decision Making Data Lake for AI Data Warehouse for BI Data Maturity Curve What will happen? What happened? 22
  • 23. ©2022 Databricks Inc. — All rights reserved Business Intelligence SQL Analytics Data Science & ML Data Streaming Two disparate, incompatible data platforms Structured tables Unstructured files: logs, text, images, video, Data Warehouse Data Lake Governance and Security Table ACLs Governance and Security Files and Blobs Copy subsets of data Disjointed and duplicative data silos Incompatible security and governance models Incomplete support for use cases 23
  • 24. ©2022 Databricks Inc. — All rights reserved 24 Databricks Lakehouse Platform Lakehouse Platform Data Warehousing Data Engineering Data Science and ML Data Streaming All structured and unstructured data Cloud Data Lake Unity Catalog Fine-grained governance for data and AI Delta Lake Data reliability and performance Simple Unify your data warehousing and AI use cases on a single platform Open Built on open source and open standards Multicloud One consistent data platform across clouds
  • 25. ©2022 Databricks Inc. — All rights reserved 25 Streaming on the Lakehouse Streaming on the Lakehouse Lakehouse Platform All structured and unstructured data Cloud Data Lake Unity Catalog Fine-grained governance for data and AI Delta Lake Data reliability and performance Streaming Ingesting Streaming ETL Event Processing Event Driven Application ML Inference Live Dashboard Near Real-time Query Alert Fraud Prevention Dynamic UI Dynamic Ads Dynamic Pricing Device Control Game Scene Update Etc. DB Change Data Feeds Clickstreams Machine & Application Logs Mobile & IoT Data Application Events
  • 26. ©2022 Databricks Inc. — All rights reserved Lakehouse Differentiations 26 Unified Batch and Streaming No overhead of learning, developing on, or maintaining two sets of APIs and data processing stacks Favorite Tools Provide diverse users with their favorite tools to work with streaming data, enabling the broader organization to take advantage of streaming End-to-End Streaming Has everything you need, no need to stitch together different streaming technology stacks or tune them to work together Optimal Cost Structure Easily configure the right latency-cost tradeoff for each of your streaming workloads
  • 27. ©2022 Databricks Inc. — All rights reserved 27 but what is the foundation of the Lakehouse Platform?
  • 28. ©2022 Databricks Inc. — All rights reserved One source of truth for all your data Open format Delta Lake is the foundation of the Lakehouse • open source format Delta Lake, based on Parquet • adds quality, reliability, and performance to your existing data lakes • Provides one common data management framework for Batch & Streaming, ETL, Analytics & ML ✓ ACID Transactions ✓ Time Travel ✓ Schema Enforcement ✓ Identity Columns ✓ Advanced Indexing ✓ Caching ✓ Auto-tuning ✓ Python, SQL, R, Scala Support AllOpenSource
  • 29. ©2022 Databricks Inc. — All rights reserved Lakehouse: Delta Lake Publication Link to PDF
  • 30. ©2022 Databricks Inc. — All rights reserved Delta’s expanding ecosystem of connectors N ew ! C o m in g S o o n !
  • 31. ©2022 Databricks Inc. — All rights reserved Under the Hood my_table/ ← table name _delta_log/ ← delta log 00000.json ← delta for tx 00001.json … 00010.json 00010.chkpt.parquet ← checkpoint files date=2022-01-01/ ← optional partition File-1.parquet ← data in parquet format Write ahead log with optimistic concurrency provides serializable ACID transactions:
  • 32. ©2022 Databricks Inc. — All rights reserved Delta Lake Quickstart pyspark --packages io.delta:delta-core_2.12:1.0.0 --conf "spark.sql.extensions= io.delta.sql.DeltaSparkSessionExtension" --conf "spark.sql.catalog.spark_catalog= org.apache.spark.sql.delta.catalog.DeltaCatalog" https://github.com/delta-io/delta
  • 33. ©2022 Databricks Inc. — All rights reserved Spark: Convert to Delta 33 #convert to delta format cust = spark.read.json('/data/customers/') cust.write.format("delta").saveAsTable(table_name)
  • 34. ©2022 Databricks Inc. — All rights reserved All of Delta Lake 2.0 is open ACID Transactions Scalable Metadata Time Travel Open Source Unified Batch/Streaming Schema Evolution /Enforcement Audit History DML Operations OPTIMIZE Compaction OPTIMIZE ZORDER Change data feed Table Restore S3 Multi-cluster writes MERGE Enhancements Stream Enhancements Simplified Logstore Data Skipping via Column Stats Multi-part checkpoint writes Generated Columns Column Mapping Generated column support w/ partitioning Identity Columns Subqueries in deletes and updates Clones Iceberg to Delta converter Fast metadata only deletes Coming Soon!
  • 35. ©2022 Databricks Inc. — All rights reserved Without Delta Lake: No UPDATE
  • 36. ©2022 Databricks Inc. — All rights reserved Delta Examples (in SQL for brevity) 36 SELECT * FROM student TIMESTAMP AS OF "2022-05-28" CREATE TABLE test_student SHALLOW CLONE student DESCRIBE HISTORY student UPDATE student SET lastname = "Miller" WHERE id = 2805 OPTIMIZE student ZORDER BY age VACUUM student RETAIN 12 HOURS Update/Upsert Time travel Clones Table History Co-locate Compact Files
  • 37. ©2022 Databricks Inc. — All rights reserved 37 … but how does this perform for DWH workloads?
  • 38. ©2022 Databricks Inc. — All rights reserved Databricks SQL Photon Serverless Eliminate compute infrastructure management Instant, Elastic Compute Zero Management Lower TCO Vectorized C++ execution engine with Apache Spark API https://dbricks.co/benchmark TPC-DS Benchmark 100 TB
  • 39. ©2022 Databricks Inc. — All rights reserved 39 … and streaming?
  • 40. ©2022 Databricks Inc. — All rights reserved Streaming: Built into the foundation 40 display(spark.readStream.format("delta").table("heart.bpm"). groupBy("bpm",window("time", "1 minute")) .avg("bpm").orderBy("window",ascending=True)) Delta Tables can be streaming sources and sinks
  • 41. ©2022 Databricks Inc. — All rights reserved 41
  • 42. ©2022 Databricks Inc. — All rights reserved 42 How can you simplify Streaming Ingestion and Data Pipelines?
  • 43. ©2022 Databricks Inc. — All rights reserved Auto Loader • Python & Scala (and SQL in Delta Live Tables!) • Streaming data source with incremental loading • Exactly once ingestion • Scales to large amounts of data • Designed for structured, semi-structured and unstructured data • Schema inference, enforcement with data rescue, and evolution df = spark.readStream.format("cloudFiles") .option("cloudFiles.format", "json") .load("/path/to/table") 43
  • 44. ©2022 Databricks Inc. — All rights reserved “We ingest more than 5 petabytes per day with Auto Loader” 44
  • 45. ©2022 Databricks Inc. — All rights reserved Delta Live Tables 45 CREATE STREAMING LIVE TABLE raw_data AS SELECT * FROM cloud_files ("/raw_data", "json") CREATE LIVE TABLE clean_data AS SELECT … FROM LIVE .raw_data Reliable, declarative, streaming data pipelines in SQL or Python Accelerate ETL development Declare SQL or Python and DLT automatically orchestrates the DAG, handles retries, changing data Automatically manage your infrastructure Automates complex tedious activities like recovery, auto-scaling, and performance optimization Ensure high data quality Deliver reliable data with built-in quality controls, testing, monitoring, and enforcement Unify batch and streaming Get the simplicity of SQL with freshness of streaming with one unified API
  • 46. ©2022 Databricks Inc. — All rights reserved COPY INTO • SQL command • Idempotent and incremental • Great when source directory contains ~ thousands of files • Schema automatically inferred COPY INTO my_delta_table FROM 's3://my-bucket/path/to/csv_files' FILEFORMAT = CSV FORMAT_OPTIONS ('header'='true','inferSchema'='true') 46
  • 47. ©2022 Databricks Inc. — All rights reserved 47 10k View
  • 48. ©2022 Databricks Inc. — All rights reserved Lakehouse Platform: Streaming Options 48 Messaging Systems (Apache Kafka, Kinesis, …) Files / Object Stores (S3, ADLS, GCS, …) Apache Kafka Connector for SSStreaming Databricks Auto Loader Spark Structured Streaming Delta Live Tables (SQL/Python) Delta Lake Data Consumers Delta Lake Sink Connector
  • 49. ©2022 Databricks Inc. — All rights reserved 49 It's demo time!
  • 50. ©2022 Databricks Inc. — All rights reserved 50 Demo 1
  • 51. ©2022 Databricks Inc. — All rights reserved Delta Live Tables Twitter Sentiment Analysis
  • 52. ©2022 Databricks Inc. — All rights reserved Tweepy API: Streaming Twitter Feed
  • 53. ©2022 Databricks Inc. — All rights reserved Auto Loader: Streaming Data Ingestion Ingest Streaming Data in a Delta Live Table pipeline
  • 54. ©2022 Databricks Inc. — All rights reserved Declarative, auto scaling Data Pipelines in SQL CTAS Pattern: Create Table As Select …
  • 55. ©2022 Databricks Inc. — All rights reserved Declarative, auto scaling Data Pipelines
  • 56. ©2022 Databricks Inc. — All rights reserved DWH / SQL Persona
  • 57. ©2022 Databricks Inc. — All rights reserved ML/DS: Hugging Face Transformer 57
  • 58. ©2022 Databricks Inc. — All rights reserved 58 Demo 2
  • 59. ©2022 Databricks Inc. — All rights reserved Demo: Data Donation Project https://corona-datenspende.de/science/en/
  • 60. ©2022 Databricks Inc. — All rights reserved System Architecture
  • 61. ©2022 Databricks Inc. — All rights reserved 61 Resources
  • 62. ©2022 Databricks Inc. — All rights reserved Demos on Databricks Github https://github.com/databricks/delta-live-tables-notebooks
  • 63. ©2022 Databricks Inc. — All rights reserved Databricks Blog: DLT with Apache Kafka https://www.databricks.com/blog/2022/08/09/low-latency-streaming-data-pipelines-with-delta-live-tables-and-apache-kafka.html
  • 64. ©2022 Databricks Inc. — All rights reserved Resources • The best platform to run your Spark workloads • Databricks Demo Hub • Datbricks GitHub • Databricks Blog • Databricks Community / Forum (join to ask your tech questions!) • Training & Certification
  • 65. ©2022 Databricks Inc. — All rights reserved 65 Thank You! https://speakerdeck.com/fmunz @frankmunz