More Related Content Similar to Streaming Data Into Your Lakehouse With Frank Munz | Current 2022 (20) More from HostedbyConfluent (20) Streaming Data Into Your Lakehouse With Frank Munz | Current 20221. ©2022 Databricks Inc. — All rights reserved
Streaming Data
Into the Lakehouse
Current.io 2022
1
Frank Munz, Ph.D.
Oct 2022
2. ©2022 Databricks Inc. — All rights reserved
About me
• @Databricks, Data and AI Products, formerly DevRel EMEA
• All things large scale data & compute
• Based in Munich, 🍻 ⛰ 🥨
• Twitter: @frankmunz
• Formerly AWS Tech Evangelist, SW architect, data scientist,
published author etc.
4. ©2022 Databricks Inc. — All rights reserved
Spark SQL and Dataframe API
# PYTHON: read multiple JSON files with multiple datasets per file
cust = spark.read.json('/data/customers/')
cust.createOrReplaceTempView("customers")
cust.filter(cust['car_make']== "Audi").show()
%sql
select * from customers where car_make = "Audi"
-- PYTHON:
-- spark.sql('select * from customers where car_make ="Audi"').show()
6. ©2022 Databricks Inc. — All rights reserved
Reading from Files vs Reading from Streams
# read file(s)
cust = spark.read.format("json").path('/data/customers/')
# read from messaging platform
sales = spark.readStream.format("kafka|kinesis|socket").load()
7. ©2022 Databricks Inc. — All rights reserved
Stream Processing is
continuous and unbounded
Stream Processing
7
Traditional Processing is
one-off and bounded 1
Data Source
2
Processing
Data Source Processing
8. ©2022 Databricks Inc. — All rights reserved
Technical Advantages
A more intuitive way of capturing and processing
continuous and unbounded data
Lower latency for time sensitive applications and use cases
Better fault-tolerance through checkpointing
Better compute utilization and scalability through
continuous and incremental processing
8
9. ©2022 Databricks Inc. — All rights reserved
Business Benefits
9
BI and SQL
Analytics
Fresher
and faster
insights
Quicker and
better business
decisions
Data
Engineering
Sooner
availability of
cleaned data
More business
use cases
Data Science
and ML
More frequent
model update
and inference
Better model
efficacy
Event Driven
Application
Faster customized
response
and action
Better and
differentiated
customer
experience
11. ©2022 Databricks Inc. — All rights reserved
Misconception #1
Stream processing is only for low latency use cases
X
spark.readStream
.format("delta")
.option("maxFilesPerTrigger", "1")
.load(inputDir)
.writeStream
.trigger(Trigger.AvailableNow)
.option("checkpointLocation",chkdir)
.start()
11
Stream processing can
be applied to use cases
of any latency
“Batch” is a special case
of streaming
12. ©2022 Databricks Inc. — All rights reserved
Misconception #2
Latency
Accuracy Cost
Choose the right
latency, accuracy, and
cost tradeoff for each
specific use case
12
The lower the latency, the "better"
X
15. ©2022 Databricks Inc. — All rights reserved
Structured Streaming
15
A scalable and
fault-tolerant stream
processing engine built
on the Spark SQL engine
+Project Lightspeed: predictable low latencies
16. ©2022 Databricks Inc. — All rights reserved
Structured Streaming
16
● Read from an initial
offset position
● Keep tracking offset
position as
processing makes
progress
Source
● Apply the same
transformations
using a standard
dataframe
Transformation
● Write to a target
● Keep updating
checkpoint as
processing makes
progress
Sink
Trigger
17. ©2022 Databricks Inc. — All rights reserved
source
transformation
sink
config
spark.readStream.format("kafka|kinesis|socket|...")
.option(<>,<>)...
.load()
.select(cast("string").alias("jsonData"))
.select(from_json($"jsonData",jsonSchema).alias("payload"))
.writeStream
.format("delta")
.option("path",...)
.trigger("30 seconds")
.option("checkpointLocation",...)
.start()
Streaming ETL
17
18. ©2022 Databricks Inc. — All rights reserved
Trigger Types
18
● Default: Process as soon as the previous batch has been processed
● Fixed interval: Process at a user-specified time interval
● One-time: Process all of the available data and then stop
19. ©2022 Databricks Inc. — All rights reserved
Output Modes
19
● Append (Default): Only new rows added to the result
table since the last trigger will be output to the sink
● Complete: The whole result table will be output to the
sink after each trigger
● Update: Only the rows updated in the result table since
the last trigger will be output to the sink
20. ©2022 Databricks Inc. — All rights reserved
Structured Streaming Benefits
High
Throughput
Optimized for high
throughput and
low cost
Rich Connector
Ecosystem
Streaming
connectors ranging
from message
buses to object
storage services
Exactly Once
Semantics
Fault-tolerance and
exactly once
semantics
guarantee
correctness
Unified Batch and
Streaming
Unified API makes
development and
maintenance
simple
20
22. ©2022 Databricks Inc. — All rights reserved
Data Maturity Curve
Data + AI Maturity
Competitive
Advantage
Reports
Clean Data
Ad Hoc
Queries
Data
Exploration
Predictive
Modeling
Prescriptive
Analytics
Automated
Decision Making
Data Lake
for AI
Data Warehouse
for BI
Data Maturity Curve
What will happen?
What happened?
22
23. ©2022 Databricks Inc. — All rights reserved
Business
Intelligence
SQL
Analytics
Data Science
& ML
Data
Streaming
Two disparate, incompatible data platforms
Structured tables Unstructured files:
logs, text, images, video,
Data Warehouse Data Lake
Governance and Security
Table ACLs
Governance and Security
Files and Blobs
Copy subsets of data
Disjointed
and duplicative
data silos
Incompatible
security and
governance models
Incomplete
support for
use cases
23
24. ©2022 Databricks Inc. — All rights reserved 24
Databricks
Lakehouse Platform
Lakehouse Platform
Data
Warehousing
Data
Engineering
Data Science
and ML
Data
Streaming
All structured and unstructured data
Cloud Data Lake
Unity Catalog
Fine-grained governance for data and AI
Delta Lake
Data reliability and performance
Simple
Unify your data warehousing and AI
use cases on a single platform
Open
Built on open source and open standards
Multicloud
One consistent data platform across clouds
25. ©2022 Databricks Inc. — All rights reserved 25
Streaming on the Lakehouse
Streaming on the Lakehouse
Lakehouse Platform
All structured and unstructured data
Cloud Data Lake
Unity Catalog
Fine-grained governance for data and AI
Delta Lake
Data reliability and performance
Streaming
Ingesting
Streaming
ETL
Event
Processing
Event Driven
Application
ML Inference
Live Dashboard
Near Real-time
Query
Alert
Fraud Prevention
Dynamic UI
Dynamic Ads
Dynamic Pricing
Device Control
Game Scene
Update
Etc.
DB Change
Data Feeds
Clickstreams
Machine &
Application Logs
Mobile & IoT Data
Application Events
26. ©2022 Databricks Inc. — All rights reserved
Lakehouse Differentiations
26
Unified Batch and
Streaming
No overhead of
learning, developing
on, or maintaining
two sets of APIs and
data processing
stacks
Favorite Tools
Provide diverse users
with their favorite tools
to work with streaming
data, enabling the
broader organization to
take advantage of
streaming
End-to-End
Streaming
Has everything you
need, no need to
stitch together
different streaming
technology stacks
or tune them to work
together
Optimal Cost
Structure
Easily configure the
right latency-cost
tradeoff for each of
your streaming
workloads
28. ©2022 Databricks Inc. — All rights reserved
One source of truth for all your data
Open format Delta Lake is the foundation of the Lakehouse
• open source format Delta Lake, based on Parquet
• adds quality, reliability, and performance
to your existing data lakes
• Provides one common data management
framework for Batch & Streaming, ETL, Analytics & ML
✓ ACID Transactions
✓ Time Travel
✓ Schema Enforcement
✓ Identity Columns
✓ Advanced Indexing
✓ Caching
✓ Auto-tuning
✓ Python, SQL, R, Scala Support
AllOpenSource
30. ©2022 Databricks Inc. — All rights reserved
Delta’s expanding ecosystem of connectors
N
ew
!
C
o
m
in
g
S
o
o
n
!
31. ©2022 Databricks Inc. — All rights reserved
Under the Hood
my_table/ ← table name
_delta_log/ ← delta log
00000.json ← delta for tx
00001.json
…
00010.json
00010.chkpt.parquet ← checkpoint files
date=2022-01-01/ ← optional partition
File-1.parquet ← data in parquet format
Write ahead log with optimistic concurrency provides
serializable ACID transactions:
32. ©2022 Databricks Inc. — All rights reserved
Delta Lake Quickstart
pyspark --packages io.delta:delta-core_2.12:1.0.0
--conf "spark.sql.extensions=
io.delta.sql.DeltaSparkSessionExtension"
--conf "spark.sql.catalog.spark_catalog=
org.apache.spark.sql.delta.catalog.DeltaCatalog"
https://github.com/delta-io/delta
33. ©2022 Databricks Inc. — All rights reserved
Spark: Convert to Delta
33
#convert to delta format
cust = spark.read.json('/data/customers/')
cust.write.format("delta").saveAsTable(table_name)
34. ©2022 Databricks Inc. — All rights reserved
All of Delta Lake 2.0 is open
ACID
Transactions
Scalable
Metadata
Time Travel Open Source
Unified
Batch/Streaming
Schema Evolution
/Enforcement
Audit History DML Operations
OPTIMIZE
Compaction
OPTIMIZE
ZORDER
Change data
feed
Table Restore S3 Multi-cluster
writes
MERGE
Enhancements
Stream
Enhancements
Simplified
Logstore
Data Skipping
via Column
Stats
Multi-part checkpoint
writes
Generated
Columns
Column
Mapping
Generated
column support
w/ partitioning
Identity
Columns
Subqueries in
deletes and
updates
Clones
Iceberg to Delta
converter
Fast metadata
only deletes
Coming Soon!
36. ©2022 Databricks Inc. — All rights reserved
Delta Examples (in SQL for brevity)
36
SELECT * FROM student TIMESTAMP AS OF "2022-05-28"
CREATE TABLE test_student SHALLOW CLONE student
DESCRIBE HISTORY student
UPDATE student SET lastname = "Miller" WHERE id = 2805
OPTIMIZE student ZORDER BY age
VACUUM student RETAIN 12 HOURS
Update/Upsert
Time travel
Clones
Table History
Co-locate
Compact Files
38. ©2022 Databricks Inc. — All rights reserved
Databricks SQL Photon
Serverless
Eliminate compute
infrastructure management
Instant, Elastic Compute
Zero Management
Lower TCO
Vectorized C++ execution engine with
Apache Spark API
https://dbricks.co/benchmark
TPC-DS Benchmark 100 TB
40. ©2022 Databricks Inc. — All rights reserved
Streaming: Built into the foundation
40
display(spark.readStream.format("delta").table("heart.bpm").
groupBy("bpm",window("time", "1 minute"))
.avg("bpm").orderBy("window",ascending=True))
Delta Tables can be
streaming
sources and sinks
42. ©2022 Databricks Inc. — All rights reserved 42
How can you simplify
Streaming Ingestion and
Data Pipelines?
43. ©2022 Databricks Inc. — All rights reserved
Auto Loader
• Python & Scala (and SQL in Delta Live Tables!)
• Streaming data source with incremental loading
• Exactly once ingestion
• Scales to large amounts of data
• Designed for structured, semi-structured and unstructured data
• Schema inference, enforcement with data rescue, and evolution
df = spark.readStream.format("cloudFiles")
.option("cloudFiles.format", "json")
.load("/path/to/table")
43
44. ©2022 Databricks Inc. — All rights reserved
“We ingest more than
5 petabytes per day
with Auto Loader”
44
45. ©2022 Databricks Inc. — All rights reserved
Delta Live Tables
45
CREATE STREAMING LIVE TABLE raw_data
AS SELECT *
FROM cloud_files ("/raw_data", "json")
CREATE LIVE TABLE clean_data
AS SELECT …
FROM LIVE .raw_data
Reliable, declarative, streaming data pipelines in SQL or Python
Accelerate ETL development
Declare SQL or Python and DLT automatically
orchestrates the DAG, handles retries, changing data
Automatically manage your infrastructure
Automates complex tedious activities like recovery,
auto-scaling, and performance optimization
Ensure high data quality
Deliver reliable data with built-in quality controls,
testing, monitoring, and enforcement
Unify batch and streaming
Get the simplicity of SQL with freshness
of streaming with one unified API
46. ©2022 Databricks Inc. — All rights reserved
COPY INTO
• SQL command
• Idempotent and incremental
• Great when source directory contains ~ thousands of files
• Schema automatically inferred
COPY INTO my_delta_table
FROM 's3://my-bucket/path/to/csv_files'
FILEFORMAT = CSV
FORMAT_OPTIONS ('header'='true','inferSchema'='true')
46
48. ©2022 Databricks Inc. — All rights reserved
Lakehouse Platform: Streaming Options
48
Messaging Systems
(Apache Kafka, Kinesis, …)
Files / Object Stores
(S3, ADLS, GCS, …)
Apache Kafka
Connector for
SSStreaming
Databricks
Auto Loader
Spark Structured Streaming
Delta Live Tables (SQL/Python)
Delta Lake
Data
Consumers
Delta Lake
Sink Connector
53. ©2022 Databricks Inc. — All rights reserved
Auto Loader: Streaming Data Ingestion
Ingest Streaming Data in a Delta Live Table pipeline
54. ©2022 Databricks Inc. — All rights reserved
Declarative, auto scaling Data Pipelines in SQL
CTAS Pattern: Create Table As Select …
59. ©2022 Databricks Inc. — All rights reserved
Demo: Data Donation Project
https://corona-datenspende.de/science/en/
62. ©2022 Databricks Inc. — All rights reserved
Demos on Databricks Github
https://github.com/databricks/delta-live-tables-notebooks
63. ©2022 Databricks Inc. — All rights reserved
Databricks Blog: DLT with Apache Kafka
https://www.databricks.com/blog/2022/08/09/low-latency-streaming-data-pipelines-with-delta-live-tables-and-apache-kafka.html
64. ©2022 Databricks Inc. — All rights reserved
Resources
• The best platform to run your Spark workloads
• Databricks Demo Hub
• Datbricks GitHub
• Databricks Blog
• Databricks Community / Forum (join to ask your tech questions!)
• Training & Certification