Pulsar in the Lakehouse: Apache Pulsar™ with Apache Spark™ and Delta Lake - Pulsar Summit Europe 2021 Keynote

Pulsar Virtual Summit Europe 2021
Pulsar in the Lakehouse
Ryan Zhu
Staff Software Engineer, Databricks
Addison Higham
Chief Architect, StreamNative

Ryan Zhu
Staff Software Engineer
Ryan Zhu, Staff Software Engineer at Databricks
● Tech Lead of Delta Ecosystem team
● Apache Spark PMC member and commiter
● Experience:
○ One of the core developers of Delta Lake and Spark Structured
Streaming. Working on these two projects since the beginning.
○ Working on Delta Sharing, a new open protocol to share data recently.

Addison Higham
Chief Architect
Addison Higham, Chief Architect at StreamNative
● Apache Pulsar Committer
● Experience:
○ 10+ years as Software Engineer, with 7 years working on streaming
systems.
○ 3+ years Pulsar experience, including leading the successful adoption
of Pulsar at Instructure.

Delta Lake/Lakehouse Overview

Data is
fragmented
across many
systems
Cost and
complexity is a
drag on the
organization
Silos get in the
way of data team
collaboration

Data infrastructure is too complicated
Data Lake
Semi-structured
Data Warehouse
Structured
Machine
Learning
Data
Science
BI
Unstructured
Data Warehouse
BI
Data Warehouse
BI

Pros
Great for
Business
Intelligence (BI)
applications
Cons
Limited support
for Machine
Learning (ML)
workloads
Proprietary
systems with
only a SQL
interface
Pros
Supports ML
Completely open
ecosystem of
tools and
formats
Cons
Poor support BI
Complex to
manage and
govern →data
swamp
Data
Warehouse
Data
Lake

Lakehouse
One platform to unify all
your data, analytics, and AI workloads
BI & SQL
Open Data Lake
Data Management & Governance
Real-time Data
Applications
Data Science
& ML

QUALITY
Filtered, Cleaned,
Augmented
Business-level
Aggregates
Raw Ingestion
and History
Building the foundation of a Lakehouse - Delta
Lake
CSV,
JSON, TXT…
Kinesis
BI &
Reporting
Streaming
Analytics
Data Science
& ML
BRONZE SILVER GOLD

350+ PB
processed /
day
75%
Data Scanned
3K+
Customers in
Production

OSS Delta Lake Key Features
Feature
ACID Transactions Delta Lake brings ACID transactions to your data lakes. It provides serializability, the
strongest level of isolation level. Learn more at Diving into Delta Lake: Unpacking the
Transaction Log.
Scalable Metadata Handling Delta Lake can handle petabyte-scale tables with billions of partitions and files at ease.
Time Travel (data versioning) Delta Lake provides data snapshots to access and revert to earlier versions of data for audits,
rollbacks or to reproduce experiments.
Open Format All data in Delta Lake is stored in Apache Parquet format enabling Delta Lake to leverage the
efficient compression and encoding schemes that are native to Parquet

OSS Delta Lake Key Features (Continued)
Feature
Unified Batch and Streaming
Source and Sink
A table in Delta Lake is both a batch table, as well as a streaming source and sink. Streaming
data ingest, batch historic backfill, and interactive queries all just work out of the box.
Schema Enforcement and
Evolution
Delta Lake provides the ability to specify your schema and enforce it. This helps ensure that the
data types are correct and required columns are present, preventing bad data from causing data
corruption. For more information, refer to Diving Into Delta Lake: Schema Enforcement &
Evolution.
Audit History Delta Lake transaction log records details about every change made to data providing a full audit
trail of the changes.
DML Operations Delta Lake supports SQL, Scala / Java and Python APIs to merge, update and delete datasets
allowing you to easily comply with GDPR and CCPA and simplifying use cases like change data
capture. For more information, refer to Diving Into Delta Lake: DML Internals

Upcoming features
Feature
Column dropping and renaming Allow users to drop a column and rename a column.
Atomic data replacement Allow users to delete a portion of data from the table and replace it with new data atomically.
Schema evolution improvement
for MERGE
StructType in ArrayType will support schema evolutions in the MERGE command.
MERGE support for generated
columns
Generate Columns is a feature added in Delta 1.0 to support generating columns based on SQL
expressions. MERGE will support these columns.
New release cadence One release every 3 months

Ecosystem Project Status
Delta Standalone Reader
Delta Standalone Writer
Available
Q4’ 21
Flink/Delta Source
Flink/Delta Sink
Q1’ 22
Q4’ 21
Pulsar/Delta Source
Pulsar/Delta Sink
Q4’ 21
Q1’ 22
PrestoDB/Trino integration Q4’ 21
Rust Integration
(kafka-delta-ingest)
Available
Nessie Integration Q4’ 21
LakeFS Integration Q4’ 21
Hive3 Connector Available
Spark 3.2 Support Q4’ 21
Delta Lake ecosystem

Pulsar + Lakehouse

Pulsar is the unified messaging and
streaming platform for real-time teams

Why Pulsar?
Streams and
messages to
support more
workloads
Multi-tenancy to
break down data
silos and ease
data ingestion
Geo-replication
to support multi-
cloud and global
business

Pulsar + Delta Lake enable data unification
Delta Lake and Lakehouse
support unified system for data,
analytics, and ML
Pulsar unifies real-time data across
diverse use cases like streaming,
messaging, and microservices
Simplifies data
infrastructures across
your entire organization
Pulsar
Delta Lake + =

The Pulsar and Spark/Delta Lake communities are committed to building solid
integrations
Pulsar, Delta Lake, and Spark Connectors
Connector
Spark Pulsar Connector Connectors for Spark for reading and writing data from Pulsar for use with DataFrame and
DataStream APIs. https://github.com/streamnative/pulsar-spark. Discussions in progress for
upstream contribution.
Pulsar IO Delta Lake Source A Pulsar “Source” for reading data directly from Delta Lake within the Pulsar IO framework. It’s
built on top of Delta Standalone project. In progress, expect a first release this year.

Database
Pulsar IO
Pulsar
Source
Pulsar
Source
Pulsar
---
KoP
---
AoP
---
Websocke
t
---
HTTP

Pulsar offers many options for integration, including Pulsar, KoP, AoP,
connectors, to connect with many systems in real-time.
Database
Pulsar IO
Pulsar
Source
Pulsar
Source
Pulsar
---
KoP
---
AoP
---
Websocke
t
---
HTTP
Database
Pulsar IO
Pulsar
Source
Pulsar
Source
Pulsar
---
KoP
---
AoP
---
Websocke
t
---
HTTP

Database
Pulsar IO
Pulsar
Source
Pulsar
Source
Pulsar
---
KoP
---
AoP
---
Websocke
t
---
HTTP
Delta Lake Connectors allow for data to be exchanged between Delta
Lake and Pulsar.

Database
Pulsar IO
Pulsar
Source
Pulsar
Source
Pulsar
---
KoP
---
AoP
---
Websocke
t
---
HTTP
Spark’s Pulsar connector allows for developers to write Spark jobs that
can read data from Pulsar topics, transform the data, and write back to
Pulsar topics.

Application events stored in Delta Lake for use in ML
Database
Pulsar IO
Pulsar
Source
Pulsar
Source
Pulsar
---
KoP
---
AoP
---
Websocke
t
---
HTTP
Pulsar
Source
Pulsar
Source

ML Results made available to applications
Database
Pulsar IO
Pulsar
Source
Pulsar
Source
Pulsar
---
KoP
---
AoP
---
Websocke
t
---
HTTP
Pulsar
---
KoP
---
AoP
---
Websocke
t
---
HTTP
Pulsar
Source
Pulsar
Source

CDC Events transformed and stored in Delta Lake
Database
Pulsar IO
Pulsar
Source
Pulsar
Source
Pulsar
---
KoP
---
AoP
---
Websocke
t
---
HTTP
Database
Pulsar IO
Pulsar
Source
Pulsar
Source
Pulsar
---
KoP
---
AoP
---
Websocke
t
---
HTTP

Other systems data made available in Delta Lake for Data Science
Database
Pulsar IO
Pulsar
Source
Pulsar
Source
Pulsar
---
KoP
---
AoP
---
Websocke
t
---
HTTP
Pulsar
Source
Pulsar
Source
Pulsar
---
KoP
---
AoP
---
Websocke
t
---
HTTP
Pulsar IO
Pulsar
Source

Pulsar IO Delta Lake Source
With the Pulsar IO Delta Lake source, users will be able to ingest Delta Lake changes
into Pulsar without running a separate component
Delta Lake Source
or
Metadata
Change
Topic
W/
Schema
New File
Removed
File
Metadata
Change
Parquet
File
Update
Schema
Records
Write
Records

Future of Pulsar + Delta Lake
One of Pulsar’s unique features is tiered storage, which allows for streams to be
offloaded out of Apache BookKeeper into S3, GCS, etc.
Work is in progress to offload data in Delta Lake compatible files, with the required
metadata, allowing for Pulsar to make streams available to Delta Lake without any
need to copy data out of Pulsar and allows for the data to still be read as streams.
Stay connected to learn more in early 2022!

Pulsar and Delta Lake are technologies
designed to simplify your data
infrastructure
Connect with us on #connector-pulsar in
Delta Lake Slack to learn more!

Thank-You!

Pulsar in the Lakehouse: Apache Pulsar™ with Apache Spark™ and Delta Lake - Pulsar Summit Europe 2021 Keynote

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Pulsar in the Lakehouse: Apache Pulsar™ with Apache Spark™ and Delta Lake - Pulsar Summit Europe 2021 Keynote

Similar to Pulsar in the Lakehouse: Apache Pulsar™ with Apache Spark™ and Delta Lake - Pulsar Summit Europe 2021 Keynote (20)

More from StreamNative

More from StreamNative (20)

Recently uploaded

Recently uploaded (20)

Pulsar in the Lakehouse: Apache Pulsar™ with Apache Spark™ and Delta Lake - Pulsar Summit Europe 2021 Keynote

Editor's Notes