In this session, we provide an overview of the “Lakehouse” architecture and how Apache Pulsar™ can be used to support this architecture through integrations with the Apache Spark™ and Delta Lake to build your reliable data lake. We will also discuss the current state of Pulsar + Spark & Delta Lake connectors and discuss real world use cases and present the roadmap on what you can expect in the future of integrations between Spark, Delta Lake, and Pulsar communities.
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Pulsar in the Lakehouse: Apache Pulsar™ with Apache Spark™ and Delta Lake - Pulsar Summit Europe 2021 Keynote
1. Pulsar Virtual Summit Europe 2021
Pulsar in the Lakehouse
Ryan Zhu
Staff Software Engineer, Databricks
Addison Higham
Chief Architect, StreamNative
2. Pulsar Virtual Summit Europe 2021
Ryan Zhu
Staff Software Engineer
Ryan Zhu, Staff Software Engineer at Databricks
● Tech Lead of Delta Ecosystem team
● Apache Spark PMC member and commiter
● Experience:
○ One of the core developers of Delta Lake and Spark Structured
Streaming. Working on these two projects since the beginning.
○ Working on Delta Sharing, a new open protocol to share data recently.
3. Pulsar Virtual Summit Europe 2021
Addison Higham
Chief Architect
Addison Higham, Chief Architect at StreamNative
● Apache Pulsar Committer
● Experience:
○ 10+ years as Software Engineer, with 7 years working on streaming
systems.
○ 3+ years Pulsar experience, including leading the successful adoption
of Pulsar at Instructure.
5. Pulsar Virtual Summit Europe 2021
Data is
fragmented
across many
systems
Cost and
complexity is a
drag on the
organization
Silos get in the
way of data team
collaboration
6. Pulsar Virtual Summit Europe 2021
Data infrastructure is too complicated
Data Lake
Semi-structured
Data Warehouse
Structured
Machine
Learning
Data
Science
BI
Unstructured
Data Warehouse
BI
Data Warehouse
BI
7. Pulsar Virtual Summit Europe 2021
Pros
Great for
Business
Intelligence (BI)
applications
Cons
Limited support
for Machine
Learning (ML)
workloads
Proprietary
systems with
only a SQL
interface
Pros
Supports ML
Completely open
ecosystem of
tools and
formats
Cons
Poor support BI
Complex to
manage and
govern →data
swamp
Data
Warehouse
Data
Lake
8. Pulsar Virtual Summit Europe 2021
Lakehouse
One platform to unify all
your data, analytics, and AI workloads
BI & SQL
Open Data Lake
Data Management & Governance
Real-time Data
Applications
Data Science
& ML
9. Pulsar Virtual Summit Europe 2021
QUALITY
Filtered, Cleaned,
Augmented
Business-level
Aggregates
Raw Ingestion
and History
Building the foundation of a Lakehouse - Delta
Lake
CSV,
JSON, TXT…
Kinesis
BI &
Reporting
Streaming
Analytics
Data Science
& ML
BRONZE SILVER GOLD
10. Pulsar Virtual Summit Europe 2021
350+ PB
processed /
day
75%
Data Scanned
3K+
Customers in
Production
11. Pulsar Virtual Summit Europe 2021
OSS Delta Lake Key Features
Feature
ACID Transactions Delta Lake brings ACID transactions to your data lakes. It provides serializability, the
strongest level of isolation level. Learn more at Diving into Delta Lake: Unpacking the
Transaction Log.
Scalable Metadata Handling Delta Lake can handle petabyte-scale tables with billions of partitions and files at ease.
Time Travel (data versioning) Delta Lake provides data snapshots to access and revert to earlier versions of data for audits,
rollbacks or to reproduce experiments.
Open Format All data in Delta Lake is stored in Apache Parquet format enabling Delta Lake to leverage the
efficient compression and encoding schemes that are native to Parquet
12. Pulsar Virtual Summit Europe 2021
OSS Delta Lake Key Features (Continued)
Feature
Unified Batch and Streaming
Source and Sink
A table in Delta Lake is both a batch table, as well as a streaming source and sink. Streaming
data ingest, batch historic backfill, and interactive queries all just work out of the box.
Schema Enforcement and
Evolution
Delta Lake provides the ability to specify your schema and enforce it. This helps ensure that the
data types are correct and required columns are present, preventing bad data from causing data
corruption. For more information, refer to Diving Into Delta Lake: Schema Enforcement &
Evolution.
Audit History Delta Lake transaction log records details about every change made to data providing a full audit
trail of the changes.
DML Operations Delta Lake supports SQL, Scala / Java and Python APIs to merge, update and delete datasets
allowing you to easily comply with GDPR and CCPA and simplifying use cases like change data
capture. For more information, refer to Diving Into Delta Lake: DML Internals
13. Pulsar Virtual Summit Europe 2021
Upcoming features
Feature
Column dropping and renaming Allow users to drop a column and rename a column.
Atomic data replacement Allow users to delete a portion of data from the table and replace it with new data atomically.
Schema evolution improvement
for MERGE
StructType in ArrayType will support schema evolutions in the MERGE command.
MERGE support for generated
columns
Generate Columns is a feature added in Delta 1.0 to support generating columns based on SQL
expressions. MERGE will support these columns.
New release cadence One release every 3 months
14. Pulsar Virtual Summit Europe 2021
Ecosystem Project Status
Delta Standalone Reader
Delta Standalone Writer
Available
Q4’ 21
Flink/Delta Source
Flink/Delta Sink
Q1’ 22
Q4’ 21
Pulsar/Delta Source
Pulsar/Delta Sink
Q4’ 21
Q1’ 22
PrestoDB/Trino integration Q4’ 21
Rust Integration
(kafka-delta-ingest)
Available
Nessie Integration Q4’ 21
LakeFS Integration Q4’ 21
Hive3 Connector Available
Spark 3.2 Support Q4’ 21
Delta Lake ecosystem
16. Pulsar Virtual Summit Europe 2021
Pulsar is the unified messaging and
streaming platform for real-time teams
17. Pulsar Virtual Summit Europe 2021
Why Pulsar?
Streams and
messages to
support more
workloads
Multi-tenancy to
break down data
silos and ease
data ingestion
Geo-replication
to support multi-
cloud and global
business
18. Pulsar Virtual Summit Europe 2021
Pulsar + Delta Lake enable data unification
Delta Lake and Lakehouse
support unified system for data,
analytics, and ML
Pulsar unifies real-time data across
diverse use cases like streaming,
messaging, and microservices
Simplifies data
infrastructures across
your entire organization
Pulsar
Delta Lake + =
19. Pulsar Virtual Summit Europe 2021
The Pulsar and Spark/Delta Lake communities are committed to building solid
integrations
Pulsar, Delta Lake, and Spark Connectors
Connector
Spark Pulsar Connector Connectors for Spark for reading and writing data from Pulsar for use with DataFrame and
DataStream APIs. https://github.com/streamnative/pulsar-spark. Discussions in progress for
upstream contribution.
Pulsar IO Delta Lake Source A Pulsar “Source” for reading data directly from Delta Lake within the Pulsar IO framework. It’s
built on top of Delta Standalone project. In progress, expect a first release this year.
20. Pulsar Virtual Summit Europe 2021
Database
Pulsar IO
Pulsar
Source
Pulsar
Source
Pulsar
---
KoP
---
AoP
---
Websocke
t
---
HTTP
21. Pulsar Virtual Summit Europe 2021
Pulsar offers many options for integration, including Pulsar, KoP, AoP,
connectors, to connect with many systems in real-time.
Database
Pulsar IO
Pulsar
Source
Pulsar
Source
Pulsar
---
KoP
---
AoP
---
Websocke
t
---
HTTP
Database
Pulsar IO
Pulsar
Source
Pulsar
Source
Pulsar
---
KoP
---
AoP
---
Websocke
t
---
HTTP
22. Pulsar Virtual Summit Europe 2021
Database
Pulsar IO
Pulsar
Source
Pulsar
Source
Pulsar
---
KoP
---
AoP
---
Websocke
t
---
HTTP
Delta Lake Connectors allow for data to be exchanged between Delta
Lake and Pulsar.
23. Pulsar Virtual Summit Europe 2021
Database
Pulsar IO
Pulsar
Source
Pulsar
Source
Pulsar
---
KoP
---
AoP
---
Websocke
t
---
HTTP
Spark’s Pulsar connector allows for developers to write Spark jobs that
can read data from Pulsar topics, transform the data, and write back to
Pulsar topics.
24. Pulsar Virtual Summit Europe 2021
Application events stored in Delta Lake for use in ML
Database
Pulsar IO
Pulsar
Source
Pulsar
Source
Pulsar
---
KoP
---
AoP
---
Websocke
t
---
HTTP
Pulsar
Source
Pulsar
Source
25. Pulsar Virtual Summit Europe 2021
ML Results made available to applications
Database
Pulsar IO
Pulsar
Source
Pulsar
Source
Pulsar
---
KoP
---
AoP
---
Websocke
t
---
HTTP
Pulsar
---
KoP
---
AoP
---
Websocke
t
---
HTTP
Pulsar
Source
Pulsar
Source
26. Pulsar Virtual Summit Europe 2021
CDC Events transformed and stored in Delta Lake
Database
Pulsar IO
Pulsar
Source
Pulsar
Source
Pulsar
---
KoP
---
AoP
---
Websocke
t
---
HTTP
Database
Pulsar IO
Pulsar
Source
Pulsar
Source
Pulsar
---
KoP
---
AoP
---
Websocke
t
---
HTTP
27. Pulsar Virtual Summit Europe 2021
Other systems data made available in Delta Lake for Data Science
Database
Pulsar IO
Pulsar
Source
Pulsar
Source
Pulsar
---
KoP
---
AoP
---
Websocke
t
---
HTTP
Pulsar
Source
Pulsar
Source
Pulsar
---
KoP
---
AoP
---
Websocke
t
---
HTTP
Pulsar IO
Pulsar
Source
28. Pulsar Virtual Summit Europe 2021
Pulsar IO Delta Lake Source
With the Pulsar IO Delta Lake source, users will be able to ingest Delta Lake changes
into Pulsar without running a separate component
Delta Lake Source
or
Metadata
Change
Topic
W/
Schema
New File
Removed
File
Metadata
Change
Parquet
File
Update
Schema
Records
Write
Records
29. Pulsar Virtual Summit Europe 2021
Future of Pulsar + Delta Lake
One of Pulsar’s unique features is tiered storage, which allows for streams to be
offloaded out of Apache BookKeeper into S3, GCS, etc.
Work is in progress to offload data in Delta Lake compatible files, with the required
metadata, allowing for Pulsar to make streams available to Delta Lake without any
need to copy data out of Pulsar and allows for the data to still be read as streams.
Stay connected to learn more in early 2022!
30. Pulsar Virtual Summit Europe 2021
Pulsar and Delta Lake are technologies
designed to simplify your data
infrastructure
Connect with us on #connector-pulsar in
Delta Lake Slack to learn more!
Today I’m pretty excited to have the chance to talk about Delta Lake and Lakehouse.
Before we jump into Lakehouse, I would like to talk about the challenges people are facing today.
Data infrastructure is too complicated and expensive to manage for advanced use cases.
Many systems are not open. They are using proprietary formats, and you need to copy data around multiple systems.
Teams are not well connected to collaborate.
This is one classic architecture example. The massive majority of data is flowing into a data lake. Companies do a lot of data validations so that they can serve data science and machine learning on top of these data lakes. At the same time, a huge amount of data is ETLed to many downstream data warehouses to do business intelligence and other use cases. We have to do that because the BI workloads are often too slow to run on a data lake directly. Depend on the workload, data also needs to be moved out of data warehouse back to data lake if it’s been updated in the data warehouse. Increasingly, the machine learning workloads are also reading and writing to the data warehouses at the same time.
The root of the problem is inherent difference between data lakes and data warehouses.
On one hand, we have data lakes that do a great job supporting machine learning. They have open formats and a big ecosystem on top of them. But, they have poor support for business intelligence and suffer complex data quality problems.
[CLICK]
On the other hand, we have data warehouses that are great for business intelligence applications. But, they have limited support for machine learning workloads, and they are proprietary systems with only a SQL interface.
Unifying these systems can be transformational in how we think about data. This is why we’re such big believers in the lakehouse to provide one platform to unify all of your data, analytics, and AI to allow all members of the data team to collaborate together.
By definition, a lakehouse is based on open standards and open source.
Because without being open, it’s impossible to create the unification across all data types, all these tools, and workloads.
And, the communities for the best open source projects to enable the lakehouse are in the audience today across Apache Spark, Delta Lake, MLflow, and Redash.
The foundation of your lakehouse is Delta Lake. Delta Lake allows you to build a data quality framework for your data lake by ensuring the data is reliable via ACID transactions.
In this model, data is flowing into Delta Lake from all various of data sources. The data quality is improved incrementally, from raw data, intermediate data, clean data. Then downstream applications such as machine learning, AI and BI can build on top of the clean, fresh and reliable data to fit the use cases it’s intended for.
Delta Lake is foundation of this model, an open source project that enables building a lakehouse architecture on top of existing storage systems such as S3, ADLS, GCS, and HDFS.
Today Delta Lake is used all over the world. Over 350+ PB data per day gets processed on Delta Lake.
70% data scanned on Databricks platform is using Delta Lake.
And Dalte Lake has been deployed to over 3 thousands customers in their production lakehouse architecture.
So what key features does Delta Lake provide to help you build your lakehouse?
Delta Lake provides ACID transactions, scalable metadata handling to process tables with billions of files, data versioning which provides the ability to time travel.
Data is stored in open format to allow users to leverage existing tools.
In addition, Delta Lake unifies batch and streaming queries, makes it easy to write either batch or streaming applications.
Delta Lake automatically handles schema validation to prevent bad records from causing data corruption, and supports schema evolution.
It allows you to audit history through transaction logs.
Delta Lake also supports DML operations such as update, delete and merge.
We also continue to improve Delta Lake for more use cases and there are multiple excited features coming soon.
Delta Lake will allow users to drop a column or rename a column.
You will be able to replace a portion of data in your table with new data atomically.
MERGE command will get better schema evolution support and generated column support.
We also plan to make a release every 3 months so that people can get new Delta Lake features quickly.
Moreover, the Delta Lake community is interested and excited to expand Delta Lake and make it available everywhere. We are building the Delta Standalone project to allow users to read and write Delta Lake natively. This enables the community to build a lot of connectors for various of systems, such as Pulsar, Flink, Presto.
Next, I will hand over to Addison to talk about the work of Pulsar connector for Delta Lake.