Unlocking the Power of Lakehouse Architectures with Apache Pulsar and Apache Hudi - Pulsar Summit SF 2022

Pulsar Summit
San Francisco
Hotel Nikko
August 18 2022
Ecosystem
Unlocking the power
of Lakehouse
architectures
with Apache Pulsar
and Apache Hudi
Alexey Kudinkin
Founding Engineer • Onehouse
Addison Higham
Chief Architect • StreamNative

Alexey Kudinkin
Founding Engineer
Onehouse
● Founding Engineer at Onehouse
● Prior 5 years spent at Uber (re)building
out its Fulfillment Platform from the
grounds up
@alexeykudinkin @alexeykudinkin

● Member of StreamNative team for 2
years
● Last 8 years in big-data/streaming data
● Apache Pulsar Committer and member
of Pulsar community for 5.5 years
● Previously at Instructure as Data and
Platform Architect
Addison Higham
Chief Architect
StreamNative @addisonjh @addisonj

<DRAFT PLAN>
TBR
1. What are Lakehouses?
2. Overview of Apache Hudi
3. Why Pulsar for Lakehouse?
4. Apache Pulsar integration
5. Demo

Unlocking Power of the Lakhouses using Pulsar and Hudi
What are Lakehouses?

What is Lakehouse?
On-Prem Data
warehouses
(Traditional
BI/Reporting)
2000s - Hadoop
Data Lakes
(Search/Social)
2014 - Apache Spark
(Data Science)
2016 - Apache Hudi
(Txns, Streams)
2017 - Databricks
Delta*
2012 - BigQuery
(Serverless)
2014 - Snowflake
(Decoupling/UX)
2013- Redshift
(Cloud)
Cloud
Warehouse
Lakehouse
*Databricks was the one to coin term “Lakehouse”

What is Lakehouse?
Lakehouse Architecture
Query
Engine(s)
Storage
Transactional
Layer
Traditional Data Lakes
Cloud Storage (S3/GCS/ABS/…)
Parquet/ORC/JSON/CSV/…
Local
Cache
SQL Exec
Node A
Optimizer
Local
Cache
SQL Exec
Node B
Optimizer
Local
Cache
SQL Exec
Node C
Optimizer
Lakehouses
Cloud Storage (S3/GCS/ABS/…)
Parquet/ORC/JSON/CSV/…
Metadata
Local
Cache
SQL Exec
Node A
Optimizer
Local
Cache
SQL Exec
Node B
Optimizer
Local
Cache
SQL Exec
Node C
Optimizer
Table Format
Table Services Txn Manager Indexes

Overview of Apache Hudi

Apache Hudi Overview
Lake Storage
(Cloud Object Stores, HDFS, …)
Open File/Data Formats
(Parquet, HFile, Avro, Orc, …)
Concurrency Control
(OCC, MVCC, Non-blocking, Lock
providers, Scheduling...)
Table Services
(cleaning, compaction, clustering,
indexing, file sizing,...)
Indexes
(Bloom filter, HBase, Bucket index,
Hash based, Lucene..)
Table Format
(Schema, File listings, Stats,
Evolution, …)
Lake Cache*
(Columnar, transactional, mutable,
WIP,...)
Metaserver*
(Stats, table service coordination,...)
Query Engines
(Spark, Flink, Hive, Presto, Trino,
Impala, Redshift, BigQuery,
Snowflake,..)
Platform Services
(Streaming/Batch ingest, various
sources, Catalog sync, Admin CLI,
Data Quality,...)
Transactional
Database
Layer
User Interface
Readers
(Snapshot, Time Travel,
Incremental, etc)
Writers
(Inserts, Updates, Deletes, Smart
Layout Management, etc)
Programming API

COW, MOR, WTH?
Copy-on-Write
Merge-on-Read
Incoming Data
Versioned Base files
v1
v2
v1
v2
v1
v2
v1
v2
COW Table
Incoming Data
Versioned Base files + Change logs
v1 v1 v1 v1
MOR Table
v1 v1 v1 v1
v1
v2
v1
v2
v1
v2
v1
v2
Compaction
Write
Write
Read
Read
v2 v2 v2
Snapshot read fetches
the latest version
…
v1* v1* v1*
Snapshot read fetches
the latest version and
merges change-log
…

Comparing COW and MOR
COW MOR
Writing Cost
High
(one updated record →
rewrites whole file)
Low
(updated records persisted in
change-logs)
Ingestion Latency High
(see above)
Fast
(see above)
Querying Speed Fast
(data read as is)
Slow(er)
w/o compaction
(updated records from change-logs
have to be applied to original ones
when reading)
Fast
after compaction
(data read as is, identical to COW)
Overall
Fast querying at the expense of
write amplification and slower
ingestion
Fast writing allowing to amortize
updating cost across many writes

Who’s using?
Uber rides - 250+ Petabytes from 24h+ to minutes latency
https://eng.uber.com/uber-big-data-platform/
Package deliveries - real-time event analytics at Petabyte scale
https://aws.amazon.com/blogs/big-data/how-amazon-transportation-service-enabled-near-real-time-event-analytics-at-petabyte-scale-using-aws-glue-with-apache-hudi/
TikTok/Bytedance recommendation system at *Exabyte* scale
http://hudi.apache.org/blog/2021/09/01/building-eb-level-data-lake-using-hudi-at-bytedance
Trading transactions - Near real-time CDC from 4000+ postgres tables
https://s.apache.org/hudi-robinhood-talk
150 source systems, ETL processing for 10,000+ tables
https://aws.amazon.com/blogs/big-data/how-ge-aviation-built-cloud-native-data-pipelines-at-enterprise-scale-using-the-aws-platform/
Real-time advertising for 20M+ concurrent viewers
https://www.youtube.com/watch?v=mFpqrVxxwKc
Store transactions - CDC & Warehousing
https://searchdatamanagement.techtarget.com/feature/Hudi-powering-data-lake-efforts-at-Walmart-and-Disney-Hotstar

Why Pulsar for Lakehouse?

Why Pulsar for Lakehouse
Different domains, analogous problems
Microservices -> EDA
requires comprehensive
platform. Legacy message
tech can’t keep up.
Streams alone are not
enough for app tier.
Batch → streams and
hours → minutes of
latency breaks data lakes.
Legacy meta-stores and
inconsistent metadata
aren’t sufficient.

Why Pulsar for Lakehouse
Different domains, similar goals
Pulsar is the multi-tenant
real-time data platform
that scales across orgs to
simplify building
event-driven apps with
messages and streams
Lakehouse is the modern
solution to providing
consistent access to
minute-latency data and
batch data across a rich
ecosystem of tools

Apache Pulsar + Apache Hudi

Apache Pulsar + Apache Hudi
Across the data ecosystem
Apache Pulsar + Apache Hudi together
provide teams with a powerful solution for
data across app, data, and time domains
Real-time Batch
Offload topic to hudi tables
milliseconds seconds minutes hours days months forever
app-tier Real-time analytics BI / Batch
Load hudi tables to topic
App-tier Data-tier

Apache Pulsar + Apache Hudi = Lakehouse
Integration Options
There are currently a few ways to ingest the
data from Pulsar using Spark and Hudi:
1. Using Pulsar’s Apache Spark connector
2. Using DeltaStreamer utility from Hudi
3. Using StreamNative Lakehouse Sink (Beta)

Using Pulsar’s Apache Spark connector
val topicName = "realtime-impressions"
val tableName = "rt_impressions"
// Fetching the data from Pulsar
val df =
spark.read.format("pulsar").
option("service.url", "pulsar://localhost:6650").
option("topics", topicName).
option("startingOffsets", startingOffsets).
option("endingOffsets", endingOffsets).
load()
// And writing it into Hudi table
df.write.format("hudi").
option("hoodie.datasource.write.table.name", tableName).
option("hoodie.datasource.write.operation", "bulk_insert").
// Record keys are necessary for Hudi to efficiently perform delete/update
operations
option("hoodie.datasource.write.recordkey.field", "event_id").
// We're creating a non-partitioned table
option("hoodie.datasource.write.keygenerator.class",
"org.apache.hudi.keygen.NonpartitionedKeyGenerator").
mode(SaveMode.Append).
save(s"s3a://hudi-tables/$tableName")

Using Hudi’s DeltaStreamer
export TOPIC_NAME=stonks
./bin/spark-submit
--master 'local[2]'
--deploy-mode client
--packages io.streamnative.connectors:pulsar-spark-connector_2.12:3.1.1.4
--class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer <hudi.jar>
--table-type COPY_ON_WRITE
--source-class org.apache.hudi.utilities.sources.PulsarSource
--source-ordering-field ts
--target-base-path file:///data/tables/$TOPIC_NAME
--target-table $TOPIC_NAME
--hoodie-conf hoodie.datasource.write.recordkey.field=key
--hoodie-conf hoodie.datasource.write.partitionpath.field=date
--hoodie-conf
hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.SimpleKeyGenerator
--hoodie-conf hoodie.deltastreamer.source.pulsar.topic=$TOPIC_NAME
--hoodie-conf hoodie.deltastreamer.source.pulsar.offset.autoResetStrategy=EARLIEST
--hoodie-conf
hoodie.deltastreamer.source.pulsar.endpoint.service.url=pulsar://localhost:6650
--hoodie-conf
hoodie.deltastreamer.source.pulsar.endpoint.admin.url=http://localhost:8080

Using StreamNative Lakehouse Sink
github.com/streamnative/pulsar-io-lakehouse
Hudi Sink
Topic
W/
Schema
Metadata
Change
Parquet
File
New Schema
New Record
Updated
Local
Buffer
1. Flush
or
2. New
Table
Commit

Using StreamNative Lakehouse Sink
Current Status: Beta
Ideal for:
● Append-only tables / low volume tables
○ CoW and MoR supported, but CoW is expensive, MoR requires
external compaction
● Low concurrency workloads
○ Conflicts more likely with higher concurrency
Future Work:
● Improved coordination / higher concurrency
● Read hudi tables into topics
● Integrate Lakehouse into tiered storage

Demo

Alexey Kudinkin
Thank you!
alexey@onehouse.ai
@alexeykudinkin
Pulsar Summit
San Francisco
Hotel Nikko
August 18 2022

Using DeltaStreamer from Hudi
Demo Script
Step #1
1. Ingest 1st batch to Pulsar (stock ticks dataset)
2. Run DS (ingests 1st batch into Hudi)
3. Show the dataset (schema, data itself, counts)
4. Run DS again (no new data nothing to ingest)
Step #2
1. Ingest 2d batch to Pulsar (stock ticks dataset)
2. Run DS (ingest 2d batch into Hudi)
3. Show the dataset (schema, data itself, counts)

Unlocking the Power of Lakehouse Architectures with Apache Pulsar and Apache Hudi - Pulsar Summit SF 2022

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Unlocking the Power of Lakehouse Architectures with Apache Pulsar and Apache Hudi - Pulsar Summit SF 2022

Similar to Unlocking the Power of Lakehouse Architectures with Apache Pulsar and Apache Hudi - Pulsar Summit SF 2022 (20)

More from StreamNative

More from StreamNative (20)

Recently uploaded

Recently uploaded (20)

Unlocking the Power of Lakehouse Architectures with Apache Pulsar and Apache Hudi - Pulsar Summit SF 2022