Unlock user behavior with 87 Million events using Hudi, StarRocks & MinIO

Unlock user behavior with 87 Million
events using Hudi, StarRocks & MinIO
Presenters:
● Nadine Farah {nadine@onehouse.ai}
● Albert Wong {albert.wong@celerdata.com}
February 22nd 2024

Albert Wong
❏ Dev Rel @Onehouse
❏ Contributor @Apache Hudi
❏ Former @Rockset, @Bose
❏ Dev Rel @CelerData
❏ Contributor @ StarRocks
❏ Former MongoDB, Red Hat, IBM
in/nadinefarah/
@nfarah86
Nadine Farah
Speaker Bio
in/atwong/

Hudi Overview & Table
Type Deep-Dive

Apache Hudi is a Lakehouse Platform
Lake Storage
(Cloud Object Stores, HDFS, …)
Open File/Data Formats
(Parquet, HFile, Avro, Orc, …)
Concurrency Control
(OCC, MVCC, Non-blocking, Lock
providers, Scheduling...)
Table Services
(cleaning, compaction, clustering,
indexing, file sizing,...)
Indexes
(Bloom filter, HBase, Bucket index,
Hash based, Lucene..)
Table Format
(Schema, File listings, Stats,
Evolution, …)
Lake Cache*
(Columnar, transactional, mutable,
WIP,...)
Metaserver*
(Stats, table service coordination,...)
Query Engines
(Spark, Flink, Hive, Presto, Trino,
StarRocks, Redshift, BigQuery,
Snowflake,..)
Platform Services
(Streaming/Batch ingest, various
sources, Catalog sync, Admin CLI,
Data Quality,...)
Transactional
Database
Layer
User Interface
Readers
(Snapshot, Time Travel,
Incremental, etc)
Writers
(Inserts, Updates, Deletes, Smart
Layout Management, etc)
Programming API

Why Choose Apache Hudi for Data
storing & processing
● Fast Upserts and Deletes: Hudi supports data mutability & oﬀers multiple indexing
○ Working with streaming data like Kafka, Flink, Spark Structured Streaming etc
● Incremental Processing: Avoid full table scans and table rewrites
● Proven at Scale: Petabyte & Exabyte data with Bytedance, Uber & more
● Interoperability with Query Engines: StarRocks & other popular engine support
● Easy to Manage Table Services: Automatic ﬁle sizing, cleaning, clustering & more

Hudi Table: Copy On Write
Snapshot Query
Incremental Query
Insert: A, B, C, D, E Update: A => A’, D => D’ Update: A’ => A”, E => E’, Insert: F
commit time=0 commit time=1 commit time=2
A, B
C, D
E
file1_t0.parquet
file2_t0.parquet
file3_t0.parquet
A’, B
C, D’
file1_t1.parquet
file2_t1.parquet
A”, B
E’,F
file1_t2.parquet
file3_t2.parquet
A,B,C,D,E
A,B,C,D,E
A’,B,C,D’,E A”,B,C,D’,E’,F
A’,D’ A”,E’,F

Hudi Table: Merge On Read
Snapshot Query
Incremental Query
Insert: A, B, C, D, E Update: A => A’,
D => D’
Update: A’=>A”,
E=>E’,Insert: F
commit time=0 commit time=1 commit time=2
A, B
C, D
E
file1_t0.parquet
file2_t0.parquet
file3_t0.parquet
A’
D’
.file1_t1.log
.file2_t1.log
A”
E’, F
.file1_t2.log
.file3_t2.log
A,B,C,D,E
A,B,C,D,E
A’,B,C,D’,E A”,B,C,D’,E’,F
A’,D’ A”,E’,F
Read Optimized Query A,B,C,D,E
Compaction
commit time=3
A”, B
C, D’
E’,F
file1_t3.parquet
file2_t3.parquet
file3_t3.parquet
A”,B,C,D’,E’,F
A”,E’,F
A”,B,C,D’,E’,F
A,B,C,D,E A,B,C,D,E

Hudi for Data lake
9
Choose Copy On Write if:
- Write cost is not an issue, but need fast reads
- Workload is fairly understood and not bursty
- Bound by parquet ingestion performance
- Simple to operate
Choose Merge On Read if:
- Need quick ingestion
- Workload can be changing or spiky
- Some operational chops
- Both read optimized and real time

Query Types
● Snapshot query/Real time query
○ Latest data
● Read optimized Query
○ Favors faster query latency by trading oﬀ fresh data
● Incremental Query
○ Incremental processing, Medallion architecture
● Time travel query
○ As of timestamp

The Community
4000+
Slack Members
300+
Contributors
3000+
GH Engagers
30+
Committers
Pre-installed on 5 cloud providers
Diverse PMC/Committers
1M DLs/month
(400% YoY)
800B+
Records/Day
(from even just 1 customer!)
Rich community of participants

StarRocks Community
7500+ Github Stars 350+ Contributors 18,000+ Community Members

StarRocks Architecture Overview
More diagrams: https://github.com/StarRocks/starrocks-reference-architecture
Seamless integration with the
Ecosystem
Ease of Use
Real-world Performance
Open Source OLAP compute
engine
Open Table Formats as the
Foundation
Support for Open Storage
Separated compute and storage
architecture
Cloud Native with k8s Operator
Linux Foundation project with Apache 2.0 license.

StarRocks with Open Data Lake
More diagrams: https://github.com/StarRocks/starrocks-reference-architecture

StarRocks 3.x series roadmap
The goal of the 3.x series roadmap is to 1) Build more and optimize core data warehouse features, 2) have
feature parity between the the shared-nothing architecture and shared-data architecture and 3) be able
to query the StarRocks table format and all the popular open table formats such as Apache Iceberg,
Apache Hudi, Apache Hive, Delta Lake and Apache Paimon.
3.0
Initial release of Shared Data Architecture
Decouple compute and storage layers.
Further development of StarRocks tables, materialized view,
JOIN performance, cache.
Enhancements to Iceberg, Hudi, Delta Lake, Hive support
3.1
Incremental improvement to 3.x goals
Mirroring features from shared nothing to shared
data architecture.
Further development of core DW features and open
table format support.
3.2
Mirroring features from shared nothing to shared
data architecture.
Further development of core DW features and open
table format support.
3.3
To be determined.
3.4
To be determined.

Vectorized Query Engine with SIMD
Modern CPUs have vectorized instruction sets, which can perform operations on multiple data elements
simultaneously which means faster queries by 3x to 5x over non-SIMD databases.

JOIN performance at scale
Types of JOINS supported
● CBO will do intelligent Join reorder
and Join method selection
● Starrocks can join 100 million rows of
data per second using only 1 CPU.
Details at
https://www.starrocks.io/blog/bench
mark-test
Simply your data engineering pipeline and infrastructure by
using JOINS; denormalization is optional.
SQL JOINS
Inner Join ✅
Left Join ✅
Right Join ✅
Full Join ✅
Cross Join ✅
Semi Join ✅
Anti Join ✅
SQL JOINS
Optimization Technique
Broadcast Join ✅
Shuffle Join ✅
Bucket Shuffle
Join
✅
Co-Located Join ✅
Replicated Join ✅
Local Join ✅

SQL Hybrid-Based Optimizer
Analyzes a SQL query and chooses the most efﬁcient execution plan by estimating the cost of different potential
plans

Cache System
Cache allows you to pull the data from memory instead of storage which can improve query efﬁciency by 3x to
17x.
Transparent Speedup
(Cache Functionality)
Metadata ✅
Query ✅
Page ✅
Data ✅

Separated compute and storage architecture
Design approach for databases and data platforms that decouples the processing power (compute) from the
data storage layer.

SQL Connectivity through MySQL wire protocol
support with Trino dialect
Communicate with StarRocks through MySQL statements and utilities. Also understands the Trino SQL
dialect.
Client Server

Thank you.
● Community starrocks.io
● Enterprise celerdata.com
● Managed Service cloud.celerdata.com

Hudi
- MOR table type with Snapshot query
StarRocks
- https://github.com/StarRocks/demo/tree/master/documentation-sa
mples/datalakehouse
Demo Resources

Engage With Our Community
Docs : https://hudi.apache.org
Blogs : https://hudi.apache.org/blog
Slack : https://join.slack.com/t/apache-hudi/shared_invite/zt-1e94d3xro-JvlNO1kSeIHJBTVfLPlI5w
Twitter : https://twitter.com/apachehudi
Github: https://github.com/apache/hudi/ Give us a star ⭐!
Mailing list(s) :
dev-subscribe@hudi.apache.org (send an empty email to subscribe)
Join Hudi Slack

Preface: Uber’s Petabyte
Data Challenge

Data Lake Challenges at Uber
Context
❏ Uber had PB’s of data
❏ Frequent updates
❏ HDFS/Cloud storage is immutable
Problems
❏ Extremely poor ingest performance
❏ Wasteful reading/writing (compute)
❏ Zero concurrency control or ACID

Motivations for Hudi
● Uber needed FAST data: They needed to power faster and fresher
analytics
● Late-arriving data: Updates go beyond the current day and can span
months in the past. With a data lake, you would need to rewrite the
whole table or partition.

How does Uber solve their petabyte data
challenge on a Data Lake

Apache Hudi: Improved efficiency
Context
❏ Uber in hypergrowth
❏ Moving from warehouse to lake
❏ HDFS/Cloud storage is immutable
Solutions
❏ Efficient ingestion: support of mutability,
row-level updates & deletes
❏ Efficient reading/writing performance:
support for MOR tables, indexes, improved
file layout & timeline
❏ Concurrency control & ACID guarantees

Unlock user behavior with 87 Million events using Hudi, StarRocks & MinIO

Recommended

Recommended

More Related Content

Similar to Unlock user behavior with 87 Million events using Hudi, StarRocks & MinIO

Similar to Unlock user behavior with 87 Million events using Hudi, StarRocks & MinIO (20)

Recently uploaded

Recently uploaded (20)

Unlock user behavior with 87 Million events using Hudi, StarRocks & MinIO