Understanding conversion funnel and rates is essential for deciphering e-commerce shopping behavior. In this live event, Albert Wong from StarRocks will provide an anonymized, real-world customer dataset featuring 87 million events and 4 million unique products spanning 10,000 product categories. He'll showcase how to deploy a modern data lakehouse with hashtag#ApacheHudi, and MinIO, then conduct complex analytics, including JOIN operations, to analyze purchasing patterns and product conversion rates with hashtag#StarRocks as the analytical engine.
You can catch the live event:
https://youtu.be/-Wp7itPDtgo
Unlock user behavior with 87 Million events using Hudi, StarRocks & MinIO
1. Unlock user behavior with 87 Million
events using Hudi, StarRocks & MinIO
Presenters:
● Nadine Farah {nadine@onehouse.ai}
● Albert Wong {albert.wong@celerdata.com}
February 22nd 2024
2. Albert Wong
❏ Dev Rel @Onehouse
❏ Contributor @Apache Hudi
❏ Former @Rockset, @Bose
❏ Dev Rel @CelerData
❏ Contributor @ StarRocks
❏ Former MongoDB, Red Hat, IBM
in/nadinefarah/
@nfarah86
Nadine Farah
Speaker Bio
in/atwong/
5. Apache Hudi is a Lakehouse Platform
Lake Storage
(Cloud Object Stores, HDFS, …)
Open File/Data Formats
(Parquet, HFile, Avro, Orc, …)
Concurrency Control
(OCC, MVCC, Non-blocking, Lock
providers, Scheduling...)
Table Services
(cleaning, compaction, clustering,
indexing, file sizing,...)
Indexes
(Bloom filter, HBase, Bucket index,
Hash based, Lucene..)
Table Format
(Schema, File listings, Stats,
Evolution, …)
Lake Cache*
(Columnar, transactional, mutable,
WIP,...)
Metaserver*
(Stats, table service coordination,...)
Query Engines
(Spark, Flink, Hive, Presto, Trino,
StarRocks, Redshift, BigQuery,
Snowflake,..)
Platform Services
(Streaming/Batch ingest, various
sources, Catalog sync, Admin CLI,
Data Quality,...)
Transactional
Database
Layer
User Interface
Readers
(Snapshot, Time Travel,
Incremental, etc)
Writers
(Inserts, Updates, Deletes, Smart
Layout Management, etc)
Programming API
6. Why Choose Apache Hudi for Data
storing & processing
● Fast Upserts and Deletes: Hudi supports data mutability & offers multiple indexing
○ Working with streaming data like Kafka, Flink, Spark Structured Streaming etc
● Incremental Processing: Avoid full table scans and table rewrites
● Proven at Scale: Petabyte & Exabyte data with Bytedance, Uber & more
● Interoperability with Query Engines: StarRocks & other popular engine support
● Easy to Manage Table Services: Automatic file sizing, cleaning, clustering & more
7. Hudi Table: Copy On Write
Snapshot Query
Incremental Query
Insert: A, B, C, D, E Update: A => A’, D => D’ Update: A’ => A”, E => E’, Insert: F
commit time=0 commit time=1 commit time=2
A, B
C, D
E
file1_t0.parquet
file2_t0.parquet
file3_t0.parquet
A’, B
C, D’
file1_t1.parquet
file2_t1.parquet
A”, B
E’,F
file1_t2.parquet
file3_t2.parquet
A,B,C,D,E
A,B,C,D,E
A’,B,C,D’,E A”,B,C,D’,E’,F
A’,D’ A”,E’,F
8. Hudi Table: Merge On Read
Snapshot Query
Incremental Query
Insert: A, B, C, D, E Update: A => A’,
D => D’
Update: A’=>A”,
E=>E’,Insert: F
commit time=0 commit time=1 commit time=2
A, B
C, D
E
file1_t0.parquet
file2_t0.parquet
file3_t0.parquet
A’
D’
.file1_t1.log
.file2_t1.log
A”
E’, F
.file1_t2.log
.file3_t2.log
A,B,C,D,E
A,B,C,D,E
A’,B,C,D’,E A”,B,C,D’,E’,F
A’,D’ A”,E’,F
Read Optimized Query A,B,C,D,E
Compaction
commit time=3
A”, B
C, D’
E’,F
file1_t3.parquet
file2_t3.parquet
file3_t3.parquet
A”,B,C,D’,E’,F
A”,E’,F
A”,B,C,D’,E’,F
A,B,C,D,E A,B,C,D,E
9. Hudi for Data lake
9
Choose Copy On Write if:
- Write cost is not an issue, but need fast reads
- Workload is fairly understood and not bursty
- Bound by parquet ingestion performance
- Simple to operate
Choose Merge On Read if:
- Need quick ingestion
- Workload can be changing or spiky
- Some operational chops
- Both read optimized and real time
10. Query Types
● Snapshot query/Real time query
○ Latest data
● Read optimized Query
○ Favors faster query latency by trading off fresh data
● Incremental Query
○ Incremental processing, Medallion architecture
● Time travel query
○ As of timestamp
11. The Community
4000+
Slack Members
300+
Contributors
3000+
GH Engagers
30+
Committers
Pre-installed on 5 cloud providers
Diverse PMC/Committers
1M DLs/month
(400% YoY)
800B+
Records/Day
(from even just 1 customer!)
Rich community of participants
14. StarRocks Architecture Overview
More diagrams: https://github.com/StarRocks/starrocks-reference-architecture
Seamless integration with the
Ecosystem
Ease of Use
Real-world Performance
Open Source OLAP compute
engine
Open Table Formats as the
Foundation
Support for Open Storage
Separated compute and storage
architecture
Cloud Native with k8s Operator
Linux Foundation project with Apache 2.0 license.
15. StarRocks with Open Data Lake
More diagrams: https://github.com/StarRocks/starrocks-reference-architecture
16. StarRocks 3.x series roadmap
The goal of the 3.x series roadmap is to 1) Build more and optimize core data warehouse features, 2) have
feature parity between the the shared-nothing architecture and shared-data architecture and 3) be able
to query the StarRocks table format and all the popular open table formats such as Apache Iceberg,
Apache Hudi, Apache Hive, Delta Lake and Apache Paimon.
3.0
Initial release of Shared Data Architecture
Decouple compute and storage layers.
Further development of StarRocks tables, materialized view,
JOIN performance, cache.
Enhancements to Iceberg, Hudi, Delta Lake, Hive support
3.1
Incremental improvement to 3.x goals
Mirroring features from shared nothing to shared
data architecture.
Further development of core DW features and open
table format support.
3.2
Incremental improvement to 3.x goals
Mirroring features from shared nothing to shared
data architecture.
Further development of core DW features and open
table format support.
3.3
Incremental improvement to 3.x goals
To be determined.
3.4
Incremental improvement to 3.x goals
To be determined.
18. Vectorized Query Engine with SIMD
Modern CPUs have vectorized instruction sets, which can perform operations on multiple data elements
simultaneously which means faster queries by 3x to 5x over non-SIMD databases.
19. JOIN performance at scale
Types of JOINS supported
● CBO will do intelligent Join reorder
and Join method selection
● Starrocks can join 100 million rows of
data per second using only 1 CPU.
Details at
https://www.starrocks.io/blog/bench
mark-test
Simply your data engineering pipeline and infrastructure by
using JOINS; denormalization is optional.
SQL JOINS
Inner Join ✅
Left Join ✅
Right Join ✅
Full Join ✅
Cross Join ✅
Semi Join ✅
Anti Join ✅
SQL JOINS
Optimization Technique
Broadcast Join ✅
Shuffle Join ✅
Bucket Shuffle
Join
✅
Co-Located Join ✅
Replicated Join ✅
Local Join ✅
20. SQL Hybrid-Based Optimizer
Analyzes a SQL query and chooses the most efficient execution plan by estimating the cost of different potential
plans
21. Cache System
Cache allows you to pull the data from memory instead of storage which can improve query efficiency by 3x to
17x.
Transparent Speedup
(Cache Functionality)
Metadata ✅
Query ✅
Page ✅
Data ✅
22. Separated compute and storage architecture
Design approach for databases and data platforms that decouples the processing power (compute) from the
data storage layer.
23. SQL Connectivity through MySQL wire protocol
support with Trino dialect
Communicate with StarRocks through MySQL statements and utilities. Also understands the Trino SQL
dialect.
Client Server
24. Thank you.
● Community starrocks.io
● Enterprise celerdata.com
● Managed Service cloud.celerdata.com
27. Hudi
- MOR table type with Snapshot query
StarRocks
- https://github.com/StarRocks/demo/tree/master/documentation-sa
mples/datalakehouse
Demo Resources
28. Engage With Our Community
Docs : https://hudi.apache.org
Blogs : https://hudi.apache.org/blog
Slack : https://join.slack.com/t/apache-hudi/shared_invite/zt-1e94d3xro-JvlNO1kSeIHJBTVfLPlI5w
Twitter : https://twitter.com/apachehudi
Github: https://github.com/apache/hudi/ Give us a star ⭐!
Mailing list(s) :
dev-subscribe@hudi.apache.org (send an empty email to subscribe)
Join Hudi Slack
31. Data Lake Challenges at Uber
Context
❏ Uber had PB’s of data
❏ Frequent updates
❏ HDFS/Cloud storage is immutable
Problems
❏ Extremely poor ingest performance
❏ Wasteful reading/writing (compute)
❏ Zero concurrency control or ACID
32. Motivations for Hudi
● Uber needed FAST data: They needed to power faster and fresher
analytics
● Late-arriving data: Updates go beyond the current day and can span
months in the past. With a data lake, you would need to rewrite the
whole table or partition.
33. How does Uber solve their petabyte data
challenge on a Data Lake
34. Apache Hudi: Improved efficiency
Context
❏ Uber in hypergrowth
❏ Moving from warehouse to lake
❏ HDFS/Cloud storage is immutable
Solutions
❏ Efficient ingestion: support of mutability,
row-level updates & deletes
❏ Efficient reading/writing performance:
support for MOR tables, indexes, improved
file layout & timeline
❏ Concurrency control & ACID guarantees