SlideShare a Scribd company logo
1 of 34
Download to read offline
Unlock user behavior with 87 Million
events using Hudi, StarRocks & MinIO
Presenters:
● Nadine Farah {nadine@onehouse.ai}
● Albert Wong {albert.wong@celerdata.com}
February 22nd 2024
Albert Wong
❏ Dev Rel @Onehouse
❏ Contributor @Apache Hudi
❏ Former @Rockset, @Bose
❏ Dev Rel @CelerData
❏ Contributor @ StarRocks
❏ Former MongoDB, Red Hat, IBM
in/nadinefarah/
@nfarah86
Nadine Farah
Speaker Bio
in/atwong/
Hudi Overview & Table
Type Deep-Dive
Apache Hudi is a Lakehouse Platform
Lake Storage
(Cloud Object Stores, HDFS, …)
Open File/Data Formats
(Parquet, HFile, Avro, Orc, …)
Concurrency Control
(OCC, MVCC, Non-blocking, Lock
providers, Scheduling...)
Table Services
(cleaning, compaction, clustering,
indexing, file sizing,...)
Indexes
(Bloom filter, HBase, Bucket index,
Hash based, Lucene..)
Table Format
(Schema, File listings, Stats,
Evolution, …)
Lake Cache*
(Columnar, transactional, mutable,
WIP,...)
Metaserver*
(Stats, table service coordination,...)
Query Engines
(Spark, Flink, Hive, Presto, Trino,
StarRocks, Redshift, BigQuery,
Snowflake,..)
Platform Services
(Streaming/Batch ingest, various
sources, Catalog sync, Admin CLI,
Data Quality,...)
Transactional
Database
Layer
User Interface
Readers
(Snapshot, Time Travel,
Incremental, etc)
Writers
(Inserts, Updates, Deletes, Smart
Layout Management, etc)
Programming API
Why Choose Apache Hudi for Data
storing & processing
● Fast Upserts and Deletes: Hudi supports data mutability & offers multiple indexing
○ Working with streaming data like Kafka, Flink, Spark Structured Streaming etc
● Incremental Processing: Avoid full table scans and table rewrites
● Proven at Scale: Petabyte & Exabyte data with Bytedance, Uber & more
● Interoperability with Query Engines: StarRocks & other popular engine support
● Easy to Manage Table Services: Automatic file sizing, cleaning, clustering & more
Hudi Table: Copy On Write
Snapshot Query
Incremental Query
Insert: A, B, C, D, E Update: A => A’, D => D’ Update: A’ => A”, E => E’, Insert: F
commit time=0 commit time=1 commit time=2
A, B
C, D
E
file1_t0.parquet
file2_t0.parquet
file3_t0.parquet
A’, B
C, D’
file1_t1.parquet
file2_t1.parquet
A”, B
E’,F
file1_t2.parquet
file3_t2.parquet
A,B,C,D,E
A,B,C,D,E
A’,B,C,D’,E A”,B,C,D’,E’,F
A’,D’ A”,E’,F
Hudi Table: Merge On Read
Snapshot Query
Incremental Query
Insert: A, B, C, D, E Update: A => A’,
D => D’
Update: A’=>A”,
E=>E’,Insert: F
commit time=0 commit time=1 commit time=2
A, B
C, D
E
file1_t0.parquet
file2_t0.parquet
file3_t0.parquet
A’
D’
.file1_t1.log
.file2_t1.log
A”
E’, F
.file1_t2.log
.file3_t2.log
A,B,C,D,E
A,B,C,D,E
A’,B,C,D’,E A”,B,C,D’,E’,F
A’,D’ A”,E’,F
Read Optimized Query A,B,C,D,E
Compaction
commit time=3
A”, B
C, D’
E’,F
file1_t3.parquet
file2_t3.parquet
file3_t3.parquet
A”,B,C,D’,E’,F
A”,E’,F
A”,B,C,D’,E’,F
A,B,C,D,E A,B,C,D,E
Hudi for Data lake
9
Choose Copy On Write if:
- Write cost is not an issue, but need fast reads
- Workload is fairly understood and not bursty
- Bound by parquet ingestion performance
- Simple to operate
Choose Merge On Read if:
- Need quick ingestion
- Workload can be changing or spiky
- Some operational chops
- Both read optimized and real time
Query Types
● Snapshot query/Real time query
○ Latest data
● Read optimized Query
○ Favors faster query latency by trading off fresh data
● Incremental Query
○ Incremental processing, Medallion architecture
● Time travel query
○ As of timestamp
The Community
4000+
Slack Members
300+
Contributors
3000+
GH Engagers
30+
Committers
Pre-installed on 5 cloud providers
Diverse PMC/Committers
1M DLs/month
(400% YoY)
800B+
Records/Day
(from even just 1 customer!)
Rich community of participants
StarRocks Overview
StarRocks Community
7500+ Github Stars 350+ Contributors 18,000+ Community Members
StarRocks Architecture Overview
More diagrams: https://github.com/StarRocks/starrocks-reference-architecture
Seamless integration with the
Ecosystem
Ease of Use
Real-world Performance
Open Source OLAP compute
engine
Open Table Formats as the
Foundation
Support for Open Storage
Separated compute and storage
architecture
Cloud Native with k8s Operator
Linux Foundation project with Apache 2.0 license.
StarRocks with Open Data Lake
More diagrams: https://github.com/StarRocks/starrocks-reference-architecture
StarRocks 3.x series roadmap
The goal of the 3.x series roadmap is to 1) Build more and optimize core data warehouse features, 2) have
feature parity between the the shared-nothing architecture and shared-data architecture and 3) be able
to query the StarRocks table format and all the popular open table formats such as Apache Iceberg,
Apache Hudi, Apache Hive, Delta Lake and Apache Paimon.
3.0
Initial release of Shared Data Architecture
Decouple compute and storage layers.
Further development of StarRocks tables, materialized view,
JOIN performance, cache.
Enhancements to Iceberg, Hudi, Delta Lake, Hive support
3.1
Incremental improvement to 3.x goals
Mirroring features from shared nothing to shared
data architecture.
Further development of core DW features and open
table format support.
3.2
Incremental improvement to 3.x goals
Mirroring features from shared nothing to shared
data architecture.
Further development of core DW features and open
table format support.
3.3
Incremental improvement to 3.x goals
To be determined.
3.4
Incremental improvement to 3.x goals
To be determined.
Major Features in StarRocks
Vectorized Query Engine with SIMD
Modern CPUs have vectorized instruction sets, which can perform operations on multiple data elements
simultaneously which means faster queries by 3x to 5x over non-SIMD databases.
JOIN performance at scale
Types of JOINS supported
● CBO will do intelligent Join reorder
and Join method selection
● Starrocks can join 100 million rows of
data per second using only 1 CPU.
Details at
https://www.starrocks.io/blog/bench
mark-test
Simply your data engineering pipeline and infrastructure by
using JOINS; denormalization is optional.
SQL JOINS
Inner Join ✅
Left Join ✅
Right Join ✅
Full Join ✅
Cross Join ✅
Semi Join ✅
Anti Join ✅
SQL JOINS
Optimization Technique
Broadcast Join ✅
Shuffle Join ✅
Bucket Shuffle
Join
✅
Co-Located Join ✅
Replicated Join ✅
Local Join ✅
SQL Hybrid-Based Optimizer
Analyzes a SQL query and chooses the most efficient execution plan by estimating the cost of different potential
plans
Cache System
Cache allows you to pull the data from memory instead of storage which can improve query efficiency by 3x to
17x.
Transparent Speedup
(Cache Functionality)
Metadata ✅
Query ✅
Page ✅
Data ✅
Separated compute and storage architecture
Design approach for databases and data platforms that decouples the processing power (compute) from the
data storage layer.
SQL Connectivity through MySQL wire protocol
support with Trino dialect
Communicate with StarRocks through MySQL statements and utilities. Also understands the Trino SQL
dialect.
Client Server
Thank you.
● Community starrocks.io
● Enterprise celerdata.com
● Managed Service cloud.celerdata.com
Demo
Architecture
Hudi
- MOR table type with Snapshot query
StarRocks
- https://github.com/StarRocks/demo/tree/master/documentation-sa
mples/datalakehouse
Demo Resources
Engage With Our Community
Docs : https://hudi.apache.org
Blogs : https://hudi.apache.org/blog
Slack : https://join.slack.com/t/apache-hudi/shared_invite/zt-1e94d3xro-JvlNO1kSeIHJBTVfLPlI5w
Twitter : https://twitter.com/apachehudi
Github: https://github.com/apache/hudi/ Give us a star ⭐!
Mailing list(s) :
dev-subscribe@hudi.apache.org (send an empty email to subscribe)
Join Hudi Slack
Thanks!
Questions?
Preface: Uber’s Petabyte
Data Challenge
Data Lake Challenges at Uber
Context
❏ Uber had PB’s of data
❏ Frequent updates
❏ HDFS/Cloud storage is immutable
Problems
❏ Extremely poor ingest performance
❏ Wasteful reading/writing (compute)
❏ Zero concurrency control or ACID
Motivations for Hudi
● Uber needed FAST data: They needed to power faster and fresher
analytics
● Late-arriving data: Updates go beyond the current day and can span
months in the past. With a data lake, you would need to rewrite the
whole table or partition.
How does Uber solve their petabyte data
challenge on a Data Lake
Apache Hudi: Improved efficiency
Context
❏ Uber in hypergrowth
❏ Moving from warehouse to lake
❏ HDFS/Cloud storage is immutable
Solutions
❏ Efficient ingestion: support of mutability,
row-level updates & deletes
❏ Efficient reading/writing performance:
support for MOR tables, indexes, improved
file layout & timeline
❏ Concurrency control & ACID guarantees

More Related Content

Similar to Unlock user behavior with 87 Million events using Hudi, StarRocks & MinIO

Data Discovery at Databricks with Amundsen
Data Discovery at Databricks with AmundsenData Discovery at Databricks with Amundsen
Data Discovery at Databricks with AmundsenDatabricks
 
Large scale, interactive ad-hoc queries over different datastores with Apache...
Large scale, interactive ad-hoc queries over different datastores with Apache...Large scale, interactive ad-hoc queries over different datastores with Apache...
Large scale, interactive ad-hoc queries over different datastores with Apache...jaxLondonConference
 
Windows Azure: Lessons From The Field
Windows Azure: Lessons From The FieldWindows Azure: Lessons From The Field
Windows Azure: Lessons From The FieldRob Gillen
 
New Developments in Spark
New Developments in SparkNew Developments in Spark
New Developments in SparkDatabricks
 
Big Data Analytics from Azure Cloud to Power BI Mobile
Big Data Analytics from Azure Cloud to Power BI MobileBig Data Analytics from Azure Cloud to Power BI Mobile
Big Data Analytics from Azure Cloud to Power BI MobileRoy Kim
 
Daniel Krasner - High Performance Text Processing with Rosetta
Daniel Krasner - High Performance Text Processing with Rosetta Daniel Krasner - High Performance Text Processing with Rosetta
Daniel Krasner - High Performance Text Processing with Rosetta PyData
 
Building a high-performance data lake analytics engine at Alibaba Cloud with ...
Building a high-performance data lake analytics engine at Alibaba Cloud with ...Building a high-performance data lake analytics engine at Alibaba Cloud with ...
Building a high-performance data lake analytics engine at Alibaba Cloud with ...Alluxio, Inc.
 
AWS Athena vs. Google BigQuery for interactive SQL Queries
AWS Athena vs. Google BigQuery for interactive SQL QueriesAWS Athena vs. Google BigQuery for interactive SQL Queries
AWS Athena vs. Google BigQuery for interactive SQL QueriesDoiT International
 
Unlocking the Power of Lakehouse Architectures with Apache Pulsar and Apache ...
Unlocking the Power of Lakehouse Architectures with Apache Pulsar and Apache ...Unlocking the Power of Lakehouse Architectures with Apache Pulsar and Apache ...
Unlocking the Power of Lakehouse Architectures with Apache Pulsar and Apache ...StreamNative
 
Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010nzhang
 
Databricks Platform.pptx
Databricks Platform.pptxDatabricks Platform.pptx
Databricks Platform.pptxAlex Ivy
 
What it takes to run Hadoop at Scale: Yahoo! Perspectives
What it takes to run Hadoop at Scale: Yahoo! PerspectivesWhat it takes to run Hadoop at Scale: Yahoo! Perspectives
What it takes to run Hadoop at Scale: Yahoo! PerspectivesDataWorks Summit
 
FIWARE Global Summit - A Multi-database Plugin for the Orion FIWARE Context B...
FIWARE Global Summit - A Multi-database Plugin for the Orion FIWARE Context B...FIWARE Global Summit - A Multi-database Plugin for the Orion FIWARE Context B...
FIWARE Global Summit - A Multi-database Plugin for the Orion FIWARE Context B...FIWARE
 
Berlin Buzz Words - Apache Drill by Ted Dunning & Michael Hausenblas
Berlin Buzz Words - Apache Drill by Ted Dunning & Michael HausenblasBerlin Buzz Words - Apache Drill by Ted Dunning & Michael Hausenblas
Berlin Buzz Words - Apache Drill by Ted Dunning & Michael HausenblasMapR Technologies
 
Valtech - Big Data & NoSQL : au-delà du nouveau buzz
Valtech  - Big Data & NoSQL : au-delà du nouveau buzzValtech  - Big Data & NoSQL : au-delà du nouveau buzz
Valtech - Big Data & NoSQL : au-delà du nouveau buzzValtech
 
Modernizing Mission-Critical Apps with SQL Server
Modernizing Mission-Critical Apps with SQL ServerModernizing Mission-Critical Apps with SQL Server
Modernizing Mission-Critical Apps with SQL ServerMicrosoft Tech Community
 
Overview of stinger interactive query for hive
Overview of stinger   interactive query for hiveOverview of stinger   interactive query for hive
Overview of stinger interactive query for hiveDavid Kaiser
 
Experience SQL Server 2017: The Modern Data Platform
Experience SQL Server 2017: The Modern Data PlatformExperience SQL Server 2017: The Modern Data Platform
Experience SQL Server 2017: The Modern Data PlatformBob Ward
 
Database Performance Management in Cloud
Database Performance Management in CloudDatabase Performance Management in Cloud
Database Performance Management in CloudDr. Amarjeet Singh
 

Similar to Unlock user behavior with 87 Million events using Hudi, StarRocks & MinIO (20)

Data Discovery at Databricks with Amundsen
Data Discovery at Databricks with AmundsenData Discovery at Databricks with Amundsen
Data Discovery at Databricks with Amundsen
 
Large scale, interactive ad-hoc queries over different datastores with Apache...
Large scale, interactive ad-hoc queries over different datastores with Apache...Large scale, interactive ad-hoc queries over different datastores with Apache...
Large scale, interactive ad-hoc queries over different datastores with Apache...
 
Windows Azure: Lessons From The Field
Windows Azure: Lessons From The FieldWindows Azure: Lessons From The Field
Windows Azure: Lessons From The Field
 
New Developments in Spark
New Developments in SparkNew Developments in Spark
New Developments in Spark
 
Big Data Analytics from Azure Cloud to Power BI Mobile
Big Data Analytics from Azure Cloud to Power BI MobileBig Data Analytics from Azure Cloud to Power BI Mobile
Big Data Analytics from Azure Cloud to Power BI Mobile
 
Daniel Krasner - High Performance Text Processing with Rosetta
Daniel Krasner - High Performance Text Processing with Rosetta Daniel Krasner - High Performance Text Processing with Rosetta
Daniel Krasner - High Performance Text Processing with Rosetta
 
Building a high-performance data lake analytics engine at Alibaba Cloud with ...
Building a high-performance data lake analytics engine at Alibaba Cloud with ...Building a high-performance data lake analytics engine at Alibaba Cloud with ...
Building a high-performance data lake analytics engine at Alibaba Cloud with ...
 
AWS Athena vs. Google BigQuery for interactive SQL Queries
AWS Athena vs. Google BigQuery for interactive SQL QueriesAWS Athena vs. Google BigQuery for interactive SQL Queries
AWS Athena vs. Google BigQuery for interactive SQL Queries
 
Unlocking the Power of Lakehouse Architectures with Apache Pulsar and Apache ...
Unlocking the Power of Lakehouse Architectures with Apache Pulsar and Apache ...Unlocking the Power of Lakehouse Architectures with Apache Pulsar and Apache ...
Unlocking the Power of Lakehouse Architectures with Apache Pulsar and Apache ...
 
Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010
 
Databricks Platform.pptx
Databricks Platform.pptxDatabricks Platform.pptx
Databricks Platform.pptx
 
What it takes to run Hadoop at Scale: Yahoo! Perspectives
What it takes to run Hadoop at Scale: Yahoo! PerspectivesWhat it takes to run Hadoop at Scale: Yahoo! Perspectives
What it takes to run Hadoop at Scale: Yahoo! Perspectives
 
The Heterogeneous Data lake
The Heterogeneous Data lakeThe Heterogeneous Data lake
The Heterogeneous Data lake
 
FIWARE Global Summit - A Multi-database Plugin for the Orion FIWARE Context B...
FIWARE Global Summit - A Multi-database Plugin for the Orion FIWARE Context B...FIWARE Global Summit - A Multi-database Plugin for the Orion FIWARE Context B...
FIWARE Global Summit - A Multi-database Plugin for the Orion FIWARE Context B...
 
Berlin Buzz Words - Apache Drill by Ted Dunning & Michael Hausenblas
Berlin Buzz Words - Apache Drill by Ted Dunning & Michael HausenblasBerlin Buzz Words - Apache Drill by Ted Dunning & Michael Hausenblas
Berlin Buzz Words - Apache Drill by Ted Dunning & Michael Hausenblas
 
Valtech - Big Data & NoSQL : au-delà du nouveau buzz
Valtech  - Big Data & NoSQL : au-delà du nouveau buzzValtech  - Big Data & NoSQL : au-delà du nouveau buzz
Valtech - Big Data & NoSQL : au-delà du nouveau buzz
 
Modernizing Mission-Critical Apps with SQL Server
Modernizing Mission-Critical Apps with SQL ServerModernizing Mission-Critical Apps with SQL Server
Modernizing Mission-Critical Apps with SQL Server
 
Overview of stinger interactive query for hive
Overview of stinger   interactive query for hiveOverview of stinger   interactive query for hive
Overview of stinger interactive query for hive
 
Experience SQL Server 2017: The Modern Data Platform
Experience SQL Server 2017: The Modern Data PlatformExperience SQL Server 2017: The Modern Data Platform
Experience SQL Server 2017: The Modern Data Platform
 
Database Performance Management in Cloud
Database Performance Management in CloudDatabase Performance Management in Cloud
Database Performance Management in Cloud
 

Recently uploaded

PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一F La
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...Boston Institute of Analytics
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationshipsccctableauusergroup
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Cantervoginip
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFAAndrei Kaleshka
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfBoston Institute of Analytics
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Sapana Sha
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档208367051
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理e4aez8ss
 

Recently uploaded (20)

PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdf
 
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Canter
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFA
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 
Call Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort ServiceCall Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort Service
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
 

Unlock user behavior with 87 Million events using Hudi, StarRocks & MinIO

  • 1. Unlock user behavior with 87 Million events using Hudi, StarRocks & MinIO Presenters: ● Nadine Farah {nadine@onehouse.ai} ● Albert Wong {albert.wong@celerdata.com} February 22nd 2024
  • 2. Albert Wong ❏ Dev Rel @Onehouse ❏ Contributor @Apache Hudi ❏ Former @Rockset, @Bose ❏ Dev Rel @CelerData ❏ Contributor @ StarRocks ❏ Former MongoDB, Red Hat, IBM in/nadinefarah/ @nfarah86 Nadine Farah Speaker Bio in/atwong/
  • 3.
  • 4. Hudi Overview & Table Type Deep-Dive
  • 5. Apache Hudi is a Lakehouse Platform Lake Storage (Cloud Object Stores, HDFS, …) Open File/Data Formats (Parquet, HFile, Avro, Orc, …) Concurrency Control (OCC, MVCC, Non-blocking, Lock providers, Scheduling...) Table Services (cleaning, compaction, clustering, indexing, file sizing,...) Indexes (Bloom filter, HBase, Bucket index, Hash based, Lucene..) Table Format (Schema, File listings, Stats, Evolution, …) Lake Cache* (Columnar, transactional, mutable, WIP,...) Metaserver* (Stats, table service coordination,...) Query Engines (Spark, Flink, Hive, Presto, Trino, StarRocks, Redshift, BigQuery, Snowflake,..) Platform Services (Streaming/Batch ingest, various sources, Catalog sync, Admin CLI, Data Quality,...) Transactional Database Layer User Interface Readers (Snapshot, Time Travel, Incremental, etc) Writers (Inserts, Updates, Deletes, Smart Layout Management, etc) Programming API
  • 6. Why Choose Apache Hudi for Data storing & processing ● Fast Upserts and Deletes: Hudi supports data mutability & offers multiple indexing ○ Working with streaming data like Kafka, Flink, Spark Structured Streaming etc ● Incremental Processing: Avoid full table scans and table rewrites ● Proven at Scale: Petabyte & Exabyte data with Bytedance, Uber & more ● Interoperability with Query Engines: StarRocks & other popular engine support ● Easy to Manage Table Services: Automatic file sizing, cleaning, clustering & more
  • 7. Hudi Table: Copy On Write Snapshot Query Incremental Query Insert: A, B, C, D, E Update: A => A’, D => D’ Update: A’ => A”, E => E’, Insert: F commit time=0 commit time=1 commit time=2 A, B C, D E file1_t0.parquet file2_t0.parquet file3_t0.parquet A’, B C, D’ file1_t1.parquet file2_t1.parquet A”, B E’,F file1_t2.parquet file3_t2.parquet A,B,C,D,E A,B,C,D,E A’,B,C,D’,E A”,B,C,D’,E’,F A’,D’ A”,E’,F
  • 8. Hudi Table: Merge On Read Snapshot Query Incremental Query Insert: A, B, C, D, E Update: A => A’, D => D’ Update: A’=>A”, E=>E’,Insert: F commit time=0 commit time=1 commit time=2 A, B C, D E file1_t0.parquet file2_t0.parquet file3_t0.parquet A’ D’ .file1_t1.log .file2_t1.log A” E’, F .file1_t2.log .file3_t2.log A,B,C,D,E A,B,C,D,E A’,B,C,D’,E A”,B,C,D’,E’,F A’,D’ A”,E’,F Read Optimized Query A,B,C,D,E Compaction commit time=3 A”, B C, D’ E’,F file1_t3.parquet file2_t3.parquet file3_t3.parquet A”,B,C,D’,E’,F A”,E’,F A”,B,C,D’,E’,F A,B,C,D,E A,B,C,D,E
  • 9. Hudi for Data lake 9 Choose Copy On Write if: - Write cost is not an issue, but need fast reads - Workload is fairly understood and not bursty - Bound by parquet ingestion performance - Simple to operate Choose Merge On Read if: - Need quick ingestion - Workload can be changing or spiky - Some operational chops - Both read optimized and real time
  • 10. Query Types ● Snapshot query/Real time query ○ Latest data ● Read optimized Query ○ Favors faster query latency by trading off fresh data ● Incremental Query ○ Incremental processing, Medallion architecture ● Time travel query ○ As of timestamp
  • 11. The Community 4000+ Slack Members 300+ Contributors 3000+ GH Engagers 30+ Committers Pre-installed on 5 cloud providers Diverse PMC/Committers 1M DLs/month (400% YoY) 800B+ Records/Day (from even just 1 customer!) Rich community of participants
  • 13. StarRocks Community 7500+ Github Stars 350+ Contributors 18,000+ Community Members
  • 14. StarRocks Architecture Overview More diagrams: https://github.com/StarRocks/starrocks-reference-architecture Seamless integration with the Ecosystem Ease of Use Real-world Performance Open Source OLAP compute engine Open Table Formats as the Foundation Support for Open Storage Separated compute and storage architecture Cloud Native with k8s Operator Linux Foundation project with Apache 2.0 license.
  • 15. StarRocks with Open Data Lake More diagrams: https://github.com/StarRocks/starrocks-reference-architecture
  • 16. StarRocks 3.x series roadmap The goal of the 3.x series roadmap is to 1) Build more and optimize core data warehouse features, 2) have feature parity between the the shared-nothing architecture and shared-data architecture and 3) be able to query the StarRocks table format and all the popular open table formats such as Apache Iceberg, Apache Hudi, Apache Hive, Delta Lake and Apache Paimon. 3.0 Initial release of Shared Data Architecture Decouple compute and storage layers. Further development of StarRocks tables, materialized view, JOIN performance, cache. Enhancements to Iceberg, Hudi, Delta Lake, Hive support 3.1 Incremental improvement to 3.x goals Mirroring features from shared nothing to shared data architecture. Further development of core DW features and open table format support. 3.2 Incremental improvement to 3.x goals Mirroring features from shared nothing to shared data architecture. Further development of core DW features and open table format support. 3.3 Incremental improvement to 3.x goals To be determined. 3.4 Incremental improvement to 3.x goals To be determined.
  • 17. Major Features in StarRocks
  • 18. Vectorized Query Engine with SIMD Modern CPUs have vectorized instruction sets, which can perform operations on multiple data elements simultaneously which means faster queries by 3x to 5x over non-SIMD databases.
  • 19. JOIN performance at scale Types of JOINS supported ● CBO will do intelligent Join reorder and Join method selection ● Starrocks can join 100 million rows of data per second using only 1 CPU. Details at https://www.starrocks.io/blog/bench mark-test Simply your data engineering pipeline and infrastructure by using JOINS; denormalization is optional. SQL JOINS Inner Join ✅ Left Join ✅ Right Join ✅ Full Join ✅ Cross Join ✅ Semi Join ✅ Anti Join ✅ SQL JOINS Optimization Technique Broadcast Join ✅ Shuffle Join ✅ Bucket Shuffle Join ✅ Co-Located Join ✅ Replicated Join ✅ Local Join ✅
  • 20. SQL Hybrid-Based Optimizer Analyzes a SQL query and chooses the most efficient execution plan by estimating the cost of different potential plans
  • 21. Cache System Cache allows you to pull the data from memory instead of storage which can improve query efficiency by 3x to 17x. Transparent Speedup (Cache Functionality) Metadata ✅ Query ✅ Page ✅ Data ✅
  • 22. Separated compute and storage architecture Design approach for databases and data platforms that decouples the processing power (compute) from the data storage layer.
  • 23. SQL Connectivity through MySQL wire protocol support with Trino dialect Communicate with StarRocks through MySQL statements and utilities. Also understands the Trino SQL dialect. Client Server
  • 24. Thank you. ● Community starrocks.io ● Enterprise celerdata.com ● Managed Service cloud.celerdata.com
  • 25. Demo
  • 27. Hudi - MOR table type with Snapshot query StarRocks - https://github.com/StarRocks/demo/tree/master/documentation-sa mples/datalakehouse Demo Resources
  • 28. Engage With Our Community Docs : https://hudi.apache.org Blogs : https://hudi.apache.org/blog Slack : https://join.slack.com/t/apache-hudi/shared_invite/zt-1e94d3xro-JvlNO1kSeIHJBTVfLPlI5w Twitter : https://twitter.com/apachehudi Github: https://github.com/apache/hudi/ Give us a star ⭐! Mailing list(s) : dev-subscribe@hudi.apache.org (send an empty email to subscribe) Join Hudi Slack
  • 31. Data Lake Challenges at Uber Context ❏ Uber had PB’s of data ❏ Frequent updates ❏ HDFS/Cloud storage is immutable Problems ❏ Extremely poor ingest performance ❏ Wasteful reading/writing (compute) ❏ Zero concurrency control or ACID
  • 32. Motivations for Hudi ● Uber needed FAST data: They needed to power faster and fresher analytics ● Late-arriving data: Updates go beyond the current day and can span months in the past. With a data lake, you would need to rewrite the whole table or partition.
  • 33. How does Uber solve their petabyte data challenge on a Data Lake
  • 34. Apache Hudi: Improved efficiency Context ❏ Uber in hypergrowth ❏ Moving from warehouse to lake ❏ HDFS/Cloud storage is immutable Solutions ❏ Efficient ingestion: support of mutability, row-level updates & deletes ❏ Efficient reading/writing performance: support for MOR tables, indexes, improved file layout & timeline ❏ Concurrency control & ACID guarantees