SlideShare a Scribd company logo
1 of 20
Download to read offline
[Live] Incremental data
processing with Hudi &
Spark + dbt
December 06, 2023
Shiyan Xu
Apache Hudi PMC
❏ PMC member @ Apache Hudi
❏ Open Source Engineer @ Onehouse
❏ ex Tech Lead Manager @ Zendesk
Shiyan Xu
Speaker Bio
in/xushiyan
@rshiyanxu
blog.datumagic.com
The medallion
architecture
Medallion Architecture Overview
So, what does it take to build
medallion architecture?
Challenges in the Medallion Architecture
But … what if you can simplify
the medallion architecture?
Simplified architecture with Apache Hudi
Apache Hudi Overview
Open
Formats
CDC Incremental
Change Feed
Transactions +
Concurrency
Managed
Perf Tuning
+++
More
Auto Catalog
Sync
Merge-On-Read
Stream Writers
AWS Glue
Data Catalog
Metastore
BigQuery
Catalogs
+ Many More
Lakehouse Platform
Apache Kafka
Raw Cleaned Derived
Incremental
processing with
Spark + dbt
dbt overview
Apache
Kafka
Raw Cleaned Derived
Lakehouse storage
Extract &
Load Transform
dbt (data build tool)
● handles the T in ELT
● compiles and runs SQL
with engines like Spark
Read more: What, exactly, is dbt?
dbt project structure
tells dbt the project context
let dbt know how to build a specific data set
define transformations between data sets
defines data set schemas
contains compiled/runtime SQLs
dbt case study: update user profiles
Profile
update
events
Raw
updates
Profiles Profile
changes
Downstream
jobs
dbt case study: update user profiles
-- raw_updates.sql
{{
config(
materialized='incremental',
file_format='hudi',
incremental_strategy='insert_overwrite'
)
}}
with source_data as (
select '101' as user_id, 'A' as city, unix_timestamp() as
updated_at
union all
select '102' as user_id, 'B' as city, unix_timestamp() as
updated_at
union all
select '103' as user_id, 'C' as city, unix_timestamp() as
updated_at
)
select *
from source_data
select user_id, city,
updated_at from raw_updates
+-------+----+----------+
|user_id|city|updated_at|
+-------+----+----------+
| 101| A|1701083620|
| 103| C|1701083620|
| 102| B|1701083620|
| 101| D|1701084137|
| 102| E|1701084365|
| 103| F|1701084369|
+-------+----+----------+
dbt case study: update user profiles
-- profiles.sql
{{
config(
materialized='incremental',
incremental_strategy='merge',
merge_update_columns = ['city', 'updated_at'],
unique_key='user_id',
file_format='hudi',
options={
'type': 'cow',
'primaryKey': 'user_id',
'preCombineField': 'updated_at',
'hoodie.table.cdc.enabled': 'true'
}
)
}}
with new_updates as (
select user_id, city, updated_at from {{ ref('raw_updates') }}
{% if is_incremental() %}
where updated_at > (select max(updated_at) from {{ this }})
{% endif %}
)
select user_id, city, updated_at from new_updates
select user_id, city,
updated_at from profiles
+-------+----+----------+
|user_id|city|updated_at|
+-------+----+----------+
| 101| D|1701084137|
| 102| E|1701084365|
| 103| F|1701084369|
+-------+----+----------+
dbt case study: update user profiles
-- profile_changes.sql
{{
config(
materialized='incremental',
file_format='hudi'
)
}}
with new_changes as (
select
GET_JSON_OBJECT(after, '$.user_id') AS user_id,
GET_JSON_OBJECT(after, '$.city') AS new_city,
ts_ms as process_ts
from hudi_table_changes('dbt_example_cdc.profiles', 'cdc',
from_unixtime(unix_timestamp() - 3600 * 24, 'yyyyMMddHHmmss'))
{% if is_incremental() %}
where ts_ms > (select max(process_ts) from {{ this }})
{% endif %}
)
select user_id, new_city, process_ts
from new_changes
select user_id, new_city
from profile_changes
+-------+--------+
|user_id|new_city|
+-------+--------+
| 102| E|
| 103| F|
| 101| D|
+-------+--------+
dbt
docs
UI
dbt x Hudi recap
● dbt supports incremental & merge semantics
● Hudi CDC feature supports rich data capabilities and fits
the incremental model
● Efficiency & cost-saving
● Sample code @
https://github.com/apache/hudi/tree/master/hudi-exam
ples/hudi-examples-dbt
Come Build With The Community!
Checkout Hudi docs 🔖
Give us a star in Github ⭐
Join Hudi Slack 👥
Follow us on Linkedin!
Join our Twitter Community!
Subscribe to our Mailing list (send an empty email to subscribe) 📩
Subscribe to Apache Hudi Youtube Channel
Thanks!
Questions?
Join Hudi Slack
in/xushiyan
@rshiyanxu
blog.datumagic.com

More Related Content

What's hot

The work breakdown structure and project estimation
The work breakdown structure and project estimationThe work breakdown structure and project estimation
The work breakdown structure and project estimation
Varit Saprasert
 
เรื่อง การจัดทำระบบคุณภาพในโรงงานอุตสาหกรรม
เรื่อง การจัดทำระบบคุณภาพในโรงงานอุตสาหกรรมเรื่อง การจัดทำระบบคุณภาพในโรงงานอุตสาหกรรม
เรื่อง การจัดทำระบบคุณภาพในโรงงานอุตสาหกรรม
Apichaya Savetvijit
 
คอมพิวเตอร์ในชีวิตประจำวัน
คอมพิวเตอร์ในชีวิตประจำวันคอมพิวเตอร์ในชีวิตประจำวัน
คอมพิวเตอร์ในชีวิตประจำวัน
Maliwan Boonyen
 
ISO27001_Army Audit Office
ISO27001_Army Audit OfficeISO27001_Army Audit Office
ISO27001_Army Audit Office
Rawee Sirichoom
 

What's hot (20)

การสร้างสื่อภาพกราฟิกเคลื่อนไหว (Motion Graphic)
การสร้างสื่อภาพกราฟิกเคลื่อนไหว  (Motion Graphic)การสร้างสื่อภาพกราฟิกเคลื่อนไหว  (Motion Graphic)
การสร้างสื่อภาพกราฟิกเคลื่อนไหว (Motion Graphic)
 
The work breakdown structure and project estimation
The work breakdown structure and project estimationThe work breakdown structure and project estimation
The work breakdown structure and project estimation
 
หลักการเขียนผังงาน(Flow chart)
หลักการเขียนผังงาน(Flow chart)หลักการเขียนผังงาน(Flow chart)
หลักการเขียนผังงาน(Flow chart)
 
การใช้เทคโนโลยีสารสนเทศสำหรับจัดเก็บและวิเคราะห์ข้อมูล (Information Technolog...
การใช้เทคโนโลยีสารสนเทศสำหรับจัดเก็บและวิเคราะห์ข้อมูล (Information Technolog...การใช้เทคโนโลยีสารสนเทศสำหรับจัดเก็บและวิเคราะห์ข้อมูล (Information Technolog...
การใช้เทคโนโลยีสารสนเทศสำหรับจัดเก็บและวิเคราะห์ข้อมูล (Information Technolog...
 
การเขียนสตอรี่บอร์ด (Storyboard)
การเขียนสตอรี่บอร์ด (Storyboard)การเขียนสตอรี่บอร์ด (Storyboard)
การเขียนสตอรี่บอร์ด (Storyboard)
 
เสี่ยวเอ้อสอน Spark
เสี่ยวเอ้อสอน Sparkเสี่ยวเอ้อสอน Spark
เสี่ยวเอ้อสอน Spark
 
Tool box เครื่องมือใน photoshop cs5
Tool box เครื่องมือใน photoshop cs5Tool box เครื่องมือใน photoshop cs5
Tool box เครื่องมือใน photoshop cs5
 
เรื่อง การจัดทำระบบคุณภาพในโรงงานอุตสาหกรรม
เรื่อง การจัดทำระบบคุณภาพในโรงงานอุตสาหกรรมเรื่อง การจัดทำระบบคุณภาพในโรงงานอุตสาหกรรม
เรื่อง การจัดทำระบบคุณภาพในโรงงานอุตสาหกรรม
 
คอมพิวเตอร์ในชีวิตประจำวัน
คอมพิวเตอร์ในชีวิตประจำวันคอมพิวเตอร์ในชีวิตประจำวัน
คอมพิวเตอร์ในชีวิตประจำวัน
 
Hive(ppt)
Hive(ppt)Hive(ppt)
Hive(ppt)
 
ความรู้คอมพิวเตอร์พื้นฐาน (Computing Fundamental)
ความรู้คอมพิวเตอร์พื้นฐาน (Computing Fundamental)ความรู้คอมพิวเตอร์พื้นฐาน (Computing Fundamental)
ความรู้คอมพิวเตอร์พื้นฐาน (Computing Fundamental)
 
แนวข้อสอบจริง 75 ข้อ
แนวข้อสอบจริง  75 ข้อแนวข้อสอบจริง  75 ข้อ
แนวข้อสอบจริง 75 ข้อ
 
Vector Databases - A Technical Primer.pdf
Vector Databases - A Technical Primer.pdfVector Databases - A Technical Primer.pdf
Vector Databases - A Technical Primer.pdf
 
Cloud-Native Apache Spark Scheduling with YuniKorn Scheduler
Cloud-Native Apache Spark Scheduling with YuniKorn SchedulerCloud-Native Apache Spark Scheduling with YuniKorn Scheduler
Cloud-Native Apache Spark Scheduling with YuniKorn Scheduler
 
คอมพิวเตอร์สำหรับบัณฑิตศึกษา
คอมพิวเตอร์สำหรับบัณฑิตศึกษาคอมพิวเตอร์สำหรับบัณฑิตศึกษา
คอมพิวเตอร์สำหรับบัณฑิตศึกษา
 
Motion Infographic
Motion InfographicMotion Infographic
Motion Infographic
 
สอนออนไลน์ด้วย OBS Studio
สอนออนไลน์ด้วย OBS Studioสอนออนไลน์ด้วย OBS Studio
สอนออนไลน์ด้วย OBS Studio
 
ใบงานส่วนประกอบคอมพิวเตอร์
ใบงานส่วนประกอบคอมพิวเตอร์ใบงานส่วนประกอบคอมพิวเตอร์
ใบงานส่วนประกอบคอมพิวเตอร์
 
Performance Troubleshooting Using Apache Spark Metrics
Performance Troubleshooting Using Apache Spark MetricsPerformance Troubleshooting Using Apache Spark Metrics
Performance Troubleshooting Using Apache Spark Metrics
 
ISO27001_Army Audit Office
ISO27001_Army Audit OfficeISO27001_Army Audit Office
ISO27001_Army Audit Office
 

Similar to Incremental data processing with Hudi & Spark + dbt.pdf

Ajax Performance Tuning and Best Practices
Ajax Performance Tuning and Best PracticesAjax Performance Tuning and Best Practices
Ajax Performance Tuning and Best Practices
Doris Chen
 

Similar to Incremental data processing with Hudi & Spark + dbt.pdf (20)

Evolutionary db development
Evolutionary db development Evolutionary db development
Evolutionary db development
 
Dynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache SparkDynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache Spark
 
MongoDB World 2018: Keynote
MongoDB World 2018: KeynoteMongoDB World 2018: Keynote
MongoDB World 2018: Keynote
 
Expedite the development lifecycle with MongoDB and serverless - DEM02 - Sant...
Expedite the development lifecycle with MongoDB and serverless - DEM02 - Sant...Expedite the development lifecycle with MongoDB and serverless - DEM02 - Sant...
Expedite the development lifecycle with MongoDB and serverless - DEM02 - Sant...
 
DAC4B 2015 - Polybase
DAC4B 2015 - PolybaseDAC4B 2015 - Polybase
DAC4B 2015 - Polybase
 
Shrug2017 arcpy data_and_you
Shrug2017 arcpy data_and_youShrug2017 arcpy data_and_you
Shrug2017 arcpy data_and_you
 
Scaling and Modernizing Data Platform with Databricks
Scaling and Modernizing Data Platform with DatabricksScaling and Modernizing Data Platform with Databricks
Scaling and Modernizing Data Platform with Databricks
 
Online | MongoDB Atlas on GCP Workshop
Online | MongoDB Atlas on GCP Workshop Online | MongoDB Atlas on GCP Workshop
Online | MongoDB Atlas on GCP Workshop
 
BigQuery implementation
BigQuery implementationBigQuery implementation
BigQuery implementation
 
Ajax Performance Tuning and Best Practices
Ajax Performance Tuning and Best PracticesAjax Performance Tuning and Best Practices
Ajax Performance Tuning and Best Practices
 
Vida Dashboard Training
Vida Dashboard TrainingVida Dashboard Training
Vida Dashboard Training
 
How to create an Angular builder
How to create an Angular builderHow to create an Angular builder
How to create an Angular builder
 
Tricks every ClickHouse designer should know, by Robert Hodges, Altinity CEO
Tricks every ClickHouse designer should know, by Robert Hodges, Altinity CEOTricks every ClickHouse designer should know, by Robert Hodges, Altinity CEO
Tricks every ClickHouse designer should know, by Robert Hodges, Altinity CEO
 
Te kslate sap bods
Te kslate sap bodsTe kslate sap bods
Te kslate sap bods
 
Sap bo xi r4.0
Sap bo xi r4.0Sap bo xi r4.0
Sap bo xi r4.0
 
Sap bo xi r4.0
Sap bo xi r4.0Sap bo xi r4.0
Sap bo xi r4.0
 
Sap bo xi r4.0
Sap bo xi r4.0Sap bo xi r4.0
Sap bo xi r4.0
 
Sap bo xi r4.0
Sap bo xi r4.0Sap bo xi r4.0
Sap bo xi r4.0
 
Power BI with Essbase in the Oracle Cloud
Power BI with Essbase in the Oracle CloudPower BI with Essbase in the Oracle Cloud
Power BI with Essbase in the Oracle Cloud
 
Apache Calcite Tutorial - BOSS 21
Apache Calcite Tutorial - BOSS 21Apache Calcite Tutorial - BOSS 21
Apache Calcite Tutorial - BOSS 21
 

Recently uploaded

Structuring Teams and Portfolios for Success
Structuring Teams and Portfolios for SuccessStructuring Teams and Portfolios for Success
Structuring Teams and Portfolios for Success
UXDXConf
 
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
panagenda
 

Recently uploaded (20)

Structuring Teams and Portfolios for Success
Structuring Teams and Portfolios for SuccessStructuring Teams and Portfolios for Success
Structuring Teams and Portfolios for Success
 
ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...
ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...
ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...
 
(Explainable) Data-Centric AI: what are you explaininhg, and to whom?
(Explainable) Data-Centric AI: what are you explaininhg, and to whom?(Explainable) Data-Centric AI: what are you explaininhg, and to whom?
(Explainable) Data-Centric AI: what are you explaininhg, and to whom?
 
ADP Passwordless Journey Case Study.pptx
ADP Passwordless Journey Case Study.pptxADP Passwordless Journey Case Study.pptx
ADP Passwordless Journey Case Study.pptx
 
Your enemies use GenAI too - staying ahead of fraud with Neo4j
Your enemies use GenAI too - staying ahead of fraud with Neo4jYour enemies use GenAI too - staying ahead of fraud with Neo4j
Your enemies use GenAI too - staying ahead of fraud with Neo4j
 
AI mind or machine power point presentation
AI mind or machine power point presentationAI mind or machine power point presentation
AI mind or machine power point presentation
 
Portal Kombat : extension du réseau de propagande russe
Portal Kombat : extension du réseau de propagande russePortal Kombat : extension du réseau de propagande russe
Portal Kombat : extension du réseau de propagande russe
 
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdfLinux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
 
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
 
Intro to Passkeys and the State of Passwordless.pptx
Intro to Passkeys and the State of Passwordless.pptxIntro to Passkeys and the State of Passwordless.pptx
Intro to Passkeys and the State of Passwordless.pptx
 
State of the Smart Building Startup Landscape 2024!
State of the Smart Building Startup Landscape 2024!State of the Smart Building Startup Landscape 2024!
State of the Smart Building Startup Landscape 2024!
 
How we scaled to 80K users by doing nothing!.pdf
How we scaled to 80K users by doing nothing!.pdfHow we scaled to 80K users by doing nothing!.pdf
How we scaled to 80K users by doing nothing!.pdf
 
Easier, Faster, and More Powerful – Notes Document Properties Reimagined
Easier, Faster, and More Powerful – Notes Document Properties ReimaginedEasier, Faster, and More Powerful – Notes Document Properties Reimagined
Easier, Faster, and More Powerful – Notes Document Properties Reimagined
 
Human Expert Website Manual WCAG 2.0 2.1 2.2 Audit - Digital Accessibility Au...
Human Expert Website Manual WCAG 2.0 2.1 2.2 Audit - Digital Accessibility Au...Human Expert Website Manual WCAG 2.0 2.1 2.2 Audit - Digital Accessibility Au...
Human Expert Website Manual WCAG 2.0 2.1 2.2 Audit - Digital Accessibility Au...
 
ERP Contender Series: Acumatica vs. Sage Intacct
ERP Contender Series: Acumatica vs. Sage IntacctERP Contender Series: Acumatica vs. Sage Intacct
ERP Contender Series: Acumatica vs. Sage Intacct
 
Overview of Hyperledger Foundation
Overview of Hyperledger FoundationOverview of Hyperledger Foundation
Overview of Hyperledger Foundation
 
How Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdf
How Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdfHow Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdf
How Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdf
 
Where to Learn More About FDO _ Richard at FIDO Alliance.pdf
Where to Learn More About FDO _ Richard at FIDO Alliance.pdfWhere to Learn More About FDO _ Richard at FIDO Alliance.pdf
Where to Learn More About FDO _ Richard at FIDO Alliance.pdf
 
Intro in Product Management - Коротко про професію продакт менеджера
Intro in Product Management - Коротко про професію продакт менеджераIntro in Product Management - Коротко про професію продакт менеджера
Intro in Product Management - Коротко про професію продакт менеджера
 
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
 

Incremental data processing with Hudi & Spark + dbt.pdf

  • 1. [Live] Incremental data processing with Hudi & Spark + dbt December 06, 2023 Shiyan Xu Apache Hudi PMC
  • 2. ❏ PMC member @ Apache Hudi ❏ Open Source Engineer @ Onehouse ❏ ex Tech Lead Manager @ Zendesk Shiyan Xu Speaker Bio in/xushiyan @rshiyanxu blog.datumagic.com
  • 5. So, what does it take to build medallion architecture?
  • 6. Challenges in the Medallion Architecture
  • 7. But … what if you can simplify the medallion architecture?
  • 9. Apache Hudi Overview Open Formats CDC Incremental Change Feed Transactions + Concurrency Managed Perf Tuning +++ More Auto Catalog Sync Merge-On-Read Stream Writers AWS Glue Data Catalog Metastore BigQuery Catalogs + Many More Lakehouse Platform Apache Kafka Raw Cleaned Derived
  • 11. dbt overview Apache Kafka Raw Cleaned Derived Lakehouse storage Extract & Load Transform dbt (data build tool) ● handles the T in ELT ● compiles and runs SQL with engines like Spark Read more: What, exactly, is dbt?
  • 12. dbt project structure tells dbt the project context let dbt know how to build a specific data set define transformations between data sets defines data set schemas contains compiled/runtime SQLs
  • 13. dbt case study: update user profiles Profile update events Raw updates Profiles Profile changes Downstream jobs
  • 14. dbt case study: update user profiles -- raw_updates.sql {{ config( materialized='incremental', file_format='hudi', incremental_strategy='insert_overwrite' ) }} with source_data as ( select '101' as user_id, 'A' as city, unix_timestamp() as updated_at union all select '102' as user_id, 'B' as city, unix_timestamp() as updated_at union all select '103' as user_id, 'C' as city, unix_timestamp() as updated_at ) select * from source_data select user_id, city, updated_at from raw_updates +-------+----+----------+ |user_id|city|updated_at| +-------+----+----------+ | 101| A|1701083620| | 103| C|1701083620| | 102| B|1701083620| | 101| D|1701084137| | 102| E|1701084365| | 103| F|1701084369| +-------+----+----------+
  • 15. dbt case study: update user profiles -- profiles.sql {{ config( materialized='incremental', incremental_strategy='merge', merge_update_columns = ['city', 'updated_at'], unique_key='user_id', file_format='hudi', options={ 'type': 'cow', 'primaryKey': 'user_id', 'preCombineField': 'updated_at', 'hoodie.table.cdc.enabled': 'true' } ) }} with new_updates as ( select user_id, city, updated_at from {{ ref('raw_updates') }} {% if is_incremental() %} where updated_at > (select max(updated_at) from {{ this }}) {% endif %} ) select user_id, city, updated_at from new_updates select user_id, city, updated_at from profiles +-------+----+----------+ |user_id|city|updated_at| +-------+----+----------+ | 101| D|1701084137| | 102| E|1701084365| | 103| F|1701084369| +-------+----+----------+
  • 16. dbt case study: update user profiles -- profile_changes.sql {{ config( materialized='incremental', file_format='hudi' ) }} with new_changes as ( select GET_JSON_OBJECT(after, '$.user_id') AS user_id, GET_JSON_OBJECT(after, '$.city') AS new_city, ts_ms as process_ts from hudi_table_changes('dbt_example_cdc.profiles', 'cdc', from_unixtime(unix_timestamp() - 3600 * 24, 'yyyyMMddHHmmss')) {% if is_incremental() %} where ts_ms > (select max(process_ts) from {{ this }}) {% endif %} ) select user_id, new_city, process_ts from new_changes select user_id, new_city from profile_changes +-------+--------+ |user_id|new_city| +-------+--------+ | 102| E| | 103| F| | 101| D| +-------+--------+
  • 18. dbt x Hudi recap ● dbt supports incremental & merge semantics ● Hudi CDC feature supports rich data capabilities and fits the incremental model ● Efficiency & cost-saving ● Sample code @ https://github.com/apache/hudi/tree/master/hudi-exam ples/hudi-examples-dbt
  • 19. Come Build With The Community! Checkout Hudi docs 🔖 Give us a star in Github ⭐ Join Hudi Slack 👥 Follow us on Linkedin! Join our Twitter Community! Subscribe to our Mailing list (send an empty email to subscribe) 📩 Subscribe to Apache Hudi Youtube Channel