11. dbt overview
Apache
Kafka
Raw Cleaned Derived
Lakehouse storage
Extract &
Load Transform
dbt (data build tool)
● handles the T in ELT
● compiles and runs SQL
with engines like Spark
Read more: What, exactly, is dbt?
12. dbt project structure
tells dbt the project context
let dbt know how to build a specific data set
define transformations between data sets
defines data set schemas
contains compiled/runtime SQLs
13. dbt case study: update user profiles
Profile
update
events
Raw
updates
Profiles Profile
changes
Downstream
jobs
14. dbt case study: update user profiles
-- raw_updates.sql
{{
config(
materialized='incremental',
file_format='hudi',
incremental_strategy='insert_overwrite'
)
}}
with source_data as (
select '101' as user_id, 'A' as city, unix_timestamp() as
updated_at
union all
select '102' as user_id, 'B' as city, unix_timestamp() as
updated_at
union all
select '103' as user_id, 'C' as city, unix_timestamp() as
updated_at
)
select *
from source_data
select user_id, city,
updated_at from raw_updates
+-------+----+----------+
|user_id|city|updated_at|
+-------+----+----------+
| 101| A|1701083620|
| 103| C|1701083620|
| 102| B|1701083620|
| 101| D|1701084137|
| 102| E|1701084365|
| 103| F|1701084369|
+-------+----+----------+
15. dbt case study: update user profiles
-- profiles.sql
{{
config(
materialized='incremental',
incremental_strategy='merge',
merge_update_columns = ['city', 'updated_at'],
unique_key='user_id',
file_format='hudi',
options={
'type': 'cow',
'primaryKey': 'user_id',
'preCombineField': 'updated_at',
'hoodie.table.cdc.enabled': 'true'
}
)
}}
with new_updates as (
select user_id, city, updated_at from {{ ref('raw_updates') }}
{% if is_incremental() %}
where updated_at > (select max(updated_at) from {{ this }})
{% endif %}
)
select user_id, city, updated_at from new_updates
select user_id, city,
updated_at from profiles
+-------+----+----------+
|user_id|city|updated_at|
+-------+----+----------+
| 101| D|1701084137|
| 102| E|1701084365|
| 103| F|1701084369|
+-------+----+----------+
16. dbt case study: update user profiles
-- profile_changes.sql
{{
config(
materialized='incremental',
file_format='hudi'
)
}}
with new_changes as (
select
GET_JSON_OBJECT(after, '$.user_id') AS user_id,
GET_JSON_OBJECT(after, '$.city') AS new_city,
ts_ms as process_ts
from hudi_table_changes('dbt_example_cdc.profiles', 'cdc',
from_unixtime(unix_timestamp() - 3600 * 24, 'yyyyMMddHHmmss'))
{% if is_incremental() %}
where ts_ms > (select max(process_ts) from {{ this }})
{% endif %}
)
select user_id, new_city, process_ts
from new_changes
select user_id, new_city
from profile_changes
+-------+--------+
|user_id|new_city|
+-------+--------+
| 102| E|
| 103| F|
| 101| D|
+-------+--------+
18. dbt x Hudi recap
● dbt supports incremental & merge semantics
● Hudi CDC feature supports rich data capabilities and fits
the incremental model
● Efficiency & cost-saving
● Sample code @
https://github.com/apache/hudi/tree/master/hudi-exam
ples/hudi-examples-dbt
19. Come Build With The Community!
Checkout Hudi docs 🔖
Give us a star in Github ⭐
Join Hudi Slack 👥
Follow us on Linkedin!
Join our Twitter Community!
Subscribe to our Mailing list (send an empty email to subscribe) 📩
Subscribe to Apache Hudi Youtube Channel