Functional Data Engineering
- A Blueprint for adopting functional principles in data pipeline
Ananth Packkildurai
Slack
Data
Engineer
Zendesk
Principal Data
Engineer
Creator
Schemata -
Data Contract
Platform
Author
Data
Engineering
Weekly
Key Principles of
Functional Data
Engineering
Reproducibility
Re-Computability
1
2
The Modern Data Cloud =
LakeHouse & Warehouse
State of the Data 2023
Separation of storage and compute
Unlimited scale data repository
ACID transaction and mutation support
Schema Classification
Warehouse
LakeHouse
CREATE TABLE dw.user (
user_id BIGINT, user_name STRING, created_at DATE
) PARTITION BY (ds STRING)
# ds = date timestamp of the snapshot
s3://dw/user/2022-12-20/<all users data at the time of
snapshot>
s3://dw/user/2022-12-21/<all users data at the time of
snapshot>
DateTime Partition Table Design
Entity Modeling
Incremental Snapshot
Full Snapshot
1
2
Entity Modeling
CREATE
OR REPLACE VIEW dw.user_latest
AS
SELECT
user_id,
user_name,
created_at,
ds
FROM
dw.user
WHERE
ds =< current DateTime
partition >;
Event Modeling
Key Challenges
Late Arriving Data
Data Deletion
1
2
Hour T1 Data Hour T2 Data Hour T3 Data
Hour T1 Data
Hour T2 Data
Hour T3 Data
Hour T1 Data
Hour T2 Data
Tumbling Window
Hour T1 Pipeline Hour T2 Pipeline
Hour T3 Pipeline
Sliding Window
Apply Window Functions
Hour T1 Data Window Time
Hour T1 pipeline starts
Apply Watermark
Adopt Reconciliation
Hour T1 pipeline Hour T2 pipeline Hour T3 pipeline
Reconciliation pipeline
Choose your
Confidence Window of
Correctness
Data Deletion
Reprocessing
Deletion Audit Log
1
2
https://schemata.app
https://www.linkedin.com/in/ananthdurai
ananth@dataengineeringweekly.com

Functional Data Engineering.pdf