Next Gen Data Modeling in the Open Data Platform With Doron Porat and Liran Yogev | Current 2022

‹#›
Next gen data modeling in the open
data platform
Doron Porat, Liran Yogev
Current, 2022

Data infra group leader
Doron Porat
Director of engineering
Liran Yogev
Ex-coworkers who LOVE data and still
share a successful Israeli podcast about
Data Engineering.

Flexible just enough to cope with
technological and procedural changes
Adaptable
Parts of the platform can be replaced
over time by different/similar solutions
Interoperable and interchangeable
Built for big data, for many consumers, for
many producers
Scalable
Solves a specific problem in the data
platform
Clear purpose
The open data platform
Main principles

Data generation in the open data
platform

“Data Transform V1”
• Spark based
• Sql oriented
• Many supported inputs
• Many supported writers
• Data unit-testing
• Dq checks
Hundreds of data pipelines were built with this tool,
by generalist developers.

We need to be
better governors.
Enablement is not enough.
Assume nothing.
Lost metadata cannot be recovered.
Coupling is dangerous.

V2 is all about
“Governance Driven Development”.

"DATA TRANSFORM V2"
Our Key Objectives
Developer Experience Abstraction Data As A Product
Simple, Reusable and
Testable
Orchestration, Quality,
Consistency and Ownership
Consumer Awareness,
Documentation and Observability

Our quest for finding an open-source
solution was a short one!

DBT terminologies:
• Sources
• Models
• Macros
• Exposures
• Metrics

DBT runs on multiple environments and
technologies using different adapters
Modeling - Compute decoupling
DBT embraced by many organizations and
data tooling providers
The community is the best
DBT encourages metadata collection during
the development process (GDD)
Improves the survivability factor
Open source, highly extensible and already
contains many of our requirements
Can adapt beautifully to our needs
DBT
What does work for us?

DBT if not a perfect fit
Looking back at V2's key objectives
CLI manual development
Mono-repo
Dev testing
Dev experience
Single adapter
Not Spark optimized
Built for batch
No orchestration
Abstraction

But what about real-time workloads?

Works well with DBT architecture
Clickhouse implementation (link)
Rockset implementation (link)
Real-time analytics databases
Requires extensive SQL interface
Most testing cannot run in-process
Difficult to convert batch to stream
Streaming engines
Vs.
Real-time analytics in DBT

Streaming engines
• Materialize is the first streaming engine in DBT (here)
• Materialize is a streaming SQL database
• Based on incrementally-updated materialized views
• Extensive ANSI SQL support
• DBT support includes modeling, documenting, running and
testing

This is super cool!
But we use Flink…

So, Flink and DBT?
• Flink has an SQL interface
• Flink can connect to our metastore
• Flink can store kafka topics references in our metastore
• SQL jobs can be deployed remotely via simple python
code
• Supports both batch and stream
• Can write directly to the data lake

- name: transactions
description:
external:
location: transactions
connector: 'kafka'
properties.bootstrap.servers: 'localhost:9092'
key.format: 'raw'
value.format: 'json'
key.fields: 'id'
value.fields-include: 'ALL'
watermark:
rowtime_column_name: transaction_time
watermark_strategy_expression: transaction_time - INTERVAL '30’ SECONDS
columns:
- name: id
data_type: STRING
description: "ID"
- name: currency_code
data_type: STRING
description: "Currency Code"
- name: total
data_type: DECIMAL(10,2)
description: "Total amount spent"
- name: transaction_time
data_type: TIMESTAMP(3)
description: "Time of transaction"
Kafka topic #1
Flink in DBT (concept) - sources

version: 2
models:
- name: transactions_with_rate
description: transactions joined with rate
config:
meta:
external:
location: 'transactions_with_rates’
connector: 'kafka’
properties.bootstrap.servers: 'localhost:9092’
format: 'json'
columns:
- name: id
description: 'Transaction id'
data_type: STRING
tests:
- unique
- not_null
- name: total_eur
description: 'Total in euro'
- name: total
description: 'Total'
- name: currency_code
description: 'Currency code'
data_type: STRING
- name: transaction_time
description: 'Transaction time'
data_type: TIMESTAMP(3)
SELECT
t.id,
t.total * c.eur_rate AS total_eur,
t.total,
c.currency_code,
t.transaction_time
FROM {{ source(kafka_tables, transactions) }} t
JOIN {{ source(kafka_tables, currency_rates) }} FOR
SYSTEM_TIME AS OF t.transaction_time AS c
ON t.currency_code = c.currency_code;
Flink in DBT (concept) - model
transactions_with_rates.sql
transactions_with_rates.yml

Materialize’s implementation is great!
Flink in DBT (concept) - tests
- name: transactions_with_rates
description: transactions joined with rates
columns:
- name: id
data_type: STRING
description: "ID"
tests:
- unique
config:
store_failures: true
INSERT INTO test_unique_transactions_with_rates_id
SELECT id, COUNT(1) as count_id
FROM transactions_with_rates
GROUP BY id
HAVING count_id > 0
transactions_with_rates.yml

So, we have streaming figured out!
(theoretically)

Challenges
• Generalization has a price
• Not everything can be abstracted this way
• We lose expertise across the organization
• Heavy dependency on Dataops
• Requires LOTS of building

The early demise of apache flink, Dall-e, 2022
• Not all tech is here to stay
• Our users can’t keep up
• Puts governance first
• Business logic is the true organizational asset
Data Modeling is the
last survivor

Next Gen Data Modeling in the Open Data Platform With Doron Porat and Liran Yogev | Current 2022

Recommended

Recommended

More Related Content

Similar to Next Gen Data Modeling in the Open Data Platform With Doron Porat and Liran Yogev | Current 2022

Similar to Next Gen Data Modeling in the Open Data Platform With Doron Porat and Liran Yogev | Current 2022 (20)

More from HostedbyConfluent

More from HostedbyConfluent (20)

Recently uploaded

Recently uploaded (20)

Next Gen Data Modeling in the Open Data Platform With Doron Porat and Liran Yogev | Current 2022