Next Gen Data Modeling in the Open Data Platform With Doron Porat and Liran Yogev | Current 2022
At Yotpo, we have a rich and busy data lake consisting of thousands of data sets ingested and digested by different engines, the main one being Spark.
We built our data infrastructure to enable our users to produce and consume data via self-service tooling, giving them the utmost freedom.
This freedom came with a cost.
We had trouble with bad standardization, little data reusability, lack of data lineage, and flaky data sets.
We also witnessed the landscape under which we built our platform change dramatically and so have our analytics needs and expectations.
We came to an understanding that the modeling layer should be decoupled from the execution layer in order to get rid of the limitations we were bounded by -
Batch and stream should be no more than attributes as part of a wider abstraction
A Kafka topic and a data lake table are no different and should be treated the same way
Observability of our data pipelines should have the same quality and depth across all execution engines, storage methods, and formats
Governance should be an implicit part of our ecosystem to serve as a basis for both exploration and automation/anomaly detection
That's when we started building YODA (soon to be open sourced) that gives us killer dev experience with the level of abstraction we always dreamed of.
Combining DBT, Databricks, lakeFS, and a multitude of streaming engines - we started seeing our vision come to life.
In this talk, we'll share from our journey redesigning the data lake, and how to best address organizational needs, without having to give up on high-end tooling and technology. We are taking this to the next level.
3. Data infra group leader
Doron Porat
Director of engineering
Liran Yogev
Ex-coworkers who LOVE data and still
share a successful Israeli podcast about
Data Engineering.
6. Flexible just enough to cope with
technological and procedural changes
Adaptable
Parts of the platform can be replaced
over time by different/similar solutions
Interoperable and interchangeable
Built for big data, for many consumers, for
many producers
Scalable
Solves a specific problem in the data
platform
Clear purpose
The open data platform
Main principles
9. “Data Transform V1”
• Spark based
• Sql oriented
• Many supported inputs
• Many supported writers
• Data unit-testing
• Dq checks
Hundreds of data pipelines were built with this tool,
by generalist developers.
11. We need to be
better governors.
Enablement is not enough.
Assume nothing.
Lost metadata cannot be recovered.
Coupling is dangerous.
12. V2 is all about
“Governance Driven Development”.
13. "DATA TRANSFORM V2"
Our Key Objectives
Developer Experience Abstraction Data As A Product
Simple, Reusable and
Testable
Orchestration, Quality,
Consistency and Ownership
Consumer Awareness,
Documentation and Observability
14.
15. Our quest for finding an open-source
solution was a short one!
18. DBT runs on multiple environments and
technologies using different adapters
Modeling - Compute decoupling
DBT embraced by many organizations and
data tooling providers
The community is the best
DBT encourages metadata collection during
the development process (GDD)
Improves the survivability factor
Open source, highly extensible and already
contains many of our requirements
Can adapt beautifully to our needs
DBT
What does work for us?
19. DBT if not a perfect fit
Looking back at V2's key objectives
CLI manual development
Mono-repo
Dev testing
Dev experience
Single adapter
Not Spark optimized
Built for batch
No orchestration
Abstraction
28. Works well with DBT architecture
Clickhouse implementation (link)
Rockset implementation (link)
Real-time analytics databases
Requires extensive SQL interface
Most testing cannot run in-process
Difficult to convert batch to stream
Streaming engines
Vs.
Real-time analytics in DBT
29. Streaming engines
• Materialize is the first streaming engine in DBT (here)
• Materialize is a streaming SQL database
• Based on incrementally-updated materialized views
• Extensive ANSI SQL support
• DBT support includes modeling, documenting, running and
testing
31. So, Flink and DBT?
• Flink has an SQL interface
• Flink can connect to our metastore
• Flink can store kafka topics references in our metastore
• SQL jobs can be deployed remotely via simple python
code
• Supports both batch and stream
• Can write directly to the data lake
33. version: 2
models:
- name: transactions_with_rate
description: transactions joined with rate
config:
meta:
external:
location: 'transactions_with_rates’
connector: 'kafka’
properties.bootstrap.servers: 'localhost:9092’
format: 'json'
columns:
- name: id
description: 'Transaction id'
data_type: STRING
tests:
- unique
- not_null
- name: total_eur
data_type: DECIMAL(10,2)
description: 'Total in euro'
- name: total
description: 'Total'
data_type: DECIMAL(10,2)
- name: currency_code
description: 'Currency code'
data_type: STRING
- name: transaction_time
description: 'Transaction time'
data_type: TIMESTAMP(3)
SELECT
t.id,
t.total * c.eur_rate AS total_eur,
t.total,
c.currency_code,
t.transaction_time
FROM {{ source(kafka_tables, transactions) }} t
JOIN {{ source(kafka_tables, currency_rates) }} FOR
SYSTEM_TIME AS OF t.transaction_time AS c
ON t.currency_code = c.currency_code;
Flink in DBT (concept) - model
transactions_with_rates.sql
transactions_with_rates.yml
34. Materialize’s implementation is great!
Flink in DBT (concept) - tests
- name: transactions_with_rates
description: transactions joined with rates
columns:
- name: id
data_type: STRING
description: "ID"
tests:
- unique
config:
store_failures: true
INSERT INTO test_unique_transactions_with_rates_id
SELECT id, COUNT(1) as count_id
FROM transactions_with_rates
GROUP BY id
HAVING count_id > 0
transactions_with_rates.yml
35. So, we have streaming figured out!
(theoretically)
36. Challenges
• Generalization has a price
• Not everything can be abstracted this way
• We lose expertise across the organization
• Heavy dependency on Dataops
• Requires LOTS of building
37. The early demise of apache flink, Dall-e, 2022
• Not all tech is here to stay
• Our users can’t keep up
• Puts governance first
• Business logic is the true organizational asset
Data Modeling is the
last survivor