SlideShare a Scribd company logo
1 of 38
Download to read offline
‹#›
Next gen data modeling in the open
data platform
Doron Porat, Liran Yogev
Current, 2022
Let’s talk about
mistakes
Data infra group leader
Doron Porat
Director of engineering
Liran Yogev
Ex-coworkers who LOVE data and still
share a successful Israeli podcast about
Data Engineering.
Who sent us here
The open data platform
Flexible just enough to cope with
technological and procedural changes
Adaptable
Parts of the platform can be replaced
over time by different/similar solutions
Interoperable and interchangeable
Built for big data, for many consumers, for
many producers
Scalable
Solves a specific problem in the data
platform
Clear purpose
The open data platform
Main principles
Yotpo’s open data platform
Data generation in the open data
platform
“Data Transform V1”
• Spark based
• Sql oriented
• Many supported inputs
• Many supported writers
• Data unit-testing
• Dq checks
Hundreds of data pipelines were built with this tool,
by generalist developers.
The reality of V1
We need to be
better governors.
Enablement is not enough.
Assume nothing.
Lost metadata cannot be recovered.
Coupling is dangerous.
V2 is all about
“Governance Driven Development”.
"DATA TRANSFORM V2"
Our Key Objectives
Developer Experience Abstraction Data As A Product
Simple, Reusable and
Testable
Orchestration, Quality,
Consistency and Ownership
Consumer Awareness,
Documentation and Observability
Our quest for finding an open-source
solution was a short one!
DBT terminologies:
• Sources
• Models
• Macros
• Exposures
• Metrics
DBT runs on multiple environments and
technologies using different adapters
Modeling - Compute decoupling
DBT embraced by many organizations and
data tooling providers
The community is the best
DBT encourages metadata collection during
the development process (GDD)
Improves the survivability factor
Open source, highly extensible and already
contains many of our requirements
Can adapt beautifully to our needs
DBT
What does work for us?
DBT if not a perfect fit
Looking back at V2's key objectives
CLI manual development
Mono-repo
Dev testing
Dev experience
Single adapter
Not Spark optimized
Built for batch
No orchestration
Abstraction
So how can we fix it?
Demo time
‹#›
Data CI/CD
Orchestration
But what about real-time workloads?
Works well with DBT architecture
Clickhouse implementation (link)
Rockset implementation (link)
Real-time analytics databases
Requires extensive SQL interface
Most testing cannot run in-process
Difficult to convert batch to stream
Streaming engines
Vs.
Real-time analytics in DBT
Streaming engines
• Materialize is the first streaming engine in DBT (here)
• Materialize is a streaming SQL database
• Based on incrementally-updated materialized views
• Extensive ANSI SQL support
• DBT support includes modeling, documenting, running and
testing
This is super cool!
But we use Flink…
So, Flink and DBT?
• Flink has an SQL interface
• Flink can connect to our metastore
• Flink can store kafka topics references in our metastore
• SQL jobs can be deployed remotely via simple python
code
• Supports both batch and stream
• Can write directly to the data lake
- name: transactions
description:
external:
location: transactions
connector: 'kafka'
properties.bootstrap.servers: 'localhost:9092'
key.format: 'raw'
value.format: 'json'
key.fields: 'id'
value.fields-include: 'ALL'
watermark:
rowtime_column_name: transaction_time
watermark_strategy_expression: transaction_time - INTERVAL '30’ SECONDS
columns:
- name: id
data_type: STRING
description: "ID"
- name: currency_code
data_type: STRING
description: "Currency Code"
- name: total
data_type: DECIMAL(10,2)
description: "Total amount spent"
- name: transaction_time
data_type: TIMESTAMP(3)
description: "Time of transaction"
Kafka topic #1
Flink in DBT (concept) - sources
version: 2
models:
- name: transactions_with_rate
description: transactions joined with rate
config:
meta:
external:
location: 'transactions_with_rates’
connector: 'kafka’
properties.bootstrap.servers: 'localhost:9092’
format: 'json'
columns:
- name: id
description: 'Transaction id'
data_type: STRING
tests:
- unique
- not_null
- name: total_eur
data_type: DECIMAL(10,2)
description: 'Total in euro'
- name: total
description: 'Total'
data_type: DECIMAL(10,2)
- name: currency_code
description: 'Currency code'
data_type: STRING
- name: transaction_time
description: 'Transaction time'
data_type: TIMESTAMP(3)
SELECT
t.id,
t.total * c.eur_rate AS total_eur,
t.total,
c.currency_code,
t.transaction_time
FROM {{ source(kafka_tables, transactions) }} t
JOIN {{ source(kafka_tables, currency_rates) }} FOR
SYSTEM_TIME AS OF t.transaction_time AS c
ON t.currency_code = c.currency_code;
Flink in DBT (concept) - model
transactions_with_rates.sql
transactions_with_rates.yml
Materialize’s implementation is great!
Flink in DBT (concept) - tests
- name: transactions_with_rates
description: transactions joined with rates
columns:
- name: id
data_type: STRING
description: "ID"
tests:
- unique
config:
store_failures: true
INSERT INTO test_unique_transactions_with_rates_id
SELECT id, COUNT(1) as count_id
FROM transactions_with_rates
GROUP BY id
HAVING count_id > 0
transactions_with_rates.yml
So, we have streaming figured out!
(theoretically)
Challenges
• Generalization has a price
• Not everything can be abstracted this way
• We lose expertise across the organization
• Heavy dependency on Dataops
• Requires LOTS of building
The early demise of apache flink, Dall-e, 2022
• Not all tech is here to stay
• Our users can’t keep up
• Puts governance first
• Business logic is the true organizational asset
Data Modeling is the
last survivor
‹#›
Questions?

More Related Content

Similar to Next Gen Data Modeling in the Open Data Platform With Doron Porat and Liran Yogev | Current 2022

NDC Sydney 2019 - Microservices for building an IDE – The innards of JetBrain...
NDC Sydney 2019 - Microservices for building an IDE – The innards of JetBrain...NDC Sydney 2019 - Microservices for building an IDE – The innards of JetBrain...
NDC Sydney 2019 - Microservices for building an IDE – The innards of JetBrain...Maarten Balliauw
 
Introduction to Azure DocumentDB
Introduction to Azure DocumentDBIntroduction to Azure DocumentDB
Introduction to Azure DocumentDBDenny Lee
 
Data collection in AWS at Schibsted
Data collection in AWS at SchibstedData collection in AWS at Schibsted
Data collection in AWS at SchibstedLars Marius Garshol
 
Anaconda and PyData Solutions
Anaconda and PyData SolutionsAnaconda and PyData Solutions
Anaconda and PyData SolutionsTravis Oliphant
 
The Enterprise Guide to Building a Data Mesh - Introducing SpecMesh
The Enterprise Guide to Building a Data Mesh - Introducing SpecMeshThe Enterprise Guide to Building a Data Mesh - Introducing SpecMesh
The Enterprise Guide to Building a Data Mesh - Introducing SpecMeshIanFurlong4
 
Overview Of Parallel Development - Ericnel
Overview Of Parallel Development -  EricnelOverview Of Parallel Development -  Ericnel
Overview Of Parallel Development - Ericnelukdpe
 
Monitoring as an entry point for collaboration
Monitoring as an entry point for collaborationMonitoring as an entry point for collaboration
Monitoring as an entry point for collaborationJulien Pivotto
 
Azure Digital Twins.pdf
Azure Digital Twins.pdfAzure Digital Twins.pdf
Azure Digital Twins.pdfTomasz Kopacz
 
Maximizing Data Lake ROI with Data Virtualization: A Technical Demonstration
Maximizing Data Lake ROI with Data Virtualization: A Technical DemonstrationMaximizing Data Lake ROI with Data Virtualization: A Technical Demonstration
Maximizing Data Lake ROI with Data Virtualization: A Technical DemonstrationDenodo
 
Data Mesh Part 4 Monolith to Mesh
Data Mesh Part 4 Monolith to MeshData Mesh Part 4 Monolith to Mesh
Data Mesh Part 4 Monolith to MeshJeffrey T. Pollock
 
Innovation in SDN Tools and Platforms
Innovation in SDN Tools and PlatformsInnovation in SDN Tools and Platforms
Innovation in SDN Tools and PlatformsUmesh Krishnaswamy
 
Metarhia: Node.js Macht Frei
Metarhia: Node.js Macht FreiMetarhia: Node.js Macht Frei
Metarhia: Node.js Macht FreiTimur Shemsedinov
 
.NET per la Data Science e oltre
.NET per la Data Science e oltre.NET per la Data Science e oltre
.NET per la Data Science e oltreMarco Parenzan
 
Speeding Time to Insight with a Modern ELT Approach
Speeding Time to Insight with a Modern ELT ApproachSpeeding Time to Insight with a Modern ELT Approach
Speeding Time to Insight with a Modern ELT ApproachDatabricks
 
IoT Story: From Edge to HDP
IoT Story: From Edge to HDPIoT Story: From Edge to HDP
IoT Story: From Edge to HDPDataWorks Summit
 
An Introduction to Microservices
An Introduction to MicroservicesAn Introduction to Microservices
An Introduction to MicroservicesAd van der Veer
 

Similar to Next Gen Data Modeling in the Open Data Platform With Doron Porat and Liran Yogev | Current 2022 (20)

NDC Sydney 2019 - Microservices for building an IDE – The innards of JetBrain...
NDC Sydney 2019 - Microservices for building an IDE – The innards of JetBrain...NDC Sydney 2019 - Microservices for building an IDE – The innards of JetBrain...
NDC Sydney 2019 - Microservices for building an IDE – The innards of JetBrain...
 
Introduction to Azure DocumentDB
Introduction to Azure DocumentDBIntroduction to Azure DocumentDB
Introduction to Azure DocumentDB
 
Data collection in AWS at Schibsted
Data collection in AWS at SchibstedData collection in AWS at Schibsted
Data collection in AWS at Schibsted
 
VoltDB on SolftLayer Cloud
VoltDB on SolftLayer CloudVoltDB on SolftLayer Cloud
VoltDB on SolftLayer Cloud
 
Anaconda and PyData Solutions
Anaconda and PyData SolutionsAnaconda and PyData Solutions
Anaconda and PyData Solutions
 
The Enterprise Guide to Building a Data Mesh - Introducing SpecMesh
The Enterprise Guide to Building a Data Mesh - Introducing SpecMeshThe Enterprise Guide to Building a Data Mesh - Introducing SpecMesh
The Enterprise Guide to Building a Data Mesh - Introducing SpecMesh
 
Overview Of Parallel Development - Ericnel
Overview Of Parallel Development -  EricnelOverview Of Parallel Development -  Ericnel
Overview Of Parallel Development - Ericnel
 
Monitoring as an entry point for collaboration
Monitoring as an entry point for collaborationMonitoring as an entry point for collaboration
Monitoring as an entry point for collaboration
 
Azure Digital Twins.pdf
Azure Digital Twins.pdfAzure Digital Twins.pdf
Azure Digital Twins.pdf
 
Maximizing Data Lake ROI with Data Virtualization: A Technical Demonstration
Maximizing Data Lake ROI with Data Virtualization: A Technical DemonstrationMaximizing Data Lake ROI with Data Virtualization: A Technical Demonstration
Maximizing Data Lake ROI with Data Virtualization: A Technical Demonstration
 
Neethu_Abraham
Neethu_AbrahamNeethu_Abraham
Neethu_Abraham
 
Zakir_Hussain_cv
Zakir_Hussain_cvZakir_Hussain_cv
Zakir_Hussain_cv
 
Ramesh_resume
Ramesh_resumeRamesh_resume
Ramesh_resume
 
Data Mesh Part 4 Monolith to Mesh
Data Mesh Part 4 Monolith to MeshData Mesh Part 4 Monolith to Mesh
Data Mesh Part 4 Monolith to Mesh
 
Innovation in SDN Tools and Platforms
Innovation in SDN Tools and PlatformsInnovation in SDN Tools and Platforms
Innovation in SDN Tools and Platforms
 
Metarhia: Node.js Macht Frei
Metarhia: Node.js Macht FreiMetarhia: Node.js Macht Frei
Metarhia: Node.js Macht Frei
 
.NET per la Data Science e oltre
.NET per la Data Science e oltre.NET per la Data Science e oltre
.NET per la Data Science e oltre
 
Speeding Time to Insight with a Modern ELT Approach
Speeding Time to Insight with a Modern ELT ApproachSpeeding Time to Insight with a Modern ELT Approach
Speeding Time to Insight with a Modern ELT Approach
 
IoT Story: From Edge to HDP
IoT Story: From Edge to HDPIoT Story: From Edge to HDP
IoT Story: From Edge to HDP
 
An Introduction to Microservices
An Introduction to MicroservicesAn Introduction to Microservices
An Introduction to Microservices
 

More from HostedbyConfluent

Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
Renaming a Kafka Topic | Kafka Summit London
Renaming a Kafka Topic | Kafka Summit LondonRenaming a Kafka Topic | Kafka Summit London
Renaming a Kafka Topic | Kafka Summit LondonHostedbyConfluent
 
Evolution of NRT Data Ingestion Pipeline at Trendyol
Evolution of NRT Data Ingestion Pipeline at TrendyolEvolution of NRT Data Ingestion Pipeline at Trendyol
Evolution of NRT Data Ingestion Pipeline at TrendyolHostedbyConfluent
 
Ensuring Kafka Service Resilience: A Dive into Health-Checking Techniques
Ensuring Kafka Service Resilience: A Dive into Health-Checking TechniquesEnsuring Kafka Service Resilience: A Dive into Health-Checking Techniques
Ensuring Kafka Service Resilience: A Dive into Health-Checking TechniquesHostedbyConfluent
 
Exactly-once Stream Processing with Arroyo and Kafka
Exactly-once Stream Processing with Arroyo and KafkaExactly-once Stream Processing with Arroyo and Kafka
Exactly-once Stream Processing with Arroyo and KafkaHostedbyConfluent
 
Fish Plays Pokemon | Kafka Summit London
Fish Plays Pokemon | Kafka Summit LondonFish Plays Pokemon | Kafka Summit London
Fish Plays Pokemon | Kafka Summit LondonHostedbyConfluent
 
Tiered Storage 101 | Kafla Summit London
Tiered Storage 101 | Kafla Summit LondonTiered Storage 101 | Kafla Summit London
Tiered Storage 101 | Kafla Summit LondonHostedbyConfluent
 
Building a Self-Service Stream Processing Portal: How And Why
Building a Self-Service Stream Processing Portal: How And WhyBuilding a Self-Service Stream Processing Portal: How And Why
Building a Self-Service Stream Processing Portal: How And WhyHostedbyConfluent
 
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...HostedbyConfluent
 
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...HostedbyConfluent
 
Navigating Private Network Connectivity Options for Kafka Clusters
Navigating Private Network Connectivity Options for Kafka ClustersNavigating Private Network Connectivity Options for Kafka Clusters
Navigating Private Network Connectivity Options for Kafka ClustersHostedbyConfluent
 
Apache Flink: Building a Company-wide Self-service Streaming Data Platform
Apache Flink: Building a Company-wide Self-service Streaming Data PlatformApache Flink: Building a Company-wide Self-service Streaming Data Platform
Apache Flink: Building a Company-wide Self-service Streaming Data PlatformHostedbyConfluent
 
Explaining How Real-Time GenAI Works in a Noisy Pub
Explaining How Real-Time GenAI Works in a Noisy PubExplaining How Real-Time GenAI Works in a Noisy Pub
Explaining How Real-Time GenAI Works in a Noisy PubHostedbyConfluent
 
TL;DR Kafka Metrics | Kafka Summit London
TL;DR Kafka Metrics | Kafka Summit LondonTL;DR Kafka Metrics | Kafka Summit London
TL;DR Kafka Metrics | Kafka Summit LondonHostedbyConfluent
 
A Window Into Your Kafka Streams Tasks | KSL
A Window Into Your Kafka Streams Tasks | KSLA Window Into Your Kafka Streams Tasks | KSL
A Window Into Your Kafka Streams Tasks | KSLHostedbyConfluent
 
Mastering Kafka Producer Configs: A Guide to Optimizing Performance
Mastering Kafka Producer Configs: A Guide to Optimizing PerformanceMastering Kafka Producer Configs: A Guide to Optimizing Performance
Mastering Kafka Producer Configs: A Guide to Optimizing PerformanceHostedbyConfluent
 
Data Contracts Management: Schema Registry and Beyond
Data Contracts Management: Schema Registry and BeyondData Contracts Management: Schema Registry and Beyond
Data Contracts Management: Schema Registry and BeyondHostedbyConfluent
 
Code-First Approach: Crafting Efficient Flink Apps
Code-First Approach: Crafting Efficient Flink AppsCode-First Approach: Crafting Efficient Flink Apps
Code-First Approach: Crafting Efficient Flink AppsHostedbyConfluent
 
Debezium vs. the World: An Overview of the CDC Ecosystem
Debezium vs. the World: An Overview of the CDC EcosystemDebezium vs. the World: An Overview of the CDC Ecosystem
Debezium vs. the World: An Overview of the CDC EcosystemHostedbyConfluent
 
Beyond Tiered Storage: Serverless Kafka with No Local Disks
Beyond Tiered Storage: Serverless Kafka with No Local DisksBeyond Tiered Storage: Serverless Kafka with No Local Disks
Beyond Tiered Storage: Serverless Kafka with No Local DisksHostedbyConfluent
 

More from HostedbyConfluent (20)

Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
Renaming a Kafka Topic | Kafka Summit London
Renaming a Kafka Topic | Kafka Summit LondonRenaming a Kafka Topic | Kafka Summit London
Renaming a Kafka Topic | Kafka Summit London
 
Evolution of NRT Data Ingestion Pipeline at Trendyol
Evolution of NRT Data Ingestion Pipeline at TrendyolEvolution of NRT Data Ingestion Pipeline at Trendyol
Evolution of NRT Data Ingestion Pipeline at Trendyol
 
Ensuring Kafka Service Resilience: A Dive into Health-Checking Techniques
Ensuring Kafka Service Resilience: A Dive into Health-Checking TechniquesEnsuring Kafka Service Resilience: A Dive into Health-Checking Techniques
Ensuring Kafka Service Resilience: A Dive into Health-Checking Techniques
 
Exactly-once Stream Processing with Arroyo and Kafka
Exactly-once Stream Processing with Arroyo and KafkaExactly-once Stream Processing with Arroyo and Kafka
Exactly-once Stream Processing with Arroyo and Kafka
 
Fish Plays Pokemon | Kafka Summit London
Fish Plays Pokemon | Kafka Summit LondonFish Plays Pokemon | Kafka Summit London
Fish Plays Pokemon | Kafka Summit London
 
Tiered Storage 101 | Kafla Summit London
Tiered Storage 101 | Kafla Summit LondonTiered Storage 101 | Kafla Summit London
Tiered Storage 101 | Kafla Summit London
 
Building a Self-Service Stream Processing Portal: How And Why
Building a Self-Service Stream Processing Portal: How And WhyBuilding a Self-Service Stream Processing Portal: How And Why
Building a Self-Service Stream Processing Portal: How And Why
 
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
 
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
 
Navigating Private Network Connectivity Options for Kafka Clusters
Navigating Private Network Connectivity Options for Kafka ClustersNavigating Private Network Connectivity Options for Kafka Clusters
Navigating Private Network Connectivity Options for Kafka Clusters
 
Apache Flink: Building a Company-wide Self-service Streaming Data Platform
Apache Flink: Building a Company-wide Self-service Streaming Data PlatformApache Flink: Building a Company-wide Self-service Streaming Data Platform
Apache Flink: Building a Company-wide Self-service Streaming Data Platform
 
Explaining How Real-Time GenAI Works in a Noisy Pub
Explaining How Real-Time GenAI Works in a Noisy PubExplaining How Real-Time GenAI Works in a Noisy Pub
Explaining How Real-Time GenAI Works in a Noisy Pub
 
TL;DR Kafka Metrics | Kafka Summit London
TL;DR Kafka Metrics | Kafka Summit LondonTL;DR Kafka Metrics | Kafka Summit London
TL;DR Kafka Metrics | Kafka Summit London
 
A Window Into Your Kafka Streams Tasks | KSL
A Window Into Your Kafka Streams Tasks | KSLA Window Into Your Kafka Streams Tasks | KSL
A Window Into Your Kafka Streams Tasks | KSL
 
Mastering Kafka Producer Configs: A Guide to Optimizing Performance
Mastering Kafka Producer Configs: A Guide to Optimizing PerformanceMastering Kafka Producer Configs: A Guide to Optimizing Performance
Mastering Kafka Producer Configs: A Guide to Optimizing Performance
 
Data Contracts Management: Schema Registry and Beyond
Data Contracts Management: Schema Registry and BeyondData Contracts Management: Schema Registry and Beyond
Data Contracts Management: Schema Registry and Beyond
 
Code-First Approach: Crafting Efficient Flink Apps
Code-First Approach: Crafting Efficient Flink AppsCode-First Approach: Crafting Efficient Flink Apps
Code-First Approach: Crafting Efficient Flink Apps
 
Debezium vs. the World: An Overview of the CDC Ecosystem
Debezium vs. the World: An Overview of the CDC EcosystemDebezium vs. the World: An Overview of the CDC Ecosystem
Debezium vs. the World: An Overview of the CDC Ecosystem
 
Beyond Tiered Storage: Serverless Kafka with No Local Disks
Beyond Tiered Storage: Serverless Kafka with No Local DisksBeyond Tiered Storage: Serverless Kafka with No Local Disks
Beyond Tiered Storage: Serverless Kafka with No Local Disks
 

Recently uploaded

Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Principled Technologies
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024SynarionITSolutions
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesBoston Institute of Analytics
 

Recently uploaded (20)

Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 

Next Gen Data Modeling in the Open Data Platform With Doron Porat and Liran Yogev | Current 2022

  • 1. ‹#› Next gen data modeling in the open data platform Doron Porat, Liran Yogev Current, 2022
  • 3. Data infra group leader Doron Porat Director of engineering Liran Yogev Ex-coworkers who LOVE data and still share a successful Israeli podcast about Data Engineering.
  • 4. Who sent us here
  • 5. The open data platform
  • 6. Flexible just enough to cope with technological and procedural changes Adaptable Parts of the platform can be replaced over time by different/similar solutions Interoperable and interchangeable Built for big data, for many consumers, for many producers Scalable Solves a specific problem in the data platform Clear purpose The open data platform Main principles
  • 8. Data generation in the open data platform
  • 9. “Data Transform V1” • Spark based • Sql oriented • Many supported inputs • Many supported writers • Data unit-testing • Dq checks Hundreds of data pipelines were built with this tool, by generalist developers.
  • 11. We need to be better governors. Enablement is not enough. Assume nothing. Lost metadata cannot be recovered. Coupling is dangerous.
  • 12. V2 is all about “Governance Driven Development”.
  • 13. "DATA TRANSFORM V2" Our Key Objectives Developer Experience Abstraction Data As A Product Simple, Reusable and Testable Orchestration, Quality, Consistency and Ownership Consumer Awareness, Documentation and Observability
  • 14.
  • 15. Our quest for finding an open-source solution was a short one!
  • 16. DBT terminologies: • Sources • Models • Macros • Exposures • Metrics
  • 17.
  • 18. DBT runs on multiple environments and technologies using different adapters Modeling - Compute decoupling DBT embraced by many organizations and data tooling providers The community is the best DBT encourages metadata collection during the development process (GDD) Improves the survivability factor Open source, highly extensible and already contains many of our requirements Can adapt beautifully to our needs DBT What does work for us?
  • 19. DBT if not a perfect fit Looking back at V2's key objectives CLI manual development Mono-repo Dev testing Dev experience Single adapter Not Spark optimized Built for batch No orchestration Abstraction
  • 20. So how can we fix it?
  • 25.
  • 26.
  • 27. But what about real-time workloads?
  • 28. Works well with DBT architecture Clickhouse implementation (link) Rockset implementation (link) Real-time analytics databases Requires extensive SQL interface Most testing cannot run in-process Difficult to convert batch to stream Streaming engines Vs. Real-time analytics in DBT
  • 29. Streaming engines • Materialize is the first streaming engine in DBT (here) • Materialize is a streaming SQL database • Based on incrementally-updated materialized views • Extensive ANSI SQL support • DBT support includes modeling, documenting, running and testing
  • 30. This is super cool! But we use Flink…
  • 31. So, Flink and DBT? • Flink has an SQL interface • Flink can connect to our metastore • Flink can store kafka topics references in our metastore • SQL jobs can be deployed remotely via simple python code • Supports both batch and stream • Can write directly to the data lake
  • 32. - name: transactions description: external: location: transactions connector: 'kafka' properties.bootstrap.servers: 'localhost:9092' key.format: 'raw' value.format: 'json' key.fields: 'id' value.fields-include: 'ALL' watermark: rowtime_column_name: transaction_time watermark_strategy_expression: transaction_time - INTERVAL '30’ SECONDS columns: - name: id data_type: STRING description: "ID" - name: currency_code data_type: STRING description: "Currency Code" - name: total data_type: DECIMAL(10,2) description: "Total amount spent" - name: transaction_time data_type: TIMESTAMP(3) description: "Time of transaction" Kafka topic #1 Flink in DBT (concept) - sources
  • 33. version: 2 models: - name: transactions_with_rate description: transactions joined with rate config: meta: external: location: 'transactions_with_rates’ connector: 'kafka’ properties.bootstrap.servers: 'localhost:9092’ format: 'json' columns: - name: id description: 'Transaction id' data_type: STRING tests: - unique - not_null - name: total_eur data_type: DECIMAL(10,2) description: 'Total in euro' - name: total description: 'Total' data_type: DECIMAL(10,2) - name: currency_code description: 'Currency code' data_type: STRING - name: transaction_time description: 'Transaction time' data_type: TIMESTAMP(3) SELECT t.id, t.total * c.eur_rate AS total_eur, t.total, c.currency_code, t.transaction_time FROM {{ source(kafka_tables, transactions) }} t JOIN {{ source(kafka_tables, currency_rates) }} FOR SYSTEM_TIME AS OF t.transaction_time AS c ON t.currency_code = c.currency_code; Flink in DBT (concept) - model transactions_with_rates.sql transactions_with_rates.yml
  • 34. Materialize’s implementation is great! Flink in DBT (concept) - tests - name: transactions_with_rates description: transactions joined with rates columns: - name: id data_type: STRING description: "ID" tests: - unique config: store_failures: true INSERT INTO test_unique_transactions_with_rates_id SELECT id, COUNT(1) as count_id FROM transactions_with_rates GROUP BY id HAVING count_id > 0 transactions_with_rates.yml
  • 35. So, we have streaming figured out! (theoretically)
  • 36. Challenges • Generalization has a price • Not everything can be abstracted this way • We lose expertise across the organization • Heavy dependency on Dataops • Requires LOTS of building
  • 37. The early demise of apache flink, Dall-e, 2022 • Not all tech is here to stay • Our users can’t keep up • Puts governance first • Business logic is the true organizational asset Data Modeling is the last survivor