Streaming SQL for Data Engineers: The Next Big Thing?

Yaroslav Tkachenko
Yaroslav TkachenkoPrincipal Software Engineer at Goldsky
Streaming SQL for
Data Engineers: The
Next Big Thing?
Streaming SQL for Data Engineers: The Next Big Thing?
Streaming SQL Products
● Apache Flink
● Apache Spark
● Apache Beam
● AWS Kinesis
● Google Cloud Dataflow
● Databricks
● ksqlDB
● …
● Meta
● LinkedIn
● Pinterest
● DoorDash
● Alibaba
● …
Companies building
internal platforms
Open source and
vendor solutions
Streaming SQL for Data Engineers: The Next Big Thing?
👋 Hi, I’m Yaroslav
👋 Hi, I’m Yaroslav
● Principal Software Engineer @ Goldsky
● Staff Data Engineer @ Shopify
● Software Architect @ Activision
● …
👋 Hi, I’m Yaroslav
● Principal Software Engineer @ Goldsky
● Staff Data Engineer @ Shopify
● Software Architect @ Activision
● …
❤ Apache Flink
🤔
TableEnvironment tableEnv = TableEnvironment.create(/*…*/);
Table revenue = tableEnv.sqlQuery(
"SELECT cID, cName, SUM(revenue) AS revSum " +
"FROM Orders " +
"WHERE cCountry = 'FRANCE' " +
"GROUP BY cID, cName"
);
… but why SQL?
Why SQL?
● Wide adoption
● Declarative transformation model
● Planner!
● Common type system
What instead of How
User
Intention Execution
Runtime
←
Imperative Style
→
User
Intention Execution
Runtime
→
Planning
Planner
→
Declarative SQL Style
SELECT * FROM Orders
INNER JOIN Product
ON Orders.productId = Product.id
● LOTS of code!
● Create an operator to connect
two streams
● Define and accumulate state
● Implement a mechanism for
emitting the latest value per
key
SQL API DataStream API
Declarative Transformation Model
SELECT * FROM Orders
INNER JOIN Product
ON Orders.productId = Product.id
SQL API Why not Table API?
val orders = tEnv.from("Orders")
.select($"productId", $"a", $"b")
val products = tEnv.from("Products")
.select($"id", $"c", $"d")
val result = orders
.join(products)
.where($"productId" === $"id")
.select($"a", $"b", $"c")
Declarative Transformation Model
SELECT * FROM (
SELECT *, ROW_NUMBER() OVER (PARTITION BY ticker
ORDER BY price DESC) AS row_num
FROM stock_table)
WHERE row_num <= 10;
Top-N Query
Declarative Transformation Model
Row Pattern Recognition in SQL
(ISO/IEC TR 19075-5:2016)
SELECT *
FROM stock_table
MATCH_RECOGNIZE(
PARTITION BY ticker
ORDER BY event_time
MEASURES
A.event_time AS initialPriceTime,
C.event_time AS dropTime,
A.price - C.price AS dropDiff,
A.price AS initialPrice,
C.price AS lastPrice
ONE ROW PER MATCH
AFTER MATCH SKIP PAST LAST ROW
PATTERN (A B* C) WITHIN INTERVAL '10' MINUTES
DEFINE
B AS B.price > A.price - 500
)
Flink Planner Migration
From https://www.ververica.com/blog/a-journey-to-beating-flinks-sql-performance
Planner Decoupling
Planner Optimizations & Query Rewrite
● Predicate push down
● Projection push down
● Join rewrite
● Join elimination
● Constant inlining
● …
SQL API DataStream API
val postgresSink: SinkFunction[Envelope] = JdbcSink.sink(
"INSERT INTO table " +
"(id, number, timestamp, author, difficulty, size, vid, block_range) " +
"VALUES (?, ?, ?, ?, ?, ?, ?, ?) " +
"ON CONFLICT (id) DO UPDATE SET " +
"number = excluded.number, " +
"timestamp = excluded.timestamp, " +
"author = excluded.author, " +
"difficulty = excluded.difficulty, " +
"size = excluded.size, " +
"vid = excluded.vid, " +
"block_range = excluded.block_range " +
"WHERE excluded.vid > table.vid",
new JdbcStatementBuilder[Envelope] {
override def accept(statement: PreparedStatement, record: Envelope): Unit = {
val payload = record.payload
payload.id.foreach { id => statement.setString(1, id) }
payload.number.foreach { number => statement.setBigDecimal(2, new java.math.BigDecimal(number)) }
payload.timestamp.foreach { timestamp => statement.setBigDecimal(3, new java.math.BigDecimal(timestamp)) }
payload.author.foreach { author => statement.setString(4, author) }
payload.difficulty.foreach { difficulty => statement.setBigDecimal(5, new java.math.BigDecimal(difficulty)) }
payload.size.foreach { size => statement.setBigDecimal(6, new java.math.BigDecimal(size)) }
payload.vid.foreach { vid => statement.setLong(7, vid.toLong) }
payload.block_range.foreach { block_range => statement.setObject(8, new PostgresIntRange(block_range), Types.O
}
},
CREATE TABLE TABLE (
id BIGINT,
number INTEGER,
timestamp TIMESTAMP,
author STRING,
difficulty STRING,
size INTEGER,
vid BIGINT,
block_range STRING
PRIMARY KEY (vid) NOT ENFORCED
) WITH (
'connector' = 'jdbc',
'table-name' = 'table'
);
😱
Common Type System
When you start using SQL
you get access to the
decades of advancements
in database design
When NOT to use
● Complex serialization / deserialization logic
● Low-level optimizations, especially with state and timers
● Not always debugging-friendly
Dealing with Complexity
UDFs for heavy lifting
● Calling 3rd-party
libraries
● External calls
● Enrichments
Templating
● Control structures
● dbt-style macros
and references
Convinced? Let’s use it!
Ways to use
Structured Statements
dbt-style Project Notebooks
Managed Runtime
Requirements
● Version control
● Code organization
● Testability
● CI/CD
● Observability
Structured Statements
def revenueByCountry(country: String): Table = {
tEnv.sqlQuery(
s"""
|SELECT name, SUM(revenue) AS totalRevenue
|FROM Orders
|WHERE country = '${country}'
|GROUP BY name""".stripMargin
)
}
Structured Statements
def revenueByCountry(country: String): Table = {
tEnv.sqlQuery(
s"""
|SELECT name, SUM(revenue) AS totalRevenue
|FROM Orders
|WHERE country = '${country}'
|GROUP BY name""".stripMargin
)
}
✅ structure
✅ mock/stub
for testing
Structured Statements
● Treat them like code
● Only make sense when Table API is not available
● Mix with other API flavours
● SQL also has style guides
● Otherwise it’s a typical streaming application!
Structured Statements
● Version control: 🟢
● Code organization: 🟢
● Testability: 🟡
● CI/CD: 🟡
● Observability: 🟢
dbt-style Project
➔ models
◆ common
● users.sql
● users.yml
◆ sales.sql
◆ sales.yml
◆ …
➔ tests
◆ …
dbt-style Project
➔ models
◆ common
● users.sql
● users.yml
◆ sales.sql
◆ sales.yml
◆ …
➔ tests
◆ …
✅ structured
✅ schematized
✅ testable
dbt-style Project
SELECT
((text::jsonb)->>'bid_price')::FLOAT AS bid_price,
(text::jsonb)->>'order_quantity' AS order_quantity,
(text::jsonb)->>'symbol' AS symbol,
(text::jsonb)->>'trade_type' AS trade_type,
to_timestamp(((text::jsonb)->'timestamp')::BIGINT) AS ts
FROM {{ REF('market_orders_raw') }}
{{ config(materialized='materializedview') }}
SELECT symbol,
AVG(bid_price) AS avg
FROM {{ REF('market_orders') }}
GROUP BY symbol
dbt-style Project
● Works well for heavy analytical use-cases
● Could write tests in Python/Scala/etc.
● Probably needs more tooling than you think (state
management, observability, etc.)
● Check dbt adapter from Materialize!
dbt-style Project
● Version control: 🟢
● Code organization: 🟢
● Testability: 🟡
● CI/CD: 🟡
● Observability: 🟡
Notebooks
Apache Zeppelin
Notebooks
Apache Zeppelin
Notebooks
● Great UX
● Ideal for exploratory analysis and BI
● Complements all other patterns really well
● Way more important for realtime workloads
Notebooks
We don't recommend productionizing notebooks and
instead encourage empowering data scientists to build
production-ready code with the right programming
frameworks
https://www.thoughtworks.com/en-ca/radar/technique
s/productionizing-notebooks
Notebooks
● Version control: 🟡
● Code organization: 🔴
● Testability: 🔴
● CI/CD: 🔴
● Observability: 🔴
Managed Runtime
decodable
Managed Runtime
● Managed ≈ “Serverless”
● Auto-scaling
● Automated deployments, rollbacks, etc.
● Testing for different layers is decoupled
(runtime vs jobs)
Managed Runtime
Reference Architecture
Control Plane Data Plane
API Reconciler
Streaming Job
UI CLI
Any managed runtime
requires excellent
developer experience
to succeed
Managed Runtime: Ideal Developer Experience
Notebooks UX
SELECT * …
SELECT * …
Managed Runtime: Ideal Developer Experience
Version Control Integration
SELECT * …
SELECT * …
Managed Runtime: Ideal Developer Experience
dbt-style Project Structure
SELECT * …
SELECT * …
➔ models
◆ common
◆ sales
◆ shipping
◆ marketing
◆ …
Managed Runtime: Ideal Developer Experience
Versioning
SELECT * …
SELECT * …
● Version 1
● Version 2
● Version 3
● …
Managed Runtime: Ideal Developer Experience
Previews
SELECT * …
SELECT * …
User Count
Irene 100
Alex 53
Josh 12
Jane 1
Managed Runtime
● Version control: 🟢
● Code organization: 🟢
● Testability: 🟡
● CI/CD: 🟢
● Observability: 🟢
Summary
Structured
Statements
dbt-style Project Notebooks Managed
Runtime
Version Control 🟢 🟢 🟡 🟢
Code
Organization
🟢 🟢 🔴 🟢
Testability 🟡 🟡 🔴 🟡
CI/CD 🟡 🟡 🔴 🟢
Observability 🟢 🟡 🔴 🟢
Complexity 🟢 🟡 🟡 🔴
General Guidelines
● Long-running streaming apps require special attention
to state management
● Try to avoid mutability: every change is a new version
● Integration testing > unit testing
● Embrace the SRE mentality
Really dislike SQL?
Malloy PRQL
Streaming SQL for Data Engineers: The Next Big Thing?
Questions?
@sap1ens
1 of 57

Recommended

A Thorough Comparison of Delta Lake, Iceberg and Hudi by
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks
11.1K views27 slides
Optimizing Delta/Parquet Data Lakes for Apache Spark by
Optimizing Delta/Parquet Data Lakes for Apache SparkOptimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache SparkDatabricks
2.5K views51 slides
Apache Pinot Case Study: Building Distributed Analytics Systems Using Apache ... by
Apache Pinot Case Study: Building Distributed Analytics Systems Using Apache ...Apache Pinot Case Study: Building Distributed Analytics Systems Using Apache ...
Apache Pinot Case Study: Building Distributed Analytics Systems Using Apache ...HostedbyConfluent
6.2K views33 slides
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard by
Delta Lake OSS: Create reliable and performant Data Lake by Quentin AmbardDelta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin AmbardParis Data Engineers !
1.3K views42 slides
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang by
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangApache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangDatabricks
5.8K views103 slides
Data Discovery at Databricks with Amundsen by
Data Discovery at Databricks with AmundsenData Discovery at Databricks with Amundsen
Data Discovery at Databricks with AmundsenDatabricks
1.2K views45 slides

More Related Content

What's hot

Building Lakehouses on Delta Lake with SQL Analytics Primer by
Building Lakehouses on Delta Lake with SQL Analytics PrimerBuilding Lakehouses on Delta Lake with SQL Analytics Primer
Building Lakehouses on Delta Lake with SQL Analytics PrimerDatabricks
428 views32 slides
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga... by
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...DataScienceConferenc1
139 views23 slides
Incremental View Maintenance with Coral, DBT, and Iceberg by
Incremental View Maintenance with Coral, DBT, and IcebergIncremental View Maintenance with Coral, DBT, and Iceberg
Incremental View Maintenance with Coral, DBT, and IcebergWalaa Eldin Moustafa
599 views57 slides
Apache Iceberg Presentation for the St. Louis Big Data IDEA by
Apache Iceberg Presentation for the St. Louis Big Data IDEAApache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEAAdam Doyle
607 views13 slides
Data Quality With or Without Apache Spark and Its Ecosystem by
Data Quality With or Without Apache Spark and Its EcosystemData Quality With or Without Apache Spark and Its Ecosystem
Data Quality With or Without Apache Spark and Its EcosystemDatabricks
1.3K views22 slides
Apache Iceberg - A Table Format for Hige Analytic Datasets by
Apache Iceberg - A Table Format for Hige Analytic DatasetsApache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsAlluxio, Inc.
6.6K views28 slides

What's hot(20)

Building Lakehouses on Delta Lake with SQL Analytics Primer by Databricks
Building Lakehouses on Delta Lake with SQL Analytics PrimerBuilding Lakehouses on Delta Lake with SQL Analytics Primer
Building Lakehouses on Delta Lake with SQL Analytics Primer
Databricks428 views
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga... by DataScienceConferenc1
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
Incremental View Maintenance with Coral, DBT, and Iceberg by Walaa Eldin Moustafa
Incremental View Maintenance with Coral, DBT, and IcebergIncremental View Maintenance with Coral, DBT, and Iceberg
Incremental View Maintenance with Coral, DBT, and Iceberg
Apache Iceberg Presentation for the St. Louis Big Data IDEA by Adam Doyle
Apache Iceberg Presentation for the St. Louis Big Data IDEAApache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEA
Adam Doyle607 views
Data Quality With or Without Apache Spark and Its Ecosystem by Databricks
Data Quality With or Without Apache Spark and Its EcosystemData Quality With or Without Apache Spark and Its Ecosystem
Data Quality With or Without Apache Spark and Its Ecosystem
Databricks1.3K views
Apache Iceberg - A Table Format for Hige Analytic Datasets by Alluxio, Inc.
Apache Iceberg - A Table Format for Hige Analytic DatasetsApache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic Datasets
Alluxio, Inc.6.6K views
Radical Speed for SQL Queries on Databricks: Photon Under the Hood by Databricks
Radical Speed for SQL Queries on Databricks: Photon Under the HoodRadical Speed for SQL Queries on Databricks: Photon Under the Hood
Radical Speed for SQL Queries on Databricks: Photon Under the Hood
Databricks1.1K views
Feed Your SIEM Smart with Kafka Connect (Vitalii Rudenskyi, McKesson Corp) Ka... by HostedbyConfluent
Feed Your SIEM Smart with Kafka Connect (Vitalii Rudenskyi, McKesson Corp) Ka...Feed Your SIEM Smart with Kafka Connect (Vitalii Rudenskyi, McKesson Corp) Ka...
Feed Your SIEM Smart with Kafka Connect (Vitalii Rudenskyi, McKesson Corp) Ka...
HostedbyConfluent1.1K views
Apache Flink internals by Kostas Tzoumas
Apache Flink internalsApache Flink internals
Apache Flink internals
Kostas Tzoumas12.4K views
Intro to databricks delta lake by Mykola Zerniuk
 Intro to databricks delta lake Intro to databricks delta lake
Intro to databricks delta lake
Mykola Zerniuk316 views
Using Queryable State for Fun and Profit by Flink Forward
Using Queryable State for Fun and ProfitUsing Queryable State for Fun and Profit
Using Queryable State for Fun and Profit
Flink Forward257 views
2022-06-23 Apache Arrow and DataFusion_ Changing the Game for implementing Da... by Andrew Lamb
2022-06-23 Apache Arrow and DataFusion_ Changing the Game for implementing Da...2022-06-23 Apache Arrow and DataFusion_ Changing the Game for implementing Da...
2022-06-23 Apache Arrow and DataFusion_ Changing the Game for implementing Da...
Andrew Lamb181 views
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D... by Databricks
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
Databricks1.5K views
Intro to Delta Lake by Databricks
Intro to Delta LakeIntro to Delta Lake
Intro to Delta Lake
Databricks1.5K views
Databricks Delta Lake and Its Benefits by Databricks
Databricks Delta Lake and Its BenefitsDatabricks Delta Lake and Its Benefits
Databricks Delta Lake and Its Benefits
Databricks5.1K views
Tame the small files problem and optimize data layout for streaming ingestion... by Flink Forward
Tame the small files problem and optimize data layout for streaming ingestion...Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...
Flink Forward803 views
Data Engineer's Lunch #83: Strategies for Migration to Apache Iceberg by Anant Corporation
Data Engineer's Lunch #83: Strategies for Migration to Apache IcebergData Engineer's Lunch #83: Strategies for Migration to Apache Iceberg
Data Engineer's Lunch #83: Strategies for Migration to Apache Iceberg
Anant Corporation219 views
Serverless Kafka and Spark in a Multi-Cloud Lakehouse Architecture by Kai Wähner
Serverless Kafka and Spark in a Multi-Cloud Lakehouse ArchitectureServerless Kafka and Spark in a Multi-Cloud Lakehouse Architecture
Serverless Kafka and Spark in a Multi-Cloud Lakehouse Architecture
Kai Wähner1.9K views
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap... by Flink Forward
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
Flink Forward3.2K views

Similar to Streaming SQL for Data Engineers: The Next Big Thing?

Shaping serverless architecture with domain driven design patterns - py web-il by
Shaping serverless architecture with domain driven design patterns - py web-ilShaping serverless architecture with domain driven design patterns - py web-il
Shaping serverless architecture with domain driven design patterns - py web-ilAsher Sterkin
615 views29 slides
Apache Samza 1.0 - What's New, What's Next by
Apache Samza 1.0 - What's New, What's NextApache Samza 1.0 - What's New, What's Next
Apache Samza 1.0 - What's New, What's NextPrateek Maheshwari
292 views45 slides
Sprint 58 by
Sprint 58Sprint 58
Sprint 58ManageIQ
238 views47 slides
Serverless in-action by
Serverless in-actionServerless in-action
Serverless in-actionAssaf Gannon
318 views44 slides
Shaping serverless architecture with domain driven design patterns by
Shaping serverless architecture with domain driven design patternsShaping serverless architecture with domain driven design patterns
Shaping serverless architecture with domain driven design patternsShimon Tolts
250 views26 slides
Shaping serverless architecture with domain driven design patterns by
Shaping serverless architecture with domain driven design patternsShaping serverless architecture with domain driven design patterns
Shaping serverless architecture with domain driven design patternsAsher Sterkin
3K views26 slides

Similar to Streaming SQL for Data Engineers: The Next Big Thing?(20)

Shaping serverless architecture with domain driven design patterns - py web-il by Asher Sterkin
Shaping serverless architecture with domain driven design patterns - py web-ilShaping serverless architecture with domain driven design patterns - py web-il
Shaping serverless architecture with domain driven design patterns - py web-il
Asher Sterkin615 views
Sprint 58 by ManageIQ
Sprint 58Sprint 58
Sprint 58
ManageIQ238 views
Serverless in-action by Assaf Gannon
Serverless in-actionServerless in-action
Serverless in-action
Assaf Gannon318 views
Shaping serverless architecture with domain driven design patterns by Shimon Tolts
Shaping serverless architecture with domain driven design patternsShaping serverless architecture with domain driven design patterns
Shaping serverless architecture with domain driven design patterns
Shimon Tolts250 views
Shaping serverless architecture with domain driven design patterns by Asher Sterkin
Shaping serverless architecture with domain driven design patternsShaping serverless architecture with domain driven design patterns
Shaping serverless architecture with domain driven design patterns
Asher Sterkin3K views
Sprint 45 review by ManageIQ
Sprint 45 reviewSprint 45 review
Sprint 45 review
ManageIQ1.2K views
Improving Apache Spark Downscaling by Databricks
 Improving Apache Spark Downscaling Improving Apache Spark Downscaling
Improving Apache Spark Downscaling
Databricks827 views
Sprint 55 by ManageIQ
Sprint 55Sprint 55
Sprint 55
ManageIQ860 views
Advanced Code Flow, Notes From the Field by Ariel Moskovich
Advanced Code Flow, Notes From the FieldAdvanced Code Flow, Notes From the Field
Advanced Code Flow, Notes From the Field
Ariel Moskovich361 views
Fast federated SQL with Apache Calcite by Chris Baynes
Fast federated SQL with Apache CalciteFast federated SQL with Apache Calcite
Fast federated SQL with Apache Calcite
Chris Baynes1.4K views
Spark SQL Catalyst Code Optimization using Function Outlining with Kavana Bha... by Databricks
Spark SQL Catalyst Code Optimization using Function Outlining with Kavana Bha...Spark SQL Catalyst Code Optimization using Function Outlining with Kavana Bha...
Spark SQL Catalyst Code Optimization using Function Outlining with Kavana Bha...
Databricks677 views
Modular Web Applications With Netzke by netzke
Modular Web Applications With NetzkeModular Web Applications With Netzke
Modular Web Applications With Netzke
netzke1.1K views
How Level Infinite Implemented CQRS and Event Sourcing on Top of Apache Pulsa... by ScyllaDB
How Level Infinite Implemented CQRS and Event Sourcing on Top of Apache Pulsa...How Level Infinite Implemented CQRS and Event Sourcing on Top of Apache Pulsa...
How Level Infinite Implemented CQRS and Event Sourcing on Top of Apache Pulsa...
ScyllaDB863 views
GraphQL the holy contract between client and server by Pavel Chertorogov
GraphQL the holy contract between client and serverGraphQL the holy contract between client and server
GraphQL the holy contract between client and server
Pavel Chertorogov736 views
Sprint 59 by ManageIQ
Sprint 59Sprint 59
Sprint 59
ManageIQ280 views
SamzaSQL QCon'16 presentation by Yi Pan
SamzaSQL QCon'16 presentationSamzaSQL QCon'16 presentation
SamzaSQL QCon'16 presentation
Yi Pan1.9K views
The path to a Serverless-native era with Kubernetes by Paolo Mainardi by NETWAYS
The path to a Serverless-native era with Kubernetes by Paolo MainardiThe path to a Serverless-native era with Kubernetes by Paolo Mainardi
The path to a Serverless-native era with Kubernetes by Paolo Mainardi
NETWAYS105 views

More from Yaroslav Tkachenko

Apache Flink Adoption at Shopify by
Apache Flink Adoption at ShopifyApache Flink Adoption at Shopify
Apache Flink Adoption at ShopifyYaroslav Tkachenko
1.1K views36 slides
Storing State Forever: Why It Can Be Good For Your Analytics by
Storing State Forever: Why It Can Be Good For Your AnalyticsStoring State Forever: Why It Can Be Good For Your Analytics
Storing State Forever: Why It Can Be Good For Your AnalyticsYaroslav Tkachenko
483 views38 slides
It's Time To Stop Using Lambda Architecture by
It's Time To Stop Using Lambda ArchitectureIt's Time To Stop Using Lambda Architecture
It's Time To Stop Using Lambda ArchitectureYaroslav Tkachenko
213 views37 slides
Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streaming by
Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to StreamingBravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streaming
Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to StreamingYaroslav Tkachenko
542 views39 slides
Apache Kafka: New Features That You Might Not Know About by
Apache Kafka: New Features That You Might Not Know AboutApache Kafka: New Features That You Might Not Know About
Apache Kafka: New Features That You Might Not Know AboutYaroslav Tkachenko
4.9K views25 slides
Building Scalable and Extendable Data Pipeline for Call of Duty Games: Lesson... by
Building Scalable and Extendable Data Pipeline for Call of Duty Games: Lesson...Building Scalable and Extendable Data Pipeline for Call of Duty Games: Lesson...
Building Scalable and Extendable Data Pipeline for Call of Duty Games: Lesson...Yaroslav Tkachenko
1.8K views44 slides

More from Yaroslav Tkachenko(17)

Storing State Forever: Why It Can Be Good For Your Analytics by Yaroslav Tkachenko
Storing State Forever: Why It Can Be Good For Your AnalyticsStoring State Forever: Why It Can Be Good For Your Analytics
Storing State Forever: Why It Can Be Good For Your Analytics
Yaroslav Tkachenko483 views
Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streaming by Yaroslav Tkachenko
Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to StreamingBravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streaming
Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streaming
Yaroslav Tkachenko542 views
Apache Kafka: New Features That You Might Not Know About by Yaroslav Tkachenko
Apache Kafka: New Features That You Might Not Know AboutApache Kafka: New Features That You Might Not Know About
Apache Kafka: New Features That You Might Not Know About
Yaroslav Tkachenko4.9K views
Building Scalable and Extendable Data Pipeline for Call of Duty Games: Lesson... by Yaroslav Tkachenko
Building Scalable and Extendable Data Pipeline for Call of Duty Games: Lesson...Building Scalable and Extendable Data Pipeline for Call of Duty Games: Lesson...
Building Scalable and Extendable Data Pipeline for Call of Duty Games: Lesson...
Yaroslav Tkachenko1.8K views
Designing Scalable and Extendable Data Pipeline for Call Of Duty Games by Yaroslav Tkachenko
Designing Scalable and Extendable Data Pipeline for Call Of Duty GamesDesigning Scalable and Extendable Data Pipeline for Call Of Duty Games
Designing Scalable and Extendable Data Pipeline for Call Of Duty Games
Yaroslav Tkachenko1.1K views
10 tips for making Bash a sane programming language by Yaroslav Tkachenko
10 tips for making Bash a sane programming language10 tips for making Bash a sane programming language
10 tips for making Bash a sane programming language
Yaroslav Tkachenko764 views
Kafka Streams: the easiest way to start with stream processing by Yaroslav Tkachenko
Kafka Streams: the easiest way to start with stream processingKafka Streams: the easiest way to start with stream processing
Kafka Streams: the easiest way to start with stream processing
Yaroslav Tkachenko6.6K views
Building Stateful Microservices With Akka by Yaroslav Tkachenko
Building Stateful Microservices With AkkaBuilding Stateful Microservices With Akka
Building Stateful Microservices With Akka
Yaroslav Tkachenko1.9K views
Akka Microservices Architecture And Design by Yaroslav Tkachenko
Akka Microservices Architecture And DesignAkka Microservices Architecture And Design
Akka Microservices Architecture And Design
Yaroslav Tkachenko4.6K views
Why Actor-Based Systems Are The Best For Microservices by Yaroslav Tkachenko
Why Actor-Based Systems Are The Best For MicroservicesWhy Actor-Based Systems Are The Best For Microservices
Why Actor-Based Systems Are The Best For Microservices
Yaroslav Tkachenko940 views
Why actor-based systems are the best for microservices by Yaroslav Tkachenko
Why actor-based systems are the best for microservicesWhy actor-based systems are the best for microservices
Why actor-based systems are the best for microservices
Yaroslav Tkachenko4.3K views
Building Eventing Systems for Microservice Architecture by Yaroslav Tkachenko
Building Eventing Systems for Microservice Architecture  Building Eventing Systems for Microservice Architecture
Building Eventing Systems for Microservice Architecture
Yaroslav Tkachenko3.6K views
Быстрая и безболезненная разработка клиентской части веб-приложений by Yaroslav Tkachenko
Быстрая и безболезненная разработка клиентской части веб-приложенийБыстрая и безболезненная разработка клиентской части веб-приложений
Быстрая и безболезненная разработка клиентской части веб-приложений
Yaroslav Tkachenko801 views

Recently uploaded

Data about the sector workshop by
Data about the sector workshopData about the sector workshop
Data about the sector workshopinfo828217
12 views27 slides
OECD-Persol Holdings Workshop on Advancing Employee Well-being in Business an... by
OECD-Persol Holdings Workshop on Advancing Employee Well-being in Business an...OECD-Persol Holdings Workshop on Advancing Employee Well-being in Business an...
OECD-Persol Holdings Workshop on Advancing Employee Well-being in Business an...StatsCommunications
5 views26 slides
CRM stick or twist.pptx by
CRM stick or twist.pptxCRM stick or twist.pptx
CRM stick or twist.pptxinfo828217
11 views16 slides
How Leaders See Data? (Level 1) by
How Leaders See Data? (Level 1)How Leaders See Data? (Level 1)
How Leaders See Data? (Level 1)Narendra Narendra
15 views76 slides
Organic Shopping in Google Analytics 4.pdf by
Organic Shopping in Google Analytics 4.pdfOrganic Shopping in Google Analytics 4.pdf
Organic Shopping in Google Analytics 4.pdfGA4 Tutorials
16 views13 slides
Survey on Factuality in LLM's.pptx by
Survey on Factuality in LLM's.pptxSurvey on Factuality in LLM's.pptx
Survey on Factuality in LLM's.pptxNeethaSherra1
7 views9 slides

Recently uploaded(20)

Data about the sector workshop by info828217
Data about the sector workshopData about the sector workshop
Data about the sector workshop
info82821712 views
OECD-Persol Holdings Workshop on Advancing Employee Well-being in Business an... by StatsCommunications
OECD-Persol Holdings Workshop on Advancing Employee Well-being in Business an...OECD-Persol Holdings Workshop on Advancing Employee Well-being in Business an...
OECD-Persol Holdings Workshop on Advancing Employee Well-being in Business an...
CRM stick or twist.pptx by info828217
CRM stick or twist.pptxCRM stick or twist.pptx
CRM stick or twist.pptx
info82821711 views
Organic Shopping in Google Analytics 4.pdf by GA4 Tutorials
Organic Shopping in Google Analytics 4.pdfOrganic Shopping in Google Analytics 4.pdf
Organic Shopping in Google Analytics 4.pdf
GA4 Tutorials16 views
Survey on Factuality in LLM's.pptx by NeethaSherra1
Survey on Factuality in LLM's.pptxSurvey on Factuality in LLM's.pptx
Survey on Factuality in LLM's.pptx
NeethaSherra17 views
PRIVACY AWRE PERSONAL DATA STORAGE by antony420421
PRIVACY AWRE PERSONAL DATA STORAGEPRIVACY AWRE PERSONAL DATA STORAGE
PRIVACY AWRE PERSONAL DATA STORAGE
antony4204215 views
[DSC Europe 23] Stefan Mrsic_Goran Savic - Evolving Technology Excellence.pptx by DataScienceConferenc1
[DSC Europe 23] Stefan Mrsic_Goran Savic - Evolving Technology Excellence.pptx[DSC Europe 23] Stefan Mrsic_Goran Savic - Evolving Technology Excellence.pptx
[DSC Europe 23] Stefan Mrsic_Goran Savic - Evolving Technology Excellence.pptx
Chapter 3b- Process Communication (1) (1)(1) (1).pptx by ayeshabaig2004
Chapter 3b- Process Communication (1) (1)(1) (1).pptxChapter 3b- Process Communication (1) (1)(1) (1).pptx
Chapter 3b- Process Communication (1) (1)(1) (1).pptx
ayeshabaig20047 views
Advanced_Recommendation_Systems_Presentation.pptx by neeharikasingh29
Advanced_Recommendation_Systems_Presentation.pptxAdvanced_Recommendation_Systems_Presentation.pptx
Advanced_Recommendation_Systems_Presentation.pptx
Ukraine Infographic_22NOV2023_v2.pdf by AnastosiyaGurin
Ukraine Infographic_22NOV2023_v2.pdfUkraine Infographic_22NOV2023_v2.pdf
Ukraine Infographic_22NOV2023_v2.pdf
AnastosiyaGurin1.4K views
[DSC Europe 23] Spela Poklukar & Tea Brasanac - Retrieval Augmented Generation by DataScienceConferenc1
[DSC Europe 23] Spela Poklukar & Tea Brasanac - Retrieval Augmented Generation[DSC Europe 23] Spela Poklukar & Tea Brasanac - Retrieval Augmented Generation
[DSC Europe 23] Spela Poklukar & Tea Brasanac - Retrieval Augmented Generation
[DSC Europe 23] Zsolt Feleki - Machine Translation should we trust it.pptx by DataScienceConferenc1
[DSC Europe 23] Zsolt Feleki - Machine Translation should we trust it.pptx[DSC Europe 23] Zsolt Feleki - Machine Translation should we trust it.pptx
[DSC Europe 23] Zsolt Feleki - Machine Translation should we trust it.pptx
[DSC Europe 23][AI:CSI] Dragan Pleskonjic - AI Impact on Cybersecurity and P... by DataScienceConferenc1
[DSC Europe 23][AI:CSI]  Dragan Pleskonjic - AI Impact on Cybersecurity and P...[DSC Europe 23][AI:CSI]  Dragan Pleskonjic - AI Impact on Cybersecurity and P...
[DSC Europe 23][AI:CSI] Dragan Pleskonjic - AI Impact on Cybersecurity and P...
Short Story Assignment by Kelly Nguyen by kellynguyen01
Short Story Assignment by Kelly NguyenShort Story Assignment by Kelly Nguyen
Short Story Assignment by Kelly Nguyen
kellynguyen0119 views
3196 The Case of The East River by ErickANDRADE90
3196 The Case of The East River3196 The Case of The East River
3196 The Case of The East River
ErickANDRADE9016 views
CRM stick or twist workshop by info828217
CRM stick or twist workshopCRM stick or twist workshop
CRM stick or twist workshop
info82821710 views

Streaming SQL for Data Engineers: The Next Big Thing?

  • 1. Streaming SQL for Data Engineers: The Next Big Thing?
  • 4. ● Apache Flink ● Apache Spark ● Apache Beam ● AWS Kinesis ● Google Cloud Dataflow ● Databricks ● ksqlDB ● … ● Meta ● LinkedIn ● Pinterest ● DoorDash ● Alibaba ● … Companies building internal platforms Open source and vendor solutions
  • 6. 👋 Hi, I’m Yaroslav
  • 7. 👋 Hi, I’m Yaroslav ● Principal Software Engineer @ Goldsky ● Staff Data Engineer @ Shopify ● Software Architect @ Activision ● …
  • 8. 👋 Hi, I’m Yaroslav ● Principal Software Engineer @ Goldsky ● Staff Data Engineer @ Shopify ● Software Architect @ Activision ● … ❤ Apache Flink
  • 9. 🤔 TableEnvironment tableEnv = TableEnvironment.create(/*…*/); Table revenue = tableEnv.sqlQuery( "SELECT cID, cName, SUM(revenue) AS revSum " + "FROM Orders " + "WHERE cCountry = 'FRANCE' " + "GROUP BY cID, cName" );
  • 10. … but why SQL?
  • 11. Why SQL? ● Wide adoption ● Declarative transformation model ● Planner! ● Common type system
  • 15. SELECT * FROM Orders INNER JOIN Product ON Orders.productId = Product.id ● LOTS of code! ● Create an operator to connect two streams ● Define and accumulate state ● Implement a mechanism for emitting the latest value per key SQL API DataStream API Declarative Transformation Model
  • 16. SELECT * FROM Orders INNER JOIN Product ON Orders.productId = Product.id SQL API Why not Table API? val orders = tEnv.from("Orders") .select($"productId", $"a", $"b") val products = tEnv.from("Products") .select($"id", $"c", $"d") val result = orders .join(products) .where($"productId" === $"id") .select($"a", $"b", $"c") Declarative Transformation Model
  • 17. SELECT * FROM ( SELECT *, ROW_NUMBER() OVER (PARTITION BY ticker ORDER BY price DESC) AS row_num FROM stock_table) WHERE row_num <= 10; Top-N Query Declarative Transformation Model
  • 18. Row Pattern Recognition in SQL (ISO/IEC TR 19075-5:2016) SELECT * FROM stock_table MATCH_RECOGNIZE( PARTITION BY ticker ORDER BY event_time MEASURES A.event_time AS initialPriceTime, C.event_time AS dropTime, A.price - C.price AS dropDiff, A.price AS initialPrice, C.price AS lastPrice ONE ROW PER MATCH AFTER MATCH SKIP PAST LAST ROW PATTERN (A B* C) WITHIN INTERVAL '10' MINUTES DEFINE B AS B.price > A.price - 500 )
  • 19. Flink Planner Migration From https://www.ververica.com/blog/a-journey-to-beating-flinks-sql-performance Planner Decoupling
  • 20. Planner Optimizations & Query Rewrite ● Predicate push down ● Projection push down ● Join rewrite ● Join elimination ● Constant inlining ● …
  • 21. SQL API DataStream API val postgresSink: SinkFunction[Envelope] = JdbcSink.sink( "INSERT INTO table " + "(id, number, timestamp, author, difficulty, size, vid, block_range) " + "VALUES (?, ?, ?, ?, ?, ?, ?, ?) " + "ON CONFLICT (id) DO UPDATE SET " + "number = excluded.number, " + "timestamp = excluded.timestamp, " + "author = excluded.author, " + "difficulty = excluded.difficulty, " + "size = excluded.size, " + "vid = excluded.vid, " + "block_range = excluded.block_range " + "WHERE excluded.vid > table.vid", new JdbcStatementBuilder[Envelope] { override def accept(statement: PreparedStatement, record: Envelope): Unit = { val payload = record.payload payload.id.foreach { id => statement.setString(1, id) } payload.number.foreach { number => statement.setBigDecimal(2, new java.math.BigDecimal(number)) } payload.timestamp.foreach { timestamp => statement.setBigDecimal(3, new java.math.BigDecimal(timestamp)) } payload.author.foreach { author => statement.setString(4, author) } payload.difficulty.foreach { difficulty => statement.setBigDecimal(5, new java.math.BigDecimal(difficulty)) } payload.size.foreach { size => statement.setBigDecimal(6, new java.math.BigDecimal(size)) } payload.vid.foreach { vid => statement.setLong(7, vid.toLong) } payload.block_range.foreach { block_range => statement.setObject(8, new PostgresIntRange(block_range), Types.O } }, CREATE TABLE TABLE ( id BIGINT, number INTEGER, timestamp TIMESTAMP, author STRING, difficulty STRING, size INTEGER, vid BIGINT, block_range STRING PRIMARY KEY (vid) NOT ENFORCED ) WITH ( 'connector' = 'jdbc', 'table-name' = 'table' ); 😱 Common Type System
  • 22. When you start using SQL you get access to the decades of advancements in database design
  • 23. When NOT to use ● Complex serialization / deserialization logic ● Low-level optimizations, especially with state and timers ● Not always debugging-friendly
  • 24. Dealing with Complexity UDFs for heavy lifting ● Calling 3rd-party libraries ● External calls ● Enrichments Templating ● Control structures ● dbt-style macros and references
  • 26. Ways to use Structured Statements dbt-style Project Notebooks Managed Runtime
  • 27. Requirements ● Version control ● Code organization ● Testability ● CI/CD ● Observability
  • 28. Structured Statements def revenueByCountry(country: String): Table = { tEnv.sqlQuery( s""" |SELECT name, SUM(revenue) AS totalRevenue |FROM Orders |WHERE country = '${country}' |GROUP BY name""".stripMargin ) }
  • 29. Structured Statements def revenueByCountry(country: String): Table = { tEnv.sqlQuery( s""" |SELECT name, SUM(revenue) AS totalRevenue |FROM Orders |WHERE country = '${country}' |GROUP BY name""".stripMargin ) } ✅ structure ✅ mock/stub for testing
  • 30. Structured Statements ● Treat them like code ● Only make sense when Table API is not available ● Mix with other API flavours ● SQL also has style guides ● Otherwise it’s a typical streaming application!
  • 31. Structured Statements ● Version control: 🟢 ● Code organization: 🟢 ● Testability: 🟡 ● CI/CD: 🟡 ● Observability: 🟢
  • 32. dbt-style Project ➔ models ◆ common ● users.sql ● users.yml ◆ sales.sql ◆ sales.yml ◆ … ➔ tests ◆ …
  • 33. dbt-style Project ➔ models ◆ common ● users.sql ● users.yml ◆ sales.sql ◆ sales.yml ◆ … ➔ tests ◆ … ✅ structured ✅ schematized ✅ testable
  • 34. dbt-style Project SELECT ((text::jsonb)->>'bid_price')::FLOAT AS bid_price, (text::jsonb)->>'order_quantity' AS order_quantity, (text::jsonb)->>'symbol' AS symbol, (text::jsonb)->>'trade_type' AS trade_type, to_timestamp(((text::jsonb)->'timestamp')::BIGINT) AS ts FROM {{ REF('market_orders_raw') }} {{ config(materialized='materializedview') }} SELECT symbol, AVG(bid_price) AS avg FROM {{ REF('market_orders') }} GROUP BY symbol
  • 35. dbt-style Project ● Works well for heavy analytical use-cases ● Could write tests in Python/Scala/etc. ● Probably needs more tooling than you think (state management, observability, etc.) ● Check dbt adapter from Materialize!
  • 36. dbt-style Project ● Version control: 🟢 ● Code organization: 🟢 ● Testability: 🟡 ● CI/CD: 🟡 ● Observability: 🟡
  • 39. Notebooks ● Great UX ● Ideal for exploratory analysis and BI ● Complements all other patterns really well ● Way more important for realtime workloads
  • 40. Notebooks We don't recommend productionizing notebooks and instead encourage empowering data scientists to build production-ready code with the right programming frameworks https://www.thoughtworks.com/en-ca/radar/technique s/productionizing-notebooks
  • 41. Notebooks ● Version control: 🟡 ● Code organization: 🔴 ● Testability: 🔴 ● CI/CD: 🔴 ● Observability: 🔴
  • 43. Managed Runtime ● Managed ≈ “Serverless” ● Auto-scaling ● Automated deployments, rollbacks, etc. ● Testing for different layers is decoupled (runtime vs jobs)
  • 44. Managed Runtime Reference Architecture Control Plane Data Plane API Reconciler Streaming Job UI CLI
  • 45. Any managed runtime requires excellent developer experience to succeed
  • 46. Managed Runtime: Ideal Developer Experience Notebooks UX SELECT * … SELECT * …
  • 47. Managed Runtime: Ideal Developer Experience Version Control Integration SELECT * … SELECT * …
  • 48. Managed Runtime: Ideal Developer Experience dbt-style Project Structure SELECT * … SELECT * … ➔ models ◆ common ◆ sales ◆ shipping ◆ marketing ◆ …
  • 49. Managed Runtime: Ideal Developer Experience Versioning SELECT * … SELECT * … ● Version 1 ● Version 2 ● Version 3 ● …
  • 50. Managed Runtime: Ideal Developer Experience Previews SELECT * … SELECT * … User Count Irene 100 Alex 53 Josh 12 Jane 1
  • 51. Managed Runtime ● Version control: 🟢 ● Code organization: 🟢 ● Testability: 🟡 ● CI/CD: 🟢 ● Observability: 🟢
  • 52. Summary Structured Statements dbt-style Project Notebooks Managed Runtime Version Control 🟢 🟢 🟡 🟢 Code Organization 🟢 🟢 🔴 🟢 Testability 🟡 🟡 🔴 🟡 CI/CD 🟡 🟡 🔴 🟢 Observability 🟢 🟡 🔴 🟢 Complexity 🟢 🟡 🟡 🔴
  • 53. General Guidelines ● Long-running streaming apps require special attention to state management ● Try to avoid mutability: every change is a new version ● Integration testing > unit testing ● Embrace the SRE mentality