SlideShare a Scribd company logo
1 of 31
Download to read offline
Paulo Gutierrez
Spark Meetup Tokyo #2 (Spark+AI Summit EU 2019)
Credits to Prakash Chockalingam for various Delta
slides, used with permission
The Delta Architecture
A step beyond Lambda Architecture
A Data Engineer’s Dream...
Data Lake
CSV,
JSON,
TXT…
Kinesis
AI & Reporting
Process data continuously and incrementally as new data arrive in a
cost efficient way without having to choose between batch or streaming
Table
(Data gets written
continuously)
AI & Reporting
Events
Spark job gets slower with time due
to small files.
Stream
Stream
The Data Engineer’s Journey...
Table
(Data gets written
continuously)
AI & Reporting
Events
Table
(Data gets compacted
every hour)
Batch Batch
Late arriving data means processing
need to be delayed
Stream
The Data Engineer’s Journey...
Table
(Data gets written
continuously)
AI & Reporting
Events
Table
(Data gets compacted
every hour) Few hours latency doesn’t
satisfy business needs
Batch Batch
Stream
The Data Engineer’s Journey...
Table
(Data gets written
continuously)
AI & Reporting
Events
Batch
Stream
Unified View
Lambda arch increases
operational burden
Stream
Table
(Data gets compacted
every hour)
The Data Engineer’s Journey...
Table
(Data gets written
continuously)
AI & Reporting
Events
Batch Batch
Stream
Unified ViewValidation
Validations and other cleanup
actions need to be done twice
Stream
Table
(Data gets compacted
every hour)
The Data Engineer’s Journey...
Table
(Data gets written
continuously)
AI & Reporting
Events
Batch Batch
Stream
Unified ViewValidation
Fixing mistakes means
blowing up partitions and
doing atomic re-publish
Reprocessing
Stream
Table
(Data gets compacted
every hour)
The Data Engineer’s Journey...
Table
(Data gets written
continuously)
AI & Reporting
Events
Batch Batch
Stream
Unified ViewValidation
Updates & Merge get
complex with data lake
Reprocessing
Update & Merge
Stream
Table
(Data gets compacted
every hour)
The Data Engineer’s Journey...
Table
(Data gets written
continuously)
AI & Reporting
Events
Batch Batch
Stream
Unified ViewValidation
Updates & Merge get
complex with data lake
Reprocessing
Update & Merge
Can this be simplified?
Stream
The Data Engineer’s Journey...
Table
(Data gets compacted
every hour)
What was missing?
1. Ability to read consistent data while data is being written
2. Ability to read incrementally from a large table with good throughput
3. Ability to rollback in case of bad writes
4. Ability to replay historical data along new data that arrived
5. Ability to handle late arriving data without having to delay downstream processing
Data Lake
CSV,
JSON,
TXT…
Kinesis
AI & Reporting
?
So… What is the answer?
STRUCTURED
STREAMING
+ =
The
Delta
Architecture
1. Unify batch & streaming with a continuous data flow model
2. Infinite retention to replay/reprocess historical events as needed
3. Independent, elastic compute and storage to scale while balancing costs
Connecting the dots...
Snapshot isolation between writers and
readers
Data Lake
CSV,
JSON,
TXT…
Kinesis
AI & Reporting
?
1. Ability to read consistent data while data is
being written
Snapshot isolation between writers and
readers
Optimized file source with scalable metadata
handling
Connecting the dots...
Data Lake
CSV,
JSON,
TXT…
Kinesis
AI & Reporting
?
1. Ability to read consistent data while data is
being written
1. Ability to read incrementally from a large
table with good throughput
Snapshot isolation between writers and
readers
Optimized file source with scalable metadata
handling
Time travel
Connecting the dots...
Data Lake
CSV,
JSON,
TXT…
Kinesis
AI & Reporting
?
1. Ability to read consistent data while data is
being written
1. Ability to read incrementally from a large
table with good throughput
1. Ability to rollback in case of bad writes
Snapshot isolation between writers and
readers
Optimized file source with scalable metadata
handling
Time travel
Stream the backfilled historical data through
the same pipeline
Connecting the dots...
Data Lake
CSV,
JSON,
TXT…
Kinesis
AI & Reporting
?
1. Ability to read consistent data while data is
being written
1. Ability to read incrementally from a large
table with good throughput
1. Ability to rollback in case of bad writes
1. Ability to replay historical data along new
data that arrived
1. Ability to read consistent data while data is
being written
1. Ability to read incrementally from a large
table with good throughput
1. Ability to rollback in case of bad writes
1. Ability to replay historical data along new
data that arrived
1. Ability to handle late arriving data without
having to delay downstream processing
Snapshot isolation between writers and
readers
Optimized file source with scalable metadata
handling
Time travel
Stream the backfilled historical data through
the same pipeline
Stream any late arriving data added to the
table as they get added
Connecting the dots...
Data Lake
CSV,
JSON,
TXT…
Kinesis
AI & Reporting
?
1. Ability to read consistent data while data is
being written
1. Ability to read incrementally from a large
table with good throughput
1. Ability to rollback in case of bad writes
1. Ability to replay historical data along new
data that arrived
1. Ability to handle late arriving data without
having to delay downstream processing
Snapshot isolation between writers and
readers
Optimized file source with scalable metadata
handling
Time travel
Stream the backfilled historical data through
the same pipeline
Stream any late arriving data added to the
table as they get added
Connecting the dots...
Data Lake
CSV,
JSON,
TXT…
Kinesis
AI & Reporting
AI & Reporting
Streaming
Analytics
Data Lake
CSV,
JSON,
TXT…
Kinesis
The Delta Architecture
A continuous data flow model to unify batch & streaming
Characteristics of the
Delta Architecture
#1. Adopt continuous data flow model
Stream to and from a Delta Lake table whenever possible.
● Unify batch and streaming. Same engine. Same APIs. Same user code.
No need to reason about system complexities separately.
● Incrementally load the new data efficiently. No need to do state
management on what are the new files added.
● Process the data quickly as it arrives without any delays.
#2. Use Intermediate Hops
T1
T1
T2
T2
T3
T3
T4
T5
T6
T7
Source Table Dest Tables
Intermediate Dataframes in memory
Materialize Dataframes wherever applicable; especially when large number of
transformations are involved. Materialization could be for:
● Fault recovery
● Easy troubleshooting
● Multiple consumers expected
#2. Use Intermediate Hops
T1 T2 T3
T4
T5
T6
T7
Materialize Dataframes wherever applicable; especially when large number of
transformations are involved. Materialization could be for:
● Fault recovery
● Easy troubleshooting
● Multiple consumers expected
Intermediate Table
#2. Use Intermediate Hops
Materialize Dataframes wherever applicable; especially when large number of
transformations are involved. Materialization will help with:
● Fault recovery
● Easy troubleshooting
● Multiple consumers expected
T1 T2 T3
T4
T5
T6
T7
Intermediate TableIntermediate Table
#3. Cost vs Latency Tradeoff
1. Streams; data arriving continuously: Have an always on cluster
continuously processing data.
2. Frequent batches; data arriving every few minutes (say 30 mins): Use a
warm pool of machines. Turn off the cluster when idle. Start the cluster
when data needs to be processed. Use streaming Trigger.Once mode.
3. Infrequent batches; data arriving every few hours or days: Turn off the
cluster when idle. Start the cluster when data needs to be processed. Use
streaming Trigger.Once mode.
#4. Optimize storage layout
Optimize storage for good read performance on common query predicate
columns by:
● Partitioning on low cardinality columns (Ensure > 1gb per partition).
○ partitionBy(date, eventType). // There are only 100 distinct event types.
● Z-Ordering on high cardinality columns
○ optimize table ZORDER BY userId. // 100M distinct user ids
#5. Reprocessing
Infinite retention of raw data + stream =
trival recomputation
• Simply clear out the result table and
restart the stream
• Leverage cloud elasticity to quickly
process initial backfill
.
.
.
#6. Tune Data Quality
● Merge schemas automatically for raw ingestion tables: Make sure you capture all the raw
events without ignoring any data.
● Enforce Schema on write for high quality analytics tables: Make sure the data is clean and
ready for analytics by enforcing schema restrictions (and data expectations in future)
Data Lake AI & Reporting
Streaming
Analytics
Business-level
Aggregates
Filtered, Cleaned
AugmentedRaw Ingestion
Bronze Silver Gold
CSV,
JSON,
TXT…
Kinesis
*Data Quality Levels *
1. Adopt a continuous data flow model to unify batch and streaming
2. Use intermediate hops to improve reliability and troubleshooting
3. Make the cost vs latency tradeoff based on your use cases and business needs
4. Optimize the storage layout based on the access patterns
5. Reprocess the historical data as needed by simply clearing the result table and
restarting the stream
6. Incrementally improve the quality of your data until it is ready for consumption
with schema management options and data expectations.
Summary of the key characteristics
1. Reduce end-to-end pipeline SLA.
a. Organizations reduced pipeline SLAs from days and hours to minutes.
2. Reduce pipeline maintenance burden.
a. Eliminate lambda architectures for minute-latency use cases.
3. Handle updates and deletes easily.
a. Change data capture, GDPR, Sessionization, Deduplication use cases simplified.
4. Lower infrastructure costs with elastic, independent compute & storage
a. Organizations reduce infrastructure costs by up to 10x
Benefits of the Delta Architecture
Thank you
paulo@databricks.com
Website: https://delta.io
Community (Slack/Email): https://delta.io/#community

More Related Content

What's hot

地理分散DBについて
地理分散DBについて地理分散DBについて
地理分散DBについてKumazaki Hiroki
 
[B23] PostgreSQLのインデックス・チューニング by Tomonari Katsumata
[B23] PostgreSQLのインデックス・チューニング by Tomonari Katsumata[B23] PostgreSQLのインデックス・チューニング by Tomonari Katsumata
[B23] PostgreSQLのインデックス・チューニング by Tomonari KatsumataInsight Technology, Inc.
 
大量のデータ処理や分析に使えるOSS Apache Sparkのご紹介(Open Source Conference 2020 Online/Kyoto ...
大量のデータ処理や分析に使えるOSS Apache Sparkのご紹介(Open Source Conference 2020 Online/Kyoto ...大量のデータ処理や分析に使えるOSS Apache Sparkのご紹介(Open Source Conference 2020 Online/Kyoto ...
大量のデータ処理や分析に使えるOSS Apache Sparkのご紹介(Open Source Conference 2020 Online/Kyoto ...NTT DATA Technology & Innovation
 
オンライン物理バックアップの排他モードと非排他モードについて ~PostgreSQLバージョン15対応版~(第34回PostgreSQLアンカンファレンス...
オンライン物理バックアップの排他モードと非排他モードについて ~PostgreSQLバージョン15対応版~(第34回PostgreSQLアンカンファレンス...オンライン物理バックアップの排他モードと非排他モードについて ~PostgreSQLバージョン15対応版~(第34回PostgreSQLアンカンファレンス...
オンライン物理バックアップの排他モードと非排他モードについて ~PostgreSQLバージョン15対応版~(第34回PostgreSQLアンカンファレンス...NTT DATA Technology & Innovation
 
ポスト・ラムダアーキテクチャの切り札? Apache Hudi(NTTデータ テクノロジーカンファレンス 2020 発表資料)
ポスト・ラムダアーキテクチャの切り札? Apache Hudi(NTTデータ テクノロジーカンファレンス 2020 発表資料)ポスト・ラムダアーキテクチャの切り札? Apache Hudi(NTTデータ テクノロジーカンファレンス 2020 発表資料)
ポスト・ラムダアーキテクチャの切り札? Apache Hudi(NTTデータ テクノロジーカンファレンス 2020 発表資料)NTT DATA Technology & Innovation
 
NTTデータが考えるデータ基盤の次の一手 ~AI活用のために知っておくべき新潮流とは?~(NTTデータ テクノロジーカンファレンス 2020 発表資料)
NTTデータが考えるデータ基盤の次の一手 ~AI活用のために知っておくべき新潮流とは?~(NTTデータ テクノロジーカンファレンス 2020 発表資料)NTTデータが考えるデータ基盤の次の一手 ~AI活用のために知っておくべき新潮流とは?~(NTTデータ テクノロジーカンファレンス 2020 発表資料)
NTTデータが考えるデータ基盤の次の一手 ~AI活用のために知っておくべき新潮流とは?~(NTTデータ テクノロジーカンファレンス 2020 発表資料)NTT DATA Technology & Innovation
 
Apache Sparkの基本と最新バージョン3.2のアップデート(Open Source Conference 2021 Online/Fukuoka ...
Apache Sparkの基本と最新バージョン3.2のアップデート(Open Source Conference 2021 Online/Fukuoka ...Apache Sparkの基本と最新バージョン3.2のアップデート(Open Source Conference 2021 Online/Fukuoka ...
Apache Sparkの基本と最新バージョン3.2のアップデート(Open Source Conference 2021 Online/Fukuoka ...NTT DATA Technology & Innovation
 
PostgreSQL WAL for DBAs
PostgreSQL WAL for DBAs PostgreSQL WAL for DBAs
PostgreSQL WAL for DBAs PGConf APAC
 
分散システムについて語らせてくれ
分散システムについて語らせてくれ分散システムについて語らせてくれ
分散システムについて語らせてくれKumazaki Hiroki
 
楽天における大規模データベースの運用
楽天における大規模データベースの運用楽天における大規模データベースの運用
楽天における大規模データベースの運用Rakuten Group, Inc.
 
LineairDBの紹介
LineairDBの紹介LineairDBの紹介
LineairDBの紹介Sho Nakazono
 
PostgreSQL 14 モニタリング新機能紹介(PostgreSQL カンファレンス #24、2021/06/08)
PostgreSQL 14 モニタリング新機能紹介(PostgreSQL カンファレンス #24、2021/06/08)PostgreSQL 14 モニタリング新機能紹介(PostgreSQL カンファレンス #24、2021/06/08)
PostgreSQL 14 モニタリング新機能紹介(PostgreSQL カンファレンス #24、2021/06/08)NTT DATA Technology & Innovation
 
大規模分散システムの現在 -- GFS, MapReduce, BigTableはどう変化したか?
大規模分散システムの現在 -- GFS, MapReduce, BigTableはどう変化したか?大規模分散システムの現在 -- GFS, MapReduce, BigTableはどう変化したか?
大規模分散システムの現在 -- GFS, MapReduce, BigTableはどう変化したか?maruyama097
 
Apache BigtopによるHadoopエコシステムのパッケージング(Open Source Conference 2021 Online/Osaka...
Apache BigtopによるHadoopエコシステムのパッケージング(Open Source Conference 2021 Online/Osaka...Apache BigtopによるHadoopエコシステムのパッケージング(Open Source Conference 2021 Online/Osaka...
Apache BigtopによるHadoopエコシステムのパッケージング(Open Source Conference 2021 Online/Osaka...NTT DATA Technology & Innovation
 
Hadoop Compatible File Systems (Azure編) (セミナー「Big Data Developerに贈る第二弾 ‐ Azur...
Hadoop Compatible File Systems (Azure編) (セミナー「Big Data Developerに贈る第二弾 ‐ Azur...Hadoop Compatible File Systems (Azure編) (セミナー「Big Data Developerに贈る第二弾 ‐ Azur...
Hadoop Compatible File Systems (Azure編) (セミナー「Big Data Developerに贈る第二弾 ‐ Azur...NTT DATA Technology & Innovation
 

What's hot (20)

地理分散DBについて
地理分散DBについて地理分散DBについて
地理分散DBについて
 
[B23] PostgreSQLのインデックス・チューニング by Tomonari Katsumata
[B23] PostgreSQLのインデックス・チューニング by Tomonari Katsumata[B23] PostgreSQLのインデックス・チューニング by Tomonari Katsumata
[B23] PostgreSQLのインデックス・チューニング by Tomonari Katsumata
 
大量のデータ処理や分析に使えるOSS Apache Sparkのご紹介(Open Source Conference 2020 Online/Kyoto ...
大量のデータ処理や分析に使えるOSS Apache Sparkのご紹介(Open Source Conference 2020 Online/Kyoto ...大量のデータ処理や分析に使えるOSS Apache Sparkのご紹介(Open Source Conference 2020 Online/Kyoto ...
大量のデータ処理や分析に使えるOSS Apache Sparkのご紹介(Open Source Conference 2020 Online/Kyoto ...
 
ヤフー発のメッセージキュー「Pulsar」のご紹介
ヤフー発のメッセージキュー「Pulsar」のご紹介ヤフー発のメッセージキュー「Pulsar」のご紹介
ヤフー発のメッセージキュー「Pulsar」のご紹介
 
オンライン物理バックアップの排他モードと非排他モードについて ~PostgreSQLバージョン15対応版~(第34回PostgreSQLアンカンファレンス...
オンライン物理バックアップの排他モードと非排他モードについて ~PostgreSQLバージョン15対応版~(第34回PostgreSQLアンカンファレンス...オンライン物理バックアップの排他モードと非排他モードについて ~PostgreSQLバージョン15対応版~(第34回PostgreSQLアンカンファレンス...
オンライン物理バックアップの排他モードと非排他モードについて ~PostgreSQLバージョン15対応版~(第34回PostgreSQLアンカンファレンス...
 
ポスト・ラムダアーキテクチャの切り札? Apache Hudi(NTTデータ テクノロジーカンファレンス 2020 発表資料)
ポスト・ラムダアーキテクチャの切り札? Apache Hudi(NTTデータ テクノロジーカンファレンス 2020 発表資料)ポスト・ラムダアーキテクチャの切り札? Apache Hudi(NTTデータ テクノロジーカンファレンス 2020 発表資料)
ポスト・ラムダアーキテクチャの切り札? Apache Hudi(NTTデータ テクノロジーカンファレンス 2020 発表資料)
 
NTTデータが考えるデータ基盤の次の一手 ~AI活用のために知っておくべき新潮流とは?~(NTTデータ テクノロジーカンファレンス 2020 発表資料)
NTTデータが考えるデータ基盤の次の一手 ~AI活用のために知っておくべき新潮流とは?~(NTTデータ テクノロジーカンファレンス 2020 発表資料)NTTデータが考えるデータ基盤の次の一手 ~AI活用のために知っておくべき新潮流とは?~(NTTデータ テクノロジーカンファレンス 2020 発表資料)
NTTデータが考えるデータ基盤の次の一手 ~AI活用のために知っておくべき新潮流とは?~(NTTデータ テクノロジーカンファレンス 2020 発表資料)
 
Hadoop入門
Hadoop入門Hadoop入門
Hadoop入門
 
Apache Sparkの基本と最新バージョン3.2のアップデート(Open Source Conference 2021 Online/Fukuoka ...
Apache Sparkの基本と最新バージョン3.2のアップデート(Open Source Conference 2021 Online/Fukuoka ...Apache Sparkの基本と最新バージョン3.2のアップデート(Open Source Conference 2021 Online/Fukuoka ...
Apache Sparkの基本と最新バージョン3.2のアップデート(Open Source Conference 2021 Online/Fukuoka ...
 
PostgreSQL WAL for DBAs
PostgreSQL WAL for DBAs PostgreSQL WAL for DBAs
PostgreSQL WAL for DBAs
 
分散システムについて語らせてくれ
分散システムについて語らせてくれ分散システムについて語らせてくれ
分散システムについて語らせてくれ
 
楽天における大規模データベースの運用
楽天における大規模データベースの運用楽天における大規模データベースの運用
楽天における大規模データベースの運用
 
Hive on Tezのベストプラクティス
Hive on TezのベストプラクティスHive on Tezのベストプラクティス
Hive on Tezのベストプラクティス
 
LineairDBの紹介
LineairDBの紹介LineairDBの紹介
LineairDBの紹介
 
perfを使ったPostgreSQLの解析(後編)
perfを使ったPostgreSQLの解析(後編)perfを使ったPostgreSQLの解析(後編)
perfを使ったPostgreSQLの解析(後編)
 
PostgreSQL 14 モニタリング新機能紹介(PostgreSQL カンファレンス #24、2021/06/08)
PostgreSQL 14 モニタリング新機能紹介(PostgreSQL カンファレンス #24、2021/06/08)PostgreSQL 14 モニタリング新機能紹介(PostgreSQL カンファレンス #24、2021/06/08)
PostgreSQL 14 モニタリング新機能紹介(PostgreSQL カンファレンス #24、2021/06/08)
 
Application of postgre sql to large social infrastructure jp
Application of postgre sql to large social infrastructure jpApplication of postgre sql to large social infrastructure jp
Application of postgre sql to large social infrastructure jp
 
大規模分散システムの現在 -- GFS, MapReduce, BigTableはどう変化したか?
大規模分散システムの現在 -- GFS, MapReduce, BigTableはどう変化したか?大規模分散システムの現在 -- GFS, MapReduce, BigTableはどう変化したか?
大規模分散システムの現在 -- GFS, MapReduce, BigTableはどう変化したか?
 
Apache BigtopによるHadoopエコシステムのパッケージング(Open Source Conference 2021 Online/Osaka...
Apache BigtopによるHadoopエコシステムのパッケージング(Open Source Conference 2021 Online/Osaka...Apache BigtopによるHadoopエコシステムのパッケージング(Open Source Conference 2021 Online/Osaka...
Apache BigtopによるHadoopエコシステムのパッケージング(Open Source Conference 2021 Online/Osaka...
 
Hadoop Compatible File Systems (Azure編) (セミナー「Big Data Developerに贈る第二弾 ‐ Azur...
Hadoop Compatible File Systems (Azure編) (セミナー「Big Data Developerに贈る第二弾 ‐ Azur...Hadoop Compatible File Systems (Azure編) (セミナー「Big Data Developerに贈る第二弾 ‐ Azur...
Hadoop Compatible File Systems (Azure編) (セミナー「Big Data Developerに贈る第二弾 ‐ Azur...
 

Similar to Delta Architecture

Simplify and Scale Data Engineering Pipelines with Delta Lake
Simplify and Scale Data Engineering Pipelines with Delta LakeSimplify and Scale Data Engineering Pipelines with Delta Lake
Simplify and Scale Data Engineering Pipelines with Delta LakeDatabricks
 
Delta from a Data Engineer's Perspective
Delta from a Data Engineer's PerspectiveDelta from a Data Engineer's Perspective
Delta from a Data Engineer's PerspectiveDatabricks
 
Making Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeMaking Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeDatabricks
 
Open Source Reliability for Data Lake with Apache Spark by Michael Armbrust
Open Source Reliability for Data Lake with Apache Spark by Michael ArmbrustOpen Source Reliability for Data Lake with Apache Spark by Michael Armbrust
Open Source Reliability for Data Lake with Apache Spark by Michael ArmbrustData Con LA
 
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft AzureOtimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft AzureLuan Moreno Medeiros Maciel
 
Apache CarbonData+Spark to realize data convergence and Unified high performa...
Apache CarbonData+Spark to realize data convergence and Unified high performa...Apache CarbonData+Spark to realize data convergence and Unified high performa...
Apache CarbonData+Spark to realize data convergence and Unified high performa...Tech Triveni
 
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...Databricks
 
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch AnalysisNoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch AnalysisHelena Edelson
 
AWS Webcast - Managing Big Data in the AWS Cloud_20140924
AWS Webcast - Managing Big Data in the AWS Cloud_20140924AWS Webcast - Managing Big Data in the AWS Cloud_20140924
AWS Webcast - Managing Big Data in the AWS Cloud_20140924Amazon Web Services
 
Powering Interactive Data Analysis at Pinterest by Amazon Redshift
Powering Interactive Data Analysis at Pinterest by Amazon RedshiftPowering Interactive Data Analysis at Pinterest by Amazon Redshift
Powering Interactive Data Analysis at Pinterest by Amazon RedshiftJie Li
 
AWS Webcast - AWS Kinesis Webinar
AWS Webcast - AWS Kinesis WebinarAWS Webcast - AWS Kinesis Webinar
AWS Webcast - AWS Kinesis WebinarAmazon Web Services
 
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin AmbardDelta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin AmbardParis Data Engineers !
 
Solving Office 365 Big Challenges using Cassandra + Spark
Solving Office 365 Big Challenges using Cassandra + Spark Solving Office 365 Big Challenges using Cassandra + Spark
Solving Office 365 Big Challenges using Cassandra + Spark Anubhav Kale
 
Flink Forward SF 2017: Srikanth Satya & Tom Kaitchuck - Pravega: Storage Rei...
Flink Forward SF 2017: Srikanth Satya & Tom Kaitchuck -  Pravega: Storage Rei...Flink Forward SF 2017: Srikanth Satya & Tom Kaitchuck -  Pravega: Storage Rei...
Flink Forward SF 2017: Srikanth Satya & Tom Kaitchuck - Pravega: Storage Rei...Flink Forward
 
Big Data, Ingeniería de datos, y Data Lakes en AWS
Big Data, Ingeniería de datos, y Data Lakes en AWSBig Data, Ingeniería de datos, y Data Lakes en AWS
Big Data, Ingeniería de datos, y Data Lakes en AWSjavier ramirez
 
Typesafe & William Hill: Cassandra, Spark, and Kafka - The New Streaming Data...
Typesafe & William Hill: Cassandra, Spark, and Kafka - The New Streaming Data...Typesafe & William Hill: Cassandra, Spark, and Kafka - The New Streaming Data...
Typesafe & William Hill: Cassandra, Spark, and Kafka - The New Streaming Data...DataStax Academy
 
Streaming Data Analytics with ksqlDB and Superset | Robert Stolz, Preset
Streaming Data Analytics with ksqlDB and Superset | Robert Stolz, PresetStreaming Data Analytics with ksqlDB and Superset | Robert Stolz, Preset
Streaming Data Analytics with ksqlDB and Superset | Robert Stolz, PresetHostedbyConfluent
 
Data & Analytics - Session 2 - Introducing Amazon Redshift
Data & Analytics - Session 2 - Introducing Amazon RedshiftData & Analytics - Session 2 - Introducing Amazon Redshift
Data & Analytics - Session 2 - Introducing Amazon RedshiftAmazon Web Services
 

Similar to Delta Architecture (20)

The delta architecture
The delta architectureThe delta architecture
The delta architecture
 
Simplify and Scale Data Engineering Pipelines with Delta Lake
Simplify and Scale Data Engineering Pipelines with Delta LakeSimplify and Scale Data Engineering Pipelines with Delta Lake
Simplify and Scale Data Engineering Pipelines with Delta Lake
 
Delta from a Data Engineer's Perspective
Delta from a Data Engineer's PerspectiveDelta from a Data Engineer's Perspective
Delta from a Data Engineer's Perspective
 
Making Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeMaking Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta Lake
 
Open Source Reliability for Data Lake with Apache Spark by Michael Armbrust
Open Source Reliability for Data Lake with Apache Spark by Michael ArmbrustOpen Source Reliability for Data Lake with Apache Spark by Michael Armbrust
Open Source Reliability for Data Lake with Apache Spark by Michael Armbrust
 
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft AzureOtimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
 
Apache CarbonData+Spark to realize data convergence and Unified high performa...
Apache CarbonData+Spark to realize data convergence and Unified high performa...Apache CarbonData+Spark to realize data convergence and Unified high performa...
Apache CarbonData+Spark to realize data convergence and Unified high performa...
 
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
 
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch AnalysisNoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis
 
AWS Webcast - Managing Big Data in the AWS Cloud_20140924
AWS Webcast - Managing Big Data in the AWS Cloud_20140924AWS Webcast - Managing Big Data in the AWS Cloud_20140924
AWS Webcast - Managing Big Data in the AWS Cloud_20140924
 
Powering Interactive Data Analysis at Pinterest by Amazon Redshift
Powering Interactive Data Analysis at Pinterest by Amazon RedshiftPowering Interactive Data Analysis at Pinterest by Amazon Redshift
Powering Interactive Data Analysis at Pinterest by Amazon Redshift
 
AWS Webcast - AWS Kinesis Webinar
AWS Webcast - AWS Kinesis WebinarAWS Webcast - AWS Kinesis Webinar
AWS Webcast - AWS Kinesis Webinar
 
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin AmbardDelta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
 
Solving Office 365 Big Challenges using Cassandra + Spark
Solving Office 365 Big Challenges using Cassandra + Spark Solving Office 365 Big Challenges using Cassandra + Spark
Solving Office 365 Big Challenges using Cassandra + Spark
 
Flink Forward SF 2017: Srikanth Satya & Tom Kaitchuck - Pravega: Storage Rei...
Flink Forward SF 2017: Srikanth Satya & Tom Kaitchuck -  Pravega: Storage Rei...Flink Forward SF 2017: Srikanth Satya & Tom Kaitchuck -  Pravega: Storage Rei...
Flink Forward SF 2017: Srikanth Satya & Tom Kaitchuck - Pravega: Storage Rei...
 
Big Data, Ingeniería de datos, y Data Lakes en AWS
Big Data, Ingeniería de datos, y Data Lakes en AWSBig Data, Ingeniería de datos, y Data Lakes en AWS
Big Data, Ingeniería de datos, y Data Lakes en AWS
 
Typesafe & William Hill: Cassandra, Spark, and Kafka - The New Streaming Data...
Typesafe & William Hill: Cassandra, Spark, and Kafka - The New Streaming Data...Typesafe & William Hill: Cassandra, Spark, and Kafka - The New Streaming Data...
Typesafe & William Hill: Cassandra, Spark, and Kafka - The New Streaming Data...
 
WW Historian 10
WW Historian 10WW Historian 10
WW Historian 10
 
Streaming Data Analytics with ksqlDB and Superset | Robert Stolz, Preset
Streaming Data Analytics with ksqlDB and Superset | Robert Stolz, PresetStreaming Data Analytics with ksqlDB and Superset | Robert Stolz, Preset
Streaming Data Analytics with ksqlDB and Superset | Robert Stolz, Preset
 
Data & Analytics - Session 2 - Introducing Amazon Redshift
Data & Analytics - Session 2 - Introducing Amazon RedshiftData & Analytics - Session 2 - Introducing Amazon Redshift
Data & Analytics - Session 2 - Introducing Amazon Redshift
 

Recently uploaded

Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiSuhani Kapoor
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...Suhani Kapoor
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
Predicting Employee Churn: A Data-Driven Approach Project Presentation
Predicting Employee Churn: A Data-Driven Approach Project PresentationPredicting Employee Churn: A Data-Driven Approach Project Presentation
Predicting Employee Churn: A Data-Driven Approach Project PresentationBoston Institute of Analytics
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改atducpo
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxStephen266013
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiSuhani Kapoor
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Sapana Sha
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationshipsccctableauusergroup
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach
 
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...Suhani Kapoor
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...Suhani Kapoor
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝soniya singh
 

Recently uploaded (20)

Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
 
Predicting Employee Churn: A Data-Driven Approach Project Presentation
Predicting Employee Churn: A Data-Driven Approach Project PresentationPredicting Employee Churn: A Data-Driven Approach Project Presentation
Predicting Employee Churn: A Data-Driven Approach Project Presentation
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docx
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
 
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
 

Delta Architecture

  • 1. Paulo Gutierrez Spark Meetup Tokyo #2 (Spark+AI Summit EU 2019) Credits to Prakash Chockalingam for various Delta slides, used with permission The Delta Architecture A step beyond Lambda Architecture
  • 2. A Data Engineer’s Dream... Data Lake CSV, JSON, TXT… Kinesis AI & Reporting Process data continuously and incrementally as new data arrive in a cost efficient way without having to choose between batch or streaming
  • 3. Table (Data gets written continuously) AI & Reporting Events Spark job gets slower with time due to small files. Stream Stream The Data Engineer’s Journey...
  • 4. Table (Data gets written continuously) AI & Reporting Events Table (Data gets compacted every hour) Batch Batch Late arriving data means processing need to be delayed Stream The Data Engineer’s Journey...
  • 5. Table (Data gets written continuously) AI & Reporting Events Table (Data gets compacted every hour) Few hours latency doesn’t satisfy business needs Batch Batch Stream The Data Engineer’s Journey...
  • 6. Table (Data gets written continuously) AI & Reporting Events Batch Stream Unified View Lambda arch increases operational burden Stream Table (Data gets compacted every hour) The Data Engineer’s Journey...
  • 7. Table (Data gets written continuously) AI & Reporting Events Batch Batch Stream Unified ViewValidation Validations and other cleanup actions need to be done twice Stream Table (Data gets compacted every hour) The Data Engineer’s Journey...
  • 8. Table (Data gets written continuously) AI & Reporting Events Batch Batch Stream Unified ViewValidation Fixing mistakes means blowing up partitions and doing atomic re-publish Reprocessing Stream Table (Data gets compacted every hour) The Data Engineer’s Journey...
  • 9. Table (Data gets written continuously) AI & Reporting Events Batch Batch Stream Unified ViewValidation Updates & Merge get complex with data lake Reprocessing Update & Merge Stream Table (Data gets compacted every hour) The Data Engineer’s Journey...
  • 10. Table (Data gets written continuously) AI & Reporting Events Batch Batch Stream Unified ViewValidation Updates & Merge get complex with data lake Reprocessing Update & Merge Can this be simplified? Stream The Data Engineer’s Journey... Table (Data gets compacted every hour)
  • 11. What was missing? 1. Ability to read consistent data while data is being written 2. Ability to read incrementally from a large table with good throughput 3. Ability to rollback in case of bad writes 4. Ability to replay historical data along new data that arrived 5. Ability to handle late arriving data without having to delay downstream processing Data Lake CSV, JSON, TXT… Kinesis AI & Reporting ?
  • 12. So… What is the answer? STRUCTURED STREAMING + = The Delta Architecture 1. Unify batch & streaming with a continuous data flow model 2. Infinite retention to replay/reprocess historical events as needed 3. Independent, elastic compute and storage to scale while balancing costs
  • 13. Connecting the dots... Snapshot isolation between writers and readers Data Lake CSV, JSON, TXT… Kinesis AI & Reporting ? 1. Ability to read consistent data while data is being written
  • 14. Snapshot isolation between writers and readers Optimized file source with scalable metadata handling Connecting the dots... Data Lake CSV, JSON, TXT… Kinesis AI & Reporting ? 1. Ability to read consistent data while data is being written 1. Ability to read incrementally from a large table with good throughput
  • 15. Snapshot isolation between writers and readers Optimized file source with scalable metadata handling Time travel Connecting the dots... Data Lake CSV, JSON, TXT… Kinesis AI & Reporting ? 1. Ability to read consistent data while data is being written 1. Ability to read incrementally from a large table with good throughput 1. Ability to rollback in case of bad writes
  • 16. Snapshot isolation between writers and readers Optimized file source with scalable metadata handling Time travel Stream the backfilled historical data through the same pipeline Connecting the dots... Data Lake CSV, JSON, TXT… Kinesis AI & Reporting ? 1. Ability to read consistent data while data is being written 1. Ability to read incrementally from a large table with good throughput 1. Ability to rollback in case of bad writes 1. Ability to replay historical data along new data that arrived
  • 17. 1. Ability to read consistent data while data is being written 1. Ability to read incrementally from a large table with good throughput 1. Ability to rollback in case of bad writes 1. Ability to replay historical data along new data that arrived 1. Ability to handle late arriving data without having to delay downstream processing Snapshot isolation between writers and readers Optimized file source with scalable metadata handling Time travel Stream the backfilled historical data through the same pipeline Stream any late arriving data added to the table as they get added Connecting the dots... Data Lake CSV, JSON, TXT… Kinesis AI & Reporting ?
  • 18. 1. Ability to read consistent data while data is being written 1. Ability to read incrementally from a large table with good throughput 1. Ability to rollback in case of bad writes 1. Ability to replay historical data along new data that arrived 1. Ability to handle late arriving data without having to delay downstream processing Snapshot isolation between writers and readers Optimized file source with scalable metadata handling Time travel Stream the backfilled historical data through the same pipeline Stream any late arriving data added to the table as they get added Connecting the dots... Data Lake CSV, JSON, TXT… Kinesis AI & Reporting
  • 19. AI & Reporting Streaming Analytics Data Lake CSV, JSON, TXT… Kinesis The Delta Architecture A continuous data flow model to unify batch & streaming
  • 21. #1. Adopt continuous data flow model Stream to and from a Delta Lake table whenever possible. ● Unify batch and streaming. Same engine. Same APIs. Same user code. No need to reason about system complexities separately. ● Incrementally load the new data efficiently. No need to do state management on what are the new files added. ● Process the data quickly as it arrives without any delays.
  • 22. #2. Use Intermediate Hops T1 T1 T2 T2 T3 T3 T4 T5 T6 T7 Source Table Dest Tables Intermediate Dataframes in memory Materialize Dataframes wherever applicable; especially when large number of transformations are involved. Materialization could be for: ● Fault recovery ● Easy troubleshooting ● Multiple consumers expected
  • 23. #2. Use Intermediate Hops T1 T2 T3 T4 T5 T6 T7 Materialize Dataframes wherever applicable; especially when large number of transformations are involved. Materialization could be for: ● Fault recovery ● Easy troubleshooting ● Multiple consumers expected Intermediate Table
  • 24. #2. Use Intermediate Hops Materialize Dataframes wherever applicable; especially when large number of transformations are involved. Materialization will help with: ● Fault recovery ● Easy troubleshooting ● Multiple consumers expected T1 T2 T3 T4 T5 T6 T7 Intermediate TableIntermediate Table
  • 25. #3. Cost vs Latency Tradeoff 1. Streams; data arriving continuously: Have an always on cluster continuously processing data. 2. Frequent batches; data arriving every few minutes (say 30 mins): Use a warm pool of machines. Turn off the cluster when idle. Start the cluster when data needs to be processed. Use streaming Trigger.Once mode. 3. Infrequent batches; data arriving every few hours or days: Turn off the cluster when idle. Start the cluster when data needs to be processed. Use streaming Trigger.Once mode.
  • 26. #4. Optimize storage layout Optimize storage for good read performance on common query predicate columns by: ● Partitioning on low cardinality columns (Ensure > 1gb per partition). ○ partitionBy(date, eventType). // There are only 100 distinct event types. ● Z-Ordering on high cardinality columns ○ optimize table ZORDER BY userId. // 100M distinct user ids
  • 27. #5. Reprocessing Infinite retention of raw data + stream = trival recomputation • Simply clear out the result table and restart the stream • Leverage cloud elasticity to quickly process initial backfill . . .
  • 28. #6. Tune Data Quality ● Merge schemas automatically for raw ingestion tables: Make sure you capture all the raw events without ignoring any data. ● Enforce Schema on write for high quality analytics tables: Make sure the data is clean and ready for analytics by enforcing schema restrictions (and data expectations in future) Data Lake AI & Reporting Streaming Analytics Business-level Aggregates Filtered, Cleaned AugmentedRaw Ingestion Bronze Silver Gold CSV, JSON, TXT… Kinesis *Data Quality Levels *
  • 29. 1. Adopt a continuous data flow model to unify batch and streaming 2. Use intermediate hops to improve reliability and troubleshooting 3. Make the cost vs latency tradeoff based on your use cases and business needs 4. Optimize the storage layout based on the access patterns 5. Reprocess the historical data as needed by simply clearing the result table and restarting the stream 6. Incrementally improve the quality of your data until it is ready for consumption with schema management options and data expectations. Summary of the key characteristics
  • 30. 1. Reduce end-to-end pipeline SLA. a. Organizations reduced pipeline SLAs from days and hours to minutes. 2. Reduce pipeline maintenance burden. a. Eliminate lambda architectures for minute-latency use cases. 3. Handle updates and deletes easily. a. Change data capture, GDPR, Sessionization, Deduplication use cases simplified. 4. Lower infrastructure costs with elastic, independent compute & storage a. Organizations reduce infrastructure costs by up to 10x Benefits of the Delta Architecture
  • 31. Thank you paulo@databricks.com Website: https://delta.io Community (Slack/Email): https://delta.io/#community