Lakehouse
architecture with Delta
Lake & Databricks
Data engineer / Technical & Engineering lead
Dragan Beric
Agenda
● Motivation
● Architecture evolution
● Lakehouse
● Delta lake
Motivation
Top 3 challenges in enterprise data usage:
○ Access
○ Reliability
○ Timeliness
Architecture evolution
two-tier architecture
Architecture evolution
two-tier architecture
• flexible storage
• low - cost storage
• easy access for
ML & DS
Architecture evolution
two-tier architecture
• flexible storage
• low - cost storage
• easy access for
ML & DS
• data duplication
• additional jobs for
data movement
• maintenance
(data +
development)
Architecture evolution
lakehouse architecture
Architecture evolution
lakehouse architecture
• flexible storage
• low - cost storage
• easy access for ML
& DS
• single source of truth
• governance
• schema validation
• ACID transactions
over files
Lakehouse
A lakehouse is a new, open architecture that
combines the best elements of data lakes and data
warehouses
Lakehouse
Lorem ipsu
datawarehouse
Fast
Structured
Operational
Governed
Cheap
Flexible
Scalable
data lake
Lakehouse
Delta lake
what is delta?
Databricks delta is a unified data
management system that brings
reliability and performance to
existing data lakes
Delta lake
what is delta?
Databricks delta is a unified data
management system that brings
reliability and performance to
existing data lakes
Delta Lake is an optimized,
managed format for organizing &
working with Parquet files
Delta lake
what is delta?
Databricks delta is a unified data
management system that brings
reliability and performance to
existing data lakes
Delta Lake is an optimized,
managed format for organizing &
working with Parquet files
“It’s Parquet, just better!”
Delta lake
challenges with parquet
• Hard to append data
• Update / Merge not supported
• Metadata does not scale (a lot
of small files)
• A lot of small parquet files (no
auto-compaction)
• Jobs failing mid way
Delta lake
structure
folder that contains:
• data in parquet
• delta log
• data
Delta lake
structure
folder that contains:
• data in parquet
• delta log
• data
Delta lake
delta log – transactional layer
folder that contains:
• table-schema
• commits info
• meta-data
checkpoint file
10 commits
(transactions)
Delta lake
query execution plan
Query is
received
Processing
_delta_log
Find and read
latest
checkpoint file
Read
transactions
after the
checkpoint
Read data
referenced by
checkpoint and
transactions
Return results
Delta lake
DML operations – merge/update
_delta_log
storage
delta
lake
Product Price($)
Apple 1.5
Banana 0.53
Lemon 1.42
UPDATE DimProduct
SET Price = 1$
WHERE Product = ‘Lemon’
0000.json
“add”:{“part-01.parquet”,…}
part-01
3 rows
0001.json
“remove”:{“part-01.parquet”,…},
“add”:{“part-02.parquet”,…}
part-01
3 rows
part-02
3 rows
Product Price($)
Apple 1.5
Banana 0.53
Lemon 1
Delta lake
DML operations – delete
_delta_log
storage
delta
lake
Product Price($)
Apple 1.5
Banana 0.53
Lemon 1.42
DELETE from DimProduct
WHERE Product = ‘Lemon’
part-01
3 rows
0002.json
“remove”:{“part-02.parquet”,…},
“add”:{“part-03.parquet”,…}
part-01
3 rows
part-02
3 rows
Product Price($)
Apple 1.5
Banana 0.53
0001.json
“remove”:{“part-01.parquet”,…},
“add”:{“part-02.parquet”,…}
part-02
3 rows
part-03
2 rows
Delta lake
DML operations – insert
_delta_log
storage
delta
lake
INSERT INTO DimProduct
VALUES (‘Mango’,3)
0003.json
“add”:{“part-04.parquet”,…}
part-01
3 rows
part-02
3 rows
Product Price($)
Apple 1.5
Banana 0.53
Mango 3
part-03
2 rows
Product Price($)
Apple 1.5
Banana 0.53
0002.json
“remove”:{“part-02.parquet”,…},
“add”:{“part-03.parquet”,…}
part-01
3 rows
part-02
3 rows
part-03
2 rows
part-04
1 rows
Delta lake
features – conclussion
ACID
transactions
indexing
metadata
scaling
governance
schema
validation
time travel
HTEC group | USA, San Francisco | 535 Mission St, 14th floor | +1 415 490 8175 | office-sf@htecgroup.com | htecgroup.com
Q&A

[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Dragan Beric

  • 1.
    Lakehouse architecture with Delta Lake& Databricks Data engineer / Technical & Engineering lead Dragan Beric
  • 2.
    Agenda ● Motivation ● Architectureevolution ● Lakehouse ● Delta lake
  • 3.
    Motivation Top 3 challengesin enterprise data usage: ○ Access ○ Reliability ○ Timeliness
  • 4.
  • 5.
    Architecture evolution two-tier architecture •flexible storage • low - cost storage • easy access for ML & DS
  • 6.
    Architecture evolution two-tier architecture •flexible storage • low - cost storage • easy access for ML & DS • data duplication • additional jobs for data movement • maintenance (data + development)
  • 7.
  • 8.
    Architecture evolution lakehouse architecture •flexible storage • low - cost storage • easy access for ML & DS • single source of truth • governance • schema validation • ACID transactions over files
  • 9.
    Lakehouse A lakehouse isa new, open architecture that combines the best elements of data lakes and data warehouses
  • 10.
  • 11.
    Delta lake what isdelta? Databricks delta is a unified data management system that brings reliability and performance to existing data lakes
  • 12.
    Delta lake what isdelta? Databricks delta is a unified data management system that brings reliability and performance to existing data lakes Delta Lake is an optimized, managed format for organizing & working with Parquet files
  • 13.
    Delta lake what isdelta? Databricks delta is a unified data management system that brings reliability and performance to existing data lakes Delta Lake is an optimized, managed format for organizing & working with Parquet files “It’s Parquet, just better!”
  • 14.
    Delta lake challenges withparquet • Hard to append data • Update / Merge not supported • Metadata does not scale (a lot of small files) • A lot of small parquet files (no auto-compaction) • Jobs failing mid way
  • 15.
    Delta lake structure folder thatcontains: • data in parquet • delta log • data
  • 16.
    Delta lake structure folder thatcontains: • data in parquet • delta log • data
  • 17.
    Delta lake delta log– transactional layer folder that contains: • table-schema • commits info • meta-data checkpoint file 10 commits (transactions)
  • 18.
    Delta lake query executionplan Query is received Processing _delta_log Find and read latest checkpoint file Read transactions after the checkpoint Read data referenced by checkpoint and transactions Return results
  • 19.
    Delta lake DML operations– merge/update _delta_log storage delta lake Product Price($) Apple 1.5 Banana 0.53 Lemon 1.42 UPDATE DimProduct SET Price = 1$ WHERE Product = ‘Lemon’ 0000.json “add”:{“part-01.parquet”,…} part-01 3 rows 0001.json “remove”:{“part-01.parquet”,…}, “add”:{“part-02.parquet”,…} part-01 3 rows part-02 3 rows Product Price($) Apple 1.5 Banana 0.53 Lemon 1
  • 20.
    Delta lake DML operations– delete _delta_log storage delta lake Product Price($) Apple 1.5 Banana 0.53 Lemon 1.42 DELETE from DimProduct WHERE Product = ‘Lemon’ part-01 3 rows 0002.json “remove”:{“part-02.parquet”,…}, “add”:{“part-03.parquet”,…} part-01 3 rows part-02 3 rows Product Price($) Apple 1.5 Banana 0.53 0001.json “remove”:{“part-01.parquet”,…}, “add”:{“part-02.parquet”,…} part-02 3 rows part-03 2 rows
  • 21.
    Delta lake DML operations– insert _delta_log storage delta lake INSERT INTO DimProduct VALUES (‘Mango’,3) 0003.json “add”:{“part-04.parquet”,…} part-01 3 rows part-02 3 rows Product Price($) Apple 1.5 Banana 0.53 Mango 3 part-03 2 rows Product Price($) Apple 1.5 Banana 0.53 0002.json “remove”:{“part-02.parquet”,…}, “add”:{“part-03.parquet”,…} part-01 3 rows part-02 3 rows part-03 2 rows part-04 1 rows
  • 22.
    Delta lake features –conclussion ACID transactions indexing metadata scaling governance schema validation time travel
  • 23.
    HTEC group |USA, San Francisco | 535 Mission St, 14th floor | +1 415 490 8175 | office-sf@htecgroup.com | htecgroup.com Q&A