BUILDING
STREAMING
DATA WAREHOUSES
SEBASTIAN HEROLD
2019-11-05
2
# Principal Data Engineer / Architect
# 7y @ Immo-/Scout24
# DataDevOps Manifesto
# Data Platform
# 2y @ Zalando
# ML Productivity
# Streaming DWH
@heroldamus
Sebastian Herold
3
WE BRING FASHION TO PEOPLE
2008-2009
2010
2012-2013
2011
2018
17 markets
9 fulfillment centers
>28M active customers
5.4B revenue 2018
>300M visits/month
>14k employees
>400k product choices
>80% visits from mobile
4
TECH@SCALE
>350 accounts
>100 clusters
>250 teams
>5 data lakes
API
>800 micro services
2019-11-05
BEYOND CLASSICAL ANALYTICS:
ML EVERYWHERE!
WHY OUR CENTRAL DWH
DOES NOT SUCCEED
ANYMORE?
7
DB
DB
Early Years Rise of MicroServices ML + Data Mesh Chaos
DB
DWH History
DWH
Kafka
MPP
Data Lake
Data Lake
Data Lake
Data Lake
DWH
DWH
DWH
ML
ML
ML
ML
Data Lake
HEAVY INTEGRATION OF
UNSTRUCTURED DATA
INTO RELATIONAL TABLES
DRAWBACKS OF CENTRAL DWH
DATASETS ARE NEEDED
DISTRIBUTED
DRAWBACKS OF CENTRAL DWH
LOWER LATENCY REQUIRED BY
AI USE-CASES,
OTHER DATA WAREHOUSES,
NEAR-REALTIME USE-CASES
DRAWBACKS OF CENTRAL DWH
MULTIPLE TEAMS DO SAME
LOW-LATENCY EVENT INTEGRATION
DRAWBACKS OF CENTRAL DWH
HEAVY INTEGRATION OF
UNSTRUCTURED DATA
INTO RELATIONAL TABLES
DATASETS ARE NEEDED
DISTRIBUTED
LOWER LATENCY REQUIRED BY
AI USE-CASES,
OTHER DATA WAREHOUSES,
NEAR-REALTIME USE-CASES
MULTIPLE TEAMS DO SAME
LOW-LATENCY EVENT
INTEGRATION
13
SALES ORDER EXAMPLE
order.created
order_id,
order_date,
items,
...
shipment.created
order_id,
shipping_date,
shipped_items,
...
payment.done
payment_id,
payment_date,
order_id,
...
item.returned
order_id,
return_date,
returned_item,
...
sales-order
order_id,
order_date,
payment_id,
payment_date,
items:
shipped_at,
returned_at,
...
calculated_1,
calculated_2
...
14
ADVANTAGE OF NESTED STRUCTURE
sales-order
order_id,
order_date,
payment_id,
payment_date,
items:
shipped_at,
returned_at,
...
calculated_1,
calculated_2
...
# Contains the whole entity
# Can be streamed as new event
# Can be flattened for analytics
# Easier feature extraction for ML
15
STREAMING ARCHITECTURE
Kafka S3
changes
S3
snapshots
Real-time Analytics
Machine Learning
Stream Processing
DWHs
Machine Learning
Batch Processing
16
CHALLENGES IN STREAMING DATA INTEGRATION
# Integration of forecasts
# Find the right state store
# Cyclic dependencies
# Reprocessing
22
DATABRICKS DELTA FILE FORMAT
S3
stream stream
# now open source
# based on parquet (columnar)
# supports batch and stream
# supports “transactions”
# schema evolution
# scalable metadata handling
# time travelling
23
SQL vs SCALA
# Started with 200 lines of SQL
# Grew fast to 400 lines
# Violated DRY principle
# Hard to unit-test
# Hard to refactor
# Bad support for nested structures
24
LESSONS LEARNED
# Streaming needs different thinking
# DWH ~ Backend Programming
# Don’t start with SQL because it’s easy
# Watch your data formats
# Best works with cross-functional team
Thanks
Questions?

TDWI Schweiz 2019 - Building Streaming Data Warehouses

Editor's Notes

  • #4 2008/2009: Germany, Austria 2010: Netherlands, France 2011: Italy, UK, Switzerland 2012/2013: Sweden, Belgium, Spain, Denmark, Finland, Poland, Norway, Luxembourg 2018: Czech Republic, Ireland
  • #7 Our key competencies are our key to success: Fashion Technology Operations Together, these three core competencies are the basis of our platform strategy
  • #9 Our key competencies are our key to success: Fashion Technology Operations Together, these three core competencies are the basis of our platform strategy
  • #10 Our key competencies are our key to success: Fashion Technology Operations Together, these three core competencies are the basis of our platform strategy
  • #11 Our key competencies are our key to success: Fashion Technology Operations Together, these three core competencies are the basis of our platform strategy
  • #12 Our key competencies are our key to success: Fashion Technology Operations Together, these three core competencies are the basis of our platform strategy
  • #26 Our key competencies are our key to success: Fashion Technology Operations Together, these three core competencies are the basis of our platform strategy