This document discusses the journey of building streaming data pipelines at Zalando. It describes how Zalando has transitioned from a centralized data warehouse to using Spark streaming to process data in real-time from multiple sources like Kafka. Spark streaming allows for easy processing of unstructured data at scale with low latency. The document also covers best practices like using Scala instead of SQL for streaming ETL, persisting state data, integrating multiple streaming applications, and using Databricks Delta as the data lake storage format.
2. 2
# Principal Data Engineer / Architect
# 7y @ Immo-/Scout24
# DataDevOps Manifesto
# Data Platform
# 2y @ Zalando
# ML Productivity
# Streaming DWH
@heroldamus
Journey of Building Data Pipelines - BEDCon ‘19
Sebastian Herold
3. 3
WE BRING FASHION TO PEOPLE
2008-2009
2010
2012-2013
2011
2018
17 markets
9 fulfillment centers
>28M active customers
5.4B revenue 2018
>300M visits/month
>14k employees
>400k product choices
>80% visits from mobile
4. 4
TECH@SCALE
Journey of Building Data Pipelines - BEDCon ‘19
>350 accounts
>100 clusters
>250 teams
>5 data lakes
API
>800 micro services
5. 5
JOURNEY OF PAIN@SCALE WITH THE DATA WAREHOUSE
Journey of Building Data Pipelines - BEDCon ‘19
Early Years2008
DB
App App App
DWH
MSTR
App
App
App
App
App
App
App
Rise of MicroServices
Kafka
App
MPP
ML + Data Mesh Chaos
ML
ML
ML
ML
ML Training
ML Training
Data Lake
ML Training
DWH
DWH
8. LOWER LATENCY REQUIRED BY
AI USE-CASES,
OTHER DATA WAREHOUSES,
NEAR-REALTIME USE-CASES
DRAWBACKS OF CENTRAL DWH
9. MULTIPLE TEAMS DO SAME
LOW-LATENCY EVENT INTEGRATION
DRAWBACKS OF CENTRAL DWH
10. HEAVY INTEGRATION OF
UNSTRUCTURED DATA
INTO RELATIONAL TABLES
DATASETS ARE NEEDED
DISTRIBUTED
LOWER LATENCY REQUIRED BY
AI USE-CASES,
OTHER DATA WAREHOUSES,
NEAR-REALTIME USE-CASES
MULTIPLE TEAMS DO SAME
LOW-LATENCY EVENT
INTEGRATION
STREAMING
11. 11
# Easy processing of unstructured data
# Distribution via S3 and Kafka
# Low-latency by design
# Single integration to avoid duplication
# Scalable Infrastructure
# Do heavy calculation during the whole day
HOW SPARK STREAMING CAN HELP?
Journey of Building Data Pipelines - BEDCon ‘19
12. 12
SALES ORDER EXAMPLE
Journey of Building Data Pipelines - BEDCon ‘19
order.created
order_id,
order_date,
items,
...
shipment.created
order_id,
shipping_date,
shipped_items,
...
payment.done
payment_id,
payment_date,
order_id,
...
item.returned
order_id,
return_date,
returned_item,
...
sales-order
order_id,
order_date,
payment_id,
payment_date,
items:
shipped_at,
returned_at,
...
calculated_1,
calculated_2
...
13. 13
HOW STREAMING APP WORKS?
Journey of Building Data Pipelines - BEDCon ‘19
Topic
Topic
Topic
MS
MS
MS
S3
Streaming
S3
Topic
State
StoreHistoric
DWH S3
Bootstrap Snapshotter S3
14. 14
HOW TO PERSIST THE STATE?
Journey of Building Data Pipelines - BEDCon ‘19
# Find the needle in a haystack
# 0,02% orders touched per hour
# >200GB in size, growing fast
# Even old orders are touched
# Rebootstrapping ingests >500M items
# Currently: on the cluster
# Tought of HBase in future
?
15. 15 Journey of Building Data Pipelines - BEDCon ‘19
HOW TO INTEGRATE MULTIPLE APPS?
Topic
Topic
Topic
Topic
Topic
Topic
Topic
Topic
Topic ML Topic
CYCLE
Cycle Detection
16. 16
COMPARISON BETWEEN BATCH AND STREAMING ETL
Journey of Building Data Pipelines - BEDCon ‘19
Classic DWH
StreamingBatch
Scalability
Unstructured Data
Low-Latency
Efficiency
MTTR
Connectivity
17. 17 Journey of Building Data Pipelines - BEDCon ‘19
SQL vs SCALA
# Started with 200 lines of SQL
# Grew fast to 400 lines
# Violates DRY principle
# Hard to unit-test
# Hard to refactor
# Bad support for nested structures
SCALA
18. 18
DATABRICKS DELTA FILE FORMAT
Journey of Building Data Pipelines - BEDCon ‘19
S3
stream stream
# now open source
# based on parquet (columnar)
# supports batch and stream
# supports “transactions”
# schema evolution
# scalable metadata handling
# time travelling
19. 19
LESSONS LEARNED
Journey of Building Data Pipelines - BEDCon ‘19
# Streaming needs different thinking
# DWH ~ Backend Programming
# Don’t start with SQL because it’s easy
# Databricks Delta succeeds Parquet
# Make sure all data is available in S3