Zalandos AI-driven products and distributed landscape of analytical data marts cannot wait for long-running, hard-to-recover, monolithic batch jobs taking all night to calculate already outdated data. Modern data integration pipelines need to deliver fast and easy to consume data sets in high quality. Based on Spark Streaming and Delta, the central data warehousing team was able to deliver widely-used master data as S3 or Kafka streams and snapshots at the same time. The talk will cover challenges in our fashion data platform and a detailed architectural deep dive about separation of integration from enrichment, providing streams as well as snapshots and feeding the data to distributed data marts. Finally, lessons learned and best practices about Delta's MERGE command, Scala API vs Spark SQL and schema evolution give more insights and guidance for similar use cases.
7. 7
DB
DB
Early Years Rise of MicroServices ML + Data Mesh Chaos
DB
DWH History
DWH
Kafka
MPP
Data Lake
Data Lake
Data Lake
Data Lake
DWH
DWH
DWH
ML
ML
ML
ML
Data Lake
10. LOWER LATENCY REQUIRED BY
AI USE-CASES,
OTHER DATA WAREHOUSES,
NEAR-REALTIME USE-CASES
DRAWBACKS OF CENTRAL DWH
11. MULTIPLE TEAMS DO SAME
LOW-LATENCY EVENT INTEGRATION
DRAWBACKS OF CENTRAL DWH
12. HEAVY INTEGRATION OF
UNSTRUCTURED DATA
INTO RELATIONAL TABLES
DATASETS ARE NEEDED
DISTRIBUTED
LOWER LATENCY REQUIRED BY
AI USE-CASES,
OTHER DATA WAREHOUSES,
NEAR-REALTIME USE-CASES
MULTIPLE TEAMS DO SAME
LOW-LATENCY EVENT
INTEGRATION
14. 14
ADVANTAGE OF NESTED STRUCTURE
sales-order
order_id,
order_date,
payment_id,
payment_date,
items:
shipped_at,
returned_at,
...
calculated_1,
calculated_2
...
# Contains the whole entity
# Can be streamed as new event
# Can be flattened for analytics
# Easier feature extraction for ML
16. 16
CHALLENGES IN STREAMING DATA INTEGRATION
# Integration of forecasts
# Find the right state store
# Cyclic dependencies
# Reprocessing
17. 22
DATABRICKS DELTA FILE FORMAT
S3
stream stream
# now open source
# based on parquet (columnar)
# supports batch and stream
# supports “transactions”
# schema evolution
# scalable metadata handling
# time travelling
18. 23
SQL vs SCALA
# Started with 200 lines of SQL
# Grew fast to 400 lines
# Violated DRY principle
# Hard to unit-test
# Hard to refactor
# Bad support for nested structures
19. 24
LESSONS LEARNED
# Streaming needs different thinking
# DWH ~ Backend Programming
# Don’t start with SQL because it’s easy
# Watch your data formats
# Best works with cross-functional team
Our key competencies are our key to success:
Fashion
Technology
Operations
Together, these three core competencies are the basis of our platform strategy
Our key competencies are our key to success:
Fashion
Technology
Operations
Together, these three core competencies are the basis of our platform strategy
Our key competencies are our key to success:
Fashion
Technology
Operations
Together, these three core competencies are the basis of our platform strategy
Our key competencies are our key to success:
Fashion
Technology
Operations
Together, these three core competencies are the basis of our platform strategy
Our key competencies are our key to success:
Fashion
Technology
Operations
Together, these three core competencies are the basis of our platform strategy
Our key competencies are our key to success:
Fashion
Technology
Operations
Together, these three core competencies are the basis of our platform strategy