SlideShare a Scribd company logo
1 of 20
Download to read offline
JOURNEY OF
BUILDING
STREAMING
DATA PIPELINES
STREAMING DATA WAREHOUSING
SEBASTIAN HEROLD
2019-09-06
2
# Principal Data Engineer / Architect
# 7y @ Immo-/Scout24
# DataDevOps Manifesto
# Data Platform
# 2y @ Zalando
# ML Productivity
# Streaming DWH
@heroldamus
Journey of Building Data Pipelines - BEDCon ‘19
Sebastian Herold
3
WE BRING FASHION TO PEOPLE
2008-2009
2010
2012-2013
2011
2018
17 markets
9 fulfillment centers
>28M active customers
5.4B revenue 2018
>300M visits/month
>14k employees
>400k product choices
>80% visits from mobile
4
TECH@SCALE
Journey of Building Data Pipelines - BEDCon ‘19
>350 accounts
>100 clusters
>250 teams
>5 data lakes
API
>800 micro services
5
JOURNEY OF PAIN@SCALE WITH THE DATA WAREHOUSE
Journey of Building Data Pipelines - BEDCon ‘19
Early Years2008
DB
App App App
DWH
MSTR
App
App
App
App
App
App
App
Rise of MicroServices
Kafka
App
MPP
ML + Data Mesh Chaos
ML
ML
ML
ML
ML Training
ML Training
Data Lake
ML Training
DWH
DWH
HEAVY INTEGRATION OF
UNSTRUCTURED DATA
INTO RELATIONAL TABLES
DRAWBACKS OF CENTRAL DWH
DATASETS ARE NEEDED
DISTRIBUTED
DRAWBACKS OF CENTRAL DWH
LOWER LATENCY REQUIRED BY
AI USE-CASES,
OTHER DATA WAREHOUSES,
NEAR-REALTIME USE-CASES
DRAWBACKS OF CENTRAL DWH
MULTIPLE TEAMS DO SAME
LOW-LATENCY EVENT INTEGRATION
DRAWBACKS OF CENTRAL DWH
HEAVY INTEGRATION OF
UNSTRUCTURED DATA
INTO RELATIONAL TABLES
DATASETS ARE NEEDED
DISTRIBUTED
LOWER LATENCY REQUIRED BY
AI USE-CASES,
OTHER DATA WAREHOUSES,
NEAR-REALTIME USE-CASES
MULTIPLE TEAMS DO SAME
LOW-LATENCY EVENT
INTEGRATION
STREAMING
11
# Easy processing of unstructured data
# Distribution via S3 and Kafka
# Low-latency by design
# Single integration to avoid duplication
# Scalable Infrastructure
# Do heavy calculation during the whole day
HOW SPARK STREAMING CAN HELP?
Journey of Building Data Pipelines - BEDCon ‘19
12
SALES ORDER EXAMPLE
Journey of Building Data Pipelines - BEDCon ‘19
order.created
order_id,
order_date,
items,
...
shipment.created
order_id,
shipping_date,
shipped_items,
...
payment.done
payment_id,
payment_date,
order_id,
...
item.returned
order_id,
return_date,
returned_item,
...
sales-order
order_id,
order_date,
payment_id,
payment_date,
items:
shipped_at,
returned_at,
...
calculated_1,
calculated_2
...
13
HOW STREAMING APP WORKS?
Journey of Building Data Pipelines - BEDCon ‘19
Topic
Topic
Topic
MS
MS
MS
S3
Streaming
S3
Topic
State
StoreHistoric
DWH S3
Bootstrap Snapshotter S3
14
HOW TO PERSIST THE STATE?
Journey of Building Data Pipelines - BEDCon ‘19
# Find the needle in a haystack
# 0,02% orders touched per hour
# >200GB in size, growing fast
# Even old orders are touched
# Rebootstrapping ingests >500M items
# Currently: on the cluster
# Tought of HBase in future
?
15 Journey of Building Data Pipelines - BEDCon ‘19
HOW TO INTEGRATE MULTIPLE APPS?
Topic
Topic
Topic
Topic
Topic
Topic
Topic
Topic
Topic ML Topic
CYCLE
Cycle Detection
16
COMPARISON BETWEEN BATCH AND STREAMING ETL
Journey of Building Data Pipelines - BEDCon ‘19
Classic DWH
StreamingBatch
Scalability
Unstructured Data
Low-Latency
Efficiency
MTTR
Connectivity
17 Journey of Building Data Pipelines - BEDCon ‘19
SQL vs SCALA
# Started with 200 lines of SQL
# Grew fast to 400 lines
# Violates DRY principle
# Hard to unit-test
# Hard to refactor
# Bad support for nested structures
SCALA
18
DATABRICKS DELTA FILE FORMAT
Journey of Building Data Pipelines - BEDCon ‘19
S3
stream stream
# now open source
# based on parquet (columnar)
# supports batch and stream
# supports “transactions”
# schema evolution
# scalable metadata handling
# time travelling
19
LESSONS LEARNED
Journey of Building Data Pipelines - BEDCon ‘19
# Streaming needs different thinking
# DWH ~ Backend Programming
# Don’t start with SQL because it’s easy
# Databricks Delta succeeds Parquet
# Make sure all data is available in S3
20
QUESTIONS?
INTERESTED?
Journey of Building Data Pipelines - BEDCon ‘19

More Related Content

Similar to Building Streaming Data Pipelines

Data Warehousing with Spark Streaming at Zalando
Data Warehousing with Spark Streaming at ZalandoData Warehousing with Spark Streaming at Zalando
Data Warehousing with Spark Streaming at ZalandoDatabricks
 
Customer migration to azure sql database from on-premises SQL, for a SaaS app...
Customer migration to azure sql database from on-premises SQL, for a SaaS app...Customer migration to azure sql database from on-premises SQL, for a SaaS app...
Customer migration to azure sql database from on-premises SQL, for a SaaS app...George Walters
 
Coud-based Data Lake for Analytics and AI
Coud-based Data Lake for Analytics and AICoud-based Data Lake for Analytics and AI
Coud-based Data Lake for Analytics and AITorsten Steinbach
 
Risk Analytics Using Knowledge Graphs / FIBO with Deep Learning
Risk Analytics Using Knowledge Graphs / FIBO with Deep LearningRisk Analytics Using Knowledge Graphs / FIBO with Deep Learning
Risk Analytics Using Knowledge Graphs / FIBO with Deep LearningCambridge Semantics
 
Metadata Mastery: A Big Step for BI Modernization
Metadata Mastery: A Big Step for BI ModernizationMetadata Mastery: A Big Step for BI Modernization
Metadata Mastery: A Big Step for BI ModernizationEric Kavanagh
 
Data Vault 2.0: Big Data Meets Data Warehousing
Data Vault 2.0: Big Data Meets Data WarehousingData Vault 2.0: Big Data Meets Data Warehousing
Data Vault 2.0: Big Data Meets Data WarehousingAll Things Open
 
IBM THINK 2020 - Cloud Data Lake with IBM Cloud Data Services
IBM THINK 2020 - Cloud Data Lake with IBM Cloud Data Services IBM THINK 2020 - Cloud Data Lake with IBM Cloud Data Services
IBM THINK 2020 - Cloud Data Lake with IBM Cloud Data Services Torsten Steinbach
 
Refactoring your EDW with Mobile Analytics Products
Refactoring your EDW with Mobile Analytics ProductsRefactoring your EDW with Mobile Analytics Products
Refactoring your EDW with Mobile Analytics ProductsLuke Han
 
SQL In Hadoop: Big Data Innovation Without the Risk
SQL In Hadoop: Big Data Innovation Without the RiskSQL In Hadoop: Big Data Innovation Without the Risk
SQL In Hadoop: Big Data Innovation Without the RiskInside Analysis
 
Building Products with Data at Core
Building Products with Data at Core Building Products with Data at Core
Building Products with Data at Core Sandeep Adwankar
 
數據架構導入經驗談
數據架構導入經驗談數據架構導入經驗談
數據架構導入經驗談Canner2
 
Accelerating a Path to Digital with a Cloud Data Strategy
Accelerating a Path to Digital with a Cloud Data StrategyAccelerating a Path to Digital with a Cloud Data Strategy
Accelerating a Path to Digital with a Cloud Data StrategyMongoDB
 
Testing IoT Apps with the Cloud
Testing IoT Apps with the CloudTesting IoT Apps with the Cloud
Testing IoT Apps with the CloudJosiah Renaudin
 
Apache Kafka in the Automotive Industry (Connected Vehicles, Manufacturing 4....
Apache Kafka in the Automotive Industry (Connected Vehicles, Manufacturing 4....Apache Kafka in the Automotive Industry (Connected Vehicles, Manufacturing 4....
Apache Kafka in the Automotive Industry (Connected Vehicles, Manufacturing 4....Kai Wähner
 
Liberate Legacy Data Sources with Precisely and Databricks
Liberate Legacy Data Sources with Precisely and DatabricksLiberate Legacy Data Sources with Precisely and Databricks
Liberate Legacy Data Sources with Precisely and DatabricksPrecisely
 
Enable the business and make Artificial Intelligence accessible for everyone!
Enable the business and make Artificial Intelligence accessible for everyone! Enable the business and make Artificial Intelligence accessible for everyone!
Enable the business and make Artificial Intelligence accessible for everyone! Marc Lelijveld
 
Tableau & MongoDB: Visual Analytics at the Speed of Thought
Tableau & MongoDB: Visual Analytics at the Speed of ThoughtTableau & MongoDB: Visual Analytics at the Speed of Thought
Tableau & MongoDB: Visual Analytics at the Speed of ThoughtMongoDB
 
Graphs for Enterprise Architects
Graphs for Enterprise ArchitectsGraphs for Enterprise Architects
Graphs for Enterprise ArchitectsNeo4j
 
Real-time Big Data Analytics in the IBM SoftLayer Cloud with VoltDB
Real-time Big Data Analytics in the IBM SoftLayer Cloud with VoltDBReal-time Big Data Analytics in the IBM SoftLayer Cloud with VoltDB
Real-time Big Data Analytics in the IBM SoftLayer Cloud with VoltDBVoltDB
 

Similar to Building Streaming Data Pipelines (20)

Data Warehousing with Spark Streaming at Zalando
Data Warehousing with Spark Streaming at ZalandoData Warehousing with Spark Streaming at Zalando
Data Warehousing with Spark Streaming at Zalando
 
Customer migration to azure sql database from on-premises SQL, for a SaaS app...
Customer migration to azure sql database from on-premises SQL, for a SaaS app...Customer migration to azure sql database from on-premises SQL, for a SaaS app...
Customer migration to azure sql database from on-premises SQL, for a SaaS app...
 
Coud-based Data Lake for Analytics and AI
Coud-based Data Lake for Analytics and AICoud-based Data Lake for Analytics and AI
Coud-based Data Lake for Analytics and AI
 
Risk Analytics Using Knowledge Graphs / FIBO with Deep Learning
Risk Analytics Using Knowledge Graphs / FIBO with Deep LearningRisk Analytics Using Knowledge Graphs / FIBO with Deep Learning
Risk Analytics Using Knowledge Graphs / FIBO with Deep Learning
 
Metadata Mastery: A Big Step for BI Modernization
Metadata Mastery: A Big Step for BI ModernizationMetadata Mastery: A Big Step for BI Modernization
Metadata Mastery: A Big Step for BI Modernization
 
Data Vault 2.0: Big Data Meets Data Warehousing
Data Vault 2.0: Big Data Meets Data WarehousingData Vault 2.0: Big Data Meets Data Warehousing
Data Vault 2.0: Big Data Meets Data Warehousing
 
IBM THINK 2020 - Cloud Data Lake with IBM Cloud Data Services
IBM THINK 2020 - Cloud Data Lake with IBM Cloud Data Services IBM THINK 2020 - Cloud Data Lake with IBM Cloud Data Services
IBM THINK 2020 - Cloud Data Lake with IBM Cloud Data Services
 
datavault2.pptx
datavault2.pptxdatavault2.pptx
datavault2.pptx
 
Refactoring your EDW with Mobile Analytics Products
Refactoring your EDW with Mobile Analytics ProductsRefactoring your EDW with Mobile Analytics Products
Refactoring your EDW with Mobile Analytics Products
 
SQL In Hadoop: Big Data Innovation Without the Risk
SQL In Hadoop: Big Data Innovation Without the RiskSQL In Hadoop: Big Data Innovation Without the Risk
SQL In Hadoop: Big Data Innovation Without the Risk
 
Building Products with Data at Core
Building Products with Data at Core Building Products with Data at Core
Building Products with Data at Core
 
數據架構導入經驗談
數據架構導入經驗談數據架構導入經驗談
數據架構導入經驗談
 
Accelerating a Path to Digital with a Cloud Data Strategy
Accelerating a Path to Digital with a Cloud Data StrategyAccelerating a Path to Digital with a Cloud Data Strategy
Accelerating a Path to Digital with a Cloud Data Strategy
 
Testing IoT Apps with the Cloud
Testing IoT Apps with the CloudTesting IoT Apps with the Cloud
Testing IoT Apps with the Cloud
 
Apache Kafka in the Automotive Industry (Connected Vehicles, Manufacturing 4....
Apache Kafka in the Automotive Industry (Connected Vehicles, Manufacturing 4....Apache Kafka in the Automotive Industry (Connected Vehicles, Manufacturing 4....
Apache Kafka in the Automotive Industry (Connected Vehicles, Manufacturing 4....
 
Liberate Legacy Data Sources with Precisely and Databricks
Liberate Legacy Data Sources with Precisely and DatabricksLiberate Legacy Data Sources with Precisely and Databricks
Liberate Legacy Data Sources with Precisely and Databricks
 
Enable the business and make Artificial Intelligence accessible for everyone!
Enable the business and make Artificial Intelligence accessible for everyone! Enable the business and make Artificial Intelligence accessible for everyone!
Enable the business and make Artificial Intelligence accessible for everyone!
 
Tableau & MongoDB: Visual Analytics at the Speed of Thought
Tableau & MongoDB: Visual Analytics at the Speed of ThoughtTableau & MongoDB: Visual Analytics at the Speed of Thought
Tableau & MongoDB: Visual Analytics at the Speed of Thought
 
Graphs for Enterprise Architects
Graphs for Enterprise ArchitectsGraphs for Enterprise Architects
Graphs for Enterprise Architects
 
Real-time Big Data Analytics in the IBM SoftLayer Cloud with VoltDB
Real-time Big Data Analytics in the IBM SoftLayer Cloud with VoltDBReal-time Big Data Analytics in the IBM SoftLayer Cloud with VoltDB
Real-time Big Data Analytics in the IBM SoftLayer Cloud with VoltDB
 

Recently uploaded

Intelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmIntelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmSujith Sukumaran
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software DevelopersVinodh Ram
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantAxelRicardoTrocheRiq
 
React Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaReact Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaHanief Utama
 
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...soniya singh
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...stazi3110
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...MyIntelliSource, Inc.
 
MYjobs Presentation Django-based project
MYjobs Presentation Django-based projectMYjobs Presentation Django-based project
MYjobs Presentation Django-based projectAnoyGreter
 
Asset Management Software - Infographic
Asset Management Software - InfographicAsset Management Software - Infographic
Asset Management Software - InfographicHr365.us smith
 
Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackVICTOR MAESTRE RAMIREZ
 
Cloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEECloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEEVICTOR MAESTRE RAMIREZ
 
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesPhilip Schwarz
 
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxTier1 app
 
software engineering Chapter 5 System modeling.pptx
software engineering Chapter 5 System modeling.pptxsoftware engineering Chapter 5 System modeling.pptx
software engineering Chapter 5 System modeling.pptxnada99848
 
Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Andreas Granig
 
EY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityEY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityNeo4j
 
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideChristina Lin
 
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio, Inc.
 

Recently uploaded (20)

Intelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmIntelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalm
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software Developers
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service Consultant
 
React Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaReact Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief Utama
 
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
 
MYjobs Presentation Django-based project
MYjobs Presentation Django-based projectMYjobs Presentation Django-based project
MYjobs Presentation Django-based project
 
Asset Management Software - Infographic
Asset Management Software - InfographicAsset Management Software - Infographic
Asset Management Software - Infographic
 
Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStack
 
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
 
Cloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEECloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEE
 
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a series
 
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
 
software engineering Chapter 5 System modeling.pptx
software engineering Chapter 5 System modeling.pptxsoftware engineering Chapter 5 System modeling.pptx
software engineering Chapter 5 System modeling.pptx
 
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort ServiceHot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
 
Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024
 
EY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityEY_Graph Database Powered Sustainability
EY_Graph Database Powered Sustainability
 
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
 
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
 

Building Streaming Data Pipelines

  • 1. JOURNEY OF BUILDING STREAMING DATA PIPELINES STREAMING DATA WAREHOUSING SEBASTIAN HEROLD 2019-09-06
  • 2. 2 # Principal Data Engineer / Architect # 7y @ Immo-/Scout24 # DataDevOps Manifesto # Data Platform # 2y @ Zalando # ML Productivity # Streaming DWH @heroldamus Journey of Building Data Pipelines - BEDCon ‘19 Sebastian Herold
  • 3. 3 WE BRING FASHION TO PEOPLE 2008-2009 2010 2012-2013 2011 2018 17 markets 9 fulfillment centers >28M active customers 5.4B revenue 2018 >300M visits/month >14k employees >400k product choices >80% visits from mobile
  • 4. 4 TECH@SCALE Journey of Building Data Pipelines - BEDCon ‘19 >350 accounts >100 clusters >250 teams >5 data lakes API >800 micro services
  • 5. 5 JOURNEY OF PAIN@SCALE WITH THE DATA WAREHOUSE Journey of Building Data Pipelines - BEDCon ‘19 Early Years2008 DB App App App DWH MSTR App App App App App App App Rise of MicroServices Kafka App MPP ML + Data Mesh Chaos ML ML ML ML ML Training ML Training Data Lake ML Training DWH DWH
  • 6. HEAVY INTEGRATION OF UNSTRUCTURED DATA INTO RELATIONAL TABLES DRAWBACKS OF CENTRAL DWH
  • 8. LOWER LATENCY REQUIRED BY AI USE-CASES, OTHER DATA WAREHOUSES, NEAR-REALTIME USE-CASES DRAWBACKS OF CENTRAL DWH
  • 9. MULTIPLE TEAMS DO SAME LOW-LATENCY EVENT INTEGRATION DRAWBACKS OF CENTRAL DWH
  • 10. HEAVY INTEGRATION OF UNSTRUCTURED DATA INTO RELATIONAL TABLES DATASETS ARE NEEDED DISTRIBUTED LOWER LATENCY REQUIRED BY AI USE-CASES, OTHER DATA WAREHOUSES, NEAR-REALTIME USE-CASES MULTIPLE TEAMS DO SAME LOW-LATENCY EVENT INTEGRATION STREAMING
  • 11. 11 # Easy processing of unstructured data # Distribution via S3 and Kafka # Low-latency by design # Single integration to avoid duplication # Scalable Infrastructure # Do heavy calculation during the whole day HOW SPARK STREAMING CAN HELP? Journey of Building Data Pipelines - BEDCon ‘19
  • 12. 12 SALES ORDER EXAMPLE Journey of Building Data Pipelines - BEDCon ‘19 order.created order_id, order_date, items, ... shipment.created order_id, shipping_date, shipped_items, ... payment.done payment_id, payment_date, order_id, ... item.returned order_id, return_date, returned_item, ... sales-order order_id, order_date, payment_id, payment_date, items: shipped_at, returned_at, ... calculated_1, calculated_2 ...
  • 13. 13 HOW STREAMING APP WORKS? Journey of Building Data Pipelines - BEDCon ‘19 Topic Topic Topic MS MS MS S3 Streaming S3 Topic State StoreHistoric DWH S3 Bootstrap Snapshotter S3
  • 14. 14 HOW TO PERSIST THE STATE? Journey of Building Data Pipelines - BEDCon ‘19 # Find the needle in a haystack # 0,02% orders touched per hour # >200GB in size, growing fast # Even old orders are touched # Rebootstrapping ingests >500M items # Currently: on the cluster # Tought of HBase in future ?
  • 15. 15 Journey of Building Data Pipelines - BEDCon ‘19 HOW TO INTEGRATE MULTIPLE APPS? Topic Topic Topic Topic Topic Topic Topic Topic Topic ML Topic CYCLE Cycle Detection
  • 16. 16 COMPARISON BETWEEN BATCH AND STREAMING ETL Journey of Building Data Pipelines - BEDCon ‘19 Classic DWH StreamingBatch Scalability Unstructured Data Low-Latency Efficiency MTTR Connectivity
  • 17. 17 Journey of Building Data Pipelines - BEDCon ‘19 SQL vs SCALA # Started with 200 lines of SQL # Grew fast to 400 lines # Violates DRY principle # Hard to unit-test # Hard to refactor # Bad support for nested structures SCALA
  • 18. 18 DATABRICKS DELTA FILE FORMAT Journey of Building Data Pipelines - BEDCon ‘19 S3 stream stream # now open source # based on parquet (columnar) # supports batch and stream # supports “transactions” # schema evolution # scalable metadata handling # time travelling
  • 19. 19 LESSONS LEARNED Journey of Building Data Pipelines - BEDCon ‘19 # Streaming needs different thinking # DWH ~ Backend Programming # Don’t start with SQL because it’s easy # Databricks Delta succeeds Parquet # Make sure all data is available in S3
  • 20. 20 QUESTIONS? INTERESTED? Journey of Building Data Pipelines - BEDCon ‘19