Building Streaming Data Pipelines

•

1 like•37 views

This document discusses the journey of building streaming data pipelines at Zalando. It describes how Zalando has transitioned from a centralized data warehouse to using Spark streaming to process data in real-time from multiple sources like Kafka. Spark streaming allows for easy processing of unstructured data at scale with low latency. The document also covers best practices like using Scala instead of SQL for streaming ETL, persisting state data, integrating multiple streaming applications, and using Databricks Delta as the data lake storage format.

Software

JOURNEY OF
BUILDING
STREAMING
DATA PIPELINES
STREAMING DATA WAREHOUSING
SEBASTIAN HEROLD
2019-09-06

2
# Principal Data Engineer / Architect
# 7y @ Immo-/Scout24
# DataDevOps Manifesto
# Data Platform
# 2y @ Zalando
# ML Productivity
# Streaming DWH
@heroldamus
Journey of Building Data Pipelines - BEDCon ‘19
Sebastian Herold

3
WE BRING FASHION TO PEOPLE
2008-2009
2010
2012-2013
2011
2018
17 markets
9 fulfillment centers
>28M active customers
5.4B revenue 2018
>300M visits/month
>14k employees
>400k product choices
>80% visits from mobile

4
TECH@SCALE
Journey of Building Data Pipelines - BEDCon ‘19
>350 accounts
>100 clusters
>250 teams
>5 data lakes
API
>800 micro services

5
JOURNEY OF PAIN@SCALE WITH THE DATA WAREHOUSE
Journey of Building Data Pipelines - BEDCon ‘19
Early Years2008
DB
App App App
DWH
MSTR
App
App
App
App
App
App
App
Rise of MicroServices
Kafka
App
MPP
ML + Data Mesh Chaos
ML
ML
ML
ML
ML Training
ML Training
Data Lake
ML Training
DWH
DWH

HEAVY INTEGRATION OF
UNSTRUCTURED DATA
INTO RELATIONAL TABLES
DRAWBACKS OF CENTRAL DWH

DATASETS ARE NEEDED
DISTRIBUTED
DRAWBACKS OF CENTRAL DWH

LOWER LATENCY REQUIRED BY
AI USE-CASES,
OTHER DATA WAREHOUSES,
NEAR-REALTIME USE-CASES
DRAWBACKS OF CENTRAL DWH

MULTIPLE TEAMS DO SAME
LOW-LATENCY EVENT INTEGRATION
DRAWBACKS OF CENTRAL DWH

HEAVY INTEGRATION OF
UNSTRUCTURED DATA
INTO RELATIONAL TABLES
DATASETS ARE NEEDED
DISTRIBUTED
LOWER LATENCY REQUIRED BY
AI USE-CASES,
OTHER DATA WAREHOUSES,
NEAR-REALTIME USE-CASES
MULTIPLE TEAMS DO SAME
LOW-LATENCY EVENT
INTEGRATION
STREAMING

11
# Easy processing of unstructured data
# Distribution via S3 and Kafka
# Low-latency by design
# Single integration to avoid duplication
# Scalable Infrastructure
# Do heavy calculation during the whole day
HOW SPARK STREAMING CAN HELP?
Journey of Building Data Pipelines - BEDCon ‘19

12
SALES ORDER EXAMPLE
Journey of Building Data Pipelines - BEDCon ‘19
order.created
order_id,
order_date,
items,
...
shipment.created
order_id,
shipping_date,
shipped_items,
...
payment.done
payment_id,
payment_date,
order_id,
...
item.returned
order_id,
return_date,
returned_item,
...
sales-order
order_id,
order_date,
payment_id,
payment_date,
items:
shipped_at,
returned_at,
...
calculated_1,
calculated_2
...

13
HOW STREAMING APP WORKS?
Journey of Building Data Pipelines - BEDCon ‘19
Topic
Topic
Topic
MS
MS
MS
S3
Streaming
S3
Topic
State
StoreHistoric
DWH S3
Bootstrap Snapshotter S3

14
HOW TO PERSIST THE STATE?
Journey of Building Data Pipelines - BEDCon ‘19
# Find the needle in a haystack
# 0,02% orders touched per hour
# >200GB in size, growing fast
# Even old orders are touched
# Rebootstrapping ingests >500M items
# Currently: on the cluster
# Tought of HBase in future
?

15 Journey of Building Data Pipelines - BEDCon ‘19
HOW TO INTEGRATE MULTIPLE APPS?
Topic
Topic
Topic
Topic
Topic
Topic
Topic
Topic
Topic ML Topic
CYCLE
Cycle Detection

16
COMPARISON BETWEEN BATCH AND STREAMING ETL
Journey of Building Data Pipelines - BEDCon ‘19
Classic DWH
StreamingBatch
Scalability
Unstructured Data
Low-Latency
Efficiency
MTTR
Connectivity

17 Journey of Building Data Pipelines - BEDCon ‘19
SQL vs SCALA
# Started with 200 lines of SQL
# Grew fast to 400 lines
# Violates DRY principle
# Hard to unit-test
# Hard to refactor
# Bad support for nested structures
SCALA

18
DATABRICKS DELTA FILE FORMAT
Journey of Building Data Pipelines - BEDCon ‘19
S3
stream stream
# now open source
# based on parquet (columnar)
# supports batch and stream
# supports “transactions”
# schema evolution
# scalable metadata handling
# time travelling

19
LESSONS LEARNED
Journey of Building Data Pipelines - BEDCon ‘19
# Streaming needs different thinking
# DWH ~ Backend Programming
# Don’t start with SQL because it’s easy
# Databricks Delta succeeds Parquet
# Make sure all data is available in S3

20
QUESTIONS?
INTERESTED?
Journey of Building Data Pipelines - BEDCon ‘19

Similar to Building Streaming Data Pipelines

Data Warehousing with Spark Streaming at ZalandoDatabricks

Customer migration to azure sql database from on-premises SQL, for a SaaS app...George Walters

Coud-based Data Lake for Analytics and AITorsten Steinbach

Risk Analytics Using Knowledge Graphs / FIBO with Deep LearningCambridge Semantics

Metadata Mastery: A Big Step for BI ModernizationEric Kavanagh

Data Vault 2.0: Big Data Meets Data WarehousingAll Things Open

IBM THINK 2020 - Cloud Data Lake with IBM Cloud Data Services Torsten Steinbach

datavault2.pptxMounika662749

Refactoring your EDW with Mobile Analytics ProductsLuke Han

SQL In Hadoop: Big Data Innovation Without the RiskInside Analysis

Building Products with Data at Core Sandeep Adwankar

數據架構導入經驗談Canner2

Accelerating a Path to Digital with a Cloud Data StrategyMongoDB

Testing IoT Apps with the CloudJosiah Renaudin

Apache Kafka in the Automotive Industry (Connected Vehicles, Manufacturing 4....Kai Wähner

Liberate Legacy Data Sources with Precisely and DatabricksPrecisely

Enable the business and make Artificial Intelligence accessible for everyone! Marc Lelijveld

Tableau & MongoDB: Visual Analytics at the Speed of ThoughtMongoDB

Graphs for Enterprise ArchitectsNeo4j

Real-time Big Data Analytics in the IBM SoftLayer Cloud with VoltDBVoltDB

Similar to Building Streaming Data Pipelines (20)

Data Warehousing with Spark Streaming at Zalando

Customer migration to azure sql database from on-premises SQL, for a SaaS app...

Coud-based Data Lake for Analytics and AI

Risk Analytics Using Knowledge Graphs / FIBO with Deep Learning

Metadata Mastery: A Big Step for BI Modernization

Data Vault 2.0: Big Data Meets Data Warehousing

IBM THINK 2020 - Cloud Data Lake with IBM Cloud Data Services

datavault2.pptx

Refactoring your EDW with Mobile Analytics Products

SQL In Hadoop: Big Data Innovation Without the Risk

Building Products with Data at Core

數據架構導入經驗談

Accelerating a Path to Digital with a Cloud Data Strategy

Testing IoT Apps with the Cloud

Apache Kafka in the Automotive Industry (Connected Vehicles, Manufacturing 4....

Liberate Legacy Data Sources with Precisely and Databricks

Enable the business and make Artificial Intelligence accessible for everyone!

Tableau & MongoDB: Visual Analytics at the Speed of Thought

Graphs for Enterprise Architects

Real-time Big Data Analytics in the IBM SoftLayer Cloud with VoltDB

Recently uploaded

Intelligent Home Wi-Fi Solutions | ThinkPalmSujith Sukumaran

Professional Resume Template for Software DevelopersVinodh Ram

Salesforce Certified Field Service ConsultantAxelRicardoTrocheRiq

React Server Component in Next.js by Hanief UtamaHanief Utama

Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...soniya singh

Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...stazi3110

Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...MyIntelliSource, Inc.

MYjobs Presentation Django-based projectAnoyGreter

Asset Management Software - InfographicHr365.us smith

Cloud Management Software Platforms: OpenStackVICTOR MAESTRE RAMIREZ

Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...Call Girls In Delhi Whatsup 9873940964 Enjoy Unlimited Pleasure

Cloud Data Center Network Construction - IEEEVICTOR MAESTRE RAMIREZ

Folding Cheat Sheet #4 - fourth in a seriesPhilip Schwarz

KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxTier1 app

software engineering Chapter 5 System modeling.pptxnada99848

Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service9953056974 Low Rate Call Girls In Saket, Delhi NCR

Automate your Kamailio Test Calls - Kamailio World 2024Andreas Granig

EY_Graph Database Powered SustainabilityNeo4j

Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideChristina Lin

Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio, Inc.

Recently uploaded (20)

Intelligent Home Wi-Fi Solutions | ThinkPalm

Professional Resume Template for Software Developers

Salesforce Certified Field Service Consultant

React Server Component in Next.js by Hanief Utama

Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...

Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...

Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...

MYjobs Presentation Django-based project

Asset Management Software - Infographic

Cloud Management Software Platforms: OpenStack

Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...

Cloud Data Center Network Construction - IEEE

Folding Cheat Sheet #4 - fourth in a series

KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx

software engineering Chapter 5 System modeling.pptx

Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service

Automate your Kamailio Test Calls - Kamailio World 2024

EY_Graph Database Powered Sustainability

Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide

Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data

Building Streaming Data Pipelines

1. JOURNEY OF BUILDING STREAMING DATA PIPELINES STREAMING DATA WAREHOUSING SEBASTIAN HEROLD 2019-09-06

2. 2 # Principal Data Engineer / Architect # 7y @ Immo-/Scout24 # DataDevOps Manifesto # Data Platform # 2y @ Zalando # ML Productivity # Streaming DWH @heroldamus Journey of Building Data Pipelines - BEDCon ‘19 Sebastian Herold

3. 3 WE BRING FASHION TO PEOPLE 2008-2009 2010 2012-2013 2011 2018 17 markets 9 fulfillment centers >28M active customers 5.4B revenue 2018 >300M visits/month >14k employees >400k product choices >80% visits from mobile

4. 4 TECH@SCALE Journey of Building Data Pipelines - BEDCon ‘19 >350 accounts >100 clusters >250 teams >5 data lakes API >800 micro services

5. 5 JOURNEY OF PAIN@SCALE WITH THE DATA WAREHOUSE Journey of Building Data Pipelines - BEDCon ‘19 Early Years2008 DB App App App DWH MSTR App App App App App App App Rise of MicroServices Kafka App MPP ML + Data Mesh Chaos ML ML ML ML ML Training ML Training Data Lake ML Training DWH DWH

6. HEAVY INTEGRATION OF UNSTRUCTURED DATA INTO RELATIONAL TABLES DRAWBACKS OF CENTRAL DWH

7. DATASETS ARE NEEDED DISTRIBUTED DRAWBACKS OF CENTRAL DWH

8. LOWER LATENCY REQUIRED BY AI USE-CASES, OTHER DATA WAREHOUSES, NEAR-REALTIME USE-CASES DRAWBACKS OF CENTRAL DWH

9. MULTIPLE TEAMS DO SAME LOW-LATENCY EVENT INTEGRATION DRAWBACKS OF CENTRAL DWH

10. HEAVY INTEGRATION OF UNSTRUCTURED DATA INTO RELATIONAL TABLES DATASETS ARE NEEDED DISTRIBUTED LOWER LATENCY REQUIRED BY AI USE-CASES, OTHER DATA WAREHOUSES, NEAR-REALTIME USE-CASES MULTIPLE TEAMS DO SAME LOW-LATENCY EVENT INTEGRATION STREAMING

11. 11 # Easy processing of unstructured data # Distribution via S3 and Kafka # Low-latency by design # Single integration to avoid duplication # Scalable Infrastructure # Do heavy calculation during the whole day HOW SPARK STREAMING CAN HELP? Journey of Building Data Pipelines - BEDCon ‘19

12. 12 SALES ORDER EXAMPLE Journey of Building Data Pipelines - BEDCon ‘19 order.created order_id, order_date, items, ... shipment.created order_id, shipping_date, shipped_items, ... payment.done payment_id, payment_date, order_id, ... item.returned order_id, return_date, returned_item, ... sales-order order_id, order_date, payment_id, payment_date, items: shipped_at, returned_at, ... calculated_1, calculated_2 ...

13. 13 HOW STREAMING APP WORKS? Journey of Building Data Pipelines - BEDCon ‘19 Topic Topic Topic MS MS MS S3 Streaming S3 Topic State StoreHistoric DWH S3 Bootstrap Snapshotter S3

14. 14 HOW TO PERSIST THE STATE? Journey of Building Data Pipelines - BEDCon ‘19 # Find the needle in a haystack # 0,02% orders touched per hour # >200GB in size, growing fast # Even old orders are touched # Rebootstrapping ingests >500M items # Currently: on the cluster # Tought of HBase in future ?

15. 15 Journey of Building Data Pipelines - BEDCon ‘19 HOW TO INTEGRATE MULTIPLE APPS? Topic Topic Topic Topic Topic Topic Topic Topic Topic ML Topic CYCLE Cycle Detection

16. 16 COMPARISON BETWEEN BATCH AND STREAMING ETL Journey of Building Data Pipelines - BEDCon ‘19 Classic DWH StreamingBatch Scalability Unstructured Data Low-Latency Efficiency MTTR Connectivity

17. 17 Journey of Building Data Pipelines - BEDCon ‘19 SQL vs SCALA # Started with 200 lines of SQL # Grew fast to 400 lines # Violates DRY principle # Hard to unit-test # Hard to refactor # Bad support for nested structures SCALA

18. 18 DATABRICKS DELTA FILE FORMAT Journey of Building Data Pipelines - BEDCon ‘19 S3 stream stream # now open source # based on parquet (columnar) # supports batch and stream # supports “transactions” # schema evolution # scalable metadata handling # time travelling

19. 19 LESSONS LEARNED Journey of Building Data Pipelines - BEDCon ‘19 # Streaming needs different thinking # DWH ~ Backend Programming # Don’t start with SQL because it’s easy # Databricks Delta succeeds Parquet # Make sure all data is available in S3

20. 20 QUESTIONS? INTERESTED? Journey of Building Data Pipelines - BEDCon ‘19

Building Streaming Data Pipelines

Recommended

Recommended

More Related Content

Similar to Building Streaming Data Pipelines

Similar to Building Streaming Data Pipelines (20)

Recently uploaded

Recently uploaded (20)

Building Streaming Data Pipelines