Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

of

Data Warehousing with Spark Streaming at Zalando Slide 1 Data Warehousing with Spark Streaming at Zalando Slide 2 Data Warehousing with Spark Streaming at Zalando Slide 3 Data Warehousing with Spark Streaming at Zalando Slide 4 Data Warehousing with Spark Streaming at Zalando Slide 5 Data Warehousing with Spark Streaming at Zalando Slide 6 Data Warehousing with Spark Streaming at Zalando Slide 7 Data Warehousing with Spark Streaming at Zalando Slide 8 Data Warehousing with Spark Streaming at Zalando Slide 9 Data Warehousing with Spark Streaming at Zalando Slide 10 Data Warehousing with Spark Streaming at Zalando Slide 11 Data Warehousing with Spark Streaming at Zalando Slide 12 Data Warehousing with Spark Streaming at Zalando Slide 13 Data Warehousing with Spark Streaming at Zalando Slide 14 Data Warehousing with Spark Streaming at Zalando Slide 15 Data Warehousing with Spark Streaming at Zalando Slide 16 Data Warehousing with Spark Streaming at Zalando Slide 17 Data Warehousing with Spark Streaming at Zalando Slide 18 Data Warehousing with Spark Streaming at Zalando Slide 19 Data Warehousing with Spark Streaming at Zalando Slide 20
Upcoming SlideShare
What to Upload to SlideShare
Next
Download to read offline and view in fullscreen.

0 Likes

Share

Download to read offline

Data Warehousing with Spark Streaming at Zalando

Download to read offline

Zalandos AI-driven products and distributed landscape of analytical data marts cannot wait for long-running, hard-to-recover, monolithic batch jobs taking all night to calculate already outdated data. Modern data integration pipelines need to deliver fast and easy to consume data sets in high quality. Based on Spark Streaming and Delta, the central data warehousing team was able to deliver widely-used master data as S3 or Kafka streams and snapshots at the same time.

The talk will cover challenges in our fashion data platform and a detailed architectural deep dive about separation of integration from enrichment, providing streams as well as snapshots and feeding the data to distributed data marts. Finally, lessons learned and best practices about Delta’s MERGE command, Scala API vs Spark SQL and schema evolution give more insights and guidance for similar use cases.

  • Be the first to like this

Data Warehousing with Spark Streaming at Zalando

  1. 1. WIFI SSID:Spark+AISummit | Password: UnifiedDataAnalytics
  2. 2. Sebastian Herold, Zalando SE Data Warehousing with Spark Streaming @ Zalando #UnifiedDataAnalytics #SparkAISummit
  3. 3. 3 # Principal Data Engineer / Architect # 7y @ Immo-/Scout24 # DataDevOps Manifesto # Data Platform # 2y @ Zalando # ML Productivity # Streaming DWH @heroldamus Data Warehousing with Spark Streaming Sebastian Herold
  4. 4. 4 WE BRING FASHION TO PEOPLE 2008-2009 2010 2012-2013 2011 2018 17 markets 9 fulfillment centers >28M active customers 5.4B revenue 2018 >300M visits/month >14k employees >400k product choices >80% visits from mobile
  5. 5. 5 TECH@SCALE Data Warehousing with Spark Streaming - Spark + AI Summit Amsterdam ‘19 >350 accounts >100 clusters >250 teams >5 data lakes API >800 micro services
  6. 6. WHY OUR CENTRAL DWH DOES NOT SUCCEED ANYMORE?
  7. 7. HEAVY INTEGRATION OF UNSTRUCTURED DATA INTO RELATIONAL TABLES DRAWBACKS OF CENTRAL DWH
  8. 8. DATASETS ARE NEEDED DISTRIBUTED DRAWBACKS OF CENTRAL DWH
  9. 9. LOWER LATENCY REQUIRED BY AI USE-CASES, OTHER DATA WAREHOUSES, NEAR-REALTIME USE-CASES DRAWBACKS OF CENTRAL DWH
  10. 10. MULTIPLE TEAMS DO SAME LOW-LATENCY EVENT INTEGRATION DRAWBACKS OF CENTRAL DWH
  11. 11. HEAVY INTEGRATION OF UNSTRUCTURED DATA INTO RELATIONAL TABLES DATASETS ARE NEEDED DISTRIBUTED LOWER LATENCY REQUIRED BY AI USE-CASES, OTHER DATA WAREHOUSES, NEAR-REALTIME USE-CASES MULTIPLE TEAMS DO SAME LOW-LATENCY EVENT INTEGRATION STREAMING
  12. 12. 12 SALES ORDER EXAMPLE Data Warehousing with Spark Streaming - Spark + AI Summit Amsterdam ‘19 order.created order_id, order_date, items, ... shipment.created order_id, shipping_date, shipped_items, ... payment.done payment_id, payment_date, order_id, ... item.returned order_id, return_date, returned_item, ... sales-order order_id, order_date, payment_id, payment_date, items: shipped_at, returned_at, ... calculated_1, calculated_2 ...
  13. 13. 13 HOW WE STARTED? Data Warehousing with Spark Streaming - Spark + AI Summit Amsterdam ‘19 Topics Streaming S3 nakadi.io S3 Delta Table WAIT! Downstream
  14. 14. 14 INTEGRATION OF HISTORIC DATA Data Warehousing with Spark Streaming - Spark + AI Summit Amsterdam ‘19 Topics Streaming S3 nakadi.io S3 Delta TableCentral DWH Bootstrap Delta Table BOOM! Batch time increased to 2h !! MERGE command slow for needles in the haystack Downstream
  15. 15. 15 INTRODUCE SNAPSHOTS AND CHANGES TABLE Topics Streaming S3 nakadi.io S3Central DWH Bootstrap Delta Table Downstream Snapshot Changes Snapshotter Better, but still slow!
  16. 16. 16 LOAD SNAPSHOT INTO CLUSTER Topics Streaming S3 nakadi.io S3Central DWH Bootstrap Downstream Snapshot Changes Snapshotter Snapshot
  17. 17. 17 WHAT’S COMING NEXT? Topics Streaming S3 nakadi.io Central DWH Bootstrap Downstream Snapshotter S3 Snapshot Changes Snapshot Changes State Store ??? Snapshotter
  18. 18. 18 Data Warehousing with Spark Streaming - Spark + AI Summit Amsterdam ‘19 SQL vs SCALA # Started with 200 lines of SQL # Grew fast to 400 lines # Violated DRY principle # Hard to unit-test # Hard to refactor # Bad support for nested structures SCALA
  19. 19. 19 LESSONS LEARNED Data Warehousing with Spark Streaming - Spark + AI Summit Amsterdam ‘19 # Streaming needs different thinking # DWH ~ Backend Programming # Don’t start with SQL because it’s easy # Databricks Delta succeeds Parquet # Make sure all data is available in S3
  20. 20. THANKS A LOT! QUESTIONS? WE ARE HIRING!

Zalandos AI-driven products and distributed landscape of analytical data marts cannot wait for long-running, hard-to-recover, monolithic batch jobs taking all night to calculate already outdated data. Modern data integration pipelines need to deliver fast and easy to consume data sets in high quality. Based on Spark Streaming and Delta, the central data warehousing team was able to deliver widely-used master data as S3 or Kafka streams and snapshots at the same time. The talk will cover challenges in our fashion data platform and a detailed architectural deep dive about separation of integration from enrichment, providing streams as well as snapshots and feeding the data to distributed data marts. Finally, lessons learned and best practices about Delta’s MERGE command, Scala API vs Spark SQL and schema evolution give more insights and guidance for similar use cases.

Views

Total views

440

On Slideshare

0

From embeds

0

Number of embeds

0

Actions

Downloads

15

Shares

0

Comments

0

Likes

0

×