Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Operationalizing Big Data Pipelines At Scale

123 views

Published on

Running a global, world-class business with data-driven decision making requires ingesting and processing diverse sets of data at tremendous scale. How does a company achieve this while ensuring quality and honoring their commitment as responsible stewards of data? This session will detail how Starbucks has embraced big data, building robust, high-quality pipelines for faster insights to drive world-class customer experiences.

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

Operationalizing Big Data Pipelines At Scale

  1. 1. OPERATIONALIZING BIG DATA PIPELINES AT SCALE STARBUCKS BI & DATA SERV ICES J U N E 2 4 , 2 0 2 0 B R A D M A Y A R J I T D H A V A L E
  2. 2. Enterprise Data Analytics Platform • Azure Databricks + Delta Stack • 4+ PB Delta Lake • 1000+ Pipelines (Streaming + Batch) • 13 Domains / 20 Sub-domains • 1000+ Users across workgroups
  3. 3. Data Lake AI & Reporting Streaming Analytics Integration Raw Published CSV, JSON, .. DELETE MERGE OVERWRITEINSERT • Data loaded as-is • Segregated by source • Limited retention • Segregated by Domain • Partitioned by date loaded • Minimally processed • Schema applied • Adheres to data retention schedule • Segregated by Domain • Partitioned by usage patterns • Schema and business logic applied • Adheres to data retention schedule
  4. 4. Ingestion Database Extracts • Spark Utility • High Degree Parallelism Using Replicated Instance • How to Choose the Distribution • Time savings are proportional to parallelism achieved e.g. in a 10 node cluster time savings are 10x
  5. 5. Ingestion Streaming Data • Azure Event Hubs • Spark Structured Streaming • Enforced Schema – Delta Format • Auto Optimization • Delta Small File Efficiencies • Delta Optimization Time Savings • For Starbucks use case, queries on Delta Optimized tables runs 15x faster. • Streaming vs. Batch
  6. 6. Processing • What are we building • Raw Data vs Data Sets vs Data Products • How are we building • APPEND PATTERN • Idempotency • MERGE PATTERN • Only Available with Delta • PARITION OVERWRITE/REPLACE WHERE PATTERN • Transaction Isolation/Always Available
  7. 7. Consumption • Consumer Workspace Model • Meta-sync Process • Shared access/collaboration • Data Democratization • Operational Reporting • Analytical Capabilities • AI/ML Capabilities
  8. 8. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions.

×