OPERATIONALIZING BIG DATA
PIPELINES AT SCALE
STARBUCKS BI & DATA SERV ICES
J U N E 2 4 , 2 0 2 0
B R A D M A Y
A R J I T D H A V A L E
Enterprise Data
Analytics Platform
• Azure Databricks + Delta Stack
• 4+ PB Delta Lake
• 1000+ Pipelines (Streaming +
Batch)
• 13 Domains / 20 Sub-domains
• 1000+ Users across workgroups
Data Lake
AI & Reporting
Streaming
Analytics
Integration Raw Published
CSV,
JSON, ..
DELETE
MERGE
OVERWRITEINSERT
• Data loaded
as-is
• Segregated
by source
• Limited
retention
• Segregated by
Domain
• Partitioned by
date loaded
• Minimally
processed
• Schema
applied
• Adheres to data
retention
schedule
• Segregated by
Domain
• Partitioned by
usage patterns
• Schema and
business logic
applied
• Adheres to
data retention
schedule
Ingestion
Database Extracts
• Spark Utility
• High Degree Parallelism Using Replicated
Instance
• How to Choose the Distribution
• Time savings are proportional to parallelism
achieved e.g. in a 10 node cluster time savings
are 10x
Ingestion
Streaming Data
• Azure Event Hubs
• Spark Structured Streaming
• Enforced Schema – Delta Format
• Auto Optimization
• Delta Small File Efficiencies
• Delta Optimization Time Savings
• For Starbucks use case, queries on Delta
Optimized tables runs 15x faster.
• Streaming vs. Batch
Processing
• What are we building
• Raw Data vs Data Sets vs Data Products
• How are we building
• APPEND PATTERN
• Idempotency
• MERGE PATTERN
• Only Available with Delta
• PARITION OVERWRITE/REPLACE WHERE
PATTERN
• Transaction Isolation/Always Available
Consumption
• Consumer Workspace Model
• Meta-sync Process
• Shared access/collaboration
• Data Democratization
• Operational Reporting
• Analytical Capabilities
• AI/ML Capabilities
Feedback
Your feedback is important to us.
Don’t forget to rate and
review the sessions.
Operationalizing Big Data Pipelines At Scale

Operationalizing Big Data Pipelines At Scale

  • 1.
    OPERATIONALIZING BIG DATA PIPELINESAT SCALE STARBUCKS BI & DATA SERV ICES J U N E 2 4 , 2 0 2 0 B R A D M A Y A R J I T D H A V A L E
  • 3.
    Enterprise Data Analytics Platform •Azure Databricks + Delta Stack • 4+ PB Delta Lake • 1000+ Pipelines (Streaming + Batch) • 13 Domains / 20 Sub-domains • 1000+ Users across workgroups
  • 4.
    Data Lake AI &Reporting Streaming Analytics Integration Raw Published CSV, JSON, .. DELETE MERGE OVERWRITEINSERT • Data loaded as-is • Segregated by source • Limited retention • Segregated by Domain • Partitioned by date loaded • Minimally processed • Schema applied • Adheres to data retention schedule • Segregated by Domain • Partitioned by usage patterns • Schema and business logic applied • Adheres to data retention schedule
  • 5.
    Ingestion Database Extracts • SparkUtility • High Degree Parallelism Using Replicated Instance • How to Choose the Distribution • Time savings are proportional to parallelism achieved e.g. in a 10 node cluster time savings are 10x
  • 6.
    Ingestion Streaming Data • AzureEvent Hubs • Spark Structured Streaming • Enforced Schema – Delta Format • Auto Optimization • Delta Small File Efficiencies • Delta Optimization Time Savings • For Starbucks use case, queries on Delta Optimized tables runs 15x faster. • Streaming vs. Batch
  • 7.
    Processing • What arewe building • Raw Data vs Data Sets vs Data Products • How are we building • APPEND PATTERN • Idempotency • MERGE PATTERN • Only Available with Delta • PARITION OVERWRITE/REPLACE WHERE PATTERN • Transaction Isolation/Always Available
  • 8.
    Consumption • Consumer WorkspaceModel • Meta-sync Process • Shared access/collaboration • Data Democratization • Operational Reporting • Analytical Capabilities • AI/ML Capabilities
  • 9.
    Feedback Your feedback isimportant to us. Don’t forget to rate and review the sessions.