Operationalizing Big Data Pipelines At Scale

OPERATIONALIZING BIG DATA
PIPELINES AT SCALE
STARBUCKS BI & DATA SERV ICES
J U N E 2 4 , 2 0 2 0
B R A D M A Y
A R J I T D H A V A L E

Enterprise Data
Analytics Platform
• Azure Databricks + Delta Stack
• 4+ PB Delta Lake
• 1000+ Pipelines (Streaming +
Batch)
• 13 Domains / 20 Sub-domains
• 1000+ Users across workgroups

Data Lake
AI & Reporting
Streaming
Analytics
Integration Raw Published
CSV,
JSON, ..
DELETE
MERGE
OVERWRITEINSERT
• Data loaded
as-is
• Segregated
by source
• Limited
retention
• Segregated by
Domain
• Partitioned by
date loaded
• Minimally
processed
• Schema
applied
• Adheres to data
retention
schedule
• Segregated by
Domain
• Partitioned by
usage patterns
• Schema and
business logic
applied
• Adheres to
data retention
schedule

Ingestion
Database Extracts
• Spark Utility
• High Degree Parallelism Using Replicated
Instance
• How to Choose the Distribution
• Time savings are proportional to parallelism
achieved e.g. in a 10 node cluster time savings
are 10x

Ingestion
Streaming Data
• Azure Event Hubs
• Spark Structured Streaming
• Enforced Schema – Delta Format
• Auto Optimization
• Delta Small File Efficiencies
• Delta Optimization Time Savings
• For Starbucks use case, queries on Delta
Optimized tables runs 15x faster.
• Streaming vs. Batch

Processing
• What are we building
• Raw Data vs Data Sets vs Data Products
• How are we building
• APPEND PATTERN
• Idempotency
• MERGE PATTERN
• Only Available with Delta
• PARITION OVERWRITE/REPLACE WHERE
PATTERN
• Transaction Isolation/Always Available

Consumption
• Consumer Workspace Model
• Meta-sync Process
• Shared access/collaboration
• Data Democratization
• Operational Reporting
• Analytical Capabilities
• AI/ML Capabilities

Feedback
Your feedback is important to us.
Don’t forget to rate and
review the sessions.

Operationalizing Big Data Pipelines At Scale

Operationalizing Big Data Pipelines At Scale

More Related Content

What's hot

Similar to Operationalizing Big Data Pipelines At Scale

More from Databricks

Recently uploaded

Operationalizing Big Data Pipelines At Scale