Operationalizing Big Data Pipelines At Scale

•

0 likes•586 views

Running a global, world-class business with data-driven decision making requires ingesting and processing diverse sets of data at tremendous scale. How does a company achieve this while ensuring quality and honoring their commitment as responsible stewards of data? This session will detail how Starbucks has embraced big data, building robust, high-quality pipelines for faster insights to drive world-class customer experiences.

Data & Analytics

OPERATIONALIZING BIG DATA
PIPELINES AT SCALE
STARBUCKS BI & DATA SERV ICES
J U N E 2 4 , 2 0 2 0
B R A D M A Y
A R J I T D H A V A L E

Enterprise Data
Analytics Platform
• Azure Databricks + Delta Stack
• 4+ PB Delta Lake
• 1000+ Pipelines (Streaming +
Batch)
• 13 Domains / 20 Sub-domains
• 1000+ Users across workgroups

Data Lake
AI & Reporting
Streaming
Analytics
Integration Raw Published
CSV,
JSON, ..
DELETE
MERGE
OVERWRITEINSERT
• Data loaded
as-is
• Segregated
by source
• Limited
retention
• Segregated by
Domain
• Partitioned by
date loaded
• Minimally
processed
• Schema
applied
• Adheres to data
retention
schedule
• Segregated by
Domain
• Partitioned by
usage patterns
• Schema and
business logic
applied
• Adheres to
data retention
schedule

Ingestion
Database Extracts
• Spark Utility
• High Degree Parallelism Using Replicated
Instance
• How to Choose the Distribution
• Time savings are proportional to parallelism
achieved e.g. in a 10 node cluster time savings
are 10x

Ingestion
Streaming Data
• Azure Event Hubs
• Spark Structured Streaming
• Enforced Schema – Delta Format
• Auto Optimization
• Delta Small File Efficiencies
• Delta Optimization Time Savings
• For Starbucks use case, queries on Delta
Optimized tables runs 15x faster.
• Streaming vs. Batch

Processing
• What are we building
• Raw Data vs Data Sets vs Data Products
• How are we building
• APPEND PATTERN
• Idempotency
• MERGE PATTERN
• Only Available with Delta
• PARITION OVERWRITE/REPLACE WHERE
PATTERN
• Transaction Isolation/Always Available

Consumption
• Consumer Workspace Model
• Meta-sync Process
• Shared access/collaboration
• Data Democratization
• Operational Reporting
• Analytical Capabilities
• AI/ML Capabilities

Feedback
Your feedback is important to us.
Don’t forget to rate and
review the sessions.

Operationalizing Big Data Pipelines At Scale

What's hot

Healthcare Claim Reimbursement using Apache SparkDatabricks

Lessons Learned from Modernizing USCIS Data Analytics PlatformDatabricks

Lightning-Fast Analytics for Workday Transactional Data with Pavel Hardak and...Databricks

Analytics at the Real-Time Speed of Business: Spark Summit East talk by Manis...Spark Summit

Large Scale Lakehouse Implementation Using Structured StreamingDatabricks

Spark - Migration Story Roman Chukh

IEEE International Conference on Data Engineering 2015Yousun Jeong

Zeus: Uber’s Highly Scalable and Distributed Shuffle as a ServiceDatabricks

Building the Foundations of an Intelligent, Event-Driven Data Platform at EFSADatabricks

5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks

Data Warehouse Modernization - Big Data in the Cloud Success with Qubole on O...Qubole

What’s New in the Upcoming Apache Spark 3.0Databricks

Intro to databricks delta lakeMykola Zerniuk

Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...Databricks

Taming the Search: A Practical Way of Enforcing GDPR and CCPA in Very Large D...Databricks

Using Databricks as an Analysis PlatformDatabricks

ETL Made Easy with Azure Data Factory and Azure DatabricksDatabricks

Quark Virtualization Engine for Analytics DataWorks Summit/Hadoop Summit

Geosp.AI.tial: Applying Big Data and Machine Learning to Solve the World's To...Databricks

From Idea to Model: Productionizing Data Pipelines with Apache AirflowDatabricks

What's hot (20)

Healthcare Claim Reimbursement using Apache Spark

Lessons Learned from Modernizing USCIS Data Analytics Platform

Lightning-Fast Analytics for Workday Transactional Data with Pavel Hardak and...

Analytics at the Real-Time Speed of Business: Spark Summit East talk by Manis...

Large Scale Lakehouse Implementation Using Structured Streaming

Spark - Migration Story

IEEE International Conference on Data Engineering 2015

Zeus: Uber’s Highly Scalable and Distributed Shuffle as a Service

Building the Foundations of an Intelligent, Event-Driven Data Platform at EFSA

5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop

Data Warehouse Modernization - Big Data in the Cloud Success with Qubole on O...

What’s New in the Upcoming Apache Spark 3.0

Intro to databricks delta lake

Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...

Taming the Search: A Practical Way of Enforcing GDPR and CCPA in Very Large D...

Using Databricks as an Analysis Platform

ETL Made Easy with Azure Data Factory and Azure Databricks

Quark Virtualization Engine for Analytics

Geosp.AI.tial: Applying Big Data and Machine Learning to Solve the World's To...

From Idea to Model: Productionizing Data Pipelines with Apache Airflow

Similar to Operationalizing Big Data Pipelines At Scale

Suburface 2021 IBM Cloud Data LakeTorsten Steinbach

Apache CarbonData+Spark to realize data convergence and Unified high performa...Tech Triveni

Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...Fwdays

Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...Databricks

CC -Unit4.pptxRevathiparamanathan

4Developers 2018: Przetwarzanie Big Data w oparciu o architekturę Lambda na p...PROIDEA

Azure Data Platform Overview.pdfDustin Vannoy

Spark and Couchbase: Augmenting the Operational Database with SparkSpark Summit

IBM Cloud Day January 2021 Data Lake Deep DiveTorsten Steinbach

(BDT317) Building A Data Lake On AWSAmazon Web Services

(BDT314) A Big Data & Analytics App on Amazon EMR & Amazon RedshiftAmazon Web Services

Move your on prem data to a lake in a Lake in CloudCAMMS

IBM Cloud Native Day April 2021: Serverless Data LakeTorsten Steinbach

AWS Webcast - Managing Big Data in the AWS Cloud_20140924Amazon Web Services

Integrating Apache Spark and NiFi for Data LakesDataWorks Summit/Hadoop Summit

New Developments in the Open Source Ecosystem: Apache Spark 3.0, Delta Lake, ...Databricks

Amazon Redshift with Full 360 Inc.Amazon Web Services

Using Data LakesAmazon Web Services

20160331 sa introduction to big data pipelining berlin meetup 0.3Simon Ambridge

Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz

Similar to Operationalizing Big Data Pipelines At Scale (20)

Suburface 2021 IBM Cloud Data Lake

Apache CarbonData+Spark to realize data convergence and Unified high performa...

Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...

Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...

CC -Unit4.pptx

4Developers 2018: Przetwarzanie Big Data w oparciu o architekturę Lambda na p...

Azure Data Platform Overview.pdf

Spark and Couchbase: Augmenting the Operational Database with Spark

IBM Cloud Day January 2021 Data Lake Deep Dive

(BDT317) Building A Data Lake On AWS

(BDT314) A Big Data & Analytics App on Amazon EMR & Amazon Redshift

Move your on prem data to a lake in a Lake in Cloud

IBM Cloud Native Day April 2021: Serverless Data Lake

AWS Webcast - Managing Big Data in the AWS Cloud_20140924

Integrating Apache Spark and NiFi for Data Lakes

New Developments in the Open Source Ecosystem: Apache Spark 3.0, Delta Lake, ...

Amazon Redshift with Full 360 Inc.

Using Data Lakes

20160331 sa introduction to big data pipelining berlin meetup 0.3

Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost

Recently uploaded

Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Valters Lauzums

(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7Call Girls in Nagpur High Profile Call Girls

Accredited-Transport-Cooperatives-Jan-2021-Web.pdfadriantubila

Invezz.com - Grow your wealth with trading signalsInvezz1

Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H

VidaXL dropshipping via API with DroFx.pptxolyaivanovalion

Smarteg dropshipping via API with DroFx.pptxolyaivanovalion

Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service9953056974 Low Rate Call Girls In Saket, Delhi NCR

April 2024 - Crypto Market Report's Analysismanisha194592

Ravak dropshipping via API with DroFx.pptxolyaivanovalion

Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Delhi Call girls

Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Riyadh +966572737505 get cytotec

Determinants of health, dimensions of health, positive health and spectrum of...shambhavirathore45

FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg

Discover Why Less is More in B2B Researchmichael115558

Mature dropshipping via API with DroFx.pptxolyaivanovalion

CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion

Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083

Data-Analysis for Chicago Crime Data 2023ymrp368

Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...amitlee9823

Recently uploaded (20)

Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...

(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7

Accredited-Transport-Cooperatives-Jan-2021-Web.pdf

Invezz.com - Grow your wealth with trading signals

Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf

VidaXL dropshipping via API with DroFx.pptx

Smarteg dropshipping via API with DroFx.pptx

Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service

April 2024 - Crypto Market Report's Analysis

Ravak dropshipping via API with DroFx.pptx

Best VIP Call Girls Noida Sector 39 Call Me: 8448380779

Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec

Determinants of health, dimensions of health, positive health and spectrum of...

FESE Capital Markets Fact Sheet 2024 Q1.pdf

Discover Why Less is More in B2B Research

Mature dropshipping via API with DroFx.pptx

CebaBaby dropshipping via API with DroFX.pptx

Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call

Data-Analysis for Chicago Crime Data 2023

Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...

Operationalizing Big Data Pipelines At Scale

1. OPERATIONALIZING BIG DATA PIPELINES AT SCALE STARBUCKS BI & DATA SERV ICES J U N E 2 4 , 2 0 2 0 B R A D M A Y A R J I T D H A V A L E

3. Enterprise Data Analytics Platform • Azure Databricks + Delta Stack • 4+ PB Delta Lake • 1000+ Pipelines (Streaming + Batch) • 13 Domains / 20 Sub-domains • 1000+ Users across workgroups

4. Data Lake AI & Reporting Streaming Analytics Integration Raw Published CSV, JSON, .. DELETE MERGE OVERWRITEINSERT • Data loaded as-is • Segregated by source • Limited retention • Segregated by Domain • Partitioned by date loaded • Minimally processed • Schema applied • Adheres to data retention schedule • Segregated by Domain • Partitioned by usage patterns • Schema and business logic applied • Adheres to data retention schedule

5. Ingestion Database Extracts • Spark Utility • High Degree Parallelism Using Replicated Instance • How to Choose the Distribution • Time savings are proportional to parallelism achieved e.g. in a 10 node cluster time savings are 10x

6. Ingestion Streaming Data • Azure Event Hubs • Spark Structured Streaming • Enforced Schema – Delta Format • Auto Optimization • Delta Small File Efficiencies • Delta Optimization Time Savings • For Starbucks use case, queries on Delta Optimized tables runs 15x faster. • Streaming vs. Batch

7. Processing • What are we building • Raw Data vs Data Sets vs Data Products • How are we building • APPEND PATTERN • Idempotency • MERGE PATTERN • Only Available with Delta • PARITION OVERWRITE/REPLACE WHERE PATTERN • Transaction Isolation/Always Available

8. Consumption • Consumer Workspace Model • Meta-sync Process • Shared access/collaboration • Data Democratization • Operational Reporting • Analytical Capabilities • AI/ML Capabilities

9. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions.

Operationalizing Big Data Pipelines At Scale

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Operationalizing Big Data Pipelines At Scale

Similar to Operationalizing Big Data Pipelines At Scale (20)

More from Databricks

More from Databricks (20)

Recently uploaded

Recently uploaded (20)

Operationalizing Big Data Pipelines At Scale