SlideShare a Scribd company logo
1 of 21
Download to read offline
Edit or delete footer text in Master ipsandella doloreium dem isciame ndaestia nessed
quibus aut hiligenet ut ea debisci eturiate poresti vid min core, vercidigent.
Pinot: Near Real-Time Analytics @ Uber
U B E R | Data
Xiang Fu 

Sr Software Engineer II @ Uber

Streaming Analytics Team
Quick Introduction
U B E R | Data
Uber Scale
Messages Bytes
Apache Kafka Trillion per day ~PB per day

Streaming Analytics
Platform
Billions processed
per day
100s of TB
processed per day
Pinot 100s of Billions 10s of TB
U B E R | Data
Agenda
● Pinot @ Uber

● Architecture

● Case Study

● Pinot Perf
Edit or delete footer text in Master ipsandella doloreium dem isciame ndaestia nessed
quibus aut hiligenet ut ea debisci eturiate poresti vid min core, vercidigent.
Pinot @ Uber
U B E R | Data
Experimentation platform (Internal Dashboard)
A / B Tests

See progress of
tests in real-time
U B E R | Data
UberEats (Realtime User Facing Product)
UberEats Restaurant
Manager

“What is my revenue for
past 90 days?”
U B E R | Data
Many More…
• UberPool Analytics
• Mobile Analytics
...
Edit or delete footer text in Master ipsandella doloreium dem isciame ndaestia nessed
quibus aut hiligenet ut ea debisci eturiate poresti vid min core, vercidigent.
Architecture
U B E R | Data
Pinot Workflow
Athena-X
Hive/Spark SQL/oozie
● Projection, Filtering

● Window Aggregation 

● Join
U B E R | Data
Pinot Realtime: Self Service
● Projection, Filtering

● Window Aggregation 

● Join
Edit or delete footer text in Master ipsandella doloreium dem isciame ndaestia nessed
quibus aut hiligenet ut ea debisci eturiate poresti vid min core, vercidigent.
Case Study
U B E R | Data
Pinot Data Model
Column Name Column Type Filtering Compression Indexing
RiderId SingleValue/
Dimension
Yes Dictionary Sorted
DriverId SingleValue/
Dimension
Yes Dictionary Inverted
TripId SingleValue/
Dimension
No No Dictionary No
PickUpPoints MultiValue/
Dimension
No No Dictionary No
TripFare SingleValue/
Metric
No No Dictionary No
Step 1
List Column Spec
Step 2
Analyze Query Pattern
Step 3
Decide Compression &
Indexing Strategy
U B E R | Data
Pinot Data Ingestion
Realtime Ingestion:
Consumer Type Scalability Consistency
High Level Consumer Hard to scale beyond one node
Sacrificing consistency during
failures
Low Level Consumer Scalable beyond one node
Strong consistency guarantees even
during failure
Segment Persistence: 500k msg or 6 hours
Offline Ingestion:
Using Oozie to schedule daily incremental backfill from Hive to Pinot
Edit or delete footer text in Master ipsandella doloreium dem isciame ndaestia nessed
quibus aut hiligenet ut ea debisci eturiate poresti vid min core, vercidigent.
Pinot Perf
U B E R | Data
Pinot Realtime Ingestion
Hardware 4 base SKU boxes(24 cores, 128G RAM)
Consumer Type HLC LLC
Peak Traffic(msg/sec/box) 20k 200k
Peak Traffic(bytes/sec/box) 4M 40M
Storage Kafka Pinot
Total Data Volume(GB) 500 60
U B E R | Data
Pinot/Druid Data Size
Raw Data: 

500M Rows, 30 columns

Raw Json: 391.9G
Three Storage Tiers 

in Pinot/Druid
- Segments in Deep Storage 

(NFS or HDFS)
- Local Disk Cache
- Memory
U B E R | Data
Pinot/Druid Query Performance
Max Duration:
select max(duration) from trips
Count All Grouped by City:
select count(*) from trips
group by city_id top 10000
Count All in One Month:
select count(*) from trips
where Month = '201601'
Count All in SF:
select count(*) from trips
where city_id=1 group by Month
Unique Drivers in SF:
select distinctCountHLL(driver_uuid)
from trips where city_id=1
Unique Drivers By Date:
select distinctCountHLL(driver_uuid)
from trips group by Date
U B E R | Data
Pinot/Druid Concurrent Query
Query: select count(*) from trips group by city_id
U B E R | Data
Guaranteed SLA for Site Facing Products
Aggregation on Rider
trips:
select count(*) from trips
where riderId = x and
date > 20170225
Edit or delete footer text in Master ipsandella doloreium dem isciame ndaestia nessed
quibus aut hiligenet ut ea debisci eturiate poresti vid min core, vercidigent.
Thank you

More Related Content

What's hot

Building Reliable Lakehouses with Apache Flink and Delta Lake
Building Reliable Lakehouses with Apache Flink and Delta LakeBuilding Reliable Lakehouses with Apache Flink and Delta Lake
Building Reliable Lakehouses with Apache Flink and Delta LakeFlink Forward
 
[215] Druid로 쉽고 빠르게 데이터 분석하기
[215] Druid로 쉽고 빠르게 데이터 분석하기[215] Druid로 쉽고 빠르게 데이터 분석하기
[215] Druid로 쉽고 빠르게 데이터 분석하기NAVER D2
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks
 
Building real time analytics applications using pinot : A LinkedIn case study
Building real time analytics applications using pinot : A LinkedIn case studyBuilding real time analytics applications using pinot : A LinkedIn case study
Building real time analytics applications using pinot : A LinkedIn case studyKishore Gopalakrishna
 
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and HudiHow to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and HudiFlink Forward
 
CDC Stream Processing with Apache Flink
CDC Stream Processing with Apache FlinkCDC Stream Processing with Apache Flink
CDC Stream Processing with Apache FlinkTimo Walther
 
[236] 카카오의데이터파이프라인 윤도영
[236] 카카오의데이터파이프라인 윤도영[236] 카카오의데이터파이프라인 윤도영
[236] 카카오의데이터파이프라인 윤도영NAVER D2
 
Spark + S3 + R3를 이용한 데이터 분석 시스템 만들기
Spark + S3 + R3를 이용한 데이터 분석 시스템 만들기Spark + S3 + R3를 이용한 데이터 분석 시스템 만들기
Spark + S3 + R3를 이용한 데이터 분석 시스템 만들기AWSKRUG - AWS한국사용자모임
 
How to size up an Apache Cassandra cluster (Training)
How to size up an Apache Cassandra cluster (Training)How to size up an Apache Cassandra cluster (Training)
How to size up an Apache Cassandra cluster (Training)DataStax Academy
 
Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...
Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...
Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...HostedbyConfluent
 
Best Practices for Running PostgreSQL on AWS - DAT314 - re:Invent 2017
Best Practices for Running PostgreSQL on AWS - DAT314 - re:Invent 2017Best Practices for Running PostgreSQL on AWS - DAT314 - re:Invent 2017
Best Practices for Running PostgreSQL on AWS - DAT314 - re:Invent 2017Amazon Web Services
 
Evening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in FlinkEvening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in FlinkFlink Forward
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsDatabricks
 
Polyglot persistence @ netflix (CDE Meetup)
Polyglot persistence @ netflix (CDE Meetup) Polyglot persistence @ netflix (CDE Meetup)
Polyglot persistence @ netflix (CDE Meetup) Roopa Tangirala
 
Thrift vs Protocol Buffers vs Avro - Biased Comparison
Thrift vs Protocol Buffers vs Avro - Biased ComparisonThrift vs Protocol Buffers vs Avro - Biased Comparison
Thrift vs Protocol Buffers vs Avro - Biased ComparisonIgor Anishchenko
 
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...HostedbyConfluent
 
Hudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilitiesHudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilitiesNishith Agarwal
 
Kafka Summit NYC 2017 - Data Processing at LinkedIn with Apache Kafka
Kafka Summit NYC 2017 - Data Processing at LinkedIn with Apache KafkaKafka Summit NYC 2017 - Data Processing at LinkedIn with Apache Kafka
Kafka Summit NYC 2017 - Data Processing at LinkedIn with Apache Kafkaconfluent
 
Pinot: Realtime OLAP for 530 Million Users - Sigmod 2018
Pinot: Realtime OLAP for 530 Million Users - Sigmod 2018Pinot: Realtime OLAP for 530 Million Users - Sigmod 2018
Pinot: Realtime OLAP for 530 Million Users - Sigmod 2018Seunghyun Lee
 

What's hot (20)

Building Reliable Lakehouses with Apache Flink and Delta Lake
Building Reliable Lakehouses with Apache Flink and Delta LakeBuilding Reliable Lakehouses with Apache Flink and Delta Lake
Building Reliable Lakehouses with Apache Flink and Delta Lake
 
[215] Druid로 쉽고 빠르게 데이터 분석하기
[215] Druid로 쉽고 빠르게 데이터 분석하기[215] Druid로 쉽고 빠르게 데이터 분석하기
[215] Druid로 쉽고 빠르게 데이터 분석하기
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
 
Building real time analytics applications using pinot : A LinkedIn case study
Building real time analytics applications using pinot : A LinkedIn case studyBuilding real time analytics applications using pinot : A LinkedIn case study
Building real time analytics applications using pinot : A LinkedIn case study
 
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and HudiHow to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
 
CDC Stream Processing with Apache Flink
CDC Stream Processing with Apache FlinkCDC Stream Processing with Apache Flink
CDC Stream Processing with Apache Flink
 
[236] 카카오의데이터파이프라인 윤도영
[236] 카카오의데이터파이프라인 윤도영[236] 카카오의데이터파이프라인 윤도영
[236] 카카오의데이터파이프라인 윤도영
 
Spark + S3 + R3를 이용한 데이터 분석 시스템 만들기
Spark + S3 + R3를 이용한 데이터 분석 시스템 만들기Spark + S3 + R3를 이용한 데이터 분석 시스템 만들기
Spark + S3 + R3를 이용한 데이터 분석 시스템 만들기
 
How to size up an Apache Cassandra cluster (Training)
How to size up an Apache Cassandra cluster (Training)How to size up an Apache Cassandra cluster (Training)
How to size up an Apache Cassandra cluster (Training)
 
Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...
Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...
Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...
 
Best Practices for Running PostgreSQL on AWS - DAT314 - re:Invent 2017
Best Practices for Running PostgreSQL on AWS - DAT314 - re:Invent 2017Best Practices for Running PostgreSQL on AWS - DAT314 - re:Invent 2017
Best Practices for Running PostgreSQL on AWS - DAT314 - re:Invent 2017
 
Cloud arch patterns
Cloud arch patternsCloud arch patterns
Cloud arch patterns
 
Evening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in FlinkEvening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in Flink
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIs
 
Polyglot persistence @ netflix (CDE Meetup)
Polyglot persistence @ netflix (CDE Meetup) Polyglot persistence @ netflix (CDE Meetup)
Polyglot persistence @ netflix (CDE Meetup)
 
Thrift vs Protocol Buffers vs Avro - Biased Comparison
Thrift vs Protocol Buffers vs Avro - Biased ComparisonThrift vs Protocol Buffers vs Avro - Biased Comparison
Thrift vs Protocol Buffers vs Avro - Biased Comparison
 
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
 
Hudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilitiesHudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilities
 
Kafka Summit NYC 2017 - Data Processing at LinkedIn with Apache Kafka
Kafka Summit NYC 2017 - Data Processing at LinkedIn with Apache KafkaKafka Summit NYC 2017 - Data Processing at LinkedIn with Apache Kafka
Kafka Summit NYC 2017 - Data Processing at LinkedIn with Apache Kafka
 
Pinot: Realtime OLAP for 530 Million Users - Sigmod 2018
Pinot: Realtime OLAP for 530 Million Users - Sigmod 2018Pinot: Realtime OLAP for 530 Million Users - Sigmod 2018
Pinot: Realtime OLAP for 530 Million Users - Sigmod 2018
 

Similar to Pinot: Near Realtime Analytics @ Uber

ISTA 2019 - Migrating data-intensive microservices from Python to Go
ISTA 2019 - Migrating data-intensive microservices from Python to GoISTA 2019 - Migrating data-intensive microservices from Python to Go
ISTA 2019 - Migrating data-intensive microservices from Python to GoNikolay Stoitsev
 
The Evolution of Trillion-level Real-time Messaging System in BIGO - Puslar ...
The Evolution of Trillion-level Real-time Messaging System in BIGO  - Puslar ...The Evolution of Trillion-level Real-time Messaging System in BIGO  - Puslar ...
The Evolution of Trillion-level Real-time Messaging System in BIGO - Puslar ...StreamNative
 
Spectra Logic's BlackPearl Developers Summit 2016
Spectra Logic's BlackPearl Developers Summit 2016Spectra Logic's BlackPearl Developers Summit 2016
Spectra Logic's BlackPearl Developers Summit 2016spectralogic
 
Introduction to Apache Kafka
Introduction to Apache KafkaIntroduction to Apache Kafka
Introduction to Apache KafkaRicardo Bravo
 
GOTO Aarhus 2014: Making Enterprise Data Available in Real Time with elastics...
GOTO Aarhus 2014: Making Enterprise Data Available in Real Time with elastics...GOTO Aarhus 2014: Making Enterprise Data Available in Real Time with elastics...
GOTO Aarhus 2014: Making Enterprise Data Available in Real Time with elastics...Yann Cluchey
 
Scylla Summit 2022: Building Zeotap's Privacy Compliant Customer Data Platfor...
Scylla Summit 2022: Building Zeotap's Privacy Compliant Customer Data Platfor...Scylla Summit 2022: Building Zeotap's Privacy Compliant Customer Data Platfor...
Scylla Summit 2022: Building Zeotap's Privacy Compliant Customer Data Platfor...ScyllaDB
 
The Practice of Presto & Alluxio in E-Commerce Big Data Platform
The Practice of Presto & Alluxio in E-Commerce Big Data PlatformThe Practice of Presto & Alluxio in E-Commerce Big Data Platform
The Practice of Presto & Alluxio in E-Commerce Big Data PlatformAlluxio, Inc.
 
Ceph Day Beijing - Ceph on All-Flash Storage - Breaking Performance Barriers
Ceph Day Beijing - Ceph on All-Flash Storage - Breaking Performance BarriersCeph Day Beijing - Ceph on All-Flash Storage - Breaking Performance Barriers
Ceph Day Beijing - Ceph on All-Flash Storage - Breaking Performance BarriersCeph Community
 
DevOps in the Amazon Cloud – Learn from the pioneersNetflix suro
DevOps in the Amazon Cloud – Learn from the pioneersNetflix suroDevOps in the Amazon Cloud – Learn from the pioneersNetflix suro
DevOps in the Amazon Cloud – Learn from the pioneersNetflix suroGaurav "GP" Pal
 
Webinar replay: MySQL Query Tuning Trilogy: Query tuning process and tools
Webinar replay: MySQL Query Tuning Trilogy: Query tuning process and toolsWebinar replay: MySQL Query Tuning Trilogy: Query tuning process and tools
Webinar replay: MySQL Query Tuning Trilogy: Query tuning process and toolsSeveralnines
 
Using real time big data analytics for competitive advantage
 Using real time big data analytics for competitive advantage Using real time big data analytics for competitive advantage
Using real time big data analytics for competitive advantageAmazon Web Services
 
ITCamp 2011 - Cristian Lefter - SQL Server code-name Denali
ITCamp 2011 - Cristian Lefter - SQL Server code-name DenaliITCamp 2011 - Cristian Lefter - SQL Server code-name Denali
ITCamp 2011 - Cristian Lefter - SQL Server code-name DenaliITCamp
 
Look how easy it is to go from events to blazing-fast analytics! | Neha Pawar...
Look how easy it is to go from events to blazing-fast analytics! | Neha Pawar...Look how easy it is to go from events to blazing-fast analytics! | Neha Pawar...
Look how easy it is to go from events to blazing-fast analytics! | Neha Pawar...HostedbyConfluent
 
Siphon - Near Real Time Databus Using Kafka, Eric Boyd, Nitin Kumar
Siphon - Near Real Time Databus Using Kafka, Eric Boyd, Nitin KumarSiphon - Near Real Time Databus Using Kafka, Eric Boyd, Nitin Kumar
Siphon - Near Real Time Databus Using Kafka, Eric Boyd, Nitin Kumarconfluent
 
Netflix Edge Engineering Open House Presentations - June 9, 2016
Netflix Edge Engineering Open House Presentations - June 9, 2016Netflix Edge Engineering Open House Presentations - June 9, 2016
Netflix Edge Engineering Open House Presentations - June 9, 2016Daniel Jacobson
 
High Performance Cloud Computing
High Performance Cloud ComputingHigh Performance Cloud Computing
High Performance Cloud ComputingAmazon Web Services
 
High Performance Cloud Computing
High Performance Cloud ComputingHigh Performance Cloud Computing
High Performance Cloud ComputingAmazon Web Services
 
Real-time Analytics with Upsert Using Apache Kafka and Apache Pinot | Yupeng ...
Real-time Analytics with Upsert Using Apache Kafka and Apache Pinot | Yupeng ...Real-time Analytics with Upsert Using Apache Kafka and Apache Pinot | Yupeng ...
Real-time Analytics with Upsert Using Apache Kafka and Apache Pinot | Yupeng ...HostedbyConfluent
 
Shortening the Feedback Loop: How Spotify’s Big Data Ecosystem has evolved to...
Shortening the Feedback Loop: How Spotify’s Big Data Ecosystem has evolved to...Shortening the Feedback Loop: How Spotify’s Big Data Ecosystem has evolved to...
Shortening the Feedback Loop: How Spotify’s Big Data Ecosystem has evolved to...Big Data Spain
 

Similar to Pinot: Near Realtime Analytics @ Uber (20)

ISTA 2019 - Migrating data-intensive microservices from Python to Go
ISTA 2019 - Migrating data-intensive microservices from Python to GoISTA 2019 - Migrating data-intensive microservices from Python to Go
ISTA 2019 - Migrating data-intensive microservices from Python to Go
 
The Evolution of Trillion-level Real-time Messaging System in BIGO - Puslar ...
The Evolution of Trillion-level Real-time Messaging System in BIGO  - Puslar ...The Evolution of Trillion-level Real-time Messaging System in BIGO  - Puslar ...
The Evolution of Trillion-level Real-time Messaging System in BIGO - Puslar ...
 
Spectra Logic's BlackPearl Developers Summit 2016
Spectra Logic's BlackPearl Developers Summit 2016Spectra Logic's BlackPearl Developers Summit 2016
Spectra Logic's BlackPearl Developers Summit 2016
 
Introduction to Apache Kafka
Introduction to Apache KafkaIntroduction to Apache Kafka
Introduction to Apache Kafka
 
GOTO Aarhus 2014: Making Enterprise Data Available in Real Time with elastics...
GOTO Aarhus 2014: Making Enterprise Data Available in Real Time with elastics...GOTO Aarhus 2014: Making Enterprise Data Available in Real Time with elastics...
GOTO Aarhus 2014: Making Enterprise Data Available in Real Time with elastics...
 
Scylla Summit 2022: Building Zeotap's Privacy Compliant Customer Data Platfor...
Scylla Summit 2022: Building Zeotap's Privacy Compliant Customer Data Platfor...Scylla Summit 2022: Building Zeotap's Privacy Compliant Customer Data Platfor...
Scylla Summit 2022: Building Zeotap's Privacy Compliant Customer Data Platfor...
 
The Practice of Presto & Alluxio in E-Commerce Big Data Platform
The Practice of Presto & Alluxio in E-Commerce Big Data PlatformThe Practice of Presto & Alluxio in E-Commerce Big Data Platform
The Practice of Presto & Alluxio in E-Commerce Big Data Platform
 
Ceph Day Beijing - Ceph on All-Flash Storage - Breaking Performance Barriers
Ceph Day Beijing - Ceph on All-Flash Storage - Breaking Performance BarriersCeph Day Beijing - Ceph on All-Flash Storage - Breaking Performance Barriers
Ceph Day Beijing - Ceph on All-Flash Storage - Breaking Performance Barriers
 
DevOps in the Amazon Cloud – Learn from the pioneersNetflix suro
DevOps in the Amazon Cloud – Learn from the pioneersNetflix suroDevOps in the Amazon Cloud – Learn from the pioneersNetflix suro
DevOps in the Amazon Cloud – Learn from the pioneersNetflix suro
 
Webinar replay: MySQL Query Tuning Trilogy: Query tuning process and tools
Webinar replay: MySQL Query Tuning Trilogy: Query tuning process and toolsWebinar replay: MySQL Query Tuning Trilogy: Query tuning process and tools
Webinar replay: MySQL Query Tuning Trilogy: Query tuning process and tools
 
Using real time big data analytics for competitive advantage
 Using real time big data analytics for competitive advantage Using real time big data analytics for competitive advantage
Using real time big data analytics for competitive advantage
 
ITCamp 2011 - Cristian Lefter - SQL Server code-name Denali
ITCamp 2011 - Cristian Lefter - SQL Server code-name DenaliITCamp 2011 - Cristian Lefter - SQL Server code-name Denali
ITCamp 2011 - Cristian Lefter - SQL Server code-name Denali
 
Look how easy it is to go from events to blazing-fast analytics! | Neha Pawar...
Look how easy it is to go from events to blazing-fast analytics! | Neha Pawar...Look how easy it is to go from events to blazing-fast analytics! | Neha Pawar...
Look how easy it is to go from events to blazing-fast analytics! | Neha Pawar...
 
Siphon - Near Real Time Databus Using Kafka, Eric Boyd, Nitin Kumar
Siphon - Near Real Time Databus Using Kafka, Eric Boyd, Nitin KumarSiphon - Near Real Time Databus Using Kafka, Eric Boyd, Nitin Kumar
Siphon - Near Real Time Databus Using Kafka, Eric Boyd, Nitin Kumar
 
Netflix Edge Engineering Open House Presentations - June 9, 2016
Netflix Edge Engineering Open House Presentations - June 9, 2016Netflix Edge Engineering Open House Presentations - June 9, 2016
Netflix Edge Engineering Open House Presentations - June 9, 2016
 
History of Apache Pinot
History of Apache Pinot History of Apache Pinot
History of Apache Pinot
 
High Performance Cloud Computing
High Performance Cloud ComputingHigh Performance Cloud Computing
High Performance Cloud Computing
 
High Performance Cloud Computing
High Performance Cloud ComputingHigh Performance Cloud Computing
High Performance Cloud Computing
 
Real-time Analytics with Upsert Using Apache Kafka and Apache Pinot | Yupeng ...
Real-time Analytics with Upsert Using Apache Kafka and Apache Pinot | Yupeng ...Real-time Analytics with Upsert Using Apache Kafka and Apache Pinot | Yupeng ...
Real-time Analytics with Upsert Using Apache Kafka and Apache Pinot | Yupeng ...
 
Shortening the Feedback Loop: How Spotify’s Big Data Ecosystem has evolved to...
Shortening the Feedback Loop: How Spotify’s Big Data Ecosystem has evolved to...Shortening the Feedback Loop: How Spotify’s Big Data Ecosystem has evolved to...
Shortening the Feedback Loop: How Spotify’s Big Data Ecosystem has evolved to...
 

Recently uploaded

"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentationphoebematthew05
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDGMarianaLemus7
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 

Recently uploaded (20)

"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentation
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDG
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 

Pinot: Near Realtime Analytics @ Uber

  • 1. Edit or delete footer text in Master ipsandella doloreium dem isciame ndaestia nessed quibus aut hiligenet ut ea debisci eturiate poresti vid min core, vercidigent. Pinot: Near Real-Time Analytics @ Uber
  • 2. U B E R | Data Xiang Fu Sr Software Engineer II @ Uber Streaming Analytics Team Quick Introduction
  • 3. U B E R | Data Uber Scale Messages Bytes Apache Kafka Trillion per day ~PB per day Streaming Analytics Platform Billions processed per day 100s of TB processed per day Pinot 100s of Billions 10s of TB
  • 4. U B E R | Data Agenda ● Pinot @ Uber ● Architecture ● Case Study ● Pinot Perf
  • 5. Edit or delete footer text in Master ipsandella doloreium dem isciame ndaestia nessed quibus aut hiligenet ut ea debisci eturiate poresti vid min core, vercidigent. Pinot @ Uber
  • 6. U B E R | Data Experimentation platform (Internal Dashboard) A / B Tests See progress of tests in real-time
  • 7. U B E R | Data UberEats (Realtime User Facing Product) UberEats Restaurant Manager “What is my revenue for past 90 days?”
  • 8. U B E R | Data Many More… • UberPool Analytics • Mobile Analytics ...
  • 9. Edit or delete footer text in Master ipsandella doloreium dem isciame ndaestia nessed quibus aut hiligenet ut ea debisci eturiate poresti vid min core, vercidigent. Architecture
  • 10. U B E R | Data Pinot Workflow Athena-X Hive/Spark SQL/oozie ● Projection, Filtering ● Window Aggregation ● Join
  • 11. U B E R | Data Pinot Realtime: Self Service ● Projection, Filtering ● Window Aggregation ● Join
  • 12. Edit or delete footer text in Master ipsandella doloreium dem isciame ndaestia nessed quibus aut hiligenet ut ea debisci eturiate poresti vid min core, vercidigent. Case Study
  • 13. U B E R | Data Pinot Data Model Column Name Column Type Filtering Compression Indexing RiderId SingleValue/ Dimension Yes Dictionary Sorted DriverId SingleValue/ Dimension Yes Dictionary Inverted TripId SingleValue/ Dimension No No Dictionary No PickUpPoints MultiValue/ Dimension No No Dictionary No TripFare SingleValue/ Metric No No Dictionary No Step 1 List Column Spec Step 2 Analyze Query Pattern Step 3 Decide Compression & Indexing Strategy
  • 14. U B E R | Data Pinot Data Ingestion Realtime Ingestion: Consumer Type Scalability Consistency High Level Consumer Hard to scale beyond one node Sacrificing consistency during failures Low Level Consumer Scalable beyond one node Strong consistency guarantees even during failure Segment Persistence: 500k msg or 6 hours Offline Ingestion: Using Oozie to schedule daily incremental backfill from Hive to Pinot
  • 15. Edit or delete footer text in Master ipsandella doloreium dem isciame ndaestia nessed quibus aut hiligenet ut ea debisci eturiate poresti vid min core, vercidigent. Pinot Perf
  • 16. U B E R | Data Pinot Realtime Ingestion Hardware 4 base SKU boxes(24 cores, 128G RAM) Consumer Type HLC LLC Peak Traffic(msg/sec/box) 20k 200k Peak Traffic(bytes/sec/box) 4M 40M Storage Kafka Pinot Total Data Volume(GB) 500 60
  • 17. U B E R | Data Pinot/Druid Data Size Raw Data: 
 500M Rows, 30 columns
 Raw Json: 391.9G Three Storage Tiers 
 in Pinot/Druid - Segments in Deep Storage 
 (NFS or HDFS) - Local Disk Cache - Memory
  • 18. U B E R | Data Pinot/Druid Query Performance Max Duration: select max(duration) from trips Count All Grouped by City: select count(*) from trips group by city_id top 10000 Count All in One Month: select count(*) from trips where Month = '201601' Count All in SF: select count(*) from trips where city_id=1 group by Month Unique Drivers in SF: select distinctCountHLL(driver_uuid) from trips where city_id=1 Unique Drivers By Date: select distinctCountHLL(driver_uuid) from trips group by Date
  • 19. U B E R | Data Pinot/Druid Concurrent Query Query: select count(*) from trips group by city_id
  • 20. U B E R | Data Guaranteed SLA for Site Facing Products Aggregation on Rider trips: select count(*) from trips where riderId = x and date > 20170225
  • 21. Edit or delete footer text in Master ipsandella doloreium dem isciame ndaestia nessed quibus aut hiligenet ut ea debisci eturiate poresti vid min core, vercidigent. Thank you