SlideShare a Scribd company logo
Presto @ Netflix: Interactive Queries
at Petabyte Scale
Nezih Yigitbasi and Zhenxiao Luo
Big Data Platform
Outline
» Big data platform @ Netflix
» Why we love Presto?
» Our contributions
» What are we working on?
» What else we need?
Cloud
Apps
S3
Suro Ursula
SSTable
s
Cassandra Aegisthus
Event Data
15m
Daily
Dimension Data
Our Data Pipeline
Data
Warehouse
Service
Tool
s
Gateways
Big Data Platform Architecture
Prod
Clients
Clusters
VPCQuery Prod TestBonusProd
» Batch jobs (Pig, Hive)
» ETL jobs
» reporting and other analysis
» Ad-hoc queries
» interactive data exploration
» Looked at Impala, Redshift, Spark, and Presto
Our Use Cases
Deployment
» v 0.86
» 1 coordinator (r3.4xlarge)
» 250 workers (m2.4xlarge)
Tooling
Numbers
» ~2.5K queries/day against our 10PB Hive DW on S3
» 230+ Presto users out of 300+ platform users
» presto-cli, Python, R,
BI tools (ODBC/JDBC), etc.
» Atlas/Suro for monitoring/logging
Presto @ Netflix
Why we love Presto?
» Open source
» Fast
» Scalable
» Works well on AWS
» Good integration with the Hadoop stack
» ANSI SQL
Our Contributions
24 open PRs, 60+ commits
» S3 file system
» multipart upload, IAM roles, retries, monitoring, etc.
» Functions for complex types
» Parquet
» name/index-based access, type coercion, etc.
» Query optimization
» Various other bug fixes
» Vectorized reader* Read based on column vectors
» Predicate pushdown Use statistics to skip data
» Lazy load Postpone loading the data until needed
» Lazy materialization Postpone decoding the data until needed
What are we Working on?
Parquet Optimizations
* PARQUET-
Netflix Integration
» BI tools integration
» ODBC driver, Tableau web connector, etc.
» Better monitoring
» Ganglia ⟶ Atlas
» Data lineage
» Presto ⟶ Suro ⟶ Charlotte
» Graceful cluster shrink
» Better resource management
» Dynamic type coercion for all file formats
» Support for more Hive types (e.g., decimal)
» Predictable metastore cache behavior
» Big table joins similar to Hive
What else we need?
THANK YOU

More Related Content

What's hot

Presto @ Treasure Data - Presto Meetup Boston 2015
Presto @ Treasure Data - Presto Meetup Boston 2015Presto @ Treasure Data - Presto Meetup Boston 2015
Presto @ Treasure Data - Presto Meetup Boston 2015
Taro L. Saito
 
Presto - Analytical Database. Overview and use cases.
Presto - Analytical Database. Overview and use cases.Presto - Analytical Database. Overview and use cases.
Presto - Analytical Database. Overview and use cases.
Wojciech Biela
 
Netflix running Presto in the AWS Cloud
Netflix running Presto in the AWS CloudNetflix running Presto in the AWS Cloud
Netflix running Presto in the AWS CloudZhenxiao Luo
 
Superset druid realtime
Superset druid realtimeSuperset druid realtime
Superset druid realtime
arupmalakar
 
Presto Strata Hadoop SJ 2016 short talk
Presto Strata Hadoop SJ 2016 short talkPresto Strata Hadoop SJ 2016 short talk
Presto Strata Hadoop SJ 2016 short talk
kbajda
 
Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with P...
Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with P...Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with P...
Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with P...
viirya
 
Bullet: A Real Time Data Query Engine
Bullet: A Real Time Data Query EngineBullet: A Real Time Data Query Engine
Bullet: A Real Time Data Query Engine
DataWorks Summit
 
Presto Summit 2018 - 01 - Facebook Presto
Presto Summit 2018  - 01 - Facebook PrestoPresto Summit 2018  - 01 - Facebook Presto
Presto Summit 2018 - 01 - Facebook Presto
kbajda
 
Rental Cars and Industrialized Learning to Rank with Sean Downes
Rental Cars and Industrialized Learning to Rank with Sean DownesRental Cars and Industrialized Learning to Rank with Sean Downes
Rental Cars and Industrialized Learning to Rank with Sean Downes
Databricks
 
Presto for the Enterprise @ Hadoop Meetup
Presto for the Enterprise @ Hadoop MeetupPresto for the Enterprise @ Hadoop Meetup
Presto for the Enterprise @ Hadoop Meetup
Wojciech Biela
 
From Batch to Streaming ET(L) with Apache Apex
From Batch to Streaming ET(L) with Apache ApexFrom Batch to Streaming ET(L) with Apache Apex
From Batch to Streaming ET(L) with Apache Apex
DataWorks Summit
 
Presto
PrestoPresto
Presto
Knoldus Inc.
 
Introduction to Data Engineer and Data Pipeline at Credit OK
Introduction to Data Engineer and Data Pipeline at Credit OKIntroduction to Data Engineer and Data Pipeline at Credit OK
Introduction to Data Engineer and Data Pipeline at Credit OK
Kriangkrai Chaonithi
 
Low-latency data applications with Kafka and Agg indexes | Tino Tereshko, Fir...
Low-latency data applications with Kafka and Agg indexes | Tino Tereshko, Fir...Low-latency data applications with Kafka and Agg indexes | Tino Tereshko, Fir...
Low-latency data applications with Kafka and Agg indexes | Tino Tereshko, Fir...
HostedbyConfluent
 
Presto @ Uber Hadoop summit2017
Presto @ Uber Hadoop summit2017Presto @ Uber Hadoop summit2017
Presto @ Uber Hadoop summit2017
Zhenxiao Luo
 
Real time analytics at uber @ strata data 2019
Real time analytics at uber @ strata data 2019Real time analytics at uber @ strata data 2019
Real time analytics at uber @ strata data 2019
Zhenxiao Luo
 
Presto Summit 2018 - 02 - LinkedIn
Presto Summit 2018  - 02 - LinkedInPresto Summit 2018  - 02 - LinkedIn
Presto Summit 2018 - 02 - LinkedIn
kbajda
 
Presto GeoSpatial @ Strata New York 2017
Presto GeoSpatial @ Strata New York 2017Presto GeoSpatial @ Strata New York 2017
Presto GeoSpatial @ Strata New York 2017
Zhenxiao Luo
 
Presto Summit 2018 - 09 - Netflix Iceberg
Presto Summit 2018  - 09 - Netflix IcebergPresto Summit 2018  - 09 - Netflix Iceberg
Presto Summit 2018 - 09 - Netflix Iceberg
kbajda
 

What's hot (20)

Presto
PrestoPresto
Presto
 
Presto @ Treasure Data - Presto Meetup Boston 2015
Presto @ Treasure Data - Presto Meetup Boston 2015Presto @ Treasure Data - Presto Meetup Boston 2015
Presto @ Treasure Data - Presto Meetup Boston 2015
 
Presto - Analytical Database. Overview and use cases.
Presto - Analytical Database. Overview and use cases.Presto - Analytical Database. Overview and use cases.
Presto - Analytical Database. Overview and use cases.
 
Netflix running Presto in the AWS Cloud
Netflix running Presto in the AWS CloudNetflix running Presto in the AWS Cloud
Netflix running Presto in the AWS Cloud
 
Superset druid realtime
Superset druid realtimeSuperset druid realtime
Superset druid realtime
 
Presto Strata Hadoop SJ 2016 short talk
Presto Strata Hadoop SJ 2016 short talkPresto Strata Hadoop SJ 2016 short talk
Presto Strata Hadoop SJ 2016 short talk
 
Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with P...
Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with P...Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with P...
Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with P...
 
Bullet: A Real Time Data Query Engine
Bullet: A Real Time Data Query EngineBullet: A Real Time Data Query Engine
Bullet: A Real Time Data Query Engine
 
Presto Summit 2018 - 01 - Facebook Presto
Presto Summit 2018  - 01 - Facebook PrestoPresto Summit 2018  - 01 - Facebook Presto
Presto Summit 2018 - 01 - Facebook Presto
 
Rental Cars and Industrialized Learning to Rank with Sean Downes
Rental Cars and Industrialized Learning to Rank with Sean DownesRental Cars and Industrialized Learning to Rank with Sean Downes
Rental Cars and Industrialized Learning to Rank with Sean Downes
 
Presto for the Enterprise @ Hadoop Meetup
Presto for the Enterprise @ Hadoop MeetupPresto for the Enterprise @ Hadoop Meetup
Presto for the Enterprise @ Hadoop Meetup
 
From Batch to Streaming ET(L) with Apache Apex
From Batch to Streaming ET(L) with Apache ApexFrom Batch to Streaming ET(L) with Apache Apex
From Batch to Streaming ET(L) with Apache Apex
 
Presto
PrestoPresto
Presto
 
Introduction to Data Engineer and Data Pipeline at Credit OK
Introduction to Data Engineer and Data Pipeline at Credit OKIntroduction to Data Engineer and Data Pipeline at Credit OK
Introduction to Data Engineer and Data Pipeline at Credit OK
 
Low-latency data applications with Kafka and Agg indexes | Tino Tereshko, Fir...
Low-latency data applications with Kafka and Agg indexes | Tino Tereshko, Fir...Low-latency data applications with Kafka and Agg indexes | Tino Tereshko, Fir...
Low-latency data applications with Kafka and Agg indexes | Tino Tereshko, Fir...
 
Presto @ Uber Hadoop summit2017
Presto @ Uber Hadoop summit2017Presto @ Uber Hadoop summit2017
Presto @ Uber Hadoop summit2017
 
Real time analytics at uber @ strata data 2019
Real time analytics at uber @ strata data 2019Real time analytics at uber @ strata data 2019
Real time analytics at uber @ strata data 2019
 
Presto Summit 2018 - 02 - LinkedIn
Presto Summit 2018  - 02 - LinkedInPresto Summit 2018  - 02 - LinkedIn
Presto Summit 2018 - 02 - LinkedIn
 
Presto GeoSpatial @ Strata New York 2017
Presto GeoSpatial @ Strata New York 2017Presto GeoSpatial @ Strata New York 2017
Presto GeoSpatial @ Strata New York 2017
 
Presto Summit 2018 - 09 - Netflix Iceberg
Presto Summit 2018  - 09 - Netflix IcebergPresto Summit 2018  - 09 - Netflix Iceberg
Presto Summit 2018 - 09 - Netflix Iceberg
 

Viewers also liked

Putting Lipstick on Apache Pig at Netflix
Putting Lipstick on Apache Pig at NetflixPutting Lipstick on Apache Pig at Netflix
Putting Lipstick on Apache Pig at Netflix
Jeff Magnusson
 
Netflix - Elevating Your Data Platform - TDWI Keynote - San Diego 2015
Netflix - Elevating Your Data Platform - TDWI Keynote - San Diego 2015Netflix - Elevating Your Data Platform - TDWI Keynote - San Diego 2015
Netflix - Elevating Your Data Platform - TDWI Keynote - San Diego 2015
Kurt Brown
 
Presto at Facebook - Presto Meetup @ Boston (10/6/2015)
Presto at Facebook - Presto Meetup @ Boston (10/6/2015)Presto at Facebook - Presto Meetup @ Boston (10/6/2015)
Presto at Facebook - Presto Meetup @ Boston (10/6/2015)
Martin Traverso
 
JOSA TechTalk: Metadata Management
in Big Data
JOSA TechTalk: Metadata Management
in Big DataJOSA TechTalk: Metadata Management
in Big Data
JOSA TechTalk: Metadata Management
in Big Data
Jordan Open Source Association
 
The evolution of the big data platform @ Netflix (OSCON 2015)
The evolution of the big data platform @ Netflix (OSCON 2015)The evolution of the big data platform @ Netflix (OSCON 2015)
The evolution of the big data platform @ Netflix (OSCON 2015)
Eva Tse
 
Netflix - Enabling a Culture of Analytics
Netflix - Enabling a Culture of AnalyticsNetflix - Enabling a Culture of Analytics
Netflix - Enabling a Culture of Analytics
Blake Irvine
 
AWS re:Invent 2016: Netflix: Using Amazon S3 as the fabric of our big data ec...
AWS re:Invent 2016: Netflix: Using Amazon S3 as the fabric of our big data ec...AWS re:Invent 2016: Netflix: Using Amazon S3 as the fabric of our big data ec...
AWS re:Invent 2016: Netflix: Using Amazon S3 as the fabric of our big data ec...
Amazon Web Services
 

Viewers also liked (7)

Putting Lipstick on Apache Pig at Netflix
Putting Lipstick on Apache Pig at NetflixPutting Lipstick on Apache Pig at Netflix
Putting Lipstick on Apache Pig at Netflix
 
Netflix - Elevating Your Data Platform - TDWI Keynote - San Diego 2015
Netflix - Elevating Your Data Platform - TDWI Keynote - San Diego 2015Netflix - Elevating Your Data Platform - TDWI Keynote - San Diego 2015
Netflix - Elevating Your Data Platform - TDWI Keynote - San Diego 2015
 
Presto at Facebook - Presto Meetup @ Boston (10/6/2015)
Presto at Facebook - Presto Meetup @ Boston (10/6/2015)Presto at Facebook - Presto Meetup @ Boston (10/6/2015)
Presto at Facebook - Presto Meetup @ Boston (10/6/2015)
 
JOSA TechTalk: Metadata Management
in Big Data
JOSA TechTalk: Metadata Management
in Big DataJOSA TechTalk: Metadata Management
in Big Data
JOSA TechTalk: Metadata Management
in Big Data
 
The evolution of the big data platform @ Netflix (OSCON 2015)
The evolution of the big data platform @ Netflix (OSCON 2015)The evolution of the big data platform @ Netflix (OSCON 2015)
The evolution of the big data platform @ Netflix (OSCON 2015)
 
Netflix - Enabling a Culture of Analytics
Netflix - Enabling a Culture of AnalyticsNetflix - Enabling a Culture of Analytics
Netflix - Enabling a Culture of Analytics
 
AWS re:Invent 2016: Netflix: Using Amazon S3 as the fabric of our big data ec...
AWS re:Invent 2016: Netflix: Using Amazon S3 as the fabric of our big data ec...AWS re:Invent 2016: Netflix: Using Amazon S3 as the fabric of our big data ec...
AWS re:Invent 2016: Netflix: Using Amazon S3 as the fabric of our big data ec...
 

Similar to Presto@Netflix Presto Meetup 03-19-15

4Developers 2018: Przetwarzanie Big Data w oparciu o architekturę Lambda na p...
4Developers 2018: Przetwarzanie Big Data w oparciu o architekturę Lambda na p...4Developers 2018: Przetwarzanie Big Data w oparciu o architekturę Lambda na p...
4Developers 2018: Przetwarzanie Big Data w oparciu o architekturę Lambda na p...
PROIDEA
 
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Databricks
 
Netflix Data Engineering @ Uber Engineering Meetup
Netflix Data Engineering @ Uber Engineering MeetupNetflix Data Engineering @ Uber Engineering Meetup
Netflix Data Engineering @ Uber Engineering Meetup
Blake Irvine
 
Streaming Data Analytics with ksqlDB and Superset | Robert Stolz, Preset
Streaming Data Analytics with ksqlDB and Superset | Robert Stolz, PresetStreaming Data Analytics with ksqlDB and Superset | Robert Stolz, Preset
Streaming Data Analytics with ksqlDB and Superset | Robert Stolz, Preset
HostedbyConfluent
 
Using the Open Science Data Cloud for Data Science Research
Using the Open Science Data Cloud for Data Science ResearchUsing the Open Science Data Cloud for Data Science Research
Using the Open Science Data Cloud for Data Science Research
Robert Grossman
 
Running Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data PlatformRunning Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data Platform
Eva Tse
 
Elastic Data Analytics Platform @Datadog
Elastic Data Analytics Platform @DatadogElastic Data Analytics Platform @Datadog
Elastic Data Analytics Platform @Datadog
C4Media
 
(BDT303) Running Spark and Presto on the Netflix Big Data Platform
(BDT303) Running Spark and Presto on the Netflix Big Data Platform(BDT303) Running Spark and Presto on the Netflix Big Data Platform
(BDT303) Running Spark and Presto on the Netflix Big Data Platform
Amazon Web Services
 
Azure Data Explorer deep dive - review 04.2020
Azure Data Explorer deep dive - review 04.2020Azure Data Explorer deep dive - review 04.2020
Azure Data Explorer deep dive - review 04.2020
Riccardo Zamana
 
Benchmark Showdown: Which Relational Database is the Fastest on AWS?
Benchmark Showdown: Which Relational Database is the Fastest on AWS?Benchmark Showdown: Which Relational Database is the Fastest on AWS?
Benchmark Showdown: Which Relational Database is the Fastest on AWS?
Clustrix
 
Powering Interactive BI Analytics with Presto and Delta Lake
Powering Interactive BI Analytics with Presto and Delta LakePowering Interactive BI Analytics with Presto and Delta Lake
Powering Interactive BI Analytics with Presto and Delta Lake
Databricks
 
Spark + AI Summit 2020 イベント概要
Spark + AI Summit 2020 イベント概要Spark + AI Summit 2020 イベント概要
Spark + AI Summit 2020 イベント概要
Paulo Gutierrez
 
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft AzureOtimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Luan Moreno Medeiros Maciel
 
What's new in SQL on Hadoop and Beyond
What's new in SQL on Hadoop and BeyondWhat's new in SQL on Hadoop and Beyond
What's new in SQL on Hadoop and Beyond
DataWorks Summit/Hadoop Summit
 
Yahoo compares Storm and Spark
Yahoo compares Storm and SparkYahoo compares Storm and Spark
Yahoo compares Storm and Spark
Chicago Hadoop Users Group
 
1 Introduction to Microsoft data platform analytics for release
1 Introduction to Microsoft data platform analytics for release1 Introduction to Microsoft data platform analytics for release
1 Introduction to Microsoft data platform analytics for release
Jen Stirrup
 
USQL Trivadis Azure Data Lake Event
USQL Trivadis Azure Data Lake EventUSQL Trivadis Azure Data Lake Event
USQL Trivadis Azure Data Lake Event
Trivadis
 
Big Data Modeling Challenges and Machine Learning with No Code
Big Data Modeling Challenges and Machine Learning with No CodeBig Data Modeling Challenges and Machine Learning with No Code
Big Data Modeling Challenges and Machine Learning with No Code
Liana Ye
 
Distributed and Fault Tolerant Realtime Computation with Apache Storm, Apache...
Distributed and Fault Tolerant Realtime Computation with Apache Storm, Apache...Distributed and Fault Tolerant Realtime Computation with Apache Storm, Apache...
Distributed and Fault Tolerant Realtime Computation with Apache Storm, Apache...
Folio3 Software
 
Data saturday malta - ADX Azure Data Explorer overview
Data saturday malta - ADX Azure Data Explorer overviewData saturday malta - ADX Azure Data Explorer overview
Data saturday malta - ADX Azure Data Explorer overview
Riccardo Zamana
 

Similar to Presto@Netflix Presto Meetup 03-19-15 (20)

4Developers 2018: Przetwarzanie Big Data w oparciu o architekturę Lambda na p...
4Developers 2018: Przetwarzanie Big Data w oparciu o architekturę Lambda na p...4Developers 2018: Przetwarzanie Big Data w oparciu o architekturę Lambda na p...
4Developers 2018: Przetwarzanie Big Data w oparciu o architekturę Lambda na p...
 
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
 
Netflix Data Engineering @ Uber Engineering Meetup
Netflix Data Engineering @ Uber Engineering MeetupNetflix Data Engineering @ Uber Engineering Meetup
Netflix Data Engineering @ Uber Engineering Meetup
 
Streaming Data Analytics with ksqlDB and Superset | Robert Stolz, Preset
Streaming Data Analytics with ksqlDB and Superset | Robert Stolz, PresetStreaming Data Analytics with ksqlDB and Superset | Robert Stolz, Preset
Streaming Data Analytics with ksqlDB and Superset | Robert Stolz, Preset
 
Using the Open Science Data Cloud for Data Science Research
Using the Open Science Data Cloud for Data Science ResearchUsing the Open Science Data Cloud for Data Science Research
Using the Open Science Data Cloud for Data Science Research
 
Running Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data PlatformRunning Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data Platform
 
Elastic Data Analytics Platform @Datadog
Elastic Data Analytics Platform @DatadogElastic Data Analytics Platform @Datadog
Elastic Data Analytics Platform @Datadog
 
(BDT303) Running Spark and Presto on the Netflix Big Data Platform
(BDT303) Running Spark and Presto on the Netflix Big Data Platform(BDT303) Running Spark and Presto on the Netflix Big Data Platform
(BDT303) Running Spark and Presto on the Netflix Big Data Platform
 
Azure Data Explorer deep dive - review 04.2020
Azure Data Explorer deep dive - review 04.2020Azure Data Explorer deep dive - review 04.2020
Azure Data Explorer deep dive - review 04.2020
 
Benchmark Showdown: Which Relational Database is the Fastest on AWS?
Benchmark Showdown: Which Relational Database is the Fastest on AWS?Benchmark Showdown: Which Relational Database is the Fastest on AWS?
Benchmark Showdown: Which Relational Database is the Fastest on AWS?
 
Powering Interactive BI Analytics with Presto and Delta Lake
Powering Interactive BI Analytics with Presto and Delta LakePowering Interactive BI Analytics with Presto and Delta Lake
Powering Interactive BI Analytics with Presto and Delta Lake
 
Spark + AI Summit 2020 イベント概要
Spark + AI Summit 2020 イベント概要Spark + AI Summit 2020 イベント概要
Spark + AI Summit 2020 イベント概要
 
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft AzureOtimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
 
What's new in SQL on Hadoop and Beyond
What's new in SQL on Hadoop and BeyondWhat's new in SQL on Hadoop and Beyond
What's new in SQL on Hadoop and Beyond
 
Yahoo compares Storm and Spark
Yahoo compares Storm and SparkYahoo compares Storm and Spark
Yahoo compares Storm and Spark
 
1 Introduction to Microsoft data platform analytics for release
1 Introduction to Microsoft data platform analytics for release1 Introduction to Microsoft data platform analytics for release
1 Introduction to Microsoft data platform analytics for release
 
USQL Trivadis Azure Data Lake Event
USQL Trivadis Azure Data Lake EventUSQL Trivadis Azure Data Lake Event
USQL Trivadis Azure Data Lake Event
 
Big Data Modeling Challenges and Machine Learning with No Code
Big Data Modeling Challenges and Machine Learning with No CodeBig Data Modeling Challenges and Machine Learning with No Code
Big Data Modeling Challenges and Machine Learning with No Code
 
Distributed and Fault Tolerant Realtime Computation with Apache Storm, Apache...
Distributed and Fault Tolerant Realtime Computation with Apache Storm, Apache...Distributed and Fault Tolerant Realtime Computation with Apache Storm, Apache...
Distributed and Fault Tolerant Realtime Computation with Apache Storm, Apache...
 
Data saturday malta - ADX Azure Data Explorer overview
Data saturday malta - ADX Azure Data Explorer overviewData saturday malta - ADX Azure Data Explorer overview
Data saturday malta - ADX Azure Data Explorer overview
 

Recently uploaded

Living-in-IT-era-Module-7-Imaging-and-Design-for-Social-Impact.pptx
Living-in-IT-era-Module-7-Imaging-and-Design-for-Social-Impact.pptxLiving-in-IT-era-Module-7-Imaging-and-Design-for-Social-Impact.pptx
Living-in-IT-era-Module-7-Imaging-and-Design-for-Social-Impact.pptx
TristanJasperRamos
 
原版仿制(uob毕业证书)英国伯明翰大学毕业证本科学历证书原版一模一样
原版仿制(uob毕业证书)英国伯明翰大学毕业证本科学历证书原版一模一样原版仿制(uob毕业证书)英国伯明翰大学毕业证本科学历证书原版一模一样
原版仿制(uob毕业证书)英国伯明翰大学毕业证本科学历证书原版一模一样
3ipehhoa
 
BASIC C++ lecture NOTE C++ lecture 3.pptx
BASIC C++ lecture NOTE C++ lecture 3.pptxBASIC C++ lecture NOTE C++ lecture 3.pptx
BASIC C++ lecture NOTE C++ lecture 3.pptx
natyesu
 
ER(Entity Relationship) Diagram for online shopping - TAE
ER(Entity Relationship) Diagram for online shopping - TAEER(Entity Relationship) Diagram for online shopping - TAE
ER(Entity Relationship) Diagram for online shopping - TAE
Himani415946
 
1.Wireless Communication System_Wireless communication is a broad term that i...
1.Wireless Communication System_Wireless communication is a broad term that i...1.Wireless Communication System_Wireless communication is a broad term that i...
1.Wireless Communication System_Wireless communication is a broad term that i...
JeyaPerumal1
 
test test test test testtest test testtest test testtest test testtest test ...
test test  test test testtest test testtest test testtest test testtest test ...test test  test test testtest test testtest test testtest test testtest test ...
test test test test testtest test testtest test testtest test testtest test ...
Arif0071
 
History+of+E-commerce+Development+in+China-www.cfye-commerce.shop
History+of+E-commerce+Development+in+China-www.cfye-commerce.shopHistory+of+E-commerce+Development+in+China-www.cfye-commerce.shop
History+of+E-commerce+Development+in+China-www.cfye-commerce.shop
laozhuseo02
 
How to Use Contact Form 7 Like a Pro.pptx
How to Use Contact Form 7 Like a Pro.pptxHow to Use Contact Form 7 Like a Pro.pptx
How to Use Contact Form 7 Like a Pro.pptx
Gal Baras
 
This 7-second Brain Wave Ritual Attracts Money To You.!
This 7-second Brain Wave Ritual Attracts Money To You.!This 7-second Brain Wave Ritual Attracts Money To You.!
This 7-second Brain Wave Ritual Attracts Money To You.!
nirahealhty
 
1比1复刻(bath毕业证书)英国巴斯大学毕业证学位证原版一模一样
1比1复刻(bath毕业证书)英国巴斯大学毕业证学位证原版一模一样1比1复刻(bath毕业证书)英国巴斯大学毕业证学位证原版一模一样
1比1复刻(bath毕业证书)英国巴斯大学毕业证学位证原版一模一样
3ipehhoa
 
guildmasters guide to ravnica Dungeons & Dragons 5...
guildmasters guide to ravnica Dungeons & Dragons 5...guildmasters guide to ravnica Dungeons & Dragons 5...
guildmasters guide to ravnica Dungeons & Dragons 5...
Rogerio Filho
 
Latest trends in computer networking.pptx
Latest trends in computer networking.pptxLatest trends in computer networking.pptx
Latest trends in computer networking.pptx
JungkooksNonexistent
 
Output determination SAP S4 HANA SAP SD CC
Output determination SAP S4 HANA SAP SD CCOutput determination SAP S4 HANA SAP SD CC
Output determination SAP S4 HANA SAP SD CC
ShahulHameed54211
 
Multi-cluster Kubernetes Networking- Patterns, Projects and Guidelines
Multi-cluster Kubernetes Networking- Patterns, Projects and GuidelinesMulti-cluster Kubernetes Networking- Patterns, Projects and Guidelines
Multi-cluster Kubernetes Networking- Patterns, Projects and Guidelines
Sanjeev Rampal
 
急速办(bedfordhire毕业证书)英国贝德福特大学毕业证成绩单原版一模一样
急速办(bedfordhire毕业证书)英国贝德福特大学毕业证成绩单原版一模一样急速办(bedfordhire毕业证书)英国贝德福特大学毕业证成绩单原版一模一样
急速办(bedfordhire毕业证书)英国贝德福特大学毕业证成绩单原版一模一样
3ipehhoa
 
The+Prospects+of+E-Commerce+in+China.pptx
The+Prospects+of+E-Commerce+in+China.pptxThe+Prospects+of+E-Commerce+in+China.pptx
The+Prospects+of+E-Commerce+in+China.pptx
laozhuseo02
 

Recently uploaded (16)

Living-in-IT-era-Module-7-Imaging-and-Design-for-Social-Impact.pptx
Living-in-IT-era-Module-7-Imaging-and-Design-for-Social-Impact.pptxLiving-in-IT-era-Module-7-Imaging-and-Design-for-Social-Impact.pptx
Living-in-IT-era-Module-7-Imaging-and-Design-for-Social-Impact.pptx
 
原版仿制(uob毕业证书)英国伯明翰大学毕业证本科学历证书原版一模一样
原版仿制(uob毕业证书)英国伯明翰大学毕业证本科学历证书原版一模一样原版仿制(uob毕业证书)英国伯明翰大学毕业证本科学历证书原版一模一样
原版仿制(uob毕业证书)英国伯明翰大学毕业证本科学历证书原版一模一样
 
BASIC C++ lecture NOTE C++ lecture 3.pptx
BASIC C++ lecture NOTE C++ lecture 3.pptxBASIC C++ lecture NOTE C++ lecture 3.pptx
BASIC C++ lecture NOTE C++ lecture 3.pptx
 
ER(Entity Relationship) Diagram for online shopping - TAE
ER(Entity Relationship) Diagram for online shopping - TAEER(Entity Relationship) Diagram for online shopping - TAE
ER(Entity Relationship) Diagram for online shopping - TAE
 
1.Wireless Communication System_Wireless communication is a broad term that i...
1.Wireless Communication System_Wireless communication is a broad term that i...1.Wireless Communication System_Wireless communication is a broad term that i...
1.Wireless Communication System_Wireless communication is a broad term that i...
 
test test test test testtest test testtest test testtest test testtest test ...
test test  test test testtest test testtest test testtest test testtest test ...test test  test test testtest test testtest test testtest test testtest test ...
test test test test testtest test testtest test testtest test testtest test ...
 
History+of+E-commerce+Development+in+China-www.cfye-commerce.shop
History+of+E-commerce+Development+in+China-www.cfye-commerce.shopHistory+of+E-commerce+Development+in+China-www.cfye-commerce.shop
History+of+E-commerce+Development+in+China-www.cfye-commerce.shop
 
How to Use Contact Form 7 Like a Pro.pptx
How to Use Contact Form 7 Like a Pro.pptxHow to Use Contact Form 7 Like a Pro.pptx
How to Use Contact Form 7 Like a Pro.pptx
 
This 7-second Brain Wave Ritual Attracts Money To You.!
This 7-second Brain Wave Ritual Attracts Money To You.!This 7-second Brain Wave Ritual Attracts Money To You.!
This 7-second Brain Wave Ritual Attracts Money To You.!
 
1比1复刻(bath毕业证书)英国巴斯大学毕业证学位证原版一模一样
1比1复刻(bath毕业证书)英国巴斯大学毕业证学位证原版一模一样1比1复刻(bath毕业证书)英国巴斯大学毕业证学位证原版一模一样
1比1复刻(bath毕业证书)英国巴斯大学毕业证学位证原版一模一样
 
guildmasters guide to ravnica Dungeons & Dragons 5...
guildmasters guide to ravnica Dungeons & Dragons 5...guildmasters guide to ravnica Dungeons & Dragons 5...
guildmasters guide to ravnica Dungeons & Dragons 5...
 
Latest trends in computer networking.pptx
Latest trends in computer networking.pptxLatest trends in computer networking.pptx
Latest trends in computer networking.pptx
 
Output determination SAP S4 HANA SAP SD CC
Output determination SAP S4 HANA SAP SD CCOutput determination SAP S4 HANA SAP SD CC
Output determination SAP S4 HANA SAP SD CC
 
Multi-cluster Kubernetes Networking- Patterns, Projects and Guidelines
Multi-cluster Kubernetes Networking- Patterns, Projects and GuidelinesMulti-cluster Kubernetes Networking- Patterns, Projects and Guidelines
Multi-cluster Kubernetes Networking- Patterns, Projects and Guidelines
 
急速办(bedfordhire毕业证书)英国贝德福特大学毕业证成绩单原版一模一样
急速办(bedfordhire毕业证书)英国贝德福特大学毕业证成绩单原版一模一样急速办(bedfordhire毕业证书)英国贝德福特大学毕业证成绩单原版一模一样
急速办(bedfordhire毕业证书)英国贝德福特大学毕业证成绩单原版一模一样
 
The+Prospects+of+E-Commerce+in+China.pptx
The+Prospects+of+E-Commerce+in+China.pptxThe+Prospects+of+E-Commerce+in+China.pptx
The+Prospects+of+E-Commerce+in+China.pptx
 

Presto@Netflix Presto Meetup 03-19-15

  • 1. Presto @ Netflix: Interactive Queries at Petabyte Scale Nezih Yigitbasi and Zhenxiao Luo Big Data Platform
  • 2. Outline » Big data platform @ Netflix » Why we love Presto? » Our contributions » What are we working on? » What else we need?
  • 3. Cloud Apps S3 Suro Ursula SSTable s Cassandra Aegisthus Event Data 15m Daily Dimension Data Our Data Pipeline
  • 4. Data Warehouse Service Tool s Gateways Big Data Platform Architecture Prod Clients Clusters VPCQuery Prod TestBonusProd
  • 5. » Batch jobs (Pig, Hive) » ETL jobs » reporting and other analysis » Ad-hoc queries » interactive data exploration » Looked at Impala, Redshift, Spark, and Presto Our Use Cases
  • 6. Deployment » v 0.86 » 1 coordinator (r3.4xlarge) » 250 workers (m2.4xlarge) Tooling Numbers » ~2.5K queries/day against our 10PB Hive DW on S3 » 230+ Presto users out of 300+ platform users » presto-cli, Python, R, BI tools (ODBC/JDBC), etc. » Atlas/Suro for monitoring/logging Presto @ Netflix
  • 7. Why we love Presto? » Open source » Fast » Scalable » Works well on AWS » Good integration with the Hadoop stack » ANSI SQL
  • 8. Our Contributions 24 open PRs, 60+ commits » S3 file system » multipart upload, IAM roles, retries, monitoring, etc. » Functions for complex types » Parquet » name/index-based access, type coercion, etc. » Query optimization » Various other bug fixes
  • 9. » Vectorized reader* Read based on column vectors » Predicate pushdown Use statistics to skip data » Lazy load Postpone loading the data until needed » Lazy materialization Postpone decoding the data until needed What are we Working on? Parquet Optimizations * PARQUET-
  • 10. Netflix Integration » BI tools integration » ODBC driver, Tableau web connector, etc. » Better monitoring » Ganglia ⟶ Atlas » Data lineage » Presto ⟶ Suro ⟶ Charlotte
  • 11. » Graceful cluster shrink » Better resource management » Dynamic type coercion for all file formats » Support for more Hive types (e.g., decimal) » Predictable metastore cache behavior » Big table joins similar to Hive What else we need?

Editor's Notes

  1. data from apps/services. event data 200b events: app logs, user activity (search event, movie detail click from website, etc.), system operational data ursula demultiplex the events into event types (~150 event types right now). latency of this ursula pipeline is 15m dimension data: subscriber data. aegisthus extracts data from cassandra which is the online backing store for netflix and writes to s3.
  2. mention that we have single dw on s3, spin up multiple clusters. ittle perf diff. on s3 vs hdfs as we are mostly cpu bound. http://netflix.github.io/ sting: reporting charlotte: lineage
  3. impala: no s3 support spark loads all data, doesn’t stream + stability issues at that time. it couldn’t even handle an hour worth of data ~ 2013. spark sql recently graduated from alpha with the spark 1.3 release (https://spark.apache.org/releases/spark-release-1-3-0.html) redshift: need to copy data from s3 to redshift
  4. r3.4xlarge and m2.4xlarge are both memory optimized instances where m2 is a previous generation instance type 5PB of our 10PB Hive DW is in Parquet format
  5. single warehouse on s3, spin up multiple test/prod presto clusters and query live data etc.
  6. s3 fs: exp backoff, exposed various configs for the aws sdk, multipart upload, IAM roles, and monitoring prestoS3FileSystem and AWS sdk better tooling/community support for parquet. good integration with existing tools hive, spark, etc.. several bug fixes and new functions to manipulate complex types to close the gap between hive and presto DDL: alter/create table optimization:(2085) Rewrite Single Distinct Aggregation into GroupBy and (1937) and Optimize joins with similar subqueries complex types: array: contains, concat, sort, map: map_agg and map constructors, map_keys, map_values, etc. bridge the gap between hive and presto
  7. We log queries to our internal data pipeline (Suro) and another internal tool (Charlotte) analyzes data lineage
  8. we are pushing reporting to Presto with our Tableau/MS work. not for ETL. → monitoring, scheduling improvements. Presto’s distributed join is still memory-limited as there is no spills. hive decimal type: https://github.com/facebook/presto/issues/2417 -> at least be able to read it, still open