SlideShare a Scribd company logo
1 of 34
Download to read offline
Speed up Interactive Analytic
Queries over Existing Big Data on
Hadoop with Presto	

Liang-Chi Hsieh
HadoopCon 2014 in Taiwan
1
In Today’s talk
• Introduction of Presto	

• Distributed architecture	

• Query model	

• Deployment and configuration	

• Data visualization with Presto - Demo
2
SQL on/over Hadoop
• Hive	

• Matured and proven solution (0.13.x)	

• Drawbacks: execution model based on MapReduce 	

• Better execution engines: Hive-Tez and Hive-Spark	

!
• Alternative and usually faster options including	

• Impala, Presto, Drill, ...
3
Presto
• Presto is a distributed SQL query engine optimized
for ad-hoc analysis at interactive speed	

• Data scale: GBs to PBs	

!
• Deployment at:	

• Facebook, Netflix, Dropbox,Treasure Data,Airbnb,
Qubole
4
History of Presto
• Fall 2012	

• The development on Presto started at Facebook	

• Spring 2013	

• It was rolled out to the entire company and became
major interactive data warehouse	

• Winter 2013	

• Open-sourced
5
The Problems to Solve
• Hive is not optimized for interactive data analysis as
the data size grows to petabyte scale	

• In practice, we do need to have reduced data
stored in an interactive DB that provides quick
query response	

• Redundant maintenance cost, out of date data
view, data transferring, ...	

• The need to incorporate other data that are not
stored in HDFS
6
Typical Batch Data Architecture
7
HDFS
Data Flow Batch Run
DB
Query
• Views generated in batch maybe
out of date	

• Batch workflow is too slow
Interactive Query on HDFS
8
HDFS
Data Flow Interactive
query
Presto
Query
Interactive Query on
HDFS and other Data Sources
9
HDFS
Data Flow Interactive
query
Presto
QueryMySQL Cassandra
Distributed Architecture
• Coordinator	

• Parsing statements	

• Planning queries	

• Managing Presto workers	

!
• Worker	

• Executing tasks	

• Processing data
10
11
Storage Plugins
• Connectors	

• Providing interfaces for fetching metadata, getting data
locations, accessing the data	

• Current connectors (v0.76)	

• Hive: Hadoop 1.x, Hadoop 2.x, CDH 4, CDH 5	

• Cassandra	

• MySQL	

• Kafka	

• PostgreSQL
12
13
Presto Clients
• Protocol: HTTP + JSON	

!
• Client libraries available in several
programming languages:	

• Python, PHP, Ruby, Node.js, Java, R	

!
• ODBC through Prestogres
14
Query Model
• Presto’s execution engine does not use
MapReduce	

• It employs a custom query and execution engine	

• Based on DAG that is more like Apache Tez,
Spark or MPP databases
15
Query Execution
• Presto executes ANSI-compatible SQL statements	

!
• Coordinator	

• SQL parser	

• Query planner	

• Execution planner	

• Workers	

• Task execution scheduler
16
Query Execution
Query
planner
AST Query plan
Execution
planner
Connector
Metadata
Execution plan
NodeManager
17
Query Planner
SELECT name, count(*) from logs GROUP BY name
Logical query plan:
Table scan GROUP BY Output
Distributed query plan:
SQL:
Table scan
Stage-2
Partial aggregation
Output buffer
Exchange client
Final aggregation
Output buffer
Exchange client
Output
Stage-1 Stage-0
18
Distributed query plan:
Table scan
Stage-2
Partial aggregation
Output buffer
Exchange client
Final aggregation
Output buffer
Exchange client
Output
Stage-1 Stage-0
Worker 1
Worker 2
Table scan
Partial aggregation
Output buffer
Exchange client
Final aggregation
Output buffer
Exchange client
Output
Table scan
Partial aggregation
Output buffer
Exchange client
Final aggregation
Output buffer
* Tasks run on workers
19
Query Execution on Presto
• SQL is converted into stages, tasks, drivers	

• Tasks operate on splits that are sections of
data	

• Lowest stages retrieve splits from
connectors
20
Query Execution on Presto
• Tasks are run in parallel	

• Pipelined to reduce wait time between stages	

• One task fails then the query fails	

!
• No disk I/O 	

• If aggregated data does not fit in memory, the
query fails	

• May spill to disk in future
21
Deployment & Configuration
• Basically, there are four configurations to set
up for Presto	

• Node properties: environment configuration
specific to each node	

• JVM config	

• Config properties: configuration for Presto server	

• Catalog properties: configuration for connectors	

!
• Detailed documents are provided on Presto site
22
Node Properties
• etc/node.properties	

• Minimal configuration:
node.environment=production
node.id=ffffffff-ffff-ffff-ffff-ffffffffffff
node.data-dir=/var/presto/data
23
Config Properties
• etc/config.properties	

• Minimal configuration for coordinator:
coordinator=true
node-scheduler.include-coordinator=false
http-server.http.port=8080
task.max-memory=1GB
discovery-server.enabled=true
discovery.uri=http://example.net:8080
24
Config Properties
• Minimal configuration for worker:
coordinator=false
http-server.http.port=8080
task.max-memory=1GB
discovery.uri=http://example.net:8080
25
Catalog Properties
• Presto connectors are mounted in catalogs	

• Create catalog properties in etc/catalog	

• For example, the configuration etc/catalog/
hive.properties for Hive connector:
connector.name=hive-hadoop2
hive.metastore.uri=thrift://example.net:9083
26
Presto’s Roadmap
• In next year:	

• Complex data structures	

• Create table with partitioning	

• Huge joins and aggregations	

• Spill to disk	

• Basic task recovery	

• Native store	

• Authentication & authorization
* Based on the Presto Meetup, May 2014
27
DataVisualization with Presto - Demo
• There will be official ODBC driver for connecting
Presto to major BI tools, according to Presto’s
roadmap	

• Prestogres provides alternative solution for now	

• Use PostgreSQL’s ODBC driver	

!
• It is also not difficult to integrate Presto with other
data visualization tools such as Grafana
28
Grafana
• An open source metrics dashboard and graph
editor for Graphite, InfluxDB & OpenTSDB	

• But we may not be satisfied with these DBs or
just want to visualize data on HDFS, especially
for large-scale data
29
Integrating Presto with Grafana
• Presto provides many useful date & time
functions	

• current_date -> date	

• current_time -> time with time zone	

• current_timestamp -> timestamp with time zone	

• from_unixtime(unixtime) → timestamp	

• localtime -> time	

• now() → timestamp with time zone	

• to_unixtime(timestamp) → double
30
Integrating Presto with Grafana
• Presto also supports many common aggregation
functions	

• avg(x) → double	

• count(x) → bigint	

• max(x) → [same as input]	

• min(x) → [same as input]	

• sum(x) → [same as input]	

• …..
31
Integrating Presto with Grafana
• So we implemented a custom datasource for
Presto to work with Grafana	

• Interactively visualize data on HDFS
HDFS
Interactive
query
Presto
Grafana
32
Demo
33
References
• Martin Traverso,“Presto: Interacting with petabytes of data at
Facebook”	

• Sadayuki Furuhashi,“Presto: Interactive SQL Query Engine
for Big Data”	

• Sundstrom,“Presto: Past, Present, and Future”	

• “Presto Concepts” on Presto’s documents
34

More Related Content

What's hot

Presto at Twitter
Presto at TwitterPresto at Twitter
Presto at TwitterBill Graham
 
Presto @ Facebook: Past, Present and Future
Presto @ Facebook: Past, Present and FuturePresto @ Facebook: Past, Present and Future
Presto @ Facebook: Past, Present and FutureDataWorks Summit
 
Presto: Distributed sql query engine
Presto: Distributed sql query engine Presto: Distributed sql query engine
Presto: Distributed sql query engine kiran palaka
 
Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...
Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...
Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...Databricks
 
Hello, Enterprise! Meet Presto. (Presto Boston Meetup 10062015)
Hello, Enterprise! Meet Presto. (Presto Boston Meetup 10062015)Hello, Enterprise! Meet Presto. (Presto Boston Meetup 10062015)
Hello, Enterprise! Meet Presto. (Presto Boston Meetup 10062015)Matt Fuller
 
Understanding Presto - Presto meetup @ Tokyo #1
Understanding Presto - Presto meetup @ Tokyo #1Understanding Presto - Presto meetup @ Tokyo #1
Understanding Presto - Presto meetup @ Tokyo #1Sadayuki Furuhashi
 
Presto in the cloud
Presto in the cloudPresto in the cloud
Presto in the cloudQubole
 
Hadoop ecosystem framework n hadoop in live environment
Hadoop ecosystem framework  n hadoop in live environmentHadoop ecosystem framework  n hadoop in live environment
Hadoop ecosystem framework n hadoop in live environmentDelhi/NCR HUG
 
Bullet: A Real Time Data Query Engine
Bullet: A Real Time Data Query EngineBullet: A Real Time Data Query Engine
Bullet: A Real Time Data Query EngineDataWorks Summit
 
Presto - Analytical Database. Overview and use cases.
Presto - Analytical Database. Overview and use cases.Presto - Analytical Database. Overview and use cases.
Presto - Analytical Database. Overview and use cases.Wojciech Biela
 
Top 5 mistakes when writing Streaming applications
Top 5 mistakes when writing Streaming applicationsTop 5 mistakes when writing Streaming applications
Top 5 mistakes when writing Streaming applicationshadooparchbook
 
Presto@Netflix Presto Meetup 03-19-15
Presto@Netflix Presto Meetup 03-19-15Presto@Netflix Presto Meetup 03-19-15
Presto@Netflix Presto Meetup 03-19-15Zhenxiao Luo
 
Migrating from Redshift to Spark at Stitch Fix: Spark Summit East talk by Sky...
Migrating from Redshift to Spark at Stitch Fix: Spark Summit East talk by Sky...Migrating from Redshift to Spark at Stitch Fix: Spark Summit East talk by Sky...
Migrating from Redshift to Spark at Stitch Fix: Spark Summit East talk by Sky...Spark Summit
 
Facebook Presto presentation
Facebook Presto presentationFacebook Presto presentation
Facebook Presto presentationCyanny LIANG
 
Netflix running Presto in the AWS Cloud
Netflix running Presto in the AWS CloudNetflix running Presto in the AWS Cloud
Netflix running Presto in the AWS CloudZhenxiao Luo
 
Presto meetup 2015-03-19 @Facebook
Presto meetup 2015-03-19 @FacebookPresto meetup 2015-03-19 @Facebook
Presto meetup 2015-03-19 @FacebookTreasure Data, Inc.
 
SSR: Structured Streaming for R and Machine Learning
SSR: Structured Streaming for R and Machine LearningSSR: Structured Streaming for R and Machine Learning
SSR: Structured Streaming for R and Machine Learningfelixcss
 
Building a Data Pipeline from Scratch - Joe Crobak
Building a Data Pipeline from Scratch - Joe CrobakBuilding a Data Pipeline from Scratch - Joe Crobak
Building a Data Pipeline from Scratch - Joe CrobakHakka Labs
 

What's hot (20)

Presto
PrestoPresto
Presto
 
Presto at Twitter
Presto at TwitterPresto at Twitter
Presto at Twitter
 
Presto @ Facebook: Past, Present and Future
Presto @ Facebook: Past, Present and FuturePresto @ Facebook: Past, Present and Future
Presto @ Facebook: Past, Present and Future
 
Internals of Presto Service
Internals of Presto ServiceInternals of Presto Service
Internals of Presto Service
 
Presto: Distributed sql query engine
Presto: Distributed sql query engine Presto: Distributed sql query engine
Presto: Distributed sql query engine
 
Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...
Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...
Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...
 
Hello, Enterprise! Meet Presto. (Presto Boston Meetup 10062015)
Hello, Enterprise! Meet Presto. (Presto Boston Meetup 10062015)Hello, Enterprise! Meet Presto. (Presto Boston Meetup 10062015)
Hello, Enterprise! Meet Presto. (Presto Boston Meetup 10062015)
 
Understanding Presto - Presto meetup @ Tokyo #1
Understanding Presto - Presto meetup @ Tokyo #1Understanding Presto - Presto meetup @ Tokyo #1
Understanding Presto - Presto meetup @ Tokyo #1
 
Presto in the cloud
Presto in the cloudPresto in the cloud
Presto in the cloud
 
Hadoop ecosystem framework n hadoop in live environment
Hadoop ecosystem framework  n hadoop in live environmentHadoop ecosystem framework  n hadoop in live environment
Hadoop ecosystem framework n hadoop in live environment
 
Bullet: A Real Time Data Query Engine
Bullet: A Real Time Data Query EngineBullet: A Real Time Data Query Engine
Bullet: A Real Time Data Query Engine
 
Presto - Analytical Database. Overview and use cases.
Presto - Analytical Database. Overview and use cases.Presto - Analytical Database. Overview and use cases.
Presto - Analytical Database. Overview and use cases.
 
Top 5 mistakes when writing Streaming applications
Top 5 mistakes when writing Streaming applicationsTop 5 mistakes when writing Streaming applications
Top 5 mistakes when writing Streaming applications
 
Presto@Netflix Presto Meetup 03-19-15
Presto@Netflix Presto Meetup 03-19-15Presto@Netflix Presto Meetup 03-19-15
Presto@Netflix Presto Meetup 03-19-15
 
Migrating from Redshift to Spark at Stitch Fix: Spark Summit East talk by Sky...
Migrating from Redshift to Spark at Stitch Fix: Spark Summit East talk by Sky...Migrating from Redshift to Spark at Stitch Fix: Spark Summit East talk by Sky...
Migrating from Redshift to Spark at Stitch Fix: Spark Summit East talk by Sky...
 
Facebook Presto presentation
Facebook Presto presentationFacebook Presto presentation
Facebook Presto presentation
 
Netflix running Presto in the AWS Cloud
Netflix running Presto in the AWS CloudNetflix running Presto in the AWS Cloud
Netflix running Presto in the AWS Cloud
 
Presto meetup 2015-03-19 @Facebook
Presto meetup 2015-03-19 @FacebookPresto meetup 2015-03-19 @Facebook
Presto meetup 2015-03-19 @Facebook
 
SSR: Structured Streaming for R and Machine Learning
SSR: Structured Streaming for R and Machine LearningSSR: Structured Streaming for R and Machine Learning
SSR: Structured Streaming for R and Machine Learning
 
Building a Data Pipeline from Scratch - Joe Crobak
Building a Data Pipeline from Scratch - Joe CrobakBuilding a Data Pipeline from Scratch - Joe Crobak
Building a Data Pipeline from Scratch - Joe Crobak
 

Viewers also liked

Presto Strata Hadoop SJ 2016 short talk
Presto Strata Hadoop SJ 2016 short talkPresto Strata Hadoop SJ 2016 short talk
Presto Strata Hadoop SJ 2016 short talkkbajda
 
Hive acid and_2.x new_features
Hive acid and_2.x new_featuresHive acid and_2.x new_features
Hive acid and_2.x new_featuresAlberto Romero
 
Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...
Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...
Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...Amazon Web Services
 
(BDT303) Running Spark and Presto on the Netflix Big Data Platform
(BDT303) Running Spark and Presto on the Netflix Big Data Platform(BDT303) Running Spark and Presto on the Netflix Big Data Platform
(BDT303) Running Spark and Presto on the Netflix Big Data PlatformAmazon Web Services
 
(BDT320) New! Streaming Data Flows with Amazon Kinesis Firehose
(BDT320) New! Streaming Data Flows with Amazon Kinesis Firehose(BDT320) New! Streaming Data Flows with Amazon Kinesis Firehose
(BDT320) New! Streaming Data Flows with Amazon Kinesis FirehoseAmazon Web Services
 

Viewers also liked (6)

Presto Strata Hadoop SJ 2016 short talk
Presto Strata Hadoop SJ 2016 short talkPresto Strata Hadoop SJ 2016 short talk
Presto Strata Hadoop SJ 2016 short talk
 
Hive acid and_2.x new_features
Hive acid and_2.x new_featuresHive acid and_2.x new_features
Hive acid and_2.x new_features
 
Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...
Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...
Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...
 
A Multi Colored YARN
A Multi Colored YARNA Multi Colored YARN
A Multi Colored YARN
 
(BDT303) Running Spark and Presto on the Netflix Big Data Platform
(BDT303) Running Spark and Presto on the Netflix Big Data Platform(BDT303) Running Spark and Presto on the Netflix Big Data Platform
(BDT303) Running Spark and Presto on the Netflix Big Data Platform
 
(BDT320) New! Streaming Data Flows with Amazon Kinesis Firehose
(BDT320) New! Streaming Data Flows with Amazon Kinesis Firehose(BDT320) New! Streaming Data Flows with Amazon Kinesis Firehose
(BDT320) New! Streaming Data Flows with Amazon Kinesis Firehose
 

Similar to Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with Presto

Technologies for Data Analytics Platform
Technologies for Data Analytics PlatformTechnologies for Data Analytics Platform
Technologies for Data Analytics PlatformN Masahiro
 
Microsoft's Big Play for Big Data
Microsoft's Big Play for Big DataMicrosoft's Big Play for Big Data
Microsoft's Big Play for Big DataAndrew Brust
 
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012Andrew Brust
 
Bi on Big Data - Strata 2016 in London
Bi on Big Data - Strata 2016 in LondonBi on Big Data - Strata 2016 in London
Bi on Big Data - Strata 2016 in LondonDremio Corporation
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Databricks
 
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和SparkEtu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和SparkJames Chen
 
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impalamarkgrover
 
Apache Arrow: Present and Future @ ScaledML 2020
Apache Arrow: Present and Future @ ScaledML 2020Apache Arrow: Present and Future @ ScaledML 2020
Apache Arrow: Present and Future @ ScaledML 2020Wes McKinney
 
2014 09-12 lambda-architecture-at-indix
2014 09-12 lambda-architecture-at-indix2014 09-12 lambda-architecture-at-indix
2014 09-12 lambda-architecture-at-indixYu Ishikawa
 
Presto – Today and Beyond – The Open Source SQL Engine for Querying all Data...
Presto – Today and Beyond – The Open Source SQL Engine for Querying all Data...Presto – Today and Beyond – The Open Source SQL Engine for Querying all Data...
Presto – Today and Beyond – The Open Source SQL Engine for Querying all Data...Dipti Borkar
 
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...ssuserd3a367
 
Big Data Developers Moscow Meetup 1 - sql on hadoop
Big Data Developers Moscow Meetup 1  - sql on hadoopBig Data Developers Moscow Meetup 1  - sql on hadoop
Big Data Developers Moscow Meetup 1 - sql on hadoopbddmoscow
 
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...Rittman Analytics
 
Enkitec E4 Barcelona : SQL and Data Integration Futures on Hadoop :
Enkitec E4 Barcelona : SQL and Data Integration Futures on Hadoop : Enkitec E4 Barcelona : SQL and Data Integration Futures on Hadoop :
Enkitec E4 Barcelona : SQL and Data Integration Futures on Hadoop : Mark Rittman
 
Big Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI ProsBig Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI ProsAndrew Brust
 

Similar to Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with Presto (20)

Technologies for Data Analytics Platform
Technologies for Data Analytics PlatformTechnologies for Data Analytics Platform
Technologies for Data Analytics Platform
 
Microsoft's Big Play for Big Data
Microsoft's Big Play for Big DataMicrosoft's Big Play for Big Data
Microsoft's Big Play for Big Data
 
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
 
SQL on Hadoop
SQL on HadoopSQL on Hadoop
SQL on Hadoop
 
Bi on Big Data - Strata 2016 in London
Bi on Big Data - Strata 2016 in LondonBi on Big Data - Strata 2016 in London
Bi on Big Data - Strata 2016 in London
 
Apache drill
Apache drillApache drill
Apache drill
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
 
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和SparkEtu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
 
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impala
 
Apache Arrow: Present and Future @ ScaledML 2020
Apache Arrow: Present and Future @ ScaledML 2020Apache Arrow: Present and Future @ ScaledML 2020
Apache Arrow: Present and Future @ ScaledML 2020
 
2014 09-12 lambda-architecture-at-indix
2014 09-12 lambda-architecture-at-indix2014 09-12 lambda-architecture-at-indix
2014 09-12 lambda-architecture-at-indix
 
Presto – Today and Beyond – The Open Source SQL Engine for Querying all Data...
Presto – Today and Beyond – The Open Source SQL Engine for Querying all Data...Presto – Today and Beyond – The Open Source SQL Engine for Querying all Data...
Presto – Today and Beyond – The Open Source SQL Engine for Querying all Data...
 
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
 
Big Data Developers Moscow Meetup 1 - sql on hadoop
Big Data Developers Moscow Meetup 1  - sql on hadoopBig Data Developers Moscow Meetup 1  - sql on hadoop
Big Data Developers Moscow Meetup 1 - sql on hadoop
 
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...
 
Enkitec E4 Barcelona : SQL and Data Integration Futures on Hadoop :
Enkitec E4 Barcelona : SQL and Data Integration Futures on Hadoop : Enkitec E4 Barcelona : SQL and Data Integration Futures on Hadoop :
Enkitec E4 Barcelona : SQL and Data Integration Futures on Hadoop :
 
List of Engineering Colleges in Uttarakhand
List of Engineering Colleges in UttarakhandList of Engineering Colleges in Uttarakhand
List of Engineering Colleges in Uttarakhand
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
Big Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI ProsBig Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI Pros
 

Recently uploaded

Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusZilliz
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Victor Rentea
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native ApplicationsWSO2
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsNanddeep Nachan
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdfSandro Moreira
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Orbitshub
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Zilliz
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024The Digital Insurer
 

Recently uploaded (20)

Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 

Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with Presto

  • 1. Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with Presto Liang-Chi Hsieh HadoopCon 2014 in Taiwan 1
  • 2. In Today’s talk • Introduction of Presto • Distributed architecture • Query model • Deployment and configuration • Data visualization with Presto - Demo 2
  • 3. SQL on/over Hadoop • Hive • Matured and proven solution (0.13.x) • Drawbacks: execution model based on MapReduce • Better execution engines: Hive-Tez and Hive-Spark ! • Alternative and usually faster options including • Impala, Presto, Drill, ... 3
  • 4. Presto • Presto is a distributed SQL query engine optimized for ad-hoc analysis at interactive speed • Data scale: GBs to PBs ! • Deployment at: • Facebook, Netflix, Dropbox,Treasure Data,Airbnb, Qubole 4
  • 5. History of Presto • Fall 2012 • The development on Presto started at Facebook • Spring 2013 • It was rolled out to the entire company and became major interactive data warehouse • Winter 2013 • Open-sourced 5
  • 6. The Problems to Solve • Hive is not optimized for interactive data analysis as the data size grows to petabyte scale • In practice, we do need to have reduced data stored in an interactive DB that provides quick query response • Redundant maintenance cost, out of date data view, data transferring, ... • The need to incorporate other data that are not stored in HDFS 6
  • 7. Typical Batch Data Architecture 7 HDFS Data Flow Batch Run DB Query • Views generated in batch maybe out of date • Batch workflow is too slow
  • 8. Interactive Query on HDFS 8 HDFS Data Flow Interactive query Presto Query
  • 9. Interactive Query on HDFS and other Data Sources 9 HDFS Data Flow Interactive query Presto QueryMySQL Cassandra
  • 10. Distributed Architecture • Coordinator • Parsing statements • Planning queries • Managing Presto workers ! • Worker • Executing tasks • Processing data 10
  • 11. 11
  • 12. Storage Plugins • Connectors • Providing interfaces for fetching metadata, getting data locations, accessing the data • Current connectors (v0.76) • Hive: Hadoop 1.x, Hadoop 2.x, CDH 4, CDH 5 • Cassandra • MySQL • Kafka • PostgreSQL 12
  • 13. 13
  • 14. Presto Clients • Protocol: HTTP + JSON ! • Client libraries available in several programming languages: • Python, PHP, Ruby, Node.js, Java, R ! • ODBC through Prestogres 14
  • 15. Query Model • Presto’s execution engine does not use MapReduce • It employs a custom query and execution engine • Based on DAG that is more like Apache Tez, Spark or MPP databases 15
  • 16. Query Execution • Presto executes ANSI-compatible SQL statements ! • Coordinator • SQL parser • Query planner • Execution planner • Workers • Task execution scheduler 16
  • 17. Query Execution Query planner AST Query plan Execution planner Connector Metadata Execution plan NodeManager 17
  • 18. Query Planner SELECT name, count(*) from logs GROUP BY name Logical query plan: Table scan GROUP BY Output Distributed query plan: SQL: Table scan Stage-2 Partial aggregation Output buffer Exchange client Final aggregation Output buffer Exchange client Output Stage-1 Stage-0 18
  • 19. Distributed query plan: Table scan Stage-2 Partial aggregation Output buffer Exchange client Final aggregation Output buffer Exchange client Output Stage-1 Stage-0 Worker 1 Worker 2 Table scan Partial aggregation Output buffer Exchange client Final aggregation Output buffer Exchange client Output Table scan Partial aggregation Output buffer Exchange client Final aggregation Output buffer * Tasks run on workers 19
  • 20. Query Execution on Presto • SQL is converted into stages, tasks, drivers • Tasks operate on splits that are sections of data • Lowest stages retrieve splits from connectors 20
  • 21. Query Execution on Presto • Tasks are run in parallel • Pipelined to reduce wait time between stages • One task fails then the query fails ! • No disk I/O • If aggregated data does not fit in memory, the query fails • May spill to disk in future 21
  • 22. Deployment & Configuration • Basically, there are four configurations to set up for Presto • Node properties: environment configuration specific to each node • JVM config • Config properties: configuration for Presto server • Catalog properties: configuration for connectors ! • Detailed documents are provided on Presto site 22
  • 23. Node Properties • etc/node.properties • Minimal configuration: node.environment=production node.id=ffffffff-ffff-ffff-ffff-ffffffffffff node.data-dir=/var/presto/data 23
  • 24. Config Properties • etc/config.properties • Minimal configuration for coordinator: coordinator=true node-scheduler.include-coordinator=false http-server.http.port=8080 task.max-memory=1GB discovery-server.enabled=true discovery.uri=http://example.net:8080 24
  • 25. Config Properties • Minimal configuration for worker: coordinator=false http-server.http.port=8080 task.max-memory=1GB discovery.uri=http://example.net:8080 25
  • 26. Catalog Properties • Presto connectors are mounted in catalogs • Create catalog properties in etc/catalog • For example, the configuration etc/catalog/ hive.properties for Hive connector: connector.name=hive-hadoop2 hive.metastore.uri=thrift://example.net:9083 26
  • 27. Presto’s Roadmap • In next year: • Complex data structures • Create table with partitioning • Huge joins and aggregations • Spill to disk • Basic task recovery • Native store • Authentication & authorization * Based on the Presto Meetup, May 2014 27
  • 28. DataVisualization with Presto - Demo • There will be official ODBC driver for connecting Presto to major BI tools, according to Presto’s roadmap • Prestogres provides alternative solution for now • Use PostgreSQL’s ODBC driver ! • It is also not difficult to integrate Presto with other data visualization tools such as Grafana 28
  • 29. Grafana • An open source metrics dashboard and graph editor for Graphite, InfluxDB & OpenTSDB • But we may not be satisfied with these DBs or just want to visualize data on HDFS, especially for large-scale data 29
  • 30. Integrating Presto with Grafana • Presto provides many useful date & time functions • current_date -> date • current_time -> time with time zone • current_timestamp -> timestamp with time zone • from_unixtime(unixtime) → timestamp • localtime -> time • now() → timestamp with time zone • to_unixtime(timestamp) → double 30
  • 31. Integrating Presto with Grafana • Presto also supports many common aggregation functions • avg(x) → double • count(x) → bigint • max(x) → [same as input] • min(x) → [same as input] • sum(x) → [same as input] • ….. 31
  • 32. Integrating Presto with Grafana • So we implemented a custom datasource for Presto to work with Grafana • Interactively visualize data on HDFS HDFS Interactive query Presto Grafana 32
  • 34. References • Martin Traverso,“Presto: Interacting with petabytes of data at Facebook” • Sadayuki Furuhashi,“Presto: Interactive SQL Query Engine for Big Data” • Sundstrom,“Presto: Past, Present, and Future” • “Presto Concepts” on Presto’s documents 34