SQL on Hadoop 
a Perspective of a Cloud-based, 
Managed Service Provider 
Masahiro Nakagawa 
Sep 13, 2014 
Hadoop Meetup in Taiwan
Today’s agenda 
> Self introduction 
> Why SQL? 
> Hive 
> Presto 
> Conclusion
Who are you? 
> Masahiro Nakagawa 
> github/twitter: @repeatedly 
> Treasure Data, Inc. 
> Senior Software Engineer 
> Fluentd / td-agent developer 
> I love OSS :) 
> D language - Phobos committer 
> Fluentd - Main maintainer 
> MessagePack / RPC- D and Python (only RPC) 
> The organizer of Presto Source Code Reading 
> etc…
Do you love SQL?
Why we love SQL? 
> Easy to understand what we are doing 
> declarative language 
> common interface for data manipulation 
> There are many users 
> SQL is not the best but 
better than uncommon interfaces
We want to use SQL 
in the Hadoop world
SQL Players on Hadoop 
This color indicates a commercial product 
> Hive 
> Spark SQL Batch 
Short Batch 
Low latency 
Stream 
> Presto 
> Impala 
> Drill 
> Norikra 
> StreamSQL 
> HAWQ 
> Actian 
> etc… 
Latency: minutes - hours 
Latency: seconds - minutes 
Latency: immediate
SQL Players on Hadoop 
This color indicates a commercial product 
> Hive 
> Spark SQL 
Batch 
Short Batch 
Low latency 
Stream 
> Presto 
> Impala 
> Drill 
> HAWQ 
> Actian 
> etc… 
Red Ocean 
Blue Ocean? 
> Norikra 
> StreamSQL
3 query engines on Treasure Data 
> Hive (batch) 
> for ETL and scheduled reporting 
> Presto (short batch / low latency) 
> for Ad hoc queries 
> Pig 
> Not SQL 
> There aren’t as many users… ;( 
Today’s talk
Presto 
https://hive.apache.org/
What’s Hive 
> Needs no explanation ;) 
> Most popular project in the ecosystem 
> HiveQL and MapReduce 
> Writing MapReduce code is hard 
> Hive is growing rapidly by Stinger initiative 
> Vectorized Processing 
> Query optimization with statistics 
> Tez instead of MapReduce 
> etc…
Apache Tez 
> Low level framework for YARN applications 
> Next generation query engine 
> Provide good IR for Hive, Pig and more 
> Task and DAG based pipelining 
> Spark uses a similar DAG model 
Input Processor Output 
Task DAG 
http://tez.apache.org/
Hive on MR vs. Hive on Tez 
SELECT g1.x, g2.avg, g2.cnt 
FROM (SELECT a.x AVERAGE(a.y) AS avg FROM a GROUP BY a.x) g1" 
JOIN (SELECT b.x, COUNT(b.y) AS avg FROM b GROUP BY b.x) g2" 
ON (g1.x = g2.x) ORDER BY avg; 
MapReduce Tez 
M M 
M M 
R 
HDFS HDFS 
M M M 
http://www.slideshare.net/Hadoop_Summit/w-235phall1pandey/9 
M M 
R 
HDFS 
M 
R 
R 
R 
M 
R 
M M 
R 
R 
R 
Can avoid unnecessary HDFS write 
GROUP a BY a.x GROUP b BY b.x 
JOIN (a, b) 
ORDER BY 
GROUP BY x 
GROUP BY a.x" 
JOIN (a, b) 
ORDER BY
Why still use MapReduce? 
> The emphasis is on stability / reliability 
> Speed is important but not most important 
> Can use a MPP query engine for short batch 
> Tez/Spark are immature 
> Hard to manage in a multi-tenant env 
> Different failure models 
> We are now testing Tez for Hive 
•No code change needed for Hive. Spark is hard… 
• Disabling Tez is easy. Just remove 
‘set hive.execution.engine=tez;’
Presto 
http://prestodb.io/
What’s Presto? 
A distributed SQL query engine 
for interactive data analisys 
against GBs to PBs of data.
Presto’s history 
> 2012 Fall: Project started at Facebook 
> Designed for interactive query 
with speed of commercial data 
warehouse 
> and scalability to the size of Facebook 
> 2013 Winter: Open sourced! 
> 30+ contributes in 6 months 
> including people outside of Facebook
What problems does it solve? 
> We couldn’t visualize data in HDFS directly 
using dashboards or BI tools 
> because Hive is too slow (not interactive) 
> or ODBC connectivity is unavailable/unstable 
> We needed to store daily-batch results to an 
interactive DB for quick response 
(PostgreSQL, Redshift, etc.) 
> Interactive DB costs more & less scalable 
> Some data are not stored in HDFS 
> We need to copy the data into HDFS to analyze
What problems does it solve? 
> We couldn’t visualize data in HDFS directly 
using dashboards or BI tools 
> because Hive is too slow (not interactive) 
> or ODBC connectivity is unavailable/unstable 
> We needed to store daily-batch results to an 
interactive DB for quick response 
(PostgreSQL, Redshift, etc.) 
> Interactive DB costs more & less scalable 
> Some data are not stored in HDFS 
> We need to copy the data into HDFS to analyze
What problems does it solve? 
> We couldn’t visualize data in HDFS directly 
using dashboards or BI tools 
> because Hive is too slow (not interactive) 
> or ODBC connectivity is unavailable/unstable 
> We needed to store daily-batch results to an 
interactive DB for quick response 
(PostgreSQL, Redshift, etc.) 
> Interactive DB costs more & less scalable 
> Some data are not stored in HDFS 
> We need to copy the data into HDFS to analyze
What problems does it solve? 
> We couldn’t visualize data in HDFS directly 
using dashboards or BI tools 
> because Hive is too slow (not interactive) 
> or ODBC connectivity is unavailable/unstable 
> We needed to store daily-batch results to an 
interactive DB for quick response 
(PostgreSQL, Redshift, etc.) 
> Interactive DB costs more & less scalable 
> Some data are not stored in HDFS 
> We need to copy the data into HDFS to analyze
HDFS 
Hive 
PostgreSQL, etc. 
Daily/Hourly Batch 
Interactive query 
Dashboard 
Commercial 
BI Tools 
Batch analysis platform Visualization platform
HDFS 
Daily/Hourly Batch 
Hive 
Interactive query 
PostgreSQL, etc. 
✓ Less scalable 
✓ Extra cost 
Dashboard 
Commercial 
BI Tools 
✓ Can’t query against 
“live” data directly 
Batch analysis platform Visualization platform 
✓ More work to manage 
2 platforms
HDFS 
Hive Dashboard 
Presto 
PostgreSQL, etc. 
Daily/Hourly Batch 
HDFS 
Hive 
Dashboard 
Daily/Hourly Batch 
Interactive query 
Interactive query
Presto 
HDFS 
Hive 
Dashboard 
Daily/Hourly Batch 
Interactive query 
SQL on any data sets 
Cassandra MySQL Commertial DBs
Presto 
HDFS 
Hive 
Dashboard 
Daily/Hourly Batch 
Interactive query 
SQL on any data sets Commercial 
Cassandra MySQL Commertial DBs 
BI Tools 
✓ IBM Cognos 
✓ Tableau 
✓ ... 
Data analysis platform
dashboard on chart.io: https://chartio.com/
What can Presto do? 
> Query interactively (in milliseconds to minutes) 
> MapReduce and Hive are still necessary for ETL 
> Query using commercial BI tools or dashboards 
> Reliable ODBC/JDBC connectivity 
> Query across multiple data sources such as 
Hive, HBase, Cassandra, or even commercial DBs 
> Plugin mechanism 
> Integrate batch analysis + visualization 
into a single data analysis platform
Presto’s deployment 
> Facebook 
> Multiple geographical regions 
> scaled to 1,000 nodes 
> actively used by 1,000+ employees 
> processing 1PB/day 
> Netflix, Dropbox, Treasure Data, Airbnb, 
Qubole, LINE, GREE, Scaleout, etc 
> Presto as a Service 
> Treasure Data, Qubole
Distributed architecture
Client 
Coordinator Connector 
Plugin 
Worker 
Worker 
Worker 
Storage / Metadata 
Discovery Service
Client 
Coordinator Connector 
Plugin 
Worker 
Worker 
Worker 
Storage / Metadata 
Discovery Service 
1. Client sends a query 
using HTTP
Client 
Coordinator Connector 
Plugin 
Worker 
Worker 
Worker 
Storage / Metadata 
Discovery Service 
2. Coordinator builds 
a query plan 
Connector plugin 
provides metadata 
(table schema, etc.)
Client 
Coordinator Connector 
Plugin 
Worker 
Worker 
Worker 
Storage / Metadata 
Discovery Service 
3. Coordinator sends 
tasks to workers
Client 
Coordinator Connector 
Plugin 
Worker 
Worker 
Worker 
Storage / Metadata 
Discovery Service 
4. Workers read data 
through connector plugin
Client 
Coordinator Connector 
Plugin 
Worker 
Worker 
Worker 
Storage / Metadata 
Discovery Service 
5. Workers run tasks 
in memory and 
in parallel
Coordinator Connector 
Plugin 
Worker 
Worker 
Worker 
Storage / Metadata 
Discovery Service 
Client 
6. Client gets the result 
from a worker
Client 
Coordinator Connector 
Plugin 
Worker 
Worker 
Worker 
Storage / Metadata 
Discovery Service
What’s Connectors? 
> Access to storage and metadata 
> provide table schema to coordinators 
> provide table rows to workers 
> Connectors are pluggable to Presto 
> written in Java 
> Implementations: 
> Hive connector 
> Cassandra connector 
> MySQL through JDBC connector (prerelease) 
> Or your own connector
Hive connector 
Client 
Coordinator Hive 
Connector 
Worker 
Worker 
Worker 
HDFS, 
Hive Metastore 
Discovery Service 
find servers in a cluster
Cassandra connector 
Client 
Coordinator Cassandra 
Connector 
Worker 
Worker 
Worker 
Cassandra 
Discovery Service 
find servers in a cluster
Client 
Coordinator 
other 
connectors 
... 
Worker 
Worker 
Worker 
Cassandra 
Discovery Service 
find servers in a cluster 
Hive 
Connector 
HDFS / Metastore 
Multiple connectors in a query 
Cassandra 
Connector 
Other data sources...
Distributed architecture 
> 3 type of servers: 
> Coordinator, worker, discovery service 
> Get data/metadata through connector 
plugins. 
> Presto is NOT a database 
> Presto provides SQL to existent data stores 
> Client protocol is HTTP + JSON 
> Language bindings: 
Ruby, Python, PHP, Java (JDBC), R, Node.JS...
Query Execution
Presto’s execution model 
> Presto is NOT MapReduce 
> Use its own execution engine 
> Presto’s query plan is based on DAG 
> more like Apache Tez / Spark or 
traditional MPP databases 
> Impala and Drill use a similar model
How query runs? 
> Coordinator 
> SQL Parser 
> Query Planner 
> Execution planner 
> Workers 
> Task execution scheduler
SQL 
SQL 
Parser 
AST 
Logical 
Planner 
Metadata 
Distributed 
Planner 
Logical 
Query Plan 
Optimizer 
Execution 
Planner 
Discovery Server 
Connector 
Distributed 
Query Plan Execution Plan 
NodeManager 
✓ node list 
✓ table schema
SQL 
SQL 
Parser 
SQL 
Metadata 
Distributed 
Planner 
Logical 
Query Plan 
Optimizer 
Execution 
Planner 
Discovery Service 
Connector 
Query Plan Execution Plan 
NodeManager 
✓ node list 
✓ table schema 
(today’s talk) 
Query 
Planner
Query Planner 
SQL 
SELECT 
name, 
count(*) AS c 
FROM impressions 
GROUP BY name 
Table schema 
impressions ( 
name varchar 
time bigint 
) 
Output 
(name, c) 
GROUP BY 
(name, 
count(*)) 
Table scan 
(name:varchar) 
+ 
Output 
Exchange 
Sink 
Final aggr 
Exchange 
Sink 
Partial aggr 
Table scan 
Logical query plan 
Distributed query plan
Query Planner - Stages 
Output 
Exchange 
Sink 
Final aggr 
Exchange 
Sink 
Partial aggr 
Table scan 
inter-worker 
data transfer Stage-0 
pipelined 
aggregation 
inter-worker 
data transfer 
Stage-1 
Stage-2
Output 
Exchange 
Sink 
Partial aggr 
Table scan 
Sink 
Partial aggr 
Table scan 
Execution Planner 
+Node list 
✓ 2 workers 
Sink 
Final aggr 
Exchange 
Sink 
Final aggr 
Exchange 
Output 
Exchange 
Sink 
Final aggr 
Exchange 
Sink 
Partial aggr 
Table scan 
Worker 1 Worker 2
Execution Planner - Tasks 
Worker 1 Worker 2 
Sink 
Final aggr 
Exchange 
Sink 
Partial aggr 
Table scan 
Sink 
Final aggr 
Exchange 
Sink 
Partial aggr 
Table scan 
Task 
1 task / worker / stage 
✓ All tasks in parallel 
Output 
Exchange
Execution Planner - Split 
Sink 
Final aggr 
Exchange 
Sink 
Partial aggr 
Table scan 
Sink 
Final aggr 
Exchange 
Sink 
Partial aggr 
Table scan 
Output 
Exchange 
Split 
1 split / task 
= 1 thread / worker 
many splits / task 
= many threads / 
worker (table scan) 
Worker 1 Worker 2 
1 split / worker 
= 1 thread / worker
All stages are pipe-lined 
✓ No wait time 
✓ No fault-tolerance 
MapReduce vs. Presto 
MapReduce Presto 
reduce reduce 
disk 
map map 
disk 
reduce reduce 
map map 
task 
task 
task task 
task task 
memory-to-memory 
data transfer 
✓ No disk IO 
✓ Data chunk must 
fit in memory 
task 
disk 
Wait between 
stages 
Write data 
to disk
Query Execution 
> SQL is converted into stages, tasks and splits 
> All tasks run in parallel 
> No wait time between stages (pipelined) 
> If one task fails, all tasks fail at once (query fails) 
> Memory-to-memory data transfer 
> No disk IO 
> If aggregated data doesn’t fit in memory, 
query fails 
•Note: query dies but worker doesn’t die. 
Memory consumption of all queries is fully managed
Why select Presto? 
> The ease of operations 
> Easy to deploy. Just drop a jar 
> Easy to extend its functionalities 
• Pluggable and DI based loose coupling 
> Doesn’t crash when a query fails 
> Standard SQL syntax 
> Important for existing DB/DWH users 
> HiveQL is for MapReduce, not MPP DB
Our customer use cases 
Hive Presto 
> Scheduled reporting 
for customers 
> once every hour 
Online Ad 
Web/Social 
Retail 
> Scheduled reporting 
for management 
> Compute KPIs 
> Scheduled reporting 
for website, PoS and 
touch panel data 
> Hard deadlines! 
> Check ad-network 
performance 
> delivery logic 
optimization in realtime 
> Aggregation for 
user support 
> Measuring the effect 
of user campaigns 
> Ad-hoc query for 
Basket Analysis 
> Aggregate data for the 
product development
Conclusion
Batch summary 
> MapReduce-based Hive is still the default choice 
> Stable & Lots of shared experience and knowledge 
> Hive with Tez is for Hadoop users 
> No code change needed 
> HDP includes Tez by default 
> Spark and Spark SQL is a good alternative 
> Can’t reuse Hadoop knowledge 
> Mainly for in-memory processing for now
Short batch summary 
> Presto is a good default choice 
> Easy to manage and have useful features 
> Need faster queries? Try Impala 
> for HDFS and HBase 
> CDH includes Impala by default 
> If you are a challenger, check out Drill 
> The project’s goal is ambitious 
> The status is developer preview
Stream summary 
> Fluentd and Norikra 
> Fluentd is for robust log collection 
> Norikra is for SQL based CEP 
! 
> StreamSQL 
> for Spark users 
> Current status is POC
Lastly… 
> Use different engines for different requirements 
> Hadoop/Spark for batch jobs 
> MapReduce won't die for the time being 
> MPP query engine for interactive queries 
> These engines are integrated into 
one system in the future 
> Batch now use DAG pipeline 
> Short Batch will support Task recovery 
The differences will be minimum
Enjoy SQL!
Cloud service for the entire data pipeline, 
including Presto 
Check: treasuredata.com

SQL on Hadoop in Taiwan

  • 1.
    SQL on Hadoop a Perspective of a Cloud-based, Managed Service Provider Masahiro Nakagawa Sep 13, 2014 Hadoop Meetup in Taiwan
  • 2.
    Today’s agenda >Self introduction > Why SQL? > Hive > Presto > Conclusion
  • 3.
    Who are you? > Masahiro Nakagawa > github/twitter: @repeatedly > Treasure Data, Inc. > Senior Software Engineer > Fluentd / td-agent developer > I love OSS :) > D language - Phobos committer > Fluentd - Main maintainer > MessagePack / RPC- D and Python (only RPC) > The organizer of Presto Source Code Reading > etc…
  • 4.
  • 5.
    Why we loveSQL? > Easy to understand what we are doing > declarative language > common interface for data manipulation > There are many users > SQL is not the best but better than uncommon interfaces
  • 6.
    We want touse SQL in the Hadoop world
  • 7.
    SQL Players onHadoop This color indicates a commercial product > Hive > Spark SQL Batch Short Batch Low latency Stream > Presto > Impala > Drill > Norikra > StreamSQL > HAWQ > Actian > etc… Latency: minutes - hours Latency: seconds - minutes Latency: immediate
  • 8.
    SQL Players onHadoop This color indicates a commercial product > Hive > Spark SQL Batch Short Batch Low latency Stream > Presto > Impala > Drill > HAWQ > Actian > etc… Red Ocean Blue Ocean? > Norikra > StreamSQL
  • 9.
    3 query engineson Treasure Data > Hive (batch) > for ETL and scheduled reporting > Presto (short batch / low latency) > for Ad hoc queries > Pig > Not SQL > There aren’t as many users… ;( Today’s talk
  • 10.
  • 11.
    What’s Hive >Needs no explanation ;) > Most popular project in the ecosystem > HiveQL and MapReduce > Writing MapReduce code is hard > Hive is growing rapidly by Stinger initiative > Vectorized Processing > Query optimization with statistics > Tez instead of MapReduce > etc…
  • 12.
    Apache Tez >Low level framework for YARN applications > Next generation query engine > Provide good IR for Hive, Pig and more > Task and DAG based pipelining > Spark uses a similar DAG model Input Processor Output Task DAG http://tez.apache.org/
  • 13.
    Hive on MRvs. Hive on Tez SELECT g1.x, g2.avg, g2.cnt FROM (SELECT a.x AVERAGE(a.y) AS avg FROM a GROUP BY a.x) g1" JOIN (SELECT b.x, COUNT(b.y) AS avg FROM b GROUP BY b.x) g2" ON (g1.x = g2.x) ORDER BY avg; MapReduce Tez M M M M R HDFS HDFS M M M http://www.slideshare.net/Hadoop_Summit/w-235phall1pandey/9 M M R HDFS M R R R M R M M R R R Can avoid unnecessary HDFS write GROUP a BY a.x GROUP b BY b.x JOIN (a, b) ORDER BY GROUP BY x GROUP BY a.x" JOIN (a, b) ORDER BY
  • 14.
    Why still useMapReduce? > The emphasis is on stability / reliability > Speed is important but not most important > Can use a MPP query engine for short batch > Tez/Spark are immature > Hard to manage in a multi-tenant env > Different failure models > We are now testing Tez for Hive •No code change needed for Hive. Spark is hard… • Disabling Tez is easy. Just remove ‘set hive.execution.engine=tez;’
  • 15.
  • 16.
    What’s Presto? Adistributed SQL query engine for interactive data analisys against GBs to PBs of data.
  • 17.
    Presto’s history >2012 Fall: Project started at Facebook > Designed for interactive query with speed of commercial data warehouse > and scalability to the size of Facebook > 2013 Winter: Open sourced! > 30+ contributes in 6 months > including people outside of Facebook
  • 18.
    What problems doesit solve? > We couldn’t visualize data in HDFS directly using dashboards or BI tools > because Hive is too slow (not interactive) > or ODBC connectivity is unavailable/unstable > We needed to store daily-batch results to an interactive DB for quick response (PostgreSQL, Redshift, etc.) > Interactive DB costs more & less scalable > Some data are not stored in HDFS > We need to copy the data into HDFS to analyze
  • 19.
    What problems doesit solve? > We couldn’t visualize data in HDFS directly using dashboards or BI tools > because Hive is too slow (not interactive) > or ODBC connectivity is unavailable/unstable > We needed to store daily-batch results to an interactive DB for quick response (PostgreSQL, Redshift, etc.) > Interactive DB costs more & less scalable > Some data are not stored in HDFS > We need to copy the data into HDFS to analyze
  • 20.
    What problems doesit solve? > We couldn’t visualize data in HDFS directly using dashboards or BI tools > because Hive is too slow (not interactive) > or ODBC connectivity is unavailable/unstable > We needed to store daily-batch results to an interactive DB for quick response (PostgreSQL, Redshift, etc.) > Interactive DB costs more & less scalable > Some data are not stored in HDFS > We need to copy the data into HDFS to analyze
  • 21.
    What problems doesit solve? > We couldn’t visualize data in HDFS directly using dashboards or BI tools > because Hive is too slow (not interactive) > or ODBC connectivity is unavailable/unstable > We needed to store daily-batch results to an interactive DB for quick response (PostgreSQL, Redshift, etc.) > Interactive DB costs more & less scalable > Some data are not stored in HDFS > We need to copy the data into HDFS to analyze
  • 22.
    HDFS Hive PostgreSQL,etc. Daily/Hourly Batch Interactive query Dashboard Commercial BI Tools Batch analysis platform Visualization platform
  • 23.
    HDFS Daily/Hourly Batch Hive Interactive query PostgreSQL, etc. ✓ Less scalable ✓ Extra cost Dashboard Commercial BI Tools ✓ Can’t query against “live” data directly Batch analysis platform Visualization platform ✓ More work to manage 2 platforms
  • 24.
    HDFS Hive Dashboard Presto PostgreSQL, etc. Daily/Hourly Batch HDFS Hive Dashboard Daily/Hourly Batch Interactive query Interactive query
  • 25.
    Presto HDFS Hive Dashboard Daily/Hourly Batch Interactive query SQL on any data sets Cassandra MySQL Commertial DBs
  • 26.
    Presto HDFS Hive Dashboard Daily/Hourly Batch Interactive query SQL on any data sets Commercial Cassandra MySQL Commertial DBs BI Tools ✓ IBM Cognos ✓ Tableau ✓ ... Data analysis platform
  • 27.
    dashboard on chart.io:https://chartio.com/
  • 28.
    What can Prestodo? > Query interactively (in milliseconds to minutes) > MapReduce and Hive are still necessary for ETL > Query using commercial BI tools or dashboards > Reliable ODBC/JDBC connectivity > Query across multiple data sources such as Hive, HBase, Cassandra, or even commercial DBs > Plugin mechanism > Integrate batch analysis + visualization into a single data analysis platform
  • 29.
    Presto’s deployment >Facebook > Multiple geographical regions > scaled to 1,000 nodes > actively used by 1,000+ employees > processing 1PB/day > Netflix, Dropbox, Treasure Data, Airbnb, Qubole, LINE, GREE, Scaleout, etc > Presto as a Service > Treasure Data, Qubole
  • 30.
  • 31.
    Client Coordinator Connector Plugin Worker Worker Worker Storage / Metadata Discovery Service
  • 32.
    Client Coordinator Connector Plugin Worker Worker Worker Storage / Metadata Discovery Service 1. Client sends a query using HTTP
  • 33.
    Client Coordinator Connector Plugin Worker Worker Worker Storage / Metadata Discovery Service 2. Coordinator builds a query plan Connector plugin provides metadata (table schema, etc.)
  • 34.
    Client Coordinator Connector Plugin Worker Worker Worker Storage / Metadata Discovery Service 3. Coordinator sends tasks to workers
  • 35.
    Client Coordinator Connector Plugin Worker Worker Worker Storage / Metadata Discovery Service 4. Workers read data through connector plugin
  • 36.
    Client Coordinator Connector Plugin Worker Worker Worker Storage / Metadata Discovery Service 5. Workers run tasks in memory and in parallel
  • 37.
    Coordinator Connector Plugin Worker Worker Worker Storage / Metadata Discovery Service Client 6. Client gets the result from a worker
  • 38.
    Client Coordinator Connector Plugin Worker Worker Worker Storage / Metadata Discovery Service
  • 39.
    What’s Connectors? >Access to storage and metadata > provide table schema to coordinators > provide table rows to workers > Connectors are pluggable to Presto > written in Java > Implementations: > Hive connector > Cassandra connector > MySQL through JDBC connector (prerelease) > Or your own connector
  • 40.
    Hive connector Client Coordinator Hive Connector Worker Worker Worker HDFS, Hive Metastore Discovery Service find servers in a cluster
  • 41.
    Cassandra connector Client Coordinator Cassandra Connector Worker Worker Worker Cassandra Discovery Service find servers in a cluster
  • 42.
    Client Coordinator other connectors ... Worker Worker Worker Cassandra Discovery Service find servers in a cluster Hive Connector HDFS / Metastore Multiple connectors in a query Cassandra Connector Other data sources...
  • 43.
    Distributed architecture >3 type of servers: > Coordinator, worker, discovery service > Get data/metadata through connector plugins. > Presto is NOT a database > Presto provides SQL to existent data stores > Client protocol is HTTP + JSON > Language bindings: Ruby, Python, PHP, Java (JDBC), R, Node.JS...
  • 44.
  • 45.
    Presto’s execution model > Presto is NOT MapReduce > Use its own execution engine > Presto’s query plan is based on DAG > more like Apache Tez / Spark or traditional MPP databases > Impala and Drill use a similar model
  • 46.
    How query runs? > Coordinator > SQL Parser > Query Planner > Execution planner > Workers > Task execution scheduler
  • 47.
    SQL SQL Parser AST Logical Planner Metadata Distributed Planner Logical Query Plan Optimizer Execution Planner Discovery Server Connector Distributed Query Plan Execution Plan NodeManager ✓ node list ✓ table schema
  • 48.
    SQL SQL Parser SQL Metadata Distributed Planner Logical Query Plan Optimizer Execution Planner Discovery Service Connector Query Plan Execution Plan NodeManager ✓ node list ✓ table schema (today’s talk) Query Planner
  • 49.
    Query Planner SQL SELECT name, count(*) AS c FROM impressions GROUP BY name Table schema impressions ( name varchar time bigint ) Output (name, c) GROUP BY (name, count(*)) Table scan (name:varchar) + Output Exchange Sink Final aggr Exchange Sink Partial aggr Table scan Logical query plan Distributed query plan
  • 50.
    Query Planner -Stages Output Exchange Sink Final aggr Exchange Sink Partial aggr Table scan inter-worker data transfer Stage-0 pipelined aggregation inter-worker data transfer Stage-1 Stage-2
  • 51.
    Output Exchange Sink Partial aggr Table scan Sink Partial aggr Table scan Execution Planner +Node list ✓ 2 workers Sink Final aggr Exchange Sink Final aggr Exchange Output Exchange Sink Final aggr Exchange Sink Partial aggr Table scan Worker 1 Worker 2
  • 52.
    Execution Planner -Tasks Worker 1 Worker 2 Sink Final aggr Exchange Sink Partial aggr Table scan Sink Final aggr Exchange Sink Partial aggr Table scan Task 1 task / worker / stage ✓ All tasks in parallel Output Exchange
  • 53.
    Execution Planner -Split Sink Final aggr Exchange Sink Partial aggr Table scan Sink Final aggr Exchange Sink Partial aggr Table scan Output Exchange Split 1 split / task = 1 thread / worker many splits / task = many threads / worker (table scan) Worker 1 Worker 2 1 split / worker = 1 thread / worker
  • 54.
    All stages arepipe-lined ✓ No wait time ✓ No fault-tolerance MapReduce vs. Presto MapReduce Presto reduce reduce disk map map disk reduce reduce map map task task task task task task memory-to-memory data transfer ✓ No disk IO ✓ Data chunk must fit in memory task disk Wait between stages Write data to disk
  • 55.
    Query Execution >SQL is converted into stages, tasks and splits > All tasks run in parallel > No wait time between stages (pipelined) > If one task fails, all tasks fail at once (query fails) > Memory-to-memory data transfer > No disk IO > If aggregated data doesn’t fit in memory, query fails •Note: query dies but worker doesn’t die. Memory consumption of all queries is fully managed
  • 56.
    Why select Presto? > The ease of operations > Easy to deploy. Just drop a jar > Easy to extend its functionalities • Pluggable and DI based loose coupling > Doesn’t crash when a query fails > Standard SQL syntax > Important for existing DB/DWH users > HiveQL is for MapReduce, not MPP DB
  • 57.
    Our customer usecases Hive Presto > Scheduled reporting for customers > once every hour Online Ad Web/Social Retail > Scheduled reporting for management > Compute KPIs > Scheduled reporting for website, PoS and touch panel data > Hard deadlines! > Check ad-network performance > delivery logic optimization in realtime > Aggregation for user support > Measuring the effect of user campaigns > Ad-hoc query for Basket Analysis > Aggregate data for the product development
  • 58.
  • 59.
    Batch summary >MapReduce-based Hive is still the default choice > Stable & Lots of shared experience and knowledge > Hive with Tez is for Hadoop users > No code change needed > HDP includes Tez by default > Spark and Spark SQL is a good alternative > Can’t reuse Hadoop knowledge > Mainly for in-memory processing for now
  • 60.
    Short batch summary > Presto is a good default choice > Easy to manage and have useful features > Need faster queries? Try Impala > for HDFS and HBase > CDH includes Impala by default > If you are a challenger, check out Drill > The project’s goal is ambitious > The status is developer preview
  • 61.
    Stream summary >Fluentd and Norikra > Fluentd is for robust log collection > Norikra is for SQL based CEP ! > StreamSQL > for Spark users > Current status is POC
  • 62.
    Lastly… > Usedifferent engines for different requirements > Hadoop/Spark for batch jobs > MapReduce won't die for the time being > MPP query engine for interactive queries > These engines are integrated into one system in the future > Batch now use DAG pipeline > Short Batch will support Task recovery The differences will be minimum
  • 63.
  • 64.
    Cloud service forthe entire data pipeline, including Presto Check: treasuredata.com