This document discusses SQL engines for Hadoop, including Hive, Presto, and Impala. Hive is best for batch jobs due to its stability. Presto provides interactive queries across data sources and is easier to manage than Hive with Tez. Presto's distributed architecture allows queries to run in parallel across nodes. It supports pluggable connectors to access different data stores and has language bindings for multiple clients.
3. Who are you?
> Masahiro Nakagawa
> github/twitter: @repeatedly
> Treasure Data, Inc.
> Senior Software Engineer
> Fluentd / td-agent developer
> I love OSS :)
> D language - Phobos committer
> Fluentd - Main maintainer
> MessagePack / RPC- D and Python (only RPC)
> The organizer of Presto Source Code Reading
> etc…
5. Why we love SQL?
> Easy to understand what we are doing
> declarative language
> common interface for data manipulation
> There are many users
> SQL is not the best but
better than uncommon interfaces
7. SQL Players on Hadoop
This color indicates a commercial product
> Hive
> Spark SQL Batch
Short Batch
Low latency
Stream
> Presto
> Impala
> Drill
> Norikra
> StreamSQL
> HAWQ
> Actian
> etc…
Latency: minutes - hours
Latency: seconds - minutes
Latency: immediate
8. SQL Players on Hadoop
This color indicates a commercial product
> Hive
> Spark SQL
Batch
Short Batch
Low latency
Stream
> Presto
> Impala
> Drill
> HAWQ
> Actian
> etc…
Red Ocean
Blue Ocean?
> Norikra
> StreamSQL
9. 3 query engines on Treasure Data
> Hive (batch)
> for ETL and scheduled reporting
> Presto (short batch / low latency)
> for Ad hoc queries
> Pig
> Not SQL
> There aren’t as many users… ;(
Today’s talk
11. What’s Hive
> Needs no explanation ;)
> Most popular project in the ecosystem
> HiveQL and MapReduce
> Writing MapReduce code is hard
> Hive is growing rapidly by Stinger initiative
> Vectorized Processing
> Query optimization with statistics
> Tez instead of MapReduce
> etc…
12. Apache Tez
> Low level framework for YARN applications
> Next generation query engine
> Provide good IR for Hive, Pig and more
> Task and DAG based pipelining
> Spark uses a similar DAG model
Input Processor Output
Task DAG
http://tez.apache.org/
13. Hive on MR vs. Hive on Tez
SELECT g1.x, g2.avg, g2.cnt
FROM (SELECT a.x AVERAGE(a.y) AS avg FROM a GROUP BY a.x) g1"
JOIN (SELECT b.x, COUNT(b.y) AS avg FROM b GROUP BY b.x) g2"
ON (g1.x = g2.x) ORDER BY avg;
MapReduce Tez
M M
M M
R
HDFS HDFS
M M M
http://www.slideshare.net/Hadoop_Summit/w-235phall1pandey/9
M M
R
HDFS
M
R
R
R
M
R
M M
R
R
R
Can avoid unnecessary HDFS write
GROUP a BY a.x GROUP b BY b.x
JOIN (a, b)
ORDER BY
GROUP BY x
GROUP BY a.x"
JOIN (a, b)
ORDER BY
14. Why still use MapReduce?
> The emphasis is on stability / reliability
> Speed is important but not most important
> Can use a MPP query engine for short batch
> Tez/Spark are immature
> Hard to manage in a multi-tenant env
> Different failure models
> We are now testing Tez for Hive
•No code change needed for Hive. Spark is hard…
• Disabling Tez is easy. Just remove
‘set hive.execution.engine=tez;’
16. What’s Presto?
A distributed SQL query engine
for interactive data analisys
against GBs to PBs of data.
17. Presto’s history
> 2012 Fall: Project started at Facebook
> Designed for interactive query
with speed of commercial data
warehouse
> and scalability to the size of Facebook
> 2013 Winter: Open sourced!
> 30+ contributes in 6 months
> including people outside of Facebook
18. What problems does it solve?
> We couldn’t visualize data in HDFS directly
using dashboards or BI tools
> because Hive is too slow (not interactive)
> or ODBC connectivity is unavailable/unstable
> We needed to store daily-batch results to an
interactive DB for quick response
(PostgreSQL, Redshift, etc.)
> Interactive DB costs more & less scalable
> Some data are not stored in HDFS
> We need to copy the data into HDFS to analyze
19. What problems does it solve?
> We couldn’t visualize data in HDFS directly
using dashboards or BI tools
> because Hive is too slow (not interactive)
> or ODBC connectivity is unavailable/unstable
> We needed to store daily-batch results to an
interactive DB for quick response
(PostgreSQL, Redshift, etc.)
> Interactive DB costs more & less scalable
> Some data are not stored in HDFS
> We need to copy the data into HDFS to analyze
20. What problems does it solve?
> We couldn’t visualize data in HDFS directly
using dashboards or BI tools
> because Hive is too slow (not interactive)
> or ODBC connectivity is unavailable/unstable
> We needed to store daily-batch results to an
interactive DB for quick response
(PostgreSQL, Redshift, etc.)
> Interactive DB costs more & less scalable
> Some data are not stored in HDFS
> We need to copy the data into HDFS to analyze
21. What problems does it solve?
> We couldn’t visualize data in HDFS directly
using dashboards or BI tools
> because Hive is too slow (not interactive)
> or ODBC connectivity is unavailable/unstable
> We needed to store daily-batch results to an
interactive DB for quick response
(PostgreSQL, Redshift, etc.)
> Interactive DB costs more & less scalable
> Some data are not stored in HDFS
> We need to copy the data into HDFS to analyze
22. HDFS
Hive
PostgreSQL, etc.
Daily/Hourly Batch
Interactive query
Dashboard
Commercial
BI Tools
Batch analysis platform Visualization platform
23. HDFS
Daily/Hourly Batch
Hive
Interactive query
PostgreSQL, etc.
✓ Less scalable
✓ Extra cost
Dashboard
Commercial
BI Tools
✓ Can’t query against
“live” data directly
Batch analysis platform Visualization platform
✓ More work to manage
2 platforms
25. Presto
HDFS
Hive
Dashboard
Daily/Hourly Batch
Interactive query
SQL on any data sets
Cassandra MySQL Commertial DBs
26. Presto
HDFS
Hive
Dashboard
Daily/Hourly Batch
Interactive query
SQL on any data sets Commercial
Cassandra MySQL Commertial DBs
BI Tools
✓ IBM Cognos
✓ Tableau
✓ ...
Data analysis platform
28. What can Presto do?
> Query interactively (in milliseconds to minutes)
> MapReduce and Hive are still necessary for ETL
> Query using commercial BI tools or dashboards
> Reliable ODBC/JDBC connectivity
> Query across multiple data sources such as
Hive, HBase, Cassandra, or even commercial DBs
> Plugin mechanism
> Integrate batch analysis + visualization
into a single data analysis platform
29. Presto’s deployment
> Facebook
> Multiple geographical regions
> scaled to 1,000 nodes
> actively used by 1,000+ employees
> processing 1PB/day
> Netflix, Dropbox, Treasure Data, Airbnb,
Qubole, LINE, GREE, Scaleout, etc
> Presto as a Service
> Treasure Data, Qubole
39. What’s Connectors?
> Access to storage and metadata
> provide table schema to coordinators
> provide table rows to workers
> Connectors are pluggable to Presto
> written in Java
> Implementations:
> Hive connector
> Cassandra connector
> MySQL through JDBC connector (prerelease)
> Or your own connector
40. Hive connector
Client
Coordinator Hive
Connector
Worker
Worker
Worker
HDFS,
Hive Metastore
Discovery Service
find servers in a cluster
41. Cassandra connector
Client
Coordinator Cassandra
Connector
Worker
Worker
Worker
Cassandra
Discovery Service
find servers in a cluster
42. Client
Coordinator
other
connectors
...
Worker
Worker
Worker
Cassandra
Discovery Service
find servers in a cluster
Hive
Connector
HDFS / Metastore
Multiple connectors in a query
Cassandra
Connector
Other data sources...
43. Distributed architecture
> 3 type of servers:
> Coordinator, worker, discovery service
> Get data/metadata through connector
plugins.
> Presto is NOT a database
> Presto provides SQL to existent data stores
> Client protocol is HTTP + JSON
> Language bindings:
Ruby, Python, PHP, Java (JDBC), R, Node.JS...
45. Presto’s execution model
> Presto is NOT MapReduce
> Use its own execution engine
> Presto’s query plan is based on DAG
> more like Apache Tez / Spark or
traditional MPP databases
> Impala and Drill use a similar model
47. SQL
SQL
Parser
AST
Logical
Planner
Metadata
Distributed
Planner
Logical
Query Plan
Optimizer
Execution
Planner
Discovery Server
Connector
Distributed
Query Plan Execution Plan
NodeManager
✓ node list
✓ table schema
48. SQL
SQL
Parser
SQL
Metadata
Distributed
Planner
Logical
Query Plan
Optimizer
Execution
Planner
Discovery Service
Connector
Query Plan Execution Plan
NodeManager
✓ node list
✓ table schema
(today’s talk)
Query
Planner
49. Query Planner
SQL
SELECT
name,
count(*) AS c
FROM impressions
GROUP BY name
Table schema
impressions (
name varchar
time bigint
)
Output
(name, c)
GROUP BY
(name,
count(*))
Table scan
(name:varchar)
+
Output
Exchange
Sink
Final aggr
Exchange
Sink
Partial aggr
Table scan
Logical query plan
Distributed query plan
50. Query Planner - Stages
Output
Exchange
Sink
Final aggr
Exchange
Sink
Partial aggr
Table scan
inter-worker
data transfer Stage-0
pipelined
aggregation
inter-worker
data transfer
Stage-1
Stage-2
54. All stages are pipe-lined
✓ No wait time
✓ No fault-tolerance
MapReduce vs. Presto
MapReduce Presto
reduce reduce
disk
map map
disk
reduce reduce
map map
task
task
task task
task task
memory-to-memory
data transfer
✓ No disk IO
✓ Data chunk must
fit in memory
task
disk
Wait between
stages
Write data
to disk
55. Query Execution
> SQL is converted into stages, tasks and splits
> All tasks run in parallel
> No wait time between stages (pipelined)
> If one task fails, all tasks fail at once (query fails)
> Memory-to-memory data transfer
> No disk IO
> If aggregated data doesn’t fit in memory,
query fails
•Note: query dies but worker doesn’t die.
Memory consumption of all queries is fully managed
56. Why select Presto?
> The ease of operations
> Easy to deploy. Just drop a jar
> Easy to extend its functionalities
• Pluggable and DI based loose coupling
> Doesn’t crash when a query fails
> Standard SQL syntax
> Important for existing DB/DWH users
> HiveQL is for MapReduce, not MPP DB
57. Our customer use cases
Hive Presto
> Scheduled reporting
for customers
> once every hour
Online Ad
Web/Social
Retail
> Scheduled reporting
for management
> Compute KPIs
> Scheduled reporting
for website, PoS and
touch panel data
> Hard deadlines!
> Check ad-network
performance
> delivery logic
optimization in realtime
> Aggregation for
user support
> Measuring the effect
of user campaigns
> Ad-hoc query for
Basket Analysis
> Aggregate data for the
product development
59. Batch summary
> MapReduce-based Hive is still the default choice
> Stable & Lots of shared experience and knowledge
> Hive with Tez is for Hadoop users
> No code change needed
> HDP includes Tez by default
> Spark and Spark SQL is a good alternative
> Can’t reuse Hadoop knowledge
> Mainly for in-memory processing for now
60. Short batch summary
> Presto is a good default choice
> Easy to manage and have useful features
> Need faster queries? Try Impala
> for HDFS and HBase
> CDH includes Impala by default
> If you are a challenger, check out Drill
> The project’s goal is ambitious
> The status is developer preview
61. Stream summary
> Fluentd and Norikra
> Fluentd is for robust log collection
> Norikra is for SQL based CEP
!
> StreamSQL
> for Spark users
> Current status is POC
62. Lastly…
> Use different engines for different requirements
> Hadoop/Spark for batch jobs
> MapReduce won't die for the time being
> MPP query engine for interactive queries
> These engines are integrated into
one system in the future
> Batch now use DAG pipeline
> Short Batch will support Task recovery
The differences will be minimum