SQL on Hadoop in Taiwan

SQL on Hadoop
a Perspective of a Cloud-based,
Managed Service Provider
Masahiro Nakagawa
Sep 13, 2014
Hadoop Meetup in Taiwan

Today’s agenda
> Self introduction
> Why SQL?
> Hive
> Presto
> Conclusion

Who are you?
> Masahiro Nakagawa
> github/twitter: @repeatedly
> Treasure Data, Inc.
> Senior Software Engineer
> Fluentd / td-agent developer
> I love OSS :)
> D language - Phobos committer
> Fluentd - Main maintainer
> MessagePack / RPC- D and Python (only RPC)
> The organizer of Presto Source Code Reading
> etc…

Why we love SQL?
> Easy to understand what we are doing
> declarative language
> common interface for data manipulation
> There are many users
> SQL is not the best but
better than uncommon interfaces

We want to use SQL
in the Hadoop world

SQL Players on Hadoop
This color indicates a commercial product
> Hive
> Spark SQL Batch
Short Batch
Low latency
Stream
> Presto
> Impala
> Drill
> Norikra
> StreamSQL
> HAWQ
> Actian
> etc…
Latency: minutes - hours
Latency: seconds - minutes
Latency: immediate

SQL Players on Hadoop
This color indicates a commercial product
> Hive
> Spark SQL
Batch
Short Batch
Low latency
Stream
> Presto
> Impala
> Drill
> HAWQ
> Actian
> etc…
Red Ocean
Blue Ocean?
> Norikra
> StreamSQL

3 query engines on Treasure Data
> Hive (batch)
> for ETL and scheduled reporting
> Presto (short batch / low latency)
> for Ad hoc queries
> Pig
> Not SQL
> There aren’t as many users… ;(
Today’s talk

Presto
https://hive.apache.org/

What’s Hive
> Needs no explanation ;)
> Most popular project in the ecosystem
> HiveQL and MapReduce
> Writing MapReduce code is hard
> Hive is growing rapidly by Stinger initiative
> Vectorized Processing
> Query optimization with statistics
> Tez instead of MapReduce
> etc…

Apache Tez
> Low level framework for YARN applications
> Next generation query engine
> Provide good IR for Hive, Pig and more
> Task and DAG based pipelining
> Spark uses a similar DAG model
Input Processor Output
Task DAG
http://tez.apache.org/

Hive on MR vs. Hive on Tez
SELECT g1.x, g2.avg, g2.cnt
FROM (SELECT a.x AVERAGE(a.y) AS avg FROM a GROUP BY a.x) g1"
JOIN (SELECT b.x, COUNT(b.y) AS avg FROM b GROUP BY b.x) g2"
ON (g1.x = g2.x) ORDER BY avg;
MapReduce Tez
M M
M M
R
HDFS HDFS
M M M
http://www.slideshare.net/Hadoop_Summit/w-235phall1pandey/9
M M
R
HDFS
M
R
R
R
M
R
M M
R
R
R
Can avoid unnecessary HDFS write
GROUP a BY a.x GROUP b BY b.x
JOIN (a, b)
ORDER BY
GROUP BY x
GROUP BY a.x"
JOIN (a, b)
ORDER BY

Why still use MapReduce?
> The emphasis is on stability / reliability
> Speed is important but not most important
> Can use a MPP query engine for short batch
> Tez/Spark are immature
> Hard to manage in a multi-tenant env
> Different failure models
> We are now testing Tez for Hive
•No code change needed for Hive. Spark is hard…
• Disabling Tez is easy. Just remove
‘set hive.execution.engine=tez;’

What’s Presto?
A distributed SQL query engine
for interactive data analisys
against GBs to PBs of data.

Presto’s history
> 2012 Fall: Project started at Facebook
> Designed for interactive query
with speed of commercial data
warehouse
> and scalability to the size of Facebook
> 2013 Winter: Open sourced!
> 30+ contributes in 6 months
> including people outside of Facebook

What problems does it solve?
> We couldn’t visualize data in HDFS directly
using dashboards or BI tools
> because Hive is too slow (not interactive)
> or ODBC connectivity is unavailable/unstable
> We needed to store daily-batch results to an
interactive DB for quick response
(PostgreSQL, Redshift, etc.)
> Interactive DB costs more & less scalable
> Some data are not stored in HDFS
> We need to copy the data into HDFS to analyze

HDFS
Hive
PostgreSQL, etc.
Daily/Hourly Batch
Interactive query
Dashboard
Commercial
BI Tools
Batch analysis platform Visualization platform

HDFS
Daily/Hourly Batch
Hive
Interactive query
PostgreSQL, etc.
✓ Less scalable
✓ Extra cost
Dashboard
Commercial
BI Tools
✓ Can’t query against
“live” data directly
Batch analysis platform Visualization platform
✓ More work to manage
2 platforms

HDFS
Hive Dashboard
Presto
PostgreSQL, etc.
Daily/Hourly Batch
HDFS
Hive
Dashboard
Daily/Hourly Batch
Interactive query
Interactive query

Presto
HDFS
Hive
Dashboard
Daily/Hourly Batch
Interactive query
SQL on any data sets
Cassandra MySQL Commertial DBs

Presto
HDFS
Hive
Dashboard
Daily/Hourly Batch
Interactive query
SQL on any data sets Commercial
Cassandra MySQL Commertial DBs
BI Tools
✓ IBM Cognos
✓ Tableau
✓ ...
Data analysis platform

dashboard on chart.io: https://chartio.com/

What can Presto do?
> Query interactively (in milliseconds to minutes)
> MapReduce and Hive are still necessary for ETL
> Query using commercial BI tools or dashboards
> Reliable ODBC/JDBC connectivity
> Query across multiple data sources such as
Hive, HBase, Cassandra, or even commercial DBs
> Plugin mechanism
> Integrate batch analysis + visualization
into a single data analysis platform

Presto’s deployment
> Facebook
> Multiple geographical regions
> scaled to 1,000 nodes
> actively used by 1,000+ employees
> processing 1PB/day
> Netflix, Dropbox, Treasure Data, Airbnb,
Qubole, LINE, GREE, Scaleout, etc
> Presto as a Service
> Treasure Data, Qubole

Client
Coordinator Connector
Plugin
Worker
Worker
Worker
Storage / Metadata
Discovery Service

Client
Plugin
Worker
Worker
Worker
Storage / Metadata
Discovery Service
1. Client sends a query
using HTTP

Client
Plugin
Worker
Worker
Worker
Storage / Metadata
Discovery Service
2. Coordinator builds
a query plan
Connector plugin
provides metadata
(table schema, etc.)

Client
Plugin
Worker
Worker
Worker
Storage / Metadata
Discovery Service
3. Coordinator sends
tasks to workers

Client
Plugin
Worker
Worker
Worker
Storage / Metadata
Discovery Service
4. Workers read data
through connector plugin

Client
Plugin
Worker
Worker
Worker
Storage / Metadata
Discovery Service
5. Workers run tasks
in memory and
in parallel

Plugin
Worker
Worker
Worker
Storage / Metadata
Discovery Service
Client
6. Client gets the result
from a worker

What’s Connectors?
> Access to storage and metadata
> provide table schema to coordinators
> provide table rows to workers
> Connectors are pluggable to Presto
> written in Java
> Implementations:
> Hive connector
> Cassandra connector
> MySQL through JDBC connector (prerelease)
> Or your own connector

Hive connector
Client
Coordinator Hive
Connector
Worker
Worker
Worker
HDFS,
Hive Metastore
Discovery Service
find servers in a cluster

Cassandra connector
Client
Coordinator Cassandra
Connector
Worker
Worker
Worker
Cassandra
Discovery Service

Client
Coordinator
other
connectors
...
Worker
Worker
Worker
Cassandra
Discovery Service
Hive
Connector
HDFS / Metastore
Multiple connectors in a query
Cassandra
Connector
Other data sources...

Distributed architecture
> 3 type of servers:
> Coordinator, worker, discovery service
> Get data/metadata through connector
plugins.
> Presto is NOT a database
> Presto provides SQL to existent data stores
> Client protocol is HTTP + JSON
> Language bindings:
Ruby, Python, PHP, Java (JDBC), R, Node.JS...

Presto’s execution model
> Presto is NOT MapReduce
> Use its own execution engine
> Presto’s query plan is based on DAG
> more like Apache Tez / Spark or
traditional MPP databases
> Impala and Drill use a similar model

How query runs?
> Coordinator
> SQL Parser
> Query Planner
> Execution planner
> Workers
> Task execution scheduler

SQL
SQL
Parser
AST
Logical
Planner
Metadata
Distributed
Planner
Logical
Query Plan
Optimizer
Execution
Planner
Discovery Server
Connector
Distributed
Query Plan Execution Plan
NodeManager
✓ node list
✓ table schema

SQL
SQL
Parser
SQL
Metadata
Distributed
Planner
Logical
Query Plan
Optimizer
Execution
Planner
Discovery Service
Connector
Query Plan Execution Plan
NodeManager
✓ node list
✓ table schema
(today’s talk)
Query
Planner

Query Planner
SQL
SELECT
name,
count(*) AS c
FROM impressions
GROUP BY name
Table schema
impressions (
name varchar
time bigint
)
Output
(name, c)
GROUP BY
(name,
count(*))
Table scan
(name:varchar)
+
Output
Exchange
Sink
Final aggr
Exchange
Sink
Partial aggr
Table scan
Logical query plan
Distributed query plan

Query Planner - Stages
Output
Exchange
Sink
Final aggr
Exchange
Sink
Partial aggr
Table scan
inter-worker
data transfer Stage-0
pipelined
aggregation
inter-worker
data transfer
Stage-1
Stage-2

Output
Exchange
Sink
Partial aggr
Table scan
Sink
Partial aggr
Table scan
Execution Planner
+Node list
✓ 2 workers
Sink
Final aggr
Exchange
Sink
Final aggr
Exchange
Output
Exchange
Sink
Final aggr
Exchange
Sink
Partial aggr
Table scan
Worker 1 Worker 2

Execution Planner - Tasks
Worker 1 Worker 2
Sink
Final aggr
Exchange
Sink
Partial aggr
Table scan
Sink
Final aggr
Exchange
Sink
Partial aggr
Table scan
Task
1 task / worker / stage
✓ All tasks in parallel
Output
Exchange

Execution Planner - Split
Sink
Final aggr
Exchange
Sink
Partial aggr
Table scan
Sink
Final aggr
Exchange
Sink
Partial aggr
Table scan
Output
Exchange
Split
1 split / task
= 1 thread / worker
many splits / task
= many threads /
worker (table scan)
Worker 1 Worker 2
1 split / worker
= 1 thread / worker

All stages are pipe-lined
✓ No wait time
✓ No fault-tolerance
MapReduce vs. Presto
MapReduce Presto
reduce reduce
disk
map map
disk
reduce reduce
map map
task
task
task task
task task
memory-to-memory
data transfer
✓ No disk IO
✓ Data chunk must
fit in memory
task
disk
Wait between
stages
Write data
to disk

Query Execution
> SQL is converted into stages, tasks and splits
> All tasks run in parallel
> No wait time between stages (pipelined)
> If one task fails, all tasks fail at once (query fails)
> Memory-to-memory data transfer
> No disk IO
> If aggregated data doesn’t fit in memory,
query fails
•Note: query dies but worker doesn’t die.
Memory consumption of all queries is fully managed

Why select Presto?
> The ease of operations
> Easy to deploy. Just drop a jar
> Easy to extend its functionalities
• Pluggable and DI based loose coupling
> Doesn’t crash when a query fails
> Standard SQL syntax
> Important for existing DB/DWH users
> HiveQL is for MapReduce, not MPP DB

Our customer use cases
Hive Presto
> Scheduled reporting
for customers
> once every hour
Online Ad
Web/Social
Retail
for management
> Compute KPIs
for website, PoS and
touch panel data
> Hard deadlines!
> Check ad-network
performance
> delivery logic
optimization in realtime
> Aggregation for
user support
> Measuring the effect
of user campaigns
> Ad-hoc query for
Basket Analysis
> Aggregate data for the
product development

Batch summary
> MapReduce-based Hive is still the default choice
> Stable & Lots of shared experience and knowledge
> Hive with Tez is for Hadoop users
> No code change needed
> HDP includes Tez by default
> Spark and Spark SQL is a good alternative
> Can’t reuse Hadoop knowledge
> Mainly for in-memory processing for now

Short batch summary
> Presto is a good default choice
> Easy to manage and have useful features
> Need faster queries? Try Impala
> for HDFS and HBase
> CDH includes Impala by default
> If you are a challenger, check out Drill
> The project’s goal is ambitious
> The status is developer preview

Stream summary
> Fluentd and Norikra
> Fluentd is for robust log collection
> Norikra is for SQL based CEP
!
> StreamSQL
> for Spark users
> Current status is POC

Lastly…
> Use different engines for different requirements
> Hadoop/Spark for batch jobs
> MapReduce won't die for the time being
> MPP query engine for interactive queries
> These engines are integrated into
one system in the future
> Batch now use DAG pipeline
> Short Batch will support Task recovery
The differences will be minimum

Cloud service for the entire data pipeline,
including Presto
Check: treasuredata.com

SQL on Hadoop in Taiwan

More Related Content

What's hot

Viewers also liked

Similar to SQL on Hadoop in Taiwan

More from Treasure Data, Inc.

Recently uploaded

SQL on Hadoop in Taiwan