Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with Presto
1. Speed up Interactive Analytic
Queries over Existing Big Data on
Hadoop with Presto
Liang-Chi Hsieh
HadoopCon 2014 in Taiwan
1
2. In Today’s talk
• Introduction of Presto
• Distributed architecture
• Query model
• Deployment and configuration
• Data visualization with Presto - Demo
2
3. SQL on/over Hadoop
• Hive
• Matured and proven solution (0.13.x)
• Drawbacks: execution model based on MapReduce
• Better execution engines: Hive-Tez and Hive-Spark
!
• Alternative and usually faster options including
• Impala, Presto, Drill, ...
3
4. Presto
• Presto is a distributed SQL query engine optimized
for ad-hoc analysis at interactive speed
• Data scale: GBs to PBs
!
• Deployment at:
• Facebook, Netflix, Dropbox,Treasure Data,Airbnb,
Qubole
4
5. History of Presto
• Fall 2012
• The development on Presto started at Facebook
• Spring 2013
• It was rolled out to the entire company and became
major interactive data warehouse
• Winter 2013
• Open-sourced
5
6. The Problems to Solve
• Hive is not optimized for interactive data analysis as
the data size grows to petabyte scale
• In practice, we do need to have reduced data
stored in an interactive DB that provides quick
query response
• Redundant maintenance cost, out of date data
view, data transferring, ...
• The need to incorporate other data that are not
stored in HDFS
6
7. Typical Batch Data Architecture
7
HDFS
Data Flow Batch Run
DB
Query
• Views generated in batch maybe
out of date
• Batch workflow is too slow
14. Presto Clients
• Protocol: HTTP + JSON
!
• Client libraries available in several
programming languages:
• Python, PHP, Ruby, Node.js, Java, R
!
• ODBC through Prestogres
14
15. Query Model
• Presto’s execution engine does not use
MapReduce
• It employs a custom query and execution engine
• Based on DAG that is more like Apache Tez,
Spark or MPP databases
15
20. Query Execution on Presto
• SQL is converted into stages, tasks, drivers
• Tasks operate on splits that are sections of
data
• Lowest stages retrieve splits from
connectors
20
21. Query Execution on Presto
• Tasks are run in parallel
• Pipelined to reduce wait time between stages
• One task fails then the query fails
!
• No disk I/O
• If aggregated data does not fit in memory, the
query fails
• May spill to disk in future
21
22. Deployment & Configuration
• Basically, there are four configurations to set
up for Presto
• Node properties: environment configuration
specific to each node
• JVM config
• Config properties: configuration for Presto server
• Catalog properties: configuration for connectors
!
• Detailed documents are provided on Presto site
22
26. Catalog Properties
• Presto connectors are mounted in catalogs
• Create catalog properties in etc/catalog
• For example, the configuration etc/catalog/
hive.properties for Hive connector:
connector.name=hive-hadoop2
hive.metastore.uri=thrift://example.net:9083
26
27. Presto’s Roadmap
• In next year:
• Complex data structures
• Create table with partitioning
• Huge joins and aggregations
• Spill to disk
• Basic task recovery
• Native store
• Authentication & authorization
* Based on the Presto Meetup, May 2014
27
28. DataVisualization with Presto - Demo
• There will be official ODBC driver for connecting
Presto to major BI tools, according to Presto’s
roadmap
• Prestogres provides alternative solution for now
• Use PostgreSQL’s ODBC driver
!
• It is also not difficult to integrate Presto with other
data visualization tools such as Grafana
28
29. Grafana
• An open source metrics dashboard and graph
editor for Graphite, InfluxDB & OpenTSDB
• But we may not be satisfied with these DBs or
just want to visualize data on HDFS, especially
for large-scale data
29
30. Integrating Presto with Grafana
• Presto provides many useful date & time
functions
• current_date -> date
• current_time -> time with time zone
• current_timestamp -> timestamp with time zone
• from_unixtime(unixtime) → timestamp
• localtime -> time
• now() → timestamp with time zone
• to_unixtime(timestamp) → double
30
31. Integrating Presto with Grafana
• Presto also supports many common aggregation
functions
• avg(x) → double
• count(x) → bigint
• max(x) → [same as input]
• min(x) → [same as input]
• sum(x) → [same as input]
• …..
31
32. Integrating Presto with Grafana
• So we implemented a custom datasource for
Presto to work with Grafana
• Interactively visualize data on HDFS
HDFS
Interactive
query
Presto
Grafana
32
34. References
• Martin Traverso,“Presto: Interacting with petabytes of data at
Facebook”
• Sadayuki Furuhashi,“Presto: Interactive SQL Query Engine
for Big Data”
• Sundstrom,“Presto: Past, Present, and Future”
• “Presto Concepts” on Presto’s documents
34