Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS July 2016 Webinar Series

© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Keith Steward, Ph.D.
Specialist (EMR) Solution Architect, AWS
2016-08-28
Running Fast, Interactive
Queries on Petabyte Datasets
using Presto on Amazon EMR

What We’ll Cover:
 The challenges of using Data Warehouses. Then a non-DW alternative.
 High level steps for querying petabyte-scale data on S3.
 Amazon S3
 Amazon EMR
 Apache Presto: History, Goals & Benefits, Architecture
 Presto on EMR
 Demo – Querying 29 years of U.S. Air Flights data on S3 using Presto on EMR.

Challenges in using Data Warehouses
Schema-on-Write
Data
Data
Warehouse
schema
Significant “time to answer”
Schema-on-Read
Data
Shorter “time to answer”
$$$$
$$

1. Store your petabyte-scale data in S3.
2. Configure & launch an EMR cluster with Presto.
3. Login to the EMR cluster
4. Expose S3 data as a Hive Table(s)
5. Issue SQL queries against the Hive table(s) using Presto.
6. Get query results.
How to Query Petabyte-Scale Datasets on S3?

 Store anything (object storage)
 Scalable / Elastic
 99.999999999% durability
 Effectively infinite inbound bandwidth
 Extremely low cost: $0.03/GB-Mo; $30.72/TB-Mo
 Data layer for virtually all AWS services
Amazon S3

Aggregate all Data in S3 as your Data Lake
Surrounded by a collection of the right tools
EMR Kinesis
Redshift DynamoDB RDS
Data Pipeline
Spark Streaming Storm
Amazon
S3
Import / Export
Snowball

Exposing petabyte-scale datasets in S3 as Hive tables
hive> CREATE EXTERNAL TABLE airdelays (
yr INT,
quarter INT,
month INT,
dayofmonth INT,
dayofweek INT,
flightdate STRING,
uniquecarrier STRING,
airlineid INT,
. . .
div5tailnum STRING
)
PARTITIONED BY (year STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
ESCAPED BY ''
LINES TERMINATED BY 'n'
LOCATION 's3://flightdelays-kls/csv’;
S3 bucket with data:
Ask Hive to
expose S3
data as
table:
hive> describe airdelays;
OK
yr int
quarter int
month int
dayofmonth int
dayofweek int
flightdate string
. . .
div5wheelsoff string
div5tailnum string
year string
# Partition Information
# col_name data_type comment
year string
Time taken: 0.169 seconds, Fetched: 115 row(s)
Hive now knows about table:

 Scalable Hadoop clusters as a service
 Hadoop, Hive, Spark, Presto, Hbase, etc.
 Easy to use; fully managed
 On demand, reserved, spot pricing
 HDFS, S3, and Amazon EBS filesystems
 End to end security
Amazon EMR

EMRFS makes it easier to leverage Amazon S3
 Better performance and error handling options
 Transparent to applications – just read/write to “s3://”
 Support for Amazon S3 server-side and client-side encryption
 Faster listing using EMRFS metadata
 HDFS is still available via local instance storage or Amazon EBS

Amazon S3 as your cluster’s persistent data store
Amazon S3
 Designed for 99.999999999% durability
 Separate compute and storage
 Resize and shut down Amazon EMR
clusters with no data loss
 Point multiple Amazon EMR clusters at
same data in Amazon S3 using the EMR
File System (EMRFS)

Demo: Let’s spin up an EMR cluster (with Presto) …

(History)
PB scale interactive query engine designed by
Facebook in 2012
Originally designed for exploring existing Hive
tables without triggering slow mapreduce jobs
Open Source in late 2013

(Benefits)
In-memory distributed query engine
Support standard ANSI-SQL
Support rich analytical functions
Support wide range of data sources
Combine data from multiple sources in single query
Response time ranges from seconds to minutes

Extensibility
Pluggable backends: Hive, Cassandra, JMX, Kafka, MySQL,
PostgreSQL, MySQL, SystemSchema, TPCH
JDBC, ODBC for commercial BI tools or Dashboards, like data
visualization
Client Protocol: HTTP+JSON, support various languages
(Python, Ruby, PHP, Node.js, Java(JDBC), C#,…)
ANSI SQL
complex queries, joins, aggregations, various functions
(Window functions)
High Performance: 10x faster than Hive
E.g. Netflix: runs 3500+ Presto queries / day on 25+ PB
dataset in S3 with 350 active platform users
(Features)

High Level Architecture
 A distributed system that runs on a cluster of machines.
 Components: a coordinator and multiple workers.
 Queries are submitted by a client such as the Presto CLI to the coordinator.
 The coordinator parses, analyzes and plans the query execution, then distributes
the processing to the workers.
https://prestodb.io/overview.html

In memory parallel queries
Pipeline task execution
Data local computation with multi-threading
Cache hot queries and data
Just-in-time compile-to-bye-code operator
SQL optimization
Other optimizations (e.g. Predicate Pushdown)
(Why is it so fast?)

Presto: In-memory processing and pipelining

Presto: Accessing Petabyte-Scale Datasets in S3
Any table known to Hive Metastore can be accessed / queried by
Presto.
Including data in S3 exposed via CREATE EXTERNAL TABLE
statements in Hive.

Embedded
Mode
 Uses Derby
 not recommended
for Production
Hive Metastore Deployment Modes
http://www.cloudera.com/documentation/enterprise/latest/topics/cdh_ig_hive_metastore_configure.html

Embedded
Mode
• Uses Derby
• not recommended for
Production
Local
Mode
• metastore service in
same process as main
HiveServer process
• Metastore db runs in
separate process

Embedded
Mode
• Uses Derby
• not recommended for
Production
Local
Mode
• metastore service in
same process as main
HiveServer process
• Metastore db runs in
separate process
Remote
Mode
• metastore service runs in
own JVM process
• processes communicate
with it via Thrift network
API
• metastore service
communicates with
metastore db over JDBC

Supported Data Sources
Currently Presto provides connectors for the following
 Hive
 Cassandra
 MySQL
 PostgreSQL
 Kafka
 Redis

Common Use Cases
When to use Presto?
 Need fast interactive query ability with high concurrency
 Need ANSI SQL
When might you not want to use Presto?
 You focus on batch processing (ETL, enriching, aggregation, etc) for large data sets.
Hive or Spark recommended.
 Need to compute (e.g. machine learning, graph algorithms) over the retrieved data.
Spark recommended.
 Star-schema organization of data. Amazon Redshift data warehouse recommended.

Airpal – a Presto GUI designed & open-sourced
by Airbnb
 Optional access controls for users
 Search and find tables
 See metadata, partitions, schemas & sample rows
 Write queries in an easy-to-read editor
 Submit queries through a web interface
 Track query progress
 Get the results back through the browser as a CSV
 Create new Hive table based on the results of a query
 Save queries once written
 Searchable history of all queries run within the tool

Demo: Let’s now query 29 years worth of Air-traffic data
in S3 using Presto on our EMR cluster…

Summary for Presto on EMR with Data in S3
 Data in S3 is queryable using Presto on EMR
 Presto is easy to deploy on Amazon EMR
 Presto provides fast ad-hoc queries
 Supports wide range of data sources
 In-memory data processing with pipelining
 Feature-rich
 Increasing adoption & active community
Amazon S3
Amazon
EMR

Thank you!
Keith Steward,
stewardk@amazon.com
 http://www.slideshare.net/GuorongLIANG/facebook
-presto-presentation
 https://prestodb.io
 https://github.com/airbnb/airpal#airpal
 https://github.com/treasure-data/prestogres
References:
If you want to run this demo later in
your own AWS account, go to:
http://bit.ly/1Xg0111

Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS July 2016 Webinar Series

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS July 2016 Webinar Series

Similar to Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS July 2016 Webinar Series (20)

More from Amazon Web Services

More from Amazon Web Services (20)

Recently uploaded

Recently uploaded (20)

Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS July 2016 Webinar Series