software engineering Chapter 5 System modeling.pptx
Enabling presto to handle massive scale at lightning speed
1. Enabling Presto to Handle
Massive Scale at Lightning Speed
Fast and Scalable Data Processing
Raunaq Morarka
27/4/2019
2. 00
About Presenter
● Presenter Name, Title: Raunaq Morarka, Staff Engineer at Qubole, Bangalore
● Bio: I currently work in the Presto team at Qubole. My areas of interests are distributed
database systems and software performance optimizations. At Qubole I have worked on features
related to scheduling, autoscaling and usage of spot nodes for running Presto as a service on
cloud. I have recently started contributing to Presto sql open source project. Before Qubole, I
worked on a time-series distributed columnar database which supports real time ingest and low
latency queries at Akamai.
3. 00
Agenda
● State of Presto today
○ Background – Introduction, Why Presto
○ Presto Architecture
○ Usage Overview – Recent Growth and Adoption Trends
● Presto in the Cloud
○ Optimizing for Scale
■ Autoscaling
■ Maximizing the Benefits of the Cloud
○ Optimizing for Speed
■ Dynamic Filtering, Join Reordering, Join Strategy Selection
■ RubiX – The next-generation column level optimized caching on Presto
● Future roadmap
5. 00
State of Presto Today - Background
What is Presto?
• Distributed SQL Query Engine originated in Facebook in 2013
• ANSI SQL Compliant
• Supports Federated Queries
• Pluggable data sources
• Completely in-memory and pipelined execution model
Why Presto?
• Built for variety of use cases : Low latency user facing applications, Exploratory Analysis through BI tools, Batch ETL
• Data source agnostic : HDFS, RDBMSs, NoSQL, Stream processing, Cloud Object stores (S3, ADLS, GCP)
• Zero configuration ideology
• Proven in production at Petabyte Scale: Facebook, Netflix, Airbnb, Uber, LinkedIn, Qubole, and more
• Highly Extendible
7. 00
Query Lifecycle
● Client submits sql query to Coordinator using HTTP REST API
● ANTLR based parser converts query to syntax tree
● Logical Planner generates tree of plan tree
● Optimizer transforms logical plan into an efficient execution strategy
• RBO (predicate and limit pushdown, column pruning, partition pruning etc.)
• CBO (Join reordering, Join strategy selection)
• Takes advantage of Data layout (partitioning, sorting, grouping and indices)
• Inter-node parallelism by breaking up plan into Stages that can be executed in parallel across
workers
• Intra-node parallelism by running a sequence of operators (pipeline) in multiple threads.
9. 00
Scheduling
● Coordinator distributes plan to workers, starts execution of tasks and then
begins to enumerate splits, which are opaque handles to an addressable
chunk of data in an external storage system
● Splits are assigned to the tasks responsible for reading this data
■
10. 00
Exchange (Shuffles)
● Presto uses in-memory buffered shuffles over HTTP to exchange intermediate results for
different stages of a query
● Tasks produce data into an in-memory output buffer
● Workers consume results from other workers through an exchange client which uses HTTP
long-polling
● Exchange client buffers data before it is processed (input buffer)
● Exchange server retains data until client acknowledges receipt
● Engine tunes parallelism to maintain target utilization rates for output and input buffers
11. 00
Split Assignment
Presto asks connectors to enumerate small batches of splits, and assigns them to tasks lazily
● Decouples query response time from time taken for listing files
● Avoid enumerating all splits when queries are cancelled or finish early when LIMIT clause is
satisfied.
● Workers maintain a queue of splits. The coordinator assigns new splits to tasks with the
shortest queue. Keeping these queues small allows the system to adapt to variance in CPU cost
of processing different splits and performance differences among workers
● Allows queries to execute without having to hold all their metadata in memory
● Lazy split enumeration can make it difficult to accurately estimate and report query progress
12. 00
State of Presto Today – Usage Overview
Presto grew 420% in terms of compute hours on Qubole’s cloud platform from January 2017 to 2018.
Customers in aggregate are running 24x more commands per hour in Presto than Apache Spark and 6x
more commands than Apache Hadoop/Hive.
13. 00
State of Presto Today – Usage Overview
Top Three Industries Using Presto
1. Entertainment
2. Travel Services
3. Gaming
Verticals everywhere are adopting Presto
15. 00
Optimizing for Scale – Autoscaling
● Scale clusters in range [min size, max size]
● Scale up for the increased workload
● Scale down when load goes down
● Graceful scale down
● Usually implemented by defining rules on top of CPU/memory/IO metrics exported by system
● Qubole’s implementation
○ Monitor progress of queries
○ Intelligent decision making to scale up only if it can help to meet a given SLA
○ Handle bursty workloads by avoiding fixed step sizes
○ Finer controls like grouped scale up/down, cool down period, etc.
○ Automatic termination of idle clusters
○ Self start of cluster in response to first query on a shutdown cluster
16. 00
Optimizing for Scale – Required workers
● Non source stages cannot be redistributed to take advantage of newly added nodes
● Min size of cluster must be large enough to avoid query failures
● Choice between high cost and degraded performance for initial queries
● Required workers is a mechanism to delay query execution until a minimum no. of worker
nodes join the cluster
● Integration with Qubole’s autoscaling
○ Scale up cluster to satisfy min workers requirement
○ Avoid scaling up for DDL and monitoring related queries
○ Scale down to 1 node during periods of inactivity
17. 00
Config A Config B Config C
Total time taken 5h 12m 4h 26m 4h 37m
Total node runtime
seconds
143137 134664 124351
Min size 2 6 1 (6 required nodes)
19. 00
Optimizing for Scale – Maximizing the Benefits of the Cloud
● Spot nodes are generally available at highly discounted prices
● Presto is not able to utilize them well OOB due to its pipelined and in-memory execution
architecture
● Spot loss will lead to failure of all queries which had any part of their execution tree running on
that spot node
● Presto is usually run on newer generation, high memory instance types which experience spot
loss more often due to greater demand
● Qubole’s handling of Spot termination notification
○ Proactive addition of nodes to maintain cluster size
○ Avoid scheduling tasks on spot node after receiving STN
○ Acquire on-demand quickly
○ Lazily rebalance to achieve desired spot ratio
○ No query failures if all queries finish under 2 minutes
21. 00
Query retries
● Fallback for query failures that cannot be handled in STN Integration
● Query retries should be transparent to the clients and work with all Presto clients: Java cli,
*DBC Drivers, Ruby client, etc.
● The retry should happen only if it is guaranteed that no partial results have yet been sent to
the client
● The retry should happen only if changes (if any) caused by the failed query have been rolled
back e.g. in case of Inserts, data written by the failed query has been deleted
● The retry should happen only if there is a chance of successful query execution
● Qubole’s implementation (Smart Query Retry)
○ Presto server responsible for retries, clients redirected to new query without any changes
required to client
○ Convert SELECT queries into IOD queries (INSERT OVERWRITE DIRECTORY), clients
get result only after query has finished
○ Track rollback status of query
○ Retry in bigger cluster if the failure is due to insufficient memory
○ Retry when cluster size stabilizes if the failure is due to node loss
22. 00
Optimizing for Speed – Dynamic Filtering
SELECT (...)
FROM store_sales JOIN date_dim ON ss_sold_date_sk = d_date_sk (...)
WHERE d_year = 2000 and d_moy = 12 (...)
(... GROUP BY ... ORDER BY ...)
Currently (assuming tables are not partitioned) Presto will perform full table scan of both tables.
1. Skip accessing fact table partitions not needed by the query (runtime partition pruning)
2. Filter rows on probe side of join by sending only the subset of rows that match the join keys
across the network (runtime row filtering)
3. If storage format supports predicate pushdown, use runtime filters to avoid scanning data on
probe side (runtime predicate pushdown)
24. 00
Dynamic Filtering results
• Runtime of 13 queries improved by at least 5X.
• Runtime of 13 queries improved between 3X - 5X.
• Runtime of 22 queries improved between 1.5X - 3X.
• 14 queries that did not run before succeeded.
25. 00
Optimizing for Speed – Join Reordering
• Smaller table to the right for better performance
• Difficult to ensure it in a multi-join query
• Join Reordering Optimizer rule to the rescue
26. 00
Optimizing for Speed – Join Reordering
BA AB
A.a = B.b B.b = A.a
Join Reordering made for the case
where build-side of Join is expensive
3~6x
Tpcds scale 3000*
Geomean 3.1x
28. 00
Optimizing for Speed – Join strategy selection
● Broadcast (Map-side join) vs Repartitioned (Shuffle join)
● Repartitioned
○ Default
○ Low memory usage
○ Both build and probe side need to be partitioned
○ More efficient for joins between large tables of similar size
● Broadcast
○ High memory usage, build side table must fit in memory
○ Probe side does not need to be partitioned
○ Build side table broadcast to all nodes
○ More efficient for joins where one table is of small size
29. 00
Optimizing for Speed – RubiX
● A caching framework for big data engines in the cloud
● Open source https://github.com/qubole/rubix
● Built for zero configuration, SQL first, zero bottlenecks, auto rebalancing
● Adopted within Qubole for Presto, Hive and Spark
● Adopted outside Qubole e.g. HDInsights IOCache
● https://www.qubole.com/blog/rubix-fast-cache-access-for-big-data-analytics-on-cloud-storage
32. 00
Presto OS roadmap
● Coordinator High Availability
● Allow connectors to participate in query optimization
● Improvements to Spill to disk functionality
● Partial recovery support for failure of long running queries
● Ranger integration
● Qubole collaborations with community
○ Dynamic Filtering
○ Kinesis Connector
○ Supporting Insert Only Transactional Hive tables
○ Data locality based scheduling of source splits
○ Presto UDFs (https://github.com/qubole/presto-udfs)