Speed Up Uber's Presto with Alluxio

Speed Up Uber’s Presto
with Alluxio
Chen Liang: Senior Software Engineer@Uber Data Analytics
Beinan Wang: Software Engineer@Alluxio

Data informs every decision at Uber
Marketplace
Pricing
Community
Operations
Growth Marketing Data Science
Compliance
Eats

Presto @ Uber: Numbers
7K
Weekly Active Users
500K
Queries/day
2
Regions
5K
Nodes
15
Clusters
50PB
HDFS bytes read/day

Workloads
Interactive
Ad hoc queries
Batch
Scheduled

Alluxio Local Cache: Overview
Production Deployment
● Deployed to 3 clusters, with >200
nodes each
● Plugged in as a local library in
Presto worker
● Leverage Presto workers’ local
NVMe disks
● Selective caching based on cache
ﬁlter
https://prestodb.io/blog/2020/06/16/alluxio-datacaching

Challenges Alluxio@Uber
● Challenge #1: Realtime Partition Updates
● Challenge #2: Cluster Membership Change
● Challenge #3: Cache Size Restriction

Challenge #1: Realtime Partition Updates
● At Uber, a lot of tables/partitions are constantly changing
○ Upsert queries constantly into Hudi tables
● Partition id alone as caching key is not sufﬁcient
○ Same partition may have changed in Hive, while Alluxio still caches the
outdated version
● Partitions in cache are outdated

Challenge #1: Realtime Partition Updates
● Solution: Add Hive latest modification time to caching key
○ Previous caching key: hdfs://<path>
○ New caching key: hdfs://<path><mod time>
■ Concatenate last modification time to the path
● New partition with latest modification gets cached
● Tradeoff: outdated partition still present in cache, wasting caching space until
evicted
○ Improving eviction strategy WIP

Challenge #2: Cluster Membership Change
● Cached bytes only present on certain nodes
○ SOFT_AFFINITY
● Presto worker nodes may go up/down due to operational activities
○ Node crash
○ Node taken down for maintenance
○ Ad-hoc Node restart
● When node changes, node selection may hit wrong nodes

Presto Coordinator
Presto Worker#0 Presto Worker#1 Presto Worker#2
key=4
Currently, simple hash mod based node lookup : key 4 % 3 nodes = worker # 1

Presto Coordinator
Presto Worker#0 Presto Worker#1 Presto Worker#2
key=4
Now node#3 goes down, new lookup: key 4 % 2 nodes = 0, but worker#0 does not have the bytes

Solution: Node id based consistent
hashing
● All nodes are placed on a virtual
ring
● Relative ordering of nodes on
the ring doesn’t change
● Always look up the key on the
ring
○ Instead of using mode
based hash
● Use replication for better
robustness

Challenge #3: Cache Size Restriction
● At Uber, PBs accessed by Presto queries >> PBs Disks space available on
Worker nodes
○ 50PB of data accessed daily v.s 500GB local disk space
○ Heavy eviction can hurt overall cache performance
● Only a selected set of data can ﬁt into cache:
○ certain tables
○ certain number of partitions

● Solution: Cache Filter
○ A mechanism that decide whether to cache a table and how many
partitions
○ Based on a static json config that specifies:
■ which table are eligible for caching
■ how many partitions to cache for each table
● A sample configuration:
{
"databases": [{
"name": "database_foo",
"tables": [{
"name" : "table_bar",
"maxCachedPartitions": 100}]}]
}

● Greatly increased cache hit rate
○ From ~65% to >90%
● Notes wrt Cache Filter
○ Manual, static conﬁguration
○ Should be based on trafﬁc pattern, e.g.:
■ Most frequently accessed tables
■ Most common # of partitions being accessed
■ Tables that do not change too frequently
■ Ideally, should be based on shadow caching numbers and table
level metrics

Monitoring/Dashboarding
● Integrated with Uber’s internal metrics platform
● Jmx metrics emitted to Grafana based dashboard

Current Status and Next Steps
● Deployed to production cluster
○ 3 clusters of 200+ nodes each, all nodes on NVMe disks, 500GB cache space per
node
○ Using cache ﬁlter to cache ~20 most frequently accessed tables
○ Initial measurement shows great improvement
■ ~1/3 of wall time for input scan (TableScanOperator and
ScanFilterProjectOperator) vs no cache
● Next Steps
○ Onboard more tables/Improve process of table onboarding
■ E.g. shadow cache
○ Better support for changing partitions/Hoodie tables
○ Other optimizations
■ E.g. load balancing between nodes

Table Level Metrics - Motivation

Table Level Metrics - Architecture

Table Level Metrics - Dashboard

Persistent File Level Metadata for Local Cache
● Prevent stale caching
○ The underlying data ﬁles might be changed by the 3rd
party frameworks. (This situation might be rare in hive
table, but very common in hudi tables)
● Metadata should be recoverable after server restart
● File or Partition Level Eviction

Future Work
● Performance Tuning
○ Improve cache efﬁciency
○ Optimized for SATA or mechanical hard drives
● Adopt Shadow Cache
○ Table-level working set estimation
○ Partition-level popularity estimation

Speed Up Uber's Presto with Alluxio

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Speed Up Uber's Presto with Alluxio

Similar to Speed Up Uber's Presto with Alluxio (20)

More from Alluxio, Inc.

More from Alluxio, Inc. (20)

Recently uploaded

Recently uploaded (20)

Speed Up Uber's Presto with Alluxio