Enable Presto® Caching
in Uber with Alluxio
Zhongting Hu: TLM @Uber Data Analytics
Beinan Wang: Software Engineer@Alluxio
Data informs every decision at Uber
Marketplace
Pricing
Community
Operations
Growth Marketing Data Science
Compliance
Eats
Presto @ Uber-scale
12K
Monthly Active Users
400K
Queries/day
2
Data Centers
6K
Nodes
14
Clusters
50PB
HDFS data
processed/day
Presto Deployment
Workloads
Interactive
Ad hoc queries
Batch
Scheduled
Data: From On-Premise to Cloud
● What
○ BI (Application)
○ Analytics (Compute)
○ Storage
● How
○ Feature Compatibility
○ Performance Measurement
○ Security / Compliance
○ Tech Debt ?
● Why
○ Cost Efficiency
○ Usability / Scalability / Reliability
Alluxio Local Caching-- High Level Architecture
Running as a local library in presto Worker
Key <-> Value:
HDFS File Path as the key
https://prestodb.io/blog/2020/06/16/alluxio-datacaching
Presto on GCP
Key Problems -- Data
● Data Characteristics
○ Mostly partition by Date
○ Hudi incremental update on File
○ Staging Directory / Partition from ETL framework
● Cache Data Hit Ratio
○ 3+ PB distinct data access per day
○ ~10% frequently accessed data
○ ~3% hot accessed data
● Data Cache Filtering
○ Offline Query Analytics on the Table (with Partition) Access
○ Onboarding hot accessed data
Key Problems -- Apache Hadoop® HDFS Latency
● Data Nodes can create some random latency
● In real production environment, CPU walltime mostly spent in reading data
Key Problems -- HDFS Latency, Cont
● Reading from local cache have much better guaranteed latency
● Fixing a bug of Namenode listing (ListLocatedStatus API)
Key Problems -- Presto Soft Affinity Scheduling
● Compute Preferred workers
○ Split override getPreferredNodes() to return the 2 preferred workers
○ Simple Mod based algorithm
○ try to assign it one by one by looking at whether it is busy
○ If both workers are busy, then select least busy worker (with cacheable = false)
● Define Busy worker
○ Max splits per node: node-scheduler.max-splits-per-node
○ Max pending splits per task: node-scheduler.max-pending-splits-per-task
Key Problems -- Soft Affinity with Consistent Hash
● Change from simple
Mod based node
selection to consistent
hashing
● 10 virtual nodes,
original 400 nodes
cluster
Current Status and next steps
● Initial testing has been finished, great improvement on queries
● TPCDS testing with sf10k in progress
● Historical Table/Partition analytics to setup cache filters
● Dashboarding, monitoring, metadata integrations
Persistent File Level Metadata for Local Cache
● Prevent stale caching
○ The underlying data files might be changed by the 3rd party frameworks. (This situation might be
rare in hive table, but very common in hudi tables)
● Scoped quota management
○ Do you want to put a cache quota for each table?
● Metadata should be recoverable after server restart
File Level Metadata -- High Level Approach
● Implement a file level metadata store which keeps the last modified time and the scope of
each data file we cached.
● The file level metadata store should be persistent on disk so the data will not disappear
after restarting
Cache data and Metadata Structure
root_path/page_size(ulong)/bucket(uint)/file_id(str)/
timestamp1/
Page_file1 (The filename is a ulong)
Page_file2
….
Page_fileN
timestamp2/
Page_file1 (The filename is a ulong)
Page_file2
….
Page_fileN
metadata (stores FileInfo in protobuf format)
Contains timestamp and scope
Metadata Awareness -- Cache Context (New in
2.6.1)
Per Query Metrics Aggregation on Presto Side
Future Work
● Performance Tuning
● Semantic Cache
● More efficient deserialization

Enabling Presto Caching at Uber with Alluxio

  • 1.
    Enable Presto® Caching inUber with Alluxio Zhongting Hu: TLM @Uber Data Analytics Beinan Wang: Software Engineer@Alluxio
  • 2.
    Data informs everydecision at Uber Marketplace Pricing Community Operations Growth Marketing Data Science Compliance Eats
  • 3.
    Presto @ Uber-scale 12K MonthlyActive Users 400K Queries/day 2 Data Centers 6K Nodes 14 Clusters 50PB HDFS data processed/day
  • 4.
  • 5.
  • 6.
    Data: From On-Premiseto Cloud ● What ○ BI (Application) ○ Analytics (Compute) ○ Storage ● How ○ Feature Compatibility ○ Performance Measurement ○ Security / Compliance ○ Tech Debt ? ● Why ○ Cost Efficiency ○ Usability / Scalability / Reliability
  • 7.
    Alluxio Local Caching--High Level Architecture Running as a local library in presto Worker Key <-> Value: HDFS File Path as the key https://prestodb.io/blog/2020/06/16/alluxio-datacaching
  • 8.
  • 9.
    Key Problems --Data ● Data Characteristics ○ Mostly partition by Date ○ Hudi incremental update on File ○ Staging Directory / Partition from ETL framework ● Cache Data Hit Ratio ○ 3+ PB distinct data access per day ○ ~10% frequently accessed data ○ ~3% hot accessed data ● Data Cache Filtering ○ Offline Query Analytics on the Table (with Partition) Access ○ Onboarding hot accessed data
  • 10.
    Key Problems --Apache Hadoop® HDFS Latency ● Data Nodes can create some random latency ● In real production environment, CPU walltime mostly spent in reading data
  • 11.
    Key Problems --HDFS Latency, Cont ● Reading from local cache have much better guaranteed latency ● Fixing a bug of Namenode listing (ListLocatedStatus API)
  • 12.
    Key Problems --Presto Soft Affinity Scheduling ● Compute Preferred workers ○ Split override getPreferredNodes() to return the 2 preferred workers ○ Simple Mod based algorithm ○ try to assign it one by one by looking at whether it is busy ○ If both workers are busy, then select least busy worker (with cacheable = false) ● Define Busy worker ○ Max splits per node: node-scheduler.max-splits-per-node ○ Max pending splits per task: node-scheduler.max-pending-splits-per-task
  • 13.
    Key Problems --Soft Affinity with Consistent Hash ● Change from simple Mod based node selection to consistent hashing ● 10 virtual nodes, original 400 nodes cluster
  • 14.
    Current Status andnext steps ● Initial testing has been finished, great improvement on queries ● TPCDS testing with sf10k in progress ● Historical Table/Partition analytics to setup cache filters ● Dashboarding, monitoring, metadata integrations
  • 15.
    Persistent File LevelMetadata for Local Cache ● Prevent stale caching ○ The underlying data files might be changed by the 3rd party frameworks. (This situation might be rare in hive table, but very common in hudi tables) ● Scoped quota management ○ Do you want to put a cache quota for each table? ● Metadata should be recoverable after server restart
  • 16.
    File Level Metadata-- High Level Approach ● Implement a file level metadata store which keeps the last modified time and the scope of each data file we cached. ● The file level metadata store should be persistent on disk so the data will not disappear after restarting
  • 17.
    Cache data andMetadata Structure root_path/page_size(ulong)/bucket(uint)/file_id(str)/ timestamp1/ Page_file1 (The filename is a ulong) Page_file2 …. Page_fileN timestamp2/ Page_file1 (The filename is a ulong) Page_file2 …. Page_fileN metadata (stores FileInfo in protobuf format) Contains timestamp and scope
  • 18.
    Metadata Awareness --Cache Context (New in 2.6.1)
  • 19.
    Per Query MetricsAggregation on Presto Side
  • 20.
    Future Work ● PerformanceTuning ● Semantic Cache ● More efficient deserialization