1
RaptorX
Rohit Jain
Software Engineer June 24th, 2021
2
10X faster Presto for Facebook scale petabyte workloads
Presto @ Facebook Scale
3
50K+
Servers
~ 1 EB data
scan per day
Presto Today: Disaggregated Storage and Physics!
• Data is growing exponentially faster than use of compute
• Resultant Industry trend towards scaling storage and compute
independently e.g., Snowflake on S3, AWS EMR on S3, Big Query on
Google Storage etc.
• Helps customers and cloud providers scale independently, reducing
cost
• Data for querying and processing needs to be streamed from remote
storage nodes
• New challenge for query latency as scanning huge amounts of data
over the wire is going to be I/O bound when the network is saturated
4
CAPTION: Presto Servers need to retrieve data from remote storage
Distance has increased between compute and storage and overcoming Physics is hard
RaptorX: Hierarchical Caching for Interactive
Workloads!
• RaptorX’s goal is to create a no migration query acceleration solution for
existing Presto customers so that existing workloads can benefit
seamlessly
• Challenge is to accelerate interactive workloads that are petabyte scale
without replicating data
• Found top opportunities to increase performance by doing a
comprehensive audit of query lifecycle
• Caching is obviously the answer and not new - however is a lot of work to
manage e.g., cache invalidation etc.!
• What’s new is ‘true no-work’ query acceleration; Responses are returned
upto 10x faster with no change in pipelines or queries
5
CAPTION: Presto with RaptorX smartly caches at every opportunity
Reduce distance between compute and storage intelligently!
Metastore Cache: 20% latency decrease
• Every Presto query makes a metastore call getPartitions() to learn about
metadata (e.g., schema, partition list, and partition info)
• FB scale partitions are complex and can introduce latency!
• Presto Coordinator (SQL endpoint) caches metadata to avoid calls to metastore.
• Slow changing partitions particularly benefit from this (e.g., date based
partitions)
• Cache is versioned to confirm validity of cached metadata
- A version number is attached to each cache Key-Value pair.
- For every read request, coordinator either gets partition information for
caching if not cached
- or confirms that cached information is up to date from the metastore
6
CAPTION: RaptorX caches table metadata with versioning
Presto
Metastore
Coordinator i.e. SQL endpoint
metadata
versioned
cache
File List Cache: 100ms drop per query
7
• A listFile() call is used by Presto to retrieve list of files and name from
remote file system
• Coordinator caches file lists in memory to avoid long listFile calls to remote
storage.
• Challenge is applicability to partitions / directories that are compacted or
sealed i.e. no new data will be added to a partition
• However, real-time ingestion and serving depend fresh data i.e. partitions /
directories are open / not compacted
• For open partitions, RaptorX skips caching directories to guarantee data
freshness
• Note that consistency is still maintained when a query uses both, a mix of
compacted/sealed and open partitions
7
CAPTION: RaptorX caches file lists to lower query latency
Presto
Remote
Storage
Coordinator i.e. SQL endpoint
File List
cache
Affinity Scheduling for Compute/Data locality
8
• Presto optimizes cluster utilization by assigning work to the worker cluster nodes
uniformly across all running queries.
• This prevents nodes from becoming overloaded, which would lead to a slowdown
of queries due to the overloaded nodes becoming a compute bottleneck.
• With Affinity scheduling, Presto Coordinator schedules requests that process
certain data/file to the same Presto worker node.
• Sending requests for the same data consistently to the same worker node means
less remote storage calls to retrieve data
• High probability, that this data/file is cached on that particular worker node
• Scheduling policy is "soft", i.e. if the destination worker node is too busy or
unavailable, the scheduler will fallback to its secondary worker node pick
• Stay tuned for results of a more sophisticated scheduling (in testing currently)
8
CAPTION: RaptorX does a best effort to send jobs that use
data from remote storage to nodes that have processed
jobs with the same data, reducing remote storage calls
Presto
Coordinator i.e.
SQL endpoint
Scheduler
Hashed file path to send
processing work to same
worker instance
Load balancing is done if
target worker node is at
capacity
File Desc & Footer Cache: 40% CPU & latency decrease
9
• OpenFile() calls to remote storage are used to learn about columnar file data
• High hit rate of footers as they are the indexes to the data itself
• Presto worker nodes cache file descriptors in memory to avoid long openFile
calls to remote storage
• Especially beneficial for super wide tables that contain hundreds or thousands
of columns - upto 40% CPU and latency decrease
• Presto worker nodes also cache common columnar file and stripe footers in
memory.
• Supported file formats are ORC, DWRF, and Parquet
9
CAPTION: RaptorX caches file descriptors to lower query latency
Presto
Remote
Storage
Coordinator i.e. SQL endpoint
File
Descriptor
cache
Header
Index Data
Row Data
Stripe Footer
Metadata
File Footer
Postscript
Optimized Row Columnar (ORC) file
Data cache using Alluxio: 10X - 20X latency decrease
10
• Improved performance by caching data on flash disks co-located with Presto
worker; Collaboration between Alluxio and Presto team to create a worker node
level embedded cache library
• Cache is transparent to Presto (standard HDFS interface). Presto falls back to
remote data source if there are disk failures.
• On a cache hit, Alluxio local cache directly reads data from the local disk and
returns the cached data to Presto; otherwise, it retrieves data from the remote
data source, and caches the data on the local disk for follow-up queries.
• Caching mechanism aligns each read into 1MB chunks, where 1MB is configurable
to be adapted to different storage media
• Example IO: [1.1MB, 5.6MB]
- Alluxio will issue IO [1MB, 6MB]
- Then save the following 5 chunks on disk: [1MB, 2MB], [2MB, 3MB], [3MB, 4MB],
[4MB, 5MB], and [5MB, 6MB]
- If there is another IO [4.3MB, 7.8MB], then [4.3MB, 6MB] will be fetched locally
and [6MB, 8MB] will be issued and cache with two extra chunks: [6MB, 7MB]
and [7MB, 8MB)
10
CAPTION: RaptorX does a best effort to send jobs that use
data from remote storage to nodes that have processed
jobs with the same data, reducing remote storage calls
Presto
Coordinator
Remote
Storage
Worker
1
MB
1
MB
1
MB
Alluxio Caching
Cache hit
Cache miss
Fragmented Result Cache: 45% latency decrease and
75% CPU decrease
11
• Exact results cache has been around for a long time; does not help if queries
differ
• RaptorX uses a fragmented result cache, caches fragment results
• Especially beneficial for slice and dice, drill down, sliding window reporting and
visualization use cases or queries where customers add/remove filters and
projections
• Consider two aggregate queries over an overlapping time period, Query 1 and 2
• Partially computed sum for each of 2021-03-22, 2021-03-23, and 2021-03-24
partitions i.e. corresponding files is cached on Presto workers forming a
fragment result for query 1.
• A subsequent query will only need to aggregate/compute 2021-03-25 and
2021-03-26 partitions, reducing both, compute and I/O cost
11
CAPTION: RaptorX’s fragment result cache reduces compute and I/O cost
SELECT
SUM(col)
FROM
T
WHERE
ds BETWEEN '2021-03-22'
AND '2021-03-24'
SELECT
SUM(col)
FROM
T
WHERE
ds BETWEEN '2021-03-22'
AND '2021-03-26'
Cached
Result
2021-03-22
Cached
Result
2021-03-23
Cached
Result
2021-03-24
Scan Node
2021-03-25
Scan Node
2021-03-26
Query 1 Query 2
AggNode
partial sum(col)
2021-03-25
AggNode
partial sum(col)
2021-03-26
AggNode
final sum(col)
03-22 to
03-26
Fragmented Result Cache
12
• Previous example explains intelligent cache handling when filtering on partition
columns
• Another query type is one that contains non-partition column filters; Cache
misses for such queries types are reduced by partition statistics based pruning
• Consider Query 3, where time is a non-partition column. NOW() is a function that
has values changing all the time. Caching absolute value results in 0% cache hits
• Predicate time > NOW() - INTERVAL '3' DAY is a "loose" condition that is going
to be true for most of the partitions if predicate is removed from the plan
• For example, if today is 2021-03-24, we know for partition ds = 2021-03-23,
predicate time > NOW() - INTERVAL '3' DAY is always true.
• RaptorX makes a normalized plan shape with
- Plan Canonicalization/Normalization
- Partition column pruning
- Non-partition column pruning based on partition stats
12
CAPTION: RaptorX’s intelligent fragmented result cache
reduces compute and I/O cost
SELECT
SUM(col)
FROM
T
WHERE
ds BETWEEN '2021-03-22' AND '2021-03-26'
AND time > NOW() - INTERVAL '3' DAY
Query 3
Scan Node
Filter
time > NOW() - INTERVAL '3' DAY
AggNode
partial sum(col)
Scan Node
2021-03-23
Filter
time > NOW() - INTERVAL '3' DAY
AggNode
partial sum(col)
13
RaptorX: 10X faster than Presto!
• We see more than 10X increase in query performance
with RaptorX in production at Facebook
• TPC-H benchmark between Presto and RaptorX also
confirms the performance difference!
• Test was run on a 114 node cluster with 1TB SSD and 4
threads per task
• TPC-H scale factor was 100 in remote storage
• Scan and aggregation heavy queries show 10X
improvement (Q1, Q6, Q12-16, Q19 and Q22)
• Join heavy queries show between 3X and 5X
improvement (Q2, Q5, Q10, or Q17)
13
CAPTION: Presto + Cache i.e. RaptorX is on average 10X faster
10X better performance with no change in pipelines!
Presto RaptorX
Not a research project: RaptorX is in production!
• RaptorX is battle tested!
• We want to highlight, RaptorX is widely deployed (10K+ machines) within Facebook for interactive workloads that need low-latency query
performance
• Other low-latency query engines (with co-located storage or disaggregated row-based storage) have been consolidated into RaptorX
• RaptorX is the engine of choice for interactive queries within Facebook!
14
15
Come join us!
facebook.com/careers

RaptorX: Building a 10X Faster Presto with hierarchical cache

  • 1.
  • 2.
    RaptorX Rohit Jain Software EngineerJune 24th, 2021 2 10X faster Presto for Facebook scale petabyte workloads
  • 3.
    Presto @ FacebookScale 3 50K+ Servers ~ 1 EB data scan per day
  • 4.
    Presto Today: DisaggregatedStorage and Physics! • Data is growing exponentially faster than use of compute • Resultant Industry trend towards scaling storage and compute independently e.g., Snowflake on S3, AWS EMR on S3, Big Query on Google Storage etc. • Helps customers and cloud providers scale independently, reducing cost • Data for querying and processing needs to be streamed from remote storage nodes • New challenge for query latency as scanning huge amounts of data over the wire is going to be I/O bound when the network is saturated 4 CAPTION: Presto Servers need to retrieve data from remote storage Distance has increased between compute and storage and overcoming Physics is hard
  • 5.
    RaptorX: Hierarchical Cachingfor Interactive Workloads! • RaptorX’s goal is to create a no migration query acceleration solution for existing Presto customers so that existing workloads can benefit seamlessly • Challenge is to accelerate interactive workloads that are petabyte scale without replicating data • Found top opportunities to increase performance by doing a comprehensive audit of query lifecycle • Caching is obviously the answer and not new - however is a lot of work to manage e.g., cache invalidation etc.! • What’s new is ‘true no-work’ query acceleration; Responses are returned upto 10x faster with no change in pipelines or queries 5 CAPTION: Presto with RaptorX smartly caches at every opportunity Reduce distance between compute and storage intelligently!
  • 6.
    Metastore Cache: 20%latency decrease • Every Presto query makes a metastore call getPartitions() to learn about metadata (e.g., schema, partition list, and partition info) • FB scale partitions are complex and can introduce latency! • Presto Coordinator (SQL endpoint) caches metadata to avoid calls to metastore. • Slow changing partitions particularly benefit from this (e.g., date based partitions) • Cache is versioned to confirm validity of cached metadata - A version number is attached to each cache Key-Value pair. - For every read request, coordinator either gets partition information for caching if not cached - or confirms that cached information is up to date from the metastore 6 CAPTION: RaptorX caches table metadata with versioning Presto Metastore Coordinator i.e. SQL endpoint metadata versioned cache
  • 7.
    File List Cache:100ms drop per query 7 • A listFile() call is used by Presto to retrieve list of files and name from remote file system • Coordinator caches file lists in memory to avoid long listFile calls to remote storage. • Challenge is applicability to partitions / directories that are compacted or sealed i.e. no new data will be added to a partition • However, real-time ingestion and serving depend fresh data i.e. partitions / directories are open / not compacted • For open partitions, RaptorX skips caching directories to guarantee data freshness • Note that consistency is still maintained when a query uses both, a mix of compacted/sealed and open partitions 7 CAPTION: RaptorX caches file lists to lower query latency Presto Remote Storage Coordinator i.e. SQL endpoint File List cache
  • 8.
    Affinity Scheduling forCompute/Data locality 8 • Presto optimizes cluster utilization by assigning work to the worker cluster nodes uniformly across all running queries. • This prevents nodes from becoming overloaded, which would lead to a slowdown of queries due to the overloaded nodes becoming a compute bottleneck. • With Affinity scheduling, Presto Coordinator schedules requests that process certain data/file to the same Presto worker node. • Sending requests for the same data consistently to the same worker node means less remote storage calls to retrieve data • High probability, that this data/file is cached on that particular worker node • Scheduling policy is "soft", i.e. if the destination worker node is too busy or unavailable, the scheduler will fallback to its secondary worker node pick • Stay tuned for results of a more sophisticated scheduling (in testing currently) 8 CAPTION: RaptorX does a best effort to send jobs that use data from remote storage to nodes that have processed jobs with the same data, reducing remote storage calls Presto Coordinator i.e. SQL endpoint Scheduler Hashed file path to send processing work to same worker instance Load balancing is done if target worker node is at capacity
  • 9.
    File Desc &Footer Cache: 40% CPU & latency decrease 9 • OpenFile() calls to remote storage are used to learn about columnar file data • High hit rate of footers as they are the indexes to the data itself • Presto worker nodes cache file descriptors in memory to avoid long openFile calls to remote storage • Especially beneficial for super wide tables that contain hundreds or thousands of columns - upto 40% CPU and latency decrease • Presto worker nodes also cache common columnar file and stripe footers in memory. • Supported file formats are ORC, DWRF, and Parquet 9 CAPTION: RaptorX caches file descriptors to lower query latency Presto Remote Storage Coordinator i.e. SQL endpoint File Descriptor cache Header Index Data Row Data Stripe Footer Metadata File Footer Postscript Optimized Row Columnar (ORC) file
  • 10.
    Data cache usingAlluxio: 10X - 20X latency decrease 10 • Improved performance by caching data on flash disks co-located with Presto worker; Collaboration between Alluxio and Presto team to create a worker node level embedded cache library • Cache is transparent to Presto (standard HDFS interface). Presto falls back to remote data source if there are disk failures. • On a cache hit, Alluxio local cache directly reads data from the local disk and returns the cached data to Presto; otherwise, it retrieves data from the remote data source, and caches the data on the local disk for follow-up queries. • Caching mechanism aligns each read into 1MB chunks, where 1MB is configurable to be adapted to different storage media • Example IO: [1.1MB, 5.6MB] - Alluxio will issue IO [1MB, 6MB] - Then save the following 5 chunks on disk: [1MB, 2MB], [2MB, 3MB], [3MB, 4MB], [4MB, 5MB], and [5MB, 6MB] - If there is another IO [4.3MB, 7.8MB], then [4.3MB, 6MB] will be fetched locally and [6MB, 8MB] will be issued and cache with two extra chunks: [6MB, 7MB] and [7MB, 8MB) 10 CAPTION: RaptorX does a best effort to send jobs that use data from remote storage to nodes that have processed jobs with the same data, reducing remote storage calls Presto Coordinator Remote Storage Worker 1 MB 1 MB 1 MB Alluxio Caching Cache hit Cache miss
  • 11.
    Fragmented Result Cache:45% latency decrease and 75% CPU decrease 11 • Exact results cache has been around for a long time; does not help if queries differ • RaptorX uses a fragmented result cache, caches fragment results • Especially beneficial for slice and dice, drill down, sliding window reporting and visualization use cases or queries where customers add/remove filters and projections • Consider two aggregate queries over an overlapping time period, Query 1 and 2 • Partially computed sum for each of 2021-03-22, 2021-03-23, and 2021-03-24 partitions i.e. corresponding files is cached on Presto workers forming a fragment result for query 1. • A subsequent query will only need to aggregate/compute 2021-03-25 and 2021-03-26 partitions, reducing both, compute and I/O cost 11 CAPTION: RaptorX’s fragment result cache reduces compute and I/O cost SELECT SUM(col) FROM T WHERE ds BETWEEN '2021-03-22' AND '2021-03-24' SELECT SUM(col) FROM T WHERE ds BETWEEN '2021-03-22' AND '2021-03-26' Cached Result 2021-03-22 Cached Result 2021-03-23 Cached Result 2021-03-24 Scan Node 2021-03-25 Scan Node 2021-03-26 Query 1 Query 2 AggNode partial sum(col) 2021-03-25 AggNode partial sum(col) 2021-03-26 AggNode final sum(col) 03-22 to 03-26
  • 12.
    Fragmented Result Cache 12 •Previous example explains intelligent cache handling when filtering on partition columns • Another query type is one that contains non-partition column filters; Cache misses for such queries types are reduced by partition statistics based pruning • Consider Query 3, where time is a non-partition column. NOW() is a function that has values changing all the time. Caching absolute value results in 0% cache hits • Predicate time > NOW() - INTERVAL '3' DAY is a "loose" condition that is going to be true for most of the partitions if predicate is removed from the plan • For example, if today is 2021-03-24, we know for partition ds = 2021-03-23, predicate time > NOW() - INTERVAL '3' DAY is always true. • RaptorX makes a normalized plan shape with - Plan Canonicalization/Normalization - Partition column pruning - Non-partition column pruning based on partition stats 12 CAPTION: RaptorX’s intelligent fragmented result cache reduces compute and I/O cost SELECT SUM(col) FROM T WHERE ds BETWEEN '2021-03-22' AND '2021-03-26' AND time > NOW() - INTERVAL '3' DAY Query 3 Scan Node Filter time > NOW() - INTERVAL '3' DAY AggNode partial sum(col) Scan Node 2021-03-23 Filter time > NOW() - INTERVAL '3' DAY AggNode partial sum(col)
  • 13.
    13 RaptorX: 10X fasterthan Presto! • We see more than 10X increase in query performance with RaptorX in production at Facebook • TPC-H benchmark between Presto and RaptorX also confirms the performance difference! • Test was run on a 114 node cluster with 1TB SSD and 4 threads per task • TPC-H scale factor was 100 in remote storage • Scan and aggregation heavy queries show 10X improvement (Q1, Q6, Q12-16, Q19 and Q22) • Join heavy queries show between 3X and 5X improvement (Q2, Q5, Q10, or Q17) 13 CAPTION: Presto + Cache i.e. RaptorX is on average 10X faster 10X better performance with no change in pipelines! Presto RaptorX
  • 14.
    Not a researchproject: RaptorX is in production! • RaptorX is battle tested! • We want to highlight, RaptorX is widely deployed (10K+ machines) within Facebook for interactive workloads that need low-latency query performance • Other low-latency query engines (with co-located storage or disaggregated row-based storage) have been consolidated into RaptorX • RaptorX is the engine of choice for interactive queries within Facebook! 14
  • 15.