Improve Presto architectural decisions with
shadow cache
Zhenyu Song (Princeton University)
Ke Wang (Facebook)
October 12, 2021
Introduction
2
● Zhenyu Song
● Ph.D. Candidate at Princeton
University
● Interested on caching system
● Ke Wang
● Engineer in facebook
● Focus on low latency queries in
presto team
Motivation: cache operation decisions
Shadow cache: a lightweight Alluxio component to
track the working set size & infinite cache hit ratio
3
Cache operator
How to size my cache for each tenant?
What is the potential hit ratio improvement?
Motivation: cache operation decisions
4
Cache operator
How to size my cache for
each tenant?
What is the potential hit ratio
improvement?
Shadow cache
Total unique bytes (pages)
accessed in the past 24 h
Total #hit/miss if the cache can
hold all 24h requested pages
Shadow cache design challenges
● Goal: track the working set size & infinite size hit ratio
● Challenges:
● Small memory & CPU overhead
● Accurate
● Dynamic update
5
Solution to overhead & accuracy challenge: Bloom filter
6
● Space-efficient probabilistic data structure membership testing
● Intuition: each object is represented with only several bits
● Possibly false positive, but not false negative
● It has k hash functions
○ To add an element, apply each hash function and set the bit to 1
○ To query an element, apply each hash function and AND the bits.
Why Bloom filter helps?
7
● To get infinite size hit ratio, we can query each get(key) to know
whether the key is in the Bloom filter.
● To measure the working set size, we leverage the approximation
Where is an estimate of the number of items in the filter, m is the
length (size) of the filter, k is the number of hash functions, and X is
the number of bits set to one.
Solution to dynamic update: Bloom filter chain
8
Bloom
filter
Bloom
filter
Bloom
filter
Bloom
filter
● The shadow cache is implemented by a chain of Bloom filters.
Each one tracks the unique objects in one period
6h 6h 6h 6h
Bloom filter chain: insert()
9
Bloom
filter
Bloom
filter
Bloom
filter
Bloom
filter
t
insert(key)
Bloom filter chain: get()
10
Bloom
filter
Bloom
filter
Bloom
filter
Bloom
filter
t
get(key)
OR
Bloom filter chain: switch()
11
Bloom
filter
Bloom
filter
Bloom
filter
Bloom
filter
t
Bloom
filter
remove add
Bloom filter chain: estimate_working_set_size()
12
Bloom
filter
Bloom
filter
Bloom
filter
Bloom
filter
t
OR all bits
Bloom
filter
Memory overhead estimation
● Example: track 27 M pages (27 TB working set size) uses 125 MB memory,
with only 3% error
○ Assume four bloom filters, each page is 1MB
○ Memory overhead is regardless of page key type (currently {string, long})
● Can further reduce by using HyperLogLog, but then not support infinite size
hit ratio estimation
13
Implementation
● Guava BloomFilter lib
● Automatically select the Bloom filter config (bits, #hash) by user-defined
memory overhead budget, and shadow cache window
● Support working set size in terms of #pages and #byte
● Support infinite size byte hit ratio and object hit ratio
14
Usage
#The past window to define the working set
alluxio.user.client.cache.shadow.window=24h
#The total memory overhead for bloom filters used for
tracking
alluxio.user.client.cache.shadow.memory.overhead=125MB
#The number of bloom filters used for tracking. Each
tracks a segment of window
alluxio.user.client.cache.shadow.bloomfilter.num=4
15
Conclusion
● We design Shadow cache: a lightweight Alluxio component to track the working
set size & infinite cache hit ratio
● Code merged:
https://github.com/Alluxio/alluxio/blob/master/core/client/fs/src/main/java/
alluxio/client/file/cache/CacheManagerWithShadowCache.java
● Many optimization opportunities
16
Shadow cache in facebook
17
18
Project RaptorX
Motivation
1. We want to understand if a cluster is bounded by cache storage, Is
adding more storage going to help with cache hit rate and thus help with
query latency
2. It would also be useful to explore the potential improvement in caching
algorithms
3. We want to optimize the routing algorithm for better balance and
efficiency
19
Presto Routing for raptorX
● We shard the cache based on table name among clusters
● Query that access the same table will always go to the same target cluster to
maximize its cache
20
21
CPU skew
Options for optimizing routing logic
● Secondary cluster
○ when the primary cluster is busy, have a designated secondary cluster which will also have the
cache turned on for those queries
○ it requires storing additional tables cache on each cluster
● Two clusters both serving as designated primary, and do load balancing between
those two primary clusters
○ Cache disk usage -> X2
● Shuffle the tables between clusters to make the CPU distribution more even
based on query pattern.
○ it could make cache storage distribution not even and requires extra cache space
22
Key metrics on shadow cache
● Shadow cache is able to give us insights on the cache working set and how
cache hit rate would look like if we have infinite cache space.
● C1: Real Cache usage at a certain point of time
● C2: Shadow cache working set in a time window (1 day / 1 week)
● H1: Real Cache hit-rate
● H2: Shadow cache hit-rate
23
Decision tree based on key metrics
24
Thank you!
Q&A
25

Improve Presto Architectural Decisions with Shadow Cache

  • 1.
    Improve Presto architecturaldecisions with shadow cache Zhenyu Song (Princeton University) Ke Wang (Facebook) October 12, 2021
  • 2.
    Introduction 2 ● Zhenyu Song ●Ph.D. Candidate at Princeton University ● Interested on caching system ● Ke Wang ● Engineer in facebook ● Focus on low latency queries in presto team
  • 3.
    Motivation: cache operationdecisions Shadow cache: a lightweight Alluxio component to track the working set size & infinite cache hit ratio 3 Cache operator How to size my cache for each tenant? What is the potential hit ratio improvement?
  • 4.
    Motivation: cache operationdecisions 4 Cache operator How to size my cache for each tenant? What is the potential hit ratio improvement? Shadow cache Total unique bytes (pages) accessed in the past 24 h Total #hit/miss if the cache can hold all 24h requested pages
  • 5.
    Shadow cache designchallenges ● Goal: track the working set size & infinite size hit ratio ● Challenges: ● Small memory & CPU overhead ● Accurate ● Dynamic update 5
  • 6.
    Solution to overhead& accuracy challenge: Bloom filter 6 ● Space-efficient probabilistic data structure membership testing ● Intuition: each object is represented with only several bits ● Possibly false positive, but not false negative ● It has k hash functions ○ To add an element, apply each hash function and set the bit to 1 ○ To query an element, apply each hash function and AND the bits.
  • 7.
    Why Bloom filterhelps? 7 ● To get infinite size hit ratio, we can query each get(key) to know whether the key is in the Bloom filter. ● To measure the working set size, we leverage the approximation Where is an estimate of the number of items in the filter, m is the length (size) of the filter, k is the number of hash functions, and X is the number of bits set to one.
  • 8.
    Solution to dynamicupdate: Bloom filter chain 8 Bloom filter Bloom filter Bloom filter Bloom filter ● The shadow cache is implemented by a chain of Bloom filters. Each one tracks the unique objects in one period 6h 6h 6h 6h
  • 9.
    Bloom filter chain:insert() 9 Bloom filter Bloom filter Bloom filter Bloom filter t insert(key)
  • 10.
    Bloom filter chain:get() 10 Bloom filter Bloom filter Bloom filter Bloom filter t get(key) OR
  • 11.
    Bloom filter chain:switch() 11 Bloom filter Bloom filter Bloom filter Bloom filter t Bloom filter remove add
  • 12.
    Bloom filter chain:estimate_working_set_size() 12 Bloom filter Bloom filter Bloom filter Bloom filter t OR all bits Bloom filter
  • 13.
    Memory overhead estimation ●Example: track 27 M pages (27 TB working set size) uses 125 MB memory, with only 3% error ○ Assume four bloom filters, each page is 1MB ○ Memory overhead is regardless of page key type (currently {string, long}) ● Can further reduce by using HyperLogLog, but then not support infinite size hit ratio estimation 13
  • 14.
    Implementation ● Guava BloomFilterlib ● Automatically select the Bloom filter config (bits, #hash) by user-defined memory overhead budget, and shadow cache window ● Support working set size in terms of #pages and #byte ● Support infinite size byte hit ratio and object hit ratio 14
  • 15.
    Usage #The past windowto define the working set alluxio.user.client.cache.shadow.window=24h #The total memory overhead for bloom filters used for tracking alluxio.user.client.cache.shadow.memory.overhead=125MB #The number of bloom filters used for tracking. Each tracks a segment of window alluxio.user.client.cache.shadow.bloomfilter.num=4 15
  • 16.
    Conclusion ● We designShadow cache: a lightweight Alluxio component to track the working set size & infinite cache hit ratio ● Code merged: https://github.com/Alluxio/alluxio/blob/master/core/client/fs/src/main/java/ alluxio/client/file/cache/CacheManagerWithShadowCache.java ● Many optimization opportunities 16
  • 17.
    Shadow cache infacebook 17
  • 18.
  • 19.
    Motivation 1. We wantto understand if a cluster is bounded by cache storage, Is adding more storage going to help with cache hit rate and thus help with query latency 2. It would also be useful to explore the potential improvement in caching algorithms 3. We want to optimize the routing algorithm for better balance and efficiency 19
  • 20.
    Presto Routing forraptorX ● We shard the cache based on table name among clusters ● Query that access the same table will always go to the same target cluster to maximize its cache 20
  • 21.
  • 22.
    Options for optimizingrouting logic ● Secondary cluster ○ when the primary cluster is busy, have a designated secondary cluster which will also have the cache turned on for those queries ○ it requires storing additional tables cache on each cluster ● Two clusters both serving as designated primary, and do load balancing between those two primary clusters ○ Cache disk usage -> X2 ● Shuffle the tables between clusters to make the CPU distribution more even based on query pattern. ○ it could make cache storage distribution not even and requires extra cache space 22
  • 23.
    Key metrics onshadow cache ● Shadow cache is able to give us insights on the cache working set and how cache hit rate would look like if we have infinite cache space. ● C1: Real Cache usage at a certain point of time ● C2: Shadow cache working set in a time window (1 day / 1 week) ● H1: Real Cache hit-rate ● H2: Shadow cache hit-rate 23
  • 24.
    Decision tree basedon key metrics 24
  • 25.