Efficient Query Processing with Optimistically
Compressed Hash Tables & Strings in the USSR
Presented by Huaiyu Xu
PingCAP.com
Motivation
● Hash tables frequently used in analytical queries
● Crucial for overall performance
● But (large) HTs bottlenecked by main memory bandwidth
What can we do about it?
PingCAP.com
Motivation
Orthogonal approaches :
● Optimize access
○ partitioning
● Increase fill-rate
○ Cuckoo, Robin Hood hashing
● Shrink the table itself
○ reduce the bucket/row size (has not received as much attention)
consequently, increase cache-efficiency
PingCAP.com
Shrinking Hash Tables
100 MiB, magically shrink by 10x:
● Increase query throughput
● Downsize your computer
Bonus:
● HT 10MiB, fits into L3/LLC cache
● Improved runtime
Better Latency & Throughput
PingCAP.com
Shrinking Hash Tables
● Domain-guided prefix suppression
● Optimistic splitting
● Unique Strings self-aligned Region (USSR)
PingCAP.com
Domain-guided prefix suppression
● Domain-guided:
○ per-column min/max infomation from meta-data
● Prefix Suppression
○ substract the domain minimum from each value
○ pack multiple columns together
PingCAP.com
Domain-guided prefix suppression
● Compression and Decompression
○ lightweight: handful bitwise operations
○ fast equality comparisons on compressed data
● Generating Pre-Compiled Kernels
○ restrict the number of inputs to 4
○ restrict the types we pack into to 32-, 64- and 128-bit unsigned integers
○ impose an order on the inputs
PingCAP.com
Optimistic Splitting
● decrease effective memory footprint
● Decompose HT into:
● select sum(a) from t; a int64 (8 byte) → sum decimal + a + a (40byte)
● int64 / decimal
Hot HT:
● Frequently accessed
● Cache-resident
● Aggregates:
○ SUM: sub-sums fit smaller
data type
Cold HT:
● Rarely accessed
● Main memory
● Aggregates:
○ SUM: store full SUM or
overflow counter
PingCAP.com
Unique Strings self-aligned Region (USSR)
● Assumption: Many strings repeat
● USSR
○ Query-wide dictionary
○ Limited size (cache resident)
○ Built during scan
● 768kB (hash table 256kB, data region 512kB)
● data region
○ 2^16 slots ( 8-byte / slot )
○ each string takes at least two slots
(one for the hash and one for the string)
○ all pointers inside a data region start
with same 45 bit prefix
● hash table region
○ 2^16 buckets ( 4-byte / bucket )
○ each bucket consists of a 16-bit hash extract
and a 16-bit slot number
○ load factor < 50% (2^16 buckets for at most 2^15 strings)
PingCAP.com
Experiments
PingCAP.com
Micro-bench: Faster HashJoin Probe
● micro-benchmark Domain-Guided Prefix Suppression
● 4 keys [0...1000], 4 payloads [0...10]
● 2.5x faster hash probe including the tuple
reconstruction cost
● > 10^6 rows, the speedups were caused by the
more cache-resident hash table
● < 10^6 rows, mostly affected by the more efficient
comparisons directly on compressed data
PingCAP.com
Micro-bench: USSR and Group-By
● SELECT COUNT(*) FROM T GROUP BY s
● 10 unique strings, all strings had the same length
● the time spent on string comparisons when
checking the keys inside group by’s hash table
● the time spent on computing hash of the
string keys
PingCAP.com
TPC-H (sf = 100): memory footprint
PingCAP.com
TPC-H (sf = 100): memory footprint
● Over TPC-H we measured up to 2.1x lower memory consumption
● However, Optimistic Splitting in fact increases (rather than reduces) the overall memory
consumption as it introduces additional data
● The main idea behind Optimistic Splitting is to reduce memory pressure rather than overall
memory consumption
PingCAP.com
TPC-H (sf = 100): query performance
● USSR alone: Q4, Q12, Q16 benefit from faster string hashing and equality comparisons
● CHT alone: improvement of at least 10%. a) more efficient expression evaluation on smaller data types
provide b) more cache efficient hash table operation on compressed keys
● CHT + OPTIMISTIC + USSR: Q1, Q15 benefited from the Optimistic SUM aggregate which boosted the
aggregate computation
● Q2: the regression was caused by type casting overhead which occurred when operating on compact data
types
PingCAP.com
Faster Real-World Workload (Public BI)
● string heavy
● “CommonGovernment” workbook:
PingCAP.com
Thank You !

[Paper Reading] Efficient Query Processing with Optimistically Compressed Hash Tables & Strings in the USSR

  • 1.
    Efficient Query Processingwith Optimistically Compressed Hash Tables & Strings in the USSR Presented by Huaiyu Xu
  • 2.
    PingCAP.com Motivation ● Hash tablesfrequently used in analytical queries ● Crucial for overall performance ● But (large) HTs bottlenecked by main memory bandwidth What can we do about it?
  • 3.
    PingCAP.com Motivation Orthogonal approaches : ●Optimize access ○ partitioning ● Increase fill-rate ○ Cuckoo, Robin Hood hashing ● Shrink the table itself ○ reduce the bucket/row size (has not received as much attention) consequently, increase cache-efficiency
  • 4.
    PingCAP.com Shrinking Hash Tables 100MiB, magically shrink by 10x: ● Increase query throughput ● Downsize your computer Bonus: ● HT 10MiB, fits into L3/LLC cache ● Improved runtime Better Latency & Throughput
  • 5.
    PingCAP.com Shrinking Hash Tables ●Domain-guided prefix suppression ● Optimistic splitting ● Unique Strings self-aligned Region (USSR)
  • 6.
    PingCAP.com Domain-guided prefix suppression ●Domain-guided: ○ per-column min/max infomation from meta-data ● Prefix Suppression ○ substract the domain minimum from each value ○ pack multiple columns together
  • 7.
    PingCAP.com Domain-guided prefix suppression ●Compression and Decompression ○ lightweight: handful bitwise operations ○ fast equality comparisons on compressed data ● Generating Pre-Compiled Kernels ○ restrict the number of inputs to 4 ○ restrict the types we pack into to 32-, 64- and 128-bit unsigned integers ○ impose an order on the inputs
  • 8.
    PingCAP.com Optimistic Splitting ● decreaseeffective memory footprint ● Decompose HT into: ● select sum(a) from t; a int64 (8 byte) → sum decimal + a + a (40byte) ● int64 / decimal Hot HT: ● Frequently accessed ● Cache-resident ● Aggregates: ○ SUM: sub-sums fit smaller data type Cold HT: ● Rarely accessed ● Main memory ● Aggregates: ○ SUM: store full SUM or overflow counter
  • 9.
    PingCAP.com Unique Strings self-alignedRegion (USSR) ● Assumption: Many strings repeat ● USSR ○ Query-wide dictionary ○ Limited size (cache resident) ○ Built during scan ● 768kB (hash table 256kB, data region 512kB) ● data region ○ 2^16 slots ( 8-byte / slot ) ○ each string takes at least two slots (one for the hash and one for the string) ○ all pointers inside a data region start with same 45 bit prefix ● hash table region ○ 2^16 buckets ( 4-byte / bucket ) ○ each bucket consists of a 16-bit hash extract and a 16-bit slot number ○ load factor < 50% (2^16 buckets for at most 2^15 strings)
  • 10.
  • 11.
    PingCAP.com Micro-bench: Faster HashJoinProbe ● micro-benchmark Domain-Guided Prefix Suppression ● 4 keys [0...1000], 4 payloads [0...10] ● 2.5x faster hash probe including the tuple reconstruction cost ● > 10^6 rows, the speedups were caused by the more cache-resident hash table ● < 10^6 rows, mostly affected by the more efficient comparisons directly on compressed data
  • 12.
    PingCAP.com Micro-bench: USSR andGroup-By ● SELECT COUNT(*) FROM T GROUP BY s ● 10 unique strings, all strings had the same length ● the time spent on string comparisons when checking the keys inside group by’s hash table ● the time spent on computing hash of the string keys
  • 13.
    PingCAP.com TPC-H (sf =100): memory footprint
  • 14.
    PingCAP.com TPC-H (sf =100): memory footprint ● Over TPC-H we measured up to 2.1x lower memory consumption ● However, Optimistic Splitting in fact increases (rather than reduces) the overall memory consumption as it introduces additional data ● The main idea behind Optimistic Splitting is to reduce memory pressure rather than overall memory consumption
  • 15.
    PingCAP.com TPC-H (sf =100): query performance ● USSR alone: Q4, Q12, Q16 benefit from faster string hashing and equality comparisons ● CHT alone: improvement of at least 10%. a) more efficient expression evaluation on smaller data types provide b) more cache efficient hash table operation on compressed keys ● CHT + OPTIMISTIC + USSR: Q1, Q15 benefited from the Optimistic SUM aggregate which boosted the aggregate computation ● Q2: the regression was caused by type casting overhead which occurred when operating on compact data types
  • 16.
    PingCAP.com Faster Real-World Workload(Public BI) ● string heavy ● “CommonGovernment” workbook:
  • 17.