Paper Reading: HashKV and Beyond
Presented by Xinye Tao
Part I - Problem Recap
Background: SSD
● Coarse-Grained Access
○ data is returned by Page
○ in-place update requires a costly read-erase-
rewrite procedure on Block
● Parallelism
○ unlike hard disk, SSD can serve multiple
requests at the same time
○ sequential or patterned random access can be
optimized
■ sequential access is faster due to
hardware-tuned address translation unit
what we need is
Append write & Prefetch read
Background: LSM and Key-Value Seperation
● LSM is a radical compromise in pursuit for online latency
● Cache + Log architecture, not Index + Data
● Pros:
○ Append-Only is SSD-Friendly
○ Immutablity is Parallel-Friendly
● Cons:
○ Need Traversal to Query Log, read amplification
○ Stale Data, space amplification (1 / fanout)
○ Data Purging needs Rewriting Cold Record, write amplification
Background: LSM and Key-Value Seperation
● Three amplifications are unbeatable (RUM conjecture)
● But we can delay its development by feeding less data
● One feasible solution is Key-Value Seperation
Problem Formalization: Designing Value Store
● Architecture
○ Index Store (key->address)
○ Value Store (address->value)
● Metrics
○ Read, Scan
○ Write
○ Space Amplification
○ Background Load and Variance
○ Update (on Index Store) Frequency
Previous Attempt: WiscKey
● Circular List
● Append at tail
● Purge at head
○ if IndexStore.get(log.key) != log.address: Purge
○ else: Relocate
writer
Previous Attempt: WiscKey
● Evaluation
○ GC_Speed >= Update_Speed
○ Update Frequency: GC_Speed * Cold_Rate
○ Background Load: GC_Speed * { Index Query + Cold_Rate * (Append + Index Update) }
● GC’s load variance is bound to Cold_Rate and Update_Speed
● Index Store’s locality is disturbed by unnecessary relocation of cold record
Part II - HashKV Details
HashKV: Hash-Based Data Grouping
● Value Store is Hash-Partitioned, with each partition
dynamically sized
● Global GC: choose partition by priority
○ in-memory heap holds write frequency of each
partition
● Local GC: local scan without issuing new request to
Index Store
○ in-memory temporary hash map to track latest
key:value
Main
Segment 1
Segment
Table
HashKV: Hash-Based Data Grouping
● Value Store is Hash-Partitioned, with each partition
dynamically sized
● Global GC: choose partition by priority
○ in-memory heap holds write frequency of each
partition
● Local GC: local scan without issuing new request to
Index Store
○ in-memory temporary hash map to track latest
key:value
Main
Segment 1
2
3
...
Segment
Table
incremental segment
incremental segment
incremental segment
...
HashKV: Hotness-Awareness
● Rewriting cold entry with rare update poses a waste
of resource
● In HashKV, they are identified and moved to
seperate vLog during GC
○ Index Store will point to new value in vLog
○ old entry is tagged as cold
● Further update may trigger a cold-to-hot switch
○ vLog is GCed using WiscKey’s approach
meta(tagged), key
k-v
HashKV: Hotness-Awareness
● Rewriting cold entry with rare update poses a waste
of resource
● In HashKV, they are identified and moved to
seperate vLog during GC
○ Index Store will point to new value in vLog
○ old entry is tagged as cold
● Further update may trigger a cold-to-hot switch
○ vLog is GCed using WiscKey’s approach
meta(tagged), key
k-v
meta, key, new-value
Benchmark: Setup
● Hardware
○ CPU: 4 core Xeon E3-1240v2
○ Memory: 16 GB
○ Disk: 128 GB SSD * ( 1 + 6 )
● Input
○ Key: 24 Byte
○ Value: 992 Byte
○ Dataset: three phases of 40 GB updates over
existing 40 GB
Benchmark: Overall Performance
Benchmark: Impact of Value Size
Part III - Beyond HashKV
Pros and Cons of HashKV
● Evaluation
○ Read, Scan: read randomness worse than WiscKey
■ could be optimized by software prefetch in the cost of CPU resource
■ could be optimized by range grouping, which needs global scheduler
● TiKV Region
○ Write: global Append flow is divided into regions, write randomness
■ could be optimized by batch write
■ in the cost of parallelism
○ Background Load: no checking query and external dependency on GC
○ Update Frequency: once for cold, several for hot
Insights from HashKV: modular
LSM
LSM
Index
vLog
LSM
Index
Hashed
Log
Cold
vLog
?
optimize for workflow
Insights from HashKV: tunable
● Tunable system is naturally adaptable
○ design decision is made at runtime, not fixed at setup
● HashKV is tunable in that
○ global GC can be delayed without compromising KV service
○ local GC can be scheduled in a tunable manner
○ cold/hot seperation is also controlled by independent standard
● e.g. Self-Driving database Peloton by CMU, now remade as terrier
optimize for workload
reference ( and some related papers )
ATC ‘18, HashKV: Enabling Efficient Updates in KV Storage via Hashing
EDBT ‘16, Designing Access Methods: The RUM Conjecture
FAST ‘16, WiscKey: Separating Keys from Values in SSD-conscious Storage
FAST ‘16, Towards Accurate and Fast Evaluation of Multi-Stage Log-structured Designs
ATC ‘17, TRIAD: Creating Synergies Between Memory, Disk and Log in Log Structured Key-Value Stores
CIDR ‘17, Optimizing Space Amplification in RocksDB
SOSP ‘17, PebblesDB: Building Key-Value Stores using Fragmented Log-Structured Merge Trees
ATC ‘18, Redesigning LSMs for Nonvolatile Memory with NoveLSM
VLDB ‘19, Efficient Data Ingestion and Query Processing for LSM-Based Storage Systems
Thank You !

Paper reading: HashKV and beyond

  • 1.
    Paper Reading: HashKVand Beyond Presented by Xinye Tao
  • 2.
    Part I -Problem Recap
  • 3.
    Background: SSD ● Coarse-GrainedAccess ○ data is returned by Page ○ in-place update requires a costly read-erase- rewrite procedure on Block ● Parallelism ○ unlike hard disk, SSD can serve multiple requests at the same time ○ sequential or patterned random access can be optimized ■ sequential access is faster due to hardware-tuned address translation unit what we need is Append write & Prefetch read
  • 4.
    Background: LSM andKey-Value Seperation ● LSM is a radical compromise in pursuit for online latency ● Cache + Log architecture, not Index + Data ● Pros: ○ Append-Only is SSD-Friendly ○ Immutablity is Parallel-Friendly ● Cons: ○ Need Traversal to Query Log, read amplification ○ Stale Data, space amplification (1 / fanout) ○ Data Purging needs Rewriting Cold Record, write amplification
  • 5.
    Background: LSM andKey-Value Seperation ● Three amplifications are unbeatable (RUM conjecture) ● But we can delay its development by feeding less data ● One feasible solution is Key-Value Seperation
  • 6.
    Problem Formalization: DesigningValue Store ● Architecture ○ Index Store (key->address) ○ Value Store (address->value) ● Metrics ○ Read, Scan ○ Write ○ Space Amplification ○ Background Load and Variance ○ Update (on Index Store) Frequency
  • 7.
    Previous Attempt: WiscKey ●Circular List ● Append at tail ● Purge at head ○ if IndexStore.get(log.key) != log.address: Purge ○ else: Relocate writer
  • 8.
    Previous Attempt: WiscKey ●Evaluation ○ GC_Speed >= Update_Speed ○ Update Frequency: GC_Speed * Cold_Rate ○ Background Load: GC_Speed * { Index Query + Cold_Rate * (Append + Index Update) } ● GC’s load variance is bound to Cold_Rate and Update_Speed ● Index Store’s locality is disturbed by unnecessary relocation of cold record
  • 9.
    Part II -HashKV Details
  • 10.
    HashKV: Hash-Based DataGrouping ● Value Store is Hash-Partitioned, with each partition dynamically sized ● Global GC: choose partition by priority ○ in-memory heap holds write frequency of each partition ● Local GC: local scan without issuing new request to Index Store ○ in-memory temporary hash map to track latest key:value Main Segment 1 Segment Table
  • 11.
    HashKV: Hash-Based DataGrouping ● Value Store is Hash-Partitioned, with each partition dynamically sized ● Global GC: choose partition by priority ○ in-memory heap holds write frequency of each partition ● Local GC: local scan without issuing new request to Index Store ○ in-memory temporary hash map to track latest key:value Main Segment 1 2 3 ... Segment Table incremental segment incremental segment incremental segment ...
  • 12.
    HashKV: Hotness-Awareness ● Rewritingcold entry with rare update poses a waste of resource ● In HashKV, they are identified and moved to seperate vLog during GC ○ Index Store will point to new value in vLog ○ old entry is tagged as cold ● Further update may trigger a cold-to-hot switch ○ vLog is GCed using WiscKey’s approach meta(tagged), key k-v
  • 13.
    HashKV: Hotness-Awareness ● Rewritingcold entry with rare update poses a waste of resource ● In HashKV, they are identified and moved to seperate vLog during GC ○ Index Store will point to new value in vLog ○ old entry is tagged as cold ● Further update may trigger a cold-to-hot switch ○ vLog is GCed using WiscKey’s approach meta(tagged), key k-v meta, key, new-value
  • 14.
    Benchmark: Setup ● Hardware ○CPU: 4 core Xeon E3-1240v2 ○ Memory: 16 GB ○ Disk: 128 GB SSD * ( 1 + 6 ) ● Input ○ Key: 24 Byte ○ Value: 992 Byte ○ Dataset: three phases of 40 GB updates over existing 40 GB
  • 15.
  • 16.
  • 17.
    Part III -Beyond HashKV
  • 18.
    Pros and Consof HashKV ● Evaluation ○ Read, Scan: read randomness worse than WiscKey ■ could be optimized by software prefetch in the cost of CPU resource ■ could be optimized by range grouping, which needs global scheduler ● TiKV Region ○ Write: global Append flow is divided into regions, write randomness ■ could be optimized by batch write ■ in the cost of parallelism ○ Background Load: no checking query and external dependency on GC ○ Update Frequency: once for cold, several for hot
  • 19.
    Insights from HashKV:modular LSM LSM Index vLog LSM Index Hashed Log Cold vLog ? optimize for workflow
  • 20.
    Insights from HashKV:tunable ● Tunable system is naturally adaptable ○ design decision is made at runtime, not fixed at setup ● HashKV is tunable in that ○ global GC can be delayed without compromising KV service ○ local GC can be scheduled in a tunable manner ○ cold/hot seperation is also controlled by independent standard ● e.g. Self-Driving database Peloton by CMU, now remade as terrier optimize for workload
  • 21.
    reference ( andsome related papers ) ATC ‘18, HashKV: Enabling Efficient Updates in KV Storage via Hashing EDBT ‘16, Designing Access Methods: The RUM Conjecture FAST ‘16, WiscKey: Separating Keys from Values in SSD-conscious Storage FAST ‘16, Towards Accurate and Fast Evaluation of Multi-Stage Log-structured Designs ATC ‘17, TRIAD: Creating Synergies Between Memory, Disk and Log in Log Structured Key-Value Stores CIDR ‘17, Optimizing Space Amplification in RocksDB SOSP ‘17, PebblesDB: Building Key-Value Stores using Fragmented Log-Structured Merge Trees ATC ‘18, Redesigning LSMs for Nonvolatile Memory with NoveLSM VLDB ‘19, Efficient Data Ingestion and Query Processing for LSM-Based Storage Systems
  • 22.

Editor's Notes

  • #3 paper name: Enabling Efficient Updates in KV Storage via Hashing write intensive workload
  • #5 here write amplification refer to peak write ratio when several compactions are issued. and the write-optimized property of LSM refer to required data transfer per write, which is the data throughput when no compaction is running.
  • #9 a. sometime controllable is more important than best-case performance. b. only relocate cold value.
  • #16 during update, HashKV out-perform vLog by 4x. PebblesDB has better throughput but larger output size, due to lower compaction frequency.
  • #17 data from P3 update. gap between vLog and HashKV is closed for larger value size, reason: lower pair count and smaller LSM makes query and update on Index Store cheaper.
  • #19 background: local gc and dynamic size
  • #20 key, hot value, cold value has different workload property and thus different expectation for storage structure. key: point query and scan query value: space optimized cold: low occupation of resource