Brought to you by
Keeping Latency Low and Throughput
High with Application-level
Priority Management
Avi Kivity
CTO at
Avi Kivity
CTO at ScyllaDB
Creator and ex-maintainer of Kernel-based Virtual Machine (KVM)
Creator of the Seastar I/O framework
Co-founder, CTO @ ScyllaDB
Comparing throughput and latency
Throughput computing (~ OLAP)
■ Want to maximize utilization
■ Extensive buffering to hide
device/network latency
■ Total time is important
■ Fewer operations, serialization is
permissible
Latency computing (~ OLTP)
■ Leave free cycles to absorb
bursts
■ Cannot predict data to read
Often must synchronously write
■ Individual operation time is
important
■ Many operations execute
concurrently
Why mix throughput and latency computing?
■ Run different workloads on the same data - HTAP
● Fewer resources than dedicated clusters
■ Maintenance operations on an OLTP workload
● Garbage collection
● Grooming a Log-Structured Merge Tree (LSM Tree)
● Cluster maintenance - add/remove/rebuild/backup/scrub nodes
General plan
1. Isolate/identify different tasks
2. Schedule tasks
Isolating tasks in threads
■ Each operation becomes a thread
● Perhaps temporarily borrowed from a thread pool
■ Let the kernel schedule these threads
■ Influence kernel choices with priority
Isolating tasks in threads
Advantages
■ Well understood
■ Large ecosystem
Disadvantages
■ Context switches are expensive
■ Communicating priority to the OS is
hard
● Priority levels not meaningful
■ Locking becomes complex and
expensive
■ Priority inversion is possible
■ Kernel scheduling granularity may be
too high
Application-level task isolation
■ Every operation is a normal object
■ Operations are multiplexed on a small number of threads
● Ideally one thread per logical core
● Both throughput and latency tasks on the same thread!
■ Concurrency framework assigns tasks to threads
■ Concurrency framework controls order
Application-level task isolation
Advantages
■ Full control
■ Low overhead with cooperative scheduling
■ Many locks become unnecessary
■ Good CPU affinity
■ Fewer surprises from the kernel
Disadvantages
■ Full control
■ Less mature ecosystem
Application-managed tasks
Scheduler
tq1 tq2 tq3 tqn
Execution timeline
time
tq1 tq2 tq3 tq1 tq2 tq3
Switching queues
■ When queue is exhausted
● Common for latency sensitive queues
■ When time slice is exhausted
● Throughput oriented queues
● Queue may have more tasks
● Tasks can be preempted
■ Poll for I/O
● io_uring_enter or equivalent
■ Make scheduling decision
● Pick next queue
● Scheduling goal is to keep q_runtime / q_shares equal across queues
● Selection of queue is not round-robin
Preemption techniques
■ Read clock and compare to timeslice end deadline
● Prohibitively expensive
■ Use timer+signal
● Works, icky locking
■ Use kernel timer to write to user memory location
● linux-aio or io_uring
● Tricky but very efficient
Stall detector
■ Signal-based mechanism to detect where you “forgot” to add a
preemption check
■ cf. Accidentally Quadratic
Implementation in ScyllaDB
About ScyllaDB
■ Distributed OLTP NoSQL Database
■ Compatibility
● Apache Cassandra (CQL, Thrift)
● AWS DynamoDB (JSON/HTTP)
● Redis (RESP)
■ ~10X performance on same hardware
■ Low latency, esp. higher percentiles
■ C++20, Open Source
■ Fully asynchronous; Seastar!
Dynamic Shares Adjustment
• Internal feedback loops to balance competing loads
Memtable
Seastar
Scheduler
Compaction
Query
Repair
Commitlog
SSD
Compaction
Backlog
Monitor
Memory
Monitor
Adjust priority
Adjust priority
WAN
CPU
Resource partitioning (QoS)
• Provide different quality of service to different users
Memtable
Seastar
Scheduler
Compaction
Query 1
Repair
Commitlog
SSD
Compaction
Backlog
Monitor
Memory
Monitor
Adjust priority
Adjust priority
WAN
CPU
Query 2
I/O scheduling
■ Logically, same
■ But scheduling an entity much more complicated than a CPU core
■ More difficult cross-core coordination
■ More in Pavel’s talk
● “What We Need to Unlearn about Persistent Storage”
Brought to you by
Avi Kivity
@AviKivity

Keeping Latency Low and Throughput High with Application-level Priority Management

  • 1.
    Brought to youby Keeping Latency Low and Throughput High with Application-level Priority Management Avi Kivity CTO at
  • 2.
    Avi Kivity CTO atScyllaDB Creator and ex-maintainer of Kernel-based Virtual Machine (KVM) Creator of the Seastar I/O framework Co-founder, CTO @ ScyllaDB
  • 3.
    Comparing throughput andlatency Throughput computing (~ OLAP) ■ Want to maximize utilization ■ Extensive buffering to hide device/network latency ■ Total time is important ■ Fewer operations, serialization is permissible Latency computing (~ OLTP) ■ Leave free cycles to absorb bursts ■ Cannot predict data to read Often must synchronously write ■ Individual operation time is important ■ Many operations execute concurrently
  • 4.
    Why mix throughputand latency computing? ■ Run different workloads on the same data - HTAP ● Fewer resources than dedicated clusters ■ Maintenance operations on an OLTP workload ● Garbage collection ● Grooming a Log-Structured Merge Tree (LSM Tree) ● Cluster maintenance - add/remove/rebuild/backup/scrub nodes
  • 5.
    General plan 1. Isolate/identifydifferent tasks 2. Schedule tasks
  • 6.
    Isolating tasks inthreads ■ Each operation becomes a thread ● Perhaps temporarily borrowed from a thread pool ■ Let the kernel schedule these threads ■ Influence kernel choices with priority
  • 7.
    Isolating tasks inthreads Advantages ■ Well understood ■ Large ecosystem Disadvantages ■ Context switches are expensive ■ Communicating priority to the OS is hard ● Priority levels not meaningful ■ Locking becomes complex and expensive ■ Priority inversion is possible ■ Kernel scheduling granularity may be too high
  • 8.
    Application-level task isolation ■Every operation is a normal object ■ Operations are multiplexed on a small number of threads ● Ideally one thread per logical core ● Both throughput and latency tasks on the same thread! ■ Concurrency framework assigns tasks to threads ■ Concurrency framework controls order
  • 9.
    Application-level task isolation Advantages ■Full control ■ Low overhead with cooperative scheduling ■ Many locks become unnecessary ■ Good CPU affinity ■ Fewer surprises from the kernel Disadvantages ■ Full control ■ Less mature ecosystem
  • 10.
  • 11.
  • 12.
    Switching queues ■ Whenqueue is exhausted ● Common for latency sensitive queues ■ When time slice is exhausted ● Throughput oriented queues ● Queue may have more tasks ● Tasks can be preempted ■ Poll for I/O ● io_uring_enter or equivalent ■ Make scheduling decision ● Pick next queue ● Scheduling goal is to keep q_runtime / q_shares equal across queues ● Selection of queue is not round-robin
  • 13.
    Preemption techniques ■ Readclock and compare to timeslice end deadline ● Prohibitively expensive ■ Use timer+signal ● Works, icky locking ■ Use kernel timer to write to user memory location ● linux-aio or io_uring ● Tricky but very efficient
  • 14.
    Stall detector ■ Signal-basedmechanism to detect where you “forgot” to add a preemption check ■ cf. Accidentally Quadratic
  • 15.
  • 16.
    About ScyllaDB ■ DistributedOLTP NoSQL Database ■ Compatibility ● Apache Cassandra (CQL, Thrift) ● AWS DynamoDB (JSON/HTTP) ● Redis (RESP) ■ ~10X performance on same hardware ■ Low latency, esp. higher percentiles ■ C++20, Open Source ■ Fully asynchronous; Seastar!
  • 17.
    Dynamic Shares Adjustment •Internal feedback loops to balance competing loads Memtable Seastar Scheduler Compaction Query Repair Commitlog SSD Compaction Backlog Monitor Memory Monitor Adjust priority Adjust priority WAN CPU
  • 18.
    Resource partitioning (QoS) •Provide different quality of service to different users Memtable Seastar Scheduler Compaction Query 1 Repair Commitlog SSD Compaction Backlog Monitor Memory Monitor Adjust priority Adjust priority WAN CPU Query 2
  • 19.
    I/O scheduling ■ Logically,same ■ But scheduling an entity much more complicated than a CPU core ■ More difficult cross-core coordination ■ More in Pavel’s talk ● “What We Need to Unlearn about Persistent Storage”
  • 20.
    Brought to youby Avi Kivity @AviKivity