SlideShare a Scribd company logo
1 of 20
Download to read offline
Brought to you by
Keeping Latency Low and Throughput
High with Application-level
Priority Management
Avi Kivity
CTO at
Avi Kivity
CTO at ScyllaDB
Creator and ex-maintainer of Kernel-based Virtual Machine (KVM)
Creator of the Seastar I/O framework
Co-founder, CTO @ ScyllaDB
Comparing throughput and latency
Throughput computing (~ OLAP)
■ Want to maximize utilization
■ Extensive buffering to hide
device/network latency
■ Total time is important
■ Fewer operations, serialization is
permissible
Latency computing (~ OLTP)
■ Leave free cycles to absorb
bursts
■ Cannot predict data to read
Often must synchronously write
■ Individual operation time is
important
■ Many operations execute
concurrently
Why mix throughput and latency computing?
■ Run different workloads on the same data - HTAP
● Fewer resources than dedicated clusters
■ Maintenance operations on an OLTP workload
● Garbage collection
● Grooming a Log-Structured Merge Tree (LSM Tree)
● Cluster maintenance - add/remove/rebuild/backup/scrub nodes
General plan
1. Isolate/identify different tasks
2. Schedule tasks
Isolating tasks in threads
■ Each operation becomes a thread
● Perhaps temporarily borrowed from a thread pool
■ Let the kernel schedule these threads
■ Influence kernel choices with priority
Isolating tasks in threads
Advantages
■ Well understood
■ Large ecosystem
Disadvantages
■ Context switches are expensive
■ Communicating priority to the OS is
hard
● Priority levels not meaningful
■ Locking becomes complex and
expensive
■ Priority inversion is possible
■ Kernel scheduling granularity may be
too high
Application-level task isolation
■ Every operation is a normal object
■ Operations are multiplexed on a small number of threads
● Ideally one thread per logical core
● Both throughput and latency tasks on the same thread!
■ Concurrency framework assigns tasks to threads
■ Concurrency framework controls order
Application-level task isolation
Advantages
■ Full control
■ Low overhead with cooperative scheduling
■ Many locks become unnecessary
■ Good CPU affinity
■ Fewer surprises from the kernel
Disadvantages
■ Full control
■ Less mature ecosystem
Application-managed tasks
Scheduler
tq1 tq2 tq3 tqn
Execution timeline
time
tq1 tq2 tq3 tq1 tq2 tq3
Switching queues
■ When queue is exhausted
● Common for latency sensitive queues
■ When time slice is exhausted
● Throughput oriented queues
● Queue may have more tasks
● Tasks can be preempted
■ Poll for I/O
● io_uring_enter or equivalent
■ Make scheduling decision
● Pick next queue
● Scheduling goal is to keep q_runtime / q_shares equal across queues
● Selection of queue is not round-robin
Preemption techniques
■ Read clock and compare to timeslice end deadline
● Prohibitively expensive
■ Use timer+signal
● Works, icky locking
■ Use kernel timer to write to user memory location
● linux-aio or io_uring
● Tricky but very efficient
Stall detector
■ Signal-based mechanism to detect where you “forgot” to add a
preemption check
■ cf. Accidentally Quadratic
Implementation in ScyllaDB
About ScyllaDB
■ Distributed OLTP NoSQL Database
■ Compatibility
● Apache Cassandra (CQL, Thrift)
● AWS DynamoDB (JSON/HTTP)
● Redis (RESP)
■ ~10X performance on same hardware
■ Low latency, esp. higher percentiles
■ C++20, Open Source
■ Fully asynchronous; Seastar!
Dynamic Shares Adjustment
• Internal feedback loops to balance competing loads
Memtable
Seastar
Scheduler
Compaction
Query
Repair
Commitlog
SSD
Compaction
Backlog
Monitor
Memory
Monitor
Adjust priority
Adjust priority
WAN
CPU
Resource partitioning (QoS)
• Provide different quality of service to different users
Memtable
Seastar
Scheduler
Compaction
Query 1
Repair
Commitlog
SSD
Compaction
Backlog
Monitor
Memory
Monitor
Adjust priority
Adjust priority
WAN
CPU
Query 2
I/O scheduling
■ Logically, same
■ But scheduling an entity much more complicated than a CPU core
■ More difficult cross-core coordination
■ More in Pavel’s talk
● “What We Need to Unlearn about Persistent Storage”
Brought to you by
Avi Kivity
@AviKivity

More Related Content

What's hot

OSNoise Tracer: Who Is Stealing My CPU Time?
OSNoise Tracer: Who Is Stealing My CPU Time?OSNoise Tracer: Who Is Stealing My CPU Time?
OSNoise Tracer: Who Is Stealing My CPU Time?
ScyllaDB
 
High-Performance Networking Using eBPF, XDP, and io_uring
High-Performance Networking Using eBPF, XDP, and io_uringHigh-Performance Networking Using eBPF, XDP, and io_uring
High-Performance Networking Using eBPF, XDP, and io_uring
ScyllaDB
 
OSv Unikernel — Optimizing Guest OS to Run Stateless and Serverless Apps in t...
OSv Unikernel — Optimizing Guest OS to Run Stateless and Serverless Apps in t...OSv Unikernel — Optimizing Guest OS to Run Stateless and Serverless Apps in t...
OSv Unikernel — Optimizing Guest OS to Run Stateless and Serverless Apps in t...
ScyllaDB
 

What's hot (20)

RISC-V on Edge: Porting EVE and Alpine Linux to RISC-V
RISC-V on Edge: Porting EVE and Alpine Linux to RISC-VRISC-V on Edge: Porting EVE and Alpine Linux to RISC-V
RISC-V on Edge: Porting EVE and Alpine Linux to RISC-V
 
OSNoise Tracer: Who Is Stealing My CPU Time?
OSNoise Tracer: Who Is Stealing My CPU Time?OSNoise Tracer: Who Is Stealing My CPU Time?
OSNoise Tracer: Who Is Stealing My CPU Time?
 
Data Structures for High Resolution, Real-time Telemetry at Scale
Data Structures for High Resolution, Real-time Telemetry at ScaleData Structures for High Resolution, Real-time Telemetry at Scale
Data Structures for High Resolution, Real-time Telemetry at Scale
 
Object Compaction in Cloud for High Yield
Object Compaction in Cloud for High YieldObject Compaction in Cloud for High Yield
Object Compaction in Cloud for High Yield
 
High-Performance Networking Using eBPF, XDP, and io_uring
High-Performance Networking Using eBPF, XDP, and io_uringHigh-Performance Networking Using eBPF, XDP, and io_uring
High-Performance Networking Using eBPF, XDP, and io_uring
 
G1: To Infinity and Beyond
G1: To Infinity and BeyondG1: To Infinity and Beyond
G1: To Infinity and Beyond
 
Rust, Wright's Law, and the Future of Low-Latency Systems
Rust, Wright's Law, and the Future of Low-Latency SystemsRust, Wright's Law, and the Future of Low-Latency Systems
Rust, Wright's Law, and the Future of Low-Latency Systems
 
OSv Unikernel — Optimizing Guest OS to Run Stateless and Serverless Apps in t...
OSv Unikernel — Optimizing Guest OS to Run Stateless and Serverless Apps in t...OSv Unikernel — Optimizing Guest OS to Run Stateless and Serverless Apps in t...
OSv Unikernel — Optimizing Guest OS to Run Stateless and Serverless Apps in t...
 
Continuous Go Profiling & Observability
Continuous Go Profiling & ObservabilityContinuous Go Profiling & Observability
Continuous Go Profiling & Observability
 
Get Lower Latency and Higher Throughput for Java Applications
Get Lower Latency and Higher Throughput for Java ApplicationsGet Lower Latency and Higher Throughput for Java Applications
Get Lower Latency and Higher Throughput for Java Applications
 
Understanding Apache Kafka P99 Latency at Scale
Understanding Apache Kafka P99 Latency at ScaleUnderstanding Apache Kafka P99 Latency at Scale
Understanding Apache Kafka P99 Latency at Scale
 
Practical SystemTAP basics: Perl memory profiling
Practical SystemTAP basics: Perl memory profilingPractical SystemTAP basics: Perl memory profiling
Practical SystemTAP basics: Perl memory profiling
 
Java Performance Tuning
Java Performance TuningJava Performance Tuning
Java Performance Tuning
 
Using SLOs for Continuous Performance Optimizations of Your k8s Workloads
Using SLOs for Continuous Performance Optimizations of Your k8s WorkloadsUsing SLOs for Continuous Performance Optimizations of Your k8s Workloads
Using SLOs for Continuous Performance Optimizations of Your k8s Workloads
 
Performance Tuning - Memory leaks, Thread deadlocks, JDK tools
Performance Tuning -  Memory leaks, Thread deadlocks, JDK toolsPerformance Tuning -  Memory leaks, Thread deadlocks, JDK tools
Performance Tuning - Memory leaks, Thread deadlocks, JDK tools
 
Performance optimization 101 - Erlang Factory SF 2014
Performance optimization 101 - Erlang Factory SF 2014Performance optimization 101 - Erlang Factory SF 2014
Performance optimization 101 - Erlang Factory SF 2014
 
WebLogic Stability; Detect and Analyse Stuck Threads
WebLogic Stability; Detect and Analyse Stuck ThreadsWebLogic Stability; Detect and Analyse Stuck Threads
WebLogic Stability; Detect and Analyse Stuck Threads
 
Rust's Journey to Async/await
Rust's Journey to Async/awaitRust's Journey to Async/await
Rust's Journey to Async/await
 
State of systemd @ Facebook
State of systemd @ FacebookState of systemd @ Facebook
State of systemd @ Facebook
 
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Ben...
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Ben...S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Ben...
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Ben...
 

Similar to Keeping Latency Low and Throughput High with Application-level Priority Management

Hardware Assisted Latency Investigations
Hardware Assisted Latency InvestigationsHardware Assisted Latency Investigations
Hardware Assisted Latency Investigations
ScyllaDB
 

Similar to Keeping Latency Low and Throughput High with Application-level Priority Management (20)

Lec 9-os-review
Lec 9-os-reviewLec 9-os-review
Lec 9-os-review
 
Hardware Assisted Latency Investigations
Hardware Assisted Latency InvestigationsHardware Assisted Latency Investigations
Hardware Assisted Latency Investigations
 
Using the big guns: Advanced OS performance tools for troubleshooting databas...
Using the big guns: Advanced OS performance tools for troubleshooting databas...Using the big guns: Advanced OS performance tools for troubleshooting databas...
Using the big guns: Advanced OS performance tools for troubleshooting databas...
 
An End to Order
An End to OrderAn End to Order
An End to Order
 
Building Efficient Multi-Threaded Filters for Faster SQL Queries
Building Efficient Multi-Threaded Filters for Faster SQL QueriesBuilding Efficient Multi-Threaded Filters for Faster SQL Queries
Building Efficient Multi-Threaded Filters for Faster SQL Queries
 
Optimizing Linux Servers
Optimizing Linux ServersOptimizing Linux Servers
Optimizing Linux Servers
 
An End to Order (many cores with java, session two)
An End to Order (many cores with java, session two)An End to Order (many cores with java, session two)
An End to Order (many cores with java, session two)
 
IO Schedulers (Elevater) concept and its affection on database performance
IO Schedulers (Elevater) concept and its affection on database performanceIO Schedulers (Elevater) concept and its affection on database performance
IO Schedulers (Elevater) concept and its affection on database performance
 
The Linux Block Layer - Built for Fast Storage
The Linux Block Layer - Built for Fast StorageThe Linux Block Layer - Built for Fast Storage
The Linux Block Layer - Built for Fast Storage
 
Introduction to ZooKeeper - TriHUG May 22, 2012
Introduction to ZooKeeper - TriHUG May 22, 2012Introduction to ZooKeeper - TriHUG May 22, 2012
Introduction to ZooKeeper - TriHUG May 22, 2012
 
Nagios Conference 2014 - Andy Brist - Nagios XI Failover and HA Solutions
Nagios Conference 2014 - Andy Brist - Nagios XI Failover and HA SolutionsNagios Conference 2014 - Andy Brist - Nagios XI Failover and HA Solutions
Nagios Conference 2014 - Andy Brist - Nagios XI Failover and HA Solutions
 
Concurrency, Parallelism And IO
Concurrency,  Parallelism And IOConcurrency,  Parallelism And IO
Concurrency, Parallelism And IO
 
Multi-Tenancy Kafka cluster for LINE services with 250 billion daily messages
Multi-Tenancy Kafka cluster for LINE services with 250 billion daily messagesMulti-Tenancy Kafka cluster for LINE services with 250 billion daily messages
Multi-Tenancy Kafka cluster for LINE services with 250 billion daily messages
 
Functional? Reactive? Why?
Functional? Reactive? Why?Functional? Reactive? Why?
Functional? Reactive? Why?
 
Streaming datasets for personalization
Streaming datasets for personalizationStreaming datasets for personalization
Streaming datasets for personalization
 
Shall we play a game?
Shall we play a game?Shall we play a game?
Shall we play a game?
 
0507 057 01 98 * Adana Klima Servisleri
0507 057 01 98 * Adana Klima Servisleri0507 057 01 98 * Adana Klima Servisleri
0507 057 01 98 * Adana Klima Servisleri
 
Shall we play a game
Shall we play a gameShall we play a game
Shall we play a game
 
Task migration using CRIU
Task migration using CRIUTask migration using CRIU
Task migration using CRIU
 
Realtime
RealtimeRealtime
Realtime
 

More from ScyllaDB

More from ScyllaDB (20)

Optimizing NoSQL Performance Through Observability
Optimizing NoSQL Performance Through ObservabilityOptimizing NoSQL Performance Through Observability
Optimizing NoSQL Performance Through Observability
 
Event-Driven Architecture Masterclass: Challenges in Stream Processing
Event-Driven Architecture Masterclass: Challenges in Stream ProcessingEvent-Driven Architecture Masterclass: Challenges in Stream Processing
Event-Driven Architecture Masterclass: Challenges in Stream Processing
 
Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...
Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...
Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...
 
Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...
Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...
Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
What Developers Need to Unlearn for High Performance NoSQL
What Developers Need to Unlearn for High Performance NoSQLWhat Developers Need to Unlearn for High Performance NoSQL
What Developers Need to Unlearn for High Performance NoSQL
 
Low Latency at Extreme Scale: Proven Practices & Pitfalls
Low Latency at Extreme Scale: Proven Practices & PitfallsLow Latency at Extreme Scale: Proven Practices & Pitfalls
Low Latency at Extreme Scale: Proven Practices & Pitfalls
 
Dissecting Real-World Database Performance Dilemmas
Dissecting Real-World Database Performance DilemmasDissecting Real-World Database Performance Dilemmas
Dissecting Real-World Database Performance Dilemmas
 
Beyond Linear Scaling: A New Path for Performance with ScyllaDB
Beyond Linear Scaling: A New Path for Performance with ScyllaDBBeyond Linear Scaling: A New Path for Performance with ScyllaDB
Beyond Linear Scaling: A New Path for Performance with ScyllaDB
 
Dissecting Real-World Database Performance Dilemmas
Dissecting Real-World Database Performance DilemmasDissecting Real-World Database Performance Dilemmas
Dissecting Real-World Database Performance Dilemmas
 
Database Performance at Scale Masterclass: Workload Characteristics by Felipe...
Database Performance at Scale Masterclass: Workload Characteristics by Felipe...Database Performance at Scale Masterclass: Workload Characteristics by Felipe...
Database Performance at Scale Masterclass: Workload Characteristics by Felipe...
 
Database Performance at Scale Masterclass: Database Internals by Pavel Emelya...
Database Performance at Scale Masterclass: Database Internals by Pavel Emelya...Database Performance at Scale Masterclass: Database Internals by Pavel Emelya...
Database Performance at Scale Masterclass: Database Internals by Pavel Emelya...
 
Database Performance at Scale Masterclass: Driver Strategies by Piotr Sarna
Database Performance at Scale Masterclass: Driver Strategies by Piotr SarnaDatabase Performance at Scale Masterclass: Driver Strategies by Piotr Sarna
Database Performance at Scale Masterclass: Driver Strategies by Piotr Sarna
 
Replacing Your Cache with ScyllaDB
Replacing Your Cache with ScyllaDBReplacing Your Cache with ScyllaDB
Replacing Your Cache with ScyllaDB
 
Powering Real-Time Apps with ScyllaDB_ Low Latency & Linear Scalability
Powering Real-Time Apps with ScyllaDB_ Low Latency & Linear ScalabilityPowering Real-Time Apps with ScyllaDB_ Low Latency & Linear Scalability
Powering Real-Time Apps with ScyllaDB_ Low Latency & Linear Scalability
 
7 Reasons Not to Put an External Cache in Front of Your Database.pptx
7 Reasons Not to Put an External Cache in Front of Your Database.pptx7 Reasons Not to Put an External Cache in Front of Your Database.pptx
7 Reasons Not to Put an External Cache in Front of Your Database.pptx
 
Getting the most out of ScyllaDB
Getting the most out of ScyllaDBGetting the most out of ScyllaDB
Getting the most out of ScyllaDB
 
NoSQL Database Migration Masterclass - Session 2: The Anatomy of a Migration
NoSQL Database Migration Masterclass - Session 2: The Anatomy of a MigrationNoSQL Database Migration Masterclass - Session 2: The Anatomy of a Migration
NoSQL Database Migration Masterclass - Session 2: The Anatomy of a Migration
 
NoSQL Database Migration Masterclass - Session 3: Migration Logistics
NoSQL Database Migration Masterclass - Session 3: Migration LogisticsNoSQL Database Migration Masterclass - Session 3: Migration Logistics
NoSQL Database Migration Masterclass - Session 3: Migration Logistics
 
NoSQL Data Migration Masterclass - Session 1 Migration Strategies and Challenges
NoSQL Data Migration Masterclass - Session 1 Migration Strategies and ChallengesNoSQL Data Migration Masterclass - Session 1 Migration Strategies and Challenges
NoSQL Data Migration Masterclass - Session 1 Migration Strategies and Challenges
 

Recently uploaded

Recently uploaded (20)

Free and Effective: Making Flows Publicly Accessible, Yumi Ibrahimzade
Free and Effective: Making Flows Publicly Accessible, Yumi IbrahimzadeFree and Effective: Making Flows Publicly Accessible, Yumi Ibrahimzade
Free and Effective: Making Flows Publicly Accessible, Yumi Ibrahimzade
 
WSO2CONMay2024OpenSourceConferenceDebrief.pptx
WSO2CONMay2024OpenSourceConferenceDebrief.pptxWSO2CONMay2024OpenSourceConferenceDebrief.pptx
WSO2CONMay2024OpenSourceConferenceDebrief.pptx
 
WebAssembly is Key to Better LLM Performance
WebAssembly is Key to Better LLM PerformanceWebAssembly is Key to Better LLM Performance
WebAssembly is Key to Better LLM Performance
 
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...
 
IoT Analytics Company Presentation May 2024
IoT Analytics Company Presentation May 2024IoT Analytics Company Presentation May 2024
IoT Analytics Company Presentation May 2024
 
Extensible Python: Robustness through Addition - PyCon 2024
Extensible Python: Robustness through Addition - PyCon 2024Extensible Python: Robustness through Addition - PyCon 2024
Extensible Python: Robustness through Addition - PyCon 2024
 
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptxUnpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
 
Simplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdf
Simplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdfSimplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdf
Simplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdf
 
The Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdf
The Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdfThe Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdf
The Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdf
 
Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya Halder
Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya HalderCustom Approval Process: A New Perspective, Pavel Hrbacek & Anindya Halder
Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya Halder
 
What's New in Teams Calling, Meetings and Devices April 2024
What's New in Teams Calling, Meetings and Devices April 2024What's New in Teams Calling, Meetings and Devices April 2024
What's New in Teams Calling, Meetings and Devices April 2024
 
Enterprise Knowledge Graphs - Data Summit 2024
Enterprise Knowledge Graphs - Data Summit 2024Enterprise Knowledge Graphs - Data Summit 2024
Enterprise Knowledge Graphs - Data Summit 2024
 
Syngulon - Selection technology May 2024.pdf
Syngulon - Selection technology May 2024.pdfSyngulon - Selection technology May 2024.pdf
Syngulon - Selection technology May 2024.pdf
 
Strategic AI Integration in Engineering Teams
Strategic AI Integration in Engineering TeamsStrategic AI Integration in Engineering Teams
Strategic AI Integration in Engineering Teams
 
The UX of Automation by AJ King, Senior UX Researcher, Ocado
The UX of Automation by AJ King, Senior UX Researcher, OcadoThe UX of Automation by AJ King, Senior UX Researcher, Ocado
The UX of Automation by AJ King, Senior UX Researcher, Ocado
 
Connecting the Dots in Product Design at KAYAK
Connecting the Dots in Product Design at KAYAKConnecting the Dots in Product Design at KAYAK
Connecting the Dots in Product Design at KAYAK
 
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdfIntroduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf
 
THE BEST IPTV in GERMANY for 2024: IPTVreel
THE BEST IPTV in  GERMANY for 2024: IPTVreelTHE BEST IPTV in  GERMANY for 2024: IPTVreel
THE BEST IPTV in GERMANY for 2024: IPTVreel
 
How Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdf
How Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdfHow Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdf
How Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdf
 
How we scaled to 80K users by doing nothing!.pdf
How we scaled to 80K users by doing nothing!.pdfHow we scaled to 80K users by doing nothing!.pdf
How we scaled to 80K users by doing nothing!.pdf
 

Keeping Latency Low and Throughput High with Application-level Priority Management

  • 1. Brought to you by Keeping Latency Low and Throughput High with Application-level Priority Management Avi Kivity CTO at
  • 2. Avi Kivity CTO at ScyllaDB Creator and ex-maintainer of Kernel-based Virtual Machine (KVM) Creator of the Seastar I/O framework Co-founder, CTO @ ScyllaDB
  • 3. Comparing throughput and latency Throughput computing (~ OLAP) ■ Want to maximize utilization ■ Extensive buffering to hide device/network latency ■ Total time is important ■ Fewer operations, serialization is permissible Latency computing (~ OLTP) ■ Leave free cycles to absorb bursts ■ Cannot predict data to read Often must synchronously write ■ Individual operation time is important ■ Many operations execute concurrently
  • 4. Why mix throughput and latency computing? ■ Run different workloads on the same data - HTAP ● Fewer resources than dedicated clusters ■ Maintenance operations on an OLTP workload ● Garbage collection ● Grooming a Log-Structured Merge Tree (LSM Tree) ● Cluster maintenance - add/remove/rebuild/backup/scrub nodes
  • 5. General plan 1. Isolate/identify different tasks 2. Schedule tasks
  • 6. Isolating tasks in threads ■ Each operation becomes a thread ● Perhaps temporarily borrowed from a thread pool ■ Let the kernel schedule these threads ■ Influence kernel choices with priority
  • 7. Isolating tasks in threads Advantages ■ Well understood ■ Large ecosystem Disadvantages ■ Context switches are expensive ■ Communicating priority to the OS is hard ● Priority levels not meaningful ■ Locking becomes complex and expensive ■ Priority inversion is possible ■ Kernel scheduling granularity may be too high
  • 8. Application-level task isolation ■ Every operation is a normal object ■ Operations are multiplexed on a small number of threads ● Ideally one thread per logical core ● Both throughput and latency tasks on the same thread! ■ Concurrency framework assigns tasks to threads ■ Concurrency framework controls order
  • 9. Application-level task isolation Advantages ■ Full control ■ Low overhead with cooperative scheduling ■ Many locks become unnecessary ■ Good CPU affinity ■ Fewer surprises from the kernel Disadvantages ■ Full control ■ Less mature ecosystem
  • 12. Switching queues ■ When queue is exhausted ● Common for latency sensitive queues ■ When time slice is exhausted ● Throughput oriented queues ● Queue may have more tasks ● Tasks can be preempted ■ Poll for I/O ● io_uring_enter or equivalent ■ Make scheduling decision ● Pick next queue ● Scheduling goal is to keep q_runtime / q_shares equal across queues ● Selection of queue is not round-robin
  • 13. Preemption techniques ■ Read clock and compare to timeslice end deadline ● Prohibitively expensive ■ Use timer+signal ● Works, icky locking ■ Use kernel timer to write to user memory location ● linux-aio or io_uring ● Tricky but very efficient
  • 14. Stall detector ■ Signal-based mechanism to detect where you “forgot” to add a preemption check ■ cf. Accidentally Quadratic
  • 16. About ScyllaDB ■ Distributed OLTP NoSQL Database ■ Compatibility ● Apache Cassandra (CQL, Thrift) ● AWS DynamoDB (JSON/HTTP) ● Redis (RESP) ■ ~10X performance on same hardware ■ Low latency, esp. higher percentiles ■ C++20, Open Source ■ Fully asynchronous; Seastar!
  • 17. Dynamic Shares Adjustment • Internal feedback loops to balance competing loads Memtable Seastar Scheduler Compaction Query Repair Commitlog SSD Compaction Backlog Monitor Memory Monitor Adjust priority Adjust priority WAN CPU
  • 18. Resource partitioning (QoS) • Provide different quality of service to different users Memtable Seastar Scheduler Compaction Query 1 Repair Commitlog SSD Compaction Backlog Monitor Memory Monitor Adjust priority Adjust priority WAN CPU Query 2
  • 19. I/O scheduling ■ Logically, same ■ But scheduling an entity much more complicated than a CPU core ■ More difficult cross-core coordination ■ More in Pavel’s talk ● “What We Need to Unlearn about Persistent Storage”
  • 20. Brought to you by Avi Kivity @AviKivity