SlideShare a Scribd company logo
1 of 25
Motivation and Challenges
Investigating Tail Latency is About Probing for the
Atypical
■ Execution samples that land in the tail have been slowed down for some reason
that differs from the average case: either—
● They differ in amount (or type of) work performed, or
● They were affected by some uncommon event or interference
● They encountered more waiting (experienced resource starvation longer than average
cases).
■ In particular, it is not generally true that tail latency samples and remaining
execution samples exhibit comparable software, hardware, or scheduling
histories.
■ Standard approaches for exposing throughput limiters do not generally expose
causes of peak latency
Illustrative Scenario: 1
Illustrative Scenario: 1
SLA
violation
Illustrative Scenario: 1
SLA
violation
SLA
violation
Illustrative Scenario: 1 -- Why ?
SLA
violation
Default sleep
state (C9)
Min sleep
state (C1)
■ Default power setting results in latency SLA violation at ½ the throughput
■ Ironically: low utilization results in earlier onset of tail latency because of
power-management interference (not in application’s control).
SLA
violation
Illustrative Scenario: 2
■ Reducing latency to a minimum 🡪 frequently entails choosing between
conflicting options for scheduling resources like cpu time.
● Reduce overheads vs.
● Preempt sooner vs.
● Prioritize critical sections vs.
● Do not thrash the cache vs.
● Be fair at a fine-grained scale: do not starve tasks (or activities like garbage
collection) . . . etc.
Illustrative Scenario: 2
■ Reducing latency to a minimum 🡪 frequently entails choosing between
conflicting options for scheduling resources like cpu time.
● Reduce overheads vs.
● Preempt sooner vs.
● Prioritize critical sections vs.
● Do not thrash the cache vs.
● Be fair at a fine-grained scale: do not starve tasks (or activities like garbage
collection) . . . etc.
■ Online video gaming at high scale:
● High FPS (frames per second).
● Hotness in L1, L2.
● Streamlined execution: do not preempt too soon!
● Deadline priorities: Schedule quickly upon waking up!
Illustrative Scenario: 2
■ Online video gaming at high scale:
● High FPS (frames per second).
● Hotness in L1, L2.
● Streamlined execution: do not preempt too soon!
● Deadline priorities: Schedule quickly upon waking up!
■ For a first order impact on reducing frame drops: make task migration immediate.
Tunable CFS default New value
kernel_sched_migration_cost_ns 500000 0
kernel_sched_min_granularity_ns 3000000 1000000
kernel_sched_wakeup_granularity_ns 4000000 0
Scheduler tuning for dramatically reducing frame drops
kernel/sched_fair.c
. . .
gran = sysctl_sched_wakeup_granularity;
. . .
// if virtual run time of current executing task exceeds that of wakeup target plus a margin
// of safety provided by gran, then the current task is forced to yield to the wakeup target.
if (pse->vruntime + gran < se->vruntime) //se is current executing task
resched_task(curr);
Vital to Understand Transitions at Sub-ms Grain
■ Small-time ranges contain the vital signals for analyzing tail latency growth.
● Sampling and averaging of CPU and platform software telemetry helps only a little, but
is too blunt to tie short-range causes to effects that persist.
● For example: the scheduling of a large I/O that may take a few microseconds but cause
lingering effects in shared caches.
■ Tuning of schedulers (as in Scenario 2) requires the ability to observe intra- and
inter- process effects more or less continuously.
● Interestingly, scheduler tuning can also free up CPU utilization, and create an
opportunity to support an ability to handle load spikes within a latency budget.
Solution Space and Role of Hardware
perf-sched: a Dump and Post-process Approach for
Capturing and Analyzing Scheduler Events
■ perf-sched record
■ perf-sched timehist
■ perf-sched map
■ …
https://www.brendangregg.com/blog/2017-03-16/perf-sched.html
From: perf sched for Linux CPU scheduler analysis,
by Brendan Gregg, 2017
Get to an understanding of -- How far is
max sched delay from average? When is
wait time blowing up - what
contemporaneous activities are in play
when max delays occur? Do they show
similar effects, or is their execution itself
something unusual?
perf-sched timehist to Understand and Tune Various
Scheduling Parameters
https://www.brendangregg.com/blog/2017-03-16/perf-sched.html
From: perf sched for Linux CPU scheduler analysis, by Brendan Gregg, 2017
KUtrace by Richard Sites
■ All Kernel-user and User-kernel transitions
collected at very low space, time overhead.
■ Doing one thing, and that one thing well, for
production coverage
■ Further, captures instruction counter from
the PMU at transition points, to understand
IPC – all at ~40ns overhead
From: KUTrace: Where have all the nanoseconds gone?, Richard
Sites, Tracing Summit, 2017.
Hardware Role in Monitoring
■ Modern CPUs build in a significant amount of ability to monitor many events at
very low overhead.
■ PMU registers get multiplexed over event space (under software control) at a
moderate granularity.
■ A very large subset of these events can provide for event-based sampling, so that
correlated ratios can be collected and time-aligned -- for dissecting into likely
causes of IPC shifts.
https://perfmon-events.intel.com/
PMU Based Extraction of Long-latency Paths
■ Timed LBRs (last branch records)
PMU Based Extraction of Long-latency Paths
■ A more detailed view
(different example case)
Identifying Cache Miss Hotspots (Data, Code)
• Similarly, data heatmap can be generated without requiring software instrumentation and
tracing (e.g., with valgrind, Pin, etc.)
MEM_TRANS_RETIRED.LOAD_LATENCY_GT_<Binary Number>
From Sampling to Tracing
■ Profiling tools today (e.g., perf, Intel® VTune, etc.) are mostly based on the idea of statistical
hotspot collection: sampling and averaging – and therefore lose short interval transitions.
■ Tracing today (e.g., with insertion of tracepoints) requires a software developer to anticipate
where to instrument code. This is not generally easy, unless a lot of engineering has already
gone into preselecting (e.g, KUTrace).
● Instrumenting everything or over-collecting results incurs too much CPU penalty
● And memory and cache pollution
■ Challenges beyond trace collection:
● Much effort and data pruning before tail latencies can be linked to likely causes
● Usually pushed to offline analysis.
■ Understanding (and remediating) tail latencies itself needs to be a low latency endeavor.
eBPF
■ eBPF provides for programmable triggering and conditional collection
● at low overhead
● user or kernel
■ Thus, for example, one can do something like this:
Using the longest-latency access
event based sampling
as a trigger
Snapshot the timed LBR buffer
by using eBPF
eBPF and KUTrace/perf-sched and . . .
■ KUtrace is intentionally austere
● So it can be deployed in production, at scale, and be available and running continuously.
● (Adding bells and whistles to it– not a good idea, discouraged!).
+ eBPF’s In kernel filtering
■ For deep insights: KUTrace will provide all
user<->kernel transitions
eBPF and KUTrace/perf-sched and . . .
■ KUtrace is intentionally austere
● So it can be deployed in production, at scale, and be available and running continuously.
● (Adding bells and whistles to it– not a good idea, discouraged!).
+ eBPF’s In kernel filtering
■ For deep insights: KUTrace will provide all
user<->kernel transitions
■ With eBPF controlled tracing we could
invoke traces only in areas of interest. E.g.
trace all grpc requests with packet size >
64K (maybe that’s only when you see high
tail latencies)
■ (Or start first with eBPF and perf-sched latency co-monitoring and triggering)
● (While connecting eBPF based probes for latency monitoring in select higher
stack layers)
Summing It Up
■ Tail latency control is crucial, with penetration of real-time complex event processing in
virtually all sectors.
■ Low overhead and agile monitoring of latency excursions is needed.
■ Equally, to unveil the causes, contributing factors need to be collected at low overhead and
in a timely manner – ideally, through conditional collection and filtering.
■ Hardware performance monitoring capabilities are rich and can collect a rich variety of
events at very low overhead.
■ Linking eBPF based latency-focused monitoring (e.g., timed LBRs and long latency cache
misses) is one direction.
■ Another is triggering eBPF based hardware event rates collection, time-aligned with
scheduler events filtered for high scheduling delays (wait-signaling and post-wait
dispatching).
Brought to you by
Thank You

More Related Content

What's hot

P99 Pursuit: 8 Years of Battling P99 Latency
P99 Pursuit: 8 Years of Battling P99 LatencyP99 Pursuit: 8 Years of Battling P99 Latency
P99 Pursuit: 8 Years of Battling P99 LatencyScyllaDB
 
20111015 勉強会 (PCIe / SR-IOV)
20111015 勉強会 (PCIe / SR-IOV)20111015 勉強会 (PCIe / SR-IOV)
20111015 勉強会 (PCIe / SR-IOV)Kentaro Ebisawa
 
How We Reduced Performance Tuning Time by Orders of Magnitude with Database O...
How We Reduced Performance Tuning Time by Orders of Magnitude with Database O...How We Reduced Performance Tuning Time by Orders of Magnitude with Database O...
How We Reduced Performance Tuning Time by Orders of Magnitude with Database O...ScyllaDB
 
IntelON 2021 Processor Benchmarking
IntelON 2021 Processor BenchmarkingIntelON 2021 Processor Benchmarking
IntelON 2021 Processor BenchmarkingBrendan Gregg
 
CXL_説明_公開用.pdf
CXL_説明_公開用.pdfCXL_説明_公開用.pdf
CXL_説明_公開用.pdfYasunori Goto
 
マルチコアとネットワークスタックの高速化技法
マルチコアとネットワークスタックの高速化技法マルチコアとネットワークスタックの高速化技法
マルチコアとネットワークスタックの高速化技法Takuya ASADA
 
Ovs dpdk hwoffload way to full offload
Ovs dpdk hwoffload way to full offloadOvs dpdk hwoffload way to full offload
Ovs dpdk hwoffload way to full offloadKevin Traynor
 
Hive Bucketing in Apache Spark
Hive Bucketing in Apache SparkHive Bucketing in Apache Spark
Hive Bucketing in Apache SparkTejas Patil
 
Build Low-Latency Applications in Rust on ScyllaDB
Build Low-Latency Applications in Rust on ScyllaDBBuild Low-Latency Applications in Rust on ScyllaDB
Build Low-Latency Applications in Rust on ScyllaDBScyllaDB
 
Rust Is Safe. But Is It Fast?
Rust Is Safe. But Is It Fast?Rust Is Safe. But Is It Fast?
Rust Is Safe. But Is It Fast?ScyllaDB
 
Myths of Big Partitions (Robert Stupp, DataStax) | Cassandra Summit 2016
Myths of Big Partitions (Robert Stupp, DataStax) | Cassandra Summit 2016Myths of Big Partitions (Robert Stupp, DataStax) | Cassandra Summit 2016
Myths of Big Partitions (Robert Stupp, DataStax) | Cassandra Summit 2016DataStax
 
Disaggregating Ceph using NVMeoF
Disaggregating Ceph using NVMeoFDisaggregating Ceph using NVMeoF
Disaggregating Ceph using NVMeoFShapeBlue
 
Building a SIMD Supported Vectorized Native Engine for Spark SQL
Building a SIMD Supported Vectorized Native Engine for Spark SQLBuilding a SIMD Supported Vectorized Native Engine for Spark SQL
Building a SIMD Supported Vectorized Native Engine for Spark SQLDatabricks
 
The Linux Kernel Scheduler (For Beginners) - SFO17-421
The Linux Kernel Scheduler (For Beginners) - SFO17-421The Linux Kernel Scheduler (For Beginners) - SFO17-421
The Linux Kernel Scheduler (For Beginners) - SFO17-421Linaro
 
Scylla Summit 2022: IO Scheduling & NVMe Disk Modelling
 Scylla Summit 2022: IO Scheduling & NVMe Disk Modelling Scylla Summit 2022: IO Scheduling & NVMe Disk Modelling
Scylla Summit 2022: IO Scheduling & NVMe Disk ModellingScyllaDB
 
C# as a System Language
C# as a System LanguageC# as a System Language
C# as a System LanguageScyllaDB
 
Migrating from InnoDB and HBase to MyRocks at Facebook
Migrating from InnoDB and HBase to MyRocks at FacebookMigrating from InnoDB and HBase to MyRocks at Facebook
Migrating from InnoDB and HBase to MyRocks at FacebookMariaDB plc
 
Ceph data services in a multi- and hybrid cloud world
Ceph data services in a multi- and hybrid cloud worldCeph data services in a multi- and hybrid cloud world
Ceph data services in a multi- and hybrid cloud worldSage Weil
 
Container Performance Analysis
Container Performance AnalysisContainer Performance Analysis
Container Performance AnalysisBrendan Gregg
 

What's hot (20)

P99 Pursuit: 8 Years of Battling P99 Latency
P99 Pursuit: 8 Years of Battling P99 LatencyP99 Pursuit: 8 Years of Battling P99 Latency
P99 Pursuit: 8 Years of Battling P99 Latency
 
20111015 勉強会 (PCIe / SR-IOV)
20111015 勉強会 (PCIe / SR-IOV)20111015 勉強会 (PCIe / SR-IOV)
20111015 勉強会 (PCIe / SR-IOV)
 
How We Reduced Performance Tuning Time by Orders of Magnitude with Database O...
How We Reduced Performance Tuning Time by Orders of Magnitude with Database O...How We Reduced Performance Tuning Time by Orders of Magnitude with Database O...
How We Reduced Performance Tuning Time by Orders of Magnitude with Database O...
 
IntelON 2021 Processor Benchmarking
IntelON 2021 Processor BenchmarkingIntelON 2021 Processor Benchmarking
IntelON 2021 Processor Benchmarking
 
CXL_説明_公開用.pdf
CXL_説明_公開用.pdfCXL_説明_公開用.pdf
CXL_説明_公開用.pdf
 
マルチコアとネットワークスタックの高速化技法
マルチコアとネットワークスタックの高速化技法マルチコアとネットワークスタックの高速化技法
マルチコアとネットワークスタックの高速化技法
 
Ovs dpdk hwoffload way to full offload
Ovs dpdk hwoffload way to full offloadOvs dpdk hwoffload way to full offload
Ovs dpdk hwoffload way to full offload
 
HBase Accelerated: In-Memory Flush and Compaction
HBase Accelerated: In-Memory Flush and CompactionHBase Accelerated: In-Memory Flush and Compaction
HBase Accelerated: In-Memory Flush and Compaction
 
Hive Bucketing in Apache Spark
Hive Bucketing in Apache SparkHive Bucketing in Apache Spark
Hive Bucketing in Apache Spark
 
Build Low-Latency Applications in Rust on ScyllaDB
Build Low-Latency Applications in Rust on ScyllaDBBuild Low-Latency Applications in Rust on ScyllaDB
Build Low-Latency Applications in Rust on ScyllaDB
 
Rust Is Safe. But Is It Fast?
Rust Is Safe. But Is It Fast?Rust Is Safe. But Is It Fast?
Rust Is Safe. But Is It Fast?
 
Myths of Big Partitions (Robert Stupp, DataStax) | Cassandra Summit 2016
Myths of Big Partitions (Robert Stupp, DataStax) | Cassandra Summit 2016Myths of Big Partitions (Robert Stupp, DataStax) | Cassandra Summit 2016
Myths of Big Partitions (Robert Stupp, DataStax) | Cassandra Summit 2016
 
Disaggregating Ceph using NVMeoF
Disaggregating Ceph using NVMeoFDisaggregating Ceph using NVMeoF
Disaggregating Ceph using NVMeoF
 
Building a SIMD Supported Vectorized Native Engine for Spark SQL
Building a SIMD Supported Vectorized Native Engine for Spark SQLBuilding a SIMD Supported Vectorized Native Engine for Spark SQL
Building a SIMD Supported Vectorized Native Engine for Spark SQL
 
The Linux Kernel Scheduler (For Beginners) - SFO17-421
The Linux Kernel Scheduler (For Beginners) - SFO17-421The Linux Kernel Scheduler (For Beginners) - SFO17-421
The Linux Kernel Scheduler (For Beginners) - SFO17-421
 
Scylla Summit 2022: IO Scheduling & NVMe Disk Modelling
 Scylla Summit 2022: IO Scheduling & NVMe Disk Modelling Scylla Summit 2022: IO Scheduling & NVMe Disk Modelling
Scylla Summit 2022: IO Scheduling & NVMe Disk Modelling
 
C# as a System Language
C# as a System LanguageC# as a System Language
C# as a System Language
 
Migrating from InnoDB and HBase to MyRocks at Facebook
Migrating from InnoDB and HBase to MyRocks at FacebookMigrating from InnoDB and HBase to MyRocks at Facebook
Migrating from InnoDB and HBase to MyRocks at Facebook
 
Ceph data services in a multi- and hybrid cloud world
Ceph data services in a multi- and hybrid cloud worldCeph data services in a multi- and hybrid cloud world
Ceph data services in a multi- and hybrid cloud world
 
Container Performance Analysis
Container Performance AnalysisContainer Performance Analysis
Container Performance Analysis
 

Similar to Hardware Assisted Latency Investigations

Module 3-cpu-scheduling
Module 3-cpu-schedulingModule 3-cpu-scheduling
Module 3-cpu-schedulingHesham Elmasry
 
Automating the Hunt for Non-Obvious Sources of Latency Spreads
Automating the Hunt for Non-Obvious Sources of Latency SpreadsAutomating the Hunt for Non-Obvious Sources of Latency Spreads
Automating the Hunt for Non-Obvious Sources of Latency SpreadsScyllaDB
 
Galvin-operating System(Ch6)
Galvin-operating System(Ch6)Galvin-operating System(Ch6)
Galvin-operating System(Ch6)dsuyal1
 
CPN302 your-linux-ami-optimization-and-performance
CPN302 your-linux-ami-optimization-and-performanceCPN302 your-linux-ami-optimization-and-performance
CPN302 your-linux-ami-optimization-and-performanceCoburn Watson
 
CONFidence 2017: Escaping the (sand)box: The promises and pitfalls of modern ...
CONFidence 2017: Escaping the (sand)box: The promises and pitfalls of modern ...CONFidence 2017: Escaping the (sand)box: The promises and pitfalls of modern ...
CONFidence 2017: Escaping the (sand)box: The promises and pitfalls of modern ...PROIDEA
 
BUD17-309: IRQ prediction
BUD17-309: IRQ prediction BUD17-309: IRQ prediction
BUD17-309: IRQ prediction Linaro
 
Cpu scheduling final
Cpu scheduling finalCpu scheduling final
Cpu scheduling finalmarangburu42
 
Linux Server Deep Dives (DrupalCon Amsterdam)
Linux Server Deep Dives (DrupalCon Amsterdam)Linux Server Deep Dives (DrupalCon Amsterdam)
Linux Server Deep Dives (DrupalCon Amsterdam)Amin Astaneh
 
Cpu scheduling pre final formatting
Cpu scheduling pre final formattingCpu scheduling pre final formatting
Cpu scheduling pre final formattingmarangburu42
 
ch5_EN_CPUSched_2022.pdf
ch5_EN_CPUSched_2022.pdfch5_EN_CPUSched_2022.pdf
ch5_EN_CPUSched_2022.pdfCuracaoJTR
 
When the OS gets in the way
When the OS gets in the wayWhen the OS gets in the way
When the OS gets in the wayMark Price
 
Keeping Latency Low and Throughput High with Application-level Priority Manag...
Keeping Latency Low and Throughput High with Application-level Priority Manag...Keeping Latency Low and Throughput High with Application-level Priority Manag...
Keeping Latency Low and Throughput High with Application-level Priority Manag...ScyllaDB
 
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...Data Con LA
 
SREcon 2016 Performance Checklists for SREs
SREcon 2016 Performance Checklists for SREsSREcon 2016 Performance Checklists for SREs
SREcon 2016 Performance Checklists for SREsBrendan Gregg
 
Tommaso Cucinotta - Low-latency and power-efficient audio applications on Linux
Tommaso Cucinotta - Low-latency and power-efficient audio applications on LinuxTommaso Cucinotta - Low-latency and power-efficient audio applications on Linux
Tommaso Cucinotta - Low-latency and power-efficient audio applications on Linuxlinuxlab_conf
 

Similar to Hardware Assisted Latency Investigations (20)

Module 3-cpu-scheduling
Module 3-cpu-schedulingModule 3-cpu-scheduling
Module 3-cpu-scheduling
 
Optimizing Linux Servers
Optimizing Linux ServersOptimizing Linux Servers
Optimizing Linux Servers
 
Automating the Hunt for Non-Obvious Sources of Latency Spreads
Automating the Hunt for Non-Obvious Sources of Latency SpreadsAutomating the Hunt for Non-Obvious Sources of Latency Spreads
Automating the Hunt for Non-Obvious Sources of Latency Spreads
 
Galvin-operating System(Ch6)
Galvin-operating System(Ch6)Galvin-operating System(Ch6)
Galvin-operating System(Ch6)
 
CPN302 your-linux-ami-optimization-and-performance
CPN302 your-linux-ami-optimization-and-performanceCPN302 your-linux-ami-optimization-and-performance
CPN302 your-linux-ami-optimization-and-performance
 
ch5_EN_CPUSched.pdf
ch5_EN_CPUSched.pdfch5_EN_CPUSched.pdf
ch5_EN_CPUSched.pdf
 
CONFidence 2017: Escaping the (sand)box: The promises and pitfalls of modern ...
CONFidence 2017: Escaping the (sand)box: The promises and pitfalls of modern ...CONFidence 2017: Escaping the (sand)box: The promises and pitfalls of modern ...
CONFidence 2017: Escaping the (sand)box: The promises and pitfalls of modern ...
 
BUD17-309: IRQ prediction
BUD17-309: IRQ prediction BUD17-309: IRQ prediction
BUD17-309: IRQ prediction
 
Cpu scheduling final
Cpu scheduling finalCpu scheduling final
Cpu scheduling final
 
Linux Server Deep Dives (DrupalCon Amsterdam)
Linux Server Deep Dives (DrupalCon Amsterdam)Linux Server Deep Dives (DrupalCon Amsterdam)
Linux Server Deep Dives (DrupalCon Amsterdam)
 
Cpu scheduling pre final formatting
Cpu scheduling pre final formattingCpu scheduling pre final formatting
Cpu scheduling pre final formatting
 
Cpu scheduling
Cpu schedulingCpu scheduling
Cpu scheduling
 
ch5_EN_CPUSched_2022.pdf
ch5_EN_CPUSched_2022.pdfch5_EN_CPUSched_2022.pdf
ch5_EN_CPUSched_2022.pdf
 
When the OS gets in the way
When the OS gets in the wayWhen the OS gets in the way
When the OS gets in the way
 
Keeping Latency Low and Throughput High with Application-level Priority Manag...
Keeping Latency Low and Throughput High with Application-level Priority Manag...Keeping Latency Low and Throughput High with Application-level Priority Manag...
Keeping Latency Low and Throughput High with Application-level Priority Manag...
 
Cat @ scale
Cat @ scaleCat @ scale
Cat @ scale
 
Os2
Os2Os2
Os2
 
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
 
SREcon 2016 Performance Checklists for SREs
SREcon 2016 Performance Checklists for SREsSREcon 2016 Performance Checklists for SREs
SREcon 2016 Performance Checklists for SREs
 
Tommaso Cucinotta - Low-latency and power-efficient audio applications on Linux
Tommaso Cucinotta - Low-latency and power-efficient audio applications on LinuxTommaso Cucinotta - Low-latency and power-efficient audio applications on Linux
Tommaso Cucinotta - Low-latency and power-efficient audio applications on Linux
 

More from ScyllaDB

Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
What Developers Need to Unlearn for High Performance NoSQL
What Developers Need to Unlearn for High Performance NoSQLWhat Developers Need to Unlearn for High Performance NoSQL
What Developers Need to Unlearn for High Performance NoSQLScyllaDB
 
Low Latency at Extreme Scale: Proven Practices & Pitfalls
Low Latency at Extreme Scale: Proven Practices & PitfallsLow Latency at Extreme Scale: Proven Practices & Pitfalls
Low Latency at Extreme Scale: Proven Practices & PitfallsScyllaDB
 
Dissecting Real-World Database Performance Dilemmas
Dissecting Real-World Database Performance DilemmasDissecting Real-World Database Performance Dilemmas
Dissecting Real-World Database Performance DilemmasScyllaDB
 
Beyond Linear Scaling: A New Path for Performance with ScyllaDB
Beyond Linear Scaling: A New Path for Performance with ScyllaDBBeyond Linear Scaling: A New Path for Performance with ScyllaDB
Beyond Linear Scaling: A New Path for Performance with ScyllaDBScyllaDB
 
Dissecting Real-World Database Performance Dilemmas
Dissecting Real-World Database Performance DilemmasDissecting Real-World Database Performance Dilemmas
Dissecting Real-World Database Performance DilemmasScyllaDB
 
Database Performance at Scale Masterclass: Driver Strategies by Piotr Sarna
Database Performance at Scale Masterclass: Driver Strategies by Piotr SarnaDatabase Performance at Scale Masterclass: Driver Strategies by Piotr Sarna
Database Performance at Scale Masterclass: Driver Strategies by Piotr SarnaScyllaDB
 
Replacing Your Cache with ScyllaDB
Replacing Your Cache with ScyllaDBReplacing Your Cache with ScyllaDB
Replacing Your Cache with ScyllaDBScyllaDB
 
Powering Real-Time Apps with ScyllaDB_ Low Latency & Linear Scalability
Powering Real-Time Apps with ScyllaDB_ Low Latency & Linear ScalabilityPowering Real-Time Apps with ScyllaDB_ Low Latency & Linear Scalability
Powering Real-Time Apps with ScyllaDB_ Low Latency & Linear ScalabilityScyllaDB
 
7 Reasons Not to Put an External Cache in Front of Your Database.pptx
7 Reasons Not to Put an External Cache in Front of Your Database.pptx7 Reasons Not to Put an External Cache in Front of Your Database.pptx
7 Reasons Not to Put an External Cache in Front of Your Database.pptxScyllaDB
 
Getting the most out of ScyllaDB
Getting the most out of ScyllaDBGetting the most out of ScyllaDB
Getting the most out of ScyllaDBScyllaDB
 
NoSQL Database Migration Masterclass - Session 2: The Anatomy of a Migration
NoSQL Database Migration Masterclass - Session 2: The Anatomy of a MigrationNoSQL Database Migration Masterclass - Session 2: The Anatomy of a Migration
NoSQL Database Migration Masterclass - Session 2: The Anatomy of a MigrationScyllaDB
 
NoSQL Database Migration Masterclass - Session 3: Migration Logistics
NoSQL Database Migration Masterclass - Session 3: Migration LogisticsNoSQL Database Migration Masterclass - Session 3: Migration Logistics
NoSQL Database Migration Masterclass - Session 3: Migration LogisticsScyllaDB
 
NoSQL Data Migration Masterclass - Session 1 Migration Strategies and Challenges
NoSQL Data Migration Masterclass - Session 1 Migration Strategies and ChallengesNoSQL Data Migration Masterclass - Session 1 Migration Strategies and Challenges
NoSQL Data Migration Masterclass - Session 1 Migration Strategies and ChallengesScyllaDB
 
ScyllaDB Virtual Workshop
ScyllaDB Virtual WorkshopScyllaDB Virtual Workshop
ScyllaDB Virtual WorkshopScyllaDB
 
DBaaS in the Real World: Risks, Rewards & Tradeoffs
DBaaS in the Real World: Risks, Rewards & TradeoffsDBaaS in the Real World: Risks, Rewards & Tradeoffs
DBaaS in the Real World: Risks, Rewards & TradeoffsScyllaDB
 
NoSQL Data Modeling 101
NoSQL Data Modeling 101NoSQL Data Modeling 101
NoSQL Data Modeling 101ScyllaDB
 
Top NoSQL Data Modeling Mistakes
Top NoSQL Data Modeling MistakesTop NoSQL Data Modeling Mistakes
Top NoSQL Data Modeling MistakesScyllaDB
 
NoSQL Data Modeling Foundations — Introducing Concepts & Principles
NoSQL Data Modeling Foundations — Introducing Concepts & PrinciplesNoSQL Data Modeling Foundations — Introducing Concepts & Principles
NoSQL Data Modeling Foundations — Introducing Concepts & PrinciplesScyllaDB
 
Optimizing Performance in Rust for Low-Latency Database Drivers
Optimizing Performance in Rust for Low-Latency Database DriversOptimizing Performance in Rust for Low-Latency Database Drivers
Optimizing Performance in Rust for Low-Latency Database DriversScyllaDB
 

More from ScyllaDB (20)

Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
What Developers Need to Unlearn for High Performance NoSQL
What Developers Need to Unlearn for High Performance NoSQLWhat Developers Need to Unlearn for High Performance NoSQL
What Developers Need to Unlearn for High Performance NoSQL
 
Low Latency at Extreme Scale: Proven Practices & Pitfalls
Low Latency at Extreme Scale: Proven Practices & PitfallsLow Latency at Extreme Scale: Proven Practices & Pitfalls
Low Latency at Extreme Scale: Proven Practices & Pitfalls
 
Dissecting Real-World Database Performance Dilemmas
Dissecting Real-World Database Performance DilemmasDissecting Real-World Database Performance Dilemmas
Dissecting Real-World Database Performance Dilemmas
 
Beyond Linear Scaling: A New Path for Performance with ScyllaDB
Beyond Linear Scaling: A New Path for Performance with ScyllaDBBeyond Linear Scaling: A New Path for Performance with ScyllaDB
Beyond Linear Scaling: A New Path for Performance with ScyllaDB
 
Dissecting Real-World Database Performance Dilemmas
Dissecting Real-World Database Performance DilemmasDissecting Real-World Database Performance Dilemmas
Dissecting Real-World Database Performance Dilemmas
 
Database Performance at Scale Masterclass: Driver Strategies by Piotr Sarna
Database Performance at Scale Masterclass: Driver Strategies by Piotr SarnaDatabase Performance at Scale Masterclass: Driver Strategies by Piotr Sarna
Database Performance at Scale Masterclass: Driver Strategies by Piotr Sarna
 
Replacing Your Cache with ScyllaDB
Replacing Your Cache with ScyllaDBReplacing Your Cache with ScyllaDB
Replacing Your Cache with ScyllaDB
 
Powering Real-Time Apps with ScyllaDB_ Low Latency & Linear Scalability
Powering Real-Time Apps with ScyllaDB_ Low Latency & Linear ScalabilityPowering Real-Time Apps with ScyllaDB_ Low Latency & Linear Scalability
Powering Real-Time Apps with ScyllaDB_ Low Latency & Linear Scalability
 
7 Reasons Not to Put an External Cache in Front of Your Database.pptx
7 Reasons Not to Put an External Cache in Front of Your Database.pptx7 Reasons Not to Put an External Cache in Front of Your Database.pptx
7 Reasons Not to Put an External Cache in Front of Your Database.pptx
 
Getting the most out of ScyllaDB
Getting the most out of ScyllaDBGetting the most out of ScyllaDB
Getting the most out of ScyllaDB
 
NoSQL Database Migration Masterclass - Session 2: The Anatomy of a Migration
NoSQL Database Migration Masterclass - Session 2: The Anatomy of a MigrationNoSQL Database Migration Masterclass - Session 2: The Anatomy of a Migration
NoSQL Database Migration Masterclass - Session 2: The Anatomy of a Migration
 
NoSQL Database Migration Masterclass - Session 3: Migration Logistics
NoSQL Database Migration Masterclass - Session 3: Migration LogisticsNoSQL Database Migration Masterclass - Session 3: Migration Logistics
NoSQL Database Migration Masterclass - Session 3: Migration Logistics
 
NoSQL Data Migration Masterclass - Session 1 Migration Strategies and Challenges
NoSQL Data Migration Masterclass - Session 1 Migration Strategies and ChallengesNoSQL Data Migration Masterclass - Session 1 Migration Strategies and Challenges
NoSQL Data Migration Masterclass - Session 1 Migration Strategies and Challenges
 
ScyllaDB Virtual Workshop
ScyllaDB Virtual WorkshopScyllaDB Virtual Workshop
ScyllaDB Virtual Workshop
 
DBaaS in the Real World: Risks, Rewards & Tradeoffs
DBaaS in the Real World: Risks, Rewards & TradeoffsDBaaS in the Real World: Risks, Rewards & Tradeoffs
DBaaS in the Real World: Risks, Rewards & Tradeoffs
 
NoSQL Data Modeling 101
NoSQL Data Modeling 101NoSQL Data Modeling 101
NoSQL Data Modeling 101
 
Top NoSQL Data Modeling Mistakes
Top NoSQL Data Modeling MistakesTop NoSQL Data Modeling Mistakes
Top NoSQL Data Modeling Mistakes
 
NoSQL Data Modeling Foundations — Introducing Concepts & Principles
NoSQL Data Modeling Foundations — Introducing Concepts & PrinciplesNoSQL Data Modeling Foundations — Introducing Concepts & Principles
NoSQL Data Modeling Foundations — Introducing Concepts & Principles
 
Optimizing Performance in Rust for Low-Latency Database Drivers
Optimizing Performance in Rust for Low-Latency Database DriversOptimizing Performance in Rust for Low-Latency Database Drivers
Optimizing Performance in Rust for Low-Latency Database Drivers
 

Recently uploaded

Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 

Recently uploaded (20)

Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 

Hardware Assisted Latency Investigations

  • 1.
  • 3. Investigating Tail Latency is About Probing for the Atypical ■ Execution samples that land in the tail have been slowed down for some reason that differs from the average case: either— ● They differ in amount (or type of) work performed, or ● They were affected by some uncommon event or interference ● They encountered more waiting (experienced resource starvation longer than average cases). ■ In particular, it is not generally true that tail latency samples and remaining execution samples exhibit comparable software, hardware, or scheduling histories. ■ Standard approaches for exposing throughput limiters do not generally expose causes of peak latency
  • 7. Illustrative Scenario: 1 -- Why ? SLA violation Default sleep state (C9) Min sleep state (C1) ■ Default power setting results in latency SLA violation at ½ the throughput ■ Ironically: low utilization results in earlier onset of tail latency because of power-management interference (not in application’s control). SLA violation
  • 8. Illustrative Scenario: 2 ■ Reducing latency to a minimum 🡪 frequently entails choosing between conflicting options for scheduling resources like cpu time. ● Reduce overheads vs. ● Preempt sooner vs. ● Prioritize critical sections vs. ● Do not thrash the cache vs. ● Be fair at a fine-grained scale: do not starve tasks (or activities like garbage collection) . . . etc.
  • 9. Illustrative Scenario: 2 ■ Reducing latency to a minimum 🡪 frequently entails choosing between conflicting options for scheduling resources like cpu time. ● Reduce overheads vs. ● Preempt sooner vs. ● Prioritize critical sections vs. ● Do not thrash the cache vs. ● Be fair at a fine-grained scale: do not starve tasks (or activities like garbage collection) . . . etc. ■ Online video gaming at high scale: ● High FPS (frames per second). ● Hotness in L1, L2. ● Streamlined execution: do not preempt too soon! ● Deadline priorities: Schedule quickly upon waking up!
  • 10. Illustrative Scenario: 2 ■ Online video gaming at high scale: ● High FPS (frames per second). ● Hotness in L1, L2. ● Streamlined execution: do not preempt too soon! ● Deadline priorities: Schedule quickly upon waking up! ■ For a first order impact on reducing frame drops: make task migration immediate. Tunable CFS default New value kernel_sched_migration_cost_ns 500000 0 kernel_sched_min_granularity_ns 3000000 1000000 kernel_sched_wakeup_granularity_ns 4000000 0 Scheduler tuning for dramatically reducing frame drops kernel/sched_fair.c . . . gran = sysctl_sched_wakeup_granularity; . . . // if virtual run time of current executing task exceeds that of wakeup target plus a margin // of safety provided by gran, then the current task is forced to yield to the wakeup target. if (pse->vruntime + gran < se->vruntime) //se is current executing task resched_task(curr);
  • 11. Vital to Understand Transitions at Sub-ms Grain ■ Small-time ranges contain the vital signals for analyzing tail latency growth. ● Sampling and averaging of CPU and platform software telemetry helps only a little, but is too blunt to tie short-range causes to effects that persist. ● For example: the scheduling of a large I/O that may take a few microseconds but cause lingering effects in shared caches. ■ Tuning of schedulers (as in Scenario 2) requires the ability to observe intra- and inter- process effects more or less continuously. ● Interestingly, scheduler tuning can also free up CPU utilization, and create an opportunity to support an ability to handle load spikes within a latency budget.
  • 12. Solution Space and Role of Hardware
  • 13. perf-sched: a Dump and Post-process Approach for Capturing and Analyzing Scheduler Events ■ perf-sched record ■ perf-sched timehist ■ perf-sched map ■ … https://www.brendangregg.com/blog/2017-03-16/perf-sched.html From: perf sched for Linux CPU scheduler analysis, by Brendan Gregg, 2017 Get to an understanding of -- How far is max sched delay from average? When is wait time blowing up - what contemporaneous activities are in play when max delays occur? Do they show similar effects, or is their execution itself something unusual?
  • 14. perf-sched timehist to Understand and Tune Various Scheduling Parameters https://www.brendangregg.com/blog/2017-03-16/perf-sched.html From: perf sched for Linux CPU scheduler analysis, by Brendan Gregg, 2017
  • 15. KUtrace by Richard Sites ■ All Kernel-user and User-kernel transitions collected at very low space, time overhead. ■ Doing one thing, and that one thing well, for production coverage ■ Further, captures instruction counter from the PMU at transition points, to understand IPC – all at ~40ns overhead From: KUTrace: Where have all the nanoseconds gone?, Richard Sites, Tracing Summit, 2017.
  • 16. Hardware Role in Monitoring ■ Modern CPUs build in a significant amount of ability to monitor many events at very low overhead. ■ PMU registers get multiplexed over event space (under software control) at a moderate granularity. ■ A very large subset of these events can provide for event-based sampling, so that correlated ratios can be collected and time-aligned -- for dissecting into likely causes of IPC shifts. https://perfmon-events.intel.com/
  • 17. PMU Based Extraction of Long-latency Paths ■ Timed LBRs (last branch records)
  • 18. PMU Based Extraction of Long-latency Paths ■ A more detailed view (different example case)
  • 19. Identifying Cache Miss Hotspots (Data, Code) • Similarly, data heatmap can be generated without requiring software instrumentation and tracing (e.g., with valgrind, Pin, etc.) MEM_TRANS_RETIRED.LOAD_LATENCY_GT_<Binary Number>
  • 20. From Sampling to Tracing ■ Profiling tools today (e.g., perf, Intel® VTune, etc.) are mostly based on the idea of statistical hotspot collection: sampling and averaging – and therefore lose short interval transitions. ■ Tracing today (e.g., with insertion of tracepoints) requires a software developer to anticipate where to instrument code. This is not generally easy, unless a lot of engineering has already gone into preselecting (e.g, KUTrace). ● Instrumenting everything or over-collecting results incurs too much CPU penalty ● And memory and cache pollution ■ Challenges beyond trace collection: ● Much effort and data pruning before tail latencies can be linked to likely causes ● Usually pushed to offline analysis. ■ Understanding (and remediating) tail latencies itself needs to be a low latency endeavor.
  • 21. eBPF ■ eBPF provides for programmable triggering and conditional collection ● at low overhead ● user or kernel ■ Thus, for example, one can do something like this: Using the longest-latency access event based sampling as a trigger Snapshot the timed LBR buffer by using eBPF
  • 22. eBPF and KUTrace/perf-sched and . . . ■ KUtrace is intentionally austere ● So it can be deployed in production, at scale, and be available and running continuously. ● (Adding bells and whistles to it– not a good idea, discouraged!). + eBPF’s In kernel filtering ■ For deep insights: KUTrace will provide all user<->kernel transitions
  • 23. eBPF and KUTrace/perf-sched and . . . ■ KUtrace is intentionally austere ● So it can be deployed in production, at scale, and be available and running continuously. ● (Adding bells and whistles to it– not a good idea, discouraged!). + eBPF’s In kernel filtering ■ For deep insights: KUTrace will provide all user<->kernel transitions ■ With eBPF controlled tracing we could invoke traces only in areas of interest. E.g. trace all grpc requests with packet size > 64K (maybe that’s only when you see high tail latencies) ■ (Or start first with eBPF and perf-sched latency co-monitoring and triggering) ● (While connecting eBPF based probes for latency monitoring in select higher stack layers)
  • 24. Summing It Up ■ Tail latency control is crucial, with penetration of real-time complex event processing in virtually all sectors. ■ Low overhead and agile monitoring of latency excursions is needed. ■ Equally, to unveil the causes, contributing factors need to be collected at low overhead and in a timely manner – ideally, through conditional collection and filtering. ■ Hardware performance monitoring capabilities are rich and can collect a rich variety of events at very low overhead. ■ Linking eBPF based latency-focused monitoring (e.g., timed LBRs and long latency cache misses) is one direction. ■ Another is triggering eBPF based hardware event rates collection, time-aligned with scheduler events filtered for high scheduling delays (wait-signaling and post-wait dispatching).
  • 25. Brought to you by Thank You