SlideShare a Scribd company logo
1 of 36
Download to read offline
Automating the Hunt for
Non-Obvious Sources of
Latency Spreads
Kshitij Doshi, Sr. Principal Engr at Intel
Harshad S. Sane, Principal Engr at Intel
Datacenter
& AI
Kshitij Doshi
■ Ph.D., Rice Univ – Comm. efficient parallel algorithms
■ Performance of Systems, DB, Cloud-native apps
■ Research interests in storage, memory, distributed systems
■ 20 y at Intel; previously,13 y at Unix Systems Labs & Novell.
Datacenter
& AI
Harshad Sane
■ Harshad Sane is a Principal Engineer in Intel's Data Center
and AI group
■ Deep technical expertise in system software, memory, and
CPU architectures.
■ Specializes in Performance Engineering with extensive
experience and expertise in Telemetry, Observability,
Monitoring, Software optimization.
Datacenter
& AI
■ Section 1 - About tail latency spreads
■ Section 2 - Two non-obvious causes of latency escapes
■ Section 3 - How to decide if either of them are hurting your application
■ Section 4 - Mitigations, if they are hurting your application
■ Section 5 - Summary
Agenda
Hurdles are not always predictable.
Courtesy: bing.com/images
ScyllaDB is engineered for usages needing high
throughputs and predictable, low latencies...
https://resources.scylladb.com/videos/build-low-latency-applications-in-rust-on-scylladb
Query
Commitlog
Compaction
Queue
Queue
Queue
0.5 msec
Userspace
I/O Scheduler
Disk
Frontend
Database
Services
3-tier
architecture
. . . Latency landmines can be present, however, in
other layers and inter-services interactions, or, in
infrastructure services
Microservices
architecture
frequently there is some issue that intersects in an unpredictable manner
with execution of normal hotspots
When Small Performance Fluctuations Magnify
Into Sudden, Large Spikes in Response Times…
repeating over and over with minor perturbations in end to end
latencies for each itearation
Consider a Streamlined Flow of Execution
Such a hiccup . . . propagates and throws both timing and resource usage out of
balance, for some period of time.
But this period of non-streamlined flow can feed on itself and produce secondary
spikes in end-to-end latencies, even as overall flow throughput evens out.
Where Something Goes Out of Balance Momentarily
and Causes a Hiccup.
Consider two such issues . . .
Wait 100 ms
T1
T3
T2
T4
T5
T0
: set random
seed S
use S
use S
use S
use S
use S
Producer
Consumer
Get Y from
queue L
Put X in
queue L
X
Y
Producer frequently modifies tail of queue L, while
consumer frequently modifies head of queue L.
A first module in an application
A second module in the application
First issue:
Wait 100 ms
T1
T3
T2
T4
T5
T0
: set random
seed S
use S
use S
use S
use S
use S
Producer
Consumer
Get Y from
queue L
Put X in
queue L
X
Y
Producer frequently modifies tail of queue L, while
consumer frequently modifies head of queue L.
A first module in an application
A second module in the application
The threads working in the two modules, which have no logical intersection, do however get cross-coupled if the
variable S ends up on a cacheline that is also used for storing either or both of the head / tail pointers of queue L.
Not a significant problem unless updates of queue L become frequent.
FIrst issue: false sharing
High CPU frequency
High power consumption
More instructions/time
Low CPU frequency
Low power consumption
Fewer instructions/time
One of several sleep states
Very low power consumption
Active idle
Operating system and hardware algorithms together with system configuration parameters
determine conditions under which CPUs transition among different states of operation
Second Issue: CPU Active (P-states) and
Sleep (C-states) States
High CPU frequency
High power consumption
More instructions/time
Low CPU frequency
Low power consumption
Fewer instructions/time
One of several sleep states
Very low power consumption
Active idle
Transitions out of deeper sleep states into active execution take many microseconds upto the
point of normal instruction execution, and also experience transient effects of colder caches.
Second Issue: CPU Power Management
Transitions
High CPU frequency
High power consumption
More instructions/time
Low CPU frequency
Low power consumption
Fewer instructions/time
One of several sleep states
Very low power consumption
Active idle
Transitions out of deeper sleep states into active execution take many microseconds upto the point of normal
instruction execution, and also experience transient effects of colder caches.
Transitions from low power states to normal execution go through a series of frequency step-ups, causing
software actions that are dependency-chained, to stall due to inter-thread or inter-process data/event waiting.
Second issue: CPU Power (and Sleep) State
Transitions
ThrA
ThrB
ThrC
trigger
ThrB
trigger
ThrC
trigger
ThrA
trigger
ThrB
Ideal progression . . .
Time
ThrA
ThrB
ThrC
Imagine that this CPU is
transitioning out of a deep
sleep state . . .
ThrA
ThrB
ThrC
Imagine that this CPU is
transitioning out of a deep
sleep state . . .
Delayed trigger for ThrA
ThrC
runs slower
Possibly this CPU has
entered a deeper sleep
as a result
ThrA
runs slower
as a result
Delayed trigger for ThrB
Cascading Sleep -> Wakeup -> Sleep transients can take up time to
fade out, and cause high peak latencies …
… even though impact on average latency gets amortized.
Detecting and untangling the causes of these intersections of issues is very
challenging. Particularly if a high degree of instrumentation, such as tracing
or logging interferes with and distorts the effects.
These effects are not easily noticed through lightweight sampling or
counting of events
Collecting Traces
Security, Collection overheads at CPU, Caches,
Bandwidth to memory and storage/network.
Analyzing Traces
Like searching for a needle in multiple haystacks
without knowing if a needle is to be found at all.
Scheduling collection
and analysis
Like figuring out when crime is going to occur in order
to launch crime scene analysis
Challenges
• Turbostat
• Powertop
• Runqueue lengths
Indicating Power transitions
• Sharp IPC drop with concurrency
• No obvious mem/disk data bottleneck
• High utilization, low runqueue lengths
Indicating cache coherence issues
Good but circumstantial clues
• CoreFreq -
https://github.com/cyring/CoreFreq
Correlated Events That Can Be Collected
at Low Overhead
Picking up on
disruptive events
Courtesy: bing.com/images
To see what is available:
sudo perf list | grep cstate
Example output:
cstate_core/c3-residency/
cstate_core/c6-residency/
cstate_core/c7-residency/
cstate_pkg/c2-residency/
…
Capturing C-State Transitions
sudo perf timechart record
sudo perf timechart
Monitoring sleep, wait, and
run times per CPU.
4 threads in intermittent
sleeps
4 threads in variable I/O
wait durations
Timecharting
Per CPU timeline of processes as they are
context switched
2. perf sched map
Visualizing scheduling events by time
1. perf sched timehist
(-Mw): migrations and wakeups
Tracks Scheduler latency by event, including time
to wakeup , latency from wakeup to run (sched
delay)
P-states (≡ frequencies)
‒ BIOS controlled
‒ OS controlled via scaling drivers
‒ HW controlled P states
‒ Turbo
Monitoring and controlling P-states
Credits:https://images.anandtech.com/doci/9582/43.jpg
To monitor the P-states:
‒ Turbostat, CoreFreq
‒ Profiling tools (Perf, Vtune, etc.)
To control P-states: OS controlled P-states manually
configured through scaling drivers. such as
‒ Cpupower
‒ CoreFreq https://github.com/cyring/CoreFreq
to control performance governor for P states.
Response time R
--monitored at
application level
Exp-weighted Moving
Window Avg
Short-range average
Detect upward
heave
C-State Monitoring
P-State Monitoring
Tn = Transitions
count over 250ms
windows
Tn > Threshold
Detect
Overlap
Snapshot
runqlat and
timechart
activity from
last ‘n’ secs
(a) Likelihood with higher concurrency + low scaling + higher response times, with low IPC
despite low LLC misses/instruction
(b) Sensitivity insensitive to runq-lengths (not sensitive to CPU subscription)
(c) Clues higher number of coherence misses in L1 and/or L2 (PMC snoop events
S2I, M2I)
increased inter-socket link utilization in a multi-socket system
Step 1: Establish whether sufficient clues exist to suspect false sharing
Clues for False-Sharing (With Low
Overhead)
Drilling down for concrete evidence of false sharing
perf c2c
Sampling based detection of cachelines where false sharing was likely – based on
the HITM event (see below).
These are read or write accesses for which a different core’s cache reports a “hit” in
“modified” state (HITM).
Provides insights into data addresses, code addresses, processes and threads that
generate sharing conflicts.
Conditionally upon step 1 indicating possible false sharing
Step 2: Collect perf c2c profiles identifying data and code addresses producing the contention
Cacheline summary
Cacheline Access details (look for HITMs)
What perf c2c profiling looks like
Courtesy: bing.com/images
When power management actions are suspected to provoke high tail latencies:
1. Choose less extreme power-performance settings
power-save, energy-efficient etc.
1. Explore changes in scheduler tunings, such as –
a. Quicker preemption (reducing wakeup -> onproc)
b. Smaller time-slices
c. Different (usually lower) migration thresholds
Solution Space
When false-sharing is suspected to provoke high tail latencies:
1. Some data structure layout possibilities:
a. Data structure / global variable padding (if possible)
b. Changing the affected data structure to better separate (quasi)- immutable from
mutable cachelines
c. Splitting data structures in question into sub-structures
2. Possible computation strategy changes:
a. Rate-limiting writers to cachelines that are accessed frequently by readers
b. Colocating to same socket or sub-numa clusters
c. Make code bimodal: normal computation until a monitor signals rise in coherence
events, and one of (2a/2b) after.
Solution Space
• Latency instrumentation needs to be made as close to real time as possible.
• Tracing needs to be combined with sampling over short intervals and, triggered by good
precursors so overhead is kept to a minimum.
• We outlined two issues—
• False sharing
• Power management transitions
that may not arise frequently, but can have measurable effects on tail latencies, which can be
hard to detect.
In this presentation we have shown the role these 2 components play in application
performance, their detectability and possible solutions.
Summary
Thank You
Stay in Touch
kshitij.a.doshi@intel.com
harshad.s.sane@intel.com

More Related Content

Similar to Automating the Hunt for Non-Obvious Sources of Latency Spreads

Affect of parallel computing on multicore processors
Affect of parallel computing on multicore processorsAffect of parallel computing on multicore processors
Affect of parallel computing on multicore processorscsandit
 
C* Summit 2013: Time is Money Jake Luciani and Carl Yeksigian
C* Summit 2013: Time is Money Jake Luciani and Carl YeksigianC* Summit 2013: Time is Money Jake Luciani and Carl Yeksigian
C* Summit 2013: Time is Money Jake Luciani and Carl YeksigianDataStax Academy
 
Low level java programming
Low level java programmingLow level java programming
Low level java programmingPeter Lawrey
 
Four Ways to Improve Linux Performance IEEE Webinar, R2.0
Four Ways to Improve Linux Performance IEEE Webinar, R2.0Four Ways to Improve Linux Performance IEEE Webinar, R2.0
Four Ways to Improve Linux Performance IEEE Webinar, R2.0Michael Christofferson
 
Open HFT libraries in @Java
Open HFT libraries in @JavaOpen HFT libraries in @Java
Open HFT libraries in @JavaPeter Lawrey
 
CPN302 your-linux-ami-optimization-and-performance
CPN302 your-linux-ami-optimization-and-performanceCPN302 your-linux-ami-optimization-and-performance
CPN302 your-linux-ami-optimization-and-performanceCoburn Watson
 
Considerations when implementing_ha_in_dmf
Considerations when implementing_ha_in_dmfConsiderations when implementing_ha_in_dmf
Considerations when implementing_ha_in_dmfhik_lhz
 
Scaling Networks Lab Manual 1st Edition Cisco Solutions Manual
Scaling Networks Lab Manual 1st Edition Cisco Solutions ManualScaling Networks Lab Manual 1st Edition Cisco Solutions Manual
Scaling Networks Lab Manual 1st Edition Cisco Solutions Manualnudicixox
 
Software architecture for data applications
Software architecture for data applicationsSoftware architecture for data applications
Software architecture for data applicationsDing Li
 
참여기관_발표자료-국민대학교 201301 정기회의
참여기관_발표자료-국민대학교 201301 정기회의참여기관_발표자료-국민대학교 201301 정기회의
참여기관_발표자료-국민대학교 201301 정기회의DzH QWuynh
 
High Frequency Trading and NoSQL database
High Frequency Trading and NoSQL databaseHigh Frequency Trading and NoSQL database
High Frequency Trading and NoSQL databasePeter Lawrey
 
Project Tungsten Phase II: Joining a Billion Rows per Second on a Laptop
Project Tungsten Phase II: Joining a Billion Rows per Second on a LaptopProject Tungsten Phase II: Joining a Billion Rows per Second on a Laptop
Project Tungsten Phase II: Joining a Billion Rows per Second on a LaptopDatabricks
 
Instruction Level Parallelism (ILP) Limitations
Instruction Level Parallelism (ILP) LimitationsInstruction Level Parallelism (ILP) Limitations
Instruction Level Parallelism (ILP) LimitationsJose Pinilla
 
CA UNIT I PPT.ppt
CA UNIT I PPT.pptCA UNIT I PPT.ppt
CA UNIT I PPT.pptRAJESH S
 
Advanced Pipelining in ARM Processors.pptx
Advanced Pipelining  in ARM Processors.pptxAdvanced Pipelining  in ARM Processors.pptx
Advanced Pipelining in ARM Processors.pptxJoyChowdhury30
 

Similar to Automating the Hunt for Non-Obvious Sources of Latency Spreads (20)

Affect of parallel computing on multicore processors
Affect of parallel computing on multicore processorsAffect of parallel computing on multicore processors
Affect of parallel computing on multicore processors
 
C* Summit 2013: Time is Money Jake Luciani and Carl Yeksigian
C* Summit 2013: Time is Money Jake Luciani and Carl YeksigianC* Summit 2013: Time is Money Jake Luciani and Carl Yeksigian
C* Summit 2013: Time is Money Jake Luciani and Carl Yeksigian
 
Cisco OpenSOC
Cisco OpenSOCCisco OpenSOC
Cisco OpenSOC
 
Low level java programming
Low level java programmingLow level java programming
Low level java programming
 
Chap2 slides
Chap2 slidesChap2 slides
Chap2 slides
 
Computer architecture
Computer architectureComputer architecture
Computer architecture
 
Oversimplified CA
Oversimplified CAOversimplified CA
Oversimplified CA
 
Four Ways to Improve Linux Performance IEEE Webinar, R2.0
Four Ways to Improve Linux Performance IEEE Webinar, R2.0Four Ways to Improve Linux Performance IEEE Webinar, R2.0
Four Ways to Improve Linux Performance IEEE Webinar, R2.0
 
Open HFT libraries in @Java
Open HFT libraries in @JavaOpen HFT libraries in @Java
Open HFT libraries in @Java
 
CPN302 your-linux-ami-optimization-and-performance
CPN302 your-linux-ami-optimization-and-performanceCPN302 your-linux-ami-optimization-and-performance
CPN302 your-linux-ami-optimization-and-performance
 
Considerations when implementing_ha_in_dmf
Considerations when implementing_ha_in_dmfConsiderations when implementing_ha_in_dmf
Considerations when implementing_ha_in_dmf
 
Scaling Networks Lab Manual 1st Edition Cisco Solutions Manual
Scaling Networks Lab Manual 1st Edition Cisco Solutions ManualScaling Networks Lab Manual 1st Edition Cisco Solutions Manual
Scaling Networks Lab Manual 1st Edition Cisco Solutions Manual
 
Postgres clusters
Postgres clustersPostgres clusters
Postgres clusters
 
Software architecture for data applications
Software architecture for data applicationsSoftware architecture for data applications
Software architecture for data applications
 
참여기관_발표자료-국민대학교 201301 정기회의
참여기관_발표자료-국민대학교 201301 정기회의참여기관_발표자료-국민대학교 201301 정기회의
참여기관_발표자료-국민대학교 201301 정기회의
 
High Frequency Trading and NoSQL database
High Frequency Trading and NoSQL databaseHigh Frequency Trading and NoSQL database
High Frequency Trading and NoSQL database
 
Project Tungsten Phase II: Joining a Billion Rows per Second on a Laptop
Project Tungsten Phase II: Joining a Billion Rows per Second on a LaptopProject Tungsten Phase II: Joining a Billion Rows per Second on a Laptop
Project Tungsten Phase II: Joining a Billion Rows per Second on a Laptop
 
Instruction Level Parallelism (ILP) Limitations
Instruction Level Parallelism (ILP) LimitationsInstruction Level Parallelism (ILP) Limitations
Instruction Level Parallelism (ILP) Limitations
 
CA UNIT I PPT.ppt
CA UNIT I PPT.pptCA UNIT I PPT.ppt
CA UNIT I PPT.ppt
 
Advanced Pipelining in ARM Processors.pptx
Advanced Pipelining  in ARM Processors.pptxAdvanced Pipelining  in ARM Processors.pptx
Advanced Pipelining in ARM Processors.pptx
 

More from ScyllaDB

Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
What Developers Need to Unlearn for High Performance NoSQL
What Developers Need to Unlearn for High Performance NoSQLWhat Developers Need to Unlearn for High Performance NoSQL
What Developers Need to Unlearn for High Performance NoSQLScyllaDB
 
Low Latency at Extreme Scale: Proven Practices & Pitfalls
Low Latency at Extreme Scale: Proven Practices & PitfallsLow Latency at Extreme Scale: Proven Practices & Pitfalls
Low Latency at Extreme Scale: Proven Practices & PitfallsScyllaDB
 
Dissecting Real-World Database Performance Dilemmas
Dissecting Real-World Database Performance DilemmasDissecting Real-World Database Performance Dilemmas
Dissecting Real-World Database Performance DilemmasScyllaDB
 
Beyond Linear Scaling: A New Path for Performance with ScyllaDB
Beyond Linear Scaling: A New Path for Performance with ScyllaDBBeyond Linear Scaling: A New Path for Performance with ScyllaDB
Beyond Linear Scaling: A New Path for Performance with ScyllaDBScyllaDB
 
Dissecting Real-World Database Performance Dilemmas
Dissecting Real-World Database Performance DilemmasDissecting Real-World Database Performance Dilemmas
Dissecting Real-World Database Performance DilemmasScyllaDB
 
Database Performance at Scale Masterclass: Workload Characteristics by Felipe...
Database Performance at Scale Masterclass: Workload Characteristics by Felipe...Database Performance at Scale Masterclass: Workload Characteristics by Felipe...
Database Performance at Scale Masterclass: Workload Characteristics by Felipe...ScyllaDB
 
Database Performance at Scale Masterclass: Database Internals by Pavel Emelya...
Database Performance at Scale Masterclass: Database Internals by Pavel Emelya...Database Performance at Scale Masterclass: Database Internals by Pavel Emelya...
Database Performance at Scale Masterclass: Database Internals by Pavel Emelya...ScyllaDB
 
Database Performance at Scale Masterclass: Driver Strategies by Piotr Sarna
Database Performance at Scale Masterclass: Driver Strategies by Piotr SarnaDatabase Performance at Scale Masterclass: Driver Strategies by Piotr Sarna
Database Performance at Scale Masterclass: Driver Strategies by Piotr SarnaScyllaDB
 
Replacing Your Cache with ScyllaDB
Replacing Your Cache with ScyllaDBReplacing Your Cache with ScyllaDB
Replacing Your Cache with ScyllaDBScyllaDB
 
Powering Real-Time Apps with ScyllaDB_ Low Latency & Linear Scalability
Powering Real-Time Apps with ScyllaDB_ Low Latency & Linear ScalabilityPowering Real-Time Apps with ScyllaDB_ Low Latency & Linear Scalability
Powering Real-Time Apps with ScyllaDB_ Low Latency & Linear ScalabilityScyllaDB
 
7 Reasons Not to Put an External Cache in Front of Your Database.pptx
7 Reasons Not to Put an External Cache in Front of Your Database.pptx7 Reasons Not to Put an External Cache in Front of Your Database.pptx
7 Reasons Not to Put an External Cache in Front of Your Database.pptxScyllaDB
 
Getting the most out of ScyllaDB
Getting the most out of ScyllaDBGetting the most out of ScyllaDB
Getting the most out of ScyllaDBScyllaDB
 
NoSQL Database Migration Masterclass - Session 2: The Anatomy of a Migration
NoSQL Database Migration Masterclass - Session 2: The Anatomy of a MigrationNoSQL Database Migration Masterclass - Session 2: The Anatomy of a Migration
NoSQL Database Migration Masterclass - Session 2: The Anatomy of a MigrationScyllaDB
 
NoSQL Database Migration Masterclass - Session 3: Migration Logistics
NoSQL Database Migration Masterclass - Session 3: Migration LogisticsNoSQL Database Migration Masterclass - Session 3: Migration Logistics
NoSQL Database Migration Masterclass - Session 3: Migration LogisticsScyllaDB
 
NoSQL Data Migration Masterclass - Session 1 Migration Strategies and Challenges
NoSQL Data Migration Masterclass - Session 1 Migration Strategies and ChallengesNoSQL Data Migration Masterclass - Session 1 Migration Strategies and Challenges
NoSQL Data Migration Masterclass - Session 1 Migration Strategies and ChallengesScyllaDB
 
ScyllaDB Virtual Workshop
ScyllaDB Virtual WorkshopScyllaDB Virtual Workshop
ScyllaDB Virtual WorkshopScyllaDB
 
DBaaS in the Real World: Risks, Rewards & Tradeoffs
DBaaS in the Real World: Risks, Rewards & TradeoffsDBaaS in the Real World: Risks, Rewards & Tradeoffs
DBaaS in the Real World: Risks, Rewards & TradeoffsScyllaDB
 
Build Low-Latency Applications in Rust on ScyllaDB
Build Low-Latency Applications in Rust on ScyllaDBBuild Low-Latency Applications in Rust on ScyllaDB
Build Low-Latency Applications in Rust on ScyllaDBScyllaDB
 
NoSQL Data Modeling 101
NoSQL Data Modeling 101NoSQL Data Modeling 101
NoSQL Data Modeling 101ScyllaDB
 

More from ScyllaDB (20)

Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
What Developers Need to Unlearn for High Performance NoSQL
What Developers Need to Unlearn for High Performance NoSQLWhat Developers Need to Unlearn for High Performance NoSQL
What Developers Need to Unlearn for High Performance NoSQL
 
Low Latency at Extreme Scale: Proven Practices & Pitfalls
Low Latency at Extreme Scale: Proven Practices & PitfallsLow Latency at Extreme Scale: Proven Practices & Pitfalls
Low Latency at Extreme Scale: Proven Practices & Pitfalls
 
Dissecting Real-World Database Performance Dilemmas
Dissecting Real-World Database Performance DilemmasDissecting Real-World Database Performance Dilemmas
Dissecting Real-World Database Performance Dilemmas
 
Beyond Linear Scaling: A New Path for Performance with ScyllaDB
Beyond Linear Scaling: A New Path for Performance with ScyllaDBBeyond Linear Scaling: A New Path for Performance with ScyllaDB
Beyond Linear Scaling: A New Path for Performance with ScyllaDB
 
Dissecting Real-World Database Performance Dilemmas
Dissecting Real-World Database Performance DilemmasDissecting Real-World Database Performance Dilemmas
Dissecting Real-World Database Performance Dilemmas
 
Database Performance at Scale Masterclass: Workload Characteristics by Felipe...
Database Performance at Scale Masterclass: Workload Characteristics by Felipe...Database Performance at Scale Masterclass: Workload Characteristics by Felipe...
Database Performance at Scale Masterclass: Workload Characteristics by Felipe...
 
Database Performance at Scale Masterclass: Database Internals by Pavel Emelya...
Database Performance at Scale Masterclass: Database Internals by Pavel Emelya...Database Performance at Scale Masterclass: Database Internals by Pavel Emelya...
Database Performance at Scale Masterclass: Database Internals by Pavel Emelya...
 
Database Performance at Scale Masterclass: Driver Strategies by Piotr Sarna
Database Performance at Scale Masterclass: Driver Strategies by Piotr SarnaDatabase Performance at Scale Masterclass: Driver Strategies by Piotr Sarna
Database Performance at Scale Masterclass: Driver Strategies by Piotr Sarna
 
Replacing Your Cache with ScyllaDB
Replacing Your Cache with ScyllaDBReplacing Your Cache with ScyllaDB
Replacing Your Cache with ScyllaDB
 
Powering Real-Time Apps with ScyllaDB_ Low Latency & Linear Scalability
Powering Real-Time Apps with ScyllaDB_ Low Latency & Linear ScalabilityPowering Real-Time Apps with ScyllaDB_ Low Latency & Linear Scalability
Powering Real-Time Apps with ScyllaDB_ Low Latency & Linear Scalability
 
7 Reasons Not to Put an External Cache in Front of Your Database.pptx
7 Reasons Not to Put an External Cache in Front of Your Database.pptx7 Reasons Not to Put an External Cache in Front of Your Database.pptx
7 Reasons Not to Put an External Cache in Front of Your Database.pptx
 
Getting the most out of ScyllaDB
Getting the most out of ScyllaDBGetting the most out of ScyllaDB
Getting the most out of ScyllaDB
 
NoSQL Database Migration Masterclass - Session 2: The Anatomy of a Migration
NoSQL Database Migration Masterclass - Session 2: The Anatomy of a MigrationNoSQL Database Migration Masterclass - Session 2: The Anatomy of a Migration
NoSQL Database Migration Masterclass - Session 2: The Anatomy of a Migration
 
NoSQL Database Migration Masterclass - Session 3: Migration Logistics
NoSQL Database Migration Masterclass - Session 3: Migration LogisticsNoSQL Database Migration Masterclass - Session 3: Migration Logistics
NoSQL Database Migration Masterclass - Session 3: Migration Logistics
 
NoSQL Data Migration Masterclass - Session 1 Migration Strategies and Challenges
NoSQL Data Migration Masterclass - Session 1 Migration Strategies and ChallengesNoSQL Data Migration Masterclass - Session 1 Migration Strategies and Challenges
NoSQL Data Migration Masterclass - Session 1 Migration Strategies and Challenges
 
ScyllaDB Virtual Workshop
ScyllaDB Virtual WorkshopScyllaDB Virtual Workshop
ScyllaDB Virtual Workshop
 
DBaaS in the Real World: Risks, Rewards & Tradeoffs
DBaaS in the Real World: Risks, Rewards & TradeoffsDBaaS in the Real World: Risks, Rewards & Tradeoffs
DBaaS in the Real World: Risks, Rewards & Tradeoffs
 
Build Low-Latency Applications in Rust on ScyllaDB
Build Low-Latency Applications in Rust on ScyllaDBBuild Low-Latency Applications in Rust on ScyllaDB
Build Low-Latency Applications in Rust on ScyllaDB
 
NoSQL Data Modeling 101
NoSQL Data Modeling 101NoSQL Data Modeling 101
NoSQL Data Modeling 101
 

Recently uploaded

Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAndikSusilo4
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Hyundai Motor Group
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraDeakin University
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsHyundai Motor Group
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 

Recently uploaded (20)

Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & Application
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning era
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 

Automating the Hunt for Non-Obvious Sources of Latency Spreads

  • 1. Automating the Hunt for Non-Obvious Sources of Latency Spreads Kshitij Doshi, Sr. Principal Engr at Intel Harshad S. Sane, Principal Engr at Intel Datacenter & AI
  • 2. Kshitij Doshi ■ Ph.D., Rice Univ – Comm. efficient parallel algorithms ■ Performance of Systems, DB, Cloud-native apps ■ Research interests in storage, memory, distributed systems ■ 20 y at Intel; previously,13 y at Unix Systems Labs & Novell. Datacenter & AI
  • 3. Harshad Sane ■ Harshad Sane is a Principal Engineer in Intel's Data Center and AI group ■ Deep technical expertise in system software, memory, and CPU architectures. ■ Specializes in Performance Engineering with extensive experience and expertise in Telemetry, Observability, Monitoring, Software optimization. Datacenter & AI
  • 4. ■ Section 1 - About tail latency spreads ■ Section 2 - Two non-obvious causes of latency escapes ■ Section 3 - How to decide if either of them are hurting your application ■ Section 4 - Mitigations, if they are hurting your application ■ Section 5 - Summary Agenda
  • 5. Hurdles are not always predictable. Courtesy: bing.com/images
  • 6. ScyllaDB is engineered for usages needing high throughputs and predictable, low latencies... https://resources.scylladb.com/videos/build-low-latency-applications-in-rust-on-scylladb Query Commitlog Compaction Queue Queue Queue 0.5 msec Userspace I/O Scheduler Disk
  • 7. Frontend Database Services 3-tier architecture . . . Latency landmines can be present, however, in other layers and inter-services interactions, or, in infrastructure services Microservices architecture
  • 8. frequently there is some issue that intersects in an unpredictable manner with execution of normal hotspots When Small Performance Fluctuations Magnify Into Sudden, Large Spikes in Response Times…
  • 9. repeating over and over with minor perturbations in end to end latencies for each itearation Consider a Streamlined Flow of Execution
  • 10. Such a hiccup . . . propagates and throws both timing and resource usage out of balance, for some period of time. But this period of non-streamlined flow can feed on itself and produce secondary spikes in end-to-end latencies, even as overall flow throughput evens out. Where Something Goes Out of Balance Momentarily and Causes a Hiccup.
  • 11. Consider two such issues . . .
  • 12. Wait 100 ms T1 T3 T2 T4 T5 T0 : set random seed S use S use S use S use S use S Producer Consumer Get Y from queue L Put X in queue L X Y Producer frequently modifies tail of queue L, while consumer frequently modifies head of queue L. A first module in an application A second module in the application First issue:
  • 13. Wait 100 ms T1 T3 T2 T4 T5 T0 : set random seed S use S use S use S use S use S Producer Consumer Get Y from queue L Put X in queue L X Y Producer frequently modifies tail of queue L, while consumer frequently modifies head of queue L. A first module in an application A second module in the application The threads working in the two modules, which have no logical intersection, do however get cross-coupled if the variable S ends up on a cacheline that is also used for storing either or both of the head / tail pointers of queue L. Not a significant problem unless updates of queue L become frequent. FIrst issue: false sharing
  • 14. High CPU frequency High power consumption More instructions/time Low CPU frequency Low power consumption Fewer instructions/time One of several sleep states Very low power consumption Active idle Operating system and hardware algorithms together with system configuration parameters determine conditions under which CPUs transition among different states of operation Second Issue: CPU Active (P-states) and Sleep (C-states) States
  • 15. High CPU frequency High power consumption More instructions/time Low CPU frequency Low power consumption Fewer instructions/time One of several sleep states Very low power consumption Active idle Transitions out of deeper sleep states into active execution take many microseconds upto the point of normal instruction execution, and also experience transient effects of colder caches. Second Issue: CPU Power Management Transitions
  • 16. High CPU frequency High power consumption More instructions/time Low CPU frequency Low power consumption Fewer instructions/time One of several sleep states Very low power consumption Active idle Transitions out of deeper sleep states into active execution take many microseconds upto the point of normal instruction execution, and also experience transient effects of colder caches. Transitions from low power states to normal execution go through a series of frequency step-ups, causing software actions that are dependency-chained, to stall due to inter-thread or inter-process data/event waiting. Second issue: CPU Power (and Sleep) State Transitions
  • 18. ThrA ThrB ThrC Imagine that this CPU is transitioning out of a deep sleep state . . .
  • 19. ThrA ThrB ThrC Imagine that this CPU is transitioning out of a deep sleep state . . . Delayed trigger for ThrA ThrC runs slower Possibly this CPU has entered a deeper sleep as a result ThrA runs slower as a result Delayed trigger for ThrB Cascading Sleep -> Wakeup -> Sleep transients can take up time to fade out, and cause high peak latencies … … even though impact on average latency gets amortized.
  • 20. Detecting and untangling the causes of these intersections of issues is very challenging. Particularly if a high degree of instrumentation, such as tracing or logging interferes with and distorts the effects. These effects are not easily noticed through lightweight sampling or counting of events
  • 21. Collecting Traces Security, Collection overheads at CPU, Caches, Bandwidth to memory and storage/network. Analyzing Traces Like searching for a needle in multiple haystacks without knowing if a needle is to be found at all. Scheduling collection and analysis Like figuring out when crime is going to occur in order to launch crime scene analysis Challenges
  • 22. • Turbostat • Powertop • Runqueue lengths Indicating Power transitions • Sharp IPC drop with concurrency • No obvious mem/disk data bottleneck • High utilization, low runqueue lengths Indicating cache coherence issues Good but circumstantial clues • CoreFreq - https://github.com/cyring/CoreFreq Correlated Events That Can Be Collected at Low Overhead
  • 23. Picking up on disruptive events Courtesy: bing.com/images
  • 24. To see what is available: sudo perf list | grep cstate Example output: cstate_core/c3-residency/ cstate_core/c6-residency/ cstate_core/c7-residency/ cstate_pkg/c2-residency/ … Capturing C-State Transitions
  • 25. sudo perf timechart record sudo perf timechart Monitoring sleep, wait, and run times per CPU. 4 threads in intermittent sleeps 4 threads in variable I/O wait durations Timecharting
  • 26. Per CPU timeline of processes as they are context switched 2. perf sched map Visualizing scheduling events by time 1. perf sched timehist (-Mw): migrations and wakeups Tracks Scheduler latency by event, including time to wakeup , latency from wakeup to run (sched delay)
  • 27. P-states (≡ frequencies) ‒ BIOS controlled ‒ OS controlled via scaling drivers ‒ HW controlled P states ‒ Turbo Monitoring and controlling P-states Credits:https://images.anandtech.com/doci/9582/43.jpg To monitor the P-states: ‒ Turbostat, CoreFreq ‒ Profiling tools (Perf, Vtune, etc.) To control P-states: OS controlled P-states manually configured through scaling drivers. such as ‒ Cpupower ‒ CoreFreq https://github.com/cyring/CoreFreq to control performance governor for P states.
  • 28. Response time R --monitored at application level Exp-weighted Moving Window Avg Short-range average Detect upward heave C-State Monitoring P-State Monitoring Tn = Transitions count over 250ms windows Tn > Threshold Detect Overlap Snapshot runqlat and timechart activity from last ‘n’ secs
  • 29. (a) Likelihood with higher concurrency + low scaling + higher response times, with low IPC despite low LLC misses/instruction (b) Sensitivity insensitive to runq-lengths (not sensitive to CPU subscription) (c) Clues higher number of coherence misses in L1 and/or L2 (PMC snoop events S2I, M2I) increased inter-socket link utilization in a multi-socket system Step 1: Establish whether sufficient clues exist to suspect false sharing Clues for False-Sharing (With Low Overhead)
  • 30. Drilling down for concrete evidence of false sharing perf c2c Sampling based detection of cachelines where false sharing was likely – based on the HITM event (see below). These are read or write accesses for which a different core’s cache reports a “hit” in “modified” state (HITM). Provides insights into data addresses, code addresses, processes and threads that generate sharing conflicts. Conditionally upon step 1 indicating possible false sharing Step 2: Collect perf c2c profiles identifying data and code addresses producing the contention
  • 31. Cacheline summary Cacheline Access details (look for HITMs) What perf c2c profiling looks like
  • 33. When power management actions are suspected to provoke high tail latencies: 1. Choose less extreme power-performance settings power-save, energy-efficient etc. 1. Explore changes in scheduler tunings, such as – a. Quicker preemption (reducing wakeup -> onproc) b. Smaller time-slices c. Different (usually lower) migration thresholds Solution Space
  • 34. When false-sharing is suspected to provoke high tail latencies: 1. Some data structure layout possibilities: a. Data structure / global variable padding (if possible) b. Changing the affected data structure to better separate (quasi)- immutable from mutable cachelines c. Splitting data structures in question into sub-structures 2. Possible computation strategy changes: a. Rate-limiting writers to cachelines that are accessed frequently by readers b. Colocating to same socket or sub-numa clusters c. Make code bimodal: normal computation until a monitor signals rise in coherence events, and one of (2a/2b) after. Solution Space
  • 35. • Latency instrumentation needs to be made as close to real time as possible. • Tracing needs to be combined with sampling over short intervals and, triggered by good precursors so overhead is kept to a minimum. • We outlined two issues— • False sharing • Power management transitions that may not arise frequently, but can have measurable effects on tail latencies, which can be hard to detect. In this presentation we have shown the role these 2 components play in application performance, their detectability and possible solutions. Summary
  • 36. Thank You Stay in Touch kshitij.a.doshi@intel.com harshad.s.sane@intel.com