SlideShare a Scribd company logo
1 of 45
Download to read offline
When the OS
gets in the way
(and what you can do about it)
Mark Price
@epickrram
LMAX Exchange
When the OS
gets in the way
(and what you can do about it)
Linux
Mark Price
@epickrram
LMAX Exchange
● Linux is an excellent general-purpose OS
● Many target platforms
● Scheduling is actually fairly complicated
● Low-latency is a special use-case
● We need to provide some hints
It’s not the OS’s fault
Why should I care?
Useful in some scenarios
● Low latency applications
● Response times < 1ms
● Compute-intensive workloads
● Long-running jobs
A real-world scenario: LMAX
System
Latency = T1 - T0
Before tuning:
250us / 10+ms
After tuning:
80us / <1ms
(mean / max)
Jitter
● “slight irregular movement, variation, or
unsteadiness, especially in an electrical
signal or electronic device”
● Variation in response time latency
● Long-tail in response time
Dealing with it
● First take care of the low-hanging fruit
○ e.g. Garbage collection (gc-free / Zing)
○ e.g. Slow I/O
● Once response times are < 10ms the fun
begins
● Make sure your code is running!
Measure first
● Need to validate changes are good
● End-to-end tests
● Using realistic load
● Change one thing and observe
● A refresher...
Modern hardware layout
Multi-tasking
● num(tasks) > num(HyperThreads)
● OS must share out hardware resources
● Clever? Dumb? Fast? Slow?
● Fair...
Linux CFS
● Completely Fair Scheduler
● Maintains a task ‘queue’ per HT
● Runs the task with the lowest runtime
● Updates task runtime after execution
● Higher priority implies longer execution time
● Tasks are load-balanced across HTs
An example application ...
Threads
… running on a language runtime
… running on an operating system
Optimise for locality - PCI/memory
Target deployment
How do I start?
● BIOS settings for maximum performance
● That’s a whole other talk...
Start with the metal
● lstopo is a useful tool for looking at hardware
● Provided by the hwloc package
● Displays:
○ HyperThreads
○ Physical cores
○ NUMA nodes
○ PCI locality
Discover what’s available
lstopo
lstopo
HyperThread
Core
Caches
NUMA-local
RAM
● Use isolcpus to reserve cpu resource
● kernel boot parameter
● isolcpus=0-5,10-13
● Use taskset to pin your application to cpus:
● taskset -c 10-13 java …
● Set affinity of hot threads:
● sched_setaffinity(...)
Reserve & use specific resource
Deploy the application
sched_setaffinity() !{isolcpus} taskset
You have no load-balancer
Pile-up
A solution: cpusets
● Create hierarchical sets of reserved
resource
● CPU, memory
● Userland tools: cset (SUSE)
Isolate OS processes
● cset set --set=/system --cpu=6-9
○ create a cpuset with cpus 6-9
○ create it at the path /system
● cset proc --move --from-set=/ --to-set=/system
○ move all processes from / to /system
○ -k => move unbound kernel threads
○ --threads => move child threads
○ --force => erm... force
Run the application
● cset set --cpu=0-5,10-13 --set=/app
● cset proc --exec /app taskset -cp 10-13 java …
○ start a process in the /app cpuset
○ run the program on cpus 10-13
● sched_setaffinity() to pin the hot threads to cpus 1,3,5
Isolated threads
/app
sched_setaffinity()
/system
/app
taskset
No more jitter?
● Sampling tracer
● Static/dynamic trace points
● Very low overhead
● A good starting point for digging deeper
● perf list to view available trace points
● network, file-system, scheduler, etc
perf_events
What’s happening CPU?
● perf record -e "sched:sched_switch" -C 3
○ Sample task switches on CPU 3
● perf report (best for multiple events)
● perf script (best for single events)
Rogue process
java
36049 [003] 3011858.780856: sched:sched_switch: java:
36049 [110] R ==> kworker/3:1:13991 [120]
kworker/3:1
13991 [003] 3011858.780861: sched:sched_switch:
kworker/3:1:13991 [120] S ==> java:36049 [110]
ftrace
● Function tracer
● Static/dynamic trace points
● Higher overhead
● But captures everything
● Can provide function graphs
● trace-cmd is the usable front-end
So what is that kernel thread doing?
● trace-cmd record -P <pid> -p function_graph
○ Trace functions called by process <pid>
● trace-cmd report
○ Display captured trace data
Some things can’t be deferred
kworker/3:1-13991 [003] 3013287.180771: funcgraph_entry: | process_one_work() {
kworker/3:1-13991 [003] 3013287.180772: funcgraph_entry: | cache_reap() {
kworker/3:1-13991 [003] 3013287.180772: funcgraph_entry: 0.137 us | mutex_trylock();
kworker/3:1-13991 [003] 3013287.180772: funcgraph_entry: 0.289 us | drain_array();
kworker/3:1-13991 [003] 3013287.180773: funcgraph_entry: 0.040 us | _cond_resched();
………
………
kworker/3:1-13991 [003] 3013287.180859: funcgraph_exit: +86.735 us | }
+86.735 us
Things to look out for
● cache_reap() - SLAB allocator
● vmstat_update() - kernel stats
● other workqueue events
○ perf record -e “workqueue:*” -C 3
● Interrupts - set affinity in /proc/irq
● Timer ticks - tickless mode
● CPU governor - set to performance
○ /sys/devices/system/cpu/cpuN/cpufreq/scaling_governor
Some numbers
● Inter-thread latency is a good proxy
● 2 busy-spinning threads passing a message
● Time taken between producer & consumer
● Record times over several seconds
● Compare tuned/untuned
Results
== Latency (ns) ==
mean
min
50.00%
90.00%
99.00%
99.90%
99.99%
max
untuned
466
200
464
608
768
992
2432
69632
tuned
216
128
208
288
336
544
1664
69632
tuned vs untuned
tuned vs untuned (log scale)
Results (loaded system)
== Latency (ns) ==
mean
min
50.00%
90.00%
99.00%
99.90%
99.99%
max
untuned
545
144
464
544
736
2944
294913
884739
tuned
332
216
336
352
448
544
704
36864
tuned vs untuned (loaded system)
Summary
● Select threads that need access to CPU
● Isolate CPUs from the OS
● Pin important threads to isolated CPUs
● Don’t forget interrupts
● There will be more things…
● Always test assumptions!
● Run validation tests to ensure tunings are as
expected
Thank you
● lmax.com/blog/staff-blogs/
● epickrram.blogspot.com
● github.com/epickrram/perf-workshop
● @epickrram

More Related Content

What's hot

Am I reading GC logs Correctly?
Am I reading GC logs Correctly?Am I reading GC logs Correctly?
Am I reading GC logs Correctly?
Tier1 App
 

What's hot (20)

LAS16-TR04: Using tracing to tune and optimize EAS (English)
LAS16-TR04: Using tracing to tune and optimize EAS (English)LAS16-TR04: Using tracing to tune and optimize EAS (English)
LAS16-TR04: Using tracing to tune and optimize EAS (English)
 
Kgdb kdb modesetting
Kgdb kdb modesettingKgdb kdb modesetting
Kgdb kdb modesetting
 
Linux monitoring and Troubleshooting for DBA's
Linux monitoring and Troubleshooting for DBA'sLinux monitoring and Troubleshooting for DBA's
Linux monitoring and Troubleshooting for DBA's
 
Interruption Timer Périodique
Interruption Timer PériodiqueInterruption Timer Périodique
Interruption Timer Périodique
 
Le guide de dépannage de la jvm
Le guide de dépannage de la jvmLe guide de dépannage de la jvm
Le guide de dépannage de la jvm
 
RCU
RCURCU
RCU
 
Linux kernel debugging
Linux kernel debuggingLinux kernel debugging
Linux kernel debugging
 
BKK16-104 sched-freq
BKK16-104 sched-freqBKK16-104 sched-freq
BKK16-104 sched-freq
 
Spying on the Linux kernel for fun and profit
Spying on the Linux kernel for fun and profitSpying on the Linux kernel for fun and profit
Spying on the Linux kernel for fun and profit
 
Kernel Recipes 2018 - New GPIO interface for linux user space - Bartosz Golas...
Kernel Recipes 2018 - New GPIO interface for linux user space - Bartosz Golas...Kernel Recipes 2018 - New GPIO interface for linux user space - Bartosz Golas...
Kernel Recipes 2018 - New GPIO interface for linux user space - Bartosz Golas...
 
Linux Troubleshooting
Linux TroubleshootingLinux Troubleshooting
Linux Troubleshooting
 
Process scheduling linux
Process scheduling linuxProcess scheduling linux
Process scheduling linux
 
Smarter Scheduling
Smarter SchedulingSmarter Scheduling
Smarter Scheduling
 
Kernel Recipes 2018 - KernelShark 1.0; What's new and what's coming - Steven ...
Kernel Recipes 2018 - KernelShark 1.0; What's new and what's coming - Steven ...Kernel Recipes 2018 - KernelShark 1.0; What's new and what's coming - Steven ...
Kernel Recipes 2018 - KernelShark 1.0; What's new and what's coming - Steven ...
 
Ganglia Monitoring Tool
Ganglia Monitoring ToolGanglia Monitoring Tool
Ganglia Monitoring Tool
 
Am I reading GC logs Correctly?
Am I reading GC logs Correctly?Am I reading GC logs Correctly?
Am I reading GC logs Correctly?
 
Embedded Recipes 2018 - Finding sources of Latency In your system - Steven Ro...
Embedded Recipes 2018 - Finding sources of Latency In your system - Steven Ro...Embedded Recipes 2018 - Finding sources of Latency In your system - Steven Ro...
Embedded Recipes 2018 - Finding sources of Latency In your system - Steven Ro...
 
Metrics with Ganglia
Metrics with GangliaMetrics with Ganglia
Metrics with Ganglia
 
OSSNA 2017 Performance Analysis Superpowers with Linux BPF
OSSNA 2017 Performance Analysis Superpowers with Linux BPFOSSNA 2017 Performance Analysis Superpowers with Linux BPF
OSSNA 2017 Performance Analysis Superpowers with Linux BPF
 
test1_361
test1_361test1_361
test1_361
 

Viewers also liked

Extent3 turquoise equity_trading_2012
Extent3 turquoise equity_trading_2012Extent3 turquoise equity_trading_2012
Extent3 turquoise equity_trading_2012
extentconf Tsoy
 
Extent3 exactpro testing_of_hft_gui
Extent3 exactpro testing_of_hft_guiExtent3 exactpro testing_of_hft_gui
Extent3 exactpro testing_of_hft_gui
extentconf Tsoy
 
Levchenko Andrey
Levchenko AndreyLevchenko Andrey
Levchenko Andrey
Ekaterina Vashurina
 

Viewers also liked (19)

Performance: Observe and Tune
Performance: Observe and TunePerformance: Observe and Tune
Performance: Observe and Tune
 
Tuned
TunedTuned
Tuned
 
FPGA Applications in Finance
FPGA Applications in FinanceFPGA Applications in Finance
FPGA Applications in Finance
 
TMPA-2015: FPGA-Based Low Latency Sponsored Access
TMPA-2015: FPGA-Based Low Latency Sponsored AccessTMPA-2015: FPGA-Based Low Latency Sponsored Access
TMPA-2015: FPGA-Based Low Latency Sponsored Access
 
Writing and testing high frequency trading engines in java
Writing and testing high frequency trading engines in javaWriting and testing high frequency trading engines in java
Writing and testing high frequency trading engines in java
 
Extent3 turquoise equity_trading_2012
Extent3 turquoise equity_trading_2012Extent3 turquoise equity_trading_2012
Extent3 turquoise equity_trading_2012
 
Extent3 exactpro testing_of_hft_gui
Extent3 exactpro testing_of_hft_guiExtent3 exactpro testing_of_hft_gui
Extent3 exactpro testing_of_hft_gui
 
Work4 23
Work4 23Work4 23
Work4 23
 
Premiare i dipendenti: i nostri 5 suggerimenti
Premiare i dipendenti: i nostri 5 suggerimentiPremiare i dipendenti: i nostri 5 suggerimenti
Premiare i dipendenti: i nostri 5 suggerimenti
 
Come (e perché) le HR devono sviluppare la loro resilienza...
Come (e perché) le HR devono sviluppare la loro resilienza...Come (e perché) le HR devono sviluppare la loro resilienza...
Come (e perché) le HR devono sviluppare la loro resilienza...
 
Esclarecer o-habitus
Esclarecer o-habitusEsclarecer o-habitus
Esclarecer o-habitus
 
Levchenko Andrey
Levchenko AndreyLevchenko Andrey
Levchenko Andrey
 
great
greatgreat
great
 
Madhusmita pati
Madhusmita patiMadhusmita pati
Madhusmita pati
 
Presupuesto publico
Presupuesto publicoPresupuesto publico
Presupuesto publico
 
C3 new
C3 newC3 new
C3 new
 
Three NJ High Schools Roll Out New CTEP Marketing Course to Prepare Students ...
Three NJ High Schools Roll Out New CTEP Marketing Course to Prepare Students ...Three NJ High Schools Roll Out New CTEP Marketing Course to Prepare Students ...
Three NJ High Schools Roll Out New CTEP Marketing Course to Prepare Students ...
 
Silabus mtk xii
Silabus mtk xiiSilabus mtk xii
Silabus mtk xii
 
Ayesha tanwir ppt
Ayesha tanwir pptAyesha tanwir ppt
Ayesha tanwir ppt
 

Similar to When the OS gets in the way

BUD17-309: IRQ prediction
BUD17-309: IRQ prediction BUD17-309: IRQ prediction
BUD17-309: IRQ prediction
Linaro
 
Hardware Assisted Latency Investigations
Hardware Assisted Latency InvestigationsHardware Assisted Latency Investigations
Hardware Assisted Latency Investigations
ScyllaDB
 

Similar to When the OS gets in the way (20)

BUD17-309: IRQ prediction
BUD17-309: IRQ prediction BUD17-309: IRQ prediction
BUD17-309: IRQ prediction
 
Optimizing Linux Servers
Optimizing Linux ServersOptimizing Linux Servers
Optimizing Linux Servers
 
OS scheduling and The anatomy of a context switch
OS scheduling and The anatomy of a context switchOS scheduling and The anatomy of a context switch
OS scheduling and The anatomy of a context switch
 
HKG15-409: ARM Hibernation enablement on SoCs - a case study
HKG15-409: ARM Hibernation enablement on SoCs - a case studyHKG15-409: ARM Hibernation enablement on SoCs - a case study
HKG15-409: ARM Hibernation enablement on SoCs - a case study
 
UNIT 3 - General Purpose Processors
UNIT 3 - General Purpose ProcessorsUNIT 3 - General Purpose Processors
UNIT 3 - General Purpose Processors
 
Linux Network Stack
Linux Network StackLinux Network Stack
Linux Network Stack
 
Embedded_ PPT_4-5 unit_Dr Monika-edited.pptx
Embedded_ PPT_4-5 unit_Dr Monika-edited.pptxEmbedded_ PPT_4-5 unit_Dr Monika-edited.pptx
Embedded_ PPT_4-5 unit_Dr Monika-edited.pptx
 
Hardware Assisted Latency Investigations
Hardware Assisted Latency InvestigationsHardware Assisted Latency Investigations
Hardware Assisted Latency Investigations
 
PERFORMANCE_SCHEMA and sys schema
PERFORMANCE_SCHEMA and sys schemaPERFORMANCE_SCHEMA and sys schema
PERFORMANCE_SCHEMA and sys schema
 
Linux Systems Performance 2016
Linux Systems Performance 2016Linux Systems Performance 2016
Linux Systems Performance 2016
 
CONFidence 2017: Escaping the (sand)box: The promises and pitfalls of modern ...
CONFidence 2017: Escaping the (sand)box: The promises and pitfalls of modern ...CONFidence 2017: Escaping the (sand)box: The promises and pitfalls of modern ...
CONFidence 2017: Escaping the (sand)box: The promises and pitfalls of modern ...
 
Performance Analysis: new tools and concepts from the cloud
Performance Analysis: new tools and concepts from the cloudPerformance Analysis: new tools and concepts from the cloud
Performance Analysis: new tools and concepts from the cloud
 
My First 100 days with an Exadata (PPT)
My First 100 days with an Exadata (PPT)My First 100 days with an Exadata (PPT)
My First 100 days with an Exadata (PPT)
 
Analyzing OS X Systems Performance with the USE Method
Analyzing OS X Systems Performance with the USE MethodAnalyzing OS X Systems Performance with the USE Method
Analyzing OS X Systems Performance with the USE Method
 
Distributed implementation of a lstm on spark and tensorflow
Distributed implementation of a lstm on spark and tensorflowDistributed implementation of a lstm on spark and tensorflow
Distributed implementation of a lstm on spark and tensorflow
 
Lec 12-15 mips instruction set processor
Lec 12-15 mips instruction set processorLec 12-15 mips instruction set processor
Lec 12-15 mips instruction set processor
 
RTOS implementation
RTOS implementationRTOS implementation
RTOS implementation
 
VMworld 2016: vSphere 6.x Host Resource Deep Dive
VMworld 2016: vSphere 6.x Host Resource Deep DiveVMworld 2016: vSphere 6.x Host Resource Deep Dive
VMworld 2016: vSphere 6.x Host Resource Deep Dive
 
Kernel Recipes 2019 - RCU in 2019 - Joel Fernandes
Kernel Recipes 2019 - RCU in 2019 - Joel FernandesKernel Recipes 2019 - RCU in 2019 - Joel Fernandes
Kernel Recipes 2019 - RCU in 2019 - Joel Fernandes
 
Operating Systems: Revision
Operating Systems: RevisionOperating Systems: Revision
Operating Systems: Revision
 

Recently uploaded

Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 

Recently uploaded (20)

"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 

When the OS gets in the way

  • 1. When the OS gets in the way (and what you can do about it) Mark Price @epickrram LMAX Exchange
  • 2. When the OS gets in the way (and what you can do about it) Linux Mark Price @epickrram LMAX Exchange
  • 3. ● Linux is an excellent general-purpose OS ● Many target platforms ● Scheduling is actually fairly complicated ● Low-latency is a special use-case ● We need to provide some hints It’s not the OS’s fault
  • 4. Why should I care?
  • 5. Useful in some scenarios ● Low latency applications ● Response times < 1ms ● Compute-intensive workloads ● Long-running jobs
  • 6. A real-world scenario: LMAX System Latency = T1 - T0 Before tuning: 250us / 10+ms After tuning: 80us / <1ms (mean / max)
  • 7. Jitter ● “slight irregular movement, variation, or unsteadiness, especially in an electrical signal or electronic device” ● Variation in response time latency ● Long-tail in response time
  • 8. Dealing with it ● First take care of the low-hanging fruit ○ e.g. Garbage collection (gc-free / Zing) ○ e.g. Slow I/O ● Once response times are < 10ms the fun begins ● Make sure your code is running!
  • 9. Measure first ● Need to validate changes are good ● End-to-end tests ● Using realistic load ● Change one thing and observe ● A refresher...
  • 11. Multi-tasking ● num(tasks) > num(HyperThreads) ● OS must share out hardware resources ● Clever? Dumb? Fast? Slow? ● Fair...
  • 12. Linux CFS ● Completely Fair Scheduler ● Maintains a task ‘queue’ per HT ● Runs the task with the lowest runtime ● Updates task runtime after execution ● Higher priority implies longer execution time ● Tasks are load-balanced across HTs
  • 13. An example application ... Threads
  • 14. … running on a language runtime
  • 15. … running on an operating system
  • 16. Optimise for locality - PCI/memory
  • 18. How do I start?
  • 19. ● BIOS settings for maximum performance ● That’s a whole other talk... Start with the metal
  • 20. ● lstopo is a useful tool for looking at hardware ● Provided by the hwloc package ● Displays: ○ HyperThreads ○ Physical cores ○ NUMA nodes ○ PCI locality Discover what’s available
  • 23. ● Use isolcpus to reserve cpu resource ● kernel boot parameter ● isolcpus=0-5,10-13 ● Use taskset to pin your application to cpus: ● taskset -c 10-13 java … ● Set affinity of hot threads: ● sched_setaffinity(...) Reserve & use specific resource
  • 25. You have no load-balancer Pile-up
  • 26. A solution: cpusets ● Create hierarchical sets of reserved resource ● CPU, memory ● Userland tools: cset (SUSE)
  • 27. Isolate OS processes ● cset set --set=/system --cpu=6-9 ○ create a cpuset with cpus 6-9 ○ create it at the path /system ● cset proc --move --from-set=/ --to-set=/system ○ move all processes from / to /system ○ -k => move unbound kernel threads ○ --threads => move child threads ○ --force => erm... force
  • 28. Run the application ● cset set --cpu=0-5,10-13 --set=/app ● cset proc --exec /app taskset -cp 10-13 java … ○ start a process in the /app cpuset ○ run the program on cpus 10-13 ● sched_setaffinity() to pin the hot threads to cpus 1,3,5
  • 31. ● Sampling tracer ● Static/dynamic trace points ● Very low overhead ● A good starting point for digging deeper ● perf list to view available trace points ● network, file-system, scheduler, etc perf_events
  • 32. What’s happening CPU? ● perf record -e "sched:sched_switch" -C 3 ○ Sample task switches on CPU 3 ● perf report (best for multiple events) ● perf script (best for single events)
  • 33. Rogue process java 36049 [003] 3011858.780856: sched:sched_switch: java: 36049 [110] R ==> kworker/3:1:13991 [120] kworker/3:1 13991 [003] 3011858.780861: sched:sched_switch: kworker/3:1:13991 [120] S ==> java:36049 [110]
  • 34. ftrace ● Function tracer ● Static/dynamic trace points ● Higher overhead ● But captures everything ● Can provide function graphs ● trace-cmd is the usable front-end
  • 35. So what is that kernel thread doing? ● trace-cmd record -P <pid> -p function_graph ○ Trace functions called by process <pid> ● trace-cmd report ○ Display captured trace data
  • 36. Some things can’t be deferred kworker/3:1-13991 [003] 3013287.180771: funcgraph_entry: | process_one_work() { kworker/3:1-13991 [003] 3013287.180772: funcgraph_entry: | cache_reap() { kworker/3:1-13991 [003] 3013287.180772: funcgraph_entry: 0.137 us | mutex_trylock(); kworker/3:1-13991 [003] 3013287.180772: funcgraph_entry: 0.289 us | drain_array(); kworker/3:1-13991 [003] 3013287.180773: funcgraph_entry: 0.040 us | _cond_resched(); ……… ……… kworker/3:1-13991 [003] 3013287.180859: funcgraph_exit: +86.735 us | } +86.735 us
  • 37. Things to look out for ● cache_reap() - SLAB allocator ● vmstat_update() - kernel stats ● other workqueue events ○ perf record -e “workqueue:*” -C 3 ● Interrupts - set affinity in /proc/irq ● Timer ticks - tickless mode ● CPU governor - set to performance ○ /sys/devices/system/cpu/cpuN/cpufreq/scaling_governor
  • 38. Some numbers ● Inter-thread latency is a good proxy ● 2 busy-spinning threads passing a message ● Time taken between producer & consumer ● Record times over several seconds ● Compare tuned/untuned
  • 39. Results == Latency (ns) == mean min 50.00% 90.00% 99.00% 99.90% 99.99% max untuned 466 200 464 608 768 992 2432 69632 tuned 216 128 208 288 336 544 1664 69632
  • 41. tuned vs untuned (log scale)
  • 42. Results (loaded system) == Latency (ns) == mean min 50.00% 90.00% 99.00% 99.90% 99.99% max untuned 545 144 464 544 736 2944 294913 884739 tuned 332 216 336 352 448 544 704 36864
  • 43. tuned vs untuned (loaded system)
  • 44. Summary ● Select threads that need access to CPU ● Isolate CPUs from the OS ● Pin important threads to isolated CPUs ● Don’t forget interrupts ● There will be more things… ● Always test assumptions! ● Run validation tests to ensure tunings are as expected
  • 45. Thank you ● lmax.com/blog/staff-blogs/ ● epickrram.blogspot.com ● github.com/epickrram/perf-workshop ● @epickrram