SlideShare a Scribd company logo
Multi-threaded Performance Pitfalls

                   Ciaran McHale




CiaranMcHale.com
                                         1
License
Copyright © 2008 Ciaran McHale.
Permission is hereby granted, free of charge, to any person obtaining a copy of this
training course and associated documentation files (the “Training Coursequot;), to deal in
the Training Course without restriction, including without limitation the rights to use,
copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Training
Course, and to permit persons to whom the Training Course is furnished to do so,
subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies
or substantial portions of the Training Course.
THE TRAINING COURSE IS PROVIDED quot;AS ISquot;, WITHOUT WARRANTY OF ANY
KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE
WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE
AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR
COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE,
ARISING FROM, OUT OF OR IN CONNECTION WITH THE TRAINING COURSE
OR THE USE OR OTHER DEALINGS IN THE TRAINING COURSE.



Multi-threaded Performance Pitfalls                                                     2
Purpose of this presentation
n   Some issues in multi-threading are counter-intuitive

n   Ignorance of these issues can result in poor performance
    -   Performance can actually get worse when you add more CPUs

n   This presentation explains the counter-intuitive issues




Multi-threaded Performance Pitfalls                                 3
1. A case study




                  4
Architectural diagram


                                       J2EE
                                        App
                                      Server1

      web                                       CORBA C++
                       load            J2EE
    browser                                       server on
                    balancing           App                    DB
                                                   8-CPU
                      router          Server2
                                                 Solaris box




                                        ...
                                       J2EE
                                        App
                                      Server6




Multi-threaded Performance Pitfalls                                 5
Architectural notes
n   The customer felt J2EE was slower than CORBA/C++

n   So, the architecture had:
    -   Multiple J2EE App Servers acting as clients to…
    -   Just one CORBA/C++ server that ran on an 8-CPU Solaris box

n   The customer assumed the CORBA/C++ server “should be
    able to cope with the load”




Multi-threaded Performance Pitfalls                                  6
Strange problems were observed
n   Throughput of the CORBA server decreased as the number of
    CPUs increased
    -   It ran fastest on 1 CPU
    -   It ran slower but “fast enough” with moderate load on 4 CPUs
        (development machines)
    -   It ran very slowly on 8 CPUs (production machine)

n   The CORBA server ran faster if a thread pool limit was
    imposed

n   Under a high load in production:
    -   Most requests were processed in < 0.3 second
    -   But some took up to a minute to be processed
    -   A few took up to 30 minutes to be processed

n   This is not what you hope to see

Multi-threaded Performance Pitfalls                                    7
2. Analysis of the problems




                              8
What went wrong?
n   Investigation showed that scalability problems were caused by
    a combination of:
    -   Cache consistency in multi-CPU machines

    -   Unfair mutex wakeup semantics

n   These issues are discussed in the following slides

n   Another issue contributed (slightly) to scalability problems:
    -   Bottlenecks in application code
    -   A discussion of this is outside the scope of this presentation




Multi-threaded Performance Pitfalls                                      9
Cache consistency
n   RAM access is much slower than speed of CPU
    -   Solution: high-speed cache memory sits between CPU and RAM

n   Cache memory works great:
    -   In a single-CPU machine
    -   In a multi-CPU machine if the threads of a process are “bound” to a
        CPU

n   Cache memory can backfire if the threads in a program are
    spread over all the CPUs:
    -   Each CPU has a separate cache
    -   Cache consistency protocol require cache flushes to RAM
        (cache consistency protocol is driven by calls to lock() and
        unlock())




Multi-threaded Performance Pitfalls                                           10
Cache consistency (cont’)
n   Overhead of cache consistency protocols worsens as:
    -   Overhead of a cache synchronization increases
        (this increases as the number of CPUs increase)

    -   Frequency of cache synchronization increases
        (this increases with the rate of mutex lock() and unlock() calls)

n   Lessons:
    -   Increasing number of CPUs can decrease performance of a server
    -   Work around this by:
         - Having multiple server processes instead of just one
         - Binding each process to a CPU (avoids need for cache
           synchronization)
    -   Try to minimize need for mutex lock() and unlock() in application
         - Note: malloc()/free(), and new/delete use a mutex


Multi-threaded Performance Pitfalls                                         11
Unfair mutex wakeup semantics
n   A mutex does not guarantee First In First Out (FIFO) wakeup
    semantics
    -   To do so would prevent two important optimizations
        (discussed on the following slides)

n   Instead, a mutex provides:
    -   Unfair wakeup semantics
         - Can cause temporary starvation of a thread
         - But guarantees to avoid infinite starvation
    -   High speed lock() and unlock()




Multi-threaded Performance Pitfalls                           12
Unfair mutex wakeup semantics (cont’)
n   Why does a mutex not provide fair wakeup semantics?

n   Because most of the time, speed matter more than fairness
    -   When FIFO wakeup semantics are required, developers can write a
        FIFOMutex class and take a performance hit




Multi-threaded Performance Pitfalls                                       13
Mutex optimization 1
n   Pseudo-code:
        void lock()
        {
            if (rand() % 100) < 98) {
                add thread to head of list; // LIFO wakeup
            } else {
                add thread to tail of list; // FIFO wakeup
            }
        }

n   Notes:
    -   Last In First Out (LIFO) wakeup increases likelihood of cache hits for
        the woken-up thread (avoids expense of cache misses)
    -   Occasionally putting a thread at the tail of the queue prevents infinite
        starvation




Multi-threaded Performance Pitfalls                                                14
Mutex optimization 2
n   Assume several threads concurrently execute the following
    code:
        for (i = 0; i < 1000; i++) {
            lock(a_mutex);
            process(data[i]);
            unlock(a_mutex);
        }

n   A thread context switch is (relatively) expensive
    -   Context switching on every unlock() would add a lot of overhead

n   Solution (this is an unfair optimization):
    -   Defer context switches until the end of the current thread’s time slice
    -   Current thread can repeatedly lock() and unlock() mutex in a
        single time slice


Multi-threaded Performance Pitfalls                                               15
3. Improving Throughput




                          16
Improving throughput
n   20X increase in throughput was obtained by combination of:
    -   Limiting size of the CORBA server’s thread pool
         - This Decreased the maximum length of the mutex wakeup queue
         - Which decreased the maximum wakeup time

    -   Using several server processes (each with a small thread pool)
        rather than one server process (with a very large thread pool)

    -   Binding each server process to one CPU
         - This avoided the overhead of cache consistency
         - Binding was achieved with the pbind command on Solaris
        - Windows has an equivalent of process binding:
          - Use the SetProcessAffinityMask() system call
           - Or, in Task Manager, right click on a process and choose the
             menu option
             (this menu option is visible only if you have a multi-CPU machine)

Multi-threaded Performance Pitfalls                                           17
4. Finishing up




                  18
Recap: architectural diagram


                                       J2EE
                                        App
                                      Server1

      web                                       CORBA C++
                       load            J2EE
    browser                                       server on
                    balancing           App                    DB
                                                   8-CPU
                      router          Server2
                                                 Solaris box




                                        ...
                                       J2EE
                                        App
                                      Server6




Multi-threaded Performance Pitfalls                                 19
The case study is not an isolated incident
n   The project’s high-level architecture is quite common:
    -   Multi-threaded clients communicate with a multi-threaded server
    -   Server process is not “bound” to a single CPU
    -   Server’s thread pool size is unlimited
        (this is the default case in many middleware products)

n   Likely that many projects have similar scalability problems:
    -   But the system load is not high enough (yet) to trigger problems

n   Problems are not specific to CORBA
    -   They are independent of your choice of middleware technology

n   Multi-core CPUs are becoming more common
    -   So, expect to see these scalability issues occurring more frequently




Multi-threaded Performance Pitfalls                                            20
Summary: important things to remember
n   Recognize danger signs:
    -   Performance drops as number of CPUs increases
    -   Wide variation in response times with a high number of threads

n   Good advice for multi-threaded servers:
    -   Put a limit on the size of a server’s thread pool
    -   Have several server processes with a small number of threads instead
        of one process with many threads
    -   Bind each a server process to a CPU




n   Acknowledgements:
    -   Ciaran McHale’s employer, IONA Technologies (www.iona.com)
        generously gave permission for this presentation to be released under
        the stated open-source license.

Multi-threaded Performance Pitfalls                                             21

More Related Content

What's hot

Minimizing I/O Latency in Xen-ARM
Minimizing I/O Latency in Xen-ARMMinimizing I/O Latency in Xen-ARM
Minimizing I/O Latency in Xen-ARM
The Linux Foundation
 
ARM Architecture-based System Virtualization: Xen ARM open source software pr...
ARM Architecture-based System Virtualization: Xen ARM open source software pr...ARM Architecture-based System Virtualization: Xen ARM open source software pr...
ARM Architecture-based System Virtualization: Xen ARM open source software pr...
The Linux Foundation
 
XS 2008 Boston Capacity Planning
XS 2008 Boston Capacity PlanningXS 2008 Boston Capacity Planning
XS 2008 Boston Capacity Planning
The Linux Foundation
 
kexec / kdump implementation in Linux Kernel and Xen hypervisor
kexec / kdump implementation in Linux Kernel and Xen hypervisorkexec / kdump implementation in Linux Kernel and Xen hypervisor
kexec / kdump implementation in Linux Kernel and Xen hypervisor
The Linux Foundation
 
Ina Pratt Fosdem Feb2008
Ina Pratt Fosdem Feb2008Ina Pratt Fosdem Feb2008
Ina Pratt Fosdem Feb2008
The Linux Foundation
 
TSM og virtualisering
 TSM og virtualisering TSM og virtualisering
TSM og virtualisering
Solv AS
 
XS Boston 2008 Quantitative
XS Boston 2008 QuantitativeXS Boston 2008 Quantitative
XS Boston 2008 Quantitative
The Linux Foundation
 
XS Boston 2008 OpenSolaris
XS Boston 2008 OpenSolarisXS Boston 2008 OpenSolaris
XS Boston 2008 OpenSolaris
The Linux Foundation
 
L lpic2201-pdf
L lpic2201-pdfL lpic2201-pdf
L lpic2201-pdf
G&P
 
I/O Scalability in Xen
I/O Scalability in XenI/O Scalability in Xen
I/O Scalability in Xen
The Linux Foundation
 
What’s new System Center 2012 SP1, VMM
What’s new System Center 2012 SP1, VMMWhat’s new System Center 2012 SP1, VMM
What’s new System Center 2012 SP1, VMM
Microsoft TechNet - Belgium and Luxembourg
 
XS Oracle 2009 Error Detection
XS Oracle 2009 Error DetectionXS Oracle 2009 Error Detection
XS Oracle 2009 Error Detection
The Linux Foundation
 
VMware vSphere 4.1 deep dive - part 1
VMware vSphere 4.1 deep dive - part 1VMware vSphere 4.1 deep dive - part 1
VMware vSphere 4.1 deep dive - part 1
Louis Göhl
 
XS Boston 2008 ARM
XS Boston 2008 ARMXS Boston 2008 ARM
XS Boston 2008 ARM
The Linux Foundation
 
VMworld 2013: Silent Killer: How Latency Destroys Performance...And What to D...
VMworld 2013: Silent Killer: How Latency Destroys Performance...And What to D...VMworld 2013: Silent Killer: How Latency Destroys Performance...And What to D...
VMworld 2013: Silent Killer: How Latency Destroys Performance...And What to D...
VMworld
 
XS Oracle 2009 Vm Snapshots
XS Oracle 2009 Vm SnapshotsXS Oracle 2009 Vm Snapshots
XS Oracle 2009 Vm Snapshots
The Linux Foundation
 
Hyper-V VMM ile Cloud computing
Hyper-V VMM ile Cloud computingHyper-V VMM ile Cloud computing
Hyper-V VMM ile Cloud computing
Ahmet Mutlu
 
Faq websphere performance
Faq websphere performanceFaq websphere performance
Faq websphere performance
budakia
 
Xen PV Performance Status and Optimization Opportunities
Xen PV Performance Status and Optimization OpportunitiesXen PV Performance Status and Optimization Opportunities
Xen PV Performance Status and Optimization Opportunities
The Linux Foundation
 
20 christian ferber xen_server_6_workshop
20 christian ferber xen_server_6_workshop20 christian ferber xen_server_6_workshop
20 christian ferber xen_server_6_workshop
Digicomp Academy AG
 

What's hot (20)

Minimizing I/O Latency in Xen-ARM
Minimizing I/O Latency in Xen-ARMMinimizing I/O Latency in Xen-ARM
Minimizing I/O Latency in Xen-ARM
 
ARM Architecture-based System Virtualization: Xen ARM open source software pr...
ARM Architecture-based System Virtualization: Xen ARM open source software pr...ARM Architecture-based System Virtualization: Xen ARM open source software pr...
ARM Architecture-based System Virtualization: Xen ARM open source software pr...
 
XS 2008 Boston Capacity Planning
XS 2008 Boston Capacity PlanningXS 2008 Boston Capacity Planning
XS 2008 Boston Capacity Planning
 
kexec / kdump implementation in Linux Kernel and Xen hypervisor
kexec / kdump implementation in Linux Kernel and Xen hypervisorkexec / kdump implementation in Linux Kernel and Xen hypervisor
kexec / kdump implementation in Linux Kernel and Xen hypervisor
 
Ina Pratt Fosdem Feb2008
Ina Pratt Fosdem Feb2008Ina Pratt Fosdem Feb2008
Ina Pratt Fosdem Feb2008
 
TSM og virtualisering
 TSM og virtualisering TSM og virtualisering
TSM og virtualisering
 
XS Boston 2008 Quantitative
XS Boston 2008 QuantitativeXS Boston 2008 Quantitative
XS Boston 2008 Quantitative
 
XS Boston 2008 OpenSolaris
XS Boston 2008 OpenSolarisXS Boston 2008 OpenSolaris
XS Boston 2008 OpenSolaris
 
L lpic2201-pdf
L lpic2201-pdfL lpic2201-pdf
L lpic2201-pdf
 
I/O Scalability in Xen
I/O Scalability in XenI/O Scalability in Xen
I/O Scalability in Xen
 
What’s new System Center 2012 SP1, VMM
What’s new System Center 2012 SP1, VMMWhat’s new System Center 2012 SP1, VMM
What’s new System Center 2012 SP1, VMM
 
XS Oracle 2009 Error Detection
XS Oracle 2009 Error DetectionXS Oracle 2009 Error Detection
XS Oracle 2009 Error Detection
 
VMware vSphere 4.1 deep dive - part 1
VMware vSphere 4.1 deep dive - part 1VMware vSphere 4.1 deep dive - part 1
VMware vSphere 4.1 deep dive - part 1
 
XS Boston 2008 ARM
XS Boston 2008 ARMXS Boston 2008 ARM
XS Boston 2008 ARM
 
VMworld 2013: Silent Killer: How Latency Destroys Performance...And What to D...
VMworld 2013: Silent Killer: How Latency Destroys Performance...And What to D...VMworld 2013: Silent Killer: How Latency Destroys Performance...And What to D...
VMworld 2013: Silent Killer: How Latency Destroys Performance...And What to D...
 
XS Oracle 2009 Vm Snapshots
XS Oracle 2009 Vm SnapshotsXS Oracle 2009 Vm Snapshots
XS Oracle 2009 Vm Snapshots
 
Hyper-V VMM ile Cloud computing
Hyper-V VMM ile Cloud computingHyper-V VMM ile Cloud computing
Hyper-V VMM ile Cloud computing
 
Faq websphere performance
Faq websphere performanceFaq websphere performance
Faq websphere performance
 
Xen PV Performance Status and Optimization Opportunities
Xen PV Performance Status and Optimization OpportunitiesXen PV Performance Status and Optimization Opportunities
Xen PV Performance Status and Optimization Opportunities
 
20 christian ferber xen_server_6_workshop
20 christian ferber xen_server_6_workshop20 christian ferber xen_server_6_workshop
20 christian ferber xen_server_6_workshop
 

Similar to Multi-threaded Performance Pitfalls

”Bare-Metal Container" presented at HPCC2016
”Bare-Metal Container" presented at HPCC2016”Bare-Metal Container" presented at HPCC2016
”Bare-Metal Container" presented at HPCC2016
Kuniyasu Suzaki
 
BMC: Bare Metal Container @Open Source Summit Japan 2017
BMC: Bare Metal Container @Open Source Summit Japan 2017BMC: Bare Metal Container @Open Source Summit Japan 2017
BMC: Bare Metal Container @Open Source Summit Japan 2017
Kuniyasu Suzaki
 
CloudOpen 2013: Developing cloud infrastructure: from scratch: the tale of an...
CloudOpen 2013: Developing cloud infrastructure: from scratch: the tale of an...CloudOpen 2013: Developing cloud infrastructure: from scratch: the tale of an...
CloudOpen 2013: Developing cloud infrastructure: from scratch: the tale of an...
Andrey Korolyov
 
DPDK Integration: A Product's Journey - Roger B. Melton
DPDK Integration: A Product's Journey - Roger B. MeltonDPDK Integration: A Product's Journey - Roger B. Melton
DPDK Integration: A Product's Journey - Roger B. Melton
harryvanhaaren
 
Multicore Computers
Multicore ComputersMulticore Computers
Multicore Computers
A B Shinde
 
Power-Efficient Programming Using Qualcomm Multicore Asynchronous Runtime Env...
Power-Efficient Programming Using Qualcomm Multicore Asynchronous Runtime Env...Power-Efficient Programming Using Qualcomm Multicore Asynchronous Runtime Env...
Power-Efficient Programming Using Qualcomm Multicore Asynchronous Runtime Env...
Qualcomm Developer Network
 
Kernel Recipes 2019 - BPF at Facebook
Kernel Recipes 2019 - BPF at FacebookKernel Recipes 2019 - BPF at Facebook
Kernel Recipes 2019 - BPF at Facebook
Anne Nicolas
 
Gatehouse software genanvendelse
Gatehouse software genanvendelseGatehouse software genanvendelse
Gatehouse software genanvendelse
InfinIT - Innovationsnetværket for it
 
Cognos Performance Tuning Tips & Tricks
Cognos Performance Tuning Tips & TricksCognos Performance Tuning Tips & Tricks
Cognos Performance Tuning Tips & Tricks
Senturus
 
Zoo keeper in the wild
Zoo keeper in the wildZoo keeper in the wild
Zoo keeper in the wild
datamantra
 
Composing services with Kubernetes
Composing services with KubernetesComposing services with Kubernetes
Composing services with Kubernetes
Bart Spaans
 
IBM Notes in the Cloud
IBM Notes in the CloudIBM Notes in the Cloud
IBM Notes in the Cloud
Stephen Beagles
 
FD.io Vector Packet Processing (VPP)
FD.io Vector Packet Processing (VPP)FD.io Vector Packet Processing (VPP)
FD.io Vector Packet Processing (VPP)
Kirill Tsym
 
FD.IO Vector Packet Processing
FD.IO Vector Packet ProcessingFD.IO Vector Packet Processing
FD.IO Vector Packet Processing
Kernel TLV
 
Clustering
ClusteringClustering
Clustering
Abhay Pai
 
VMworld 2014: Extreme Performance Series
VMworld 2014: Extreme Performance Series VMworld 2014: Extreme Performance Series
VMworld 2014: Extreme Performance Series
VMworld
 
Spring_Boot_Microservices-5_Day_Session.pptx
Spring_Boot_Microservices-5_Day_Session.pptxSpring_Boot_Microservices-5_Day_Session.pptx
Spring_Boot_Microservices-5_Day_Session.pptx
Prabhakaran Ravichandran
 
Accelerating Spark Genome Sequencing in Cloud—A Data Driven Approach, Case St...
Accelerating Spark Genome Sequencing in Cloud—A Data Driven Approach, Case St...Accelerating Spark Genome Sequencing in Cloud—A Data Driven Approach, Case St...
Accelerating Spark Genome Sequencing in Cloud—A Data Driven Approach, Case St...
Spark Summit
 
Synopsis on online shopping by sudeep singh
Synopsis on online shopping by  sudeep singhSynopsis on online shopping by  sudeep singh
Synopsis on online shopping by sudeep singh
Sudeep Singh
 
Ch2 embedded processors-i
Ch2 embedded processors-iCh2 embedded processors-i
Ch2 embedded processors-i
Ankit Shah
 

Similar to Multi-threaded Performance Pitfalls (20)

”Bare-Metal Container" presented at HPCC2016
”Bare-Metal Container" presented at HPCC2016”Bare-Metal Container" presented at HPCC2016
”Bare-Metal Container" presented at HPCC2016
 
BMC: Bare Metal Container @Open Source Summit Japan 2017
BMC: Bare Metal Container @Open Source Summit Japan 2017BMC: Bare Metal Container @Open Source Summit Japan 2017
BMC: Bare Metal Container @Open Source Summit Japan 2017
 
CloudOpen 2013: Developing cloud infrastructure: from scratch: the tale of an...
CloudOpen 2013: Developing cloud infrastructure: from scratch: the tale of an...CloudOpen 2013: Developing cloud infrastructure: from scratch: the tale of an...
CloudOpen 2013: Developing cloud infrastructure: from scratch: the tale of an...
 
DPDK Integration: A Product's Journey - Roger B. Melton
DPDK Integration: A Product's Journey - Roger B. MeltonDPDK Integration: A Product's Journey - Roger B. Melton
DPDK Integration: A Product's Journey - Roger B. Melton
 
Multicore Computers
Multicore ComputersMulticore Computers
Multicore Computers
 
Power-Efficient Programming Using Qualcomm Multicore Asynchronous Runtime Env...
Power-Efficient Programming Using Qualcomm Multicore Asynchronous Runtime Env...Power-Efficient Programming Using Qualcomm Multicore Asynchronous Runtime Env...
Power-Efficient Programming Using Qualcomm Multicore Asynchronous Runtime Env...
 
Kernel Recipes 2019 - BPF at Facebook
Kernel Recipes 2019 - BPF at FacebookKernel Recipes 2019 - BPF at Facebook
Kernel Recipes 2019 - BPF at Facebook
 
Gatehouse software genanvendelse
Gatehouse software genanvendelseGatehouse software genanvendelse
Gatehouse software genanvendelse
 
Cognos Performance Tuning Tips & Tricks
Cognos Performance Tuning Tips & TricksCognos Performance Tuning Tips & Tricks
Cognos Performance Tuning Tips & Tricks
 
Zoo keeper in the wild
Zoo keeper in the wildZoo keeper in the wild
Zoo keeper in the wild
 
Composing services with Kubernetes
Composing services with KubernetesComposing services with Kubernetes
Composing services with Kubernetes
 
IBM Notes in the Cloud
IBM Notes in the CloudIBM Notes in the Cloud
IBM Notes in the Cloud
 
FD.io Vector Packet Processing (VPP)
FD.io Vector Packet Processing (VPP)FD.io Vector Packet Processing (VPP)
FD.io Vector Packet Processing (VPP)
 
FD.IO Vector Packet Processing
FD.IO Vector Packet ProcessingFD.IO Vector Packet Processing
FD.IO Vector Packet Processing
 
Clustering
ClusteringClustering
Clustering
 
VMworld 2014: Extreme Performance Series
VMworld 2014: Extreme Performance Series VMworld 2014: Extreme Performance Series
VMworld 2014: Extreme Performance Series
 
Spring_Boot_Microservices-5_Day_Session.pptx
Spring_Boot_Microservices-5_Day_Session.pptxSpring_Boot_Microservices-5_Day_Session.pptx
Spring_Boot_Microservices-5_Day_Session.pptx
 
Accelerating Spark Genome Sequencing in Cloud—A Data Driven Approach, Case St...
Accelerating Spark Genome Sequencing in Cloud—A Data Driven Approach, Case St...Accelerating Spark Genome Sequencing in Cloud—A Data Driven Approach, Case St...
Accelerating Spark Genome Sequencing in Cloud—A Data Driven Approach, Case St...
 
Synopsis on online shopping by sudeep singh
Synopsis on online shopping by  sudeep singhSynopsis on online shopping by  sudeep singh
Synopsis on online shopping by sudeep singh
 
Ch2 embedded processors-i
Ch2 embedded processors-iCh2 embedded processors-i
Ch2 embedded processors-i
 

Recently uploaded

Apps Break Data
Apps Break DataApps Break Data
Apps Break Data
Ivo Velitchkov
 
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
Edge AI and Vision Alliance
 
Harnessing the Power of NLP and Knowledge Graphs for Opioid Research
Harnessing the Power of NLP and Knowledge Graphs for Opioid ResearchHarnessing the Power of NLP and Knowledge Graphs for Opioid Research
Harnessing the Power of NLP and Knowledge Graphs for Opioid Research
Neo4j
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
panagenda
 
Northern Engraving | Nameplate Manufacturing Process - 2024
Northern Engraving | Nameplate Manufacturing Process - 2024Northern Engraving | Nameplate Manufacturing Process - 2024
Northern Engraving | Nameplate Manufacturing Process - 2024
Northern Engraving
 
Skybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoptionSkybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoption
Tatiana Kojar
 
Biomedical Knowledge Graphs for Data Scientists and Bioinformaticians
Biomedical Knowledge Graphs for Data Scientists and BioinformaticiansBiomedical Knowledge Graphs for Data Scientists and Bioinformaticians
Biomedical Knowledge Graphs for Data Scientists and Bioinformaticians
Neo4j
 
Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)
Jakub Marek
 
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success StoryDriving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Safe Software
 
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-EfficiencyFreshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
ScyllaDB
 
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
saastr
 
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdfMonitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Tosin Akinosho
 
GraphRAG for LifeSciences Hands-On with the Clinical Knowledge Graph
GraphRAG for LifeSciences Hands-On with the Clinical Knowledge GraphGraphRAG for LifeSciences Hands-On with the Clinical Knowledge Graph
GraphRAG for LifeSciences Hands-On with the Clinical Knowledge Graph
Neo4j
 
The Microsoft 365 Migration Tutorial For Beginner.pptx
The Microsoft 365 Migration Tutorial For Beginner.pptxThe Microsoft 365 Migration Tutorial For Beginner.pptx
The Microsoft 365 Migration Tutorial For Beginner.pptx
operationspcvita
 
What is an RPA CoE? Session 1 – CoE Vision
What is an RPA CoE?  Session 1 – CoE VisionWhat is an RPA CoE?  Session 1 – CoE Vision
What is an RPA CoE? Session 1 – CoE Vision
DianaGray10
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
panagenda
 
Choosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptxChoosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptx
Brandon Minnick, MBA
 
"Choosing proper type of scaling", Olena Syrota
"Choosing proper type of scaling", Olena Syrota"Choosing proper type of scaling", Olena Syrota
"Choosing proper type of scaling", Olena Syrota
Fwdays
 
Generating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and MilvusGenerating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and Milvus
Zilliz
 
Crafting Excellence: A Comprehensive Guide to iOS Mobile App Development Serv...
Crafting Excellence: A Comprehensive Guide to iOS Mobile App Development Serv...Crafting Excellence: A Comprehensive Guide to iOS Mobile App Development Serv...
Crafting Excellence: A Comprehensive Guide to iOS Mobile App Development Serv...
Pitangent Analytics & Technology Solutions Pvt. Ltd
 

Recently uploaded (20)

Apps Break Data
Apps Break DataApps Break Data
Apps Break Data
 
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
 
Harnessing the Power of NLP and Knowledge Graphs for Opioid Research
Harnessing the Power of NLP and Knowledge Graphs for Opioid ResearchHarnessing the Power of NLP and Knowledge Graphs for Opioid Research
Harnessing the Power of NLP and Knowledge Graphs for Opioid Research
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
 
Northern Engraving | Nameplate Manufacturing Process - 2024
Northern Engraving | Nameplate Manufacturing Process - 2024Northern Engraving | Nameplate Manufacturing Process - 2024
Northern Engraving | Nameplate Manufacturing Process - 2024
 
Skybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoptionSkybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoption
 
Biomedical Knowledge Graphs for Data Scientists and Bioinformaticians
Biomedical Knowledge Graphs for Data Scientists and BioinformaticiansBiomedical Knowledge Graphs for Data Scientists and Bioinformaticians
Biomedical Knowledge Graphs for Data Scientists and Bioinformaticians
 
Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)
 
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success StoryDriving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success Story
 
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-EfficiencyFreshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
 
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
 
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdfMonitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdf
 
GraphRAG for LifeSciences Hands-On with the Clinical Knowledge Graph
GraphRAG for LifeSciences Hands-On with the Clinical Knowledge GraphGraphRAG for LifeSciences Hands-On with the Clinical Knowledge Graph
GraphRAG for LifeSciences Hands-On with the Clinical Knowledge Graph
 
The Microsoft 365 Migration Tutorial For Beginner.pptx
The Microsoft 365 Migration Tutorial For Beginner.pptxThe Microsoft 365 Migration Tutorial For Beginner.pptx
The Microsoft 365 Migration Tutorial For Beginner.pptx
 
What is an RPA CoE? Session 1 – CoE Vision
What is an RPA CoE?  Session 1 – CoE VisionWhat is an RPA CoE?  Session 1 – CoE Vision
What is an RPA CoE? Session 1 – CoE Vision
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
 
Choosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptxChoosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptx
 
"Choosing proper type of scaling", Olena Syrota
"Choosing proper type of scaling", Olena Syrota"Choosing proper type of scaling", Olena Syrota
"Choosing proper type of scaling", Olena Syrota
 
Generating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and MilvusGenerating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and Milvus
 
Crafting Excellence: A Comprehensive Guide to iOS Mobile App Development Serv...
Crafting Excellence: A Comprehensive Guide to iOS Mobile App Development Serv...Crafting Excellence: A Comprehensive Guide to iOS Mobile App Development Serv...
Crafting Excellence: A Comprehensive Guide to iOS Mobile App Development Serv...
 

Multi-threaded Performance Pitfalls

  • 1. Multi-threaded Performance Pitfalls Ciaran McHale CiaranMcHale.com 1
  • 2. License Copyright © 2008 Ciaran McHale. Permission is hereby granted, free of charge, to any person obtaining a copy of this training course and associated documentation files (the “Training Coursequot;), to deal in the Training Course without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Training Course, and to permit persons to whom the Training Course is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Training Course. THE TRAINING COURSE IS PROVIDED quot;AS ISquot;, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE TRAINING COURSE OR THE USE OR OTHER DEALINGS IN THE TRAINING COURSE. Multi-threaded Performance Pitfalls 2
  • 3. Purpose of this presentation n Some issues in multi-threading are counter-intuitive n Ignorance of these issues can result in poor performance - Performance can actually get worse when you add more CPUs n This presentation explains the counter-intuitive issues Multi-threaded Performance Pitfalls 3
  • 4. 1. A case study 4
  • 5. Architectural diagram J2EE App Server1 web CORBA C++ load J2EE browser server on balancing App DB 8-CPU router Server2 Solaris box ... J2EE App Server6 Multi-threaded Performance Pitfalls 5
  • 6. Architectural notes n The customer felt J2EE was slower than CORBA/C++ n So, the architecture had: - Multiple J2EE App Servers acting as clients to… - Just one CORBA/C++ server that ran on an 8-CPU Solaris box n The customer assumed the CORBA/C++ server “should be able to cope with the load” Multi-threaded Performance Pitfalls 6
  • 7. Strange problems were observed n Throughput of the CORBA server decreased as the number of CPUs increased - It ran fastest on 1 CPU - It ran slower but “fast enough” with moderate load on 4 CPUs (development machines) - It ran very slowly on 8 CPUs (production machine) n The CORBA server ran faster if a thread pool limit was imposed n Under a high load in production: - Most requests were processed in < 0.3 second - But some took up to a minute to be processed - A few took up to 30 minutes to be processed n This is not what you hope to see Multi-threaded Performance Pitfalls 7
  • 8. 2. Analysis of the problems 8
  • 9. What went wrong? n Investigation showed that scalability problems were caused by a combination of: - Cache consistency in multi-CPU machines - Unfair mutex wakeup semantics n These issues are discussed in the following slides n Another issue contributed (slightly) to scalability problems: - Bottlenecks in application code - A discussion of this is outside the scope of this presentation Multi-threaded Performance Pitfalls 9
  • 10. Cache consistency n RAM access is much slower than speed of CPU - Solution: high-speed cache memory sits between CPU and RAM n Cache memory works great: - In a single-CPU machine - In a multi-CPU machine if the threads of a process are “bound” to a CPU n Cache memory can backfire if the threads in a program are spread over all the CPUs: - Each CPU has a separate cache - Cache consistency protocol require cache flushes to RAM (cache consistency protocol is driven by calls to lock() and unlock()) Multi-threaded Performance Pitfalls 10
  • 11. Cache consistency (cont’) n Overhead of cache consistency protocols worsens as: - Overhead of a cache synchronization increases (this increases as the number of CPUs increase) - Frequency of cache synchronization increases (this increases with the rate of mutex lock() and unlock() calls) n Lessons: - Increasing number of CPUs can decrease performance of a server - Work around this by: - Having multiple server processes instead of just one - Binding each process to a CPU (avoids need for cache synchronization) - Try to minimize need for mutex lock() and unlock() in application - Note: malloc()/free(), and new/delete use a mutex Multi-threaded Performance Pitfalls 11
  • 12. Unfair mutex wakeup semantics n A mutex does not guarantee First In First Out (FIFO) wakeup semantics - To do so would prevent two important optimizations (discussed on the following slides) n Instead, a mutex provides: - Unfair wakeup semantics - Can cause temporary starvation of a thread - But guarantees to avoid infinite starvation - High speed lock() and unlock() Multi-threaded Performance Pitfalls 12
  • 13. Unfair mutex wakeup semantics (cont’) n Why does a mutex not provide fair wakeup semantics? n Because most of the time, speed matter more than fairness - When FIFO wakeup semantics are required, developers can write a FIFOMutex class and take a performance hit Multi-threaded Performance Pitfalls 13
  • 14. Mutex optimization 1 n Pseudo-code: void lock() { if (rand() % 100) < 98) { add thread to head of list; // LIFO wakeup } else { add thread to tail of list; // FIFO wakeup } } n Notes: - Last In First Out (LIFO) wakeup increases likelihood of cache hits for the woken-up thread (avoids expense of cache misses) - Occasionally putting a thread at the tail of the queue prevents infinite starvation Multi-threaded Performance Pitfalls 14
  • 15. Mutex optimization 2 n Assume several threads concurrently execute the following code: for (i = 0; i < 1000; i++) { lock(a_mutex); process(data[i]); unlock(a_mutex); } n A thread context switch is (relatively) expensive - Context switching on every unlock() would add a lot of overhead n Solution (this is an unfair optimization): - Defer context switches until the end of the current thread’s time slice - Current thread can repeatedly lock() and unlock() mutex in a single time slice Multi-threaded Performance Pitfalls 15
  • 17. Improving throughput n 20X increase in throughput was obtained by combination of: - Limiting size of the CORBA server’s thread pool - This Decreased the maximum length of the mutex wakeup queue - Which decreased the maximum wakeup time - Using several server processes (each with a small thread pool) rather than one server process (with a very large thread pool) - Binding each server process to one CPU - This avoided the overhead of cache consistency - Binding was achieved with the pbind command on Solaris - Windows has an equivalent of process binding: - Use the SetProcessAffinityMask() system call - Or, in Task Manager, right click on a process and choose the menu option (this menu option is visible only if you have a multi-CPU machine) Multi-threaded Performance Pitfalls 17
  • 19. Recap: architectural diagram J2EE App Server1 web CORBA C++ load J2EE browser server on balancing App DB 8-CPU router Server2 Solaris box ... J2EE App Server6 Multi-threaded Performance Pitfalls 19
  • 20. The case study is not an isolated incident n The project’s high-level architecture is quite common: - Multi-threaded clients communicate with a multi-threaded server - Server process is not “bound” to a single CPU - Server’s thread pool size is unlimited (this is the default case in many middleware products) n Likely that many projects have similar scalability problems: - But the system load is not high enough (yet) to trigger problems n Problems are not specific to CORBA - They are independent of your choice of middleware technology n Multi-core CPUs are becoming more common - So, expect to see these scalability issues occurring more frequently Multi-threaded Performance Pitfalls 20
  • 21. Summary: important things to remember n Recognize danger signs: - Performance drops as number of CPUs increases - Wide variation in response times with a high number of threads n Good advice for multi-threaded servers: - Put a limit on the size of a server’s thread pool - Have several server processes with a small number of threads instead of one process with many threads - Bind each a server process to a CPU n Acknowledgements: - Ciaran McHale’s employer, IONA Technologies (www.iona.com) generously gave permission for this presentation to be released under the stated open-source license. Multi-threaded Performance Pitfalls 21