SlideShare a Scribd company logo
Optimizing Thread Performance for a
Genomics Variant Caller
This talk
• Introduce two tools that can help improve the performance of
multithreaded code
• Apply the tools to a real world Genomics code
caption
Tool 1: Allinea Performance Reports – benchmarking and
characterization
Tool 2: Allinea Forge - Debugging and Profiling
• Debug and profile from
one interface,
configuration
• Secure native remote and
local access
• Rapidly switch between
the tasks
• Edit, build, commit,
debug, profile, optimize..
Small data files
<5% slowdown
No instrumentation
No recompilation
Our profiler finds the performance bottlenecks
Our debugger helps bugs and performance
• Observe why
workload is
imbalanced
• Observe why
particular code paths
are followed
• .. And fix any bugs
that optimization
creates!
Above all…
• The tools are aimed at any performance problem that matters
– Focus on time: the ultimate judge of performance
• Do not prejudge the problem
– Don’t assume it’s MPI messages, threads or I/O before profiling!
• If there’s a problem..
– Allinea Performance Reports shows it, and advises you on solutions
– Allinea Forge’s profiler shows it, next to your code
6 steps to improve performance
Get a realistic test
case
• Performance on real data
matters
• Keep the test case for
reference and re-use
Profile your code
• Add “-g” flag to your
compilation
• Run with a profiler
Look for the significant
• Which part/phase of the
code dominates time?
• Is there any unexpected
significant time use?
What is the nature of
the problem?
• Compute? I/O? MPI?
Thread synchronization?
• Display the metrics that
show the problem best
Apply brain to solve
• MPI – can you balance the
work better?
• Compute – is memory time
dominant – can you improve
layout?
Think of the future
• Try larger process or thread
counts to watch for
scalability problems
• Keep the profile (.map file)
for future comparison
Example: Improving Thread Usage in Genomics
• DISCOVAR
– Variant caller and small genome assembler
– Sub-mammalian sized genomes
– Newer DISCOVAR de novo for larger genomes
• C++ and OpenMP
• Developed by Broad Institute at MIT
A first look – on real hardware
• It’s not I/O intensive
• Good quantity of
OpenMP time
• No vectorization
OpenMP in detail
• Physical cores are
200% loaded:
hyperthreading is on
• 17% of parallel region
time is synchronization
• .. That’s quite high
Investigating the OpenMP synchronization
• Horizontal time axis:
colour coded
– Dark green – single core
– Light green – OpenMP work
– Light blue – pthread
synchronization
– Gray – idle
• Vertical axis
– #cores doing something
• Something’s very wrong
towards the end – with
all the gray
Zoom in on the region
• Stacks, code, regions,
time are all focused on
zoom area
• Key observation:
– OpenMP region with
“omp critical” is where
the time is being wasted
Fixing
• #pragma omp critical
– Execute exactly one
thread at a time to
ensure safety
• Is costing too much
– Passing “token” from
thread to thread to do
small pieces of work.
• Run whole section on
one thread instead
– Has same semantics
Impact of change
• Runtime down by 7%
As a performance report
• Improvements in
– Runtime
– Synchronization
overhead
Let’s try something bigger – into Amazon cloud!
• C4.8xlarge
– 36 hyperthreaded cores
– 60GB RAM
– Xeon E5-2666 v3 Haswell
– 25MB Cache
– 2.6GHZ
vs
• Our physical server
– 24 hyperthreaded cores
– 24 GB RAM
– Xeon E5-2407 v2
– 10MB Cache
– 2.4GHz
$ ./runme.sh
discovar version: Discovar r52488
loadaverage: 0.05 0.98 1.36 1/790 16317
2015-07-27 07:57 PERF: REAL 835.857 USER 36.188
SYSTEM 5.441 PERC 4.71
835 seconds to run on EC2
… vs …
~448 seconds on our physical server
Why?
Profile with Allinea Forge to find where the problem is
• Focus on initial 300
seconds: something
must be wrong here
• Serious lack of good
“green” compute
In detail…
• 36 threads, waiting… but who is using madvise?!
Why is glibc so bad?
• madvise system call in
_int_free()
– At least two context
switches each call ..
– This glibc version has
issues…?
• What other options are
there?
Maybe Google TCMalloc?
• Optimized for multi-
threaded applications
• No-win
– Same run time
– Issue is use of sys_futex
not madvise
• .. Not optimized for this
multithreaded
application!
Jemalloc?
• As recommended by
the Broad Institute
• … same runtime
Jemalloc – same problem
• Source proves the issue
again…
Can Intel libraries help?
• We try the Intel TBB
multithreaded allocator
• 14 minutes down to 10
minutes!
• .. But still this code has
scope for more…
Real optimization of OpenMP regions
• NB – still profiling for
first 300 seconds only
• Significant inactivity in
final 60 seconds
• OpenMP region
– #pragma omp parallel for
• Is it working?
– No – the threads are idle
• Let’s remove
After the first fix…
• Now able to run to
completion
– 358 seconds
• Still inactivity at end of
run
Zoomed to the inactivity…
• Another OpenMP region
• Quick edit: comment out
the OpenMP, again!
… and the impact
• Down to 304 seconds
Finally… something to sort out
• Recursive, in-place
multithreaded sorter
• Is not scaling well in
thread counts
• Options?
– Re-engineer
– Replace
– Tune
Let’s tune
• Try limiting the thread pool to 8 workers
– Better than 36 clashing threads?
Result…
• Runtime 4.7 minutes
• 3x improvement on
original
• #1 position on the
Broad Benchmark list
for a sub-$2 / hour
system!
Lessons learned
• Real codes exhibit many different performance patterns
– Profiling real data sets at real scales is vital to target the effort
– Small test cases do not expose all the problems
– Small thread counts can be too small to find real problems
• Changing code can be simple
– Use threads wisely – it will not always be faster
– Changing libraries – someone else might have fixed your problem
• Re-engineering is sometimes necessary
– Take advantage of vector units
– Take advantage of threads
Increase the performance of your software
Analyze and tune
with Allinea
Performance Reports
Develop, profile and
debug applications
with Allinea Forge
With professional
support when you
need it most
Read more!

More Related Content

What's hot

Keeping MongoDB Data Safe
Keeping MongoDB Data SafeKeeping MongoDB Data Safe
Keeping MongoDB Data Safe
Tony Tam
 
Parallel and Asynchronous Programming - ITProDevConnections 2012 (Greek)
Parallel and Asynchronous Programming -  ITProDevConnections 2012 (Greek)Parallel and Asynchronous Programming -  ITProDevConnections 2012 (Greek)
Parallel and Asynchronous Programming - ITProDevConnections 2012 (Greek)
Panagiotis Kanavos
 
Perl-Critic
Perl-CriticPerl-Critic
Perl-Critic
Jonas Brømsø
 
Outsmarting Merge Edge Cases in Component Based Design
Outsmarting Merge Edge Cases in Component Based DesignOutsmarting Merge Edge Cases in Component Based Design
Outsmarting Merge Edge Cases in Component Based Design
Perforce
 
Process Scheduling Algorithms | Interviews | Operating system
Process Scheduling Algorithms | Interviews | Operating systemProcess Scheduling Algorithms | Interviews | Operating system
Process Scheduling Algorithms | Interviews | Operating system
Shivam Mitra
 
Practical Malware Analysis: Ch 9: OllyDbg
Practical Malware Analysis: Ch 9: OllyDbgPractical Malware Analysis: Ch 9: OllyDbg
Practical Malware Analysis: Ch 9: OllyDbg
Sam Bowne
 
Coding For Cores - C# Way
Coding For Cores - C# WayCoding For Cores - C# Way
Coding For Cores - C# Way
Bishnu Rawal
 
Олександр Хотемський:”Serverless архітектура та її застосування в автоматизац...
Олександр Хотемський:”Serverless архітектура та її застосування в автоматизац...Олександр Хотемський:”Serverless архітектура та її застосування в автоматизац...
Олександр Хотемський:”Serverless архітектура та її застосування в автоматизац...
Dakiry
 
DevOps For Solo Developers
DevOps For Solo DevelopersDevOps For Solo Developers
DevOps For Solo Developers
Jure Cuhalev
 
Introduction to keras
Introduction to kerasIntroduction to keras
Introduction to keras
Haritha Thilakarathne
 
Ginsbourg.com presentation of open source performance validation
Ginsbourg.com presentation of open source performance validationGinsbourg.com presentation of open source performance validation
Ginsbourg.com presentation of open source performance validation
Perfecto Mobile
 
Practical Malware Analysis: Ch 15: Anti-Disassembly
Practical Malware Analysis: Ch 15: Anti-DisassemblyPractical Malware Analysis: Ch 15: Anti-Disassembly
Practical Malware Analysis: Ch 15: Anti-Disassembly
Sam Bowne
 
Using the big guns: Advanced OS performance tools for troubleshooting databas...
Using the big guns: Advanced OS performance tools for troubleshooting databas...Using the big guns: Advanced OS performance tools for troubleshooting databas...
Using the big guns: Advanced OS performance tools for troubleshooting databas...
Nikolay Savvinov
 
Celery
CeleryCelery
Celery
Yipit
 
Devel::NYTProf v3 - 200908 (OUTDATED, see 201008)
Devel::NYTProf v3 - 200908 (OUTDATED, see 201008)Devel::NYTProf v3 - 200908 (OUTDATED, see 201008)
Devel::NYTProf v3 - 200908 (OUTDATED, see 201008)
Tim Bunce
 
Profiling and Optimizing for Xeon Phi with Allinea MAP
Profiling and Optimizing for Xeon Phi with Allinea MAPProfiling and Optimizing for Xeon Phi with Allinea MAP
Profiling and Optimizing for Xeon Phi with Allinea MAP
Intel IT Center
 
Give A Great Tech Talk 2013
Give A Great Tech Talk 2013Give A Great Tech Talk 2013
Give A Great Tech Talk 2013
PostgreSQL Experts, Inc.
 
Pharo: A Reflective System
Pharo: A Reflective SystemPharo: A Reflective System
Pharo: A Reflective System
Marcus Denker
 
CNIT 126 8: Debugging
CNIT 126 8: DebuggingCNIT 126 8: Debugging
CNIT 126 8: Debugging
Sam Bowne
 
Kubernetes at NU.nl (Kubernetes meetup 2019-09-05)
Kubernetes at NU.nl   (Kubernetes meetup 2019-09-05)Kubernetes at NU.nl   (Kubernetes meetup 2019-09-05)
Kubernetes at NU.nl (Kubernetes meetup 2019-09-05)
Tibo Beijen
 

What's hot (20)

Keeping MongoDB Data Safe
Keeping MongoDB Data SafeKeeping MongoDB Data Safe
Keeping MongoDB Data Safe
 
Parallel and Asynchronous Programming - ITProDevConnections 2012 (Greek)
Parallel and Asynchronous Programming -  ITProDevConnections 2012 (Greek)Parallel and Asynchronous Programming -  ITProDevConnections 2012 (Greek)
Parallel and Asynchronous Programming - ITProDevConnections 2012 (Greek)
 
Perl-Critic
Perl-CriticPerl-Critic
Perl-Critic
 
Outsmarting Merge Edge Cases in Component Based Design
Outsmarting Merge Edge Cases in Component Based DesignOutsmarting Merge Edge Cases in Component Based Design
Outsmarting Merge Edge Cases in Component Based Design
 
Process Scheduling Algorithms | Interviews | Operating system
Process Scheduling Algorithms | Interviews | Operating systemProcess Scheduling Algorithms | Interviews | Operating system
Process Scheduling Algorithms | Interviews | Operating system
 
Practical Malware Analysis: Ch 9: OllyDbg
Practical Malware Analysis: Ch 9: OllyDbgPractical Malware Analysis: Ch 9: OllyDbg
Practical Malware Analysis: Ch 9: OllyDbg
 
Coding For Cores - C# Way
Coding For Cores - C# WayCoding For Cores - C# Way
Coding For Cores - C# Way
 
Олександр Хотемський:”Serverless архітектура та її застосування в автоматизац...
Олександр Хотемський:”Serverless архітектура та її застосування в автоматизац...Олександр Хотемський:”Serverless архітектура та її застосування в автоматизац...
Олександр Хотемський:”Serverless архітектура та її застосування в автоматизац...
 
DevOps For Solo Developers
DevOps For Solo DevelopersDevOps For Solo Developers
DevOps For Solo Developers
 
Introduction to keras
Introduction to kerasIntroduction to keras
Introduction to keras
 
Ginsbourg.com presentation of open source performance validation
Ginsbourg.com presentation of open source performance validationGinsbourg.com presentation of open source performance validation
Ginsbourg.com presentation of open source performance validation
 
Practical Malware Analysis: Ch 15: Anti-Disassembly
Practical Malware Analysis: Ch 15: Anti-DisassemblyPractical Malware Analysis: Ch 15: Anti-Disassembly
Practical Malware Analysis: Ch 15: Anti-Disassembly
 
Using the big guns: Advanced OS performance tools for troubleshooting databas...
Using the big guns: Advanced OS performance tools for troubleshooting databas...Using the big guns: Advanced OS performance tools for troubleshooting databas...
Using the big guns: Advanced OS performance tools for troubleshooting databas...
 
Celery
CeleryCelery
Celery
 
Devel::NYTProf v3 - 200908 (OUTDATED, see 201008)
Devel::NYTProf v3 - 200908 (OUTDATED, see 201008)Devel::NYTProf v3 - 200908 (OUTDATED, see 201008)
Devel::NYTProf v3 - 200908 (OUTDATED, see 201008)
 
Profiling and Optimizing for Xeon Phi with Allinea MAP
Profiling and Optimizing for Xeon Phi with Allinea MAPProfiling and Optimizing for Xeon Phi with Allinea MAP
Profiling and Optimizing for Xeon Phi with Allinea MAP
 
Give A Great Tech Talk 2013
Give A Great Tech Talk 2013Give A Great Tech Talk 2013
Give A Great Tech Talk 2013
 
Pharo: A Reflective System
Pharo: A Reflective SystemPharo: A Reflective System
Pharo: A Reflective System
 
CNIT 126 8: Debugging
CNIT 126 8: DebuggingCNIT 126 8: Debugging
CNIT 126 8: Debugging
 
Kubernetes at NU.nl (Kubernetes meetup 2019-09-05)
Kubernetes at NU.nl   (Kubernetes meetup 2019-09-05)Kubernetes at NU.nl   (Kubernetes meetup 2019-09-05)
Kubernetes at NU.nl (Kubernetes meetup 2019-09-05)
 

Similar to Optimizing thread performance for a genomics variant caller

Preparing Codes for Intel Knights Landing (KNL)
Preparing Codes for Intel Knights Landing (KNL)Preparing Codes for Intel Knights Landing (KNL)
Preparing Codes for Intel Knights Landing (KNL)
AllineaSoftware
 
Introduction to multicore .ppt
Introduction to multicore .pptIntroduction to multicore .ppt
Introduction to multicore .ppt
Rajagopal Nagarajan
 
2.4 Optimizing your Visual COBOL Applications
2.4   Optimizing your Visual COBOL Applications2.4   Optimizing your Visual COBOL Applications
2.4 Optimizing your Visual COBOL Applications
Micro Focus
 
Performance and Abstractions
Performance and AbstractionsPerformance and Abstractions
Performance and Abstractions
Metosin Oy
 
Lecture1
Lecture1Lecture1
Lecture1
tt_aljobory
 
ICONUK 2016: Back From the Dead: How Bad Code Kills a Good Server
ICONUK 2016: Back From the Dead: How Bad Code Kills a Good ServerICONUK 2016: Back From the Dead: How Bad Code Kills a Good Server
ICONUK 2016: Back From the Dead: How Bad Code Kills a Good Server
Serdar Basegmez
 
PraveenBOUT++
PraveenBOUT++PraveenBOUT++
PraveenBOUT++
Praveen Narayanan
 
Java performance - not so scary after all
Java performance - not so scary after allJava performance - not so scary after all
Java performance - not so scary after all
Holly Cummins
 
cs1311lecture25wdl.ppt
cs1311lecture25wdl.pptcs1311lecture25wdl.ppt
cs1311lecture25wdl.ppt
FannyBellows
 
Performance optimization techniques for Java code
Performance optimization techniques for Java codePerformance optimization techniques for Java code
Performance optimization techniques for Java code
Attila Balazs
 
Performance tuning the Spring Pet Clinic sample application
Performance tuning the Spring Pet Clinic sample applicationPerformance tuning the Spring Pet Clinic sample application
Performance tuning the Spring Pet Clinic sample application
Julien Dubois
 
BTV PHP - Building Fast Websites
BTV PHP - Building Fast WebsitesBTV PHP - Building Fast Websites
BTV PHP - Building Fast Websites
Jonathan Klein
 
Silicon Valley Code Camp 2015 - Advanced MongoDB - The Sequel
Silicon Valley Code Camp 2015 - Advanced MongoDB - The SequelSilicon Valley Code Camp 2015 - Advanced MongoDB - The Sequel
Silicon Valley Code Camp 2015 - Advanced MongoDB - The Sequel
Daniel Coupal
 
Ruby codebases in an entropic universe
Ruby codebases in an entropic universeRuby codebases in an entropic universe
Ruby codebases in an entropic universe
Niranjan Paranjape
 
CS101- Introduction to Computing- Lecture 45
CS101- Introduction to Computing- Lecture 45CS101- Introduction to Computing- Lecture 45
CS101- Introduction to Computing- Lecture 45
Bilal Ahmed
 
Velocity 2015 linux perf tools
Velocity 2015 linux perf toolsVelocity 2015 linux perf tools
Velocity 2015 linux perf tools
Brendan Gregg
 
Gearman: A Job Server made for Scale
Gearman: A Job Server made for ScaleGearman: A Job Server made for Scale
Gearman: A Job Server made for Scale
Mike Willbanks
 
OpenMP
OpenMPOpenMP
OpenMP
Eric Cheng
 
Hardware Provisioning
Hardware ProvisioningHardware Provisioning
Hardware Provisioning
MongoDB
 
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
smallerror
 

Similar to Optimizing thread performance for a genomics variant caller (20)

Preparing Codes for Intel Knights Landing (KNL)
Preparing Codes for Intel Knights Landing (KNL)Preparing Codes for Intel Knights Landing (KNL)
Preparing Codes for Intel Knights Landing (KNL)
 
Introduction to multicore .ppt
Introduction to multicore .pptIntroduction to multicore .ppt
Introduction to multicore .ppt
 
2.4 Optimizing your Visual COBOL Applications
2.4   Optimizing your Visual COBOL Applications2.4   Optimizing your Visual COBOL Applications
2.4 Optimizing your Visual COBOL Applications
 
Performance and Abstractions
Performance and AbstractionsPerformance and Abstractions
Performance and Abstractions
 
Lecture1
Lecture1Lecture1
Lecture1
 
ICONUK 2016: Back From the Dead: How Bad Code Kills a Good Server
ICONUK 2016: Back From the Dead: How Bad Code Kills a Good ServerICONUK 2016: Back From the Dead: How Bad Code Kills a Good Server
ICONUK 2016: Back From the Dead: How Bad Code Kills a Good Server
 
PraveenBOUT++
PraveenBOUT++PraveenBOUT++
PraveenBOUT++
 
Java performance - not so scary after all
Java performance - not so scary after allJava performance - not so scary after all
Java performance - not so scary after all
 
cs1311lecture25wdl.ppt
cs1311lecture25wdl.pptcs1311lecture25wdl.ppt
cs1311lecture25wdl.ppt
 
Performance optimization techniques for Java code
Performance optimization techniques for Java codePerformance optimization techniques for Java code
Performance optimization techniques for Java code
 
Performance tuning the Spring Pet Clinic sample application
Performance tuning the Spring Pet Clinic sample applicationPerformance tuning the Spring Pet Clinic sample application
Performance tuning the Spring Pet Clinic sample application
 
BTV PHP - Building Fast Websites
BTV PHP - Building Fast WebsitesBTV PHP - Building Fast Websites
BTV PHP - Building Fast Websites
 
Silicon Valley Code Camp 2015 - Advanced MongoDB - The Sequel
Silicon Valley Code Camp 2015 - Advanced MongoDB - The SequelSilicon Valley Code Camp 2015 - Advanced MongoDB - The Sequel
Silicon Valley Code Camp 2015 - Advanced MongoDB - The Sequel
 
Ruby codebases in an entropic universe
Ruby codebases in an entropic universeRuby codebases in an entropic universe
Ruby codebases in an entropic universe
 
CS101- Introduction to Computing- Lecture 45
CS101- Introduction to Computing- Lecture 45CS101- Introduction to Computing- Lecture 45
CS101- Introduction to Computing- Lecture 45
 
Velocity 2015 linux perf tools
Velocity 2015 linux perf toolsVelocity 2015 linux perf tools
Velocity 2015 linux perf tools
 
Gearman: A Job Server made for Scale
Gearman: A Job Server made for ScaleGearman: A Job Server made for Scale
Gearman: A Job Server made for Scale
 
OpenMP
OpenMPOpenMP
OpenMP
 
Hardware Provisioning
Hardware ProvisioningHardware Provisioning
Hardware Provisioning
 
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
 

Recently uploaded

Automated software refactoring with OpenRewrite and Generative AI.pptx.pdf
Automated software refactoring with OpenRewrite and Generative AI.pptx.pdfAutomated software refactoring with OpenRewrite and Generative AI.pptx.pdf
Automated software refactoring with OpenRewrite and Generative AI.pptx.pdf
timtebeek1
 
GraphSummit Paris - The art of the possible with Graph Technology
GraphSummit Paris - The art of the possible with Graph TechnologyGraphSummit Paris - The art of the possible with Graph Technology
GraphSummit Paris - The art of the possible with Graph Technology
Neo4j
 
Using Query Store in Azure PostgreSQL to Understand Query Performance
Using Query Store in Azure PostgreSQL to Understand Query PerformanceUsing Query Store in Azure PostgreSQL to Understand Query Performance
Using Query Store in Azure PostgreSQL to Understand Query Performance
Grant Fritchey
 
2024 eCommerceDays Toulouse - Sylius 2.0.pdf
2024 eCommerceDays Toulouse - Sylius 2.0.pdf2024 eCommerceDays Toulouse - Sylius 2.0.pdf
2024 eCommerceDays Toulouse - Sylius 2.0.pdf
Łukasz Chruściel
 
Oracle Database 19c New Features for DBAs and Developers.pptx
Oracle Database 19c New Features for DBAs and Developers.pptxOracle Database 19c New Features for DBAs and Developers.pptx
Oracle Database 19c New Features for DBAs and Developers.pptx
Remote DBA Services
 
Atelier - Innover avec l’IA Générative et les graphes de connaissances
Atelier - Innover avec l’IA Générative et les graphes de connaissancesAtelier - Innover avec l’IA Générative et les graphes de connaissances
Atelier - Innover avec l’IA Générative et les graphes de connaissances
Neo4j
 
LORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOM
LORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOMLORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOM
LORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOM
lorraineandreiamcidl
 
Empowering Growth with Best Software Development Company in Noida - Deuglo
Empowering Growth with Best Software  Development Company in Noida - DeugloEmpowering Growth with Best Software  Development Company in Noida - Deuglo
Empowering Growth with Best Software Development Company in Noida - Deuglo
Deuglo Infosystem Pvt Ltd
 
UI5con 2024 - Keynote: Latest News about UI5 and it’s Ecosystem
UI5con 2024 - Keynote: Latest News about UI5 and it’s EcosystemUI5con 2024 - Keynote: Latest News about UI5 and it’s Ecosystem
UI5con 2024 - Keynote: Latest News about UI5 and it’s Ecosystem
Peter Muessig
 
OpenMetadata Community Meeting - 5th June 2024
OpenMetadata Community Meeting - 5th June 2024OpenMetadata Community Meeting - 5th June 2024
OpenMetadata Community Meeting - 5th June 2024
OpenMetadata
 
Webinar On-Demand: Using Flutter for Embedded
Webinar On-Demand: Using Flutter for EmbeddedWebinar On-Demand: Using Flutter for Embedded
Webinar On-Demand: Using Flutter for Embedded
ICS
 
openEuler Case Study - The Journey to Supply Chain Security
openEuler Case Study - The Journey to Supply Chain SecurityopenEuler Case Study - The Journey to Supply Chain Security
openEuler Case Study - The Journey to Supply Chain Security
Shane Coughlan
 
E-commerce Development Services- Hornet Dynamics
E-commerce Development Services- Hornet DynamicsE-commerce Development Services- Hornet Dynamics
E-commerce Development Services- Hornet Dynamics
Hornet Dynamics
 
SWEBOK and Education at FUSE Okinawa 2024
SWEBOK and Education at FUSE Okinawa 2024SWEBOK and Education at FUSE Okinawa 2024
SWEBOK and Education at FUSE Okinawa 2024
Hironori Washizaki
 
Microservice Teams - How the cloud changes the way we work
Microservice Teams - How the cloud changes the way we workMicroservice Teams - How the cloud changes the way we work
Microservice Teams - How the cloud changes the way we work
Sven Peters
 
What is Master Data Management by PiLog Group
What is Master Data Management by PiLog GroupWhat is Master Data Management by PiLog Group
What is Master Data Management by PiLog Group
aymanquadri279
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
Safe Software
 
SMS API Integration in Saudi Arabia| Best SMS API Service
SMS API Integration in Saudi Arabia| Best SMS API ServiceSMS API Integration in Saudi Arabia| Best SMS API Service
SMS API Integration in Saudi Arabia| Best SMS API Service
Yara Milbes
 
What is Augmented Reality Image Tracking
What is Augmented Reality Image TrackingWhat is Augmented Reality Image Tracking
What is Augmented Reality Image Tracking
pavan998932
 
KuberTENes Birthday Bash Guadalajara - Introducción a Argo CD
KuberTENes Birthday Bash Guadalajara - Introducción a Argo CDKuberTENes Birthday Bash Guadalajara - Introducción a Argo CD
KuberTENes Birthday Bash Guadalajara - Introducción a Argo CD
rodomar2
 

Recently uploaded (20)

Automated software refactoring with OpenRewrite and Generative AI.pptx.pdf
Automated software refactoring with OpenRewrite and Generative AI.pptx.pdfAutomated software refactoring with OpenRewrite and Generative AI.pptx.pdf
Automated software refactoring with OpenRewrite and Generative AI.pptx.pdf
 
GraphSummit Paris - The art of the possible with Graph Technology
GraphSummit Paris - The art of the possible with Graph TechnologyGraphSummit Paris - The art of the possible with Graph Technology
GraphSummit Paris - The art of the possible with Graph Technology
 
Using Query Store in Azure PostgreSQL to Understand Query Performance
Using Query Store in Azure PostgreSQL to Understand Query PerformanceUsing Query Store in Azure PostgreSQL to Understand Query Performance
Using Query Store in Azure PostgreSQL to Understand Query Performance
 
2024 eCommerceDays Toulouse - Sylius 2.0.pdf
2024 eCommerceDays Toulouse - Sylius 2.0.pdf2024 eCommerceDays Toulouse - Sylius 2.0.pdf
2024 eCommerceDays Toulouse - Sylius 2.0.pdf
 
Oracle Database 19c New Features for DBAs and Developers.pptx
Oracle Database 19c New Features for DBAs and Developers.pptxOracle Database 19c New Features for DBAs and Developers.pptx
Oracle Database 19c New Features for DBAs and Developers.pptx
 
Atelier - Innover avec l’IA Générative et les graphes de connaissances
Atelier - Innover avec l’IA Générative et les graphes de connaissancesAtelier - Innover avec l’IA Générative et les graphes de connaissances
Atelier - Innover avec l’IA Générative et les graphes de connaissances
 
LORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOM
LORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOMLORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOM
LORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOM
 
Empowering Growth with Best Software Development Company in Noida - Deuglo
Empowering Growth with Best Software  Development Company in Noida - DeugloEmpowering Growth with Best Software  Development Company in Noida - Deuglo
Empowering Growth with Best Software Development Company in Noida - Deuglo
 
UI5con 2024 - Keynote: Latest News about UI5 and it’s Ecosystem
UI5con 2024 - Keynote: Latest News about UI5 and it’s EcosystemUI5con 2024 - Keynote: Latest News about UI5 and it’s Ecosystem
UI5con 2024 - Keynote: Latest News about UI5 and it’s Ecosystem
 
OpenMetadata Community Meeting - 5th June 2024
OpenMetadata Community Meeting - 5th June 2024OpenMetadata Community Meeting - 5th June 2024
OpenMetadata Community Meeting - 5th June 2024
 
Webinar On-Demand: Using Flutter for Embedded
Webinar On-Demand: Using Flutter for EmbeddedWebinar On-Demand: Using Flutter for Embedded
Webinar On-Demand: Using Flutter for Embedded
 
openEuler Case Study - The Journey to Supply Chain Security
openEuler Case Study - The Journey to Supply Chain SecurityopenEuler Case Study - The Journey to Supply Chain Security
openEuler Case Study - The Journey to Supply Chain Security
 
E-commerce Development Services- Hornet Dynamics
E-commerce Development Services- Hornet DynamicsE-commerce Development Services- Hornet Dynamics
E-commerce Development Services- Hornet Dynamics
 
SWEBOK and Education at FUSE Okinawa 2024
SWEBOK and Education at FUSE Okinawa 2024SWEBOK and Education at FUSE Okinawa 2024
SWEBOK and Education at FUSE Okinawa 2024
 
Microservice Teams - How the cloud changes the way we work
Microservice Teams - How the cloud changes the way we workMicroservice Teams - How the cloud changes the way we work
Microservice Teams - How the cloud changes the way we work
 
What is Master Data Management by PiLog Group
What is Master Data Management by PiLog GroupWhat is Master Data Management by PiLog Group
What is Master Data Management by PiLog Group
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
 
SMS API Integration in Saudi Arabia| Best SMS API Service
SMS API Integration in Saudi Arabia| Best SMS API ServiceSMS API Integration in Saudi Arabia| Best SMS API Service
SMS API Integration in Saudi Arabia| Best SMS API Service
 
What is Augmented Reality Image Tracking
What is Augmented Reality Image TrackingWhat is Augmented Reality Image Tracking
What is Augmented Reality Image Tracking
 
KuberTENes Birthday Bash Guadalajara - Introducción a Argo CD
KuberTENes Birthday Bash Guadalajara - Introducción a Argo CDKuberTENes Birthday Bash Guadalajara - Introducción a Argo CD
KuberTENes Birthday Bash Guadalajara - Introducción a Argo CD
 

Optimizing thread performance for a genomics variant caller

  • 1. Optimizing Thread Performance for a Genomics Variant Caller
  • 2. This talk • Introduce two tools that can help improve the performance of multithreaded code • Apply the tools to a real world Genomics code
  • 3. caption Tool 1: Allinea Performance Reports – benchmarking and characterization
  • 4. Tool 2: Allinea Forge - Debugging and Profiling • Debug and profile from one interface, configuration • Secure native remote and local access • Rapidly switch between the tasks • Edit, build, commit, debug, profile, optimize..
  • 5. Small data files <5% slowdown No instrumentation No recompilation Our profiler finds the performance bottlenecks
  • 6. Our debugger helps bugs and performance • Observe why workload is imbalanced • Observe why particular code paths are followed • .. And fix any bugs that optimization creates!
  • 7. Above all… • The tools are aimed at any performance problem that matters – Focus on time: the ultimate judge of performance • Do not prejudge the problem – Don’t assume it’s MPI messages, threads or I/O before profiling! • If there’s a problem.. – Allinea Performance Reports shows it, and advises you on solutions – Allinea Forge’s profiler shows it, next to your code
  • 8. 6 steps to improve performance Get a realistic test case • Performance on real data matters • Keep the test case for reference and re-use Profile your code • Add “-g” flag to your compilation • Run with a profiler Look for the significant • Which part/phase of the code dominates time? • Is there any unexpected significant time use? What is the nature of the problem? • Compute? I/O? MPI? Thread synchronization? • Display the metrics that show the problem best Apply brain to solve • MPI – can you balance the work better? • Compute – is memory time dominant – can you improve layout? Think of the future • Try larger process or thread counts to watch for scalability problems • Keep the profile (.map file) for future comparison
  • 9. Example: Improving Thread Usage in Genomics • DISCOVAR – Variant caller and small genome assembler – Sub-mammalian sized genomes – Newer DISCOVAR de novo for larger genomes • C++ and OpenMP • Developed by Broad Institute at MIT
  • 10. A first look – on real hardware • It’s not I/O intensive • Good quantity of OpenMP time • No vectorization
  • 11. OpenMP in detail • Physical cores are 200% loaded: hyperthreading is on • 17% of parallel region time is synchronization • .. That’s quite high
  • 12. Investigating the OpenMP synchronization • Horizontal time axis: colour coded – Dark green – single core – Light green – OpenMP work – Light blue – pthread synchronization – Gray – idle • Vertical axis – #cores doing something • Something’s very wrong towards the end – with all the gray
  • 13. Zoom in on the region • Stacks, code, regions, time are all focused on zoom area • Key observation: – OpenMP region with “omp critical” is where the time is being wasted
  • 14. Fixing • #pragma omp critical – Execute exactly one thread at a time to ensure safety • Is costing too much – Passing “token” from thread to thread to do small pieces of work. • Run whole section on one thread instead – Has same semantics
  • 15. Impact of change • Runtime down by 7%
  • 16. As a performance report • Improvements in – Runtime – Synchronization overhead
  • 17. Let’s try something bigger – into Amazon cloud! • C4.8xlarge – 36 hyperthreaded cores – 60GB RAM – Xeon E5-2666 v3 Haswell – 25MB Cache – 2.6GHZ vs • Our physical server – 24 hyperthreaded cores – 24 GB RAM – Xeon E5-2407 v2 – 10MB Cache – 2.4GHz $ ./runme.sh discovar version: Discovar r52488 loadaverage: 0.05 0.98 1.36 1/790 16317 2015-07-27 07:57 PERF: REAL 835.857 USER 36.188 SYSTEM 5.441 PERC 4.71 835 seconds to run on EC2 … vs … ~448 seconds on our physical server Why?
  • 18. Profile with Allinea Forge to find where the problem is • Focus on initial 300 seconds: something must be wrong here • Serious lack of good “green” compute
  • 19. In detail… • 36 threads, waiting… but who is using madvise?!
  • 20. Why is glibc so bad? • madvise system call in _int_free() – At least two context switches each call .. – This glibc version has issues…? • What other options are there?
  • 21. Maybe Google TCMalloc? • Optimized for multi- threaded applications • No-win – Same run time – Issue is use of sys_futex not madvise • .. Not optimized for this multithreaded application!
  • 22. Jemalloc? • As recommended by the Broad Institute • … same runtime
  • 23. Jemalloc – same problem • Source proves the issue again…
  • 24. Can Intel libraries help? • We try the Intel TBB multithreaded allocator • 14 minutes down to 10 minutes! • .. But still this code has scope for more…
  • 25. Real optimization of OpenMP regions • NB – still profiling for first 300 seconds only • Significant inactivity in final 60 seconds • OpenMP region – #pragma omp parallel for • Is it working? – No – the threads are idle • Let’s remove
  • 26. After the first fix… • Now able to run to completion – 358 seconds • Still inactivity at end of run
  • 27. Zoomed to the inactivity… • Another OpenMP region • Quick edit: comment out the OpenMP, again!
  • 28. … and the impact • Down to 304 seconds
  • 29. Finally… something to sort out • Recursive, in-place multithreaded sorter • Is not scaling well in thread counts • Options? – Re-engineer – Replace – Tune
  • 30. Let’s tune • Try limiting the thread pool to 8 workers – Better than 36 clashing threads?
  • 31. Result… • Runtime 4.7 minutes • 3x improvement on original • #1 position on the Broad Benchmark list for a sub-$2 / hour system!
  • 32. Lessons learned • Real codes exhibit many different performance patterns – Profiling real data sets at real scales is vital to target the effort – Small test cases do not expose all the problems – Small thread counts can be too small to find real problems • Changing code can be simple – Use threads wisely – it will not always be faster – Changing libraries – someone else might have fixed your problem • Re-engineering is sometimes necessary – Take advantage of vector units – Take advantage of threads
  • 33. Increase the performance of your software Analyze and tune with Allinea Performance Reports Develop, profile and debug applications with Allinea Forge With professional support when you need it most Read more!