SlideShare a Scribd company logo
1 of 23
Download to read offline
William Arndt
NERSC Post Doc
Increasing
HMMER3
performance
on Genepool
and Cori
- 1 -
July 6, 2016
What HMMER3 Does
- 2 -
Protein Homology Search
A Hidden Markov Model is used to define a profile
that describes a protein domain. When a domain is
shared by proteins it suggests they have a common
function, structure, or evolutionary history.
Does this protein match the profile?
Search all pairs between millions of sequences and
tens of thousands of models.
- 3 -
HMMER3 filter pipeline
The overwhelming majority of searches find nothing.
Speed is gained by giving up on a search as soon as
possible.
• Filtering Pipeline:
– Single/Multiple Segment Viterbi filter (25% of cpu)
– Full Viterbi filter (15% of cpu)
– Forward filter (5% of cpu)
– Hit processing (30% of cpu)
Ratio of data to processing in the average case is high
Very conditional code
- 4 -
HMMER3 Threading and
I/O
- 5 -
HMMER3 division of labor
For each model:
• Create worker threads with private copies of
model
• Master thread reads blocks of sequence from disk
and places in work queue
• Worker threads take from work queue, process,
and pass results back to master
When all sequences have been processed:
• Discard worker threads
• Write output
• Rewind sequence file
• Repeat with next model
- 6 -
How well does that work?
- 7 -
Haswell processor, HMMER3, hmmsearch
Swissprot sequence database (~550k sequences)
searching 100 Pfam models
Cores
Speedup
Vtune Concurrency
- 8 -
Why no core scaling?
• Reading blocks of sequence from disk has a modest
overhead of disk access, formatting, and error
checking. This compounds as the entire sequence
file is completely re-read for each model.
• The work queue is either full (< 4 workers) or empty
(> 4 workers) with no middle ground. A roofline
pattern results.
• A barrier for every model. The worst case is
serialization of 1000 sequence searches
• Thread creation and destruction overhead. No reuse
- 9 -
JGI splits files as a workaround
- 10 -
Haswell processor, HMMER3, hmmsearch
Swissprot sequence database (~550k sequences)
searching 100 Pfam models
Cores
Speedup
Modified HMMER3
- 11 -
Vtune Counting I/O instructions
- 12 -
Buffer and reuse I/O data
Store several models and their results in a buffer such
that each read sequence can be used to search
multiple models.
This puts a denomenator under the number of
sequence related disk access calls needed.
Two buffers can alternate; I/O performed on one and
computation on the other.
- 13 -
Building blocks
• int load_hmm_buffer(...);
– Read enough models from disk to fill the hmm buffer
• int load_seq_buffer(...);
– Read enough sequence from disk to fill the sequence
buffer, when EOF reset file to beginning
• int write_hmm_output(...);
– Empty results contained in model buffer to output files
• void thread_kernel(...);
– create private data copies, process searches, and load
results into model buffer
• int work_counter;
– When active tasks fall below thread count, fork half of
remaining work into new thread_kernels.
- 14 -
OpenMP Work Distribution
...
while model file not yet EOF
#pragma omp task
output_hmm_buffer(...) //unless first iteration
load_hmm_buffer(...)
do //step sequence buffer through sequence file until EOF
#pragma omp taskgroup
#pragma omp task
load_seq_buffer(...)
for each model in hmm buffer
#pragma omp atomic work_counter++;
#pragma omp task thread_kernel(...)
swap sequence buffers
… // repeat the task group for the last sequence buffer
#taskwait //in case work finishes before hmm (unlikely)
swap model buffers
output_hmm_buffer(...) //write output for the final work block
...
- 15 -
The Work Kernel
int thread_kernel(range of sequences, ...)
...//prepare private pipeline data
for each sequence in range
if work_counter < threads
#pragma omp atomic work_counter++;
#pragma omp task
thread_kernel(half range, ...);
call HMMER3 pipeline
#pragma omp critical
...//write results to model buffer
...//destroy private pipeline data
#pragma omp atomic work_counter--;
- 16 -
Vtune Concurrency
- 17 -
Now how well does it work?
- 18 -
Haswell processor, HMMER3, hmmsearch
Swissprot sequence database (~550k sequences)
searching 100 Pfam models
Cores
Speedup
A production sized search
• Entire Pfam 29.0 database (16k models)
searched against entire swissprot database
(550k sequences)
– 1 thread, standard hmmsearch, estimated:
– 4 threads, standard hmmsearch:
– 32 threads, standard hmmsearch, sharded:
– Full Haswell + HT, modified hmmsearch:
- 19 -
25 hours
8 hours
1 hour
27 minutes
HMMER3 Vectorization
Work in progress
- 20 -
HMMER3 uses SSE intrinsics
HMMER3 uses a heavily optimized pipeline of search
filters that explicitly apply a complex vector striping
pattern to the underlying dynamic programming
algorithms
The code is a uniform mixture of the base algorithms,
ordinary optimizations (like loop unrolling), adjustments
to widen vectors used by certain filters with less precise
data types, and workarounds for missing instructions in
SSE2
Compiler auto-vectorization can’t compete
- 21 -
Will be customized to use AVX2
The exact same design with AVX2 vectors would
experience diminishing returns:
• When larger stripes divided the search, increasing
remainders are waste
• Lane restrictions between high and low 128 bit lanes
require less efficient implementations for certain
instructions such as right and left byte shift
My modified implementation will search one sequence
against two models at a time, each in its own AVX2 lane
- 22 -
National Energy Research Scientific Computing
Center
- 23 -

More Related Content

What's hot

Zabbix em Computação de Alto Desempenho - - 2º ZABBIX MEETUP DO INTERIOR-SP
Zabbix em Computação de Alto Desempenho - - 2º ZABBIX MEETUP DO INTERIOR-SPZabbix em Computação de Alto Desempenho - - 2º ZABBIX MEETUP DO INTERIOR-SP
Zabbix em Computação de Alto Desempenho - - 2º ZABBIX MEETUP DO INTERIOR-SPZabbix BR
 
Low Overhead System Tracing with eBPF
Low Overhead System Tracing with eBPFLow Overhead System Tracing with eBPF
Low Overhead System Tracing with eBPFAkshay Kapoor
 
Linux kernel debugging
Linux kernel debuggingLinux kernel debugging
Linux kernel debuggingHao-Ran Liu
 
ch3-pv1-memory-management
ch3-pv1-memory-managementch3-pv1-memory-management
ch3-pv1-memory-managementyushiang fu
 
Training Slides: Basics 105: Backup, Recovery and Provisioning Within Tungste...
Training Slides: Basics 105: Backup, Recovery and Provisioning Within Tungste...Training Slides: Basics 105: Backup, Recovery and Provisioning Within Tungste...
Training Slides: Basics 105: Backup, Recovery and Provisioning Within Tungste...Continuent
 
IBM Flash System 810 Eng
IBM Flash System 810 EngIBM Flash System 810 Eng
IBM Flash System 810 EngOleg Korol
 
Smashing the stack for fun and profit
Smashing the stack for fun and profitSmashing the stack for fun and profit
Smashing the stack for fun and profitAlexey Miasoedov
 
Understanding eBPF in a Hurry!
Understanding eBPF in a Hurry!Understanding eBPF in a Hurry!
Understanding eBPF in a Hurry!Ray Jenkins
 
BPF / XDP 8월 세미나 KossLab
BPF / XDP 8월 세미나 KossLabBPF / XDP 8월 세미나 KossLab
BPF / XDP 8월 세미나 KossLabTaeung Song
 
Kernel Recipes 2016 - Understanding a Real-Time System (more than just a kernel)
Kernel Recipes 2016 - Understanding a Real-Time System (more than just a kernel)Kernel Recipes 2016 - Understanding a Real-Time System (more than just a kernel)
Kernel Recipes 2016 - Understanding a Real-Time System (more than just a kernel)Anne Nicolas
 
Lec0 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech ECE -- Introdu...
Lec0 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech ECE -- Introdu...Lec0 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech ECE -- Introdu...
Lec0 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech ECE -- Introdu...Hsien-Hsin Sean Lee, Ph.D.
 
Bpf performance tools chapter 4 bcc
Bpf performance tools chapter 4   bccBpf performance tools chapter 4   bcc
Bpf performance tools chapter 4 bccViller Hsiao
 
Continguous Memory Allocator in the Linux Kernel
Continguous Memory Allocator in the Linux KernelContinguous Memory Allocator in the Linux Kernel
Continguous Memory Allocator in the Linux KernelKernel TLV
 
PostgreSQL on EXT4, XFS, BTRFS and ZFS
PostgreSQL on EXT4, XFS, BTRFS and ZFSPostgreSQL on EXT4, XFS, BTRFS and ZFS
PostgreSQL on EXT4, XFS, BTRFS and ZFSTomas Vondra
 
Open mp library functions and environment variables
Open mp library functions and environment variablesOpen mp library functions and environment variables
Open mp library functions and environment variablesSuveeksha
 
Juniper防火墙case信息收集表
Juniper防火墙case信息收集表Juniper防火墙case信息收集表
Juniper防火墙case信息收集表mickchen
 
Go profiling introduction
Go profiling introductionGo profiling introduction
Go profiling introductionWilliam Lin
 

What's hot (20)

Multimaster
MultimasterMultimaster
Multimaster
 
Zabbix em Computação de Alto Desempenho - - 2º ZABBIX MEETUP DO INTERIOR-SP
Zabbix em Computação de Alto Desempenho - - 2º ZABBIX MEETUP DO INTERIOR-SPZabbix em Computação de Alto Desempenho - - 2º ZABBIX MEETUP DO INTERIOR-SP
Zabbix em Computação de Alto Desempenho - - 2º ZABBIX MEETUP DO INTERIOR-SP
 
Low Overhead System Tracing with eBPF
Low Overhead System Tracing with eBPFLow Overhead System Tracing with eBPF
Low Overhead System Tracing with eBPF
 
Linux kernel debugging
Linux kernel debuggingLinux kernel debugging
Linux kernel debugging
 
ch3-pv1-memory-management
ch3-pv1-memory-managementch3-pv1-memory-management
ch3-pv1-memory-management
 
Training Slides: Basics 105: Backup, Recovery and Provisioning Within Tungste...
Training Slides: Basics 105: Backup, Recovery and Provisioning Within Tungste...Training Slides: Basics 105: Backup, Recovery and Provisioning Within Tungste...
Training Slides: Basics 105: Backup, Recovery and Provisioning Within Tungste...
 
IBM Flash System 810 Eng
IBM Flash System 810 EngIBM Flash System 810 Eng
IBM Flash System 810 Eng
 
Smashing the stack for fun and profit
Smashing the stack for fun and profitSmashing the stack for fun and profit
Smashing the stack for fun and profit
 
Understanding eBPF in a Hurry!
Understanding eBPF in a Hurry!Understanding eBPF in a Hurry!
Understanding eBPF in a Hurry!
 
BPF / XDP 8월 세미나 KossLab
BPF / XDP 8월 세미나 KossLabBPF / XDP 8월 세미나 KossLab
BPF / XDP 8월 세미나 KossLab
 
Cp uarch
Cp uarchCp uarch
Cp uarch
 
Hd7
Hd7Hd7
Hd7
 
Kernel Recipes 2016 - Understanding a Real-Time System (more than just a kernel)
Kernel Recipes 2016 - Understanding a Real-Time System (more than just a kernel)Kernel Recipes 2016 - Understanding a Real-Time System (more than just a kernel)
Kernel Recipes 2016 - Understanding a Real-Time System (more than just a kernel)
 
Lec0 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech ECE -- Introdu...
Lec0 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech ECE -- Introdu...Lec0 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech ECE -- Introdu...
Lec0 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech ECE -- Introdu...
 
Bpf performance tools chapter 4 bcc
Bpf performance tools chapter 4   bccBpf performance tools chapter 4   bcc
Bpf performance tools chapter 4 bcc
 
Continguous Memory Allocator in the Linux Kernel
Continguous Memory Allocator in the Linux KernelContinguous Memory Allocator in the Linux Kernel
Continguous Memory Allocator in the Linux Kernel
 
PostgreSQL on EXT4, XFS, BTRFS and ZFS
PostgreSQL on EXT4, XFS, BTRFS and ZFSPostgreSQL on EXT4, XFS, BTRFS and ZFS
PostgreSQL on EXT4, XFS, BTRFS and ZFS
 
Open mp library functions and environment variables
Open mp library functions and environment variablesOpen mp library functions and environment variables
Open mp library functions and environment variables
 
Juniper防火墙case信息收集表
Juniper防火墙case信息收集表Juniper防火墙case信息收集表
Juniper防火墙case信息收集表
 
Go profiling introduction
Go profiling introductionGo profiling introduction
Go profiling introduction
 

Viewers also liked

February 2010 - Plenty of Energy but high costs
February 2010 - Plenty of Energy but high costsFebruary 2010 - Plenty of Energy but high costs
February 2010 - Plenty of Energy but high costsFGV Brazil
 
The 9 X 3 Servant Of God
The 9 X 3 Servant Of GodThe 9 X 3 Servant Of God
The 9 X 3 Servant Of GodSteve Klein
 
Mobile is eating the world
Mobile is eating the worldMobile is eating the world
Mobile is eating the world金宝 李
 
ランダムプレーヤー with Beacon
ランダムプレーヤー with Beaconランダムプレーヤー with Beacon
ランダムプレーヤー with BeaconJunichi Minamino
 
April 2010 - Competition and credit boom
April 2010 - Competition and credit boomApril 2010 - Competition and credit boom
April 2010 - Competition and credit boomFGV Brazil
 
Unit5project
Unit5projectUnit5project
Unit5projectbasiafifi
 
Colo-rectal Cancer poster
Colo-rectal Cancer posterColo-rectal Cancer poster
Colo-rectal Cancer posterMolly McCarthy
 
ADVT 530 – Final Project
ADVT 530 – Final ProjectADVT 530 – Final Project
ADVT 530 – Final ProjectElizabeth Floyd
 
Dich vụ tổ chức tiệc trà tại nhà giá rẻ, teabreak chuyên nghiệp giá rẻ nhất t...
Dich vụ tổ chức tiệc trà tại nhà giá rẻ, teabreak chuyên nghiệp giá rẻ nhất t...Dich vụ tổ chức tiệc trà tại nhà giá rẻ, teabreak chuyên nghiệp giá rẻ nhất t...
Dich vụ tổ chức tiệc trà tại nhà giá rẻ, teabreak chuyên nghiệp giá rẻ nhất t...Hoàng Tuấn
 
Gene synthesis technology and applications update—unleash your lab’s potentia...
Gene synthesis technology and applications update—unleash your lab’s potentia...Gene synthesis technology and applications update—unleash your lab’s potentia...
Gene synthesis technology and applications update—unleash your lab’s potentia...Integrated DNA Technologies
 

Viewers also liked (16)

Profe lupe 3 y4
Profe lupe 3 y4Profe lupe 3 y4
Profe lupe 3 y4
 
some experience
some experiencesome experience
some experience
 
February 2010 - Plenty of Energy but high costs
February 2010 - Plenty of Energy but high costsFebruary 2010 - Plenty of Energy but high costs
February 2010 - Plenty of Energy but high costs
 
The 9 X 3 Servant Of God
The 9 X 3 Servant Of GodThe 9 X 3 Servant Of God
The 9 X 3 Servant Of God
 
Mobile is eating the world
Mobile is eating the worldMobile is eating the world
Mobile is eating the world
 
Ramadaconroe
RamadaconroeRamadaconroe
Ramadaconroe
 
ランダムプレーヤー with Beacon
ランダムプレーヤー with Beaconランダムプレーヤー with Beacon
ランダムプレーヤー with Beacon
 
April 2010 - Competition and credit boom
April 2010 - Competition and credit boomApril 2010 - Competition and credit boom
April 2010 - Competition and credit boom
 
Groupe AEROW
Groupe AEROW Groupe AEROW
Groupe AEROW
 
La fotografía
La fotografíaLa fotografía
La fotografía
 
Unit5project
Unit5projectUnit5project
Unit5project
 
Colo-rectal Cancer poster
Colo-rectal Cancer posterColo-rectal Cancer poster
Colo-rectal Cancer poster
 
ADVT 530 – Final Project
ADVT 530 – Final ProjectADVT 530 – Final Project
ADVT 530 – Final Project
 
Rabies daph (1)
Rabies daph (1)Rabies daph (1)
Rabies daph (1)
 
Dich vụ tổ chức tiệc trà tại nhà giá rẻ, teabreak chuyên nghiệp giá rẻ nhất t...
Dich vụ tổ chức tiệc trà tại nhà giá rẻ, teabreak chuyên nghiệp giá rẻ nhất t...Dich vụ tổ chức tiệc trà tại nhà giá rẻ, teabreak chuyên nghiệp giá rẻ nhất t...
Dich vụ tổ chức tiệc trà tại nhà giá rẻ, teabreak chuyên nghiệp giá rẻ nhất t...
 
Gene synthesis technology and applications update—unleash your lab’s potentia...
Gene synthesis technology and applications update—unleash your lab’s potentia...Gene synthesis technology and applications update—unleash your lab’s potentia...
Gene synthesis technology and applications update—unleash your lab’s potentia...
 

Similar to HPC_HMMER.pptx

Nodes and Networks for HPC computing
Nodes and Networks for HPC computingNodes and Networks for HPC computing
Nodes and Networks for HPC computingrinnocente
 
Multithreading computer architecture
 Multithreading computer architecture  Multithreading computer architecture
Multithreading computer architecture Haris456
 
Solr Troubleshooting - TreeMap approach
Solr Troubleshooting - TreeMap approachSolr Troubleshooting - TreeMap approach
Solr Troubleshooting - TreeMap approachAlexandre Rafalovitch
 
Solr Troubleshooting - Treemap Approach: Presented by Alexandre Rafolovitch, ...
Solr Troubleshooting - Treemap Approach: Presented by Alexandre Rafolovitch, ...Solr Troubleshooting - Treemap Approach: Presented by Alexandre Rafolovitch, ...
Solr Troubleshooting - Treemap Approach: Presented by Alexandre Rafolovitch, ...Lucidworks
 
Performance evaluation with Arm HPC tools for SVE
Performance evaluation with Arm HPC tools for SVEPerformance evaluation with Arm HPC tools for SVE
Performance evaluation with Arm HPC tools for SVELinaro
 
Simple Scalar Simulator of ACD Familiariation Labratory Manual
Simple Scalar Simulator of ACD Familiariation Labratory ManualSimple Scalar Simulator of ACD Familiariation Labratory Manual
Simple Scalar Simulator of ACD Familiariation Labratory Manualzelalem2022
 
ECECS 472572 Final Exam ProjectRemember to check the errat.docx
ECECS 472572 Final Exam ProjectRemember to check the errat.docxECECS 472572 Final Exam ProjectRemember to check the errat.docx
ECECS 472572 Final Exam ProjectRemember to check the errat.docxtidwellveronique
 
ECECS 472572 Final Exam ProjectRemember to check the err.docx
ECECS 472572 Final Exam ProjectRemember to check the err.docxECECS 472572 Final Exam ProjectRemember to check the err.docx
ECECS 472572 Final Exam ProjectRemember to check the err.docxtidwellveronique
 
Code GPU with CUDA - Optimizing memory and control flow
Code GPU with CUDA - Optimizing memory and control flowCode GPU with CUDA - Optimizing memory and control flow
Code GPU with CUDA - Optimizing memory and control flowMarina Kolpakova
 
Architecting a 35 PB distributed parallel file system for science
Architecting a 35 PB distributed parallel file system for scienceArchitecting a 35 PB distributed parallel file system for science
Architecting a 35 PB distributed parallel file system for scienceSpeck&Tech
 

Similar to HPC_HMMER.pptx (20)

MAKER2
MAKER2MAKER2
MAKER2
 
Nodes and Networks for HPC computing
Nodes and Networks for HPC computingNodes and Networks for HPC computing
Nodes and Networks for HPC computing
 
Multithreading computer architecture
 Multithreading computer architecture  Multithreading computer architecture
Multithreading computer architecture
 
Open mp intro_01
Open mp intro_01Open mp intro_01
Open mp intro_01
 
Solr Troubleshooting - TreeMap approach
Solr Troubleshooting - TreeMap approachSolr Troubleshooting - TreeMap approach
Solr Troubleshooting - TreeMap approach
 
Solr Troubleshooting - Treemap Approach: Presented by Alexandre Rafolovitch, ...
Solr Troubleshooting - Treemap Approach: Presented by Alexandre Rafolovitch, ...Solr Troubleshooting - Treemap Approach: Presented by Alexandre Rafolovitch, ...
Solr Troubleshooting - Treemap Approach: Presented by Alexandre Rafolovitch, ...
 
Performance evaluation with Arm HPC tools for SVE
Performance evaluation with Arm HPC tools for SVEPerformance evaluation with Arm HPC tools for SVE
Performance evaluation with Arm HPC tools for SVE
 
Simple Scalar Simulator of ACD Familiariation Labratory Manual
Simple Scalar Simulator of ACD Familiariation Labratory ManualSimple Scalar Simulator of ACD Familiariation Labratory Manual
Simple Scalar Simulator of ACD Familiariation Labratory Manual
 
ECECS 472572 Final Exam ProjectRemember to check the errat.docx
ECECS 472572 Final Exam ProjectRemember to check the errat.docxECECS 472572 Final Exam ProjectRemember to check the errat.docx
ECECS 472572 Final Exam ProjectRemember to check the errat.docx
 
ECECS 472572 Final Exam ProjectRemember to check the err.docx
ECECS 472572 Final Exam ProjectRemember to check the err.docxECECS 472572 Final Exam ProjectRemember to check the err.docx
ECECS 472572 Final Exam ProjectRemember to check the err.docx
 
Code GPU with CUDA - Optimizing memory and control flow
Code GPU with CUDA - Optimizing memory and control flowCode GPU with CUDA - Optimizing memory and control flow
Code GPU with CUDA - Optimizing memory and control flow
 
Pentium iii
Pentium iiiPentium iii
Pentium iii
 
Architecting a 35 PB distributed parallel file system for science
Architecting a 35 PB distributed parallel file system for scienceArchitecting a 35 PB distributed parallel file system for science
Architecting a 35 PB distributed parallel file system for science
 
Sa
SaSa
Sa
 
Sa
SaSa
Sa
 
Sa
SaSa
Sa
 
Sa
SaSa
Sa
 
Sa
SaSa
Sa
 
Sa
SaSa
Sa
 
Sa
SaSa
Sa
 

HPC_HMMER.pptx

  • 1. William Arndt NERSC Post Doc Increasing HMMER3 performance on Genepool and Cori - 1 - July 6, 2016
  • 3. Protein Homology Search A Hidden Markov Model is used to define a profile that describes a protein domain. When a domain is shared by proteins it suggests they have a common function, structure, or evolutionary history. Does this protein match the profile? Search all pairs between millions of sequences and tens of thousands of models. - 3 -
  • 4. HMMER3 filter pipeline The overwhelming majority of searches find nothing. Speed is gained by giving up on a search as soon as possible. • Filtering Pipeline: – Single/Multiple Segment Viterbi filter (25% of cpu) – Full Viterbi filter (15% of cpu) – Forward filter (5% of cpu) – Hit processing (30% of cpu) Ratio of data to processing in the average case is high Very conditional code - 4 -
  • 6. HMMER3 division of labor For each model: • Create worker threads with private copies of model • Master thread reads blocks of sequence from disk and places in work queue • Worker threads take from work queue, process, and pass results back to master When all sequences have been processed: • Discard worker threads • Write output • Rewind sequence file • Repeat with next model - 6 -
  • 7. How well does that work? - 7 - Haswell processor, HMMER3, hmmsearch Swissprot sequence database (~550k sequences) searching 100 Pfam models Cores Speedup
  • 9. Why no core scaling? • Reading blocks of sequence from disk has a modest overhead of disk access, formatting, and error checking. This compounds as the entire sequence file is completely re-read for each model. • The work queue is either full (< 4 workers) or empty (> 4 workers) with no middle ground. A roofline pattern results. • A barrier for every model. The worst case is serialization of 1000 sequence searches • Thread creation and destruction overhead. No reuse - 9 -
  • 10. JGI splits files as a workaround - 10 - Haswell processor, HMMER3, hmmsearch Swissprot sequence database (~550k sequences) searching 100 Pfam models Cores Speedup
  • 12. Vtune Counting I/O instructions - 12 -
  • 13. Buffer and reuse I/O data Store several models and their results in a buffer such that each read sequence can be used to search multiple models. This puts a denomenator under the number of sequence related disk access calls needed. Two buffers can alternate; I/O performed on one and computation on the other. - 13 -
  • 14. Building blocks • int load_hmm_buffer(...); – Read enough models from disk to fill the hmm buffer • int load_seq_buffer(...); – Read enough sequence from disk to fill the sequence buffer, when EOF reset file to beginning • int write_hmm_output(...); – Empty results contained in model buffer to output files • void thread_kernel(...); – create private data copies, process searches, and load results into model buffer • int work_counter; – When active tasks fall below thread count, fork half of remaining work into new thread_kernels. - 14 -
  • 15. OpenMP Work Distribution ... while model file not yet EOF #pragma omp task output_hmm_buffer(...) //unless first iteration load_hmm_buffer(...) do //step sequence buffer through sequence file until EOF #pragma omp taskgroup #pragma omp task load_seq_buffer(...) for each model in hmm buffer #pragma omp atomic work_counter++; #pragma omp task thread_kernel(...) swap sequence buffers … // repeat the task group for the last sequence buffer #taskwait //in case work finishes before hmm (unlikely) swap model buffers output_hmm_buffer(...) //write output for the final work block ... - 15 -
  • 16. The Work Kernel int thread_kernel(range of sequences, ...) ...//prepare private pipeline data for each sequence in range if work_counter < threads #pragma omp atomic work_counter++; #pragma omp task thread_kernel(half range, ...); call HMMER3 pipeline #pragma omp critical ...//write results to model buffer ...//destroy private pipeline data #pragma omp atomic work_counter--; - 16 -
  • 18. Now how well does it work? - 18 - Haswell processor, HMMER3, hmmsearch Swissprot sequence database (~550k sequences) searching 100 Pfam models Cores Speedup
  • 19. A production sized search • Entire Pfam 29.0 database (16k models) searched against entire swissprot database (550k sequences) – 1 thread, standard hmmsearch, estimated: – 4 threads, standard hmmsearch: – 32 threads, standard hmmsearch, sharded: – Full Haswell + HT, modified hmmsearch: - 19 - 25 hours 8 hours 1 hour 27 minutes
  • 20. HMMER3 Vectorization Work in progress - 20 -
  • 21. HMMER3 uses SSE intrinsics HMMER3 uses a heavily optimized pipeline of search filters that explicitly apply a complex vector striping pattern to the underlying dynamic programming algorithms The code is a uniform mixture of the base algorithms, ordinary optimizations (like loop unrolling), adjustments to widen vectors used by certain filters with less precise data types, and workarounds for missing instructions in SSE2 Compiler auto-vectorization can’t compete - 21 -
  • 22. Will be customized to use AVX2 The exact same design with AVX2 vectors would experience diminishing returns: • When larger stripes divided the search, increasing remainders are waste • Lane restrictions between high and low 128 bit lanes require less efficient implementations for certain instructions such as right and left byte shift My modified implementation will search one sequence against two models at a time, each in its own AVX2 lane - 22 -
  • 23. National Energy Research Scientific Computing Center - 23 -