SlideShare a Scribd company logo
1 of 22
Download to read offline
William Arndt
Increased
HMMER3
performance
on Genepool
- 1 -
August 2, 2016
The result first:
A production sized use case: HMMER3 hmmsearch tool
searching Pfam 29.0 database (16k models) against
swissprot database (550k sequences):
- 2 -
1 thread: 25 hours
NEW hmmsearch, 32 threads + HT: 27 minutes
32 threads, sharded input files: 1 hour
4 threads: 8 hours
32 threads: 8 hours
What HMMER3 Does
- 3 -
Protein Homology Search
Start with a multiple sequence alignment describing
an interesting protein domain, profile, or motif.
A MSA is used to build a Hidden Markov Model
through which HMMER3 can search protein
sequences for matches with statistical significance.
Compare millions of sequences against tens of
thousands of protein HMMs. Use the results for
annotation.
- 4 -
HMMER3 filter pipeline
The overwhelming majority of sequences don’t
match. Speed is gained by discarding a miss as soon as
possible.
• Filtering Pipeline:
– Multiple Segment Viterbi filter:
• High scoring diagonals, 2% pass, uses 25% of cpu time
– Viterbi filter:
• optimal alignment with indels, 5% pass, uses 15% of cpu time
– Forward/Backward filter:
• combined score of all alignments, 1% pass, uses 5% of cpu time
– Hit processing and output (30% of time)
- 5 -
HMMER3 output
Query: 1-cysPrx_C [M=40]
Accession: PF10417.6
Description: C-terminal domain of 1-Cys peroxiredoxin
Scores for complete sequences (score includes all domains):
--- full sequence --- --- best 1 domain --- -#dom-
E-value score bias E-value score bias exp N Sequence Description
------- ------ ----- ------- ------ ----- ---- -- -------- -----------
5.8e-18 69.2 1.8 1.1e-17 68.4 1.8 1.5 1 sp|O67024|TDXH_AQUAE Peroxiredo...
3.4e-15 60.4 0.0 9e-15 59.0 0.0 1.8 1 sp|Q9Y7F0|TSA1_CANAL Peroxiredo...
7.9e-14 56.0 0.0 1.5e-13 55.1 0.0 1.5 1 sp|Q26695|TDX_TRYBR Thioredoxi...
...
- 6 -
Domain annotation for each sequence:
>> sp|O67024|TDXH_AQUAE Peroxiredoxin OS=Aquifex aeolicus (strain VF5) GN=aq_858 PE=3 SV=1
# score bias c-Evalue i-Evalue hmmfrom hmm to alifrom ali to envfrom env to acc
--- ------ ----- --------- --------- ------- ------- ------- ------- ------- ------- ----
1 ! 68.4 1.8 2.9e-21 1.1e-17 1 40 [] 160 209 .. 160 209 .. 0.99
>> sp|Q9Y7F0|TSA1_CANAL Peroxiredoxin TSA1 OS=Candida albicans (strain SC5314 / ATCC MYA-2876) ...
# score bias c-Evalue i-Evalue hmmfrom hmm to alifrom ali to envfrom env to acc
--- ------ ----- --------- --------- ------- ------- ------- ------- ------- ------- ----
1 ! 59.0 0.0 2.4e-18 9e-15 1 40 [] 158 193 .. 158 193 .. 0.98
>> sp|Q26695|TDX_TRYBR Thioredoxin peroxidase OS=Trypanosoma brucei rhodesiense PE=2 SV=1
# score bias c-Evalue i-Evalue hmmfrom hmm to alifrom ali to envfrom env to acc
--- ------ ----- --------- --------- ------- ------- ------- ------- ------- ------- ----
1 ! 55.1 0.0 4e-17 1.5e-13 1 39 [. 162 196 .. 162 197 .. 0.97
...
Why HMMER3 is
inefficient on Genepool
- 7 -
HMMER3 memory scrooge
HMMER3 was engineered to be as portable as possible.
Running on a 2010 era desktop or laptop requires a
much smaller memory footprint than available in an
HPC environment.
Instead of reading a fasta file once and using memory
to store it, HMMER3 goes back to disk over and over
again. The overhead limits the rate data can be
prepared. That rate is slower than the rate multiple
threads can consume it. Any more than 4 worker
threads will sit idle waiting for data.
- 8 -
Counting I/O instructions
- 9 -
sqascii_Read() and header_fasta() are the sequence
reading functions. Standard hmmsearch spends 25% of its
compute reading the same sequence file over and over
again.
Utilization of Genepool nodes
• Core Utilization
– Genepool has nodes with 16 or 32 cores
– HMMER3 can use no more than 4 cores efficiently
– All threads wait for stragglers after every model
– Mitigation options include:
• Ignore the problem
• sharing a node with -pe pe_slots 4 + --cpu 3
• Shard input files, run multiple hmmsearch on one node, then
combine output
• Memory Utilization
– All Genepool nodes have more than 100GB of memory
– HMMER3 won’t use 95% of that unless you do something
absurd like search TITIN against its own model.
- 10 -
Modified HMMER3
- 11 -
Buffer the I/O data and reuse it
Store several models and their results in a memory
buffer such that each read sequence can be used to
search multiple models.
This puts a denominator under the number of
sequence related disk access calls needed; 25% of
cpu instructions are reduced to <1% this way.
Two buffers can alternate; I/O performed on one and
computation on the other. If I/O finishes early that
thread converts itself to a worker.
- 12 -
Original HMMER3 thread behavior
- 13 -
New thread behavior
- 14 -
How to use custom hmmsearch
- 15 -
warndt@genepool13:~$ module load hmmer/3.1b2-opt
warndt@genepool13:~$ hpc_hmmsearch -h
# hpc_hmmsearch :: search profile(s) against a sequence database, custom
modified for improved thread performance
# HMMER 3.1b2 (February 2015); http://hmmer.org/
...
Input buffer and thread control:
--seq_buffer <n> : set # of sequences per thread buffer [200000] (n>=1)
--hmm_buffer <n> : set # of hmms per thread hmm buffer [500] (n>=1)
--cpu <n> : set # of threads [1] (n>=1)
...
Future work
- 16 -
HMMER3 on other NERSC systems
Cori phase I hardware is functionally identical
(Haswell processors with 128GB memory) to -pe
pe_slots 32 nodes available on Genepool. No
custom HMMER3 module on Cori yet, but that can
be fixed in 5 minutes when someone wants it.
HMMER3 runs on Cori phase II hardware (Knights
Landing many-core architecture) but not as well as
on phase I. My current best KNL time for swissprot
against Pfam is 38 minutes.
- 17 -
hmmscan modification
JGI usage of hmmscan is approximately an order of
magnitude less than hmmsearch.
The design is very similar to hmmsearch. Conversion
would be straightforward and take approximately a week.
As soon as someone expresses interest in running high
volume hmmscan, I’ll complete and make it available.
- 18 -
Upgrading vector code
The 6 year old single instruction multiple data (SIMD)
instructions in the HMMER3 pipeline do not run well on
KNL hardware.
I am currently working on new filters which will use more
modern vector instructions and will run more efficiently
on the phase II machine.
- 19 -
HMMER4 is coming
• Sean Eddy has been actively developing a new major
version of HMMER.
• The components I am hacking for better performance
today will be completely replaced in the future with
theoretically superior algorithms.
• It won’t be available for at least a year, and probably
more like two or three.
• If I’m still around, I’ll help everyone transition to the
new application.
- 20 -
HMMER3 translated search
Translated frameshift aware HMMER3 search is
currently in development. An alpha version is
available and anyone interested is welcome to give it
a try and provide feedback.
/global/homes/w/warndt/edison-t-hmmer/hmmer/src/phmmert
/global/homes/w/warndt/edison-t-hmmer/hmmer/src/nhmmscant
- 21 -
National Energy Research Scientific Computing
Center
- 22 -

More Related Content

What's hot

Linux 4.x Tracing Tools: Using BPF Superpowers
Linux 4.x Tracing Tools: Using BPF SuperpowersLinux 4.x Tracing Tools: Using BPF Superpowers
Linux 4.x Tracing Tools: Using BPF SuperpowersBrendan Gregg
 
UM2019 Extended BPF: A New Type of Software
UM2019 Extended BPF: A New Type of SoftwareUM2019 Extended BPF: A New Type of Software
UM2019 Extended BPF: A New Type of SoftwareBrendan Gregg
 
BPF / XDP 8월 세미나 KossLab
BPF / XDP 8월 세미나 KossLabBPF / XDP 8월 세미나 KossLab
BPF / XDP 8월 세미나 KossLabTaeung Song
 
Bpf performance tools chapter 4 bcc
Bpf performance tools chapter 4   bccBpf performance tools chapter 4   bcc
Bpf performance tools chapter 4 bccViller Hsiao
 
BPF - in-kernel virtual machine
BPF - in-kernel virtual machineBPF - in-kernel virtual machine
BPF - in-kernel virtual machineAlexei Starovoitov
 
Linux kernel modules
Linux kernel modulesLinux kernel modules
Linux kernel modulesHao-Ran Liu
 
Process scheduling
Process schedulingProcess scheduling
Process schedulingHao-Ran Liu
 
5.MLP(Multi-Layer Perceptron)
5.MLP(Multi-Layer Perceptron) 5.MLP(Multi-Layer Perceptron)
5.MLP(Multi-Layer Perceptron) 艾鍗科技
 
Disk IO Benchmarking in shared multi-tenant environments
Disk IO Benchmarking in shared multi-tenant environmentsDisk IO Benchmarking in shared multi-tenant environments
Disk IO Benchmarking in shared multi-tenant environmentsRodrigo Campos
 
Linux System Monitoring with eBPF
Linux System Monitoring with eBPFLinux System Monitoring with eBPF
Linux System Monitoring with eBPFHeinrich Hartmann
 
Performance Analysis Tools for Linux Kernel
Performance Analysis Tools for Linux KernelPerformance Analysis Tools for Linux Kernel
Performance Analysis Tools for Linux Kernellcplcp1
 
PFQ@ 10th Italian Networking Workshop (Bormio)
PFQ@ 10th Italian Networking Workshop (Bormio)PFQ@ 10th Italian Networking Workshop (Bormio)
PFQ@ 10th Italian Networking Workshop (Bormio)Nicola Bonelli
 
Debugging Hung Python Processes With GDB
Debugging Hung Python Processes With GDBDebugging Hung Python Processes With GDB
Debugging Hung Python Processes With GDBbmbouter
 
ATO Linux Performance 2018
ATO Linux Performance 2018ATO Linux Performance 2018
ATO Linux Performance 2018Brendan Gregg
 
Linux kernel memory allocators
Linux kernel memory allocatorsLinux kernel memory allocators
Linux kernel memory allocatorsHao-Ran Liu
 
Container Performance Analysis
Container Performance AnalysisContainer Performance Analysis
Container Performance AnalysisBrendan Gregg
 
Tracing MariaDB server with bpftrace - MariaDB Server Fest 2021
Tracing MariaDB server with bpftrace - MariaDB Server Fest 2021Tracing MariaDB server with bpftrace - MariaDB Server Fest 2021
Tracing MariaDB server with bpftrace - MariaDB Server Fest 2021Valeriy Kravchuk
 

What's hot (20)

Linux 4.x Tracing Tools: Using BPF Superpowers
Linux 4.x Tracing Tools: Using BPF SuperpowersLinux 4.x Tracing Tools: Using BPF Superpowers
Linux 4.x Tracing Tools: Using BPF Superpowers
 
UM2019 Extended BPF: A New Type of Software
UM2019 Extended BPF: A New Type of SoftwareUM2019 Extended BPF: A New Type of Software
UM2019 Extended BPF: A New Type of Software
 
BPF / XDP 8월 세미나 KossLab
BPF / XDP 8월 세미나 KossLabBPF / XDP 8월 세미나 KossLab
BPF / XDP 8월 세미나 KossLab
 
Cache profiling on ARM Linux
Cache profiling on ARM LinuxCache profiling on ARM Linux
Cache profiling on ARM Linux
 
Bpf performance tools chapter 4 bcc
Bpf performance tools chapter 4   bccBpf performance tools chapter 4   bcc
Bpf performance tools chapter 4 bcc
 
eBPF Workshop
eBPF WorkshopeBPF Workshop
eBPF Workshop
 
BPF - in-kernel virtual machine
BPF - in-kernel virtual machineBPF - in-kernel virtual machine
BPF - in-kernel virtual machine
 
Linux kernel modules
Linux kernel modulesLinux kernel modules
Linux kernel modules
 
Process scheduling
Process schedulingProcess scheduling
Process scheduling
 
5.MLP(Multi-Layer Perceptron)
5.MLP(Multi-Layer Perceptron) 5.MLP(Multi-Layer Perceptron)
5.MLP(Multi-Layer Perceptron)
 
BPF Tools 2017
BPF Tools 2017BPF Tools 2017
BPF Tools 2017
 
Disk IO Benchmarking in shared multi-tenant environments
Disk IO Benchmarking in shared multi-tenant environmentsDisk IO Benchmarking in shared multi-tenant environments
Disk IO Benchmarking in shared multi-tenant environments
 
Linux System Monitoring with eBPF
Linux System Monitoring with eBPFLinux System Monitoring with eBPF
Linux System Monitoring with eBPF
 
Performance Analysis Tools for Linux Kernel
Performance Analysis Tools for Linux KernelPerformance Analysis Tools for Linux Kernel
Performance Analysis Tools for Linux Kernel
 
PFQ@ 10th Italian Networking Workshop (Bormio)
PFQ@ 10th Italian Networking Workshop (Bormio)PFQ@ 10th Italian Networking Workshop (Bormio)
PFQ@ 10th Italian Networking Workshop (Bormio)
 
Debugging Hung Python Processes With GDB
Debugging Hung Python Processes With GDBDebugging Hung Python Processes With GDB
Debugging Hung Python Processes With GDB
 
ATO Linux Performance 2018
ATO Linux Performance 2018ATO Linux Performance 2018
ATO Linux Performance 2018
 
Linux kernel memory allocators
Linux kernel memory allocatorsLinux kernel memory allocators
Linux kernel memory allocators
 
Container Performance Analysis
Container Performance AnalysisContainer Performance Analysis
Container Performance Analysis
 
Tracing MariaDB server with bpftrace - MariaDB Server Fest 2021
Tracing MariaDB server with bpftrace - MariaDB Server Fest 2021Tracing MariaDB server with bpftrace - MariaDB Server Fest 2021
Tracing MariaDB server with bpftrace - MariaDB Server Fest 2021
 

Similar to JGI_HMMER.pptx

Nodes and Networks for HPC computing
Nodes and Networks for HPC computingNodes and Networks for HPC computing
Nodes and Networks for HPC computingrinnocente
 
Multithreading computer architecture
 Multithreading computer architecture  Multithreading computer architecture
Multithreading computer architecture Haris456
 
TCAM Design using Flash Transistors
TCAM Design using Flash TransistorsTCAM Design using Flash Transistors
TCAM Design using Flash TransistorsViacheslav Fedorov
 
Loffeld_SIAMCSE15
Loffeld_SIAMCSE15Loffeld_SIAMCSE15
Loffeld_SIAMCSE15Karen Pao
 
Large-Scale Optimization Strategies for Typical HPC Workloads
Large-Scale Optimization Strategies for Typical HPC WorkloadsLarge-Scale Optimization Strategies for Typical HPC Workloads
Large-Scale Optimization Strategies for Typical HPC Workloadsinside-BigData.com
 
osdi20-slides_zhao.pptx
osdi20-slides_zhao.pptxosdi20-slides_zhao.pptx
osdi20-slides_zhao.pptxCive1971
 
Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...
Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...
Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...inside-BigData.com
 
Understand and Harness the Capabilities of Intel® Xeon Phi™ Processors
Understand and Harness the Capabilities of Intel® Xeon Phi™ ProcessorsUnderstand and Harness the Capabilities of Intel® Xeon Phi™ Processors
Understand and Harness the Capabilities of Intel® Xeon Phi™ ProcessorsIntel® Software
 
Pragmatic model checking: from theory to implementations
Pragmatic model checking: from theory to implementationsPragmatic model checking: from theory to implementations
Pragmatic model checking: from theory to implementationsUniversität Rostock
 
Parallelization of Coupled Cluster Code with OpenMP
Parallelization of Coupled Cluster Code with OpenMPParallelization of Coupled Cluster Code with OpenMP
Parallelization of Coupled Cluster Code with OpenMPAnil Bohare
 
Preparing OpenSHMEM for Exascale
Preparing OpenSHMEM for ExascalePreparing OpenSHMEM for Exascale
Preparing OpenSHMEM for Exascaleinside-BigData.com
 
z/VM 6.3 - Mudanças de Comportamento do hypervisor para suporte de partições ...
z/VM 6.3 - Mudanças de Comportamento do hypervisor para suporte de partições ...z/VM 6.3 - Mudanças de Comportamento do hypervisor para suporte de partições ...
z/VM 6.3 - Mudanças de Comportamento do hypervisor para suporte de partições ...Joao Galdino Mello de Souza
 
Quantifying Energy Consumption for Practical Fork-Join Parallelism on an Embe...
Quantifying Energy Consumption for Practical Fork-Join Parallelism on an Embe...Quantifying Energy Consumption for Practical Fork-Join Parallelism on an Embe...
Quantifying Energy Consumption for Practical Fork-Join Parallelism on an Embe...Tulipp. Eu
 
Architecting a 35 PB distributed parallel file system for science
Architecting a 35 PB distributed parallel file system for scienceArchitecting a 35 PB distributed parallel file system for science
Architecting a 35 PB distributed parallel file system for scienceSpeck&Tech
 

Similar to JGI_HMMER.pptx (20)

Nodes and Networks for HPC computing
Nodes and Networks for HPC computingNodes and Networks for HPC computing
Nodes and Networks for HPC computing
 
Multithreading computer architecture
 Multithreading computer architecture  Multithreading computer architecture
Multithreading computer architecture
 
Super computer 2017
Super computer 2017Super computer 2017
Super computer 2017
 
TCAM Design using Flash Transistors
TCAM Design using Flash TransistorsTCAM Design using Flash Transistors
TCAM Design using Flash Transistors
 
Loffeld_SIAMCSE15
Loffeld_SIAMCSE15Loffeld_SIAMCSE15
Loffeld_SIAMCSE15
 
Large-Scale Optimization Strategies for Typical HPC Workloads
Large-Scale Optimization Strategies for Typical HPC WorkloadsLarge-Scale Optimization Strategies for Typical HPC Workloads
Large-Scale Optimization Strategies for Typical HPC Workloads
 
osdi20-slides_zhao.pptx
osdi20-slides_zhao.pptxosdi20-slides_zhao.pptx
osdi20-slides_zhao.pptx
 
Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...
Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...
Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...
 
Understand and Harness the Capabilities of Intel® Xeon Phi™ Processors
Understand and Harness the Capabilities of Intel® Xeon Phi™ ProcessorsUnderstand and Harness the Capabilities of Intel® Xeon Phi™ Processors
Understand and Harness the Capabilities of Intel® Xeon Phi™ Processors
 
Nbvtalkatjntuvizianagaram
NbvtalkatjntuvizianagaramNbvtalkatjntuvizianagaram
Nbvtalkatjntuvizianagaram
 
Chap2 slides
Chap2 slidesChap2 slides
Chap2 slides
 
OpenPOWER Webinar
OpenPOWER Webinar OpenPOWER Webinar
OpenPOWER Webinar
 
1083 wang
1083 wang1083 wang
1083 wang
 
Pragmatic model checking: from theory to implementations
Pragmatic model checking: from theory to implementationsPragmatic model checking: from theory to implementations
Pragmatic model checking: from theory to implementations
 
Parallelization of Coupled Cluster Code with OpenMP
Parallelization of Coupled Cluster Code with OpenMPParallelization of Coupled Cluster Code with OpenMP
Parallelization of Coupled Cluster Code with OpenMP
 
Preparing OpenSHMEM for Exascale
Preparing OpenSHMEM for ExascalePreparing OpenSHMEM for Exascale
Preparing OpenSHMEM for Exascale
 
z/VM 6.3 - Mudanças de Comportamento do hypervisor para suporte de partições ...
z/VM 6.3 - Mudanças de Comportamento do hypervisor para suporte de partições ...z/VM 6.3 - Mudanças de Comportamento do hypervisor para suporte de partições ...
z/VM 6.3 - Mudanças de Comportamento do hypervisor para suporte de partições ...
 
Pentium iii
Pentium iiiPentium iii
Pentium iii
 
Quantifying Energy Consumption for Practical Fork-Join Parallelism on an Embe...
Quantifying Energy Consumption for Practical Fork-Join Parallelism on an Embe...Quantifying Energy Consumption for Practical Fork-Join Parallelism on an Embe...
Quantifying Energy Consumption for Practical Fork-Join Parallelism on an Embe...
 
Architecting a 35 PB distributed parallel file system for science
Architecting a 35 PB distributed parallel file system for scienceArchitecting a 35 PB distributed parallel file system for science
Architecting a 35 PB distributed parallel file system for science
 

JGI_HMMER.pptx

  • 2. The result first: A production sized use case: HMMER3 hmmsearch tool searching Pfam 29.0 database (16k models) against swissprot database (550k sequences): - 2 - 1 thread: 25 hours NEW hmmsearch, 32 threads + HT: 27 minutes 32 threads, sharded input files: 1 hour 4 threads: 8 hours 32 threads: 8 hours
  • 4. Protein Homology Search Start with a multiple sequence alignment describing an interesting protein domain, profile, or motif. A MSA is used to build a Hidden Markov Model through which HMMER3 can search protein sequences for matches with statistical significance. Compare millions of sequences against tens of thousands of protein HMMs. Use the results for annotation. - 4 -
  • 5. HMMER3 filter pipeline The overwhelming majority of sequences don’t match. Speed is gained by discarding a miss as soon as possible. • Filtering Pipeline: – Multiple Segment Viterbi filter: • High scoring diagonals, 2% pass, uses 25% of cpu time – Viterbi filter: • optimal alignment with indels, 5% pass, uses 15% of cpu time – Forward/Backward filter: • combined score of all alignments, 1% pass, uses 5% of cpu time – Hit processing and output (30% of time) - 5 -
  • 6. HMMER3 output Query: 1-cysPrx_C [M=40] Accession: PF10417.6 Description: C-terminal domain of 1-Cys peroxiredoxin Scores for complete sequences (score includes all domains): --- full sequence --- --- best 1 domain --- -#dom- E-value score bias E-value score bias exp N Sequence Description ------- ------ ----- ------- ------ ----- ---- -- -------- ----------- 5.8e-18 69.2 1.8 1.1e-17 68.4 1.8 1.5 1 sp|O67024|TDXH_AQUAE Peroxiredo... 3.4e-15 60.4 0.0 9e-15 59.0 0.0 1.8 1 sp|Q9Y7F0|TSA1_CANAL Peroxiredo... 7.9e-14 56.0 0.0 1.5e-13 55.1 0.0 1.5 1 sp|Q26695|TDX_TRYBR Thioredoxi... ... - 6 - Domain annotation for each sequence: >> sp|O67024|TDXH_AQUAE Peroxiredoxin OS=Aquifex aeolicus (strain VF5) GN=aq_858 PE=3 SV=1 # score bias c-Evalue i-Evalue hmmfrom hmm to alifrom ali to envfrom env to acc --- ------ ----- --------- --------- ------- ------- ------- ------- ------- ------- ---- 1 ! 68.4 1.8 2.9e-21 1.1e-17 1 40 [] 160 209 .. 160 209 .. 0.99 >> sp|Q9Y7F0|TSA1_CANAL Peroxiredoxin TSA1 OS=Candida albicans (strain SC5314 / ATCC MYA-2876) ... # score bias c-Evalue i-Evalue hmmfrom hmm to alifrom ali to envfrom env to acc --- ------ ----- --------- --------- ------- ------- ------- ------- ------- ------- ---- 1 ! 59.0 0.0 2.4e-18 9e-15 1 40 [] 158 193 .. 158 193 .. 0.98 >> sp|Q26695|TDX_TRYBR Thioredoxin peroxidase OS=Trypanosoma brucei rhodesiense PE=2 SV=1 # score bias c-Evalue i-Evalue hmmfrom hmm to alifrom ali to envfrom env to acc --- ------ ----- --------- --------- ------- ------- ------- ------- ------- ------- ---- 1 ! 55.1 0.0 4e-17 1.5e-13 1 39 [. 162 196 .. 162 197 .. 0.97 ...
  • 7. Why HMMER3 is inefficient on Genepool - 7 -
  • 8. HMMER3 memory scrooge HMMER3 was engineered to be as portable as possible. Running on a 2010 era desktop or laptop requires a much smaller memory footprint than available in an HPC environment. Instead of reading a fasta file once and using memory to store it, HMMER3 goes back to disk over and over again. The overhead limits the rate data can be prepared. That rate is slower than the rate multiple threads can consume it. Any more than 4 worker threads will sit idle waiting for data. - 8 -
  • 9. Counting I/O instructions - 9 - sqascii_Read() and header_fasta() are the sequence reading functions. Standard hmmsearch spends 25% of its compute reading the same sequence file over and over again.
  • 10. Utilization of Genepool nodes • Core Utilization – Genepool has nodes with 16 or 32 cores – HMMER3 can use no more than 4 cores efficiently – All threads wait for stragglers after every model – Mitigation options include: • Ignore the problem • sharing a node with -pe pe_slots 4 + --cpu 3 • Shard input files, run multiple hmmsearch on one node, then combine output • Memory Utilization – All Genepool nodes have more than 100GB of memory – HMMER3 won’t use 95% of that unless you do something absurd like search TITIN against its own model. - 10 -
  • 12. Buffer the I/O data and reuse it Store several models and their results in a memory buffer such that each read sequence can be used to search multiple models. This puts a denominator under the number of sequence related disk access calls needed; 25% of cpu instructions are reduced to <1% this way. Two buffers can alternate; I/O performed on one and computation on the other. If I/O finishes early that thread converts itself to a worker. - 12 -
  • 13. Original HMMER3 thread behavior - 13 -
  • 15. How to use custom hmmsearch - 15 - warndt@genepool13:~$ module load hmmer/3.1b2-opt warndt@genepool13:~$ hpc_hmmsearch -h # hpc_hmmsearch :: search profile(s) against a sequence database, custom modified for improved thread performance # HMMER 3.1b2 (February 2015); http://hmmer.org/ ... Input buffer and thread control: --seq_buffer <n> : set # of sequences per thread buffer [200000] (n>=1) --hmm_buffer <n> : set # of hmms per thread hmm buffer [500] (n>=1) --cpu <n> : set # of threads [1] (n>=1) ...
  • 17. HMMER3 on other NERSC systems Cori phase I hardware is functionally identical (Haswell processors with 128GB memory) to -pe pe_slots 32 nodes available on Genepool. No custom HMMER3 module on Cori yet, but that can be fixed in 5 minutes when someone wants it. HMMER3 runs on Cori phase II hardware (Knights Landing many-core architecture) but not as well as on phase I. My current best KNL time for swissprot against Pfam is 38 minutes. - 17 -
  • 18. hmmscan modification JGI usage of hmmscan is approximately an order of magnitude less than hmmsearch. The design is very similar to hmmsearch. Conversion would be straightforward and take approximately a week. As soon as someone expresses interest in running high volume hmmscan, I’ll complete and make it available. - 18 -
  • 19. Upgrading vector code The 6 year old single instruction multiple data (SIMD) instructions in the HMMER3 pipeline do not run well on KNL hardware. I am currently working on new filters which will use more modern vector instructions and will run more efficiently on the phase II machine. - 19 -
  • 20. HMMER4 is coming • Sean Eddy has been actively developing a new major version of HMMER. • The components I am hacking for better performance today will be completely replaced in the future with theoretically superior algorithms. • It won’t be available for at least a year, and probably more like two or three. • If I’m still around, I’ll help everyone transition to the new application. - 20 -
  • 21. HMMER3 translated search Translated frameshift aware HMMER3 search is currently in development. An alpha version is available and anyone interested is welcome to give it a try and provide feedback. /global/homes/w/warndt/edison-t-hmmer/hmmer/src/phmmert /global/homes/w/warndt/edison-t-hmmer/hmmer/src/nhmmscant - 21 -
  • 22. National Energy Research Scientific Computing Center - 22 -