William Arndt increased the performance of HMMER3 on Genepool by modifying it to use buffering and multiple threads more efficiently. An example showed that searching a large protein database took 27 minutes using the modified HMMER3 with 32 threads, compared to 25 hours using the original single-threaded version. The modifications avoid repeated reading of input files by buffering sequence and HMM data in memory for reuse across multiple models.
4. Protein Homology Search
Start with a multiple sequence alignment describing
an interesting protein domain, profile, or motif.
A MSA is used to build a Hidden Markov Model
through which HMMER3 can search protein
sequences for matches with statistical significance.
Compare millions of sequences against tens of
thousands of protein HMMs. Use the results for
annotation.
- 4 -
5. HMMER3 filter pipeline
The overwhelming majority of sequences don’t
match. Speed is gained by discarding a miss as soon as
possible.
• Filtering Pipeline:
– Multiple Segment Viterbi filter:
• High scoring diagonals, 2% pass, uses 25% of cpu time
– Viterbi filter:
• optimal alignment with indels, 5% pass, uses 15% of cpu time
– Forward/Backward filter:
• combined score of all alignments, 1% pass, uses 5% of cpu time
– Hit processing and output (30% of time)
- 5 -
8. HMMER3 memory scrooge
HMMER3 was engineered to be as portable as possible.
Running on a 2010 era desktop or laptop requires a
much smaller memory footprint than available in an
HPC environment.
Instead of reading a fasta file once and using memory
to store it, HMMER3 goes back to disk over and over
again. The overhead limits the rate data can be
prepared. That rate is slower than the rate multiple
threads can consume it. Any more than 4 worker
threads will sit idle waiting for data.
- 8 -
9. Counting I/O instructions
- 9 -
sqascii_Read() and header_fasta() are the sequence
reading functions. Standard hmmsearch spends 25% of its
compute reading the same sequence file over and over
again.
10. Utilization of Genepool nodes
• Core Utilization
– Genepool has nodes with 16 or 32 cores
– HMMER3 can use no more than 4 cores efficiently
– All threads wait for stragglers after every model
– Mitigation options include:
• Ignore the problem
• sharing a node with -pe pe_slots 4 + --cpu 3
• Shard input files, run multiple hmmsearch on one node, then
combine output
• Memory Utilization
– All Genepool nodes have more than 100GB of memory
– HMMER3 won’t use 95% of that unless you do something
absurd like search TITIN against its own model.
- 10 -
12. Buffer the I/O data and reuse it
Store several models and their results in a memory
buffer such that each read sequence can be used to
search multiple models.
This puts a denominator under the number of
sequence related disk access calls needed; 25% of
cpu instructions are reduced to <1% this way.
Two buffers can alternate; I/O performed on one and
computation on the other. If I/O finishes early that
thread converts itself to a worker.
- 12 -
17. HMMER3 on other NERSC systems
Cori phase I hardware is functionally identical
(Haswell processors with 128GB memory) to -pe
pe_slots 32 nodes available on Genepool. No
custom HMMER3 module on Cori yet, but that can
be fixed in 5 minutes when someone wants it.
HMMER3 runs on Cori phase II hardware (Knights
Landing many-core architecture) but not as well as
on phase I. My current best KNL time for swissprot
against Pfam is 38 minutes.
- 17 -
18. hmmscan modification
JGI usage of hmmscan is approximately an order of
magnitude less than hmmsearch.
The design is very similar to hmmsearch. Conversion
would be straightforward and take approximately a week.
As soon as someone expresses interest in running high
volume hmmscan, I’ll complete and make it available.
- 18 -
19. Upgrading vector code
The 6 year old single instruction multiple data (SIMD)
instructions in the HMMER3 pipeline do not run well on
KNL hardware.
I am currently working on new filters which will use more
modern vector instructions and will run more efficiently
on the phase II machine.
- 19 -
20. HMMER4 is coming
• Sean Eddy has been actively developing a new major
version of HMMER.
• The components I am hacking for better performance
today will be completely replaced in the future with
theoretically superior algorithms.
• It won’t be available for at least a year, and probably
more like two or three.
• If I’m still around, I’ll help everyone transition to the
new application.
- 20 -
21. HMMER3 translated search
Translated frameshift aware HMMER3 search is
currently in development. An alpha version is
available and anyone interested is welcome to give it
a try and provide feedback.
/global/homes/w/warndt/edison-t-hmmer/hmmer/src/phmmert
/global/homes/w/warndt/edison-t-hmmer/hmmer/src/nhmmscant
- 21 -