3. Protein Homology Search
A Hidden Markov Model is used to define a profile
that describes a protein domain. When a domain is
shared by proteins it suggests they have a common
function, structure, or evolutionary history.
Does this protein match the profile?
Search all pairs between millions of sequences and
tens of thousands of models.
- 3 -
4. HMMER3 filter pipeline
The overwhelming majority of searches find nothing.
Speed is gained by giving up on a search as soon as
possible.
• Filtering Pipeline:
– Single/Multiple Segment Viterbi filter (25% of cpu)
– Full Viterbi filter (15% of cpu)
– Forward filter (5% of cpu)
– Hit processing (30% of cpu)
Ratio of data to processing in the average case is high
Very conditional code
- 4 -
6. HMMER3 division of labor
For each model:
• Create worker threads with private copies of
model
• Master thread reads blocks of sequence from disk
and places in work queue
• Worker threads take from work queue, process,
and pass results back to master
When all sequences have been processed:
• Discard worker threads
• Write output
• Rewind sequence file
• Repeat with next model
- 6 -
7. How well does that work?
- 7 -
Haswell processor, HMMER3, hmmsearch
Swissprot sequence database (~550k sequences)
searching 100 Pfam models
Cores
Speedup
9. Why no core scaling?
• Reading blocks of sequence from disk has a modest
overhead of disk access, formatting, and error
checking. This compounds as the entire sequence
file is completely re-read for each model.
• The work queue is either full (< 4 workers) or empty
(> 4 workers) with no middle ground. A roofline
pattern results.
• A barrier for every model. The worst case is
serialization of 1000 sequence searches
• Thread creation and destruction overhead. No reuse
- 9 -
13. Buffer and reuse I/O data
Store several models and their results in a buffer such
that each read sequence can be used to search
multiple models.
This puts a denomenator under the number of
sequence related disk access calls needed.
Two buffers can alternate; I/O performed on one and
computation on the other.
- 13 -
14. Building blocks
• int load_hmm_buffer(...);
– Read enough models from disk to fill the hmm buffer
• int load_seq_buffer(...);
– Read enough sequence from disk to fill the sequence
buffer, when EOF reset file to beginning
• int write_hmm_output(...);
– Empty results contained in model buffer to output files
• void thread_kernel(...);
– create private data copies, process searches, and load
results into model buffer
• int work_counter;
– When active tasks fall below thread count, fork half of
remaining work into new thread_kernels.
- 14 -
15. OpenMP Work Distribution
...
while model file not yet EOF
#pragma omp task
output_hmm_buffer(...) //unless first iteration
load_hmm_buffer(...)
do //step sequence buffer through sequence file until EOF
#pragma omp taskgroup
#pragma omp task
load_seq_buffer(...)
for each model in hmm buffer
#pragma omp atomic work_counter++;
#pragma omp task thread_kernel(...)
swap sequence buffers
… // repeat the task group for the last sequence buffer
#taskwait //in case work finishes before hmm (unlikely)
swap model buffers
output_hmm_buffer(...) //write output for the final work block
...
- 15 -
16. The Work Kernel
int thread_kernel(range of sequences, ...)
...//prepare private pipeline data
for each sequence in range
if work_counter < threads
#pragma omp atomic work_counter++;
#pragma omp task
thread_kernel(half range, ...);
call HMMER3 pipeline
#pragma omp critical
...//write results to model buffer
...//destroy private pipeline data
#pragma omp atomic work_counter--;
- 16 -
18. Now how well does it work?
- 18 -
Haswell processor, HMMER3, hmmsearch
Swissprot sequence database (~550k sequences)
searching 100 Pfam models
Cores
Speedup
19. A production sized search
• Entire Pfam 29.0 database (16k models)
searched against entire swissprot database
(550k sequences)
– 1 thread, standard hmmsearch, estimated:
– 4 threads, standard hmmsearch:
– 32 threads, standard hmmsearch, sharded:
– Full Haswell + HT, modified hmmsearch:
- 19 -
25 hours
8 hours
1 hour
27 minutes
21. HMMER3 uses SSE intrinsics
HMMER3 uses a heavily optimized pipeline of search
filters that explicitly apply a complex vector striping
pattern to the underlying dynamic programming
algorithms
The code is a uniform mixture of the base algorithms,
ordinary optimizations (like loop unrolling), adjustments
to widen vectors used by certain filters with less precise
data types, and workarounds for missing instructions in
SSE2
Compiler auto-vectorization can’t compete
- 21 -
22. Will be customized to use AVX2
The exact same design with AVX2 vectors would
experience diminishing returns:
• When larger stripes divided the search, increasing
remainders are waste
• Lane restrictions between high and low 128 bit lanes
require less efficient implementations for certain
instructions such as right and left byte shift
My modified implementation will search one sequence
against two models at a time, each in its own AVX2 lane
- 22 -