Winnowmap2: A long read mapping method
for highly repetitive reference sequences
Chirag Jain
Indian Institute of Science
chirgjain
github.com/marbl/Winnowmap
Repeat-aware read mapping
• Human (and other mammalian) genomes approaching completion

• Megabase-long gaps resolved (chrX: Miga et al., chr8: Logsdon et al.)

• Segmental duplications, rRNA genes, centromeres

• Read-mapping & re-sequencing accuracy is critical

• Variant calling

• Epigenetics / Transcriptomics / validating de-novo assemblies
EXACT PRACTICAL HEURISTICS
Smith-Waterman
seed
C G T C G C C T A A T C G C A C G T C C G T C G C C T A A T
chain
extend
Addressed in Winnowmap
using weighted minimizers
Improved in Winnowmap2
Repeat-aware read mapping
Is the highest scoring alignment always a
correct placement for a read?
Mask repetitive k-mers
[ISMB’20]
Allelic bias: Illustration 1
repeat-copy I repeat-copy II
paralog-specific

variants (PSVs)
Ancestral
genome
Degner et al. (2009) Effect of read-mapping biases on detecting allele-specific expression from RNA-sequencing data
mutations
Sequenced
individual
Reference
genome
long reads incorrect mapping
RESEQUENCING
Allelic bias: Illustration 2
true
repeat-copy I repeat-copy II
mutations
longread
18 20
reference
better score
alignment
repeat-copy I repeat-copy II
9 6better score
Identify short read substrings that
can be confidently mapped
i.e., minimal confidently alignable
substrings (MCASs)
reference
MCAS: minimal confidently alignable substring
Given a read , and a position , MCAS is a
substring of minimum length beginning at position of that maps
confidently to reference.
Q i, 0 ≤ i < |Q| (i)
i Q
ACCTCTCGT ACCTCTCGTAGTCGCAT
ACCTCTCTT ACCTCTCGTCGTCGCAT
ACCTCTCTT
ACCTCTCTT
ACCTCTCGTAGTCGCAT
ACCTCTCGTATCGGCAT
read referencei
MCASs
read
reference
MCAS: minimal confidently alignable substring
judged using ratio of best & second-best end-to-end alignment scores,
mapping quality (mapQ)≈
Lemma. Computing MCAS(i) requires time and space.∀ 0 ≤ i < |Q| O(|Q||R|) O(|R|)
(proof in preprint)
Given a read , and a position , MCAS is a
substring of minimum length beginning at position of that maps
confidently to reference.
Q i, 0 ≤ i < |Q| (i)
i Q
ACCTCTCGTACCTCTCGTACCTCTCGT…
MCAS: minimal confidently alignable substring
PRACTICAL HEURISTICS
• For a read, compute all periodically sample
MCASs (e.g., every 500th base)
• Linearly exponentially grow a substring
• Use DP-based exact alignment seed and
extend to map substrings
• minimap2’s alignment and mapQ
scoring
• weighted minimizers
• Final step: consolidate MCASs
ACCTCTCGTACCTCTCGTACCTCTCGT…
read i
i
seed chain extend
read
Result I (chr8: defensin-locus)β
544 kbp dup
12.2M11.5M
693 kbp dup 644 kbp dup
Logsdon et al. (2020) “The structure, function, and evolution of a complete human chromosome 8”
3 large segmental duplications
7.1M
T2T CHM13 chromosome 8
• Simulate 40x ONT reads
• Add an artificial 1 kbp
deletion variant at position
12,000,000
• If mapped correctly, deletion
should appear in overlapping
alignments
chr8 reference sequence
A simulated
ONT read
simulated 1 kbp deletion
Result I (chr8: defensin-locus)β
Winnowmap2
in action!
dot-plot
MCAS alignments
Result I (chr8: defensin-genes)β
WINNOWMAP2
NGMLR
MINIMAP2 GRAPHMAP
IGV
Result II (T2T chromosomes 8, X)
T2T CHM13 chromosome 8 (146 Mbp) T2T CHM13 chromosome X (154 Mbp)
mutate references using SURVIVOR (1000 indel SVs, 100 inv)
Simulate reads (HiFi, ONT)
Compare variants to ground truth
Align reads -> Call SVs (Sniffles)
github.com/fritzsedlazeck/Sniffles fritzsedlazeck/SURVIVOR
Setup:
hifi
20x
hifi
40x
ont
20x
ont
40x
hifi
20x
hifi
40x
ont
20x
ont
40x
hifi
20x
hifi
40x
ont
20x
ont
40x
hifi
20x
hifi
40x
ont
20x
ont
40x
hifi
20x
hifi
40x
ont
20x
ont
40x
hifi
20x
hifi
40x
ont
20x
ont
40x
hifi
20x
hifi
40x
ont
20x
ont
40x
hifi
20x
hifi
40x
ont
20x
ont
40x
chromosome 8 chromosome X chromosome 8 chromosome X
chromosome 8 (repeats) chromosome 8 (repeats)chromosome X (repeats) chromosome X (repeats)
False
positives (%)
False
negatives (%)
winnowmap2 winnowmap ngmlrminimap2
False
positives (%)
False
negatives (%)
Result II (T2T chromosomes 8, X)
repeats: 95% identity, 10 kbp length≥ ≥
Time and memory usage
hifi
35x
ont
35x
ont
50x
hifi
35x
ont
35x
ont
50x
hifi
35x
ont
35x
ont
50x
hifi
35x
ont
35x
ont
50x
False negatives False positives
Rate(%)
Result III (GIAB SV callset)
• Excludes >10 kbp-sized repeats of the human genome
• Results comparable to minimap2
Zook et al. (2020) “A robust benchmark for detection of germline large deletions and insertions”
Conclusions
doi.org/10.1101/2020.11.01.363887
chirag@iisc.ac.in
github.com/marbl/Winnowmap
Arang Rhie Nancy Hansen Sergey Koren Adam Phillippy
• Allelic bias: highest scoring alignment correct read placement
• Minimal confidently alignable substrings can be mapped independently of non-reference bases
• MCASs are more tolerant of structural variation and more sensitive to paralog-specific variants
• Winnowmap2 enables superior downstream variant call accuracy in complex repeats
≠

Winnowmap2: A long read mapping method for highly repetitive reference sequences

  • 1.
    Winnowmap2: A longread mapping method for highly repetitive reference sequences Chirag Jain Indian Institute of Science chirgjain github.com/marbl/Winnowmap
  • 2.
    Repeat-aware read mapping •Human (and other mammalian) genomes approaching completion • Megabase-long gaps resolved (chrX: Miga et al., chr8: Logsdon et al.) • Segmental duplications, rRNA genes, centromeres • Read-mapping & re-sequencing accuracy is critical • Variant calling • Epigenetics / Transcriptomics / validating de-novo assemblies
  • 3.
    EXACT PRACTICAL HEURISTICS Smith-Waterman seed CG T C G C C T A A T C G C A C G T C C G T C G C C T A A T chain extend Addressed in Winnowmap using weighted minimizers Improved in Winnowmap2 Repeat-aware read mapping Is the highest scoring alignment always a correct placement for a read? Mask repetitive k-mers [ISMB’20]
  • 4.
    Allelic bias: Illustration1 repeat-copy I repeat-copy II paralog-specific variants (PSVs) Ancestral genome Degner et al. (2009) Effect of read-mapping biases on detecting allele-specific expression from RNA-sequencing data mutations Sequenced individual Reference genome long reads incorrect mapping RESEQUENCING
  • 5.
    Allelic bias: Illustration2 true repeat-copy I repeat-copy II mutations longread 18 20 reference better score alignment repeat-copy I repeat-copy II 9 6better score Identify short read substrings that can be confidently mapped i.e., minimal confidently alignable substrings (MCASs) reference
  • 6.
    MCAS: minimal confidentlyalignable substring Given a read , and a position , MCAS is a substring of minimum length beginning at position of that maps confidently to reference. Q i, 0 ≤ i < |Q| (i) i Q ACCTCTCGT ACCTCTCGTAGTCGCAT ACCTCTCTT ACCTCTCGTCGTCGCAT ACCTCTCTT ACCTCTCTT ACCTCTCGTAGTCGCAT ACCTCTCGTATCGGCAT read referencei MCASs read reference
  • 7.
    MCAS: minimal confidentlyalignable substring judged using ratio of best & second-best end-to-end alignment scores, mapping quality (mapQ)≈ Lemma. Computing MCAS(i) requires time and space.∀ 0 ≤ i < |Q| O(|Q||R|) O(|R|) (proof in preprint) Given a read , and a position , MCAS is a substring of minimum length beginning at position of that maps confidently to reference. Q i, 0 ≤ i < |Q| (i) i Q
  • 8.
    ACCTCTCGTACCTCTCGTACCTCTCGT… MCAS: minimal confidentlyalignable substring PRACTICAL HEURISTICS • For a read, compute all periodically sample MCASs (e.g., every 500th base) • Linearly exponentially grow a substring • Use DP-based exact alignment seed and extend to map substrings • minimap2’s alignment and mapQ scoring • weighted minimizers • Final step: consolidate MCASs ACCTCTCGTACCTCTCGTACCTCTCGT… read i i seed chain extend read
  • 9.
    Result I (chr8:defensin-locus)β 544 kbp dup 12.2M11.5M 693 kbp dup 644 kbp dup Logsdon et al. (2020) “The structure, function, and evolution of a complete human chromosome 8” 3 large segmental duplications 7.1M T2T CHM13 chromosome 8 • Simulate 40x ONT reads • Add an artificial 1 kbp deletion variant at position 12,000,000 • If mapped correctly, deletion should appear in overlapping alignments
  • 10.
    chr8 reference sequence Asimulated ONT read simulated 1 kbp deletion Result I (chr8: defensin-locus)β Winnowmap2 in action! dot-plot MCAS alignments
  • 11.
    Result I (chr8:defensin-genes)β WINNOWMAP2 NGMLR MINIMAP2 GRAPHMAP IGV
  • 12.
    Result II (T2Tchromosomes 8, X) T2T CHM13 chromosome 8 (146 Mbp) T2T CHM13 chromosome X (154 Mbp) mutate references using SURVIVOR (1000 indel SVs, 100 inv) Simulate reads (HiFi, ONT) Compare variants to ground truth Align reads -> Call SVs (Sniffles) github.com/fritzsedlazeck/Sniffles fritzsedlazeck/SURVIVOR Setup:
  • 13.
    hifi 20x hifi 40x ont 20x ont 40x hifi 20x hifi 40x ont 20x ont 40x hifi 20x hifi 40x ont 20x ont 40x hifi 20x hifi 40x ont 20x ont 40x hifi 20x hifi 40x ont 20x ont 40x hifi 20x hifi 40x ont 20x ont 40x hifi 20x hifi 40x ont 20x ont 40x hifi 20x hifi 40x ont 20x ont 40x chromosome 8 chromosomeX chromosome 8 chromosome X chromosome 8 (repeats) chromosome 8 (repeats)chromosome X (repeats) chromosome X (repeats) False positives (%) False negatives (%) winnowmap2 winnowmap ngmlrminimap2 False positives (%) False negatives (%) Result II (T2T chromosomes 8, X) repeats: 95% identity, 10 kbp length≥ ≥
  • 14.
    Time and memoryusage hifi 35x ont 35x ont 50x hifi 35x ont 35x ont 50x hifi 35x ont 35x ont 50x hifi 35x ont 35x ont 50x False negatives False positives Rate(%) Result III (GIAB SV callset) • Excludes >10 kbp-sized repeats of the human genome • Results comparable to minimap2 Zook et al. (2020) “A robust benchmark for detection of germline large deletions and insertions”
  • 15.
    Conclusions doi.org/10.1101/2020.11.01.363887 chirag@iisc.ac.in github.com/marbl/Winnowmap Arang Rhie NancyHansen Sergey Koren Adam Phillippy • Allelic bias: highest scoring alignment correct read placement • Minimal confidently alignable substrings can be mapped independently of non-reference bases • MCASs are more tolerant of structural variation and more sensitive to paralog-specific variants • Winnowmap2 enables superior downstream variant call accuracy in complex repeats ≠