SlideShare a Scribd company logo
Graph and assembly strategies for the
MHC and ribosomal DNA regions
Alexander Dilthey
The MHC is the zebrafish of the genome!
(model region)
PRGs – Population Reference Graphs
• Simple: acyclic, directed (sub-class of general variation graphs)
• Usually built from MSA, preserve gap positions
(i.e. global homology between input sequences).
• Generative model: Recombination
• Ploidy well-defined (0, 1, 2)
TA CT A G
C
C
_
_
A
TA
A
Outline
• Quick recap:
What we know about the utility of graph genome approaches
• New results:
Haplotyping in hypervariable regions (HLA)
Pseudo graph alignment
• De novo assembly of ribosomal DNA
In most of the MHC, single-reference
approaches work just fine…
Numberofkmers(millions)
4.55.0
PGFreference Platypus PRG-Viterbi PRG-Mapped
kmersrecovered
kmersnot recovered
+ long-read validation with consistent results (not shown)
Dilthey et al., Nature Genetics 2015
… graph genomes outperform in the most
complex sub-region of the MHC …
Dilthey et al., Nature Genetics 2015
… remaining problems driven by incomplete
input haplotypes + algorithmics.
Aligned kmers
Chromotype position (kb)
Readposition(kb)
0 10 20
0
2
4
6
Incomplete input haplotypes:
Large uncharacterized inversion
Algorithmics:
Incorrect HLA haplotyping.
Dilthey et al., Nature Genetics 2015
HLA haplotyping
• Hypothesis: Whole-genome sequencing data contains the information
necessary for accurate HLA typing
• “HLA typing”  HLA gene exon sequences
• HLA class I: exons 2 and 3
• HLA class II: exon 2
• Challenge: align reads to the right gene – homology hell.
• Proper read-to-graph alignment instead of k-Mers.
Class I exon homology
Exon 2 Exon 3
HLA-A 3284 alleles
HLA-B 4077 alleles
HLA-C 2799 alleles
Approach: deep PRG + mapping
Exonic MSA
T*01:01 _ _ A C G T A C T _ _
T*01:02 C A A C A T A C T _ _
T*01:03 _ _ A C G C G C T _ _
T*01:04 _ _ A T C C G C T A C
T*01:05 _ _ A T C C C C T _ _
T*01:06 _ _ _ C C T A C T _ _
Genomic MSA
T*01:01 A G C A _ _ A C G T A C T _ _ C C T A
T*01:02 A C C A C A A C A T A C T _ _ C C T A
T*01:04 _ T T A _ _ A T C C G C T A C C C T A
8 xMHC reference haplotypes
PGF (with T*01:01) A C T A G C A _ _ A C G T A C T _ _ C C T A T G A
MANN (with T*01:04) T T T _ T T A _ _ A T C C G C T A C C C T A T G A
1) Gene-only PRG – 46 (pseudo) genes, mostly HLA
|--NNN--| |--NNN--|Gene 1 Gene 2 Gene 3
Padding UTR Exon 1 Intron 1 Exon 2 UTR Padding
Numberofreferencesequences
Region covered by 'genomic' sequences
2) Varying numbers of input sequences across PRG
3) Use hierarchical MSA approach to combine in
Approach: deep PRG + mapping
Level 1
CA
_ _
C T
C
CC
G
AAligned read
2 3 4 5 6 7
A _ TATA _ C
198 9 10 11 12 13 14 15 16 17 18 25 26
C AGTATC
20 21 22 23 24
TC
TC
T T
A
_
A _
A G
C
T
C
T
T
C T
ATA
C
C {G, C}T
C
G
CA
A
_ _
A
4) Seed-and-extend paired-end mapping to PRG
5) Likelihood-based inference: maximize L( aligned reads | HLA types )
(independently per locus)
High-quality WGS data enables gold-standard
accuracy
(of note: 2/3 original discrepancies with validation data were errors in the validation data!)
… but not from exome, MiSeq data
Sequencing error?
Effective fragment length? [2 x read length + IS]
Conclusion (intermediate)
• If the input sequencing data is „good enough“, we manage near-
perfect haplotyping in the genome‘s most polymorphic region
• Effective fragment length likely the most important factor
• Not-so-good sequencing data: joint haplotyping + alignment
(i.e. alignment location is not independent of inferred haplotype)
• Read mapping implementation SLOW
Pseudo graph mapping
Input sequences
Pseudo graph mapping
Input sequences
Graph
Pseudo graph mapping
Input sequences
Graph
Align short reads to input sequences...
Pseudo graph mapping
Input sequences
Graph
Align short reads to input sequences...
... transpose onto graph
Scrubbing, cutting, cleaning
Input MSA Lin. alignment MSA coor. Scrubbed
123456789 123456X789 123456789
Seq1 AACAC_TTT Seq1 AACAC_TTT AACAC__TTT AACAC_TTT
Seq2 TTCACGTTT Read AACACGTTT AACAC_GTTT AACACGTTT
-
Graph TTCAC TTT
G
Scrubbing: get rid of INDEL-induced changes in the alignment coordinate system
Cutting: Examine alignment gap structure; cut in „bad“ areas; use longest stretch
Cleaning: Find the best gap-less sequence-to-graph alignment + extension with gaps
Graph alignment
123456789
Graph AACACGTTT
Seq1 AACACGTTT
Accuracy slightly worse; fast!
Conclusion: perhaps there is a middle ground between graph and linear sequence
alignment. Work in progress. Further tuning?
Inferred Accuracy Call Rate Inferred Accuracy Call Rate
A 6 6 1.00 1.00 6 1.00 1.00
B 6 6 1.00 1.00 6 1.00 1.00
C 6 6 1.00 1.00 6 1.00 1.00
DQA1 6 6 1.00 1.00 6 1.00 1.00
DQB1 6 6 1.00 1.00 6 1.00 1.00
DRB1 6 6 1.00 1.00 6 1.00 1.00
A 22 22 0.86 1.00 22 1.00 1.00
B 22 22 1.00 1.00 22 1.00 1.00
C 22 22 1.00 1.00 22 1.00 1.00
DQA1 12 12 1.00 1.00 12 1.00 1.00
DQB1 22 22 1.00 1.00 22 1.00 1.00
DRB1 22 22 0.91 1.00 22 0.95 1.00
Platinum
Trio
1000
Genomes
Highest
Resolution
MHC-PRG-2 HLA*PRG
NLocusCohort
Towards additional high-quality reference
haplotypes…
Remaining challenges: extreme repeats, haplotypes.
Sergey Koren
Ribosomal DNA
• Encodes ribosomal RNA
• Hundreds of copies
(tandem repeat arrays)
• Variation poorly characterized
• Step 1: Targeted approach
• Step 2: WGS-based
• Step 3: Variation graph
Read error vs variation
… from whole-genome data?
Long reads  de Bruijn graph Technology!
6% > 50k
Summary
• Variation graphs are worth the effort – at least in highly complex regions.
• Evidence: MHC „model system“
+ overall improvement of Genome inference accuracy
+ complex-locus haplotyping
• Incorporate LD?
• Middle ground between full graph alignment and linear sequence
alignment?
• Ribosomal DNA – let me know if you‘re also interested!
Acknowledgements
NIH
Adam Phillippy
Sergey Koren
Brian Walenz
Jung-Hyun Kim
Vladimir Larionov
Oxford
Gil McVean
Zam Iqbal
Alexander Mentzer
Histogenetics
Nezih Cereb
UCSF/Nantes
Pierre-Antoine Gourraud
GSK
Matt Nelson
Charles Cox

More Related Content

Viewers also liked

Everyday de novo diploid assembly
Everyday de novo diploid assemblyEveryday de novo diploid assembly
Everyday de novo diploid assembly
Genome Reference Consortium
 
Genome in a Bottle
Genome in a BottleGenome in a Bottle
Genome in a Bottle
Genome Reference Consortium
 
Haplotype resolved structural variation assembly with long reads
Haplotype resolved structural variation assembly with long readsHaplotype resolved structural variation assembly with long reads
Haplotype resolved structural variation assembly with long reads
Genome Reference Consortium
 
AGBT2017 Reference Workshop: Schneider
AGBT2017 Reference Workshop: SchneiderAGBT2017 Reference Workshop: Schneider
AGBT2017 Reference Workshop: Schneider
Genome Reference Consortium
 
AGBT2017 Reference Workshop: Lindsay
AGBT2017 Reference Workshop: LindsayAGBT2017 Reference Workshop: Lindsay
AGBT2017 Reference Workshop: Lindsay
Genome Reference Consortium
 
AGBT2017 Reference Workshop: Fulton
AGBT2017 Reference Workshop: FultonAGBT2017 Reference Workshop: Fulton
AGBT2017 Reference Workshop: Fulton
Genome Reference Consortium
 
Creating Reference-Grade Human Genome Assemblies
Creating Reference-Grade Human Genome AssembliesCreating Reference-Grade Human Genome Assemblies
Creating Reference-Grade Human Genome Assemblies
Genome Reference Consortium
 
Exploiting long read sequencing technology to build a substantially improved ...
Exploiting long read sequencing technology to build a substantially improved ...Exploiting long read sequencing technology to build a substantially improved ...
Exploiting long read sequencing technology to build a substantially improved ...
Genome Reference Consortium
 
Everyday de novo assembly
Everyday de novo assemblyEveryday de novo assembly
Everyday de novo assembly
Genome Reference Consortium
 
AGBT 2016 Workshop Magrini
AGBT 2016 Workshop MagriniAGBT 2016 Workshop Magrini
AGBT 2016 Workshop Magrini
Genome Reference Consortium
 
ClinVar: Getting the most from the reference assembly and reference materials
ClinVar: Getting the most from the reference assembly and reference materialsClinVar: Getting the most from the reference assembly and reference materials
ClinVar: Getting the most from the reference assembly and reference materials
Genome Reference Consortium
 
TAGC2016 schneider
TAGC2016 schneiderTAGC2016 schneider
TAGC2016 schneider
Genome Reference Consortium
 
agbt 2016 workshop church
agbt 2016 workshop churchagbt 2016 workshop church
agbt 2016 workshop church
Genome Reference Consortium
 
AQA A2 Psychology Unit 4 - Schizophrenia
AQA A2 Psychology Unit 4 - SchizophreniaAQA A2 Psychology Unit 4 - Schizophrenia
AQA A2 Psychology Unit 4 - Schizophrenia
Snowfairy007
 

Viewers also liked (14)

Everyday de novo diploid assembly
Everyday de novo diploid assemblyEveryday de novo diploid assembly
Everyday de novo diploid assembly
 
Genome in a Bottle
Genome in a BottleGenome in a Bottle
Genome in a Bottle
 
Haplotype resolved structural variation assembly with long reads
Haplotype resolved structural variation assembly with long readsHaplotype resolved structural variation assembly with long reads
Haplotype resolved structural variation assembly with long reads
 
AGBT2017 Reference Workshop: Schneider
AGBT2017 Reference Workshop: SchneiderAGBT2017 Reference Workshop: Schneider
AGBT2017 Reference Workshop: Schneider
 
AGBT2017 Reference Workshop: Lindsay
AGBT2017 Reference Workshop: LindsayAGBT2017 Reference Workshop: Lindsay
AGBT2017 Reference Workshop: Lindsay
 
AGBT2017 Reference Workshop: Fulton
AGBT2017 Reference Workshop: FultonAGBT2017 Reference Workshop: Fulton
AGBT2017 Reference Workshop: Fulton
 
Creating Reference-Grade Human Genome Assemblies
Creating Reference-Grade Human Genome AssembliesCreating Reference-Grade Human Genome Assemblies
Creating Reference-Grade Human Genome Assemblies
 
Exploiting long read sequencing technology to build a substantially improved ...
Exploiting long read sequencing technology to build a substantially improved ...Exploiting long read sequencing technology to build a substantially improved ...
Exploiting long read sequencing technology to build a substantially improved ...
 
Everyday de novo assembly
Everyday de novo assemblyEveryday de novo assembly
Everyday de novo assembly
 
AGBT 2016 Workshop Magrini
AGBT 2016 Workshop MagriniAGBT 2016 Workshop Magrini
AGBT 2016 Workshop Magrini
 
ClinVar: Getting the most from the reference assembly and reference materials
ClinVar: Getting the most from the reference assembly and reference materialsClinVar: Getting the most from the reference assembly and reference materials
ClinVar: Getting the most from the reference assembly and reference materials
 
TAGC2016 schneider
TAGC2016 schneiderTAGC2016 schneider
TAGC2016 schneider
 
agbt 2016 workshop church
agbt 2016 workshop churchagbt 2016 workshop church
agbt 2016 workshop church
 
AQA A2 Psychology Unit 4 - Schizophrenia
AQA A2 Psychology Unit 4 - SchizophreniaAQA A2 Psychology Unit 4 - Schizophrenia
AQA A2 Psychology Unit 4 - Schizophrenia
 

Similar to Graph and assembly strategies for the MHC and ribosomal DNA regions

20100515 bioinformatics kapushesky_lecture07
20100515 bioinformatics kapushesky_lecture0720100515 bioinformatics kapushesky_lecture07
20100515 bioinformatics kapushesky_lecture07Computer Science Club
 
DNA Compression (Encoded using Huffman Encoding Method)
DNA Compression (Encoded using Huffman Encoding Method)DNA Compression (Encoded using Huffman Encoding Method)
DNA Compression (Encoded using Huffman Encoding Method)
Marwa Al-Rikaby
 
An Efficient Biological Sequence Compression Technique Using LUT and Repeat ...
An Efficient Biological Sequence Compression Technique Using  LUT and Repeat ...An Efficient Biological Sequence Compression Technique Using  LUT and Repeat ...
An Efficient Biological Sequence Compression Technique Using LUT and Repeat ...
IOSR Journals
 
High throughput qPCR: tips for analysis across multiple plates
High throughput qPCR: tips for analysis across multiple platesHigh throughput qPCR: tips for analysis across multiple plates
High throughput qPCR: tips for analysis across multiple plates
Integrated DNA Technologies
 
Church_GenomeAccess_2013_genome2013
Church_GenomeAccess_2013_genome2013Church_GenomeAccess_2013_genome2013
Church_GenomeAccess_2013_genome2013
Deanna Church
 
ICLP 2009 doctoral consortium presentation; Logic-Statistic Models with Const...
ICLP 2009 doctoral consortium presentation; Logic-Statistic Models with Const...ICLP 2009 doctoral consortium presentation; Logic-Statistic Models with Const...
ICLP 2009 doctoral consortium presentation; Logic-Statistic Models with Const...Christian Have
 
Introducing data analysis: reads to results
Introducing data analysis: reads to resultsIntroducing data analysis: reads to results
Introducing data analysis: reads to results
AGRF_Ltd
 
Characterization of Novel ctDNA Reference Materials Developed using the Genom...
Characterization of Novel ctDNA Reference Materials Developed using the Genom...Characterization of Novel ctDNA Reference Materials Developed using the Genom...
Characterization of Novel ctDNA Reference Materials Developed using the Genom...
Thermo Fisher Scientific
 
Wang labsummer2010
Wang labsummer2010Wang labsummer2010
Wang labsummer2010
russodl
 
LPEI_ZCNI_Poster
LPEI_ZCNI_PosterLPEI_ZCNI_Poster
LPEI_ZCNI_PosterLong Pei
 
Snippy - Rapid bacterial variant calling - UK - tue 5 may 2015
Snippy - Rapid bacterial variant calling - UK - tue 5 may 2015Snippy - Rapid bacterial variant calling - UK - tue 5 may 2015
Snippy - Rapid bacterial variant calling - UK - tue 5 may 2015
Torsten Seemann
 
Paired-end alignments in sequence graphs
Paired-end alignments in sequence graphsPaired-end alignments in sequence graphs
Paired-end alignments in sequence graphs
Chirag Jain
 
Daly altshuler.labmeeting
Daly altshuler.labmeetingDaly altshuler.labmeeting
Daly altshuler.labmeeting
Manuel Rivas
 
Computational Chemistry Robots
Computational Chemistry RobotsComputational Chemistry Robots
Computational Chemistry Robots
University of Cambridge
 
Matrix Factorizations at Scale: a Comparison of Scientific Data Analytics on ...
Matrix Factorizations at Scale: a Comparison of Scientific Data Analytics on ...Matrix Factorizations at Scale: a Comparison of Scientific Data Analytics on ...
Matrix Factorizations at Scale: a Comparison of Scientific Data Analytics on ...
Databricks
 
Rt2 pcr arraydataanalysisquickcarde
Rt2 pcr arraydataanalysisquickcardeRt2 pcr arraydataanalysisquickcarde
Rt2 pcr arraydataanalysisquickcardeElsa von Licy
 

Similar to Graph and assembly strategies for the MHC and ribosomal DNA regions (20)

Biochip
BiochipBiochip
Biochip
 
20100515 bioinformatics kapushesky_lecture07
20100515 bioinformatics kapushesky_lecture0720100515 bioinformatics kapushesky_lecture07
20100515 bioinformatics kapushesky_lecture07
 
DNA Compression (Encoded using Huffman Encoding Method)
DNA Compression (Encoded using Huffman Encoding Method)DNA Compression (Encoded using Huffman Encoding Method)
DNA Compression (Encoded using Huffman Encoding Method)
 
An Efficient Biological Sequence Compression Technique Using LUT and Repeat ...
An Efficient Biological Sequence Compression Technique Using  LUT and Repeat ...An Efficient Biological Sequence Compression Technique Using  LUT and Repeat ...
An Efficient Biological Sequence Compression Technique Using LUT and Repeat ...
 
High throughput qPCR: tips for analysis across multiple plates
High throughput qPCR: tips for analysis across multiple platesHigh throughput qPCR: tips for analysis across multiple plates
High throughput qPCR: tips for analysis across multiple plates
 
Church_GenomeAccess_2013_genome2013
Church_GenomeAccess_2013_genome2013Church_GenomeAccess_2013_genome2013
Church_GenomeAccess_2013_genome2013
 
ICLP 2009 doctoral consortium presentation; Logic-Statistic Models with Const...
ICLP 2009 doctoral consortium presentation; Logic-Statistic Models with Const...ICLP 2009 doctoral consortium presentation; Logic-Statistic Models with Const...
ICLP 2009 doctoral consortium presentation; Logic-Statistic Models with Const...
 
ETH_SymposiumCR
ETH_SymposiumCRETH_SymposiumCR
ETH_SymposiumCR
 
Introducing data analysis: reads to results
Introducing data analysis: reads to resultsIntroducing data analysis: reads to results
Introducing data analysis: reads to results
 
Characterization of Novel ctDNA Reference Materials Developed using the Genom...
Characterization of Novel ctDNA Reference Materials Developed using the Genom...Characterization of Novel ctDNA Reference Materials Developed using the Genom...
Characterization of Novel ctDNA Reference Materials Developed using the Genom...
 
Wang labsummer2010
Wang labsummer2010Wang labsummer2010
Wang labsummer2010
 
LPEI_ZCNI_Poster
LPEI_ZCNI_PosterLPEI_ZCNI_Poster
LPEI_ZCNI_Poster
 
Snippy - Rapid bacterial variant calling - UK - tue 5 may 2015
Snippy - Rapid bacterial variant calling - UK - tue 5 may 2015Snippy - Rapid bacterial variant calling - UK - tue 5 may 2015
Snippy - Rapid bacterial variant calling - UK - tue 5 may 2015
 
Abrf poster2007
Abrf poster2007Abrf poster2007
Abrf poster2007
 
Paired-end alignments in sequence graphs
Paired-end alignments in sequence graphsPaired-end alignments in sequence graphs
Paired-end alignments in sequence graphs
 
Daly altshuler.labmeeting
Daly altshuler.labmeetingDaly altshuler.labmeeting
Daly altshuler.labmeeting
 
Computational Chemistry Robots
Computational Chemistry RobotsComputational Chemistry Robots
Computational Chemistry Robots
 
Matrix Factorizations at Scale: a Comparison of Scientific Data Analytics on ...
Matrix Factorizations at Scale: a Comparison of Scientific Data Analytics on ...Matrix Factorizations at Scale: a Comparison of Scientific Data Analytics on ...
Matrix Factorizations at Scale: a Comparison of Scientific Data Analytics on ...
 
Rt2 pcr arraydataanalysisquickcarde
Rt2 pcr arraydataanalysisquickcardeRt2 pcr arraydataanalysisquickcarde
Rt2 pcr arraydataanalysisquickcarde
 
Notes on Mutation
Notes on MutationNotes on Mutation
Notes on Mutation
 

More from Genome Reference Consortium

Previewing GRCm39: Assembly Updates from the GRC
Previewing GRCm39: Assembly Updates from the GRCPreviewing GRCm39: Assembly Updates from the GRC
Previewing GRCm39: Assembly Updates from the GRC
Genome Reference Consortium
 
What's new and what's next for the human reference assembly?
What's new and what's next for the human reference assembly?What's new and what's next for the human reference assembly?
What's new and what's next for the human reference assembly?
Genome Reference Consortium
 
Advancements in the human genome reference assembly (GRCh38)
Advancements in the human genome reference assembly (GRCh38)Advancements in the human genome reference assembly (GRCh38)
Advancements in the human genome reference assembly (GRCh38)
Genome Reference Consortium
 
Telomere-to-telomere assembly of a complete human chromosomes
Telomere-to-telomere assembly of a complete human chromosomesTelomere-to-telomere assembly of a complete human chromosomes
Telomere-to-telomere assembly of a complete human chromosomes
Genome Reference Consortium
 
Genome variation graphs with the vg toolkit
Genome variation graphs with the vg toolkitGenome variation graphs with the vg toolkit
Genome variation graphs with the vg toolkit
Genome Reference Consortium
 
The Matched Annotation from NCBI and EMBL-EBI (MANE) Project
The Matched Annotation from NCBI and EMBL-EBI (MANE) ProjectThe Matched Annotation from NCBI and EMBL-EBI (MANE) Project
The Matched Annotation from NCBI and EMBL-EBI (MANE) Project
Genome Reference Consortium
 
Why graph genome storage and updating wakes me up at 4 am
Why graph genome storage and updating wakes me up at 4 amWhy graph genome storage and updating wakes me up at 4 am
Why graph genome storage and updating wakes me up at 4 am
Genome Reference Consortium
 
Schneider grc workshop_final
Schneider grc workshop_finalSchneider grc workshop_final
Schneider grc workshop_final
Genome Reference Consortium
 
Mane v2 final
Mane v2 finalMane v2 final
Lrg and mane 16 oct 2018
Lrg and mane   16 oct 2018Lrg and mane   16 oct 2018
Lrg and mane 16 oct 2018
Genome Reference Consortium
 
20181016 grc presentation-pa
20181016 grc presentation-pa20181016 grc presentation-pa
20181016 grc presentation-pa
Genome Reference Consortium
 
2018 1016 trio_binning_ashg_arhie_final
2018 1016 trio_binning_ashg_arhie_final2018 1016 trio_binning_ashg_arhie_final
2018 1016 trio_binning_ashg_arhie_final
Genome Reference Consortium
 
Variation graphs and population assisted genome inference copy
Variation graphs and population assisted genome inference copyVariation graphs and population assisted genome inference copy
Variation graphs and population assisted genome inference copy
Genome Reference Consortium
 
Ashg2017 workshop schneider
Ashg2017 workshop schneiderAshg2017 workshop schneider
Ashg2017 workshop schneider
Genome Reference Consortium
 
Ashg2017 workshop tg
Ashg2017 workshop tgAshg2017 workshop tg
Ashg2017 workshop tg
Genome Reference Consortium
 
Ashg sedlazeck grc_share
Ashg sedlazeck grc_shareAshg sedlazeck grc_share
Ashg sedlazeck grc_share
Genome Reference Consortium
 
171017 giab for giab grc workshop
171017 giab for giab grc workshop171017 giab for giab grc workshop
171017 giab for giab grc workshop
Genome Reference Consortium
 
101717.kh miga ashg_grc
101717.kh miga ashg_grc101717.kh miga ashg_grc
101717.kh miga ashg_grc
Genome Reference Consortium
 

More from Genome Reference Consortium (18)

Previewing GRCm39: Assembly Updates from the GRC
Previewing GRCm39: Assembly Updates from the GRCPreviewing GRCm39: Assembly Updates from the GRC
Previewing GRCm39: Assembly Updates from the GRC
 
What's new and what's next for the human reference assembly?
What's new and what's next for the human reference assembly?What's new and what's next for the human reference assembly?
What's new and what's next for the human reference assembly?
 
Advancements in the human genome reference assembly (GRCh38)
Advancements in the human genome reference assembly (GRCh38)Advancements in the human genome reference assembly (GRCh38)
Advancements in the human genome reference assembly (GRCh38)
 
Telomere-to-telomere assembly of a complete human chromosomes
Telomere-to-telomere assembly of a complete human chromosomesTelomere-to-telomere assembly of a complete human chromosomes
Telomere-to-telomere assembly of a complete human chromosomes
 
Genome variation graphs with the vg toolkit
Genome variation graphs with the vg toolkitGenome variation graphs with the vg toolkit
Genome variation graphs with the vg toolkit
 
The Matched Annotation from NCBI and EMBL-EBI (MANE) Project
The Matched Annotation from NCBI and EMBL-EBI (MANE) ProjectThe Matched Annotation from NCBI and EMBL-EBI (MANE) Project
The Matched Annotation from NCBI and EMBL-EBI (MANE) Project
 
Why graph genome storage and updating wakes me up at 4 am
Why graph genome storage and updating wakes me up at 4 amWhy graph genome storage and updating wakes me up at 4 am
Why graph genome storage and updating wakes me up at 4 am
 
Schneider grc workshop_final
Schneider grc workshop_finalSchneider grc workshop_final
Schneider grc workshop_final
 
Mane v2 final
Mane v2 finalMane v2 final
Mane v2 final
 
Lrg and mane 16 oct 2018
Lrg and mane   16 oct 2018Lrg and mane   16 oct 2018
Lrg and mane 16 oct 2018
 
20181016 grc presentation-pa
20181016 grc presentation-pa20181016 grc presentation-pa
20181016 grc presentation-pa
 
2018 1016 trio_binning_ashg_arhie_final
2018 1016 trio_binning_ashg_arhie_final2018 1016 trio_binning_ashg_arhie_final
2018 1016 trio_binning_ashg_arhie_final
 
Variation graphs and population assisted genome inference copy
Variation graphs and population assisted genome inference copyVariation graphs and population assisted genome inference copy
Variation graphs and population assisted genome inference copy
 
Ashg2017 workshop schneider
Ashg2017 workshop schneiderAshg2017 workshop schneider
Ashg2017 workshop schneider
 
Ashg2017 workshop tg
Ashg2017 workshop tgAshg2017 workshop tg
Ashg2017 workshop tg
 
Ashg sedlazeck grc_share
Ashg sedlazeck grc_shareAshg sedlazeck grc_share
Ashg sedlazeck grc_share
 
171017 giab for giab grc workshop
171017 giab for giab grc workshop171017 giab for giab grc workshop
171017 giab for giab grc workshop
 
101717.kh miga ashg_grc
101717.kh miga ashg_grc101717.kh miga ashg_grc
101717.kh miga ashg_grc
 

Recently uploaded

Identification and nursing management of congenital malformations .pptx
Identification and nursing management of congenital malformations .pptxIdentification and nursing management of congenital malformations .pptx
Identification and nursing management of congenital malformations .pptx
MGM SCHOOL/COLLEGE OF NURSING
 
Basavarajeeyam - Ayurvedic heritage book of Andhra pradesh
Basavarajeeyam - Ayurvedic heritage book of Andhra pradeshBasavarajeeyam - Ayurvedic heritage book of Andhra pradesh
Basavarajeeyam - Ayurvedic heritage book of Andhra pradesh
Dr. Madduru Muni Haritha
 
Non-respiratory Functions of the Lungs.pdf
Non-respiratory Functions of the Lungs.pdfNon-respiratory Functions of the Lungs.pdf
Non-respiratory Functions of the Lungs.pdf
MedicoseAcademics
 
Aortic Association CBL Pilot April 19 – 20 Bern
Aortic Association CBL Pilot April 19 – 20 BernAortic Association CBL Pilot April 19 – 20 Bern
Aortic Association CBL Pilot April 19 – 20 Bern
suvadeepdas911
 
Role of Mukta Pishti in the Management of Hyperthyroidism
Role of Mukta Pishti in the Management of HyperthyroidismRole of Mukta Pishti in the Management of Hyperthyroidism
Role of Mukta Pishti in the Management of Hyperthyroidism
Dr. Jyothirmai Paindla
 
Antimicrobial stewardship to prevent antimicrobial resistance
Antimicrobial stewardship to prevent antimicrobial resistanceAntimicrobial stewardship to prevent antimicrobial resistance
Antimicrobial stewardship to prevent antimicrobial resistance
GovindRankawat1
 
Top-Vitamin-Supplement-Brands-in-India List
Top-Vitamin-Supplement-Brands-in-India ListTop-Vitamin-Supplement-Brands-in-India List
Top-Vitamin-Supplement-Brands-in-India List
SwisschemDerma
 
Hemodialysis: Chapter 4, Dialysate Circuit - Dr.Gawad
Hemodialysis: Chapter 4, Dialysate Circuit - Dr.GawadHemodialysis: Chapter 4, Dialysate Circuit - Dr.Gawad
Hemodialysis: Chapter 4, Dialysate Circuit - Dr.Gawad
NephroTube - Dr.Gawad
 
Management of Traumatic Splenic injury.pptx
Management of Traumatic Splenic injury.pptxManagement of Traumatic Splenic injury.pptx
Management of Traumatic Splenic injury.pptx
AkshaySarraf1
 
Tests for analysis of different pharmaceutical.pptx
Tests for analysis of different pharmaceutical.pptxTests for analysis of different pharmaceutical.pptx
Tests for analysis of different pharmaceutical.pptx
taiba qazi
 
share - Lions, tigers, AI and health misinformation, oh my!.pptx
share - Lions, tigers, AI and health misinformation, oh my!.pptxshare - Lions, tigers, AI and health misinformation, oh my!.pptx
share - Lions, tigers, AI and health misinformation, oh my!.pptx
Tina Purnat
 
Integrating Ayurveda into Parkinson’s Management: A Holistic Approach
Integrating Ayurveda into Parkinson’s Management: A Holistic ApproachIntegrating Ayurveda into Parkinson’s Management: A Holistic Approach
Integrating Ayurveda into Parkinson’s Management: A Holistic Approach
Ayurveda ForAll
 
NVBDCP.pptx Nation vector borne disease control program
NVBDCP.pptx Nation vector borne disease control programNVBDCP.pptx Nation vector borne disease control program
NVBDCP.pptx Nation vector borne disease control program
Sapna Thakur
 
The Best Ayurvedic Antacid Tablets in India
The Best Ayurvedic Antacid Tablets in IndiaThe Best Ayurvedic Antacid Tablets in India
The Best Ayurvedic Antacid Tablets in India
Swastik Ayurveda
 
Cardiac Assessment for B.sc Nursing Student.pdf
Cardiac Assessment for B.sc Nursing Student.pdfCardiac Assessment for B.sc Nursing Student.pdf
Cardiac Assessment for B.sc Nursing Student.pdf
shivalingatalekar1
 
Local Advanced Lung Cancer: Artificial Intelligence, Synergetics, Complex Sys...
Local Advanced Lung Cancer: Artificial Intelligence, Synergetics, Complex Sys...Local Advanced Lung Cancer: Artificial Intelligence, Synergetics, Complex Sys...
Local Advanced Lung Cancer: Artificial Intelligence, Synergetics, Complex Sys...
Oleg Kshivets
 
Efficacy of Avartana Sneha in Ayurveda
Efficacy of Avartana Sneha in AyurvedaEfficacy of Avartana Sneha in Ayurveda
Efficacy of Avartana Sneha in Ayurveda
Dr. Jyothirmai Paindla
 
The Electrocardiogram - Physiologic Principles
The Electrocardiogram - Physiologic PrinciplesThe Electrocardiogram - Physiologic Principles
The Electrocardiogram - Physiologic Principles
MedicoseAcademics
 
Netter's Atlas of Human Anatomy 7.ed.pdf
Netter's Atlas of Human Anatomy 7.ed.pdfNetter's Atlas of Human Anatomy 7.ed.pdf
Netter's Atlas of Human Anatomy 7.ed.pdf
BrissaOrtiz3
 
micro teaching on communication m.sc nursing.pdf
micro teaching on communication m.sc nursing.pdfmicro teaching on communication m.sc nursing.pdf
micro teaching on communication m.sc nursing.pdf
Anurag Sharma
 

Recently uploaded (20)

Identification and nursing management of congenital malformations .pptx
Identification and nursing management of congenital malformations .pptxIdentification and nursing management of congenital malformations .pptx
Identification and nursing management of congenital malformations .pptx
 
Basavarajeeyam - Ayurvedic heritage book of Andhra pradesh
Basavarajeeyam - Ayurvedic heritage book of Andhra pradeshBasavarajeeyam - Ayurvedic heritage book of Andhra pradesh
Basavarajeeyam - Ayurvedic heritage book of Andhra pradesh
 
Non-respiratory Functions of the Lungs.pdf
Non-respiratory Functions of the Lungs.pdfNon-respiratory Functions of the Lungs.pdf
Non-respiratory Functions of the Lungs.pdf
 
Aortic Association CBL Pilot April 19 – 20 Bern
Aortic Association CBL Pilot April 19 – 20 BernAortic Association CBL Pilot April 19 – 20 Bern
Aortic Association CBL Pilot April 19 – 20 Bern
 
Role of Mukta Pishti in the Management of Hyperthyroidism
Role of Mukta Pishti in the Management of HyperthyroidismRole of Mukta Pishti in the Management of Hyperthyroidism
Role of Mukta Pishti in the Management of Hyperthyroidism
 
Antimicrobial stewardship to prevent antimicrobial resistance
Antimicrobial stewardship to prevent antimicrobial resistanceAntimicrobial stewardship to prevent antimicrobial resistance
Antimicrobial stewardship to prevent antimicrobial resistance
 
Top-Vitamin-Supplement-Brands-in-India List
Top-Vitamin-Supplement-Brands-in-India ListTop-Vitamin-Supplement-Brands-in-India List
Top-Vitamin-Supplement-Brands-in-India List
 
Hemodialysis: Chapter 4, Dialysate Circuit - Dr.Gawad
Hemodialysis: Chapter 4, Dialysate Circuit - Dr.GawadHemodialysis: Chapter 4, Dialysate Circuit - Dr.Gawad
Hemodialysis: Chapter 4, Dialysate Circuit - Dr.Gawad
 
Management of Traumatic Splenic injury.pptx
Management of Traumatic Splenic injury.pptxManagement of Traumatic Splenic injury.pptx
Management of Traumatic Splenic injury.pptx
 
Tests for analysis of different pharmaceutical.pptx
Tests for analysis of different pharmaceutical.pptxTests for analysis of different pharmaceutical.pptx
Tests for analysis of different pharmaceutical.pptx
 
share - Lions, tigers, AI and health misinformation, oh my!.pptx
share - Lions, tigers, AI and health misinformation, oh my!.pptxshare - Lions, tigers, AI and health misinformation, oh my!.pptx
share - Lions, tigers, AI and health misinformation, oh my!.pptx
 
Integrating Ayurveda into Parkinson’s Management: A Holistic Approach
Integrating Ayurveda into Parkinson’s Management: A Holistic ApproachIntegrating Ayurveda into Parkinson’s Management: A Holistic Approach
Integrating Ayurveda into Parkinson’s Management: A Holistic Approach
 
NVBDCP.pptx Nation vector borne disease control program
NVBDCP.pptx Nation vector borne disease control programNVBDCP.pptx Nation vector borne disease control program
NVBDCP.pptx Nation vector borne disease control program
 
The Best Ayurvedic Antacid Tablets in India
The Best Ayurvedic Antacid Tablets in IndiaThe Best Ayurvedic Antacid Tablets in India
The Best Ayurvedic Antacid Tablets in India
 
Cardiac Assessment for B.sc Nursing Student.pdf
Cardiac Assessment for B.sc Nursing Student.pdfCardiac Assessment for B.sc Nursing Student.pdf
Cardiac Assessment for B.sc Nursing Student.pdf
 
Local Advanced Lung Cancer: Artificial Intelligence, Synergetics, Complex Sys...
Local Advanced Lung Cancer: Artificial Intelligence, Synergetics, Complex Sys...Local Advanced Lung Cancer: Artificial Intelligence, Synergetics, Complex Sys...
Local Advanced Lung Cancer: Artificial Intelligence, Synergetics, Complex Sys...
 
Efficacy of Avartana Sneha in Ayurveda
Efficacy of Avartana Sneha in AyurvedaEfficacy of Avartana Sneha in Ayurveda
Efficacy of Avartana Sneha in Ayurveda
 
The Electrocardiogram - Physiologic Principles
The Electrocardiogram - Physiologic PrinciplesThe Electrocardiogram - Physiologic Principles
The Electrocardiogram - Physiologic Principles
 
Netter's Atlas of Human Anatomy 7.ed.pdf
Netter's Atlas of Human Anatomy 7.ed.pdfNetter's Atlas of Human Anatomy 7.ed.pdf
Netter's Atlas of Human Anatomy 7.ed.pdf
 
micro teaching on communication m.sc nursing.pdf
micro teaching on communication m.sc nursing.pdfmicro teaching on communication m.sc nursing.pdf
micro teaching on communication m.sc nursing.pdf
 

Graph and assembly strategies for the MHC and ribosomal DNA regions

  • 1. Graph and assembly strategies for the MHC and ribosomal DNA regions Alexander Dilthey
  • 2. The MHC is the zebrafish of the genome! (model region)
  • 3. PRGs – Population Reference Graphs • Simple: acyclic, directed (sub-class of general variation graphs) • Usually built from MSA, preserve gap positions (i.e. global homology between input sequences). • Generative model: Recombination • Ploidy well-defined (0, 1, 2) TA CT A G C C _ _ A TA A
  • 4. Outline • Quick recap: What we know about the utility of graph genome approaches • New results: Haplotyping in hypervariable regions (HLA) Pseudo graph alignment • De novo assembly of ribosomal DNA
  • 5. In most of the MHC, single-reference approaches work just fine… Numberofkmers(millions) 4.55.0 PGFreference Platypus PRG-Viterbi PRG-Mapped kmersrecovered kmersnot recovered + long-read validation with consistent results (not shown) Dilthey et al., Nature Genetics 2015
  • 6. … graph genomes outperform in the most complex sub-region of the MHC … Dilthey et al., Nature Genetics 2015
  • 7. … remaining problems driven by incomplete input haplotypes + algorithmics. Aligned kmers Chromotype position (kb) Readposition(kb) 0 10 20 0 2 4 6 Incomplete input haplotypes: Large uncharacterized inversion Algorithmics: Incorrect HLA haplotyping. Dilthey et al., Nature Genetics 2015
  • 8. HLA haplotyping • Hypothesis: Whole-genome sequencing data contains the information necessary for accurate HLA typing • “HLA typing”  HLA gene exon sequences • HLA class I: exons 2 and 3 • HLA class II: exon 2 • Challenge: align reads to the right gene – homology hell. • Proper read-to-graph alignment instead of k-Mers.
  • 9. Class I exon homology Exon 2 Exon 3 HLA-A 3284 alleles HLA-B 4077 alleles HLA-C 2799 alleles
  • 10. Approach: deep PRG + mapping Exonic MSA T*01:01 _ _ A C G T A C T _ _ T*01:02 C A A C A T A C T _ _ T*01:03 _ _ A C G C G C T _ _ T*01:04 _ _ A T C C G C T A C T*01:05 _ _ A T C C C C T _ _ T*01:06 _ _ _ C C T A C T _ _ Genomic MSA T*01:01 A G C A _ _ A C G T A C T _ _ C C T A T*01:02 A C C A C A A C A T A C T _ _ C C T A T*01:04 _ T T A _ _ A T C C G C T A C C C T A 8 xMHC reference haplotypes PGF (with T*01:01) A C T A G C A _ _ A C G T A C T _ _ C C T A T G A MANN (with T*01:04) T T T _ T T A _ _ A T C C G C T A C C C T A T G A 1) Gene-only PRG – 46 (pseudo) genes, mostly HLA |--NNN--| |--NNN--|Gene 1 Gene 2 Gene 3 Padding UTR Exon 1 Intron 1 Exon 2 UTR Padding Numberofreferencesequences Region covered by 'genomic' sequences 2) Varying numbers of input sequences across PRG 3) Use hierarchical MSA approach to combine in
  • 11. Approach: deep PRG + mapping Level 1 CA _ _ C T C CC G AAligned read 2 3 4 5 6 7 A _ TATA _ C 198 9 10 11 12 13 14 15 16 17 18 25 26 C AGTATC 20 21 22 23 24 TC TC T T A _ A _ A G C T C T T C T ATA C C {G, C}T C G CA A _ _ A 4) Seed-and-extend paired-end mapping to PRG 5) Likelihood-based inference: maximize L( aligned reads | HLA types ) (independently per locus)
  • 12. High-quality WGS data enables gold-standard accuracy (of note: 2/3 original discrepancies with validation data were errors in the validation data!)
  • 13. … but not from exome, MiSeq data
  • 15. Effective fragment length? [2 x read length + IS]
  • 16. Conclusion (intermediate) • If the input sequencing data is „good enough“, we manage near- perfect haplotyping in the genome‘s most polymorphic region • Effective fragment length likely the most important factor • Not-so-good sequencing data: joint haplotyping + alignment (i.e. alignment location is not independent of inferred haplotype) • Read mapping implementation SLOW
  • 18. Pseudo graph mapping Input sequences Graph
  • 19. Pseudo graph mapping Input sequences Graph Align short reads to input sequences...
  • 20. Pseudo graph mapping Input sequences Graph Align short reads to input sequences... ... transpose onto graph
  • 21. Scrubbing, cutting, cleaning Input MSA Lin. alignment MSA coor. Scrubbed 123456789 123456X789 123456789 Seq1 AACAC_TTT Seq1 AACAC_TTT AACAC__TTT AACAC_TTT Seq2 TTCACGTTT Read AACACGTTT AACAC_GTTT AACACGTTT - Graph TTCAC TTT G Scrubbing: get rid of INDEL-induced changes in the alignment coordinate system Cutting: Examine alignment gap structure; cut in „bad“ areas; use longest stretch Cleaning: Find the best gap-less sequence-to-graph alignment + extension with gaps Graph alignment 123456789 Graph AACACGTTT Seq1 AACACGTTT
  • 22. Accuracy slightly worse; fast! Conclusion: perhaps there is a middle ground between graph and linear sequence alignment. Work in progress. Further tuning? Inferred Accuracy Call Rate Inferred Accuracy Call Rate A 6 6 1.00 1.00 6 1.00 1.00 B 6 6 1.00 1.00 6 1.00 1.00 C 6 6 1.00 1.00 6 1.00 1.00 DQA1 6 6 1.00 1.00 6 1.00 1.00 DQB1 6 6 1.00 1.00 6 1.00 1.00 DRB1 6 6 1.00 1.00 6 1.00 1.00 A 22 22 0.86 1.00 22 1.00 1.00 B 22 22 1.00 1.00 22 1.00 1.00 C 22 22 1.00 1.00 22 1.00 1.00 DQA1 12 12 1.00 1.00 12 1.00 1.00 DQB1 22 22 1.00 1.00 22 1.00 1.00 DRB1 22 22 0.91 1.00 22 0.95 1.00 Platinum Trio 1000 Genomes Highest Resolution MHC-PRG-2 HLA*PRG NLocusCohort
  • 23. Towards additional high-quality reference haplotypes… Remaining challenges: extreme repeats, haplotypes. Sergey Koren
  • 24. Ribosomal DNA • Encodes ribosomal RNA • Hundreds of copies (tandem repeat arrays) • Variation poorly characterized • Step 1: Targeted approach • Step 2: WGS-based • Step 3: Variation graph
  • 25. Read error vs variation … from whole-genome data? Long reads  de Bruijn graph Technology! 6% > 50k
  • 26. Summary • Variation graphs are worth the effort – at least in highly complex regions. • Evidence: MHC „model system“ + overall improvement of Genome inference accuracy + complex-locus haplotyping • Incorporate LD? • Middle ground between full graph alignment and linear sequence alignment? • Ribosomal DNA – let me know if you‘re also interested!
  • 27. Acknowledgements NIH Adam Phillippy Sergey Koren Brian Walenz Jung-Hyun Kim Vladimir Larionov Oxford Gil McVean Zam Iqbal Alexander Mentzer Histogenetics Nezih Cereb UCSF/Nantes Pierre-Antoine Gourraud GSK Matt Nelson Charles Cox