Successfully reported this slideshow.
Your SlideShare is downloading. ×

Variation graphs and population assisted genome inference copy

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Upcoming SlideShare
101717.kh miga ashg_grc
101717.kh miga ashg_grc
Loading in …3
×

Check these out next

1 of 46 Ad

Variation graphs and population assisted genome inference copy

Download to read offline

Presentation by Benedict Paten at GRC/GIAB ASHG 2017 workshop "Getting the most from the reference assembly and reference materials" on updates to the human reference assembly, GRCh38.

Presentation by Benedict Paten at GRC/GIAB ASHG 2017 workshop "Getting the most from the reference assembly and reference materials" on updates to the human reference assembly, GRCh38.

Advertisement
Advertisement

More Related Content

Slideshows for you (20)

Similar to Variation graphs and population assisted genome inference copy (20)

Advertisement

More from Genome Reference Consortium (20)

Recently uploaded (20)

Advertisement

Variation graphs and population assisted genome inference copy

  1. 1. Human Genome Variation Graphs Benedict Paten - UC Santa Cruz Genomics Institute benedict@soe.ucsc.edu https://cgl.genomics.ucsc.edu/ Twitter: @BenedictPaten
  2. 2. Triumph of the reference human genome • The publication of the human reference genome unleashed the field of large-scale human genomics • It offers a coordinate system to: • describe gene sequences • display annotations • interpret molecular assays • However, the reference genome represents only a single instance among billions of unique human genomes...
  3. 3. Triumph of the reference human genome • The publication of the human reference genome unleashed the field of large-scale human genomics • It offers a coordinate system to: • describe gene sequences • display annotations • interpret molecular assays • However, the reference genome represents only a single instance among billions of unique human genomes... Supplementary Figure 2 – Browser Window Position Human Feb. 2009 (GRCh37/hg19) chr17:37,783,223-37,826,720 (6,701 bp) 100 vertebrates Basewise Conservation by PhyloP UCSC Genes (RefSeq, GenBank, CCDS, Rfam, tRNAs & Comparative Genomics) Transcription Levels Assayed by RNA-seq on 9 Cell Lines from ENCODE GTEx RNA-signal from male Brain - Caudate (basal ganglia) (GTEX-P44G-0011-R5A-SM-2I3FA) GTEx RNA-seq read coverage from Brain - Caudate (basal ganglia) GTEx RNA-seq read coverage from Brain - Cortex GTEx RNA-signal from male Muscle - Skeletal (GTEX-11DXW-0726-SM-5H12J) GTEx RNA-seq read coverage from Muscle - Skeletal GTEx RNA-seq read coverage from Skin - Sun Exposed (Lower leg) GTEx RNA-seq read coverage from Thyroid PPP1R1B STARD3 TCAP PNMT 100 Vert. Cons 7.76614 _ -1.84367 _ Transcription ln(x+1) 8 _ 0 _ brainCauda M P44G 127 _ 0 _ brainCauda M NPJ8 brainCauda M R55F brainCauda M S7SE brainCauda M T6MN brainCauda M WL46 brainCauda M WVLH brainCauda M WZTO brainCauda M XOTO brainCauda M Z93S brainCauda M ZUA1 brainCorte M NPJ8 brainCorte M R55F brainCorte M T6MN brainCorte M XOTO brainCorte M WL46 brainCorte M WVLH brainCorte M WZTO brainCorte M ZUA1 brainCorte M Z93S muscleSkel M 11DXW 127 _ 0 _ muscleSkel M NPJ8 muscleSkel M OOBK muscleSkel M Q2AH muscleSkel M Q2AI muscleSkel M R55C muscleSkel M U3ZM muscleSkel M U4B1 muscleSkel M WFON muscleSkel M WZTO muscleSkel M X5EB skinExpose M ZAB4 thyroid M ZAB5 Supplementary Figure 2 | 6700 bp exon-focused view of a 43 Kbp region of human chromosome 17 where GTEx RNA-seq highlights tissue-specific expression of two genes. The TCAP (titin cap protein) is highly expressed in muscle tissue, while PP1R1B (a therapeutic target for neurologic disorders) shows expression in brain basal ganglia but not muscle (or brain cortex). In this UCSC Genome Browser view, 33 samples from 5 tissues were selected for display, from the total 7304 (in 53 tissues) available on the GTEx public track hub. The hub is available on both hg19 (GRCh37) and hg38 (GRCh38) human genome assemblies. The hg19 tracks were generated using the UCSC liftOver tool to transform coordinates from the hg38 bedGraph files generated by STAR2 in the Toil pipeline. The browser display was configured to use the Multi-region exon view. .CC-BY-NC 4.0 International licensepeer-reviewed) is the author/funder. It is made available under a The copyright holder for this preprint (which was not.http://dx.doi.org/10.1101/062497doi:bioRxiv preprint first posted online Jul. 7, 2016; Figure UCSC Browser of gTEX data from: Vivian et al. Nature Biotechnology 35, 314– 316 (2017) doi:10.1038/nbt.3772
  4. 4. Triumph of the reference human genome • The publication of the human reference genome unleashed the field of large-scale human genomics • It offers a coordinate system to: • describe gene sequences • display annotations • interpret molecular assays • However, the reference genome represents only a single instance among billions of unique human genomes... Supplementary Figure 2 – Browser Window Position Human Feb. 2009 (GRCh37/hg19) chr17:37,783,223-37,826,720 (6,701 bp) 100 vertebrates Basewise Conservation by PhyloP UCSC Genes (RefSeq, GenBank, CCDS, Rfam, tRNAs & Comparative Genomics) Transcription Levels Assayed by RNA-seq on 9 Cell Lines from ENCODE GTEx RNA-signal from male Brain - Caudate (basal ganglia) (GTEX-P44G-0011-R5A-SM-2I3FA) GTEx RNA-seq read coverage from Brain - Caudate (basal ganglia) GTEx RNA-seq read coverage from Brain - Cortex GTEx RNA-signal from male Muscle - Skeletal (GTEX-11DXW-0726-SM-5H12J) GTEx RNA-seq read coverage from Muscle - Skeletal GTEx RNA-seq read coverage from Skin - Sun Exposed (Lower leg) GTEx RNA-seq read coverage from Thyroid PPP1R1B STARD3 TCAP PNMT 100 Vert. Cons 7.76614 _ -1.84367 _ Transcription ln(x+1) 8 _ 0 _ brainCauda M P44G 127 _ 0 _ brainCauda M NPJ8 brainCauda M R55F brainCauda M S7SE brainCauda M T6MN brainCauda M WL46 brainCauda M WVLH brainCauda M WZTO brainCauda M XOTO brainCauda M Z93S brainCauda M ZUA1 brainCorte M NPJ8 brainCorte M R55F brainCorte M T6MN brainCorte M XOTO brainCorte M WL46 brainCorte M WVLH brainCorte M WZTO brainCorte M ZUA1 brainCorte M Z93S muscleSkel M 11DXW 127 _ 0 _ muscleSkel M NPJ8 muscleSkel M OOBK muscleSkel M Q2AH muscleSkel M Q2AI muscleSkel M R55C muscleSkel M U3ZM muscleSkel M U4B1 muscleSkel M WFON muscleSkel M WZTO muscleSkel M X5EB skinExpose M ZAB4 thyroid M ZAB5 Supplementary Figure 2 | 6700 bp exon-focused view of a 43 Kbp region of human chromosome 17 where GTEx RNA-seq highlights tissue-specific expression of two genes. The TCAP (titin cap protein) is highly expressed in muscle tissue, while PP1R1B (a therapeutic target for neurologic disorders) shows expression in brain basal ganglia but not muscle (or brain cortex). In this UCSC Genome Browser view, 33 samples from 5 tissues were selected for display, from the total 7304 (in 53 tissues) available on the GTEx public track hub. The hub is available on both hg19 (GRCh37) and hg38 (GRCh38) human genome assemblies. The hg19 tracks were generated using the UCSC liftOver tool to transform coordinates from the hg38 bedGraph files generated by STAR2 in the Toil pipeline. The browser display was configured to use the Multi-region exon view. .CC-BY-NC 4.0 International licensepeer-reviewed) is the author/funder. It is made available under a The copyright holder for this preprint (which was not.http://dx.doi.org/10.1101/062497doi:bioRxiv preprint first posted online Jul. 7, 2016; Figure UCSC Browser of gTEX data from: Vivian et al. Nature Biotechnology 35, 314– 316 (2017) doi:10.1038/nbt.3772
  5. 5. Triumph of the reference human genome • The publication of the human reference genome unleashed the field of large-scale human genomics • It offers a coordinate system to: • describe gene sequences • display annotations • interpret molecular assays • However, the reference genome represents only a single instance among billions of unique human genomes... Supplementary Figure 2 – Browser Window Position Human Feb. 2009 (GRCh37/hg19) chr17:37,783,223-37,826,720 (6,701 bp) 100 vertebrates Basewise Conservation by PhyloP UCSC Genes (RefSeq, GenBank, CCDS, Rfam, tRNAs & Comparative Genomics) Transcription Levels Assayed by RNA-seq on 9 Cell Lines from ENCODE GTEx RNA-signal from male Brain - Caudate (basal ganglia) (GTEX-P44G-0011-R5A-SM-2I3FA) GTEx RNA-seq read coverage from Brain - Caudate (basal ganglia) GTEx RNA-seq read coverage from Brain - Cortex GTEx RNA-signal from male Muscle - Skeletal (GTEX-11DXW-0726-SM-5H12J) GTEx RNA-seq read coverage from Muscle - Skeletal GTEx RNA-seq read coverage from Skin - Sun Exposed (Lower leg) GTEx RNA-seq read coverage from Thyroid PPP1R1B STARD3 TCAP PNMT 100 Vert. Cons 7.76614 _ -1.84367 _ Transcription ln(x+1) 8 _ 0 _ brainCauda M P44G 127 _ 0 _ brainCauda M NPJ8 brainCauda M R55F brainCauda M S7SE brainCauda M T6MN brainCauda M WL46 brainCauda M WVLH brainCauda M WZTO brainCauda M XOTO brainCauda M Z93S brainCauda M ZUA1 brainCorte M NPJ8 brainCorte M R55F brainCorte M T6MN brainCorte M XOTO brainCorte M WL46 brainCorte M WVLH brainCorte M WZTO brainCorte M ZUA1 brainCorte M Z93S muscleSkel M 11DXW 127 _ 0 _ muscleSkel M NPJ8 muscleSkel M OOBK muscleSkel M Q2AH muscleSkel M Q2AI muscleSkel M R55C muscleSkel M U3ZM muscleSkel M U4B1 muscleSkel M WFON muscleSkel M WZTO muscleSkel M X5EB skinExpose M ZAB4 thyroid M ZAB5 Supplementary Figure 2 | 6700 bp exon-focused view of a 43 Kbp region of human chromosome 17 where GTEx RNA-seq highlights tissue-specific expression of two genes. The TCAP (titin cap protein) is highly expressed in muscle tissue, while PP1R1B (a therapeutic target for neurologic disorders) shows expression in brain basal ganglia but not muscle (or brain cortex). In this UCSC Genome Browser view, 33 samples from 5 tissues were selected for display, from the total 7304 (in 53 tissues) available on the GTEx public track hub. The hub is available on both hg19 (GRCh37) and hg38 (GRCh38) human genome assemblies. The hg19 tracks were generated using the UCSC liftOver tool to transform coordinates from the hg38 bedGraph files generated by STAR2 in the Toil pipeline. The browser display was configured to use the Multi-region exon view. .CC-BY-NC 4.0 International licensepeer-reviewed) is the author/funder. It is made available under a The copyright holder for this preprint (which was not.http://dx.doi.org/10.1101/062497doi:bioRxiv preprint first posted online Jul. 7, 2016; Figure UCSC Browser of gTEX data from: Vivian et al. Nature Biotechnology 35, 314– 316 (2017) doi:10.1038/nbt.3772
  6. 6. Triumph of the reference human genome • The publication of the human reference genome unleashed the field of large-scale human genomics • It offers a coordinate system to: • describe gene sequences • display annotations • interpret molecular assays • However, the reference genome represents only a single instance among billions of unique human genomes... Supplementary Figure 2 – Browser Window Position Human Feb. 2009 (GRCh37/hg19) chr17:37,783,223-37,826,720 (6,701 bp) 100 vertebrates Basewise Conservation by PhyloP UCSC Genes (RefSeq, GenBank, CCDS, Rfam, tRNAs & Comparative Genomics) Transcription Levels Assayed by RNA-seq on 9 Cell Lines from ENCODE GTEx RNA-signal from male Brain - Caudate (basal ganglia) (GTEX-P44G-0011-R5A-SM-2I3FA) GTEx RNA-seq read coverage from Brain - Caudate (basal ganglia) GTEx RNA-seq read coverage from Brain - Cortex GTEx RNA-signal from male Muscle - Skeletal (GTEX-11DXW-0726-SM-5H12J) GTEx RNA-seq read coverage from Muscle - Skeletal GTEx RNA-seq read coverage from Skin - Sun Exposed (Lower leg) GTEx RNA-seq read coverage from Thyroid PPP1R1B STARD3 TCAP PNMT 100 Vert. Cons 7.76614 _ -1.84367 _ Transcription ln(x+1) 8 _ 0 _ brainCauda M P44G 127 _ 0 _ brainCauda M NPJ8 brainCauda M R55F brainCauda M S7SE brainCauda M T6MN brainCauda M WL46 brainCauda M WVLH brainCauda M WZTO brainCauda M XOTO brainCauda M Z93S brainCauda M ZUA1 brainCorte M NPJ8 brainCorte M R55F brainCorte M T6MN brainCorte M XOTO brainCorte M WL46 brainCorte M WVLH brainCorte M WZTO brainCorte M ZUA1 brainCorte M Z93S muscleSkel M 11DXW 127 _ 0 _ muscleSkel M NPJ8 muscleSkel M OOBK muscleSkel M Q2AH muscleSkel M Q2AI muscleSkel M R55C muscleSkel M U3ZM muscleSkel M U4B1 muscleSkel M WFON muscleSkel M WZTO muscleSkel M X5EB skinExpose M ZAB4 thyroid M ZAB5 Supplementary Figure 2 | 6700 bp exon-focused view of a 43 Kbp region of human chromosome 17 where GTEx RNA-seq highlights tissue-specific expression of two genes. The TCAP (titin cap protein) is highly expressed in muscle tissue, while PP1R1B (a therapeutic target for neurologic disorders) shows expression in brain basal ganglia but not muscle (or brain cortex). In this UCSC Genome Browser view, 33 samples from 5 tissues were selected for display, from the total 7304 (in 53 tissues) available on the GTEx public track hub. The hub is available on both hg19 (GRCh37) and hg38 (GRCh38) human genome assemblies. The hg19 tracks were generated using the UCSC liftOver tool to transform coordinates from the hg38 bedGraph files generated by STAR2 in the Toil pipeline. The browser display was configured to use the Multi-region exon view. .CC-BY-NC 4.0 International licensepeer-reviewed) is the author/funder. It is made available under a The copyright holder for this preprint (which was not.http://dx.doi.org/10.1101/062497doi:bioRxiv preprint first posted online Jul. 7, 2016; Figure UCSC Browser of gTEX data from: Vivian et al. Nature Biotechnology 35, 314– 316 (2017) doi:10.1038/nbt.3772
  7. 7. Triumph of the reference human genome • The publication of the human reference genome unleashed the field of large-scale human genomics • It offers a coordinate system to: • describe gene sequences • display annotations • interpret molecular assays • However, the primary ref genome represents only a single instance among billions of unique germline human genomes... Figure UCSC Browser of gTEX data from: Vivian et al. Nature Biotechnology 35, 314– 316 (2017) doi:10.1038/nbt.3772 Supplementary Figure 2 – Browser Window Position Human Feb. 2009 (GRCh37/hg19) chr17:37,783,223-37,826,720 (6,701 bp) 100 vertebrates Basewise Conservation by PhyloP UCSC Genes (RefSeq, GenBank, CCDS, Rfam, tRNAs & Comparative Genomics) Transcription Levels Assayed by RNA-seq on 9 Cell Lines from ENCODE GTEx RNA-signal from male Brain - Caudate (basal ganglia) (GTEX-P44G-0011-R5A-SM-2I3FA) GTEx RNA-seq read coverage from Brain - Caudate (basal ganglia) GTEx RNA-seq read coverage from Brain - Cortex GTEx RNA-signal from male Muscle - Skeletal (GTEX-11DXW-0726-SM-5H12J) GTEx RNA-seq read coverage from Muscle - Skeletal GTEx RNA-seq read coverage from Skin - Sun Exposed (Lower leg) GTEx RNA-seq read coverage from Thyroid PPP1R1B STARD3 TCAP PNMT 100 Vert. Cons 7.76614 _ -1.84367 _ Transcription ln(x+1) 8 _ 0 _ brainCauda M P44G 127 _ 0 _ brainCauda M NPJ8 brainCauda M R55F brainCauda M S7SE brainCauda M T6MN brainCauda M WL46 brainCauda M WVLH brainCauda M WZTO brainCauda M XOTO brainCauda M Z93S brainCauda M ZUA1 brainCorte M NPJ8 brainCorte M R55F brainCorte M T6MN brainCorte M XOTO brainCorte M WL46 brainCorte M WVLH brainCorte M WZTO brainCorte M ZUA1 brainCorte M Z93S muscleSkel M 11DXW 127 _ 0 _ muscleSkel M NPJ8 muscleSkel M OOBK muscleSkel M Q2AH muscleSkel M Q2AI muscleSkel M R55C muscleSkel M U3ZM muscleSkel M U4B1 muscleSkel M WFON muscleSkel M WZTO muscleSkel M X5EB skinExpose M ZAB4 thyroid M ZAB5 Supplementary Figure 2 | 6700 bp exon-focused view of a 43 Kbp region of human chromosome 17 where GTEx RNA-seq highlights tissue-specific expression of two genes. The TCAP (titin cap protein) is highly expressed in muscle tissue, while PP1R1B (a therapeutic target for neurologic disorders) shows expression in brain basal ganglia but not muscle (or brain cortex). In this UCSC Genome Browser view, 33 samples from 5 tissues were selected for display, from the total 7304 (in 53 tissues) available on the GTEx public track hub. The hub is available on both hg19 (GRCh37) and hg38 (GRCh38) human genome assemblies. The hg19 tracks were generated using the UCSC liftOver tool to transform coordinates from the hg38 bedGraph files generated by STAR2 in the Toil pipeline. The browser display was configured to use the Multi-region exon view. .CC-BY-NC 4.0 International licensepeer-reviewed) is the author/funder. It is made available under a The copyright holder for this preprint (which was not.http://dx.doi.org/10.1101/062497doi:bioRxiv preprint first posted online Jul. 7, 2016;
  8. 8. The problem with the reference • Avg. 4-5 m point variations / individual • 80 m point variants w/>= 0.1% freq. • Avg. > 10 megabases (MB) in copy- number variants (CNVs) / individual • 350-400 MB in CNVs w/ >= 0.1% freq. • Avg. > 6 MB in large indels / individual • > 100 MB in large indels w/>= 0.1% freq.
  9. 9. The problem with the reference • Avg. 4-5 m point variations / individual • 80 m point variants w/>= 0.1% freq. • Avg. > 10 megabases (MB) in copy- number variants (CNVs) / individual • 350-400 MB in CNVs w/ >= 0.1% freq. • Avg. > 6 MB in large indels / individual • > 100 MB in large indels w/>= 0.1% freq. ANRV285-GG07-17 ARI 3 August 2006 8:58 Structural Variation of the Human Genome Andrew J. Sharp, Ze Cheng, and Evan E. Eichler Department of Genome Sciences, University of Washington, Howard Hughes Medical Institute, Seattle, Washington 98195; email: eee@gs.washington.edu edfromwww.annualreviews.org .Forpersonaluseonly. Characterization of Missing Human Genome Sequences and Copy-number Polymorphic Insertions Jeffrey M. Kidd1, Nick Sampas2, Francesca Antonacci1, Tina Graves3, Robert Fulton3, Hillary S. Hayden1, Can Alkan1, Maika Malig1, Mario Ventura4, Giuliana Giannuzzi4, Joelle Kallicki3, Paige Anderson2, Anya Tsalenko2, N. Alice Yamada2, Peter Tsang2, Rajinder Kaul1, Richard K. Wilson3, Laurakay Bruhn2, and Evan E. Eichler1,5,6 1Department of Genome Sciences, University of Washington School of Medicine, Seattle, Washington 98195, USA 2Agilent Laboratories, Santa Clara, California 95051, USA 3Washington University Genome Sequencing Center, School of Medicine, St. Louis, Missouri 63108, USA 4Department of Genetics and Microbiology, University of Bari, Bari 70126, Italy 5Howard Hughes Medical Institute, Seattle, Washington 98195, USA Abstract NIH Public Access Author Manuscript Nat Methods. Author manuscript; available in PMC 2010 November 1. Published in final edited form as: Nat Methods. 2010 May ; 7(5): 365–371. NIH-PAAuthorManuscriptNIH-PAAuthor
  10. 10. The problem with the reference • These differences create a failure of representation, for example: • Some functional (transcribed) genes are either present in disabled form or absent from the current reference (e.g. some HLA genes) • Reference Allele Bias: Mapping algorithms are intrinsically biased towards ignoring evidence of variants • The current reference is largely derived from one individual, making it less suitable for the study of genomes that derive from other subpopulations • In summary: the current reference genome has become an impediment to personal genomics
  11. 11. The problem with the reference RESEARCH Open Access The GENCODE pseudogene resource Baikang Pei1† , Cristina Sisu1,2† , Adam Frankish3 , Cédric Howald4 , Lukas Habegger1 , Xinmeng Jasmine Mu1 , Rachel Harte5 , Suganthi Balasubramanian1,2 , Andrea Tanzer6 , Mark Diekhans5 , Alexandre Reymond4 , Tim J Hubbard3 , Jennifer Harrow3 and Mark B Gerstein1,2,7* Abstract Background: Pseudogenes have long been considered as nonfunctional genomic sequences. However, recent evidence suggests that many of them might have some form of biological activity, and the possibility of functionality has increased interest in their accurate annotation and integration with functional genomics data. Results: As part of the GENCODE annotation of the human genome, we present the first genome-wide pseudogene assignment for protein-coding genes, based on both large-scale manual annotation and in silico pipelines. A key aspect of this coupled approach is that it allows us to identify pseudogenes in an unbiased fashion as well as untangle complex events through manual evaluation. We integrate the pseudogene annotations with the extensive ENCODE functional genomics information. In particular, we determine the expression level, transcription-factor and RNA polymerase II binding, and chromatin marks associated with each pseudogene. Based on their distribution, we develop simple statistical models for each type of activity, which we validate with large- scale RT-PCR-Seq experiments. Finally, we compare our pseudogenes with conservation and variation data from primate alignments and the 1000 Genomes project, producing lists of pseudogenes potentially under selection. Conclusions: At one extreme, some pseudogenes possess conventional characteristics of functionality; these may represent genes that have recently died. On the other hand, we find interesting patterns of partial activity, which may suggest that dead genes are being resurrected as functioning non-coding RNAs. The activity data of each pseudogene are stored in an associated resource, psiDR, which will be useful for the initial identification of potentially functional pseudogenes. Background Pseudogenes are defined as defunct genomic loci with sequence similarity to functional genes but lacking cod- ing potential due to the presence of disruptive muta- tions such as frame shifts and premature stop codons [1–4]. The functional paralogs of pseudogenes are often referred to as parent genes. Based on the mechanism of their creation, pseudogenes can be categorized into three large groups: (1) processed pseudogenes, created by retrotransposition of mRNA from functional protein- coding loci back into the genome; (2) duplicated (also referred to as unprocessed) pseudogenes, derived from duplication of functional genes; and (3) unitary pseudogenes, which arise through in situ mutations in previously functional protein-coding genes [1,4–6]. Different types of pseudogenes exhibit different geno- mic features. Duplicated pseudogenes have intron-exon- like genomic structures and may still maintain the upstream regulatory sequences of their parents. In con- trast, processed pseudogenes, having lost their introns, contain only exonic sequence and do not retain the upstream regulatory regions. Processed pseudogenes may preserve evidence of their insertion in the form of polyadenine features at their 3’ end. These features of processed pseudogenes are shared with other genomic elements commonly known as retrogenes [7]. However, retrogenes differ from pseudogenes in that they have intact coding frames and encode functional proteins [8]. The composition of different types of pseudogenes var- ies among organisms [9]. In the human genome, pro- cessed pseudogenes are the most abundant type due to * Correspondence: mark.gerstein@yale.edu † Contributed equally 1 Program in Computational Biology and Bioinformatics, Yale University, Bass 432, 266 Whitney Avenue, New Haven, CT 06520, USA Full list of author information is available at the end of the article Pei et al. Genome Biology 2012, 13:R51 http://genomebiology.com/2012/13/9/R51 © 2012 Pei et al.; licensee BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons • These differences create a failure of representation, for example: • Some functional (transcribed) genes are either present in disabled form or absent from the current reference (e.g. some HLA genes) • Reference Allele Bias: Mapping algorithms are intrinsically biased towards ignoring evidence of variants • The current reference is largely derived from one individual, making it less suitable for the study of genomes that derive from other subpopulations • In summary: the current reference genome has become an impediment to personal genomics
  12. 12. The problem with the reference RESEARCH Open Access The GENCODE pseudogene resource Baikang Pei1† , Cristina Sisu1,2† , Adam Frankish3 , Cédric Howald4 , Lukas Habegger1 , Xinmeng Jasmine Mu1 , Rachel Harte5 , Suganthi Balasubramanian1,2 , Andrea Tanzer6 , Mark Diekhans5 , Alexandre Reymond4 , Tim J Hubbard3 , Jennifer Harrow3 and Mark B Gerstein1,2,7* Abstract Background: Pseudogenes have long been considered as nonfunctional genomic sequences. However, recent evidence suggests that many of them might have some form of biological activity, and the possibility of functionality has increased interest in their accurate annotation and integration with functional genomics data. Results: As part of the GENCODE annotation of the human genome, we present the first genome-wide pseudogene assignment for protein-coding genes, based on both large-scale manual annotation and in silico pipelines. A key aspect of this coupled approach is that it allows us to identify pseudogenes in an unbiased fashion as well as untangle complex events through manual evaluation. We integrate the pseudogene annotations with the extensive ENCODE functional genomics information. In particular, we determine the expression level, transcription-factor and RNA polymerase II binding, and chromatin marks associated with each pseudogene. Based on their distribution, we develop simple statistical models for each type of activity, which we validate with large- scale RT-PCR-Seq experiments. Finally, we compare our pseudogenes with conservation and variation data from primate alignments and the 1000 Genomes project, producing lists of pseudogenes potentially under selection. Conclusions: At one extreme, some pseudogenes possess conventional characteristics of functionality; these may represent genes that have recently died. On the other hand, we find interesting patterns of partial activity, which may suggest that dead genes are being resurrected as functioning non-coding RNAs. The activity data of each pseudogene are stored in an associated resource, psiDR, which will be useful for the initial identification of potentially functional pseudogenes. Background Pseudogenes are defined as defunct genomic loci with sequence similarity to functional genes but lacking cod- ing potential due to the presence of disruptive muta- tions such as frame shifts and premature stop codons [1–4]. The functional paralogs of pseudogenes are often referred to as parent genes. Based on the mechanism of their creation, pseudogenes can be categorized into three large groups: (1) processed pseudogenes, created by retrotransposition of mRNA from functional protein- coding loci back into the genome; (2) duplicated (also referred to as unprocessed) pseudogenes, derived from duplication of functional genes; and (3) unitary pseudogenes, which arise through in situ mutations in previously functional protein-coding genes [1,4–6]. Different types of pseudogenes exhibit different geno- mic features. Duplicated pseudogenes have intron-exon- like genomic structures and may still maintain the upstream regulatory sequences of their parents. In con- trast, processed pseudogenes, having lost their introns, contain only exonic sequence and do not retain the upstream regulatory regions. Processed pseudogenes may preserve evidence of their insertion in the form of polyadenine features at their 3’ end. These features of processed pseudogenes are shared with other genomic elements commonly known as retrogenes [7]. However, retrogenes differ from pseudogenes in that they have intact coding frames and encode functional proteins [8]. The composition of different types of pseudogenes var- ies among organisms [9]. In the human genome, pro- cessed pseudogenes are the most abundant type due to * Correspondence: mark.gerstein@yale.edu † Contributed equally 1 Program in Computational Biology and Bioinformatics, Yale University, Bass 432, 266 Whitney Avenue, New Haven, CT 06520, USA Full list of author information is available at the end of the article Pei et al. Genome Biology 2012, 13:R51 http://genomebiology.com/2012/13/9/R51 © 2012 Pei et al.; licensee BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons BIOINFORMATICS ORIGINAL PAPER Vol. 25 no. 24 2009, pages 320 doi:10.1093/bioinformatics Genome analysis Effect of read-mapping biases on detecting allele-specific expression from RNA-sequencing data Jacob F. Degner1,2,∗, John C. Marioni1,∗, Athma A. Pai1, Joseph K. Pickrell1, Everlyne Nkadori1,3, Yoav Gilad1,∗ and Jonathan K. Pritchard1,3,∗ 1Department of Human Genetics, 2Committee on Genetics, Genomics and Systems Biology and 3Howard Hu Medical Institute, University of Chicago, 920 E. 58th St., CLSC 507, Chicago, IL 60637, USA Received on June 25, 2009; revised on September 17, 2009; accepted on September 30, 2009 Advance Access publication October 6, 2009 Associate Editor: Limsoon Wong ABSTRACT Motivation: Next-generation sequencing has become an important tool for genome-wide quantification of DNA and RNA. However, a major technical hurdle lies in the need to map short sequence reads back to their correct locations in a reference genome. Here, we investigate the impact of SNP variation on the reliability of read-mapping in the context of detecting allele-specific expression (ASE). Results: We generated 16 million 35 bp reads from mRNA of each of two HapMap Yoruba individuals. When we mapped these reads to the human genome we found that, at heterozygous SNPs, there was a significant bias toward higher mapping rates of the allele in the reference sequence, compared with the alternative allele. Masking known SNP positions in the genome sequence eliminated the reference bias but, surprisingly, did not lead to more reliable results overall. We find that even after masking, ∼ 5–10% of SNPs still have an inherent bias toward more effective mapping of one allele. Filtering out inherently biased SNPs removes 40% of the top signals of ASE. The remaining SNPs showing ASE are enriched in mechanisms can be uncovered through the identification o specific expression (ASE). For example, studies investigati have uncovered both genes harboring cis-regulatory variat imprinted genes that are epigenetically silenced in one copy the other (Babak et al., 2008; Serre et al., 2008; Wang et al. Recently developed sequencing technologies such as the I Genome Analyzer, Roche 454 GS FLX sequencer and A Biosystems SOLiD sequencer have the potential to greatly i our ability to detect ASE and to improve our understan cis-regulatory variation and epigenetic imprinting. Howe detection of ASE depends critically on accurate mapping reads in the presence of sequence variation. Here, using Seq data from two HapMap individuals, along with sim experiments, we characterize the effects of individual SNP quantification of expression levels. Our results are also r to other applications of next-generation sequencing, such discovery, expression QTL mapping and detection of allele- differences in transcription factor binding. • These differences create a failure of representation, for example: • Some functional (transcribed) genes are either present in disabled form or absent from the current reference (e.g. some HLA genes) • Reference Allele Bias: Mapping algorithms are intrinsically biased towards ignoring evidence of variants • The current reference is largely derived from one individual, making it less suitable for the study of genomes that derive from other subpopulations • In summary: the current reference genome has become an impediment to personal genomics
  13. 13. The problem with the reference BIOINFORMATICS ORIGINAL PAPER Vol. 25 no. 24 2009, pages 320 doi:10.1093/bioinformatics Genome analysis Effect of read-mapping biases on detecting allele-specific expression from RNA-sequencing data Jacob F. Degner1,2,∗, John C. Marioni1,∗, Athma A. Pai1, Joseph K. Pickrell1, Everlyne Nkadori1,3, Yoav Gilad1,∗ and Jonathan K. Pritchard1,3,∗ 1Department of Human Genetics, 2Committee on Genetics, Genomics and Systems Biology and 3Howard Hu Medical Institute, University of Chicago, 920 E. 58th St., CLSC 507, Chicago, IL 60637, USA Received on June 25, 2009; revised on September 17, 2009; accepted on September 30, 2009 Advance Access publication October 6, 2009 Associate Editor: Limsoon Wong ABSTRACT Motivation: Next-generation sequencing has become an important tool for genome-wide quantification of DNA and RNA. However, a major technical hurdle lies in the need to map short sequence reads back to their correct locations in a reference genome. Here, we investigate the impact of SNP variation on the reliability of read-mapping in the context of detecting allele-specific expression (ASE). Results: We generated 16 million 35 bp reads from mRNA of each of two HapMap Yoruba individuals. When we mapped these reads to the human genome we found that, at heterozygous SNPs, there was a significant bias toward higher mapping rates of the allele in the reference sequence, compared with the alternative allele. Masking known SNP positions in the genome sequence eliminated the reference bias but, surprisingly, did not lead to more reliable results overall. We find that even after masking, ∼ 5–10% of SNPs still have an inherent bias toward more effective mapping of one allele. Filtering out inherently biased SNPs removes 40% of the top signals of ASE. The remaining SNPs showing ASE are enriched in mechanisms can be uncovered through the identification o specific expression (ASE). For example, studies investigati have uncovered both genes harboring cis-regulatory variat imprinted genes that are epigenetically silenced in one copy the other (Babak et al., 2008; Serre et al., 2008; Wang et al. Recently developed sequencing technologies such as the I Genome Analyzer, Roche 454 GS FLX sequencer and A Biosystems SOLiD sequencer have the potential to greatly i our ability to detect ASE and to improve our understan cis-regulatory variation and epigenetic imprinting. Howe detection of ASE depends critically on accurate mapping reads in the presence of sequence variation. Here, using Seq data from two HapMap individuals, along with sim experiments, we characterize the effects of individual SNP quantification of expression levels. Our results are also r to other applications of next-generation sequencing, such discovery, expression QTL mapping and detection of allele- differences in transcription factor binding. A Bacterial Artificial Chromosome Library for Sequencing the Complete Human Genome Kazutoyo Osoegawa,1 Aaron G. Mammoser, Chenyan Wu,2 Eirik Frengen,3 Changjiang Zeng, Joseph J. Catanese,1,2 and Pieter J. de Jong1,2,4 Department of Cancer Genetics, Roswell Park Cancer Institute, Buffalo, New York 14263, USA A 30-fold redundant human bacterial artificial chromosome (BAC) library with a large average insert size (17 kb) has been constructed to provide the intermediate substrate for the international genome sequencing effor The DNA was obtained from a single anonymous volunteer, whose identity was protected through double-blind donor selection protocol. DNA fragments were generated by partial digestion with EcoRI (librar segments 1–4: 24-fold) and MboI (segment 5: sixfold) and cloned into the pBACe3.6 and pTARBAC1 vector respectively. The quality of the library was assessed by extensive analysis of 169 clones for rearrangements an artifacts. Eighteen BACs (11%) revealed minor insert rearrangements, and none was chimeric. This BAC librar designated as “RPCI-11,” has been used widely as the central resource for insert-end sequencing, clon fingerprinting, high-throughput sequence analysis and as a source of mapped clones for diagnostic an functional studies. Resource Cold Spring Harbor Laboratory Presson September 9, 2011 - Published bygenome.cshlp.orgDownloaded from RESEARCH Open Access The GENCODE pseudogene resource Baikang Pei1† , Cristina Sisu1,2† , Adam Frankish3 , Cédric Howald4 , Lukas Habegger1 , Xinmeng Jasmine Mu1 , Rachel Harte5 , Suganthi Balasubramanian1,2 , Andrea Tanzer6 , Mark Diekhans5 , Alexandre Reymond4 , Tim J Hubbard3 , Jennifer Harrow3 and Mark B Gerstein1,2,7* Abstract Background: Pseudogenes have long been considered as nonfunctional genomic sequences. However, recent evidence suggests that many of them might have some form of biological activity, and the possibility of functionality has increased interest in their accurate annotation and integration with functional genomics data. Results: As part of the GENCODE annotation of the human genome, we present the first genome-wide pseudogene assignment for protein-coding genes, based on both large-scale manual annotation and in silico pipelines. A key aspect of this coupled approach is that it allows us to identify pseudogenes in an unbiased fashion as well as untangle complex events through manual evaluation. We integrate the pseudogene annotations with the extensive ENCODE functional genomics information. In particular, we determine the expression level, transcription-factor and RNA polymerase II binding, and chromatin marks associated with each pseudogene. Based on their distribution, we develop simple statistical models for each type of activity, which we validate with large- scale RT-PCR-Seq experiments. Finally, we compare our pseudogenes with conservation and variation data from primate alignments and the 1000 Genomes project, producing lists of pseudogenes potentially under selection. Conclusions: At one extreme, some pseudogenes possess conventional characteristics of functionality; these may represent genes that have recently died. On the other hand, we find interesting patterns of partial activity, which may suggest that dead genes are being resurrected as functioning non-coding RNAs. The activity data of each pseudogene are stored in an associated resource, psiDR, which will be useful for the initial identification of potentially functional pseudogenes. Background Pseudogenes are defined as defunct genomic loci with sequence similarity to functional genes but lacking cod- ing potential due to the presence of disruptive muta- tions such as frame shifts and premature stop codons [1–4]. The functional paralogs of pseudogenes are often referred to as parent genes. Based on the mechanism of their creation, pseudogenes can be categorized into three large groups: (1) processed pseudogenes, created by retrotransposition of mRNA from functional protein- coding loci back into the genome; (2) duplicated (also referred to as unprocessed) pseudogenes, derived from duplication of functional genes; and (3) unitary pseudogenes, which arise through in situ mutations in previously functional protein-coding genes [1,4–6]. Different types of pseudogenes exhibit different geno- mic features. Duplicated pseudogenes have intron-exon- like genomic structures and may still maintain the upstream regulatory sequences of their parents. In con- trast, processed pseudogenes, having lost their introns, contain only exonic sequence and do not retain the upstream regulatory regions. Processed pseudogenes may preserve evidence of their insertion in the form of polyadenine features at their 3’ end. These features of processed pseudogenes are shared with other genomic elements commonly known as retrogenes [7]. However, retrogenes differ from pseudogenes in that they have intact coding frames and encode functional proteins [8]. The composition of different types of pseudogenes var- ies among organisms [9]. In the human genome, pro- cessed pseudogenes are the most abundant type due to * Correspondence: mark.gerstein@yale.edu † Contributed equally 1 Program in Computational Biology and Bioinformatics, Yale University, Bass 432, 266 Whitney Avenue, New Haven, CT 06520, USA Full list of author information is available at the end of the article Pei et al. Genome Biology 2012, 13:R51 http://genomebiology.com/2012/13/9/R51 © 2012 Pei et al.; licensee BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons • These differences create a failure of representation, for example: • Some functional (transcribed) genes are either present in disabled form or absent from the current reference (e.g. some HLA genes) • Reference Allele Bias: Mapping algorithms are intrinsically biased towards ignoring evidence of variants • The current primary reference is largely derived from one individual, making it less suitable for the study of genomes that derive from other subpopulations • In summary: the current reference genome has become an impediment to personal genomics
  14. 14. The problem with the reference • These differences create a failure of representation, for example: • Some functional (transcribed) genes are either present in disabled form or absent from the current reference (e.g. some HLA genes) • Reference Allele Bias: Mapping algorithms are intrinsically biased towards ignoring evidence of variants • The current primary reference is largely derived from one individual, making it less suitable for the study of genomes that derive from other subpopulations • In summary: the current primary reference genome is an imperfect lens for personal genomics BIOINFORMATICS ORIGINAL PAPER Vol. 25 no. 24 2009, pages 320 doi:10.1093/bioinformatics Genome analysis Effect of read-mapping biases on detecting allele-specific expression from RNA-sequencing data Jacob F. Degner1,2,∗, John C. Marioni1,∗, Athma A. Pai1, Joseph K. Pickrell1, Everlyne Nkadori1,3, Yoav Gilad1,∗ and Jonathan K. Pritchard1,3,∗ 1Department of Human Genetics, 2Committee on Genetics, Genomics and Systems Biology and 3Howard Hu Medical Institute, University of Chicago, 920 E. 58th St., CLSC 507, Chicago, IL 60637, USA Received on June 25, 2009; revised on September 17, 2009; accepted on September 30, 2009 Advance Access publication October 6, 2009 Associate Editor: Limsoon Wong ABSTRACT Motivation: Next-generation sequencing has become an important tool for genome-wide quantification of DNA and RNA. However, a major technical hurdle lies in the need to map short sequence reads back to their correct locations in a reference genome. Here, we investigate the impact of SNP variation on the reliability of read-mapping in the context of detecting allele-specific expression (ASE). Results: We generated 16 million 35 bp reads from mRNA of each of two HapMap Yoruba individuals. When we mapped these reads to the human genome we found that, at heterozygous SNPs, there was a significant bias toward higher mapping rates of the allele in the reference sequence, compared with the alternative allele. Masking known SNP positions in the genome sequence eliminated the reference bias but, surprisingly, did not lead to more reliable results overall. We find that even after masking, ∼ 5–10% of SNPs still have an inherent bias toward more effective mapping of one allele. Filtering out inherently biased SNPs removes 40% of the top signals of ASE. The remaining SNPs showing ASE are enriched in mechanisms can be uncovered through the identification o specific expression (ASE). For example, studies investigati have uncovered both genes harboring cis-regulatory variat imprinted genes that are epigenetically silenced in one copy the other (Babak et al., 2008; Serre et al., 2008; Wang et al. Recently developed sequencing technologies such as the I Genome Analyzer, Roche 454 GS FLX sequencer and A Biosystems SOLiD sequencer have the potential to greatly i our ability to detect ASE and to improve our understan cis-regulatory variation and epigenetic imprinting. Howe detection of ASE depends critically on accurate mapping reads in the presence of sequence variation. Here, using Seq data from two HapMap individuals, along with sim experiments, we characterize the effects of individual SNP quantification of expression levels. Our results are also r to other applications of next-generation sequencing, such discovery, expression QTL mapping and detection of allele- differences in transcription factor binding. A Bacterial Artificial Chromosome Library for Sequencing the Complete Human Genome Kazutoyo Osoegawa,1 Aaron G. Mammoser, Chenyan Wu,2 Eirik Frengen,3 Changjiang Zeng, Joseph J. Catanese,1,2 and Pieter J. de Jong1,2,4 Department of Cancer Genetics, Roswell Park Cancer Institute, Buffalo, New York 14263, USA A 30-fold redundant human bacterial artificial chromosome (BAC) library with a large average insert size (17 kb) has been constructed to provide the intermediate substrate for the international genome sequencing effor The DNA was obtained from a single anonymous volunteer, whose identity was protected through double-blind donor selection protocol. DNA fragments were generated by partial digestion with EcoRI (librar segments 1–4: 24-fold) and MboI (segment 5: sixfold) and cloned into the pBACe3.6 and pTARBAC1 vector respectively. The quality of the library was assessed by extensive analysis of 169 clones for rearrangements an artifacts. Eighteen BACs (11%) revealed minor insert rearrangements, and none was chimeric. This BAC librar designated as “RPCI-11,” has been used widely as the central resource for insert-end sequencing, clon fingerprinting, high-throughput sequence analysis and as a source of mapped clones for diagnostic an functional studies. Resource Cold Spring Harbor Laboratory Presson September 9, 2011 - Published bygenome.cshlp.orgDownloaded from RESEARCH Open Access The GENCODE pseudogene resource Baikang Pei1† , Cristina Sisu1,2† , Adam Frankish3 , Cédric Howald4 , Lukas Habegger1 , Xinmeng Jasmine Mu1 , Rachel Harte5 , Suganthi Balasubramanian1,2 , Andrea Tanzer6 , Mark Diekhans5 , Alexandre Reymond4 , Tim J Hubbard3 , Jennifer Harrow3 and Mark B Gerstein1,2,7* Abstract Background: Pseudogenes have long been considered as nonfunctional genomic sequences. However, recent evidence suggests that many of them might have some form of biological activity, and the possibility of functionality has increased interest in their accurate annotation and integration with functional genomics data. Results: As part of the GENCODE annotation of the human genome, we present the first genome-wide pseudogene assignment for protein-coding genes, based on both large-scale manual annotation and in silico pipelines. A key aspect of this coupled approach is that it allows us to identify pseudogenes in an unbiased fashion as well as untangle complex events through manual evaluation. We integrate the pseudogene annotations with the extensive ENCODE functional genomics information. In particular, we determine the expression level, transcription-factor and RNA polymerase II binding, and chromatin marks associated with each pseudogene. Based on their distribution, we develop simple statistical models for each type of activity, which we validate with large- scale RT-PCR-Seq experiments. Finally, we compare our pseudogenes with conservation and variation data from primate alignments and the 1000 Genomes project, producing lists of pseudogenes potentially under selection. Conclusions: At one extreme, some pseudogenes possess conventional characteristics of functionality; these may represent genes that have recently died. On the other hand, we find interesting patterns of partial activity, which may suggest that dead genes are being resurrected as functioning non-coding RNAs. The activity data of each pseudogene are stored in an associated resource, psiDR, which will be useful for the initial identification of potentially functional pseudogenes. Background Pseudogenes are defined as defunct genomic loci with sequence similarity to functional genes but lacking cod- ing potential due to the presence of disruptive muta- tions such as frame shifts and premature stop codons [1–4]. The functional paralogs of pseudogenes are often referred to as parent genes. Based on the mechanism of their creation, pseudogenes can be categorized into three large groups: (1) processed pseudogenes, created by retrotransposition of mRNA from functional protein- coding loci back into the genome; (2) duplicated (also referred to as unprocessed) pseudogenes, derived from duplication of functional genes; and (3) unitary pseudogenes, which arise through in situ mutations in previously functional protein-coding genes [1,4–6]. Different types of pseudogenes exhibit different geno- mic features. Duplicated pseudogenes have intron-exon- like genomic structures and may still maintain the upstream regulatory sequences of their parents. In con- trast, processed pseudogenes, having lost their introns, contain only exonic sequence and do not retain the upstream regulatory regions. Processed pseudogenes may preserve evidence of their insertion in the form of polyadenine features at their 3’ end. These features of processed pseudogenes are shared with other genomic elements commonly known as retrogenes [7]. However, retrogenes differ from pseudogenes in that they have intact coding frames and encode functional proteins [8]. The composition of different types of pseudogenes var- ies among organisms [9]. In the human genome, pro- cessed pseudogenes are the most abundant type due to * Correspondence: mark.gerstein@yale.edu † Contributed equally 1 Program in Computational Biology and Bioinformatics, Yale University, Bass 432, 266 Whitney Avenue, New Haven, CT 06520, USA Full list of author information is available at the end of the article Pei et al. Genome Biology 2012, 13:R51 http://genomebiology.com/2012/13/9/R51 © 2012 Pei et al.; licensee BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons
  15. 15. Alternate haplotypes
  16. 16. Alternate haplotypes GRCh38 is a graph!
  17. 17. Human Genome Variation Graph Project • Goals: • Develop next generation human genetic reference that includes known variation from all human ethnic populations • Provide tools to map, call, phase and represent genomes Figure courtesy Kiran Garimella & Gil McVean
  18. 18. Existing Variation is Fragmented Variants associated with phenotype Genome- and locus-specific variation databases Sequencing projects Human reference genome
  19. 19. A Rosetta Stone for human genomics
  20. 20. Merge diverse genomes into one graph The major histocompatibility complex− Kiran Garimella & Gil McVean
  21. 21. Zooming in, you see local structure
  22. 22. At base level we assign unique position identifiers
  23. 23. Variation Graphs – The Essentials GTCCCAA ACGTGG ACTACCA TTACTAC Set of sequences (nodes) Joins (edges) connect sides of sequences.
  24. 24. Variation Graphs – The Essentials GTCCCAAACGTGG TTACTAC Joins can connect either side of a sequence (bidirected edges) Walks encode DNA strings, with side of entry determining strand
  25. 25. Essential operations on variation graphs • To switch to variation graphs a complete ecosystem must be redeveloped • “rebooting genomics” - Erik Garrison “Adapted from Computational Pan-Genomics: Status, Promises and Challenges.” Computational Pan-Genomics Consortium. Briefings in Bioinformatics (2016) variation graph another variation graph
  26. 26. variation graph another variation graph Essential operations on variation graphs • To switch to variation graphs a complete ecosystem must be redeveloped “Adapted from Computational Pan-Genomics: Status, Promises and Challenges.” Computational Pan-Genomics Consortium. Briefings in Bioinformatics (2016) https://github.com/vgteam/vg
  27. 27. Now lots of good genome graph development …
  28. 28. Genome Graph Vignettes • Read mapping • Haplotypes vs. graphs • Visualization • Alleles and sites • Variant calling
  29. 29. Variation graph mapping GRCh38 alts in B-3106 from human MHC
  30. 30. Simulation Study - Human 60 60 60 60 60 60 50 50 50 50 50 50 40 40 40 40 40 40 30 30 30 30 30 30 20 20 20 20 20 20 10 10 10 1010 10 0 0 0 0 0 0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.95 0.96 0.97 0.98 0.99 1.00 1e−06 1e−05 1e−04 1e−03 1e−02 FPR TPR aligner a● a● a● a● a● a● bwa.mem.pe bwa.mem.se vg.pan.pe vg.pan.se vg.ref.pe vg.ref.se number ● ● ● ● 250000 500000 750000 1000000 60 60 60 60 60 60 50 50 50 50 50 50 40 40 40 40 40 40 30 30 30 30 30 30 20 20 20 20 20 20 10 10 10 10 10 10 0 0 0 0 0 0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.94 0.96 0.98 1e−06 1e−05 1e−04 1e−03 1e−02 FPR TPR number ● ● ● 2500000 5000000 7500000 aligner a● a● a● a● a● a● bwa.mem.pe bwa.mem.se vg.pan.pe vg.pan.se vg.ref.pe vg.ref.se • 10 M reads from a genome with 1% error • Subset of reads with >=1 match to non- primary ref match
  31. 31. Simulation Study - Human 60 60 60 60 60 60 50 50 50 50 50 50 40 40 40 40 40 40 30 30 30 30 30 30 20 20 20 20 20 20 10 10 10 1010 10 0 0 0 0 0 0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.95 0.96 0.97 0.98 0.99 1.00 1e−06 1e−05 1e−04 1e−03 1e−02 FPR TPR aligner a● a● a● a● a● a● bwa.mem.pe bwa.mem.se vg.pan.pe vg.pan.se vg.ref.pe vg.ref.se number ● ● ● ● 250000 500000 750000 1000000 • 10 M reads from a genome with 1% error • Subset of reads with >=1 match to non- primary ref match
  32. 32. Human - Indel Mapping Bias Alleviated curve 0 0 ● ● ● ● ● ● 2 number ● ● ● 2500000 5000000 7500000 aligner a● a● a● a● a● a● bwa.mem.pe bwa.mem.se vg.pan.pe vg.pan.se vg.ref.pe vg.ref.se (b) allele fraction vs variant size
  33. 33. Mapping improvements differ by population 1000 Genomes Super Population MHC %Diff.inperfectmap. primaryvs.1KG
  34. 34. 1: 82 bp 2: A 3: G 4: 38 bp 5: C 6: T 7: 24 bp 1: 82 bp 2: A 3: G 4': 38 bp 5: C 6: T 7: 24 bp 4: 38 bp Embedding Haplotypes • Genome graphs do not encode linkage • To restrict linkage, natural solution is to duplicate paths: • But duplication creates mapping ambiguity
  35. 35. Embedding Haplotypes 1: 82 bp 2: A 3: G 4: 38 bp 5: C 6: T 7: 24 bp 1': 82 bp 2: A 3: G 4': 38 bp 5: C 6: T 7: 24 bp4: 38 bp1: 82 bp 7': 24 bp • Instead maintain projection from haplotypes to graph: • The question then becomes how to encode this projection?
  36. 36. Embedding Haplotypes • The Graph Positional Burrows Wheeler Transform (gPBWT) From “Novak et al, A Graph Extension of the Positional Burrows-Wheeler Transform and its Applications (PBWT), WABI 2016”
 3 counting of the number of threads in T that contain a given new thread as a subthread. Figure 2 and Table 1 give a worked example. 1 2 3 2 1 3 1 1 2 2 B0 · · · · · · · · · · · · · · · · · · Fig. 1. An illustration of the B0[] array for a single side numbered 0. Threads visiting this side may enter their next nodes on sides 1, 2, or 3. The B0[] array records, for each visit of a thread to side 0, the side on which it enters its next node. This determines through which of the available edges it should leave the current node. Because threads tend to be similar to each other, they are likely to run in “ribbons” of multiple threads .CC-BY 4.0 International licensepeer-reviewed) is the author/funder. It is made available under a The copyright holder for this preprint (which was not.http://dx.doi.org/10.1101/051409doi:bioRxiv preprint first posted online May. 2, 2016; gPBWTk[] • Reversible, compressible, enables efficient indexed queries
  37. 37. gPBWT Performance • Experiment: • chr22 • 50,818,468 bp • 5004 Haplotypes • Result: • 356 MB gPBWT + vg graph • 0.011 bits per base - 200x compression • ~336 GB for whole genome w/80 million point variants @ 100,000 diploid genomes
  38. 38. Embedding Haplotypes • Tube Maps Wolfgang Beyer
  39. 39. Embedding Haplotypes Prototype: Wolfgang Beyer https://vgteam.github.io/ sequenceTubeMap/
  40. 40. Haplotype Probabilities • Li & Stephens: Efficiently compute P(h|H), where h is haplotype and H is population nd Stephens” on sequence graphs Stephens: sequences h are generated by walks x across the space of all haplotyp H x h
  41. 41. Haplotype Probabilities • Graph Li & Stephens: Efficiently compute P(x|H), where x is haplotype walk in a genome graph nd Stephens: sequences h are generated by walks x across the space of all hap model: sequences h are generated by walks x through G which follow segmen otypes in H h x c/w h g1 , g2 , g3 ε H
  42. 42. Haplotype Probabilities • Applied to vg mapped reads: Single recombinants, 9% Double recombinants, 1% Non recombinants, 90%
  43. 43. What’s a site and an allele in a genome graph? What’s a site and an allele in a variation graph? Bubble: Superbubble: • Use subgraph decomposition to find single source/sink subgraphs, set of paths are the alleles A T C A T C A T C A T C A T
  44. 44. A haplotype phasing pipeline Read mapping Variant calling Haplotype phasing Known population information Population Assisted Variant Calling h Haplotype likelihood Read likelihood genome posterior probability Haplotype likelihood Read likelihood A haplotype phasing pipeline Read mapping Variant calling Haplotype phasing Known population information
  45. 45. Genome Variation Graphs Summary • A shared reference graph will provide a single canonical naming scheme for human variants: either it is already a (named) path in the graph, or it is a new canonically named augmentation • A better prior: Clear benefits for simplifying and improving read mapping and variant calling - could ultimately lower cost of genome inference • Additional haplotype data can be embedded (gPBWT) • The natural reference is a population cohort - we should build a public cohort for hundreds of thousands of individuals - let’s change the culture of de-identified sharing • True population assisted genome inference is coming • Still many open problems: repeatome, annotations, RNA
  46. 46. Thanks! UCSC Adam Novak Glenn Hickey Sean Blum Yohei Rosen Jordan Eizenga Wolfgang Beyer Karen Hayden David Haussler Team VG: Erik Garrison Eric Dawson Mike Lin Jouni Siren (and many more) GA4GH ref-var group: Andres Kahles Ben Murray Goran Rakocevic Alex Dilthey Sarah Guthrie Jerome Kelleher Heng Li Stephen Keenan Richard Durbin Gil McVean Opportunities: https://cgl.genomics.ucsc.edu/ benedict@soe.ucsc.edu

×