SlideShare a Scribd company logo
1 of 46
Download to read offline
Human Genome Variation Graphs
Benedict Paten - UC Santa Cruz Genomics Institute

benedict@soe.ucsc.edu

https://cgl.genomics.ucsc.edu/

Twitter: @BenedictPaten
Triumph of the reference human genome
• The publication of the human reference genome unleashed
the field of large-scale human genomics
• It offers a coordinate system to:
• describe gene sequences
• display annotations
• interpret molecular assays
• However, the reference genome represents only a single
instance among billions of unique human genomes...
Triumph of the reference human genome
• The publication of the human reference genome unleashed
the field of large-scale human genomics
• It offers a coordinate system to:
• describe gene sequences
• display annotations
• interpret molecular assays
• However, the reference genome represents only a single
instance among billions of unique human genomes...
Supplementary Figure 2 – Browser
Window Position Human Feb. 2009 (GRCh37/hg19) chr17:37,783,223-37,826,720 (6,701 bp)
100 vertebrates Basewise Conservation by PhyloP
UCSC Genes (RefSeq, GenBank, CCDS, Rfam, tRNAs & Comparative Genomics)
Transcription Levels Assayed by RNA-seq on 9 Cell Lines from ENCODE
GTEx RNA-signal from male Brain - Caudate (basal ganglia) (GTEX-P44G-0011-R5A-SM-2I3FA)
GTEx RNA-seq read coverage from Brain - Caudate (basal ganglia)
GTEx RNA-seq read coverage from Brain - Cortex
GTEx RNA-signal from male Muscle - Skeletal (GTEX-11DXW-0726-SM-5H12J)
GTEx RNA-seq read coverage from Muscle - Skeletal
GTEx RNA-seq read coverage from Skin - Sun Exposed (Lower leg)
GTEx RNA-seq read coverage from Thyroid
PPP1R1B
STARD3
TCAP
PNMT
100 Vert. Cons
7.76614 _
-1.84367 _
Transcription
ln(x+1) 8 _
0 _
brainCauda M P44G
127 _
0 _
brainCauda M NPJ8
brainCauda M R55F
brainCauda M S7SE
brainCauda M T6MN
brainCauda M WL46
brainCauda M WVLH
brainCauda M WZTO
brainCauda M XOTO
brainCauda M Z93S
brainCauda M ZUA1
brainCorte M NPJ8
brainCorte M R55F
brainCorte M T6MN
brainCorte M XOTO
brainCorte M WL46
brainCorte M WVLH
brainCorte M WZTO
brainCorte M ZUA1
brainCorte M Z93S
muscleSkel M 11DXW
127 _
0 _
muscleSkel M NPJ8
muscleSkel M OOBK
muscleSkel M Q2AH
muscleSkel M Q2AI
muscleSkel M R55C
muscleSkel M U3ZM
muscleSkel M U4B1
muscleSkel M WFON
muscleSkel M WZTO
muscleSkel M X5EB
skinExpose M ZAB4
thyroid M ZAB5
Supplementary Figure 2 | 6700 bp exon-focused view of a 43 Kbp region of human chromosome 17 where GTEx
RNA-seq highlights tissue-specific expression of two genes. The TCAP (titin cap protein) is highly expressed in
muscle tissue, while PP1R1B (a therapeutic target for neurologic disorders) shows expression in brain basal ganglia
but not muscle (or brain cortex). In this UCSC Genome Browser view, 33 samples from 5 tissues were selected for
display, from the total 7304 (in 53 tissues) available on the GTEx public track hub. The hub is available on both hg19
(GRCh37) and hg38 (GRCh38) human genome assemblies. The hg19 tracks were generated using the UCSC liftOver
tool to transform coordinates from the hg38 bedGraph files generated by STAR2 in the Toil pipeline. The browser
display was configured to use the Multi-region exon view.
.CC-BY-NC 4.0 International licensepeer-reviewed) is the author/funder. It is made available under a
The copyright holder for this preprint (which was not.http://dx.doi.org/10.1101/062497doi:bioRxiv preprint first posted online Jul. 7, 2016;
Figure UCSC Browser of gTEX data from: Vivian et al. Nature Biotechnology 35, 314–
316 (2017) doi:10.1038/nbt.3772
Triumph of the reference human genome
• The publication of the human reference genome unleashed
the field of large-scale human genomics
• It offers a coordinate system to:
• describe gene sequences
• display annotations
• interpret molecular assays
• However, the reference genome represents only a single
instance among billions of unique human genomes...
Supplementary Figure 2 – Browser
Window Position Human Feb. 2009 (GRCh37/hg19) chr17:37,783,223-37,826,720 (6,701 bp)
100 vertebrates Basewise Conservation by PhyloP
UCSC Genes (RefSeq, GenBank, CCDS, Rfam, tRNAs & Comparative Genomics)
Transcription Levels Assayed by RNA-seq on 9 Cell Lines from ENCODE
GTEx RNA-signal from male Brain - Caudate (basal ganglia) (GTEX-P44G-0011-R5A-SM-2I3FA)
GTEx RNA-seq read coverage from Brain - Caudate (basal ganglia)
GTEx RNA-seq read coverage from Brain - Cortex
GTEx RNA-signal from male Muscle - Skeletal (GTEX-11DXW-0726-SM-5H12J)
GTEx RNA-seq read coverage from Muscle - Skeletal
GTEx RNA-seq read coverage from Skin - Sun Exposed (Lower leg)
GTEx RNA-seq read coverage from Thyroid
PPP1R1B
STARD3
TCAP
PNMT
100 Vert. Cons
7.76614 _
-1.84367 _
Transcription
ln(x+1) 8 _
0 _
brainCauda M P44G
127 _
0 _
brainCauda M NPJ8
brainCauda M R55F
brainCauda M S7SE
brainCauda M T6MN
brainCauda M WL46
brainCauda M WVLH
brainCauda M WZTO
brainCauda M XOTO
brainCauda M Z93S
brainCauda M ZUA1
brainCorte M NPJ8
brainCorte M R55F
brainCorte M T6MN
brainCorte M XOTO
brainCorte M WL46
brainCorte M WVLH
brainCorte M WZTO
brainCorte M ZUA1
brainCorte M Z93S
muscleSkel M 11DXW
127 _
0 _
muscleSkel M NPJ8
muscleSkel M OOBK
muscleSkel M Q2AH
muscleSkel M Q2AI
muscleSkel M R55C
muscleSkel M U3ZM
muscleSkel M U4B1
muscleSkel M WFON
muscleSkel M WZTO
muscleSkel M X5EB
skinExpose M ZAB4
thyroid M ZAB5
Supplementary Figure 2 | 6700 bp exon-focused view of a 43 Kbp region of human chromosome 17 where GTEx
RNA-seq highlights tissue-specific expression of two genes. The TCAP (titin cap protein) is highly expressed in
muscle tissue, while PP1R1B (a therapeutic target for neurologic disorders) shows expression in brain basal ganglia
but not muscle (or brain cortex). In this UCSC Genome Browser view, 33 samples from 5 tissues were selected for
display, from the total 7304 (in 53 tissues) available on the GTEx public track hub. The hub is available on both hg19
(GRCh37) and hg38 (GRCh38) human genome assemblies. The hg19 tracks were generated using the UCSC liftOver
tool to transform coordinates from the hg38 bedGraph files generated by STAR2 in the Toil pipeline. The browser
display was configured to use the Multi-region exon view.
.CC-BY-NC 4.0 International licensepeer-reviewed) is the author/funder. It is made available under a
The copyright holder for this preprint (which was not.http://dx.doi.org/10.1101/062497doi:bioRxiv preprint first posted online Jul. 7, 2016;
Figure UCSC Browser of gTEX data from: Vivian et al. Nature Biotechnology 35, 314–
316 (2017) doi:10.1038/nbt.3772
Triumph of the reference human genome
• The publication of the human reference genome unleashed
the field of large-scale human genomics
• It offers a coordinate system to:
• describe gene sequences
• display annotations
• interpret molecular assays
• However, the reference genome represents only a single
instance among billions of unique human genomes...
Supplementary Figure 2 – Browser
Window Position Human Feb. 2009 (GRCh37/hg19) chr17:37,783,223-37,826,720 (6,701 bp)
100 vertebrates Basewise Conservation by PhyloP
UCSC Genes (RefSeq, GenBank, CCDS, Rfam, tRNAs & Comparative Genomics)
Transcription Levels Assayed by RNA-seq on 9 Cell Lines from ENCODE
GTEx RNA-signal from male Brain - Caudate (basal ganglia) (GTEX-P44G-0011-R5A-SM-2I3FA)
GTEx RNA-seq read coverage from Brain - Caudate (basal ganglia)
GTEx RNA-seq read coverage from Brain - Cortex
GTEx RNA-signal from male Muscle - Skeletal (GTEX-11DXW-0726-SM-5H12J)
GTEx RNA-seq read coverage from Muscle - Skeletal
GTEx RNA-seq read coverage from Skin - Sun Exposed (Lower leg)
GTEx RNA-seq read coverage from Thyroid
PPP1R1B
STARD3
TCAP
PNMT
100 Vert. Cons
7.76614 _
-1.84367 _
Transcription
ln(x+1) 8 _
0 _
brainCauda M P44G
127 _
0 _
brainCauda M NPJ8
brainCauda M R55F
brainCauda M S7SE
brainCauda M T6MN
brainCauda M WL46
brainCauda M WVLH
brainCauda M WZTO
brainCauda M XOTO
brainCauda M Z93S
brainCauda M ZUA1
brainCorte M NPJ8
brainCorte M R55F
brainCorte M T6MN
brainCorte M XOTO
brainCorte M WL46
brainCorte M WVLH
brainCorte M WZTO
brainCorte M ZUA1
brainCorte M Z93S
muscleSkel M 11DXW
127 _
0 _
muscleSkel M NPJ8
muscleSkel M OOBK
muscleSkel M Q2AH
muscleSkel M Q2AI
muscleSkel M R55C
muscleSkel M U3ZM
muscleSkel M U4B1
muscleSkel M WFON
muscleSkel M WZTO
muscleSkel M X5EB
skinExpose M ZAB4
thyroid M ZAB5
Supplementary Figure 2 | 6700 bp exon-focused view of a 43 Kbp region of human chromosome 17 where GTEx
RNA-seq highlights tissue-specific expression of two genes. The TCAP (titin cap protein) is highly expressed in
muscle tissue, while PP1R1B (a therapeutic target for neurologic disorders) shows expression in brain basal ganglia
but not muscle (or brain cortex). In this UCSC Genome Browser view, 33 samples from 5 tissues were selected for
display, from the total 7304 (in 53 tissues) available on the GTEx public track hub. The hub is available on both hg19
(GRCh37) and hg38 (GRCh38) human genome assemblies. The hg19 tracks were generated using the UCSC liftOver
tool to transform coordinates from the hg38 bedGraph files generated by STAR2 in the Toil pipeline. The browser
display was configured to use the Multi-region exon view.
.CC-BY-NC 4.0 International licensepeer-reviewed) is the author/funder. It is made available under a
The copyright holder for this preprint (which was not.http://dx.doi.org/10.1101/062497doi:bioRxiv preprint first posted online Jul. 7, 2016;
Figure UCSC Browser of gTEX data from: Vivian et al. Nature Biotechnology 35, 314–
316 (2017) doi:10.1038/nbt.3772
Triumph of the reference human genome
• The publication of the human reference genome unleashed
the field of large-scale human genomics
• It offers a coordinate system to:
• describe gene sequences
• display annotations
• interpret molecular assays
• However, the reference genome represents only a single
instance among billions of unique human genomes...
Supplementary Figure 2 – Browser
Window Position Human Feb. 2009 (GRCh37/hg19) chr17:37,783,223-37,826,720 (6,701 bp)
100 vertebrates Basewise Conservation by PhyloP
UCSC Genes (RefSeq, GenBank, CCDS, Rfam, tRNAs & Comparative Genomics)
Transcription Levels Assayed by RNA-seq on 9 Cell Lines from ENCODE
GTEx RNA-signal from male Brain - Caudate (basal ganglia) (GTEX-P44G-0011-R5A-SM-2I3FA)
GTEx RNA-seq read coverage from Brain - Caudate (basal ganglia)
GTEx RNA-seq read coverage from Brain - Cortex
GTEx RNA-signal from male Muscle - Skeletal (GTEX-11DXW-0726-SM-5H12J)
GTEx RNA-seq read coverage from Muscle - Skeletal
GTEx RNA-seq read coverage from Skin - Sun Exposed (Lower leg)
GTEx RNA-seq read coverage from Thyroid
PPP1R1B
STARD3
TCAP
PNMT
100 Vert. Cons
7.76614 _
-1.84367 _
Transcription
ln(x+1) 8 _
0 _
brainCauda M P44G
127 _
0 _
brainCauda M NPJ8
brainCauda M R55F
brainCauda M S7SE
brainCauda M T6MN
brainCauda M WL46
brainCauda M WVLH
brainCauda M WZTO
brainCauda M XOTO
brainCauda M Z93S
brainCauda M ZUA1
brainCorte M NPJ8
brainCorte M R55F
brainCorte M T6MN
brainCorte M XOTO
brainCorte M WL46
brainCorte M WVLH
brainCorte M WZTO
brainCorte M ZUA1
brainCorte M Z93S
muscleSkel M 11DXW
127 _
0 _
muscleSkel M NPJ8
muscleSkel M OOBK
muscleSkel M Q2AH
muscleSkel M Q2AI
muscleSkel M R55C
muscleSkel M U3ZM
muscleSkel M U4B1
muscleSkel M WFON
muscleSkel M WZTO
muscleSkel M X5EB
skinExpose M ZAB4
thyroid M ZAB5
Supplementary Figure 2 | 6700 bp exon-focused view of a 43 Kbp region of human chromosome 17 where GTEx
RNA-seq highlights tissue-specific expression of two genes. The TCAP (titin cap protein) is highly expressed in
muscle tissue, while PP1R1B (a therapeutic target for neurologic disorders) shows expression in brain basal ganglia
but not muscle (or brain cortex). In this UCSC Genome Browser view, 33 samples from 5 tissues were selected for
display, from the total 7304 (in 53 tissues) available on the GTEx public track hub. The hub is available on both hg19
(GRCh37) and hg38 (GRCh38) human genome assemblies. The hg19 tracks were generated using the UCSC liftOver
tool to transform coordinates from the hg38 bedGraph files generated by STAR2 in the Toil pipeline. The browser
display was configured to use the Multi-region exon view.
.CC-BY-NC 4.0 International licensepeer-reviewed) is the author/funder. It is made available under a
The copyright holder for this preprint (which was not.http://dx.doi.org/10.1101/062497doi:bioRxiv preprint first posted online Jul. 7, 2016;
Figure UCSC Browser of gTEX data from: Vivian et al. Nature Biotechnology 35, 314–
316 (2017) doi:10.1038/nbt.3772
Triumph of the reference human genome
• The publication of the human reference genome unleashed
the field of large-scale human genomics
• It offers a coordinate system to:
• describe gene sequences
• display annotations
• interpret molecular assays
• However, the primary ref genome represents only a single
instance among billions of unique germline human genomes...
Figure UCSC Browser of gTEX data from: Vivian et al. Nature Biotechnology 35, 314–
316 (2017) doi:10.1038/nbt.3772
Supplementary Figure 2 – Browser
Window Position Human Feb. 2009 (GRCh37/hg19) chr17:37,783,223-37,826,720 (6,701 bp)
100 vertebrates Basewise Conservation by PhyloP
UCSC Genes (RefSeq, GenBank, CCDS, Rfam, tRNAs & Comparative Genomics)
Transcription Levels Assayed by RNA-seq on 9 Cell Lines from ENCODE
GTEx RNA-signal from male Brain - Caudate (basal ganglia) (GTEX-P44G-0011-R5A-SM-2I3FA)
GTEx RNA-seq read coverage from Brain - Caudate (basal ganglia)
GTEx RNA-seq read coverage from Brain - Cortex
GTEx RNA-signal from male Muscle - Skeletal (GTEX-11DXW-0726-SM-5H12J)
GTEx RNA-seq read coverage from Muscle - Skeletal
GTEx RNA-seq read coverage from Skin - Sun Exposed (Lower leg)
GTEx RNA-seq read coverage from Thyroid
PPP1R1B
STARD3
TCAP
PNMT
100 Vert. Cons
7.76614 _
-1.84367 _
Transcription
ln(x+1) 8 _
0 _
brainCauda M P44G
127 _
0 _
brainCauda M NPJ8
brainCauda M R55F
brainCauda M S7SE
brainCauda M T6MN
brainCauda M WL46
brainCauda M WVLH
brainCauda M WZTO
brainCauda M XOTO
brainCauda M Z93S
brainCauda M ZUA1
brainCorte M NPJ8
brainCorte M R55F
brainCorte M T6MN
brainCorte M XOTO
brainCorte M WL46
brainCorte M WVLH
brainCorte M WZTO
brainCorte M ZUA1
brainCorte M Z93S
muscleSkel M 11DXW
127 _
0 _
muscleSkel M NPJ8
muscleSkel M OOBK
muscleSkel M Q2AH
muscleSkel M Q2AI
muscleSkel M R55C
muscleSkel M U3ZM
muscleSkel M U4B1
muscleSkel M WFON
muscleSkel M WZTO
muscleSkel M X5EB
skinExpose M ZAB4
thyroid M ZAB5
Supplementary Figure 2 | 6700 bp exon-focused view of a 43 Kbp region of human chromosome 17 where GTEx
RNA-seq highlights tissue-specific expression of two genes. The TCAP (titin cap protein) is highly expressed in
muscle tissue, while PP1R1B (a therapeutic target for neurologic disorders) shows expression in brain basal ganglia
but not muscle (or brain cortex). In this UCSC Genome Browser view, 33 samples from 5 tissues were selected for
display, from the total 7304 (in 53 tissues) available on the GTEx public track hub. The hub is available on both hg19
(GRCh37) and hg38 (GRCh38) human genome assemblies. The hg19 tracks were generated using the UCSC liftOver
tool to transform coordinates from the hg38 bedGraph files generated by STAR2 in the Toil pipeline. The browser
display was configured to use the Multi-region exon view.
.CC-BY-NC 4.0 International licensepeer-reviewed) is the author/funder. It is made available under a
The copyright holder for this preprint (which was not.http://dx.doi.org/10.1101/062497doi:bioRxiv preprint first posted online Jul. 7, 2016;
The problem with the reference
• Avg. 4-5 m point variations /
individual
• 80 m point variants w/>= 0.1%
freq.
• Avg. > 10 megabases (MB) in copy-
number variants (CNVs) / individual
• 350-400 MB in CNVs w/ >=
0.1% freq.
• Avg. > 6 MB in large indels /
individual
• > 100 MB in large indels w/>=
0.1% freq.
The problem with the reference
• Avg. 4-5 m point variations /
individual
• 80 m point variants w/>= 0.1%
freq.
• Avg. > 10 megabases (MB) in copy-
number variants (CNVs) / individual
• 350-400 MB in CNVs w/ >=
0.1% freq.
• Avg. > 6 MB in large indels /
individual
• > 100 MB in large indels w/>=
0.1% freq.
ANRV285-GG07-17 ARI 3 August 2006 8:58
Structural Variation of the
Human Genome
Andrew J. Sharp, Ze Cheng, and Evan E. Eichler
Department of Genome Sciences, University of Washington, Howard Hughes
Medical Institute, Seattle, Washington 98195; email: eee@gs.washington.edu
edfromwww.annualreviews.org
.Forpersonaluseonly.
Characterization of Missing Human Genome Sequences and
Copy-number Polymorphic Insertions
Jeffrey M. Kidd1, Nick Sampas2, Francesca Antonacci1, Tina Graves3, Robert Fulton3,
Hillary S. Hayden1, Can Alkan1, Maika Malig1, Mario Ventura4, Giuliana Giannuzzi4, Joelle
Kallicki3, Paige Anderson2, Anya Tsalenko2, N. Alice Yamada2, Peter Tsang2, Rajinder
Kaul1, Richard K. Wilson3, Laurakay Bruhn2, and Evan E. Eichler1,5,6
1Department of Genome Sciences, University of Washington School of Medicine, Seattle,
Washington 98195, USA
2Agilent Laboratories, Santa Clara, California 95051, USA
3Washington University Genome Sequencing Center, School of Medicine, St. Louis, Missouri
63108, USA
4Department of Genetics and Microbiology, University of Bari, Bari 70126, Italy
5Howard Hughes Medical Institute, Seattle, Washington 98195, USA
Abstract
NIH Public Access
Author Manuscript
Nat Methods. Author manuscript; available in PMC 2010 November 1.
Published in final edited form as:
Nat Methods. 2010 May ; 7(5): 365–371.
NIH-PAAuthorManuscriptNIH-PAAuthor
The problem with the reference
• These differences create a failure of
representation, for example:
• Some functional (transcribed) genes
are either present in disabled form or
absent from the current reference (e.g.
some HLA genes)
• Reference Allele Bias: Mapping
algorithms are intrinsically biased
towards ignoring evidence of variants
• The current reference is largely derived
from one individual, making it less
suitable for the study of genomes that
derive from other subpopulations
• In summary: the current reference genome
has become an impediment to personal
genomics
The problem with the reference
RESEARCH Open Access
The GENCODE pseudogene resource
Baikang Pei1†
, Cristina Sisu1,2†
, Adam Frankish3
, Cédric Howald4
, Lukas Habegger1
, Xinmeng Jasmine Mu1
,
Rachel Harte5
, Suganthi Balasubramanian1,2
, Andrea Tanzer6
, Mark Diekhans5
, Alexandre Reymond4
,
Tim J Hubbard3
, Jennifer Harrow3
and Mark B Gerstein1,2,7*
Abstract
Background: Pseudogenes have long been considered as nonfunctional genomic sequences. However, recent
evidence suggests that many of them might have some form of biological activity, and the possibility of
functionality has increased interest in their accurate annotation and integration with functional genomics data.
Results: As part of the GENCODE annotation of the human genome, we present the first genome-wide
pseudogene assignment for protein-coding genes, based on both large-scale manual annotation and in silico
pipelines. A key aspect of this coupled approach is that it allows us to identify pseudogenes in an unbiased
fashion as well as untangle complex events through manual evaluation. We integrate the pseudogene annotations
with the extensive ENCODE functional genomics information. In particular, we determine the expression level,
transcription-factor and RNA polymerase II binding, and chromatin marks associated with each pseudogene. Based
on their distribution, we develop simple statistical models for each type of activity, which we validate with large-
scale RT-PCR-Seq experiments. Finally, we compare our pseudogenes with conservation and variation data from
primate alignments and the 1000 Genomes project, producing lists of pseudogenes potentially under selection.
Conclusions: At one extreme, some pseudogenes possess conventional characteristics of functionality; these may
represent genes that have recently died. On the other hand, we find interesting patterns of partial activity, which
may suggest that dead genes are being resurrected as functioning non-coding RNAs. The activity data of each
pseudogene are stored in an associated resource, psiDR, which will be useful for the initial identification of
potentially functional pseudogenes.
Background
Pseudogenes are defined as defunct genomic loci with
sequence similarity to functional genes but lacking cod-
ing potential due to the presence of disruptive muta-
tions such as frame shifts and premature stop codons
[1–4]. The functional paralogs of pseudogenes are often
referred to as parent genes. Based on the mechanism of
their creation, pseudogenes can be categorized into
three large groups: (1) processed pseudogenes, created
by retrotransposition of mRNA from functional protein-
coding loci back into the genome; (2) duplicated (also
referred to as unprocessed) pseudogenes, derived from
duplication of functional genes; and (3) unitary
pseudogenes, which arise through in situ mutations in
previously functional protein-coding genes [1,4–6].
Different types of pseudogenes exhibit different geno-
mic features. Duplicated pseudogenes have intron-exon-
like genomic structures and may still maintain the
upstream regulatory sequences of their parents. In con-
trast, processed pseudogenes, having lost their introns,
contain only exonic sequence and do not retain the
upstream regulatory regions. Processed pseudogenes
may preserve evidence of their insertion in the form of
polyadenine features at their 3’ end. These features of
processed pseudogenes are shared with other genomic
elements commonly known as retrogenes [7]. However,
retrogenes differ from pseudogenes in that they have
intact coding frames and encode functional proteins [8].
The composition of different types of pseudogenes var-
ies among organisms [9]. In the human genome, pro-
cessed pseudogenes are the most abundant type due to
* Correspondence: mark.gerstein@yale.edu
† Contributed equally
1
Program in Computational Biology and Bioinformatics, Yale University, Bass
432, 266 Whitney Avenue, New Haven, CT 06520, USA
Full list of author information is available at the end of the article
Pei et al. Genome Biology 2012, 13:R51
http://genomebiology.com/2012/13/9/R51
© 2012 Pei et al.; licensee BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons
• These differences create a failure of
representation, for example:
• Some functional (transcribed) genes
are either present in disabled form or
absent from the current reference (e.g.
some HLA genes)
• Reference Allele Bias: Mapping
algorithms are intrinsically biased
towards ignoring evidence of variants
• The current reference is largely derived
from one individual, making it less
suitable for the study of genomes that
derive from other subpopulations
• In summary: the current reference genome
has become an impediment to personal
genomics
The problem with the reference
RESEARCH Open Access
The GENCODE pseudogene resource
Baikang Pei1†
, Cristina Sisu1,2†
, Adam Frankish3
, Cédric Howald4
, Lukas Habegger1
, Xinmeng Jasmine Mu1
,
Rachel Harte5
, Suganthi Balasubramanian1,2
, Andrea Tanzer6
, Mark Diekhans5
, Alexandre Reymond4
,
Tim J Hubbard3
, Jennifer Harrow3
and Mark B Gerstein1,2,7*
Abstract
Background: Pseudogenes have long been considered as nonfunctional genomic sequences. However, recent
evidence suggests that many of them might have some form of biological activity, and the possibility of
functionality has increased interest in their accurate annotation and integration with functional genomics data.
Results: As part of the GENCODE annotation of the human genome, we present the first genome-wide
pseudogene assignment for protein-coding genes, based on both large-scale manual annotation and in silico
pipelines. A key aspect of this coupled approach is that it allows us to identify pseudogenes in an unbiased
fashion as well as untangle complex events through manual evaluation. We integrate the pseudogene annotations
with the extensive ENCODE functional genomics information. In particular, we determine the expression level,
transcription-factor and RNA polymerase II binding, and chromatin marks associated with each pseudogene. Based
on their distribution, we develop simple statistical models for each type of activity, which we validate with large-
scale RT-PCR-Seq experiments. Finally, we compare our pseudogenes with conservation and variation data from
primate alignments and the 1000 Genomes project, producing lists of pseudogenes potentially under selection.
Conclusions: At one extreme, some pseudogenes possess conventional characteristics of functionality; these may
represent genes that have recently died. On the other hand, we find interesting patterns of partial activity, which
may suggest that dead genes are being resurrected as functioning non-coding RNAs. The activity data of each
pseudogene are stored in an associated resource, psiDR, which will be useful for the initial identification of
potentially functional pseudogenes.
Background
Pseudogenes are defined as defunct genomic loci with
sequence similarity to functional genes but lacking cod-
ing potential due to the presence of disruptive muta-
tions such as frame shifts and premature stop codons
[1–4]. The functional paralogs of pseudogenes are often
referred to as parent genes. Based on the mechanism of
their creation, pseudogenes can be categorized into
three large groups: (1) processed pseudogenes, created
by retrotransposition of mRNA from functional protein-
coding loci back into the genome; (2) duplicated (also
referred to as unprocessed) pseudogenes, derived from
duplication of functional genes; and (3) unitary
pseudogenes, which arise through in situ mutations in
previously functional protein-coding genes [1,4–6].
Different types of pseudogenes exhibit different geno-
mic features. Duplicated pseudogenes have intron-exon-
like genomic structures and may still maintain the
upstream regulatory sequences of their parents. In con-
trast, processed pseudogenes, having lost their introns,
contain only exonic sequence and do not retain the
upstream regulatory regions. Processed pseudogenes
may preserve evidence of their insertion in the form of
polyadenine features at their 3’ end. These features of
processed pseudogenes are shared with other genomic
elements commonly known as retrogenes [7]. However,
retrogenes differ from pseudogenes in that they have
intact coding frames and encode functional proteins [8].
The composition of different types of pseudogenes var-
ies among organisms [9]. In the human genome, pro-
cessed pseudogenes are the most abundant type due to
* Correspondence: mark.gerstein@yale.edu
† Contributed equally
1
Program in Computational Biology and Bioinformatics, Yale University, Bass
432, 266 Whitney Avenue, New Haven, CT 06520, USA
Full list of author information is available at the end of the article
Pei et al. Genome Biology 2012, 13:R51
http://genomebiology.com/2012/13/9/R51
© 2012 Pei et al.; licensee BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons
BIOINFORMATICS ORIGINAL PAPER
Vol. 25 no. 24 2009, pages 320
doi:10.1093/bioinformatics
Genome analysis
Effect of read-mapping biases on detecting allele-specific
expression from RNA-sequencing data
Jacob F. Degner1,2,∗, John C. Marioni1,∗, Athma A. Pai1, Joseph K. Pickrell1,
Everlyne Nkadori1,3, Yoav Gilad1,∗ and Jonathan K. Pritchard1,3,∗
1Department of Human Genetics, 2Committee on Genetics, Genomics and Systems Biology and 3Howard Hu
Medical Institute, University of Chicago, 920 E. 58th St., CLSC 507, Chicago, IL 60637, USA
Received on June 25, 2009; revised on September 17, 2009; accepted on September 30, 2009
Advance Access publication October 6, 2009
Associate Editor: Limsoon Wong
ABSTRACT
Motivation: Next-generation sequencing has become an important
tool for genome-wide quantification of DNA and RNA. However,
a major technical hurdle lies in the need to map short sequence
reads back to their correct locations in a reference genome. Here,
we investigate the impact of SNP variation on the reliability of
read-mapping in the context of detecting allele-specific expression
(ASE).
Results: We generated 16 million 35 bp reads from mRNA of each
of two HapMap Yoruba individuals. When we mapped these reads
to the human genome we found that, at heterozygous SNPs, there
was a significant bias toward higher mapping rates of the allele
in the reference sequence, compared with the alternative allele.
Masking known SNP positions in the genome sequence eliminated
the reference bias but, surprisingly, did not lead to more reliable
results overall. We find that even after masking, ∼ 5–10% of SNPs
still have an inherent bias toward more effective mapping of one
allele. Filtering out inherently biased SNPs removes 40% of the top
signals of ASE. The remaining SNPs showing ASE are enriched in
mechanisms can be uncovered through the identification o
specific expression (ASE). For example, studies investigati
have uncovered both genes harboring cis-regulatory variat
imprinted genes that are epigenetically silenced in one copy
the other (Babak et al., 2008; Serre et al., 2008; Wang et al.
Recently developed sequencing technologies such as the I
Genome Analyzer, Roche 454 GS FLX sequencer and A
Biosystems SOLiD sequencer have the potential to greatly i
our ability to detect ASE and to improve our understan
cis-regulatory variation and epigenetic imprinting. Howe
detection of ASE depends critically on accurate mapping
reads in the presence of sequence variation. Here, using
Seq data from two HapMap individuals, along with sim
experiments, we characterize the effects of individual SNP
quantification of expression levels. Our results are also r
to other applications of next-generation sequencing, such
discovery, expression QTL mapping and detection of allele-
differences in transcription factor binding.
• These differences create a failure of
representation, for example:
• Some functional (transcribed) genes
are either present in disabled form or
absent from the current reference (e.g.
some HLA genes)
• Reference Allele Bias: Mapping
algorithms are intrinsically biased
towards ignoring evidence of variants
• The current reference is largely derived
from one individual, making it less
suitable for the study of genomes that
derive from other subpopulations
• In summary: the current reference genome
has become an impediment to personal
genomics
The problem with the reference
BIOINFORMATICS ORIGINAL PAPER
Vol. 25 no. 24 2009, pages 320
doi:10.1093/bioinformatics
Genome analysis
Effect of read-mapping biases on detecting allele-specific
expression from RNA-sequencing data
Jacob F. Degner1,2,∗, John C. Marioni1,∗, Athma A. Pai1, Joseph K. Pickrell1,
Everlyne Nkadori1,3, Yoav Gilad1,∗ and Jonathan K. Pritchard1,3,∗
1Department of Human Genetics, 2Committee on Genetics, Genomics and Systems Biology and 3Howard Hu
Medical Institute, University of Chicago, 920 E. 58th St., CLSC 507, Chicago, IL 60637, USA
Received on June 25, 2009; revised on September 17, 2009; accepted on September 30, 2009
Advance Access publication October 6, 2009
Associate Editor: Limsoon Wong
ABSTRACT
Motivation: Next-generation sequencing has become an important
tool for genome-wide quantification of DNA and RNA. However,
a major technical hurdle lies in the need to map short sequence
reads back to their correct locations in a reference genome. Here,
we investigate the impact of SNP variation on the reliability of
read-mapping in the context of detecting allele-specific expression
(ASE).
Results: We generated 16 million 35 bp reads from mRNA of each
of two HapMap Yoruba individuals. When we mapped these reads
to the human genome we found that, at heterozygous SNPs, there
was a significant bias toward higher mapping rates of the allele
in the reference sequence, compared with the alternative allele.
Masking known SNP positions in the genome sequence eliminated
the reference bias but, surprisingly, did not lead to more reliable
results overall. We find that even after masking, ∼ 5–10% of SNPs
still have an inherent bias toward more effective mapping of one
allele. Filtering out inherently biased SNPs removes 40% of the top
signals of ASE. The remaining SNPs showing ASE are enriched in
mechanisms can be uncovered through the identification o
specific expression (ASE). For example, studies investigati
have uncovered both genes harboring cis-regulatory variat
imprinted genes that are epigenetically silenced in one copy
the other (Babak et al., 2008; Serre et al., 2008; Wang et al.
Recently developed sequencing technologies such as the I
Genome Analyzer, Roche 454 GS FLX sequencer and A
Biosystems SOLiD sequencer have the potential to greatly i
our ability to detect ASE and to improve our understan
cis-regulatory variation and epigenetic imprinting. Howe
detection of ASE depends critically on accurate mapping
reads in the presence of sequence variation. Here, using
Seq data from two HapMap individuals, along with sim
experiments, we characterize the effects of individual SNP
quantification of expression levels. Our results are also r
to other applications of next-generation sequencing, such
discovery, expression QTL mapping and detection of allele-
differences in transcription factor binding.
A Bacterial Artificial Chromosome Library
for Sequencing the Complete Human Genome
Kazutoyo Osoegawa,1
Aaron G. Mammoser, Chenyan Wu,2
Eirik Frengen,3
Changjiang Zeng, Joseph J. Catanese,1,2
and Pieter J. de Jong1,2,4
Department of Cancer Genetics, Roswell Park Cancer Institute, Buffalo, New York 14263, USA
A 30-fold redundant human bacterial artificial chromosome (BAC) library with a large average insert size (17
kb) has been constructed to provide the intermediate substrate for the international genome sequencing effor
The DNA was obtained from a single anonymous volunteer, whose identity was protected through
double-blind donor selection protocol. DNA fragments were generated by partial digestion with EcoRI (librar
segments 1–4: 24-fold) and MboI (segment 5: sixfold) and cloned into the pBACe3.6 and pTARBAC1 vector
respectively. The quality of the library was assessed by extensive analysis of 169 clones for rearrangements an
artifacts. Eighteen BACs (11%) revealed minor insert rearrangements, and none was chimeric. This BAC librar
designated as “RPCI-11,” has been used widely as the central resource for insert-end sequencing, clon
fingerprinting, high-throughput sequence analysis and as a source of mapped clones for diagnostic an
functional studies.
Resource
Cold Spring Harbor Laboratory Presson September 9, 2011 - Published bygenome.cshlp.orgDownloaded from
RESEARCH Open Access
The GENCODE pseudogene resource
Baikang Pei1†
, Cristina Sisu1,2†
, Adam Frankish3
, Cédric Howald4
, Lukas Habegger1
, Xinmeng Jasmine Mu1
,
Rachel Harte5
, Suganthi Balasubramanian1,2
, Andrea Tanzer6
, Mark Diekhans5
, Alexandre Reymond4
,
Tim J Hubbard3
, Jennifer Harrow3
and Mark B Gerstein1,2,7*
Abstract
Background: Pseudogenes have long been considered as nonfunctional genomic sequences. However, recent
evidence suggests that many of them might have some form of biological activity, and the possibility of
functionality has increased interest in their accurate annotation and integration with functional genomics data.
Results: As part of the GENCODE annotation of the human genome, we present the first genome-wide
pseudogene assignment for protein-coding genes, based on both large-scale manual annotation and in silico
pipelines. A key aspect of this coupled approach is that it allows us to identify pseudogenes in an unbiased
fashion as well as untangle complex events through manual evaluation. We integrate the pseudogene annotations
with the extensive ENCODE functional genomics information. In particular, we determine the expression level,
transcription-factor and RNA polymerase II binding, and chromatin marks associated with each pseudogene. Based
on their distribution, we develop simple statistical models for each type of activity, which we validate with large-
scale RT-PCR-Seq experiments. Finally, we compare our pseudogenes with conservation and variation data from
primate alignments and the 1000 Genomes project, producing lists of pseudogenes potentially under selection.
Conclusions: At one extreme, some pseudogenes possess conventional characteristics of functionality; these may
represent genes that have recently died. On the other hand, we find interesting patterns of partial activity, which
may suggest that dead genes are being resurrected as functioning non-coding RNAs. The activity data of each
pseudogene are stored in an associated resource, psiDR, which will be useful for the initial identification of
potentially functional pseudogenes.
Background
Pseudogenes are defined as defunct genomic loci with
sequence similarity to functional genes but lacking cod-
ing potential due to the presence of disruptive muta-
tions such as frame shifts and premature stop codons
[1–4]. The functional paralogs of pseudogenes are often
referred to as parent genes. Based on the mechanism of
their creation, pseudogenes can be categorized into
three large groups: (1) processed pseudogenes, created
by retrotransposition of mRNA from functional protein-
coding loci back into the genome; (2) duplicated (also
referred to as unprocessed) pseudogenes, derived from
duplication of functional genes; and (3) unitary
pseudogenes, which arise through in situ mutations in
previously functional protein-coding genes [1,4–6].
Different types of pseudogenes exhibit different geno-
mic features. Duplicated pseudogenes have intron-exon-
like genomic structures and may still maintain the
upstream regulatory sequences of their parents. In con-
trast, processed pseudogenes, having lost their introns,
contain only exonic sequence and do not retain the
upstream regulatory regions. Processed pseudogenes
may preserve evidence of their insertion in the form of
polyadenine features at their 3’ end. These features of
processed pseudogenes are shared with other genomic
elements commonly known as retrogenes [7]. However,
retrogenes differ from pseudogenes in that they have
intact coding frames and encode functional proteins [8].
The composition of different types of pseudogenes var-
ies among organisms [9]. In the human genome, pro-
cessed pseudogenes are the most abundant type due to
* Correspondence: mark.gerstein@yale.edu
† Contributed equally
1
Program in Computational Biology and Bioinformatics, Yale University, Bass
432, 266 Whitney Avenue, New Haven, CT 06520, USA
Full list of author information is available at the end of the article
Pei et al. Genome Biology 2012, 13:R51
http://genomebiology.com/2012/13/9/R51
© 2012 Pei et al.; licensee BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons
• These differences create a failure of
representation, for example:
• Some functional (transcribed) genes
are either present in disabled form or
absent from the current reference (e.g.
some HLA genes)
• Reference Allele Bias: Mapping
algorithms are intrinsically biased
towards ignoring evidence of variants
• The current primary reference is largely
derived from one individual, making it
less suitable for the study of genomes
that derive from other subpopulations
• In summary: the current reference genome
has become an impediment to personal
genomics
The problem with the reference
• These differences create a failure of
representation, for example:
• Some functional (transcribed) genes
are either present in disabled form or
absent from the current reference (e.g.
some HLA genes)
• Reference Allele Bias: Mapping
algorithms are intrinsically biased
towards ignoring evidence of variants
• The current primary reference is largely
derived from one individual, making it
less suitable for the study of genomes
that derive from other subpopulations
• In summary: the current primary reference
genome is an imperfect lens for personal
genomics
BIOINFORMATICS ORIGINAL PAPER
Vol. 25 no. 24 2009, pages 320
doi:10.1093/bioinformatics
Genome analysis
Effect of read-mapping biases on detecting allele-specific
expression from RNA-sequencing data
Jacob F. Degner1,2,∗, John C. Marioni1,∗, Athma A. Pai1, Joseph K. Pickrell1,
Everlyne Nkadori1,3, Yoav Gilad1,∗ and Jonathan K. Pritchard1,3,∗
1Department of Human Genetics, 2Committee on Genetics, Genomics and Systems Biology and 3Howard Hu
Medical Institute, University of Chicago, 920 E. 58th St., CLSC 507, Chicago, IL 60637, USA
Received on June 25, 2009; revised on September 17, 2009; accepted on September 30, 2009
Advance Access publication October 6, 2009
Associate Editor: Limsoon Wong
ABSTRACT
Motivation: Next-generation sequencing has become an important
tool for genome-wide quantification of DNA and RNA. However,
a major technical hurdle lies in the need to map short sequence
reads back to their correct locations in a reference genome. Here,
we investigate the impact of SNP variation on the reliability of
read-mapping in the context of detecting allele-specific expression
(ASE).
Results: We generated 16 million 35 bp reads from mRNA of each
of two HapMap Yoruba individuals. When we mapped these reads
to the human genome we found that, at heterozygous SNPs, there
was a significant bias toward higher mapping rates of the allele
in the reference sequence, compared with the alternative allele.
Masking known SNP positions in the genome sequence eliminated
the reference bias but, surprisingly, did not lead to more reliable
results overall. We find that even after masking, ∼ 5–10% of SNPs
still have an inherent bias toward more effective mapping of one
allele. Filtering out inherently biased SNPs removes 40% of the top
signals of ASE. The remaining SNPs showing ASE are enriched in
mechanisms can be uncovered through the identification o
specific expression (ASE). For example, studies investigati
have uncovered both genes harboring cis-regulatory variat
imprinted genes that are epigenetically silenced in one copy
the other (Babak et al., 2008; Serre et al., 2008; Wang et al.
Recently developed sequencing technologies such as the I
Genome Analyzer, Roche 454 GS FLX sequencer and A
Biosystems SOLiD sequencer have the potential to greatly i
our ability to detect ASE and to improve our understan
cis-regulatory variation and epigenetic imprinting. Howe
detection of ASE depends critically on accurate mapping
reads in the presence of sequence variation. Here, using
Seq data from two HapMap individuals, along with sim
experiments, we characterize the effects of individual SNP
quantification of expression levels. Our results are also r
to other applications of next-generation sequencing, such
discovery, expression QTL mapping and detection of allele-
differences in transcription factor binding.
A Bacterial Artificial Chromosome Library
for Sequencing the Complete Human Genome
Kazutoyo Osoegawa,1
Aaron G. Mammoser, Chenyan Wu,2
Eirik Frengen,3
Changjiang Zeng, Joseph J. Catanese,1,2
and Pieter J. de Jong1,2,4
Department of Cancer Genetics, Roswell Park Cancer Institute, Buffalo, New York 14263, USA
A 30-fold redundant human bacterial artificial chromosome (BAC) library with a large average insert size (17
kb) has been constructed to provide the intermediate substrate for the international genome sequencing effor
The DNA was obtained from a single anonymous volunteer, whose identity was protected through
double-blind donor selection protocol. DNA fragments were generated by partial digestion with EcoRI (librar
segments 1–4: 24-fold) and MboI (segment 5: sixfold) and cloned into the pBACe3.6 and pTARBAC1 vector
respectively. The quality of the library was assessed by extensive analysis of 169 clones for rearrangements an
artifacts. Eighteen BACs (11%) revealed minor insert rearrangements, and none was chimeric. This BAC librar
designated as “RPCI-11,” has been used widely as the central resource for insert-end sequencing, clon
fingerprinting, high-throughput sequence analysis and as a source of mapped clones for diagnostic an
functional studies.
Resource
Cold Spring Harbor Laboratory Presson September 9, 2011 - Published bygenome.cshlp.orgDownloaded from
RESEARCH Open Access
The GENCODE pseudogene resource
Baikang Pei1†
, Cristina Sisu1,2†
, Adam Frankish3
, Cédric Howald4
, Lukas Habegger1
, Xinmeng Jasmine Mu1
,
Rachel Harte5
, Suganthi Balasubramanian1,2
, Andrea Tanzer6
, Mark Diekhans5
, Alexandre Reymond4
,
Tim J Hubbard3
, Jennifer Harrow3
and Mark B Gerstein1,2,7*
Abstract
Background: Pseudogenes have long been considered as nonfunctional genomic sequences. However, recent
evidence suggests that many of them might have some form of biological activity, and the possibility of
functionality has increased interest in their accurate annotation and integration with functional genomics data.
Results: As part of the GENCODE annotation of the human genome, we present the first genome-wide
pseudogene assignment for protein-coding genes, based on both large-scale manual annotation and in silico
pipelines. A key aspect of this coupled approach is that it allows us to identify pseudogenes in an unbiased
fashion as well as untangle complex events through manual evaluation. We integrate the pseudogene annotations
with the extensive ENCODE functional genomics information. In particular, we determine the expression level,
transcription-factor and RNA polymerase II binding, and chromatin marks associated with each pseudogene. Based
on their distribution, we develop simple statistical models for each type of activity, which we validate with large-
scale RT-PCR-Seq experiments. Finally, we compare our pseudogenes with conservation and variation data from
primate alignments and the 1000 Genomes project, producing lists of pseudogenes potentially under selection.
Conclusions: At one extreme, some pseudogenes possess conventional characteristics of functionality; these may
represent genes that have recently died. On the other hand, we find interesting patterns of partial activity, which
may suggest that dead genes are being resurrected as functioning non-coding RNAs. The activity data of each
pseudogene are stored in an associated resource, psiDR, which will be useful for the initial identification of
potentially functional pseudogenes.
Background
Pseudogenes are defined as defunct genomic loci with
sequence similarity to functional genes but lacking cod-
ing potential due to the presence of disruptive muta-
tions such as frame shifts and premature stop codons
[1–4]. The functional paralogs of pseudogenes are often
referred to as parent genes. Based on the mechanism of
their creation, pseudogenes can be categorized into
three large groups: (1) processed pseudogenes, created
by retrotransposition of mRNA from functional protein-
coding loci back into the genome; (2) duplicated (also
referred to as unprocessed) pseudogenes, derived from
duplication of functional genes; and (3) unitary
pseudogenes, which arise through in situ mutations in
previously functional protein-coding genes [1,4–6].
Different types of pseudogenes exhibit different geno-
mic features. Duplicated pseudogenes have intron-exon-
like genomic structures and may still maintain the
upstream regulatory sequences of their parents. In con-
trast, processed pseudogenes, having lost their introns,
contain only exonic sequence and do not retain the
upstream regulatory regions. Processed pseudogenes
may preserve evidence of their insertion in the form of
polyadenine features at their 3’ end. These features of
processed pseudogenes are shared with other genomic
elements commonly known as retrogenes [7]. However,
retrogenes differ from pseudogenes in that they have
intact coding frames and encode functional proteins [8].
The composition of different types of pseudogenes var-
ies among organisms [9]. In the human genome, pro-
cessed pseudogenes are the most abundant type due to
* Correspondence: mark.gerstein@yale.edu
† Contributed equally
1
Program in Computational Biology and Bioinformatics, Yale University, Bass
432, 266 Whitney Avenue, New Haven, CT 06520, USA
Full list of author information is available at the end of the article
Pei et al. Genome Biology 2012, 13:R51
http://genomebiology.com/2012/13/9/R51
© 2012 Pei et al.; licensee BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons
Alternate haplotypes
Alternate haplotypes
GRCh38 is a graph!
Human Genome Variation Graph Project
• Goals:
• Develop next generation human genetic reference that
includes known variation from all human ethnic
populations
• Provide tools to map, call, phase and represent genomes
Figure courtesy Kiran Garimella & Gil McVean
Existing Variation is Fragmented
Variants associated with phenotype
Genome- and locus-specific variation databases
Sequencing projects
Human reference genome
A Rosetta Stone for
human genomics
Merge diverse genomes into one graph
The major histocompatibility complex− Kiran Garimella & Gil McVean
Zooming in, you see local structure
At base level we assign unique position identifiers
Variation Graphs – The Essentials
GTCCCAA
ACGTGG
ACTACCA
TTACTAC
Set of sequences	(nodes)
Joins	(edges)	connect	sides	of	sequences.
Variation Graphs – The Essentials
GTCCCAAACGTGG TTACTAC
Joins can connect either side of a sequence (bidirected edges)
Walks encode DNA strings, with side of entry determining strand
Essential operations on variation graphs
• To switch to
variation graphs a
complete
ecosystem must be
redeveloped
• “rebooting
genomics” - Erik
Garrison
“Adapted from Computational Pan-Genomics: Status, Promises and Challenges.”
Computational Pan-Genomics Consortium. Briefings in Bioinformatics (2016)
variation
graph
another
variation
graph
variation
graph
another
variation
graph
Essential operations on variation graphs
• To switch to
variation graphs a
complete
ecosystem must be
redeveloped
“Adapted from Computational Pan-Genomics: Status, Promises and Challenges.”
Computational Pan-Genomics Consortium. Briefings in Bioinformatics (2016)
https://github.com/vgteam/vg
Now lots of good genome graph development …
Genome Graph Vignettes
• Read mapping
• Haplotypes vs. graphs
• Visualization
• Alleles and sites
• Variant calling
Variation graph mapping GRCh38 alts in B-3106 from
human MHC
Simulation Study - Human
60
60
60
60
60
60
50
50
50
50
50
50
40
40
40
40
40
40
30
30
30
30
30
30
20
20
20
20
20
20
10
10
10
1010
10
0
0
0
0
0
0
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
0.95
0.96
0.97
0.98
0.99
1.00
1e−06 1e−05 1e−04 1e−03 1e−02
FPR
TPR
aligner
a●
a●
a●
a●
a●
a●
bwa.mem.pe
bwa.mem.se
vg.pan.pe
vg.pan.se
vg.ref.pe
vg.ref.se
number
●
●
●
●
250000
500000
750000
1000000
60
60
60
60
60
60
50
50
50
50
50
50
40
40
40
40
40
40
30
30
30
30
30
30
20
20
20
20
20
20
10
10
10
10
10
10
0
0
0
0
0
0
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
0.94
0.96
0.98
1e−06 1e−05 1e−04 1e−03 1e−02
FPR
TPR number
●
●
●
2500000
5000000
7500000
aligner
a●
a●
a●
a●
a●
a●
bwa.mem.pe
bwa.mem.se
vg.pan.pe
vg.pan.se
vg.ref.pe
vg.ref.se
• 10 M reads from a
genome with 1%
error
• Subset of reads with
>=1 match to non-
primary ref match
Simulation Study - Human
60
60
60
60
60
60
50
50
50
50
50
50
40
40
40
40
40
40
30
30
30
30
30
30
20
20
20
20
20
20
10
10
10
1010
10
0
0
0
0
0
0
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
0.95
0.96
0.97
0.98
0.99
1.00
1e−06 1e−05 1e−04 1e−03 1e−02
FPR
TPR
aligner
a●
a●
a●
a●
a●
a●
bwa.mem.pe
bwa.mem.se
vg.pan.pe
vg.pan.se
vg.ref.pe
vg.ref.se
number
●
●
●
●
250000
500000
750000
1000000
• 10 M reads from a
genome with 1%
error
• Subset of reads with
>=1 match to non-
primary ref match
Human - Indel Mapping Bias Alleviated
curve
0
0
●
●
●
●
●
●
2
number
●
●
●
2500000
5000000
7500000
aligner
a●
a●
a●
a●
a●
a●
bwa.mem.pe
bwa.mem.se
vg.pan.pe
vg.pan.se
vg.ref.pe
vg.ref.se
(b) allele fraction vs variant size
Mapping improvements differ by population
1000 Genomes Super Population
MHC
%Diff.inperfectmap.
primaryvs.1KG
1: 82 bp
2: A
3: G
4: 38 bp
5: C
6: T
7: 24 bp
1: 82 bp
2: A
3: G 4': 38 bp
5: C
6: T
7: 24 bp
4: 38 bp
Embedding Haplotypes
• Genome graphs do not encode linkage
• To restrict linkage, natural solution is to duplicate paths:
• But duplication creates mapping ambiguity
Embedding Haplotypes
1: 82 bp
2: A
3: G
4: 38 bp
5: C
6: T
7: 24 bp
1': 82 bp
2: A
3: G 4': 38 bp
5: C
6: T
7: 24 bp4: 38 bp1: 82 bp
7': 24 bp
• Instead maintain projection from haplotypes to graph:
• The question then becomes how to encode this projection?
Embedding Haplotypes
• The Graph Positional Burrows Wheeler Transform
(gPBWT)
From “Novak et al, A Graph Extension of the Positional Burrows-Wheeler Transform and its Applications (PBWT), WABI 2016”

3
counting of the number of threads in T that contain a given new thread as a
subthread. Figure 2 and Table 1 give a worked example.
1
2
3
2
1
3
1
1
2
2
B0
· · ·
· · ·
· · ·
· · ·
· · ·
· · ·
Fig. 1. An illustration of the B0[] array for a single side numbered 0. Threads visiting
this side may enter their next nodes on sides 1, 2, or 3. The B0[] array records, for each
visit of a thread to side 0, the side on which it enters its next node. This determines
through which of the available edges it should leave the current node. Because threads
tend to be similar to each other, they are likely to run in “ribbons” of multiple threads
.CC-BY 4.0 International licensepeer-reviewed) is the author/funder. It is made available under a
The copyright holder for this preprint (which was not.http://dx.doi.org/10.1101/051409doi:bioRxiv preprint first posted online May. 2, 2016;
gPBWTk[]
• Reversible, compressible, enables efficient indexed queries
gPBWT Performance
• Experiment:
• chr22
• 50,818,468 bp
• 5004 Haplotypes
• Result:
• 356 MB gPBWT + vg graph
• 0.011 bits per base -
200x compression
• ~336 GB for whole
genome w/80 million
point variants @ 100,000
diploid genomes
Embedding Haplotypes
• Tube Maps
Wolfgang Beyer
Embedding Haplotypes
Prototype: Wolfgang Beyer https://vgteam.github.io/
sequenceTubeMap/
Haplotype Probabilities
• Li & Stephens: Efficiently compute P(h|H), where h is
haplotype and H is population
nd Stephens” on sequence graphs
Stephens: sequences h are generated by walks x across the space of all haplotyp
H
x
h
Haplotype Probabilities
• Graph Li & Stephens: Efficiently compute P(x|H), where x
is haplotype walk in a genome graph
nd Stephens: sequences h are generated by walks x across the space of all hap
model: sequences h are generated by walks x through G which follow segmen
otypes in H
h
x c/w h
g1
, g2
, g3
ε H
Haplotype Probabilities
• Applied to vg mapped reads:
Single
recombinants, 9%
Double
recombinants, 1%
Non
recombinants,
90%
What’s a site and an allele in a genome graph?
What’s a site and an allele in a variation graph?
Bubble: Superbubble:
• Use subgraph decomposition to find single source/sink
subgraphs, set of paths are the alleles
A T
C
A
T C A T
C
A
T C A T
A haplotype phasing pipeline
Read
mapping
Variant
calling
Haplotype
phasing
Known population
information
Population Assisted Variant Calling
h
Haplotype
likelihood
Read
likelihood
genome posterior
probability
Haplotype
likelihood
Read
likelihood
A haplotype phasing pipeline
Read
mapping
Variant
calling
Haplotype
phasing
Known population
information
Genome Variation Graphs Summary
• A shared reference graph will provide a single canonical naming scheme
for human variants: either it is already a (named) path in the graph, or it is a
new canonically named augmentation
• A better prior: Clear benefits for simplifying and improving read
mapping and variant calling - could ultimately lower cost of genome
inference
• Additional haplotype data can be embedded (gPBWT)
• The natural reference is a population cohort - we should build a public
cohort for hundreds of thousands of individuals - let’s change the
culture of de-identified sharing
• True population assisted genome inference is coming
• Still many open problems: repeatome, annotations, RNA
Thanks!
UCSC
Adam Novak
Glenn Hickey
Sean Blum
Yohei Rosen
Jordan Eizenga
Wolfgang Beyer
Karen Hayden
David Haussler
Team VG:

Erik Garrison
Eric Dawson
Mike Lin
Jouni Siren
(and many more)

GA4GH ref-var group:

Andres Kahles
Ben Murray
Goran Rakocevic
Alex Dilthey
Sarah Guthrie
Jerome Kelleher
Heng Li
Stephen Keenan
Richard Durbin
Gil McVean
Opportunities: https://cgl.genomics.ucsc.edu/ benedict@soe.ucsc.edu

More Related Content

What's hot

hg19 (GRCh37) vs. hg38 (GRCh38)
hg19 (GRCh37) vs. hg38 (GRCh38)hg19 (GRCh37) vs. hg38 (GRCh38)
hg19 (GRCh37) vs. hg38 (GRCh38)Shaojun Xie
 
Theory and practice of graphical population analysis
Theory and practice of graphical population analysisTheory and practice of graphical population analysis
Theory and practice of graphical population analysisGenome Reference Consortium
 
Why graph genome storage and updating wakes me up at 4 am
Why graph genome storage and updating wakes me up at 4 amWhy graph genome storage and updating wakes me up at 4 am
Why graph genome storage and updating wakes me up at 4 amGenome Reference Consortium
 
Creating Reference-Grade Human Genome Assemblies
Creating Reference-Grade Human Genome AssembliesCreating Reference-Grade Human Genome Assemblies
Creating Reference-Grade Human Genome AssembliesGenome Reference Consortium
 
Previewing GRCm39: Assembly Updates from the GRC
Previewing GRCm39: Assembly Updates from the GRCPreviewing GRCm39: Assembly Updates from the GRC
Previewing GRCm39: Assembly Updates from the GRCGenome Reference Consortium
 
Advancements in the human genome reference assembly (GRCh38)
Advancements in the human genome reference assembly (GRCh38)Advancements in the human genome reference assembly (GRCh38)
Advancements in the human genome reference assembly (GRCh38)Genome Reference Consortium
 
Exploiting long read sequencing technology to build a substantially improved ...
Exploiting long read sequencing technology to build a substantially improved ...Exploiting long read sequencing technology to build a substantially improved ...
Exploiting long read sequencing technology to build a substantially improved ...Genome Reference Consortium
 

What's hot (20)

Ashg2015 grc-pruitt
Ashg2015 grc-pruittAshg2015 grc-pruitt
Ashg2015 grc-pruitt
 
hg19 (GRCh37) vs. hg38 (GRCh38)
hg19 (GRCh37) vs. hg38 (GRCh38)hg19 (GRCh37) vs. hg38 (GRCh38)
hg19 (GRCh37) vs. hg38 (GRCh38)
 
GRCWorkshop_geval_1KG_slides
GRCWorkshop_geval_1KG_slidesGRCWorkshop_geval_1KG_slides
GRCWorkshop_geval_1KG_slides
 
Theory and practice of graphical population analysis
Theory and practice of graphical population analysisTheory and practice of graphical population analysis
Theory and practice of graphical population analysis
 
Why graph genome storage and updating wakes me up at 4 am
Why graph genome storage and updating wakes me up at 4 amWhy graph genome storage and updating wakes me up at 4 am
Why graph genome storage and updating wakes me up at 4 am
 
Ashg2017 workshop schneider
Ashg2017 workshop schneiderAshg2017 workshop schneider
Ashg2017 workshop schneider
 
Agbt2015 workshop schneider
Agbt2015 workshop schneiderAgbt2015 workshop schneider
Agbt2015 workshop schneider
 
agbt 2016 workshop church
agbt 2016 workshop churchagbt 2016 workshop church
agbt 2016 workshop church
 
Ashg2017 workshop tg
Ashg2017 workshop tgAshg2017 workshop tg
Ashg2017 workshop tg
 
TAGC2016 schneider
TAGC2016 schneiderTAGC2016 schneider
TAGC2016 schneider
 
Schneider grc workshop_final
Schneider grc workshop_finalSchneider grc workshop_final
Schneider grc workshop_final
 
ABGT 2016 Workshop Schneider
ABGT 2016 Workshop SchneiderABGT 2016 Workshop Schneider
ABGT 2016 Workshop Schneider
 
Creating Reference-Grade Human Genome Assemblies
Creating Reference-Grade Human Genome AssembliesCreating Reference-Grade Human Genome Assemblies
Creating Reference-Grade Human Genome Assemblies
 
Previewing GRCm39: Assembly Updates from the GRC
Previewing GRCm39: Assembly Updates from the GRCPreviewing GRCm39: Assembly Updates from the GRC
Previewing GRCm39: Assembly Updates from the GRC
 
agbt 2016 workshop lindsay
agbt 2016 workshop lindsayagbt 2016 workshop lindsay
agbt 2016 workshop lindsay
 
Ashg grc workshop2015_tg
Ashg grc workshop2015_tgAshg grc workshop2015_tg
Ashg grc workshop2015_tg
 
AGBT 2016 Workshop Magrini
AGBT 2016 Workshop MagriniAGBT 2016 Workshop Magrini
AGBT 2016 Workshop Magrini
 
Advancements in the human genome reference assembly (GRCh38)
Advancements in the human genome reference assembly (GRCh38)Advancements in the human genome reference assembly (GRCh38)
Advancements in the human genome reference assembly (GRCh38)
 
Exploiting long read sequencing technology to build a substantially improved ...
Exploiting long read sequencing technology to build a substantially improved ...Exploiting long read sequencing technology to build a substantially improved ...
Exploiting long read sequencing technology to build a substantially improved ...
 
150224 grc kms
150224 grc kms150224 grc kms
150224 grc kms
 

Similar to Variation graphs and population assisted genome inference copy

Building Genomic Data Processing and Machine Learning Workflows Using Apache ...
Building Genomic Data Processing and Machine Learning Workflows Using Apache ...Building Genomic Data Processing and Machine Learning Workflows Using Apache ...
Building Genomic Data Processing and Machine Learning Workflows Using Apache ...Databricks
 
Bioinformatics final
Bioinformatics finalBioinformatics final
Bioinformatics finalRainu Rajeev
 
RNA-Seq To Identify Novel Markers For Research on Neural Tissue Differentiation
RNA-Seq To Identify Novel Markers For Research on Neural Tissue DifferentiationRNA-Seq To Identify Novel Markers For Research on Neural Tissue Differentiation
RNA-Seq To Identify Novel Markers For Research on Neural Tissue DifferentiationThermo Fisher Scientific
 
BolingerJustin - Honors Thesis
BolingerJustin - Honors ThesisBolingerJustin - Honors Thesis
BolingerJustin - Honors ThesisJustin P. Bolinger
 
Church_GenomeAccess_2013_genome2013
Church_GenomeAccess_2013_genome2013Church_GenomeAccess_2013_genome2013
Church_GenomeAccess_2013_genome2013Deanna Church
 
Thesis def
Thesis defThesis def
Thesis defJay Vyas
 
Castanon, A-MSc SCRM Poster
Castanon, A-MSc SCRM PosterCastanon, A-MSc SCRM Poster
Castanon, A-MSc SCRM PosterAmaris Castanon
 
“MS-Extractor: An Innovative Approach to Extract Microsatellites on „Y‟ Chrom...
“MS-Extractor: An Innovative Approach to Extract Microsatellites on „Y‟ Chrom...“MS-Extractor: An Innovative Approach to Extract Microsatellites on „Y‟ Chrom...
“MS-Extractor: An Innovative Approach to Extract Microsatellites on „Y‟ Chrom...IJERD Editor
 
Microarray biotechnologg ppy dna microarrays
Microarray biotechnologg ppy dna microarraysMicroarray biotechnologg ppy dna microarrays
Microarray biotechnologg ppy dna microarraysayeshasattarsandhu
 
L14 human genome
L14 human genomeL14 human genome
L14 human genomeMUBOSScz
 
Human genome project
Human genome projectHuman genome project
Human genome projectruchibioinfo
 
BITS training - UCSC Genome Browser - Part 2
BITS training - UCSC Genome Browser - Part 2BITS training - UCSC Genome Browser - Part 2
BITS training - UCSC Genome Browser - Part 2BITS
 
Development of machine learning-based prediction models for chemical modulato...
Development of machine learning-based prediction models for chemical modulato...Development of machine learning-based prediction models for chemical modulato...
Development of machine learning-based prediction models for chemical modulato...Sunghwan Kim
 
Localized gene expression changes by AmpliSeq transcriptome sequencing from A...
Localized gene expression changes by AmpliSeq transcriptome sequencing from A...Localized gene expression changes by AmpliSeq transcriptome sequencing from A...
Localized gene expression changes by AmpliSeq transcriptome sequencing from A...Thermo Fisher Scientific
 
Whole Transcriptome Analysis of Testicular Germ Cell Tumors
Whole Transcriptome Analysis of Testicular Germ Cell TumorsWhole Transcriptome Analysis of Testicular Germ Cell Tumors
Whole Transcriptome Analysis of Testicular Germ Cell TumorsThermo Fisher Scientific
 

Similar to Variation graphs and population assisted genome inference copy (20)

Building Genomic Data Processing and Machine Learning Workflows Using Apache ...
Building Genomic Data Processing and Machine Learning Workflows Using Apache ...Building Genomic Data Processing and Machine Learning Workflows Using Apache ...
Building Genomic Data Processing and Machine Learning Workflows Using Apache ...
 
Bioinformatics final
Bioinformatics finalBioinformatics final
Bioinformatics final
 
RNA-Seq To Identify Novel Markers For Research on Neural Tissue Differentiation
RNA-Seq To Identify Novel Markers For Research on Neural Tissue DifferentiationRNA-Seq To Identify Novel Markers For Research on Neural Tissue Differentiation
RNA-Seq To Identify Novel Markers For Research on Neural Tissue Differentiation
 
BolingerJustin - Honors Thesis
BolingerJustin - Honors ThesisBolingerJustin - Honors Thesis
BolingerJustin - Honors Thesis
 
Church_GenomeAccess_2013_genome2013
Church_GenomeAccess_2013_genome2013Church_GenomeAccess_2013_genome2013
Church_GenomeAccess_2013_genome2013
 
GBI2016_Cantone
GBI2016_CantoneGBI2016_Cantone
GBI2016_Cantone
 
Thesis def
Thesis defThesis def
Thesis def
 
Castanon, A-MSc SCRM Poster
Castanon, A-MSc SCRM PosterCastanon, A-MSc SCRM Poster
Castanon, A-MSc SCRM Poster
 
Imaginal discs1
Imaginal discs1Imaginal discs1
Imaginal discs1
 
“MS-Extractor: An Innovative Approach to Extract Microsatellites on „Y‟ Chrom...
“MS-Extractor: An Innovative Approach to Extract Microsatellites on „Y‟ Chrom...“MS-Extractor: An Innovative Approach to Extract Microsatellites on „Y‟ Chrom...
“MS-Extractor: An Innovative Approach to Extract Microsatellites on „Y‟ Chrom...
 
Microarray biotechnologg ppy dna microarrays
Microarray biotechnologg ppy dna microarraysMicroarray biotechnologg ppy dna microarrays
Microarray biotechnologg ppy dna microarrays
 
L14 human genome
L14 human genomeL14 human genome
L14 human genome
 
Human genome project
Human genome projectHuman genome project
Human genome project
 
BITS training - UCSC Genome Browser - Part 2
BITS training - UCSC Genome Browser - Part 2BITS training - UCSC Genome Browser - Part 2
BITS training - UCSC Genome Browser - Part 2
 
Development of machine learning-based prediction models for chemical modulato...
Development of machine learning-based prediction models for chemical modulato...Development of machine learning-based prediction models for chemical modulato...
Development of machine learning-based prediction models for chemical modulato...
 
Localized gene expression changes by AmpliSeq transcriptome sequencing from A...
Localized gene expression changes by AmpliSeq transcriptome sequencing from A...Localized gene expression changes by AmpliSeq transcriptome sequencing from A...
Localized gene expression changes by AmpliSeq transcriptome sequencing from A...
 
ImmGenPosterCLVizbiSpring2014
ImmGenPosterCLVizbiSpring2014ImmGenPosterCLVizbiSpring2014
ImmGenPosterCLVizbiSpring2014
 
Thesis biobix
Thesis biobixThesis biobix
Thesis biobix
 
Whole Transcriptome Analysis of Testicular Germ Cell Tumors
Whole Transcriptome Analysis of Testicular Germ Cell TumorsWhole Transcriptome Analysis of Testicular Germ Cell Tumors
Whole Transcriptome Analysis of Testicular Germ Cell Tumors
 
NCBI
NCBINCBI
NCBI
 

More from Genome Reference Consortium

What's new and what's next for the human reference assembly?
What's new and what's next for the human reference assembly?What's new and what's next for the human reference assembly?
What's new and what's next for the human reference assembly?Genome Reference Consortium
 
Telomere-to-telomere assembly of a complete human chromosomes
Telomere-to-telomere assembly of a complete human chromosomesTelomere-to-telomere assembly of a complete human chromosomes
Telomere-to-telomere assembly of a complete human chromosomesGenome Reference Consortium
 
The Matched Annotation from NCBI and EMBL-EBI (MANE) Project
The Matched Annotation from NCBI and EMBL-EBI (MANE) ProjectThe Matched Annotation from NCBI and EMBL-EBI (MANE) Project
The Matched Annotation from NCBI and EMBL-EBI (MANE) ProjectGenome Reference Consortium
 
Haplotype resolved structural variation assembly with long reads
Haplotype resolved structural variation assembly with long readsHaplotype resolved structural variation assembly with long reads
Haplotype resolved structural variation assembly with long readsGenome Reference Consortium
 
Creating Reference-Grade Human Genome Assemblies
Creating Reference-Grade Human Genome AssembliesCreating Reference-Grade Human Genome Assemblies
Creating Reference-Grade Human Genome AssembliesGenome Reference Consortium
 
ClinVar: Getting the most from the reference assembly and reference materials
ClinVar: Getting the most from the reference assembly and reference materialsClinVar: Getting the most from the reference assembly and reference materials
ClinVar: Getting the most from the reference assembly and reference materialsGenome Reference Consortium
 
Graph and assembly strategies for the MHC and ribosomal DNA regions
Graph and assembly strategies for the MHC and ribosomal DNA regionsGraph and assembly strategies for the MHC and ribosomal DNA regions
Graph and assembly strategies for the MHC and ribosomal DNA regionsGenome Reference Consortium
 

More from Genome Reference Consortium (20)

What's new and what's next for the human reference assembly?
What's new and what's next for the human reference assembly?What's new and what's next for the human reference assembly?
What's new and what's next for the human reference assembly?
 
Telomere-to-telomere assembly of a complete human chromosomes
Telomere-to-telomere assembly of a complete human chromosomesTelomere-to-telomere assembly of a complete human chromosomes
Telomere-to-telomere assembly of a complete human chromosomes
 
Genome variation graphs with the vg toolkit
Genome variation graphs with the vg toolkitGenome variation graphs with the vg toolkit
Genome variation graphs with the vg toolkit
 
The Matched Annotation from NCBI and EMBL-EBI (MANE) Project
The Matched Annotation from NCBI and EMBL-EBI (MANE) ProjectThe Matched Annotation from NCBI and EMBL-EBI (MANE) Project
The Matched Annotation from NCBI and EMBL-EBI (MANE) Project
 
Mane v2 final
Mane v2 finalMane v2 final
Mane v2 final
 
Lrg and mane 16 oct 2018
Lrg and mane   16 oct 2018Lrg and mane   16 oct 2018
Lrg and mane 16 oct 2018
 
20181016 grc presentation-pa
20181016 grc presentation-pa20181016 grc presentation-pa
20181016 grc presentation-pa
 
2018 1016 trio_binning_ashg_arhie_final
2018 1016 trio_binning_ashg_arhie_final2018 1016 trio_binning_ashg_arhie_final
2018 1016 trio_binning_ashg_arhie_final
 
Ashg sedlazeck grc_share
Ashg sedlazeck grc_shareAshg sedlazeck grc_share
Ashg sedlazeck grc_share
 
171017 giab for giab grc workshop
171017 giab for giab grc workshop171017 giab for giab grc workshop
171017 giab for giab grc workshop
 
AGBT2017 Reference Workshop: Fulton
AGBT2017 Reference Workshop: FultonAGBT2017 Reference Workshop: Fulton
AGBT2017 Reference Workshop: Fulton
 
AGBT2017 Reference Workshop: Schneider
AGBT2017 Reference Workshop: SchneiderAGBT2017 Reference Workshop: Schneider
AGBT2017 Reference Workshop: Schneider
 
AGBT2017 Reference Workshop: Lindsay
AGBT2017 Reference Workshop: LindsayAGBT2017 Reference Workshop: Lindsay
AGBT2017 Reference Workshop: Lindsay
 
Haplotype resolved structural variation assembly with long reads
Haplotype resolved structural variation assembly with long readsHaplotype resolved structural variation assembly with long reads
Haplotype resolved structural variation assembly with long reads
 
Everyday de novo diploid assembly
Everyday de novo diploid assemblyEveryday de novo diploid assembly
Everyday de novo diploid assembly
 
Getting the most from the reference assembly
Getting the most from the reference assemblyGetting the most from the reference assembly
Getting the most from the reference assembly
 
Creating Reference-Grade Human Genome Assemblies
Creating Reference-Grade Human Genome AssembliesCreating Reference-Grade Human Genome Assemblies
Creating Reference-Grade Human Genome Assemblies
 
Genome in a Bottle
Genome in a BottleGenome in a Bottle
Genome in a Bottle
 
ClinVar: Getting the most from the reference assembly and reference materials
ClinVar: Getting the most from the reference assembly and reference materialsClinVar: Getting the most from the reference assembly and reference materials
ClinVar: Getting the most from the reference assembly and reference materials
 
Graph and assembly strategies for the MHC and ribosomal DNA regions
Graph and assembly strategies for the MHC and ribosomal DNA regionsGraph and assembly strategies for the MHC and ribosomal DNA regions
Graph and assembly strategies for the MHC and ribosomal DNA regions
 

Recently uploaded

SHOCK (Medical SURGICAL BASED EDITION)).pptx
SHOCK (Medical SURGICAL BASED EDITION)).pptxSHOCK (Medical SURGICAL BASED EDITION)).pptx
SHOCK (Medical SURGICAL BASED EDITION)).pptxAbhishek943418
 
Presentation for Bella Mahl 2024-03-28-24-MW-Overview-Bella.pptx
Presentation for Bella Mahl 2024-03-28-24-MW-Overview-Bella.pptxPresentation for Bella Mahl 2024-03-28-24-MW-Overview-Bella.pptx
Presentation for Bella Mahl 2024-03-28-24-MW-Overview-Bella.pptxpdamico1
 
Presentació "Real-Life VR Integration for Mild Cognitive Impairment Rehabilit...
Presentació "Real-Life VR Integration for Mild Cognitive Impairment Rehabilit...Presentació "Real-Life VR Integration for Mild Cognitive Impairment Rehabilit...
Presentació "Real-Life VR Integration for Mild Cognitive Impairment Rehabilit...Badalona Serveis Assistencials
 
Musculoskeletal disorders: Osteoarthritis,.pptx
Musculoskeletal disorders: Osteoarthritis,.pptxMusculoskeletal disorders: Osteoarthritis,.pptx
Musculoskeletal disorders: Osteoarthritis,.pptxraviapr7
 
Screening for colorectal cancer AAU.pptx
Screening for colorectal cancer AAU.pptxScreening for colorectal cancer AAU.pptx
Screening for colorectal cancer AAU.pptxtadehabte
 
Wessex Health Partners Wessex Integrated Care, Population Health, Research & ...
Wessex Health Partners Wessex Integrated Care, Population Health, Research & ...Wessex Health Partners Wessex Integrated Care, Population Health, Research & ...
Wessex Health Partners Wessex Integrated Care, Population Health, Research & ...Wessex Health Partners
 
World-Health-Day-2024-My-Health-My-Right.pptx
World-Health-Day-2024-My-Health-My-Right.pptxWorld-Health-Day-2024-My-Health-My-Right.pptx
World-Health-Day-2024-My-Health-My-Right.pptxEx WHO/USAID
 
CCSC6142 Week 3 Research ethics - Long Hoang.pdf
CCSC6142 Week 3 Research ethics - Long Hoang.pdfCCSC6142 Week 3 Research ethics - Long Hoang.pdf
CCSC6142 Week 3 Research ethics - Long Hoang.pdfMyThaoAiDoan
 
Music Therapy's Impact in Palliative Care| IAPCON2024| Dr. Tara Rajendran
Music Therapy's Impact in Palliative Care| IAPCON2024| Dr. Tara RajendranMusic Therapy's Impact in Palliative Care| IAPCON2024| Dr. Tara Rajendran
Music Therapy's Impact in Palliative Care| IAPCON2024| Dr. Tara RajendranTara Rajendran
 
Chronic-Fatigue-Syndrome-CFS-Understanding-a-Complex-Disorder.pptx
Chronic-Fatigue-Syndrome-CFS-Understanding-a-Complex-Disorder.pptxChronic-Fatigue-Syndrome-CFS-Understanding-a-Complex-Disorder.pptx
Chronic-Fatigue-Syndrome-CFS-Understanding-a-Complex-Disorder.pptxSasikiranMarri
 
Prince Paulraj W HERBAL DRUG TECHNO.pptx
Prince Paulraj W HERBAL DRUG TECHNO.pptxPrince Paulraj W HERBAL DRUG TECHNO.pptx
Prince Paulraj W HERBAL DRUG TECHNO.pptxprincebieber28
 
PULMONARY EDEMA AND ITS MANAGEMENT.pdf
PULMONARY EDEMA AND  ITS  MANAGEMENT.pdfPULMONARY EDEMA AND  ITS  MANAGEMENT.pdf
PULMONARY EDEMA AND ITS MANAGEMENT.pdfDolisha Warbi
 
Hypersensitivity and its classification .pptx
Hypersensitivity and its classification .pptxHypersensitivity and its classification .pptx
Hypersensitivity and its classification .pptxAkshay Shetty
 
Role of medicinal and aromatic plants in national economy PDF.pdf
Role of medicinal and aromatic plants in national economy PDF.pdfRole of medicinal and aromatic plants in national economy PDF.pdf
Role of medicinal and aromatic plants in national economy PDF.pdfDivya Kanojiya
 
VarSeq 2.6.0: Advancing Pharmacogenomics and Genomic Analysis
VarSeq 2.6.0: Advancing Pharmacogenomics and Genomic AnalysisVarSeq 2.6.0: Advancing Pharmacogenomics and Genomic Analysis
VarSeq 2.6.0: Advancing Pharmacogenomics and Genomic AnalysisGolden Helix
 
COVID-19 (NOVEL CORONA VIRUS DISEASE PANDEMIC ).pptx
COVID-19  (NOVEL CORONA  VIRUS DISEASE PANDEMIC ).pptxCOVID-19  (NOVEL CORONA  VIRUS DISEASE PANDEMIC ).pptx
COVID-19 (NOVEL CORONA VIRUS DISEASE PANDEMIC ).pptxBibekananda shah
 
Systemic Lupus Erythematosus -SLE PT2.ppt
Systemic  Lupus  Erythematosus -SLE PT2.pptSystemic  Lupus  Erythematosus -SLE PT2.ppt
Systemic Lupus Erythematosus -SLE PT2.pptraviapr7
 
CEHPALOSPORINS.pptx By Harshvardhan Dev Bhoomi Uttarakhand University
CEHPALOSPORINS.pptx By Harshvardhan Dev Bhoomi Uttarakhand UniversityCEHPALOSPORINS.pptx By Harshvardhan Dev Bhoomi Uttarakhand University
CEHPALOSPORINS.pptx By Harshvardhan Dev Bhoomi Uttarakhand UniversityHarshChauhan475104
 
ANEMIA IN PREGNANCY by Dr. Akebom Kidanemariam
ANEMIA IN PREGNANCY by Dr. Akebom KidanemariamANEMIA IN PREGNANCY by Dr. Akebom Kidanemariam
ANEMIA IN PREGNANCY by Dr. Akebom KidanemariamAkebom Gebremichael
 

Recently uploaded (20)

SHOCK (Medical SURGICAL BASED EDITION)).pptx
SHOCK (Medical SURGICAL BASED EDITION)).pptxSHOCK (Medical SURGICAL BASED EDITION)).pptx
SHOCK (Medical SURGICAL BASED EDITION)).pptx
 
Presentation for Bella Mahl 2024-03-28-24-MW-Overview-Bella.pptx
Presentation for Bella Mahl 2024-03-28-24-MW-Overview-Bella.pptxPresentation for Bella Mahl 2024-03-28-24-MW-Overview-Bella.pptx
Presentation for Bella Mahl 2024-03-28-24-MW-Overview-Bella.pptx
 
Presentació "Real-Life VR Integration for Mild Cognitive Impairment Rehabilit...
Presentació "Real-Life VR Integration for Mild Cognitive Impairment Rehabilit...Presentació "Real-Life VR Integration for Mild Cognitive Impairment Rehabilit...
Presentació "Real-Life VR Integration for Mild Cognitive Impairment Rehabilit...
 
JANGAMA VISHA .pptx-
JANGAMA VISHA .pptx-JANGAMA VISHA .pptx-
JANGAMA VISHA .pptx-
 
Musculoskeletal disorders: Osteoarthritis,.pptx
Musculoskeletal disorders: Osteoarthritis,.pptxMusculoskeletal disorders: Osteoarthritis,.pptx
Musculoskeletal disorders: Osteoarthritis,.pptx
 
Screening for colorectal cancer AAU.pptx
Screening for colorectal cancer AAU.pptxScreening for colorectal cancer AAU.pptx
Screening for colorectal cancer AAU.pptx
 
Wessex Health Partners Wessex Integrated Care, Population Health, Research & ...
Wessex Health Partners Wessex Integrated Care, Population Health, Research & ...Wessex Health Partners Wessex Integrated Care, Population Health, Research & ...
Wessex Health Partners Wessex Integrated Care, Population Health, Research & ...
 
World-Health-Day-2024-My-Health-My-Right.pptx
World-Health-Day-2024-My-Health-My-Right.pptxWorld-Health-Day-2024-My-Health-My-Right.pptx
World-Health-Day-2024-My-Health-My-Right.pptx
 
CCSC6142 Week 3 Research ethics - Long Hoang.pdf
CCSC6142 Week 3 Research ethics - Long Hoang.pdfCCSC6142 Week 3 Research ethics - Long Hoang.pdf
CCSC6142 Week 3 Research ethics - Long Hoang.pdf
 
Music Therapy's Impact in Palliative Care| IAPCON2024| Dr. Tara Rajendran
Music Therapy's Impact in Palliative Care| IAPCON2024| Dr. Tara RajendranMusic Therapy's Impact in Palliative Care| IAPCON2024| Dr. Tara Rajendran
Music Therapy's Impact in Palliative Care| IAPCON2024| Dr. Tara Rajendran
 
Chronic-Fatigue-Syndrome-CFS-Understanding-a-Complex-Disorder.pptx
Chronic-Fatigue-Syndrome-CFS-Understanding-a-Complex-Disorder.pptxChronic-Fatigue-Syndrome-CFS-Understanding-a-Complex-Disorder.pptx
Chronic-Fatigue-Syndrome-CFS-Understanding-a-Complex-Disorder.pptx
 
Prince Paulraj W HERBAL DRUG TECHNO.pptx
Prince Paulraj W HERBAL DRUG TECHNO.pptxPrince Paulraj W HERBAL DRUG TECHNO.pptx
Prince Paulraj W HERBAL DRUG TECHNO.pptx
 
PULMONARY EDEMA AND ITS MANAGEMENT.pdf
PULMONARY EDEMA AND  ITS  MANAGEMENT.pdfPULMONARY EDEMA AND  ITS  MANAGEMENT.pdf
PULMONARY EDEMA AND ITS MANAGEMENT.pdf
 
Hypersensitivity and its classification .pptx
Hypersensitivity and its classification .pptxHypersensitivity and its classification .pptx
Hypersensitivity and its classification .pptx
 
Role of medicinal and aromatic plants in national economy PDF.pdf
Role of medicinal and aromatic plants in national economy PDF.pdfRole of medicinal and aromatic plants in national economy PDF.pdf
Role of medicinal and aromatic plants in national economy PDF.pdf
 
VarSeq 2.6.0: Advancing Pharmacogenomics and Genomic Analysis
VarSeq 2.6.0: Advancing Pharmacogenomics and Genomic AnalysisVarSeq 2.6.0: Advancing Pharmacogenomics and Genomic Analysis
VarSeq 2.6.0: Advancing Pharmacogenomics and Genomic Analysis
 
COVID-19 (NOVEL CORONA VIRUS DISEASE PANDEMIC ).pptx
COVID-19  (NOVEL CORONA  VIRUS DISEASE PANDEMIC ).pptxCOVID-19  (NOVEL CORONA  VIRUS DISEASE PANDEMIC ).pptx
COVID-19 (NOVEL CORONA VIRUS DISEASE PANDEMIC ).pptx
 
Systemic Lupus Erythematosus -SLE PT2.ppt
Systemic  Lupus  Erythematosus -SLE PT2.pptSystemic  Lupus  Erythematosus -SLE PT2.ppt
Systemic Lupus Erythematosus -SLE PT2.ppt
 
CEHPALOSPORINS.pptx By Harshvardhan Dev Bhoomi Uttarakhand University
CEHPALOSPORINS.pptx By Harshvardhan Dev Bhoomi Uttarakhand UniversityCEHPALOSPORINS.pptx By Harshvardhan Dev Bhoomi Uttarakhand University
CEHPALOSPORINS.pptx By Harshvardhan Dev Bhoomi Uttarakhand University
 
ANEMIA IN PREGNANCY by Dr. Akebom Kidanemariam
ANEMIA IN PREGNANCY by Dr. Akebom KidanemariamANEMIA IN PREGNANCY by Dr. Akebom Kidanemariam
ANEMIA IN PREGNANCY by Dr. Akebom Kidanemariam
 

Variation graphs and population assisted genome inference copy

  • 1. Human Genome Variation Graphs Benedict Paten - UC Santa Cruz Genomics Institute benedict@soe.ucsc.edu https://cgl.genomics.ucsc.edu/ Twitter: @BenedictPaten
  • 2. Triumph of the reference human genome • The publication of the human reference genome unleashed the field of large-scale human genomics • It offers a coordinate system to: • describe gene sequences • display annotations • interpret molecular assays • However, the reference genome represents only a single instance among billions of unique human genomes...
  • 3. Triumph of the reference human genome • The publication of the human reference genome unleashed the field of large-scale human genomics • It offers a coordinate system to: • describe gene sequences • display annotations • interpret molecular assays • However, the reference genome represents only a single instance among billions of unique human genomes... Supplementary Figure 2 – Browser Window Position Human Feb. 2009 (GRCh37/hg19) chr17:37,783,223-37,826,720 (6,701 bp) 100 vertebrates Basewise Conservation by PhyloP UCSC Genes (RefSeq, GenBank, CCDS, Rfam, tRNAs & Comparative Genomics) Transcription Levels Assayed by RNA-seq on 9 Cell Lines from ENCODE GTEx RNA-signal from male Brain - Caudate (basal ganglia) (GTEX-P44G-0011-R5A-SM-2I3FA) GTEx RNA-seq read coverage from Brain - Caudate (basal ganglia) GTEx RNA-seq read coverage from Brain - Cortex GTEx RNA-signal from male Muscle - Skeletal (GTEX-11DXW-0726-SM-5H12J) GTEx RNA-seq read coverage from Muscle - Skeletal GTEx RNA-seq read coverage from Skin - Sun Exposed (Lower leg) GTEx RNA-seq read coverage from Thyroid PPP1R1B STARD3 TCAP PNMT 100 Vert. Cons 7.76614 _ -1.84367 _ Transcription ln(x+1) 8 _ 0 _ brainCauda M P44G 127 _ 0 _ brainCauda M NPJ8 brainCauda M R55F brainCauda M S7SE brainCauda M T6MN brainCauda M WL46 brainCauda M WVLH brainCauda M WZTO brainCauda M XOTO brainCauda M Z93S brainCauda M ZUA1 brainCorte M NPJ8 brainCorte M R55F brainCorte M T6MN brainCorte M XOTO brainCorte M WL46 brainCorte M WVLH brainCorte M WZTO brainCorte M ZUA1 brainCorte M Z93S muscleSkel M 11DXW 127 _ 0 _ muscleSkel M NPJ8 muscleSkel M OOBK muscleSkel M Q2AH muscleSkel M Q2AI muscleSkel M R55C muscleSkel M U3ZM muscleSkel M U4B1 muscleSkel M WFON muscleSkel M WZTO muscleSkel M X5EB skinExpose M ZAB4 thyroid M ZAB5 Supplementary Figure 2 | 6700 bp exon-focused view of a 43 Kbp region of human chromosome 17 where GTEx RNA-seq highlights tissue-specific expression of two genes. The TCAP (titin cap protein) is highly expressed in muscle tissue, while PP1R1B (a therapeutic target for neurologic disorders) shows expression in brain basal ganglia but not muscle (or brain cortex). In this UCSC Genome Browser view, 33 samples from 5 tissues were selected for display, from the total 7304 (in 53 tissues) available on the GTEx public track hub. The hub is available on both hg19 (GRCh37) and hg38 (GRCh38) human genome assemblies. The hg19 tracks were generated using the UCSC liftOver tool to transform coordinates from the hg38 bedGraph files generated by STAR2 in the Toil pipeline. The browser display was configured to use the Multi-region exon view. .CC-BY-NC 4.0 International licensepeer-reviewed) is the author/funder. It is made available under a The copyright holder for this preprint (which was not.http://dx.doi.org/10.1101/062497doi:bioRxiv preprint first posted online Jul. 7, 2016; Figure UCSC Browser of gTEX data from: Vivian et al. Nature Biotechnology 35, 314– 316 (2017) doi:10.1038/nbt.3772
  • 4. Triumph of the reference human genome • The publication of the human reference genome unleashed the field of large-scale human genomics • It offers a coordinate system to: • describe gene sequences • display annotations • interpret molecular assays • However, the reference genome represents only a single instance among billions of unique human genomes... Supplementary Figure 2 – Browser Window Position Human Feb. 2009 (GRCh37/hg19) chr17:37,783,223-37,826,720 (6,701 bp) 100 vertebrates Basewise Conservation by PhyloP UCSC Genes (RefSeq, GenBank, CCDS, Rfam, tRNAs & Comparative Genomics) Transcription Levels Assayed by RNA-seq on 9 Cell Lines from ENCODE GTEx RNA-signal from male Brain - Caudate (basal ganglia) (GTEX-P44G-0011-R5A-SM-2I3FA) GTEx RNA-seq read coverage from Brain - Caudate (basal ganglia) GTEx RNA-seq read coverage from Brain - Cortex GTEx RNA-signal from male Muscle - Skeletal (GTEX-11DXW-0726-SM-5H12J) GTEx RNA-seq read coverage from Muscle - Skeletal GTEx RNA-seq read coverage from Skin - Sun Exposed (Lower leg) GTEx RNA-seq read coverage from Thyroid PPP1R1B STARD3 TCAP PNMT 100 Vert. Cons 7.76614 _ -1.84367 _ Transcription ln(x+1) 8 _ 0 _ brainCauda M P44G 127 _ 0 _ brainCauda M NPJ8 brainCauda M R55F brainCauda M S7SE brainCauda M T6MN brainCauda M WL46 brainCauda M WVLH brainCauda M WZTO brainCauda M XOTO brainCauda M Z93S brainCauda M ZUA1 brainCorte M NPJ8 brainCorte M R55F brainCorte M T6MN brainCorte M XOTO brainCorte M WL46 brainCorte M WVLH brainCorte M WZTO brainCorte M ZUA1 brainCorte M Z93S muscleSkel M 11DXW 127 _ 0 _ muscleSkel M NPJ8 muscleSkel M OOBK muscleSkel M Q2AH muscleSkel M Q2AI muscleSkel M R55C muscleSkel M U3ZM muscleSkel M U4B1 muscleSkel M WFON muscleSkel M WZTO muscleSkel M X5EB skinExpose M ZAB4 thyroid M ZAB5 Supplementary Figure 2 | 6700 bp exon-focused view of a 43 Kbp region of human chromosome 17 where GTEx RNA-seq highlights tissue-specific expression of two genes. The TCAP (titin cap protein) is highly expressed in muscle tissue, while PP1R1B (a therapeutic target for neurologic disorders) shows expression in brain basal ganglia but not muscle (or brain cortex). In this UCSC Genome Browser view, 33 samples from 5 tissues were selected for display, from the total 7304 (in 53 tissues) available on the GTEx public track hub. The hub is available on both hg19 (GRCh37) and hg38 (GRCh38) human genome assemblies. The hg19 tracks were generated using the UCSC liftOver tool to transform coordinates from the hg38 bedGraph files generated by STAR2 in the Toil pipeline. The browser display was configured to use the Multi-region exon view. .CC-BY-NC 4.0 International licensepeer-reviewed) is the author/funder. It is made available under a The copyright holder for this preprint (which was not.http://dx.doi.org/10.1101/062497doi:bioRxiv preprint first posted online Jul. 7, 2016; Figure UCSC Browser of gTEX data from: Vivian et al. Nature Biotechnology 35, 314– 316 (2017) doi:10.1038/nbt.3772
  • 5. Triumph of the reference human genome • The publication of the human reference genome unleashed the field of large-scale human genomics • It offers a coordinate system to: • describe gene sequences • display annotations • interpret molecular assays • However, the reference genome represents only a single instance among billions of unique human genomes... Supplementary Figure 2 – Browser Window Position Human Feb. 2009 (GRCh37/hg19) chr17:37,783,223-37,826,720 (6,701 bp) 100 vertebrates Basewise Conservation by PhyloP UCSC Genes (RefSeq, GenBank, CCDS, Rfam, tRNAs & Comparative Genomics) Transcription Levels Assayed by RNA-seq on 9 Cell Lines from ENCODE GTEx RNA-signal from male Brain - Caudate (basal ganglia) (GTEX-P44G-0011-R5A-SM-2I3FA) GTEx RNA-seq read coverage from Brain - Caudate (basal ganglia) GTEx RNA-seq read coverage from Brain - Cortex GTEx RNA-signal from male Muscle - Skeletal (GTEX-11DXW-0726-SM-5H12J) GTEx RNA-seq read coverage from Muscle - Skeletal GTEx RNA-seq read coverage from Skin - Sun Exposed (Lower leg) GTEx RNA-seq read coverage from Thyroid PPP1R1B STARD3 TCAP PNMT 100 Vert. Cons 7.76614 _ -1.84367 _ Transcription ln(x+1) 8 _ 0 _ brainCauda M P44G 127 _ 0 _ brainCauda M NPJ8 brainCauda M R55F brainCauda M S7SE brainCauda M T6MN brainCauda M WL46 brainCauda M WVLH brainCauda M WZTO brainCauda M XOTO brainCauda M Z93S brainCauda M ZUA1 brainCorte M NPJ8 brainCorte M R55F brainCorte M T6MN brainCorte M XOTO brainCorte M WL46 brainCorte M WVLH brainCorte M WZTO brainCorte M ZUA1 brainCorte M Z93S muscleSkel M 11DXW 127 _ 0 _ muscleSkel M NPJ8 muscleSkel M OOBK muscleSkel M Q2AH muscleSkel M Q2AI muscleSkel M R55C muscleSkel M U3ZM muscleSkel M U4B1 muscleSkel M WFON muscleSkel M WZTO muscleSkel M X5EB skinExpose M ZAB4 thyroid M ZAB5 Supplementary Figure 2 | 6700 bp exon-focused view of a 43 Kbp region of human chromosome 17 where GTEx RNA-seq highlights tissue-specific expression of two genes. The TCAP (titin cap protein) is highly expressed in muscle tissue, while PP1R1B (a therapeutic target for neurologic disorders) shows expression in brain basal ganglia but not muscle (or brain cortex). In this UCSC Genome Browser view, 33 samples from 5 tissues were selected for display, from the total 7304 (in 53 tissues) available on the GTEx public track hub. The hub is available on both hg19 (GRCh37) and hg38 (GRCh38) human genome assemblies. The hg19 tracks were generated using the UCSC liftOver tool to transform coordinates from the hg38 bedGraph files generated by STAR2 in the Toil pipeline. The browser display was configured to use the Multi-region exon view. .CC-BY-NC 4.0 International licensepeer-reviewed) is the author/funder. It is made available under a The copyright holder for this preprint (which was not.http://dx.doi.org/10.1101/062497doi:bioRxiv preprint first posted online Jul. 7, 2016; Figure UCSC Browser of gTEX data from: Vivian et al. Nature Biotechnology 35, 314– 316 (2017) doi:10.1038/nbt.3772
  • 6. Triumph of the reference human genome • The publication of the human reference genome unleashed the field of large-scale human genomics • It offers a coordinate system to: • describe gene sequences • display annotations • interpret molecular assays • However, the reference genome represents only a single instance among billions of unique human genomes... Supplementary Figure 2 – Browser Window Position Human Feb. 2009 (GRCh37/hg19) chr17:37,783,223-37,826,720 (6,701 bp) 100 vertebrates Basewise Conservation by PhyloP UCSC Genes (RefSeq, GenBank, CCDS, Rfam, tRNAs & Comparative Genomics) Transcription Levels Assayed by RNA-seq on 9 Cell Lines from ENCODE GTEx RNA-signal from male Brain - Caudate (basal ganglia) (GTEX-P44G-0011-R5A-SM-2I3FA) GTEx RNA-seq read coverage from Brain - Caudate (basal ganglia) GTEx RNA-seq read coverage from Brain - Cortex GTEx RNA-signal from male Muscle - Skeletal (GTEX-11DXW-0726-SM-5H12J) GTEx RNA-seq read coverage from Muscle - Skeletal GTEx RNA-seq read coverage from Skin - Sun Exposed (Lower leg) GTEx RNA-seq read coverage from Thyroid PPP1R1B STARD3 TCAP PNMT 100 Vert. Cons 7.76614 _ -1.84367 _ Transcription ln(x+1) 8 _ 0 _ brainCauda M P44G 127 _ 0 _ brainCauda M NPJ8 brainCauda M R55F brainCauda M S7SE brainCauda M T6MN brainCauda M WL46 brainCauda M WVLH brainCauda M WZTO brainCauda M XOTO brainCauda M Z93S brainCauda M ZUA1 brainCorte M NPJ8 brainCorte M R55F brainCorte M T6MN brainCorte M XOTO brainCorte M WL46 brainCorte M WVLH brainCorte M WZTO brainCorte M ZUA1 brainCorte M Z93S muscleSkel M 11DXW 127 _ 0 _ muscleSkel M NPJ8 muscleSkel M OOBK muscleSkel M Q2AH muscleSkel M Q2AI muscleSkel M R55C muscleSkel M U3ZM muscleSkel M U4B1 muscleSkel M WFON muscleSkel M WZTO muscleSkel M X5EB skinExpose M ZAB4 thyroid M ZAB5 Supplementary Figure 2 | 6700 bp exon-focused view of a 43 Kbp region of human chromosome 17 where GTEx RNA-seq highlights tissue-specific expression of two genes. The TCAP (titin cap protein) is highly expressed in muscle tissue, while PP1R1B (a therapeutic target for neurologic disorders) shows expression in brain basal ganglia but not muscle (or brain cortex). In this UCSC Genome Browser view, 33 samples from 5 tissues were selected for display, from the total 7304 (in 53 tissues) available on the GTEx public track hub. The hub is available on both hg19 (GRCh37) and hg38 (GRCh38) human genome assemblies. The hg19 tracks were generated using the UCSC liftOver tool to transform coordinates from the hg38 bedGraph files generated by STAR2 in the Toil pipeline. The browser display was configured to use the Multi-region exon view. .CC-BY-NC 4.0 International licensepeer-reviewed) is the author/funder. It is made available under a The copyright holder for this preprint (which was not.http://dx.doi.org/10.1101/062497doi:bioRxiv preprint first posted online Jul. 7, 2016; Figure UCSC Browser of gTEX data from: Vivian et al. Nature Biotechnology 35, 314– 316 (2017) doi:10.1038/nbt.3772
  • 7. Triumph of the reference human genome • The publication of the human reference genome unleashed the field of large-scale human genomics • It offers a coordinate system to: • describe gene sequences • display annotations • interpret molecular assays • However, the primary ref genome represents only a single instance among billions of unique germline human genomes... Figure UCSC Browser of gTEX data from: Vivian et al. Nature Biotechnology 35, 314– 316 (2017) doi:10.1038/nbt.3772 Supplementary Figure 2 – Browser Window Position Human Feb. 2009 (GRCh37/hg19) chr17:37,783,223-37,826,720 (6,701 bp) 100 vertebrates Basewise Conservation by PhyloP UCSC Genes (RefSeq, GenBank, CCDS, Rfam, tRNAs & Comparative Genomics) Transcription Levels Assayed by RNA-seq on 9 Cell Lines from ENCODE GTEx RNA-signal from male Brain - Caudate (basal ganglia) (GTEX-P44G-0011-R5A-SM-2I3FA) GTEx RNA-seq read coverage from Brain - Caudate (basal ganglia) GTEx RNA-seq read coverage from Brain - Cortex GTEx RNA-signal from male Muscle - Skeletal (GTEX-11DXW-0726-SM-5H12J) GTEx RNA-seq read coverage from Muscle - Skeletal GTEx RNA-seq read coverage from Skin - Sun Exposed (Lower leg) GTEx RNA-seq read coverage from Thyroid PPP1R1B STARD3 TCAP PNMT 100 Vert. Cons 7.76614 _ -1.84367 _ Transcription ln(x+1) 8 _ 0 _ brainCauda M P44G 127 _ 0 _ brainCauda M NPJ8 brainCauda M R55F brainCauda M S7SE brainCauda M T6MN brainCauda M WL46 brainCauda M WVLH brainCauda M WZTO brainCauda M XOTO brainCauda M Z93S brainCauda M ZUA1 brainCorte M NPJ8 brainCorte M R55F brainCorte M T6MN brainCorte M XOTO brainCorte M WL46 brainCorte M WVLH brainCorte M WZTO brainCorte M ZUA1 brainCorte M Z93S muscleSkel M 11DXW 127 _ 0 _ muscleSkel M NPJ8 muscleSkel M OOBK muscleSkel M Q2AH muscleSkel M Q2AI muscleSkel M R55C muscleSkel M U3ZM muscleSkel M U4B1 muscleSkel M WFON muscleSkel M WZTO muscleSkel M X5EB skinExpose M ZAB4 thyroid M ZAB5 Supplementary Figure 2 | 6700 bp exon-focused view of a 43 Kbp region of human chromosome 17 where GTEx RNA-seq highlights tissue-specific expression of two genes. The TCAP (titin cap protein) is highly expressed in muscle tissue, while PP1R1B (a therapeutic target for neurologic disorders) shows expression in brain basal ganglia but not muscle (or brain cortex). In this UCSC Genome Browser view, 33 samples from 5 tissues were selected for display, from the total 7304 (in 53 tissues) available on the GTEx public track hub. The hub is available on both hg19 (GRCh37) and hg38 (GRCh38) human genome assemblies. The hg19 tracks were generated using the UCSC liftOver tool to transform coordinates from the hg38 bedGraph files generated by STAR2 in the Toil pipeline. The browser display was configured to use the Multi-region exon view. .CC-BY-NC 4.0 International licensepeer-reviewed) is the author/funder. It is made available under a The copyright holder for this preprint (which was not.http://dx.doi.org/10.1101/062497doi:bioRxiv preprint first posted online Jul. 7, 2016;
  • 8. The problem with the reference • Avg. 4-5 m point variations / individual • 80 m point variants w/>= 0.1% freq. • Avg. > 10 megabases (MB) in copy- number variants (CNVs) / individual • 350-400 MB in CNVs w/ >= 0.1% freq. • Avg. > 6 MB in large indels / individual • > 100 MB in large indels w/>= 0.1% freq.
  • 9. The problem with the reference • Avg. 4-5 m point variations / individual • 80 m point variants w/>= 0.1% freq. • Avg. > 10 megabases (MB) in copy- number variants (CNVs) / individual • 350-400 MB in CNVs w/ >= 0.1% freq. • Avg. > 6 MB in large indels / individual • > 100 MB in large indels w/>= 0.1% freq. ANRV285-GG07-17 ARI 3 August 2006 8:58 Structural Variation of the Human Genome Andrew J. Sharp, Ze Cheng, and Evan E. Eichler Department of Genome Sciences, University of Washington, Howard Hughes Medical Institute, Seattle, Washington 98195; email: eee@gs.washington.edu edfromwww.annualreviews.org .Forpersonaluseonly. Characterization of Missing Human Genome Sequences and Copy-number Polymorphic Insertions Jeffrey M. Kidd1, Nick Sampas2, Francesca Antonacci1, Tina Graves3, Robert Fulton3, Hillary S. Hayden1, Can Alkan1, Maika Malig1, Mario Ventura4, Giuliana Giannuzzi4, Joelle Kallicki3, Paige Anderson2, Anya Tsalenko2, N. Alice Yamada2, Peter Tsang2, Rajinder Kaul1, Richard K. Wilson3, Laurakay Bruhn2, and Evan E. Eichler1,5,6 1Department of Genome Sciences, University of Washington School of Medicine, Seattle, Washington 98195, USA 2Agilent Laboratories, Santa Clara, California 95051, USA 3Washington University Genome Sequencing Center, School of Medicine, St. Louis, Missouri 63108, USA 4Department of Genetics and Microbiology, University of Bari, Bari 70126, Italy 5Howard Hughes Medical Institute, Seattle, Washington 98195, USA Abstract NIH Public Access Author Manuscript Nat Methods. Author manuscript; available in PMC 2010 November 1. Published in final edited form as: Nat Methods. 2010 May ; 7(5): 365–371. NIH-PAAuthorManuscriptNIH-PAAuthor
  • 10. The problem with the reference • These differences create a failure of representation, for example: • Some functional (transcribed) genes are either present in disabled form or absent from the current reference (e.g. some HLA genes) • Reference Allele Bias: Mapping algorithms are intrinsically biased towards ignoring evidence of variants • The current reference is largely derived from one individual, making it less suitable for the study of genomes that derive from other subpopulations • In summary: the current reference genome has become an impediment to personal genomics
  • 11. The problem with the reference RESEARCH Open Access The GENCODE pseudogene resource Baikang Pei1† , Cristina Sisu1,2† , Adam Frankish3 , Cédric Howald4 , Lukas Habegger1 , Xinmeng Jasmine Mu1 , Rachel Harte5 , Suganthi Balasubramanian1,2 , Andrea Tanzer6 , Mark Diekhans5 , Alexandre Reymond4 , Tim J Hubbard3 , Jennifer Harrow3 and Mark B Gerstein1,2,7* Abstract Background: Pseudogenes have long been considered as nonfunctional genomic sequences. However, recent evidence suggests that many of them might have some form of biological activity, and the possibility of functionality has increased interest in their accurate annotation and integration with functional genomics data. Results: As part of the GENCODE annotation of the human genome, we present the first genome-wide pseudogene assignment for protein-coding genes, based on both large-scale manual annotation and in silico pipelines. A key aspect of this coupled approach is that it allows us to identify pseudogenes in an unbiased fashion as well as untangle complex events through manual evaluation. We integrate the pseudogene annotations with the extensive ENCODE functional genomics information. In particular, we determine the expression level, transcription-factor and RNA polymerase II binding, and chromatin marks associated with each pseudogene. Based on their distribution, we develop simple statistical models for each type of activity, which we validate with large- scale RT-PCR-Seq experiments. Finally, we compare our pseudogenes with conservation and variation data from primate alignments and the 1000 Genomes project, producing lists of pseudogenes potentially under selection. Conclusions: At one extreme, some pseudogenes possess conventional characteristics of functionality; these may represent genes that have recently died. On the other hand, we find interesting patterns of partial activity, which may suggest that dead genes are being resurrected as functioning non-coding RNAs. The activity data of each pseudogene are stored in an associated resource, psiDR, which will be useful for the initial identification of potentially functional pseudogenes. Background Pseudogenes are defined as defunct genomic loci with sequence similarity to functional genes but lacking cod- ing potential due to the presence of disruptive muta- tions such as frame shifts and premature stop codons [1–4]. The functional paralogs of pseudogenes are often referred to as parent genes. Based on the mechanism of their creation, pseudogenes can be categorized into three large groups: (1) processed pseudogenes, created by retrotransposition of mRNA from functional protein- coding loci back into the genome; (2) duplicated (also referred to as unprocessed) pseudogenes, derived from duplication of functional genes; and (3) unitary pseudogenes, which arise through in situ mutations in previously functional protein-coding genes [1,4–6]. Different types of pseudogenes exhibit different geno- mic features. Duplicated pseudogenes have intron-exon- like genomic structures and may still maintain the upstream regulatory sequences of their parents. In con- trast, processed pseudogenes, having lost their introns, contain only exonic sequence and do not retain the upstream regulatory regions. Processed pseudogenes may preserve evidence of their insertion in the form of polyadenine features at their 3’ end. These features of processed pseudogenes are shared with other genomic elements commonly known as retrogenes [7]. However, retrogenes differ from pseudogenes in that they have intact coding frames and encode functional proteins [8]. The composition of different types of pseudogenes var- ies among organisms [9]. In the human genome, pro- cessed pseudogenes are the most abundant type due to * Correspondence: mark.gerstein@yale.edu † Contributed equally 1 Program in Computational Biology and Bioinformatics, Yale University, Bass 432, 266 Whitney Avenue, New Haven, CT 06520, USA Full list of author information is available at the end of the article Pei et al. Genome Biology 2012, 13:R51 http://genomebiology.com/2012/13/9/R51 © 2012 Pei et al.; licensee BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons • These differences create a failure of representation, for example: • Some functional (transcribed) genes are either present in disabled form or absent from the current reference (e.g. some HLA genes) • Reference Allele Bias: Mapping algorithms are intrinsically biased towards ignoring evidence of variants • The current reference is largely derived from one individual, making it less suitable for the study of genomes that derive from other subpopulations • In summary: the current reference genome has become an impediment to personal genomics
  • 12. The problem with the reference RESEARCH Open Access The GENCODE pseudogene resource Baikang Pei1† , Cristina Sisu1,2† , Adam Frankish3 , Cédric Howald4 , Lukas Habegger1 , Xinmeng Jasmine Mu1 , Rachel Harte5 , Suganthi Balasubramanian1,2 , Andrea Tanzer6 , Mark Diekhans5 , Alexandre Reymond4 , Tim J Hubbard3 , Jennifer Harrow3 and Mark B Gerstein1,2,7* Abstract Background: Pseudogenes have long been considered as nonfunctional genomic sequences. However, recent evidence suggests that many of them might have some form of biological activity, and the possibility of functionality has increased interest in their accurate annotation and integration with functional genomics data. Results: As part of the GENCODE annotation of the human genome, we present the first genome-wide pseudogene assignment for protein-coding genes, based on both large-scale manual annotation and in silico pipelines. A key aspect of this coupled approach is that it allows us to identify pseudogenes in an unbiased fashion as well as untangle complex events through manual evaluation. We integrate the pseudogene annotations with the extensive ENCODE functional genomics information. In particular, we determine the expression level, transcription-factor and RNA polymerase II binding, and chromatin marks associated with each pseudogene. Based on their distribution, we develop simple statistical models for each type of activity, which we validate with large- scale RT-PCR-Seq experiments. Finally, we compare our pseudogenes with conservation and variation data from primate alignments and the 1000 Genomes project, producing lists of pseudogenes potentially under selection. Conclusions: At one extreme, some pseudogenes possess conventional characteristics of functionality; these may represent genes that have recently died. On the other hand, we find interesting patterns of partial activity, which may suggest that dead genes are being resurrected as functioning non-coding RNAs. The activity data of each pseudogene are stored in an associated resource, psiDR, which will be useful for the initial identification of potentially functional pseudogenes. Background Pseudogenes are defined as defunct genomic loci with sequence similarity to functional genes but lacking cod- ing potential due to the presence of disruptive muta- tions such as frame shifts and premature stop codons [1–4]. The functional paralogs of pseudogenes are often referred to as parent genes. Based on the mechanism of their creation, pseudogenes can be categorized into three large groups: (1) processed pseudogenes, created by retrotransposition of mRNA from functional protein- coding loci back into the genome; (2) duplicated (also referred to as unprocessed) pseudogenes, derived from duplication of functional genes; and (3) unitary pseudogenes, which arise through in situ mutations in previously functional protein-coding genes [1,4–6]. Different types of pseudogenes exhibit different geno- mic features. Duplicated pseudogenes have intron-exon- like genomic structures and may still maintain the upstream regulatory sequences of their parents. In con- trast, processed pseudogenes, having lost their introns, contain only exonic sequence and do not retain the upstream regulatory regions. Processed pseudogenes may preserve evidence of their insertion in the form of polyadenine features at their 3’ end. These features of processed pseudogenes are shared with other genomic elements commonly known as retrogenes [7]. However, retrogenes differ from pseudogenes in that they have intact coding frames and encode functional proteins [8]. The composition of different types of pseudogenes var- ies among organisms [9]. In the human genome, pro- cessed pseudogenes are the most abundant type due to * Correspondence: mark.gerstein@yale.edu † Contributed equally 1 Program in Computational Biology and Bioinformatics, Yale University, Bass 432, 266 Whitney Avenue, New Haven, CT 06520, USA Full list of author information is available at the end of the article Pei et al. Genome Biology 2012, 13:R51 http://genomebiology.com/2012/13/9/R51 © 2012 Pei et al.; licensee BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons BIOINFORMATICS ORIGINAL PAPER Vol. 25 no. 24 2009, pages 320 doi:10.1093/bioinformatics Genome analysis Effect of read-mapping biases on detecting allele-specific expression from RNA-sequencing data Jacob F. Degner1,2,∗, John C. Marioni1,∗, Athma A. Pai1, Joseph K. Pickrell1, Everlyne Nkadori1,3, Yoav Gilad1,∗ and Jonathan K. Pritchard1,3,∗ 1Department of Human Genetics, 2Committee on Genetics, Genomics and Systems Biology and 3Howard Hu Medical Institute, University of Chicago, 920 E. 58th St., CLSC 507, Chicago, IL 60637, USA Received on June 25, 2009; revised on September 17, 2009; accepted on September 30, 2009 Advance Access publication October 6, 2009 Associate Editor: Limsoon Wong ABSTRACT Motivation: Next-generation sequencing has become an important tool for genome-wide quantification of DNA and RNA. However, a major technical hurdle lies in the need to map short sequence reads back to their correct locations in a reference genome. Here, we investigate the impact of SNP variation on the reliability of read-mapping in the context of detecting allele-specific expression (ASE). Results: We generated 16 million 35 bp reads from mRNA of each of two HapMap Yoruba individuals. When we mapped these reads to the human genome we found that, at heterozygous SNPs, there was a significant bias toward higher mapping rates of the allele in the reference sequence, compared with the alternative allele. Masking known SNP positions in the genome sequence eliminated the reference bias but, surprisingly, did not lead to more reliable results overall. We find that even after masking, ∼ 5–10% of SNPs still have an inherent bias toward more effective mapping of one allele. Filtering out inherently biased SNPs removes 40% of the top signals of ASE. The remaining SNPs showing ASE are enriched in mechanisms can be uncovered through the identification o specific expression (ASE). For example, studies investigati have uncovered both genes harboring cis-regulatory variat imprinted genes that are epigenetically silenced in one copy the other (Babak et al., 2008; Serre et al., 2008; Wang et al. Recently developed sequencing technologies such as the I Genome Analyzer, Roche 454 GS FLX sequencer and A Biosystems SOLiD sequencer have the potential to greatly i our ability to detect ASE and to improve our understan cis-regulatory variation and epigenetic imprinting. Howe detection of ASE depends critically on accurate mapping reads in the presence of sequence variation. Here, using Seq data from two HapMap individuals, along with sim experiments, we characterize the effects of individual SNP quantification of expression levels. Our results are also r to other applications of next-generation sequencing, such discovery, expression QTL mapping and detection of allele- differences in transcription factor binding. • These differences create a failure of representation, for example: • Some functional (transcribed) genes are either present in disabled form or absent from the current reference (e.g. some HLA genes) • Reference Allele Bias: Mapping algorithms are intrinsically biased towards ignoring evidence of variants • The current reference is largely derived from one individual, making it less suitable for the study of genomes that derive from other subpopulations • In summary: the current reference genome has become an impediment to personal genomics
  • 13. The problem with the reference BIOINFORMATICS ORIGINAL PAPER Vol. 25 no. 24 2009, pages 320 doi:10.1093/bioinformatics Genome analysis Effect of read-mapping biases on detecting allele-specific expression from RNA-sequencing data Jacob F. Degner1,2,∗, John C. Marioni1,∗, Athma A. Pai1, Joseph K. Pickrell1, Everlyne Nkadori1,3, Yoav Gilad1,∗ and Jonathan K. Pritchard1,3,∗ 1Department of Human Genetics, 2Committee on Genetics, Genomics and Systems Biology and 3Howard Hu Medical Institute, University of Chicago, 920 E. 58th St., CLSC 507, Chicago, IL 60637, USA Received on June 25, 2009; revised on September 17, 2009; accepted on September 30, 2009 Advance Access publication October 6, 2009 Associate Editor: Limsoon Wong ABSTRACT Motivation: Next-generation sequencing has become an important tool for genome-wide quantification of DNA and RNA. However, a major technical hurdle lies in the need to map short sequence reads back to their correct locations in a reference genome. Here, we investigate the impact of SNP variation on the reliability of read-mapping in the context of detecting allele-specific expression (ASE). Results: We generated 16 million 35 bp reads from mRNA of each of two HapMap Yoruba individuals. When we mapped these reads to the human genome we found that, at heterozygous SNPs, there was a significant bias toward higher mapping rates of the allele in the reference sequence, compared with the alternative allele. Masking known SNP positions in the genome sequence eliminated the reference bias but, surprisingly, did not lead to more reliable results overall. We find that even after masking, ∼ 5–10% of SNPs still have an inherent bias toward more effective mapping of one allele. Filtering out inherently biased SNPs removes 40% of the top signals of ASE. The remaining SNPs showing ASE are enriched in mechanisms can be uncovered through the identification o specific expression (ASE). For example, studies investigati have uncovered both genes harboring cis-regulatory variat imprinted genes that are epigenetically silenced in one copy the other (Babak et al., 2008; Serre et al., 2008; Wang et al. Recently developed sequencing technologies such as the I Genome Analyzer, Roche 454 GS FLX sequencer and A Biosystems SOLiD sequencer have the potential to greatly i our ability to detect ASE and to improve our understan cis-regulatory variation and epigenetic imprinting. Howe detection of ASE depends critically on accurate mapping reads in the presence of sequence variation. Here, using Seq data from two HapMap individuals, along with sim experiments, we characterize the effects of individual SNP quantification of expression levels. Our results are also r to other applications of next-generation sequencing, such discovery, expression QTL mapping and detection of allele- differences in transcription factor binding. A Bacterial Artificial Chromosome Library for Sequencing the Complete Human Genome Kazutoyo Osoegawa,1 Aaron G. Mammoser, Chenyan Wu,2 Eirik Frengen,3 Changjiang Zeng, Joseph J. Catanese,1,2 and Pieter J. de Jong1,2,4 Department of Cancer Genetics, Roswell Park Cancer Institute, Buffalo, New York 14263, USA A 30-fold redundant human bacterial artificial chromosome (BAC) library with a large average insert size (17 kb) has been constructed to provide the intermediate substrate for the international genome sequencing effor The DNA was obtained from a single anonymous volunteer, whose identity was protected through double-blind donor selection protocol. DNA fragments were generated by partial digestion with EcoRI (librar segments 1–4: 24-fold) and MboI (segment 5: sixfold) and cloned into the pBACe3.6 and pTARBAC1 vector respectively. The quality of the library was assessed by extensive analysis of 169 clones for rearrangements an artifacts. Eighteen BACs (11%) revealed minor insert rearrangements, and none was chimeric. This BAC librar designated as “RPCI-11,” has been used widely as the central resource for insert-end sequencing, clon fingerprinting, high-throughput sequence analysis and as a source of mapped clones for diagnostic an functional studies. Resource Cold Spring Harbor Laboratory Presson September 9, 2011 - Published bygenome.cshlp.orgDownloaded from RESEARCH Open Access The GENCODE pseudogene resource Baikang Pei1† , Cristina Sisu1,2† , Adam Frankish3 , Cédric Howald4 , Lukas Habegger1 , Xinmeng Jasmine Mu1 , Rachel Harte5 , Suganthi Balasubramanian1,2 , Andrea Tanzer6 , Mark Diekhans5 , Alexandre Reymond4 , Tim J Hubbard3 , Jennifer Harrow3 and Mark B Gerstein1,2,7* Abstract Background: Pseudogenes have long been considered as nonfunctional genomic sequences. However, recent evidence suggests that many of them might have some form of biological activity, and the possibility of functionality has increased interest in their accurate annotation and integration with functional genomics data. Results: As part of the GENCODE annotation of the human genome, we present the first genome-wide pseudogene assignment for protein-coding genes, based on both large-scale manual annotation and in silico pipelines. A key aspect of this coupled approach is that it allows us to identify pseudogenes in an unbiased fashion as well as untangle complex events through manual evaluation. We integrate the pseudogene annotations with the extensive ENCODE functional genomics information. In particular, we determine the expression level, transcription-factor and RNA polymerase II binding, and chromatin marks associated with each pseudogene. Based on their distribution, we develop simple statistical models for each type of activity, which we validate with large- scale RT-PCR-Seq experiments. Finally, we compare our pseudogenes with conservation and variation data from primate alignments and the 1000 Genomes project, producing lists of pseudogenes potentially under selection. Conclusions: At one extreme, some pseudogenes possess conventional characteristics of functionality; these may represent genes that have recently died. On the other hand, we find interesting patterns of partial activity, which may suggest that dead genes are being resurrected as functioning non-coding RNAs. The activity data of each pseudogene are stored in an associated resource, psiDR, which will be useful for the initial identification of potentially functional pseudogenes. Background Pseudogenes are defined as defunct genomic loci with sequence similarity to functional genes but lacking cod- ing potential due to the presence of disruptive muta- tions such as frame shifts and premature stop codons [1–4]. The functional paralogs of pseudogenes are often referred to as parent genes. Based on the mechanism of their creation, pseudogenes can be categorized into three large groups: (1) processed pseudogenes, created by retrotransposition of mRNA from functional protein- coding loci back into the genome; (2) duplicated (also referred to as unprocessed) pseudogenes, derived from duplication of functional genes; and (3) unitary pseudogenes, which arise through in situ mutations in previously functional protein-coding genes [1,4–6]. Different types of pseudogenes exhibit different geno- mic features. Duplicated pseudogenes have intron-exon- like genomic structures and may still maintain the upstream regulatory sequences of their parents. In con- trast, processed pseudogenes, having lost their introns, contain only exonic sequence and do not retain the upstream regulatory regions. Processed pseudogenes may preserve evidence of their insertion in the form of polyadenine features at their 3’ end. These features of processed pseudogenes are shared with other genomic elements commonly known as retrogenes [7]. However, retrogenes differ from pseudogenes in that they have intact coding frames and encode functional proteins [8]. The composition of different types of pseudogenes var- ies among organisms [9]. In the human genome, pro- cessed pseudogenes are the most abundant type due to * Correspondence: mark.gerstein@yale.edu † Contributed equally 1 Program in Computational Biology and Bioinformatics, Yale University, Bass 432, 266 Whitney Avenue, New Haven, CT 06520, USA Full list of author information is available at the end of the article Pei et al. Genome Biology 2012, 13:R51 http://genomebiology.com/2012/13/9/R51 © 2012 Pei et al.; licensee BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons • These differences create a failure of representation, for example: • Some functional (transcribed) genes are either present in disabled form or absent from the current reference (e.g. some HLA genes) • Reference Allele Bias: Mapping algorithms are intrinsically biased towards ignoring evidence of variants • The current primary reference is largely derived from one individual, making it less suitable for the study of genomes that derive from other subpopulations • In summary: the current reference genome has become an impediment to personal genomics
  • 14. The problem with the reference • These differences create a failure of representation, for example: • Some functional (transcribed) genes are either present in disabled form or absent from the current reference (e.g. some HLA genes) • Reference Allele Bias: Mapping algorithms are intrinsically biased towards ignoring evidence of variants • The current primary reference is largely derived from one individual, making it less suitable for the study of genomes that derive from other subpopulations • In summary: the current primary reference genome is an imperfect lens for personal genomics BIOINFORMATICS ORIGINAL PAPER Vol. 25 no. 24 2009, pages 320 doi:10.1093/bioinformatics Genome analysis Effect of read-mapping biases on detecting allele-specific expression from RNA-sequencing data Jacob F. Degner1,2,∗, John C. Marioni1,∗, Athma A. Pai1, Joseph K. Pickrell1, Everlyne Nkadori1,3, Yoav Gilad1,∗ and Jonathan K. Pritchard1,3,∗ 1Department of Human Genetics, 2Committee on Genetics, Genomics and Systems Biology and 3Howard Hu Medical Institute, University of Chicago, 920 E. 58th St., CLSC 507, Chicago, IL 60637, USA Received on June 25, 2009; revised on September 17, 2009; accepted on September 30, 2009 Advance Access publication October 6, 2009 Associate Editor: Limsoon Wong ABSTRACT Motivation: Next-generation sequencing has become an important tool for genome-wide quantification of DNA and RNA. However, a major technical hurdle lies in the need to map short sequence reads back to their correct locations in a reference genome. Here, we investigate the impact of SNP variation on the reliability of read-mapping in the context of detecting allele-specific expression (ASE). Results: We generated 16 million 35 bp reads from mRNA of each of two HapMap Yoruba individuals. When we mapped these reads to the human genome we found that, at heterozygous SNPs, there was a significant bias toward higher mapping rates of the allele in the reference sequence, compared with the alternative allele. Masking known SNP positions in the genome sequence eliminated the reference bias but, surprisingly, did not lead to more reliable results overall. We find that even after masking, ∼ 5–10% of SNPs still have an inherent bias toward more effective mapping of one allele. Filtering out inherently biased SNPs removes 40% of the top signals of ASE. The remaining SNPs showing ASE are enriched in mechanisms can be uncovered through the identification o specific expression (ASE). For example, studies investigati have uncovered both genes harboring cis-regulatory variat imprinted genes that are epigenetically silenced in one copy the other (Babak et al., 2008; Serre et al., 2008; Wang et al. Recently developed sequencing technologies such as the I Genome Analyzer, Roche 454 GS FLX sequencer and A Biosystems SOLiD sequencer have the potential to greatly i our ability to detect ASE and to improve our understan cis-regulatory variation and epigenetic imprinting. Howe detection of ASE depends critically on accurate mapping reads in the presence of sequence variation. Here, using Seq data from two HapMap individuals, along with sim experiments, we characterize the effects of individual SNP quantification of expression levels. Our results are also r to other applications of next-generation sequencing, such discovery, expression QTL mapping and detection of allele- differences in transcription factor binding. A Bacterial Artificial Chromosome Library for Sequencing the Complete Human Genome Kazutoyo Osoegawa,1 Aaron G. Mammoser, Chenyan Wu,2 Eirik Frengen,3 Changjiang Zeng, Joseph J. Catanese,1,2 and Pieter J. de Jong1,2,4 Department of Cancer Genetics, Roswell Park Cancer Institute, Buffalo, New York 14263, USA A 30-fold redundant human bacterial artificial chromosome (BAC) library with a large average insert size (17 kb) has been constructed to provide the intermediate substrate for the international genome sequencing effor The DNA was obtained from a single anonymous volunteer, whose identity was protected through double-blind donor selection protocol. DNA fragments were generated by partial digestion with EcoRI (librar segments 1–4: 24-fold) and MboI (segment 5: sixfold) and cloned into the pBACe3.6 and pTARBAC1 vector respectively. The quality of the library was assessed by extensive analysis of 169 clones for rearrangements an artifacts. Eighteen BACs (11%) revealed minor insert rearrangements, and none was chimeric. This BAC librar designated as “RPCI-11,” has been used widely as the central resource for insert-end sequencing, clon fingerprinting, high-throughput sequence analysis and as a source of mapped clones for diagnostic an functional studies. Resource Cold Spring Harbor Laboratory Presson September 9, 2011 - Published bygenome.cshlp.orgDownloaded from RESEARCH Open Access The GENCODE pseudogene resource Baikang Pei1† , Cristina Sisu1,2† , Adam Frankish3 , Cédric Howald4 , Lukas Habegger1 , Xinmeng Jasmine Mu1 , Rachel Harte5 , Suganthi Balasubramanian1,2 , Andrea Tanzer6 , Mark Diekhans5 , Alexandre Reymond4 , Tim J Hubbard3 , Jennifer Harrow3 and Mark B Gerstein1,2,7* Abstract Background: Pseudogenes have long been considered as nonfunctional genomic sequences. However, recent evidence suggests that many of them might have some form of biological activity, and the possibility of functionality has increased interest in their accurate annotation and integration with functional genomics data. Results: As part of the GENCODE annotation of the human genome, we present the first genome-wide pseudogene assignment for protein-coding genes, based on both large-scale manual annotation and in silico pipelines. A key aspect of this coupled approach is that it allows us to identify pseudogenes in an unbiased fashion as well as untangle complex events through manual evaluation. We integrate the pseudogene annotations with the extensive ENCODE functional genomics information. In particular, we determine the expression level, transcription-factor and RNA polymerase II binding, and chromatin marks associated with each pseudogene. Based on their distribution, we develop simple statistical models for each type of activity, which we validate with large- scale RT-PCR-Seq experiments. Finally, we compare our pseudogenes with conservation and variation data from primate alignments and the 1000 Genomes project, producing lists of pseudogenes potentially under selection. Conclusions: At one extreme, some pseudogenes possess conventional characteristics of functionality; these may represent genes that have recently died. On the other hand, we find interesting patterns of partial activity, which may suggest that dead genes are being resurrected as functioning non-coding RNAs. The activity data of each pseudogene are stored in an associated resource, psiDR, which will be useful for the initial identification of potentially functional pseudogenes. Background Pseudogenes are defined as defunct genomic loci with sequence similarity to functional genes but lacking cod- ing potential due to the presence of disruptive muta- tions such as frame shifts and premature stop codons [1–4]. The functional paralogs of pseudogenes are often referred to as parent genes. Based on the mechanism of their creation, pseudogenes can be categorized into three large groups: (1) processed pseudogenes, created by retrotransposition of mRNA from functional protein- coding loci back into the genome; (2) duplicated (also referred to as unprocessed) pseudogenes, derived from duplication of functional genes; and (3) unitary pseudogenes, which arise through in situ mutations in previously functional protein-coding genes [1,4–6]. Different types of pseudogenes exhibit different geno- mic features. Duplicated pseudogenes have intron-exon- like genomic structures and may still maintain the upstream regulatory sequences of their parents. In con- trast, processed pseudogenes, having lost their introns, contain only exonic sequence and do not retain the upstream regulatory regions. Processed pseudogenes may preserve evidence of their insertion in the form of polyadenine features at their 3’ end. These features of processed pseudogenes are shared with other genomic elements commonly known as retrogenes [7]. However, retrogenes differ from pseudogenes in that they have intact coding frames and encode functional proteins [8]. The composition of different types of pseudogenes var- ies among organisms [9]. In the human genome, pro- cessed pseudogenes are the most abundant type due to * Correspondence: mark.gerstein@yale.edu † Contributed equally 1 Program in Computational Biology and Bioinformatics, Yale University, Bass 432, 266 Whitney Avenue, New Haven, CT 06520, USA Full list of author information is available at the end of the article Pei et al. Genome Biology 2012, 13:R51 http://genomebiology.com/2012/13/9/R51 © 2012 Pei et al.; licensee BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons
  • 17. Human Genome Variation Graph Project • Goals: • Develop next generation human genetic reference that includes known variation from all human ethnic populations • Provide tools to map, call, phase and represent genomes Figure courtesy Kiran Garimella & Gil McVean
  • 18. Existing Variation is Fragmented Variants associated with phenotype Genome- and locus-specific variation databases Sequencing projects Human reference genome
  • 19. A Rosetta Stone for human genomics
  • 20. Merge diverse genomes into one graph The major histocompatibility complex− Kiran Garimella & Gil McVean
  • 21. Zooming in, you see local structure
  • 22. At base level we assign unique position identifiers
  • 23. Variation Graphs – The Essentials GTCCCAA ACGTGG ACTACCA TTACTAC Set of sequences (nodes) Joins (edges) connect sides of sequences.
  • 24. Variation Graphs – The Essentials GTCCCAAACGTGG TTACTAC Joins can connect either side of a sequence (bidirected edges) Walks encode DNA strings, with side of entry determining strand
  • 25. Essential operations on variation graphs • To switch to variation graphs a complete ecosystem must be redeveloped • “rebooting genomics” - Erik Garrison “Adapted from Computational Pan-Genomics: Status, Promises and Challenges.” Computational Pan-Genomics Consortium. Briefings in Bioinformatics (2016) variation graph another variation graph
  • 26. variation graph another variation graph Essential operations on variation graphs • To switch to variation graphs a complete ecosystem must be redeveloped “Adapted from Computational Pan-Genomics: Status, Promises and Challenges.” Computational Pan-Genomics Consortium. Briefings in Bioinformatics (2016) https://github.com/vgteam/vg
  • 27. Now lots of good genome graph development …
  • 28. Genome Graph Vignettes • Read mapping • Haplotypes vs. graphs • Visualization • Alleles and sites • Variant calling
  • 29. Variation graph mapping GRCh38 alts in B-3106 from human MHC
  • 30. Simulation Study - Human 60 60 60 60 60 60 50 50 50 50 50 50 40 40 40 40 40 40 30 30 30 30 30 30 20 20 20 20 20 20 10 10 10 1010 10 0 0 0 0 0 0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.95 0.96 0.97 0.98 0.99 1.00 1e−06 1e−05 1e−04 1e−03 1e−02 FPR TPR aligner a● a● a● a● a● a● bwa.mem.pe bwa.mem.se vg.pan.pe vg.pan.se vg.ref.pe vg.ref.se number ● ● ● ● 250000 500000 750000 1000000 60 60 60 60 60 60 50 50 50 50 50 50 40 40 40 40 40 40 30 30 30 30 30 30 20 20 20 20 20 20 10 10 10 10 10 10 0 0 0 0 0 0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.94 0.96 0.98 1e−06 1e−05 1e−04 1e−03 1e−02 FPR TPR number ● ● ● 2500000 5000000 7500000 aligner a● a● a● a● a● a● bwa.mem.pe bwa.mem.se vg.pan.pe vg.pan.se vg.ref.pe vg.ref.se • 10 M reads from a genome with 1% error • Subset of reads with >=1 match to non- primary ref match
  • 31. Simulation Study - Human 60 60 60 60 60 60 50 50 50 50 50 50 40 40 40 40 40 40 30 30 30 30 30 30 20 20 20 20 20 20 10 10 10 1010 10 0 0 0 0 0 0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.95 0.96 0.97 0.98 0.99 1.00 1e−06 1e−05 1e−04 1e−03 1e−02 FPR TPR aligner a● a● a● a● a● a● bwa.mem.pe bwa.mem.se vg.pan.pe vg.pan.se vg.ref.pe vg.ref.se number ● ● ● ● 250000 500000 750000 1000000 • 10 M reads from a genome with 1% error • Subset of reads with >=1 match to non- primary ref match
  • 32. Human - Indel Mapping Bias Alleviated curve 0 0 ● ● ● ● ● ● 2 number ● ● ● 2500000 5000000 7500000 aligner a● a● a● a● a● a● bwa.mem.pe bwa.mem.se vg.pan.pe vg.pan.se vg.ref.pe vg.ref.se (b) allele fraction vs variant size
  • 33. Mapping improvements differ by population 1000 Genomes Super Population MHC %Diff.inperfectmap. primaryvs.1KG
  • 34. 1: 82 bp 2: A 3: G 4: 38 bp 5: C 6: T 7: 24 bp 1: 82 bp 2: A 3: G 4': 38 bp 5: C 6: T 7: 24 bp 4: 38 bp Embedding Haplotypes • Genome graphs do not encode linkage • To restrict linkage, natural solution is to duplicate paths: • But duplication creates mapping ambiguity
  • 35. Embedding Haplotypes 1: 82 bp 2: A 3: G 4: 38 bp 5: C 6: T 7: 24 bp 1': 82 bp 2: A 3: G 4': 38 bp 5: C 6: T 7: 24 bp4: 38 bp1: 82 bp 7': 24 bp • Instead maintain projection from haplotypes to graph: • The question then becomes how to encode this projection?
  • 36. Embedding Haplotypes • The Graph Positional Burrows Wheeler Transform (gPBWT) From “Novak et al, A Graph Extension of the Positional Burrows-Wheeler Transform and its Applications (PBWT), WABI 2016”
 3 counting of the number of threads in T that contain a given new thread as a subthread. Figure 2 and Table 1 give a worked example. 1 2 3 2 1 3 1 1 2 2 B0 · · · · · · · · · · · · · · · · · · Fig. 1. An illustration of the B0[] array for a single side numbered 0. Threads visiting this side may enter their next nodes on sides 1, 2, or 3. The B0[] array records, for each visit of a thread to side 0, the side on which it enters its next node. This determines through which of the available edges it should leave the current node. Because threads tend to be similar to each other, they are likely to run in “ribbons” of multiple threads .CC-BY 4.0 International licensepeer-reviewed) is the author/funder. It is made available under a The copyright holder for this preprint (which was not.http://dx.doi.org/10.1101/051409doi:bioRxiv preprint first posted online May. 2, 2016; gPBWTk[] • Reversible, compressible, enables efficient indexed queries
  • 37. gPBWT Performance • Experiment: • chr22 • 50,818,468 bp • 5004 Haplotypes • Result: • 356 MB gPBWT + vg graph • 0.011 bits per base - 200x compression • ~336 GB for whole genome w/80 million point variants @ 100,000 diploid genomes
  • 38. Embedding Haplotypes • Tube Maps Wolfgang Beyer
  • 39. Embedding Haplotypes Prototype: Wolfgang Beyer https://vgteam.github.io/ sequenceTubeMap/
  • 40. Haplotype Probabilities • Li & Stephens: Efficiently compute P(h|H), where h is haplotype and H is population nd Stephens” on sequence graphs Stephens: sequences h are generated by walks x across the space of all haplotyp H x h
  • 41. Haplotype Probabilities • Graph Li & Stephens: Efficiently compute P(x|H), where x is haplotype walk in a genome graph nd Stephens: sequences h are generated by walks x across the space of all hap model: sequences h are generated by walks x through G which follow segmen otypes in H h x c/w h g1 , g2 , g3 ε H
  • 42. Haplotype Probabilities • Applied to vg mapped reads: Single recombinants, 9% Double recombinants, 1% Non recombinants, 90%
  • 43. What’s a site and an allele in a genome graph? What’s a site and an allele in a variation graph? Bubble: Superbubble: • Use subgraph decomposition to find single source/sink subgraphs, set of paths are the alleles A T C A T C A T C A T C A T
  • 44. A haplotype phasing pipeline Read mapping Variant calling Haplotype phasing Known population information Population Assisted Variant Calling h Haplotype likelihood Read likelihood genome posterior probability Haplotype likelihood Read likelihood A haplotype phasing pipeline Read mapping Variant calling Haplotype phasing Known population information
  • 45. Genome Variation Graphs Summary • A shared reference graph will provide a single canonical naming scheme for human variants: either it is already a (named) path in the graph, or it is a new canonically named augmentation • A better prior: Clear benefits for simplifying and improving read mapping and variant calling - could ultimately lower cost of genome inference • Additional haplotype data can be embedded (gPBWT) • The natural reference is a population cohort - we should build a public cohort for hundreds of thousands of individuals - let’s change the culture of de-identified sharing • True population assisted genome inference is coming • Still many open problems: repeatome, annotations, RNA
  • 46. Thanks! UCSC Adam Novak Glenn Hickey Sean Blum Yohei Rosen Jordan Eizenga Wolfgang Beyer Karen Hayden David Haussler Team VG: Erik Garrison Eric Dawson Mike Lin Jouni Siren (and many more) GA4GH ref-var group: Andres Kahles Ben Murray Goran Rakocevic Alex Dilthey Sarah Guthrie Jerome Kelleher Heng Li Stephen Keenan Richard Durbin Gil McVean Opportunities: https://cgl.genomics.ucsc.edu/ benedict@soe.ucsc.edu