Francesc Lopez
Yale Center for Genome Analysis
Dept. of Genetics
(francesc.lopez@yale.edu)
Next-Generation Sequencing and
its Applications in Medical
Research
Brief History of DNA Sequencing
1953: Discovery of DNA structure by Watson and Crick
1973: First sequence of 24 bases published
1977: Sanger sequencing method published
1982: GenBank started
1987: 1st automated sequencer: Applied Biosystems Prism 373 (up to 600 bases)
1996: First Capillary sequencer: ABI310
2000-2003: Human Genome Sequenced
2005- : First NGS sequencers 454 Life Sciences, Solexa/Illumina, Helicos, Ion Torrent
Sequencing of the human
genome using Sanger
technology took more than
a decade and cost an
estimated $70 million
dollars
Sanger VS NGS
Bases Genes
Human Genome 3.3x109 ~20,000
In 3 days (one run), Illumina HiSeq 4000
is able to produce 1,680x109 bases for
~$32,000
Production facility. 7,000 Sq Ft
dedicated facility
25 Full time staff including 4
PhD level bioinformaticians
Yale Center for Genome Analysis (YCGA)
Dedicated computation
infrastructure
3.5 Petabytes data storage
4500 cores HPC
7 Illumina HiSeqs
5: 2500
2: 4000
One PacBio RS
Illumina MiSeq Ion PGM™ Sequencer
Sequencing Platforms at YCGA
Trend of sequencing data output at YCGA
Sequencers are operated at ~70% of the max capacity
Progress made at YCGA in the past years
1% 5%
30%
1%
63%
Library Prep Sample Types
ChIP
Whole Genome
mRNA
micro RNA
Seqcap
Types of samples processed at YCGA
Whole Exome
Protein coding genes
(exome) constitute 1.5% of
the human genome but
harbor 85% of disease
causing mutations.
 Significantly cheaper than
sequencing entire genome
 >50,000 exomes sequenced
at YCGA
Whole-Genome VS Whole-Exome Sequencing
Choi et al PNAS 2009
FastQ format – single read
@SEQ_ID
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
+
!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65
X 5x109 reads in a run of HiSeq 4000
A
T
G
C
ExAC: 61,000 exomes
dbSNP
1000 genomes
NHLBI exomes: 6,500
Yale exomes: 2,500
Variant frequency DBs
44 vertebrate species
2 invertebrate species (fly and worm)
PhyloP
Conservation
Polyphen-2
SIFT
Functional prediction OMIM
GO
KEGG
Jackson lab knockout
Gene annotation
Genome build: hg38
Variant caller: GATK
Annotation gene reference: refGene
General parameters
Variant annotation
Sequencing a genome is simple
finding a cause of a disease is not
First clinical use of whole genome sequencing shows just
how challenging it can be
Genomes on prescription: Nature 2011
DNA Sequencing and Precision Medicine
• Precision Medicine: Use of genomics to tailor medical care to
individuals based on their genetic makeup.
Which treatment?What are my
chances?
Which class of
cancer?
Is it benign?
Therapeutic
Choice
PrognosisDiagnosis Classification
How and why
• Elucidation of
mechanism of cause
• Identification of cancer
biomarkers
• Therapeutic targets
Discovery
Genetic diagnosis by whole exome capture and
massively parallel DNA sequencing.
Choi M, et al. (2009) PNAS 106 (45): 19096-101
 5 month child presented with
failure to thrive and dehydration.
 Treatments for kidney disease
failed
 Captured 180,000 exons of 18,673 protein-coding genes comprising 34.0
Mb of genomic sequence
 Identified a mutation in SLC26A3 gene which causes congenital chloride
diarrhea – treatments for which have effectively managed the disease
 Demonstration of the clinical utility of whole-exome sequencing and its
implications for disease gene discovery and clinical diagnosis
In the first 362 trios (affected proband), ~2000 putative de
novo pre-filtered variants were detected.
Gene burden analysis - unrelated patients of same
clinical output
Comparing variants from cases and controls per gene
allows for detection of gene causing diseases
 Broad, Baylor/Hopkins, U of Washington, and Yale
 More than 6,000 rare Mendelian disorders affecting more than 25 million
individuals in US
 Discover the genes and variants responsible for as many Mendelian
phenotypes as possible
 Develop and disseminate improved methods for disease gene discovery and
analysis
 Educate colleagues and public regarding Mendelian disease
Whole-Exome/whole-genome analysis is carried out at no
cost and on a collaborative basis
ACKNOWLEDGMENTS
Yale Center for Genomic Analysis
Prof. Lifton Lab
YCGA STAFF
In 2013, Angelina Jolie
tested positive for
BRCA1
Nine unrelated kindreds with an apparent recessive mode
of inheritance.
23
Filtering Recessive Variants
1 1
High quality
Protein altering
Rare in control
databases*
* Yale exome database, NHLBI ESP exome, 1000 Genomes
Kindred 1 Kindred 2
Subject 1 Subject 2 Subject 1 Subject 2
Same
gene
DGKE
4 1
3,151 3,072
12,326 12,094
2 5
3,283 3,227
12,959 12,753
Lemaire et al., Nature Genetics 2013
Some Machine learning applications in genetics and genomics
Gene prediction (2002): predict which regions of the genome code for proteins.
RNA secondary structure prediction (2006): predict the base-pairing interactions within a
strand of RNA.
Transcription factor target prediction (2007): predict the sequence of bases most likely to
bind a specific transcription factor.
Base calling (2009): predict the base photographed by an Illumina sequencing device
during a sequencing by synthesis reaction.
Enhancer prediction (2012): predict regions of the genome that act as enhancers for
expression using information about the epigenetic marks present on the chromosomes.
Splicing code (2015): predict how a mutation within a gene will affect the splicing of that
gene's transcript.
Pathogenicity prediction (2015): predict the functional impact of a mutation in a sample
of DNA.
Pharmacogenomics (2011): predict if mutations in a person's DNA will impact how a drug
works in their body.
Predicting the functions of long noncoding RNAs (2015)
Predicting effects of noncoding variants using predicted DNaseI hypersensitivity, histone
modifications, and transcription factor binding (2015)
Predicting RNA editing (2016)
List of select publications resulting form the next-generation sequencing usage at YCGA
Whole-exome sequencing identifies recessive WDR62 mutations in severe brain malformations. Bilguvar Nature, v467, 2010
A Novel miRNA Processing Pathway Independent of Dicer Requires Argonaute2 Activity. Cifuentes Science, v328, 2010
Mitotic recombination in ichthyosis causes reversion of dominant mutations in KRT10. Choate K Science, v330, 2010
Transcriptomic analysis of avian digits reveals conserved and derived digit identities in birds. Wang s. Nature, v477, 2011
Transposom-mediated rewiring of gene regulatory networks contributed to the evolution of pregnancy in mammals.
Lynch and Wagner
Nature, Genet. v43,
2011
K
+
channel mutations in adrenal aldosterone-producing adenomas and hereditary hypertension. Choi M Science, v331, 2011
Recessive LAMC3 mutations cause malformations of occipital cortical development. Barak and Gunel. Nat Genet., V43, 2011
Spatio-temporal transcriptome of the human brain. Kang and Sestan Nature, v478, 2011
Langerhans cells facilitate epithelial DNA damage and squamous cell carcinoma. Modi and Girardi Science, v335, 2012
Mutations in kelch-like 3 and cullin 3 causes hypertension and electrolyte abnormalities. Boyden et al Nature, v482, 2012
De novo point mutations are strongly associated with Autism Spectrum Disorders. Sanders and State Nature, v485, 2012
Exome sequencing identifies recurrent somatic RAC1 mutations in melanoma. Krauthammer Nat Genet., V44, 2012
Genomic Analysis of Non-NF2 Meningiomas Reveals Mutations in TRAF7, KLF4, AKT1,& SMO. Clark V et al Science, v339, 2013
De novo mutations in histone-modifying genes in congenital heart disease. Zaidi and Lifton Nature, v498, 2013
Recessive mutations in DGKE cause atypical hemolytic-uremic syndrome. Lemaire and Lifton Nat Genet., V45, 2013
Somatic and germline CACNA1D calcium channel mutations in aldosterone-producing adenomas Scholl and Lifton Nat Genet., V45, 2013
The evolution of lineage-specific regulatory activities in the human embryonic limb. Cotney and Noonan Cell, v154, 2013
Mutations in DSTYK and dominant urinary tract malformations. Sanna-Cherchi and Gharavi N Eng J Med., 2013
Nanog, and SoxB1 activate zygotic gene expression during the maternal-to-zygotic transition. Lee et al Nature, 2013
Co-expression networks implicate mid-fetal deep cortical projection neurons in the pathogenesis of autism. State Cell, 2013
CLP1 Founder Mutation Links tRNA Splicing and Maturation to Cerebellar Development. Schaffer and Gleeson . Cell, V157, 2014
Exome sequencing links corticospinal motor neuron disease to neurodegenerative disorders. Novarino and Gleeson Science, V363, 2014
Recurrent mutations in NF1 and RASopathy genes in sun-exposed melanomas. Krauthammer and halaban Nat Genet. V47 2015
Genetic Causes for Congenital Heart Disease with Neurodevelopmental and Other Deficits. Homsy J et al Science , 2015

Next Generation Sequencing and its Applications in Medical Research - Francesc lopez

  • 1.
    Francesc Lopez Yale Centerfor Genome Analysis Dept. of Genetics (francesc.lopez@yale.edu) Next-Generation Sequencing and its Applications in Medical Research
  • 2.
    Brief History ofDNA Sequencing 1953: Discovery of DNA structure by Watson and Crick 1973: First sequence of 24 bases published 1977: Sanger sequencing method published 1982: GenBank started 1987: 1st automated sequencer: Applied Biosystems Prism 373 (up to 600 bases) 1996: First Capillary sequencer: ABI310 2000-2003: Human Genome Sequenced 2005- : First NGS sequencers 454 Life Sciences, Solexa/Illumina, Helicos, Ion Torrent
  • 3.
    Sequencing of thehuman genome using Sanger technology took more than a decade and cost an estimated $70 million dollars Sanger VS NGS Bases Genes Human Genome 3.3x109 ~20,000 In 3 days (one run), Illumina HiSeq 4000 is able to produce 1,680x109 bases for ~$32,000
  • 4.
    Production facility. 7,000Sq Ft dedicated facility 25 Full time staff including 4 PhD level bioinformaticians Yale Center for Genome Analysis (YCGA) Dedicated computation infrastructure 3.5 Petabytes data storage 4500 cores HPC
  • 5.
    7 Illumina HiSeqs 5:2500 2: 4000 One PacBio RS Illumina MiSeq Ion PGM™ Sequencer Sequencing Platforms at YCGA
  • 7.
    Trend of sequencingdata output at YCGA Sequencers are operated at ~70% of the max capacity Progress made at YCGA in the past years 1% 5% 30% 1% 63% Library Prep Sample Types ChIP Whole Genome mRNA micro RNA Seqcap Types of samples processed at YCGA Whole Exome
  • 8.
    Protein coding genes (exome)constitute 1.5% of the human genome but harbor 85% of disease causing mutations.  Significantly cheaper than sequencing entire genome  >50,000 exomes sequenced at YCGA Whole-Genome VS Whole-Exome Sequencing Choi et al PNAS 2009
  • 9.
    FastQ format –single read @SEQ_ID GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT + !''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65 X 5x109 reads in a run of HiSeq 4000
  • 10.
  • 11.
    ExAC: 61,000 exomes dbSNP 1000genomes NHLBI exomes: 6,500 Yale exomes: 2,500 Variant frequency DBs 44 vertebrate species 2 invertebrate species (fly and worm) PhyloP Conservation Polyphen-2 SIFT Functional prediction OMIM GO KEGG Jackson lab knockout Gene annotation Genome build: hg38 Variant caller: GATK Annotation gene reference: refGene General parameters Variant annotation
  • 12.
    Sequencing a genomeis simple finding a cause of a disease is not First clinical use of whole genome sequencing shows just how challenging it can be Genomes on prescription: Nature 2011
  • 13.
    DNA Sequencing andPrecision Medicine • Precision Medicine: Use of genomics to tailor medical care to individuals based on their genetic makeup. Which treatment?What are my chances? Which class of cancer? Is it benign? Therapeutic Choice PrognosisDiagnosis Classification How and why • Elucidation of mechanism of cause • Identification of cancer biomarkers • Therapeutic targets Discovery
  • 14.
    Genetic diagnosis bywhole exome capture and massively parallel DNA sequencing. Choi M, et al. (2009) PNAS 106 (45): 19096-101  5 month child presented with failure to thrive and dehydration.  Treatments for kidney disease failed  Captured 180,000 exons of 18,673 protein-coding genes comprising 34.0 Mb of genomic sequence  Identified a mutation in SLC26A3 gene which causes congenital chloride diarrhea – treatments for which have effectively managed the disease  Demonstration of the clinical utility of whole-exome sequencing and its implications for disease gene discovery and clinical diagnosis
  • 15.
    In the first362 trios (affected proband), ~2000 putative de novo pre-filtered variants were detected.
  • 16.
    Gene burden analysis- unrelated patients of same clinical output
  • 17.
    Comparing variants fromcases and controls per gene allows for detection of gene causing diseases
  • 18.
     Broad, Baylor/Hopkins,U of Washington, and Yale  More than 6,000 rare Mendelian disorders affecting more than 25 million individuals in US  Discover the genes and variants responsible for as many Mendelian phenotypes as possible  Develop and disseminate improved methods for disease gene discovery and analysis  Educate colleagues and public regarding Mendelian disease Whole-Exome/whole-genome analysis is carried out at no cost and on a collaborative basis
  • 19.
    ACKNOWLEDGMENTS Yale Center forGenomic Analysis Prof. Lifton Lab YCGA STAFF
  • 21.
    In 2013, AngelinaJolie tested positive for BRCA1
  • 22.
    Nine unrelated kindredswith an apparent recessive mode of inheritance.
  • 23.
    23 Filtering Recessive Variants 11 High quality Protein altering Rare in control databases* * Yale exome database, NHLBI ESP exome, 1000 Genomes Kindred 1 Kindred 2 Subject 1 Subject 2 Subject 1 Subject 2 Same gene DGKE 4 1 3,151 3,072 12,326 12,094 2 5 3,283 3,227 12,959 12,753 Lemaire et al., Nature Genetics 2013
  • 24.
    Some Machine learningapplications in genetics and genomics Gene prediction (2002): predict which regions of the genome code for proteins. RNA secondary structure prediction (2006): predict the base-pairing interactions within a strand of RNA. Transcription factor target prediction (2007): predict the sequence of bases most likely to bind a specific transcription factor. Base calling (2009): predict the base photographed by an Illumina sequencing device during a sequencing by synthesis reaction. Enhancer prediction (2012): predict regions of the genome that act as enhancers for expression using information about the epigenetic marks present on the chromosomes. Splicing code (2015): predict how a mutation within a gene will affect the splicing of that gene's transcript. Pathogenicity prediction (2015): predict the functional impact of a mutation in a sample of DNA. Pharmacogenomics (2011): predict if mutations in a person's DNA will impact how a drug works in their body. Predicting the functions of long noncoding RNAs (2015) Predicting effects of noncoding variants using predicted DNaseI hypersensitivity, histone modifications, and transcription factor binding (2015) Predicting RNA editing (2016)
  • 25.
    List of selectpublications resulting form the next-generation sequencing usage at YCGA Whole-exome sequencing identifies recessive WDR62 mutations in severe brain malformations. Bilguvar Nature, v467, 2010 A Novel miRNA Processing Pathway Independent of Dicer Requires Argonaute2 Activity. Cifuentes Science, v328, 2010 Mitotic recombination in ichthyosis causes reversion of dominant mutations in KRT10. Choate K Science, v330, 2010 Transcriptomic analysis of avian digits reveals conserved and derived digit identities in birds. Wang s. Nature, v477, 2011 Transposom-mediated rewiring of gene regulatory networks contributed to the evolution of pregnancy in mammals. Lynch and Wagner Nature, Genet. v43, 2011 K + channel mutations in adrenal aldosterone-producing adenomas and hereditary hypertension. Choi M Science, v331, 2011 Recessive LAMC3 mutations cause malformations of occipital cortical development. Barak and Gunel. Nat Genet., V43, 2011 Spatio-temporal transcriptome of the human brain. Kang and Sestan Nature, v478, 2011 Langerhans cells facilitate epithelial DNA damage and squamous cell carcinoma. Modi and Girardi Science, v335, 2012 Mutations in kelch-like 3 and cullin 3 causes hypertension and electrolyte abnormalities. Boyden et al Nature, v482, 2012 De novo point mutations are strongly associated with Autism Spectrum Disorders. Sanders and State Nature, v485, 2012 Exome sequencing identifies recurrent somatic RAC1 mutations in melanoma. Krauthammer Nat Genet., V44, 2012 Genomic Analysis of Non-NF2 Meningiomas Reveals Mutations in TRAF7, KLF4, AKT1,& SMO. Clark V et al Science, v339, 2013 De novo mutations in histone-modifying genes in congenital heart disease. Zaidi and Lifton Nature, v498, 2013 Recessive mutations in DGKE cause atypical hemolytic-uremic syndrome. Lemaire and Lifton Nat Genet., V45, 2013 Somatic and germline CACNA1D calcium channel mutations in aldosterone-producing adenomas Scholl and Lifton Nat Genet., V45, 2013 The evolution of lineage-specific regulatory activities in the human embryonic limb. Cotney and Noonan Cell, v154, 2013 Mutations in DSTYK and dominant urinary tract malformations. Sanna-Cherchi and Gharavi N Eng J Med., 2013 Nanog, and SoxB1 activate zygotic gene expression during the maternal-to-zygotic transition. Lee et al Nature, 2013 Co-expression networks implicate mid-fetal deep cortical projection neurons in the pathogenesis of autism. State Cell, 2013 CLP1 Founder Mutation Links tRNA Splicing and Maturation to Cerebellar Development. Schaffer and Gleeson . Cell, V157, 2014 Exome sequencing links corticospinal motor neuron disease to neurodegenerative disorders. Novarino and Gleeson Science, V363, 2014 Recurrent mutations in NF1 and RASopathy genes in sun-exposed melanomas. Krauthammer and halaban Nat Genet. V47 2015 Genetic Causes for Congenital Heart Disease with Neurodevelopmental and Other Deficits. Homsy J et al Science , 2015