1
Vall d’Hebron Institut de Recerca (VHIR)
Rosa Prieto
Head of the High Tech Unit
rosa.prieto@vhir.org
15/05/2014
Institut d’Investigació Sanitària acreditat per l’Instituto de Salud Carlos III (ISCIII)
NEXT GENERATION SEQUENCING
TECHNOLOGIES AND APPLICATIONS
CURS OF BIOINFORMATICS
FOR BIOMEDICAL RESEARCH
2
INTRODUCTION TO NGS1
2
3
4
Index
NGS TECHNOLOGY OVERVIEW
NGS APPLICATIONS OVERVIEW
CURS OF BIOINFORMATICS
FOR BIOMEDICAL RESEARCH
WHAT IS NEXT IN SEQUENCING TECHNOLOGIES?
NGS applications
-Amplicon sequencing
-Targeted DNA resequencing
-Exome sequencing
-Whole genome sequencing
-Metagenomics
-RNA sequencing
-Targeted RNA resequencing
-Epigenomics
-Sequencing of free DNA-RNA (plasma/serum)
Considerations to use NGS
-What do I want to sequence? Whole genome, exome, several genes, metagenome,
epigenome, RNAseq.....
-How many samples?
-Length of read required?
-Quality and quantity of starting material?
-Size of nucleic acids to sequence
-Amount of sequence needed: coverage
(Depth of) Coverage: how many times a particular base is sequenced.
30x = each base has been read by 30 sequences (in average)
Depth of coverage = (nº reads * read length / size of target genome)
(Breadth of) Coverage: amount of the target sequence that has been covered (with a
given coverage)
Considerations to use NGS
Which depth of coverage do I need?
It is an empiric value that depends on the objective of the study and its particular
conditions (consensus values may exist)
Amplicon sequencing: viral quasispecies
 In an infected patient the population of viruses presents high
rates of mutation and replication. It is a complex mixing of
different mutants.
 Goal of the study:
 Detection and quantification of mutations or combination of
mutations that could confer resistance to viral inhibitors in
samples from infected patients.
 Special interest in mutations at a low rate (minor variants).
HCV, HBV, HIV virus populations have special characteristics:
Amplicon sequencing: viral quasispecies
 Minor variants often play an important role in the development of resistance
to antiviral treatments in patients, even if they are present in a very low
percentage in the population.
 Minor variants may not be detected by classical sequencing methods
 You obtain hundreds of sequences with much effort and high cost
 NGS allows to detect efficiently variants at a very low rate
 You obtain thousands of sequences with relatively low cost
WHY IS NGS APPROPIATED FOR THIS KIND OF STUDY?
454 technology is the most appropiated method in this particular case (long
sequences are achieved)
Targeted sequencing using gene panels
Array-based capture system
Liquid capture system
Targeted sequencing using gene panels
Illumina
Ion Torrent
Considerations that affect capture efficiency
-Quality and quantity of input DNA
-Repeat elements, tandem repeats and pseudogenes: uneven distribution of coverage
-Extreme GC content: 5’UTR, first exons of genes, promoter regions
-Library insert length and its distribution:
•Different capture platforms recommend different sets of standard practices for
sample library preparation.
•.As a result of these underlying chemistries, each platform has its own range of
recommended fragment sizes. Agilent insert size ranges from 100 to 300bp,
Nimblegen ranges from 150 to 250bp and TruSeq has the broadest range of 300
to 500bp.
-Consistent laboratory procedures.
Sequence capture for cancer genomics
Exome vs. whole genome sequencing
PROS:
• Enabling technologies: NGS machines, open-source algorithms,
capture reagents, lowering cost, big sample collections
• Exomes are more cost effective (less sequencing for the same
coverage): human genome 3,2 Gb vs. human exome aprox. 50 Mb (1-
2% of the genome)
• Simplified bioinformatics analysis compared to whole genomes
CHALLENGES:
• Still can’t interpret many Mendelian disorders
• Rare variants need large samples sizes
• Exome might miss regions of interest (e.g. novel non-coding genes)
• Exome reagents do not capture all exons
• Sometimes unsuccessful to interpret clinical data
Shendure, Genome Biol 2011
( )
/emPCR
Exome sequencing workflow
Illumina exome sequencing
Kits
Sequencers
-Nimblegen EZ capture
-Agilent SureSelect
-Raindance
.......
Ion exome sequencing
De novo sequencing
Resequencing
Whole genome sequencing
http://www.ncbi.nlm.nih.gov/projects/WGS/WGSprojectlist.cgi
Whole genome sequencing
Sequenced reads
Contigs
Scaffolds
Mapped Scaffolds
Genome map
Long reads (454, PacBio, PE Illumina reads)
Shot gun
18
Secuenciación de la cepa bacteriana E. coli O104:H4 con GS Junior, MiSeq, PGM.
1. Creación de un ensamblaje de referencia (Roche GS FLX+ shotgun + 8 Kb PE, coverage 32x).
Contiene 1 cromosoma (5.3 kb) y 2 plásmidos. Quedan 153 gaps correspondientes a regiones
repetitivas sin resolver.
2. Secuenciación de la misma cepa usando:
• 2 runs del 454 GS Junior
• 2 chips 316 del Ion Torrent PGM
• 1 run del MiSeq (2x150 bases)
Performance comparison of benchtop high-troughput sequencing platforms.
Nat. Biotechn. 30 (5): 434-441 (2012)
Whole genome sequencing
19
Conclusions: “One important conclusion from this evaluation is that saying that one has
“sequenced a bacterial genome” means different things on different benchtop sequencing
platforms”
MiSeq GS Junior IonTorrent
Throughput/run The highest The lowest The fastest
Errors The lowest Intermediate(indels) Many, specially in
homopolymers
Read length Intermediate
(2x150bp)
The longest (520 bp) The shortest (100bp)
Run time The longest (27 hr) Intermediate (9 hr) The shortest (3 hr)
Price per Mb The cheapest The most expensive Intermediate
Other considerations Unfillable gaps Errors in homopolymers The worstest performance
Performance comparison of benchtop high-troughput sequencing platforms.
Nat. Biotechn. 30 (5): 434-441 (2012)
Whole genome sequencing
20
• La pequeña fracción del genoma con variaciones entre los individuos puede explicar diferencias en la susceptibilidad a una
enfermedad, en la respuesta a fármacos o en la reacción a factores ambientales. El “Proyecto de los 1000 genomas” tratará
de establecer un mapa del genoma humano que incluya la descripción de la mayor cantidad posible de variaciones en el
mismo, mejorando de forma espectacular la información obtenida con el proyecto HapMap.
• El proyecto se realiza con el soporte principal de tres instituciones: el Wellcome Trust Sanger Institute (Hinxton, Inglaterra),
el Beijing Genomics Institute (Shenzen, China) y el National Human Genome Research Institute, que forma parte del NIH
(National Institutes of Health, USA).
1000 Genomes Project
21
Methods:
1-Low coverage (5x) sequencing: SOLiD+Illumina
2-Whole exome sequencing (80× average coverage across a consensus target of 24 Mb spanning more than 15,000 genes)):
SeqCap EZHuman Exome Library, Nimblegen, and SureSelect All Exon V2 Target Enrichment kit from Agilent.
3-SNP genotyping: Initially all samples were typed using a Sequenom MassArray SNP Genotyping panel of 23 SNPs and one
gender determining assay to establish a genetic fingerprint. After gender concordance was verified the samples were placed on 96
well plates using the llumina HumanOmni2.5OQuad v1.0 B SNP array.
1000 Genomes Project
22
El proyecto publicará el genotipo de los voluntarios,
junto con información detallada de su fenotipo:
registros médicos, varios análisis, imágenes RM, etc.
Toda la información estará disponible para cualquiera
en Internet, para que investigadores puedan probar
varias hipótesis acerca de las relaciones entre el
genotipo, el ambiente y el fenotipo.
Personal Genome Project
23
ClinVar
 MedGen to research the phenotype
http://www.ncbi.nlm.nih.gov/medgen/
 GTR (Genetic Testing Registry) to choose appropriate tests
http://www.ncbi.nlm.nih.gov/gtr/
 ClinVar to research variant pathogenicity
http://www.ncbi.nlm.nih.gov/clinvar/
NCBI’s Resources for Phenotype (MedGen),
Tests (GTR) and Variation (ClinVar)
24
NCBI’s Resources for Phenotype (MedGen),
Tests (GTR) and Variation (ClinVar)
Patient showing signs compatible with Marfan syndrome:
25
NCBI’s Resources for Phenotype (MedGen),
Tests (GTR) and Variation (ClinVar)
26
List of tests for Marfan syndrome (panels included)
27
NCBI’s Resources for Phenotype (MedGen),
Tests (GTR) and Variation (ClinVar)
28
29
NCBI’s Resources for Phenotype (MedGen),
Tests (GTR) and Variation (ClinVar)
30
Searching ClinVar
NM_000138.4:c.4786C>T
FBN1:c.4786C>T
c.4786C>T
Arg1596Ter
R1596*
31
Allele summary
• Gene
• Variant type
• Genomic location
• HGVS expressions*
• Molecular consequence*
• Links*
• Frequency*
Phenotype summary
• Names
• Links*
• Age of onset *
• Prevalence *
Interpretation
• Significance
• Review status *
• Accession.version *
* May be provided by NCBI
ClinVar detailed display
32
ClinVar detailed display

NGS Applications I (UEB-UAT Bioinformatics Course - Session 2.1.2 - VHIR, Barcelona)

  • 1.
    1 Vall d’Hebron Institutde Recerca (VHIR) Rosa Prieto Head of the High Tech Unit rosa.prieto@vhir.org 15/05/2014 Institut d’Investigació Sanitària acreditat per l’Instituto de Salud Carlos III (ISCIII) NEXT GENERATION SEQUENCING TECHNOLOGIES AND APPLICATIONS CURS OF BIOINFORMATICS FOR BIOMEDICAL RESEARCH
  • 2.
    2 INTRODUCTION TO NGS1 2 3 4 Index NGSTECHNOLOGY OVERVIEW NGS APPLICATIONS OVERVIEW CURS OF BIOINFORMATICS FOR BIOMEDICAL RESEARCH WHAT IS NEXT IN SEQUENCING TECHNOLOGIES?
  • 3.
    NGS applications -Amplicon sequencing -TargetedDNA resequencing -Exome sequencing -Whole genome sequencing -Metagenomics -RNA sequencing -Targeted RNA resequencing -Epigenomics -Sequencing of free DNA-RNA (plasma/serum)
  • 4.
    Considerations to useNGS -What do I want to sequence? Whole genome, exome, several genes, metagenome, epigenome, RNAseq..... -How many samples? -Length of read required? -Quality and quantity of starting material? -Size of nucleic acids to sequence -Amount of sequence needed: coverage (Depth of) Coverage: how many times a particular base is sequenced. 30x = each base has been read by 30 sequences (in average) Depth of coverage = (nº reads * read length / size of target genome) (Breadth of) Coverage: amount of the target sequence that has been covered (with a given coverage)
  • 5.
    Considerations to useNGS Which depth of coverage do I need? It is an empiric value that depends on the objective of the study and its particular conditions (consensus values may exist)
  • 6.
    Amplicon sequencing: viralquasispecies  In an infected patient the population of viruses presents high rates of mutation and replication. It is a complex mixing of different mutants.  Goal of the study:  Detection and quantification of mutations or combination of mutations that could confer resistance to viral inhibitors in samples from infected patients.  Special interest in mutations at a low rate (minor variants). HCV, HBV, HIV virus populations have special characteristics:
  • 7.
    Amplicon sequencing: viralquasispecies  Minor variants often play an important role in the development of resistance to antiviral treatments in patients, even if they are present in a very low percentage in the population.  Minor variants may not be detected by classical sequencing methods  You obtain hundreds of sequences with much effort and high cost  NGS allows to detect efficiently variants at a very low rate  You obtain thousands of sequences with relatively low cost WHY IS NGS APPROPIATED FOR THIS KIND OF STUDY? 454 technology is the most appropiated method in this particular case (long sequences are achieved)
  • 8.
    Targeted sequencing usinggene panels Array-based capture system Liquid capture system
  • 9.
    Targeted sequencing usinggene panels Illumina Ion Torrent
  • 10.
    Considerations that affectcapture efficiency -Quality and quantity of input DNA -Repeat elements, tandem repeats and pseudogenes: uneven distribution of coverage -Extreme GC content: 5’UTR, first exons of genes, promoter regions -Library insert length and its distribution: •Different capture platforms recommend different sets of standard practices for sample library preparation. •.As a result of these underlying chemistries, each platform has its own range of recommended fragment sizes. Agilent insert size ranges from 100 to 300bp, Nimblegen ranges from 150 to 250bp and TruSeq has the broadest range of 300 to 500bp. -Consistent laboratory procedures.
  • 11.
    Sequence capture forcancer genomics
  • 12.
    Exome vs. wholegenome sequencing PROS: • Enabling technologies: NGS machines, open-source algorithms, capture reagents, lowering cost, big sample collections • Exomes are more cost effective (less sequencing for the same coverage): human genome 3,2 Gb vs. human exome aprox. 50 Mb (1- 2% of the genome) • Simplified bioinformatics analysis compared to whole genomes CHALLENGES: • Still can’t interpret many Mendelian disorders • Rare variants need large samples sizes • Exome might miss regions of interest (e.g. novel non-coding genes) • Exome reagents do not capture all exons • Sometimes unsuccessful to interpret clinical data Shendure, Genome Biol 2011
  • 13.
  • 14.
    Illumina exome sequencing Kits Sequencers -NimblegenEZ capture -Agilent SureSelect -Raindance .......
  • 15.
  • 16.
    De novo sequencing Resequencing Wholegenome sequencing http://www.ncbi.nlm.nih.gov/projects/WGS/WGSprojectlist.cgi
  • 17.
    Whole genome sequencing Sequencedreads Contigs Scaffolds Mapped Scaffolds Genome map Long reads (454, PacBio, PE Illumina reads) Shot gun
  • 18.
    18 Secuenciación de lacepa bacteriana E. coli O104:H4 con GS Junior, MiSeq, PGM. 1. Creación de un ensamblaje de referencia (Roche GS FLX+ shotgun + 8 Kb PE, coverage 32x). Contiene 1 cromosoma (5.3 kb) y 2 plásmidos. Quedan 153 gaps correspondientes a regiones repetitivas sin resolver. 2. Secuenciación de la misma cepa usando: • 2 runs del 454 GS Junior • 2 chips 316 del Ion Torrent PGM • 1 run del MiSeq (2x150 bases) Performance comparison of benchtop high-troughput sequencing platforms. Nat. Biotechn. 30 (5): 434-441 (2012) Whole genome sequencing
  • 19.
    19 Conclusions: “One importantconclusion from this evaluation is that saying that one has “sequenced a bacterial genome” means different things on different benchtop sequencing platforms” MiSeq GS Junior IonTorrent Throughput/run The highest The lowest The fastest Errors The lowest Intermediate(indels) Many, specially in homopolymers Read length Intermediate (2x150bp) The longest (520 bp) The shortest (100bp) Run time The longest (27 hr) Intermediate (9 hr) The shortest (3 hr) Price per Mb The cheapest The most expensive Intermediate Other considerations Unfillable gaps Errors in homopolymers The worstest performance Performance comparison of benchtop high-troughput sequencing platforms. Nat. Biotechn. 30 (5): 434-441 (2012) Whole genome sequencing
  • 20.
    20 • La pequeñafracción del genoma con variaciones entre los individuos puede explicar diferencias en la susceptibilidad a una enfermedad, en la respuesta a fármacos o en la reacción a factores ambientales. El “Proyecto de los 1000 genomas” tratará de establecer un mapa del genoma humano que incluya la descripción de la mayor cantidad posible de variaciones en el mismo, mejorando de forma espectacular la información obtenida con el proyecto HapMap. • El proyecto se realiza con el soporte principal de tres instituciones: el Wellcome Trust Sanger Institute (Hinxton, Inglaterra), el Beijing Genomics Institute (Shenzen, China) y el National Human Genome Research Institute, que forma parte del NIH (National Institutes of Health, USA). 1000 Genomes Project
  • 21.
    21 Methods: 1-Low coverage (5x)sequencing: SOLiD+Illumina 2-Whole exome sequencing (80× average coverage across a consensus target of 24 Mb spanning more than 15,000 genes)): SeqCap EZHuman Exome Library, Nimblegen, and SureSelect All Exon V2 Target Enrichment kit from Agilent. 3-SNP genotyping: Initially all samples were typed using a Sequenom MassArray SNP Genotyping panel of 23 SNPs and one gender determining assay to establish a genetic fingerprint. After gender concordance was verified the samples were placed on 96 well plates using the llumina HumanOmni2.5OQuad v1.0 B SNP array. 1000 Genomes Project
  • 22.
    22 El proyecto publicaráel genotipo de los voluntarios, junto con información detallada de su fenotipo: registros médicos, varios análisis, imágenes RM, etc. Toda la información estará disponible para cualquiera en Internet, para que investigadores puedan probar varias hipótesis acerca de las relaciones entre el genotipo, el ambiente y el fenotipo. Personal Genome Project
  • 23.
    23 ClinVar  MedGen toresearch the phenotype http://www.ncbi.nlm.nih.gov/medgen/  GTR (Genetic Testing Registry) to choose appropriate tests http://www.ncbi.nlm.nih.gov/gtr/  ClinVar to research variant pathogenicity http://www.ncbi.nlm.nih.gov/clinvar/ NCBI’s Resources for Phenotype (MedGen), Tests (GTR) and Variation (ClinVar)
  • 24.
    24 NCBI’s Resources forPhenotype (MedGen), Tests (GTR) and Variation (ClinVar) Patient showing signs compatible with Marfan syndrome:
  • 25.
    25 NCBI’s Resources forPhenotype (MedGen), Tests (GTR) and Variation (ClinVar)
  • 26.
    26 List of testsfor Marfan syndrome (panels included)
  • 27.
    27 NCBI’s Resources forPhenotype (MedGen), Tests (GTR) and Variation (ClinVar)
  • 28.
  • 29.
    29 NCBI’s Resources forPhenotype (MedGen), Tests (GTR) and Variation (ClinVar)
  • 30.
  • 31.
    31 Allele summary • Gene •Variant type • Genomic location • HGVS expressions* • Molecular consequence* • Links* • Frequency* Phenotype summary • Names • Links* • Age of onset * • Prevalence * Interpretation • Significance • Review status * • Accession.version * * May be provided by NCBI ClinVar detailed display
  • 32.