Genome Wide Methodologies and Future Perspectives

Genome Wide Methodologies and
Future Perspectives
Brian Krueger, PhD
Duke University
Center for Human Genome Variation

History of Genetic Linkage

• Mendel’s Laws
– Law of segregation
• Each parent randomly passes one of two alleles to offspring
– Law of Independent Assortment
• Separate genes for separate traits are passed independently to
offspring
• Traits should appear in offspring in the ratio of 9:3:3:1
– Laws hold true for genes on different chromosomes or
genes located far away from one another
• Linkage
– Bateson and Punnett quickly found traits that didn’t
assort independently
– Thomas Hunt Morgan and his student Alfred
Sturtevant found that recombination frequency is a
good predictor of distance between genes
• Genes that are inherited together must be closer to one another
– linked
• Generated the first linkage maps
– Serves as an important basis for understanding
genetic association studies

Linkage Studies

• Model Organisms
– Fruit Flies, plants, etc
– Extremely important for understanding human
genetics
– Fruit flies can produce new generations of 400+
offspring approximately every week!
• Can very quickly understand the genetics of trait heritability

• Familial Linkage Studies
– Require multiple generations
– Take decades to develop
– Complicated by family participation
• Association studies
– Subtle difference between linkage studies
– Try to apply knowledge of familial linkage to entire
populations

Genome Wide Association Studies

• GWA studies
– Aim to find genetic variants that are associated with
traits
– Typically used to elucidate complex disease traits
– Focus on SNPs, Indels, CNVs
– Most often Case/Control Studies
• SNP (Single Nucleotide Polymorphism)
– Change in a single nucleotide position
• Indel (Insertion/Deletion)
– Describes the insertion or deletion of nucleotides
• CNV (Copy number variations)
– Large deletions or duplications of genetic material

GWA Study History

• Human Genome Project (1990-2000)
– Decade long international project to determine the
complete human genome sequence
– Provided the reference genome for future research on
genome variation
• Human HapMap (2002-2009)
– Sequencing whole genomes is expensive
– Needed a shortcut to understand how variation
contributes to disease
– Mapped millions of common known SNPs in 269
individuals
– Theory that common SNPs are inherited and could be
predictive of associated disease
– Determine how SNPs from case/control studies
associate with human disease

Defining Association

• Variants are not always causal!
– SNPs sometimes only serve as markers
– Can play absolutely no role in the disease and even be
located on different chromosomes from the gene
actually responsible for the phenotype
• Population stratification
– Variants differ by population
– Variants important markers of disease in one
population or ethnicity may not be effective markers
in another
– For GWA studies to be effective predictors in multiple
populations, large datasets for each ethnicity must be
obtained

GWAS SNP Genotyping

• Bead array genotyping
– Uses a chip containing beads with
covalently attached baits
– Baits hybridized to fragmented DNA
– Baits SPECIFIC for the DNA just upstream
of a SNP
– Base extension with fluorescently labeled
bases allows interrogation of the SNP
(each base has a different color!)
– A single bead chip can assay millions of rs1372493 rs1372493

SNPs 16000
1.60

1.40

– Colorimetric output plotted
14000

12000 1.20

• Blue indicates homozygous for one version of the 10000 1

SNP - CC Intensity (B)

8000 0.80

• Purple is heterozygous - CA

Norm R
6000 0.60

• Red homozygous for the other version of the SNP
4000
- AA 0.40

2000
0.20

0
0
2317 834 74
-2000
-0.20
0 2000 4000 6000 8000 10000 12000 14000 16000 18000 20000
0 0.20 0.40 0.60 0.80 1
Intensity (A)
Norm Theta

GWAS SNP Genotyping and Validation

• Realtime PCR
– Use specific PCR probes to verify SNPs
– Good for validating a handful of SNPs at a time
• Mass Array
– Use mass spec to find SNPs
– Detected by looking at fragment weight
differences
– Good for detecting or validating a large number
of SNPs rapidly
• Sanger sequencing
– Gold standard validation method
– Can determine the SNP at its exact position
– Very robust

GWA Study History

• To this point in time, the power of most GWA
studies was lacking
– GWA not really genome wide
– Looked at common variants across genome
– Missed rare variants and not always descriptive of
disease causation
• Whole Genome Sequencing (WGS)
– Actually assays the entire genome
– Discovers all variants
– Prohibitively costly before 2008
– Current cost of WGS ~$4000
• Thousand Genomes Project (2008-)
– Facilitated by plummeting sequencing costs and
technological advancements
– Goal to fully sequence the genomes of 1000 healthy
individuals to provide a true picture of genome wide
variation

Second Generation Sequencing

• Developed to increase throughput of
Sanger sequencing
• Can sequence many molecules in parallel
– Does not require homogenous input
– Sequenced as clusters
• Sequencing by synthesis
– Bases are added, signals scanned, and then
washed
– Cycle repeated (30-2000x)

2nd Gen: Sequencing by Synthesis Overview

Genomic Fragmented DNA Ligate Adaptors
DNA Generate Clusters (On Flowcell or
Beads)

T T
A T A T
TA T A
T T
C C
G G
A G A G
T T
T T
G G
Repeat Hundreds of times on
millions of clusters Detect Signals Add Bases

Flavors of Sequencing

• Whole Genome Sequencing
– Obtain whole blood or tissue sample
– Create sequencing libraries of all DNA
fragments
• Whole Exome Sequencing
– Utilizes a selection protocol
– Attach complimentary RNA strands to beads
– Fish out ONLY coding DNA sequences
– Create sequencing libraries from enriched DNA
– Reduces cost significantly
• Custom Capture
– Same protocol as Exome sequencing
– Only target desired DNA sequences
• Amplicon Sequencing
– Use PCR to amplify target DNA
– Sequence amplified DNA (Amplicon)

NGS Study Designs for Gene Discovery

Multiplex families

Case-control studies

Trio sequencing of
sporadic diseases

De novo Mutation Calling/Filtering

Variant Individual variant Multi-sample
calling calling variant calling

Exome Variant Server 6500 exome
Cross-checking sequenced individuals
public databases

Visual
Inspection

Sanger sequencing
confirmation

Detecting Copy Number Variants

ERDS (Estimation by Read Depth with SNvs)
Average read depth (RD) of every 2-kb window were calculated, followed
by GC corrections. A paired Hidden Markov model was applied to infer
copy numbers of every window by utilizing both RD information and
heterozygosity information.

homozygous heterozygous duplication
deletion deletion

Windows

Illumina

• Uses a flow cell
• Cluster generated on slide via bridge
amplification
– Performed by flowing labeled bases over flow
cell
– 4 pictures taken (one for each base)
– Cluster color determined at each cycle allows
interrogation of sequence
• Advantages
– Low cost per base
– Very high throughput
• Limitations
– High cost per experiment
– Short read length (30-150bp)
– Acquired a company that uses new tech to
reach read lengths of 2-10Kb
Schadt et al 2010 HMG

Ion Torrent

• Emulsion PCR is used to generate clusters
on a bead
– Pyrosequencing
– Relies on release of pyrophosphate for
detection
– Instead of a visual cue, system senses the
release of H+ as each base is flowed over the
beads
• Advantages
– Short run time
– Does not require modified bases
– Longer read length (200bp)
• Limitations
– Low data output
– High homopolymer error rate

Third Generation Sequencing

• Defined as single molecule sequencing
• Less complex sample prep
• Much longer read length
– SGS Short read length a huge disadvantage for
de novo sequencing applications
• Two categories
– Sequencing by synthesis
– Direct sequencing
• Passing molecule through a nanopore
• Using atomic force microscopy
• Bleeding edge technology
– Many technical hurdles
– Currently very high error rates

Pacific Biosciences

• Utilizes single molecule sequencing by
synthesis
• Extremely complex system
– Each well contains a single DNA molecule and
an immobilized polymerase
– No reagent washing
– Employs confocal microscopy to only detect
fluorescence at the polymerase
• Advantages
– Very long read length (1-15kb)
– Low complexity sample prep
– Very fast data generation (real time)
• Disadvantages
– Prone to sequencing errors (~15% error
rate)
– Company on the verge of bankruptcy

Third/Second Generation Sequencing

• Currently only one viable high throughput
long read sequencing platform
– PacBio system has a 15% error rate
– Need long reads for many applications from de
novo sequencing to haplotyping
• Second generation sequencers high
throughput and accurate
– Short reads are hard to assemble and leave
gaps in repetitive sequences
• Can use both as a highly accurate and
extremely powerful tool for de novo
sequencing applications
– Use PacBio assembly as a scaffold
– Correct errors by aligning HiSeq reads on top
– Effective error rate of 0.1%
– Expensive but extremely fast and accurate
compared to other methods Koren et al 2012 Nature Biotechnology

Future: Nanopore Sequencing

• Leading candidate is Oxford
Nanopore
• Concept
– Detect flow of electrons through the
pore
– Each base causes a detectable change in
the current
– Results in direct sequencing
– Theoretically could be used to sequence
RNA and protein too
• Advantages
– Long read length
– Plug and play
– Easily scalable
• Disadvantages
– No hard data yet Credit: John MacNeill/TechnologyReview
– No specific release date

Future: Direct sequencing

• Concept stage techniques
– Significant technical hurdles to overcome
– Mostly proof of concept experiments
• IBM DNA Transistor Credit: IBM

– Bases read as single stranded DNA passes
through the transistor
– Gold bands represent metal, gray bands are
the dielectric
• Atomic force microscopy sequencing
– Use AFM tip to detect each base of single
stranded DNA

Credit: Lee et al US PAT 20040124084

Sequencing Applications

• Old techniques which used to take days or
years to perform can now be completed
in hours
• Next generation sequencing has opened a
new door for addressing very complicated
genetic questions
– Has huge potential to revolutionize human
healthcare
– Survey complex tumor types
– Research into macro and micro community
genomics
– Reveal evolutionary history

De novo Sequencing

• Human genome took 10 years to complete and
cost $3 billion dollars
– Done by laboriously cloning overlapping segments of the
human genome into bacmid libraries and Sanger
sequencing each one
– Genome assembled using computers to line up over
lapping sequences
• Current estimate is around $4000
– Can be completed in a week
– Companies like Complete Genomics say they have already
sequenced thousands of human genomes
• Future
– Long read sequencers will make agricultural sequencing
more viable
– Whole genome sequencing for human diagnostics will
become routine
– Increasing the catalog of organismal genomes will improve
our understanding of evolution and development

Genome Mutation Analysis

• Previously done by completing
complicated and time consuming familial
linkage studies and targeted Sanger
sequencing
• Next generation sequencing can look at
every gene at once
– Can produce a genetic map of the complete
genome
– Used to detect genetic polymorphisms
– See every possible mutation
• Future
– Whole genome sequence analysis
– Targeted genome sequencing analysis using
predetermined sequence selection arrays (ex:
Exome Enrichment)

Pharmacogenetics

• Very hot topic in the biotech and
insurance industries
• Use genetic typing to guess how a person
might respond to different drug
treatments
• Currently relies on microarrays
• NGS could provide significantly more
information at more loci
– Microarrays only look at a handful of
polymorphisms
– Current NGS approaches port the microarray
technique to enrich pools for sequencing
• Future
– As the catalog of human genomes increases, it
will be easier to calculate responses to
treatment before drugs are administered
Gauthier et al 2007 Cancer Cell

Epigenetics

• Defined as heritable genetic information
that is not coded in the DNA bases
– DNA methylation
– Histone modifications
• Previous mechanisms for detecting these
Chromatin or DNA modifications relied on
targeted probing
– ChIP-PCR
– Bisulfite sequencing
– Footprinting assays
• Next generation sequencing changed
everything
– Whole genome methylation mapping (MAP-IT)
– Whole genome histone modification and
protein binding mapping (ChIP-Seq -
acetylation, methylation, etc)
• ENCODE project

ENCyclOpedia of Dna Elements (ENCODE)

• International project
– Follow up to the human genome project
• Only 98% of the human genome codes for
protein
– Creating and maintaining DNA is biochemically
expensive
– What’s the other 98% of the genome doing?
• ENCODE goals
– Determine the functional elements of the
human genome
– Protein Coding
– Non-Coding RNA
– mRNA Expression
– Regulatory protein binding sites
– Histone modifications
• Preliminary estimates show that 80% of
human DNA is functional!

Transcriptome/Expression Analysis

• Gene expression analysis is important for
disease discovery and cancer diagnosis
• Expression analysis first relied on Northern
blotting followed by DNA microarrays
– Both cases require a probe
– Need to “know” what you are looking for
– Low resolution screening
• Next generation approaches screen the
entire transcriptome (RNA-Seq)
– Single base resolution of expression
– Can see level of expression and also visualize
mutations in expressed sequences
• Future
– Important for diagnosing/treating cancer and
heritable diseases

Phenotypic Correlation

• NGS data generates huge datasets with
85-99.9% base accuracy
– Must determine which signals are real, and
which are noise/errors
– Most promising hits are validated by other
assays (Sanger, qRT, Mass Spec)
– How do we determine which hits to validate?
• Currently have very small datasets, even
in pharmacogenetics that have limited
utility
• Validated hits can be distractions See NYTimes Series on whole genome
– Tumor diversity presents multiple escape Sequencing: http://nyti.ms/No4fgd
routes during targeted treatment
• Future
– Require large validated datasets that are
ethnically and geographically diverse

Metagenomics

• Used to survey macro and micro
environments
– Microbial communities (Soil/Gut)
– Tumors
– Plant communities
– Coral reef ecosystems
• Previous techniques coupled mtDNA or
ribosomal Sanger sequencing with BLAST
analysis
– Limited by number of sequenced species
– Can determine who, but not what is going on
• NGS approaches now being used to
determine exactly what organisms are
present and how they interact
– Can get expression data and link it back to
community groups
– Survey community diversity

Data

• Absolutely the largest roadblock for next
generation sequencing
• Terabytes of data are useless if we can’t
efficiently analyze the data
• How long should data be kept?
– Depends on application
• Human Diagnostic sequencing?
• Research sequencing?
• Where should data be kept and
processed?
– Local or Cloud (Amazon, etc)?
– Cost of infrastructure vs cost of cloud service
– Security issues
• Future
– Cloud based solutions will become more
attractive

Genome Wide Methodologies and Future Perspectives

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (20)

Recently uploaded

Recently uploaded (20)

Genome Wide Methodologies and Future Perspectives

Editor's Notes