SlideShare a Scribd company logo
1 of 89
DNA sequencing: methods
I. Brief history of sequencing
II. Sanger dideoxy method for sequencing
III. Sequencing large pieces of DNA
VI. The “$1,000 dollar genome”
On WebCT
-- “The $1000 genome”
-- review of new sequencing techniques by George Church
Why sequence DNA?
• All genes available for an organism to use -- a
very important tool for biologists
• Not just sequence of genes, but also positioning
of genes and sequences of regulatory regions
• New recombinant DNA constructs must be
sequenced to verify construction or positions of
mutations
• Etc.
History of DNA sequencing
MC chapter 12
History of DNA sequencing
Methods of sequencing
A. Sanger dideoxy (primer extension/chain-termination)
method: most popular protocol for sequencing, very
adaptable, scalable to large sequencing projects
B. Maxam-Gilbert chemical cleavage method: DNA is
labelled and then chemically cleaved in a sequence-
dependent manner. This method is not easily scaled and
is rather tedious
C. Pyrosequencing: measuring chain extension by
pyrophosphate monitoring
for dideoxy sequencing you need:
1) Single stranded DNA template
2) A primer for DNA synthesis
3) DNA polymerase
4) Deoxynucleoside triphosphates and
dideoxynucleotide triphosphates
Primers for DNA sequencing
• Oligonucleotide primers can be synthesized by
phosphoramidite chemistry--usually designed
manually and then purchased
• Sequence of the oligo must be complimentary to
DNA flanking sequenced region
• Oligos are usually 15-30 nucleotides in length
DNA templates for sequencing:
• Single stranded DNA isolated from
recombinant M13 bacteriophage containing
DNA of interest
• Double-stranded DNA that has been
denatured
• Non-denatured double stranded DNA (cycle
sequencing)
One way for obtaining single-stranded DNA from a double
stranded source--magnets
Reagents for sequencing:
DNA polymerases
• Should be highly processive, and
incorporate ddNTPs efficiently
• Should lack exonuclease activity
• Thermostability required for “cycle
sequencing”
Single stranded DNA 5’
3’
5’ 3’
Sanger dideoxy sequencing--basic method
a) Anneal the primer
Sanger dideoxy sequencing: basic method
b) Extend the
primer with DNA
polymerase in the
presence of all four
dNTPs, with a
limited amount of a
dideoxy NTP
(ddNTP)
5’
3’
Direction of
DNA
polymerase
travel
DNA polymerase incorporates ddNTP in a template-
dependent manner, but it works best if the DNA pol
lacks 3’ to 5’ exonuclease (proofreading) activity
Sanger dideoxy sequencing: basic method
5’
3’
5’ 3’
T T T
T
ddA
ddA
ddA
ddA
ddATP in the
reaction: anywhere
there’s a T in the
template strand,
occasionally a ddA
will be added to the
growing strand
How to visualize DNA fragments?
• Radioactivity
– Radiolabeled primers (kinase with 32P)
– Radiolabelled dNTPs (gamma 35S or 32P)
• Fluorescence
– ddNTPs chemically synthesized to contain fluors
– Each ddNTP fluoresces at a different wavelength
allowing identification
Analysis of sequencing products:
Polyacrylamide gel electrophoresis--good
resolution of fragments differing by a single
dNTP
– Slab gels: as previously described
– Capillary gels: require only a tiny amount of
sample to be loaded, run much faster than
slab gels, best for high throughput
sequencing
DNA sequencing gels: old school
Analyze sequencing
products by gel
electrophoresis,
autoradiography
Different ddNTP used in
separate reactions
Radioactively labelled primer or
dNTP in sequencing reaction
cycle sequencing: denaturation
occurs during temperature cycles
94°C:DNA denatures
45°C: primer anneals
60-72°C: thermostable DNA
pol extends primer
Repeat 25-35 times
Advantages: don’t need a lot
of template DNA
Disadvantages: DNA pol
may incorporate ddNTPs
poorly
Animation of cycle sequencing: see
http://www.dnai.org/
Click on:
“manipulation”
“techniques”
“sorting and sequencing”
An automated sequencer
The output
Current trends in sequencing:
It is rare for labs to do their own sequencing:
--costly, perishable reagents
--time consuming
--success rate varies
Instead most labs send out for sequencing:
--You prepare the DNA (usually plasmid, M13, or PCR product),
supply the primer, company or university sequencing center
does the rest
--The sequence is recorded by an automated sequencer as an
“electropherogram”
~160 kbp
~1 kbp
Assemble sequences by
matching overlaps
BAC sequence
BAC overlaps give genome sequence
BREAK UP THE GENOME,
PUT IT BACK TOGETHER
Sequencing large pieces of DNA:
the “shotgun” method
• Break DNA into small pieces (typically sizes of around
1000 base pairs is preferable)
• Clone pieces of DNA into M13
• Sequence enough M13 clones to ensure complete
coverage (eg. sequencing a 3 million base pair genome
would require 5x to 10x 3 million base pairs to have a
reliable representation of the genome)
• Assemble genome through overlap analysis using
computer algorithms, also “polish” sequences using
mapping information from individual clones,
characterized genes, and genetic markers
• This process is assisted by robotics
Sequencing done by TIGR (Maryland) and The
Sanger Institute (Cambridge, UK)
“Here we report an analysis of the genome sequence of P.
falciparum clone 3D7, including descriptions of chromosome
structure, gene content, functional classification of proteins,
metabolism and transport, and other features of parasite
biology.”
Sequencing strategy
A whole chromosome shotgun sequencing
strategy was used to determine the genome
sequence of P. falciparum clone 3D7. This approach
was taken because a whole genome shotgun
strategy was not feasible or cost-effective with the
technology that was available at the beginning of the
project. Also, high-quality large insert libraries of (A -
T)-rich P. falciparum DNA have never been
constructed in Escherichia coli, which ruled out a
clone-by-clone sequencing strategy. The
chromosomes were separated on pulsed field gels,
and chromosomal DNA was extracted…
The shotgun sequences were assembled into
contiguous DNA sequences (contigs), in some cases with
low coverage shotgun sequences of yeast artificial
chromosome (YAC) clones to assist in the ordering of
contigs for closure. Sequence tagged sites (STSs)10,
microsatellite markers11,12 and HAPPY mapping7 were
also used to place and orient contigs during the gap
closure process. The high (A /T) content of the genome
made gap closure extremely difficult7–9.
Chromosomes 1–5, 9 and 12 were closed,
whereas chromosomes 6–8, 10, 11, 13 and 14 contained
3–37 gaps (most less than 2.5 kb) per chromosome at the
beginning of genome annotation. Efforts to close the
remaining gaps are continuing.
Methods: Sequencing, gap closure and annotation
The techniques used at each of the three participating
centres for sequencing, closure and annotation are described in
the accompanying Letters7–9. To ensure that each centres’
annotation procedures produced roughly equivalent results, the
Wellcome Trust Sanger Institute (‘Sanger’) and the Institute for
Genomic Research (‘TIGR’) annotated the same100-kb
segment of chromosome 14. The number of genes predicted in
this sequence by the two centres was 22 and 23; the
discrepancy being due to the merging of two single genes by
one centre. Of the 74 exons predicted by the two centres, 50
(68%) were identical, 9 (2%) overlapped, 6 (8%) overlapped
and shared one boundary, and the remainder were predicted by
one centre but not the other. Thus 88% of the exons predicted
by the two centres in the 100-kb fragment were identical or
overlapped.
The $1000 dollar genome
Venter Foundation (2003): The first group to produce a
technology capable of a $1000 human genome will win
$500,000 …
X - Prize Foundation: no, $5 - 20 million …
National Institutes of Health (2004): $70 million grant program
to reach the $1000 genome
Previous sequencing techniques: one DNA molecule at a time
Needed: many DNA molecules at a time -- arrays
One of these: “pyrosequencing”
Cut a genome to DNA fragments 300 - 500 bases long
Immobilize single strands on a very small plastic bead (one
piece of DNA per bead)
Amplify the DNA on each bead to cover each bead to boost the
signal
Separate each bead on a plate with up to 1.6 million wells
Sequence by DNA polymerase -dependent chain extension,
one base at a time in the presence of a reporter (luciferase)
Luciferase is an enzyme that will emit a photon of light in
response to the pyrophosphate (PPi) released upon nucleotide
addition by DNA polymerase
Flashes of light and their intensity are recorded
Extension with individual dNTPs gives a readout
A B
A B
The readout is recorded by
a detector that measures
position of light flashes and
intensity of light flashes
APS = Adenosine phosphosulfate From www.454.com
25 million bases in
about 4 hours
Height of peak indicates the number of
dNTPs added
This sequence: TTTGGGGTTGCAGTT
DNA sequencing: methods
I. Brief history of sequencing
II. Sanger dideoxy method for sequencing
III. Sequencing large pieces of DNA
VI. The “$1,000 dollar genome”
On WebCT
-- “The $1000 genome”
-- review of new sequencing techniques by George Church
Introduction to bioinformatics
1) Making biological sense of DNA
sequences
2) Online databases: a brief survey
3) Database in depth: NCBI
4) What is BLAST?
5) Using BLAST for sequence analysis
6) “Biology workbench”, etc.
www.ncbi.nlm.nih.gov
www.tigr.org
http://workbench.sdsc.edu
There’s plenty of DNA to make sense of
http://www.genomesonline.org/
(2006)
Making sense of genome sequences:
1) Genes
a) Protein-coding
• Where are the open reading frames?
• What are the ORFs most similar to? (What is
the function/structure/evolution history?)
b) RNA
2) Non-genes
a) Regulation: promoters and factor-binding sites
b) Transactions: replication, repair, and
segregation, DNA packaging (nucleosomes)
Sequence output
Computer calls
GNNTNNTGTGNCGGATACAATTCCCCTCTAGAAATAATTTTGTTTAACTTTAAGAAGGAGATATACATATGCACCACCAC
CACCACCACCCCATGGGTATGAATAAGCAAAAGGTTTGTCCTGCTTGTGAATCTGCGGAACTTATTTATGATCCAGAAAG
GGGGGAAATAGTCTGTGCCAAGTGCGGTTATGTAATAGAAGAGAACATAATTGATATGGGTCCTAAGTGGCGTGCTTTTG
ATGCTTCTCAAAGGGAACGCAGGTCTAGAACTGGTGCACCAGAAAGTATTCTTCTTCATGACAAGGGGCTTTCAACTGCA
ATTGGAATTGACAGATCGCTTTCCGGATTAATGAGAGAGAAGATGTACCGTTTGAGGAAGTGGCANTCCANATTANGAGT
TAGTGATGCAGCANANAGGAACCTAGCTTTTGCCCTAAGTGAGTTGGATAGAATTNCTGCTCAGTTAAAACTTCCNNGAC
ATGTAGAGGAAGAAGCTGCAANGCTGNACANAGANGCAGNGNGANAGGGACTTATTNGANGCAGATCTATTGAGAGCGTT
ATGGCGGCANGTGTTTACCCTGCTTGTAGGTTATTAAAAGNTCCCGGGACTCTGGATGAGATTGCTGATATTGCTAGAGC
Raw data
atgttgtatttgtctgaagaaaataaatccgtat
ccactccttgccctcctgataagattatctttga
tgcagagaggggggagtacatttgctctgaaact
ggagaagttttagaagataaaattatagatcaag
ggccagagtggagggccttcacgccagaggagaa
agaaaagagaagcagagttggagggcctttaaac
aatactattcacgataggggtttatccactctta
tagactggaaagataaggatgctatgggaagaac
tttagaccctaagagaagacttgaggcattgaga
tggagaaagtggcaaattaga
What does this sequence do?
Could it encode a protein?
Looking for ORFs (Open Reading Frames)
using “DNA Strider”
ORF map 1) Where are the potential starts (ATG)
and stops (TAA, TAG, TGA)?
2) Which reading frame is correct?
= ATG
= stop
codon
Reading frame #1 appears to encode a protein
Cautions in ORF identification
• Not all genes initiate with ATG, particularly in certain
microbes (archaea)
• What is the shortest possible length of a real ORF? 50
amino acids? 25 amino acids? Cut-off is somewhat
arbitrary.
• In eukaryotes, ORFs can be difficult to identify because
of introns
• Are there other sequences surrounding the ORF that
indicate it might be functional?
– promoter sequences for RNA polymerase binding
– Shine-Dalgarno sequences for ribosome binding?
What is the function of
the sequenced gene?
Classical methods:
-- mutate gene, characterize phenotype for clues to function
(genetics)
-- purify protein product, characterize in vitro (biochemistry)
Comparison to previously characterized genes:
-- genes sequences that have high sequence similarity
usually have similar functions
-- if your gene has been previously characterized
(using classical methods) by someone else, you want
to know right away! (avoid duplication of labor)
NCBI
NCBI home page --Go to www.ncbi.nlm.nih.gov for the following
pages
Pubmed: search tool for literature--search by author, subject, title
words, etc.
All databases: “a retrieval system for searching several linked
databases”
BLAST: Basic Local Alignment Sequence Tool
OMIM: Online Mendelian Inheritance in Man
Books: many online textbooks available
Tax Browser: A taxonomic organization of organisms and their
genomes
Structure: Clearinghouse for solved molecular structures
What does BLAST do?
1) Searches chosen sequence database
and identifies sequences with similarity
to test sequence
2) Ranks similar sequences by degree of
homology (E value)
3) Illustrates alignment between test
sequence and similar sequences
Alignment of sequences:
The principle: two homologous sequences derived from the
same ancestral sequence will have at least some identical
(similar) amino acid residues
Fraction of identical amino acids is called “percent identity”
Similar amino acids: some amino acids have similar
physical/chemical properties, and more likely to substitute for
each other--these give specific similarity scores in
alignments
Gaps in similar/homologous sequences are rare, and are
given penalty scores
Homology of proteins
Homology: similarity of biological structure, physiology,
development, and evolution, based on genetic inheritance
Homologous proteins: statistically similar sequence, therefore
similar functions (often, but not always…)
Alignment of TFB and TFIIB sequences
High sequence similarity correlates with functional similarity
40-20% identity: fold can be predicted by similarity but precise
function cannot be predicted (the 40% rule)
enzymes
Non-enzymes
Programs available for BLAST searches
Protein sequence (this is the best option)
blastp--compares an amino acid query sequence against a protein
sequence database
tblastn--compares a protein query sequence against a nucleotide
sequence database translated in all reading frames
DNA sequence
blastn--compares a nucleotide query sequence against a nucleotide
sequence database
blastx--compares a nucleotide query sequence translated in all reading
frames against a protein sequence database
tblastx--compares the six-frame translations of a nucleotide query
sequence against the six-frame translations of a nucleotide sequence
database.
BLAST considers all possible combinations of
matches
mismatches
gaps
in any given alignment
Gives the “best” (highest scoring) alignment of sequences
Three scores
1) percent identity
2) similarity score
3) E-value--probability that two sequences will have
the similarity they have by chance (lower number, higher
probability of evolutionary homology, higher probability of
similar function)
What is the E-value?
The E value represents the chance that the similarity is
random and therefore insignificant. Essentially, the E value
describes the random background noise that exists for
matches between sequences. For example, an E value of 1
assigned to a hit can be interpreted as meaning that in a
database of the current size one might expect to see 1
match with a similar score simply by chance.
You can change the Expect value threshold on most main
BLAST search pages. When the Expect value is increased
from the default value of 10, a larger list with more low-
scoring hits can be reported.
E values (continued)
From the BLAST tutorial:
Although hits with E values much higher than 0.1 are
unlikely to reflect true sequence relatives, it is useful
to examine hits with lower significance (E values
between 0.1 and 10) for short regions of similarity. In
the absence of longer similarities, these short
regions may allow the tentative assignment of
biochemical activities to the ORF in question. The
significance of any such regions must be assessed
on a case by case basis.
Relationship between E-value and function
Single domain proteins
Multi-domain proteins
E value greater than 10-10, similar structure but possibly
different functions
Computer calls
GNNTNNTGTGNCGGATACAATTCCCCTCTAGAAATAATTTTGTTTAACTTTAAGAAGGAGATATACATATGCACCACCAC
CACCACCACCCCATGGGTATGAATAAGCAAAAGGTTTGTCCTGCTTGTGAATCTGCGGAACTTATTTATGATCCAGAAAG
GGGGGAAATAGTCTGTGCCAAGTGCGGTTATGTAATAGAAGAGAACATAATTGATATGGGTCCTAAGTGGCGTGCTTTTG
ATGCTTCTCAAAGGGAACGCAGGTCTAGAACTGGTGCACCAGAAAGTATTCTTCTTCATGACAAGGGGCTTTCAACTGCA
ATTGGAATTGACAGATCGCTTTCCGGATTAATGAGAGAGAAGATGTACCGTTTGAGGAAGTGGCANTCCANATTANGAGT
TAGTGATGCAGCANANAGGAACCTAGCTTTTGCCCTAAGTGAGTTGGATAGAATTNCTGCTCAGTTAAAACTTCCNNGAC
ATGTAGAGGAAGAAGCTGCAANGCTGNACANAGANGCAGNGNGANAGGGACTTATTNGANGCAGATCTATTGAGAGCGTT
ATGGCGGCANGTGTTTACCCTGCTTGTAGGTTATTAAAAGNTCCCGGGACTCTGGATGAGATTGCTGATATTGCTAGAGC
Raw data
What does this sequence do? Cue up BLAST…..
MKCPYCKSRDLVYDRQHGEVFCKKCGSILATNLVDSELSRKT
KTNDIPRYTKRIGEFTREKIYRLRKWQKKISSERNLVLAMSE
LRRLSGMLKLPKYVEEEAAYLYREAAKRGLTRRIPIETTVAA
CIYATCRLFKVPRTLNEIASYSKTEKKEIMKAFRVIVRNLNL
TPKMLLARPTDYVDKFADELELSERVRRRTVDILRRANEEGI
TSGKNPLSLVAAALYIASLLEGERRSQKEIARVTGVSEMTVR
NRYKELA
Find the open reading frame(s)
Translate it:
BLAST against (go to genomes page):
-- Microbial genomes
-- environmental sequences (genomes)
Results:
1) Distribution of hits: query sequence and positions in
sequence that gave alignments
2) Sequences producing significant alignments
1) Accession number (this takes you to the sequence that
yielded the hit: gene or contig)
2) Name of sequence (sometimes identifies the gene)
3) Similarity score
4) E-value
3) Alignments arranged by E value, with links to gene reports
2) Large percentages of
coding proteins cannot be
assigned function based
on homology
1) Homology? the function is
only inferred (NOT known)
Two problems with BLAST
For a current list of databases and bioinformatics
tools see: Nucleic Acids Research annual
bioinformatics issue (comes out every January).
List of all the databases described, by category:
http://www.oxfordjournals.org/nar/database/cap/
Guide to NCBI: see Webct
Bioinformatics:
making sense of biological sequence
• New DNA sequences are analyzed for ORFs
(Open Reading Frames: protein)
• Any DNA or protein sequence can then be
compared to all other sequences in databases,
and similar sequences identified
• There is much more -- a great diversity of
programs and databases are available
Massively parallel measurements of gene
expression: microarrays
• Defining the “transcriptome”
• The northern blot revisited
• Detecting expression of many genes: arrays
• A typical array experiment
• What to do with all this data?
Brown and Botstein (1999) “Exploring the new world
of the genome with DNA microarrays” Nature
Genetics 21, p. 33-37.
DNA
RNA
protein
genome
“transcriptome”
“proteome”
(we have this)
(we want these)
The value of DNA microarrays for
studying gene expression
1) Study all transcripts at same time
2) Transcript abundance usually correlates with level of gene
expression--much gene control is at level of transcription
3) Changes in transcription patterns often occur as a response to
changing environment--this can be detected with a microarray
Detection of mRNA transcripts
• Northern Blot -- immobilize mRNA on membrane,
detect specific sequence by hybridization with one
labeled probe--requires a separate blotting for
each probe
• DNA microarray -- immobilize many probes
(thousands) in an ordered array, hybridize (base
pair) with labelled mRNA or cDNA
Generating an array of probes
• Identify open reading frames (orfs)
1) PCR each orf (several for each orf), attach
(spot) each PCR product to a solid support in a
specific order (pioneered by Pat Brown’s lab,
Stanford)
2) Chemically synthesize orf-specific
oligonucleotide probes directly on microchip
(Affymetrix)
http://derisilab.ucsf.edu/microarray/
(Derisi Lab at UCSF)
The chip defines
the genes you are
measuring
The hybridization
represents the
measurement
The RNA comes
from the cells and
conditions you are
interested in
A print head for generating arrays
of probes
Print head travels from DNA probe
source (microtiter plate) to solid
support (treated glass slide)
Small amount of DNA probe is put
on a specific spot at a specific
location
Each spot (DNA probe sequence)
has a specific “address”
Print head
Printing needles
A yeast array experiment
vegetative sporulating
Isolate mRNA
Prepare fluorescently
labeled cDNA with two
different-colored fluors
hybridize read-out
Example microarray data
Green: mRNA
more abundant
in vegetative
cells
Red: mRNA more
abundant in
sporulating cells
Yellow: equivalent
mRNA abundance
in vegetative and
sporulating cells
What to do with all that data?
Overarching patterns may become apparent
1) Organize data by hierarchical clustering,
profiling to find patterns
2) Display data graphically to allow
assimilation/comprehension
low mRNA
levels
High mRNA
levels
(Cell synchronization method)
All yeast cell cycle-
regulated genes
(phase in which
gene is
expressed)
MIAME:
The Minimum Information About a Microarray Experiment
(#6 helps correct for variations in the quantity of
starting RNA, and for variable labelling and
detection efficiencies)
DNA
RNA
protein
genome
“transcriptome”
“proteome”
(we have this)
(we want these)
Analysis of the proteome: “proteomics”
• Which proteins are present and when?
• What are the proteins doing?
– What interacts with what?
• Protein-DNA interactions (chromatin
immunoprecipitation)
• Protein-protein interactions
– Functions of proteins?
Phizicky et al. (2003) “Protein analysis on a proteomic
scale” Nature 422, p. 208-215
Which proteins are expressed?
Classical method
– Detect presence of a specific protein
• Using antibodies or specific assay
• Measure changes in protein levels with
changing environment, in different tissues
– Very labor intensive, expensive to scale up to
proteome
Massively parallel detection and
identification of proteins
• 2D gel electrophoresis
– Separate proteins in a given organism or tissue type by migration in gel
electrophoresis
– Identify protein (cut out of gel, sequence or mass-spec)
– Pattern of spots like a barcode for hi-throughput studies
• Mass spectrometry
– Separate individual proteins from cell by charge and mass, individual
proteins can be identified (but need genome sequence information for
this)
• Microarrays: isolate things that bind proteins
2D gel electrophoresis
1) Separate proteins on the basis of isoelectric point
This technique is usually
done on a long, narrow gel
4 10
2D gel
electrophoresis
Lay gel containing
isoelectrically focused
protein on SDS page
gel, separate on the
basis of size
E.coli protein profile
From swissprot database,
www.expasy.ch
Mass spectrometry for identifying proteins in a
mixture
From J.R. Yates 1998 “Mass spectrometry and the age
of the proteome” J Mass Spec. 33, p 1-19
Liquid chromatography
and tandem mass
spectrometry
Software for processing
data
Defining protein function
• Classical methods:
– Define activity of protein, develop an assay for
activity
• Biochemistry: use assay to purify protein from
cell, characterize structure/function of protein in
vitro
• Genetics: obtain mutants with change in activity,
characterize phenotype of mutant, obtain
suppressors to identify genes that interact with
protein of interest
– Time intensive, expensive
Protein activity at the proteome level
• Protein-DNA interactions: identifying binding
sites for DNA-binding proteins: regulation of
gene expression
• Massively parallel screens for activity--protein
arrays
“chromatin immunoprecipitation” (ChIP)
1) Grow cells, add
formaldehyde to cross-link
everything to everything
(including DNA to protein)
2) Lyse cells, break up DNA
by shearing
3) Retrieve protein of interest
(and the DNA it is bound to)
using specific antibody to that
protein (immunoprecipitation)
4) Determine presence of
DNA by quantitative PCR
V. Orlando (2000) TIBS 25, p. 99
Massively
parallel Ch-
IP
PCR, label with
fluorescent dyes
Protein arrays for function
Proteins immobilized,
usually by virtue of a tag
sequence (6 x his tag,
biotin, etc.)
Probe all proteins
at once for a
specific activity
Example of a protein microarray
Proteins fused to GST with
6 x histidine tags,
immobilized on Ni++ matrix
Anti-GST tells how much
protein is immobilized on
surface
Specific assays identify
proteins with specific
activities--calmodulin
binding, phosphoinositide
binding
DNA
RNA
protein
genome
“transcriptome”
“proteome”
(we have this)
(we want these)

More Related Content

Similar to DNA Sequencing: History, methods and NGS

Next Generation Sequencing
Next Generation SequencingNext Generation Sequencing
Next Generation Sequencingshinycthomas
 
DNA SEQUENCING (1).pptx
DNA SEQUENCING (1).pptxDNA SEQUENCING (1).pptx
DNA SEQUENCING (1).pptxDeenaRahul
 
Gene Sequencing | maxam gilbert sequencing | sanger sequencing
Gene Sequencing | maxam gilbert sequencing | sanger sequencingGene Sequencing | maxam gilbert sequencing | sanger sequencing
Gene Sequencing | maxam gilbert sequencing | sanger sequencingmahimachoudhary0807
 
Genetic mapping and sequencing
Genetic mapping and sequencingGenetic mapping and sequencing
Genetic mapping and sequencingAamna Tabassum
 
theoretical perspectives on marriage and family
theoretical perspectives on marriage and familytheoretical perspectives on marriage and family
theoretical perspectives on marriage and familyRameenIqbal1
 
Gene mapping and sequencing
Gene mapping and sequencingGene mapping and sequencing
Gene mapping and sequencingPREETAM PALKAR
 
High Throughput Sequencing Technologies: On the path to the $0* genome
High Throughput Sequencing Technologies: On the path to the $0* genomeHigh Throughput Sequencing Technologies: On the path to the $0* genome
High Throughput Sequencing Technologies: On the path to the $0* genomeBrian Krueger
 
Next generation sequencing technologies for crop improvement
Next generation sequencing technologies for crop improvementNext generation sequencing technologies for crop improvement
Next generation sequencing technologies for crop improvementanjaligoud
 
Gene sequencing (pharmacology) (sem 1)
Gene sequencing (pharmacology) (sem 1)Gene sequencing (pharmacology) (sem 1)
Gene sequencing (pharmacology) (sem 1)Baidehi Mitra
 
Next Generation Sequencing of DNA
Next Generation Sequencing of DNANext Generation Sequencing of DNA
Next Generation Sequencing of DNAmaryamshah13
 

Similar to DNA Sequencing: History, methods and NGS (20)

DNA sequencing
DNA sequencing  DNA sequencing
DNA sequencing
 
Next Generation Sequencing
Next Generation SequencingNext Generation Sequencing
Next Generation Sequencing
 
DNA SEQUENCING (1).pptx
DNA SEQUENCING (1).pptxDNA SEQUENCING (1).pptx
DNA SEQUENCING (1).pptx
 
Gene Sequencing | maxam gilbert sequencing | sanger sequencing
Gene Sequencing | maxam gilbert sequencing | sanger sequencingGene Sequencing | maxam gilbert sequencing | sanger sequencing
Gene Sequencing | maxam gilbert sequencing | sanger sequencing
 
-DNA Sequencing Notes.pdf
-DNA Sequencing Notes.pdf-DNA Sequencing Notes.pdf
-DNA Sequencing Notes.pdf
 
-DNA Sequencing Notes.pdf
-DNA Sequencing Notes.pdf-DNA Sequencing Notes.pdf
-DNA Sequencing Notes.pdf
 
Genetic mapping and sequencing
Genetic mapping and sequencingGenetic mapping and sequencing
Genetic mapping and sequencing
 
theoretical perspectives on marriage and family
theoretical perspectives on marriage and familytheoretical perspectives on marriage and family
theoretical perspectives on marriage and family
 
Gene mapping and sequencing
Gene mapping and sequencingGene mapping and sequencing
Gene mapping and sequencing
 
NGS.pptx
NGS.pptxNGS.pptx
NGS.pptx
 
molecular genetics1.pdf
molecular genetics1.pdfmolecular genetics1.pdf
molecular genetics1.pdf
 
DNA sequencing
DNA sequencingDNA sequencing
DNA sequencing
 
High Throughput Sequencing Technologies: On the path to the $0* genome
High Throughput Sequencing Technologies: On the path to the $0* genomeHigh Throughput Sequencing Technologies: On the path to the $0* genome
High Throughput Sequencing Technologies: On the path to the $0* genome
 
Gene sequencing
Gene sequencingGene sequencing
Gene sequencing
 
Next generation sequencing technologies for crop improvement
Next generation sequencing technologies for crop improvementNext generation sequencing technologies for crop improvement
Next generation sequencing technologies for crop improvement
 
Gene sequencing (pharmacology) (sem 1)
Gene sequencing (pharmacology) (sem 1)Gene sequencing (pharmacology) (sem 1)
Gene sequencing (pharmacology) (sem 1)
 
BIOTECHNOLOGY PPT.pptx
BIOTECHNOLOGY PPT.pptxBIOTECHNOLOGY PPT.pptx
BIOTECHNOLOGY PPT.pptx
 
Next Generation Sequencing of DNA
Next Generation Sequencing of DNANext Generation Sequencing of DNA
Next Generation Sequencing of DNA
 
Dna sequencing and its types
Dna sequencing and its typesDna sequencing and its types
Dna sequencing and its types
 
DNA Sequencing
DNA Sequencing DNA Sequencing
DNA Sequencing
 

Recently uploaded

REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...Universidade Federal de Sergipe - UFS
 
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCR
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCRCall Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCR
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCRlizamodels9
 
Harmful and Useful Microorganisms Presentation
Harmful and Useful Microorganisms PresentationHarmful and Useful Microorganisms Presentation
Harmful and Useful Microorganisms Presentationtahreemzahra82
 
Base editing, prime editing, Cas13 & RNA editing and organelle base editing
Base editing, prime editing, Cas13 & RNA editing and organelle base editingBase editing, prime editing, Cas13 & RNA editing and organelle base editing
Base editing, prime editing, Cas13 & RNA editing and organelle base editingNetHelix
 
GenBio2 - Lesson 1 - Introduction to Genetics.pptx
GenBio2 - Lesson 1 - Introduction to Genetics.pptxGenBio2 - Lesson 1 - Introduction to Genetics.pptx
GenBio2 - Lesson 1 - Introduction to Genetics.pptxBerniceCayabyab1
 
FREE NURSING BUNDLE FOR NURSES.PDF by na
FREE NURSING BUNDLE FOR NURSES.PDF by naFREE NURSING BUNDLE FOR NURSES.PDF by na
FREE NURSING BUNDLE FOR NURSES.PDF by naJASISJULIANOELYNV
 
Pests of Bengal gram_Identification_Dr.UPR.pdf
Pests of Bengal gram_Identification_Dr.UPR.pdfPests of Bengal gram_Identification_Dr.UPR.pdf
Pests of Bengal gram_Identification_Dr.UPR.pdfPirithiRaju
 
User Guide: Orion™ Weather Station (Columbia Weather Systems)
User Guide: Orion™ Weather Station (Columbia Weather Systems)User Guide: Orion™ Weather Station (Columbia Weather Systems)
User Guide: Orion™ Weather Station (Columbia Weather Systems)Columbia Weather Systems
 
RESPIRATORY ADAPTATIONS TO HYPOXIA IN HUMNAS.pptx
RESPIRATORY ADAPTATIONS TO HYPOXIA IN HUMNAS.pptxRESPIRATORY ADAPTATIONS TO HYPOXIA IN HUMNAS.pptx
RESPIRATORY ADAPTATIONS TO HYPOXIA IN HUMNAS.pptxFarihaAbdulRasheed
 
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.PraveenaKalaiselvan1
 
User Guide: Magellan MX™ Weather Station
User Guide: Magellan MX™ Weather StationUser Guide: Magellan MX™ Weather Station
User Guide: Magellan MX™ Weather StationColumbia Weather Systems
 
Behavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdfBehavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdfSELF-EXPLANATORY
 
Davis plaque method.pptx recombinant DNA technology
Davis plaque method.pptx recombinant DNA technologyDavis plaque method.pptx recombinant DNA technology
Davis plaque method.pptx recombinant DNA technologycaarthichand2003
 
Pests of castor_Binomics_Identification_Dr.UPR.pdf
Pests of castor_Binomics_Identification_Dr.UPR.pdfPests of castor_Binomics_Identification_Dr.UPR.pdf
Pests of castor_Binomics_Identification_Dr.UPR.pdfPirithiRaju
 
Microteaching on terms used in filtration .Pharmaceutical Engineering
Microteaching on terms used in filtration .Pharmaceutical EngineeringMicroteaching on terms used in filtration .Pharmaceutical Engineering
Microteaching on terms used in filtration .Pharmaceutical EngineeringPrajakta Shinde
 
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubai
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In DubaiDubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubai
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubaikojalkojal131
 
Pests of jatropha_Bionomics_identification_Dr.UPR.pdf
Pests of jatropha_Bionomics_identification_Dr.UPR.pdfPests of jatropha_Bionomics_identification_Dr.UPR.pdf
Pests of jatropha_Bionomics_identification_Dr.UPR.pdfPirithiRaju
 
Microphone- characteristics,carbon microphone, dynamic microphone.pptx
Microphone- characteristics,carbon microphone, dynamic microphone.pptxMicrophone- characteristics,carbon microphone, dynamic microphone.pptx
Microphone- characteristics,carbon microphone, dynamic microphone.pptxpriyankatabhane
 
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...Universidade Federal de Sergipe - UFS
 

Recently uploaded (20)

REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...
 
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCR
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCRCall Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCR
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCR
 
Harmful and Useful Microorganisms Presentation
Harmful and Useful Microorganisms PresentationHarmful and Useful Microorganisms Presentation
Harmful and Useful Microorganisms Presentation
 
Base editing, prime editing, Cas13 & RNA editing and organelle base editing
Base editing, prime editing, Cas13 & RNA editing and organelle base editingBase editing, prime editing, Cas13 & RNA editing and organelle base editing
Base editing, prime editing, Cas13 & RNA editing and organelle base editing
 
GenBio2 - Lesson 1 - Introduction to Genetics.pptx
GenBio2 - Lesson 1 - Introduction to Genetics.pptxGenBio2 - Lesson 1 - Introduction to Genetics.pptx
GenBio2 - Lesson 1 - Introduction to Genetics.pptx
 
FREE NURSING BUNDLE FOR NURSES.PDF by na
FREE NURSING BUNDLE FOR NURSES.PDF by naFREE NURSING BUNDLE FOR NURSES.PDF by na
FREE NURSING BUNDLE FOR NURSES.PDF by na
 
Pests of Bengal gram_Identification_Dr.UPR.pdf
Pests of Bengal gram_Identification_Dr.UPR.pdfPests of Bengal gram_Identification_Dr.UPR.pdf
Pests of Bengal gram_Identification_Dr.UPR.pdf
 
User Guide: Orion™ Weather Station (Columbia Weather Systems)
User Guide: Orion™ Weather Station (Columbia Weather Systems)User Guide: Orion™ Weather Station (Columbia Weather Systems)
User Guide: Orion™ Weather Station (Columbia Weather Systems)
 
Volatile Oils Pharmacognosy And Phytochemistry -I
Volatile Oils Pharmacognosy And Phytochemistry -IVolatile Oils Pharmacognosy And Phytochemistry -I
Volatile Oils Pharmacognosy And Phytochemistry -I
 
RESPIRATORY ADAPTATIONS TO HYPOXIA IN HUMNAS.pptx
RESPIRATORY ADAPTATIONS TO HYPOXIA IN HUMNAS.pptxRESPIRATORY ADAPTATIONS TO HYPOXIA IN HUMNAS.pptx
RESPIRATORY ADAPTATIONS TO HYPOXIA IN HUMNAS.pptx
 
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
 
User Guide: Magellan MX™ Weather Station
User Guide: Magellan MX™ Weather StationUser Guide: Magellan MX™ Weather Station
User Guide: Magellan MX™ Weather Station
 
Behavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdfBehavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdf
 
Davis plaque method.pptx recombinant DNA technology
Davis plaque method.pptx recombinant DNA technologyDavis plaque method.pptx recombinant DNA technology
Davis plaque method.pptx recombinant DNA technology
 
Pests of castor_Binomics_Identification_Dr.UPR.pdf
Pests of castor_Binomics_Identification_Dr.UPR.pdfPests of castor_Binomics_Identification_Dr.UPR.pdf
Pests of castor_Binomics_Identification_Dr.UPR.pdf
 
Microteaching on terms used in filtration .Pharmaceutical Engineering
Microteaching on terms used in filtration .Pharmaceutical EngineeringMicroteaching on terms used in filtration .Pharmaceutical Engineering
Microteaching on terms used in filtration .Pharmaceutical Engineering
 
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubai
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In DubaiDubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubai
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubai
 
Pests of jatropha_Bionomics_identification_Dr.UPR.pdf
Pests of jatropha_Bionomics_identification_Dr.UPR.pdfPests of jatropha_Bionomics_identification_Dr.UPR.pdf
Pests of jatropha_Bionomics_identification_Dr.UPR.pdf
 
Microphone- characteristics,carbon microphone, dynamic microphone.pptx
Microphone- characteristics,carbon microphone, dynamic microphone.pptxMicrophone- characteristics,carbon microphone, dynamic microphone.pptx
Microphone- characteristics,carbon microphone, dynamic microphone.pptx
 
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...
 

DNA Sequencing: History, methods and NGS

  • 1. DNA sequencing: methods I. Brief history of sequencing II. Sanger dideoxy method for sequencing III. Sequencing large pieces of DNA VI. The “$1,000 dollar genome” On WebCT -- “The $1000 genome” -- review of new sequencing techniques by George Church
  • 2. Why sequence DNA? • All genes available for an organism to use -- a very important tool for biologists • Not just sequence of genes, but also positioning of genes and sequences of regulatory regions • New recombinant DNA constructs must be sequenced to verify construction or positions of mutations • Etc.
  • 3. History of DNA sequencing
  • 4. MC chapter 12 History of DNA sequencing
  • 5. Methods of sequencing A. Sanger dideoxy (primer extension/chain-termination) method: most popular protocol for sequencing, very adaptable, scalable to large sequencing projects B. Maxam-Gilbert chemical cleavage method: DNA is labelled and then chemically cleaved in a sequence- dependent manner. This method is not easily scaled and is rather tedious C. Pyrosequencing: measuring chain extension by pyrophosphate monitoring
  • 6. for dideoxy sequencing you need: 1) Single stranded DNA template 2) A primer for DNA synthesis 3) DNA polymerase 4) Deoxynucleoside triphosphates and dideoxynucleotide triphosphates
  • 7. Primers for DNA sequencing • Oligonucleotide primers can be synthesized by phosphoramidite chemistry--usually designed manually and then purchased • Sequence of the oligo must be complimentary to DNA flanking sequenced region • Oligos are usually 15-30 nucleotides in length
  • 8. DNA templates for sequencing: • Single stranded DNA isolated from recombinant M13 bacteriophage containing DNA of interest • Double-stranded DNA that has been denatured • Non-denatured double stranded DNA (cycle sequencing)
  • 9. One way for obtaining single-stranded DNA from a double stranded source--magnets
  • 10. Reagents for sequencing: DNA polymerases • Should be highly processive, and incorporate ddNTPs efficiently • Should lack exonuclease activity • Thermostability required for “cycle sequencing”
  • 11. Single stranded DNA 5’ 3’ 5’ 3’ Sanger dideoxy sequencing--basic method a) Anneal the primer
  • 12. Sanger dideoxy sequencing: basic method b) Extend the primer with DNA polymerase in the presence of all four dNTPs, with a limited amount of a dideoxy NTP (ddNTP) 5’ 3’ Direction of DNA polymerase travel
  • 13. DNA polymerase incorporates ddNTP in a template- dependent manner, but it works best if the DNA pol lacks 3’ to 5’ exonuclease (proofreading) activity
  • 14. Sanger dideoxy sequencing: basic method 5’ 3’ 5’ 3’ T T T T ddA ddA ddA ddA ddATP in the reaction: anywhere there’s a T in the template strand, occasionally a ddA will be added to the growing strand
  • 15. How to visualize DNA fragments? • Radioactivity – Radiolabeled primers (kinase with 32P) – Radiolabelled dNTPs (gamma 35S or 32P) • Fluorescence – ddNTPs chemically synthesized to contain fluors – Each ddNTP fluoresces at a different wavelength allowing identification
  • 16. Analysis of sequencing products: Polyacrylamide gel electrophoresis--good resolution of fragments differing by a single dNTP – Slab gels: as previously described – Capillary gels: require only a tiny amount of sample to be loaded, run much faster than slab gels, best for high throughput sequencing
  • 17. DNA sequencing gels: old school Analyze sequencing products by gel electrophoresis, autoradiography Different ddNTP used in separate reactions Radioactively labelled primer or dNTP in sequencing reaction
  • 18.
  • 19. cycle sequencing: denaturation occurs during temperature cycles 94°C:DNA denatures 45°C: primer anneals 60-72°C: thermostable DNA pol extends primer Repeat 25-35 times Advantages: don’t need a lot of template DNA Disadvantages: DNA pol may incorporate ddNTPs poorly
  • 20. Animation of cycle sequencing: see http://www.dnai.org/ Click on: “manipulation” “techniques” “sorting and sequencing”
  • 22. Current trends in sequencing: It is rare for labs to do their own sequencing: --costly, perishable reagents --time consuming --success rate varies Instead most labs send out for sequencing: --You prepare the DNA (usually plasmid, M13, or PCR product), supply the primer, company or university sequencing center does the rest --The sequence is recorded by an automated sequencer as an “electropherogram”
  • 23. ~160 kbp ~1 kbp Assemble sequences by matching overlaps BAC sequence BAC overlaps give genome sequence BREAK UP THE GENOME, PUT IT BACK TOGETHER
  • 24. Sequencing large pieces of DNA: the “shotgun” method • Break DNA into small pieces (typically sizes of around 1000 base pairs is preferable) • Clone pieces of DNA into M13 • Sequence enough M13 clones to ensure complete coverage (eg. sequencing a 3 million base pair genome would require 5x to 10x 3 million base pairs to have a reliable representation of the genome) • Assemble genome through overlap analysis using computer algorithms, also “polish” sequences using mapping information from individual clones, characterized genes, and genetic markers • This process is assisted by robotics
  • 25. Sequencing done by TIGR (Maryland) and The Sanger Institute (Cambridge, UK) “Here we report an analysis of the genome sequence of P. falciparum clone 3D7, including descriptions of chromosome structure, gene content, functional classification of proteins, metabolism and transport, and other features of parasite biology.”
  • 26. Sequencing strategy A whole chromosome shotgun sequencing strategy was used to determine the genome sequence of P. falciparum clone 3D7. This approach was taken because a whole genome shotgun strategy was not feasible or cost-effective with the technology that was available at the beginning of the project. Also, high-quality large insert libraries of (A - T)-rich P. falciparum DNA have never been constructed in Escherichia coli, which ruled out a clone-by-clone sequencing strategy. The chromosomes were separated on pulsed field gels, and chromosomal DNA was extracted…
  • 27. The shotgun sequences were assembled into contiguous DNA sequences (contigs), in some cases with low coverage shotgun sequences of yeast artificial chromosome (YAC) clones to assist in the ordering of contigs for closure. Sequence tagged sites (STSs)10, microsatellite markers11,12 and HAPPY mapping7 were also used to place and orient contigs during the gap closure process. The high (A /T) content of the genome made gap closure extremely difficult7–9. Chromosomes 1–5, 9 and 12 were closed, whereas chromosomes 6–8, 10, 11, 13 and 14 contained 3–37 gaps (most less than 2.5 kb) per chromosome at the beginning of genome annotation. Efforts to close the remaining gaps are continuing.
  • 28. Methods: Sequencing, gap closure and annotation The techniques used at each of the three participating centres for sequencing, closure and annotation are described in the accompanying Letters7–9. To ensure that each centres’ annotation procedures produced roughly equivalent results, the Wellcome Trust Sanger Institute (‘Sanger’) and the Institute for Genomic Research (‘TIGR’) annotated the same100-kb segment of chromosome 14. The number of genes predicted in this sequence by the two centres was 22 and 23; the discrepancy being due to the merging of two single genes by one centre. Of the 74 exons predicted by the two centres, 50 (68%) were identical, 9 (2%) overlapped, 6 (8%) overlapped and shared one boundary, and the remainder were predicted by one centre but not the other. Thus 88% of the exons predicted by the two centres in the 100-kb fragment were identical or overlapped.
  • 29. The $1000 dollar genome Venter Foundation (2003): The first group to produce a technology capable of a $1000 human genome will win $500,000 … X - Prize Foundation: no, $5 - 20 million … National Institutes of Health (2004): $70 million grant program to reach the $1000 genome
  • 30. Previous sequencing techniques: one DNA molecule at a time Needed: many DNA molecules at a time -- arrays One of these: “pyrosequencing” Cut a genome to DNA fragments 300 - 500 bases long Immobilize single strands on a very small plastic bead (one piece of DNA per bead) Amplify the DNA on each bead to cover each bead to boost the signal Separate each bead on a plate with up to 1.6 million wells
  • 31. Sequence by DNA polymerase -dependent chain extension, one base at a time in the presence of a reporter (luciferase) Luciferase is an enzyme that will emit a photon of light in response to the pyrophosphate (PPi) released upon nucleotide addition by DNA polymerase Flashes of light and their intensity are recorded
  • 32. Extension with individual dNTPs gives a readout A B A B The readout is recorded by a detector that measures position of light flashes and intensity of light flashes
  • 33. APS = Adenosine phosphosulfate From www.454.com 25 million bases in about 4 hours
  • 34. Height of peak indicates the number of dNTPs added This sequence: TTTGGGGTTGCAGTT
  • 35. DNA sequencing: methods I. Brief history of sequencing II. Sanger dideoxy method for sequencing III. Sequencing large pieces of DNA VI. The “$1,000 dollar genome” On WebCT -- “The $1000 genome” -- review of new sequencing techniques by George Church
  • 36. Introduction to bioinformatics 1) Making biological sense of DNA sequences 2) Online databases: a brief survey 3) Database in depth: NCBI 4) What is BLAST? 5) Using BLAST for sequence analysis 6) “Biology workbench”, etc. www.ncbi.nlm.nih.gov www.tigr.org http://workbench.sdsc.edu
  • 37. There’s plenty of DNA to make sense of http://www.genomesonline.org/ (2006)
  • 38. Making sense of genome sequences: 1) Genes a) Protein-coding • Where are the open reading frames? • What are the ORFs most similar to? (What is the function/structure/evolution history?) b) RNA 2) Non-genes a) Regulation: promoters and factor-binding sites b) Transactions: replication, repair, and segregation, DNA packaging (nucleosomes)
  • 41. Looking for ORFs (Open Reading Frames) using “DNA Strider”
  • 42. ORF map 1) Where are the potential starts (ATG) and stops (TAA, TAG, TGA)? 2) Which reading frame is correct? = ATG = stop codon Reading frame #1 appears to encode a protein
  • 43. Cautions in ORF identification • Not all genes initiate with ATG, particularly in certain microbes (archaea) • What is the shortest possible length of a real ORF? 50 amino acids? 25 amino acids? Cut-off is somewhat arbitrary. • In eukaryotes, ORFs can be difficult to identify because of introns • Are there other sequences surrounding the ORF that indicate it might be functional? – promoter sequences for RNA polymerase binding – Shine-Dalgarno sequences for ribosome binding?
  • 44. What is the function of the sequenced gene? Classical methods: -- mutate gene, characterize phenotype for clues to function (genetics) -- purify protein product, characterize in vitro (biochemistry) Comparison to previously characterized genes: -- genes sequences that have high sequence similarity usually have similar functions -- if your gene has been previously characterized (using classical methods) by someone else, you want to know right away! (avoid duplication of labor)
  • 45. NCBI NCBI home page --Go to www.ncbi.nlm.nih.gov for the following pages Pubmed: search tool for literature--search by author, subject, title words, etc. All databases: “a retrieval system for searching several linked databases” BLAST: Basic Local Alignment Sequence Tool OMIM: Online Mendelian Inheritance in Man Books: many online textbooks available Tax Browser: A taxonomic organization of organisms and their genomes Structure: Clearinghouse for solved molecular structures
  • 46. What does BLAST do? 1) Searches chosen sequence database and identifies sequences with similarity to test sequence 2) Ranks similar sequences by degree of homology (E value) 3) Illustrates alignment between test sequence and similar sequences
  • 47. Alignment of sequences: The principle: two homologous sequences derived from the same ancestral sequence will have at least some identical (similar) amino acid residues Fraction of identical amino acids is called “percent identity” Similar amino acids: some amino acids have similar physical/chemical properties, and more likely to substitute for each other--these give specific similarity scores in alignments Gaps in similar/homologous sequences are rare, and are given penalty scores
  • 48. Homology of proteins Homology: similarity of biological structure, physiology, development, and evolution, based on genetic inheritance Homologous proteins: statistically similar sequence, therefore similar functions (often, but not always…) Alignment of TFB and TFIIB sequences
  • 49. High sequence similarity correlates with functional similarity 40-20% identity: fold can be predicted by similarity but precise function cannot be predicted (the 40% rule) enzymes Non-enzymes
  • 50. Programs available for BLAST searches Protein sequence (this is the best option) blastp--compares an amino acid query sequence against a protein sequence database tblastn--compares a protein query sequence against a nucleotide sequence database translated in all reading frames DNA sequence blastn--compares a nucleotide query sequence against a nucleotide sequence database blastx--compares a nucleotide query sequence translated in all reading frames against a protein sequence database tblastx--compares the six-frame translations of a nucleotide query sequence against the six-frame translations of a nucleotide sequence database.
  • 51. BLAST considers all possible combinations of matches mismatches gaps in any given alignment Gives the “best” (highest scoring) alignment of sequences Three scores 1) percent identity 2) similarity score 3) E-value--probability that two sequences will have the similarity they have by chance (lower number, higher probability of evolutionary homology, higher probability of similar function)
  • 52. What is the E-value? The E value represents the chance that the similarity is random and therefore insignificant. Essentially, the E value describes the random background noise that exists for matches between sequences. For example, an E value of 1 assigned to a hit can be interpreted as meaning that in a database of the current size one might expect to see 1 match with a similar score simply by chance. You can change the Expect value threshold on most main BLAST search pages. When the Expect value is increased from the default value of 10, a larger list with more low- scoring hits can be reported.
  • 53. E values (continued) From the BLAST tutorial: Although hits with E values much higher than 0.1 are unlikely to reflect true sequence relatives, it is useful to examine hits with lower significance (E values between 0.1 and 10) for short regions of similarity. In the absence of longer similarities, these short regions may allow the tentative assignment of biochemical activities to the ORF in question. The significance of any such regions must be assessed on a case by case basis.
  • 54. Relationship between E-value and function Single domain proteins Multi-domain proteins E value greater than 10-10, similar structure but possibly different functions
  • 57. BLAST against (go to genomes page): -- Microbial genomes -- environmental sequences (genomes) Results: 1) Distribution of hits: query sequence and positions in sequence that gave alignments 2) Sequences producing significant alignments 1) Accession number (this takes you to the sequence that yielded the hit: gene or contig) 2) Name of sequence (sometimes identifies the gene) 3) Similarity score 4) E-value 3) Alignments arranged by E value, with links to gene reports
  • 58. 2) Large percentages of coding proteins cannot be assigned function based on homology 1) Homology? the function is only inferred (NOT known) Two problems with BLAST
  • 59. For a current list of databases and bioinformatics tools see: Nucleic Acids Research annual bioinformatics issue (comes out every January). List of all the databases described, by category: http://www.oxfordjournals.org/nar/database/cap/ Guide to NCBI: see Webct
  • 60. Bioinformatics: making sense of biological sequence • New DNA sequences are analyzed for ORFs (Open Reading Frames: protein) • Any DNA or protein sequence can then be compared to all other sequences in databases, and similar sequences identified • There is much more -- a great diversity of programs and databases are available
  • 61. Massively parallel measurements of gene expression: microarrays • Defining the “transcriptome” • The northern blot revisited • Detecting expression of many genes: arrays • A typical array experiment • What to do with all this data? Brown and Botstein (1999) “Exploring the new world of the genome with DNA microarrays” Nature Genetics 21, p. 33-37.
  • 63. The value of DNA microarrays for studying gene expression 1) Study all transcripts at same time 2) Transcript abundance usually correlates with level of gene expression--much gene control is at level of transcription 3) Changes in transcription patterns often occur as a response to changing environment--this can be detected with a microarray
  • 64. Detection of mRNA transcripts • Northern Blot -- immobilize mRNA on membrane, detect specific sequence by hybridization with one labeled probe--requires a separate blotting for each probe • DNA microarray -- immobilize many probes (thousands) in an ordered array, hybridize (base pair) with labelled mRNA or cDNA
  • 65. Generating an array of probes • Identify open reading frames (orfs) 1) PCR each orf (several for each orf), attach (spot) each PCR product to a solid support in a specific order (pioneered by Pat Brown’s lab, Stanford) 2) Chemically synthesize orf-specific oligonucleotide probes directly on microchip (Affymetrix)
  • 66. http://derisilab.ucsf.edu/microarray/ (Derisi Lab at UCSF) The chip defines the genes you are measuring The hybridization represents the measurement The RNA comes from the cells and conditions you are interested in
  • 67.
  • 68. A print head for generating arrays of probes Print head travels from DNA probe source (microtiter plate) to solid support (treated glass slide) Small amount of DNA probe is put on a specific spot at a specific location Each spot (DNA probe sequence) has a specific “address” Print head Printing needles
  • 69.
  • 70.
  • 71. A yeast array experiment vegetative sporulating Isolate mRNA Prepare fluorescently labeled cDNA with two different-colored fluors hybridize read-out
  • 72. Example microarray data Green: mRNA more abundant in vegetative cells Red: mRNA more abundant in sporulating cells Yellow: equivalent mRNA abundance in vegetative and sporulating cells
  • 73. What to do with all that data? Overarching patterns may become apparent 1) Organize data by hierarchical clustering, profiling to find patterns 2) Display data graphically to allow assimilation/comprehension
  • 74. low mRNA levels High mRNA levels (Cell synchronization method) All yeast cell cycle- regulated genes (phase in which gene is expressed)
  • 75. MIAME: The Minimum Information About a Microarray Experiment (#6 helps correct for variations in the quantity of starting RNA, and for variable labelling and detection efficiencies)
  • 77. Analysis of the proteome: “proteomics” • Which proteins are present and when? • What are the proteins doing? – What interacts with what? • Protein-DNA interactions (chromatin immunoprecipitation) • Protein-protein interactions – Functions of proteins? Phizicky et al. (2003) “Protein analysis on a proteomic scale” Nature 422, p. 208-215
  • 78. Which proteins are expressed? Classical method – Detect presence of a specific protein • Using antibodies or specific assay • Measure changes in protein levels with changing environment, in different tissues – Very labor intensive, expensive to scale up to proteome
  • 79. Massively parallel detection and identification of proteins • 2D gel electrophoresis – Separate proteins in a given organism or tissue type by migration in gel electrophoresis – Identify protein (cut out of gel, sequence or mass-spec) – Pattern of spots like a barcode for hi-throughput studies • Mass spectrometry – Separate individual proteins from cell by charge and mass, individual proteins can be identified (but need genome sequence information for this) • Microarrays: isolate things that bind proteins
  • 80. 2D gel electrophoresis 1) Separate proteins on the basis of isoelectric point This technique is usually done on a long, narrow gel 4 10
  • 81. 2D gel electrophoresis Lay gel containing isoelectrically focused protein on SDS page gel, separate on the basis of size E.coli protein profile From swissprot database, www.expasy.ch
  • 82. Mass spectrometry for identifying proteins in a mixture From J.R. Yates 1998 “Mass spectrometry and the age of the proteome” J Mass Spec. 33, p 1-19 Liquid chromatography and tandem mass spectrometry Software for processing data
  • 83. Defining protein function • Classical methods: – Define activity of protein, develop an assay for activity • Biochemistry: use assay to purify protein from cell, characterize structure/function of protein in vitro • Genetics: obtain mutants with change in activity, characterize phenotype of mutant, obtain suppressors to identify genes that interact with protein of interest – Time intensive, expensive
  • 84. Protein activity at the proteome level • Protein-DNA interactions: identifying binding sites for DNA-binding proteins: regulation of gene expression • Massively parallel screens for activity--protein arrays
  • 85. “chromatin immunoprecipitation” (ChIP) 1) Grow cells, add formaldehyde to cross-link everything to everything (including DNA to protein) 2) Lyse cells, break up DNA by shearing 3) Retrieve protein of interest (and the DNA it is bound to) using specific antibody to that protein (immunoprecipitation) 4) Determine presence of DNA by quantitative PCR V. Orlando (2000) TIBS 25, p. 99
  • 86. Massively parallel Ch- IP PCR, label with fluorescent dyes
  • 87. Protein arrays for function Proteins immobilized, usually by virtue of a tag sequence (6 x his tag, biotin, etc.) Probe all proteins at once for a specific activity
  • 88. Example of a protein microarray Proteins fused to GST with 6 x histidine tags, immobilized on Ni++ matrix Anti-GST tells how much protein is immobilized on surface Specific assays identify proteins with specific activities--calmodulin binding, phosphoinositide binding