Sequencing and Assembly Cont'd CS273a Lecture 5, Aut08, Batzoglou
Upcoming SlideShare
Loading in...5

Sequencing and Assembly Cont'd CS273a Lecture 5, Aut08, Batzoglou






Total Views
Views on SlideShare
Embed Views



0 Embeds 0

No embeds



Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment
  • Genomic DNA is extracted from sample cells taken from an individual.   In a single tube reaction, genomic DNA is processed into single-stranded oligonucleotide fragments. These are prepared for attachment to Solexa’s Single Molecule Arrays using proprietary primer and anchor molecules. Hundreds of millions of molecules, representing the entire genome of the individual, are deposited and attached at discrete sites on a Single Molecule Array. Fluorescently labelled nucleotides and a polymerase enzyme are added to the Single Molecule Array. Complementary nucleotides base-pair to the first base of each oligonucleotide fragment and are added to the primer by the enzyme. Remaining free nucleotides are removed. Laser light of a specific wavelength for each base excites the label on the incorporated nucleotides, which fluoresce. This fluorescence is detected by a CCD camera that rapidly scans the entire array to identify the incorporated nucleotides on each fragment. Fluorescence is then removed. The identity of the incorporated nucleotide reveals the identity of the base in the sample sequence to which it is paired. In this example, the first base is C (cytosine). This cycle of incorporation, detection and identification is repeated approximately 25 times to determine the first 25 bases in each oligonucleotide fragment. By simultaneously sequencing all molecules on the array the first 25 bases for the hundreds of millions of oligonucleotide fragments are determined. These hundreds of millions of sequences are aligned and compared to the reference sequence using Solexa’s proprietary bioinformatics system. Known and unknown single nucleotide polymorphisms (SNP’s) together with other genetic variations can then be readily determined.
  • "Polonies" are tiny colonies of DNA, about one micron in diameter, grown on a glass microscope slide (the word itself is a contraction of "polymerase colony"). To create them, researchers first pour a solution containing chopped-up DNA onto the slide. Adding an enzyme called polymerase causes each piece to copy itself repeatedly, creating millions of polonies, each dot containing only copies of the original piece of DNA. The polonies are then exposed to a series of chemically-labeled probes that light up when run through a scanning machine, identifying each nucleotide base in the strand of code, much as dusting with powder allows crime-scene investigators to bring up fingerprints on a surface. Prior to sequencing, dsDNA is denatured and unbound copy strands are washed away. - Covalently linked template strands allow for washing.
  • The availability of large collections of single nucleotide polymorphisms (SNPs), along with the recent large-scale linkage disequilibrium mapping efforts, have brought the promise of whole genome association studies to the forefront of current thinking in human genetics. ParAllele (now part of Affymetrix) has developed a novel technology based on the concept of Molecular Inversion Probes that enables up to 20,000 SNPs to be scored in a single assay. This unprecedented level of multiplexing is made possible through exquisite enzymological specificity using a unimolecular interaction that is insensitive to cross-reactivity among multiple probe molecules. The technology has been demonstrated to exhibit high accuracy while enabling a high rate of conversion of individual SNPs into working multiplexed assays. Molecular Inversion Probes Molecular Inversion Probes are so named because the oligonucleotide probe central to the process undergoes a unimolecular rearrangement from a molecule that cannot be amplified (step 1), into a molecule that can be amplified (step 6). This rearrangement is mediated by hybridization to genomic DNA (step 2) and an enzymatic "gap fill" process that occurs in an allele-specific manner (step 3). The resulting circularized probe can be separated from cross-reacted or unreacted probes by a simple exonuclease reaction (step 4). Figure 1 shows these steps. Applications of Molecular Inversion Probes Molecular Inversion Probe technology is invaluable as a high-throughput SNP genotyping method for both targeted and whole genome SNP analysis projects as well as allele quantitation.
  • Figure 1 . A nanopore sensor for sequencing DNA. A channel or nanopore in an insulating membrane separates two ionic solution-filled compartments. In response to a voltage bias (labeled “ - ” and “+”) across the membrane, ssDNA molecules (yellow) in the “-” compartment are driven, one at a time, into and through the channel. Embedded in the membrane, an electrically connected nanotube (orange) that abuts on the nanopore serves as a sensor to identify the nucleotides in the translocating DNA molecules. Elevated temperatures and denaturants maintain the DNA in an unstructured, single-stranded form. The underlying principle of nanopore sequencing is that a single-stranded DNA or RNA molecule can be electrophoretically driven through a nano-scale pore in such a way that the molecule traverses the pore in strict linear sequence, as illustrated in Figure 1.  Because a translocating molecule partially obstructs or blocks the nanopore, it alters the pore's electrical properties 1 . 

Sequencing and Assembly Cont'd CS273a Lecture 5, Aut08, Batzoglou Sequencing and Assembly Cont'd CS273a Lecture 5, Aut08, Batzoglou Presentation Transcript

  • Sequencing and Assembly Cont’d
  • Steps to Assemble a Genome 1. Find overlapping reads 4. Derive consensus sequence ..ACGATTACAATAGGTT.. 2. Merge some “good” pairs of reads into longer contigs 3. Link contigs to form supercontigs Some Terminology read a 500-900 long word that comes out of sequencer mate pair a pair of reads from two ends of the same insert fragment contig a contiguous sequence formed by several overlapping reads with no gaps supercontig an ordered and oriented set (scaffold) of contigs, usually by mate pairs consensus sequence derived from the sequene multiple alignment of reads in a contig
  • 2. Merge Reads into Contigs
    • Overlap graph:
      • Nodes: reads r 1 …..r n
      • Edges: overlaps (r i , r j , shift, orientation, score)
    Note: of course, we don’t know the “color” of these nodes Reads that come from two regions of the genome (blue and red) that contain the same repeat
  • 2. Merge Reads into Contigs
    • We want to merge reads up to potential repeat boundaries
    Unique Contig Overcollapsed Contig repeat region
  • 2. Merge Reads into Contigs
    • Remove transitively inferable overlaps
      • If read r overlaps to the right reads r 1 , r 2 , and r 1 overlaps r 2 , then (r, r 2 ) can be inferred by (r, r 1 ) and (r 1 , r 2 )
    r r 1 r 2 r 3
  • 2. Merge Reads into Contigs
  • 2. Merge Reads into Contigs
    • Ignore “hanging” reads, when detecting repeat boundaries
    sequencing error repeat boundary??? b a a b …
  • Overlap graph after forming contigs Unitigs: Gene Myers, 95
  • Repeats, errors, and contig lengths
    • Repeats shorter than read length are easily resolved
      • Read that spans across a repeat disambiguates order of flanking regions
    • Repeats with more base pair diffs than sequencing error rate are OK
      • We throw overlaps between two reads in different copies of the repeat
    • To make the genome appear less repetitive, try to:
      • Increase read length
      • Decrease sequencing error rate
    • Role of error correction:
    • Discards up to 98% of single-letter sequencing errors
    • decreases error rate
    •  decreases effective repeat content
    •  increases contig length
  • 3. Link Contigs into Supercontigs Too dense  Overcollapsed Inconsistent links  Overcollapsed? Normal density
  • Find all links between unique contigs 3. Link Contigs into Supercontigs Connect contigs incrementally, if  2 forward-reverse links supercontig (aka scaffold )
    • Fill gaps in supercontigs with paths of repeat contigs
    • Complex algorithmic step
      • Exponential number of paths
      • Forward-reverse links
    3. Link Contigs into Supercontigs
  • 4. Derive Consensus Sequence
    • Derive multiple alignment from pairwise read alignments
  • Some Assemblers
    • PHRAP
        • Early assembler, widely used, good model of read errors
        • Overlap O(n 2 )  layout (no mate pairs)  consensus
    • Celera
        • First assembler to handle large genomes (fly, human, mouse)
        • Overlap  layout  consensus
    • Arachne
        • Public assembler (mouse, several fungi)
        • Overlap  layout  consensus
    • Phusion
        • Overlap  clustering  PHRAP  assemblage  consensus
    • Euler
        • Indexing  Euler graph  layout by picking paths  consensus
  • Quality of assemblies—mouse Terminology: N50 contig length If we sort contigs from largest to smallest, and start Covering the genome in that order, N50 is the length Of the contig that just covers the 50 th percentile. 7.7X sequence coverage
  • Quality of assemblies—dog 7.5X sequence coverage
  • Quality of assemblies—chimp 3.6X sequence Coverage Assisted Assembly
  • History of WGA
    • 1982:  -virus, 48,502 bp
    • 1995: h-influenzae, 1 Mbp
    • 2000: fly, 100 Mbp
    • 2001 – present
      • human (3Gbp), mouse (2.5Gbp), rat * , chicken, dog, chimpanzee, several fungal genomes
    Gene Myers Let’s sequence the human genome with the shotgun strategy That is impossible, and a bad idea anyway Phil Green 1997
  • $399 Personal Genome Service $2,500 Health Compass service $985 deCODEme (November 2007) (November 2007) (April 2008) $350,000 Whole-genome sequencing (November 2007) Genetic Information Nondiscrimination Act (May 2008)
  • Whole-genome sequencing Comparative genomics Genome resequencing Structural variation analysis Polymorphism discovery Metagenomics Environmental sequencing Gene expression profiling Applications Genotyping Population genetics Migration studies Ancestry inference Relationship inference Genetic screening Drug targeting Forensics
  • Sequencing applications Demand for more sequencing Sequencing technology improvement Increase in sequencing data output New sequencing applications
  • Sequencing technology Sanger sequencing 1975 1980 2008 1990 2000 $10.00 $1.00 $0.10 $0.01 Cost per finished bp: Read length: 15 – 200 bp 500 – 1,000 bp Throughput: “ grad-student years” 2 ∙ 10 6 bp/day Fred Sanger
  • Sequencing technology Sanger sequencing 3 ∙ 10 9 bp 1x coverage 10x coverage 2 ∙ 10 6 bp/day = 40 years × 3 ∙ 10 9 bp 10x coverage × 3 ∙ 10 9 bp × $0.001/bp = $30 million
  • Pyrosequencing on a chip
      • Mostafa Ronaghi, Stanford Genome Technologies Center
      • 454 Life Sciences
  • Sequencing technology Next-generation sequencing Read length: 250 bp Throughput: 300 Mb/day Cost: ~ 10,000 bp/$ De novo : yes Genome Sequencer / FLX “ short reads”
  • Single Molecule Array for Genotyping—Solexa
  • Polony Sequencing
  • Sequencing technology Next-generation sequencing Read length: ~ 35 bp Throughput: 300 – 500 Mb/day Cost: ~ 100,000 bp/$ De novo : yes Genome Analyzer SOLiD Analyzer “ microreads”
  • Molecular Inversion Probes
  • Illumina Genotype Arrays
  • Sequencing technology Next-generation sequencing Read length: 1 bp Throughput: 1 – 2 Mb/day Cost: 5,000 bp/$ De novo : no Infinium Assay GeneChip Array genotypes “ SNP chips”
  • Nanopore Sequencing
  • Sequencing technology Next-generation sequencing
  • Sequencing technology ? Technology Read length (bp) Throughput (Mb/day) Cost (bp/$) De novo Sanger 1,000 2 1,000  454 250 300 10,000  Solexa / ABI 35 500 100,000  SNP chip 1 2 5,000 Application Sanger 454 Solexa/ABI SNP chip Bacterial sequencing $  sometimes Mammalian sequencing $$$ not likely today Mammalian re sequencing $$$ $ sort of Metagenomics $  ? Genotyping $$$ $$$ $$$ 