High Throughput Sequencing Technologies: On the path to the $0* genome

  • 1,271 views
Uploaded on

Presented to freshman at Duke University on April 7, 2014 - Includes detailed slide notes that loosely follow what I said in the lecture.

Presented to freshman at Duke University on April 7, 2014 - Includes detailed slide notes that loosely follow what I said in the lecture.

More in: Science
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
1,271
On Slideshare
0
From Embeds
0
Number of Embeds
3

Actions

Shares
Downloads
64
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide
  • First off I wanted to mention that I changed the title of this talk from next, next next generation sequencing to High throughput sequencing. Next and Next next describe the two most recent generations of sequencing which there are sure to be more, but they’re both high throughput technologies. Instead of adding more next’s to the “generation” nomenclature, I prefer to refer to these technologies numerically as first, second, third, so from here on next and next next will be referred to as second and third generation sequencing
  • But before we get into the specifics of the latest sequencing technologies, I thought it’d be a good idea to do a quick overview of the molecular biology behind DNA sequencing. The central molecule behind today’s lecture is DNA and within the cell it’s usually packed into a much larger super structure called chromatin. There are two types of chromatin in your cells. There’s Euchromatin which is the open and active form of Chromatin. This is the chromatin that’s bound by regulatory proteins and polymerases and is usually producing the messenger RNA that codes for protein. The other major form of chromatin is heterochromatin which is a densely packed form of chromatin that isn’t actively producing RNA. Chromatin structures are named based on their nanometer size, for example, a metaphase chromosome is 1400nm, each chromosome arm is around 700 nm. These 700nm arms are composed of tightly packed fibers. We don’t really start seeing the individual DNA strands until we get down to the 11nm nucleosomes or beads on a string level. The nucleosome is a protein complex made up of 4 proteins that dimerize to form an 8 protein complex and a linker histone acts as a clip to hold the two turns of the DNA strand onto the nucleosome. It isn’t until we get down to the 2nm level that we can finally see the DNA double helix.http://www.nature.com/scitable/resource?action=showFullImageForTopic&imgSrc=/scitable/content/ne0000/ne0000/ne0000/ne0000/113158606/18847_6.jpg
  • The DNA double helix is a polymer of sugar phosphate bases arranged in a specific order. The sugar phosphates fall into two classes, the purines adenine and Guanine and the Pyrimidines, cytosine and thymine. The purines form hydrogen bonds with the pyramidines - Adenine binds with thymine and cytosine binds with guanine. The bases themselves have a 5’ phosphate and a 3’ hydroxl group. These are two very important positions on the DNA because the 5’ phosphate forms covalent bonds with the adjacent sugar’s 3’ Oxygen to form the sugar phosphate backbone. DNA is named deoxy ribonucleic acid because the 2’ OH of the ribose sugar has been removed by an enzyme. RNA or ribonucleic acid preserves the 2’ OH. Most importantly of all, it’s the sequence of the bases of DNA that control what genes are expressed and when they are expressed.http://commons.wikimedia.org/wiki/File:0322_DNA_Nucleotides.jpg
  • And this process of gene expression occurs through a process called transcription. DNA is the storage form of your genetic information and must be converted in messenger RNA before the message encoded by the DNA can be translated into the proteins that make up your cells and cellular machinery. The human genome is made up of 3 billion bases of which only 2% of those bases code for genes. Finding genes is a lot like finding a needle in a haystack but fortunately evolution has provided your cells with some help finding genes within the genetic code. The genome is obviously filled with non-coding sequence, some of which is leftover information that is no longer used but much of the code in your DNA serves some function during the process of gene expression. Enhancers are bits of genetic code that bind to activator proteins and transcription factors to help RNA polymerases find genes. The promoter is a similar type of sequence closer to the gene that helps polymerases find genes. Promoters are the bits of DNA that serve as a polymerase staging ground. Once all of the transcription factors have bound the promoter and readied the polymerase for action, the polymerase transcribes the first base of the messenger RNA transcript at the transcription start sitehttp://www.nature.com/scitable/topicpage/gene-expression-14121669
  • As the polymerase passes through the gene and copies the DNA to RNA it passes through what are called exons and introns. Exons are the coding regions of DNA and introns are the non-coding pieces. Introns are copied to RNA but are removed in a process called RNA splicing which is a sequence dependent process. Once the DNA is copied to RNA, the messenger RNA is capped with a methyl group and a poly A tail is added. These modifications stabilize the RNA for its trip to the cytoplasm where it is bound by ribosomes which translate the messenger RNA into proteins. Interestingly, all of the processes described here can be negatively affected by changes to the sequence of the DNA!http://www.nature.com/scitable/topicpage/gene-expression-14121669
  • There a two large classes of DNA mutations or variations. There are sequence variants and structural variants. The sequence variants cover things like single nucleotide variants which is the change of one single nucleotide. Sequence variants can also include small insertions or deletions where small chunks of DNA are either inserted or deleted. The bigger mutations can include deletions, and duplications which are pretty self explanatory. There can also be inversions where a large region of a chromosome flips, or translocations where arms from separate chromosomes fuse together. Of course these mutations can have varying effects on the DNA depending on where they happen and which parts of the genetic code are disrupted by these changes.http://www.nature.com/scitable/topicpage/gene-expression-14121669
  • So obviously the site of mutation matters and they can have both positive and negative effects. Sometimes there’s no effect. Other times too much or too little protein is made. Mutations in gene expression regulatory regions such as promoters or enhancers can cause gene expression to ramp up or turn it off completely. Mutations within the coding region of the DNA can also have a variety of similar effects. They could be silent and cause no issues what so ever or they can result in a positive or a negative protein mutant. Mutations at RNA splice sites or the sequences involved in removing introns can change how the final message is spliced together. More often than not these are severely deleterious mutations, but sometimes they result in the creation of completely novel proteins. And finally mutations to 5’ and 3’ untranslated regions can affect how the messenger RNA is translated and recognized by the cellular machinery. This can significantly change how much protein is produced. But one thing to keep in mind here is that we are all mutants. Everyone in this room has approximately 4 million single nucleotide variants and 700,000 insertions and deletions. As I said before though, the genome is 3 billion basepairs and only 2% of the genome codes for protein. For the most part, these mutations are benign and are considered normal human variation. Credit: Cooper, G. M. & Shendure, J. Needles in stacks of needles: finding disease-causal variants in a wealth of genomic data. Nat Rev Genet12, 628–640 (2011).
  • So how do we detect these variations and use them to better understand how mutations and variations contribute to normal human variation and disease? Up until about 30 years ago this was a bit of a challenge, but in the mid 1970’s two groups invented DNA sequencing. Both employed similar methodologies using 5’ radioactively labeled DNA strands and gel size selection to determine the exact sequence of the bases. Maxam-Gilbert sequencing relied on exposing the DNA to 4 chemical agents. These chemical agents would cause specific breaks in the DNA and when these broken bits of DNA were gel size selected you could determine the sequence of the DNA based on which fragments appeared on the gel and the order in which the fragments appeared. A second method of sequencing was invented by Frederick Sanger and his system also employed radioactively labeled DNA coupled to gel size selection but his technique use dideoxy nucleotides to allow for sequence decoding. You’ll remember the DNA usually has a 3’ OH which is used to extend a growing DNA strand. Sanger removed this 3’ OH to cause the primer extension reaction to terminate. By spiking in small amounts of these bases, they’d incorporate randomly and generate terminations out to about 800 base pairs.Maxam-Gilbert: http://upload.wikimedia.org/wikipedia/commons/f/fa/Maxam-Gilbert_sequencing_en.svgSanger: http://upload.wikimedia.org/wikipedia/commons/c/cb/Sequencing.jpg
  • Sanger sequencing eventually won out as the technique of choice mostly because Maxam Gilbert sequencing used extremely toxic chemicals. Throughout the years, Sanger sequencing has become much more automated. Dideoxy bases were replaced with fluorescently labeled dideoxy bases which could all be combined into a single reaction. Liquid chromatography replaced gel electrophoresis and in the final iteration of the technology, lasers and computers were employed to determine the sequence of the bases instead of forcing graduate students and postdocs to do this time consuming work. Sanger sequencing lasted for 30 years as the dominant sequencing technology and it is still used today as the gold standard validation method for many sequencing diagnostic tests. Although Sanger sequencing has been automated it still has significant limitations because only a single 800 base pair DNA fragment can be sequenced per reaction.http://upload.wikimedia.org/wikipedia/commons/3/3d/Radioactive_Fluorescent_Seq.jpg
  • However, despite this 800 base pair limitation, the US government and Celera genomics used sanger sequencing in the early 90’s to sequence all 3 billion basepairs of the human genome. This process took 10 years and cost 3 billion dollars. The original plan for doing this sequencing was to sequence the human genome piece by piece from one end to the other. This was projected to take nearly 15 years! Craig Venter, who worked at the NIH at the time, thought that was completely insane and suggested using a technique, also invented by Frederick Sanger, to sequence the majority of the human genome in half that time. This technique was called shotgun sequencing and involved sequencing random bits of the DNA and using computers to align the overlapping sequences back together. After a lot of fighting between Francis Collins, who headed up the human genome project, Craig Venter left to start Celera genomics with the plan to beat the NIH funded research using shotgun sequencing. It quickly became clear that shotgun sequencing was the way to go for sequencing the majority of the human genome and the two groups eventually collaborated to release a first draft of the human genome in 2000 and a final draft in 2003. Much of the genome sequencing we do today uses whole genome shotgun as the preferred methodhttp://upload.wikimedia.org/wikipedia/commons/b/bd/Whole_genome_shotgun_sequencing_versus_Hierarchical_shotgun_sequencing.png
  • While it was a monumental achievement to sequence the human genome, it quickly became clear that we were going to have to sequence a lot of human genomes to truly understand how human genetic variation contributes to health and disease. To do this we needed newer, faster, and more efficient sequencing technology. This came in the form of what is now called second generation sequencing technology. This new technology was a gigantic improvement over Sanger sequencing when it was first introduced because it can sequence many DNA molecules as clusters in parallel where-as Sanger sequencing is limited to sequencing only one DNA sequence per reaction. The latest version the second generation technology can sequence between 3 and 10 billion independent DNA fragments at the same time compared to a maximum of 1152 reactions for a Sanger sequencer. This drastic improvement in sequencing output has cut the time for human genome sequencing down from 10 year to 1 day. The dominant technology in use today is the illuminaHiSeq which can generate anywhere from 600 gigabases to 1.8 terabases or 5 to 16 genomes worth of data. A second technology called the Ion torrent Proton is also used heavily clinically at hospitals because it has a rapid turn around time.
  • Both of these second generation technologies function very similarly. For each, genomic DNA is randomly fragmented into smaller 300-400 base pair pieces. Adaptors which contain known DNA sequence are then added to the ends of the fragmented DNA. These adapators contain sequencers that allow this DNA to bind to a substrate. In the case of the illumina technology the substrate is a glass flowcell. Once bound to the substrate, the DNA strands are amplified using a primer to create clonal DNA clusters which amplify the sequencing signal. Once the clusters are generated the sequencing can begin. This is done by adding fluorescently labeled DNA bases to the flowcell. The bases are added and the excess is washed away. A camera then takes an image of the flowcell to record the color of each cluster. After imaging, the fluorophor is cleaved off of the DNA which allows for another round of base addition. This cycle continues hundreds of times on billions of clusters and generates images which can be used to determine the sequence of each cluster. Once completed, the short sequences of DNA are aligned to the human genome, called the reference genome, to determine the full sequence of the sample’s DNA.
  • You may be wondering how we then use these aligned bases to find variants. If we’re aligning these sequences to a reference human genome, how do we find where there are differences? While the sequences are being aligned to the genome, we mark the bases that don’t exactly match the reference genome. We also sequence way more than one genome worth of DNA. Typically we try to sequence around 30 or 40 times as much of the genome meaning we try to sequence 90 to 120 billion bases of DNA. This gives us mathematical power to “call” variants. As I said before, the typical human genome contains 4 million mutations and 700,0000 indels. That’s a lot of data to sort through! To weed out most of those variants, we look to see if any of the variants turn up in healthy controls by searching our own internal database or public databases. This usually significantly cuts down the number of hits we have to sort through. We then run other filters to determine if the variants we find land in genes or land in genes and are likely to change the protein sequence. Once we find a promising hit, we visually inspect the reads, in this example we’re comparing a family trio which is composed of a mother, a father, and an affected child. You can see here that neither of the parents have this mutation while it appears that one copy of the child’s DNA has this mutation because it shows up in only 50% of the reads. We typically follow up the most promising mutants with a secondary validation method such as Sanger sequencing. You can see here that the Sanger data exactly mimics the high throughput sequencing data!
  • We can also use whole genome sequencing to detect some of the large structural variants such as deletions and duplications. We can do this by looking at the number of reads we see for regions of the genome. The holes are places where there are deletions. Complete deletion tells us that both copies are deleted which is a homozygous deletions where as if the region shows half as much expected DNA we know it’s a heterozygous deletion. Finally, if we get more than the expected amount of DNA, we know there’s probably a genetic duplication
  • I’ll get into the costs associated with sequencing in a minute, but currently whole genome sequencing is very expensive so we’ve devised shortcuts to make the process cheaper. We can use whole exome sequencing to only look at the coding DNA sequences. Since most genetic disorders are caused by mutations in protein coding regions, it makes sense to look at the exons first. We do this by fishing out the exonic sequences using DNA or RNA “baits” attached to magnetic beads. Once the exonic sequences are isolated, we sequence the DNA just as we would a genome, but now we only have to sequence 65 megabases of DNA versus 3 gigabases of genomic DNA – that’s a 20 fold reduction. If we’re only interested in a specific set of genes, say genes involved in cancer progression, we can use this same capture technique in what is called custom capture to further reduce costs. Alternatively we can use amplicon sequencing to similarly sequence specific genes but this is usually done for much smaller gene panels on hundreds to thousands of samples.
  • Second generation sequencing does have a few huge disadvantages though. One of them is that it relies on amplification both in preparing the libraries and flowcells. Because the polymerases used to do these amplifications have an inherent error rate, we can introduce errors that are actually in the genome. This means that we’ll probably introduce one error for every 10-100 million bases in the genome. This isn’t a trivial number and this is why we use secondary validation techniques to confirm all of our most exciting variants.Another problem with second generation technology is that the read lengths are very short, on the order of 200-400 bases. Unfortunately the genome is full of repetitive sequences that are much longer than 400 bases so its impossible to sequence these regions using second generation sequencing. This means we can’t use second generation to do de novo sequencing or sequencing from scratch without aligning to a reference genome. This also means that second generation technology can’t be used to diagnose most trinucleotide repeat diseases such as fragile X or huntington’s. Short read sequencers also have trouble calling small insertions and deletions, but this is mostly a computational problem because aligning and calling billions of sequences is computationally expensive. We are always generating newer, better, and faster algorithms so this is becoming less of a problem but many if not all of the problems listed on this slide could be overcome if we had access to very high quality single molecule long reads!Translocations and inversions usually happen at repetitive sequences making determining directionality even harder!
  • We are on the cusp of seeing single molecule long read sequencing. This type of sequencing is called third generation sequencing and it has many advantages such as it can be used to sequence tens to hundred of kilobases in a single read. There are two technologies for this type of sequencing on the market or soon to be on the market. One of them uses sequencing by synthesis similar to the illumina or ion torrent technology while the other reads the DNA bases directly. The Pacific Biosciences sequencer is a very advanced sequencing system that uses polymerase bound nanowells and super microscopes to watch DNA as it is sequenced in real time. This can happen because the microscope only detects a flash of light at the active site of the polymerase. The direct sequencing technique uses a nanopore that spans a membrane separating two charged spaces. As the DNA passes through the nanopore, the DNA bases change how many ions can flow through the pore based on the size of the bases in the pore. The nanopore technology uses these changes in ionic strength at the pore to determine the sequence of the DNA. While technically amazing, these technologies are at the bleeding edge and have many technical hurdles ahead of them along with very high error rates. In addition to those problems, they’re also very expensive when compared with other sequencing technologies. If you really need to generate a de novo genome you can combine the third generation technologies with the second generation technology to create genome scaffolds or references that are then error corrected using second generation short reads.
  • While we have the technology to quickly do whole genome sequencing, the costs of using this technique are still very high. As I said before other techniques such as exome sequencing and custom capture sequencing are used more often these days because the prices for these tests are much cheaper and more likely to be covered by insurance. The prices for sequencing preparation and data storage are falling but the most expensive part at least for clinical sequencing is the analysis. The analysis is expensive because it can take weeks to analyze the data and pull out the most promising candidate mutations. Those candidate mutations are then followed up and in most places the data from each sample are evaluated by a panel of medical doctors. This process requires the expertise from many different people and those costs are not cheap. Given that the cost of whole genome sequencing is currently running around $15,000, how will we ever get to the $0 genome?
  • To understand how we’re going to get there it’s probably best to talk about what the cost per genome has looked like over the past decade. When the final draft of the Human genome was released, the cost of sequencing a human genome had already fallen from a couple billion dollars to around $100 million. There was a steady decline in price until 2003 when the final draft of the human genome was released.
  • Remember, the human genome was done using Sanger sequencing and Sanger was the only technology we had available until around 2007. I call the period between 2004 and 2007 the HGP high because after the human genome was sequenced, a lot of sequencing centers had more Sanger sequencers than they knew what to do with so they cut prices and started sequencing just about anything they could get their hands on. In this time the sea urchin, zebrafish, chicken, puffer fish, platypus, macaque, chimpanzee, dog, mouse and rat were all sequenced.
  • However you’ll notice a very sharp decline in 2007 and this was caused by the release of second generation sequencing technology from a number of competing technologies that were all vying to be the sequencing technology of choice. During this time the cost of sequencing a genome dropped from $10 million in 2007 down to about $50,000 in 2010.
  • AS I said, this was the dawn of the second generation sequencers. At the time there were really only 3 technologies. There was the Roche/454 lifesciences sequencer which used pyro sequencing. This sequencing method is similar to the illumina technology that I explained in detail earlier except pyrosequencing creates DNA clusters on microbeads that are deposited in nanowells on a sequencing substrate. Bases are then flowed over the substrate and base addition is detected by the release of pyrophosphate which is a biproduct of the chemical reaction that happens when DNA bases are added to a growing strand of DNA. A camera detects this release of pyrophosphate as a bright dot in the nano well. The second sequencing competitor during this time was the ABI solid sequencer which used dye tagged fragment ligation instead of sequencing by synthesis to sequence the DNA. This technique uses a complicated “color space” scheme to deter the sequence of the DNA. Of course there was illuminia which I already explained in detail and finally there was a new startup that appeared on the scene in 2009 that presented the first 3rd generation sequencing technology. This company named Helicos appeared to have some very impressive technology but the company was plagued with problems from the start. However, they made a lot of noise and scared a lot of people and were a pretty significant driver of the cost cutting seen by illumina and others during this time period.
  • Following the period of drastic cost reductions came a period of price stagnation because by 2010 it was clear that illumina was the winner of the sequencer game from a cost and accuracy perspective.
  • In 2010, illumina released the HiSeq 2000 which was a huge upgrade over the old Genome Analyzer technology and boasted nearly 10 times as much data output as the old Genome Analyzer. By 2010, Roche had been defeated and was forced to settle for performing niche sequencing projects on microbes. 454 lost mostly because its technology costs significantly more than the illumina technology and if you don’t need longer reads it’s hard to justify spending 5 times as much per base for the sequencing. ABI solid sequencing never really caught on, partly because of errors and complications with sample prep, but mostly because illumina severely undercut prices and it couldn’t remain competitive. Helicos couldn’t produce a viable cost effective sequencer and they filed for bankruptcy in 2011.
  • However the game changed a bit in 2011 and 2012 when new competitors appeared on the market. These include complete genomics, ion torrent, pacific biosciences and oxford nanopore
  • Complete genomics took a different path on the sequencing game. They decided to market their sequencing as a service and do all of the sequencing in house. This allowed them to control every aspect of the process and made them attractive for medical diagnostics and personal genomics. At the time, they were also competitively priced with illumina sequencing. Pacific Biosciences first announced the PacBio RS in 2011 and promised high accuracy, long reads, and single molecule sequencing. Ion torrent also appeared in 2011. They recycled much of the technology behind the 454 sequencer however they eliminated the imaging step and replaced it with a semiconductor detector which could sense the change in pH in the sequencing nanowell based on the release of an H+ atom. Ion torrent’s system was very speedy and they promised that their technology would be able to deliver a genome for $1000. Finally, the first nanopore sequencing technology was presented in this same time period, however, much like Helicos from a few years ago, nanopore had a very flashy presentation with little data to support their claims. Nanopore even went as far as to say they had a thumbnail drive sized device that could sequence a human genome using a standard laptop. All of this new competition once again forces illumina to undercut prices with the hopes of destroying their competitors.
  • And finally we’ve reached the current era where prices have remained stagnant for almost 2 and a half years.
  • This is mostly because none of the competitors that came out in 2011 could deliver on their promises. Complete genomics, while good technology, never caught on with the research community because most researchers like to have control over data generation. Instead complete genomics has been used mostly in diagnostics. Pacific Biosciences couldn’t deliver on its promises. Its system had many early problems including a much higher error rate and significantly shorter reads than promised. It also cost a lot of money. To do human genome sequencing on this system costs $50-80,000. That’s 5-10x the current cost of reagents for illumina sequencing. Ion torrent has suffered a similar fate. Their system still has very low data output and the cost per base is more expensive than illumina’s technology. It’s also error prone and they’ve struggled to deliver a system that can sequence a genome in a single sequencing run. This genome level chip has been promised since 2012 but has now been pushed back until late 2014. Oxford nanopore finally released actual sequencing data 3 years after they said they were going to revolutionize sequencing. The data was terrible and to get any useable information at all out of the data, it had to be corrected using illumina sequencing runs. It’s now clear that nanopore sequencing is still in the early proof of concept stage, despite them promising that they will allow researchers to use their technology this year. And while Complete Genomics and Ion Torrent really only represent a minor concern for illumina sequencing, illumina is still scared enough by their market share that this year they released two new sequencers to kill both companies. Illumina released the nextseq 500 which is targeted at the clinical diagnostic market. It has 6 times the data output as the ion torrent system with similar speed. To kill off Complete Genomics which has focused more on whole genome sequencing of populations, illumina released the HiSeqX which can actually deliver a $1000 raw genome. This slashes sequencing costs from $5000 to $1000, however, this revolutionary technology is only available to sequencing centers that can afford to purchase 10 systems at a time and the buyers must agree to only use the sequencers for whole genome sequencing. Scumbag Steve Hat: http://chasesocal.deviantart.com/art/Scumbag-Steve-Hat-413935466
  • With the release of the HiSeqX, the clinical cost of a human genome should drop from $10 to 15,000 down to $6 to 10,000. This brings us much closer to a time when the $0 genome will be feasible however this relies on insurance companies or governments footing most of the bill. There are still quite a few hurdles for the $0 genome though. Many clinicians don’t know how to use or interpret genetic data to affect patient care. New clinical positions and extensive training need to happen for us to be able to fully leverage genetic data in the clinic. As a scientist part of my job is to help show that this data really does have a useful clinical value. For the majority of people, genetic data isn’t very informative or predictive but that’s mostly because our sample sizes are too small to make accurate predictions. We need to obtain more population wide data to be able to make better predictions from healthy genomes to better determine how variants outside of coding regions contribute to health and disease. Where there is clear benefit, whole genome, whole exome, and custom capture sequencing are already being used in the clinic in the areas of cancer, neonatal, fertility and undiagnosed disease diagnostics. And the final hurdle for the $0 genome is still price. Whole genome sequencing still needs a bit of a price drop for insurance companies to justify the cost. Bring the cost of the test down into the same range as an MRI will certainly help, and this is likely to happen as we are able to streamline analysis pipelines and reduce the amount of time and the number of people required for genetic analysis. It’s clear that improvements over the next few years will certainly cause more insurance companies to approve payment on whole genome diagnostics.

Transcript

  • 1. High Throughput Sequencing Technologies: On the path to the $0* Genome Brian Krueger, PhD Duke University Center for Human Genome Variation
  • 2. Chromatin Basics 1) 1400nm - Metaphase Chromosome 2) 700nm - Condensed Chromosome 3) 300nm - Extended Condensed Chromosome 4) 30nm – Packed nucleosomes 5) 11nm – Nucleosome string 6) 2nm – DNA double Helix 6 12 3 4 5 Image credit: Nature Education • Chromatin is the DNA packing material • Two forms – Euchromatin • Open and actively transcribed – Heterochromatin • Packed and not producing RNA
  • 3. DNA Basics Credit: Wikimedia Commons • DNA is made of sugar phosphate bases – Purines • Adenine • Guanine – Pyrimidines • Cytosine • Thymine • Sequence of bases determines when and what proteins are made
  • 4. Gene Expression – Enhancers/Promoters Image credit: Nature Education • DNA is converted into useable information in a process called transcription – Enhancers • Serve as accessory beacons that bind proteins involved in regulating gene expression • Help the polymerase “find” where a gene is located in the chromatin – Promoter • Located just upstream of the transcription start site • Staging site for the polymerase transcription factors that create mRNA – RNA polymerase II – Transcription start site • First transcribed base of mRNA sequence
  • 5. Gene Expression – Transcription/Translation Image credit: Nature Education • DNA is composed of Exons and Introns – Exons are protein coding regions of DNA – Introns are noncoding regions of DNA that must be removed during transcription to produce mature mRNA • Introns removed during transcription by the RNA spliceosome – Sequence dependent process • Mature mRNA is capped (methylated) and a poly-adenine tail is added for stability • Sequence exported to the cytoplasm for translation and protein production • Mutations to the DNA can negatively affect every step of this process!
  • 6. Chromosome Common DNA MutationsSequence variants Structural variants Single nucleotide variant Small insertion Small deletion Deletion Translocation Reference A B C D ATCGGGTCATGTCA ATCGGGTCATATCA A B C D ATCGGGTCATGACGTCA A B C D ATCGGGTCAT A B C D A C D A B GE Duplication A B C DC Inversion A B D C F Credit: Elizabeth Ruzzo, PhD, CHGV
  • 7. Common DNA Mutations • Effects – No effect – Too much protein – Too little protein – No protein – Not the right protein Image Credit: Cooper et al. Nat Rev Genet • Site of Mutation Matters – Exons – RNA splice sites – Enhancers – Promoters – 5’ and 3’ UTR regulatory regions Splice variant We’re all mutants! Your genome has 4 million single nucleotide variants and 700,000 insertions/deletions! Luckily, the genome is 3 billion base pairs and only 2% of those bases code for protein
  • 8. • Mutations/Variations can be detected using DNA sequencing – First invented in the mid 1970s – Two very similar methods developed – Maxam-Gilbert Sequencing • Chemical modification and cleavage paired with gel electrophoresis • DNA is 5’ labeled with radioactivity • Exposed to chemical agents that cause specific DNA breaks • Run on a gel and the pattern reveals which base is at each site – Sanger Sequencing • Dideoxy DNA sequencing paired with gel electrophoresis • DNA is 5’ labeled with radioactivity • Small amount of Dideoxy base added to 4 separate primer extension reactions • Run on a gel to determine bases at each position by size DNA Sequencing Maxam-Gilbert Sanger X No 3’-OH, No Extension! Credit: Wikimedia Commons
  • 9. • Sanger sequencing – Beat Maxam-Gilbert Sequencing as the method of choice – Became fully automated • Dideoxy bases replaced with fluorescently labeled dideoxy bases (1 reaction now instead of 4) • Liquid chromatography replaces gel electrophoresis • Lasers and computers replace graduate students and postdocs • By far the dominant sequencing method up until 2007 – 30 years! – Still considered the gold standard for validating sequencing data • Huge limitations for genome wide sequencing because Sanger can only be used to sequence one fragment per Sanger reaction First Generation Sequencing Technology Credit: Wikimedia Commons
  • 10. • Done using Sanger sequencing… • Took 10 years to complete • Cost $3 billion dollars • Used a technique called hierarchical whole genome shotgun sequencing – Shotgun Sequencing also invented by Frederick Sanger – Genome fragmented into 200-400kb fragments – Genome fragments cloned into over 30,000 bacmid libraries – Libraries were then fragmented – Sanger sequencing performed – Genome assembled using computers to line up over lapping sequences • Most human genome sequencing today is done using whole genome shotgun sequencing! Human Genome Sequencing Hierarchical Shotgun Sequencing Credit: Wikimedia Commons
  • 11. • Developed to increase throughput of Sanger sequencing • Can sequence many molecules in parallel – Does not require homogenous input – DNA sequenced as clusters or in nanowells – Single machine can sequence 3-10 Billion independent DNA fragments AT THE SAME TIME! – Single Sanger Sequencer maxes out at 1152 reactions per machine • Time from DNA to genome reduced from 10 years to 1 day! Second Generation Sequencing Illumina HiSeq (3-9 billion clusters – 600GB-1.8TB) Ion Torrent Proton (100 - 300 million nanowells - 20 - 60GB)
  • 12. 2nd Gen: Sequencing by Synthesis Overview Align reads to a reference genome Fragmented DNA Ligate Adaptors Add Bases ImageCleave Wash Wash Bind Library and create clusters Sequencing Cycle Repeat Hundreds of times on billions of clusters (1:20) Genomic DNA
  • 13. Mutation Calling/Filtering Variant calling Visual Inspection Cross-checking public databases Sanger sequencing confirmation Exome Variant Server 6500 exome sequenced individuals
  • 14. Detecting Copy Number Variants heterozygous deletion homozygous deletion duplication Windows ERDS (Estimation by Read Depth with SNvs) Average read depth (RD) of every 2-kb window were calculated, followed by GC corrections. A paired Hidden Markov model was applied to infer copy numbers of every window by utilizing both RD information and heterozygosity information.
  • 15. Flavors of Sequencing • Whole Genome Sequencing – Obtain whole blood or tissue sample – Create sequencing libraries of all DNA fragments • Whole Exome Sequencing – Utilizes a selection protocol – Attach complimentary RNA or DNA strands to beads – Fish out ONLY coding DNA sequences – Create sequencing libraries from enriched DNA – Reduces cost and analysis time • Custom Capture – Same protocol as Exome sequencing – Only target desired DNA sequences • Amplicon Sequencing – Use PCR to amplify target DNA – Sequence amplified DNA (Amplicon)
  • 16. Disadvantages of 2nd Generation Tech • Rely on amplification to create libraries and clusters – All polymerases have an inherent error rate (10-6-10-7) – Errors introduced every 10 million to 100 million bases – Secondary validation of variants is key • Short reads cannot be used for De novo genome assembly – 2nd Generation sequencers have a maximum read length of 400bp – This is too short to span long repeat regions – Not good for detecting trinucleotide repeat expansions ex: fragile X, Huntington’s, spinocerebellar ataxias • Short reads can miss large structural variations – Genome Translocations and inversions likely will be missed – Require significant read depth at break points for these variations to be detected • Trouble detecting small insertions and deletions – Short reads computationally hard to align and call • Very high quality single molecule long reads would fix many of these problems! A CD GE FA A B C DB B A B C DB B BB B A B C DBB B X X
  • 17. • Defined as single molecule sequencing • Less complex sample prep and much longer read length (1-100kb) compared to 200-400bp for 2nd Gen • Two categories – Sequencing by synthesis • Pioneered by Pacific Biosciences • Sequencer uses super microscopes and polymerase bound nanowells to WATCH DNA as it is sequenced in real time • Nanowells filled with DNA bases • Fluorescence of base only detected at the polymerase – Direct sequencing by passing DNA through a nanopore • Bases fed through a membrane bound nanopore • Ionic difference between both sides of the membrane • Detect how ion flow changes at the pore as each base passes through • Bleeding edge technology – Many technical hurdles with very high error rates (10-25%) – Very expensive technology • Costs 3-10x as much as Illumina to do whole genome sequencing – Short/Long read hybrid proposed to leverage the base accuracy of 2nd gen sequencing and the length of 3rd gen • Use long reads as a scaffold and correct the errors with short reDS The Future: Third Generation Sequencing PacBio Oxford Nanopore
  • 18. Costs Associated with Clinical Sequencing Whole Genome Exome Custom Capture Amplicon Size (GB) 100 12.5 0.13-1 0.03-0.13 Preparation $400 $200 $80 $40 Sequencing $4,300 $400 $12-100 $1-12 Data Processing/Storage $350 $200 $50 $25 Clincal Review $5,000-10,000 $2,000-6000 $700-2000 $400-900 Total $10,000-15,000 $2,800-6,800 $1,000-2,000 $500-1,000 DNA sequencing costs are falling, but analysis and clinical review cost will likely remain stable for the foreseeable future New sequencing technology announced this year should reduce the cost of preparing and sequencing a whole genome to $1000 starting in mid 2014 (Does not include Analysis and Review) How will we ever get to the $0* genome?!?!
  • 19. Sequencing Costs in the Genome Era Image credit: NIH HG Draft HG Final
  • 20. Sequencing Costs in the Genome Era Image credit: NIH Sanger Sanger – HGP High HG Draft HG Final
  • 21. Sequencing Costs in the Genome Era Image credit: NIH Sanger Roche/454 Illumina ABI Solid Helicos Sanger – HGP High HG Draft HG Final
  • 22. Sequencing in the Genome Era: 2008-2010 • The Dawn of the Second Generation Sequencers – Roche 454 - 2007 • Imaging based pyrosequencing • Camera detects pyrophosphate release after each base is added to nanowells – Bright dot = Base present – ABI Solid - 2007 • Dye tagged fragment ligation • Imaging based • Complicated detection scheme using “color space” – Illumina - 2008 • Imaging based reversible dye termination sequencing • Camera detects fluorescently labeled bases in each cluster – Color determines base – Helicos (3rd Gen) - 2009 • First “single molecule” sequencer – Third generation sequencing • Plagued with problems • BUT the fear that it might work helped drive down costs 454 Illumina GAIIx ABI Solid 3 Helicos
  • 23. Sequencing Costs in the Genome Era Image credit: NIH Sanger Roche/454 Illumina ABI Solid Helicos Sanger – HGP High HG Draft HG Final
  • 24. Sequencing Costs in the Genome Era Image credit: NIH Illumina Sanger Roche/454 Illumina ABI Solid Helicos Sanger – HGP High HG Draft HG Final
  • 25. Sequencing in the Genome Era: 2010-2011 • The death of the competition – Illumina • Release of the HiSeq • Drastically increases output 10x over the GAIIx – Roche 454 • Release 454 titanium and 454 Junior • Used primarily for microbes because it can sequence 400bp and do de novo assembly of these small organisms • Expensive and error prone • Roche will phase out the 454 family in 2014 – ABI Solid • Never caught on • Expensive, error prone, complicated sample prep – Helicos • Filed for bankruptcy 2011 • Costs remain level because Illumina has no competition Illumina HiSeq 2000
  • 26. Sequencing Costs in the Genome Era Image credit: NIH Illumina Sanger Roche/454 Illumina ABI Solid Helicos Sanger – HGP High HG Draft HG Final
  • 27. Sequencing Costs in the Genome Era Image credit: NIH Illumina Illumina Complete Genomics Ion Torrent PacBio Nanopore Sanger Roche/454 Illumina ABI Solid Helicos Sanger – HGP High HG Draft HG Final
  • 28. Sequencing in the Genome Era: 2010-2011 • New Contenders – Complete Genomics • Proprietary tech and generate data in-house • Competitive pricing with Illumina sequencing – Pacific Biosciences (3rd Gen) • Announce the PacBio RS • Promise high base accuracy, single molecule sequencing with reads reaching up to 20kb – Ion Torrent • Same sequencing methodology as the Roche 454 system • Difference is that it detects the release of H+ after bases are added • Removes need for time consuming imaging steps • Promise a $1000 genome – Oxford Nanopore (3rd Gen) • Announce MinIon and GridIon • Promise very cheap single molecule sequencing that can be done on a thumb drive • Promising competition forces price reductions PacBio RS Ion Torrent Proton Nanopore MinIon
  • 29. Sequencing Costs in the Genome Era Image credit: NIH Illumina Illumina Complete Genomics Ion Torrent PacBio Nanopore Sanger Roche/454 Illumina ABI Solid Helicos Sanger – HGP High HG Draft HG Final
  • 30. Illumina Complete Genomics Ion Torrent PacBio Nanopore Sequencing Costs in the Genome Era Image credit: NIH Sanger Roche/454 Illumina ABI Solid Helicos Sanger – HGP High Illumina Illumina HG Draft HG Final
  • 31. Sequencing in the Genome Era: 2012-Present • New Contenders Fail - Mostly – Complete Genomics • Not embraced by the research community and serves the diagnostic niche – Pacific Biosciences • Didn’t deliver on promises – 15% error rate, shorter reads (1-10kb) • Slowly improving – reduced error rate to 5-10%, reads reaching 20-50kb – Ion Torrent • Didn’t deliver on promises - Low data output, expensive • Serves niche diagnostic market where speed is more valuable than cost or amount of data output • 60GB PII chip has been “coming” since 2012 – Slated for late 2014 release – Oxford Nanopore • Finally released first data in 2014 • Full of errors and looks like proof of concept tech – Illumina • Release NextSeq500 for the diagnostic market to kill Ion Torrent • Release the HiSeqX which can sequence a human genome for $1000 to kill Complete Genomics (1.8TB of output in 3 days! – 16 genomes) – HiSeqX MUST be purchased as a 10 pack ($10 million) – Contractually forced to ONLY use the HiSeqX for genomes • Prices remain steady 2012-14 because the competition can’t deliver Releases $1000 genome sequencer, Only lets rich people use it. Hat image: chasesocal, Deviant Art
  • 32. The Promise of the $0* Genome • HiSeqX brings clinical genome cost down to $6- 10K (mid 2014) • Hurdles for the $0* Genome – *Relies on health insurance companies or governments paying most of the bill – Clinician Education • Many clinicians do not understand genetic data or how to use it to affect patient care – Proof of widely applicable value • Genome sequences for MOST people not very informative • Need more population wide data to accurately predict how variants outside of coding regions contribute to disease • Currently used in cancer, neonatal, fertility and undiagnosed disease diagnostics – Cost reduction • Cost of the all-in test needs to be <$5000 • Similar to other high diagnostic value, high tech tests such as PET, CT, and MRI scans • Likely to happen with streamlined analysis pipelines Improvements over the next few years will cause more insurance companies to approve payment on whole genome diagnostics