Before we can talk about whole genome sequencing and how it could revolutionize personalized medicine, it is important that we first discuss the history of genetics. I know that you’ve covered much of this information in previous lectures, so I’m going to highlight a little bit of the history that I think is important for today’s discussion.
I’m going to assume that you haven’t made it to this point in your study of biology without having heard of Gregor Mendel. Mendel is considered the father of modern genetics because he was the first to notice and then study the phenomenon that plants seemed to pass their traits onto their offspring. He developed two laws through his research studying pea plants in the mid-1800’s. His first law of segregation stated that each parent randomly passes one of two alleles to their offspring. His second law states that separate genes for separate traits are passed independently to their offspring in a ratio of 9:3:3:1. These laws held true in Mendel’s experiments however the significance of Mendel’s work was not realized until the early 1900’s when scientists were again trying to understand how traits were inherited. Mendel’s work was ignored for almost 50 years!However, the rediscovery of Mendel’s work quickly led to the discovery that not all traits assort independently. Much of this early work was done by Bateson and Punnet who showed that some traits seemed to assort together or be linked. Further work by Thomas Hunt Morgan and his student Alfred Sturtevant found that genes did appear to be linked together and that their frequency of recombination or loss of linkage was a good predictor of the distance between genes. This allowed for the creation of the first linkage maps of chromosomes. Understanding linkage is important for understanding some of the early work done in genetic association studies.
You may be wondering how scientists were able to make linkage maps of genes in the early 20th century, nearly 20 years before the discovery that DNA was the genetic material that made up chromosomes. Much of this work was performed in model organisms. Mendel used peas, for example while others like Sturtevant used the humble fruit fly. Although model organisms sometimes become the target of needless political bickering, these organisms have been and will continue to be extremely important for understanding human genetics. In the early days, fruit flies were a great source of genetic material, not only because they can produce large numbers of offspring very quickly but also because their salivary glands contain giant chromosomes which can be stained and visualized using standard microscope procedures. Upon staining, these chromosomes display a distinct banding pattern based on chromatin density or how compacted the DNA is in each region. These bands could be used to track genetic recombinations. Of course the work completed in model organisms was later applied to humans, however studying genetics in humans is complicated by the fact that we don’t produce 400 offspring every generation and discovery of the family history of disease can be hard to determine if family members have passed away or if relatives refuse to participate. Because of these factors, studying human disease in the early days of genetics was extremely time consuming. It could take decades to develop informative maps to track disease. Further, the chromosomes contained in human cells are much smaller than those found in fruit fly salivary glands. So new techniques and better microscopes had to be invented before the power of chromosome staining could be applied to humans. As techniques have matured and the field of genetics has improved we have been able to determine how many Mendelian diseases are inherited using these simple techniques. The next logical step for genetics was to use the information gathered in familial studies and apply these lessons to entire populations. Modern genetics has facilitated our ability to look at genetic associations on a population wide and genome wide scale.
Genome wide association studies aim to find genetic variants that are associated with traits. These can be used to understand human disease, but more generally they can be used to trace any genetic trait. These studies focus on looking at specific changes in the DNA. These can include snips, single nucleotide polymorphisms or the change in a single nucleotide position in the DNA. These changes can also include Indels or insertions or deletions of specific nucleotides. Finally, these studies also track copy number variations which include large deletions or duplications of genetic material.
Before we start the discussion of genome wide studies, it is important to introduce the human genome project. Prior to the completion of the human genome project, tracking down genes was a very time consuming process of trial and error using cloning, PCR and Sanger sequencing. For the most part, traits could only be loosely mapped to large genetic regions. Geneticists realized that knowing the sequence of the entire genome would greatly enhance their ability to narrow down the location of all of the genes in the genome. However, the human genome project took over a decade to complete and cost nearly 3 billion dollars. Of course the best way to look at genome variation in individuals is to sequence all of their DNA individually and find the specific mutations that are responsible for their traits. This just was not possible at the time because genome sequencing was prohibitively expensive, but scientists could use what they already knew about human genetics and single nucleotide polymorphisms to develop a condensed map of the human genome that could then be used to track genetic changes. To create this abridged map, a team of scientists came together with the goal of mapping all of the known common SNPs that are present in 1% or more of the population in 269 people from 4 different ethnic backgrounds. In essence, the human haplotype map, or hapmap for short, serves as the baseline group for future studies. Once the hapmap was in place, groups of affected individuals could be genotyped to find common SNPs that were predictive of their disease, trait, or response to a drug. In theory, this was a great workaround for the problem of having to whole genome sequence individuals, however this approach does not necessarily tell scientists which genes in particular are responsible for the trait and in many cases only serves as a marker.
We can explore this idea in a little more depth. Sometimes changes in the DNA have no effect whatsoever on the phenotype, however, these changes can be used as markers. This is sometimes a hard concept to understand. Remember back about how Punnet and Bateson showed that some genes can be linked. Well, you can think of SNPs as markers that can be linked with a trait but not responsible for it. Sometimes SNPs are very near the genes responsible for the trait, other times they are on completely different chromosomes. Because the majority of your DNA is conserved from your mother and father, there are many common SNPs that have been passed down through generations. Sometimes SNPs can be predictive of disease when looking at large sets of data and calculating statistics. Again, this was the whole idea behind the human hapmap. Certain variants can be statistically associated with a trait while not at all being causal! To further complicate the situation, because different populations of humans have been isolated form one another, each population has passed different sets of SNPs to their offspring. This means that important SNPs in one ethnicity may not be conserved in another. For genome wide association studies to be effective predictors, large control datasets from each population must be obtained.
So how do scientists go about determining the SNP genotype for individuals? Much of this is currently done using array technology. The technology we use to do this in the Genomic Analysis Facility relies on bead chip arrays. These beads have DNA baits on them that are complementary to DNA just upstream of a SNP. These beads are exposed to fragmented DNA that has been isolated from an individual and then these captured fragments serve as the template for a DNA extension reaction. This is done using fluorescently labeled DNA bases. Each base has a different color so it’s easy to detect which SNP genotype is present at each targeted SNP location. These signals are then detected by a laser and the genotypes plotted. In the output here you can see a blue, a purple and a red cluster. These clusters indicate the SNP genotype at this position for each individual tested in this assay.
Some other methods for SNP genotyping include quantitative realtime PCR which can be used to genotype samples at a handful of SNPs. This is typically used when looking at a small number of individual samples or a small number of SNP locations.Another commonly used technology is a mass array. Instead of using PCR probes to find SNPs, the massarray does an extension reaction with very heavy modified bases. Mass spectrometry is then used to detect which heavy base is present in each SNP reaction.Finally, the gold standard technique for detecting and confirming SNPs is Sanger sequencing. Sanger sequencing is beneficial because it allows researchers to look directly at the DNA. It is a very robust technique, however it is hard to use to genotype a large number of samples at a time. These days, Sanger sequencing is mostly used as a confirmatory assay to back up the findings of other SNP genotyping techniques.
Up until 2008 or so, most genome wide association studies were lacking because they weren’t really genome wide. They employed genome wide biomarkers, but their utility in trait discovery was limited because they relied on common variants. These studies missed rare variants and because many of the SNPs were not mutations in specific genes, GWA studies were not descriptive of disease causation.The major benefit of whole genome sequencing is that it does assay the entire genome. This technique discovers all of the variants in the human genome which has its drawbacks and benefits. One of the biggest criticisms of whole genome sequencing is that, because this technique finds all of the genetic variation, that it makes it hard to determine which mutations are actually important. Without large control datasets to set the benchmark for normal variation, it’s a huge task to sort out which variants are signal and which are noise.Enter the thousand genomes project. This project is similar to the human hapmap project, in that it was set up to provide a large database of genomic variation except the thousand genomes project focuses on the genomes of 1000 individuals from a range of ethnic backgrounds. Again, the important genomic variants can differ among ethnic populations.
What caused the price of whole genome sequencing to decrease so rapidly and facilitate this new age of whole genome sequencing? Next generation sequencing or second generation sequencing was developed to increase the throughput of sequencing. Sanger sequencing had been advanced to the point that 384 sequences could be analyzed at the same time, but this still was not high enough throughput to quickly sequence genomes.Next generation sequencing’s main advantage is that it can sequence millions to billions of sequences in parallel and it does not require a homogenous input sequence. This is because sequences are obtained as independent clustersAll of the second generation technologies rely on sequencing by synthesis using a polymerase to generate a complement DNA strand.
The technical theory behind all of the second generation sequencing technologies is very similar. Genomic DNA is isolated and then fragmented to a particular size range. In the Genomic analysis facility, we try to fragment the DNA so that the majority of it is 300 basepairs in length. We then perform some biochemistry to ligate small pieces of DNA to the ends of these fragments. These are called adaptors. The adaptors are very important because they attach a “known” DNA sequence to each fragment. This is important for amplifying the library and also for capturing the library on a flowcell or bead. The flow cell or bead, depending on the sequencing technology, has known complimentary DNA attached to their surface. This allows them to capture DNA fragments. Once the fragments have been captured by the bead or the flow cell, the fragments are amplified on the surface to create large fragment clusters. These clusters are important because they allow for the future detection of a sequencing signal. Once the clusters are made, they are placed in the sequencer where the sequences can be read. This is done by sequentially flowing labeled bases over the bead or flow cell and detecting the signals that are emitted from each spot. I’ll go into the details later of how each technology actually sequences the DNA. This cycle of base addition and detection is repeated hundreds of times on millions of clusters to obtain the final sequencing information.
There are many different types of sequencing that can be done using this technology. I’ll highlight the ones that we perform most often in the genomic analysis facility. Of course there’s whole genome sequencing which is essentially what I described on the previous slide. There’s also whole exome sequencing which utilizes a selection protocol to only sequence the DNA found in exons or coding DNA sequences. This technique is advantageous in that it saves a lot of money on the sequencing end if the question you are asking relates to coding DNA sequences. This selection is done by attaching complimentary RNA to beads to fish out only the fragments of DNA that relate to coding DNA sequences. A similar technique that further saves cost is custom capture. This technique is employed when a researcher is looking to sequence DNA related specific DNA targets of interests. An example of this would be to create a capture library of only the genes known to be involved in a specific type of cancer. Finally, there’s amplicon sequencing. This technique is usually only used when you want to sequence a small number of genetic regions in a large number of samples. This technique is only cost effective when used on a small number of regions because specific primers need to be created for each targeted region. As the number of regions increase, methods such as custom capture become more cost effective.
There are a few different types of studies that we do in the Center for Human Genome variation. These include multiplex family studies, case control studies and trio sequencing studies.In multiplex family studies, we sequence DNA from families with a known history of disease and then probe their genomes for variants that appear to be conserved in affected individuals. The power of multiplex family studies is that the genetic background in these individuals is conserved and affords us with a greater signal to noise ratio.When sequencing entire families is not possible, trio sequencing is employed. This is a powerful technique for sporadic disease where the disease is not conserved within a family lineage and new genetic mutations appear to be responsible for the disease. For these studies, we sequence the genomes of both parents and the affected child to try to determine where de novo mutations may have occurred. Finally, we’ve already touched on case-control studies a little bit when we talked about the human hapmap and the thousand genomes project. Case-control studies are used to look at how traits or diseases can be identified while looking at populations. In these studies, unaffected controls are sequenced and compared to affected individuals to try to find genetic mutations that are more highly represented in cases than controls.
So how do we discover these disease causing mutations in the sequencing data? Using trio sequencing as an example here, we sequence the genomes of a mother, a father, and the affected child. We run this sequencing data through a computer system that takes the small DNA fragments that we sequenced and aligns them together so that we can compare the three genomes to one another. We use this data to find differences in the three genomes and then compare those genetic differences to a database of known disease causing genetic mutations. Here you can see in this example that the two parents have a normal sequence at this location while the affected child appears to be heterozygous for a mutation. Sanger sequencing of all three individuals at this locus confirms that the affected individual is heterozygous at this location. So how do we know that this mutation is actually the cause of the disease based on the sequence information alone? We don’t, but we can make an educated guess if the mutation occurs in a protein coding region. Additionally, it is becoming more and more common to follow up these association studies in other models to provide functional information about a mutation. This is done by making specific mutations in animal or cell culture models and seeing if the mutation produces a similar phenotype.
In addition to point mutation detection, we can also use next generation sequencing data to detect copy number variations including insertions and deletions. Because next generation sequencing looks at millions of small sequence fragments, we get multiple sequences for each region of the genome. Based on how often we see a signal for each sequence, we can determine how much of each sequence is likely to be present. Using ERDS analysis, we can look at specific sequences and see if they are completely missing such as a homozygous deletion. We can determine heterozygous deletions if we obtain half as much sequence information as we expect, or we can find duplications if we receive more sequence information than expected.
I’ve spent the last few slides talking about the applications of genome sequencing in the genomic analysis facility, but I’d also like to provide you with information on the current crop of next generation sequencers and where the field is heading. I’m going to start this off by discussing the technology that we currently use in the genomic analsis facility for sequencing which is the Illumina system. You will get a chance to see this technology in action when you come to tour the lab on Friday.The illumina sequencing platform uses a flow cell to process sequencing samples. Clusters are created via bridge amplification within each lane of the flow cell.Sequencing is performed by flowing fluorescently labeled bases over the clusters and imaging cluster tiles for each colored base. Determination of the color of the cluster after each cycle allows for interrogation of the sequenceThe advantage of the illumina system is that it is very high throughput and has a low cost per base. It can generate anywhere from 40-240 million reads per lane of a flow cell depending on the sequencing system and read lengthSome of the limitations are that each experiment can cost between $10-20,000 depending on the depth of sequencing and the length of the run. The illumina system also is not ideal for de novo genome sequencing because the short read length limits its ability to accurately sequence large repeats. However, illumina has recently purchased a company that says it can increase the read length of illumina sequencing to the 2-10kb range making it an attractive choice for future sequencing experiments that require longer reads.
The ion torrent system uses a completely new detection system. For the most part, all of the next generation sequencing systems use light to determine the sequence. The ion torrent system uses semiconductor technology to detect the sequence based on changes in pHThis is done using the pyrosequencing system in which pyrophosphate is released during each cycle of the sequencing reaction. This results in a detectable change in the pH of the reaction well. This is one of the advantages of the system because it does not rely on visual inspection to determine the sequence. This significantly increases the speed at which the sequencing is performed because a camera doesn’t have to image each cluster, instead, there is a pH detecting sensor under each bead.The technique also doesn’t require expensive modified bases or complicated chemistry to perform the sequencing reactions. Another benefit is that the read lengths are slightly longer than that of the HiSeq system at 200 bpIt has a high homopolymer error rate and the ion torrent system has low data output. It is not an ideal system for large scale projects and has a relatively high base cost compared to Illumina sequencing. Ion torrent sequencing is ideal for array based custom capture sequencing that is starting to be used by diagnostic labs
We are currently entering the era of third generation sequencing. The defining characteristic of 3rd generation sequencing is that sequences are determined from single DNA molecules, not clusters. One of the main criticisms of cluster based second generation sequencing is that mutations can occur and be amplified during cluster generation. Third generation sequencing also has the added benefit of less complex sample prep and longer read lengths. Current 3rd generation sequencers can produce reads up to 15 kb, with an average read length of 3 kbThese technologies fall under two categories. Like second generation sequencers, some of these techniques employ sequencing by synthesis while the more bleeding edge technologies sequence the DNA by reading the strand directly.Direct reading of the DNA can be accomplished by passing the DNA through a specialized pore or using atomic force microscopyCurrently there is only one functioning 3rd generation technology on the market and this sector still faces many technical hurdles
The Pacific biosciences sequencing system is extremely complex. It uses sequencing by synthesis to “watch” a polymerase sequence a single DNA strand in real timeThis is done by immobilizing a single polymerase at the bottom of a nanowell and using confocal microscopy to look at a small slice of the visual field and only detect fluorescence at the polymerase. In this system reagents exist in a large reaction volume and sequencing is allowed to proceed at a very rapid rate.The advantages of the system are the extremely long read lengths, low complexity of sample prep and very fast generation of sequencing dataThe major disadvantage of this sequencing technique is the very high error rate which is on the order of 15%, because of this, the company is struggling to maintain its customer base and attract new customers.
PacBio may still have the last laugh though. The long reads that their system generates are invaluable to de novo sequencing applications such as determining the genomic sequence of previously unsequenced organisms, however the high error rate of those reads makes them nearly useless.On the flip side, short read sequencers are very accurate but their data is hard to assemble into complete genomes because holes are generated in highly repetitive sequencesOne recently published solution is to use HiSeq or 454 reads to “fix” the large sequencing reads generated by the PacBio system. By using the PacBio reads as a scaffold, HiSeq reads can be aligned and a consensus can be obtained to fix the readsThe resulting effective error rate is better than any other available next generation sequencing system on the market although the cost of doing such a project is very highThis hybrid sequencing system is really a bridge technique until PacBio can improve its error rate or other long read technologies come on-line
The future of DNA sequencing is in direct interrogation of single molecules. Most of these technologies are in the concept stage although a small startup says that it will be releasing its product at the end of the year. But they have been saying they’d release a product at the end of the year for the last 3 years so how close they are to an actual product is anyone’s guess.The concept behind Oxford Nanopore’s technology is to feed a single DNA molecule through a membrane and detect the flow of electrons as each base passes through. Every DNA base should cause a detectable change in current flow. The system theoretically can be used to directly sequence RNA and protein tooOxford says the system has a very long read length, is infinitely scalable, and plug and play. They even introduced a USB drive device that can turn any computer into a high throughput sequencerAt this point, Oxford has yet to release any real product or data so we can assume that most of this is probably pure speculation and marketing hype
Other candidates in the direct sequencing game are concept or proof of concept stage devices.One of these is a project from IBM called the DNA transistor which utilizes a pore to direct DNA through a dielectric transistor to read bases in a similar manner as the Oxford Nanopore systemOther labs have pioneered using atomic force microscopy to sequence short stretches of DNA, however a commercial system has yet to materialize
To this point I’ve spent some time talking about sequencing and where the technology is heading. Now I’d like to discuss how these sequencers can be used to answer complex biological questions. Sequencers are now able to perform genetic techniques which used to take days or years in a matter of hours.Next generation sequencing has opened a new door and has huge potential to revolutionize human healthcare and scientific research
One of these areas which I touched on briefly already is de novo sequencing. Previously this was done by laboriously cloning overlapping segments into plasmids and Sanger sequencing each one. This is how the human genome was sequenced by the NIH. It took 10 years and cost 3 billion dollars. Using the current crop of next generation sequencers, we can perform the same de novo sequencing for around $4000 and complete it in a week. Biotech companies like complete genomics report that they have already sequenced a few thousand human genomes.The future of de novo sequencing is evolving and relies on future long read sequencers to make the alignment and data analysis process more efficient. In the case of agriculture, many plants are considered impossible to sequence because of the highly repetitive nature of their DNA.As sequencing costs decrease, de novo sequencing of human genomes will become a routine diagnostic tool and along these lines, using de novo sequencing to increase the catalog of organismal genomes will improve our understanding of evolution and development.
Another area where next generation DNA sequencing has the potential to change science and medicine is in genome mutation analysis. This type of genome analysis is one of the biggest focuses of the work that we do in the genomic analysis facility.As I stated earlier, complicated and time consuming linkage studies coupled with Sanger sequencing were used in the past to elucidate genetic diseasesNow next generation sequencing can look directly at the entire genome and produce a complete genetic map of a patients DNAIn the future, whole genome mutation mapping will be used to diagnose human disease. This will occur slowly and likely proceed through targeted genome sequencing using selection arrays or panels to look at specific regions of the genome which have already be linked to disease.
Along the same lines of genetic mutation mapping is the current hot topic of pharmacogenetics. Pharmaceutical and insurance companies are really interested in understanding how genetic data can predict a patients response to drug treatment. These screens currently rely on microarrays but next generation sequencing could provide sequence level information at more loci. Microarrays only look at a handful of polymorphisms. The value of these types of screens is enormous if a drug phenotype association is determined, however, pharmaco genetics suffers from relatively small datasets with low predictive power for most drugs. As the amount of genetic data increases it should be easier to predict treatment outcomes
Next generation sequencing has already revolutionized the field of epigenetics which looks at heritable genetic information that isn’t coded for in the DNA sequence. These marks include DNA methylation or histone modifications that affect gene expression. Previous methods for detecting these modifications relied on low throughput and sometimes extremely complicated techniques to determine the presence of modified sequences or DNA bound proteinsNext generation sequencing changed everything because now researchers can look at chromatin on a global scale and not just at their favorite convenient genomic locus. Whole genome epigentic sequencing is revealing the complexities of chromation acetylation and methylation which in the future may be important for understanding dysregulation of gene expression in a wide variety of diseases.
In the last year, the ENCODE consortium published their initial findings on a wide range of genomic topics. ENCODE stands for the Encyclopedia of DNA Elements and was a follow up to the human genome project. It quickly became clear that knowing the human genome sequence alone wasn’t very helpful for understanding how the genome functions. Of the 3 billion bases sequenced, only 1-2% of DNA actually codes for functional proteins. This means that 98-99% of the genome is involved in other processes! Unfortunately, this led to the propagation of the idea that the genome was mostly ‘junk’ DNA. The problem with this idea of Junk DNA is that DNA bases are biochemically expensive and maintaining the genome isn’t an easy task. This junk DNA must serve some purpose and finding this purpose is the goal of the ENCODE project. We know that non-coding DNA plays an important role in gene expression via promoters, enhancers and other protein binding sites. Before whole genome sequencing, finding all of these sites was an impossible task. The application of Chromatin immunoprecipitation coupled with whole genome sequencing has allowed the ENCODE consortium to determine the protein binding locations of a 119 gene expression regulatory proteins in 147 different tissue types.Combining the RNA-Sequencing information with the protein binding information, the ENCODE consortium has determined that over 80% of the DNA coded by the genome is functional in some way in that it is either bound by protein or converted into RNA. As the available data expands to include all of the 1800 DNA regulatory proteins and additional tissue types, there’s no doubt that the amount of functional genomic material will increase. Of course, this 80% figure is still somewhat controversial in the field with some biologists saying that they believe that only 10-15% of the genome actually contributes to phenotypes.So how do these results relate to pharmacogenetics and personalized medicine? These results show that sometimes knowing the absolute sequence of the DNA isn’t enough to fully understand how the genome is functioning. This is especially true in cancers where uncontrolled cell growth can be the result of overexpression of a perfectly normal protein, but because gene expression is dysregulated, this protein causes unwanted cellular proliferation.
Transcriptomeand gene expression anaylsis are important for understanding what genes are expressed at any given time in a sample. This kind of information is very important for disease discovery and cancer diagnosisBefore microarrays, expression analysis was done by northern blot, however both systems require a probe to determine which genes are expressed. The problem with these systems, even with microarrays is that you need to “know” what you’re looking for or at least guess what you’re looking forNext generation sequencing eliminates the guess work because now it is possible to survey every expressed gene on a sequence level basis to look at how much of each gene is expressed and determine if there are any mutations in those expressed sequencesTranscriptomeprofiling using next generation sequencing has already been used experimentally to diagnose disease and define cancer targets. Refining this technique for broader use in the clinic has important applications in medicine
One of the major challenges of Next generation sequencing is the data deluge and sifting through millions of bases of DNA to find mutations that actually result in a phenotype. The error rate of current NGS technologies complicate the issue further.The most promising hits must still be validated by other means like realtime PCR, Mass Spec, or Sanger sequencing and one of the major questions plaguing the field is how to choose which data to validate. Just because a mutation occurs in a gene doesn’t necessarily mean it contributes to the phenotypeDataset size is a huge problem for most phenotypic correlation studies and most lack the size or the diversity to be very useful for predicting complex diseasesOn the other hand, validated hits can be a distraction, especially in tumors where high diversity gives cancer cells multiple escape routes to continue killing patients. This idea is highlighted well in a great series the New York Times ran last year on whole genome sequencing.For genomes to provide valuable information on phenotypes we need to continue generating large validated datasets that are ethnically and even geographically diverse
Another field that can benefit significantly from next generation sequencing technology is metagenomics. This field seeks to determine how communities of organisms interact. This may seem like an odd topic to highlight in a lecture about human genetics and personalized medicine, but there are communities living in and on you that effect your health in ways that we are only just beginning to understand.Metagenomicscan be employed to explore micro scale soil and gut microbial communities all of the way to macro scale coral reef communitiesPrevious techniques relied on sequencing of mitochondrial DNA or polymorphic regions of ribosomal DNA to classify the composition of communities. These studies are time intensive and not completely informative because you can only determine who is present and not necessarily what is expressed and how the organisms might be interactingNext generation sequencing can get around many of these limitations by surveying community diversity and linking DNA and expression back to community members
I touched on the data problem a few slides back but data analysis is by far the biggest roadblock for next generation sequencing. The amount of data obtained from these studies is mind boggling and essentially useless if we can’t efficiently analyze the data.Another big question is how long should the data be kept? If you ask the sequencing companies what the solution to this problem is, they’ll say that saving the data indefinitely is too expensive, and if you need the data again, you should just redo the run. Of course there’s a huge economic benefit in that for them. However, I think the answer to this question really depends on the application and most doctors would tell you that for diagnostic purposes, sequencing data should be stored forever.If the data is kept forever, then where should it be stored? Should local resources be used to maintain the data or is hosting with an external company preferable? Again, this comes down to a cost benefit analysis but I think that in the future, cloud or external systems will become more attractive. This is because large companies like amazon and google can afford to invest in very high power resources and limit the cost of universities investing in constantly changing technologies that quickly become obsolete.We are rapidly approaching a time when every person will have their genome sequenced multiple times in their life. Determining what to do with all of the data and who has access to it is a complicated ethical question that I’m sure will be a future discussion topic in this course.
Genome Wide Methodologies and Future Perspectives Brian Krueger, PhD Duke University Center for Human Genome Variation
History of Genetic Linkage • Mendel’s Laws – Law of segregation • Each parent randomly passes one of two alleles to offspring – Law of Independent Assortment • Separate genes for separate traits are passed independently to offspring • Traits should appear in offspring in the ratio of 9:3:3:1 – Laws hold true for genes on different chromosomes or genes located far away from one another • Linkage – Bateson and Punnett quickly found traits that didn’t assort independently – Thomas Hunt Morgan and his student Alfred Sturtevant found that recombination frequency is a good predictor of distance between genes • Genes that are inherited together must be closer to one another – linked • Generated the first linkage maps – Serves as an important basis for understanding genetic association studies
Linkage Studies • Model Organisms – Fruit Flies, plants, etc – Extremely important for understanding human genetics – Fruit flies can produce new generations of 400+ offspring approximately every week! • Can very quickly understand the genetics of trait heritability • Familial Linkage Studies – Require multiple generations – Take decades to develop – Complicated by family participation • Association studies – Subtle difference between linkage studies – Try to apply knowledge of familial linkage to entire populations
Genome Wide Association Studies • GWA studies – Aim to find genetic variants that are associated with traits – Typically used to elucidate complex disease traits – Focus on SNPs, Indels, CNVs – Most often Case/Control Studies • SNP (Single Nucleotide Polymorphism) – Change in a single nucleotide position • Indel (Insertion/Deletion) – Describes the insertion or deletion of nucleotides • CNV (Copy number variations) – Large deletions or duplications of genetic material
GWA Study History • Human Genome Project (1990-2000) – Decade long international project to determine the complete human genome sequence – Provided the reference genome for future research on genome variation • Human HapMap (2002-2009) – Sequencing whole genomes is expensive – Needed a shortcut to understand how variation contributes to disease – Mapped millions of common known SNPs in 269 individuals – Theory that common SNPs are inherited and could be predictive of associated disease – Determine how SNPs from case/control studies associate with human disease
Defining Association • Variants are not always causal! – SNPs sometimes only serve as markers – Can play absolutely no role in the disease and even be located on different chromosomes from the gene actually responsible for the phenotype • Population stratification – Variants differ by population – Variants important markers of disease in one population or ethnicity may not be effective markers in another – For GWA studies to be effective predictors in multiple populations, large datasets for each ethnicity must be obtained
GWAS SNP Genotyping • Bead array genotyping – Uses a chip containing beads with covalently attached baits – Baits hybridized to fragmented DNA – Baits SPECIFIC for the DNA just upstream of a SNP – Base extension with fluorescently labeled bases allows interrogation of the SNP (each base has a different color!) – A single bead chip can assay millions of rs1372493 rs1372493 SNPs 16000 1.60 1.40 – Colorimetric output plotted 14000 12000 1.20 • Blue indicates homozygous for one version of the 10000 1 SNP - CC Intensity (B) 8000 0.80 • Purple is heterozygous - CA Norm R 6000 0.60 • Red homozygous for the other version of the SNP 4000 - AA 0.40 2000 0.20 0 0 2317 834 74 -2000 -0.20 0 2000 4000 6000 8000 10000 12000 14000 16000 18000 20000 0 0.20 0.40 0.60 0.80 1 Intensity (A) Norm Theta
GWAS SNP Genotyping and Validation • Realtime PCR – Use specific PCR probes to verify SNPs – Good for validating a handful of SNPs at a time • Mass Array – Use mass spec to find SNPs – Detected by looking at fragment weight differences – Good for detecting or validating a large number of SNPs rapidly • Sanger sequencing – Gold standard validation method – Can determine the SNP at its exact position – Very robust
GWA Study History • To this point in time, the power of most GWA studies was lacking – GWA not really genome wide – Looked at common variants across genome – Missed rare variants and not always descriptive of disease causation • Whole Genome Sequencing (WGS) – Actually assays the entire genome – Discovers all variants – Prohibitively costly before 2008 – Current cost of WGS ~$4000 • Thousand Genomes Project (2008-) – Facilitated by plummeting sequencing costs and technological advancements – Goal to fully sequence the genomes of 1000 healthy individuals to provide a true picture of genome wide variation
Second Generation Sequencing • Developed to increase throughput of Sanger sequencing • Can sequence many molecules in parallel – Does not require homogenous input – Sequenced as clusters • Sequencing by synthesis – Bases are added, signals scanned, and then washed – Cycle repeated (30-2000x)
2nd Gen: Sequencing by Synthesis Overview Genomic Fragmented DNA Ligate Adaptors DNA Generate Clusters (On Flowcell or Beads) T T A T A T TA T A T T C C G G A G A G T T T T G G Repeat Hundreds of times on millions of clusters Detect Signals Add Bases
Flavors of Sequencing • Whole Genome Sequencing – Obtain whole blood or tissue sample – Create sequencing libraries of all DNA fragments • Whole Exome Sequencing – Utilizes a selection protocol – Attach complimentary RNA strands to beads – Fish out ONLY coding DNA sequences – Create sequencing libraries from enriched DNA – Reduces cost significantly • Custom Capture – Same protocol as Exome sequencing – Only target desired DNA sequences • Amplicon Sequencing – Use PCR to amplify target DNA – Sequence amplified DNA (Amplicon)
NGS Study Designs for Gene Discovery Multiplex families Case-control studies Trio sequencing of sporadic diseases
De novo Mutation Calling/Filtering Variant Individual variant Multi-sample calling calling variant calling Exome Variant Server 6500 exome Cross-checking sequenced individuals public databases Visual InspectionSanger sequencing confirmation
Detecting Copy Number Variants ERDS (Estimation by Read Depth with SNvs) Average read depth (RD) of every 2-kb window were calculated, followed by GC corrections. A paired Hidden Markov model was applied to infer copy numbers of every window by utilizing both RD information and heterozygosity information. homozygous heterozygous duplication deletion deletion Windows
Illumina • Uses a flow cell • Cluster generated on slide via bridge amplification • Sequencing by synthesis – Performed by flowing labeled bases over flow cell – 4 pictures taken (one for each base) – Cluster color determined at each cycle allows interrogation of sequence • Advantages – Low cost per base – Very high throughput • Limitations – High cost per experiment – Short read length (30-150bp) – Acquired a company that uses new tech to reach read lengths of 2-10Kb Schadt et al 2010 HMG
Ion Torrent • Emulsion PCR is used to generate clusters on a bead • Sequencing by synthesis – Pyrosequencing – Relies on release of pyrophosphate for detection – Instead of a visual cue, system senses the release of H+ as each base is flowed over the beads • Advantages – Short run time – Does not require modified bases – Longer read length (200bp) • Limitations – Low data output – High homopolymer error rate
Third Generation Sequencing • Defined as single molecule sequencing • Less complex sample prep • Much longer read length – SGS Short read length a huge disadvantage for de novo sequencing applications • Two categories – Sequencing by synthesis – Direct sequencing • Passing molecule through a nanopore • Using atomic force microscopy • Bleeding edge technology – Many technical hurdles – Currently very high error rates
Pacific Biosciences • Utilizes single molecule sequencing by synthesis • Extremely complex system – Each well contains a single DNA molecule and an immobilized polymerase – No reagent washing – Employs confocal microscopy to only detect fluorescence at the polymerase • Advantages – Very long read length (1-15kb) – Low complexity sample prep – Very fast data generation (real time) • Disadvantages – Prone to sequencing errors (~15% error rate) – Company on the verge of bankruptcy
Third/Second Generation Sequencing • Currently only one viable high throughput long read sequencing platform – PacBio system has a 15% error rate – Need long reads for many applications from de novo sequencing to haplotyping • Second generation sequencers high throughput and accurate – Short reads are hard to assemble and leave gaps in repetitive sequences • Can use both as a highly accurate and extremely powerful tool for de novo sequencing applications – Use PacBio assembly as a scaffold – Correct errors by aligning HiSeq reads on top – Effective error rate of 0.1% – Expensive but extremely fast and accurate compared to other methods Koren et al 2012 Nature Biotechnology
Future: Nanopore Sequencing • Leading candidate is Oxford Nanopore • Concept – Detect flow of electrons through the pore – Each base causes a detectable change in the current – Results in direct sequencing – Theoretically could be used to sequence RNA and protein too • Advantages – Long read length – Plug and play – Easily scalable • Disadvantages – No hard data yet Credit: John MacNeill/TechnologyReview – No specific release date
Future: Direct sequencing • Concept stage techniques – Significant technical hurdles to overcome – Mostly proof of concept experiments • IBM DNA Transistor Credit: IBM – Bases read as single stranded DNA passes through the transistor – Gold bands represent metal, gray bands are the dielectric • Atomic force microscopy sequencing – Use AFM tip to detect each base of single stranded DNA Credit: Lee et al US PAT 20040124084
Sequencing Applications • Old techniques which used to take days or years to perform can now be completed in hours • Next generation sequencing has opened a new door for addressing very complicated genetic questions – Has huge potential to revolutionize human healthcare – Survey complex tumor types – Research into macro and micro community genomics – Reveal evolutionary history
De novo Sequencing• Human genome took 10 years to complete and cost $3 billion dollars – Done by laboriously cloning overlapping segments of the human genome into bacmid libraries and Sanger sequencing each one – Genome assembled using computers to line up over lapping sequences• Current estimate is around $4000 – Can be completed in a week – Companies like Complete Genomics say they have already sequenced thousands of human genomes• Future – Long read sequencers will make agricultural sequencing more viable – Whole genome sequencing for human diagnostics will become routine – Increasing the catalog of organismal genomes will improve our understanding of evolution and development
Genome Mutation Analysis • Previously done by completing complicated and time consuming familial linkage studies and targeted Sanger sequencing • Next generation sequencing can look at every gene at once – Can produce a genetic map of the complete genome – Used to detect genetic polymorphisms – See every possible mutation • Future – Whole genome sequence analysis – Targeted genome sequencing analysis using predetermined sequence selection arrays (ex: Exome Enrichment)
Pharmacogenetics • Very hot topic in the biotech and insurance industries • Use genetic typing to guess how a person might respond to different drug treatments • Currently relies on microarrays • NGS could provide significantly more information at more loci – Microarrays only look at a handful of polymorphisms – Current NGS approaches port the microarray technique to enrich pools for sequencing • Future – As the catalog of human genomes increases, it will be easier to calculate responses to treatment before drugs are administered Gauthier et al 2007 Cancer Cell
Epigenetics • Defined as heritable genetic information that is not coded in the DNA bases – DNA methylation – Histone modifications • Previous mechanisms for detecting these Chromatin or DNA modifications relied on targeted probing – ChIP-PCR – Bisulfite sequencing – Footprinting assays • Next generation sequencing changed everything – Whole genome methylation mapping (MAP-IT) – Whole genome histone modification and protein binding mapping (ChIP-Seq - acetylation, methylation, etc) • ENCODE project
ENCyclOpedia of Dna Elements (ENCODE) • International project – Follow up to the human genome project • Only 98% of the human genome codes for protein – Creating and maintaining DNA is biochemically expensive – What’s the other 98% of the genome doing? • ENCODE goals – Determine the functional elements of the human genome – Protein Coding – Non-Coding RNA – mRNA Expression – Regulatory protein binding sites – Histone modifications • Preliminary estimates show that 80% of human DNA is functional!
Transcriptome/Expression Analysis • Gene expression analysis is important for disease discovery and cancer diagnosis • Expression analysis first relied on Northern blotting followed by DNA microarrays – Both cases require a probe – Need to “know” what you are looking for – Low resolution screening • Next generation approaches screen the entire transcriptome (RNA-Seq) – Single base resolution of expression – Can see level of expression and also visualize mutations in expressed sequences • Future – Important for diagnosing/treating cancer and heritable diseases
Phenotypic Correlation • NGS data generates huge datasets with 85-99.9% base accuracy – Must determine which signals are real, and which are noise/errors – Most promising hits are validated by other assays (Sanger, qRT, Mass Spec) – How do we determine which hits to validate? • Currently have very small datasets, even in pharmacogenetics that have limited utility • Validated hits can be distractions See NYTimes Series on whole genome – Tumor diversity presents multiple escape Sequencing: http://nyti.ms/No4fgd routes during targeted treatment • Future – Require large validated datasets that are ethnically and geographically diverse
Metagenomics • Used to survey macro and micro environments – Microbial communities (Soil/Gut) – Tumors – Plant communities – Coral reef ecosystems • Previous techniques coupled mtDNA or ribosomal Sanger sequencing with BLAST analysis – Limited by number of sequenced species – Can determine who, but not what is going on • NGS approaches now being used to determine exactly what organisms are present and how they interact – Can get expression data and link it back to community groups – Survey community diversity
Data • Absolutely the largest roadblock for next generation sequencing • Terabytes of data are useless if we can’t efficiently analyze the data • How long should data be kept? – Depends on application • Human Diagnostic sequencing? • Research sequencing? • Where should data be kept and processed? – Local or Cloud (Amazon, etc)? – Cost of infrastructure vs cost of cloud service – Security issues • Future – Cloud based solutions will become more attractive