A Genome Sequence Analysis System Built With Hypertable

1,813 views

Published on

This presentation was given by Doug Judd at the NoSQL Now! 2011 conference in San Jose.

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,813
On SlideShare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
23
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • Improvements in the rate of DNA sequencing over the past 30 years and into the future
  • A Genome Sequence Analysis System Built With Hypertable

    1. 1. A Genome Sequence Analysis System Built with Hypertable Doug Judd CEO, Hypertable, Inc.
    2. 2. Application Development Team <ul><li>UCSF-Abbott Viral Diagnostics and Discovery Center </li></ul><ul><ul><li>Director: Dr. Charles Chiu, M.D./Ph.D. </li></ul></ul><ul><ul><li>http://vddc.ucsf.edu/ </li></ul></ul><ul><li>Helices Inc. </li></ul><ul><ul><li>Taylor Sittler, M.D. </li></ul></ul><ul><ul><li>John Dennis </li></ul></ul><ul><ul><li>Brad Miller, M.D. </li></ul></ul><ul><ul><li>http://helic.es/ </li></ul></ul>
    3. 3. What is Hypertable? <ul><li>Modeled after Google’s Bigtable </li></ul><ul><li>Open Source (GPL v2) </li></ul><ul><li>Horizontally Scalable </li></ul><ul><li>High Performance Implementation (C++) </li></ul><ul><li>Thrift Interface for all popular languages (Java, PHP, Ruby, Python, Perl, etc.) </li></ul><ul><li>NoSQL </li></ul><ul><ul><li>No joins (not yet) </li></ul></ul><ul><ul><li>No transactions (not yet) </li></ul></ul><ul><li>Project Started in March 2007 </li></ul>
    4. 4. Hypertable Deployments
    5. 5. Why NoSQL?
    6. 6. Source: Nature 458, 719-724 (2009)
    7. 7. Source: wired.com, February 2011
    8. 8. Genomics 101
    9. 9. Base Pair (aka “base”) <ul><li>Two nucleotides on opposite compl. DNA or RNA strands connected via hydrogen bonds </li></ul><ul><li>Double stranded DNA/RNA is made up of base pairs </li></ul><ul><li>adenine (A) pairs with thymine (T) </li></ul><ul><li>guanine (G) pairs with cytosine (C) </li></ul><ul><li>Base-paired DNA sequence: ATCGATTGAGCTCTAGCG TAGCTAACTCGAGATCGC </li></ul>
    10. 10. Gene <ul><li>Encodes info on how to make a protein </li></ul><ul><li>DNA or RNA sequence </li></ul><ul><li>Thousands to millions of base pairs long </li></ul><ul><li>Corresponds to various different biological traits </li></ul><ul><li>Human genome contains about 23,000 genes </li></ul>
    11. 11. Biological Samples <ul><li>Specimen taken from human or animal </li></ul><ul><ul><li>Nasal Swabs </li></ul></ul><ul><ul><li>Blood Serum </li></ul></ul><ul><ul><li>Diarrheal </li></ul></ul><ul><ul><li>Cerebral spinal fluid </li></ul></ul><ul><li>Sent to a sequencing company to process into DNA sequence information in digital format </li></ul><ul><li>Each sample will generate anywhere from 1M to 100M “reads” </li></ul><ul><li>A read is a short DNA sequence snippets of approximately 100 bases </li></ul>
    12. 12. Example Reads File GTGGATAGGGGGAGACTAATGTAGTATGATTATCATCATCAACAGAAGCTATGACACCAGGATAAA CATTTCTTATTGCTGAAAGTATTCTATTGTAGAGATGTACCACAATTTGGTTTCTGGTTTTGTATT GGGAGGATACTAGGGATTACTGAAGCCAACTTTGCAGACTCATACATTTGACTAGACACAGCC ACATTACAGTTTTCTGAGGAAAATTCTTAAGATGTTACCCCAAAACATAGCATTTTAAATTAAAAC GGACCGGCTGAAGCCATGGCAGAAGAACATAAATTGTGAAGATTTCATGGGCATTTATTAGTT GGAAGTGATAAGTGTCCATGAAATCTTCACAATTTATGTTCAGAGATTGCAGTAAAGACAGGTGTA AAGACACAGCAAAGCTAAGAGGACCCAACACACGGTAGGGTCGGGGACCTTGGAGAAACATGG TGGCTTCTTCCTACATGCTTGTGATAGATGACCAAAAAACATTTGTTGAGTTGATGAATAGTACAA AAAAGGGGCGGATAATAAATGAAAAGGGAATGTGCTGTTATTTCCTACTAAGATCAGAAAGAG ATATAAACAAAAGCTGTCATCACTTAGGGACTTCAGCCACATAAAACAATGTCAGGCTAGTCACTT AGAGCTTTGGGACTAGTTGAGTGGCAGCTTAACAAAGCAACGCAATATCCATAGGGATTGGGG ATATTTACATCTAGTGGATTCTACCAGTATGGTGGTCTTATGTGGACTGCACGTGGTTTTCTAGTA AGATAGCAGCTCTTCCCAAATTTATTTATAATTGTGGCATTATTTATAATATCAAAATATTAT GTTGCCAAAGGAGATTAACATTTGAGTCAGTGGGCGGGGTAAGGCCGACCTACCCTTAATCTGGTG GAGAAAGAAGCTGCTAATGGAGTTTAAAAGGTTACTGTCATTAATGAAAAATAAATTTACAGC CAGACATTTATGAACAGAAATGGGAAAAACACACTAGGAAAGCACTGCAAAGACTAATCTGTCTTT AAAGGAGATAGAGTGACTCCAGGCCCCTTAGAAATGACTATACCTGGCAGAGCATGCCAACTG ATGGGCTCGAGTCCTCACAAATATGAATTCCCCCTAAGTCTTGAGAGGTCATTTGTGCATTTGGAA GGAAGAACATTCCATGCTCATGGGTAGGAAGAATCAATATCGTGAAAATGGTCATACTGCCCA GCGGGGTTTTTTTTTGTTTCATATTAACTTTAAAGTAGTTTTTTTCCATTTTGTGAAGAAAGACAT AAAGAACCAAGGCTAATAGTTGTTTGAGTTGTACTTACCATGTTGTTAAATGTCACCTCACAC CGCTGCCAGCCTATCAGAGCCGGGAATTACACCGTGCTTGGAGTTCTGGCACAGATCCACAGCTAC AGTTCTTCATTGTAAGAAATGGATGCTAACATGTAACAAGAAAACATCTGAAGGTTAAACTCA AATAAATGGGTTAATAGTTTGTCTTTCGGTCTTCATACTTTCAATATAAGTGGTTTACTTAGCCGA
    13. 13. Sequence Alignment <ul><li>Arranging the sequences of DNA or RNA to identify regions of similarity </li></ul><ul><li>Fuzzy matching algorithm </li></ul><ul><li>Alignment methods </li></ul><ul><ul><li>BLAST - Basic Local Alignment Search Tool </li></ul></ul><ul><ul><li>MegaBLAST </li></ul></ul><ul><li>Faster but less accurate alignment methods </li></ul><ul><ul><li>SOAP - Short Oligonucleotide Analysis Package </li></ul></ul><ul><ul><li>BLAT - BLAST-like Alignment Tool </li></ul></ul>
    14. 14. Taxonomy <ul><li>Hierarchical biological classification </li></ul><ul><li>Method to group and categorize organisms by biological type </li></ul><ul><li>Basic Ranks Kingdom, Phylum/Division, Class, Order, Family, Genus, Species </li></ul><ul><li>Downloadable from National Center for Biotechnology Information (NCBI) website </li></ul><ul><li>Every node in the taxonomy tree is assigned a unique numeric ID </li></ul>
    15. 15. GenBank <ul><li>NIH genetic sequence database </li></ul><ul><ul><li>380,000 distinct organisms </li></ul></ul><ul><ul><li>126,551,501,141 nucleotide bases </li></ul></ul><ul><ul><li>135,440,924 sequence records </li></ul></ul><ul><li>Most important and most influential database for research in almost all biological fields </li></ul><ul><li>Growth rate is exponential </li></ul><ul><li>Information on each sequence includes: </li></ul><ul><ul><li>Numeric ID </li></ul></ul><ul><ul><li>Taxonomic information </li></ul></ul>
    16. 16. Schema Design
    17. 17. Taxa Table <ul><li>Schema </li></ul><ul><li>Contents </li></ul>CREATE TABLE Taxa (ID, Type, Children, Name); /1 ID 1 /1 ID :fullName /root /1 Type no rank /1 Children 1,10239,12884,12908,28384,131567 /1 Name root /1/10239 ID 10239 /1/10239 ID :fullName /root/Viruses /1/10239 Type superkingdom /1/10239 Children 12333,12429,12877,29258,35237, … /1/10239 Name Viruses /1/10239/12333 ID 12333 /1/10239/12333 ID :fullName /root/Viruses/unclassified phages /1/10239/12333 Type no rank /1/10239/12333 Children 12340,12347,12366,12371,12374, … /1/10239/12333 Name unclassified phages
    18. 18. Reads Table <ul><li>Schema </li></ul><ul><li>Contents </li></ul>CREATE TABLE Reads (Sequence, Quality, GeneKey, Comments); AbCam1_100_ACAGTG,HWI...56#ACAGTG/1 Sequence ATCGCACCATTGAACTCCAGTC... AbCam1_100_ACAGTG,HWI...56#ACAGTG/1 Quality eeaeeeede_Ycc]dcacab... AbCam1_100_ACAGTG,HWI...56#ACAGTG/1 Comments :qualityFilter 11071815... AbCam1_100_ACAGTG,HWI...56#ACAGTG/1 Sequence GGCTTACGCCTGTAATCCCAGC... AbCam1_100_ACAGTG,HWI...56#ACAGTG/1 Quality gfee_cgggegggecggggegc... AbCam1_100_ACAGTG,HWI...56#ACAGTG/1 GeneKey :gnl|GNOMON|1320663.m 11... AbCam1_100_ACAGTG,HWI...17#ACAGTG/1 Sequence AGGATACGGAAGGCCCAAGGAG... AbCam1_100_ACAGTG,HWI...17#ACAGTG/1 Quality cdd`dffffffgffgggegf^e... AbCam1_100_ACAGTG,HWI...17#ACAGTG/1 GeneKey :chr10 110718151643.1308... AbCam1_100_ACAGTG,HWI...80#ACAGTG/1 Sequence ACGGAAGAGCACACGTCTGAAC... AbCam1_100_ACAGTG,HWI...80#ACAGTG/1 Quality cbccb[^WUb]_b`_[bR_]... AbCam1_100_ACAGTG,HWI...80#ACAGTG/1 Comments :qualityFilter 11071815... AbCam1_100_ACAGTG,HWI...88#ACAGTG/1 Sequence GAACTCCAGTCACACAGTGATC... AbCam1_100_ACAGTG,HWI...88#ACAGTG/1 Quality eeeeeeeeeeeceeeeeaeeTQ... AbCam1_100_ACAGTG,HWI...88#ACAGTG/1 Comments :qualityFilter 11071815...
    19. 19. Genes Table <ul><li>Schema </li></ul><ul><li>Contents </li></ul>CREATE TABLE Genes (Sequence, TaxID, ID, ReadID); 1000075 Sequence GAATTCCATGGCAGTAAAACATCTTCCCTTC… 1000075 TaxID 9606 1000075 ID :name HSLFBPS6 Human fructose-1,6-biphosphatase 1000075 ReadID :0310.Lane8big,HWI-EAS355:8:91:1231:1315#0/1 … 1000075 ReadID :0908.Mexus2.TATTAT,SCS:1:22:395:324#0/1_TA … 1000075 ReadID :0916.Enceph2,SCS:6:24:1519:513#0/1 1000075 ReadID :0916.Mexus,SCS:1:22:410:248#0/1 1000075 ReadID :0916.MonkeyAdeno,SCS:2:17:811:769#0/1 1000075 ReadID :0916.MonkeyAdeno,SCS:2:21:1132:1067#0/1 1000075 ReadID :0916.MonkeyAdeno,SCS:2:24:1207:492#0/1 1000075 ReadID :0916.MonkeyAdeno,SCS:2:33:1138:547#0/1 1000075 ReadID :0916.Parecho,SCS:3:4:679:1416#0/1|1 1000075 ReadID :HIV.HIV18_Lane7.s_7_sequence.AAA,SCS:7:30:688 … 1000075 ReadID :HIV.HIV18_Lane7.s_7_sequence.AAA,SCS:7:30:688 … 1000075 ReadID :HIV.HIV18_Lane7.s_7_sequence.unbiased,SCS:7:30 …
    20. 20. Monitoring Table Overview
    21. 21. Applications
    22. 22. Novel Virus Discovery <ul><li>Process for discovering new viral DNA in a biological sample </li></ul><ul><li>Algorithm Overview </li></ul><ul><ul><li>Import biological sample read data from sequencing company into system </li></ul></ul><ul><ul><li>Strip out all reads that align to known DNA sequences </li></ul></ul><ul><ul><li>What’s left over is novel </li></ul></ul>
    23. 23. Novel Virus Discovery Algorithm Detail <ul><li>Import sample data into Reads table </li></ul><ul><li>Run MapReduce program to filter/align reads and update Comment column of Reads table </li></ul><ul><ul><li>Filter out poor quality (“low entropy”) reads </li></ul></ul><ul><ul><li>Align to common human RNA/DNA </li></ul></ul><ul><ul><li>Align to virus database </li></ul></ul><ul><ul><li>Align to GenBank </li></ul></ul><ul><li>All Reads left in Reads table with no Comment column are novel </li></ul>
    24. 24. Pathogen Discovery in Cancer Samples <ul><li>Accomplished using same technique as novel virus discovery </li></ul><ul><li>Matthew Meyerson's Lab @ Broad Institute </li></ul>
    25. 25. Taxonomic Tree Viewer <ul><li>Display Taxonomy breakdown of biological sample </li></ul><ul><li>For each aligned read in sample, consult Genes table to determine Taxonomy ID </li></ul><ul><li>Populate HitSummary table with taxonomy IDs for all aligned reads from all samples </li></ul>
    26. 26. Depletion Array (future) <ul><li>Align reads to human genome </li></ul><ul><li>Determine set of probes - sequences of human genome with most number of alignments </li></ul><ul><li>Send probes to Agilent to produce vial of “magnetized” DNA sequences of the probes </li></ul><ul><li>Mix vial in with biological sample </li></ul><ul><li>Magnetized DNA binds to human DNA which precipitates from solution </li></ul><ul><li>Increases viral percentage of sample from ~0.01% - 0.1% to 10 % </li></ul>
    27. 27. The End Questions?

    ×