A Genome Sequence
                q
Analysis System Built with
       Hypertable
       H     t bl
         Doug Judd
     CEO, Hypertable, Inc.
Application
      Development Team
UCSF-
UCSF-Abbott Viral Diagnostics and
Discovery Center
  Director: Dr. Charles Chiu, M.D./Ph.D.
                            ,
  http://vddc.ucsf.edu/
Helices Inc.
  Taylor Sittler, M.D.
  John Dennis
  Brad Miller, M.D.
  http://helic.es/
What is Hypertable?
Modeled after Google’s Bigtable
                Google s
Open Source (GPL v2)
Horizontally Scalable
High Performance Implementation (C++)
Thrift Interface for all popular languages
(Java, PHP Ruby Python Perl etc )
(Java PHP, Ruby, Python, Perl, etc.)
NoSQL
   No joins (not yet)
   No t
   N transactions ( t yet)
              ti    (not t)
Project Started in March 2007
Hypertable Deployments
Why NoSQL?
QuickTime™ d
                         Q i kTi ™ and a
                        BMP decompressor
                  are needed to see this picture.




Source: Nature 458, 719-724 (2009)
Source: wired.com, February 2011
Genomics 101
Base Pair
              (aka “base”)
Two nucleotides on opposite
                      pp
compl. DNA or RNA strands
connected via hydrogen bonds
Double
D bl stranded DNA/RNA i
            d d             is
made up of base pairs
adenine (A) pairs with thymine (T)
guanine (G) pairs with cytosine (C)
Base-
Base-paired DNA sequence:
ATCGATTGAGCTCTAGCG
TAGCTAACTCGAGATCGC
Gene
Encodes info on how to make a protein
DNA or RNA sequence
Thousands t millions of b
Th      d to illi      f base pairs l
                                 i long
Corresponds to various different biological
traits
Human genome contains about 23,000
       g                           ,
genes
Biological Samples
Specimen taken from human or animal
 p
  Nasal Swabs
  Blood Serum
  Diarrheal
  Cerebral spinal fluid
Sent to a sequencing company to process into
DNA sequence information in digital format
Each sample will g
          p       generate anywhere from 1M to
                             y
100M “reads”
A read is a short DNA sequence snippets of
approximately 100 b
       i t l       bases
Example Reads File
GTGGATAGGGGGAGACTAATGTAGTATGATTATCATCATCAACAGAAGCTATGACACCAGGATAAA
CATTTCTTATTGCTGAAAGTATTCTATTGTAGAGATGTACCACAATTTGGTTTCTGGTTTTGTATT
GGGAGGATACTAGGGATTACTGAAGCCAACTTTGCAGACTCATACATTTGACTAGACACAGCC
ACATTACAGTTTTCTGAGGAAAATTCTTAAGATGTTACCCCAAAACATAGCATTTTAAATTAAAAC
GGACCGGCTGAAGCCATGGCAGAAGAACATAAATTGTGAAGATTTCATGGGCATTTATTAGTT
GGAAGTGATAAGTGTCCATGAAATCTTCACAATTTATGTTCAGAGATTGCAGTAAAGACAGGTGTA
AAGACACAGCAAAGCTAAGAGGACCCAACACACGGTAGGGTCGGGGACCTTGGAGAAACATGG
TGGCTTCTTCCTACATGCTTGTGATAGATGACCAAAAAACATTTGTTGAGTTGATGAATAGTACAA
AAAAGGGGCGGATAATAAATGAAAAGGGAATGTGCTGTTATTTCCTACTAAGATCAGAAAGAG
ATATAAACAAAAGCTGTCATCACTTAGGGACTTCAGCCACATAAAACAATGTCAGGCTAGTCACTT
AGAGCTTTGGGACTAGTTGAGTGGCAGCTTAACAAAGCAACGCAATATCCATAGGGATTGGGG
ATATTTACATCTAGTGGATTCTACCAGTATGGTGGTCTTATGTGGACTGCACGTGGTTTTCTAGTA
AGATAGCAGCTCTTCCCAAATTTATTTATAATTGTGGCATTATTTATAATATCAAAATATTAT
GTTGCCAAAGGAGATTAACATTTGAGTCAGTGGGCGGGGTAAGGCCGACCTACCCTTAATCTGGTG
GAGAAAGAAGCTGCTAATGGAGTTTAAAAGGTTACTGTCATTAATGAAAAATAAATTTACAGC
CAGACATTTATGAACAGAAATGGGAAAAACACACTAGGAAAGCACTGCAAAGACTAATCTGTCTTT
AAAGGAGATAGAGTGACTCCAGGCCCCTTAGAAATGACTATACCTGGCAGAGCATGCCAACTG
ATGGGCTCGAGTCCTCACAAATATGAATTCCCCCTAAGTCTTGAGAGGTCATTTGTGCATTTGGAA
GGAAGAACATTCCATGCTCATGGGTAGGAAGAATCAATATCGTGAAAATGGTCATACTGCCCA
GCGGGGTTTTTTTTTGTTTCATATTAACTTTAAAGTAGTTTTTTTCCATTTTGTGAAGAAAGACAT
AAAGAACCAAGGCTAATAGTTGTTTGAGTTGTACTTACCATGTTGTTAAATGTCACCTCACAC
CGCTGCCAGCCTATCAGAGCCGGGAATTACACCGTGCTTGGAGTTCTGGCACAGATCCACAGCTAC
AGTTCTTCATTGTAAGAAATGGATGCTAACATGTAACAAGAAAACATCTGAAGGTTAAACTCA
AATAAATGGGTTAATAGTTTGTCTTTCGGTCTTCATACTTTCAATATAAGTGGTTTACTTAGCCGA
Sequence Alignment
Arranging the sequences of DNA or RNA to identify
      g g        q                              y
regions of similarity
Fuzzy matching algorithm
Alignment methods
Alignment methods
   BLAST ‐
   BLAST ‐ Basic Local Alignment Search Tool
   MegaBLAST
Faster but less accurate alignment methods
Faster but less accurate alignment methods
   SOAP ‐
   SOAP ‐ Short Oligonucleotide Analysis Package
   BLAT ‐ BLAST‐
   BLAT ‐ BLAST‐like Alignment Tool
Taxonomy
Hierarchical biological classification
Method to group and categorize organisms by
biological type
     g      yp
Basic Ranks
Kingdom, Phylum/Division, Class, Order, Family, Genus, Species
Downloadable from National Center for
Biotechnology Information (NCBI) website
Every node in the taxonomy tree is assigned a
unique numeric ID
GenBank
NIH genetic sequence database
    g         q
  380,000 distinct organisms
  126,551,501,141 nucleotide bases
  135,440,924 sequence records
Most important and most influential database for
research in almost all biological fields
Growth rate is exponential
Information on each sequence includes:
  Numeric ID
  Taxonomic information
Schema Design
Taxa Table
Schema
CREATE TABLE T
             Taxa (ID T
                  (ID, Type, Child
                             Children, N
                                       Name);
                                           )
Contents
/1     ID      1
/1     ID:fullName     /root
/1     Type    no rank
/1     Children        1,10239,12884,12908,28384,131567
/1     Name    root
/1/10239       ID      10239
/1/10239       ID:fullName     /root/Viruses
/1/10239       Type    superkingdom
/1/10239       Children        12333,12429,12877,29258,35237, …
/1/10239       Name    Viruses
/1/10239/12333 ID      12333
/1/10239/12333 ID:fullName     /root/Viruses/unclassified phages
/1/10239/12333 Type    no rank
/1/10239/12333 Children        12340,12347,12366,12371,12374, …
/1/10239/12333 Name    unclassified phages
Reads Table
Schema
CREATE TABLE Reads (Sequence, Quality, GeneKey, Comments);

Contents
AbCam1_100_ACAGTG,HWI...56#ACAGTG/1   Sequence ATCGCACCATTGAACTCCAGTC...
AbCam1_100_ACAGTG,HWI...56#ACAGTG/1   Quality    eeaeeeede_Ycc]dcacab...
AbCam1_100_ACAGTG,HWI...56#ACAGTG/1   Comments:qualityFilter 11071815...
AbCam1_100_ACAGTG,HWI...56#ACAGTG/1   Sequence GGCTTACGCCTGTAATCCCAGC...
AbCam1_100_ACAGTG,HWI...56#ACAGTG/1   Quality    gfee_cgggegggecggggegc...
AbCam1_100_ACAGTG,HWI...56#ACAGTG/1
AbCam1 100 ACAGTG,HWI...56#ACAGTG/1   GeneKey:gnl|GNOMON|1320663.m 11...
AbCam1_100_ACAGTG,HWI...17#ACAGTG/1   Sequence AGGATACGGAAGGCCCAAGGAG...
AbCam1_100_ACAGTG,HWI...17#ACAGTG/1   Quality    cdd`dffffffgffgggegf^e...
AbCam1_100_ACAGTG,HWI...17#ACAGTG/1   GeneKey:chr10 110718151643.1308...
AbCam1_100_ACAGTG,HWI...80#ACAGTG/1   Sequence ACGGAAGAGCACACGTCTGAAC...
AbCam1_100_ACAGTG,HWI...80#ACAGTG/1
 b   1 100              80#      /1   Quality
                                      Q li       cbccb[^WUb]_b`_[bR_]...
                                                  b b[^  b] b` [b ]
AbCam1_100_ACAGTG,HWI...80#ACAGTG/1   Comments:qualityFilter 11071815...
AbCam1_100_ACAGTG,HWI...88#ACAGTG/1   Sequence GAACTCCAGTCACACAGTGATC...
AbCam1_100_ACAGTG,HWI...88#ACAGTG/1   Quality    eeeeeeeeeeeceeeeeaeeTQ...
AbCam1_100_ACAGTG,HWI...88#ACAGTG/1
                 ,                    Comments:qualityFilter 11071815...
                                               q      y
Genes Table
Schema
CREATE TABLE Genes (Sequence, TaxID, ID, ReadID);

Contents
1000075   Sequence GAATTCCATGGCAGTAAAACATCTTCCCTTC…
1000075   TaxID        9606
1000075   ID:name    HSLFBPS6 Human fructose-1,6-biphosphatase
1000075   ReadID:0310.Lane8big,HWI-EAS355:8:91:1231:1315#0/1 …
1000075   ReadID:0908.Mexus2.TATTAT,SCS:1:22:395:324#0/1_TA …
1000075   ReadID:0916.Enceph2,SCS:6:24:1519:513#0/1
1000075   ReadID:0916.Mexus,SCS:1:22:410:248#0/1
1000075   ReadID:0916.MonkeyAdeno,SCS:2:17:811:769#0/1
1000075   ReadID:0916.MonkeyAdeno,SCS:2:21:1132:1067#0/1
1000075   ReadID:0916.MonkeyAdeno,SCS:2:24:1207:492#0/1
1000075   ReadID:0916.MonkeyAdeno,SCS:2:33:1138:547#0/1
1000075   ReadID:0916.Parecho,SCS:3:4:679:1416#0/1|1
1000075   ReadID:HIV.HIV18_Lane7.s_7_sequence.AAA,SCS:7:30:688 …
1000075   ReadID:HIV.HIV18_Lane7.s_7_sequence.AAA,SCS:7:30:688 …
1000075   ReadID:HIV.HIV18_Lane7.s_7_sequence.unbiased,SCS:7:30 …
Monitoring Table Overview
Applications
Novel Virus Discovery
Process for discovering new viral DNA in a
biological sample
Algorithm Overview
  Import biological sample read data from
  sequencing company into system
  Strip out all reads that align to known DNA
  sequences
  What’s left over is novel
Novel Virus Discovery
      Algorithm Detail
Import sample data into Reads table
Run MapReduce program to filter/align reads
and update Comment column of Reads table
     p
  Filter out poor quality (“low entropy”) reads
  Align to common human RNA/DNA
  Align to virus database
  Align to GenBank
All Reads left in Reads table with no Comment
column are novel
Pathogen Discovery
      in Cancer Samples
Accomplished using same technique as
novel virus discovery
Matthew Meyerson s Lab @ Broad
          Meyerson's
Institute
Taxonomic Tree Viewer
Display Taxonomy breakdown of biological
sample
For each aligned read in sample consult
                           sample,
Genes table to determine Taxonomy ID
Populate HitSummary t bl with t
P     l t HitS          table ith taxonomy
IDs for all aligned reads from all samples
Depletion Array (future)
Align reads to human g
   g                   genome
Determine set of probes - sequences of human
genome with most number of alignments
Send probes to Agilent to produce vial of
“magnetized” DNA sequences of the probes
Mix i l in ith biological
Mi vial i with bi l i l sample l
Magnetized DNA binds to human DNA which
precipitates from solution
Increases viral percentage of sample from
~0.01% - 0.1% to 10 %
The End

Questions?

A Genome Sequence Analysis System Built with Hypertable