SlideShare a Scribd company logo
blueplastic.com/dna.pdf




       Analyzing the human
       genome/DNA with
       Cassandra
       BY SAMEER FAROOQUI
                                linkedin.com/in/blueplastic/
       SAMEER@BLUEPLASTIC.COM
                                @blueplastic

                                http://youtu.be/ziqx2hJY8Hg
T
A

T   A
    G C
A T
3.3 billion base pairs

          Thymine                                       Cytosine

             T A A C C C N A A C                                  . . .
             A T T G G G N T T G
          Adenine                                        Guanine
                         98.5 % of genome identical
                           (3 in 10,000 bases differ)

1 Human:                      US population:            Earth population:
3 GB (uncompressed)           900 PB (uncompressed)     2.8 exabytes (uncompressed)
900 MB (compressed gz)                                     1 Exabyte = 1 million TB
Humans: 46 Chromosomes


                                                      3.3 billion bp
242 MB
                 Mom   Dad
                                   178 MB
                                                       4 billion bp



                                                          XY
                                                          XX

                                                   Y chromosome:
                                                   58 million base pairs
                                                   (2% of total DNA)

                                  59 MB            X chromosome:
         61 MB                            154 MB   155 million base pairs
                                                   (5% of total DNA)
Why Chromosomes ??



                                          Garden Snail
Adder's Tongue Fern
                         Fruit Fly        54 Ch
1,200 Chromosomes
                          8 Ch            2 billion bp
                      165 million bp




                                           Gorilla
Elephant                                   48 Ch
56 Ch                                      3.4 billion bp
5.8 billion bp             Onion
                           16 Ch
                       ~18 billion bp
                      Highly repetitive
Human Genome Project           vs   1000 Genomes Project


- ~15 year project: 1989 – 2003              - Launched Jan 2008

- Sequenced 99% of the genome (400 gaps)     - Oct 2012: 1092 human genomes complete
                                                         from 14 populations
- >70% of the genome came from an
  anonymous male donor from Buffalo, New     - Goal: 2,500 sequences from 26 specific
  York (code name RP11)                        populations like: Han Chinese, Japanese,
                                               British, Columbian, Maratha/India,
- Cost about $3 billion dollars                Punjabi/Pakistan, Finnish, African Americans

                                             - Work done by 111 global institutions

                                             - Cost about $40 million ($16,000 per person)
                                             Download @ http://www.1000genomes.org/
- In 2010: 179 human genomes

                          - Discussed DNA from 2 families of:
                                Mother / Father / Child

                          - One of biology’s most cited papers in 2011




Link: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3042601/
Download at : http://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes




                                                       - Feb 2009 assembly of one human
                                                         genome (hg19)


                                                       - One gzip FASTA file per
                                                         chromosome




      1) rsync -avzP rsync://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/ .
      2) gunzip <file>.fa.gz
Exploring DNA from your browser…                Chromosome 2
                                                Gene: MCM6
           A:T - Can digest milk                SNP: rs4988235
           G:C - Lactose intolerance            Position: 136,608,646 bp from pter


                              Click here




                http://useast.ensembl.org/Homo_sapiens/Location/Genome
T A A C C C T A A C C C T A A C C C T A A C C C T A A C C C
A T T G G G A T T G G G A T T G G G A T T G G G A T T G G G

              Chromosome #1 :            250 million base pairs (across both C-pairs)
                                                                   (8% of total DNA)


                                     centromere
  pter            P (short arm)                                       Q (long arm)           qter
                                       0 0                                                       43
36

                                             1q12                                       1q42.2

 1p36.32             1p31.1                                                                1q43


                                  4,316 known genes
Compound Keys
                                                  (22 pairs + X + Y)
Partition key : remaining keys                  24 Column Families
  humanID:cell_type:parent

                                      Chrom-1   Chrom-2            Chrom-3   Chrom-Y
   humanID       cell_type   parent
   595-36-0000   normal      mother
   595-36-0000   normal      father
   595-36-0000   cancer      mother
   595-36-0000   cancer      father
   595-36-1111   normal      mother
   595-36-1111   normal      father
XY

        XX
                                         Chrom-1             Chrom-2                Chrom-X        Chrom-Y
humanID       cell     parent       1     2
595-36-0000   normal   mother
595-36-0000   normal   father
595-36-1111   normal   mother
595-36-1111   normal   father



                                        Chrom-1 Column Family on disk

     595-36-0000                [normal,           [normal, mother,    [normal, father,   [normal, father,
                                mother, 1]: TAG    2]: GCC             1]: TAG            2]: GCC
     595-36-1111                [normal,           [normal, mother,    [normal, father,   [normal, father,
                                mother, 1]: TAG    2]: GCC             1]: TAG            2]: GCC
Partition based on humanID w/ Murmur3Partitioner
     B

A          C
                                                     Chrom-1               Chrom-Y
     D            humanID       cell_type   parent
                  595-36-0000   normal      mother
                  595-36-0000   normal      father
Send to range A
                  595-36-0000   cancer      mother
                  595-36-0000   cancer      father
                  595-36-1111   normal      mother
Send to range D
                  595-36-1111   normal      father



                  Now it’s possible to do row range scans down the same humanID…
                               … and get all the DNA for human #1000
Chromosome #1: 4,316 genes

                                        Breast Cancer                          11 m: Alzheimers
                                   DIRAS3 protein                                  Presenilin 2


                                          1q31                                     1q31-q42
                                 68,511,644 - 68,516,459                   227,058,272 – 227,083,803
Neuroblastoma Cancer
(deletion of: 1p36.1 – 1p36.3)    DIRAS3 (4,800 bp)                                  PSEN2    (25,000 bp)




                                                         1q12                                   1q42.2
1p36.32                          1p31.1                                                                1q43
  ABCA4     AGL        AMPD1     BSND        CDC73      CHRNB2   COL8A2     CPT2       DBT        EDARADD
  ACADM     ALDH4A1    ASPM      CACNA1S     CFH        CLCNKA   COL9A2     CRB1       DIRAS3     EGLN1
  ACTA1     ALPL       ATP1A2    CASQ2       CFHR5      CLCNKB   COL11A1    DARS2      DPYD       EIF2B3
Conditions related to genes on Chromosome 1

         - Alzheimer disease

         - Neuroblastoma

         - breast cancer

         - color vision deficiency

         - early-onset glaucoma

         - Emery-Dreifuss muscular dystrophy

         - Parkinson disease
Conditions related to genes on Chromosome 1
                                                        familial cold autoinflammatory syndrome
                                                        familial erythrocytosis                                              leukoencephalopathy with vanishing white matter
                                                        familial hemiplegic migraine                                         limb-girdle muscular dystrophy
actin-accumulation myopathy                             familial hypertrophic cardiomyopathy                                 malignant hyperthermia
adenosine monophosphate deaminase deficiency            familial hypobetalipoproteinemia                                     maple syrup urine disease
age-related macular degeneration                        familial isolated hyperparathyroidism                                medium-chain acyl-CoA dehydrogenase deficiency
Alagille syndrome                                       familial restrictive cardiomyopathy                                  Muckle-Wells syndrome
Alzheimer disease                                       Fuchs endothelial dystrophy                                          multiminicore disease
amyotrophic lateral sclerosis                           fucosidosis                                                          multiple epiphyseal dysplasia
anencephaly                                             fumarase deficiency                                                  nemaline myopathy
ankylosing spondylitis                                  galactosemia                                                         neonatal onset multisystem inflammatory disease
arrhythmogenic right ventricular cardiomyopathy         gastrointestinal stromal tumor                                       neuroblastoma
atypical hemolytic-uremic syndrome                      Gaucher disease                                                      nonsyndromic deafness
auriculo-condylar syndrome                              Gitelman syndrome                                                    nonsyndromic paraganglioma
autosomal dominant nocturnal frontal lobe epilepsy      GLUT1 deficiency syndrome                                            Noonan syndrome
autosomal recessive primary microcephaly                glycogen storage disease type III                                    osteogenesis imperfecta
Bartter syndrome                                        Greenberg dysplasia                                                  Parkinson disease
3-beta-hydroxysteroid dehydrogenase deficiency          hemochromatosis                                                      popliteal pterygium syndrome
breast cancer                                           hereditary antithrombin deficiency                                   porphyria
cap myopathy                                            hereditary leiomyomatosis and renal cell cancer                      primary myelofibrosis
carnitine palmitoyltransferase II deficiency            hereditary paraganglioma-pheochromocytoma                            psoriatic arthritis
catecholaminergic polymorphic ventricular tachycardia   hereditary sensory and autonomic neuropathy type V                   pyruvate kinase deficiency
Charcot-Marie-Tooth disease                             homocystinuria                                                       REN-related kidney disease
Chediak-Higashi syndrome                                Hutchinson-Gilford progeria syndrome                                 retinitis pigmentosa
chronic granulomatous disease                           3-hydroxy-3-methylglutaryl-CoA lyase deficiency                      rhizomelic chondrodysplasia punctata
color vision deficiency                                 hypercholesterolemia                                                 severe congenital neutropenia
congenital fiber-type disproportion                     hypermanganesemia with dystonia, polycythemia, and cirrhosis         Shprintzen-Goldberg syndrome
congenital hypothyroidism                               hyperparathyroidism-jaw tumor syndrome                               spina bifida
congenital insensitivity to pain with anhidrosis        hyperprolinemia                                                      Stargardt macular degeneration
Cowden syndrome                                         hypohidrotic ectodermal dysplasia                                    Stickler syndrome
Crohn disease                                           hypokalemic periodic paralysis                                       systemic scleroderma
dense deposit disease                                   hypophosphatasia                                                     thiamine-responsive megaloblastic anemia syndrome
Diamond-Blackfan anemia                                 idiopathic inflammatory myopathy                                     thrombocytopenia-absent radius syndrome
dihydropyrimidine dehydrogenase deficiency              intranuclear rod myopathy                                            trimethylaminuria
early-onset glaucoma                                    junctional epidermolysis bullosa                                     Usher syndrome
Ehlers-Danlos syndrome                                  juvenile idiopathic arthritis                                        van der Woude syndrome
Emery-Dreifuss muscular dystrophy                       Kufs disease                                                         vitiligo
essential thrombocythemia                               Leber congenital amaurosis                                           Vohwinkel syndrome
factor V Leiden thrombophilia                           leukoencephalopathy with brainstem and spinal cord involvement and   WNT4 Müllerian aplasia and ovarian dysfunction
familial adenomatous polyposis                          lactate elevation
What read queries do we want to perform?


   Write once, read many times type of database


1) Give me the PSEN2 gene for 2,000 people w/ Alzheimer's
           25,000 sequential bp



2) Give me all of the humans who have the lactose intolerance SNP on CR-2
Translation: DNA       -> Proteins

   codon




DNA
            T A A       C C C      T A A         C C C          T A A       A C T
            A T T       G G G      A T T         G G G          A T T       T G A


Amino      Isoleucine                            Glycine       Isoleucine
                        Glycine   Isoleucine                                 STOP
Acids
               I          G           I             G              I

                                          Protein   (20 different types)
Translation: ATT   -> Lsoleucine
125 million bp
                   = 41 m cols          (short arm)   P           centromere         Q   (long arm)
       3                           36                                0 0                                43


                                                                   Chrom-1                                   1q43.7492932
humanID       cell_type   parent    1p36 1p35         ... 1p1      1p0   1q0   1q1       ...   1q43.7
595-36-0000   normal      mother    TAG    GCC        CAG   CAG          TCA   CTG       NNN      GAT
595-36-0000   normal      father    TAG    GCC        CAG   CAG          TAA   CTG       NNN      GAT
595-36-0000   cancer      mother    TAG    GCC        CAG   CAG          TCA   CTG       NNN      GAT
595-36-0000   cancer      father    TAG    GCC        CAG   CAG                CTG       NNN      GAT
595-36-1111   normal      mother    TAG    GCC        CAG   CAG    TCC   TCA   CTG       NNN      GAT
595-36-1111   normal      father    TAG    GCC        CAG   CAG          TCA   CTG       NNN      GAT            Chrom-21
595-36-1111   normal      3rd

                                                                                                     
           (SNP) Point Mutation     TAG GCC CAG CAG TCA CTG                    TAG GCC CAG CAG TAA CTG
              Deletion Mutation     TAG GCC CAG CAG TCA CTG                    TAG GCC CAG CAG ___ CTG
              Insertion Mutation    TAG GCC CAG CAG TCA CTG                    TAG GCC CAG CAG TCC TCA CTG
4x reduction in total data size + 35% faster reads
                                                                         To detect SNPs
                  Excellent candidate for compression!
                                                                      Create Secondary Index



                                                             Chrom-1
humanID       cell_type   parent   1p36 1p35    ... 1p1      1p0    1q0   1q1      ...   1q43.7
595-36-0000   normal      mother    TAG   GCC   CAG   CAG           TCA    CTG     NNN     GAT
595-36-0000   normal      father    TAG   GCC   CAG   CAG           TAA    CTG     NNN     GAT
595-36-0000   cancer      mother    TAG   GCC   CAG   CAG           TCA    CTG     NNN     GAT
595-36-0000   cancer      father    TAG   GCC   CAG   CAG                  CTG     NNN     GAT
595-36-1111   normal      mother    TAG   GCC   CAG   CAG    TCC    TCA    CTG     NNN     GAT
595-36-1111   normal      father    TAG   GCC   CAG   CAG           TCA    CTG     NNN     GAT    Chrom-21
595-36-1111   normal      3rd
                                          To get all of the people with the SNP:
                                     cqlsh:dna_table> SELECT humanID FROM Chrom-1
                                                      WHERE 1q0 = ‘TAA’;
Query: Give me the X gene for 2 people

                                                                        X

                                                            Chrom-1
humanID       cell_type   parent    1p36 1p35   ... 1p1     1p0   1q0       1q1   ...   1q43.7
595-36-0000   normal      mother    TAG   GCC   CAG   CAG         TCA       CTG   NNN     GAT
595-36-0000   normal      father    TAG   GCC   CAG   CAG         TAA       CTG   NNN     GAT
595-36-0000   cancer      mother    TAG   GCC   CAG   CAG         TCA       CTG   NNN     GAT
595-36-0000   cancer      father    TAG   GCC   CAG   CAG                   CTG   NNN     GAT
595-36-1111   normal      mother    TAG   GCC   CAG   CAG   TCC   TCA       CTG   NNN     GAT
595-36-1111   normal      father    TAG   GCC   CAG   CAG         TCA       CTG   NNN     GAT    Chrom-21
595-36-1111   normal      3rd

                                cqlsh:dna_table> SELECT 1q0, 1q1 FROM Chrom-1
                                                 WHERE humanID in(595-36-000, 595-36-111);
Storing the total USA population Genome in Cassandra
                            (314 million people)
                                                                   9 million columns

                                  41 million columns



                                        Chrom-1          Chrom-2      Chrom-3           Chrom-Y
                      P-key
 3 billion cols
                      SS:CT:M       125 MB of data
    1.5 GB
                      SS:CT:F       125 MB of data
                      SS:CT:M
                      SS:CT:F
 900 PB   =X          SS:CT:M
                      SS:CT:F
     46,000
     nodes
                                                                1000 Genomes Project
 (20 TB each)     630 million rows (2 for each person)    Oct 2012: 1092 genomes sequenced
No Replication
                                                                    3.2 TB data total
Cost per Human Genome sequence
               $120,000,000



               $100,000,000



                $80,000,000



                $60,000,000
                                                                                                                      Huh ?

                $40,000,000



Linear scale    $20,000,000

 $20 million
increments              $0
                              2001   2002   2003   2004   2005     2006     2007   2008   2009   2010   2011   2012
                                                                 Series 1
Cost per Human Genome sequence
               $100,000,000


                $10,000,000


                    $1,000,000                                                                  Super
                                                                                                Logarithmic
                                       Jan 2008                                                 Scale!
                     $100,000
                                       Switched to next-gen sequencing

                      $10,000


Logarithmic scale      $1,000
       10x
   increments            $100
                                  2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012
                                            Genome Sequencing   Moore's Law
Coding vs Non-coding DNA




     98% non-coding DNA




     Coding   Non-coding
8% of human DNA
                             (98,000 fragments)




T A A C C C T A A C C C T A A C C C T A A C C C T A A C C C
A T T G G G A T T G G G A T T G G G A T T G G G A T T G G G


   HIV-1 virus genome: https://www.ncbi.nlm.nih.gov/nuccore/9629357?report=fasta
Get Ubuntu 12.10
          http://www.ubuntu.com/download
          (note CentOS/Red Hat has install issues with Biopython)




          DataStax Community Edition of Cassandra + OpsCenter
          http://www.datastax.com/download/community




          Free python tools for biological computation
          http://biopython.org




          Cassandra python client library
Pycassa   https://github.com/pycassa/pycassa
blueplastic.com/dna.pdf
                                                                      Polychaos dubium
                                                                      620 billion bp (200x humans)
     Sameer Farooqui
     sameer@blueplastic.com

     - Freelance Big Data consultant and trainer
     - Taught 50+ courses on Hadoop, HBase, Cassandra and OpenStack

     Ex: Hortonworks, Accenture R&D, Symantec




                 - Co-author on v2 of Cassandra book
                 - Coming late 2013                              linkedin.com/in/blueplastic/

                                                                 @blueplastic

                                                                 http://youtu.be/ziqx2hJY8Hg
James Watson: How we discovered DNA                                 Juan Enriquez: The life-code that will reshape the future




http://www.ted.com/talks/james_watson_on_how_he_discovered_dna.html   http://www.ted.com/talks/juan_enriquez_on_genomics_and_our_future.html
Resources to get started for beginners…

More Related Content

What's hot

Banana, Ensete and Boesenbergia Genomics - Schwarzacher, Heslop-Harrison, Har...
Banana, Ensete and Boesenbergia Genomics - Schwarzacher, Heslop-Harrison, Har...Banana, Ensete and Boesenbergia Genomics - Schwarzacher, Heslop-Harrison, Har...
Banana, Ensete and Boesenbergia Genomics - Schwarzacher, Heslop-Harrison, Har...
Pat (JS) Heslop-Harrison
 
20081217 05邵彥春 與紅麴菌菌絲發育相關基因的克隆及序列分析
20081217 05邵彥春 與紅麴菌菌絲發育相關基因的克隆及序列分析20081217 05邵彥春 與紅麴菌菌絲發育相關基因的克隆及序列分析
20081217 05邵彥春 與紅麴菌菌絲發育相關基因的克隆及序列分析
Monascus2008
 
2008 PGSAS G-nomes
2008 PGSAS G-nomes2008 PGSAS G-nomes
2008 PGSAS G-nomesgfb1
 
Senior Project Presentation[1]
Senior Project Presentation[1]Senior Project Presentation[1]
Senior Project Presentation[1]
debka
 
Transgenic animals- Sharmista
Transgenic animals- SharmistaTransgenic animals- Sharmista
Transgenic animals- Sharmista
SharmistaChaitali
 
Azb1 11403102
Azb1 11403102Azb1 11403102
Azb1 11403102
ghulam abbas
 
Louisville2
Louisville2Louisville2
Louisville2
Rosie Redfield
 
Silva ribosomal RNA database
Silva ribosomal RNA databaseSilva ribosomal RNA database
Silva ribosomal RNA database
cfloare
 
Haploid production by centromere mediated genome elimination
Haploid production by centromere mediated genome eliminationHaploid production by centromere mediated genome elimination
Haploid production by centromere mediated genome elimination
IARI, New Delhi
 
Human genetics evolutionary genetics
Human genetics   evolutionary geneticsHuman genetics   evolutionary genetics
Human genetics evolutionary genetics
Dan Gaston
 

What's hot (13)

Banana, Ensete and Boesenbergia Genomics - Schwarzacher, Heslop-Harrison, Har...
Banana, Ensete and Boesenbergia Genomics - Schwarzacher, Heslop-Harrison, Har...Banana, Ensete and Boesenbergia Genomics - Schwarzacher, Heslop-Harrison, Har...
Banana, Ensete and Boesenbergia Genomics - Schwarzacher, Heslop-Harrison, Har...
 
NCUR Presentation
NCUR PresentationNCUR Presentation
NCUR Presentation
 
20081217 05邵彥春 與紅麴菌菌絲發育相關基因的克隆及序列分析
20081217 05邵彥春 與紅麴菌菌絲發育相關基因的克隆及序列分析20081217 05邵彥春 與紅麴菌菌絲發育相關基因的克隆及序列分析
20081217 05邵彥春 與紅麴菌菌絲發育相關基因的克隆及序列分析
 
Chromosomes
ChromosomesChromosomes
Chromosomes
 
2008 PGSAS G-nomes
2008 PGSAS G-nomes2008 PGSAS G-nomes
2008 PGSAS G-nomes
 
Senior Project Presentation[1]
Senior Project Presentation[1]Senior Project Presentation[1]
Senior Project Presentation[1]
 
Transgenic animals- Sharmista
Transgenic animals- SharmistaTransgenic animals- Sharmista
Transgenic animals- Sharmista
 
Azb1 11403102
Azb1 11403102Azb1 11403102
Azb1 11403102
 
Louisville2
Louisville2Louisville2
Louisville2
 
Silva ribosomal RNA database
Silva ribosomal RNA databaseSilva ribosomal RNA database
Silva ribosomal RNA database
 
pax8b
pax8bpax8b
pax8b
 
Haploid production by centromere mediated genome elimination
Haploid production by centromere mediated genome eliminationHaploid production by centromere mediated genome elimination
Haploid production by centromere mediated genome elimination
 
Human genetics evolutionary genetics
Human genetics   evolutionary geneticsHuman genetics   evolutionary genetics
Human genetics evolutionary genetics
 

Viewers also liked

Lactose intolerance
Lactose intoleranceLactose intolerance
Lactose intolerance
Montse94
 
Lactose intolerance
Lactose intoleranceLactose intolerance
Lactose intolerance
Imac16
 
Lactose Intolerance. Student Presentation
Lactose Intolerance. Student PresentationLactose Intolerance. Student Presentation
Lactose Intolerance. Student PresentationSEPA_genomics
 
SparkとCassandraの美味しい関係
SparkとCassandraの美味しい関係SparkとCassandraの美味しい関係
SparkとCassandraの美味しい関係
datastaxjp
 
cassandra調査レポート
cassandra調査レポートcassandra調査レポート
cassandra調査レポート
Akihiro Kuwano
 
Lightning fast genomics with Spark, Adam and Scala
Lightning fast genomics with Spark, Adam and ScalaLightning fast genomics with Spark, Adam and Scala
Lightning fast genomics with Spark, Adam and Scala
Andy Petrella
 

Viewers also liked (6)

Lactose intolerance
Lactose intoleranceLactose intolerance
Lactose intolerance
 
Lactose intolerance
Lactose intoleranceLactose intolerance
Lactose intolerance
 
Lactose Intolerance. Student Presentation
Lactose Intolerance. Student PresentationLactose Intolerance. Student Presentation
Lactose Intolerance. Student Presentation
 
SparkとCassandraの美味しい関係
SparkとCassandraの美味しい関係SparkとCassandraの美味しい関係
SparkとCassandraの美味しい関係
 
cassandra調査レポート
cassandra調査レポートcassandra調査レポート
cassandra調査レポート
 
Lightning fast genomics with Spark, Adam and Scala
Lightning fast genomics with Spark, Adam and ScalaLightning fast genomics with Spark, Adam and Scala
Lightning fast genomics with Spark, Adam and Scala
 

Similar to NYC* 2013 - "Analyzing the Human Genome/DNA with Cassandra"

2008 PGSAS G-nomes
2008 PGSAS G-nomes2008 PGSAS G-nomes
2008 PGSAS G-nomesgfb1
 
Human genome project
Human genome projectHuman genome project
Human genome project
YashaswineeSahoo
 
L14 human genome
L14 human genomeL14 human genome
L14 human genomeMUBOSScz
 
Human genome project
Human genome projectHuman genome project
Human genome project
Dilip jaipal
 
Human genome project
Human genome projectHuman genome project
Human genome project
Shital Pal
 
Lecture 1,2
Lecture 1,2Lecture 1,2
Lecture 1,2
Sucheta Tripathy
 
Topic Five: Genetics
Topic Five: GeneticsTopic Five: Genetics
Topic Five: Genetics
Bob Smullen
 
Human genome project (2) converted
Human genome project (2) convertedHuman genome project (2) converted
Human genome project (2) converted
GAnchal
 
2015 pycon-talk
2015 pycon-talk2015 pycon-talk
2015 pycon-talk
c.titus.brown
 
Genome project.pdf
Genome project.pdfGenome project.pdf
Genome project.pdf
ManchikantiDivya
 
Project IdeaFor our project, we will be focusing on the wastewat.docx
Project IdeaFor our project, we will be focusing on the wastewat.docxProject IdeaFor our project, we will be focusing on the wastewat.docx
Project IdeaFor our project, we will be focusing on the wastewat.docx
briancrawford30935
 
U Florida / Gainesville talk, apr 13 2011
U Florida / Gainesville  talk, apr 13 2011U Florida / Gainesville  talk, apr 13 2011
U Florida / Gainesville talk, apr 13 2011c.titus.brown
 
Comparative genomics
Comparative genomicsComparative genomics
Comparative genomics
kiran singh
 
Chromosomes, Crops and Superdomestication - Pat Heslop-Harrison Malaysia
Chromosomes, Crops and Superdomestication - Pat Heslop-Harrison MalaysiaChromosomes, Crops and Superdomestication - Pat Heslop-Harrison Malaysia
Chromosomes, Crops and Superdomestication - Pat Heslop-Harrison Malaysia
Pat (JS) Heslop-Harrison
 
Markers
MarkersMarkers
Markers
GaurabSircar2
 
Hgp
HgpHgp
Chapter 7 genome structure, chromatin, and the nucleosome (1)
Chapter 7   genome structure, chromatin, and the nucleosome (1)Chapter 7   genome structure, chromatin, and the nucleosome (1)
Chapter 7 genome structure, chromatin, and the nucleosome (1)Roger Mendez
 
Clase 2 - Genoma Humano proyecto conicet.pdf
Clase 2 - Genoma Humano proyecto conicet.pdfClase 2 - Genoma Humano proyecto conicet.pdf
Clase 2 - Genoma Humano proyecto conicet.pdf
NoraCRuizGuevara
 

Similar to NYC* 2013 - "Analyzing the Human Genome/DNA with Cassandra" (20)

2008 PGSAS G-nomes
2008 PGSAS G-nomes2008 PGSAS G-nomes
2008 PGSAS G-nomes
 
Human genome project
Human genome projectHuman genome project
Human genome project
 
L14 human genome
L14 human genomeL14 human genome
L14 human genome
 
Human encodeproject
Human encodeprojectHuman encodeproject
Human encodeproject
 
Genome
GenomeGenome
Genome
 
Human genome project
Human genome projectHuman genome project
Human genome project
 
Human genome project
Human genome projectHuman genome project
Human genome project
 
Lecture 1,2
Lecture 1,2Lecture 1,2
Lecture 1,2
 
Topic Five: Genetics
Topic Five: GeneticsTopic Five: Genetics
Topic Five: Genetics
 
Human genome project (2) converted
Human genome project (2) convertedHuman genome project (2) converted
Human genome project (2) converted
 
2015 pycon-talk
2015 pycon-talk2015 pycon-talk
2015 pycon-talk
 
Genome project.pdf
Genome project.pdfGenome project.pdf
Genome project.pdf
 
Project IdeaFor our project, we will be focusing on the wastewat.docx
Project IdeaFor our project, we will be focusing on the wastewat.docxProject IdeaFor our project, we will be focusing on the wastewat.docx
Project IdeaFor our project, we will be focusing on the wastewat.docx
 
U Florida / Gainesville talk, apr 13 2011
U Florida / Gainesville  talk, apr 13 2011U Florida / Gainesville  talk, apr 13 2011
U Florida / Gainesville talk, apr 13 2011
 
Comparative genomics
Comparative genomicsComparative genomics
Comparative genomics
 
Chromosomes, Crops and Superdomestication - Pat Heslop-Harrison Malaysia
Chromosomes, Crops and Superdomestication - Pat Heslop-Harrison MalaysiaChromosomes, Crops and Superdomestication - Pat Heslop-Harrison Malaysia
Chromosomes, Crops and Superdomestication - Pat Heslop-Harrison Malaysia
 
Markers
MarkersMarkers
Markers
 
Hgp
HgpHgp
Hgp
 
Chapter 7 genome structure, chromatin, and the nucleosome (1)
Chapter 7   genome structure, chromatin, and the nucleosome (1)Chapter 7   genome structure, chromatin, and the nucleosome (1)
Chapter 7 genome structure, chromatin, and the nucleosome (1)
 
Clase 2 - Genoma Humano proyecto conicet.pdf
Clase 2 - Genoma Humano proyecto conicet.pdfClase 2 - Genoma Humano proyecto conicet.pdf
Clase 2 - Genoma Humano proyecto conicet.pdf
 

More from DataStax Academy

Forrester CXNYC 2017 - Delivering great real-time cx is a true craft
Forrester CXNYC 2017 - Delivering great real-time cx is a true craftForrester CXNYC 2017 - Delivering great real-time cx is a true craft
Forrester CXNYC 2017 - Delivering great real-time cx is a true craft
DataStax Academy
 
Introduction to DataStax Enterprise Graph Database
Introduction to DataStax Enterprise Graph DatabaseIntroduction to DataStax Enterprise Graph Database
Introduction to DataStax Enterprise Graph Database
DataStax Academy
 
Introduction to DataStax Enterprise Advanced Replication with Apache Cassandra
Introduction to DataStax Enterprise Advanced Replication with Apache CassandraIntroduction to DataStax Enterprise Advanced Replication with Apache Cassandra
Introduction to DataStax Enterprise Advanced Replication with Apache Cassandra
DataStax Academy
 
Cassandra on Docker @ Walmart Labs
Cassandra on Docker @ Walmart LabsCassandra on Docker @ Walmart Labs
Cassandra on Docker @ Walmart Labs
DataStax Academy
 
Cassandra 3.0 Data Modeling
Cassandra 3.0 Data ModelingCassandra 3.0 Data Modeling
Cassandra 3.0 Data Modeling
DataStax Academy
 
Cassandra Adoption on Cisco UCS & Open stack
Cassandra Adoption on Cisco UCS & Open stackCassandra Adoption on Cisco UCS & Open stack
Cassandra Adoption on Cisco UCS & Open stack
DataStax Academy
 
Data Modeling for Apache Cassandra
Data Modeling for Apache CassandraData Modeling for Apache Cassandra
Data Modeling for Apache Cassandra
DataStax Academy
 
Coursera Cassandra Driver
Coursera Cassandra DriverCoursera Cassandra Driver
Coursera Cassandra Driver
DataStax Academy
 
Production Ready Cassandra
Production Ready CassandraProduction Ready Cassandra
Production Ready Cassandra
DataStax Academy
 
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & PythonCassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
DataStax Academy
 
Cassandra @ Sony: The good, the bad, and the ugly part 1
Cassandra @ Sony: The good, the bad, and the ugly part 1Cassandra @ Sony: The good, the bad, and the ugly part 1
Cassandra @ Sony: The good, the bad, and the ugly part 1
DataStax Academy
 
Cassandra @ Sony: The good, the bad, and the ugly part 2
Cassandra @ Sony: The good, the bad, and the ugly part 2Cassandra @ Sony: The good, the bad, and the ugly part 2
Cassandra @ Sony: The good, the bad, and the ugly part 2
DataStax Academy
 
Standing Up Your First Cluster
Standing Up Your First ClusterStanding Up Your First Cluster
Standing Up Your First Cluster
DataStax Academy
 
Real Time Analytics with Dse
Real Time Analytics with DseReal Time Analytics with Dse
Real Time Analytics with Dse
DataStax Academy
 
Introduction to Data Modeling with Apache Cassandra
Introduction to Data Modeling with Apache CassandraIntroduction to Data Modeling with Apache Cassandra
Introduction to Data Modeling with Apache Cassandra
DataStax Academy
 
Cassandra Core Concepts
Cassandra Core ConceptsCassandra Core Concepts
Cassandra Core Concepts
DataStax Academy
 
Enabling Search in your Cassandra Application with DataStax Enterprise
Enabling Search in your Cassandra Application with DataStax EnterpriseEnabling Search in your Cassandra Application with DataStax Enterprise
Enabling Search in your Cassandra Application with DataStax Enterprise
DataStax Academy
 
Bad Habits Die Hard
Bad Habits Die Hard Bad Habits Die Hard
Bad Habits Die Hard
DataStax Academy
 
Advanced Data Modeling with Apache Cassandra
Advanced Data Modeling with Apache CassandraAdvanced Data Modeling with Apache Cassandra
Advanced Data Modeling with Apache Cassandra
DataStax Academy
 

More from DataStax Academy (20)

Forrester CXNYC 2017 - Delivering great real-time cx is a true craft
Forrester CXNYC 2017 - Delivering great real-time cx is a true craftForrester CXNYC 2017 - Delivering great real-time cx is a true craft
Forrester CXNYC 2017 - Delivering great real-time cx is a true craft
 
Introduction to DataStax Enterprise Graph Database
Introduction to DataStax Enterprise Graph DatabaseIntroduction to DataStax Enterprise Graph Database
Introduction to DataStax Enterprise Graph Database
 
Introduction to DataStax Enterprise Advanced Replication with Apache Cassandra
Introduction to DataStax Enterprise Advanced Replication with Apache CassandraIntroduction to DataStax Enterprise Advanced Replication with Apache Cassandra
Introduction to DataStax Enterprise Advanced Replication with Apache Cassandra
 
Cassandra on Docker @ Walmart Labs
Cassandra on Docker @ Walmart LabsCassandra on Docker @ Walmart Labs
Cassandra on Docker @ Walmart Labs
 
Cassandra 3.0 Data Modeling
Cassandra 3.0 Data ModelingCassandra 3.0 Data Modeling
Cassandra 3.0 Data Modeling
 
Cassandra Adoption on Cisco UCS & Open stack
Cassandra Adoption on Cisco UCS & Open stackCassandra Adoption on Cisco UCS & Open stack
Cassandra Adoption on Cisco UCS & Open stack
 
Data Modeling for Apache Cassandra
Data Modeling for Apache CassandraData Modeling for Apache Cassandra
Data Modeling for Apache Cassandra
 
Coursera Cassandra Driver
Coursera Cassandra DriverCoursera Cassandra Driver
Coursera Cassandra Driver
 
Production Ready Cassandra
Production Ready CassandraProduction Ready Cassandra
Production Ready Cassandra
 
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & PythonCassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
 
Cassandra @ Sony: The good, the bad, and the ugly part 1
Cassandra @ Sony: The good, the bad, and the ugly part 1Cassandra @ Sony: The good, the bad, and the ugly part 1
Cassandra @ Sony: The good, the bad, and the ugly part 1
 
Cassandra @ Sony: The good, the bad, and the ugly part 2
Cassandra @ Sony: The good, the bad, and the ugly part 2Cassandra @ Sony: The good, the bad, and the ugly part 2
Cassandra @ Sony: The good, the bad, and the ugly part 2
 
Standing Up Your First Cluster
Standing Up Your First ClusterStanding Up Your First Cluster
Standing Up Your First Cluster
 
Real Time Analytics with Dse
Real Time Analytics with DseReal Time Analytics with Dse
Real Time Analytics with Dse
 
Introduction to Data Modeling with Apache Cassandra
Introduction to Data Modeling with Apache CassandraIntroduction to Data Modeling with Apache Cassandra
Introduction to Data Modeling with Apache Cassandra
 
Cassandra Core Concepts
Cassandra Core ConceptsCassandra Core Concepts
Cassandra Core Concepts
 
Enabling Search in your Cassandra Application with DataStax Enterprise
Enabling Search in your Cassandra Application with DataStax EnterpriseEnabling Search in your Cassandra Application with DataStax Enterprise
Enabling Search in your Cassandra Application with DataStax Enterprise
 
Bad Habits Die Hard
Bad Habits Die Hard Bad Habits Die Hard
Bad Habits Die Hard
 
Advanced Data Modeling with Apache Cassandra
Advanced Data Modeling with Apache CassandraAdvanced Data Modeling with Apache Cassandra
Advanced Data Modeling with Apache Cassandra
 
Advanced Cassandra
Advanced CassandraAdvanced Cassandra
Advanced Cassandra
 

Recently uploaded

The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Nexer Digital
 
By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024
Pierluigi Pugliese
 
Free Complete Python - A step towards Data Science
Free Complete Python - A step towards Data ScienceFree Complete Python - A step towards Data Science
Free Complete Python - A step towards Data Science
RinaMondal9
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
 
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfSAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
Peter Spielvogel
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
ControlCase
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
Matthew Sinclair
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
DianaGray10
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
Aftab Hussain
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
Matthew Sinclair
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
Neo4j
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
Adtran
 
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptxSecstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
nkrafacyberclub
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
Ana-Maria Mihalceanu
 
GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...
ThomasParaiso2
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
 

Recently uploaded (20)

The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?
 
By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024
 
Free Complete Python - A step towards Data Science
Free Complete Python - A step towards Data ScienceFree Complete Python - A step towards Data Science
Free Complete Python - A step towards Data Science
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfSAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
 
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptxSecstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
 
GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 

NYC* 2013 - "Analyzing the Human Genome/DNA with Cassandra"

  • 1. blueplastic.com/dna.pdf Analyzing the human genome/DNA with Cassandra BY SAMEER FAROOQUI linkedin.com/in/blueplastic/ SAMEER@BLUEPLASTIC.COM @blueplastic http://youtu.be/ziqx2hJY8Hg
  • 2. T A T A G C A T
  • 3. 3.3 billion base pairs Thymine Cytosine T A A C C C N A A C . . . A T T G G G N T T G Adenine Guanine 98.5 % of genome identical (3 in 10,000 bases differ) 1 Human: US population: Earth population: 3 GB (uncompressed) 900 PB (uncompressed) 2.8 exabytes (uncompressed) 900 MB (compressed gz) 1 Exabyte = 1 million TB
  • 4. Humans: 46 Chromosomes 3.3 billion bp 242 MB Mom Dad 178 MB 4 billion bp XY XX Y chromosome: 58 million base pairs (2% of total DNA) 59 MB X chromosome: 61 MB 154 MB 155 million base pairs (5% of total DNA)
  • 5. Why Chromosomes ?? Garden Snail Adder's Tongue Fern Fruit Fly 54 Ch 1,200 Chromosomes 8 Ch 2 billion bp 165 million bp Gorilla Elephant 48 Ch 56 Ch 3.4 billion bp 5.8 billion bp Onion 16 Ch ~18 billion bp Highly repetitive
  • 6. Human Genome Project vs 1000 Genomes Project - ~15 year project: 1989 – 2003 - Launched Jan 2008 - Sequenced 99% of the genome (400 gaps) - Oct 2012: 1092 human genomes complete from 14 populations - >70% of the genome came from an anonymous male donor from Buffalo, New - Goal: 2,500 sequences from 26 specific York (code name RP11) populations like: Han Chinese, Japanese, British, Columbian, Maratha/India, - Cost about $3 billion dollars Punjabi/Pakistan, Finnish, African Americans - Work done by 111 global institutions - Cost about $40 million ($16,000 per person) Download @ http://www.1000genomes.org/
  • 7. - In 2010: 179 human genomes - Discussed DNA from 2 families of: Mother / Father / Child - One of biology’s most cited papers in 2011 Link: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3042601/
  • 8. Download at : http://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes - Feb 2009 assembly of one human genome (hg19) - One gzip FASTA file per chromosome 1) rsync -avzP rsync://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/ . 2) gunzip <file>.fa.gz
  • 9. Exploring DNA from your browser… Chromosome 2 Gene: MCM6 A:T - Can digest milk SNP: rs4988235 G:C - Lactose intolerance Position: 136,608,646 bp from pter Click here http://useast.ensembl.org/Homo_sapiens/Location/Genome
  • 10. T A A C C C T A A C C C T A A C C C T A A C C C T A A C C C A T T G G G A T T G G G A T T G G G A T T G G G A T T G G G Chromosome #1 : 250 million base pairs (across both C-pairs) (8% of total DNA) centromere pter P (short arm) Q (long arm) qter 0 0 43 36 1q12 1q42.2 1p36.32 1p31.1 1q43 4,316 known genes
  • 11. Compound Keys (22 pairs + X + Y) Partition key : remaining keys 24 Column Families humanID:cell_type:parent Chrom-1 Chrom-2 Chrom-3 Chrom-Y humanID cell_type parent 595-36-0000 normal mother 595-36-0000 normal father 595-36-0000 cancer mother 595-36-0000 cancer father 595-36-1111 normal mother 595-36-1111 normal father
  • 12. XY XX Chrom-1 Chrom-2 Chrom-X Chrom-Y humanID cell parent 1 2 595-36-0000 normal mother 595-36-0000 normal father 595-36-1111 normal mother 595-36-1111 normal father Chrom-1 Column Family on disk 595-36-0000 [normal, [normal, mother, [normal, father, [normal, father, mother, 1]: TAG 2]: GCC 1]: TAG 2]: GCC 595-36-1111 [normal, [normal, mother, [normal, father, [normal, father, mother, 1]: TAG 2]: GCC 1]: TAG 2]: GCC
  • 13. Partition based on humanID w/ Murmur3Partitioner B A C Chrom-1 Chrom-Y D humanID cell_type parent 595-36-0000 normal mother 595-36-0000 normal father Send to range A 595-36-0000 cancer mother 595-36-0000 cancer father 595-36-1111 normal mother Send to range D 595-36-1111 normal father Now it’s possible to do row range scans down the same humanID… … and get all the DNA for human #1000
  • 14. Chromosome #1: 4,316 genes Breast Cancer 11 m: Alzheimers DIRAS3 protein Presenilin 2 1q31 1q31-q42 68,511,644 - 68,516,459 227,058,272 – 227,083,803 Neuroblastoma Cancer (deletion of: 1p36.1 – 1p36.3) DIRAS3 (4,800 bp) PSEN2 (25,000 bp) 1q12 1q42.2 1p36.32 1p31.1 1q43 ABCA4 AGL AMPD1 BSND CDC73 CHRNB2 COL8A2 CPT2 DBT EDARADD ACADM ALDH4A1 ASPM CACNA1S CFH CLCNKA COL9A2 CRB1 DIRAS3 EGLN1 ACTA1 ALPL ATP1A2 CASQ2 CFHR5 CLCNKB COL11A1 DARS2 DPYD EIF2B3
  • 15. Conditions related to genes on Chromosome 1 - Alzheimer disease - Neuroblastoma - breast cancer - color vision deficiency - early-onset glaucoma - Emery-Dreifuss muscular dystrophy - Parkinson disease
  • 16. Conditions related to genes on Chromosome 1 familial cold autoinflammatory syndrome familial erythrocytosis leukoencephalopathy with vanishing white matter familial hemiplegic migraine limb-girdle muscular dystrophy actin-accumulation myopathy familial hypertrophic cardiomyopathy malignant hyperthermia adenosine monophosphate deaminase deficiency familial hypobetalipoproteinemia maple syrup urine disease age-related macular degeneration familial isolated hyperparathyroidism medium-chain acyl-CoA dehydrogenase deficiency Alagille syndrome familial restrictive cardiomyopathy Muckle-Wells syndrome Alzheimer disease Fuchs endothelial dystrophy multiminicore disease amyotrophic lateral sclerosis fucosidosis multiple epiphyseal dysplasia anencephaly fumarase deficiency nemaline myopathy ankylosing spondylitis galactosemia neonatal onset multisystem inflammatory disease arrhythmogenic right ventricular cardiomyopathy gastrointestinal stromal tumor neuroblastoma atypical hemolytic-uremic syndrome Gaucher disease nonsyndromic deafness auriculo-condylar syndrome Gitelman syndrome nonsyndromic paraganglioma autosomal dominant nocturnal frontal lobe epilepsy GLUT1 deficiency syndrome Noonan syndrome autosomal recessive primary microcephaly glycogen storage disease type III osteogenesis imperfecta Bartter syndrome Greenberg dysplasia Parkinson disease 3-beta-hydroxysteroid dehydrogenase deficiency hemochromatosis popliteal pterygium syndrome breast cancer hereditary antithrombin deficiency porphyria cap myopathy hereditary leiomyomatosis and renal cell cancer primary myelofibrosis carnitine palmitoyltransferase II deficiency hereditary paraganglioma-pheochromocytoma psoriatic arthritis catecholaminergic polymorphic ventricular tachycardia hereditary sensory and autonomic neuropathy type V pyruvate kinase deficiency Charcot-Marie-Tooth disease homocystinuria REN-related kidney disease Chediak-Higashi syndrome Hutchinson-Gilford progeria syndrome retinitis pigmentosa chronic granulomatous disease 3-hydroxy-3-methylglutaryl-CoA lyase deficiency rhizomelic chondrodysplasia punctata color vision deficiency hypercholesterolemia severe congenital neutropenia congenital fiber-type disproportion hypermanganesemia with dystonia, polycythemia, and cirrhosis Shprintzen-Goldberg syndrome congenital hypothyroidism hyperparathyroidism-jaw tumor syndrome spina bifida congenital insensitivity to pain with anhidrosis hyperprolinemia Stargardt macular degeneration Cowden syndrome hypohidrotic ectodermal dysplasia Stickler syndrome Crohn disease hypokalemic periodic paralysis systemic scleroderma dense deposit disease hypophosphatasia thiamine-responsive megaloblastic anemia syndrome Diamond-Blackfan anemia idiopathic inflammatory myopathy thrombocytopenia-absent radius syndrome dihydropyrimidine dehydrogenase deficiency intranuclear rod myopathy trimethylaminuria early-onset glaucoma junctional epidermolysis bullosa Usher syndrome Ehlers-Danlos syndrome juvenile idiopathic arthritis van der Woude syndrome Emery-Dreifuss muscular dystrophy Kufs disease vitiligo essential thrombocythemia Leber congenital amaurosis Vohwinkel syndrome factor V Leiden thrombophilia leukoencephalopathy with brainstem and spinal cord involvement and WNT4 Müllerian aplasia and ovarian dysfunction familial adenomatous polyposis lactate elevation
  • 17. What read queries do we want to perform? Write once, read many times type of database 1) Give me the PSEN2 gene for 2,000 people w/ Alzheimer's 25,000 sequential bp 2) Give me all of the humans who have the lactose intolerance SNP on CR-2
  • 18. Translation: DNA -> Proteins codon DNA T A A C C C T A A C C C T A A A C T A T T G G G A T T G G G A T T T G A Amino Isoleucine Glycine Isoleucine Glycine Isoleucine STOP Acids I G I G I Protein (20 different types)
  • 19. Translation: ATT -> Lsoleucine
  • 20. 125 million bp = 41 m cols (short arm) P centromere Q (long arm) 3 36 0 0 43 Chrom-1 1q43.7492932 humanID cell_type parent 1p36 1p35 ... 1p1 1p0 1q0 1q1 ... 1q43.7 595-36-0000 normal mother TAG GCC CAG CAG TCA CTG NNN GAT 595-36-0000 normal father TAG GCC CAG CAG TAA CTG NNN GAT 595-36-0000 cancer mother TAG GCC CAG CAG TCA CTG NNN GAT 595-36-0000 cancer father TAG GCC CAG CAG CTG NNN GAT 595-36-1111 normal mother TAG GCC CAG CAG TCC TCA CTG NNN GAT 595-36-1111 normal father TAG GCC CAG CAG TCA CTG NNN GAT Chrom-21 595-36-1111 normal 3rd   (SNP) Point Mutation TAG GCC CAG CAG TCA CTG TAG GCC CAG CAG TAA CTG Deletion Mutation TAG GCC CAG CAG TCA CTG TAG GCC CAG CAG ___ CTG Insertion Mutation TAG GCC CAG CAG TCA CTG TAG GCC CAG CAG TCC TCA CTG
  • 21. 4x reduction in total data size + 35% faster reads To detect SNPs Excellent candidate for compression! Create Secondary Index Chrom-1 humanID cell_type parent 1p36 1p35 ... 1p1 1p0 1q0 1q1 ... 1q43.7 595-36-0000 normal mother TAG GCC CAG CAG TCA CTG NNN GAT 595-36-0000 normal father TAG GCC CAG CAG TAA CTG NNN GAT 595-36-0000 cancer mother TAG GCC CAG CAG TCA CTG NNN GAT 595-36-0000 cancer father TAG GCC CAG CAG CTG NNN GAT 595-36-1111 normal mother TAG GCC CAG CAG TCC TCA CTG NNN GAT 595-36-1111 normal father TAG GCC CAG CAG TCA CTG NNN GAT Chrom-21 595-36-1111 normal 3rd To get all of the people with the SNP: cqlsh:dna_table> SELECT humanID FROM Chrom-1 WHERE 1q0 = ‘TAA’;
  • 22. Query: Give me the X gene for 2 people X Chrom-1 humanID cell_type parent 1p36 1p35 ... 1p1 1p0 1q0 1q1 ... 1q43.7 595-36-0000 normal mother TAG GCC CAG CAG TCA CTG NNN GAT 595-36-0000 normal father TAG GCC CAG CAG TAA CTG NNN GAT 595-36-0000 cancer mother TAG GCC CAG CAG TCA CTG NNN GAT 595-36-0000 cancer father TAG GCC CAG CAG CTG NNN GAT 595-36-1111 normal mother TAG GCC CAG CAG TCC TCA CTG NNN GAT 595-36-1111 normal father TAG GCC CAG CAG TCA CTG NNN GAT Chrom-21 595-36-1111 normal 3rd cqlsh:dna_table> SELECT 1q0, 1q1 FROM Chrom-1 WHERE humanID in(595-36-000, 595-36-111);
  • 23. Storing the total USA population Genome in Cassandra (314 million people) 9 million columns 41 million columns Chrom-1 Chrom-2 Chrom-3 Chrom-Y P-key 3 billion cols SS:CT:M 125 MB of data 1.5 GB SS:CT:F 125 MB of data SS:CT:M SS:CT:F 900 PB =X SS:CT:M SS:CT:F 46,000 nodes 1000 Genomes Project (20 TB each) 630 million rows (2 for each person) Oct 2012: 1092 genomes sequenced No Replication 3.2 TB data total
  • 24. Cost per Human Genome sequence $120,000,000 $100,000,000 $80,000,000 $60,000,000 Huh ? $40,000,000 Linear scale $20,000,000 $20 million increments $0 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 Series 1
  • 25. Cost per Human Genome sequence $100,000,000 $10,000,000 $1,000,000 Super Logarithmic Jan 2008 Scale! $100,000 Switched to next-gen sequencing $10,000 Logarithmic scale $1,000 10x increments $100 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 Genome Sequencing Moore's Law
  • 26. Coding vs Non-coding DNA 98% non-coding DNA Coding Non-coding
  • 27. 8% of human DNA (98,000 fragments) T A A C C C T A A C C C T A A C C C T A A C C C T A A C C C A T T G G G A T T G G G A T T G G G A T T G G G A T T G G G HIV-1 virus genome: https://www.ncbi.nlm.nih.gov/nuccore/9629357?report=fasta
  • 28. Get Ubuntu 12.10 http://www.ubuntu.com/download (note CentOS/Red Hat has install issues with Biopython) DataStax Community Edition of Cassandra + OpsCenter http://www.datastax.com/download/community Free python tools for biological computation http://biopython.org Cassandra python client library Pycassa https://github.com/pycassa/pycassa
  • 29. blueplastic.com/dna.pdf Polychaos dubium 620 billion bp (200x humans) Sameer Farooqui sameer@blueplastic.com - Freelance Big Data consultant and trainer - Taught 50+ courses on Hadoop, HBase, Cassandra and OpenStack Ex: Hortonworks, Accenture R&D, Symantec - Co-author on v2 of Cassandra book - Coming late 2013 linkedin.com/in/blueplastic/ @blueplastic http://youtu.be/ziqx2hJY8Hg
  • 30. James Watson: How we discovered DNA Juan Enriquez: The life-code that will reshape the future http://www.ted.com/talks/james_watson_on_how_he_discovered_dna.html http://www.ted.com/talks/juan_enriquez_on_genomics_and_our_future.html
  • 31. Resources to get started for beginners…