NYC* 2013 - "Analyzing the Human Genome/DNA with Cassandra"

1,600 views
1,376 views

Published on

Published in: Technology
1 Comment
3 Likes
Statistics
Notes
  • I have attended this presentation, and I was intrigued by the claim that African DNA has more base pairs (4e9 vs 3.3e9, see slide 4). I cannot find any mentions of this in web searches. Could you please post some references to scientific literature about this?
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Views
Total views
1,600
On SlideShare
0
From Embeds
0
Number of Embeds
22
Actions
Shares
0
Downloads
51
Comments
1
Likes
3
Embeds 0
No embeds

No notes for slide

NYC* 2013 - "Analyzing the Human Genome/DNA with Cassandra"

  1. 1. blueplastic.com/dna.pdf Analyzing the human genome/DNA with Cassandra BY SAMEER FAROOQUI linkedin.com/in/blueplastic/ SAMEER@BLUEPLASTIC.COM @blueplastic http://youtu.be/ziqx2hJY8Hg
  2. 2. TAT A G CA T
  3. 3. 3.3 billion base pairs Thymine Cytosine T A A C C C N A A C . . . A T T G G G N T T G Adenine Guanine 98.5 % of genome identical (3 in 10,000 bases differ)1 Human: US population: Earth population:3 GB (uncompressed) 900 PB (uncompressed) 2.8 exabytes (uncompressed)900 MB (compressed gz) 1 Exabyte = 1 million TB
  4. 4. Humans: 46 Chromosomes 3.3 billion bp242 MB Mom Dad 178 MB 4 billion bp XY XX Y chromosome: 58 million base pairs (2% of total DNA) 59 MB X chromosome: 61 MB 154 MB 155 million base pairs (5% of total DNA)
  5. 5. Why Chromosomes ?? Garden SnailAdders Tongue Fern Fruit Fly 54 Ch1,200 Chromosomes 8 Ch 2 billion bp 165 million bp GorillaElephant 48 Ch56 Ch 3.4 billion bp5.8 billion bp Onion 16 Ch ~18 billion bp Highly repetitive
  6. 6. Human Genome Project vs 1000 Genomes Project- ~15 year project: 1989 – 2003 - Launched Jan 2008- Sequenced 99% of the genome (400 gaps) - Oct 2012: 1092 human genomes complete from 14 populations- >70% of the genome came from an anonymous male donor from Buffalo, New - Goal: 2,500 sequences from 26 specific York (code name RP11) populations like: Han Chinese, Japanese, British, Columbian, Maratha/India,- Cost about $3 billion dollars Punjabi/Pakistan, Finnish, African Americans - Work done by 111 global institutions - Cost about $40 million ($16,000 per person) Download @ http://www.1000genomes.org/
  7. 7. - In 2010: 179 human genomes - Discussed DNA from 2 families of: Mother / Father / Child - One of biology’s most cited papers in 2011Link: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3042601/
  8. 8. Download at : http://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes - Feb 2009 assembly of one human genome (hg19) - One gzip FASTA file per chromosome 1) rsync -avzP rsync://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/ . 2) gunzip <file>.fa.gz
  9. 9. Exploring DNA from your browser… Chromosome 2 Gene: MCM6 A:T - Can digest milk SNP: rs4988235 G:C - Lactose intolerance Position: 136,608,646 bp from pter Click here http://useast.ensembl.org/Homo_sapiens/Location/Genome
  10. 10. T A A C C C T A A C C C T A A C C C T A A C C C T A A C C CA T T G G G A T T G G G A T T G G G A T T G G G A T T G G G Chromosome #1 : 250 million base pairs (across both C-pairs) (8% of total DNA) centromere pter P (short arm) Q (long arm) qter 0 0 4336 1q12 1q42.2 1p36.32 1p31.1 1q43 4,316 known genes
  11. 11. Compound Keys (22 pairs + X + Y)Partition key : remaining keys 24 Column Families humanID:cell_type:parent Chrom-1 Chrom-2 Chrom-3 Chrom-Y humanID cell_type parent 595-36-0000 normal mother 595-36-0000 normal father 595-36-0000 cancer mother 595-36-0000 cancer father 595-36-1111 normal mother 595-36-1111 normal father
  12. 12. XY XX Chrom-1 Chrom-2 Chrom-X Chrom-YhumanID cell parent 1 2595-36-0000 normal mother595-36-0000 normal father595-36-1111 normal mother595-36-1111 normal father Chrom-1 Column Family on disk 595-36-0000 [normal, [normal, mother, [normal, father, [normal, father, mother, 1]: TAG 2]: GCC 1]: TAG 2]: GCC 595-36-1111 [normal, [normal, mother, [normal, father, [normal, father, mother, 1]: TAG 2]: GCC 1]: TAG 2]: GCC
  13. 13. Partition based on humanID w/ Murmur3Partitioner BA C Chrom-1 Chrom-Y D humanID cell_type parent 595-36-0000 normal mother 595-36-0000 normal fatherSend to range A 595-36-0000 cancer mother 595-36-0000 cancer father 595-36-1111 normal motherSend to range D 595-36-1111 normal father Now it’s possible to do row range scans down the same humanID… … and get all the DNA for human #1000
  14. 14. Chromosome #1: 4,316 genes Breast Cancer 11 m: Alzheimers DIRAS3 protein Presenilin 2 1q31 1q31-q42 68,511,644 - 68,516,459 227,058,272 – 227,083,803Neuroblastoma Cancer(deletion of: 1p36.1 – 1p36.3) DIRAS3 (4,800 bp) PSEN2 (25,000 bp) 1q12 1q42.21p36.32 1p31.1 1q43 ABCA4 AGL AMPD1 BSND CDC73 CHRNB2 COL8A2 CPT2 DBT EDARADD ACADM ALDH4A1 ASPM CACNA1S CFH CLCNKA COL9A2 CRB1 DIRAS3 EGLN1 ACTA1 ALPL ATP1A2 CASQ2 CFHR5 CLCNKB COL11A1 DARS2 DPYD EIF2B3
  15. 15. Conditions related to genes on Chromosome 1 - Alzheimer disease - Neuroblastoma - breast cancer - color vision deficiency - early-onset glaucoma - Emery-Dreifuss muscular dystrophy - Parkinson disease
  16. 16. Conditions related to genes on Chromosome 1 familial cold autoinflammatory syndrome familial erythrocytosis leukoencephalopathy with vanishing white matter familial hemiplegic migraine limb-girdle muscular dystrophyactin-accumulation myopathy familial hypertrophic cardiomyopathy malignant hyperthermiaadenosine monophosphate deaminase deficiency familial hypobetalipoproteinemia maple syrup urine diseaseage-related macular degeneration familial isolated hyperparathyroidism medium-chain acyl-CoA dehydrogenase deficiencyAlagille syndrome familial restrictive cardiomyopathy Muckle-Wells syndromeAlzheimer disease Fuchs endothelial dystrophy multiminicore diseaseamyotrophic lateral sclerosis fucosidosis multiple epiphyseal dysplasiaanencephaly fumarase deficiency nemaline myopathyankylosing spondylitis galactosemia neonatal onset multisystem inflammatory diseasearrhythmogenic right ventricular cardiomyopathy gastrointestinal stromal tumor neuroblastomaatypical hemolytic-uremic syndrome Gaucher disease nonsyndromic deafnessauriculo-condylar syndrome Gitelman syndrome nonsyndromic paragangliomaautosomal dominant nocturnal frontal lobe epilepsy GLUT1 deficiency syndrome Noonan syndromeautosomal recessive primary microcephaly glycogen storage disease type III osteogenesis imperfectaBartter syndrome Greenberg dysplasia Parkinson disease3-beta-hydroxysteroid dehydrogenase deficiency hemochromatosis popliteal pterygium syndromebreast cancer hereditary antithrombin deficiency porphyriacap myopathy hereditary leiomyomatosis and renal cell cancer primary myelofibrosiscarnitine palmitoyltransferase II deficiency hereditary paraganglioma-pheochromocytoma psoriatic arthritiscatecholaminergic polymorphic ventricular tachycardia hereditary sensory and autonomic neuropathy type V pyruvate kinase deficiencyCharcot-Marie-Tooth disease homocystinuria REN-related kidney diseaseChediak-Higashi syndrome Hutchinson-Gilford progeria syndrome retinitis pigmentosachronic granulomatous disease 3-hydroxy-3-methylglutaryl-CoA lyase deficiency rhizomelic chondrodysplasia punctatacolor vision deficiency hypercholesterolemia severe congenital neutropeniacongenital fiber-type disproportion hypermanganesemia with dystonia, polycythemia, and cirrhosis Shprintzen-Goldberg syndromecongenital hypothyroidism hyperparathyroidism-jaw tumor syndrome spina bifidacongenital insensitivity to pain with anhidrosis hyperprolinemia Stargardt macular degenerationCowden syndrome hypohidrotic ectodermal dysplasia Stickler syndromeCrohn disease hypokalemic periodic paralysis systemic sclerodermadense deposit disease hypophosphatasia thiamine-responsive megaloblastic anemia syndromeDiamond-Blackfan anemia idiopathic inflammatory myopathy thrombocytopenia-absent radius syndromedihydropyrimidine dehydrogenase deficiency intranuclear rod myopathy trimethylaminuriaearly-onset glaucoma junctional epidermolysis bullosa Usher syndromeEhlers-Danlos syndrome juvenile idiopathic arthritis van der Woude syndromeEmery-Dreifuss muscular dystrophy Kufs disease vitiligoessential thrombocythemia Leber congenital amaurosis Vohwinkel syndromefactor V Leiden thrombophilia leukoencephalopathy with brainstem and spinal cord involvement and WNT4 Müllerian aplasia and ovarian dysfunctionfamilial adenomatous polyposis lactate elevation
  17. 17. What read queries do we want to perform? Write once, read many times type of database1) Give me the PSEN2 gene for 2,000 people w/ Alzheimers 25,000 sequential bp2) Give me all of the humans who have the lactose intolerance SNP on CR-2
  18. 18. Translation: DNA -> Proteins codonDNA T A A C C C T A A C C C T A A A C T A T T G G G A T T G G G A T T T G AAmino Isoleucine Glycine Isoleucine Glycine Isoleucine STOPAcids I G I G I Protein (20 different types)
  19. 19. Translation: ATT -> Lsoleucine
  20. 20. 125 million bp = 41 m cols (short arm) P centromere Q (long arm) 3 36 0 0 43 Chrom-1 1q43.7492932humanID cell_type parent 1p36 1p35 ... 1p1 1p0 1q0 1q1 ... 1q43.7595-36-0000 normal mother TAG GCC CAG CAG TCA CTG NNN GAT595-36-0000 normal father TAG GCC CAG CAG TAA CTG NNN GAT595-36-0000 cancer mother TAG GCC CAG CAG TCA CTG NNN GAT595-36-0000 cancer father TAG GCC CAG CAG CTG NNN GAT595-36-1111 normal mother TAG GCC CAG CAG TCC TCA CTG NNN GAT595-36-1111 normal father TAG GCC CAG CAG TCA CTG NNN GAT Chrom-21595-36-1111 normal 3rd   (SNP) Point Mutation TAG GCC CAG CAG TCA CTG TAG GCC CAG CAG TAA CTG Deletion Mutation TAG GCC CAG CAG TCA CTG TAG GCC CAG CAG ___ CTG Insertion Mutation TAG GCC CAG CAG TCA CTG TAG GCC CAG CAG TCC TCA CTG
  21. 21. 4x reduction in total data size + 35% faster reads To detect SNPs Excellent candidate for compression! Create Secondary Index Chrom-1humanID cell_type parent 1p36 1p35 ... 1p1 1p0 1q0 1q1 ... 1q43.7595-36-0000 normal mother TAG GCC CAG CAG TCA CTG NNN GAT595-36-0000 normal father TAG GCC CAG CAG TAA CTG NNN GAT595-36-0000 cancer mother TAG GCC CAG CAG TCA CTG NNN GAT595-36-0000 cancer father TAG GCC CAG CAG CTG NNN GAT595-36-1111 normal mother TAG GCC CAG CAG TCC TCA CTG NNN GAT595-36-1111 normal father TAG GCC CAG CAG TCA CTG NNN GAT Chrom-21595-36-1111 normal 3rd To get all of the people with the SNP: cqlsh:dna_table> SELECT humanID FROM Chrom-1 WHERE 1q0 = ‘TAA’;
  22. 22. Query: Give me the X gene for 2 people X Chrom-1humanID cell_type parent 1p36 1p35 ... 1p1 1p0 1q0 1q1 ... 1q43.7595-36-0000 normal mother TAG GCC CAG CAG TCA CTG NNN GAT595-36-0000 normal father TAG GCC CAG CAG TAA CTG NNN GAT595-36-0000 cancer mother TAG GCC CAG CAG TCA CTG NNN GAT595-36-0000 cancer father TAG GCC CAG CAG CTG NNN GAT595-36-1111 normal mother TAG GCC CAG CAG TCC TCA CTG NNN GAT595-36-1111 normal father TAG GCC CAG CAG TCA CTG NNN GAT Chrom-21595-36-1111 normal 3rd cqlsh:dna_table> SELECT 1q0, 1q1 FROM Chrom-1 WHERE humanID in(595-36-000, 595-36-111);
  23. 23. Storing the total USA population Genome in Cassandra (314 million people) 9 million columns 41 million columns Chrom-1 Chrom-2 Chrom-3 Chrom-Y P-key 3 billion cols SS:CT:M 125 MB of data 1.5 GB SS:CT:F 125 MB of data SS:CT:M SS:CT:F 900 PB =X SS:CT:M SS:CT:F 46,000 nodes 1000 Genomes Project (20 TB each) 630 million rows (2 for each person) Oct 2012: 1092 genomes sequencedNo Replication 3.2 TB data total
  24. 24. Cost per Human Genome sequence $120,000,000 $100,000,000 $80,000,000 $60,000,000 Huh ? $40,000,000Linear scale $20,000,000 $20 millionincrements $0 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 Series 1
  25. 25. Cost per Human Genome sequence $100,000,000 $10,000,000 $1,000,000 Super Logarithmic Jan 2008 Scale! $100,000 Switched to next-gen sequencing $10,000Logarithmic scale $1,000 10x increments $100 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 Genome Sequencing Moores Law
  26. 26. Coding vs Non-coding DNA 98% non-coding DNA Coding Non-coding
  27. 27. 8% of human DNA (98,000 fragments)T A A C C C T A A C C C T A A C C C T A A C C C T A A C C CA T T G G G A T T G G G A T T G G G A T T G G G A T T G G G HIV-1 virus genome: https://www.ncbi.nlm.nih.gov/nuccore/9629357?report=fasta
  28. 28. Get Ubuntu 12.10 http://www.ubuntu.com/download (note CentOS/Red Hat has install issues with Biopython) DataStax Community Edition of Cassandra + OpsCenter http://www.datastax.com/download/community Free python tools for biological computation http://biopython.org Cassandra python client libraryPycassa https://github.com/pycassa/pycassa
  29. 29. blueplastic.com/dna.pdf Polychaos dubium 620 billion bp (200x humans) Sameer Farooqui sameer@blueplastic.com - Freelance Big Data consultant and trainer - Taught 50+ courses on Hadoop, HBase, Cassandra and OpenStack Ex: Hortonworks, Accenture R&D, Symantec - Co-author on v2 of Cassandra book - Coming late 2013 linkedin.com/in/blueplastic/ @blueplastic http://youtu.be/ziqx2hJY8Hg
  30. 30. James Watson: How we discovered DNA Juan Enriquez: The life-code that will reshape the futurehttp://www.ted.com/talks/james_watson_on_how_he_discovered_dna.html http://www.ted.com/talks/juan_enriquez_on_genomics_and_our_future.html
  31. 31. Resources to get started for beginners…

×