NYC* 2013 - "Analyzing the Human Genome/DNA with Cassandra"

blueplastic.com/dna.pdf

Analyzing the human
genome/DNA with
Cassandra
BY SAMEER FAROOQUI
linkedin.com/in/blueplastic/
SAMEER@BLUEPLASTIC.COM
@blueplastic

http://youtu.be/ziqx2hJY8Hg

3.3 billion base pairs

Thymine Cytosine

T A A C C C N A A C . . .
A T T G G G N T T G
Adenine Guanine
98.5 % of genome identical
(3 in 10,000 bases differ)

1 Human: US population: Earth population:
3 GB (uncompressed) 900 PB (uncompressed) 2.8 exabytes (uncompressed)
900 MB (compressed gz) 1 Exabyte = 1 million TB

Humans: 46 Chromosomes

3.3 billion bp
242 MB
Mom Dad
178 MB
4 billion bp

XY
XX

Y chromosome:
58 million base pairs
(2% of total DNA)

59 MB X chromosome:
61 MB 154 MB 155 million base pairs
(5% of total DNA)

Why Chromosomes ??

Garden Snail
Adder's Tongue Fern
Fruit Fly 54 Ch
1,200 Chromosomes
8 Ch 2 billion bp
165 million bp

Gorilla
Elephant 48 Ch
56 Ch 3.4 billion bp
5.8 billion bp Onion
16 Ch
~18 billion bp
Highly repetitive

Human Genome Project vs 1000 Genomes Project

- ~15 year project: 1989 – 2003 - Launched Jan 2008

- Sequenced 99% of the genome (400 gaps) - Oct 2012: 1092 human genomes complete
from 14 populations
- >70% of the genome came from an
anonymous male donor from Buffalo, New - Goal: 2,500 sequences from 26 specific
York (code name RP11) populations like: Han Chinese, Japanese,
British, Columbian, Maratha/India,
- Cost about $3 billion dollars Punjabi/Pakistan, Finnish, African Americans

- Work done by 111 global institutions

- Cost about $40 million ($16,000 per person)
Download @ http://www.1000genomes.org/

- In 2010: 179 human genomes

- Discussed DNA from 2 families of:
Mother / Father / Child

- One of biology’s most cited papers in 2011

Link: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3042601/

Download at : http://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes

- Feb 2009 assembly of one human
genome (hg19)

- One gzip FASTA file per
chromosome

1) rsync -avzP rsync://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/ .
2) gunzip <file>.fa.gz

Exploring DNA from your browser… Chromosome 2
Gene: MCM6
A:T - Can digest milk SNP: rs4988235
G:C - Lactose intolerance Position: 136,608,646 bp from pter

Click here

http://useast.ensembl.org/Homo_sapiens/Location/Genome

T A A C C C T A A C C C T A A C C C T A A C C C T A A C C C
A T T G G G A T T G G G A T T G G G A T T G G G A T T G G G

Chromosome #1 : 250 million base pairs (across both C-pairs)
(8% of total DNA)

centromere
pter P (short arm) Q (long arm) qter
0 0 43
36

1q12 1q42.2

1p36.32 1p31.1 1q43

4,316 known genes

Compound Keys
(22 pairs + X + Y)
Partition key : remaining keys 24 Column Families
humanID:cell_type:parent

Chrom-1 Chrom-2 Chrom-3 Chrom-Y
humanID cell_type parent
595-36-0000 normal mother
595-36-0000 normal father
595-36-0000 cancer mother
595-36-0000 cancer father

XY

XX
Chrom-1 Chrom-2 Chrom-X Chrom-Y
humanID cell parent 1 2

Chrom-1 Column Family on disk

595-36-0000 [normal, [normal, mother, [normal, father, [normal, father,
mother, 1]: TAG 2]: GCC 1]: TAG 2]: GCC
595-36-1111 [normal, [normal, mother, [normal, father, [normal, father,
mother, 1]: TAG 2]: GCC 1]: TAG 2]: GCC

Partition based on humanID w/ Murmur3Partitioner
B

A C
Chrom-1 Chrom-Y
D humanID cell_type parent
Send to range A
595-36-0000 cancer mother
595-36-0000 cancer father
Send to range D

Now it’s possible to do row range scans down the same humanID…
… and get all the DNA for human #1000

Chromosome #1: 4,316 genes

Breast Cancer 11 m: Alzheimers
DIRAS3 protein Presenilin 2

1q31 1q31-q42
68,511,644 - 68,516,459 227,058,272 – 227,083,803
Neuroblastoma Cancer
(deletion of: 1p36.1 – 1p36.3) DIRAS3 (4,800 bp) PSEN2 (25,000 bp)

1q12 1q42.2
1p36.32 1p31.1 1q43
ABCA4 AGL AMPD1 BSND CDC73 CHRNB2 COL8A2 CPT2 DBT EDARADD
ACADM ALDH4A1 ASPM CACNA1S CFH CLCNKA COL9A2 CRB1 DIRAS3 EGLN1
ACTA1 ALPL ATP1A2 CASQ2 CFHR5 CLCNKB COL11A1 DARS2 DPYD EIF2B3

Conditions related to genes on Chromosome 1

- Alzheimer disease

- Neuroblastoma

- breast cancer

- color vision deficiency

- early-onset glaucoma

- Emery-Dreifuss muscular dystrophy

- Parkinson disease

Conditions related to genes on Chromosome 1
familial cold autoinflammatory syndrome
familial erythrocytosis leukoencephalopathy with vanishing white matter
familial hemiplegic migraine limb-girdle muscular dystrophy
actin-accumulation myopathy familial hypertrophic cardiomyopathy malignant hyperthermia
adenosine monophosphate deaminase deficiency familial hypobetalipoproteinemia maple syrup urine disease
age-related macular degeneration familial isolated hyperparathyroidism medium-chain acyl-CoA dehydrogenase deficiency
Alagille syndrome familial restrictive cardiomyopathy Muckle-Wells syndrome
Alzheimer disease Fuchs endothelial dystrophy multiminicore disease
amyotrophic lateral sclerosis fucosidosis multiple epiphyseal dysplasia
anencephaly fumarase deficiency nemaline myopathy
ankylosing spondylitis galactosemia neonatal onset multisystem inflammatory disease
arrhythmogenic right ventricular cardiomyopathy gastrointestinal stromal tumor neuroblastoma
atypical hemolytic-uremic syndrome Gaucher disease nonsyndromic deafness
auriculo-condylar syndrome Gitelman syndrome nonsyndromic paraganglioma
autosomal dominant nocturnal frontal lobe epilepsy GLUT1 deficiency syndrome Noonan syndrome
autosomal recessive primary microcephaly glycogen storage disease type III osteogenesis imperfecta
Bartter syndrome Greenberg dysplasia Parkinson disease
3-beta-hydroxysteroid dehydrogenase deficiency hemochromatosis popliteal pterygium syndrome
breast cancer hereditary antithrombin deficiency porphyria
cap myopathy hereditary leiomyomatosis and renal cell cancer primary myelofibrosis
carnitine palmitoyltransferase II deficiency hereditary paraganglioma-pheochromocytoma psoriatic arthritis
catecholaminergic polymorphic ventricular tachycardia hereditary sensory and autonomic neuropathy type V pyruvate kinase deficiency
Charcot-Marie-Tooth disease homocystinuria REN-related kidney disease
Chediak-Higashi syndrome Hutchinson-Gilford progeria syndrome retinitis pigmentosa
chronic granulomatous disease 3-hydroxy-3-methylglutaryl-CoA lyase deficiency rhizomelic chondrodysplasia punctata
color vision deficiency hypercholesterolemia severe congenital neutropenia
congenital fiber-type disproportion hypermanganesemia with dystonia, polycythemia, and cirrhosis Shprintzen-Goldberg syndrome
congenital hypothyroidism hyperparathyroidism-jaw tumor syndrome spina bifida
congenital insensitivity to pain with anhidrosis hyperprolinemia Stargardt macular degeneration
Cowden syndrome hypohidrotic ectodermal dysplasia Stickler syndrome
Crohn disease hypokalemic periodic paralysis systemic scleroderma
dense deposit disease hypophosphatasia thiamine-responsive megaloblastic anemia syndrome
Diamond-Blackfan anemia idiopathic inflammatory myopathy thrombocytopenia-absent radius syndrome
dihydropyrimidine dehydrogenase deficiency intranuclear rod myopathy trimethylaminuria
early-onset glaucoma junctional epidermolysis bullosa Usher syndrome
Ehlers-Danlos syndrome juvenile idiopathic arthritis van der Woude syndrome
Emery-Dreifuss muscular dystrophy Kufs disease vitiligo
essential thrombocythemia Leber congenital amaurosis Vohwinkel syndrome
factor V Leiden thrombophilia leukoencephalopathy with brainstem and spinal cord involvement and WNT4 Müllerian aplasia and ovarian dysfunction
familial adenomatous polyposis lactate elevation

What read queries do we want to perform?

Write once, read many times type of database

1) Give me the PSEN2 gene for 2,000 people w/ Alzheimer's
25,000 sequential bp

2) Give me all of the humans who have the lactose intolerance SNP on CR-2

Translation: DNA -> Proteins

codon

DNA
T A A C C C T A A C C C T A A A C T
A T T G G G A T T G G G A T T T G A

Amino Isoleucine Glycine Isoleucine
Glycine Isoleucine STOP
Acids
I G I G I

Protein (20 different types)

Translation: ATT -> Lsoleucine

125 million bp
= 41 m cols (short arm) P centromere Q (long arm)
3 36 0 0 43

Chrom-1 1q43.7492932
humanID cell_type parent 1p36 1p35 ... 1p1 1p0 1q0 1q1 ... 1q43.7
595-36-0000 normal mother TAG GCC CAG CAG TCA CTG NNN GAT
595-36-0000 normal father TAG GCC CAG CAG TAA CTG NNN GAT
595-36-0000 cancer mother TAG GCC CAG CAG TCA CTG NNN GAT
595-36-0000 cancer father TAG GCC CAG CAG CTG NNN GAT
595-36-1111 normal mother TAG GCC CAG CAG TCC TCA CTG NNN GAT
595-36-1111 normal father TAG GCC CAG CAG TCA CTG NNN GAT Chrom-21
595-36-1111 normal 3rd

 
(SNP) Point Mutation TAG GCC CAG CAG TCA CTG TAG GCC CAG CAG TAA CTG
Deletion Mutation TAG GCC CAG CAG TCA CTG TAG GCC CAG CAG ___ CTG
Insertion Mutation TAG GCC CAG CAG TCA CTG TAG GCC CAG CAG TCC TCA CTG

4x reduction in total data size + 35% faster reads
To detect SNPs
Excellent candidate for compression!
Create Secondary Index

Chrom-1
595-36-1111 normal 3rd
To get all of the people with the SNP:
cqlsh:dna_table> SELECT humanID FROM Chrom-1
WHERE 1q0 = ‘TAA’;

Query: Give me the X gene for 2 people

X

Chrom-1
595-36-1111 normal 3rd

cqlsh:dna_table> SELECT 1q0, 1q1 FROM Chrom-1
WHERE humanID in(595-36-000, 595-36-111);

Storing the total USA population Genome in Cassandra
(314 million people)
9 million columns

41 million columns

Chrom-1 Chrom-2 Chrom-3 Chrom-Y
P-key
3 billion cols
SS:CT:M 125 MB of data
1.5 GB
SS:CT:F 125 MB of data
SS:CT:M
SS:CT:F
900 PB =X SS:CT:M
SS:CT:F
46,000
nodes
1000 Genomes Project
(20 TB each) 630 million rows (2 for each person) Oct 2012: 1092 genomes sequenced
No Replication
3.2 TB data total

Cost per Human Genome sequence
$120,000,000

$100,000,000

$80,000,000

$60,000,000
Huh ?

$40,000,000

Linear scale $20,000,000

$20 million
increments $0
2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012
Series 1

Cost per Human Genome sequence
$100,000,000

$10,000,000

$1,000,000 Super
Logarithmic
Jan 2008 Scale!
$100,000
Switched to next-gen sequencing

$10,000

Logarithmic scale $1,000
10x
increments $100
2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012
Genome Sequencing Moore's Law

Coding vs Non-coding DNA

98% non-coding DNA

Coding Non-coding

8% of human DNA
(98,000 fragments)

T A A C C C T A A C C C T A A C C C T A A C C C T A A C C C
A T T G G G A T T G G G A T T G G G A T T G G G A T T G G G

HIV-1 virus genome: https://www.ncbi.nlm.nih.gov/nuccore/9629357?report=fasta

Get Ubuntu 12.10
http://www.ubuntu.com/download
(note CentOS/Red Hat has install issues with Biopython)

DataStax Community Edition of Cassandra + OpsCenter
http://www.datastax.com/download/community

Free python tools for biological computation
http://biopython.org

Cassandra python client library
Pycassa https://github.com/pycassa/pycassa

blueplastic.com/dna.pdf
Polychaos dubium
620 billion bp (200x humans)
Sameer Farooqui
sameer@blueplastic.com

- Freelance Big Data consultant and trainer
- Taught 50+ courses on Hadoop, HBase, Cassandra and OpenStack

Ex: Hortonworks, Accenture R&D, Symantec

- Co-author on v2 of Cassandra book
- Coming late 2013 linkedin.com/in/blueplastic/

@blueplastic

http://youtu.be/ziqx2hJY8Hg

James Watson: How we discovered DNA Juan Enriquez: The life-code that will reshape the future

http://www.ted.com/talks/james_watson_on_how_he_discovered_dna.html http://www.ted.com/talks/juan_enriquez_on_genomics_and_our_future.html

Resources to get started for beginners…

NYC* 2013 - "Analyzing the Human Genome/DNA with Cassandra"

Recommended

Recommended

More Related Content

What's hot

What's hot (13)

Viewers also liked

Viewers also liked (6)

Similar to NYC* 2013 - "Analyzing the Human Genome/DNA with Cassandra"

Similar to NYC* 2013 - "Analyzing the Human Genome/DNA with Cassandra" (20)

More from DataStax Academy

More from DataStax Academy (20)

Recently uploaded

Recently uploaded (20)

NYC* 2013 - "Analyzing the Human Genome/DNA with Cassandra"