Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
{GraphConnect NYC}
Hadoop and Graph Databases
(Neo4j): Winning Combination for
Bioinformatics
Jonathan Freeman
@freethejaz...
Hadoop + Neo4j = Bioanalytics Win

Open Software Integrators
●

Jonathan Freeman
@freethejazz

Founded January 2008 by And...
Hadoop + Neo4j = Bioinformatics Win

Questions to answer

●
●
●
●

uhh, bioinformatics?
What is Hadoop? Why is it a good f...
{Hadoop + Neo4j = Bioinformatics Win}

Bioinformatics

{Open Software Integrators} { www.osintegrators.com} {@osintegrator...
Hadoop + Neo4j = Bioinformatics Win
Jonathan Freeman
@freethejazz

“
dynamic
information processing
system
{Open Software ...
Hadoop + Neo4j = Bioinformatics Win
Jonathan Freeman
@freethejazz

Life
http://www.labtimes.org/labtimes/issues/lt2011/lt0...
Hadoop + Neo4j = Bioinformatics Win
Jonathan Freeman
@freethejazz

● Storing/Retrieving Biological Data
● Organizing Biolo...
Hadoop + Neo4j = Bioinformatics Win
Jonathan Freeman
@freethejazz

Biological Data
● amino acid sequences
● nucleotide seq...
Hadoop + Neo4j = Bioinformatics Win
Jonathan Freeman
@freethejazz

●
●
●
●
●

Genetic sequence analysis
Tracing biological...
Hadoop + Neo4j = Bioinformatics Win
Jonathan Freeman
@freethejazz

●
●
●
●
●

Genetic sequence analysis
Tracing biological...
Hadoop + Neo4j = Bioinformatics Win
Jonathan Freeman
@freethejazz

Full Human Genome Sequencing Then

13 Years

$2,700,000...
Hadoop + Neo4j = Bioinformatics Win
Jonathan Freeman
@freethejazz

Full Human Genome Sequencing Then

1 Day

{Open Softwar...
Hadoop + Neo4j = Bioinformatics Win
Jonathan Freeman
@freethejazz

http://www.genome.gov/images/content/cost_per_genome_ap...
Hadoop + Neo4j = Bioinformatics Win
Jonathan Freeman
@freethejazz

So what are we
waiting for?

{Open Software Integrators...
Hadoop + Neo4j = Bioinformatics Win
Jonathan Freeman
@freethejazz

{Open Software Integrators} { www.osintegrators.com} {@...
Hadoop + Neo4j = Bioinformatics Win
Jonathan Freeman
@freethejazz

well, the thing
about that…

{Open Software Integrators...
Hadoop + Neo4j = Bioinformatics Win
Jonathan Freeman
@freethejazz

{Open Software Integrators} { www.osintegrators.com} {@...
Hadoop + Neo4j = Bioinformatics Win
Jonathan Freeman
@freethejazz

{Open Software Integrators} { www.osintegrators.com} {@...
Hadoop + Neo4j = Bioinformatics Win
Jonathan Freeman
@freethejazz

...
ATTCCAGGAGTATTGACACCAT...

{Open Software Integrato...
Hadoop + Neo4j = Bioinformatics Win
Jonathan Freeman
@freethejazz

{Open Software Integrators} { www.osintegrators.com} {@...
Hadoop + Neo4j = Bioinformatics Win
Jonathan Freeman
@freethejazz

{Open Software Integrators} { www.osintegrators.com} {@...
Hadoop + Neo4j = Bioinformatics Win
Jonathan Freeman
@freethejazz

{Open Software Integrators} { www.osintegrators.com} {@...
Hadoop + Neo4j = Bioinformatics Win
Jonathan Freeman
@freethejazz

AGGATTACCAGGA
CAAAGGATT
TTACCAGGATACCAG
TGACAA
AAGGATTA...
{Hadoop + Neo4j = Bioinformatics Win}

Hadoop

{Open Software Integrators} { www.osintegrators.com} {@osintegrators}
Hadoop + Neo4j = Bioinformatics Win
Jonathan Freeman
@freethejazz

Infrastructure for distributed computing
HDFS

MapReduc...
Hadoop + Neo4j = Bioinformatics Win
Jonathan Freeman
@freethejazz

…
{Open Software Integrators} { www.osintegrators.com} ...
Hadoop + Neo4j = Bioinformatics Win
Jonathan Freeman
@freethejazz

{Open Software Integrators} { www.osintegrators.com} {@...
Hadoop + Neo4j = Bioinformatics Win
Jonathan Freeman
@freethejazz

{Open Software Integrators} { www.osintegrators.com} {@...
Hadoop + Neo4j = Bioinformatics Win
Jonathan Freeman
@freethejazz

Infrastructure for distributed computing
HDFS

MapReduc...
Hadoop + Neo4j = Bioinformatics Win
Jonathan Freeman
@freethejazz

AGGATTACCAGGA
CAAAGGATT
TTACCAGGATACCAG
TGACAA
AAGGATTA...
Hadoop + Neo4j = Bioinformatics Win
Jonathan Freeman
@freethejazz

...
ATTCCAGGAGTATTGACACCAT...

{Open Software Integrato...
Hadoop + Neo4j = Bioinformatics Win
Jonathan Freeman
@freethejazz

1000 CPU hours

{Open Software Integrators} { www.osint...
Hadoop + Neo4j = Bioinformatics Win
Jonathan Freeman
@freethejazz

3 hours
$85
OSS
http://bowtie-bio.sourceforge.net/cross...
{Hadoop + Neo4j = Bioinformatics Win}

And Neo4j?

{Open Software Integrators} { www.osintegrators.com} {@osintegrators}
Hadoop + Neo4j = Bioinformatics Win
Jonathan Freeman
@freethejazz

{Open Software Integrators} { www.osintegrators.com} {@...
Hadoop + Neo4j = Bioinformatics Win
Jonathan Freeman
@freethejazz

MATCH (snp)<-[:INFLUENCED_BY]-(conditions)
WHERE snp.id...
Hadoop + Neo4j = Bioinformatics Win
Jonathan Freeman
@freethejazz

MATCH (p)-[:GENOME_CONTAINS]->(snp)
(snp)<-[:INFLUENCED...
Hadoop + Neo4j = Bioinformatics Win
Jonathan Freeman
@freethejazz

MATCH (p)-[:GENOME_CONTAINS]->(snp)
(snp)<-[:INFLUENCED...
{Hadoop + Neo4j = Bioinformatics Win}

How can I haz?!?!?!1

{Open Software Integrators} { www.osintegrators.com} {@osinte...
Hadoop + Neo4j = Bioinformatics Win
Jonathan Freeman
@freethejazz

Step 1: Get local copies
● Hadoop: http://www.neo4j.org...
Hadoop + Neo4j = Bioinformatics Win
Jonathan Freeman
@freethejazz

Step 2: Familiarize yourself with the languages
●
●
●

...
Hadoop + Neo4j = Bioinformatics Win
Jonathan Freeman
@freethejazz

Step 3: Find a dataset
●
●

Typical starter data: http:...
Hadoop + Neo4j = Bioinformatics Win
Jonathan Freeman
@freethejazz

Step 4: Start Playing!!!

{Open Software Integrators} {...
Hadoop + Neo4j = Bioinformatics Win
Jonathan Freeman
@freethejazz

Step 5: Take Hadoop to the cloud
● http://aws.amazon.co...
Hadoop + Neo4j = Bioinformatics Win
Jonathan Freeman
@freethejazz

Doing this in production?
http://blog.xebia.com/2012/11...
{Hadoop + Neo4j = Bioinformatics Win}

Thank You
@freethejazz

{Open Software Integrators} { www.osintegrators.com} {@osin...
Hadoop + Neo4j = Bioinformatics Win
Jonathan Freeman
@freethejazz

Image Attribution:
Sand Timer: http://bit.ly/HyCAgy
Mon...
Upcoming SlideShare
Loading in …5
×

Hadoop and Graph Databases (Neo4j): Winning Combination for Bioanalytics - Jonathan Freeman @ GraphConnect NY 2013

14,777 views

Published on

This talk will describe a prototype application designed to demonstrate the ability to utilize both Hadoop and Neo4j for Big Data analysis.

Published in: Technology, Education
  • Be the first to comment

Hadoop and Graph Databases (Neo4j): Winning Combination for Bioanalytics - Jonathan Freeman @ GraphConnect NY 2013

  1. 1. {GraphConnect NYC} Hadoop and Graph Databases (Neo4j): Winning Combination for Bioinformatics Jonathan Freeman @freethejazz {Open Software Integrators} { www.osintegrators.com} {@osintegrators}
  2. 2. Hadoop + Neo4j = Bioanalytics Win Open Software Integrators ● Jonathan Freeman @freethejazz Founded January 2008 by Andrew C. Oliver ○ Durham, NC Revenue and staff has at least doubled every year since 2009. ● New office (2012) in Chicago, IL ○ We're hiring associate to senior level as well as UI Developers (JQuery, Javascript, HTML, CSS) ○ Up to 50% travel (probably less), salary + bonus, 401k, health, etc etc ○ Preferred: Java, Tomcat, JBoss, Hibernate, Spring, RDBMS, JQuery ○ Nice to have: Hadoop, Neo4j, MongoDB, Ruby a/o at least one Cloud platform {Open Software Integrators} { www.osintegrators.com} {@osintegrators}
  3. 3. Hadoop + Neo4j = Bioinformatics Win Questions to answer ● ● ● ● uhh, bioinformatics? What is Hadoop? Why is it a good fit? And Neo4j? Why the combination? I want this now! How do I do it?!?! {Open Software Integrators} { www.osintegrators.com} {@osintegrators} Jonathan Freeman @freethejazz
  4. 4. {Hadoop + Neo4j = Bioinformatics Win} Bioinformatics {Open Software Integrators} { www.osintegrators.com} {@osintegrators}
  5. 5. Hadoop + Neo4j = Bioinformatics Win Jonathan Freeman @freethejazz “ dynamic information processing system {Open Software Integrators} { www.osintegrators.com} {@osintegrators}
  6. 6. Hadoop + Neo4j = Bioinformatics Win Jonathan Freeman @freethejazz Life http://www.labtimes.org/labtimes/issues/lt2011/lt07/lt_2011_07_26_29.pdf {Open Software Integrators} { www.osintegrators.com} {@osintegrators}
  7. 7. Hadoop + Neo4j = Bioinformatics Win Jonathan Freeman @freethejazz ● Storing/Retrieving Biological Data ● Organizing Biological Data ● Analyzing Biological Data {Open Software Integrators} { www.osintegrators.com} {@osintegrators}
  8. 8. Hadoop + Neo4j = Bioinformatics Win Jonathan Freeman @freethejazz Biological Data ● amino acid sequences ● nucleotide sequences ● protein structures {Open Software Integrators} { www.osintegrators.com} {@osintegrators}
  9. 9. Hadoop + Neo4j = Bioinformatics Win Jonathan Freeman @freethejazz ● ● ● ● ● Genetic sequence analysis Tracing biological evolution Analysis of gene expression Studying mutations in cancer Predicting protein structure and function ● Molecular Interaction {Open Software Integrators} { www.osintegrators.com} {@osintegrators}
  10. 10. Hadoop + Neo4j = Bioinformatics Win Jonathan Freeman @freethejazz ● ● ● ● ● Genetic sequence analysis Tracing biological evolution Analysis of gene expression Studying mutations in cancer Predicting protein structure and function ● Molecular Interaction {Open Software Integrators} { www.osintegrators.com} {@osintegrators}
  11. 11. Hadoop + Neo4j = Bioinformatics Win Jonathan Freeman @freethejazz Full Human Genome Sequencing Then 13 Years $2,700,000,000 {Open Software Integrators} { www.osintegrators.com} {@osintegrators}
  12. 12. Hadoop + Neo4j = Bioinformatics Win Jonathan Freeman @freethejazz Full Human Genome Sequencing Then 1 Day {Open Software Integrators} { www.osintegrators.com} {@osintegrators} $5,000
  13. 13. Hadoop + Neo4j = Bioinformatics Win Jonathan Freeman @freethejazz http://www.genome.gov/images/content/cost_per_genome_apr.jpg {Open Software Integrators} { www.osintegrators.com} {@osintegrators}
  14. 14. Hadoop + Neo4j = Bioinformatics Win Jonathan Freeman @freethejazz So what are we waiting for? {Open Software Integrators} { www.osintegrators.com} {@osintegrators}
  15. 15. Hadoop + Neo4j = Bioinformatics Win Jonathan Freeman @freethejazz {Open Software Integrators} { www.osintegrators.com} {@osintegrators}
  16. 16. Hadoop + Neo4j = Bioinformatics Win Jonathan Freeman @freethejazz well, the thing about that… {Open Software Integrators} { www.osintegrators.com} {@osintegrators}
  17. 17. Hadoop + Neo4j = Bioinformatics Win Jonathan Freeman @freethejazz {Open Software Integrators} { www.osintegrators.com} {@osintegrators}
  18. 18. Hadoop + Neo4j = Bioinformatics Win Jonathan Freeman @freethejazz {Open Software Integrators} { www.osintegrators.com} {@osintegrators}
  19. 19. Hadoop + Neo4j = Bioinformatics Win Jonathan Freeman @freethejazz ... ATTCCAGGAGTATTGACACCAT... {Open Software Integrators} { www.osintegrators.com} {@osintegrators}
  20. 20. Hadoop + Neo4j = Bioinformatics Win Jonathan Freeman @freethejazz {Open Software Integrators} { www.osintegrators.com} {@osintegrators}
  21. 21. Hadoop + Neo4j = Bioinformatics Win Jonathan Freeman @freethejazz {Open Software Integrators} { www.osintegrators.com} {@osintegrators}
  22. 22. Hadoop + Neo4j = Bioinformatics Win Jonathan Freeman @freethejazz {Open Software Integrators} { www.osintegrators.com} {@osintegrators}
  23. 23. Hadoop + Neo4j = Bioinformatics Win Jonathan Freeman @freethejazz AGGATTACCAGGA CAAAGGATT TTACCAGGATACCAG TGACAA AAGGATTAC GATACCAGTA CAAGGATT GTGACAA {Open Software Integrators} { www.osintegrators.com} {@osintegrators}
  24. 24. {Hadoop + Neo4j = Bioinformatics Win} Hadoop {Open Software Integrators} { www.osintegrators.com} {@osintegrators}
  25. 25. Hadoop + Neo4j = Bioinformatics Win Jonathan Freeman @freethejazz Infrastructure for distributed computing HDFS MapReduce A distributed file system. An implementation of a programming model for processing very large data sets. {Open Software Integrators} { www.osintegrators.com} {@osintegrators}
  26. 26. Hadoop + Neo4j = Bioinformatics Win Jonathan Freeman @freethejazz … {Open Software Integrators} { www.osintegrators.com} {@osintegrators}
  27. 27. Hadoop + Neo4j = Bioinformatics Win Jonathan Freeman @freethejazz {Open Software Integrators} { www.osintegrators.com} {@osintegrators}
  28. 28. Hadoop + Neo4j = Bioinformatics Win Jonathan Freeman @freethejazz {Open Software Integrators} { www.osintegrators.com} {@osintegrators}
  29. 29. Hadoop + Neo4j = Bioinformatics Win Jonathan Freeman @freethejazz Infrastructure for distributed computing HDFS MapReduce A distributed file system. An implementation of a programming model for processing very large data sets. {Open Software Integrators} { www.osintegrators.com} {@osintegrators}
  30. 30. Hadoop + Neo4j = Bioinformatics Win Jonathan Freeman @freethejazz AGGATTACCAGGA CAAAGGATT TTACCAGGATACCAG TGACAA AAGGATTAC GATACCAGTA CAAGGATT GTGACAA {Open Software Integrators} { www.osintegrators.com} {@osintegrators}
  31. 31. Hadoop + Neo4j = Bioinformatics Win Jonathan Freeman @freethejazz ... ATTCCAGGAGTATTGACACCAT... {Open Software Integrators} { www.osintegrators.com} {@osintegrators}
  32. 32. Hadoop + Neo4j = Bioinformatics Win Jonathan Freeman @freethejazz 1000 CPU hours {Open Software Integrators} { www.osintegrators.com} {@osintegrators}
  33. 33. Hadoop + Neo4j = Bioinformatics Win Jonathan Freeman @freethejazz 3 hours $85 OSS http://bowtie-bio.sourceforge.net/crossbow/index.shtml {Open Software Integrators} { www.osintegrators.com} {@osintegrators}
  34. 34. {Hadoop + Neo4j = Bioinformatics Win} And Neo4j? {Open Software Integrators} { www.osintegrators.com} {@osintegrators}
  35. 35. Hadoop + Neo4j = Bioinformatics Win Jonathan Freeman @freethejazz {Open Software Integrators} { www.osintegrators.com} {@osintegrators}
  36. 36. Hadoop + Neo4j = Bioinformatics Win Jonathan Freeman @freethejazz MATCH (snp)<-[:INFLUENCED_BY]-(conditions) WHERE snp.id = “rs1234” RETURN conditions; {Open Software Integrators} { www.osintegrators.com} {@osintegrators}
  37. 37. Hadoop + Neo4j = Bioinformatics Win Jonathan Freeman @freethejazz MATCH (p)-[:GENOME_CONTAINS]->(snp) (snp)<-[:INFLUENCED_BY]-(conditions) WHERE p.name = “Jonathan Freeman” RETURN conditions; {Open Software Integrators} { www.osintegrators.com} {@osintegrators}
  38. 38. Hadoop + Neo4j = Bioinformatics Win Jonathan Freeman @freethejazz MATCH (p)-[:GENOME_CONTAINS]->(snp) (snp)<-[:INFLUENCED_BY]-(conditions) WHERE c.name = “Parkinsons” RETURN p; {Open Software Integrators} { www.osintegrators.com} {@osintegrators}
  39. 39. {Hadoop + Neo4j = Bioinformatics Win} How can I haz?!?!?!1 {Open Software Integrators} { www.osintegrators.com} {@osintegrators}
  40. 40. Hadoop + Neo4j = Bioinformatics Win Jonathan Freeman @freethejazz Step 1: Get local copies ● Hadoop: http://www.neo4j.org/download ● Neo4j: http://hadoop.apache.org/releases.html#Download ● Batch Importer: https://github.com/jexp/batch-import {Open Software Integrators} { www.osintegrators.com} {@osintegrators}
  41. 41. Hadoop + Neo4j = Bioinformatics Win Jonathan Freeman @freethejazz Step 2: Familiarize yourself with the languages ● ● ● MapReduce: http://hadoop.apache.org/docs/r0.18.3/mapred_tutorial.html Pig: http://pig.apache.org/docs/r0.12.0/start.html Hive: https://cwiki.apache.org/confluence/display/Hive/GettingStarted {Open Software Integrators} { www.osintegrators.com} {@osintegrators}
  42. 42. Hadoop + Neo4j = Bioinformatics Win Jonathan Freeman @freethejazz Step 3: Find a dataset ● ● Typical starter data: http://www.gutenberg.org/ Amazon’s public data sets: http://aws.amazon.com/publicdatasets/ {Open Software Integrators} { www.osintegrators.com} {@osintegrators}
  43. 43. Hadoop + Neo4j = Bioinformatics Win Jonathan Freeman @freethejazz Step 4: Start Playing!!! {Open Software Integrators} { www.osintegrators.com} {@osintegrators}
  44. 44. Hadoop + Neo4j = Bioinformatics Win Jonathan Freeman @freethejazz Step 5: Take Hadoop to the cloud ● http://aws.amazon.com/elasticmapreduce/ {Open Software Integrators} { www.osintegrators.com} {@osintegrators}
  45. 45. Hadoop + Neo4j = Bioinformatics Win Jonathan Freeman @freethejazz Doing this in production? http://blog.xebia.com/2012/11/13/combining-neo4j-and-hadoop-part-i/ http://blog.xebia.com/2013/01/17/combining-neo4j-and-hadoop-part-ii/ {Open Software Integrators} { www.osintegrators.com} {@osintegrators}
  46. 46. {Hadoop + Neo4j = Bioinformatics Win} Thank You @freethejazz {Open Software Integrators} { www.osintegrators.com} {@osintegrators}
  47. 47. Hadoop + Neo4j = Bioinformatics Win Jonathan Freeman @freethejazz Image Attribution: Sand Timer: http://bit.ly/HyCAgy Money: http://bit.ly/1e4lhS6 Scraggly DNA drawings: Jonathan Freeman :) {Open Software Integrators} { www.osintegrators.com} {@osintegrators}

×