0
Graph databases
In computational biology:
Neo4j and TitanDB
Andrei Kucharavy 23/08/2013
Rigid structure of
Interactions
= Interactome
Knowledge
access structure
= GO
Why even bother?
Those are
Graphs
Why even bother?
● ~ 1 Gb of raw data from Reactome
● ~ 300 Mb of Data from Uniprot / GO /
ENSEMBL/ … mappings
● => this i...
Relational Databases
Intro to neo4j presentation – jexp @ slideshare
Data models
Graph databases
Intro to neo4j presentation – jexp @ slideshare
Core abstractions
● Objects:
– Nodes (Vertexes)
– Relationships between nodes (Edges)
– Properties for Vertexes and Edges
Node1
Node2
Node3
Property1
Node2
Property2
Property3
Property1
Property1
Property2
Property2
Property
Property
Property
Core abstractions
● Objects:
– Nodes (Vertexes)
– Relationships between nodes (Edges)
– Properties for Vertexes and Edges
...
Main advantages promised
● Increased speed for graph-type applications
– Avoid “join” on 10M rows to get ~20 “related”
ele...
Main advantages promised
● Ease of deployment / maintenance:
– Scalability
– Complexity
– Modifications
– Schema migrations
neo4j
● Started in 2003
● Schema-free
● ACID transactions
● Reasonably scalable, reasonably replicatable
● 10 000 open sou...
neo4j
● Started in 2003
● Schema-free
● ACID transactions
● 10 000 open source projects, 1000 commercial
costumers
● 100 %...
https://github.com/neo4j/neo4j
neo4j
● Started in 2003
● Schema-free
● ACID transactions
● 10 000 open source projects, 1000 commercial costumers
● 100 %...
neo4j
● Started in 2003
● Schema-free
● ACID transactions
● 10 000 open source projects, 1000 commercial
costumers
● 100 %...
Deployment Demo
● cd to specific DB location (better as a special
user)
● ./neo4j start
● ./neo4j stop
● => Serves localho...
Under the hood
● Java & JVM
● Split in two
– In-RAM “pre-heated” v.s. Whole in-HDD
● Scalability:
– 32 G nodes / 32 G rela...
Interfaces
● Two-fold interface:
– REST server
– Local instance
● Specific query Language: Cipher
Interfaces
● Two-fold interface:
– REST server
– Local instance
● Specific query Language: Cipher
● Interoperability: supp...
TinkerPop stack
REST APIs
What is Gremlin
● Domain-specific graph language
● Build atop Groovy
– JVM
– Dynamically evaluated
– ~ scripting in java
●...
Interfaces
● Two-fold interface:
– REST server
– Local instance
● Specific query Language: Cipher
● Interoperability: supp...
Python + Bulbs + REST + neo4j
● Bulbs = Pythonic wrapper for Gremlin
● Portability(BluePrints + Rexter)
– Titan DB (will b...
Demo 2
● Datatype declaration
● GraphDB connection and declaration
● Fill-in
● Graphical Interface
neo4j-specific
● Lucene index in the backend
– Exact indexing => constant-time retrieval
– Full-text indexing => searching...
Demo3
● Constant node retrieval time / internode
connection distance time
● Performing the partial search
● Adding missing...
Use Case:
● Existent map of correlations:
ProteinDomain
Domain Type
Protein
function
Use Case:
● Existent map of correlations:
● Wanted map of correlations:
ProteinDomain
Domain Type
Protein
function
Protein...
Use Case:
● Existent map of correlations:
● Wanted map of correlations:
ProteinDomain
Domain Type
Protein
function
Protein...
Use Case
● SQL Python / SQLAlchemy:
– Create new table
– Add ForeignKeys, Primary key, indexes, ...
– Add the table to the...
Use Case
● Bulbs / Neo4j => Live demo
Use case 2
● In human proteome, find all chemical groups A and B separated
by less then x Å
– Database Structure:
● Suppos...
Limitations
● Node Number:
– 32 Giga Nodes / Edges is a lot on servers
● ~100 Tb of data
● 1 Unix partition
● 40 000 ++ si...
Limitations
● Absence of parallelism/distribution
– One process at time:
● 1 traversal at time
● ACID => Database locks
● ...
Limitations
● Bubs: python over gremlin scripts
– Gremlin → Groovy → JVM → do what you want
=> SQL (Gremlin) injections
– ...
Limitations
● Bulk insert not naively implemented in Bulbs:
– Insertion rate ~10 nodes /sec
– Naive python binding tests:
...
Port to TitanDB
TitanDB
● Hbase / Cassandra / BerkleyDB as backend
TitanDB
● Hbase / Cassandra / BerkleyDB as storage backend
● Lucene / ElasticSearch as Indexing backend
● Served over Rext...
Neo4j for bioinformatics:
parsing and curating Reactome.org
● Reactome.org:
– BioPax : xml / RDF / OWL
Neo4j for bioinformatics:
parsing and curating Reactome.org
● Reactome.org structure:
– BioPax : xml / RDF / OWL
– Physica...
Protege
Neo4j for bioinformatics:
parsing and curating Reactome.org
● Reality of Reactome.org:
– Main connex element: ~ 22 000 ent...
Neo4j for bioinformatics:
parsing and curating Reactome.org
● Reality of Reactome.org:
– heavily comment-based: case of SRC
Neo4j for bioinformatics:
parsing and curating Reactome.org
Neo4j for bioinformatics:
parsing and curating Reactome.org
● Completed with HiNT protein-protein interaction
from Yue lab...
Neo4j for bioinformatics:
parsing and curating Reactome.org
● Example of pathway Parsing
Conclusion
● Systems biology is more about graphs then
about systems of tables
● Graph Databases are awesome
● Neo4j is te...
Questions ?
Thanks
Pr. Philp Bourne
Pr. Bart Deplanke
Cedric Merlot
Li Xie
Spencer Blieven
Jiang Wang
Julia Ponomarenko
Cole Christie
...
Graph databases in computational bioloby: case of neo4j and TitanDB
Graph databases in computational bioloby: case of neo4j and TitanDB
Graph databases in computational bioloby: case of neo4j and TitanDB
Graph databases in computational bioloby: case of neo4j and TitanDB
Graph databases in computational bioloby: case of neo4j and TitanDB
Upcoming SlideShare
Loading in...5
×

Graph databases in computational bioloby: case of neo4j and TitanDB

3,349

Published on

Code used for demos is available from: https://github.com/chiffa/neo4jDemo repositry
Code used for IO over the reactome is available from: https://github.com/chiffa/PolyPharma

Published in: Technology
5 Comments
5 Likes
Statistics
Notes
  • @Danielbrami: Daniel, I am sorry for a delayed response, I haven't been back to this site in a while. I currently use the legacy neo4j 1.9.6 version, because the 2.0 has no more support for Gremlin (efficiency reasons). Installation of Gremlin atop 2.0 is possible, but seemed quite complicated to me and I prefered to stick with the legacy solution. Hope it still can help. Beyond that, in the future, it is much more easy to reach me on Twitter (https://twitter.com/andrei_chiffa) or GitHub repository associated to the project
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Hi Andrei, I am bioinformaticist in San Diego and I am trying to 'tinker' with neo4j and bulbs. But I am having a really tough time - seems to be due to update 2.0+ to neo4j. Any advice on setting it up?
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • @Roelof Pieters: this is quite personal since I don't have that extensive of an experience. Neo4j works well for prototyping and applications where the amount data is reasonable, sharding is not much of an issue and simultaneous writes and traversals are not critical. It has awesome costumer support and is tested and supported in breadth and width. Titan is a little bit more rough to install and configure and compared to neo4j lack docs and support. However it's up and running it is extremely robust and scalable. I've seen people run it in production over Cassandra and Hadoop for year without an issue and they're pretty ecstatic about it. If you want to know more, ask one of the guys from Alkemics (http://www.alkemics.com/#/), they would be able to give you a more detailed review.
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Andrei could you maybe write a bit about the plusses and minuses in using neo4j versus titan. When would you prefer to go for either of the two?
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • take a look at bio4j http://bio4j.com
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Views
Total Views
3,349
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
63
Comments
5
Likes
5
Embeds 0
No embeds

No notes for slide
  • Why join millions of rows if only 10 relationships are iteresting? What to do if we want traversals
  • Transcript of "Graph databases in computational bioloby: case of neo4j and TitanDB"

    1. 1. Graph databases In computational biology: Neo4j and TitanDB Andrei Kucharavy 23/08/2013
    2. 2. Rigid structure of Interactions = Interactome Knowledge access structure = GO Why even bother? Those are Graphs
    3. 3. Why even bother? ● ~ 1 Gb of raw data from Reactome ● ~ 300 Mb of Data from Uniprot / GO / ENSEMBL/ … mappings ● => this is way over the conventional 1024 Mb JVM limit => heap crash ● ~ 15 minutes to load ● Nightmare to visualize and debug
    4. 4. Relational Databases Intro to neo4j presentation – jexp @ slideshare
    5. 5. Data models
    6. 6. Graph databases Intro to neo4j presentation – jexp @ slideshare
    7. 7. Core abstractions ● Objects: – Nodes (Vertexes) – Relationships between nodes (Edges) – Properties for Vertexes and Edges
    8. 8. Node1 Node2 Node3 Property1 Node2 Property2 Property3 Property1 Property1 Property2 Property2 Property Property Property
    9. 9. Core abstractions ● Objects: – Nodes (Vertexes) – Relationships between nodes (Edges) – Properties for Vertexes and Edges ● Operations: – Immediate relations – Traversals ● Get the shortest path from j to k ● Get the path with least weight from j to k, ...
    10. 10. Main advantages promised ● Increased speed for graph-type applications – Avoid “join” on 10M rows to get ~20 “related” elements – Traversals ● Simplified programming – Java objects – Xml / rdf / owl – Schema alterations
    11. 11. Main advantages promised ● Ease of deployment / maintenance: – Scalability – Complexity – Modifications – Schema migrations
    12. 12. neo4j ● Started in 2003 ● Schema-free ● ACID transactions ● Reasonably scalable, reasonably replicatable ● 10 000 open source projects, 1000 commercial costumers
    13. 13. neo4j ● Started in 2003 ● Schema-free ● ACID transactions ● 10 000 open source projects, 1000 commercial costumers ● 100 % open source
    14. 14. https://github.com/neo4j/neo4j
    15. 15. neo4j ● Started in 2003 ● Schema-free ● ACID transactions ● 10 000 open source projects, 1000 commercial costumers ● 100 % open source ● Master-slave replication ● AGPL 3 license: if you are open source, it is free, Even the support
    16. 16. neo4j ● Started in 2003 ● Schema-free ● ACID transactions ● 10 000 open source projects, 1000 commercial costumers ● 100 % open source ● Master-slave replication ● AGPL 3 license: if you are open source, it is free, Even the support ● Plus graphical interface => De-bug!!!!
    17. 17. Deployment Demo ● cd to specific DB location (better as a special user) ● ./neo4j start ● ./neo4j stop ● => Serves localhost:7474 ● 40 000 files => mainly indexes / user accesses
    18. 18. Under the hood ● Java & JVM ● Split in two – In-RAM “pre-heated” v.s. Whole in-HDD ● Scalability: – 32 G nodes / 32 G relations / 64 G properties – 1 M traversals / sec, size-independent of a graph ● Lucene index: instant search
    19. 19. Interfaces ● Two-fold interface: – REST server – Local instance ● Specific query Language: Cipher
    20. 20. Interfaces ● Two-fold interface: – REST server – Local instance ● Specific query Language: Cipher ● Interoperability: support for tinkerpop stack
    21. 21. TinkerPop stack REST APIs
    22. 22. What is Gremlin ● Domain-specific graph language ● Build atop Groovy – JVM – Dynamically evaluated – ~ scripting in java ● Core = java – Java – Scala / Clojure – Jpypes / Jython / Jruby ● Supported by most graph databases
    23. 23. Interfaces ● Two-fold interface: – REST server – Local instance ● Specific query Language: Cipher ● Interoperability: support for TinkerPop stack ● Native bindings: – Java – Python, PHP, Ruby / Rails, node.js, .Net – Scala, Clojure, Haskell, ... ● My stack: – Native Python and Python through bulbs and REST
    24. 24. Python + Bulbs + REST + neo4j ● Bulbs = Pythonic wrapper for Gremlin ● Portability(BluePrints + Rexter) – Titan DB (will be discussed later on) – Bitsy – Infinite Graph – Sqrrl – ArangoDB ● Class heritability and DDT: – Java-like class heritability
    25. 25. Demo 2 ● Datatype declaration ● GraphDB connection and declaration ● Fill-in ● Graphical Interface
    26. 26. neo4j-specific ● Lucene index in the backend – Exact indexing => constant-time retrieval – Full-text indexing => searching partial names and adding the missing links ● SRC = SRC_HUMAN = SRC1
    27. 27. Demo3 ● Constant node retrieval time / internode connection distance time ● Performing the partial search ● Adding missing links ● Neo4j server v.s. Local database ● Performing simple Gremlin queries
    28. 28. Use Case: ● Existent map of correlations: ProteinDomain Domain Type Protein function
    29. 29. Use Case: ● Existent map of correlations: ● Wanted map of correlations: ProteinDomain Domain Type Protein function ProteinDomain Domain Type Protein function
    30. 30. Use Case: ● Existent map of correlations: ● Wanted map of correlations: ProteinDomain Domain Type Protein function ProteinDomain Domain Type Protein function
    31. 31. Use Case ● SQL Python / SQLAlchemy: – Create new table – Add ForeignKeys, Primary key, indexes, ... – Add the table to the data model, – Create functions for access/update, – ...
    32. 32. Use Case ● Bulbs / Neo4j => Live demo
    33. 33. Use case 2 ● In human proteome, find all chemical groups A and B separated by less then x Å – Database Structure: ● Suppose all the proteins are connected to a “Type node” ● Each protein is linked to it's domains, each domain is linked to it's amino acids, each amino-acid linked to it's chemical groups and ultimately atoms ● Chemical groups have assigned distance between them and groups they are close to – Algorithm ● Select a protein of interest ● Get all of it's chemical groups: 1000(a.a)*3(ch.gr/a.a) ● Filter all of the Relations longer than k: 1000*3*100(possible contacts per ch.gr) ● Recover the proteins: 1000*3*100*2 ● With 1M traversals per second => 0.6 sec. to execute the query – If TitanDB with ElasticSearch and geo-queries (all within circle of radius x), higher speeds possible
    34. 34. Limitations ● Node Number: – 32 Giga Nodes / Edges is a lot on servers ● ~100 Tb of data ● 1 Unix partition ● 40 000 ++ simultaneously opened files (Indexes+users) – 32 Giga Edges is relatively small in biology ● ~ 43 M nodes in UniProt Only ● GO x UNIPROT x EMBL x GeneNames x Interaction Maps x Localisations x names & Accesses .... ● All potentially druggable molecules, all protein atoms, all atom-atom interactions
    35. 35. Limitations ● Absence of parallelism/distribution – One process at time: ● 1 traversal at time ● ACID => Database locks ● Though master-slave distribution – Single partition ● Replication ● 100 Tb + RAID!? ● Though full support for AWS and VM
    36. 36. Limitations ● Bubs: python over gremlin scripts – Gremlin → Groovy → JVM → do what you want => SQL (Gremlin) injections – Request sanitation needed Hashes of the queries without variables Pre-filtering before query referral to server
    37. 37. Limitations ● Bulk insert not naively implemented in Bulbs: – Insertion rate ~10 nodes /sec – Naive python binding tests: ● ~60 msec for ACID compliance (HDD write) ● ~1.8 msec/node cold insertion routines (HDD sequential write) ● ~0.3 msec/node hot write insertion routines (RAM buffer) – 500 - 1500 nodes/sec if packages of 1000 ● 6 h to fill the database up to theoretical limit – github.com/chefjerome/graphalchemy implements efficient flush based on bulbs (alpha and thus unstable right now)
    38. 38. Port to TitanDB
    39. 39. TitanDB ● Hbase / Cassandra / BerkleyDB as backend
    40. 40. TitanDB ● Hbase / Cassandra / BerkleyDB as storage backend ● Lucene / ElasticSearch as Indexing backend ● Served over Rexter server Full distribution > 500 simultaneous connections (5000 is still stable) Automatic replication (Hadoop) Multiple simultaneous queries Sky is the only limit for storage quantities => TitanDB / Hbase is stable up to 5 Pbytes in production
    41. 41. Neo4j for bioinformatics: parsing and curating Reactome.org ● Reactome.org: – BioPax : xml / RDF / OWL
    42. 42. Neo4j for bioinformatics: parsing and curating Reactome.org ● Reactome.org structure: – BioPax : xml / RDF / OWL – Physical entities: ● Proteins, small molecules, Complexes, RNA, DNA ● Fragments of physical entities – Interaction: ● Degradation / polymerisation / Biochemical reactions ● Molecular interaction ● Genetic interaction – Pathways, Genes, Post-translational modifications...
    43. 43. Protege
    44. 44. Neo4j for bioinformatics: parsing and curating Reactome.org ● Reality of Reactome.org: – Main connex element: ~ 22 000 entities, but 6 other with >100 elements – Presence of generic classes : groups of objects – Proteins = mix between proteins, domains, groups, groups of domains… – 15 000 proteins, 5000 UNIPROT references – 156 genes, 56 RNA molecules => translation / transcription regulation is not well described
    45. 45. Neo4j for bioinformatics: parsing and curating Reactome.org ● Reality of Reactome.org: – heavily comment-based: case of SRC
    46. 46. Neo4j for bioinformatics: parsing and curating Reactome.org
    47. 47. Neo4j for bioinformatics: parsing and curating Reactome.org ● Completed with HiNT protein-protein interaction from Yue lab at Cornell ● Re-indexed: – SwissProt protein names – Full names from SwissProt – Gene Names – KEGG, GO, EMBL, ChEBI cross-references – PDB implemented, not re-run
    48. 48. Neo4j for bioinformatics: parsing and curating Reactome.org ● Example of pathway Parsing
    49. 49. Conclusion ● Systems biology is more about graphs then about systems of tables ● Graph Databases are awesome ● Neo4j is terrific ● TitanDB is cool ● You should definitely pick one of them, load Reactome.org dataset or whatever you are interested in and play with it.
    50. 50. Questions ?
    51. 51. Thanks Pr. Philp Bourne Pr. Bart Deplanke Cedric Merlot Li Xie Spencer Blieven Jiang Wang Julia Ponomarenko Cole Christie Andreas Prilic Lilia Iakoucheva
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×