Bio4j: A pioneer graph based database for the integration of biological Big Data

Bio4j: A pioneer graph based
database for the integration of
biological Big Data

www.ohnosequences.com www.bio4j.com

What’s Bio4j?
Bio4j is a bioinformatics graph based DB including most data
available in :
Uniprot (SwissProt + Trembl)

Gene Ontology (GO)

UniRef (50,90,100)

NCBI Taxonomy

RefSeq

Enzyme DB


What’s Bio4j?

It provides a completely new and powerful framework
for protein related information querying and
management.

Since it relies on a high-performance graph engine, data
is stored in a way that semantically represents its own
structure


What’s Bio4j?

Bio4j uses Neo4j technology, a "high-performance graph
engine with all the features of a mature and robust
database".

Thanks to both being based on Neo4j DB and the API
provided, Bio4j is also very scalable, allowing anyone
to easily incorporate his own data making the best
out of it.


What’s Bio4j?

Everything in Bio4j is open source !

released under AGPLv3


Bioinformatics Highly interconnected overlapping knowledge
DBs and Graphs spread throughout different DBs

Initial motivation

Bio4j structure

Some samples

Why Bio4j?

Bio4j and the
Cloud

Upcoming features


Bioinformatics However all this data is in most cases modeled in relational databases.
DBs and Graphs Sometimes even just as plain CSV files

Initial motivation As the amount and diversity of data grows, domain models
become crazily complicated!
Bio4j structure

Some samples

Why Bio4j?

Bio4j and the
Cloud

Upcoming features


Bioinformatics With a relational paradigm, the double implication
DBs and Graphs
Entity  Table
Initial motivation
does not go both ways.

Bio4j structure
You get „auxiliary‟ tables that have no relationship with the small
piece of reality you are modeling.
Some samples

You need ‘artificial’ IDs only for connecting entities, (and these are mixed
Why Bio4j? with IDs that somehow live in reality)

Bio4j and the Entity-relationship models are cool but in the end you always have to
Cloud deal with ‘raw’ tables plus SQL.

Integrating/incorporating new knowledge into already existing
Upcoming features
databases is hard and sometimes even not possible without changing
the domain model


Bioinformatics Life in general and biology in particular are probably not 100% like a graph…
DBs and Graphs

Initial motivation

Bio4j structure

Some samples

Why Bio4j?

Bio4j and the
Cloud

Upcoming features

but one thing’s sure, they are not a set of tables!


Bioinformatics
DBs and Graphs
NoSQL (not only SQL)

Initial motivation
NoSQ… what !??
Bio4j structure

Some samples Let’s see what Wikipedia says…

Why Bio4j? “NoSQL is a broad class of database management systems
that differ from the classic model of the relational database
Bio4j and the
Cloud management system (RDBMS) in some significant ways.
These data stores may not require fixed table schemas,
Upcoming features usually avoid join operations and typically scale
horizontally.”


Bioinformatics NoSQL data models
DBs and Graphs

Initial motivation

Bio4j structure

Some samples

Why Bio4j?

Bio4j and the
Cloud

Upcoming features


Bioinformatics
DBs and Graphs

Initial motivation

Cassandra is a highly scalable, eventually consistent,
Bio4j structure distributed, structured key-value store. Cassandra brings
together the distributed systems technologies from Dynamo and the
data model from Google's BigTable.
Some samples

Why Bio4j?

Bio4j and the
Cloud
MongoDB (from "humongous") is an open source document-
oriented NoSQL database system written in the C++ programming
Upcoming features language.


Bioinformatics
DBs and Graphs

Initial motivation

Neo4j is a high-performance, NOSQL graph database with all
Bio4j structure
the features of a mature and robust database.

Some samples
The programmer works with an object-oriented, flexible
network structure rather than with strict and static tables
Why Bio4j?

Bio4j and the All the benefits of a fully transactional, enterprise-strength
Cloud database.

Upcoming features For many applications, Neo4j offers performance
improvements on the order of 1000x or more compared to
relational DBs.


Bioinformatics DBs
and Graphs
Ok, but why starting all this?
Were you so bored…?!
Initial
motivation
It all started somehow around our need for massive access to
protein GO (Gene Ontology) annotations.
Bio4j structure
At that point I had to develop my own MySQL DB based on the official
GO SQL database, and problems started from the beginning:
Some samples

I got crazy ‘deciphering’ how to extract Uniprot protein annotations
Why Bio4j? from GO official tables schema

Bio4j and the Uniprot and GO official protein annotations were not always consistent
Cloud
Populating my own DB took really long due to all the joins and
subqueries needed in order to get and store the protein annotations.
Upcoming features
Soon enough we also had the need of having massive access to basic
protein information.


Bioinformatics DBs
These processes had to be automated for our (specifically
and Graphs
designed for NGS data) bacterial genome annotation system
Initial BG7
motivation

Uniprot web services available were too limited:
Bio4j structure
- Slow
Some samples
- Number of queries limitation

Why Bio4j? - Too little information available

Bio4j and the
Cloud

So I downloaded the whole Uniprot DB in XML format
Upcoming features (Swiss-Prot + Trembl)

and started to have some fun with it !


Bioinformatics DBs We got used to having massive direct access to all this protein
and Graphs related information…

Initial
motivation So why not adding other resources we needed quite often
in most projects and which now were becoming a sort of
bottleneck compared to all those already included in Bio4j ?
Bio4j structure

Then came:
Some samples
- Isoform sequences

Why Bio4j? - Protein interactions and features

- Uniref 50, 90, and 100
Bio4j and the
Cloud - RefSeq

- NCBI Taxonomy
Upcoming features
- Enzyme Expasy DB


Bioinformatics DBs Let’s dig a bit about Bio4j structure:
and Graphs

Initial motivation Data sources and their relationships:

Bio4j structure

Some samples

Why Bio4j?

Bio4j and the
Cloud

Upcoming features


Bioinformatics DBs
and Graphs The Graph DB model: representation

Initial motivation
Core abstractions:
Bio4j structure Nodes

Relationships between nodes
Some samples
Properties on both
Why Bio4j?

Bio4j and the
Cloud

Upcoming features


Bioinformatics DBs Let’s dig a bit about Bio4j structure:
and Graphs

Initial motivation How are things modeled?

Bio4j structure

Couldn’t be simpler!
Some samples

Why Bio4j?

Entities Associations / Relationships
Bio4j and the
Cloud

Upcoming features
Nodes Edges


Bioinformatics DBs Some examples of nodes would be:
and Graphs

Initial motivation GO term
Protein
Bio4j structure
Genome Element

Some samples

Why Bio4j?
and relationships:

Bio4j and the
Cloud
Protein PROTEIN_GO_ANNOTATION

Upcoming features
GO term


Bioinformatics DBs We have developed a tool aimed to be used both as a reference manual and
and Graphs initial contact for Bio4j domain model: Bio4jExplorer

Bio4jExplorer allows you to:
Initial motivation
• Navigate through all nodes and relationships

Bio4j structure
• Access the javadocs of any node or relationship

Some samples
• Graphically explore the neighborhood of a node/relationship

Why Bio4j?
• Look up for the indexes that may serve as an entry point for a node

Bio4j and the
Cloud • Check incoming/outgoing relationships of a specific node

Upcoming features • Check start/end nodes of a specific relationship


Bioinformatics DBs Entry points and indexing
and Graphs

There are two kinds of entry points for the graph:
Initial motivation

Bio4j structure Auxiliary relationships going from the reference node, e.g.

- CELLULAR_COMPONENT: leads to the root of GO cellular component
Some samples sub-ontology

- MAIN_DATASET: leads to both main datasets: Swiss-Prot and Trembl
Why Bio4j?

Node indexing
Bio4j and the
Cloud There are two types of node indexes:

- Exact: Only exact values are considered hits
Upcoming features
- Fulltext: Regular expressions can be used


Bioinformatics DBs Retrieving protein info (Bio4jModel Java API)
and Graphs
//--creating manager and node retriever----
Bio4jManager manager = new Bio4jManager(“/mybio4jdb”);
Initial motivation NodeRetriever nR= new NodeRetriever(manager);

ProteinNode protein = nR.getProteinNodeByAccession(“P12345”);
Bio4j structure
Getting more related info...
Some samples
List<InterproNode> interpros = protein.getInterpro();
OrganismNode organism = protein.getOrganism();
List<GoTermNode> goAnnotations = protein.getGOAnnotations();
Why Bio4j?
List<ArticleNode> articles = protein.getArticleCitations();
Bio4j and the
for (ArticleNode article : articles) {
Cloud
System.out.println(article.getPubmedId());
}
Upcoming features
//And don’t forget to close the Bio4jManager
manager.shutDown();


Bioinformatics DBs Proteins with Interpro motif ‘IPR000847’ (Bio4jModel Java API)
and Graphs

//--creating manager and node retriever----
Initial motivation Bio4jManager manager = new Bio4jManager(“/mybio4jdb”);
NodeRetriever nR= new NodeRetriever(manager);

Bio4j structure InterproNode interpro = nR.getInterproById(“IPR000847”);
ProteinInterproRel rel = ProteinInterproRel(null);

Some samples Iterator<Relationship> iterator =
interpro.getNode().getRelationships(rel, Direction.INCOMING);

Why Bio4j? while(relIterator.hasNext()){
ProteinNode p = new ProteinNode(iterator.next().getStartNode());
System.out.println(p.getAccession());
Bio4j and the }
Cloud
//And don’t forget to close the Bio4jManager
manager.shutDown();
Upcoming features


Bioinformatics DBs Querying Bio4j with Cypher
and Graphs

Initial motivation
Getting a keyword by its ID

Bio4j structure START k=node:keyword_id_index(keyword_id_index = "KW-0181")
return k.name, k.id

Some samples
Finding circuits/simple cycles of length 3 where at least one protein is from
Swiss-Prot dataset:
Why Bio4j?
START d=node:dataset_name_index(dataset_name_index = "Swiss-Prot")
MATCH d <-[r:PROTEIN_DATASET]- p,
Bio4j and the
circuit = (p) -[:PROTEIN_PROTEIN_INTERACTION]-> (p2) -
Cloud
[:PROTEIN_PROTEIN_INTERACTION]-> (p3) -
[:PROTEIN_PROTEIN_INTERACTION]-> (p)
return p.accession, p2.accession, p3.accession
Upcoming features

Check this blog post for more info and our Bio4j Cypher cheetsheet


Bioinformatics DBs
and Graphs

Initial motivation

Get protein by its accession number and return its full name
Bio4j structure

gremlin> g.idx('protein_accession_index')[['protein_accession_index':'P12345']].full_name
Some samples ==> Aspartate aminotransferase, mitochondrial

Get proteins (accessions) associated to an interpro motif (limited to 4 results)
Why Bio4j?
gremlin>
g.idx('interpro_id_index')[['interpro_id_index':'IPR023306']].inE('PROTEIN_INTERPRO').outV
Bio4j and the .accession[0..3]
Cloud ==> E2GK26
==> G3PMS4
==> G3Q865
Upcoming features ==> G3PIL8

Check our Bio4j Gremlin cheetsheet


Bioinformatics DBs
and Graphs REST Server

Initial motivation
You can also query/navigate through Bio4j with the REST API !
Bio4j structure
The default representation is json, both for responses and or data sent with
POST/PUT requests
Some samples

Get protein by its accession number: (Q9UR66)
Why Bio4j?
http://server_url:7474/db/data/index/node/protein_acc
ession_index/protein_accession_index/Q9UR66
Bio4j and the
Cloud

Get outgoing relationships for protein Q9UR66
Upcoming features
http://server_url:7474/db/data/node/Q9UR66_node_id/re
lationships/out


Bioinformatics DBs Visualizations (1)  REST Server Data Browser
and Graphs

Navigate through Bio4j data in real time !
Initial motivation

Bio4j structure

Some samples

Why Bio4j?

Bio4j and the
Cloud

Upcoming features


Bioinformatics DBs Visualizations (2)  Bio4j + Gephi
and Graphs

Get really cool graph visualizations using Bio4j and Gephi visualization and
Initial motivation exploration platform

Bio4j structure

Some samples

Why Bio4j?

Bio4j and the
Cloud

Upcoming features


Bioinformatics DBs Visualizations (3)  Bio4j GO Tools
and Graphs

Initial motivation

Bio4j structure

Some samples

Why Bio4j?

Bio4j and the
Cloud

Upcoming features


Bioinformatics DBs Why would I use Bio4j ?
and Graphs

Massive access to protein/genome/taxonomy… related
Initial motivation information

Bio4j structure Integration of your own DBs/resources around common
information
Some samples
Development of services tailored to your needs built around
Why Bio4j?
Bio4j

Bio4j and the
Networks analysis
Cloud
Visualizations
Upcoming features
Besides many others I cannot think of myself…
If you have something in mind for which Bio4j might be useful, please let
us know so we can all see how it could help you meet your needs! ;)


Bioinformatics DBs Bio4j + Cloud (1)
and Graphs

We use AWS (Amazon Web Services) everywhere we can around
Initial motivation
Bio4j, giving us the following benefits:

Bio4j structure
Interoperability and data distribution

Some samples Releases are available as public EBS Snapshots, giving AWS users
the opportunity of creating and attaching to their instances Bio4j DB
100% ready volumes in just a few seconds.
Why Bio4j?

Bio4j and the CloudFormation templates:
Cloud
- Basic Bio4j DB Instance

Upcoming features - Bio4j REST Server Instance


and Graphs

Initial motivation Backup and Storage using S3 (Simple Storage Service)

We use S3 both for backup (indirectly through the EBS snapshots) and
Bio4j structure storage (directly storing RefSeq sequences as independent S3 files)

What kind of benefits do we get from this?
Some samples
• Easy to use

Why Bio4j? • Flexible

• Cost-Effective
Bio4j and the
Cloud • Reliable

• Scalable and high-performance
Upcoming features
• Secure


and Graphs

Initial motivation Web servers and service providers in the cloud

Deploying your own web server in AWS using Bio4j as back-end is really
Bio4j structure simple.

A good example of this would be Bio4jTestServer, a continuously
Some samples developed server showcasing Web Services based on Bio4j.

Why Bio4j?

Bio4j and the
Cloud

Upcoming features


Bioinformatics DBs
and Graphs
Upcoming features

- Relationship indexing for relationships going and coming from supernodes
Initial motivation
No one’s perfect, and Bio4j is not the exception.
Relationship fetching can become a bottleneck whenever you have to deal
Bio4j structure with supernodes (unless you index these relationships). Fortunately this is
something that Neo4j is going to fix in the next version(s).

Some samples
- More resources available (Reactome…)

Why Bio4j? - Improvements in the importing process

- A more complete version of Bio4jModel
Bio4j and the
Cloud Allowing users to perform almost all sorts of queries without having to worry
about Neo4j core API.

Upcoming - New tools, services and visualizations built around Bio4j
features


Bioinformatics DBs Community
and Graphs

Bio4j has a fast growing internet presence:
Initial motivation

Bio4j structure - Twitter: check @bio4j for updates

- Blog: go to http://blog.bio4j.com
Some samples

- Mail-list: ask any question you may have in our list.
Why Bio4j?

- LinkedIn: check the Bio4j group
Bio4j and the
Cloud
- Github issues: don’t be shy! open a new issue if you think
something’s going wrong.
Upcoming features


Bioinformatics DBs and... Who’s behind all this?
and Graphs

Bio4j is being developed by Oh no sequences! Team and
Initial motivation Era7 Bioinformatics members:

Bio4j structure
- Pablo Pareja Tobes: Main developer (that’s me!)

Some samples
- Eduardo Pareja Tobes: Technology and architecture main advisor
Why Bio4j?

- Raquel Tobes: Bioinformatics main advisor
Bio4j and the
Cloud

- Marina Manrique: Bioinformatics support
Upcoming features

- Eduardo Pareja: Scientific advisor


Bioinformatics DBs
and Graphs

Initial motivation

Bio4j structure
That’s it !

Some samples

Why Bio4j?
Thanks for
your time ;)
Bio4j and the
Cloud

Upcoming features


Bio4j: A pioneer graph based database for the integration of biological Big Data

Recommended

Recommended

More Related Content

What's hot

What's hot (13)

Similar to Bio4j: A pioneer graph based database for the integration of biological Big Data

Similar to Bio4j: A pioneer graph based database for the integration of biological Big Data (20)

Recently uploaded

Recently uploaded (20)

Bio4j: A pioneer graph based database for the integration of biological Big Data