SlideShare a Scribd company logo
1 of 41
Download to read offline
Bio4j: A pioneer graph based
         database for the integration of
                  biological Big Data




www.ohnosequences.com                   www.bio4j.com
Who am I?
     I am Currently working as a Bioinformatics consultant/developer/researcher at
     Oh no sequences!


    Oh no what !?
     We are the R&D group at Era7 Bioinformatics.
     we like bioinformatics, cloud computing, NGS, category theory, bacterial
     genomics…
     well, lots of things.


    What about Era7 Bioinformatics?
     Era7 Bioinformatics is a Bioinformatics company specialized in sequence analysis,
     knowledge management and sequencing data interpretation.
     Our area of expertise revolves around biological sequence analysis, particularly
     Next Generation Sequencing data management and analysis.




www.ohnosequences.com                                                      www.bio4j.com
A bit of background…

    In Bioinformatics we have highly interconnected overlapping knowledge spread
    throughout different DBs




www.ohnosequences.com                                                   www.bio4j.com
However all this data is in most cases modeled in relational databases.
        Sometimes even just as plain CSV files

               As the amount and diversity of data grows, domain models
               become crazily complicated!




www.ohnosequences.com                                                     www.bio4j.com
With a relational paradigm, the double implication

                               Entity  Table

         does not go both ways.


              You get ‘auxiliary’ tables that have no relationship with the small
              piece of reality you are modeling.


              You need ‘artificial’ IDs only for connecting entities, (and these are mixed
              with IDs that somehow live in reality)


              Entity-relationship models are cool but in the end you always have to
              deal with ‘raw’ tables plus SQL.


              Integrating/incorporating new knowledge into already existing
              databases is hard and sometimes even not possible without changing
              the domain model




www.ohnosequences.com                                                               www.bio4j.com
Life in general and biology in particular are probably not 100% like a graph…




                                but one thing’s sure, they are not a set of tables!



www.ohnosequences.com                                                                www.bio4j.com
What’s Bio4j?


     Bio4j is a bioinformatics graph based DB including most data
     available in :

        Uniprot KB (SwissProt + Trembl)   NCBI Taxonomy

        Gene Ontology (GO)                RefSeq

        UniRef (50,90,100)                Enzyme DB




www.ohnosequences.com                                     www.bio4j.com
What’s Bio4j?

     It provides a completely new and powerful framework
     for protein related information querying and
     management.


     Since it relies on a high-performance graph engine, data
     is stored in a way that semantically represents its own
     structure




www.ohnosequences.com                                www.bio4j.com
What’s Bio4j?

     Bio4j uses Neo4j technology, a "high-performance graph
     engine with all the features of a mature and robust
     database".

     Thanks to both being based on Neo4j DB and the API
     provided, Bio4j is also very scalable, allowing anyone
     to easily incorporate his own data making the best
     out of it.



www.ohnosequences.com                                 www.bio4j.com
What’s Bio4j?


                        Everything in Bio4j is open source !



       released under AGPLv3




www.ohnosequences.com                              www.bio4j.com
Bio4j in numbers

     The current version (0.8) includes:



            Relationships: 717.484.649

            Nodes: 92.667.745

            Relationship types: 144

            Node types: 42



            We’re approaching the 1 billion relationships! :)


www.ohnosequences.com                                      www.bio4j.com
Let’s dig a bit about Bio4j structure…


               Data sources and their relationships:




www.ohnosequences.com                                  www.bio4j.com
Bio4j domain model




www.ohnosequences.com   www.bio4j.com
Bio4j modules


   Bio4j includes different data sources but you may not always be interested in having
   all of them.

   That’s why the importing process is modular and customizable, allowing you to
   import just the data you are interested in.




www.ohnosequences.com                                                    www.bio4j.com
Bio4j modules


   You must however keep in mind that you must be coherent when choosing the data
   sources you want to have included in your database; that’s to say, you cannot import
   for example protein interactions without having first included the proteins! ;)

   Here’s a schema showing
   the dependencies
   for the importing process:




www.ohnosequences.com                                                   www.bio4j.com
The Graph DB model: representation


          Core abstractions:

             Nodes

             Relationships between nodes

             Properties on both




www.ohnosequences.com                      www.bio4j.com
How are things modeled?




                            Couldn’t be simpler!




                 Entities           Associations / Relationships




                  Nodes                        Edges




www.ohnosequences.com                                        www.bio4j.com
Some examples of nodes would be:


                                      GO term
                  Protein
                                                          Genome Element




     and relationships:




                            Protein   PROTEIN_GO_ANNOTATION


                                                      GO te rm




www.ohnosequences.com                                                 www.bio4j.com
We have developed a tool aimed to be used both as a reference manual and
    initial contact for Bio4j domain model: Bio4jExplorer



     Bio4jExplorer allows you to:

     • Navigate through all nodes and relationships


     • Access the javadocs of any node or relationship


     • Graphically explore the neighborhood of a node/relationship


     • Look up for the indexes that may serve as an entry point for a node


     • Check incoming/outgoing relationships of a specific node


     • Check start/end nodes of a specific relationship




www.ohnosequences.com                                                          www.bio4j.com
Entry points and indexing

        There are two kinds of entry points for the graph:



               Auxiliary relationships going from the reference node, e.g.

                 - CELLULAR_COMPONENT : leads to the root of GO cellular component
                 sub-ontology

                 - MAIN_DATASET : leads to both main datasets: Swiss-Prot and Trembl


               Node indexing

               There are two types of node indexes:

                 - Exact: Only exact values are considered hits

                 - Fulltext: Regular expressions can be used




www.ohnosequences.com                                                            www.bio4j.com
Retrieving protein info (Bio4j Java API)

     //--creating manager and node retriever----
     Bio4jManager manager = new Bio4jManager(“/mybio4jdb”);
     NodeRetriever nR= new NodeRetriever(manager);

     ProteinNode protein = nR.getProteinNodeByAccession(“P12345”);


     Getting more related info...

     List<InterproNode> interpros = protein.getInterpro();
     OrganismNode organism = protein.getOrganism();
     List<GoTermNode> goAnnotations = protein.getGOAnnotations();

     List<ArticleNode> articles = protein.getArticleCitations();

     for (ArticleNode article : articles) {
         System.out.println(article.getPubmedId());
     }

     //Don’t forget to close the manager
     manager.shutDown();




www.ohnosequences.com                                                www.bio4j.com
Querying Bio4j with Cypher


     Getting a keyword by its ID

     START k=node:keyword_id_index(keyword_id_index = "KW-0181")
     return k.name, k.id


     Finding circuits/simple cycles of length 3 where at least one protein is from Swiss-Prot
     dataset:

     START d=node:dataset_name_index(dataset_name_index = "Swiss-Prot")
     MATCH d <-[r:PROTEIN_DATASET]- p,
     circuit = (p) -[:PROTEIN_PROTEIN_INTERACTION]-> (p2) -
     [:PROTEIN_PROTEIN_INTERACTION]-> (p3) -[:PROTEIN_PROTEIN_INTERACTION]->
     (p)
      return p.accession, p2.accession, p3.accession


              Check this blog post for more info and our Bio4j Cypher cheetsheet




www.ohnosequences.com                                                                  www.bio4j.com
Mining Bio4j data

      Finding topological patterns in Protein-Protein
                  Interaction networks




www.ohnosequences.com                            www.bio4j.com
A graph traversal language


     Get protein by its accession number and return its full name

     gremlin> g.idx('protein_accession_index')[['protein_accession_index':'P12345']].full_name
     ==> Aspartate aminotransferase, mitochondrial


     Get proteins (accessions) associated to an interpro motif (limited to 4 results)
     gremlin>
     g.idx('interpro_id_index')[['interpro_id_index':'IPR023306']].inE('PROTEIN_INTERPRO').outV.
     accession[0..3]
     ==> E2GK26
     ==> G3PMS4
     ==> G3Q865
     ==> G3PIL8


            Check our Bio4j Gremlin cheetsheet




www.ohnosequences.com                                                               www.bio4j.com
REST Server


     You can also query/navigate through Bio4j with the Neo4j REST API !

     The default representation is json, both for responses and or data sent with
     POST/PUT requests


     Get protein by its accession number: (Q9UR66)

     http://server_url:7474/db/data/index/node/protein_accession_index/
     protein_accession_index/Q9UR66


     Get outgoing relationships for protein Q9UR66

     http://server_url:7474/db/data/node/Q9UR66_node_id/relationships/o
     ut




www.ohnosequences.com                                                      www.bio4j.com
Visualizations (1)  REST Server Data Browser


      Navigate through Bio4j data in real time !




www.ohnosequences.com                               www.bio4j.com
Visualizations (2)  Bio4j GO Tools




www.ohnosequences.com                    www.bio4j.com
Visualizations (3)  Bio4j + Gephi

      Get really cool graph visualizations using Bio4j and Gephi visualization and
      exploration platform




www.ohnosequences.com                                                                www.bio4j.com
Bio4j + Cloud

     We use AWS (Amazon Web Services) everywhere we can around Bio4j, giving
     us the following benefits:


          Interoperability and data distribution

           Releases are available as public EBS Snapshots, giving AWS users the
           opportunity of creating and attaching to their instances Bio4j DB 100% ready
           volumes in just a few seconds.

           CloudFormation templates:

             - Basic Bio4j DB Instance

             - Bio4j REST Server Instance


           Backup and Storage using S3 (Simple Storage Service)

           We use S3 both for backup (indirectly through the EBS snapshots) and
           storage (directly storing RefSeq sequences as independent S3 files)



www.ohnosequences.com                                                              www.bio4j.com
Why would I use Bio4j ?


    Massive access to protein/genome/taxonomy… related information


    Integration of your own DBs/resources around common information


    Development of services tailored to your needs built around Bio4j


    Networks analysis


    Visualizations


    Besides many others I cannot think of myself…
    If you have something in mind for which Bio4j might be useful, please let us know so we
    can all see how it could help you meet your needs! ;)




www.ohnosequences.com                                                                www.bio4j.com
OK, but why starting all this?
   Were you so bored…?!

    It all started somehow around our need for massive access to protein GO
    (Gene Ontology) annotations.

     At that point I had to develop my own MySQL DB based on the official
     GO SQL database, and problems started from the beginning:


          I got crazy ‘deciphering’ how to extract Uniprot protein annotations
          from GO official tables schema

          Uniprot and GO official protein annotations were not always consistent

          Populating my own DB took really long due to all the joins and
          subqueries needed in order to get and store the protein annotations.

          Soon enough we also had the need of having massive access to basic
          protein information.




www.ohnosequences.com                                                              www.bio4j.com
These processes had to be automated for our (specifically designed for NGS data)
  bacterial genome annotation system BG7 (PLOS ONE 2012 in Press)



              Uniprot web services available were too limited:

               - Slow

               - Number of queries limitation

               - Too little information available




                  So I downloaded the whole Uniprot DB in XML format
                  (Swiss-Prot + Trembl)

                  and started to have some fun with it !




www.ohnosequences.com                                                  www.bio4j.com
We got used to having massive direct access to all this protein related
      information…


           So why not adding other resources we needed quite often in most
           projects and which now were becoming a sort of bottleneck
           compared to all those already included in Bio4j ?

       Then we incorporated:
            - Isoform sequences

            - Protein interactions and features

            - Uniref 50, 90, and 100

            - RefSeq

            - NCBI Taxonomy

            - Enzyme Expasy DB




www.ohnosequences.com                                                 www.bio4j.com
Bio4j + MG7 + 48 Blast XML files (~1GB each)


     Some numbers:

               • 157 639 502 nodes

               • 742 615 705 relationships

               • 632 832 045 properties

               • 148 relationship types

               • 44 node types


            And it works just fine!


www.ohnosequences.com                         www.bio4j.com
MG7 domain model




www.ohnosequences.com   www.bio4j.com
What’s MG7?

     MG7 provides the possibility of choosing different parameters to fix the
     thresholds for filtering the BLAST hits:

     i.    E-value
     ii.   Identity and query coverage


     It allows exporting the results of the analysis to different data formats like:
     • XML
     • CSV
     • Gexf (Graph exchange XML format)

     As well as provides to the user with Heat maps and graph visualizations whilst
     including an user-friendly interface that allows to access to the alignment
     responsible for each functional or taxonomical read assignation and that displays
     the frequencies in the taxonomical tree --> MG7Viewer




www.ohnosequences.com                                                        www.bio4j.com
Heat-map Viz




www.ohnosequences.com   www.bio4j.com
Finding the lowest common ancestor of a set of NCBI
                taxonomy nodes with Bio4j




www.ohnosequences.com                         www.bio4j.com
Future directions


     Improvements in modules

     Integration of even more massive data

     Application to Cancer genomics

     Gene flux tool (New tool for bacterial comparative genomics)

     Pathways tool Data from Metacyc is going to be included in Bio4j. This data
     would allow to dissect the metabolic pathways in which a genome element,
     organism or community (metagenomic samples) is involved.
     .
     Data visualization, network analysis and much more…




www.ohnosequences.com                                                 www.bio4j.com
Community

     Bio4j has a fast growing internet presence:



            - Twitter: check @bio4j for updates

            - Blog: go to http://blog.bio4j.com

            - Mail-list: ask any question you may have in our list.

            - LinkedIn: check the Bio4j group

            - Github issues: don’t be shy! open a new issue if you think
                            something’s going wrong.




www.ohnosequences.com                                                 www.bio4j.com
That’s it !


                        Thanks for
                        your time ;)




www.ohnosequences.com                  www.bio4j.com

More Related Content

What's hot

Introduction to the BioLink datamodel
Introduction to the BioLink datamodelIntroduction to the BioLink datamodel
Introduction to the BioLink datamodelChris Mungall
 
Building SADI Services Tutorial - SIB Workshop, Geneva, December 2015
Building SADI Services Tutorial - SIB Workshop, Geneva, December 2015Building SADI Services Tutorial - SIB Workshop, Geneva, December 2015
Building SADI Services Tutorial - SIB Workshop, Geneva, December 2015Mark Wilkinson
 
FAIR Data Prototype - Interoperability and FAIRness through a novel combinati...
FAIR Data Prototype - Interoperability and FAIRness through a novel combinati...FAIR Data Prototype - Interoperability and FAIRness through a novel combinati...
FAIR Data Prototype - Interoperability and FAIRness through a novel combinati...Mark Wilkinson
 
CSHALS 2010 W3C Semanic Web Tutorial
CSHALS 2010 W3C Semanic Web TutorialCSHALS 2010 W3C Semanic Web Tutorial
CSHALS 2010 W3C Semanic Web TutorialLeeFeigenbaum
 
Chado for evolutionary biology
Chado for evolutionary biologyChado for evolutionary biology
Chado for evolutionary biologyChris Mungall
 
Bio ontologies and semantic technologies[2]
Bio ontologies and semantic technologies[2]Bio ontologies and semantic technologies[2]
Bio ontologies and semantic technologies[2]Prof. Wim Van Criekinge
 
How to search_free_crystallography_databases_benedictine_university final 111...
How to search_free_crystallography_databases_benedictine_university final 111...How to search_free_crystallography_databases_benedictine_university final 111...
How to search_free_crystallography_databases_benedictine_university final 111...Benedictine University Library
 
Getting the best of Linked Data and Property Graphs: rdf2neo and the KnetMine...
Getting the best of Linked Data and Property Graphs: rdf2neo and the KnetMine...Getting the best of Linked Data and Property Graphs: rdf2neo and the KnetMine...
Getting the best of Linked Data and Property Graphs: rdf2neo and the KnetMine...Rothamsted Research, UK
 
2019 03 05_biological_databases_part4_v_upload
2019 03 05_biological_databases_part4_v_upload2019 03 05_biological_databases_part4_v_upload
2019 03 05_biological_databases_part4_v_uploadProf. Wim Van Criekinge
 
Libraries and Linked Data: Looking to the Future (2)
Libraries and Linked Data: Looking to the Future (2)Libraries and Linked Data: Looking to the Future (2)
Libraries and Linked Data: Looking to the Future (2)ALATechSource
 
Libraries and Linked Data: Looking to the Future (3)
Libraries and Linked Data: Looking to the Future (3)Libraries and Linked Data: Looking to the Future (3)
Libraries and Linked Data: Looking to the Future (3)ALATechSource
 
The OpenOffice.org ODF Toolkit Project
The OpenOffice.org ODF Toolkit ProjectThe OpenOffice.org ODF Toolkit Project
The OpenOffice.org ODF Toolkit ProjectAlexandro Colorado
 
Wilkinson bosc2010 moby-to-sadi
Wilkinson bosc2010 moby-to-sadiWilkinson bosc2010 moby-to-sadi
Wilkinson bosc2010 moby-to-sadiBOSC 2010
 
We’re all SMILES! Building Chemical Semantic Web Services with SADI, ChEBI, a...
We’re all SMILES! Building Chemical Semantic Web Services with SADI, ChEBI, a...We’re all SMILES! Building Chemical Semantic Web Services with SADI, ChEBI, a...
We’re all SMILES! Building Chemical Semantic Web Services with SADI, ChEBI, a...Michel Dumontier
 
From OBO to OWL and back - building scalable ontologies
From OBO to OWL and back - building scalable ontologiesFrom OBO to OWL and back - building scalable ontologies
From OBO to OWL and back - building scalable ontologiesdosumis
 

What's hot (19)

Introduction to the BioLink datamodel
Introduction to the BioLink datamodelIntroduction to the BioLink datamodel
Introduction to the BioLink datamodel
 
Building SADI Services Tutorial - SIB Workshop, Geneva, December 2015
Building SADI Services Tutorial - SIB Workshop, Geneva, December 2015Building SADI Services Tutorial - SIB Workshop, Geneva, December 2015
Building SADI Services Tutorial - SIB Workshop, Geneva, December 2015
 
FAIR Data Prototype - Interoperability and FAIRness through a novel combinati...
FAIR Data Prototype - Interoperability and FAIRness through a novel combinati...FAIR Data Prototype - Interoperability and FAIRness through a novel combinati...
FAIR Data Prototype - Interoperability and FAIRness through a novel combinati...
 
The importance of the InChI identifier as a foundation technology for eScienc...
The importance of the InChI identifier as a foundation technology for eScienc...The importance of the InChI identifier as a foundation technology for eScienc...
The importance of the InChI identifier as a foundation technology for eScienc...
 
CSHALS 2010 W3C Semanic Web Tutorial
CSHALS 2010 W3C Semanic Web TutorialCSHALS 2010 W3C Semanic Web Tutorial
CSHALS 2010 W3C Semanic Web Tutorial
 
Chado for evolutionary biology
Chado for evolutionary biologyChado for evolutionary biology
Chado for evolutionary biology
 
Chado introduction
Chado introductionChado introduction
Chado introduction
 
Bio ontologies and semantic technologies[2]
Bio ontologies and semantic technologies[2]Bio ontologies and semantic technologies[2]
Bio ontologies and semantic technologies[2]
 
Chado-XML
Chado-XMLChado-XML
Chado-XML
 
How to search_free_crystallography_databases_benedictine_university final 111...
How to search_free_crystallography_databases_benedictine_university final 111...How to search_free_crystallography_databases_benedictine_university final 111...
How to search_free_crystallography_databases_benedictine_university final 111...
 
Triple Stores
Triple StoresTriple Stores
Triple Stores
 
Getting the best of Linked Data and Property Graphs: rdf2neo and the KnetMine...
Getting the best of Linked Data and Property Graphs: rdf2neo and the KnetMine...Getting the best of Linked Data and Property Graphs: rdf2neo and the KnetMine...
Getting the best of Linked Data and Property Graphs: rdf2neo and the KnetMine...
 
2019 03 05_biological_databases_part4_v_upload
2019 03 05_biological_databases_part4_v_upload2019 03 05_biological_databases_part4_v_upload
2019 03 05_biological_databases_part4_v_upload
 
Libraries and Linked Data: Looking to the Future (2)
Libraries and Linked Data: Looking to the Future (2)Libraries and Linked Data: Looking to the Future (2)
Libraries and Linked Data: Looking to the Future (2)
 
Libraries and Linked Data: Looking to the Future (3)
Libraries and Linked Data: Looking to the Future (3)Libraries and Linked Data: Looking to the Future (3)
Libraries and Linked Data: Looking to the Future (3)
 
The OpenOffice.org ODF Toolkit Project
The OpenOffice.org ODF Toolkit ProjectThe OpenOffice.org ODF Toolkit Project
The OpenOffice.org ODF Toolkit Project
 
Wilkinson bosc2010 moby-to-sadi
Wilkinson bosc2010 moby-to-sadiWilkinson bosc2010 moby-to-sadi
Wilkinson bosc2010 moby-to-sadi
 
We’re all SMILES! Building Chemical Semantic Web Services with SADI, ChEBI, a...
We’re all SMILES! Building Chemical Semantic Web Services with SADI, ChEBI, a...We’re all SMILES! Building Chemical Semantic Web Services with SADI, ChEBI, a...
We’re all SMILES! Building Chemical Semantic Web Services with SADI, ChEBI, a...
 
From OBO to OWL and back - building scalable ontologies
From OBO to OWL and back - building scalable ontologiesFrom OBO to OWL and back - building scalable ontologies
From OBO to OWL and back - building scalable ontologies
 

Viewers also liked

VIZBI 2014 - Visualizing Genomic Variation
VIZBI 2014 - Visualizing Genomic VariationVIZBI 2014 - Visualizing Genomic Variation
VIZBI 2014 - Visualizing Genomic VariationJan Aerts
 
Chamberlain PhD Thesis
Chamberlain PhD ThesisChamberlain PhD Thesis
Chamberlain PhD Thesisschamber
 
Surfacing the deep data of taxonomy
Surfacing the deep data of taxonomySurfacing the deep data of taxonomy
Surfacing the deep data of taxonomyRoderic Page
 
J Goecks - The Galaxy Visual Analysis Framework
J Goecks - The Galaxy Visual Analysis FrameworkJ Goecks - The Galaxy Visual Analysis Framework
J Goecks - The Galaxy Visual Analysis FrameworkJan Aerts
 
Bio::Phylo - phyloinformatic analysis using perl
Bio::Phylo - phyloinformatic analysis using perlBio::Phylo - phyloinformatic analysis using perl
Bio::Phylo - phyloinformatic analysis using perlRutger Vos
 
Intro to data visualization
Intro to data visualizationIntro to data visualization
Intro to data visualizationJan Aerts
 
OBF Address at BOSC 2012
OBF Address at BOSC 2012OBF Address at BOSC 2012
OBF Address at BOSC 2012Hilmar Lapp
 
Tetrahymena genome project 2003 presentation by Jonathan Eisen
Tetrahymena genome project 2003 presentation by Jonathan EisenTetrahymena genome project 2003 presentation by Jonathan Eisen
Tetrahymena genome project 2003 presentation by Jonathan EisenJonathan Eisen
 
Jonathan Eisen: Phylogenetic approaches to the analysis of genomes and metage...
Jonathan Eisen: Phylogenetic approaches to the analysis of genomes and metage...Jonathan Eisen: Phylogenetic approaches to the analysis of genomes and metage...
Jonathan Eisen: Phylogenetic approaches to the analysis of genomes and metage...Jonathan Eisen
 
The role of cost in yeast gene expression
The role of cost in yeast gene expressionThe role of cost in yeast gene expression
The role of cost in yeast gene expressionMichael Barton
 
Jonathan Eisen talk for #SCS2012 at #ISMB "Networks in genomics and bioinfor...
Jonathan Eisen talk for #SCS2012 at #ISMB  "Networks in genomics and bioinfor...Jonathan Eisen talk for #SCS2012 at #ISMB  "Networks in genomics and bioinfor...
Jonathan Eisen talk for #SCS2012 at #ISMB "Networks in genomics and bioinfor...Jonathan Eisen
 
20120622 fridayadelboden
20120622 fridayadelboden20120622 fridayadelboden
20120622 fridayadelbodenYannick Wurm
 
Procter Vamsas Bosc2009
Procter Vamsas Bosc2009Procter Vamsas Bosc2009
Procter Vamsas Bosc2009bosc
 
Evolution of the RecA Protein: from Systematics to Structure 1995 talk for CA...
Evolution of the RecA Protein: from Systematics to Structure 1995 talk for CA...Evolution of the RecA Protein: from Systematics to Structure 1995 talk for CA...
Evolution of the RecA Protein: from Systematics to Structure 1995 talk for CA...Jonathan Eisen
 
The neurobiological nature of free will
The neurobiological nature of free willThe neurobiological nature of free will
The neurobiological nature of free willBjörn Brembs
 
E Talevich - Biopython project-update
E Talevich - Biopython project-updateE Talevich - Biopython project-update
E Talevich - Biopython project-updateJan Aerts
 
Jonathan Eisen @phylogenomics talk for #LAMG12
Jonathan Eisen @phylogenomics talk for #LAMG12Jonathan Eisen @phylogenomics talk for #LAMG12
Jonathan Eisen @phylogenomics talk for #LAMG12Jonathan Eisen
 
Humanizing bioinformatics
Humanizing bioinformaticsHumanizing bioinformatics
Humanizing bioinformaticsJan Aerts
 
Intel Theater Presentation - SC11
Intel Theater Presentation - SC11Intel Theater Presentation - SC11
Intel Theater Presentation - SC11Deepak Singh
 

Viewers also liked (20)

VIZBI 2014 - Visualizing Genomic Variation
VIZBI 2014 - Visualizing Genomic VariationVIZBI 2014 - Visualizing Genomic Variation
VIZBI 2014 - Visualizing Genomic Variation
 
Chamberlain PhD Thesis
Chamberlain PhD ThesisChamberlain PhD Thesis
Chamberlain PhD Thesis
 
Surfacing the deep data of taxonomy
Surfacing the deep data of taxonomySurfacing the deep data of taxonomy
Surfacing the deep data of taxonomy
 
J Goecks - The Galaxy Visual Analysis Framework
J Goecks - The Galaxy Visual Analysis FrameworkJ Goecks - The Galaxy Visual Analysis Framework
J Goecks - The Galaxy Visual Analysis Framework
 
Bio::Phylo - phyloinformatic analysis using perl
Bio::Phylo - phyloinformatic analysis using perlBio::Phylo - phyloinformatic analysis using perl
Bio::Phylo - phyloinformatic analysis using perl
 
Intro to data visualization
Intro to data visualizationIntro to data visualization
Intro to data visualization
 
OBF Address at BOSC 2012
OBF Address at BOSC 2012OBF Address at BOSC 2012
OBF Address at BOSC 2012
 
Tetrahymena genome project 2003 presentation by Jonathan Eisen
Tetrahymena genome project 2003 presentation by Jonathan EisenTetrahymena genome project 2003 presentation by Jonathan Eisen
Tetrahymena genome project 2003 presentation by Jonathan Eisen
 
Jonathan Eisen: Phylogenetic approaches to the analysis of genomes and metage...
Jonathan Eisen: Phylogenetic approaches to the analysis of genomes and metage...Jonathan Eisen: Phylogenetic approaches to the analysis of genomes and metage...
Jonathan Eisen: Phylogenetic approaches to the analysis of genomes and metage...
 
The role of cost in yeast gene expression
The role of cost in yeast gene expressionThe role of cost in yeast gene expression
The role of cost in yeast gene expression
 
Jonathan Eisen talk for #SCS2012 at #ISMB "Networks in genomics and bioinfor...
Jonathan Eisen talk for #SCS2012 at #ISMB  "Networks in genomics and bioinfor...Jonathan Eisen talk for #SCS2012 at #ISMB  "Networks in genomics and bioinfor...
Jonathan Eisen talk for #SCS2012 at #ISMB "Networks in genomics and bioinfor...
 
20120622 fridayadelboden
20120622 fridayadelboden20120622 fridayadelboden
20120622 fridayadelboden
 
Procter Vamsas Bosc2009
Procter Vamsas Bosc2009Procter Vamsas Bosc2009
Procter Vamsas Bosc2009
 
Evolution of the RecA Protein: from Systematics to Structure 1995 talk for CA...
Evolution of the RecA Protein: from Systematics to Structure 1995 talk for CA...Evolution of the RecA Protein: from Systematics to Structure 1995 talk for CA...
Evolution of the RecA Protein: from Systematics to Structure 1995 talk for CA...
 
The neurobiological nature of free will
The neurobiological nature of free willThe neurobiological nature of free will
The neurobiological nature of free will
 
ORCID Principles
ORCID PrinciplesORCID Principles
ORCID Principles
 
E Talevich - Biopython project-update
E Talevich - Biopython project-updateE Talevich - Biopython project-update
E Talevich - Biopython project-update
 
Jonathan Eisen @phylogenomics talk for #LAMG12
Jonathan Eisen @phylogenomics talk for #LAMG12Jonathan Eisen @phylogenomics talk for #LAMG12
Jonathan Eisen @phylogenomics talk for #LAMG12
 
Humanizing bioinformatics
Humanizing bioinformaticsHumanizing bioinformatics
Humanizing bioinformatics
 
Intel Theater Presentation - SC11
Intel Theater Presentation - SC11Intel Theater Presentation - SC11
Intel Theater Presentation - SC11
 

Similar to Bio4j: A graph based database for biological Big Data integration

Bio4j: A pioneer graph based database for the integration of biological Big D...
Bio4j: A pioneer graph based database for the integration of biological Big D...Bio4j: A pioneer graph based database for the integration of biological Big D...
Bio4j: A pioneer graph based database for the integration of biological Big D...graphdevroom
 
BITS: Overview of important biological databases beyond sequences
BITS: Overview of important biological databases beyond sequencesBITS: Overview of important biological databases beyond sequences
BITS: Overview of important biological databases beyond sequencesBITS
 
2010 CASCON - Towards a integrated network of data and services for the life ...
2010 CASCON - Towards a integrated network of data and services for the life ...2010 CASCON - Towards a integrated network of data and services for the life ...
2010 CASCON - Towards a integrated network of data and services for the life ...Michel Dumontier
 
Introduction to Ontologies for Environmental Biology
Introduction to Ontologies for Environmental BiologyIntroduction to Ontologies for Environmental Biology
Introduction to Ontologies for Environmental BiologyBarry Smith
 
Ontology Web Services for Semantic Applications
Ontology Web Services for Semantic Applications Ontology Web Services for Semantic Applications
Ontology Web Services for Semantic Applications Trish Whetzel
 
Pham yang embl-ebi
Pham yang embl-ebiPham yang embl-ebi
Pham yang embl-ebiNate Wildes
 
BioThings SDK: a toolkit for building high-performance data APIs in biology
BioThings SDK: a toolkit for building high-performance data APIs in biologyBioThings SDK: a toolkit for building high-performance data APIs in biology
BioThings SDK: a toolkit for building high-performance data APIs in biologyChunlei Wu
 
Collaboratively Creating the Knowledge Graph of Life
Collaboratively Creating the Knowledge Graph of LifeCollaboratively Creating the Knowledge Graph of Life
Collaboratively Creating the Knowledge Graph of LifeChris Mungall
 
MADICES Mungall 2022.pptx
MADICES Mungall 2022.pptxMADICES Mungall 2022.pptx
MADICES Mungall 2022.pptxChris Mungall
 
InterPro and InterProScan 5.0
InterPro and InterProScan 5.0InterPro and InterProScan 5.0
InterPro and InterProScan 5.0EBI
 
Representing and reasoning with biological knowledge
Representing and reasoning with biological knowledgeRepresenting and reasoning with biological knowledge
Representing and reasoning with biological knowledgeBenjamin Good
 
Ontology Services for the Biomedical Sciences
Ontology Services for the Biomedical SciencesOntology Services for the Biomedical Sciences
Ontology Services for the Biomedical SciencesConnected Data World
 
Building and Using Ontologies to do biology
Building and Using Ontologies to do biologyBuilding and Using Ontologies to do biology
Building and Using Ontologies to do biologyrobertstevens65
 
Web Science - ISoLA 2012
Web Science - ISoLA 2012Web Science - ISoLA 2012
Web Science - ISoLA 2012Mark Wilkinson
 
BioThings and SmartAPI: building an ecosystem of interoperable biological kno...
BioThings and SmartAPI: building an ecosystem of interoperable biological kno...BioThings and SmartAPI: building an ecosystem of interoperable biological kno...
BioThings and SmartAPI: building an ecosystem of interoperable biological kno...Chunlei Wu
 

Similar to Bio4j: A graph based database for biological Big Data integration (20)

Bio4j: A pioneer graph based database for the integration of biological Big D...
Bio4j: A pioneer graph based database for the integration of biological Big D...Bio4j: A pioneer graph based database for the integration of biological Big D...
Bio4j: A pioneer graph based database for the integration of biological Big D...
 
BITS: Overview of important biological databases beyond sequences
BITS: Overview of important biological databases beyond sequencesBITS: Overview of important biological databases beyond sequences
BITS: Overview of important biological databases beyond sequences
 
2010 CASCON - Towards a integrated network of data and services for the life ...
2010 CASCON - Towards a integrated network of data and services for the life ...2010 CASCON - Towards a integrated network of data and services for the life ...
2010 CASCON - Towards a integrated network of data and services for the life ...
 
Introduction to Ontologies for Environmental Biology
Introduction to Ontologies for Environmental BiologyIntroduction to Ontologies for Environmental Biology
Introduction to Ontologies for Environmental Biology
 
Knetminer Backend Training, Nov 2018
Knetminer Backend Training, Nov 2018Knetminer Backend Training, Nov 2018
Knetminer Backend Training, Nov 2018
 
Ontology Web Services for Semantic Applications
Ontology Web Services for Semantic Applications Ontology Web Services for Semantic Applications
Ontology Web Services for Semantic Applications
 
Pham yang embl-ebi
Pham yang embl-ebiPham yang embl-ebi
Pham yang embl-ebi
 
BioThings SDK: a toolkit for building high-performance data APIs in biology
BioThings SDK: a toolkit for building high-performance data APIs in biologyBioThings SDK: a toolkit for building high-performance data APIs in biology
BioThings SDK: a toolkit for building high-performance data APIs in biology
 
Collaboratively Creating the Knowledge Graph of Life
Collaboratively Creating the Knowledge Graph of LifeCollaboratively Creating the Knowledge Graph of Life
Collaboratively Creating the Knowledge Graph of Life
 
MADICES Mungall 2022.pptx
MADICES Mungall 2022.pptxMADICES Mungall 2022.pptx
MADICES Mungall 2022.pptx
 
InterPro and InterProScan 5.0
InterPro and InterProScan 5.0InterPro and InterProScan 5.0
InterPro and InterProScan 5.0
 
Biothings presentation
Biothings presentationBiothings presentation
Biothings presentation
 
Cshl minseqe 2013_ouellette
Cshl minseqe 2013_ouelletteCshl minseqe 2013_ouellette
Cshl minseqe 2013_ouellette
 
Representing and reasoning with biological knowledge
Representing and reasoning with biological knowledgeRepresenting and reasoning with biological knowledge
Representing and reasoning with biological knowledge
 
Ontology Services for the Biomedical Sciences
Ontology Services for the Biomedical SciencesOntology Services for the Biomedical Sciences
Ontology Services for the Biomedical Sciences
 
Building and Using Ontologies to do biology
Building and Using Ontologies to do biologyBuilding and Using Ontologies to do biology
Building and Using Ontologies to do biology
 
iBioSearch: The Integrated Biological Database Search
iBioSearch: The Integrated Biological Database SearchiBioSearch: The Integrated Biological Database Search
iBioSearch: The Integrated Biological Database Search
 
Harvester I
Harvester IHarvester I
Harvester I
 
Web Science - ISoLA 2012
Web Science - ISoLA 2012Web Science - ISoLA 2012
Web Science - ISoLA 2012
 
BioThings and SmartAPI: building an ecosystem of interoperable biological kno...
BioThings and SmartAPI: building an ecosystem of interoperable biological kno...BioThings and SmartAPI: building an ecosystem of interoperable biological kno...
BioThings and SmartAPI: building an ecosystem of interoperable biological kno...
 

Recently uploaded

APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDGMarianaLemus7
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Science&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfScience&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfjimielynbastida
 
Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Neo4j
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentationphoebematthew05
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraDeakin University
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 

Recently uploaded (20)

APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDG
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Science&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfScience&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdf
 
Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentation
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning era
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptxVulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
 

Bio4j: A graph based database for biological Big Data integration

  • 1. Bio4j: A pioneer graph based database for the integration of biological Big Data www.ohnosequences.com www.bio4j.com
  • 2. Who am I? I am Currently working as a Bioinformatics consultant/developer/researcher at Oh no sequences! Oh no what !? We are the R&D group at Era7 Bioinformatics. we like bioinformatics, cloud computing, NGS, category theory, bacterial genomics… well, lots of things. What about Era7 Bioinformatics? Era7 Bioinformatics is a Bioinformatics company specialized in sequence analysis, knowledge management and sequencing data interpretation. Our area of expertise revolves around biological sequence analysis, particularly Next Generation Sequencing data management and analysis. www.ohnosequences.com www.bio4j.com
  • 3. A bit of background… In Bioinformatics we have highly interconnected overlapping knowledge spread throughout different DBs www.ohnosequences.com www.bio4j.com
  • 4. However all this data is in most cases modeled in relational databases. Sometimes even just as plain CSV files As the amount and diversity of data grows, domain models become crazily complicated! www.ohnosequences.com www.bio4j.com
  • 5. With a relational paradigm, the double implication Entity  Table does not go both ways. You get ‘auxiliary’ tables that have no relationship with the small piece of reality you are modeling. You need ‘artificial’ IDs only for connecting entities, (and these are mixed with IDs that somehow live in reality) Entity-relationship models are cool but in the end you always have to deal with ‘raw’ tables plus SQL. Integrating/incorporating new knowledge into already existing databases is hard and sometimes even not possible without changing the domain model www.ohnosequences.com www.bio4j.com
  • 6. Life in general and biology in particular are probably not 100% like a graph… but one thing’s sure, they are not a set of tables! www.ohnosequences.com www.bio4j.com
  • 7. What’s Bio4j? Bio4j is a bioinformatics graph based DB including most data available in : Uniprot KB (SwissProt + Trembl) NCBI Taxonomy Gene Ontology (GO) RefSeq UniRef (50,90,100) Enzyme DB www.ohnosequences.com www.bio4j.com
  • 8. What’s Bio4j? It provides a completely new and powerful framework for protein related information querying and management. Since it relies on a high-performance graph engine, data is stored in a way that semantically represents its own structure www.ohnosequences.com www.bio4j.com
  • 9. What’s Bio4j? Bio4j uses Neo4j technology, a "high-performance graph engine with all the features of a mature and robust database". Thanks to both being based on Neo4j DB and the API provided, Bio4j is also very scalable, allowing anyone to easily incorporate his own data making the best out of it. www.ohnosequences.com www.bio4j.com
  • 10. What’s Bio4j? Everything in Bio4j is open source ! released under AGPLv3 www.ohnosequences.com www.bio4j.com
  • 11. Bio4j in numbers The current version (0.8) includes: Relationships: 717.484.649 Nodes: 92.667.745 Relationship types: 144 Node types: 42 We’re approaching the 1 billion relationships! :) www.ohnosequences.com www.bio4j.com
  • 12. Let’s dig a bit about Bio4j structure… Data sources and their relationships: www.ohnosequences.com www.bio4j.com
  • 14. Bio4j modules Bio4j includes different data sources but you may not always be interested in having all of them. That’s why the importing process is modular and customizable, allowing you to import just the data you are interested in. www.ohnosequences.com www.bio4j.com
  • 15. Bio4j modules You must however keep in mind that you must be coherent when choosing the data sources you want to have included in your database; that’s to say, you cannot import for example protein interactions without having first included the proteins! ;) Here’s a schema showing the dependencies for the importing process: www.ohnosequences.com www.bio4j.com
  • 16. The Graph DB model: representation Core abstractions: Nodes Relationships between nodes Properties on both www.ohnosequences.com www.bio4j.com
  • 17. How are things modeled? Couldn’t be simpler! Entities Associations / Relationships Nodes Edges www.ohnosequences.com www.bio4j.com
  • 18. Some examples of nodes would be: GO term Protein Genome Element and relationships: Protein PROTEIN_GO_ANNOTATION GO te rm www.ohnosequences.com www.bio4j.com
  • 19. We have developed a tool aimed to be used both as a reference manual and initial contact for Bio4j domain model: Bio4jExplorer Bio4jExplorer allows you to: • Navigate through all nodes and relationships • Access the javadocs of any node or relationship • Graphically explore the neighborhood of a node/relationship • Look up for the indexes that may serve as an entry point for a node • Check incoming/outgoing relationships of a specific node • Check start/end nodes of a specific relationship www.ohnosequences.com www.bio4j.com
  • 20. Entry points and indexing There are two kinds of entry points for the graph: Auxiliary relationships going from the reference node, e.g. - CELLULAR_COMPONENT : leads to the root of GO cellular component sub-ontology - MAIN_DATASET : leads to both main datasets: Swiss-Prot and Trembl Node indexing There are two types of node indexes: - Exact: Only exact values are considered hits - Fulltext: Regular expressions can be used www.ohnosequences.com www.bio4j.com
  • 21. Retrieving protein info (Bio4j Java API) //--creating manager and node retriever---- Bio4jManager manager = new Bio4jManager(“/mybio4jdb”); NodeRetriever nR= new NodeRetriever(manager); ProteinNode protein = nR.getProteinNodeByAccession(“P12345”); Getting more related info... List<InterproNode> interpros = protein.getInterpro(); OrganismNode organism = protein.getOrganism(); List<GoTermNode> goAnnotations = protein.getGOAnnotations(); List<ArticleNode> articles = protein.getArticleCitations(); for (ArticleNode article : articles) { System.out.println(article.getPubmedId()); } //Don’t forget to close the manager manager.shutDown(); www.ohnosequences.com www.bio4j.com
  • 22. Querying Bio4j with Cypher Getting a keyword by its ID START k=node:keyword_id_index(keyword_id_index = "KW-0181") return k.name, k.id Finding circuits/simple cycles of length 3 where at least one protein is from Swiss-Prot dataset: START d=node:dataset_name_index(dataset_name_index = "Swiss-Prot") MATCH d <-[r:PROTEIN_DATASET]- p, circuit = (p) -[:PROTEIN_PROTEIN_INTERACTION]-> (p2) - [:PROTEIN_PROTEIN_INTERACTION]-> (p3) -[:PROTEIN_PROTEIN_INTERACTION]-> (p) return p.accession, p2.accession, p3.accession Check this blog post for more info and our Bio4j Cypher cheetsheet www.ohnosequences.com www.bio4j.com
  • 23. Mining Bio4j data Finding topological patterns in Protein-Protein Interaction networks www.ohnosequences.com www.bio4j.com
  • 24. A graph traversal language Get protein by its accession number and return its full name gremlin> g.idx('protein_accession_index')[['protein_accession_index':'P12345']].full_name ==> Aspartate aminotransferase, mitochondrial Get proteins (accessions) associated to an interpro motif (limited to 4 results) gremlin> g.idx('interpro_id_index')[['interpro_id_index':'IPR023306']].inE('PROTEIN_INTERPRO').outV. accession[0..3] ==> E2GK26 ==> G3PMS4 ==> G3Q865 ==> G3PIL8 Check our Bio4j Gremlin cheetsheet www.ohnosequences.com www.bio4j.com
  • 25. REST Server You can also query/navigate through Bio4j with the Neo4j REST API ! The default representation is json, both for responses and or data sent with POST/PUT requests Get protein by its accession number: (Q9UR66) http://server_url:7474/db/data/index/node/protein_accession_index/ protein_accession_index/Q9UR66 Get outgoing relationships for protein Q9UR66 http://server_url:7474/db/data/node/Q9UR66_node_id/relationships/o ut www.ohnosequences.com www.bio4j.com
  • 26. Visualizations (1)  REST Server Data Browser Navigate through Bio4j data in real time ! www.ohnosequences.com www.bio4j.com
  • 27. Visualizations (2)  Bio4j GO Tools www.ohnosequences.com www.bio4j.com
  • 28. Visualizations (3)  Bio4j + Gephi Get really cool graph visualizations using Bio4j and Gephi visualization and exploration platform www.ohnosequences.com www.bio4j.com
  • 29. Bio4j + Cloud We use AWS (Amazon Web Services) everywhere we can around Bio4j, giving us the following benefits: Interoperability and data distribution Releases are available as public EBS Snapshots, giving AWS users the opportunity of creating and attaching to their instances Bio4j DB 100% ready volumes in just a few seconds. CloudFormation templates: - Basic Bio4j DB Instance - Bio4j REST Server Instance Backup and Storage using S3 (Simple Storage Service) We use S3 both for backup (indirectly through the EBS snapshots) and storage (directly storing RefSeq sequences as independent S3 files) www.ohnosequences.com www.bio4j.com
  • 30. Why would I use Bio4j ? Massive access to protein/genome/taxonomy… related information Integration of your own DBs/resources around common information Development of services tailored to your needs built around Bio4j Networks analysis Visualizations Besides many others I cannot think of myself… If you have something in mind for which Bio4j might be useful, please let us know so we can all see how it could help you meet your needs! ;) www.ohnosequences.com www.bio4j.com
  • 31. OK, but why starting all this? Were you so bored…?! It all started somehow around our need for massive access to protein GO (Gene Ontology) annotations. At that point I had to develop my own MySQL DB based on the official GO SQL database, and problems started from the beginning: I got crazy ‘deciphering’ how to extract Uniprot protein annotations from GO official tables schema Uniprot and GO official protein annotations were not always consistent Populating my own DB took really long due to all the joins and subqueries needed in order to get and store the protein annotations. Soon enough we also had the need of having massive access to basic protein information. www.ohnosequences.com www.bio4j.com
  • 32. These processes had to be automated for our (specifically designed for NGS data) bacterial genome annotation system BG7 (PLOS ONE 2012 in Press) Uniprot web services available were too limited: - Slow - Number of queries limitation - Too little information available So I downloaded the whole Uniprot DB in XML format (Swiss-Prot + Trembl) and started to have some fun with it ! www.ohnosequences.com www.bio4j.com
  • 33. We got used to having massive direct access to all this protein related information… So why not adding other resources we needed quite often in most projects and which now were becoming a sort of bottleneck compared to all those already included in Bio4j ? Then we incorporated: - Isoform sequences - Protein interactions and features - Uniref 50, 90, and 100 - RefSeq - NCBI Taxonomy - Enzyme Expasy DB www.ohnosequences.com www.bio4j.com
  • 34. Bio4j + MG7 + 48 Blast XML files (~1GB each) Some numbers: • 157 639 502 nodes • 742 615 705 relationships • 632 832 045 properties • 148 relationship types • 44 node types And it works just fine! www.ohnosequences.com www.bio4j.com
  • 36. What’s MG7? MG7 provides the possibility of choosing different parameters to fix the thresholds for filtering the BLAST hits: i. E-value ii. Identity and query coverage It allows exporting the results of the analysis to different data formats like: • XML • CSV • Gexf (Graph exchange XML format) As well as provides to the user with Heat maps and graph visualizations whilst including an user-friendly interface that allows to access to the alignment responsible for each functional or taxonomical read assignation and that displays the frequencies in the taxonomical tree --> MG7Viewer www.ohnosequences.com www.bio4j.com
  • 38. Finding the lowest common ancestor of a set of NCBI taxonomy nodes with Bio4j www.ohnosequences.com www.bio4j.com
  • 39. Future directions Improvements in modules Integration of even more massive data Application to Cancer genomics Gene flux tool (New tool for bacterial comparative genomics) Pathways tool Data from Metacyc is going to be included in Bio4j. This data would allow to dissect the metabolic pathways in which a genome element, organism or community (metagenomic samples) is involved. . Data visualization, network analysis and much more… www.ohnosequences.com www.bio4j.com
  • 40. Community Bio4j has a fast growing internet presence: - Twitter: check @bio4j for updates - Blog: go to http://blog.bio4j.com - Mail-list: ask any question you may have in our list. - LinkedIn: check the Bio4j group - Github issues: don’t be shy! open a new issue if you think something’s going wrong. www.ohnosequences.com www.bio4j.com
  • 41. That’s it ! Thanks for your time ;) www.ohnosequences.com www.bio4j.com