SlideShare a Scribd company logo
Neo4j and Bioinformatics




www.ohnosequences.com                   www.bio4j.com
But who’s this guy talking here?
     I am Currently working as a Bioinformatics consultant/developer/researcher at
     Oh no sequences!


     Oh no what !?
     We are the R&D group at Era7 Bioinformatics.
     we like bioinformatics, cloud computing, NGS, category theory, bacterial
     genomics…
     well, lots of things.


     What about Era7 Bioinformatics?
     Era7 Bioinformatics is a Bioinformatics company specialized in sequence analysis,
     knowledge management and sequencing data interpretation.
     Our area of expertise revolves around biological sequence analysis, particularly
     Next Generation Sequencing data management and analysis.




www.ohnosequences.com                                                      www.bio4j.com
In Bioinformatics we have highly interconnected overlapping knowledge spread
    throughout different DBs




www.ohnosequences.com                                                   www.bio4j.com
However all this data is in most cases modeled in relational databases.
        Sometimes even just as plain CSV files

               As the amount and diversity of data grows, domain models
               become crazily complicated!




www.ohnosequences.com                                                     www.bio4j.com
With a relational paradigm, the double implication

                              Entity  Table

         does not go both ways.


              You get ‘auxiliary’ tables that have no relationship with the small
              piece of reality you are modeling.


              You need ‘artificial’ IDs only for connecting entities, (and these are mixed
              with IDs that somehow live in reality)


              Entity-relationship models are cool but in the end you always have to
              deal with ‘raw’ tables plus SQL.


              Integrating/incorporating new knowledge into already existing
              databases is hard and sometimes even not possible without changing
              the domain model




www.ohnosequences.com                                                               www.bio4j.com
Life in general and biology in particular are probably not 100% like a graph…




                                but one thing’s sure, they are not a set of tables!



www.ohnosequences.com                                                                www.bio4j.com
NoSQL data models




www.ohnosequences.com        www.bio4j.com
Neo4j is a high-performance, NOSQL graph database with all
           the features of a mature and robust database.


           The programmer works with an object-oriented, flexible
           network structure rather than with strict and static tables


           All the benefits of a fully transactional, enterprise-strength
           database.


           For many applications, Neo4j offers performance
           improvements on the order of 1000x or more compared to
           relational DBs.




www.ohnosequences.com                                                    www.bio4j.com
What’s Bio4j?


     Bio4j is a bioinformatics graph based DB including most data
     available in :

        Uniprot KB (SwissProt + Trembl)   NCBI Taxonomy

        Gene Ontology (GO)                RefSeq

        UniRef (50,90,100)                Enzyme DB




www.ohnosequences.com                                      www.bio4j.com
What’s Bio4j?

     It provides a completely new and powerful framework
     for protein related information querying and
     management.


     Since it relies on a high-performance graph engine, data
     is stored in a way that semantically represents its own
     structure




www.ohnosequences.com                                www.bio4j.com
What’s Bio4j?

     Bio4j uses Neo4j technology, a "high-performance graph
     engine with all the features of a mature and robust
     database".

     Thanks to both being based on Neo4j DB and the API
     provided, Bio4j is also very scalable, allowing anyone
     to easily incorporate his own data making the best
     out of it.



www.ohnosequences.com                                 www.bio4j.com
What’s Bio4j?


                        Everything in Bio4j is open source !



       released under AGPLv3




www.ohnosequences.com                              www.bio4j.com
Bio4j in numbers


     The current version (0.7) includes:



             Relationships: 530.642.683

             Nodes: 76.071.411

             Relationship types: 139

             Node types: 38




www.ohnosequences.com                      www.bio4j.com
Let’s dig a bit about Bio4j structure…


               Data sources and their relationships:




www.ohnosequences.com                                  www.bio4j.com
Bio4j domain model




www.ohnosequences.com   www.bio4j.com
The Graph DB model: representation


          Core abstractions:

             Nodes

             Relationships between nodes

             Properties on both




www.ohnosequences.com                      www.bio4j.com
How are things modeled?




                            Couldn’t be simpler!




                 Entities           Associations / Relationships




                  Nodes                        Edges




www.ohnosequences.com                                        www.bio4j.com
Some examples of nodes would be:


                                      GO term
                  Protein
                                                         Genome Element




     and relationships:




                            Protein   PROTEIN_GO_ANNOTATION


                                                      GO term




www.ohnosequences.com                                                www.bio4j.com
We have developed a tool aimed to be used both as a reference manual and
    initial contact for Bio4j domain model: Bio4jExplorer



     Bio4jExplorer allows you to:

     • Navigate through all nodes and relationships


     • Access the javadocs of any node or relationship


     • Graphically explore the neighborhood of a node/relationship


     • Look up for the indexes that may serve as an entry point for a node


     • Check incoming/outgoing relationships of a specific node


     • Check start/end nodes of a specific relationship




www.ohnosequences.com                                                          www.bio4j.com
Entry points and indexing

         There are two kinds of entry points for the graph:



               Auxiliary relationships going from the reference node, e.g.

                 - CELLULAR_COMPONENT: leads to the root of GO cellular component
                 sub-ontology

                 - MAIN_DATASET: leads to both main datasets: Swiss-Prot and Trembl


               Node indexing

               There are two types of node indexes:

                 - Exact: Only exact values are considered hits

                 - Fulltext: Regular expressions can be used




www.ohnosequences.com                                                           www.bio4j.com
Retrieving protein info (Bio4jModel Java API)

     //--creating manager and node retriever----
     Bio4jManager manager = new Bio4jManager(“/mybio4jdb”);
     NodeRetriever nR= new NodeRetriever(manager);

     ProteinNode protein = nR.getProteinNodeByAccession(“P12345”);


     Getting more related info...

     List<InterproNode> interpros = protein.getInterpro();
     OrganismNode organism = protein.getOrganism();
     List<GoTermNode> goAnnotations = protein.getGOAnnotations();

     List<ArticleNode> articles = protein.getArticleCitations();

     for (ArticleNode article : articles) {
         System.out.println(article.getPubmedId());
     }

     //Don’t forget to close the manager
     manager.shutDown();




www.ohnosequences.com                                                www.bio4j.com
Querying Bio4j with Cypher


     Getting a keyword by its ID

     START k=node:keyword_id_index(keyword_id_index = "KW-0181")
     return k.name, k.id


     Finding circuits/simple cycles of length 3 where at least one protein is from Swiss-Prot
     dataset:

     START d=node:dataset_name_index(dataset_name_index = "Swiss-Prot")
     MATCH d <-[r:PROTEIN_DATASET]- p,
     circuit = (p) -[:PROTEIN_PROTEIN_INTERACTION]-> (p2) -
     [:PROTEIN_PROTEIN_INTERACTION]-> (p3) -[:PROTEIN_PROTEIN_INTERACTION]->
     (p)
      return p.accession, p2.accession, p3.accession


              Check this blog post for more info and our Bio4j Cypher cheetsheet




www.ohnosequences.com                                                                   www.bio4j.com
A graph traversal language


     Get protein by its accession number and return its full name

     gremlin> g.idx('protein_accession_index')[['protein_accession_index':'P12345']].full_name
     ==> Aspartate aminotransferase, mitochondrial


     Get proteins (accessions) associated to an interpro motif (limited to 4 results)
     gremlin>
     g.idx('interpro_id_index')[['interpro_id_index':'IPR023306']].inE('PROTEIN_INTERPRO').outV.
     accession[0..3]
     ==> E2GK26
     ==> G3PMS4
     ==> G3Q865
     ==> G3PIL8


            Check our Bio4j Gremlin cheetsheet




www.ohnosequences.com                                                               www.bio4j.com
REST Server


     You can also query/navigate through Bio4j with the Neo4j REST API !

     The default representation is json, both for responses and or data sent with
     POST/PUT requests


     Get protein by its accession number: (Q9UR66)

     http://server_url:7474/db/data/index/node/protein_accession_index/
     protein_accession_index/Q9UR66


     Get outgoing relationships for protein Q9UR66

     http://server_url:7474/db/data/node/Q9UR66_node_id/relationships/o
     ut




www.ohnosequences.com                                                      www.bio4j.com
Visualizations (1)  REST Server Data Browser


      Navigate through Bio4j data in real time !




www.ohnosequences.com                               www.bio4j.com
Visualizations (2)  Bio4j GO Tools




www.ohnosequences.com                    www.bio4j.com
Visualizations (3)  Bio4j + Gephi

      Get really cool graph visualizations using Bio4j and Gephi visualization and
      exploration platform




www.ohnosequences.com                                                                www.bio4j.com
Bio4j + Cloud

     We use AWS (Amazon Web Services) everywhere we can around Bio4j, giving
     us the following benefits:


          Interoperability and data distribution

           Releases are available as public EBS Snapshots, giving AWS users the
           opportunity of creating and attaching to their instances Bio4j DB 100% ready
           volumes in just a few seconds.

           CloudFormation templates:

             - Basic Bio4j DB Instance

             - Bio4j REST Server Instance


           Backup and Storage using S3 (Simple Storage Service)

           We use S3 both for backup (indirectly through the EBS snapshots) and
           storage (directly storing RefSeq sequences as independent S3 files)



www.ohnosequences.com                                                               www.bio4j.com
Why would I use Bio4j ?


    Massive access to protein/genome/taxonomy… related information


    Integration of your own DBs/resources around common information


    Development of services tailored to your needs built around Bio4j


    Networks analysis


    Visualizations


    Besides many others I cannot think of myself…
    If you have something in mind for which Bio4j might be useful, please let us know so we
    can all see how it could help you meet your needs! ;)




www.ohnosequences.com                                                                www.bio4j.com
Community

     Bio4j has a fast growing internet presence:



            - Twitter: check @bio4j for updates

            - Blog: go to http://blog.bio4j.com

            - Mail-list: ask any question you may have in our list.

            - LinkedIn: check the Bio4j group

            - Github issues: don’t be shy! open a new issue if you think
                             something’s going wrong.




www.ohnosequences.com                                                 www.bio4j.com
OK, but why starting all this?
   Were you so bored…?!

    It all started somehow around our need for massive access to protein GO
    (Gene Ontology) annotations.

     At that point I had to develop my own MySQL DB based on the official
     GO SQL database, and problems started from the beginning:


          I got crazy ‘deciphering’ how to extract Uniprot protein annotations
          from GO official tables schema

          Uniprot and GO official protein annotations were not always consistent


          Populating my own DB took really long due to all the joins and
          subqueries needed in order to get and store the protein annotations.

          Soon enough we also had the need of having massive access to basic
          protein information.




www.ohnosequences.com                                                              www.bio4j.com
These processes had to be automated for our (specifically designed for NGS data)
  bacterial genome annotation system BG7



              Uniprot web services available were too limited:

                - Slow

                - Number of queries limitation

                - Too little information available




                  So I downloaded the whole Uniprot DB in XML format
                  (Swiss-Prot + Trembl)

                  and started to have some fun with it !




www.ohnosequences.com                                                  www.bio4j.com
BG7 algorithm


       • Selection of the specific reference protein set
   1

       • Prediction of possible genes by BLAST similarity
   2


       • Gene definition: merging compatible similarity regions, detecting   start and stop
   3


       • Solving overlapped predicted genes
   4

       • RNA prediction by BLAST similarity
   5


   6   • Final annotation and complete deliverables. Quality control.




www.era7bioinformatics.com
We got used to having massive direct access to all this protein related
      information…


           So why not adding other resources we needed quite often in most
           projects and which now were becoming a sort of bottleneck
           compared to all those already included in Bio4j ?

       Then we incorporated:
            -   Isoform sequences

            -   Protein interactions and features

            -   Uniref 50, 90, and 100

            -   RefSeq

            -   NCBI Taxonomy

            -   Enzyme Expasy DB




www.ohnosequences.com                                                 www.bio4j.com
Bio4j + MG7 + 48 Blast XML files (~1GB each)


     Some numbers:

                •   157 639 502 nodes

                •   742 615 705 relationships

                •   632 832 045 properties

                •   148 relationship types

                •   44 node types


             And it works just fine!


www.ohnosequences.com                           www.bio4j.com
MG7 domain model




www.ohnosequences.com   www.bio4j.com
What’s MG7?

     MG7 provides the possibility of choosing different parameters to fix the
     thresholds for filtering the BLAST hits:

     i.    E-value
     ii.   Identity and query coverage


     It allows exporting the results of the analysis to different data formats like:
     • XML
     • CSV
     • Gexf (Graph exchange XML format)

     As well as provides to the user with Heat maps and graph visualizations whilst
     including an user-friendly interface that allows to access to the alignment
     responsible for each functional or taxonomical read assignation and that displays
     the frequencies in the taxonomical tree --> MG7Viewer




www.ohnosequences.com                                                         www.bio4j.com
Heat-map Viz




www.ohnosequences.com   www.bio4j.com
Graph Viz




www.ohnosequences.com   www.bio4j.com
MG7 Viewer




www.ohnosequences.com   www.bio4j.com
Mining Bio4j data

      Finding topological patterns in Protein-Protein
                  Interaction networks




www.ohnosequences.com                            www.bio4j.com
Finding the lowest common ancestor of a set of NCBI
                taxonomy nodes with Bio4j




www.ohnosequences.com                         www.bio4j.com
Future directions (1)


    Gene flux tool

    New tool for bacterial comparative genomics: massive tracing of vertical and
    horizontal gene flux between genome elements based on the analysis of the
    similarity between their proteins. It would analyze similarity relationships that could
    be fixed to a 90% or 100% similarity threshold.



    Pathways tool
    Data from Metacyc is going to be included in Bio4j. This data would allow to dissect
    the metabolic pathways in which a genome element, organism or community
    (metagenomic samples) is involved. Gephi could be used for the representation of
    metabolic pathways for each of them.
    .




www.ohnosequences.com                                                         www.bio4j.com
Future directions (2)


    Detector of common annotations in gene clusters

    Many biological problems are related to the search of common annotations in a set of genes.
    Some examples:

       - a set of overexpressed genes
       - a set of proteins with local structural similarities (WIP)
       - a set of genes bearing SNPs in cancer samples
       - a set of exclusive genes in a pathogenic bacterial strain

    The detection of common annotations can help in the inference of important functional
    connections.




www.ohnosequences.com                                                           www.bio4j.com
That’s it !


                        Thanks for
                        your time ;)




www.ohnosequences.com                  www.bio4j.com

More Related Content

What's hot

Alpha fold 2
Alpha fold 2Alpha fold 2
Alpha fold 2
Vishwas N
 
Neo4j for Discovering Drugs and Biomarkers
Neo4j for Discovering Drugs and BiomarkersNeo4j for Discovering Drugs and Biomarkers
Neo4j for Discovering Drugs and Biomarkers
Neo4j
 
An Overview to Protein bioinformatics
An Overview to Protein bioinformaticsAn Overview to Protein bioinformatics
An Overview to Protein bioinformatics
Joel Ricci-López
 
Gene Ontology Network Enrichment Analysis
Gene Ontology Network Enrichment AnalysisGene Ontology Network Enrichment Analysis
Gene Ontology Network Enrichment Analysis
UC Davis
 
Neo4j Graph Data Science - Webinar
Neo4j Graph Data Science - WebinarNeo4j Graph Data Science - Webinar
Neo4j Graph Data Science - Webinar
Neo4j
 
NGS: Mapping and de novo assembly
NGS: Mapping and de novo assemblyNGS: Mapping and de novo assembly
NGS: Mapping and de novo assembly
Bioinformatics and Computational Biosciences Branch
 
Next-generation sequencing and quality control: An Introduction (2016)
Next-generation sequencing and quality control: An Introduction (2016)Next-generation sequencing and quality control: An Introduction (2016)
Next-generation sequencing and quality control: An Introduction (2016)
Sebastian Schmeier
 
How to write bioinformatics software no one will use
How to write bioinformatics software no one will useHow to write bioinformatics software no one will use
How to write bioinformatics software no one will use
Torsten Seemann
 
Knowledge Graphs & Graph Data Science, More Context, Better Predictions - Neo...
Knowledge Graphs & Graph Data Science, More Context, Better Predictions - Neo...Knowledge Graphs & Graph Data Science, More Context, Better Predictions - Neo...
Knowledge Graphs & Graph Data Science, More Context, Better Predictions - Neo...
Neo4j
 
Scale Your Mission-Critical Applications With Neo4j Fabric and Clustering Arc...
Scale Your Mission-Critical Applications With Neo4j Fabric and Clustering Arc...Scale Your Mission-Critical Applications With Neo4j Fabric and Clustering Arc...
Scale Your Mission-Critical Applications With Neo4j Fabric and Clustering Arc...
Neo4j
 
Neo4j – The Fastest Path to Scalable Real-Time Analytics
Neo4j – The Fastest Path to Scalable Real-Time AnalyticsNeo4j – The Fastest Path to Scalable Real-Time Analytics
Neo4j – The Fastest Path to Scalable Real-Time Analytics
Neo4j
 
Scoutbee - Knowledge graphs at Scoutbee with Neo4j
Scoutbee - Knowledge graphs at Scoutbee with Neo4jScoutbee - Knowledge graphs at Scoutbee with Neo4j
Scoutbee - Knowledge graphs at Scoutbee with Neo4j
Neo4j
 
Oxford nanopore sequencing
Oxford nanopore sequencingOxford nanopore sequencing
Oxford nanopore sequencing
Sangeetha80717
 
David
DavidDavid
Perl programming language
Perl programming languagePerl programming language
Perl programming languageElie Obeid
 
BITS - Introduction to comparative genomics
BITS - Introduction to comparative genomicsBITS - Introduction to comparative genomics
BITS - Introduction to comparative genomics
BITS
 
Deep learning for biomedicine
Deep learning for biomedicineDeep learning for biomedicine
Deep learning for biomedicine
Deakin University
 
Pathways and genomes databases in bioinformatics
Pathways and genomes databases in bioinformaticsPathways and genomes databases in bioinformatics
Pathways and genomes databases in bioinformatics
sarwat bashir
 
BLAST and sequence alignment
BLAST and sequence alignmentBLAST and sequence alignment
Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...
Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...
Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...
VHIR Vall d’Hebron Institut de Recerca
 

What's hot (20)

Alpha fold 2
Alpha fold 2Alpha fold 2
Alpha fold 2
 
Neo4j for Discovering Drugs and Biomarkers
Neo4j for Discovering Drugs and BiomarkersNeo4j for Discovering Drugs and Biomarkers
Neo4j for Discovering Drugs and Biomarkers
 
An Overview to Protein bioinformatics
An Overview to Protein bioinformaticsAn Overview to Protein bioinformatics
An Overview to Protein bioinformatics
 
Gene Ontology Network Enrichment Analysis
Gene Ontology Network Enrichment AnalysisGene Ontology Network Enrichment Analysis
Gene Ontology Network Enrichment Analysis
 
Neo4j Graph Data Science - Webinar
Neo4j Graph Data Science - WebinarNeo4j Graph Data Science - Webinar
Neo4j Graph Data Science - Webinar
 
NGS: Mapping and de novo assembly
NGS: Mapping and de novo assemblyNGS: Mapping and de novo assembly
NGS: Mapping and de novo assembly
 
Next-generation sequencing and quality control: An Introduction (2016)
Next-generation sequencing and quality control: An Introduction (2016)Next-generation sequencing and quality control: An Introduction (2016)
Next-generation sequencing and quality control: An Introduction (2016)
 
How to write bioinformatics software no one will use
How to write bioinformatics software no one will useHow to write bioinformatics software no one will use
How to write bioinformatics software no one will use
 
Knowledge Graphs & Graph Data Science, More Context, Better Predictions - Neo...
Knowledge Graphs & Graph Data Science, More Context, Better Predictions - Neo...Knowledge Graphs & Graph Data Science, More Context, Better Predictions - Neo...
Knowledge Graphs & Graph Data Science, More Context, Better Predictions - Neo...
 
Scale Your Mission-Critical Applications With Neo4j Fabric and Clustering Arc...
Scale Your Mission-Critical Applications With Neo4j Fabric and Clustering Arc...Scale Your Mission-Critical Applications With Neo4j Fabric and Clustering Arc...
Scale Your Mission-Critical Applications With Neo4j Fabric and Clustering Arc...
 
Neo4j – The Fastest Path to Scalable Real-Time Analytics
Neo4j – The Fastest Path to Scalable Real-Time AnalyticsNeo4j – The Fastest Path to Scalable Real-Time Analytics
Neo4j – The Fastest Path to Scalable Real-Time Analytics
 
Scoutbee - Knowledge graphs at Scoutbee with Neo4j
Scoutbee - Knowledge graphs at Scoutbee with Neo4jScoutbee - Knowledge graphs at Scoutbee with Neo4j
Scoutbee - Knowledge graphs at Scoutbee with Neo4j
 
Oxford nanopore sequencing
Oxford nanopore sequencingOxford nanopore sequencing
Oxford nanopore sequencing
 
David
DavidDavid
David
 
Perl programming language
Perl programming languagePerl programming language
Perl programming language
 
BITS - Introduction to comparative genomics
BITS - Introduction to comparative genomicsBITS - Introduction to comparative genomics
BITS - Introduction to comparative genomics
 
Deep learning for biomedicine
Deep learning for biomedicineDeep learning for biomedicine
Deep learning for biomedicine
 
Pathways and genomes databases in bioinformatics
Pathways and genomes databases in bioinformaticsPathways and genomes databases in bioinformatics
Pathways and genomes databases in bioinformatics
 
BLAST and sequence alignment
BLAST and sequence alignmentBLAST and sequence alignment
BLAST and sequence alignment
 
Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...
Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...
Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...
 

Viewers also liked

The power of graphs to analyze biological data
The power of graphs to analyze biological dataThe power of graphs to analyze biological data
The power of graphs to analyze biological data
datablend
 
Graph DB + Bioinformatics: Bio4j, recent applications and future directions
Graph DB + Bioinformatics:  Bio4j, recent applications and future directions Graph DB + Bioinformatics:  Bio4j, recent applications and future directions
Graph DB + Bioinformatics: Bio4j, recent applications and future directions Pablo Pareja Tobes
 
FluxGraph: a time-machine for your graphs
FluxGraph: a time-machine for your graphsFluxGraph: a time-machine for your graphs
FluxGraph: a time-machine for your graphsdatablend
 
Temporal graph
Temporal graphTemporal graph
Temporal graph
Vinay Sarda
 
Managing Genetic Ancestry at Scale with Neo4j and Kafka - StampedeCon 2015
Managing Genetic Ancestry at Scale with Neo4j and Kafka - StampedeCon 2015Managing Genetic Ancestry at Scale with Neo4j and Kafka - StampedeCon 2015
Managing Genetic Ancestry at Scale with Neo4j and Kafka - StampedeCon 2015
StampedeCon
 
Building a repository of biomedical ontologies with Neo4j
Building a repository of biomedical ontologies with Neo4jBuilding a repository of biomedical ontologies with Neo4j
Building a repository of biomedical ontologies with Neo4j
Simon Jupp
 
Arakawa_Glanguage_BOSC2009
Arakawa_Glanguage_BOSC2009Arakawa_Glanguage_BOSC2009
Arakawa_Glanguage_BOSC2009bosc
 
Bio4j: A pioneer graph based database for the integration of biological Big Data
Bio4j: A pioneer graph based database for the integration of biological Big DataBio4j: A pioneer graph based database for the integration of biological Big Data
Bio4j: A pioneer graph based database for the integration of biological Big Data
Pablo Pareja Tobes
 
Graph databases in computational bioloby: case of neo4j and TitanDB
Graph databases in computational bioloby: case of neo4j and TitanDBGraph databases in computational bioloby: case of neo4j and TitanDB
Graph databases in computational bioloby: case of neo4j and TitanDB
Andrei KUCHARAVY
 
GraphTalks - Semantisches Produktdatenmanagement, Dr. Andreas Weber
GraphTalks - Semantisches Produktdatenmanagement, Dr. Andreas WeberGraphTalks - Semantisches Produktdatenmanagement, Dr. Andreas Weber
GraphTalks - Semantisches Produktdatenmanagement, Dr. Andreas Weber
Neo4j
 
Jonathan Eisen: Phylogenetic approaches to the analysis of genomes and metage...
Jonathan Eisen: Phylogenetic approaches to the analysis of genomes and metage...Jonathan Eisen: Phylogenetic approaches to the analysis of genomes and metage...
Jonathan Eisen: Phylogenetic approaches to the analysis of genomes and metage...
Jonathan Eisen
 
Procter Vamsas Bosc2009
Procter Vamsas Bosc2009Procter Vamsas Bosc2009
Procter Vamsas Bosc2009bosc
 
Jonathan Eisen talk for #SCS2012 at #ISMB "Networks in genomics and bioinfor...
Jonathan Eisen talk for #SCS2012 at #ISMB  "Networks in genomics and bioinfor...Jonathan Eisen talk for #SCS2012 at #ISMB  "Networks in genomics and bioinfor...
Jonathan Eisen talk for #SCS2012 at #ISMB "Networks in genomics and bioinfor...
Jonathan Eisen
 
OBF Address at BOSC 2012
OBF Address at BOSC 2012OBF Address at BOSC 2012
OBF Address at BOSC 2012
Hilmar Lapp
 
Chamberlain PhD Thesis
Chamberlain PhD ThesisChamberlain PhD Thesis
Chamberlain PhD Thesis
schamber
 
VIZBI 2014 - Visualizing Genomic Variation
VIZBI 2014 - Visualizing Genomic VariationVIZBI 2014 - Visualizing Genomic Variation
VIZBI 2014 - Visualizing Genomic Variation
Jan Aerts
 
Bio::Phylo - phyloinformatic analysis using perl
Bio::Phylo - phyloinformatic analysis using perlBio::Phylo - phyloinformatic analysis using perl
Bio::Phylo - phyloinformatic analysis using perl
Rutger Vos
 
The role of cost in yeast gene expression
The role of cost in yeast gene expressionThe role of cost in yeast gene expression
The role of cost in yeast gene expressionMichael Barton
 
Tetrahymena genome project 2003 presentation by Jonathan Eisen
Tetrahymena genome project 2003 presentation by Jonathan EisenTetrahymena genome project 2003 presentation by Jonathan Eisen
Tetrahymena genome project 2003 presentation by Jonathan Eisen
Jonathan Eisen
 
J Goecks - The Galaxy Visual Analysis Framework
J Goecks - The Galaxy Visual Analysis FrameworkJ Goecks - The Galaxy Visual Analysis Framework
J Goecks - The Galaxy Visual Analysis Framework
Jan Aerts
 

Viewers also liked (20)

The power of graphs to analyze biological data
The power of graphs to analyze biological dataThe power of graphs to analyze biological data
The power of graphs to analyze biological data
 
Graph DB + Bioinformatics: Bio4j, recent applications and future directions
Graph DB + Bioinformatics:  Bio4j, recent applications and future directions Graph DB + Bioinformatics:  Bio4j, recent applications and future directions
Graph DB + Bioinformatics: Bio4j, recent applications and future directions
 
FluxGraph: a time-machine for your graphs
FluxGraph: a time-machine for your graphsFluxGraph: a time-machine for your graphs
FluxGraph: a time-machine for your graphs
 
Temporal graph
Temporal graphTemporal graph
Temporal graph
 
Managing Genetic Ancestry at Scale with Neo4j and Kafka - StampedeCon 2015
Managing Genetic Ancestry at Scale with Neo4j and Kafka - StampedeCon 2015Managing Genetic Ancestry at Scale with Neo4j and Kafka - StampedeCon 2015
Managing Genetic Ancestry at Scale with Neo4j and Kafka - StampedeCon 2015
 
Building a repository of biomedical ontologies with Neo4j
Building a repository of biomedical ontologies with Neo4jBuilding a repository of biomedical ontologies with Neo4j
Building a repository of biomedical ontologies with Neo4j
 
Arakawa_Glanguage_BOSC2009
Arakawa_Glanguage_BOSC2009Arakawa_Glanguage_BOSC2009
Arakawa_Glanguage_BOSC2009
 
Bio4j: A pioneer graph based database for the integration of biological Big Data
Bio4j: A pioneer graph based database for the integration of biological Big DataBio4j: A pioneer graph based database for the integration of biological Big Data
Bio4j: A pioneer graph based database for the integration of biological Big Data
 
Graph databases in computational bioloby: case of neo4j and TitanDB
Graph databases in computational bioloby: case of neo4j and TitanDBGraph databases in computational bioloby: case of neo4j and TitanDB
Graph databases in computational bioloby: case of neo4j and TitanDB
 
GraphTalks - Semantisches Produktdatenmanagement, Dr. Andreas Weber
GraphTalks - Semantisches Produktdatenmanagement, Dr. Andreas WeberGraphTalks - Semantisches Produktdatenmanagement, Dr. Andreas Weber
GraphTalks - Semantisches Produktdatenmanagement, Dr. Andreas Weber
 
Jonathan Eisen: Phylogenetic approaches to the analysis of genomes and metage...
Jonathan Eisen: Phylogenetic approaches to the analysis of genomes and metage...Jonathan Eisen: Phylogenetic approaches to the analysis of genomes and metage...
Jonathan Eisen: Phylogenetic approaches to the analysis of genomes and metage...
 
Procter Vamsas Bosc2009
Procter Vamsas Bosc2009Procter Vamsas Bosc2009
Procter Vamsas Bosc2009
 
Jonathan Eisen talk for #SCS2012 at #ISMB "Networks in genomics and bioinfor...
Jonathan Eisen talk for #SCS2012 at #ISMB  "Networks in genomics and bioinfor...Jonathan Eisen talk for #SCS2012 at #ISMB  "Networks in genomics and bioinfor...
Jonathan Eisen talk for #SCS2012 at #ISMB "Networks in genomics and bioinfor...
 
OBF Address at BOSC 2012
OBF Address at BOSC 2012OBF Address at BOSC 2012
OBF Address at BOSC 2012
 
Chamberlain PhD Thesis
Chamberlain PhD ThesisChamberlain PhD Thesis
Chamberlain PhD Thesis
 
VIZBI 2014 - Visualizing Genomic Variation
VIZBI 2014 - Visualizing Genomic VariationVIZBI 2014 - Visualizing Genomic Variation
VIZBI 2014 - Visualizing Genomic Variation
 
Bio::Phylo - phyloinformatic analysis using perl
Bio::Phylo - phyloinformatic analysis using perlBio::Phylo - phyloinformatic analysis using perl
Bio::Phylo - phyloinformatic analysis using perl
 
The role of cost in yeast gene expression
The role of cost in yeast gene expressionThe role of cost in yeast gene expression
The role of cost in yeast gene expression
 
Tetrahymena genome project 2003 presentation by Jonathan Eisen
Tetrahymena genome project 2003 presentation by Jonathan EisenTetrahymena genome project 2003 presentation by Jonathan Eisen
Tetrahymena genome project 2003 presentation by Jonathan Eisen
 
J Goecks - The Galaxy Visual Analysis Framework
J Goecks - The Galaxy Visual Analysis FrameworkJ Goecks - The Galaxy Visual Analysis Framework
J Goecks - The Galaxy Visual Analysis Framework
 

Similar to Neo4j and bioinformatics

Bio4j: A pioneer graph based database for the integration of biological Big D...
Bio4j: A pioneer graph based database for the integration of biological Big D...Bio4j: A pioneer graph based database for the integration of biological Big D...
Bio4j: A pioneer graph based database for the integration of biological Big D...
graphdevroom
 
BITS: Overview of important biological databases beyond sequences
BITS: Overview of important biological databases beyond sequencesBITS: Overview of important biological databases beyond sequences
BITS: Overview of important biological databases beyond sequences
BITS
 
2010 CASCON - Towards a integrated network of data and services for the life ...
2010 CASCON - Towards a integrated network of data and services for the life ...2010 CASCON - Towards a integrated network of data and services for the life ...
2010 CASCON - Towards a integrated network of data and services for the life ...
Michel Dumontier
 
Knetminer Backend Training, Nov 2018
Knetminer Backend Training, Nov 2018Knetminer Backend Training, Nov 2018
Knetminer Backend Training, Nov 2018
Rothamsted Research, UK
 
Representing and reasoning with biological knowledge
Representing and reasoning with biological knowledgeRepresenting and reasoning with biological knowledge
Representing and reasoning with biological knowledge
Benjamin Good
 
Cshl minseqe 2013_ouellette
Cshl minseqe 2013_ouelletteCshl minseqe 2013_ouellette
Cshl minseqe 2013_ouellette
Functional Genomics Data Society
 
BioThings SDK: a toolkit for building high-performance data APIs in biology
BioThings SDK: a toolkit for building high-performance data APIs in biologyBioThings SDK: a toolkit for building high-performance data APIs in biology
BioThings SDK: a toolkit for building high-performance data APIs in biology
Chunlei Wu
 
Ontology Web Services for Semantic Applications
Ontology Web Services for Semantic Applications Ontology Web Services for Semantic Applications
Ontology Web Services for Semantic Applications
Trish Whetzel
 
MADICES Mungall 2022.pptx
MADICES Mungall 2022.pptxMADICES Mungall 2022.pptx
MADICES Mungall 2022.pptx
Chris Mungall
 
Introduction to Ontologies for Environmental Biology
Introduction to Ontologies for Environmental BiologyIntroduction to Ontologies for Environmental Biology
Introduction to Ontologies for Environmental Biology
Barry Smith
 
Pham yang embl-ebi
Pham yang embl-ebiPham yang embl-ebi
Pham yang embl-ebi
Nate Wildes
 
Ontologies and semantic web
Ontologies and semantic webOntologies and semantic web
Ontologies and semantic web
Stanley Wang
 
Connecting life sciences data at the European Bioinformatics Institute
Connecting life sciences data at the European Bioinformatics InstituteConnecting life sciences data at the European Bioinformatics Institute
Connecting life sciences data at the European Bioinformatics Institute
Connected Data World
 
InterPro and InterProScan 5.0
InterPro and InterProScan 5.0InterPro and InterProScan 5.0
InterPro and InterProScan 5.0
EBI
 
Ontology Services for the Biomedical Sciences
Ontology Services for the Biomedical SciencesOntology Services for the Biomedical Sciences
Ontology Services for the Biomedical Sciences
Connected Data World
 
Semantic IoT Semantic Inter-Operability Practices - Part 1
Semantic IoT Semantic Inter-Operability Practices - Part 1Semantic IoT Semantic Inter-Operability Practices - Part 1
Semantic IoT Semantic Inter-Operability Practices - Part 1
iotest
 
The Past, Present and Future of Knowledge in Biology
The Past, Present and Future of Knowledge in BiologyThe Past, Present and Future of Knowledge in Biology
The Past, Present and Future of Knowledge in Biology
robertstevens65
 
Biothings presentation
Biothings presentationBiothings presentation
Biothings presentation
Cyrus Afrasiabi
 
BioThings and SmartAPI: building an ecosystem of interoperable biological kno...
BioThings and SmartAPI: building an ecosystem of interoperable biological kno...BioThings and SmartAPI: building an ecosystem of interoperable biological kno...
BioThings and SmartAPI: building an ecosystem of interoperable biological kno...
Chunlei Wu
 

Similar to Neo4j and bioinformatics (20)

Bio4j: A pioneer graph based database for the integration of biological Big D...
Bio4j: A pioneer graph based database for the integration of biological Big D...Bio4j: A pioneer graph based database for the integration of biological Big D...
Bio4j: A pioneer graph based database for the integration of biological Big D...
 
BITS: Overview of important biological databases beyond sequences
BITS: Overview of important biological databases beyond sequencesBITS: Overview of important biological databases beyond sequences
BITS: Overview of important biological databases beyond sequences
 
2010 CASCON - Towards a integrated network of data and services for the life ...
2010 CASCON - Towards a integrated network of data and services for the life ...2010 CASCON - Towards a integrated network of data and services for the life ...
2010 CASCON - Towards a integrated network of data and services for the life ...
 
Knetminer Backend Training, Nov 2018
Knetminer Backend Training, Nov 2018Knetminer Backend Training, Nov 2018
Knetminer Backend Training, Nov 2018
 
Representing and reasoning with biological knowledge
Representing and reasoning with biological knowledgeRepresenting and reasoning with biological knowledge
Representing and reasoning with biological knowledge
 
Cshl minseqe 2013_ouellette
Cshl minseqe 2013_ouelletteCshl minseqe 2013_ouellette
Cshl minseqe 2013_ouellette
 
BioThings SDK: a toolkit for building high-performance data APIs in biology
BioThings SDK: a toolkit for building high-performance data APIs in biologyBioThings SDK: a toolkit for building high-performance data APIs in biology
BioThings SDK: a toolkit for building high-performance data APIs in biology
 
Ontology Web Services for Semantic Applications
Ontology Web Services for Semantic Applications Ontology Web Services for Semantic Applications
Ontology Web Services for Semantic Applications
 
MADICES Mungall 2022.pptx
MADICES Mungall 2022.pptxMADICES Mungall 2022.pptx
MADICES Mungall 2022.pptx
 
Introduction to Ontologies for Environmental Biology
Introduction to Ontologies for Environmental BiologyIntroduction to Ontologies for Environmental Biology
Introduction to Ontologies for Environmental Biology
 
Pham yang embl-ebi
Pham yang embl-ebiPham yang embl-ebi
Pham yang embl-ebi
 
Ontologies and semantic web
Ontologies and semantic webOntologies and semantic web
Ontologies and semantic web
 
Harvester I
Harvester IHarvester I
Harvester I
 
Connecting life sciences data at the European Bioinformatics Institute
Connecting life sciences data at the European Bioinformatics InstituteConnecting life sciences data at the European Bioinformatics Institute
Connecting life sciences data at the European Bioinformatics Institute
 
InterPro and InterProScan 5.0
InterPro and InterProScan 5.0InterPro and InterProScan 5.0
InterPro and InterProScan 5.0
 
Ontology Services for the Biomedical Sciences
Ontology Services for the Biomedical SciencesOntology Services for the Biomedical Sciences
Ontology Services for the Biomedical Sciences
 
Semantic IoT Semantic Inter-Operability Practices - Part 1
Semantic IoT Semantic Inter-Operability Practices - Part 1Semantic IoT Semantic Inter-Operability Practices - Part 1
Semantic IoT Semantic Inter-Operability Practices - Part 1
 
The Past, Present and Future of Knowledge in Biology
The Past, Present and Future of Knowledge in BiologyThe Past, Present and Future of Knowledge in Biology
The Past, Present and Future of Knowledge in Biology
 
Biothings presentation
Biothings presentationBiothings presentation
Biothings presentation
 
BioThings and SmartAPI: building an ecosystem of interoperable biological kno...
BioThings and SmartAPI: building an ecosystem of interoperable biological kno...BioThings and SmartAPI: building an ecosystem of interoperable biological kno...
BioThings and SmartAPI: building an ecosystem of interoperable biological kno...
 

Recently uploaded

Large Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial ApplicationsLarge Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial Applications
Rohit Gautam
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
Matthew Sinclair
 
GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...
ThomasParaiso2
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
KAMESHS29
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
James Anderson
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
DianaGray10
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
Matthew Sinclair
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
danishmna97
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
Neo4j
 
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex ProofszkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
Alex Pruden
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
KatiaHIMEUR1
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
Neo4j
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
Adtran
 
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AIEnchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Vladimir Iglovikov, Ph.D.
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptxSecstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
nkrafacyberclub
 
UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
DianaGray10
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Paige Cruz
 

Recently uploaded (20)

Large Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial ApplicationsLarge Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial Applications
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
 
GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
 
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex ProofszkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
 
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AIEnchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptxSecstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
 
UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
 

Neo4j and bioinformatics

  • 2. But who’s this guy talking here? I am Currently working as a Bioinformatics consultant/developer/researcher at Oh no sequences! Oh no what !? We are the R&D group at Era7 Bioinformatics. we like bioinformatics, cloud computing, NGS, category theory, bacterial genomics… well, lots of things. What about Era7 Bioinformatics? Era7 Bioinformatics is a Bioinformatics company specialized in sequence analysis, knowledge management and sequencing data interpretation. Our area of expertise revolves around biological sequence analysis, particularly Next Generation Sequencing data management and analysis. www.ohnosequences.com www.bio4j.com
  • 3. In Bioinformatics we have highly interconnected overlapping knowledge spread throughout different DBs www.ohnosequences.com www.bio4j.com
  • 4. However all this data is in most cases modeled in relational databases. Sometimes even just as plain CSV files As the amount and diversity of data grows, domain models become crazily complicated! www.ohnosequences.com www.bio4j.com
  • 5. With a relational paradigm, the double implication Entity  Table does not go both ways. You get ‘auxiliary’ tables that have no relationship with the small piece of reality you are modeling. You need ‘artificial’ IDs only for connecting entities, (and these are mixed with IDs that somehow live in reality) Entity-relationship models are cool but in the end you always have to deal with ‘raw’ tables plus SQL. Integrating/incorporating new knowledge into already existing databases is hard and sometimes even not possible without changing the domain model www.ohnosequences.com www.bio4j.com
  • 6. Life in general and biology in particular are probably not 100% like a graph… but one thing’s sure, they are not a set of tables! www.ohnosequences.com www.bio4j.com
  • 8. Neo4j is a high-performance, NOSQL graph database with all the features of a mature and robust database. The programmer works with an object-oriented, flexible network structure rather than with strict and static tables All the benefits of a fully transactional, enterprise-strength database. For many applications, Neo4j offers performance improvements on the order of 1000x or more compared to relational DBs. www.ohnosequences.com www.bio4j.com
  • 9. What’s Bio4j? Bio4j is a bioinformatics graph based DB including most data available in : Uniprot KB (SwissProt + Trembl) NCBI Taxonomy Gene Ontology (GO) RefSeq UniRef (50,90,100) Enzyme DB www.ohnosequences.com www.bio4j.com
  • 10. What’s Bio4j? It provides a completely new and powerful framework for protein related information querying and management. Since it relies on a high-performance graph engine, data is stored in a way that semantically represents its own structure www.ohnosequences.com www.bio4j.com
  • 11. What’s Bio4j? Bio4j uses Neo4j technology, a "high-performance graph engine with all the features of a mature and robust database". Thanks to both being based on Neo4j DB and the API provided, Bio4j is also very scalable, allowing anyone to easily incorporate his own data making the best out of it. www.ohnosequences.com www.bio4j.com
  • 12. What’s Bio4j? Everything in Bio4j is open source ! released under AGPLv3 www.ohnosequences.com www.bio4j.com
  • 13. Bio4j in numbers The current version (0.7) includes: Relationships: 530.642.683 Nodes: 76.071.411 Relationship types: 139 Node types: 38 www.ohnosequences.com www.bio4j.com
  • 14. Let’s dig a bit about Bio4j structure… Data sources and their relationships: www.ohnosequences.com www.bio4j.com
  • 16. The Graph DB model: representation Core abstractions: Nodes Relationships between nodes Properties on both www.ohnosequences.com www.bio4j.com
  • 17. How are things modeled? Couldn’t be simpler! Entities Associations / Relationships Nodes Edges www.ohnosequences.com www.bio4j.com
  • 18. Some examples of nodes would be: GO term Protein Genome Element and relationships: Protein PROTEIN_GO_ANNOTATION GO term www.ohnosequences.com www.bio4j.com
  • 19. We have developed a tool aimed to be used both as a reference manual and initial contact for Bio4j domain model: Bio4jExplorer Bio4jExplorer allows you to: • Navigate through all nodes and relationships • Access the javadocs of any node or relationship • Graphically explore the neighborhood of a node/relationship • Look up for the indexes that may serve as an entry point for a node • Check incoming/outgoing relationships of a specific node • Check start/end nodes of a specific relationship www.ohnosequences.com www.bio4j.com
  • 20. Entry points and indexing There are two kinds of entry points for the graph: Auxiliary relationships going from the reference node, e.g. - CELLULAR_COMPONENT: leads to the root of GO cellular component sub-ontology - MAIN_DATASET: leads to both main datasets: Swiss-Prot and Trembl Node indexing There are two types of node indexes: - Exact: Only exact values are considered hits - Fulltext: Regular expressions can be used www.ohnosequences.com www.bio4j.com
  • 21. Retrieving protein info (Bio4jModel Java API) //--creating manager and node retriever---- Bio4jManager manager = new Bio4jManager(“/mybio4jdb”); NodeRetriever nR= new NodeRetriever(manager); ProteinNode protein = nR.getProteinNodeByAccession(“P12345”); Getting more related info... List<InterproNode> interpros = protein.getInterpro(); OrganismNode organism = protein.getOrganism(); List<GoTermNode> goAnnotations = protein.getGOAnnotations(); List<ArticleNode> articles = protein.getArticleCitations(); for (ArticleNode article : articles) { System.out.println(article.getPubmedId()); } //Don’t forget to close the manager manager.shutDown(); www.ohnosequences.com www.bio4j.com
  • 22. Querying Bio4j with Cypher Getting a keyword by its ID START k=node:keyword_id_index(keyword_id_index = "KW-0181") return k.name, k.id Finding circuits/simple cycles of length 3 where at least one protein is from Swiss-Prot dataset: START d=node:dataset_name_index(dataset_name_index = "Swiss-Prot") MATCH d <-[r:PROTEIN_DATASET]- p, circuit = (p) -[:PROTEIN_PROTEIN_INTERACTION]-> (p2) - [:PROTEIN_PROTEIN_INTERACTION]-> (p3) -[:PROTEIN_PROTEIN_INTERACTION]-> (p) return p.accession, p2.accession, p3.accession Check this blog post for more info and our Bio4j Cypher cheetsheet www.ohnosequences.com www.bio4j.com
  • 23. A graph traversal language Get protein by its accession number and return its full name gremlin> g.idx('protein_accession_index')[['protein_accession_index':'P12345']].full_name ==> Aspartate aminotransferase, mitochondrial Get proteins (accessions) associated to an interpro motif (limited to 4 results) gremlin> g.idx('interpro_id_index')[['interpro_id_index':'IPR023306']].inE('PROTEIN_INTERPRO').outV. accession[0..3] ==> E2GK26 ==> G3PMS4 ==> G3Q865 ==> G3PIL8 Check our Bio4j Gremlin cheetsheet www.ohnosequences.com www.bio4j.com
  • 24. REST Server You can also query/navigate through Bio4j with the Neo4j REST API ! The default representation is json, both for responses and or data sent with POST/PUT requests Get protein by its accession number: (Q9UR66) http://server_url:7474/db/data/index/node/protein_accession_index/ protein_accession_index/Q9UR66 Get outgoing relationships for protein Q9UR66 http://server_url:7474/db/data/node/Q9UR66_node_id/relationships/o ut www.ohnosequences.com www.bio4j.com
  • 25. Visualizations (1)  REST Server Data Browser Navigate through Bio4j data in real time ! www.ohnosequences.com www.bio4j.com
  • 26. Visualizations (2)  Bio4j GO Tools www.ohnosequences.com www.bio4j.com
  • 27. Visualizations (3)  Bio4j + Gephi Get really cool graph visualizations using Bio4j and Gephi visualization and exploration platform www.ohnosequences.com www.bio4j.com
  • 28. Bio4j + Cloud We use AWS (Amazon Web Services) everywhere we can around Bio4j, giving us the following benefits: Interoperability and data distribution Releases are available as public EBS Snapshots, giving AWS users the opportunity of creating and attaching to their instances Bio4j DB 100% ready volumes in just a few seconds. CloudFormation templates: - Basic Bio4j DB Instance - Bio4j REST Server Instance Backup and Storage using S3 (Simple Storage Service) We use S3 both for backup (indirectly through the EBS snapshots) and storage (directly storing RefSeq sequences as independent S3 files) www.ohnosequences.com www.bio4j.com
  • 29. Why would I use Bio4j ? Massive access to protein/genome/taxonomy… related information Integration of your own DBs/resources around common information Development of services tailored to your needs built around Bio4j Networks analysis Visualizations Besides many others I cannot think of myself… If you have something in mind for which Bio4j might be useful, please let us know so we can all see how it could help you meet your needs! ;) www.ohnosequences.com www.bio4j.com
  • 30. Community Bio4j has a fast growing internet presence: - Twitter: check @bio4j for updates - Blog: go to http://blog.bio4j.com - Mail-list: ask any question you may have in our list. - LinkedIn: check the Bio4j group - Github issues: don’t be shy! open a new issue if you think something’s going wrong. www.ohnosequences.com www.bio4j.com
  • 31. OK, but why starting all this? Were you so bored…?! It all started somehow around our need for massive access to protein GO (Gene Ontology) annotations. At that point I had to develop my own MySQL DB based on the official GO SQL database, and problems started from the beginning: I got crazy ‘deciphering’ how to extract Uniprot protein annotations from GO official tables schema Uniprot and GO official protein annotations were not always consistent Populating my own DB took really long due to all the joins and subqueries needed in order to get and store the protein annotations. Soon enough we also had the need of having massive access to basic protein information. www.ohnosequences.com www.bio4j.com
  • 32. These processes had to be automated for our (specifically designed for NGS data) bacterial genome annotation system BG7 Uniprot web services available were too limited: - Slow - Number of queries limitation - Too little information available So I downloaded the whole Uniprot DB in XML format (Swiss-Prot + Trembl) and started to have some fun with it ! www.ohnosequences.com www.bio4j.com
  • 33. BG7 algorithm • Selection of the specific reference protein set 1 • Prediction of possible genes by BLAST similarity 2 • Gene definition: merging compatible similarity regions, detecting start and stop 3 • Solving overlapped predicted genes 4 • RNA prediction by BLAST similarity 5 6 • Final annotation and complete deliverables. Quality control. www.era7bioinformatics.com
  • 34. We got used to having massive direct access to all this protein related information… So why not adding other resources we needed quite often in most projects and which now were becoming a sort of bottleneck compared to all those already included in Bio4j ? Then we incorporated: - Isoform sequences - Protein interactions and features - Uniref 50, 90, and 100 - RefSeq - NCBI Taxonomy - Enzyme Expasy DB www.ohnosequences.com www.bio4j.com
  • 35. Bio4j + MG7 + 48 Blast XML files (~1GB each) Some numbers: • 157 639 502 nodes • 742 615 705 relationships • 632 832 045 properties • 148 relationship types • 44 node types And it works just fine! www.ohnosequences.com www.bio4j.com
  • 37. What’s MG7? MG7 provides the possibility of choosing different parameters to fix the thresholds for filtering the BLAST hits: i. E-value ii. Identity and query coverage It allows exporting the results of the analysis to different data formats like: • XML • CSV • Gexf (Graph exchange XML format) As well as provides to the user with Heat maps and graph visualizations whilst including an user-friendly interface that allows to access to the alignment responsible for each functional or taxonomical read assignation and that displays the frequencies in the taxonomical tree --> MG7Viewer www.ohnosequences.com www.bio4j.com
  • 41. Mining Bio4j data Finding topological patterns in Protein-Protein Interaction networks www.ohnosequences.com www.bio4j.com
  • 42. Finding the lowest common ancestor of a set of NCBI taxonomy nodes with Bio4j www.ohnosequences.com www.bio4j.com
  • 43. Future directions (1) Gene flux tool New tool for bacterial comparative genomics: massive tracing of vertical and horizontal gene flux between genome elements based on the analysis of the similarity between their proteins. It would analyze similarity relationships that could be fixed to a 90% or 100% similarity threshold. Pathways tool Data from Metacyc is going to be included in Bio4j. This data would allow to dissect the metabolic pathways in which a genome element, organism or community (metagenomic samples) is involved. Gephi could be used for the representation of metabolic pathways for each of them. . www.ohnosequences.com www.bio4j.com
  • 44. Future directions (2) Detector of common annotations in gene clusters Many biological problems are related to the search of common annotations in a set of genes. Some examples: - a set of overexpressed genes - a set of proteins with local structural similarities (WIP) - a set of genes bearing SNPs in cancer samples - a set of exclusive genes in a pathogenic bacterial strain The detection of common annotations can help in the inference of important functional connections. www.ohnosequences.com www.bio4j.com
  • 45. That’s it ! Thanks for your time ;) www.ohnosequences.com www.bio4j.com