Graph DB + Bioinformatics describes applications of graph databases in bioinformatics. Bio4j is a graph database that integrates biological data from sources like Uniprot, Gene Ontology, and NCBI Taxonomy. It provides a novel framework for querying and managing protein information that is more scalable and integrates new knowledge more easily than traditional relational databases. Era7 Bioinformatics develops Bio4j and other bioinformatics tools using an open source business model.
Bio4j is a graph database for integrating biological big data. It stores data from sources like Uniprot, Gene Ontology, NCBI Taxonomy, and Enzyme Database in a graph structure, with nodes for entities and edges for relationships. This allows for more flexible querying and analysis of interconnected biological data compared to traditional relational databases. Bio4j uses the Neo4j graph database and is open source.
Bio4j: A pioneer graph based database for the integration of biological Big DataPablo Pareja Tobes
1. Bio4j
2. What’s Bio4j?: Data included
3. What’s Bio4j?: A completely new and powerful framework for protein
4. What’s Bio4j?: Neo4j --> very scalable
5. What's Bio4j?: Everything in Bio4j is open source released under AGPLv3
6. Bioinformatics DBs and Graphs: Highly interconnected overlapping knowledge spread throughout different databases
7. Bioinformatics DBs and Graphs: Data is in most cases modeled in relational databases, (sometimes even just as plain CSV files)
8. Bioinformatics DBs and Graphs: Problems of a relational paradigm
9. Bioinformatics DBs and Graphs: Life + Biology like a graph
10. Bioinformatics DBs and Graphs: NoSQL
11. Bioinformatics DBs and Graphs: NoSQLdata models
12. Bioinformatics DBs and Graphs: The Graph DB model: representation
13. Bioinformatics DBs and Graphs: Neo4j
14. Initial motivation: Why starting all this?
15. Initial motivation: Processes had to be automated for BG7 (http://bg7.ohnosequences.com)
This document discusses Neo4j and its applications in bioinformatics. It describes Bio4j, an open source bioinformatics graph database built using Neo4j that integrates data from sources like Uniprot, NCBI taxonomy, Gene Ontology, and more. Bio4j models biological data as nodes and relationships in a graph structure rather than tables. This allows for more flexible querying and knowledge integration. The document provides examples of how Bio4j can be accessed through its Java API, Cypher query language, Gremlin traversal language, and REST API. It also describes some tools and visualizations for exploring and analyzing Bio4j data.
The document discusses using ontologies and Schema.org properties to connect biomedical data to ontology terms and concepts. Over 200 biomedical ontologies are in active use by life science databases at EMBL-EBI. Schema.org properties like MedicalCode and CreativeWork can be used to mark up ontology terms, data resources, and their relationships. This would allow questions about which ontologies and terms are used in specific data, and enable richer searching and discovery across data and ontologies.
1) The document discusses EBI's efforts to facilitate semantic alignment of its resources through building ontologies and annotating data with ontologies.
2) It describes EBI's work developing ontologies like the Experiment Factor Ontology and using ontologies to enhance search, data visualization, and data integration.
3) The challenges of representing EBI data in RDF are discussed, and future directions are outlined that could make RDF deployment simpler and enable more interesting queries over EBI data.
GraphConnect Europe 2016 - Building a Repository of Biomedical Ontologies wit...Neo4j
This document discusses building a repository of biomedical ontologies using Neo4j. It describes loading over 140 ontologies with over 4.5 million terms and 11 million relations into Neo4j. This allows complex ontology queries and navigation of relationships. The repository is accessed through a REST API and is being used by over 2000 users for tasks like exploring disease relationships and taxonomic classification. Maintaining and improving the system is an ongoing effort.
All together now: piecing together the knowledge graph of lifeChris Mungall
The document summarizes challenges in organizing biological knowledge and progress made through collaborative ontology development. It discusses how early efforts focused on individual ontologies but challenges emerged in maintenance and linking data. New approaches focus on shared principles, standardized mappings between ontologies, and modeling knowledge as graphs. Tools like Boomer and LinkML help reconcile mappings and model data, while community efforts like OBO Foundry and Biolink Model advance integration through open collaboration. Overall progress has been made but more work is needed to operationalize ontologies and build interconnected knowledge graphs.
Bio4j is a graph database for integrating biological big data. It stores data from sources like Uniprot, Gene Ontology, NCBI Taxonomy, and Enzyme Database in a graph structure, with nodes for entities and edges for relationships. This allows for more flexible querying and analysis of interconnected biological data compared to traditional relational databases. Bio4j uses the Neo4j graph database and is open source.
Bio4j: A pioneer graph based database for the integration of biological Big DataPablo Pareja Tobes
1. Bio4j
2. What’s Bio4j?: Data included
3. What’s Bio4j?: A completely new and powerful framework for protein
4. What’s Bio4j?: Neo4j --> very scalable
5. What's Bio4j?: Everything in Bio4j is open source released under AGPLv3
6. Bioinformatics DBs and Graphs: Highly interconnected overlapping knowledge spread throughout different databases
7. Bioinformatics DBs and Graphs: Data is in most cases modeled in relational databases, (sometimes even just as plain CSV files)
8. Bioinformatics DBs and Graphs: Problems of a relational paradigm
9. Bioinformatics DBs and Graphs: Life + Biology like a graph
10. Bioinformatics DBs and Graphs: NoSQL
11. Bioinformatics DBs and Graphs: NoSQLdata models
12. Bioinformatics DBs and Graphs: The Graph DB model: representation
13. Bioinformatics DBs and Graphs: Neo4j
14. Initial motivation: Why starting all this?
15. Initial motivation: Processes had to be automated for BG7 (http://bg7.ohnosequences.com)
This document discusses Neo4j and its applications in bioinformatics. It describes Bio4j, an open source bioinformatics graph database built using Neo4j that integrates data from sources like Uniprot, NCBI taxonomy, Gene Ontology, and more. Bio4j models biological data as nodes and relationships in a graph structure rather than tables. This allows for more flexible querying and knowledge integration. The document provides examples of how Bio4j can be accessed through its Java API, Cypher query language, Gremlin traversal language, and REST API. It also describes some tools and visualizations for exploring and analyzing Bio4j data.
The document discusses using ontologies and Schema.org properties to connect biomedical data to ontology terms and concepts. Over 200 biomedical ontologies are in active use by life science databases at EMBL-EBI. Schema.org properties like MedicalCode and CreativeWork can be used to mark up ontology terms, data resources, and their relationships. This would allow questions about which ontologies and terms are used in specific data, and enable richer searching and discovery across data and ontologies.
1) The document discusses EBI's efforts to facilitate semantic alignment of its resources through building ontologies and annotating data with ontologies.
2) It describes EBI's work developing ontologies like the Experiment Factor Ontology and using ontologies to enhance search, data visualization, and data integration.
3) The challenges of representing EBI data in RDF are discussed, and future directions are outlined that could make RDF deployment simpler and enable more interesting queries over EBI data.
GraphConnect Europe 2016 - Building a Repository of Biomedical Ontologies wit...Neo4j
This document discusses building a repository of biomedical ontologies using Neo4j. It describes loading over 140 ontologies with over 4.5 million terms and 11 million relations into Neo4j. This allows complex ontology queries and navigation of relationships. The repository is accessed through a REST API and is being used by over 2000 users for tasks like exploring disease relationships and taxonomic classification. Maintaining and improving the system is an ongoing effort.
All together now: piecing together the knowledge graph of lifeChris Mungall
The document summarizes challenges in organizing biological knowledge and progress made through collaborative ontology development. It discusses how early efforts focused on individual ontologies but challenges emerged in maintenance and linking data. New approaches focus on shared principles, standardized mappings between ontologies, and modeling knowledge as graphs. Tools like Boomer and LinkML help reconcile mappings and model data, while community efforts like OBO Foundry and Biolink Model advance integration through open collaboration. Overall progress has been made but more work is needed to operationalize ontologies and build interconnected knowledge graphs.
Data integration is intrinsic to how modern research is undertaken in areas such as genomics, drug development and personalised medicine. To better enable this integration a large number of biomedical ontologies have been developed to provide standard semantics for describing metadata. There are now several hundred biomedical ontologies in widespread use that describe concepts such as genes, molecules, drugs and diseases. This amounts to millions of terms that are interconnected via relationships that naturally form a graph of biomedical terminology.
The Ontology Lookup Service (OLS) (http://www.ebi.ac.uk/ols) integrates over 160 ontologies and provide a central point for the biomedical community to query and visualise ontologies. OLS also provide a RESTful API over the ontologies that is used in high-throughput data annotation pipelines. OLS is built on top of a Neo4j database that provides efficient indexes for extracting ontological relationships. We have developed generic tools for loading RDF/OWL ontologies into Neo4j where the indexes are optimised for serving common ontology queries. We are now moving to adopt graph database more widely in applications relating to ontology mapping prediction and recommendation systems for data annotation.
Building SADI Services Tutorial - SIB Workshop, Geneva, December 2015Mark Wilkinson
The primary slide deck for the SADI tutorial. We explain the motivation, simple SADI services, more complex SADI services, and then do a detailed walk-through of building a service, including the Perl service code and examples of service invocation at the command line, and using the SHARE client. You will want to look at the sample data/queries in this slide deck: http://www.slideshare.net/markmoby/sample-data-and-other-ur-ls-55737183 and the example service code in this slide deck: http://www.slideshare.net/markmoby/example-code-for-the-sadi-bmi-calculator-web-service?related=1
FAIR Data Prototype - Interoperability and FAIRness through a novel combinati...Mark Wilkinson
This slide deck accompanies the manuscript "Interoperability and FAIRness through a novel combination of Web technologies", submitted to PeerJ Computer Science: https://doi.org/10.7287/peerj.preprints.2522v1
It describes the output of the "Skunkworks" FAIR implementation group, who were tasked with building a prototype infrastructure that would fulfill the FAIR Principles for scholarly data publishing. We show how a novel combination of the Linked Data Platform, RDF Mapping Language (RML) and Triple Pattern Fragments (TPF) can be combined to create a scholarly publishing infrastructure that is markedly interoperable, at both the metadata and the data level.
This slide deck (or something close) will be presented at the Dutch Techcenter for Life Sciences Partners Workshop, November 4, 2016.
Spanish Ministerio de Economía y Competitividad grant number TIN2014-55993-R
The Royal Society of Chemistry hosts one of the largest online chemistry databases containing almost 30 million unique chemical structures. The database, ChemSpider, provides the underpinning for a series of eScience projects allowing for the integration of chemical compounds with our archive of scientific publications, the delivery of a reaction database containing millions of reactions as well as a chemical validation and standardization platform developed to help improve the quality of structural representations on the internet. The InChI has been a fundamental part of each of our projects and has been pivotal in our support of international projects such as the Open PHACTS semantic web project integrating chemistry and biology data and the PharmaSea project focused on identifying novel chemical components from the ocean with the intention of identifying new antibiotics. This presentation will provide an overview of the importance of InChI in the development of many of our eScience platforms and how we have used it specifically in the ChemSpider project to provide integration across hundreds of websites and chemistry databases across the web. We will discuss how we are now expanding our efforts to develop a Global Chemistry Network encompassing efforts in Open Source Drug Discovery and the support of data management for neglected diseases.
How SADI & SHARE help restore the Scientific Method to in silico scienceMark Wilkinson
This document discusses the transition from BioMoby to SADI as a framework for semantic web services. It provides statistics on BioMoby usage and describes demonstrations of how SADI allows complex queries to be answered by discovering and executing relevant web services without a centralized database. The author's vision is for SADI to support the scientific method by enabling personal ontologies and hypotheses to be explicitly expressed and evaluated dynamically.
This document provides information on how to search free crystallography databases. It discusses several free databases including the Crystallography Open Database (COD), Spectral Database for Organic Compounds (SDBS), American Mineralogist Crystal Structure Database, and Bilbao Crystallographic Server. It also shows how to search the commercial SciFinder-n database through a university library subscription. Examples are provided of searching for ibuprofen data on PubChem, ChemSpider, and COD to find chemical structures, properties and spectral information. Free software like Avogadro can be used to view crystal structure files from these databases.
This document provides an introduction to bio-ontologies and the semantic web. It discusses what ontologies are and how they are used in the bio domain through initiatives like the OBO Foundry. It introduces key semantic web technologies like RDF, URIs, Turtle syntax, and SPARQL query language. It provides examples of ontologies like the Gene Ontology and how ontologies can be represented and queried using these semantic web standards.
What is the fuzz on triple stores? Will triple stores eventually replace relational databases? This talk looks at the big picture, explains the technology and tries to look at the road ahead.
Chado is a relational database schema for evolutionary science. It has several key modules including sequence, cv (controlled vocabularies), phenotype, and phylogeny. The cv module stores ontologies and terminologies. The phenotype module currently uses an EAV model but may need extensions. The phylogeny module allows storing and querying of evolutionary trees, though linking phenotypes to phylogenies remains a challenge. Overall, Chado provides an extensible framework but some parts like phenotype modeling require more development.
The document discusses Chado and OBD, two database schemas for storing biological data and annotations. Chado is a relational database schema developed for model organism databases to store various types of genomics data and track provenance. It uses ontologies and supports modules for different data types. OBD is designed for biomedical annotations using semantic web technologies like RDF triples and SPARQL querying. It aims to index data from various sources and link to external databases. The document compares the two approaches and discusses wrapping existing databases as SPARQL endpoints.
Libraries and Linked Data: Looking to the Future (2)ALATechSource
The document discusses options for new bibliographic frameworks after MARC. It describes three scenarios: 1) a relational/object-oriented RDA database, 2) linked bibliographic and authority records, and 3) flat files without links. It then discusses three approaches to implementing a new framework: 1) going native by using URIs for things, elements and values, 2) extracting data from existing MARC records, and 3) serializing data into key-value pairs, XML, or JSON. Advantages and disadvantages of each approach are outlined.
The document discusses the transition from BioMoby to SADI as a framework for semantic web services. It provides statistics on BioMoby usage and describes demonstrations of complex queries being answered through SADI and SHARE without a centralized database. The demonstrations include finding pathways for a protein and lab results for transplant patients. It advocates for SADI to support the scientific method and personal hypotheses through distributed ontologies rather than centralized ones.
From OBO to OWL and back - building scalable ontologiesdosumis
This document provides an introduction to converting ontologies between the OBO format and the OWL format. It discusses the benefits of using OWL, including taking advantage of reasoning and automated classification. It also introduces Oort, a tool for generating OBO files that do not require reasoning from ontologies that do. The document then provides a tutorial on building ontologies, including maintaining multiple classification schemes, using relationships to specify necessary and sufficient conditions for class membership, and using error messages to identify issues.
The document discusses views and materialized views in data warehousing and decision support systems. It covers three main points:
1) OLAP queries typically involve aggregate queries, so precomputation is essential for fast response times. Materialized views allow precomputing aggregates across multiple dimensions.
2) Warehouses can be thought of as collections of asynchronously replicated tables and periodically maintained views, renewing interest in efficient view maintenance.
3) Materialized views store the results of views in the database for fast access like a cache, but they require maintenance as underlying tables change. Incremental maintenance algorithms are ideal to efficiently update materialized views.
The document discusses using XML and object-oriented APIs to improve interoperability between databases and applications. It proposes a specification called XORT to map XML to the relational model in a generic way. This allows applications to communicate with databases via XML in a language-independent manner. It also describes using PostgreSQL functions and ChadoXML services to encapsulate domain logic in a reusable way across different applications.
Getting the best of Linked Data and Property Graphs: rdf2neo and the KnetMine...Rothamsted Research, UK
Graph-based modelling is becoming more popular, in the sciences and elsewhere, as a flexible and powerful way to exploit data to power world-changing digital applications. Com- pared to the initial vision of the Semantic Web, knowledge graphs and graph databases are be- coming a practical and computationally less formal way to manage graph data. On the other hand, linked data based on Semantic Web standards are a complementary, rather than alternative, ap- proach to deal with these data, since they still provide a common way to represent and exchange information. In this paper we introduce rdf2neo, a tool to populate Neo4j databases starting from RDF data sets, based on a configurable mapping between the two. By employing agrigenomics- related real use cases, we show how such mapping can allow for a hybrid approach to the man- agement of networked knowledge, based on taking advantage of the best of both RDF and prop- erty graphs.
Libraries and Linked Data: Looking to the Future (3)ALATechSource
This document provides an overview of tools for linking data, vocabularies, and application programming. It discusses common types of entities to describe like people, places, concepts and events. It also lists vocabularies and ontologies for identifying these entities as well as tools for developing vocabularies and metadata. Finally, it outlines several programming tools and frameworks for working with semantic data, building applications, and querying datasets, including Apache Jena, Pellet, Snoggle and Virtuoso.
Part of a joint presentation with Midori Harris comparing OWL (Web Ontology Language) and OBO (Open Biomedical Ontologies) as ontology languages, This presentation concentrates on OWL, Midori Harris presented OBO.
We’re all SMILES! Building Chemical Semantic Web Services with SADI, ChEBI, a...Michel Dumontier
This document summarizes a presentation about building chemical semantic web services using SADI, ChEBI, and CHEMINF. It discusses how SADI allows services to be discovered and composed to answer queries by reasoning over chemical ontologies. As an example, it describes how a SPARQL query engine could discover and invoke web services to determine if caffeine is a drug-like molecule according to the Lipinski rule of five.
The power of graphs to analyze biological datadatablend
This document discusses using graphs to analyze biological datasets. It provides examples of using MongoDB to store gene expression data and performing Pearson correlation calculations to create a co-expression graph in Neo4j. It also discusses a prototype called FluxGraph that adds time-awareness to graphs, allowing traversal of graph states through time and comparison of temporal graphs. Potential use cases discussed include longitudinal analysis of patient data over many years.
Data integration is intrinsic to how modern research is undertaken in areas such as genomics, drug development and personalised medicine. To better enable this integration a large number of biomedical ontologies have been developed to provide standard semantics for describing metadata. There are now several hundred biomedical ontologies in widespread use that describe concepts such as genes, molecules, drugs and diseases. This amounts to millions of terms that are interconnected via relationships that naturally form a graph of biomedical terminology.
The Ontology Lookup Service (OLS) (http://www.ebi.ac.uk/ols) integrates over 160 ontologies and provide a central point for the biomedical community to query and visualise ontologies. OLS also provide a RESTful API over the ontologies that is used in high-throughput data annotation pipelines. OLS is built on top of a Neo4j database that provides efficient indexes for extracting ontological relationships. We have developed generic tools for loading RDF/OWL ontologies into Neo4j where the indexes are optimised for serving common ontology queries. We are now moving to adopt graph database more widely in applications relating to ontology mapping prediction and recommendation systems for data annotation.
Building SADI Services Tutorial - SIB Workshop, Geneva, December 2015Mark Wilkinson
The primary slide deck for the SADI tutorial. We explain the motivation, simple SADI services, more complex SADI services, and then do a detailed walk-through of building a service, including the Perl service code and examples of service invocation at the command line, and using the SHARE client. You will want to look at the sample data/queries in this slide deck: http://www.slideshare.net/markmoby/sample-data-and-other-ur-ls-55737183 and the example service code in this slide deck: http://www.slideshare.net/markmoby/example-code-for-the-sadi-bmi-calculator-web-service?related=1
FAIR Data Prototype - Interoperability and FAIRness through a novel combinati...Mark Wilkinson
This slide deck accompanies the manuscript "Interoperability and FAIRness through a novel combination of Web technologies", submitted to PeerJ Computer Science: https://doi.org/10.7287/peerj.preprints.2522v1
It describes the output of the "Skunkworks" FAIR implementation group, who were tasked with building a prototype infrastructure that would fulfill the FAIR Principles for scholarly data publishing. We show how a novel combination of the Linked Data Platform, RDF Mapping Language (RML) and Triple Pattern Fragments (TPF) can be combined to create a scholarly publishing infrastructure that is markedly interoperable, at both the metadata and the data level.
This slide deck (or something close) will be presented at the Dutch Techcenter for Life Sciences Partners Workshop, November 4, 2016.
Spanish Ministerio de Economía y Competitividad grant number TIN2014-55993-R
The Royal Society of Chemistry hosts one of the largest online chemistry databases containing almost 30 million unique chemical structures. The database, ChemSpider, provides the underpinning for a series of eScience projects allowing for the integration of chemical compounds with our archive of scientific publications, the delivery of a reaction database containing millions of reactions as well as a chemical validation and standardization platform developed to help improve the quality of structural representations on the internet. The InChI has been a fundamental part of each of our projects and has been pivotal in our support of international projects such as the Open PHACTS semantic web project integrating chemistry and biology data and the PharmaSea project focused on identifying novel chemical components from the ocean with the intention of identifying new antibiotics. This presentation will provide an overview of the importance of InChI in the development of many of our eScience platforms and how we have used it specifically in the ChemSpider project to provide integration across hundreds of websites and chemistry databases across the web. We will discuss how we are now expanding our efforts to develop a Global Chemistry Network encompassing efforts in Open Source Drug Discovery and the support of data management for neglected diseases.
How SADI & SHARE help restore the Scientific Method to in silico scienceMark Wilkinson
This document discusses the transition from BioMoby to SADI as a framework for semantic web services. It provides statistics on BioMoby usage and describes demonstrations of how SADI allows complex queries to be answered by discovering and executing relevant web services without a centralized database. The author's vision is for SADI to support the scientific method by enabling personal ontologies and hypotheses to be explicitly expressed and evaluated dynamically.
This document provides information on how to search free crystallography databases. It discusses several free databases including the Crystallography Open Database (COD), Spectral Database for Organic Compounds (SDBS), American Mineralogist Crystal Structure Database, and Bilbao Crystallographic Server. It also shows how to search the commercial SciFinder-n database through a university library subscription. Examples are provided of searching for ibuprofen data on PubChem, ChemSpider, and COD to find chemical structures, properties and spectral information. Free software like Avogadro can be used to view crystal structure files from these databases.
This document provides an introduction to bio-ontologies and the semantic web. It discusses what ontologies are and how they are used in the bio domain through initiatives like the OBO Foundry. It introduces key semantic web technologies like RDF, URIs, Turtle syntax, and SPARQL query language. It provides examples of ontologies like the Gene Ontology and how ontologies can be represented and queried using these semantic web standards.
What is the fuzz on triple stores? Will triple stores eventually replace relational databases? This talk looks at the big picture, explains the technology and tries to look at the road ahead.
Chado is a relational database schema for evolutionary science. It has several key modules including sequence, cv (controlled vocabularies), phenotype, and phylogeny. The cv module stores ontologies and terminologies. The phenotype module currently uses an EAV model but may need extensions. The phylogeny module allows storing and querying of evolutionary trees, though linking phenotypes to phylogenies remains a challenge. Overall, Chado provides an extensible framework but some parts like phenotype modeling require more development.
The document discusses Chado and OBD, two database schemas for storing biological data and annotations. Chado is a relational database schema developed for model organism databases to store various types of genomics data and track provenance. It uses ontologies and supports modules for different data types. OBD is designed for biomedical annotations using semantic web technologies like RDF triples and SPARQL querying. It aims to index data from various sources and link to external databases. The document compares the two approaches and discusses wrapping existing databases as SPARQL endpoints.
Libraries and Linked Data: Looking to the Future (2)ALATechSource
The document discusses options for new bibliographic frameworks after MARC. It describes three scenarios: 1) a relational/object-oriented RDA database, 2) linked bibliographic and authority records, and 3) flat files without links. It then discusses three approaches to implementing a new framework: 1) going native by using URIs for things, elements and values, 2) extracting data from existing MARC records, and 3) serializing data into key-value pairs, XML, or JSON. Advantages and disadvantages of each approach are outlined.
The document discusses the transition from BioMoby to SADI as a framework for semantic web services. It provides statistics on BioMoby usage and describes demonstrations of complex queries being answered through SADI and SHARE without a centralized database. The demonstrations include finding pathways for a protein and lab results for transplant patients. It advocates for SADI to support the scientific method and personal hypotheses through distributed ontologies rather than centralized ones.
From OBO to OWL and back - building scalable ontologiesdosumis
This document provides an introduction to converting ontologies between the OBO format and the OWL format. It discusses the benefits of using OWL, including taking advantage of reasoning and automated classification. It also introduces Oort, a tool for generating OBO files that do not require reasoning from ontologies that do. The document then provides a tutorial on building ontologies, including maintaining multiple classification schemes, using relationships to specify necessary and sufficient conditions for class membership, and using error messages to identify issues.
The document discusses views and materialized views in data warehousing and decision support systems. It covers three main points:
1) OLAP queries typically involve aggregate queries, so precomputation is essential for fast response times. Materialized views allow precomputing aggregates across multiple dimensions.
2) Warehouses can be thought of as collections of asynchronously replicated tables and periodically maintained views, renewing interest in efficient view maintenance.
3) Materialized views store the results of views in the database for fast access like a cache, but they require maintenance as underlying tables change. Incremental maintenance algorithms are ideal to efficiently update materialized views.
The document discusses using XML and object-oriented APIs to improve interoperability between databases and applications. It proposes a specification called XORT to map XML to the relational model in a generic way. This allows applications to communicate with databases via XML in a language-independent manner. It also describes using PostgreSQL functions and ChadoXML services to encapsulate domain logic in a reusable way across different applications.
Getting the best of Linked Data and Property Graphs: rdf2neo and the KnetMine...Rothamsted Research, UK
Graph-based modelling is becoming more popular, in the sciences and elsewhere, as a flexible and powerful way to exploit data to power world-changing digital applications. Com- pared to the initial vision of the Semantic Web, knowledge graphs and graph databases are be- coming a practical and computationally less formal way to manage graph data. On the other hand, linked data based on Semantic Web standards are a complementary, rather than alternative, ap- proach to deal with these data, since they still provide a common way to represent and exchange information. In this paper we introduce rdf2neo, a tool to populate Neo4j databases starting from RDF data sets, based on a configurable mapping between the two. By employing agrigenomics- related real use cases, we show how such mapping can allow for a hybrid approach to the man- agement of networked knowledge, based on taking advantage of the best of both RDF and prop- erty graphs.
Libraries and Linked Data: Looking to the Future (3)ALATechSource
This document provides an overview of tools for linking data, vocabularies, and application programming. It discusses common types of entities to describe like people, places, concepts and events. It also lists vocabularies and ontologies for identifying these entities as well as tools for developing vocabularies and metadata. Finally, it outlines several programming tools and frameworks for working with semantic data, building applications, and querying datasets, including Apache Jena, Pellet, Snoggle and Virtuoso.
Part of a joint presentation with Midori Harris comparing OWL (Web Ontology Language) and OBO (Open Biomedical Ontologies) as ontology languages, This presentation concentrates on OWL, Midori Harris presented OBO.
We’re all SMILES! Building Chemical Semantic Web Services with SADI, ChEBI, a...Michel Dumontier
This document summarizes a presentation about building chemical semantic web services using SADI, ChEBI, and CHEMINF. It discusses how SADI allows services to be discovered and composed to answer queries by reasoning over chemical ontologies. As an example, it describes how a SPARQL query engine could discover and invoke web services to determine if caffeine is a drug-like molecule according to the Lipinski rule of five.
The power of graphs to analyze biological datadatablend
This document discusses using graphs to analyze biological datasets. It provides examples of using MongoDB to store gene expression data and performing Pearson correlation calculations to create a co-expression graph in Neo4j. It also discusses a prototype called FluxGraph that adds time-awareness to graphs, allowing traversal of graph states through time and comparison of temporal graphs. Potential use cases discussed include longitudinal analysis of patient data over many years.
This document discusses genome database systems. It begins with an introduction to bioinformatics and genomes. It then discusses the background of genome databases, including some examples. The major characteristics of genome database systems are described as having high complex data, schema changes at a rapid pace, and complex queries. The key areas of data management in genome databases are discussed as non-standard data, complex queries, data interpretation, integration across databases, and uniform management solutions. Major research areas and applications that impact society are also summarized.
FluxGraph: a time-machine for your graphsdatablend
The document describes FluxGraph, a time-aware graph library built on top of Datomic. FluxGraph allows users to travel through different versions of a graph over time, iterate over vertices and edges within a specific time scope, and compare graphs at different points in time. It implements the Blueprints graph interface to provide time-aware graph capabilities while retaining compatibility with existing graph tools and libraries.
WebRTC enables context based, embedded communication in any app or website. Skylink makes using WebRTC as simple as using jQuery for web developers.
Here is the link to the JS Remote Conf talk this presentation was held first: https://www.youtube.com/watch?v=x2IHJBp2TTo
Each month, join us as we highlight and discuss hot topics ranging from the future of higher education to wearable technology, best productivity hacks and secrets to hiring top talent. Upload your SlideShares, and share your expertise with the world!
Not sure what to share on SlideShare?
SlideShares that inform, inspire and educate attract the most views. Beyond that, ideas for what you can upload are limitless. We’ve selected a few popular examples to get your creative juices flowing.
SlideShare is a global platform for sharing presentations, infographics, videos and documents. It has over 18 million pieces of professional content uploaded by experts like Eric Schmidt and Guy Kawasaki. The document provides tips for setting up an account on SlideShare, uploading content, optimizing it for searchability, and sharing it on social media to build an audience and reputation as a subject matter expert.
Bio4j: A pioneer graph based database for the integration of biological Big D...graphdevroom
The document describes Bio4j, an open source graph database for biological data integration. Bio4j stores biological data like Uniprot, Gene Ontology, and taxonomies in a graph structure using Neo4j technology. This allows for flexible querying of semantic relationships between data. Bio4j provides APIs for easy access to integrated data and can be customized with additional datasets. It aims to improve on relational databases for biological data through its scalable graph model.
. Images have an irrefutably central role in scientific discovery and discourse.
However, the issues associated with knowledge management and utility operations
unique to image data are only recently gaining recognition. In our previous
work, we have developed Yale Image finder (YIF), which is a novel Biomedical image
search engine that indexes around two million biomedical image data, along with
associated metadata. While YIF is considered to be a veritable source of easily accessible
biomedical images, there are still a number of usability and interoperability challenges
that have yet to be addressed. To overcome these issues and to accelerate the
adoption of the YIF for next generation biomedical applications, we have developed a
publically accessible semantic API for biomedical images with multiple modalities.
The core API called iCyrus is powered by a dedicated semantic architecture that exposes
the YIF content as linked data, permitting integration with related information
resources and consumption by linked data-aware data services. To facilitate the adhoc
integration of image data with other online data resources, we also built semantic
web services for iCyrus, such that it is compatible with the SADI semantic web service
framework. The utility of the combined infrastructure is illustrated with a number
of compelling use cases and further extended through the incorporation of Domeo, a
well known tool for open annotation. Domeo facilitates enhanced search over the
images using annotations provided through crowdsourcing. The iCyrus triplestore
currently holds more than thirty-five million triples and can be accessed and operated
through syntactic or semantic query interfaces. Core features of the iCyrus API,
namely: data reusability, system interoperability, semantic image search, automatic
update and dedicated semantic infrastructure make iCyrus a state of the art resource
for image data discovery and retrieval
Open interoperability standards, tools and services at EMBL-EBIPistoia Alliance
In this webinar Dr Henriette Harmse from EMBL-EBI presents how they are using their ontology services at EMBL-EBI to scale up the annotation of data and deliver added value through ontologies and semantics to their users.
This document provides an introduction and table of contents to the textbook "An Introduction to Relational Database Theory" by Hugh Darwen. The introduction dedicates the book to researchers at IBM in the 1970s who designed the relational database language ISBL. The table of contents outlines the 8 chapters and 2 appendices that make up the book, providing an overview of the key topics to be covered including relational algebra, constraints, database design, and more.
Life Science Database Cross Search and MetadataMaori Ito
Life science databases are sometimes difficult to understand due to lack of information. I'd like to add metadata into databases and improve search results.
Elsevier is the world's largest publisher of scientific, medical and technical (STM) content. An early adopter of XML as a standard representation for content, Elsevier has used MarkLogic in the development of a range of information access and discovery solutions for its customers. This presentation will cover Elsevier's experience with XML-centric content management systems in general and MarkLogic's technology in specific, describing Elsevier's initial adoption and uptake of the technology, current use within the Elsevier suite of online products and solutions, and opportunities for future use. Design patterns for content repositories within a publishing context that have emerged during our use of the technology will be described, and we will touch on a number of issues that have emerged, including XQuery and its adoption within the developer community, the challenges facing XML from new representations for documents and metadata such as JSON and RDF, and the delivery of search applications based on XML infrastructure.
A report presented in my BNF 216 (Database Design and Modeling for Bioinformatics) class regarding principles and tips to follow in designing biological databases.
The digital universe is booming, especially metadata and user-generated data. This raises strong challenges in order to identify the relevant portions of data which are relevant for a particular problem and to deal with the lifecycle of data. Finer grain problems include data evolution and the potential impact of change in the applications relying on the data, causing decay. The management of scientific data is especially sensitive to this. We present the Research Objects concept as the means to indentify and structure relevant data in scientific domains, addressing data as first-class citizens. We also identify and formally represent the main reasons for decay in this domain and propose methods and tools for their diagnosis and repair, based on provenance information. Finally, we discuss on the application of these concepts to the broader domain of the Web of Data: Data with a Purpose.
As BioPharma adapts to incorporate nimble networks of suppliers, collaborators, and regulators the ability to link data is critical for dynamic interoperability. Adoption of linked data paradigm allows BioPharma to focus on core business: delivering valuable therapeutics in a timely manner.
The document introduces Sean Bechhofer and provides his contact information, including that he is from the University of Manchester, his email address, Twitter handle, and blog. It then lists several publications and projects related to reproducible and open research, including myExperiment and Research Objects, with the goal of facilitating exchange and reuse of digital knowledge. Key challenges discussed are how to move beyond linear paper publications to frameworks that better support reuse of digital assets like workflows and datasets.
This presentation discusses standards for sharing functional genomics data. It summarizes lessons learned from the Minimum Information About a Microarray Experiment (MIAME) standard, including that simply depositing data is not enough - metadata, analysis code, and usable formats are also needed for reproducibility. For high-throughput sequencing data, a Minimum Information about a high-throughput Nucleotide Sequencing Experiment (MINSEQE) standard is proposed with similar requirements as MIAME. The presentation emphasizes keeping standards simple while ensuring machine-readability for reuse.
Rashad Badrawi has training in biological and computer sciences. He has a BS in Biology, MS degrees in Pharmacology and Information Systems. He has worked as a software engineer and bioinformatics specialist at several universities and companies. Some of his projects include building GeneMania's data warehouse, translating the BIND interactions database, and designing interoperability between Virtual Cell and systems biology standards. His strengths include being a self-starter, team player, and mentor who enjoys building products from start to finish while keeping up with advances in biomedical informatics.
Linked Data is an evolving set of techniques for publishing and consuming data on the Web. Learn how Linked Data can turn the Web into a distributed database and how you can participate. In this session, Bernadette Hyland takes the mystery out of Linked Data by summarizing seven steps to prepare your data sets as Linked Data and announce it so others will use it.
Polyglot Persistence with MongoDB and Neo4jCorie Pollock
Learn how to enhance your application by using Neo4j and MongoDB together. Polyglot persistence is the concept of taking advantage of the strengths of different database technologies to improve functionality and enhance your application. In this webinar we will examine some use cases where it makes sense to use a document database (MongoDB) with a graph database (Neo4j) in a single application. Specifically, we will show how MongoDB can be used to provide search and browsing functionality for a product catalog while using Neo4j to provide personalized product recommendations. Finally we will look at the Neo4j Doc Manager project which facilitates syncing data from MongoDB to Neo4j to make polyglot persistence with MongoDB and Neo4j much easier.
2010 CASCON - Towards a integrated network of data and services for the life ...Michel Dumontier
Towards a integrated network of data and services for the life sciences Modern biological knowledge discovery requires access to machine-understandable data that can be searched, retrieved, and subsequently analyzed using a wide array of analytical software and services. The Semantic Automated Discovery and Integration (SADI) framework is a set of conventions to formalize web service inputs and outputs using OWL ontologies that enable the automatic discovery and invocation of Semantic Web services. In this talk, I will walk through a worked example in the design and deployment of chemical semantic web services using the Chemical Development Toolkit, chemical descriptors from the Chemical Information Ontology (CHEMINF), and the Semanticscience Integrated Ontology (SIO) as a unifying, upper level ontology of basic types and relations. I will discuss how one can make use of the SADI-enabled SHARE client to reason about data obtained from Bio2RDF, the largest linked open data project, and automatically invoke chemical semantic web services to determine a chemical's drug-likeness. If you want to see the potential of the Semantic Web being realized, this talk is for you.
Business Intelligence is becoming more important for both Government and Commercial Enterprises as a result of the increased focus on operational efficiency in a less buoyant marketplace.
We have all known for a while that Discoverer was no longer a strategic BI product for Oracle. The question now is where do we go next?
This presentation will consider the advantages and disadvantages of a number of options including Oracle BI Enterprise Edition, Application Express, and other non Oracle offerings such as Jasper Reports. We will also look at various approaches for moving to these alternatives.
This document contains an overview of object-oriented programming (OOP) concepts and common OOP interview questions. It begins with basic questions about OOP terms and features like classes, objects, encapsulation, inheritance and polymorphism. It then covers more advanced topics such as the differences between compile-time and runtime polymorphism, abstract classes, interfaces and access specifiers. The document provides examples in C++ and Java to illustrate various OOP concepts.
The document provides an overview of object-oriented programming (OOP) concepts and common OOP interview questions. It begins with basic questions about OOP terms and features. It then covers more advanced topics like classes, objects, encapsulation, polymorphism, inheritance, and abstraction. The document lists over 40 questions on OOP concepts and includes coding problems. It is intended to help prepare for OOP interviews.
Similar to Graph DB + Bioinformatics: Bio4j, recent applications and future directions (20)
हिंदी वर्णमाला पीपीटी, hindi alphabet PPT presentation, hindi varnamala PPT, Hindi Varnamala pdf, हिंदी स्वर, हिंदी व्यंजन, sikhiye hindi varnmala, dr. mulla adam ali, hindi language and literature, hindi alphabet with drawing, hindi alphabet pdf, hindi varnamala for childrens, hindi language, hindi varnamala practice for kids, https://www.drmullaadamali.com
This slide is special for master students (MIBS & MIFB) in UUM. Also useful for readers who are interested in the topic of contemporary Islamic banking.
How to Add Chatter in the odoo 17 ERP ModuleCeline George
In Odoo, the chatter is like a chat tool that helps you work together on records. You can leave notes and track things, making it easier to talk with your team and partners. Inside chatter, all communication history, activity, and changes will be displayed.
How to Setup Warehouse & Location in Odoo 17 InventoryCeline George
In this slide, we'll explore how to set up warehouses and locations in Odoo 17 Inventory. This will help us manage our stock effectively, track inventory levels, and streamline warehouse operations.
Strategies for Effective Upskilling is a presentation by Chinwendu Peace in a Your Skill Boost Masterclass organisation by the Excellence Foundation for South Sudan on 08th and 09th June 2024 from 1 PM to 3 PM on each day.
How to Make a Field Mandatory in Odoo 17Celine George
In Odoo, making a field required can be done through both Python code and XML views. When you set the required attribute to True in Python code, it makes the field required across all views where it's used. Conversely, when you set the required attribute in XML views, it makes the field required only in the context of that particular view.
A review of the growth of the Israel Genealogy Research Association Database Collection for the last 12 months. Our collection is now passed the 3 million mark and still growing. See which archives have contributed the most. See the different types of records we have, and which years have had records added. You can also see what we have for the future.
Chapter wise All Notes of First year Basic Civil Engineering.pptxDenish Jangid
Chapter wise All Notes of First year Basic Civil Engineering
Syllabus
Chapter-1
Introduction to objective, scope and outcome the subject
Chapter 2
Introduction: Scope and Specialization of Civil Engineering, Role of civil Engineer in Society, Impact of infrastructural development on economy of country.
Chapter 3
Surveying: Object Principles & Types of Surveying; Site Plans, Plans & Maps; Scales & Unit of different Measurements.
Linear Measurements: Instruments used. Linear Measurement by Tape, Ranging out Survey Lines and overcoming Obstructions; Measurements on sloping ground; Tape corrections, conventional symbols. Angular Measurements: Instruments used; Introduction to Compass Surveying, Bearings and Longitude & Latitude of a Line, Introduction to total station.
Levelling: Instrument used Object of levelling, Methods of levelling in brief, and Contour maps.
Chapter 4
Buildings: Selection of site for Buildings, Layout of Building Plan, Types of buildings, Plinth area, carpet area, floor space index, Introduction to building byelaws, concept of sun light & ventilation. Components of Buildings & their functions, Basic concept of R.C.C., Introduction to types of foundation
Chapter 5
Transportation: Introduction to Transportation Engineering; Traffic and Road Safety: Types and Characteristics of Various Modes of Transportation; Various Road Traffic Signs, Causes of Accidents and Road Safety Measures.
Chapter 6
Environmental Engineering: Environmental Pollution, Environmental Acts and Regulations, Functional Concepts of Ecology, Basics of Species, Biodiversity, Ecosystem, Hydrological Cycle; Chemical Cycles: Carbon, Nitrogen & Phosphorus; Energy Flow in Ecosystems.
Water Pollution: Water Quality standards, Introduction to Treatment & Disposal of Waste Water. Reuse and Saving of Water, Rain Water Harvesting. Solid Waste Management: Classification of Solid Waste, Collection, Transportation and Disposal of Solid. Recycling of Solid Waste: Energy Recovery, Sanitary Landfill, On-Site Sanitation. Air & Noise Pollution: Primary and Secondary air pollutants, Harmful effects of Air Pollution, Control of Air Pollution. . Noise Pollution Harmful Effects of noise pollution, control of noise pollution, Global warming & Climate Change, Ozone depletion, Greenhouse effect
Text Books:
1. Palancharmy, Basic Civil Engineering, McGraw Hill publishers.
2. Satheesh Gopi, Basic Civil Engineering, Pearson Publishers.
3. Ketki Rangwala Dalal, Essentials of Civil Engineering, Charotar Publishing House.
4. BCP, Surveying volume 1
Main Java[All of the Base Concepts}.docxadhitya5119
This is part 1 of my Java Learning Journey. This Contains Custom methods, classes, constructors, packages, multithreading , try- catch block, finally block and more.
Graph DB + Bioinformatics: Bio4j, recent applications and future directions
1. Graph DB + Bioinformatics:
Bio4j, recent applications and future
directions
www.ohnosequences.com www.bio4j.com
2. But who‟s this guy talking here?
I am Currently working as a Bioinformatics consultant/developer/researcher at
Oh no sequences! and I have been here at the Ohio State University working as a
Visiting Scholar during these last two months.
Oh no what !?
We are the R&D group at Era7 Bioinformatics.
we like bioinformatics, cloud computing, NGS, category theory, bacterial
genomics…
well, lots of things.
What about Era7 Bioinformatics?
Era7 Bioinformatics is a Bioinformatics company specialized in sequence analysis,
knowledge management and sequencing data interpretation.
Our area of expertise revolves around biological sequence analysis, particularly
Next Generation Sequencing data management and analysis.
www.ohnosequences.com www.bio4j.com
3. We‟re a small but quite peculiar company! (in the good sense of course )
Currently we have offices in:
Madrid (Spain)
Boston MA (USA)
Yeah, I know what you‟re thinking,
they are not precisely ugly cities…
Granada (Spain)
www.ohnosequences.com www.bio4j.com
4. Our team is multidisciplinary: bioinformaticians, mathematicians, lab
researchers, immunologists, biologists specialized in biochemistry and IT
professionals.
A team formed by people with different backgrounds is able to analyze the
same problem from different point of views.
We are based in Research
In a fast changing area, our activity is based in being able to offer
cutting edge solutions. This is only possible maintaining a continuous
research and innovation activity.
In addition, since many of our customers are researchers, being part
of that community allow us to be really customer oriented.
www.ohnosequences.com www.bio4j.com
5. Everything we do is 100% Open source !
Yes, we hate patents.
And no, we‟re not crazy (or maybe just a bit…)
Ok that‟s really nice, but how can that actually work??
• Free marketing and dissemination
• We can use other bioinformatics open source tools/DBs/etc…
• Faster adaptation to a fast changing field (bioinformatics, genomics)
• You may not earn a lot of money but you earn money enough doing many
creative things
www.ohnosequences.com www.bio4j.com
6. Money? Where from ??
• Providing services
• Adapting services to different infrastructures and frameworks…
OK, but you could probably get much more money with
a different business model…
Yeah, but this is our philosophy!
www.ohnosequences.com www.bio4j.com
7. We are also based on Cloud Computing
Cloud Computing has revolutionized the world of computing because in this
paradigm you get the infrastructure as a service (IaaS). We are expert in the
use of the leaders of this world: Amazon Web Services (AWS).
So, what do we get?
a) No investment in infrastructure. Pay per use.
b) Scalability: For example we can launch just one virtual server for two
hours or more than one hundred during ten hours depending on the
amount of data that should be analyzed in different projects.
www.ohnosequences.com www.bio4j.com
8. What‟s Bio4j?
Bio4j is a bioinformatics graph based DB including most data
available in :
Uniprot KB(SwissProt + Trembl)
Gene Ontology (GO)
UniRef (50,90,100)
NCBI Taxonomy
RefSeq
Enzyme DB
www.ohnosequences.com www.bio4j.com
9. What‟s Bio4j?
It provides a completely new and powerful framework
for protein related information querying and
management.
Since it relies on a high-performance graph engine, data
is stored in a way that semantically represents its own
structure
www.ohnosequences.com www.bio4j.com
10. What‟s Bio4j?
Bio4j uses Neo4j technology, a "high-performance graph
engine with all the features of a mature and robust
database".
Thanks to both being based on Neo4j DB and the API
provided, Bio4j is also very scalable, allowing anyone
to easily incorporate his own data making the best
out of it.
www.ohnosequences.com www.bio4j.com
11. What‟s Bio4j?
Everything in Bio4j is open source !
released under AGPLv3
www.ohnosequences.com www.bio4j.com
12. Bioinformatics Highly interconnected overlapping knowledge
DBs and Graphs spread throughout different DBs
Initial motivation
Bio4j structure
Some samples
Why Bio4j?
Bio4j and the
Cloud
www.ohnosequences.com www.bio4j.com
13. Bioinformatics However all this data is in most cases modeled in relational databases.
DBs and Graphs Sometimes even just as plain CSV files
Initial motivation As the amount and diversity of data grows, domain models
become crazily complicated!
Bio4j structure
Some samples
Why Bio4j?
Bio4j and the
Cloud
www.ohnosequences.com www.bio4j.com
14. Bioinformatics With a relational paradigm, the double implication
DBs and Graphs
Entity Table
Initial motivation
does not go both ways.
Bio4j structure
You get „auxiliary‟ tables that have no relationship with the small
piece of reality you are modeling.
Some samples
You need ‘artificial’ IDs only for connecting entities, (and these are mixed
Why Bio4j? with IDs that somehow live in reality)
Bio4j and the Entity-relationship models are cool but in the end you always have to
Cloud deal with ‘raw’ tables plus SQL.
Integrating/incorporating new knowledge into already existing
databases is hard and sometimes even not possible without changing
the domain model
www.ohnosequences.com www.bio4j.com
15. Bioinformatics Life in general and biology in particular are probably not 100% like a graph…
DBs and Graphs
Initial motivation
Bio4j structure
Some samples
Why Bio4j?
Bio4j and the
Cloud
but one thing‟s sure, they are not a set of tables!
www.ohnosequences.com www.bio4j.com
16. Bioinformatics
DBs and Graphs
NoSQL (not only SQL)
Initial motivation
NoSQ… what !??
Bio4j structure
Some samples Let‟s see what Wikipedia says…
Why Bio4j? “NoSQL is a broad class of database management systems
that differ from the classic model of the relational database
Bio4j and the
Cloud management system (RDBMS) in some significant ways.
These data stores may not require fixed table schemas,
usually avoid join operations and typically scale
horizontally.”
www.ohnosequences.com www.bio4j.com
17. Bioinformatics NoSQL data models
DBs and Graphs
Initial motivation
Bio4j structure
Some samples
Why Bio4j?
Bio4j and the
Cloud
www.ohnosequences.com www.bio4j.com
18. Bioinformatics
DBs and Graphs
Initial motivation
Neo4j is a high-performance, NOSQL graph database with all
Bio4j structure
the features of a mature and robust database.
Some samples
The programmer works with an object-oriented, flexible
network structure rather than with strict and static tables
Why Bio4j?
Bio4j and the All the benefits of a fully transactional, enterprise-strength
Cloud database.
For many applications, Neo4j offers performance
improvements on the order of 1000x or more compared to
relational DBs.
www.ohnosequences.com www.bio4j.com
19. Bioinformatics DBs
and Graphs
Ok, but why starting all this?
Were you so bored…?!
Initial
motivation
It all started somehow around our need for massive access to
protein GO (Gene Ontology) annotations.
Bio4j structure
At that point I had to develop my own MySQL DB based on the official
GO SQL database, and problems started from the beginning:
Some samples
I got crazy „deciphering‟ how to extract Uniprot protein annotations
Why Bio4j? from GO official tables schema
Bio4j and the Uniprot and GO official protein annotations were not always consistent
Cloud
Populating my own DB took really long due to all the joins and
subqueries needed in order to get and store the protein annotations.
Soon enough we also had the need of having massive access to basic
protein information.
www.ohnosequences.com www.bio4j.com
20. Bioinformatics DBs
These processes had to be automated for our (specifically
and Graphs
designed for NGS data) bacterial genome annotation system
Initial BG7
motivation
Uniprot web services available were too limited:
Bio4j structure
- Slow
Some samples
- Number of queries limitation
Why Bio4j? - Too little information available
Bio4j and the
Cloud
So I downloaded the whole Uniprot DB in XML format
(Swiss-Prot + Trembl)
and started to have some fun with it !
www.ohnosequences.com www.bio4j.com
21. BG7 algorithm
• Selection of the specific reference protein set
1
• Prediction of possible genes by BLAST similarity
2
• Gene definition: merging compatible similarity regions, detecting start and stop
3
• Solving overlapped predicted genes
4
• RNA prediction by BLAST similarity
5
6
• Final annotation and complete deliverables. Quality control.
www.era7bioinformatics.com
22. Bioinformatics DBs We got used to having massive direct access to all this protein
and Graphs related information…
Initial
motivation So why not adding other resources we needed quite often
in most projects and which now were becoming a sort of
bottleneck compared to all those already included in Bio4j ?
Bio4j structure
Then came:
Some samples
- Isoform sequences
Why Bio4j? - Protein interactions and features
- Uniref 50, 90, and 100
Bio4j and the
Cloud - RefSeq
- NCBI Taxonomy
- Enzyme Expasy DB
www.ohnosequences.com www.bio4j.com
23. Bioinformatics DBs Let‟s dig a bit about Bio4j structure:
and Graphs
Initial motivation Data sources and their relationships:
Bio4j structure
Some samples
Why Bio4j?
Bio4j and the
Cloud
www.ohnosequences.com www.bio4j.com
24. Bioinformatics DBs Bio4j domain model
and Graphs
Initial motivation
Bio4j structure
Some samples
Why Bio4j?
Bio4j and the
Cloud
www.ohnosequences.com www.bio4j.com
25. Bioinformatics DBs
and Graphs The Graph DB model: representation
Initial motivation
Core abstractions:
Bio4j structure Nodes
Relationships between nodes
Some samples
Properties on both
Why Bio4j?
Bio4j and the
Cloud
www.ohnosequences.com www.bio4j.com
26. Bioinformatics DBs Let‟s dig a bit about Bio4j structure:
and Graphs
Initial motivation How are things modeled?
Bio4j structure
Couldn‟t be simpler!
Some samples
Why Bio4j?
Entities Associations / Relationships
Bio4j and the
Cloud
Nodes Edges
www.ohnosequences.com www.bio4j.com
27. Bioinformatics DBs Some examples of nodes would be:
and Graphs
Initial motivation GO term
Protein
Bio4j structure
Genome Element
Some samples
Why Bio4j?
and relationships:
Bio4j and the
Cloud
Protein PROTEIN_GO_ANNOTATION
GO term
www.ohnosequences.com www.bio4j.com
28. Bioinformatics DBs We have developed a tool aimed to be used both as a reference manual and
and Graphs initial contact for Bio4j domain model: Bio4jExplorer
Bio4jExplorer allows you to:
Initial motivation
• Navigate through all nodes and relationships
Bio4j structure
• Access the javadocs of any node or relationship
Some samples
• Graphically explore the neighborhood of a node/relationship
Why Bio4j?
• Look up for the indexes that may serve as an entry point for a node
Bio4j and the
Cloud • Check incoming/outgoing relationships of a specific node
• Check start/end nodes of a specific relationship
www.ohnosequences.com www.bio4j.com
29. Bioinformatics DBs Entry points and indexing
and Graphs
There are two kinds of entry points for the graph:
Initial motivation
Bio4j structure Auxiliary relationships going from the reference node, e.g.
- CELLULAR_COMPONENT: leads to the root of GO cellular component
Some samples sub-ontology
- MAIN_DATASET: leads to both main datasets: Swiss-Prot and Trembl
Why Bio4j?
Node indexing
Bio4j and the
Cloud There are two types of node indexes:
- Exact: Only exact values are considered hits
- Fulltext: Regular expressions can be used
www.ohnosequences.com www.bio4j.com
30. Bioinformatics DBs Querying Bio4j with Cypher
and Graphs
Initial motivation
Getting a keyword by its ID
Bio4j structure START k=node:keyword_id_index(keyword_id_index = "KW-0181")
return k.name, k.id
Some samples
Finding circuits/simple cycles of length 3 where at least one protein is from
Swiss-Prot dataset:
Why Bio4j?
START d=node:dataset_name_index(dataset_name_index = "Swiss-Prot")
MATCH d <-[r:PROTEIN_DATASET]- p,
Bio4j and the
circuit = (p) -[:PROTEIN_PROTEIN_INTERACTION]-> (p2) -
Cloud
[:PROTEIN_PROTEIN_INTERACTION]-> (p3) -
[:PROTEIN_PROTEIN_INTERACTION]-> (p)
return p.accession, p2.accession, p3.accession
Check this blog post for more info and our Bio4j Cypher cheetsheet
www.ohnosequences.com www.bio4j.com
31. Bioinformatics DBs
and Graphs
A graph traversal language
Initial motivation
Get protein by its accession number and return its full name
Bio4j structure
gremlin> g.idx('protein_accession_index')[['protein_accession_index':'P12345']].full_name
Some samples ==> Aspartate aminotransferase, mitochondrial
Get proteins (accessions) associated to an interpro motif (limited to 4 results)
Why Bio4j?
gremlin>
g.idx('interpro_id_index')[['interpro_id_index':'IPR023306']].inE('PROTEIN_INTERPRO').outV.accessio
Bio4j and the n[0..3]
Cloud ==> E2GK26
==> G3PMS4
==> G3Q865
==> G3PIL8
Check our Bio4j Gremlin cheetsheet
www.ohnosequences.com www.bio4j.com
32. Bioinformatics DBs Visualizations (1) REST Server Data Browser
and Graphs
Navigate through Bio4j data in real time !
Initial motivation
Bio4j structure
Some samples
Why Bio4j?
Bio4j and the
Cloud
www.ohnosequences.com www.bio4j.com
33. Bioinformatics DBs Visualizations (2) Bio4j + Gephi
and Graphs
Get really cool graph visualizations using Bio4j and Gephi visualization and
Initial motivation exploration platform
Bio4j structure
Some samples
Why Bio4j?
Bio4j and the
Cloud
www.ohnosequences.com www.bio4j.com
34. Bioinformatics DBs Visualizations (3) Bio4j GO Tools
and Graphs
Initial motivation
Bio4j structure
Some samples
Why Bio4j?
Bio4j and the
Cloud
www.ohnosequences.com www.bio4j.com
35. Bioinformatics DBs Why would I use Bio4j ?
and Graphs
Massive access to protein/genome/taxonomy… related
Initial motivation information
Bio4j structure Integration of your own DBs/resources around common
information
Some samples
Development of services tailored to your needs built around
Why Bio4j?
Bio4j
Bio4j and the
Networks analysis
Cloud
Visualizations
Besides many others I cannot think of myself…
If you have something in mind for which Bio4j might be useful, please let
us know so we can all see how it could help you meet your needs! ;)
www.ohnosequences.com www.bio4j.com
36. Bioinformatics DBs Bio4j + Cloud (1)
and Graphs
We use AWS (Amazon Web Services) everywhere we can around
Initial motivation
Bio4j, giving us the following benefits:
Bio4j structure
Interoperability and data distribution
Some samples Releases are available as public EBS Snapshots, giving AWS users
the opportunity of creating and attaching to their instances Bio4j DB
100% ready volumes in just a few seconds.
Why Bio4j?
Bio4j and the CloudFormation templates:
Cloud
- Basic Bio4j DB Instance
- Bio4j REST Server Instance
www.ohnosequences.com www.bio4j.com
37. Bioinformatics DBs Bio4j + Cloud (2)
and Graphs
Initial motivation Backup and Storage using S3 (Simple Storage Service)
We use S3 both for backup (indirectly through the EBS snapshots) and
Bio4j structure storage (directly storing RefSeq sequences as independent S3 files)
What kind of benefits do we get from this?
Some samples
• Easy to use
Why Bio4j? • Flexible
• Cost-Effective
Bio4j and the
Cloud • Reliable
• Scalable and high-performance
• Secure
www.ohnosequences.com www.bio4j.com
38. Bioinformatics DBs Bio4j + Cloud (3)
and Graphs
Initial motivation Web servers and service providers in the cloud
Deploying your own web server in AWS using Bio4j as back-end is really
Bio4j structure simple.
A good example of this would be Bio4jTestServer, a continuously
Some samples developed server showcasing Web Services based on Bio4j.
Why Bio4j?
Bio4j and the
Cloud
www.ohnosequences.com www.bio4j.com
39. Bioinformatics DBs Community
and Graphs
Bio4j has a fast growing internet presence:
Initial motivation
Bio4j structure - Twitter: check @bio4j for updates
- Blog: go to http://blog.bio4j.com
Some samples
- Mail-list: ask any question you may have in our list.
Why Bio4j?
- LinkedIn: check the Bio4j group
Bio4j and the
Cloud
- Github issues: don‟t be shy! open a new issue if you think
something‟s going wrong.
www.ohnosequences.com www.bio4j.com
40. Bioinformatics DBs
and Graphs
And the best part of all this is:
Initial motivation
Bio4j structure
Some samples
You have the latest version of Bio4j
Why Bio4j? already imported and
fully working in EgStation! ;)
Bio4j and the
Cloud
www.ohnosequences.com www.bio4j.com
41. Bio4j + MG7 for the integration and
analysis of Chip-seq data
www.ohnosequences.com www.bio4j.com
42. Bio4j + MG7 + 24 Chip-Seq samples
Some numbers:
• 157 639 502 nodes
• 742 615 705 relationships
• 632 832 045 properties
• 148 relationship types
• 44 node types
And it works just fine!
www.ohnosequences.com www.bio4j.com
44. What’s MG7?
MG7 is a new system for massive analysis of sequences from
metagenomics samples specially designed for next generation sequencing
technologies.
MG7 uses cloud computing to solve the problem of massive data analysis
providing scalable, real time, on demand computing for metagenomics data
analysis.
MG7 is able to obtain annotation and functional profiles for shot gun genomic
sequences and taxonomic assignation for any type of read.
The inference of function and the assignation of taxonomical origin for each
sequence are based on massive BLAST similarity analysis.
www.ohnosequences.com www.bio4j.com
45. What’s MG7?
MG7 provides the possibility of choosing different parameters to fix the
thresholds for filtering the BLAST hits:
i. E-value
ii. Identity and query coverage
It allows exporting the results of the analysis to different data formats like:
• XML
• CSV
• Gexf (Graph exchange XML format)
As well as provides to the user with Heat maps and graph visualizations whilst
including an user-friendly interface that allows to access to the alignment
responsible for each functional or taxonomical read assignation and that displays
the frequencies in the taxonomical tree --> MG7Viewer
www.ohnosequences.com www.bio4j.com
49. Bio4j + GRG
A completely new approach for
modeling genomic information and
gene regulatory networks
www.ohnosequences.com www.bio4j.com
50. Bio4j + GRG
Integrating genomic information from organisms such as:
• Zea mays subsp. Mays
• Oryza sativa Japonica Group
• Sorghum bicolor
• Brachypodium distachyon
• Arabidopsis thaliana Columbia
• Arabidopsis lyrata lyrata MN47
www.ohnosequences.com www.bio4j.com
51. Bio4j + GRG domain model
www.ohnosequences.com www.bio4j.com
52. Bio4j + GRG
Get all the advantages of Bio4j and Graph DB while modeling genomic data for
grasses, (although it could be also applied to other species/families).
Possibility of integrating data from other projects here at CAPS/EGLab in a
common framework.
Data-mining of data that currently is not accessible or simply is not structured
enough/in a good way to explore it. Both for external genomic data included in
sites like phytozome or coming directly from the experiments/analysis performed
in the lab.
Common framework for accessing all this information together with other
“Universal” resources such as Uniprot, RefSeq or Gene Ontology.
www.ohnosequences.com www.bio4j.com
53. Bio4j + GRG
Chance for the Lab to enter the Cloud and Graph DB world, being pioneer in
providing access to this sort of data to a whole set of possible different users.
Not worrying anymore about possible problems with back-ups, mantaining
infrastructure or things like that…
And what‟s more important:
Scalability Being able to adapt to the specific needs of new projects
as they go along.
www.ohnosequences.com www.bio4j.com
54. And the best part… Acknowledgments!
Bio4j + MG7 + Chip-Seq results
Bio4j + GRG
www.ohnosequences.com www.bio4j.com
55. The other guys from the basement…
(Brett)
(Matias)
(Andrew)
www.ohnosequences.com www.bio4j.com
56. And of course the rest of the Lab !
www.ohnosequences.com www.bio4j.com
57. That’s it !
Thanks for
your time ;)
www.ohnosequences.com www.bio4j.com