This document discusses Neo4j and its applications in bioinformatics. It describes Bio4j, an open source bioinformatics graph database built using Neo4j that integrates data from sources like Uniprot, NCBI taxonomy, Gene Ontology, and more. Bio4j models biological data as nodes and relationships in a graph structure rather than tables. This allows for more flexible querying and knowledge integration. The document provides examples of how Bio4j can be accessed through its Java API, Cypher query language, Gremlin traversal language, and REST API. It also describes some tools and visualizations for exploring and analyzing Bio4j data.
Course: Bioinformatics for Biomedical Research (2014).
Session: 2.3- Introduction to NGS Variant Calling Analysis.
Statistics and Bioinformatisc Unit (UEB) & High Technology Unit (UAT) from Vall d'Hebron Research Institute (www.vhir.org), Barcelona.
Course: Bioinformatics for Biomedical Research (2014).
Session: 2.3- Introduction to NGS Variant Calling Analysis.
Statistics and Bioinformatisc Unit (UEB) & High Technology Unit (UAT) from Vall d'Hebron Research Institute (www.vhir.org), Barcelona.
Introduction to protein bioinformatics, including databases like IntrPro, Uniprot, and PDB.
- An introduction to protein families, domains, sequence features, and protein signatures.
- An introduction to protein structure prediction.
Next-generation sequencing and quality control: An Introduction (2016)Sebastian Schmeier
This lecture is part is an introductory bioinformatics workshop. It gives a background to what sequencing is, what the results of a sequencing experiment are, how to assess the quality of a sequencing run, what error sources exist and how to deal with errors. The accompanying websites are available at http://sschmeier.com/bioinf-workshop/
Oxford Nanopore was founded in Oxford Nanolabs by Dr.Gordon Sanghera, Dr.Spike Willcocks and Professor Hagan Bayley. Nanopore sequencing has been around since the 1990s, when Church et al. and Deamer and Akeson separately proposed that it is possible to sequence DNA using nanopore sensors.
Short tutorials on how to use the web-based tool DAVID - Database for Annotation, Visualization and Integrated Discovery) - http://david.abcc.ncifcrf.gov/
DAVID provides a comprehensive set of functional annotation tools for investigators to understand biological meaning behind large list of genes.
This is the first presentation of the BITS training on 'Comparative genomics'.
It reviews the basic concepts of sequence homology on different levels.
Thanks to Klaas Vandepoele of the PSB department.
Course: Bioinformatics for Biomedical Research (2014).
Session: 4.1- Introduction to RNA-seq and RNA-seq Data Analysis.
Statistics and Bioinformatisc Unit (UEB) & High Technology Unit (UAT) from Vall d'Hebron Research Institute (www.vhir.org), Barcelona.
Introduction to protein bioinformatics, including databases like IntrPro, Uniprot, and PDB.
- An introduction to protein families, domains, sequence features, and protein signatures.
- An introduction to protein structure prediction.
Next-generation sequencing and quality control: An Introduction (2016)Sebastian Schmeier
This lecture is part is an introductory bioinformatics workshop. It gives a background to what sequencing is, what the results of a sequencing experiment are, how to assess the quality of a sequencing run, what error sources exist and how to deal with errors. The accompanying websites are available at http://sschmeier.com/bioinf-workshop/
Oxford Nanopore was founded in Oxford Nanolabs by Dr.Gordon Sanghera, Dr.Spike Willcocks and Professor Hagan Bayley. Nanopore sequencing has been around since the 1990s, when Church et al. and Deamer and Akeson separately proposed that it is possible to sequence DNA using nanopore sensors.
Short tutorials on how to use the web-based tool DAVID - Database for Annotation, Visualization and Integrated Discovery) - http://david.abcc.ncifcrf.gov/
DAVID provides a comprehensive set of functional annotation tools for investigators to understand biological meaning behind large list of genes.
This is the first presentation of the BITS training on 'Comparative genomics'.
It reviews the basic concepts of sequence homology on different levels.
Thanks to Klaas Vandepoele of the PSB department.
Course: Bioinformatics for Biomedical Research (2014).
Session: 4.1- Introduction to RNA-seq and RNA-seq Data Analysis.
Statistics and Bioinformatisc Unit (UEB) & High Technology Unit (UAT) from Vall d'Hebron Research Institute (www.vhir.org), Barcelona.
Managing Genetic Ancestry at Scale with Neo4j and Kafka - StampedeCon 2015StampedeCon
At the StampedeCon 2015 Big Data Conference: The global Monsanto R&D pipeline produces millions of new plant populations every year; each which contributes to a dataset of genetic ancestry spanning several decades. Historically the constraints of modeling and processing this data within an RDBMS has made drawing inferences from this dataset complex and computationally infeasible at large scale. Fortunately, the genetic history of any plant population forms a naturally occurring directed acyclic graph, a property that has allowed us to utilize graph theory to re-imagine how ancestral lineage data is modeled, stored, and queried.
In this talk we present our solutions to these problems, as realized using a graph-based approach within Neo4j. We will discuss our learnings around using Neo4j in a production setting that includes transactional and high-throughput computation, including how we transitioned from recursive JOIN queries to using Cypher and the Neo4j traversal framework to take full advantage of index-free adjacency. Our approach to polyglot persistence will be discussed via our use of a distributed commit log, Apache Kafka, to feed our graph store from sources of live transactional data. Finally, we will touch upon how we are using these technologies to annotate our genetic ancestry dataset with molecular genomics data in order to build an pipeline-scale genotype imputation platform with core algorithms built using Apache Spark.
Bio4j: A pioneer graph based database for the integration of biological Big DataPablo Pareja Tobes
1. Bio4j
2. What’s Bio4j?: Data included
3. What’s Bio4j?: A completely new and powerful framework for protein
4. What’s Bio4j?: Neo4j --> very scalable
5. What's Bio4j?: Everything in Bio4j is open source released under AGPLv3
6. Bioinformatics DBs and Graphs: Highly interconnected overlapping knowledge spread throughout different databases
7. Bioinformatics DBs and Graphs: Data is in most cases modeled in relational databases, (sometimes even just as plain CSV files)
8. Bioinformatics DBs and Graphs: Problems of a relational paradigm
9. Bioinformatics DBs and Graphs: Life + Biology like a graph
10. Bioinformatics DBs and Graphs: NoSQL
11. Bioinformatics DBs and Graphs: NoSQLdata models
12. Bioinformatics DBs and Graphs: The Graph DB model: representation
13. Bioinformatics DBs and Graphs: Neo4j
14. Initial motivation: Why starting all this?
15. Initial motivation: Processes had to be automated for BG7 (http://bg7.ohnosequences.com)
Graph databases in computational bioloby: case of neo4j and TitanDBAndrei KUCHARAVY
Code used for demos is available from: https://github.com/chiffa/neo4jDemo repositry
Code used for IO over the reactome is available from: https://github.com/chiffa/PolyPharma
Jonathan Eisen: Phylogenetic approaches to the analysis of genomes and metage...Jonathan Eisen
Talk by Jonathan Eisen March 7, 2012 at the National Academy of Sciences Institute of Medicine "Forum on Microbial Threats" meeting on the "Social Biology of Microbes"
BITS: Overview of important biological databases beyond sequencesBITS
Module 4 Other relevant biological data sources beyond sequences
Part of training session "Basic Bioinformatics concepts, databases and tools" - http://www.bits.vib.be/training
2010 CASCON - Towards a integrated network of data and services for the life ...Michel Dumontier
Towards a integrated network of data and services for the life sciences Modern biological knowledge discovery requires access to machine-understandable data that can be searched, retrieved, and subsequently analyzed using a wide array of analytical software and services. The Semantic Automated Discovery and Integration (SADI) framework is a set of conventions to formalize web service inputs and outputs using OWL ontologies that enable the automatic discovery and invocation of Semantic Web services. In this talk, I will walk through a worked example in the design and deployment of chemical semantic web services using the Chemical Development Toolkit, chemical descriptors from the Chemical Information Ontology (CHEMINF), and the Semanticscience Integrated Ontology (SIO) as a unifying, upper level ontology of basic types and relations. I will discuss how one can make use of the SADI-enabled SHARE client to reason about data obtained from Bio2RDF, the largest linked open data project, and automatically invoke chemical semantic web services to determine a chemical's drug-likeness. If you want to see the potential of the Semantic Web being realized, this talk is for you.
BioThings SDK: a toolkit for building high-performance data APIs in biologyChunlei Wu
This is from my talk at BOSC 2017.
What’s BioThings?
We use “BioThings” to refer to objects of any biomedical entity-type represented in the biological knowledge space, such as genes, genetic variants, drugs, chemicals, diseases, etc.
BioThings SDK
SDK represents “Software Development Kit”. BioThings SDK provides a Python-based toolkit to build high-performance data APIs (or web services) from a single data source or multiple data sources. It has the particular focus on building data APIs for biomedical-related entities, a.k.a “BioThings”, though it’s not necessarily limited to the biomedical scope. For any given “BioThings” type, BioThings SDK helps developers to aggregate annotations from multiple data sources, and expose them as a clean and high-performance web API.
Demonstration of the applicability of the Linked Data Modeling Language and CHEMROF ( https://chemkg.github.io/chemrof/) for semantic chemical sciences. Presented at MADICES 2022. https://github.com/MADICES/MADICES-2022
Connecting life sciences data at the European Bioinformatics InstituteConnected Data World
Tony Burdett's slides from his talk at Connected Data London. Tony is a Senior Software Engineer at The European Bioinformatics Institute. He presented the complexity of data at the EMBL-EBI and what is their solution to make sense of all this data.
Event: Plant and Animal Genomes conference 2012
Speaker: Sandra Orchard
InterPro is an open-source protein resource used for the automatic annotation of proteins, and is scalable to the analysis of entire new genomes through the use of a downloadable version of InterProScan, which can be incorporated into an existing local pipeline. InterPro integrates protein signatures from 11 major signature databases (CATH-Gene3D, HAMAP, PANTHER, Pfam, PIRSF, PRINTS, ProDom, PROSITE, SMART, SUPERFAMILY, and TIGRFAMs) into a single resource, taking advantage of the different areas of specialization of each to produce a resource that provides protein classification on multiple levels: protein families, structural superfamilies and functionally close subfamilies, as well as functional domains, repeats and important sites. The InterPro website has been improved, following extensive community consultation and a new version of InterProScan promises improved speed, ease of implementation as well as additional functionalities.
Ontologies and Semantic Web technologies play an important role in the life sciences to help make data more interoperable and reusable. There are now many publicly available ontologies that enable biologists to describe everything from gene function through to animal physiology and disease.
Various efforts such as the Open Biomedical Ontologies (OBO) foundry provide central registries for biomedical ontologies and ensure they remain interoperable through a set of common shared development principles.
At EMBL-EBI we contribute to the development of biomedical ontologies and make extensive use of them in the annotation of public datasets. Biological data typically comes with rich and often complex metadata, so the ontologies provide a standard way to capture “what the data is about” and gives us hooks to connect to more data about similar things.
These ontology annotations have been put to good use in a number of large-scale data integration efforts and there’s an increasing recognition of the need for ontologies in making data FAIR (Findable, Accessible, Interoperable and Reusable).
EMBL-EBI build a number of integrative data platforms where ontologies are at the core of our domain models. One example is the Open Targets platform, where data about disease from 18 different databases can be aggregated and grouped based on therapeutic areas in the ontology and used to identify potential drug targets.
The ontologies team at EMBL-EBI provide a suite of services that are aimed at making ontologies more accessible for both humans and machines. We work with scientific data curators and software developers to integrate ontologies and semantics into both the data generation and data presentation workflows. We provide:
– An ontology lookup service (OLS) that provides search and visualisation services to over 200+ ontologies
– Services for automating the annotation of metadata and learning from previous annotations (Zooma)
– An ontology mapping and alignment service (OXO)
– Tools for working with metadata and ontologies in spreadsheets (Webulous)
– Software for enriching documents in search engines to support “semantic” query expansion
I’ll present how we are using these services at EMBL-EBI to scale up the semantic annotation of metadata. I’ll talk about our open source technology stack and describe how we utilise a polyglot persistence approach (graph databases, triples stores, document stores etc) to optimize how we deliver ontologies and semantics to our users.
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
GridMate - End to end testing is a critical piece to ensure quality and avoid...ThomasParaiso2
End to end testing is a critical piece to ensure quality and avoid regressions. In this session, we share our journey building an E2E testing pipeline for GridMate components (LWC and Aura) using Cypress, JSForce, FakerJS…
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
Communications Mining Series - Zero to Hero - Session 1DianaGray10
This session provides introduction to UiPath Communication Mining, importance and platform overview. You will acquire a good understand of the phases in Communication Mining as we go over the platform with you. Topics covered:
• Communication Mining Overview
• Why is it important?
• How can it help today’s business and the benefits
• Phases in Communication Mining
• Demo on Platform overview
• Q/A
How to Get CNIC Information System with Paksim Ga.pptxdanishmna97
Pakdata Cf is a groundbreaking system designed to streamline and facilitate access to CNIC information. This innovative platform leverages advanced technology to provide users with efficient and secure access to their CNIC details.
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024Neo4j
Neha Bajwa, Vice President of Product Marketing, Neo4j
Join us as we explore breakthrough innovations enabled by interconnected data and AI. Discover firsthand how organizations use relationships in data to uncover contextual insights and solve our most pressing challenges – from optimizing supply chains, detecting fraud, and improving customer experiences to accelerating drug discoveries.
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex ProofsAlex Pruden
This paper presents Reef, a system for generating publicly verifiable succinct non-interactive zero-knowledge proofs that a committed document matches or does not match a regular expression. We describe applications such as proving the strength of passwords, the provenance of email despite redactions, the validity of oblivious DNS queries, and the existence of mutations in DNA. Reef supports the Perl Compatible Regular Expression syntax, including wildcards, alternation, ranges, capture groups, Kleene star, negations, and lookarounds. Reef introduces a new type of automata, Skipping Alternating Finite Automata (SAFA), that skips irrelevant parts of a document when producing proofs without undermining soundness, and instantiates SAFA with a lookup argument. Our experimental evaluation confirms that Reef can generate proofs for documents with 32M characters; the proofs are small and cheap to verify (under a second).
Paper: https://eprint.iacr.org/2023/1886
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
Sudheer Mechineni, Head of Application Frameworks, Standard Chartered Bank
Discover how Standard Chartered Bank harnessed the power of Neo4j to transform complex data access challenges into a dynamic, scalable graph database solution. This keynote will cover their journey from initial adoption to deploying a fully automated, enterprise-grade causal cluster, highlighting key strategies for modelling organisational changes and ensuring robust disaster recovery. Learn how these innovations have not only enhanced Standard Chartered Bank’s data infrastructure but also positioned them as pioneers in the banking sector’s adoption of graph technology.
Pushing the limits of ePRTC: 100ns holdover for 100 daysAdtran
At WSTS 2024, Alon Stern explored the topic of parametric holdover and explained how recent research findings can be implemented in real-world PNT networks to achieve 100 nanoseconds of accuracy for up to 100 days.
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AIVladimir Iglovikov, Ph.D.
Presented by Vladimir Iglovikov:
- https://www.linkedin.com/in/iglovikov/
- https://x.com/viglovikov
- https://www.instagram.com/ternaus/
This presentation delves into the journey of Albumentations.ai, a highly successful open-source library for data augmentation.
Created out of a necessity for superior performance in Kaggle competitions, Albumentations has grown to become a widely used tool among data scientists and machine learning practitioners.
This case study covers various aspects, including:
People: The contributors and community that have supported Albumentations.
Metrics: The success indicators such as downloads, daily active users, GitHub stars, and financial contributions.
Challenges: The hurdles in monetizing open-source projects and measuring user engagement.
Development Practices: Best practices for creating, maintaining, and scaling open-source libraries, including code hygiene, CI/CD, and fast iteration.
Community Building: Strategies for making adoption easy, iterating quickly, and fostering a vibrant, engaged community.
Marketing: Both online and offline marketing tactics, focusing on real, impactful interactions and collaborations.
Mental Health: Maintaining balance and not feeling pressured by user demands.
Key insights include the importance of automation, making the adoption process seamless, and leveraging offline interactions for marketing. The presentation also emphasizes the need for continuous small improvements and building a friendly, inclusive community that contributes to the project's growth.
Vladimir Iglovikov brings his extensive experience as a Kaggle Grandmaster, ex-Staff ML Engineer at Lyft, sharing valuable lessons and practical advice for anyone looking to enhance the adoption of their open-source projects.
Explore more about Albumentations and join the community at:
GitHub: https://github.com/albumentations-team/albumentations
Website: https://albumentations.ai/
LinkedIn: https://www.linkedin.com/company/100504475
Twitter: https://x.com/albumentations
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
UiPath Test Automation using UiPath Test Suite series, part 5DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 5. In this session, we will cover CI/CD with devops.
Topics covered:
CI/CD with in UiPath
End-to-end overview of CI/CD pipeline with Azure devops
Speaker:
Lyndsey Byblow, Test Suite Sales Engineer @ UiPath, Inc.
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfPaige Cruz
Monitoring and observability aren’t traditionally found in software curriculums and many of us cobble this knowledge together from whatever vendor or ecosystem we were first introduced to and whatever is a part of your current company’s observability stack.
While the dev and ops silo continues to crumble….many organizations still relegate monitoring & observability as the purview of ops, infra and SRE teams. This is a mistake - achieving a highly observable system requires collaboration up and down the stack.
I, a former op, would like to extend an invitation to all application developers to join the observability party will share these foundational concepts to build on:
2. But who’s this guy talking here?
I am Currently working as a Bioinformatics consultant/developer/researcher at
Oh no sequences!
Oh no what !?
We are the R&D group at Era7 Bioinformatics.
we like bioinformatics, cloud computing, NGS, category theory, bacterial
genomics…
well, lots of things.
What about Era7 Bioinformatics?
Era7 Bioinformatics is a Bioinformatics company specialized in sequence analysis,
knowledge management and sequencing data interpretation.
Our area of expertise revolves around biological sequence analysis, particularly
Next Generation Sequencing data management and analysis.
www.ohnosequences.com www.bio4j.com
3. In Bioinformatics we have highly interconnected overlapping knowledge spread
throughout different DBs
www.ohnosequences.com www.bio4j.com
4. However all this data is in most cases modeled in relational databases.
Sometimes even just as plain CSV files
As the amount and diversity of data grows, domain models
become crazily complicated!
www.ohnosequences.com www.bio4j.com
5. With a relational paradigm, the double implication
Entity Table
does not go both ways.
You get ‘auxiliary’ tables that have no relationship with the small
piece of reality you are modeling.
You need ‘artificial’ IDs only for connecting entities, (and these are mixed
with IDs that somehow live in reality)
Entity-relationship models are cool but in the end you always have to
deal with ‘raw’ tables plus SQL.
Integrating/incorporating new knowledge into already existing
databases is hard and sometimes even not possible without changing
the domain model
www.ohnosequences.com www.bio4j.com
6. Life in general and biology in particular are probably not 100% like a graph…
but one thing’s sure, they are not a set of tables!
www.ohnosequences.com www.bio4j.com
8. Neo4j is a high-performance, NOSQL graph database with all
the features of a mature and robust database.
The programmer works with an object-oriented, flexible
network structure rather than with strict and static tables
All the benefits of a fully transactional, enterprise-strength
database.
For many applications, Neo4j offers performance
improvements on the order of 1000x or more compared to
relational DBs.
www.ohnosequences.com www.bio4j.com
9. What’s Bio4j?
Bio4j is a bioinformatics graph based DB including most data
available in :
Uniprot KB (SwissProt + Trembl) NCBI Taxonomy
Gene Ontology (GO) RefSeq
UniRef (50,90,100) Enzyme DB
www.ohnosequences.com www.bio4j.com
10. What’s Bio4j?
It provides a completely new and powerful framework
for protein related information querying and
management.
Since it relies on a high-performance graph engine, data
is stored in a way that semantically represents its own
structure
www.ohnosequences.com www.bio4j.com
11. What’s Bio4j?
Bio4j uses Neo4j technology, a "high-performance graph
engine with all the features of a mature and robust
database".
Thanks to both being based on Neo4j DB and the API
provided, Bio4j is also very scalable, allowing anyone
to easily incorporate his own data making the best
out of it.
www.ohnosequences.com www.bio4j.com
12. What’s Bio4j?
Everything in Bio4j is open source !
released under AGPLv3
www.ohnosequences.com www.bio4j.com
13. Bio4j in numbers
The current version (0.7) includes:
Relationships: 530.642.683
Nodes: 76.071.411
Relationship types: 139
Node types: 38
www.ohnosequences.com www.bio4j.com
14. Let’s dig a bit about Bio4j structure…
Data sources and their relationships:
www.ohnosequences.com www.bio4j.com
16. The Graph DB model: representation
Core abstractions:
Nodes
Relationships between nodes
Properties on both
www.ohnosequences.com www.bio4j.com
17. How are things modeled?
Couldn’t be simpler!
Entities Associations / Relationships
Nodes Edges
www.ohnosequences.com www.bio4j.com
18. Some examples of nodes would be:
GO term
Protein
Genome Element
and relationships:
Protein PROTEIN_GO_ANNOTATION
GO term
www.ohnosequences.com www.bio4j.com
19. We have developed a tool aimed to be used both as a reference manual and
initial contact for Bio4j domain model: Bio4jExplorer
Bio4jExplorer allows you to:
• Navigate through all nodes and relationships
• Access the javadocs of any node or relationship
• Graphically explore the neighborhood of a node/relationship
• Look up for the indexes that may serve as an entry point for a node
• Check incoming/outgoing relationships of a specific node
• Check start/end nodes of a specific relationship
www.ohnosequences.com www.bio4j.com
20. Entry points and indexing
There are two kinds of entry points for the graph:
Auxiliary relationships going from the reference node, e.g.
- CELLULAR_COMPONENT: leads to the root of GO cellular component
sub-ontology
- MAIN_DATASET: leads to both main datasets: Swiss-Prot and Trembl
Node indexing
There are two types of node indexes:
- Exact: Only exact values are considered hits
- Fulltext: Regular expressions can be used
www.ohnosequences.com www.bio4j.com
21. Retrieving protein info (Bio4jModel Java API)
//--creating manager and node retriever----
Bio4jManager manager = new Bio4jManager(“/mybio4jdb”);
NodeRetriever nR= new NodeRetriever(manager);
ProteinNode protein = nR.getProteinNodeByAccession(“P12345”);
Getting more related info...
List<InterproNode> interpros = protein.getInterpro();
OrganismNode organism = protein.getOrganism();
List<GoTermNode> goAnnotations = protein.getGOAnnotations();
List<ArticleNode> articles = protein.getArticleCitations();
for (ArticleNode article : articles) {
System.out.println(article.getPubmedId());
}
//Don’t forget to close the manager
manager.shutDown();
www.ohnosequences.com www.bio4j.com
22. Querying Bio4j with Cypher
Getting a keyword by its ID
START k=node:keyword_id_index(keyword_id_index = "KW-0181")
return k.name, k.id
Finding circuits/simple cycles of length 3 where at least one protein is from Swiss-Prot
dataset:
START d=node:dataset_name_index(dataset_name_index = "Swiss-Prot")
MATCH d <-[r:PROTEIN_DATASET]- p,
circuit = (p) -[:PROTEIN_PROTEIN_INTERACTION]-> (p2) -
[:PROTEIN_PROTEIN_INTERACTION]-> (p3) -[:PROTEIN_PROTEIN_INTERACTION]->
(p)
return p.accession, p2.accession, p3.accession
Check this blog post for more info and our Bio4j Cypher cheetsheet
www.ohnosequences.com www.bio4j.com
23. A graph traversal language
Get protein by its accession number and return its full name
gremlin> g.idx('protein_accession_index')[['protein_accession_index':'P12345']].full_name
==> Aspartate aminotransferase, mitochondrial
Get proteins (accessions) associated to an interpro motif (limited to 4 results)
gremlin>
g.idx('interpro_id_index')[['interpro_id_index':'IPR023306']].inE('PROTEIN_INTERPRO').outV.
accession[0..3]
==> E2GK26
==> G3PMS4
==> G3Q865
==> G3PIL8
Check our Bio4j Gremlin cheetsheet
www.ohnosequences.com www.bio4j.com
24. REST Server
You can also query/navigate through Bio4j with the Neo4j REST API !
The default representation is json, both for responses and or data sent with
POST/PUT requests
Get protein by its accession number: (Q9UR66)
http://server_url:7474/db/data/index/node/protein_accession_index/
protein_accession_index/Q9UR66
Get outgoing relationships for protein Q9UR66
http://server_url:7474/db/data/node/Q9UR66_node_id/relationships/o
ut
www.ohnosequences.com www.bio4j.com
25. Visualizations (1) REST Server Data Browser
Navigate through Bio4j data in real time !
www.ohnosequences.com www.bio4j.com
27. Visualizations (3) Bio4j + Gephi
Get really cool graph visualizations using Bio4j and Gephi visualization and
exploration platform
www.ohnosequences.com www.bio4j.com
28. Bio4j + Cloud
We use AWS (Amazon Web Services) everywhere we can around Bio4j, giving
us the following benefits:
Interoperability and data distribution
Releases are available as public EBS Snapshots, giving AWS users the
opportunity of creating and attaching to their instances Bio4j DB 100% ready
volumes in just a few seconds.
CloudFormation templates:
- Basic Bio4j DB Instance
- Bio4j REST Server Instance
Backup and Storage using S3 (Simple Storage Service)
We use S3 both for backup (indirectly through the EBS snapshots) and
storage (directly storing RefSeq sequences as independent S3 files)
www.ohnosequences.com www.bio4j.com
29. Why would I use Bio4j ?
Massive access to protein/genome/taxonomy… related information
Integration of your own DBs/resources around common information
Development of services tailored to your needs built around Bio4j
Networks analysis
Visualizations
Besides many others I cannot think of myself…
If you have something in mind for which Bio4j might be useful, please let us know so we
can all see how it could help you meet your needs! ;)
www.ohnosequences.com www.bio4j.com
30. Community
Bio4j has a fast growing internet presence:
- Twitter: check @bio4j for updates
- Blog: go to http://blog.bio4j.com
- Mail-list: ask any question you may have in our list.
- LinkedIn: check the Bio4j group
- Github issues: don’t be shy! open a new issue if you think
something’s going wrong.
www.ohnosequences.com www.bio4j.com
31. OK, but why starting all this?
Were you so bored…?!
It all started somehow around our need for massive access to protein GO
(Gene Ontology) annotations.
At that point I had to develop my own MySQL DB based on the official
GO SQL database, and problems started from the beginning:
I got crazy ‘deciphering’ how to extract Uniprot protein annotations
from GO official tables schema
Uniprot and GO official protein annotations were not always consistent
Populating my own DB took really long due to all the joins and
subqueries needed in order to get and store the protein annotations.
Soon enough we also had the need of having massive access to basic
protein information.
www.ohnosequences.com www.bio4j.com
32. These processes had to be automated for our (specifically designed for NGS data)
bacterial genome annotation system BG7
Uniprot web services available were too limited:
- Slow
- Number of queries limitation
- Too little information available
So I downloaded the whole Uniprot DB in XML format
(Swiss-Prot + Trembl)
and started to have some fun with it !
www.ohnosequences.com www.bio4j.com
33. BG7 algorithm
• Selection of the specific reference protein set
1
• Prediction of possible genes by BLAST similarity
2
• Gene definition: merging compatible similarity regions, detecting start and stop
3
• Solving overlapped predicted genes
4
• RNA prediction by BLAST similarity
5
6 • Final annotation and complete deliverables. Quality control.
www.era7bioinformatics.com
34. We got used to having massive direct access to all this protein related
information…
So why not adding other resources we needed quite often in most
projects and which now were becoming a sort of bottleneck
compared to all those already included in Bio4j ?
Then we incorporated:
- Isoform sequences
- Protein interactions and features
- Uniref 50, 90, and 100
- RefSeq
- NCBI Taxonomy
- Enzyme Expasy DB
www.ohnosequences.com www.bio4j.com
35. Bio4j + MG7 + 48 Blast XML files (~1GB each)
Some numbers:
• 157 639 502 nodes
• 742 615 705 relationships
• 632 832 045 properties
• 148 relationship types
• 44 node types
And it works just fine!
www.ohnosequences.com www.bio4j.com
37. What’s MG7?
MG7 provides the possibility of choosing different parameters to fix the
thresholds for filtering the BLAST hits:
i. E-value
ii. Identity and query coverage
It allows exporting the results of the analysis to different data formats like:
• XML
• CSV
• Gexf (Graph exchange XML format)
As well as provides to the user with Heat maps and graph visualizations whilst
including an user-friendly interface that allows to access to the alignment
responsible for each functional or taxonomical read assignation and that displays
the frequencies in the taxonomical tree --> MG7Viewer
www.ohnosequences.com www.bio4j.com
41. Mining Bio4j data
Finding topological patterns in Protein-Protein
Interaction networks
www.ohnosequences.com www.bio4j.com
42. Finding the lowest common ancestor of a set of NCBI
taxonomy nodes with Bio4j
www.ohnosequences.com www.bio4j.com
43. Future directions (1)
Gene flux tool
New tool for bacterial comparative genomics: massive tracing of vertical and
horizontal gene flux between genome elements based on the analysis of the
similarity between their proteins. It would analyze similarity relationships that could
be fixed to a 90% or 100% similarity threshold.
Pathways tool
Data from Metacyc is going to be included in Bio4j. This data would allow to dissect
the metabolic pathways in which a genome element, organism or community
(metagenomic samples) is involved. Gephi could be used for the representation of
metabolic pathways for each of them.
.
www.ohnosequences.com www.bio4j.com
44. Future directions (2)
Detector of common annotations in gene clusters
Many biological problems are related to the search of common annotations in a set of genes.
Some examples:
- a set of overexpressed genes
- a set of proteins with local structural similarities (WIP)
- a set of genes bearing SNPs in cancer samples
- a set of exclusive genes in a pathogenic bacterial strain
The detection of common annotations can help in the inference of important functional
connections.
www.ohnosequences.com www.bio4j.com
45. That’s it !
Thanks for
your time ;)
www.ohnosequences.com www.bio4j.com