Graph DB + Bioinformatics: Bio4j, recent applications and future directions


Published on

Published in: Education, Technology
1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Graph DB + Bioinformatics: Bio4j, recent applications and future directions

  1. 1. Graph DB + Bioinformatics: Bio4j, recent applications and future
  2. 2. But who‟s this guy talking here? I am Currently working as a Bioinformatics consultant/developer/researcher at Oh no sequences! and I have been here at the Ohio State University working as a Visiting Scholar during these last two months. Oh no what !? We are the R&D group at Era7 Bioinformatics. we like bioinformatics, cloud computing, NGS, category theory, bacterial genomics… well, lots of things. What about Era7 Bioinformatics? Era7 Bioinformatics is a Bioinformatics company specialized in sequence analysis, knowledge management and sequencing data interpretation. Our area of expertise revolves around biological sequence analysis, particularly Next Generation Sequencing data management and
  3. 3. We‟re a small but quite peculiar company! (in the good sense of course  ) Currently we have offices in: Madrid (Spain) Boston MA (USA) Yeah, I know what you‟re thinking, they are not precisely ugly cities… Granada (Spain)
  4. 4. Our team is multidisciplinary: bioinformaticians, mathematicians, lab researchers, immunologists, biologists specialized in biochemistry and IT professionals. A team formed by people with different backgrounds is able to analyze the same problem from different point of views. We are based in Research In a fast changing area, our activity is based in being able to offer cutting edge solutions. This is only possible maintaining a continuous research and innovation activity. In addition, since many of our customers are researchers, being part of that community allow us to be really customer
  5. 5. Everything we do is 100% Open source ! Yes, we hate patents. And no, we‟re not crazy (or maybe just a bit…) Ok that‟s really nice, but how can that actually work?? • Free marketing and dissemination • We can use other bioinformatics open source tools/DBs/etc… • Faster adaptation to a fast changing field (bioinformatics, genomics) • You may not earn a lot of money but you earn money enough doing many creative
  6. 6. Money? Where from ?? • Providing services • Adapting services to different infrastructures and frameworks… OK, but you could probably get much more money with a different business model… Yeah, but this is our philosophy!
  7. 7. We are also based on Cloud Computing Cloud Computing has revolutionized the world of computing because in this paradigm you get the infrastructure as a service (IaaS). We are expert in the use of the leaders of this world: Amazon Web Services (AWS). So, what do we get? a) No investment in infrastructure. Pay per use. b) Scalability: For example we can launch just one virtual server for two hours or more than one hundred during ten hours depending on the amount of data that should be analyzed in different
  8. 8. What‟s Bio4j? Bio4j is a bioinformatics graph based DB including most data available in : Uniprot KB(SwissProt + Trembl) Gene Ontology (GO) UniRef (50,90,100) NCBI Taxonomy RefSeq Enzyme
  9. 9. What‟s Bio4j? It provides a completely new and powerful framework for protein related information querying and management. Since it relies on a high-performance graph engine, data is stored in a way that semantically represents its own
  10. 10. What‟s Bio4j? Bio4j uses Neo4j technology, a "high-performance graph engine with all the features of a mature and robust database". Thanks to both being based on Neo4j DB and the API provided, Bio4j is also very scalable, allowing anyone to easily incorporate his own data making the best out of
  11. 11. What‟s Bio4j? Everything in Bio4j is open source ! released under
  12. 12. Bioinformatics Highly interconnected overlapping knowledgeDBs and Graphs spread throughout different DBsInitial motivationBio4j structureSome samplesWhy Bio4j?Bio4j and
  13. 13. Bioinformatics However all this data is in most cases modeled in relational databases.DBs and Graphs Sometimes even just as plain CSV filesInitial motivation As the amount and diversity of data grows, domain models become crazily complicated!Bio4j structureSome samplesWhy Bio4j?Bio4j and
  14. 14. Bioinformatics With a relational paradigm, the double implicationDBs and Graphs Entity  TableInitial motivation does not go both ways.Bio4j structure You get „auxiliary‟ tables that have no relationship with the small piece of reality you are modeling.Some samples You need ‘artificial’ IDs only for connecting entities, (and these are mixedWhy Bio4j? with IDs that somehow live in reality)Bio4j and the Entity-relationship models are cool but in the end you always have toCloud deal with ‘raw’ tables plus SQL. Integrating/incorporating new knowledge into already existing databases is hard and sometimes even not possible without changing the domain
  15. 15. Bioinformatics Life in general and biology in particular are probably not 100% like a graph…DBs and GraphsInitial motivationBio4j structureSome samplesWhy Bio4j?Bio4j and theCloud but one thing‟s sure, they are not a set of tables!
  16. 16. BioinformaticsDBs and Graphs NoSQL (not only SQL)Initial motivation NoSQ… what !??Bio4j structureSome samples Let‟s see what Wikipedia says…Why Bio4j? “NoSQL is a broad class of database management systems that differ from the classic model of the relational databaseBio4j and theCloud management system (RDBMS) in some significant ways. These data stores may not require fixed table schemas, usually avoid join operations and typically scale horizontally.”
  17. 17. Bioinformatics NoSQL data modelsDBs and GraphsInitial motivationBio4j structureSome samplesWhy Bio4j?Bio4j and
  18. 18. BioinformaticsDBs and GraphsInitial motivation Neo4j is a high-performance, NOSQL graph database with allBio4j structure the features of a mature and robust database.Some samples The programmer works with an object-oriented, flexible network structure rather than with strict and static tablesWhy Bio4j?Bio4j and the All the benefits of a fully transactional, enterprise-strengthCloud database. For many applications, Neo4j offers performance improvements on the order of 1000x or more compared to relational
  19. 19. Bioinformatics DBsand Graphs Ok, but why starting all this? Were you so bored…?!Initialmotivation It all started somehow around our need for massive access to protein GO (Gene Ontology) annotations.Bio4j structure At that point I had to develop my own MySQL DB based on the official GO SQL database, and problems started from the beginning:Some samples I got crazy „deciphering‟ how to extract Uniprot protein annotationsWhy Bio4j? from GO official tables schemaBio4j and the Uniprot and GO official protein annotations were not always consistentCloud Populating my own DB took really long due to all the joins and subqueries needed in order to get and store the protein annotations. Soon enough we also had the need of having massive access to basic protein
  20. 20. Bioinformatics DBs These processes had to be automated for our (specificallyand Graphs designed for NGS data) bacterial genome annotation systemInitial BG7motivation Uniprot web services available were too limited:Bio4j structure - SlowSome samples - Number of queries limitationWhy Bio4j? - Too little information availableBio4j and theCloud So I downloaded the whole Uniprot DB in XML format (Swiss-Prot + Trembl) and started to have some fun with it !
  21. 21. BG7 algorithm • Selection of the specific reference protein set 1 • Prediction of possible genes by BLAST similarity 2 • Gene definition: merging compatible similarity regions, detecting start and stop 3 • Solving overlapped predicted genes 4 • RNA prediction by BLAST similarity 5 6 • Final annotation and complete deliverables. Quality
  22. 22. Bioinformatics DBs We got used to having massive direct access to all this proteinand Graphs related information…Initialmotivation So why not adding other resources we needed quite often in most projects and which now were becoming a sort of bottleneck compared to all those already included in Bio4j ?Bio4j structure Then came:Some samples - Isoform sequencesWhy Bio4j? - Protein interactions and features - Uniref 50, 90, and 100Bio4j and theCloud - RefSeq - NCBI Taxonomy - Enzyme Expasy
  23. 23. Bioinformatics DBs Let‟s dig a bit about Bio4j structure:and GraphsInitial motivation Data sources and their relationships:Bio4j structureSome samplesWhy Bio4j?Bio4j and
  24. 24. Bioinformatics DBs Bio4j domain modeland GraphsInitial motivationBio4j structureSome samplesWhy Bio4j?Bio4j and
  25. 25. Bioinformatics DBsand Graphs The Graph DB model: representationInitial motivation Core abstractions:Bio4j structure Nodes Relationships between nodesSome samples Properties on bothWhy Bio4j?Bio4j and
  26. 26. Bioinformatics DBs Let‟s dig a bit about Bio4j structure:and GraphsInitial motivation How are things modeled?Bio4j structure Couldn‟t be simpler!Some samplesWhy Bio4j? Entities Associations / RelationshipsBio4j and theCloud Nodes
  27. 27. Bioinformatics DBs Some examples of nodes would be:and GraphsInitial motivation GO term ProteinBio4j structure Genome ElementSome samplesWhy Bio4j? and relationships:Bio4j and theCloud Protein PROTEIN_GO_ANNOTATION GO
  28. 28. Bioinformatics DBs We have developed a tool aimed to be used both as a reference manual andand Graphs initial contact for Bio4j domain model: Bio4jExplorer Bio4jExplorer allows you to:Initial motivation • Navigate through all nodes and relationshipsBio4j structure • Access the javadocs of any node or relationshipSome samples • Graphically explore the neighborhood of a node/relationshipWhy Bio4j? • Look up for the indexes that may serve as an entry point for a nodeBio4j and theCloud • Check incoming/outgoing relationships of a specific node • Check start/end nodes of a specific
  29. 29. Bioinformatics DBs Entry points and indexingand Graphs There are two kinds of entry points for the graph:Initial motivationBio4j structure Auxiliary relationships going from the reference node, e.g. - CELLULAR_COMPONENT: leads to the root of GO cellular componentSome samples sub-ontology - MAIN_DATASET: leads to both main datasets: Swiss-Prot and TremblWhy Bio4j? Node indexingBio4j and theCloud There are two types of node indexes: - Exact: Only exact values are considered hits - Fulltext: Regular expressions can be
  30. 30. Bioinformatics DBs Querying Bio4j with Cypherand GraphsInitial motivation Getting a keyword by its IDBio4j structure START k=node:keyword_id_index(keyword_id_index = "KW-0181") return, k.idSome samples Finding circuits/simple cycles of length 3 where at least one protein is from Swiss-Prot dataset:Why Bio4j? START d=node:dataset_name_index(dataset_name_index = "Swiss-Prot") MATCH d <-[r:PROTEIN_DATASET]- p,Bio4j and the circuit = (p) -[:PROTEIN_PROTEIN_INTERACTION]-> (p2) -Cloud [:PROTEIN_PROTEIN_INTERACTION]-> (p3) - [:PROTEIN_PROTEIN_INTERACTION]-> (p) return p.accession, p2.accession, p3.accession Check this blog post for more info and our Bio4j Cypher
  31. 31. Bioinformatics DBsand Graphs A graph traversal languageInitial motivation Get protein by its accession number and return its full nameBio4j structure gremlin> g.idx(protein_accession_index)[[protein_accession_index:P12345]].full_nameSome samples ==> Aspartate aminotransferase, mitochondrial Get proteins (accessions) associated to an interpro motif (limited to 4 results)Why Bio4j? gremlin> g.idx(interpro_id_index)[[interpro_id_index:IPR023306]].inE(PROTEIN_INTERPRO).outV.accessioBio4j and the n[0..3]Cloud ==> E2GK26 ==> G3PMS4 ==> G3Q865 ==> G3PIL8 Check our Bio4j Gremlin
  32. 32. Bioinformatics DBs Visualizations (1)  REST Server Data Browserand Graphs Navigate through Bio4j data in real time !Initial motivationBio4j structureSome samplesWhy Bio4j?Bio4j and
  33. 33. Bioinformatics DBs Visualizations (2)  Bio4j + Gephiand Graphs Get really cool graph visualizations using Bio4j and Gephi visualization andInitial motivation exploration platformBio4j structureSome samplesWhy Bio4j?Bio4j and
  34. 34. Bioinformatics DBs Visualizations (3)  Bio4j GO Toolsand GraphsInitial motivationBio4j structureSome samplesWhy Bio4j?Bio4j and
  35. 35. Bioinformatics DBs Why would I use Bio4j ?and Graphs Massive access to protein/genome/taxonomy… relatedInitial motivation informationBio4j structure Integration of your own DBs/resources around common informationSome samples Development of services tailored to your needs built aroundWhy Bio4j? Bio4jBio4j and the Networks analysisCloud Visualizations Besides many others I cannot think of myself… If you have something in mind for which Bio4j might be useful, please let us know so we can all see how it could help you meet your needs! ;)
  36. 36. Bioinformatics DBs Bio4j + Cloud (1)and Graphs We use AWS (Amazon Web Services) everywhere we can aroundInitial motivation Bio4j, giving us the following benefits:Bio4j structure Interoperability and data distributionSome samples Releases are available as public EBS Snapshots, giving AWS users the opportunity of creating and attaching to their instances Bio4j DB 100% ready volumes in just a few seconds.Why Bio4j?Bio4j and the CloudFormation templates:Cloud - Basic Bio4j DB Instance - Bio4j REST Server
  37. 37. Bioinformatics DBs Bio4j + Cloud (2)and GraphsInitial motivation Backup and Storage using S3 (Simple Storage Service) We use S3 both for backup (indirectly through the EBS snapshots) andBio4j structure storage (directly storing RefSeq sequences as independent S3 files) What kind of benefits do we get from this?Some samples • Easy to useWhy Bio4j? • Flexible • Cost-EffectiveBio4j and theCloud • Reliable • Scalable and high-performance •
  38. 38. Bioinformatics DBs Bio4j + Cloud (3)and GraphsInitial motivation Web servers and service providers in the cloud Deploying your own web server in AWS using Bio4j as back-end is reallyBio4j structure simple. A good example of this would be Bio4jTestServer, a continuouslySome samples developed server showcasing Web Services based on Bio4j.Why Bio4j?Bio4j and
  39. 39. Bioinformatics DBs Communityand Graphs Bio4j has a fast growing internet presence:Initial motivationBio4j structure - Twitter: check @bio4j for updates - Blog: go to http://blog.bio4j.comSome samples - Mail-list: ask any question you may have in our list.Why Bio4j? - LinkedIn: check the Bio4j groupBio4j and theCloud - Github issues: don‟t be shy! open a new issue if you think something‟s going
  40. 40. Bioinformatics DBsand Graphs And the best part of all this is:Initial motivationBio4j structureSome samples You have the latest version of Bio4jWhy Bio4j? already imported and fully working in EgStation! ;)Bio4j and
  41. 41. Bio4j + MG7 for the integration and analysis of Chip-seq
  42. 42. Bio4j + MG7 + 24 Chip-Seq samples Some numbers: • 157 639 502 nodes • 742 615 705 relationships • 632 832 045 properties • 148 relationship types • 44 node types And it works just fine!
  43. 43. MG7 domain
  44. 44. What’s MG7? MG7 is a new system for massive analysis of sequences from metagenomics samples specially designed for next generation sequencing technologies. MG7 uses cloud computing to solve the problem of massive data analysis providing scalable, real time, on demand computing for metagenomics data analysis. MG7 is able to obtain annotation and functional profiles for shot gun genomic sequences and taxonomic assignation for any type of read. The inference of function and the assignation of taxonomical origin for each sequence are based on massive BLAST similarity
  45. 45. What’s MG7? MG7 provides the possibility of choosing different parameters to fix the thresholds for filtering the BLAST hits: i. E-value ii. Identity and query coverage It allows exporting the results of the analysis to different data formats like: • XML • CSV • Gexf (Graph exchange XML format) As well as provides to the user with Heat maps and graph visualizations whilst including an user-friendly interface that allows to access to the alignment responsible for each functional or taxonomical read assignation and that displays the frequencies in the taxonomical tree -->
  46. 46. Heat-map
  47. 47. Graph
  48. 48. MG7
  49. 49. Bio4j + GRG A completely new approach for modeling genomic information and gene regulatory
  50. 50. Bio4j + GRG Integrating genomic information from organisms such as: • Zea mays subsp. Mays • Oryza sativa Japonica Group • Sorghum bicolor • Brachypodium distachyon • Arabidopsis thaliana Columbia • Arabidopsis lyrata lyrata
  51. 51. Bio4j + GRG domain
  52. 52. Bio4j + GRG Get all the advantages of Bio4j and Graph DB while modeling genomic data for grasses, (although it could be also applied to other species/families). Possibility of integrating data from other projects here at CAPS/EGLab in a common framework. Data-mining of data that currently is not accessible or simply is not structured enough/in a good way to explore it. Both for external genomic data included in sites like phytozome or coming directly from the experiments/analysis performed in the lab. Common framework for accessing all this information together with other “Universal” resources such as Uniprot, RefSeq or Gene
  53. 53. Bio4j + GRG Chance for the Lab to enter the Cloud and Graph DB world, being pioneer in providing access to this sort of data to a whole set of possible different users. Not worrying anymore about possible problems with back-ups, mantaining infrastructure or things like that… And what‟s more important: Scalability  Being able to adapt to the specific needs of new projects as they go
  54. 54. And the best part… Acknowledgments! Bio4j + MG7 + Chip-Seq results Bio4j +
  55. 55. The other guys from the basement…  (Brett) (Matias) (Andrew)
  56. 56. And of course the rest of the Lab !
  57. 57. That’s it ! Thanks for your time ;)