Bio solr building a better search for bioinformatics

Tom Winch & Matt Pearce
21st
April 2015
charlie@flax.co.uk
www.flax.co.uk/blog
+44 (0) 8700 118334
Twitter: @FlaxSearch
BioSolr
building a better search for bioinformatics

The European Bioinformatics Institute
Part of the European Molecular Biology Laboratory
Based on the Wellcome Genome Campus in Hinxton,
Cambridge
Maintains the world’s most comprehensive range of freely
available and up-to-date molecular databases, serving millions
of researchers – indexing over 1 billion items
BioSolr project involves two teams from EMBL-EBI:
Protein Data Bank in Europe (PDBe)
Samples, Phenotypes and Ontologies (SPOt)

The genesis of BioSolr
Grant Ingersoll visits the Wellcome Campus in July '13
Around 90 people attend
Show of hands indicates 75% using Lucene/Solr
Sameer Velankar of EMBL-EBI identifies grant funding
Flax and EMBL-EBI apply successfully to the BBSRC

BioSolr
One year BBSRC funded project from September 2014
“to significantly advance the state of the art with
regard to indexing and querying biomedical data with freely
available open source software”
Outputs:
– Workshops
– Papers & presentations
– Software (Open source of course!)
– Documentation
Inputs: from the PDBe & SPOt teams

BioSolr
Tom Winch
– Working on site with Sameer Velankar & the PDBe team
– Facet.contains & Xjoin
Matt Pearce
– Working on site with Tony Burdett & the SPOt team
– Indexing ontologies

BioSolr & PDBe - Introduction
Protein Data Bank (PDBe)
facet.contains – autosuggest
https://issues.apache.org/jira/browse/SOLR-1387
In Solr 5.1
DNA sequence similarity

BioSolr & PDBe – Xjoin concepts
The problem - sequences come from a live source
Joining with data from an external source
Custom SOLR code

BioSolr & PDBe – Solr classes
XJoinResultsFactory, XJoinResults
XJoinSearchComponent
XJoinQParserPlugin
XJoinValueSourceParser

BioSolr & PDBe – What next?
SOLR contrib – SOLR-7341
https://issues.apache.org/jira/browse/SOLR-7341
Joining from multiple external sources
Federated search

Washington, N. & Lewis, S. (2008) Ontologies: Scientific
Data Sharing Made Easy. Nature Education 1(3):5
BioSolr & SPOt – Indexing Ontologies

Indexing Ontologies - the problem
You have a collection of documents annotated with ontology
references.
You want to search both the documents and the associated
ontology data.
This may include associated nodes – “has location”, “is
part of”, etc.
Faceting by ontology reference would be nice!

Approach 1
– Keep the data separate
documents
Documents
Indexer
Documents
Indexer
ontology
Ontology
Indexer

Approach 1 - steps
Index the documents, with the node annotations, but no
further detail.
Index the ontology in its own core.
Search the documents, then cross-match against the
ontology.
BUT - Requires multiple calls, doesn't allow
searching both cores at the same time.

Approach 2
• Add some ontology data to your documents.
Documents
Indexer Ontology
documents

Approach 2 – step 1
Index node references, plus their labels and synonyms.
Easier to include the ontology references in your search.
Can boost fields over others.

Approach 2 – step 2
Expand the ontology data being stored.
Include single (or multi)-level parent and child nodes, with
labels.
Use dynamic fields to store additional relationships.
Dynamic fields allow searches across specific relation types.
BUT Requires some additional Solr look-ups to be fully
dynamic.

Approach 3
Search the ontology, and cross-match with documents.
Allow SPARQL queries over the ontology index.
SPARQL is a semantic query language

Adding Apache Jena
To allow SPARQL queries, we use Apache Jena to provide
TDB-querying.
Jena uses Solr to search label fields.
Uses its own Triple Store for other fields.
Need to include reference URI in returned fields.

Integrating Jena results
Returned Jena data needs to be cross-matched against
document index.
Use a filter query to choose the matching documents.

Summary so far
We can search documents and ontology data with a single call
to Solr.
We can dynamically search over additional related ontology
nodes.
We can use SPARQL to search.
Can facet on individual ontology annotations...but we still can't
present the facets in a tree.
https://github.com/flaxsearch/BioSolr/tree/master/spot

The ultimate goal
A generic ontology indexer using Solr.
Multiple ontologies stored in the same index.
Unique integer keys for each node, allowing cross-
matching from document indexes.
Optional customisation, allowing for additional lookups or
data manipulation.

BioSolr conclusions
Final workshop at EMBL-EBI in September
https://github.com/flaxsearch/BioSolr
Investigating funding to continue the project
– We have some ideas around federated Solr search...

Thankyou!
Any questions?
charlie@flax.co.uk
www.flax.co.uk/blog
+44 (0) 8700 118334
Twitter: @FlaxSearch

Bio solr building a better search for bioinformatics

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (6)

Similar to Bio solr building a better search for bioinformatics

Similar to Bio solr building a better search for bioinformatics (20)

More from Charlie Hull

More from Charlie Hull (11)

Recently uploaded

Recently uploaded (20)

Bio solr building a better search for bioinformatics