SlideShare a Scribd company logo
Tom Winch & Matt Pearce
21st
April 2015
charlie@flax.co.uk
www.flax.co.uk/blog
+44 (0) 8700 118334
Twitter: @FlaxSearch
BioSolr
building a better search for bioinformatics
The European Bioinformatics Institute
Part of the European Molecular Biology Laboratory
Based on the Wellcome Genome Campus in Hinxton,
Cambridge
Maintains the world’s most comprehensive range of freely
available and up-to-date molecular databases, serving millions
of researchers – indexing over 1 billion items
BioSolr project involves two teams from EMBL-EBI:
Protein Data Bank in Europe (PDBe)
Samples, Phenotypes and Ontologies (SPOt)
The genesis of BioSolr
Grant Ingersoll visits the Wellcome Campus in July '13
Around 90 people attend
Show of hands indicates 75% using Lucene/Solr
Sameer Velankar of EMBL-EBI identifies grant funding
Flax and EMBL-EBI apply successfully to the BBSRC
BioSolr
One year BBSRC funded project from September 2014
“to significantly advance the state of the art with
regard to indexing and querying biomedical data with freely
available open source software”
Outputs:
– Workshops
– Papers & presentations
– Software (Open source of course!)
– Documentation
Inputs: from the PDBe & SPOt teams
BioSolr
Tom Winch
– Working on site with Sameer Velankar & the PDBe team
– Facet.contains & Xjoin
Matt Pearce
– Working on site with Tony Burdett & the SPOt team
– Indexing ontologies
BioSolr & PDBe - Introduction
Protein Data Bank (PDBe)
facet.contains – autosuggest
https://issues.apache.org/jira/browse/SOLR-1387
In Solr 5.1
DNA sequence similarity
BioSolr & PDBe – Xjoin concepts
The problem - sequences come from a live source
Joining with data from an external source
Custom SOLR code
BioSolr & PDBe – Solr classes
XJoinResultsFactory, XJoinResults
XJoinSearchComponent
XJoinQParserPlugin
XJoinValueSourceParser
BioSolr & PDBe – What next?
SOLR contrib – SOLR-7341
https://issues.apache.org/jira/browse/SOLR-7341
Joining from multiple external sources
Federated search
Washington, N. & Lewis, S. (2008) Ontologies: Scientific
Data Sharing Made Easy. Nature Education 1(3):5
BioSolr & SPOt – Indexing Ontologies
Indexing Ontologies - the problem
You have a collection of documents annotated with ontology
references.
You want to search both the documents and the associated
ontology data.
This may include associated nodes – “has location”, “is
part of”, etc.
Faceting by ontology reference would be nice!
Approach 1
– Keep the data separate
documents
Documents
Indexer
Documents
Indexer
ontology
Ontology
Indexer
Approach 1 - steps
Index the documents, with the node annotations, but no
further detail.
Index the ontology in its own core.
Search the documents, then cross-match against the
ontology.
BUT - Requires multiple calls, doesn't allow
searching both cores at the same time.
Approach 2
• Add some ontology data to your documents.
Documents
Indexer Ontology
documents
Approach 2 – step 1
Index node references, plus their labels and synonyms.
Easier to include the ontology references in your search.
Can boost fields over others.
Approach 2 – step 2
Expand the ontology data being stored.
Include single (or multi)-level parent and child nodes, with
labels.
Use dynamic fields to store additional relationships.
Dynamic fields allow searches across specific relation types.
BUT Requires some additional Solr look-ups to be fully
dynamic.
Approach 2 – search screen
Approach 2 – search results
Approach 3
Search the ontology, and cross-match with documents.
Allow SPARQL queries over the ontology index.
SPARQL is a semantic query language
Approach 3
Adding Apache Jena
To allow SPARQL queries, we use Apache Jena to provide
TDB-querying.
Jena uses Solr to search label fields.
Uses its own Triple Store for other fields.
Need to include reference URI in returned fields.
Integrating Jena results
Returned Jena data needs to be cross-matched against
document index.
Use a filter query to choose the matching documents.
Integrating Jena results
Summary so far
We can search documents and ontology data with a single call
to Solr.
We can dynamically search over additional related ontology
nodes.
We can use SPARQL to search.
Can facet on individual ontology annotations...but we still can't
present the facets in a tree.
https://github.com/flaxsearch/BioSolr/tree/master/spot
The ultimate goal
A generic ontology indexer using Solr.
Multiple ontologies stored in the same index.
Unique integer keys for each node, allowing cross-
matching from document indexes.
Optional customisation, allowing for additional lookups or
data manipulation.
BioSolr conclusions
Final workshop at EMBL-EBI in September
https://github.com/flaxsearch/BioSolr
Investigating funding to continue the project
– We have some ideas around federated Solr search...
Thankyou!
Any questions?
charlie@flax.co.uk
www.flax.co.uk/blog
+44 (0) 8700 118334
Twitter: @FlaxSearch

More Related Content

What's hot

Social Phrases Having Impact in Altmetrics - SOPHIA
Social Phrases Having Impact in Altmetrics - SOPHIASocial Phrases Having Impact in Altmetrics - SOPHIA
Social Phrases Having Impact in Altmetrics - SOPHIA
Insight_Altmetrics
 
Solr Recipes
Solr RecipesSolr Recipes
Solr Recipes
Erik Hatcher
 
Tutorial 5 (lucene)
Tutorial 5 (lucene)Tutorial 5 (lucene)
Tutorial 5 (lucene)
Kira
 
Full Text Search with Lucene
Full Text Search with LuceneFull Text Search with Lucene
Full Text Search with Lucene
WO Community
 
The UK National Chemical Database Service – an integration of commercial and ...
The UK National Chemical Database Service – an integration of commercial and ...The UK National Chemical Database Service – an integration of commercial and ...
The UK National Chemical Database Service – an integration of commercial and ...
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 
Battle of the Giants round 2
Battle of the Giants round 2Battle of the Giants round 2
Battle of the Giants round 2
Rafał Kuć
 
Citation and Research Objects: Toward Active Research Objects
Citation and Research Objects: Toward Active Research ObjectsCitation and Research Objects: Toward Active Research Objects
Citation and Research Objects: Toward Active Research Objects
Daniel S. Katz
 
SQL is Dead; Long Live SQL: Lightweight Query Services for Long Tail Science
SQL is Dead; Long Live SQL: Lightweight Query Services for Long Tail ScienceSQL is Dead; Long Live SQL: Lightweight Query Services for Long Tail Science
SQL is Dead; Long Live SQL: Lightweight Query Services for Long Tail Science
University of Washington
 
Workshop: Learning Elasticsearch
Workshop: Learning ElasticsearchWorkshop: Learning Elasticsearch
Workshop: Learning Elasticsearch
Anurag Patel
 
Annotopia open annotation services platform
Annotopia open annotation services platformAnnotopia open annotation services platform
Annotopia open annotation services platform
Tim Clark
 
Building Intelligent Search Applications with Apache Solr and PHP5
Building Intelligent Search Applications with Apache Solr and PHP5Building Intelligent Search Applications with Apache Solr and PHP5
Building Intelligent Search Applications with Apache Solr and PHP5
israelekpo
 
Lucene's Latest (for Libraries)
Lucene's Latest (for Libraries)Lucene's Latest (for Libraries)
Lucene's Latest (for Libraries)
Erik Hatcher
 
Guided tutorial of the Neuroscience Information Framework
Guided tutorial of the Neuroscience Information FrameworkGuided tutorial of the Neuroscience Information Framework
Guided tutorial of the Neuroscience Information Framework
Maryann Martone
 
Clustering the royal society of chemistry chemical repository to enable enhan...
Clustering the royal society of chemistry chemical repository to enable enhan...Clustering the royal society of chemistry chemical repository to enable enhan...
Clustering the royal society of chemistry chemical repository to enable enhan...
Valery Tkachenko
 
ACS 248th Paper 136 JSmol/JSpecView Eureka Integration
ACS 248th Paper 136 JSmol/JSpecView Eureka IntegrationACS 248th Paper 136 JSmol/JSpecView Eureka Integration
ACS 248th Paper 136 JSmol/JSpecView Eureka Integration
Stuart Chalk
 
Building a Standard for Standards: The ChAMP Project
Building a Standard for Standards: The ChAMP ProjectBuilding a Standard for Standards: The ChAMP Project
Building a Standard for Standards: The ChAMP Project
Stuart Chalk
 
Repeatable plant pathology bioinformatic analysis: Not everything is NGS data
Repeatable plant pathology bioinformatic analysis: Not everything is NGS dataRepeatable plant pathology bioinformatic analysis: Not everything is NGS data
Repeatable plant pathology bioinformatic analysis: Not everything is NGS data
Leighton Pritchard
 
247th ACS Meeting: The Eureka Research Workbench
247th ACS Meeting: The Eureka Research Workbench247th ACS Meeting: The Eureka Research Workbench
247th ACS Meeting: The Eureka Research Workbench
Stuart Chalk
 
Annotations
AnnotationsAnnotations
Annotations
Ostap Maliuvanchuk
 
ChemSpider – disseminating data and enabling an abundance of chemistry platforms
ChemSpider – disseminating data and enabling an abundance of chemistry platformsChemSpider – disseminating data and enabling an abundance of chemistry platforms
ChemSpider – disseminating data and enabling an abundance of chemistry platforms
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 

What's hot (20)

Social Phrases Having Impact in Altmetrics - SOPHIA
Social Phrases Having Impact in Altmetrics - SOPHIASocial Phrases Having Impact in Altmetrics - SOPHIA
Social Phrases Having Impact in Altmetrics - SOPHIA
 
Solr Recipes
Solr RecipesSolr Recipes
Solr Recipes
 
Tutorial 5 (lucene)
Tutorial 5 (lucene)Tutorial 5 (lucene)
Tutorial 5 (lucene)
 
Full Text Search with Lucene
Full Text Search with LuceneFull Text Search with Lucene
Full Text Search with Lucene
 
The UK National Chemical Database Service – an integration of commercial and ...
The UK National Chemical Database Service – an integration of commercial and ...The UK National Chemical Database Service – an integration of commercial and ...
The UK National Chemical Database Service – an integration of commercial and ...
 
Battle of the Giants round 2
Battle of the Giants round 2Battle of the Giants round 2
Battle of the Giants round 2
 
Citation and Research Objects: Toward Active Research Objects
Citation and Research Objects: Toward Active Research ObjectsCitation and Research Objects: Toward Active Research Objects
Citation and Research Objects: Toward Active Research Objects
 
SQL is Dead; Long Live SQL: Lightweight Query Services for Long Tail Science
SQL is Dead; Long Live SQL: Lightweight Query Services for Long Tail ScienceSQL is Dead; Long Live SQL: Lightweight Query Services for Long Tail Science
SQL is Dead; Long Live SQL: Lightweight Query Services for Long Tail Science
 
Workshop: Learning Elasticsearch
Workshop: Learning ElasticsearchWorkshop: Learning Elasticsearch
Workshop: Learning Elasticsearch
 
Annotopia open annotation services platform
Annotopia open annotation services platformAnnotopia open annotation services platform
Annotopia open annotation services platform
 
Building Intelligent Search Applications with Apache Solr and PHP5
Building Intelligent Search Applications with Apache Solr and PHP5Building Intelligent Search Applications with Apache Solr and PHP5
Building Intelligent Search Applications with Apache Solr and PHP5
 
Lucene's Latest (for Libraries)
Lucene's Latest (for Libraries)Lucene's Latest (for Libraries)
Lucene's Latest (for Libraries)
 
Guided tutorial of the Neuroscience Information Framework
Guided tutorial of the Neuroscience Information FrameworkGuided tutorial of the Neuroscience Information Framework
Guided tutorial of the Neuroscience Information Framework
 
Clustering the royal society of chemistry chemical repository to enable enhan...
Clustering the royal society of chemistry chemical repository to enable enhan...Clustering the royal society of chemistry chemical repository to enable enhan...
Clustering the royal society of chemistry chemical repository to enable enhan...
 
ACS 248th Paper 136 JSmol/JSpecView Eureka Integration
ACS 248th Paper 136 JSmol/JSpecView Eureka IntegrationACS 248th Paper 136 JSmol/JSpecView Eureka Integration
ACS 248th Paper 136 JSmol/JSpecView Eureka Integration
 
Building a Standard for Standards: The ChAMP Project
Building a Standard for Standards: The ChAMP ProjectBuilding a Standard for Standards: The ChAMP Project
Building a Standard for Standards: The ChAMP Project
 
Repeatable plant pathology bioinformatic analysis: Not everything is NGS data
Repeatable plant pathology bioinformatic analysis: Not everything is NGS dataRepeatable plant pathology bioinformatic analysis: Not everything is NGS data
Repeatable plant pathology bioinformatic analysis: Not everything is NGS data
 
247th ACS Meeting: The Eureka Research Workbench
247th ACS Meeting: The Eureka Research Workbench247th ACS Meeting: The Eureka Research Workbench
247th ACS Meeting: The Eureka Research Workbench
 
Annotations
AnnotationsAnnotations
Annotations
 
ChemSpider – disseminating data and enabling an abundance of chemistry platforms
ChemSpider – disseminating data and enabling an abundance of chemistry platformsChemSpider – disseminating data and enabling an abundance of chemistry platforms
ChemSpider – disseminating data and enabling an abundance of chemistry platforms
 

Viewers also liked

BioSolr - Searching the stuff of life - Lucene/Solr Revolution 2015
BioSolr - Searching the stuff of life - Lucene/Solr Revolution 2015BioSolr - Searching the stuff of life - Lucene/Solr Revolution 2015
BioSolr - Searching the stuff of life - Lucene/Solr Revolution 2015
Charlie Hull
 
Beyond the Query: A Cassandra + Solr + Spark Love Triangle Using Datastax Ent...
Beyond the Query: A Cassandra + Solr + Spark Love Triangle Using Datastax Ent...Beyond the Query: A Cassandra + Solr + Spark Love Triangle Using Datastax Ent...
Beyond the Query: A Cassandra + Solr + Spark Love Triangle Using Datastax Ent...
DataStax Academy
 
Rackspace: Email's Solution for Indexing 50K Documents per Second: Presented ...
Rackspace: Email's Solution for Indexing 50K Documents per Second: Presented ...Rackspace: Email's Solution for Indexing 50K Documents per Second: Presented ...
Rackspace: Email's Solution for Indexing 50K Documents per Second: Presented ...
Lucidworks
 
Tutorial on developing a Solr search component plugin
Tutorial on developing a Solr search component pluginTutorial on developing a Solr search component plugin
Tutorial on developing a Solr search component plugin
searchbox-com
 
Solr Graph Query: Presented by Kevin Watters, KMW Technology
Solr Graph Query: Presented by Kevin Watters, KMW TechnologySolr Graph Query: Presented by Kevin Watters, KMW Technology
Solr Graph Query: Presented by Kevin Watters, KMW Technology
Lucidworks
 
LOD (linked open data) part 2 lod 구축과 현황
LOD (linked open data) part 2   lod 구축과 현황LOD (linked open data) part 2   lod 구축과 현황
LOD (linked open data) part 2 lod 구축과 현황
LiST Inc
 

Viewers also liked (6)

BioSolr - Searching the stuff of life - Lucene/Solr Revolution 2015
BioSolr - Searching the stuff of life - Lucene/Solr Revolution 2015BioSolr - Searching the stuff of life - Lucene/Solr Revolution 2015
BioSolr - Searching the stuff of life - Lucene/Solr Revolution 2015
 
Beyond the Query: A Cassandra + Solr + Spark Love Triangle Using Datastax Ent...
Beyond the Query: A Cassandra + Solr + Spark Love Triangle Using Datastax Ent...Beyond the Query: A Cassandra + Solr + Spark Love Triangle Using Datastax Ent...
Beyond the Query: A Cassandra + Solr + Spark Love Triangle Using Datastax Ent...
 
Rackspace: Email's Solution for Indexing 50K Documents per Second: Presented ...
Rackspace: Email's Solution for Indexing 50K Documents per Second: Presented ...Rackspace: Email's Solution for Indexing 50K Documents per Second: Presented ...
Rackspace: Email's Solution for Indexing 50K Documents per Second: Presented ...
 
Tutorial on developing a Solr search component plugin
Tutorial on developing a Solr search component pluginTutorial on developing a Solr search component plugin
Tutorial on developing a Solr search component plugin
 
Solr Graph Query: Presented by Kevin Watters, KMW Technology
Solr Graph Query: Presented by Kevin Watters, KMW TechnologySolr Graph Query: Presented by Kevin Watters, KMW Technology
Solr Graph Query: Presented by Kevin Watters, KMW Technology
 
LOD (linked open data) part 2 lod 구축과 현황
LOD (linked open data) part 2   lod 구축과 현황LOD (linked open data) part 2   lod 구축과 현황
LOD (linked open data) part 2 lod 구축과 현황
 

Similar to Bio solr building a better search for bioinformatics

MIREOT
MIREOTMIREOT
ICBO 2018 Poster - Current Development in the Evidence and Conclusion Ontolog...
ICBO 2018 Poster - Current Development in the Evidence and Conclusion Ontolog...ICBO 2018 Poster - Current Development in the Evidence and Conclusion Ontolog...
ICBO 2018 Poster - Current Development in the Evidence and Conclusion Ontolog...
dolleyj
 
Facilitating semantic alignment.-biohackathon-jupp
Facilitating semantic alignment.-biohackathon-juppFacilitating semantic alignment.-biohackathon-jupp
Facilitating semantic alignment.-biohackathon-jupp
Simon Jupp
 
Connecting life sciences data at the European Bioinformatics Institute
Connecting life sciences data at the European Bioinformatics InstituteConnecting life sciences data at the European Bioinformatics Institute
Connecting life sciences data at the European Bioinformatics Institute
Connected Data World
 
Research Objects for e-Laboratories
Research Objects for e-LaboratoriesResearch Objects for e-Laboratories
Research Objects for e-Laboratories
David Newman
 
How Bio Ontologies Enable Open Science
How Bio Ontologies Enable Open ScienceHow Bio Ontologies Enable Open Science
How Bio Ontologies Enable Open Science
drnigam
 
Building a Model Organism Metabolome Database
Building a  Model Organism Metabolome DatabaseBuilding a  Model Organism Metabolome Database
Building a Model Organism Metabolome Database
Christoph Steinbeck
 
Peer Review and Science2.0
Peer Review and Science2.0Peer Review and Science2.0
Peer Review and Science2.0
Jean-Claude Bradley
 
Encyclopedia of Life: Applying Concepts from Amazon and LEGO to Biodiversity ...
Encyclopedia of Life: Applying Concepts from Amazon and LEGO to Biodiversity ...Encyclopedia of Life: Applying Concepts from Amazon and LEGO to Biodiversity ...
Encyclopedia of Life: Applying Concepts from Amazon and LEGO to Biodiversity ...
Cyndy Parr
 
Interlinking educational data to Web of Data (Thesis presentation)
Interlinking educational data to Web of Data (Thesis presentation)Interlinking educational data to Web of Data (Thesis presentation)
Interlinking educational data to Web of Data (Thesis presentation)
Enayat Rajabi
 
Ontology Services for the Biomedical Sciences
Ontology Services for the Biomedical SciencesOntology Services for the Biomedical Sciences
Ontology Services for the Biomedical Sciences
Connected Data World
 
Open interoperability standards, tools and services at EMBL-EBI
Open interoperability standards, tools and services at EMBL-EBIOpen interoperability standards, tools and services at EMBL-EBI
Open interoperability standards, tools and services at EMBL-EBI
Pistoia Alliance
 
EOL and Science: Yes we can!
EOL and Science: Yes we can!EOL and Science: Yes we can!
EOL and Science: Yes we can!
Cyndy Parr
 
Annotation of SBML Models Through Rule-Based Semantic Integration
Annotation of SBML Models Through Rule-Based Semantic IntegrationAnnotation of SBML Models Through Rule-Based Semantic Integration
Annotation of SBML Models Through Rule-Based Semantic Integration
Allyson Lister
 
Connected Data for Machine Learning | Paul Groth
Connected Data for Machine Learning | Paul GrothConnected Data for Machine Learning | Paul Groth
Connected Data for Machine Learning | Paul Groth
Connected Data World
 
Building and Using Ontologies to do biology
Building and Using Ontologies to do biologyBuilding and Using Ontologies to do biology
Building and Using Ontologies to do biology
robertstevens65
 
Venkatesan bosc2010 onto-toolkit
Venkatesan bosc2010 onto-toolkitVenkatesan bosc2010 onto-toolkit
Venkatesan bosc2010 onto-toolkit
BOSC 2010
 
Introduction to EOL.org for scientists
Introduction to EOL.org for scientistsIntroduction to EOL.org for scientists
Introduction to EOL.org for scientists
Cyndy Parr
 
Ontologies neo4j-graph-workshop-berlin
Ontologies neo4j-graph-workshop-berlinOntologies neo4j-graph-workshop-berlin
Ontologies neo4j-graph-workshop-berlin
Simon Jupp
 
Building a repository of biomedical ontologies with Neo4j
Building a repository of biomedical ontologies with Neo4jBuilding a repository of biomedical ontologies with Neo4j
Building a repository of biomedical ontologies with Neo4j
Simon Jupp
 

Similar to Bio solr building a better search for bioinformatics (20)

MIREOT
MIREOTMIREOT
MIREOT
 
ICBO 2018 Poster - Current Development in the Evidence and Conclusion Ontolog...
ICBO 2018 Poster - Current Development in the Evidence and Conclusion Ontolog...ICBO 2018 Poster - Current Development in the Evidence and Conclusion Ontolog...
ICBO 2018 Poster - Current Development in the Evidence and Conclusion Ontolog...
 
Facilitating semantic alignment.-biohackathon-jupp
Facilitating semantic alignment.-biohackathon-juppFacilitating semantic alignment.-biohackathon-jupp
Facilitating semantic alignment.-biohackathon-jupp
 
Connecting life sciences data at the European Bioinformatics Institute
Connecting life sciences data at the European Bioinformatics InstituteConnecting life sciences data at the European Bioinformatics Institute
Connecting life sciences data at the European Bioinformatics Institute
 
Research Objects for e-Laboratories
Research Objects for e-LaboratoriesResearch Objects for e-Laboratories
Research Objects for e-Laboratories
 
How Bio Ontologies Enable Open Science
How Bio Ontologies Enable Open ScienceHow Bio Ontologies Enable Open Science
How Bio Ontologies Enable Open Science
 
Building a Model Organism Metabolome Database
Building a  Model Organism Metabolome DatabaseBuilding a  Model Organism Metabolome Database
Building a Model Organism Metabolome Database
 
Peer Review and Science2.0
Peer Review and Science2.0Peer Review and Science2.0
Peer Review and Science2.0
 
Encyclopedia of Life: Applying Concepts from Amazon and LEGO to Biodiversity ...
Encyclopedia of Life: Applying Concepts from Amazon and LEGO to Biodiversity ...Encyclopedia of Life: Applying Concepts from Amazon and LEGO to Biodiversity ...
Encyclopedia of Life: Applying Concepts from Amazon and LEGO to Biodiversity ...
 
Interlinking educational data to Web of Data (Thesis presentation)
Interlinking educational data to Web of Data (Thesis presentation)Interlinking educational data to Web of Data (Thesis presentation)
Interlinking educational data to Web of Data (Thesis presentation)
 
Ontology Services for the Biomedical Sciences
Ontology Services for the Biomedical SciencesOntology Services for the Biomedical Sciences
Ontology Services for the Biomedical Sciences
 
Open interoperability standards, tools and services at EMBL-EBI
Open interoperability standards, tools and services at EMBL-EBIOpen interoperability standards, tools and services at EMBL-EBI
Open interoperability standards, tools and services at EMBL-EBI
 
EOL and Science: Yes we can!
EOL and Science: Yes we can!EOL and Science: Yes we can!
EOL and Science: Yes we can!
 
Annotation of SBML Models Through Rule-Based Semantic Integration
Annotation of SBML Models Through Rule-Based Semantic IntegrationAnnotation of SBML Models Through Rule-Based Semantic Integration
Annotation of SBML Models Through Rule-Based Semantic Integration
 
Connected Data for Machine Learning | Paul Groth
Connected Data for Machine Learning | Paul GrothConnected Data for Machine Learning | Paul Groth
Connected Data for Machine Learning | Paul Groth
 
Building and Using Ontologies to do biology
Building and Using Ontologies to do biologyBuilding and Using Ontologies to do biology
Building and Using Ontologies to do biology
 
Venkatesan bosc2010 onto-toolkit
Venkatesan bosc2010 onto-toolkitVenkatesan bosc2010 onto-toolkit
Venkatesan bosc2010 onto-toolkit
 
Introduction to EOL.org for scientists
Introduction to EOL.org for scientistsIntroduction to EOL.org for scientists
Introduction to EOL.org for scientists
 
Ontologies neo4j-graph-workshop-berlin
Ontologies neo4j-graph-workshop-berlinOntologies neo4j-graph-workshop-berlin
Ontologies neo4j-graph-workshop-berlin
 
Building a repository of biomedical ontologies with Neo4j
Building a repository of biomedical ontologies with Neo4jBuilding a repository of biomedical ontologies with Neo4j
Building a repository of biomedical ontologies with Neo4j
 

More from Charlie Hull

Lucene, Solr and java 9 - opportunities and challenges
Lucene, Solr and java 9 - opportunities and challengesLucene, Solr and java 9 - opportunities and challenges
Lucene, Solr and java 9 - opportunities and challenges
Charlie Hull
 
Finding the Bad Actor: Custom scoring & forensic name matching with Elastics...
Finding the Bad Actor: Custom scoring & forensic name matching  with Elastics...Finding the Bad Actor: Custom scoring & forensic name matching  with Elastics...
Finding the Bad Actor: Custom scoring & forensic name matching with Elastics...
Charlie Hull
 
Making sense of big data
Making sense of big dataMaking sense of big data
Making sense of big data
Charlie Hull
 
Search Solutions 2015: Towards a new model of search relevance testing
Search Solutions 2015:  Towards a new model of search relevance testingSearch Solutions 2015:  Towards a new model of search relevance testing
Search Solutions 2015: Towards a new model of search relevance testing
Charlie Hull
 
Elasticsearch for Westcoast
Elasticsearch for WestcoastElasticsearch for Westcoast
Elasticsearch for Westcoast
Charlie Hull
 
FIBEP WMIC 2015 - How Infomedia upgraded their closed-source search engine to...
FIBEP WMIC 2015 - How Infomedia upgraded their closed-source search engine to...FIBEP WMIC 2015 - How Infomedia upgraded their closed-source search engine to...
FIBEP WMIC 2015 - How Infomedia upgraded their closed-source search engine to...
Charlie Hull
 
Enterprise Search Europe 2015: Fishing the big data streams - the future of ...
Enterprise Search Europe 2015:  Fishing the big data streams - the future of ...Enterprise Search Europe 2015:  Fishing the big data streams - the future of ...
Enterprise Search Europe 2015: Fishing the big data streams - the future of ...
Charlie Hull
 
Turning search upside down with powerful open source search software
Turning search upside down with powerful open source search softwareTurning search upside down with powerful open source search software
Turning search upside down with powerful open source search software
Charlie Hull
 
Intranet show and_tell_2010
Intranet show and_tell_2010Intranet show and_tell_2010
Intranet show and_tell_2010
Charlie Hull
 
Flax ovum search-across_the_enterprise
Flax ovum search-across_the_enterpriseFlax ovum search-across_the_enterprise
Flax ovum search-across_the_enterprise
Charlie Hull
 
What's the story with Open Source?
What's the story with Open Source? What's the story with Open Source?
What's the story with Open Source?
Charlie Hull
 

More from Charlie Hull (11)

Lucene, Solr and java 9 - opportunities and challenges
Lucene, Solr and java 9 - opportunities and challengesLucene, Solr and java 9 - opportunities and challenges
Lucene, Solr and java 9 - opportunities and challenges
 
Finding the Bad Actor: Custom scoring & forensic name matching with Elastics...
Finding the Bad Actor: Custom scoring & forensic name matching  with Elastics...Finding the Bad Actor: Custom scoring & forensic name matching  with Elastics...
Finding the Bad Actor: Custom scoring & forensic name matching with Elastics...
 
Making sense of big data
Making sense of big dataMaking sense of big data
Making sense of big data
 
Search Solutions 2015: Towards a new model of search relevance testing
Search Solutions 2015:  Towards a new model of search relevance testingSearch Solutions 2015:  Towards a new model of search relevance testing
Search Solutions 2015: Towards a new model of search relevance testing
 
Elasticsearch for Westcoast
Elasticsearch for WestcoastElasticsearch for Westcoast
Elasticsearch for Westcoast
 
FIBEP WMIC 2015 - How Infomedia upgraded their closed-source search engine to...
FIBEP WMIC 2015 - How Infomedia upgraded their closed-source search engine to...FIBEP WMIC 2015 - How Infomedia upgraded their closed-source search engine to...
FIBEP WMIC 2015 - How Infomedia upgraded their closed-source search engine to...
 
Enterprise Search Europe 2015: Fishing the big data streams - the future of ...
Enterprise Search Europe 2015:  Fishing the big data streams - the future of ...Enterprise Search Europe 2015:  Fishing the big data streams - the future of ...
Enterprise Search Europe 2015: Fishing the big data streams - the future of ...
 
Turning search upside down with powerful open source search software
Turning search upside down with powerful open source search softwareTurning search upside down with powerful open source search software
Turning search upside down with powerful open source search software
 
Intranet show and_tell_2010
Intranet show and_tell_2010Intranet show and_tell_2010
Intranet show and_tell_2010
 
Flax ovum search-across_the_enterprise
Flax ovum search-across_the_enterpriseFlax ovum search-across_the_enterprise
Flax ovum search-across_the_enterprise
 
What's the story with Open Source?
What's the story with Open Source? What's the story with Open Source?
What's the story with Open Source?
 

Recently uploaded

Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
Alpen-Adria-Universität
 
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
Edge AI and Vision Alliance
 
Mariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceXMariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceX
Mariano Tinti
 
GraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracyGraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracy
Tomaz Bratanic
 
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
SOFTTECHHUB
 
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems S.M.S.A.
 
Infrastructure Challenges in Scaling RAG with Custom AI models
Infrastructure Challenges in Scaling RAG with Custom AI modelsInfrastructure Challenges in Scaling RAG with Custom AI models
Infrastructure Challenges in Scaling RAG with Custom AI models
Zilliz
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
shyamraj55
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
Matthew Sinclair
 
Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
Zilliz
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
panagenda
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
Neo4j
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
Uni Systems S.M.S.A.
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Malak Abu Hammad
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
Neo4j
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
Neo4j
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
DianaGray10
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
Adtran
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
Quotidiano Piemontese
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
mikeeftimakis1
 

Recently uploaded (20)

Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
 
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
 
Mariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceXMariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceX
 
GraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracyGraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracy
 
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
 
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
 
Infrastructure Challenges in Scaling RAG with Custom AI models
Infrastructure Challenges in Scaling RAG with Custom AI modelsInfrastructure Challenges in Scaling RAG with Custom AI models
Infrastructure Challenges in Scaling RAG with Custom AI models
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
 
Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
 

Bio solr building a better search for bioinformatics

  • 1. Tom Winch & Matt Pearce 21st April 2015 charlie@flax.co.uk www.flax.co.uk/blog +44 (0) 8700 118334 Twitter: @FlaxSearch BioSolr building a better search for bioinformatics
  • 2. The European Bioinformatics Institute Part of the European Molecular Biology Laboratory Based on the Wellcome Genome Campus in Hinxton, Cambridge Maintains the world’s most comprehensive range of freely available and up-to-date molecular databases, serving millions of researchers – indexing over 1 billion items BioSolr project involves two teams from EMBL-EBI: Protein Data Bank in Europe (PDBe) Samples, Phenotypes and Ontologies (SPOt)
  • 3. The genesis of BioSolr Grant Ingersoll visits the Wellcome Campus in July '13 Around 90 people attend Show of hands indicates 75% using Lucene/Solr Sameer Velankar of EMBL-EBI identifies grant funding Flax and EMBL-EBI apply successfully to the BBSRC
  • 4. BioSolr One year BBSRC funded project from September 2014 “to significantly advance the state of the art with regard to indexing and querying biomedical data with freely available open source software” Outputs: – Workshops – Papers & presentations – Software (Open source of course!) – Documentation Inputs: from the PDBe & SPOt teams
  • 5. BioSolr Tom Winch – Working on site with Sameer Velankar & the PDBe team – Facet.contains & Xjoin Matt Pearce – Working on site with Tony Burdett & the SPOt team – Indexing ontologies
  • 6. BioSolr & PDBe - Introduction Protein Data Bank (PDBe) facet.contains – autosuggest https://issues.apache.org/jira/browse/SOLR-1387 In Solr 5.1 DNA sequence similarity
  • 7. BioSolr & PDBe – Xjoin concepts The problem - sequences come from a live source Joining with data from an external source Custom SOLR code
  • 8. BioSolr & PDBe – Solr classes XJoinResultsFactory, XJoinResults XJoinSearchComponent XJoinQParserPlugin XJoinValueSourceParser
  • 9. BioSolr & PDBe – What next? SOLR contrib – SOLR-7341 https://issues.apache.org/jira/browse/SOLR-7341 Joining from multiple external sources Federated search
  • 10. Washington, N. & Lewis, S. (2008) Ontologies: Scientific Data Sharing Made Easy. Nature Education 1(3):5 BioSolr & SPOt – Indexing Ontologies
  • 11. Indexing Ontologies - the problem You have a collection of documents annotated with ontology references. You want to search both the documents and the associated ontology data. This may include associated nodes – “has location”, “is part of”, etc. Faceting by ontology reference would be nice!
  • 12. Approach 1 – Keep the data separate documents Documents Indexer Documents Indexer ontology Ontology Indexer
  • 13. Approach 1 - steps Index the documents, with the node annotations, but no further detail. Index the ontology in its own core. Search the documents, then cross-match against the ontology. BUT - Requires multiple calls, doesn't allow searching both cores at the same time.
  • 14. Approach 2 • Add some ontology data to your documents. Documents Indexer Ontology documents
  • 15. Approach 2 – step 1 Index node references, plus their labels and synonyms. Easier to include the ontology references in your search. Can boost fields over others.
  • 16. Approach 2 – step 2 Expand the ontology data being stored. Include single (or multi)-level parent and child nodes, with labels. Use dynamic fields to store additional relationships. Dynamic fields allow searches across specific relation types. BUT Requires some additional Solr look-ups to be fully dynamic.
  • 17. Approach 2 – search screen
  • 18. Approach 2 – search results
  • 19. Approach 3 Search the ontology, and cross-match with documents. Allow SPARQL queries over the ontology index. SPARQL is a semantic query language
  • 21. Adding Apache Jena To allow SPARQL queries, we use Apache Jena to provide TDB-querying. Jena uses Solr to search label fields. Uses its own Triple Store for other fields. Need to include reference URI in returned fields.
  • 22. Integrating Jena results Returned Jena data needs to be cross-matched against document index. Use a filter query to choose the matching documents.
  • 24. Summary so far We can search documents and ontology data with a single call to Solr. We can dynamically search over additional related ontology nodes. We can use SPARQL to search. Can facet on individual ontology annotations...but we still can't present the facets in a tree. https://github.com/flaxsearch/BioSolr/tree/master/spot
  • 25. The ultimate goal A generic ontology indexer using Solr. Multiple ontologies stored in the same index. Unique integer keys for each node, allowing cross- matching from document indexes. Optional customisation, allowing for additional lookups or data manipulation.
  • 26. BioSolr conclusions Final workshop at EMBL-EBI in September https://github.com/flaxsearch/BioSolr Investigating funding to continue the project – We have some ideas around federated Solr search...