EMBASE:
A Technical Introduction
from a
Search Engineer’s
perspective
Junte Zhang
A Typical Search Engine
Set-based
Algebraic
Probalistic
Feature based
Basic Building Blocks
A systematic approach
to information retrieval.
Source: Lalmas et al.
2001, fig. 2.
Query Refinement
• Facets and filtering
• Suggestions and autocomplete
Collecting and Saving
Results
• In EMBASE:
• Archiving
• Analytics
• Other domains also:
• Bookmarking
• Checkout and buy
EMBASE Architecture
Indexing with an Inverted Index
Documents, Data Model, Indexing
XML Docs
(OpsBank,
DWH)
Fabrication
Document enrichment
- Emtree Backposting
- Field updates
Document transformation
Kafka
Data Model in POJO XML to POJO to JSON to XML
Pre-processing of content, i.e.
combining fields and meta-
fields
Cleaning
Document LoaderDocument Feeder
Elastic
search
AWS
S3
Elasticsearch topology
Node
Index
Shard Shard
Index
Shard
• 3 master nodes
• 18 data nodes
• ~700 GB index with 18 shards
• Shard size of ~40 GB
• Running in Docker containers on Kubernetes
Query Processing
• See:
https://confluence.elsevier.com/display/EM/Embase+command+lang
uage+grammar+visualization
Query
Query Parser
Query Builders
Lexer and AST with Antlr Semantic Tree Parser
Elasticsearch
Lucene
Query Language (grammar rules)
Matching
• Tokenization by whitespace
• Exact matching and on compounded terms
• 'aminefunctionalized’:ti equals 'amine-functionalized':ti
• Removal of punctuation, but allow searching for special characters and
sub/superscripts
• ASCII folding and lowercasing
• No language processing (stemming/stopwords)
Ranking
• By Publication Year and Entry Date
• By Relevance
• Default BM25 relevance scoring of Elasticsearch (probalistic model)
• Similarity search
• Vector Space Model with term boosting
Query Refinement:
Autocomplete
• Lookup of cached (grouped) Emtree terms
• Hit counts
• Live
• Cached with a cronjob that computes the
hitcounts from ES with terms aggregations and
partitions
Query Refinement: Synonyms
Emtree thesaurus:
<Term>
<TermName LinkType="drug">water</TermName>
<Synonym>dihydrogen oxide</Synonym>
<Synonym>hydrogen oxide</Synonym>
<Synonym>hydrogen oxide o 16</Synonym>
<Synonym>reclaimed water</Synonym>
<Synonym>washing water</Synonym>
<Synonym>water o 16</Synonym>
<HistoryNote>
<CreationYear>1974</CreationYear>
</HistoryNote>
</Term>
Dorlands dictionary:
<entry disabled="false">
<term>metaiodobenzylguanidine</term>
<emtreeTerm>(3 iodobenzyl)guanidine</emtreeTerm>
<definition>iobenguane.</definition>
</entry>
Query Refinement: Faceting (1)
• Using multiple dimensions to narrow down on
results.
• “…allowing users to narrow down search
results by applying multiple filters based
on faceted classification of the items.”
• https://en.wikipedia.org/wiki/Facete
d_search
• EMBASE uses Elasticsearch aggregations for
creating facets
• “The aggregations framework helps
provide aggregated data based on a search
query. It is based on simple building blocks
called aggregations, that can be composed
in order to build complex summaries of the
data.”
• https://www.elastic.co/guide/en/ela
sticsearch/reference/current/search-
aggregations.html
Query Refinement: Faceting (2)
• Plain faceting
• Hierarchical Faceting with Subheadings and
Triplelinks
• Faceting with Venn diagrams
• Facets with name normalization
• We use Elasticsearch Aggregations
• (Terms, Nested, Reverse Nested, Adjacency Matric,
Filter)
Exporting
• We cannot use ES pagination for retrieving large amounts of results
• To retrieve large amounts of results:
• ES Scroll API and search_after parameter
Summary
Overview of EMBASE from a search
engineering perspective
Explained how EMBASE does:
Indexing
Query processing and building
Matching and ranking
Query refinement with autocomplete,
synonyms, faceting and filtering
Exporting

Search Engineering in EMBASE

  • 1.
    EMBASE: A Technical Introduction froma Search Engineer’s perspective Junte Zhang
  • 2.
    A Typical SearchEngine Set-based Algebraic Probalistic Feature based
  • 3.
    Basic Building Blocks Asystematic approach to information retrieval. Source: Lalmas et al. 2001, fig. 2.
  • 4.
    Query Refinement • Facetsand filtering • Suggestions and autocomplete
  • 5.
    Collecting and Saving Results •In EMBASE: • Archiving • Analytics • Other domains also: • Bookmarking • Checkout and buy
  • 6.
  • 7.
    Indexing with anInverted Index
  • 8.
    Documents, Data Model,Indexing XML Docs (OpsBank, DWH) Fabrication Document enrichment - Emtree Backposting - Field updates Document transformation Kafka Data Model in POJO XML to POJO to JSON to XML Pre-processing of content, i.e. combining fields and meta- fields Cleaning Document LoaderDocument Feeder Elastic search AWS S3
  • 9.
    Elasticsearch topology Node Index Shard Shard Index Shard •3 master nodes • 18 data nodes • ~700 GB index with 18 shards • Shard size of ~40 GB • Running in Docker containers on Kubernetes
  • 10.
    Query Processing • See: https://confluence.elsevier.com/display/EM/Embase+command+lang uage+grammar+visualization Query QueryParser Query Builders Lexer and AST with Antlr Semantic Tree Parser Elasticsearch Lucene
  • 11.
  • 12.
    Matching • Tokenization bywhitespace • Exact matching and on compounded terms • 'aminefunctionalized’:ti equals 'amine-functionalized':ti • Removal of punctuation, but allow searching for special characters and sub/superscripts • ASCII folding and lowercasing • No language processing (stemming/stopwords)
  • 13.
    Ranking • By PublicationYear and Entry Date • By Relevance • Default BM25 relevance scoring of Elasticsearch (probalistic model) • Similarity search • Vector Space Model with term boosting
  • 14.
    Query Refinement: Autocomplete • Lookupof cached (grouped) Emtree terms • Hit counts • Live • Cached with a cronjob that computes the hitcounts from ES with terms aggregations and partitions
  • 15.
    Query Refinement: Synonyms Emtreethesaurus: <Term> <TermName LinkType="drug">water</TermName> <Synonym>dihydrogen oxide</Synonym> <Synonym>hydrogen oxide</Synonym> <Synonym>hydrogen oxide o 16</Synonym> <Synonym>reclaimed water</Synonym> <Synonym>washing water</Synonym> <Synonym>water o 16</Synonym> <HistoryNote> <CreationYear>1974</CreationYear> </HistoryNote> </Term> Dorlands dictionary: <entry disabled="false"> <term>metaiodobenzylguanidine</term> <emtreeTerm>(3 iodobenzyl)guanidine</emtreeTerm> <definition>iobenguane.</definition> </entry>
  • 16.
    Query Refinement: Faceting(1) • Using multiple dimensions to narrow down on results. • “…allowing users to narrow down search results by applying multiple filters based on faceted classification of the items.” • https://en.wikipedia.org/wiki/Facete d_search • EMBASE uses Elasticsearch aggregations for creating facets • “The aggregations framework helps provide aggregated data based on a search query. It is based on simple building blocks called aggregations, that can be composed in order to build complex summaries of the data.” • https://www.elastic.co/guide/en/ela sticsearch/reference/current/search- aggregations.html
  • 17.
    Query Refinement: Faceting(2) • Plain faceting • Hierarchical Faceting with Subheadings and Triplelinks • Faceting with Venn diagrams • Facets with name normalization • We use Elasticsearch Aggregations • (Terms, Nested, Reverse Nested, Adjacency Matric, Filter)
  • 18.
    Exporting • We cannotuse ES pagination for retrieving large amounts of results • To retrieve large amounts of results: • ES Scroll API and search_after parameter
  • 19.
    Summary Overview of EMBASEfrom a search engineering perspective Explained how EMBASE does: Indexing Query processing and building Matching and ranking Query refinement with autocomplete, synonyms, faceting and filtering Exporting