8. Documents, Data Model, Indexing
XML Docs
(OpsBank,
DWH)
Fabrication
Document enrichment
- Emtree Backposting
- Field updates
Document transformation
Kafka
Data Model in POJO XML to POJO to JSON to XML
Pre-processing of content, i.e.
combining fields and meta-
fields
Cleaning
Document LoaderDocument Feeder
Elastic
search
AWS
S3
12. Matching
• Tokenization by whitespace
• Exact matching and on compounded terms
• 'aminefunctionalized’:ti equals 'amine-functionalized':ti
• Removal of punctuation, but allow searching for special characters and
sub/superscripts
• ASCII folding and lowercasing
• No language processing (stemming/stopwords)
13. Ranking
• By Publication Year and Entry Date
• By Relevance
• Default BM25 relevance scoring of Elasticsearch (probalistic model)
• Similarity search
• Vector Space Model with term boosting
14. Query Refinement:
Autocomplete
• Lookup of cached (grouped) Emtree terms
• Hit counts
• Live
• Cached with a cronjob that computes the
hitcounts from ES with terms aggregations and
partitions
16. Query Refinement: Faceting (1)
• Using multiple dimensions to narrow down on
results.
• “…allowing users to narrow down search
results by applying multiple filters based
on faceted classification of the items.”
• https://en.wikipedia.org/wiki/Facete
d_search
• EMBASE uses Elasticsearch aggregations for
creating facets
• “The aggregations framework helps
provide aggregated data based on a search
query. It is based on simple building blocks
called aggregations, that can be composed
in order to build complex summaries of the
data.”
• https://www.elastic.co/guide/en/ela
sticsearch/reference/current/search-
aggregations.html
17. Query Refinement: Faceting (2)
• Plain faceting
• Hierarchical Faceting with Subheadings and
Triplelinks
• Faceting with Venn diagrams
• Facets with name normalization
• We use Elasticsearch Aggregations
• (Terms, Nested, Reverse Nested, Adjacency Matric,
Filter)
18. Exporting
• We cannot use ES pagination for retrieving large amounts of results
• To retrieve large amounts of results:
• ES Scroll API and search_after parameter
19. Summary
Overview of EMBASE from a search
engineering perspective
Explained how EMBASE does:
Indexing
Query processing and building
Matching and ranking
Query refinement with autocomplete,
synonyms, faceting and filtering
Exporting