SlideShare a Scribd company logo
Using Graph and Transformer
Embeddings for Vector based
retrieval
Sujit Pal
Tony Scerri
4 November 2020
Agenda
• Term vs Vector vs Graph Search
• Problem Formulation
• Experiments and Results
• Conclusions and Future Work
2
Term vs Vector vs Graph Search
3
Intuition
4
Query:
Donald Trump
Intuition
5
Query:
Donald Trump
Term based:
Donald Trump
Donald Trump, Jr.
Intuition
6
Query:
Donald Trump
Term based:
Donald Trump
Donald Trump, Jr.
Vector based:
Melania Trump
Ivanka Trump
Jared Kushner
Barack Obama
George W Bush
Hillary Clinton
Joe Biden
Intuition
7
Query:
Donald Trump
Term based:
Donald Trump
Donald Trump, Jr.
Vector based:
Melania Trump
Ivanka Trump
Jared Kushner
Barack Obama
George W Bush
Hillary Clinton
Joe Biden
Graph based:
Rudy Giuliani
Bill Barr
Paul Manafort
Michael Flynn
Michael Cohen
Jeffrey Epstein
Prince Andrew
Robert Mueller III
Christopher Steele
Term based search
• Query and documents represented as high dimensional sparse vectors of
term weights.
• Inverted index works well with sparse vectors.
• Term based search captures term similarity / overlap.
• Unsupervised operation.
• Scales to large document sets.
• BM25 more popular nowadays for ranking.
8
Vector based search
• Based on Distributional Hypothesis.
• Leverages “word” and other
embeddings.
• Can be based on document content or
document graph structure.
• Captures semantic similarity
• Vectors are low dimensional and
dense.
• Approximate Nearest Neighbor (ANN)
methods work best.
9
Graph based search and reranking
10
• Leverages relationships between documents.
• Citation graph, co-authorship network, term/concept co-occurrence
networks, etc.
• We use the SCOPUS citation graph to calculate:
• Citation Count
• PageRank
• Localized citation count (based on results set)
• Combinations based on relative ranks or normalized scores
• Re-rank using computed graph metrics.
Graph based search examples
• Global graph metrics (e.g.
PageRank)
• Indicates importance.
11
• Graph neighborhood features
(e.g. Node2Vec)
• Indicates Topological similarity
Graph + Vector Hybrid search
• SPECTER: Document level learning using
citation-informed Transformers
• Minimize Triplet loss between papers
• Related/Unrelated papers based on
citation graph
12
Evaluation Metric: NDCG
• Search Result Quality measured with Normalized Discounted
Cumulative Gain (NDCG).
• Measured for k=1, 3, 5, 10, 20, 50.
• rel(i) is relevance score, usually relevance(query, document(i)).
13
Problem Formulation
14
Reformulating the Problem
• Objective – quantitatively compare various search approaches on SCOPUS.​
• Needed labeled data, i.e., judgement lists, which weren't available.​
• TREC-COVID has (incomplete) judgement lists (35 queries so far)​
• TREC-COVID uses CORD-19 data (from 01-May-2020), some of which are available in
SCOPUS.​
• Some degree of duplication within corpus causes minor discrepancies
• Using subset of SCOPUS papers from May 1 CORD-19 dataset.​
• Using subset of TREC-COVID judgement lists for these papers.​
• Promising candidate solutions applied back to SCOPUS.
15
Setup
• SOLR index created from CORD-19 corpus (Scopus subset only)
• Original plus stemmed fields
• Baseline created using eDismax query applied to original and
stemmed fields
• NDCG measured with filtered (to Scopus) judgements
• Various reranking schemes applied to eDismax results
• Alternative query methods (SOLR MLT, vector based query) tried
16
Experiments and Results
17
Conditions
• Unless otherwise stated:
• Use of query from CORD-19 (not question or narrative descriptions)
• NDCG based on Scopus matched records only
• NDCG reported is the average across all 35 queries
18
Basic Search
• eDismax (original text and stemmed)
• MLT (using top edismax result)
• Using title, abstract and body fields (where available)
• Experiment removing coronavirus or using only coronavirus for each query
19
Full NDCG is based on the full set of matching documents
NDCG @1 @3 @5 @10 @20 @50 Full
eDismax (orig) 0.41428 0.33743 0.33996 0.32035 0.29261 0.26126 0.54092
eDisMax (stem) 0.44285 0.37126 0.35589 0.32939 0.29697 0.26580 0.54744
MLT (stem) 0.41428 0.29457 0.26813 0.23619 0.19322 0.15559 0.38331
Just "coronavirus" 0 0 0.00604 0.00701 0.00608 0.01003 0.29161
Without "coronavirus" 0.27142 0.22841 0.23478 0.22058 0.21453 0.19836 0.46307
Basic Search (cont’d)
20
Reranking
• As previously mentioned, various reranking schemes were tried -
• Cited By (descending)
• PageRank (descending)
• Localized Cited By (descending)
• Combination included -
• Relevancy + PageRank (ranks, ascending)
• Relevancy * PageRank (normalized inverse ranks, ascending)
• Cited By + Localized Cited By (ranks, ascending)
• Relevancy + Cited By (ranks, ascending)
• Relevancy + Localized Cited By (ranks, ascending)
• Relevancy + (10%) Localized Cited By (normalized scores, descending)
• Relevancy + (10%) PageRank (normalized scores, descending)
21
Reranking Results
22
NDCG @1 @3 @5 @10 @20 @50 Full
Cited By 0 0 0 0 0.00104 0.00123 0.28192
PageRank 0 0 0 0.00515 0.00343 0.00484 0.28854
Localized CB 0 0.00335 0.01319 0.008832 0.02060 0.01347 0.31731
Rel+PR Ranks 0.11428 0.08731 0.07885 0.06334 0.05242 0.04228 0.34680
Rel*PR Ranks 0.11428 0.08731 0.07885 0.06334 0.05242 0.04208 0.34007
CB+LCB Ranks 0 0 0.00187 0.00436 0.00541 0.00762 0.30514
Rel+CB Ranks 0.07142 0.06121 0.05258 0.04764 0.03682 0.03160 0.33967
Rel+LCB Ranks 0.08571 0.07725 0.08522 0.07550 0.06465 0.06281 0.38644
Rel+0.1LCB Score 0.44285 0.37039 0.35858 0.33305 0.29205 0.26444 0.54096
Rel+0.1PR Score 0.41428 0.35626 0.34943 0.32226 0.28849 0.25534 0.53191
eDisMax (stem) 0.44285 0.37126 0.35589 0.32939 0.29697 0.26580 0.54744
The full results list from the stemmed eDismax query were used to rerank
Most significantly lower than pure eDisMax
Reranking Results (cont’d)
23
(Searching) Ranking
• Sorting of the full corpus, no cut-off applied
• Vector based reranking including:
• BERT embedding
• Node2Vec embeddings
• SPECTER embeddings
• Compare BERT embedding vector distance of query to title and abstract
• Cosine distance and Euclidean distance
• Query vs
• Title
• Title + Abstract (max or mean pooling)
• Also tried best eDismax result document per query
24
Extra Embedding
• Also looked at the question and narrative compared to the query, eg
• Query : coronavirus origin
• Question : what is the origin of COVID-19
• Narrative : seeking range of information about the SARS-CoV-2 virus's origin,
including its evolution, animal source, and first transmission into humans
• Question and narratives work better
• Likely due to longer text with words (including synonyms and
concepts) in context (more natural)
25
Ranking by BERT
26
NDCG @1 @3 @5 @10 @20 @50 Full
Question 0.31428 0.24006 0.19769 0.16458 0.13224 0.10945 0.36517
Narrative 0.17142 0.15291 0.13685 0.11555 0.0982 0.08420 0.33321
Query 0.04285 0.0386 0.03562 0.03288 0.03255 0.02948 0.29186
Best eDismax doc 0.44285 0.24325 0.19999 0.13860 0.10391 0.07905 0.31552
eDisMax (stem) 0.44285 0.37126 0.35589 0.32939 0.29697 0.26580 0.54744
• Generally -
• Mean better than max pooling
• Best match using Question over Narrative over Query
• Best match on Title + Abstract
• Cosine vs Euclidean slight variations across the board
• Results based on -
• Mean, Cosine using Title+Abstract
All lower than the baseline
Ranking by BERT (cont’d)
27
Ranking by Node2Vec
• Node embedding for node2vec as the query
• Results – Cosine, single top result from eDismax with stemming as the
query
• Some query top results were not in the edge list and therefor yield
zero NDCG
• Core network, all edges connect two CORD documents
• Extended network, all edges touch one CORD document
28
NDCG @1 @3 @5 @10 @20 @50 Full
Core 0.44285 0.21205 0.15721 0.10224 0.06997 0.04739 0.33174
Extended 0.44285 0.21628 0.16048 0.10527 0.06909 0.04360 0.32761
eDisMax (stem) 0.44285 0.37126 0.35589 0.32939 0.29697 0.26580 0.54744
Ranking by Node2Vec (cont’d)
29
Ranking by SPECTER
• SPECTER document embedding for the query
• as calculated by the CORD-19
• Best document returned by eDismax queries
30
NDCG @1 @3 @5 @10 @20 @50 Full
Stemmed 0.44285 0.28204 0.25158 0.19540 0.15484 0.12281 0.41963
eDisMax (stem) 0.44285 0.37126 0.35589 0.32939 0.29697 0.26580 0.54744
Ranking by SPECTER (cont’d)
31
Training Embedding
• Short experiment to look at training query – document for improved
ranking
• We did not try this with question or narrative, limiting to query
• Poor results assumed to be caused by:
• Lack of variability in query preventing generalisation
• Nearly all queries contain "coronavirus"
32
Final Results
• Taking the best results from each set of experiments
• None of the reranking strategies, including the embedding based
ones (content, graph, or hybrid) beat the stemmed eDismax baseline.
33
NDCG @1 @3 @5 @10 @20 @50 Full
Rel + 0.1*LCB 0.44285 0.37039 0.35858 0.3305 0.29205 0.26444 0.54096
BERT reranking 0.44285 0.24325 0.19999 0.13860 0.10391 0.07905 0.31552
Node2Vec core 0.44285 0.21205 0.15721 0.10224 0.06997 0.04739 0.33174
SPECTER (stem) 0.44285 0.28204 0.25158 0.19540 0.15484 0.12281 0.41963
eDisMax (stem) 0.44285 0.37126 0.35589 0.32939 0.29697 0.26580 0.54744
Final Results (cont’d)
34
Conclusions and Future Work
35
Summary
• CORD-19 corpus with incomplete judgement data (they are continuing to add to it based
on results from systems)
• eDismax appear to do ok...
• Recall suffers due to term mismatching
• Beyond basic synonym
• Query intent is represented by a single limiting query clause
• The question and narrative descriptors provide much more natural text for embeddings to work
from
• Graph metrics for importance may have limited application depending on user task
• Incomplete judgement data make NDCG questionable
• ...insufficient information on sampling to apply infNDCG
• Question over embedding general sense of semantic equivalence vs concept identity
(synonyms)
36
Future Work
• Experiments based on Scopus and our own judgement data
• Application of graph metrics, including more than just basic citation
graph
• Investigation of fine-tuned embeddings combining text and graph
• Apply ML based reranking
• Investigate the balance between concept, semantic, freshness and
importance
37
Questions?
38

More Related Content

What's hot

An introduction to computer vision with Hugging Face
An introduction to computer vision with Hugging FaceAn introduction to computer vision with Hugging Face
An introduction to computer vision with Hugging Face
Julien SIMON
 
Rated Ranking Evaluator Enterprise: the next generation of free Search Qualit...
Rated Ranking Evaluator Enterprise: the next generation of free Search Qualit...Rated Ranking Evaluator Enterprise: the next generation of free Search Qualit...
Rated Ranking Evaluator Enterprise: the next generation of free Search Qualit...
Sease
 
Haystack 2019 - Query relaxation - a rewriting technique between search and r...
Haystack 2019 - Query relaxation - a rewriting technique between search and r...Haystack 2019 - Query relaxation - a rewriting technique between search and r...
Haystack 2019 - Query relaxation - a rewriting technique between search and r...
OpenSource Connections
 
Search summit-2018-ltr-presentation
Search summit-2018-ltr-presentationSearch summit-2018-ltr-presentation
Search summit-2018-ltr-presentation
Sujit Pal
 
Word representations in vector space
Word representations in vector spaceWord representations in vector space
Word representations in vector space
Abdullah Khan Zehady
 
Battle of the Stream Processing Titans – Flink versus RisingWave
Battle of the Stream Processing Titans – Flink versus RisingWaveBattle of the Stream Processing Titans – Flink versus RisingWave
Battle of the Stream Processing Titans – Flink versus RisingWave
Yingjun Wu
 
Deep Natural Language Processing for Search and Recommender Systems
Deep Natural Language Processing for Search and Recommender SystemsDeep Natural Language Processing for Search and Recommender Systems
Deep Natural Language Processing for Search and Recommender Systems
Huiji Gao
 
Deep Learning for Recommender Systems
Deep Learning for Recommender SystemsDeep Learning for Recommender Systems
Deep Learning for Recommender Systems
Justin Basilico
 
Building, Evaluating, and Optimizing your RAG App for Production
Building, Evaluating, and Optimizing your RAG App for ProductionBuilding, Evaluating, and Optimizing your RAG App for Production
Building, Evaluating, and Optimizing your RAG App for Production
Sri Ambati
 
Personalization at Netflix - Making Stories Travel
Personalization at Netflix -  Making Stories Travel Personalization at Netflix -  Making Stories Travel
Personalization at Netflix - Making Stories Travel
Sudeep Das, Ph.D.
 
Dense Retrieval with Apache Solr Neural Search.pdf
Dense Retrieval with Apache Solr Neural Search.pdfDense Retrieval with Apache Solr Neural Search.pdf
Dense Retrieval with Apache Solr Neural Search.pdf
Sease
 
Improving Search Relevance in Elasticsearch Using Machine Learning - Milorad ...
Improving Search Relevance in Elasticsearch Using Machine Learning - Milorad ...Improving Search Relevance in Elasticsearch Using Machine Learning - Milorad ...
Improving Search Relevance in Elasticsearch Using Machine Learning - Milorad ...
Institute of Contemporary Sciences
 
Vector database
Vector databaseVector database
Vector database
Guy Korland
 
Numeric Range Queries in Lucene and Solr
Numeric Range Queries in Lucene and SolrNumeric Range Queries in Lucene and Solr
Numeric Range Queries in Lucene and Solr
Vadim Kirilchuk
 
Find it! Nail it! Boosting e-commerce search conversions with machine learnin...
Find it! Nail it!Boosting e-commerce search conversions with machine learnin...Find it! Nail it!Boosting e-commerce search conversions with machine learnin...
Find it! Nail it! Boosting e-commerce search conversions with machine learnin...
Rakuten Group, Inc.
 
Neural Text Embeddings for Information Retrieval (WSDM 2017)
Neural Text Embeddings for Information Retrieval (WSDM 2017)Neural Text Embeddings for Information Retrieval (WSDM 2017)
Neural Text Embeddings for Information Retrieval (WSDM 2017)
Bhaskar Mitra
 
Learned Embeddings for Search and Discovery at Instacart
Learned Embeddings for  Search and Discovery at InstacartLearned Embeddings for  Search and Discovery at Instacart
Learned Embeddings for Search and Discovery at Instacart
Sharath Rao
 
Introduction to TensorFlow 2.0
Introduction to TensorFlow 2.0Introduction to TensorFlow 2.0
Introduction to TensorFlow 2.0
Databricks
 
Content based filtering
Content based filteringContent based filtering
Content based filtering
Bendito Freitas Ribeiro
 
Indexing & Query Optimization
Indexing & Query OptimizationIndexing & Query Optimization
Indexing & Query OptimizationMongoDB
 

What's hot (20)

An introduction to computer vision with Hugging Face
An introduction to computer vision with Hugging FaceAn introduction to computer vision with Hugging Face
An introduction to computer vision with Hugging Face
 
Rated Ranking Evaluator Enterprise: the next generation of free Search Qualit...
Rated Ranking Evaluator Enterprise: the next generation of free Search Qualit...Rated Ranking Evaluator Enterprise: the next generation of free Search Qualit...
Rated Ranking Evaluator Enterprise: the next generation of free Search Qualit...
 
Haystack 2019 - Query relaxation - a rewriting technique between search and r...
Haystack 2019 - Query relaxation - a rewriting technique between search and r...Haystack 2019 - Query relaxation - a rewriting technique between search and r...
Haystack 2019 - Query relaxation - a rewriting technique between search and r...
 
Search summit-2018-ltr-presentation
Search summit-2018-ltr-presentationSearch summit-2018-ltr-presentation
Search summit-2018-ltr-presentation
 
Word representations in vector space
Word representations in vector spaceWord representations in vector space
Word representations in vector space
 
Battle of the Stream Processing Titans – Flink versus RisingWave
Battle of the Stream Processing Titans – Flink versus RisingWaveBattle of the Stream Processing Titans – Flink versus RisingWave
Battle of the Stream Processing Titans – Flink versus RisingWave
 
Deep Natural Language Processing for Search and Recommender Systems
Deep Natural Language Processing for Search and Recommender SystemsDeep Natural Language Processing for Search and Recommender Systems
Deep Natural Language Processing for Search and Recommender Systems
 
Deep Learning for Recommender Systems
Deep Learning for Recommender SystemsDeep Learning for Recommender Systems
Deep Learning for Recommender Systems
 
Building, Evaluating, and Optimizing your RAG App for Production
Building, Evaluating, and Optimizing your RAG App for ProductionBuilding, Evaluating, and Optimizing your RAG App for Production
Building, Evaluating, and Optimizing your RAG App for Production
 
Personalization at Netflix - Making Stories Travel
Personalization at Netflix -  Making Stories Travel Personalization at Netflix -  Making Stories Travel
Personalization at Netflix - Making Stories Travel
 
Dense Retrieval with Apache Solr Neural Search.pdf
Dense Retrieval with Apache Solr Neural Search.pdfDense Retrieval with Apache Solr Neural Search.pdf
Dense Retrieval with Apache Solr Neural Search.pdf
 
Improving Search Relevance in Elasticsearch Using Machine Learning - Milorad ...
Improving Search Relevance in Elasticsearch Using Machine Learning - Milorad ...Improving Search Relevance in Elasticsearch Using Machine Learning - Milorad ...
Improving Search Relevance in Elasticsearch Using Machine Learning - Milorad ...
 
Vector database
Vector databaseVector database
Vector database
 
Numeric Range Queries in Lucene and Solr
Numeric Range Queries in Lucene and SolrNumeric Range Queries in Lucene and Solr
Numeric Range Queries in Lucene and Solr
 
Find it! Nail it! Boosting e-commerce search conversions with machine learnin...
Find it! Nail it!Boosting e-commerce search conversions with machine learnin...Find it! Nail it!Boosting e-commerce search conversions with machine learnin...
Find it! Nail it! Boosting e-commerce search conversions with machine learnin...
 
Neural Text Embeddings for Information Retrieval (WSDM 2017)
Neural Text Embeddings for Information Retrieval (WSDM 2017)Neural Text Embeddings for Information Retrieval (WSDM 2017)
Neural Text Embeddings for Information Retrieval (WSDM 2017)
 
Learned Embeddings for Search and Discovery at Instacart
Learned Embeddings for  Search and Discovery at InstacartLearned Embeddings for  Search and Discovery at Instacart
Learned Embeddings for Search and Discovery at Instacart
 
Introduction to TensorFlow 2.0
Introduction to TensorFlow 2.0Introduction to TensorFlow 2.0
Introduction to TensorFlow 2.0
 
Content based filtering
Content based filteringContent based filtering
Content based filtering
 
Indexing & Query Optimization
Indexing & Query OptimizationIndexing & Query Optimization
Indexing & Query Optimization
 

Similar to Using Graph and Transformer Embeddings for Vector Based Retrieval

Data_Preparation.pptx
Data_Preparation.pptxData_Preparation.pptx
Data_Preparation.pptx
ImXaib
 
Search quality in practice
Search quality in practiceSearch quality in practice
Search quality in practice
Alexander Sibiryakov
 
Sybrandt Thesis Proposal Presentation
Sybrandt Thesis Proposal PresentationSybrandt Thesis Proposal Presentation
Sybrandt Thesis Proposal Presentation
Justin Sybrandt, Ph.D.
 
Overview of the TREC 2019 Deep Learning Track
Overview of the TREC 2019 Deep Learning TrackOverview of the TREC 2019 Deep Learning Track
Overview of the TREC 2019 Deep Learning Track
Nick Craswell
 
intership summary
intership summaryintership summary
intership summaryJunting Ma
 
Benchmarking search relevance in industry vs academia
Benchmarking search relevance in industry vs academiaBenchmarking search relevance in industry vs academia
Benchmarking search relevance in industry vs academia
Nick Craswell
 
Predicting query performance and explaining results to assist Linked Data con...
Predicting query performance and explaining results to assist Linked Data con...Predicting query performance and explaining results to assist Linked Data con...
Predicting query performance and explaining results to assist Linked Data con...
Rakebul Hasan
 
Hierarchical Entity Extraction and Ranking with Unsupervised Graph Convolutions
Hierarchical Entity Extraction and Ranking with Unsupervised Graph ConvolutionsHierarchical Entity Extraction and Ranking with Unsupervised Graph Convolutions
Hierarchical Entity Extraction and Ranking with Unsupervised Graph Convolutions
Jinho Choi
 
SPSS Data Cleaning and Management
SPSS Data Cleaning and ManagementSPSS Data Cleaning and Management
SPSS Data Cleaning and Management
Statistics Solutions
 
Multi-method Evaluation in Scientific Paper Recommender Systems
Multi-method Evaluation in Scientific Paper Recommender SystemsMulti-method Evaluation in Scientific Paper Recommender Systems
Multi-method Evaluation in Scientific Paper Recommender Systems
Aravind Sesagiri Raamkumar
 
Evaluating the possibilities of DataCite for developing 'Open data metrics' o...
Evaluating the possibilities of DataCite for developing 'Open data metrics' o...Evaluating the possibilities of DataCite for developing 'Open data metrics' o...
Evaluating the possibilities of DataCite for developing 'Open data metrics' o...
Nicolas Robinson-Garcia
 
Presentation of Project and Critique.pptx
Presentation of Project and Critique.pptxPresentation of Project and Critique.pptx
Presentation of Project and Critique.pptx
BillyMoses1
 
DS601-Data Science Processes for Data science Student.pdf
DS601-Data Science Processes for Data science Student.pdfDS601-Data Science Processes for Data science Student.pdf
DS601-Data Science Processes for Data science Student.pdf
Bolando
 
Can Short Queries be Even Shorter?
Can Short Queries be Even Shorter?Can Short Queries be Even Shorter?
Can Short Queries be Even Shorter?
Twitter Inc.
 
Lecture_4_Data_Gathering_and_Analysis.pdf
Lecture_4_Data_Gathering_and_Analysis.pdfLecture_4_Data_Gathering_and_Analysis.pdf
Lecture_4_Data_Gathering_and_Analysis.pdf
AbdullahOmar64
 
Hpd 1
Hpd 1Hpd 1
Temporal and semantic analysis of richly typed social networks from user-gene...
Temporal and semantic analysis of richly typed social networks from user-gene...Temporal and semantic analysis of richly typed social networks from user-gene...
Temporal and semantic analysis of richly typed social networks from user-gene...
Zide Meng
 
SQLBits Module 2 RStats Introduction to R and Statistics
SQLBits Module 2 RStats Introduction to R and StatisticsSQLBits Module 2 RStats Introduction to R and Statistics
SQLBits Module 2 RStats Introduction to R and Statistics
Jen Stirrup
 
Neural Architectures for Named Entity Recognition
Neural Architectures for Named Entity RecognitionNeural Architectures for Named Entity Recognition
Neural Architectures for Named Entity Recognition
Rrubaa Panchendrarajan
 
Data Preparation and Processing
Data Preparation and ProcessingData Preparation and Processing
Data Preparation and Processing
Mehul Gondaliya
 

Similar to Using Graph and Transformer Embeddings for Vector Based Retrieval (20)

Data_Preparation.pptx
Data_Preparation.pptxData_Preparation.pptx
Data_Preparation.pptx
 
Search quality in practice
Search quality in practiceSearch quality in practice
Search quality in practice
 
Sybrandt Thesis Proposal Presentation
Sybrandt Thesis Proposal PresentationSybrandt Thesis Proposal Presentation
Sybrandt Thesis Proposal Presentation
 
Overview of the TREC 2019 Deep Learning Track
Overview of the TREC 2019 Deep Learning TrackOverview of the TREC 2019 Deep Learning Track
Overview of the TREC 2019 Deep Learning Track
 
intership summary
intership summaryintership summary
intership summary
 
Benchmarking search relevance in industry vs academia
Benchmarking search relevance in industry vs academiaBenchmarking search relevance in industry vs academia
Benchmarking search relevance in industry vs academia
 
Predicting query performance and explaining results to assist Linked Data con...
Predicting query performance and explaining results to assist Linked Data con...Predicting query performance and explaining results to assist Linked Data con...
Predicting query performance and explaining results to assist Linked Data con...
 
Hierarchical Entity Extraction and Ranking with Unsupervised Graph Convolutions
Hierarchical Entity Extraction and Ranking with Unsupervised Graph ConvolutionsHierarchical Entity Extraction and Ranking with Unsupervised Graph Convolutions
Hierarchical Entity Extraction and Ranking with Unsupervised Graph Convolutions
 
SPSS Data Cleaning and Management
SPSS Data Cleaning and ManagementSPSS Data Cleaning and Management
SPSS Data Cleaning and Management
 
Multi-method Evaluation in Scientific Paper Recommender Systems
Multi-method Evaluation in Scientific Paper Recommender SystemsMulti-method Evaluation in Scientific Paper Recommender Systems
Multi-method Evaluation in Scientific Paper Recommender Systems
 
Evaluating the possibilities of DataCite for developing 'Open data metrics' o...
Evaluating the possibilities of DataCite for developing 'Open data metrics' o...Evaluating the possibilities of DataCite for developing 'Open data metrics' o...
Evaluating the possibilities of DataCite for developing 'Open data metrics' o...
 
Presentation of Project and Critique.pptx
Presentation of Project and Critique.pptxPresentation of Project and Critique.pptx
Presentation of Project and Critique.pptx
 
DS601-Data Science Processes for Data science Student.pdf
DS601-Data Science Processes for Data science Student.pdfDS601-Data Science Processes for Data science Student.pdf
DS601-Data Science Processes for Data science Student.pdf
 
Can Short Queries be Even Shorter?
Can Short Queries be Even Shorter?Can Short Queries be Even Shorter?
Can Short Queries be Even Shorter?
 
Lecture_4_Data_Gathering_and_Analysis.pdf
Lecture_4_Data_Gathering_and_Analysis.pdfLecture_4_Data_Gathering_and_Analysis.pdf
Lecture_4_Data_Gathering_and_Analysis.pdf
 
Hpd 1
Hpd 1Hpd 1
Hpd 1
 
Temporal and semantic analysis of richly typed social networks from user-gene...
Temporal and semantic analysis of richly typed social networks from user-gene...Temporal and semantic analysis of richly typed social networks from user-gene...
Temporal and semantic analysis of richly typed social networks from user-gene...
 
SQLBits Module 2 RStats Introduction to R and Statistics
SQLBits Module 2 RStats Introduction to R and StatisticsSQLBits Module 2 RStats Introduction to R and Statistics
SQLBits Module 2 RStats Introduction to R and Statistics
 
Neural Architectures for Named Entity Recognition
Neural Architectures for Named Entity RecognitionNeural Architectures for Named Entity Recognition
Neural Architectures for Named Entity Recognition
 
Data Preparation and Processing
Data Preparation and ProcessingData Preparation and Processing
Data Preparation and Processing
 

More from Sujit Pal

Supporting Concept Search using a Clinical Healthcare Knowledge Graph
Supporting Concept Search using a Clinical Healthcare Knowledge GraphSupporting Concept Search using a Clinical Healthcare Knowledge Graph
Supporting Concept Search using a Clinical Healthcare Knowledge Graph
Sujit Pal
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAG
Sujit Pal
 
Building Learning to Rank (LTR) search reranking models using Large Language ...
Building Learning to Rank (LTR) search reranking models using Large Language ...Building Learning to Rank (LTR) search reranking models using Large Language ...
Building Learning to Rank (LTR) search reranking models using Large Language ...
Sujit Pal
 
Cheap Trick for Question Answering
Cheap Trick for Question AnsweringCheap Trick for Question Answering
Cheap Trick for Question Answering
Sujit Pal
 
Searching Across Images and Test
Searching Across Images and TestSearching Across Images and Test
Searching Across Images and Test
Sujit Pal
 
Learning a Joint Embedding Representation for Image Search using Self-supervi...
Learning a Joint Embedding Representation for Image Search using Self-supervi...Learning a Joint Embedding Representation for Image Search using Self-supervi...
Learning a Joint Embedding Representation for Image Search using Self-supervi...
Sujit Pal
 
The power of community: training a Transformer Language Model on a shoestring
The power of community: training a Transformer Language Model on a shoestringThe power of community: training a Transformer Language Model on a shoestring
The power of community: training a Transformer Language Model on a shoestring
Sujit Pal
 
Backprop Visualization
Backprop VisualizationBackprop Visualization
Backprop Visualization
Sujit Pal
 
Accelerating NLP with Dask and Saturn Cloud
Accelerating NLP with Dask and Saturn CloudAccelerating NLP with Dask and Saturn Cloud
Accelerating NLP with Dask and Saturn Cloud
Sujit Pal
 
Accelerating NLP with Dask on Saturn Cloud: A case study with CORD-19
Accelerating NLP with Dask on Saturn Cloud: A case study with CORD-19Accelerating NLP with Dask on Saturn Cloud: A case study with CORD-19
Accelerating NLP with Dask on Saturn Cloud: A case study with CORD-19
Sujit Pal
 
Leslie Smith's Papers discussion for DL Journal Club
Leslie Smith's Papers discussion for DL Journal ClubLeslie Smith's Papers discussion for DL Journal Club
Leslie Smith's Papers discussion for DL Journal Club
Sujit Pal
 
Transformer Mods for Document Length Inputs
Transformer Mods for Document Length InputsTransformer Mods for Document Length Inputs
Transformer Mods for Document Length Inputs
Sujit Pal
 
Question Answering as Search - the Anserini Pipeline and Other Stories
Question Answering as Search - the Anserini Pipeline and Other StoriesQuestion Answering as Search - the Anserini Pipeline and Other Stories
Question Answering as Search - the Anserini Pipeline and Other Stories
Sujit Pal
 
Building Named Entity Recognition Models Efficiently using NERDS
Building Named Entity Recognition Models Efficiently using NERDSBuilding Named Entity Recognition Models Efficiently using NERDS
Building Named Entity Recognition Models Efficiently using NERDS
Sujit Pal
 
Graph Techniques for Natural Language Processing
Graph Techniques for Natural Language ProcessingGraph Techniques for Natural Language Processing
Graph Techniques for Natural Language Processing
Sujit Pal
 
Learning to Rank Presentation (v2) at LexisNexis Search Guild
Learning to Rank Presentation (v2) at LexisNexis Search GuildLearning to Rank Presentation (v2) at LexisNexis Search Guild
Learning to Rank Presentation (v2) at LexisNexis Search Guild
Sujit Pal
 
Search summit-2018-content-engineering-slides
Search summit-2018-content-engineering-slidesSearch summit-2018-content-engineering-slides
Search summit-2018-content-engineering-slides
Sujit Pal
 
SoDA v2 - Named Entity Recognition from streaming text
SoDA v2 - Named Entity Recognition from streaming textSoDA v2 - Named Entity Recognition from streaming text
SoDA v2 - Named Entity Recognition from streaming text
Sujit Pal
 
Evolving a Medical Image Similarity Search
Evolving a Medical Image Similarity SearchEvolving a Medical Image Similarity Search
Evolving a Medical Image Similarity Search
Sujit Pal
 
Embed, Encode, Attend, Predict – applying the 4 step NLP recipe for text clas...
Embed, Encode, Attend, Predict – applying the 4 step NLP recipe for text clas...Embed, Encode, Attend, Predict – applying the 4 step NLP recipe for text clas...
Embed, Encode, Attend, Predict – applying the 4 step NLP recipe for text clas...
Sujit Pal
 

More from Sujit Pal (20)

Supporting Concept Search using a Clinical Healthcare Knowledge Graph
Supporting Concept Search using a Clinical Healthcare Knowledge GraphSupporting Concept Search using a Clinical Healthcare Knowledge Graph
Supporting Concept Search using a Clinical Healthcare Knowledge Graph
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAG
 
Building Learning to Rank (LTR) search reranking models using Large Language ...
Building Learning to Rank (LTR) search reranking models using Large Language ...Building Learning to Rank (LTR) search reranking models using Large Language ...
Building Learning to Rank (LTR) search reranking models using Large Language ...
 
Cheap Trick for Question Answering
Cheap Trick for Question AnsweringCheap Trick for Question Answering
Cheap Trick for Question Answering
 
Searching Across Images and Test
Searching Across Images and TestSearching Across Images and Test
Searching Across Images and Test
 
Learning a Joint Embedding Representation for Image Search using Self-supervi...
Learning a Joint Embedding Representation for Image Search using Self-supervi...Learning a Joint Embedding Representation for Image Search using Self-supervi...
Learning a Joint Embedding Representation for Image Search using Self-supervi...
 
The power of community: training a Transformer Language Model on a shoestring
The power of community: training a Transformer Language Model on a shoestringThe power of community: training a Transformer Language Model on a shoestring
The power of community: training a Transformer Language Model on a shoestring
 
Backprop Visualization
Backprop VisualizationBackprop Visualization
Backprop Visualization
 
Accelerating NLP with Dask and Saturn Cloud
Accelerating NLP with Dask and Saturn CloudAccelerating NLP with Dask and Saturn Cloud
Accelerating NLP with Dask and Saturn Cloud
 
Accelerating NLP with Dask on Saturn Cloud: A case study with CORD-19
Accelerating NLP with Dask on Saturn Cloud: A case study with CORD-19Accelerating NLP with Dask on Saturn Cloud: A case study with CORD-19
Accelerating NLP with Dask on Saturn Cloud: A case study with CORD-19
 
Leslie Smith's Papers discussion for DL Journal Club
Leslie Smith's Papers discussion for DL Journal ClubLeslie Smith's Papers discussion for DL Journal Club
Leslie Smith's Papers discussion for DL Journal Club
 
Transformer Mods for Document Length Inputs
Transformer Mods for Document Length InputsTransformer Mods for Document Length Inputs
Transformer Mods for Document Length Inputs
 
Question Answering as Search - the Anserini Pipeline and Other Stories
Question Answering as Search - the Anserini Pipeline and Other StoriesQuestion Answering as Search - the Anserini Pipeline and Other Stories
Question Answering as Search - the Anserini Pipeline and Other Stories
 
Building Named Entity Recognition Models Efficiently using NERDS
Building Named Entity Recognition Models Efficiently using NERDSBuilding Named Entity Recognition Models Efficiently using NERDS
Building Named Entity Recognition Models Efficiently using NERDS
 
Graph Techniques for Natural Language Processing
Graph Techniques for Natural Language ProcessingGraph Techniques for Natural Language Processing
Graph Techniques for Natural Language Processing
 
Learning to Rank Presentation (v2) at LexisNexis Search Guild
Learning to Rank Presentation (v2) at LexisNexis Search GuildLearning to Rank Presentation (v2) at LexisNexis Search Guild
Learning to Rank Presentation (v2) at LexisNexis Search Guild
 
Search summit-2018-content-engineering-slides
Search summit-2018-content-engineering-slidesSearch summit-2018-content-engineering-slides
Search summit-2018-content-engineering-slides
 
SoDA v2 - Named Entity Recognition from streaming text
SoDA v2 - Named Entity Recognition from streaming textSoDA v2 - Named Entity Recognition from streaming text
SoDA v2 - Named Entity Recognition from streaming text
 
Evolving a Medical Image Similarity Search
Evolving a Medical Image Similarity SearchEvolving a Medical Image Similarity Search
Evolving a Medical Image Similarity Search
 
Embed, Encode, Attend, Predict – applying the 4 step NLP recipe for text clas...
Embed, Encode, Attend, Predict – applying the 4 step NLP recipe for text clas...Embed, Encode, Attend, Predict – applying the 4 step NLP recipe for text clas...
Embed, Encode, Attend, Predict – applying the 4 step NLP recipe for text clas...
 

Recently uploaded

Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Product School
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Product School
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
Paul Groth
 
"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi
Fwdays
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
Cheryl Hung
 
Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical Futures
Bhaskar Mitra
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Ramesh Iyer
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
Product School
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
Safe Software
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
BookNet Canada
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
OnBoard
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
Elena Simperl
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
DianaGray10
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
Elena Simperl
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
Product School
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
Thijs Feryn
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
Product School
 

Recently uploaded (20)

Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
 
"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical Futures
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
 

Using Graph and Transformer Embeddings for Vector Based Retrieval

  • 1. Using Graph and Transformer Embeddings for Vector based retrieval Sujit Pal Tony Scerri 4 November 2020
  • 2. Agenda • Term vs Vector vs Graph Search • Problem Formulation • Experiments and Results • Conclusions and Future Work 2
  • 3. Term vs Vector vs Graph Search 3
  • 6. Intuition 6 Query: Donald Trump Term based: Donald Trump Donald Trump, Jr. Vector based: Melania Trump Ivanka Trump Jared Kushner Barack Obama George W Bush Hillary Clinton Joe Biden
  • 7. Intuition 7 Query: Donald Trump Term based: Donald Trump Donald Trump, Jr. Vector based: Melania Trump Ivanka Trump Jared Kushner Barack Obama George W Bush Hillary Clinton Joe Biden Graph based: Rudy Giuliani Bill Barr Paul Manafort Michael Flynn Michael Cohen Jeffrey Epstein Prince Andrew Robert Mueller III Christopher Steele
  • 8. Term based search • Query and documents represented as high dimensional sparse vectors of term weights. • Inverted index works well with sparse vectors. • Term based search captures term similarity / overlap. • Unsupervised operation. • Scales to large document sets. • BM25 more popular nowadays for ranking. 8
  • 9. Vector based search • Based on Distributional Hypothesis. • Leverages “word” and other embeddings. • Can be based on document content or document graph structure. • Captures semantic similarity • Vectors are low dimensional and dense. • Approximate Nearest Neighbor (ANN) methods work best. 9
  • 10. Graph based search and reranking 10 • Leverages relationships between documents. • Citation graph, co-authorship network, term/concept co-occurrence networks, etc. • We use the SCOPUS citation graph to calculate: • Citation Count • PageRank • Localized citation count (based on results set) • Combinations based on relative ranks or normalized scores • Re-rank using computed graph metrics.
  • 11. Graph based search examples • Global graph metrics (e.g. PageRank) • Indicates importance. 11 • Graph neighborhood features (e.g. Node2Vec) • Indicates Topological similarity
  • 12. Graph + Vector Hybrid search • SPECTER: Document level learning using citation-informed Transformers • Minimize Triplet loss between papers • Related/Unrelated papers based on citation graph 12
  • 13. Evaluation Metric: NDCG • Search Result Quality measured with Normalized Discounted Cumulative Gain (NDCG). • Measured for k=1, 3, 5, 10, 20, 50. • rel(i) is relevance score, usually relevance(query, document(i)). 13
  • 15. Reformulating the Problem • Objective – quantitatively compare various search approaches on SCOPUS.​ • Needed labeled data, i.e., judgement lists, which weren't available.​ • TREC-COVID has (incomplete) judgement lists (35 queries so far)​ • TREC-COVID uses CORD-19 data (from 01-May-2020), some of which are available in SCOPUS.​ • Some degree of duplication within corpus causes minor discrepancies • Using subset of SCOPUS papers from May 1 CORD-19 dataset.​ • Using subset of TREC-COVID judgement lists for these papers.​ • Promising candidate solutions applied back to SCOPUS. 15
  • 16. Setup • SOLR index created from CORD-19 corpus (Scopus subset only) • Original plus stemmed fields • Baseline created using eDismax query applied to original and stemmed fields • NDCG measured with filtered (to Scopus) judgements • Various reranking schemes applied to eDismax results • Alternative query methods (SOLR MLT, vector based query) tried 16
  • 18. Conditions • Unless otherwise stated: • Use of query from CORD-19 (not question or narrative descriptions) • NDCG based on Scopus matched records only • NDCG reported is the average across all 35 queries 18
  • 19. Basic Search • eDismax (original text and stemmed) • MLT (using top edismax result) • Using title, abstract and body fields (where available) • Experiment removing coronavirus or using only coronavirus for each query 19 Full NDCG is based on the full set of matching documents NDCG @1 @3 @5 @10 @20 @50 Full eDismax (orig) 0.41428 0.33743 0.33996 0.32035 0.29261 0.26126 0.54092 eDisMax (stem) 0.44285 0.37126 0.35589 0.32939 0.29697 0.26580 0.54744 MLT (stem) 0.41428 0.29457 0.26813 0.23619 0.19322 0.15559 0.38331 Just "coronavirus" 0 0 0.00604 0.00701 0.00608 0.01003 0.29161 Without "coronavirus" 0.27142 0.22841 0.23478 0.22058 0.21453 0.19836 0.46307
  • 21. Reranking • As previously mentioned, various reranking schemes were tried - • Cited By (descending) • PageRank (descending) • Localized Cited By (descending) • Combination included - • Relevancy + PageRank (ranks, ascending) • Relevancy * PageRank (normalized inverse ranks, ascending) • Cited By + Localized Cited By (ranks, ascending) • Relevancy + Cited By (ranks, ascending) • Relevancy + Localized Cited By (ranks, ascending) • Relevancy + (10%) Localized Cited By (normalized scores, descending) • Relevancy + (10%) PageRank (normalized scores, descending) 21
  • 22. Reranking Results 22 NDCG @1 @3 @5 @10 @20 @50 Full Cited By 0 0 0 0 0.00104 0.00123 0.28192 PageRank 0 0 0 0.00515 0.00343 0.00484 0.28854 Localized CB 0 0.00335 0.01319 0.008832 0.02060 0.01347 0.31731 Rel+PR Ranks 0.11428 0.08731 0.07885 0.06334 0.05242 0.04228 0.34680 Rel*PR Ranks 0.11428 0.08731 0.07885 0.06334 0.05242 0.04208 0.34007 CB+LCB Ranks 0 0 0.00187 0.00436 0.00541 0.00762 0.30514 Rel+CB Ranks 0.07142 0.06121 0.05258 0.04764 0.03682 0.03160 0.33967 Rel+LCB Ranks 0.08571 0.07725 0.08522 0.07550 0.06465 0.06281 0.38644 Rel+0.1LCB Score 0.44285 0.37039 0.35858 0.33305 0.29205 0.26444 0.54096 Rel+0.1PR Score 0.41428 0.35626 0.34943 0.32226 0.28849 0.25534 0.53191 eDisMax (stem) 0.44285 0.37126 0.35589 0.32939 0.29697 0.26580 0.54744 The full results list from the stemmed eDismax query were used to rerank Most significantly lower than pure eDisMax
  • 24. (Searching) Ranking • Sorting of the full corpus, no cut-off applied • Vector based reranking including: • BERT embedding • Node2Vec embeddings • SPECTER embeddings • Compare BERT embedding vector distance of query to title and abstract • Cosine distance and Euclidean distance • Query vs • Title • Title + Abstract (max or mean pooling) • Also tried best eDismax result document per query 24
  • 25. Extra Embedding • Also looked at the question and narrative compared to the query, eg • Query : coronavirus origin • Question : what is the origin of COVID-19 • Narrative : seeking range of information about the SARS-CoV-2 virus's origin, including its evolution, animal source, and first transmission into humans • Question and narratives work better • Likely due to longer text with words (including synonyms and concepts) in context (more natural) 25
  • 26. Ranking by BERT 26 NDCG @1 @3 @5 @10 @20 @50 Full Question 0.31428 0.24006 0.19769 0.16458 0.13224 0.10945 0.36517 Narrative 0.17142 0.15291 0.13685 0.11555 0.0982 0.08420 0.33321 Query 0.04285 0.0386 0.03562 0.03288 0.03255 0.02948 0.29186 Best eDismax doc 0.44285 0.24325 0.19999 0.13860 0.10391 0.07905 0.31552 eDisMax (stem) 0.44285 0.37126 0.35589 0.32939 0.29697 0.26580 0.54744 • Generally - • Mean better than max pooling • Best match using Question over Narrative over Query • Best match on Title + Abstract • Cosine vs Euclidean slight variations across the board • Results based on - • Mean, Cosine using Title+Abstract All lower than the baseline
  • 27. Ranking by BERT (cont’d) 27
  • 28. Ranking by Node2Vec • Node embedding for node2vec as the query • Results – Cosine, single top result from eDismax with stemming as the query • Some query top results were not in the edge list and therefor yield zero NDCG • Core network, all edges connect two CORD documents • Extended network, all edges touch one CORD document 28 NDCG @1 @3 @5 @10 @20 @50 Full Core 0.44285 0.21205 0.15721 0.10224 0.06997 0.04739 0.33174 Extended 0.44285 0.21628 0.16048 0.10527 0.06909 0.04360 0.32761 eDisMax (stem) 0.44285 0.37126 0.35589 0.32939 0.29697 0.26580 0.54744
  • 29. Ranking by Node2Vec (cont’d) 29
  • 30. Ranking by SPECTER • SPECTER document embedding for the query • as calculated by the CORD-19 • Best document returned by eDismax queries 30 NDCG @1 @3 @5 @10 @20 @50 Full Stemmed 0.44285 0.28204 0.25158 0.19540 0.15484 0.12281 0.41963 eDisMax (stem) 0.44285 0.37126 0.35589 0.32939 0.29697 0.26580 0.54744
  • 31. Ranking by SPECTER (cont’d) 31
  • 32. Training Embedding • Short experiment to look at training query – document for improved ranking • We did not try this with question or narrative, limiting to query • Poor results assumed to be caused by: • Lack of variability in query preventing generalisation • Nearly all queries contain "coronavirus" 32
  • 33. Final Results • Taking the best results from each set of experiments • None of the reranking strategies, including the embedding based ones (content, graph, or hybrid) beat the stemmed eDismax baseline. 33 NDCG @1 @3 @5 @10 @20 @50 Full Rel + 0.1*LCB 0.44285 0.37039 0.35858 0.3305 0.29205 0.26444 0.54096 BERT reranking 0.44285 0.24325 0.19999 0.13860 0.10391 0.07905 0.31552 Node2Vec core 0.44285 0.21205 0.15721 0.10224 0.06997 0.04739 0.33174 SPECTER (stem) 0.44285 0.28204 0.25158 0.19540 0.15484 0.12281 0.41963 eDisMax (stem) 0.44285 0.37126 0.35589 0.32939 0.29697 0.26580 0.54744
  • 36. Summary • CORD-19 corpus with incomplete judgement data (they are continuing to add to it based on results from systems) • eDismax appear to do ok... • Recall suffers due to term mismatching • Beyond basic synonym • Query intent is represented by a single limiting query clause • The question and narrative descriptors provide much more natural text for embeddings to work from • Graph metrics for importance may have limited application depending on user task • Incomplete judgement data make NDCG questionable • ...insufficient information on sampling to apply infNDCG • Question over embedding general sense of semantic equivalence vs concept identity (synonyms) 36
  • 37. Future Work • Experiments based on Scopus and our own judgement data • Application of graph metrics, including more than just basic citation graph • Investigation of fine-tuned embeddings combining text and graph • Apply ML based reranking • Investigate the balance between concept, semantic, freshness and importance 37