SlideShare a Scribd company logo
1 of 22
Download to read offline
Evaluating the Impact of Word Embeddings on
Similarity Scoring for Practical Information
Retrieval
Lukas Galke Ahmed Saleh Ansgar Scherp
Leibniz Information Centre for Economics
Kiel University
INFORMATIK, 29 Sep 2017
Lukas Galke, Ahmed Saleh, Ansgar Scherp Word Embeddings for IR 1 of 13
Motivation and Research Question
Motivation
Word embeddings regarded as the main cause for NLP breakout in the
past few years (Goth 2016)
can be employed in various natural language processing tasks such as
classification (Balikas and Amini 2016), clustering (Kusner et al. 2015),
word analogies (Mikolov et al. 2013), language modelling . . .
Information Retrieval is quite different from these tasks, so
employment of word embeddings is challenging
Research question
Which embedding-based techniques are suitable for similarity scoring in
practical information retrieval?
Lukas Galke, Ahmed Saleh, Ansgar Scherp Word Embeddings for IR 2 of 13
Related Approaches
Paragraph Vectors (Doc2Vec)
Explicitly learn document vectors in a similar fashion to word vectors (Le
and Mikolov 2014). One artificial token per paragraph (or document).
Word Mover’s Distance (WMD)
Compute Word Mover’s Distance to solve a constrained optimization
problem to compute document similarity (Kusner et al. 2015). Minimize
the cost of moving the words of one document to the words of the other
document.
Embedding-based Query Language Models
Embedding-based query expansion and embedding-based pseudo relevance
feedback (Zamani and Croft 2016). The query is expanded by nearby words
with respect to the embedding.
Lukas Galke, Ahmed Saleh, Ansgar Scherp Word Embeddings for IR 3 of 13
Information Retrieval
Information Retrieval
Given a query, retrieve the k most relevant (to the query) documents from
a corpus in rank order.
Lukas Galke, Ahmed Saleh, Ansgar Scherp Word Embeddings for IR 4 of 13
TF-IDF Retrieval Model
Term frequencies
TF(w, d) is the number of occurrences of word w in document d.
Lukas Galke, Ahmed Saleh, Ansgar Scherp Word Embeddings for IR 5 of 13
TF-IDF Retrieval Model
Term frequencies
TF(w, d) is the number of occurrences of word w in document d.
Inverse Document Frequency
Words that occur in a lot of documents are discounted (assuming they carry
less discriminative information): IDF(w, D) = log |D|
|{d∈D|w∈d}|
Lukas Galke, Ahmed Saleh, Ansgar Scherp Word Embeddings for IR 5 of 13
TF-IDF Retrieval Model
Term frequencies
TF(w, d) is the number of occurrences of word w in document d.
Inverse Document Frequency
Words that occur in a lot of documents are discounted (assuming they carry
less discriminative information): IDF(w, D) = log |D|
|{d∈D|w∈d}|
Retrieval Model
Transform corpus of documents d into TF-IDF representation.
Compute TF-IDF representation of the query q.
Rank matching documents by descending cosine similarity
cossim(q, d) = q·d
q · d .
Lukas Galke, Ahmed Saleh, Ansgar Scherp Word Embeddings for IR 5 of 13
Word Embedding
One Hot Encoding Word Embedding
cat sits
0 1 0 1 0
cat sits
0.7 0.2 0.1 0.39 -3.1 0.42
Word Embedding
Low-dimensional (compared to vocabulary size) distributed representation,
that captures semantic and syntactic relations of the words. Key principle:
Similar words should have a similar representation.
Lukas Galke, Ahmed Saleh, Ansgar Scherp Word Embeddings for IR 6 of 13
Word Vector Arithmetic
Addition of word vectors and its nearest neighbors in the word embedding1.
Expression Nearest tokens
Czech + Currency koruna, Czech crown, Polish zloty, CTK
Vietnam + capital Hanoi, Ho Chi Minh City, Viet Nam, Vietnamese
German + airlines airline Lufthansa, carrier Lufthansa, flag carrier
Lufthansa
French + actress Juliette Binoche, Vanessa Paradis, Charlotte
Gainsbourg
1
Extracted from Mikolov’s talk at NIPS 2013
Lukas Galke, Ahmed Saleh, Ansgar Scherp Word Embeddings for IR 7 of 13
Word2vec
Skip-Gram Negative Sampling (Mikolov et al. 2013)
Given a stream of words (tokens) s over vocabulary V and a context size k,
learn word embedding W.
Let wT be target word with context
C = {sT−k, . . . , sT−1, sT+1, . . . , sT+k} (skip-gram)
Look up word vector W[sT ] for target word sT
Predict via logistic regression from word vector W[sT ], W[x] with:
positive examples: x context words C
negative examples: x sampled from V  C (negative sampling)
Update word vector W[sT ] (via back-propagation)
Repeat with next word T = T + 1
Lukas Galke, Ahmed Saleh, Ansgar Scherp Word Embeddings for IR 8 of 13
Document Representation
How to employ word embeddings for information retrieval?
Bag-of-words (left) vs distributed representations (right)
Lukas Galke, Ahmed Saleh, Ansgar Scherp Word Embeddings for IR 9 of 13
Embedded Retrieval Models
Word Centroid Similarity (WCS)
Aggregate word vectors to their centroid for both the documents and the
query and compute cosine similarity between the centroids.
Word vector centroids C = TF · W
Given query q in one-hot representation, compute
Word Centroid Similarity WCS(q, i) = (qT ·W)·Ci
qT ·W · Ci
Lukas Galke, Ahmed Saleh, Ansgar Scherp Word Embeddings for IR 10 of 13
Embedded Retrieval Models
Word Centroid Similarity (WCS)
Aggregate word vectors to their centroid for both the documents and the
query and compute cosine similarity between the centroids.
Word vector centroids C = TF · W
Given query q in one-hot representation, compute
Word Centroid Similarity WCS(q, i) = (qT ·W)·Ci
qT ·W · Ci
IDF re-weighted Word Centroid Similarity (IWCS)
IDF re-weighted aggregation of word vectors C = TF · IDF · W
Lukas Galke, Ahmed Saleh, Ansgar Scherp Word Embeddings for IR 10 of 13
Experiments
Lukas Galke, Ahmed Saleh, Ansgar Scherp Word Embeddings for IR 11 of 13
Results
Ground truth: Relevancy provided by human annotators.
Evaluate mean average precision of top 20 documents (MAP@20)
Datasets: Titles of NTCIR2, Economics, Reuters
Model NTCIR2 Economics Reuters
TF-IDF (baseline) .40 .37 .52
WCS .30 .36 .54
IWCS .41 .37 .60
IWCS-WMD .40 .32 .54
Doc2Vec .24 .30 .48
. . .
Lukas Galke, Ahmed Saleh, Ansgar Scherp Word Embeddings for IR 12 of 13
Conclusion
Word embeddings can be successfully employed in practical IR
IWCS is competitive to the TF-IDF baseline
IDF weighting improves the performance of WCS by 11%
IWCS outperforms the TF-IDF baseline by 15% on Reuters (news
domain)
Code to reproduce the experiments is available at
github.com/lgalke/vec4ir.
MOVING is funded by the EU Horizon 2020 Programme under the project number
INSO-4-2015: 693092
Lukas Galke, Ahmed Saleh, Ansgar Scherp Word Embeddings for IR 13 of 13
Discussion
Lukas Galke, Ahmed Saleh, Ansgar Scherp Word Embeddings for IR 14 of 13
Data Sets and Embeddings
Data set properties
Data set Documents Topics relevant per topic
NTCIR2 135k 49 43.6 (48.8)
Econ62k 62k 4518 72.98 (328.65)
Reuters 100k 102 3,143 (6,316)
Word Embedding properties
Embedding Tokens Vocab Case Dim Training
GoogleNews 3B 3M cased 300 Word2Vec
CommonCrawl 840B 2.2M cased 300 GloVe
Lukas Galke, Ahmed Saleh, Ansgar Scherp Word Embeddings for IR 15 of 13
Preprocessing in Detail
Matching and TFIDF
Token Regexp: ww+
English stop words removed
Word2Vec
Token Regexp: ww*
English stop words removed
GloVe
Punctuation separated by white-space
Token Regexp: S+ (everything but white-space)
No stop word removal
Lukas Galke, Ahmed Saleh, Ansgar Scherp Word Embeddings for IR 16 of 13
Out of Vocabulary Statistics
Embedding Data set Field OOV ratio
GoogleNews NTCIR2 Title 7.4%
Abstract 7.3%
Econ62k Title 2.9%
Full-Text 14.1%
CommonCrawl NTCIR2 Title 5.1%
Abstract 3.5%
Econ62k Title 1.2%
Full-Text 5.2%
Lukas Galke, Ahmed Saleh, Ansgar Scherp Word Embeddings for IR 17 of 13
Metrics in Detail
Let r be relevance scores in rank order as retrieved. Nonzero indicating true
positive and zero false positive. For each metric, the scores are averaged
over the queries.
Reciprocal Rank
RR(r, k) = 1
min{i|ri>0} if ∃i : ri > 0 else 0.
Average Precision
Precision(r, k) = |{ri∈r|ri>0}|
|k| ,
AP(r, k) = 1
|r|
k
i=1 Precision((r1, . . . , ri), i)
Normalised Discounted Cumulative Gain
DCG(r, k) = r1 + k
i=2
ri
log2 i , nDCG(r, k) = DCG(r,k)
IDCGq,k
where IDCG is the optimal possible DCG value for a query (w.r.t gold
standard)
Lukas Galke, Ahmed Saleh, Ansgar Scherp Word Embeddings for IR 18 of 13
References
Balikas, Georgios, and Massih-Reza Amini. 2016. “An Empirical Study on
Large Scale Text Classification with Skip-Gram Embeddings.” CoRR
abs/1606.06623.
Goth, Gregory. 2016. “Deep or Shallow, NLP Is Breaking Out.” Commun.
ACM 59 (3): 13–16.
Kusner, Matt J., Yu Sun, Nicholas I. Kolkin, and Kilian Q. Weinberger.
2015. “From Word Embeddings to Document Distances.” In ICML,
37:957–66. JMLR Workshop and Conference Proceedings. JMLR.org.
Le, Quoc V., and Tomas Mikolov. 2014. “Distributed Representations of
Sentences and Documents.” In ICML, 32:1188–96. JMLR Workshop and
Conference Proceedings. JMLR.org.
Mikolov, Tomas, Ilya Sutskever, Kai Chen, Gregory S. Corrado, and Jeffrey
Dean. 2013. “Distributed Representations of Words and Phrases and Their
Compositionality.” In NIPS, 3111–9.
Zamani, Hamed, and W. Bruce Croft. 2016. “Embedding-Based Query
Language Models.” In ICTIR, 147–56. ACM.
Lukas Galke, Ahmed Saleh, Ansgar Scherp Word Embeddings for IR 19 of 13

More Related Content

What's hot

Transfer Learning -- The Next Frontier for Machine Learning
Transfer Learning -- The Next Frontier for Machine LearningTransfer Learning -- The Next Frontier for Machine Learning
Transfer Learning -- The Next Frontier for Machine LearningSebastian Ruder
 
NLP Project: Paragraph Topic Classification
NLP Project: Paragraph Topic ClassificationNLP Project: Paragraph Topic Classification
NLP Project: Paragraph Topic ClassificationEugene Nho
 
Neural Models for Information Retrieval
Neural Models for Information RetrievalNeural Models for Information Retrieval
Neural Models for Information RetrievalBhaskar Mitra
 
Transformation Functions for Text Classification: A case study with StackOver...
Transformation Functions for Text Classification: A case study with StackOver...Transformation Functions for Text Classification: A case study with StackOver...
Transformation Functions for Text Classification: A case study with StackOver...Sebastian Ruder
 
Neural Information Retrieval: In search of meaningful progress
Neural Information Retrieval: In search of meaningful progressNeural Information Retrieval: In search of meaningful progress
Neural Information Retrieval: In search of meaningful progressBhaskar Mitra
 
Neural Models for Document Ranking
Neural Models for Document RankingNeural Models for Document Ranking
Neural Models for Document RankingBhaskar Mitra
 
Duet @ TREC 2019 Deep Learning Track
Duet @ TREC 2019 Deep Learning TrackDuet @ TREC 2019 Deep Learning Track
Duet @ TREC 2019 Deep Learning TrackBhaskar Mitra
 
Representative basedclustering
Representative basedclusteringRepresentative basedclustering
Representative basedclusteringAnanda Swarup
 
Interactive Latent Dirichlet Allocation
Interactive Latent Dirichlet AllocationInteractive Latent Dirichlet Allocation
Interactive Latent Dirichlet AllocationQuentin Pleplé
 
Deep Learning for Search
Deep Learning for SearchDeep Learning for Search
Deep Learning for SearchBhaskar Mitra
 
5 Lessons Learned from Designing Neural Models for Information Retrieval
5 Lessons Learned from Designing Neural Models for Information Retrieval5 Lessons Learned from Designing Neural Models for Information Retrieval
5 Lessons Learned from Designing Neural Models for Information RetrievalBhaskar Mitra
 
Neural Semi-supervised Learning under Domain Shift
Neural Semi-supervised Learning under Domain ShiftNeural Semi-supervised Learning under Domain Shift
Neural Semi-supervised Learning under Domain ShiftSebastian Ruder
 
Text Mining using LDA with Context
Text Mining using LDA with ContextText Mining using LDA with Context
Text Mining using LDA with ContextSteffen Staab
 
Automated building of taxonomies for search engines
Automated building of taxonomies for search enginesAutomated building of taxonomies for search engines
Automated building of taxonomies for search enginesBoris Galitsky
 
Research Directions - Full Stack Deep Learning
Research Directions - Full Stack Deep LearningResearch Directions - Full Stack Deep Learning
Research Directions - Full Stack Deep LearningSergey Karayev
 
[Yang, Downey and Boyd-Graber 2015] Efficient Methods for Incorporating Knowl...
[Yang, Downey and Boyd-Graber 2015] Efficient Methods for Incorporating Knowl...[Yang, Downey and Boyd-Graber 2015] Efficient Methods for Incorporating Knowl...
[Yang, Downey and Boyd-Graber 2015] Efficient Methods for Incorporating Knowl...Shuyo Nakatani
 

What's hot (20)

Transfer Learning -- The Next Frontier for Machine Learning
Transfer Learning -- The Next Frontier for Machine LearningTransfer Learning -- The Next Frontier for Machine Learning
Transfer Learning -- The Next Frontier for Machine Learning
 
NLP Project: Paragraph Topic Classification
NLP Project: Paragraph Topic ClassificationNLP Project: Paragraph Topic Classification
NLP Project: Paragraph Topic Classification
 
Neural Models for Information Retrieval
Neural Models for Information RetrievalNeural Models for Information Retrieval
Neural Models for Information Retrieval
 
Transformation Functions for Text Classification: A case study with StackOver...
Transformation Functions for Text Classification: A case study with StackOver...Transformation Functions for Text Classification: A case study with StackOver...
Transformation Functions for Text Classification: A case study with StackOver...
 
Neural Information Retrieval: In search of meaningful progress
Neural Information Retrieval: In search of meaningful progressNeural Information Retrieval: In search of meaningful progress
Neural Information Retrieval: In search of meaningful progress
 
Neural Models for Document Ranking
Neural Models for Document RankingNeural Models for Document Ranking
Neural Models for Document Ranking
 
Duet @ TREC 2019 Deep Learning Track
Duet @ TREC 2019 Deep Learning TrackDuet @ TREC 2019 Deep Learning Track
Duet @ TREC 2019 Deep Learning Track
 
Representative basedclustering
Representative basedclusteringRepresentative basedclustering
Representative basedclustering
 
Interactive Latent Dirichlet Allocation
Interactive Latent Dirichlet AllocationInteractive Latent Dirichlet Allocation
Interactive Latent Dirichlet Allocation
 
Deep Learning for Search
Deep Learning for SearchDeep Learning for Search
Deep Learning for Search
 
5 Lessons Learned from Designing Neural Models for Information Retrieval
5 Lessons Learned from Designing Neural Models for Information Retrieval5 Lessons Learned from Designing Neural Models for Information Retrieval
5 Lessons Learned from Designing Neural Models for Information Retrieval
 
Neural Semi-supervised Learning under Domain Shift
Neural Semi-supervised Learning under Domain ShiftNeural Semi-supervised Learning under Domain Shift
Neural Semi-supervised Learning under Domain Shift
 
Text Mining using LDA with Context
Text Mining using LDA with ContextText Mining using LDA with Context
Text Mining using LDA with Context
 
The Duet model
The Duet modelThe Duet model
The Duet model
 
Automated building of taxonomies for search engines
Automated building of taxonomies for search enginesAutomated building of taxonomies for search engines
Automated building of taxonomies for search engines
 
Classification and Clustering of arXiv Documents, Sections, and Abstracts, Co...
Classification and Clustering of arXiv Documents, Sections, and Abstracts, Co...Classification and Clustering of arXiv Documents, Sections, and Abstracts, Co...
Classification and Clustering of arXiv Documents, Sections, and Abstracts, Co...
 
Author Topic Model
Author Topic ModelAuthor Topic Model
Author Topic Model
 
Research Directions - Full Stack Deep Learning
Research Directions - Full Stack Deep LearningResearch Directions - Full Stack Deep Learning
Research Directions - Full Stack Deep Learning
 
[Yang, Downey and Boyd-Graber 2015] Efficient Methods for Incorporating Knowl...
[Yang, Downey and Boyd-Graber 2015] Efficient Methods for Incorporating Knowl...[Yang, Downey and Boyd-Graber 2015] Efficient Methods for Incorporating Knowl...
[Yang, Downey and Boyd-Graber 2015] Efficient Methods for Incorporating Knowl...
 
LDAvis
LDAvisLDAvis
LDAvis
 

Similar to Evaluating the Impact of Word Embeddings on Similarity Scoring in Practical Information Retrieval

Artificial Intelligence
Artificial IntelligenceArtificial Intelligence
Artificial Intelligencevini89
 
ISEC'18 Keynote: Intelligent Software Engineering: Synergy between AI and Sof...
ISEC'18 Keynote: Intelligent Software Engineering: Synergy between AI and Sof...ISEC'18 Keynote: Intelligent Software Engineering: Synergy between AI and Sof...
ISEC'18 Keynote: Intelligent Software Engineering: Synergy between AI and Sof...Tao Xie
 
N-gram IDF: A Global Term Weighting Scheme Based on Information Distance (WWW...
N-gram IDF: A Global Term Weighting Scheme Based on Information Distance (WWW...N-gram IDF: A Global Term Weighting Scheme Based on Information Distance (WWW...
N-gram IDF: A Global Term Weighting Scheme Based on Information Distance (WWW...Masumi Shirakawa
 
Detecting paraphrases using recursive autoencoders
Detecting paraphrases using recursive autoencodersDetecting paraphrases using recursive autoencoders
Detecting paraphrases using recursive autoencodersFeynman Liang
 
Representing Documents and Queries as Sets of Word Embedded Vectors for Infor...
Representing Documents and Queries as Sets of Word Embedded Vectors for Infor...Representing Documents and Queries as Sets of Word Embedded Vectors for Infor...
Representing Documents and Queries as Sets of Word Embedded Vectors for Infor...Dwaipayan Roy
 
MACHINE-DRIVEN TEXT ANALYSIS
MACHINE-DRIVEN TEXT ANALYSISMACHINE-DRIVEN TEXT ANALYSIS
MACHINE-DRIVEN TEXT ANALYSISMassimo Schenone
 
Topic model an introduction
Topic model an introductionTopic model an introduction
Topic model an introductionYueshen Xu
 
Collective entity linking with WSRM DocEng'19
Collective entity linking with WSRM DocEng'19Collective entity linking with WSRM DocEng'19
Collective entity linking with WSRM DocEng'19ngamou
 
Lecture 9 - Machine Learning and Support Vector Machines (SVM)
Lecture 9 - Machine Learning and Support Vector Machines (SVM)Lecture 9 - Machine Learning and Support Vector Machines (SVM)
Lecture 9 - Machine Learning and Support Vector Machines (SVM)Sean Golliher
 
Improving Web Image Search Re-ranking
Improving Web Image Search Re-rankingImproving Web Image Search Re-ranking
Improving Web Image Search Re-rankingIOSR Journals
 
Dsm as theory building
Dsm as theory buildingDsm as theory building
Dsm as theory buildingClarkTony
 
SNLI_presentation_2
SNLI_presentation_2SNLI_presentation_2
SNLI_presentation_2Viral Gupta
 
Sparse Kernel Learning for Image Annotation
Sparse Kernel Learning for Image AnnotationSparse Kernel Learning for Image Annotation
Sparse Kernel Learning for Image AnnotationSean Moran
 
[ppt] A Comparison of SAWSDL Based Semantic Web Service Discovery Algorithms
[ppt] A Comparison of SAWSDL Based Semantic Web Service Discovery Algorithms[ppt] A Comparison of SAWSDL Based Semantic Web Service Discovery Algorithms
[ppt] A Comparison of SAWSDL Based Semantic Web Service Discovery AlgorithmsShiva Sandeep Garlapati
 
Framester: A Wide Coverage Linguistic Linked Data Hub
Framester: A Wide Coverage Linguistic Linked Data HubFramester: A Wide Coverage Linguistic Linked Data Hub
Framester: A Wide Coverage Linguistic Linked Data HubMehwish Alam
 
Gradient Estimation Using Stochastic Computation Graphs
Gradient Estimation Using Stochastic Computation GraphsGradient Estimation Using Stochastic Computation Graphs
Gradient Estimation Using Stochastic Computation GraphsYoonho Lee
 
IA3_presentation.pptx
IA3_presentation.pptxIA3_presentation.pptx
IA3_presentation.pptxKtonNguyn2
 
Neural Text Embeddings for Information Retrieval (WSDM 2017)
Neural Text Embeddings for Information Retrieval (WSDM 2017)Neural Text Embeddings for Information Retrieval (WSDM 2017)
Neural Text Embeddings for Information Retrieval (WSDM 2017)Bhaskar Mitra
 

Similar to Evaluating the Impact of Word Embeddings on Similarity Scoring in Practical Information Retrieval (20)

Artificial Intelligence
Artificial IntelligenceArtificial Intelligence
Artificial Intelligence
 
ISEC'18 Keynote: Intelligent Software Engineering: Synergy between AI and Sof...
ISEC'18 Keynote: Intelligent Software Engineering: Synergy between AI and Sof...ISEC'18 Keynote: Intelligent Software Engineering: Synergy between AI and Sof...
ISEC'18 Keynote: Intelligent Software Engineering: Synergy between AI and Sof...
 
N-gram IDF: A Global Term Weighting Scheme Based on Information Distance (WWW...
N-gram IDF: A Global Term Weighting Scheme Based on Information Distance (WWW...N-gram IDF: A Global Term Weighting Scheme Based on Information Distance (WWW...
N-gram IDF: A Global Term Weighting Scheme Based on Information Distance (WWW...
 
Detecting paraphrases using recursive autoencoders
Detecting paraphrases using recursive autoencodersDetecting paraphrases using recursive autoencoders
Detecting paraphrases using recursive autoencoders
 
Representing Documents and Queries as Sets of Word Embedded Vectors for Infor...
Representing Documents and Queries as Sets of Word Embedded Vectors for Infor...Representing Documents and Queries as Sets of Word Embedded Vectors for Infor...
Representing Documents and Queries as Sets of Word Embedded Vectors for Infor...
 
MACHINE-DRIVEN TEXT ANALYSIS
MACHINE-DRIVEN TEXT ANALYSISMACHINE-DRIVEN TEXT ANALYSIS
MACHINE-DRIVEN TEXT ANALYSIS
 
Topic model an introduction
Topic model an introductionTopic model an introduction
Topic model an introduction
 
Wi presentation
Wi presentationWi presentation
Wi presentation
 
Collective entity linking with WSRM DocEng'19
Collective entity linking with WSRM DocEng'19Collective entity linking with WSRM DocEng'19
Collective entity linking with WSRM DocEng'19
 
Lecture 9 - Machine Learning and Support Vector Machines (SVM)
Lecture 9 - Machine Learning and Support Vector Machines (SVM)Lecture 9 - Machine Learning and Support Vector Machines (SVM)
Lecture 9 - Machine Learning and Support Vector Machines (SVM)
 
Improving Web Image Search Re-ranking
Improving Web Image Search Re-rankingImproving Web Image Search Re-ranking
Improving Web Image Search Re-ranking
 
inteSearch: An Intelligent Linked Data Information Access Framework
inteSearch: An Intelligent Linked Data Information Access FrameworkinteSearch: An Intelligent Linked Data Information Access Framework
inteSearch: An Intelligent Linked Data Information Access Framework
 
Dsm as theory building
Dsm as theory buildingDsm as theory building
Dsm as theory building
 
SNLI_presentation_2
SNLI_presentation_2SNLI_presentation_2
SNLI_presentation_2
 
Sparse Kernel Learning for Image Annotation
Sparse Kernel Learning for Image AnnotationSparse Kernel Learning for Image Annotation
Sparse Kernel Learning for Image Annotation
 
[ppt] A Comparison of SAWSDL Based Semantic Web Service Discovery Algorithms
[ppt] A Comparison of SAWSDL Based Semantic Web Service Discovery Algorithms[ppt] A Comparison of SAWSDL Based Semantic Web Service Discovery Algorithms
[ppt] A Comparison of SAWSDL Based Semantic Web Service Discovery Algorithms
 
Framester: A Wide Coverage Linguistic Linked Data Hub
Framester: A Wide Coverage Linguistic Linked Data HubFramester: A Wide Coverage Linguistic Linked Data Hub
Framester: A Wide Coverage Linguistic Linked Data Hub
 
Gradient Estimation Using Stochastic Computation Graphs
Gradient Estimation Using Stochastic Computation GraphsGradient Estimation Using Stochastic Computation Graphs
Gradient Estimation Using Stochastic Computation Graphs
 
IA3_presentation.pptx
IA3_presentation.pptxIA3_presentation.pptx
IA3_presentation.pptx
 
Neural Text Embeddings for Information Retrieval (WSDM 2017)
Neural Text Embeddings for Information Retrieval (WSDM 2017)Neural Text Embeddings for Information Retrieval (WSDM 2017)
Neural Text Embeddings for Information Retrieval (WSDM 2017)
 

Recently uploaded

Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingNeil Barnes
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAmazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAbdelrhman abooda
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...Suhani Kapoor
 
9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home ServiceSapana Sha
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett
 
Data Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptxData Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptxFurkanTasci3
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFAAndrei Kaleshka
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改atducpo
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...Suhani Kapoor
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubaihf8803863
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationshipsccctableauusergroup
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝soniya singh
 

Recently uploaded (20)

Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data Storytelling
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
 
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAmazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
 
9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdf
 
Data Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptxData Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptx
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFA
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships
 
Decoding Loan Approval: Predictive Modeling in Action
Decoding Loan Approval: Predictive Modeling in ActionDecoding Loan Approval: Predictive Modeling in Action
Decoding Loan Approval: Predictive Modeling in Action
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
 

Evaluating the Impact of Word Embeddings on Similarity Scoring in Practical Information Retrieval

  • 1. Evaluating the Impact of Word Embeddings on Similarity Scoring for Practical Information Retrieval Lukas Galke Ahmed Saleh Ansgar Scherp Leibniz Information Centre for Economics Kiel University INFORMATIK, 29 Sep 2017 Lukas Galke, Ahmed Saleh, Ansgar Scherp Word Embeddings for IR 1 of 13
  • 2. Motivation and Research Question Motivation Word embeddings regarded as the main cause for NLP breakout in the past few years (Goth 2016) can be employed in various natural language processing tasks such as classification (Balikas and Amini 2016), clustering (Kusner et al. 2015), word analogies (Mikolov et al. 2013), language modelling . . . Information Retrieval is quite different from these tasks, so employment of word embeddings is challenging Research question Which embedding-based techniques are suitable for similarity scoring in practical information retrieval? Lukas Galke, Ahmed Saleh, Ansgar Scherp Word Embeddings for IR 2 of 13
  • 3. Related Approaches Paragraph Vectors (Doc2Vec) Explicitly learn document vectors in a similar fashion to word vectors (Le and Mikolov 2014). One artificial token per paragraph (or document). Word Mover’s Distance (WMD) Compute Word Mover’s Distance to solve a constrained optimization problem to compute document similarity (Kusner et al. 2015). Minimize the cost of moving the words of one document to the words of the other document. Embedding-based Query Language Models Embedding-based query expansion and embedding-based pseudo relevance feedback (Zamani and Croft 2016). The query is expanded by nearby words with respect to the embedding. Lukas Galke, Ahmed Saleh, Ansgar Scherp Word Embeddings for IR 3 of 13
  • 4. Information Retrieval Information Retrieval Given a query, retrieve the k most relevant (to the query) documents from a corpus in rank order. Lukas Galke, Ahmed Saleh, Ansgar Scherp Word Embeddings for IR 4 of 13
  • 5. TF-IDF Retrieval Model Term frequencies TF(w, d) is the number of occurrences of word w in document d. Lukas Galke, Ahmed Saleh, Ansgar Scherp Word Embeddings for IR 5 of 13
  • 6. TF-IDF Retrieval Model Term frequencies TF(w, d) is the number of occurrences of word w in document d. Inverse Document Frequency Words that occur in a lot of documents are discounted (assuming they carry less discriminative information): IDF(w, D) = log |D| |{d∈D|w∈d}| Lukas Galke, Ahmed Saleh, Ansgar Scherp Word Embeddings for IR 5 of 13
  • 7. TF-IDF Retrieval Model Term frequencies TF(w, d) is the number of occurrences of word w in document d. Inverse Document Frequency Words that occur in a lot of documents are discounted (assuming they carry less discriminative information): IDF(w, D) = log |D| |{d∈D|w∈d}| Retrieval Model Transform corpus of documents d into TF-IDF representation. Compute TF-IDF representation of the query q. Rank matching documents by descending cosine similarity cossim(q, d) = q·d q · d . Lukas Galke, Ahmed Saleh, Ansgar Scherp Word Embeddings for IR 5 of 13
  • 8. Word Embedding One Hot Encoding Word Embedding cat sits 0 1 0 1 0 cat sits 0.7 0.2 0.1 0.39 -3.1 0.42 Word Embedding Low-dimensional (compared to vocabulary size) distributed representation, that captures semantic and syntactic relations of the words. Key principle: Similar words should have a similar representation. Lukas Galke, Ahmed Saleh, Ansgar Scherp Word Embeddings for IR 6 of 13
  • 9. Word Vector Arithmetic Addition of word vectors and its nearest neighbors in the word embedding1. Expression Nearest tokens Czech + Currency koruna, Czech crown, Polish zloty, CTK Vietnam + capital Hanoi, Ho Chi Minh City, Viet Nam, Vietnamese German + airlines airline Lufthansa, carrier Lufthansa, flag carrier Lufthansa French + actress Juliette Binoche, Vanessa Paradis, Charlotte Gainsbourg 1 Extracted from Mikolov’s talk at NIPS 2013 Lukas Galke, Ahmed Saleh, Ansgar Scherp Word Embeddings for IR 7 of 13
  • 10. Word2vec Skip-Gram Negative Sampling (Mikolov et al. 2013) Given a stream of words (tokens) s over vocabulary V and a context size k, learn word embedding W. Let wT be target word with context C = {sT−k, . . . , sT−1, sT+1, . . . , sT+k} (skip-gram) Look up word vector W[sT ] for target word sT Predict via logistic regression from word vector W[sT ], W[x] with: positive examples: x context words C negative examples: x sampled from V C (negative sampling) Update word vector W[sT ] (via back-propagation) Repeat with next word T = T + 1 Lukas Galke, Ahmed Saleh, Ansgar Scherp Word Embeddings for IR 8 of 13
  • 11. Document Representation How to employ word embeddings for information retrieval? Bag-of-words (left) vs distributed representations (right) Lukas Galke, Ahmed Saleh, Ansgar Scherp Word Embeddings for IR 9 of 13
  • 12. Embedded Retrieval Models Word Centroid Similarity (WCS) Aggregate word vectors to their centroid for both the documents and the query and compute cosine similarity between the centroids. Word vector centroids C = TF · W Given query q in one-hot representation, compute Word Centroid Similarity WCS(q, i) = (qT ·W)·Ci qT ·W · Ci Lukas Galke, Ahmed Saleh, Ansgar Scherp Word Embeddings for IR 10 of 13
  • 13. Embedded Retrieval Models Word Centroid Similarity (WCS) Aggregate word vectors to their centroid for both the documents and the query and compute cosine similarity between the centroids. Word vector centroids C = TF · W Given query q in one-hot representation, compute Word Centroid Similarity WCS(q, i) = (qT ·W)·Ci qT ·W · Ci IDF re-weighted Word Centroid Similarity (IWCS) IDF re-weighted aggregation of word vectors C = TF · IDF · W Lukas Galke, Ahmed Saleh, Ansgar Scherp Word Embeddings for IR 10 of 13
  • 14. Experiments Lukas Galke, Ahmed Saleh, Ansgar Scherp Word Embeddings for IR 11 of 13
  • 15. Results Ground truth: Relevancy provided by human annotators. Evaluate mean average precision of top 20 documents (MAP@20) Datasets: Titles of NTCIR2, Economics, Reuters Model NTCIR2 Economics Reuters TF-IDF (baseline) .40 .37 .52 WCS .30 .36 .54 IWCS .41 .37 .60 IWCS-WMD .40 .32 .54 Doc2Vec .24 .30 .48 . . . Lukas Galke, Ahmed Saleh, Ansgar Scherp Word Embeddings for IR 12 of 13
  • 16. Conclusion Word embeddings can be successfully employed in practical IR IWCS is competitive to the TF-IDF baseline IDF weighting improves the performance of WCS by 11% IWCS outperforms the TF-IDF baseline by 15% on Reuters (news domain) Code to reproduce the experiments is available at github.com/lgalke/vec4ir. MOVING is funded by the EU Horizon 2020 Programme under the project number INSO-4-2015: 693092 Lukas Galke, Ahmed Saleh, Ansgar Scherp Word Embeddings for IR 13 of 13
  • 17. Discussion Lukas Galke, Ahmed Saleh, Ansgar Scherp Word Embeddings for IR 14 of 13
  • 18. Data Sets and Embeddings Data set properties Data set Documents Topics relevant per topic NTCIR2 135k 49 43.6 (48.8) Econ62k 62k 4518 72.98 (328.65) Reuters 100k 102 3,143 (6,316) Word Embedding properties Embedding Tokens Vocab Case Dim Training GoogleNews 3B 3M cased 300 Word2Vec CommonCrawl 840B 2.2M cased 300 GloVe Lukas Galke, Ahmed Saleh, Ansgar Scherp Word Embeddings for IR 15 of 13
  • 19. Preprocessing in Detail Matching and TFIDF Token Regexp: ww+ English stop words removed Word2Vec Token Regexp: ww* English stop words removed GloVe Punctuation separated by white-space Token Regexp: S+ (everything but white-space) No stop word removal Lukas Galke, Ahmed Saleh, Ansgar Scherp Word Embeddings for IR 16 of 13
  • 20. Out of Vocabulary Statistics Embedding Data set Field OOV ratio GoogleNews NTCIR2 Title 7.4% Abstract 7.3% Econ62k Title 2.9% Full-Text 14.1% CommonCrawl NTCIR2 Title 5.1% Abstract 3.5% Econ62k Title 1.2% Full-Text 5.2% Lukas Galke, Ahmed Saleh, Ansgar Scherp Word Embeddings for IR 17 of 13
  • 21. Metrics in Detail Let r be relevance scores in rank order as retrieved. Nonzero indicating true positive and zero false positive. For each metric, the scores are averaged over the queries. Reciprocal Rank RR(r, k) = 1 min{i|ri>0} if ∃i : ri > 0 else 0. Average Precision Precision(r, k) = |{ri∈r|ri>0}| |k| , AP(r, k) = 1 |r| k i=1 Precision((r1, . . . , ri), i) Normalised Discounted Cumulative Gain DCG(r, k) = r1 + k i=2 ri log2 i , nDCG(r, k) = DCG(r,k) IDCGq,k where IDCG is the optimal possible DCG value for a query (w.r.t gold standard) Lukas Galke, Ahmed Saleh, Ansgar Scherp Word Embeddings for IR 18 of 13
  • 22. References Balikas, Georgios, and Massih-Reza Amini. 2016. “An Empirical Study on Large Scale Text Classification with Skip-Gram Embeddings.” CoRR abs/1606.06623. Goth, Gregory. 2016. “Deep or Shallow, NLP Is Breaking Out.” Commun. ACM 59 (3): 13–16. Kusner, Matt J., Yu Sun, Nicholas I. Kolkin, and Kilian Q. Weinberger. 2015. “From Word Embeddings to Document Distances.” In ICML, 37:957–66. JMLR Workshop and Conference Proceedings. JMLR.org. Le, Quoc V., and Tomas Mikolov. 2014. “Distributed Representations of Sentences and Documents.” In ICML, 32:1188–96. JMLR Workshop and Conference Proceedings. JMLR.org. Mikolov, Tomas, Ilya Sutskever, Kai Chen, Gregory S. Corrado, and Jeffrey Dean. 2013. “Distributed Representations of Words and Phrases and Their Compositionality.” In NIPS, 3111–9. Zamani, Hamed, and W. Bruce Croft. 2016. “Embedding-Based Query Language Models.” In ICTIR, 147–56. ACM. Lukas Galke, Ahmed Saleh, Ansgar Scherp Word Embeddings for IR 19 of 13