Exploiting Wikipedia for Information Retrieval Tasks

Exploiting Wikipedia for Information
Retrieval Tasks
SIGIR Tutorial , August 2015
Department of Information Systems Engineering

Who We Are
Department of Information Systems Engineering
Ben-Gurion University of the Negev
SIGIR 2015, Tutorial

Sessions Agenda
• Introduction to Wikipedia
• Session 1 – Search related tasks
• Session 2 - Sentiment Analysis
• Session 3 - Recommender Systems
• Summary

Introduction
Getting Started!

Wikimedia
• “The Wikimedia Foundation, Inc. is a nonprofit
charitable organization dedicated to encouraging
the growth, development and distribution of free,
multilingual, educational content, and to providing
the full content of these wiki-based projects to the
public free of charge”
• https://wikimediafoundation.org/wiki/Home

What is Wikipedia?
• Wikipedia is an encyclopedia
• no original research
• neutral point of view
• statements must be verifiable
• must reference reliable published sources
• Wikipedia relies on crowd-sourcing
• anyone can edit
• Wikipedia is big
printwikipedia
.com SIGIR 2015, Tutorial

Structured and Unstructured
Entity Pages
Categories
Links
Disambiguation pages
Redirecting pages
Navigation Boxes
Infoboxes
Discussion pages
User Pages
Page Views

Articles Number & Growth Rate
• ~4.9 million articles on English Wikipedia
• Since 2006 around 30,000 new articles per month
• 287 languages

Quality of Wikipedia Article
1. Is the length and structure of article an
indication of the importance of this subject?
2. Click on edit history activity: when was the last
edit?
3. Talk page checked for debates: what are the
issues in this article?
4. Check the references [or lack of]
– Are all claims referenced (especially if controversial )?
– What is the quality of the sources?
– How relevant are the sources?
https://upload.wikimedia.org/wikipedia/commons/9/96/Evaluating_Wikipedia_brochure_%28Wiki_Education_Foundation%29.pdf

Article Rating System
https://en.wikipedia.org/wiki/Wikipedia:Featured_articles SIGIR 2015, Tutorial

Internet Encyclopaedias Go Head to Head. In Nature 1
38(15), [GILES, J. 2005]

Links
Categories
Amount of text
Images
Various languages
Page Views
Wisdom of the Crowds: Decentralized Knowledge Construction in Wikipedia
]Arazy et al., 2007 ]

Superior Information Source
Size & Scope, Britannica has 40000
Timely & Updated
Wisdom of the crowd

Lens for the Real World
Wikipedia: Representative of the real world and of people understanding
Ideas!
Thoughts
Perceptions

Unique Visitors and Page Views
• http://reportcard.wmflabs.org/
430 Millions in May 2015
Mobile users are not included!

Editing Wikipedia
[Hill BM, Shaw A ,2013] The Wikipedia Gender Gap Revisited: Characterizing Survey Response
Bias with Propensity Score Estimation. PLoS ONE 8(6): e65782. SIGIR 2015, Tutorial

Literature Review of Scholarly Articles
• http://wikilit.referata.com/wiki/Main_Page

Research Involving Wikipedia
Researching the Wikipedia
Using the Wikipedia content as
knowledge resource

Systems that Use Information from
Wikipedia
The task – Goal of the
system
•Query operations
•Recommendation
•Sentiment Analysis
•Ontology building
•…..
The challenge for which
Wikipedia is a remedy
•Sparseness
•Ambiguity
•Cost of manual labor
•Lack of information
•Understanding /perception
•……….
Utilized Information
•Concepts/pages, links,
categories, redirect pages,
views, edits…….
Algorithms & techniques
•How items are matched with
Wikipedia pages
•How data is extracted
•How Wikipedia data is utilized
•How the similarity between
concepts is computed
•………..

IR & Wikipedia
• Wikipedia as a collection is:
– enormous
– Timely
– Reflects crowd wisdom
• connections between entities in Wikipedia represent the way a large number of people
view them
• Computers cannot understand “concepts” cannot relate things like humans
– Accessible (free!)
– Accurate
– Coverage
• Weaknesses
– Incomplete
– Imbalanced
– No complete citations

IR & Wikipedia
• Wikipedia is used for
– enhancement the performance of IR systems (mainly
relevance)
• Challenge –
• Distillation of knowledge from such a large amount of un/semi
structured information is an extremely difficult task
• The contents of today’s Web are perfectly suitable for human
consumption, but remain hardly accessible to machines.

How to Use Wikipedia for Your Own
Research

Structured Storage
Pages Categories
Links
Pages
Paragraphs
Redirection Pages
Queries
Documents
Collection
(TREC-X)
Schema 2 Schema 1

Partial Diagram of Wikipedia’s
Structured meta-Data
[Created by Gilad Katz]

Wikipedia Download
Client Apps: XOWA, WikiTaxi, WikiReader,….
16 Offline tools for Downloading - 53GB of disk space
Page views download - size and structure 50GB per hour
EnwikiContentSource
Wikipedia Miner (Milne and Witten)
[An open source toolkit for mining Wikipedia,
2012]
XiaoxiaoLi/getWikipediaMetaData

Wikipedia Download
Google for : Wikipedia dump files download
https://dumps.wikimedia.org/enwiki/
Torrent:
https://meta.wikimedia.org/wiki/Data_dump_torrents#enwiki
Your and other’s code to get plain text!

DBPedia
• ~4.9 billion articles, 4.2 billion of which are classified into consistent
ontology
• Persons, places, organizations, diseases
• An effort to transform the knowledge of Wikipedia into "tabular-like"
format
• Sophisticated database-query language
- Open Data
- Linked Open Data
UGC
CGC

Search Related Tasks
Session 1

Crawl Control
Search Engine Basic Structure
Crawlers
Ranking
Indexer
Page Repository
Query Engine
Collection
Analysis
Queries
Results
Indexes

Query Operations Agenda
• Query Expansion
• Cross Language Information Retrieval
• Entity Search
• Query Performance Prediction

Query Expansion

How to Describe the Unknown?
Meno: “And how will you enquire, Socrates, into
that which you do not know? What will you put
forth as the subject of enquiry?
And if you find what you want, how will you ever
know that this is the thing which you did not
know? ”
Written 380 B.C.E
By Plato

Automatic QE - A process where the user’s original query is
augmented by new features with similar meaning.
1.What you know and what you wish to know
2. Initial vague query and concrete topics and
terminology
How to Describe the Unknown?
The average length of an initial query at prominent search engines is
2.4 in 2001, 2.08 in 2009, 3.1 nowdays (and growing…….)
[Amanda Spink, Dietmar Wolfram, Major B. J. Jansen, Tefko Saracevic (2001)].
"Searching the web: The public and their queries”.
[Taghavi, Wills, Patel (2011)] An analysis of web proxy logs with query distribution
pattern approach for search engines

Wikipedia Based Query Expansion
• Wikipedia is:
• Rich, Highly Interconnected, Domain independent
“The fourth generation iPad (originally
marketed as iPad with Retina display,
retrospectively marketed as the iPad 4) is
a tablet computer produced and marketed
by Apple Inc.” Wikipedia

Knowledge based search engine powered by Wikipedia.
[Milne, Witten and Nichols](2007).
Thesaurus Based QE
Initial query Wikipedia
based
thesaurus
augmented by
new features
with similar meaning

Semantic
Relatedness
is quantified.
Consider Jackson
Relatedness:
Co-occurrence
statistics of terms
and of links
(ESA as an alternative)
Synonyms:
Redirect Pages
No NER process needed!
Relevant to particular
document collection
Manual definition vs. automatic generation
KORU
[Milne, Witten and Nichols](2007).

Query Suggestion
query = obama white house
Task: Given a query, produce a ranked lists of concepts from Wikipedia
which are mentioned or meant in the query
Quiz Time!

Learning Semantic Query Suggestions.
query = obama white house
Barack Obama
White House
Correct concepts
Use of some ranking (e.g. language modeling) approach to score
the concepts (articles) in Wikipedia, where each n-gram is
considered as a query in its turn
1. Candidates
Generation
[Meij, E., M. Bron, L. Hollink, B. Huurnink and M. De Rijke] (2009)

• Supervised machine learning approach
• Input: a set of labeled examples (query2concept mappings)
• Types of features
• N-gram
• Concept
• N-gram – concept combination
• Current search history
2. Candidates
Selection
# of concepts linking to 2 c
# of concepts linked from c
# of accosted categories
# or redirect pages to c
Importance of c in query
(TF-IDF Q in c)

Cross Language Information Retrieval
(CLIR)

Machine Translation First
ً‫ا‬‫ل‬‫أه‬

CLIR Task
Query in a source language
Query in a target language
Collection translation
is not scalable!
The solution:

WikiTranslate: Query Translation for Cross-lingual Information Retrieval using only Wikipedia
[D. Nguyen, A.Overwijk, C.Hauff, R.B. Trieschnigg, D. Hiemstra, F.M.G. de Jong] 2008
• Stage 1: Mapping the query to Wikipedia concepts
• Full query search (“obama white house”)
• Partial query search (N-Grams)
• Sole N-Grams in links of the top retrieved documents (“house”)
• Whole query with weighted n-grams (“obama white house”)
• Possible to combine search in different fields within a document:
• (title: obama) (content: white house) – to avoid missing terms
Generating Translation Candidates

Stage 2: Creating the final expanded query terms weighting
according to Stage 1 and Wikipedia pages analysis:
• Concept translation from cross lingual links
• Try synonyms and spelling variants
• Translated term weighting : Concepts obtained from the whole
query search are more important
Creating the Final Translated Query
Obama White House
Weiße Haus ^ 1.0
Barack Obama ^ 0.5

Entity Search

Entity Search
Retrieving a set of entities in response to user’s query
query = “United States presidents”

Passages Entities
Entity Ranking:
Graph centrality measures:
Degree.
Combined with inverse
entity frequency
Ranking very many typed entities on Wikipedia
[Hugo Zaragoza, Henning Rode, Peter Mika, Jordi Atserias, Massimiliano Ciaramita, and Giuseppe Attardi]
2007

• Utilizing Wikipedia Category Structure
• Query-Categories (QC)
• Entity-Categories (EC)
• Distance(QC,EC)=
• if QC and EC have a common category then Distance = 0
• else: distance = minimal path length
Category Based
Semantic Distance Between Entities
A ranking framework for entity oriented search using Markov random fields.
[Raviv, H., D. Carmel and O. Kurland ](2012).
PUC

Category Based Distance Example
Query categories:
novels, books
If entity category is novels, then distance = zero
If entity category is books by Paul Auster, then distance = 2

The INNEX Ranking Task
• The goal of the track is to evaluate how well systems can rank
entities in response to a query
• Entities are assumed to correspond to Wikipedia entries
Used category, association, link structure
Took place in 2007, 2008, 2009
Overview of the INEX 2007 Entity Ranking Track
[Arjen P. de Vries, Anne-Marie Vercoustre, James A. Thom, Nick Craswell, Mounia Lalmas.]

Query Performance Prediction
“The prophecy was given to the infants and the fools”

Which Query is Better?

Benefits of Estimating the Query
Difficulty
Reproduced from a tutorial:
David Carmel and Oren Kurland. Query performance prediction for IR. SIGIR 2012
Feedback to users
User can rephrase a difficult query
Feedback to the search engines
Alternative retrieval strategy
Feedback to system administrator
Missing content
For IR applications
Federated search over different datasets

Query1 = “obama white house” Query2 = “weather in Israel”
Prediction
Mechanism
Prediction
value
Q1 > Q2
?QueryLength(Q1)=3
Query.split().length()

Retrieval Effectiveness
Predictors’ values
TREC Collections
(topics,
relevance judgments)
Regression
Framework
MAP

in Ad-Hoc Retrieval
Estimating query effectiveness when relevance judgments are not
available

• Pre-Retrieval Prediction
• Operate prior to retrieval time
• Analyze the query and the corpus
• Computationally efficient
• Post-Retrieval Prediction
• Analyze the result list of the
• documents most highly ranked
• Superior performance

Regardless of corpus, but with external knowledge
Absolute Query Difficulty
1. Corpus independent!
2. Information induced from Wikipedia
3. Advantage for non cooperative search where corpus-based
information is not available
Wikipedia-based query performance prediction.
[Gilad Katz, Anna Shtock, Oren Kurland, Bracha Shapira, and Lior Rokach] 2014.
New Prediction Type?

Wikipedia Properties
Title
Content Links
Categories
Wikipedia-based query performance prediction.
[Gilad Katz, Anna Shtock, Oren Kurland, Bracha Shapira, and Lior Rokach] 2014. SIGIR 2015, Tutorial

Notational Conventions
A page p is associated with a set of terms if its title contains at
least one of the terms in the set. (soft mapping)
q= “Barack Obama White House”
Associated page
Maximal exact match length (MEML) = 2
Set of pages for which MEML holds MMEML

Titles (measuring queries)
Size of subquery which has an exact match
1. Maximum
2. Average
Quiz Time!
Maximum – 2
Average – 2

Titles & Content(measuring pages)
Number of pages associated with a subquery (fixed length = 1,2,3)
The average length (# terms) of the pages that are exact match
or maximal exact match
SumAverage Standard Deviation
“scope” of q in Wikipedia

Examples
Query
Max subquery
match
#pages containing
one query term in
the title
#pages containing
two query terms
in the title
Horse Leg
1 7885 3
Poliomyelitis and Post-Polio
2 6605 8
Hubble Telescope Achievements
2 460 4
Endangered Species (Mammals)
1 3481 96
Most Dangerous Vehicles
1 3978 6
African Civilian Deaths
2 23381 14
New Hydroelectric Projects
1 93858 33
Implant Dentistry
1 268 0
Rap and Crime
2 3826 1
Radio Waves and Brain Cancer
2 15460 33

Links and Categories
Overall # of links that contain at least one of query’s terms in their
anchor text
# of categories that appear in at least one of the pages
associated with a subquery
# of in/outgoing links for Wikipedia page in MMEML
Average Standard Deviation

Links for Query Coherency
Prediction
# of links from pages associated with a single-term
subquery that point to pages associated with another subquery
MaximumAverage Standard Deviation

Examples
Average # pages
Containing at
least one query
term in link
Standard deviation
of # of categories
that contain one
query term in the
title
Link based
Coherency
(average)
Horse Leg
1821.5 34.7211 1731.5
Poliomyelitis and Post-Polio
302 11.6409 1120.333
Hubble Telescope Achievements
152.3333 11.19255 55.66667
Endangered Species (Mammals)
5753.333 64.60477 686
Most Dangerous Vehicles
376.6667 13.4316 396.3333
African Civilian Deaths
1673.333 18.53009 4172.333
New Hydroelectric Projects
1046 50.96217 121477.3
Implant Dentistry
209.5 13.45904 10.5
Rap and Crime
1946 35.30528 661.5
Radio Waves and Brain Cancer
1395.5 24.81557 2202.5

Query Performance
Prediction - Summary
• Absolute query difficulty, corpus independent
• # of pages containing a subquery in the title – the most
effective predictor
• Coherency is the 2nd
• Integration with state-of-the-art predictors leads to
superior performance (Wikipedia-based-clarity-score)

“The prophecy was given to the infants and the fools”
Query-Performance Prediction: Setting the
Expectations Straight
[Fiana Raiber and Oren Kurland] (2014)
Quiz Time!
Storming of the Bastille!

Exploiting Wikipedia for Sentiment
Analysis
Session 2

Introduction
• Sentiment analysis or opinion mining
The location is indeed perfect, across the
road from Hyde Park and Kensington
Palace. The building itself is charming,
…The room, which was an upgrade, was
ridiculously small. There was a persistent
unpleasant smell. The bathroom sink didn't
drain properly. The TV didn't work.
The location is indeed perfect

Leveraging with Language
Understanding
• Since they leave your door wide open when they come clean
your room , the mosquitoes get in your room and they buzz
around at night
• There were a few mosquitoes , but nothing that a little bug
repellant could not solve
• It seems like people on the first floor had more issues with
mosquitoes
• Also there was absolutely no issue with mosquitoes on the
upper floors

Challenge – Common Sense

Wikipedia as a Knowledgebase

Concepts
– We will go shopping next week
• Relying on ontologies or semantic networks
• Using concepts steps away from blindly using keywords and words
occurrences
[Cambria 2013: An introduction to concept level sentiment analysis]
[Cambria et al., 2014]

Semantic Relatedness - ESA
• Concepts representation (training)
Computing Semantic Relatedness Using Wikipedia-based Explicit Semantic Analysis
[Gabrilovich, & Markovitch, 2007]

W1 W2 W3 W3
Weighted inverted index
W1
C1 C2 C3
C1 C2 C3 C4
Weight(Cj)
Text fragment
Word vector
Concept vector
Document Representation

Similarity

Word Sense Disambiguation
plant

Wikipedia as a Sense Tagged Corpus
• Use hyperlinked concepts as sense annotations
• Extract senses and paragraphs of mentions of
the ambiguous word in Wikipedia articles
• Learn a classification model
Using Wikipedia for Automatic Word Sense Disambiguation
[Mihalcea, 2007]

• In 1834, Sumner was admitted to the [[bar(law)|bar]] at
the age of twenty-three, and entered private practice in
Boston.
• It is danced in 3/4 time (like most waltzes), with the couple
turning approx. 180 degrees every [[bar(music)|bar]].
• Vehicles of this type may contain expensive audio players,
televisions, video players, and [[bar (counter)|bar]]s, often
with refrigerators.
• Jenga is a popular beer in the [[bar(establishment)|bar]]s
of Thailand.
• This is a disturbance on the water surface of a river
or estuary, often cause by the presence of a [[bar
(landform)|bar]] or dune on the riverbed.
Annotated Text

• In 1834, Sumner was admitted to the
[[bar(law)|bar]] at the age of twenty-three, and
entered private practice in Boston.
• Features examples
– Current word and its POS
– Surrounding words and their POS
– The verb and noun in the vicinity
Example of Annotated Text

Opinion Target Detection

Challenge – Granularity Level
One can look at this review from
– Document level, i.e., is this review + or -?
– Sentence level, i.e., is each sentence + or -?
– Entity and feature/aspect level SIGIR 2015, Tutorial

Sentiment Lexicon
• Sentiment words are often the dominating
factor for sentiment analysis (Liu, 2012)
– good, excellent, bad, terrible
• Sentiment lexicon holds a score for each word
representing the degree of its sentiment
good
excellent
friendly
beautiful
+
bad
terrible
ugly
difficult
-

Sentiment Lexicon
• General lexicon
– There is no general sentiment lexicon that is
optimal in any domain
– Sentiment of some terms is dependent on the
context
– Poor coverage
“The device was small and handy."
"The waiter brought the food on time, but the portion was
very small,"

Yet Another Challenge: Polarity
Ambiguity

Remedy- Identify Opinion Targets
Old wine or warm beer: Target-specific sentiment analysis of adjectives
[Fahrni & Klenner, 2008]

• If ’cold coca cola’ is positive then ’cold coca cola
cherry’ is positive as well
cold
cool
warm
old
expensive
Coca Cola
Sushi
Pizza
cold
cool
warm
old
expensive
Coca Cola cherry

Example (2): Product Attributes
Detection
• Discover what are the attributes that people
express opinions about
• Identifying all words that are included in
Wikipedia titles
Domain Independent Model for Product Attribute Extraction from User Reviews using Wikipedia.
[Kovelamudi, Ramalingam, Sood & Varma, 2011]
Excellent picture quality.. videoz are in
HD.. no complaintz from me. Never had
any trouble with gamez.. Paid WAAAAY
to much for it at the time th0.. it sellz

Excellent picture quality.. videoz are in
HD.. no complaintz from me. Never had
any trouble with gamez.. Paid WAAAAY
to much for it at the time th0.. it sellz
now fer like a third the price I paid..
heheh.. oh well....the fact that I didn’t
wait a year er so to buy a bigger model
for half the price.. most likely from a different
store.. ..not namin any namez th0..
WSD

model
Semantic Relatedness
price
𝑗=1
𝑗≠𝑖
𝑘
relatednessWi,Wj
k
count(Wi,P)
count(P)
features

Lexicon Construction
(unsupervised)
• Label propagation
• Identify pairs of adjectives based on a set of
constrains
• Infer from known adjectives
“the room was good and wide”
good
bad
wide
29 Polarity=0.97
Unsupervised Common-Sense Knowledge Enrichment for Domain-Specific Sentiment Analysis
[Ofek, Rokach, Poria, Cambria, Hussein, Shabtai, 2015]

Computing Polarity

WikiSent: Sentiment Analysis of Movie
Reviews
Wikisent: Weakly supervised sentiment analysis through extractive summarization with Wikipedia
[Mukherjee & Bhattacharyya, 2012] SIGIR 2015, Tutorial

Feature Types
• Crew
• Cast
• Specific domain nouns from text content and
Plot
– wizard, witch, magic, wand, death-eater,
power
• Movie domain specific terms
– Movie, Staffing, Casting, Writing, Theory,
Rewriting, Screenplay, Format, Treatments,
Scriptments, Synopsis

• Rank sentences according to participating entities
1. Title, crew
2. Movie domain specific terms
3. Plot
Sentiment classification
– Weakly supervised – identify words’ polarity from
general lexicons
Retrieve Relevant Opinionated Text
subjectivity
+αXΣ
+βXΣ
-γXΣ

Classify Blog Sentiment
• Use verb and adjectives categories
• Adjectives
– Positive
– Negative
• Verbs
– Positive verb classes, positive mental affecting,
approving, praising
– Negative verb classes, abusing, accusing,
arguing, doubting, negative mental affecting
Using verbs and adjectives to automatically classify blog sentiment
[Chesley, Vincent, Xu, & Srihari, 2006] SIGIR 2015, Tutorial

Expanding Adjectives
cont.
• Query Wiktionary
• Assumption: an adjective’s polarity is reflected
by the words that define it
• Maintain a set of adjectives with know polarity
• Count mentions of known adjectives to derive
new polarity

Recommender Systems
Session 3

Recommender Systems
RS are software agents that elicit the
interests and preferences of individual
consumers […] and make recommendations
accordingly.
(Xiao & Benbasat 2007)
• Different system designs and algorithms
– Based on availability of data
– Domain characteristics

Content-based Recommendation
A General Vector-Space Approach
Matching Thresholding
Relevant
content
Feedback
Profile
Learning
Threshold
Learning
threshold

Challenge- Dull Content
• Content does not contain enough information to
distinguish items the user likes from items the user
does not like
– Result: Specifity – more of the same
– Vocabulary mismatch, limited aspects in content
to distinguish relevant items, synonymy, polysemy
– Result: bad recommendations!

Remedy: Enriching background knowledge
Infusion of external sources
• Movie recommendations
• Three external sources:
– Dictionary – (controlled) extracts of lemma
descriptions - linguistic
– Wikipedia
• pages related to movies are transformed into
semantic vectors using matrix factorization
– User tags – on some sites people can add tags to
movies.
Knowledge Infusion into Movies Content Based Recommender Systems
[Semeraro, Lops and Basile 2009]

• The data combined from all sources is represented
as a graph.
• The set of terms that describe an item is extended
using a spreading activation model that connects
terms based on semantic relations
Knowledge Infusion into Movies Content Based Recommender Systems
[Semeraro, Lops and Basile 2009]

axe, murder,
paranormal, hotel,
winter
Process- Example
The Shining
Keyword
search
perceptions,
killer, psychiatry.
spreading
activation of
external
knowledge
search
“Carrie”
“The silence of the
lambs”

Tweets Content-Based
Recommendations
• GOAL : Re-rank tweet messages based
on similarity to user content profile
• Method: The user interest profile is represented as
two vectors : concepts from Wikipedia, and affinity with
other users
user’s profile is expended by random walk on Wikipedia
concepts graph, utilizing the inter-links between Wikipedia
articles.
[Lu, Lam & Xang, 2012] Twitter User Modeling and Tweets Recommendation
based onWikipedia Concept Graph

Algorithm steps
Map a Twitter
message to a set
of concepts
employing
Explicit Semantic
Analysis (ESA)
Generate a user
profile from two
vectors,
a concept
vector
representing
user interested
topics, a vector
representing the
affinity with other
users.
In order to get
related concepts
random walk on
the Wikipedia
graph is applied
User profile and
the tweet are
represented as a
set of weighted
Wikipedia
concepts.
Cosine similarity
is applied to
compute the
score btw. Profile
and tweet
messages.

 kNN - Nearest Neighbor
 SVD – Matrix Factorization
 The method of making automatic predictions
(filtering) about the interests of a user by
collecting taste information from many users
(collaborating). The underlying assumption of
CF approach is that those who agreed in the
past tend to agree again in the future.
Collaborative Filtering1
Description
MainApproaches
Collaborative Filtering

Trying to predict the opinion the user will have on
the different items and be able to recommend the
“best” items to each user based on: the user’s
previous likings and the opinions of other like
minded users
abcd
The Idea
?
Positive
Rating
Negative
Rating

The ratings (or events) of users and items are represented in a matrix
All CF methods are based on such rating matrix
abcd
Sample of a matrix
The Users in the system
abcdUsers The Items in the
system
abcdItems
Collaborative Filtering Rating Matrix
I1 I2 I3 I4 I5 I6 …. In
U1 r
U2 r
U3
U4 r
U5
…. r
Um r
Each item may have a
rating
abcdRatings

abcd
“People who liked this also liked…”
Approach 1: Nearest Neighbors
Item
to
Item
User to
User
abcd
User-to-User
 Recommendations are made by finding users with similar
tastes. Jane and Tim both liked Item 2 and disliked Item
3; it seems they might have similar taste, which suggests
that in general Jane agrees with Tim. This makes Item 1
a good recommendation for Tim.
Item-to-Item
 Recommendations are made by finding items that have
similar appeal to many users.
Tom and Sandra liked both Item 1 and Item 4. That
suggests that, in general, people who liked Item 4 will
also like item 1, Item 1 will be recommended to Tim.
This approach is scalable to millions of users and millions
of items.

Some Math…..
   
   






m
i uiu
m
i aia
m
i uiuaia
ua
rrrr
rrrr
w
1
2
,1
2
,
1 ,,
,
  





 n
u ua
n
u uauiu
aia
w
wrr
rp
1 ,
1 ,,
,
*
Similarity
Prediction

Computing the Item-to-Item Similarity
We must have:
For each customer ui the list of products bought by ui
For each product pj the list of users that bought that
Amazon.com Recommendations Item-to-Item Collaborative
Filtering
[Greg Linden, Brent Smith, and Jeremy York], 2003, IEEE Internet Computing

The Main Challenge – Lack of Data
• Sparseness
• Long Tail
– many items in the Long
Tail have only few
ratings
• Cold Start
– System cannot draw any
inferences for users or
items about which it has
not yet gathered
sufficient data

External Sources as a Remedy
Recommendation
Engine
User data
recommendations
Items data
External
Sources
External data : about users, and items
From :WWW, social networks, other systems, other user’s
devices, and Wikipedia

Computing Similarities Using
Wikipedia for Sparse Matrices
• Utilize knowledge from Wikipedia to infer
the similarities between items/users
• Systems differ in what Wikipedia data is
used and how it is used
Similar????

Example
Step 1: Map item to Wikipedia pages
– Generate several variations of each item’s (movie) name
– Compare the generated names with corresponding page
titles in Wikipedia.
– Choose the page with the largest number of categories that
contain the word "film" (e.g., "Films shot in Vancouver").
• Using this technique, 1512 items out of the 1682 (89.8%)
contained in the MovieLens database were matched
Using Wikipedia to Boost Collaborative Filtering Techniques , Recsys 2011
[Katz, Ofek, Shapira, Rokach, Shani, 2011],

Step 2: Use Wikipedia information to compute similarity between items
– Text similarity – calculate the Cosine-Similarity between item pages in Wikipedia
– Category Similarity – count mutual categories
– Link Similarity – count mutual outgoing links

Use Wikipedia to Generate “Artificial
Ratings”
Step 3: use the item similarity matrix and each user’s actual
ratings to generate additional, “artificial” ratings.
• i – the item for which we wish to generate a
rating
• K – the set of items with actual ratings

Use Wikipedia to Generate “Artificial
Ratings”
Step 4: add the artificial ratings to the user-item
matrix
• Use the artificial ratings only when there’s no real value
• This matrix will be used for the collaborative filtering

Results
• The sparser the initial matrix, the greater the
improvement

Results – Collaborative Filtering
0.9
0.95
1
1.05
1.1
1.15
1.2
1.25
0.94 0.95 0.96 0.97 0.98 0.99 1
RMSE
% OF DATA SPARSITY
PERFORMANCE OF THE VARIOUS METHODS
basic item-item
categories
links
text
IMDB text

Comparison – Wikipedia and IMDB
0
100
200
300
400
500
600
700
800
900
1000
Number of movie
descriptions
Number of words in text
Comparison of words per movie
Wikipedia
IMDB

Yet another CF algorithm - SVD

abcd
Factorization
MF is a Decomposition of a matrix into
several component matrices, exposing
many of the useful and interesting
properties of the original matrix.
MF models users and items as vectors
of latent features which produce the
rating for the user of the item
With MF a matrix is factored into a
series of linear approximations that
expose the underlying structure of the
matrix.
The goal is to uncover latent features
that explain observed ratings
abcd
09.08.2015
Approach 2: Matrix factorization

Users & Ratings Latent Concepts or Factors
MF Process
abcdSVD
MF reveals
hidden
connections
and its
strength
abcdHidden Concept
Latent Factor Models - Example
User
Rating
abcdSVD

Users & Ratings Latent Concepts or Factors
SVD
revealed
a movie
this user
might like!
abcdRecommendation
Latent Factor Models - Example

09.08.2015
Latent Factor Models - Concept space

Integrating Wikipedia to the SVD
Process
• Learn parameters for the actual and “artificial
ratings”

Using Wikipedia for
generating ontology for
recommender systems

Example (1)-
• Content recommendations for scholars using
ontology to describe scholar’s profile
• Challenge: singular reference ontologies lack
sufficient ontological concepts and are unable to
represent the scholars’ knowledge.
A reference ontology for profiling scholar’s background knowledge in recommender
systems, 2014 [Bahram, Roliana, Moohd,Nematbakhsh]

• Building ontology based profiles for modeling
background knowledge of scholars
• Build ontology by merging a few sources
• Wikipedia is used both as a knowledge source
and for merging ontologies (verifying the
semantic similarity between two candidate
terms)

Example (2) Tag Recommendation
• Challenge- “bad” tags (e.g., misspelled,
irrelevant) lead to improper relationship
among items and ineffective searches for
topic information

Ontology Based Tag Recommendation
• Integrate Wikipedia categories with WordNet
Synsets to create Ontology for tag recommendation
Effective Tag Recommendation System Based on Topic Ontology
using Wikipedia and WordNet [Subramaniyaswamy and Chentur 2012]

We demonstrated how Wikipedia
was used for :
• Recsys, SA
Ontology creation
• Recsys, QE, QPP
Semantic relatedness
• CLIR, QE
Synonym detection
• Entity search, QPP
Relevance assessment
• Sa, QE
Disambiguation
• Recsys, SA
Domain specific knowledge acquisition

What was not Covered
• Using behaviors (views, edit)
• Examining time effect
• ……
• More tasks, more methods (QA, advertising…)
• Wikipedia weaknesses
• We only showed a few examples of the
Wikipedia power, and its potential

Use Wikipedia, it is a Treasure…..

Thank You
Complete references list:
http://vitiokm.wix.com/wikitutorial

Exploiting Wikipedia for Information Retrieval Tasks

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (15)

Similar to Exploiting Wikipedia for Information Retrieval Tasks

Similar to Exploiting Wikipedia for Information Retrieval Tasks (20)

Recently uploaded

Recently uploaded (20)

Exploiting Wikipedia for Information Retrieval Tasks