5. Wikimedia
• “The Wikimedia Foundation, Inc. is a nonprofit
charitable organization dedicated to encouraging
the growth, development and distribution of free,
multilingual, educational content, and to providing
the full content of these wiki-based projects to the
public free of charge”
• https://wikimediafoundation.org/wiki/Home
SIGIR 2015, Tutorial
6. What is Wikipedia?
• Wikipedia is an encyclopedia
• no original research
• neutral point of view
• statements must be verifiable
• must reference reliable published sources
• Wikipedia relies on crowd-sourcing
• anyone can edit
• Wikipedia is big
printwikipedia
.com SIGIR 2015, Tutorial
8. Articles Number & Growth Rate
• ~4.9 million articles on English Wikipedia
• Since 2006 around 30,000 new articles per month
• 287 languages
SIGIR 2015, Tutorial
9. Quality of Wikipedia Article
1. Is the length and structure of article an
indication of the importance of this subject?
2. Click on edit history activity: when was the last
edit?
3. Talk page checked for debates: what are the
issues in this article?
4. Check the references [or lack of]
– Are all claims referenced (especially if controversial )?
– What is the quality of the sources?
– How relevant are the sources?
https://upload.wikimedia.org/wikipedia/commons/9/96/Evaluating_Wikipedia_brochure_%28Wiki_Education_Foundation%29.pdf
SIGIR 2015, Tutorial
14. Lens for the Real World
Wikipedia: Representative of the real world and of people understanding
Ideas!
Thoughts
Perceptions
SIGIR 2015, Tutorial
15. Unique Visitors and Page Views
• http://reportcard.wmflabs.org/
430 Millions in May 2015
Mobile users are not included!
SIGIR 2015, Tutorial
16. Editing Wikipedia
[Hill BM, Shaw A ,2013] The Wikipedia Gender Gap Revisited: Characterizing Survey Response
Bias with Propensity Score Estimation. PLoS ONE 8(6): e65782. SIGIR 2015, Tutorial
17. Literature Review of Scholarly Articles
• http://wikilit.referata.com/wiki/Main_Page
SIGIR 2015, Tutorial
19. Systems that Use Information from
Wikipedia
The task – Goal of the
system
•Query operations
•Recommendation
•Sentiment Analysis
•Ontology building
•…..
The challenge for which
Wikipedia is a remedy
•Sparseness
•Ambiguity
•Cost of manual labor
•Lack of information
•Understanding /perception
•……….
Utilized Information
•Concepts/pages, links,
categories, redirect pages,
views, edits…….
Algorithms & techniques
•How items are matched with
Wikipedia pages
•How data is extracted
•How Wikipedia data is utilized
•How the similarity between
concepts is computed
•………..
SIGIR 2015, Tutorial
20. IR & Wikipedia
• Wikipedia as a collection is:
– enormous
– Timely
– Reflects crowd wisdom
• connections between entities in Wikipedia represent the way a large number of people
view them
• Computers cannot understand “concepts” cannot relate things like humans
– Accessible (free!)
– Accurate
– Coverage
• Weaknesses
– Incomplete
– Imbalanced
– No complete citations
SIGIR 2015, Tutorial
21. IR & Wikipedia
• Wikipedia is used for
– enhancement the performance of IR systems (mainly
relevance)
• Challenge –
• Distillation of knowledge from such a large amount of un/semi
structured information is an extremely difficult task
• The contents of today’s Web are perfectly suitable for human
consumption, but remain hardly accessible to machines.
SIGIR 2015, Tutorial
22. How to Use Wikipedia for Your Own
Research
SIGIR 2015, Tutorial
24. Partial Diagram of Wikipedia’s
Structured meta-Data
[Created by Gilad Katz]
SIGIR 2015, Tutorial
25. Wikipedia Download
Client Apps: XOWA, WikiTaxi, WikiReader,….
16 Offline tools for Downloading - 53GB of disk space
Page views download - size and structure 50GB per hour
EnwikiContentSource
Wikipedia Miner (Milne and Witten)
[An open source toolkit for mining Wikipedia,
2012]
XiaoxiaoLi/getWikipediaMetaData
SIGIR 2015, Tutorial
26. Wikipedia Download
Google for : Wikipedia dump files download
https://dumps.wikimedia.org/enwiki/
Torrent:
https://meta.wikimedia.org/wiki/Data_dump_torrents#enwiki
Your and other’s code to get plain text!
SIGIR 2015, Tutorial
27. DBPedia
• ~4.9 billion articles, 4.2 billion of which are classified into consistent
ontology
• Persons, places, organizations, diseases
• An effort to transform the knowledge of Wikipedia into "tabular-like"
format
• Sophisticated database-query language
- Open Data
- Linked Open Data
UGC
CGC
SIGIR 2015, Tutorial
32. How to Describe the Unknown?
Meno: “And how will you enquire, Socrates, into
that which you do not know? What will you put
forth as the subject of enquiry?
And if you find what you want, how will you ever
know that this is the thing which you did not
know? ”
Written 380 B.C.E
By Plato
SIGIR 2015, Tutorial
33. Automatic QE - A process where the user’s original query is
augmented by new features with similar meaning.
1.What you know and what you wish to know
2. Initial vague query and concrete topics and
terminology
How to Describe the Unknown?
The average length of an initial query at prominent search engines is
2.4 in 2001, 2.08 in 2009, 3.1 nowdays (and growing…….)
[Amanda Spink, Dietmar Wolfram, Major B. J. Jansen, Tefko Saracevic (2001)].
"Searching the web: The public and their queries”.
[Taghavi, Wills, Patel (2011)] An analysis of web proxy logs with query distribution
pattern approach for search engines
SIGIR 2015, Tutorial
34. Wikipedia Based Query Expansion
• Wikipedia is:
• Rich, Highly Interconnected, Domain independent
“The fourth generation iPad (originally
marketed as iPad with Retina display,
retrospectively marketed as the iPad 4) is
a tablet computer produced and marketed
by Apple Inc.” Wikipedia
SIGIR 2015, Tutorial
35. Knowledge based search engine powered by Wikipedia.
[Milne, Witten and Nichols](2007).
Thesaurus Based QE
Initial query Wikipedia
based
thesaurus
augmented by
new features
with similar meaning
SIGIR 2015, Tutorial
37. Query Suggestion
query = obama white house
Task: Given a query, produce a ranked lists of concepts from Wikipedia
which are mentioned or meant in the query
Quiz Time!
SIGIR 2015, Tutorial
38. Learning Semantic Query Suggestions.
query = obama white house
Barack Obama
White House
Correct concepts
Use of some ranking (e.g. language modeling) approach to score
the concepts (articles) in Wikipedia, where each n-gram is
considered as a query in its turn
1. Candidates
Generation
[Meij, E., M. Bron, L. Hollink, B. Huurnink and M. De Rijke] (2009)
SIGIR 2015, Tutorial
39. • Supervised machine learning approach
• Input: a set of labeled examples (query2concept mappings)
• Types of features
• N-gram
• Concept
• N-gram – concept combination
• Current search history
2. Candidates
Selection
# of concepts linking to 2 c
# of concepts linked from c
# of accosted categories
# or redirect pages to c
Importance of c in query
(TF-IDF Q in c)
SIGIR 2015, Tutorial
42. CLIR Task
Query in a source language
Query in a target language
Collection translation
is not scalable!
The solution:
SIGIR 2015, Tutorial
43. WikiTranslate: Query Translation for Cross-lingual Information Retrieval using only Wikipedia
[D. Nguyen, A.Overwijk, C.Hauff, R.B. Trieschnigg, D. Hiemstra, F.M.G. de Jong] 2008
• Stage 1: Mapping the query to Wikipedia concepts
• Full query search (“obama white house”)
• Partial query search (N-Grams)
• Sole N-Grams in links of the top retrieved documents (“house”)
• Whole query with weighted n-grams (“obama white house”)
• Possible to combine search in different fields within a document:
• (title: obama) (content: white house) – to avoid missing terms
Generating Translation Candidates
SIGIR 2015, Tutorial
44. Stage 2: Creating the final expanded query terms weighting
according to Stage 1 and Wikipedia pages analysis:
• Concept translation from cross lingual links
• Try synonyms and spelling variants
• Translated term weighting : Concepts obtained from the whole
query search are more important
Creating the Final Translated Query
Obama White House
Weiße Haus ^ 1.0
Barack Obama ^ 0.5
SIGIR 2015, Tutorial
46. Entity Search
Retrieving a set of entities in response to user’s query
query = “United States presidents”
SIGIR 2015, Tutorial
47. Passages Entities
Entity Ranking:
Graph centrality measures:
Degree.
Combined with inverse
entity frequency
Ranking very many typed entities on Wikipedia
[Hugo Zaragoza, Henning Rode, Peter Mika, Jordi Atserias, Massimiliano Ciaramita, and Giuseppe Attardi]
2007
SIGIR 2015, Tutorial
48. • Utilizing Wikipedia Category Structure
• Query-Categories (QC)
• Entity-Categories (EC)
• Distance(QC,EC)=
• if QC and EC have a common category then Distance = 0
• else: distance = minimal path length
Category Based
Semantic Distance Between Entities
A ranking framework for entity oriented search using Markov random fields.
[Raviv, H., D. Carmel and O. Kurland ](2012).
PUC
SIGIR 2015, Tutorial
49. Category Based Distance Example
Query categories:
novels, books
If entity category is novels, then distance = zero
If entity category is books by Paul Auster, then distance = 2
SIGIR 2015, Tutorial
50. The INNEX Ranking Task
• The goal of the track is to evaluate how well systems can rank
entities in response to a query
• Entities are assumed to correspond to Wikipedia entries
Used category, association, link structure
Took place in 2007, 2008, 2009
Overview of the INEX 2007 Entity Ranking Track
[Arjen P. de Vries, Anne-Marie Vercoustre, James A. Thom, Nick Craswell, Mounia Lalmas.]
SIGIR 2015, Tutorial
53. Benefits of Estimating the Query
Difficulty
Reproduced from a tutorial:
David Carmel and Oren Kurland. Query performance prediction for IR. SIGIR 2012
Feedback to users
User can rephrase a difficult query
Feedback to the search engines
Alternative retrieval strategy
Feedback to system administrator
Missing content
For IR applications
Federated search over different datasets
SIGIR 2015, Tutorial
54. Query Performance Prediction
Query1 = “obama white house” Query2 = “weather in Israel”
Prediction
Mechanism
Prediction
value
Q1 > Q2
?QueryLength(Q1)=3
Query.split().length()
SIGIR 2015, Tutorial
56. Query Performance Prediction
in Ad-Hoc Retrieval
Estimating query effectiveness when relevance judgments are not
available
SIGIR 2015, Tutorial
57. Query Performance Prediction
• Pre-Retrieval Prediction
• Operate prior to retrieval time
• Analyze the query and the corpus
• Computationally efficient
• Post-Retrieval Prediction
• Analyze the result list of the
• documents most highly ranked
• Superior performance
SIGIR 2015, Tutorial
58. Regardless of corpus, but with external knowledge
Absolute Query Difficulty
1. Corpus independent!
2. Information induced from Wikipedia
3. Advantage for non cooperative search where corpus-based
information is not available
Wikipedia-based query performance prediction.
[Gilad Katz, Anna Shtock, Oren Kurland, Bracha Shapira, and Lior Rokach] 2014.
New Prediction Type?
SIGIR 2015, Tutorial
60. Notational Conventions
A page p is associated with a set of terms if its title contains at
least one of the terms in the set. (soft mapping)
q= “Barack Obama White House”
Associated page
Maximal exact match length (MEML) = 2
Set of pages for which MEML holds MMEML
SIGIR 2015, Tutorial
61. Titles (measuring queries)
Size of subquery which has an exact match
1. Maximum
2. Average
Quiz Time!
q= “Barack Obama White House”
Maximum – 2
Average – 2
SIGIR 2015, Tutorial
62. Titles & Content(measuring pages)
Number of pages associated with a subquery (fixed length = 1,2,3)
The average length (# terms) of the pages that are exact match
or maximal exact match
SumAverage Standard Deviation
“scope” of q in Wikipedia
SIGIR 2015, Tutorial
63. Examples
Query
Max subquery
match
#pages containing
one query term in
the title
#pages containing
two query terms
in the title
Horse Leg
1 7885 3
Poliomyelitis and Post-Polio
2 6605 8
Hubble Telescope Achievements
2 460 4
Endangered Species (Mammals)
1 3481 96
Most Dangerous Vehicles
1 3978 6
African Civilian Deaths
2 23381 14
New Hydroelectric Projects
1 93858 33
Implant Dentistry
1 268 0
Rap and Crime
2 3826 1
Radio Waves and Brain Cancer
2 15460 33
SIGIR 2015, Tutorial
64. Links and Categories
Overall # of links that contain at least one of query’s terms in their
anchor text
# of categories that appear in at least one of the pages
associated with a subquery
# of in/outgoing links for Wikipedia page in MMEML
Average Standard Deviation
SIGIR 2015, Tutorial
65. Links for Query Coherency
Prediction
# of links from pages associated with a single-term
subquery that point to pages associated with another subquery
q= “Barack Obama White House”
MaximumAverage Standard Deviation
SIGIR 2015, Tutorial
66. Examples
Average # pages
Containing at
least one query
term in link
Standard deviation
of # of categories
that contain one
query term in the
title
Link based
Coherency
(average)
Horse Leg
1821.5 34.7211 1731.5
Poliomyelitis and Post-Polio
302 11.6409 1120.333
Hubble Telescope Achievements
152.3333 11.19255 55.66667
Endangered Species (Mammals)
5753.333 64.60477 686
Most Dangerous Vehicles
376.6667 13.4316 396.3333
African Civilian Deaths
1673.333 18.53009 4172.333
New Hydroelectric Projects
1046 50.96217 121477.3
Implant Dentistry
209.5 13.45904 10.5
Rap and Crime
1946 35.30528 661.5
Radio Waves and Brain Cancer
1395.5 24.81557 2202.5
SIGIR 2015, Tutorial
67. Query Performance
Prediction - Summary
• Absolute query difficulty, corpus independent
• # of pages containing a subquery in the title – the most
effective predictor
• Coherency is the 2nd
• Integration with state-of-the-art predictors leads to
superior performance (Wikipedia-based-clarity-score)
SIGIR 2015, Tutorial
68. “The prophecy was given to the infants and the fools”
Query-Performance Prediction: Setting the
Expectations Straight
[Fiana Raiber and Oren Kurland] (2014)
Quiz Time!
Storming of the Bastille!
SIGIR 2015, Tutorial
70. Introduction
• Sentiment analysis or opinion mining
The location is indeed perfect, across the
road from Hyde Park and Kensington
Palace. The building itself is charming,
…The room, which was an upgrade, was
ridiculously small. There was a persistent
unpleasant smell. The bathroom sink didn't
drain properly. The TV didn't work.
The location is indeed perfect
SIGIR 2015, Tutorial
71. Leveraging with Language
Understanding
• Since they leave your door wide open when they come clean
your room , the mosquitoes get in your room and they buzz
around at night
• There were a few mosquitoes , but nothing that a little bug
repellant could not solve
• It seems like people on the first floor had more issues with
mosquitoes
• Also there was absolutely no issue with mosquitoes on the
upper floors
SIGIR 2015, Tutorial
75. Concepts
– We will go shopping next week
• Relying on ontologies or semantic networks
• Using concepts steps away from blindly using keywords and words
occurrences
[Cambria 2013: An introduction to concept level sentiment analysis]
[Cambria et al., 2014]
SIGIR 2015, Tutorial
80. Wikipedia as a Sense Tagged Corpus
• Use hyperlinked concepts as sense annotations
• Extract senses and paragraphs of mentions of
the ambiguous word in Wikipedia articles
• Learn a classification model
Using Wikipedia for Automatic Word Sense Disambiguation
[Mihalcea, 2007]
SIGIR 2015, Tutorial
81. • In 1834, Sumner was admitted to the [[bar(law)|bar]] at
the age of twenty-three, and entered private practice in
Boston.
• It is danced in 3/4 time (like most waltzes), with the couple
turning approx. 180 degrees every [[bar(music)|bar]].
• Vehicles of this type may contain expensive audio players,
televisions, video players, and [[bar (counter)|bar]]s, often
with refrigerators.
• Jenga is a popular beer in the [[bar(establishment)|bar]]s
of Thailand.
• This is a disturbance on the water surface of a river
or estuary, often cause by the presence of a [[bar
(landform)|bar]] or dune on the riverbed.
Annotated Text
SIGIR 2015, Tutorial
82. • In 1834, Sumner was admitted to the
[[bar(law)|bar]] at the age of twenty-three, and
entered private practice in Boston.
• Features examples
– Current word and its POS
– Surrounding words and their POS
– The verb and noun in the vicinity
Example of Annotated Text
SIGIR 2015, Tutorial
84. Challenge – Granularity Level
One can look at this review from
– Document level, i.e., is this review + or -?
– Sentence level, i.e., is each sentence + or -?
– Entity and feature/aspect level SIGIR 2015, Tutorial
85. Sentiment Lexicon
• Sentiment words are often the dominating
factor for sentiment analysis (Liu, 2012)
– good, excellent, bad, terrible
• Sentiment lexicon holds a score for each word
representing the degree of its sentiment
good
excellent
friendly
beautiful
+
bad
terrible
ugly
difficult
-
SIGIR 2015, Tutorial
86. Sentiment Lexicon
• General lexicon
– There is no general sentiment lexicon that is
optimal in any domain
– Sentiment of some terms is dependent on the
context
– Poor coverage
“The device was small and handy."
"The waiter brought the food on time, but the portion was
very small,"
SIGIR 2015, Tutorial
88. Remedy- Identify Opinion Targets
Old wine or warm beer: Target-specific sentiment analysis of adjectives
[Fahrni & Klenner, 2008]
SIGIR 2015, Tutorial
89. • If ’cold coca cola’ is positive then ’cold coca cola
cherry’ is positive as well
cold
cool
warm
old
expensive
Coca Cola
Sushi
Pizza
cold
cool
warm
old
expensive
Coca Cola cherry
SIGIR 2015, Tutorial
90. Example (2): Product Attributes
Detection
• Discover what are the attributes that people
express opinions about
• Identifying all words that are included in
Wikipedia titles
Domain Independent Model for Product Attribute Extraction from User Reviews using Wikipedia.
[Kovelamudi, Ramalingam, Sood & Varma, 2011]
Excellent picture quality.. videoz are in
HD.. no complaintz from me. Never had
any trouble with gamez.. Paid WAAAAY
to much for it at the time th0.. it sellz
SIGIR 2015, Tutorial
91. Excellent picture quality.. videoz are in
HD.. no complaintz from me. Never had
any trouble with gamez.. Paid WAAAAY
to much for it at the time th0.. it sellz
now fer like a third the price I paid..
heheh.. oh well....the fact that I didn’t
wait a year er so to buy a bigger model
for half the price.. most likely from a different
store.. ..not namin any namez th0..
WSD
SIGIR 2015, Tutorial
93. Lexicon Construction
(unsupervised)
• Label propagation
• Identify pairs of adjectives based on a set of
constrains
• Infer from known adjectives
“the room was good and wide”
good
bad
wide
29 Polarity=0.97
Unsupervised Common-Sense Knowledge Enrichment for Domain-Specific Sentiment Analysis
[Ofek, Rokach, Poria, Cambria, Hussein, Shabtai, 2015]
SIGIR 2015, Tutorial
95. WikiSent: Sentiment Analysis of Movie
Reviews
Wikisent: Weakly supervised sentiment analysis through extractive summarization with Wikipedia
[Mukherjee & Bhattacharyya, 2012] SIGIR 2015, Tutorial
96. Feature Types
• Crew
• Cast
• Specific domain nouns from text content and
Plot
– wizard, witch, magic, wand, death-eater,
power
• Movie domain specific terms
– Movie, Staffing, Casting, Writing, Theory,
Rewriting, Screenplay, Format, Treatments,
Scriptments, Synopsis
SIGIR 2015, Tutorial
97. • Rank sentences according to participating entities
1. Title, crew
2. Movie domain specific terms
3. Plot
Sentiment classification
– Weakly supervised – identify words’ polarity from
general lexicons
Retrieve Relevant Opinionated Text
subjectivity
+αXΣ
+βXΣ
-γXΣ
SIGIR 2015, Tutorial
98. Classify Blog Sentiment
• Use verb and adjectives categories
• Adjectives
– Positive
– Negative
• Verbs
– Positive verb classes, positive mental affecting,
approving, praising
– Negative verb classes, abusing, accusing,
arguing, doubting, negative mental affecting
Using verbs and adjectives to automatically classify blog sentiment
[Chesley, Vincent, Xu, & Srihari, 2006] SIGIR 2015, Tutorial
99. Expanding Adjectives
cont.
• Query Wiktionary
• Assumption: an adjective’s polarity is reflected
by the words that define it
• Maintain a set of adjectives with know polarity
• Count mentions of known adjectives to derive
new polarity
SIGIR 2015, Tutorial
102. Recommender Systems
RS are software agents that elicit the
interests and preferences of individual
consumers […] and make recommendations
accordingly.
(Xiao & Benbasat 2007)
• Different system designs and algorithms
– Based on availability of data
– Domain characteristics
SIGIR 2015, Tutorial
104. Content-based Recommendation
A General Vector-Space Approach
Matching Thresholding
Relevant
content
Feedback
Profile
Learning
Threshold
Learning
threshold
SIGIR 2015, Tutorial
105. Challenge- Dull Content
• Content does not contain enough information to
distinguish items the user likes from items the user
does not like
– Result: Specifity – more of the same
– Vocabulary mismatch, limited aspects in content
to distinguish relevant items, synonymy, polysemy
– Result: bad recommendations!
SIGIR 2015, Tutorial
106. Remedy: Enriching background knowledge
Infusion of external sources
• Movie recommendations
• Three external sources:
– Dictionary – (controlled) extracts of lemma
descriptions - linguistic
– Wikipedia
• pages related to movies are transformed into
semantic vectors using matrix factorization
– User tags – on some sites people can add tags to
movies.
Knowledge Infusion into Movies Content Based Recommender Systems
[Semeraro, Lops and Basile 2009]
SIGIR 2015, Tutorial
107. • The data combined from all sources is represented
as a graph.
• The set of terms that describe an item is extended
using a spreading activation model that connects
terms based on semantic relations
Knowledge Infusion into Movies Content Based Recommender Systems
[Semeraro, Lops and Basile 2009]
SIGIR 2015, Tutorial
108. axe, murder,
paranormal, hotel,
winter
Process- Example
The Shining
Keyword
search
perceptions,
killer, psychiatry.
spreading
activation of
external
knowledge
search
“Carrie”
“The silence of the
lambs”
SIGIR 2015, Tutorial
109. Tweets Content-Based
Recommendations
• GOAL : Re-rank tweet messages based
on similarity to user content profile
• Method: The user interest profile is represented as
two vectors : concepts from Wikipedia, and affinity with
other users
user’s profile is expended by random walk on Wikipedia
concepts graph, utilizing the inter-links between Wikipedia
articles.
SIGIR 2015, Tutorial
[Lu, Lam & Xang, 2012] Twitter User Modeling and Tweets Recommendation
based onWikipedia Concept Graph
110. Algorithm steps
Map a Twitter
message to a set
of concepts
employing
Explicit Semantic
Analysis (ESA)
Generate a user
profile from two
vectors,
a concept
vector
representing
user interested
topics, a vector
representing the
affinity with other
users.
In order to get
related concepts
random walk on
the Wikipedia
graph is applied
User profile and
the tweet are
represented as a
set of weighted
Wikipedia
concepts.
Cosine similarity
is applied to
compute the
score btw. Profile
and tweet
messages.
SIGIR 2015, Tutorial
112. kNN - Nearest Neighbor
SVD – Matrix Factorization
The method of making automatic predictions
(filtering) about the interests of a user by
collecting taste information from many users
(collaborating). The underlying assumption of
CF approach is that those who agreed in the
past tend to agree again in the future.
Collaborative Filtering1
Description
MainApproaches
Collaborative Filtering
SIGIR 2015, Tutorial
113. Collaborative Filtering
Trying to predict the opinion the user will have on
the different items and be able to recommend the
“best” items to each user based on: the user’s
previous likings and the opinions of other like
minded users
abcd
The Idea
?
Positive
Rating
Negative
Rating
SIGIR 2015, Tutorial
114. The ratings (or events) of users and items are represented in a matrix
All CF methods are based on such rating matrix
abcd
Sample of a matrix
The Users in the system
abcdUsers The Items in the
system
abcdItems
Collaborative Filtering Rating Matrix
I1 I2 I3 I4 I5 I6 …. In
U1 r
U2 r
U3
U4 r
U5
…. r
Um r
Each item may have a
rating
abcdRatings
SIGIR 2015, Tutorial
115. abcd
“People who liked this also liked…”
Collaborative Filtering
Approach 1: Nearest Neighbors
Item
to
Item
User to
User
abcd
User-to-User
Recommendations are made by finding users with similar
tastes. Jane and Tim both liked Item 2 and disliked Item
3; it seems they might have similar taste, which suggests
that in general Jane agrees with Tim. This makes Item 1
a good recommendation for Tim.
Item-to-Item
Recommendations are made by finding items that have
similar appeal to many users.
Tom and Sandra liked both Item 1 and Item 4. That
suggests that, in general, people who liked Item 4 will
also like item 1, Item 1 will be recommended to Tim.
This approach is scalable to millions of users and millions
of items.
SIGIR 2015, Tutorial
116. Some Math…..
m
i uiu
m
i aia
m
i uiuaia
ua
rrrr
rrrr
w
1
2
,1
2
,
1 ,,
,
n
u ua
n
u uauiu
aia
w
wrr
rp
1 ,
1 ,,
,
*
Similarity
Prediction
SIGIR 2015, Tutorial
117. Computing the Item-to-Item Similarity
We must have:
For each customer ui the list of products bought by ui
For each product pj the list of users that bought that
Amazon.com Recommendations Item-to-Item Collaborative
Filtering
[Greg Linden, Brent Smith, and Jeremy York], 2003, IEEE Internet Computing
SIGIR 2015, Tutorial
118. The Main Challenge – Lack of Data
• Sparseness
• Long Tail
– many items in the Long
Tail have only few
ratings
• Cold Start
– System cannot draw any
inferences for users or
items about which it has
not yet gathered
sufficient data
SIGIR 2015, Tutorial
119. External Sources as a Remedy
Recommendation
Engine
User data
recommendations
Items data
External
Sources
External data : about users, and items
From :WWW, social networks, other systems, other user’s
devices, and Wikipedia
SIGIR 2015, Tutorial
120. Computing Similarities Using
Wikipedia for Sparse Matrices
• Utilize knowledge from Wikipedia to infer
the similarities between items/users
• Systems differ in what Wikipedia data is
used and how it is used
Similar????
SIGIR 2015, Tutorial
121. Example
Step 1: Map item to Wikipedia pages
– Generate several variations of each item’s (movie) name
– Compare the generated names with corresponding page
titles in Wikipedia.
– Choose the page with the largest number of categories that
contain the word "film" (e.g., "Films shot in Vancouver").
• Using this technique, 1512 items out of the 1682 (89.8%)
contained in the MovieLens database were matched
Using Wikipedia to Boost Collaborative Filtering Techniques , Recsys 2011
[Katz, Ofek, Shapira, Rokach, Shani, 2011],
SIGIR 2015, Tutorial
123. Step 2: Use Wikipedia information to compute similarity between items
– Text similarity – calculate the Cosine-Similarity between item pages in Wikipedia
– Category Similarity – count mutual categories
– Link Similarity – count mutual outgoing links
SIGIR 2015, Tutorial
124. Use Wikipedia to Generate “Artificial
Ratings”
Step 3: use the item similarity matrix and each user’s actual
ratings to generate additional, “artificial” ratings.
• i – the item for which we wish to generate a
rating
• K – the set of items with actual ratings
SIGIR 2015, Tutorial
125. Use Wikipedia to Generate “Artificial
Ratings”
Step 4: add the artificial ratings to the user-item
matrix
• Use the artificial ratings only when there’s no real value
• This matrix will be used for the collaborative filtering
SIGIR 2015, Tutorial
126. Results
• The sparser the initial matrix, the greater the
improvement
SIGIR 2015, Tutorial
127. Results – Collaborative Filtering
0.9
0.95
1
1.05
1.1
1.15
1.2
1.25
0.94 0.95 0.96 0.97 0.98 0.99 1
RMSE
% OF DATA SPARSITY
PERFORMANCE OF THE VARIOUS METHODS
basic item-item
categories
links
text
IMDB text
SIGIR 2015, Tutorial
128. Comparison – Wikipedia and IMDB
0
100
200
300
400
500
600
700
800
900
1000
Number of movie
descriptions
Number of words in text
Comparison of words per movie
Wikipedia
IMDB
SIGIR 2015, Tutorial
130. abcd
Factorization
MF is a Decomposition of a matrix into
several component matrices, exposing
many of the useful and interesting
properties of the original matrix.
MF models users and items as vectors
of latent features which produce the
rating for the user of the item
With MF a matrix is factored into a
series of linear approximations that
expose the underlying structure of the
matrix.
The goal is to uncover latent features
that explain observed ratings
abcd
09.08.2015
Collaborative Filtering
Approach 2: Matrix factorization
SIGIR 2015, Tutorial
131. Users & Ratings Latent Concepts or Factors
MF Process
abcdSVD
MF reveals
hidden
connections
and its
strength
abcdHidden Concept
Latent Factor Models - Example
User
Rating
abcdSVD
SIGIR 2015, Tutorial
132. Users & Ratings Latent Concepts or Factors
SVD
revealed
a movie
this user
might like!
abcdRecommendation
Latent Factor Models - Example
SIGIR 2015, Tutorial
137. Example (1)-
• Content recommendations for scholars using
ontology to describe scholar’s profile
• Challenge: singular reference ontologies lack
sufficient ontological concepts and are unable to
represent the scholars’ knowledge.
A reference ontology for profiling scholar’s background knowledge in recommender
systems, 2014 [Bahram, Roliana, Moohd,Nematbakhsh]
SIGIR 2015, Tutorial
138. • Building ontology based profiles for modeling
background knowledge of scholars
• Build ontology by merging a few sources
• Wikipedia is used both as a knowledge source
and for merging ontologies (verifying the
semantic similarity between two candidate
terms)
SIGIR 2015, Tutorial
141. Example (2) Tag Recommendation
• Challenge- “bad” tags (e.g., misspelled,
irrelevant) lead to improper relationship
among items and ineffective searches for
topic information
SIGIR 2015, Tutorial
142. Ontology Based Tag Recommendation
• Integrate Wikipedia categories with WordNet
Synsets to create Ontology for tag recommendation
Effective Tag Recommendation System Based on Topic Ontology
using Wikipedia and WordNet [Subramaniyaswamy and Chentur 2012]
SIGIR 2015, Tutorial
145. We demonstrated how Wikipedia
was used for :
• Recsys, SA
Ontology creation
• Recsys, QE, QPP
Semantic relatedness
• CLIR, QE
Synonym detection
• Entity search, QPP
Relevance assessment
• Sa, QE
Disambiguation
• Recsys, SA
Domain specific knowledge acquisition
SIGIR 2015, Tutorial
146. What was not Covered
• Using behaviors (views, edit)
• Examining time effect
• ……
• More tasks, more methods (QA, advertising…)
• Wikipedia weaknesses
• We only showed a few examples of the
Wikipedia power, and its potential
SIGIR 2015, Tutorial