SlideShare a Scribd company logo
1 of 148
Download to read offline
Exploiting Wikipedia for Information
Retrieval Tasks
SIGIR Tutorial , August 2015
Department of Information Systems Engineering
Who We Are
Department of Information Systems Engineering
Ben-Gurion University of the Negev
SIGIR 2015, Tutorial
Sessions Agenda
• Introduction to Wikipedia
• Session 1 – Search related tasks
• Session 2 - Sentiment Analysis
• Session 3 - Recommender Systems
• Summary
SIGIR 2015, Tutorial
Introduction
Getting Started!
SIGIR 2015, Tutorial
Wikimedia
• “The Wikimedia Foundation, Inc. is a nonprofit
charitable organization dedicated to encouraging
the growth, development and distribution of free,
multilingual, educational content, and to providing
the full content of these wiki-based projects to the
public free of charge”
• https://wikimediafoundation.org/wiki/Home
SIGIR 2015, Tutorial
What is Wikipedia?
• Wikipedia is an encyclopedia
• no original research
• neutral point of view
• statements must be verifiable
• must reference reliable published sources
• Wikipedia relies on crowd-sourcing
• anyone can edit
• Wikipedia is big
printwikipedia
.com SIGIR 2015, Tutorial
Structured and Unstructured
Entity Pages
Categories
Links
Disambiguation pages
Redirecting pages
Navigation Boxes
Infoboxes
Discussion pages
User Pages
Page Views
SIGIR 2015, Tutorial
Articles Number & Growth Rate
• ~4.9 million articles on English Wikipedia
• Since 2006 around 30,000 new articles per month
• 287 languages
SIGIR 2015, Tutorial
Quality of Wikipedia Article
1. Is the length and structure of article an
indication of the importance of this subject?
2. Click on edit history activity: when was the last
edit?
3. Talk page checked for debates: what are the
issues in this article?
4. Check the references [or lack of]
– Are all claims referenced (especially if controversial )?
– What is the quality of the sources?
– How relevant are the sources?
https://upload.wikimedia.org/wikipedia/commons/9/96/Evaluating_Wikipedia_brochure_%28Wiki_Education_Foundation%29.pdf
SIGIR 2015, Tutorial
Article Rating System
https://en.wikipedia.org/wiki/Wikipedia:Featured_articles SIGIR 2015, Tutorial
Internet Encyclopaedias Go Head to Head. In Nature 1
38(15), [GILES, J. 2005]
SIGIR 2015, Tutorial
Links
Categories
Amount of text
Images
Various languages
SIGIR 2015, Tutorial
Page Views
Wisdom of the Crowds: Decentralized Knowledge Construction in Wikipedia
]Arazy et al., 2007 ]
Superior Information Source
Size & Scope, Britannica has 40000
Timely & Updated
Wisdom of the crowd
SIGIR 2015, Tutorial
Lens for the Real World
Wikipedia: Representative of the real world and of people understanding
Ideas!
Thoughts
Perceptions
SIGIR 2015, Tutorial
Unique Visitors and Page Views
• http://reportcard.wmflabs.org/
430 Millions in May 2015
Mobile users are not included!
SIGIR 2015, Tutorial
Editing Wikipedia
[Hill BM, Shaw A ,2013] The Wikipedia Gender Gap Revisited: Characterizing Survey Response
Bias with Propensity Score Estimation. PLoS ONE 8(6): e65782. SIGIR 2015, Tutorial
Literature Review of Scholarly Articles
• http://wikilit.referata.com/wiki/Main_Page
SIGIR 2015, Tutorial
Research Involving Wikipedia
Researching the Wikipedia
Using the Wikipedia content as
knowledge resource
SIGIR 2015, Tutorial
Systems that Use Information from
Wikipedia
The task – Goal of the
system
•Query operations
•Recommendation
•Sentiment Analysis
•Ontology building
•…..
The challenge for which
Wikipedia is a remedy
•Sparseness
•Ambiguity
•Cost of manual labor
•Lack of information
•Understanding /perception
•……….
Utilized Information
•Concepts/pages, links,
categories, redirect pages,
views, edits…….
Algorithms & techniques
•How items are matched with
Wikipedia pages
•How data is extracted
•How Wikipedia data is utilized
•How the similarity between
concepts is computed
•………..
SIGIR 2015, Tutorial
IR & Wikipedia
• Wikipedia as a collection is:
– enormous
– Timely
– Reflects crowd wisdom
• connections between entities in Wikipedia represent the way a large number of people
view them
• Computers cannot understand “concepts” cannot relate things like humans
– Accessible (free!)
– Accurate
– Coverage
• Weaknesses
– Incomplete
– Imbalanced
– No complete citations
SIGIR 2015, Tutorial
IR & Wikipedia
• Wikipedia is used for
– enhancement the performance of IR systems (mainly
relevance)
• Challenge –
• Distillation of knowledge from such a large amount of un/semi
structured information is an extremely difficult task
• The contents of today’s Web are perfectly suitable for human
consumption, but remain hardly accessible to machines.
SIGIR 2015, Tutorial
How to Use Wikipedia for Your Own
Research
SIGIR 2015, Tutorial
Structured Storage
Pages Categories
Links
Pages
Paragraphs
Redirection Pages
Queries
Documents
Collection
(TREC-X)
Schema 2 Schema 1
SIGIR 2015, Tutorial
Partial Diagram of Wikipedia’s
Structured meta-Data
[Created by Gilad Katz]
SIGIR 2015, Tutorial
Wikipedia Download
Client Apps: XOWA, WikiTaxi, WikiReader,….
16 Offline tools for Downloading - 53GB of disk space
Page views download - size and structure 50GB per hour
EnwikiContentSource
Wikipedia Miner (Milne and Witten)
[An open source toolkit for mining Wikipedia,
2012]
XiaoxiaoLi/getWikipediaMetaData
SIGIR 2015, Tutorial
Wikipedia Download
Google for : Wikipedia dump files download
https://dumps.wikimedia.org/enwiki/
Torrent:
https://meta.wikimedia.org/wiki/Data_dump_torrents#enwiki
Your and other’s code to get plain text!
SIGIR 2015, Tutorial
DBPedia
• ~4.9 billion articles, 4.2 billion of which are classified into consistent
ontology
• Persons, places, organizations, diseases
• An effort to transform the knowledge of Wikipedia into "tabular-like"
format
• Sophisticated database-query language
- Open Data
- Linked Open Data
UGC
CGC
SIGIR 2015, Tutorial
Search Related Tasks
Session 1
SIGIR 2015, Tutorial
Crawl Control
Search Engine Basic Structure
Crawlers
Ranking
Indexer
Page Repository
Query Engine
Collection
Analysis
Queries
Results
Indexes
SIGIR 2015, Tutorial
Query Operations Agenda
• Query Expansion
• Cross Language Information Retrieval
• Entity Search
• Query Performance Prediction
SIGIR 2015, Tutorial
Query Expansion
SIGIR 2015, Tutorial
How to Describe the Unknown?
Meno: “And how will you enquire, Socrates, into
that which you do not know? What will you put
forth as the subject of enquiry?
And if you find what you want, how will you ever
know that this is the thing which you did not
know? ”
Written 380 B.C.E
By Plato
SIGIR 2015, Tutorial
Automatic QE - A process where the user’s original query is
augmented by new features with similar meaning.
1.What you know and what you wish to know
2. Initial vague query and concrete topics and
terminology
How to Describe the Unknown?
The average length of an initial query at prominent search engines is
2.4 in 2001, 2.08 in 2009, 3.1 nowdays (and growing…….)
[Amanda Spink, Dietmar Wolfram, Major B. J. Jansen, Tefko Saracevic (2001)].
"Searching the web: The public and their queries”.
[Taghavi, Wills, Patel (2011)] An analysis of web proxy logs with query distribution
pattern approach for search engines
SIGIR 2015, Tutorial
Wikipedia Based Query Expansion
• Wikipedia is:
• Rich, Highly Interconnected, Domain independent
“The fourth generation iPad (originally
marketed as iPad with Retina display,
retrospectively marketed as the iPad 4) is
a tablet computer produced and marketed
by Apple Inc.” Wikipedia
SIGIR 2015, Tutorial
Knowledge based search engine powered by Wikipedia.
[Milne, Witten and Nichols](2007).
Thesaurus Based QE
Initial query Wikipedia
based
thesaurus
augmented by
new features
with similar meaning
SIGIR 2015, Tutorial
Semantic
Relatedness
is quantified.
Consider Jackson
Relatedness:
Co-occurrence
statistics of terms
and of links
(ESA as an alternative)
Synonyms:
Redirect Pages
No NER process needed!
Relevant to particular
document collection
Manual definition vs. automatic generation
KORU
[Milne, Witten and Nichols](2007).
SIGIR 2015, Tutorial
Query Suggestion
query = obama white house
Task: Given a query, produce a ranked lists of concepts from Wikipedia
which are mentioned or meant in the query
Quiz Time!
SIGIR 2015, Tutorial
Learning Semantic Query Suggestions.
query = obama white house
Barack Obama
White House
Correct concepts
Use of some ranking (e.g. language modeling) approach to score
the concepts (articles) in Wikipedia, where each n-gram is
considered as a query in its turn
1. Candidates
Generation
[Meij, E., M. Bron, L. Hollink, B. Huurnink and M. De Rijke] (2009)
SIGIR 2015, Tutorial
• Supervised machine learning approach
• Input: a set of labeled examples (query2concept mappings)
• Types of features
• N-gram
• Concept
• N-gram – concept combination
• Current search history
2. Candidates
Selection
# of concepts linking to 2 c
# of concepts linked from c
# of accosted categories
# or redirect pages to c
Importance of c in query
(TF-IDF Q in c)
SIGIR 2015, Tutorial
Cross Language Information Retrieval
(CLIR)
SIGIR 2015, Tutorial
Machine Translation First
ً‫ا‬‫ل‬‫أه‬
SIGIR 2015, Tutorial
CLIR Task
Query in a source language
Query in a target language
Collection translation
is not scalable!
The solution:
SIGIR 2015, Tutorial
WikiTranslate: Query Translation for Cross-lingual Information Retrieval using only Wikipedia
[D. Nguyen, A.Overwijk, C.Hauff, R.B. Trieschnigg, D. Hiemstra, F.M.G. de Jong] 2008
• Stage 1: Mapping the query to Wikipedia concepts
• Full query search (“obama white house”)
• Partial query search (N-Grams)
• Sole N-Grams in links of the top retrieved documents (“house”)
• Whole query with weighted n-grams (“obama white house”)
• Possible to combine search in different fields within a document:
• (title: obama) (content: white house) – to avoid missing terms
Generating Translation Candidates
SIGIR 2015, Tutorial
Stage 2: Creating the final expanded query terms weighting
according to Stage 1 and Wikipedia pages analysis:
• Concept translation from cross lingual links
• Try synonyms and spelling variants
• Translated term weighting : Concepts obtained from the whole
query search are more important
Creating the Final Translated Query
Obama White House
Weiße Haus ^ 1.0
Barack Obama ^ 0.5
SIGIR 2015, Tutorial
Entity Search
SIGIR 2015, Tutorial
Entity Search
Retrieving a set of entities in response to user’s query
query = “United States presidents”
SIGIR 2015, Tutorial
Passages Entities
Entity Ranking:
Graph centrality measures:
Degree.
Combined with inverse
entity frequency
Ranking very many typed entities on Wikipedia
[Hugo Zaragoza, Henning Rode, Peter Mika, Jordi Atserias, Massimiliano Ciaramita, and Giuseppe Attardi]
2007
SIGIR 2015, Tutorial
• Utilizing Wikipedia Category Structure
• Query-Categories (QC)
• Entity-Categories (EC)
• Distance(QC,EC)=
• if QC and EC have a common category then Distance = 0
• else: distance = minimal path length
Category Based
Semantic Distance Between Entities
A ranking framework for entity oriented search using Markov random fields.
[Raviv, H., D. Carmel and O. Kurland ](2012).
PUC
SIGIR 2015, Tutorial
Category Based Distance Example
Query categories:
novels, books
If entity category is novels, then distance = zero
If entity category is books by Paul Auster, then distance = 2
SIGIR 2015, Tutorial
The INNEX Ranking Task
• The goal of the track is to evaluate how well systems can rank
entities in response to a query
• Entities are assumed to correspond to Wikipedia entries
Used category, association, link structure
Took place in 2007, 2008, 2009
Overview of the INEX 2007 Entity Ranking Track
[Arjen P. de Vries, Anne-Marie Vercoustre, James A. Thom, Nick Craswell, Mounia Lalmas.]
SIGIR 2015, Tutorial
Query Performance Prediction
“The prophecy was given to the infants and the fools”
SIGIR 2015, Tutorial
Which Query is Better?
SIGIR 2015, Tutorial
Benefits of Estimating the Query
Difficulty
Reproduced from a tutorial:
David Carmel and Oren Kurland. Query performance prediction for IR. SIGIR 2012
Feedback to users
User can rephrase a difficult query
Feedback to the search engines
Alternative retrieval strategy
Feedback to system administrator
Missing content
For IR applications
Federated search over different datasets
SIGIR 2015, Tutorial
Query Performance Prediction
Query1 = “obama white house” Query2 = “weather in Israel”
Prediction
Mechanism
Prediction
value
Q1 > Q2
?QueryLength(Q1)=3
Query.split().length()
SIGIR 2015, Tutorial
Retrieval Effectiveness
Predictors’ values
TREC Collections
(topics,
relevance judgments)
Regression
Framework
MAP
SIGIR 2015, Tutorial
Query Performance Prediction
in Ad-Hoc Retrieval
Estimating query effectiveness when relevance judgments are not
available
SIGIR 2015, Tutorial
Query Performance Prediction
• Pre-Retrieval Prediction
• Operate prior to retrieval time
• Analyze the query and the corpus
• Computationally efficient
• Post-Retrieval Prediction
• Analyze the result list of the
• documents most highly ranked
• Superior performance
SIGIR 2015, Tutorial
Regardless of corpus, but with external knowledge
Absolute Query Difficulty
1. Corpus independent!
2. Information induced from Wikipedia
3. Advantage for non cooperative search where corpus-based
information is not available
Wikipedia-based query performance prediction.
[Gilad Katz, Anna Shtock, Oren Kurland, Bracha Shapira, and Lior Rokach] 2014.
New Prediction Type?
SIGIR 2015, Tutorial
Wikipedia Properties
Title
Content Links
Categories
Wikipedia-based query performance prediction.
[Gilad Katz, Anna Shtock, Oren Kurland, Bracha Shapira, and Lior Rokach] 2014. SIGIR 2015, Tutorial
Notational Conventions
A page p is associated with a set of terms if its title contains at
least one of the terms in the set. (soft mapping)
q= “Barack Obama White House”
Associated page
Maximal exact match length (MEML) = 2
Set of pages for which MEML holds MMEML
SIGIR 2015, Tutorial
Titles (measuring queries)
Size of subquery which has an exact match
1. Maximum
2. Average
Quiz Time!
q= “Barack Obama White House”
Maximum – 2
Average – 2
SIGIR 2015, Tutorial
Titles & Content(measuring pages)
Number of pages associated with a subquery (fixed length = 1,2,3)
The average length (# terms) of the pages that are exact match
or maximal exact match
SumAverage Standard Deviation
“scope” of q in Wikipedia
SIGIR 2015, Tutorial
Examples
Query
Max subquery
match
#pages containing
one query term in
the title
#pages containing
two query terms
in the title
Horse Leg
1 7885 3
Poliomyelitis and Post-Polio
2 6605 8
Hubble Telescope Achievements
2 460 4
Endangered Species (Mammals)
1 3481 96
Most Dangerous Vehicles
1 3978 6
African Civilian Deaths
2 23381 14
New Hydroelectric Projects
1 93858 33
Implant Dentistry
1 268 0
Rap and Crime
2 3826 1
Radio Waves and Brain Cancer
2 15460 33
SIGIR 2015, Tutorial
Links and Categories
Overall # of links that contain at least one of query’s terms in their
anchor text
# of categories that appear in at least one of the pages
associated with a subquery
# of in/outgoing links for Wikipedia page in MMEML
Average Standard Deviation
SIGIR 2015, Tutorial
Links for Query Coherency
Prediction
# of links from pages associated with a single-term
subquery that point to pages associated with another subquery
q= “Barack Obama White House”
MaximumAverage Standard Deviation
SIGIR 2015, Tutorial
Examples
Average # pages
Containing at
least one query
term in link
Standard deviation
of # of categories
that contain one
query term in the
title
Link based
Coherency
(average)
Horse Leg
1821.5 34.7211 1731.5
Poliomyelitis and Post-Polio
302 11.6409 1120.333
Hubble Telescope Achievements
152.3333 11.19255 55.66667
Endangered Species (Mammals)
5753.333 64.60477 686
Most Dangerous Vehicles
376.6667 13.4316 396.3333
African Civilian Deaths
1673.333 18.53009 4172.333
New Hydroelectric Projects
1046 50.96217 121477.3
Implant Dentistry
209.5 13.45904 10.5
Rap and Crime
1946 35.30528 661.5
Radio Waves and Brain Cancer
1395.5 24.81557 2202.5
SIGIR 2015, Tutorial
Query Performance
Prediction - Summary
• Absolute query difficulty, corpus independent
• # of pages containing a subquery in the title – the most
effective predictor
• Coherency is the 2nd
• Integration with state-of-the-art predictors leads to
superior performance (Wikipedia-based-clarity-score)
SIGIR 2015, Tutorial
“The prophecy was given to the infants and the fools”
Query-Performance Prediction: Setting the
Expectations Straight
[Fiana Raiber and Oren Kurland] (2014)
Quiz Time!
Storming of the Bastille!
SIGIR 2015, Tutorial
Exploiting Wikipedia for Sentiment
Analysis
Session 2
SIGIR 2015, Tutorial
Introduction
• Sentiment analysis or opinion mining
The location is indeed perfect, across the
road from Hyde Park and Kensington
Palace. The building itself is charming,
…The room, which was an upgrade, was
ridiculously small. There was a persistent
unpleasant smell. The bathroom sink didn't
drain properly. The TV didn't work.
The location is indeed perfect
SIGIR 2015, Tutorial
Leveraging with Language
Understanding
• Since they leave your door wide open when they come clean
your room , the mosquitoes get in your room and they buzz
around at night
• There were a few mosquitoes , but nothing that a little bug
repellant could not solve
• It seems like people on the first floor had more issues with
mosquitoes
• Also there was absolutely no issue with mosquitoes on the
upper floors
SIGIR 2015, Tutorial
Challenge – Common Sense
SIGIR 2015, Tutorial
Wikipedia as a Knowledgebase
SIGIR 2015, Tutorial
?
SIGIR 2015, Tutorial
Concepts
– We will go shopping next week
• Relying on ontologies or semantic networks
• Using concepts steps away from blindly using keywords and words
occurrences
[Cambria 2013: An introduction to concept level sentiment analysis]
[Cambria et al., 2014]
SIGIR 2015, Tutorial
Semantic Relatedness - ESA
• Concepts representation (training)
Computing Semantic Relatedness Using Wikipedia-based Explicit Semantic Analysis
[Gabrilovich, & Markovitch, 2007]
SIGIR 2015, Tutorial
W1 W2 W3 W3
Weighted inverted index
W1
C1 C2 C3
C1 C2 C3 C4
Weight(Cj)
Text fragment
Word vector
Concept vector
Document Representation
SIGIR 2015, Tutorial
Similarity
SIGIR 2015, Tutorial
Word Sense Disambiguation
plant
SIGIR 2015, Tutorial
Wikipedia as a Sense Tagged Corpus
• Use hyperlinked concepts as sense annotations
• Extract senses and paragraphs of mentions of
the ambiguous word in Wikipedia articles
• Learn a classification model
Using Wikipedia for Automatic Word Sense Disambiguation
[Mihalcea, 2007]
SIGIR 2015, Tutorial
• In 1834, Sumner was admitted to the [[bar(law)|bar]] at
the age of twenty-three, and entered private practice in
Boston.
• It is danced in 3/4 time (like most waltzes), with the couple
turning approx. 180 degrees every [[bar(music)|bar]].
• Vehicles of this type may contain expensive audio players,
televisions, video players, and [[bar (counter)|bar]]s, often
with refrigerators.
• Jenga is a popular beer in the [[bar(establishment)|bar]]s
of Thailand.
• This is a disturbance on the water surface of a river
or estuary, often cause by the presence of a [[bar
(landform)|bar]] or dune on the riverbed.
Annotated Text
SIGIR 2015, Tutorial
• In 1834, Sumner was admitted to the
[[bar(law)|bar]] at the age of twenty-three, and
entered private practice in Boston.
• Features examples
– Current word and its POS
– Surrounding words and their POS
– The verb and noun in the vicinity
Example of Annotated Text
SIGIR 2015, Tutorial
Opinion Target Detection
SIGIR 2015, Tutorial
Challenge – Granularity Level
One can look at this review from
– Document level, i.e., is this review + or -?
– Sentence level, i.e., is each sentence + or -?
– Entity and feature/aspect level SIGIR 2015, Tutorial
Sentiment Lexicon
• Sentiment words are often the dominating
factor for sentiment analysis (Liu, 2012)
– good, excellent, bad, terrible
• Sentiment lexicon holds a score for each word
representing the degree of its sentiment
good
excellent
friendly
beautiful
+
bad
terrible
ugly
difficult
-
SIGIR 2015, Tutorial
Sentiment Lexicon
• General lexicon
– There is no general sentiment lexicon that is
optimal in any domain
– Sentiment of some terms is dependent on the
context
– Poor coverage
“The device was small and handy."
"The waiter brought the food on time, but the portion was
very small,"
SIGIR 2015, Tutorial
Yet Another Challenge: Polarity
Ambiguity
SIGIR 2015, Tutorial
Remedy- Identify Opinion Targets
Old wine or warm beer: Target-specific sentiment analysis of adjectives
[Fahrni & Klenner, 2008]
SIGIR 2015, Tutorial
• If ’cold coca cola’ is positive then ’cold coca cola
cherry’ is positive as well
cold
cool
warm
old
expensive
Coca Cola
Sushi
Pizza
cold
cool
warm
old
expensive
Coca Cola cherry
SIGIR 2015, Tutorial
Example (2): Product Attributes
Detection
• Discover what are the attributes that people
express opinions about
• Identifying all words that are included in
Wikipedia titles
Domain Independent Model for Product Attribute Extraction from User Reviews using Wikipedia.
[Kovelamudi, Ramalingam, Sood & Varma, 2011]
Excellent picture quality.. videoz are in
HD.. no complaintz from me. Never had
any trouble with gamez.. Paid WAAAAY
to much for it at the time th0.. it sellz
SIGIR 2015, Tutorial
Excellent picture quality.. videoz are in
HD.. no complaintz from me. Never had
any trouble with gamez.. Paid WAAAAY
to much for it at the time th0.. it sellz
now fer like a third the price I paid..
heheh.. oh well....the fact that I didn’t
wait a year er so to buy a bigger model
for half the price.. most likely from a different
store.. ..not namin any namez th0..
WSD
SIGIR 2015, Tutorial
model
Semantic Relatedness
price
𝑗=1
𝑗≠𝑖
𝑘
relatednessWi,Wj
k
count(Wi,P)
count(P)
features
SIGIR 2015, Tutorial
Lexicon Construction
(unsupervised)
• Label propagation
• Identify pairs of adjectives based on a set of
constrains
• Infer from known adjectives
“the room was good and wide”
good
bad
wide
29 Polarity=0.97
Unsupervised Common-Sense Knowledge Enrichment for Domain-Specific Sentiment Analysis
[Ofek, Rokach, Poria, Cambria, Hussein, Shabtai, 2015]
SIGIR 2015, Tutorial
Computing Polarity
SIGIR 2015, Tutorial
WikiSent: Sentiment Analysis of Movie
Reviews
Wikisent: Weakly supervised sentiment analysis through extractive summarization with Wikipedia
[Mukherjee & Bhattacharyya, 2012] SIGIR 2015, Tutorial
Feature Types
• Crew
• Cast
• Specific domain nouns from text content and
Plot
– wizard, witch, magic, wand, death-eater,
power
• Movie domain specific terms
– Movie, Staffing, Casting, Writing, Theory,
Rewriting, Screenplay, Format, Treatments,
Scriptments, Synopsis
SIGIR 2015, Tutorial
• Rank sentences according to participating entities
1. Title, crew
2. Movie domain specific terms
3. Plot
Sentiment classification
– Weakly supervised – identify words’ polarity from
general lexicons
Retrieve Relevant Opinionated Text
subjectivity
+αXΣ
+βXΣ
-γXΣ
SIGIR 2015, Tutorial
Classify Blog Sentiment
• Use verb and adjectives categories
• Adjectives
– Positive
– Negative
• Verbs
– Positive verb classes, positive mental affecting,
approving, praising
– Negative verb classes, abusing, accusing,
arguing, doubting, negative mental affecting
Using verbs and adjectives to automatically classify blog sentiment
[Chesley, Vincent, Xu, & Srihari, 2006] SIGIR 2015, Tutorial
Expanding Adjectives
cont.
• Query Wiktionary
• Assumption: an adjective’s polarity is reflected
by the words that define it
• Maintain a set of adjectives with know polarity
• Count mentions of known adjectives to derive
new polarity
SIGIR 2015, Tutorial
SIGIR 2015, Tutorial
Recommender Systems
Session 3
SIGIR 2015, Tutorial
Recommender Systems
RS are software agents that elicit the
interests and preferences of individual
consumers […] and make recommendations
accordingly.
(Xiao & Benbasat 2007)
• Different system designs and algorithms
– Based on availability of data
– Domain characteristics
SIGIR 2015, Tutorial
SIGIR 2015, Tutorial
Content-based Recommendation
A General Vector-Space Approach
Matching Thresholding
Relevant
content
Feedback
Profile
Learning
Threshold
Learning
threshold
SIGIR 2015, Tutorial
Challenge- Dull Content
• Content does not contain enough information to
distinguish items the user likes from items the user
does not like
– Result: Specifity – more of the same
– Vocabulary mismatch, limited aspects in content
to distinguish relevant items, synonymy, polysemy
– Result: bad recommendations!
SIGIR 2015, Tutorial
Remedy: Enriching background knowledge
Infusion of external sources
• Movie recommendations
• Three external sources:
– Dictionary – (controlled) extracts of lemma
descriptions - linguistic
– Wikipedia
• pages related to movies are transformed into
semantic vectors using matrix factorization
– User tags – on some sites people can add tags to
movies.
Knowledge Infusion into Movies Content Based Recommender Systems
[Semeraro, Lops and Basile 2009]
SIGIR 2015, Tutorial
• The data combined from all sources is represented
as a graph.
• The set of terms that describe an item is extended
using a spreading activation model that connects
terms based on semantic relations
Knowledge Infusion into Movies Content Based Recommender Systems
[Semeraro, Lops and Basile 2009]
SIGIR 2015, Tutorial
axe, murder,
paranormal, hotel,
winter
Process- Example
The Shining
Keyword
search
perceptions,
killer, psychiatry.
spreading
activation of
external
knowledge
search
“Carrie”
“The silence of the
lambs”
SIGIR 2015, Tutorial
Tweets Content-Based
Recommendations
• GOAL : Re-rank tweet messages based
on similarity to user content profile
• Method: The user interest profile is represented as
two vectors : concepts from Wikipedia, and affinity with
other users
user’s profile is expended by random walk on Wikipedia
concepts graph, utilizing the inter-links between Wikipedia
articles.
SIGIR 2015, Tutorial
[Lu, Lam & Xang, 2012] Twitter User Modeling and Tweets Recommendation
based onWikipedia Concept Graph
Algorithm steps
Map a Twitter
message to a set
of concepts
employing
Explicit Semantic
Analysis (ESA)
Generate a user
profile from two
vectors,
a concept
vector
representing
user interested
topics, a vector
representing the
affinity with other
users.
In order to get
related concepts
random walk on
the Wikipedia
graph is applied
User profile and
the tweet are
represented as a
set of weighted
Wikipedia
concepts.
Cosine similarity
is applied to
compute the
score btw. Profile
and tweet
messages.
SIGIR 2015, Tutorial
Results
SIGIR 2015, Tutorial
 kNN - Nearest Neighbor
 SVD – Matrix Factorization
 The method of making automatic predictions
(filtering) about the interests of a user by
collecting taste information from many users
(collaborating). The underlying assumption of
CF approach is that those who agreed in the
past tend to agree again in the future.
Collaborative Filtering1
Description
MainApproaches
Collaborative Filtering
SIGIR 2015, Tutorial
Collaborative Filtering
Trying to predict the opinion the user will have on
the different items and be able to recommend the
“best” items to each user based on: the user’s
previous likings and the opinions of other like
minded users
abcd
The Idea
?
Positive
Rating
Negative
Rating
SIGIR 2015, Tutorial
The ratings (or events) of users and items are represented in a matrix
All CF methods are based on such rating matrix
abcd
Sample of a matrix
The Users in the system
abcdUsers The Items in the
system
abcdItems
Collaborative Filtering Rating Matrix
I1 I2 I3 I4 I5 I6 …. In
U1 r
U2 r
U3
U4 r
U5
…. r
Um r
Each item may have a
rating
abcdRatings
SIGIR 2015, Tutorial
abcd
“People who liked this also liked…”
Collaborative Filtering
Approach 1: Nearest Neighbors
Item
to
Item
User to
User
abcd
User-to-User
 Recommendations are made by finding users with similar
tastes. Jane and Tim both liked Item 2 and disliked Item
3; it seems they might have similar taste, which suggests
that in general Jane agrees with Tim. This makes Item 1
a good recommendation for Tim.
Item-to-Item
 Recommendations are made by finding items that have
similar appeal to many users.
Tom and Sandra liked both Item 1 and Item 4. That
suggests that, in general, people who liked Item 4 will
also like item 1, Item 1 will be recommended to Tim.
This approach is scalable to millions of users and millions
of items.
SIGIR 2015, Tutorial
Some Math…..
   
   






m
i uiu
m
i aia
m
i uiuaia
ua
rrrr
rrrr
w
1
2
,1
2
,
1 ,,
,
  





 n
u ua
n
u uauiu
aia
w
wrr
rp
1 ,
1 ,,
,
*
Similarity
Prediction
SIGIR 2015, Tutorial
Computing the Item-to-Item Similarity
We must have:
For each customer ui the list of products bought by ui
For each product pj the list of users that bought that
Amazon.com Recommendations Item-to-Item Collaborative
Filtering
[Greg Linden, Brent Smith, and Jeremy York], 2003, IEEE Internet Computing
SIGIR 2015, Tutorial
The Main Challenge – Lack of Data
• Sparseness
• Long Tail
– many items in the Long
Tail have only few
ratings
• Cold Start
– System cannot draw any
inferences for users or
items about which it has
not yet gathered
sufficient data
SIGIR 2015, Tutorial
External Sources as a Remedy
Recommendation
Engine
User data
recommendations
Items data
External
Sources
External data : about users, and items
From :WWW, social networks, other systems, other user’s
devices, and Wikipedia
SIGIR 2015, Tutorial
Computing Similarities Using
Wikipedia for Sparse Matrices
• Utilize knowledge from Wikipedia to infer
the similarities between items/users
• Systems differ in what Wikipedia data is
used and how it is used
Similar????
SIGIR 2015, Tutorial
Example
Step 1: Map item to Wikipedia pages
– Generate several variations of each item’s (movie) name
– Compare the generated names with corresponding page
titles in Wikipedia.
– Choose the page with the largest number of categories that
contain the word "film" (e.g., "Films shot in Vancouver").
• Using this technique, 1512 items out of the 1682 (89.8%)
contained in the MovieLens database were matched
Using Wikipedia to Boost Collaborative Filtering Techniques , Recsys 2011
[Katz, Ofek, Shapira, Rokach, Shani, 2011],
SIGIR 2015, Tutorial
SIGIR 2015, Tutorial
Step 2: Use Wikipedia information to compute similarity between items
– Text similarity – calculate the Cosine-Similarity between item pages in Wikipedia
– Category Similarity – count mutual categories
– Link Similarity – count mutual outgoing links
SIGIR 2015, Tutorial
Use Wikipedia to Generate “Artificial
Ratings”
Step 3: use the item similarity matrix and each user’s actual
ratings to generate additional, “artificial” ratings.
• i – the item for which we wish to generate a
rating
• K – the set of items with actual ratings
SIGIR 2015, Tutorial
Use Wikipedia to Generate “Artificial
Ratings”
Step 4: add the artificial ratings to the user-item
matrix
• Use the artificial ratings only when there’s no real value
• This matrix will be used for the collaborative filtering
SIGIR 2015, Tutorial
Results
• The sparser the initial matrix, the greater the
improvement
SIGIR 2015, Tutorial
Results – Collaborative Filtering
0.9
0.95
1
1.05
1.1
1.15
1.2
1.25
0.94 0.95 0.96 0.97 0.98 0.99 1
RMSE
% OF DATA SPARSITY
PERFORMANCE OF THE VARIOUS METHODS
basic item-item
categories
links
text
IMDB text
SIGIR 2015, Tutorial
Comparison – Wikipedia and IMDB
0
100
200
300
400
500
600
700
800
900
1000
Number of movie
descriptions
Number of words in text
Comparison of words per movie
Wikipedia
IMDB
SIGIR 2015, Tutorial
Yet another CF algorithm - SVD
SIGIR 2015, Tutorial
abcd
Factorization
MF is a Decomposition of a matrix into
several component matrices, exposing
many of the useful and interesting
properties of the original matrix.
MF models users and items as vectors
of latent features which produce the
rating for the user of the item
With MF a matrix is factored into a
series of linear approximations that
expose the underlying structure of the
matrix.
The goal is to uncover latent features
that explain observed ratings
abcd
09.08.2015
Collaborative Filtering
Approach 2: Matrix factorization
SIGIR 2015, Tutorial
Users & Ratings Latent Concepts or Factors
MF Process
abcdSVD
MF reveals
hidden
connections
and its
strength
abcdHidden Concept
Latent Factor Models - Example
User
Rating
abcdSVD
SIGIR 2015, Tutorial
Users & Ratings Latent Concepts or Factors
SVD
revealed
a movie
this user
might like!
abcdRecommendation
Latent Factor Models - Example
SIGIR 2015, Tutorial
09.08.2015
Latent Factor Models - Concept space
SIGIR 2015, Tutorial
Integrating Wikipedia to the SVD
Process
• Learn parameters for the actual and “artificial
ratings”
SIGIR 2015, Tutorial
Using Wikipedia for
generating ontology for
recommender systems
SIGIR 2015, Tutorial
Example (1)-
• Content recommendations for scholars using
ontology to describe scholar’s profile
• Challenge: singular reference ontologies lack
sufficient ontological concepts and are unable to
represent the scholars’ knowledge.
A reference ontology for profiling scholar’s background knowledge in recommender
systems, 2014 [Bahram, Roliana, Moohd,Nematbakhsh]
SIGIR 2015, Tutorial
• Building ontology based profiles for modeling
background knowledge of scholars
• Build ontology by merging a few sources
• Wikipedia is used both as a knowledge source
and for merging ontologies (verifying the
semantic similarity between two candidate
terms)
SIGIR 2015, Tutorial
SIGIR 2015, Tutorial
SIGIR 2015, Tutorial
Example (2) Tag Recommendation
• Challenge- “bad” tags (e.g., misspelled,
irrelevant) lead to improper relationship
among items and ineffective searches for
topic information
SIGIR 2015, Tutorial
Ontology Based Tag Recommendation
• Integrate Wikipedia categories with WordNet
Synsets to create Ontology for tag recommendation
Effective Tag Recommendation System Based on Topic Ontology
using Wikipedia and WordNet [Subramaniyaswamy and Chentur 2012]
SIGIR 2015, Tutorial
SIGIR 2015, Tutorial
Results
SIGIR 2015, Tutorial
We demonstrated how Wikipedia
was used for :
• Recsys, SA
Ontology creation
• Recsys, QE, QPP
Semantic relatedness
• CLIR, QE
Synonym detection
• Entity search, QPP
Relevance assessment
• Sa, QE
Disambiguation
• Recsys, SA
Domain specific knowledge acquisition
SIGIR 2015, Tutorial
What was not Covered
• Using behaviors (views, edit)
• Examining time effect
• ……
• More tasks, more methods (QA, advertising…)
• Wikipedia weaknesses
• We only showed a few examples of the
Wikipedia power, and its potential
SIGIR 2015, Tutorial
Use Wikipedia, it is a Treasure…..
SIGIR 2015, Tutorial
Thank You
Complete references list:
http://vitiokm.wix.com/wikitutorial
SIGIR 2015, Tutorial

More Related Content

Viewers also liked

Onato - Bảo vệ bộ não toàn diện
Onato - Bảo vệ bộ não toàn diệnOnato - Bảo vệ bộ não toàn diện
Onato - Bảo vệ bộ não toàn diệnLinh Linh
 
Lebanese Animation - CSM presentation
Lebanese Animation - CSM presentationLebanese Animation - CSM presentation
Lebanese Animation - CSM presentationabouljoud86
 
Mi power point flores
Mi power point floresMi power point flores
Mi power point floresAilinTomas
 
English: That mysterious language - Go! idiomas
English: That mysterious language - Go! idiomasEnglish: That mysterious language - Go! idiomas
English: That mysterious language - Go! idiomasGo! idiomas
 
อุปกรณ์เครือข่ายคอมพิวเตอร์
อุปกรณ์เครือข่ายคอมพิวเตอร์อุปกรณ์เครือข่ายคอมพิวเตอร์
อุปกรณ์เครือข่ายคอมพิวเตอร์Aphisek Zilch
 
Lean startup - Agicalkväll #58
Lean startup -  Agicalkväll #58Lean startup -  Agicalkväll #58
Lean startup - Agicalkväll #58Bengt Nyman
 

Viewers also liked (15)

TIPOS O NIVELES DE PREVENCION
TIPOS O NIVELES DE PREVENCIONTIPOS O NIVELES DE PREVENCION
TIPOS O NIVELES DE PREVENCION
 
B2C/first part/day3
B2C/first part/day3B2C/first part/day3
B2C/first part/day3
 
Onato - Bảo vệ bộ não toàn diện
Onato - Bảo vệ bộ não toàn diệnOnato - Bảo vệ bộ não toàn diện
Onato - Bảo vệ bộ não toàn diện
 
TIPOS O NIVELES DE PREVENCION
TIPOS O NIVELES DE PREVENCIONTIPOS O NIVELES DE PREVENCION
TIPOS O NIVELES DE PREVENCION
 
Lebanese Animation - CSM presentation
Lebanese Animation - CSM presentationLebanese Animation - CSM presentation
Lebanese Animation - CSM presentation
 
Day2 mkt rev
Day2 mkt revDay2 mkt rev
Day2 mkt rev
 
TIPOS O NIVELES DE PREVENCION
TIPOS O NIVELES DE PREVENCIONTIPOS O NIVELES DE PREVENCION
TIPOS O NIVELES DE PREVENCION
 
Mi power point flores
Mi power point floresMi power point flores
Mi power point flores
 
TIPOS O NIVELES DE PREVENCION
TIPOS O NIVELES DE PREVENCIONTIPOS O NIVELES DE PREVENCION
TIPOS O NIVELES DE PREVENCION
 
English: That mysterious language - Go! idiomas
English: That mysterious language - Go! idiomasEnglish: That mysterious language - Go! idiomas
English: That mysterious language - Go! idiomas
 
อุปกรณ์เครือข่ายคอมพิวเตอร์
อุปกรณ์เครือข่ายคอมพิวเตอร์อุปกรณ์เครือข่ายคอมพิวเตอร์
อุปกรณ์เครือข่ายคอมพิวเตอร์
 
Lean startup - Agicalkväll #58
Lean startup -  Agicalkväll #58Lean startup -  Agicalkväll #58
Lean startup - Agicalkväll #58
 
P pt ionic formulae
P pt ionic formulaeP pt ionic formulae
P pt ionic formulae
 
Rol del Docente
Rol del DocenteRol del Docente
Rol del Docente
 
Creating Story
Creating StoryCreating Story
Creating Story
 

Similar to Exploiting Wikipedia for Information Retrieval Tasks

Knowledge Extraction for the Web of Things (KE4WoT) Challenge: Co-located wit...
Knowledge Extraction for the Web of Things (KE4WoT) Challenge: Co-located wit...Knowledge Extraction for the Web of Things (KE4WoT) Challenge: Co-located wit...
Knowledge Extraction for the Web of Things (KE4WoT) Challenge: Co-located wit...Amélie Gyrard
 
Presentation1.pptn
Presentation1.pptnPresentation1.pptn
Presentation1.pptnMariamHmoud
 
WikiLoop: Big tech's Open Knowledge contributions
WikiLoop: Big tech's Open Knowledge contributionsWikiLoop: Big tech's Open Knowledge contributions
WikiLoop: Big tech's Open Knowledge contributionsAll Things Open
 
Building Real Time, Open-Source Tools for Wikipedia
Building Real Time, Open-Source Tools for WikipediaBuilding Real Time, Open-Source Tools for Wikipedia
Building Real Time, Open-Source Tools for WikipediaFITC
 
Semantic Wiki: Social Semantic Web in Use
Semantic Wiki: Social Semantic Web in UseSemantic Wiki: Social Semantic Web in Use
Semantic Wiki: Social Semantic Web in UseJesse Wang
 
Defining iot.schema.org: Using Knowledge Extraction from Existing IoT-based ...
Defining iot.schema.org: Using Knowledge Extraction from  Existing IoT-based ...Defining iot.schema.org: Using Knowledge Extraction from  Existing IoT-based ...
Defining iot.schema.org: Using Knowledge Extraction from Existing IoT-based ...Amélie Gyrard
 
SIOC Tactics: Gaining Acceptance for a Semantic Web Vocabulary on the Social Web
SIOC Tactics: Gaining Acceptance for a Semantic Web Vocabulary on the Social WebSIOC Tactics: Gaining Acceptance for a Semantic Web Vocabulary on the Social Web
SIOC Tactics: Gaining Acceptance for a Semantic Web Vocabulary on the Social WebJohn Breslin
 
The sum of all human knowledge in the age of machines: A new research agenda ...
The sum of all human knowledge in the age of machines: A new research agenda ...The sum of all human knowledge in the age of machines: A new research agenda ...
The sum of all human knowledge in the age of machines: A new research agenda ...Dario Taraborelli
 
Reflections On Personal Experiences In Using Wikis
Reflections On Personal Experiences In Using WikisReflections On Personal Experiences In Using Wikis
Reflections On Personal Experiences In Using Wikislisbk
 
Role of Wikipedia in Modern Education
Role of Wikipedia in Modern EducationRole of Wikipedia in Modern Education
Role of Wikipedia in Modern EducationIRJET Journal
 
60 website evaluation and testing with wcag 2
60 website evaluation and testing with wcag 260 website evaluation and testing with wcag 2
60 website evaluation and testing with wcag 2AEGIS-ACCESSIBLE Projects
 
Lecture 5: Mining, Analysis and Visualisation
Lecture 5: Mining, Analysis and VisualisationLecture 5: Mining, Analysis and Visualisation
Lecture 5: Mining, Analysis and VisualisationMarieke van Erp
 
Advanced Wikipedia Editing Workshop
Advanced Wikipedia Editing WorkshopAdvanced Wikipedia Editing Workshop
Advanced Wikipedia Editing Workshopdorohoward
 
Open ILRI
Open ILRIOpen ILRI
Open ILRIILRI
 
Drupal Camp Asheville 2014: Assessing the 2014 National Climate Assessment We...
Drupal Camp Asheville 2014: Assessing the 2014 National Climate Assessment We...Drupal Camp Asheville 2014: Assessing the 2014 National Climate Assessment We...
Drupal Camp Asheville 2014: Assessing the 2014 National Climate Assessment We...April Sides
 
Estermann Wikidata and Heritage Data 20170914
Estermann Wikidata and Heritage Data 20170914Estermann Wikidata and Heritage Data 20170914
Estermann Wikidata and Heritage Data 20170914Beat Estermann
 
What Does DITA Have To Do With Wiki
What Does DITA Have To Do With WikiWhat Does DITA Have To Do With Wiki
What Does DITA Have To Do With WikiAnne Gentle
 

Similar to Exploiting Wikipedia for Information Retrieval Tasks (20)

Knowledge Extraction for the Web of Things (KE4WoT) Challenge: Co-located wit...
Knowledge Extraction for the Web of Things (KE4WoT) Challenge: Co-located wit...Knowledge Extraction for the Web of Things (KE4WoT) Challenge: Co-located wit...
Knowledge Extraction for the Web of Things (KE4WoT) Challenge: Co-located wit...
 
Presentation1.pptn
Presentation1.pptnPresentation1.pptn
Presentation1.pptn
 
WikiLoop: Big tech's Open Knowledge contributions
WikiLoop: Big tech's Open Knowledge contributionsWikiLoop: Big tech's Open Knowledge contributions
WikiLoop: Big tech's Open Knowledge contributions
 
Building Real Time, Open-Source Tools for Wikipedia
Building Real Time, Open-Source Tools for WikipediaBuilding Real Time, Open-Source Tools for Wikipedia
Building Real Time, Open-Source Tools for Wikipedia
 
Semantic Wiki: Social Semantic Web in Use
Semantic Wiki: Social Semantic Web in UseSemantic Wiki: Social Semantic Web in Use
Semantic Wiki: Social Semantic Web in Use
 
Defining iot.schema.org: Using Knowledge Extraction from Existing IoT-based ...
Defining iot.schema.org: Using Knowledge Extraction from  Existing IoT-based ...Defining iot.schema.org: Using Knowledge Extraction from  Existing IoT-based ...
Defining iot.schema.org: Using Knowledge Extraction from Existing IoT-based ...
 
SIOC Tactics: Gaining Acceptance for a Semantic Web Vocabulary on the Social Web
SIOC Tactics: Gaining Acceptance for a Semantic Web Vocabulary on the Social WebSIOC Tactics: Gaining Acceptance for a Semantic Web Vocabulary on the Social Web
SIOC Tactics: Gaining Acceptance for a Semantic Web Vocabulary on the Social Web
 
Intranet 2.0: Using Wikis
Intranet 2.0: Using WikisIntranet 2.0: Using Wikis
Intranet 2.0: Using Wikis
 
The sum of all human knowledge in the age of machines: A new research agenda ...
The sum of all human knowledge in the age of machines: A new research agenda ...The sum of all human knowledge in the age of machines: A new research agenda ...
The sum of all human knowledge in the age of machines: A new research agenda ...
 
Reflections On Personal Experiences In Using Wikis
Reflections On Personal Experiences In Using WikisReflections On Personal Experiences In Using Wikis
Reflections On Personal Experiences In Using Wikis
 
Role of Wikipedia in Modern Education
Role of Wikipedia in Modern EducationRole of Wikipedia in Modern Education
Role of Wikipedia in Modern Education
 
60 website evaluation and testing with wcag 2
60 website evaluation and testing with wcag 260 website evaluation and testing with wcag 2
60 website evaluation and testing with wcag 2
 
Lecture 5: Mining, Analysis and Visualisation
Lecture 5: Mining, Analysis and VisualisationLecture 5: Mining, Analysis and Visualisation
Lecture 5: Mining, Analysis and Visualisation
 
Advanced Wikipedia Editing Workshop
Advanced Wikipedia Editing WorkshopAdvanced Wikipedia Editing Workshop
Advanced Wikipedia Editing Workshop
 
Open ILRI
Open ILRIOpen ILRI
Open ILRI
 
Drupal Camp Asheville 2014: Assessing the 2014 National Climate Assessment We...
Drupal Camp Asheville 2014: Assessing the 2014 National Climate Assessment We...Drupal Camp Asheville 2014: Assessing the 2014 National Climate Assessment We...
Drupal Camp Asheville 2014: Assessing the 2014 National Climate Assessment We...
 
Social Media Dataset
Social Media DatasetSocial Media Dataset
Social Media Dataset
 
Estermann Wikidata and Heritage Data 20170914
Estermann Wikidata and Heritage Data 20170914Estermann Wikidata and Heritage Data 20170914
Estermann Wikidata and Heritage Data 20170914
 
Lecture4 Social Web
Lecture4 Social Web Lecture4 Social Web
Lecture4 Social Web
 
What Does DITA Have To Do With Wiki
What Does DITA Have To Do With WikiWhat Does DITA Have To Do With Wiki
What Does DITA Have To Do With Wiki
 

Recently uploaded

Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...jana861314
 
Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTSérgio Sacani
 
A relative description on Sonoporation.pdf
A relative description on Sonoporation.pdfA relative description on Sonoporation.pdf
A relative description on Sonoporation.pdfnehabiju2046
 
Luciferase in rDNA technology (biotechnology).pptx
Luciferase in rDNA technology (biotechnology).pptxLuciferase in rDNA technology (biotechnology).pptx
Luciferase in rDNA technology (biotechnology).pptxAleenaTreesaSaji
 
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bNightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bSérgio Sacani
 
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxSOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxkessiyaTpeter
 
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.aasikanpl
 
Boyles law module in the grade 10 science
Boyles law module in the grade 10 scienceBoyles law module in the grade 10 science
Boyles law module in the grade 10 sciencefloriejanemacaya1
 
VIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PVIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PPRINCE C P
 
Artificial Intelligence In Microbiology by Dr. Prince C P
Artificial Intelligence In Microbiology by Dr. Prince C PArtificial Intelligence In Microbiology by Dr. Prince C P
Artificial Intelligence In Microbiology by Dr. Prince C PPRINCE C P
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Sérgio Sacani
 
Scheme-of-Work-Science-Stage-4 cambridge science.docx
Scheme-of-Work-Science-Stage-4 cambridge science.docxScheme-of-Work-Science-Stage-4 cambridge science.docx
Scheme-of-Work-Science-Stage-4 cambridge science.docxyaramohamed343013
 
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
Grafana in space: Monitoring Japan's SLIM moon lander in real time
Grafana in space: Monitoring Japan's SLIM moon lander  in real timeGrafana in space: Monitoring Japan's SLIM moon lander  in real time
Grafana in space: Monitoring Japan's SLIM moon lander in real timeSatoshi NAKAHIRA
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsSérgio Sacani
 
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...Sérgio Sacani
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...RohitNehra6
 
Work, Energy and Power for class 10 ICSE Physics
Work, Energy and Power for class 10 ICSE PhysicsWork, Energy and Power for class 10 ICSE Physics
Work, Energy and Power for class 10 ICSE Physicsvishikhakeshava1
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Lokesh Kothari
 

Recently uploaded (20)

Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
 
Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOST
 
A relative description on Sonoporation.pdf
A relative description on Sonoporation.pdfA relative description on Sonoporation.pdf
A relative description on Sonoporation.pdf
 
Luciferase in rDNA technology (biotechnology).pptx
Luciferase in rDNA technology (biotechnology).pptxLuciferase in rDNA technology (biotechnology).pptx
Luciferase in rDNA technology (biotechnology).pptx
 
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bNightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
 
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxSOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
 
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
 
Boyles law module in the grade 10 science
Boyles law module in the grade 10 scienceBoyles law module in the grade 10 science
Boyles law module in the grade 10 science
 
VIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PVIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C P
 
9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service
9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service
9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service
 
Artificial Intelligence In Microbiology by Dr. Prince C P
Artificial Intelligence In Microbiology by Dr. Prince C PArtificial Intelligence In Microbiology by Dr. Prince C P
Artificial Intelligence In Microbiology by Dr. Prince C P
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
 
Scheme-of-Work-Science-Stage-4 cambridge science.docx
Scheme-of-Work-Science-Stage-4 cambridge science.docxScheme-of-Work-Science-Stage-4 cambridge science.docx
Scheme-of-Work-Science-Stage-4 cambridge science.docx
 
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
 
Grafana in space: Monitoring Japan's SLIM moon lander in real time
Grafana in space: Monitoring Japan's SLIM moon lander  in real timeGrafana in space: Monitoring Japan's SLIM moon lander  in real time
Grafana in space: Monitoring Japan's SLIM moon lander in real time
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
 
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...
 
Work, Energy and Power for class 10 ICSE Physics
Work, Energy and Power for class 10 ICSE PhysicsWork, Energy and Power for class 10 ICSE Physics
Work, Energy and Power for class 10 ICSE Physics
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
 

Exploiting Wikipedia for Information Retrieval Tasks

  • 1. Exploiting Wikipedia for Information Retrieval Tasks SIGIR Tutorial , August 2015 Department of Information Systems Engineering
  • 2. Who We Are Department of Information Systems Engineering Ben-Gurion University of the Negev SIGIR 2015, Tutorial
  • 3. Sessions Agenda • Introduction to Wikipedia • Session 1 – Search related tasks • Session 2 - Sentiment Analysis • Session 3 - Recommender Systems • Summary SIGIR 2015, Tutorial
  • 5. Wikimedia • “The Wikimedia Foundation, Inc. is a nonprofit charitable organization dedicated to encouraging the growth, development and distribution of free, multilingual, educational content, and to providing the full content of these wiki-based projects to the public free of charge” • https://wikimediafoundation.org/wiki/Home SIGIR 2015, Tutorial
  • 6. What is Wikipedia? • Wikipedia is an encyclopedia • no original research • neutral point of view • statements must be verifiable • must reference reliable published sources • Wikipedia relies on crowd-sourcing • anyone can edit • Wikipedia is big printwikipedia .com SIGIR 2015, Tutorial
  • 7. Structured and Unstructured Entity Pages Categories Links Disambiguation pages Redirecting pages Navigation Boxes Infoboxes Discussion pages User Pages Page Views SIGIR 2015, Tutorial
  • 8. Articles Number & Growth Rate • ~4.9 million articles on English Wikipedia • Since 2006 around 30,000 new articles per month • 287 languages SIGIR 2015, Tutorial
  • 9. Quality of Wikipedia Article 1. Is the length and structure of article an indication of the importance of this subject? 2. Click on edit history activity: when was the last edit? 3. Talk page checked for debates: what are the issues in this article? 4. Check the references [or lack of] – Are all claims referenced (especially if controversial )? – What is the quality of the sources? – How relevant are the sources? https://upload.wikimedia.org/wikipedia/commons/9/96/Evaluating_Wikipedia_brochure_%28Wiki_Education_Foundation%29.pdf SIGIR 2015, Tutorial
  • 11. Internet Encyclopaedias Go Head to Head. In Nature 1 38(15), [GILES, J. 2005] SIGIR 2015, Tutorial
  • 12. Links Categories Amount of text Images Various languages SIGIR 2015, Tutorial Page Views Wisdom of the Crowds: Decentralized Knowledge Construction in Wikipedia ]Arazy et al., 2007 ]
  • 13. Superior Information Source Size & Scope, Britannica has 40000 Timely & Updated Wisdom of the crowd SIGIR 2015, Tutorial
  • 14. Lens for the Real World Wikipedia: Representative of the real world and of people understanding Ideas! Thoughts Perceptions SIGIR 2015, Tutorial
  • 15. Unique Visitors and Page Views • http://reportcard.wmflabs.org/ 430 Millions in May 2015 Mobile users are not included! SIGIR 2015, Tutorial
  • 16. Editing Wikipedia [Hill BM, Shaw A ,2013] The Wikipedia Gender Gap Revisited: Characterizing Survey Response Bias with Propensity Score Estimation. PLoS ONE 8(6): e65782. SIGIR 2015, Tutorial
  • 17. Literature Review of Scholarly Articles • http://wikilit.referata.com/wiki/Main_Page SIGIR 2015, Tutorial
  • 18. Research Involving Wikipedia Researching the Wikipedia Using the Wikipedia content as knowledge resource SIGIR 2015, Tutorial
  • 19. Systems that Use Information from Wikipedia The task – Goal of the system •Query operations •Recommendation •Sentiment Analysis •Ontology building •….. The challenge for which Wikipedia is a remedy •Sparseness •Ambiguity •Cost of manual labor •Lack of information •Understanding /perception •………. Utilized Information •Concepts/pages, links, categories, redirect pages, views, edits……. Algorithms & techniques •How items are matched with Wikipedia pages •How data is extracted •How Wikipedia data is utilized •How the similarity between concepts is computed •……….. SIGIR 2015, Tutorial
  • 20. IR & Wikipedia • Wikipedia as a collection is: – enormous – Timely – Reflects crowd wisdom • connections between entities in Wikipedia represent the way a large number of people view them • Computers cannot understand “concepts” cannot relate things like humans – Accessible (free!) – Accurate – Coverage • Weaknesses – Incomplete – Imbalanced – No complete citations SIGIR 2015, Tutorial
  • 21. IR & Wikipedia • Wikipedia is used for – enhancement the performance of IR systems (mainly relevance) • Challenge – • Distillation of knowledge from such a large amount of un/semi structured information is an extremely difficult task • The contents of today’s Web are perfectly suitable for human consumption, but remain hardly accessible to machines. SIGIR 2015, Tutorial
  • 22. How to Use Wikipedia for Your Own Research SIGIR 2015, Tutorial
  • 23. Structured Storage Pages Categories Links Pages Paragraphs Redirection Pages Queries Documents Collection (TREC-X) Schema 2 Schema 1 SIGIR 2015, Tutorial
  • 24. Partial Diagram of Wikipedia’s Structured meta-Data [Created by Gilad Katz] SIGIR 2015, Tutorial
  • 25. Wikipedia Download Client Apps: XOWA, WikiTaxi, WikiReader,…. 16 Offline tools for Downloading - 53GB of disk space Page views download - size and structure 50GB per hour EnwikiContentSource Wikipedia Miner (Milne and Witten) [An open source toolkit for mining Wikipedia, 2012] XiaoxiaoLi/getWikipediaMetaData SIGIR 2015, Tutorial
  • 26. Wikipedia Download Google for : Wikipedia dump files download https://dumps.wikimedia.org/enwiki/ Torrent: https://meta.wikimedia.org/wiki/Data_dump_torrents#enwiki Your and other’s code to get plain text! SIGIR 2015, Tutorial
  • 27. DBPedia • ~4.9 billion articles, 4.2 billion of which are classified into consistent ontology • Persons, places, organizations, diseases • An effort to transform the knowledge of Wikipedia into "tabular-like" format • Sophisticated database-query language - Open Data - Linked Open Data UGC CGC SIGIR 2015, Tutorial
  • 28. Search Related Tasks Session 1 SIGIR 2015, Tutorial
  • 29. Crawl Control Search Engine Basic Structure Crawlers Ranking Indexer Page Repository Query Engine Collection Analysis Queries Results Indexes SIGIR 2015, Tutorial
  • 30. Query Operations Agenda • Query Expansion • Cross Language Information Retrieval • Entity Search • Query Performance Prediction SIGIR 2015, Tutorial
  • 32. How to Describe the Unknown? Meno: “And how will you enquire, Socrates, into that which you do not know? What will you put forth as the subject of enquiry? And if you find what you want, how will you ever know that this is the thing which you did not know? ” Written 380 B.C.E By Plato SIGIR 2015, Tutorial
  • 33. Automatic QE - A process where the user’s original query is augmented by new features with similar meaning. 1.What you know and what you wish to know 2. Initial vague query and concrete topics and terminology How to Describe the Unknown? The average length of an initial query at prominent search engines is 2.4 in 2001, 2.08 in 2009, 3.1 nowdays (and growing…….) [Amanda Spink, Dietmar Wolfram, Major B. J. Jansen, Tefko Saracevic (2001)]. "Searching the web: The public and their queries”. [Taghavi, Wills, Patel (2011)] An analysis of web proxy logs with query distribution pattern approach for search engines SIGIR 2015, Tutorial
  • 34. Wikipedia Based Query Expansion • Wikipedia is: • Rich, Highly Interconnected, Domain independent “The fourth generation iPad (originally marketed as iPad with Retina display, retrospectively marketed as the iPad 4) is a tablet computer produced and marketed by Apple Inc.” Wikipedia SIGIR 2015, Tutorial
  • 35. Knowledge based search engine powered by Wikipedia. [Milne, Witten and Nichols](2007). Thesaurus Based QE Initial query Wikipedia based thesaurus augmented by new features with similar meaning SIGIR 2015, Tutorial
  • 36. Semantic Relatedness is quantified. Consider Jackson Relatedness: Co-occurrence statistics of terms and of links (ESA as an alternative) Synonyms: Redirect Pages No NER process needed! Relevant to particular document collection Manual definition vs. automatic generation KORU [Milne, Witten and Nichols](2007). SIGIR 2015, Tutorial
  • 37. Query Suggestion query = obama white house Task: Given a query, produce a ranked lists of concepts from Wikipedia which are mentioned or meant in the query Quiz Time! SIGIR 2015, Tutorial
  • 38. Learning Semantic Query Suggestions. query = obama white house Barack Obama White House Correct concepts Use of some ranking (e.g. language modeling) approach to score the concepts (articles) in Wikipedia, where each n-gram is considered as a query in its turn 1. Candidates Generation [Meij, E., M. Bron, L. Hollink, B. Huurnink and M. De Rijke] (2009) SIGIR 2015, Tutorial
  • 39. • Supervised machine learning approach • Input: a set of labeled examples (query2concept mappings) • Types of features • N-gram • Concept • N-gram – concept combination • Current search history 2. Candidates Selection # of concepts linking to 2 c # of concepts linked from c # of accosted categories # or redirect pages to c Importance of c in query (TF-IDF Q in c) SIGIR 2015, Tutorial
  • 40. Cross Language Information Retrieval (CLIR) SIGIR 2015, Tutorial
  • 42. CLIR Task Query in a source language Query in a target language Collection translation is not scalable! The solution: SIGIR 2015, Tutorial
  • 43. WikiTranslate: Query Translation for Cross-lingual Information Retrieval using only Wikipedia [D. Nguyen, A.Overwijk, C.Hauff, R.B. Trieschnigg, D. Hiemstra, F.M.G. de Jong] 2008 • Stage 1: Mapping the query to Wikipedia concepts • Full query search (“obama white house”) • Partial query search (N-Grams) • Sole N-Grams in links of the top retrieved documents (“house”) • Whole query with weighted n-grams (“obama white house”) • Possible to combine search in different fields within a document: • (title: obama) (content: white house) – to avoid missing terms Generating Translation Candidates SIGIR 2015, Tutorial
  • 44. Stage 2: Creating the final expanded query terms weighting according to Stage 1 and Wikipedia pages analysis: • Concept translation from cross lingual links • Try synonyms and spelling variants • Translated term weighting : Concepts obtained from the whole query search are more important Creating the Final Translated Query Obama White House Weiße Haus ^ 1.0 Barack Obama ^ 0.5 SIGIR 2015, Tutorial
  • 46. Entity Search Retrieving a set of entities in response to user’s query query = “United States presidents” SIGIR 2015, Tutorial
  • 47. Passages Entities Entity Ranking: Graph centrality measures: Degree. Combined with inverse entity frequency Ranking very many typed entities on Wikipedia [Hugo Zaragoza, Henning Rode, Peter Mika, Jordi Atserias, Massimiliano Ciaramita, and Giuseppe Attardi] 2007 SIGIR 2015, Tutorial
  • 48. • Utilizing Wikipedia Category Structure • Query-Categories (QC) • Entity-Categories (EC) • Distance(QC,EC)= • if QC and EC have a common category then Distance = 0 • else: distance = minimal path length Category Based Semantic Distance Between Entities A ranking framework for entity oriented search using Markov random fields. [Raviv, H., D. Carmel and O. Kurland ](2012). PUC SIGIR 2015, Tutorial
  • 49. Category Based Distance Example Query categories: novels, books If entity category is novels, then distance = zero If entity category is books by Paul Auster, then distance = 2 SIGIR 2015, Tutorial
  • 50. The INNEX Ranking Task • The goal of the track is to evaluate how well systems can rank entities in response to a query • Entities are assumed to correspond to Wikipedia entries Used category, association, link structure Took place in 2007, 2008, 2009 Overview of the INEX 2007 Entity Ranking Track [Arjen P. de Vries, Anne-Marie Vercoustre, James A. Thom, Nick Craswell, Mounia Lalmas.] SIGIR 2015, Tutorial
  • 51. Query Performance Prediction “The prophecy was given to the infants and the fools” SIGIR 2015, Tutorial
  • 52. Which Query is Better? SIGIR 2015, Tutorial
  • 53. Benefits of Estimating the Query Difficulty Reproduced from a tutorial: David Carmel and Oren Kurland. Query performance prediction for IR. SIGIR 2012 Feedback to users User can rephrase a difficult query Feedback to the search engines Alternative retrieval strategy Feedback to system administrator Missing content For IR applications Federated search over different datasets SIGIR 2015, Tutorial
  • 54. Query Performance Prediction Query1 = “obama white house” Query2 = “weather in Israel” Prediction Mechanism Prediction value Q1 > Q2 ?QueryLength(Q1)=3 Query.split().length() SIGIR 2015, Tutorial
  • 55. Retrieval Effectiveness Predictors’ values TREC Collections (topics, relevance judgments) Regression Framework MAP SIGIR 2015, Tutorial
  • 56. Query Performance Prediction in Ad-Hoc Retrieval Estimating query effectiveness when relevance judgments are not available SIGIR 2015, Tutorial
  • 57. Query Performance Prediction • Pre-Retrieval Prediction • Operate prior to retrieval time • Analyze the query and the corpus • Computationally efficient • Post-Retrieval Prediction • Analyze the result list of the • documents most highly ranked • Superior performance SIGIR 2015, Tutorial
  • 58. Regardless of corpus, but with external knowledge Absolute Query Difficulty 1. Corpus independent! 2. Information induced from Wikipedia 3. Advantage for non cooperative search where corpus-based information is not available Wikipedia-based query performance prediction. [Gilad Katz, Anna Shtock, Oren Kurland, Bracha Shapira, and Lior Rokach] 2014. New Prediction Type? SIGIR 2015, Tutorial
  • 59. Wikipedia Properties Title Content Links Categories Wikipedia-based query performance prediction. [Gilad Katz, Anna Shtock, Oren Kurland, Bracha Shapira, and Lior Rokach] 2014. SIGIR 2015, Tutorial
  • 60. Notational Conventions A page p is associated with a set of terms if its title contains at least one of the terms in the set. (soft mapping) q= “Barack Obama White House” Associated page Maximal exact match length (MEML) = 2 Set of pages for which MEML holds MMEML SIGIR 2015, Tutorial
  • 61. Titles (measuring queries) Size of subquery which has an exact match 1. Maximum 2. Average Quiz Time! q= “Barack Obama White House” Maximum – 2 Average – 2 SIGIR 2015, Tutorial
  • 62. Titles & Content(measuring pages) Number of pages associated with a subquery (fixed length = 1,2,3) The average length (# terms) of the pages that are exact match or maximal exact match SumAverage Standard Deviation “scope” of q in Wikipedia SIGIR 2015, Tutorial
  • 63. Examples Query Max subquery match #pages containing one query term in the title #pages containing two query terms in the title Horse Leg 1 7885 3 Poliomyelitis and Post-Polio 2 6605 8 Hubble Telescope Achievements 2 460 4 Endangered Species (Mammals) 1 3481 96 Most Dangerous Vehicles 1 3978 6 African Civilian Deaths 2 23381 14 New Hydroelectric Projects 1 93858 33 Implant Dentistry 1 268 0 Rap and Crime 2 3826 1 Radio Waves and Brain Cancer 2 15460 33 SIGIR 2015, Tutorial
  • 64. Links and Categories Overall # of links that contain at least one of query’s terms in their anchor text # of categories that appear in at least one of the pages associated with a subquery # of in/outgoing links for Wikipedia page in MMEML Average Standard Deviation SIGIR 2015, Tutorial
  • 65. Links for Query Coherency Prediction # of links from pages associated with a single-term subquery that point to pages associated with another subquery q= “Barack Obama White House” MaximumAverage Standard Deviation SIGIR 2015, Tutorial
  • 66. Examples Average # pages Containing at least one query term in link Standard deviation of # of categories that contain one query term in the title Link based Coherency (average) Horse Leg 1821.5 34.7211 1731.5 Poliomyelitis and Post-Polio 302 11.6409 1120.333 Hubble Telescope Achievements 152.3333 11.19255 55.66667 Endangered Species (Mammals) 5753.333 64.60477 686 Most Dangerous Vehicles 376.6667 13.4316 396.3333 African Civilian Deaths 1673.333 18.53009 4172.333 New Hydroelectric Projects 1046 50.96217 121477.3 Implant Dentistry 209.5 13.45904 10.5 Rap and Crime 1946 35.30528 661.5 Radio Waves and Brain Cancer 1395.5 24.81557 2202.5 SIGIR 2015, Tutorial
  • 67. Query Performance Prediction - Summary • Absolute query difficulty, corpus independent • # of pages containing a subquery in the title – the most effective predictor • Coherency is the 2nd • Integration with state-of-the-art predictors leads to superior performance (Wikipedia-based-clarity-score) SIGIR 2015, Tutorial
  • 68. “The prophecy was given to the infants and the fools” Query-Performance Prediction: Setting the Expectations Straight [Fiana Raiber and Oren Kurland] (2014) Quiz Time! Storming of the Bastille! SIGIR 2015, Tutorial
  • 69. Exploiting Wikipedia for Sentiment Analysis Session 2 SIGIR 2015, Tutorial
  • 70. Introduction • Sentiment analysis or opinion mining The location is indeed perfect, across the road from Hyde Park and Kensington Palace. The building itself is charming, …The room, which was an upgrade, was ridiculously small. There was a persistent unpleasant smell. The bathroom sink didn't drain properly. The TV didn't work. The location is indeed perfect SIGIR 2015, Tutorial
  • 71. Leveraging with Language Understanding • Since they leave your door wide open when they come clean your room , the mosquitoes get in your room and they buzz around at night • There were a few mosquitoes , but nothing that a little bug repellant could not solve • It seems like people on the first floor had more issues with mosquitoes • Also there was absolutely no issue with mosquitoes on the upper floors SIGIR 2015, Tutorial
  • 72. Challenge – Common Sense SIGIR 2015, Tutorial
  • 73. Wikipedia as a Knowledgebase SIGIR 2015, Tutorial
  • 75. Concepts – We will go shopping next week • Relying on ontologies or semantic networks • Using concepts steps away from blindly using keywords and words occurrences [Cambria 2013: An introduction to concept level sentiment analysis] [Cambria et al., 2014] SIGIR 2015, Tutorial
  • 76. Semantic Relatedness - ESA • Concepts representation (training) Computing Semantic Relatedness Using Wikipedia-based Explicit Semantic Analysis [Gabrilovich, & Markovitch, 2007] SIGIR 2015, Tutorial
  • 77. W1 W2 W3 W3 Weighted inverted index W1 C1 C2 C3 C1 C2 C3 C4 Weight(Cj) Text fragment Word vector Concept vector Document Representation SIGIR 2015, Tutorial
  • 80. Wikipedia as a Sense Tagged Corpus • Use hyperlinked concepts as sense annotations • Extract senses and paragraphs of mentions of the ambiguous word in Wikipedia articles • Learn a classification model Using Wikipedia for Automatic Word Sense Disambiguation [Mihalcea, 2007] SIGIR 2015, Tutorial
  • 81. • In 1834, Sumner was admitted to the [[bar(law)|bar]] at the age of twenty-three, and entered private practice in Boston. • It is danced in 3/4 time (like most waltzes), with the couple turning approx. 180 degrees every [[bar(music)|bar]]. • Vehicles of this type may contain expensive audio players, televisions, video players, and [[bar (counter)|bar]]s, often with refrigerators. • Jenga is a popular beer in the [[bar(establishment)|bar]]s of Thailand. • This is a disturbance on the water surface of a river or estuary, often cause by the presence of a [[bar (landform)|bar]] or dune on the riverbed. Annotated Text SIGIR 2015, Tutorial
  • 82. • In 1834, Sumner was admitted to the [[bar(law)|bar]] at the age of twenty-three, and entered private practice in Boston. • Features examples – Current word and its POS – Surrounding words and their POS – The verb and noun in the vicinity Example of Annotated Text SIGIR 2015, Tutorial
  • 84. Challenge – Granularity Level One can look at this review from – Document level, i.e., is this review + or -? – Sentence level, i.e., is each sentence + or -? – Entity and feature/aspect level SIGIR 2015, Tutorial
  • 85. Sentiment Lexicon • Sentiment words are often the dominating factor for sentiment analysis (Liu, 2012) – good, excellent, bad, terrible • Sentiment lexicon holds a score for each word representing the degree of its sentiment good excellent friendly beautiful + bad terrible ugly difficult - SIGIR 2015, Tutorial
  • 86. Sentiment Lexicon • General lexicon – There is no general sentiment lexicon that is optimal in any domain – Sentiment of some terms is dependent on the context – Poor coverage “The device was small and handy." "The waiter brought the food on time, but the portion was very small," SIGIR 2015, Tutorial
  • 87. Yet Another Challenge: Polarity Ambiguity SIGIR 2015, Tutorial
  • 88. Remedy- Identify Opinion Targets Old wine or warm beer: Target-specific sentiment analysis of adjectives [Fahrni & Klenner, 2008] SIGIR 2015, Tutorial
  • 89. • If ’cold coca cola’ is positive then ’cold coca cola cherry’ is positive as well cold cool warm old expensive Coca Cola Sushi Pizza cold cool warm old expensive Coca Cola cherry SIGIR 2015, Tutorial
  • 90. Example (2): Product Attributes Detection • Discover what are the attributes that people express opinions about • Identifying all words that are included in Wikipedia titles Domain Independent Model for Product Attribute Extraction from User Reviews using Wikipedia. [Kovelamudi, Ramalingam, Sood & Varma, 2011] Excellent picture quality.. videoz are in HD.. no complaintz from me. Never had any trouble with gamez.. Paid WAAAAY to much for it at the time th0.. it sellz SIGIR 2015, Tutorial
  • 91. Excellent picture quality.. videoz are in HD.. no complaintz from me. Never had any trouble with gamez.. Paid WAAAAY to much for it at the time th0.. it sellz now fer like a third the price I paid.. heheh.. oh well....the fact that I didn’t wait a year er so to buy a bigger model for half the price.. most likely from a different store.. ..not namin any namez th0.. WSD SIGIR 2015, Tutorial
  • 93. Lexicon Construction (unsupervised) • Label propagation • Identify pairs of adjectives based on a set of constrains • Infer from known adjectives “the room was good and wide” good bad wide 29 Polarity=0.97 Unsupervised Common-Sense Knowledge Enrichment for Domain-Specific Sentiment Analysis [Ofek, Rokach, Poria, Cambria, Hussein, Shabtai, 2015] SIGIR 2015, Tutorial
  • 95. WikiSent: Sentiment Analysis of Movie Reviews Wikisent: Weakly supervised sentiment analysis through extractive summarization with Wikipedia [Mukherjee & Bhattacharyya, 2012] SIGIR 2015, Tutorial
  • 96. Feature Types • Crew • Cast • Specific domain nouns from text content and Plot – wizard, witch, magic, wand, death-eater, power • Movie domain specific terms – Movie, Staffing, Casting, Writing, Theory, Rewriting, Screenplay, Format, Treatments, Scriptments, Synopsis SIGIR 2015, Tutorial
  • 97. • Rank sentences according to participating entities 1. Title, crew 2. Movie domain specific terms 3. Plot Sentiment classification – Weakly supervised – identify words’ polarity from general lexicons Retrieve Relevant Opinionated Text subjectivity +αXΣ +βXΣ -γXΣ SIGIR 2015, Tutorial
  • 98. Classify Blog Sentiment • Use verb and adjectives categories • Adjectives – Positive – Negative • Verbs – Positive verb classes, positive mental affecting, approving, praising – Negative verb classes, abusing, accusing, arguing, doubting, negative mental affecting Using verbs and adjectives to automatically classify blog sentiment [Chesley, Vincent, Xu, & Srihari, 2006] SIGIR 2015, Tutorial
  • 99. Expanding Adjectives cont. • Query Wiktionary • Assumption: an adjective’s polarity is reflected by the words that define it • Maintain a set of adjectives with know polarity • Count mentions of known adjectives to derive new polarity SIGIR 2015, Tutorial
  • 102. Recommender Systems RS are software agents that elicit the interests and preferences of individual consumers […] and make recommendations accordingly. (Xiao & Benbasat 2007) • Different system designs and algorithms – Based on availability of data – Domain characteristics SIGIR 2015, Tutorial
  • 104. Content-based Recommendation A General Vector-Space Approach Matching Thresholding Relevant content Feedback Profile Learning Threshold Learning threshold SIGIR 2015, Tutorial
  • 105. Challenge- Dull Content • Content does not contain enough information to distinguish items the user likes from items the user does not like – Result: Specifity – more of the same – Vocabulary mismatch, limited aspects in content to distinguish relevant items, synonymy, polysemy – Result: bad recommendations! SIGIR 2015, Tutorial
  • 106. Remedy: Enriching background knowledge Infusion of external sources • Movie recommendations • Three external sources: – Dictionary – (controlled) extracts of lemma descriptions - linguistic – Wikipedia • pages related to movies are transformed into semantic vectors using matrix factorization – User tags – on some sites people can add tags to movies. Knowledge Infusion into Movies Content Based Recommender Systems [Semeraro, Lops and Basile 2009] SIGIR 2015, Tutorial
  • 107. • The data combined from all sources is represented as a graph. • The set of terms that describe an item is extended using a spreading activation model that connects terms based on semantic relations Knowledge Infusion into Movies Content Based Recommender Systems [Semeraro, Lops and Basile 2009] SIGIR 2015, Tutorial
  • 108. axe, murder, paranormal, hotel, winter Process- Example The Shining Keyword search perceptions, killer, psychiatry. spreading activation of external knowledge search “Carrie” “The silence of the lambs” SIGIR 2015, Tutorial
  • 109. Tweets Content-Based Recommendations • GOAL : Re-rank tweet messages based on similarity to user content profile • Method: The user interest profile is represented as two vectors : concepts from Wikipedia, and affinity with other users user’s profile is expended by random walk on Wikipedia concepts graph, utilizing the inter-links between Wikipedia articles. SIGIR 2015, Tutorial [Lu, Lam & Xang, 2012] Twitter User Modeling and Tweets Recommendation based onWikipedia Concept Graph
  • 110. Algorithm steps Map a Twitter message to a set of concepts employing Explicit Semantic Analysis (ESA) Generate a user profile from two vectors, a concept vector representing user interested topics, a vector representing the affinity with other users. In order to get related concepts random walk on the Wikipedia graph is applied User profile and the tweet are represented as a set of weighted Wikipedia concepts. Cosine similarity is applied to compute the score btw. Profile and tweet messages. SIGIR 2015, Tutorial
  • 112.  kNN - Nearest Neighbor  SVD – Matrix Factorization  The method of making automatic predictions (filtering) about the interests of a user by collecting taste information from many users (collaborating). The underlying assumption of CF approach is that those who agreed in the past tend to agree again in the future. Collaborative Filtering1 Description MainApproaches Collaborative Filtering SIGIR 2015, Tutorial
  • 113. Collaborative Filtering Trying to predict the opinion the user will have on the different items and be able to recommend the “best” items to each user based on: the user’s previous likings and the opinions of other like minded users abcd The Idea ? Positive Rating Negative Rating SIGIR 2015, Tutorial
  • 114. The ratings (or events) of users and items are represented in a matrix All CF methods are based on such rating matrix abcd Sample of a matrix The Users in the system abcdUsers The Items in the system abcdItems Collaborative Filtering Rating Matrix I1 I2 I3 I4 I5 I6 …. In U1 r U2 r U3 U4 r U5 …. r Um r Each item may have a rating abcdRatings SIGIR 2015, Tutorial
  • 115. abcd “People who liked this also liked…” Collaborative Filtering Approach 1: Nearest Neighbors Item to Item User to User abcd User-to-User  Recommendations are made by finding users with similar tastes. Jane and Tim both liked Item 2 and disliked Item 3; it seems they might have similar taste, which suggests that in general Jane agrees with Tim. This makes Item 1 a good recommendation for Tim. Item-to-Item  Recommendations are made by finding items that have similar appeal to many users. Tom and Sandra liked both Item 1 and Item 4. That suggests that, in general, people who liked Item 4 will also like item 1, Item 1 will be recommended to Tim. This approach is scalable to millions of users and millions of items. SIGIR 2015, Tutorial
  • 116. Some Math…..               m i uiu m i aia m i uiuaia ua rrrr rrrr w 1 2 ,1 2 , 1 ,, ,          n u ua n u uauiu aia w wrr rp 1 , 1 ,, , * Similarity Prediction SIGIR 2015, Tutorial
  • 117. Computing the Item-to-Item Similarity We must have: For each customer ui the list of products bought by ui For each product pj the list of users that bought that Amazon.com Recommendations Item-to-Item Collaborative Filtering [Greg Linden, Brent Smith, and Jeremy York], 2003, IEEE Internet Computing SIGIR 2015, Tutorial
  • 118. The Main Challenge – Lack of Data • Sparseness • Long Tail – many items in the Long Tail have only few ratings • Cold Start – System cannot draw any inferences for users or items about which it has not yet gathered sufficient data SIGIR 2015, Tutorial
  • 119. External Sources as a Remedy Recommendation Engine User data recommendations Items data External Sources External data : about users, and items From :WWW, social networks, other systems, other user’s devices, and Wikipedia SIGIR 2015, Tutorial
  • 120. Computing Similarities Using Wikipedia for Sparse Matrices • Utilize knowledge from Wikipedia to infer the similarities between items/users • Systems differ in what Wikipedia data is used and how it is used Similar???? SIGIR 2015, Tutorial
  • 121. Example Step 1: Map item to Wikipedia pages – Generate several variations of each item’s (movie) name – Compare the generated names with corresponding page titles in Wikipedia. – Choose the page with the largest number of categories that contain the word "film" (e.g., "Films shot in Vancouver"). • Using this technique, 1512 items out of the 1682 (89.8%) contained in the MovieLens database were matched Using Wikipedia to Boost Collaborative Filtering Techniques , Recsys 2011 [Katz, Ofek, Shapira, Rokach, Shani, 2011], SIGIR 2015, Tutorial
  • 123. Step 2: Use Wikipedia information to compute similarity between items – Text similarity – calculate the Cosine-Similarity between item pages in Wikipedia – Category Similarity – count mutual categories – Link Similarity – count mutual outgoing links SIGIR 2015, Tutorial
  • 124. Use Wikipedia to Generate “Artificial Ratings” Step 3: use the item similarity matrix and each user’s actual ratings to generate additional, “artificial” ratings. • i – the item for which we wish to generate a rating • K – the set of items with actual ratings SIGIR 2015, Tutorial
  • 125. Use Wikipedia to Generate “Artificial Ratings” Step 4: add the artificial ratings to the user-item matrix • Use the artificial ratings only when there’s no real value • This matrix will be used for the collaborative filtering SIGIR 2015, Tutorial
  • 126. Results • The sparser the initial matrix, the greater the improvement SIGIR 2015, Tutorial
  • 127. Results – Collaborative Filtering 0.9 0.95 1 1.05 1.1 1.15 1.2 1.25 0.94 0.95 0.96 0.97 0.98 0.99 1 RMSE % OF DATA SPARSITY PERFORMANCE OF THE VARIOUS METHODS basic item-item categories links text IMDB text SIGIR 2015, Tutorial
  • 128. Comparison – Wikipedia and IMDB 0 100 200 300 400 500 600 700 800 900 1000 Number of movie descriptions Number of words in text Comparison of words per movie Wikipedia IMDB SIGIR 2015, Tutorial
  • 129. Yet another CF algorithm - SVD SIGIR 2015, Tutorial
  • 130. abcd Factorization MF is a Decomposition of a matrix into several component matrices, exposing many of the useful and interesting properties of the original matrix. MF models users and items as vectors of latent features which produce the rating for the user of the item With MF a matrix is factored into a series of linear approximations that expose the underlying structure of the matrix. The goal is to uncover latent features that explain observed ratings abcd 09.08.2015 Collaborative Filtering Approach 2: Matrix factorization SIGIR 2015, Tutorial
  • 131. Users & Ratings Latent Concepts or Factors MF Process abcdSVD MF reveals hidden connections and its strength abcdHidden Concept Latent Factor Models - Example User Rating abcdSVD SIGIR 2015, Tutorial
  • 132. Users & Ratings Latent Concepts or Factors SVD revealed a movie this user might like! abcdRecommendation Latent Factor Models - Example SIGIR 2015, Tutorial
  • 133. 09.08.2015 Latent Factor Models - Concept space SIGIR 2015, Tutorial
  • 134.
  • 135. Integrating Wikipedia to the SVD Process • Learn parameters for the actual and “artificial ratings” SIGIR 2015, Tutorial
  • 136. Using Wikipedia for generating ontology for recommender systems SIGIR 2015, Tutorial
  • 137. Example (1)- • Content recommendations for scholars using ontology to describe scholar’s profile • Challenge: singular reference ontologies lack sufficient ontological concepts and are unable to represent the scholars’ knowledge. A reference ontology for profiling scholar’s background knowledge in recommender systems, 2014 [Bahram, Roliana, Moohd,Nematbakhsh] SIGIR 2015, Tutorial
  • 138. • Building ontology based profiles for modeling background knowledge of scholars • Build ontology by merging a few sources • Wikipedia is used both as a knowledge source and for merging ontologies (verifying the semantic similarity between two candidate terms) SIGIR 2015, Tutorial
  • 141. Example (2) Tag Recommendation • Challenge- “bad” tags (e.g., misspelled, irrelevant) lead to improper relationship among items and ineffective searches for topic information SIGIR 2015, Tutorial
  • 142. Ontology Based Tag Recommendation • Integrate Wikipedia categories with WordNet Synsets to create Ontology for tag recommendation Effective Tag Recommendation System Based on Topic Ontology using Wikipedia and WordNet [Subramaniyaswamy and Chentur 2012] SIGIR 2015, Tutorial
  • 145. We demonstrated how Wikipedia was used for : • Recsys, SA Ontology creation • Recsys, QE, QPP Semantic relatedness • CLIR, QE Synonym detection • Entity search, QPP Relevance assessment • Sa, QE Disambiguation • Recsys, SA Domain specific knowledge acquisition SIGIR 2015, Tutorial
  • 146. What was not Covered • Using behaviors (views, edit) • Examining time effect • …… • More tasks, more methods (QA, advertising…) • Wikipedia weaknesses • We only showed a few examples of the Wikipedia power, and its potential SIGIR 2015, Tutorial
  • 147. Use Wikipedia, it is a Treasure….. SIGIR 2015, Tutorial
  • 148. Thank You Complete references list: http://vitiokm.wix.com/wikitutorial SIGIR 2015, Tutorial