Random Indexing and Quantum Negation for TV-Shows Retrieval and Classification

High Tech Campus, Philips Research
Eindhoven, Netherlands

Random Indexing and Quantum
Negation for TV-Shows
Retrieval and Classiﬁcation
Cataldo Musto, Ph.D. Student
cataldomusto@di.uniba.it - cataldo.musto@philips.com
University of Bari “Aldo Moro” (Italy), SWAP Research Group
Philips Research Center - Eindhoven (Netherlands) - HI&E Group
14.07.11

outline
• part 1: introduction
• information overload, personalization, information ﬁltering, recommender
systems

• part 2: approaches
• vector space model, random indexing, quantum negation

• part 3: scenario
• tv-show recommendation, description of the data, description of the tasks

• part 4: experimental evaluation
• results, discussion, future work

C.Musto: Random Indexing and Quantum Negation for TV shows Retrieval and Classiﬁcation - Philips Research , Eindhoven (The Netherlands) - 14.07.11

part 1: introduction
what are we talking about?


TV

text messages

phone calls

internet navigation

scenario
• Daily interaction with electronic
devices

• eMail, Web navigation, Social
media, instant messaging

• Continuous ﬂow of
information

• in 2007, 500.000 terabyte of
information have been produced
on the Web in one year

• By including also telephone,
radio, TV and so on we reach 18
exabytes of data!


information overload
• Consequences:
cognitive overload

• It is impossible to
effectively deal with
this surplus of
information

• It is difﬁcult to quickly
ﬁnd the information
we really need


information ﬁltering
”
An information ﬁltering system is a
system that removes redundant of
unwanted information from an information
stream using automated methods ”
Wikipedia.


information ﬁltering systems

• How do they work?
• Usually, in three steps
• Training Step
• User Modeling
• Filtering

Step 1:

Training

Step 2:

User Modeling

Step 3:

Filtering

recommender systems
• A speciﬁc type of Information Filtering system
that attempts to recommend
information items (ﬁlms, television, video on
demand, music, books, etc) that are likely to be of
interest to the user

• Everyday we interact with recommender
systems, even if we do not know it!


Amazon

YouTube

recommendation approaches
• Content-based filtering
• No interactions between users. Each user is an atomic entity
• Prerequisite: each item to be recommended has to be described through a set of
textual features
• We store in a user profile the features that often
occur in the items she like
• Assumption: if a user usually likes items in whose description often occurs a specific feature we
can assume that he will like that items also in the future

• e.g.
• If User_A likes a news with the features “Football” and “Internazionale FC” inside
• We can recommend her other news about both Football or Internazionale
FC


part 2: approaches
vector space model, random indexing,quantum negation


vector space model
• Introduced by Salton in
1975

• Given a set of M documents
(items) d = (d1.....dM)

• Given N features describing
the documents

• Each document (item) is
represented in a an N-
dimensional vector space

• The whole corpus is
represented in a N*M matrix
called term/document
matrix


vector space model
• VSM in a recommendation scenario
• Document: point in the vector space
• User profile: point in the vector space
• e.g. built as the sum of the vector space representation of the documents
liked in the past by the user
• Goal: to find the documents that are the most relevant ones for that user profile
• Assumption

• the most similar documents in the vector space are the most
relevant ones

• Cosine Similarity to compute the similarity between query and
documents


vsm analysis (2)
• Weak Points
• Not incremental

• The whole Vector Space has to be generated from scratch
whenever a new item is added to the repository
• High Dimensionality
• NLP operations (stopwords elimination, stemming and so on)
• Does not manage negative evidence
• The vector space representation only depends on the features that occur in
the document, there are no assumption about the features that don’t occur
• Does not manage the latent semantic of documents

• Any permutation of the terms in a document has the same
VSM representation!


idea
• To introduce tools and techniques
able to overcome these drawbacks
• Random Indexing
• Dimensionality reduction technique
Sahlgren, 2005

• Quantum Negation
• Based on Quantum Logic
Widdows, 2007


random indexing
• Random Indexing (RI) is an incremental and effective
technique for dimensionality reduction
• Distributional Models
• Assumption: we can infer information about terms
by analyzing how are they used in large corpus of data

• Based on the so-called “Distributional Hypothesis”
• “Words that occur in the same context tend to have
similar meanings”
• “Meaning is its use” (Wittgenstein)


how it works?

Random Indexing reduces the original
dimensional term/doc matrix to a new lower
dimensional matrix


how it works?
• How?
• By multiplying the original
matrix with a random
one, built in an incremental
way
• formally: An,m * Rm,k = Bn,k
• k << m
• After projection, the
distance between points in
the vector space is preserved
• Johnson-Lindenstrauss
Lemma

random matrix
• How is the random matrix build?
• The whole process is based on the concept of “context”
• Given a term, its “context” could be the whole document, a
paragraph, a sentence, a sliding window of words and so on.

• The deﬁnition of the context inﬂuences the structure of the
matrix

• The matrix is built in an iterative and incremental way

• The vector representing each document depends on the terms
that occur in it

• The vector representing each term depends on its context


item representation
• A context vector is assigned for each context (for simplicity, we
assume as context the whole document)
• This vector has a ﬁxed dimension (k) and it can contain only values in
-1, 0,1. Values are distributed in a random way but the number of non-
zero elements is much smaller.

• The Vector Space representation of a term is obtained by summing all
its context (the documents it occurs in).

• The Vector Space representation of a document (item) is
obtained by summing the context vectors of the terms that occur in it

• Output: lower-dimensional vector space representation
based on random context vectors


quantum negation
• Random Indexing is still not capable of managing negative evidence
• RI can be coupled with Quantum Negation (QN) operator
• Definition inherited by Quantum logic

• Negation as a form of orthogonality between
vectors
• Given two vectors A e B , we can define the vector A not B
• It represents the projection of the vector A on the subspace
orthogonal to those generated by vector B
• In a recommendation scenario, this operator could be used to
model two vectors, the first one representing positive
evidence and the second one for modeling negative ones


...summing up
• VSM is an effective model for document retrieval

• It can be exploited in recommendation scenarios
• It suffers from some well-known drawbacks
• Solutions
• Random Indexing is an incremental and effective approach
that can catch the high-dimensionality problem
• Quantum Negation can effectively model negative evidence

• The combined use of RI and QN is a good
alternative to VSM, especially for real-life scenarios


part 3: scenario
tv-shows recommendation


Scenario:
EPG (Electronic Program Guides)
personalization

scenario

• Given a set of TV-Shows
we want to provide
user a set of
suggestions about the
shows that she should
watch, according on her
preferences


approach

Currently the recommendation
model is implemented through
the Vector Space Model (VSM)


data
• TV shows gathered from a set of
47 German-language broadcast
channel

• Each TV show is described
through a set of textual
features (title, synopsis,
description, etc.) gathered from an
XML feed

• Each TV-Show is mapped to a ﬁxed
program type (Movie, Sport,
Documentary, Magazine, etc.)


problems
• How to represent the data?

• We compared two approaches
• Bag of Words (BOW)
• Tag.me

• Which ones are the typical use cases?
• We identiﬁed two tasks
• Classiﬁcation Task
• Retrieval Task


data representation
• Bag of Words
• Each item i is described through the
words that appear in the text

• Weighting of the words
• Counting of the occurrences,
normalization, TF-IDF weighting, etc.


BOW representation
• To improve BOW representation
• Usually textual description are very noisy
• Full of uninformative words
• Further processing can improve
the classical BOW representation
• Stopword removal: ﬁltering of all the
uninformative words (articles, adverbs,
adjectives and so on)


data representation
• Tag.me
• Online tool developed by the University
of Pisa (Italy)
• Goal: to identify Wikipedia concepts that
occur in the text
• Idea: to process original text through Tag.me
in order to avoid noise and provide a novel
representation based on high-level
Wikipedia concepts


tag.me web interface


ﬁnal output
Bow

Tag.me


description of the tasks
• task 1: classification

• Given a flow of TV shows, we would classify
them against a the set of program types

• task 2: retrieval

• Given a set of program type and a repository
of TV shows, we would retrieve the shows
that belong to a specific program type


VSM for TV shows classiﬁcation

• Steps
• 1) Build a vector space for the tv shows
• 2) Build a vector for each program type
• 3) Use cosine similarity to compare tv shows
and program types
• 4) Assign the TV show to the program type that got
the highest cosine similarity


• Step 1: build a vector space
representation of the TV-shows
• For each TV show we collected a set of words by
using the synopsis and the title of the show
• We ﬁltered out the set of the words through a
ﬁxed set of 996 stopwords for
German language
• We calculated the TF-IDF score for each
document



• Step 2: build a vector for each
program type
• Given the vector space representation of
each document
• The vector space representation of each
program type is the sum of the
vector space representations of each tv-
show that belongs to that program type



• Given a set of TV-shows

• T=(s1...sn)
• Given a set of program types

• P=(t1...tm)
• We deﬁne a function pt: P T
• It returns the program type of a tv show
• We can build the set S(t_i) as the set of the tv-shows that belong to t_i
•



• Given the set
S(t_i) with a
cardinality of k,
the vector space
representation of
the program
type is simply
given by



• Step 3 and Step 4
• Given the vector space representation of both
program types and tv shows
• Use of cosine similarity to compare each TV
shows against the set of the program types
• We assigned the TV show to the program type
that got the highest cosine similarity


RI for TV shows classiﬁcation

• Steps
• 2) Reduce the vector space through the
Random Indexing algorithm
• 3) Build a vector for each program type on the (reduced)
vector space
• 4) Use cosine similarity to compare tv shows and
program types
• 5) Assign the TV show to the program type that got the
highest cosine similarity


RI for TV shows retrieval

• Steps
• 2) Reduce the vector space through the Random
Indexing algorithm
• 3) Build a positive vector for each program type on the
(reduced) vector space
• 4) Use cosine similarity to compare tv shows and
program types
• 5) Rank the tv shows and assign the ﬁrst N to
the program type


RI+QN for TV shows retrieval

• Steps
• 2) Reduce the vector space through the Random Indexing
algorithm
• 3) Build a positive vector for each program type on the
(reduced) vector space
• 4) Build a negative vector for each program type
on the (reduced) vector space
• 5) Use cosine similarity to compare tv shows with
both positive and negative program types vectors
• 6) Rank the tv shows and assign the ﬁrst N to the program type



• Given a set of TV-shows

• T=(s1...sn)
• Given a set of program types

• P=(t1...tm)
• We deﬁne a function pt: P T
• We can build the set S(t_i) as the set of the tv-shows that belong to t_i
•


• Given the sets S(t_i) and
its complement with a
cardinality of k and z the
vector space
representation of the
program type is simply
given by
• The positive and negative
vector will be combined in
order to emphasize the
features that occur in the
positive vector and avoid
the ones that occur in the
negative one


...summing up
• Classiﬁcation task
• Comparison of VSM and RI
• We build a vector space
• Applied RI to reduce the vector space
• We tried to classify TV shows in the complete vector space and in the reduced
one, comparing the accuracy
• Retrieval task
• Comparison of RI and RI+QN
• We build a vector space
• Applied RI to reduce the vector space
• Build both positive and negative program types vectors and applied QN
• We tried to retrieve TV shows and we compared the the RI without negation and
the RI with negation


part 4: experimental evaluation
results, discussion, future work


dataset
program
tv shows 133.579 17
types

features features
306,006 74,599
(BOW) (Tag.me)

avg
avg features
42.11 features 9.21
(BOW)
(Tag.me)


experimental design
• 10-fold cross validation
• Dataset splitted in 10 partitions
• 9 partitions for training the models, the
last one for testing

• Results averaged over all the
partitions


metrics

• classiﬁcation task
• precision =
• retrieval task
• precision @n =
• precision @k% =

tuning of parameters
• Random Indexing algorithm
• Dimension of the vectors
• Classification task: 500, 700
• Retrieval task: 500, 1000, 1500, 2000
• Minimum number of occurrences
• Classification task: 2
• Retrieval task: 1, 3
• Training Cycles
• Classification task: 1, 2
• Retrieval task: 1


classiﬁcation task - results
size occur. cycles tag.me bow

500 2 1 37.38 42.91

700 2 1 40.28 47.76

500 2 1 44.61 54.32

700 2 1 45.33 54.33


classiﬁcation task: comparison
68.7

54.3 54.3

47.7
42.9


classiﬁcation - results per program type


classification task - outcomes
• BOW better than Tag.me
• Representation too poor

• Difficult to learn a solid and effective model for text classification
• Dimension of the vector space and the second training cycles affect the
predictive accuracy
• RI does not overcome the baseline

• Vector space reduced over 99% (from 133579 to 500 or 700)
• Too much loss of information

• but
• Splitting the results for single program types the Random Indexing got better results in
10 out of 17 program types
• Need to investigate the reasons of that


retrieval task - bow - p@n

82.6%

66.3%



65.9%

45.2%



58.1%

36.5%


retrieval task - bow - p@k%
86.0%

58.1%


retrieval task - bow - p@k%

55.4%

35.4%


retrieval task - tagme - p@n
61.9%

47.9%



53.7%

40.9%



51.6%

39.0%


retrieval task - tagme - p@k%
76.6%

57.9%


retrieval task - tagme - p@k%

49.6%

35.4%


retrieval task - overview
82.6%

61.9%



65.0%

53.0%



58.3%

53.2%


retrieval task - outcomes

• BOW always better than Tag.me
• Between 5 and 20% difference
• Parameters do not affect the accuracy
• QN operator improves the retrieval
accuracy by almost 20%


conclusions & future work
• In scenarios where the recommender system has to deal with a continous flow of
information the VSM is not suitable

• RI is able to effectively catch typical VSM drawbacks
• Classification task

• Even if its accuracy is lower, these preliminar results need to be further
investigated, for example testing the algorithm with different values
of the parameters

• Is a worsening in precision suitable for an algorithm that provides a big
improvement in scalability and efficiency?
• Retrieval Task

• QN improves the predictive accuracy of the model in the
retrieval tasks

• Novel operator, this is important outcome with a good
scientific impact


Thanks for you
attention.

Cataldo Musto, Ph.D. Student
cataldomusto@di.uniba.it - cataldo.musto@philips.com
University of Bari “Aldo Moro” (Italy), SWAP Research Group
Philips Research Center - Eindhoven (Netherlands) - HI&E Group
14.07.11


Random Indexing and Quantum Negation for TV-Shows Retrieval and Classification

Recommended

Recommended

More Related Content

Similar to Random Indexing and Quantum Negation for TV-Shows Retrieval and Classification

Similar to Random Indexing and Quantum Negation for TV-Shows Retrieval and Classification (13)

More from Cataldo Musto

More from Cataldo Musto (20)

Recently uploaded

Recently uploaded (20)

Random Indexing and Quantum Negation for TV-Shows Retrieval and Classification

Editor's Notes