Social recommender system

Tag Based Social
Recommender System(RS)
Project Mentor
Ms Pragya Dwivedi
By
Aditi Gupta
Anirudh kanjani
Abhinav Vasu Rawat
Kapil kumar
Ashutosh Singh

Agenda







Recommender systems- overview
Usefulness of Recommender Systems(RS)
Types of RS
Relation with information architecture
Limitations and possible improvements
Relation with Social Networking

What are they and Why are
they
Recommender systems provide a way for information
filtering that attempts to present information that are
likely of interest to the user. Its advantages are:
 Enhances user experience
◦ Assists users in finding information
◦ Reduces search and navigation time
 Increases productivity
 Increases credibility
 Mutually beneficial proposition

Types of Recommender
Systems(RS)

Content based RS
•
•
•
•
•

Highlights
Recommend items similar to those users
preferred in the past
User profiling is the key
Items/content usually denoted by keywords
Matching “user preferences” with “item
characteristics” works for textual information
Vector Space Model widely used

Content based RS
•
•
•
•
•
•

Limitations
Not all content is well represented by
keywords, e.g. images
Items represented by same set of features are
indistinguishable
Overspecialization: unrated items not shown
Users with thousands of purchases is a
problem
New user: No history available
Shouldn’t show items that are too different, or
too similar

Collaborative RS
•
•
•
•
•

Highlights
Use other users’ recommendations (ratings) to
judge item’s utility
Key is to find users/user groups whose interests
match with the current user
Vector Space model widely used (directions of
vectors are user specified ratings)
More users, more ratings: better results
Can account for items dissimilar to the ones seen
in the past too...ovielens.org

Collaborative RS
•

•
•
•

Limitations
Different users might use different scales.
Possible solution: weighted ratings, i.e.
deviations from average rating .
Finding similar users/user groups isn’t very easy.
New user: No preferences available.
New item: No ratings available.

Hybrid RS




Uses both content based and collaborative filtering.
Introduced to avoid the limitations found in both
content and collaborative methods.
Example: Netflix- makes recommendations by
comparing the watching and searching habits of
similar users (i.e. collaborative filtering) as well as
by offering movies that share characteristics with
films that a user has rated highly (content-based
filtering).

Other Variations of RS
Cluster Models
• Create clusters or groups.
• Put a customer into a category.
• Classification simplifies the task of user
matching.
• More scalability and performance.
• Lesser accuracy than normal collaborative
filtering method.

Possible Improvement in RS
Better understanding of users and items
–
Social network (social RS)
1. User level
• Highlighting interests, hobbies, and keywords
people have in common
2. Item level
• link the keywords to ecommerce (by RS
algorithms)

What is tag?
A tag is a piece of information that describes the
data or content that it is assigned to. Tags are nonhierarchical keywords used for Internet bookmarks,
digital images, videos, files and so on. A tag doesn't
carry any information or semantics.

Tagging serves many functions, including:

Classification

Marking ownership

Describing content type

Online identity

About tagging
Labeling and Tagging are done to aid in
classification, marking, ownership, noting boundaries
and indicating online identity. They may take the form
of words, images or marks.
Online & internet databases deploy them as a way
for publishers to help users to find content.

Where they are used?
Social bookmarking :- provides users to add tags to
their bookmarks.
 Flickr :- allows users to add their own text tags to
each of their pictures, constructing flexible & easy
metadata that makes pictures highly searchable.
 YouTube :- also implements tagging. They
categorise content using simple keywords. The
users add tags which are visible and themselves
link to other items that share that keyword tag.


Examples






Within a Blog : - Many blog systems allow authors
to add free-form tags to a post. For example, a post
may display that it has been tagged with baseball
and tickets.
For an event :- An official tag is a keyword adopted
by events to use in their web applications, such as
blog entries, photos of the event and persentation
slides.
In research :- Associate an item with a small no of
themes, then a group of tags for these themes can
be attached. In this way free form classification
allows author to manage large amounts of
information.

Tag types


Triple Tags : - Triple tag or Machine tag
uses a special tag to define extra semantics
information about the tag, making it more
meaningful for interpretation.
Triple tags comprise of - a namespace ,
a predicate & a value .

Tag types




Hash Tag : - Word or phrase prefixed with #. Form
of metadata tag. Short messages on social
networking such as twitter , facebook may be
tagged by putting #.
before important words.
Hash tag provides a means of grouping such
messages since one can search for hash tags and
get the set of messages that contain it.
Knowledge tag : - it is a type of meta information
that describes or defines some aspect of
information resource. They are
the type of
metadata that captures knowledge in the form of
descriptions, classification, comments, notes,
hyperlinks etc.

Information Retrieval Systems
Information retrieval is the activity of obtaining
information resources relevant to an information
need from collection of information resources.
Searches can be based on metadata or on full text.

The Information Retrieval
Cycle
Source
Selection

Resource
Query
Formulation

Query
Search

Ranked List
Selection

Documents

query reformulation,
relevance feedback

result

11/27/2013

Introduction to Information Retrieval

19

Search Process
Source
Selection

Resource
Query
Formulation

Query

Search

Indexing

Index

Ranked List

Selection

Documents

Results
Document Collection
Slide is from Jimmy Lin’s tutorial
11/27/2013


20

Implementation-How
Recommender System Works
In case we use content based filtering
Cosine similarity formula is utilized as follows

Where wc and ws are TF-IDF weight vectors

Implementation-How
In case we use collaborative filtering Pearson similarity
formula is used as follows







sim(x,y)-similarity between user x and y
rx,s – rating for item “s” given by user “x”
ry,s – rating for item “s” given by user “y”
ry- mean of all ratings by user “y”
rx- mean of all ratings by user “x”

Implementation-How


Similarity Model
Vector-space model
This is a model that allows us to extract documents
based on the tags given by a user through a query.
Vector space model uses TF-IDF weights to
categorise the documents into relevant and nonrelevant ones. The end result is the document(s)
having best similarity with the tags given in the query.

11/27/2013


24

The Vector-Space Model
Assume t distinct terms remain after preprocessing;
call them index terms or the vocabulary.
 These “orthogonal” terms form a vector space.
Dimension = t = |vocabulary|
 Each term, i, in a document or query, j, is given a
real-valued weight, wij.
 Both documents and queries are expressed as
t-dimensional vectors:
dj = (w1j, w2j, …, wtj)


25

Document Collection
A

collection of n documents can be represented in the
vector space model by a term-document matrix.
 An entry in the matrix corresponds to the “weight” of a
term in the document; zero means the term has no
significance in the document or it simply doesn’t exist in
the document.

T1 T2 ….
w11 w21 …

D1
wt1
D2 w12
wt2
:
:
:
:
Dn w1n
wtn

Tt

w22 …

:
:
w2n …

:
:
26

Issues for Vector Space Model
How to determine important words in a document?
◦ Word sense?
◦ Word n-grams (and phrases, idioms,…)  terms
 How to determine the degree of importance of a
term within a document and within the entire
collection?
 How to determine the degree of similarity between
a document and the query?
 In the case of the web, what is a collection and
what are the effects of links, formatting information,
etc.?


27

Term Weights: Term Frequency


More frequent terms in a document are more
important, i.e. more indicative of the topic.
fij = frequency of term i in document j



May want to normalize term frequency (TF) by
dividing by the frequency of the most common term
in the document:
TFij = fij / maxi{fij}

28

Term Weights: Inverse Document
Frequency





Terms that appear in many different documents are
less indicative of overall topic.
df i = document frequency of term i
= number of documents containing term i
IDFi = inverse document frequency of term i,
= log2 (N/ df i)
(N: total number of documents)
An indication of a term’s discrimination power.
Log used to dampen the effect relative to tf.

29

TF-IDF Weighting






A typical combined term importance indicator is TFIDF weighting:
wij = TFij -IDFi = TFij log2 (N/ dfi)
A term occurring frequently in the document but
rarely in the rest of the collection is given high weight.
Many other ways of determining term weights have
been proposed.
Experimentally, TF-IDF has been found to work well.

30

Computing TF-IDF - An
Example
Given a document containing terms with given
frequencies:
A(3), B(2), C(1)
Assume collection contains 10,000 documents and
document frequencies of these terms are:
A(50), B(1300), C(250)
Then:
A: TF = 3/3; IDF = log2(10000/50) = 7.6; TF-IDF =
7.6
B: TF= 2/3; IDF = log2 (10000/1300) = 2.9; TF-IDF =
2.0
C: TF= 1/3; IDF= log2 (10000/250) = 5.3; TF-IDF = 31

Performance and Correction
Measures
Precision- is the fraction of documents retrieved
that are relevant to the user’s information need.
 Recall- Recall is the fraction of the documents that
are relevant to the query that are successfully
retrieved
 F-Measure
 Mean Absolute Error(MAE)


F-Measure
The weighted harmonic mean of precision and
recall , the traditional f- measure or balanced Fsource is
F-measure =

2 *precision*recall
(precision+recall)

Mean Absolute Error(MAE)
Mean absolute error for a set of queries is calculated
as average of the absolute difference between the
predicted rating and the actual rating for each query.

Where n is the total number of queries,
is the
prediction and is the true value and the absolute
error is

Datasets


We have studied the datasets of some popular
sites and have implemented basic functions like
Pearson similarity, Cosine similarity, Resnick
prediction formula and Tf-Idf model on them. The
datasets we studied are as follows:



MovieLens Dataset
Flickr Dataset



MovieLens Dataset
MovieLens is a recommender system and virtual
community website that recommends films based
on user-provided ratings.
 The dataset on which we have worked contains a
total of 1,00,000 ratings from 943 users on 1682
movie items.
 It was collected from September 19th, 1997 to April
22nd, 1998.
 The dataset includes file that has every entry in 4tuples <user_id><item_id><rating><timestamp>.


Flickr Dataset
Flickr is an image hosting and video hosting
website where people host images that they
embed in blogs and social media.
 The dataset we have used is MRFLICKR-25000
and it is a collection of 25000 images downloaded
from the social photography site Flickr through its
public API.
 The average number of tags per image is 8.94. In
the collection there are 1386 tags which occur in at
least 20 images.
 The dataset includes a meta-data folder named
“meta” that contains all the tags associated with a
particular image in a respective file.


Visit my blog for more
www.csekapil.wordpress.com
Motilal Nehru National institute of Tech.
Allahabad.(india)

Social recommender system

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Viewers also liked

Viewers also liked (16)

Similar to Social recommender system

Similar to Social recommender system (20)

Recently uploaded

Recently uploaded (20)

Social recommender system