A new similarity measurement based on hellinger distance for collaborating filtering in sparse data set

A New Similarity Measurement based on Hellinger Distance
For Collaborating Filtering in Sparse Data Set
Submitted in Fulfillment of Requirements for the
Degree of
MASTER OF TECHNOLOGY IN
COMPUTER SCIENCE AND ENGINEERING
specialization in
Information Security
by
Prabhu Kumar (15MT000624)
Under the guidance of
Dr. Rajendra Pamula
(Assistant Professor)
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
INDIAN INSTITUTE OF TECHNOLOGY (INDIAN SCHOOL OF MINES), DHANBAD
INDIA
M AY 2017

Outlines
• Introduction of recommender system
• Source of information
• Types of recommendation system
• Architecture
• Similarity measurements
• Proposed method
• Result
• References

Introduction
What is Recommender System?
• It’s generic machine learning techniques
or information filtering system which predict
the user’s preference.

Example of Recommender System
• Recommender system widely used in Movie, News, and Music recommendation etc...

Source of Information
• The data which collects for recommendation is from Content, demographic, and
social media information.

Source of information (Continued..)

Types of Recommendation
1. Collaborative filtering recommendation system- It is based on the way which
humans have made decision throughout history and it is based on rating that user
has rated before using that specific items. So that, algorithm analyze their rating
predicts items for recommendation
2. Content based recommendation system- It is based on the user’s choices made
in the past in form of content that which content user liked the most in past
3. Hybrid recommendation system- Combinations of both
If A and B techniques is used for recommendation then A’s disadvantages will fix B
and B’s disadvantages will fix A .

Collaborating Filtering based Recommender system

Content based recommender system

Architecture of recommender system

• For matching process in Recommender system:
“KNN algorithm is one of most useful algorithm which is used for recommendation
the item to the users”

KNN-algorithm(oriented to users)

Similarity Measurements
• Cosine Similarity:
“It measures angle between two vector of ratings, the lower the angle, higher the similarity”
𝒔𝒊𝒎(𝒖, 𝒗) 𝒄𝒐𝒔
=
𝒓 𝒖 . 𝒓 𝒗
𝒓 𝒖 . 𝒓 𝒗
“A vector which has magnitude and direction.”
Drawbacks:
• If the two vector are on same line example a=(2,2,2,2) and b=(3,3,3,3) then the cosine value will be 1,
the similarity value will be “0”.
• It suffers from the co-rated items.
• Similarity measurement is techniques which finds the nearest neighbor for an specific active user for
further processing of recommendation.

• ACOS (Adjusted Cosine Similarity) : “ Some people like to rate high even they don’t like the item very
much However some people like to rate low if they like the item too much. So, ACOS is introduced”
𝒔𝒊𝒎(𝒖, 𝒗) 𝑨𝑪𝑶𝑺
=
𝒋=𝟏
𝒕𝒐𝒕𝒂𝒍 𝒏𝒐 𝒐𝒇 𝒄𝒐−𝒓𝒂𝒕𝒆𝒅 𝒊𝒕𝒆𝒎𝒔
𝒓 𝒖 𝒋
− 𝒓 𝒖 𝒋
∗ (𝒓 𝒗 𝒋
− 𝒓 𝒗 𝒋
)
𝒋=𝟏
𝒕𝒐𝒕𝒂𝒍 𝒏𝒐 𝒐𝒇 𝒄𝒐−𝒓𝒂𝒕𝒆𝒅 𝒊𝒕𝒆𝒔𝒎
(𝒓 𝒖 𝒋
− 𝒓 𝒖 𝒋
) 𝟐
𝒋=𝟏
𝒕𝒐𝒕𝒂𝒍 𝒏𝒐 𝒐𝒇 𝒄𝒐−𝒓𝒂𝒕𝒆𝒅 𝒊𝒕𝒆𝒎𝒔
(𝒓 𝒗 𝒋
− 𝒓 𝒗 𝒋
) 𝟐
Drawbacks:
• Similar rating problems
• Few co-rated item problems
• Pearson’s co-relation : “It finds the linear co-relation between two vector of ratings”
𝒔𝒊𝒎(𝒖, 𝒗) 𝑷𝑪𝑪
=
𝒑∈𝒋(𝒓 𝒖,𝒑 − 𝒓 𝒖)(𝒓 𝒗,𝒑 − 𝒓 𝒗)
𝒑∈𝒋(𝒓 𝒖,𝒑 − 𝒓 𝒖) 𝟐 . 𝒑∈𝒋(𝒓 𝒗,𝒑 − 𝒓 𝒗)𝟐
Drawbacks:
• If the rating item vector is a=(2,2,2,2) and b=(1,2,3,4) or rating in vector is Flat then PCC can’t be calculate
• If the co-rated item 1, PCC will be “0”, So it suffer from the few co-rated items.

PIP (Proximity-Impact- Popularity) :
𝑠𝑖𝑚(𝑢, 𝑣) 𝑃𝐼𝑃
= 𝑗∈𝑡𝑜𝑡𝑎𝑙 𝑛𝑜 𝑜𝑓 𝑐𝑜−𝑟𝑎𝑡𝑒𝑑 𝑖𝑡𝑒𝑚𝑠 𝑃𝐼𝑃(𝑟𝑢 𝑗
, 𝑟𝑣 𝑗
)
Whereas, 𝑃𝐼𝑃 𝑟1, 𝑟2 = 𝑃𝑟𝑜𝑥𝑖𝑚𝑖𝑡𝑦 𝑟1, 𝑟2 ∗ 𝑖𝑚𝑝𝑎𝑐𝑡 𝑟1, 𝑟2 ∗ 𝑝𝑜𝑝𝑢𝑙𝑎𝑟𝑖𝑡𝑦(𝑟1, 𝑟2)
𝑖𝑓 𝑟1 > 𝑟 𝑚𝑒𝑑 𝑎𝑛𝑑 𝑟2 > 𝑟 𝑚𝑒𝑑 :
𝑝𝑟𝑜𝑥𝑖𝑚𝑖𝑡𝑦 𝑟1, 𝑟2 = 𝑟1 − 𝑟2
𝑖𝑚𝑝𝑎𝑐𝑡 𝑟1, 𝑟2 = ( 𝑟1 − 𝑟 𝑚𝑒𝑑 + 1)( 𝑟2 − 𝑟 𝑚𝑒𝑑 + 1)
𝑝𝑜𝑝𝑢𝑙𝑎𝑟𝑖𝑡𝑦 𝑟1, 𝑟2 = 1 + (
𝑟1+𝑟2
2
− 𝜇 𝑘)2
𝑒𝑙𝑠𝑒:
𝑝𝑟𝑜𝑚𝑖𝑡𝑦 𝑟1, 𝑟2 = 2 ∗ 𝑟1 − 𝑟2
𝑖𝑚𝑝𝑎𝑐𝑡 𝑟1, 𝑟2 =
1
( 𝑟1−𝑟 𝑚𝑒𝑑 +1)( 𝑟2−𝑟 𝑚𝑒𝑑 +1)
𝑝𝑜𝑝𝑢𝑙𝑎𝑟𝑖𝑡𝑦 𝑟1, 𝑟2 = 1
and 𝜇 𝑘 = 𝑎𝑣𝑒𝑟𝑎𝑔𝑒 𝑟𝑎𝑡𝑖𝑛𝑔 𝑓𝑜𝑟 𝑡ℎ𝑎𝑡 𝑝𝑎𝑟𝑡𝑖𝑐𝑢𝑙𝑎𝑟 𝑖𝑡𝑒𝑚 𝑤ℎ𝑖𝑐ℎ 𝑖𝑠 𝑟𝑎𝑡𝑒𝑑 𝑏𝑦 𝑎𝑙𝑙 𝑢𝑠𝑒𝑟𝑠
Drawbacks:
• It doesn’t consider the proportion of common ratings made by users

• Jacard similarity measurement:
“It only considers the no of common rating between two users.”
𝑺𝒊𝒎(𝒖, 𝒗) 𝑱𝒂𝒄𝒂𝒓𝒅
=
𝑰 𝒖 ∩ 𝑰 𝒗
𝑰 𝒖 ∪ 𝑰 𝒗
Drawbacks:
• It doesn’t consider the absolute rating.
• Mean squared difference:
“It only considers the absolute rating ”
𝒔𝒊𝒎(𝒖, 𝒗) 𝒎𝒔𝒅 = 𝟏 −
𝒑∈𝑰(𝒓 𝒖,𝒑−𝒓 𝒗,𝒑) 𝟐
𝑰
Drawbacks:
• It doesn’t consider the no of common rating between two users so, it ignores the credibility of similarity
measurement.
• It ignores the proportion of common rating between two users.

Proposed method
Hellinger Distance:
• It is used to quantify the similarity between two vector.
• The minimum hellinger distance will be zero if no item is rated by both users and all the item rated by users as
absolutely same.
• The value of hellinger distance will range from 0 to 2
• 2 is defines at H(P,Q) ≤ 1 for all distance between the two users
𝑯 𝑷, 𝑸 =
𝟏
𝟐 𝒊=𝟏
𝒌
( 𝒑𝒊 − 𝒒𝒊) 𝟐
Let P = {2, 3, 1} and Q= {3, 2, 3}
So, Hellinger distance =
1
2
( 2 − 3)2 + 3 − 2 2 + ( 1 − 3)2
=
1
2
0.101021 + 0.101021 + 0.53589838 =
1
2
𝑋 0.85903 =0.60743

Local references:
• It plays an important role to find the local information about the user’s rating.
• It must provide positive as well as negative co-relation between two users.
• It is used for finding the actual relation between two users according to their ratings.
𝒍𝒐𝒄 𝒎𝒆𝒅 𝒓 𝒖𝒊 , 𝒓 𝒗𝒊 =
(𝒓 𝒖𝒊−𝒓 𝒎𝒆𝒅 )(𝒓 𝒗𝒊 −𝒓 𝒎𝒆𝒅)
𝒌∈𝑰 𝒖
(𝒓 𝒖𝒌 −𝒓 𝒎𝒆𝒅) 𝟐
𝒌∈𝑰 𝒗
(𝒓 𝒗𝒌−𝒓 𝒎𝒆𝒅) 𝟐
Whereas, K is all items rated by users
rui is the rating by user u for ith item.
rvi is the rating by user v for ith item.
rmed is the average of rating by users.

Proposed method equation :
𝑆 𝑢, 𝑣 = 𝐻 𝑢, 𝑣 ∗
𝑖∈𝑢 𝑗∈𝑣
𝑙𝑜𝑐 𝑟𝑢𝑖, 𝑟𝑣𝑗 + 𝐽𝑎𝑐𝑎𝑟𝑑(𝑢, 𝑣)
Where,
H(u, v) is the hellinger distance
loc(rui, rvj) is the local similarities between all the user’s rating to that items
Jacard (u, v) measures the rating proportion of two users.

Result:
• In this graph, the flat item-ratings and few common rating problem is solved using proposed
method.
• U1 and U3 and U2-U4 is flat rating, U4-U5 is improvement of Common rating Proportion.
• U3 to U5 has few co-rated item problem.
Item1 Item2 Item3 Item4
User1 4 3 5 4
User2 5 3 - -
User3 4 3 4 4
User4 2 1 - -
User5 4 2 - -

• The problem of same co-rated vector and few co-rated items has improved using proposed method and
also the simultaneous difference of rating problem has been solved.
• U1 and U3 has same co-rated Vector, it improves using proposed method.
• U1 and U5 suffers from few co-rated items
• U4 and U5 has simultaneous difference problem.

• The problem of local similarities and proportion of rating has improved using proposed
method.
• U4 and U5 has proportion of rating problem in PIP which improved by proposed method.
• U1 and U4 has few co-rated item problems.
• U2 and U4 has local similarities improvement.

Evaluation of Proposed method in large dataset
• Through large dataset of Movielens, called ML-100K, there are 100,000 ratings with
943 persons and 1682 movies. Another is ML-1M, it includes 6040 users and 3952
movies with 1,000,209 ratings. Each user has rated at least 20 movies.

• The movie’s recommendation using Cosine Similarity and proposed method.

• The movie’s recommendation using PIP (proximity-impact-popularity) and
proposed method.

References
• J. Bobadilla, F. Ortega, A. Hernando, A. Gutirrez, Recommender systems survey, Knowl.-Based Syst. 46 (2013) 109–132.
• P. Resnick, H.R. Varian, Recommender systems, Commun. ACM 40 (3) (1997) 56–58.
• G. Linden, B. Smith, J. York, Amazon.com recommendations: item-to-item collaborative filtering, IEEE Internet Comput. 7 (1)
(2003) 76–80.
• Y. Koren, Factorization meets the neighborhood: a multifaceted collaborative filtering model, in: Proceedings of the 14th ACM
SIGKDD International Conference on Knowledge Discovery and Data Mining, 2008, pp. 426–434.
• C. Desrosiers, G. Karypis, A comprehensive survey of neighborhood-based recommendation methods, in: Recommender
Systems Handbook, 2011, pp. 107–144.
• M.J. Pazzani, D. Billsus, Content-based recommendation systems, The Adap. Web (2007) 325–341.
• H. Junming, C. Xueqi, G. Jiafeng, S. Huawei, Y. Kun, Social recommendation with interpersonal influence, ECAI 10 (2010) 601–
606.

Thank You !
A special thanks to my project guide Dr. Rajendra Pamula sir for
guiding, motivating and providing me with fruitful information throughout
the development process of this project work
My sincere gratitude to the panel of teachers present for giving their
precious time for listening and evaluating my project presentation

A new similarity measurement based on hellinger distance for collaborating filtering in sparse data set

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to A new similarity measurement based on hellinger distance for collaborating filtering in sparse data set

Similar to A new similarity measurement based on hellinger distance for collaborating filtering in sparse data set (20)

Recently uploaded

Recently uploaded (20)

A new similarity measurement based on hellinger distance for collaborating filtering in sparse data set