This document discusses recommender systems and how they work. It covers two main types of recommender systems: content-based and collaborative filtering. Content-based systems recommend items similar to what a user has liked in the past. Collaborative filtering recommends items liked by similar users. The document then discusses key concepts like term frequency-inverse document frequency (TF-IDF) for representing text as vectors and calculating cosine similarity between vectors to find similar documents. It includes code examples for building a basic book recommender system using these techniques.
4. TYPES OF RECOMMENDER SYSTEMS
R Venkat Raman
Content Based: This technique is all about
recommending items to the user that are similar in
characteristics to the original item liked by the user
Collaborative Filtering: This technique is based
on the idea that similar users possibly share the
same interest and thereby like similar items
5. RECOMMENDER ENGINES ARE POWERFUL
THEY HELP IN CROSS SELLING AND UPSELLING OF PRODUCTS
R Venkat Raman
6. RECOMMENDER SYSTEMS ARE NOT PERFECT
THERE ARE MINOR PROBLEMS LIKE COLD START AND IRRELEVANT RECOMMENDATIONS. THERE ARE
SOLUTIONS TO OVERCOME THESE PROBLEMS
R Venkat Raman
7. KEY CONCEPTS TO UNDERSTAND : VECTORS
R Venkat Raman
Vectors: Computers speak and understand the language of numbers (technically only binary i.e. 0’s and 1’s).
They do not understand words. Here starts the biggest problem for NLP !!
A vector can be thought of as a line from the origin of the vector space with a direction and a magnitude.
Alternatively it can also be thought of as a point or coordinate in n – dimensional space.
Vectors are normally also represented as collection of numbers e.g. [2,3]
The fundamental idea in NLP is to convert the texts or words into a vector and represent in a vector space model.
This idea is so beautiful and in an essence this very idea of vectors is what is making the rapid strides in NLP ,
Machine learning and AI possible.
In fact Geoffrey Hinton (“Father of Deep Learning”) in a MIT technology review article acknowledged that the AI
institute at Toronto has been named “Vector Institute” owing to the beautiful properties of vectors that has helped
them in the field of Deep Learning and other variants of Neural nets.
8. KEY CONCEPTS TO UNDERSTAND : TF –IDF (1/4)
R Venkat Raman
TF - IDF Stands for Term Frequency and Inverse Document Frequency .TF-IDF helps in evaluating importance of a term (word) in a document.
TF – Term Frequency
In order to ascertain how frequent the term/word appears in the document and also to represent the document in vector form, let’s break it down
to following steps.
Step 1: Create a dictionary of words (also known as bag of words) present in the whole document space. We ignore some common words also
called as stop words e.g. the, of, a, an, is etc, since these words are pretty common and it will not help us in our goal of choosing important
words
In this current example I have used the file ‘test1.csv’ which contains titles of 50 books. But to drive home the point, just consider 3 book titles
(documents) to be making up the whole document space. So B1 is one document, B2 and B3 are other documents. Together B1, B2, B3 make
up the document space.
9. KEY CONCEPTS TO UNDERSTAND : TF –IDF (2/4)
R Venkat Raman
B1 — Recommender Systems
B2 — The Elements of Statistical Learning
B3 — Recommender Systems — Advanced
Now creating an index of these words (stop words ignored)
1. Recommender 2. Systems 3 Elements 4. Statistical 5.Learning 6. Advanced
Step 2: Forming the vector
10. KEY CONCEPTS TO UNDERSTAND : TF –IDF (3/4)
R Venkat Raman
The Term Frequency helps us identify how many times the term or word appears in a document but there is also an inherent problem, TF gives
more importance to words/ terms occurring frequently while ignoring the importance of rare words/terms.
This is not an ideal situation as rare words contain more importance or signal. This problem is resolved by IDF.
Sometimes a word / term might occur more frequently in longer documents than shorter ones; hence Term Frequency normalization is carried
out.
TFn = (Number of times term t appears in a document) / (Total number of terms in the document), where n represents normalized.
IDF (Inverse Document Frequency)
11. KEY CONCEPTS TO UNDERSTAND : TF –IDF (4/4)
R Venkat Raman
Basically a simple definition would be: IDF = ln (Total number of documents / Number of documents with term t in it)
Now let’s take an example from our own dictionary or bag of words and calculate the IDFs
We had 6 terms or words which are as follows
1. Recommender 2. Systems 3 Elements 4. Statistical 5.Learning 6. Advanced
and our documents were :
B1 — Recommender Systems
B2 — The Elements of Statistical Learning
B3 — Recommender Systems — Advanced
Now IDF (w1) = log 3/2; IDF(w2) = log 3/2; IDF (w3) = log 3/1; IDF (W4) = log 3/1; IDF (W5) = log 3/1; IDF(w6) = log 3/1
We then again get a vector as follows:
= (0.4054, 0.4054, 1.0986, 1.0986, 1.0986, 1.0986)
Now the final step would be to get the TF-IDF weight. The TF vector and IDF vector are converted into a matrix.
Then TF-IDF weight is represented as: TF-IDF Weight = TF (t,d) * IDF(t,D)
12. KEY CONCEPTS TO UNDERSTAND : COSINE SIMILARITY
R Venkat Raman
Cosine similarity is a measure of similarity between two non zero vectors. It is basically a measure of orientation and not magnitude. It is got by
basically taking the dot product of the two vectors.
The dot product is given by the formula
If you are wondering why the cos angle comes into picture, adjacent diagram provides the intuition
One of the beautiful thing about vector representation is we can now see how closely related two sentence are based on what angles their respective
vectors make. Cosine value ranges from -1 to 1.
So if two vectors make an angle 0, then cosine value would be 1, which in turn would mean that the sentences are closely related to each other.
If the two vectors are orthogonal, i.e. cos 90 then it would mean that the sentences are almost unrelated.
14. CODE WALKTHROUGH
R Venkat Raman
• First we download standard packages like Pandas, numpy, sklearn
• We then read the csv file which contains the ‘Book Title’
• From the sklearn package, we import TFidfVectorizer.
• The TFidfVectorizer helps creating the TF-IDF scores. We apply this on the column ‘Book Title’ to generate the TF-IDF scores
15. CODE WALKTHROUGH
R Venkat Raman
Next, we calculate the cosine similarities between each document (read Book Titles) .
We then store the corresponding ID s in descending order of cosine similarity score
16. CODE WALKTHROUGH
R Venkat Raman
Next, we define a function ‘item’ to get the Book Title for the corresponding ID.
Finally through the function ‘recommend’, we recommend similar books once the user gives the arguments (ID, number of books to be recommended
17. CODE AND FILES
R Venkat Raman
List of Books : https://gist.github.com/venkarafa/64df1ee21ae8d62bafe64a96f9bff881
Code : https://gist.github.com/venkarafa/0da815727f1ee098b201c371b60b2d72