CSC410-Presentation

Greymeter
Summer Intern
NAME – RAHUL PATIDAR (2012CS10244)
PROJECT – RECOMMENDATION ENGINE
COMPANY – GREYMETER SERVICES PVT. LTD.
VENUE - NOIDA, INDIA
1

Greymeter Services Pvt. Ltd.
How can you
help me ?
If you are a student you can
demonstrate your skills here
and companies will hire you
If you are a company
you can hire students
or get your problems
solved by students
2

What users may like ?
To serve users with better services that interest them
3

How should I solve this
problem… I can’t find one
6

For challenges
and for jobs
Exactly… and then you
can combine both to
one
A recommendation
engine
See … if you want to provide
better services to users you
have to recommend them what
they like
7

Recommendation
Engine
Content Based
Collaborative
Filtering
 Personalized recommendations
 Recommends items similar to what user has
liked in past
 Example – you tube
 recognize commonalities between users on
the basis of their activities
 generate new recommendations based on
inter-user comparisons
 Example – user who likes X also likes Y
8

Lets find out some tools
which help me to
develop
recommendation engine
Apache Mahout
 Open source framework
 Uses Apache Hadoop platform
 It is a suite of machine learning libraries
 Helps in building scalable machine learning
algorithms like – collaborative filtering ,
classification and clustering
 Used for big data
 Less efficient with small data
9

Mahout won’t be
required as our data set
is small, let look at Scikit-
Learn
Scikit-Learn (sklearn)
 Simple and efficient tools for data mining
and data analysis
 Built on Python, NumPy and SciPy
 Features various classification, regression,
and clustering algorithms
 Open source
Lets go with Scikit-Learn as it
is simple to implement and
efficient for small data and
also built on Python
10

OK… first lets go for
challenge
recommendation
 Classifying all challenges into different categories like
finance, programming, design, Management,
communications and marketing
 Calculated Challenges Similarity
 Calculated recommendation index/score of each
challenges based on user history
11

Classifying the
Challenges
 Used Multinomial Naïve Bayes Classifier
 Training Datasets – Wikipedia, stack overflow and
stack exchange
 Refined training data by removing stopwords and
stemming
 Convert training examples into tf-idf form
 Used this tf-idf matrix to implement Multinomial
Naïve Bayes Classifier
12

Tf-idf means Term
Frequency-Inverse
Document Frequency
 𝑡𝑓(𝑡, 𝑑) = log 1 + 𝑓𝑡𝑑
𝑓𝑡𝑑 ∶ 𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦 𝑜𝑓 𝑡𝑒𝑟𝑚 𝑡 𝑖𝑛 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡 𝑑
 𝑖𝑑𝑓(𝑡, 𝐷) = log
𝑀
1+𝑓𝑡𝐷
𝑓𝑡𝐷 ∶ 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡𝑠 𝑖𝑛 𝑤ℎ𝑖𝑐ℎ 𝑡𝑒𝑟𝑚 𝑡 𝑎𝑝𝑝𝑒𝑎𝑟𝑠
M ∶ 𝑡𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡𝑠
 𝑡𝑓𝑖𝑑𝑓(𝑡, 𝑑, 𝐷) = 𝑡𝑓(𝑡, 𝑑) ∗ 𝑖𝑑𝑓(𝑡, 𝐷)
 Convert text corpus into N*M matrix of tfidf where N is
number of terms and M is number of document
𝑡11 ⋯ 𝑡𝑛1
⋮ 𝑡𝑖𝑗 ⋮
𝑡𝑚1 ⋯ 𝑡𝑛𝑚
𝑡𝑖𝑗 ∶ 𝑡𝑓𝑖𝑑𝑓(𝑡𝑖, 𝑑𝑗, 𝐷)
13

Multinomial Naïve
Bayes Classifier
 Bayes Theorem :
𝑃 𝐴 𝐵 =
𝑃 𝐴 ∗ 𝑃 𝐵 𝐴
𝑃(𝐵)
𝑃(𝑌|𝑥1, . . 𝑥𝑖, . . 𝑥𝑛) =
𝑃 𝑌 ∗𝑃 𝑥1,..𝑥𝑖..𝑥𝑛 𝑌)
𝑃(𝑥1,..𝑥𝑖,..𝑥𝑛)
𝐶𝑙𝑎𝑠𝑠 𝑑 = 𝑎𝑟𝑔𝑀𝑎𝑥 𝑦(𝑃(𝑌|𝑥1, . . 𝑥𝑖, . . 𝑥𝑛))
𝐶𝑙𝑎𝑠𝑠 𝑑 = 𝑎𝑟𝑔𝑀𝑎𝑥 𝑌( 𝑃 𝑌 ∗ 𝑃 𝑥1, . . 𝑥𝑖. . 𝑥𝑛 𝑌) )
𝑃 𝑥1, . . 𝑥𝑖. . 𝑥𝑛 𝑌) = 𝑖
𝑛
𝑃(𝑥𝑖|𝑌)
𝑃 𝑥𝑖 𝑌 = 𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦 𝑜𝑓 𝑥𝑖 𝑖𝑛 𝑐𝑙𝑎𝑠𝑠 𝑌
𝑇𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑡𝑒𝑟𝑚𝑠 𝑖𝑛 𝑐𝑙𝑎𝑠𝑠 𝑌
𝑃 𝑌 = |{𝑑|𝑑∈𝑌}|
|𝐷|
14

Challenges Similarity
Description Similarity
Textual similarity in challenges’ statements
Convert challenge statement into tf-idf matrix
Euclidean Distance between two vectors as
similarity measure
Higher the distance lesser the similarity
Features-similarity
Features were weighted based on their
relevance and testing
Calculated Weighted Euclidean Distance
between two vectors
Challenge-similarity
= Description similarity + features similarity
15

Lets see what is user doing…?
;) and then recommend them.
 We are monitoring various user activities which will be the basis
of recommendation
 Calculation of recommendation score :
initial𝑖𝑧𝑒 𝑠𝑐𝑜𝑟𝑒 𝑜𝑓 𝑒𝑎𝑐ℎ 𝑐ℎ𝑎𝑙𝑙𝑒𝑛𝑔𝑒 𝑤𝑖𝑡ℎ 0;
𝑓𝑜𝑟 𝑒𝑎𝑐ℎ 𝒄𝒉, 𝒂𝒄𝒕 𝑖𝑛 𝑎𝑐𝑡𝑖𝑣𝑖𝑡𝑦{
𝑠𝑖𝑚𝐶ℎ𝑎𝑙 = 𝑐ℎ𝑎𝑙𝑙𝑒𝑛𝑔𝑒𝑠 𝑠𝑖𝑚𝑖𝑙𝑎𝑟 𝑡𝑜 𝑐ℎ;
𝑓𝑜𝑟 𝑒𝑎𝑐ℎ 𝒄 𝑖𝑛 𝑠𝑖𝑚𝐶ℎ𝑎𝑙{
score(c) = score(c) + log( w_act*(1/ distance(ch,c)));
}
}
𝑓𝑜𝑟 𝑒𝑎𝑐ℎ 𝒄𝒉 𝑖𝑛 𝑎𝑙𝑙_𝑐ℎ𝑎𝑙𝑙𝑒𝑛𝑔𝑠{
score (ch) = score(ch) + log(# common interest of
user and ch)
score(ch) = score(ch)* (1/(deadline - current date))
}
16

Jobs/internships you may like
 Similar to challenge recommendation engine
 Only change is here we have job and it features
 Key feature : # times company appears in
user’s challenge activity. Add this factor in
recommendation score of job
 Everything else is same.
17

I was unable to decide
which tool/framework
should I choose for my
work
Challenges faced
I didn’t get ready made
dataset which full fills our
requirement. So last
open was to crawl the
web
And testing was
headache
18

Explored python and
Scikit-learn platform
Learning and experience
How a startup works –
much of hard work goes
into it day and night
Management team member
of Hackathon organized by
Greymeter and
Unicommerce
19

CSC410-Presentation

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (14)

Similar to CSC410-Presentation

Similar to CSC410-Presentation (20)

CSC410-Presentation

Editor's Notes