SlideShare a Scribd company logo
Greymeter
Summer Intern
NAME – RAHUL PATIDAR (2012CS10244)
PROJECT – RECOMMENDATION ENGINE
COMPANY – GREYMETER SERVICES PVT. LTD.
VENUE - NOIDA, INDIA
1
Greymeter Services Pvt. Ltd.
How can you
help me ?
If you are a student you can
demonstrate your skills here
and companies will hire you
If you are a company
you can hire students
or get your problems
solved by students
2
What users may like ?
To serve users with better services that interest them
3
What user may like ? 4
Challenges Jobs/Internships
5
How should I solve this
problem… I can’t find one
6
For challenges
and for jobs
Exactly… and then you
can combine both to
one
A recommendation
engine
See … if you want to provide
better services to users you
have to recommend them what
they like
7
Recommendation
Engine
Content Based
Collaborative
Filtering
 Personalized recommendations
 Recommends items similar to what user has
liked in past
 Example – you tube
 recognize commonalities between users on
the basis of their activities
 generate new recommendations based on
inter-user comparisons
 Example – user who likes X also likes Y
8
Lets find out some tools
which help me to
develop
recommendation engine
Apache Mahout
 Open source framework
 Uses Apache Hadoop platform
 It is a suite of machine learning libraries
 Helps in building scalable machine learning
algorithms like – collaborative filtering ,
classification and clustering
 Used for big data
 Less efficient with small data
9
Mahout won’t be
required as our data set
is small, let look at Scikit-
Learn
Scikit-Learn (sklearn)
 Simple and efficient tools for data mining
and data analysis
 Built on Python, NumPy and SciPy
 Features various classification, regression,
and clustering algorithms
 Open source
Lets go with Scikit-Learn as it
is simple to implement and
efficient for small data and
also built on Python
10
OK… first lets go for
challenge
recommendation
 Classifying all challenges into different categories like
finance, programming, design, Management,
communications and marketing
 Calculated Challenges Similarity
 Calculated recommendation index/score of each
challenges based on user history
11
Classifying the
Challenges
 Used Multinomial Naïve Bayes Classifier
 Training Datasets – Wikipedia, stack overflow and
stack exchange
 Refined training data by removing stopwords and
stemming
 Convert training examples into tf-idf form
 Used this tf-idf matrix to implement Multinomial
Naïve Bayes Classifier
12
Tf-idf means Term
Frequency-Inverse
Document Frequency
 𝑡𝑓(𝑡, 𝑑) = log 1 + 𝑓𝑡𝑑
𝑓𝑡𝑑 ∶ 𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦 𝑜𝑓 𝑡𝑒𝑟𝑚 𝑡 𝑖𝑛 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡 𝑑
 𝑖𝑑𝑓(𝑡, 𝐷) = log
𝑀
1+𝑓𝑡𝐷
𝑓𝑡𝐷 ∶ 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡𝑠 𝑖𝑛 𝑤ℎ𝑖𝑐ℎ 𝑡𝑒𝑟𝑚 𝑡 𝑎𝑝𝑝𝑒𝑎𝑟𝑠
M ∶ 𝑡𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡𝑠
 𝑡𝑓𝑖𝑑𝑓(𝑡, 𝑑, 𝐷) = 𝑡𝑓(𝑡, 𝑑) ∗ 𝑖𝑑𝑓(𝑡, 𝐷)
 Convert text corpus into N*M matrix of tfidf where N is
number of terms and M is number of document
𝑡11 ⋯ 𝑡𝑛1
⋮ 𝑡𝑖𝑗 ⋮
𝑡𝑚1 ⋯ 𝑡𝑛𝑚
𝑡𝑖𝑗 ∶ 𝑡𝑓𝑖𝑑𝑓(𝑡𝑖, 𝑑𝑗, 𝐷)
13
Multinomial Naïve
Bayes Classifier
 Bayes Theorem :
𝑃 𝐴 𝐵 =
𝑃 𝐴 ∗ 𝑃 𝐵 𝐴
𝑃(𝐵)
𝑃(𝑌|𝑥1, . . 𝑥𝑖, . . 𝑥𝑛) =
𝑃 𝑌 ∗𝑃 𝑥1,..𝑥𝑖..𝑥𝑛 𝑌)
𝑃(𝑥1,..𝑥𝑖,..𝑥𝑛)
𝐶𝑙𝑎𝑠𝑠 𝑑 = 𝑎𝑟𝑔𝑀𝑎𝑥 𝑦(𝑃(𝑌|𝑥1, . . 𝑥𝑖, . . 𝑥𝑛))
𝐶𝑙𝑎𝑠𝑠 𝑑 = 𝑎𝑟𝑔𝑀𝑎𝑥 𝑌( 𝑃 𝑌 ∗ 𝑃 𝑥1, . . 𝑥𝑖. . 𝑥𝑛 𝑌) )
𝑃 𝑥1, . . 𝑥𝑖. . 𝑥𝑛 𝑌) = 𝑖
𝑛
𝑃(𝑥𝑖|𝑌)
𝑃 𝑥𝑖 𝑌 = 𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦 𝑜𝑓 𝑥𝑖 𝑖𝑛 𝑐𝑙𝑎𝑠𝑠 𝑌
𝑇𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑡𝑒𝑟𝑚𝑠 𝑖𝑛 𝑐𝑙𝑎𝑠𝑠 𝑌
𝑃 𝑌 = |{𝑑|𝑑∈𝑌}|
|𝐷|
14
Challenges Similarity
Description Similarity
Textual similarity in challenges’ statements
Convert challenge statement into tf-idf matrix
Euclidean Distance between two vectors as
similarity measure
Higher the distance lesser the similarity
Features-similarity
Features were weighted based on their
relevance and testing
Calculated Weighted Euclidean Distance
between two vectors
Challenge-similarity
= Description similarity + features similarity
15
Lets see what is user doing…?
;) and then recommend them.
 We are monitoring various user activities which will be the basis
of recommendation
 Calculation of recommendation score :
initial𝑖𝑧𝑒 𝑠𝑐𝑜𝑟𝑒 𝑜𝑓 𝑒𝑎𝑐ℎ 𝑐ℎ𝑎𝑙𝑙𝑒𝑛𝑔𝑒 𝑤𝑖𝑡ℎ 0;
𝑓𝑜𝑟 𝑒𝑎𝑐ℎ 𝒄𝒉, 𝒂𝒄𝒕 𝑖𝑛 𝑎𝑐𝑡𝑖𝑣𝑖𝑡𝑦{
𝑠𝑖𝑚𝐶ℎ𝑎𝑙 = 𝑐ℎ𝑎𝑙𝑙𝑒𝑛𝑔𝑒𝑠 𝑠𝑖𝑚𝑖𝑙𝑎𝑟 𝑡𝑜 𝑐ℎ;
𝑓𝑜𝑟 𝑒𝑎𝑐ℎ 𝒄 𝑖𝑛 𝑠𝑖𝑚𝐶ℎ𝑎𝑙{
score(c) = score(c) + log( w_act*(1/ distance(ch,c)));
}
}
𝑓𝑜𝑟 𝑒𝑎𝑐ℎ 𝒄𝒉 𝑖𝑛 𝑎𝑙𝑙_𝑐ℎ𝑎𝑙𝑙𝑒𝑛𝑔𝑠{
score (ch) = score(ch) + log(# common interest of
user and ch)
score(ch) = score(ch)* (1/(deadline - current date))
}
16
Jobs/internships you may like
 Similar to challenge recommendation engine
 Only change is here we have job and it features
 Key feature : # times company appears in
user’s challenge activity. Add this factor in
recommendation score of job
 Everything else is same.
17
I was unable to decide
which tool/framework
should I choose for my
work
Challenges faced
I didn’t get ready made
dataset which full fills our
requirement. So last
open was to crawl the
web
And testing was
headache
18
Explored python and
Scikit-learn platform
Learning and experience
How a startup works –
much of hard work goes
into it day and night
Management team member
of Hackathon organized by
Greymeter and
Unicommerce
19
20

More Related Content

What's hot

Covering (Rules-based) Algorithm
Covering (Rules-based) AlgorithmCovering (Rules-based) Algorithm
Covering (Rules-based) Algorithm
ZHAO Sam
 
Om0010 operations management
Om0010   operations managementOm0010   operations management
Om0010 operations managementsmumbahelp
 
Learning On The Border:Active Learning in Imbalanced classification Data
Learning On The Border:Active Learning in Imbalanced classification DataLearning On The Border:Active Learning in Imbalanced classification Data
Learning On The Border:Active Learning in Imbalanced classification Data萍華 楊
 
Mb0048 operations research
Mb0048  operations researchMb0048  operations research
Mb0048 operations research
smumbahelp
 
supervised learning
supervised learningsupervised learning
supervised learning
Amar Tripathi
 
.NET Fest 2017. Игорь Кочетов. Классификация результатов тестирования произво...
.NET Fest 2017. Игорь Кочетов. Классификация результатов тестирования произво....NET Fest 2017. Игорь Кочетов. Классификация результатов тестирования произво...
.NET Fest 2017. Игорь Кочетов. Классификация результатов тестирования произво...
NETFest
 
Mb0048 operations research
Mb0048  operations researchMb0048  operations research
Mb0048 operations research
smumbahelp
 
Mb0048 operations research
Mb0048  operations researchMb0048  operations research
Mb0048 operations research
smumbahelp
 
Decision trees
Decision treesDecision trees
Decision trees
Rohit Srivastava
 
Machine learning overview
Machine learning overviewMachine learning overview
Machine learning overview
prih_yah
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
Bhupender Sharma
 
Decision Trees
Decision TreesDecision Trees
Decision Trees
CloudxLab
 
Machine learning ppt.
Machine learning ppt.Machine learning ppt.
Machine learning ppt.
ASHOK KUMAR
 
Decision Trees
Decision TreesDecision Trees
Decision Trees
Student
 
Machine Learning using Support Vector Machine
Machine Learning using Support Vector MachineMachine Learning using Support Vector Machine
Machine Learning using Support Vector Machine
Mohsin Ul Haq
 
Support Vector Machines
Support Vector MachinesSupport Vector Machines
Support Vector Machines
CloudxLab
 
LearningAG.ppt
LearningAG.pptLearningAG.ppt
LearningAG.pptbutest
 
Learning to Rank - From pairwise approach to listwise
Learning to Rank - From pairwise approach to listwiseLearning to Rank - From pairwise approach to listwise
Learning to Rank - From pairwise approach to listwise
Hasan H Topcu
 

What's hot (20)

Covering (Rules-based) Algorithm
Covering (Rules-based) AlgorithmCovering (Rules-based) Algorithm
Covering (Rules-based) Algorithm
 
Om0010 operations management
Om0010   operations managementOm0010   operations management
Om0010 operations management
 
Learning On The Border:Active Learning in Imbalanced classification Data
Learning On The Border:Active Learning in Imbalanced classification DataLearning On The Border:Active Learning in Imbalanced classification Data
Learning On The Border:Active Learning in Imbalanced classification Data
 
Mb0048 operations research
Mb0048  operations researchMb0048  operations research
Mb0048 operations research
 
supervised learning
supervised learningsupervised learning
supervised learning
 
.NET Fest 2017. Игорь Кочетов. Классификация результатов тестирования произво...
.NET Fest 2017. Игорь Кочетов. Классификация результатов тестирования произво....NET Fest 2017. Игорь Кочетов. Классификация результатов тестирования произво...
.NET Fest 2017. Игорь Кочетов. Классификация результатов тестирования произво...
 
Mb0048 operations research
Mb0048  operations researchMb0048  operations research
Mb0048 operations research
 
Mb0048 operations research
Mb0048  operations researchMb0048  operations research
Mb0048 operations research
 
Decision trees
Decision treesDecision trees
Decision trees
 
Machine learning overview
Machine learning overviewMachine learning overview
Machine learning overview
 
Learning
LearningLearning
Learning
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
 
Decision Trees
Decision TreesDecision Trees
Decision Trees
 
Machine learning ppt.
Machine learning ppt.Machine learning ppt.
Machine learning ppt.
 
Decision Trees
Decision TreesDecision Trees
Decision Trees
 
Cs 1114 - lecture-4
Cs 1114 - lecture-4Cs 1114 - lecture-4
Cs 1114 - lecture-4
 
Machine Learning using Support Vector Machine
Machine Learning using Support Vector MachineMachine Learning using Support Vector Machine
Machine Learning using Support Vector Machine
 
Support Vector Machines
Support Vector MachinesSupport Vector Machines
Support Vector Machines
 
LearningAG.ppt
LearningAG.pptLearningAG.ppt
LearningAG.ppt
 
Learning to Rank - From pairwise approach to listwise
Learning to Rank - From pairwise approach to listwiseLearning to Rank - From pairwise approach to listwise
Learning to Rank - From pairwise approach to listwise
 

Viewers also liked

RST2014_Cheboksary_Hardware-SoftwareComplexofAutomobileEngines'AcousticDiagno...
RST2014_Cheboksary_Hardware-SoftwareComplexofAutomobileEngines'AcousticDiagno...RST2014_Cheboksary_Hardware-SoftwareComplexofAutomobileEngines'AcousticDiagno...
RST2014_Cheboksary_Hardware-SoftwareComplexofAutomobileEngines'AcousticDiagno...
RussianStartupTour
 
Maria Antonia Ortiz Molina2
Maria Antonia Ortiz Molina2Maria Antonia Ortiz Molina2
Maria Antonia Ortiz Molina2
Guadalinfo
 
Friends and colleagues
Friends and colleaguesFriends and colleagues
Friends and colleagues
Bob Arnold
 
CCOMM ideal meetings pitch deck
CCOMM ideal meetings pitch deckCCOMM ideal meetings pitch deck
CCOMM ideal meetings pitch deckLuc Boucher
 
Tugas Ekonomi Pembangunan
Tugas Ekonomi PembangunanTugas Ekonomi Pembangunan
Tugas Ekonomi Pembangunan
Lisa Wijayanti
 
Surveillance in the workplace: what you should know
Surveillance in the workplace: what you should knowSurveillance in the workplace: what you should know
Surveillance in the workplace: what you should know
WorkplaceInfo
 
El baloncesto
El baloncestoEl baloncesto
El baloncesto
prrv20
 
Problemas de convivencia en el peru sa 3º 2014
Problemas de convivencia en el peru sa 3º 2014Problemas de convivencia en el peru sa 3º 2014
Problemas de convivencia en el peru sa 3º 2014
ROSARIO DEZA MONTERO
 
Sunat y desafios de la tributación
Sunat y desafios de la tributaciónSunat y desafios de la tributación
Sunat y desafios de la tributación
KAtiRojChu
 

Viewers also liked (14)

What Navy Supervisors
What Navy SupervisorsWhat Navy Supervisors
What Navy Supervisors
 
HHH Presentation option 2
HHH Presentation option 2HHH Presentation option 2
HHH Presentation option 2
 
RST2014_Cheboksary_Hardware-SoftwareComplexofAutomobileEngines'AcousticDiagno...
RST2014_Cheboksary_Hardware-SoftwareComplexofAutomobileEngines'AcousticDiagno...RST2014_Cheboksary_Hardware-SoftwareComplexofAutomobileEngines'AcousticDiagno...
RST2014_Cheboksary_Hardware-SoftwareComplexofAutomobileEngines'AcousticDiagno...
 
Maria Antonia Ortiz Molina2
Maria Antonia Ortiz Molina2Maria Antonia Ortiz Molina2
Maria Antonia Ortiz Molina2
 
pOrTFoLiO_YoHaPriYa
pOrTFoLiO_YoHaPriYapOrTFoLiO_YoHaPriYa
pOrTFoLiO_YoHaPriYa
 
One Card Front and Back
One Card Front and BackOne Card Front and Back
One Card Front and Back
 
Updated Resume
Updated Resume Updated Resume
Updated Resume
 
Friends and colleagues
Friends and colleaguesFriends and colleagues
Friends and colleagues
 
CCOMM ideal meetings pitch deck
CCOMM ideal meetings pitch deckCCOMM ideal meetings pitch deck
CCOMM ideal meetings pitch deck
 
Tugas Ekonomi Pembangunan
Tugas Ekonomi PembangunanTugas Ekonomi Pembangunan
Tugas Ekonomi Pembangunan
 
Surveillance in the workplace: what you should know
Surveillance in the workplace: what you should knowSurveillance in the workplace: what you should know
Surveillance in the workplace: what you should know
 
El baloncesto
El baloncestoEl baloncesto
El baloncesto
 
Problemas de convivencia en el peru sa 3º 2014
Problemas de convivencia en el peru sa 3º 2014Problemas de convivencia en el peru sa 3º 2014
Problemas de convivencia en el peru sa 3º 2014
 
Sunat y desafios de la tributación
Sunat y desafios de la tributaciónSunat y desafios de la tributación
Sunat y desafios de la tributación
 

Similar to CSC410-Presentation

Artificial Intelligence at LinkedIn
Artificial Intelligence at LinkedInArtificial Intelligence at LinkedIn
Artificial Intelligence at LinkedIn
Bill Liu
 
powerpoint
powerpointpowerpoint
powerpointbutest
 
Your learning ecosystem
Your learning ecosystemYour learning ecosystem
Your learning ecosystem
NetDimensions
 
IRJET- Boosting Response Aware Model-Based Collaborative Filtering
IRJET- Boosting Response Aware Model-Based Collaborative FilteringIRJET- Boosting Response Aware Model-Based Collaborative Filtering
IRJET- Boosting Response Aware Model-Based Collaborative Filtering
IRJET Journal
 
Hadoop France meetup Feb2016 : recommendations with spark
Hadoop France meetup  Feb2016 : recommendations with sparkHadoop France meetup  Feb2016 : recommendations with spark
Hadoop France meetup Feb2016 : recommendations with spark
Modern Data Stack France
 
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
Dr Arash Najmaei ( Phd., MBA, BSc)
 
Discovering User's Topics of Interest in Recommender Systems @ Meetup Machine...
Discovering User's Topics of Interest in Recommender Systems @ Meetup Machine...Discovering User's Topics of Interest in Recommender Systems @ Meetup Machine...
Discovering User's Topics of Interest in Recommender Systems @ Meetup Machine...
Gabriel Moreira
 
Sfeldman performance bb_worldemea07
Sfeldman performance bb_worldemea07Sfeldman performance bb_worldemea07
Sfeldman performance bb_worldemea07Steve Feldman
 
ML Platform Q1 Meetup: End to-end Feature Analysis, Validation and Transforma...
ML Platform Q1 Meetup: End to-end Feature Analysis, Validation and Transforma...ML Platform Q1 Meetup: End to-end Feature Analysis, Validation and Transforma...
ML Platform Q1 Meetup: End to-end Feature Analysis, Validation and Transforma...
Fei Chen
 
Six sigma green belt project template
Six sigma green belt project templateSix sigma green belt project template
Six sigma green belt project templateShankaran Rd
 
Combined Template.ppt
Combined Template.pptCombined Template.ppt
Combined Template.ppt
salmashokat
 
Tutorial Knowledge Discovery
Tutorial Knowledge DiscoveryTutorial Knowledge Discovery
Tutorial Knowledge Discovery
SSSW
 
ML Framework for auto-responding to customer support queries
ML Framework for auto-responding to customer support queriesML Framework for auto-responding to customer support queries
ML Framework for auto-responding to customer support queries
Varun Nathan
 
Strata 2016 - Lessons Learned from building real-life Machine Learning Systems
Strata 2016 -  Lessons Learned from building real-life Machine Learning SystemsStrata 2016 -  Lessons Learned from building real-life Machine Learning Systems
Strata 2016 - Lessons Learned from building real-life Machine Learning Systems
Xavier Amatriain
 
NISI Introductie Continuous Delivery 3.0
NISI Introductie Continuous Delivery 3.0NISI Introductie Continuous Delivery 3.0
NISI Introductie Continuous Delivery 3.0
Garm Lucassen
 
Knowledge discovery claudiad amato
Knowledge discovery claudiad amatoKnowledge discovery claudiad amato
Knowledge discovery claudiad amato
SSSW
 
Analytics Boot Camp - Slides
Analytics Boot Camp - SlidesAnalytics Boot Camp - Slides
Analytics Boot Camp - Slides
Aditya Joshi
 
Design Processes For Web Projects
Design Processes For Web ProjectsDesign Processes For Web Projects
Design Processes For Web Projects
Peter Boersma
 
B2 2006 sizing_benchmarking
B2 2006 sizing_benchmarkingB2 2006 sizing_benchmarking
B2 2006 sizing_benchmarkingSteve Feldman
 
B2 2006 sizing_benchmarking (1)
B2 2006 sizing_benchmarking (1)B2 2006 sizing_benchmarking (1)
B2 2006 sizing_benchmarking (1)Steve Feldman
 

Similar to CSC410-Presentation (20)

Artificial Intelligence at LinkedIn
Artificial Intelligence at LinkedInArtificial Intelligence at LinkedIn
Artificial Intelligence at LinkedIn
 
powerpoint
powerpointpowerpoint
powerpoint
 
Your learning ecosystem
Your learning ecosystemYour learning ecosystem
Your learning ecosystem
 
IRJET- Boosting Response Aware Model-Based Collaborative Filtering
IRJET- Boosting Response Aware Model-Based Collaborative FilteringIRJET- Boosting Response Aware Model-Based Collaborative Filtering
IRJET- Boosting Response Aware Model-Based Collaborative Filtering
 
Hadoop France meetup Feb2016 : recommendations with spark
Hadoop France meetup  Feb2016 : recommendations with sparkHadoop France meetup  Feb2016 : recommendations with spark
Hadoop France meetup Feb2016 : recommendations with spark
 
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
 
Discovering User's Topics of Interest in Recommender Systems @ Meetup Machine...
Discovering User's Topics of Interest in Recommender Systems @ Meetup Machine...Discovering User's Topics of Interest in Recommender Systems @ Meetup Machine...
Discovering User's Topics of Interest in Recommender Systems @ Meetup Machine...
 
Sfeldman performance bb_worldemea07
Sfeldman performance bb_worldemea07Sfeldman performance bb_worldemea07
Sfeldman performance bb_worldemea07
 
ML Platform Q1 Meetup: End to-end Feature Analysis, Validation and Transforma...
ML Platform Q1 Meetup: End to-end Feature Analysis, Validation and Transforma...ML Platform Q1 Meetup: End to-end Feature Analysis, Validation and Transforma...
ML Platform Q1 Meetup: End to-end Feature Analysis, Validation and Transforma...
 
Six sigma green belt project template
Six sigma green belt project templateSix sigma green belt project template
Six sigma green belt project template
 
Combined Template.ppt
Combined Template.pptCombined Template.ppt
Combined Template.ppt
 
Tutorial Knowledge Discovery
Tutorial Knowledge DiscoveryTutorial Knowledge Discovery
Tutorial Knowledge Discovery
 
ML Framework for auto-responding to customer support queries
ML Framework for auto-responding to customer support queriesML Framework for auto-responding to customer support queries
ML Framework for auto-responding to customer support queries
 
Strata 2016 - Lessons Learned from building real-life Machine Learning Systems
Strata 2016 -  Lessons Learned from building real-life Machine Learning SystemsStrata 2016 -  Lessons Learned from building real-life Machine Learning Systems
Strata 2016 - Lessons Learned from building real-life Machine Learning Systems
 
NISI Introductie Continuous Delivery 3.0
NISI Introductie Continuous Delivery 3.0NISI Introductie Continuous Delivery 3.0
NISI Introductie Continuous Delivery 3.0
 
Knowledge discovery claudiad amato
Knowledge discovery claudiad amatoKnowledge discovery claudiad amato
Knowledge discovery claudiad amato
 
Analytics Boot Camp - Slides
Analytics Boot Camp - SlidesAnalytics Boot Camp - Slides
Analytics Boot Camp - Slides
 
Design Processes For Web Projects
Design Processes For Web ProjectsDesign Processes For Web Projects
Design Processes For Web Projects
 
B2 2006 sizing_benchmarking
B2 2006 sizing_benchmarkingB2 2006 sizing_benchmarking
B2 2006 sizing_benchmarking
 
B2 2006 sizing_benchmarking (1)
B2 2006 sizing_benchmarking (1)B2 2006 sizing_benchmarking (1)
B2 2006 sizing_benchmarking (1)
 

CSC410-Presentation

  • 1. Greymeter Summer Intern NAME – RAHUL PATIDAR (2012CS10244) PROJECT – RECOMMENDATION ENGINE COMPANY – GREYMETER SERVICES PVT. LTD. VENUE - NOIDA, INDIA 1
  • 2. Greymeter Services Pvt. Ltd. How can you help me ? If you are a student you can demonstrate your skills here and companies will hire you If you are a company you can hire students or get your problems solved by students 2
  • 3. What users may like ? To serve users with better services that interest them 3
  • 4. What user may like ? 4
  • 6. How should I solve this problem… I can’t find one 6
  • 7. For challenges and for jobs Exactly… and then you can combine both to one A recommendation engine See … if you want to provide better services to users you have to recommend them what they like 7
  • 8. Recommendation Engine Content Based Collaborative Filtering  Personalized recommendations  Recommends items similar to what user has liked in past  Example – you tube  recognize commonalities between users on the basis of their activities  generate new recommendations based on inter-user comparisons  Example – user who likes X also likes Y 8
  • 9. Lets find out some tools which help me to develop recommendation engine Apache Mahout  Open source framework  Uses Apache Hadoop platform  It is a suite of machine learning libraries  Helps in building scalable machine learning algorithms like – collaborative filtering , classification and clustering  Used for big data  Less efficient with small data 9
  • 10. Mahout won’t be required as our data set is small, let look at Scikit- Learn Scikit-Learn (sklearn)  Simple and efficient tools for data mining and data analysis  Built on Python, NumPy and SciPy  Features various classification, regression, and clustering algorithms  Open source Lets go with Scikit-Learn as it is simple to implement and efficient for small data and also built on Python 10
  • 11. OK… first lets go for challenge recommendation  Classifying all challenges into different categories like finance, programming, design, Management, communications and marketing  Calculated Challenges Similarity  Calculated recommendation index/score of each challenges based on user history 11
  • 12. Classifying the Challenges  Used Multinomial Naïve Bayes Classifier  Training Datasets – Wikipedia, stack overflow and stack exchange  Refined training data by removing stopwords and stemming  Convert training examples into tf-idf form  Used this tf-idf matrix to implement Multinomial Naïve Bayes Classifier 12
  • 13. Tf-idf means Term Frequency-Inverse Document Frequency  𝑡𝑓(𝑡, 𝑑) = log 1 + 𝑓𝑡𝑑 𝑓𝑡𝑑 ∶ 𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦 𝑜𝑓 𝑡𝑒𝑟𝑚 𝑡 𝑖𝑛 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡 𝑑  𝑖𝑑𝑓(𝑡, 𝐷) = log 𝑀 1+𝑓𝑡𝐷 𝑓𝑡𝐷 ∶ 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡𝑠 𝑖𝑛 𝑤ℎ𝑖𝑐ℎ 𝑡𝑒𝑟𝑚 𝑡 𝑎𝑝𝑝𝑒𝑎𝑟𝑠 M ∶ 𝑡𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡𝑠  𝑡𝑓𝑖𝑑𝑓(𝑡, 𝑑, 𝐷) = 𝑡𝑓(𝑡, 𝑑) ∗ 𝑖𝑑𝑓(𝑡, 𝐷)  Convert text corpus into N*M matrix of tfidf where N is number of terms and M is number of document 𝑡11 ⋯ 𝑡𝑛1 ⋮ 𝑡𝑖𝑗 ⋮ 𝑡𝑚1 ⋯ 𝑡𝑛𝑚 𝑡𝑖𝑗 ∶ 𝑡𝑓𝑖𝑑𝑓(𝑡𝑖, 𝑑𝑗, 𝐷) 13
  • 14. Multinomial Naïve Bayes Classifier  Bayes Theorem : 𝑃 𝐴 𝐵 = 𝑃 𝐴 ∗ 𝑃 𝐵 𝐴 𝑃(𝐵) 𝑃(𝑌|𝑥1, . . 𝑥𝑖, . . 𝑥𝑛) = 𝑃 𝑌 ∗𝑃 𝑥1,..𝑥𝑖..𝑥𝑛 𝑌) 𝑃(𝑥1,..𝑥𝑖,..𝑥𝑛) 𝐶𝑙𝑎𝑠𝑠 𝑑 = 𝑎𝑟𝑔𝑀𝑎𝑥 𝑦(𝑃(𝑌|𝑥1, . . 𝑥𝑖, . . 𝑥𝑛)) 𝐶𝑙𝑎𝑠𝑠 𝑑 = 𝑎𝑟𝑔𝑀𝑎𝑥 𝑌( 𝑃 𝑌 ∗ 𝑃 𝑥1, . . 𝑥𝑖. . 𝑥𝑛 𝑌) ) 𝑃 𝑥1, . . 𝑥𝑖. . 𝑥𝑛 𝑌) = 𝑖 𝑛 𝑃(𝑥𝑖|𝑌) 𝑃 𝑥𝑖 𝑌 = 𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦 𝑜𝑓 𝑥𝑖 𝑖𝑛 𝑐𝑙𝑎𝑠𝑠 𝑌 𝑇𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑡𝑒𝑟𝑚𝑠 𝑖𝑛 𝑐𝑙𝑎𝑠𝑠 𝑌 𝑃 𝑌 = |{𝑑|𝑑∈𝑌}| |𝐷| 14
  • 15. Challenges Similarity Description Similarity Textual similarity in challenges’ statements Convert challenge statement into tf-idf matrix Euclidean Distance between two vectors as similarity measure Higher the distance lesser the similarity Features-similarity Features were weighted based on their relevance and testing Calculated Weighted Euclidean Distance between two vectors Challenge-similarity = Description similarity + features similarity 15
  • 16. Lets see what is user doing…? ;) and then recommend them.  We are monitoring various user activities which will be the basis of recommendation  Calculation of recommendation score : initial𝑖𝑧𝑒 𝑠𝑐𝑜𝑟𝑒 𝑜𝑓 𝑒𝑎𝑐ℎ 𝑐ℎ𝑎𝑙𝑙𝑒𝑛𝑔𝑒 𝑤𝑖𝑡ℎ 0; 𝑓𝑜𝑟 𝑒𝑎𝑐ℎ 𝒄𝒉, 𝒂𝒄𝒕 𝑖𝑛 𝑎𝑐𝑡𝑖𝑣𝑖𝑡𝑦{ 𝑠𝑖𝑚𝐶ℎ𝑎𝑙 = 𝑐ℎ𝑎𝑙𝑙𝑒𝑛𝑔𝑒𝑠 𝑠𝑖𝑚𝑖𝑙𝑎𝑟 𝑡𝑜 𝑐ℎ; 𝑓𝑜𝑟 𝑒𝑎𝑐ℎ 𝒄 𝑖𝑛 𝑠𝑖𝑚𝐶ℎ𝑎𝑙{ score(c) = score(c) + log( w_act*(1/ distance(ch,c))); } } 𝑓𝑜𝑟 𝑒𝑎𝑐ℎ 𝒄𝒉 𝑖𝑛 𝑎𝑙𝑙_𝑐ℎ𝑎𝑙𝑙𝑒𝑛𝑔𝑠{ score (ch) = score(ch) + log(# common interest of user and ch) score(ch) = score(ch)* (1/(deadline - current date)) } 16
  • 17. Jobs/internships you may like  Similar to challenge recommendation engine  Only change is here we have job and it features  Key feature : # times company appears in user’s challenge activity. Add this factor in recommendation score of job  Everything else is same. 17
  • 18. I was unable to decide which tool/framework should I choose for my work Challenges faced I didn’t get ready made dataset which full fills our requirement. So last open was to crawl the web And testing was headache 18
  • 19. Explored python and Scikit-learn platform Learning and experience How a startup works – much of hard work goes into it day and night Management team member of Hackathon organized by Greymeter and Unicommerce 19
  • 20. 20

Editor's Notes

  1. Online skill demonstration platform connecting students and companies Resume generation based on their performance throughout the journey Companies can float selection challenges
  2. features various classification, regression and clustering algorithms including support vector machines, k-means, kNN, naïve bayes
  3. Describe about stemming
  4. Tf-idf is used by Tf has +1 because if ftd = 0 -> tf = -infi idf has +1 to prevent document occurring in all document from getting 0 idf