Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Using Machine learning and R
Finding Order in the
Chaos
Harshad Saykhedkar
The main ideaSource of text and applications
Emails Spam detection
Product descriptions /
reviews
Sentiment analysis,
reco...
(Text mining) is a wonderful
world. Let's go exploring...!
The main ideaThe main idea
Itinerary
● R you ready ?
● Prep camp
● The wandering traveller
● The seeker
R you ready ?
The main ideaPacking our bags : Checks
● Starting R
● Loading required packages
● Check sessionInfo( )
The main ideaPacking our bags : Datatypes
Atomic
Vector
Lists
"Let's try our hands"
The main ideaPacking our bags : Functions
● Expressions which are evaluated
● Can be passed around
● Definitions can be ne...
Prep Camp
The main ideaPrep camp : Sentiment Analysis
● Bag of words model
● Simple aggregated score
' terrible service & disorganis...
The main idea
● Part of speech ambiguity
● Further exploration ?
● Equal weightage model
● Double negations ?
Prep camp : ...
The Wandering
Traveller
The main ideawandering traveller : Unsupervised Learning
Can define
distance
Entity as
point in
space
How to derive this m...
The main ideawandering traveller : Vector Space Model
Word,
Phrase,
Theme
Comments,
Blogs,
Tweets
Word,
Phrase,
Theme
The main ideawandering traveller : TfIdf and other details
" But how to measure the importance of
a word for a doc ? "
● B...
The main ideawandering traveller : Hierarchical Clustering
● Define distance measure
● Keep Merging based on similarity
Wa...
The main ideawandering traveller : Improvements
● Stemming, lemmatization
● Latent semantic analysis
"Cameras" Vs "Camera"...
The Seeker
The main ideaSeeker : Supervised Learning
● Labels given with features
● Find rule, classify unobserved case
Feature 1
Fea...
The main ideaSeeker : Naive Bayes Classifier
● Independence of features
● Train the model on training set
● Test accuracy ...
Learnings
The main ideaLearnings
● How to cleanup and preprocess data
in text form ?
● How to model the data ?
● How to cluster the ...
The main ideaSource of text and applications
Emails Spam detection
Product descriptions /
reviews
Sentiment analysis,
reco...
Questions ?
"Avid R learner, trying to apply bunch of these
techniques to the digital ads world"
Contact
harshad.saykhedkar@sokrati.co...
Upcoming SlideShare
Loading in …5
×

Machine learning applications on text data

566 views

Published on

o you get the feeling of ‘the cart before the horse’ on hearing buzz-words like social data mining or sentiment analysis and so on? Fundamental text mining methods are the real ‘workhorses’ behind these buzz-words. This prsentation aims to give understanding of the fundamentals in plain english.

Published in: Technology, Education
  • Be the first to comment

Machine learning applications on text data

  1. 1. Using Machine learning and R Finding Order in the Chaos Harshad Saykhedkar
  2. 2. The main ideaSource of text and applications Emails Spam detection Product descriptions / reviews Sentiment analysis, recommendation Blogs / informational content Content recommendations Web pages / news articles Topic identification, trending topics Tweets / comments / social content Sentiment analysis, named entity recognition
  3. 3. (Text mining) is a wonderful world. Let's go exploring...! The main ideaThe main idea
  4. 4. Itinerary ● R you ready ? ● Prep camp ● The wandering traveller ● The seeker
  5. 5. R you ready ?
  6. 6. The main ideaPacking our bags : Checks ● Starting R ● Loading required packages ● Check sessionInfo( )
  7. 7. The main ideaPacking our bags : Datatypes Atomic Vector Lists "Let's try our hands"
  8. 8. The main ideaPacking our bags : Functions ● Expressions which are evaluated ● Can be passed around ● Definitions can be nested Details not covered : Argument matching, Call by value, Environments and lexical scoping, Promises etc..
  9. 9. Prep Camp
  10. 10. The main ideaPrep camp : Sentiment Analysis ● Bag of words model ● Simple aggregated score ' terrible service & disorganised ' ' OK - some good some bad ' ' Great location, fabulous staff '
  11. 11. The main idea ● Part of speech ambiguity ● Further exploration ? ● Equal weightage model ● Double negations ? Prep camp : Improvements
  12. 12. The Wandering Traveller
  13. 13. The main ideawandering traveller : Unsupervised Learning Can define distance Entity as point in space How to derive this model for text ? Feature 1 Feature 2
  14. 14. The main ideawandering traveller : Vector Space Model Word, Phrase, Theme Comments, Blogs, Tweets Word, Phrase, Theme
  15. 15. The main ideawandering traveller : TfIdf and other details " But how to measure the importance of a word for a doc ? " ● Binary : Is the 'word' in the 'doc' ? ● Tf : # times the word in the 'doc' ? ● TfIdf : Penalize the obvious!
  16. 16. The main ideawandering traveller : Hierarchical Clustering ● Define distance measure ● Keep Merging based on similarity Washing Machine Washer Dryer Camera
  17. 17. The main ideawandering traveller : Improvements ● Stemming, lemmatization ● Latent semantic analysis "Cameras" Vs "Camera" "Phone" "Touch Screen"
  18. 18. The Seeker
  19. 19. The main ideaSeeker : Supervised Learning ● Labels given with features ● Find rule, classify unobserved case Feature 1 Feature 2
  20. 20. The main ideaSeeker : Naive Bayes Classifier ● Independence of features ● Train the model on training set ● Test accuracy on a holdout sample Predicted 0 Predicted 1 Actual 0 F (0, 0) F(0, 1) Actual 1 F (1, 0) F(1, 1)
  21. 21. Learnings
  22. 22. The main ideaLearnings ● How to cleanup and preprocess data in text form ? ● How to model the data ? ● How to cluster the data ? ● How to classify the data ?
  23. 23. The main ideaSource of text and applications Emails Spam detection Product descriptions / reviews Sentiment analysis, recommendation Blogs / informational content Content recommendations Web pages / news articles Topic identification, trending topics Tweets / comments / social content Sentiment analysis, named entity recognition
  24. 24. Questions ?
  25. 25. "Avid R learner, trying to apply bunch of these techniques to the digital ads world" Contact harshad.saykhedkar@sokrati.com The main ideaAbout me

×