o you get the feeling of ‘the cart before the horse’ on hearing buzz-words like social data mining or sentiment analysis and so on? Fundamental text mining methods are the real ‘workhorses’ behind these buzz-words. This prsentation aims to give understanding of the fundamentals in plain english.
The main ideaPacking our bags : Checks
● Starting R
● Loading required packages
● Check sessionInfo( )
The main ideaPacking our bags : Datatypes
"Let's try our hands"
The main ideaPacking our bags : Functions
● Expressions which are evaluated
● Can be passed around
● Definitions can be nested
Details not covered : Argument matching, Call by value,
Environments and lexical scoping, Promises etc..
The main ideawandering traveller : Unsupervised Learning
How to derive this model for text ?
The main ideawandering traveller : Vector Space Model
The main ideawandering traveller : TfIdf and other details
" But how to measure the importance of
a word for a doc ? "
● Binary : Is the 'word' in the 'doc' ?
● Tf : # times the word in the 'doc' ?
● TfIdf : Penalize the obvious!
The main ideawandering traveller : Hierarchical Clustering
● Define distance measure
● Keep Merging based on similarity
The main ideawandering traveller : Improvements
● Stemming, lemmatization
● Latent semantic analysis
"Cameras" Vs "Camera"
"Phone" "Touch Screen"
The main ideaSeeker : Supervised Learning
● Labels given with features
● Find rule, classify unobserved case
The main ideaSeeker : Naive Bayes Classifier
● Independence of features
● Train the model on training set
● Test accuracy on a holdout sample
Predicted 0 Predicted 1
Actual 0 F (0, 0) F(0, 1)
Actual 1 F (1, 0) F(1, 1)