SlideShare a Scribd company logo
Getting Started with Text Mining
Mathangi Sri R
Lets look at some text
1. I love movies
2. I love icecream
3. I don’t like anything
4. I am not going to tell you anything
5. What are you guys doing
6. Where are you all going with it
7. I love her
8. doggie
When asked a question what do you love?
the tokens..?
['I', 'love', 'movies', 'I', 'love', 'icecream', 'I', 'donx92t', 'like',
'anything', 'I', 'am', 'not', 'going', 'to', 'tell', 'you', 'anything',
'What', 'are', 'you', 'guys', 'doing', 'Where', 'are', 'you', 'all',
'going', 'with', 'it', 'I', 'love', 'her', 'doggie']
word frequency
[('I', 5), ('love', 3), ('movies', 1), ('I', 5), ('love', 3),
('icecream', 1), ('I', 5), ('donx92t', 1), ('like', 1),
('anything', 2), ('I', 5), ('am', 1), ('not', 1), ('going',
2), ('to', 1), ('tell', 1), ('you', 3), ('anything', 2),
('What', 1), ('are', 2), ('you', 3), ('guys', 1), ('doing',
1), ('Where', 1), ('are', 2), ('you', 3), ('all', 1),
('going', 2), ('with', 1), ('it', 1), ('I', 5), ('love', 3),
('her', 1), ('doggie', 1)]
Term Frequency
[('I', 0.15), ('love', 0.09), ('movies', 0.03), ('I', 0.15), ('love', 0.09), ('icecream',
0.03), ('I', 0.15), ('donx92t', 0.03), ('like', 0.03), ('anything', 0.06), ('I', 0.15),
('am', 0.03), ('not', 0.03), ('going', 0.06), ('to', 0.03), ('tell', 0.03), ('you', 0.09),
('anything', 0.06), ('What', 0.03), ('are', 0.06), ('you', 0.09), ('guys', 0.03),
('doing', 0.03), ('Where', 0.03), ('are', 0.06), ('you', 0.09), ('all', 0.03), ('going',
0.06), ('with', 0.03), ('it', 0.03), ('I', 0.15), ('love', 0.09), ('her', 0.03), ('doggie',
0.03)]
TF - IDF
• TF: Term Frequency, which measures how frequently a
term occurs in a document
TF(t) = (Number of times term t appears in a document) /
(Total number of terms in the document).
• IDF: Inverse Document Frequency, which measures how
important a term is. :
IDF(t) = log_e(Total number of documents / Number of
documents with term t in it).
Tf-idf for our dataset
• 8*22 (8 records * 22 unique words. Total words 34)
u'all', u'am',
u'anyt
hing', u'are',
u'dog
gie',
u'doin
g', u'don',
u'goin
g',
u'guys
', u'her',
u'icecr
eam', u'it', u'like',
u'love'
,
u'movi
es', u'not', u'tell', u'to',
u'what
',
u'wher
e',
u'with'
, u'you'
I love movies 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.59 0.81 0.00 0.00 0.00 0.00 0.00 0.00 0.00
I love icecream 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.81 0.00 0.00 0.59 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
I don’t like
anything 0.00 0.00 0.51 0.00 0.00 0.00 0.61 0.00 0.00 0.00 0.00 0.00 0.61 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
I am not going to
tell you anything 0.00 0.41 0.34 0.00 0.00 0.00 0.00 0.34 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.41 0.41 0.41 0.00 0.00 0.00 0.30
What are you
guys doing 0.00 0.00 0.00 0.41 0.00 0.49 0.00 0.00 0.49 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.49 0.00 0.00 0.35
Where are you all
going with it 0.41 0.00 0.00 0.34 0.00 0.00 0.00 0.34 0.00 0.00 0.00 0.41 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.41 0.41 0.30
Unigrams,Bi-grams and Tri-grams
• I love movies
--I love, love movies
In our dataset,
[u'all', u'all going', u'all going with', u'am', u'am not', u'am not going', u'anything', u'are', u'are you',
u'are you all', u'are you guys', u'doggie', u'doing', u'don', u'don like', u'don like anything', u'going',
u'going to', u'going to tell', u'going with', u'going with it', u'guys', u'guys doing', u'her', u'icecream',
u'it', u'like', u'like anything', u'love', u'love her', u'love icecream', u'love movies', u'movies', u'not',
u'not going', u'not going to', u'tell', u'tell you', u'tell you anything', u'to', u'to tell', u'to tell you',
u'what', u'what are', u'what are you', u'where', u'where are', u'where are you', u'with', u'with it',
u'you', u'you all', u'you all going', u'you anything', u'you guys', u'you guys doing']
Python code to genarate tf-idf matrix
Input dataset (List of strings)-
[u'I love movies', u'I love icecream ', u'I donx92t like anything', u'I am not going to tell you anything', u'What
are you guys doing', u'Where are you all going with it', u'I love her', u'doggie ']
Code:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer(min_df=0.0,analyzer=u'word',ngram_range=(1, 4),stop_words=None)
tfidf_matrix = tfidf_vectorizer.fit_transform(tt1)
tf1= tfidf_matrix.todense()
Text Classification
Classifying text - Methods
• Supervised classification:
– Requires labelled data
– Classification algorithms – SVM, LR, Ensemble,
RF,etc
– Can measure accuracy precisely
– Need for highly actionable applications
Classifying text - Methods
• Unsupervised
- No labels required
- Accuracy is a ‘loose’ measure
- Measuring homogeneity of clusters
- Useful for quick insights or where grouping is
required
Classifying text - Methods
• Semi-supervised learning is a class of
supervised learning tasks and techniques that
also make use of unlabeled data for training -
typically a small amount of labeled data with a
large amount of unlabeled data.
Supervised Learning – Case Study
Lets look at some text
line class
20 get me to check in check in
21 check in internet check in
22 what is free baggage allowance baggage
23 how much baggage baggage
24 I have 35 kg should I pay baggage
25 how much can I carry baggage
26 lots of bags I have baggage
27 till how much baggage is free baggage
28 how many bags are free baggage
29 upto what weight I can carry baggage
30 how much can I carry baggage
31 baggage carry baggage
32 baggage to carry baggage
33 number of bags baggage
34 carrying bags baggage
35 travelling with bags baggage
36 money for luggage baggage
37 how much luggage I can carry baggage
38 too much luggage baggage
Class Distribution
0%
5%
10%
15%
20%
25%
30%
login other baggage check in greetings thanks cancel
Preprocess the data
• Naming same words into a word group (For
eg: different places can be made with a single
group name)
• Use regex and normalize Dates, dollar values
etc
Stop Words
How do you generate stop words from a corpus?
Stemming
• Stemming is the process of reducing a word
into its stem, i.e. its root form. The root form
is not necessarily a word by itself, but it can be
used to generate words by concatenating the
right suffix.
Stemmed words
fish, fishes and fishing --- fish
study, studies and studying stems --- studi
Diff between stemming vs lemmetization:
stemming – meaningless words
Lemmetization – meaningful words
Stemming and Lemmetizing
Code
from nltk.stem import PorterStemmer
#from nltk.tokenize import sent_tokenize, word_tokenize
ps = PorterStemmer()
ps.stem(“having”)
from nltk.stem.lancaster import LancasterStemmer
lancaster_stemmer = LancasterStemmer()
lancaster_stemmer.stem("maximum")
Spell checker
• https://github.com/mattalcock/blog/blob/ma
ster/2012/12/5/python-spell-checker.rst
• https://pypi.python.org/pypi/autocorrect/0.1.
0
Sampling – Train and Validation
• from sklearn.cross_validation import StratifiedShuffleSplit
• sss = StratifiedShuffleSplit(tgt3, 1,
test_size=0.2,random_state=42)
• for train_index, test_index in sss:
• #print("TRAIN:", train_index, "TEST:", test_index)
• a_train_b, a_test_b = tf1[train_index], tf1[test_index]
• b_train_b, b_test_b = tgt3[train_index], tgt3[test_index]
Generate features or word tokens and
vectorize
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vectorizer =
TfidfVectorizer(min_df=0.0,analyzer=u'word',ngram_range=(1,
4),stop_words=None)
tfidf_matrix = tfidf_vectorizer.fit_transform(tt1)
tf1= tfidf_matrix.todense()
Feature Selection
from sklearn.feature_selection import
SelectPercentile, f_classif
selector = SelectPercentile(f_classif, percentile=100)
selector.fit(a_train_b, b_train_b)
a_train_b = selector.fit_transform(a_train_b,
b_train_b)
a_test_b = selector.transform(a_test_b)
Build Model
• Logistic Regression
• GBM
• SVM
• RF
• Neural Nets
• NB

More Related Content

Similar to Getting started with text mining By Mathangi Sri Head of Data Science at PhonePe at CYPHER 2018

Film plot
Film plotFilm plot
Film plot
Jess Ribeiro
 
BEA Ignite2017 - Therkelsen
BEA Ignite2017  - TherkelsenBEA Ignite2017  - Therkelsen
BEA Ignite2017 - Therkelsen
Michael Bruce
 
Lesson plan intended for kindergarten konti nlng
Lesson plan intended for kindergarten konti nlngLesson plan intended for kindergarten konti nlng
Lesson plan intended for kindergarten konti nlng
teacherglenda132992
 
TDC2018SP | Trilha Machine Learning - Analise Forense de mensagens em rede so...
TDC2018SP | Trilha Machine Learning - Analise Forense de mensagens em rede so...TDC2018SP | Trilha Machine Learning - Analise Forense de mensagens em rede so...
TDC2018SP | Trilha Machine Learning - Analise Forense de mensagens em rede so...
tdc-globalcode
 
TDC - Trilha Machine Learning - O que sabemos de voce por meio de PLN e ML?
TDC - Trilha Machine Learning - O que sabemos de voce por meio de PLN e ML?TDC - Trilha Machine Learning - O que sabemos de voce por meio de PLN e ML?
TDC - Trilha Machine Learning - O que sabemos de voce por meio de PLN e ML?
Ahirton Lopes
 
SLIDE TUYỆT VỜI DÀNH CHO THUYẾT TRÌNH
SLIDE TUYỆT VỜI DÀNH CHO THUYẾT TRÌNHSLIDE TUYỆT VỜI DÀNH CHO THUYẾT TRÌNH
SLIDE TUYỆT VỜI DÀNH CHO THUYẾT TRÌNH
Nguyễn Văn Tuấn
 

Similar to Getting started with text mining By Mathangi Sri Head of Data Science at PhonePe at CYPHER 2018 (6)

Film plot
Film plotFilm plot
Film plot
 
BEA Ignite2017 - Therkelsen
BEA Ignite2017  - TherkelsenBEA Ignite2017  - Therkelsen
BEA Ignite2017 - Therkelsen
 
Lesson plan intended for kindergarten konti nlng
Lesson plan intended for kindergarten konti nlngLesson plan intended for kindergarten konti nlng
Lesson plan intended for kindergarten konti nlng
 
TDC2018SP | Trilha Machine Learning - Analise Forense de mensagens em rede so...
TDC2018SP | Trilha Machine Learning - Analise Forense de mensagens em rede so...TDC2018SP | Trilha Machine Learning - Analise Forense de mensagens em rede so...
TDC2018SP | Trilha Machine Learning - Analise Forense de mensagens em rede so...
 
TDC - Trilha Machine Learning - O que sabemos de voce por meio de PLN e ML?
TDC - Trilha Machine Learning - O que sabemos de voce por meio de PLN e ML?TDC - Trilha Machine Learning - O que sabemos de voce por meio de PLN e ML?
TDC - Trilha Machine Learning - O que sabemos de voce por meio de PLN e ML?
 
SLIDE TUYỆT VỜI DÀNH CHO THUYẾT TRÌNH
SLIDE TUYỆT VỜI DÀNH CHO THUYẾT TRÌNHSLIDE TUYỆT VỜI DÀNH CHO THUYẾT TRÌNH
SLIDE TUYỆT VỜI DÀNH CHO THUYẾT TRÌNH
 

More from Analytics India Magazine

Deep Learning in Search for E-Commerce
Deep Learning in Search for E-CommerceDeep Learning in Search for E-Commerce
Deep Learning in Search for E-Commerce
Analytics India Magazine
 
[Paper Presentation] EMOTIONAL STRESS DETECTION USING DEEP LEARNING
[Paper Presentation] EMOTIONAL STRESS DETECTION USING DEEP LEARNING[Paper Presentation] EMOTIONAL STRESS DETECTION USING DEEP LEARNING
[Paper Presentation] EMOTIONAL STRESS DETECTION USING DEEP LEARNING
Analytics India Magazine
 
Flood & Other Disaster forecasting using Predictive Modelling and Artificial ...
Flood & Other Disaster forecasting using Predictive Modelling and Artificial ...Flood & Other Disaster forecasting using Predictive Modelling and Artificial ...
Flood & Other Disaster forecasting using Predictive Modelling and Artificial ...
Analytics India Magazine
 
AI for Enterprises-The Value Paradigm By Venkat Subramanian VP Marketing at B...
AI for Enterprises-The Value Paradigm By Venkat Subramanian VP Marketing at B...AI for Enterprises-The Value Paradigm By Venkat Subramanian VP Marketing at B...
AI for Enterprises-The Value Paradigm By Venkat Subramanian VP Marketing at B...
Analytics India Magazine
 
Keep it simple and it works - Simplicity and sticking to fundamentals in the ...
Keep it simple and it works - Simplicity and sticking to fundamentals in the ...Keep it simple and it works - Simplicity and sticking to fundamentals in the ...
Keep it simple and it works - Simplicity and sticking to fundamentals in the ...
Analytics India Magazine
 
Feature Based Opinion Mining By Gourab Nath Core Faculty – Data Science at Pr...
Feature Based Opinion Mining By Gourab Nath Core Faculty – Data Science at Pr...Feature Based Opinion Mining By Gourab Nath Core Faculty – Data Science at Pr...
Feature Based Opinion Mining By Gourab Nath Core Faculty – Data Science at Pr...
Analytics India Magazine
 
Deciphering AI - Unlocking the Black Box of AIML with State-of-the-Art Techno...
Deciphering AI - Unlocking the Black Box of AIML with State-of-the-Art Techno...Deciphering AI - Unlocking the Black Box of AIML with State-of-the-Art Techno...
Deciphering AI - Unlocking the Black Box of AIML with State-of-the-Art Techno...
Analytics India Magazine
 
Getting your first job in Data Science By Imaad Mohamed Khan Founder-in-Resid...
Getting your first job in Data Science By Imaad Mohamed Khan Founder-in-Resid...Getting your first job in Data Science By Imaad Mohamed Khan Founder-in-Resid...
Getting your first job in Data Science By Imaad Mohamed Khan Founder-in-Resid...
Analytics India Magazine
 
10 data science & AI trends in india to watch out for in 2019
10 data science & AI trends in india to watch out for in 201910 data science & AI trends in india to watch out for in 2019
10 data science & AI trends in india to watch out for in 2019
Analytics India Magazine
 
The hitchhiker's guide to artificial intelligence 2018-19
The hitchhiker's guide to artificial intelligence 2018-19The hitchhiker's guide to artificial intelligence 2018-19
The hitchhiker's guide to artificial intelligence 2018-19
Analytics India Magazine
 
Data Science Skills Study 2018 by AIM & Great Learning
Data Science Skills Study 2018 by AIM & Great LearningData Science Skills Study 2018 by AIM & Great Learning
Data Science Skills Study 2018 by AIM & Great Learning
Analytics India Magazine
 
Emerging engineering issues for building large scale AI systems By Srinivas P...
Emerging engineering issues for building large scale AI systems By Srinivas P...Emerging engineering issues for building large scale AI systems By Srinivas P...
Emerging engineering issues for building large scale AI systems By Srinivas P...
Analytics India Magazine
 
Predicting outcome of legal case using machine learning algorithms By Ankita ...
Predicting outcome of legal case using machine learning algorithms By Ankita ...Predicting outcome of legal case using machine learning algorithms By Ankita ...
Predicting outcome of legal case using machine learning algorithms By Ankita ...
Analytics India Magazine
 
Bringing AI into the Enterprise - A Practitioner's view By Piyush Chowhan CIO...
Bringing AI into the Enterprise - A Practitioner's view By Piyush Chowhan CIO...Bringing AI into the Enterprise - A Practitioner's view By Piyush Chowhan CIO...
Bringing AI into the Enterprise - A Practitioner's view By Piyush Chowhan CIO...
Analytics India Magazine
 
Explainable deep learning with applications in Healthcare By Sunil Kumar Vupp...
Explainable deep learning with applications in Healthcare By Sunil Kumar Vupp...Explainable deep learning with applications in Healthcare By Sunil Kumar Vupp...
Explainable deep learning with applications in Healthcare By Sunil Kumar Vupp...
Analytics India Magazine
 
“Who Moved My Cheese?” – Sniff the changes and stay relevant as an analytics ...
“Who Moved My Cheese?” – Sniff the changes and stay relevant as an analytics ...“Who Moved My Cheese?” – Sniff the changes and stay relevant as an analytics ...
“Who Moved My Cheese?” – Sniff the changes and stay relevant as an analytics ...
Analytics India Magazine
 
"Route risks using driving data on road segments" By Jayanta Kumar Pal Staff ...
"Route risks using driving data on road segments" By Jayanta Kumar Pal Staff ..."Route risks using driving data on road segments" By Jayanta Kumar Pal Staff ...
"Route risks using driving data on road segments" By Jayanta Kumar Pal Staff ...
Analytics India Magazine
 
“Who Moved My Cheese?” – Sniff the changes and stay relevant as an analytics ...
“Who Moved My Cheese?” – Sniff the changes and stay relevant as an analytics ...“Who Moved My Cheese?” – Sniff the changes and stay relevant as an analytics ...
“Who Moved My Cheese?” – Sniff the changes and stay relevant as an analytics ...
Analytics India Magazine
 
Analytics Education — A Primer & Learning Path
Analytics Education — A Primer & Learning PathAnalytics Education — A Primer & Learning Path
Analytics Education — A Primer & Learning Path
Analytics India Magazine
 
Analytics & Data Science Industry In India: Study 2018 - by AnalytixLabs & AIM
Analytics & Data Science Industry In India: Study 2018 - by AnalytixLabs & AIMAnalytics & Data Science Industry In India: Study 2018 - by AnalytixLabs & AIM
Analytics & Data Science Industry In India: Study 2018 - by AnalytixLabs & AIM
Analytics India Magazine
 

More from Analytics India Magazine (20)

Deep Learning in Search for E-Commerce
Deep Learning in Search for E-CommerceDeep Learning in Search for E-Commerce
Deep Learning in Search for E-Commerce
 
[Paper Presentation] EMOTIONAL STRESS DETECTION USING DEEP LEARNING
[Paper Presentation] EMOTIONAL STRESS DETECTION USING DEEP LEARNING[Paper Presentation] EMOTIONAL STRESS DETECTION USING DEEP LEARNING
[Paper Presentation] EMOTIONAL STRESS DETECTION USING DEEP LEARNING
 
Flood & Other Disaster forecasting using Predictive Modelling and Artificial ...
Flood & Other Disaster forecasting using Predictive Modelling and Artificial ...Flood & Other Disaster forecasting using Predictive Modelling and Artificial ...
Flood & Other Disaster forecasting using Predictive Modelling and Artificial ...
 
AI for Enterprises-The Value Paradigm By Venkat Subramanian VP Marketing at B...
AI for Enterprises-The Value Paradigm By Venkat Subramanian VP Marketing at B...AI for Enterprises-The Value Paradigm By Venkat Subramanian VP Marketing at B...
AI for Enterprises-The Value Paradigm By Venkat Subramanian VP Marketing at B...
 
Keep it simple and it works - Simplicity and sticking to fundamentals in the ...
Keep it simple and it works - Simplicity and sticking to fundamentals in the ...Keep it simple and it works - Simplicity and sticking to fundamentals in the ...
Keep it simple and it works - Simplicity and sticking to fundamentals in the ...
 
Feature Based Opinion Mining By Gourab Nath Core Faculty – Data Science at Pr...
Feature Based Opinion Mining By Gourab Nath Core Faculty – Data Science at Pr...Feature Based Opinion Mining By Gourab Nath Core Faculty – Data Science at Pr...
Feature Based Opinion Mining By Gourab Nath Core Faculty – Data Science at Pr...
 
Deciphering AI - Unlocking the Black Box of AIML with State-of-the-Art Techno...
Deciphering AI - Unlocking the Black Box of AIML with State-of-the-Art Techno...Deciphering AI - Unlocking the Black Box of AIML with State-of-the-Art Techno...
Deciphering AI - Unlocking the Black Box of AIML with State-of-the-Art Techno...
 
Getting your first job in Data Science By Imaad Mohamed Khan Founder-in-Resid...
Getting your first job in Data Science By Imaad Mohamed Khan Founder-in-Resid...Getting your first job in Data Science By Imaad Mohamed Khan Founder-in-Resid...
Getting your first job in Data Science By Imaad Mohamed Khan Founder-in-Resid...
 
10 data science & AI trends in india to watch out for in 2019
10 data science & AI trends in india to watch out for in 201910 data science & AI trends in india to watch out for in 2019
10 data science & AI trends in india to watch out for in 2019
 
The hitchhiker's guide to artificial intelligence 2018-19
The hitchhiker's guide to artificial intelligence 2018-19The hitchhiker's guide to artificial intelligence 2018-19
The hitchhiker's guide to artificial intelligence 2018-19
 
Data Science Skills Study 2018 by AIM & Great Learning
Data Science Skills Study 2018 by AIM & Great LearningData Science Skills Study 2018 by AIM & Great Learning
Data Science Skills Study 2018 by AIM & Great Learning
 
Emerging engineering issues for building large scale AI systems By Srinivas P...
Emerging engineering issues for building large scale AI systems By Srinivas P...Emerging engineering issues for building large scale AI systems By Srinivas P...
Emerging engineering issues for building large scale AI systems By Srinivas P...
 
Predicting outcome of legal case using machine learning algorithms By Ankita ...
Predicting outcome of legal case using machine learning algorithms By Ankita ...Predicting outcome of legal case using machine learning algorithms By Ankita ...
Predicting outcome of legal case using machine learning algorithms By Ankita ...
 
Bringing AI into the Enterprise - A Practitioner's view By Piyush Chowhan CIO...
Bringing AI into the Enterprise - A Practitioner's view By Piyush Chowhan CIO...Bringing AI into the Enterprise - A Practitioner's view By Piyush Chowhan CIO...
Bringing AI into the Enterprise - A Practitioner's view By Piyush Chowhan CIO...
 
Explainable deep learning with applications in Healthcare By Sunil Kumar Vupp...
Explainable deep learning with applications in Healthcare By Sunil Kumar Vupp...Explainable deep learning with applications in Healthcare By Sunil Kumar Vupp...
Explainable deep learning with applications in Healthcare By Sunil Kumar Vupp...
 
“Who Moved My Cheese?” – Sniff the changes and stay relevant as an analytics ...
“Who Moved My Cheese?” – Sniff the changes and stay relevant as an analytics ...“Who Moved My Cheese?” – Sniff the changes and stay relevant as an analytics ...
“Who Moved My Cheese?” – Sniff the changes and stay relevant as an analytics ...
 
"Route risks using driving data on road segments" By Jayanta Kumar Pal Staff ...
"Route risks using driving data on road segments" By Jayanta Kumar Pal Staff ..."Route risks using driving data on road segments" By Jayanta Kumar Pal Staff ...
"Route risks using driving data on road segments" By Jayanta Kumar Pal Staff ...
 
“Who Moved My Cheese?” – Sniff the changes and stay relevant as an analytics ...
“Who Moved My Cheese?” – Sniff the changes and stay relevant as an analytics ...“Who Moved My Cheese?” – Sniff the changes and stay relevant as an analytics ...
“Who Moved My Cheese?” – Sniff the changes and stay relevant as an analytics ...
 
Analytics Education — A Primer & Learning Path
Analytics Education — A Primer & Learning PathAnalytics Education — A Primer & Learning Path
Analytics Education — A Primer & Learning Path
 
Analytics & Data Science Industry In India: Study 2018 - by AnalytixLabs & AIM
Analytics & Data Science Industry In India: Study 2018 - by AnalytixLabs & AIMAnalytics & Data Science Industry In India: Study 2018 - by AnalytixLabs & AIM
Analytics & Data Science Industry In India: Study 2018 - by AnalytixLabs & AIM
 

Recently uploaded

一比一原版(曼大毕业证书)曼尼托巴大学毕业证如何办理
一比一原版(曼大毕业证书)曼尼托巴大学毕业证如何办理一比一原版(曼大毕业证书)曼尼托巴大学毕业证如何办理
一比一原版(曼大毕业证书)曼尼托巴大学毕业证如何办理
ytypuem
 
Namma-Kalvi-11th-Physics-Study-Material-Unit-1-EM-221086.pdf
Namma-Kalvi-11th-Physics-Study-Material-Unit-1-EM-221086.pdfNamma-Kalvi-11th-Physics-Study-Material-Unit-1-EM-221086.pdf
Namma-Kalvi-11th-Physics-Study-Material-Unit-1-EM-221086.pdf
22ad0301
 
Salesforce AI + Data Community Tour Slides - Canarias
Salesforce AI + Data Community Tour Slides - CanariasSalesforce AI + Data Community Tour Slides - Canarias
Salesforce AI + Data Community Tour Slides - Canarias
davidpietrzykowski1
 
A gentle exploration of Retrieval Augmented Generation
A gentle exploration of Retrieval Augmented GenerationA gentle exploration of Retrieval Augmented Generation
A gentle exploration of Retrieval Augmented Generation
dataschool1
 
Overview IFM June 2024 Consumer Confidence INDEX Report.pdf
Overview IFM June 2024 Consumer Confidence INDEX Report.pdfOverview IFM June 2024 Consumer Confidence INDEX Report.pdf
Overview IFM June 2024 Consumer Confidence INDEX Report.pdf
nhutnguyen355078
 
一比一原版南昆士兰大学毕业证如何办理
一比一原版南昆士兰大学毕业证如何办理一比一原版南昆士兰大学毕业证如何办理
一比一原版南昆士兰大学毕业证如何办理
ugydym
 
Senior Software Profiles Backend Sample - Sheet1.pdf
Senior Software Profiles  Backend Sample - Sheet1.pdfSenior Software Profiles  Backend Sample - Sheet1.pdf
Senior Software Profiles Backend Sample - Sheet1.pdf
Vineet
 
社内勉強会資料_Hallucination of LLMs               .
社内勉強会資料_Hallucination of LLMs               .社内勉強会資料_Hallucination of LLMs               .
社内勉強会資料_Hallucination of LLMs               .
NABLAS株式会社
 
一比一原版加拿大麦吉尔大学毕业证(mcgill毕业证书)如何办理
一比一原版加拿大麦吉尔大学毕业证(mcgill毕业证书)如何办理一比一原版加拿大麦吉尔大学毕业证(mcgill毕业证书)如何办理
一比一原版加拿大麦吉尔大学毕业证(mcgill毕业证书)如何办理
agdhot
 
Sample Devops SRE Product Companies .pdf
Sample Devops SRE  Product Companies .pdfSample Devops SRE  Product Companies .pdf
Sample Devops SRE Product Companies .pdf
Vineet
 
How To Control IO Usage using Resource Manager
How To Control IO Usage using Resource ManagerHow To Control IO Usage using Resource Manager
How To Control IO Usage using Resource Manager
Alireza Kamrani
 
一比一原版多伦多大学毕业证(UofT毕业证书)学历如何办理
一比一原版多伦多大学毕业证(UofT毕业证书)学历如何办理一比一原版多伦多大学毕业证(UofT毕业证书)学历如何办理
一比一原版多伦多大学毕业证(UofT毕业证书)学历如何办理
eoxhsaa
 
[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024
[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024
[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024
Vietnam Cotton & Spinning Association
 
06-20-2024-AI Camp Meetup-Unstructured Data and Vector Databases
06-20-2024-AI Camp Meetup-Unstructured Data and Vector Databases06-20-2024-AI Camp Meetup-Unstructured Data and Vector Databases
06-20-2024-AI Camp Meetup-Unstructured Data and Vector Databases
Timothy Spann
 
一比一原版加拿大渥太华大学毕业证(uottawa毕业证书)如何办理
一比一原版加拿大渥太华大学毕业证(uottawa毕业证书)如何办理一比一原版加拿大渥太华大学毕业证(uottawa毕业证书)如何办理
一比一原版加拿大渥太华大学毕业证(uottawa毕业证书)如何办理
uevausa
 
一比一原版(uom毕业证书)曼彻斯特大学毕业证如何办理
一比一原版(uom毕业证书)曼彻斯特大学毕业证如何办理一比一原版(uom毕业证书)曼彻斯特大学毕业证如何办理
一比一原版(uom毕业证书)曼彻斯特大学毕业证如何办理
osoyvvf
 
Digital Marketing Performance Marketing Sample .pdf
Digital Marketing Performance Marketing  Sample .pdfDigital Marketing Performance Marketing  Sample .pdf
Digital Marketing Performance Marketing Sample .pdf
Vineet
 
Q4FY24 Investor-Presentation.pdf bank slide
Q4FY24 Investor-Presentation.pdf bank slideQ4FY24 Investor-Presentation.pdf bank slide
Q4FY24 Investor-Presentation.pdf bank slide
mukulupadhayay1
 
06-18-2024-Princeton Meetup-Introduction to Milvus
06-18-2024-Princeton Meetup-Introduction to Milvus06-18-2024-Princeton Meetup-Introduction to Milvus
06-18-2024-Princeton Meetup-Introduction to Milvus
Timothy Spann
 
Call Girls Lucknow 0000000000 Independent Call Girl Service Lucknow
Call Girls Lucknow 0000000000 Independent Call Girl Service LucknowCall Girls Lucknow 0000000000 Independent Call Girl Service Lucknow
Call Girls Lucknow 0000000000 Independent Call Girl Service Lucknow
hiju9823
 

Recently uploaded (20)

一比一原版(曼大毕业证书)曼尼托巴大学毕业证如何办理
一比一原版(曼大毕业证书)曼尼托巴大学毕业证如何办理一比一原版(曼大毕业证书)曼尼托巴大学毕业证如何办理
一比一原版(曼大毕业证书)曼尼托巴大学毕业证如何办理
 
Namma-Kalvi-11th-Physics-Study-Material-Unit-1-EM-221086.pdf
Namma-Kalvi-11th-Physics-Study-Material-Unit-1-EM-221086.pdfNamma-Kalvi-11th-Physics-Study-Material-Unit-1-EM-221086.pdf
Namma-Kalvi-11th-Physics-Study-Material-Unit-1-EM-221086.pdf
 
Salesforce AI + Data Community Tour Slides - Canarias
Salesforce AI + Data Community Tour Slides - CanariasSalesforce AI + Data Community Tour Slides - Canarias
Salesforce AI + Data Community Tour Slides - Canarias
 
A gentle exploration of Retrieval Augmented Generation
A gentle exploration of Retrieval Augmented GenerationA gentle exploration of Retrieval Augmented Generation
A gentle exploration of Retrieval Augmented Generation
 
Overview IFM June 2024 Consumer Confidence INDEX Report.pdf
Overview IFM June 2024 Consumer Confidence INDEX Report.pdfOverview IFM June 2024 Consumer Confidence INDEX Report.pdf
Overview IFM June 2024 Consumer Confidence INDEX Report.pdf
 
一比一原版南昆士兰大学毕业证如何办理
一比一原版南昆士兰大学毕业证如何办理一比一原版南昆士兰大学毕业证如何办理
一比一原版南昆士兰大学毕业证如何办理
 
Senior Software Profiles Backend Sample - Sheet1.pdf
Senior Software Profiles  Backend Sample - Sheet1.pdfSenior Software Profiles  Backend Sample - Sheet1.pdf
Senior Software Profiles Backend Sample - Sheet1.pdf
 
社内勉強会資料_Hallucination of LLMs               .
社内勉強会資料_Hallucination of LLMs               .社内勉強会資料_Hallucination of LLMs               .
社内勉強会資料_Hallucination of LLMs               .
 
一比一原版加拿大麦吉尔大学毕业证(mcgill毕业证书)如何办理
一比一原版加拿大麦吉尔大学毕业证(mcgill毕业证书)如何办理一比一原版加拿大麦吉尔大学毕业证(mcgill毕业证书)如何办理
一比一原版加拿大麦吉尔大学毕业证(mcgill毕业证书)如何办理
 
Sample Devops SRE Product Companies .pdf
Sample Devops SRE  Product Companies .pdfSample Devops SRE  Product Companies .pdf
Sample Devops SRE Product Companies .pdf
 
How To Control IO Usage using Resource Manager
How To Control IO Usage using Resource ManagerHow To Control IO Usage using Resource Manager
How To Control IO Usage using Resource Manager
 
一比一原版多伦多大学毕业证(UofT毕业证书)学历如何办理
一比一原版多伦多大学毕业证(UofT毕业证书)学历如何办理一比一原版多伦多大学毕业证(UofT毕业证书)学历如何办理
一比一原版多伦多大学毕业证(UofT毕业证书)学历如何办理
 
[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024
[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024
[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024
 
06-20-2024-AI Camp Meetup-Unstructured Data and Vector Databases
06-20-2024-AI Camp Meetup-Unstructured Data and Vector Databases06-20-2024-AI Camp Meetup-Unstructured Data and Vector Databases
06-20-2024-AI Camp Meetup-Unstructured Data and Vector Databases
 
一比一原版加拿大渥太华大学毕业证(uottawa毕业证书)如何办理
一比一原版加拿大渥太华大学毕业证(uottawa毕业证书)如何办理一比一原版加拿大渥太华大学毕业证(uottawa毕业证书)如何办理
一比一原版加拿大渥太华大学毕业证(uottawa毕业证书)如何办理
 
一比一原版(uom毕业证书)曼彻斯特大学毕业证如何办理
一比一原版(uom毕业证书)曼彻斯特大学毕业证如何办理一比一原版(uom毕业证书)曼彻斯特大学毕业证如何办理
一比一原版(uom毕业证书)曼彻斯特大学毕业证如何办理
 
Digital Marketing Performance Marketing Sample .pdf
Digital Marketing Performance Marketing  Sample .pdfDigital Marketing Performance Marketing  Sample .pdf
Digital Marketing Performance Marketing Sample .pdf
 
Q4FY24 Investor-Presentation.pdf bank slide
Q4FY24 Investor-Presentation.pdf bank slideQ4FY24 Investor-Presentation.pdf bank slide
Q4FY24 Investor-Presentation.pdf bank slide
 
06-18-2024-Princeton Meetup-Introduction to Milvus
06-18-2024-Princeton Meetup-Introduction to Milvus06-18-2024-Princeton Meetup-Introduction to Milvus
06-18-2024-Princeton Meetup-Introduction to Milvus
 
Call Girls Lucknow 0000000000 Independent Call Girl Service Lucknow
Call Girls Lucknow 0000000000 Independent Call Girl Service LucknowCall Girls Lucknow 0000000000 Independent Call Girl Service Lucknow
Call Girls Lucknow 0000000000 Independent Call Girl Service Lucknow
 

Getting started with text mining By Mathangi Sri Head of Data Science at PhonePe at CYPHER 2018

  • 1. Getting Started with Text Mining Mathangi Sri R
  • 2. Lets look at some text 1. I love movies 2. I love icecream 3. I don’t like anything 4. I am not going to tell you anything 5. What are you guys doing 6. Where are you all going with it 7. I love her 8. doggie When asked a question what do you love?
  • 3. the tokens..? ['I', 'love', 'movies', 'I', 'love', 'icecream', 'I', 'donx92t', 'like', 'anything', 'I', 'am', 'not', 'going', 'to', 'tell', 'you', 'anything', 'What', 'are', 'you', 'guys', 'doing', 'Where', 'are', 'you', 'all', 'going', 'with', 'it', 'I', 'love', 'her', 'doggie']
  • 4. word frequency [('I', 5), ('love', 3), ('movies', 1), ('I', 5), ('love', 3), ('icecream', 1), ('I', 5), ('donx92t', 1), ('like', 1), ('anything', 2), ('I', 5), ('am', 1), ('not', 1), ('going', 2), ('to', 1), ('tell', 1), ('you', 3), ('anything', 2), ('What', 1), ('are', 2), ('you', 3), ('guys', 1), ('doing', 1), ('Where', 1), ('are', 2), ('you', 3), ('all', 1), ('going', 2), ('with', 1), ('it', 1), ('I', 5), ('love', 3), ('her', 1), ('doggie', 1)]
  • 5. Term Frequency [('I', 0.15), ('love', 0.09), ('movies', 0.03), ('I', 0.15), ('love', 0.09), ('icecream', 0.03), ('I', 0.15), ('donx92t', 0.03), ('like', 0.03), ('anything', 0.06), ('I', 0.15), ('am', 0.03), ('not', 0.03), ('going', 0.06), ('to', 0.03), ('tell', 0.03), ('you', 0.09), ('anything', 0.06), ('What', 0.03), ('are', 0.06), ('you', 0.09), ('guys', 0.03), ('doing', 0.03), ('Where', 0.03), ('are', 0.06), ('you', 0.09), ('all', 0.03), ('going', 0.06), ('with', 0.03), ('it', 0.03), ('I', 0.15), ('love', 0.09), ('her', 0.03), ('doggie', 0.03)]
  • 6. TF - IDF • TF: Term Frequency, which measures how frequently a term occurs in a document TF(t) = (Number of times term t appears in a document) / (Total number of terms in the document). • IDF: Inverse Document Frequency, which measures how important a term is. : IDF(t) = log_e(Total number of documents / Number of documents with term t in it).
  • 7. Tf-idf for our dataset • 8*22 (8 records * 22 unique words. Total words 34) u'all', u'am', u'anyt hing', u'are', u'dog gie', u'doin g', u'don', u'goin g', u'guys ', u'her', u'icecr eam', u'it', u'like', u'love' , u'movi es', u'not', u'tell', u'to', u'what ', u'wher e', u'with' , u'you' I love movies 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.59 0.81 0.00 0.00 0.00 0.00 0.00 0.00 0.00 I love icecream 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.81 0.00 0.00 0.59 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 I don’t like anything 0.00 0.00 0.51 0.00 0.00 0.00 0.61 0.00 0.00 0.00 0.00 0.00 0.61 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 I am not going to tell you anything 0.00 0.41 0.34 0.00 0.00 0.00 0.00 0.34 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.41 0.41 0.41 0.00 0.00 0.00 0.30 What are you guys doing 0.00 0.00 0.00 0.41 0.00 0.49 0.00 0.00 0.49 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.49 0.00 0.00 0.35 Where are you all going with it 0.41 0.00 0.00 0.34 0.00 0.00 0.00 0.34 0.00 0.00 0.00 0.41 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.41 0.41 0.30
  • 8. Unigrams,Bi-grams and Tri-grams • I love movies --I love, love movies In our dataset, [u'all', u'all going', u'all going with', u'am', u'am not', u'am not going', u'anything', u'are', u'are you', u'are you all', u'are you guys', u'doggie', u'doing', u'don', u'don like', u'don like anything', u'going', u'going to', u'going to tell', u'going with', u'going with it', u'guys', u'guys doing', u'her', u'icecream', u'it', u'like', u'like anything', u'love', u'love her', u'love icecream', u'love movies', u'movies', u'not', u'not going', u'not going to', u'tell', u'tell you', u'tell you anything', u'to', u'to tell', u'to tell you', u'what', u'what are', u'what are you', u'where', u'where are', u'where are you', u'with', u'with it', u'you', u'you all', u'you all going', u'you anything', u'you guys', u'you guys doing']
  • 9. Python code to genarate tf-idf matrix Input dataset (List of strings)- [u'I love movies', u'I love icecream ', u'I donx92t like anything', u'I am not going to tell you anything', u'What are you guys doing', u'Where are you all going with it', u'I love her', u'doggie '] Code: from sklearn.feature_extraction.text import TfidfVectorizer tfidf_vectorizer = TfidfVectorizer(min_df=0.0,analyzer=u'word',ngram_range=(1, 4),stop_words=None) tfidf_matrix = tfidf_vectorizer.fit_transform(tt1) tf1= tfidf_matrix.todense()
  • 11. Classifying text - Methods • Supervised classification: – Requires labelled data – Classification algorithms – SVM, LR, Ensemble, RF,etc – Can measure accuracy precisely – Need for highly actionable applications
  • 12. Classifying text - Methods • Unsupervised - No labels required - Accuracy is a ‘loose’ measure - Measuring homogeneity of clusters - Useful for quick insights or where grouping is required
  • 13. Classifying text - Methods • Semi-supervised learning is a class of supervised learning tasks and techniques that also make use of unlabeled data for training - typically a small amount of labeled data with a large amount of unlabeled data.
  • 15. Lets look at some text line class 20 get me to check in check in 21 check in internet check in 22 what is free baggage allowance baggage 23 how much baggage baggage 24 I have 35 kg should I pay baggage 25 how much can I carry baggage 26 lots of bags I have baggage 27 till how much baggage is free baggage 28 how many bags are free baggage 29 upto what weight I can carry baggage 30 how much can I carry baggage 31 baggage carry baggage 32 baggage to carry baggage 33 number of bags baggage 34 carrying bags baggage 35 travelling with bags baggage 36 money for luggage baggage 37 how much luggage I can carry baggage 38 too much luggage baggage
  • 16. Class Distribution 0% 5% 10% 15% 20% 25% 30% login other baggage check in greetings thanks cancel
  • 17. Preprocess the data • Naming same words into a word group (For eg: different places can be made with a single group name) • Use regex and normalize Dates, dollar values etc
  • 18. Stop Words How do you generate stop words from a corpus?
  • 19. Stemming • Stemming is the process of reducing a word into its stem, i.e. its root form. The root form is not necessarily a word by itself, but it can be used to generate words by concatenating the right suffix.
  • 20. Stemmed words fish, fishes and fishing --- fish study, studies and studying stems --- studi Diff between stemming vs lemmetization: stemming – meaningless words Lemmetization – meaningful words
  • 21. Stemming and Lemmetizing Code from nltk.stem import PorterStemmer #from nltk.tokenize import sent_tokenize, word_tokenize ps = PorterStemmer() ps.stem(“having”) from nltk.stem.lancaster import LancasterStemmer lancaster_stemmer = LancasterStemmer() lancaster_stemmer.stem("maximum")
  • 23. Sampling – Train and Validation • from sklearn.cross_validation import StratifiedShuffleSplit • sss = StratifiedShuffleSplit(tgt3, 1, test_size=0.2,random_state=42) • for train_index, test_index in sss: • #print("TRAIN:", train_index, "TEST:", test_index) • a_train_b, a_test_b = tf1[train_index], tf1[test_index] • b_train_b, b_test_b = tgt3[train_index], tgt3[test_index]
  • 24. Generate features or word tokens and vectorize from sklearn.feature_extraction.text import TfidfVectorizer tfidf_vectorizer = TfidfVectorizer(min_df=0.0,analyzer=u'word',ngram_range=(1, 4),stop_words=None) tfidf_matrix = tfidf_vectorizer.fit_transform(tt1) tf1= tfidf_matrix.todense()
  • 25. Feature Selection from sklearn.feature_selection import SelectPercentile, f_classif selector = SelectPercentile(f_classif, percentile=100) selector.fit(a_train_b, b_train_b) a_train_b = selector.fit_transform(a_train_b, b_train_b) a_test_b = selector.transform(a_test_b)
  • 26. Build Model • Logistic Regression • GBM • SVM • RF • Neural Nets • NB