SlideShare a Scribd company logo
1 of 20
Download to read offline
Text Mining 101
Manohar Swamynathan
August 2012
agenda:
o Text Mining Process Steps
o Calculate Term Weight
o Similarity Distance Measure
o Common Text Mining Techniques
o Appendix
- Required R packages for Text Mining
- Implemented Examples
o R code for obtaining and analyzing tweets.
o RTextTools – Ensemble Classification
o References
Manohar Swamynathan
Aug 2012
Step 1 – Data assemble
Text Corpus
Flat files
Social
Corporate
Database
CommonTextDataSources
Data Processing Step Brief Description
Explore Corpus through
Exploratory Data Analysis
Understand the types of variables, their functions, permissible values, and so on. Some formats
including html and xml contain tags and other data structures that provide more metadata.
Convert text to lowercase This is to avoid distinguish between words simply on case.
Remove Number(if
required)
Numbers may or may not be relevant to our analyses.
Remove Punctuations Punctuation can provide grammatical context which supports understanding. Often for initial
analyses we ignore the punctuation. Later we will use punctuation to support the extraction of
meaning.
Remove English stop
words
Stop words are common words found in a language. Words like for, of, are, etc are common stop
words.
Remove Own stop words(if
required)
Along with English stop words, we could instead or in addition remove our own stop words. The
choice of own stop word might depend on the domain of discourse, and might not become
apparent until we've done some analysis.
Strip whitespace Eliminate extra white-spaces. Any additional space that is not the space that occur within the
sentence or between words.
Stemming Stemming uses an algorithm that removes common word endings for English words, such as “es”,
“ed” and “'s”. Example, "computer" & "computers" become "comput"
Lemmatization Transform to dictionary base form i.e., "produce" & "produced" become "produce"
Sparse terms We are often not interested in infrequent terms in our documents. Such “sparse" terms should be
removed from the document term matrix.
Document term matrix A document term matrix is simply a matrix with documents as the rows and terms as the columns
and a count of the frequency of words as the cells of the matrix.
Step 2 - Data Processing
4
Python packages – textmining, nltk
R packages - tm, qdap, openNLP
Step 3 - Data Visualization
Frequency Chart Word Cloud
Correlation Plot
Step 4 – Models
Clustering
Classification
Sentiment Analysis
Document
Term Frequency - How frequently term appears?
Term Frequency TF(t) = (Number of times term t appears in a document) / (Total number of terms in the
document)
Example:
Calculate Term Weight (TF * IDF)
Inverse Document Frequency - How important a term is?
Document Frequency DF = d (number of documents containing a given term) / D (the size of the collection of
documents)
To normalize take log(d/D), but often D > d and log(d/D) will give negative value. So invert the ratio inside log
expression. Essentially we are compressing the scale of values so that very large or very small quantities are
smoothly compared
Inverse Document Frequency IDF(t) = log(Total number of documents / Number of documents with term t in it)
7
- Assume we have overall 10 million documents and the word spindle appears in one thousand of these
- Consider 2 documents containing 100 total words each, and contains term spindle x number of times
Document spindle – Frequency Total Words TF IDF TF * IDF
1 3 100 3/100 = 0.03 log(10,000,000/1,000) = 4 0.03 * 4 = 0.12
2 30 100 30/100 = .3 log(10,000,000/1,000) = 4 0.3 * 4 = 1.2
Similarity Distance Measure
Example:
Text 1: statistics skills and programming skills are equally important for analytics
Text 2: statistics skills and domain knowledge are important for analytics
Text 3: I like reading books and travelling
The three vectors are:
T1 = (1,2,1,1,0,1,1,1,1,1,0,0,0,0,0,0)
T2 = (1,1,1,0,1,1,0,1,1,1,1,0,0,0,0,0)
T3 = (0,0,1,0,0,0,0,0,0,0,0,1,1,1,1,1)
Degree of Similarity (T1 & T2) = (T1 %*% T2) / (sqrt(sum(T1^2)) * sqrt(sum(T2^2))) = 77%
Degree of Similarity (T1 & T3) = (T1 %*% T3) / (sqrt(sum(T1^2)) * sqrt(sum(T3^2))) = 12%
Additional Reading: Here is a detailed paper on comparing the efficiency of different distance measures for text documents.
URL – 1) http://home.iitk.ac.in/~spranjal/cs671/project/report.pdf
2) http://users.dsic.upv.es/~prosso/resources/BarronEtAl_ICON09.pdf
statisticsskills and programming knowledge are equally important for analytics domain I like reading books travelling
Text 1 1 2 1 1 0 1 1 1 1 1 0 0 0 0 0 0
Text 2 1 1 1 0 1 1 0 1 1 1 1 0 0 0 0 0
Text 3 0 0 1 0 0 0 0 0 0 0 0 1 1 1 1 1
X
Y
Euclidean
Cosine
- cosine value will be a
number between 0 and 1
- Smaller the angel bigger
the cosine value/similarity
8
Common Text Mining Techniques
• N-grams
• Shallow Natural Language Processing
• Deep Natural Language Processing
Example: "defense attorney for liberty and
montecito”
1-gram:
defense
attorney
for
liberty
and
montecito
2-gram:
defense attorney
for liberty
and montecito
attorney for
liberty and
attorney for
3-gram:
defense attorney for
liberty and montecito
attorney for liberty
for liberty and
liberty and montecito
4-gram:
defense attorney for liberty
attorney for liberty and
for liberty and montecito
5-gram:
defense attorney for liberty and montecito
attorney for liberty and montecito
Application:
 Probabilistic language model for predicting the
next item in a sequence in the form of a (n − 1)
 Widely used in probability, communication
theory, computational linguistics, biological
sequence analysis
Advantage:
 Relatively simple
 Simply increasing n, model can be used to store
more context
Disadvantage:
 Semantic value of the item is not considered
n-gram
Definition:
• n-gram is a contiguous sequence of n items from a
given sequence of text
• The items can be letters, words, syllables or base
pairs according to the application
10
Application:
- Taxonomy extraction (predefined terms and entities)
- Entities: People, organizations, locations, times, dates, prices, genes, proteins, diseases,
medicines
- Concept extraction (main idea or a theme)
Advantage:
- Less noisy than n-grams
Disadvantage:
- Does not specify role of items in the main sentence
Shallow NLP Technique
Definition:
- Assign a syntactic label (noun, verb etc.) to a chunk
- Knowledge extraction from text through semantic/syntactic analysis approach
11
Sentence - “The driver from Europe crashed the car with the white bumper”
1-gram
the
driver
from
europe
crashed
the
car
with
the
white
bumper
Part of Speech
DT – Determiner
NN - Noun, singular or mass
IN - Preposition or subordinating conjunction
NNP - Proper Noun, singular
VBD - Verb, past tense
DT – Determiner
NN - Noun, singular or mass
IN - Preposition or subordinating conjunction
DT – Determiner
JJ – Adjective
NN - Noun, singular or mass
- Convert to lowercase & PoS tag
Concept Extraction:
- Remove Stop words
- Retain only Noun’s & Verb’s
- Bi-gram with Noun’s & Verb’s retained
Bi-gram PoS
car white NN JJ
crashed car VBD NN
driver europe NN NNP
europe crashed NNP VBD
white bumper JJ NN
3-gram PoS
car white bumper NN JJ NN
crashed car white VBD NN JJ
driver europe crashed NN NNP VBD
europe crashed car NNP VBD NN
- 3-gram with Noun’s & Verb’s retained
Conclusion:
1-gram: Reduced noise, however no clear context
Bi-gram & 3-gram: Increased context, however there
is a information loss
Shallow NLP Technique
12
Stop words
Noun/Verb
Definition:
- Extension to the shallow NLP
- Detected relationships are expressed as complex construction to retain the context
- Example relationships: Located in, employed by, part of, married to
Applications:
- Develop features and representations appropriate for complex interpretation tasks
- Fraud detection
- Life science: prediction activities based on complex RNA-Sequence
Deep NLP technique
Example:
The above sentence can be represented using triples (Subject: Predicate [Modifier]: Object)
without loosing the context.
Triples:
driver : crash : car
driver : crash with : bumper
driver : be from : Europe
13
Technique General Steps Pros Cons
N-Gram
- Convert to lowercase
- Remove punctuations
- Remove special characters Simple technique Extremely noisy
Shallow NLP
technique
- POS tagging
- Lemmatization i.e., transform to
dictionary base form i.e., "produce" &
"produced" become "produce"
- Stemming i.e., transform to root word
i.e., 1) "computer" & "computers" become
"comput"
2) "product", "produce" & "produced"
become "produc"
- Chunking i.e., identify the phrasal
constituents in a sentence , including
noun/verb phrase etc., and splits the
sentence into chunks of semantically
related words
Less noisy than N-
Grams
Computationally
expensive
solution for
analyzing the
structure of texts.
Does not specify
the internal
structure or the
role of words in
the sentence
Deep NLP
technique
- Generate syntactic relationship
between each pair of words
- Extract subject, predicate, nagation,
objecct and named entity to form triples.
Context of the
sentence is
retained.
Sentence level
analysis is too
structured
Techniques - Summary
14
Appendix
15
2A - Explore Corpus through EDA
2B - Convert text to lowercase
2C - Remove
a) Numbers(if required)
b) Punctuations
c) English stop words
d) Own stop words(if
required)
e) Strip whitespace
f) Lemmatization/Stemming
g) Sparse terms
2D - Create document term matrix
Step 3 - Visualization
Corpus
Web
Documents
Step 1 – Data Assemble
Step 2 – Data Processing
Step 4 – Build Model(s)
 Clustering
 Classification
 Sentiment Analysis
FrequencyChartWordCloudCorrelationPlot
R - Text Mining Process Overview
16
DB
Package
Name
Category Description
tm Text Mining A framework for text mining applications
topicmodels Topic Modelling Fit topic models with Latent Dirichlet Allocation (LDA) and Comparative Text Mining (CTM)
wordcloud Visualization Plot a cloud comparing the frequencies of words
across documents
lda Topic Modelling Fit topic models with Latent Dirichlet Allocation
wordnet Text Mining Database of English which is commonly used in linguistics and text mining
RTextTools Text Mining Automatic text classification via supervised learning
qdap Sentiment
analysis
Transcript analysis, text mining and natural language processing
tm.plugin.dc Text Mining A plug-in for package tm to support distributed text mining
tm.plugin.mail Text Mining A plug-in for package tm to handle mail
textir Text Mining A suite of tools for inference about text documents and associated sentiment
tau Text Mining Utilities for text analysis
textcat Text Mining N-gram based text categorization
SnowballC Text Mining Word stemmer
twitteR Text Mining Provides an interface to the Twitter web API
ROAuth Text Mining Allows users to authenticate to the server of their choice (like Twitter)
RColorBrewer Visualization The packages provides palettes for drawing nice maps shaded according to a variable
ggplot2 Visualization Graphing package implemented on top of the R statistical package. Inspired by the
Grammar of Graphics seminal work of Leland Wilkinson
R – Required packages for Text Mining
17
Example 1 - Obtaining and analyzing tweets
Objective: R code for analyzing tweets relating to #AAA2011 (text mining, topic modelling, network analysis, clustering and
sentiment analysis)
What does the code do?
The code details ten steps in the analysis and visualization of the tweets:
Acquiring the raw Twitter data
Calculating some basic statistics with the raw Twitter data
Calculating some basic retweet statistics
Calculating the ratio of retweets to tweets
Calculating some basic statistics about URLs in tweets
Basic text mining for token frequency and token association analysis (word cloud)
Calculating sentiment scores of tweets, including on subsets containing tokens of interest
Hierarchical clustering of tokens based on multi scale bootstrap resampling
Topic modelling the tweet corpus using latent Dirichlet allocation
Network analysis of tweeters based on retweets
Code Source: Code was taken from following link and tweaked/added additional bits where required to ensure code runs fine
https://github.com/benmarwick/AAA2011-Tweets
How to Run or Test the code? - From the word doc copy the R code in the given sequence highlighted in yellow color and paste
on your R console.
18
RTextTools – Example for supervised Learning for Text Classification using Ensemble
RTextTools is a free, open source R machine learning package for automatic text classification.
The package includes nine algorithms for ensemble classification (svm, slda, boosting, bagging, random
forests, glmnet, decision trees, neural networks, and maximum entropy), comprehensive analytics, and
thorough documentation.
Users may use n-fold cross validation to calculate the accuracy of each algorithm on their dataset and
determine which algorithms to use in their ensemble.
(Using a four-ensemble agreement approach, Collingwood and Wilkerson (2012) found that when four of their
algorithms agree on the label of a textual document, the machine label matches the human label over 90% of
the time. The rate is just 45% when only two algorithms agree on the text label.)
Code Source: The codes is readily available for download and usage from the following link
https://github.com/timjurka/RTextTools . The code can be implemented without modification for testing,
however it’s set up such that changes can be incorporated easily based on our requirement.
Additional Reading: http://www.rtexttools.com/about-the-project.html
19
Example 2 - RTextTools
Penn Treebank - https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html
Stanford info lab - Finding Similar Items: http://infolab.stanford.edu/~ullman/mmds/ch3.pdf
TRIPLET EXTRACTION FROM SENTENCES: URL - http://ailab.ijs.si/delia_rusu/Papers/is_2007.pdf
Shallow and Deep NLP Processing for ontology learning, a Quick Overview:
http://azouaq.athabascau.ca/publications/Conferences,%20Workshops,%20Books/%5BBC2%5D_K
DW_2010.pdf
References
20

More Related Content

What's hot

Data mining project presentation
Data mining project presentationData mining project presentation
Data mining project presentation
Kaiwen Qi
 

What's hot (20)

Data Visualization in Exploratory Data Analysis
Data Visualization in Exploratory Data AnalysisData Visualization in Exploratory Data Analysis
Data Visualization in Exploratory Data Analysis
 
Cross validation
Cross validationCross validation
Cross validation
 
web mining
web miningweb mining
web mining
 
Analytics
AnalyticsAnalytics
Analytics
 
Text mining presentation in Data mining Area
Text mining presentation in Data mining AreaText mining presentation in Data mining Area
Text mining presentation in Data mining Area
 
Exploratory data analysis with Python
Exploratory data analysis with PythonExploratory data analysis with Python
Exploratory data analysis with Python
 
CS6007 information retrieval - 5 units notes
CS6007   information retrieval - 5 units notesCS6007   information retrieval - 5 units notes
CS6007 information retrieval - 5 units notes
 
Text MIning
Text MIningText MIning
Text MIning
 
Fraud detection analysis
Fraud detection analysis Fraud detection analysis
Fraud detection analysis
 
Textmining Introduction
Textmining IntroductionTextmining Introduction
Textmining Introduction
 
Text mining
Text miningText mining
Text mining
 
Predictive analytics
Predictive analytics Predictive analytics
Predictive analytics
 
Exploratory data analysis
Exploratory data analysis Exploratory data analysis
Exploratory data analysis
 
Data mining project presentation
Data mining project presentationData mining project presentation
Data mining project presentation
 
Text Mining
Text MiningText Mining
Text Mining
 
Introduction to Text Mining
Introduction to Text MiningIntroduction to Text Mining
Introduction to Text Mining
 
Data Analytics
Data AnalyticsData Analytics
Data Analytics
 
Web mining
Web miningWeb mining
Web mining
 
Application of predictive analytics
Application of predictive analyticsApplication of predictive analytics
Application of predictive analytics
 
Fraud detection with Machine Learning
Fraud detection with Machine LearningFraud detection with Machine Learning
Fraud detection with Machine Learning
 

Viewers also liked

Text mining, By Hadi Mohammadzadeh
Text mining, By Hadi MohammadzadehText mining, By Hadi Mohammadzadeh
Text mining, By Hadi Mohammadzadeh
Hadi Mohammadzadeh
 
European Transport Networks
European Transport NetworksEuropean Transport Networks
European Transport Networks
caglarozpinar
 
Project report for railway security monotorin system
Project report for railway security monotorin systemProject report for railway security monotorin system
Project report for railway security monotorin system
ASWATHY VG
 

Viewers also liked (20)

Text mining, By Hadi Mohammadzadeh
Text mining, By Hadi MohammadzadehText mining, By Hadi Mohammadzadeh
Text mining, By Hadi Mohammadzadeh
 
Elements of Text Mining Part - I
Elements of Text Mining Part - IElements of Text Mining Part - I
Elements of Text Mining Part - I
 
Big Data & Text Mining
Big Data & Text MiningBig Data & Text Mining
Big Data & Text Mining
 
Text Mining with R -- an Analysis of Twitter Data
Text Mining with R -- an Analysis of Twitter DataText Mining with R -- an Analysis of Twitter Data
Text Mining with R -- an Analysis of Twitter Data
 
Textmining Information Extraction
Textmining Information ExtractionTextmining Information Extraction
Textmining Information Extraction
 
Predictive Text Analytics
Predictive Text AnalyticsPredictive Text Analytics
Predictive Text Analytics
 
Enabling Exploration Through Text Analytics
Enabling Exploration Through Text AnalyticsEnabling Exploration Through Text Analytics
Enabling Exploration Through Text Analytics
 
European Transport Networks
European Transport NetworksEuropean Transport Networks
European Transport Networks
 
Log Data Mining
Log Data MiningLog Data Mining
Log Data Mining
 
Information Extraction
Information ExtractionInformation Extraction
Information Extraction
 
Text Analytics for Dummies 2010
Text Analytics for Dummies 2010Text Analytics for Dummies 2010
Text Analytics for Dummies 2010
 
Unmanned railway tracking and anti collision system using gsm
Unmanned railway tracking and anti collision  system  using gsmUnmanned railway tracking and anti collision  system  using gsm
Unmanned railway tracking and anti collision system using gsm
 
TextMining with R
TextMining with RTextMining with R
TextMining with R
 
Quick Tour of Text Mining
Quick Tour of Text MiningQuick Tour of Text Mining
Quick Tour of Text Mining
 
Log Mining: Beyond Log Analysis
Log Mining: Beyond Log AnalysisLog Mining: Beyond Log Analysis
Log Mining: Beyond Log Analysis
 
Project report for railway security monotorin system
Project report for railway security monotorin systemProject report for railway security monotorin system
Project report for railway security monotorin system
 
Text mining
Text miningText mining
Text mining
 
Efficient Practices for Large Scale Text Mining Process
Efficient Practices for Large Scale Text Mining ProcessEfficient Practices for Large Scale Text Mining Process
Efficient Practices for Large Scale Text Mining Process
 
Quick tour all handout
Quick tour all handoutQuick tour all handout
Quick tour all handout
 
Machine Learning and Data Mining: 19 Mining Text And Web Data
Machine Learning and Data Mining: 19 Mining Text And Web DataMachine Learning and Data Mining: 19 Mining Text And Web Data
Machine Learning and Data Mining: 19 Mining Text And Web Data
 

Similar to Text Mining Analytics 101

Resume_Clasification.pptx
Resume_Clasification.pptxResume_Clasification.pptx
Resume_Clasification.pptx
MOINDALVS
 
Resume_Clasification.pptx
Resume_Clasification.pptxResume_Clasification.pptx
Resume_Clasification.pptx
MOINDALVS
 
Dsm as theory building
Dsm as theory buildingDsm as theory building
Dsm as theory building
ClarkTony
 
Towards advanced data retrieval from learning objects repositories
Towards advanced data retrieval from learning objects repositoriesTowards advanced data retrieval from learning objects repositories
Towards advanced data retrieval from learning objects repositories
Valentina Paunovic
 
Sales_Prediction_Technique using R Programming
Sales_Prediction_Technique using R ProgrammingSales_Prediction_Technique using R Programming
Sales_Prediction_Technique using R Programming
Nagarjun Kotyada
 
C programming_MSBTE_Diploma_Pranoti Doke
C programming_MSBTE_Diploma_Pranoti DokeC programming_MSBTE_Diploma_Pranoti Doke
C programming_MSBTE_Diploma_Pranoti Doke
Pranoti Doke
 

Similar to Text Mining Analytics 101 (20)

Resume_Clasification.pptx
Resume_Clasification.pptxResume_Clasification.pptx
Resume_Clasification.pptx
 
Language Technology Enhanced Learning
Language Technology Enhanced LearningLanguage Technology Enhanced Learning
Language Technology Enhanced Learning
 
Resume_Clasification.pptx
Resume_Clasification.pptxResume_Clasification.pptx
Resume_Clasification.pptx
 
Introduction to Text Mining
Introduction to Text Mining Introduction to Text Mining
Introduction to Text Mining
 
Natural Language processing Parts of speech tagging, its classes, and how to ...
Natural Language processing Parts of speech tagging, its classes, and how to ...Natural Language processing Parts of speech tagging, its classes, and how to ...
Natural Language processing Parts of speech tagging, its classes, and how to ...
 
IMPROVE THE QUALITY OF IMPORTANT SENTENCES FOR AUTOMATIC TEXT SUMMARIZATION
IMPROVE THE QUALITY OF IMPORTANT SENTENCES FOR AUTOMATIC TEXT SUMMARIZATIONIMPROVE THE QUALITY OF IMPORTANT SENTENCES FOR AUTOMATIC TEXT SUMMARIZATION
IMPROVE THE QUALITY OF IMPORTANT SENTENCES FOR AUTOMATIC TEXT SUMMARIZATION
 
Open nlp presentationss
Open nlp presentationssOpen nlp presentationss
Open nlp presentationss
 
Recommender systems
Recommender systemsRecommender systems
Recommender systems
 
T EXT M INING AND C LASSIFICATION OF P RODUCT R EVIEWS U SING S TRUCTURED S U...
T EXT M INING AND C LASSIFICATION OF P RODUCT R EVIEWS U SING S TRUCTURED S U...T EXT M INING AND C LASSIFICATION OF P RODUCT R EVIEWS U SING S TRUCTURED S U...
T EXT M INING AND C LASSIFICATION OF P RODUCT R EVIEWS U SING S TRUCTURED S U...
 
Learning deep structured semantic models for web search
Learning deep structured semantic models for web searchLearning deep structured semantic models for web search
Learning deep structured semantic models for web search
 
MACHINE-DRIVEN TEXT ANALYSIS
MACHINE-DRIVEN TEXT ANALYSISMACHINE-DRIVEN TEXT ANALYSIS
MACHINE-DRIVEN TEXT ANALYSIS
 
AI UNIT 3 - SRCAS JOC.pptx enjoy this ppt
AI UNIT 3 - SRCAS JOC.pptx enjoy this pptAI UNIT 3 - SRCAS JOC.pptx enjoy this ppt
AI UNIT 3 - SRCAS JOC.pptx enjoy this ppt
 
Dsm as theory building
Dsm as theory buildingDsm as theory building
Dsm as theory building
 
Towards advanced data retrieval from learning objects repositories
Towards advanced data retrieval from learning objects repositoriesTowards advanced data retrieval from learning objects repositories
Towards advanced data retrieval from learning objects repositories
 
MODULE 4-Text Analytics.pptx
MODULE 4-Text Analytics.pptxMODULE 4-Text Analytics.pptx
MODULE 4-Text Analytics.pptx
 
Text Mining Infrastructure in R
Text Mining Infrastructure in RText Mining Infrastructure in R
Text Mining Infrastructure in R
 
Sales_Prediction_Technique using R Programming
Sales_Prediction_Technique using R ProgrammingSales_Prediction_Technique using R Programming
Sales_Prediction_Technique using R Programming
 
C programming_MSBTE_Diploma_Pranoti Doke
C programming_MSBTE_Diploma_Pranoti DokeC programming_MSBTE_Diploma_Pranoti Doke
C programming_MSBTE_Diploma_Pranoti Doke
 
NLP Deep Learning with Tensorflow
NLP Deep Learning with TensorflowNLP Deep Learning with Tensorflow
NLP Deep Learning with Tensorflow
 
A-Study_TopicModeling
A-Study_TopicModelingA-Study_TopicModeling
A-Study_TopicModeling
 

Recently uploaded

In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
ahmedjiabur940
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Bertram Ludäscher
 
如何办理澳洲拉筹伯大学毕业证(LaTrobe毕业证书)成绩单原件一模一样
如何办理澳洲拉筹伯大学毕业证(LaTrobe毕业证书)成绩单原件一模一样如何办理澳洲拉筹伯大学毕业证(LaTrobe毕业证书)成绩单原件一模一样
如何办理澳洲拉筹伯大学毕业证(LaTrobe毕业证书)成绩单原件一模一样
wsppdmt
 
sourabh vyas1222222222222222222244444444
sourabh vyas1222222222222222222244444444sourabh vyas1222222222222222222244444444
sourabh vyas1222222222222222222244444444
saurabvyas476
 
Displacement, Velocity, Acceleration, and Second Derivatives
Displacement, Velocity, Acceleration, and Second DerivativesDisplacement, Velocity, Acceleration, and Second Derivatives
Displacement, Velocity, Acceleration, and Second Derivatives
23050636
 
如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样
如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样
如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样
jk0tkvfv
 
如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证
acoha1
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证
acoha1
 

Recently uploaded (20)

Northern New England Tableau User Group (TUG) May 2024
Northern New England Tableau User Group (TUG) May 2024Northern New England Tableau User Group (TUG) May 2024
Northern New England Tableau User Group (TUG) May 2024
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
 
Jual Obat Aborsi Bandung (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...
Jual Obat Aborsi Bandung (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...Jual Obat Aborsi Bandung (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...
Jual Obat Aborsi Bandung (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
 
Pentesting_AI and security challenges of AI
Pentesting_AI and security challenges of AIPentesting_AI and security challenges of AI
Pentesting_AI and security challenges of AI
 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
 
Predictive Precipitation: Advanced Rain Forecasting Techniques
Predictive Precipitation: Advanced Rain Forecasting TechniquesPredictive Precipitation: Advanced Rain Forecasting Techniques
Predictive Precipitation: Advanced Rain Forecasting Techniques
 
如何办理澳洲拉筹伯大学毕业证(LaTrobe毕业证书)成绩单原件一模一样
如何办理澳洲拉筹伯大学毕业证(LaTrobe毕业证书)成绩单原件一模一样如何办理澳洲拉筹伯大学毕业证(LaTrobe毕业证书)成绩单原件一模一样
如何办理澳洲拉筹伯大学毕业证(LaTrobe毕业证书)成绩单原件一模一样
 
社内勉強会資料_Object Recognition as Next Token Prediction
社内勉強会資料_Object Recognition as Next Token Prediction社内勉強会資料_Object Recognition as Next Token Prediction
社内勉強会資料_Object Recognition as Next Token Prediction
 
sourabh vyas1222222222222222222244444444
sourabh vyas1222222222222222222244444444sourabh vyas1222222222222222222244444444
sourabh vyas1222222222222222222244444444
 
Credit Card Fraud Detection: Safeguarding Transactions in the Digital Age
Credit Card Fraud Detection: Safeguarding Transactions in the Digital AgeCredit Card Fraud Detection: Safeguarding Transactions in the Digital Age
Credit Card Fraud Detection: Safeguarding Transactions in the Digital Age
 
DAA Assignment Solution.pdf is the best1
DAA Assignment Solution.pdf is the best1DAA Assignment Solution.pdf is the best1
DAA Assignment Solution.pdf is the best1
 
Displacement, Velocity, Acceleration, and Second Derivatives
Displacement, Velocity, Acceleration, and Second DerivativesDisplacement, Velocity, Acceleration, and Second Derivatives
Displacement, Velocity, Acceleration, and Second Derivatives
 
如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样
如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样
如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样
 
如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
 
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证
 
DS Lecture-1 about discrete structure .ppt
DS Lecture-1 about discrete structure .pptDS Lecture-1 about discrete structure .ppt
DS Lecture-1 about discrete structure .ppt
 

Text Mining Analytics 101

  • 1. Text Mining 101 Manohar Swamynathan August 2012
  • 2. agenda: o Text Mining Process Steps o Calculate Term Weight o Similarity Distance Measure o Common Text Mining Techniques o Appendix - Required R packages for Text Mining - Implemented Examples o R code for obtaining and analyzing tweets. o RTextTools – Ensemble Classification o References Manohar Swamynathan Aug 2012
  • 3. Step 1 – Data assemble Text Corpus Flat files Social Corporate Database CommonTextDataSources
  • 4. Data Processing Step Brief Description Explore Corpus through Exploratory Data Analysis Understand the types of variables, their functions, permissible values, and so on. Some formats including html and xml contain tags and other data structures that provide more metadata. Convert text to lowercase This is to avoid distinguish between words simply on case. Remove Number(if required) Numbers may or may not be relevant to our analyses. Remove Punctuations Punctuation can provide grammatical context which supports understanding. Often for initial analyses we ignore the punctuation. Later we will use punctuation to support the extraction of meaning. Remove English stop words Stop words are common words found in a language. Words like for, of, are, etc are common stop words. Remove Own stop words(if required) Along with English stop words, we could instead or in addition remove our own stop words. The choice of own stop word might depend on the domain of discourse, and might not become apparent until we've done some analysis. Strip whitespace Eliminate extra white-spaces. Any additional space that is not the space that occur within the sentence or between words. Stemming Stemming uses an algorithm that removes common word endings for English words, such as “es”, “ed” and “'s”. Example, "computer" & "computers" become "comput" Lemmatization Transform to dictionary base form i.e., "produce" & "produced" become "produce" Sparse terms We are often not interested in infrequent terms in our documents. Such “sparse" terms should be removed from the document term matrix. Document term matrix A document term matrix is simply a matrix with documents as the rows and terms as the columns and a count of the frequency of words as the cells of the matrix. Step 2 - Data Processing 4 Python packages – textmining, nltk R packages - tm, qdap, openNLP
  • 5. Step 3 - Data Visualization Frequency Chart Word Cloud Correlation Plot
  • 6. Step 4 – Models Clustering Classification Sentiment Analysis Document
  • 7. Term Frequency - How frequently term appears? Term Frequency TF(t) = (Number of times term t appears in a document) / (Total number of terms in the document) Example: Calculate Term Weight (TF * IDF) Inverse Document Frequency - How important a term is? Document Frequency DF = d (number of documents containing a given term) / D (the size of the collection of documents) To normalize take log(d/D), but often D > d and log(d/D) will give negative value. So invert the ratio inside log expression. Essentially we are compressing the scale of values so that very large or very small quantities are smoothly compared Inverse Document Frequency IDF(t) = log(Total number of documents / Number of documents with term t in it) 7 - Assume we have overall 10 million documents and the word spindle appears in one thousand of these - Consider 2 documents containing 100 total words each, and contains term spindle x number of times Document spindle – Frequency Total Words TF IDF TF * IDF 1 3 100 3/100 = 0.03 log(10,000,000/1,000) = 4 0.03 * 4 = 0.12 2 30 100 30/100 = .3 log(10,000,000/1,000) = 4 0.3 * 4 = 1.2
  • 8. Similarity Distance Measure Example: Text 1: statistics skills and programming skills are equally important for analytics Text 2: statistics skills and domain knowledge are important for analytics Text 3: I like reading books and travelling The three vectors are: T1 = (1,2,1,1,0,1,1,1,1,1,0,0,0,0,0,0) T2 = (1,1,1,0,1,1,0,1,1,1,1,0,0,0,0,0) T3 = (0,0,1,0,0,0,0,0,0,0,0,1,1,1,1,1) Degree of Similarity (T1 & T2) = (T1 %*% T2) / (sqrt(sum(T1^2)) * sqrt(sum(T2^2))) = 77% Degree of Similarity (T1 & T3) = (T1 %*% T3) / (sqrt(sum(T1^2)) * sqrt(sum(T3^2))) = 12% Additional Reading: Here is a detailed paper on comparing the efficiency of different distance measures for text documents. URL – 1) http://home.iitk.ac.in/~spranjal/cs671/project/report.pdf 2) http://users.dsic.upv.es/~prosso/resources/BarronEtAl_ICON09.pdf statisticsskills and programming knowledge are equally important for analytics domain I like reading books travelling Text 1 1 2 1 1 0 1 1 1 1 1 0 0 0 0 0 0 Text 2 1 1 1 0 1 1 0 1 1 1 1 0 0 0 0 0 Text 3 0 0 1 0 0 0 0 0 0 0 0 1 1 1 1 1 X Y Euclidean Cosine - cosine value will be a number between 0 and 1 - Smaller the angel bigger the cosine value/similarity 8
  • 9. Common Text Mining Techniques • N-grams • Shallow Natural Language Processing • Deep Natural Language Processing
  • 10. Example: "defense attorney for liberty and montecito” 1-gram: defense attorney for liberty and montecito 2-gram: defense attorney for liberty and montecito attorney for liberty and attorney for 3-gram: defense attorney for liberty and montecito attorney for liberty for liberty and liberty and montecito 4-gram: defense attorney for liberty attorney for liberty and for liberty and montecito 5-gram: defense attorney for liberty and montecito attorney for liberty and montecito Application:  Probabilistic language model for predicting the next item in a sequence in the form of a (n − 1)  Widely used in probability, communication theory, computational linguistics, biological sequence analysis Advantage:  Relatively simple  Simply increasing n, model can be used to store more context Disadvantage:  Semantic value of the item is not considered n-gram Definition: • n-gram is a contiguous sequence of n items from a given sequence of text • The items can be letters, words, syllables or base pairs according to the application 10
  • 11. Application: - Taxonomy extraction (predefined terms and entities) - Entities: People, organizations, locations, times, dates, prices, genes, proteins, diseases, medicines - Concept extraction (main idea or a theme) Advantage: - Less noisy than n-grams Disadvantage: - Does not specify role of items in the main sentence Shallow NLP Technique Definition: - Assign a syntactic label (noun, verb etc.) to a chunk - Knowledge extraction from text through semantic/syntactic analysis approach 11
  • 12. Sentence - “The driver from Europe crashed the car with the white bumper” 1-gram the driver from europe crashed the car with the white bumper Part of Speech DT – Determiner NN - Noun, singular or mass IN - Preposition or subordinating conjunction NNP - Proper Noun, singular VBD - Verb, past tense DT – Determiner NN - Noun, singular or mass IN - Preposition or subordinating conjunction DT – Determiner JJ – Adjective NN - Noun, singular or mass - Convert to lowercase & PoS tag Concept Extraction: - Remove Stop words - Retain only Noun’s & Verb’s - Bi-gram with Noun’s & Verb’s retained Bi-gram PoS car white NN JJ crashed car VBD NN driver europe NN NNP europe crashed NNP VBD white bumper JJ NN 3-gram PoS car white bumper NN JJ NN crashed car white VBD NN JJ driver europe crashed NN NNP VBD europe crashed car NNP VBD NN - 3-gram with Noun’s & Verb’s retained Conclusion: 1-gram: Reduced noise, however no clear context Bi-gram & 3-gram: Increased context, however there is a information loss Shallow NLP Technique 12 Stop words Noun/Verb
  • 13. Definition: - Extension to the shallow NLP - Detected relationships are expressed as complex construction to retain the context - Example relationships: Located in, employed by, part of, married to Applications: - Develop features and representations appropriate for complex interpretation tasks - Fraud detection - Life science: prediction activities based on complex RNA-Sequence Deep NLP technique Example: The above sentence can be represented using triples (Subject: Predicate [Modifier]: Object) without loosing the context. Triples: driver : crash : car driver : crash with : bumper driver : be from : Europe 13
  • 14. Technique General Steps Pros Cons N-Gram - Convert to lowercase - Remove punctuations - Remove special characters Simple technique Extremely noisy Shallow NLP technique - POS tagging - Lemmatization i.e., transform to dictionary base form i.e., "produce" & "produced" become "produce" - Stemming i.e., transform to root word i.e., 1) "computer" & "computers" become "comput" 2) "product", "produce" & "produced" become "produc" - Chunking i.e., identify the phrasal constituents in a sentence , including noun/verb phrase etc., and splits the sentence into chunks of semantically related words Less noisy than N- Grams Computationally expensive solution for analyzing the structure of texts. Does not specify the internal structure or the role of words in the sentence Deep NLP technique - Generate syntactic relationship between each pair of words - Extract subject, predicate, nagation, objecct and named entity to form triples. Context of the sentence is retained. Sentence level analysis is too structured Techniques - Summary 14
  • 16. 2A - Explore Corpus through EDA 2B - Convert text to lowercase 2C - Remove a) Numbers(if required) b) Punctuations c) English stop words d) Own stop words(if required) e) Strip whitespace f) Lemmatization/Stemming g) Sparse terms 2D - Create document term matrix Step 3 - Visualization Corpus Web Documents Step 1 – Data Assemble Step 2 – Data Processing Step 4 – Build Model(s)  Clustering  Classification  Sentiment Analysis FrequencyChartWordCloudCorrelationPlot R - Text Mining Process Overview 16 DB
  • 17. Package Name Category Description tm Text Mining A framework for text mining applications topicmodels Topic Modelling Fit topic models with Latent Dirichlet Allocation (LDA) and Comparative Text Mining (CTM) wordcloud Visualization Plot a cloud comparing the frequencies of words across documents lda Topic Modelling Fit topic models with Latent Dirichlet Allocation wordnet Text Mining Database of English which is commonly used in linguistics and text mining RTextTools Text Mining Automatic text classification via supervised learning qdap Sentiment analysis Transcript analysis, text mining and natural language processing tm.plugin.dc Text Mining A plug-in for package tm to support distributed text mining tm.plugin.mail Text Mining A plug-in for package tm to handle mail textir Text Mining A suite of tools for inference about text documents and associated sentiment tau Text Mining Utilities for text analysis textcat Text Mining N-gram based text categorization SnowballC Text Mining Word stemmer twitteR Text Mining Provides an interface to the Twitter web API ROAuth Text Mining Allows users to authenticate to the server of their choice (like Twitter) RColorBrewer Visualization The packages provides palettes for drawing nice maps shaded according to a variable ggplot2 Visualization Graphing package implemented on top of the R statistical package. Inspired by the Grammar of Graphics seminal work of Leland Wilkinson R – Required packages for Text Mining 17
  • 18. Example 1 - Obtaining and analyzing tweets Objective: R code for analyzing tweets relating to #AAA2011 (text mining, topic modelling, network analysis, clustering and sentiment analysis) What does the code do? The code details ten steps in the analysis and visualization of the tweets: Acquiring the raw Twitter data Calculating some basic statistics with the raw Twitter data Calculating some basic retweet statistics Calculating the ratio of retweets to tweets Calculating some basic statistics about URLs in tweets Basic text mining for token frequency and token association analysis (word cloud) Calculating sentiment scores of tweets, including on subsets containing tokens of interest Hierarchical clustering of tokens based on multi scale bootstrap resampling Topic modelling the tweet corpus using latent Dirichlet allocation Network analysis of tweeters based on retweets Code Source: Code was taken from following link and tweaked/added additional bits where required to ensure code runs fine https://github.com/benmarwick/AAA2011-Tweets How to Run or Test the code? - From the word doc copy the R code in the given sequence highlighted in yellow color and paste on your R console. 18
  • 19. RTextTools – Example for supervised Learning for Text Classification using Ensemble RTextTools is a free, open source R machine learning package for automatic text classification. The package includes nine algorithms for ensemble classification (svm, slda, boosting, bagging, random forests, glmnet, decision trees, neural networks, and maximum entropy), comprehensive analytics, and thorough documentation. Users may use n-fold cross validation to calculate the accuracy of each algorithm on their dataset and determine which algorithms to use in their ensemble. (Using a four-ensemble agreement approach, Collingwood and Wilkerson (2012) found that when four of their algorithms agree on the label of a textual document, the machine label matches the human label over 90% of the time. The rate is just 45% when only two algorithms agree on the text label.) Code Source: The codes is readily available for download and usage from the following link https://github.com/timjurka/RTextTools . The code can be implemented without modification for testing, however it’s set up such that changes can be incorporated easily based on our requirement. Additional Reading: http://www.rtexttools.com/about-the-project.html 19 Example 2 - RTextTools
  • 20. Penn Treebank - https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html Stanford info lab - Finding Similar Items: http://infolab.stanford.edu/~ullman/mmds/ch3.pdf TRIPLET EXTRACTION FROM SENTENCES: URL - http://ailab.ijs.si/delia_rusu/Papers/is_2007.pdf Shallow and Deep NLP Processing for ontology learning, a Quick Overview: http://azouaq.athabascau.ca/publications/Conferences,%20Workshops,%20Books/%5BBC2%5D_K DW_2010.pdf References 20