SlideShare a Scribd company logo
Introduction to
Tm package
What is Text mining?
Text mining is the process of exploring and analyzing large amounts
of unstructured data that can be used to identify concepts, patterns,
topics, keywords and other attributes.
Common challenges of text mining:
 Each word and phrase can be high number of possible dimensions.
 Data are in unstructured form unlike in other data mining techniques
data are found in structure tabular format.
 Even statistically not independent.
 Ambiguity “the quality of being open to more than one
interpretation; inexactness.”
Rupak Roy
Text mining applications
• Customer Relationship management (CRM)
• Market Analysis
• NLP (natural language processing)
• Personalization in E-Commerce
• Natural language processing (or NLP) is a field of Ai and a
component of text mining that performs linguistic analysis that
essentially helps machine to deal with understanding, analyzing,
languages that humans naturally is good at. NLP uses a variety of
methodologies to decipher the ambiguities in human language, like
automatic summarization, speech tagging, entity extraction and
relations extraction, as well as disambiguation and natural language
understanding and recognition.
Rupak Roy
Modeling Techniques
• Supervised Learning
• Unsupervised learning
Supervised Learning: where we use labeled data to train our model to
classify new data and as we know in supervised learning we direct i.e.
train our ML model using labeled data.
For example sentimental analysis using classification methods like svm.
Unsupervised Learning: is the vice versa of supervised learning. It doesn't
require labeled data to train the model and validate over test data,
instead it will use the available unlabeled data to develop the model to
classify the problems and the solutions.
For example: Clustering, topic modeling.
Rupak Roy
Tm(text mining) package in R
Tm is a base R package for Pre processing the text data like
1. Remove unnecessary lines then convert text to a corpus(a structured
set of texts in tabular format)
2. Then read and inspect the Corpus to create TDM (term document
matrix)
Corpus- A corpus or text corpus is a large and structured set of texts.
a) In a corpus we parse the data to extract words, remove
punctuations, spaces even lower and upper case to make it
uniform.
b) Then remove words that has no meaning by itself like was, as, a, it
etc. also called as Stop words.
c) Finally apply Stemming which is the process of reducing
derived words to their word stem, base or root form. Eg. Consult,
Consulting, Consultation, Consultants = Consult(same meaning)
Term Document Matrix
Term Document Matrix (TDM) is a matrix that describes the frequency of
terms that occur in a collection of documents.
ROWS = TERMS
Columns = DOCUMENTS
Document term Matrix
Term Document Matrix
One of the common function widely used for cleaning the
data(corpus) like remove whitespaces, punctuations, numbers is
tm_map() function from base tm R package.
Rupak Roy
i like hate Data Science
D1 1 1 0 1
D2 1 0 1 1
D1 D2
I 1 1
Like 1 0
Hate 0 1
Data
Science
1 1
Term Document Matrix
Now what we can we do with Term Document Matrix (TDM)?
* We can easy find the frequent terms occur in the document which is
helpful to understand the keywords. For example very helpful to
understand they Google search keywords.
* We can also find association that are co-related or similar of each
words, how they are related to each other.
* Group the words that have same or similar performance by Clustering
techniques.
* Sentimental Analysis: is the automated process of understanding an
opinion like negative, positive or neutral about a given subject from
written or spoken language helping a business to understand the social
sentiment of their brand, product or service.
Rupak Roy
Example
#load the data
>star_wars_EPV<-read.csv("SW_EpisodeV.txt",h=TRUE,sep = " ")
>View(star_wars_EPV)
>str(star_wars_EPV)
>names(star_wars_EPV)
#Convert to a dataframe ‘only second column’
>dialogue<-data.frame(star_wars_EPV$dialogue)
#Renaming the column
>names(dialogue)<-"dialogue"
>str(dialogue)
Rupak Roy
Example
#data preprocessing using TM package
>library(tm)
#build text corpus
>dialogue.corpus<-Corpus(VectorSource(dialogue$dialogue))
>summary(dialogue.corpus)
>inspect(dialogue.corpus[1:5]) #Inspecting elements in Corpus
#clean the data
>inspect(dialogue.corpus[1:5])
#Converting to lower case
>dialogue.corpus<-tm_map(dialogue.corpus,content_transformer(tolower))
#Removing extra white space
>dialogue.corpus<-tm_map(dialogue.corpus,stripWhitespace)
#Removing punctuations
>dialogue.corpus<-tm_map(dialogue.corpus,removePunctuation)
#Removing numbers
>dialogue.corpus<-tm_map(dialogue.corpus,removeNumbers)
Example
#Create a list of stop words, the words that have no meaning itself.
>my_stopwords<-c(stopwords('english'),‟@‟,'http*„,‟url‟,‟www*‟)
#Remove the stop words
>dialogue.corpus<-
tm_map(dialogue.corpus,removeWords,my_stopwords)
#Build term document matrix
>dialogue.tdm<-TermDocumentMatrix(dialogue.corpus)
>dialogue.tdm
>dim(dialogue.tdm) #Dimensions of term document matrix
>inspect(dialogue.tdm[1:10,1:10])
#Remove sparse terms (Words that occur infrequently)
#here 97% refers remove at least 97% of sparse
>dialogue.imp<-removeSparseTerms(dialogue.tdm,0.97)
Example
#Finding word and frequencies
>temp<-inspect(dialogue.imp)
>wordFreq<-data.frame(apply(temp, 1, sum))
>wordFreq<-data.frame(ST = row.names(wordFreq), Freq =
wordFreq[,1])
>head(wordFreq)
>wordFreq<-wordFreq[order(wordFreq$Freq, decreasing = T), ]
>View(wordFreq)
Rupak Roy
Example
##Basic Analysis
#Finding the most frequent terms/words
findFreqTerms(dialogue.tdm,10) #Occurring minimum of 10 times
findFreqTerms(dialogue.tdm,30) #Occurring minimum of 30 times
findFreqTerms(dialogue.tdm,50) #Occurring minimum of 50 times
findFreqTerms(dialogue.tdm,70) #Occurring minimum of 70 times
#Finding association between terms/words
findAssocs(dialogue.tdm,"dont",0.3)
findAssocs(dialogue.tdm,"get",0.2)
findAssocs(dialogue.tdm,"right",0.2)
findAssocs(dialogue.tdm,"will",0.3)
findAssocs(dialogue.tdm,"know",0.3)
findAssocs(dialogue.tdm,"good",0.3)
Building Word Cloud
#Visualization using WordCloud
>library("wordcloud")
>library("RColorBrewer")
#Word Cloud requires text corpus and not term document matrix
#How to choose colors?
?brewer.pal
display.brewer.all() #Gives you a chart
brewer.pal #Helps you identify the groups of pallete colors
display.brewer.pal(8,"Dark2")
display.brewer.pal(8,"Purples")
display.brewer.pal(3,"Oranges")
set8<-brewer.pal(8,"Dark2")
Rupak Roy
Building Word Cloud
#plot the word cloud
wordcloud(dialogue.corpus,min.freq=10,
max.words=60,
random.order=T,colors=set8)
wordcloud(dialogue.corpus,min.freq=10,max.words=60,
random.order=T,
colors=set8,vfont=c("script","plain"))
Rupak Roy
Next
We will learn how to use regular expression tools to find and replace the
text.
Rupak Roy

More Related Content

What's hot

NLP_Project_Paper_up276_vec241
NLP_Project_Paper_up276_vec241NLP_Project_Paper_up276_vec241
NLP_Project_Paper_up276_vec241Urjit Patel
 
Extraction Based automatic summarization
Extraction Based automatic summarizationExtraction Based automatic summarization
Extraction Based automatic summarizationAbdelaziz Al-Rihawi
 
similarity measure
similarity measure similarity measure
similarity measure ZHAO Sam
 
Conceptual foundations of text mining and preprocessing steps nfaoui el_habib
Conceptual foundations of text mining and preprocessing steps nfaoui el_habibConceptual foundations of text mining and preprocessing steps nfaoui el_habib
Conceptual foundations of text mining and preprocessing steps nfaoui el_habibEl Habib NFAOUI
 
Email Data Cleaning
Email Data CleaningEmail Data Cleaning
Email Data Cleaningfeiwin
 
Language Models for Information Retrieval
Language Models for Information RetrievalLanguage Models for Information Retrieval
Language Models for Information RetrievalDustin Smith
 
Document Summarization
Document SummarizationDocument Summarization
Document SummarizationPratik Kumar
 
DeepSearch_Project_Report
DeepSearch_Project_ReportDeepSearch_Project_Report
DeepSearch_Project_ReportUrjit Patel
 
Proposed Method for String Transformation using Probablistic Approach
Proposed Method for String Transformation using Probablistic ApproachProposed Method for String Transformation using Probablistic Approach
Proposed Method for String Transformation using Probablistic ApproachEditor IJMTER
 
Dissertation defense slides on "Semantic Analysis for Improved Multi-document...
Dissertation defense slides on "Semantic Analysis for Improved Multi-document...Dissertation defense slides on "Semantic Analysis for Improved Multi-document...
Dissertation defense slides on "Semantic Analysis for Improved Multi-document...Quinsulon Israel
 
Interface for Finding Close Matches from Translation Memory
Interface for Finding Close Matches from Translation MemoryInterface for Finding Close Matches from Translation Memory
Interface for Finding Close Matches from Translation MemoryPriyatham Bollimpalli
 
Data modelingpresentation
Data modelingpresentationData modelingpresentation
Data modelingpresentationfikirabc
 
Text summarization
Text summarizationText summarization
Text summarizationkareemhashem
 
G04124041046
G04124041046G04124041046
G04124041046IOSR-JEN
 

What's hot (20)

NLP_Project_Paper_up276_vec241
NLP_Project_Paper_up276_vec241NLP_Project_Paper_up276_vec241
NLP_Project_Paper_up276_vec241
 
Extraction Based automatic summarization
Extraction Based automatic summarizationExtraction Based automatic summarization
Extraction Based automatic summarization
 
Term weighting
Term weightingTerm weighting
Term weighting
 
Ijcai 2007 Pedersen
Ijcai 2007 PedersenIjcai 2007 Pedersen
Ijcai 2007 Pedersen
 
Aaai 2006 Pedersen
Aaai 2006 PedersenAaai 2006 Pedersen
Aaai 2006 Pedersen
 
similarity measure
similarity measure similarity measure
similarity measure
 
Conceptual foundations of text mining and preprocessing steps nfaoui el_habib
Conceptual foundations of text mining and preprocessing steps nfaoui el_habibConceptual foundations of text mining and preprocessing steps nfaoui el_habib
Conceptual foundations of text mining and preprocessing steps nfaoui el_habib
 
Email Data Cleaning
Email Data CleaningEmail Data Cleaning
Email Data Cleaning
 
Language Models for Information Retrieval
Language Models for Information RetrievalLanguage Models for Information Retrieval
Language Models for Information Retrieval
 
Text Summarization
Text SummarizationText Summarization
Text Summarization
 
Document Summarization
Document SummarizationDocument Summarization
Document Summarization
 
DeepSearch_Project_Report
DeepSearch_Project_ReportDeepSearch_Project_Report
DeepSearch_Project_Report
 
Proposed Method for String Transformation using Probablistic Approach
Proposed Method for String Transformation using Probablistic ApproachProposed Method for String Transformation using Probablistic Approach
Proposed Method for String Transformation using Probablistic Approach
 
Dissertation defense slides on "Semantic Analysis for Improved Multi-document...
Dissertation defense slides on "Semantic Analysis for Improved Multi-document...Dissertation defense slides on "Semantic Analysis for Improved Multi-document...
Dissertation defense slides on "Semantic Analysis for Improved Multi-document...
 
Deductive databases
Deductive databasesDeductive databases
Deductive databases
 
Deductive Database
Deductive DatabaseDeductive Database
Deductive Database
 
Interface for Finding Close Matches from Translation Memory
Interface for Finding Close Matches from Translation MemoryInterface for Finding Close Matches from Translation Memory
Interface for Finding Close Matches from Translation Memory
 
Data modelingpresentation
Data modelingpresentationData modelingpresentation
Data modelingpresentation
 
Text summarization
Text summarizationText summarization
Text summarization
 
G04124041046
G04124041046G04124041046
G04124041046
 

Similar to Introduction to Text Mining

Text data mining1
Text data mining1Text data mining1
Text data mining1KU Leuven
 
02 Text Operatiohhfdhjghdfshjgkhjdfjhglkdfjhgiuyihjufidhcun.pdf
02 Text Operatiohhfdhjghdfshjgkhjdfjhglkdfjhgiuyihjufidhcun.pdf02 Text Operatiohhfdhjghdfshjgkhjdfjhglkdfjhgiuyihjufidhcun.pdf
02 Text Operatiohhfdhjghdfshjgkhjdfjhglkdfjhgiuyihjufidhcun.pdfbeshahashenafe20
 
MODULE 4-Text Analytics.pptx
MODULE 4-Text Analytics.pptxMODULE 4-Text Analytics.pptx
MODULE 4-Text Analytics.pptxnikshaikh786
 
NLP WITH NAÏVE BAYES CLASSIFIER (1).pptx
NLP WITH NAÏVE BAYES CLASSIFIER (1).pptxNLP WITH NAÏVE BAYES CLASSIFIER (1).pptx
NLP WITH NAÏVE BAYES CLASSIFIER (1).pptxrohithprabhas1
 
Survey on Text Classification
Survey on Text ClassificationSurvey on Text Classification
Survey on Text ClassificationAM Publications
 
T EXT M INING AND C LASSIFICATION OF P RODUCT R EVIEWS U SING S TRUCTURED S U...
T EXT M INING AND C LASSIFICATION OF P RODUCT R EVIEWS U SING S TRUCTURED S U...T EXT M INING AND C LASSIFICATION OF P RODUCT R EVIEWS U SING S TRUCTURED S U...
T EXT M INING AND C LASSIFICATION OF P RODUCT R EVIEWS U SING S TRUCTURED S U...csandit
 
Information retrieval chapter 2-Text Operations.ppt
Information retrieval chapter 2-Text Operations.pptInformation retrieval chapter 2-Text Operations.ppt
Information retrieval chapter 2-Text Operations.pptSamuelKetema1
 
Ppt programming by alyssa marie paral
Ppt programming by alyssa marie paralPpt programming by alyssa marie paral
Ppt programming by alyssa marie paralalyssamarieparal
 
Frame-Script and Predicate logic.pptx
Frame-Script and Predicate logic.pptxFrame-Script and Predicate logic.pptx
Frame-Script and Predicate logic.pptxnilesh405711
 
DETERMINING CUSTOMER SATISFACTION IN-ECOMMERCE
DETERMINING CUSTOMER SATISFACTION IN-ECOMMERCEDETERMINING CUSTOMER SATISFACTION IN-ECOMMERCE
DETERMINING CUSTOMER SATISFACTION IN-ECOMMERCEAbdurrahimDerric
 
A Novel Approach for Keyword extraction in learning objects using text mining
A Novel Approach for Keyword extraction in learning objects using text miningA Novel Approach for Keyword extraction in learning objects using text mining
A Novel Approach for Keyword extraction in learning objects using text miningIJSRD
 
SIMILAR THESAURUS BASED ON ARABIC DOCUMENT: AN OVERVIEW AND COMPARISON
SIMILAR THESAURUS BASED ON ARABIC DOCUMENT: AN OVERVIEW AND COMPARISONSIMILAR THESAURUS BASED ON ARABIC DOCUMENT: AN OVERVIEW AND COMPARISON
SIMILAR THESAURUS BASED ON ARABIC DOCUMENT: AN OVERVIEW AND COMPARISONIJCSEA Journal
 
Text Mining at Feature Level: A Review
Text Mining at Feature Level: A ReviewText Mining at Feature Level: A Review
Text Mining at Feature Level: A ReviewINFOGAIN PUBLICATION
 
Analysing Demonetisation through Text Mining using Live Twitter Data!
Analysing Demonetisation through Text Mining using Live Twitter Data!Analysing Demonetisation through Text Mining using Live Twitter Data!
Analysing Demonetisation through Text Mining using Live Twitter Data!Ivy Pro School
 
NLP Deep Learning with Tensorflow
NLP Deep Learning with TensorflowNLP Deep Learning with Tensorflow
NLP Deep Learning with Tensorflowseungwoo kim
 
2_text operationinformation retrieval. ppt
2_text operationinformation retrieval. ppt2_text operationinformation retrieval. ppt
2_text operationinformation retrieval. pptHayomeTakele
 
What is Text Analysis?
What is Text Analysis?What is Text Analysis?
What is Text Analysis?Ducat India
 
INTRODUCTION TO Natural language processing
INTRODUCTION TO Natural language processingINTRODUCTION TO Natural language processing
INTRODUCTION TO Natural language processingsocarem879
 

Similar to Introduction to Text Mining (20)

Text data mining1
Text data mining1Text data mining1
Text data mining1
 
Text Mining Analytics 101
Text Mining Analytics 101Text Mining Analytics 101
Text Mining Analytics 101
 
02 Text Operatiohhfdhjghdfshjgkhjdfjhglkdfjhgiuyihjufidhcun.pdf
02 Text Operatiohhfdhjghdfshjgkhjdfjhglkdfjhgiuyihjufidhcun.pdf02 Text Operatiohhfdhjghdfshjgkhjdfjhglkdfjhgiuyihjufidhcun.pdf
02 Text Operatiohhfdhjghdfshjgkhjdfjhglkdfjhgiuyihjufidhcun.pdf
 
MODULE 4-Text Analytics.pptx
MODULE 4-Text Analytics.pptxMODULE 4-Text Analytics.pptx
MODULE 4-Text Analytics.pptx
 
NLP WITH NAÏVE BAYES CLASSIFIER (1).pptx
NLP WITH NAÏVE BAYES CLASSIFIER (1).pptxNLP WITH NAÏVE BAYES CLASSIFIER (1).pptx
NLP WITH NAÏVE BAYES CLASSIFIER (1).pptx
 
Survey on Text Classification
Survey on Text ClassificationSurvey on Text Classification
Survey on Text Classification
 
T EXT M INING AND C LASSIFICATION OF P RODUCT R EVIEWS U SING S TRUCTURED S U...
T EXT M INING AND C LASSIFICATION OF P RODUCT R EVIEWS U SING S TRUCTURED S U...T EXT M INING AND C LASSIFICATION OF P RODUCT R EVIEWS U SING S TRUCTURED S U...
T EXT M INING AND C LASSIFICATION OF P RODUCT R EVIEWS U SING S TRUCTURED S U...
 
Information retrieval chapter 2-Text Operations.ppt
Information retrieval chapter 2-Text Operations.pptInformation retrieval chapter 2-Text Operations.ppt
Information retrieval chapter 2-Text Operations.ppt
 
Textmining
TextminingTextmining
Textmining
 
Ppt programming by alyssa marie paral
Ppt programming by alyssa marie paralPpt programming by alyssa marie paral
Ppt programming by alyssa marie paral
 
Frame-Script and Predicate logic.pptx
Frame-Script and Predicate logic.pptxFrame-Script and Predicate logic.pptx
Frame-Script and Predicate logic.pptx
 
DETERMINING CUSTOMER SATISFACTION IN-ECOMMERCE
DETERMINING CUSTOMER SATISFACTION IN-ECOMMERCEDETERMINING CUSTOMER SATISFACTION IN-ECOMMERCE
DETERMINING CUSTOMER SATISFACTION IN-ECOMMERCE
 
A Novel Approach for Keyword extraction in learning objects using text mining
A Novel Approach for Keyword extraction in learning objects using text miningA Novel Approach for Keyword extraction in learning objects using text mining
A Novel Approach for Keyword extraction in learning objects using text mining
 
SIMILAR THESAURUS BASED ON ARABIC DOCUMENT: AN OVERVIEW AND COMPARISON
SIMILAR THESAURUS BASED ON ARABIC DOCUMENT: AN OVERVIEW AND COMPARISONSIMILAR THESAURUS BASED ON ARABIC DOCUMENT: AN OVERVIEW AND COMPARISON
SIMILAR THESAURUS BASED ON ARABIC DOCUMENT: AN OVERVIEW AND COMPARISON
 
Text Mining at Feature Level: A Review
Text Mining at Feature Level: A ReviewText Mining at Feature Level: A Review
Text Mining at Feature Level: A Review
 
Analysing Demonetisation through Text Mining using Live Twitter Data!
Analysing Demonetisation through Text Mining using Live Twitter Data!Analysing Demonetisation through Text Mining using Live Twitter Data!
Analysing Demonetisation through Text Mining using Live Twitter Data!
 
NLP Deep Learning with Tensorflow
NLP Deep Learning with TensorflowNLP Deep Learning with Tensorflow
NLP Deep Learning with Tensorflow
 
2_text operationinformation retrieval. ppt
2_text operationinformation retrieval. ppt2_text operationinformation retrieval. ppt
2_text operationinformation retrieval. ppt
 
What is Text Analysis?
What is Text Analysis?What is Text Analysis?
What is Text Analysis?
 
INTRODUCTION TO Natural language processing
INTRODUCTION TO Natural language processingINTRODUCTION TO Natural language processing
INTRODUCTION TO Natural language processing
 

More from Rupak Roy

Hierarchical Clustering - Text Mining/NLP
Hierarchical Clustering - Text Mining/NLPHierarchical Clustering - Text Mining/NLP
Hierarchical Clustering - Text Mining/NLPRupak Roy
 
Clustering K means and Hierarchical - NLP
Clustering K means and Hierarchical - NLPClustering K means and Hierarchical - NLP
Clustering K means and Hierarchical - NLPRupak Roy
 
Network Analysis - NLP
Network Analysis  - NLPNetwork Analysis  - NLP
Network Analysis - NLPRupak Roy
 
Sentiment Analysis Practical Steps
Sentiment Analysis Practical StepsSentiment Analysis Practical Steps
Sentiment Analysis Practical StepsRupak Roy
 
Text Mining using Regular Expressions
Text Mining using Regular ExpressionsText Mining using Regular Expressions
Text Mining using Regular ExpressionsRupak Roy
 
Apache Hbase Architecture
Apache Hbase ArchitectureApache Hbase Architecture
Apache Hbase ArchitectureRupak Roy
 
Introduction to Hbase
Introduction to Hbase Introduction to Hbase
Introduction to Hbase Rupak Roy
 
Apache Hive Table Partition and HQL
Apache Hive Table Partition and HQLApache Hive Table Partition and HQL
Apache Hive Table Partition and HQLRupak Roy
 
Installing Apache Hive, internal and external table, import-export
Installing Apache Hive, internal and external table, import-export Installing Apache Hive, internal and external table, import-export
Installing Apache Hive, internal and external table, import-export Rupak Roy
 
Introductive to Hive
Introductive to Hive Introductive to Hive
Introductive to Hive Rupak Roy
 
Scoop Job, import and export to RDBMS
Scoop Job, import and export to RDBMSScoop Job, import and export to RDBMS
Scoop Job, import and export to RDBMSRupak Roy
 
Apache Scoop - Import with Append mode and Last Modified mode
Apache Scoop - Import with Append mode and Last Modified mode Apache Scoop - Import with Append mode and Last Modified mode
Apache Scoop - Import with Append mode and Last Modified mode Rupak Roy
 
Introduction to scoop and its functions
Introduction to scoop and its functionsIntroduction to scoop and its functions
Introduction to scoop and its functionsRupak Roy
 
Introduction to Flume
Introduction to FlumeIntroduction to Flume
Introduction to FlumeRupak Roy
 
Apache Pig Relational Operators - II
Apache Pig Relational Operators - II Apache Pig Relational Operators - II
Apache Pig Relational Operators - II Rupak Roy
 
Passing Parameters using File and Command Line
Passing Parameters using File and Command LinePassing Parameters using File and Command Line
Passing Parameters using File and Command LineRupak Roy
 
Apache PIG Relational Operations
Apache PIG Relational Operations Apache PIG Relational Operations
Apache PIG Relational Operations Rupak Roy
 
Apache PIG casting, reference
Apache PIG casting, referenceApache PIG casting, reference
Apache PIG casting, referenceRupak Roy
 
Pig Latin, Data Model with Load and Store Functions
Pig Latin, Data Model with Load and Store FunctionsPig Latin, Data Model with Load and Store Functions
Pig Latin, Data Model with Load and Store FunctionsRupak Roy
 
Introduction to PIG components
Introduction to PIG components Introduction to PIG components
Introduction to PIG components Rupak Roy
 

More from Rupak Roy (20)

Hierarchical Clustering - Text Mining/NLP
Hierarchical Clustering - Text Mining/NLPHierarchical Clustering - Text Mining/NLP
Hierarchical Clustering - Text Mining/NLP
 
Clustering K means and Hierarchical - NLP
Clustering K means and Hierarchical - NLPClustering K means and Hierarchical - NLP
Clustering K means and Hierarchical - NLP
 
Network Analysis - NLP
Network Analysis  - NLPNetwork Analysis  - NLP
Network Analysis - NLP
 
Sentiment Analysis Practical Steps
Sentiment Analysis Practical StepsSentiment Analysis Practical Steps
Sentiment Analysis Practical Steps
 
Text Mining using Regular Expressions
Text Mining using Regular ExpressionsText Mining using Regular Expressions
Text Mining using Regular Expressions
 
Apache Hbase Architecture
Apache Hbase ArchitectureApache Hbase Architecture
Apache Hbase Architecture
 
Introduction to Hbase
Introduction to Hbase Introduction to Hbase
Introduction to Hbase
 
Apache Hive Table Partition and HQL
Apache Hive Table Partition and HQLApache Hive Table Partition and HQL
Apache Hive Table Partition and HQL
 
Installing Apache Hive, internal and external table, import-export
Installing Apache Hive, internal and external table, import-export Installing Apache Hive, internal and external table, import-export
Installing Apache Hive, internal and external table, import-export
 
Introductive to Hive
Introductive to Hive Introductive to Hive
Introductive to Hive
 
Scoop Job, import and export to RDBMS
Scoop Job, import and export to RDBMSScoop Job, import and export to RDBMS
Scoop Job, import and export to RDBMS
 
Apache Scoop - Import with Append mode and Last Modified mode
Apache Scoop - Import with Append mode and Last Modified mode Apache Scoop - Import with Append mode and Last Modified mode
Apache Scoop - Import with Append mode and Last Modified mode
 
Introduction to scoop and its functions
Introduction to scoop and its functionsIntroduction to scoop and its functions
Introduction to scoop and its functions
 
Introduction to Flume
Introduction to FlumeIntroduction to Flume
Introduction to Flume
 
Apache Pig Relational Operators - II
Apache Pig Relational Operators - II Apache Pig Relational Operators - II
Apache Pig Relational Operators - II
 
Passing Parameters using File and Command Line
Passing Parameters using File and Command LinePassing Parameters using File and Command Line
Passing Parameters using File and Command Line
 
Apache PIG Relational Operations
Apache PIG Relational Operations Apache PIG Relational Operations
Apache PIG Relational Operations
 
Apache PIG casting, reference
Apache PIG casting, referenceApache PIG casting, reference
Apache PIG casting, reference
 
Pig Latin, Data Model with Load and Store Functions
Pig Latin, Data Model with Load and Store FunctionsPig Latin, Data Model with Load and Store Functions
Pig Latin, Data Model with Load and Store Functions
 
Introduction to PIG components
Introduction to PIG components Introduction to PIG components
Introduction to PIG components
 

Recently uploaded

Application of Matrices in real life. Presentation on application of matrices
Application of Matrices in real life. Presentation on application of matricesApplication of Matrices in real life. Presentation on application of matrices
Application of Matrices in real life. Presentation on application of matricesRased Khan
 
Basic Civil Engineering Notes of Chapter-6, Topic- Ecosystem, Biodiversity G...
Basic Civil Engineering Notes of Chapter-6,  Topic- Ecosystem, Biodiversity G...Basic Civil Engineering Notes of Chapter-6,  Topic- Ecosystem, Biodiversity G...
Basic Civil Engineering Notes of Chapter-6, Topic- Ecosystem, Biodiversity G...Denish Jangid
 
Solid waste management & Types of Basic civil Engineering notes by DJ Sir.pptx
Solid waste management & Types of Basic civil Engineering notes by DJ Sir.pptxSolid waste management & Types of Basic civil Engineering notes by DJ Sir.pptx
Solid waste management & Types of Basic civil Engineering notes by DJ Sir.pptxDenish Jangid
 
NLC-2024-Orientation-for-RO-SDO (1).pptx
NLC-2024-Orientation-for-RO-SDO (1).pptxNLC-2024-Orientation-for-RO-SDO (1).pptx
NLC-2024-Orientation-for-RO-SDO (1).pptxssuserbdd3e8
 
MARUTI SUZUKI- A Successful Joint Venture in India.pptx
MARUTI SUZUKI- A Successful Joint Venture in India.pptxMARUTI SUZUKI- A Successful Joint Venture in India.pptx
MARUTI SUZUKI- A Successful Joint Venture in India.pptxbennyroshan06
 
Accounting and finance exit exam 2016 E.C.pdf
Accounting and finance exit exam 2016 E.C.pdfAccounting and finance exit exam 2016 E.C.pdf
Accounting and finance exit exam 2016 E.C.pdfYibeltalNibretu
 
GIÁO ÁN DẠY THÊM (KẾ HOẠCH BÀI BUỔI 2) - TIẾNG ANH 8 GLOBAL SUCCESS (2 CỘT) N...
GIÁO ÁN DẠY THÊM (KẾ HOẠCH BÀI BUỔI 2) - TIẾNG ANH 8 GLOBAL SUCCESS (2 CỘT) N...GIÁO ÁN DẠY THÊM (KẾ HOẠCH BÀI BUỔI 2) - TIẾNG ANH 8 GLOBAL SUCCESS (2 CỘT) N...
GIÁO ÁN DẠY THÊM (KẾ HOẠCH BÀI BUỔI 2) - TIẾNG ANH 8 GLOBAL SUCCESS (2 CỘT) N...Nguyen Thanh Tu Collection
 
slides CapTechTalks Webinar May 2024 Alexander Perry.pptx
slides CapTechTalks Webinar May 2024 Alexander Perry.pptxslides CapTechTalks Webinar May 2024 Alexander Perry.pptx
slides CapTechTalks Webinar May 2024 Alexander Perry.pptxCapitolTechU
 
Synthetic Fiber Construction in lab .pptx
Synthetic Fiber Construction in lab .pptxSynthetic Fiber Construction in lab .pptx
Synthetic Fiber Construction in lab .pptxShajedul Islam Pavel
 
How to Create Map Views in the Odoo 17 ERP
How to Create Map Views in the Odoo 17 ERPHow to Create Map Views in the Odoo 17 ERP
How to Create Map Views in the Odoo 17 ERPCeline George
 
Fish and Chips - have they had their chips
Fish and Chips - have they had their chipsFish and Chips - have they had their chips
Fish and Chips - have they had their chipsGeoBlogs
 
The Art Pastor's Guide to Sabbath | Steve Thomason
The Art Pastor's Guide to Sabbath | Steve ThomasonThe Art Pastor's Guide to Sabbath | Steve Thomason
The Art Pastor's Guide to Sabbath | Steve ThomasonSteve Thomason
 
Benefits and Challenges of Using Open Educational Resources
Benefits and Challenges of Using Open Educational ResourcesBenefits and Challenges of Using Open Educational Resources
Benefits and Challenges of Using Open Educational Resourcesdimpy50
 
How to Break the cycle of negative Thoughts
How to Break the cycle of negative ThoughtsHow to Break the cycle of negative Thoughts
How to Break the cycle of negative ThoughtsCol Mukteshwar Prasad
 
Phrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXX
Phrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXXPhrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXX
Phrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXXMIRIAMSALINAS13
 
Basic_QTL_Marker-assisted_Selection_Sourabh.ppt
Basic_QTL_Marker-assisted_Selection_Sourabh.pptBasic_QTL_Marker-assisted_Selection_Sourabh.ppt
Basic_QTL_Marker-assisted_Selection_Sourabh.pptSourabh Kumar
 
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaasiemaillard
 
2024_Student Session 2_ Set Plan Preparation.pptx
2024_Student Session 2_ Set Plan Preparation.pptx2024_Student Session 2_ Set Plan Preparation.pptx
2024_Student Session 2_ Set Plan Preparation.pptxmansk2
 

Recently uploaded (20)

Application of Matrices in real life. Presentation on application of matrices
Application of Matrices in real life. Presentation on application of matricesApplication of Matrices in real life. Presentation on application of matrices
Application of Matrices in real life. Presentation on application of matrices
 
Basic Civil Engineering Notes of Chapter-6, Topic- Ecosystem, Biodiversity G...
Basic Civil Engineering Notes of Chapter-6,  Topic- Ecosystem, Biodiversity G...Basic Civil Engineering Notes of Chapter-6,  Topic- Ecosystem, Biodiversity G...
Basic Civil Engineering Notes of Chapter-6, Topic- Ecosystem, Biodiversity G...
 
Solid waste management & Types of Basic civil Engineering notes by DJ Sir.pptx
Solid waste management & Types of Basic civil Engineering notes by DJ Sir.pptxSolid waste management & Types of Basic civil Engineering notes by DJ Sir.pptx
Solid waste management & Types of Basic civil Engineering notes by DJ Sir.pptx
 
NLC-2024-Orientation-for-RO-SDO (1).pptx
NLC-2024-Orientation-for-RO-SDO (1).pptxNLC-2024-Orientation-for-RO-SDO (1).pptx
NLC-2024-Orientation-for-RO-SDO (1).pptx
 
MARUTI SUZUKI- A Successful Joint Venture in India.pptx
MARUTI SUZUKI- A Successful Joint Venture in India.pptxMARUTI SUZUKI- A Successful Joint Venture in India.pptx
MARUTI SUZUKI- A Successful Joint Venture in India.pptx
 
Accounting and finance exit exam 2016 E.C.pdf
Accounting and finance exit exam 2016 E.C.pdfAccounting and finance exit exam 2016 E.C.pdf
Accounting and finance exit exam 2016 E.C.pdf
 
GIÁO ÁN DẠY THÊM (KẾ HOẠCH BÀI BUỔI 2) - TIẾNG ANH 8 GLOBAL SUCCESS (2 CỘT) N...
GIÁO ÁN DẠY THÊM (KẾ HOẠCH BÀI BUỔI 2) - TIẾNG ANH 8 GLOBAL SUCCESS (2 CỘT) N...GIÁO ÁN DẠY THÊM (KẾ HOẠCH BÀI BUỔI 2) - TIẾNG ANH 8 GLOBAL SUCCESS (2 CỘT) N...
GIÁO ÁN DẠY THÊM (KẾ HOẠCH BÀI BUỔI 2) - TIẾNG ANH 8 GLOBAL SUCCESS (2 CỘT) N...
 
slides CapTechTalks Webinar May 2024 Alexander Perry.pptx
slides CapTechTalks Webinar May 2024 Alexander Perry.pptxslides CapTechTalks Webinar May 2024 Alexander Perry.pptx
slides CapTechTalks Webinar May 2024 Alexander Perry.pptx
 
Synthetic Fiber Construction in lab .pptx
Synthetic Fiber Construction in lab .pptxSynthetic Fiber Construction in lab .pptx
Synthetic Fiber Construction in lab .pptx
 
How to Create Map Views in the Odoo 17 ERP
How to Create Map Views in the Odoo 17 ERPHow to Create Map Views in the Odoo 17 ERP
How to Create Map Views in the Odoo 17 ERP
 
B.ed spl. HI pdusu exam paper-2023-24.pdf
B.ed spl. HI pdusu exam paper-2023-24.pdfB.ed spl. HI pdusu exam paper-2023-24.pdf
B.ed spl. HI pdusu exam paper-2023-24.pdf
 
Fish and Chips - have they had their chips
Fish and Chips - have they had their chipsFish and Chips - have they had their chips
Fish and Chips - have they had their chips
 
The Art Pastor's Guide to Sabbath | Steve Thomason
The Art Pastor's Guide to Sabbath | Steve ThomasonThe Art Pastor's Guide to Sabbath | Steve Thomason
The Art Pastor's Guide to Sabbath | Steve Thomason
 
Benefits and Challenges of Using Open Educational Resources
Benefits and Challenges of Using Open Educational ResourcesBenefits and Challenges of Using Open Educational Resources
Benefits and Challenges of Using Open Educational Resources
 
How to Break the cycle of negative Thoughts
How to Break the cycle of negative ThoughtsHow to Break the cycle of negative Thoughts
How to Break the cycle of negative Thoughts
 
Phrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXX
Phrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXXPhrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXX
Phrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXX
 
Basic_QTL_Marker-assisted_Selection_Sourabh.ppt
Basic_QTL_Marker-assisted_Selection_Sourabh.pptBasic_QTL_Marker-assisted_Selection_Sourabh.ppt
Basic_QTL_Marker-assisted_Selection_Sourabh.ppt
 
Operations Management - Book1.p - Dr. Abdulfatah A. Salem
Operations Management - Book1.p  - Dr. Abdulfatah A. SalemOperations Management - Book1.p  - Dr. Abdulfatah A. Salem
Operations Management - Book1.p - Dr. Abdulfatah A. Salem
 
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
 
2024_Student Session 2_ Set Plan Preparation.pptx
2024_Student Session 2_ Set Plan Preparation.pptx2024_Student Session 2_ Set Plan Preparation.pptx
2024_Student Session 2_ Set Plan Preparation.pptx
 

Introduction to Text Mining

  • 2. What is Text mining? Text mining is the process of exploring and analyzing large amounts of unstructured data that can be used to identify concepts, patterns, topics, keywords and other attributes. Common challenges of text mining:  Each word and phrase can be high number of possible dimensions.  Data are in unstructured form unlike in other data mining techniques data are found in structure tabular format.  Even statistically not independent.  Ambiguity “the quality of being open to more than one interpretation; inexactness.” Rupak Roy
  • 3. Text mining applications • Customer Relationship management (CRM) • Market Analysis • NLP (natural language processing) • Personalization in E-Commerce • Natural language processing (or NLP) is a field of Ai and a component of text mining that performs linguistic analysis that essentially helps machine to deal with understanding, analyzing, languages that humans naturally is good at. NLP uses a variety of methodologies to decipher the ambiguities in human language, like automatic summarization, speech tagging, entity extraction and relations extraction, as well as disambiguation and natural language understanding and recognition. Rupak Roy
  • 4. Modeling Techniques • Supervised Learning • Unsupervised learning Supervised Learning: where we use labeled data to train our model to classify new data and as we know in supervised learning we direct i.e. train our ML model using labeled data. For example sentimental analysis using classification methods like svm. Unsupervised Learning: is the vice versa of supervised learning. It doesn't require labeled data to train the model and validate over test data, instead it will use the available unlabeled data to develop the model to classify the problems and the solutions. For example: Clustering, topic modeling. Rupak Roy
  • 5. Tm(text mining) package in R Tm is a base R package for Pre processing the text data like 1. Remove unnecessary lines then convert text to a corpus(a structured set of texts in tabular format) 2. Then read and inspect the Corpus to create TDM (term document matrix) Corpus- A corpus or text corpus is a large and structured set of texts. a) In a corpus we parse the data to extract words, remove punctuations, spaces even lower and upper case to make it uniform. b) Then remove words that has no meaning by itself like was, as, a, it etc. also called as Stop words. c) Finally apply Stemming which is the process of reducing derived words to their word stem, base or root form. Eg. Consult, Consulting, Consultation, Consultants = Consult(same meaning)
  • 6. Term Document Matrix Term Document Matrix (TDM) is a matrix that describes the frequency of terms that occur in a collection of documents. ROWS = TERMS Columns = DOCUMENTS Document term Matrix Term Document Matrix One of the common function widely used for cleaning the data(corpus) like remove whitespaces, punctuations, numbers is tm_map() function from base tm R package. Rupak Roy i like hate Data Science D1 1 1 0 1 D2 1 0 1 1 D1 D2 I 1 1 Like 1 0 Hate 0 1 Data Science 1 1
  • 7. Term Document Matrix Now what we can we do with Term Document Matrix (TDM)? * We can easy find the frequent terms occur in the document which is helpful to understand the keywords. For example very helpful to understand they Google search keywords. * We can also find association that are co-related or similar of each words, how they are related to each other. * Group the words that have same or similar performance by Clustering techniques. * Sentimental Analysis: is the automated process of understanding an opinion like negative, positive or neutral about a given subject from written or spoken language helping a business to understand the social sentiment of their brand, product or service. Rupak Roy
  • 8. Example #load the data >star_wars_EPV<-read.csv("SW_EpisodeV.txt",h=TRUE,sep = " ") >View(star_wars_EPV) >str(star_wars_EPV) >names(star_wars_EPV) #Convert to a dataframe ‘only second column’ >dialogue<-data.frame(star_wars_EPV$dialogue) #Renaming the column >names(dialogue)<-"dialogue" >str(dialogue) Rupak Roy
  • 9. Example #data preprocessing using TM package >library(tm) #build text corpus >dialogue.corpus<-Corpus(VectorSource(dialogue$dialogue)) >summary(dialogue.corpus) >inspect(dialogue.corpus[1:5]) #Inspecting elements in Corpus #clean the data >inspect(dialogue.corpus[1:5]) #Converting to lower case >dialogue.corpus<-tm_map(dialogue.corpus,content_transformer(tolower)) #Removing extra white space >dialogue.corpus<-tm_map(dialogue.corpus,stripWhitespace) #Removing punctuations >dialogue.corpus<-tm_map(dialogue.corpus,removePunctuation) #Removing numbers >dialogue.corpus<-tm_map(dialogue.corpus,removeNumbers)
  • 10. Example #Create a list of stop words, the words that have no meaning itself. >my_stopwords<-c(stopwords('english'),‟@‟,'http*„,‟url‟,‟www*‟) #Remove the stop words >dialogue.corpus<- tm_map(dialogue.corpus,removeWords,my_stopwords) #Build term document matrix >dialogue.tdm<-TermDocumentMatrix(dialogue.corpus) >dialogue.tdm >dim(dialogue.tdm) #Dimensions of term document matrix >inspect(dialogue.tdm[1:10,1:10]) #Remove sparse terms (Words that occur infrequently) #here 97% refers remove at least 97% of sparse >dialogue.imp<-removeSparseTerms(dialogue.tdm,0.97)
  • 11. Example #Finding word and frequencies >temp<-inspect(dialogue.imp) >wordFreq<-data.frame(apply(temp, 1, sum)) >wordFreq<-data.frame(ST = row.names(wordFreq), Freq = wordFreq[,1]) >head(wordFreq) >wordFreq<-wordFreq[order(wordFreq$Freq, decreasing = T), ] >View(wordFreq) Rupak Roy
  • 12. Example ##Basic Analysis #Finding the most frequent terms/words findFreqTerms(dialogue.tdm,10) #Occurring minimum of 10 times findFreqTerms(dialogue.tdm,30) #Occurring minimum of 30 times findFreqTerms(dialogue.tdm,50) #Occurring minimum of 50 times findFreqTerms(dialogue.tdm,70) #Occurring minimum of 70 times #Finding association between terms/words findAssocs(dialogue.tdm,"dont",0.3) findAssocs(dialogue.tdm,"get",0.2) findAssocs(dialogue.tdm,"right",0.2) findAssocs(dialogue.tdm,"will",0.3) findAssocs(dialogue.tdm,"know",0.3) findAssocs(dialogue.tdm,"good",0.3)
  • 13. Building Word Cloud #Visualization using WordCloud >library("wordcloud") >library("RColorBrewer") #Word Cloud requires text corpus and not term document matrix #How to choose colors? ?brewer.pal display.brewer.all() #Gives you a chart brewer.pal #Helps you identify the groups of pallete colors display.brewer.pal(8,"Dark2") display.brewer.pal(8,"Purples") display.brewer.pal(3,"Oranges") set8<-brewer.pal(8,"Dark2") Rupak Roy
  • 14. Building Word Cloud #plot the word cloud wordcloud(dialogue.corpus,min.freq=10, max.words=60, random.order=T,colors=set8) wordcloud(dialogue.corpus,min.freq=10,max.words=60, random.order=T, colors=set8,vfont=c("script","plain")) Rupak Roy
  • 15. Next We will learn how to use regular expression tools to find and replace the text. Rupak Roy