SlideShare a Scribd company logo
1 of 28
BIRLA INSTITUTE OF TECHNOLOGY
MESRA ,JAIPUR CAMPUS
TOPIC :-TEXT MINING OF TWITTER
PRESENTED BY:
MEGHAJ KUMAR MALLICK(MCA/25017/18)
SAJAN SINGH(MCA/25019/18)
2ND YEAR,3RD SEMESTER
OUTLINE
 INTRODUCTION
 Tweets Analysis
Extracting Tweets
Text Cleaning
Frequent Words and Word Cloud
 R Packages
Referencesand OnlineResources
TWITTER
Twitter is a microblogging and social networking service on which users post and
interact with messages known as "tweets". Tweets were originally restricted to 140
characters, but on November 7, 2017, this limit was doubled to 280 for all languages
except Chinese, Japanese, and Korean.
Twitter was created in March 2006 by Jack Dorsey, Noah Glass, Biz Stone, and Evan
Williams, launched in July of that year.
 In 2012, more than 100 million users posted 340 million tweets a day, and the
service handled an average of 1.6 billion search queries per day.
In 2013, it was one of the ten most-visited websites and has been described as "the
SMS of the Internet“.
As of 2018, Twitter had more than 321 million monthly active users.
PROCESS
1. Extract tweets and followers from the Twitter website with R
and the twitteRpackage
2. With the tm package, clean text by removing punctuations,
numbers, hyperlinks and stop words, followed by stemming
and stem completion
3. Build aterm-document matrix
4. Analyse following/followed and retweeting relationships with
the igraphpackage
Retrieving Text from Twitter
 Tweets are extracted from Twitter with the code below using userTimeline() in package
twitteR [Gentry, 2012]. Package twitteR depends on package RCurl [Lang, 2012a], which is
available at http://www.stats.ox.ac.uk/ pub/RWin/bin/windows/contrib/.
 Another way to retrieve text from Twitter is using package XML [Lang, 2012b], and an
example on that is given at http://heuristically.wordpress.com/ 2011/04/08/text-data-mining-
twitter-r/. For readers who have no access to Twitter, the tweets data “rdmTweets.RData” can
be downloaded at http://www.rdatamining.com/data.
Note that the Twitter API requires authentication since March 2013. Before running the code
below, please complete authentication by following instructions in “Section 3: Authentication
with OAuth” in the twitteR vignettes (http://cran.r-project.org/web/packages/twitteR/
vignettes/twitteR.pdf).
Retrieve Tweets
## Option 1: retrieve tweets from Twitter
l i b r a r y (twitteR)
library(ROAuth)
## Twitter authentication
setup_twitter_oauth(consumer_key, consumer_secret, access_token,
access_secret)
## 3200 i s the maximum to retrieve
tweets <- userTimeline("RDataMining", n = 3200)
## Option 2: download @RDataMining tweets from RDataMining.com
u r l <- "http://www.rdatamining.com/data/RDataMining-Tweets-20160212.rds
download .file(url, d e s t f i l e = "./data/RDataMining-Tweets-20160212.rds")
## load tweets into R
tweets <- readRDS("./data/RDataMining-Tweets-20160212.rds")
Twitter Authentication with OAuth:
Section 3 of http://geoffjentry.hexdump.org/twitteR.pdf
(n.tweet <- length(tweets))
## [ 1] 448
# convert tweets to a data frame
tweets.df <- twListToDF(tweets)
# tweet #190
tweets.df[ 190, c ( " i d " , "created" , "screenName", "replyToSN",
"favoriteCount", "retweetCount", "longitude", " l a t i t u d e " , " t e x t " ) ]
## i d created screenName r e . . .
## 190 362866933894352898 2013-08-01 09:26:33 RDataMining . . .
favoriteCount retweetCount longitude l a t i t u d e
9 9 NA NA
##
## 190
## . . .
## 190 The RReference Card f or Data Mining now provides l i n . . .
# print tweet #190 and make t e x t f i t for slide width
writeLines(strwrap(tweets.df$text[190], 60))
## The R Reference Card f or Data Mining now provides l i nks t o ##
packages on CRAN. Packages f or MapReduce and Hadoop added. ##
http://t.co/RrFypol8kw
TRANSFORMING TEXT
• The tweets are first converted to a data frame and then to a
corpus, which is a collection of text documents. After that, the
corpus can be processed with functions provided in package
tm [Feinerer, 2012].
• # convert tweets to a data frame
• df <- do.call("rbind", lapply(rdmTweets, as.data.frame))
• dim(df)
• [1] 154 10 > library(tm)
• # build a corpus, and specify the source to be character
vectors > myCorpus <- Corpus(VectorSource(df$text))
Text Cleaning
library(tm)
# build a corpus, and specify the source to be character vectors
myCorpus <- Corpus(VectorSource(tweets.df$text))
# convert to lower case
myCorpus <- tm_map(myCorpus, content_transformer(tolower))
# remove URLs
removeURL <- function(x) gsub("ht t p[ ^[ : s pace: ]] *" , " " , x)
myCorpus <- tm_map(myCorpus, content_transformer(removeURL))
# remove anything other than English l e t t e r s or space
removeNumPunct <- function(x) gsub("[ ^[ :al p ha :][ :s pace:]] *" , " " , x)
myCorpus <- tm_map(myCorpus, content_transformer(removeNumPunct))
# remove stopwords
myStopwords <- c(setdiff(stopwords('english'), c ( " r " , " b i g " ) ) ,
"use", "s e e " , "used", " v i a " , "amp")
myCorpus <- tm_map(myCorpus, removeWords, myStopwords)
# remove extra whitespace
myCorpus <- tm_map(myCorpus, stripWhitespace)
# keep a copy for stem completion later
myCorpusCopy <- myCorpus
STEMMING WORDS
• In many applications, words need to be stemmed to retrieve
their radicals, so that various forms derived from a stem
would be taken as the same when counting word frequency.
• For instance, words “update”, “updated” and “updating”
would all be stemmed to “updat”. Word stemming can be
done with the snowball stemmer, which requires packages
Snowball, RWeka, rJava and RWekajars. After that, we can
complete the stems to their original forms, i.e., “update” for
the above example, so that the words would look normal. This
can be achieved with function stemCompletion().
Stemming and Stem
Completion 2
myCorpus <- tm_map(myCorpus, stemDocument) # stem words
writeLines(strwrap(myCorpus[[190]]$content, 60))
## r r e f e r card data mine now provid l i nk packag cran packag
## mapreduc hadoop ad
stemCompletion2 <- function(x, dictionary ) { x
<- u n l i s t ( s t r s p l i t ( a s . c h a r a c t e r ( x ) , " " ) ) x
<- x[x != " " ]
x <- stemCompletion(x, dictionary=dictionary)
x <- paste (x, sep="", collapse=" " )
PlainTextDocument(stripWhitespace(x))
}
myCorpus <- lapply(myCorpus, stemCompletion2, dictionary=myCorpusCopy)
myCorpus <- Corpus(VectorSource(myCorpus))
writeLines(strwrap(myCorpus[[190]]$content, 60))
## r reference card data miner now provided l i nk package cran
## package mapreduce hadoop add
Issuesin Stem Completion:
“Miner” vs“Mining”
# count word frequence
wordFreq <- function(corpus, word) {
r e s u l t s <- lapply(corpus,
function(x) {grep(as.charac ter ( x) , pattern=paste0("<",word)) }
)
sum (unl is t ( res ul t s))
}
n.miner <- wordFreq(myCorpusCopy, "miner")
n.mining <- wordFreq(myCorpusCopy, "mining")
cat(n.miner, n.mining)
## 9 104
# replace oldword with newword
replaceWord <- function(corpus, oldword, newword) {
tm_map(corpus, content_transformer(gsub),
pattern=oldword, replacement=newword)
}
myCorpus <- replaceWord(myCorpus, "miner", "mining")
myCorpus <- replaceWord(myCorpus, "universidad", "univer s i t y" )
myCorpus <- replaceWord(myCorpus, "scienc" , "science")
BUILDING A TERM-DOCUMENT
MATRIX
• A term-document matrix represents the relationship between
terms and documents, where each row stands for a term and
each column for a document, and an entry is the number of
occurrences of the term in the document.
• Alternatively, one can also build a document-term matrix by
swapping row and column. In this section, we build a term-
document matrix from the above processed corpus with
function TermDocumentMatrix(). With its default setting,
terms with less than three characters are discarded. To keep
“r” in the matrix, we set the range of wordLengths in the
example below.
Build Term Document Matrix
## data 0 1 0 0 1 0 0 0 0 1
## mining 0 0 0 0 1 0 0 0 0 1
## r 1 1 1 1 0 1 0 1 1 1
tdm <- TermDocumentMatrix(myCorpus,
control = list(wordLengths = c ( 1 , I n f ) ) )
tdm
## <<TermDocumentMatrix (terms: 1073, documents: 448)>>
## Non-/sparse e n t r i e s : 3594/477110
## Sparsity : 99
## Maximal term length: 23
## Weighting : term frequency ( t f )
idx <- which(dimnames(tdm)$Terms i n c ( " r " , "da t a " , "mining"))
as.matrix(tdm[idx, 21:30])
##
## Terms
Docs
21 22 23 24 25 26 27 28 29 30
Top Frequent Terms :-
We have a look at the popular words and the association betweenwords. Note that
there are 154 tweets in total.
# inspect frequent words
(freq.terms <- findFreqTerms(tdm, lowfreq = 20))
"analytics"
"course"
"a us t r a l i a "
"data"
## [ 1] "analysing"
## [ 5] "canberra"
## [ 9] "group" "introduction" "learn"
## [13] "network"
## [17] "rdatamining"
## [21] "t a l k"
"package"
"research"
"t e xt "
"position"
"science"
" t u t o r i a l "
"big"
"example"
"mining"
" r "
"s l i de "
"university"
term.freq <- rowSums(as.matrix(tdm))
term.freq <- subset (term.freq, term.freq >= 20)
df <- data.frame(term = names(term.freq), freq =term.freq)
library(ggplot2)
ggplot (df, aes(x=term, y=freq)) + geom_bar(stat="identity") +
xlab("Terms") + ylab("Count") + coord_flip () +
theme(axis.text=element_text(size=7))
university
tutorial
text
talk
slide
science
research
rdatamining
r
position
package
network
mining
learn
introduction
group
example
data
course
canberra
big
australia
analytics
analysing
0 50 150 200100
Terms
WORDCLOUD
• After building a term-document matrix, we can show the importance of
words with a word cloud (also known as a tag cloud), which can be easily
produced with package wordcloud [Fellows, 2012].
• In the code below, we first convert the term-document matrix to a normal
matrix, and then calculate word frequencies.
• After that, we set gray levels based on word frequency and use
wordcloud() to make a plot for it. With wordcloud(), the first two
parameters give a list of words and their frequencies. Words with
frequency below three are not plotted, as specified by min.freq=3. By
setting random.order=F, frequent words are plotted first, which makes
them appear in the center of cloud. We also set the colors to gray levels
based on frequency. A colorful cloud can be generated by setting colors
with rainbow().
WordCloud
m<- as.matrix(tdm)
# calculate the frequency of words and sort i t by frequency
word.freq <- sort(rowSums(m), decreasing = T)
# colors
pal <- brewer.pal(9, "BuGn")[-(1:4)]
# plot word cloud
library(wordcloud)
wordcloud(words = names(word.freq), freq = word.freq, min.freq = 3 ,
random.order = F, colors = pa l )
data
r
mining
big analysing
network slide
tutorial
course
talk
introduction
series
research application
code
text statistical
available
modeling
program
pdf
scientist
present
time
lecture
august
language
mapreduce
video
workshop
machine
seminar
useful dataset
visualisations
university
conference
learnapril
cfp
australia
join
get
job
postdocthanks
will
analyst
knowledgedue
linkedin cluster free
may
poll
open social
provided
us analytics
database
feb sydney
ieee
hadoop
informal
event scienceth now
stanford submission
outlier call software
create twitter book
find
follow
engin
e
intern
map
recent
thursday
access
acm
area
build topnew can
coursera
experience
file rule
francisco process
add lab
notes
group
tool
webinar
canada
associatedownload
dr
user close extended
fast kdnuggets package
go
high
june
little
nd
distributed iapa
member nov
performance
case
week
senior published
easier detailedfellow quickrisk
short
card classificationsingapore
systemskills
studies vacancies
algorithm handling industrial
amazon tricksrstudio
deadline america
web
today
regressio
n
tuesday
answers
niceaug
australasian
sentiment
competition
credit
dmapps
graph givedynamic
version facebook
retrieval
forest
spatial
titled document page check healthcare
large publicimprove
initial
iselect
link
load excel
looking china
make
step
melbourne
start list postdoctoral jan march
massive
media datacamp
contain
cran mexico
mid
modeinteracting california
neoj keynote
oct positionfit
technological official
participation
please
plotsept extract rdataminingcanberra
pls technique parallel ausdm
computational reference project
result
run sigkdd chapter predicting google topic
format sasapache function
simple
kdd sna detection
source decision wwwrdataminingcom
southern comment
spark share state
sunday
support visit
survey seattle pm
task
together australiannatural
track friday search developed management
tree edited center
guidance forecastingadvanced summit
business tweet
updated
vs examplev online
various profgraphical ranked cloud san
random website
snowfallpaper
world
youtube
ASSOCIATIONS
# which words are associated with 'r'?
findAssocs(tdm, " r " , 0.2)
## r
## code 0.27
## example 0.21
## s e r i e s 0.21
## markdown 0.20
## user 0.20
# which words are associated with 'data'?
findAssocs(tdm, "da t a " , 0.2)
##
## mining
## big
data
0.48
0.44
## analytic s 0.31
## science
## pol l
0.29
0.24
Network of Terms
library(graph)
l i b r a r y (Rgraphviz)
plot(tdm, term = freq.terms, corThreshold = 0 . 1 , weighting = T)
analysing
analytics
big
canberra coursedata
group
introduction
learnmining
network
example package
australia position
rrdatamining
research
science
slide
talk
text
tutorial
university
R packages
 Twitter data extraction: twitteR
 Text cleaning and mining: tm
 Word cloud: wordcloud
 Social network analysis: igraph, sna
 Visualisation: wordcloud, Rgraphviz, ggplot2
Twitter Data Extraction – Package twitteR
 userTimeline, homeTimeline, mentions,
retweetsOfMe: retrive various timelines
 getUser, lookupUsers: get information of Twitter user(s)
 getFollowers, getFollowerIDs: retrieve followers (or
their IDs)
 getFriends, getFriendIDs: return alist of Twitter users
(or userIDs) that auser follows
 retweets, retweeters: return retweets or userswho
retweeted atweet
 searchTwitter: issueasearchof Twitter
 getCurRateLimitInfo: retrieve current rate
limit information
 twListToDF: convert into data.frame
https://cran.r-project.org/package=twitteR
Text Mining – Package tm
 removeNumbers, removePunctuation, removeWords,
removeSparseTerms, stripWhitespace: remove numbers,
punctuations, wordsor extra whitespaces
 removeSparseTerms: remove sparseterms from a
term-document matrix
 stopwords: various kinds of stopwords
 stemDocument, stemCompletion: stem words
and complete stems
 TermDocumentMatrix, DocumentTermMatrix: build
a term-document matrix or adocument-term matrix
 termFreq: generate aterm frequencyvector
 findFreqTerms, findAssocs: find frequent terms
or associations of terms
 weightBin, weightTf, weightTfIdf, weightSMART,
WeightFunction: various waysto weight aterm-document
matrix
Social Network Analysis and Visualization – Package
igraph
degree, betweenness, closeness, t r a n s i t i v i t y :
various centrality scores
 neighborhood: neighborhood of graphvertices
 c l i q u e s , l a r g e s t . c l i q u e s , maximal.cliques,
clique.number: find cliques, ie. complete subgraphs
 c l u s t e r s , n o . c l u s t e r s : maximal connected components of
agraph and the number of them
 fastgreedy.community, spinglass.community:
community detection
 cohesive.blocks: calculate cohesive blocks
 induced.subgraph: create asubgraph of agraph(igraph)
 read.graph, write.graph: read and writ graphs from and to
files of various formats
https://cran.r-project.org/package=igraph
References
 Yanchang Zhao. R and Data Mining: Examples and Case
Studies. ISBN 978-0-12-396963-7, December 2012.Academic
Press,Elsevier. 256 pages.
http://www.rdatamining.com/docs/RDataMining -book.pdf
 Yanchang Zhao andYonghua Cen(Eds.). Data Mining
Applications with R. ISBN 978-0124115118, December 2013.
Academic Press,Elsevier.
 Yanchang Zhao.Analysing Twitter Data with Text Mining
and Social NetworkAnalysis. In Proc. of the 11th
Australasian Data Mining Analytics Conference (AusDM
2013), Canberra,Australia, November 13-15, 2013.
Online Resources
 Chapter 10 – Text Mining & Chapter 11 – Social Network
Analysis, in book R and Data Mining: Examples and Case
Studies
http://www.rdatamining.com/docs/RDataMining.pdf
 RDataMining Reference Card
http://www.rdatamining.com/docs/RDataMining -reference-card.pdf
 Online documents, books and tutorials
http://www.rdatamining.com/resources/onlinedocs
 Freeonlinecourses
http://www.rdatamining.com/resources/courses
 RDataMining Group on LinkedIn (22,000+ members)
http://group.rdatamining.com
Twitter (2,700+ followers)
@RDataMining
Text Mining of Twitter in Data Mining

More Related Content

What's hot

R by example: mining Twitter for consumer attitudes towards airlines
R by example: mining Twitter for consumer attitudes towards airlinesR by example: mining Twitter for consumer attitudes towards airlines
R by example: mining Twitter for consumer attitudes towards airlinesJeffrey Breen
 
Meetup Analytics with R and Neo4j
Meetup Analytics with R and Neo4jMeetup Analytics with R and Neo4j
Meetup Analytics with R and Neo4jNeo4j
 
Icse2010 adams presentation
Icse2010 adams presentationIcse2010 adams presentation
Icse2010 adams presentationSAIL_QU
 
Latent Semantic Analysis of Wikipedia with Spark
Latent Semantic Analysis of Wikipedia with SparkLatent Semantic Analysis of Wikipedia with Spark
Latent Semantic Analysis of Wikipedia with SparkSandy Ryza
 
Ditto Cheatsheet 1 2
Ditto Cheatsheet 1 2Ditto Cheatsheet 1 2
Ditto Cheatsheet 1 2Oleh Burkhay
 
PyRate for fun and research
PyRate for fun and researchPyRate for fun and research
PyRate for fun and researchBrianna McHorse
 
การค้นหาข้อมูลจากอินเทอร์เน็ต
การค้นหาข้อมูลจากอินเทอร์เน็ตการค้นหาข้อมูลจากอินเทอร์เน็ต
การค้นหาข้อมูลจากอินเทอร์เน็ตjobasketball
 
21.04.2016 Meetup: Spark vs. Flink
21.04.2016 Meetup: Spark vs. Flink21.04.2016 Meetup: Spark vs. Flink
21.04.2016 Meetup: Spark vs. FlinkComsysto Reply GmbH
 

What's hot (9)

R by example: mining Twitter for consumer attitudes towards airlines
R by example: mining Twitter for consumer attitudes towards airlinesR by example: mining Twitter for consumer attitudes towards airlines
R by example: mining Twitter for consumer attitudes towards airlines
 
issue35 zh-CN
issue35 zh-CNissue35 zh-CN
issue35 zh-CN
 
Meetup Analytics with R and Neo4j
Meetup Analytics with R and Neo4jMeetup Analytics with R and Neo4j
Meetup Analytics with R and Neo4j
 
Icse2010 adams presentation
Icse2010 adams presentationIcse2010 adams presentation
Icse2010 adams presentation
 
Latent Semantic Analysis of Wikipedia with Spark
Latent Semantic Analysis of Wikipedia with SparkLatent Semantic Analysis of Wikipedia with Spark
Latent Semantic Analysis of Wikipedia with Spark
 
Ditto Cheatsheet 1 2
Ditto Cheatsheet 1 2Ditto Cheatsheet 1 2
Ditto Cheatsheet 1 2
 
PyRate for fun and research
PyRate for fun and researchPyRate for fun and research
PyRate for fun and research
 
การค้นหาข้อมูลจากอินเทอร์เน็ต
การค้นหาข้อมูลจากอินเทอร์เน็ตการค้นหาข้อมูลจากอินเทอร์เน็ต
การค้นหาข้อมูลจากอินเทอร์เน็ต
 
21.04.2016 Meetup: Spark vs. Flink
21.04.2016 Meetup: Spark vs. Flink21.04.2016 Meetup: Spark vs. Flink
21.04.2016 Meetup: Spark vs. Flink
 

Similar to Text Mining of Twitter in Data Mining

Text Mining with R -- an Analysis of Twitter Data
Text Mining with R -- an Analysis of Twitter DataText Mining with R -- an Analysis of Twitter Data
Text Mining with R -- an Analysis of Twitter DataYanchang Zhao
 
RDataMining slides-text-mining-with-r
RDataMining slides-text-mining-with-rRDataMining slides-text-mining-with-r
RDataMining slides-text-mining-with-rYanchang Zhao
 
Text mining and social network analysis of twitter data part 1
Text mining and social network analysis of twitter data part 1Text mining and social network analysis of twitter data part 1
Text mining and social network analysis of twitter data part 1Johan Blomme
 
Introduction to R for Data Science :: Session 8 [Intro to Text Mining in R, M...
Introduction to R for Data Science :: Session 8 [Intro to Text Mining in R, M...Introduction to R for Data Science :: Session 8 [Intro to Text Mining in R, M...
Introduction to R for Data Science :: Session 8 [Intro to Text Mining in R, M...Goran S. Milovanovic
 
Twitter data analysis using r (part 2)
Twitter data analysis using r (part 2)Twitter data analysis using r (part 2)
Twitter data analysis using r (part 2)santoshi mangalgi
 
Introduction to source{d} Engine and source{d} Lookout
Introduction to source{d} Engine and source{d} Lookout Introduction to source{d} Engine and source{d} Lookout
Introduction to source{d} Engine and source{d} Lookout source{d}
 
"R & Text Analytics" (15 January 2013)
"R & Text Analytics" (15 January 2013)"R & Text Analytics" (15 January 2013)
"R & Text Analytics" (15 January 2013)Portland R User Group
 
SMS Spam Filter Design Using R: A Machine Learning Approach
SMS Spam Filter Design Using R: A Machine Learning ApproachSMS Spam Filter Design Using R: A Machine Learning Approach
SMS Spam Filter Design Using R: A Machine Learning ApproachReza Rahimi
 
Getting Started on Hadoop
Getting Started on HadoopGetting Started on Hadoop
Getting Started on HadoopPaco Nathan
 
A Matching Approach Based on Term Clusters for eRecruitment
A Matching Approach Based on Term Clusters for eRecruitmentA Matching Approach Based on Term Clusters for eRecruitment
A Matching Approach Based on Term Clusters for eRecruitmentKemal Can Kara
 
Overview of running R in the Oracle Database
Overview of running R in the Oracle DatabaseOverview of running R in the Oracle Database
Overview of running R in the Oracle DatabaseBrendan Tierney
 
Spark + Clojure for Topic Discovery - Zalando Tech Clojure/Conj Talk
Spark + Clojure for Topic Discovery - Zalando Tech Clojure/Conj TalkSpark + Clojure for Topic Discovery - Zalando Tech Clojure/Conj Talk
Spark + Clojure for Topic Discovery - Zalando Tech Clojure/Conj TalkZalando Technology
 
Spark with Elasticsearch
Spark with ElasticsearchSpark with Elasticsearch
Spark with ElasticsearchHolden Karau
 
r,rstats,r language,r packages
r,rstats,r language,r packagesr,rstats,r language,r packages
r,rstats,r language,r packagesAjay Ohri
 
An Introduction to gensim: "Topic Modelling for Humans"
An Introduction to gensim: "Topic Modelling for Humans"An Introduction to gensim: "Topic Modelling for Humans"
An Introduction to gensim: "Topic Modelling for Humans"sandinmyjoints
 
Introduction to Scalding and Monoids
Introduction to Scalding and MonoidsIntroduction to Scalding and Monoids
Introduction to Scalding and MonoidsHugo Gävert
 
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & PyTorch with B...
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & PyTorch with B...A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & PyTorch with B...
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & PyTorch with B...Databricks
 
Twitter Mention Graph - Analytics Project
Twitter Mention Graph - Analytics ProjectTwitter Mention Graph - Analytics Project
Twitter Mention Graph - Analytics ProjectSotiris Baratsas
 
Cassandra & puppet, scaling data at $15 per month
Cassandra & puppet, scaling data at $15 per monthCassandra & puppet, scaling data at $15 per month
Cassandra & puppet, scaling data at $15 per monthdaveconnors
 

Similar to Text Mining of Twitter in Data Mining (20)

Text Mining with R -- an Analysis of Twitter Data
Text Mining with R -- an Analysis of Twitter DataText Mining with R -- an Analysis of Twitter Data
Text Mining with R -- an Analysis of Twitter Data
 
RDataMining slides-text-mining-with-r
RDataMining slides-text-mining-with-rRDataMining slides-text-mining-with-r
RDataMining slides-text-mining-with-r
 
Text mining and social network analysis of twitter data part 1
Text mining and social network analysis of twitter data part 1Text mining and social network analysis of twitter data part 1
Text mining and social network analysis of twitter data part 1
 
Introduction to R for Data Science :: Session 8 [Intro to Text Mining in R, M...
Introduction to R for Data Science :: Session 8 [Intro to Text Mining in R, M...Introduction to R for Data Science :: Session 8 [Intro to Text Mining in R, M...
Introduction to R for Data Science :: Session 8 [Intro to Text Mining in R, M...
 
Twitter data analysis using r (part 2)
Twitter data analysis using r (part 2)Twitter data analysis using r (part 2)
Twitter data analysis using r (part 2)
 
Introduction to source{d} Engine and source{d} Lookout
Introduction to source{d} Engine and source{d} Lookout Introduction to source{d} Engine and source{d} Lookout
Introduction to source{d} Engine and source{d} Lookout
 
"R & Text Analytics" (15 January 2013)
"R & Text Analytics" (15 January 2013)"R & Text Analytics" (15 January 2013)
"R & Text Analytics" (15 January 2013)
 
SMS Spam Filter Design Using R: A Machine Learning Approach
SMS Spam Filter Design Using R: A Machine Learning ApproachSMS Spam Filter Design Using R: A Machine Learning Approach
SMS Spam Filter Design Using R: A Machine Learning Approach
 
Getting Started on Hadoop
Getting Started on HadoopGetting Started on Hadoop
Getting Started on Hadoop
 
A Matching Approach Based on Term Clusters for eRecruitment
A Matching Approach Based on Term Clusters for eRecruitmentA Matching Approach Based on Term Clusters for eRecruitment
A Matching Approach Based on Term Clusters for eRecruitment
 
Overview of running R in the Oracle Database
Overview of running R in the Oracle DatabaseOverview of running R in the Oracle Database
Overview of running R in the Oracle Database
 
Spark + Clojure for Topic Discovery - Zalando Tech Clojure/Conj Talk
Spark + Clojure for Topic Discovery - Zalando Tech Clojure/Conj TalkSpark + Clojure for Topic Discovery - Zalando Tech Clojure/Conj Talk
Spark + Clojure for Topic Discovery - Zalando Tech Clojure/Conj Talk
 
Spark with Elasticsearch
Spark with ElasticsearchSpark with Elasticsearch
Spark with Elasticsearch
 
r,rstats,r language,r packages
r,rstats,r language,r packagesr,rstats,r language,r packages
r,rstats,r language,r packages
 
An Introduction to gensim: "Topic Modelling for Humans"
An Introduction to gensim: "Topic Modelling for Humans"An Introduction to gensim: "Topic Modelling for Humans"
An Introduction to gensim: "Topic Modelling for Humans"
 
Introduction to Scalding and Monoids
Introduction to Scalding and MonoidsIntroduction to Scalding and Monoids
Introduction to Scalding and Monoids
 
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & PyTorch with B...
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & PyTorch with B...A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & PyTorch with B...
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & PyTorch with B...
 
Storm real-time processing
Storm real-time processingStorm real-time processing
Storm real-time processing
 
Twitter Mention Graph - Analytics Project
Twitter Mention Graph - Analytics ProjectTwitter Mention Graph - Analytics Project
Twitter Mention Graph - Analytics Project
 
Cassandra & puppet, scaling data at $15 per month
Cassandra & puppet, scaling data at $15 per monthCassandra & puppet, scaling data at $15 per month
Cassandra & puppet, scaling data at $15 per month
 

More from Meghaj Mallick

PORTFOLIO BY USING HTML & CSS
PORTFOLIO BY USING HTML & CSSPORTFOLIO BY USING HTML & CSS
PORTFOLIO BY USING HTML & CSSMeghaj Mallick
 
Introduction to Software Testing
Introduction to Software TestingIntroduction to Software Testing
Introduction to Software TestingMeghaj Mallick
 
Introduction to System Programming
Introduction to System ProgrammingIntroduction to System Programming
Introduction to System ProgrammingMeghaj Mallick
 
Icons, Image & Multimedia
Icons, Image & MultimediaIcons, Image & Multimedia
Icons, Image & MultimediaMeghaj Mallick
 
Project Tracking & SPC
Project Tracking & SPCProject Tracking & SPC
Project Tracking & SPCMeghaj Mallick
 
Architecture and security in Vanet PPT
Architecture and security in Vanet PPTArchitecture and security in Vanet PPT
Architecture and security in Vanet PPTMeghaj Mallick
 
Design Model & User Interface Design in Software Engineering
Design Model & User Interface Design in Software EngineeringDesign Model & User Interface Design in Software Engineering
Design Model & User Interface Design in Software EngineeringMeghaj Mallick
 
DFS & BFS in Computer Algorithm
DFS & BFS in Computer AlgorithmDFS & BFS in Computer Algorithm
DFS & BFS in Computer AlgorithmMeghaj Mallick
 
Software Development Method
Software Development MethodSoftware Development Method
Software Development MethodMeghaj Mallick
 
Secant method in Numerical & Statistical Method
Secant method in Numerical & Statistical MethodSecant method in Numerical & Statistical Method
Secant method in Numerical & Statistical MethodMeghaj Mallick
 
Motivation in Organization
Motivation in OrganizationMotivation in Organization
Motivation in OrganizationMeghaj Mallick
 
Partial-Orderings in Discrete Mathematics
 Partial-Orderings in Discrete Mathematics Partial-Orderings in Discrete Mathematics
Partial-Orderings in Discrete MathematicsMeghaj Mallick
 
Hashing In Data Structure
Hashing In Data Structure Hashing In Data Structure
Hashing In Data Structure Meghaj Mallick
 
Complexity Analysis of Recursive Function
Complexity Analysis of Recursive FunctionComplexity Analysis of Recursive Function
Complexity Analysis of Recursive FunctionMeghaj Mallick
 

More from Meghaj Mallick (20)

24 partial-orderings
24 partial-orderings24 partial-orderings
24 partial-orderings
 
PORTFOLIO BY USING HTML & CSS
PORTFOLIO BY USING HTML & CSSPORTFOLIO BY USING HTML & CSS
PORTFOLIO BY USING HTML & CSS
 
Introduction to Software Testing
Introduction to Software TestingIntroduction to Software Testing
Introduction to Software Testing
 
Introduction to System Programming
Introduction to System ProgrammingIntroduction to System Programming
Introduction to System Programming
 
MACRO ASSEBLER
MACRO ASSEBLERMACRO ASSEBLER
MACRO ASSEBLER
 
Icons, Image & Multimedia
Icons, Image & MultimediaIcons, Image & Multimedia
Icons, Image & Multimedia
 
Project Tracking & SPC
Project Tracking & SPCProject Tracking & SPC
Project Tracking & SPC
 
Peephole Optimization
Peephole OptimizationPeephole Optimization
Peephole Optimization
 
Routing in MANET
Routing in MANETRouting in MANET
Routing in MANET
 
Macro assembler
 Macro assembler Macro assembler
Macro assembler
 
Architecture and security in Vanet PPT
Architecture and security in Vanet PPTArchitecture and security in Vanet PPT
Architecture and security in Vanet PPT
 
Design Model & User Interface Design in Software Engineering
Design Model & User Interface Design in Software EngineeringDesign Model & User Interface Design in Software Engineering
Design Model & User Interface Design in Software Engineering
 
DFS & BFS in Computer Algorithm
DFS & BFS in Computer AlgorithmDFS & BFS in Computer Algorithm
DFS & BFS in Computer Algorithm
 
Software Development Method
Software Development MethodSoftware Development Method
Software Development Method
 
Secant method in Numerical & Statistical Method
Secant method in Numerical & Statistical MethodSecant method in Numerical & Statistical Method
Secant method in Numerical & Statistical Method
 
Motivation in Organization
Motivation in OrganizationMotivation in Organization
Motivation in Organization
 
Communication Skill
Communication SkillCommunication Skill
Communication Skill
 
Partial-Orderings in Discrete Mathematics
 Partial-Orderings in Discrete Mathematics Partial-Orderings in Discrete Mathematics
Partial-Orderings in Discrete Mathematics
 
Hashing In Data Structure
Hashing In Data Structure Hashing In Data Structure
Hashing In Data Structure
 
Complexity Analysis of Recursive Function
Complexity Analysis of Recursive FunctionComplexity Analysis of Recursive Function
Complexity Analysis of Recursive Function
 

Recently uploaded

miladyskindiseases-200705210221 2.!!pptx
miladyskindiseases-200705210221 2.!!pptxmiladyskindiseases-200705210221 2.!!pptx
miladyskindiseases-200705210221 2.!!pptxCarrieButtitta
 
James Joyce, Dubliners and Ulysses.ppt !
James Joyce, Dubliners and Ulysses.ppt !James Joyce, Dubliners and Ulysses.ppt !
James Joyce, Dubliners and Ulysses.ppt !risocarla2016
 
Genesis part 2 Isaiah Scudder 04-24-2024.pptx
Genesis part 2 Isaiah Scudder 04-24-2024.pptxGenesis part 2 Isaiah Scudder 04-24-2024.pptx
Genesis part 2 Isaiah Scudder 04-24-2024.pptxFamilyWorshipCenterD
 
Mathan flower ppt.pptx slide orchids ✨🌸
Mathan flower ppt.pptx slide orchids ✨🌸Mathan flower ppt.pptx slide orchids ✨🌸
Mathan flower ppt.pptx slide orchids ✨🌸mathanramanathan2005
 
Dutch Power - 26 maart 2024 - Henk Kras - Circular Plastics
Dutch Power - 26 maart 2024 - Henk Kras - Circular PlasticsDutch Power - 26 maart 2024 - Henk Kras - Circular Plastics
Dutch Power - 26 maart 2024 - Henk Kras - Circular PlasticsDutch Power
 
Exploring protein-protein interactions by Weak Affinity Chromatography (WAC) ...
Exploring protein-protein interactions by Weak Affinity Chromatography (WAC) ...Exploring protein-protein interactions by Weak Affinity Chromatography (WAC) ...
Exploring protein-protein interactions by Weak Affinity Chromatography (WAC) ...Salam Al-Karadaghi
 
Philippine History cavite Mutiny Report.ppt
Philippine History cavite Mutiny Report.pptPhilippine History cavite Mutiny Report.ppt
Philippine History cavite Mutiny Report.pptssuser319dad
 
PHYSICS PROJECT BY MSC - NANOTECHNOLOGY
PHYSICS PROJECT BY MSC  - NANOTECHNOLOGYPHYSICS PROJECT BY MSC  - NANOTECHNOLOGY
PHYSICS PROJECT BY MSC - NANOTECHNOLOGYpruthirajnayak525
 
Open Source Camp Kubernetes 2024 | Running WebAssembly on Kubernetes by Alex ...
Open Source Camp Kubernetes 2024 | Running WebAssembly on Kubernetes by Alex ...Open Source Camp Kubernetes 2024 | Running WebAssembly on Kubernetes by Alex ...
Open Source Camp Kubernetes 2024 | Running WebAssembly on Kubernetes by Alex ...NETWAYS
 
Call Girls In Aerocity 🤳 Call Us +919599264170
Call Girls In Aerocity 🤳 Call Us +919599264170Call Girls In Aerocity 🤳 Call Us +919599264170
Call Girls In Aerocity 🤳 Call Us +919599264170Escort Service
 
call girls in delhi malviya nagar @9811711561@
call girls in delhi malviya nagar @9811711561@call girls in delhi malviya nagar @9811711561@
call girls in delhi malviya nagar @9811711561@vikas rana
 
Open Source Camp Kubernetes 2024 | Monitoring Kubernetes With Icinga by Eric ...
Open Source Camp Kubernetes 2024 | Monitoring Kubernetes With Icinga by Eric ...Open Source Camp Kubernetes 2024 | Monitoring Kubernetes With Icinga by Eric ...
Open Source Camp Kubernetes 2024 | Monitoring Kubernetes With Icinga by Eric ...NETWAYS
 
The Ten Facts About People With Autism Presentation
The Ten Facts About People With Autism PresentationThe Ten Facts About People With Autism Presentation
The Ten Facts About People With Autism PresentationNathan Young
 
Call Girls in Rohini Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Rohini Delhi 💯Call Us 🔝8264348440🔝Call Girls in Rohini Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Rohini Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
Gaps, Issues and Challenges in the Implementation of Mother Tongue Based-Mult...
Gaps, Issues and Challenges in the Implementation of Mother Tongue Based-Mult...Gaps, Issues and Challenges in the Implementation of Mother Tongue Based-Mult...
Gaps, Issues and Challenges in the Implementation of Mother Tongue Based-Mult...marjmae69
 
The 3rd Intl. Workshop on NL-based Software Engineering
The 3rd Intl. Workshop on NL-based Software EngineeringThe 3rd Intl. Workshop on NL-based Software Engineering
The 3rd Intl. Workshop on NL-based Software EngineeringSebastiano Panichella
 
Anne Frank A Beacon of Hope amidst darkness ppt.pptx
Anne Frank A Beacon of Hope amidst darkness ppt.pptxAnne Frank A Beacon of Hope amidst darkness ppt.pptx
Anne Frank A Beacon of Hope amidst darkness ppt.pptxnoorehahmad
 
OSCamp Kubernetes 2024 | A Tester's Guide to CI_CD as an Automated Quality Co...
OSCamp Kubernetes 2024 | A Tester's Guide to CI_CD as an Automated Quality Co...OSCamp Kubernetes 2024 | A Tester's Guide to CI_CD as an Automated Quality Co...
OSCamp Kubernetes 2024 | A Tester's Guide to CI_CD as an Automated Quality Co...NETWAYS
 
Genshin Impact PPT Template by EaTemp.pptx
Genshin Impact PPT Template by EaTemp.pptxGenshin Impact PPT Template by EaTemp.pptx
Genshin Impact PPT Template by EaTemp.pptxJohnree4
 
Open Source Strategy in Logistics 2015_Henrik Hankedvz-d-nl-log-conference.pdf
Open Source Strategy in Logistics 2015_Henrik Hankedvz-d-nl-log-conference.pdfOpen Source Strategy in Logistics 2015_Henrik Hankedvz-d-nl-log-conference.pdf
Open Source Strategy in Logistics 2015_Henrik Hankedvz-d-nl-log-conference.pdfhenrik385807
 

Recently uploaded (20)

miladyskindiseases-200705210221 2.!!pptx
miladyskindiseases-200705210221 2.!!pptxmiladyskindiseases-200705210221 2.!!pptx
miladyskindiseases-200705210221 2.!!pptx
 
James Joyce, Dubliners and Ulysses.ppt !
James Joyce, Dubliners and Ulysses.ppt !James Joyce, Dubliners and Ulysses.ppt !
James Joyce, Dubliners and Ulysses.ppt !
 
Genesis part 2 Isaiah Scudder 04-24-2024.pptx
Genesis part 2 Isaiah Scudder 04-24-2024.pptxGenesis part 2 Isaiah Scudder 04-24-2024.pptx
Genesis part 2 Isaiah Scudder 04-24-2024.pptx
 
Mathan flower ppt.pptx slide orchids ✨🌸
Mathan flower ppt.pptx slide orchids ✨🌸Mathan flower ppt.pptx slide orchids ✨🌸
Mathan flower ppt.pptx slide orchids ✨🌸
 
Dutch Power - 26 maart 2024 - Henk Kras - Circular Plastics
Dutch Power - 26 maart 2024 - Henk Kras - Circular PlasticsDutch Power - 26 maart 2024 - Henk Kras - Circular Plastics
Dutch Power - 26 maart 2024 - Henk Kras - Circular Plastics
 
Exploring protein-protein interactions by Weak Affinity Chromatography (WAC) ...
Exploring protein-protein interactions by Weak Affinity Chromatography (WAC) ...Exploring protein-protein interactions by Weak Affinity Chromatography (WAC) ...
Exploring protein-protein interactions by Weak Affinity Chromatography (WAC) ...
 
Philippine History cavite Mutiny Report.ppt
Philippine History cavite Mutiny Report.pptPhilippine History cavite Mutiny Report.ppt
Philippine History cavite Mutiny Report.ppt
 
PHYSICS PROJECT BY MSC - NANOTECHNOLOGY
PHYSICS PROJECT BY MSC  - NANOTECHNOLOGYPHYSICS PROJECT BY MSC  - NANOTECHNOLOGY
PHYSICS PROJECT BY MSC - NANOTECHNOLOGY
 
Open Source Camp Kubernetes 2024 | Running WebAssembly on Kubernetes by Alex ...
Open Source Camp Kubernetes 2024 | Running WebAssembly on Kubernetes by Alex ...Open Source Camp Kubernetes 2024 | Running WebAssembly on Kubernetes by Alex ...
Open Source Camp Kubernetes 2024 | Running WebAssembly on Kubernetes by Alex ...
 
Call Girls In Aerocity 🤳 Call Us +919599264170
Call Girls In Aerocity 🤳 Call Us +919599264170Call Girls In Aerocity 🤳 Call Us +919599264170
Call Girls In Aerocity 🤳 Call Us +919599264170
 
call girls in delhi malviya nagar @9811711561@
call girls in delhi malviya nagar @9811711561@call girls in delhi malviya nagar @9811711561@
call girls in delhi malviya nagar @9811711561@
 
Open Source Camp Kubernetes 2024 | Monitoring Kubernetes With Icinga by Eric ...
Open Source Camp Kubernetes 2024 | Monitoring Kubernetes With Icinga by Eric ...Open Source Camp Kubernetes 2024 | Monitoring Kubernetes With Icinga by Eric ...
Open Source Camp Kubernetes 2024 | Monitoring Kubernetes With Icinga by Eric ...
 
The Ten Facts About People With Autism Presentation
The Ten Facts About People With Autism PresentationThe Ten Facts About People With Autism Presentation
The Ten Facts About People With Autism Presentation
 
Call Girls in Rohini Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Rohini Delhi 💯Call Us 🔝8264348440🔝Call Girls in Rohini Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Rohini Delhi 💯Call Us 🔝8264348440🔝
 
Gaps, Issues and Challenges in the Implementation of Mother Tongue Based-Mult...
Gaps, Issues and Challenges in the Implementation of Mother Tongue Based-Mult...Gaps, Issues and Challenges in the Implementation of Mother Tongue Based-Mult...
Gaps, Issues and Challenges in the Implementation of Mother Tongue Based-Mult...
 
The 3rd Intl. Workshop on NL-based Software Engineering
The 3rd Intl. Workshop on NL-based Software EngineeringThe 3rd Intl. Workshop on NL-based Software Engineering
The 3rd Intl. Workshop on NL-based Software Engineering
 
Anne Frank A Beacon of Hope amidst darkness ppt.pptx
Anne Frank A Beacon of Hope amidst darkness ppt.pptxAnne Frank A Beacon of Hope amidst darkness ppt.pptx
Anne Frank A Beacon of Hope amidst darkness ppt.pptx
 
OSCamp Kubernetes 2024 | A Tester's Guide to CI_CD as an Automated Quality Co...
OSCamp Kubernetes 2024 | A Tester's Guide to CI_CD as an Automated Quality Co...OSCamp Kubernetes 2024 | A Tester's Guide to CI_CD as an Automated Quality Co...
OSCamp Kubernetes 2024 | A Tester's Guide to CI_CD as an Automated Quality Co...
 
Genshin Impact PPT Template by EaTemp.pptx
Genshin Impact PPT Template by EaTemp.pptxGenshin Impact PPT Template by EaTemp.pptx
Genshin Impact PPT Template by EaTemp.pptx
 
Open Source Strategy in Logistics 2015_Henrik Hankedvz-d-nl-log-conference.pdf
Open Source Strategy in Logistics 2015_Henrik Hankedvz-d-nl-log-conference.pdfOpen Source Strategy in Logistics 2015_Henrik Hankedvz-d-nl-log-conference.pdf
Open Source Strategy in Logistics 2015_Henrik Hankedvz-d-nl-log-conference.pdf
 

Text Mining of Twitter in Data Mining

  • 1. BIRLA INSTITUTE OF TECHNOLOGY MESRA ,JAIPUR CAMPUS TOPIC :-TEXT MINING OF TWITTER PRESENTED BY: MEGHAJ KUMAR MALLICK(MCA/25017/18) SAJAN SINGH(MCA/25019/18) 2ND YEAR,3RD SEMESTER
  • 2. OUTLINE  INTRODUCTION  Tweets Analysis Extracting Tweets Text Cleaning Frequent Words and Word Cloud  R Packages Referencesand OnlineResources
  • 3. TWITTER Twitter is a microblogging and social networking service on which users post and interact with messages known as "tweets". Tweets were originally restricted to 140 characters, but on November 7, 2017, this limit was doubled to 280 for all languages except Chinese, Japanese, and Korean. Twitter was created in March 2006 by Jack Dorsey, Noah Glass, Biz Stone, and Evan Williams, launched in July of that year.  In 2012, more than 100 million users posted 340 million tweets a day, and the service handled an average of 1.6 billion search queries per day. In 2013, it was one of the ten most-visited websites and has been described as "the SMS of the Internet“. As of 2018, Twitter had more than 321 million monthly active users.
  • 4. PROCESS 1. Extract tweets and followers from the Twitter website with R and the twitteRpackage 2. With the tm package, clean text by removing punctuations, numbers, hyperlinks and stop words, followed by stemming and stem completion 3. Build aterm-document matrix 4. Analyse following/followed and retweeting relationships with the igraphpackage
  • 5. Retrieving Text from Twitter  Tweets are extracted from Twitter with the code below using userTimeline() in package twitteR [Gentry, 2012]. Package twitteR depends on package RCurl [Lang, 2012a], which is available at http://www.stats.ox.ac.uk/ pub/RWin/bin/windows/contrib/.  Another way to retrieve text from Twitter is using package XML [Lang, 2012b], and an example on that is given at http://heuristically.wordpress.com/ 2011/04/08/text-data-mining- twitter-r/. For readers who have no access to Twitter, the tweets data “rdmTweets.RData” can be downloaded at http://www.rdatamining.com/data. Note that the Twitter API requires authentication since March 2013. Before running the code below, please complete authentication by following instructions in “Section 3: Authentication with OAuth” in the twitteR vignettes (http://cran.r-project.org/web/packages/twitteR/ vignettes/twitteR.pdf).
  • 6. Retrieve Tweets ## Option 1: retrieve tweets from Twitter l i b r a r y (twitteR) library(ROAuth) ## Twitter authentication setup_twitter_oauth(consumer_key, consumer_secret, access_token, access_secret) ## 3200 i s the maximum to retrieve tweets <- userTimeline("RDataMining", n = 3200) ## Option 2: download @RDataMining tweets from RDataMining.com u r l <- "http://www.rdatamining.com/data/RDataMining-Tweets-20160212.rds download .file(url, d e s t f i l e = "./data/RDataMining-Tweets-20160212.rds") ## load tweets into R tweets <- readRDS("./data/RDataMining-Tweets-20160212.rds") Twitter Authentication with OAuth: Section 3 of http://geoffjentry.hexdump.org/twitteR.pdf
  • 7. (n.tweet <- length(tweets)) ## [ 1] 448 # convert tweets to a data frame tweets.df <- twListToDF(tweets) # tweet #190 tweets.df[ 190, c ( " i d " , "created" , "screenName", "replyToSN", "favoriteCount", "retweetCount", "longitude", " l a t i t u d e " , " t e x t " ) ] ## i d created screenName r e . . . ## 190 362866933894352898 2013-08-01 09:26:33 RDataMining . . . favoriteCount retweetCount longitude l a t i t u d e 9 9 NA NA ## ## 190 ## . . . ## 190 The RReference Card f or Data Mining now provides l i n . . . # print tweet #190 and make t e x t f i t for slide width writeLines(strwrap(tweets.df$text[190], 60)) ## The R Reference Card f or Data Mining now provides l i nks t o ## packages on CRAN. Packages f or MapReduce and Hadoop added. ## http://t.co/RrFypol8kw
  • 8. TRANSFORMING TEXT • The tweets are first converted to a data frame and then to a corpus, which is a collection of text documents. After that, the corpus can be processed with functions provided in package tm [Feinerer, 2012]. • # convert tweets to a data frame • df <- do.call("rbind", lapply(rdmTweets, as.data.frame)) • dim(df) • [1] 154 10 > library(tm) • # build a corpus, and specify the source to be character vectors > myCorpus <- Corpus(VectorSource(df$text))
  • 9. Text Cleaning library(tm) # build a corpus, and specify the source to be character vectors myCorpus <- Corpus(VectorSource(tweets.df$text)) # convert to lower case myCorpus <- tm_map(myCorpus, content_transformer(tolower)) # remove URLs removeURL <- function(x) gsub("ht t p[ ^[ : s pace: ]] *" , " " , x) myCorpus <- tm_map(myCorpus, content_transformer(removeURL)) # remove anything other than English l e t t e r s or space removeNumPunct <- function(x) gsub("[ ^[ :al p ha :][ :s pace:]] *" , " " , x) myCorpus <- tm_map(myCorpus, content_transformer(removeNumPunct)) # remove stopwords myStopwords <- c(setdiff(stopwords('english'), c ( " r " , " b i g " ) ) , "use", "s e e " , "used", " v i a " , "amp") myCorpus <- tm_map(myCorpus, removeWords, myStopwords) # remove extra whitespace myCorpus <- tm_map(myCorpus, stripWhitespace) # keep a copy for stem completion later myCorpusCopy <- myCorpus
  • 10. STEMMING WORDS • In many applications, words need to be stemmed to retrieve their radicals, so that various forms derived from a stem would be taken as the same when counting word frequency. • For instance, words “update”, “updated” and “updating” would all be stemmed to “updat”. Word stemming can be done with the snowball stemmer, which requires packages Snowball, RWeka, rJava and RWekajars. After that, we can complete the stems to their original forms, i.e., “update” for the above example, so that the words would look normal. This can be achieved with function stemCompletion().
  • 11. Stemming and Stem Completion 2 myCorpus <- tm_map(myCorpus, stemDocument) # stem words writeLines(strwrap(myCorpus[[190]]$content, 60)) ## r r e f e r card data mine now provid l i nk packag cran packag ## mapreduc hadoop ad stemCompletion2 <- function(x, dictionary ) { x <- u n l i s t ( s t r s p l i t ( a s . c h a r a c t e r ( x ) , " " ) ) x <- x[x != " " ] x <- stemCompletion(x, dictionary=dictionary) x <- paste (x, sep="", collapse=" " ) PlainTextDocument(stripWhitespace(x)) } myCorpus <- lapply(myCorpus, stemCompletion2, dictionary=myCorpusCopy) myCorpus <- Corpus(VectorSource(myCorpus)) writeLines(strwrap(myCorpus[[190]]$content, 60)) ## r reference card data miner now provided l i nk package cran ## package mapreduce hadoop add
  • 12. Issuesin Stem Completion: “Miner” vs“Mining” # count word frequence wordFreq <- function(corpus, word) { r e s u l t s <- lapply(corpus, function(x) {grep(as.charac ter ( x) , pattern=paste0("<",word)) } ) sum (unl is t ( res ul t s)) } n.miner <- wordFreq(myCorpusCopy, "miner") n.mining <- wordFreq(myCorpusCopy, "mining") cat(n.miner, n.mining) ## 9 104 # replace oldword with newword replaceWord <- function(corpus, oldword, newword) { tm_map(corpus, content_transformer(gsub), pattern=oldword, replacement=newword) } myCorpus <- replaceWord(myCorpus, "miner", "mining") myCorpus <- replaceWord(myCorpus, "universidad", "univer s i t y" ) myCorpus <- replaceWord(myCorpus, "scienc" , "science")
  • 13. BUILDING A TERM-DOCUMENT MATRIX • A term-document matrix represents the relationship between terms and documents, where each row stands for a term and each column for a document, and an entry is the number of occurrences of the term in the document. • Alternatively, one can also build a document-term matrix by swapping row and column. In this section, we build a term- document matrix from the above processed corpus with function TermDocumentMatrix(). With its default setting, terms with less than three characters are discarded. To keep “r” in the matrix, we set the range of wordLengths in the example below.
  • 14. Build Term Document Matrix ## data 0 1 0 0 1 0 0 0 0 1 ## mining 0 0 0 0 1 0 0 0 0 1 ## r 1 1 1 1 0 1 0 1 1 1 tdm <- TermDocumentMatrix(myCorpus, control = list(wordLengths = c ( 1 , I n f ) ) ) tdm ## <<TermDocumentMatrix (terms: 1073, documents: 448)>> ## Non-/sparse e n t r i e s : 3594/477110 ## Sparsity : 99 ## Maximal term length: 23 ## Weighting : term frequency ( t f ) idx <- which(dimnames(tdm)$Terms i n c ( " r " , "da t a " , "mining")) as.matrix(tdm[idx, 21:30]) ## ## Terms Docs 21 22 23 24 25 26 27 28 29 30
  • 15. Top Frequent Terms :- We have a look at the popular words and the association betweenwords. Note that there are 154 tweets in total. # inspect frequent words (freq.terms <- findFreqTerms(tdm, lowfreq = 20)) "analytics" "course" "a us t r a l i a " "data" ## [ 1] "analysing" ## [ 5] "canberra" ## [ 9] "group" "introduction" "learn" ## [13] "network" ## [17] "rdatamining" ## [21] "t a l k" "package" "research" "t e xt " "position" "science" " t u t o r i a l " "big" "example" "mining" " r " "s l i de " "university" term.freq <- rowSums(as.matrix(tdm)) term.freq <- subset (term.freq, term.freq >= 20) df <- data.frame(term = names(term.freq), freq =term.freq)
  • 16. library(ggplot2) ggplot (df, aes(x=term, y=freq)) + geom_bar(stat="identity") + xlab("Terms") + ylab("Count") + coord_flip () + theme(axis.text=element_text(size=7)) university tutorial text talk slide science research rdatamining r position package network mining learn introduction group example data course canberra big australia analytics analysing 0 50 150 200100 Terms
  • 17. WORDCLOUD • After building a term-document matrix, we can show the importance of words with a word cloud (also known as a tag cloud), which can be easily produced with package wordcloud [Fellows, 2012]. • In the code below, we first convert the term-document matrix to a normal matrix, and then calculate word frequencies. • After that, we set gray levels based on word frequency and use wordcloud() to make a plot for it. With wordcloud(), the first two parameters give a list of words and their frequencies. Words with frequency below three are not plotted, as specified by min.freq=3. By setting random.order=F, frequent words are plotted first, which makes them appear in the center of cloud. We also set the colors to gray levels based on frequency. A colorful cloud can be generated by setting colors with rainbow().
  • 18. WordCloud m<- as.matrix(tdm) # calculate the frequency of words and sort i t by frequency word.freq <- sort(rowSums(m), decreasing = T) # colors pal <- brewer.pal(9, "BuGn")[-(1:4)] # plot word cloud library(wordcloud) wordcloud(words = names(word.freq), freq = word.freq, min.freq = 3 , random.order = F, colors = pa l )
  • 19. data r mining big analysing network slide tutorial course talk introduction series research application code text statistical available modeling program pdf scientist present time lecture august language mapreduce video workshop machine seminar useful dataset visualisations university conference learnapril cfp australia join get job postdocthanks will analyst knowledgedue linkedin cluster free may poll open social provided us analytics database feb sydney ieee hadoop informal event scienceth now stanford submission outlier call software create twitter book find follow engin e intern map recent thursday access acm area build topnew can coursera experience file rule francisco process add lab notes group tool webinar canada associatedownload dr user close extended fast kdnuggets package go high june little nd distributed iapa member nov performance case week senior published easier detailedfellow quickrisk short card classificationsingapore systemskills studies vacancies algorithm handling industrial amazon tricksrstudio deadline america web today regressio n tuesday answers niceaug australasian sentiment competition credit dmapps graph givedynamic version facebook retrieval forest spatial titled document page check healthcare large publicimprove initial iselect link load excel looking china make step melbourne start list postdoctoral jan march massive media datacamp contain cran mexico mid modeinteracting california neoj keynote oct positionfit technological official participation please plotsept extract rdataminingcanberra pls technique parallel ausdm computational reference project result run sigkdd chapter predicting google topic format sasapache function simple kdd sna detection source decision wwwrdataminingcom southern comment spark share state sunday support visit survey seattle pm task together australiannatural track friday search developed management tree edited center guidance forecastingadvanced summit business tweet updated vs examplev online various profgraphical ranked cloud san random website snowfallpaper world youtube
  • 20. ASSOCIATIONS # which words are associated with 'r'? findAssocs(tdm, " r " , 0.2) ## r ## code 0.27 ## example 0.21 ## s e r i e s 0.21 ## markdown 0.20 ## user 0.20 # which words are associated with 'data'? findAssocs(tdm, "da t a " , 0.2) ## ## mining ## big data 0.48 0.44 ## analytic s 0.31 ## science ## pol l 0.29 0.24
  • 21. Network of Terms library(graph) l i b r a r y (Rgraphviz) plot(tdm, term = freq.terms, corThreshold = 0 . 1 , weighting = T) analysing analytics big canberra coursedata group introduction learnmining network example package australia position rrdatamining research science slide talk text tutorial university
  • 22. R packages  Twitter data extraction: twitteR  Text cleaning and mining: tm  Word cloud: wordcloud  Social network analysis: igraph, sna  Visualisation: wordcloud, Rgraphviz, ggplot2
  • 23. Twitter Data Extraction – Package twitteR  userTimeline, homeTimeline, mentions, retweetsOfMe: retrive various timelines  getUser, lookupUsers: get information of Twitter user(s)  getFollowers, getFollowerIDs: retrieve followers (or their IDs)  getFriends, getFriendIDs: return alist of Twitter users (or userIDs) that auser follows  retweets, retweeters: return retweets or userswho retweeted atweet  searchTwitter: issueasearchof Twitter  getCurRateLimitInfo: retrieve current rate limit information  twListToDF: convert into data.frame https://cran.r-project.org/package=twitteR
  • 24. Text Mining – Package tm  removeNumbers, removePunctuation, removeWords, removeSparseTerms, stripWhitespace: remove numbers, punctuations, wordsor extra whitespaces  removeSparseTerms: remove sparseterms from a term-document matrix  stopwords: various kinds of stopwords  stemDocument, stemCompletion: stem words and complete stems  TermDocumentMatrix, DocumentTermMatrix: build a term-document matrix or adocument-term matrix  termFreq: generate aterm frequencyvector  findFreqTerms, findAssocs: find frequent terms or associations of terms  weightBin, weightTf, weightTfIdf, weightSMART, WeightFunction: various waysto weight aterm-document matrix
  • 25. Social Network Analysis and Visualization – Package igraph degree, betweenness, closeness, t r a n s i t i v i t y : various centrality scores  neighborhood: neighborhood of graphvertices  c l i q u e s , l a r g e s t . c l i q u e s , maximal.cliques, clique.number: find cliques, ie. complete subgraphs  c l u s t e r s , n o . c l u s t e r s : maximal connected components of agraph and the number of them  fastgreedy.community, spinglass.community: community detection  cohesive.blocks: calculate cohesive blocks  induced.subgraph: create asubgraph of agraph(igraph)  read.graph, write.graph: read and writ graphs from and to files of various formats https://cran.r-project.org/package=igraph
  • 26. References  Yanchang Zhao. R and Data Mining: Examples and Case Studies. ISBN 978-0-12-396963-7, December 2012.Academic Press,Elsevier. 256 pages. http://www.rdatamining.com/docs/RDataMining -book.pdf  Yanchang Zhao andYonghua Cen(Eds.). Data Mining Applications with R. ISBN 978-0124115118, December 2013. Academic Press,Elsevier.  Yanchang Zhao.Analysing Twitter Data with Text Mining and Social NetworkAnalysis. In Proc. of the 11th Australasian Data Mining Analytics Conference (AusDM 2013), Canberra,Australia, November 13-15, 2013.
  • 27. Online Resources  Chapter 10 – Text Mining & Chapter 11 – Social Network Analysis, in book R and Data Mining: Examples and Case Studies http://www.rdatamining.com/docs/RDataMining.pdf  RDataMining Reference Card http://www.rdatamining.com/docs/RDataMining -reference-card.pdf  Online documents, books and tutorials http://www.rdatamining.com/resources/onlinedocs  Freeonlinecourses http://www.rdatamining.com/resources/courses  RDataMining Group on LinkedIn (22,000+ members) http://group.rdatamining.com Twitter (2,700+ followers) @RDataMining