SlideShare a Scribd company logo
1 of 64
Download to read offline
Text Mining with R ∗
Yanchang Zhao
http://www.RDataMining.com
R and Data Mining Course
Beijing University of Posts and Telecommunications,
Beijing, China
July 2019
∗
Chapter 10: Text Mining, in R and Data Mining: Examples and Case Studies.
http://www.rdatamining.com/docs/RDataMining-book.pdf
1 / 61
Contents
Text Mining
Concept
Tasks
Twitter Data Analysis with R
Twitter
Extracting Tweets
Text Cleaning
Frequent Words and Word Cloud
Word Associations
Clustering
Topic Modelling
Sentiment Analysis
Follower Analysis
Retweeting Analysis
R Packages
Wrap Up
Further Readings and Online Resources
2 / 61
Text Data
Text documents in a natural language
Unstructured
Documents in plain text, Word or PDF format
Emails, online chat logs and phone transcripts
Online news and forums, blogs, micro-blogs and social media
. . .
3 / 61
Typical Process of Text Mining
1. Transform text into structured data
Term-Document Matrix (TDM)
Entities and relations
. . .
2. Apply traditional data mining techniques to the above
structured data
Clustering
Classification
Social Network Analysis (SNA)
. . .
4 / 61
Typical Process of Text Mining (cont.)
5 / 61
Term-Document Matrix (TDM)
Also known as Document-Term Matrix (DTM)
A 2D matrix
Rows: terms or words
Columns: documents
Entry mi,j : number of occurrences of term ti in document dj
Term weighting schemes: Term Frequency, Binary Weight,
TF-IDF, etc.
6 / 61
TF-IDF
Term Frequency (TF) tfi,j : the number of occurrences of
term ti in document dj
Inverse Document Frequency (IDF) for term ti is:
idfi = log2
|D|
|{d | ti ∈ d}|
(1)
|D|: the total number of documents
|{d | ti ∈ d}|: the number of documents where term ti appears
Term Frequency - Inverse Document Frequency (TF-IDF)
tfidf = tfi,j · idfi (2)
IDF reduces the weight of terms that occur frequently in
documents and increases the weight of terms that occur rarely.
7 / 61
An Example of TDM
Doc1: I like R.
Doc2: I like Python.
Term Frequency
TF-IDF
IDF
8 / 61
An Example of TDM
Doc1: I like R.
Doc2: I like Python.
Term Frequency
TF-IDF
IDF
8 / 61
Terms that can distinguish
different documents are given
greater weights.
An Example of TDM (cont.)
Doc1: I like R.
Doc2: I like Python.
Term Frequency
Normalized Term Frequency
IDF
Normalized TF-IDF
9 / 61
An Example of Term Weighting in R
## term weighting
library(magrittr)
library(tm) ## package for text mining
a <- c("I like R", "I like Python")
## build corpus
b <- a %>% VectorSource() %>% Corpus()
## build term document matrix
m <- b %>% TermDocumentMatrix(control=list(wordLengths=c(1, Inf)))
m %>% inspect()
## various term weighting schemes
m %>% weightBin() %>% inspect() ## binary weighting
m %>% weightTf() %>% inspect() ## term frequency
m %>% weightTfIdf(normalize=F) %>% inspect() ## TF-IDF
m %>% weightTfIdf(normalize=T) %>% inspect() ## normalized TF-IDF
More options provided in package tm:
weightSMART
WeightFunction
10 / 61
Text Mining Tasks
Text classification
Text clustering and categorization
Topic modelling
Sentiment analysis
Document summarization
Entity and relation extraction
. . .
11 / 61
Topic Modelling
To identify topics in a set of documents
It groups both documents that use similar words and words
that occur in a similar set of documents.
Intuition: Documents related to R would contain more words
like R, ggplot2, plyr, stringr, knitr and other R packages, than
Python related keywords like Python, NumPy, SciPy,
Matplotlib, etc.
A document can be of multiple topics in different proportions.
For instance, a document can be 90% about R and 10%
about Python. ⇒ soft/fuzzy clustering
Latent Dirichlet Allocation (LDA): the most widely used topic
model
12 / 61
Sentiment Analysis
Also known as opinion mining
To determine attitude, polarity or emotions from documents
Polarity: positive, negative, netural
Emotions: angry, sad, happy, bored, afraid, etc.
Method:
1. identify invidual words and phrases and map them to different
emotional scales
2. adjust the sentiment value of a concept based on modifications
surrounding it
13 / 61
Document Summarization
To create a summary with major points of the orignial
document
Approaches
Extraction: select a subset of existing words, phrases or
sentences to build a summary
Abstraction: use natural language generation techniques to
build a summary that is similar to natural language
14 / 61
Entity and Relationship Extraction
Named Entity Recognition (NER): identify named entities in
text into pre-defined categories, such as person names,
organizations, locations, date and time, etc.
Relationship Extraction: identify associations among entities
Example:
Ben lives at 5 Geroge St, Sydney.
15 / 61
Entity and Relationship Extraction
Named Entity Recognition (NER): identify named entities in
text into pre-defined categories, such as person names,
organizations, locations, date and time, etc.
Relationship Extraction: identify associations among entities
Example:
Ben lives at 5 Geroge St, Sydney.
15 / 61
Entity and Relationship Extraction
Named Entity Recognition (NER): identify named entities in
text into pre-defined categories, such as person names,
organizations, locations, date and time, etc.
Relationship Extraction: identify associations among entities
Example:
Ben lives at 5 Geroge St, Sydney.
15 / 61
Ben 5 Geroge St, Sydney
Contents
Text Mining
Concept
Tasks
Twitter Data Analysis with R
Twitter
Extracting Tweets
Text Cleaning
Frequent Words and Word Cloud
Word Associations
Clustering
Topic Modelling
Sentiment Analysis
Follower Analysis
Retweeting Analysis
R Packages
Wrap Up
Further Readings and Online Resources
16 / 61
Twitter
An online social networking service that enables users to send
and read short 280-character (used to be 140 before
November 2017) messages called “tweets” (Wikipedia)
Over 300 million monthly active users (as of 2018)
Creating over 500 million tweets per day
17 / 61
RDataMining Twitter Account
18 / 61
Process†
1. Extract tweets and followers from the Twitter website with R
and the twitteR package
2. With the tm package, clean text by removing punctuations,
numbers, hyperlinks and stop words, followed by stemming
and stem completion
3. Build a term-document matrix
4. Cluster Tweets with text clustering
5. Analyse topics with the topicmodels package
6. Analyse sentiment with the sentiment140 package
7. Analyse following/followed and retweeting relationships with
the igraph package
†
More details in paper titled Analysing Twitter Data with Text Mining and
Social Network Analysis [Zhao, 2013].
19 / 61
Retrieve Tweets
## Option 1: retrieve tweets from Twitter
library(twitteR)
library(ROAuth)
## Twitter authentication
setup_twitter_oauth(consumer_key, consumer_secret, access_token, access_
## 3200 is the maximum to retrieve
tweets <- "RDataMining" %>% userTimeline(n = 3200)
See details of Twitter Authentication with OAuth in Section 3 of
http://geoffjentry.hexdump.org/twitteR.pdf.
## Option 2: download @RDataMining tweets from RDataMining.com
library(twitteR)
url <- "http://www.rdatamining.com/data/RDataMining-Tweets-20160212.rds"
download.file(url, destfile = "./data/RDataMining-Tweets-20160212.rds")
## load tweets into R
tweets <- readRDS("./data/RDataMining-Tweets-20160212.rds")
20 / 61
(n.tweet <- tweets %>% length())
## [1] 448
# convert tweets to a data frame
tweets.df <- tweets %>% twListToDF()
# tweet #1
tweets.df[1, c("id", "created", "screenName", "replyToSN",
"favoriteCount", "retweetCount", "longitude", "latitude", "text")]
## id created screenName replyToSN
## 1 697031245503418368 2016-02-09 12:16:13 RDataMining <NA>
## favoriteCount retweetCount longitude latitude
## 1 13 14 NA NA
## ...
## 1 A Twitter dataset for text mining: @RDataMining Tweets ex...
# print tweet #1 and make text fit for slide width
tweets.df$text[1] %>% strwrap(60) %>% writeLines()
## A Twitter dataset for text mining: @RDataMining Tweets
## extracted on 3 February 2016. Download it at
## https://t.co/lQp94IvfPf
21 / 61
Text Cleaning Functions
Convert to lower case: tolower
Remove punctuation: removePunctuation
Remove numbers: removeNumbers
Remove URLs
Remove stop words (like ’a’, ’the’, ’in’): removeWords,
stopwords
Remove extra white space: stripWhitespace
## text cleaning
library(tm)
# function for removing URLs, i.e.,
# "http" followed by any non-space letters
removeURL <- function(x) gsub("http[^[:space:]]*", "", x)
# function for removing anything other than English letters or space
removeNumPunct <- function(x) gsub("[^[:alpha:][:space:]]*", "", x)
# customize stop words
myStopwords <- c(setdiff(stopwords('english'), c("r", "big")),
"use", "see", "used", "via", "amp")
See details of regular expressions by running ?regex in R console.
22 / 61
Text Cleaning
# build a corpus and specify the source to be character vectors
corpus.raw <- tweets.df$text %>% VectorSource() %>% Corpus()
# text cleaning
corpus.cleaned <- corpus.raw %>%
# convert to lower case
tm_map(content_transformer(tolower)) %>%
# remove URLs
tm_map(content_transformer(removeURL)) %>%
# remove numbers and punctuations
tm_map(content_transformer(removeNumPunct)) %>%
# remove stopwords
tm_map(removeWords, myStopwords) %>%
# remove extra whitespace
tm_map(stripWhitespace)
23 / 61
Stemming and Stem Completion ‡
## stem words
corpus.stemmed <- corpus.cleaned %>% tm_map(stemDocument)
## stem completion
stemCompletion2 <- function(x, dictionary) {
x <- unlist(strsplit(as.character(x), " "))
x <- x[x != ""]
x <- stemCompletion(x, dictionary=dictionary)
x <- paste(x, sep="", collapse=" ")
stripWhitespace(x)
}
corpus.completed <- corpus.stemmed %>%
lapply(stemCompletion2, dictionary=corpus.cleaned) %>%
VectorSource() %>% Corpus()
‡
http://stackoverflow.com/questions/25206049/stemcompletion-is-not-working
24 / 61
Before/After Text Cleaning and Stemming
## compare text before/after cleaning
# original text
corpus.raw[[1]]$content %>% strwrap(60) %>% writeLines()
## A Twitter dataset for text mining: @RDataMining Tweets
## extracted on 3 February 2016. Download it at
## https://t.co/lQp94IvfPf
# after basic cleaning
corpus.cleaned[[1]]$content %>% strwrap(60) %>% writeLines()
## twitter dataset text mining rdatamining tweets extracted
## february download
# stemmed text
corpus.stemmed[[1]]$content %>% strwrap(60) %>% writeLines()
## twitter dataset text mine rdatamin tweet extract februari
## download
# after stem completion
corpus.completed[[1]]$content %>% strwrap(60) %>% writeLines()
## twitter dataset text miner rdatamining tweet extract
## download 25 / 61
Issues in Stem Completion: “Miner” vs “Mining”
# count word frequence
wordFreq <- function(corpus, word) {
results <- lapply(corpus,
function(x) grep(as.character(x), pattern=paste0("<",word)) )
sum(unlist(results))
}
n.miner <- corpus.cleaned %>% wordFreq("miner")
n.mining <- corpus.cleaned %>% wordFreq("mining")
cat(n.miner, n.mining)
## 9 104
# replace old word with new word
replaceWord <- function(corpus, oldword, newword) {
tm_map(corpus, content_transformer(gsub),
pattern=oldword, replacement=newword)
}
corpus.completed <- corpus.completed %>%
replaceWord("miner", "mining") %>%
replaceWord("universidad", "university") %>%
replaceWord("scienc", "science")
26 / 61
Build Term Document Matrix
## Build Term Document Matrix
tdm <- corpus.completed %>%
TermDocumentMatrix(control = list(wordLengths = c(1, Inf))) %>%
print
## <<TermDocumentMatrix (terms: 1073, documents: 448)>>
## Non-/sparse entries: 3594/477110
## Sparsity : 99%
## Maximal term length: 23
## Weighting : term frequency (tf)
idx <- which(dimnames(tdm)$Terms %in% c("r", "data", "mining"))
tdm[idx, 21:30] %>% as.matrix()
## Docs
## Terms 21 22 23 24 25 26 27 28 29 30
## mining 0 0 0 0 1 0 0 0 0 1
## data 0 1 0 0 1 0 0 0 0 1
## r 1 1 1 1 0 1 0 1 1 1
27 / 61
Top Frequent Terms
# inspect frequent words
freq.terms <- tdm %>% findFreqTerms(lowfreq = 20) %>% print
## [1] "mining" "rdatamining" "text" "analytics"
## [5] "australia" "data" "canberra" "group"
## [9] "university" "science" "slide" "tutorial"
## [13] "big" "learn" "package" "r"
## [17] "network" "course" "introduction" "talk"
## [21] "analysing" "research" "position" "example"
term.freq <- tdm %>% as.matrix() %>% rowSums()
term.freq <- term.freq %>% subset(term.freq >= 20)
df <- data.frame(term = names(term.freq), freq = term.freq)
28 / 61
## plot frequent words
library(ggplot2)
ggplot(df, aes(x=term, y=freq)) + geom_bar(stat="identity") +
xlab("Terms") + ylab("Count") + coord_flip() +
theme(axis.text=element_text(size=7))
analysing
analytics
australia
big
canberra
course
data
example
group
introduction
learn
mining
network
package
position
r
rdatamining
research
science
slide
talk
text
tutorial
university
0 50 100 150 200
Count
Terms
29 / 61
Wordcloud
## word cloud
m <- tdm %>% as.matrix
# calculate the frequency of words and sort it by frequency
word.freq <- m %>% rowSums() %>% sort(decreasing = T)
# colors
library(RColorBrewer)
pal <- brewer.pal(9, "BuGn")[-(1:4)]
# plot word cloud
library(wordcloud)
wordcloud(words = names(word.freq), freq = word.freq, min.freq = 3,
random.order = F, colors = pal)
30 / 61
data
r
mining
slide
big
analysing package
research
analytics
example
position
university
canberra
network
australia
group
tutorial
rdatamining
course
talk
introduction
text
science
learn scientist
useful
social
series
free
computational
application
online
code
ausdm
statistical book
program
modeling
available
present
submission
conference
pdf
time
start
cluster
lecture
workshop
user
august
mapreduce
language
videosoftware
hadoop
join
twitter
seminar
machine
kdnuggets
poll
associate
dataset
melbourne
visualisations
open
postdoctoral
vacancies
thanks
sydney
will
top
th
process
get
cfp
april
parallel
classification
job
due
postdoc
knowledge
graph
document
linkedin
predicting
rstudio
may
analyst
us
provided
extract
new
pm
event
spark
informal
can
easier
iapa
system
senior
database
web
feb
large
technological
stanford
now
detailed
ieee
rule
detection
outlier
thursday
follow
close
page
find
business
chapter
create
engine
guidance
intern
keynote
call
forecasting
google
week
recent
card
referencefunction
map
technique
add
apache
tricks
excel kdd
access
china
list
acm
coursera
handling
webinar
nice
today
deadline
fellow
file
lab
experience
sigkdd
version
distributed
area
francisco
san
titled
step
give
ranked
sentiment
tool
build
topic
industrial
notes
graphical
center
download
dr
little
tuesday
sept
go
vs
june
public
extended
retrieval
search
singapore
member
share
high
jan
natural
seattle
skills
case
studies
nov
visit
developed
regression
risk
quick
advanced
nd
spatial
interacting
crancloud
canada
published
california
short
fast
performance
wwwrdataminingcom
tweet
healthcare
support
decision
facebook
make
sunday
friday
algorithm
aug
sas
competition
various
datacamp
australasian
official
task
summit
format
load
please
run
updated
v
survey
neoj
tree
march
world
prof
fit
together
massive
sna
amazon
iselect
looking
australian
oct
participation
america
dynamic
mexico
source
credit
management
paper
forest
random
improve
check
link pls
simple
edited
mode
result
media
project
contain
track
initial
mid
dmapps
youtube
state
southern
plot
snowfall
website
answers
comment
31 / 61
Associations
# which words are associated with 'r'?
tdm %>% findAssocs("r", 0.2)
## $r
## code example series user markdown
## 0.27 0.21 0.21 0.20 0.20
# which words are associated with 'data'?
tdm %>% findAssocs("data", 0.2)
## $data
## mining big analytics science poll
## 0.48 0.44 0.31 0.29 0.24
32 / 61
Network of Terms
## network of terms
library(graph)
library(Rgraphviz)
plot(tdm, term = freq.terms, corThreshold = 0.1, weighting = T)
mining
rdataminingtextanalytics
australia
data canberra group
university
science
slide
tutorial
big
learn
package
r
network
course
introduction
talk
analysing
research
position
example
33 / 61
Hierarchical Clustering of Terms
## clustering of terms remove sparse terms
m2 <- tdm %>% removeSparseTerms(sparse = 0.95) %>% as.matrix()
# calculate distance matrix
dist.matrix <- m2 %>% scale() %>% dist()
# hierarchical clustering
fit <- dist.matrix %>% hclust(method = "ward")
34 / 61
plot(fit)
fit %>% rect.hclust(k = 6) # cut tree into 6 clusters
groups <- fit %>% cutree(k = 6)
slide
big
australia
analytics
canberra
research
university
position
package
example
analysing
tutorial
network
r
mining
data
10305070
Cluster Dendrogram
hclust (*, "ward.D")
.
Height
35 / 61
## k-means clustering of documents
m3 <- m2 %>% t() # transpose the matrix to cluster documents (tweets)
set.seed(122) # set a fixed random seed to make the result reproducible
k <- 6 # number of clusters
kmeansResult <- kmeans(m3, k)
round(kmeansResult$centers, digits = 3) # cluster centers
## mining analytics australia data canberra university slide
## 1 0.435 0.000 0.000 0.217 0.000 0.000 0.087
## 2 1.128 0.154 0.000 1.333 0.026 0.051 0.179
## 3 0.055 0.018 0.009 0.164 0.027 0.009 0.227
## 4 0.083 0.014 0.056 0.000 0.035 0.097 0.090
## 5 0.412 0.206 0.098 1.196 0.137 0.039 0.078
## 6 0.167 0.133 0.133 0.567 0.033 0.233 0.000
## tutorial big package r network analysing research
## 1 0.043 0.000 0.043 1.130 0.087 0.174 0.000
## 2 0.026 0.077 0.282 1.103 0.000 0.051 0.000
## 3 0.064 0.018 0.109 1.127 0.045 0.109 0.000
## 4 0.056 0.007 0.090 0.000 0.090 0.111 0.000
## 5 0.059 0.333 0.010 0.020 0.020 0.059 0.020
## 6 0.000 0.167 0.033 0.000 0.067 0.100 1.233
## position example
## 1 0.000 1.043
## 2 0.000 0.026
## 3 0.000 0.000 36 / 61
for (i in 1:k) {
cat(paste("cluster ", i, ": ", sep = ""))
s <- sort(kmeansResult$centers[i, ], decreasing = T)
cat(names(s)[1:5], "n")
# print the tweets of every cluster
# print(tweets[which(kmeansResult£cluster==i)])
}
## cluster 1: r example mining data analysing
## cluster 2: data mining r package slide
## cluster 3: r slide data package analysing
## cluster 4: analysing university slide package network
## cluster 5: data mining big analytics canberra
## cluster 6: research data position university mining
37 / 61
Topic Modelling
dtm <- tdm %>% as.DocumentTermMatrix()
library(topicmodels)
lda <- LDA(dtm, k = 8) # find 8 topics
term <- terms(lda, 7) # first 7 terms of every topic
term <- apply(term, MARGIN = 2, paste, collapse = ", ") %>% print
## ...
## "data, big, mining, r, research, group,...
## ...
## "analysing, network, r, canberra, data, social,...
## ...
## "r, talk, slide, series, learn, rdatamini...
## ...
## "r, data, course, introduction, free, online...
## ...
## "data, mining, r, application, book, dataset, a...
## ...
## "r, package, example, useful, program, sli...
## ...
## "data, university, analytics, mining, position, research, s...
## ...
## "australia, data, ausdm, submission, workshop, mining...
38 / 61
Topic Modelling
rdm.topics <- topics(lda) # 1st topic identified for every document (twe
rdm.topics <- data.frame(date=as.IDate(tweets.df$created),
topic=rdm.topics)
ggplot(rdm.topics, aes(date, fill = term[topic])) +
geom_density(position = "stack")
0.000
0.002
0.004
0.006
2012 2013 2014 2015 2016
date
density
term[topic]
analysing, network, r, canberra, data, social, present
australia, data, ausdm, submission, workshop, mining, august
data, big, mining, r, research, group, science
data, mining, r, application, book, dataset, associate
data, university, analytics, mining, position, research, scientist
r, data, course, introduction, free, online, mining
r, package, example, useful, program, slide, code
r, talk, slide, series, learn, rdatamining, time
Another way to plot steam graph:
http://menugget.blogspot.com.au/2013/12/data-mountains-and-streams-stacked-area.html 39 / 61
Sentiment Analysis
## sentiment analysis install package sentiment140
require(devtools)
install_github("sentiment140", "okugami79")
# sentiment analysis
library(sentiment)
sentiments <- sentiment(tweets.df$text)
table(sentiments$polarity)
# sentiment plot
sentiments$score <- 0
sentiments$score[sentiments$polarity == "positive"] <- 1
sentiments$score[sentiments$polarity == "negative"] <- -1
sentiments$date <- as.IDate(tweets.df$created)
result <- aggregate(score ~ date, data = sentiments, sum)
40 / 61
Retrieve User Info and Followers
## follower analysis
user <- getUser("RDataMining")
user$toDataFrame()
friends <- user$getFriends() # who this user follows
followers <- user$getFollowers() # this user's followers
followers2 <- followers[[1]]$getFollowers() # a follower's followers
## [,1] ...
## description "R and Data Mining. Group on LinkedIn: ht...
## statusesCount "583" ...
## followersCount "2376" ...
## favoritesCount "6" ...
## friendsCount "72" ...
## url "http://t.co/LwL50uRmPd" ...
## name "Yanchang Zhao" ...
## created "2011-04-04 09:15:43" ...
## protected "FALSE" ...
## verified "FALSE" ...
## screenName "RDataMining" ...
## location "Australia" ...
## lang "en" ... 41 / 61
Follower Map§
@RDataMining Followers (#: 2376)
qqq
q
q
qq
q
q
qqq
q
qqqq
qqqqq
q
q
q
q
q
qq
q
q
qq
qqq
qq
q
qqq
q
qq
qq
qq
qqq
qqq
qqqqq
q
qq
q
q
q
q
q
qq
q
qq
qq
q
q
q
q
qq
qq
q
q
qq
qqq
qq
qq
q
qq
q
qqq
q
qq
q
qq
q
q
q
q
q
q
q
qqqqqqq
qqq
q
qq
q
qqqqqq
qq
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
qq
q
qq
q
q
q
q
q
q
q
qq
qq
q
qq
q
q
qq
q
q
qq
qq
qq
q
q
q
q
qq
q
qqqqq qq
q
qq
q
q q
q
qq
qq
q
q
qq
q
q
q
q
q
q
q
qq
qq
q
q
q
q
q
q
q qq
q
qqqqq
q
q
q
q
q
qqqqqqqqq
qqq
q
q
qqqqq q
qqq
q
qqq
q
qq
qqqq
qqqqq
qq
q
qqq
qq
q
q
qqq
q
q
q
q
q
q
qqq
qqq
q
qq
q
q
qqq
qqq
qq
qqqqqqqq
q
q
qq
q
qq
q
q
qq
q
q qqq q
qq
q
q
q
q
q
qqq
q
q
q
q
qqq
q
qq
q
q
q
q
q
qqqq
qq
q
q
qq
qq
q
qq
q
q
q
qq
qq
q
q
q
qq
q
q
q
q
qq
q
q
qqq
qq
qqq
q
q
q
qq
qq
qqq
q
qqq
q
qq
qq
q
q
q
qqqqqqqqqq
q
qq
q
q
qqqqqqqqqqq
q
q
qq
qqqq
q
q
q
q
q
qqq
qq q
q
qqqq qqqq
q q
qq
q
qq
q
q q
q q
qqq
qqq
q
q q
q
q
q
q
qq
q
q
q
q qq
q
q
qq
q
qq q
qqq
qq
qqq
q
qqqq
qqqqq
q
q
q
q
q
q
q
q
q
qqqq
q
qqq
q qqq
qqqq
q
qqqqq
q
q qq
q
q
qqqqqqqq
q
qqqqq
q
qqq
qq
q
q
qq q
qqqqq q
q
qq
q
q
qq
q
q
qqq
q
qqqq
q
q
qq
qqqqq
qq
q
q
q
qq
q
qqq
qq
qqq
qqqq
q
q
qq qq
q
qqq
q
q
qq
qqq
q
q
q
q
q
q
qqqqq
q
q
q
q q
q
qq
q
q
qqq
qq qq qqqq
q
q
qq
q
qq
qqqqq qqq
q
q
qqqqqq qqq
q
qqqq qqq qqqqqq
q
qqq qq qqqq
q
qqq
q
qqqqq
q
q
q
qqqq
q q
q
q
q
q
qqqq q
q
q
q
qqq
qqqq
q
qq
q
qq
q
q
qqqq q
q
qq q
q
qqqqqqqq
q
q
qqqq
q q
qqq
qqqqqqqqqq
q
q
q qqqqqqqq qqqqqqqqqqqqqqq q
q
q
qqq
q q
qq
q
q
q
q
q qqqqqq qq qqq
qq q
q
q
q q
q
q
q
q qqq
q
qqq qqqqq
qq
qq q
qq qqq q
q
q
qq
q
qqqq qq qqqq
q
q
qqq qq
q
qq
q
qqq qq qq qq
q
qq
qqq
§
Based on Jeff Leek’s twitterMap function at
http://biostat.jhsph.edu/~jleek/code/twitterMap.R
42 / 61
Active Influential Followers
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
5 10 20 50 100
0.20.51.02.05.010.020.0
#followers / #friends
#Tweetsperday
#AI PR Girl
Marcel Molina
Rahul Kapil
David Smith
Zac S.
Christopher D. Long
Data Science London
Ryan Rosario
Roby
Learn R
Statistics Blog
Robert Penner
Prof. Diego Kuonen
DataCamp Derecho Internet
Murari Bhartia
Sharon Machlis
Rob J Hyndman
StatsBlogs
Mitch Sanders
Michal Illich
.................................
RDataMining
pavel jašek
biao
Daniel D. Gutierrez
Data Mining
M Kautzar Ichramsyah
Yichuan Wang
Prithwis Mukerjee
Antonio Piccolboni
Duccio Schiavon
LearnDataAnalysis
RDataMining
43 / 61
Top Retweeted Tweets
## retweet analysis
## select top retweeted tweets
table(tweets.df$retweetCount)
selected <- which(tweets.df$retweetCount >= 9)
## plot them
dates <- strptime(tweets.df$created, format="%Y-%m-%d")
plot(x=dates, y=tweets.df$retweetCount, type="l", col="grey",
xlab="Date", ylab="Times retweeted")
colors <- rainbow(10)[1:length(selected)]
points(dates[selected], tweets.df$retweetCount[selected],
pch=19, col=colors)
text(dates[selected], tweets.df$retweetCount[selected],
tweets.df$text[selected], col=colors, cex=.9)
44 / 61
Top Retweeted Tweets
2012 2013 2014 2015 2016
051015
Date
Timesretweeted
q
qq q
q
q
Free online course on Computing for Data Analysis (with R), to start on 24 Sept 2012 https://t.co/Y617n30y
Lecture videos of natural language processing course at Stanford University: 18 videos, with each of over 1 hr length http://t.co/VKKdA9Tykm
The R Reference Card for Data Mining now provides links to packages on CRAN. Packages for MapReduce and Hadoop added. http://t.co/RrFypol8kw
Slides in 8 PDF files on Getting Data from the Web with R http://t.co/epT4Jv07WD
Handling and Processing Strings in R −− an ebook in PDF format, 105 pages. http://t.co/UXnetU7k87
A Twitter dataset for text mining: @RDataMining Tweets extracted on 3 February 2016. Download it at https://t.co/lQp94IvfPf
45 / 61
Tracking Message Propagation
tweets[[1]]
retweeters(tweets[[1]]$id)
retweets(tweets[[1]]$id)
## [1] "RDataMining: A Twitter dataset for text mining: @RData...
## [1] "197489286" "316875164" "229796464" "3316009302"
## [5] "244077734" "16900353" "2404767650" "222061895"
## [9] "11686382" "190569306" "49413866" "187048879"
## [13] "6146692" "2591996912"
## [[1]]
## [1] "bobaiKato: RT @RDataMining: A Twitter dataset for text...
##
## [[2]]
## [1] "VipulMathur: RT @RDataMining: A Twitter dataset for te...
##
## [[3]]
## [1] "tau_phoenix: RT @RDataMining: A Twitter dataset for te...
The tweet potentially reached around 120,000 users. 46 / 61
47 / 61
Contents
Text Mining
Concept
Tasks
Twitter Data Analysis with R
Twitter
Extracting Tweets
Text Cleaning
Frequent Words and Word Cloud
Word Associations
Clustering
Topic Modelling
Sentiment Analysis
Follower Analysis
Retweeting Analysis
R Packages
Wrap Up
Further Readings and Online Resources
48 / 61
R Packages
Twitter data extraction: twitteR
Text cleaning and mining: tm
Word cloud: wordcloud
Topic modelling: topicmodels, lda
Sentiment analysis: sentiment140
Social network analysis: igraph, sna
Visualisation: wordcloud, Rgraphviz, ggplot2
49 / 61
Twitter Data Extraction – Package twitteR ¶
userTimeline, homeTimeline, mentions,
retweetsOfMe: retrive various timelines
getUser, lookupUsers: get information of Twitter user(s)
getFollowers, getFollowerIDs: retrieve followers (or
their IDs)
getFriends, getFriendIDs: return a list of Twitter users
(or user IDs) that a user follows
retweets, retweeters: return retweets or users who
retweeted a tweet
searchTwitter: issue a search of Twitter
getCurRateLimitInfo: retrieve current rate limit
information
twListToDF: convert into data.frame
¶
https://cran.r-project.org/package=twitteR
50 / 61
Text Mining – Package tm
removeNumbers, removePunctuation, removeWords,
removeSparseTerms, stripWhitespace: remove numbers,
punctuations, words or extra whitespaces
removeSparseTerms: remove sparse terms from a
term-document matrix
stopwords: various kinds of stopwords
stemDocument, stemCompletion: stem words and
complete stems
TermDocumentMatrix, DocumentTermMatrix: build a
term-document matrix or a document-term matrix
termFreq: generate a term frequency vector
findFreqTerms, findAssocs: find frequent terms or
associations of terms
weightBin, weightTf, weightTfIdf, weightSMART,
WeightFunction: various ways to weight a term-document
matrix
https://cran.r-project.org/package=tm
51 / 61
Topic Modelling and Sentiment Analysis – Packages
topicmodels & sentiment140
Package topicmodels ∗∗
LDA: build a Latent Dirichlet Allocation (LDA) model
CTM: build a Correlated Topic Model (CTM) model
terms: extract the most likely terms for each topic
topics: extract the most likely topics for each document
Package sentiment140 ††
sentiment: sentiment analysis with the sentiment140 API,
tune to Twitter text analysis
∗∗
https://cran.r-project.org/package=topicmodels
††
https://github.com/okugami79/sentiment140
52 / 61
Social Network Analysis and Visualization – Package
igraph ‡‡
degree, betweenness, closeness, transitivity:
various centrality scores
neighborhood: neighborhood of graph vertices
cliques, largest.cliques, maximal.cliques,
clique.number: find cliques, ie. complete subgraphs
clusters, no.clusters: maximal connected components
of a graph and the number of them
fastgreedy.community, spinglass.community:
community detection
cohesive.blocks: calculate cohesive blocks
induced.subgraph: create a subgraph of a graph (igraph)
read.graph, write.graph: read and writ graphs from and
to files of various formats
‡‡
https://cran.r-project.org/package=igraph
53 / 61
Contents
Text Mining
Concept
Tasks
Twitter Data Analysis with R
Twitter
Extracting Tweets
Text Cleaning
Frequent Words and Word Cloud
Word Associations
Clustering
Topic Modelling
Sentiment Analysis
Follower Analysis
Retweeting Analysis
R Packages
Wrap Up
Further Readings and Online Resources
54 / 61
Wrap Up
Transform unstructured data into structured data (i.e.,
term-document matrix), and then apply traditional data
mining algorithms like clustering and classification
Feature extraction: term frequency, TF-IDF and many others
Text cleaning: lower case, removing numbers, puntuations
and URLs, stop words, stemming and stem completion
Stem completion may not always work as expected.
Documents in languages other than English
55 / 61
Contents
Text Mining
Concept
Tasks
Twitter Data Analysis with R
Twitter
Extracting Tweets
Text Cleaning
Frequent Words and Word Cloud
Word Associations
Clustering
Topic Modelling
Sentiment Analysis
Follower Analysis
Retweeting Analysis
R Packages
Wrap Up
Further Readings and Online Resources
56 / 61
Further Readings
Text Mining
https://en.wikipedia.org/wiki/Text_mining
TF-IDF
https://en.wikipedia.org/wiki/TfOT1textendashidf
Topic Modelling
https://en.wikipedia.org/wiki/Topic_model
Sentiment Analysis
https://en.wikipedia.org/wiki/Sentiment_analysis
Document Summarization
https://en.wikipedia.org/wiki/Automatic_summarization
Natural Language Processing
https://en.wikipedia.org/wiki/Natural_language_processing
An introduction to text mining by Ian Witten
http://www.cs.waikato.ac.nz/%7Eihw/papers/04-IHW-Textmining.pdf
57 / 61
Online Resources
Book titled R and Data Mining: Examples and Case
Studies [Zhao, 2012]
http://www.rdatamining.com/docs/RDataMining-book.pdf
R Reference Card for Data Mining
http://www.rdatamining.com/docs/RDataMining-reference-card.pdf
Free online courses and documents
http://www.rdatamining.com/resources/
RDataMining Group on LinkedIn (27,000+ members)
http://group.rdatamining.com
Twitter (3,300+ followers)
@RDataMining
58 / 61
The End
Thanks!
Email: yanchang(at)RDataMining.com
Twitter: @RDataMining
59 / 61
How to Cite This Work
Citation
Yanchang Zhao. R and Data Mining: Examples and Case Studies. ISBN
978-0-12-396963-7, December 2012. Academic Press, Elsevier. 256
pages. URL: http://www.rdatamining.com/docs/RDataMining-book.pdf.
BibTex
@BOOK{Zhao2012R,
title = {R and Data Mining: Examples and Case Studies},
publisher = {Academic Press, Elsevier},
year = {2012},
author = {Yanchang Zhao},
pages = {256},
month = {December},
isbn = {978-0-123-96963-7},
keywords = {R, data mining},
url = {http://www.rdatamining.com/docs/RDataMining-book.pdf}
}
60 / 61
References I
Zhao, Y. (2012).
R and Data Mining: Examples and Case Studies, ISBN 978-0-12-396963-7.
Academic Press, Elsevier.
Zhao, Y. (2013).
Analysing twitter data with text mining and social network analysis.
In Proc. of the 11th Australasian Data Mining Conference (AusDM 2013), Canberra, Australia.
61 / 61

More Related Content

What's hot

Text Mining Infrastructure in R
Text Mining Infrastructure in RText Mining Infrastructure in R
Text Mining Infrastructure in RAshraf Uddin
 
Elements of Text Mining Part - I
Elements of Text Mining Part - IElements of Text Mining Part - I
Elements of Text Mining Part - IJaganadh Gopinadhan
 
RDataMining slides-regression-classification
RDataMining slides-regression-classificationRDataMining slides-regression-classification
RDataMining slides-regression-classificationYanchang Zhao
 
Introduction to data analysis using R
Introduction to data analysis using RIntroduction to data analysis using R
Introduction to data analysis using RVictoria López
 
Image Similarity Detection at Scale Using LSH and Tensorflow with Andrey Gusev
Image Similarity Detection at Scale Using LSH and Tensorflow with Andrey GusevImage Similarity Detection at Scale Using LSH and Tensorflow with Andrey Gusev
Image Similarity Detection at Scale Using LSH and Tensorflow with Andrey GusevDatabricks
 
Presentation R basic teaching module
Presentation R basic teaching modulePresentation R basic teaching module
Presentation R basic teaching moduleSander Timmer
 
Introduction to R Programming
Introduction to R ProgrammingIntroduction to R Programming
Introduction to R Programmingizahn
 
A first look at tf idf-pdx data science meetup
A first look at tf idf-pdx data science meetupA first look at tf idf-pdx data science meetup
A first look at tf idf-pdx data science meetupDan Sullivan, Ph.D.
 
Interactive Knowledge Discovery over Web of Data.
Interactive Knowledge Discovery over Web of Data.Interactive Knowledge Discovery over Web of Data.
Interactive Knowledge Discovery over Web of Data.Mehwish Alam
 
Framester: A Wide Coverage Linguistic Linked Data Hub
Framester: A Wide Coverage Linguistic Linked Data HubFramester: A Wide Coverage Linguistic Linked Data Hub
Framester: A Wide Coverage Linguistic Linked Data HubMehwish Alam
 
Navigating and Exploring RDF Data using Formal Concept Analysis
Navigating and Exploring RDF Data using Formal Concept AnalysisNavigating and Exploring RDF Data using Formal Concept Analysis
Navigating and Exploring RDF Data using Formal Concept AnalysisMehwish Alam
 
Scalable Data Analysis in R -- Lee Edlefsen
Scalable Data Analysis in R -- Lee EdlefsenScalable Data Analysis in R -- Lee Edlefsen
Scalable Data Analysis in R -- Lee EdlefsenRevolution Analytics
 
Introduction to R for Data Science :: Session 8 [Intro to Text Mining in R, M...
Introduction to R for Data Science :: Session 8 [Intro to Text Mining in R, M...Introduction to R for Data Science :: Session 8 [Intro to Text Mining in R, M...
Introduction to R for Data Science :: Session 8 [Intro to Text Mining in R, M...Goran S. Milovanovic
 
R-programming-training-in-mumbai
R-programming-training-in-mumbaiR-programming-training-in-mumbai
R-programming-training-in-mumbaiUnmesh Baile
 

What's hot (20)

Text Mining Infrastructure in R
Text Mining Infrastructure in RText Mining Infrastructure in R
Text Mining Infrastructure in R
 
Elements of Text Mining Part - I
Elements of Text Mining Part - IElements of Text Mining Part - I
Elements of Text Mining Part - I
 
Unit 3
Unit 3Unit 3
Unit 3
 
RDataMining slides-regression-classification
RDataMining slides-regression-classificationRDataMining slides-regression-classification
RDataMining slides-regression-classification
 
Introduction to data analysis using R
Introduction to data analysis using RIntroduction to data analysis using R
Introduction to data analysis using R
 
Image Similarity Detection at Scale Using LSH and Tensorflow with Andrey Gusev
Image Similarity Detection at Scale Using LSH and Tensorflow with Andrey GusevImage Similarity Detection at Scale Using LSH and Tensorflow with Andrey Gusev
Image Similarity Detection at Scale Using LSH and Tensorflow with Andrey Gusev
 
Presentation R basic teaching module
Presentation R basic teaching modulePresentation R basic teaching module
Presentation R basic teaching module
 
Introduction to R Programming
Introduction to R ProgrammingIntroduction to R Programming
Introduction to R Programming
 
A first look at tf idf-pdx data science meetup
A first look at tf idf-pdx data science meetupA first look at tf idf-pdx data science meetup
A first look at tf idf-pdx data science meetup
 
inteSearch: An Intelligent Linked Data Information Access Framework
inteSearch: An Intelligent Linked Data Information Access FrameworkinteSearch: An Intelligent Linked Data Information Access Framework
inteSearch: An Intelligent Linked Data Information Access Framework
 
Introduction to R
Introduction to RIntroduction to R
Introduction to R
 
Data mining techniques
Data mining techniquesData mining techniques
Data mining techniques
 
Interactive Knowledge Discovery over Web of Data.
Interactive Knowledge Discovery over Web of Data.Interactive Knowledge Discovery over Web of Data.
Interactive Knowledge Discovery over Web of Data.
 
Framester: A Wide Coverage Linguistic Linked Data Hub
Framester: A Wide Coverage Linguistic Linked Data HubFramester: A Wide Coverage Linguistic Linked Data Hub
Framester: A Wide Coverage Linguistic Linked Data Hub
 
Profile of NPOESS HDF5 Files
Profile of NPOESS HDF5 FilesProfile of NPOESS HDF5 Files
Profile of NPOESS HDF5 Files
 
An Intoduction to R
An Intoduction to RAn Intoduction to R
An Intoduction to R
 
Navigating and Exploring RDF Data using Formal Concept Analysis
Navigating and Exploring RDF Data using Formal Concept AnalysisNavigating and Exploring RDF Data using Formal Concept Analysis
Navigating and Exploring RDF Data using Formal Concept Analysis
 
Scalable Data Analysis in R -- Lee Edlefsen
Scalable Data Analysis in R -- Lee EdlefsenScalable Data Analysis in R -- Lee Edlefsen
Scalable Data Analysis in R -- Lee Edlefsen
 
Introduction to R for Data Science :: Session 8 [Intro to Text Mining in R, M...
Introduction to R for Data Science :: Session 8 [Intro to Text Mining in R, M...Introduction to R for Data Science :: Session 8 [Intro to Text Mining in R, M...
Introduction to R for Data Science :: Session 8 [Intro to Text Mining in R, M...
 
R-programming-training-in-mumbai
R-programming-training-in-mumbaiR-programming-training-in-mumbai
R-programming-training-in-mumbai
 

Similar to RDataMining slides-text-mining-with-r

Text Mining of Twitter in Data Mining
Text Mining of Twitter in Data MiningText Mining of Twitter in Data Mining
Text Mining of Twitter in Data MiningMeghaj Mallick
 
Text mining and analytics v6 - p1
Text mining and analytics   v6 - p1Text mining and analytics   v6 - p1
Text mining and analytics v6 - p1Dave King
 
Textmining Introduction
Textmining IntroductionTextmining Introduction
Textmining Introductionguest0edcaf
 
Sales_Prediction_Technique using R Programming
Sales_Prediction_Technique using R ProgrammingSales_Prediction_Technique using R Programming
Sales_Prediction_Technique using R ProgrammingNagarjun Kotyada
 
FDP-faculty deveopmemt program on python
FDP-faculty deveopmemt program on pythonFDP-faculty deveopmemt program on python
FDP-faculty deveopmemt program on pythonkannikadg
 
Predictive Text Analytics
Predictive Text AnalyticsPredictive Text Analytics
Predictive Text AnalyticsSeth Grimes
 
Utilizing the natural langauage toolkit for keyword research
Utilizing the natural langauage toolkit for keyword researchUtilizing the natural langauage toolkit for keyword research
Utilizing the natural langauage toolkit for keyword researchErudite
 
Text Analytics
Text AnalyticsText Analytics
Text AnalyticsAjay Ram
 
Language-agnostic data analysis workflows and reproducible research
Language-agnostic data analysis workflows and reproducible researchLanguage-agnostic data analysis workflows and reproducible research
Language-agnostic data analysis workflows and reproducible researchAndrew Lowe
 
Text data mining1
Text data mining1Text data mining1
Text data mining1KU Leuven
 
Tweet segmentation and its application to named entity recognition
Tweet segmentation and its application to named entity recognitionTweet segmentation and its application to named entity recognition
Tweet segmentation and its application to named entity recognitionieeepondy
 
Introduction to Text Mining
Introduction to Text Mining Introduction to Text Mining
Introduction to Text Mining Rupak Roy
 
Topic detecton by clustering and text mining
Topic detecton by clustering and text miningTopic detecton by clustering and text mining
Topic detecton by clustering and text miningIRJET Journal
 

Similar to RDataMining slides-text-mining-with-r (20)

Text Mining of Twitter in Data Mining
Text Mining of Twitter in Data MiningText Mining of Twitter in Data Mining
Text Mining of Twitter in Data Mining
 
Text mining and analytics v6 - p1
Text mining and analytics   v6 - p1Text mining and analytics   v6 - p1
Text mining and analytics v6 - p1
 
Text mining
Text miningText mining
Text mining
 
Textmining Introduction
Textmining IntroductionTextmining Introduction
Textmining Introduction
 
Textmining Introduction
Textmining IntroductionTextmining Introduction
Textmining Introduction
 
Textmining Introduction
Textmining IntroductionTextmining Introduction
Textmining Introduction
 
[IJET V2I3P7] Authors: Muthe Sandhya, Shitole Sarika, Sinha Anukriti, Aghav S...
[IJET V2I3P7] Authors: Muthe Sandhya, Shitole Sarika, Sinha Anukriti, Aghav S...[IJET V2I3P7] Authors: Muthe Sandhya, Shitole Sarika, Sinha Anukriti, Aghav S...
[IJET V2I3P7] Authors: Muthe Sandhya, Shitole Sarika, Sinha Anukriti, Aghav S...
 
Sales_Prediction_Technique using R Programming
Sales_Prediction_Technique using R ProgrammingSales_Prediction_Technique using R Programming
Sales_Prediction_Technique using R Programming
 
FDP-faculty deveopmemt program on python
FDP-faculty deveopmemt program on pythonFDP-faculty deveopmemt program on python
FDP-faculty deveopmemt program on python
 
Predictive Text Analytics
Predictive Text AnalyticsPredictive Text Analytics
Predictive Text Analytics
 
Lecture 3.mte 407
Lecture 3.mte 407Lecture 3.mte 407
Lecture 3.mte 407
 
Lsu tcat
Lsu tcatLsu tcat
Lsu tcat
 
Utilizing the natural langauage toolkit for keyword research
Utilizing the natural langauage toolkit for keyword researchUtilizing the natural langauage toolkit for keyword research
Utilizing the natural langauage toolkit for keyword research
 
Text Analytics
Text AnalyticsText Analytics
Text Analytics
 
Language-agnostic data analysis workflows and reproducible research
Language-agnostic data analysis workflows and reproducible researchLanguage-agnostic data analysis workflows and reproducible research
Language-agnostic data analysis workflows and reproducible research
 
Text data mining1
Text data mining1Text data mining1
Text data mining1
 
Tweet segmentation and its application to named entity recognition
Tweet segmentation and its application to named entity recognitionTweet segmentation and its application to named entity recognition
Tweet segmentation and its application to named entity recognition
 
Tcat
TcatTcat
Tcat
 
Introduction to Text Mining
Introduction to Text Mining Introduction to Text Mining
Introduction to Text Mining
 
Topic detecton by clustering and text mining
Topic detecton by clustering and text miningTopic detecton by clustering and text mining
Topic detecton by clustering and text mining
 

More from Yanchang Zhao

RDataMining slides-time-series-analysis
RDataMining slides-time-series-analysisRDataMining slides-time-series-analysis
RDataMining slides-time-series-analysisYanchang Zhao
 
RDataMining slides-r-programming
RDataMining slides-r-programmingRDataMining slides-r-programming
RDataMining slides-r-programmingYanchang Zhao
 
RDataMining slides-network-analysis-with-r
RDataMining slides-network-analysis-with-rRDataMining slides-network-analysis-with-r
RDataMining slides-network-analysis-with-rYanchang Zhao
 
RDataMining slides-data-exploration-visualisation
RDataMining slides-data-exploration-visualisationRDataMining slides-data-exploration-visualisation
RDataMining slides-data-exploration-visualisationYanchang Zhao
 
RDataMining slides-clustering-with-r
RDataMining slides-clustering-with-rRDataMining slides-clustering-with-r
RDataMining slides-clustering-with-rYanchang Zhao
 
RDataMining slides-association-rule-mining-with-r
RDataMining slides-association-rule-mining-with-rRDataMining slides-association-rule-mining-with-r
RDataMining slides-association-rule-mining-with-rYanchang Zhao
 
RDataMining-reference-card
RDataMining-reference-cardRDataMining-reference-card
RDataMining-reference-cardYanchang Zhao
 
Association Rule Mining with R
Association Rule Mining with RAssociation Rule Mining with R
Association Rule Mining with RYanchang Zhao
 
Time Series Analysis and Mining with R
Time Series Analysis and Mining with RTime Series Analysis and Mining with R
Time Series Analysis and Mining with RYanchang Zhao
 
Regression and Classification with R
Regression and Classification with RRegression and Classification with R
Regression and Classification with RYanchang Zhao
 
Data Clustering with R
Data Clustering with RData Clustering with R
Data Clustering with RYanchang Zhao
 
Data Exploration and Visualization with R
Data Exploration and Visualization with RData Exploration and Visualization with R
Data Exploration and Visualization with RYanchang Zhao
 
Introduction to Data Mining with R and Data Import/Export in R
Introduction to Data Mining with R and Data Import/Export in RIntroduction to Data Mining with R and Data Import/Export in R
Introduction to Data Mining with R and Data Import/Export in RYanchang Zhao
 
An Introduction to Data Mining with R
An Introduction to Data Mining with RAn Introduction to Data Mining with R
An Introduction to Data Mining with RYanchang Zhao
 
Time series-mining-slides
Time series-mining-slidesTime series-mining-slides
Time series-mining-slidesYanchang Zhao
 
R Reference Card for Data Mining
R Reference Card for Data MiningR Reference Card for Data Mining
R Reference Card for Data MiningYanchang Zhao
 

More from Yanchang Zhao (16)

RDataMining slides-time-series-analysis
RDataMining slides-time-series-analysisRDataMining slides-time-series-analysis
RDataMining slides-time-series-analysis
 
RDataMining slides-r-programming
RDataMining slides-r-programmingRDataMining slides-r-programming
RDataMining slides-r-programming
 
RDataMining slides-network-analysis-with-r
RDataMining slides-network-analysis-with-rRDataMining slides-network-analysis-with-r
RDataMining slides-network-analysis-with-r
 
RDataMining slides-data-exploration-visualisation
RDataMining slides-data-exploration-visualisationRDataMining slides-data-exploration-visualisation
RDataMining slides-data-exploration-visualisation
 
RDataMining slides-clustering-with-r
RDataMining slides-clustering-with-rRDataMining slides-clustering-with-r
RDataMining slides-clustering-with-r
 
RDataMining slides-association-rule-mining-with-r
RDataMining slides-association-rule-mining-with-rRDataMining slides-association-rule-mining-with-r
RDataMining slides-association-rule-mining-with-r
 
RDataMining-reference-card
RDataMining-reference-cardRDataMining-reference-card
RDataMining-reference-card
 
Association Rule Mining with R
Association Rule Mining with RAssociation Rule Mining with R
Association Rule Mining with R
 
Time Series Analysis and Mining with R
Time Series Analysis and Mining with RTime Series Analysis and Mining with R
Time Series Analysis and Mining with R
 
Regression and Classification with R
Regression and Classification with RRegression and Classification with R
Regression and Classification with R
 
Data Clustering with R
Data Clustering with RData Clustering with R
Data Clustering with R
 
Data Exploration and Visualization with R
Data Exploration and Visualization with RData Exploration and Visualization with R
Data Exploration and Visualization with R
 
Introduction to Data Mining with R and Data Import/Export in R
Introduction to Data Mining with R and Data Import/Export in RIntroduction to Data Mining with R and Data Import/Export in R
Introduction to Data Mining with R and Data Import/Export in R
 
An Introduction to Data Mining with R
An Introduction to Data Mining with RAn Introduction to Data Mining with R
An Introduction to Data Mining with R
 
Time series-mining-slides
Time series-mining-slidesTime series-mining-slides
Time series-mining-slides
 
R Reference Card for Data Mining
R Reference Card for Data MiningR Reference Card for Data Mining
R Reference Card for Data Mining
 

Recently uploaded

Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1DianaGray10
 
Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™Adtran
 
Nanopower In Semiconductor Industry.pdf
Nanopower  In Semiconductor Industry.pdfNanopower  In Semiconductor Industry.pdf
Nanopower In Semiconductor Industry.pdfPedro Manuel
 
Introduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptxIntroduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptxMatsuo Lab
 
NIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 WorkshopNIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 WorkshopBachir Benyammi
 
PicPay - GenAI Finance Assistant - ChatGPT for Customer Service
PicPay - GenAI Finance Assistant - ChatGPT for Customer ServicePicPay - GenAI Finance Assistant - ChatGPT for Customer Service
PicPay - GenAI Finance Assistant - ChatGPT for Customer ServiceRenan Moreira de Oliveira
 
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdfUiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdfDianaGray10
 
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdfIaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdfDaniel Santiago Silva Capera
 
UiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation DevelopersUiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation DevelopersUiPathCommunity
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 
Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024D Cloud Solutions
 
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IES VE
 
Designing A Time bound resource download URL
Designing A Time bound resource download URLDesigning A Time bound resource download URL
Designing A Time bound resource download URLRuncy Oommen
 
Cloud Revolution: Exploring the New Wave of Serverless Spatial Data
Cloud Revolution: Exploring the New Wave of Serverless Spatial DataCloud Revolution: Exploring the New Wave of Serverless Spatial Data
Cloud Revolution: Exploring the New Wave of Serverless Spatial DataSafe Software
 
Cybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxCybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxGDSC PJATK
 
UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6DianaGray10
 
UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7DianaGray10
 
COMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online CollaborationCOMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online Collaborationbruanjhuli
 
GenAI and AI GCC State of AI_Object Automation Inc
GenAI and AI GCC State of AI_Object Automation IncGenAI and AI GCC State of AI_Object Automation Inc
GenAI and AI GCC State of AI_Object Automation IncObject Automation
 
Spring24-Release Overview - Wellingtion User Group-1.pdf
Spring24-Release Overview - Wellingtion User Group-1.pdfSpring24-Release Overview - Wellingtion User Group-1.pdf
Spring24-Release Overview - Wellingtion User Group-1.pdfAnna Loughnan Colquhoun
 

Recently uploaded (20)

Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1
 
Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™
 
Nanopower In Semiconductor Industry.pdf
Nanopower  In Semiconductor Industry.pdfNanopower  In Semiconductor Industry.pdf
Nanopower In Semiconductor Industry.pdf
 
Introduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptxIntroduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptx
 
NIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 WorkshopNIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 Workshop
 
PicPay - GenAI Finance Assistant - ChatGPT for Customer Service
PicPay - GenAI Finance Assistant - ChatGPT for Customer ServicePicPay - GenAI Finance Assistant - ChatGPT for Customer Service
PicPay - GenAI Finance Assistant - ChatGPT for Customer Service
 
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdfUiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
 
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdfIaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
 
UiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation DevelopersUiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation Developers
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024
 
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
 
Designing A Time bound resource download URL
Designing A Time bound resource download URLDesigning A Time bound resource download URL
Designing A Time bound resource download URL
 
Cloud Revolution: Exploring the New Wave of Serverless Spatial Data
Cloud Revolution: Exploring the New Wave of Serverless Spatial DataCloud Revolution: Exploring the New Wave of Serverless Spatial Data
Cloud Revolution: Exploring the New Wave of Serverless Spatial Data
 
Cybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxCybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptx
 
UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6
 
UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7
 
COMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online CollaborationCOMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online Collaboration
 
GenAI and AI GCC State of AI_Object Automation Inc
GenAI and AI GCC State of AI_Object Automation IncGenAI and AI GCC State of AI_Object Automation Inc
GenAI and AI GCC State of AI_Object Automation Inc
 
Spring24-Release Overview - Wellingtion User Group-1.pdf
Spring24-Release Overview - Wellingtion User Group-1.pdfSpring24-Release Overview - Wellingtion User Group-1.pdf
Spring24-Release Overview - Wellingtion User Group-1.pdf
 

RDataMining slides-text-mining-with-r

  • 1. Text Mining with R ∗ Yanchang Zhao http://www.RDataMining.com R and Data Mining Course Beijing University of Posts and Telecommunications, Beijing, China July 2019 ∗ Chapter 10: Text Mining, in R and Data Mining: Examples and Case Studies. http://www.rdatamining.com/docs/RDataMining-book.pdf 1 / 61
  • 2. Contents Text Mining Concept Tasks Twitter Data Analysis with R Twitter Extracting Tweets Text Cleaning Frequent Words and Word Cloud Word Associations Clustering Topic Modelling Sentiment Analysis Follower Analysis Retweeting Analysis R Packages Wrap Up Further Readings and Online Resources 2 / 61
  • 3. Text Data Text documents in a natural language Unstructured Documents in plain text, Word or PDF format Emails, online chat logs and phone transcripts Online news and forums, blogs, micro-blogs and social media . . . 3 / 61
  • 4. Typical Process of Text Mining 1. Transform text into structured data Term-Document Matrix (TDM) Entities and relations . . . 2. Apply traditional data mining techniques to the above structured data Clustering Classification Social Network Analysis (SNA) . . . 4 / 61
  • 5. Typical Process of Text Mining (cont.) 5 / 61
  • 6. Term-Document Matrix (TDM) Also known as Document-Term Matrix (DTM) A 2D matrix Rows: terms or words Columns: documents Entry mi,j : number of occurrences of term ti in document dj Term weighting schemes: Term Frequency, Binary Weight, TF-IDF, etc. 6 / 61
  • 7. TF-IDF Term Frequency (TF) tfi,j : the number of occurrences of term ti in document dj Inverse Document Frequency (IDF) for term ti is: idfi = log2 |D| |{d | ti ∈ d}| (1) |D|: the total number of documents |{d | ti ∈ d}|: the number of documents where term ti appears Term Frequency - Inverse Document Frequency (TF-IDF) tfidf = tfi,j · idfi (2) IDF reduces the weight of terms that occur frequently in documents and increases the weight of terms that occur rarely. 7 / 61
  • 8. An Example of TDM Doc1: I like R. Doc2: I like Python. Term Frequency TF-IDF IDF 8 / 61
  • 9. An Example of TDM Doc1: I like R. Doc2: I like Python. Term Frequency TF-IDF IDF 8 / 61 Terms that can distinguish different documents are given greater weights.
  • 10. An Example of TDM (cont.) Doc1: I like R. Doc2: I like Python. Term Frequency Normalized Term Frequency IDF Normalized TF-IDF 9 / 61
  • 11. An Example of Term Weighting in R ## term weighting library(magrittr) library(tm) ## package for text mining a <- c("I like R", "I like Python") ## build corpus b <- a %>% VectorSource() %>% Corpus() ## build term document matrix m <- b %>% TermDocumentMatrix(control=list(wordLengths=c(1, Inf))) m %>% inspect() ## various term weighting schemes m %>% weightBin() %>% inspect() ## binary weighting m %>% weightTf() %>% inspect() ## term frequency m %>% weightTfIdf(normalize=F) %>% inspect() ## TF-IDF m %>% weightTfIdf(normalize=T) %>% inspect() ## normalized TF-IDF More options provided in package tm: weightSMART WeightFunction 10 / 61
  • 12. Text Mining Tasks Text classification Text clustering and categorization Topic modelling Sentiment analysis Document summarization Entity and relation extraction . . . 11 / 61
  • 13. Topic Modelling To identify topics in a set of documents It groups both documents that use similar words and words that occur in a similar set of documents. Intuition: Documents related to R would contain more words like R, ggplot2, plyr, stringr, knitr and other R packages, than Python related keywords like Python, NumPy, SciPy, Matplotlib, etc. A document can be of multiple topics in different proportions. For instance, a document can be 90% about R and 10% about Python. ⇒ soft/fuzzy clustering Latent Dirichlet Allocation (LDA): the most widely used topic model 12 / 61
  • 14. Sentiment Analysis Also known as opinion mining To determine attitude, polarity or emotions from documents Polarity: positive, negative, netural Emotions: angry, sad, happy, bored, afraid, etc. Method: 1. identify invidual words and phrases and map them to different emotional scales 2. adjust the sentiment value of a concept based on modifications surrounding it 13 / 61
  • 15. Document Summarization To create a summary with major points of the orignial document Approaches Extraction: select a subset of existing words, phrases or sentences to build a summary Abstraction: use natural language generation techniques to build a summary that is similar to natural language 14 / 61
  • 16. Entity and Relationship Extraction Named Entity Recognition (NER): identify named entities in text into pre-defined categories, such as person names, organizations, locations, date and time, etc. Relationship Extraction: identify associations among entities Example: Ben lives at 5 Geroge St, Sydney. 15 / 61
  • 17. Entity and Relationship Extraction Named Entity Recognition (NER): identify named entities in text into pre-defined categories, such as person names, organizations, locations, date and time, etc. Relationship Extraction: identify associations among entities Example: Ben lives at 5 Geroge St, Sydney. 15 / 61
  • 18. Entity and Relationship Extraction Named Entity Recognition (NER): identify named entities in text into pre-defined categories, such as person names, organizations, locations, date and time, etc. Relationship Extraction: identify associations among entities Example: Ben lives at 5 Geroge St, Sydney. 15 / 61 Ben 5 Geroge St, Sydney
  • 19. Contents Text Mining Concept Tasks Twitter Data Analysis with R Twitter Extracting Tweets Text Cleaning Frequent Words and Word Cloud Word Associations Clustering Topic Modelling Sentiment Analysis Follower Analysis Retweeting Analysis R Packages Wrap Up Further Readings and Online Resources 16 / 61
  • 20. Twitter An online social networking service that enables users to send and read short 280-character (used to be 140 before November 2017) messages called “tweets” (Wikipedia) Over 300 million monthly active users (as of 2018) Creating over 500 million tweets per day 17 / 61
  • 22. Process† 1. Extract tweets and followers from the Twitter website with R and the twitteR package 2. With the tm package, clean text by removing punctuations, numbers, hyperlinks and stop words, followed by stemming and stem completion 3. Build a term-document matrix 4. Cluster Tweets with text clustering 5. Analyse topics with the topicmodels package 6. Analyse sentiment with the sentiment140 package 7. Analyse following/followed and retweeting relationships with the igraph package † More details in paper titled Analysing Twitter Data with Text Mining and Social Network Analysis [Zhao, 2013]. 19 / 61
  • 23. Retrieve Tweets ## Option 1: retrieve tweets from Twitter library(twitteR) library(ROAuth) ## Twitter authentication setup_twitter_oauth(consumer_key, consumer_secret, access_token, access_ ## 3200 is the maximum to retrieve tweets <- "RDataMining" %>% userTimeline(n = 3200) See details of Twitter Authentication with OAuth in Section 3 of http://geoffjentry.hexdump.org/twitteR.pdf. ## Option 2: download @RDataMining tweets from RDataMining.com library(twitteR) url <- "http://www.rdatamining.com/data/RDataMining-Tweets-20160212.rds" download.file(url, destfile = "./data/RDataMining-Tweets-20160212.rds") ## load tweets into R tweets <- readRDS("./data/RDataMining-Tweets-20160212.rds") 20 / 61
  • 24. (n.tweet <- tweets %>% length()) ## [1] 448 # convert tweets to a data frame tweets.df <- tweets %>% twListToDF() # tweet #1 tweets.df[1, c("id", "created", "screenName", "replyToSN", "favoriteCount", "retweetCount", "longitude", "latitude", "text")] ## id created screenName replyToSN ## 1 697031245503418368 2016-02-09 12:16:13 RDataMining <NA> ## favoriteCount retweetCount longitude latitude ## 1 13 14 NA NA ## ... ## 1 A Twitter dataset for text mining: @RDataMining Tweets ex... # print tweet #1 and make text fit for slide width tweets.df$text[1] %>% strwrap(60) %>% writeLines() ## A Twitter dataset for text mining: @RDataMining Tweets ## extracted on 3 February 2016. Download it at ## https://t.co/lQp94IvfPf 21 / 61
  • 25. Text Cleaning Functions Convert to lower case: tolower Remove punctuation: removePunctuation Remove numbers: removeNumbers Remove URLs Remove stop words (like ’a’, ’the’, ’in’): removeWords, stopwords Remove extra white space: stripWhitespace ## text cleaning library(tm) # function for removing URLs, i.e., # "http" followed by any non-space letters removeURL <- function(x) gsub("http[^[:space:]]*", "", x) # function for removing anything other than English letters or space removeNumPunct <- function(x) gsub("[^[:alpha:][:space:]]*", "", x) # customize stop words myStopwords <- c(setdiff(stopwords('english'), c("r", "big")), "use", "see", "used", "via", "amp") See details of regular expressions by running ?regex in R console. 22 / 61
  • 26. Text Cleaning # build a corpus and specify the source to be character vectors corpus.raw <- tweets.df$text %>% VectorSource() %>% Corpus() # text cleaning corpus.cleaned <- corpus.raw %>% # convert to lower case tm_map(content_transformer(tolower)) %>% # remove URLs tm_map(content_transformer(removeURL)) %>% # remove numbers and punctuations tm_map(content_transformer(removeNumPunct)) %>% # remove stopwords tm_map(removeWords, myStopwords) %>% # remove extra whitespace tm_map(stripWhitespace) 23 / 61
  • 27. Stemming and Stem Completion ‡ ## stem words corpus.stemmed <- corpus.cleaned %>% tm_map(stemDocument) ## stem completion stemCompletion2 <- function(x, dictionary) { x <- unlist(strsplit(as.character(x), " ")) x <- x[x != ""] x <- stemCompletion(x, dictionary=dictionary) x <- paste(x, sep="", collapse=" ") stripWhitespace(x) } corpus.completed <- corpus.stemmed %>% lapply(stemCompletion2, dictionary=corpus.cleaned) %>% VectorSource() %>% Corpus() ‡ http://stackoverflow.com/questions/25206049/stemcompletion-is-not-working 24 / 61
  • 28. Before/After Text Cleaning and Stemming ## compare text before/after cleaning # original text corpus.raw[[1]]$content %>% strwrap(60) %>% writeLines() ## A Twitter dataset for text mining: @RDataMining Tweets ## extracted on 3 February 2016. Download it at ## https://t.co/lQp94IvfPf # after basic cleaning corpus.cleaned[[1]]$content %>% strwrap(60) %>% writeLines() ## twitter dataset text mining rdatamining tweets extracted ## february download # stemmed text corpus.stemmed[[1]]$content %>% strwrap(60) %>% writeLines() ## twitter dataset text mine rdatamin tweet extract februari ## download # after stem completion corpus.completed[[1]]$content %>% strwrap(60) %>% writeLines() ## twitter dataset text miner rdatamining tweet extract ## download 25 / 61
  • 29. Issues in Stem Completion: “Miner” vs “Mining” # count word frequence wordFreq <- function(corpus, word) { results <- lapply(corpus, function(x) grep(as.character(x), pattern=paste0("<",word)) ) sum(unlist(results)) } n.miner <- corpus.cleaned %>% wordFreq("miner") n.mining <- corpus.cleaned %>% wordFreq("mining") cat(n.miner, n.mining) ## 9 104 # replace old word with new word replaceWord <- function(corpus, oldword, newword) { tm_map(corpus, content_transformer(gsub), pattern=oldword, replacement=newword) } corpus.completed <- corpus.completed %>% replaceWord("miner", "mining") %>% replaceWord("universidad", "university") %>% replaceWord("scienc", "science") 26 / 61
  • 30. Build Term Document Matrix ## Build Term Document Matrix tdm <- corpus.completed %>% TermDocumentMatrix(control = list(wordLengths = c(1, Inf))) %>% print ## <<TermDocumentMatrix (terms: 1073, documents: 448)>> ## Non-/sparse entries: 3594/477110 ## Sparsity : 99% ## Maximal term length: 23 ## Weighting : term frequency (tf) idx <- which(dimnames(tdm)$Terms %in% c("r", "data", "mining")) tdm[idx, 21:30] %>% as.matrix() ## Docs ## Terms 21 22 23 24 25 26 27 28 29 30 ## mining 0 0 0 0 1 0 0 0 0 1 ## data 0 1 0 0 1 0 0 0 0 1 ## r 1 1 1 1 0 1 0 1 1 1 27 / 61
  • 31. Top Frequent Terms # inspect frequent words freq.terms <- tdm %>% findFreqTerms(lowfreq = 20) %>% print ## [1] "mining" "rdatamining" "text" "analytics" ## [5] "australia" "data" "canberra" "group" ## [9] "university" "science" "slide" "tutorial" ## [13] "big" "learn" "package" "r" ## [17] "network" "course" "introduction" "talk" ## [21] "analysing" "research" "position" "example" term.freq <- tdm %>% as.matrix() %>% rowSums() term.freq <- term.freq %>% subset(term.freq >= 20) df <- data.frame(term = names(term.freq), freq = term.freq) 28 / 61
  • 32. ## plot frequent words library(ggplot2) ggplot(df, aes(x=term, y=freq)) + geom_bar(stat="identity") + xlab("Terms") + ylab("Count") + coord_flip() + theme(axis.text=element_text(size=7)) analysing analytics australia big canberra course data example group introduction learn mining network package position r rdatamining research science slide talk text tutorial university 0 50 100 150 200 Count Terms 29 / 61
  • 33. Wordcloud ## word cloud m <- tdm %>% as.matrix # calculate the frequency of words and sort it by frequency word.freq <- m %>% rowSums() %>% sort(decreasing = T) # colors library(RColorBrewer) pal <- brewer.pal(9, "BuGn")[-(1:4)] # plot word cloud library(wordcloud) wordcloud(words = names(word.freq), freq = word.freq, min.freq = 3, random.order = F, colors = pal) 30 / 61
  • 34. data r mining slide big analysing package research analytics example position university canberra network australia group tutorial rdatamining course talk introduction text science learn scientist useful social series free computational application online code ausdm statistical book program modeling available present submission conference pdf time start cluster lecture workshop user august mapreduce language videosoftware hadoop join twitter seminar machine kdnuggets poll associate dataset melbourne visualisations open postdoctoral vacancies thanks sydney will top th process get cfp april parallel classification job due postdoc knowledge graph document linkedin predicting rstudio may analyst us provided extract new pm event spark informal can easier iapa system senior database web feb large technological stanford now detailed ieee rule detection outlier thursday follow close page find business chapter create engine guidance intern keynote call forecasting google week recent card referencefunction map technique add apache tricks excel kdd access china list acm coursera handling webinar nice today deadline fellow file lab experience sigkdd version distributed area francisco san titled step give ranked sentiment tool build topic industrial notes graphical center download dr little tuesday sept go vs june public extended retrieval search singapore member share high jan natural seattle skills case studies nov visit developed regression risk quick advanced nd spatial interacting crancloud canada published california short fast performance wwwrdataminingcom tweet healthcare support decision facebook make sunday friday algorithm aug sas competition various datacamp australasian official task summit format load please run updated v survey neoj tree march world prof fit together massive sna amazon iselect looking australian oct participation america dynamic mexico source credit management paper forest random improve check link pls simple edited mode result media project contain track initial mid dmapps youtube state southern plot snowfall website answers comment 31 / 61
  • 35. Associations # which words are associated with 'r'? tdm %>% findAssocs("r", 0.2) ## $r ## code example series user markdown ## 0.27 0.21 0.21 0.20 0.20 # which words are associated with 'data'? tdm %>% findAssocs("data", 0.2) ## $data ## mining big analytics science poll ## 0.48 0.44 0.31 0.29 0.24 32 / 61
  • 36. Network of Terms ## network of terms library(graph) library(Rgraphviz) plot(tdm, term = freq.terms, corThreshold = 0.1, weighting = T) mining rdataminingtextanalytics australia data canberra group university science slide tutorial big learn package r network course introduction talk analysing research position example 33 / 61
  • 37. Hierarchical Clustering of Terms ## clustering of terms remove sparse terms m2 <- tdm %>% removeSparseTerms(sparse = 0.95) %>% as.matrix() # calculate distance matrix dist.matrix <- m2 %>% scale() %>% dist() # hierarchical clustering fit <- dist.matrix %>% hclust(method = "ward") 34 / 61
  • 38. plot(fit) fit %>% rect.hclust(k = 6) # cut tree into 6 clusters groups <- fit %>% cutree(k = 6) slide big australia analytics canberra research university position package example analysing tutorial network r mining data 10305070 Cluster Dendrogram hclust (*, "ward.D") . Height 35 / 61
  • 39. ## k-means clustering of documents m3 <- m2 %>% t() # transpose the matrix to cluster documents (tweets) set.seed(122) # set a fixed random seed to make the result reproducible k <- 6 # number of clusters kmeansResult <- kmeans(m3, k) round(kmeansResult$centers, digits = 3) # cluster centers ## mining analytics australia data canberra university slide ## 1 0.435 0.000 0.000 0.217 0.000 0.000 0.087 ## 2 1.128 0.154 0.000 1.333 0.026 0.051 0.179 ## 3 0.055 0.018 0.009 0.164 0.027 0.009 0.227 ## 4 0.083 0.014 0.056 0.000 0.035 0.097 0.090 ## 5 0.412 0.206 0.098 1.196 0.137 0.039 0.078 ## 6 0.167 0.133 0.133 0.567 0.033 0.233 0.000 ## tutorial big package r network analysing research ## 1 0.043 0.000 0.043 1.130 0.087 0.174 0.000 ## 2 0.026 0.077 0.282 1.103 0.000 0.051 0.000 ## 3 0.064 0.018 0.109 1.127 0.045 0.109 0.000 ## 4 0.056 0.007 0.090 0.000 0.090 0.111 0.000 ## 5 0.059 0.333 0.010 0.020 0.020 0.059 0.020 ## 6 0.000 0.167 0.033 0.000 0.067 0.100 1.233 ## position example ## 1 0.000 1.043 ## 2 0.000 0.026 ## 3 0.000 0.000 36 / 61
  • 40. for (i in 1:k) { cat(paste("cluster ", i, ": ", sep = "")) s <- sort(kmeansResult$centers[i, ], decreasing = T) cat(names(s)[1:5], "n") # print the tweets of every cluster # print(tweets[which(kmeansResult£cluster==i)]) } ## cluster 1: r example mining data analysing ## cluster 2: data mining r package slide ## cluster 3: r slide data package analysing ## cluster 4: analysing university slide package network ## cluster 5: data mining big analytics canberra ## cluster 6: research data position university mining 37 / 61
  • 41. Topic Modelling dtm <- tdm %>% as.DocumentTermMatrix() library(topicmodels) lda <- LDA(dtm, k = 8) # find 8 topics term <- terms(lda, 7) # first 7 terms of every topic term <- apply(term, MARGIN = 2, paste, collapse = ", ") %>% print ## ... ## "data, big, mining, r, research, group,... ## ... ## "analysing, network, r, canberra, data, social,... ## ... ## "r, talk, slide, series, learn, rdatamini... ## ... ## "r, data, course, introduction, free, online... ## ... ## "data, mining, r, application, book, dataset, a... ## ... ## "r, package, example, useful, program, sli... ## ... ## "data, university, analytics, mining, position, research, s... ## ... ## "australia, data, ausdm, submission, workshop, mining... 38 / 61
  • 42. Topic Modelling rdm.topics <- topics(lda) # 1st topic identified for every document (twe rdm.topics <- data.frame(date=as.IDate(tweets.df$created), topic=rdm.topics) ggplot(rdm.topics, aes(date, fill = term[topic])) + geom_density(position = "stack") 0.000 0.002 0.004 0.006 2012 2013 2014 2015 2016 date density term[topic] analysing, network, r, canberra, data, social, present australia, data, ausdm, submission, workshop, mining, august data, big, mining, r, research, group, science data, mining, r, application, book, dataset, associate data, university, analytics, mining, position, research, scientist r, data, course, introduction, free, online, mining r, package, example, useful, program, slide, code r, talk, slide, series, learn, rdatamining, time Another way to plot steam graph: http://menugget.blogspot.com.au/2013/12/data-mountains-and-streams-stacked-area.html 39 / 61
  • 43. Sentiment Analysis ## sentiment analysis install package sentiment140 require(devtools) install_github("sentiment140", "okugami79") # sentiment analysis library(sentiment) sentiments <- sentiment(tweets.df$text) table(sentiments$polarity) # sentiment plot sentiments$score <- 0 sentiments$score[sentiments$polarity == "positive"] <- 1 sentiments$score[sentiments$polarity == "negative"] <- -1 sentiments$date <- as.IDate(tweets.df$created) result <- aggregate(score ~ date, data = sentiments, sum) 40 / 61
  • 44. Retrieve User Info and Followers ## follower analysis user <- getUser("RDataMining") user$toDataFrame() friends <- user$getFriends() # who this user follows followers <- user$getFollowers() # this user's followers followers2 <- followers[[1]]$getFollowers() # a follower's followers ## [,1] ... ## description "R and Data Mining. Group on LinkedIn: ht... ## statusesCount "583" ... ## followersCount "2376" ... ## favoritesCount "6" ... ## friendsCount "72" ... ## url "http://t.co/LwL50uRmPd" ... ## name "Yanchang Zhao" ... ## created "2011-04-04 09:15:43" ... ## protected "FALSE" ... ## verified "FALSE" ... ## screenName "RDataMining" ... ## location "Australia" ... ## lang "en" ... 41 / 61
  • 45. Follower Map§ @RDataMining Followers (#: 2376) qqq q q qq q q qqq q qqqq qqqqq q q q q q qq q q qq qqq qq q qqq q qq qq qq qqq qqq qqqqq q qq q q q q q qq q qq qq q q q q qq qq q q qq qqq qq qq q qq q qqq q qq q qq q q q q q q q qqqqqqq qqq q qq q qqqqqq qq q q q q q q q qq q q q q q q q q qq q qq q q q q q q q qq qq q qq q q qq q q qq qq qq q q q q qq q qqqqq qq q qq q q q q qq qq q q qq q q q q q q q qq qq q q q q q q q qq q qqqqq q q q q q qqqqqqqqq qqq q q qqqqq q qqq q qqq q qq qqqq qqqqq qq q qqq qq q q qqq q q q q q q qqq qqq q qq q q qqq qqq qq qqqqqqqq q q qq q qq q q qq q q qqq q qq q q q q q qqq q q q q qqq q qq q q q q q qqqq qq q q qq qq q qq q q q qq qq q q q qq q q q q qq q q qqq qq qqq q q q qq qq qqq q qqq q qq qq q q q qqqqqqqqqq q qq q q qqqqqqqqqqq q q qq qqqq q q q q q qqq qq q q qqqq qqqq q q qq q qq q q q q q qqq qqq q q q q q q q qq q q q q qq q q qq q qq q qqq qq qqq q qqqq qqqqq q q q q q q q q q qqqq q qqq q qqq qqqq q qqqqq q q qq q q qqqqqqqq q qqqqq q qqq qq q q qq q qqqqq q q qq q q qq q q qqq q qqqq q q qq qqqqq qq q q q qq q qqq qq qqq qqqq q q qq qq q qqq q q qq qqq q q q q q q qqqqq q q q q q q qq q q qqq qq qq qqqq q q qq q qq qqqqq qqq q q qqqqqq qqq q qqqq qqq qqqqqq q qqq qq qqqq q qqq q qqqqq q q q qqqq q q q q q q qqqq q q q q qqq qqqq q qq q qq q q qqqq q q qq q q qqqqqqqq q q qqqq q q qqq qqqqqqqqqq q q q qqqqqqqq qqqqqqqqqqqqqqq q q q qqq q q qq q q q q q qqqqqq qq qqq qq q q q q q q q q q qqq q qqq qqqqq qq qq q qq qqq q q q qq q qqqq qq qqqq q q qqq qq q qq q qqq qq qq qq q qq qqq § Based on Jeff Leek’s twitterMap function at http://biostat.jhsph.edu/~jleek/code/twitterMap.R 42 / 61
  • 46. Active Influential Followers q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q 5 10 20 50 100 0.20.51.02.05.010.020.0 #followers / #friends #Tweetsperday #AI PR Girl Marcel Molina Rahul Kapil David Smith Zac S. Christopher D. Long Data Science London Ryan Rosario Roby Learn R Statistics Blog Robert Penner Prof. Diego Kuonen DataCamp Derecho Internet Murari Bhartia Sharon Machlis Rob J Hyndman StatsBlogs Mitch Sanders Michal Illich ................................. RDataMining pavel jašek biao Daniel D. Gutierrez Data Mining M Kautzar Ichramsyah Yichuan Wang Prithwis Mukerjee Antonio Piccolboni Duccio Schiavon LearnDataAnalysis RDataMining 43 / 61
  • 47. Top Retweeted Tweets ## retweet analysis ## select top retweeted tweets table(tweets.df$retweetCount) selected <- which(tweets.df$retweetCount >= 9) ## plot them dates <- strptime(tweets.df$created, format="%Y-%m-%d") plot(x=dates, y=tweets.df$retweetCount, type="l", col="grey", xlab="Date", ylab="Times retweeted") colors <- rainbow(10)[1:length(selected)] points(dates[selected], tweets.df$retweetCount[selected], pch=19, col=colors) text(dates[selected], tweets.df$retweetCount[selected], tweets.df$text[selected], col=colors, cex=.9) 44 / 61
  • 48. Top Retweeted Tweets 2012 2013 2014 2015 2016 051015 Date Timesretweeted q qq q q q Free online course on Computing for Data Analysis (with R), to start on 24 Sept 2012 https://t.co/Y617n30y Lecture videos of natural language processing course at Stanford University: 18 videos, with each of over 1 hr length http://t.co/VKKdA9Tykm The R Reference Card for Data Mining now provides links to packages on CRAN. Packages for MapReduce and Hadoop added. http://t.co/RrFypol8kw Slides in 8 PDF files on Getting Data from the Web with R http://t.co/epT4Jv07WD Handling and Processing Strings in R −− an ebook in PDF format, 105 pages. http://t.co/UXnetU7k87 A Twitter dataset for text mining: @RDataMining Tweets extracted on 3 February 2016. Download it at https://t.co/lQp94IvfPf 45 / 61
  • 49. Tracking Message Propagation tweets[[1]] retweeters(tweets[[1]]$id) retweets(tweets[[1]]$id) ## [1] "RDataMining: A Twitter dataset for text mining: @RData... ## [1] "197489286" "316875164" "229796464" "3316009302" ## [5] "244077734" "16900353" "2404767650" "222061895" ## [9] "11686382" "190569306" "49413866" "187048879" ## [13] "6146692" "2591996912" ## [[1]] ## [1] "bobaiKato: RT @RDataMining: A Twitter dataset for text... ## ## [[2]] ## [1] "VipulMathur: RT @RDataMining: A Twitter dataset for te... ## ## [[3]] ## [1] "tau_phoenix: RT @RDataMining: A Twitter dataset for te... The tweet potentially reached around 120,000 users. 46 / 61
  • 51. Contents Text Mining Concept Tasks Twitter Data Analysis with R Twitter Extracting Tweets Text Cleaning Frequent Words and Word Cloud Word Associations Clustering Topic Modelling Sentiment Analysis Follower Analysis Retweeting Analysis R Packages Wrap Up Further Readings and Online Resources 48 / 61
  • 52. R Packages Twitter data extraction: twitteR Text cleaning and mining: tm Word cloud: wordcloud Topic modelling: topicmodels, lda Sentiment analysis: sentiment140 Social network analysis: igraph, sna Visualisation: wordcloud, Rgraphviz, ggplot2 49 / 61
  • 53. Twitter Data Extraction – Package twitteR ¶ userTimeline, homeTimeline, mentions, retweetsOfMe: retrive various timelines getUser, lookupUsers: get information of Twitter user(s) getFollowers, getFollowerIDs: retrieve followers (or their IDs) getFriends, getFriendIDs: return a list of Twitter users (or user IDs) that a user follows retweets, retweeters: return retweets or users who retweeted a tweet searchTwitter: issue a search of Twitter getCurRateLimitInfo: retrieve current rate limit information twListToDF: convert into data.frame ¶ https://cran.r-project.org/package=twitteR 50 / 61
  • 54. Text Mining – Package tm removeNumbers, removePunctuation, removeWords, removeSparseTerms, stripWhitespace: remove numbers, punctuations, words or extra whitespaces removeSparseTerms: remove sparse terms from a term-document matrix stopwords: various kinds of stopwords stemDocument, stemCompletion: stem words and complete stems TermDocumentMatrix, DocumentTermMatrix: build a term-document matrix or a document-term matrix termFreq: generate a term frequency vector findFreqTerms, findAssocs: find frequent terms or associations of terms weightBin, weightTf, weightTfIdf, weightSMART, WeightFunction: various ways to weight a term-document matrix https://cran.r-project.org/package=tm 51 / 61
  • 55. Topic Modelling and Sentiment Analysis – Packages topicmodels & sentiment140 Package topicmodels ∗∗ LDA: build a Latent Dirichlet Allocation (LDA) model CTM: build a Correlated Topic Model (CTM) model terms: extract the most likely terms for each topic topics: extract the most likely topics for each document Package sentiment140 †† sentiment: sentiment analysis with the sentiment140 API, tune to Twitter text analysis ∗∗ https://cran.r-project.org/package=topicmodels †† https://github.com/okugami79/sentiment140 52 / 61
  • 56. Social Network Analysis and Visualization – Package igraph ‡‡ degree, betweenness, closeness, transitivity: various centrality scores neighborhood: neighborhood of graph vertices cliques, largest.cliques, maximal.cliques, clique.number: find cliques, ie. complete subgraphs clusters, no.clusters: maximal connected components of a graph and the number of them fastgreedy.community, spinglass.community: community detection cohesive.blocks: calculate cohesive blocks induced.subgraph: create a subgraph of a graph (igraph) read.graph, write.graph: read and writ graphs from and to files of various formats ‡‡ https://cran.r-project.org/package=igraph 53 / 61
  • 57. Contents Text Mining Concept Tasks Twitter Data Analysis with R Twitter Extracting Tweets Text Cleaning Frequent Words and Word Cloud Word Associations Clustering Topic Modelling Sentiment Analysis Follower Analysis Retweeting Analysis R Packages Wrap Up Further Readings and Online Resources 54 / 61
  • 58. Wrap Up Transform unstructured data into structured data (i.e., term-document matrix), and then apply traditional data mining algorithms like clustering and classification Feature extraction: term frequency, TF-IDF and many others Text cleaning: lower case, removing numbers, puntuations and URLs, stop words, stemming and stem completion Stem completion may not always work as expected. Documents in languages other than English 55 / 61
  • 59. Contents Text Mining Concept Tasks Twitter Data Analysis with R Twitter Extracting Tweets Text Cleaning Frequent Words and Word Cloud Word Associations Clustering Topic Modelling Sentiment Analysis Follower Analysis Retweeting Analysis R Packages Wrap Up Further Readings and Online Resources 56 / 61
  • 60. Further Readings Text Mining https://en.wikipedia.org/wiki/Text_mining TF-IDF https://en.wikipedia.org/wiki/TfOT1textendashidf Topic Modelling https://en.wikipedia.org/wiki/Topic_model Sentiment Analysis https://en.wikipedia.org/wiki/Sentiment_analysis Document Summarization https://en.wikipedia.org/wiki/Automatic_summarization Natural Language Processing https://en.wikipedia.org/wiki/Natural_language_processing An introduction to text mining by Ian Witten http://www.cs.waikato.ac.nz/%7Eihw/papers/04-IHW-Textmining.pdf 57 / 61
  • 61. Online Resources Book titled R and Data Mining: Examples and Case Studies [Zhao, 2012] http://www.rdatamining.com/docs/RDataMining-book.pdf R Reference Card for Data Mining http://www.rdatamining.com/docs/RDataMining-reference-card.pdf Free online courses and documents http://www.rdatamining.com/resources/ RDataMining Group on LinkedIn (27,000+ members) http://group.rdatamining.com Twitter (3,300+ followers) @RDataMining 58 / 61
  • 63. How to Cite This Work Citation Yanchang Zhao. R and Data Mining: Examples and Case Studies. ISBN 978-0-12-396963-7, December 2012. Academic Press, Elsevier. 256 pages. URL: http://www.rdatamining.com/docs/RDataMining-book.pdf. BibTex @BOOK{Zhao2012R, title = {R and Data Mining: Examples and Case Studies}, publisher = {Academic Press, Elsevier}, year = {2012}, author = {Yanchang Zhao}, pages = {256}, month = {December}, isbn = {978-0-123-96963-7}, keywords = {R, data mining}, url = {http://www.rdatamining.com/docs/RDataMining-book.pdf} } 60 / 61
  • 64. References I Zhao, Y. (2012). R and Data Mining: Examples and Case Studies, ISBN 978-0-12-396963-7. Academic Press, Elsevier. Zhao, Y. (2013). Analysing twitter data with text mining and social network analysis. In Proc. of the 11th Australasian Data Mining Conference (AusDM 2013), Canberra, Australia. 61 / 61