Brief Lecture on Text Mining and Social Network Analysis with R, by Deolu Adeleye

Text Mining, Social Network Analysis
Deolu Adeleye
Text Mining
Just as we can mine raw materials from ores, we can also intelligently ‘mine’ textual data from groups of data.
Once again, R proves to be a very powerful tool, with packages such as twitteR proving quite useful, as we’ll soon
demonstrate.
As a demonstration, we’ll be examining mining textual information from the popular social network Twitter. We’ll
be examining tweets from the Twitter handle ‘@55wordsorless’ (though you could use any handle of your choice when
running the code).
Do note that these demonstrations will require an active internet connection (at least in the beginning to authenticate),
and will be using the following R packages:
• twitter
• tm
• wordcloud
• SnowballC
• RWeka
• igraph
The ﬁrst step is to create a Twitter application for yourself. Go to https://twitter.com/apps/new and log in. After
ﬁlling in the basic info, go to the “Settings” tab and select “Read, Write and Access direct messages”. Make sure to
click on the save button after doing this. In the “Details” tab, take note of the following:
• your consumer key
• your consumer secret
• your access token
• your access secret
Once these four are retrieved, simply insert them into the setup_twitter_oauth function in the format
setup_twitter_oauth(“API key”, “API secret”, “Access token”, “Access secret”). Here’s ours with the according
values inserted:
#load the twitteR package
library(twitteR)
#authenticate
setup_twitter_oauth(our_key,
our_secret,
our_token,
our_access_secret)
## [1] "Using direct authentication"
1

You only need to authenticate once per R session.
So, we’ve authenticated. Next, let’s just randomly mine a particular word, say ‘water’, from everywhere it was used
recently on Twitter.
#retrieve last 50 tweets where hashtag '#water' is used, for example
watertag<-searchTwitter('#water', n=50)
head(watertag,3)
## [[1]]
## [1] "FrozenMOVlE: #vsco #afterlight #winter #wisconsin #water #lake #michigan #frozen #milwaukee #city http:/
##
## [[2]]
## [1] "FrozenMOVlE: Taking over my brothers place <ed><U+00A0><U+00BD><ed><U+00B8><U+0082><ed><U+00A0><U+00BD
##
## [[3]]
## [1] "vikashprasad21: RT @WaterNetwork1: #DSRSD #Certified For #Water #Quality #Testing http://t.co/bkF2mz47q
Next, let’s get info from the particular user ‘@55wordsorless’:
#retrieve the last 100 tweets from the specified timeline
tweets <- userTimeline('55wordsorless', n=100)
head(tweets,3)
## [[1]]
## [1] "55WordsOrLess: @_missjem_ Someone already has...though how you could borrow Dr. Who's TARDIS is your next
##
## [[2]]
## [1] "55WordsOrLess: Join the conversation!! http://t.co/xqDAtWbzVK :D :D"
##
## [[3]]
## [1] "55WordsOrLess: @BeautifulFeet_ Did you read the 'Mischievous' Thoughts as well? :D"
For our purposes, we’ll convert these into a data.frame object:
watertag_df <- twListToDF(watertag)
tweets_df <- twListToDF(tweets)
head(watertag_df,3)
##
## 1 #vsco #afterlight #winter #wisconsin #water #lake #michiga
## 2 Taking over my brothers place <ed><U+00A0><U+00BD><ed><U+00B8><U+0082><ed><U+00A0><U+00BD><ed><U+00B8><U+
## 3 RT @WaterNetwork1: #DSRSD #Certified For #Wat
## favorited favoriteCount replyToSN created truncated
## 1 FALSE 0 <NA> 2015-01-02 21:06:40 FALSE
## 2 FALSE 0 <NA> 2015-01-02 21:06:21 FALSE
## 3 FALSE 0 <NA> 2015-01-02 21:05:22 FALSE
## replyToSID id replyToUID
## 1 <NA> 551122427166875648 <NA>
## 2 <NA> 551122347814825984 <NA>
## 3 <NA> 551122099348054016 <NA>
## statusSource
2

## 1 <a href="http://ifttt.com" rel="nofollow">IFTTT</a>
## 2 <a href="http://ifttt.com" rel="nofollow">IFTTT</a>
## 3 <a href="http://spinabell.com" rel="nofollow">spinabell</a>
## screenName retweetCount isRetweet retweeted longitude latitude
## 1 FrozenMOVlE 0 FALSE FALSE <NA> <NA>
## 2 FrozenMOVlE 0 FALSE FALSE <NA> <NA>
## 3 vikashprasad21 1 TRUE FALSE <NA> <NA>
head(tweets_df,3)
## text
## 1 @_missjem_ Someone already has...though how you could borrow Dr. Who's TARDIS is your next difficulty... :|
## 2 Join the conversation!! http://t.co/xqDAtWbzVK :D :D
## 3 @BeautifulFeet_ Did you read the 'Mischievous' Thoughts as well? :D
## favorited favoriteCount replyToSN created truncated
## 1 FALSE 0 _MissJem_ 2014-11-28 18:03:43 FALSE
## 2 FALSE 0 <NA> 2014-10-28 14:19:27 FALSE
## 3 FALSE 0 BeautifulFeet_ 2014-10-09 19:48:46 FALSE
## replyToSID id replyToUID
## 1 538279840193839108 538392811075559424 434366153
## 2 <NA> 527102349727531008 <NA>
## 3 <NA> 520299852824334337 92370873
## statusSource
## 1 <a href="https://mobile.twitter.com" rel="nofollow">Mobile Web (M2)</a>
## 2 <a href="http://twitter.com" rel="nofollow">Twitter Web Client</a>
## 3 <a href="http://twitter.com/download/android" rel="nofollow">Twitter for Android</a>
## screenName retweetCount isRetweet retweeted longitude latitude
## 1 55WordsOrLess 0 FALSE FALSE NA NA
After that, we’ll convert to a corpus (which is just a collection of text documents) using the tm package:
library(tm)
#build a corpus, and specify the source to be character vectors
watertag_corpus <- Corpus(VectorSource(watertag_df$text))
tweets_corpus <- Corpus(VectorSource(tweets_df$text))
The corpus allows us to perform certain manipulations with functions in the tm package. You should run ?Corpus
to see other possible sources of textual data you can harness.
Let’s proceed by ﬁrst ‘cleaning’ our data:
#make a copy, just in case we might need the original later
watertag_1 <- watertag_corpus
tweets_1 <- tweets_corpus
# remove punctuation
watertag_corpus <- tm_map(watertag_corpus, removePunctuation)
tweets_corpus <- tm_map(tweets_corpus, removePunctuation)
# remove numbers
watertag_corpus <- tm_map(watertag_corpus, removeNumbers)
tweets_corpus <- tm_map(tweets_corpus, removeNumbers)
# convert to lower case
watertag_corpus <- tm_map(watertag_corpus, content_transformer(tolower))
tweets_corpus <- tm_map(tweets_corpus, content_transformer(tolower))
3

# remove whitespace
watertag_corpus <- tm_map(watertag_corpus, stripWhitespace)
tweets_corpus <- tm_map(tweets_corpus, stripWhitespace)
# remove stopwords such as 'you', 'me', etc.
watertag_corpus <- tm_map(watertag_corpus, removeWords, stopwords("english"))
tweets_corpus <- tm_map(tweets_corpus, removeWords, stopwords("english"))
# remove URLs
# We'll create a function to look for 'http' in our text, and then delete the links
removeURL <- content_transformer(function(x) gsub("http[[:alnum:]]*", "", x))
watertag_corpus <- tm_map(watertag_corpus, removeURL)
tweets_corpus <- tm_map(tweets_corpus, removeURL)
#inspect our results
inspect(head(watertag_corpus,3))
## <<VCorpus (documents: 3, metadata (corpus/indexed): 0/0)>>
##
## [[1]]
## <<PlainTextDocument (metadata: 7)>>
## vsco afterlight winter wisconsin water lake michigan frozen milwaukee city
##
## [[2]]
## taking brothers place <U+653C><U+3E64><U+613C><U+3E30><U+623C><U+3E64><U+653C><U+3E64><U+623C><U+3E38><U
##
## [[3]]
## rt waternetwork dsrsd certified water quality testing
inspect(head(tweets_corpus,3))
##
## [[1]]
## missjem someone already hasthough borrow dr whos tardis next difficulty
##
## [[2]]
## join conversation d d
##
## [[3]]
## beautifulfeet read mischievous thoughts well d
Other transformations possible with tm_map can obtained by running getTransformations()
In many applications, words need to be stemmed to retrieve their radicals, so that various forms derived from a stem
would be taken as the same when counting word frequency. Stemming uses an algorithm that removes common word
endings for English words, such as “es”, “ed” and “’s”. For instance, words “update”, “updated” and “updating”
would all be stemmed to “updat”. It’s not mandatory (and sometimes it may be counter-productive), but it does
pay to understand what it does, so we’ll demonstrate:
4

# create a copy we'll stem
watertag_stemmed <- watertag_corpus
tweets_stemmed <- tweets_corpus
# stem words
library(SnowballC)
watertag_stemmed <- tm_map(watertag_stemmed, stemDocument)
tweets_stemmed <- tm_map(tweets_stemmed, stemDocument)
# inspect our stemmed results
inspect(head(watertag_stemmed,3))
##
## [[1]]
## vsco afterlight winter wisconsin water lake michigan frozen milwauke citi
##
## [[2]]
## take brother place <U+653C><U+3E64><U+613C><U+3E30><U+623C><U+3E64><U+653C><U+3E64><U+623C><U+3E38><U+38
##
## [[3]]
## rt waternetwork dsrsd certifi water qualiti test
inspect(head(tweets_stemmed,3))
##
## [[1]]
## missjem someon alreadi hasthough borrow dr whos tardi next difficulti
##
## [[2]]
## join convers d d
##
## [[3]]
## beautifulfeet read mischiev thought well d
A term-document matrix represents the relationship between terms and documents, where each row stands for a
term and each column for a document, and an entry is the number of occurrences of the term in the document.
In contrast, a document-term matrix is simply the transpose of the term-document matrix, with documents as
rows, and columns as terms.
So which should you use? Either of your choice!
#creating term-document matrices
watertag_tdm<-TermDocumentMatrix(watertag_corpus)
tweets_tdm<-TermDocumentMatrix(tweets_corpus)
#creating document-term matrices
watertag_dtm<-DocumentTermMatrix(watertag_corpus)
5

tweets_dtm<-DocumentTermMatrix(tweets_corpus)
# just to compare the two:
watertag_tdm
## <<TermDocumentMatrix (terms: 311, documents: 50)>>
## Non-/sparse entries: 492/15058
## Sparsity : 97%
## Maximal term length: 61
## Weighting : term frequency (tf)
watertag_dtm
## <<DocumentTermMatrix (documents: 50, terms: 311)>>
## Sparsity : 97%
As seen above, except for their transpose, their practically the same.
With our matrix, we can perform quite a number of functions. Like, if we wanted to know the frequency of occurence
for some words:
#find terms which occur 5 times or more
findFreqTerms(watertag_dtm, 5)
## [1] "amp" "ice" "sun" "water"
#how about 10 times or more?
findFreqTerms(tweets_tdm, lowfreq=10)
## [1] "man" "now"
It is important to note the results are ordered alphabetically, not according to frequency of occurence.
If we want it according to frequency, we’ll obtain it as a vector by converting into a matrix and using the rowSums
function if we’re using a tdm, and colSums if dtm:
#remember you can use either dtm or tdm - we're using both interchangeably just to demonstrate
watertag_freq <- colSums(as.matrix(watertag_dtm))
tweets_freq <- rowSums(as.matrix(tweets_tdm))
…and then we sort it in descending order, so it shows the terms with maximum occurence ﬁrst:
#display head of most frequent terms
head(sort(watertag_freq,decreasing=TRUE))
## water ice amp sun frozen fun
## 57 7 5 5 4 4
6

head(sort(tweets_freq,decreasing=TRUE))
## man now just said home never
## 10 10 7 7 6 6
We could even see the frequency of frequencies, to know how many times some terms appear:
head(table(watertag_freq),15)
## watertag_freq
## 1 2 3 4 5 7 57
## 200 85 16 6 2 1 1
head(table(tweets_freq),15)
## tweets_freq
## 1 2 3 4 5 6 7 10
## 569 85 24 21 4 4 2 2
This is tells us that from our search, 200 terms occur just once; and from our tweets, 569 terms occur just once, and
so forth…
We could also retrieve associations between words: if two words appeared together, then their correlation would be
1.0; if never: 0.0. Those are the boundaries.
So, let’s say we wanted to see words that have at least a 0.5 correlation with the word ‘time’ in our search results:
findAssocs(watertag_dtm, "time", corlimit=0.5)
## $time
## numeric(0)
Note that a result of type(0) indicates no correlating words were found, meaning the word you searced didn’t occur
(to the level of correlation you speciﬁed).
How about the words ‘trend’ and ‘food’ from our timeline, this time with a 0.4 correlation?
findAssocs(tweets_tdm, c("trend","food"), corlimit=0.4)
## $trend
## numeric(0)
##
## $food
## diner garbage” protested rat siryou tastes cook
## 1.00 1.00 1.00 1.00 1.00 1.00 0.70
## money paid yet stunned good like
## 0.70 0.70 0.70 0.57 0.49 0.49
What if we wanted to graphically represent our results? We could, and it only require a few lines of code.
For example: let’s make a barplot of all the terms that occur at least 5 times from text source(s). (5 is considerably
small, but serves this particular example well)
7

#using ggplot2 package
library(ggplot2)
#from our search on Twitter
qplot(names(watertag_freq[watertag_freq>=5]), watertag_freq[watertag_freq>=5], geom="bar",
stat="identity", xlab="Frequency", ylab="Terms", main="Search Results") + coord_flip()
Figure 1: Words Occuring At Least 5 Times
#from our timeline
qplot(names(tweets_freq[tweets_freq>=5]), tweets_freq[tweets_freq>=5], geom="bar",
stat="identity", xlab="Frequency", ylab="Terms", main="@55wordsorless Timeline") + coord_flip()
Figure 2: Words Occuring At Least 5 Times
8

Wordclouds are also a very cool graphical representation of textual information. Here, the more frequently a word
occurs, the bolder and larger it is displayed, with the reverse being true.
By default the most frequent words have a font scale of 4 and the least have a scale of 0.5, but even that can be
changed, as we’ll demonstrate!
tweets_freq<-sort(tweets_freq,decreasing=TRUE)
watertag_freq<-sort(watertag_freq,decreasing=TRUE)
#wordcloud package allows us to produce wordclouds
library(wordcloud)
#each time wordcloud is run, it randomly produces a layout.
#Though it doesn't really matter, you can set the seed to keep the layout the same
set.seed(77)
#'min.freq' specifies the minimum frequency of the words to be plotted
wordcloud(names(tweets_freq), tweets_freq, min.freq=3)
Figure 3: Wordcloud Using min.freq
#max.words specifies the maximum number of words it should plot
#scale changes font scale
wordcloud(names(tweets_freq), scale=c(5, .1), tweets_freq, max.words=100)
## Warning in wordcloud(names(tweets_freq), scale = c(5, 0.1), tweets_freq, :
## now could not be fit on page. It will not be plotted.
9

Figure 4: Wordcloud Using max.words
10

#just adding some colour!
set.seed(79)
wordcloud(names(watertag_freq), watertag_freq, min.freq=2,
random.color=TRUE,colors=rainbow(7))
Figure 5: Wordcloud With Colour!
Run ?wordcloud for even more options you can specify.
Word clusters can also be generated.
Hierarchical
Let’s ﬁrst remove some sparse words that occur minimally and are not so important with removeSparseTerms. The
value of sparse is a numeric serving as a factor - terms occuring less than the speciﬁed percentage are retained.
#we're using 0.95 because our text source has only a few terms, and not many re-ocurring words
tweets_sparsed <- removeSparseTerms(tweets_tdm, sparse=0.95)
distMatrix <- dist(scale(tweets_sparsed))
fit <- hclust(distMatrix, method="ward")
## The "ward" method has been renamed to "ward.D"; note new "ward.D2"
plot(fit)
# cut tree into 10 clusters
rect.hclust(fit, k=10)
(groups <- cutree(fit, k=10))
## ’ll bed check died discovered good
## 1 2 3 1 3 4
## home just kill know last like
11

Figure 6: Cluster Dendogram
12

## 5 6 7 8 7 4
## love made man marry mischievous never
## 1 8 1 1 3 2
## new now one said smiled steel
## 4 9 4 1 4 3
## still swore take things took went
## 6 2 1 3 10 5
## wife
## 2
K-means
We can use the k-means clustering in our analysis. However, for this you MUST use a document-term matrix.
#using DOCUMENT-TERM matrix
(tweets_sparsed <- removeSparseTerms(tweets_dtm, sparse=0.95))
## <<DocumentTermMatrix (documents: 79, terms: 31)>>
## Sparsity : 94%
#setting our value of k
k <- 4
kmeansResult <- kmeans(tweets_sparsed, k)
# cluster centers
round(kmeansResult$centers, digits=3)
## ’ll bed check died discovered good home just kill know last
## 1 0.034 0 0 0.051 0 0.051 0.051 0.102 0.068 0.068 0.051
## 2 0.000 0 1 0.000 1 0.000 0.000 0.000 0.000 0.000 0.000
## 3 0.167 0 0 0.083 0 0.083 0.000 0.083 0.000 0.083 0.083
## 4 0.000 1 0 0.000 0 0.000 0.750 0.000 0.000 0.000 0.000
## like love made man marry mischievous never new now one said
## 1 0.051 0.051 0.068 0.0 0.000 0.017 0.051 0.068 0.119 0.085 0.000
## 2 0.000 0.000 0.000 1.0 0.000 1.000 0.000 0.000 0.000 0.000 0.000
## 3 0.083 0.083 0.083 0.5 0.333 0.000 0.000 0.000 0.250 0.000 0.583
## 4 0.000 0.000 0.000 0.0 0.000 0.000 0.750 0.000 0.000 0.000 0.000
## smiled steel still swore take things took went wife
## 1 0.102 0 0.068 0.000 0.017 0 0.085 0.051 0.017
## 2 0.000 1 0.000 0.000 0.000 1 0.000 0.000 0.000
## 3 0.000 0 0.000 0.083 0.250 0 0.083 0.000 0.083
## 4 0.000 0 0.000 0.750 0.000 0 0.000 0.250 0.500
To make things easier, let’s just print the top three words in every cluster, as well as the wordcloud cluster:
for (i in 1:k)
{
cat(paste("cluster ", i, ": ", sep=""))
s <- sort(kmeansResult$centers[i,], decreasing=T)
cat(names(s)[1:3], "n")
# if you want to print the tweets of every cluster, run the next line
# print(tweets[which(kmeansResult$cluster==i)])
}
13

## cluster 1: now just smiled
## cluster 2: check discovered man
## cluster 3: said man marry
## cluster 4: bed home never
Social Network Analysis
First, we want to produce a term-term matrix, which is basically just a network of terms based on their co-occurrence
in tweets. It is the matrix product of the term-document and a document-term matrices. (We produce the matrix
product by using the operator **%*%**).
#matrix product;
#using sparsed tweets because original tdm in our example had too many sparse terms
#transposing with 't' operator
tweets_sparsed <- removeSparseTerms(tweets_tdm, sparse=0.95)
termTerm <- as.matrix(tweets_sparsed) %*% as.matrix(t(tweets_sparsed))
#inspect few rows and columns
termTerm[1:10,1:10]
## Terms
## Terms ’ll bed check died discovered good home just kill know
## ’ll 4 0 0 0 0 0 0 0 1 0
## bed 0 4 0 0 0 0 3 0 0 0
## check 0 0 4 0 4 0 0 0 0 0
## died 0 0 0 4 0 0 0 1 0 1
## discovered 0 0 4 0 4 0 0 0 0 0
## good 0 0 0 0 0 4 0 1 0 0
## home 0 3 0 0 0 0 8 0 0 0
## just 0 0 0 1 0 1 0 7 0 0
## kill 1 0 0 0 0 0 0 0 4 0
## know 0 0 0 1 0 0 0 0 0 5
After this, we can use package igraph to graphically represent these network of terms in a visually-appealing way:
library(igraph)
# build a graph from the above matrix
g <- graph.adjacency(termTerm, weighted=T, mode="undirected")
# remove loops
g <- simplify(g)
# set labels and degrees of vertices
V(g)$label <- V(g)$name
V(g)$degree <- degree(g)
# setting seed to make the layout reproducible
set.seed(1001)
#call to plot network
layout1 <- layout.fruchterman.reingold(g)
plot(g, layout=layout1)
What if we wanted a diﬀerent layout?
plot(g, layout=layout.kamada.kawai)
14

Figure 8: Diﬀerent Layout
16

What if we wanted an interactive network plot? Easy!
tkplot(g, layout=layout1)
In fact, in our interactive graphs, we can just change the layouts immediately by selecting diﬀerent options in the
Layout tab.
But the above just produce a graph with a lot of connections. What if we wanted to see straightaway which were
more important? Which connections were stronger? We can do that by specifying options with the following code:
#make stronger connections more bold on vertices 'V'
V(g)$label.cex <- 2.2 * V(g)$degree / max(V(g)$degree)+ .2
#color
V(g)$label.color <- rgb(0, 0, .2, .8)
#no frame
V(g)$frame.color <- NA
egam <- (log(E(g)$weight)+.4) / max(log(E(g)$weight)+.4)
# access edges 'E'
E(g)$color <- rgb(.5, .5, 0, egam)
E(g)$width <- egam
# plot the graph in layout1
plot(g, layout=layout1)
…and straightaway we can see which words are more ‘weighted’, and even point out one or two clusters…
How about making this new graph interactive too? As before, just use tkplot:
tkplot(g, layout=layout1)
As usual, there are a plethora of options and settings at your disposal! Just run ?igraph::layout to see them! (we’re
specifying the package because you might have another layout funtion from another package)
17

Brief Lecture on Text Mining and Social Network Analysis with R, by Deolu Adeleye

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (11)

Similar to Brief Lecture on Text Mining and Social Network Analysis with R, by Deolu Adeleye

Similar to Brief Lecture on Text Mining and Social Network Analysis with R, by Deolu Adeleye (20)

Recently uploaded

Recently uploaded (20)

Brief Lecture on Text Mining and Social Network Analysis with R, by Deolu Adeleye