SlideShare a Scribd company logo
Text Mining, Social Network Analysis
Deolu Adeleye
Text Mining
Just as we can mine raw materials from ores, we can also intelligently ‘mine’ textual data from groups of data.
Once again, R proves to be a very powerful tool, with packages such as twitteR proving quite useful, as we’ll soon
demonstrate.
As a demonstration, we’ll be examining mining textual information from the popular social network Twitter. We’ll
be examining tweets from the Twitter handle ‘@55wordsorless’ (though you could use any handle of your choice when
running the code).
Do note that these demonstrations will require an active internet connection (at least in the beginning to authenticate),
and will be using the following R packages:
• twitter
• tm
• wordcloud
• SnowballC
• RWeka
• igraph
The first step is to create a Twitter application for yourself. Go to https://twitter.com/apps/new and log in. After
filling in the basic info, go to the “Settings” tab and select “Read, Write and Access direct messages”. Make sure to
click on the save button after doing this. In the “Details” tab, take note of the following:
• your consumer key
• your consumer secret
• your access token
• your access secret
Once these four are retrieved, simply insert them into the setup_twitter_oauth function in the format
setup_twitter_oauth(“API key”, “API secret”, “Access token”, “Access secret”). Here’s ours with the according
values inserted:
#load the twitteR package
library(twitteR)
#authenticate
setup_twitter_oauth(our_key,
our_secret,
our_token,
our_access_secret)
## [1] "Using direct authentication"
1
You only need to authenticate once per R session.
So, we’ve authenticated. Next, let’s just randomly mine a particular word, say ‘water’, from everywhere it was used
recently on Twitter.
#retrieve last 50 tweets where hashtag '#water' is used, for example
watertag<-searchTwitter('#water', n=50)
head(watertag,3)
## [[1]]
## [1] "FrozenMOVlE: #vsco #afterlight #winter #wisconsin #water #lake #michigan #frozen #milwaukee #city http:/
##
## [[2]]
## [1] "FrozenMOVlE: Taking over my brothers place <ed><U+00A0><U+00BD><ed><U+00B8><U+0082><ed><U+00A0><U+00BD
##
## [[3]]
## [1] "vikashprasad21: RT @WaterNetwork1: #DSRSD #Certified For #Water #Quality #Testing http://t.co/bkF2mz47q
Next, let’s get info from the particular user ‘@55wordsorless’:
#retrieve the last 100 tweets from the specified timeline
tweets <- userTimeline('55wordsorless', n=100)
head(tweets,3)
## [[1]]
## [1] "55WordsOrLess: @_missjem_ Someone already has...though how you could borrow Dr. Who's TARDIS is your next
##
## [[2]]
## [1] "55WordsOrLess: Join the conversation!! http://t.co/xqDAtWbzVK :D :D"
##
## [[3]]
## [1] "55WordsOrLess: @BeautifulFeet_ Did you read the 'Mischievous' Thoughts as well? :D"
For our purposes, we’ll convert these into a data.frame object:
watertag_df <- twListToDF(watertag)
tweets_df <- twListToDF(tweets)
head(watertag_df,3)
##
## 1 #vsco #afterlight #winter #wisconsin #water #lake #michiga
## 2 Taking over my brothers place <ed><U+00A0><U+00BD><ed><U+00B8><U+0082><ed><U+00A0><U+00BD><ed><U+00B8><U+
## 3 RT @WaterNetwork1: #DSRSD #Certified For #Wat
## favorited favoriteCount replyToSN created truncated
## 1 FALSE 0 <NA> 2015-01-02 21:06:40 FALSE
## 2 FALSE 0 <NA> 2015-01-02 21:06:21 FALSE
## 3 FALSE 0 <NA> 2015-01-02 21:05:22 FALSE
## replyToSID id replyToUID
## 1 <NA> 551122427166875648 <NA>
## 2 <NA> 551122347814825984 <NA>
## 3 <NA> 551122099348054016 <NA>
## statusSource
2
## 1 <a href="http://ifttt.com" rel="nofollow">IFTTT</a>
## 2 <a href="http://ifttt.com" rel="nofollow">IFTTT</a>
## 3 <a href="http://spinabell.com" rel="nofollow">spinabell</a>
## screenName retweetCount isRetweet retweeted longitude latitude
## 1 FrozenMOVlE 0 FALSE FALSE <NA> <NA>
## 2 FrozenMOVlE 0 FALSE FALSE <NA> <NA>
## 3 vikashprasad21 1 TRUE FALSE <NA> <NA>
head(tweets_df,3)
## text
## 1 @_missjem_ Someone already has...though how you could borrow Dr. Who's TARDIS is your next difficulty... :|
## 2 Join the conversation!! http://t.co/xqDAtWbzVK :D :D
## 3 @BeautifulFeet_ Did you read the 'Mischievous' Thoughts as well? :D
## favorited favoriteCount replyToSN created truncated
## 1 FALSE 0 _MissJem_ 2014-11-28 18:03:43 FALSE
## 2 FALSE 0 <NA> 2014-10-28 14:19:27 FALSE
## 3 FALSE 0 BeautifulFeet_ 2014-10-09 19:48:46 FALSE
## replyToSID id replyToUID
## 1 538279840193839108 538392811075559424 434366153
## 2 <NA> 527102349727531008 <NA>
## 3 <NA> 520299852824334337 92370873
## statusSource
## 1 <a href="https://mobile.twitter.com" rel="nofollow">Mobile Web (M2)</a>
## 2 <a href="http://twitter.com" rel="nofollow">Twitter Web Client</a>
## 3 <a href="http://twitter.com/download/android" rel="nofollow">Twitter for Android</a>
## screenName retweetCount isRetweet retweeted longitude latitude
## 1 55WordsOrLess 0 FALSE FALSE NA NA
## 2 55WordsOrLess 0 FALSE FALSE NA NA
## 3 55WordsOrLess 0 FALSE FALSE NA NA
After that, we’ll convert to a corpus (which is just a collection of text documents) using the tm package:
library(tm)
#build a corpus, and specify the source to be character vectors
watertag_corpus <- Corpus(VectorSource(watertag_df$text))
tweets_corpus <- Corpus(VectorSource(tweets_df$text))
The corpus allows us to perform certain manipulations with functions in the tm package. You should run ?Corpus
to see other possible sources of textual data you can harness.
Let’s proceed by first ‘cleaning’ our data:
#make a copy, just in case we might need the original later
watertag_1 <- watertag_corpus
tweets_1 <- tweets_corpus
# remove punctuation
watertag_corpus <- tm_map(watertag_corpus, removePunctuation)
tweets_corpus <- tm_map(tweets_corpus, removePunctuation)
# remove numbers
watertag_corpus <- tm_map(watertag_corpus, removeNumbers)
tweets_corpus <- tm_map(tweets_corpus, removeNumbers)
# convert to lower case
watertag_corpus <- tm_map(watertag_corpus, content_transformer(tolower))
tweets_corpus <- tm_map(tweets_corpus, content_transformer(tolower))
3
# remove whitespace
watertag_corpus <- tm_map(watertag_corpus, stripWhitespace)
tweets_corpus <- tm_map(tweets_corpus, stripWhitespace)
# remove stopwords such as 'you', 'me', etc.
watertag_corpus <- tm_map(watertag_corpus, removeWords, stopwords("english"))
tweets_corpus <- tm_map(tweets_corpus, removeWords, stopwords("english"))
# remove URLs
# We'll create a function to look for 'http' in our text, and then delete the links
removeURL <- content_transformer(function(x) gsub("http[[:alnum:]]*", "", x))
watertag_corpus <- tm_map(watertag_corpus, removeURL)
tweets_corpus <- tm_map(tweets_corpus, removeURL)
#inspect our results
inspect(head(watertag_corpus,3))
## <<VCorpus (documents: 3, metadata (corpus/indexed): 0/0)>>
##
## [[1]]
## <<PlainTextDocument (metadata: 7)>>
## vsco afterlight winter wisconsin water lake michigan frozen milwaukee city
##
## [[2]]
## <<PlainTextDocument (metadata: 7)>>
## taking brothers place <U+653C><U+3E64><U+613C><U+3E30><U+623C><U+3E64><U+653C><U+3E64><U+623C><U+3E38><U
##
## [[3]]
## <<PlainTextDocument (metadata: 7)>>
## rt waternetwork dsrsd certified water quality testing
inspect(head(tweets_corpus,3))
## <<VCorpus (documents: 3, metadata (corpus/indexed): 0/0)>>
##
## [[1]]
## <<PlainTextDocument (metadata: 7)>>
## missjem someone already hasthough borrow dr whos tardis next difficulty
##
## [[2]]
## <<PlainTextDocument (metadata: 7)>>
## join conversation d d
##
## [[3]]
## <<PlainTextDocument (metadata: 7)>>
## beautifulfeet read mischievous thoughts well d
Other transformations possible with tm_map can obtained by running getTransformations()
In many applications, words need to be stemmed to retrieve their radicals, so that various forms derived from a stem
would be taken as the same when counting word frequency. Stemming uses an algorithm that removes common word
endings for English words, such as “es”, “ed” and “’s”. For instance, words “update”, “updated” and “updating”
would all be stemmed to “updat”. It’s not mandatory (and sometimes it may be counter-productive), but it does
pay to understand what it does, so we’ll demonstrate:
4
# create a copy we'll stem
watertag_stemmed <- watertag_corpus
tweets_stemmed <- tweets_corpus
# stem words
library(SnowballC)
watertag_stemmed <- tm_map(watertag_stemmed, stemDocument)
tweets_stemmed <- tm_map(tweets_stemmed, stemDocument)
# inspect our stemmed results
inspect(head(watertag_stemmed,3))
## <<VCorpus (documents: 3, metadata (corpus/indexed): 0/0)>>
##
## [[1]]
## <<PlainTextDocument (metadata: 7)>>
## vsco afterlight winter wisconsin water lake michigan frozen milwauke citi
##
## [[2]]
## <<PlainTextDocument (metadata: 7)>>
## take brother place <U+653C><U+3E64><U+613C><U+3E30><U+623C><U+3E64><U+653C><U+3E64><U+623C><U+3E38><U+38
##
## [[3]]
## <<PlainTextDocument (metadata: 7)>>
## rt waternetwork dsrsd certifi water qualiti test
inspect(head(tweets_stemmed,3))
## <<VCorpus (documents: 3, metadata (corpus/indexed): 0/0)>>
##
## [[1]]
## <<PlainTextDocument (metadata: 7)>>
## missjem someon alreadi hasthough borrow dr whos tardi next difficulti
##
## [[2]]
## <<PlainTextDocument (metadata: 7)>>
## join convers d d
##
## [[3]]
## <<PlainTextDocument (metadata: 7)>>
## beautifulfeet read mischiev thought well d
A term-document matrix represents the relationship between terms and documents, where each row stands for a
term and each column for a document, and an entry is the number of occurrences of the term in the document.
In contrast, a document-term matrix is simply the transpose of the term-document matrix, with documents as
rows, and columns as terms.
So which should you use? Either of your choice!
#creating term-document matrices
watertag_tdm<-TermDocumentMatrix(watertag_corpus)
tweets_tdm<-TermDocumentMatrix(tweets_corpus)
#creating document-term matrices
watertag_dtm<-DocumentTermMatrix(watertag_corpus)
5
tweets_dtm<-DocumentTermMatrix(tweets_corpus)
# just to compare the two:
watertag_tdm
## <<TermDocumentMatrix (terms: 311, documents: 50)>>
## Non-/sparse entries: 492/15058
## Sparsity : 97%
## Maximal term length: 61
## Weighting : term frequency (tf)
watertag_dtm
## <<DocumentTermMatrix (documents: 50, terms: 311)>>
## Non-/sparse entries: 492/15058
## Sparsity : 97%
## Maximal term length: 61
## Weighting : term frequency (tf)
As seen above, except for their transpose, their practically the same.
With our matrix, we can perform quite a number of functions. Like, if we wanted to know the frequency of occurence
for some words:
#find terms which occur 5 times or more
findFreqTerms(watertag_dtm, 5)
## [1] "amp" "ice" "sun" "water"
#how about 10 times or more?
findFreqTerms(tweets_tdm, lowfreq=10)
## [1] "man" "now"
It is important to note the results are ordered alphabetically, not according to frequency of occurence.
If we want it according to frequency, we’ll obtain it as a vector by converting into a matrix and using the rowSums
function if we’re using a tdm, and colSums if dtm:
#remember you can use either dtm or tdm - we're using both interchangeably just to demonstrate
watertag_freq <- colSums(as.matrix(watertag_dtm))
tweets_freq <- rowSums(as.matrix(tweets_tdm))
…and then we sort it in descending order, so it shows the terms with maximum occurence first:
#display head of most frequent terms
head(sort(watertag_freq,decreasing=TRUE))
## water ice amp sun frozen fun
## 57 7 5 5 4 4
6
head(sort(tweets_freq,decreasing=TRUE))
## man now just said home never
## 10 10 7 7 6 6
We could even see the frequency of frequencies, to know how many times some terms appear:
head(table(watertag_freq),15)
## watertag_freq
## 1 2 3 4 5 7 57
## 200 85 16 6 2 1 1
head(table(tweets_freq),15)
## tweets_freq
## 1 2 3 4 5 6 7 10
## 569 85 24 21 4 4 2 2
This is tells us that from our search, 200 terms occur just once; and from our tweets, 569 terms occur just once, and
so forth…
We could also retrieve associations between words: if two words appeared together, then their correlation would be
1.0; if never: 0.0. Those are the boundaries.
So, let’s say we wanted to see words that have at least a 0.5 correlation with the word ‘time’ in our search results:
findAssocs(watertag_dtm, "time", corlimit=0.5)
## $time
## numeric(0)
Note that a result of type(0) indicates no correlating words were found, meaning the word you searced didn’t occur
(to the level of correlation you specified).
How about the words ‘trend’ and ‘food’ from our timeline, this time with a 0.4 correlation?
findAssocs(tweets_tdm, c("trend","food"), corlimit=0.4)
## $trend
## numeric(0)
##
## $food
## diner garbage” protested rat siryou tastes cook
## 1.00 1.00 1.00 1.00 1.00 1.00 0.70
## money paid yet stunned good like
## 0.70 0.70 0.70 0.57 0.49 0.49
What if we wanted to graphically represent our results? We could, and it only require a few lines of code.
For example: let’s make a barplot of all the terms that occur at least 5 times from text source(s). (5 is considerably
small, but serves this particular example well)
7
#using ggplot2 package
library(ggplot2)
#from our search on Twitter
qplot(names(watertag_freq[watertag_freq>=5]), watertag_freq[watertag_freq>=5], geom="bar",
stat="identity", xlab="Frequency", ylab="Terms", main="Search Results") + coord_flip()
Figure 1: Words Occuring At Least 5 Times
#from our timeline
qplot(names(tweets_freq[tweets_freq>=5]), tweets_freq[tweets_freq>=5], geom="bar",
stat="identity", xlab="Frequency", ylab="Terms", main="@55wordsorless Timeline") + coord_flip()
Figure 2: Words Occuring At Least 5 Times
8
Wordclouds are also a very cool graphical representation of textual information. Here, the more frequently a word
occurs, the bolder and larger it is displayed, with the reverse being true.
By default the most frequent words have a font scale of 4 and the least have a scale of 0.5, but even that can be
changed, as we’ll demonstrate!
tweets_freq<-sort(tweets_freq,decreasing=TRUE)
watertag_freq<-sort(watertag_freq,decreasing=TRUE)
#wordcloud package allows us to produce wordclouds
library(wordcloud)
#each time wordcloud is run, it randomly produces a layout.
#Though it doesn't really matter, you can set the seed to keep the layout the same
set.seed(77)
#'min.freq' specifies the minimum frequency of the words to be plotted
wordcloud(names(tweets_freq), tweets_freq, min.freq=3)
Figure 3: Wordcloud Using min.freq
#max.words specifies the maximum number of words it should plot
#scale changes font scale
wordcloud(names(tweets_freq), scale=c(5, .1), tweets_freq, max.words=100)
## Warning in wordcloud(names(tweets_freq), scale = c(5, 0.1), tweets_freq, :
## now could not be fit on page. It will not be plotted.
9
Figure 4: Wordcloud Using max.words
10
#just adding some colour!
set.seed(79)
wordcloud(names(watertag_freq), watertag_freq, min.freq=2,
random.color=TRUE,colors=rainbow(7))
Figure 5: Wordcloud With Colour!
Run ?wordcloud for even more options you can specify.
Word clusters can also be generated.
Hierarchical
Let’s first remove some sparse words that occur minimally and are not so important with removeSparseTerms. The
value of sparse is a numeric serving as a factor - terms occuring less than the specified percentage are retained.
#we're using 0.95 because our text source has only a few terms, and not many re-ocurring words
tweets_sparsed <- removeSparseTerms(tweets_tdm, sparse=0.95)
distMatrix <- dist(scale(tweets_sparsed))
fit <- hclust(distMatrix, method="ward")
## The "ward" method has been renamed to "ward.D"; note new "ward.D2"
plot(fit)
# cut tree into 10 clusters
rect.hclust(fit, k=10)
(groups <- cutree(fit, k=10))
## ’ll bed check died discovered good
## 1 2 3 1 3 4
## home just kill know last like
11
Figure 6: Cluster Dendogram
12
## 5 6 7 8 7 4
## love made man marry mischievous never
## 1 8 1 1 3 2
## new now one said smiled steel
## 4 9 4 1 4 3
## still swore take things took went
## 6 2 1 3 10 5
## wife
## 2
K-means
We can use the k-means clustering in our analysis. However, for this you MUST use a document-term matrix.
#using DOCUMENT-TERM matrix
(tweets_sparsed <- removeSparseTerms(tweets_dtm, sparse=0.95))
## <<DocumentTermMatrix (documents: 79, terms: 31)>>
## Non-/sparse entries: 153/2296
## Sparsity : 94%
## Maximal term length: 11
## Weighting : term frequency (tf)
#setting our value of k
k <- 4
kmeansResult <- kmeans(tweets_sparsed, k)
# cluster centers
round(kmeansResult$centers, digits=3)
## ’ll bed check died discovered good home just kill know last
## 1 0.034 0 0 0.051 0 0.051 0.051 0.102 0.068 0.068 0.051
## 2 0.000 0 1 0.000 1 0.000 0.000 0.000 0.000 0.000 0.000
## 3 0.167 0 0 0.083 0 0.083 0.000 0.083 0.000 0.083 0.083
## 4 0.000 1 0 0.000 0 0.000 0.750 0.000 0.000 0.000 0.000
## like love made man marry mischievous never new now one said
## 1 0.051 0.051 0.068 0.0 0.000 0.017 0.051 0.068 0.119 0.085 0.000
## 2 0.000 0.000 0.000 1.0 0.000 1.000 0.000 0.000 0.000 0.000 0.000
## 3 0.083 0.083 0.083 0.5 0.333 0.000 0.000 0.000 0.250 0.000 0.583
## 4 0.000 0.000 0.000 0.0 0.000 0.000 0.750 0.000 0.000 0.000 0.000
## smiled steel still swore take things took went wife
## 1 0.102 0 0.068 0.000 0.017 0 0.085 0.051 0.017
## 2 0.000 1 0.000 0.000 0.000 1 0.000 0.000 0.000
## 3 0.000 0 0.000 0.083 0.250 0 0.083 0.000 0.083
## 4 0.000 0 0.000 0.750 0.000 0 0.000 0.250 0.500
To make things easier, let’s just print the top three words in every cluster, as well as the wordcloud cluster:
for (i in 1:k)
{
cat(paste("cluster ", i, ": ", sep=""))
s <- sort(kmeansResult$centers[i,], decreasing=T)
cat(names(s)[1:3], "n")
# if you want to print the tweets of every cluster, run the next line
# print(tweets[which(kmeansResult$cluster==i)])
}
13
## cluster 1: now just smiled
## cluster 2: check discovered man
## cluster 3: said man marry
## cluster 4: bed home never
Social Network Analysis
First, we want to produce a term-term matrix, which is basically just a network of terms based on their co-occurrence
in tweets. It is the matrix product of the term-document and a document-term matrices. (We produce the matrix
product by using the operator **%*%**).
#matrix product;
#using sparsed tweets because original tdm in our example had too many sparse terms
#transposing with 't' operator
tweets_sparsed <- removeSparseTerms(tweets_tdm, sparse=0.95)
termTerm <- as.matrix(tweets_sparsed) %*% as.matrix(t(tweets_sparsed))
#inspect few rows and columns
termTerm[1:10,1:10]
## Terms
## Terms ’ll bed check died discovered good home just kill know
## ’ll 4 0 0 0 0 0 0 0 1 0
## bed 0 4 0 0 0 0 3 0 0 0
## check 0 0 4 0 4 0 0 0 0 0
## died 0 0 0 4 0 0 0 1 0 1
## discovered 0 0 4 0 4 0 0 0 0 0
## good 0 0 0 0 0 4 0 1 0 0
## home 0 3 0 0 0 0 8 0 0 0
## just 0 0 0 1 0 1 0 7 0 0
## kill 1 0 0 0 0 0 0 0 4 0
## know 0 0 0 1 0 0 0 0 0 5
After this, we can use package igraph to graphically represent these network of terms in a visually-appealing way:
library(igraph)
# build a graph from the above matrix
g <- graph.adjacency(termTerm, weighted=T, mode="undirected")
# remove loops
g <- simplify(g)
# set labels and degrees of vertices
V(g)$label <- V(g)$name
V(g)$degree <- degree(g)
# setting seed to make the layout reproducible
set.seed(1001)
#call to plot network
layout1 <- layout.fruchterman.reingold(g)
plot(g, layout=layout1)
What if we wanted a different layout?
plot(g, layout=layout.kamada.kawai)
14
Figure 7: Network Of Terms
15
Figure 8: Different Layout
16
What if we wanted an interactive network plot? Easy!
tkplot(g, layout=layout1)
In fact, in our interactive graphs, we can just change the layouts immediately by selecting different options in the
Layout tab.
But the above just produce a graph with a lot of connections. What if we wanted to see straightaway which were
more important? Which connections were stronger? We can do that by specifying options with the following code:
#make stronger connections more bold on vertices 'V'
V(g)$label.cex <- 2.2 * V(g)$degree / max(V(g)$degree)+ .2
#color
V(g)$label.color <- rgb(0, 0, .2, .8)
#no frame
V(g)$frame.color <- NA
egam <- (log(E(g)$weight)+.4) / max(log(E(g)$weight)+.4)
# access edges 'E'
E(g)$color <- rgb(.5, .5, 0, egam)
E(g)$width <- egam
# plot the graph in layout1
plot(g, layout=layout1)
…and straightaway we can see which words are more ‘weighted’, and even point out one or two clusters…
How about making this new graph interactive too? As before, just use tkplot:
tkplot(g, layout=layout1)
As usual, there are a plethora of options and settings at your disposal! Just run ?igraph::layout to see them! (we’re
specifying the package because you might have another layout funtion from another package)
17
Figure 9: Weighted Network
18

More Related Content

Viewers also liked

An Introduction to Data Mining with R
An Introduction to Data Mining with RAn Introduction to Data Mining with R
An Introduction to Data Mining with R
Yanchang Zhao
 
Text mining on Twitter information based on R platform
Text mining on Twitter information based on R platformText mining on Twitter information based on R platform
Text mining on Twitter information based on R platformFayan TAO
 
Export, Import Menggunakan Ms.Excel dan Join Data Attibute pada ArcGis 10.0
Export, Import Menggunakan Ms.Excel dan Join Data Attibute pada ArcGis 10.0Export, Import Menggunakan Ms.Excel dan Join Data Attibute pada ArcGis 10.0
Export, Import Menggunakan Ms.Excel dan Join Data Attibute pada ArcGis 10.0
oriza steva andra
 
Social Media Data Collection & Network Analysis with Netlytic and R
Social Media Data Collection & Network Analysis with Netlytic and R Social Media Data Collection & Network Analysis with Netlytic and R
Social Media Data Collection & Network Analysis with Netlytic and R
Toronto Metropolitan University
 
From Potentials and Problems to Actions and Plans (Simulation Studies of Regi...
From Potentials and Problems to Actions and Plans (Simulation Studies of Regi...From Potentials and Problems to Actions and Plans (Simulation Studies of Regi...
From Potentials and Problems to Actions and Plans (Simulation Studies of Regi...
bramantiyo marjuki
 
Social Network Analysis With R
Social Network Analysis With RSocial Network Analysis With R
Social Network Analysis With R
David Chiu
 
Tiga Cara Memotong file Raster Sesuai Batas Polygon Menggunakan ArcGIS
Tiga Cara Memotong file Raster Sesuai Batas Polygon Menggunakan ArcGISTiga Cara Memotong file Raster Sesuai Batas Polygon Menggunakan ArcGIS
Tiga Cara Memotong file Raster Sesuai Batas Polygon Menggunakan ArcGIS
bramantiyo marjuki
 
Pemetaan digital
Pemetaan digital Pemetaan digital
Pemetaan digital
bramantiyo marjuki
 
R by example: mining Twitter for consumer attitudes towards airlines
R by example: mining Twitter for consumer attitudes towards airlinesR by example: mining Twitter for consumer attitudes towards airlines
R by example: mining Twitter for consumer attitudes towards airlines
Jeffrey Breen
 
Survei dan Pemetaan Menggunakan GPS
Survei dan Pemetaan Menggunakan GPSSurvei dan Pemetaan Menggunakan GPS
Survei dan Pemetaan Menggunakan GPS
bramantiyo marjuki
 
Text Mining with R -- an Analysis of Twitter Data
Text Mining with R -- an Analysis of Twitter DataText Mining with R -- an Analysis of Twitter Data
Text Mining with R -- an Analysis of Twitter Data
Yanchang Zhao
 

Viewers also liked (11)

An Introduction to Data Mining with R
An Introduction to Data Mining with RAn Introduction to Data Mining with R
An Introduction to Data Mining with R
 
Text mining on Twitter information based on R platform
Text mining on Twitter information based on R platformText mining on Twitter information based on R platform
Text mining on Twitter information based on R platform
 
Export, Import Menggunakan Ms.Excel dan Join Data Attibute pada ArcGis 10.0
Export, Import Menggunakan Ms.Excel dan Join Data Attibute pada ArcGis 10.0Export, Import Menggunakan Ms.Excel dan Join Data Attibute pada ArcGis 10.0
Export, Import Menggunakan Ms.Excel dan Join Data Attibute pada ArcGis 10.0
 
Social Media Data Collection & Network Analysis with Netlytic and R
Social Media Data Collection & Network Analysis with Netlytic and R Social Media Data Collection & Network Analysis with Netlytic and R
Social Media Data Collection & Network Analysis with Netlytic and R
 
From Potentials and Problems to Actions and Plans (Simulation Studies of Regi...
From Potentials and Problems to Actions and Plans (Simulation Studies of Regi...From Potentials and Problems to Actions and Plans (Simulation Studies of Regi...
From Potentials and Problems to Actions and Plans (Simulation Studies of Regi...
 
Social Network Analysis With R
Social Network Analysis With RSocial Network Analysis With R
Social Network Analysis With R
 
Tiga Cara Memotong file Raster Sesuai Batas Polygon Menggunakan ArcGIS
Tiga Cara Memotong file Raster Sesuai Batas Polygon Menggunakan ArcGISTiga Cara Memotong file Raster Sesuai Batas Polygon Menggunakan ArcGIS
Tiga Cara Memotong file Raster Sesuai Batas Polygon Menggunakan ArcGIS
 
Pemetaan digital
Pemetaan digital Pemetaan digital
Pemetaan digital
 
R by example: mining Twitter for consumer attitudes towards airlines
R by example: mining Twitter for consumer attitudes towards airlinesR by example: mining Twitter for consumer attitudes towards airlines
R by example: mining Twitter for consumer attitudes towards airlines
 
Survei dan Pemetaan Menggunakan GPS
Survei dan Pemetaan Menggunakan GPSSurvei dan Pemetaan Menggunakan GPS
Survei dan Pemetaan Menggunakan GPS
 
Text Mining with R -- an Analysis of Twitter Data
Text Mining with R -- an Analysis of Twitter DataText Mining with R -- an Analysis of Twitter Data
Text Mining with R -- an Analysis of Twitter Data
 

Similar to Brief Lecture on Text Mining and Social Network Analysis with R, by Deolu Adeleye

Text Mining of Twitter in Data Mining
Text Mining of Twitter in Data MiningText Mining of Twitter in Data Mining
Text Mining of Twitter in Data Mining
Meghaj Mallick
 
Beyond PHP - it's not (just) about the code
Beyond PHP - it's not (just) about the codeBeyond PHP - it's not (just) about the code
Beyond PHP - it's not (just) about the code
Wim Godden
 
Facebook Sentiment Analysis - What is Facebook Saying about Nintendo?
Facebook Sentiment Analysis - What is Facebook Saying about Nintendo?Facebook Sentiment Analysis - What is Facebook Saying about Nintendo?
Facebook Sentiment Analysis - What is Facebook Saying about Nintendo?
Gregory Zapata
 
Использование социальных сетей и микроблогов для бизнеса
Использование социальных сетей и микроблогов для бизнесаИспользование социальных сетей и микроблогов для бизнеса
Использование социальных сетей и микроблогов для бизнеса
Aleksandr Shchedrin
 
Cassandra By Example: Data Modelling with CQL3
Cassandra By Example: Data Modelling with CQL3Cassandra By Example: Data Modelling with CQL3
Cassandra By Example: Data Modelling with CQL3
Eric Evans
 
RではじめるTwitter解析
RではじめるTwitter解析RではじめるTwitter解析
RではじめるTwitter解析
Takeshi Arabiki
 
Unleashing Twitter Data for Fun and Insight
Unleashing Twitter Data for Fun and InsightUnleashing Twitter Data for Fun and Insight
Unleashing Twitter Data for Fun and Insight
Matthew Russell
 
Unleashing twitter data for fun and insight
Unleashing twitter data for fun and insightUnleashing twitter data for fun and insight
Unleashing twitter data for fun and insight
Digital Reasoning
 
Beyond php - it's not (just) about the code
Beyond php - it's not (just) about the codeBeyond php - it's not (just) about the code
Beyond php - it's not (just) about the code
Wim Godden
 
Beyond php - it's not (just) about the code
Beyond php - it's not (just) about the codeBeyond php - it's not (just) about the code
Beyond php - it's not (just) about the code
Wim Godden
 
Beyond PHP - It's not (just) about the code
Beyond PHP - It's not (just) about the codeBeyond PHP - It's not (just) about the code
Beyond PHP - It's not (just) about the code
Wim Godden
 
Chapter 2: R tutorial Handbook for Data Science and Machine Learning Practiti...
Chapter 2: R tutorial Handbook for Data Science and Machine Learning Practiti...Chapter 2: R tutorial Handbook for Data Science and Machine Learning Practiti...
Chapter 2: R tutorial Handbook for Data Science and Machine Learning Practiti...
Raman Kannan
 
Twitter Mention Graph - Analytics Project
Twitter Mention Graph - Analytics ProjectTwitter Mention Graph - Analytics Project
Twitter Mention Graph - Analytics Project
Sotiris Baratsas
 
"R & Text Analytics" (15 January 2013)
"R & Text Analytics" (15 January 2013)"R & Text Analytics" (15 January 2013)
"R & Text Analytics" (15 January 2013)
Portland R User Group
 
Data manipulation and visualization in r 20190711 myanmarucsy
Data manipulation and visualization in r 20190711 myanmarucsyData manipulation and visualization in r 20190711 myanmarucsy
Data manipulation and visualization in r 20190711 myanmarucsy
SmartHinJ
 
Velocity 2015-final
Velocity 2015-finalVelocity 2015-final
Velocity 2015-final
Arun Kejariwal
 
Linux system admin
Linux system adminLinux system admin
Linux system admin
Mohammed Zainul Abiddin
 
Spark Streaming with Cassandra
Spark Streaming with CassandraSpark Streaming with Cassandra
Spark Streaming with Cassandra
Jacek Lewandowski
 
Apache cassandra en production - devoxx 2017
Apache cassandra en production  - devoxx 2017Apache cassandra en production  - devoxx 2017
Apache cassandra en production - devoxx 2017
Alexander DEJANOVSKI
 
He stopped using for/while loops, you won't believe what happened next!
He stopped using for/while loops, you won't believe what happened next!He stopped using for/while loops, you won't believe what happened next!
He stopped using for/while loops, you won't believe what happened next!
François-Guillaume Ribreau
 

Similar to Brief Lecture on Text Mining and Social Network Analysis with R, by Deolu Adeleye (20)

Text Mining of Twitter in Data Mining
Text Mining of Twitter in Data MiningText Mining of Twitter in Data Mining
Text Mining of Twitter in Data Mining
 
Beyond PHP - it's not (just) about the code
Beyond PHP - it's not (just) about the codeBeyond PHP - it's not (just) about the code
Beyond PHP - it's not (just) about the code
 
Facebook Sentiment Analysis - What is Facebook Saying about Nintendo?
Facebook Sentiment Analysis - What is Facebook Saying about Nintendo?Facebook Sentiment Analysis - What is Facebook Saying about Nintendo?
Facebook Sentiment Analysis - What is Facebook Saying about Nintendo?
 
Использование социальных сетей и микроблогов для бизнеса
Использование социальных сетей и микроблогов для бизнесаИспользование социальных сетей и микроблогов для бизнеса
Использование социальных сетей и микроблогов для бизнеса
 
Cassandra By Example: Data Modelling with CQL3
Cassandra By Example: Data Modelling with CQL3Cassandra By Example: Data Modelling with CQL3
Cassandra By Example: Data Modelling with CQL3
 
RではじめるTwitter解析
RではじめるTwitter解析RではじめるTwitter解析
RではじめるTwitter解析
 
Unleashing Twitter Data for Fun and Insight
Unleashing Twitter Data for Fun and InsightUnleashing Twitter Data for Fun and Insight
Unleashing Twitter Data for Fun and Insight
 
Unleashing twitter data for fun and insight
Unleashing twitter data for fun and insightUnleashing twitter data for fun and insight
Unleashing twitter data for fun and insight
 
Beyond php - it's not (just) about the code
Beyond php - it's not (just) about the codeBeyond php - it's not (just) about the code
Beyond php - it's not (just) about the code
 
Beyond php - it's not (just) about the code
Beyond php - it's not (just) about the codeBeyond php - it's not (just) about the code
Beyond php - it's not (just) about the code
 
Beyond PHP - It's not (just) about the code
Beyond PHP - It's not (just) about the codeBeyond PHP - It's not (just) about the code
Beyond PHP - It's not (just) about the code
 
Chapter 2: R tutorial Handbook for Data Science and Machine Learning Practiti...
Chapter 2: R tutorial Handbook for Data Science and Machine Learning Practiti...Chapter 2: R tutorial Handbook for Data Science and Machine Learning Practiti...
Chapter 2: R tutorial Handbook for Data Science and Machine Learning Practiti...
 
Twitter Mention Graph - Analytics Project
Twitter Mention Graph - Analytics ProjectTwitter Mention Graph - Analytics Project
Twitter Mention Graph - Analytics Project
 
"R & Text Analytics" (15 January 2013)
"R & Text Analytics" (15 January 2013)"R & Text Analytics" (15 January 2013)
"R & Text Analytics" (15 January 2013)
 
Data manipulation and visualization in r 20190711 myanmarucsy
Data manipulation and visualization in r 20190711 myanmarucsyData manipulation and visualization in r 20190711 myanmarucsy
Data manipulation and visualization in r 20190711 myanmarucsy
 
Velocity 2015-final
Velocity 2015-finalVelocity 2015-final
Velocity 2015-final
 
Linux system admin
Linux system adminLinux system admin
Linux system admin
 
Spark Streaming with Cassandra
Spark Streaming with CassandraSpark Streaming with Cassandra
Spark Streaming with Cassandra
 
Apache cassandra en production - devoxx 2017
Apache cassandra en production  - devoxx 2017Apache cassandra en production  - devoxx 2017
Apache cassandra en production - devoxx 2017
 
He stopped using for/while loops, you won't believe what happened next!
He stopped using for/while loops, you won't believe what happened next!He stopped using for/while loops, you won't believe what happened next!
He stopped using for/while loops, you won't believe what happened next!
 

Recently uploaded

一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
ahzuo
 
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
ewymefz
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Subhajit Sahu
 
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
slg6lamcq
 
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
AbhimanyuSinha9
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
Oppotus
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
Subhajit Sahu
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
ewymefz
 
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project PresentationPredicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Boston Institute of Analytics
 
Machine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptxMachine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptx
balafet
 
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
haila53
 
一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
enxupq
 
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
ahzuo
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
TravisMalana
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
ewymefz
 
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar
 
Empowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptxEmpowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptx
benishzehra469
 
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
v3tuleee
 
FP Growth Algorithm and its Applications
FP Growth Algorithm and its ApplicationsFP Growth Algorithm and its Applications
FP Growth Algorithm and its Applications
MaleehaSheikh2
 
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdfSample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Linda486226
 

Recently uploaded (20)

一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
 
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
 
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
 
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
 
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project PresentationPredicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
 
Machine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptxMachine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptx
 
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
 
一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
 
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
 
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
 
Empowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptxEmpowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptx
 
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
 
FP Growth Algorithm and its Applications
FP Growth Algorithm and its ApplicationsFP Growth Algorithm and its Applications
FP Growth Algorithm and its Applications
 
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdfSample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
 

Brief Lecture on Text Mining and Social Network Analysis with R, by Deolu Adeleye

  • 1. Text Mining, Social Network Analysis Deolu Adeleye Text Mining Just as we can mine raw materials from ores, we can also intelligently ‘mine’ textual data from groups of data. Once again, R proves to be a very powerful tool, with packages such as twitteR proving quite useful, as we’ll soon demonstrate. As a demonstration, we’ll be examining mining textual information from the popular social network Twitter. We’ll be examining tweets from the Twitter handle ‘@55wordsorless’ (though you could use any handle of your choice when running the code). Do note that these demonstrations will require an active internet connection (at least in the beginning to authenticate), and will be using the following R packages: • twitter • tm • wordcloud • SnowballC • RWeka • igraph The first step is to create a Twitter application for yourself. Go to https://twitter.com/apps/new and log in. After filling in the basic info, go to the “Settings” tab and select “Read, Write and Access direct messages”. Make sure to click on the save button after doing this. In the “Details” tab, take note of the following: • your consumer key • your consumer secret • your access token • your access secret Once these four are retrieved, simply insert them into the setup_twitter_oauth function in the format setup_twitter_oauth(“API key”, “API secret”, “Access token”, “Access secret”). Here’s ours with the according values inserted: #load the twitteR package library(twitteR) #authenticate setup_twitter_oauth(our_key, our_secret, our_token, our_access_secret) ## [1] "Using direct authentication" 1
  • 2. You only need to authenticate once per R session. So, we’ve authenticated. Next, let’s just randomly mine a particular word, say ‘water’, from everywhere it was used recently on Twitter. #retrieve last 50 tweets where hashtag '#water' is used, for example watertag<-searchTwitter('#water', n=50) head(watertag,3) ## [[1]] ## [1] "FrozenMOVlE: #vsco #afterlight #winter #wisconsin #water #lake #michigan #frozen #milwaukee #city http:/ ## ## [[2]] ## [1] "FrozenMOVlE: Taking over my brothers place <ed><U+00A0><U+00BD><ed><U+00B8><U+0082><ed><U+00A0><U+00BD ## ## [[3]] ## [1] "vikashprasad21: RT @WaterNetwork1: #DSRSD #Certified For #Water #Quality #Testing http://t.co/bkF2mz47q Next, let’s get info from the particular user ‘@55wordsorless’: #retrieve the last 100 tweets from the specified timeline tweets <- userTimeline('55wordsorless', n=100) head(tweets,3) ## [[1]] ## [1] "55WordsOrLess: @_missjem_ Someone already has...though how you could borrow Dr. Who's TARDIS is your next ## ## [[2]] ## [1] "55WordsOrLess: Join the conversation!! http://t.co/xqDAtWbzVK :D :D" ## ## [[3]] ## [1] "55WordsOrLess: @BeautifulFeet_ Did you read the 'Mischievous' Thoughts as well? :D" For our purposes, we’ll convert these into a data.frame object: watertag_df <- twListToDF(watertag) tweets_df <- twListToDF(tweets) head(watertag_df,3) ## ## 1 #vsco #afterlight #winter #wisconsin #water #lake #michiga ## 2 Taking over my brothers place <ed><U+00A0><U+00BD><ed><U+00B8><U+0082><ed><U+00A0><U+00BD><ed><U+00B8><U+ ## 3 RT @WaterNetwork1: #DSRSD #Certified For #Wat ## favorited favoriteCount replyToSN created truncated ## 1 FALSE 0 <NA> 2015-01-02 21:06:40 FALSE ## 2 FALSE 0 <NA> 2015-01-02 21:06:21 FALSE ## 3 FALSE 0 <NA> 2015-01-02 21:05:22 FALSE ## replyToSID id replyToUID ## 1 <NA> 551122427166875648 <NA> ## 2 <NA> 551122347814825984 <NA> ## 3 <NA> 551122099348054016 <NA> ## statusSource 2
  • 3. ## 1 <a href="http://ifttt.com" rel="nofollow">IFTTT</a> ## 2 <a href="http://ifttt.com" rel="nofollow">IFTTT</a> ## 3 <a href="http://spinabell.com" rel="nofollow">spinabell</a> ## screenName retweetCount isRetweet retweeted longitude latitude ## 1 FrozenMOVlE 0 FALSE FALSE <NA> <NA> ## 2 FrozenMOVlE 0 FALSE FALSE <NA> <NA> ## 3 vikashprasad21 1 TRUE FALSE <NA> <NA> head(tweets_df,3) ## text ## 1 @_missjem_ Someone already has...though how you could borrow Dr. Who's TARDIS is your next difficulty... :| ## 2 Join the conversation!! http://t.co/xqDAtWbzVK :D :D ## 3 @BeautifulFeet_ Did you read the 'Mischievous' Thoughts as well? :D ## favorited favoriteCount replyToSN created truncated ## 1 FALSE 0 _MissJem_ 2014-11-28 18:03:43 FALSE ## 2 FALSE 0 <NA> 2014-10-28 14:19:27 FALSE ## 3 FALSE 0 BeautifulFeet_ 2014-10-09 19:48:46 FALSE ## replyToSID id replyToUID ## 1 538279840193839108 538392811075559424 434366153 ## 2 <NA> 527102349727531008 <NA> ## 3 <NA> 520299852824334337 92370873 ## statusSource ## 1 <a href="https://mobile.twitter.com" rel="nofollow">Mobile Web (M2)</a> ## 2 <a href="http://twitter.com" rel="nofollow">Twitter Web Client</a> ## 3 <a href="http://twitter.com/download/android" rel="nofollow">Twitter for Android</a> ## screenName retweetCount isRetweet retweeted longitude latitude ## 1 55WordsOrLess 0 FALSE FALSE NA NA ## 2 55WordsOrLess 0 FALSE FALSE NA NA ## 3 55WordsOrLess 0 FALSE FALSE NA NA After that, we’ll convert to a corpus (which is just a collection of text documents) using the tm package: library(tm) #build a corpus, and specify the source to be character vectors watertag_corpus <- Corpus(VectorSource(watertag_df$text)) tweets_corpus <- Corpus(VectorSource(tweets_df$text)) The corpus allows us to perform certain manipulations with functions in the tm package. You should run ?Corpus to see other possible sources of textual data you can harness. Let’s proceed by first ‘cleaning’ our data: #make a copy, just in case we might need the original later watertag_1 <- watertag_corpus tweets_1 <- tweets_corpus # remove punctuation watertag_corpus <- tm_map(watertag_corpus, removePunctuation) tweets_corpus <- tm_map(tweets_corpus, removePunctuation) # remove numbers watertag_corpus <- tm_map(watertag_corpus, removeNumbers) tweets_corpus <- tm_map(tweets_corpus, removeNumbers) # convert to lower case watertag_corpus <- tm_map(watertag_corpus, content_transformer(tolower)) tweets_corpus <- tm_map(tweets_corpus, content_transformer(tolower)) 3
  • 4. # remove whitespace watertag_corpus <- tm_map(watertag_corpus, stripWhitespace) tweets_corpus <- tm_map(tweets_corpus, stripWhitespace) # remove stopwords such as 'you', 'me', etc. watertag_corpus <- tm_map(watertag_corpus, removeWords, stopwords("english")) tweets_corpus <- tm_map(tweets_corpus, removeWords, stopwords("english")) # remove URLs # We'll create a function to look for 'http' in our text, and then delete the links removeURL <- content_transformer(function(x) gsub("http[[:alnum:]]*", "", x)) watertag_corpus <- tm_map(watertag_corpus, removeURL) tweets_corpus <- tm_map(tweets_corpus, removeURL) #inspect our results inspect(head(watertag_corpus,3)) ## <<VCorpus (documents: 3, metadata (corpus/indexed): 0/0)>> ## ## [[1]] ## <<PlainTextDocument (metadata: 7)>> ## vsco afterlight winter wisconsin water lake michigan frozen milwaukee city ## ## [[2]] ## <<PlainTextDocument (metadata: 7)>> ## taking brothers place <U+653C><U+3E64><U+613C><U+3E30><U+623C><U+3E64><U+653C><U+3E64><U+623C><U+3E38><U ## ## [[3]] ## <<PlainTextDocument (metadata: 7)>> ## rt waternetwork dsrsd certified water quality testing inspect(head(tweets_corpus,3)) ## <<VCorpus (documents: 3, metadata (corpus/indexed): 0/0)>> ## ## [[1]] ## <<PlainTextDocument (metadata: 7)>> ## missjem someone already hasthough borrow dr whos tardis next difficulty ## ## [[2]] ## <<PlainTextDocument (metadata: 7)>> ## join conversation d d ## ## [[3]] ## <<PlainTextDocument (metadata: 7)>> ## beautifulfeet read mischievous thoughts well d Other transformations possible with tm_map can obtained by running getTransformations() In many applications, words need to be stemmed to retrieve their radicals, so that various forms derived from a stem would be taken as the same when counting word frequency. Stemming uses an algorithm that removes common word endings for English words, such as “es”, “ed” and “’s”. For instance, words “update”, “updated” and “updating” would all be stemmed to “updat”. It’s not mandatory (and sometimes it may be counter-productive), but it does pay to understand what it does, so we’ll demonstrate: 4
  • 5. # create a copy we'll stem watertag_stemmed <- watertag_corpus tweets_stemmed <- tweets_corpus # stem words library(SnowballC) watertag_stemmed <- tm_map(watertag_stemmed, stemDocument) tweets_stemmed <- tm_map(tweets_stemmed, stemDocument) # inspect our stemmed results inspect(head(watertag_stemmed,3)) ## <<VCorpus (documents: 3, metadata (corpus/indexed): 0/0)>> ## ## [[1]] ## <<PlainTextDocument (metadata: 7)>> ## vsco afterlight winter wisconsin water lake michigan frozen milwauke citi ## ## [[2]] ## <<PlainTextDocument (metadata: 7)>> ## take brother place <U+653C><U+3E64><U+613C><U+3E30><U+623C><U+3E64><U+653C><U+3E64><U+623C><U+3E38><U+38 ## ## [[3]] ## <<PlainTextDocument (metadata: 7)>> ## rt waternetwork dsrsd certifi water qualiti test inspect(head(tweets_stemmed,3)) ## <<VCorpus (documents: 3, metadata (corpus/indexed): 0/0)>> ## ## [[1]] ## <<PlainTextDocument (metadata: 7)>> ## missjem someon alreadi hasthough borrow dr whos tardi next difficulti ## ## [[2]] ## <<PlainTextDocument (metadata: 7)>> ## join convers d d ## ## [[3]] ## <<PlainTextDocument (metadata: 7)>> ## beautifulfeet read mischiev thought well d A term-document matrix represents the relationship between terms and documents, where each row stands for a term and each column for a document, and an entry is the number of occurrences of the term in the document. In contrast, a document-term matrix is simply the transpose of the term-document matrix, with documents as rows, and columns as terms. So which should you use? Either of your choice! #creating term-document matrices watertag_tdm<-TermDocumentMatrix(watertag_corpus) tweets_tdm<-TermDocumentMatrix(tweets_corpus) #creating document-term matrices watertag_dtm<-DocumentTermMatrix(watertag_corpus) 5
  • 6. tweets_dtm<-DocumentTermMatrix(tweets_corpus) # just to compare the two: watertag_tdm ## <<TermDocumentMatrix (terms: 311, documents: 50)>> ## Non-/sparse entries: 492/15058 ## Sparsity : 97% ## Maximal term length: 61 ## Weighting : term frequency (tf) watertag_dtm ## <<DocumentTermMatrix (documents: 50, terms: 311)>> ## Non-/sparse entries: 492/15058 ## Sparsity : 97% ## Maximal term length: 61 ## Weighting : term frequency (tf) As seen above, except for their transpose, their practically the same. With our matrix, we can perform quite a number of functions. Like, if we wanted to know the frequency of occurence for some words: #find terms which occur 5 times or more findFreqTerms(watertag_dtm, 5) ## [1] "amp" "ice" "sun" "water" #how about 10 times or more? findFreqTerms(tweets_tdm, lowfreq=10) ## [1] "man" "now" It is important to note the results are ordered alphabetically, not according to frequency of occurence. If we want it according to frequency, we’ll obtain it as a vector by converting into a matrix and using the rowSums function if we’re using a tdm, and colSums if dtm: #remember you can use either dtm or tdm - we're using both interchangeably just to demonstrate watertag_freq <- colSums(as.matrix(watertag_dtm)) tweets_freq <- rowSums(as.matrix(tweets_tdm)) …and then we sort it in descending order, so it shows the terms with maximum occurence first: #display head of most frequent terms head(sort(watertag_freq,decreasing=TRUE)) ## water ice amp sun frozen fun ## 57 7 5 5 4 4 6
  • 7. head(sort(tweets_freq,decreasing=TRUE)) ## man now just said home never ## 10 10 7 7 6 6 We could even see the frequency of frequencies, to know how many times some terms appear: head(table(watertag_freq),15) ## watertag_freq ## 1 2 3 4 5 7 57 ## 200 85 16 6 2 1 1 head(table(tweets_freq),15) ## tweets_freq ## 1 2 3 4 5 6 7 10 ## 569 85 24 21 4 4 2 2 This is tells us that from our search, 200 terms occur just once; and from our tweets, 569 terms occur just once, and so forth… We could also retrieve associations between words: if two words appeared together, then their correlation would be 1.0; if never: 0.0. Those are the boundaries. So, let’s say we wanted to see words that have at least a 0.5 correlation with the word ‘time’ in our search results: findAssocs(watertag_dtm, "time", corlimit=0.5) ## $time ## numeric(0) Note that a result of type(0) indicates no correlating words were found, meaning the word you searced didn’t occur (to the level of correlation you specified). How about the words ‘trend’ and ‘food’ from our timeline, this time with a 0.4 correlation? findAssocs(tweets_tdm, c("trend","food"), corlimit=0.4) ## $trend ## numeric(0) ## ## $food ## diner garbage” protested rat siryou tastes cook ## 1.00 1.00 1.00 1.00 1.00 1.00 0.70 ## money paid yet stunned good like ## 0.70 0.70 0.70 0.57 0.49 0.49 What if we wanted to graphically represent our results? We could, and it only require a few lines of code. For example: let’s make a barplot of all the terms that occur at least 5 times from text source(s). (5 is considerably small, but serves this particular example well) 7
  • 8. #using ggplot2 package library(ggplot2) #from our search on Twitter qplot(names(watertag_freq[watertag_freq>=5]), watertag_freq[watertag_freq>=5], geom="bar", stat="identity", xlab="Frequency", ylab="Terms", main="Search Results") + coord_flip() Figure 1: Words Occuring At Least 5 Times #from our timeline qplot(names(tweets_freq[tweets_freq>=5]), tweets_freq[tweets_freq>=5], geom="bar", stat="identity", xlab="Frequency", ylab="Terms", main="@55wordsorless Timeline") + coord_flip() Figure 2: Words Occuring At Least 5 Times 8
  • 9. Wordclouds are also a very cool graphical representation of textual information. Here, the more frequently a word occurs, the bolder and larger it is displayed, with the reverse being true. By default the most frequent words have a font scale of 4 and the least have a scale of 0.5, but even that can be changed, as we’ll demonstrate! tweets_freq<-sort(tweets_freq,decreasing=TRUE) watertag_freq<-sort(watertag_freq,decreasing=TRUE) #wordcloud package allows us to produce wordclouds library(wordcloud) #each time wordcloud is run, it randomly produces a layout. #Though it doesn't really matter, you can set the seed to keep the layout the same set.seed(77) #'min.freq' specifies the minimum frequency of the words to be plotted wordcloud(names(tweets_freq), tweets_freq, min.freq=3) Figure 3: Wordcloud Using min.freq #max.words specifies the maximum number of words it should plot #scale changes font scale wordcloud(names(tweets_freq), scale=c(5, .1), tweets_freq, max.words=100) ## Warning in wordcloud(names(tweets_freq), scale = c(5, 0.1), tweets_freq, : ## now could not be fit on page. It will not be plotted. 9
  • 10. Figure 4: Wordcloud Using max.words 10
  • 11. #just adding some colour! set.seed(79) wordcloud(names(watertag_freq), watertag_freq, min.freq=2, random.color=TRUE,colors=rainbow(7)) Figure 5: Wordcloud With Colour! Run ?wordcloud for even more options you can specify. Word clusters can also be generated. Hierarchical Let’s first remove some sparse words that occur minimally and are not so important with removeSparseTerms. The value of sparse is a numeric serving as a factor - terms occuring less than the specified percentage are retained. #we're using 0.95 because our text source has only a few terms, and not many re-ocurring words tweets_sparsed <- removeSparseTerms(tweets_tdm, sparse=0.95) distMatrix <- dist(scale(tweets_sparsed)) fit <- hclust(distMatrix, method="ward") ## The "ward" method has been renamed to "ward.D"; note new "ward.D2" plot(fit) # cut tree into 10 clusters rect.hclust(fit, k=10) (groups <- cutree(fit, k=10)) ## ’ll bed check died discovered good ## 1 2 3 1 3 4 ## home just kill know last like 11
  • 12. Figure 6: Cluster Dendogram 12
  • 13. ## 5 6 7 8 7 4 ## love made man marry mischievous never ## 1 8 1 1 3 2 ## new now one said smiled steel ## 4 9 4 1 4 3 ## still swore take things took went ## 6 2 1 3 10 5 ## wife ## 2 K-means We can use the k-means clustering in our analysis. However, for this you MUST use a document-term matrix. #using DOCUMENT-TERM matrix (tweets_sparsed <- removeSparseTerms(tweets_dtm, sparse=0.95)) ## <<DocumentTermMatrix (documents: 79, terms: 31)>> ## Non-/sparse entries: 153/2296 ## Sparsity : 94% ## Maximal term length: 11 ## Weighting : term frequency (tf) #setting our value of k k <- 4 kmeansResult <- kmeans(tweets_sparsed, k) # cluster centers round(kmeansResult$centers, digits=3) ## ’ll bed check died discovered good home just kill know last ## 1 0.034 0 0 0.051 0 0.051 0.051 0.102 0.068 0.068 0.051 ## 2 0.000 0 1 0.000 1 0.000 0.000 0.000 0.000 0.000 0.000 ## 3 0.167 0 0 0.083 0 0.083 0.000 0.083 0.000 0.083 0.083 ## 4 0.000 1 0 0.000 0 0.000 0.750 0.000 0.000 0.000 0.000 ## like love made man marry mischievous never new now one said ## 1 0.051 0.051 0.068 0.0 0.000 0.017 0.051 0.068 0.119 0.085 0.000 ## 2 0.000 0.000 0.000 1.0 0.000 1.000 0.000 0.000 0.000 0.000 0.000 ## 3 0.083 0.083 0.083 0.5 0.333 0.000 0.000 0.000 0.250 0.000 0.583 ## 4 0.000 0.000 0.000 0.0 0.000 0.000 0.750 0.000 0.000 0.000 0.000 ## smiled steel still swore take things took went wife ## 1 0.102 0 0.068 0.000 0.017 0 0.085 0.051 0.017 ## 2 0.000 1 0.000 0.000 0.000 1 0.000 0.000 0.000 ## 3 0.000 0 0.000 0.083 0.250 0 0.083 0.000 0.083 ## 4 0.000 0 0.000 0.750 0.000 0 0.000 0.250 0.500 To make things easier, let’s just print the top three words in every cluster, as well as the wordcloud cluster: for (i in 1:k) { cat(paste("cluster ", i, ": ", sep="")) s <- sort(kmeansResult$centers[i,], decreasing=T) cat(names(s)[1:3], "n") # if you want to print the tweets of every cluster, run the next line # print(tweets[which(kmeansResult$cluster==i)]) } 13
  • 14. ## cluster 1: now just smiled ## cluster 2: check discovered man ## cluster 3: said man marry ## cluster 4: bed home never Social Network Analysis First, we want to produce a term-term matrix, which is basically just a network of terms based on their co-occurrence in tweets. It is the matrix product of the term-document and a document-term matrices. (We produce the matrix product by using the operator **%*%**). #matrix product; #using sparsed tweets because original tdm in our example had too many sparse terms #transposing with 't' operator tweets_sparsed <- removeSparseTerms(tweets_tdm, sparse=0.95) termTerm <- as.matrix(tweets_sparsed) %*% as.matrix(t(tweets_sparsed)) #inspect few rows and columns termTerm[1:10,1:10] ## Terms ## Terms ’ll bed check died discovered good home just kill know ## ’ll 4 0 0 0 0 0 0 0 1 0 ## bed 0 4 0 0 0 0 3 0 0 0 ## check 0 0 4 0 4 0 0 0 0 0 ## died 0 0 0 4 0 0 0 1 0 1 ## discovered 0 0 4 0 4 0 0 0 0 0 ## good 0 0 0 0 0 4 0 1 0 0 ## home 0 3 0 0 0 0 8 0 0 0 ## just 0 0 0 1 0 1 0 7 0 0 ## kill 1 0 0 0 0 0 0 0 4 0 ## know 0 0 0 1 0 0 0 0 0 5 After this, we can use package igraph to graphically represent these network of terms in a visually-appealing way: library(igraph) # build a graph from the above matrix g <- graph.adjacency(termTerm, weighted=T, mode="undirected") # remove loops g <- simplify(g) # set labels and degrees of vertices V(g)$label <- V(g)$name V(g)$degree <- degree(g) # setting seed to make the layout reproducible set.seed(1001) #call to plot network layout1 <- layout.fruchterman.reingold(g) plot(g, layout=layout1) What if we wanted a different layout? plot(g, layout=layout.kamada.kawai) 14
  • 15. Figure 7: Network Of Terms 15
  • 16. Figure 8: Different Layout 16
  • 17. What if we wanted an interactive network plot? Easy! tkplot(g, layout=layout1) In fact, in our interactive graphs, we can just change the layouts immediately by selecting different options in the Layout tab. But the above just produce a graph with a lot of connections. What if we wanted to see straightaway which were more important? Which connections were stronger? We can do that by specifying options with the following code: #make stronger connections more bold on vertices 'V' V(g)$label.cex <- 2.2 * V(g)$degree / max(V(g)$degree)+ .2 #color V(g)$label.color <- rgb(0, 0, .2, .8) #no frame V(g)$frame.color <- NA egam <- (log(E(g)$weight)+.4) / max(log(E(g)$weight)+.4) # access edges 'E' E(g)$color <- rgb(.5, .5, 0, egam) E(g)$width <- egam # plot the graph in layout1 plot(g, layout=layout1) …and straightaway we can see which words are more ‘weighted’, and even point out one or two clusters… How about making this new graph interactive too? As before, just use tkplot: tkplot(g, layout=layout1) As usual, there are a plethora of options and settings at your disposal! Just run ?igraph::layout to see them! (we’re specifying the package because you might have another layout funtion from another package) 17
  • 18. Figure 9: Weighted Network 18