2. Table of Contents
I. Goal of research and limitations
II. Data Preparation
II. Scrape text from blogs (blogs.korrespondent.net)
III. Stemming and cleaning
IV. Bottlenecks mining Cyrillic
III. Text Mining & clustering
III. Term normalization (TF-IDF). Most Frequent and Correlated terms
IV. Hierarchical clustering with hclust
V. Clustering with k-means and k-medoids
VI. Comparing clusters
IV. Conclusion
4. demonstrate most popular practices when mining dissimilar texts with low number
of observations
mine blogs on https://blogs.korrespondent.net and identify most discussed topics
identify bottlenecks when mining Cyrillic
perform hierarchical clustering with hclust method
perform clustering using k-means and k-medoids methods
compare results
6. no initial blog categorization by date range, subject(s), author(s) etc*
last 245 blogs** from blogs.korrespondent.net as of the day of analysis
blogs less then 1kb of plain text excluded
* There is no goal to achieve best cluster accuracy, but most discussed subjects
(clusters) should be identified.
** 245 – after excluding empty and small blogs (<1Kb) from initial 400 blogs
8. How to scrape blogs..
HTML parsing
parse each page and get urls
not transparent
9. How to scrape blogs..
HTML parsing
parse each page and get urls
not transparent
RSS feed
keeps 1 day history
10. How to scrape blogs..
HTML parsing
parse each page and get urls
not transparent
RSS feed
keeps only 1 day history
Twitter (@Korr_blog)
each tweet has blog URL
easy and transparent for R
11. Parse tweets
Get tweets
Extract URL from text
Remove empty URLs
Unshorten double-shorted URLs
Validate URLs
Remove duplicates
..
[269] "http://blogs.korrespondent.net/journalists/blog/anna-radio/a51779"
[270] "http://blogs.korrespondent.net/celebrities/blog/gritsenko/a51727"
[271] "http://blogs.korrespondent.net/celebrities/blog/press13/a51764"
[272] "http://blogs.korrespondent.net/celebrities/blog/olesdoniy/a51736"
[273] "http://blogs.korrespondent.net/journalists/blog/raimanelena/a51724"
..
* Full R code is available at the end
13. Clean texts
Translate all blogs in English
Extract translated text from the html code
Load texts into Corpus
Map to lower case, rm punctuation, Stop Words, numbers, strip white spaces
Stem document
15. declensions in RU/UA words. After stemming the same word has several forms
0xFF-problem (“я”, 0xFF, windows-1251). DocumentTermMatrix (in R) crops texts
E.g. „янукович‟ – filtered, „объявлять‟ – „объ‟, „братья‟ – „брать‟ (sense changes) etc
Cyrillic texts with pseudo-graphic or special symbols can‟t be encoded with windows-
1251 charset properly (additional filter uurlencoded required, not supported in R)
16. Translate texts into English
#see the full code in Appendix F
go_tr <- function(url) {
src.url<-URLencode(paste("http://translate.google.com/translate?sl=auto&tl=en&u=", url, sep=""))
html.parse <- htmlTreeParse(getURL(src.url), useInternalNodes = TRUE)
frame.c <- getNodeSet(html.parse, '//frameset//frame[@name="c"]')
params <- sapply(frame.c, function(t) t <- xmlAttrs(t)[[1]])
#...
dest.url <- capture.output(getNodeSet(html, '//meta[@http-equiv="refresh"]')[[1]])
#...
dest.url <- xmlValue(getNodeSet(htmlParse(dest.url, asText = TRUE), "//p")[[1]])
return(dest.url)
}
[1] "http://blogs.korrespondent.net/celebrities/blog/gknbu/a50268"
[1] "http://translate.googleusercontent.com/translate_c?
rurl=translate.google.com&sl=auto&tl=en&u=http://blogs.korrespondent.net/celebrities/blog/gknbu/a50268&usg=ALkJ
rhisevp7b7yg4CxX6_iTDxyBAk4PCQ"
17. Original Blog Text. Example
До официального старта Евро остается 150 дней. В разгаре, так называемая, операционная
подготовка к Чемпионату. Речь идет о налаживании коммуникаций между принимающими городами,
обучении персонала и наведении марафета в целом. Ни для кого не секрет, что, по сравнению с
украинцами, поляки получили гораздо больше дивидендов в процессе подготовки к ЧЕ. В первую
очередь, речь идет о привлечении немалых ресурсов за счет финансирования из фондов ЕС.
...
18. Original Blog Text. Example
До официального старта Евро остается 150 дней. В разгаре, так называемая, операционная
подготовка к Чемпионату. Речь идет о налаживании коммуникаций между принимающими городами,
обучении персонала и наведении марафета в целом. Ни для кого не секрет, что, по сравнению с
украинцами, поляки получили гораздо больше дивидендов в процессе подготовки к ЧЕ. В первую
очередь, речь идет о привлечении немалых ресурсов за счет финансирования из фондов ЕС.
...
Translated & extracted Text
Before the official launch of the Euro is 150 days.In the midst of the so-called operational
preparation for the championship.It is about establishing communication between the host cities,
staff training and marafet hover as a whole.It's no secret that, in comparison with the
Ukrainians, the Poles were far more dividends in preparation for the Championship.First of all,
we are talking about bringing considerable resources through financing from EU funds.
...
19. Original Blog Text. Example
До официального старта Евро остается 150 дней. В разгаре, так называемая, операционная
подготовка к Чемпионату. Речь идет о налаживании коммуникаций между принимающими городами,
обучении персонала и наведении марафета в целом. Ни для кого не секрет, что, по сравнению с
украинцами, поляки получили гораздо больше дивидендов в процессе подготовки к ЧЕ. В первую
очередь, речь идет о привлечении немалых ресурсов за счет финансирования из фондов ЕС.
...
Translated & extracted Text
Before the official launch of the Euro is 150 days.In the midst of the so-called operational
preparation for the championship.It is about establishing communication between the host cities,
staff training and marafet hover as a whole.It's no secret that, in comparison with the
Ukrainians, the Poles were far more dividends in preparation for the Championship.First of all,
we are talking about bringing considerable resources through financing from EU funds.
...
Cleaned Text
official launch euro days midst called operational preparation championship establishing
communication host cities staff training marafet hover secret comparison ukrainians poles
dividends preparation championship talking bringing considerable resources financing eu funds
...
20. Original Blog Text. Example
До официального старта Евро остается 150 дней. В разгаре, так называемая, операционная
подготовка к Чемпионату. Речь идет о налаживании коммуникаций между принимающими городами,
обучении персонала и наведении марафета в целом. Ни для кого не секрет, что, по сравнению с
украинцами, поляки получили гораздо больше дивидендов в процессе подготовки к ЧЕ. В первую
очередь, речь идет о привлечении немалых ресурсов за счет финансирования из фондов ЕС.
...
Translated & extracted Text
Before the official launch of the Euro is 150 days.In the midst of the so-called operational
preparation for the championship.It is about establishing communication between the host cities,
staff training and marafet hover as a whole.It's no secret that, in comparison with the
Ukrainians, the Poles were far more dividends in preparation for the Championship.First of all,
we are talking about bringing considerable resources through financing from EU funds.
...
Cleaned Text
official launch euro days midst called operational preparation championship establishing
communication host cities staff training marafet hover secret comparison ukrainians poles
dividends preparation championship talking bringing considerable resources financing eu funds
...
Stemmed Text
offici launch euro day midst call oper prepar championship establish communic host citi staff
train marafet hover secret comparison ukrainian pole dividend prepar championship talk bring
consider resourc financ eu fund
...
30. Cluster Dendrogram*
#input – DTM normalized with TF-IDF (349 terms, sparse=0.7)
d <- dist(dtm2.df.scale, method = "euclidean") # dissimilarity matrix
#clustering with Ward‟s method
fit <- hclust(d=d, method="ward") #compare: "complete","single","mcquitty","median", "centroid"
* Full result of h-clustering is available in pdf
31. Hierarchical Clustering Summary
universal hierarchical clustering with different algorithms, e.g. Ward‟s objective
function based on squared Euclidean distance (it‟s worth to play with other methods)
good with large number of terms and small number of observations
gives understanding on correlation between terms in Corpus
provides visual representation on how clusters nested with each other
* Full result of h-clustering is available in pdf
33. Description of the k-means algorithm*
1) k initial 2) k clusters are created by 3) The centroid of each 4) Steps 2 and 3 are
"means" (in this associating every observation of the k clusters repeated until
case k=3) are with the nearest mean. The becomes the new means. convergence has been
randomly selected partitions here represent the reached.
from the data set Voronoi diagram generated by
(shown in color). the means.
* Source: http://en.wikipedia.org/wiki/K-means
34. Assess number of clusters using kmeans$withinss
less terms in DTM
higher sum of squares
better cluster quality
more terms in DTM
lower sum of squares
lower cluster quality
Unexpected expected results
35. Clustering with 20 centers
#dtm.k – DocumentTermMatrix(TF-IDF) with 349 terms (cleaned with sparse=0.9)
#nstart – let‟s try 10 random starts to generate centroids
#algorithm – “Hartigan-Wong” (default)
> dtm.clust<-kmeans(x=dtm.k, centers=20, iter.max=40, nstart=10, algorithm="Hartigan-Wong")
Cluster sizes
> dtm.clust$size
[1] 41 21 4 1 1 5 1 7 12 5 98 2 3 7 10 1 4 2 1 11
Sum of squares
> dtm.clust$withinss
[1] 0.75166171 0.37998302 0.08702162 0.00000000 0.00000000 0.10884947 0.00000000 0.21350480 0.22052166
[10] 0.07426058 1.35245927 0.03003547 0.05145358 0.12662083 0.25722734 0.00000000 0.08037547 0.02691182
[19] 0.00000000 0.22561816
* See the full R code in Appendixes
39. Assess number of clusters with pam$silinfo$avg.width
Recommended number of clusters: 2. However …
40. Perform clustering with 20 centers
#max_diss, av_diss – maximum/average
dissimilarity between observations in cluster
and cluster‟s medoid
#diameter – maximum dissimilarity between two
observations in the cluster
#separation – minimal dissimilarity between
observation in the cluster and observation of
another cluster
Result: 4 clusters
44. Recognized clusters* ([cluster - # of blogs])
“tymoshenko,
“Ukrainian “Ukrainian “social networks,
opposition,
elections” democracy” ex.ua”
court”
[2-21] [3-4] [6-5]
[8-7]
“Ukraine-Russia “Ukrainian “Ukraine-EU
“Euro-2012”
relations, gas” taxes” relations”
[9-12]
[10-5] [12-2] [14-7]
“protests, “culture, “all other blogs
“journalist
demonstrations, regulation” with various
investigations”
human rights” [17-4] topics”
[20-11]
[15-10] [13-3] (unrecognized)
Total blogs recognized: 91 of 236 (~40%)
* Based on kmeans
45. Conclusion
number of elements in data vector (349) must be significantly < number of
observations (245)
some resulted clusters include “unlike” blogs (see sum of squares)
try kmeans for better precision when mining big dissimilar texts with low number
of observations. In other cases kmedoids is more robust model
focus on similar texts for best accuracy (by category, date range)
sentimental analysis will make analysis even more tastefull
53. Appendix G. R Code. Parse tweets and extract URLs
require(twitteR)
kb_tweets<-userTimeline('Korr_Blogs', n=400)
#get text of tweets
urls<-laply(kb_tweets, function(t) t$getText())
#extract urls from text
url_expr<-regexec("http://[a-zA-Z0-9].S*$", urls);
urls<-regmatches(urls, url_expr)
#remove empty elements from the list
urls[lapply(urls, length)<1]<-NULL
#unshorten double-shorted urls
for(i in 1:length(urls)) { urls[i]<-decode_short_url(decode_short_url(urls[[i]])) }
#remove duplicates
urls<-as.list(unique(unlist(urls)))
#...
#contact me for the rest part of the code
#...
54. Appendix H. R Code. Handle blogs
for(i in 1:length(urls))
{
#translate blogs into English
url<-go_tr(urls[i])
blogs<-readLines(tc<-textConnection(url));
close (tc)
pagetree<-try(htmlTreeParse(blogs, useInternalNodes=TRUE, encoding="windows-1251"))
if(class(pagetree)=="try-error") next;
x<-xpathSApply(pagetree,
"//div[@class='article_box']//*[@onmouseover]/text()|//div[@class='article_box']//*[@onmouseover]/a/text()",
xmlValue)
x <- unlist(strsplit(x, "n"))
x <- sub("^[[:space:]]*(.*?)[[:space:]]*$", "1", x, perl=TRUE)
x <- x[!(x %in% c("", "|"))]
#...
}
#...
#contact me for the rest part of the code
#...
55. Appendix I. R Code. Manage TermDocumentMatrix
#...
corp <- Corpus(DirSource(“//kor_blogs/en"),readerControl=list(language="en", encodeString="windows-1251"))
#..
#Clean texts, stemming and so on
#...
#Create DTM for both stemmed and not-stemmed Corpuses
dtm <- DocumentTermMatrix(corp, control = list(weighting = weightTfIdf))
dtm <- removeSparseTerms(dtm, sparse=0.995) #0.995 - for both EN and RU
#...
#Find Most Frequent and Associated terms
#Build Correlation Matrix
#..
corrplot(type="lower", tl.cex=.6, corr=corr_stem_ru, title="Correlation matrix", tl.col="grey20",
method="circle", order="FPC", addtextlabel = "ld", outline=TRUE)
#...
#contact me for the rest part of the code
#...
56. Appendix J. R Code. Hierarchical clustering
#...
dtm2<-as.TermDocumentMatrix(dtm)
#...
dtm2.df<-as.data.frame(inspect(dtm2))
#...
(d <- dist(dtm2.df.scale, method = "euclidean")) # distance matrix
fit <- hclust(d=d, method="ward")
#..
dev.off()
#...
#contact me for the rest part of the code
#...
57. Appendix K. R Code. Worldcloud (most frequent terms)
require(wordcloud)
#...
dtm.m <- as.matrix(dtm)
v <- apply(dtm.m,2,sum) #calculate number of occurancies for each word
v <- sort(v, decreasing=TRUE)
#..
wordcloud(d$word, d$freq, scale=c(9,.1), min.freq=3, max.words=Inf, random.order=F, rot.per=.3, colors=pal2)
#...
#contact me for the rest part of the code
#...
58. Appendix L. R Code. kmeans analysis
#...
# assess number of clusters
wss <- (nrow(dtm)-1)*sum(apply(dtm,2,var)) #for stemmed DTM
dtm_orig <- DocumentTermMatrix(corp, control = list(weighting = weightTfIdf)) # non-stemmed DTM
dtm_orig <- removeSparseTerms(dtm_orig, sparse=0.995)
#...
# visualize withinss
# perform clustering
#dtm.k – DocumentTermMatrix(TF-IDF) with 349 terms (cleaned with sparse=0.9)
#nstart – let‟s try 10 random starts to generate centroids
#algorithm – “Hartigan-Wong” (default)
dtm.clust<-kmeans(x=dtm.k,centers=20,iter.max=40, nstart=10, algorithm="Hartigan-Wong")
dtm.clust$size
#...
#contact me for the rest part of the code
#...
59. Appendix M. R Code. kmedoids analysis
#...
# assess number of clusters
# visualize withinss
ggplot()+geom_line(aes(x=1:236, y=asw),size=1,colour="red4") + opts(axis.text.x=theme_text(hjust=0,
colour="grey20", size=14), axis.text.y=theme_text(size=14, colour="grey20"),
axis.title.x=theme_text(size=20, colour="grey20"), axis.title.y=theme_text(angle=90, size=20,
colour="grey20")) + labs(y="average silhouette width", x="k-medoids (# clusters)",size=16) +
scale_x_continuous(breaks=c(k.best,20,40,60,80,100,120,140,160,180,200,220))
# perform kmedoids clustering
#...
dtm.clust.m$clusinfo
#...
#contact me for the rest part of the code
#...
60. Appendix N. R Code. Visualize clusters
#...
#define which cluster to visualize
dtm.clust.v<-dtm.clust # for kmeans
dtm.clust.v<-dtm.clust.m # for kmedoids
#...
dtm_scaled <- cmdscale(dtm.dist) # scale from multi dimensions to two dimensions
require(vegan)
# distance matrix
dtm.dist <- dist(dtm.k, method="euclidean")
#...
for(i in seq_along(groups)){
points(dtm_scaled[factor(dtm.clust.v$cluster) == groups[i], ], col = cols[i], pch = 20)
}
# draw ordihull
ordihull(dtm_scaled, factor(dtm.clust.v$cluster), lty = "dotted", draw="polygon", label=TRUE)
#draw Voronoi diagram
#...
#contact me for the rest part of the code
#...
61. Appendix O. R Code. Visualize heatmaps
#...
dtm.clust.v<-dtm.clust # for kmeans
dtm.clust.v<-dtm.clust.m # for kmedoids
dtm0 <- dtm.k #dtm for kmeans clustering
dtm0 <- removeSparseTerms(dtm0, sparse=0.7) #get terms which exist in 70% of blogs
dtm.df <- as.data.frame(inspect(dtm0))
dfc <- cbind(dtm.df, id=seq(nrow(dtm.df)), cluster=dtm.clust.v$cluster) #Append id and cluster
#...
require(ggplot2)
dev.off()
dev.new()
ggplot(dfm, aes(x=variable, y=idsort)) + geom_tile(aes(fill=value)) +
opts(axis.text.x=theme_text(angle=90, hjust=0, colour="grey20", size=14)) + labs(x="", y="")
#...
#contact me for the rest part of the code
#...