SlideShare a Scribd company logo
1 of 63
Text Mining with R


       Aleksei Beloshytski
           Kyiv, 2012-Feb
Table of Contents
I.       Goal of research and limitations

II.      Data Preparation

       II.    Scrape text from blogs (blogs.korrespondent.net)

       III.   Stemming and cleaning

       IV.    Bottlenecks mining Cyrillic

III.     Text Mining & clustering

       III.   Term normalization (TF-IDF). Most Frequent and Correlated terms

       IV.    Hierarchical clustering with hclust

       V.     Clustering with k-means and k-medoids

       VI.    Comparing clusters

IV.      Conclusion
Goals..
demonstrate most popular practices when mining dissimilar texts with low number
of observations
mine blogs on https://blogs.korrespondent.net and identify most discussed topics

identify bottlenecks when mining Cyrillic

perform hierarchical clustering with hclust method

perform clustering using k-means and k-medoids methods

compare results
Limitations..
no initial blog categorization by date range, subject(s), author(s) etc*

  last 245 blogs** from blogs.korrespondent.net as of the day of analysis
  blogs less then 1kb of plain text excluded




* There is no goal to achieve best cluster accuracy, but most discussed subjects
(clusters) should be identified.
** 245 – after excluding empty and small blogs (<1Kb) from initial 400 blogs
Step 1.
Scrape text from blogs
How to scrape blogs..
HTML parsing
parse each page and get urls
not transparent
How to scrape blogs..
HTML parsing
parse each page and get urls
not transparent


RSS feed
keeps 1 day history
How to scrape blogs..
HTML parsing
parse each page and get urls
not transparent
RSS feed
keeps only 1 day history


Twitter (@Korr_blog)
each tweet has blog URL
easy and transparent for R
Parse tweets

   Get tweets
   Extract URL from text
   Remove empty URLs
   Unshorten double-shorted URLs
   Validate URLs
   Remove duplicates



                        ..
                        [269]   "http://blogs.korrespondent.net/journalists/blog/anna-radio/a51779"
                        [270]   "http://blogs.korrespondent.net/celebrities/blog/gritsenko/a51727"
                        [271]   "http://blogs.korrespondent.net/celebrities/blog/press13/a51764"
                        [272]   "http://blogs.korrespondent.net/celebrities/blog/olesdoniy/a51736"
                        [273]   "http://blogs.korrespondent.net/journalists/blog/raimanelena/a51724"
                        ..




   * Full R code is available at the end
Step 2.
Stemming and Cleaning
Clean texts
   Translate all blogs in English
   Extract translated text from the html code
   Load texts into Corpus
   Map to lower case, rm punctuation, Stop Words, numbers, strip white spaces
   Stem document
Bottlenecks mining Cyrillic texts
declensions in RU/UA words. After stemming the same word has several forms


0xFF-problem (“я”, 0xFF, windows-1251). DocumentTermMatrix (in R) crops texts
E.g. „янукович‟ – filtered, „объявлять‟ – „объ‟, „братья‟ – „брать‟ (sense changes) etc


Cyrillic texts with pseudo-graphic or special symbols can‟t be encoded with windows-
1251 charset properly (additional filter uurlencoded required, not supported in R)
Translate texts into English
     #see the full code in Appendix F
     go_tr <- function(url) {
         src.url<-URLencode(paste("http://translate.google.com/translate?sl=auto&tl=en&u=", url, sep=""))
         html.parse <- htmlTreeParse(getURL(src.url), useInternalNodes = TRUE)
         frame.c <- getNodeSet(html.parse, '//frameset//frame[@name="c"]')
         params <- sapply(frame.c, function(t) t <- xmlAttrs(t)[[1]])
         #...
         dest.url <- capture.output(getNodeSet(html, '//meta[@http-equiv="refresh"]')[[1]])
         #...
         dest.url <- xmlValue(getNodeSet(htmlParse(dest.url, asText = TRUE), "//p")[[1]])
         return(dest.url)
       }




[1] "http://blogs.korrespondent.net/celebrities/blog/gknbu/a50268"


[1] "http://translate.googleusercontent.com/translate_c?
rurl=translate.google.com&sl=auto&tl=en&u=http://blogs.korrespondent.net/celebrities/blog/gknbu/a50268&usg=ALkJ
rhisevp7b7yg4CxX6_iTDxyBAk4PCQ"
Original Blog Text. Example
До официального старта Евро остается 150 дней. В разгаре, так называемая, операционная
подготовка к Чемпионату. Речь идет о налаживании коммуникаций между принимающими городами,
обучении персонала и наведении марафета в целом. Ни для кого не секрет, что, по сравнению с
украинцами, поляки получили гораздо больше дивидендов в процессе подготовки к ЧЕ. В первую
очередь, речь идет о привлечении немалых ресурсов за счет финансирования из фондов ЕС.
...
Original Blog Text. Example
До официального старта Евро остается 150 дней. В разгаре, так называемая, операционная
подготовка к Чемпионату. Речь идет о налаживании коммуникаций между принимающими городами,
обучении персонала и наведении марафета в целом. Ни для кого не секрет, что, по сравнению с
украинцами, поляки получили гораздо больше дивидендов в процессе подготовки к ЧЕ. В первую
очередь, речь идет о привлечении немалых ресурсов за счет финансирования из фондов ЕС.
...

Translated & extracted Text
Before the official launch of the Euro is 150 days.In the midst of the so-called operational
preparation for the championship.It is about establishing communication between the host cities,
staff training and marafet hover as a whole.It's no secret that, in comparison with the
Ukrainians, the Poles were far more dividends in preparation for the Championship.First of all,
we are talking about bringing considerable resources through financing from EU funds.
...
Original Blog Text. Example
До официального старта Евро остается 150 дней. В разгаре, так называемая, операционная
подготовка к Чемпионату. Речь идет о налаживании коммуникаций между принимающими городами,
обучении персонала и наведении марафета в целом. Ни для кого не секрет, что, по сравнению с
украинцами, поляки получили гораздо больше дивидендов в процессе подготовки к ЧЕ. В первую
очередь, речь идет о привлечении немалых ресурсов за счет финансирования из фондов ЕС.
...

 Translated & extracted Text
Before the official launch of the Euro is 150 days.In the midst of the so-called operational
preparation for the championship.It is about establishing communication between the host cities,
staff training and marafet hover as a whole.It's no secret that, in comparison with the
Ukrainians, the Poles were far more dividends in preparation for the Championship.First of all,
we are talking about bringing considerable resources through financing from EU funds.
...

 Cleaned Text
official launch euro days midst called operational preparation championship establishing
communication host cities staff training marafet hover secret comparison ukrainians poles
dividends preparation championship talking bringing considerable resources financing eu funds
...
Original Blog Text. Example
До официального старта Евро остается 150 дней. В разгаре, так называемая, операционная
подготовка к Чемпионату. Речь идет о налаживании коммуникаций между принимающими городами,
обучении персонала и наведении марафета в целом. Ни для кого не секрет, что, по сравнению с
украинцами, поляки получили гораздо больше дивидендов в процессе подготовки к ЧЕ. В первую
очередь, речь идет о привлечении немалых ресурсов за счет финансирования из фондов ЕС.
...

 Translated & extracted Text
Before the official launch of the Euro is 150 days.In the midst of the so-called operational
preparation for the championship.It is about establishing communication between the host cities,
staff training and marafet hover as a whole.It's no secret that, in comparison with the
Ukrainians, the Poles were far more dividends in preparation for the Championship.First of all,
we are talking about bringing considerable resources through financing from EU funds.
...

 Cleaned Text
official launch euro days midst called operational preparation championship establishing
communication host cities staff training marafet hover secret comparison ukrainians poles
dividends preparation championship talking bringing considerable resources financing eu funds
...

 Stemmed Text
offici launch euro day midst call oper prepar championship establish communic host citi staff
train marafet hover secret comparison ukrainian pole dividend prepar championship talk bring
consider resourc financ eu fund
...
Step 3.
Text Mining & Clustering
Text Mining and Clustering
    Build normalized TermDocumentMatrix. Remove Sparse Terms
    Hierarchical Clustering, Dendrogram
    Kmeans. Perform Clustering and visualize clusters
    Kmedoids. Perform Clustering and visualize clusters
Term Normalization
DocumentTermMatrix Structure



                                  Terms
                                ncol=4101

                0.0175105020782697,   ...   0.019135397913606,
                0.0095258656396137,   ...   0.017510502078269,
                0.0099078198722524,   ...   0.014062173579334,
       Docs     0.0163576201358285,   ...   0.014114967574557,
     nrow=237   ...
                0.0113371897967796,   ...   0.014732724300492,




                                 TF-IDF
Most Frequent & Correlated Terms.
Why is that important?
Most Frequent Terms (Latin)

       Non stemmed terms
       > findFreqTerms(dtm, lowfreq=1)


[1] "country"         "euro"         “european"   "government" "internet"   "kiev"       "kyiv"  "money"
[9] "opposition"      "party"        "people"     "political“ "power"       "president" "russia" "social"
[17] "society"        "tymoshenko"   "ukraine“    "ukrainian" "world"       "yanukovych"




       Stemmed terms
       > findFreqTerms(dtm, lowfreq=1)

[1]    "chang“    "countri"      "elect"      "euro"       "european“   "govern"       "internet“   "kiev“
[9]    "kyiv"     "leader"       "money"      "opposit"    "parti"      "peopl"        "polit"      "power“
[17]   "presid"   "russia"       "russian"    "social"     "societi"    "tymoshenko“   "ukrain"     "ukrainian“
[25]   "world"    "yanukovych"




 * See the full R code in Appendixes
Correlated Terms (Cyrillic vs Latin). Example
     >findAssocs(dtm, 'евр', 0.35) #correlation with term “евро”



      евр     старт   гарант     хлеб     тыс   талисман   официальн   воплощен   будущ   чемпионат   живет
     1.00      0.76     0.74     0.71    0.62       0.55        0.49       0.48    0.35        0.31    0.22
подготовк    реплик   секрет   футбол
     0.22      0.22     0.21     0.21




     >findAssocs(dtm, „euro', 0.35)


   euro      championship      footbal   tourist     airport    tournament         fan    poland
   1.00              0.68         0.57      0.49        0.45          0.43        0.42      0.42
horribl     infrastructur      foreign    patrol     unhappi        prepar    flashmob
   0.38              0.38         0.37      0.37        0.37          0.36        0.35
Correlation Matrix (Latin vs Cyrillic). Example




   English Terms: higher correlation, better term accuracy
Hierarchical Clustering (hclust)
Cluster Dendrogram*
           #input – DTM normalized with TF-IDF (349 terms, sparse=0.7)
           d <- dist(dtm2.df.scale, method = "euclidean") # dissimilarity matrix
           #clustering with Ward‟s method
           fit <- hclust(d=d, method="ward") #compare: "complete","single","mcquitty","median", "centroid"




* Full result of h-clustering is available in pdf
Hierarchical Clustering Summary

    universal hierarchical clustering with different algorithms, e.g. Ward‟s objective
    function based on squared Euclidean distance (it‟s worth to play with other methods)

    good with large number of terms and small number of observations

    gives understanding on correlation between terms in Corpus

    provides visual representation on how clusters nested with each other




* Full result of h-clustering is available in pdf
Clustering with kmeans
Description of the k-means algorithm*




 1) k initial        2) k clusters are created by    3) The centroid of each   4) Steps 2 and 3 are
 "means" (in this    associating every observation   of the k clusters         repeated until
 case k=3) are       with the nearest mean. The      becomes the new means.    convergence has been
 randomly selected   partitions here represent the                             reached.
 from the data set   Voronoi diagram generated by
 (shown in color).   the means.




* Source: http://en.wikipedia.org/wiki/K-means
Assess number of clusters using kmeans$withinss




                    less terms in DTM
                     higher sum of squares
                      better cluster quality


more terms in DTM
lower sum of squares
lower cluster quality




                  Unexpected expected results
Clustering with 20 centers
  #dtm.k – DocumentTermMatrix(TF-IDF) with 349 terms (cleaned with sparse=0.9)
  #nstart – let‟s try 10 random starts to generate centroids
  #algorithm – “Hartigan-Wong” (default)
  > dtm.clust<-kmeans(x=dtm.k, centers=20, iter.max=40, nstart=10, algorithm="Hartigan-Wong")




   Cluster sizes
     > dtm.clust$size

       [1] 41 21 4 1 1 5 1 7 12 5 98 2 3 7 10 1 4 2 1 11



   Sum of squares
     > dtm.clust$withinss

 [1] 0.75166171 0.37998302 0.08702162 0.00000000 0.00000000 0.10884947 0.00000000 0.21350480 0.22052166
 [10] 0.07426058 1.35245927 0.03003547 0.05145358 0.12662083 0.25722734 0.00000000 0.08037547 0.02691182
 [19] 0.00000000 0.22561816




* See the full R code in Appendixes
kmeans. Cluster Visualization




                           Distance Matrix (Euclidean)
                           Scale multi-dimensional DTM to 2Dim
k-means clustering Summary
Clustering with kmedoids
Assess number of clusters with pam$silinfo$avg.width




Recommended number of clusters: 2. However …
Perform clustering with 20 centers
                       #max_diss, av_diss – maximum/average
                       dissimilarity between observations in cluster
                       and cluster‟s medoid

                       #diameter – maximum dissimilarity between two
                       observations in the cluster

                       #separation – minimal dissimilarity between
                       observation in the cluster and observation of
                       another cluster




Result: 4 clusters
kmedoids. Cluster Visualization
k-medoids clustering Summary
Recognize clusters
Recognized clusters* ([cluster - # of blogs])
                                                                 “tymoshenko,
     “Ukrainian          “Ukrainian        “social networks,
                                                                   opposition,
     elections”         democracy”              ex.ua”
                                                                     court”
       [2-21]              [3-4]                 [6-5]
                                                                     [8-7]


                       “Ukraine-Russia           “Ukrainian       “Ukraine-EU
     “Euro-2012”
                       relations, gas”             taxes”          relations”
       [9-12]
                           [10-5]                 [12-2]             [14-7]


      “protests,           “culture,                             “all other blogs
                                                  “journalist
   demonstrations,       regulation”                              with various
                                               investigations”
    human rights”           [17-4]                                    topics”
                                                   [20-11]
       [15-10]              [13-3]                               (unrecognized)

    Total blogs recognized: 91 of 236 (~40%)

* Based on kmeans
Conclusion


number of elements in data vector (349) must be significantly < number of
observations (245)

some resulted clusters include “unlike” blogs (see sum of squares)

try kmeans for better precision when mining big dissimilar texts with low number
of observations. In other cases kmedoids is more robust model

focus on similar texts for best accuracy (by category, date range)

sentimental analysis will make analysis even more tastefull
Questions & Answers



                 Aleksei Beloshytski
     Alelsei.Beloshytski@gmail.com
Appendix A. kmeans. Voronoi Diagram (“Euclidean”)
Appendix B. kmeans. Voronoi Diagram (“Manhattan”)
Appendix C. kmeans. Heatmap (most freq. terms). TF-IDF
Appendix D. kmedoids. Heatmap (most freq. terms). TF
Appendix E. R packages required for analysis

                  require(twitteR)
                  require(XML)
                  require(plyr)
                  require(tm)
                  require(Rstem)
                  require(Snowball)
                  require(corrplot)
                  require(RWeka)
                  require(RCurl)
                  require(wordcloud)
                  require(ggplot2)
                  require(vegan)
                  require(reshape2)
                  require(cluster)
                  require(alphahull)
Appendix F. R Code. Translate texts into English

     go_tr <- function(url) {
         src.url<-URLencode(paste("http://translate.google.com/translate?sl=auto&tl=en&u=", url, sep=""))
         html.parse <- htmlTreeParse(getURL(src.url), useInternalNodes = TRUE)
         frame.c <- getNodeSet(html.parse, '//frameset//frame[@name="c"]')
         params <- sapply(frame.c, function(t) t <- xmlAttrs(t)[[1]])
         src.url <- paste("http://translate.google.com", params, sep = "")
         dest.url <- getURL(src.url, followlocation = TRUE)
         html <- htmlTreeParse(dest.url, useInternalNodes = TRUE)
         dest.url <- capture.output(getNodeSet(html, '//meta[@http-equiv="refresh"]')[[1]])
         dest.url <- strsplit(dest.url, "URL=", fixed = TRUE)[[1]][2]
         dest.url <- gsub(""/>", "", dest.url, fixed = TRUE)
         dest.url <- gsub(" ", "", dest.url, fixed = TRUE)
         dest.url <- xmlValue(getNodeSet(htmlParse(dest.url, asText = TRUE), "//p")[[1]])
         return(dest.url)
       }




[1] "http://blogs.korrespondent.net/celebrities/blog/gknbu/a50268"


[1] "http://translate.googleusercontent.com/translate_c?
rurl=translate.google.com&sl=auto&tl=en&u=http://blogs.korrespondent.net/celebrities/blog/gknbu/a50268&usg=ALkJ
rhisevp7b7yg4CxX6_iTDxyBAk4PCQ"
Appendix G. R Code. Parse tweets and extract URLs
require(twitteR)
kb_tweets<-userTimeline('Korr_Blogs', n=400)
#get text of tweets
urls<-laply(kb_tweets, function(t) t$getText())
#extract urls from text
url_expr<-regexec("http://[a-zA-Z0-9].S*$", urls);
urls<-regmatches(urls, url_expr)
#remove empty elements from the list
urls[lapply(urls, length)<1]<-NULL
#unshorten double-shorted urls
for(i in 1:length(urls)) { urls[i]<-decode_short_url(decode_short_url(urls[[i]])) }
#remove duplicates
urls<-as.list(unique(unlist(urls)))

#...

#contact me for the rest part of the code

#...
Appendix H. R Code. Handle blogs
for(i in 1:length(urls))
{
     #translate blogs into English
     url<-go_tr(urls[i])
     blogs<-readLines(tc<-textConnection(url));
     close (tc)

     pagetree<-try(htmlTreeParse(blogs, useInternalNodes=TRUE, encoding="windows-1251"))
          if(class(pagetree)=="try-error") next;
     x<-xpathSApply(pagetree,
"//div[@class='article_box']//*[@onmouseover]/text()|//div[@class='article_box']//*[@onmouseover]/a/text()",
xmlValue)
     x <- unlist(strsplit(x, "n"))
     x <- sub("^[[:space:]]*(.*?)[[:space:]]*$", "1", x, perl=TRUE)
     x <- x[!(x %in% c("", "|"))]
#...
}

#...

#contact me for the rest part of the code

#...
Appendix I. R Code. Manage TermDocumentMatrix

#...
corp <- Corpus(DirSource(“//kor_blogs/en"),readerControl=list(language="en", encodeString="windows-1251"))
#..
#Clean texts, stemming and so on
#...
#Create DTM for both stemmed and not-stemmed Corpuses
dtm <- DocumentTermMatrix(corp, control = list(weighting = weightTfIdf))
dtm <- removeSparseTerms(dtm, sparse=0.995) #0.995 - for both EN and RU
#...
#Find Most Frequent and Associated terms
#Build Correlation Matrix
#..
corrplot(type="lower", tl.cex=.6, corr=corr_stem_ru, title="Correlation matrix", tl.col="grey20",
method="circle", order="FPC", addtextlabel = "ld", outline=TRUE)

#...

#contact me for the rest part of the code

#...
Appendix J. R Code. Hierarchical clustering
#...
dtm2<-as.TermDocumentMatrix(dtm)
#...
dtm2.df<-as.data.frame(inspect(dtm2))
#...
(d <- dist(dtm2.df.scale, method = "euclidean")) # distance matrix
fit <- hclust(d=d, method="ward")
#..
dev.off()

#...

#contact me for the rest part of the code

#...
Appendix K. R Code. Worldcloud (most frequent terms)

require(wordcloud)
#...
dtm.m <- as.matrix(dtm)
v <- apply(dtm.m,2,sum) #calculate number of occurancies for each word
v <- sort(v, decreasing=TRUE)
#..
wordcloud(d$word, d$freq, scale=c(9,.1), min.freq=3, max.words=Inf, random.order=F, rot.per=.3, colors=pal2)


#...

#contact me for the rest part of the code

#...
Appendix L. R Code. kmeans analysis

#...
# assess number of clusters
wss <- (nrow(dtm)-1)*sum(apply(dtm,2,var)) #for stemmed DTM
dtm_orig <- DocumentTermMatrix(corp, control = list(weighting = weightTfIdf)) # non-stemmed DTM
dtm_orig <- removeSparseTerms(dtm_orig, sparse=0.995)
#...
# visualize withinss

# perform clustering
#dtm.k – DocumentTermMatrix(TF-IDF) with 349 terms (cleaned with sparse=0.9)
#nstart – let‟s try 10 random starts to generate centroids
#algorithm – “Hartigan-Wong” (default)
dtm.clust<-kmeans(x=dtm.k,centers=20,iter.max=40, nstart=10, algorithm="Hartigan-Wong")
dtm.clust$size

#...

#contact me for the rest part of the code

#...
Appendix M. R Code. kmedoids analysis
#...
# assess number of clusters
# visualize withinss
ggplot()+geom_line(aes(x=1:236, y=asw),size=1,colour="red4") + opts(axis.text.x=theme_text(hjust=0,
colour="grey20", size=14), axis.text.y=theme_text(size=14, colour="grey20"),
axis.title.x=theme_text(size=20, colour="grey20"), axis.title.y=theme_text(angle=90, size=20,
colour="grey20")) + labs(y="average silhouette width", x="k-medoids (# clusters)",size=16) +
scale_x_continuous(breaks=c(k.best,20,40,60,80,100,120,140,160,180,200,220))

# perform kmedoids clustering
#...
dtm.clust.m$clusinfo

#...

#contact me for the rest part of the code

#...
Appendix N. R Code. Visualize clusters

#...
#define which cluster to visualize
dtm.clust.v<-dtm.clust # for kmeans
dtm.clust.v<-dtm.clust.m # for kmedoids
#...
dtm_scaled <- cmdscale(dtm.dist) # scale from multi dimensions to two dimensions
require(vegan)
# distance matrix
dtm.dist <- dist(dtm.k, method="euclidean")
#...
for(i in seq_along(groups)){
  points(dtm_scaled[factor(dtm.clust.v$cluster) == groups[i], ], col = cols[i], pch = 20)
}
# draw ordihull
ordihull(dtm_scaled, factor(dtm.clust.v$cluster), lty = "dotted", draw="polygon", label=TRUE)

#draw Voronoi diagram

#...

#contact me for the rest part of the code

#...
Appendix O. R Code. Visualize heatmaps

#...
dtm.clust.v<-dtm.clust # for kmeans
dtm.clust.v<-dtm.clust.m # for kmedoids

dtm0 <- dtm.k #dtm for kmeans clustering
dtm0 <- removeSparseTerms(dtm0, sparse=0.7) #get terms which exist in 70% of blogs
dtm.df <- as.data.frame(inspect(dtm0))
dfc <- cbind(dtm.df, id=seq(nrow(dtm.df)), cluster=dtm.clust.v$cluster) #Append id and cluster
#...
require(ggplot2)
dev.off()
dev.new()
ggplot(dfm, aes(x=variable, y=idsort)) + geom_tile(aes(fill=value)) +
opts(axis.text.x=theme_text(angle=90, hjust=0, colour="grey20", size=14)) + labs(x="", y="")

#...

#contact me for the rest part of the code

#...
Appendix P. Most Frequent Terms (Cyrillic)*

  Non stemmed terms (cropped terms, 0xFF)
   > findFreqTerms(dtm, lowfreq=3)

[1]    "брать"      "вли"       "вопрос“   "врем"    "выборы“    "высоцкий"   "действи“   "евро"       "знакома“
[10]   "истори“     "написал"   "непри"    "нова"    "объ“       "остаетс"    "попул"     "прав"       "прин“
[19]   "прочитал“   "прошла"    "сегодн"   "суд"     "течение"   "третий"     "украине"   "украинцы“   "украины“
[28]   "хочу"




  Stemmed terms
   > findFreqTerms(dtm, lowfreq=4)

[1]    "виктор“          "власт"      "вли“         "вопрос“     "врем"       "выбор“     "высоцк“
[8]    "государствен“    "интересн“   "непри"       "объ"        "очередн“    "попул"     "последн“
[15]   "посто"           "прав"       "прин"        "прочита“    "росси"      "сегодн“    "страниц“
[22]   "течен"           "украин"




* Bold words – cropped; blue – terms don’t exist in non-stemmed variant
TextMining with R

More Related Content

What's hot

Where is my data (in the cloud) tamir dresher
Where is my data (in the cloud)   tamir dresherWhere is my data (in the cloud)   tamir dresher
Where is my data (in the cloud) tamir dresherTamir Dresher
 
Lecture 9 - Machine Learning and Support Vector Machines (SVM)
Lecture 9 - Machine Learning and Support Vector Machines (SVM)Lecture 9 - Machine Learning and Support Vector Machines (SVM)
Lecture 9 - Machine Learning and Support Vector Machines (SVM)Sean Golliher
 
08. ElasticSearch : Sorting and Relevance
08.  ElasticSearch : Sorting and Relevance08.  ElasticSearch : Sorting and Relevance
08. ElasticSearch : Sorting and RelevanceOpenThink Labs
 
Python tutorial
Python tutorialPython tutorial
Python tutorialRajiv Risi
 
Spark + Clojure for Topic Discovery - Zalando Tech Clojure/Conj Talk
Spark + Clojure for Topic Discovery - Zalando Tech Clojure/Conj TalkSpark + Clojure for Topic Discovery - Zalando Tech Clojure/Conj Talk
Spark + Clojure for Topic Discovery - Zalando Tech Clojure/Conj TalkZalando Technology
 
Coffee, Danish & Search: Presented by Alan Woodward & Charlie Hull, Flax
Coffee, Danish & Search: Presented by Alan Woodward & Charlie Hull, FlaxCoffee, Danish & Search: Presented by Alan Woodward & Charlie Hull, Flax
Coffee, Danish & Search: Presented by Alan Woodward & Charlie Hull, FlaxLucidworks
 
Search Engine-Building with Lucene and Solr, Part 1 (SoCal Code Camp LA 2013)
Search Engine-Building with Lucene and Solr, Part 1 (SoCal Code Camp LA 2013)Search Engine-Building with Lucene and Solr, Part 1 (SoCal Code Camp LA 2013)
Search Engine-Building with Lucene and Solr, Part 1 (SoCal Code Camp LA 2013)Kai Chan
 
Python tutorialfeb152012
Python tutorialfeb152012Python tutorialfeb152012
Python tutorialfeb152012Shani729
 
RFS Search Lang Spec
RFS Search Lang SpecRFS Search Lang Spec
RFS Search Lang SpecJing Kang
 
Webinar: Simpler Semantic Search with Solr
Webinar: Simpler Semantic Search with SolrWebinar: Simpler Semantic Search with Solr
Webinar: Simpler Semantic Search with SolrLucidworks
 
Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)
Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)
Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)Kai Chan
 
Java Performance Tips (So Code Camp San Diego 2014)
Java Performance Tips (So Code Camp San Diego 2014)Java Performance Tips (So Code Camp San Diego 2014)
Java Performance Tips (So Code Camp San Diego 2014)Kai Chan
 
Introduction to Cassandra & Data model
Introduction to Cassandra & Data modelIntroduction to Cassandra & Data model
Introduction to Cassandra & Data modelDuyhai Doan
 
Liferay Search: Best Practices to Dramatically Improve Relevance - Liferay Sy...
Liferay Search: Best Practices to Dramatically Improve Relevance - Liferay Sy...Liferay Search: Best Practices to Dramatically Improve Relevance - Liferay Sy...
Liferay Search: Best Practices to Dramatically Improve Relevance - Liferay Sy...André Ricardo Barreto de Oliveira
 
Introduction to CrossRef Technical Basics Webinar 031815
Introduction to CrossRef Technical Basics Webinar 031815Introduction to CrossRef Technical Basics Webinar 031815
Introduction to CrossRef Technical Basics Webinar 031815Crossref
 

What's hot (18)

Where is my data (in the cloud) tamir dresher
Where is my data (in the cloud)   tamir dresherWhere is my data (in the cloud)   tamir dresher
Where is my data (in the cloud) tamir dresher
 
Quepy
QuepyQuepy
Quepy
 
Lecture 9 - Machine Learning and Support Vector Machines (SVM)
Lecture 9 - Machine Learning and Support Vector Machines (SVM)Lecture 9 - Machine Learning and Support Vector Machines (SVM)
Lecture 9 - Machine Learning and Support Vector Machines (SVM)
 
08. ElasticSearch : Sorting and Relevance
08.  ElasticSearch : Sorting and Relevance08.  ElasticSearch : Sorting and Relevance
08. ElasticSearch : Sorting and Relevance
 
Python tutorial
Python tutorialPython tutorial
Python tutorial
 
Spark + Clojure for Topic Discovery - Zalando Tech Clojure/Conj Talk
Spark + Clojure for Topic Discovery - Zalando Tech Clojure/Conj TalkSpark + Clojure for Topic Discovery - Zalando Tech Clojure/Conj Talk
Spark + Clojure for Topic Discovery - Zalando Tech Clojure/Conj Talk
 
Coffee, Danish & Search: Presented by Alan Woodward & Charlie Hull, Flax
Coffee, Danish & Search: Presented by Alan Woodward & Charlie Hull, FlaxCoffee, Danish & Search: Presented by Alan Woodward & Charlie Hull, Flax
Coffee, Danish & Search: Presented by Alan Woodward & Charlie Hull, Flax
 
seo tutorial
seo tutorialseo tutorial
seo tutorial
 
Search Engine-Building with Lucene and Solr, Part 1 (SoCal Code Camp LA 2013)
Search Engine-Building with Lucene and Solr, Part 1 (SoCal Code Camp LA 2013)Search Engine-Building with Lucene and Solr, Part 1 (SoCal Code Camp LA 2013)
Search Engine-Building with Lucene and Solr, Part 1 (SoCal Code Camp LA 2013)
 
Python tutorialfeb152012
Python tutorialfeb152012Python tutorialfeb152012
Python tutorialfeb152012
 
RFS Search Lang Spec
RFS Search Lang SpecRFS Search Lang Spec
RFS Search Lang Spec
 
Hidden Features in HTTP
Hidden Features in HTTPHidden Features in HTTP
Hidden Features in HTTP
 
Webinar: Simpler Semantic Search with Solr
Webinar: Simpler Semantic Search with SolrWebinar: Simpler Semantic Search with Solr
Webinar: Simpler Semantic Search with Solr
 
Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)
Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)
Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)
 
Java Performance Tips (So Code Camp San Diego 2014)
Java Performance Tips (So Code Camp San Diego 2014)Java Performance Tips (So Code Camp San Diego 2014)
Java Performance Tips (So Code Camp San Diego 2014)
 
Introduction to Cassandra & Data model
Introduction to Cassandra & Data modelIntroduction to Cassandra & Data model
Introduction to Cassandra & Data model
 
Liferay Search: Best Practices to Dramatically Improve Relevance - Liferay Sy...
Liferay Search: Best Practices to Dramatically Improve Relevance - Liferay Sy...Liferay Search: Best Practices to Dramatically Improve Relevance - Liferay Sy...
Liferay Search: Best Practices to Dramatically Improve Relevance - Liferay Sy...
 
Introduction to CrossRef Technical Basics Webinar 031815
Introduction to CrossRef Technical Basics Webinar 031815Introduction to CrossRef Technical Basics Webinar 031815
Introduction to CrossRef Technical Basics Webinar 031815
 

Viewers also liked

Text Mining with R -- an Analysis of Twitter Data
Text Mining with R -- an Analysis of Twitter DataText Mining with R -- an Analysis of Twitter Data
Text Mining with R -- an Analysis of Twitter DataYanchang Zhao
 
R by example: mining Twitter for consumer attitudes towards airlines
R by example: mining Twitter for consumer attitudes towards airlinesR by example: mining Twitter for consumer attitudes towards airlines
R by example: mining Twitter for consumer attitudes towards airlinesJeffrey Breen
 
hands on: Text Mining With R
hands on: Text Mining With Rhands on: Text Mining With R
hands on: Text Mining With RJahnab Kumar Deka
 
Sentiment Analysis of Twitter Data
Sentiment Analysis of Twitter DataSentiment Analysis of Twitter Data
Sentiment Analysis of Twitter DataSumit Raj
 
Text Mining Infrastructure in R
Text Mining Infrastructure in RText Mining Infrastructure in R
Text Mining Infrastructure in RAshraf Uddin
 
Twitter analysis by Kaify Rais
Twitter analysis by Kaify RaisTwitter analysis by Kaify Rais
Twitter analysis by Kaify RaisAjay Ohri
 
Social media analysis in R using twitter API
Social media analysis in R using twitter API Social media analysis in R using twitter API
Social media analysis in R using twitter API Mohd Shadab Alam
 
Sentiment Analysis in Twitter
Sentiment Analysis in TwitterSentiment Analysis in Twitter
Sentiment Analysis in TwitterAyushi Dalmia
 
Sentiment analysis of tweets
Sentiment analysis of tweetsSentiment analysis of tweets
Sentiment analysis of tweetsVasu Jain
 
How Sentiment Analysis works
How Sentiment Analysis worksHow Sentiment Analysis works
How Sentiment Analysis worksCJ Jenkins
 
Text Mining and Visualization
Text Mining and VisualizationText Mining and Visualization
Text Mining and VisualizationSeth Grimes
 
Web data from R
Web data from RWeb data from R
Web data from Rschamber
 
Tutorial of Sentiment Analysis
Tutorial of Sentiment AnalysisTutorial of Sentiment Analysis
Tutorial of Sentiment AnalysisFabio Benedetti
 
Introduction to Sentiment Analysis
Introduction to Sentiment AnalysisIntroduction to Sentiment Analysis
Introduction to Sentiment AnalysisJaganadh Gopinadhan
 
Twitter sentiment-analysis Jiit2013-14
Twitter sentiment-analysis Jiit2013-14Twitter sentiment-analysis Jiit2013-14
Twitter sentiment-analysis Jiit2013-14Rachit Goel
 
Text analytics in Python and R with examples from Tobacco Control
Text analytics in Python and R with examples from Tobacco ControlText analytics in Python and R with examples from Tobacco Control
Text analytics in Python and R with examples from Tobacco ControlBen Healey
 
Big Data & Text Mining
Big Data & Text MiningBig Data & Text Mining
Big Data & Text MiningMichel Bruley
 
OUTDATED Text Mining 4/5: Text Classification
OUTDATED Text Mining 4/5: Text ClassificationOUTDATED Text Mining 4/5: Text Classification
OUTDATED Text Mining 4/5: Text ClassificationFlorian Leitner
 
Social Network Analysis With R
Social Network Analysis With RSocial Network Analysis With R
Social Network Analysis With RDavid Chiu
 

Viewers also liked (20)

Text Mining with R -- an Analysis of Twitter Data
Text Mining with R -- an Analysis of Twitter DataText Mining with R -- an Analysis of Twitter Data
Text Mining with R -- an Analysis of Twitter Data
 
R by example: mining Twitter for consumer attitudes towards airlines
R by example: mining Twitter for consumer attitudes towards airlinesR by example: mining Twitter for consumer attitudes towards airlines
R by example: mining Twitter for consumer attitudes towards airlines
 
hands on: Text Mining With R
hands on: Text Mining With Rhands on: Text Mining With R
hands on: Text Mining With R
 
Sentiment Analysis of Twitter Data
Sentiment Analysis of Twitter DataSentiment Analysis of Twitter Data
Sentiment Analysis of Twitter Data
 
Text Mining Infrastructure in R
Text Mining Infrastructure in RText Mining Infrastructure in R
Text Mining Infrastructure in R
 
Twitter analysis by Kaify Rais
Twitter analysis by Kaify RaisTwitter analysis by Kaify Rais
Twitter analysis by Kaify Rais
 
Social media analysis in R using twitter API
Social media analysis in R using twitter API Social media analysis in R using twitter API
Social media analysis in R using twitter API
 
Sentiment Analysis in Twitter
Sentiment Analysis in TwitterSentiment Analysis in Twitter
Sentiment Analysis in Twitter
 
Sentiment analysis of tweets
Sentiment analysis of tweetsSentiment analysis of tweets
Sentiment analysis of tweets
 
How Sentiment Analysis works
How Sentiment Analysis worksHow Sentiment Analysis works
How Sentiment Analysis works
 
Text Mining and Visualization
Text Mining and VisualizationText Mining and Visualization
Text Mining and Visualization
 
Web data from R
Web data from RWeb data from R
Web data from R
 
Tutorial of Sentiment Analysis
Tutorial of Sentiment AnalysisTutorial of Sentiment Analysis
Tutorial of Sentiment Analysis
 
Introduction to Sentiment Analysis
Introduction to Sentiment AnalysisIntroduction to Sentiment Analysis
Introduction to Sentiment Analysis
 
Twitter sentiment-analysis Jiit2013-14
Twitter sentiment-analysis Jiit2013-14Twitter sentiment-analysis Jiit2013-14
Twitter sentiment-analysis Jiit2013-14
 
Text analytics in Python and R with examples from Tobacco Control
Text analytics in Python and R with examples from Tobacco ControlText analytics in Python and R with examples from Tobacco Control
Text analytics in Python and R with examples from Tobacco Control
 
Big Data & Text Mining
Big Data & Text MiningBig Data & Text Mining
Big Data & Text Mining
 
Text MIning
Text MIningText MIning
Text MIning
 
OUTDATED Text Mining 4/5: Text Classification
OUTDATED Text Mining 4/5: Text ClassificationOUTDATED Text Mining 4/5: Text Classification
OUTDATED Text Mining 4/5: Text Classification
 
Social Network Analysis With R
Social Network Analysis With RSocial Network Analysis With R
Social Network Analysis With R
 

Similar to TextMining with R

PyData Berlin Meetup
PyData Berlin MeetupPyData Berlin Meetup
PyData Berlin MeetupSteffen Wenz
 
Social Graphs and Semantic Analytics
Social Graphs and Semantic AnalyticsSocial Graphs and Semantic Analytics
Social Graphs and Semantic AnalyticsColin Bell
 
Computational Social Science, Lecture 09: Data Wrangling
Computational Social Science, Lecture 09: Data WranglingComputational Social Science, Lecture 09: Data Wrangling
Computational Social Science, Lecture 09: Data Wranglingjakehofman
 
Information Extraction from the Web - Algorithms and Tools
Information Extraction from the Web - Algorithms and ToolsInformation Extraction from the Web - Algorithms and Tools
Information Extraction from the Web - Algorithms and ToolsBenjamin Habegger
 
DBpedia Framework - BBC Talk
DBpedia Framework - BBC TalkDBpedia Framework - BBC Talk
DBpedia Framework - BBC TalkGeorgi Kobilarov
 
Building search app with ElasticSearch
Building search app with ElasticSearchBuilding search app with ElasticSearch
Building search app with ElasticSearchLukas Vlcek
 
ICDM2019 table tutorial
ICDM2019 table tutorialICDM2019 table tutorial
ICDM2019 table tutorialNancy Wang
 
Abhishek lingineni
Abhishek lingineniAbhishek lingineni
Abhishek lingineniabhishekl404
 
CPlusPus
CPlusPusCPlusPus
CPlusPusrasen58
 
The Rough Guide to MongoDB
The Rough Guide to MongoDBThe Rough Guide to MongoDB
The Rough Guide to MongoDBSimeon Simeonov
 
The openCypher Project - An Open Graph Query Language
The openCypher Project - An Open Graph Query LanguageThe openCypher Project - An Open Graph Query Language
The openCypher Project - An Open Graph Query LanguageNeo4j
 
C++_programs.ppt
C++_programs.pptC++_programs.ppt
C++_programs.pptTeacherOnat
 
C++_programs.ppt
C++_programs.pptC++_programs.ppt
C++_programs.pptJayarAlejo
 
C++_programs.ppt
C++_programs.pptC++_programs.ppt
C++_programs.pptJayarAlejo
 
C++_programs.ppt
C++_programs.pptC++_programs.ppt
C++_programs.pptEPORI
 
C++ programming: Basic introduction to C++.ppt
C++ programming: Basic introduction to C++.pptC++ programming: Basic introduction to C++.ppt
C++ programming: Basic introduction to C++.pptyp02
 
C++_programs.ppt
C++_programs.pptC++_programs.ppt
C++_programs.pptInfotech27
 

Similar to TextMining with R (20)

PyData Berlin Meetup
PyData Berlin MeetupPyData Berlin Meetup
PyData Berlin Meetup
 
Social Graphs and Semantic Analytics
Social Graphs and Semantic AnalyticsSocial Graphs and Semantic Analytics
Social Graphs and Semantic Analytics
 
EAD - QC GSLIS 730
EAD - QC GSLIS 730EAD - QC GSLIS 730
EAD - QC GSLIS 730
 
Computational Social Science, Lecture 09: Data Wrangling
Computational Social Science, Lecture 09: Data WranglingComputational Social Science, Lecture 09: Data Wrangling
Computational Social Science, Lecture 09: Data Wrangling
 
Information Extraction from the Web - Algorithms and Tools
Information Extraction from the Web - Algorithms and ToolsInformation Extraction from the Web - Algorithms and Tools
Information Extraction from the Web - Algorithms and Tools
 
DBpedia Framework - BBC Talk
DBpedia Framework - BBC TalkDBpedia Framework - BBC Talk
DBpedia Framework - BBC Talk
 
Building search app with ElasticSearch
Building search app with ElasticSearchBuilding search app with ElasticSearch
Building search app with ElasticSearch
 
ICDM2019 table tutorial
ICDM2019 table tutorialICDM2019 table tutorial
ICDM2019 table tutorial
 
Abhishek lingineni
Abhishek lingineniAbhishek lingineni
Abhishek lingineni
 
CPlusPus
CPlusPusCPlusPus
CPlusPus
 
The Rough Guide to MongoDB
The Rough Guide to MongoDBThe Rough Guide to MongoDB
The Rough Guide to MongoDB
 
The openCypher Project - An Open Graph Query Language
The openCypher Project - An Open Graph Query LanguageThe openCypher Project - An Open Graph Query Language
The openCypher Project - An Open Graph Query Language
 
C++_programs.ppt
C++_programs.pptC++_programs.ppt
C++_programs.ppt
 
C++_programs.ppt
C++_programs.pptC++_programs.ppt
C++_programs.ppt
 
C++_programs.ppt
C++_programs.pptC++_programs.ppt
C++_programs.ppt
 
C++_programs.ppt
C++_programs.pptC++_programs.ppt
C++_programs.ppt
 
C++_programs.ppt
C++_programs.pptC++_programs.ppt
C++_programs.ppt
 
C++ programming: Basic introduction to C++.ppt
C++ programming: Basic introduction to C++.pptC++ programming: Basic introduction to C++.ppt
C++ programming: Basic introduction to C++.ppt
 
C++_programs.ppt
C++_programs.pptC++_programs.ppt
C++_programs.ppt
 
Labs_20210809.pdf
Labs_20210809.pdfLabs_20210809.pdf
Labs_20210809.pdf
 

Recently uploaded

costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentationphoebematthew05
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxnull - The Open Security Community
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDGMarianaLemus7
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsHyundai Motor Group
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Bluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfBluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfngoud9212
 

Recently uploaded (20)

costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentation
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDG
 
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort ServiceHot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptxVulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Bluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfBluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdf
 

TextMining with R

  • 1. Text Mining with R Aleksei Beloshytski Kyiv, 2012-Feb
  • 2. Table of Contents I. Goal of research and limitations II. Data Preparation II. Scrape text from blogs (blogs.korrespondent.net) III. Stemming and cleaning IV. Bottlenecks mining Cyrillic III. Text Mining & clustering III. Term normalization (TF-IDF). Most Frequent and Correlated terms IV. Hierarchical clustering with hclust V. Clustering with k-means and k-medoids VI. Comparing clusters IV. Conclusion
  • 4. demonstrate most popular practices when mining dissimilar texts with low number of observations mine blogs on https://blogs.korrespondent.net and identify most discussed topics identify bottlenecks when mining Cyrillic perform hierarchical clustering with hclust method perform clustering using k-means and k-medoids methods compare results
  • 6. no initial blog categorization by date range, subject(s), author(s) etc* last 245 blogs** from blogs.korrespondent.net as of the day of analysis blogs less then 1kb of plain text excluded * There is no goal to achieve best cluster accuracy, but most discussed subjects (clusters) should be identified. ** 245 – after excluding empty and small blogs (<1Kb) from initial 400 blogs
  • 7. Step 1. Scrape text from blogs
  • 8. How to scrape blogs.. HTML parsing parse each page and get urls not transparent
  • 9. How to scrape blogs.. HTML parsing parse each page and get urls not transparent RSS feed keeps 1 day history
  • 10. How to scrape blogs.. HTML parsing parse each page and get urls not transparent RSS feed keeps only 1 day history Twitter (@Korr_blog) each tweet has blog URL easy and transparent for R
  • 11. Parse tweets Get tweets Extract URL from text Remove empty URLs Unshorten double-shorted URLs Validate URLs Remove duplicates .. [269] "http://blogs.korrespondent.net/journalists/blog/anna-radio/a51779" [270] "http://blogs.korrespondent.net/celebrities/blog/gritsenko/a51727" [271] "http://blogs.korrespondent.net/celebrities/blog/press13/a51764" [272] "http://blogs.korrespondent.net/celebrities/blog/olesdoniy/a51736" [273] "http://blogs.korrespondent.net/journalists/blog/raimanelena/a51724" .. * Full R code is available at the end
  • 13. Clean texts Translate all blogs in English Extract translated text from the html code Load texts into Corpus Map to lower case, rm punctuation, Stop Words, numbers, strip white spaces Stem document
  • 15. declensions in RU/UA words. After stemming the same word has several forms 0xFF-problem (“я”, 0xFF, windows-1251). DocumentTermMatrix (in R) crops texts E.g. „янукович‟ – filtered, „объявлять‟ – „объ‟, „братья‟ – „брать‟ (sense changes) etc Cyrillic texts with pseudo-graphic or special symbols can‟t be encoded with windows- 1251 charset properly (additional filter uurlencoded required, not supported in R)
  • 16. Translate texts into English #see the full code in Appendix F go_tr <- function(url) { src.url<-URLencode(paste("http://translate.google.com/translate?sl=auto&tl=en&u=", url, sep="")) html.parse <- htmlTreeParse(getURL(src.url), useInternalNodes = TRUE) frame.c <- getNodeSet(html.parse, '//frameset//frame[@name="c"]') params <- sapply(frame.c, function(t) t <- xmlAttrs(t)[[1]]) #... dest.url <- capture.output(getNodeSet(html, '//meta[@http-equiv="refresh"]')[[1]]) #... dest.url <- xmlValue(getNodeSet(htmlParse(dest.url, asText = TRUE), "//p")[[1]]) return(dest.url) } [1] "http://blogs.korrespondent.net/celebrities/blog/gknbu/a50268" [1] "http://translate.googleusercontent.com/translate_c? rurl=translate.google.com&sl=auto&tl=en&u=http://blogs.korrespondent.net/celebrities/blog/gknbu/a50268&usg=ALkJ rhisevp7b7yg4CxX6_iTDxyBAk4PCQ"
  • 17. Original Blog Text. Example До официального старта Евро остается 150 дней. В разгаре, так называемая, операционная подготовка к Чемпионату. Речь идет о налаживании коммуникаций между принимающими городами, обучении персонала и наведении марафета в целом. Ни для кого не секрет, что, по сравнению с украинцами, поляки получили гораздо больше дивидендов в процессе подготовки к ЧЕ. В первую очередь, речь идет о привлечении немалых ресурсов за счет финансирования из фондов ЕС. ...
  • 18. Original Blog Text. Example До официального старта Евро остается 150 дней. В разгаре, так называемая, операционная подготовка к Чемпионату. Речь идет о налаживании коммуникаций между принимающими городами, обучении персонала и наведении марафета в целом. Ни для кого не секрет, что, по сравнению с украинцами, поляки получили гораздо больше дивидендов в процессе подготовки к ЧЕ. В первую очередь, речь идет о привлечении немалых ресурсов за счет финансирования из фондов ЕС. ... Translated & extracted Text Before the official launch of the Euro is 150 days.In the midst of the so-called operational preparation for the championship.It is about establishing communication between the host cities, staff training and marafet hover as a whole.It's no secret that, in comparison with the Ukrainians, the Poles were far more dividends in preparation for the Championship.First of all, we are talking about bringing considerable resources through financing from EU funds. ...
  • 19. Original Blog Text. Example До официального старта Евро остается 150 дней. В разгаре, так называемая, операционная подготовка к Чемпионату. Речь идет о налаживании коммуникаций между принимающими городами, обучении персонала и наведении марафета в целом. Ни для кого не секрет, что, по сравнению с украинцами, поляки получили гораздо больше дивидендов в процессе подготовки к ЧЕ. В первую очередь, речь идет о привлечении немалых ресурсов за счет финансирования из фондов ЕС. ... Translated & extracted Text Before the official launch of the Euro is 150 days.In the midst of the so-called operational preparation for the championship.It is about establishing communication between the host cities, staff training and marafet hover as a whole.It's no secret that, in comparison with the Ukrainians, the Poles were far more dividends in preparation for the Championship.First of all, we are talking about bringing considerable resources through financing from EU funds. ... Cleaned Text official launch euro days midst called operational preparation championship establishing communication host cities staff training marafet hover secret comparison ukrainians poles dividends preparation championship talking bringing considerable resources financing eu funds ...
  • 20. Original Blog Text. Example До официального старта Евро остается 150 дней. В разгаре, так называемая, операционная подготовка к Чемпионату. Речь идет о налаживании коммуникаций между принимающими городами, обучении персонала и наведении марафета в целом. Ни для кого не секрет, что, по сравнению с украинцами, поляки получили гораздо больше дивидендов в процессе подготовки к ЧЕ. В первую очередь, речь идет о привлечении немалых ресурсов за счет финансирования из фондов ЕС. ... Translated & extracted Text Before the official launch of the Euro is 150 days.In the midst of the so-called operational preparation for the championship.It is about establishing communication between the host cities, staff training and marafet hover as a whole.It's no secret that, in comparison with the Ukrainians, the Poles were far more dividends in preparation for the Championship.First of all, we are talking about bringing considerable resources through financing from EU funds. ... Cleaned Text official launch euro days midst called operational preparation championship establishing communication host cities staff training marafet hover secret comparison ukrainians poles dividends preparation championship talking bringing considerable resources financing eu funds ... Stemmed Text offici launch euro day midst call oper prepar championship establish communic host citi staff train marafet hover secret comparison ukrainian pole dividend prepar championship talk bring consider resourc financ eu fund ...
  • 21. Step 3. Text Mining & Clustering
  • 22. Text Mining and Clustering Build normalized TermDocumentMatrix. Remove Sparse Terms Hierarchical Clustering, Dendrogram Kmeans. Perform Clustering and visualize clusters Kmedoids. Perform Clustering and visualize clusters
  • 24. DocumentTermMatrix Structure Terms ncol=4101 0.0175105020782697, ... 0.019135397913606, 0.0095258656396137, ... 0.017510502078269, 0.0099078198722524, ... 0.014062173579334, Docs 0.0163576201358285, ... 0.014114967574557, nrow=237 ... 0.0113371897967796, ... 0.014732724300492, TF-IDF
  • 25. Most Frequent & Correlated Terms. Why is that important?
  • 26. Most Frequent Terms (Latin) Non stemmed terms > findFreqTerms(dtm, lowfreq=1) [1] "country" "euro" “european" "government" "internet" "kiev" "kyiv" "money" [9] "opposition" "party" "people" "political“ "power" "president" "russia" "social" [17] "society" "tymoshenko" "ukraine“ "ukrainian" "world" "yanukovych" Stemmed terms > findFreqTerms(dtm, lowfreq=1) [1] "chang“ "countri" "elect" "euro" "european“ "govern" "internet“ "kiev“ [9] "kyiv" "leader" "money" "opposit" "parti" "peopl" "polit" "power“ [17] "presid" "russia" "russian" "social" "societi" "tymoshenko“ "ukrain" "ukrainian“ [25] "world" "yanukovych" * See the full R code in Appendixes
  • 27. Correlated Terms (Cyrillic vs Latin). Example >findAssocs(dtm, 'евр', 0.35) #correlation with term “евро” евр старт гарант хлеб тыс талисман официальн воплощен будущ чемпионат живет 1.00 0.76 0.74 0.71 0.62 0.55 0.49 0.48 0.35 0.31 0.22 подготовк реплик секрет футбол 0.22 0.22 0.21 0.21 >findAssocs(dtm, „euro', 0.35) euro championship footbal tourist airport tournament fan poland 1.00 0.68 0.57 0.49 0.45 0.43 0.42 0.42 horribl infrastructur foreign patrol unhappi prepar flashmob 0.38 0.38 0.37 0.37 0.37 0.36 0.35
  • 28. Correlation Matrix (Latin vs Cyrillic). Example English Terms: higher correlation, better term accuracy
  • 30. Cluster Dendrogram* #input – DTM normalized with TF-IDF (349 terms, sparse=0.7) d <- dist(dtm2.df.scale, method = "euclidean") # dissimilarity matrix #clustering with Ward‟s method fit <- hclust(d=d, method="ward") #compare: "complete","single","mcquitty","median", "centroid" * Full result of h-clustering is available in pdf
  • 31. Hierarchical Clustering Summary universal hierarchical clustering with different algorithms, e.g. Ward‟s objective function based on squared Euclidean distance (it‟s worth to play with other methods) good with large number of terms and small number of observations gives understanding on correlation between terms in Corpus provides visual representation on how clusters nested with each other * Full result of h-clustering is available in pdf
  • 33. Description of the k-means algorithm* 1) k initial 2) k clusters are created by 3) The centroid of each 4) Steps 2 and 3 are "means" (in this associating every observation of the k clusters repeated until case k=3) are with the nearest mean. The becomes the new means. convergence has been randomly selected partitions here represent the reached. from the data set Voronoi diagram generated by (shown in color). the means. * Source: http://en.wikipedia.org/wiki/K-means
  • 34. Assess number of clusters using kmeans$withinss less terms in DTM higher sum of squares better cluster quality more terms in DTM lower sum of squares lower cluster quality Unexpected expected results
  • 35. Clustering with 20 centers #dtm.k – DocumentTermMatrix(TF-IDF) with 349 terms (cleaned with sparse=0.9) #nstart – let‟s try 10 random starts to generate centroids #algorithm – “Hartigan-Wong” (default) > dtm.clust<-kmeans(x=dtm.k, centers=20, iter.max=40, nstart=10, algorithm="Hartigan-Wong") Cluster sizes > dtm.clust$size [1] 41 21 4 1 1 5 1 7 12 5 98 2 3 7 10 1 4 2 1 11 Sum of squares > dtm.clust$withinss [1] 0.75166171 0.37998302 0.08702162 0.00000000 0.00000000 0.10884947 0.00000000 0.21350480 0.22052166 [10] 0.07426058 1.35245927 0.03003547 0.05145358 0.12662083 0.25722734 0.00000000 0.08037547 0.02691182 [19] 0.00000000 0.22561816 * See the full R code in Appendixes
  • 36. kmeans. Cluster Visualization Distance Matrix (Euclidean) Scale multi-dimensional DTM to 2Dim
  • 39. Assess number of clusters with pam$silinfo$avg.width Recommended number of clusters: 2. However …
  • 40. Perform clustering with 20 centers #max_diss, av_diss – maximum/average dissimilarity between observations in cluster and cluster‟s medoid #diameter – maximum dissimilarity between two observations in the cluster #separation – minimal dissimilarity between observation in the cluster and observation of another cluster Result: 4 clusters
  • 44. Recognized clusters* ([cluster - # of blogs]) “tymoshenko, “Ukrainian “Ukrainian “social networks, opposition, elections” democracy” ex.ua” court” [2-21] [3-4] [6-5] [8-7] “Ukraine-Russia “Ukrainian “Ukraine-EU “Euro-2012” relations, gas” taxes” relations” [9-12] [10-5] [12-2] [14-7] “protests, “culture, “all other blogs “journalist demonstrations, regulation” with various investigations” human rights” [17-4] topics” [20-11] [15-10] [13-3] (unrecognized) Total blogs recognized: 91 of 236 (~40%) * Based on kmeans
  • 45. Conclusion number of elements in data vector (349) must be significantly < number of observations (245) some resulted clusters include “unlike” blogs (see sum of squares) try kmeans for better precision when mining big dissimilar texts with low number of observations. In other cases kmedoids is more robust model focus on similar texts for best accuracy (by category, date range) sentimental analysis will make analysis even more tastefull
  • 46. Questions & Answers Aleksei Beloshytski Alelsei.Beloshytski@gmail.com
  • 47. Appendix A. kmeans. Voronoi Diagram (“Euclidean”)
  • 48. Appendix B. kmeans. Voronoi Diagram (“Manhattan”)
  • 49. Appendix C. kmeans. Heatmap (most freq. terms). TF-IDF
  • 50. Appendix D. kmedoids. Heatmap (most freq. terms). TF
  • 51. Appendix E. R packages required for analysis require(twitteR) require(XML) require(plyr) require(tm) require(Rstem) require(Snowball) require(corrplot) require(RWeka) require(RCurl) require(wordcloud) require(ggplot2) require(vegan) require(reshape2) require(cluster) require(alphahull)
  • 52. Appendix F. R Code. Translate texts into English go_tr <- function(url) { src.url<-URLencode(paste("http://translate.google.com/translate?sl=auto&tl=en&u=", url, sep="")) html.parse <- htmlTreeParse(getURL(src.url), useInternalNodes = TRUE) frame.c <- getNodeSet(html.parse, '//frameset//frame[@name="c"]') params <- sapply(frame.c, function(t) t <- xmlAttrs(t)[[1]]) src.url <- paste("http://translate.google.com", params, sep = "") dest.url <- getURL(src.url, followlocation = TRUE) html <- htmlTreeParse(dest.url, useInternalNodes = TRUE) dest.url <- capture.output(getNodeSet(html, '//meta[@http-equiv="refresh"]')[[1]]) dest.url <- strsplit(dest.url, "URL=", fixed = TRUE)[[1]][2] dest.url <- gsub(""/>", "", dest.url, fixed = TRUE) dest.url <- gsub(" ", "", dest.url, fixed = TRUE) dest.url <- xmlValue(getNodeSet(htmlParse(dest.url, asText = TRUE), "//p")[[1]]) return(dest.url) } [1] "http://blogs.korrespondent.net/celebrities/blog/gknbu/a50268" [1] "http://translate.googleusercontent.com/translate_c? rurl=translate.google.com&sl=auto&tl=en&u=http://blogs.korrespondent.net/celebrities/blog/gknbu/a50268&usg=ALkJ rhisevp7b7yg4CxX6_iTDxyBAk4PCQ"
  • 53. Appendix G. R Code. Parse tweets and extract URLs require(twitteR) kb_tweets<-userTimeline('Korr_Blogs', n=400) #get text of tweets urls<-laply(kb_tweets, function(t) t$getText()) #extract urls from text url_expr<-regexec("http://[a-zA-Z0-9].S*$", urls); urls<-regmatches(urls, url_expr) #remove empty elements from the list urls[lapply(urls, length)<1]<-NULL #unshorten double-shorted urls for(i in 1:length(urls)) { urls[i]<-decode_short_url(decode_short_url(urls[[i]])) } #remove duplicates urls<-as.list(unique(unlist(urls))) #... #contact me for the rest part of the code #...
  • 54. Appendix H. R Code. Handle blogs for(i in 1:length(urls)) { #translate blogs into English url<-go_tr(urls[i]) blogs<-readLines(tc<-textConnection(url)); close (tc) pagetree<-try(htmlTreeParse(blogs, useInternalNodes=TRUE, encoding="windows-1251")) if(class(pagetree)=="try-error") next; x<-xpathSApply(pagetree, "//div[@class='article_box']//*[@onmouseover]/text()|//div[@class='article_box']//*[@onmouseover]/a/text()", xmlValue) x <- unlist(strsplit(x, "n")) x <- sub("^[[:space:]]*(.*?)[[:space:]]*$", "1", x, perl=TRUE) x <- x[!(x %in% c("", "|"))] #... } #... #contact me for the rest part of the code #...
  • 55. Appendix I. R Code. Manage TermDocumentMatrix #... corp <- Corpus(DirSource(“//kor_blogs/en"),readerControl=list(language="en", encodeString="windows-1251")) #.. #Clean texts, stemming and so on #... #Create DTM for both stemmed and not-stemmed Corpuses dtm <- DocumentTermMatrix(corp, control = list(weighting = weightTfIdf)) dtm <- removeSparseTerms(dtm, sparse=0.995) #0.995 - for both EN and RU #... #Find Most Frequent and Associated terms #Build Correlation Matrix #.. corrplot(type="lower", tl.cex=.6, corr=corr_stem_ru, title="Correlation matrix", tl.col="grey20", method="circle", order="FPC", addtextlabel = "ld", outline=TRUE) #... #contact me for the rest part of the code #...
  • 56. Appendix J. R Code. Hierarchical clustering #... dtm2<-as.TermDocumentMatrix(dtm) #... dtm2.df<-as.data.frame(inspect(dtm2)) #... (d <- dist(dtm2.df.scale, method = "euclidean")) # distance matrix fit <- hclust(d=d, method="ward") #.. dev.off() #... #contact me for the rest part of the code #...
  • 57. Appendix K. R Code. Worldcloud (most frequent terms) require(wordcloud) #... dtm.m <- as.matrix(dtm) v <- apply(dtm.m,2,sum) #calculate number of occurancies for each word v <- sort(v, decreasing=TRUE) #.. wordcloud(d$word, d$freq, scale=c(9,.1), min.freq=3, max.words=Inf, random.order=F, rot.per=.3, colors=pal2) #... #contact me for the rest part of the code #...
  • 58. Appendix L. R Code. kmeans analysis #... # assess number of clusters wss <- (nrow(dtm)-1)*sum(apply(dtm,2,var)) #for stemmed DTM dtm_orig <- DocumentTermMatrix(corp, control = list(weighting = weightTfIdf)) # non-stemmed DTM dtm_orig <- removeSparseTerms(dtm_orig, sparse=0.995) #... # visualize withinss # perform clustering #dtm.k – DocumentTermMatrix(TF-IDF) with 349 terms (cleaned with sparse=0.9) #nstart – let‟s try 10 random starts to generate centroids #algorithm – “Hartigan-Wong” (default) dtm.clust<-kmeans(x=dtm.k,centers=20,iter.max=40, nstart=10, algorithm="Hartigan-Wong") dtm.clust$size #... #contact me for the rest part of the code #...
  • 59. Appendix M. R Code. kmedoids analysis #... # assess number of clusters # visualize withinss ggplot()+geom_line(aes(x=1:236, y=asw),size=1,colour="red4") + opts(axis.text.x=theme_text(hjust=0, colour="grey20", size=14), axis.text.y=theme_text(size=14, colour="grey20"), axis.title.x=theme_text(size=20, colour="grey20"), axis.title.y=theme_text(angle=90, size=20, colour="grey20")) + labs(y="average silhouette width", x="k-medoids (# clusters)",size=16) + scale_x_continuous(breaks=c(k.best,20,40,60,80,100,120,140,160,180,200,220)) # perform kmedoids clustering #... dtm.clust.m$clusinfo #... #contact me for the rest part of the code #...
  • 60. Appendix N. R Code. Visualize clusters #... #define which cluster to visualize dtm.clust.v<-dtm.clust # for kmeans dtm.clust.v<-dtm.clust.m # for kmedoids #... dtm_scaled <- cmdscale(dtm.dist) # scale from multi dimensions to two dimensions require(vegan) # distance matrix dtm.dist <- dist(dtm.k, method="euclidean") #... for(i in seq_along(groups)){ points(dtm_scaled[factor(dtm.clust.v$cluster) == groups[i], ], col = cols[i], pch = 20) } # draw ordihull ordihull(dtm_scaled, factor(dtm.clust.v$cluster), lty = "dotted", draw="polygon", label=TRUE) #draw Voronoi diagram #... #contact me for the rest part of the code #...
  • 61. Appendix O. R Code. Visualize heatmaps #... dtm.clust.v<-dtm.clust # for kmeans dtm.clust.v<-dtm.clust.m # for kmedoids dtm0 <- dtm.k #dtm for kmeans clustering dtm0 <- removeSparseTerms(dtm0, sparse=0.7) #get terms which exist in 70% of blogs dtm.df <- as.data.frame(inspect(dtm0)) dfc <- cbind(dtm.df, id=seq(nrow(dtm.df)), cluster=dtm.clust.v$cluster) #Append id and cluster #... require(ggplot2) dev.off() dev.new() ggplot(dfm, aes(x=variable, y=idsort)) + geom_tile(aes(fill=value)) + opts(axis.text.x=theme_text(angle=90, hjust=0, colour="grey20", size=14)) + labs(x="", y="") #... #contact me for the rest part of the code #...
  • 62. Appendix P. Most Frequent Terms (Cyrillic)* Non stemmed terms (cropped terms, 0xFF) > findFreqTerms(dtm, lowfreq=3) [1] "брать" "вли" "вопрос“ "врем" "выборы“ "высоцкий" "действи“ "евро" "знакома“ [10] "истори“ "написал" "непри" "нова" "объ“ "остаетс" "попул" "прав" "прин“ [19] "прочитал“ "прошла" "сегодн" "суд" "течение" "третий" "украине" "украинцы“ "украины“ [28] "хочу" Stemmed terms > findFreqTerms(dtm, lowfreq=4) [1] "виктор“ "власт" "вли“ "вопрос“ "врем" "выбор“ "высоцк“ [8] "государствен“ "интересн“ "непри" "объ" "очередн“ "попул" "последн“ [15] "посто" "прав" "прин" "прочита“ "росси" "сегодн“ "страниц“ [22] "течен" "украин" * Bold words – cropped; blue – terms don’t exist in non-stemmed variant