Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

TextMining with R

51,671 views

Published on

Goal: Demonstrate popular practices when mining big dissimilar texts
Object of mining: texts from site: http://blogs.korrespondent.net
Tool: R

Published in: Technology, Education
  • Writing good research paper is quite easy and very difficult simultaneously. It depends on the individual skill set also. You can get help from research paper writing. Check out, please ⇒ www.WritePaper.info ⇐
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • DOWNLOAD THE BOOK INTO AVAILABLE FORMAT (New Update) ......................................................................................................................... ......................................................................................................................... Download Full PDF EBOOK here { https://soo.gd/irt2 } ......................................................................................................................... Download Full EPUB Ebook here { https://soo.gd/irt2 } ......................................................................................................................... Download Full doc Ebook here { https://soo.gd/irt2 } ......................................................................................................................... Download PDF EBOOK here { https://soo.gd/irt2 } ......................................................................................................................... Download EPUB Ebook here { https://soo.gd/irt2 } ......................................................................................................................... Download doc Ebook here { https://soo.gd/irt2 } ......................................................................................................................... ......................................................................................................................... ................................................................................................................................... eBook is an electronic version of a traditional print book THE can be read by using a personal computer or by using an eBook reader. (An eBook reader can be a software application for use on a computer such as Microsoft's free Reader application, or a book-sized computer THE is used solely as a reading device such as Nuvomedia's Rocket eBook.) Users can purchase an eBook on diskette or CD, but the most popular method of getting an eBook is to purchase a downloadable file of the eBook (or other reading material) from a Web site (such as Barnes and Noble) to be read from the user's computer or reading device. Generally, an eBook can be downloaded in five minutes or less ......................................................................................................................... .............. Browse by Genre Available eBOOK .............................................................................................................................. Art, Biography, Business, Chick Lit, Children's, Christian, Classics, Comics, Contemporary, CookBOOK, Manga, Memoir, Music, Mystery, Non Fiction, Paranormal, Philosophy, Poetry, Psychology, Religion, Romance, Science, Science Fiction, Self Help, Suspense, Spirituality, Sports, Thriller, Travel, Young Adult, Crime, EBOOK, Fantasy, Fiction, Graphic Novels, Historical Fiction, History, Horror, Humor And Comedy, ......................................................................................................................... ......................................................................................................................... .....BEST SELLER FOR EBOOK RECOMMEND............................................................. ......................................................................................................................... Blowout: Corrupted Democracy, Rogue State Russia, and the Richest, Most Destructive Industry on Earth,-- The Ride of a Lifetime: Lessons Learned from 15 Years as CEO of the Walt Disney Company,-- Call Sign Chaos: Learning to Lead,-- StrengthsFinder 2.0,-- Stillness Is the Key,-- She Said: Breaking the Sexual Harassment Story THE Helped Ignite a Movement,-- Atomic Habits: An Easy & Proven Way to Build Good Habits & Break Bad Ones,-- Everything Is Figureoutable,-- What It Takes: Lessons in the Pursuit of Excellence,-- Rich Dad Poor Dad: What the Rich Teach Their Kids About Money THE the Poor and Middle Class Do Not!,-- The Total Money Makeover: Classic Edition: A Proven Plan for Financial Fitness,-- Shut Up and Listen!: Hard Business Truths THE Will Help You Succeed, ......................................................................................................................... .........................................................................................................................
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • DOWNLOAD THE BOOK INTO AVAILABLE FORMAT (New Update) ......................................................................................................................... ......................................................................................................................... Download Full PDF EBOOK here { https://soo.gd/irt2 } ......................................................................................................................... Download Full EPUB Ebook here { https://soo.gd/irt2 } ......................................................................................................................... Download Full doc Ebook here { https://soo.gd/irt2 } ......................................................................................................................... Download PDF EBOOK here { https://soo.gd/irt2 } ......................................................................................................................... Download EPUB Ebook here { https://soo.gd/irt2 } ......................................................................................................................... Download doc Ebook here { https://soo.gd/irt2 } ......................................................................................................................... ......................................................................................................................... ................................................................................................................................... eBook is an electronic version of a traditional print book THE can be read by using a personal computer or by using an eBook reader. (An eBook reader can be a software application for use on a computer such as Microsoft's free Reader application, or a book-sized computer THE is used solely as a reading device such as Nuvomedia's Rocket eBook.) Users can purchase an eBook on diskette or CD, but the most popular method of getting an eBook is to purchase a downloadable file of the eBook (or other reading material) from a Web site (such as Barnes and Noble) to be read from the user's computer or reading device. Generally, an eBook can be downloaded in five minutes or less ......................................................................................................................... .............. Browse by Genre Available eBOOK .............................................................................................................................. Art, Biography, Business, Chick Lit, Children's, Christian, Classics, Comics, Contemporary, CookBOOK, Manga, Memoir, Music, Mystery, Non Fiction, Paranormal, Philosophy, Poetry, Psychology, Religion, Romance, Science, Science Fiction, Self Help, Suspense, Spirituality, Sports, Thriller, Travel, Young Adult, Crime, EBOOK, Fantasy, Fiction, Graphic Novels, Historical Fiction, History, Horror, Humor And Comedy, ......................................................................................................................... ......................................................................................................................... .....BEST SELLER FOR EBOOK RECOMMEND............................................................. ......................................................................................................................... Blowout: Corrupted Democracy, Rogue State Russia, and the Richest, Most Destructive Industry on Earth,-- The Ride of a Lifetime: Lessons Learned from 15 Years as CEO of the Walt Disney Company,-- Call Sign Chaos: Learning to Lead,-- StrengthsFinder 2.0,-- Stillness Is the Key,-- She Said: Breaking the Sexual Harassment Story THE Helped Ignite a Movement,-- Atomic Habits: An Easy & Proven Way to Build Good Habits & Break Bad Ones,-- Everything Is Figureoutable,-- What It Takes: Lessons in the Pursuit of Excellence,-- Rich Dad Poor Dad: What the Rich Teach Their Kids About Money THE the Poor and Middle Class Do Not!,-- The Total Money Makeover: Classic Edition: A Proven Plan for Financial Fitness,-- Shut Up and Listen!: Hard Business Truths THE Will Help You Succeed, ......................................................................................................................... .........................................................................................................................
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • DOWNLOAD THE BOOK INTO AVAILABLE FORMAT (New Update) ......................................................................................................................... ......................................................................................................................... Download Full PDF EBOOK here { https://soo.gd/irt2 } ......................................................................................................................... Download Full EPUB Ebook here { https://soo.gd/irt2 } ......................................................................................................................... Download Full doc Ebook here { https://soo.gd/irt2 } ......................................................................................................................... Download PDF EBOOK here { https://soo.gd/irt2 } ......................................................................................................................... Download EPUB Ebook here { https://soo.gd/irt2 } ......................................................................................................................... Download doc Ebook here { https://soo.gd/irt2 } ......................................................................................................................... ......................................................................................................................... ................................................................................................................................... eBook is an electronic version of a traditional print book THE can be read by using a personal computer or by using an eBook reader. (An eBook reader can be a software application for use on a computer such as Microsoft's free Reader application, or a book-sized computer THE is used solely as a reading device such as Nuvomedia's Rocket eBook.) Users can purchase an eBook on diskette or CD, but the most popular method of getting an eBook is to purchase a downloadable file of the eBook (or other reading material) from a Web site (such as Barnes and Noble) to be read from the user's computer or reading device. Generally, an eBook can be downloaded in five minutes or less ......................................................................................................................... .............. Browse by Genre Available eBOOK .............................................................................................................................. Art, Biography, Business, Chick Lit, Children's, Christian, Classics, Comics, Contemporary, CookBOOK, Manga, Memoir, Music, Mystery, Non Fiction, Paranormal, Philosophy, Poetry, Psychology, Religion, Romance, Science, Science Fiction, Self Help, Suspense, Spirituality, Sports, Thriller, Travel, Young Adult, Crime, EBOOK, Fantasy, Fiction, Graphic Novels, Historical Fiction, History, Horror, Humor And Comedy, ......................................................................................................................... ......................................................................................................................... .....BEST SELLER FOR EBOOK RECOMMEND............................................................. ......................................................................................................................... Blowout: Corrupted Democracy, Rogue State Russia, and the Richest, Most Destructive Industry on Earth,-- The Ride of a Lifetime: Lessons Learned from 15 Years as CEO of the Walt Disney Company,-- Call Sign Chaos: Learning to Lead,-- StrengthsFinder 2.0,-- Stillness Is the Key,-- She Said: Breaking the Sexual Harassment Story THE Helped Ignite a Movement,-- Atomic Habits: An Easy & Proven Way to Build Good Habits & Break Bad Ones,-- Everything Is Figureoutable,-- What It Takes: Lessons in the Pursuit of Excellence,-- Rich Dad Poor Dad: What the Rich Teach Their Kids About Money THE the Poor and Middle Class Do Not!,-- The Total Money Makeover: Classic Edition: A Proven Plan for Financial Fitness,-- Shut Up and Listen!: Hard Business Truths THE Will Help You Succeed, ......................................................................................................................... .........................................................................................................................
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • DOWNLOAD THE BOOK INTO AVAILABLE FORMAT (New Update) ......................................................................................................................... ......................................................................................................................... Download Full PDF EBOOK here { https://soo.gd/irt2 } ......................................................................................................................... Download Full EPUB Ebook here { https://soo.gd/irt2 } ......................................................................................................................... Download Full doc Ebook here { https://soo.gd/irt2 } ......................................................................................................................... Download PDF EBOOK here { https://soo.gd/irt2 } ......................................................................................................................... Download EPUB Ebook here { https://soo.gd/irt2 } ......................................................................................................................... Download doc Ebook here { https://soo.gd/irt2 } ......................................................................................................................... ......................................................................................................................... ................................................................................................................................... eBook is an electronic version of a traditional print book THE can be read by using a personal computer or by using an eBook reader. (An eBook reader can be a software application for use on a computer such as Microsoft's free Reader application, or a book-sized computer THE is used solely as a reading device such as Nuvomedia's Rocket eBook.) Users can purchase an eBook on diskette or CD, but the most popular method of getting an eBook is to purchase a downloadable file of the eBook (or other reading material) from a Web site (such as Barnes and Noble) to be read from the user's computer or reading device. Generally, an eBook can be downloaded in five minutes or less ......................................................................................................................... .............. Browse by Genre Available eBOOK .............................................................................................................................. Art, Biography, Business, Chick Lit, Children's, Christian, Classics, Comics, Contemporary, CookBOOK, Manga, Memoir, Music, Mystery, Non Fiction, Paranormal, Philosophy, Poetry, Psychology, Religion, Romance, Science, Science Fiction, Self Help, Suspense, Spirituality, Sports, Thriller, Travel, Young Adult, Crime, EBOOK, Fantasy, Fiction, Graphic Novels, Historical Fiction, History, Horror, Humor And Comedy, ......................................................................................................................... ......................................................................................................................... .....BEST SELLER FOR EBOOK RECOMMEND............................................................. ......................................................................................................................... Blowout: Corrupted Democracy, Rogue State Russia, and the Richest, Most Destructive Industry on Earth,-- The Ride of a Lifetime: Lessons Learned from 15 Years as CEO of the Walt Disney Company,-- Call Sign Chaos: Learning to Lead,-- StrengthsFinder 2.0,-- Stillness Is the Key,-- She Said: Breaking the Sexual Harassment Story THE Helped Ignite a Movement,-- Atomic Habits: An Easy & Proven Way to Build Good Habits & Break Bad Ones,-- Everything Is Figureoutable,-- What It Takes: Lessons in the Pursuit of Excellence,-- Rich Dad Poor Dad: What the Rich Teach Their Kids About Money THE the Poor and Middle Class Do Not!,-- The Total Money Makeover: Classic Edition: A Proven Plan for Financial Fitness,-- Shut Up and Listen!: Hard Business Truths THE Will Help You Succeed, ......................................................................................................................... .........................................................................................................................
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

TextMining with R

  1. 1. Text Mining with R Aleksei Beloshytski Kyiv, 2012-Feb
  2. 2. Table of ContentsI. Goal of research and limitationsII. Data Preparation II. Scrape text from blogs (blogs.korrespondent.net) III. Stemming and cleaning IV. Bottlenecks mining CyrillicIII. Text Mining & clustering III. Term normalization (TF-IDF). Most Frequent and Correlated terms IV. Hierarchical clustering with hclust V. Clustering with k-means and k-medoids VI. Comparing clustersIV. Conclusion
  3. 3. Goals..
  4. 4. demonstrate most popular practices when mining dissimilar texts with low numberof observationsmine blogs on https://blogs.korrespondent.net and identify most discussed topicsidentify bottlenecks when mining Cyrillicperform hierarchical clustering with hclust methodperform clustering using k-means and k-medoids methodscompare results
  5. 5. Limitations..
  6. 6. no initial blog categorization by date range, subject(s), author(s) etc* last 245 blogs** from blogs.korrespondent.net as of the day of analysis blogs less then 1kb of plain text excluded* There is no goal to achieve best cluster accuracy, but most discussed subjects(clusters) should be identified.** 245 – after excluding empty and small blogs (<1Kb) from initial 400 blogs
  7. 7. Step 1.Scrape text from blogs
  8. 8. How to scrape blogs..HTML parsingparse each page and get urlsnot transparent
  9. 9. How to scrape blogs..HTML parsingparse each page and get urlsnot transparentRSS feedkeeps 1 day history
  10. 10. How to scrape blogs..HTML parsingparse each page and get urlsnot transparentRSS feedkeeps only 1 day historyTwitter (@Korr_blog)each tweet has blog URLeasy and transparent for R
  11. 11. Parse tweets Get tweets Extract URL from text Remove empty URLs Unshorten double-shorted URLs Validate URLs Remove duplicates .. [269] "http://blogs.korrespondent.net/journalists/blog/anna-radio/a51779" [270] "http://blogs.korrespondent.net/celebrities/blog/gritsenko/a51727" [271] "http://blogs.korrespondent.net/celebrities/blog/press13/a51764" [272] "http://blogs.korrespondent.net/celebrities/blog/olesdoniy/a51736" [273] "http://blogs.korrespondent.net/journalists/blog/raimanelena/a51724" .. * Full R code is available at the end
  12. 12. Step 2.Stemming and Cleaning
  13. 13. Clean texts Translate all blogs in English Extract translated text from the html code Load texts into Corpus Map to lower case, rm punctuation, Stop Words, numbers, strip white spaces Stem document
  14. 14. Bottlenecks mining Cyrillic texts
  15. 15. declensions in RU/UA words. After stemming the same word has several forms0xFF-problem (“я”, 0xFF, windows-1251). DocumentTermMatrix (in R) crops textsE.g. „янукович‟ – filtered, „объявлять‟ – „объ‟, „братья‟ – „брать‟ (sense changes) etcCyrillic texts with pseudo-graphic or special symbols can‟t be encoded with windows-1251 charset properly (additional filter uurlencoded required, not supported in R)
  16. 16. Translate texts into English #see the full code in Appendix F go_tr <- function(url) { src.url<-URLencode(paste("http://translate.google.com/translate?sl=auto&tl=en&u=", url, sep="")) html.parse <- htmlTreeParse(getURL(src.url), useInternalNodes = TRUE) frame.c <- getNodeSet(html.parse, //frameset//frame[@name="c"]) params <- sapply(frame.c, function(t) t <- xmlAttrs(t)[[1]]) #... dest.url <- capture.output(getNodeSet(html, //meta[@http-equiv="refresh"])[[1]]) #... dest.url <- xmlValue(getNodeSet(htmlParse(dest.url, asText = TRUE), "//p")[[1]]) return(dest.url) }[1] "http://blogs.korrespondent.net/celebrities/blog/gknbu/a50268"[1] "http://translate.googleusercontent.com/translate_c?rurl=translate.google.com&sl=auto&tl=en&u=http://blogs.korrespondent.net/celebrities/blog/gknbu/a50268&usg=ALkJrhisevp7b7yg4CxX6_iTDxyBAk4PCQ"
  17. 17. Original Blog Text. ExampleДо официального старта Евро остается 150 дней. В разгаре, так называемая, операционнаяподготовка к Чемпионату. Речь идет о налаживании коммуникаций между принимающими городами,обучении персонала и наведении марафета в целом. Ни для кого не секрет, что, по сравнению сукраинцами, поляки получили гораздо больше дивидендов в процессе подготовки к ЧЕ. В первуюочередь, речь идет о привлечении немалых ресурсов за счет финансирования из фондов ЕС....
  18. 18. Original Blog Text. ExampleДо официального старта Евро остается 150 дней. В разгаре, так называемая, операционнаяподготовка к Чемпионату. Речь идет о налаживании коммуникаций между принимающими городами,обучении персонала и наведении марафета в целом. Ни для кого не секрет, что, по сравнению сукраинцами, поляки получили гораздо больше дивидендов в процессе подготовки к ЧЕ. В первуюочередь, речь идет о привлечении немалых ресурсов за счет финансирования из фондов ЕС....Translated & extracted TextBefore the official launch of the Euro is 150 days.In the midst of the so-called operationalpreparation for the championship.It is about establishing communication between the host cities,staff training and marafet hover as a whole.Its no secret that, in comparison with theUkrainians, the Poles were far more dividends in preparation for the Championship.First of all,we are talking about bringing considerable resources through financing from EU funds....
  19. 19. Original Blog Text. ExampleДо официального старта Евро остается 150 дней. В разгаре, так называемая, операционнаяподготовка к Чемпионату. Речь идет о налаживании коммуникаций между принимающими городами,обучении персонала и наведении марафета в целом. Ни для кого не секрет, что, по сравнению сукраинцами, поляки получили гораздо больше дивидендов в процессе подготовки к ЧЕ. В первуюочередь, речь идет о привлечении немалых ресурсов за счет финансирования из фондов ЕС.... Translated & extracted TextBefore the official launch of the Euro is 150 days.In the midst of the so-called operationalpreparation for the championship.It is about establishing communication between the host cities,staff training and marafet hover as a whole.Its no secret that, in comparison with theUkrainians, the Poles were far more dividends in preparation for the Championship.First of all,we are talking about bringing considerable resources through financing from EU funds.... Cleaned Textofficial launch euro days midst called operational preparation championship establishingcommunication host cities staff training marafet hover secret comparison ukrainians polesdividends preparation championship talking bringing considerable resources financing eu funds...
  20. 20. Original Blog Text. ExampleДо официального старта Евро остается 150 дней. В разгаре, так называемая, операционнаяподготовка к Чемпионату. Речь идет о налаживании коммуникаций между принимающими городами,обучении персонала и наведении марафета в целом. Ни для кого не секрет, что, по сравнению сукраинцами, поляки получили гораздо больше дивидендов в процессе подготовки к ЧЕ. В первуюочередь, речь идет о привлечении немалых ресурсов за счет финансирования из фондов ЕС.... Translated & extracted TextBefore the official launch of the Euro is 150 days.In the midst of the so-called operationalpreparation for the championship.It is about establishing communication between the host cities,staff training and marafet hover as a whole.Its no secret that, in comparison with theUkrainians, the Poles were far more dividends in preparation for the Championship.First of all,we are talking about bringing considerable resources through financing from EU funds.... Cleaned Textofficial launch euro days midst called operational preparation championship establishingcommunication host cities staff training marafet hover secret comparison ukrainians polesdividends preparation championship talking bringing considerable resources financing eu funds... Stemmed Textoffici launch euro day midst call oper prepar championship establish communic host citi stafftrain marafet hover secret comparison ukrainian pole dividend prepar championship talk bringconsider resourc financ eu fund...
  21. 21. Step 3.Text Mining & Clustering
  22. 22. Text Mining and Clustering Build normalized TermDocumentMatrix. Remove Sparse Terms Hierarchical Clustering, Dendrogram Kmeans. Perform Clustering and visualize clusters Kmedoids. Perform Clustering and visualize clusters
  23. 23. Term Normalization
  24. 24. DocumentTermMatrix Structure Terms ncol=4101 0.0175105020782697, ... 0.019135397913606, 0.0095258656396137, ... 0.017510502078269, 0.0099078198722524, ... 0.014062173579334, Docs 0.0163576201358285, ... 0.014114967574557, nrow=237 ... 0.0113371897967796, ... 0.014732724300492, TF-IDF
  25. 25. Most Frequent & Correlated Terms.Why is that important?
  26. 26. Most Frequent Terms (Latin) Non stemmed terms > findFreqTerms(dtm, lowfreq=1)[1] "country" "euro" “european" "government" "internet" "kiev" "kyiv" "money"[9] "opposition" "party" "people" "political“ "power" "president" "russia" "social"[17] "society" "tymoshenko" "ukraine“ "ukrainian" "world" "yanukovych" Stemmed terms > findFreqTerms(dtm, lowfreq=1)[1] "chang“ "countri" "elect" "euro" "european“ "govern" "internet“ "kiev“[9] "kyiv" "leader" "money" "opposit" "parti" "peopl" "polit" "power“[17] "presid" "russia" "russian" "social" "societi" "tymoshenko“ "ukrain" "ukrainian“[25] "world" "yanukovych" * See the full R code in Appendixes
  27. 27. Correlated Terms (Cyrillic vs Latin). Example >findAssocs(dtm, евр, 0.35) #correlation with term “евро” евр старт гарант хлеб тыс талисман официальн воплощен будущ чемпионат живет 1.00 0.76 0.74 0.71 0.62 0.55 0.49 0.48 0.35 0.31 0.22подготовк реплик секрет футбол 0.22 0.22 0.21 0.21 >findAssocs(dtm, „euro, 0.35) euro championship footbal tourist airport tournament fan poland 1.00 0.68 0.57 0.49 0.45 0.43 0.42 0.42horribl infrastructur foreign patrol unhappi prepar flashmob 0.38 0.38 0.37 0.37 0.37 0.36 0.35
  28. 28. Correlation Matrix (Latin vs Cyrillic). Example English Terms: higher correlation, better term accuracy
  29. 29. Hierarchical Clustering (hclust)
  30. 30. Cluster Dendrogram* #input – DTM normalized with TF-IDF (349 terms, sparse=0.7) d <- dist(dtm2.df.scale, method = "euclidean") # dissimilarity matrix #clustering with Ward‟s method fit <- hclust(d=d, method="ward") #compare: "complete","single","mcquitty","median", "centroid"* Full result of h-clustering is available in pdf
  31. 31. Hierarchical Clustering Summary universal hierarchical clustering with different algorithms, e.g. Ward‟s objective function based on squared Euclidean distance (it‟s worth to play with other methods) good with large number of terms and small number of observations gives understanding on correlation between terms in Corpus provides visual representation on how clusters nested with each other* Full result of h-clustering is available in pdf
  32. 32. Clustering with kmeans
  33. 33. Description of the k-means algorithm* 1) k initial 2) k clusters are created by 3) The centroid of each 4) Steps 2 and 3 are "means" (in this associating every observation of the k clusters repeated until case k=3) are with the nearest mean. The becomes the new means. convergence has been randomly selected partitions here represent the reached. from the data set Voronoi diagram generated by (shown in color). the means.* Source: http://en.wikipedia.org/wiki/K-means
  34. 34. Assess number of clusters using kmeans$withinss less terms in DTM higher sum of squares better cluster qualitymore terms in DTMlower sum of squareslower cluster quality Unexpected expected results
  35. 35. Clustering with 20 centers #dtm.k – DocumentTermMatrix(TF-IDF) with 349 terms (cleaned with sparse=0.9) #nstart – let‟s try 10 random starts to generate centroids #algorithm – “Hartigan-Wong” (default) > dtm.clust<-kmeans(x=dtm.k, centers=20, iter.max=40, nstart=10, algorithm="Hartigan-Wong") Cluster sizes > dtm.clust$size [1] 41 21 4 1 1 5 1 7 12 5 98 2 3 7 10 1 4 2 1 11 Sum of squares > dtm.clust$withinss [1] 0.75166171 0.37998302 0.08702162 0.00000000 0.00000000 0.10884947 0.00000000 0.21350480 0.22052166 [10] 0.07426058 1.35245927 0.03003547 0.05145358 0.12662083 0.25722734 0.00000000 0.08037547 0.02691182 [19] 0.00000000 0.22561816* See the full R code in Appendixes
  36. 36. kmeans. Cluster Visualization Distance Matrix (Euclidean) Scale multi-dimensional DTM to 2Dim
  37. 37. k-means clustering Summary
  38. 38. Clustering with kmedoids
  39. 39. Assess number of clusters with pam$silinfo$avg.widthRecommended number of clusters: 2. However …
  40. 40. Perform clustering with 20 centers #max_diss, av_diss – maximum/average dissimilarity between observations in cluster and cluster‟s medoid #diameter – maximum dissimilarity between two observations in the cluster #separation – minimal dissimilarity between observation in the cluster and observation of another clusterResult: 4 clusters
  41. 41. kmedoids. Cluster Visualization
  42. 42. k-medoids clustering Summary
  43. 43. Recognize clusters
  44. 44. Recognized clusters* ([cluster - # of blogs]) “tymoshenko, “Ukrainian “Ukrainian “social networks, opposition, elections” democracy” ex.ua” court” [2-21] [3-4] [6-5] [8-7] “Ukraine-Russia “Ukrainian “Ukraine-EU “Euro-2012” relations, gas” taxes” relations” [9-12] [10-5] [12-2] [14-7] “protests, “culture, “all other blogs “journalist demonstrations, regulation” with various investigations” human rights” [17-4] topics” [20-11] [15-10] [13-3] (unrecognized) Total blogs recognized: 91 of 236 (~40%)* Based on kmeans
  45. 45. Conclusionnumber of elements in data vector (349) must be significantly < number ofobservations (245)some resulted clusters include “unlike” blogs (see sum of squares)try kmeans for better precision when mining big dissimilar texts with low numberof observations. In other cases kmedoids is more robust modelfocus on similar texts for best accuracy (by category, date range)sentimental analysis will make analysis even more tastefull
  46. 46. Questions & Answers Aleksei Beloshytski Alelsei.Beloshytski@gmail.com
  47. 47. Appendix A. kmeans. Voronoi Diagram (“Euclidean”)
  48. 48. Appendix B. kmeans. Voronoi Diagram (“Manhattan”)
  49. 49. Appendix C. kmeans. Heatmap (most freq. terms). TF-IDF
  50. 50. Appendix D. kmedoids. Heatmap (most freq. terms). TF
  51. 51. Appendix E. R packages required for analysis require(twitteR) require(XML) require(plyr) require(tm) require(Rstem) require(Snowball) require(corrplot) require(RWeka) require(RCurl) require(wordcloud) require(ggplot2) require(vegan) require(reshape2) require(cluster) require(alphahull)
  52. 52. Appendix F. R Code. Translate texts into English go_tr <- function(url) { src.url<-URLencode(paste("http://translate.google.com/translate?sl=auto&tl=en&u=", url, sep="")) html.parse <- htmlTreeParse(getURL(src.url), useInternalNodes = TRUE) frame.c <- getNodeSet(html.parse, //frameset//frame[@name="c"]) params <- sapply(frame.c, function(t) t <- xmlAttrs(t)[[1]]) src.url <- paste("http://translate.google.com", params, sep = "") dest.url <- getURL(src.url, followlocation = TRUE) html <- htmlTreeParse(dest.url, useInternalNodes = TRUE) dest.url <- capture.output(getNodeSet(html, //meta[@http-equiv="refresh"])[[1]]) dest.url <- strsplit(dest.url, "URL=", fixed = TRUE)[[1]][2] dest.url <- gsub(""/>", "", dest.url, fixed = TRUE) dest.url <- gsub(" ", "", dest.url, fixed = TRUE) dest.url <- xmlValue(getNodeSet(htmlParse(dest.url, asText = TRUE), "//p")[[1]]) return(dest.url) }[1] "http://blogs.korrespondent.net/celebrities/blog/gknbu/a50268"[1] "http://translate.googleusercontent.com/translate_c?rurl=translate.google.com&sl=auto&tl=en&u=http://blogs.korrespondent.net/celebrities/blog/gknbu/a50268&usg=ALkJrhisevp7b7yg4CxX6_iTDxyBAk4PCQ"
  53. 53. Appendix G. R Code. Parse tweets and extract URLsrequire(twitteR)kb_tweets<-userTimeline(Korr_Blogs, n=400)#get text of tweetsurls<-laply(kb_tweets, function(t) t$getText())#extract urls from texturl_expr<-regexec("http://[a-zA-Z0-9].S*$", urls);urls<-regmatches(urls, url_expr)#remove empty elements from the listurls[lapply(urls, length)<1]<-NULL#unshorten double-shorted urlsfor(i in 1:length(urls)) { urls[i]<-decode_short_url(decode_short_url(urls[[i]])) }#remove duplicatesurls<-as.list(unique(unlist(urls)))#...#contact me for the rest part of the code#...
  54. 54. Appendix H. R Code. Handle blogsfor(i in 1:length(urls)){ #translate blogs into English url<-go_tr(urls[i]) blogs<-readLines(tc<-textConnection(url)); close (tc) pagetree<-try(htmlTreeParse(blogs, useInternalNodes=TRUE, encoding="windows-1251")) if(class(pagetree)=="try-error") next; x<-xpathSApply(pagetree,"//div[@class=article_box]//*[@onmouseover]/text()|//div[@class=article_box]//*[@onmouseover]/a/text()",xmlValue) x <- unlist(strsplit(x, "n")) x <- sub("^[[:space:]]*(.*?)[[:space:]]*$", "1", x, perl=TRUE) x <- x[!(x %in% c("", "|"))]#...}#...#contact me for the rest part of the code#...
  55. 55. Appendix I. R Code. Manage TermDocumentMatrix#...corp <- Corpus(DirSource(“//kor_blogs/en"),readerControl=list(language="en", encodeString="windows-1251"))#..#Clean texts, stemming and so on#...#Create DTM for both stemmed and not-stemmed Corpusesdtm <- DocumentTermMatrix(corp, control = list(weighting = weightTfIdf))dtm <- removeSparseTerms(dtm, sparse=0.995) #0.995 - for both EN and RU#...#Find Most Frequent and Associated terms#Build Correlation Matrix#..corrplot(type="lower", tl.cex=.6, corr=corr_stem_ru, title="Correlation matrix", tl.col="grey20",method="circle", order="FPC", addtextlabel = "ld", outline=TRUE)#...#contact me for the rest part of the code#...
  56. 56. Appendix J. R Code. Hierarchical clustering#...dtm2<-as.TermDocumentMatrix(dtm)#...dtm2.df<-as.data.frame(inspect(dtm2))#...(d <- dist(dtm2.df.scale, method = "euclidean")) # distance matrixfit <- hclust(d=d, method="ward")#..dev.off()#...#contact me for the rest part of the code#...
  57. 57. Appendix K. R Code. Worldcloud (most frequent terms)require(wordcloud)#...dtm.m <- as.matrix(dtm)v <- apply(dtm.m,2,sum) #calculate number of occurancies for each wordv <- sort(v, decreasing=TRUE)#..wordcloud(d$word, d$freq, scale=c(9,.1), min.freq=3, max.words=Inf, random.order=F, rot.per=.3, colors=pal2)#...#contact me for the rest part of the code#...
  58. 58. Appendix L. R Code. kmeans analysis#...# assess number of clusterswss <- (nrow(dtm)-1)*sum(apply(dtm,2,var)) #for stemmed DTMdtm_orig <- DocumentTermMatrix(corp, control = list(weighting = weightTfIdf)) # non-stemmed DTMdtm_orig <- removeSparseTerms(dtm_orig, sparse=0.995)#...# visualize withinss# perform clustering#dtm.k – DocumentTermMatrix(TF-IDF) with 349 terms (cleaned with sparse=0.9)#nstart – let‟s try 10 random starts to generate centroids#algorithm – “Hartigan-Wong” (default)dtm.clust<-kmeans(x=dtm.k,centers=20,iter.max=40, nstart=10, algorithm="Hartigan-Wong")dtm.clust$size#...#contact me for the rest part of the code#...
  59. 59. Appendix M. R Code. kmedoids analysis#...# assess number of clusters# visualize withinssggplot()+geom_line(aes(x=1:236, y=asw),size=1,colour="red4") + opts(axis.text.x=theme_text(hjust=0,colour="grey20", size=14), axis.text.y=theme_text(size=14, colour="grey20"),axis.title.x=theme_text(size=20, colour="grey20"), axis.title.y=theme_text(angle=90, size=20,colour="grey20")) + labs(y="average silhouette width", x="k-medoids (# clusters)",size=16) +scale_x_continuous(breaks=c(k.best,20,40,60,80,100,120,140,160,180,200,220))# perform kmedoids clustering#...dtm.clust.m$clusinfo#...#contact me for the rest part of the code#...
  60. 60. Appendix N. R Code. Visualize clusters#...#define which cluster to visualizedtm.clust.v<-dtm.clust # for kmeansdtm.clust.v<-dtm.clust.m # for kmedoids#...dtm_scaled <- cmdscale(dtm.dist) # scale from multi dimensions to two dimensionsrequire(vegan)# distance matrixdtm.dist <- dist(dtm.k, method="euclidean")#...for(i in seq_along(groups)){ points(dtm_scaled[factor(dtm.clust.v$cluster) == groups[i], ], col = cols[i], pch = 20)}# draw ordihullordihull(dtm_scaled, factor(dtm.clust.v$cluster), lty = "dotted", draw="polygon", label=TRUE)#draw Voronoi diagram#...#contact me for the rest part of the code#...
  61. 61. Appendix O. R Code. Visualize heatmaps#...dtm.clust.v<-dtm.clust # for kmeansdtm.clust.v<-dtm.clust.m # for kmedoidsdtm0 <- dtm.k #dtm for kmeans clusteringdtm0 <- removeSparseTerms(dtm0, sparse=0.7) #get terms which exist in 70% of blogsdtm.df <- as.data.frame(inspect(dtm0))dfc <- cbind(dtm.df, id=seq(nrow(dtm.df)), cluster=dtm.clust.v$cluster) #Append id and cluster#...require(ggplot2)dev.off()dev.new()ggplot(dfm, aes(x=variable, y=idsort)) + geom_tile(aes(fill=value)) +opts(axis.text.x=theme_text(angle=90, hjust=0, colour="grey20", size=14)) + labs(x="", y="")#...#contact me for the rest part of the code#...
  62. 62. Appendix P. Most Frequent Terms (Cyrillic)* Non stemmed terms (cropped terms, 0xFF) > findFreqTerms(dtm, lowfreq=3)[1] "брать" "вли" "вопрос“ "врем" "выборы“ "высоцкий" "действи“ "евро" "знакома“[10] "истори“ "написал" "непри" "нова" "объ“ "остаетс" "попул" "прав" "прин“[19] "прочитал“ "прошла" "сегодн" "суд" "течение" "третий" "украине" "украинцы“ "украины“[28] "хочу" Stemmed terms > findFreqTerms(dtm, lowfreq=4)[1] "виктор“ "власт" "вли“ "вопрос“ "врем" "выбор“ "высоцк“[8] "государствен“ "интересн“ "непри" "объ" "очередн“ "попул" "последн“[15] "посто" "прав" "прин" "прочита“ "росси" "сегодн“ "страниц“[22] "течен" "украин"* Bold words – cropped; blue – terms don’t exist in non-stemmed variant

×