An Introduction to Data Mining with R

3,773 views

Published on

Slides of a talk on Introduction to Data Mining with R at University of Canberra, Sept 2013

Published in: Technology, Education
0 Comments
4 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
3,773
On SlideShare
0
From Embeds
0
Number of Embeds
22
Actions
Shares
0
Downloads
236
Comments
0
Likes
4
Embeds 0
No embeds

No notes for slide

An Introduction to Data Mining with R

  1. 1. An Introduction to Data Mining with R Yanchang Zhao http://www.RDataMining.com 6 September 2013 1 / 43
  2. 2. Questions Do you know data mining and techniques for it? 2 / 43
  3. 3. Questions Do you know data mining and techniques for it? Have you used R before? 2 / 43
  4. 4. Questions Do you know data mining and techniques for it? Have you used R before? Have you used R in your data mining research or projects? 2 / 43
  5. 5. Outline Introduction Classification with R Clustering with R Association Rule Mining with R Text Mining with R Time Series Analysis with R Social Network Analysis with R R and Hadoop Online Resources 3 / 43
  6. 6. What is R? R 1 is a free software environment for statistical computing and graphics. R can be easily extended with 4,728 packages available on CRAN2 (as of Sept 6, 2013). Many other packages provided on Bioconductor3 , R-Forge4 , GitHub5 , etc. R manuals on CRAN6 An Introduction to R The R Language Definition R Data Import/Export ... 1 http://www.r-project.org/ http://cran.r-project.org/ 3 http://www.bioconductor.org/ 4 http://r-forge.r-project.org/ 5 https://github.com/ 6 http://cran.r-project.org/manuals.html 2 4 / 43
  7. 7. Why R? R is widely used in both academia and industry. R is ranked no. 1 again in the KDnuggets 2013 poll on Top Languages for analytics, data mining, data science 7 . The CRAN Task Views 8 provide collections of packages for different tasks. Machine learning & atatistical learning Cluster analysis & finite mixture models Time series analysis Multivariate statistics Analysis of spatial data ... 7 8 http://www.kdnuggets.com/2013/08/languages-for-analytics-data-mining-data-science.html http://cran.r-project.org/web/views/ 5 / 43
  8. 8. Outline Introduction Classification with R Clustering with R Association Rule Mining with R Text Mining with R Time Series Analysis with R Social Network Analysis with R R and Hadoop Online Resources 6 / 43
  9. 9. Classification with R Decision trees: rpart, party Random forest: randomForest, party SVM: e1071, kernlab Neural networks: nnet, neuralnet, RSNNS Performance evaluation: ROCR 7 / 43
  10. 10. The Iris Dataset # iris data str(iris) ## 'data.frame': 150 obs. of 5 variables: ## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ... ## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 . ## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ## $ Species : Factor w/ 3 levels "setosa","versicolor",..: # split into training and test datasets set.seed(1234) ind <- sample(2, nrow(iris), replace=T, prob=c(0.7, 0.3)) iris.train <- iris[ind==1, ] iris.test <- iris[ind==2, ] 8 / 43
  11. 11. Build a Decision Tree # build a decision tree library(party) iris.formula <- Species ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width iris.ctree <- ctree(iris.formula, data=iris.train) 9 / 43
  12. 12. plot(iris.ctree) 1 Petal.Length p < 0.001 ≤ 1.9 > 1.9 3 Petal.Width p < 0.001 ≤ 1.7 > 1.7 4 Petal.Length p = 0.026 ≤ 4.4 > 4.4 Node 2 (n = 40) Node 5 (n = 21) Node 6 (n = 19) Node 7 (n = 32) 1 1 1 1 0.8 0.8 0.8 0.8 0.6 0.6 0.6 0.6 0.4 0.4 0.4 0.4 0.2 0.2 0.2 0.2 0 0 0 setosa setosa 0 setosa setosa 10 / 43
  13. 13. Prediction # predict on test data pred <- predict(iris.ctree, newdata = iris.test) # check prediction result table(pred, iris.test$Species) ## ## pred setosa versicolor virginica ## setosa 10 0 0 ## versicolor 0 12 2 ## virginica 0 0 14 11 / 43
  14. 14. Outline Introduction Classification with R Clustering with R Association Rule Mining with R Text Mining with R Time Series Analysis with R Social Network Analysis with R R and Hadoop Online Resources 12 / 43
  15. 15. Clustering with R k-means: kmeans(), kmeansruns()9 k-medoids: pam(), pamk() Hierarchical clustering: hclust(), agnes(), diana() DBSCAN: fpc BIRCH: birch 9 Functions are followed with “()”, and others are packages. 13 / 43
  16. 16. k-means Clustering set.seed(8953) iris2 <- iris # remove class IDs iris2$Species <- NULL # k-means clustering iris.kmeans <- kmeans(iris2, 3) # check result table(iris$Species, iris.kmeans$cluster) ## ## ## ## ## 1 2 3 setosa 0 50 0 versicolor 2 0 48 virginica 36 0 14 14 / 43
  17. 17. * 3.0 * 2.5 * 2.0 Sepal.Width 3.5 4.0 # plot clusters and their centers plot(iris2[c("Sepal.Length", "Sepal.Width")], col=iris.kmeans$cluster) points(iris.kmeans$centers[, c("Sepal.Length", "Sepal.Width")], col=1:3, pch="*", cex=5) 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0 15 / 43
  18. 18. Density-based Clustering library(fpc) iris2 <- iris[-5] # remove class IDs # DBSCAN clustering ds <- dbscan(iris2, eps = 0.42, MinPts = 5) # compare clusters with original class IDs table(ds$cluster, iris$Species) ## ## ## ## ## ## 0 1 2 3 setosa versicolor virginica 2 10 17 48 0 0 0 37 0 0 3 33 16 / 43
  19. 19. # 1-3: clusters; 0: outliers or noise plotcluster(iris2, ds$cluster) 0 3 3 33 0 3 3 03 3 1 2 1 dc 2 1 1 1 3 3 3 3 0 0 2 2 2 0 2 2 2 2 2 0 2 −2 0 0 −8 −6 3 3 3 3 3 0 33 3 3 3 3 3 3 30 3 3 0 3 2 3 22 2 2 22 2 2 0 3 0 2 20 2 2 2 3 2 2 22 02 0 22 30 0 3 2 20 2 0 0 0 0 2 2 −1 0 1 1 0 1 1 1 1 1 1 1 1 11 1 1 1 1 1 11 1 1 1 11 111 1 1 1 11 1 11 1 1 1 1 11 1 11 −4 −2 dc 1 0 0 0 0 0 2 17 / 43
  20. 20. Outline Introduction Classification with R Clustering with R Association Rule Mining with R Text Mining with R Time Series Analysis with R Social Network Analysis with R R and Hadoop Online Resources 18 / 43
  21. 21. Association Rule Mining with R Association rules: apriori(), eclat() in package arules Sequential patterns: arulesSequence Visualisation of associations: arulesViz 19 / 43
  22. 22. The Titanic Dataset load("./data/titanic.raw.rdata") dim(titanic.raw) ## [1] 2201 4 idx <- sample(1:nrow(titanic.raw), 8) titanic.raw[idx, ] ## ## ## ## ## ## ## ## ## 501 477 674 766 1485 1388 448 590 Class Sex Age Survived 3rd Male Adult No 3rd Male Adult No 3rd Male Adult No Crew Male Adult No 3rd Female Adult No 2nd Female Adult No 3rd Male Adult No 3rd Male Adult No 20 / 43
  23. 23. Association Rule Mining # find association rules with the APRIORI algorithm library(arules) rules <- apriori(titanic.raw, control=list(verbose=F), parameter=list(minlen=2, supp=0.005, conf=0.8), appearance=list(rhs=c("Survived=No", "Survived=Yes"), default="lhs")) # sort rules quality(rules) <- round(quality(rules), digits=3) rules.sorted <- sort(rules, by="lift") # have a look at rules # inspect(rules.sorted) 21 / 43
  24. 24. # # # # # # # # # # # # # # # # # # # # # # # # lhs {Class=2nd, Age=Child} 2 {Class=2nd, Sex=Female, Age=Child} 3 {Class=1st, Sex=Female} 4 {Class=1st, Sex=Female, Age=Adult} 5 {Class=2nd, Sex=Male, Age=Adult} 6 {Class=2nd, Sex=Female} 7 {Class=Crew, Sex=Female} 8 {Class=Crew, Sex=Female, Age=Adult} 9 {Class=2nd, Sex=Male} 10 {Class=2nd, rhs support confidence lift 1 => {Survived=Yes} 0.011 1.000 3.096 => {Survived=Yes} 0.006 1.000 3.096 => {Survived=Yes} 0.064 0.972 3.010 => {Survived=Yes} 0.064 0.972 3.010 => {Survived=No} 0.070 0.917 1.354 => {Survived=Yes} 0.042 0.877 2.716 => {Survived=Yes} 0.009 0.870 2.692 => {Survived=Yes} 0.009 0.870 2.692 => {Survived=No} 0.070 0.860 1.271 22 / 43
  25. 25. library(arulesViz) plot(rules, method = "graph") Graph for 12 rules width: support (0.006 − 0.192) color: lift (1.222 − 3.096) {Class=2nd,Age=Child} {Class=2nd,Sex=Female} {Class=1st,Sex=Female,Age=Adult} {Class=Crew,Sex=Female} {Survived=Yes} {Class=2nd,Sex=Female,Age=Child} {Class=Crew,Sex=Female,Age=Adult} {Class=1st,Sex=Female} {Class=2nd,Sex=Male} {Class=2nd,Sex=Female,Age=Adult} {Class=3rd,Sex=Male} {Survived=No} {Class=3rd,Sex=Male,Age=Adult} {Class=2nd,Sex=Male,Age=Adult} 23 / 43
  26. 26. Outline Introduction Classification with R Clustering with R Association Rule Mining with R Text Mining with R Time Series Analysis with R Social Network Analysis with R R and Hadoop Online Resources 24 / 43
  27. 27. Text Mining with R Text mining: tm Topic modelling: topicmodels, lda Word cloud: wordcloud Twitter data access: twitteR 25 / 43
  28. 28. Fetch Twitter Data ## retrieve tweets from the user timeline of @rdatammining library(twitteR) # tweets <- userTimeline('rdatamining') load(file = "./data/rdmTweets.RData") (nDocs <- length(tweets)) ## [1] 320 strwrap(tweets[[320]]$text, width = 50) ## ## ## ## [1] [2] [3] [4] "An R Reference Card for Data Mining is now" "available on CRAN. It lists many useful R" "functions and packages for data mining" "applications." # convert tweets to a data frame df <- do.call("rbind", lapply(tweets, as.data.frame)) 26 / 43
  29. 29. Text Cleaning library(tm) # build a corpus myCorpus <- Corpus(VectorSource(df$text)) # convert to lower case myCorpus <- tm_map(myCorpus, tolower) # remove punctuation & numbers myCorpus <- tm_map(myCorpus, removePunctuation) myCorpus <- tm_map(myCorpus, removeNumbers) # remove URLs removeURL <- function(x) gsub("http[[:alnum:]]*", "", x) myCorpus <- tm_map(myCorpus, removeURL) # remove 'r' and 'big' from stopwords myStopwords <- setdiff(stopwords("english"), c("r", "big")) # remove stopwords myCorpus <- tm_map(myCorpus, removeWords, myStopwords) 27 / 43
  30. 30. Stemming # keep a copy of corpus myCorpusCopy <- myCorpus # stem words myCorpus <- tm_map(myCorpus, stemDocument) # stem completion myCorpus <- tm_map(myCorpus, stemCompletion, dictionary = myCorpusCopy) # replace "miners" with "mining", because "mining" was # first stemmed to "mine" and then completed to "miners" myCorpus <- tm_map(myCorpus, gsub, pattern="miners", replacement="mining") strwrap(myCorpus[320], width=50) ## [1] "r reference card data mining now available cran" ## [2] "list used r functions package data mining" ## [3] "applications" 28 / 43
  31. 31. Frequent Terms myTdm <- TermDocumentMatrix(myCorpus, control=list(wordLengths=c(1,Inf))) # inspect frequent words (freq.terms <- findFreqTerms(myTdm, lowfreq=20)) ## [1] "analysis" ## [4] "data" ## [7] "network" ## [10] "postdoctoral" ## [13] "slides" ## [16] "university" "big" "examples" "package" "r" "social" "used" "computing" "mining" "position" "research" "tutorial" 29 / 43
  32. 32. Associations # which words are associated with 'r'? findAssocs(myTdm, "r", 0.2) ## examples ## 0.32 code 0.29 package 0.20 # which words are associated with 'mining'? findAssocs(myTdm, "mining", 0.25) ## ## ## ## data 0.47 supports 0.30 mahout recommendation 0.30 0.30 frequent itemset 0.26 0.26 sets 0.30 30 / 43
  33. 33. Network of Terms library(graph) library(Rgraphviz) plot(myTdm, term=freq.terms, corThreshold=0.1, weighting=T) university tutorial social network analysis mining research postdoctoral position used r data big package examples computing slides 31 / 43
  34. 34. Word Cloud library(wordcloud) m <- as.matrix(myTdm) freq <- sort(rowSums(m), decreasing=T) wordcloud(words=names(freq), freq=freq, min.freq=4, random.order=F) provided melbourne analysis outlier map mining network open graphics thanks conference users processing cfp text analyst exampleschapter postdoctoral slides used big job analytics join high sydney topic china large snowfall casesee available poll draft performance applications group now reference course code can via visualizing series tenuretrack industrial center due introduction association clustering access information page distributed sentiment videos techniques tried youtube top presentation science classification southern wwwrdataminingcom canberra added experience management predictive talk r linkedin vacancy research package notes card get data database statistics rdatamining knowledge list graph free online using recent published workshop find position fast call studies tutorial california cloud frequent week tools document technology nd australia social university datasets google short software time learn details lecture book forecasting functions follower submission business events kdnuggetsinteractive detection programmingcanada spatial search ausdm pdf modelling machine twitter starting fellow web scientist computing parallel ibm amp rules dmapps handling 32 / 43
  35. 35. Topic Modelling library(topicmodels) set.seed(123) myLda <- LDA(as.DocumentTermMatrix(myTdm), k=8) terms(myLda, 5) ## ## ## ## ## ## ## ## ## ## ## ## [1,] [2,] [3,] [4,] [5,] [1,] [2,] [3,] [4,] [5,] Topic 1 Topic 2 Topic 3 Topic 4 "data" "r" "r" "research" "mining" "package" "time" "position" "big" "examples" "series" "data" "association" "used" "users" "university" "rules" "code" "talk" "postdoctoral" Topic 5 Topic 6 Topic 7 Topic 8 "mining" "group" "data" "analysis" "data" "data" "r" "network" "slides" "used" "mining" "social" "modelling" "software" "analysis" "text" "tools" "kdnuggets" "book" "slides" 33 / 43
  36. 36. Outline Introduction Classification with R Clustering with R Association Rule Mining with R Text Mining with R Time Series Analysis with R Social Network Analysis with R R and Hadoop Online Resources 34 / 43
  37. 37. Time Series Analysis with R Time series decomposition: decomp(), decompose(), arima(), stl() Time series forecasting: forecast Time Series Clustering: TSclust Dynamic Time Warping (DTW): dtw 35 / 43
  38. 38. Outline Introduction Classification with R Clustering with R Association Rule Mining with R Text Mining with R Time Series Analysis with R Social Network Analysis with R R and Hadoop Online Resources 36 / 43
  39. 39. Social Network Analysis with R Packages: igraph, sna Centrality measures: degree(), betweenness(), closeness(), transitivity() Clusters: clusters(), no.clusters() Cliques: cliques(), largest.cliques(), maximal.cliques(), clique.number() Community detection: fastgreedy.community(), spinglass.community() 37 / 43
  40. 40. Outline Introduction Classification with R Clustering with R Association Rule Mining with R Text Mining with R Time Series Analysis with R Social Network Analysis with R R and Hadoop Online Resources 38 / 43
  41. 41. R and Hadoop Packages: RHadoop, RHive RHadoop10 is a collection of 3 R packages: rmr2 - perform data analysis with R via MapReduce on a Hadoop cluster rhdfs - connect to Hadoop Distributed File System (HDFS) rhbase - connect to the NoSQL HBase database You can play with it on a single PC (in standalone or pseudo-distributed mode), and your code developed on that will be able to work on a cluster of PCs (in full-distributed mode)! Step by step to set up my first R Hadoop system http://www.rdatamining.com/tutorials/rhadoop 10 https://github.com/RevolutionAnalytics/RHadoop/wiki 39 / 43
  42. 42. An Example of MapReducing with R library(rmr2) map <- function(k, lines) { words.list <- strsplit(lines, "s") words <- unlist(words.list) return(keyval(words, 1)) } reduce <- function(word, counts) { keyval(word, sum(counts)) } wordcount <- function(input, output = NULL) { mapreduce(input = input, output = output, input.format = "text", map = map, reduce = reduce) } ## Submit job out <- wordcount(in.file.path, out.file.path) 11 11 From Jeffrey Breen’s presentation on Using R with Hadoop http://www.revolutionanalytics.com/news-events/free-webinars/2013/using-r-with-hadoop/ 40 / 43
  43. 43. Outline Introduction Classification with R Clustering with R Association Rule Mining with R Text Mining with R Time Series Analysis with R Social Network Analysis with R R and Hadoop Online Resources 41 / 43
  44. 44. Online Resources RDataMining website http://www.rdatamining.com R Reference Card for Data Mining R and Data Mining: Examples and Case Studies RDataMining Group on LinkedIn (3100+ members) http://group.rdatamining.com RDataMining on Twitter (1200+ followers) http://twitter.com/rdatamining Free online courses http://www.rdatamining.com/resources/courses Online documents http://www.rdatamining.com/resources/onlinedocs 42 / 43
  45. 45. The End Thanks! Email: yanchang(at)rdatamining.com 43 / 43

×