Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Visualization of Supervised Learning with {arules} + {arulesViz}

7,092 views

Published on

My talk in Global TokyoR #1

Published in: Technology, Education
  • @kw try - plot(rules, method="graph", control = list(type="items", main="")) . where rules are member of class apriory class(rules)
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Just from my idea, please try to write two columns whose row should be "v1, value_x".
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • can I transform the graph obtained by aRulesViz where node represents a rule {v1= value} into a graph where each node is a variable {v1}? thank you
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Visualization of Supervised Learning with {arules} + {arulesViz}

  1. 1. Visualization of Supervised Learning with {arules} + {arulesViz} Takashi J. OZAKI, Ph. D. Recruit Communications Co., Ltd. 2014/4/17 1
  2. 2. About me  Twitter: @TJO_datasci  Data Scientist (Quant Analyst) in Recruit group  A group of companies in advertisement media and human resources  Known as a major player with big data  Current mission: ad-hoc analysis on various marketing data  Actually, still I’m new to the field of data science 2014/4/17 2
  3. 3. About me  Original background: neuroscience in the human brain (6 years experience as postdoc researcher) 2014/4/17 3 (Ozaki, PLoS One, 2011)
  4. 4. About me  English version of my blog http://tjo-en.hatenablog.com/ 2014/4/17 4
  5. 5. 2014/4/17 5 Tonight’s topic is:
  6. 6. 2014/4/17 6 Graphical Visualization of Supervised Learning
  7. 7. Advantage of this technique More intuitive Easy to grasp even for high- dimensional data Even lay guys can easily understand Useful for presentation 2014/4/17 7
  8. 8. Supervised learning: lower dimension, more intuitive  In case of 2D data… (e.g. nonlinear SVM) 2014/4/17 8 x y label 0.924335 -1.0665Yes 2.109901 2.615284No 0.988192 -0.90812Yes 1.299749 0.944518No -0.60885 0.457816Yes -2.25484 1.615489Yes
  9. 9. Supervised learning: higher dimension, less intuitive  In case of 7D… no way!!! 2014/4/17 9 game1 game2 game3 social1 social2 app1 app2 cv 0 0 0 1 0 0 0No 1 0 0 1 1 0 0No 0 1 1 1 1 1 0Yes 0 0 1 1 0 1 1Yes 1 0 1 0 1 1 1Yes 0 0 0 1 1 1 0No … … … … … … …… ???
  10. 10. 2014/4/17 10 Is there any technique that can easily visualize supervised learning with higher dimension? (…for lay people?)
  11. 11. 2014/4/17 11  {arules} + {arulesViz}
  12. 12. Why association rules and its visualization?  Much roughly, association rules can be interpreted as a kind of (likeness of) generative modeling  A large set of conditional probability  If it can be regarded as a set of conditional probability, it also can be described as (likeness of) Bayesian network “XY”  If it’s like a Bayesian network, it can be visualized as graph representation, e.g. by {igraph} 2014/4/17 12 𝑠𝑢𝑝𝑝 𝑋 → 𝑌 = 𝜎(𝑋 ∪ 𝑌) 𝑀 𝑐𝑜𝑛𝑓 𝑋 → 𝑌 = 𝑠𝑢𝑝𝑝(𝑋 → 𝑌) 𝑠𝑢𝑝𝑝(𝑋) 𝑙𝑖𝑓𝑡 𝑋 → 𝑌 = 𝑐𝑜𝑛𝑓(𝑋 → 𝑌) 𝑠𝑢𝑝𝑝(𝑌) X Y
  13. 13. Further points…  Only when all of independent variables are bivariate, they can be handled as “basket transaction” 2014/4/17 13 game1 game2 game3 social1 social2 app1 app2 cv 0 0 0 1 0 0 0No 1 0 0 1 1 0 0No 0 1 1 1 1 1 0Yes 0 0 1 1 0 1 1Yes 1 0 1 0 1 1 1Yes 0 0 0 1 1 1 0No … … … … … … …… {social1, No} {game1, social1, social2, No} {game2, game3, social1, social2, app1, Yes} {game3, social1, app1, app2, Yes} {game1, game3, social2, app1, app2, Yes} {socia1, social2, app1, No} …
  14. 14. 2014/4/17 14 Let’s try in R!
  15. 15. Sample data “d1” 2014/4/17 15 game1 game2 game3 social1 social2 app1 app2 cv 0 0 0 1 0 0 0No 1 0 0 1 1 0 0No 0 1 1 1 1 1 0Yes 0 0 1 1 0 1 1Yes 1 0 1 0 1 1 1Yes 0 0 0 1 1 1 0No … … … … … … …… Imagine you’re working on a certain platform for web entertainment. It has 3 SP games, 2 SP social networking, 2 apps. The data records user’s history of any activity on each content in a month after registration, and “cv” label describes they are still active after a month passed.
  16. 16. In the case with svm {e1071}… 2014/4/17 16 > d1.svm<-svm(cv~.,d1) # install and require {e1071} # svm {e1071} > table(d1$cv,predict(d1.svm,d1[,-8])) No Yes No 1402 98 Yes 80 1420 # Good accuracy (only for training data)
  17. 17. In the case with randomForest {randomForest}… 2014/4/17 17 > tuneRF(d1[,-8],d1[,8],doBest=T) # install and require {randomForest} # (omitted) > d1.rf<-randomForest(cv~.,d1,mtry=2) # randomForest {randomForest} > table(d1$cv,predict(d1.rf,d1[,-8])) No Yes No 1413 87 Yes 92 1408 # Good accuracy > importance(d1.rf) MeanDecreaseGini game1 20.640253 game2 12.115196 game3 2.355584 social1 189.053648 social2 76.476470 app1 796.937087 app2 2.804019 # Variable importance (without any directionality)
  18. 18. In the case with glm {stats}… 2014/4/17 18 > d1.glm<-glm(cv~.,d1,family=binomial) > summary(d1.glm) Call: glm(formula = cv ~ ., family = binomial, data = d1) # (omitted) Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -1.37793 0.25979 -5.304 1.13e-07 *** game1 1.05846 0.17344 6.103 1.04e-09 *** game2 -0.54914 0.16752 -3.278 0.00105 ** game3 0.12035 0.16803 0.716 0.47386 social1 -3.00110 0.21653 -13.860 < 2e-16 *** social2 1.53098 0.17349 8.824 < 2e-16 *** app1 5.33547 0.19191 27.802 < 2e-16 *** app2 0.07811 0.16725 0.467 0.64048 --- # (omitted)
  19. 19. Sample data converted for transactions “d2” 2014/4/17 19 game1 game2 game3 social1 social2 app1 app2 yes no 0 0 0 1 0 0 0 0 1 1 0 0 1 1 0 0 0 1 0 1 1 1 1 1 0 1 0 0 0 1 1 0 1 1 1 0 1 0 1 0 1 1 1 1 0 0 0 0 1 1 1 0 0 1 … … … … … … … … … Just “cv” column was divided into 2 columns: “yes” and “no” with bivariate (0 or 1)
  20. 20. Run apriori {arules} to get association rules 2014/4/17 20 > d2.ap.small<-apriori(as.matrix(d2)) # install and require {arules} parameter specification: confidence minval smax arem aval originalSupport support minlen maxlen target ext 0.8 0.1 1 none FALSE TRUE 0.1 1 10 rules FALSE algorithmic control: filter tree heap memopt load sort verbose 0.1 TRUE TRUE FALSE TRUE 2 TRUE apriori - find association rules with the apriori algorithm version 4.21 (2004.05.09) (c) 1996-2004 Christian Borgelt set item appearances ...[0 item(s)] done [0.00s]. set transactions ...[9 item(s), 3000 transaction(s)] done [0.00s]. sorting and recoding items ... [9 item(s)] done [0.00s]. creating transaction tree ... done [0.00s]. checking subsets of size 1 2 3 4 5 done [0.00s]. writing ... [50 rule(s)] done [0.00s]. # only 50 rules… creating S4 object ... done [0.00s].
  21. 21. Run apriori {arules} to get association rules 2014/4/17 21 > d2.ap.large<-apriori(as.matrix(d2),parameter=list(support=0.001)) parameter specification: confidence minval smax arem aval originalSupport support minlen maxlen target ext 0.8 0.1 1 none FALSE TRUE 0.001 1 10 rules FALSE algorithmic control: filter tree heap memopt load sort verbose 0.1 TRUE TRUE FALSE TRUE 2 TRUE apriori - find association rules with the apriori algorithm version 4.21 (2004.05.09) (c) 1996-2004 Christian Borgelt set item appearances ...[0 item(s)] done [0.00s]. set transactions ...[9 item(s), 3000 transaction(s)] done [0.00s]. sorting and recoding items ... [9 item(s)] done [0.00s]. creating transaction tree ... done [0.00s]. checking subsets of size 1 2 3 4 5 6 7 8 done [0.00s]. writing ... [182 rule(s)] done [0.00s]. # as much as 182 rules creating S4 object ... done [0.00s].
  22. 22. OK, just visualize it 2014/4/17 22 > require(“arulesViz”) # (omitted) > plot(d2.ap.small, method=“graph”, control=list(type=“items”, layout=layout.fruchterman.reingold,)) > plot(d2.ap.large, method=“graph”, control=list(type=“items”, layout=layout.fruchterman.reingold,)) # Fruchterman – Reingold force-directed graph drawing algorithm can locate nodes with distances that is proportional to “shortest path length” between them # Then nodes (items) should be located based on their “closeness” between each other
  23. 23. Small set of rules visualized with {arulesViz} 2014/4/17 23
  24. 24. Compare with a result of glm 2014/4/17 24 > d1.glm<-glm(cv~.,d1,family=binomial) > summary(d1.glm) Call: glm(formula = cv ~ ., family = binomial, data = d1) # (omitted) Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -1.37793 0.25979 -5.304 1.13e-07 *** game1 1.05846 0.17344 6.103 1.04e-09 *** game2 -0.54914 0.16752 -3.278 0.00105 ** game3 0.12035 0.16803 0.716 0.47386 social1 -3.00110 0.21653 -13.860 < 2e-16 *** social2 1.53098 0.17349 8.824 < 2e-16 *** app1 5.33547 0.19191 27.802 < 2e-16 *** app2 0.07811 0.16725 0.467 0.64048 --- # (omitted)
  25. 25. Large set of rules visualized with {arulesViz} 2014/4/17 25
  26. 26. Compare with a result of randomForest 2014/4/17 26 > tuneRF(d1[,-8],d1[,8],doBest=T) # install and require {randomForest} # (omitted) > d1.rf<-randomForest(cv~.,d1,mtry=2) # randomForest {randomForest} > table(d1$cv,predict(d1.rf,d1[,-8])) No Yes No 1413 87 Yes 92 1408 # Good accuracy > importance(d1.rf) MeanDecreaseGini game1 20.640253 game2 12.115196 game3 2.355584 social1 189.053648 social2 76.476470 app1 796.937087 app2 2.804019 # Variable importance (without any directionality)
  27. 27. See how far nodes are from yes / no 2014/4/17 27
  28. 28. Large set of rules visualized with {arulesViz} 2014/4/17 28
  29. 29. Advantage of this technique More intuitive Easy to grasp even for high- dimensional data Even lay guys can easily understand Useful for presentation 2014/4/17 29
  30. 30. Disadvantage of this technique Less strict Never quantitative 2014/4/17 30
  31. 31. Any questions or comments? 2014/4/17 31 Don’t hesitate to ask me! @TJO_datasci

×