SlideShare a Scribd company logo
1 of 31
Visualization of
Supervised Learning with
{arules} + {arulesViz}
Takashi J. OZAKI, Ph. D.
Recruit Communications Co., Ltd.
2014/4/17 1
About me
 Twitter: @TJO_datasci
 Data Scientist (Quant Analyst) in Recruit group
 A group of companies in advertisement media and
human resources
 Known as a major player with big data
 Current mission: ad-hoc analysis on various
marketing data
 Actually, still I’m new to the field of data science
2014/4/17 2
About me
 Original background: neuroscience in the human
brain (6 years experience as postdoc researcher)
2014/4/17 3
(Ozaki, PLoS One, 2011)
About me
 English version of my blog
http://tjo-en.hatenablog.com/
2014/4/17 4
2014/4/17 5
Tonight’s topic is:
2014/4/17 6
Graphical Visualization of
Supervised Learning
Advantage of this technique
More intuitive
Easy to grasp even for high-
dimensional data
Even lay guys can easily understand
Useful for presentation
2014/4/17 7
Supervised learning: lower dimension, more intuitive
 In case of 2D data… (e.g. nonlinear SVM)
2014/4/17 8
x y label
0.924335 -1.0665Yes
2.109901 2.615284No
0.988192 -0.90812Yes
1.299749 0.944518No
-0.60885 0.457816Yes
-2.25484 1.615489Yes
Supervised learning: higher dimension, less intuitive
 In case of 7D… no way!!!
2014/4/17 9
game1 game2 game3 social1 social2 app1 app2 cv
0 0 0 1 0 0 0No
1 0 0 1 1 0 0No
0 1 1 1 1 1 0Yes
0 0 1 1 0 1 1Yes
1 0 1 0 1 1 1Yes
0 0 0 1 1 1 0No
… … … … … … ……
???
2014/4/17 10
Is there any technique
that can easily visualize
supervised learning with
higher dimension?
(…for lay people?)
2014/4/17 11
 {arules} + {arulesViz}
Why association rules and its visualization?
 Much roughly, association rules can be interpreted
as a kind of (likeness of) generative modeling
 A large set of conditional probability
 If it can be regarded as a set of conditional
probability, it also can be described as (likeness of)
Bayesian network
“XY”
 If it’s like a Bayesian network, it can be visualized
as graph representation, e.g. by {igraph}
2014/4/17 12
𝑠𝑢𝑝𝑝 𝑋 → 𝑌 =
𝜎(𝑋 ∪ 𝑌)
𝑀
𝑐𝑜𝑛𝑓 𝑋 → 𝑌 =
𝑠𝑢𝑝𝑝(𝑋 → 𝑌)
𝑠𝑢𝑝𝑝(𝑋)
𝑙𝑖𝑓𝑡 𝑋 → 𝑌 =
𝑐𝑜𝑛𝑓(𝑋 → 𝑌)
𝑠𝑢𝑝𝑝(𝑌)
X Y
Further points…
 Only when all of independent variables are bivariate,
they can be handled as “basket transaction”
2014/4/17 13
game1 game2 game3 social1 social2 app1 app2 cv
0 0 0 1 0 0 0No
1 0 0 1 1 0 0No
0 1 1 1 1 1 0Yes
0 0 1 1 0 1 1Yes
1 0 1 0 1 1 1Yes
0 0 0 1 1 1 0No
… … … … … … ……
{social1, No}
{game1, social1, social2, No}
{game2, game3, social1, social2, app1, Yes}
{game3, social1, app1, app2, Yes}
{game1, game3, social2, app1, app2, Yes}
{socia1, social2, app1, No}
…
2014/4/17 14
Let’s try in R!
Sample data “d1”
2014/4/17 15
game1 game2 game3 social1 social2 app1 app2 cv
0 0 0 1 0 0 0No
1 0 0 1 1 0 0No
0 1 1 1 1 1 0Yes
0 0 1 1 0 1 1Yes
1 0 1 0 1 1 1Yes
0 0 0 1 1 1 0No
… … … … … … ……
Imagine you’re working on a certain platform for web entertainment.
It has 3 SP games, 2 SP social networking, 2 apps.
The data records user’s history of any activity on each content in a
month after registration, and “cv” label describes they are still active
after a month passed.
In the case with svm {e1071}…
2014/4/17 16
> d1.svm<-svm(cv~.,d1) # install and require {e1071}
# svm {e1071}
> table(d1$cv,predict(d1.svm,d1[,-8]))
No Yes
No 1402 98
Yes 80 1420
# Good accuracy (only for training data)
In the case with randomForest {randomForest}…
2014/4/17 17
> tuneRF(d1[,-8],d1[,8],doBest=T) # install and require {randomForest}
# (omitted)
> d1.rf<-randomForest(cv~.,d1,mtry=2)
# randomForest {randomForest}
> table(d1$cv,predict(d1.rf,d1[,-8]))
No Yes
No 1413 87
Yes 92 1408
# Good accuracy
> importance(d1.rf)
MeanDecreaseGini
game1 20.640253
game2 12.115196
game3 2.355584
social1 189.053648
social2 76.476470
app1 796.937087
app2 2.804019
# Variable importance (without any directionality)
In the case with glm {stats}…
2014/4/17 18
> d1.glm<-glm(cv~.,d1,family=binomial)
> summary(d1.glm)
Call:
glm(formula = cv ~ ., family = binomial, data = d1)
# (omitted)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.37793 0.25979 -5.304 1.13e-07 ***
game1 1.05846 0.17344 6.103 1.04e-09 ***
game2 -0.54914 0.16752 -3.278 0.00105 **
game3 0.12035 0.16803 0.716 0.47386
social1 -3.00110 0.21653 -13.860 < 2e-16 ***
social2 1.53098 0.17349 8.824 < 2e-16 ***
app1 5.33547 0.19191 27.802 < 2e-16 ***
app2 0.07811 0.16725 0.467 0.64048
---
# (omitted)
Sample data converted for transactions “d2”
2014/4/17 19
game1 game2 game3 social1 social2 app1 app2 yes no
0 0 0 1 0 0 0 0 1
1 0 0 1 1 0 0 0 1
0 1 1 1 1 1 0 1 0
0 0 1 1 0 1 1 1 0
1 0 1 0 1 1 1 1 0
0 0 0 1 1 1 0 0 1
… … … … … … … … …
Just “cv” column was divided into 2 columns: “yes” and “no” with
bivariate (0 or 1)
Run apriori {arules} to get association rules
2014/4/17 20
> d2.ap.small<-apriori(as.matrix(d2)) # install and require {arules}
parameter specification:
confidence minval smax arem aval originalSupport support minlen
maxlen target ext
0.8 0.1 1 none FALSE TRUE 0.1 1 10 rules FALSE
algorithmic control:
filter tree heap memopt load sort verbose
0.1 TRUE TRUE FALSE TRUE 2 TRUE
apriori - find association rules with the apriori algorithm
version 4.21 (2004.05.09) (c) 1996-2004 Christian Borgelt
set item appearances ...[0 item(s)] done [0.00s].
set transactions ...[9 item(s), 3000 transaction(s)] done [0.00s].
sorting and recoding items ... [9 item(s)] done [0.00s].
creating transaction tree ... done [0.00s].
checking subsets of size 1 2 3 4 5 done [0.00s].
writing ... [50 rule(s)] done [0.00s]. # only 50 rules…
creating S4 object ... done [0.00s].
Run apriori {arules} to get association rules
2014/4/17 21
> d2.ap.large<-apriori(as.matrix(d2),parameter=list(support=0.001))
parameter specification:
confidence minval smax arem aval originalSupport support minlen
maxlen target ext
0.8 0.1 1 none FALSE TRUE 0.001 1 10 rules FALSE
algorithmic control:
filter tree heap memopt load sort verbose
0.1 TRUE TRUE FALSE TRUE 2 TRUE
apriori - find association rules with the apriori algorithm
version 4.21 (2004.05.09) (c) 1996-2004 Christian Borgelt
set item appearances ...[0 item(s)] done [0.00s].
set transactions ...[9 item(s), 3000 transaction(s)] done [0.00s].
sorting and recoding items ... [9 item(s)] done [0.00s].
creating transaction tree ... done [0.00s].
checking subsets of size 1 2 3 4 5 6 7 8 done [0.00s].
writing ... [182 rule(s)] done [0.00s]. # as much as 182 rules
creating S4 object ... done [0.00s].
OK, just visualize it
2014/4/17 22
> require(“arulesViz”)
# (omitted)
> plot(d2.ap.small, method=“graph”, control=list(type=“items”,
layout=layout.fruchterman.reingold,))
> plot(d2.ap.large, method=“graph”, control=list(type=“items”,
layout=layout.fruchterman.reingold,))
# Fruchterman – Reingold force-directed graph drawing algorithm can
locate nodes with distances that is proportional to “shortest path
length” between them
# Then nodes (items) should be located based on their “closeness”
between each other
Small set of rules visualized with {arulesViz}
2014/4/17 23
Compare with a result of glm
2014/4/17 24
> d1.glm<-glm(cv~.,d1,family=binomial)
> summary(d1.glm)
Call:
glm(formula = cv ~ ., family = binomial, data = d1)
# (omitted)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.37793 0.25979 -5.304 1.13e-07 ***
game1 1.05846 0.17344 6.103 1.04e-09 ***
game2 -0.54914 0.16752 -3.278 0.00105 **
game3 0.12035 0.16803 0.716 0.47386
social1 -3.00110 0.21653 -13.860 < 2e-16 ***
social2 1.53098 0.17349 8.824 < 2e-16 ***
app1 5.33547 0.19191 27.802 < 2e-16 ***
app2 0.07811 0.16725 0.467 0.64048
---
# (omitted)
Large set of rules visualized with {arulesViz}
2014/4/17 25
Compare with a result of randomForest
2014/4/17 26
> tuneRF(d1[,-8],d1[,8],doBest=T) # install and require {randomForest}
# (omitted)
> d1.rf<-randomForest(cv~.,d1,mtry=2)
# randomForest {randomForest}
> table(d1$cv,predict(d1.rf,d1[,-8]))
No Yes
No 1413 87
Yes 92 1408
# Good accuracy
> importance(d1.rf)
MeanDecreaseGini
game1 20.640253
game2 12.115196
game3 2.355584
social1 189.053648
social2 76.476470
app1 796.937087
app2 2.804019
# Variable importance (without any directionality)
See how far nodes are from yes / no
2014/4/17 27
Large set of rules visualized with {arulesViz}
2014/4/17 28
Advantage of this technique
More intuitive
Easy to grasp even for high-
dimensional data
Even lay guys can easily understand
Useful for presentation
2014/4/17 29
Disadvantage of this technique
Less strict
Never quantitative
2014/4/17 30
Any questions or comments?
2014/4/17 31
Don’t hesitate to ask me!
@TJO_datasci

More Related Content

What's hot

Palestra sobre Collections com Python
Palestra sobre Collections com PythonPalestra sobre Collections com Python
Palestra sobre Collections com Python
pugpe
 
Seistech SQL code
Seistech SQL codeSeistech SQL code
Seistech SQL code
Simon Hoyle
 
Clustering com numpy e cython
Clustering com numpy e cythonClustering com numpy e cython
Clustering com numpy e cython
Anderson Dantas
 
Database API, your new friend
Database API, your new friendDatabase API, your new friend
Database API, your new friend
kikoalonsob
 

What's hot (20)

Five
FiveFive
Five
 
Palestra sobre Collections com Python
Palestra sobre Collections com PythonPalestra sobre Collections com Python
Palestra sobre Collections com Python
 
밑바닥부터 시작하는 의료 AI
밑바닥부터 시작하는 의료 AI밑바닥부터 시작하는 의료 AI
밑바닥부터 시작하는 의료 AI
 
Seistech SQL code
Seistech SQL codeSeistech SQL code
Seistech SQL code
 
03
0303
03
 
PHP 5.4
PHP 5.4PHP 5.4
PHP 5.4
 
[Pgday.Seoul 2021] 2. Porting Oracle UDF and Optimization
[Pgday.Seoul 2021] 2. Porting Oracle UDF and Optimization[Pgday.Seoul 2021] 2. Porting Oracle UDF and Optimization
[Pgday.Seoul 2021] 2. Porting Oracle UDF and Optimization
 
M12 random forest-part01
M12 random forest-part01M12 random forest-part01
M12 random forest-part01
 
Clustering com numpy e cython
Clustering com numpy e cythonClustering com numpy e cython
Clustering com numpy e cython
 
Chapter 2: R tutorial Handbook for Data Science and Machine Learning Practiti...
Chapter 2: R tutorial Handbook for Data Science and Machine Learning Practiti...Chapter 2: R tutorial Handbook for Data Science and Machine Learning Practiti...
Chapter 2: R tutorial Handbook for Data Science and Machine Learning Practiti...
 
PLOTCON NYC: Behind Every Great Plot There's a Great Deal of Wrangling
PLOTCON NYC: Behind Every Great Plot There's a Great Deal of WranglingPLOTCON NYC: Behind Every Great Plot There's a Great Deal of Wrangling
PLOTCON NYC: Behind Every Great Plot There's a Great Deal of Wrangling
 
Database API, your new friend
Database API, your new friendDatabase API, your new friend
Database API, your new friend
 
M11 bagging loo cv
M11 bagging loo cvM11 bagging loo cv
M11 bagging loo cv
 
Session 02
Session 02Session 02
Session 02
 
Python 1
Python 1Python 1
Python 1
 
M09-Cross validating-naive-bayes
M09-Cross validating-naive-bayesM09-Cross validating-naive-bayes
M09-Cross validating-naive-bayes
 
第7回 大規模データを用いたデータフレーム操作実習(1)
第7回 大規模データを用いたデータフレーム操作実習(1)第7回 大規模データを用いたデータフレーム操作実習(1)
第7回 大規模データを用いたデータフレーム操作実習(1)
 
手把手教你 R 語言分析實務
手把手教你 R 語言分析實務手把手教你 R 語言分析實務
手把手教你 R 語言分析實務
 
30 分鐘學會實作 Python Feature Selection
30 分鐘學會實作 Python Feature Selection30 分鐘學會實作 Python Feature Selection
30 分鐘學會實作 Python Feature Selection
 
Drupal 8 database api
Drupal 8 database apiDrupal 8 database api
Drupal 8 database api
 

Viewers also liked

「データサイエンティスト・ブーム」後の企業におけるデータ分析者像を探る
「データサイエンティスト・ブーム」後の企業におけるデータ分析者像を探る「データサイエンティスト・ブーム」後の企業におけるデータ分析者像を探る
「データサイエンティスト・ブーム」後の企業におけるデータ分析者像を探る
Takashi J OZAKI
 
Simple perceptron by TJO
Simple perceptron by TJOSimple perceptron by TJO
Simple perceptron by TJO
Takashi J OZAKI
 
傾向スコアを使ったキャンペーン効果検証V1
傾向スコアを使ったキャンペーン効果検証V1傾向スコアを使ったキャンペーン効果検証V1
傾向スコアを使ったキャンペーン効果検証V1
Kazuya Obanayama
 

Viewers also liked (20)

Taste of Wine vs. Data Science
Taste of Wine vs. Data ScienceTaste of Wine vs. Data Science
Taste of Wine vs. Data Science
 
「データサイエンティスト・ブーム」後の企業におけるデータ分析者像を探る
「データサイエンティスト・ブーム」後の企業におけるデータ分析者像を探る「データサイエンティスト・ブーム」後の企業におけるデータ分析者像を探る
「データサイエンティスト・ブーム」後の企業におけるデータ分析者像を探る
 
Deep Learningと他の分類器をRで比べてみよう in Japan.R 2014
Deep Learningと他の分類器をRで比べてみよう in Japan.R 2014Deep Learningと他の分類器をRで比べてみよう in Japan.R 2014
Deep Learningと他の分類器をRで比べてみよう in Japan.R 2014
 
直感的な単変量モデルでは予測できない「ワインの味」を多変量モデルで予測する
直感的な単変量モデルでは予測できない「ワインの味」を多変量モデルで予測する直感的な単変量モデルでは予測できない「ワインの味」を多変量モデルで予測する
直感的な単変量モデルでは予測できない「ワインの味」を多変量モデルで予測する
 
Rによるやさしい統計学第20章「検定力分析によるサンプルサイズの決定」
Rによるやさしい統計学第20章「検定力分析によるサンプルサイズの決定」Rによるやさしい統計学第20章「検定力分析によるサンプルサイズの決定」
Rによるやさしい統計学第20章「検定力分析によるサンプルサイズの決定」
 
データ分析というお仕事のこれまでとこれから(HCMPL2014)
データ分析というお仕事のこれまでとこれから(HCMPL2014)データ分析というお仕事のこれまでとこれから(HCMPL2014)
データ分析というお仕事のこれまでとこれから(HCMPL2014)
 
Trading volume mapping R in recent environment
Trading volume mapping R in recent environment Trading volume mapping R in recent environment
Trading volume mapping R in recent environment
 
最新業界事情から見るデータサイエンティストの「実像」
最新業界事情から見るデータサイエンティストの「実像」最新業界事情から見るデータサイエンティストの「実像」
最新業界事情から見るデータサイエンティストの「実像」
 
Salmon cycle
Salmon cycleSalmon cycle
Salmon cycle
 
Jc 20141003 tjo
Jc 20141003 tjoJc 20141003 tjo
Jc 20141003 tjo
 
『手を動かしながら学ぶ ビジネスに活かすデータマイニング』で目指したもの・学んでもらいたいもの
『手を動かしながら学ぶ ビジネスに活かすデータマイニング』で目指したもの・学んでもらいたいもの『手を動かしながら学ぶ ビジネスに活かすデータマイニング』で目指したもの・学んでもらいたいもの
『手を動かしながら学ぶ ビジネスに活かすデータマイニング』で目指したもの・学んでもらいたいもの
 
Granger因果による 時系列データの因果推定(因果フェス2015)
Granger因果による時系列データの因果推定(因果フェス2015)Granger因果による時系列データの因果推定(因果フェス2015)
Granger因果による 時系列データの因果推定(因果フェス2015)
 
計量時系列分析の立場からビジネスの現場のデータを見てみよう - 30th Tokyo Webmining
計量時系列分析の立場からビジネスの現場のデータを見てみよう - 30th Tokyo Webmining計量時系列分析の立場からビジネスの現場のデータを見てみよう - 30th Tokyo Webmining
計量時系列分析の立場からビジネスの現場のデータを見てみよう - 30th Tokyo Webmining
 
21世紀で最もセクシーな職業!?「データサイエンティスト」の実像に迫る
21世紀で最もセクシーな職業!?「データサイエンティスト」の実像に迫る21世紀で最もセクシーな職業!?「データサイエンティスト」の実像に迫る
21世紀で最もセクシーな職業!?「データサイエンティスト」の実像に迫る
 
Rで計量時系列分析~CRANパッケージ総ざらい~
Rで計量時系列分析~CRANパッケージ総ざらい~ Rで計量時系列分析~CRANパッケージ総ざらい~
Rで計量時系列分析~CRANパッケージ総ざらい~
 
ビジネスの現場のデータ分析における理想と現実
ビジネスの現場のデータ分析における理想と現実ビジネスの現場のデータ分析における理想と現実
ビジネスの現場のデータ分析における理想と現実
 
Tech Lab Paak講演会 20150601
Tech Lab Paak講演会 20150601Tech Lab Paak講演会 20150601
Tech Lab Paak講演会 20150601
 
なぜ統計学がビジネスの 意思決定において大事なのか?
なぜ統計学がビジネスの 意思決定において大事なのか?なぜ統計学がビジネスの 意思決定において大事なのか?
なぜ統計学がビジネスの 意思決定において大事なのか?
 
Simple perceptron by TJO
Simple perceptron by TJOSimple perceptron by TJO
Simple perceptron by TJO
 
傾向スコアを使ったキャンペーン効果検証V1
傾向スコアを使ったキャンペーン効果検証V1傾向スコアを使ったキャンペーン効果検証V1
傾向スコアを使ったキャンペーン効果検証V1
 

Similar to Visualization of Supervised Learning with {arules} + {arulesViz}

TAO Fayan_Report on Top 10 data mining algorithms applications with R
TAO Fayan_Report on Top 10 data mining algorithms applications with RTAO Fayan_Report on Top 10 data mining algorithms applications with R
TAO Fayan_Report on Top 10 data mining algorithms applications with R
Fayan TAO
 
Comparing EDA with classical and Bayesian analysis.pptx
Comparing EDA with classical and Bayesian analysis.pptxComparing EDA with classical and Bayesian analysis.pptx
Comparing EDA with classical and Bayesian analysis.pptx
PremaGanesh1
 
INFORMATIVE ESSAYThe purpose of the Informative Essay assignme.docx
INFORMATIVE ESSAYThe purpose of the Informative Essay assignme.docxINFORMATIVE ESSAYThe purpose of the Informative Essay assignme.docx
INFORMATIVE ESSAYThe purpose of the Informative Essay assignme.docx
carliotwaycave
 
mobl presentation @ IHomer
mobl presentation @ IHomermobl presentation @ IHomer
mobl presentation @ IHomer
zefhemel
 
Data Manipulation with Numpy and Pandas in PythonStarting with N
Data Manipulation with Numpy and Pandas in PythonStarting with NData Manipulation with Numpy and Pandas in PythonStarting with N
Data Manipulation with Numpy and Pandas in PythonStarting with N
OllieShoresna
 

Similar to Visualization of Supervised Learning with {arules} + {arulesViz} (20)

TAO Fayan_Report on Top 10 data mining algorithms applications with R
TAO Fayan_Report on Top 10 data mining algorithms applications with RTAO Fayan_Report on Top 10 data mining algorithms applications with R
TAO Fayan_Report on Top 10 data mining algorithms applications with R
 
Comparing EDA with classical and Bayesian analysis.pptx
Comparing EDA with classical and Bayesian analysis.pptxComparing EDA with classical and Bayesian analysis.pptx
Comparing EDA with classical and Bayesian analysis.pptx
 
R studio
R studio R studio
R studio
 
ML .pptx
ML .pptxML .pptx
ML .pptx
 
Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...
Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...
Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...
 
Is your excel production code?
Is your excel production code?Is your excel production code?
Is your excel production code?
 
Rattle Graphical Interface for R Language
Rattle Graphical Interface for R LanguageRattle Graphical Interface for R Language
Rattle Graphical Interface for R Language
 
INFORMATIVE ESSAYThe purpose of the Informative Essay assignme.docx
INFORMATIVE ESSAYThe purpose of the Informative Essay assignme.docxINFORMATIVE ESSAYThe purpose of the Informative Essay assignme.docx
INFORMATIVE ESSAYThe purpose of the Informative Essay assignme.docx
 
UNIT V.docx
UNIT V.docxUNIT V.docx
UNIT V.docx
 
NCCU: Statistics in the Criminal Justice System, R basics and Simulation - Pr...
NCCU: Statistics in the Criminal Justice System, R basics and Simulation - Pr...NCCU: Statistics in the Criminal Justice System, R basics and Simulation - Pr...
NCCU: Statistics in the Criminal Justice System, R basics and Simulation - Pr...
 
mobl presentation @ IHomer
mobl presentation @ IHomermobl presentation @ IHomer
mobl presentation @ IHomer
 
Data Science Academy Student Demo day--Michael blecher,the importance of clea...
Data Science Academy Student Demo day--Michael blecher,the importance of clea...Data Science Academy Student Demo day--Michael blecher,the importance of clea...
Data Science Academy Student Demo day--Michael blecher,the importance of clea...
 
Bsc cs ii dfs u-1 introduction to data structure
Bsc cs ii dfs u-1 introduction to data structureBsc cs ii dfs u-1 introduction to data structure
Bsc cs ii dfs u-1 introduction to data structure
 
4 Descriptive Statistics with R
4 Descriptive Statistics with R4 Descriptive Statistics with R
4 Descriptive Statistics with R
 
Bca ii dfs u-1 introduction to data structure
Bca ii dfs u-1 introduction to data structureBca ii dfs u-1 introduction to data structure
Bca ii dfs u-1 introduction to data structure
 
UNIT IV -Data Structures.pdf
UNIT IV -Data Structures.pdfUNIT IV -Data Structures.pdf
UNIT IV -Data Structures.pdf
 
Data Manipulation with Numpy and Pandas in PythonStarting with N
Data Manipulation with Numpy and Pandas in PythonStarting with NData Manipulation with Numpy and Pandas in PythonStarting with N
Data Manipulation with Numpy and Pandas in PythonStarting with N
 
Presentation on use of r statistics
Presentation on use of r statisticsPresentation on use of r statistics
Presentation on use of r statistics
 
Mca ii dfs u-1 introduction to data structure
Mca ii dfs u-1 introduction to data structureMca ii dfs u-1 introduction to data structure
Mca ii dfs u-1 introduction to data structure
 
Get started with R lang
Get started with R langGet started with R lang
Get started with R lang
 

Recently uploaded

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Recently uploaded (20)

HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 

Visualization of Supervised Learning with {arules} + {arulesViz}

  • 1. Visualization of Supervised Learning with {arules} + {arulesViz} Takashi J. OZAKI, Ph. D. Recruit Communications Co., Ltd. 2014/4/17 1
  • 2. About me  Twitter: @TJO_datasci  Data Scientist (Quant Analyst) in Recruit group  A group of companies in advertisement media and human resources  Known as a major player with big data  Current mission: ad-hoc analysis on various marketing data  Actually, still I’m new to the field of data science 2014/4/17 2
  • 3. About me  Original background: neuroscience in the human brain (6 years experience as postdoc researcher) 2014/4/17 3 (Ozaki, PLoS One, 2011)
  • 4. About me  English version of my blog http://tjo-en.hatenablog.com/ 2014/4/17 4
  • 6. 2014/4/17 6 Graphical Visualization of Supervised Learning
  • 7. Advantage of this technique More intuitive Easy to grasp even for high- dimensional data Even lay guys can easily understand Useful for presentation 2014/4/17 7
  • 8. Supervised learning: lower dimension, more intuitive  In case of 2D data… (e.g. nonlinear SVM) 2014/4/17 8 x y label 0.924335 -1.0665Yes 2.109901 2.615284No 0.988192 -0.90812Yes 1.299749 0.944518No -0.60885 0.457816Yes -2.25484 1.615489Yes
  • 9. Supervised learning: higher dimension, less intuitive  In case of 7D… no way!!! 2014/4/17 9 game1 game2 game3 social1 social2 app1 app2 cv 0 0 0 1 0 0 0No 1 0 0 1 1 0 0No 0 1 1 1 1 1 0Yes 0 0 1 1 0 1 1Yes 1 0 1 0 1 1 1Yes 0 0 0 1 1 1 0No … … … … … … …… ???
  • 10. 2014/4/17 10 Is there any technique that can easily visualize supervised learning with higher dimension? (…for lay people?)
  • 11. 2014/4/17 11  {arules} + {arulesViz}
  • 12. Why association rules and its visualization?  Much roughly, association rules can be interpreted as a kind of (likeness of) generative modeling  A large set of conditional probability  If it can be regarded as a set of conditional probability, it also can be described as (likeness of) Bayesian network “XY”  If it’s like a Bayesian network, it can be visualized as graph representation, e.g. by {igraph} 2014/4/17 12 𝑠𝑢𝑝𝑝 𝑋 → 𝑌 = 𝜎(𝑋 ∪ 𝑌) 𝑀 𝑐𝑜𝑛𝑓 𝑋 → 𝑌 = 𝑠𝑢𝑝𝑝(𝑋 → 𝑌) 𝑠𝑢𝑝𝑝(𝑋) 𝑙𝑖𝑓𝑡 𝑋 → 𝑌 = 𝑐𝑜𝑛𝑓(𝑋 → 𝑌) 𝑠𝑢𝑝𝑝(𝑌) X Y
  • 13. Further points…  Only when all of independent variables are bivariate, they can be handled as “basket transaction” 2014/4/17 13 game1 game2 game3 social1 social2 app1 app2 cv 0 0 0 1 0 0 0No 1 0 0 1 1 0 0No 0 1 1 1 1 1 0Yes 0 0 1 1 0 1 1Yes 1 0 1 0 1 1 1Yes 0 0 0 1 1 1 0No … … … … … … …… {social1, No} {game1, social1, social2, No} {game2, game3, social1, social2, app1, Yes} {game3, social1, app1, app2, Yes} {game1, game3, social2, app1, app2, Yes} {socia1, social2, app1, No} …
  • 15. Sample data “d1” 2014/4/17 15 game1 game2 game3 social1 social2 app1 app2 cv 0 0 0 1 0 0 0No 1 0 0 1 1 0 0No 0 1 1 1 1 1 0Yes 0 0 1 1 0 1 1Yes 1 0 1 0 1 1 1Yes 0 0 0 1 1 1 0No … … … … … … …… Imagine you’re working on a certain platform for web entertainment. It has 3 SP games, 2 SP social networking, 2 apps. The data records user’s history of any activity on each content in a month after registration, and “cv” label describes they are still active after a month passed.
  • 16. In the case with svm {e1071}… 2014/4/17 16 > d1.svm<-svm(cv~.,d1) # install and require {e1071} # svm {e1071} > table(d1$cv,predict(d1.svm,d1[,-8])) No Yes No 1402 98 Yes 80 1420 # Good accuracy (only for training data)
  • 17. In the case with randomForest {randomForest}… 2014/4/17 17 > tuneRF(d1[,-8],d1[,8],doBest=T) # install and require {randomForest} # (omitted) > d1.rf<-randomForest(cv~.,d1,mtry=2) # randomForest {randomForest} > table(d1$cv,predict(d1.rf,d1[,-8])) No Yes No 1413 87 Yes 92 1408 # Good accuracy > importance(d1.rf) MeanDecreaseGini game1 20.640253 game2 12.115196 game3 2.355584 social1 189.053648 social2 76.476470 app1 796.937087 app2 2.804019 # Variable importance (without any directionality)
  • 18. In the case with glm {stats}… 2014/4/17 18 > d1.glm<-glm(cv~.,d1,family=binomial) > summary(d1.glm) Call: glm(formula = cv ~ ., family = binomial, data = d1) # (omitted) Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -1.37793 0.25979 -5.304 1.13e-07 *** game1 1.05846 0.17344 6.103 1.04e-09 *** game2 -0.54914 0.16752 -3.278 0.00105 ** game3 0.12035 0.16803 0.716 0.47386 social1 -3.00110 0.21653 -13.860 < 2e-16 *** social2 1.53098 0.17349 8.824 < 2e-16 *** app1 5.33547 0.19191 27.802 < 2e-16 *** app2 0.07811 0.16725 0.467 0.64048 --- # (omitted)
  • 19. Sample data converted for transactions “d2” 2014/4/17 19 game1 game2 game3 social1 social2 app1 app2 yes no 0 0 0 1 0 0 0 0 1 1 0 0 1 1 0 0 0 1 0 1 1 1 1 1 0 1 0 0 0 1 1 0 1 1 1 0 1 0 1 0 1 1 1 1 0 0 0 0 1 1 1 0 0 1 … … … … … … … … … Just “cv” column was divided into 2 columns: “yes” and “no” with bivariate (0 or 1)
  • 20. Run apriori {arules} to get association rules 2014/4/17 20 > d2.ap.small<-apriori(as.matrix(d2)) # install and require {arules} parameter specification: confidence minval smax arem aval originalSupport support minlen maxlen target ext 0.8 0.1 1 none FALSE TRUE 0.1 1 10 rules FALSE algorithmic control: filter tree heap memopt load sort verbose 0.1 TRUE TRUE FALSE TRUE 2 TRUE apriori - find association rules with the apriori algorithm version 4.21 (2004.05.09) (c) 1996-2004 Christian Borgelt set item appearances ...[0 item(s)] done [0.00s]. set transactions ...[9 item(s), 3000 transaction(s)] done [0.00s]. sorting and recoding items ... [9 item(s)] done [0.00s]. creating transaction tree ... done [0.00s]. checking subsets of size 1 2 3 4 5 done [0.00s]. writing ... [50 rule(s)] done [0.00s]. # only 50 rules… creating S4 object ... done [0.00s].
  • 21. Run apriori {arules} to get association rules 2014/4/17 21 > d2.ap.large<-apriori(as.matrix(d2),parameter=list(support=0.001)) parameter specification: confidence minval smax arem aval originalSupport support minlen maxlen target ext 0.8 0.1 1 none FALSE TRUE 0.001 1 10 rules FALSE algorithmic control: filter tree heap memopt load sort verbose 0.1 TRUE TRUE FALSE TRUE 2 TRUE apriori - find association rules with the apriori algorithm version 4.21 (2004.05.09) (c) 1996-2004 Christian Borgelt set item appearances ...[0 item(s)] done [0.00s]. set transactions ...[9 item(s), 3000 transaction(s)] done [0.00s]. sorting and recoding items ... [9 item(s)] done [0.00s]. creating transaction tree ... done [0.00s]. checking subsets of size 1 2 3 4 5 6 7 8 done [0.00s]. writing ... [182 rule(s)] done [0.00s]. # as much as 182 rules creating S4 object ... done [0.00s].
  • 22. OK, just visualize it 2014/4/17 22 > require(“arulesViz”) # (omitted) > plot(d2.ap.small, method=“graph”, control=list(type=“items”, layout=layout.fruchterman.reingold,)) > plot(d2.ap.large, method=“graph”, control=list(type=“items”, layout=layout.fruchterman.reingold,)) # Fruchterman – Reingold force-directed graph drawing algorithm can locate nodes with distances that is proportional to “shortest path length” between them # Then nodes (items) should be located based on their “closeness” between each other
  • 23. Small set of rules visualized with {arulesViz} 2014/4/17 23
  • 24. Compare with a result of glm 2014/4/17 24 > d1.glm<-glm(cv~.,d1,family=binomial) > summary(d1.glm) Call: glm(formula = cv ~ ., family = binomial, data = d1) # (omitted) Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -1.37793 0.25979 -5.304 1.13e-07 *** game1 1.05846 0.17344 6.103 1.04e-09 *** game2 -0.54914 0.16752 -3.278 0.00105 ** game3 0.12035 0.16803 0.716 0.47386 social1 -3.00110 0.21653 -13.860 < 2e-16 *** social2 1.53098 0.17349 8.824 < 2e-16 *** app1 5.33547 0.19191 27.802 < 2e-16 *** app2 0.07811 0.16725 0.467 0.64048 --- # (omitted)
  • 25. Large set of rules visualized with {arulesViz} 2014/4/17 25
  • 26. Compare with a result of randomForest 2014/4/17 26 > tuneRF(d1[,-8],d1[,8],doBest=T) # install and require {randomForest} # (omitted) > d1.rf<-randomForest(cv~.,d1,mtry=2) # randomForest {randomForest} > table(d1$cv,predict(d1.rf,d1[,-8])) No Yes No 1413 87 Yes 92 1408 # Good accuracy > importance(d1.rf) MeanDecreaseGini game1 20.640253 game2 12.115196 game3 2.355584 social1 189.053648 social2 76.476470 app1 796.937087 app2 2.804019 # Variable importance (without any directionality)
  • 27. See how far nodes are from yes / no 2014/4/17 27
  • 28. Large set of rules visualized with {arulesViz} 2014/4/17 28
  • 29. Advantage of this technique More intuitive Easy to grasp even for high- dimensional data Even lay guys can easily understand Useful for presentation 2014/4/17 29
  • 30. Disadvantage of this technique Less strict Never quantitative 2014/4/17 30
  • 31. Any questions or comments? 2014/4/17 31 Don’t hesitate to ask me! @TJO_datasci