Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Upcoming SlideShare
Kmeans vs kmeanspp_20151124
Next
Download to read offline and view in fullscreen.

1

Share

Download to read offline

Datafesta 20141004_05

Download to read offline

datafestjp
https://sites.google.com/site/datafestjp/home
http://datafest.connpass.com/event/8792/

Related Books

Free with a 30 day trial from Scribd

See all

Datafesta 20141004_05

  1. 1. チームfesta @datafesta 品川インターシティー10F https://sites.google.com/site/datafestjp/home 2014年10月5日(日) 片岡豊 + 太田博三
  2. 2. 目次 1.Bank Marketing data Setの選定理由とその内容 2.本日の仮説とゴール 2.1 3段論法:A→ B, B→C, A→C, A→C A: age, B: campaign, C: y(定期預金の有無) 2.2 回帰モデル 3.全説明変数の投入とbackward induction 4. モデルの考察 4.1 モデルの選択とAIC基準←科学的なアプローチ 4.2 の人間的な解釈 5.まとめ 付録:コード 2
  3. 3. 1. Bank Marketing data Setの選定理由とその内容 •マーケティングに興味があったため •モデルを構築したい •自分たちが認識できるデータだったから cf. 犯罪データは同じような変数が多かったから cf. ロンドンオリンピックのtweetデータで自然言 語処理しようとしたが、重すぎて断念! 3
  4. 4. 1.Bank Marketing Data Set •# bank client data: 1 - age (numeric) 2 - job : type of job 3 - marital : marital status 4 - education 5 - default 6 - housing 7 - loan: 8 - contact 9 - month 10 - day_of_week 11 - duration: 12 - campaign 13 - pdays 14 - previous 15 - poutcome 16 - emp.var.rate: employment variation rate - quarterly indicator (numeric) 17 - cons.price.idx: consumer price index - monthly indicator (numeric) 18 - cons.conf.idx: consumer confidence index - monthly indicator (numeric) 19 - euribor3m: euribor 3 month rate - daily indicator (numeric) 20 - nr.employed: number of employees - quarterly indicator (numeric) Output variable (desired target): 21 - y - has the client subscribed a term deposit? (binary: 'yes','no') 4
  5. 5. 1.Bank Marketing Data Set •Attribute Information: •Input variables: 1-20 説明変数 # bank client data •Output variable (desired target):目的変数 21 - y - has the client subscribed a term deposit? (binary: 'yes','no') •https://archive.ics.uci.edu/ml/datasets/Bank+Marketing# 5
  6. 6. 1.Bank Marketing Data Set •Y = α1x1+ α2x2+ α3x3 + … + α20x20 + ξ •Y: y, 定期預金の申し込んでいるか 否か •https://archive.ics.uci.edu/ml/datasets/Bank+Marketing# 6
  7. 7. 2.本日の仮説とゴール 2.1 3段論法:A→ B, B→C, A→C, A→C A: age, B: campaign(), C: y(定期預金) 2.2 回帰モデル →定期預金を申し込んでいるか、否かに2値分 類したい。この要因を適切な説明変数で表した い。 7
  8. 8. 2.1 3段論法:A→ B, B→C, A→C, A→C ターゲット キャンペーン コンバージョン A: age B: campaign (bank telemarketing campaign) C: y(定期預金) 8
  9. 9. 2.1 3段論法:A→ B, B→C, A→C, A→C ターゲット キャンペーン コンバージョン A: age B: campaign (bank telemarketing campaign) C: y(定期預金) ② ③ ① ① ② ③ 9
  10. 10. 2.2 回帰モデルとデータ加工 Y = α1x1+ α2x2+ α3x3 + … + α20x20 + ξ Y: y, 定期預金の申し込んでいるか否か [41121] no no yes yes yes yes yes yes yes no yes no yes no yes no no no yes no [41141] yes yes yes no no yes yes yes yes no no yes no yes no no yes no yes yes [41161] yes no no yes yes yes yes no no no no yes yes yes yes no no no yes no [41181] no yes no yes no no yes no Levels: no yes yy = y head(yy) yy = as.numeric(yy) yy = ifelse(yy==1, 0,yy) yy = ifelse(yy==2, 1,yy) tail(yy) data2$y = yy [41041] 1 0 1 0 1 0 1 1 0 1 1 0 1 0 1 1 0 1 1 0 0 1 0 1 0 1 0 0 0 0 0 0 0 0 0 0 1 1 1 0 [41081] 0 0 0 1 0 1 0 0 1 1 0 0 1 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 1 0 0 1 1 1 1 1 0 1 [41121] 0 0 1 1 1 1 1 1 1 0 1 0 1 0 1 0 0 0 1 0 1 1 1 0 0 1 1 1 1 0 0 1 0 1 0 0 1 0 1 1 [41161] 1 0 0 1 1 1 1 0 0 0 0 1 1 1 1 0 0 0 1 0 0 1 0 1 0 0 1 0 10
  11. 11. 3.全説明変数の投入とbackward induction glm_result = glm(y ~., data=data2,family="binomial") glm_result stepAIC(glm_result) libglm_result2 <- stepAIC(glm_result) > glm_result = glm(y ~., data=data2,family="binomial") > glm_result Call: glm(formula = y ~ ., family = "binomial", data = data2) Coefficients: (Intercept) age -2.366e+02 1.966e-04 jobblue-collar jobentrepreneur -2.347e-01 -1.780e-01 jobhousemaid jobmanagement -2.432e-02 -5.614e-02 jobretired jobself-employed 2.858e-01 -1.578e-01 Degrees of Freedom: 41187 Total (i.e. Null); 41135 Residual Null Deviance: 29000 Residual Deviance: 17080 AIC: 17180 11
  12. 12. 4. モデルの考察 4.1 モデルの選択と AIC基準←科学的な アプローチ →AICは数字に小さ い方がよい。 4.2 の人間的な解釈 →絞り込んだ後の説 明変数は、定性的に 見ても、有用であると 判断できるものだっ た。 Step: AIC=17170.27 y ~ job + default + contact + month + day_of_week + duration + campaign + pdays + poutcome + emp.var.rate + cons.price.idx + cons.conf.idx + euribor3m + nr.employed Df Deviance AIC <none> 17094 17170 - nr.employed 1 17097 17171 - cons.conf.idx 1 17101 17175 - euribor3m 1 17101 17175 - campaign 1 17107 17181 - day_of_week 4 17117 17185 - pdays 1 17111 17185 - default 2 17117 17189 - job 11 17145 17199 - cons.price.idx 1 17168 17242 - contact 1 17169 17243 - poutcome 2 17184 17256 - emp.var.rate 1 17246 17320 - month 9 17658 17716 - duration 1 22724 22798 > 12
  13. 13. 4. モデルの考察 4.1 モデルの選択とAIC基準←科学的なアプ ローチ →AICは数字に小さい方がよい。 4.2 の人間的な解釈 →絞り込んだ後の説明変数は、定性的に見て も、有用であると判断できるものだった。 13
  14. 14. 5.まとめ 今回、説明変数を20個も投入したが、GLMは強 く、AIC基準で頑健性の高い、良いモデルが構 築できました。 また、その内容も、解釈しやすく、実用的なモデ ルになったと考えております! 14
  15. 15. Thanks a lot ! 15
  16. 16. コード一覧 #正解presentation data2 <- read.csv("bank-additional-full.csv",header=T,sep=";") head(data2) summary(data2) attach(data2) detach(data) head(age) plot(loan,age) #考察 length(loan) plot(loan) plot(y,loan) plot(loan,y) par(mfrow=c(2,2)) #2行2列 描画面を2分割してヒストグラムを書く lm2 <- lm(as.numeric(y) ~ ., data=data2) plot(lm2) lm3 <- step(lm2) #AICは小さいほうが良い 16
  17. 17. コード一覧 // yy = y head(yy) yy = as.numeric(yy) yy = ifelse(yy==1, 0,yy) yy = ifelse(yy==2, 1,yy) tail(yy) data2$y = yy glm_result = glm(y ~., data=data2,family="binomial") glm_result length(yy) nrow(housing) head(data2) library(boot) library(MASS) stepAIC(glm_result) libglm_result2 <- stepAIC(glm_result) library(mvpart) tree_result <- rpart(y ~ .,data=data2,method="class") tree_result plot(tree_result,uniform=T,brach=0.4,margin=0.05) text(tree_result,use.n=T,all=T) library(rpart.plot) prp(tree_result, type=2, extra=102,nn=TRUE, fallen.leaves=TRUE, faclen=0, varlen=0,shadow.col="grey", branch.lty=3, cex = 1.2, split.cex=1.2,under.cex = 1.2) plotcp(tree_result) 17
  18. 18. 付録.Bank Marketing Data Set •Attribute Information: •Input variables: # bank client data: 1 - age (numeric) 2 - job : type of job (categorical: 'admin.','blue- collar','entrepreneur','housemaid','management','retired','self- employed','services','student','technician','unemployed','unknown') 3 - marital : marital status (categorical: 'divorced','married','single','unknown'; note: 'divorced' means divorced or widowed) 4 - education (categorical: 'basic.4y','basic.6y','basic.9y','high.school','illiterate','professional.course','university.degree','unknown') 5 - default: has credit in default? (categorical: 'no','yes','unknown') 6 - housing: has housing loan? (categorical: 'no','yes','unknown') 7 - loan: has personal loan? (categorical: 'no','yes','unknown') # related with the last contact of the current campaign: 8 - contact: contact communication type (categorical: 'cellular','telephone') 9 - month: last contact month of year (categorical: 'jan', 'feb', 'mar', ..., 'nov', 'dec') 10 - day_of_week: last contact day of the week (categorical: 'mon','tue','wed','thu','fri') 11 - duration: last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target (e.g., if duration=0 then y='no'). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model. # other attributes: 12 - campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact) 13 - pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted) 14 - previous: number of contacts performed before this campaign and for this client (numeric) 15 - poutcome: outcome of the previous marketing campaign (categorical:
  • kazuyashishirai

    Oct. 6, 2014

datafestjp https://sites.google.com/site/datafestjp/home http://datafest.connpass.com/event/8792/

Views

Total views

1,364

On Slideshare

0

From embeds

0

Number of embeds

12

Actions

Downloads

12

Shares

0

Comments

0

Likes

1

×