入門機械学習読書会二回目

第2回入門機械学習
読書会
2013.04.27
@kzfm

準備
!   R
!   http://www.r-project.org/
!   Rstudio
!   http://www.rstudio.com/
!   サンプルコード
!   https://github.com/johnmyleswhite/
ML_for_Hackers
!   source( package_installer.R )を実行
>
setwd("/Users/kzfm/lang/rcode/ML_̲for_̲Hackers/")

>
source("package_̲installer.R")

私とR
@kzfm (http://blog.kzfmix.com/)
医療統計からテキストマイニングまで幅広くこなす

ファイル読み込み
!   前回の反省
!   setwdでwdを変更しないで、getwdで表示さ
れるwdに必要なファイルを移動させて読み込む
という方法でもいいかも

2-9から
library(ggplot2)

library(gridExtra)

setwd("~∼/lang/rcode/ML_̲for_̲Hackers/02-‐‑‒Exploration/")

heights.weights
<-‐‑‒
read.csv("data//
01_̲heights_̲weights_̲genders.csv",
header=TRUE,
sep=',')

g
<-‐‑‒
ggplot(heights.weights,
aes(x=Height))

g1
<-‐‑‒
g
+
geom_̲histogram(binwidth=1)

g2
<-‐‑‒
g
+
geom_̲histogram(binwidth=5)

g3
<-‐‑‒
g
+
geom_̲histogram(binwidth=0.001)

g4
<-‐‑‒
g
+
geom_̲density()

g5
<-‐‑‒
g
+
geom_̲density(aes(ﬁll=Gender))

g6
<-‐‑‒
g5
+
facet_̲grid(Gender
~∼
.)

grid.arrange(g1,
g2,
g3,
g4,
g5,
g6,
ncol=2,
nrow=3)

!   install.packages(gridExtra)する

ヒストグラムの注意点
!   binの幅で見た目が変わる
!   適切な幅を決めるのは難しい
!   単峰なのか多峰なのか掴みづらいことが多い
!   密度プロットを併用しよう

散布図を描く
h
<-‐‑‒
ggplot(heights.weights,
aes(x=Height,
y=Weight))

h1
<-‐‑‒
h
+
geom_̲point()

h2
<-‐‑‒
h1
+
geom_̲smooth()

h3
<-‐‑‒
ggplot(heights.weights[1:20,],
aes(x=Height,
y=Weight))

+
geom_̲point()
+
geom_̲smooth()

h4
<-‐‑‒
aes(x=Height,

y=Weight))
+
geom_̲point()
+
geom_̲smooth()

h5
<-‐‑‒
aes(x=Height,

y=Weight))
+
geom_̲point()
+
geom_̲smooth()

grid.arrange(h1,
h2,
h3,
h4,
h5,
ncol=2,
nrow=3)

!   散布図

5章でやるので飛ばす
c
<-‐‑‒
coef(logit.mode)

ggplot(height.weights,
aes(x
=
Weight,
y=Height,
color=Gender))
+
geom_̲point()

+
stat_̲abline(intercept
=
-‐‑‒c[1]/c[2],
slope=-‐‑‒c[3]/c[2],
geom='abline',
color='black')

3章スパムフィルタ
ベイズ分類

教師あり/なし学習
!   教師あり学習（きょうしありがくしゅう,
Supervised learning）とは、機械学習の手
法の一つである
!   事前に与えられたデータをいわば「例題（＝先
生からの助言）」とみなして、それをガイドに
学習（＝データへの何らかのフィッティング）
を行うところからこの名がある。
!   wikipediaより

サイコロを振る
!   6面体(A)と8面体(B)の
サイコロを振る
!   同時に振って両方3が出
る確率
!   Aで3が出た状態でBが
3になる確率
!   Bで3が出た状態でAが
3になる確率

ベイズの定理
1
2
3
4
5
6
1 (1, 3)
2
(2, 3)
3
(3, 1)
(3, 2)
(3, 3)
(4, 3)
(5, 3)
(6, 3)
4
(4, 3)
5
(5, 3)
6
(6, 3)
7
(7, 3)
8
(8, 3)
B
A

箱の問題
!   箱から玉を取り出す
!   A(白1,黒5)0.8
!   B(白3,黒1)0.2
!   ただしBの箱は旧作で
人気がないため5人に一
人しか選ばない
!   白が取り出された時、
Bの箱から取り出され
た可能性はどれくらい
か？
A(0.8) B(0.2)

ベイズの定理を使う
!   P(B¦白) = P(白¦B) x P(B) / P(白)
= 0.75 * 0.2 / 0.4
= 0.375
•  もともとBが選ばれる確率が20%だったのが、
白が観察されたことで37.5%に上昇した
•  箱から取り出す確率が変化する不思議(もとの箱
を取り出す確率は単なる仮定と考えることもで
きる。)

スパム分類
!   P(spam¦words) =
P(words¦spam) * P(spam) / P(words)
!   スパムと非スパムから単語の頻度が分かれば、
ある単語が文中に現れた場合にそれがスパムで
ある確率を出すことができる

箱で例える
!   箱から玉(word)を複
数同時に取り出す
!   箱を選ぶ確率は五分
五分
!   wordsが観測された
時spamの箱から取
り出された可能性は
どのくらいか？ spam(0.5) ham(0.5)

数式の補足
!   単純ベイズ分類器
!   条件付き独立を仮定
条件付き独立を仮定しているので
Zは定数

作業のながれ（やる？）
!   tm(text mining)
パッケージを利用して
TDM(term
document matrix)
をつくる
!   分類器をつくる
!   未知の単語が出てきた
場合どうするか
!   テストする

e1071を使う
data(iris)

library(e1071)

classiﬁer<-‐‑‒naiveBayes(iris[,1:4],
iris[,5])

train
<-‐‑‒
predict(classiﬁer,
iris[,-‐‑‒5])

table(train,iris[,5],dnn=list('predicted','actual'))

actual

predicted

setosa
versicolor
virginica

setosa

50

0

0

versicolor

0

47

3

virginica

0

3

47

何をやったのか？
library(ggplot2)

g
<-‐‑‒
ggplot(iris,
aes(x=Petal.Length,
color=Species))

pl
<-‐‑‒
classiﬁer$tables$Petal.Length

g
+
geom_̲histogram()

+
stat_̲function(fun=dnorm,
colour='red',
args=list(mean=pl[1,1],
sd=pl[1,2]))

+
colour='green',
sd=pl[2,2]))

+
colour='blue',
sd=pl[3,2]))

3章まとめ
!   ベイズ分類をつかってみました
!   文書に対して行いたいのならpythonのNLTK
が便利です。
!   Pythonでは他にscikit-learnという機械学習
パッケージもあります

入門機械学習読書会二回目

More Related Content

Similar to 入門機械学習読書会二回目

More from Kazufumi Ohkawa

入門機械学習読書会二回目