How to use SVM for data classification

如何用 SVM 做分類問題
Yiwei Chen
2016.10

import numpy as np
from sklearn import datasets
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.svm import SVC
dataset = datasets.load_iris()
X_train, X_test, y_train, y_test = train_test_split(
dataset.data, dataset.target,
test_size=0.1, stratify=dataset.target)
scaler = MinMaxScaler()
X_scaled = scaler.fit_transform(X_train)
param_grid = {
"C": np.logspace(-5, 15, num=6, base=2),
"gamma": np.logspace(-13, 3, num=5, base=2)
}
grid = GridSearchCV(
estimator=SVC(kernel="rbf", max_iter=10000000),
param_grid=param_grid, cv=5)
grid.fit(X_scaled, y_train)

clf = SVC(kernel="rbf",
C=grid.best_params_["C"],
gamma=grid.best_params_["gamma"],
max_iter=10000000)
clf.fit(X_scaled, y_train)
novel_X = np.array([[5.9, 3.2, 3.9, 1.5]])
novel_X_scaled = scaler.transform(novel_X)
print(novel_X_scaled)
print(clf.predict(novel_X_scaled))
X_test_scaled = scaler.transform(X_test)
print(clf.predict(X_test_scaled))
print(clf.score(X_test_scaled, y_test))

如果看得懂前兩頁，
就可以跳出這份投影片了

學習的目的也不同
not
sweet
sweet

從經驗中學習冥冥之定數
Learn the Mother Nature
from experience

這份投影片著重在
監督式分類(Supervised classification)

Mother Nature
甜不甜不甜甜 ??

??
甜 / 不甜 ?
train
甜/不甜?
model
甜不甜不甜甜

??
甜 / 不甜 ?
predict
model
甜甜/不甜?
甜不甜不甜甜

Supervised Classification
● 有 training data: 一些物品/事情 + 其類別 (classes)
● 你要訓練出一個模型 (train a model)，之後
有新的物品進來，能預測 (predicts) 其類別
類別可以有兩個 (甜/不甜, binary classification)
或者更多個 (台/日/韓, multi-class classification)

Support Vector Machine (SVM)
● 有 training data: 向量 (vectors) + 其類別
● 你要訓練出一個模型 -- 為一個函數 (function),
之後有新的向量進來，能預測其類別
類別可以有兩個 (甜/不甜, binary classification)
或者更多個 (台/日/韓, multi-class classification)

(1.2, 0, 0, 1, …, 57)
train
ƒ: →
model
O
(8.7, 1, 0, 0, …, -3)X
(2.4, 1, 0, 0, …, 22)O
(0.3, 0, 1, 0, …, 33)X
⋮⋮

(1.2, 0, 0, 1, …, 57)
ƒ: →
model
O
(8.7, 1, 0, 0, …, -3)X
(2.4, 1, 0, 0, …, 22)O
(0.3, 0, 1, 0, …, 33)X (1.2, 0, 1, …, 8)
predict
X
O
⋮⋮

Feature engineering
● 用同樣方式，把物品轉成向量
● Size: 8cm or 80mm?
● red/yellow/green: (1,0,0)/(0,1,0)/(0,0,1)

解決監督式分類問題有很多種方法
● SVM
● Decision trees
● Neural networks
● Deep learning
● …
他們可以解決監督式分類問題
不代表他們只能解決監督式分類問題

Agenda
● Supervised classification
● Support Vector Machine
● Software environment
● Use Support Vector Machines

(1.2, 0, 0, 1, …, 57)
train
ƒ: →
model
O
(8.7, 1, 0, 0, …, 22)X
(2.4, 1, 0, 0, …, -3)O
(0.3, 0, 1, 0, …, 33)X (1.2, 0, 1, …, 8)
predict
X
O
⋮⋮

Support Vector Machine ??
例子: 二維的向量，兩個分類
Feature 1
Feature 2
train
Model (function)

Support Vector Machine ??
例子: 二維的向量，兩個分類
predict
Model
?
? Model

SVM 的性質
● 和距離相關 (Distance related)
● 分越開越好 (Maximum margin)

Characteristics in SVM
● 和距離相關 (Distance related)
● 分越開越好 (Maximum margin)
● 參數化 (Parameterized)
○ 邊界有可能是彎的
○ 可以分錯，但要懲罰

用不同參數訓練，有不同結果 ...

用 python 的話
scikit-learn
(sklearn)
numpy
SVM,
decision trees,
...
arrays, ... scipy
python
variance, ...

Anaconda: 願望一次滿足
● 跑在 python 上的開源科學平台
○ Linux / OSX / Windows
● 想得到的都幫你安裝
● 快。不花腦。
● https://www.continuum.io/anaconda-overview

一般流程
定好評估公式+基礎預測
上線預測訓練

● Accuracy
○ Training accuracy
○ Testing accuracy
● precision, recall, Type I / Type II error, AUC, …
進行任何訓練前，先決定好你要怎麼評估結果！
評估 (Evaluation)

● Simple and easy, 閉著眼睛猜
● 拿來「比較」用（你知道你做的比Baseline還差嗎）
基礎的預測 (Baseline predictor)
train
ALL

用 SVM 的流程
定好評估公式+基礎預測
處理資料處理資料
縮放 features
尋找最好的參數
訓練模型
縮放 features
預測

dataset = datasets.load_iris()
X_train, X_test, y_train, y_test = train_test_split(
dataset.data, dataset.target,
test_size=0.1, stratify=dataset.target)
X_scaled = scaler.fit_transform(X_train)
param_grid = {
}
estimator=SVC(kernel="rbf", max_iter=10000000),
max_iter=10000000)

novel_X = np.array([[5.9, 3.2, 3.9, 1.5]])
print(novel_X_scaled)
X_test_scaled = scaler.transform(X_test)
print(clf.predict(X_test_scaled))
print(clf.score(X_test_scaled, y_test))

1. Data preparation
● Transform object → vector
● Whole training data at once
○ X in numpy.array (2-D) or scipy.sparse.csr_matrix
○ y in numpy.array
(1.2, 0, 57)O
(8.7, 1, 22)X
(2.4, 1, -3)O X=np.array([[2.4, 1, -3],
[8.7, 1, 22],
[1.2, 0, 57]])
y=np.array([1,0,1])

2. Feature Scaling
(1.2, 0, 0, …)O
(8.7, 1, 0, …)X
(2.4, 1, 0, …)O
(0.3, 0, 1, …)X
⋮⋮
0.3 ~ 10.3
(n−0.3)
×0.1
0 ~ 1
0 ~ 1
(n+0)
×1
0 ~ 1
(0.09, 0, 0, …)O
(0.84, 1, 0, …)X
O
(0 , 0, 1, …)X
⋮⋮
(0.21, 1, 0, …)
scale

2. Feature Scaling
(1.2, 0, 0, …)O
(8.7, 1, 0, …)X
(2.4, 1, 0, …)O
(0.3, 0, 1, …)X
⋮⋮
(0.09, 0, 0, …)O
(0.84, 1, 0, …)X
O
(0 , 0, 1, …)X
⋮⋮
(0.21, 1, 0, …)
scale
X_scaled = scaler.fit_transform(X)

3. Search for the best parameter
param_grid = {
}
estimator=SVC(kernel="rbf",
max_iter=10000000),

3. what is “best”?
甜不甜不甜甜 ??
train
model
你還不知道

3. Search for the best - validation
train
model
當做新的，
沒看過
validate
甜不甜不甜甜

3. Search for the best - cross-validation
Cross-validation (CV): each fold validates in turn
train validate
train validate train
validate train
Given C=12, =34, the validation accuracy=0.56

3. Search for the best parameter - Grid
C

4. Train Model
use the best parameter in CV to train
max_iter=10000000)

Predict a novel data
● Scaling
● Predict
novel_X = np.array([[5.9, 3.2, 3.9, 1.5]])

Scale Training Data
(1.2, 0, 0, …)O
(8.7, 1, 0, …)X
(2.4, 1, 0, …)O
(0.3, 0, 1, …)X
⋮⋮
0.3 ~ 10.3
(n−0.3)
×0.1
0 ~ 1
0 ~ 1
(n+0)
×1
0 ~ 1
(0.09, 0, 0, …)O
(0.84, 1, 0, …)X
O
(0 , 0, 1, …)X
⋮⋮
(0.21, 1, 0, …)
scale

Scale Testing Data
(2.3, 0, 0, …)O
(-0.7, 1, 1, …)X
(1.3, 1, 1, …)O
(100, 0, 0, …)X
⋮⋮
(n−0.3)
×0.1
(n+0)
×1
(0.20, 0, 0, …)O
(-0.1, 1, 1, …)X
O
(9.97, 0, 0, …)X
⋮⋮
(0.10, 1, 1, …)
scale

Agenda
● Supervised classification
● Support Vector Machine
● Software environment
● Use Support Vector Machines
Takeaway…

用 SVM 的流程
Evaluation criteria + Baseline predictor
prepare dataprepare data
scale features
search best param:
CV on grid
train model
scale features
predict

知道怎麼正確使用微波爐之後...
● Data collection (準備食材)
● Model evaluation monitoring (客戶滿意?)
● Feature engineering (處理食材)
● Model update from novel data (與時俱進)
● Training / prediction in large scale (大量食材)
● A robust pipeline that integrates these altogether
(開餐廳)

Model Serialization
http://scikit-learn.org/stable/modules/model_persiste
nce.html

How to use SVM for data classification

More Related Content

What's hot

Viewers also liked

Similar to How to use SVM for data classification

Recently uploaded

How to use SVM for data classification