1.
應用 Machine Learning 到你的 Data 上吧
從 R 開始
@ COSCUP 2013David Chiu
2.
About Me
Trend Micro
Taiwan R User Group
ywchiu-tw.appspot.com
3.
Big Data Era
Quick analysis, finding meaning beneath data.
4.
Data Analysis
1. Preparing to run the Data (Munging)
2. Running the model (Analysis)
3. Interpreting the result
5.
Machine Learning
Black-box, algorithmic approach to producing predictions or
classifications from data
A computer program is said to learn from
experience E with respect to some task T and some
performance measure P, if its performance on T, as
measured by P, improves with experience E
Tom Mitchell (1998)
11.
Regression
Predict one set of numbers given another set of numbers
Given number of friends x, predict how many
goods I will receive on each facebook posts
12.
Scatter Plot
dataset <- read.csv('fbgood.txt',head=TRUE, sep='t', row.names=1)
x = dataset$friends
y = dataset$getgoods
plot(x,y)
13.
Linear Fit
fit <- lm(y ~ x);
abline(fit, col = 'red', lwd=3)
14.
2nd order polynomial fit
plot(x,y)
polyfit2 <- lm(y ~ poly(x, 2));
lines(sort(x), polyfit2$fit[order(x)], col = 2, lwd = 3)
15.
3rd order polynomial fit
plot(x,y)
polyfit3 <- lm(y ~ poly(x, 3));
lines(sort(x), polyfit3$fit[order(x)], col = 2, lwd = 3)
16.
Other Regression Packages
MASS rlm - Robust Regression
GLM - Generalized linear Models
GAM - Generalized Additive Models
17.
Classfication
Identifying to which of a set of categories a new observation belongs,
on the basis of a training set of data
Given features of bank costumer, predict whether
the client will subscribe a term deposit
18.
Data Description
Features:
age,job,marital,education,default,balance,housing,loan,contact
Labels:
Customers subscribe a term deposit (Yes/No)
19.
Classify Data With LibSVM
library(e1071)
dataset <- read.csv('bank.csv',head=TRUE, sep=';')
dati = split.data(dataset, p = 0.7)
train = dati$train
test = dati$test
model <- svm(y~., data = train, probability = TRUE)
pred <- predict(model, test[,1:(dim(test)[[2]]-1)], probability = TRUE)
20.
Verify the predictions
table(pred,test[,dim(test)[2]])
pred no yes
no 1183 99
yes 27 47
23.
Support Vector Machines and
Kernel Methods
e1071 - LIBSVM
kernlab - SVM, RVM and other kernel learning algorithms
klaR - SVMlight
rdetools - Model selection and prediction
24.
Dimension Reduction
Seeks linear combinations of the columns of X with maximalvariance
Calculate a new index to measure economy index
of each Taiwan city/county
25.
Economic Index of Taiwan
County
縣市
營利事業銷售額
經濟發展支出佔歲出比例
得收入者平均每人可支配所得
2012年《天下雜誌》幸福城市大調查 - 第505期
38.
Machine Learning Dignostic
1. Get more training examples
2. Try smaller sets of features
3. Try getting additional features
4. Try adding polynomial features
5. Try parameter increasing/decreasing
39.
Overfitting
Trainging error to be low, test error to be highe. g. θJtraining θJtest