Rdatamining

R을 이용한 데이터 마이닝
bt22dr@gmail.com

목차
•회귀분석
–단순선형회귀분석
–다중선형회귀분석
•분류
–로지스틱 회귀분석
–나이브베이즈 분류기
–K-최근접 이웃
•군집화
–K-means 군집화
•기타
–주성분분석
–추천시스템

선형회귀분석
•단순선형회귀 (Simple Linear Regression) :
–연속형 변수 대상
–독립변수와 종속변수 간의 상관관계에 따른 수학적 모델인 선형적 관계식 도출
–독립변수가 주어졌을 때 이에 따른 종속변수를 예측
•종류
–단순회귀분석 : 1개의 종속변수와 1개의 독립변수 사이의 관계를 분석
–다중회귀분석 : 1개의 종속변수와 여러 개의 독립변수 사이의 관계를 규명
•회귀식 추정
–단순선형회귀모델 :
–표본회귀식 : y = a + bx
–최소제곱법 : 푒푖 2= 푛푖 =1 (푦푖−푦 푖)2= 푛푖 =1 (푦푖−푎−푏푥푖)2푛푖 =1를 최소화하는 a, b
•회귀모형 적합도 판단
–회귀모형이 적합한지 확인하기 위해 결정계수 푅2을 사용
–회귀모형의 독립변수가 종속변수 변동의 몇%를 설명하는지를 나타내는 지표
–결정계수 푅2값은 상관계수 r을 제곱한 값과 같다.
y = α + βx + ε, ε ~ iid N(0, σ²)

선형회귀분석
•상관분석 (Correlation Analysis)
–두 변수간에 어떤 선형적 관계를 갖고 있는 지를 분석하는 방법
–상관계수(Correlation coefficient)로 상관관계의 정도를 파악
–두 변수간의 연관된 정도를 나타낼 뿐 인과관계를 설명하는 것은 아니다
•피어슨 상관계수 (Pearson correlation coefficient)
– X와 Y가 함께 변하는 정도X와 Y가 따로 변하는 정도 = 두 변수 X,Y의 표준편차의 곱에 대한 공분산의 비율
–모상관계수 =
–표본상관계수 =
–상관성을 -1과 1 사이 값으로 측정
•방향 : 변수들이 어떻게 변하는지를 나타낸다
•크기 : 절대값이 크면 상관관계가 더 강하다

선형회귀분석
•단순선형회귀분석 실습
women
ggplot(women, aes(x=height, y=weight)) +
geom_point() +
geom_smooth(method = 'lm', se = FALSE)
model = lm(weight ~ height, women)
summary(model)
Call:
lm(formula = weight ~ height, data = women)
Residuals:
Min 1Q Median 3Q Max
-1.7333 -1.1333 -0.3833 0.7417 3.1167
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -87.51667 5.93694 -14.74 1.71e-09 ***
height 3.45000 0.09114 37.85 1.09e-14 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.525 on 13 degrees of freedom
Multiple R-squared: 0.991, Adjusted R-squared: 0.9903
F-statistic: 1433 on 1 and 13 DF, p-value: 1.091e-14

선형회귀분석
•단순선형회귀분석 실습
predict(model, data.frame(height=73))
1
167.7833
cor(women$weight, women$height)
cor(women$weight, women$height)^2
summary(model)$r.squared
[1] 0.9954948
[1] 0.9910098
[1] 0.9910098

선형회귀분석
•다중선형회귀 (Multiple Linear Regression) :
–일반적인 자연 현상은 여러 독립 변수들에 의해 설명된다.
–2개 이상의 독립 변수와 종속 변수간의 관계 도출
–다중회귀모형 : 푦= 훽0+ 훽1푥1+ 훽2푥2+ …+ 훽푛푥푛+ 휀
•다중공선성
–현실에서 설명변수들이 완벽하게 독립인 경우는 드물다.
–변수간 상관정도가 높으면 회귀분석 결과에 나쁜 영향을 미친다.
•변수선택
–전진 선택법
–변수 소거법
–단계적 방법

선형회귀분석
•다중선형회귀분석 실습
library(car)
mtcars_num <- mtcars[,c(1,3:7)]
# 산점도 행렬과 상관계수만으로 다중공선성 1차 진단
pairs(mtcars_num, panel=panel.smooth)
cor(mtcars_num[,-1])
# 모델 생성 후 값 추정
model <- lm(mpg~., data=mtcars_num)
summary(model)
predict(model, data.frame(disp=170, hp=120, drat=4.0, wt=2.7, qsec=17.0))
# stepwise selection 방법으로 변수 선택
out <- step(model, direction="both")
summary(out)
formula(out)
# 선택 변수만으로 모델 재생성 후 값 추정
model <- lm(formula(out), data=mtcars_num)
summary(model)
predict(model, data.frame(disp=170, hp=120, drat=4.0, wt=2.7, qsec=17.0))

로지스틱 회귀분석
•종속변수가 연속형이 아닌 범주형 자료인 이항 변수로 구성됨
•일반화된 선형 모델 (generalized linear model)을 이용
–푦 = ln( 푝 1−푝 )=훽0+ 훽1푥1+훽2푥2+ …+훽푝푥푝+ 휀,휀 ~ 푁(0,휎2)
•모델 설명
–푝 : X인 경우 Y가 발생할 확률 (0 ~ 1)
–Odds : Y가 0일 확률에 대한 1일 확률의 비율 = 푝 1−푝 (0 ~ ∞)
–Logit : odds에 자연로그 취한 상태 = ln( 푝 1−푝 ) (−∞ ~ ∞)
•위 모델로부터
–푝 = Pr(Y=1 | X) = 푒푧 1+푒푧 = 11+푒−푧
(푧= 훽0+ 훽1푥1+훽2푥2+ …+훽푝푥푝)
1(1+푒−푥)

•로지스틱회귀분석 실습
# 임의의 데이터 생성 x <- c(rnorm(50, mean=0), rnorm(50, mean=2)) y <- c(rnorm(50, mean=2), rnorm(50, mean=0)) z <- c(rep("a", 50), rep("b", 50)) data <- data.frame(x, y, z) # 데이터 스플릿 training_set <- sample(100,67) train <- data[training_set,] test <- data[-training_set,] # 트레이닝 데이터 관찰 plot(train$x, train$y, col=train$z, xlab="x", ylab="y") # 로지스틱회귀모형 model <- glm(z~x+y, binomial(link="logit"), train) summary(model)
출처 : http://www.cleveralgorithms.com/machinelearning/regression/logistic_regression.html

•로지스틱회귀분석 실습
# 테스트 데이터로 검증
probabilities <- predict.glm(model, test[,1:2], type="response")
predictions <- cut(probabilities, c(-Inf,0.5,Inf), labels=c("a","b"))
table(predictions, test$z)
# 모델이 테스트 데이터를 분류하는 모습 관찰
plot(test$x, test$y, col=test$z, xlab="x", ylab="y")
abline(intercept, slope, lty=5)
predictions a b
a 16 2
b 1 14
출처 : http://www.cleveralgorithms.com/machinelearning/regression/logistic_regression.html

나이브베이즈 분류기
•베이즈 정리를 기반으로 하는 단순한 확률 분류기
–어떤 feature 변수들 Fi가 주어졌을 때 이것이 Class C에서 나왔을 확률인 사후확률을 측정하여 가장 확률이 높은 클래스로 분류한다.
–
•보유한 데이터에서 해당 카테고리의 확률인 사전확률을 알 수 있고 해당 카테고리에서 각 아이템이 어떤 확률로 발생하는지(우도, likelyhood)를 알 수 있으니 계산을 통해 사후확률을 구할 수 있다.
–
–
–

나이브베이즈 분류기
•나이브베이즈 분류기 실습
install.packages('ElemStatLearn') # spam 이메일 데이터셋
install.packages('e1071')
library(e1071)
# 이메일 데이터 로드
data(spam, package="ElemStatLearn")
spam[1,]
# 데이터 스플릿
cv = sample(nrow(spam), floor(nrow(spam) * 0.9), replace=FALSE)
train = spam[cv,]
test = spam[-cv,]
# 분류기 테스트
num_feature <- ncol(spam) # label class feature 위치
modle <- naiveBayes(train[,-num_feature], train[,num_feature], laplace=1)
predict(modle, test[,-num_feature])
table(predict(m, test[,-num_feature]), test$spam)

K-최근접 이웃
•직관적으로 이해하기 쉽고 매우 효과적인 분류기
•분류 항목 표시가 주어지지 않은 테스트 데이터가 새로 주어졌을 때 트레이닝 데이터와 비교하여 가장 유사한 k개의 데이터를 골라내고 그 중 가장 많은 label을 테스트 데이터의 클래스로 예측한다.
•선형적으로 분류하기 힘든 데이터에서도 간단하고 직관적인 방법으로 분류를 수행

K-최근접 이웃
•K-최근접 이웃 실습 : OCR 예제
–머신러닝 인 액션 2장의 소스코드 참조
–데이터셋 : http://www.manning-source.com/books/pharrington/MLiA_SourceCode.zip
# 이미지를 벡터로 변환
img2vector <- function(filename) {
vec <- rep(0, 1024)
f <- file(filename, "rt")
lines <- readLines(f)
vec <- as.integer(unlist(strsplit(paste(lines, collapse=''), NULL)))
close(f)
return(vec)
}
# img2vector 테스트
test_vec <- img2vector('./data/digits/trainingDigits/0_0.txt') # 0
image(matrix(test_vec, ncol=32))
test_vec <- img2vector('./data/digits/trainingDigits/3_0.txt') # 3
image(matrix(test_vec, ncol=32))
소스코드 발췌 : 머신러닝 인 액션 2장

K-최근접 이웃
# training data 생성
hwLabels <- vector('integer')
training_file_list <- list.files('./data/digits/trainingDigits')
m <- length(training_file_list)
training_mat <- matrix(nrow=m, ncol=1024)
for(i in 1:m) {
file_name <- training_file_list[i]
print(file_name)
file_str <- unlist(strsplit(file_name, split="."))[1] # 0_0, txt
class_num <- as.integer(unlist(strsplit(file_str, split="_"))[1]) # 0, 0
hwLabels <- append(hwLabels, class_num)
training_mat[i,] <- img2vector(paste('./data/digits/trainingDigits/', file_name, sep=""))
}

K-최근접 이웃
# test data 분류
test_file_list <- list.files('./data/digits/testDigits')
error_cnt <- as.integer(0)
m_test <- length(test_file_list)
for(i in 1:m_test) {
file_name <- test_file_list[i]
print(file_name)
file_str <- unlist(strsplit(file_name, split="."))[1] # 0_0, txt
class_num <- as.integer(unlist(strsplit(file_str, split="_"))[1]) # 0, 0
test_vec <- img2vector(paste('./data/digits/testDigits/', file_name, sep=""))
cl_result <- knn(training_mat, test_vec, hwLabels, k=3)
print(sprintf("knn() result : %s, the real answer : %d", cl_result, class_num))
print("---------------")
if(cl_result != class_num) error_cnt <- error_cnt + 1
}
print(sprintf("the total error rate is : %f %%", error_cnt/m_test*100))

K-means 군집화
•Cluster
–Object들의 집합
–비슷한 Object끼리 같은 cluster에 묶음
•Clustering Algorithm
–Object 집합을 여러 개의 group으로 묶는 알고리즘
•좋은 Clustering Algorithm
–Vector distance는 최대한 가깝게
–Cluster distance는 최대한 멀게

K-means 군집화
•K-means Clustering
–주어진 데이터를 k개의 클러스터로 묶는 알고리즘
–각 클러스터 센터와의 거리 차이의 분산을 최소화하는 방식으로 동작
–
–알고리즘 동작 방식

K-means 군집화
•K-means 군집화 실습
# 데이터셋 준비
newiris <- iris
newiris$Species <- NULL
# k-means 알고리즘 수행
(kc <- kmeans(newiris, 3))
table(iris$Species, kc$cluster)
# 결과 확인
plot(newiris[c("Sepal.Length", "Sepal.Width")], col=kc$cluster)
points(kc$centers[,c("Sepal.Length", "Sepal.Width")], col=1:3, pch=8, cex=2)

PCA
•주성분 분석 : PCA (Principal component analysis)
–분포의 주 성분을 분석하는 방법
–즉, 다차원의 데이터에서 정보 유실을 최소화하면서 더 적은 차원의 데이터로 설명할 수 있게 하는 방법
–주성분 벡터 -> 해당 벡터 방향으로 데이터들의 분산이 가장 큰 방향 벡터
–데이터를 한 개의 축으로 사상시켰을 때 그 분산이 가장 커지는 축이 첫 번째 좌표축으로 오고, 두 번째로 커지는 축이 두 번째… 와 같은 방법으로 차례로 놓이도록 새로운 좌표계로 데이터를 선형 변환한다.

PCA
•PCA 실습
# 데이터셋 생성 및 관찰
data <- matrix(c(1,1,5,4,4,6,5,5,6,5,9,9),byrow=T,ncol=2)
(mu <- apply(data, 2, mean))
plot(data, xlim=c(0,10), ylim=c(0,10), asp=1)
(pc <- prcomp(data))
(r <- pc$rotation)
(result <- predict(pc, newdata=data))
plot(result, asp=1) # result를 그려보면 각 eigenvector의 관점에서 오리지널 데이터가 그려진다.
# 하지만 위에서는 모든 PC를 유지했기 때문에 차원 리덕션 효과는 없었다.
# PC1만을 남겨두어 차원 축소를 수행해보면
(pc <- prcomp(data, tol=0.3)) # tol 값은 0.7745967/3.6055513 보다 큰 값을 주면 된다.
(result <- predict(pc, newdata=data)) # 원본 데이터에서 차원 축소된 1차원 결과 데이터를 얻는다.
# 결과 데이터를 원본 데이터의 영역에서 비교해보자
result <- cbind(result, matrix(rep(0,6),byrow=F)) # 더미 차원을 추가해주고
plot(result, asp=1, col=2)
# eigenvector를 통한 transformation을 역으로 적용함
# t(t(r))의 의미는 solve(t(r)), 연산을 위해 rotation의 eigenvector를 col->row로 변환한 놈의 역행렬
(b <- t(t(t(r)) %*% t(result) + mu))
plot(data, xlim=c(0,10), ylim=c(0,10), asp=1)
par(new=T)
plot(b, xlim=c(0,10), ylim=c(0,10), asp=1, col=2, pch=4)
lines(c(0,10), c(0,10), lty="dotted")

추천시스템
•등장 배경
•User-Item Matrix
Off-line
On-line
물리적 공간의 제약
물리적 공간 제약 없음
제한된 고객 수
고객 수의 제한 없음
Pareto(80/20) 법칙
Long tail 법칙
개인화 불가능
개인화 가능

추천시스템
•추천시스템 분류
–Content-based RS
–Collaborative Filtering
•Memory-based
–User-based CF
–Item-based CF
•Model-based
•일반적인 성능 및 결과 비교
Content-based < User-based CF <= Item-based CF < Model-based

추천시스템
•추천 시스템 실습
library(recommenderlab)
data(MovieLense)
MovieLense
image(sample(MovieLense, 500))
r <- Recommender(MovieLense, method="UBCF")
recom <- predict(r, MovieLense[1:2], n=3)
as(recom, "list")
[[1]]
[1] "Glory (1989)" "Schindler's List (1993)" "Casablanca (1942)"
[[2]]
[1] "Boot, Das (1981)" "Dead Man Walking (1995)" "Lone Star (1996)"

Rdatamining

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Rdatamining

Similar to Rdatamining (20)

More from Kangwook Lee

More from Kangwook Lee (20)

Rdatamining