집단지성 프로그래밍 06-의사결정트리-01

Collective Intelligence
Chapter 7. 의사결정트리(Modeling with Decision Trees)
Kwang Woo Nam
Department of Computer and Information Engineering
Kunsan National University
Textbook: Programming in Collective Intelligence, Toby Segaran

웹 사이트 가입 유형 추정에 의사결정트리 적용하기
 목적
 웹 로그 정보를 이용하여 회원의 구입 결정 여부를 판단할 수 있는 의사 결정 트리 구현
– Web Log 정보 : 회원정보, 유입경로와 위치, FAQ 읽음유무, Page View
의사결정 트리로
예측하고 싶은 판단

웹 사이트 가입 유형 추정에 의사결정트리 적용하기
 데이터 구축
 구현 파일 : treepredict.py
 예제용 파일의 다운로드와 사용
– https://github.com/nico/collectiveintelligence-book/blob/master/decision_tree_example.txt

의사결정트리(Decision Tree)란
 정의
 의사결정 규칙 (Decision Tree)을 도표화하여 관심대상이 되는 집단을 몇 개의 소집단으로
분류 (Classification)하거나 예측 (Prediction)을 수행하는 계량적 분석 방법
 장점
– 분석결과는‘조건 A이고 조건 B이면 결과집단 C’라는 형태의 규칙으로 표현되므로 이해가 쉽고,
분류 또는 예측을 목적으로 하는 다른 계량적분석 방법에 비해 쉽게 이해하고 활용 할 수 있음
그림출처:http://jaek.khu.ac.kr/datamining/684
데이터마이닝 : Collective Intelligence 4

 의사결정 트리의 예

 의사결정 트리의 분할 속성 선택
 분할된 데이터의 불순도를 얼마나 많이 제거 했는가로 속성과 속성값을 결정
– 속성선택 : 현재의 불순도 – 노드를 분리한 다음의 불순도
 예 : 승차식 잔디깍기의 구매여부 판단
주택대지 크기속성의 분할 값 선택
- 15을 선택했을때 불순도 ?
- 17을 선택했을때 불순도?
- 19? 21? 23?을 선택했을 때는???
소득액 속성의 분할 값 선택
- 40을 선택했을때 불순도 ?
- 60을 선택했을때 불순도?
- 80? 100?을 선택했을 때는???
그림출처:http://jaek.khu.ac.kr/datamining/684

의사결정트리(Decision Tree) : 불순도의 측정
 의사결정 트리의 분할 속성 선택
 어떤 입력변수를 이용하여 어떻게 분리하는 것이 목표변수의 분포를 가장 잘 구별해
주는지를 파악하여 자식마디가 형성되는데, 목표변수의 분포를 구별하는 정도를
순수도(Purity), 또는 불순도(Impurity)에 의해서 측정
– 순수도 (Purity) : 특정 범주의 개체들이 포함되어 잇는 정도를 의미한다.
– 불순도(impurity) : 얼마나 다양한 범주들의 개체들이 포함되어있는 가를 의미
 분할속성의 선택
– 부모마디의 순수도에 비해서 자식마디들의 순수도가 증가하도록 자식마디를 형성
• 예를 들어 그룹0과 그룹 1의 비율이 45%와 55%인 마디는 각 그룹의 비율이 90%와 10%인 마디에
비하여 순수도가 낮다 (또는 불순도가 높다)라고 이야기 한다.
 불순도의 측정
– 카이제곱 통계량의 P값
– 지니 지수 (Gini Index)
– 엔트로피 지수(Entropy Index)

 지니 지수 (Gini Index):
 불순도를 측정하는 하나의 지수로서 지니지수를 가장 감소시켜주는 예측변수와 그 때의 최적
분리에 의해서 자식마디를 선택
데이터마이닝 : Collective Intelligence 8 그림출처:http://jaek.khu.ac.kr/datamining/684

 지니 지수 (Gini Index)의 값 다이어그램
 두개의 범주개체가 50대 50으로 구성될때 최대의 불순도값 0.5

 엔트로피 지수(entropy index)

 지니지수와 엔트로피 지수를 이용한 불순도 측정

 불순도에 의한 트리 분할

의사결정트리(Decision Tree) : 구현
13 트리 노드 구현
 Create a new class called decisionnode, which represents each node in the
tree:
the column index of the
tchriete vraialu teo tbhea tt ethstee dcolumn
must match to get a true
result
stores a dictionary of results for
this branch. None , except
ednedcipsoioinntnsodes, which are the next
nodes in the tree if the result is tru
e or false, respectively.

의사결정트리(Decision Tree) : 트리 학습
14 CART (Classification and Regression Trees)
 Classification And Regression Tree의 준말
 1984년 Breiman과 그의 동료들이 발명
 기계학습(machine learning) 실험의 산물
 가장 널리 사용되는 의사결정나무 알고리즘
1. create a root node
2. choose the best variable to divide up the data
C4.5
호주의 연구원 J. Ross Quinlan에 의하여 개발
초기버전은 ID 3 (Iterative Dichotomizer 3)로 1986년에 개발
 CART와는 다르게 각 마디에서 다지분리 (multiple split)가 가능하다.
 범주형 입력변수에 대해서는 범주의 수만큼 분리가 일어난다.
 불순도함수로 엔트로피 지수를 사용한다.
가지치기를 사용할 때 학습자료를 사용한다.

15 구현 : divideset
 특정 컬럼에 있는 데이터를 기반으로 row들을 둘로 나누는 함수
– 둘로 나뉘어진 row들은 불순도 계산을 위해 사용됨

16 구현 : divideset
 FAQ를 읽었는지의 여부(yes, no)로 row들을 나누는 경우의 예

최적 단편 선정
17 구현 : uniquecounts
 finds all the different possible outcomes and returns them as a dictionary of
how many times they each appear
 This is used by the other functions to calculate how mixed a set is.

최적 단편 선정 : Gini Impurity 구현
18 Gini Impurity 구현
 Gini impurity is the expected error rate if one of the results from a set is
randomly applied to one of the items in the set.

최적 단편 선정 : Entropy 구현
19 Entropy Index 구현 :
– p(i) = frequency(outcome) = count(outcome) /
count(total rows)
– Entropy = sum of p(i) x log(p(i)) for all outcomes

최적 단편 선정 : 실행 예
20 테스트 : the Gini impurity and entropy metrics
 The main difference between entropy and Gini impurity is that entropy peaks
more slowly.

재귀적으로 트리 만들기
21 정보이득(Information gain)을 통한 트리노드 선정
 정보이득
– 현재의 entropy와 새로운 두 그룹의 가중 평균 entropy 간의 차
• 알고리즘은 모든 속성마다 정보이득을 계산하여 가장 높은 정보이득을 가진 것을 선택
 현재의 불순도- 두개의 그룹으로 나뉜후의 불순도
– 재귀적으로 트리를 분할

22 구현 : buildtree
 a recursive function that builds the tree by choosing the best dividing criteria
for the current set

23 실행 : build the tree

트리 출력하기
24 구현 : printtree

트리 출력하기 : Graphical Display
25 구현 : getwidth
 구현 : getdepth

26 Python Imaging Library
 http://pythonware.com
 Add this import statement at the beginning of treepredict.py:

27 구현 : drawtree
 determines the appropriate total size and sets up a canvas

28 drawnode
 draws the decision nodes of the tree

트리 가지치기
 과잉접합(overfitted)
– 데이터 과대반영-미소한 엔트로피 감소로도 가지가 생성한다.
– 엔트로피가 어떤 최소값만큼 줄지 않을 때 분할을 종료한다.
• 한번의 분할로 엔트로피가 많이 감소되지 않지만 다음 번 분할로 크게 감소하는 경우가 있다.
– 완전한 트리 생성 후 불필요한 노드를 제거한다.

Prune 함수
 def prune(tree,mingain):
# If the branches aren't leaves, then prune them
if tree.tb.results==None:
prune(tree.tb,mingain)
if tree.fb.results==None:
prune(tree.fb,mingain)
# If both the subbranches are now leaves, see if they
# should merged
if tree.tb.results!=None and tree.fb.results!=None:
# Build a combined dataset
tb,fb=[],[]
for v,c in tree.tb.results.items( ):
tb+=[[v]]*c
for v,c in tree.fb.results.items( ):
fb+=[[v]]*c
# Test the reduction in entropy
delta=entropy(tb+fb)-(entropy(tb)+entropy(fb)/2)
if delta<mingain:
# Merge the branches
tree.tb,tree.fb=None,None
tree.results=uniquecounts(tb+fb)

새로운 관찰 분류하기
 분류기를 이용하여 관찰들을 분류
– Classify함수를 재귀적(recursive)으로 사용한다.
– 트리킅까지 트리를 따라가며 Observation값들을 평가한다.

Classify함수
 def classify(observation,tree):
if tree.results!=None:
return tree.results
else:
v=observation[tree.col]
branch=None
if isinstance(v,int) or isinstance(v,float):
if v>=tree.value: branch=tree.tb
else: branch=tree.fb
else:
if v==tree.value: branch=tree.tb
return classify(observation,branch)

손상된 데이터 다루기
 손상된 데이터 처리
– 데이터 세트가 정보조각을 잃어버릴 경우.
– 예: 사용자의 위치 정보를 IP 주소에서 발견하기 어려워 해당 필드를 공백으로 채운 경우.
 트리의 가지를 선택
– 두 가지를 모두 탐색한다.
– 양쪽 가지의 결과에 다른 가중치 부여한다.

Mdclassify함수
 def mdclassify(observation,tree):
if tree.results!=None:
return tree.results
else:
v=observation[tree.col]
if v==None:
tr,fr=mdclassify(observation,tree.tb),
mdclassify(observation,tree.fb)
tcount=sum(tr.values( ))
fcount=sum(fr.values( ))
tw=float(tcount)/(tcount+fcount)
fw=float(fcount)/(tcount+fcount)
result={}
for k,v in tr.items( ): result[k]=v*tw
for k,v in fr.items( ): result[k]=v*fw
return result

Mdclassify함수
else:
if isinstance(v,int) or isinstance(v,float):
if v>=tree.value: branch=tree.tb
else:
if v==tree.value: branch=tree.tb
return mdclassify(observation,branch)

숫자 결과 다루기
 숫자 출력 & 분류출력
– 분류 출력은 개개의 분류들을 완전히 별개의 것으로 취급한다.
– 숫자 출력은 숫자들간에 서로 근접해 있거나 멀리 떨어져 있는 관계가 존재한다.
– 분산(variance)을 이용하여 숫자들간의 원근관계를 반영한다.

Variance함수
 def variance(rows):
if len(rows)==0: return 0
data=[float(row[len(row)-1]) for row in rows]
mean=sum(data)/len(data)
variance=sum([(d-mean)**2 for d in data])/len(data)
return variance

주택 가격 모델링
 질로우(Zillow)API
– 부동산 가격을 추적하고 이 정보를 이용하여 다른 집 가격을 예측한다.
– 집들의 정보와 예측 가격을 얻을 수 있는 API 제공한다.
• http://www.zillow.com/howto/api/APIOverview.htm

속성 추출
 무료 개발자 키 얻기:
 import xml.dom.minidom
import urllib2
zwskey="X1-ZWz1chwxis15aj_9skq6“
 속성 정보 추출 :
 try:
zipcode=doc.getElementsByTagName('zipcode')[0].firstChild.data
use=doc.getElementsByTagName('useCode')[0].firstChild.data
year=doc.getElementsByTagName('yearBuilt')[0].firstChild.data
bath=doc.getElementsByTagName('bathrooms')[0].firstChild.data
bed=doc.getElementsByTagName('bedrooms')[0].firstChild.data
rooms=doc.getElementsByTagName('totalRooms')[0].firstChild.data
price=doc.getElementsByTagName('amount')[0].firstChild.data
except:
return None

데이터리스트 생성
 def getpricelist( ):
l1=[]
for line in file('addresslist.txt'):
data=getaddressdata(line.strip(),'Cambridge,MA')
l1.append(data)
return l1

“인기도” 모델링
 사용자들이 다른 사용자의 외모를 평가하여 평가점수를 만드는 것이다.
 “인기도”로 평가된 멤버들에 대한 인구통계 정보를 얻는 오픈 API를 제공한다.
 입력변수와 출력변수와 재미있는 추론 과정이 있어 흥미로운 테스트사례이다.

의사결정트리 활용 시점
 장점:
– 학습된 모델을 이해하기 쉽다.
– 분류 데이터와 숫자 데이터 모두 사용가능하다.
 단점:
– 많은 가능성을 가진 데이터 세트에 비효율적이다.
– 숫자 데이터를 다룰 때 이상/이하 결정 포인트만 만들 수 있다.

집단지성 프로그래밍 06-의사결정트리-01

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to 집단지성 프로그래밍 06-의사결정트리-01

Similar to 집단지성 프로그래밍 06-의사결정트리-01 (20)

More from Kwang Woo NAM

More from Kwang Woo NAM (20)

집단지성 프로그래밍 06-의사결정트리-01