SlideShare a Scribd company logo
1 of 20
Download to read offline
타이타닉 생존 예측
-안재형-
데이터 분석
Part1
Part2
Part3
데이터 탐색
데이터 전처리
데이터 분석
CONTENTS
Part4 결과 요약
데이터 탐색
Part 1
1. 데이터 탐색
1) 각 변수에 대한 요약 및 설명
• 분석에 사용한 데이터는 OpenML에 공개된 Titanic 데이터셋이다.
• 이 자료는 1912년 4월 15일 타이타닉호의 침몰 당시 정보를 기록한 데이터이다.
• 데이터는 1309명의 탑승객과 11개의 변수로 이루어져 있다.
• 종속 변수는 생존 여부를 나타내는 survived (1=생존, 0=생존X) 이다.
변수 명 변수 설명 비고
Survived 생존 여부 (1=생존, 0=생존X) 종속변수
Pclass 티켓의 등급 (1=1등석, 2=2등석, 3=3등석)
Name 성명 텍스트
Sex 성별 (female=여성, male=남성)
Age 나이 결측값 263개
Sibsp 자매 형제 혹은 배우자의 수
Parch 부모 혹은 자녀의 수
Ticket 티켓의 번호
Fare 승객 운임 결측값 1개
Cabin 객실 번호 결측값 1014개
Embarked 탑승 항구 결측값 2개
데이터 전처리
Part 2
2. 데이터 전처리
1) 결측 값 처리
• 데이터의 결측값을 변수별로 요약하고 그 처리 방법을 소개한다.
• Age는 Name변수의 Title 정보를 이용해 Title group별 Median 값 사용
• Fare, Embarked는 전체 데이터의 Median, Mode 값으로 대체
• Cabin은 결측값의 개수가 전체의 50% 이상으로 사용하지 않음
변수 결측값 개수 결측값 처리
Age 263 Median by Title or 전체 Age의 Median
Fare 1 전체 Fare값의 Median
Cabin 1014 제거
Embarked 2 전체 Embarked값의 Mode
결측값 요약
2. 데이터 전처리
1-1) Age 결측 값 처리
• Age 변수의 결측값을 Name 변수의 Title 정보를 기준으로 대체
• Name 변수의 정보를 이용해 Title 변수 생성
Name
Allen, Miss. Elisabeth Walton
Allison, Master. Hudson Trevor
Allison, Miss. Helen Loraine
Allison, Mr. Hudson Joshua Creighton
Allison, Mrs. Hudson J C (Bessie Waldo Daniels)
Title
Miss.
Master.
Miss.
Mr.
Mrs.
Title 빈도
Capt 1
Col 4
Don 1
Dona 1
Dr 8
Jonkheer 1
Lady 1
Major 2
Master 61
Miss 260
Mlle 2
Mme 1
Mr 757
Mrs 197
Ms 2
Rev 8
Sir 1
the Countess 1
(Example) 탑승객 5명의 Title 추출 Title 요약
2. 데이터 전처리
1-1) Age 결측 값 처리
• 18개의 Title 값을 6개로 통합
• 같은 의미로 쓰이는 Title을 통일
1 4 1 1
8
1 1 2
61
260
2 1
757
197
2 8
1 1
0% 0% 0% 0% 1% 0% 0% 0% 5% 20% 0% 0% 58% 15% 0% 1% 0% 0%
0
200
400
600
Capt Col Don Dona Dr JonkheerLady MajorMaster Miss Mlle Mme Mr Mrs Ms Rev Sirthe Countess
survived
count
Title
Capt
Col
Don
Dona
Dr
Jonkheer
Lady
Major
Master
Miss
Mlle
Mme
Mr
Mrs
Ms
Rev
Sir
the Countess
Title
8
61
265
767
200
81% 5% 20% 59% 15% 1%
0
200
400
600
800
Dr Master Miss Mr Mrs Rev
survived
count
Title
Dr
Master
Miss
Mr
Mrs
Rev
Title
기존 Title 변환된 Title 빈도
Major, Col, Sir, Don,
Jonkheer, Capt
Mr 767
Lady, the Countess, Dona Mrs 200
Mlle, Mme, Ms Miss 265
Dr 8
Master 61
Rev 8
기존 Title의 Barplot 변환된 Title의 Barplot
2. 데이터 전처리
1-1) Age 결측 값 처리
• Title별 Age의 분포가 다름을 확인
• Title별 Age의 Median 값을 이용해 Title별 결측값을 대체
0
20
40
60
80
Dr Master Miss Mr Mrs Rev
title
age
title
Dr
Master
Miss
Mr
Mrs
Rev
Age − Title
Title Median of Age
Dr 49
Master 4
Miss 22
Mr 30
Mrs 36
Rev 41.5
Title별 Age의 Boxplot Title별 Age의 Median
2. 데이터 전처리
2) Feature Engineering
• Sibsp와 Parch의 값이 0보다 큰 탑승객의 경우를 가족이라고 상정
• 가족의 경우, Ticket과 Fare의 값이 동일한 걸 확인할 수 있음
• 가족이 아님에도 Ticket과 Fare 값이 동일한 탑승객들을 Ticket 그룹이라고 지정
Survived Pclass Name Sex Age Sibsp Parch Ticket Fare
1 1 Cherry, Miss. Gladys female 30 0 0 110152 86.5
1 1
Rothes, the Countess. of (Lucy
Noel Martha Dyer-Edwards)
female 33 0 0 110152 86.5
NA 1 Maioni, Miss. Roberta female 16 0 0 110152 86.5
1 1 Taussig, Miss. Ruth female 18 0 2 110413 79.65
1 1
Taussig, Mrs. Emil (Tillie Mandel
baum)
female 39 1 1 110413 79.65
NA 1 Taussig, Mr. Emil male 52 1 1 110413 79.65
Ticket 그룹
가족
(Example) 탑승객 6명의 그룹 확인
2. 데이터 전처리
2) Feature Engineering
Group_type 변수 생성 Group_size 변수 생성
• If, Sibsp + Parch > 0
=> Family
• If, Sibsp + Parch ==0
=> Single
• If, (Sibsp + Parch ==0) & (Ticket
frequency >1) & (Fare frequency>1)
=> Ticket
• Family: Sibsp + Parch
• Single: 0
• Ticket: Ticket Frequency
• Sibsp, Parch, Ticket frequency, Fare frequency 정보를 이용해 Group_type 변수 생성
• Sibsp, Parch, Ticket frequency 이용해 Group_size 변수 생성
2. 데이터 전처리
2) Feature Engineering
• 새로 생성한 변수 Group_type의 분포 확인
• 각 group 별로 survived 분포가 다름을 확인
519
665
125
40% 51% 10%
0
200
400
600
Family Single Ticket
Group Type
count
Group Type
Family
Single
Ticket
Group Type
194
179
128
335
3840
0
100
200
300
Family Single Ticket
Group Type
count
Survived
0
1
Group Type−Survived
Group_type 요약 Group_type – Survived 분포
2. 데이터 전처리
3) Outlier Detection
• Training 데이터에 대한 outlier detection 시행
• Isolation Forest를 이용해 각 관측값에 대한 anomaly score를 계산 (5-fold CV로 진행)
• Anomaly score 분포를 통해 0.55이상의 값이 소수 발생하는 것을 확인
• 0.55를 threshold로 설정하여 outlier 제거
Ranking Index Anomaly_score
1 46 0.56920767
2 158 0.56774173
3 83 0.56553181
4 340 0.56188462
5 167 0.56178377
6 341 0.56146659
7 27 0.5608269
8 800 0.55886405
9 351 0.55857549
10 36 0.55360205
11 195 0.55342874
12 342 0.55300336
13 119 0.55015317
Outlier 제거
0
20
40
60
0.3 0.4 0.5
anomaly_score
count
Anomaly Score
Anomaly Score
Threshold = 0.55
2. 데이터 전처리
4) 데이터 전처리로 발생한 8개의 트레이닝 데이터
• 앞서 살펴본 전처리 과정을 통해 변형된 데이터 총 8개 생성
• 각 데이터셋에 동일한 모형 적합하여 데이터 전처리로 발생하는 변화 확인할 것
순서 Missing Value Imputation Feature Engineering Outlier 제거
1 Rough X X
2 Rough X O
3 Rough O X
4 Rough O O
5 By Title X X
6 By Title X O
7 By Title O X
8 By Title O O
변형된 데이터셋 이름
Rough
Rough_Out
Rough_Feat
Rough_Out_Feat
ByTitle
ByTitle_Out
ByTitle_Feat
ByTitle_Out_Feat
변형된 데이터셋
데이터 분석
Part 3
3. 데이터 분석
1) 평가 측도 정의
• 모형의 비교를 위한 평가 측도 정의
• 타이타닉 생존 예측분석에 대해선 Accuracy 사용
Confusion matrix 평가 측도
Accuracy =
!"#!$
!"#!$#%"#%$
3. 데이터 분석
2) 모형 및 평가 방식 소개
• 적합에 사용된 모형은 총 12개
• 모형의 비교는 Train / Test 데이터로 진행 (튜닝이 필요한 모형은 5-fold CV 사용)
Model Description
모형 이름 모형 설명 튜닝 파라미터
Logistic Logistic regression
Logistic_ridge Logistic regression + ridge Lambda
Logistic_LASSO Logistic regression + LASSO Lambda
Logistic_SCAD Logistic regression + SCAD Lambda
Nbayes Naïve Bayes
KNN K-Nearest Neighbor K
CART CART Alpha (complexity parameter)
RandFor Random Forest Mtry (각 스플릿에 사용될 변수 개수)
XGBoost XGBoost Nrounds (최대 iteration)
SVM_linear SVM with linear kernel Gamma, cost
SVM_Radial SVM with Raidal Kernel Gamma, cost
Ensemble
Ensemble Classifier
(모든 모형에서 추정한 예측값에 대한 다수
결을 Ensemble의 예측값으로 사용)
전체 데이터를 고정된 인덱스
에 따라 7:3으로 분할
70% (Training data) 로 모델
학습
(튜닝은 5-folds CV 사용)
30% (Test data)로 예측
Model Fitting
결과 요약
Part 4
Model
Accuracy
Rough Rough Out Rough Feat Rough Out Feat Title Title Out Title Feat Title Out Feat
Logistic 0.7985 0.7959 0.7934 0.7934 0.7934 0.7959 0.7985 0.7883
Logistic_ridge 0.7985 0.8010 0.7985 0.7959 0.7959 0.7959 0.7985 0.8036
Logistic_LASSO 0.7985 0.7985 0.7959 0.7959 0.7959 0.7883 0.7985 0.7959
Logistic_SCAD 0.7883 0.7883 0.7883 0.7883 0.7883 0.7883 0.7883 0.7883
Nbayes 0.7092 0.7194 0.7092 0.7245 0.7985 0.7908 0.7551 0.7449
KNN 0.6811 0.6811 0.6786 0.6786 0.7041 0.7092 0.6939 0.6964
CART 0.8138 0.8138 0.8189 0.8138 0.8138 0.8138 0.8138 0.8138
RandFor 0.7755 0.8036 0.7704 0.7934 0.7908 0.7857 0.8010 0.8087
XGBoost 0.8138 0.8087 0.8087 0.8163 0.8189 0.8138 0.8061 0.8163
SVM_linear 0.7883 0.7883 0.7883 0.7883 0.7883 0.7883 0.7883 0.7883
SVM_radial 0.8010 0.8036 0.8189 0.8163 0.8061 0.8061 0.8189 0.8240
Ensemble 0.8010 0.8010 0.8010 0.7985 0.7985 0.7908 0.7959 0.7985
4. 결과 요약
1) 평가 측도 Accuracy에 따른 모형 및 데이터 비교
• 각 열의 최댓값을 파란색 박스로 표시
• 모든 데이터 중 최댓값을 빨간색 박스로 표시
• Age의 결측값을 Title로 처리했을 때 전반적인 accuracy가 높은 걸 확인
• 데이터 전처리 기법을 모두 사용했을 때 SVM_radial 모형의 정확도가 가장 높음
Accuracy 기준 최고 모형 및 데이터:
SVM_radial & Title_Out_Feat
Q&A

More Related Content

What's hot

บทที่ 4 การวางแผนกำลังการผลิต
บทที่ 4 การวางแผนกำลังการผลิตบทที่ 4 การวางแผนกำลังการผลิต
บทที่ 4 การวางแผนกำลังการผลิตRungnapa Rungnapa
 
การวิเคราะห์ข้อมูลเชิงปริมาณ
การวิเคราะห์ข้อมูลเชิงปริมาณการวิเคราะห์ข้อมูลเชิงปริมาณ
การวิเคราะห์ข้อมูลเชิงปริมาณtanongsak
 
รายงานวิจัย ฯ เศรษฐกิจพอเพียง
รายงานวิจัย ฯ เศรษฐกิจพอเพียงรายงานวิจัย ฯ เศรษฐกิจพอเพียง
รายงานวิจัย ฯ เศรษฐกิจพอเพียงIntrapan Suwan
 
โครงงานคอมพิวเตอร์เรื่องเปลือกไข่สารพัดประโยชน์
โครงงานคอมพิวเตอร์เรื่องเปลือกไข่สารพัดประโยชน์โครงงานคอมพิวเตอร์เรื่องเปลือกไข่สารพัดประโยชน์
โครงงานคอมพิวเตอร์เรื่องเปลือกไข่สารพัดประโยชน์supansa phuprasong
 
Machine learning by Dr. Vivek Vijay and Dr. Sandeep Yadav
Machine learning by Dr. Vivek Vijay and Dr. Sandeep YadavMachine learning by Dr. Vivek Vijay and Dr. Sandeep Yadav
Machine learning by Dr. Vivek Vijay and Dr. Sandeep YadavAgile Testing Alliance
 
กิตติกรรมประกาศ
กิตติกรรมประกาศกิตติกรรมประกาศ
กิตติกรรมประกาศPa'rig Prig
 
วิทยาการคำนวณ ม.5 - บทที่ 2 การเก็บรวบรวมและสำรวจข้อมูล
วิทยาการคำนวณ ม.5 - บทที่ 2 การเก็บรวบรวมและสำรวจข้อมูลวิทยาการคำนวณ ม.5 - บทที่ 2 การเก็บรวบรวมและสำรวจข้อมูล
วิทยาการคำนวณ ม.5 - บทที่ 2 การเก็บรวบรวมและสำรวจข้อมูลCoco Tan
 
การจัดการคุณภาพ(Quality management)
การจัดการคุณภาพ(Quality management)การจัดการคุณภาพ(Quality management)
การจัดการคุณภาพ(Quality management)tumetr1
 
ค่ายสิ่งแวดล้อมศึกษา หลักสูตร “เพื่อนสายน้ำ”
ค่ายสิ่งแวดล้อมศึกษา หลักสูตร “เพื่อนสายน้ำ”ค่ายสิ่งแวดล้อมศึกษา หลักสูตร “เพื่อนสายน้ำ”
ค่ายสิ่งแวดล้อมศึกษา หลักสูตร “เพื่อนสายน้ำ”Sompop Petkleang
 
บทที่ 10 ประกันภัยทางบกและขนส่ง
บทที่ 10 ประกันภัยทางบกและขนส่งบทที่ 10 ประกันภัยทางบกและขนส่ง
บทที่ 10 ประกันภัยทางบกและขนส่งchakaew4524
 
Churn customer analysis
Churn customer analysisChurn customer analysis
Churn customer analysisDr.Bechoo Lal
 
การนำเทคโนโลยีมาใช้ในด้านต่างๆ และนวัตกรรมทางด้านการศึกษา
การนำเทคโนโลยีมาใช้ในด้านต่างๆ และนวัตกรรมทางด้านการศึกษาการนำเทคโนโลยีมาใช้ในด้านต่างๆ และนวัตกรรมทางด้านการศึกษา
การนำเทคโนโลยีมาใช้ในด้านต่างๆ และนวัตกรรมทางด้านการศึกษาtakkysang
 
สถาปัตยกรรมพื้นถิ่น 4 ภาค
สถาปัตยกรรมพื้นถิ่น 4 ภาคสถาปัตยกรรมพื้นถิ่น 4 ภาค
สถาปัตยกรรมพื้นถิ่น 4 ภาคchickyshare
 
หลักการและแนวคิดในการจัดทำหลักสูตร
หลักการและแนวคิดในการจัดทำหลักสูตรหลักการและแนวคิดในการจัดทำหลักสูตร
หลักการและแนวคิดในการจัดทำหลักสูตรKidty Nunta
 

What's hot (20)

1 weka introducing
1 weka introducing1 weka introducing
1 weka introducing
 
01 introduction to data mining
01 introduction to data mining01 introduction to data mining
01 introduction to data mining
 
บทที่ 4 การวางแผนกำลังการผลิต
บทที่ 4 การวางแผนกำลังการผลิตบทที่ 4 การวางแผนกำลังการผลิต
บทที่ 4 การวางแผนกำลังการผลิต
 
การวิเคราะห์ข้อมูลเชิงปริมาณ
การวิเคราะห์ข้อมูลเชิงปริมาณการวิเคราะห์ข้อมูลเชิงปริมาณ
การวิเคราะห์ข้อมูลเชิงปริมาณ
 
รายงานวิจัย ฯ เศรษฐกิจพอเพียง
รายงานวิจัย ฯ เศรษฐกิจพอเพียงรายงานวิจัย ฯ เศรษฐกิจพอเพียง
รายงานวิจัย ฯ เศรษฐกิจพอเพียง
 
07 Network Visualization
07 Network Visualization07 Network Visualization
07 Network Visualization
 
โครงงานคอมพิวเตอร์เรื่องเปลือกไข่สารพัดประโยชน์
โครงงานคอมพิวเตอร์เรื่องเปลือกไข่สารพัดประโยชน์โครงงานคอมพิวเตอร์เรื่องเปลือกไข่สารพัดประโยชน์
โครงงานคอมพิวเตอร์เรื่องเปลือกไข่สารพัดประโยชน์
 
Machine learning by Dr. Vivek Vijay and Dr. Sandeep Yadav
Machine learning by Dr. Vivek Vijay and Dr. Sandeep YadavMachine learning by Dr. Vivek Vijay and Dr. Sandeep Yadav
Machine learning by Dr. Vivek Vijay and Dr. Sandeep Yadav
 
กิตติกรรมประกาศ
กิตติกรรมประกาศกิตติกรรมประกาศ
กิตติกรรมประกาศ
 
Chapt3
Chapt3Chapt3
Chapt3
 
วิทยาการคำนวณ ม.5 - บทที่ 2 การเก็บรวบรวมและสำรวจข้อมูล
วิทยาการคำนวณ ม.5 - บทที่ 2 การเก็บรวบรวมและสำรวจข้อมูลวิทยาการคำนวณ ม.5 - บทที่ 2 การเก็บรวบรวมและสำรวจข้อมูล
วิทยาการคำนวณ ม.5 - บทที่ 2 การเก็บรวบรวมและสำรวจข้อมูล
 
วิทยาการคำนวณ3
วิทยาการคำนวณ3วิทยาการคำนวณ3
วิทยาการคำนวณ3
 
การจัดการคุณภาพ(Quality management)
การจัดการคุณภาพ(Quality management)การจัดการคุณภาพ(Quality management)
การจัดการคุณภาพ(Quality management)
 
ค่ายสิ่งแวดล้อมศึกษา หลักสูตร “เพื่อนสายน้ำ”
ค่ายสิ่งแวดล้อมศึกษา หลักสูตร “เพื่อนสายน้ำ”ค่ายสิ่งแวดล้อมศึกษา หลักสูตร “เพื่อนสายน้ำ”
ค่ายสิ่งแวดล้อมศึกษา หลักสูตร “เพื่อนสายน้ำ”
 
บทที่ 10 ประกันภัยทางบกและขนส่ง
บทที่ 10 ประกันภัยทางบกและขนส่งบทที่ 10 ประกันภัยทางบกและขนส่ง
บทที่ 10 ประกันภัยทางบกและขนส่ง
 
Churn customer analysis
Churn customer analysisChurn customer analysis
Churn customer analysis
 
03 data preprocessing
03 data preprocessing03 data preprocessing
03 data preprocessing
 
การนำเทคโนโลยีมาใช้ในด้านต่างๆ และนวัตกรรมทางด้านการศึกษา
การนำเทคโนโลยีมาใช้ในด้านต่างๆ และนวัตกรรมทางด้านการศึกษาการนำเทคโนโลยีมาใช้ในด้านต่างๆ และนวัตกรรมทางด้านการศึกษา
การนำเทคโนโลยีมาใช้ในด้านต่างๆ และนวัตกรรมทางด้านการศึกษา
 
สถาปัตยกรรมพื้นถิ่น 4 ภาค
สถาปัตยกรรมพื้นถิ่น 4 ภาคสถาปัตยกรรมพื้นถิ่น 4 ภาค
สถาปัตยกรรมพื้นถิ่น 4 ภาค
 
หลักการและแนวคิดในการจัดทำหลักสูตร
หลักการและแนวคิดในการจัดทำหลักสูตรหลักการและแนวคิดในการจัดทำหลักสูตร
หลักการและแนวคิดในการจัดทำหลักสูตร
 

Similar to Titanic data analysis

Titanic kaggle competition
Titanic kaggle competitionTitanic kaggle competition
Titanic kaggle competitionjdo
 
R 스터디 두번째
R 스터디 두번째R 스터디 두번째
R 스터디 두번째Jaeseok Park
 
주가_변화시점탐지(Change point Detection)
주가_변화시점탐지(Change point Detection)주가_변화시점탐지(Change point Detection)
주가_변화시점탐지(Change point Detection)Seung-Woo Kang
 
Statistical Term Project
Statistical Term ProjectStatistical Term Project
Statistical Term ProjectJoonhee Lee
 
(Book Summary) Classification and ensemble(book review)
(Book Summary) Classification and ensemble(book review)(Book Summary) Classification and ensemble(book review)
(Book Summary) Classification and ensemble(book review)MYEONGGYU LEE
 
[홍대 머신러닝 스터디 - 핸즈온 머신러닝] 3장. 분류
[홍대 머신러닝 스터디 - 핸즈온 머신러닝] 3장. 분류[홍대 머신러닝 스터디 - 핸즈온 머신러닝] 3장. 분류
[홍대 머신러닝 스터디 - 핸즈온 머신러닝] 3장. 분류Haesun Park
 
Gamma Interaction Position Estimation using Deep Neural Networks
Gamma Interaction Position Estimation using Deep Neural NetworksGamma Interaction Position Estimation using Deep Neural Networks
Gamma Interaction Position Estimation using Deep Neural NetworksDae Woon Kim
 
Cnn 발표자료
Cnn 발표자료Cnn 발표자료
Cnn 발표자료종현 최
 
Adversarial Attack in Neural Machine Translation
Adversarial Attack in Neural Machine TranslationAdversarial Attack in Neural Machine Translation
Adversarial Attack in Neural Machine TranslationHyunKyu Jeon
 
[14.10.21] Far Cry and DX9 번역(shaderstudy)
[14.10.21] Far Cry and DX9 번역(shaderstudy)[14.10.21] Far Cry and DX9 번역(shaderstudy)
[14.10.21] Far Cry and DX9 번역(shaderstudy)해강
 
2.linear regression and logistic regression
2.linear regression and logistic regression2.linear regression and logistic regression
2.linear regression and logistic regressionHaesun Park
 
2.supervised learning
2.supervised learning2.supervised learning
2.supervised learningHaesun Park
 
3.neural networks
3.neural networks3.neural networks
3.neural networksHaesun Park
 
R을 이용한 데이터 분석
R을 이용한 데이터 분석R을 이용한 데이터 분석
R을 이용한 데이터 분석simon park
 

Similar to Titanic data analysis (16)

Titanic kaggle competition
Titanic kaggle competitionTitanic kaggle competition
Titanic kaggle competition
 
R 스터디 두번째
R 스터디 두번째R 스터디 두번째
R 스터디 두번째
 
Rnn keras
Rnn kerasRnn keras
Rnn keras
 
주가_변화시점탐지(Change point Detection)
주가_변화시점탐지(Change point Detection)주가_변화시점탐지(Change point Detection)
주가_변화시점탐지(Change point Detection)
 
분석6기 4조
분석6기 4조분석6기 4조
분석6기 4조
 
Statistical Term Project
Statistical Term ProjectStatistical Term Project
Statistical Term Project
 
(Book Summary) Classification and ensemble(book review)
(Book Summary) Classification and ensemble(book review)(Book Summary) Classification and ensemble(book review)
(Book Summary) Classification and ensemble(book review)
 
[홍대 머신러닝 스터디 - 핸즈온 머신러닝] 3장. 분류
[홍대 머신러닝 스터디 - 핸즈온 머신러닝] 3장. 분류[홍대 머신러닝 스터디 - 핸즈온 머신러닝] 3장. 분류
[홍대 머신러닝 스터디 - 핸즈온 머신러닝] 3장. 분류
 
Gamma Interaction Position Estimation using Deep Neural Networks
Gamma Interaction Position Estimation using Deep Neural NetworksGamma Interaction Position Estimation using Deep Neural Networks
Gamma Interaction Position Estimation using Deep Neural Networks
 
Cnn 발표자료
Cnn 발표자료Cnn 발표자료
Cnn 발표자료
 
Adversarial Attack in Neural Machine Translation
Adversarial Attack in Neural Machine TranslationAdversarial Attack in Neural Machine Translation
Adversarial Attack in Neural Machine Translation
 
[14.10.21] Far Cry and DX9 번역(shaderstudy)
[14.10.21] Far Cry and DX9 번역(shaderstudy)[14.10.21] Far Cry and DX9 번역(shaderstudy)
[14.10.21] Far Cry and DX9 번역(shaderstudy)
 
2.linear regression and logistic regression
2.linear regression and logistic regression2.linear regression and logistic regression
2.linear regression and logistic regression
 
2.supervised learning
2.supervised learning2.supervised learning
2.supervised learning
 
3.neural networks
3.neural networks3.neural networks
3.neural networks
 
R을 이용한 데이터 분석
R을 이용한 데이터 분석R을 이용한 데이터 분석
R을 이용한 데이터 분석
 

More from Konkuk University, Korea (8)

(Slide)concentration effect
(Slide)concentration effect(Slide)concentration effect
(Slide)concentration effect
 
House price
House priceHouse price
House price
 
A Tutorial of the EM-algorithm and Its Application to Outlier Detection
A Tutorial of the EM-algorithm and Its Application to Outlier DetectionA Tutorial of the EM-algorithm and Its Application to Outlier Detection
A Tutorial of the EM-algorithm and Its Application to Outlier Detection
 
Anomaly Detection: A Survey
Anomaly Detection: A SurveyAnomaly Detection: A Survey
Anomaly Detection: A Survey
 
Extended Isolation Forest
Extended Isolation Forest Extended Isolation Forest
Extended Isolation Forest
 
Isolation Forest
Isolation ForestIsolation Forest
Isolation Forest
 
Random Forest
Random ForestRandom Forest
Random Forest
 
Decision Tree
Decision Tree Decision Tree
Decision Tree
 

Titanic data analysis

  • 4. 1. 데이터 탐색 1) 각 변수에 대한 요약 및 설명 • 분석에 사용한 데이터는 OpenML에 공개된 Titanic 데이터셋이다. • 이 자료는 1912년 4월 15일 타이타닉호의 침몰 당시 정보를 기록한 데이터이다. • 데이터는 1309명의 탑승객과 11개의 변수로 이루어져 있다. • 종속 변수는 생존 여부를 나타내는 survived (1=생존, 0=생존X) 이다. 변수 명 변수 설명 비고 Survived 생존 여부 (1=생존, 0=생존X) 종속변수 Pclass 티켓의 등급 (1=1등석, 2=2등석, 3=3등석) Name 성명 텍스트 Sex 성별 (female=여성, male=남성) Age 나이 결측값 263개 Sibsp 자매 형제 혹은 배우자의 수 Parch 부모 혹은 자녀의 수 Ticket 티켓의 번호 Fare 승객 운임 결측값 1개 Cabin 객실 번호 결측값 1014개 Embarked 탑승 항구 결측값 2개
  • 6. 2. 데이터 전처리 1) 결측 값 처리 • 데이터의 결측값을 변수별로 요약하고 그 처리 방법을 소개한다. • Age는 Name변수의 Title 정보를 이용해 Title group별 Median 값 사용 • Fare, Embarked는 전체 데이터의 Median, Mode 값으로 대체 • Cabin은 결측값의 개수가 전체의 50% 이상으로 사용하지 않음 변수 결측값 개수 결측값 처리 Age 263 Median by Title or 전체 Age의 Median Fare 1 전체 Fare값의 Median Cabin 1014 제거 Embarked 2 전체 Embarked값의 Mode 결측값 요약
  • 7. 2. 데이터 전처리 1-1) Age 결측 값 처리 • Age 변수의 결측값을 Name 변수의 Title 정보를 기준으로 대체 • Name 변수의 정보를 이용해 Title 변수 생성 Name Allen, Miss. Elisabeth Walton Allison, Master. Hudson Trevor Allison, Miss. Helen Loraine Allison, Mr. Hudson Joshua Creighton Allison, Mrs. Hudson J C (Bessie Waldo Daniels) Title Miss. Master. Miss. Mr. Mrs. Title 빈도 Capt 1 Col 4 Don 1 Dona 1 Dr 8 Jonkheer 1 Lady 1 Major 2 Master 61 Miss 260 Mlle 2 Mme 1 Mr 757 Mrs 197 Ms 2 Rev 8 Sir 1 the Countess 1 (Example) 탑승객 5명의 Title 추출 Title 요약
  • 8. 2. 데이터 전처리 1-1) Age 결측 값 처리 • 18개의 Title 값을 6개로 통합 • 같은 의미로 쓰이는 Title을 통일 1 4 1 1 8 1 1 2 61 260 2 1 757 197 2 8 1 1 0% 0% 0% 0% 1% 0% 0% 0% 5% 20% 0% 0% 58% 15% 0% 1% 0% 0% 0 200 400 600 Capt Col Don Dona Dr JonkheerLady MajorMaster Miss Mlle Mme Mr Mrs Ms Rev Sirthe Countess survived count Title Capt Col Don Dona Dr Jonkheer Lady Major Master Miss Mlle Mme Mr Mrs Ms Rev Sir the Countess Title 8 61 265 767 200 81% 5% 20% 59% 15% 1% 0 200 400 600 800 Dr Master Miss Mr Mrs Rev survived count Title Dr Master Miss Mr Mrs Rev Title 기존 Title 변환된 Title 빈도 Major, Col, Sir, Don, Jonkheer, Capt Mr 767 Lady, the Countess, Dona Mrs 200 Mlle, Mme, Ms Miss 265 Dr 8 Master 61 Rev 8 기존 Title의 Barplot 변환된 Title의 Barplot
  • 9. 2. 데이터 전처리 1-1) Age 결측 값 처리 • Title별 Age의 분포가 다름을 확인 • Title별 Age의 Median 값을 이용해 Title별 결측값을 대체 0 20 40 60 80 Dr Master Miss Mr Mrs Rev title age title Dr Master Miss Mr Mrs Rev Age − Title Title Median of Age Dr 49 Master 4 Miss 22 Mr 30 Mrs 36 Rev 41.5 Title별 Age의 Boxplot Title별 Age의 Median
  • 10. 2. 데이터 전처리 2) Feature Engineering • Sibsp와 Parch의 값이 0보다 큰 탑승객의 경우를 가족이라고 상정 • 가족의 경우, Ticket과 Fare의 값이 동일한 걸 확인할 수 있음 • 가족이 아님에도 Ticket과 Fare 값이 동일한 탑승객들을 Ticket 그룹이라고 지정 Survived Pclass Name Sex Age Sibsp Parch Ticket Fare 1 1 Cherry, Miss. Gladys female 30 0 0 110152 86.5 1 1 Rothes, the Countess. of (Lucy Noel Martha Dyer-Edwards) female 33 0 0 110152 86.5 NA 1 Maioni, Miss. Roberta female 16 0 0 110152 86.5 1 1 Taussig, Miss. Ruth female 18 0 2 110413 79.65 1 1 Taussig, Mrs. Emil (Tillie Mandel baum) female 39 1 1 110413 79.65 NA 1 Taussig, Mr. Emil male 52 1 1 110413 79.65 Ticket 그룹 가족 (Example) 탑승객 6명의 그룹 확인
  • 11. 2. 데이터 전처리 2) Feature Engineering Group_type 변수 생성 Group_size 변수 생성 • If, Sibsp + Parch > 0 => Family • If, Sibsp + Parch ==0 => Single • If, (Sibsp + Parch ==0) & (Ticket frequency >1) & (Fare frequency>1) => Ticket • Family: Sibsp + Parch • Single: 0 • Ticket: Ticket Frequency • Sibsp, Parch, Ticket frequency, Fare frequency 정보를 이용해 Group_type 변수 생성 • Sibsp, Parch, Ticket frequency 이용해 Group_size 변수 생성
  • 12. 2. 데이터 전처리 2) Feature Engineering • 새로 생성한 변수 Group_type의 분포 확인 • 각 group 별로 survived 분포가 다름을 확인 519 665 125 40% 51% 10% 0 200 400 600 Family Single Ticket Group Type count Group Type Family Single Ticket Group Type 194 179 128 335 3840 0 100 200 300 Family Single Ticket Group Type count Survived 0 1 Group Type−Survived Group_type 요약 Group_type – Survived 분포
  • 13. 2. 데이터 전처리 3) Outlier Detection • Training 데이터에 대한 outlier detection 시행 • Isolation Forest를 이용해 각 관측값에 대한 anomaly score를 계산 (5-fold CV로 진행) • Anomaly score 분포를 통해 0.55이상의 값이 소수 발생하는 것을 확인 • 0.55를 threshold로 설정하여 outlier 제거 Ranking Index Anomaly_score 1 46 0.56920767 2 158 0.56774173 3 83 0.56553181 4 340 0.56188462 5 167 0.56178377 6 341 0.56146659 7 27 0.5608269 8 800 0.55886405 9 351 0.55857549 10 36 0.55360205 11 195 0.55342874 12 342 0.55300336 13 119 0.55015317 Outlier 제거 0 20 40 60 0.3 0.4 0.5 anomaly_score count Anomaly Score Anomaly Score Threshold = 0.55
  • 14. 2. 데이터 전처리 4) 데이터 전처리로 발생한 8개의 트레이닝 데이터 • 앞서 살펴본 전처리 과정을 통해 변형된 데이터 총 8개 생성 • 각 데이터셋에 동일한 모형 적합하여 데이터 전처리로 발생하는 변화 확인할 것 순서 Missing Value Imputation Feature Engineering Outlier 제거 1 Rough X X 2 Rough X O 3 Rough O X 4 Rough O O 5 By Title X X 6 By Title X O 7 By Title O X 8 By Title O O 변형된 데이터셋 이름 Rough Rough_Out Rough_Feat Rough_Out_Feat ByTitle ByTitle_Out ByTitle_Feat ByTitle_Out_Feat 변형된 데이터셋
  • 16. 3. 데이터 분석 1) 평가 측도 정의 • 모형의 비교를 위한 평가 측도 정의 • 타이타닉 생존 예측분석에 대해선 Accuracy 사용 Confusion matrix 평가 측도 Accuracy = !"#!$ !"#!$#%"#%$
  • 17. 3. 데이터 분석 2) 모형 및 평가 방식 소개 • 적합에 사용된 모형은 총 12개 • 모형의 비교는 Train / Test 데이터로 진행 (튜닝이 필요한 모형은 5-fold CV 사용) Model Description 모형 이름 모형 설명 튜닝 파라미터 Logistic Logistic regression Logistic_ridge Logistic regression + ridge Lambda Logistic_LASSO Logistic regression + LASSO Lambda Logistic_SCAD Logistic regression + SCAD Lambda Nbayes Naïve Bayes KNN K-Nearest Neighbor K CART CART Alpha (complexity parameter) RandFor Random Forest Mtry (각 스플릿에 사용될 변수 개수) XGBoost XGBoost Nrounds (최대 iteration) SVM_linear SVM with linear kernel Gamma, cost SVM_Radial SVM with Raidal Kernel Gamma, cost Ensemble Ensemble Classifier (모든 모형에서 추정한 예측값에 대한 다수 결을 Ensemble의 예측값으로 사용) 전체 데이터를 고정된 인덱스 에 따라 7:3으로 분할 70% (Training data) 로 모델 학습 (튜닝은 5-folds CV 사용) 30% (Test data)로 예측 Model Fitting
  • 19. Model Accuracy Rough Rough Out Rough Feat Rough Out Feat Title Title Out Title Feat Title Out Feat Logistic 0.7985 0.7959 0.7934 0.7934 0.7934 0.7959 0.7985 0.7883 Logistic_ridge 0.7985 0.8010 0.7985 0.7959 0.7959 0.7959 0.7985 0.8036 Logistic_LASSO 0.7985 0.7985 0.7959 0.7959 0.7959 0.7883 0.7985 0.7959 Logistic_SCAD 0.7883 0.7883 0.7883 0.7883 0.7883 0.7883 0.7883 0.7883 Nbayes 0.7092 0.7194 0.7092 0.7245 0.7985 0.7908 0.7551 0.7449 KNN 0.6811 0.6811 0.6786 0.6786 0.7041 0.7092 0.6939 0.6964 CART 0.8138 0.8138 0.8189 0.8138 0.8138 0.8138 0.8138 0.8138 RandFor 0.7755 0.8036 0.7704 0.7934 0.7908 0.7857 0.8010 0.8087 XGBoost 0.8138 0.8087 0.8087 0.8163 0.8189 0.8138 0.8061 0.8163 SVM_linear 0.7883 0.7883 0.7883 0.7883 0.7883 0.7883 0.7883 0.7883 SVM_radial 0.8010 0.8036 0.8189 0.8163 0.8061 0.8061 0.8189 0.8240 Ensemble 0.8010 0.8010 0.8010 0.7985 0.7985 0.7908 0.7959 0.7985 4. 결과 요약 1) 평가 측도 Accuracy에 따른 모형 및 데이터 비교 • 각 열의 최댓값을 파란색 박스로 표시 • 모든 데이터 중 최댓값을 빨간색 박스로 표시 • Age의 결측값을 Title로 처리했을 때 전반적인 accuracy가 높은 걸 확인 • 데이터 전처리 기법을 모두 사용했을 때 SVM_radial 모형의 정확도가 가장 높음 Accuracy 기준 최고 모형 및 데이터: SVM_radial & Title_Out_Feat
  • 20. Q&A