SlideShare a Scribd company logo
1 of 19
Download to read offline
BA 682 데이터마이닝(Data Mining)
20123820 강준현
I. 유니버설 뱅크 데이터를 사용한
로지스틱 회귀분석 모델 구축
Data Exploration

Income

Family Size

Credit Card Avg

Education
Data Exploration

Experience

Age
ID
ID
Age
Experience
Income
ZIP Code
Family
CCAvg
Education
Mortgage
Personal Loan
Securities Account
CD Account
Online
CreditCard

Age

1
-0.00847
-0.00833
-0.01769
0.013432
-0.0168
-0.02467
0.021463
-0.01392
-0.0248
-0.01697
-0.00691
-0.00253
0.017028

1
0.994215
-0.05527
-0.02922
-0.04642
-0.05203
0.041334
-0.01254
-0.00773
-0.00044
0.008043
0.013702
0.007681

Experience Income ZIP Code

1
-0.04657
-0.02863
-0.05256
-0.05009
0.013152
-0.01058
-0.00741
-0.00123
0.010353
0.013898
0.008967

1
-0.01641
-0.1575
0.645993
-0.18752
0.206806
0.502462
-0.00262
0.169738
0.014206
-0.00239

1
0.011778
-0.00407
-0.01738
0.007383
0.000107
0.004704
0.019972
0.01699
0.007691

Family

CCAvg Education Mortgage
PersonalSecurities Account Account
Loan
CD

Online CreditCard

1
-0.10928
0.064929
-0.02044
0.061367
0.019994
0.01411
0.010354
0.011588

1
-0.13614
0.109909
0.366891
0.015087
0.136537
-0.00362
-0.00669

1
0.00421

1
-0.03333
0.136722
-0.01081
0.013934
-0.015
-0.01101

1
0.142095
-0.00541
0.089311
-0.00599
-0.00723

1
0.021954
1
0.316355 0.317034
1
0.006278 0.012627 0.175880016
0.002802 -0.01503 0.278644365

1
Data Dimension Reduction

Principal Components
Com ponents
Variable

1

2

3

4

5

Age
Experience
Income
Family
Education

0.01554224
0.01338275
-0.99977607
0.0039179
0.00342416

0.70662385 0.08264883 0.54202592 -0.44701025
0.70728117 -0.07694203 -0.54321295 0.44561642
0.02043895 0.00444289 0.00272685 0.00167583
-0.0041944 0.98944801 -0.1447722 0.00090257
0.00085297 0.09067582
0.6246289 0.77563143

Variance
Variance%
Cum%

2119.944092 261.3847961
1.2882899 0.89438522 0.53321916
88.92215729 10.96392059 0.05403799 0.03751545 0.02236615
88.92215729 99.88607788 99.94011688 99.97763062 99.99999237
Data Processing





Age와 Experience중에 Experience만을 변수에 포함 시키기로 결정
Experience 중에 음수 값을 갖는 데이터들은 삭제 (52개 데이터)
Nominal 변수인 ID와 Zip code도 변수에서 제외
데이터 세트를 60:40 비율로 Training set와 Validation set로 임의 분할

ID Age Experience Income ZIP Code Family CCAvg
2619
23
-3
55
92704
3
2.40
3627
24
-3
28
90089
4
1.00
4286
23
-3
149
93555
2
7.20
4515
24
-3
41
91768
4
1.00
316
24
-2
51
90630
3
0.30
452
28
-2
48
94132
2
1.75
598
24
-2
125
92835
2
7.20
794
24
-2
150
94720
2
2.00
890
24
-2
82
91103
2
1.60
2467
24
-2
80
94105
2
1.60
2718
23
-2
45
95422
4
0.60
2877
24
-2
80
91107
2
1.60
2963
23
-2
81
91711
2
1.80
3131
23
-2
82
92152
2
1.80
3797
24
-2
50
94920
3
2.40
3888
24
-2
118
92634
2
7.20
4117
24
-2
135
90065
2
7.20
4412
23
-2
75
90291
2
1.80
4482
25
-2
35
95045
4
1.00
XLMiner : Data Partition Sheet
90
25
-1
113
94303
4
2.30
227
24
-1
39
94085
2
1.70
525
24
-1
75
93014
4
0.20
Output Navigator
537
25
-1
43
92173
3
2.40
541
25
-1
109
94010
4
2.30
Training Data
Validation Data
Test Data
577
25
-1
48
92870
3
0.30
584
24
-1
38
95045
2
1.70
650
25
-1
82
92677
4
2.10
Data
671
23
-1
61
92374
4
2.60
687
-1
92612
4
0.60
Data source 24
Data!$A$5:$N$5004 38
910
23
-1
149
91709
1
6.33
Selected variables
ID
Age
Experience
Income
ZIP Code
1174
24
-1
35
94305
2
1.70
1429 Method
-1
21
94583
4
0.40
Partitioning 25
Randomly chosen
1523 Seed
25
101
94720
4
2.30
Random
12345 -1
1906
25
-1
112
92507
2
2.00
# training row s
3000
2103
25
-1
81
92647
2
1.60
# validation 23
2000
2431 row s
-1
73
92120
4
2.60
2546
25
-1
39
94720
3
2.40
2849
24
-1
78
94720
2
1.80

Data
Training data used for building the model
# Records in the training data
Validation data
# Records in the validation data

Education
Mortgage
2
145
3
0
1
0
3
0
3
0
3
89
1
0
1
0
3
0
3
0
2
0
3
238
2
0
2
0
2
0
1
0
1
0
2
0
3
0
3
0
2
0
1
0
2
176
3
314
3
0
2
0
3
0
1
239
2
0
1
305
Family
CCAvg
Education
2
0
1
90
3
256
1
241
3
0
1
0
2
0
2
0

Personal Securities
CD
Loan
Account Account
Online
0
0
0
1
0
0
0
0
0
0
0
1
0
0
0
1
0
0
0
1
0
0
0
1
0
1
0
0
0
0
0
1
0
0
0
1
0
0
0
1
0
0
0
1
0
0
0
0
0
0
0
0
0
1
0
0
0
1
0
0
0
1
0
1
0
0
0
1
0
0
0
1
(Ver:
0
0
0
1
Date: 31-Oct-2013 21:57:13
12.5.3E)
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
1
0
0
0
1
0
0
0
0
0
0
0
1
0
0
0
1
0
0
0
1
0
0
0
1
0
0
0
0
Mortgage0 Personal Loan0
Securities Account
CD Account Online 0
0
0
0
0
1
0
0
0
0
0
0
0
1
0
0
0
1
0
0
0
1
0
0
0
1
0
0
0
0

['UniversalBank_Logistic
2969
['UniversalBank_Logistic
1979

CreditCard
0
0
0
0
0
0
1
0
1
0
1
0
0
1
0
0
0
1
0
1
0
0
0
0
1
0
0
0
0
1
CreditCard 0
0
1
0
1
0
0
0
Logistic Regression
 Set confidence level 95%
 Best subset selection: Exhaustive search
Logistic Regression
The Regression Model
Input variables
Constant term
Experience
Income
Family
CCAvg
Education
Mortgage
Securities Account
CD Account
Online
CreditCard

Coefficient
-13.981266
0.01231669
0.05771863
0.73089772
0.13275796
1.72314227
0.00008745
-1.11542439
4.12018013
-0.79940081
-0.95952284

Std. Error
0.8205896
0.00861601
0.00359579
0.0997033
0.05343928
0.15280582
0.00073139
0.39214
0.4292928
0.20818347
0.2626732

p-value
Odds 95% Confidence Interval
0 8.47253E-07 8.16778E-07 8.77729E-07
0.15285712 1.01239288 0.99544007 1.02963436
0 1.05941689
1.0519768 1.06690955
0 2.07694435 1.70827281 2.52518058
0.01298148 1.14197361 1.02841508 1.26807117
0 5.60210371 4.15224123 7.55822325
0.90482515
1.0000875 0.99865484 1.00152206
0.00444875 0.32777616 0.15198027 0.70691556
0 61.5703392 26.54341698 142.8190918
0.00012309 0.44959828 0.29896376 0.67613083
0.00025928 0.38307562 0.22892682 0.64102113

Best subset selection
#Coeffs

RSS

2
3
4
5
6
7
8
9
10
11

3183.687744
3100.568359
3039.182617
2990.404297
2981.825928
2972.419922
2964.378662
2959.063965
2957.014404
2957

Choose Subset
Choose Subset
Choose Subset
Choose Subset
Choose Subset
Choose Subset
Choose Subset

Model (Constant present in all models)
Model (Constant present in all m odels) 10
2
3
4
5
6
7
8
9
11
#Coeffs
RSS 1
Cp Probability
219.7645264
0
Constant
Income
*
*
*
*
* 3
*
*
1*
2 *
4
138.6170197
0
Constant
Income
Education
*
*
*
*
*
*
*
*
2 3212.504639 217.5792542
0
Constant
Income
*
*
79.21051788
0
Constant
Income
Education CD Account
*
*
*
*
*
*
*
3108.05542 115.0950928 Family Education 0 Account
Constant*
Income * Education
32.415699013 0.00001378
Constant
Income
CD
*
*
*
**
25.834428794 0.00019169
Constant
Income
CD
Online
*
*
*
*
3059.695313 68.71881104 Family Education 0 Account
Constant
Income * Education CD Account
18.42524338 0.00426458
Constant
Income
Family
Education CD Account
Online CreditCard
*
*
*
*
5 3019.44751 30.45754433 Family Education
0.00001909 Account CD Account
Constant
Income
Family
Education
12.38126278 0.06197376
Constant
Income
Securities
Online CreditCard
*
*
*
9.064768796 0.35692081
Constant
Income
CCAvg
Securities Account CD Account
Online CreditCard
*
*
3008.855957 21.86244774 Family
0.00062266 Education
Constant
Income
Family
Education
9.01451492 0.9041543
Constant Experience
Income
Family
CCAvg
Education
Securities Account CD Account
Online CreditCard
*
7 2999.336426 14.33973217 0.01660405
Constant
Income
Family
Education
11.00010586
1
Constant Experience
Income
Family
CCAvg
Education
Mortgage
Securities Account CD Account
Online CreditCard
Cp Probability

8 2994.993164 11.99501705

0.05081629

Constant

Income

Family

CD
CD
CD
Education
Securities
Performance Evaluation
Training Data scoring - Summary Report

Validation Data scoring - Summary Report

Cut off Prob.Val. for Success (Updatable)

Classification Confusion Matrix
Predicted Class
Actual Class
1
1
179
0
43

Class
1
0
Overall

0.5

Classification Confusion Matrix
Predicted Class
Actual Class
1
1
130
0
30

0
107
2640

Error Report
# Cases
# Errors
286
107
2683
43
2969
150

% Error
37.41
1.60
5.05

Training Data scoring - Summary Report

Class
1
0
Overall

Class
1
0
Overall

0.5

0
64
1755

Error Report
# Cases
# Errors
194
64
1785
30
1979
94

% Error
32.99
1.68
4.75

Training Data scoring - Summary Report

Cut off Prob.Val. for Success (Updatable)

Classification Confusion Matrix
Predicted Class
Actual Class
1
1
213
0
93

( Updating the value here woffNOT update value in detailed report )
Cut ill Prob.Val. for Success (Updatable)

0.3

Classification Confusion Matrix
Predicted Class
Actual Class
1
1
235
0
148

0
73
2590

Error Report
# Cases
# Errors
286
73
2683
93
2969
166

Cut off Prob.Val. for Success (Updatable)

% Error
25.52
3.47
5.59

Class
1
0
Overall

0.2

0
51
2535

Error Report
# Cases
# Errors
286
51
2683
148
2969
199

% Error
17.83
5.52
6.70
Performance Evaluation
Decile-wise lift chart (training dataset)

350

Cumulative

300

Cumulative
Personal Loan
when sorted
using predicted
values

250
200
150
100

Cumulative
Personal Loan
using average

50
0
0

1000

2000

3000

Decile mean / Global mean

Lift chart (training dataset)
8
7
6
5
4
3
2
1
0

4000

1

2

3

4

# cases

Cumulative
Personal Loan
when sorted
using predicted
values

Cumulative

200
150
100

Cumulative
Personal Loan
using average

50
0
2000

# cases

7

8

9

10

Decile-wise lift chart (validation dataset)

3000

Decile mean / Global mean

250

1000

6

Deciles

Lift chart (validation dataset)

0

5

8
7
6
5
4
3
2
1
0
1

2

3

4

5

6

Deciles

7

8

9

10
II. 비행 연착 데이터를 활용한
나이브 베이즈 모델 구축
Data Exploration
90
80
70
60
0 delayed
0 ontime
1 delayed

50
40

30
20
10
0
delayed

delayed

delayed

delayed

delayed

delayed

1

Weather

delayed
2

3

4

5

6

7

Week

1200

1400

1000

1200
1000

800

800

600
600
400

400

200

200
0
delayed

ontime
BWI

delayed

ontime
DCA

Origin

delayed

ontime
IAD

0
delayed
EWR

ontime

delayed

ontime
JFK

Destination

delayed

ontime
LGA
Data Exploration
450

400
CO delayed
350

CO ontime
DH delayed

300

DH ontime
DL delayed
DL ontime

250

MQ delayed
MQ ontime

200

OH delayed
OH ontime

150

RU delayed
RU ontime

100

UA delayed
UA ontime

50

US delayed
US ontime

CO

DH

DL

MQ

OH

Carrier

RU

UA

ontime

delayed

ontime

delayed

ontime

delayed

ontime

delayed

ontime

delayed

ontime

delayed

ontime

delayed

delayed

ontime

0

US
Data Exploration
2000-2100
2%

2100-2200
8%

600 -700
4%

700-800
5%

1900-2000
9%

800-900
6%

900-1000
3%
1000-1100
3%
1100-1200
1%
1200 -1300
5%

1800-1900
3%

1700-1800
15%

1300- 1400
5%

1600-1700
8%

1400-1500
15%

1500-1600
9%

Scheduled departure time
Data Processing
 출발 시간이 10, 109로 600 ~ 2200 범위 를 벗어나는 아웃라이어로
판단하고 데이터 삭제
 Scheduled departure time 데이터를 16개의 time block으로 재구성
 예측 상황에서 미리 주어 질 수 없는 실제 비행기 출발 시간, 워싱턴 DC와
뉴욕 구간이기 때문에 모두 비슷한 수준 (평균 211.87, 중앙값 214,
최빈값 214, 표준 편차 13.31)이기 때문에 분석 변수에서 제외
 명목형 변수인 tail number와 flight number 분석 변수에서 제외
 비행 날짜는 요일에 비해 추후 예측에 활용할 여지가 적기 때문에 분석
변수에서 제외
 데이터 세트를 60:40 비율로 Training set와 Validation set로 임의 분할
Naï Bayes
ve
Conditional probabilities
Classes-->
ontime
Value
Prob
CO 0.036312849
DH 0.231843575
DL 0.188081937
MQ 0.118249534
CARRIER
OH 0.013035382
RU 0.174115456
UA 0.016759777
US 0.22160149
EWR 0.273743017
DEST
JFK 0.176908752
LGA 0.549348231
BWI 0.057728119
ORIGIN
DCA 0.645251397
IAD 0.297020484
0
1
Weather
1
0
Mon 0.131284916
Tue 0.14990689
Wed 0.148044693
DAY_WEEK
Thur 0.181564246
Fri 0.170391061
Sat 0.111731844
Sun 0.10707635
600-700 0.058659218
700-800 0.055865922
800-900 0.082867784
900-1000 0.047486034
1000-1100 0.044692737
1100-1200 0.040968343
1200-1300
0.0716946
Binned_CRS_
1300-1400 0.083798883
DEP_TIME
1400-1500 0.090316574
1500-1600 0.067970205
1600-1700 0.081005587
1700-1800 0.104283054
1800-1900 0.044692737
1900-2000 0.047486034
2000-2100 0.019553073
2100-2200 0.058659218
Input
Variables

delayed
Value
Prob
CO 0.06122449
DH 0.306122449
DL 0.118367347
MQ 0.163265306
OH 0.012244898
RU 0.244897959
UA 0.004081633
US 0.089795918
EWR 0.387755102
JFK 0.187755102
LGA 0.424489796
BWI 0.102040816
DCA 0.502040816
IAD 0.395918367
0 0.930612245
1 0.069387755
Mon 0.220408163
Tue 0.130612245
Wed 0.151020408
Thur 0.130612245
Fri 0.159183673
Sat 0.069387755
Sun 0.13877551
600-700 0.032653061
700-800 0.053061224
800-900 0.06122449
900-1000 0.016326531
1000-1100 0.032653061
1100-1200 0.016326531
1200-1300 0.065306122
1300-1400 0.048979592
1400-1500 0.146938776
1500-1600 0.085714286
1600-1700 0.07755102
1700-1800 0.13877551
1800-1900 0.028571429
1900-2000 0.089795918
2000-2100 0.024489796
2100-2200 0.081632653

Prior class probabilities
According to relative occurrences in training data
Class
ontime
delayed

Prob.
0.814253222 <-- Success Class
0.185746778

RU (Continental Express Airline)를 타고 수요일
15:00 ~ 16:00 출발 IAD에서 LGA로 갈 경우 (기상은
양호함)
Ontime = 0.81*0.174 * 0.148 * 0.068 * 0.297 * 0.549 *1
0.00022971
Delay = 0.186* 0.245* 0.424 * 0.396 * 0.151* 0.0857 *0.931
0.0000092
Ontime 확률 = 0.00022971 / (0.00022971 + 0.0000092)
96% (Cutoff value 50%를 넘으므로 ontime으로 분류)
Performance Evaluation
Training Data scoring - Summary Report
Cut off Prob.Val. for Success (Updatable)

Validation Data scoring - Summary Report
0.5

Cut off Prob.Val. for Success (Updatable)

Classification Confusion Matrix
Predicted Class
Actual Class
ontime
delayed
ontime
1049
25
delayed
205
40

Classification Confusion Matrix
Predicted Class
Actual Class
ontime
delayed
ontime
685
14
delayed
155
26

Error Report
# Cases
# Errors
1074
25
245
205
1319
230

Error Report
# Cases
# Errors
699
14
181
155
880
169

0.5

Class
ontime
delayed
Overall

% Error
2.33
83.67
17.44

Training Data scoring - Summary Report
Cut off Prob.Val. for Success (Updatable)

Error Report
# Cases
# Errors
1074
0
245
228
1319
228

% Error
2.00
85.64
19.20

Training Data scoring - Summary Report
0.3

Classification Confusion Matrix
Predicted Class
Actual Class
ontime
delayed
ontime
1074
0
delayed
228
17

Class
ontime
delayed
Overall

Class
ontime
delayed
Overall

Cut off Prob.Val. for Success (Updatable)

0.8

Classification Confusion Matrix
Predicted Class
Actual Class
ontime
delayed
ontime
672
402
delayed
83
162

% Error
0.00
93.06
17.29

Class
ontime
delayed
Overall

Error Report
# Cases
# Errors
1074
402
245
83
1319
485

% Error
37.43
33.88
36.77
Performance Evaluation
Decile-wise lift chart (training dataset)

1200

Cumulative

1000

Cumulative Flight
Status when
sorted using
predicted values

800
600
400

Cumulative Flight
Status using
average

200
0
0

500

1000

Decile mean / Global mean

Lift chart (training dataset)
1.4
1.2
1
0.8
0.6
0.4

0.2
0

1500

1

2

3

4

# cases

Cumulative Flight
Status when
sorted using
predicted values
Cumulative Flight
Status using
average
500

7

8

9

10

Decile-wise lift chart (validation dataset)

1000

Decile mean / Global mean

Cumulative

800
700
600
500
400
300
200
100
0
# cases

6

Deciles

Lift chart (validation dataset)

0

5

1.4
1.2
1
0.8
0.6
0.4
0.2
0
1

2

3

4

5

6

Deciles

7

8

9

10
End of presentation

More Related Content

Similar to 엑셀마이너를 활용한 데이터 분석

11.3 credit default swaps
11.3   credit default swaps11.3   credit default swaps
11.3 credit default swapscrmbasel
 
Basel II Risk Weighted Assets 2011
Basel II Risk Weighted Assets 2011Basel II Risk Weighted Assets 2011
Basel II Risk Weighted Assets 2011kriebelt
 
multi purpose loan proposal finex
multi purpose loan proposal finexmulti purpose loan proposal finex
multi purpose loan proposal finexkathie cruz
 
Statistical Models for Proportional Outcomes
Statistical Models for Proportional OutcomesStatistical Models for Proportional Outcomes
Statistical Models for Proportional OutcomesWenSui Liu
 
Aegis Data BUS 221EMP IDGENDERYEARS SENIORITYBase SalaryCommission.docx
Aegis Data BUS 221EMP IDGENDERYEARS SENIORITYBase SalaryCommission.docxAegis Data BUS 221EMP IDGENDERYEARS SENIORITYBase SalaryCommission.docx
Aegis Data BUS 221EMP IDGENDERYEARS SENIORITYBase SalaryCommission.docxgalerussel59292
 
Bliss%20 sandesh
Bliss%20 sandeshBliss%20 sandesh
Bliss%20 sandeshmaruph
 
Take lic policy at age 25 get pension started at 45 upto age 60 risk cover ru...
Take lic policy at age 25 get pension started at 45 upto age 60 risk cover ru...Take lic policy at age 25 get pension started at 45 upto age 60 risk cover ru...
Take lic policy at age 25 get pension started at 45 upto age 60 risk cover ru...Nandini Bhatnagar
 
STARMAX GROCER CLUB
STARMAX GROCER CLUBSTARMAX GROCER CLUB
STARMAX GROCER CLUBKatz Sim
 
Momentum Analytics Credentials
Momentum Analytics CredentialsMomentum Analytics Credentials
Momentum Analytics Credentialsmomentumanalytics
 
KDD capabilities 2016 v1.0
KDD capabilities 2016 v1.0KDD capabilities 2016 v1.0
KDD capabilities 2016 v1.0KDDanalytics
 
CRiskCo - value deck
CRiskCo - value deckCRiskCo - value deck
CRiskCo - value deckErez Saf
 
Hacking BLE Bicycle Locks for Fun and a Small Profit
Hacking BLE Bicycle Locks for Fun and a Small ProfitHacking BLE Bicycle Locks for Fun and a Small Profit
Hacking BLE Bicycle Locks for Fun and a Small ProfitPriyanka Aash
 

Similar to 엑셀마이너를 활용한 데이터 분석 (20)

Aftab.pdf
Aftab.pdfAftab.pdf
Aftab.pdf
 
11.3 credit default swaps
11.3   credit default swaps11.3   credit default swaps
11.3 credit default swaps
 
Basel II Risk Weighted Assets 2011
Basel II Risk Weighted Assets 2011Basel II Risk Weighted Assets 2011
Basel II Risk Weighted Assets 2011
 
multi purpose loan proposal finex
multi purpose loan proposal finexmulti purpose loan proposal finex
multi purpose loan proposal finex
 
bKash equity valuation 2018
bKash equity valuation 2018bKash equity valuation 2018
bKash equity valuation 2018
 
Statistical Models for Proportional Outcomes
Statistical Models for Proportional OutcomesStatistical Models for Proportional Outcomes
Statistical Models for Proportional Outcomes
 
Aegis Data BUS 221EMP IDGENDERYEARS SENIORITYBase SalaryCommission.docx
Aegis Data BUS 221EMP IDGENDERYEARS SENIORITYBase SalaryCommission.docxAegis Data BUS 221EMP IDGENDERYEARS SENIORITYBase SalaryCommission.docx
Aegis Data BUS 221EMP IDGENDERYEARS SENIORITYBase SalaryCommission.docx
 
Bliss%20 sandesh
Bliss%20 sandeshBliss%20 sandesh
Bliss%20 sandesh
 
Egresos 2011
Egresos 2011Egresos 2011
Egresos 2011
 
Aks group 222.pdf
Aks group 222.pdfAks group 222.pdf
Aks group 222.pdf
 
Khursheeda.pdf
Khursheeda.pdfKhursheeda.pdf
Khursheeda.pdf
 
Take lic policy at age 25 get pension started at 45 upto age 60 risk cover ru...
Take lic policy at age 25 get pension started at 45 upto age 60 risk cover ru...Take lic policy at age 25 get pension started at 45 upto age 60 risk cover ru...
Take lic policy at age 25 get pension started at 45 upto age 60 risk cover ru...
 
STARMAX GROCER CLUB
STARMAX GROCER CLUBSTARMAX GROCER CLUB
STARMAX GROCER CLUB
 
Atta GROUP.pdf
Atta GROUP.pdfAtta GROUP.pdf
Atta GROUP.pdf
 
Egresos 2012
Egresos 2012Egresos 2012
Egresos 2012
 
Awaz Group.pdf
Awaz Group.pdfAwaz Group.pdf
Awaz Group.pdf
 
Momentum Analytics Credentials
Momentum Analytics CredentialsMomentum Analytics Credentials
Momentum Analytics Credentials
 
KDD capabilities 2016 v1.0
KDD capabilities 2016 v1.0KDD capabilities 2016 v1.0
KDD capabilities 2016 v1.0
 
CRiskCo - value deck
CRiskCo - value deckCRiskCo - value deck
CRiskCo - value deck
 
Hacking BLE Bicycle Locks for Fun and a Small Profit
Hacking BLE Bicycle Locks for Fun and a Small ProfitHacking BLE Bicycle Locks for Fun and a Small Profit
Hacking BLE Bicycle Locks for Fun and a Small Profit
 

Recently uploaded

Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 

Recently uploaded (20)

Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 

엑셀마이너를 활용한 데이터 분석

  • 1. BA 682 데이터마이닝(Data Mining) 20123820 강준현
  • 2. I. 유니버설 뱅크 데이터를 사용한 로지스틱 회귀분석 모델 구축
  • 4. Data Exploration Experience Age ID ID Age Experience Income ZIP Code Family CCAvg Education Mortgage Personal Loan Securities Account CD Account Online CreditCard Age 1 -0.00847 -0.00833 -0.01769 0.013432 -0.0168 -0.02467 0.021463 -0.01392 -0.0248 -0.01697 -0.00691 -0.00253 0.017028 1 0.994215 -0.05527 -0.02922 -0.04642 -0.05203 0.041334 -0.01254 -0.00773 -0.00044 0.008043 0.013702 0.007681 Experience Income ZIP Code 1 -0.04657 -0.02863 -0.05256 -0.05009 0.013152 -0.01058 -0.00741 -0.00123 0.010353 0.013898 0.008967 1 -0.01641 -0.1575 0.645993 -0.18752 0.206806 0.502462 -0.00262 0.169738 0.014206 -0.00239 1 0.011778 -0.00407 -0.01738 0.007383 0.000107 0.004704 0.019972 0.01699 0.007691 Family CCAvg Education Mortgage PersonalSecurities Account Account Loan CD Online CreditCard 1 -0.10928 0.064929 -0.02044 0.061367 0.019994 0.01411 0.010354 0.011588 1 -0.13614 0.109909 0.366891 0.015087 0.136537 -0.00362 -0.00669 1 0.00421 1 -0.03333 0.136722 -0.01081 0.013934 -0.015 -0.01101 1 0.142095 -0.00541 0.089311 -0.00599 -0.00723 1 0.021954 1 0.316355 0.317034 1 0.006278 0.012627 0.175880016 0.002802 -0.01503 0.278644365 1
  • 5. Data Dimension Reduction Principal Components Com ponents Variable 1 2 3 4 5 Age Experience Income Family Education 0.01554224 0.01338275 -0.99977607 0.0039179 0.00342416 0.70662385 0.08264883 0.54202592 -0.44701025 0.70728117 -0.07694203 -0.54321295 0.44561642 0.02043895 0.00444289 0.00272685 0.00167583 -0.0041944 0.98944801 -0.1447722 0.00090257 0.00085297 0.09067582 0.6246289 0.77563143 Variance Variance% Cum% 2119.944092 261.3847961 1.2882899 0.89438522 0.53321916 88.92215729 10.96392059 0.05403799 0.03751545 0.02236615 88.92215729 99.88607788 99.94011688 99.97763062 99.99999237
  • 6. Data Processing     Age와 Experience중에 Experience만을 변수에 포함 시키기로 결정 Experience 중에 음수 값을 갖는 데이터들은 삭제 (52개 데이터) Nominal 변수인 ID와 Zip code도 변수에서 제외 데이터 세트를 60:40 비율로 Training set와 Validation set로 임의 분할 ID Age Experience Income ZIP Code Family CCAvg 2619 23 -3 55 92704 3 2.40 3627 24 -3 28 90089 4 1.00 4286 23 -3 149 93555 2 7.20 4515 24 -3 41 91768 4 1.00 316 24 -2 51 90630 3 0.30 452 28 -2 48 94132 2 1.75 598 24 -2 125 92835 2 7.20 794 24 -2 150 94720 2 2.00 890 24 -2 82 91103 2 1.60 2467 24 -2 80 94105 2 1.60 2718 23 -2 45 95422 4 0.60 2877 24 -2 80 91107 2 1.60 2963 23 -2 81 91711 2 1.80 3131 23 -2 82 92152 2 1.80 3797 24 -2 50 94920 3 2.40 3888 24 -2 118 92634 2 7.20 4117 24 -2 135 90065 2 7.20 4412 23 -2 75 90291 2 1.80 4482 25 -2 35 95045 4 1.00 XLMiner : Data Partition Sheet 90 25 -1 113 94303 4 2.30 227 24 -1 39 94085 2 1.70 525 24 -1 75 93014 4 0.20 Output Navigator 537 25 -1 43 92173 3 2.40 541 25 -1 109 94010 4 2.30 Training Data Validation Data Test Data 577 25 -1 48 92870 3 0.30 584 24 -1 38 95045 2 1.70 650 25 -1 82 92677 4 2.10 Data 671 23 -1 61 92374 4 2.60 687 -1 92612 4 0.60 Data source 24 Data!$A$5:$N$5004 38 910 23 -1 149 91709 1 6.33 Selected variables ID Age Experience Income ZIP Code 1174 24 -1 35 94305 2 1.70 1429 Method -1 21 94583 4 0.40 Partitioning 25 Randomly chosen 1523 Seed 25 101 94720 4 2.30 Random 12345 -1 1906 25 -1 112 92507 2 2.00 # training row s 3000 2103 25 -1 81 92647 2 1.60 # validation 23 2000 2431 row s -1 73 92120 4 2.60 2546 25 -1 39 94720 3 2.40 2849 24 -1 78 94720 2 1.80 Data Training data used for building the model # Records in the training data Validation data # Records in the validation data Education Mortgage 2 145 3 0 1 0 3 0 3 0 3 89 1 0 1 0 3 0 3 0 2 0 3 238 2 0 2 0 2 0 1 0 1 0 2 0 3 0 3 0 2 0 1 0 2 176 3 314 3 0 2 0 3 0 1 239 2 0 1 305 Family CCAvg Education 2 0 1 90 3 256 1 241 3 0 1 0 2 0 2 0 Personal Securities CD Loan Account Account Online 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 1 0 0 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 1 0 1 0 0 0 1 0 0 0 1 (Ver: 0 0 0 1 Date: 31-Oct-2013 21:57:13 12.5.3E) 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 0 Mortgage0 Personal Loan0 Securities Account CD Account Online 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 0 ['UniversalBank_Logistic 2969 ['UniversalBank_Logistic 1979 CreditCard 0 0 0 0 0 0 1 0 1 0 1 0 0 1 0 0 0 1 0 1 0 0 0 0 1 0 0 0 0 1 CreditCard 0 0 1 0 1 0 0 0
  • 7. Logistic Regression  Set confidence level 95%  Best subset selection: Exhaustive search
  • 8. Logistic Regression The Regression Model Input variables Constant term Experience Income Family CCAvg Education Mortgage Securities Account CD Account Online CreditCard Coefficient -13.981266 0.01231669 0.05771863 0.73089772 0.13275796 1.72314227 0.00008745 -1.11542439 4.12018013 -0.79940081 -0.95952284 Std. Error 0.8205896 0.00861601 0.00359579 0.0997033 0.05343928 0.15280582 0.00073139 0.39214 0.4292928 0.20818347 0.2626732 p-value Odds 95% Confidence Interval 0 8.47253E-07 8.16778E-07 8.77729E-07 0.15285712 1.01239288 0.99544007 1.02963436 0 1.05941689 1.0519768 1.06690955 0 2.07694435 1.70827281 2.52518058 0.01298148 1.14197361 1.02841508 1.26807117 0 5.60210371 4.15224123 7.55822325 0.90482515 1.0000875 0.99865484 1.00152206 0.00444875 0.32777616 0.15198027 0.70691556 0 61.5703392 26.54341698 142.8190918 0.00012309 0.44959828 0.29896376 0.67613083 0.00025928 0.38307562 0.22892682 0.64102113 Best subset selection #Coeffs RSS 2 3 4 5 6 7 8 9 10 11 3183.687744 3100.568359 3039.182617 2990.404297 2981.825928 2972.419922 2964.378662 2959.063965 2957.014404 2957 Choose Subset Choose Subset Choose Subset Choose Subset Choose Subset Choose Subset Choose Subset Model (Constant present in all models) Model (Constant present in all m odels) 10 2 3 4 5 6 7 8 9 11 #Coeffs RSS 1 Cp Probability 219.7645264 0 Constant Income * * * * * 3 * * 1* 2 * 4 138.6170197 0 Constant Income Education * * * * * * * * 2 3212.504639 217.5792542 0 Constant Income * * 79.21051788 0 Constant Income Education CD Account * * * * * * * 3108.05542 115.0950928 Family Education 0 Account Constant* Income * Education 32.415699013 0.00001378 Constant Income CD * * * ** 25.834428794 0.00019169 Constant Income CD Online * * * * 3059.695313 68.71881104 Family Education 0 Account Constant Income * Education CD Account 18.42524338 0.00426458 Constant Income Family Education CD Account Online CreditCard * * * * 5 3019.44751 30.45754433 Family Education 0.00001909 Account CD Account Constant Income Family Education 12.38126278 0.06197376 Constant Income Securities Online CreditCard * * * 9.064768796 0.35692081 Constant Income CCAvg Securities Account CD Account Online CreditCard * * 3008.855957 21.86244774 Family 0.00062266 Education Constant Income Family Education 9.01451492 0.9041543 Constant Experience Income Family CCAvg Education Securities Account CD Account Online CreditCard * 7 2999.336426 14.33973217 0.01660405 Constant Income Family Education 11.00010586 1 Constant Experience Income Family CCAvg Education Mortgage Securities Account CD Account Online CreditCard Cp Probability 8 2994.993164 11.99501705 0.05081629 Constant Income Family CD CD CD Education Securities
  • 9. Performance Evaluation Training Data scoring - Summary Report Validation Data scoring - Summary Report Cut off Prob.Val. for Success (Updatable) Classification Confusion Matrix Predicted Class Actual Class 1 1 179 0 43 Class 1 0 Overall 0.5 Classification Confusion Matrix Predicted Class Actual Class 1 1 130 0 30 0 107 2640 Error Report # Cases # Errors 286 107 2683 43 2969 150 % Error 37.41 1.60 5.05 Training Data scoring - Summary Report Class 1 0 Overall Class 1 0 Overall 0.5 0 64 1755 Error Report # Cases # Errors 194 64 1785 30 1979 94 % Error 32.99 1.68 4.75 Training Data scoring - Summary Report Cut off Prob.Val. for Success (Updatable) Classification Confusion Matrix Predicted Class Actual Class 1 1 213 0 93 ( Updating the value here woffNOT update value in detailed report ) Cut ill Prob.Val. for Success (Updatable) 0.3 Classification Confusion Matrix Predicted Class Actual Class 1 1 235 0 148 0 73 2590 Error Report # Cases # Errors 286 73 2683 93 2969 166 Cut off Prob.Val. for Success (Updatable) % Error 25.52 3.47 5.59 Class 1 0 Overall 0.2 0 51 2535 Error Report # Cases # Errors 286 51 2683 148 2969 199 % Error 17.83 5.52 6.70
  • 10. Performance Evaluation Decile-wise lift chart (training dataset) 350 Cumulative 300 Cumulative Personal Loan when sorted using predicted values 250 200 150 100 Cumulative Personal Loan using average 50 0 0 1000 2000 3000 Decile mean / Global mean Lift chart (training dataset) 8 7 6 5 4 3 2 1 0 4000 1 2 3 4 # cases Cumulative Personal Loan when sorted using predicted values Cumulative 200 150 100 Cumulative Personal Loan using average 50 0 2000 # cases 7 8 9 10 Decile-wise lift chart (validation dataset) 3000 Decile mean / Global mean 250 1000 6 Deciles Lift chart (validation dataset) 0 5 8 7 6 5 4 3 2 1 0 1 2 3 4 5 6 Deciles 7 8 9 10
  • 11. II. 비행 연착 데이터를 활용한 나이브 베이즈 모델 구축
  • 12. Data Exploration 90 80 70 60 0 delayed 0 ontime 1 delayed 50 40 30 20 10 0 delayed delayed delayed delayed delayed delayed 1 Weather delayed 2 3 4 5 6 7 Week 1200 1400 1000 1200 1000 800 800 600 600 400 400 200 200 0 delayed ontime BWI delayed ontime DCA Origin delayed ontime IAD 0 delayed EWR ontime delayed ontime JFK Destination delayed ontime LGA
  • 13. Data Exploration 450 400 CO delayed 350 CO ontime DH delayed 300 DH ontime DL delayed DL ontime 250 MQ delayed MQ ontime 200 OH delayed OH ontime 150 RU delayed RU ontime 100 UA delayed UA ontime 50 US delayed US ontime CO DH DL MQ OH Carrier RU UA ontime delayed ontime delayed ontime delayed ontime delayed ontime delayed ontime delayed ontime delayed delayed ontime 0 US
  • 14. Data Exploration 2000-2100 2% 2100-2200 8% 600 -700 4% 700-800 5% 1900-2000 9% 800-900 6% 900-1000 3% 1000-1100 3% 1100-1200 1% 1200 -1300 5% 1800-1900 3% 1700-1800 15% 1300- 1400 5% 1600-1700 8% 1400-1500 15% 1500-1600 9% Scheduled departure time
  • 15. Data Processing  출발 시간이 10, 109로 600 ~ 2200 범위 를 벗어나는 아웃라이어로 판단하고 데이터 삭제  Scheduled departure time 데이터를 16개의 time block으로 재구성  예측 상황에서 미리 주어 질 수 없는 실제 비행기 출발 시간, 워싱턴 DC와 뉴욕 구간이기 때문에 모두 비슷한 수준 (평균 211.87, 중앙값 214, 최빈값 214, 표준 편차 13.31)이기 때문에 분석 변수에서 제외  명목형 변수인 tail number와 flight number 분석 변수에서 제외  비행 날짜는 요일에 비해 추후 예측에 활용할 여지가 적기 때문에 분석 변수에서 제외  데이터 세트를 60:40 비율로 Training set와 Validation set로 임의 분할
  • 16. Naï Bayes ve Conditional probabilities Classes--> ontime Value Prob CO 0.036312849 DH 0.231843575 DL 0.188081937 MQ 0.118249534 CARRIER OH 0.013035382 RU 0.174115456 UA 0.016759777 US 0.22160149 EWR 0.273743017 DEST JFK 0.176908752 LGA 0.549348231 BWI 0.057728119 ORIGIN DCA 0.645251397 IAD 0.297020484 0 1 Weather 1 0 Mon 0.131284916 Tue 0.14990689 Wed 0.148044693 DAY_WEEK Thur 0.181564246 Fri 0.170391061 Sat 0.111731844 Sun 0.10707635 600-700 0.058659218 700-800 0.055865922 800-900 0.082867784 900-1000 0.047486034 1000-1100 0.044692737 1100-1200 0.040968343 1200-1300 0.0716946 Binned_CRS_ 1300-1400 0.083798883 DEP_TIME 1400-1500 0.090316574 1500-1600 0.067970205 1600-1700 0.081005587 1700-1800 0.104283054 1800-1900 0.044692737 1900-2000 0.047486034 2000-2100 0.019553073 2100-2200 0.058659218 Input Variables delayed Value Prob CO 0.06122449 DH 0.306122449 DL 0.118367347 MQ 0.163265306 OH 0.012244898 RU 0.244897959 UA 0.004081633 US 0.089795918 EWR 0.387755102 JFK 0.187755102 LGA 0.424489796 BWI 0.102040816 DCA 0.502040816 IAD 0.395918367 0 0.930612245 1 0.069387755 Mon 0.220408163 Tue 0.130612245 Wed 0.151020408 Thur 0.130612245 Fri 0.159183673 Sat 0.069387755 Sun 0.13877551 600-700 0.032653061 700-800 0.053061224 800-900 0.06122449 900-1000 0.016326531 1000-1100 0.032653061 1100-1200 0.016326531 1200-1300 0.065306122 1300-1400 0.048979592 1400-1500 0.146938776 1500-1600 0.085714286 1600-1700 0.07755102 1700-1800 0.13877551 1800-1900 0.028571429 1900-2000 0.089795918 2000-2100 0.024489796 2100-2200 0.081632653 Prior class probabilities According to relative occurrences in training data Class ontime delayed Prob. 0.814253222 <-- Success Class 0.185746778 RU (Continental Express Airline)를 타고 수요일 15:00 ~ 16:00 출발 IAD에서 LGA로 갈 경우 (기상은 양호함) Ontime = 0.81*0.174 * 0.148 * 0.068 * 0.297 * 0.549 *1 0.00022971 Delay = 0.186* 0.245* 0.424 * 0.396 * 0.151* 0.0857 *0.931 0.0000092 Ontime 확률 = 0.00022971 / (0.00022971 + 0.0000092) 96% (Cutoff value 50%를 넘으므로 ontime으로 분류)
  • 17. Performance Evaluation Training Data scoring - Summary Report Cut off Prob.Val. for Success (Updatable) Validation Data scoring - Summary Report 0.5 Cut off Prob.Val. for Success (Updatable) Classification Confusion Matrix Predicted Class Actual Class ontime delayed ontime 1049 25 delayed 205 40 Classification Confusion Matrix Predicted Class Actual Class ontime delayed ontime 685 14 delayed 155 26 Error Report # Cases # Errors 1074 25 245 205 1319 230 Error Report # Cases # Errors 699 14 181 155 880 169 0.5 Class ontime delayed Overall % Error 2.33 83.67 17.44 Training Data scoring - Summary Report Cut off Prob.Val. for Success (Updatable) Error Report # Cases # Errors 1074 0 245 228 1319 228 % Error 2.00 85.64 19.20 Training Data scoring - Summary Report 0.3 Classification Confusion Matrix Predicted Class Actual Class ontime delayed ontime 1074 0 delayed 228 17 Class ontime delayed Overall Class ontime delayed Overall Cut off Prob.Val. for Success (Updatable) 0.8 Classification Confusion Matrix Predicted Class Actual Class ontime delayed ontime 672 402 delayed 83 162 % Error 0.00 93.06 17.29 Class ontime delayed Overall Error Report # Cases # Errors 1074 402 245 83 1319 485 % Error 37.43 33.88 36.77
  • 18. Performance Evaluation Decile-wise lift chart (training dataset) 1200 Cumulative 1000 Cumulative Flight Status when sorted using predicted values 800 600 400 Cumulative Flight Status using average 200 0 0 500 1000 Decile mean / Global mean Lift chart (training dataset) 1.4 1.2 1 0.8 0.6 0.4 0.2 0 1500 1 2 3 4 # cases Cumulative Flight Status when sorted using predicted values Cumulative Flight Status using average 500 7 8 9 10 Decile-wise lift chart (validation dataset) 1000 Decile mean / Global mean Cumulative 800 700 600 500 400 300 200 100 0 # cases 6 Deciles Lift chart (validation dataset) 0 5 1.4 1.2 1 0.8 0.6 0.4 0.2 0 1 2 3 4 5 6 Deciles 7 8 9 10