SlideShare a Scribd company logo
AAmmaazzoonnEEmmppllooyyeeeeAAcccceessssCChhaalllleennggee
Predictanemployee'saccessneeds,givenhis/herjobrole
Yibo Chen
Data Scientist @ Supstat Inc
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
1 of 65 6/13/14, 2:01 PM
AAggeennddaa
Introduction to the Challenge1.
Look into the Data2.
Model Building3.
Summary4.
2/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
2 of 65 6/13/14, 2:01 PM
IInnttrroodduuccttiioonnttootthheeCChhaalllleennggee
the story
http://www.kaggle.com/c/amazon-employee-access-challenge
it is all about the access we need to fulfill our daily work.
3/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
3 of 65 6/13/14, 2:01 PM
IInnttrroodduuccttiioonnttootthheeCChhaalllleennggee
the mission
build an auto-access model based on the historical data
to determine the access privilege according to the employee's job role and the resource he applied
for
4/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
4 of 65 6/13/14, 2:01 PM
IInnttrroodduuccttiioonnttootthheeCChhaalllleennggee
the data
The data consists of real historical data collected from 2010 & 2011.
Employees are manually allowed or denied access to resources over time.
the files
train.csv - The training set. Each row has the ACTION (ground truth), RESOURCE, and
information about the employee's role at the time of approval
test.csv - The test set for which predictions should be made. Each row asks whether an
employee having the listed characteristics should have access to the listed resource.
·
·
5/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
5 of 65 6/13/14, 2:01 PM
IInnttrroodduuccttiioonnttootthheeCChhaalllleennggee
the variables
COLUMN NAME DESCRIPTION
ACTION ACTION is 1 if the resource was approved, 0 if the resource was not
RESOURCE An ID for each resource
MGR_ID The EMPLOYEE ID of the manager of the current EMPLOYEE ID record
ROLE_ROLLUP_1 Company role grouping category id 1 (e.g. US Engineering)
ROLE_ROLLUP_2 Company role grouping category id 2 (e.g. US Retail)
ROLE_DEPTNAME Company role department description (e.g. Retail)
ROLE_TITLE Company role business title description (e.g. Senior Engineering Retail Manager)
ROLE_FAMILY_DESC Company role family extended description (e.g. Retail Manager, Software Engineering)
ROLE_FAMILY Company role family description (e.g. Retail Manager)
ROLE_CODE Company role code; this code is unique to each role (e.g. Manager)
6/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
6 of 65 6/13/14, 2:01 PM
IInnttrroodduuccttiioonnttootthheeCChhaalllleennggee
the metric
AUC(area under the ROC curve)
is a metric used to judge predictions in binary response (0/1) problem
is only sensitive to the order determined by the predictions and not their magnitudes
package verification or ROCR in R
·
·
·
7/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
7 of 65 6/13/14, 2:01 PM
IInnttrroodduuccttiioonnttootthheeCChhaalllleennggee
the metric
(t <- data.frame(true_label=c(0,0,0,0,1,1,1,1),
predict_1=c(1,2,3,4,5,6,7,8),
predict_2=c(1,2,3,6,5,4,7,8),
predict_3=c(1,7,6,4,5,3,2,8)))
## true_label predict_1 predict_2 predict_3
## 1 0 1 1 1
## 2 0 2 2 7
## 3 0 3 3 6
## 4 0 4 6 4
## 5 1 5 5 5
## 6 1 6 4 3
## 7 1 7 7 2
## 8 1 8 8 8
8/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
8 of 65 6/13/14, 2:01 PM
IInnttrroodduuccttiioonnttootthheeCChhaalllleennggee
the metric
P:4
N:4
TP:2
FP:1
TPR=TP/P=0.5
FPR=FP/N=0.25
table(t$predict_2 >= 6, t$true_label)
##
## 0 1
## FALSE 3 2
## TRUE 1 2
9/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
9 of 65 6/13/14, 2:01 PM
IInnttrroodduuccttiioonnttootthheeCChhaalllleennggee
the metric
P:4
N:4
TP:3
FP:1
TPR=TP/P=0.75
FPR=FP/N=0.25
table(t$predict_2 >= 5, t$true_label)
##
## 0 1
## FALSE 3 1
## TRUE 1 3
10/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
10 of 65 6/13/14, 2:01 PM
IInnttrroodduuccttiioonnttootthheeCChhaalllleennggee
the metric
11/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
11 of 65 6/13/14, 2:01 PM
IInnttrroodduuccttiioonnttootthheeCChhaalllleennggee
the metric
require(ROCR, quietly = T)
pred <- prediction(t$predict_1, t$true_label)
performance(pred, "auc")@y.values[[1]]
## [1] 1
require(verification, quietly = T)
roc.area(t$true_label, t$predict_1)$A
## [1] 1
pred <- prediction(t$predict_1, t$true_label)
perf <- performance(pred, "tpr", "fpr")
plot(perf, col = 2, lwd = 3)
12/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
12 of 65 6/13/14, 2:01 PM
IInnttrroodduuccttiioonnttootthheeCChhaalllleennggee
the metric
pred <- prediction(t$predict_2, t$true_label)
performance(pred, "auc")@y.values[[1]]
## [1] 0.875
roc.area(t$true_label, t$predict_2)$A
## [1] 0.875
pred <- prediction(t$predict_2, t$true_label)
perf <- performance(pred, "tpr", "fpr")
plot(perf, col = 2, lwd = 3)
13/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
13 of 65 6/13/14, 2:01 PM
IInnttrroodduuccttiioonnttootthheeCChhaalllleennggee
the metric
pred <- prediction(t$predict_3, t$true_label)
performance(pred, "auc")@y.values[[1]]
## [1] 0.5
roc.area(t$true_label, t$predict_3)$A
## [1] 0.5
pred <- prediction(t$predict_3, t$true_label)
perf <- performance(pred, "tpr", "fpr")
plot(perf, col = 2, lwd = 3)
14/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
14 of 65 6/13/14, 2:01 PM
LLooookkiinnttootthheeDDaattaa
load data from files
15/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
15 of 65 6/13/14, 2:01 PM
LLooookkiinnttootthheeDDaattaa
the target
table(y, useNA = "ifany")
## y
## 0 1 <NA>
## 1897 30872 58921
16/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
16 of 65 6/13/14, 2:01 PM
LLooookkiinnttootthheeDDaattaa
the predictor
17/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
17 of 65 6/13/14, 2:01 PM
LLooookkiinnttootthheeDDaattaa
treat the features as Categorical or Numerical?
sapply(x, function(z) {
length(unique(z))
})
## resource mgr_id role_rollup_1 role_rollup_2
## 7518 4913 130 183
## role_deptname role_title role_family_desc role_family
## 476 361 2951 68
## role_code
## 361
18/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
18 of 65 6/13/14, 2:01 PM
LLooookkiinnttootthheeDDaattaa
par(mar = c(5, 4, 0, 2))
plot(x$role_title, x$role_code)
19/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
19 of 65 6/13/14, 2:01 PM
LLooookkiinnttootthheeDDaattaa
length(unique(x$role_title))
## [1] 361
length(unique(x$role_code))
## [1] 361
length(unique(paste(x$role_code, x$role_title)))
## [1] 361
20/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
20 of 65 6/13/14, 2:01 PM
LLooookkiinnttootthheeDDaattaa
x <- x[, names(x) != "role_code"]
sapply(x, function(z) {
length(unique(z))
})
## resource mgr_id role_rollup_1 role_rollup_2
## 7518 4913 130 183
## role_deptname role_title role_family_desc role_family
## 476 361 2951 68
21/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
21 of 65 6/13/14, 2:01 PM
LLooookkiinnttootthheeDDaattaa
check the distribution - role_family_desc
hist(train$role_family_desc, breaks = 100) hist(test$role_family_desc, breaks = 100)
22/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
22 of 65 6/13/14, 2:01 PM
LLooookkiinnttootthheeDDaattaa
check the distribution - resource
hist(train$resource, breaks = 100) hist(test$resource, breaks = 100)
23/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
23 of 65 6/13/14, 2:01 PM
LLooookkiinnttootthheeDDaattaa
check the distribution - mgr_id
hist(train$mgr_id, breaks = 100) hist(test$mgr_id, breaks = 100)
24/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
24 of 65 6/13/14, 2:01 PM
LLooookkiinnttootthheeDDaattaa
treat the features as Categorical or Numerical?
YetiMan shared his findings in the forum:
1) My analyses so far leads me to believe that there is "information" in some of the categorical
labels themselves. My hunch is that they imply some sort of chronology, but I can't be certain.
2) Just for fun I increased the max classes for R's gbm package to 8192 and built a model (using
plain vanilla training data). The leader board result was 0.87 - slightly worse than the all-numeric
gbm. Food for thought.
·
·
25/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
25 of 65 6/13/14, 2:01 PM
LLooookkiinnttootthheeDDaattaa
our approach
treat all features as Categorical1.
treat all features as Numerical2.
treat mgr_id as Numerical, the others as Categorical3.
26/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
26 of 65 6/13/14, 2:01 PM
MMooddeellBBuuiillddiinngg
workflow
Feature Extraction
Base Learners
Ensemble
·
·
·
27/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
27 of 65 6/13/14, 2:01 PM
MMooddeellBBuuiillddiinngg
workflow
28/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
28 of 65 6/13/14, 2:01 PM
MMooddeellBBuuiillddiinngg
Feature Extraction
the raw features(as numerical)1.
the raw features(as categorical) with level reduction2.
the dummies(in sparse Matrix)3.
the dummies including the interaction4.
some derived variables(count & ratio)5.
29/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
29 of 65 6/13/14, 2:01 PM
MMooddeellBBuuiillddiinngg
1. the raw features(as numerical)
30/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
30 of 65 6/13/14, 2:01 PM
MMooddeellBBuuiillddiinngg
2. the raw features(as categorical) with level reduction
2.1 choose the top frequency categories
VAR_RAW FREQUENCY VAR_WITH_LEVEL_REDUCTION
a 3 a
a 3 a
a 3 a
b 2 b
b 2 b
c 1 other
d 1 other
for (i in 1:ncol(x)) {
the_labels <- names(sort(table(x[, i]), decreasing = T)[1:2])
x[!x[, i] %in% the_labels, i] <- "other"
}
31/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
31 of 65 6/13/14, 2:01 PM
MMooddeellBBuuiillddiinngg
2. the raw features(as categorical) with level reduction
2.2 use Pearson's Chi-squared Test
table(y$y, ifelse(x$mgr_id == 770, "mgr_770", "mgr_not_770"))
##
## mgr_770 mgr_not_770
## 0 5 1892
## 1 147 30725
chisq.test(y$y, ifelse(x$mgr_id == 770, "mgr_770", "mgr_not_770"))$p.value
## [1] 0.2507
32/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
32 of 65 6/13/14, 2:01 PM
MMooddeellBBuuiillddiinngg
3. the dummies(in sparse Matrix)
ID VAR VAR_A VAR_B VAR_C
1 a 1 0 0
2 a 1 0 0
3 a 1 0 0
4 b 0 1 0
5 c 0 0 1
33/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
33 of 65 6/13/14, 2:01 PM
MMooddeellBBuuiillddiinngg
3. the dummies(in sparse Matrix)
use package Matrix to create the dummies
require(Matrix)
set.seed(114)
Matrix(sample(c(0, 1), 40, re = T, prob = c(0.6, 0.1)), nrow = 5)
## 5 x 8 sparse Matrix of class "dgCMatrix"
##
## [1,] . . . 1 . . . 1
## [2,] . 1 . . . . 1 .
## [3,] 1 . . . . . . .
## [4,] . . . . . 1 . .
## [5,] . . . . . . . .
34/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
34 of 65 6/13/14, 2:01 PM
MMooddeellBBuuiillddiinngg
4. the dummies including the interaction
ID M N MN_AP MN_AQ MN_BP MN_BQ
1 a p 1 0 0 0
2 a p 1 0 0 0
3 a q 0 1 0 0
4 b p 0 0 1 0
5 b q 0 0 0 1
35/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
35 of 65 6/13/14, 2:01 PM
MMooddeellBBuuiillddiinngg
5. some derived variables(count & ratio)
the frequency of every category
the frequency of the interactions
the proportion
·
·
·
36/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
36 of 65 6/13/14, 2:01 PM
MMooddeellBBuuiillddiinngg
5. some derived variables(count & ratio)
tmp1 <- cnt_1[114:117, c('c1_resource', 'c1_role_deptname')]
tmp2 <- cnt_2[114:117, c('c2_resource_role_deptname_cnt_ij',
'c2_resource_role_deptname_ratio_i',
'c2_resource_role_deptname_ratio_j')]
cbind(tmp1, tmp2)
## c1_resource c1_role_deptname c2_resource_role_deptname_cnt_ij
## 114 1 1645 1
## 115 36 1312 4
## 116 45 465 24
## 117 374 2377 169
## c2_resource_role_deptname_ratio_i c2_resource_role_deptname_ratio_j
## 114 1.0000 0.0006079
## 115 0.1111 0.0030488
## 116 0.5333 0.0516129
## 117 0.4519 0.0710980
37/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
37 of 65 6/13/14, 2:01 PM
MMooddeellBBuuiillddiinngg
Base Learners
Regularized Generalized Linear Model1.
Support Vector Machine2.
Random Forest3.
Gradient Boosting Machine4.
38/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
38 of 65 6/13/14, 2:01 PM
MMooddeellBBuuiillddiinngg
Ensemble
mean prediction of all models1.
two-stage stacking2.
based on 5-fold cv holdout predictions·
39/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
39 of 65 6/13/14, 2:01 PM
MMooddeellBBuuiillddiinngg
Ensemble
mean prediction of all models1.
two-stage stacking2.
based on 5-fold cv holdout predictions
algorithms in level-1(Regularized Generalized Linear Model & Gradient Boosting Machine)
algorithms in level-2(Regularized Generalized Linear Model)
·
·
·
40/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
40 of 65 6/13/14, 2:01 PM
MMooddeellBBuuiillddiinngg
1. Regularized Generalized Linear Model
generalized linear model(glm)
convex penalties
·
·
41/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
41 of 65 6/13/14, 2:01 PM
MMooddeellBBuuiillddiinngg
1. Regularized Generalized Linear Model
logistic regression·
x <- sort(rnorm(100))
set.seed(114)
y <- c(sample(x=c(0,1),size=30,prob=c(0.9,0.1),re=T),
sample(x=c(0,1),size=20,prob=c(0.7,0.3),re=T),
sample(x=c(0,1),size=20,prob=c(0.3,0.7),re=T),
sample(x=c(0,1),size=30,prob=c(0.1,0.9),re=T))
m1 <- lm(y~x)
m2 <- glm(y~x,family=binomial(link=logit))
y2 <- predict(m2,data=x,type='response')
par(mar=c(5,4,0,0))
plot(y~x);abline(m1,lwd=3,col=2)
points(x,y2,type='l',lwd=3,col=3)
42/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
42 of 65 6/13/14, 2:01 PM
MMooddeellBBuuiillddiinngg
1. Regularized Generalized Linear Model
logistic regression·
convex penalties·
43/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
43 of 65 6/13/14, 2:01 PM
MMooddeellBBuuiillddiinngg
1. Regularized Generalized Linear Model
convex penalties·
L1 (lasso)
L2 (ridge regression)
mixture of L1&L2 (elastic net)
-
-
-
44/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
44 of 65 6/13/14, 2:01 PM
MMooddeellBBuuiillddiinngg
1. Regularized Generalized Linear Model
the dummies(in sparse Matrix)
the dummies including the interaction
R package:glmnet
·
·
·
45/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
45 of 65 6/13/14, 2:01 PM
MMooddeellBBuuiillddiinngg
2. Support Vector Machine(just for Diversity)
46/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
46 of 65 6/13/14, 2:01 PM
MMooddeellBBuuiillddiinngg
2. Support Vector Machine(just for Diversity)
47/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
47 of 65 6/13/14, 2:01 PM
MMooddeellBBuuiillddiinngg
2. Support Vector Machine(just for Diversity)
48/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
48 of 65 6/13/14, 2:01 PM
MMooddeellBBuuiillddiinngg
2. Support Vector Machine(just for Diversity)
the dummies including the interaction
some derived variables(count & ratio)
R package:kernlab,e1071
·
·
·
49/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
49 of 65 6/13/14, 2:01 PM
MMooddeellBBuuiillddiinngg
decision tree
50/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
50 of 65 6/13/14, 2:01 PM
MMooddeellBBuuiillddiinngg
3. Random Forest
decision trees + bagging
51/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
51 of 65 6/13/14, 2:01 PM
MMooddeellBBuuiillddiinngg
3. Random Forest
the raw features(as numerical)
the raw features(as categorical) with level reduction
some derived variables(count & ratio)
R package:randomForest
·
·
·
·
52/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
52 of 65 6/13/14, 2:01 PM
MMooddeellBBuuiillddiinngg
4. Gradient Boosting Machine
decision trees + boosting
53/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
53 of 65 6/13/14, 2:01 PM
MMooddeellBBuuiillddiinngg
4. Gradient Boosting Machine
the raw features(as numerical)
the raw features(as categorical) with level reduction
some derived variables(count & ratio)
R package:gbm
·
·
·
·
54/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
54 of 65 6/13/14, 2:01 PM
SSuummmmaarryy
some insights
VARIABLE NAME REL.INF
cnt2_resource_role_deptname_cnt_ij 2.542974017
cnt2_resource_role_rollup_2_ratio_i 2.107624216
cnt2_resource_role_deptname_ratio_j 2.017153645
cnt2_resource_role_rollup_2_ratio_j 1.910465811
cnt2_resource_role_family_ratio_i 1.770737494
... ...
cnt4_resource_mgr_id_role_rollup_2_role_family_desc 0.008938286
cnt4_resource_role_rollup_1_role_rollup_2_role_title 0.008930661
cnt4_resource_mgr_id_role_rollup_1_role_family_desc 0.002106958
55/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
55 of 65 6/13/14, 2:01 PM
SSuummmmaarryy
some insights
summary(x[, c('cnt2_resource_role_deptname_cnt_ij',
'cnt2_resource_role_deptname_ratio_j')])
## cnt2_resource_role_deptname_cnt_ij cnt2_resource_role_deptname_ratio_j
## Min. : 1.0 Min. :0.0003
## 1st Qu.: 2.0 1st Qu.:0.0061
## Median : 7.0 Median :0.0172
## Mean : 15.6 Mean :0.0315
## 3rd Qu.: 17.0 3rd Qu.:0.0368
## Max. :201.0 Max. :1.0000
56/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
56 of 65 6/13/14, 2:01 PM
SSuummmmaarryy
some insights
xx <- x[, 'cnt2_resource_role_deptname_cnt_ij']
tt <- t.test(xx ~ y)
list(estimate=tt$estimate,
conf.int=tt$conf.int, p.value=tt$p.value)
## $estimate
## mean in group 0 mean in group 1
## 10.04 13.82
##
## $conf.int
## [1] -4.851 -2.710
## attr(,"conf.level")
## [1] 0.95
##
## $p.value
## [1] 5.838e-12
par(mar=c(5,4,2,2))
boxplot(xx ~ y)
57/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
57 of 65 6/13/14, 2:01 PM
SSuummmmaarryy
some insights
xxx <- cut(xx, include.lowest=T,
breaks=c(0,1,3,7,14,30,300))
par(mar=c(5,2,0,0))
barplot(table(xxx))
tb <- table(y, xxx)
r_0 <- tb[1, ] / colSums(tb)
par(mar=c(5,2,0,0))
plot(r_0, type='l', lwd=3)
58/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
58 of 65 6/13/14, 2:01 PM
SSuummmmaarryy
some insights
xx <- x[, 'cnt2_resource_role_deptname_ratio_j']
tt <- t.test(xx ~ y)
list(estimate=tt$estimate,
conf.int=tt$conf.int, p.value=tt$p.value)
## $estimate
## mean in group 0 mean in group 1
## 0.01955 0.02902
##
## $conf.int
## [1] -0.011732 -0.007205
## attr(,"conf.level")
## [1] 0.95
##
## $p.value
## [1] 3.93e-16
par(mar=c(5,4,2,2))
boxplot(xx ~ y)
59/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
59 of 65 6/13/14, 2:01 PM
SSuummmmaarryy
some insights
xxx <- cut(xx, include.lowest=T,
breaks=quantile(xx, seq(0,1,0.2)))
par(mar=c(5,2,0,0))
barplot(table(xxx))
tb <- table(y, xxx)
r_0 <- tb[1, ] / colSums(tb)
par(mar=c(5,2,0,0))
plot(r_0, type='l', lwd=3)
60/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
60 of 65 6/13/14, 2:01 PM
SSuummmmaarryy
overfitting
MODEL AUC_CV AUC_PUBLIC AUC_PRIVATE
num_glmnet_0 0.8985069 0.87737 0.87385
stacking_gbm_with_the_glmnet 0.9277316 0.90695 0.90478
61/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
61 of 65 6/13/14, 2:01 PM
SSuummmmaarryy
overfitting
MODEL AUC_CV AUC_PUBLIC AUC_PRIVATE
num_glmnet_0 0.8985069 0.87737 0.87385
stacking_gbm_with_the_glmnet 0.9277316 0.90695 0.90478
stacking_gbm_without_the_glmnet 0.9182303 0.91529 0.91130
62/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
62 of 65 6/13/14, 2:01 PM
SSuummmmaarryy
overfitting
Winning solution code and methodology
http://www.kaggle.com/c/amazon-employee-access-challenge/forums/t/5283/winning-solution-
code-and-methodology
63/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
63 of 65 6/13/14, 2:01 PM
SSuummmmaarryy
useful discussions
Python code to achieve 0.90 AUC with Logistic Regression
http://www.kaggle.com/c/amazon-employee-access-challenge/forums/t/4838/python-code-to-
achieve-0-90-auc-with-logistic-regression
Starter code in python with scikit-learn (AUC .885)
http://www.kaggle.com/c/amazon-employee-access-challenge/forums/t/4797/starter-code-in-python-
with-scikit-learn-auc-885
Patterns in Training data set
http://www.kaggle.com/c/amazon-employee-access-challenge/forums/t/4886/patterns-in-training-
data-set
64/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
64 of 65 6/13/14, 2:01 PM
tthhaannkkyyoouu
65/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
65 of 65 6/13/14, 2:01 PM

More Related Content

What's hot

The caret Package: A Unified Interface for Predictive Models
The caret Package: A Unified Interface for Predictive ModelsThe caret Package: A Unified Interface for Predictive Models
The caret Package: A Unified Interface for Predictive Models
NYC Predictive Analytics
 
Xgboost
XgboostXgboost
Gradient Boosted Regression Trees in scikit-learn
Gradient Boosted Regression Trees in scikit-learnGradient Boosted Regression Trees in scikit-learn
Gradient Boosted Regression Trees in scikit-learn
DataRobot
 
Lab 2: Classification and Regression Prediction Models, training and testing ...
Lab 2: Classification and Regression Prediction Models, training and testing ...Lab 2: Classification and Regression Prediction Models, training and testing ...
Lab 2: Classification and Regression Prediction Models, training and testing ...
Yao Yao
 
Streaming Python on Hadoop
Streaming Python on HadoopStreaming Python on Hadoop
Streaming Python on Hadoop
Vivian S. Zhang
 
Reinforcement learning Research experiments OpenAI
Reinforcement learning Research experiments OpenAIReinforcement learning Research experiments OpenAI
Reinforcement learning Research experiments OpenAI
Raouf KESKES
 
XGBoost @ Fyber
XGBoost @ FyberXGBoost @ Fyber
XGBoost @ Fyber
Daniel Hen
 
Ensembling & Boosting 概念介紹
Ensembling & Boosting  概念介紹Ensembling & Boosting  概念介紹
Ensembling & Boosting 概念介紹
Wayne Chen
 
Comparison Study of Decision Tree Ensembles for Regression
Comparison Study of Decision Tree Ensembles for RegressionComparison Study of Decision Tree Ensembles for Regression
Comparison Study of Decision Tree Ensembles for Regression
Seonho Park
 
Demystifying Xgboost
Demystifying XgboostDemystifying Xgboost
Demystifying Xgboost
halifaxchester
 
Converting R to PMML
Converting R to PMMLConverting R to PMML
Converting R to PMML
Villu Ruusmann
 
XGBoost: the algorithm that wins every competition
XGBoost: the algorithm that wins every competitionXGBoost: the algorithm that wins every competition
XGBoost: the algorithm that wins every competition
Jaroslaw Szymczak
 
On the representation and reuse of machine learning (ML) models
On the representation and reuse of machine learning (ML) modelsOn the representation and reuse of machine learning (ML) models
On the representation and reuse of machine learning (ML) models
Villu Ruusmann
 
Introduction to XGBoost
Introduction to XGBoostIntroduction to XGBoost
Introduction to XGBoost
Joonyoung Yi
 
Converting Scikit-Learn to PMML
Converting Scikit-Learn to PMMLConverting Scikit-Learn to PMML
Converting Scikit-Learn to PMML
Villu Ruusmann
 
Higgs Boson Challenge
Higgs Boson ChallengeHiggs Boson Challenge
Higgs Boson Challenge
Raouf KESKES
 
R, Scikit-Learn and Apache Spark ML - What difference does it make?
R, Scikit-Learn and Apache Spark ML - What difference does it make?R, Scikit-Learn and Apache Spark ML - What difference does it make?
R, Scikit-Learn and Apache Spark ML - What difference does it make?
Villu Ruusmann
 
Probabilistic Data Structures and Approximate Solutions
Probabilistic Data Structures and Approximate SolutionsProbabilistic Data Structures and Approximate Solutions
Probabilistic Data Structures and Approximate Solutions
Oleksandr Pryymak
 
Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...
Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...
Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...
Yao Yao
 
Machine Learning Algorithms
Machine Learning AlgorithmsMachine Learning Algorithms
Machine Learning Algorithms
Hichem Felouat
 

What's hot (20)

The caret Package: A Unified Interface for Predictive Models
The caret Package: A Unified Interface for Predictive ModelsThe caret Package: A Unified Interface for Predictive Models
The caret Package: A Unified Interface for Predictive Models
 
Xgboost
XgboostXgboost
Xgboost
 
Gradient Boosted Regression Trees in scikit-learn
Gradient Boosted Regression Trees in scikit-learnGradient Boosted Regression Trees in scikit-learn
Gradient Boosted Regression Trees in scikit-learn
 
Lab 2: Classification and Regression Prediction Models, training and testing ...
Lab 2: Classification and Regression Prediction Models, training and testing ...Lab 2: Classification and Regression Prediction Models, training and testing ...
Lab 2: Classification and Regression Prediction Models, training and testing ...
 
Streaming Python on Hadoop
Streaming Python on HadoopStreaming Python on Hadoop
Streaming Python on Hadoop
 
Reinforcement learning Research experiments OpenAI
Reinforcement learning Research experiments OpenAIReinforcement learning Research experiments OpenAI
Reinforcement learning Research experiments OpenAI
 
XGBoost @ Fyber
XGBoost @ FyberXGBoost @ Fyber
XGBoost @ Fyber
 
Ensembling & Boosting 概念介紹
Ensembling & Boosting  概念介紹Ensembling & Boosting  概念介紹
Ensembling & Boosting 概念介紹
 
Comparison Study of Decision Tree Ensembles for Regression
Comparison Study of Decision Tree Ensembles for RegressionComparison Study of Decision Tree Ensembles for Regression
Comparison Study of Decision Tree Ensembles for Regression
 
Demystifying Xgboost
Demystifying XgboostDemystifying Xgboost
Demystifying Xgboost
 
Converting R to PMML
Converting R to PMMLConverting R to PMML
Converting R to PMML
 
XGBoost: the algorithm that wins every competition
XGBoost: the algorithm that wins every competitionXGBoost: the algorithm that wins every competition
XGBoost: the algorithm that wins every competition
 
On the representation and reuse of machine learning (ML) models
On the representation and reuse of machine learning (ML) modelsOn the representation and reuse of machine learning (ML) models
On the representation and reuse of machine learning (ML) models
 
Introduction to XGBoost
Introduction to XGBoostIntroduction to XGBoost
Introduction to XGBoost
 
Converting Scikit-Learn to PMML
Converting Scikit-Learn to PMMLConverting Scikit-Learn to PMML
Converting Scikit-Learn to PMML
 
Higgs Boson Challenge
Higgs Boson ChallengeHiggs Boson Challenge
Higgs Boson Challenge
 
R, Scikit-Learn and Apache Spark ML - What difference does it make?
R, Scikit-Learn and Apache Spark ML - What difference does it make?R, Scikit-Learn and Apache Spark ML - What difference does it make?
R, Scikit-Learn and Apache Spark ML - What difference does it make?
 
Probabilistic Data Structures and Approximate Solutions
Probabilistic Data Structures and Approximate SolutionsProbabilistic Data Structures and Approximate Solutions
Probabilistic Data Structures and Approximate Solutions
 
Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...
Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...
Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...
 
Machine Learning Algorithms
Machine Learning AlgorithmsMachine Learning Algorithms
Machine Learning Algorithms
 

Similar to Kaggle talk series top 0.2% kaggler on amazon employee access challenge

Deep learning
Deep learningDeep learning
Deep learning
Sumit Sony
 
Open06
Open06Open06
Open06butest
 
Target Leakage in Machine Learning
Target Leakage in Machine LearningTarget Leakage in Machine Learning
Target Leakage in Machine Learning
Yuriy Guts
 
Telecom Churn Analysis
Telecom Churn AnalysisTelecom Churn Analysis
Telecom Churn Analysis
Vasudev pendyala
 
Machine learning key to your formulation challenges
Machine learning key to your formulation challengesMachine learning key to your formulation challenges
Machine learning key to your formulation challenges
Marc Borowczak
 
Scaling AutoML-Driven Anomaly Detection With Luminaire
Scaling AutoML-Driven Anomaly Detection With LuminaireScaling AutoML-Driven Anomaly Detection With Luminaire
Scaling AutoML-Driven Anomaly Detection With Luminaire
Databricks
 
Agile Data Science 2.0 - Big Data Science Meetup
Agile Data Science 2.0 - Big Data Science MeetupAgile Data Science 2.0 - Big Data Science Meetup
Agile Data Science 2.0 - Big Data Science Meetup
Russell Jurney
 
Speed bumps ahead
Speed bumps aheadSpeed bumps ahead
Speed bumps ahead
Alexander Yakushev
 
Agile Data Science 2.0
Agile Data Science 2.0Agile Data Science 2.0
Agile Data Science 2.0
Russell Jurney
 
Machine Learning Model Bakeoff
Machine Learning Model BakeoffMachine Learning Model Bakeoff
Machine Learning Model Bakeoff
mrphilroth
 
Java
JavaJava
Agile Data Science 2.0
Agile Data Science 2.0Agile Data Science 2.0
Agile Data Science 2.0
Russell Jurney
 
Data Science Salon Miami Example - Churn Rate Predictor
Data Science Salon Miami Example - Churn Rate PredictorData Science Salon Miami Example - Churn Rate Predictor
Data Science Salon Miami Example - Churn Rate Predictor
Greg Werner
 
미움 받을 용기 : 저 팀은 뭘 안다고 추천한다고 들쑤시고 다니는건가
미움 받을 용기 : 저 팀은 뭘 안다고 추천한다고 들쑤시고 다니는건가미움 받을 용기 : 저 팀은 뭘 안다고 추천한다고 들쑤시고 다니는건가
미움 받을 용기 : 저 팀은 뭘 안다고 추천한다고 들쑤시고 다니는건가
JaeCheolKim10
 
Performance
PerformancePerformance
Performance
Cary Millsap
 
[Tutorial] building machine learning models for predictive maintenance applic...
[Tutorial] building machine learning models for predictive maintenance applic...[Tutorial] building machine learning models for predictive maintenance applic...
[Tutorial] building machine learning models for predictive maintenance applic...
PAPIs.io
 
Running Intelligent Applications inside a Database: Deep Learning with Python...
Running Intelligent Applications inside a Database: Deep Learning with Python...Running Intelligent Applications inside a Database: Deep Learning with Python...
Running Intelligent Applications inside a Database: Deep Learning with Python...
Miguel González-Fierro
 
ITB Term Paper - 10BM60066
ITB Term Paper - 10BM60066ITB Term Paper - 10BM60066
ITB Term Paper - 10BM60066
rahulsm27
 
MACHINE LEARNING CLASSIFIERS TO ANALYZE CREDIT RISK
MACHINE LEARNING CLASSIFIERS TO ANALYZE CREDIT RISKMACHINE LEARNING CLASSIFIERS TO ANALYZE CREDIT RISK
MACHINE LEARNING CLASSIFIERS TO ANALYZE CREDIT RISK
IRJET Journal
 
Target Leakage in Machine Learning (ODSC East 2020)
Target Leakage in Machine Learning (ODSC East 2020)Target Leakage in Machine Learning (ODSC East 2020)
Target Leakage in Machine Learning (ODSC East 2020)
Yuriy Guts
 

Similar to Kaggle talk series top 0.2% kaggler on amazon employee access challenge (20)

Deep learning
Deep learningDeep learning
Deep learning
 
Open06
Open06Open06
Open06
 
Target Leakage in Machine Learning
Target Leakage in Machine LearningTarget Leakage in Machine Learning
Target Leakage in Machine Learning
 
Telecom Churn Analysis
Telecom Churn AnalysisTelecom Churn Analysis
Telecom Churn Analysis
 
Machine learning key to your formulation challenges
Machine learning key to your formulation challengesMachine learning key to your formulation challenges
Machine learning key to your formulation challenges
 
Scaling AutoML-Driven Anomaly Detection With Luminaire
Scaling AutoML-Driven Anomaly Detection With LuminaireScaling AutoML-Driven Anomaly Detection With Luminaire
Scaling AutoML-Driven Anomaly Detection With Luminaire
 
Agile Data Science 2.0 - Big Data Science Meetup
Agile Data Science 2.0 - Big Data Science MeetupAgile Data Science 2.0 - Big Data Science Meetup
Agile Data Science 2.0 - Big Data Science Meetup
 
Speed bumps ahead
Speed bumps aheadSpeed bumps ahead
Speed bumps ahead
 
Agile Data Science 2.0
Agile Data Science 2.0Agile Data Science 2.0
Agile Data Science 2.0
 
Machine Learning Model Bakeoff
Machine Learning Model BakeoffMachine Learning Model Bakeoff
Machine Learning Model Bakeoff
 
Java
JavaJava
Java
 
Agile Data Science 2.0
Agile Data Science 2.0Agile Data Science 2.0
Agile Data Science 2.0
 
Data Science Salon Miami Example - Churn Rate Predictor
Data Science Salon Miami Example - Churn Rate PredictorData Science Salon Miami Example - Churn Rate Predictor
Data Science Salon Miami Example - Churn Rate Predictor
 
미움 받을 용기 : 저 팀은 뭘 안다고 추천한다고 들쑤시고 다니는건가
미움 받을 용기 : 저 팀은 뭘 안다고 추천한다고 들쑤시고 다니는건가미움 받을 용기 : 저 팀은 뭘 안다고 추천한다고 들쑤시고 다니는건가
미움 받을 용기 : 저 팀은 뭘 안다고 추천한다고 들쑤시고 다니는건가
 
Performance
PerformancePerformance
Performance
 
[Tutorial] building machine learning models for predictive maintenance applic...
[Tutorial] building machine learning models for predictive maintenance applic...[Tutorial] building machine learning models for predictive maintenance applic...
[Tutorial] building machine learning models for predictive maintenance applic...
 
Running Intelligent Applications inside a Database: Deep Learning with Python...
Running Intelligent Applications inside a Database: Deep Learning with Python...Running Intelligent Applications inside a Database: Deep Learning with Python...
Running Intelligent Applications inside a Database: Deep Learning with Python...
 
ITB Term Paper - 10BM60066
ITB Term Paper - 10BM60066ITB Term Paper - 10BM60066
ITB Term Paper - 10BM60066
 
MACHINE LEARNING CLASSIFIERS TO ANALYZE CREDIT RISK
MACHINE LEARNING CLASSIFIERS TO ANALYZE CREDIT RISKMACHINE LEARNING CLASSIFIERS TO ANALYZE CREDIT RISK
MACHINE LEARNING CLASSIFIERS TO ANALYZE CREDIT RISK
 
Target Leakage in Machine Learning (ODSC East 2020)
Target Leakage in Machine Learning (ODSC East 2020)Target Leakage in Machine Learning (ODSC East 2020)
Target Leakage in Machine Learning (ODSC East 2020)
 

More from Vivian S. Zhang

Why NYC DSA.pdf
Why NYC DSA.pdfWhy NYC DSA.pdf
Why NYC DSA.pdf
Vivian S. Zhang
 
Career services workshop- Roger Ren
Career services workshop- Roger RenCareer services workshop- Roger Ren
Career services workshop- Roger Ren
Vivian S. Zhang
 
Nycdsa wordpress guide book
Nycdsa wordpress guide bookNycdsa wordpress guide book
Nycdsa wordpress guide book
Vivian S. Zhang
 
We're so skewed_presentation
We're so skewed_presentationWe're so skewed_presentation
We're so skewed_presentation
Vivian S. Zhang
 
Wikipedia: Tuned Predictions on Big Data
Wikipedia: Tuned Predictions on Big DataWikipedia: Tuned Predictions on Big Data
Wikipedia: Tuned Predictions on Big Data
Vivian S. Zhang
 
A Hybrid Recommender with Yelp Challenge Data
A Hybrid Recommender with Yelp Challenge Data A Hybrid Recommender with Yelp Challenge Data
A Hybrid Recommender with Yelp Challenge Data
Vivian S. Zhang
 
Kaggle Top1% Solution: Predicting Housing Prices in Moscow
Kaggle Top1% Solution: Predicting Housing Prices in Moscow Kaggle Top1% Solution: Predicting Housing Prices in Moscow
Kaggle Top1% Solution: Predicting Housing Prices in Moscow
Vivian S. Zhang
 
Nycdsa ml conference slides march 2015
Nycdsa ml conference slides march 2015 Nycdsa ml conference slides march 2015
Nycdsa ml conference slides march 2015
Vivian S. Zhang
 
THE HACK ON JERSEY CITY CONDO PRICES explore trends in public data
THE HACK ON JERSEY CITY CONDO PRICES explore trends in public dataTHE HACK ON JERSEY CITY CONDO PRICES explore trends in public data
THE HACK ON JERSEY CITY CONDO PRICES explore trends in public data
Vivian S. Zhang
 
Max Kuhn's talk on R machine learning
Max Kuhn's talk on R machine learningMax Kuhn's talk on R machine learning
Max Kuhn's talk on R machine learning
Vivian S. Zhang
 
Winning data science competitions, presented by Owen Zhang
Winning data science competitions, presented by Owen ZhangWinning data science competitions, presented by Owen Zhang
Winning data science competitions, presented by Owen Zhang
Vivian S. Zhang
 
Using Machine Learning to aid Journalism at the New York Times
Using Machine Learning to aid Journalism at the New York TimesUsing Machine Learning to aid Journalism at the New York Times
Using Machine Learning to aid Journalism at the New York Times
Vivian S. Zhang
 
Introducing natural language processing(NLP) with r
Introducing natural language processing(NLP) with rIntroducing natural language processing(NLP) with r
Introducing natural language processing(NLP) with r
Vivian S. Zhang
 
Bayesian models in r
Bayesian models in rBayesian models in r
Bayesian models in r
Vivian S. Zhang
 
Natural Language Processing(SupStat Inc)
Natural Language Processing(SupStat Inc)Natural Language Processing(SupStat Inc)
Natural Language Processing(SupStat Inc)
Vivian S. Zhang
 
Hack session for NYTimes Dialect Map Visualization( developed by R Shiny)
 Hack session for NYTimes Dialect Map Visualization( developed by R Shiny) Hack session for NYTimes Dialect Map Visualization( developed by R Shiny)
Hack session for NYTimes Dialect Map Visualization( developed by R Shiny)
Vivian S. Zhang
 
Data Science Academy Student Demo day--Moyi Dang, Visualizing global public c...
Data Science Academy Student Demo day--Moyi Dang, Visualizing global public c...Data Science Academy Student Demo day--Moyi Dang, Visualizing global public c...
Data Science Academy Student Demo day--Moyi Dang, Visualizing global public c...
Vivian S. Zhang
 
Data Science Academy Student Demo day--Divyanka Sharma, Businesses in nyc
Data Science Academy Student Demo day--Divyanka Sharma, Businesses in nycData Science Academy Student Demo day--Divyanka Sharma, Businesses in nyc
Data Science Academy Student Demo day--Divyanka Sharma, Businesses in nyc
Vivian S. Zhang
 
Data Science Academy Student Demo day--Chang Wang, dogs breeds in nyc
Data Science Academy Student Demo day--Chang Wang, dogs breeds in nycData Science Academy Student Demo day--Chang Wang, dogs breeds in nyc
Data Science Academy Student Demo day--Chang Wang, dogs breeds in nyc
Vivian S. Zhang
 
Data Science Academy Student Demo day--Richard Sheng, kinvolved school attend...
Data Science Academy Student Demo day--Richard Sheng, kinvolved school attend...Data Science Academy Student Demo day--Richard Sheng, kinvolved school attend...
Data Science Academy Student Demo day--Richard Sheng, kinvolved school attend...
Vivian S. Zhang
 

More from Vivian S. Zhang (20)

Why NYC DSA.pdf
Why NYC DSA.pdfWhy NYC DSA.pdf
Why NYC DSA.pdf
 
Career services workshop- Roger Ren
Career services workshop- Roger RenCareer services workshop- Roger Ren
Career services workshop- Roger Ren
 
Nycdsa wordpress guide book
Nycdsa wordpress guide bookNycdsa wordpress guide book
Nycdsa wordpress guide book
 
We're so skewed_presentation
We're so skewed_presentationWe're so skewed_presentation
We're so skewed_presentation
 
Wikipedia: Tuned Predictions on Big Data
Wikipedia: Tuned Predictions on Big DataWikipedia: Tuned Predictions on Big Data
Wikipedia: Tuned Predictions on Big Data
 
A Hybrid Recommender with Yelp Challenge Data
A Hybrid Recommender with Yelp Challenge Data A Hybrid Recommender with Yelp Challenge Data
A Hybrid Recommender with Yelp Challenge Data
 
Kaggle Top1% Solution: Predicting Housing Prices in Moscow
Kaggle Top1% Solution: Predicting Housing Prices in Moscow Kaggle Top1% Solution: Predicting Housing Prices in Moscow
Kaggle Top1% Solution: Predicting Housing Prices in Moscow
 
Nycdsa ml conference slides march 2015
Nycdsa ml conference slides march 2015 Nycdsa ml conference slides march 2015
Nycdsa ml conference slides march 2015
 
THE HACK ON JERSEY CITY CONDO PRICES explore trends in public data
THE HACK ON JERSEY CITY CONDO PRICES explore trends in public dataTHE HACK ON JERSEY CITY CONDO PRICES explore trends in public data
THE HACK ON JERSEY CITY CONDO PRICES explore trends in public data
 
Max Kuhn's talk on R machine learning
Max Kuhn's talk on R machine learningMax Kuhn's talk on R machine learning
Max Kuhn's talk on R machine learning
 
Winning data science competitions, presented by Owen Zhang
Winning data science competitions, presented by Owen ZhangWinning data science competitions, presented by Owen Zhang
Winning data science competitions, presented by Owen Zhang
 
Using Machine Learning to aid Journalism at the New York Times
Using Machine Learning to aid Journalism at the New York TimesUsing Machine Learning to aid Journalism at the New York Times
Using Machine Learning to aid Journalism at the New York Times
 
Introducing natural language processing(NLP) with r
Introducing natural language processing(NLP) with rIntroducing natural language processing(NLP) with r
Introducing natural language processing(NLP) with r
 
Bayesian models in r
Bayesian models in rBayesian models in r
Bayesian models in r
 
Natural Language Processing(SupStat Inc)
Natural Language Processing(SupStat Inc)Natural Language Processing(SupStat Inc)
Natural Language Processing(SupStat Inc)
 
Hack session for NYTimes Dialect Map Visualization( developed by R Shiny)
 Hack session for NYTimes Dialect Map Visualization( developed by R Shiny) Hack session for NYTimes Dialect Map Visualization( developed by R Shiny)
Hack session for NYTimes Dialect Map Visualization( developed by R Shiny)
 
Data Science Academy Student Demo day--Moyi Dang, Visualizing global public c...
Data Science Academy Student Demo day--Moyi Dang, Visualizing global public c...Data Science Academy Student Demo day--Moyi Dang, Visualizing global public c...
Data Science Academy Student Demo day--Moyi Dang, Visualizing global public c...
 
Data Science Academy Student Demo day--Divyanka Sharma, Businesses in nyc
Data Science Academy Student Demo day--Divyanka Sharma, Businesses in nycData Science Academy Student Demo day--Divyanka Sharma, Businesses in nyc
Data Science Academy Student Demo day--Divyanka Sharma, Businesses in nyc
 
Data Science Academy Student Demo day--Chang Wang, dogs breeds in nyc
Data Science Academy Student Demo day--Chang Wang, dogs breeds in nycData Science Academy Student Demo day--Chang Wang, dogs breeds in nyc
Data Science Academy Student Demo day--Chang Wang, dogs breeds in nyc
 
Data Science Academy Student Demo day--Richard Sheng, kinvolved school attend...
Data Science Academy Student Demo day--Richard Sheng, kinvolved school attend...Data Science Academy Student Demo day--Richard Sheng, kinvolved school attend...
Data Science Academy Student Demo day--Richard Sheng, kinvolved school attend...
 

Recently uploaded

Railway Signalling Principles Edition 3.pdf
Railway Signalling Principles Edition 3.pdfRailway Signalling Principles Edition 3.pdf
Railway Signalling Principles Edition 3.pdf
TeeVichai
 
Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024
Massimo Talia
 
Standard Reomte Control Interface - Neometrix
Standard Reomte Control Interface - NeometrixStandard Reomte Control Interface - Neometrix
Standard Reomte Control Interface - Neometrix
Neometrix_Engineering_Pvt_Ltd
 
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&BDesign and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Sreedhar Chowdam
 
Student information management system project report ii.pdf
Student information management system project report ii.pdfStudent information management system project report ii.pdf
Student information management system project report ii.pdf
Kamal Acharya
 
road safety engineering r s e unit 3.pdf
road safety engineering  r s e unit 3.pdfroad safety engineering  r s e unit 3.pdf
road safety engineering r s e unit 3.pdf
VENKATESHvenky89705
 
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdfTop 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Teleport Manpower Consultant
 
block diagram and signal flow graph representation
block diagram and signal flow graph representationblock diagram and signal flow graph representation
block diagram and signal flow graph representation
Divya Somashekar
 
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
obonagu
 
DESIGN A COTTON SEED SEPARATION MACHINE.docx
DESIGN A COTTON SEED SEPARATION MACHINE.docxDESIGN A COTTON SEED SEPARATION MACHINE.docx
DESIGN A COTTON SEED SEPARATION MACHINE.docx
FluxPrime1
 
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
Amil Baba Dawood bangali
 
AP LAB PPT.pdf ap lab ppt no title specific
AP LAB PPT.pdf ap lab ppt no title specificAP LAB PPT.pdf ap lab ppt no title specific
AP LAB PPT.pdf ap lab ppt no title specific
BrazilAccount1
 
Architectural Portfolio Sean Lockwood
Architectural Portfolio Sean LockwoodArchitectural Portfolio Sean Lockwood
Architectural Portfolio Sean Lockwood
seandesed
 
The Benefits and Techniques of Trenchless Pipe Repair.pdf
The Benefits and Techniques of Trenchless Pipe Repair.pdfThe Benefits and Techniques of Trenchless Pipe Repair.pdf
The Benefits and Techniques of Trenchless Pipe Repair.pdf
Pipe Restoration Solutions
 
English lab ppt no titlespecENG PPTt.pdf
English lab ppt no titlespecENG PPTt.pdfEnglish lab ppt no titlespecENG PPTt.pdf
English lab ppt no titlespecENG PPTt.pdf
BrazilAccount1
 
Gen AI Study Jams _ For the GDSC Leads in India.pdf
Gen AI Study Jams _ For the GDSC Leads in India.pdfGen AI Study Jams _ For the GDSC Leads in India.pdf
Gen AI Study Jams _ For the GDSC Leads in India.pdf
gdsczhcet
 
power quality voltage fluctuation UNIT - I.pptx
power quality voltage fluctuation UNIT - I.pptxpower quality voltage fluctuation UNIT - I.pptx
power quality voltage fluctuation UNIT - I.pptx
ViniHema
 
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
MdTanvirMahtab2
 
Planning Of Procurement o different goods and services
Planning Of Procurement o different goods and servicesPlanning Of Procurement o different goods and services
Planning Of Procurement o different goods and services
JoytuBarua2
 
CME397 Surface Engineering- Professional Elective
CME397 Surface Engineering- Professional ElectiveCME397 Surface Engineering- Professional Elective
CME397 Surface Engineering- Professional Elective
karthi keyan
 

Recently uploaded (20)

Railway Signalling Principles Edition 3.pdf
Railway Signalling Principles Edition 3.pdfRailway Signalling Principles Edition 3.pdf
Railway Signalling Principles Edition 3.pdf
 
Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024
 
Standard Reomte Control Interface - Neometrix
Standard Reomte Control Interface - NeometrixStandard Reomte Control Interface - Neometrix
Standard Reomte Control Interface - Neometrix
 
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&BDesign and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
 
Student information management system project report ii.pdf
Student information management system project report ii.pdfStudent information management system project report ii.pdf
Student information management system project report ii.pdf
 
road safety engineering r s e unit 3.pdf
road safety engineering  r s e unit 3.pdfroad safety engineering  r s e unit 3.pdf
road safety engineering r s e unit 3.pdf
 
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdfTop 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
 
block diagram and signal flow graph representation
block diagram and signal flow graph representationblock diagram and signal flow graph representation
block diagram and signal flow graph representation
 
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
 
DESIGN A COTTON SEED SEPARATION MACHINE.docx
DESIGN A COTTON SEED SEPARATION MACHINE.docxDESIGN A COTTON SEED SEPARATION MACHINE.docx
DESIGN A COTTON SEED SEPARATION MACHINE.docx
 
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
 
AP LAB PPT.pdf ap lab ppt no title specific
AP LAB PPT.pdf ap lab ppt no title specificAP LAB PPT.pdf ap lab ppt no title specific
AP LAB PPT.pdf ap lab ppt no title specific
 
Architectural Portfolio Sean Lockwood
Architectural Portfolio Sean LockwoodArchitectural Portfolio Sean Lockwood
Architectural Portfolio Sean Lockwood
 
The Benefits and Techniques of Trenchless Pipe Repair.pdf
The Benefits and Techniques of Trenchless Pipe Repair.pdfThe Benefits and Techniques of Trenchless Pipe Repair.pdf
The Benefits and Techniques of Trenchless Pipe Repair.pdf
 
English lab ppt no titlespecENG PPTt.pdf
English lab ppt no titlespecENG PPTt.pdfEnglish lab ppt no titlespecENG PPTt.pdf
English lab ppt no titlespecENG PPTt.pdf
 
Gen AI Study Jams _ For the GDSC Leads in India.pdf
Gen AI Study Jams _ For the GDSC Leads in India.pdfGen AI Study Jams _ For the GDSC Leads in India.pdf
Gen AI Study Jams _ For the GDSC Leads in India.pdf
 
power quality voltage fluctuation UNIT - I.pptx
power quality voltage fluctuation UNIT - I.pptxpower quality voltage fluctuation UNIT - I.pptx
power quality voltage fluctuation UNIT - I.pptx
 
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
 
Planning Of Procurement o different goods and services
Planning Of Procurement o different goods and servicesPlanning Of Procurement o different goods and services
Planning Of Procurement o different goods and services
 
CME397 Surface Engineering- Professional Elective
CME397 Surface Engineering- Professional ElectiveCME397 Surface Engineering- Professional Elective
CME397 Surface Engineering- Professional Elective
 

Kaggle talk series top 0.2% kaggler on amazon employee access challenge

  • 1. AAmmaazzoonnEEmmppllooyyeeeeAAcccceessssCChhaalllleennggee Predictanemployee'saccessneeds,givenhis/herjobrole Yibo Chen Data Scientist @ Supstat Inc Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html 1 of 65 6/13/14, 2:01 PM
  • 2. AAggeennddaa Introduction to the Challenge1. Look into the Data2. Model Building3. Summary4. 2/65 Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html 2 of 65 6/13/14, 2:01 PM
  • 3. IInnttrroodduuccttiioonnttootthheeCChhaalllleennggee the story http://www.kaggle.com/c/amazon-employee-access-challenge it is all about the access we need to fulfill our daily work. 3/65 Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html 3 of 65 6/13/14, 2:01 PM
  • 4. IInnttrroodduuccttiioonnttootthheeCChhaalllleennggee the mission build an auto-access model based on the historical data to determine the access privilege according to the employee's job role and the resource he applied for 4/65 Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html 4 of 65 6/13/14, 2:01 PM
  • 5. IInnttrroodduuccttiioonnttootthheeCChhaalllleennggee the data The data consists of real historical data collected from 2010 & 2011. Employees are manually allowed or denied access to resources over time. the files train.csv - The training set. Each row has the ACTION (ground truth), RESOURCE, and information about the employee's role at the time of approval test.csv - The test set for which predictions should be made. Each row asks whether an employee having the listed characteristics should have access to the listed resource. · · 5/65 Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html 5 of 65 6/13/14, 2:01 PM
  • 6. IInnttrroodduuccttiioonnttootthheeCChhaalllleennggee the variables COLUMN NAME DESCRIPTION ACTION ACTION is 1 if the resource was approved, 0 if the resource was not RESOURCE An ID for each resource MGR_ID The EMPLOYEE ID of the manager of the current EMPLOYEE ID record ROLE_ROLLUP_1 Company role grouping category id 1 (e.g. US Engineering) ROLE_ROLLUP_2 Company role grouping category id 2 (e.g. US Retail) ROLE_DEPTNAME Company role department description (e.g. Retail) ROLE_TITLE Company role business title description (e.g. Senior Engineering Retail Manager) ROLE_FAMILY_DESC Company role family extended description (e.g. Retail Manager, Software Engineering) ROLE_FAMILY Company role family description (e.g. Retail Manager) ROLE_CODE Company role code; this code is unique to each role (e.g. Manager) 6/65 Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html 6 of 65 6/13/14, 2:01 PM
  • 7. IInnttrroodduuccttiioonnttootthheeCChhaalllleennggee the metric AUC(area under the ROC curve) is a metric used to judge predictions in binary response (0/1) problem is only sensitive to the order determined by the predictions and not their magnitudes package verification or ROCR in R · · · 7/65 Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html 7 of 65 6/13/14, 2:01 PM
  • 8. IInnttrroodduuccttiioonnttootthheeCChhaalllleennggee the metric (t <- data.frame(true_label=c(0,0,0,0,1,1,1,1), predict_1=c(1,2,3,4,5,6,7,8), predict_2=c(1,2,3,6,5,4,7,8), predict_3=c(1,7,6,4,5,3,2,8))) ## true_label predict_1 predict_2 predict_3 ## 1 0 1 1 1 ## 2 0 2 2 7 ## 3 0 3 3 6 ## 4 0 4 6 4 ## 5 1 5 5 5 ## 6 1 6 4 3 ## 7 1 7 7 2 ## 8 1 8 8 8 8/65 Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html 8 of 65 6/13/14, 2:01 PM
  • 9. IInnttrroodduuccttiioonnttootthheeCChhaalllleennggee the metric P:4 N:4 TP:2 FP:1 TPR=TP/P=0.5 FPR=FP/N=0.25 table(t$predict_2 >= 6, t$true_label) ## ## 0 1 ## FALSE 3 2 ## TRUE 1 2 9/65 Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html 9 of 65 6/13/14, 2:01 PM
  • 10. IInnttrroodduuccttiioonnttootthheeCChhaalllleennggee the metric P:4 N:4 TP:3 FP:1 TPR=TP/P=0.75 FPR=FP/N=0.25 table(t$predict_2 >= 5, t$true_label) ## ## 0 1 ## FALSE 3 1 ## TRUE 1 3 10/65 Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html 10 of 65 6/13/14, 2:01 PM
  • 11. IInnttrroodduuccttiioonnttootthheeCChhaalllleennggee the metric 11/65 Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html 11 of 65 6/13/14, 2:01 PM
  • 12. IInnttrroodduuccttiioonnttootthheeCChhaalllleennggee the metric require(ROCR, quietly = T) pred <- prediction(t$predict_1, t$true_label) performance(pred, "auc")@y.values[[1]] ## [1] 1 require(verification, quietly = T) roc.area(t$true_label, t$predict_1)$A ## [1] 1 pred <- prediction(t$predict_1, t$true_label) perf <- performance(pred, "tpr", "fpr") plot(perf, col = 2, lwd = 3) 12/65 Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html 12 of 65 6/13/14, 2:01 PM
  • 13. IInnttrroodduuccttiioonnttootthheeCChhaalllleennggee the metric pred <- prediction(t$predict_2, t$true_label) performance(pred, "auc")@y.values[[1]] ## [1] 0.875 roc.area(t$true_label, t$predict_2)$A ## [1] 0.875 pred <- prediction(t$predict_2, t$true_label) perf <- performance(pred, "tpr", "fpr") plot(perf, col = 2, lwd = 3) 13/65 Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html 13 of 65 6/13/14, 2:01 PM
  • 14. IInnttrroodduuccttiioonnttootthheeCChhaalllleennggee the metric pred <- prediction(t$predict_3, t$true_label) performance(pred, "auc")@y.values[[1]] ## [1] 0.5 roc.area(t$true_label, t$predict_3)$A ## [1] 0.5 pred <- prediction(t$predict_3, t$true_label) perf <- performance(pred, "tpr", "fpr") plot(perf, col = 2, lwd = 3) 14/65 Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html 14 of 65 6/13/14, 2:01 PM
  • 15. LLooookkiinnttootthheeDDaattaa load data from files 15/65 Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html 15 of 65 6/13/14, 2:01 PM
  • 16. LLooookkiinnttootthheeDDaattaa the target table(y, useNA = "ifany") ## y ## 0 1 <NA> ## 1897 30872 58921 16/65 Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html 16 of 65 6/13/14, 2:01 PM
  • 17. LLooookkiinnttootthheeDDaattaa the predictor 17/65 Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html 17 of 65 6/13/14, 2:01 PM
  • 18. LLooookkiinnttootthheeDDaattaa treat the features as Categorical or Numerical? sapply(x, function(z) { length(unique(z)) }) ## resource mgr_id role_rollup_1 role_rollup_2 ## 7518 4913 130 183 ## role_deptname role_title role_family_desc role_family ## 476 361 2951 68 ## role_code ## 361 18/65 Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html 18 of 65 6/13/14, 2:01 PM
  • 19. LLooookkiinnttootthheeDDaattaa par(mar = c(5, 4, 0, 2)) plot(x$role_title, x$role_code) 19/65 Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html 19 of 65 6/13/14, 2:01 PM
  • 20. LLooookkiinnttootthheeDDaattaa length(unique(x$role_title)) ## [1] 361 length(unique(x$role_code)) ## [1] 361 length(unique(paste(x$role_code, x$role_title))) ## [1] 361 20/65 Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html 20 of 65 6/13/14, 2:01 PM
  • 21. LLooookkiinnttootthheeDDaattaa x <- x[, names(x) != "role_code"] sapply(x, function(z) { length(unique(z)) }) ## resource mgr_id role_rollup_1 role_rollup_2 ## 7518 4913 130 183 ## role_deptname role_title role_family_desc role_family ## 476 361 2951 68 21/65 Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html 21 of 65 6/13/14, 2:01 PM
  • 22. LLooookkiinnttootthheeDDaattaa check the distribution - role_family_desc hist(train$role_family_desc, breaks = 100) hist(test$role_family_desc, breaks = 100) 22/65 Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html 22 of 65 6/13/14, 2:01 PM
  • 23. LLooookkiinnttootthheeDDaattaa check the distribution - resource hist(train$resource, breaks = 100) hist(test$resource, breaks = 100) 23/65 Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html 23 of 65 6/13/14, 2:01 PM
  • 24. LLooookkiinnttootthheeDDaattaa check the distribution - mgr_id hist(train$mgr_id, breaks = 100) hist(test$mgr_id, breaks = 100) 24/65 Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html 24 of 65 6/13/14, 2:01 PM
  • 25. LLooookkiinnttootthheeDDaattaa treat the features as Categorical or Numerical? YetiMan shared his findings in the forum: 1) My analyses so far leads me to believe that there is "information" in some of the categorical labels themselves. My hunch is that they imply some sort of chronology, but I can't be certain. 2) Just for fun I increased the max classes for R's gbm package to 8192 and built a model (using plain vanilla training data). The leader board result was 0.87 - slightly worse than the all-numeric gbm. Food for thought. · · 25/65 Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html 25 of 65 6/13/14, 2:01 PM
  • 26. LLooookkiinnttootthheeDDaattaa our approach treat all features as Categorical1. treat all features as Numerical2. treat mgr_id as Numerical, the others as Categorical3. 26/65 Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html 26 of 65 6/13/14, 2:01 PM
  • 27. MMooddeellBBuuiillddiinngg workflow Feature Extraction Base Learners Ensemble · · · 27/65 Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html 27 of 65 6/13/14, 2:01 PM
  • 28. MMooddeellBBuuiillddiinngg workflow 28/65 Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html 28 of 65 6/13/14, 2:01 PM
  • 29. MMooddeellBBuuiillddiinngg Feature Extraction the raw features(as numerical)1. the raw features(as categorical) with level reduction2. the dummies(in sparse Matrix)3. the dummies including the interaction4. some derived variables(count & ratio)5. 29/65 Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html 29 of 65 6/13/14, 2:01 PM
  • 30. MMooddeellBBuuiillddiinngg 1. the raw features(as numerical) 30/65 Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html 30 of 65 6/13/14, 2:01 PM
  • 31. MMooddeellBBuuiillddiinngg 2. the raw features(as categorical) with level reduction 2.1 choose the top frequency categories VAR_RAW FREQUENCY VAR_WITH_LEVEL_REDUCTION a 3 a a 3 a a 3 a b 2 b b 2 b c 1 other d 1 other for (i in 1:ncol(x)) { the_labels <- names(sort(table(x[, i]), decreasing = T)[1:2]) x[!x[, i] %in% the_labels, i] <- "other" } 31/65 Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html 31 of 65 6/13/14, 2:01 PM
  • 32. MMooddeellBBuuiillddiinngg 2. the raw features(as categorical) with level reduction 2.2 use Pearson's Chi-squared Test table(y$y, ifelse(x$mgr_id == 770, "mgr_770", "mgr_not_770")) ## ## mgr_770 mgr_not_770 ## 0 5 1892 ## 1 147 30725 chisq.test(y$y, ifelse(x$mgr_id == 770, "mgr_770", "mgr_not_770"))$p.value ## [1] 0.2507 32/65 Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html 32 of 65 6/13/14, 2:01 PM
  • 33. MMooddeellBBuuiillddiinngg 3. the dummies(in sparse Matrix) ID VAR VAR_A VAR_B VAR_C 1 a 1 0 0 2 a 1 0 0 3 a 1 0 0 4 b 0 1 0 5 c 0 0 1 33/65 Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html 33 of 65 6/13/14, 2:01 PM
  • 34. MMooddeellBBuuiillddiinngg 3. the dummies(in sparse Matrix) use package Matrix to create the dummies require(Matrix) set.seed(114) Matrix(sample(c(0, 1), 40, re = T, prob = c(0.6, 0.1)), nrow = 5) ## 5 x 8 sparse Matrix of class "dgCMatrix" ## ## [1,] . . . 1 . . . 1 ## [2,] . 1 . . . . 1 . ## [3,] 1 . . . . . . . ## [4,] . . . . . 1 . . ## [5,] . . . . . . . . 34/65 Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html 34 of 65 6/13/14, 2:01 PM
  • 35. MMooddeellBBuuiillddiinngg 4. the dummies including the interaction ID M N MN_AP MN_AQ MN_BP MN_BQ 1 a p 1 0 0 0 2 a p 1 0 0 0 3 a q 0 1 0 0 4 b p 0 0 1 0 5 b q 0 0 0 1 35/65 Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html 35 of 65 6/13/14, 2:01 PM
  • 36. MMooddeellBBuuiillddiinngg 5. some derived variables(count & ratio) the frequency of every category the frequency of the interactions the proportion · · · 36/65 Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html 36 of 65 6/13/14, 2:01 PM
  • 37. MMooddeellBBuuiillddiinngg 5. some derived variables(count & ratio) tmp1 <- cnt_1[114:117, c('c1_resource', 'c1_role_deptname')] tmp2 <- cnt_2[114:117, c('c2_resource_role_deptname_cnt_ij', 'c2_resource_role_deptname_ratio_i', 'c2_resource_role_deptname_ratio_j')] cbind(tmp1, tmp2) ## c1_resource c1_role_deptname c2_resource_role_deptname_cnt_ij ## 114 1 1645 1 ## 115 36 1312 4 ## 116 45 465 24 ## 117 374 2377 169 ## c2_resource_role_deptname_ratio_i c2_resource_role_deptname_ratio_j ## 114 1.0000 0.0006079 ## 115 0.1111 0.0030488 ## 116 0.5333 0.0516129 ## 117 0.4519 0.0710980 37/65 Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html 37 of 65 6/13/14, 2:01 PM
  • 38. MMooddeellBBuuiillddiinngg Base Learners Regularized Generalized Linear Model1. Support Vector Machine2. Random Forest3. Gradient Boosting Machine4. 38/65 Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html 38 of 65 6/13/14, 2:01 PM
  • 39. MMooddeellBBuuiillddiinngg Ensemble mean prediction of all models1. two-stage stacking2. based on 5-fold cv holdout predictions· 39/65 Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html 39 of 65 6/13/14, 2:01 PM
  • 40. MMooddeellBBuuiillddiinngg Ensemble mean prediction of all models1. two-stage stacking2. based on 5-fold cv holdout predictions algorithms in level-1(Regularized Generalized Linear Model & Gradient Boosting Machine) algorithms in level-2(Regularized Generalized Linear Model) · · · 40/65 Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html 40 of 65 6/13/14, 2:01 PM
  • 41. MMooddeellBBuuiillddiinngg 1. Regularized Generalized Linear Model generalized linear model(glm) convex penalties · · 41/65 Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html 41 of 65 6/13/14, 2:01 PM
  • 42. MMooddeellBBuuiillddiinngg 1. Regularized Generalized Linear Model logistic regression· x <- sort(rnorm(100)) set.seed(114) y <- c(sample(x=c(0,1),size=30,prob=c(0.9,0.1),re=T), sample(x=c(0,1),size=20,prob=c(0.7,0.3),re=T), sample(x=c(0,1),size=20,prob=c(0.3,0.7),re=T), sample(x=c(0,1),size=30,prob=c(0.1,0.9),re=T)) m1 <- lm(y~x) m2 <- glm(y~x,family=binomial(link=logit)) y2 <- predict(m2,data=x,type='response') par(mar=c(5,4,0,0)) plot(y~x);abline(m1,lwd=3,col=2) points(x,y2,type='l',lwd=3,col=3) 42/65 Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html 42 of 65 6/13/14, 2:01 PM
  • 43. MMooddeellBBuuiillddiinngg 1. Regularized Generalized Linear Model logistic regression· convex penalties· 43/65 Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html 43 of 65 6/13/14, 2:01 PM
  • 44. MMooddeellBBuuiillddiinngg 1. Regularized Generalized Linear Model convex penalties· L1 (lasso) L2 (ridge regression) mixture of L1&L2 (elastic net) - - - 44/65 Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html 44 of 65 6/13/14, 2:01 PM
  • 45. MMooddeellBBuuiillddiinngg 1. Regularized Generalized Linear Model the dummies(in sparse Matrix) the dummies including the interaction R package:glmnet · · · 45/65 Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html 45 of 65 6/13/14, 2:01 PM
  • 46. MMooddeellBBuuiillddiinngg 2. Support Vector Machine(just for Diversity) 46/65 Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html 46 of 65 6/13/14, 2:01 PM
  • 47. MMooddeellBBuuiillddiinngg 2. Support Vector Machine(just for Diversity) 47/65 Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html 47 of 65 6/13/14, 2:01 PM
  • 48. MMooddeellBBuuiillddiinngg 2. Support Vector Machine(just for Diversity) 48/65 Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html 48 of 65 6/13/14, 2:01 PM
  • 49. MMooddeellBBuuiillddiinngg 2. Support Vector Machine(just for Diversity) the dummies including the interaction some derived variables(count & ratio) R package:kernlab,e1071 · · · 49/65 Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html 49 of 65 6/13/14, 2:01 PM
  • 50. MMooddeellBBuuiillddiinngg decision tree 50/65 Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html 50 of 65 6/13/14, 2:01 PM
  • 51. MMooddeellBBuuiillddiinngg 3. Random Forest decision trees + bagging 51/65 Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html 51 of 65 6/13/14, 2:01 PM
  • 52. MMooddeellBBuuiillddiinngg 3. Random Forest the raw features(as numerical) the raw features(as categorical) with level reduction some derived variables(count & ratio) R package:randomForest · · · · 52/65 Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html 52 of 65 6/13/14, 2:01 PM
  • 53. MMooddeellBBuuiillddiinngg 4. Gradient Boosting Machine decision trees + boosting 53/65 Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html 53 of 65 6/13/14, 2:01 PM
  • 54. MMooddeellBBuuiillddiinngg 4. Gradient Boosting Machine the raw features(as numerical) the raw features(as categorical) with level reduction some derived variables(count & ratio) R package:gbm · · · · 54/65 Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html 54 of 65 6/13/14, 2:01 PM
  • 55. SSuummmmaarryy some insights VARIABLE NAME REL.INF cnt2_resource_role_deptname_cnt_ij 2.542974017 cnt2_resource_role_rollup_2_ratio_i 2.107624216 cnt2_resource_role_deptname_ratio_j 2.017153645 cnt2_resource_role_rollup_2_ratio_j 1.910465811 cnt2_resource_role_family_ratio_i 1.770737494 ... ... cnt4_resource_mgr_id_role_rollup_2_role_family_desc 0.008938286 cnt4_resource_role_rollup_1_role_rollup_2_role_title 0.008930661 cnt4_resource_mgr_id_role_rollup_1_role_family_desc 0.002106958 55/65 Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html 55 of 65 6/13/14, 2:01 PM
  • 56. SSuummmmaarryy some insights summary(x[, c('cnt2_resource_role_deptname_cnt_ij', 'cnt2_resource_role_deptname_ratio_j')]) ## cnt2_resource_role_deptname_cnt_ij cnt2_resource_role_deptname_ratio_j ## Min. : 1.0 Min. :0.0003 ## 1st Qu.: 2.0 1st Qu.:0.0061 ## Median : 7.0 Median :0.0172 ## Mean : 15.6 Mean :0.0315 ## 3rd Qu.: 17.0 3rd Qu.:0.0368 ## Max. :201.0 Max. :1.0000 56/65 Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html 56 of 65 6/13/14, 2:01 PM
  • 57. SSuummmmaarryy some insights xx <- x[, 'cnt2_resource_role_deptname_cnt_ij'] tt <- t.test(xx ~ y) list(estimate=tt$estimate, conf.int=tt$conf.int, p.value=tt$p.value) ## $estimate ## mean in group 0 mean in group 1 ## 10.04 13.82 ## ## $conf.int ## [1] -4.851 -2.710 ## attr(,"conf.level") ## [1] 0.95 ## ## $p.value ## [1] 5.838e-12 par(mar=c(5,4,2,2)) boxplot(xx ~ y) 57/65 Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html 57 of 65 6/13/14, 2:01 PM
  • 58. SSuummmmaarryy some insights xxx <- cut(xx, include.lowest=T, breaks=c(0,1,3,7,14,30,300)) par(mar=c(5,2,0,0)) barplot(table(xxx)) tb <- table(y, xxx) r_0 <- tb[1, ] / colSums(tb) par(mar=c(5,2,0,0)) plot(r_0, type='l', lwd=3) 58/65 Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html 58 of 65 6/13/14, 2:01 PM
  • 59. SSuummmmaarryy some insights xx <- x[, 'cnt2_resource_role_deptname_ratio_j'] tt <- t.test(xx ~ y) list(estimate=tt$estimate, conf.int=tt$conf.int, p.value=tt$p.value) ## $estimate ## mean in group 0 mean in group 1 ## 0.01955 0.02902 ## ## $conf.int ## [1] -0.011732 -0.007205 ## attr(,"conf.level") ## [1] 0.95 ## ## $p.value ## [1] 3.93e-16 par(mar=c(5,4,2,2)) boxplot(xx ~ y) 59/65 Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html 59 of 65 6/13/14, 2:01 PM
  • 60. SSuummmmaarryy some insights xxx <- cut(xx, include.lowest=T, breaks=quantile(xx, seq(0,1,0.2))) par(mar=c(5,2,0,0)) barplot(table(xxx)) tb <- table(y, xxx) r_0 <- tb[1, ] / colSums(tb) par(mar=c(5,2,0,0)) plot(r_0, type='l', lwd=3) 60/65 Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html 60 of 65 6/13/14, 2:01 PM
  • 61. SSuummmmaarryy overfitting MODEL AUC_CV AUC_PUBLIC AUC_PRIVATE num_glmnet_0 0.8985069 0.87737 0.87385 stacking_gbm_with_the_glmnet 0.9277316 0.90695 0.90478 61/65 Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html 61 of 65 6/13/14, 2:01 PM
  • 62. SSuummmmaarryy overfitting MODEL AUC_CV AUC_PUBLIC AUC_PRIVATE num_glmnet_0 0.8985069 0.87737 0.87385 stacking_gbm_with_the_glmnet 0.9277316 0.90695 0.90478 stacking_gbm_without_the_glmnet 0.9182303 0.91529 0.91130 62/65 Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html 62 of 65 6/13/14, 2:01 PM
  • 63. SSuummmmaarryy overfitting Winning solution code and methodology http://www.kaggle.com/c/amazon-employee-access-challenge/forums/t/5283/winning-solution- code-and-methodology 63/65 Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html 63 of 65 6/13/14, 2:01 PM
  • 64. SSuummmmaarryy useful discussions Python code to achieve 0.90 AUC with Logistic Regression http://www.kaggle.com/c/amazon-employee-access-challenge/forums/t/4838/python-code-to- achieve-0-90-auc-with-logistic-regression Starter code in python with scikit-learn (AUC .885) http://www.kaggle.com/c/amazon-employee-access-challenge/forums/t/4797/starter-code-in-python- with-scikit-learn-auc-885 Patterns in Training data set http://www.kaggle.com/c/amazon-employee-access-challenge/forums/t/4886/patterns-in-training- data-set 64/65 Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html 64 of 65 6/13/14, 2:01 PM
  • 65. tthhaannkkyyoouu 65/65 Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html 65 of 65 6/13/14, 2:01 PM