Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
AAmmaazzoonnEEmmppllooyyeeeeAAcccceessssCChhaalllleennggee
Predictanemployee'saccessneeds,givenhis/herjobrole
Yibo Chen
Da...
AAggeennddaa
Introduction to the Challenge1.
Look into the Data2.
Model Building3.
Summary4.
2/65
Amazon Employee Access C...
IInnttrroodduuccttiioonnttootthheeCChhaalllleennggee
the story
http://www.kaggle.com/c/amazon-employee-access-challenge
it...
IInnttrroodduuccttiioonnttootthheeCChhaalllleennggee
the mission
build an auto-access model based on the historical data
t...
IInnttrroodduuccttiioonnttootthheeCChhaalllleennggee
the data
The data consists of real historical data collected from 201...
IInnttrroodduuccttiioonnttootthheeCChhaalllleennggee
the variables
COLUMN NAME DESCRIPTION
ACTION ACTION is 1 if the resou...
IInnttrroodduuccttiioonnttootthheeCChhaalllleennggee
the metric
AUC(area under the ROC curve)
is a metric used to judge pr...
IInnttrroodduuccttiioonnttootthheeCChhaalllleennggee
the metric
(t <- data.frame(true_label=c(0,0,0,0,1,1,1,1),
predict_1=...
IInnttrroodduuccttiioonnttootthheeCChhaalllleennggee
the metric
P:4
N:4
TP:2
FP:1
TPR=TP/P=0.5
FPR=FP/N=0.25
table(t$predi...
IInnttrroodduuccttiioonnttootthheeCChhaalllleennggee
the metric
P:4
N:4
TP:3
FP:1
TPR=TP/P=0.75
FPR=FP/N=0.25
table(t$pred...
IInnttrroodduuccttiioonnttootthheeCChhaalllleennggee
the metric
11/65
Amazon Employee Access Challenge http://nycopendata....
IInnttrroodduuccttiioonnttootthheeCChhaalllleennggee
the metric
require(ROCR, quietly = T)
pred <- prediction(t$predict_1,...
IInnttrroodduuccttiioonnttootthheeCChhaalllleennggee
the metric
pred <- prediction(t$predict_2, t$true_label)
performance(...
IInnttrroodduuccttiioonnttootthheeCChhaalllleennggee
the metric
pred <- prediction(t$predict_3, t$true_label)
performance(...
LLooookkiinnttootthheeDDaattaa
load data from files
15/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerT...
LLooookkiinnttootthheeDDaattaa
the target
table(y, useNA = "ifany")
## y
## 0 1 <NA>
## 1897 30872 58921
16/65
Amazon Empl...
LLooookkiinnttootthheeDDaattaa
the predictor
17/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/in...
LLooookkiinnttootthheeDDaattaa
treat the features as Categorical or Numerical?
sapply(x, function(z) {
length(unique(z))
}...
LLooookkiinnttootthheeDDaattaa
par(mar = c(5, 4, 0, 2))
plot(x$role_title, x$role_code)
19/65
Amazon Employee Access Chall...
LLooookkiinnttootthheeDDaattaa
length(unique(x$role_title))
## [1] 361
length(unique(x$role_code))
## [1] 361
length(uniqu...
LLooookkiinnttootthheeDDaattaa
x <- x[, names(x) != "role_code"]
sapply(x, function(z) {
length(unique(z))
})
## resource ...
LLooookkiinnttootthheeDDaattaa
check the distribution - role_family_desc
hist(train$role_family_desc, breaks = 100) hist(t...
LLooookkiinnttootthheeDDaattaa
check the distribution - resource
hist(train$resource, breaks = 100) hist(test$resource, br...
LLooookkiinnttootthheeDDaattaa
check the distribution - mgr_id
hist(train$mgr_id, breaks = 100) hist(test$mgr_id, breaks =...
LLooookkiinnttootthheeDDaattaa
treat the features as Categorical or Numerical?
YetiMan shared his findings in the forum:
1)...
LLooookkiinnttootthheeDDaattaa
our approach
treat all features as Categorical1.
treat all features as Numerical2.
treat mg...
MMooddeellBBuuiillddiinngg
workflow
Feature Extraction
Base Learners
Ensemble
·
·
·
27/65
Amazon Employee Access Challenge...
MMooddeellBBuuiillddiinngg
workflow
28/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
...
MMooddeellBBuuiillddiinngg
Feature Extraction
the raw features(as numerical)1.
the raw features(as categorical) with level...
MMooddeellBBuuiillddiinngg
1. the raw features(as numerical)
30/65
Amazon Employee Access Challenge http://nycopendata.com...
MMooddeellBBuuiillddiinngg
2. the raw features(as categorical) with level reduction
2.1 choose the top frequency categorie...
MMooddeellBBuuiillddiinngg
2. the raw features(as categorical) with level reduction
2.2 use Pearson's Chi-squared Test
tab...
MMooddeellBBuuiillddiinngg
3. the dummies(in sparse Matrix)
ID VAR VAR_A VAR_B VAR_C
1 a 1 0 0
2 a 1 0 0
3 a 1 0 0
4 b 0 1...
MMooddeellBBuuiillddiinngg
3. the dummies(in sparse Matrix)
use package Matrix to create the dummies
require(Matrix)
set.s...
MMooddeellBBuuiillddiinngg
4. the dummies including the interaction
ID M N MN_AP MN_AQ MN_BP MN_BQ
1 a p 1 0 0 0
2 a p 1 0...
MMooddeellBBuuiillddiinngg
5. some derived variables(count & ratio)
the frequency of every category
the frequency of the i...
MMooddeellBBuuiillddiinngg
5. some derived variables(count & ratio)
tmp1 <- cnt_1[114:117, c('c1_resource', 'c1_role_deptn...
MMooddeellBBuuiillddiinngg
Base Learners
Regularized Generalized Linear Model1.
Support Vector Machine2.
Random Forest3.
G...
MMooddeellBBuuiillddiinngg
Ensemble
mean prediction of all models1.
two-stage stacking2.
based on 5-fold cv holdout predic...
MMooddeellBBuuiillddiinngg
Ensemble
mean prediction of all models1.
two-stage stacking2.
based on 5-fold cv holdout predic...
MMooddeellBBuuiillddiinngg
1. Regularized Generalized Linear Model
generalized linear model(glm)
convex penalties
·
·
41/6...
MMooddeellBBuuiillddiinngg
1. Regularized Generalized Linear Model
logistic regression·
x <- sort(rnorm(100))
set.seed(114...
MMooddeellBBuuiillddiinngg
1. Regularized Generalized Linear Model
logistic regression·
convex penalties·
43/65
Amazon Emp...
MMooddeellBBuuiillddiinngg
1. Regularized Generalized Linear Model
convex penalties·
L1 (lasso)
L2 (ridge regression)
mixt...
MMooddeellBBuuiillddiinngg
1. Regularized Generalized Linear Model
the dummies(in sparse Matrix)
the dummies including the...
MMooddeellBBuuiillddiinngg
2. Support Vector Machine(just for Diversity)
46/65
Amazon Employee Access Challenge http://nyc...
MMooddeellBBuuiillddiinngg
2. Support Vector Machine(just for Diversity)
47/65
Amazon Employee Access Challenge http://nyc...
MMooddeellBBuuiillddiinngg
2. Support Vector Machine(just for Diversity)
48/65
Amazon Employee Access Challenge http://nyc...
MMooddeellBBuuiillddiinngg
2. Support Vector Machine(just for Diversity)
the dummies including the interaction
some derive...
MMooddeellBBuuiillddiinngg
decision tree
50/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index....
MMooddeellBBuuiillddiinngg
3. Random Forest
decision trees + bagging
51/65
Amazon Employee Access Challenge http://nycopen...
MMooddeellBBuuiillddiinngg
3. Random Forest
the raw features(as numerical)
the raw features(as categorical) with level red...
MMooddeellBBuuiillddiinngg
4. Gradient Boosting Machine
decision trees + boosting
53/65
Amazon Employee Access Challenge h...
MMooddeellBBuuiillddiinngg
4. Gradient Boosting Machine
the raw features(as numerical)
the raw features(as categorical) wi...
SSuummmmaarryy
some insights
VARIABLE NAME REL.INF
cnt2_resource_role_deptname_cnt_ij 2.542974017
cnt2_resource_role_rollu...
SSuummmmaarryy
some insights
summary(x[, c('cnt2_resource_role_deptname_cnt_ij',
'cnt2_resource_role_deptname_ratio_j')])
...
SSuummmmaarryy
some insights
xx <- x[, 'cnt2_resource_role_deptname_cnt_ij']
tt <- t.test(xx ~ y)
list(estimate=tt$estimat...
SSuummmmaarryy
some insights
xxx <- cut(xx, include.lowest=T,
breaks=c(0,1,3,7,14,30,300))
par(mar=c(5,2,0,0))
barplot(tab...
SSuummmmaarryy
some insights
xx <- x[, 'cnt2_resource_role_deptname_ratio_j']
tt <- t.test(xx ~ y)
list(estimate=tt$estima...
SSuummmmaarryy
some insights
xxx <- cut(xx, include.lowest=T,
breaks=quantile(xx, seq(0,1,0.2)))
par(mar=c(5,2,0,0))
barpl...
SSuummmmaarryy
overfitting
MODEL AUC_CV AUC_PUBLIC AUC_PRIVATE
num_glmnet_0 0.8985069 0.87737 0.87385
stacking_gbm_with_th...
SSuummmmaarryy
overfitting
MODEL AUC_CV AUC_PUBLIC AUC_PRIVATE
num_glmnet_0 0.8985069 0.87737 0.87385
stacking_gbm_with_th...
SSuummmmaarryy
overfitting
Winning solution code and methodology
http://www.kaggle.com/c/amazon-employee-access-challenge/...
SSuummmmaarryy
useful discussions
Python code to achieve 0.90 AUC with Logistic Regression
http://www.kaggle.com/c/amazon-...
tthhaannkkyyoouu
65/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
65 of 65 6/13/14, 2...
Upcoming SlideShare
Loading in …5
×

Kaggle talk series top 0.2% kaggler on amazon employee access challenge

2,107 views

Published on

NYC Data Science Academy, NYC Open Data Meetup, Big Data, Data Science, NYC, Vivian Zhang, SupStat Inc,NYC, Machine learning, Kaggle, amazon employee access challenge

Published in: Engineering, Technology, Business
  • DOWNLOAD THIS BOOKS INTO AVAILABLE FORMAT (2019 Update) ......................................................................................................................... ......................................................................................................................... Download Full PDF EBOOK here { https://soo.gd/irt2 } ......................................................................................................................... Download Full EPUB Ebook here { https://soo.gd/irt2 } ......................................................................................................................... Download Full doc Ebook here { https://soo.gd/irt2 } ......................................................................................................................... Download PDF EBOOK here { https://soo.gd/irt2 } ......................................................................................................................... Download EPUB Ebook here { https://soo.gd/irt2 } ......................................................................................................................... Download doc Ebook here { https://soo.gd/irt2 } ......................................................................................................................... ......................................................................................................................... ................................................................................................................................... eBook is an electronic version of a traditional print book THIS can be read by using a personal computer or by using an eBook reader. (An eBook reader can be a software application for use on a computer such as Microsoft's free Reader application, or a book-sized computer THIS is used solely as a reading device such as Nuvomedia's Rocket eBook.) Users can purchase an eBook on diskette or CD, but the most popular method of getting an eBook is to purchase a downloadable file of the eBook (or other reading material) from a Web site (such as Barnes and Noble) to be read from the user's computer or reading device. Generally, an eBook can be downloaded in five minutes or less ......................................................................................................................... .............. Browse by Genre Available eBooks .............................................................................................................................. Art, Biography, Business, Chick Lit, Children's, Christian, Classics, Comics, Contemporary, Cookbooks, Manga, Memoir, Music, Mystery, Non Fiction, Paranormal, Philosophy, Poetry, Psychology, Religion, Romance, Science, Science Fiction, Self Help, Suspense, Spirituality, Sports, Thriller, Travel, Young Adult, Crime, Ebooks, Fantasy, Fiction, Graphic Novels, Historical Fiction, History, Horror, Humor And Comedy, ......................................................................................................................... ......................................................................................................................... .....BEST SELLER FOR EBOOK RECOMMEND............................................................. ......................................................................................................................... Blowout: Corrupted Democracy, Rogue State Russia, and the Richest, Most Destructive Industry on Earth,-- The Ride of a Lifetime: Lessons Learned from 15 Years as CEO of the Walt Disney Company,-- Call Sign Chaos: Learning to Lead,-- StrengthsFinder 2.0,-- Stillness Is the Key,-- She Said: Breaking the Sexual Harassment Story THIS Helped Ignite a Movement,-- Atomic Habits: An Easy &amp; Proven Way to Build Good Habits &amp; Break Bad Ones,-- Everything Is Figureoutable,-- What It Takes: Lessons in the Pursuit of Excellence,-- Rich Dad Poor Dad: What the Rich Teach Their Kids About Money THIS the Poor and Middle Class Do Not!,-- The Total Money Makeover: Classic Edition: A Proven Plan for Financial Fitness,-- Shut Up and Listen!: Hard Business Truths THIS Will Help You Succeed, ......................................................................................................................... .........................................................................................................................
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • wonderful! this is what I'm looking for , thank you!
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Kaggle talk series top 0.2% kaggler on amazon employee access challenge

  1. 1. AAmmaazzoonnEEmmppllooyyeeeeAAcccceessssCChhaalllleennggee Predictanemployee'saccessneeds,givenhis/herjobrole Yibo Chen Data Scientist @ Supstat Inc Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html 1 of 65 6/13/14, 2:01 PM
  2. 2. AAggeennddaa Introduction to the Challenge1. Look into the Data2. Model Building3. Summary4. 2/65 Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html 2 of 65 6/13/14, 2:01 PM
  3. 3. IInnttrroodduuccttiioonnttootthheeCChhaalllleennggee the story http://www.kaggle.com/c/amazon-employee-access-challenge it is all about the access we need to fulfill our daily work. 3/65 Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html 3 of 65 6/13/14, 2:01 PM
  4. 4. IInnttrroodduuccttiioonnttootthheeCChhaalllleennggee the mission build an auto-access model based on the historical data to determine the access privilege according to the employee's job role and the resource he applied for 4/65 Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html 4 of 65 6/13/14, 2:01 PM
  5. 5. IInnttrroodduuccttiioonnttootthheeCChhaalllleennggee the data The data consists of real historical data collected from 2010 & 2011. Employees are manually allowed or denied access to resources over time. the files train.csv - The training set. Each row has the ACTION (ground truth), RESOURCE, and information about the employee's role at the time of approval test.csv - The test set for which predictions should be made. Each row asks whether an employee having the listed characteristics should have access to the listed resource. · · 5/65 Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html 5 of 65 6/13/14, 2:01 PM
  6. 6. IInnttrroodduuccttiioonnttootthheeCChhaalllleennggee the variables COLUMN NAME DESCRIPTION ACTION ACTION is 1 if the resource was approved, 0 if the resource was not RESOURCE An ID for each resource MGR_ID The EMPLOYEE ID of the manager of the current EMPLOYEE ID record ROLE_ROLLUP_1 Company role grouping category id 1 (e.g. US Engineering) ROLE_ROLLUP_2 Company role grouping category id 2 (e.g. US Retail) ROLE_DEPTNAME Company role department description (e.g. Retail) ROLE_TITLE Company role business title description (e.g. Senior Engineering Retail Manager) ROLE_FAMILY_DESC Company role family extended description (e.g. Retail Manager, Software Engineering) ROLE_FAMILY Company role family description (e.g. Retail Manager) ROLE_CODE Company role code; this code is unique to each role (e.g. Manager) 6/65 Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html 6 of 65 6/13/14, 2:01 PM
  7. 7. IInnttrroodduuccttiioonnttootthheeCChhaalllleennggee the metric AUC(area under the ROC curve) is a metric used to judge predictions in binary response (0/1) problem is only sensitive to the order determined by the predictions and not their magnitudes package verification or ROCR in R · · · 7/65 Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html 7 of 65 6/13/14, 2:01 PM
  8. 8. IInnttrroodduuccttiioonnttootthheeCChhaalllleennggee the metric (t <- data.frame(true_label=c(0,0,0,0,1,1,1,1), predict_1=c(1,2,3,4,5,6,7,8), predict_2=c(1,2,3,6,5,4,7,8), predict_3=c(1,7,6,4,5,3,2,8))) ## true_label predict_1 predict_2 predict_3 ## 1 0 1 1 1 ## 2 0 2 2 7 ## 3 0 3 3 6 ## 4 0 4 6 4 ## 5 1 5 5 5 ## 6 1 6 4 3 ## 7 1 7 7 2 ## 8 1 8 8 8 8/65 Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html 8 of 65 6/13/14, 2:01 PM
  9. 9. IInnttrroodduuccttiioonnttootthheeCChhaalllleennggee the metric P:4 N:4 TP:2 FP:1 TPR=TP/P=0.5 FPR=FP/N=0.25 table(t$predict_2 >= 6, t$true_label) ## ## 0 1 ## FALSE 3 2 ## TRUE 1 2 9/65 Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html 9 of 65 6/13/14, 2:01 PM
  10. 10. IInnttrroodduuccttiioonnttootthheeCChhaalllleennggee the metric P:4 N:4 TP:3 FP:1 TPR=TP/P=0.75 FPR=FP/N=0.25 table(t$predict_2 >= 5, t$true_label) ## ## 0 1 ## FALSE 3 1 ## TRUE 1 3 10/65 Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html 10 of 65 6/13/14, 2:01 PM
  11. 11. IInnttrroodduuccttiioonnttootthheeCChhaalllleennggee the metric 11/65 Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html 11 of 65 6/13/14, 2:01 PM
  12. 12. IInnttrroodduuccttiioonnttootthheeCChhaalllleennggee the metric require(ROCR, quietly = T) pred <- prediction(t$predict_1, t$true_label) performance(pred, "auc")@y.values[[1]] ## [1] 1 require(verification, quietly = T) roc.area(t$true_label, t$predict_1)$A ## [1] 1 pred <- prediction(t$predict_1, t$true_label) perf <- performance(pred, "tpr", "fpr") plot(perf, col = 2, lwd = 3) 12/65 Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html 12 of 65 6/13/14, 2:01 PM
  13. 13. IInnttrroodduuccttiioonnttootthheeCChhaalllleennggee the metric pred <- prediction(t$predict_2, t$true_label) performance(pred, "auc")@y.values[[1]] ## [1] 0.875 roc.area(t$true_label, t$predict_2)$A ## [1] 0.875 pred <- prediction(t$predict_2, t$true_label) perf <- performance(pred, "tpr", "fpr") plot(perf, col = 2, lwd = 3) 13/65 Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html 13 of 65 6/13/14, 2:01 PM
  14. 14. IInnttrroodduuccttiioonnttootthheeCChhaalllleennggee the metric pred <- prediction(t$predict_3, t$true_label) performance(pred, "auc")@y.values[[1]] ## [1] 0.5 roc.area(t$true_label, t$predict_3)$A ## [1] 0.5 pred <- prediction(t$predict_3, t$true_label) perf <- performance(pred, "tpr", "fpr") plot(perf, col = 2, lwd = 3) 14/65 Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html 14 of 65 6/13/14, 2:01 PM
  15. 15. LLooookkiinnttootthheeDDaattaa load data from files 15/65 Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html 15 of 65 6/13/14, 2:01 PM
  16. 16. LLooookkiinnttootthheeDDaattaa the target table(y, useNA = "ifany") ## y ## 0 1 <NA> ## 1897 30872 58921 16/65 Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html 16 of 65 6/13/14, 2:01 PM
  17. 17. LLooookkiinnttootthheeDDaattaa the predictor 17/65 Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html 17 of 65 6/13/14, 2:01 PM
  18. 18. LLooookkiinnttootthheeDDaattaa treat the features as Categorical or Numerical? sapply(x, function(z) { length(unique(z)) }) ## resource mgr_id role_rollup_1 role_rollup_2 ## 7518 4913 130 183 ## role_deptname role_title role_family_desc role_family ## 476 361 2951 68 ## role_code ## 361 18/65 Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html 18 of 65 6/13/14, 2:01 PM
  19. 19. LLooookkiinnttootthheeDDaattaa par(mar = c(5, 4, 0, 2)) plot(x$role_title, x$role_code) 19/65 Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html 19 of 65 6/13/14, 2:01 PM
  20. 20. LLooookkiinnttootthheeDDaattaa length(unique(x$role_title)) ## [1] 361 length(unique(x$role_code)) ## [1] 361 length(unique(paste(x$role_code, x$role_title))) ## [1] 361 20/65 Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html 20 of 65 6/13/14, 2:01 PM
  21. 21. LLooookkiinnttootthheeDDaattaa x <- x[, names(x) != "role_code"] sapply(x, function(z) { length(unique(z)) }) ## resource mgr_id role_rollup_1 role_rollup_2 ## 7518 4913 130 183 ## role_deptname role_title role_family_desc role_family ## 476 361 2951 68 21/65 Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html 21 of 65 6/13/14, 2:01 PM
  22. 22. LLooookkiinnttootthheeDDaattaa check the distribution - role_family_desc hist(train$role_family_desc, breaks = 100) hist(test$role_family_desc, breaks = 100) 22/65 Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html 22 of 65 6/13/14, 2:01 PM
  23. 23. LLooookkiinnttootthheeDDaattaa check the distribution - resource hist(train$resource, breaks = 100) hist(test$resource, breaks = 100) 23/65 Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html 23 of 65 6/13/14, 2:01 PM
  24. 24. LLooookkiinnttootthheeDDaattaa check the distribution - mgr_id hist(train$mgr_id, breaks = 100) hist(test$mgr_id, breaks = 100) 24/65 Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html 24 of 65 6/13/14, 2:01 PM
  25. 25. LLooookkiinnttootthheeDDaattaa treat the features as Categorical or Numerical? YetiMan shared his findings in the forum: 1) My analyses so far leads me to believe that there is "information" in some of the categorical labels themselves. My hunch is that they imply some sort of chronology, but I can't be certain. 2) Just for fun I increased the max classes for R's gbm package to 8192 and built a model (using plain vanilla training data). The leader board result was 0.87 - slightly worse than the all-numeric gbm. Food for thought. · · 25/65 Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html 25 of 65 6/13/14, 2:01 PM
  26. 26. LLooookkiinnttootthheeDDaattaa our approach treat all features as Categorical1. treat all features as Numerical2. treat mgr_id as Numerical, the others as Categorical3. 26/65 Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html 26 of 65 6/13/14, 2:01 PM
  27. 27. MMooddeellBBuuiillddiinngg workflow Feature Extraction Base Learners Ensemble · · · 27/65 Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html 27 of 65 6/13/14, 2:01 PM
  28. 28. MMooddeellBBuuiillddiinngg workflow 28/65 Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html 28 of 65 6/13/14, 2:01 PM
  29. 29. MMooddeellBBuuiillddiinngg Feature Extraction the raw features(as numerical)1. the raw features(as categorical) with level reduction2. the dummies(in sparse Matrix)3. the dummies including the interaction4. some derived variables(count & ratio)5. 29/65 Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html 29 of 65 6/13/14, 2:01 PM
  30. 30. MMooddeellBBuuiillddiinngg 1. the raw features(as numerical) 30/65 Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html 30 of 65 6/13/14, 2:01 PM
  31. 31. MMooddeellBBuuiillddiinngg 2. the raw features(as categorical) with level reduction 2.1 choose the top frequency categories VAR_RAW FREQUENCY VAR_WITH_LEVEL_REDUCTION a 3 a a 3 a a 3 a b 2 b b 2 b c 1 other d 1 other for (i in 1:ncol(x)) { the_labels <- names(sort(table(x[, i]), decreasing = T)[1:2]) x[!x[, i] %in% the_labels, i] <- "other" } 31/65 Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html 31 of 65 6/13/14, 2:01 PM
  32. 32. MMooddeellBBuuiillddiinngg 2. the raw features(as categorical) with level reduction 2.2 use Pearson's Chi-squared Test table(y$y, ifelse(x$mgr_id == 770, "mgr_770", "mgr_not_770")) ## ## mgr_770 mgr_not_770 ## 0 5 1892 ## 1 147 30725 chisq.test(y$y, ifelse(x$mgr_id == 770, "mgr_770", "mgr_not_770"))$p.value ## [1] 0.2507 32/65 Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html 32 of 65 6/13/14, 2:01 PM
  33. 33. MMooddeellBBuuiillddiinngg 3. the dummies(in sparse Matrix) ID VAR VAR_A VAR_B VAR_C 1 a 1 0 0 2 a 1 0 0 3 a 1 0 0 4 b 0 1 0 5 c 0 0 1 33/65 Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html 33 of 65 6/13/14, 2:01 PM
  34. 34. MMooddeellBBuuiillddiinngg 3. the dummies(in sparse Matrix) use package Matrix to create the dummies require(Matrix) set.seed(114) Matrix(sample(c(0, 1), 40, re = T, prob = c(0.6, 0.1)), nrow = 5) ## 5 x 8 sparse Matrix of class "dgCMatrix" ## ## [1,] . . . 1 . . . 1 ## [2,] . 1 . . . . 1 . ## [3,] 1 . . . . . . . ## [4,] . . . . . 1 . . ## [5,] . . . . . . . . 34/65 Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html 34 of 65 6/13/14, 2:01 PM
  35. 35. MMooddeellBBuuiillddiinngg 4. the dummies including the interaction ID M N MN_AP MN_AQ MN_BP MN_BQ 1 a p 1 0 0 0 2 a p 1 0 0 0 3 a q 0 1 0 0 4 b p 0 0 1 0 5 b q 0 0 0 1 35/65 Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html 35 of 65 6/13/14, 2:01 PM
  36. 36. MMooddeellBBuuiillddiinngg 5. some derived variables(count & ratio) the frequency of every category the frequency of the interactions the proportion · · · 36/65 Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html 36 of 65 6/13/14, 2:01 PM
  37. 37. MMooddeellBBuuiillddiinngg 5. some derived variables(count & ratio) tmp1 <- cnt_1[114:117, c('c1_resource', 'c1_role_deptname')] tmp2 <- cnt_2[114:117, c('c2_resource_role_deptname_cnt_ij', 'c2_resource_role_deptname_ratio_i', 'c2_resource_role_deptname_ratio_j')] cbind(tmp1, tmp2) ## c1_resource c1_role_deptname c2_resource_role_deptname_cnt_ij ## 114 1 1645 1 ## 115 36 1312 4 ## 116 45 465 24 ## 117 374 2377 169 ## c2_resource_role_deptname_ratio_i c2_resource_role_deptname_ratio_j ## 114 1.0000 0.0006079 ## 115 0.1111 0.0030488 ## 116 0.5333 0.0516129 ## 117 0.4519 0.0710980 37/65 Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html 37 of 65 6/13/14, 2:01 PM
  38. 38. MMooddeellBBuuiillddiinngg Base Learners Regularized Generalized Linear Model1. Support Vector Machine2. Random Forest3. Gradient Boosting Machine4. 38/65 Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html 38 of 65 6/13/14, 2:01 PM
  39. 39. MMooddeellBBuuiillddiinngg Ensemble mean prediction of all models1. two-stage stacking2. based on 5-fold cv holdout predictions· 39/65 Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html 39 of 65 6/13/14, 2:01 PM
  40. 40. MMooddeellBBuuiillddiinngg Ensemble mean prediction of all models1. two-stage stacking2. based on 5-fold cv holdout predictions algorithms in level-1(Regularized Generalized Linear Model & Gradient Boosting Machine) algorithms in level-2(Regularized Generalized Linear Model) · · · 40/65 Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html 40 of 65 6/13/14, 2:01 PM
  41. 41. MMooddeellBBuuiillddiinngg 1. Regularized Generalized Linear Model generalized linear model(glm) convex penalties · · 41/65 Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html 41 of 65 6/13/14, 2:01 PM
  42. 42. MMooddeellBBuuiillddiinngg 1. Regularized Generalized Linear Model logistic regression· x <- sort(rnorm(100)) set.seed(114) y <- c(sample(x=c(0,1),size=30,prob=c(0.9,0.1),re=T), sample(x=c(0,1),size=20,prob=c(0.7,0.3),re=T), sample(x=c(0,1),size=20,prob=c(0.3,0.7),re=T), sample(x=c(0,1),size=30,prob=c(0.1,0.9),re=T)) m1 <- lm(y~x) m2 <- glm(y~x,family=binomial(link=logit)) y2 <- predict(m2,data=x,type='response') par(mar=c(5,4,0,0)) plot(y~x);abline(m1,lwd=3,col=2) points(x,y2,type='l',lwd=3,col=3) 42/65 Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html 42 of 65 6/13/14, 2:01 PM
  43. 43. MMooddeellBBuuiillddiinngg 1. Regularized Generalized Linear Model logistic regression· convex penalties· 43/65 Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html 43 of 65 6/13/14, 2:01 PM
  44. 44. MMooddeellBBuuiillddiinngg 1. Regularized Generalized Linear Model convex penalties· L1 (lasso) L2 (ridge regression) mixture of L1&L2 (elastic net) - - - 44/65 Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html 44 of 65 6/13/14, 2:01 PM
  45. 45. MMooddeellBBuuiillddiinngg 1. Regularized Generalized Linear Model the dummies(in sparse Matrix) the dummies including the interaction R package:glmnet · · · 45/65 Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html 45 of 65 6/13/14, 2:01 PM
  46. 46. MMooddeellBBuuiillddiinngg 2. Support Vector Machine(just for Diversity) 46/65 Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html 46 of 65 6/13/14, 2:01 PM
  47. 47. MMooddeellBBuuiillddiinngg 2. Support Vector Machine(just for Diversity) 47/65 Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html 47 of 65 6/13/14, 2:01 PM
  48. 48. MMooddeellBBuuiillddiinngg 2. Support Vector Machine(just for Diversity) 48/65 Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html 48 of 65 6/13/14, 2:01 PM
  49. 49. MMooddeellBBuuiillddiinngg 2. Support Vector Machine(just for Diversity) the dummies including the interaction some derived variables(count & ratio) R package:kernlab,e1071 · · · 49/65 Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html 49 of 65 6/13/14, 2:01 PM
  50. 50. MMooddeellBBuuiillddiinngg decision tree 50/65 Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html 50 of 65 6/13/14, 2:01 PM
  51. 51. MMooddeellBBuuiillddiinngg 3. Random Forest decision trees + bagging 51/65 Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html 51 of 65 6/13/14, 2:01 PM
  52. 52. MMooddeellBBuuiillddiinngg 3. Random Forest the raw features(as numerical) the raw features(as categorical) with level reduction some derived variables(count & ratio) R package:randomForest · · · · 52/65 Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html 52 of 65 6/13/14, 2:01 PM
  53. 53. MMooddeellBBuuiillddiinngg 4. Gradient Boosting Machine decision trees + boosting 53/65 Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html 53 of 65 6/13/14, 2:01 PM
  54. 54. MMooddeellBBuuiillddiinngg 4. Gradient Boosting Machine the raw features(as numerical) the raw features(as categorical) with level reduction some derived variables(count & ratio) R package:gbm · · · · 54/65 Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html 54 of 65 6/13/14, 2:01 PM
  55. 55. SSuummmmaarryy some insights VARIABLE NAME REL.INF cnt2_resource_role_deptname_cnt_ij 2.542974017 cnt2_resource_role_rollup_2_ratio_i 2.107624216 cnt2_resource_role_deptname_ratio_j 2.017153645 cnt2_resource_role_rollup_2_ratio_j 1.910465811 cnt2_resource_role_family_ratio_i 1.770737494 ... ... cnt4_resource_mgr_id_role_rollup_2_role_family_desc 0.008938286 cnt4_resource_role_rollup_1_role_rollup_2_role_title 0.008930661 cnt4_resource_mgr_id_role_rollup_1_role_family_desc 0.002106958 55/65 Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html 55 of 65 6/13/14, 2:01 PM
  56. 56. SSuummmmaarryy some insights summary(x[, c('cnt2_resource_role_deptname_cnt_ij', 'cnt2_resource_role_deptname_ratio_j')]) ## cnt2_resource_role_deptname_cnt_ij cnt2_resource_role_deptname_ratio_j ## Min. : 1.0 Min. :0.0003 ## 1st Qu.: 2.0 1st Qu.:0.0061 ## Median : 7.0 Median :0.0172 ## Mean : 15.6 Mean :0.0315 ## 3rd Qu.: 17.0 3rd Qu.:0.0368 ## Max. :201.0 Max. :1.0000 56/65 Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html 56 of 65 6/13/14, 2:01 PM
  57. 57. SSuummmmaarryy some insights xx <- x[, 'cnt2_resource_role_deptname_cnt_ij'] tt <- t.test(xx ~ y) list(estimate=tt$estimate, conf.int=tt$conf.int, p.value=tt$p.value) ## $estimate ## mean in group 0 mean in group 1 ## 10.04 13.82 ## ## $conf.int ## [1] -4.851 -2.710 ## attr(,"conf.level") ## [1] 0.95 ## ## $p.value ## [1] 5.838e-12 par(mar=c(5,4,2,2)) boxplot(xx ~ y) 57/65 Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html 57 of 65 6/13/14, 2:01 PM
  58. 58. SSuummmmaarryy some insights xxx <- cut(xx, include.lowest=T, breaks=c(0,1,3,7,14,30,300)) par(mar=c(5,2,0,0)) barplot(table(xxx)) tb <- table(y, xxx) r_0 <- tb[1, ] / colSums(tb) par(mar=c(5,2,0,0)) plot(r_0, type='l', lwd=3) 58/65 Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html 58 of 65 6/13/14, 2:01 PM
  59. 59. SSuummmmaarryy some insights xx <- x[, 'cnt2_resource_role_deptname_ratio_j'] tt <- t.test(xx ~ y) list(estimate=tt$estimate, conf.int=tt$conf.int, p.value=tt$p.value) ## $estimate ## mean in group 0 mean in group 1 ## 0.01955 0.02902 ## ## $conf.int ## [1] -0.011732 -0.007205 ## attr(,"conf.level") ## [1] 0.95 ## ## $p.value ## [1] 3.93e-16 par(mar=c(5,4,2,2)) boxplot(xx ~ y) 59/65 Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html 59 of 65 6/13/14, 2:01 PM
  60. 60. SSuummmmaarryy some insights xxx <- cut(xx, include.lowest=T, breaks=quantile(xx, seq(0,1,0.2))) par(mar=c(5,2,0,0)) barplot(table(xxx)) tb <- table(y, xxx) r_0 <- tb[1, ] / colSums(tb) par(mar=c(5,2,0,0)) plot(r_0, type='l', lwd=3) 60/65 Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html 60 of 65 6/13/14, 2:01 PM
  61. 61. SSuummmmaarryy overfitting MODEL AUC_CV AUC_PUBLIC AUC_PRIVATE num_glmnet_0 0.8985069 0.87737 0.87385 stacking_gbm_with_the_glmnet 0.9277316 0.90695 0.90478 61/65 Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html 61 of 65 6/13/14, 2:01 PM
  62. 62. SSuummmmaarryy overfitting MODEL AUC_CV AUC_PUBLIC AUC_PRIVATE num_glmnet_0 0.8985069 0.87737 0.87385 stacking_gbm_with_the_glmnet 0.9277316 0.90695 0.90478 stacking_gbm_without_the_glmnet 0.9182303 0.91529 0.91130 62/65 Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html 62 of 65 6/13/14, 2:01 PM
  63. 63. SSuummmmaarryy overfitting Winning solution code and methodology http://www.kaggle.com/c/amazon-employee-access-challenge/forums/t/5283/winning-solution- code-and-methodology 63/65 Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html 63 of 65 6/13/14, 2:01 PM
  64. 64. SSuummmmaarryy useful discussions Python code to achieve 0.90 AUC with Logistic Regression http://www.kaggle.com/c/amazon-employee-access-challenge/forums/t/4838/python-code-to- achieve-0-90-auc-with-logistic-regression Starter code in python with scikit-learn (AUC .885) http://www.kaggle.com/c/amazon-employee-access-challenge/forums/t/4797/starter-code-in-python- with-scikit-learn-auc-885 Patterns in Training data set http://www.kaggle.com/c/amazon-employee-access-challenge/forums/t/4886/patterns-in-training- data-set 64/65 Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html 64 of 65 6/13/14, 2:01 PM
  65. 65. tthhaannkkyyoouu 65/65 Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html 65 of 65 6/13/14, 2:01 PM

×