1.
A Report to PAKDD 2006 Data Mining Competition
Nan Hu
Shawn Jin
Yingjiu Li
Mar 01st, 2006
1
2.
DATE: Feb 28th, 2006
FROM: Team Jin/Hu
TO: PAKDD Committee
RE: REPORT FOR PAKDD MODELING PROJECT
PROJECT REPORTS AND SPECIFICATIONS
PAKDD 2006 Data Mining Competition
PROJECT:
The project objective is to accurately predict which customers are likely to switch
OBJECTIVE:
to using 3G network for an Asian telco operator which has successfully launched
a third generation (3G) mobile telecommunications network.
The data mining task is a classification problem for which the objective is to
accurately predict as many current 3G customers as possible (i.e. true positives)
from the “holdout” sample provided.
BACKGROU An original sample dataset of 20,000 2G network customers and 4,000 3G
ND: network customers has been provided with more than 200 data fields. The target
categorical variable is “Customer_Type” (2G/3G). A 3G customer is defined as a
customer who has a 3G Subscriber Identity Module (SIM) card and is currently
using a 3G network compatible mobile phone.
Three-quarters of the dataset (15K 2G, 3K 3G) will have the target field available
and is meant to be used for training/testing (i.e., training data). The remaining
portion (5K 2G, 1K 3G) will be made available with the target field missing and
is meant to be used for prediction (i.e., testing data).
MEASURES True positives from the “holdout” sample.
of
PERFORMAN
CE
2
3.
PROJECT REPORTS AND SPECIFICATIONS
METHODOL Logistic regression technique is applied
OGY The idea is to build a predictive model to predict the probability of a customer
likely to switch to using 3G network. For any given customer, a probability of
“3G user” is calculated. The whole training data is rank-ordered by this
probability score. A classification threshold is decided such that the “3G”
customers predicted by our model best fit the actual “3G users” labeled in the
training data. Our predictive result is obtained by applying the predictive model
as well as the classification threshold to the testing data.
The detailed steps are as follows for this exercise
a) Split the training data into training set and validating set
Random split the training data, 70% into training set and 30% into
validating set. After the split, we have the following data structure.
Training set:
3G number percent
no 10500 83.33
yes 2100 16.67
Validating set:
3G number percent
no 4500 83.33
yes 900 16.67
b) Variable reduction
The following methods are used in variable reduction
1. Principle component analysis
2. Cluster analysis
3. Regression loosely fit
4. Logistic regression loosely fit
5. Information value
Base on the outcome of these variable reduction methods, 173
candidate variables are chosen for further analyses
c) Uni-variate and bi-variate analyses
Freqency analysis is performed on every candidate variable to see its
distribution. Then, Bi-variate analysis is performed to examine the
relationship between the candidate variables and the decision variable
(i.e., customer type). Finally, a set of predictor variables are chosen
with appropriate transformations.
In particular, the following steps are gone through for each candidate
variable
1. Imputation of this variable is decided based on the relationship of
this candidate variable and the decision variable.
2. Floor/capping of this variable is decided based on the imputation
analysis
3. It is decided whether this candidate variable is a predictor variable.
4.Appropriate transformation is decided based on the relationship of
3
4.
PROJECT REPORTS AND SPECIFICATIONS
d) Model selection
METHODOL
Various log-linear models are built based on the selected predictor
OGY
variables and their transformations. The model selection is based on
(continue) the model performance on the validating set. The selection criteria
include KS, C-value, Gini-coefficients, Information value, VIF, and
Gains table. Also, lack-of-fit statistic and box-tidwel are monitored in
the model selection process.
e) Model addition
Finally, we calculate the marginal information on all of the variables
except the predicator variables in the training data
If some of variables add significant additional predictive power, we fit
them into our model. As a result, seventeen predictor variables are
selected in our model.
Model SAS Obs Variable Estimate StdErr
Prob Standardized
WaldChiSq ChiSq
margin_
Est weight infovalue
1 loghs_age -1.085755335692 0.0332 1072.3548 <.0001 -0.4825 0.208 0.076325
output 2
3
logavg_bill_amt
qavg_vas_games
0.629322675176 0.0652
0.001859320313 0.000122
93.1787 <.0001
233.2710 <.0001
0.2573
0.2348
0.111 0.066718
0.101 0.064460
4 hs_model_p 2.145264443682 0.3095 48.0372 <.0001 0.1620 0.070 0.000000
5 logage -1.179460632775 0.1597 54.5522 <.0001 -0.1362 0.059 0.009264
6 hs_manufacturer_p 0.464200942387 0.0665 48.7006 <.0001 0.1232 0.053 0.000000
7 qavg_call_intran 0.078541920134 0.0143 30.1814 <.0001 0.1133 0.049 0.042702
8 itot_retention_camp 0.047355829152 0.00637 55.3522 <.0001 0.1126 0.049 0.007942
9 logstd_vas_gprs 0.061825882824 0.00761 66.0802 <.0001 0.1094 0.047 0.001511
10 qavg_bill_voiced -0.051260044982 0.00928 30.4867 <.0001 -0.1047 0.045 0.009360
11 iblack_list_flag -0.660335060879 0.1576 17.5665 <.0001 -0.0926 0.040 0.000000
12 icontract_flag -0.379043852991 0.0771 24.1606 <.0001 -0.0881 0.038 0.000000
13 ivas_ar_flag 0.310846459707 0.0716 18.8644 <.0001 0.0813 0.035 0.000000
14 logstd_od_amt -0.050180174464 0.0145 11.9399 0.0005 -0.0610 0.026 0.010664
15 snum_tel 0.016091722386 0.00443 13.1805 0.0003 0.0595 0.026 0.003099
16 qstd_call_frw_ratio -0.974201028084 0.3770 6.6792 0.0098 -0.0554 0.024 0.013133
17 lucky_no_flag 0.341940817274 0.1100 9.6552 0.0019 0.0436 0.019 0.000000
18 intercept -1.691815682630 0.7223 5.4863 0.0192
Resulted model The prediction model is as follows
logit=loghs_age*(-1.085755335692)+logavg_bill_amt*(0.6293226751758)+
qavg_vas_games*(0.0018593203132)+itot_retention_camp*(0.0473558291524)+
hs_model_p*(2.1452644436821)+logstd_vas_gprs*(0.0618258828235)+
logage*(-1.179460632775)+ivas_ar_flag*(0.3108464597066)+
hs_manufacturer_p*(0.4642009423865)+qstd_call_frw_ratio*(-0.974201028084)+
qavg_bill_voiced*(-0.051260044981)+qavg_call_intran*(0.0785419201338)+
icontract_flag*(-0.379043852990)+iblack_list_flag*(-0.660335060878)+
snum_tel*(0.0160917223862)+logstd_od_amt*(-0.050180174463)+
lucky_no_flag*(0.3419408172738)+ (-1.691815682629);
probability=exp(logit)/(1+exp(logit));
Model Our model is applied to the validating set and produces a list of customers ranked
prediction by their probability of being “3G” users in a descending order. A classification
threshold is decided such that the customers whose probability is greater than the
threshold best fit the actual 3G users labeled in the validating set. Finally, we
apply our model as well as the classification threshold to the testing data. The
result is that 1456 customers are identified as potential 3G users.
4
5.
PROJECT REPORTS AND SPECIFICATIONS
Model insights 3G cell phone is also called “smart phone.” It is a combination of cell phone and
PC. With larger screen, higher resolution, and better multimedia function support,
it will enhance users’ experiences with games and various data services. Game
lovers and young professionals who need access to various data services from
anywhere at anytime are probably among the earliest users to adopt this product.
This insight is supported by our model.
According to our model and SAS output, several variables dominate the
prediction of “3G user.” The top three predictor variables are hs_age (handset
age in months), avg_bill_amt (average billing amount in the last months), and
avg_vas_games (average games utilization last six months). The 2G customers
with low Hs_age, high avg_bill_amt and high avg_vas_games have a high
probability to switch to 3G. An interpretation of this is that customers who
frequently change handsets, who pay large bills, and who love to play games are
most likely to adopt 3G.
Some other notable predictor variables are tot_retention_camp (total number of
received retention campaign in the last 6 months), hs_model (handset model),
hs_manufacturer (handset manufacturer), and std_vas_gprs (standard deviation
GPRS data utilization last 6 months). The probability of customers being 3G
users is positively correlated with these variables. This can be interpreted as
follows: i) If customers receive a large amount of retention campaign, they would
be inspired to switch to 3G; (ii) Customers currently using fashionable handsets
from reputable manufacturers are likely to try more fashionable 3G product; (iii)
The more dynamic the usage pattern for GPRS data utilization, the more likely
the related customers are going to switch to 3G.
The probability of customers being 3G users is negatively correlated to predictor
variables age (in years) and avg_bill_voiced (average billing amount for voice
traffic in the last 6 months). The older the customers, or the more non-data voice
services are used, the less likely the customers will choose 3G.
In summary, the predictor variables can be classified into two categories, fashion
effect and usage effect, both driving the adoption of 3G (see figure below). The
frequency of handset change and the current handset type, for example, indicate
fashionable customers who are likely to adopt 3G. The customer age, however, is
negatively correlated with the fashion effect. For the usage effect, the game,
GPRS, and data usage are positive factors, while the voice usage is a negative
one. Overall, the market campaign will influence the relationships between
fashion effect and 3G adoption, and between usage effect and 3G adoption.
5
6.
Frequency Current
of Handset Handset
Change Type
Fashion + Indicator
+
Age
- Fashion Effect
+
3G Adoption
Market
Campain
+
Usage
Effect
+
Usage Data + - - Indicator
Usage Voice
Usage
GPRS
Usage
Figure. Factors of 3G adoption.
6
Be the first to comment