A Report to PAKDD 2006 Data Mining Competition




                    Nan Hu

                  Shawn Jin

              ...
DATE:       Feb 28th, 2006

   FROM:       Team Jin/Hu

   TO:         PAKDD Committee

   RE:         REPORT FOR PAKDD MO...
PROJECT REPORTS AND SPECIFICATIONS
METHODOL    Logistic regression technique is applied
OGY          The idea is to build ...
PROJECT REPORTS AND SPECIFICATIONS
                   d) Model selection
METHODOL
                   Various log-linear mo...
PROJECT REPORTS AND SPECIFICATIONS
Model insights   3G cell phone is also called “smart phone.” It is a combination of cel...
Frequency             Current
                         of Handset            Handset
                           Change    ...
Upcoming SlideShare
Loading in...5
×

Open30.doc

128

Published on

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
128
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
1
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Open30.doc

  1. 1. A Report to PAKDD 2006 Data Mining Competition Nan Hu Shawn Jin Yingjiu Li Mar 01st, 2006 1
  2. 2. DATE: Feb 28th, 2006 FROM: Team Jin/Hu TO: PAKDD Committee RE: REPORT FOR PAKDD MODELING PROJECT PROJECT REPORTS AND SPECIFICATIONS PAKDD 2006 Data Mining Competition PROJECT: The project objective is to accurately predict which customers are likely to switch OBJECTIVE: to using 3G network for an Asian telco operator which has successfully launched a third generation (3G) mobile telecommunications network. The data mining task is a classification problem for which the objective is to accurately predict as many current 3G customers as possible (i.e. true positives) from the “holdout” sample provided. BACKGROU An original sample dataset of 20,000 2G network customers and 4,000 3G ND: network customers has been provided with more than 200 data fields. The target categorical variable is “Customer_Type” (2G/3G). A 3G customer is defined as a customer who has a 3G Subscriber Identity Module (SIM) card and is currently using a 3G network compatible mobile phone. Three-quarters of the dataset (15K 2G, 3K 3G) will have the target field available and is meant to be used for training/testing (i.e., training data). The remaining portion (5K 2G, 1K 3G) will be made available with the target field missing and is meant to be used for prediction (i.e., testing data). MEASURES True positives from the “holdout” sample. of PERFORMAN CE 2
  3. 3. PROJECT REPORTS AND SPECIFICATIONS METHODOL Logistic regression technique is applied OGY The idea is to build a predictive model to predict the probability of a customer likely to switch to using 3G network. For any given customer, a probability of “3G user” is calculated. The whole training data is rank-ordered by this probability score. A classification threshold is decided such that the “3G” customers predicted by our model best fit the actual “3G users” labeled in the training data. Our predictive result is obtained by applying the predictive model as well as the classification threshold to the testing data. The detailed steps are as follows for this exercise a) Split the training data into training set and validating set Random split the training data, 70% into training set and 30% into validating set. After the split, we have the following data structure. Training set: 3G number percent no 10500 83.33 yes 2100 16.67 Validating set: 3G number percent no 4500 83.33 yes 900 16.67 b) Variable reduction The following methods are used in variable reduction 1. Principle component analysis 2. Cluster analysis 3. Regression loosely fit 4. Logistic regression loosely fit 5. Information value Base on the outcome of these variable reduction methods, 173 candidate variables are chosen for further analyses c) Uni-variate and bi-variate analyses Freqency analysis is performed on every candidate variable to see its distribution. Then, Bi-variate analysis is performed to examine the relationship between the candidate variables and the decision variable (i.e., customer type). Finally, a set of predictor variables are chosen with appropriate transformations. In particular, the following steps are gone through for each candidate variable 1. Imputation of this variable is decided based on the relationship of this candidate variable and the decision variable. 2. Floor/capping of this variable is decided based on the imputation analysis 3. It is decided whether this candidate variable is a predictor variable. 4.Appropriate transformation is decided based on the relationship of 3
  4. 4. PROJECT REPORTS AND SPECIFICATIONS d) Model selection METHODOL Various log-linear models are built based on the selected predictor OGY variables and their transformations. The model selection is based on (continue) the model performance on the validating set. The selection criteria include KS, C-value, Gini-coefficients, Information value, VIF, and Gains table. Also, lack-of-fit statistic and box-tidwel are monitored in the model selection process. e) Model addition Finally, we calculate the marginal information on all of the variables except the predicator variables in the training data If some of variables add significant additional predictive power, we fit them into our model. As a result, seventeen predictor variables are selected in our model. Model SAS Obs Variable Estimate StdErr Prob Standardized WaldChiSq ChiSq margin_ Est weight infovalue 1 loghs_age -1.085755335692 0.0332 1072.3548 <.0001 -0.4825 0.208 0.076325 output 2 3 logavg_bill_amt qavg_vas_games 0.629322675176 0.0652 0.001859320313 0.000122 93.1787 <.0001 233.2710 <.0001 0.2573 0.2348 0.111 0.066718 0.101 0.064460 4 hs_model_p 2.145264443682 0.3095 48.0372 <.0001 0.1620 0.070 0.000000 5 logage -1.179460632775 0.1597 54.5522 <.0001 -0.1362 0.059 0.009264 6 hs_manufacturer_p 0.464200942387 0.0665 48.7006 <.0001 0.1232 0.053 0.000000 7 qavg_call_intran 0.078541920134 0.0143 30.1814 <.0001 0.1133 0.049 0.042702 8 itot_retention_camp 0.047355829152 0.00637 55.3522 <.0001 0.1126 0.049 0.007942 9 logstd_vas_gprs 0.061825882824 0.00761 66.0802 <.0001 0.1094 0.047 0.001511 10 qavg_bill_voiced -0.051260044982 0.00928 30.4867 <.0001 -0.1047 0.045 0.009360 11 iblack_list_flag -0.660335060879 0.1576 17.5665 <.0001 -0.0926 0.040 0.000000 12 icontract_flag -0.379043852991 0.0771 24.1606 <.0001 -0.0881 0.038 0.000000 13 ivas_ar_flag 0.310846459707 0.0716 18.8644 <.0001 0.0813 0.035 0.000000 14 logstd_od_amt -0.050180174464 0.0145 11.9399 0.0005 -0.0610 0.026 0.010664 15 snum_tel 0.016091722386 0.00443 13.1805 0.0003 0.0595 0.026 0.003099 16 qstd_call_frw_ratio -0.974201028084 0.3770 6.6792 0.0098 -0.0554 0.024 0.013133 17 lucky_no_flag 0.341940817274 0.1100 9.6552 0.0019 0.0436 0.019 0.000000 18 intercept -1.691815682630 0.7223 5.4863 0.0192 Resulted model The prediction model is as follows logit=loghs_age*(-1.085755335692)+logavg_bill_amt*(0.6293226751758)+ qavg_vas_games*(0.0018593203132)+itot_retention_camp*(0.0473558291524)+ hs_model_p*(2.1452644436821)+logstd_vas_gprs*(0.0618258828235)+ logage*(-1.179460632775)+ivas_ar_flag*(0.3108464597066)+ hs_manufacturer_p*(0.4642009423865)+qstd_call_frw_ratio*(-0.974201028084)+ qavg_bill_voiced*(-0.051260044981)+qavg_call_intran*(0.0785419201338)+ icontract_flag*(-0.379043852990)+iblack_list_flag*(-0.660335060878)+ snum_tel*(0.0160917223862)+logstd_od_amt*(-0.050180174463)+ lucky_no_flag*(0.3419408172738)+ (-1.691815682629); probability=exp(logit)/(1+exp(logit)); Model Our model is applied to the validating set and produces a list of customers ranked prediction by their probability of being “3G” users in a descending order. A classification threshold is decided such that the customers whose probability is greater than the threshold best fit the actual 3G users labeled in the validating set. Finally, we apply our model as well as the classification threshold to the testing data. The result is that 1456 customers are identified as potential 3G users. 4
  5. 5. PROJECT REPORTS AND SPECIFICATIONS Model insights 3G cell phone is also called “smart phone.” It is a combination of cell phone and PC. With larger screen, higher resolution, and better multimedia function support, it will enhance users’ experiences with games and various data services. Game lovers and young professionals who need access to various data services from anywhere at anytime are probably among the earliest users to adopt this product. This insight is supported by our model. According to our model and SAS output, several variables dominate the prediction of “3G user.” The top three predictor variables are hs_age (handset age in months), avg_bill_amt (average billing amount in the last months), and avg_vas_games (average games utilization last six months). The 2G customers with low Hs_age, high avg_bill_amt and high avg_vas_games have a high probability to switch to 3G. An interpretation of this is that customers who frequently change handsets, who pay large bills, and who love to play games are most likely to adopt 3G. Some other notable predictor variables are tot_retention_camp (total number of received retention campaign in the last 6 months), hs_model (handset model), hs_manufacturer (handset manufacturer), and std_vas_gprs (standard deviation GPRS data utilization last 6 months). The probability of customers being 3G users is positively correlated with these variables. This can be interpreted as follows: i) If customers receive a large amount of retention campaign, they would be inspired to switch to 3G; (ii) Customers currently using fashionable handsets from reputable manufacturers are likely to try more fashionable 3G product; (iii) The more dynamic the usage pattern for GPRS data utilization, the more likely the related customers are going to switch to 3G. The probability of customers being 3G users is negatively correlated to predictor variables age (in years) and avg_bill_voiced (average billing amount for voice traffic in the last 6 months). The older the customers, or the more non-data voice services are used, the less likely the customers will choose 3G. In summary, the predictor variables can be classified into two categories, fashion effect and usage effect, both driving the adoption of 3G (see figure below). The frequency of handset change and the current handset type, for example, indicate fashionable customers who are likely to adopt 3G. The customer age, however, is negatively correlated with the fashion effect. For the usage effect, the game, GPRS, and data usage are positive factors, while the voice usage is a negative one. Overall, the market campaign will influence the relationships between fashion effect and 3G adoption, and between usage effect and 3G adoption. 5
  6. 6. Frequency Current of Handset Handset Change Type Fashion + Indicator + Age - Fashion Effect + 3G Adoption Market Campain + Usage Effect + Usage Data + - - Indicator Usage Voice Usage GPRS Usage Figure. Factors of 3G adoption. 6

×