Poly02
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

Poly02

on

  • 401 views

 

Statistics

Views

Total Views
401
Views on SlideShare
401
Embed Views
0

Actions

Likes
0
Downloads
1
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Microsoft Word

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Poly02 Document Transcript

  • 1. PAKDD2006 Data Mining Competition TEMASEK POLYTECHNIC TEMAEK BUSINESS SCHOOL Diploma in Business Information Technology PAKDD 2006 Data Mining Competition TAN JUN MING BEVAN 90939308 BRANDON WONG 96358643 LIM SIANG HOE 97432617 WAN PING HAO 91809559 BHARAT 91800447 Temasek Polytechnic Business School Page 1 Business Information Technology
  • 2. PAKDD2006 Data Mining Competition Part A Approached and understanding of the problem The problem scenario An Asian telco operator has just launched its third generation (3G) mobile telecommunications networks and would like to make use of a “holdout” sample taken from 24,000 of its 2G and 3G customers to identify existing customer usage as well as the demographic data to identify which customers are likely to switch to using their 3G network. Strategy and output attribute The data mining strategy that the group has decided to use to approach the problem is supervised mining, Classification and Prediction. We have chosen to use Supervised Mining compared to the other techniques as it allows us to classify the outcomes of new instances after forming concepts from instances where the outcome is known. Through classification, it allows us to classify the existing customers into “2G” and “3G” customers. From there, we can proceed to use prediction to predict the likelihood of customers switching to using their 3G network. As the classification deals with the current and Prediction deals with the future, a hybrid of both strategies is a suitable strategy to be used. Based on the problem, the group has identified “CUSTOMER_TYPE” to be the output. Temasek Polytechnic Business School Page 2 Business Information Technology
  • 3. PAKDD2006 Data Mining Competition 7-Step KDD process used to create and deploy model Below shows the 7-Step Knowledge Discovery in Database (KDD) steps we have used to built our classification model. Temasek Polytechnic Business School Page 3 Business Information Technology
  • 4. PAKDD2006 Data Mining Competition Data cleaning (Data Preprocessing and Transformation) A set of 18000 instances are compiled together and clean accordingly. Firstly, missing and noisy data is identified. It is represented in the following table. Noisy Outliers Missing DAYS_TO_CONTRAC AVG_OD_AMT AGE T_EXPIRY AVG_MINS_IB GENDER AVG_MINS_IBPK MARITAL_STATUS AVG_VAS_WAP NATIONALITY AVG_VAS_GPRS OCCUP_CD AVG_SPHERE HS_MANUFACTURER NUM_TEL AVG_BILL_VOICED STD_OD_AMT STD_PAY_AMT STD_NO_RECV STD_MINS_OP STD_CALL_FRW STD_VAS_GPSMS Based on the above attributes we have chosen to eliminate. The group has taken appropriate procedures to handle the missing data, noisy data and outliers. For missing or noisy categorical data, the group has chosen to use the most commonly recurring data to replace the missing or noisy categorical data. For missing or noisy numerical data, the group has calculated the class mean and then work out the average to replace the missing or noisy numerical data. For outliers, the group has chosen to leave the outliers alone as most of them are just outstanding figures that still deem appropriate. Temasek Polytechnic Business School Page 4 Business Information Technology
  • 5. PAKDD2006 Data Mining Competition Selection of Instances for Training and Test Set Due to the massive amount of data in the data set, the group has actually used “Stratification” to select the instances for use in Training and Test set. At the start, the customers were first sorted in ascending order by their 2G and 3G classes. There are 17100 instances under 2G and 900 instances under 3G. From there, 70% of the customers from the 2G class (11970 instances) and 3G class (630 instances) are selected to represent the training data. For the test data, the remaining 30% from both the 2G and 3G classes are selected and place below the training data. Thus, through the use of stratification, we can see how equal numbers of both the 2G and 3G classes are selected to represent the training and test data. Elimination of attributes From the 18000 instances, we have chosen to eliminate instances based on their significant to the model. This is reflected in the table in part C where it shows how the attributes are eliminated. The group eliminated instances based on their domain predictability of less than 08, numerical significant of less than 0.25 as well as base on their class predictiveness and predictability. Attributes that only contain one value of instances are also eliminated and excluded from the model. Overall, our group went through 2 stages of elimination. This is to ensure that only significant attributes are left behind and also to ensure that the accuracy of the model is better. Further stages of elimination could be done to further increase the accuracy but the group felt it was sufficient and further elimination will not increase the accuracy of the model. Mining of data set Temasek Polytechnic Business School Page 5 Business Information Technology
  • 6. PAKDD2006 Data Mining Competition For mining, we have included two additional roles in the target set that is mainly the instance type (categorical or numerical) and instance usage (U=Not Used , D = Display Only , I = Included and O = Output). This is a requirement needed by iDA. As we are performing supervised mining, we have also set the attribute “CUSTOMER_TYPE” as an output. Overall, the group has mined 3 different models. The first is a model before elimination of attributes. The second is a model with the 1st stage of elimination and the third model with the 2nd stage of elimination. Evaluation of Model The group has made cross comparison between the different models we have created to ensure that the model use for prediction is accurate for usage. For instance, the group has used the different models confusion matrices to determine how accurate the model is or even the false accepts and false rejects to determine the accuracy and efficiency of the model. Part B Temasek Polytechnic Business School Page 6 Business Information Technology
  • 7. PAKDD2006 Data Mining Competition Full technical details of algorithm(s) used We used the iData Analyzer (iDA) to carry out our data mining. It provides support for the business or technical analyst by offering a visual learning environment, an integrated tool set, and data mining process support. iDA consists of several components, mainly, preprocessor, three data mining tools, and report generator. As iDA is an Excel add-on, the user interface is in Microsoft Excel format. iDA consist of the following : Preprocessor The preprocessor scans for several types of errors, including illegal numeric values, blank lines, and missing items before the data in a file is presented to one of the iDA mining engines. It corrects several types of errors but does not attempt to fix numerical data errors. The preprocessor outputs a data file ready for data mining as well as a document informing us about the nature and location of unresolved errors. Heuristic agent The heuristic agent responds to the presentation of data files containing several thousand instances. The heuristic agent allows us to decide if we wish to extract a representation subset of the data for analysis or we desire to process the entire dataset. ESX This component is an exemplar-based data mining tool that builds a concept hierarchy to generalize data. Neural networks Temasek Polytechnic Business School Page 7 Business Information Technology
  • 8. PAKDD2006 Data Mining Competition iDA contains two neural network architectures: a back propagation neural network for supervised learning and self-organizing feature map for unsupervised clustering. RuleMaker iDA’s production rule generator provides several rule generating options. Report generator The report generator offers several sheets of summary information for each data mining session. Limitations The commercial version of iDA is bound by the size of a single MS Excel spreadsheet, which allows a maximum of 65,356 rows and 256 columns. The iDA input format uses the first three rows of a spreadsheet to house information about individual attributes. Therefore a maximum of 65,553 data instances in attribute-value format can be mined. As each MS Excel column holds a single attribute, the maximum number of attributes allowed is 256. The maximum size of an attribute name or attribute value is 250 characters. Also, RuleMaker will not generate rules for more than 20 classes. The iDA system architecture Temasek Polytechnic Business School Page 8 Business Information Technology
  • 9. PAKDD2006 Data Mining Competition The following shows how iDA carries out data mining. Interface Data PreProcessor Large Dataset Yes Heuristic Agent No Mining Technique Neural Networks Yes ESX Explanation Generate Rules Yes RuleMaker Rules No No Report Generator Excel Sheets ESX Temasek Polytechnic Business School Page 9 Business Information Technology
  • 10. PAKDD2006 Data Mining Competition This can help create target data, find irregularities in data, perform data mining, and offer insight about the practical value of discovered knowledge. The following is a list of features with the ESX learner model: 1. It supports both supervised and unsupervised clustering. 2. It does not make statistical assumptions about the nature of data to be processed. 3. It supports an automated method for dealing with missing attribute values. 4. It can be applied in domains containing both categorical and numerical data. 5. It can point out inconsistencies and unusual values in data. 6. For supervised classification, ESX can determine those instances and attributes best able to classify new instances of unknown origin. 7. For unsupervised clustering, ESX incorporates a globally optimizing evaluation function that encourages a best instance clustering. The primary data structure used by ESX is a three-level concept hierarchy. The nodes at the instance level of the tree represent the individual instances that define the concept classes given at the concept level. The concept-level nodes store summary statistics about the attribute values found within their respective instance-level children. The root-level tree node stores summary information about all instances within the domain. Concept- and root-level summary Temasek Polytechnic Business School Page 10 Business Information Technology
  • 11. PAKDD2006 Data Mining Competition information is given to the report generator, which in turn outputs a summary report in spreadsheet format. Root-Level Root Concept-Level ... C1 C2 Cn ... ... ... Instance-level I11 I 12 I1j I21 I 22 I2k I n1 I n2 Inl Class resemblance scores are stored within the root node and each concept level node. Class resemblance scores form the basis of ESX’s evaluation function. Class resemblance scores provide a measure of overall similarity for the exemplars making up individual concept classes. As ESX is commercial software, details about the class resemblance computation are not available. However, we can state the evaluation rule ESX uses for both supervised learning and unsupervised clustering. Given: A set of existing concept-level nodes C1, C2,…..Cn, An average class resemblance score S, computed by summing the resemblance scores for each class C1, C2,…..Cn and dividing by n. A new instance I to be classified. Temasek Polytechnic Business School Page 11 Business Information Technology
  • 12. PAKDD2006 Data Mining Competition Make I an instance of the concept node that results in the largest average increase or smallest average decrease for S. When learning is unsupervised: If a better score is achieved by creating a new concept node, then create new node C n +1 with I as its only number. That is, for each existing concept-level node Ck, a new instance to be classified is temporarily placed in the concept class represented by the node. A new class resemblance score is computed for Ck, as well as a new average class resemblance score. The winning concept class is the one that creates the largest increase (or smallest decrease) in average class resemblance. When learning is unsupervised, a score for creating a new concept-level node is also computed. iDAV Format for Data Mining We place a C in a second-row column if the corresponding attribute data type is categorical (nominal) or an R in a second-row column if the entered data is real – valued (numerical). The third row informs ESX about attribute usage. If learning is supervised, exactly one attribute must contain an O. An output attribute must be categorical. Values for Attribute Usage I - The attribute is used as an input attribute. U - The attribute is not used. Temasek Polytechnic Business School Page 12 Business Information Technology
  • 13. PAKDD2006 Data Mining Competition D - The attribute is not used for classification or clustering, but attribute value summary information is displayed in all output reports. O - The attribute is used as an output attribute. For supervised learning with ESX, exactly one categorical attribute is selected as the output attribute. Output Reports: Supervised Learning The following are the different components that are generated by iDA after a supervising learning is perform. Sheet1 RES MTX: This sheet shows the confusion matrix for test set data. Sheet1 RES TST: This sheet contains individual test set instance classifications and is only seen when a test set is applied. Sheet1 RES SUM: This sheet contains summary statistics about attribute values and offers several heuristics to help us determine the quality of a data mining session. Sheet1 RES CLS: This sheet has information about the classes formed as a result of a supervised mining session. Sheet1 RUL TYP: Instances are listed by their class name. The last column of this sheet shows a typicality value for each instance. Sheet1 RES RUL: The production rules generated for each class are contained in this sheet. Temasek Polytechnic Business School Page 13 Business Information Technology
  • 14. PAKDD2006 Data Mining Competition Measurements used for analysis To interpret the results, we used class predictability and predictiveness. Class predictability is defined as follows: Given categorical attribute A with values v1, v2, v3….vn, the class C predictability score for vi tells us the percent of instances within class C showing vi as the value for A. Class predictability is a within-class measure. The sum of the predictability scores for attribute v1, v2, v3….vn within class C is always equal to 1. An attribute-value predictiveness score is defined as the probability an instance resides in a specified class given the instance has the value for the chosen attribute. Class predictiveness is defined as follows: Given class C and attribute A with values v1, v2, v3….vn, an attribute-value predictiveness score for v1 is defined as the probability an instance resides in C given the instance has value v1 for A. Predictiveness scores are between-class measures. The sum of the predictiveness scores for categorical attribute A with value v1 is always equal to 1. Here are four useful observations relating to class predictability and predictiveness scores. For a given concept class C: If an attribute value has a predictability and predictiveness score of 1.0, the attribute value is said to be necessary and sufficient for membership in C. That is, all instances within class C have the specified value for the attribute and all instances with this value for the attribute reside in class C. If an attribute value has a predictiveness score of 1.0 and predictability score less than 1.0, we conclude that all instances with the value for the attribute reside in class C. However, there are some instances in C that do not have the value for Temasek Polytechnic Business School Page 14 Business Information Technology
  • 15. PAKDD2006 Data Mining Competition the attribute in question. We call the attribute value sufficient but not necessary for class membership. If an attribute value has a predictability score of 1.0 and a predictiveness score less than 1.0, we conclude that all instances in class C have the same value for the chosen attribute. However, some instances outside of class C also have the same value for the given attribute. The attribute is said to be necessary but not sufficient for class membership. In general, any categorical attribute with at least one highly predictive value should be designated as an input attribute. Also, a categorical attribute with little predictive value ought to be flagged as unused or display-only. Source: Data Mining, A Tutorial-Based Primer (Richard J.Roiger and Michael W.Geatz) Temasek Polytechnic Business School Page 15 Business Information Technology
  • 16. PAKDD2006 Data Mining Competition PART C Details of the classification model that was produced a) Model 1 ( First Elimination of Attributes) Elimination of Attributes (Round 1) The table below highlights the attributes to eliminate and the reasons for doing so. Temasek Polytechnic Business School Page 16 Business Information Technology
  • 17. Attributes to Eliminate Domain Statistics for numerical PAKDD2006 Data Mining categorical (Domain Statistics for Competition Attributes attributes) NATIONALITY DAYS_TO_CONTRACT_EXPIRY COBRAND_CARD_FLAG NUM_TEL HIGHEND_PROGRAM_FLAG NUM_ACT_TEL CUSTOMER_CLASS NUM_SUSP_TEL SUBPLAN_CHANGE_FLAG NUM_DELINQ_TEL CONTRACT_FLAG PAY_METD_CHG LUCKY_NO_FLAG LST_RETENTION_CAMP BLACK_LIST_FLAG BLACK_LIST_CNT ID_CHANGE_FLAG TELE_CHANGE_FLAG REVPAY_PREV_CD TOT_LAST_DELINQ_DAYS AVG_REVPAY_AMT TOT_PAST_DELINQ TOP3_INT_CD TOT_PAST_OVERDUE VAS_CND_FLAG TOT_PAST_TOS VAS_CNND_FLAG TOT_TOS_DAYS VAS_DRIVE_FLAG TOT_PAST_REVPAY VAS_FF_FLAG TOT_DEBIT_SHARE VAS_IB_FLAG AVG_PAST_OD_VALUE VAS_NR_FLAG AVG_DELINQ_DAYS VAS_VMN_FLAG OD_REL_SIZE VAS_VMP_FLAG OD_FREQ VAS_GPRS_FLAG REVPAY_FREQ VAS_IEM_FLAG AVG_OD_AMT STD_CALL_1900 AVG_PAY_AMT STD_VAS_QG AVG_MINS_IBOP STD_VAS_QI AVG_MINS_INT STD_VAS_QP AVG_MINS_INTT1 STD_VAS_QTXT AVG_MINS_FRW STD_VAS_QTUNE AVG_MINS_1900 STD_VAS_WAP AVG_CALL_1900 STD_VAS_GPRS AVG_VAS_QG STD_VAS_CWAP AVG_VAS_QI STD_VAS_ESMS AVG_VAS_QP STD_VAS_GPSMS AVG_VAS_QTXT Temasek Polytechnic Business School Page 17 Business Information Technology AVG_VAS_QTUNE AVG_VAS_WAP AVG_VAS_XP
  • 18. PAKDD2006 Data Mining Competition Attributes with only one value are eliminated as they have no significance in building the model. They are as follows: STD_VAS_CG VAS_SN_FLAG STD_VAS_IFSMS VAS_CSMS_FLAG STD_VAS_#123# DELINQ_FREQ STD_VAS_IEM AVG_VAS_IDU STD_VAS_ISMS AVG_VAS_WLINK STD_VAS_IEM STD_VAS_IDU AVG_VAS_MILL STD_VAS_WLINK AVG_VAS_IFSMS STD_VAS_MILL AVG_VAS_#123# HS_CHANGE AVG_VAS_CG TOT_PAST_DEMAND AVG_VAS_IEM AVG_VAS_ISMS AVG_VAS_SS Temasek Polytechnic Business School Page 18 Business Information Technology
  • 19. PAKDD2006 Data Mining Competition Important Attribute to be Kept The following attributes are kept as they are found to be important. Most commonly occurring categorical attributes values SUBPLAN HS_MODEL VAS_AR_FLAG STD_PK_MINS_RATIO STD_OP_MINS_RATIO STD_LOCAL_DURATION STD_EXTRAN_RATIO STD_SPHERE STD_M2M_MINS_RATIO These attributes are highly predictive as they have different values for different classes Findings from Round 1 of Elimination Temasek Polytechnic Business School Page 19 Business Information Technology
  • 20. PAKDD2006 Data Mining Competition CLASS RESEMBLANCE STATISTICS Class 2G Class 3G Domain Res. Score: 0.048 0.611 0.23 No. of Inst. 10500 2100 12600 Class Significance: (0.79) 1.68 The class resemblance statistics show us that around 10500 instances were classified into class 2G and 2100 instances were classified into class 3G. The class Resemblance statistics of 0.048 for class 2G and 0.611 for class 3G are less than and more than the domain statistics of 0.23 respectively. This tells us that the model is not meaningful nevertheless it can still use for the supervised mining. Test result from the confusion matrix for the 5400 instances show us the following: Confusion Matrix Computed Class 2G 3G 2G 4013 487 3G 601 299 Percent Correct: 79.0% Error: Upper Bound 22.1% Error: Lower Bound 19.9% The confusion Matrix shows us that the 4312 instances were correctly classified with 4013 as 2G users and 299 as 3G users, 79% classification accuracy. The Temasek Polytechnic Business School Page 20 Business Information Technology
  • 21. PAKDD2006 Data Mining Competition 95% confidence interval is meaningful as the number of test set instances is large. (>100) b) Model 2 (Second Elimination of Attributes) Elimination of Attributes (Round 2) The following table highlights attributes to be eliminated and the reasons for elimination: Temasek Polytechnic Business School Page 21 Business Information Technology
  • 22. Attributes to Eliminate (Domain Domain Statistics for numerical PAKDD2006 Data categorical Statistics for Mining Competition Attributes attributes) DAYS_TO_CONTRACT_EXPIRY PAY_METD_PREV NUM_TEL TOP2_INT_CD NUM_ACT_TEL STD_VAS_SR NUM_SUSP_TEL STD_VAS_YMSMS NUM_DELINQ_TEL PAY_METD_CHG LST_RETENTION_CAMP BLACK_LIST_CNT These attributes have domain TELE_CHANGE_FLAG predictability of more than 0.8. TOT_LAST_DELINQ_DAYS Domain predictability near 100% TOT_LAST_DELINQ_DIST for a categorical attributes TOT_DELINQ_DAYS indicates that the attribute is not OD_REL_SIZE useful for data mining as a OD_FREQ majority of values for the REVPAY_FREQ attributes are identical. AVG_OD_AMT Therefore, they are eliminated. AVG_PAY_AMT AVG_MINS_IBOP AVG_MINS_INT AVG_MINS_INTT1 AVG_MINS_FRW AVG_MINS_1900 AVG_CALL_1900 AVG_VAS_QG AVG_VAS_QI AVG_VAS_QP AVG_VAS_QTXT AVG_VAS_QTUNE AVG_VAS_WAP AVG_VAS_XP AVG_VAS_GPRS AVG_VAS_ESMS AVG_VAS_GBSMS AVG_VAS_CWAP Temasek Polytechnic Business School Page 22 Business Information Technology AVG_VAS_GPSMS AVG_VAS_YMSMS AVG_PK_MINS_RATIO
  • 23. PAKDD2006 Data Mining Competition Important Attributes to be Kept (Round 2) Most commonly occurring categorical attributes values STD_T1_MINS_CON These attributes are highly predictive as they have different values for different classes Findings from Round 2 of Elimination CLASS RESEMBLANCE STATISTICS Class 2G Class 3G Domain Res. Score: 0 0.594 0.15 No. of Inst. 10500 2100 12600 Class Significance: (1.00) 2.91 The class resemblance statistics show us that around 10500 instances were classified into class 2G and 2100 instances were classified into class 3G. The class Resemblance statistics of 0 for class 2G and 0.594 for class 3G are less than and more than the domain statistics of 0.23 respectively. This tells us that the model is not meaningful. As compared to the first model, Class 2G Res.Score has drop to 0 and Class 3G has drop to 0.594. Test result from the confusion matrix for the 5400 instances show us the following: Confusion Matrix Temasek Polytechnic Business School Page 23 Business Information Technology
  • 24. PAKDD2006 Data Mining Competition Computed Class 2G 3G 2G 3991 509 3G 587 313 Percent Correct: 79.0% Error: Upper Bound 22.1% Error: Lower Bound 19.9% The confusion Matrix shows us that the 4304 instances were correctly classified with 3991 as 2G users and 313 as 3G users, 79% classification accuracy. The 95% confidence interval is meaningful as the number of test set instances is large. (>100). As compared with model 1, instances correctly classified have dropped. However, the classification accuracy remains the same. Question D Discussion on what insights can be gained from their model in terms of identifying current 2G customers with the potential to switch to 3G (e.g. using false positives?) From final model after 2 rounds of iterations, the mining tool discovered several attributes’ values that are typical for the two classes, 2G and 3G. Attributes with different values that differ significantly for the two classes are as follows, 2G 3G AGE 33 - 35 28 - 33 MARITAL_STATUS M S AVG_MINS_OB 86 144 AVG_CALL_OB 57 88 AVG_VAS_SMS 7 103 Temasek Polytechnic Business School Page 24 Business Information Technology
  • 25. PAKDD2006 Data Mining Competition AVG_VAS_GAMES 0 11790 STD_BILL_AMT 31 48 Based on the above table, it is deduced that 3G customers are younger compared to 2G customers and 3G appeals more to the market segment belongng to the young and single customers. Attributes AVG_MINS_OB and AVG_CALL_OB, refer to the average number of minutes for outbound calls and average number of calls outbound respectively. The knowledge discovered indicates that 3G customers are heavy mobile users in term of talk time and number of calls made. In addition to high talk time, they also have the highest utilization SMS rate, as seen from AVG_VAS_SMS. 3G, being a multimedia medium, has a greater appeal to customers who often play games on their mobile. The attribute, AVG_VAS_GAMES, average games utilization shows that 3G customers have a higher value for AVG_VAS_GAMES compared to their 2G counterpart. Based on the knowledge discovered for 2G and 3G customers, it can be concluded that 3G customers belong to the young and single market segment and are heavy users of mobile services, in term of talk time, SMS and value- added services, such as mobile games. In contrast, 2G customers belong to the mature segment of the market and do not heavily utilize mobile services compared to 3G customers. Being heavy users of mobile services, 3G customers have higher bill amount compared to 2G customers. On the average, their bill is 54% higher compared to 2G customers. Temasek Polytechnic Business School Page 25 Business Information Technology
  • 26. PAKDD2006 Data Mining Competition Confusion Matrix Computed Class 2G 3G 2G 3991 509 3G 587 313 Based on the confusion matrix from the data mining, there are more 2G customers classified as 3G customers, indicating higher false positives. This has an impact on the marketing effort required to convert 2G customers to using 3G services. Given the cost of false inclusion, the telecom should adopt a model with more false accepts if the cost of false inclusion is lower than the loss in false omission. A model that gives lower false rejects should be used if the loss of false omission is higher. This is based on the fact that a model with more false accepts than false rejects is good for businesses where the cost of false inclusion (i.e. wrongly including 2G customers into the selected group for promoting 3G service) is lower than the loss in profit gained from false omission (i.e. wrongly excluding customers who will upgrade to 3G resulting in a loss in revenue). On the other hand, for businesses where the cost of false inclusion is higher than the loss from false omission should minimise number of false accepts. Hence, such businesses should select models with more false rejects than false accepts. Temasek Polytechnic Business School Page 26 Business Information Technology