SlideShare a Scribd company logo
1 of 19
Download to read offline
Group Assignment
Data Mining MKTG 5963
Abhinav Garg (11761380)
Tanu Srivastav (11772446)
Tejbeer Chhabra (11756746)
Maunik Desai (11758140)
Maanasa Nagaraja (11678486)
Table of Contents
Executive Summary............................................................................................................................1
Data Audit .........................................................................................................................................1
Modeling...........................................................................................................................................2
Model Comparison.............................................................................................................................4
Scoring ..............................................................................................................................................4
Segmentation ....................................................................................................................................4
Conclusion.........................................................................................................................................5
Appendix A : Data Exploration............................................................................................................ i
Appendix B: Clustering......................................................................................................................iii
Appendix C: Data Modeling...............................................................................................................vi
Appendix D: MODEL COMPARISON ...................................................................................................xi
Appendix E: Scored Data ..................................................................................................................xii
Contents for Table
Table 1 Variable Worth in Clusters ...............................................................................................................2
Table 2 Sensitivity and Specificity for Forward Regression Model...............................................................3
Table 3 Sensitivity and Specificity for Stepwise Regression .........................................................................3
Table 4 Sensitivity and Specificity for Neural Network.................................................................................3
Table 5 Model Comparisons .........................................................................................................................4
Table 6 Scored Data Summary for Target Variable.......................................................................................4
MKTG 5963 Data Mining Group Assignment
1
Executive Summary
Diversity and SAT score plays an important role in creating a better learning environment and good college
experience for the students. Diversity enriches the educational experience and promotes personal growth. SAT
score is a useful predictor of college academic performance.
In our analysis, we aim to identify prospective students who would most likely enroll as new freshmen in Fall 2005.
Also we would focus on marketing strategy for administration to increase diversity and SAT score.
Data Audit
Before performing data modeling it is critical to perform data exploration to find interesting insights about the
data.
1. DMDB Node
The DMDB tool gave us quick insights to understand our data better in the form of the summary statistics for
numerical variables, the number of categories for class variables, and the extent of the missing values in the data.
From the results, it is apparent that the categorical variables are not missing and interval variables have missing
values and distance, hscrat, init_span, init1rat, init2rat exhibit non-normal behavior, which can further introduce
biases.
2. Data Reduction
Variables ACADEMIC_INTEREST_1 and ACADEMIC_INTEREST_2 have their counterpart in INT1RAT and INT2RAT
respectively. Similarly IRSCHOOL was converted into HSCRAT. TELECQ had more than 50% missing values.
TOTAL_CONTACTS is nothing but summation of various other form of contact counts. CONTACT_CODE1 has
hundreds of levels and specifically such code doesn’t provide much information. For these reasons,
ACADEMIC_INTEREST_1, ACADEMIC_INTEREST_2, IRSCHOOL, TELECQ, TOTAL_CONTACTS, and CONTACT_CODE1
were all removed from dataset.
3. Missing Value Imputation
Since our interval variables had lots of missing values we used the PROC MI procedure to impute the
missing values rather than traditional imputation methods, which creates unknown biases in data. The
PROC MI procedure allows both finding the patterns of missing data and imputation. It simulates and
generates multiple complete dataset from the original data with missing values by repeatedly replacing
missing entries with imputed ones.
4. Data Filter Node
Extreme values are problematic as they may have undue influence on the model. We handled extreme
values by excluding observations including outliers or other extreme values that we don’t want to
include in our model. This also further improves the skewness and brings the variables closer to normal
distribution. The filtering methods for the interval and class variables used are Standard Deviations from
the mean and Rare Values (Percentage) respectively.
5. Data Partitioning
Before building our models, we split the data into training (70%) and validation (30%). We choose 70–30 spit
because this is the sweet spot hit for honest assessment. Also 70–30 split provided similar proportion of our target
as in the original dataset. A summary of the split has been provided in the appendix.
MKTG 5963 Data Mining Group Assignment
2
6. Data Transformation
Data Transformation corrects for skewed distribution of the numerical input variables and large number of classes
in the categorical variables. From the skewness and kurtosis values obtained after filtering, the independent
variables exhibit approximately normal distribution. We performed data transformation using “Maximum Normal”
for the independent variables, which is one of the best power transformations techniques that belongs to Box_Cox
transformation, to analyze its effectiveness in reducing skewness and kurtosis.
Although the skewness values have dropped, the decrease is not significant enough to use this methodology. For
instance, HSCRAT shows skweness of 2.64 and after log transformation, as suggested by max normal, the skweness
is 1.9. Moreover, the transformations bring in their own challenges. Transformed variables come with a cost, since
they are complicated to interpret (log, square root) especially in a business scenario. Therefore, we chose not to
perform any transformation.
Modeling
We have used Decision Tree, Forward and Stepwise Regression, Neural Network, and Auto Neural data
modeling techniques.
1. Decision Trees
Decision tree methodology is a commonly used data mining method for establishing classification systems based
on multiple covariates or for developing prediction algorithms for a target variable. A split search algorithm
facilitates input selection. Model complexity is addressed by pruning. The setting used for decision tree node are
Maximum Branch = 2, Maximum Depth = 6, Minimum Leaf Size = 5, and we use the assessment method and
misclassification as assessment measure.
Below is the variable importance report –
Variable Name Importance
SELF_INIT_CNTCTS 1.0000
HSCRAT 0.3798
STUEMAIL 0.2767
INIT_SPAN 0.1404
MAILQ 0.0816
INEREST 0.0698
INT1RAT 0.0638
Table 1 Variable Worth in Clusters
Model Assessment: Validation Misclassification Rate = 0.0512 and Training Misclassification Rate = 0.058.
2. Regression
Since our dependent variables ENROLL is a binary categorical variable, the type of regression chosen is Logistic.
2.1 Forward regression
Forward Regression creates a sequence of models of increasing complexity. At each step, each variable that is not
already added is tested for inclusion in the model. The most significant of these variables is added to the model, as
long as their p-values are below the SLENTRY = 0.05. The variables selected are SELF_INIT_CNTCTS, STUEMAIL,
HSCRAT, INIT_SPAN, DISTANCE, SATSCORE, MAILQ, and INT2RAT. We are also studying interaction effect through
our regression model. Following are the significant interactions are:
 SELF_INIT_CNTCTS was found to be the
most important variable in determining
enrollment decision of prospective
student.
 The Minimum Misclassification Rate was
found at Number of Leaves = 12.
MKTG 5963 Data Mining Group Assignment
3
 CAMPUS_VISIT * PREMIERE
 REFERAL_CNTCTS * INTEREST
 INTEREST * PREMIERE
 INSTATE*MAILQ
 TRAVEL_INIT_CNTCTS*INTEREST
 TERRITORY*STUEMAIL
Model Assessment: Validation Misclassification Rate = 0.0786 and Training Misclassification = 0.073.
Sensitivity Specificity
91% 93%
Table 2 Sensitivity and Specificity for Forward Regression Model
2.2 Stepwise Regression
The stepwise regression combines elements from both the forward and backward selection procedures. The final
variable selected is SELF_INIT_CNTCTS.
Logit (Enroll = 1) = -2.956 + 1.030(SELF_INIT_CNTCTS)
With every unit increase in SELF_INIT_CNTCTS the odds of success of enrollment increases by 2.802 holding
everything else constant.
Model Assessment: Validation misclassification rate = 0.106 and Training misclassification rate = 0.113.
Sensitivity Specificity
89.7% 88.9%
Table 3 Sensitivity and Specificity for Stepwise Regression
Conclusion: Out of two regression models, based on validation misclassification rate, Forward regression gives
better results.
3. Neural Network
In Neural network, the prediction formula is similar to a regression, but with flexible addition. Neural Network
does not easily address input selection. For this reason, we have used the selected variable from Forward
regression as input variables to Neural Network model.
3.1 Neural Network
The convergence criterion was not met by the default setting of neural network with 3 hidden units and then we
reduced the Number of Hidden units to 2. With 2 Hidden Units, convergence criterion was satisfied. This gave the
validation misclassification rate as 0.071 and for training misclassification rate is 0.072.
Sensitivity Specificity
94.77% 91.15%
Table 4 Sensitivity and Specificity for Neural Network
3.2 Auto Neural
The Auto Neural tool offers an automatic way to explore alternative network architectures and neural networks
with increasing hidden unit counts. The block of output below summarizes the training process. Fit Statistics from
the iteration with the smallest validation misclassification are shown for each step. Refer Appendix C for Results.
MKTG 5963 Data Mining Group Assignment
4
Model Comparison
The Model Comparison Node is a great tool, which helps us to evaluate the best model in terms of various fit
statistics.
It is apparent from the summary of the fit statistics
that Decision Tree ranks the best model with the
least Training and Validation misclassification rate.
The validation misclassification rate is 5.12%. We
would further use this model to SCORE the
SCORE_DATA set.
Scoring
Since our data is balanced, prediction estimates reflect target population in the training sample and not the
population. Therefore score ranking plots are inaccurate and misleading. To fix this we have adjusted for separate
sampling by adjusting prior probability as 3.1% for primary outcome.
Table 6 Scored Data Summary for Target Variable
Segmentation
The administration is interested in increasing enrollment, diversity and SAT score. The best way to do this is to
perform clustering, which divides the data set into mutually exclusive groups with varied diversity. Also by doing
this we can identify the group of prospective students with high SAT score and thus administration can focus on
marketing strategy for target group of students.
The node chose 3 clusters, and the relative size of each cluster is shown in the below pie chart.
Figure 1 Segmentation Pie Chart
Model Comparison Validation Misclassification Rate
Decision Tree 0.0512
Auto Neural 0.0663
Neural Network 0.0714
Forward Regression 0.0786
Stepwise Regression 0.1067
Table 5 Model Comparisons
MKTG 5963 Data Mining Group Assignment
5
DISTANCE was found to be most important factor in differentiating three segments.
Cluster 1: SOLICITED_CNTS and INSTATE are the most important predictor variable. Avg. SAT SCORE = 1143.3.
Males to Females proportion are 59% - 41%.
Cluster 2: DISTANCE and INSTATE are the most important predictor variable. Avg. SAT SCORE = 1094.90. Males to
Females proportion are 60% - 40%.
Cluster 3: INSTATE and INT2RAT are the most important predictor variable. Avg. SAT SCORE = 1127.00. Males to
Females proportion are 66% - 34%.
Conclusion
To increase the SAT score for prospective enrollement, administration should target on cluster 1 students as they
have highest average score. Also this group shows diversity in terms of Ethinicity as well as sex. Even with Ethiniciy
C overwhelming data, we find that cluster 1 has people from almost all diversity in some proportion. If we look at
the diversity of gender in cluster 1, we find that it has approximately equal proportion of males (59%) and
females(41%).
Figure 2 Modeling Diagram
Based on best fitted model i.e. Decision Tree, we found that the most important independent variables
are SELF_INT_CNTCTS and HSCRAT. This suggests that administration should focus on addressing
students, who themselves initiated contact and who belongs to the high school which has highest last 5
years enrollment. If administration could come up with special welcome kit or some sort of welcome
offer for these students, they could turn up more enrollments.
MKTG 5963 Data Mining Group Assignment
i
Appendix A : Data Exploration
Variable
Standard
Deviation Skewness Kurtosis
Missing
Values
Enroll 0.500485 1.10E-16 -2.0007756 0
TOTAL_CONTACTS 3.480081 1.0517156 0.8543522 0
SELF_INT_CNTCTS 3.0988946 0.8850357 0.305132 0
TRAVEL_INT_CNTCTS 0.6702278 1.6645745 3.673759 0
SOLICITED_CNTCTS 0.7613853 1.8541231 8.1394438 0
REFERRAL_CNTCTS 0.288625 5.9594486 50.9889296 0
CAMPUS_VISIT 0.3774713 2.3172482 4.4588036 0
SATSCORE 151.4914425 -0.2413784 0.462179 1887
SEX 0.4860401 -0.4837916 -1.7666479 127
MAILQ 1.6001673 -0.8884705 -0.9904749 0
TELECQ 0.8074666 0.9573248 0.7499086 3055
PREMIERE 0.4094565 1.4024778 -0.033069 0
INTEREST 0.4118758 2.3363909 5.0932373 0
STUEMAIL 0.4379744 -1.1022225 -0.7854101 0
INIT_SPAN 9.1778057 -2.4740469 84.078123 0
INT1RAT 0.0358866 9.3802148 207.82226 0
INT2RAT 0.039164 7.2199216 111.921867 0
HSCRAT 0.1457441 4.4813438 23.4212246 0
AVG_INCOME 23083.61 0.9640883 0.9048331 763
DISTANCE 370.781848 2.580719 10.6365752 671
Table 7 Data Exploration for Given Data set
Table 8 Skewness and Kurtosis Results after Filtering
MKTG 5963 Data Mining Group Assignment
ii
Table 9 Filtering Results for Class Variables
Table 10 Filtering Results for Interval Variables
Figure 3 Excluded Observations after Filtering
Figure 4 Partition Summaries
MKTG 5963 Data Mining Group Assignment
iii
Appendix B: Clustering
Table 11 Ethnicity distributions for each Segment
Table 12 Segmentation Report
1. Segment 1:
Table 13 Variable Worth for Segment 1
MKTG 5963 Data Mining Group Assignment
iv
Table 14 Worth Plot for Segment 1
2. Segment 2:
Table 15 Variable worth for Segment 2
Figure 5 Worth Plot for Segment 2
MKTG 5963 Data Mining Group Assignment
v
3. Segment 3:
Table 16 Variable Worth for Segment 3
Figure 6 Worth Plot for Segment 3
Table 17 Overall variable importances in Clustering
MKTG 5963 Data Mining Group Assignment
vi
Appendix C: Data Modeling
1. Decision Tree
Figure 7 Subtree Assessment Plot
Table 18 Fit Statistics for Decision Tree
MKTG 5963 Data Mining Group Assignment
vii
Figure 8 English Rules for Decision Tree
Figure 9 Decision Tree
MKTG 5963 Data Mining Group Assignment
viii
2. Forward Regression
Figure 19 Fit Statistics for Forward Regression
Figure 10 Mode Iteration plot for Forward Regression
MKTG 5963 Data Mining Group Assignment
ix
3. Stepwise Regression
Table 20 Fit Statistics Report for Stepwise Regression
Figure 11 Model Iteration Plot for Stepwise Regression
MKTG 5963 Data Mining Group Assignment
x
Figure 12 Summary of Stepwise Selection
4. Neural Network:
Table 21 Fit Statistics Table for Neural Network
Figure 13 Optimization Summaries for Neural Network
MKTG 5963 Data Mining Group Assignment
xi
5. AUTO NEURAL
Table 22 Fit Statistics for Auto Neural Network
Appendix D: MODEL COMPARISON
Table 23 Model Comparison Summaries
MKTG 5963 Data Mining Group Assignment
xii
Figure 14 Sensitivity and Specificity of Models
Appendix E: Scored Data
Table 24 Scored Data Summary for Target Variable

More Related Content

What's hot

SELECTED DATA PREPARATION METHODS
SELECTED DATA PREPARATION METHODSSELECTED DATA PREPARATION METHODS
SELECTED DATA PREPARATION METHODSKAMIL MAJEED
 
Fuzzy Querying Based on Relational Database
Fuzzy Querying Based on Relational DatabaseFuzzy Querying Based on Relational Database
Fuzzy Querying Based on Relational DatabaseIOSR Journals
 
New Feature Selection Model Based Ensemble Rule Classifiers Method for Datase...
New Feature Selection Model Based Ensemble Rule Classifiers Method for Datase...New Feature Selection Model Based Ensemble Rule Classifiers Method for Datase...
New Feature Selection Model Based Ensemble Rule Classifiers Method for Datase...ijaia
 
STAT 897D Project 2 - Final Draft
STAT 897D Project 2 - Final DraftSTAT 897D Project 2 - Final Draft
STAT 897D Project 2 - Final DraftJonathan Fivelsdal
 
Feature selection and microarray data
Feature selection and microarray dataFeature selection and microarray data
Feature selection and microarray dataGianluca Bontempi
 
Marketing analytics - clustering Types
Marketing analytics - clustering TypesMarketing analytics - clustering Types
Marketing analytics - clustering TypesSuryakumar Thangarasu
 
Data analysis_PredictingActivity_SamsungSensorData
Data analysis_PredictingActivity_SamsungSensorDataData analysis_PredictingActivity_SamsungSensorData
Data analysis_PredictingActivity_SamsungSensorDataKaren Yang
 
A Review on Feature Selection Methods For Classification Tasks
A Review on Feature Selection Methods For Classification TasksA Review on Feature Selection Methods For Classification Tasks
A Review on Feature Selection Methods For Classification TasksEditor IJCATR
 
Poor man's missing value imputation
Poor man's missing value imputationPoor man's missing value imputation
Poor man's missing value imputationLeonardo Auslender
 
ENHANCED BREAST CANCER RECOGNITION BASED ON ROTATION FOREST FEATURE SELECTIO...
 ENHANCED BREAST CANCER RECOGNITION BASED ON ROTATION FOREST FEATURE SELECTIO... ENHANCED BREAST CANCER RECOGNITION BASED ON ROTATION FOREST FEATURE SELECTIO...
ENHANCED BREAST CANCER RECOGNITION BASED ON ROTATION FOREST FEATURE SELECTIO...cscpconf
 
Feature selection concepts and methods
Feature selection concepts and methodsFeature selection concepts and methods
Feature selection concepts and methodsReza Ramezani
 
COMPARISION OF PERCENTAGE ERROR BY USING IMPUTATION METHOD ON MID TERM EXAMIN...
COMPARISION OF PERCENTAGE ERROR BY USING IMPUTATION METHOD ON MID TERM EXAMIN...COMPARISION OF PERCENTAGE ERROR BY USING IMPUTATION METHOD ON MID TERM EXAMIN...
COMPARISION OF PERCENTAGE ERROR BY USING IMPUTATION METHOD ON MID TERM EXAMIN...ijiert bestjournal
 
IJCSI-10-6-1-288-292
IJCSI-10-6-1-288-292IJCSI-10-6-1-288-292
IJCSI-10-6-1-288-292HARDIK SINGH
 
Multi-Dimensional Features Reduction of Consistency Subset Evaluator on Unsup...
Multi-Dimensional Features Reduction of Consistency Subset Evaluator on Unsup...Multi-Dimensional Features Reduction of Consistency Subset Evaluator on Unsup...
Multi-Dimensional Features Reduction of Consistency Subset Evaluator on Unsup...CSCJournals
 
The pertinent single-attribute-based classifier for small datasets classific...
The pertinent single-attribute-based classifier  for small datasets classific...The pertinent single-attribute-based classifier  for small datasets classific...
The pertinent single-attribute-based classifier for small datasets classific...IJECEIAES
 

What's hot (19)

SELECTED DATA PREPARATION METHODS
SELECTED DATA PREPARATION METHODSSELECTED DATA PREPARATION METHODS
SELECTED DATA PREPARATION METHODS
 
Fuzzy Querying Based on Relational Database
Fuzzy Querying Based on Relational DatabaseFuzzy Querying Based on Relational Database
Fuzzy Querying Based on Relational Database
 
New Feature Selection Model Based Ensemble Rule Classifiers Method for Datase...
New Feature Selection Model Based Ensemble Rule Classifiers Method for Datase...New Feature Selection Model Based Ensemble Rule Classifiers Method for Datase...
New Feature Selection Model Based Ensemble Rule Classifiers Method for Datase...
 
STAT 897D Project 2 - Final Draft
STAT 897D Project 2 - Final DraftSTAT 897D Project 2 - Final Draft
STAT 897D Project 2 - Final Draft
 
JEDM_RR_JF_Final
JEDM_RR_JF_FinalJEDM_RR_JF_Final
JEDM_RR_JF_Final
 
Feature selection and microarray data
Feature selection and microarray dataFeature selection and microarray data
Feature selection and microarray data
 
Credit scoring i financial sector
Credit scoring i financial  sector Credit scoring i financial  sector
Credit scoring i financial sector
 
Marketing analytics - clustering Types
Marketing analytics - clustering TypesMarketing analytics - clustering Types
Marketing analytics - clustering Types
 
Data analysis_PredictingActivity_SamsungSensorData
Data analysis_PredictingActivity_SamsungSensorDataData analysis_PredictingActivity_SamsungSensorData
Data analysis_PredictingActivity_SamsungSensorData
 
A Review on Feature Selection Methods For Classification Tasks
A Review on Feature Selection Methods For Classification TasksA Review on Feature Selection Methods For Classification Tasks
A Review on Feature Selection Methods For Classification Tasks
 
Poor man's missing value imputation
Poor man's missing value imputationPoor man's missing value imputation
Poor man's missing value imputation
 
ENHANCED BREAST CANCER RECOGNITION BASED ON ROTATION FOREST FEATURE SELECTIO...
 ENHANCED BREAST CANCER RECOGNITION BASED ON ROTATION FOREST FEATURE SELECTIO... ENHANCED BREAST CANCER RECOGNITION BASED ON ROTATION FOREST FEATURE SELECTIO...
ENHANCED BREAST CANCER RECOGNITION BASED ON ROTATION FOREST FEATURE SELECTIO...
 
Feature selection concepts and methods
Feature selection concepts and methodsFeature selection concepts and methods
Feature selection concepts and methods
 
COMPARISION OF PERCENTAGE ERROR BY USING IMPUTATION METHOD ON MID TERM EXAMIN...
COMPARISION OF PERCENTAGE ERROR BY USING IMPUTATION METHOD ON MID TERM EXAMIN...COMPARISION OF PERCENTAGE ERROR BY USING IMPUTATION METHOD ON MID TERM EXAMIN...
COMPARISION OF PERCENTAGE ERROR BY USING IMPUTATION METHOD ON MID TERM EXAMIN...
 
Malhotra20
Malhotra20Malhotra20
Malhotra20
 
IJCSI-10-6-1-288-292
IJCSI-10-6-1-288-292IJCSI-10-6-1-288-292
IJCSI-10-6-1-288-292
 
Multi-Dimensional Features Reduction of Consistency Subset Evaluator on Unsup...
Multi-Dimensional Features Reduction of Consistency Subset Evaluator on Unsup...Multi-Dimensional Features Reduction of Consistency Subset Evaluator on Unsup...
Multi-Dimensional Features Reduction of Consistency Subset Evaluator on Unsup...
 
The pertinent single-attribute-based classifier for small datasets classific...
The pertinent single-attribute-based classifier  for small datasets classific...The pertinent single-attribute-based classifier  for small datasets classific...
The pertinent single-attribute-based classifier for small datasets classific...
 
Feature Selection
Feature Selection Feature Selection
Feature Selection
 

Similar to Data Mining using SAS

Data mining techniques
Data mining techniquesData mining techniques
Data mining techniqueseSAT Journals
 
Data mining techniques a survey paper
Data mining techniques a survey paperData mining techniques a survey paper
Data mining techniques a survey papereSAT Publishing House
 
Open06
Open06Open06
Open06butest
 
Cancer data partitioning with data structure and difficulty independent clust...
Cancer data partitioning with data structure and difficulty independent clust...Cancer data partitioning with data structure and difficulty independent clust...
Cancer data partitioning with data structure and difficulty independent clust...IRJET Journal
 
Variance rover system web analytics tool using data
Variance rover system web analytics tool using dataVariance rover system web analytics tool using data
Variance rover system web analytics tool using dataeSAT Publishing House
 
Variance rover system
Variance rover systemVariance rover system
Variance rover systemeSAT Journals
 
Data Mining Module 2 Business Analytics.
Data Mining Module 2 Business Analytics.Data Mining Module 2 Business Analytics.
Data Mining Module 2 Business Analytics.Jayanti Pande
 
ICU Mortality Rate Estimation Using Machine Learning and Artificial Neural Ne...
ICU Mortality Rate Estimation Using Machine Learning and Artificial Neural Ne...ICU Mortality Rate Estimation Using Machine Learning and Artificial Neural Ne...
ICU Mortality Rate Estimation Using Machine Learning and Artificial Neural Ne...IRJET Journal
 
Caravan insurance data mining prediction models
Caravan insurance data mining prediction modelsCaravan insurance data mining prediction models
Caravan insurance data mining prediction modelsMuthu Kumaar Thangavelu
 
Caravan insurance data mining prediction models
Caravan insurance data mining prediction modelsCaravan insurance data mining prediction models
Caravan insurance data mining prediction modelsMuthu Kumaar Thangavelu
 
Fault detection of imbalanced data using incremental clustering
Fault detection of imbalanced data using incremental clusteringFault detection of imbalanced data using incremental clustering
Fault detection of imbalanced data using incremental clusteringIRJET Journal
 
PREDICTION OF MALIGNANCY IN SUSPECTED THYROID TUMOUR PATIENTS BY THREE DIFFER...
PREDICTION OF MALIGNANCY IN SUSPECTED THYROID TUMOUR PATIENTS BY THREE DIFFER...PREDICTION OF MALIGNANCY IN SUSPECTED THYROID TUMOUR PATIENTS BY THREE DIFFER...
PREDICTION OF MALIGNANCY IN SUSPECTED THYROID TUMOUR PATIENTS BY THREE DIFFER...cscpconf
 
Performance Evaluation: A Comparative Study of Various Classifiers
Performance Evaluation: A Comparative Study of Various ClassifiersPerformance Evaluation: A Comparative Study of Various Classifiers
Performance Evaluation: A Comparative Study of Various Classifiersamreshkr19
 
THE IMPLICATION OF STATISTICAL ANALYSIS AND FEATURE ENGINEERING FOR MODEL BUI...
THE IMPLICATION OF STATISTICAL ANALYSIS AND FEATURE ENGINEERING FOR MODEL BUI...THE IMPLICATION OF STATISTICAL ANALYSIS AND FEATURE ENGINEERING FOR MODEL BUI...
THE IMPLICATION OF STATISTICAL ANALYSIS AND FEATURE ENGINEERING FOR MODEL BUI...IJCSES Journal
 
THE IMPLICATION OF STATISTICAL ANALYSIS AND FEATURE ENGINEERING FOR MODEL BUI...
THE IMPLICATION OF STATISTICAL ANALYSIS AND FEATURE ENGINEERING FOR MODEL BUI...THE IMPLICATION OF STATISTICAL ANALYSIS AND FEATURE ENGINEERING FOR MODEL BUI...
THE IMPLICATION OF STATISTICAL ANALYSIS AND FEATURE ENGINEERING FOR MODEL BUI...ijcseit
 
84cc04ff77007e457df6aa2b814d2346bf1b
84cc04ff77007e457df6aa2b814d2346bf1b84cc04ff77007e457df6aa2b814d2346bf1b
84cc04ff77007e457df6aa2b814d2346bf1bPRAWEEN KUMAR
 
APPLICATION OF DYNAMIC CLUSTERING ALGORITHM IN MEDICAL SURVEILLANCE
APPLICATION OF DYNAMIC CLUSTERING ALGORITHM IN MEDICAL SURVEILLANCE APPLICATION OF DYNAMIC CLUSTERING ALGORITHM IN MEDICAL SURVEILLANCE
APPLICATION OF DYNAMIC CLUSTERING ALGORITHM IN MEDICAL SURVEILLANCE cscpconf
 

Similar to Data Mining using SAS (20)

Data mining techniques
Data mining techniquesData mining techniques
Data mining techniques
 
1234
12341234
1234
 
Data mining techniques a survey paper
Data mining techniques a survey paperData mining techniques a survey paper
Data mining techniques a survey paper
 
Open06
Open06Open06
Open06
 
ML-Unit-4.pdf
ML-Unit-4.pdfML-Unit-4.pdf
ML-Unit-4.pdf
 
Cancer data partitioning with data structure and difficulty independent clust...
Cancer data partitioning with data structure and difficulty independent clust...Cancer data partitioning with data structure and difficulty independent clust...
Cancer data partitioning with data structure and difficulty independent clust...
 
report
reportreport
report
 
Variance rover system web analytics tool using data
Variance rover system web analytics tool using dataVariance rover system web analytics tool using data
Variance rover system web analytics tool using data
 
Variance rover system
Variance rover systemVariance rover system
Variance rover system
 
Data Mining Module 2 Business Analytics.
Data Mining Module 2 Business Analytics.Data Mining Module 2 Business Analytics.
Data Mining Module 2 Business Analytics.
 
ICU Mortality Rate Estimation Using Machine Learning and Artificial Neural Ne...
ICU Mortality Rate Estimation Using Machine Learning and Artificial Neural Ne...ICU Mortality Rate Estimation Using Machine Learning and Artificial Neural Ne...
ICU Mortality Rate Estimation Using Machine Learning and Artificial Neural Ne...
 
Caravan insurance data mining prediction models
Caravan insurance data mining prediction modelsCaravan insurance data mining prediction models
Caravan insurance data mining prediction models
 
Caravan insurance data mining prediction models
Caravan insurance data mining prediction modelsCaravan insurance data mining prediction models
Caravan insurance data mining prediction models
 
Fault detection of imbalanced data using incremental clustering
Fault detection of imbalanced data using incremental clusteringFault detection of imbalanced data using incremental clustering
Fault detection of imbalanced data using incremental clustering
 
PREDICTION OF MALIGNANCY IN SUSPECTED THYROID TUMOUR PATIENTS BY THREE DIFFER...
PREDICTION OF MALIGNANCY IN SUSPECTED THYROID TUMOUR PATIENTS BY THREE DIFFER...PREDICTION OF MALIGNANCY IN SUSPECTED THYROID TUMOUR PATIENTS BY THREE DIFFER...
PREDICTION OF MALIGNANCY IN SUSPECTED THYROID TUMOUR PATIENTS BY THREE DIFFER...
 
Performance Evaluation: A Comparative Study of Various Classifiers
Performance Evaluation: A Comparative Study of Various ClassifiersPerformance Evaluation: A Comparative Study of Various Classifiers
Performance Evaluation: A Comparative Study of Various Classifiers
 
THE IMPLICATION OF STATISTICAL ANALYSIS AND FEATURE ENGINEERING FOR MODEL BUI...
THE IMPLICATION OF STATISTICAL ANALYSIS AND FEATURE ENGINEERING FOR MODEL BUI...THE IMPLICATION OF STATISTICAL ANALYSIS AND FEATURE ENGINEERING FOR MODEL BUI...
THE IMPLICATION OF STATISTICAL ANALYSIS AND FEATURE ENGINEERING FOR MODEL BUI...
 
THE IMPLICATION OF STATISTICAL ANALYSIS AND FEATURE ENGINEERING FOR MODEL BUI...
THE IMPLICATION OF STATISTICAL ANALYSIS AND FEATURE ENGINEERING FOR MODEL BUI...THE IMPLICATION OF STATISTICAL ANALYSIS AND FEATURE ENGINEERING FOR MODEL BUI...
THE IMPLICATION OF STATISTICAL ANALYSIS AND FEATURE ENGINEERING FOR MODEL BUI...
 
84cc04ff77007e457df6aa2b814d2346bf1b
84cc04ff77007e457df6aa2b814d2346bf1b84cc04ff77007e457df6aa2b814d2346bf1b
84cc04ff77007e457df6aa2b814d2346bf1b
 
APPLICATION OF DYNAMIC CLUSTERING ALGORITHM IN MEDICAL SURVEILLANCE
APPLICATION OF DYNAMIC CLUSTERING ALGORITHM IN MEDICAL SURVEILLANCE APPLICATION OF DYNAMIC CLUSTERING ALGORITHM IN MEDICAL SURVEILLANCE
APPLICATION OF DYNAMIC CLUSTERING ALGORITHM IN MEDICAL SURVEILLANCE
 

Recently uploaded

Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationshipsccctableauusergroup
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfgstagge
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一fhwihughh
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAmazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAbdelrhman abooda
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Sapana Sha
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFAAndrei Kaleshka
 
Data Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptxData Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptxFurkanTasci3
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改atducpo
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样vhwb25kk
 

Recently uploaded (20)

Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdf
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
 
Call Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort ServiceCall Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort Service
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAmazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFA
 
Data Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptxData Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptx
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
 

Data Mining using SAS

  • 1. Group Assignment Data Mining MKTG 5963 Abhinav Garg (11761380) Tanu Srivastav (11772446) Tejbeer Chhabra (11756746) Maunik Desai (11758140) Maanasa Nagaraja (11678486)
  • 2. Table of Contents Executive Summary............................................................................................................................1 Data Audit .........................................................................................................................................1 Modeling...........................................................................................................................................2 Model Comparison.............................................................................................................................4 Scoring ..............................................................................................................................................4 Segmentation ....................................................................................................................................4 Conclusion.........................................................................................................................................5 Appendix A : Data Exploration............................................................................................................ i Appendix B: Clustering......................................................................................................................iii Appendix C: Data Modeling...............................................................................................................vi Appendix D: MODEL COMPARISON ...................................................................................................xi Appendix E: Scored Data ..................................................................................................................xii Contents for Table Table 1 Variable Worth in Clusters ...............................................................................................................2 Table 2 Sensitivity and Specificity for Forward Regression Model...............................................................3 Table 3 Sensitivity and Specificity for Stepwise Regression .........................................................................3 Table 4 Sensitivity and Specificity for Neural Network.................................................................................3 Table 5 Model Comparisons .........................................................................................................................4 Table 6 Scored Data Summary for Target Variable.......................................................................................4
  • 3. MKTG 5963 Data Mining Group Assignment 1 Executive Summary Diversity and SAT score plays an important role in creating a better learning environment and good college experience for the students. Diversity enriches the educational experience and promotes personal growth. SAT score is a useful predictor of college academic performance. In our analysis, we aim to identify prospective students who would most likely enroll as new freshmen in Fall 2005. Also we would focus on marketing strategy for administration to increase diversity and SAT score. Data Audit Before performing data modeling it is critical to perform data exploration to find interesting insights about the data. 1. DMDB Node The DMDB tool gave us quick insights to understand our data better in the form of the summary statistics for numerical variables, the number of categories for class variables, and the extent of the missing values in the data. From the results, it is apparent that the categorical variables are not missing and interval variables have missing values and distance, hscrat, init_span, init1rat, init2rat exhibit non-normal behavior, which can further introduce biases. 2. Data Reduction Variables ACADEMIC_INTEREST_1 and ACADEMIC_INTEREST_2 have their counterpart in INT1RAT and INT2RAT respectively. Similarly IRSCHOOL was converted into HSCRAT. TELECQ had more than 50% missing values. TOTAL_CONTACTS is nothing but summation of various other form of contact counts. CONTACT_CODE1 has hundreds of levels and specifically such code doesn’t provide much information. For these reasons, ACADEMIC_INTEREST_1, ACADEMIC_INTEREST_2, IRSCHOOL, TELECQ, TOTAL_CONTACTS, and CONTACT_CODE1 were all removed from dataset. 3. Missing Value Imputation Since our interval variables had lots of missing values we used the PROC MI procedure to impute the missing values rather than traditional imputation methods, which creates unknown biases in data. The PROC MI procedure allows both finding the patterns of missing data and imputation. It simulates and generates multiple complete dataset from the original data with missing values by repeatedly replacing missing entries with imputed ones. 4. Data Filter Node Extreme values are problematic as they may have undue influence on the model. We handled extreme values by excluding observations including outliers or other extreme values that we don’t want to include in our model. This also further improves the skewness and brings the variables closer to normal distribution. The filtering methods for the interval and class variables used are Standard Deviations from the mean and Rare Values (Percentage) respectively. 5. Data Partitioning Before building our models, we split the data into training (70%) and validation (30%). We choose 70–30 spit because this is the sweet spot hit for honest assessment. Also 70–30 split provided similar proportion of our target as in the original dataset. A summary of the split has been provided in the appendix.
  • 4. MKTG 5963 Data Mining Group Assignment 2 6. Data Transformation Data Transformation corrects for skewed distribution of the numerical input variables and large number of classes in the categorical variables. From the skewness and kurtosis values obtained after filtering, the independent variables exhibit approximately normal distribution. We performed data transformation using “Maximum Normal” for the independent variables, which is one of the best power transformations techniques that belongs to Box_Cox transformation, to analyze its effectiveness in reducing skewness and kurtosis. Although the skewness values have dropped, the decrease is not significant enough to use this methodology. For instance, HSCRAT shows skweness of 2.64 and after log transformation, as suggested by max normal, the skweness is 1.9. Moreover, the transformations bring in their own challenges. Transformed variables come with a cost, since they are complicated to interpret (log, square root) especially in a business scenario. Therefore, we chose not to perform any transformation. Modeling We have used Decision Tree, Forward and Stepwise Regression, Neural Network, and Auto Neural data modeling techniques. 1. Decision Trees Decision tree methodology is a commonly used data mining method for establishing classification systems based on multiple covariates or for developing prediction algorithms for a target variable. A split search algorithm facilitates input selection. Model complexity is addressed by pruning. The setting used for decision tree node are Maximum Branch = 2, Maximum Depth = 6, Minimum Leaf Size = 5, and we use the assessment method and misclassification as assessment measure. Below is the variable importance report – Variable Name Importance SELF_INIT_CNTCTS 1.0000 HSCRAT 0.3798 STUEMAIL 0.2767 INIT_SPAN 0.1404 MAILQ 0.0816 INEREST 0.0698 INT1RAT 0.0638 Table 1 Variable Worth in Clusters Model Assessment: Validation Misclassification Rate = 0.0512 and Training Misclassification Rate = 0.058. 2. Regression Since our dependent variables ENROLL is a binary categorical variable, the type of regression chosen is Logistic. 2.1 Forward regression Forward Regression creates a sequence of models of increasing complexity. At each step, each variable that is not already added is tested for inclusion in the model. The most significant of these variables is added to the model, as long as their p-values are below the SLENTRY = 0.05. The variables selected are SELF_INIT_CNTCTS, STUEMAIL, HSCRAT, INIT_SPAN, DISTANCE, SATSCORE, MAILQ, and INT2RAT. We are also studying interaction effect through our regression model. Following are the significant interactions are:  SELF_INIT_CNTCTS was found to be the most important variable in determining enrollment decision of prospective student.  The Minimum Misclassification Rate was found at Number of Leaves = 12.
  • 5. MKTG 5963 Data Mining Group Assignment 3  CAMPUS_VISIT * PREMIERE  REFERAL_CNTCTS * INTEREST  INTEREST * PREMIERE  INSTATE*MAILQ  TRAVEL_INIT_CNTCTS*INTEREST  TERRITORY*STUEMAIL Model Assessment: Validation Misclassification Rate = 0.0786 and Training Misclassification = 0.073. Sensitivity Specificity 91% 93% Table 2 Sensitivity and Specificity for Forward Regression Model 2.2 Stepwise Regression The stepwise regression combines elements from both the forward and backward selection procedures. The final variable selected is SELF_INIT_CNTCTS. Logit (Enroll = 1) = -2.956 + 1.030(SELF_INIT_CNTCTS) With every unit increase in SELF_INIT_CNTCTS the odds of success of enrollment increases by 2.802 holding everything else constant. Model Assessment: Validation misclassification rate = 0.106 and Training misclassification rate = 0.113. Sensitivity Specificity 89.7% 88.9% Table 3 Sensitivity and Specificity for Stepwise Regression Conclusion: Out of two regression models, based on validation misclassification rate, Forward regression gives better results. 3. Neural Network In Neural network, the prediction formula is similar to a regression, but with flexible addition. Neural Network does not easily address input selection. For this reason, we have used the selected variable from Forward regression as input variables to Neural Network model. 3.1 Neural Network The convergence criterion was not met by the default setting of neural network with 3 hidden units and then we reduced the Number of Hidden units to 2. With 2 Hidden Units, convergence criterion was satisfied. This gave the validation misclassification rate as 0.071 and for training misclassification rate is 0.072. Sensitivity Specificity 94.77% 91.15% Table 4 Sensitivity and Specificity for Neural Network 3.2 Auto Neural The Auto Neural tool offers an automatic way to explore alternative network architectures and neural networks with increasing hidden unit counts. The block of output below summarizes the training process. Fit Statistics from the iteration with the smallest validation misclassification are shown for each step. Refer Appendix C for Results.
  • 6. MKTG 5963 Data Mining Group Assignment 4 Model Comparison The Model Comparison Node is a great tool, which helps us to evaluate the best model in terms of various fit statistics. It is apparent from the summary of the fit statistics that Decision Tree ranks the best model with the least Training and Validation misclassification rate. The validation misclassification rate is 5.12%. We would further use this model to SCORE the SCORE_DATA set. Scoring Since our data is balanced, prediction estimates reflect target population in the training sample and not the population. Therefore score ranking plots are inaccurate and misleading. To fix this we have adjusted for separate sampling by adjusting prior probability as 3.1% for primary outcome. Table 6 Scored Data Summary for Target Variable Segmentation The administration is interested in increasing enrollment, diversity and SAT score. The best way to do this is to perform clustering, which divides the data set into mutually exclusive groups with varied diversity. Also by doing this we can identify the group of prospective students with high SAT score and thus administration can focus on marketing strategy for target group of students. The node chose 3 clusters, and the relative size of each cluster is shown in the below pie chart. Figure 1 Segmentation Pie Chart Model Comparison Validation Misclassification Rate Decision Tree 0.0512 Auto Neural 0.0663 Neural Network 0.0714 Forward Regression 0.0786 Stepwise Regression 0.1067 Table 5 Model Comparisons
  • 7. MKTG 5963 Data Mining Group Assignment 5 DISTANCE was found to be most important factor in differentiating three segments. Cluster 1: SOLICITED_CNTS and INSTATE are the most important predictor variable. Avg. SAT SCORE = 1143.3. Males to Females proportion are 59% - 41%. Cluster 2: DISTANCE and INSTATE are the most important predictor variable. Avg. SAT SCORE = 1094.90. Males to Females proportion are 60% - 40%. Cluster 3: INSTATE and INT2RAT are the most important predictor variable. Avg. SAT SCORE = 1127.00. Males to Females proportion are 66% - 34%. Conclusion To increase the SAT score for prospective enrollement, administration should target on cluster 1 students as they have highest average score. Also this group shows diversity in terms of Ethinicity as well as sex. Even with Ethiniciy C overwhelming data, we find that cluster 1 has people from almost all diversity in some proportion. If we look at the diversity of gender in cluster 1, we find that it has approximately equal proportion of males (59%) and females(41%). Figure 2 Modeling Diagram Based on best fitted model i.e. Decision Tree, we found that the most important independent variables are SELF_INT_CNTCTS and HSCRAT. This suggests that administration should focus on addressing students, who themselves initiated contact and who belongs to the high school which has highest last 5 years enrollment. If administration could come up with special welcome kit or some sort of welcome offer for these students, they could turn up more enrollments.
  • 8. MKTG 5963 Data Mining Group Assignment i Appendix A : Data Exploration Variable Standard Deviation Skewness Kurtosis Missing Values Enroll 0.500485 1.10E-16 -2.0007756 0 TOTAL_CONTACTS 3.480081 1.0517156 0.8543522 0 SELF_INT_CNTCTS 3.0988946 0.8850357 0.305132 0 TRAVEL_INT_CNTCTS 0.6702278 1.6645745 3.673759 0 SOLICITED_CNTCTS 0.7613853 1.8541231 8.1394438 0 REFERRAL_CNTCTS 0.288625 5.9594486 50.9889296 0 CAMPUS_VISIT 0.3774713 2.3172482 4.4588036 0 SATSCORE 151.4914425 -0.2413784 0.462179 1887 SEX 0.4860401 -0.4837916 -1.7666479 127 MAILQ 1.6001673 -0.8884705 -0.9904749 0 TELECQ 0.8074666 0.9573248 0.7499086 3055 PREMIERE 0.4094565 1.4024778 -0.033069 0 INTEREST 0.4118758 2.3363909 5.0932373 0 STUEMAIL 0.4379744 -1.1022225 -0.7854101 0 INIT_SPAN 9.1778057 -2.4740469 84.078123 0 INT1RAT 0.0358866 9.3802148 207.82226 0 INT2RAT 0.039164 7.2199216 111.921867 0 HSCRAT 0.1457441 4.4813438 23.4212246 0 AVG_INCOME 23083.61 0.9640883 0.9048331 763 DISTANCE 370.781848 2.580719 10.6365752 671 Table 7 Data Exploration for Given Data set Table 8 Skewness and Kurtosis Results after Filtering
  • 9. MKTG 5963 Data Mining Group Assignment ii Table 9 Filtering Results for Class Variables Table 10 Filtering Results for Interval Variables Figure 3 Excluded Observations after Filtering Figure 4 Partition Summaries
  • 10. MKTG 5963 Data Mining Group Assignment iii Appendix B: Clustering Table 11 Ethnicity distributions for each Segment Table 12 Segmentation Report 1. Segment 1: Table 13 Variable Worth for Segment 1
  • 11. MKTG 5963 Data Mining Group Assignment iv Table 14 Worth Plot for Segment 1 2. Segment 2: Table 15 Variable worth for Segment 2 Figure 5 Worth Plot for Segment 2
  • 12. MKTG 5963 Data Mining Group Assignment v 3. Segment 3: Table 16 Variable Worth for Segment 3 Figure 6 Worth Plot for Segment 3 Table 17 Overall variable importances in Clustering
  • 13. MKTG 5963 Data Mining Group Assignment vi Appendix C: Data Modeling 1. Decision Tree Figure 7 Subtree Assessment Plot Table 18 Fit Statistics for Decision Tree
  • 14. MKTG 5963 Data Mining Group Assignment vii Figure 8 English Rules for Decision Tree Figure 9 Decision Tree
  • 15. MKTG 5963 Data Mining Group Assignment viii 2. Forward Regression Figure 19 Fit Statistics for Forward Regression Figure 10 Mode Iteration plot for Forward Regression
  • 16. MKTG 5963 Data Mining Group Assignment ix 3. Stepwise Regression Table 20 Fit Statistics Report for Stepwise Regression Figure 11 Model Iteration Plot for Stepwise Regression
  • 17. MKTG 5963 Data Mining Group Assignment x Figure 12 Summary of Stepwise Selection 4. Neural Network: Table 21 Fit Statistics Table for Neural Network Figure 13 Optimization Summaries for Neural Network
  • 18. MKTG 5963 Data Mining Group Assignment xi 5. AUTO NEURAL Table 22 Fit Statistics for Auto Neural Network Appendix D: MODEL COMPARISON Table 23 Model Comparison Summaries
  • 19. MKTG 5963 Data Mining Group Assignment xii Figure 14 Sensitivity and Specificity of Models Appendix E: Scored Data Table 24 Scored Data Summary for Target Variable