SlideShare a Scribd company logo
A New Method To Deal With 2 Level Variables in Big Data Analysis
John Gao, Harriot Jesse and Lu Fang
Business Intelligence, Endurance Group International
In the big data, often variables have multiple levels. Here we mainly study
the big data with 2 levels or 2 layers(Child level and parent level), such as in the
medical area, the patient and hospital. The patient’s outcome greatly related to
 Health condition at Individual patient level (child level)
 Hospital and doctor service quality (parent level).
Another example is for the leads (or trialers) with sale reps. These trialers or
leads (regarded as child level) have their own characteristics, such as geographic
location and activity patterns in web site. The sale reps (regarded as parent level)
try to help them to convert into paid customers. However, the conversion of
these trialers has greatly been related to their self and sale reps’s performance.
The common method to deal with it and also it is the simplest method is to
dump all variables collected from all data resources into statistical software by
using machine learning or statistical analysis method, such as stepwise selection
processing to get drivers for the binary outcome at child level. These drivers may
come from these 2 levels or may only come from 1 level only at child level. In
addition, one step analysis takes long time to run for big data if the big data file
have hundreds and thousands variables and often the important drivers at
parent’s level may not been captured .
This paper presents a new approach for the big data with 2 levels. The outcome is
binary at child level. The first step is to develop a predictive model at child level
for the binary outcome, such as patients or trialers for the re-admissions or
conversions. The aggregation of predicted probability of the binary outcome for
each parent is the expected or natural overall binary outcome. All aggregation of
actual outcomes for all child for each parent may not be the same as the overall
expected outcome. The difference between the actual outcome with expected
outcome is due to the parent’s performance. Therefore, the second step is to
identify the impact from parent’s performance on the child outcome. In this step,
the dependent variable is continuous either the ratio of the aggregation of all
actual outcome over the aggregation of all expected probability for all child. Then
the second step analysis is to identify the drivers from parent’s information,
For example, the Z is binary outcome with the independent variables
X1,….,Xn
Y1,…,Ym
Here Xj (j=1,…,n) are the variables at child level with N individual children
Yk (k=1,…,m) are the variables at parent level with M individual parents
The first step is to develop the predicted the probability of binary outcome from
variables at the child level for the solution of the
0<=E(Z| X1,….,Xn)<=1
By using logistic linear regression method with the SAS procedure
proc logistic data=train;
model Z= x1 − x 𝑛/
selection=stepwise sls=0.05 sle=0.05 details include=0;
run;
Will give the result Z= E(Z| X1,….,Xn)+ Ɛ = Ẑ+ Ɛ and Ɛ ~ N(0,σ2
).
Here the logit function is 𝛽0 + 𝛽1 x′1 +, …, + 𝛽 𝑛′ x′ 𝑛 and
Ẑ =
e
𝛽0+𝛽1 x′1+,…, + 𝛽
𝑛′ x′ 𝑛
1+ e
𝛽0+𝛽1 x′1+,…, + 𝛽
𝑛′ x′ 𝑛
called as the value of the predicted probability of the binary outcome from the
drivers X′1 , … , X′ 𝑛 identified by logistic regression procedure at the child level.
Then each individual parent k with K children has the average value of the actual
binary outcomes
ᾱ 𝑘 = ∑ Zi
𝐾
𝑖=1
However, the mean value for the predicted outcome from the individual child
prediction in the previous logistic regression is given by
Ᾱ 𝑘 = ∑ Ẑi
𝐾
𝑖=1
Which is called as the expected outcome from predictive model at the child level.
we can define the random variable S ~ (ᾱ 𝑘 − Ᾱ 𝑘)/Ᾱ 𝑘 (k=1,…,M), which is the
ratio of the actual average outcome with the prediction from the information at
the child level. The random variable is directly related to the independent
variables Y1,…,Ym at the parent level. It is noted that the S is continuous
dependent variable. Therefore, we are going to study
E[S - (𝑏0 + 𝑏1y′1 +, …, + 𝑏 𝑛′ y′ 𝑛)] ~ N(0,σ2
).
by using the linear regression procedure in SAS
proc reg data=train2;
model S= Y1,…,Ym/
selection=stepwise sls=0.05 sle=0.05 details include=0;
run;
The linear regression will give the estimation of the random variable S with
Ŝ = 𝑏0 + 𝑏1y′1 +, …, + 𝑏 𝑛′ y′ 𝑛
If the 𝑏1 = 𝑏2 = ⋯ = 𝑏 𝑛′ = 0 then the binary outcome is independent from the
parent level. if not, the binary outcome is also related to the variables at parent
level.
Then, we are going to develop the final model by using variables in both child
level and parent level. there are 2 ways for the final model. The first is to
combine the 2 level model result together by using logistic regression with
0<=E(Z| Ẑ, Ŝ)<=1
And
proc logistic data=train3;
model Z= Ẑ Ŝ/
selection=stepwise sls=0.05 sle=0.05 details include=0;
run;
or we can use the logit function
logit= 𝛽0 + 𝛽1 x′1 +, …, + 𝛽 𝑛′ x′ 𝑛
And
Ŝ = 𝑏0 + 𝑏1y′1 +, …, + 𝑏 𝑛′ y′ 𝑛
As independent variables
proc logistic data=train3;
model Z= logit Ŝ/
selection=stepwise sls=0.05 sle=0.05 details include=0;
run;
or we can use nonlinear approach by
proc logistic data=train3;
model Z= logit Ẑ Ŝ/
selection=stepwise sls=0.05 sle=0.05 details include=0;
run;
The final model for the binary outcome is going to be predicted by weighting
child’s information and parent information or the final model can be regarded as
the modification of the previous model at child level by the parent’s variables.
The difference between the final model score with the previous model score can
be regarded as the impact of the parent’s drives on the individual child for the
binary outcome. This approach has been adopted by NHA(national healthcare
association) for the evaluation of Post-Acure Care Nursing home for re-admission
of Medicare and Medicaid patients. Also, we have used this approach to evaluate
the our partner’s performance in customer retentions.

More Related Content

Similar to SAS_paper_2016_SESUG

16 USING LINEAR REGRESSION PREDICTING THE FUTURE16 MEDIA LIBRAR.docx
16 USING LINEAR REGRESSION PREDICTING THE FUTURE16 MEDIA LIBRAR.docx16 USING LINEAR REGRESSION PREDICTING THE FUTURE16 MEDIA LIBRAR.docx
16 USING LINEAR REGRESSION PREDICTING THE FUTURE16 MEDIA LIBRAR.docx
novabroom
 
Machine Learning (Classification Models)
Machine Learning (Classification Models)Machine Learning (Classification Models)
Machine Learning (Classification Models)
Makerere Unversity School of Public Health, Victoria University
 
Classification via Logistic Regression
Classification via Logistic RegressionClassification via Logistic Regression
Classification via Logistic Regression
Taweh Beysolow II
 
Prediction model of algal blooms using logistic regression and confusion matrix
Prediction model of algal blooms using logistic regression and confusion matrix Prediction model of algal blooms using logistic regression and confusion matrix
Prediction model of algal blooms using logistic regression and confusion matrix
IJECEIAES
 
Predicting deaths from COVID-19 using Machine Learning
Predicting deaths from COVID-19 using Machine LearningPredicting deaths from COVID-19 using Machine Learning
Predicting deaths from COVID-19 using Machine Learning
IdanGalShohet
 
Ties Adjusted Nonparametric Statististical Method For The Analysis Of Ordered...
Ties Adjusted Nonparametric Statististical Method For The Analysis Of Ordered...Ties Adjusted Nonparametric Statististical Method For The Analysis Of Ordered...
Ties Adjusted Nonparametric Statististical Method For The Analysis Of Ordered...
inventionjournals
 
Ties Adjusted Nonparametric Statististical Method For The Analysis Of Ordered...
Ties Adjusted Nonparametric Statististical Method For The Analysis Of Ordered...Ties Adjusted Nonparametric Statististical Method For The Analysis Of Ordered...
Ties Adjusted Nonparametric Statististical Method For The Analysis Of Ordered...
inventionjournals
 
Factor Extraction method in factor analysis with example in R studio.pptx
Factor Extraction method in factor analysis with example in R studio.pptxFactor Extraction method in factor analysis with example in R studio.pptx
Factor Extraction method in factor analysis with example in R studio.pptx
GauravRajole
 
Bipolar Disorder Investigation Using Modified Logistic Ridge Estimator
Bipolar Disorder Investigation Using Modified Logistic Ridge EstimatorBipolar Disorder Investigation Using Modified Logistic Ridge Estimator
Bipolar Disorder Investigation Using Modified Logistic Ridge Estimator
IOSR Journals
 
Introduction to Optimization with Genetic Algorithm (GA)
Introduction to Optimization with Genetic Algorithm (GA)Introduction to Optimization with Genetic Algorithm (GA)
Introduction to Optimization with Genetic Algorithm (GA)
Ahmed Gad
 
Logistic regression
Logistic regressionLogistic regression
Logistic regression
Rupak Roy
 
Week 3 Lecture 11 Regression Analysis Regression analy.docx
Week 3 Lecture 11 Regression Analysis Regression analy.docxWeek 3 Lecture 11 Regression Analysis Regression analy.docx
Week 3 Lecture 11 Regression Analysis Regression analy.docx
cockekeshia
 
Logistics regression
Logistics regressionLogistics regression
Logistics regression
SALWAidrissiakhannou
 
Day 10 prediction and regression
Day 10 prediction and regressionDay 10 prediction and regression
Day 10 prediction and regression
Elih Sutisna Yanto
 
Correlational research 1 1
Correlational research 1 1Correlational research 1 1
Correlational research 1 1
sdwilson88
 
WisconsinBreastCancerDiagnosticClassificationusingKNNandRandomForest
WisconsinBreastCancerDiagnosticClassificationusingKNNandRandomForestWisconsinBreastCancerDiagnosticClassificationusingKNNandRandomForest
WisconsinBreastCancerDiagnosticClassificationusingKNNandRandomForest
Sheing Jing Ng
 
3D Scatterplot - R programming
3D Scatterplot - R programming3D Scatterplot - R programming
3D Scatterplot - R programming
Hubert Lo
 
multiple regression
multiple regressionmultiple regression
multiple regression
Priya Sharma
 
Detail Study of the concept of Regression model.pptx
Detail Study of the concept of  Regression model.pptxDetail Study of the concept of  Regression model.pptx
Detail Study of the concept of Regression model.pptx
truptikulkarni2066
 
Logistic regression
Logistic regressionLogistic regression
Logistic regression
VARUN KUMAR
 

Similar to SAS_paper_2016_SESUG (20)

16 USING LINEAR REGRESSION PREDICTING THE FUTURE16 MEDIA LIBRAR.docx
16 USING LINEAR REGRESSION PREDICTING THE FUTURE16 MEDIA LIBRAR.docx16 USING LINEAR REGRESSION PREDICTING THE FUTURE16 MEDIA LIBRAR.docx
16 USING LINEAR REGRESSION PREDICTING THE FUTURE16 MEDIA LIBRAR.docx
 
Machine Learning (Classification Models)
Machine Learning (Classification Models)Machine Learning (Classification Models)
Machine Learning (Classification Models)
 
Classification via Logistic Regression
Classification via Logistic RegressionClassification via Logistic Regression
Classification via Logistic Regression
 
Prediction model of algal blooms using logistic regression and confusion matrix
Prediction model of algal blooms using logistic regression and confusion matrix Prediction model of algal blooms using logistic regression and confusion matrix
Prediction model of algal blooms using logistic regression and confusion matrix
 
Predicting deaths from COVID-19 using Machine Learning
Predicting deaths from COVID-19 using Machine LearningPredicting deaths from COVID-19 using Machine Learning
Predicting deaths from COVID-19 using Machine Learning
 
Ties Adjusted Nonparametric Statististical Method For The Analysis Of Ordered...
Ties Adjusted Nonparametric Statististical Method For The Analysis Of Ordered...Ties Adjusted Nonparametric Statististical Method For The Analysis Of Ordered...
Ties Adjusted Nonparametric Statististical Method For The Analysis Of Ordered...
 
Ties Adjusted Nonparametric Statististical Method For The Analysis Of Ordered...
Ties Adjusted Nonparametric Statististical Method For The Analysis Of Ordered...Ties Adjusted Nonparametric Statististical Method For The Analysis Of Ordered...
Ties Adjusted Nonparametric Statististical Method For The Analysis Of Ordered...
 
Factor Extraction method in factor analysis with example in R studio.pptx
Factor Extraction method in factor analysis with example in R studio.pptxFactor Extraction method in factor analysis with example in R studio.pptx
Factor Extraction method in factor analysis with example in R studio.pptx
 
Bipolar Disorder Investigation Using Modified Logistic Ridge Estimator
Bipolar Disorder Investigation Using Modified Logistic Ridge EstimatorBipolar Disorder Investigation Using Modified Logistic Ridge Estimator
Bipolar Disorder Investigation Using Modified Logistic Ridge Estimator
 
Introduction to Optimization with Genetic Algorithm (GA)
Introduction to Optimization with Genetic Algorithm (GA)Introduction to Optimization with Genetic Algorithm (GA)
Introduction to Optimization with Genetic Algorithm (GA)
 
Logistic regression
Logistic regressionLogistic regression
Logistic regression
 
Week 3 Lecture 11 Regression Analysis Regression analy.docx
Week 3 Lecture 11 Regression Analysis Regression analy.docxWeek 3 Lecture 11 Regression Analysis Regression analy.docx
Week 3 Lecture 11 Regression Analysis Regression analy.docx
 
Logistics regression
Logistics regressionLogistics regression
Logistics regression
 
Day 10 prediction and regression
Day 10 prediction and regressionDay 10 prediction and regression
Day 10 prediction and regression
 
Correlational research 1 1
Correlational research 1 1Correlational research 1 1
Correlational research 1 1
 
WisconsinBreastCancerDiagnosticClassificationusingKNNandRandomForest
WisconsinBreastCancerDiagnosticClassificationusingKNNandRandomForestWisconsinBreastCancerDiagnosticClassificationusingKNNandRandomForest
WisconsinBreastCancerDiagnosticClassificationusingKNNandRandomForest
 
3D Scatterplot - R programming
3D Scatterplot - R programming3D Scatterplot - R programming
3D Scatterplot - R programming
 
multiple regression
multiple regressionmultiple regression
multiple regression
 
Detail Study of the concept of Regression model.pptx
Detail Study of the concept of  Regression model.pptxDetail Study of the concept of  Regression model.pptx
Detail Study of the concept of Regression model.pptx
 
Logistic regression
Logistic regressionLogistic regression
Logistic regression
 

SAS_paper_2016_SESUG

  • 1. A New Method To Deal With 2 Level Variables in Big Data Analysis John Gao, Harriot Jesse and Lu Fang Business Intelligence, Endurance Group International In the big data, often variables have multiple levels. Here we mainly study the big data with 2 levels or 2 layers(Child level and parent level), such as in the medical area, the patient and hospital. The patient’s outcome greatly related to  Health condition at Individual patient level (child level)  Hospital and doctor service quality (parent level). Another example is for the leads (or trialers) with sale reps. These trialers or leads (regarded as child level) have their own characteristics, such as geographic location and activity patterns in web site. The sale reps (regarded as parent level) try to help them to convert into paid customers. However, the conversion of these trialers has greatly been related to their self and sale reps’s performance. The common method to deal with it and also it is the simplest method is to dump all variables collected from all data resources into statistical software by using machine learning or statistical analysis method, such as stepwise selection processing to get drivers for the binary outcome at child level. These drivers may come from these 2 levels or may only come from 1 level only at child level. In addition, one step analysis takes long time to run for big data if the big data file have hundreds and thousands variables and often the important drivers at parent’s level may not been captured . This paper presents a new approach for the big data with 2 levels. The outcome is binary at child level. The first step is to develop a predictive model at child level for the binary outcome, such as patients or trialers for the re-admissions or conversions. The aggregation of predicted probability of the binary outcome for each parent is the expected or natural overall binary outcome. All aggregation of actual outcomes for all child for each parent may not be the same as the overall
  • 2. expected outcome. The difference between the actual outcome with expected outcome is due to the parent’s performance. Therefore, the second step is to identify the impact from parent’s performance on the child outcome. In this step, the dependent variable is continuous either the ratio of the aggregation of all actual outcome over the aggregation of all expected probability for all child. Then the second step analysis is to identify the drivers from parent’s information, For example, the Z is binary outcome with the independent variables X1,….,Xn Y1,…,Ym Here Xj (j=1,…,n) are the variables at child level with N individual children Yk (k=1,…,m) are the variables at parent level with M individual parents The first step is to develop the predicted the probability of binary outcome from variables at the child level for the solution of the 0<=E(Z| X1,….,Xn)<=1 By using logistic linear regression method with the SAS procedure proc logistic data=train; model Z= x1 − x 𝑛/ selection=stepwise sls=0.05 sle=0.05 details include=0; run; Will give the result Z= E(Z| X1,….,Xn)+ Ɛ = Ẑ+ Ɛ and Ɛ ~ N(0,σ2 ). Here the logit function is 𝛽0 + 𝛽1 x′1 +, …, + 𝛽 𝑛′ x′ 𝑛 and Ẑ = e 𝛽0+𝛽1 x′1+,…, + 𝛽 𝑛′ x′ 𝑛 1+ e 𝛽0+𝛽1 x′1+,…, + 𝛽 𝑛′ x′ 𝑛 called as the value of the predicted probability of the binary outcome from the drivers X′1 , … , X′ 𝑛 identified by logistic regression procedure at the child level. Then each individual parent k with K children has the average value of the actual binary outcomes
  • 3. ᾱ 𝑘 = ∑ Zi 𝐾 𝑖=1 However, the mean value for the predicted outcome from the individual child prediction in the previous logistic regression is given by Ᾱ 𝑘 = ∑ Ẑi 𝐾 𝑖=1 Which is called as the expected outcome from predictive model at the child level. we can define the random variable S ~ (ᾱ 𝑘 − Ᾱ 𝑘)/Ᾱ 𝑘 (k=1,…,M), which is the ratio of the actual average outcome with the prediction from the information at the child level. The random variable is directly related to the independent variables Y1,…,Ym at the parent level. It is noted that the S is continuous dependent variable. Therefore, we are going to study E[S - (𝑏0 + 𝑏1y′1 +, …, + 𝑏 𝑛′ y′ 𝑛)] ~ N(0,σ2 ). by using the linear regression procedure in SAS proc reg data=train2; model S= Y1,…,Ym/ selection=stepwise sls=0.05 sle=0.05 details include=0; run; The linear regression will give the estimation of the random variable S with Ŝ = 𝑏0 + 𝑏1y′1 +, …, + 𝑏 𝑛′ y′ 𝑛 If the 𝑏1 = 𝑏2 = ⋯ = 𝑏 𝑛′ = 0 then the binary outcome is independent from the parent level. if not, the binary outcome is also related to the variables at parent level.
  • 4. Then, we are going to develop the final model by using variables in both child level and parent level. there are 2 ways for the final model. The first is to combine the 2 level model result together by using logistic regression with 0<=E(Z| Ẑ, Ŝ)<=1 And proc logistic data=train3; model Z= Ẑ Ŝ/ selection=stepwise sls=0.05 sle=0.05 details include=0; run; or we can use the logit function logit= 𝛽0 + 𝛽1 x′1 +, …, + 𝛽 𝑛′ x′ 𝑛 And Ŝ = 𝑏0 + 𝑏1y′1 +, …, + 𝑏 𝑛′ y′ 𝑛 As independent variables proc logistic data=train3; model Z= logit Ŝ/ selection=stepwise sls=0.05 sle=0.05 details include=0; run; or we can use nonlinear approach by proc logistic data=train3; model Z= logit Ẑ Ŝ/ selection=stepwise sls=0.05 sle=0.05 details include=0; run; The final model for the binary outcome is going to be predicted by weighting child’s information and parent information or the final model can be regarded as the modification of the previous model at child level by the parent’s variables. The difference between the final model score with the previous model score can be regarded as the impact of the parent’s drives on the individual child for the binary outcome. This approach has been adopted by NHA(national healthcare
  • 5. association) for the evaluation of Post-Acure Care Nursing home for re-admission of Medicare and Medicaid patients. Also, we have used this approach to evaluate the our partner’s performance in customer retentions.