2. Syllabus
• Problem definition
• Analyze method
• Data Mining process
• Baseline regression
• Dimension Reduction
• Tree-base models
• Evaluation
3. The way to analyze the Dataset
Problem
Definition
Data Gathering
&Preparation
Model Building
&Evaluation
Knowledge
Deployment
• Data Access
• Data Sampling
• Data Transformation
• Create Model
• Dimension Reduction
• Test Model
• Evaluate Model • Model Apply
• Report
4. Problem definition: Attributes vs Response
Blast Furnace Slag
Cement
Coarse Aggregate
Fine Aggregate
Fly Ash
Super plasticizer
Water
Age
Concrete
6. Correlation between response and predictors
Response:Concrete
Predictors:Blast Furnace Slag, Fly Ash, Water, Super plasticizer
Coarse Aggregate, Fine. Aggregate, Age
7. Baseline Linear Regression
Conclusion
1.This Regression is Significant
Concrete=2.637325+0.11630(Cement)+0.101271
(Slag)+0.077125(Ash)-0.198928(Water)
+0.251661(Supaerplasticizer)-0.011350(Coarse)
+0.009328(Fine)+0.114203(Age)
2. Coarse Aggregate , Fine Aggregate and
Intercept are not significant
3. 63.84% of the variation in Concrete is
explained by variation in predictors
(r2 =SSR/SST=63.84%)
9. Linear Regression with Interaction
conclusion
1.The new regression is significant
(p value<2.2e-16)
2. The interaction term
(Fine Aggregate*Age)is significant.
(p value=0.0199)
3. Coarse Aggregate , Fine Aggregate , Age
and Intercept are not significant.
4. r2 =64.11%>63.84%(baseline)
12. Polynomial Regression (Water)
Conclusion
1.The polynomial Regression is significant (p
value <2.2e-16)
Concrete=35.8039+140.9707(Water)+102.3895(
Water^ 2)+95.6686(Water^3) -38.8344(Water^4)
2.All predictors is significant
3. 20.09% of the variation in Concrete is
explained by variation in Water
13. Linear Regression with Polynomial (Water)
Conclusion
1.The polynomial Regression is significant
(p value <2.2e-16)
2. poly(Water, 4)2 , poly(Water, 4)4,
Superplasticizer , Coarse.Aggregate and
Fine.Aggregate are not significant
3. 64.7% of the variation in Concrete is
explained by variation in predictors
( r2 =64.7%>63.84%(baseline))
19. Regression Trees
1. Use six variables
“Age”, “Cement” ,“Water”, “Slag”, “Superplasticizer”
2.Has 13 terminal nodes
3.Residual mean deviance is 71.99
22. Tree Pruning (tree size= 11)
1. Use five variables
“Age”, "Superplasticizer", “Cement”, "Water”, “Slag”
2.Has 11 terminal nodes
3.Residual mean deviance is 80.17
23. Testing Error
10-fold Cross Validation
Regression tree1
Original tree with nodes 13
Regression tree2
Pruned tree with nodes 11
MSE 85.055
MSE 80.673
24. Bagging
Bootstrap 500
Samples
Use 500 Samples to
build 500 decision
Trees
Average them to
get a single low
variance model
MSE 30.78
Adjust R2 0.8928
25. Random Forest
For each samples,
randomly choose 3
viable for training
Use 500 Samples to
build 500 decision
Trees
Average them to
get a single low
variance model
Bootstrap 500
Samples
MSE 30.44
Adjust R2 0.894
26. Boosting
Bootstrap 500
Samples and build
500 decision tree
Use the 1st
decision tree as the
basic model
Repeat the last step for 498
times by using the 492 trees left
to get the boosting model
Adding a shrunken version of
the 2nd decision tree to the basic
model as our boosting model MSE 31.86
Adjust R2 0.8821