Slides_Group_8

Concrete Compressive
Strength
Group_8
Shanzhang Nong,
Xinpeng Li
Liang Zhang
Qi Wang

Syllabus
• Problem definition
• Analyze method
• Data Mining process
• Baseline regression
• Dimension Reduction
• Tree-base models
• Evaluation

The way to analyze the Dataset
Problem
Definition
Data Gathering
&Preparation
Model Building
&Evaluation
Knowledge
Deployment
• Data Access
• Data Sampling
• Data Transformation
• Create Model
• Dimension Reduction
• Test Model
• Evaluate Model • Model Apply
• Report

Problem definition: Attributes vs Response
Blast Furnace Slag
Cement
Coarse Aggregate
Fine Aggregate
Fly Ash
Super plasticizer
Water
Age
Concrete

Concrete Compressive Strength Data Set
• Source:
• https://archive.ics.uci.edu/ml/machine-
learning-databases/concrete/compressive/
• Observation Number: 1030
• Attributes: 8
• Numeric: 7
• Integer: 1
• Response: 1
• Concrete compressive strength

Correlation between response and predictors
Response：Concrete
Predictors：Blast Furnace Slag, Fly Ash, Water, Super plasticizer
Coarse Aggregate, Fine. Aggregate, Age

Baseline Linear Regression
Conclusion
1.This Regression is Significant
Concrete=2.637325+0.11630(Cement)+0.101271
(Slag)+0.077125(Ash)-0.198928(Water)
+0.251661(Supaerplasticizer)-0.011350(Coarse)
+0.009328(Fine)+0.114203(Age)
2. Coarse Aggregate , Fine Aggregate and
Intercept are not significant
3. 63.84% of the variation in Concrete is
explained by variation in predictors
(r2 =SSR/SST=63.84%)

Linear Regression Estimation VIF
Conclusion
VIF < 5
There are collinearity between Predictors

Linear Regression with Interaction
conclusion
1.The new regression is significant
(p value<2.2e-16)
2. The interaction term
(Fine Aggregate*Age)is significant.
(p value=0.0199)
3. Coarse Aggregate , Fine Aggregate , Age
and Intercept are not significant.
4. r2 =64.11%>63.84%(baseline)

Testing Error 10-fold Cross Validation
Regression 1(baseline)
Concrete=2.637325+0.11630(Cement)+0.101271(Slag)+0.077125(Ash)-
0.198928(Water)+0.251661(Superplasticizer)-0.011350(Coarse)
+0.009328(Fine)+0.114203(Age)
Regression2( with Interaction term)
Concrete=(1.372e+01)+(1.148e-01)Cement-(1.999e-01)Water+(9.872e-02) Slag +(7.464e-
02)Ash+(2.606e-01)Superplasticizer+(8.917e-03)Coarse-(9.024e-04)Fine+(8.386e-03
)Age+(1.420e-04)(Fine.Aggregate:Age )

Polynomial Regression (Water)
Conclusion
d=4 is a best choice
(the model has an average MSE error
value that is within 0.3-standard error from
this smallest value)

Polynomial Regression (Water)
Conclusion
1.The polynomial Regression is significant (p
value <2.2e-16)
Concrete=35.8039+140.9707(Water)+102.3895(
Water^ 2)+95.6686(Water^3) -38.8344(Water^4)
2.All predictors is significant
explained by variation in Water

Linear Regression with Polynomial (Water)
Conclusion
1.The polynomial Regression is significant
(p value <2.2e-16)
2. poly(Water, 4)2 , poly(Water, 4)4,
Superplasticizer , Coarse.Aggregate and
Fine.Aggregate are not significant
explained by variation in predictors
( r2 =64.7%>63.84%(baseline))

Subset selection
MSE 448.31
Adjust R2 0.6276
BIC -930
Cp 15

Principal component regression
MSE 121
Adjust R2 0.56

Partial Least Squares
MSE 107
Adjust R2 0.615

Ridge regression
RMSE 16.98534
SSE 148578.4
RSE 100.1642
R-Squared 0.5405718
Small λ value MSE = 119.9176
Large λ value MSE = 288.1472

Lasso
RMSE 10.41782
SSE 55893.45
RSE 37.68058
R-Squared 0.623364
MSE 108.6453Small λ value MSE=108.64
Large λ value MSE=267.32

Regression Trees
1. Use six variables
“Age”, “Cement” ,“Water”, “Slag”, “Superplasticizer”
2.Has 13 terminal nodes
3.Residual mean deviance is 71.99

Tree Pruning
Determine the optimal tree size
Conclusion
Tree size of 11 is good enough

Tree Pruning (tree size= 11)
1. Use five variables
“Age”, "Superplasticizer", “Cement”, "Water”, “Slag”
2.Has 11 terminal nodes
3.Residual mean deviance is 80.17

Testing Error
10-fold Cross Validation
Regression tree1
Original tree with nodes 13
Regression tree2
Pruned tree with nodes 11
MSE 85.055
MSE 80.673

Bagging
Bootstrap 500
Samples
Use 500 Samples to
build 500 decision
Trees
Average them to
get a single low
variance model
MSE 30.78
Adjust R2 0.8928

Random Forest
For each samples,
randomly choose 3
viable for training
Use 500 Samples to
build 500 decision
Trees
Average them to
get a single low
variance model
Bootstrap 500
Samples
MSE 30.44
Adjust R2 0.894

Boosting
Bootstrap 500
Samples and build
500 decision tree
Use the 1st
decision tree as the
basic model
Repeat the last step for 498
times by using the 492 trees left
to get the boosting model
Adding a shrunken version of
the 2nd decision tree to the basic
model as our boosting model MSE 31.86
Adjust R2 0.8821

Test validation result
Bagging Boosting
Random Forest

Evaluation
Model Names MSE Adjusted R2
Baseline Linear regression 109.5879 0.6343
Linear Regression with Interaction 107.4697 0.6399
Linear Regression with Polynomial 106.9055 0.6415
Subset selection 448.31 0.627
Principal component regression 121 0.56
Partial Least Squares 107 0.615
Ridge 112.42 0.54
Lasso 108.56 0.62
Decision Tree 85.055 0.703
Pruned Decision Tree 80.673 0.7534
Bagging 30.78 0.8928
Random Forest 30.44 0.894
Boosting 31.86 0.8821

Slides_Group_8

Recommended

Recommended

More Related Content

Similar to Slides_Group_8

Similar to Slides_Group_8 (20)

Slides_Group_8