Using Excel to prepare data, including impute and feature engineering.
Using BigML to build model, tune, evaluate model, predict and choose the best model.
1. MidTerm Project
1. Data preparation and feature engineering
I did data preparation in Excel. First, I replaced all blanks with zeros.
Then I created these new variables in the dataset:
1) “AvgRatingPlayer 1:14” for all 14 players, which are mean of each player’s five rating values.
2) “AvgCRating”, “AvgGSRating”,“AvgEFRating”,“AvgFFRating” and “AvgPRating”: calculated by 14
players’ corresponding Rating values divided by “NumPlayers”.
3) “AvgTotal”: equals sum of “AvgCRating”, “AvgGSRating”,“AvgEFRating”,“AvgFFRating” and
“AvgPRating” divided by 5.
4) “AvgDiffTotalPlayer 1:14”: calculated by AvgTotal minus AvgRatingPlayer 1:14. If
AvgRatingPlayer 1:14 is 0, the corresponding AvgDiffTotalPlayer is 0.
5) “FirstDiffTotal”: calculated by AvgTotal minus AvgRatingPlayer 1:14 value for the First player.
In the formula, A2 is “First”, W2 is “AvgTotal”, I2:V2 are AvgRatingPlayer 1:14 values.
=IF(A2="P1",W2-I2,(IF(A2="P2",W2-J2,(IF(A2="P3",W2-K2,(IF(A2="P4",W2-L2,(IF(A2="P5",W2-
M2,(IF(A2="P6",W2-N2,(IF(A2="P7",W2-O2,(IF(A2="P8",W2-P2,(IF(A2="P9",W2-
Q2,(IF(A2="P10",W2-R2,(IF(A2="P11",W2-S2,(IF(A2="P12",W2-T2,(IF(A2="P13",W2-
U2,(IF(A2="P14",W2-V2)))))))))))))))))))))))))))
2. Upload data to BigML
After data preparation, I uploaded the dataset to BigML and set First, Second and Third as
Categorical variables. Then I divided the dataset into 80% for Training dataset and 20% for Test
dataset.
2. 3. Build Decision Tree Models, evaluate and choose the best model
In all processes for building models, I did not use variables: Second, Third, CapFirstScore,
CapSecnodScore and CapThirdScore.
I built 7 models in Decision Tree by using different combinations of settings and features while I was
evaluating models and tuning model performance. I used 20% Test dataset to evaluate models,
downloaded the confusion metric and recorded Avg F, Avg Precision and Avg Accuracy for each
model.
The best Decision Tree model which has the highest Avg F score is 523 model. Its Avg F score is
0.268, Avg Precision is 0.242 and Avg Accuracy is 0.285. In this model, I kept all default settings
except Threshold=523, Sample rate=81% and Ordering=linear.
3. 4. Build Logistic Regression Models, evaluate and choose the best model
I built 5 models in Logistic Regression. I used 20% Test dataset to evaluate the model, downloaded the
confusion metric and recorded Avg F, Avg Precision and Avg Accuracy for each model.
The best Logistic Regression model which has the highest Avg F score is 48 model. Its Avg F score is
0.2754, Avg Precision is 0.2557 and Avg Accuracy is 0.2845. In this model, I kept all default settings
except Sampling rate=48% and excluding Bias Term.
4. 5. Build Ensemble Models, evaluate and choose the best model
I built 9 models in Ensemble including 2 models with Weight: CapFirstScore. I used 20% Test dataset
to evaluate the model, downloaded the confusion metric and recorded Avg F, Avg Precision and Avg
Accuracy for each model.
The best Ensemble model which has the highest Avg F score is 333 model. Its Avg F score is 0.2992,
Avg Precision is 0.2724 and Avg Accuracy is 0.3255. In this model, I kept all default settings.
5. 6. Predicted 20% Test dataset and calculated AvgPoints
After model evaluation, I predicted 20% Test dataset using three best models and two Ensemble
models with Weight, and recorded AvgPoints for each prediction.
For 523 model, the best Decision Tree model, its AvgPoints is 6.7533.
For 333 model, the best Ensemble model, its AvgPoints is 7.3021.
For 48 model, the best Logistic Regression model, its AvgPoints is 5.4875.
For 777 model, the Ensemble model with Weight=CapFirstScore, Sampling rate=49% and
Threshold=293, its AvgPoints is 6.6289.
For 888 model, the Ensemble model with Weight=CapFirstScore and Sampling rate=52%, its
AvgPoints is 6.6021.
At this point, the best model is 333 model.
7. More actions after step 6
I predicted 20% Test dataset using some other models and recorded AvgPoints as following.
1) Predicted three more Ensemble models which have high performance values.
For 1618 model, which has the second highest Avg F in Ensembles models, its AvgPoints is 7.5081.
For 555 model, which has the third highest Avg F in Ensembles models, its AvgPoints is 7.3389.
For 913 model, which has the fourth highest Avg F in Ensembles models, its AvgPoints is 7.3389.
Model Name Avg F Avg Precision Avg Accuracy AvgPoints
333 model 0.2992 0.2724 0.3255 7.3021
1618 model 0.2964 0.303 0.3328 7.5081
555 model 0.2956 0.279 0.3235 7.3389
913 model 0.2894 0.2889 0.3313 7.3895
1618 model has the highest AvgPoints, followed by 913 model and 555 model. This table shows
AvgPoints is not only decided by F value, it also related with Precision and Accuracy. 1618 model,
having the highest Presicion and Accuracy, is better than 333 model although 333 model has the
highest F value. Similarly, 555 model has higher Precision value than 333 model and 913 model has
the second highest Accuracy.
2) Predicted two more Decision Tree models which have high performance values.
Model Name Avg F Avg Precision Avg Accuracy AvgPoints
523 model 0.268 0.242 0.285 6.7533
512 model 0.2451 0.2496 0.2775 6.5981
1813 model 0.249 0.2549 0.2857 6.6802
6. In these three model, 523 model has the highest AvgPoints. Although 1813 model has the higest
Precision and Accuracy among all 5 models, its AvgPoints is not higher than the value of 523 model.
3) Predicted two more Logistic Regression models which have high performance values.
Model Name Avg F Avg Precision Avg Accuracy AvgPoints
48 model 0.2754 0.2557 0.2845 5.4875
111 model 0.2706 0.2531 0.2938 5.4898
222 model 0.2641 0.2741 0.2823 5.5663
222 model has the highest AvgPoints among these three models.
In sum, Ensembles models have better performance than Decision Tree models and Logistic
Regression models overall.
The top three models in this project:
The best model is 1618 model, which has the highest AvgPoints 7.5081. This is an Ensemble model
using all variables and all default settings except Threshold=1618 and Ordering=linear.
The second model is 913 model, which has AvgPoints, 7.3895. This is an Ensemble model using all
variables and all default settings except Threshold=913 and Ordering=Random shuffling.
The third model is 555 model, which has AvgPoints 7.3389. This is an Ensemble model using all
default settings but excluding variables from CRatingPlayer1:14 to PRatingPlayer1:14. This model
only used new variables I created early.