This document discusses building machine learning models to classify flower species using a dataset with 4 numeric attributes for each flower. It provides code to load and explore the data, build and evaluate several classification models including logistic regression, decision trees, and random forests. The best performing model is random forest with an accuracy of 87% on the test set, beating the baseline accuracy of 33%. Feature importance is examined using gini impurity, finding petal attributes are more predictive than sepal attributes.
2. What warrants this as a good learning
dataset?
• Clean dataset available: 4 numeric attributes with no missing values
• Target is 3 different species of flowers. Multi-class classification.
• Well known dataset
3. What software do I need?
• IDE to run Python
• Online: https://repl.it
• Code Editor: VS Code https://code.visualstudio.com/download
• Data Science Platform: Anaconda https://www.anaconda.com/distribution/
4. Data Source
• Titanic dataset on Kaggle
• https://www.kaggle.com/c/titanic
def load_data(url):
'''
Loads data into Python environment.
Parameters: url with .csv
Returns: dataframe
'''
variables = ['sepal_len', 'sepal_w', 'petal_len',
'petal_w', 'class']
df = pd.read_csv(url, names=variables)
return df
6. Summary Statistics
• # of rows, # of features
• Frequency distribution
• Number of missing values Variable # of missing values
Sepal Width 0
Sepal Length 0
Petal Width 0
Petal Length 0
def summary_statistics(df):
'''
Generates summary statistics like the # of variables & columns,
pivot table, and 5 # summary.
Parameters: dataframe
Returns: none
'''
# shape
print('Shape of dataframe: %d instances and %d features' %
(df.shape[0], df.shape[1]))
# description
print(df.describe())
# class frequency
print(df.groupby('class').size())
# missing values
print(df.isnull().sum())
return
Flower Species # of instances
setosa 50
versicolor 50
virginica 50
The Base Rate is 0.33.
Our model has to beat
that.
8. Data Processing
• Dataset is really neat, so minimal processing needed.
• All features will be selected
• Split into training and test sets
9. Split Data Set
• We have a small data set, so later on we will use 10-fold validation to
create a more accurate representation of model performance.
def split_train_test(df):
'''
Splits available data into 80% training set, 20% test
set.
Parameters: dataframe
Returns: training set - features and output, test set -
features and output
'''
# 80% training set, 20% test set
array = df.values
X = array[:,0:4]
Y = array[:,4]
n_test = 0.2
seed = 7
X_train, X_test, Y_train, Y_test =
model_selection.train_test_split(X, Y, test_size=n_test,
random_state=seed)
return X_train, X_test, Y_train, Y_test
def k_fold_validation(models, X_train, Y_train):
'''
Performs 10-fold validation and prints the mean and standard
deviation of accuracies.
Parameters: array of models, training set - features and output
Returns:
'''
results = []
means = []
stds = []
names = []
scoring = 'accuracy'
seed = 7
for name, model in models:
kfold = model_selection.KFold(n_splits=10, random_state=seed)
cv_results = model_selection.cross_val_score(model, X_train,
Y_train, cv=kfold, scoring=scoring)
results.append(cv_results)
means.append(cv_results.mean())
stds.append(cv_results.std())
names.append(name)
msg = '%s: %f (%f)' % (name, cv_results.mean(),
cv_results.std())
print(msg)
return names, models, means, stds
10. Model Building = Equation
• Multi-Class Classification with only numeric variables
• Logistic Regression
• Linear Discriminant Analysis
• K Nearest Neighbor
• Decision Tree
• Random Forest
• Naïve Bayes
• Support Vector Machine
11. Model Building code
• Logistic Regression
• Linear Discriminant Analysis
• K Nearest Neighbor
• Decision Tree
• Random Forest
• Naïve Bayes
• Support Vector Machine
def build_model(X_train, Y_train):
'''
Runs training data through Logistic Regression, Linear
Discriminant Analysis, KNN, Decision Tree, Random
Forest, Naive Bayes, and Support Vector Machine.
Parameters: training set - features and output
Returns: array of names, models, means, and stds
'''
models = []
models.append(('LR',
LogisticRegression(solver='liblinear',
multi_class='ovr')))
models.append(('LDA', LinearDiscriminantAnalysis()))
models.append(('KNN', KNeighborsClassifier()))
models.append(('CART', DecisionTreeClassifier()))
models.append(('RF',
RandomForestClassifier(n_estimators = 100,
max_depth=5)))
models.append(('NB', GaussianNB()))
models.append(('SVM', SVC(gamma='auto')))
return models
12. Estimation
• Gini Impurity
• The dimensions of the petal is more predictive than those of the sepal.
Feature Gini Index
Petal Width 0.46
Petal Length 0.42
Sepal Length 0.09
Sepal Width 0.03
def gini_impurity(models, X_train, Y_train, X_test,
df):
'''
Examines feature importance using Gini impurity.
Parameters: models, training set, test set, dataframe
Returns: none
'''
random_forest = models[4][1]
keys = df.keys()
keys = keys[[0,1,2,3]]
models[4][1].fit(X_train, Y_train)
pred = models[4][1].predict(X_test)
print(sorted(zip(map(lambda x: round(x, 4),
random_forest.feature_importances_), keys),
reverse=True))
return
13. Model Evaluation
• Run on training set
• Performance metric: Accuracy
• Null Error Rate is the baseline accuracy if we predicted flower as being setosa. Only
algorithms that beat this base rate will be considered.
• Null Error Rate = 0.33
• Visualize in: Error Bars
14. Model Evaluation
• Error Bars show us the accuracy of each model.
def evaluate_error_bar(names, models, means, stds):
'''
Compare accuracy values with Error Bar graph.
Parameters: array of names, models, means, stds
Returns: none
'''
# error bar
fig = plt.figure()
fig.suptitle('Algorithm Comparison')
ax = fig.add_subplot(1, 1, 1)
ax.set_xticklabels(names)
plt.errorbar(names, means, stds, linestyle='None',
marker='^')
plt.ylim(0.92,1)
plt.show()
return
15. Explanation
• Run on test set
• Performance metric:
Accuracy, Recall, Precision, F1
score
• Visualize in: Confusion Matrix
and Classification Report
def test_set(X_train, Y_train, X_test, Y_test, models):
'''
Runs test data through all models. Prints confusion
matrices and classification reports.
Parameters: training set and test set, array of models
Returns: none
'''
for name, model in models:
if name == 'RF':
model.fit(X_train, Y_train)
pred = model.predict(X_test)
print('nnn%s Accuracy: %.2f' % (name,
accuracy_score(Y_test, pred)))
labels = np.unique(Y_test)
confusion = confusion_matrix(Y_test, pred,
labels=labels)
print('nConfusion Matrix:')
print(pd.DataFrame(confusion, index=labels,
columns=labels))
print('nClassification Report:')
print(classification_report(Y_test, pred))
return
16. Explanation
Confusion Matrix Classification Report
Accuracy = 0.87
• Better than Base Rate = 0.33
• Precision
• Precision for setosa is perfect (1.00). This means that if the model
predicted that the flower species is setosa, then it is always right.
• Recall
• Recall rate for setosa is high (0.93). This means that we correctly
identified all setosa flowers.
• F1 Score
• Weighted mean of precision and recall. Here we see that we do a
better job at identifying setosa (F1 = 1.00) than the other two flower
species (F1 = 0.83 and 0.82)
Predicted Class
setosa versicolor virginica
Actual
Class
setosa 7 0 0
versicolor 0 10 2
virginica 0 2 9
Precision Recall F1_Score
Actual: Setosa 1.00 1.00 1.00
Actual: Versicolor 0.83 0.83 0.83
Actual: Virginica 0.82 0.82 0.82