wk5ppt1_Titanic

Would you survive the
Titanic?
Building Models with Data

What warrants this question as a ML
problem?
• Complexity of problem is high enough - concrete rules (certain gender
or cost of ticket bought) are not enough to determine survival
outcome
• Clean dataset available - supervised algorithms can be applied to
historical data

What software do I need?
• IDE to run Python
• Online: https://repl.it
• Code Editor: VS Code https://code.visualstudio.com/download
• Data Science Platform: Anaconda https://www.anaconda.com/distribution/

Data Source
• Structured data
• Features contain categorical and numeric values
• Target is 1 or 0  binary classification
• Titanic dataset on Kaggle
• https://www.kaggle.com/c/titanic
def read_data(csv):
'''
Reads csv file.
Parameters: file path
Returns: Pandas dataframe
'''
df = pd.read_csv(csv)
return df

Exploratory Data Analysis
1. Summary statistics
2. Data visualization
3. Data processing

Summary Statistics
• 5 # summary
• Mean, median, frequency
• Number of missing values
def summary_statistics(df):
'''
Prints summary statistics about data set.
Parameters: dataframe
Returns: none
'''
print(df[['Age','Parch','Fare']].describe())
print(df.isnull().sum())
print('%d instances and %d columns' % (df.shape[0],
df.shape[1]))
return
Variable # of missing values
Age 177
Cabin 682
Embarked 2
• How to treat missing values
• Age: transform to median
• Cabin: cabin number could give us useful
info like how close the person is to the
rescue boats. But since 687 out of 891
values are missing, deletion of column is
the better option here.
• Embarked: Very few rows missing, so
deletion of those two rows
imputation
deletion

Data Visualization
• Waffle chart
• Scatter matrix
• Correlation matrix
def visualize(df):
'''
Visualizes data using waffle chart, scatter plot, and
correlation matrix.
Parameters: df
Returns: none
'''
# waffle
freq = df.Parch.value_counts()
fig = plt.figure(
FigureClass=Waffle,
rows=15,
values=list(freq.values),
labels=list(freq.index)
)
# scatter matrix
numeric_df =
df.drop(columns=['PassengerId','Name','Ticket'])
scatter_matrix(numeric_df, alpha=0.2, figsize=(9,9))
plt.show()
# correlation matrix
corr = df.corr()
corr.style.background_gradient()
plt.show()
return

Data Processing
a. Encoding: turning categorical variables into numeric values
b. Standardizing: scale data into [0, 1] range
c. Imputation: replace null values with median
d. Feature Selection: manually or automatically select input variables
e. Split data into training and validation data sets

Encoding Standardizing
def encoding(df):
'''
Converts all categorical variables into numeric
representations, i.e. encoding.
Returns: dataframe
'''
columns = df.columns.values
for c in columns:
dictionary = {}
def conversion(val):
return dictionary[val]
if df[c].dtype != np.int64 and df[c].dtype !=
np.float64:
unique = set(df[c].values.tolist())
x = 0
for u in unique:
dictionary[u] = x
x += 1
df[c] = list(map(conversion, df[c]))
return df
def standardize(df, variable_name):
'''
Standardizes variable by scaling to range [0, 1].
Parameters: dataframe, name of variable to standardize
Returns: dataframe
'''
variable = pd.DataFrame(df[variable_name])
standardized_variable =
preprocessing.MinMaxScaler().fit_transform(variable)
df[variable_name] = standardized_variable
return df
Imputation
def imputation_null_median(df):
df['Age'] = df['Age'].fillna(df['Age'].median())
df['Embarked'] =
df['Embarked'].fillna(df['Embarked'].median())
return df
These machine learning algorithms only
understand numbers.
Algorithms that exploit distances or similarities like KNN or SVM need to be
standardized. Graphical classifiers like tree-based models do not share this
need, although it is a good idea.
Replace null with median.

Feature Selection
• Remove variables like Passenger ID, Ticket ID because primary keys
have no predictive power
• Remove variables with too many NULL values
def choose_features(df):
'''
Splits available data into 80% training set, 20% test
set.
Returns: array of X, array of Y
'''
# 80% training set, 20% test set
array = df.values
X = array[:,[2, 4, 5, 6, 7, 9, 11]]
Y = array[:,1]
return X, Y

Split Data Set
def split_train_test(X, Y, percentage_test):
seed = 12
X_train, X_test, Y_train, Y_test =
model_selection.train_test_split(X, Y,
test_size=percentage_test, random_state=seed)
return X_train, X_test, Y_train, Y_test
• We have a small data set, so later on we will use 10-fold validation to
create a more accurate representation of model performance.

Best Practices in Coding
• Modular programming: write each step in a separate function.
• Here, the subsets of “Data Processing” are included in one parent
function.
def process_data(df, standardize_variable_name,
percentage_test):
'''
Processes data by encoding and standardizing.
Parameters: dataframe, variable to be standardized
Returns: dataframe
'''
df = encoding(df)
df = standardize(df, standardize_variable_name)
df = imputation_null_median(df)
X, Y = choose_features(df)
X_train, X_test, Y_train, Y_test = split_train_test(X,
Y, percentage_test)
return X_train, X_test, Y_train, Y_test

Model Building = Equation
• Binary Classification with both numeric and categorical variables
• Logistic Regression
• K Nearest Neighbor
• Decision Tree
• Random Forest
• Naïve Bayes
• Support Vector Machine

Model Building code
def build_model(X_train, Y_train):
'''
Runs training data through Logistic Regression, Linear
Discriminant Analysis, KNN, Decision Tree, Random
Forest, Naive Bayes, and Support Vector Machine.
Parameters: training set - features and output
Returns: array of models
'''
models = []
models.append(('LR',
LogisticRegression(solver='liblinear',
multi_class='ovr')))
models.append(('KNN', KNeighborsClassifier()))
models.append(('CART', DecisionTreeClassifier()))
models.append(('RF',
RandomForestClassifier(n_estimators = 100,
max_depth=5)))
models.append(('NB', GaussianNB()))
models.append(('SVM', SVC(gamma='auto')))
return models
• Logistic Regression
• K Nearest Neighbor
• Decision Tree
• Random Forest
• Naïve Bayes
• Support Vector Machine

Estimation
• Gini Impurity
• The features with the most predictive power are Sex, Fare, and Pclass
Feature Gini Index
Sex 0.47
Fare 0.18
Pclass 0.13
Age 0.1
Parch 0.04
SibSp 0.04
Embarked 0.04
def estimate_4e(models, X_train, Y_train, X_test, df):
'''
Examines feature importance using Gini impurity.
Parameters: models, training set, test set, dataframe
Returns: none
'''
random_forest = models[3][1]
keys = df.keys()
keys = keys[[2, 4, 5, 6, 7, 9, 11]]
models[3][1].fit(X_train, Y_train)
pred = models[3][1].predict(X_test)
print(sorted(zip(map(lambda x: round(x, 4),
random_forest.feature_importances_), keys),
reverse=True))
return

Model Evaluation
• Run on training set
• Performance metric: Accuracy
• Base Rate is the baseline accuracy if we predicted every person as having survived the
Titanic. Only algorithms that beat this base rate will be considered.
• Base Rate = 0.56 = actual # who survived / total
• Visualize in: Error Bars

Accuracy Error Bars
def evaluate_accuracy(models, X_train, Y_train):
'''
Parameters: array of models, training set - features
and output
Returns: array of names, models, means, and stds
'''
results = []
means = []
stds = []
names = []
scoring = 'accuracy'
seed = 12
for name, model in models:
kfold = model_selection.KFold(n_splits=10,
random_state=seed)
cv_results = model_selection.cross_val_score(model,
X_train, Y_train, cv=kfold, scoring=scoring)
results.append(cv_results)
means.append(cv_results.mean())
stds.append(cv_results.std())
names.append(name)
msg = '%s: %f (%f)' % (name, cv_results.mean(),
cv_results.std())
print(msg)
return names, models, means, stds
def evaluate_error_bar(names, models, means, stds):
'''
Compares accuracy of models with Error Bar graph.
Parameters: array of names, models, means, stds
Returns: none
'''
# error bar
fig = plt.figure()
fig.suptitle('Algorithm Comparison')
ax = fig.add_subplot(1, 1, 1)
ax.set_xticklabels(names)
plt.errorbar(names, means, stds, linestyle='None', marker='^')
plt.show()
return
Here we select
Random Forest,
which has the
highest accuracy.

Explanation
• Run on test set
• Performance metric:
Accuracy, Recall, Precision, F1
score
• Visualize in: Confusion Matrix
and Classification Report
def test_set(X_train, Y_train, X_test, Y_test, models):
'''
Runs test data through all models. Prints confusion
matrices and classification reports.
Parameters: training set and test set, array of models
Returns: none
'''
for name, model in models:
if name == 'RF':
model.fit(X_train, Y_train)
pred = model.predict(X_test)
print('nnn%s Accuracy: %.2f' % (name,
accuracy_score(Y_test, pred)))
labels = np.unique(Y_test)
confusion = confusion_matrix(Y_test, pred,
labels=labels)
print('nConfusion Matrix:')
print(pd.DataFrame(confusion, index=labels,
columns=labels))
print('nClassification Report:')
print(classification_report(Y_test, pred))
return

Explanation
Confusion Matrix Classification Report
Predicted Class
Died Survived
Actual
Class
Died 93 7
Survived 34 45
Precision Recall F1_Score
Actual: Died 0.73 0.93 0.82
Actual: Survived 0.87 0.57 0.69
Accuracy = 0.77
• Better than Base Rate (0.56)
• Precision
• Precision for survivors is high (0.87). This means that if the model predicted
that you would survive, then you have a good chance of actually surviving.
• Recall
• Recall rate for survivors is low (0.57). This means that out of those who
actually survived the Titanic, we categorized many of them as having died.
• Recall rate for those who died is high (0.93). This means that out of those
who died, we miscategorized very few.
• F1 Score
• Weighted mean of precision and recall. Here we see that we do a better job
at identifying those who died (F1 = 0.82) than those who survived (F1 =
0.69)

Best Practices in Coding
• Modular programming
• Every step has a function
• Avoids hardcoding
• Easier to reproduce
• DocStrings
• Creates HTML documentation of each
function and of the workflow

wk5ppt1_Titanic

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to wk5ppt1_Titanic

Similar to wk5ppt1_Titanic (20)

Recently uploaded

Recently uploaded (20)

wk5ppt1_Titanic