Would you survive the
Titanic?
Building Models with Data
What warrants this question as a ML
problem?
• Complexity of problem is high enough - concrete rules (certain gender
or cost of ticket bought) are not enough to determine survival
outcome
• Clean dataset available - supervised algorithms can be applied to
historical data
What software do I need?
• IDE to run Python
• Online: https://repl.it
• Code Editor: VS Code https://code.visualstudio.com/download
• Data Science Platform: Anaconda https://www.anaconda.com/distribution/
Data Source
• Structured data
• Features contain categorical and numeric values
• Target is 1 or 0  binary classification
• Titanic dataset on Kaggle
• https://www.kaggle.com/c/titanic
def read_data(csv):
'''
Reads csv file.
Parameters: file path
Returns: Pandas dataframe
'''
df = pd.read_csv(csv)
return df
Exploratory Data Analysis
1. Summary statistics
2. Data visualization
3. Data processing
Summary Statistics
• 5 # summary
• Mean, median, frequency
• Number of missing values
def summary_statistics(df):
'''
Prints summary statistics about data set.
Parameters: dataframe
Returns: none
'''
print(df[['Age','Parch','Fare']].describe())
print(df.isnull().sum())
print('%d instances and %d columns' % (df.shape[0],
df.shape[1]))
return
Variable # of missing values
Age 177
Cabin 682
Embarked 2
• How to treat missing values
• Age: transform to median
• Cabin: cabin number could give us useful
info like how close the person is to the
rescue boats. But since 687 out of 891
values are missing, deletion of column is
the better option here.
• Embarked: Very few rows missing, so
deletion of those two rows
imputation
deletion
Data Visualization
• Waffle chart
• Scatter matrix
• Correlation matrix
def visualize(df):
'''
Visualizes data using waffle chart, scatter plot, and
correlation matrix.
Parameters: df
Returns: none
'''
# waffle
freq = df.Parch.value_counts()
fig = plt.figure(
FigureClass=Waffle,
rows=15,
values=list(freq.values),
labels=list(freq.index)
)
# scatter matrix
numeric_df =
df.drop(columns=['PassengerId','Name','Ticket'])
scatter_matrix(numeric_df, alpha=0.2, figsize=(9,9))
plt.show()
# correlation matrix
corr = df.corr()
corr.style.background_gradient()
plt.show()
return
Data Processing
a. Encoding: turning categorical variables into numeric values
b. Standardizing: scale data into [0, 1] range
c. Imputation: replace null values with median
d. Feature Selection: manually or automatically select input variables
e. Split data into training and validation data sets
Encoding Standardizing
def encoding(df):
'''
Converts all categorical variables into numeric
representations, i.e. encoding.
Parameters: dataframe
Returns: dataframe
'''
columns = df.columns.values
for c in columns:
dictionary = {}
def conversion(val):
return dictionary[val]
if df[c].dtype != np.int64 and df[c].dtype !=
np.float64:
unique = set(df[c].values.tolist())
x = 0
for u in unique:
dictionary[u] = x
x += 1
df[c] = list(map(conversion, df[c]))
return df
def standardize(df, variable_name):
'''
Standardizes variable by scaling to range [0, 1].
Parameters: dataframe, name of variable to standardize
Returns: dataframe
'''
variable = pd.DataFrame(df[variable_name])
standardized_variable =
preprocessing.MinMaxScaler().fit_transform(variable)
df[variable_name] = standardized_variable
return df
Imputation
def imputation_null_median(df):
df['Age'] = df['Age'].fillna(df['Age'].median())
df['Embarked'] =
df['Embarked'].fillna(df['Embarked'].median())
return df
These machine learning algorithms only
understand numbers.
Algorithms that exploit distances or similarities like KNN or SVM need to be
standardized. Graphical classifiers like tree-based models do not share this
need, although it is a good idea.
Replace null with median.
Feature Selection
• Remove variables like Passenger ID, Ticket ID because primary keys
have no predictive power
• Remove variables with too many NULL values
def choose_features(df):
'''
Splits available data into 80% training set, 20% test
set.
Parameters: dataframe
Returns: array of X, array of Y
'''
# 80% training set, 20% test set
array = df.values
X = array[:,[2, 4, 5, 6, 7, 9, 11]]
Y = array[:,1]
return X, Y
Split Data Set
def split_train_test(X, Y, percentage_test):
seed = 12
X_train, X_test, Y_train, Y_test =
model_selection.train_test_split(X, Y,
test_size=percentage_test, random_state=seed)
return X_train, X_test, Y_train, Y_test
• We have a small data set, so later on we will use 10-fold validation to
create a more accurate representation of model performance.
Best Practices in Coding
• Modular programming: write each step in a separate function.
• Here, the subsets of “Data Processing” are included in one parent
function.
def process_data(df, standardize_variable_name,
percentage_test):
'''
Processes data by encoding and standardizing.
Parameters: dataframe, variable to be standardized
Returns: dataframe
'''
df = encoding(df)
df = standardize(df, standardize_variable_name)
df = imputation_null_median(df)
X, Y = choose_features(df)
X_train, X_test, Y_train, Y_test = split_train_test(X,
Y, percentage_test)
return X_train, X_test, Y_train, Y_test
Model Building = Equation
• Binary Classification with both numeric and categorical variables
• Logistic Regression
• K Nearest Neighbor
• Decision Tree
• Random Forest
• Naïve Bayes
• Support Vector Machine
Model Building code
def build_model(X_train, Y_train):
'''
Runs training data through Logistic Regression, Linear
Discriminant Analysis, KNN, Decision Tree, Random
Forest, Naive Bayes, and Support Vector Machine.
Parameters: training set - features and output
Returns: array of models
'''
models = []
models.append(('LR',
LogisticRegression(solver='liblinear',
multi_class='ovr')))
models.append(('KNN', KNeighborsClassifier()))
models.append(('CART', DecisionTreeClassifier()))
models.append(('RF',
RandomForestClassifier(n_estimators = 100,
max_depth=5)))
models.append(('NB', GaussianNB()))
models.append(('SVM', SVC(gamma='auto')))
return models
• Logistic Regression
• K Nearest Neighbor
• Decision Tree
• Random Forest
• Naïve Bayes
• Support Vector Machine
Estimation
• Gini Impurity
• The features with the most predictive power are Sex, Fare, and Pclass
Feature Gini Index
Sex 0.47
Fare 0.18
Pclass 0.13
Age 0.1
Parch 0.04
SibSp 0.04
Embarked 0.04
def estimate_4e(models, X_train, Y_train, X_test, df):
'''
Examines feature importance using Gini impurity.
Parameters: models, training set, test set, dataframe
Returns: none
'''
random_forest = models[3][1]
keys = df.keys()
keys = keys[[2, 4, 5, 6, 7, 9, 11]]
models[3][1].fit(X_train, Y_train)
pred = models[3][1].predict(X_test)
print(sorted(zip(map(lambda x: round(x, 4),
random_forest.feature_importances_), keys),
reverse=True))
return
Model Evaluation
• Run on training set
• Performance metric: Accuracy
• Base Rate is the baseline accuracy if we predicted every person as having survived the
Titanic. Only algorithms that beat this base rate will be considered.
• Base Rate = 0.56 = actual # who survived / total
• Visualize in: Error Bars
Accuracy Error Bars
def evaluate_accuracy(models, X_train, Y_train):
'''
Parameters: array of models, training set - features
and output
Returns: array of names, models, means, and stds
'''
results = []
means = []
stds = []
names = []
scoring = 'accuracy'
seed = 12
for name, model in models:
kfold = model_selection.KFold(n_splits=10,
random_state=seed)
cv_results = model_selection.cross_val_score(model,
X_train, Y_train, cv=kfold, scoring=scoring)
results.append(cv_results)
means.append(cv_results.mean())
stds.append(cv_results.std())
names.append(name)
msg = '%s: %f (%f)' % (name, cv_results.mean(),
cv_results.std())
print(msg)
return names, models, means, stds
def evaluate_error_bar(names, models, means, stds):
'''
Compares accuracy of models with Error Bar graph.
Parameters: array of names, models, means, stds
Returns: none
'''
# error bar
fig = plt.figure()
fig.suptitle('Algorithm Comparison')
ax = fig.add_subplot(1, 1, 1)
ax.set_xticklabels(names)
plt.errorbar(names, means, stds, linestyle='None', marker='^')
plt.show()
return
Here we select
Random Forest,
which has the
highest accuracy.
Explanation
• Run on test set
• Performance metric:
Accuracy, Recall, Precision, F1
score
• Visualize in: Confusion Matrix
and Classification Report
def test_set(X_train, Y_train, X_test, Y_test, models):
'''
Runs test data through all models. Prints confusion
matrices and classification reports.
Parameters: training set and test set, array of models
Returns: none
'''
for name, model in models:
if name == 'RF':
model.fit(X_train, Y_train)
pred = model.predict(X_test)
print('nnn%s Accuracy: %.2f' % (name,
accuracy_score(Y_test, pred)))
labels = np.unique(Y_test)
confusion = confusion_matrix(Y_test, pred,
labels=labels)
print('nConfusion Matrix:')
print(pd.DataFrame(confusion, index=labels,
columns=labels))
print('nClassification Report:')
print(classification_report(Y_test, pred))
return
Explanation
Confusion Matrix Classification Report
Predicted Class
Died Survived
Actual
Class
Died 93 7
Survived 34 45
Precision Recall F1_Score
Actual: Died 0.73 0.93 0.82
Actual: Survived 0.87 0.57 0.69
Accuracy = 0.77
• Better than Base Rate (0.56)
• Precision
• Precision for survivors is high (0.87). This means that if the model predicted
that you would survive, then you have a good chance of actually surviving.
• Recall
• Recall rate for survivors is low (0.57). This means that out of those who
actually survived the Titanic, we categorized many of them as having died.
• Recall rate for those who died is high (0.93). This means that out of those
who died, we miscategorized very few.
• F1 Score
• Weighted mean of precision and recall. Here we see that we do a better job
at identifying those who died (F1 = 0.82) than those who survived (F1 =
0.69)
Best Practices in Coding
• Modular programming
• Every step has a function
• Avoids hardcoding
• Easier to reproduce
• DocStrings
• Creates HTML documentation of each
function and of the workflow

wk5ppt1_Titanic

  • 1.
    Would you survivethe Titanic? Building Models with Data
  • 2.
    What warrants thisquestion as a ML problem? • Complexity of problem is high enough - concrete rules (certain gender or cost of ticket bought) are not enough to determine survival outcome • Clean dataset available - supervised algorithms can be applied to historical data
  • 3.
    What software doI need? • IDE to run Python • Online: https://repl.it • Code Editor: VS Code https://code.visualstudio.com/download • Data Science Platform: Anaconda https://www.anaconda.com/distribution/
  • 4.
    Data Source • Structureddata • Features contain categorical and numeric values • Target is 1 or 0  binary classification • Titanic dataset on Kaggle • https://www.kaggle.com/c/titanic def read_data(csv): ''' Reads csv file. Parameters: file path Returns: Pandas dataframe ''' df = pd.read_csv(csv) return df
  • 5.
    Exploratory Data Analysis 1.Summary statistics 2. Data visualization 3. Data processing
  • 6.
    Summary Statistics • 5# summary • Mean, median, frequency • Number of missing values def summary_statistics(df): ''' Prints summary statistics about data set. Parameters: dataframe Returns: none ''' print(df[['Age','Parch','Fare']].describe()) print(df.isnull().sum()) print('%d instances and %d columns' % (df.shape[0], df.shape[1])) return Variable # of missing values Age 177 Cabin 682 Embarked 2 • How to treat missing values • Age: transform to median • Cabin: cabin number could give us useful info like how close the person is to the rescue boats. But since 687 out of 891 values are missing, deletion of column is the better option here. • Embarked: Very few rows missing, so deletion of those two rows imputation deletion
  • 7.
    Data Visualization • Wafflechart • Scatter matrix • Correlation matrix def visualize(df): ''' Visualizes data using waffle chart, scatter plot, and correlation matrix. Parameters: df Returns: none ''' # waffle freq = df.Parch.value_counts() fig = plt.figure( FigureClass=Waffle, rows=15, values=list(freq.values), labels=list(freq.index) ) # scatter matrix numeric_df = df.drop(columns=['PassengerId','Name','Ticket']) scatter_matrix(numeric_df, alpha=0.2, figsize=(9,9)) plt.show() # correlation matrix corr = df.corr() corr.style.background_gradient() plt.show() return
  • 8.
    Data Processing a. Encoding:turning categorical variables into numeric values b. Standardizing: scale data into [0, 1] range c. Imputation: replace null values with median d. Feature Selection: manually or automatically select input variables e. Split data into training and validation data sets
  • 9.
    Encoding Standardizing def encoding(df): ''' Convertsall categorical variables into numeric representations, i.e. encoding. Parameters: dataframe Returns: dataframe ''' columns = df.columns.values for c in columns: dictionary = {} def conversion(val): return dictionary[val] if df[c].dtype != np.int64 and df[c].dtype != np.float64: unique = set(df[c].values.tolist()) x = 0 for u in unique: dictionary[u] = x x += 1 df[c] = list(map(conversion, df[c])) return df def standardize(df, variable_name): ''' Standardizes variable by scaling to range [0, 1]. Parameters: dataframe, name of variable to standardize Returns: dataframe ''' variable = pd.DataFrame(df[variable_name]) standardized_variable = preprocessing.MinMaxScaler().fit_transform(variable) df[variable_name] = standardized_variable return df Imputation def imputation_null_median(df): df['Age'] = df['Age'].fillna(df['Age'].median()) df['Embarked'] = df['Embarked'].fillna(df['Embarked'].median()) return df These machine learning algorithms only understand numbers. Algorithms that exploit distances or similarities like KNN or SVM need to be standardized. Graphical classifiers like tree-based models do not share this need, although it is a good idea. Replace null with median.
  • 10.
    Feature Selection • Removevariables like Passenger ID, Ticket ID because primary keys have no predictive power • Remove variables with too many NULL values def choose_features(df): ''' Splits available data into 80% training set, 20% test set. Parameters: dataframe Returns: array of X, array of Y ''' # 80% training set, 20% test set array = df.values X = array[:,[2, 4, 5, 6, 7, 9, 11]] Y = array[:,1] return X, Y
  • 11.
    Split Data Set defsplit_train_test(X, Y, percentage_test): seed = 12 X_train, X_test, Y_train, Y_test = model_selection.train_test_split(X, Y, test_size=percentage_test, random_state=seed) return X_train, X_test, Y_train, Y_test • We have a small data set, so later on we will use 10-fold validation to create a more accurate representation of model performance.
  • 12.
    Best Practices inCoding • Modular programming: write each step in a separate function. • Here, the subsets of “Data Processing” are included in one parent function. def process_data(df, standardize_variable_name, percentage_test): ''' Processes data by encoding and standardizing. Parameters: dataframe, variable to be standardized Returns: dataframe ''' df = encoding(df) df = standardize(df, standardize_variable_name) df = imputation_null_median(df) X, Y = choose_features(df) X_train, X_test, Y_train, Y_test = split_train_test(X, Y, percentage_test) return X_train, X_test, Y_train, Y_test
  • 13.
    Model Building =Equation • Binary Classification with both numeric and categorical variables • Logistic Regression • K Nearest Neighbor • Decision Tree • Random Forest • Naïve Bayes • Support Vector Machine
  • 14.
    Model Building code defbuild_model(X_train, Y_train): ''' Runs training data through Logistic Regression, Linear Discriminant Analysis, KNN, Decision Tree, Random Forest, Naive Bayes, and Support Vector Machine. Parameters: training set - features and output Returns: array of models ''' models = [] models.append(('LR', LogisticRegression(solver='liblinear', multi_class='ovr'))) models.append(('KNN', KNeighborsClassifier())) models.append(('CART', DecisionTreeClassifier())) models.append(('RF', RandomForestClassifier(n_estimators = 100, max_depth=5))) models.append(('NB', GaussianNB())) models.append(('SVM', SVC(gamma='auto'))) return models • Logistic Regression • K Nearest Neighbor • Decision Tree • Random Forest • Naïve Bayes • Support Vector Machine
  • 15.
    Estimation • Gini Impurity •The features with the most predictive power are Sex, Fare, and Pclass Feature Gini Index Sex 0.47 Fare 0.18 Pclass 0.13 Age 0.1 Parch 0.04 SibSp 0.04 Embarked 0.04 def estimate_4e(models, X_train, Y_train, X_test, df): ''' Examines feature importance using Gini impurity. Parameters: models, training set, test set, dataframe Returns: none ''' random_forest = models[3][1] keys = df.keys() keys = keys[[2, 4, 5, 6, 7, 9, 11]] models[3][1].fit(X_train, Y_train) pred = models[3][1].predict(X_test) print(sorted(zip(map(lambda x: round(x, 4), random_forest.feature_importances_), keys), reverse=True)) return
  • 16.
    Model Evaluation • Runon training set • Performance metric: Accuracy • Base Rate is the baseline accuracy if we predicted every person as having survived the Titanic. Only algorithms that beat this base rate will be considered. • Base Rate = 0.56 = actual # who survived / total • Visualize in: Error Bars
  • 17.
    Accuracy Error Bars defevaluate_accuracy(models, X_train, Y_train): ''' Parameters: array of models, training set - features and output Returns: array of names, models, means, and stds ''' results = [] means = [] stds = [] names = [] scoring = 'accuracy' seed = 12 for name, model in models: kfold = model_selection.KFold(n_splits=10, random_state=seed) cv_results = model_selection.cross_val_score(model, X_train, Y_train, cv=kfold, scoring=scoring) results.append(cv_results) means.append(cv_results.mean()) stds.append(cv_results.std()) names.append(name) msg = '%s: %f (%f)' % (name, cv_results.mean(), cv_results.std()) print(msg) return names, models, means, stds def evaluate_error_bar(names, models, means, stds): ''' Compares accuracy of models with Error Bar graph. Parameters: array of names, models, means, stds Returns: none ''' # error bar fig = plt.figure() fig.suptitle('Algorithm Comparison') ax = fig.add_subplot(1, 1, 1) ax.set_xticklabels(names) plt.errorbar(names, means, stds, linestyle='None', marker='^') plt.show() return Here we select Random Forest, which has the highest accuracy.
  • 18.
    Explanation • Run ontest set • Performance metric: Accuracy, Recall, Precision, F1 score • Visualize in: Confusion Matrix and Classification Report def test_set(X_train, Y_train, X_test, Y_test, models): ''' Runs test data through all models. Prints confusion matrices and classification reports. Parameters: training set and test set, array of models Returns: none ''' for name, model in models: if name == 'RF': model.fit(X_train, Y_train) pred = model.predict(X_test) print('nnn%s Accuracy: %.2f' % (name, accuracy_score(Y_test, pred))) labels = np.unique(Y_test) confusion = confusion_matrix(Y_test, pred, labels=labels) print('nConfusion Matrix:') print(pd.DataFrame(confusion, index=labels, columns=labels)) print('nClassification Report:') print(classification_report(Y_test, pred)) return
  • 19.
    Explanation Confusion Matrix ClassificationReport Predicted Class Died Survived Actual Class Died 93 7 Survived 34 45 Precision Recall F1_Score Actual: Died 0.73 0.93 0.82 Actual: Survived 0.87 0.57 0.69 Accuracy = 0.77 • Better than Base Rate (0.56) • Precision • Precision for survivors is high (0.87). This means that if the model predicted that you would survive, then you have a good chance of actually surviving. • Recall • Recall rate for survivors is low (0.57). This means that out of those who actually survived the Titanic, we categorized many of them as having died. • Recall rate for those who died is high (0.93). This means that out of those who died, we miscategorized very few. • F1 Score • Weighted mean of precision and recall. Here we see that we do a better job at identifying those who died (F1 = 0.82) than those who survived (F1 = 0.69)
  • 20.
    Best Practices inCoding • Modular programming • Every step has a function • Avoids hardcoding • Easier to reproduce • DocStrings • Creates HTML documentation of each function and of the workflow