SlideShare a Scribd company logo
Would you survive the
Titanic?
Building Models with Data
What warrants this question as a ML
problem?
• Complexity of problem is high enough - concrete rules (certain gender
or cost of ticket bought) are not enough to determine survival
outcome
• Clean dataset available - supervised algorithms can be applied to
historical data
What software do I need?
• IDE to run Python
• Online: https://repl.it
• Code Editor: VS Code https://code.visualstudio.com/download
• Data Science Platform: Anaconda https://www.anaconda.com/distribution/
Data Source
• Structured data
• Features contain categorical and numeric values
• Target is 1 or 0  binary classification
• Titanic dataset on Kaggle
• https://www.kaggle.com/c/titanic
def read_data(csv):
'''
Reads csv file.
Parameters: file path
Returns: Pandas dataframe
'''
df = pd.read_csv(csv)
return df
Exploratory Data Analysis
1. Summary statistics
2. Data visualization
3. Data processing
Summary Statistics
• 5 # summary
• Mean, median, frequency
• Number of missing values
def summary_statistics(df):
'''
Prints summary statistics about data set.
Parameters: dataframe
Returns: none
'''
print(df[['Age','Parch','Fare']].describe())
print(df.isnull().sum())
print('%d instances and %d columns' % (df.shape[0],
df.shape[1]))
return
Variable # of missing values
Age 177
Cabin 682
Embarked 2
• How to treat missing values
• Age: transform to median
• Cabin: cabin number could give us useful
info like how close the person is to the
rescue boats. But since 687 out of 891
values are missing, deletion of column is
the better option here.
• Embarked: Very few rows missing, so
deletion of those two rows
imputation
deletion
Data Visualization
• Waffle chart
• Scatter matrix
• Correlation matrix
def visualize(df):
'''
Visualizes data using waffle chart, scatter plot, and
correlation matrix.
Parameters: df
Returns: none
'''
# waffle
freq = df.Parch.value_counts()
fig = plt.figure(
FigureClass=Waffle,
rows=15,
values=list(freq.values),
labels=list(freq.index)
)
# scatter matrix
numeric_df =
df.drop(columns=['PassengerId','Name','Ticket'])
scatter_matrix(numeric_df, alpha=0.2, figsize=(9,9))
plt.show()
# correlation matrix
corr = df.corr()
corr.style.background_gradient()
plt.show()
return
Data Processing
a. Encoding: turning categorical variables into numeric values
b. Standardizing: scale data into [0, 1] range
c. Imputation: replace null values with median
d. Feature Selection: manually or automatically select input variables
e. Split data into training and validation data sets
Encoding Standardizing
def encoding(df):
'''
Converts all categorical variables into numeric
representations, i.e. encoding.
Parameters: dataframe
Returns: dataframe
'''
columns = df.columns.values
for c in columns:
dictionary = {}
def conversion(val):
return dictionary[val]
if df[c].dtype != np.int64 and df[c].dtype !=
np.float64:
unique = set(df[c].values.tolist())
x = 0
for u in unique:
dictionary[u] = x
x += 1
df[c] = list(map(conversion, df[c]))
return df
def standardize(df, variable_name):
'''
Standardizes variable by scaling to range [0, 1].
Parameters: dataframe, name of variable to standardize
Returns: dataframe
'''
variable = pd.DataFrame(df[variable_name])
standardized_variable =
preprocessing.MinMaxScaler().fit_transform(variable)
df[variable_name] = standardized_variable
return df
Imputation
def imputation_null_median(df):
df['Age'] = df['Age'].fillna(df['Age'].median())
df['Embarked'] =
df['Embarked'].fillna(df['Embarked'].median())
return df
These machine learning algorithms only
understand numbers.
Algorithms that exploit distances or similarities like KNN or SVM need to be
standardized. Graphical classifiers like tree-based models do not share this
need, although it is a good idea.
Replace null with median.
Feature Selection
• Remove variables like Passenger ID, Ticket ID because primary keys
have no predictive power
• Remove variables with too many NULL values
def choose_features(df):
'''
Splits available data into 80% training set, 20% test
set.
Parameters: dataframe
Returns: array of X, array of Y
'''
# 80% training set, 20% test set
array = df.values
X = array[:,[2, 4, 5, 6, 7, 9, 11]]
Y = array[:,1]
return X, Y
Split Data Set
def split_train_test(X, Y, percentage_test):
seed = 12
X_train, X_test, Y_train, Y_test =
model_selection.train_test_split(X, Y,
test_size=percentage_test, random_state=seed)
return X_train, X_test, Y_train, Y_test
• We have a small data set, so later on we will use 10-fold validation to
create a more accurate representation of model performance.
Best Practices in Coding
• Modular programming: write each step in a separate function.
• Here, the subsets of “Data Processing” are included in one parent
function.
def process_data(df, standardize_variable_name,
percentage_test):
'''
Processes data by encoding and standardizing.
Parameters: dataframe, variable to be standardized
Returns: dataframe
'''
df = encoding(df)
df = standardize(df, standardize_variable_name)
df = imputation_null_median(df)
X, Y = choose_features(df)
X_train, X_test, Y_train, Y_test = split_train_test(X,
Y, percentage_test)
return X_train, X_test, Y_train, Y_test
Model Building = Equation
• Binary Classification with both numeric and categorical variables
• Logistic Regression
• K Nearest Neighbor
• Decision Tree
• Random Forest
• Naïve Bayes
• Support Vector Machine
Model Building code
def build_model(X_train, Y_train):
'''
Runs training data through Logistic Regression, Linear
Discriminant Analysis, KNN, Decision Tree, Random
Forest, Naive Bayes, and Support Vector Machine.
Parameters: training set - features and output
Returns: array of models
'''
models = []
models.append(('LR',
LogisticRegression(solver='liblinear',
multi_class='ovr')))
models.append(('KNN', KNeighborsClassifier()))
models.append(('CART', DecisionTreeClassifier()))
models.append(('RF',
RandomForestClassifier(n_estimators = 100,
max_depth=5)))
models.append(('NB', GaussianNB()))
models.append(('SVM', SVC(gamma='auto')))
return models
• Logistic Regression
• K Nearest Neighbor
• Decision Tree
• Random Forest
• Naïve Bayes
• Support Vector Machine
Estimation
• Gini Impurity
• The features with the most predictive power are Sex, Fare, and Pclass
Feature Gini Index
Sex 0.47
Fare 0.18
Pclass 0.13
Age 0.1
Parch 0.04
SibSp 0.04
Embarked 0.04
def estimate_4e(models, X_train, Y_train, X_test, df):
'''
Examines feature importance using Gini impurity.
Parameters: models, training set, test set, dataframe
Returns: none
'''
random_forest = models[3][1]
keys = df.keys()
keys = keys[[2, 4, 5, 6, 7, 9, 11]]
models[3][1].fit(X_train, Y_train)
pred = models[3][1].predict(X_test)
print(sorted(zip(map(lambda x: round(x, 4),
random_forest.feature_importances_), keys),
reverse=True))
return
Model Evaluation
• Run on training set
• Performance metric: Accuracy
• Base Rate is the baseline accuracy if we predicted every person as having survived the
Titanic. Only algorithms that beat this base rate will be considered.
• Base Rate = 0.56 = actual # who survived / total
• Visualize in: Error Bars
Accuracy Error Bars
def evaluate_accuracy(models, X_train, Y_train):
'''
Parameters: array of models, training set - features
and output
Returns: array of names, models, means, and stds
'''
results = []
means = []
stds = []
names = []
scoring = 'accuracy'
seed = 12
for name, model in models:
kfold = model_selection.KFold(n_splits=10,
random_state=seed)
cv_results = model_selection.cross_val_score(model,
X_train, Y_train, cv=kfold, scoring=scoring)
results.append(cv_results)
means.append(cv_results.mean())
stds.append(cv_results.std())
names.append(name)
msg = '%s: %f (%f)' % (name, cv_results.mean(),
cv_results.std())
print(msg)
return names, models, means, stds
def evaluate_error_bar(names, models, means, stds):
'''
Compares accuracy of models with Error Bar graph.
Parameters: array of names, models, means, stds
Returns: none
'''
# error bar
fig = plt.figure()
fig.suptitle('Algorithm Comparison')
ax = fig.add_subplot(1, 1, 1)
ax.set_xticklabels(names)
plt.errorbar(names, means, stds, linestyle='None', marker='^')
plt.show()
return
Here we select
Random Forest,
which has the
highest accuracy.
Explanation
• Run on test set
• Performance metric:
Accuracy, Recall, Precision, F1
score
• Visualize in: Confusion Matrix
and Classification Report
def test_set(X_train, Y_train, X_test, Y_test, models):
'''
Runs test data through all models. Prints confusion
matrices and classification reports.
Parameters: training set and test set, array of models
Returns: none
'''
for name, model in models:
if name == 'RF':
model.fit(X_train, Y_train)
pred = model.predict(X_test)
print('nnn%s Accuracy: %.2f' % (name,
accuracy_score(Y_test, pred)))
labels = np.unique(Y_test)
confusion = confusion_matrix(Y_test, pred,
labels=labels)
print('nConfusion Matrix:')
print(pd.DataFrame(confusion, index=labels,
columns=labels))
print('nClassification Report:')
print(classification_report(Y_test, pred))
return
Explanation
Confusion Matrix Classification Report
Predicted Class
Died Survived
Actual
Class
Died 93 7
Survived 34 45
Precision Recall F1_Score
Actual: Died 0.73 0.93 0.82
Actual: Survived 0.87 0.57 0.69
Accuracy = 0.77
• Better than Base Rate (0.56)
• Precision
• Precision for survivors is high (0.87). This means that if the model predicted
that you would survive, then you have a good chance of actually surviving.
• Recall
• Recall rate for survivors is low (0.57). This means that out of those who
actually survived the Titanic, we categorized many of them as having died.
• Recall rate for those who died is high (0.93). This means that out of those
who died, we miscategorized very few.
• F1 Score
• Weighted mean of precision and recall. Here we see that we do a better job
at identifying those who died (F1 = 0.82) than those who survived (F1 =
0.69)
Best Practices in Coding
• Modular programming
• Every step has a function
• Avoids hardcoding
• Easier to reproduce
• DocStrings
• Creates HTML documentation of each
function and of the workflow

More Related Content

What's hot

Sparklyr
SparklyrSparklyr
PCA and LDA in machine learning
PCA and LDA in machine learningPCA and LDA in machine learning
PCA and LDA in machine learning
Akhilesh Joshi
 
On the representation and reuse of machine learning (ML) models
On the representation and reuse of machine learning (ML) modelsOn the representation and reuse of machine learning (ML) models
On the representation and reuse of machine learning (ML) models
Villu Ruusmann
 
Aggregation Functions in OCL
Aggregation Functions in OCL Aggregation Functions in OCL
Aggregation Functions in OCL
Jordi Cabot
 
Lec 9 05_sept [compatibility mode]
Lec 9 05_sept [compatibility mode]Lec 9 05_sept [compatibility mode]
Lec 9 05_sept [compatibility mode]
Palak Sanghani
 
Ml3 logistic regression-and_classification_error_metrics
Ml3 logistic regression-and_classification_error_metricsMl3 logistic regression-and_classification_error_metrics
Ml3 logistic regression-and_classification_error_metrics
ankit_ppt
 
M03 nb-02
M03 nb-02M03 nb-02
M03 nb-02
Raman Kannan
 
Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...
Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...
Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...
Yao Yao
 
Kaggle KDD Cup Report
Kaggle KDD Cup ReportKaggle KDD Cup Report
Kaggle KDD Cup Report
Chamila Wijayarathna
 
Machine Learning Algorithms
Machine Learning AlgorithmsMachine Learning Algorithms
Machine Learning Algorithms
Hichem Felouat
 
Evaluating classifierperformance ml-cs6923
Evaluating classifierperformance ml-cs6923Evaluating classifierperformance ml-cs6923
Evaluating classifierperformance ml-cs6923
Raman Kannan
 
Practical data science_public
Practical data science_publicPractical data science_public
Practical data science_public
Long Nguyen
 
Approaching (almost) Any Machine Learning Problem (kaggledays dubai)
Approaching (almost) Any Machine Learning Problem (kaggledays dubai)Approaching (almost) Any Machine Learning Problem (kaggledays dubai)
Approaching (almost) Any Machine Learning Problem (kaggledays dubai)
Abhishek Thakur
 
Custom Star Creation for Ellucain's Enterprise Data Warehouse
Custom Star Creation for Ellucain's Enterprise Data WarehouseCustom Star Creation for Ellucain's Enterprise Data Warehouse
Custom Star Creation for Ellucain's Enterprise Data Warehouse
Bryan L. Mack
 
PPT ON MACHINE LEARNING by Ragini Ratre
PPT ON MACHINE LEARNING by Ragini RatrePPT ON MACHINE LEARNING by Ragini Ratre
PPT ON MACHINE LEARNING by Ragini Ratre
RaginiRatre
 
Arrays in Java | Edureka
Arrays in Java | EdurekaArrays in Java | Edureka
Arrays in Java | Edureka
Edureka!
 
Transfer Learning
Transfer LearningTransfer Learning
Transfer Learning
Hichem Felouat
 
Matlab Graphics Tutorial
Matlab Graphics TutorialMatlab Graphics Tutorial
Matlab Graphics Tutorial
Cheng-An Yang
 
Lec 8 03_sept [compatibility mode]
Lec 8 03_sept [compatibility mode]Lec 8 03_sept [compatibility mode]
Lec 8 03_sept [compatibility mode]
Palak Sanghani
 
Morel, a Functional Query Language
Morel, a Functional Query LanguageMorel, a Functional Query Language
Morel, a Functional Query Language
Julian Hyde
 

What's hot (20)

Sparklyr
SparklyrSparklyr
Sparklyr
 
PCA and LDA in machine learning
PCA and LDA in machine learningPCA and LDA in machine learning
PCA and LDA in machine learning
 
On the representation and reuse of machine learning (ML) models
On the representation and reuse of machine learning (ML) modelsOn the representation and reuse of machine learning (ML) models
On the representation and reuse of machine learning (ML) models
 
Aggregation Functions in OCL
Aggregation Functions in OCL Aggregation Functions in OCL
Aggregation Functions in OCL
 
Lec 9 05_sept [compatibility mode]
Lec 9 05_sept [compatibility mode]Lec 9 05_sept [compatibility mode]
Lec 9 05_sept [compatibility mode]
 
Ml3 logistic regression-and_classification_error_metrics
Ml3 logistic regression-and_classification_error_metricsMl3 logistic regression-and_classification_error_metrics
Ml3 logistic regression-and_classification_error_metrics
 
M03 nb-02
M03 nb-02M03 nb-02
M03 nb-02
 
Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...
Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...
Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...
 
Kaggle KDD Cup Report
Kaggle KDD Cup ReportKaggle KDD Cup Report
Kaggle KDD Cup Report
 
Machine Learning Algorithms
Machine Learning AlgorithmsMachine Learning Algorithms
Machine Learning Algorithms
 
Evaluating classifierperformance ml-cs6923
Evaluating classifierperformance ml-cs6923Evaluating classifierperformance ml-cs6923
Evaluating classifierperformance ml-cs6923
 
Practical data science_public
Practical data science_publicPractical data science_public
Practical data science_public
 
Approaching (almost) Any Machine Learning Problem (kaggledays dubai)
Approaching (almost) Any Machine Learning Problem (kaggledays dubai)Approaching (almost) Any Machine Learning Problem (kaggledays dubai)
Approaching (almost) Any Machine Learning Problem (kaggledays dubai)
 
Custom Star Creation for Ellucain's Enterprise Data Warehouse
Custom Star Creation for Ellucain's Enterprise Data WarehouseCustom Star Creation for Ellucain's Enterprise Data Warehouse
Custom Star Creation for Ellucain's Enterprise Data Warehouse
 
PPT ON MACHINE LEARNING by Ragini Ratre
PPT ON MACHINE LEARNING by Ragini RatrePPT ON MACHINE LEARNING by Ragini Ratre
PPT ON MACHINE LEARNING by Ragini Ratre
 
Arrays in Java | Edureka
Arrays in Java | EdurekaArrays in Java | Edureka
Arrays in Java | Edureka
 
Transfer Learning
Transfer LearningTransfer Learning
Transfer Learning
 
Matlab Graphics Tutorial
Matlab Graphics TutorialMatlab Graphics Tutorial
Matlab Graphics Tutorial
 
Lec 8 03_sept [compatibility mode]
Lec 8 03_sept [compatibility mode]Lec 8 03_sept [compatibility mode]
Lec 8 03_sept [compatibility mode]
 
Morel, a Functional Query Language
Morel, a Functional Query LanguageMorel, a Functional Query Language
Morel, a Functional Query Language
 

Similar to wk5ppt1_Titanic

wk5ppt2_Iris
wk5ppt2_Iriswk5ppt2_Iris
wk5ppt2_Iris
AliciaWei1
 
Python Cheat Sheet Presentation Learning
Python Cheat Sheet Presentation LearningPython Cheat Sheet Presentation Learning
Python Cheat Sheet Presentation Learning
Naseer-ul-Hassan Rehman
 
interenship.pptx
interenship.pptxinterenship.pptx
interenship.pptx
Naveen316549
 
Unit 4_Working with Graphs _python (2).pptx
Unit 4_Working with Graphs _python (2).pptxUnit 4_Working with Graphs _python (2).pptx
Unit 4_Working with Graphs _python (2).pptx
prakashvs7
 
Decision Tree.pptx
Decision Tree.pptxDecision Tree.pptx
Decision Tree.pptx
Ramakrishna Reddy Bijjam
 
I just need answers for all TODO- I do not need any explanation or any.pdf
I just need answers for all TODO- I do not need any explanation or any.pdfI just need answers for all TODO- I do not need any explanation or any.pdf
I just need answers for all TODO- I do not need any explanation or any.pdf
MattU5mLambertq
 
I just need answers for all TODO- I do not need any explanation or any (1).pdf
I just need answers for all TODO- I do not need any explanation or any (1).pdfI just need answers for all TODO- I do not need any explanation or any (1).pdf
I just need answers for all TODO- I do not need any explanation or any (1).pdf
MattU5mLambertq
 
R programming & Machine Learning
R programming & Machine LearningR programming & Machine Learning
R programming & Machine Learning
AmanBhalla14
 
PPT on Data Science Using Python
PPT on Data Science Using PythonPPT on Data Science Using Python
PPT on Data Science Using Python
NishantKumar1179
 
slide-keras-tf.pptx
slide-keras-tf.pptxslide-keras-tf.pptx
slide-keras-tf.pptx
RithikRaj25
 
Python Cheat Sheet 2.0.pdf
Python Cheat Sheet 2.0.pdfPython Cheat Sheet 2.0.pdf
Python Cheat Sheet 2.0.pdf
Rahul Jain
 
fINAL Lesson_5_Data_Manipulation_using_R_v1.pptx
fINAL Lesson_5_Data_Manipulation_using_R_v1.pptxfINAL Lesson_5_Data_Manipulation_using_R_v1.pptx
fINAL Lesson_5_Data_Manipulation_using_R_v1.pptx
dataKarthik
 
maXbox starter65 machinelearning3
maXbox starter65 machinelearning3maXbox starter65 machinelearning3
maXbox starter65 machinelearning3
Max Kleiner
 
maxbox starter60 machine learning
maxbox starter60 machine learningmaxbox starter60 machine learning
maxbox starter60 machine learning
Max Kleiner
 
Pandas cheat sheet_data science
Pandas cheat sheet_data sciencePandas cheat sheet_data science
Pandas cheat sheet_data science
Subrata Shaw
 
Pandas Cheat Sheet
Pandas Cheat SheetPandas Cheat Sheet
Pandas Cheat Sheet
ACASH1011
 
Pandas cheat sheet
Pandas cheat sheetPandas cheat sheet
Pandas cheat sheet
Lenis Carolina Lopez
 
Data Wrangling with Pandas
Data Wrangling with PandasData Wrangling with Pandas
Data Wrangling with Pandas
Luis Carrasco
 
Competition 1 (blog 1)
Competition 1 (blog 1)Competition 1 (blog 1)
Competition 1 (blog 1)
TarunPaparaju
 
Python-for-Data-Analysis.pptx
Python-for-Data-Analysis.pptxPython-for-Data-Analysis.pptx
Python-for-Data-Analysis.pptx
ParveenShaik21
 

Similar to wk5ppt1_Titanic (20)

wk5ppt2_Iris
wk5ppt2_Iriswk5ppt2_Iris
wk5ppt2_Iris
 
Python Cheat Sheet Presentation Learning
Python Cheat Sheet Presentation LearningPython Cheat Sheet Presentation Learning
Python Cheat Sheet Presentation Learning
 
interenship.pptx
interenship.pptxinterenship.pptx
interenship.pptx
 
Unit 4_Working with Graphs _python (2).pptx
Unit 4_Working with Graphs _python (2).pptxUnit 4_Working with Graphs _python (2).pptx
Unit 4_Working with Graphs _python (2).pptx
 
Decision Tree.pptx
Decision Tree.pptxDecision Tree.pptx
Decision Tree.pptx
 
I just need answers for all TODO- I do not need any explanation or any.pdf
I just need answers for all TODO- I do not need any explanation or any.pdfI just need answers for all TODO- I do not need any explanation or any.pdf
I just need answers for all TODO- I do not need any explanation or any.pdf
 
I just need answers for all TODO- I do not need any explanation or any (1).pdf
I just need answers for all TODO- I do not need any explanation or any (1).pdfI just need answers for all TODO- I do not need any explanation or any (1).pdf
I just need answers for all TODO- I do not need any explanation or any (1).pdf
 
R programming & Machine Learning
R programming & Machine LearningR programming & Machine Learning
R programming & Machine Learning
 
PPT on Data Science Using Python
PPT on Data Science Using PythonPPT on Data Science Using Python
PPT on Data Science Using Python
 
slide-keras-tf.pptx
slide-keras-tf.pptxslide-keras-tf.pptx
slide-keras-tf.pptx
 
Python Cheat Sheet 2.0.pdf
Python Cheat Sheet 2.0.pdfPython Cheat Sheet 2.0.pdf
Python Cheat Sheet 2.0.pdf
 
fINAL Lesson_5_Data_Manipulation_using_R_v1.pptx
fINAL Lesson_5_Data_Manipulation_using_R_v1.pptxfINAL Lesson_5_Data_Manipulation_using_R_v1.pptx
fINAL Lesson_5_Data_Manipulation_using_R_v1.pptx
 
maXbox starter65 machinelearning3
maXbox starter65 machinelearning3maXbox starter65 machinelearning3
maXbox starter65 machinelearning3
 
maxbox starter60 machine learning
maxbox starter60 machine learningmaxbox starter60 machine learning
maxbox starter60 machine learning
 
Pandas cheat sheet_data science
Pandas cheat sheet_data sciencePandas cheat sheet_data science
Pandas cheat sheet_data science
 
Pandas Cheat Sheet
Pandas Cheat SheetPandas Cheat Sheet
Pandas Cheat Sheet
 
Pandas cheat sheet
Pandas cheat sheetPandas cheat sheet
Pandas cheat sheet
 
Data Wrangling with Pandas
Data Wrangling with PandasData Wrangling with Pandas
Data Wrangling with Pandas
 
Competition 1 (blog 1)
Competition 1 (blog 1)Competition 1 (blog 1)
Competition 1 (blog 1)
 
Python-for-Data-Analysis.pptx
Python-for-Data-Analysis.pptxPython-for-Data-Analysis.pptx
Python-for-Data-Analysis.pptx
 

Recently uploaded

一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
mzpolocfi
 
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
nyfuhyz
 
Analysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performanceAnalysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performance
roli9797
 
Challenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more importantChallenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more important
Sm321
 
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
slg6lamcq
 
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
v7oacc3l
 
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
apvysm8
 
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
Social Samosa
 
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
74nqk8xf
 
Intelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicineIntelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicine
AndrzejJarynowski
 
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
nuttdpt
 
Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...
Bill641377
 
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
slg6lamcq
 
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
sameer shah
 
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
ahzuo
 
State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023
kuntobimo2016
 
Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......
Sachin Paul
 
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
dwreak4tg
 
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
g4dpvqap0
 
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
74nqk8xf
 

Recently uploaded (20)

一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
 
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
 
Analysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performanceAnalysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performance
 
Challenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more importantChallenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more important
 
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
 
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
 
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
 
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
 
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
 
Intelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicineIntelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicine
 
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
 
Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...
 
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
 
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
 
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
 
State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023
 
Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......
 
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
 
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
 
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
 

wk5ppt1_Titanic

  • 1. Would you survive the Titanic? Building Models with Data
  • 2. What warrants this question as a ML problem? • Complexity of problem is high enough - concrete rules (certain gender or cost of ticket bought) are not enough to determine survival outcome • Clean dataset available - supervised algorithms can be applied to historical data
  • 3. What software do I need? • IDE to run Python • Online: https://repl.it • Code Editor: VS Code https://code.visualstudio.com/download • Data Science Platform: Anaconda https://www.anaconda.com/distribution/
  • 4. Data Source • Structured data • Features contain categorical and numeric values • Target is 1 or 0  binary classification • Titanic dataset on Kaggle • https://www.kaggle.com/c/titanic def read_data(csv): ''' Reads csv file. Parameters: file path Returns: Pandas dataframe ''' df = pd.read_csv(csv) return df
  • 5. Exploratory Data Analysis 1. Summary statistics 2. Data visualization 3. Data processing
  • 6. Summary Statistics • 5 # summary • Mean, median, frequency • Number of missing values def summary_statistics(df): ''' Prints summary statistics about data set. Parameters: dataframe Returns: none ''' print(df[['Age','Parch','Fare']].describe()) print(df.isnull().sum()) print('%d instances and %d columns' % (df.shape[0], df.shape[1])) return Variable # of missing values Age 177 Cabin 682 Embarked 2 • How to treat missing values • Age: transform to median • Cabin: cabin number could give us useful info like how close the person is to the rescue boats. But since 687 out of 891 values are missing, deletion of column is the better option here. • Embarked: Very few rows missing, so deletion of those two rows imputation deletion
  • 7. Data Visualization • Waffle chart • Scatter matrix • Correlation matrix def visualize(df): ''' Visualizes data using waffle chart, scatter plot, and correlation matrix. Parameters: df Returns: none ''' # waffle freq = df.Parch.value_counts() fig = plt.figure( FigureClass=Waffle, rows=15, values=list(freq.values), labels=list(freq.index) ) # scatter matrix numeric_df = df.drop(columns=['PassengerId','Name','Ticket']) scatter_matrix(numeric_df, alpha=0.2, figsize=(9,9)) plt.show() # correlation matrix corr = df.corr() corr.style.background_gradient() plt.show() return
  • 8. Data Processing a. Encoding: turning categorical variables into numeric values b. Standardizing: scale data into [0, 1] range c. Imputation: replace null values with median d. Feature Selection: manually or automatically select input variables e. Split data into training and validation data sets
  • 9. Encoding Standardizing def encoding(df): ''' Converts all categorical variables into numeric representations, i.e. encoding. Parameters: dataframe Returns: dataframe ''' columns = df.columns.values for c in columns: dictionary = {} def conversion(val): return dictionary[val] if df[c].dtype != np.int64 and df[c].dtype != np.float64: unique = set(df[c].values.tolist()) x = 0 for u in unique: dictionary[u] = x x += 1 df[c] = list(map(conversion, df[c])) return df def standardize(df, variable_name): ''' Standardizes variable by scaling to range [0, 1]. Parameters: dataframe, name of variable to standardize Returns: dataframe ''' variable = pd.DataFrame(df[variable_name]) standardized_variable = preprocessing.MinMaxScaler().fit_transform(variable) df[variable_name] = standardized_variable return df Imputation def imputation_null_median(df): df['Age'] = df['Age'].fillna(df['Age'].median()) df['Embarked'] = df['Embarked'].fillna(df['Embarked'].median()) return df These machine learning algorithms only understand numbers. Algorithms that exploit distances or similarities like KNN or SVM need to be standardized. Graphical classifiers like tree-based models do not share this need, although it is a good idea. Replace null with median.
  • 10. Feature Selection • Remove variables like Passenger ID, Ticket ID because primary keys have no predictive power • Remove variables with too many NULL values def choose_features(df): ''' Splits available data into 80% training set, 20% test set. Parameters: dataframe Returns: array of X, array of Y ''' # 80% training set, 20% test set array = df.values X = array[:,[2, 4, 5, 6, 7, 9, 11]] Y = array[:,1] return X, Y
  • 11. Split Data Set def split_train_test(X, Y, percentage_test): seed = 12 X_train, X_test, Y_train, Y_test = model_selection.train_test_split(X, Y, test_size=percentage_test, random_state=seed) return X_train, X_test, Y_train, Y_test • We have a small data set, so later on we will use 10-fold validation to create a more accurate representation of model performance.
  • 12. Best Practices in Coding • Modular programming: write each step in a separate function. • Here, the subsets of “Data Processing” are included in one parent function. def process_data(df, standardize_variable_name, percentage_test): ''' Processes data by encoding and standardizing. Parameters: dataframe, variable to be standardized Returns: dataframe ''' df = encoding(df) df = standardize(df, standardize_variable_name) df = imputation_null_median(df) X, Y = choose_features(df) X_train, X_test, Y_train, Y_test = split_train_test(X, Y, percentage_test) return X_train, X_test, Y_train, Y_test
  • 13. Model Building = Equation • Binary Classification with both numeric and categorical variables • Logistic Regression • K Nearest Neighbor • Decision Tree • Random Forest • Naïve Bayes • Support Vector Machine
  • 14. Model Building code def build_model(X_train, Y_train): ''' Runs training data through Logistic Regression, Linear Discriminant Analysis, KNN, Decision Tree, Random Forest, Naive Bayes, and Support Vector Machine. Parameters: training set - features and output Returns: array of models ''' models = [] models.append(('LR', LogisticRegression(solver='liblinear', multi_class='ovr'))) models.append(('KNN', KNeighborsClassifier())) models.append(('CART', DecisionTreeClassifier())) models.append(('RF', RandomForestClassifier(n_estimators = 100, max_depth=5))) models.append(('NB', GaussianNB())) models.append(('SVM', SVC(gamma='auto'))) return models • Logistic Regression • K Nearest Neighbor • Decision Tree • Random Forest • Naïve Bayes • Support Vector Machine
  • 15. Estimation • Gini Impurity • The features with the most predictive power are Sex, Fare, and Pclass Feature Gini Index Sex 0.47 Fare 0.18 Pclass 0.13 Age 0.1 Parch 0.04 SibSp 0.04 Embarked 0.04 def estimate_4e(models, X_train, Y_train, X_test, df): ''' Examines feature importance using Gini impurity. Parameters: models, training set, test set, dataframe Returns: none ''' random_forest = models[3][1] keys = df.keys() keys = keys[[2, 4, 5, 6, 7, 9, 11]] models[3][1].fit(X_train, Y_train) pred = models[3][1].predict(X_test) print(sorted(zip(map(lambda x: round(x, 4), random_forest.feature_importances_), keys), reverse=True)) return
  • 16. Model Evaluation • Run on training set • Performance metric: Accuracy • Base Rate is the baseline accuracy if we predicted every person as having survived the Titanic. Only algorithms that beat this base rate will be considered. • Base Rate = 0.56 = actual # who survived / total • Visualize in: Error Bars
  • 17. Accuracy Error Bars def evaluate_accuracy(models, X_train, Y_train): ''' Parameters: array of models, training set - features and output Returns: array of names, models, means, and stds ''' results = [] means = [] stds = [] names = [] scoring = 'accuracy' seed = 12 for name, model in models: kfold = model_selection.KFold(n_splits=10, random_state=seed) cv_results = model_selection.cross_val_score(model, X_train, Y_train, cv=kfold, scoring=scoring) results.append(cv_results) means.append(cv_results.mean()) stds.append(cv_results.std()) names.append(name) msg = '%s: %f (%f)' % (name, cv_results.mean(), cv_results.std()) print(msg) return names, models, means, stds def evaluate_error_bar(names, models, means, stds): ''' Compares accuracy of models with Error Bar graph. Parameters: array of names, models, means, stds Returns: none ''' # error bar fig = plt.figure() fig.suptitle('Algorithm Comparison') ax = fig.add_subplot(1, 1, 1) ax.set_xticklabels(names) plt.errorbar(names, means, stds, linestyle='None', marker='^') plt.show() return Here we select Random Forest, which has the highest accuracy.
  • 18. Explanation • Run on test set • Performance metric: Accuracy, Recall, Precision, F1 score • Visualize in: Confusion Matrix and Classification Report def test_set(X_train, Y_train, X_test, Y_test, models): ''' Runs test data through all models. Prints confusion matrices and classification reports. Parameters: training set and test set, array of models Returns: none ''' for name, model in models: if name == 'RF': model.fit(X_train, Y_train) pred = model.predict(X_test) print('nnn%s Accuracy: %.2f' % (name, accuracy_score(Y_test, pred))) labels = np.unique(Y_test) confusion = confusion_matrix(Y_test, pred, labels=labels) print('nConfusion Matrix:') print(pd.DataFrame(confusion, index=labels, columns=labels)) print('nClassification Report:') print(classification_report(Y_test, pred)) return
  • 19. Explanation Confusion Matrix Classification Report Predicted Class Died Survived Actual Class Died 93 7 Survived 34 45 Precision Recall F1_Score Actual: Died 0.73 0.93 0.82 Actual: Survived 0.87 0.57 0.69 Accuracy = 0.77 • Better than Base Rate (0.56) • Precision • Precision for survivors is high (0.87). This means that if the model predicted that you would survive, then you have a good chance of actually surviving. • Recall • Recall rate for survivors is low (0.57). This means that out of those who actually survived the Titanic, we categorized many of them as having died. • Recall rate for those who died is high (0.93). This means that out of those who died, we miscategorized very few. • F1 Score • Weighted mean of precision and recall. Here we see that we do a better job at identifying those who died (F1 = 0.82) than those who survived (F1 = 0.69)
  • 20. Best Practices in Coding • Modular programming • Every step has a function • Avoids hardcoding • Easier to reproduce • DocStrings • Creates HTML documentation of each function and of the workflow