Predicting Titanic Survival Presentation

Predicting Survival on Titanic
Kaggle Machine Learning Project
Farhana Agufa | Associate Data Scientist
Titanic_README.md
Titanic Shipwreck| Python

🎯 Objective
Build a machine learning model to predict which passengers were more
likely to survive the Titanic disaster, based on passenger characteristics.
✅ Goal
Develop a supervised classification model that accurately distinguishes
survivors from non-survivors using the training data.

Dataset Overview
Train Set 891 entries
Test Set 418 entries
Features Pclass, Name, Sex, Age, SibSp, Parch, Ticket, Fare, Cabin, Embarked
Target Survived(0=No, 1=Yes)

Data Cleaning
✅ No duplicate records were found in the dataset.
⚠ Missing values were handled as follows:
Age Impute with median
Embarked Impute with mode
Cabin Dropped due to a lot of missing data

Exploratory Data Analysis
● Survival by Class:
Passengers in 1st class had the highest survival rate, followed by 2nd class.
3rd class passengers had the highest mortality rate, indicating a strong correlation
between class and survival.
●

● Survival by Gender:
Females were significantly more likely to survive than males.
This trend reflects the "women and children first" policy during evacuation.

Feature Engineering
Created a new feature: FamilySize = SibSp + Parch + 1
● Combines the number of siblings/spouses (SibSp) and parents/children (Parch) aboard
to represent family presence on board.
Dropped non-informative features:
● Name and Ticket were removed due to limited predictive value and high variability.

Data Preprocessing-Encoded & Scaling
📌 Encoded Categorical Features:
Used pd.get_dummies() to one-hot encode the following categorical variables:
● Embarked
● Sex
📌 Scaled Numerical Features:
Applied StandardScaler to normalize the following numerical features:
● Age
● Fare
● FamilySize

Splitting the Training Datasets
Data Preparation Steps:
● Categorical features were encoded to make them machine-readable.
● The dataset was split into training and validation sets:
○ 80% for X_train, 20% for y_val
○ A random state of 42 was used to ensure reproducibility.

Model Selection
Selected Model: RandomForestClassifier
Why Random Forest? It handles non-linear relationships well and reduces overfitting through ensemble
learning.
Evaluation Strategy: Performed cross-validation to assess model robustness.
Results:
● Cross-validation scores: [0.83916084 0.85314685 0.82394366 0.78169014 0.82394366]
● Mean Accuracy: 0.8244
● Standard Deviation: 0.0240
●

Model Performance
The model was trained and evaluated using key performance metrics: Accuracy,
Precision, Recall, and F1-score. These metrics provided a comprehensive view of the
model’s ability to correctly predict survival outcomes.
Accuracy 81%
Precision 82%
Recall 69%
F1 75%

Making Predictions on Test Data
Goal:
Generate predictions for the test set and prepare the submission file in the required Kaggle
format.
Key Steps:
● Use the trained model to predict the Survived column for the test dataset.
● Create a submission DataFrame using:
○ PassengerId from the original test set
○ The predicted Survived values
● Export the results using to_csv() for submission to Kaggle.

Predicting Titanic Survival Presentation

More Related Content

Similar to Predicting Titanic Survival Presentation

Recently uploaded

Predicting Titanic Survival Presentation