🎯 Objective
Build amachine learning model to predict which passengers were more
likely to survive the Titanic disaster, based on passenger characteristics.
✅ Goal
Develop a supervised classification model that accurately distinguishes
survivors from non-survivors using the training data.
3.
Dataset Overview
Train Set891 entries
Test Set 418 entries
Features Pclass, Name, Sex, Age, SibSp, Parch, Ticket, Fare, Cabin, Embarked
Target Survived(0=No, 1=Yes)
4.
Data Cleaning
✅ Noduplicate records were found in the dataset.
⚠ Missing values were handled as follows:
Age Impute with median
Embarked Impute with mode
Cabin Dropped due to a lot of missing data
5.
Exploratory Data Analysis
●Survival by Class:
Passengers in 1st class had the highest survival rate, followed by 2nd class.
3rd class passengers had the highest mortality rate, indicating a strong correlation
between class and survival.
●
6.
● Survival byGender:
Females were significantly more likely to survive than males.
This trend reflects the "women and children first" policy during evacuation.
7.
Feature Engineering
Created anew feature: FamilySize = SibSp + Parch + 1
● Combines the number of siblings/spouses (SibSp) and parents/children (Parch) aboard
to represent family presence on board.
Dropped non-informative features:
● Name and Ticket were removed due to limited predictive value and high variability.
8.
Data Preprocessing-Encoded &Scaling
📌 Encoded Categorical Features:
Used pd.get_dummies() to one-hot encode the following categorical variables:
● Embarked
● Sex
📌 Scaled Numerical Features:
Applied StandardScaler to normalize the following numerical features:
● Age
● Fare
● FamilySize
9.
Splitting the TrainingDatasets
Data Preparation Steps:
● Categorical features were encoded to make them machine-readable.
● The dataset was split into training and validation sets:
○ 80% for X_train, 20% for y_val
○ A random state of 42 was used to ensure reproducibility.
10.
Model Selection
Selected Model:RandomForestClassifier
Why Random Forest? It handles non-linear relationships well and reduces overfitting through ensemble
learning.
Evaluation Strategy: Performed cross-validation to assess model robustness.
Results:
● Cross-validation scores: [0.83916084 0.85314685 0.82394366 0.78169014 0.82394366]
● Mean Accuracy: 0.8244
● Standard Deviation: 0.0240
●
11.
Model Performance
The modelwas trained and evaluated using key performance metrics: Accuracy,
Precision, Recall, and F1-score. These metrics provided a comprehensive view of the
model’s ability to correctly predict survival outcomes.
Accuracy 81%
Precision 82%
Recall 69%
F1 75%
12.
Making Predictions onTest Data
Goal:
Generate predictions for the test set and prepare the submission file in the required Kaggle
format.
Key Steps:
● Use the trained model to predict the Survived column for the test dataset.
● Create a submission DataFrame using:
○ PassengerId from the original test set
○ The predicted Survived values
● Export the results using to_csv() for submission to Kaggle.