Predicting Survival on Titanic
Kaggle Machine Learning Project
Farhana Agufa | Associate Data Scientist
Titanic_README.md
Titanic Shipwreck| Python
🎯 Objective
Build a machine learning model to predict which passengers were more
likely to survive the Titanic disaster, based on passenger characteristics.
✅ Goal
Develop a supervised classification model that accurately distinguishes
survivors from non-survivors using the training data.
Dataset Overview
Train Set 891 entries
Test Set 418 entries
Features Pclass, Name, Sex, Age, SibSp, Parch, Ticket, Fare, Cabin, Embarked
Target Survived(0=No, 1=Yes)
Data Cleaning
✅ No duplicate records were found in the dataset.
⚠ Missing values were handled as follows:
Age Impute with median
Embarked Impute with mode
Cabin Dropped due to a lot of missing data
Exploratory Data Analysis
● Survival by Class:
Passengers in 1st class had the highest survival rate, followed by 2nd class.
3rd class passengers had the highest mortality rate, indicating a strong correlation
between class and survival.
●
● Survival by Gender:
Females were significantly more likely to survive than males.
This trend reflects the "women and children first" policy during evacuation.
Feature Engineering
Created a new feature: FamilySize = SibSp + Parch + 1
● Combines the number of siblings/spouses (SibSp) and parents/children (Parch) aboard
to represent family presence on board.
Dropped non-informative features:
● Name and Ticket were removed due to limited predictive value and high variability.
Data Preprocessing-Encoded & Scaling
📌 Encoded Categorical Features:
Used pd.get_dummies() to one-hot encode the following categorical variables:
● Embarked
● Sex
📌 Scaled Numerical Features:
Applied StandardScaler to normalize the following numerical features:
● Age
● Fare
● FamilySize
Splitting the Training Datasets
Data Preparation Steps:
● Categorical features were encoded to make them machine-readable.
● The dataset was split into training and validation sets:
○ 80% for X_train, 20% for y_val
○ A random state of 42 was used to ensure reproducibility.
Model Selection
Selected Model: RandomForestClassifier
Why Random Forest? It handles non-linear relationships well and reduces overfitting through ensemble
learning.
Evaluation Strategy: Performed cross-validation to assess model robustness.
Results:
● Cross-validation scores: [0.83916084 0.85314685 0.82394366 0.78169014 0.82394366]
● Mean Accuracy: 0.8244
● Standard Deviation: 0.0240
●
Model Performance
The model was trained and evaluated using key performance metrics: Accuracy,
Precision, Recall, and F1-score. These metrics provided a comprehensive view of the
model’s ability to correctly predict survival outcomes.
Accuracy 81%
Precision 82%
Recall 69%
F1 75%
Making Predictions on Test Data
Goal:
Generate predictions for the test set and prepare the submission file in the required Kaggle
format.
Key Steps:
● Use the trained model to predict the Survived column for the test dataset.
● Create a submission DataFrame using:
○ PassengerId from the original test set
○ The predicted Survived values
● Export the results using to_csv() for submission to Kaggle.
Titanic_Survival_Prediction

Predicting Titanic Survival Presentation

  • 1.
    Predicting Survival onTitanic Kaggle Machine Learning Project Farhana Agufa | Associate Data Scientist Titanic_README.md Titanic Shipwreck| Python
  • 2.
    🎯 Objective Build amachine learning model to predict which passengers were more likely to survive the Titanic disaster, based on passenger characteristics. ✅ Goal Develop a supervised classification model that accurately distinguishes survivors from non-survivors using the training data.
  • 3.
    Dataset Overview Train Set891 entries Test Set 418 entries Features Pclass, Name, Sex, Age, SibSp, Parch, Ticket, Fare, Cabin, Embarked Target Survived(0=No, 1=Yes)
  • 4.
    Data Cleaning ✅ Noduplicate records were found in the dataset. ⚠ Missing values were handled as follows: Age Impute with median Embarked Impute with mode Cabin Dropped due to a lot of missing data
  • 5.
    Exploratory Data Analysis ●Survival by Class: Passengers in 1st class had the highest survival rate, followed by 2nd class. 3rd class passengers had the highest mortality rate, indicating a strong correlation between class and survival. ●
  • 6.
    ● Survival byGender: Females were significantly more likely to survive than males. This trend reflects the "women and children first" policy during evacuation.
  • 7.
    Feature Engineering Created anew feature: FamilySize = SibSp + Parch + 1 ● Combines the number of siblings/spouses (SibSp) and parents/children (Parch) aboard to represent family presence on board. Dropped non-informative features: ● Name and Ticket were removed due to limited predictive value and high variability.
  • 8.
    Data Preprocessing-Encoded &Scaling 📌 Encoded Categorical Features: Used pd.get_dummies() to one-hot encode the following categorical variables: ● Embarked ● Sex 📌 Scaled Numerical Features: Applied StandardScaler to normalize the following numerical features: ● Age ● Fare ● FamilySize
  • 9.
    Splitting the TrainingDatasets Data Preparation Steps: ● Categorical features were encoded to make them machine-readable. ● The dataset was split into training and validation sets: ○ 80% for X_train, 20% for y_val ○ A random state of 42 was used to ensure reproducibility.
  • 10.
    Model Selection Selected Model:RandomForestClassifier Why Random Forest? It handles non-linear relationships well and reduces overfitting through ensemble learning. Evaluation Strategy: Performed cross-validation to assess model robustness. Results: ● Cross-validation scores: [0.83916084 0.85314685 0.82394366 0.78169014 0.82394366] ● Mean Accuracy: 0.8244 ● Standard Deviation: 0.0240 ●
  • 11.
    Model Performance The modelwas trained and evaluated using key performance metrics: Accuracy, Precision, Recall, and F1-score. These metrics provided a comprehensive view of the model’s ability to correctly predict survival outcomes. Accuracy 81% Precision 82% Recall 69% F1 75%
  • 12.
    Making Predictions onTest Data Goal: Generate predictions for the test set and prepare the submission file in the required Kaggle format. Key Steps: ● Use the trained model to predict the Survived column for the test dataset. ● Create a submission DataFrame using: ○ PassengerId from the original test set ○ The predicted Survived values ● Export the results using to_csv() for submission to Kaggle.
  • 13.