1. EDA on Haberman's Survival Data
Objective:
The objective is to classify whether the patient survive after operation of breast cancer or not.
Data Description:
Data is collected from https://www.kaggle.com/gowtamsingulur/habermancsv.
The data set contains cases from a study that was conducted between 1958 and 1970 at the University of Chicago's Billings Hospital
on the survival of patients who had undergone surgery for breast cancer.
Importing Libraries
In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
Loading data
In [2]:
data=pd.read_csv("C:/Users/KIIT/Applied_AI_Practice_Data/Haberman/haberman.csv")
data.head()
In [3]:
data.columns
'age'- Age of patient at the time of operation.
'year'- Year of operation(i.e 1900).
'Nodes'- No. of positive Axillary Lymph Nodes(Lymph Nodes are small, bean-shaped organs that acts as filter, which are present
in underarm. If Lymph Nodes have some cancer cells in them, they are called positive.)
'status'- It is the survival status of patient.
In [4]:
data.count()
Out[2]:
age year nodes status
0 30 64 1 1
1 30 62 3 1
2 30 65 0 1
3 31 59 2 1
4 31 65 4 1
Out[3]:
Index(['age', 'year', 'nodes', 'status'], dtype='object')
Out[4]:
2. In [5]:
data.isnull().sum()
In [6]:
data['status'].value_counts()
Observations
There are 4 columns, out of these 'status' is the output column.
In status there are 2 class- 1-> 'survival', 2-> 'Death'.
There are total of 306 entity.
There is no missing value in the dataset.
In [7]:
data['status']=data['status'].apply(lambda x: 'survived' if x==1 else 'died')
For better understanding let 1-> 'survived' and 2-> 'died'
In [8]:
s=sns.FacetGrid(data,hue='status',height=6)
s=s.map(sns.distplot,'age')
s.add_legend()
plt.show()
Out[4]:
age 306
year 306
nodes 306
status 306
dtype: int64
Out[5]:
age 0
year 0
nodes 0
status 0
dtype: int64
Out[6]:
1 225
2 81
Name: status, dtype: int64
4. Observation
From the 1st plot we can say that there is more chance that the patient having age less than 35 can survived.
and patient having age more than 75 have less chance of survival.
majority of patient survive have less than 5 postive nodes.
But using this plot we can't distinguish 2 classes clearly.
In [11]:
sns.set_style('whitegrid')
sns.pairplot(data,hue='status',height=5)
plt.show()
Observation
By looking the scatter plots we can't distinguish class.
In [12]:
sns.boxplot(x='status',y='age',data=data)
plt.show()