3. The Dataset
– “Titanic: Machine Learning from Disaster”
(train.csv)
– https://www.kaggle.com/c/titanic
– Sample size n=891
– (Population size= 2,224)
4. The Simplest Model
– Survived: (survived=1, not survived=0)
– Child: (child(Age<15)=1, adult(Age>=15))
– But this model does not satisfy Zero Conditional Mean (MLR4)
𝑺𝒖𝒓𝒗𝒊𝒗𝒆𝒅 = 𝜷 𝟎+𝜷 𝟏 𝑪𝒉𝒊𝒍𝒅
6. New Regression Model
and Hypothesis
– Pclass: Ticket Class (=Room Grade)
– 1= Upper, 2 =Middle, 3=Lower
𝑺𝒖𝒓𝒗𝒊𝒗𝒆𝒅 = 𝜷 𝟎+𝜷 𝟏 𝑪𝒉𝒊𝒍𝒅+ 𝜷 𝟐 𝑷𝒄𝒍𝒂𝒔𝒔
Hypothesis
𝑯 𝟎: 𝒕 𝜷 𝟏
= 0
𝑯 𝟏: 𝒕 𝜷 𝟏
> 0
If 𝐻0 𝑖𝑠 𝑟𝑒𝑗𝑒𝑐𝑡𝑒𝑑
𝑎𝑛𝑑 𝑉𝐼𝐹𝛽1
<10, 𝑉𝐼𝐹𝛽2
<10 are satisfied,
” Children were given priority to be rescued”
7. Random Sampling (MLR2)
(Population Child = 4.9%)(Population Survived = 32%) (Population{1,2,3} = {14.7%,12.9%,31.8%})
Sample Survived Sample Child Sample Pclass
The sample size is the summation of survived and unsurvived
8. Regression Result
– t value (child) = 5.282, t value (Pclass) = -11.435
– Both variables are statistically significant at 1%
– Examining Collinearity 𝑽𝑰𝑭 𝟏=𝑽𝑰𝑭 𝟐=1.01423<10
– Multiple R-squared = 0.1415
𝑺𝒖𝒓𝒗𝒊𝒗𝒆𝒅 = 0.84004 +0.2846 𝑪𝒉𝒊𝒍𝒅-0.2084 𝑷𝒄𝒍𝒂𝒔𝒔
9. But….
– There is an undesirable fact…
– I separated the dataset into 3
– Pclass=1, Pclass=2, Pclass=3