The document describes a term project analyzing data from the Titanic disaster to build a classification model to predict passenger survival. Key steps included interpreting passenger data with attributes like class, gender and age; cleaning missing values; discretizing continuous attributes; applying decision tree models like REPTree, Random Forest, J48; and testing the best REPTree model on a hold out dataset with 86.66% accuracy.
2. OUTLINES
• PROBLEM DESCRIPTION
• DATA INTERPRETATION
• DATA CLEANING AND PROCESSING
• MODELING APPROACHES
• MODEL COMPARISON
• PROJECT MANAGEMENT
• CONCLUSION
3. PROBLEM DESCRIPTION
• Titanic built in 1912 and It was the biggest ship in the World at that moment.
It started ıts first cruise from Liverpool and It s final destination is New York.
However It sank in Its first cruise and It has caused the death of
approximately 1,500 people.
• We try to build a model about Titanic Disaster. And our aim is classificate the
people's survival situation. We choose randomly 30 people and try to reach
their survival estimations.
4. DATA INTERPRETATION
• We found our dataset in Dataworld. And our dataset includes 1309
person’s information who was passenger in Titanic.
• Our dataset has 12 differents kind of attiributes
5. PASSENGER CLASS
• Passenger class is Numerical attiribute. And 3rd class has
the biggest dead ratio. It has two reasons,
• Half of the passengers from 3rd class because of passenger
fare.
• They could probably give more importance for 1st and 2nd
class.
6. GENDER
• The ratio of dead men is higher than womens one.
Because when the ship has started to sank they give
more priority to women than men.
7. AGE
• Most people die between the age range of 19-48.
It has two reasons:
• Most of the passengers are between the this age
range.
• Out of this age range has priority for lifeboat.
(Childrens)
9. PASSENGER FARE
• Most of the passengers have cheap tickets. Because of that
cheap ticket category has more dead people. Probably rich
people have more opportunuty to acces lifeboat. Thats why
this 2 expensive ticket category has lower deap people ratio.
10. EMBARKED
• There are 3 type of embarked. They are S, C and Q. Type C has the
best alive ratio. Type one is the worst one. Emberaked attiribute has a
relationship between other atributes. Type S has more male and age
range between 19-48 age range.
11. CLASS
• And our class is passenger’s survival situation. (We
have two options Alive or Dead).
12. DATA CLEANING AND PROCESSING
• We remove nominal attributes ( Name Ticket Number). Because nominal
attirbutes are unique for every passenger. And they can not effect our model.
• We start to examine our data and we notice some missing value. In age,
Cabin and also Lifeboat attiribute.
• We apply to ReplaceWithMissingValues filter for age. Our data has too
many missing value in cabin attribute. Because of that reason we had to
remove these attribute.
13. OVERFITTING
PROBLEM
• After that we applied all of trees which are permited
by weka. But We got high accurancy ratio for all
type of trees(between %90-%95). As a result of this
We had overfittig in our model.
• Overfitting refers to a model that models the training
data too well.
• It happens when a model learns the detail and noise
in the training data to the extent that it negatively
impacts the performance of the model on new data.
14. • The age attribute has 5 categories. These
are 0-19 , 19-37, 37-55, 55-73, 73-91
• Most crowded age range is 19-37
• The passenger fare attiribute has 3
categories.These are 0-171,171,342,342-
513
• Most crowded passenger fare range is 0-
171.
Discretize
So that for solving this problem we remove the
lifeboat attribute then we categorized the age
and passenger fare attributes
26. MODELING APPROACHES
In that part, we tried to find best model for our problem. So we checked all trees. (Accurancy
ratios, mean error and also squared error)
27. In REPTREE we can see our all attiributes. We choose
REPTREE.
28. • For model comparison , We chose 30
pessenger randomly. These 30
pessenger is our test set data and our
model did not know their class
variabe. We applied thisi test to REP
tree decision set which is the most
locigal decision tree forour model.
After we have asked for these 30
pessenger our model replied 86.66%
correctly. We asked for 18 alive and
12 dead passenger and it gave 2
wrong for each.
MODEL COMPARISON
30. CONCLUSION
• As a result, we applied the following steps to get the best result from the
model by avoiding overtfitting and error problems. In that case we removed
some problem sides by using weka and got the solution. We choose
REPTREE for our model with the highest accurancy 80,2139. And we can not
apply Regression method for our model. Because our attirubutes are not
proper fort he regression.