SlideShare a Scribd company logo
1 of 17
Download to read offline
University of Delaware
Contemporary Applications of Mathematics
Milestone 5: Stretching for Survival
Authors:
Christina Dehn
Devon Gonzalez
Johanna Jan
December 10, 2015
Contents
1 Introduction 2
2 Data Analysis 2
3 Methods for Predictions 4
3.1 Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
3.2 Random Forests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
3.3 Choosing Random Forest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
4 Decision Trees 5
4.1 Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
4.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
4.2.1 Age Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
4.2.2 Entropy Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
4.2.3 Pruned Entropy Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
5 sklearn’s Random Forest 8
6 Missing Data 10
6.1 Defining the Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
6.2 Implementing a Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
7 Non-numeric Values 12
7.1 Sex and Embark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
7.2 Title . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
8 Our Statistics 13
9 Statistical Comparisons 14
10 Conclusion 14
11 Future Work 15
1
1 Introduction
On April 12, 1912, one of the most infamous shipwrecks in history occurred; the RMS Titanic shockingly sank. This
vessel was 882 feet long, making it the the largest ship in service during its time. Because of its great size, the
creators and regulators did not believe that this ship could possibly sink. However during its maiden voyage, RMS
Titanic sank after colliding with an iceberg in the North Atlantic Ocean while traveling from Southampton, UK to
New York, NY. The collision caused the death of 1502 out of 2224 passengers and crew. The early 1900s had poor
safety regulations, including the lack of lifeboats on board; better safety regulations and more life boats would have
saved more lives. As a result, this tragedy exposed the importance of safety and led to better safety regulations for
maritime vessels.
Although luck was one factor in who survived the sinking, it wasn’t the only factor. Women, children, and upper
class were chosen first for the lifeboats. Because of this, some groups of people, were more likely to survive than
others. For this project, we are analyzing the Titanics survival statistics to design a model for predicting which
passengers or crew members were more likely to survive. The features included in the survival statistics consist of
their class, name, sex, age, number of family members aboard, cabin, and where they embarked from.
2 Data Analysis
The data set is provided on the Kaggle Website. We are given two main csv files, a training set and a testing set.
These sets include the passengers title, name, ID, age, sex, class, cabin number, ticket fare, ticket number, number
of siblings, number of parents or children, and port of debarkation as features. The training set consists of 891
passengers including if they survived (1) or if they died (0). The purpose of this set is to “train” the prediction
program to make accurate predictions of outcomes. The testing set consists of 418 passengers without their survival
status. This is the set of passengers that Kaggle uses to test submissions on.
As mentioned above, Kaggle has supplied some classification features for each passenger. Below is a list of the feature
names and what they represent:
• PassengerId: An integer that represents a passenger’s ID
• Pclass: An integer (1, 2, or 3) that represents 1st (1), 2nd (2), and 3rd (3) class
• Name: The name of the passenger
• Sex: The sex of the passenger (male or female)
• Age: The age of the passenger
• SibSp: The number of siblings or spouse that the passenger has on board (does not specify between sibling
and spouse)
• Parch: The number of parents or children the passenger has on board (does not specify between parent and
child)
• Ticket: The passenger’s Ticket Number
• Fare: The price of the passenger’s ticket
• Cabin: The cabin number in which the passenger stayed on board
• Embarked: A symbol (Q, S, C) that represents where the passenger boarded the ship. Q represents Queen-
stown, S represents Southampton, and C represents Chernbourg.
Because we are given a huge amount of information about the passengers, we wanted to easily look for trends within
the data. To do this, we read in the data from the training set and analyzed the data by feature. Our findings
included:
• There are almost twice the amount of males than there are females
2
• There are about twelve times more adults (greater than or equal to age fourteen) than children (under age
fourteen)
• Over 50% of the passengers were third class
• Over 60% of passengers embarked from Southampton.
To further push our analysis, below is a histogram displaying the passengers’ age distributions, a pie chart categorized
by class, and a pie chart categorized by embarkation.
Figure 1: A histogram that displays the age distributions in the training set.
The age distribution of the histogram shows that many of the passengers were in their twenties and thirties.
Class Distribution for train.csv
Figure 2: A pie chart that displays the class distribution in the test set.
The class distribution pie chart indicates that the amount of passengers in first and second class are pretty even, but
more than half of the passengers were third class.
3
Embarkation Distribution for train.csv
Figure 3: A pie chart that displays the embarkation distribution in the training set.
The embarkation distribution pie chart indicates that most passengers boarded the ship at Southampton, which was
the last stop. Many crew members were picked up at this port, which could be a strong reason for why Southampton
was such a popular embarkation location.
3 Methods for Predictions
Emerging in the 1990s, machine learning is a field of computer science that explores algorithms that can learn from
and make predictions on given data. Machine learning makes predictions based on known properties learned from
the training data set, rather than solely following a programs instructions. Kaggle suggests that participants apply
the tools of machine learning to predict which passengers survived. When researching common machine learning
techniques to use, we found many researchers’ approaches; the most popular approaches were random forests and
neural networks, which are discuss below. The majority of the researchers suggest random forests, but there are
some that determine survival through neural networks.
3.1 Neural Networks
Neural networks is a common technique that is used to estimate functions that may depend on a large number of
inputs. Some of these inputs may be unknown. This method is inspired by our understanding of how the brain
learns. There are layers which include the inner, hidden, and outer layers. Input vectors are fed into the neural
network and output vectors are given in the end. One example of a neural network for the titanic problem uses
inputs of social class, age (adult or child), and sex. Their network had an error rate of 0.2. One important fact
about the neural network is that it uses weights for the relative importance of each input. Some inputs are more
important, like if the passenger is a child. The overall effect (excitatory or inhibitory) of the neuron is found from
looking at the weights from the input to hidden neurons and also from hidden to output neurons, [5].
3.2 Random Forests
Random forests are a group of decision trees that rate the importance of certain features and predict the outcome of
an input based on the value of the features. The final prediction of a testing set is based off of the the decision trees
that are made based on an inputted training set. When a data point is fed into the tree, the forest will output the
class that is the mode of the classes, otherwise known as classification, or the mean prediction, otherwise known as
regression, of the individual trees. Classification is the result that most of the trees in the forest predict as the final
4
outcome. Regression is the mean prediction of all of the trees’ final outcomes. Based on these results, the forest will
decide which algorithm results in the better prediction for each input.
3.3 Choosing Random Forest
After careful consideration, we decided to use random forests because it seemed more flexible and easier to implement
than neural networks. We felt that we would have more freedom using Random Forests because it has a larger amount
of resources than neural networks does. In addition, Kaggle suggested it and other people are using it with good
results. Random forests are also good at taking different values from different categories and outputting categorical
results.[3]
We will be using Python to tackle this project. Python has many useful libraries for data analysis. For this project,
we are using Python’s Pandas Library, the sklearn Random Forest Classifier, and its csv libraries. These libraries
are useful because the pandas library can read and write csv files and sklearn has a random forest module which is
a useful tool for creating random forests.
4 Decision Trees
Before discussing Random Forests, we must first define decision trees. Decision trees are a specific type of tree data
structure that make predictions by breaking down a large data set into smaller subsets. A decision node holds a
specific feature and has two or more branches stemming out of it. The first decision it has to make is at the root
node. Based on the input, the node will determine which branch to take in relation to its specified feature. Each
branch will lead to a different decision node. This process continues until it hits a leaf node and cannot branch
anywhere else. The leaf node represents the final outcome for that input. Multiple decision trees make up a random
forest.
4.1 Entropy
Entropy can be used to find the homogeneity of a sample. Entropy normally represents disorder, but in information
theory, it is the expected value of information contained in a message (in our case, the survival of a person on
board). We used conditional entropy to determine the splits in our decision tree. Conditional entropy calculates the
expected value of information given another piece of information. Two kinds of entropy need to be calculated to
make a decision tree: entropy using one attribute and conditional entropy. In relation to our topic, entropy using one
attribute is survival and conditional entropy could be survival given male or female. The equation for the entropy of
one attribute is:
E(S) =
c
i=1
−pi log2 pi (1)
where S is the target (survival), p is the probability that one attribute occurs, and c is the number of possibilities
for that attribute. The equation for conditional entropy of two attributes (or more) is:
E(S) =
c∈X
P(C)E(C) (2)
where S (survival) and X (male, female) represent the attributes, P is the probability of one attribute, E is the
entropy of that attribute, and c is the number of possibilities for that attribute.
To calculate the entropy of a target, in this case target is survival, split the data and calculate entropy for each
branch and subtract from the entropy before the split. Then choose the attribute with the largest information gain,
or the smallest entropy, as the next node because the smaller the entropy, the more accurate the prediction is. Run
this algorithm recursively until all of the data is in the tree. Small entropy values lead to more accurate predictions
because that means less information is gained by knowing the end result. If the entropy is 0, then that means that
you know the end result from the other information given 100% of the time. Like most algorithms, decision trees are
not perfect. Some problems that exist with decision trees include working with continuous attributes, overfitting,
attributes with many values, and dealing with missing values, [2].
5
4.2 Implementation
We began predicting the survival of passengers by implementing a few different decision trees. Some were based
primarily on features by logic, like age, while others were based primarily on entropy-based features, like fare. We
will start with the most basic decision tree: the Age Tree.
4.2.1 Age Tree
Based on our data and its trends, we were able to implement a basic decision tree that focused on age, sex, and class
since these logically seemed to be the most influential classifications from our research and data analysis. We also
added a Parent/Child branch for our third class females because our initial research showed a positive correlation
between families and survival, meaning those traveling with families had a higher chance of surviving. This branch
was only added for third class females because it was the only node that had uncertainty. Below is the layout of our
most basic decision tree with the corresponding algorithm implemented in Python:
Figure 4: A flow representation of Age Tree, our most basic decision tree.
if math.isnan(age): # passengers with missing age receive the mean age
age = missingAge(passenger)
if age < 14:
if pclass == 1 or pclass == 2:
survived = 1
else:
if sex == ’’female’’:
survived = 1
else:
survived = 0
else:
if sex == ’’male’’:
survived = 0
6
else:
if pclass == 1 or pclass == 2:
survived = 1
else:
if parch > 0:
survived = 1
else:
survived = 0
return survived
If the passenger does not have an age, then the missingAge() function is called to predict the passenger’s age. These
predictions are based on the ticket number and whether the passenger has siblings, a spouse, or a parent/child. This
algorithm will be explained in more detail in section 6.2.
4.2.2 Entropy Tree
Another decision tree we implemented was based on the conditional entropy equation 2, which calculates the entropy
for every possible place we could split the tree. Then we chose the split that had no more than 2/3 of the tree on
either side, meaning we chose the split that divided the tree as evenly as possible while maintaining a low entropy
score. A low entropy score best fits the given data into the possible outcomes as evenly as possible. Out of all of the
possible splits, the “Fare” classification came out with low entropy scores most often. We believe that fare may be
a good indicator of survival rate and may also be influenced by the other variables. Therefore, our second decision
tree used fare as well as age, sex, and class as fare’s influencers. Below is an illustration of this decision tree.
Figure 5: A flow representation of our Entropy Tree.
4.2.3 Pruned Entropy Tree
One common issue that occurs with decision trees is overfitting. Overfitting is when an algorithm is too specific to
the training set and doesn’t predict other data well. To eliminate this potential issue for the above Entropy Tree,
we used the following estimate error formula:
7
e =
f + z2
2N + z f
N − f2
N + z2
4N2
1 + z2
N
(3)
where z = 1.96 (which is the z-value for the confidence level of 95%), f is the error on the data in the training set,
and N is the number of instances covered by the leaf, to find the error at a node with only leaves and its children.
Then we found the weighted sum of the errors of these leaves where the weight is the percentage of values that fall
within that leaf versus the values of the parent. If the weighted sum of the leaves is greater than the error of the
parent node, then we prune the tree by getting rid of those leaves. Below is our pruned version of the entropy tree.
[3]
Figure 6: A flow representation of our Pruned Entropy Tree to counteract overfitting.
Our pruned entropy tree results can be found in the table, Figure 9, in Section 8. This tree was one of our best
algorithms at predicting survival accuracy with a 91% success. The death accuracy did not score as well with 78%
and gave a Kaggle accuracy of 78%.
5 sklearn’s Random Forest
After researching and implementing our own decision trees, we were finally ready to create a random forest. Python’s
sklearn package already comes equipped with a ready-to-use Random Forest tool. The way it works is we feed in the
training set, which is a csv file of 891 passengers with all of the given features along with if they survived (1) or if
8
they died (0), and a list of features from the training set as inputs. It uses these inputs to create a bunch of decision
trees it believes will give us the most accurate results. Then we feed in the testing set, which is another csv file of
418 passengers with all of the given features but without their survival status, and sklearn will use the decisions trees
made from the training set to predict the survival statuses of the passengers in the testing set.
Below is a list of the random forest parameters and their default settings:
• n estimators: (default=10) An integer that sets the number of trees in the forest
• criterion: (default=‘gini’) A string that sets the function to measure the quality of a split
• max features: (default=’auto’) An integer, float, string, or None that sets the number of features to consider
when determining the best split
• max depth: (default=None) An integer or None that sets the maximum depth of a tree
• min samples split: (default=2) An integer that sets the minimum number of samples in an internal node to
split
• min samples leaf: (default=1) An integer that sets the minimum number of samples in a leaf node
• min weight fraction leaf: (default=0) A float that sets the minimum fractions of samples to be in a leaf
node
• max leaf nodes: (default=None) An integer or None that sets the maximum number of leaf nodes in a tree
• bootstrap: (default=True) A boolean that sets whether or not bootstrap samples are used when building a
tree
• oob score: (default=False) A boolean that sets whether or not out-of-bag samples are used to estimate error
• n jobs: (default=1) An integer that sets the number of parallel jobs to be run
• random state: (default=None) An integer, RandomState instance, or None that sets the seed for the random
number generator
• verbose: (default=0) An integer that controls the verbosity of the tree building
• warm start: (default=False) A boolean that sets whether or not the random forest will reuse the previous
solution and add more estimators
• class weight: (default=None) A dictionary, list of dictionaries, “balanced”, “balanced subsample,” or None
that sets the weights of each class
We made a few models using the standard random forest structure and training in sklearn. There are optimizations
that can be tried out, like using multi-core CPU and specifying min and max number of splits per tree. We can
also increase the number of estimators, or the number of trees enforced. We used 100 estimators for some of our
random forests because it is fast and somewhat accurate. This model prevents overfitting trees because it makes the
decision trees based on part of the training set, not the entire set. However, this tool has a few problems if data
is missing or not a number. In our csv files, the Age, Fare, Ticket, Cabin, Embarked, and Sex features either have
missing data or have values that are not a number. So we created a random forest based on the features that have
numbered-values and no missing data as our most basic random forest; this included Pclass (passenger class), Parch
(parent or child), and SibSp (sibling or spouse). To clarify, we did not use the Age, Fare, Sex, or Embark feature for
our first couple random forests because, at that point, we were still working on filling in missing data and converting
non-numeric values to usable numeric values. We also did not plan to use Ticket or Cabin in any of our random
forests because not only did we lack the information that would help fill the data or make it numeric, but we also
did not see them as necessary features. However, these features could be necessary for other reasons not pertaining
to establishing predictions, like filling in missing data for other features. This will all be explained in more detail in
Section 6. Below is the code for creating our first basic random forest:
9
train = pd.read_csv(’train.csv’)
test = pd.read_csv(’test.csv’)
cols = [’Pclass’,’SibSp’,’Parch’]
colsRes = [’Survived’]
trainArr = train.as_matrix(cols) #training array
trainArr = trainArr[:]
trainRes = train.as_matrix(colsRes) # training results
rf = RandomForestClassifier(n_estimators=100) # initialize
rf = rf.fit(trainArr, trainRes.ravel()) # fit the data to the algorithm
testArr = test.as_matrix(cols)
results = rf.predict(testArr)
One random forest we made uses the features Pclass, Sibsp, Parch, Sex, Age, Port, and Fare. We also used most of the
standard parameters from sklearn’s random forest package. The only parameters we changed were max features=1
and oob score=True. max features is the number of classification features that are considered when determining the
best split; the standard setting is ‘auto’, or sqrt(n features). Setting this to 1 should create relatively random trees
since every split only considers one feature. This works well because it reduces overfitting. oob score is whether or
not the random forest uses out-of-bag samples to estimate error. The normal setting is False, but setting this to
True also helps to reduce overfitting.
Our best random forest uses the features: Pclass, Sibsp, Parch, Sex, Age, Port, Fare,and Title, and the parameters:
max features=1, oob score=True, and max depth=7. Adding a new feature (Title) allows for more predictive power
because it gives the forest more data to work with. The “new” max depth parameter defines the max depth any one
tree can be grown; the default for this is “None”, which does not restrict the depth the trees can grow to, so the
trees can vary greatly in size and depth. Setting this parameter to a fixed number reduces overfitting farther by not
allowing any trees in the forest to be too large. After “opening” up our previous random forest, we noticed that the
depths of its decision trees do not exceed ten levels. So we tried various max depths under 10 to see if (or which)
depths would make a difference in our results. After some brute forcing, we found that a max depth of 7 yielded
optimal results. The results for all of our random forests are shown in Section 8.
6 Missing Data
In order to use sklearn Random Forest tool, all of our data has to be fully filled in; we cannot have any holes or NaNs
(Python’s symbol for no solution). Some features, like Ticket and Cabin Number, have too many blanks and there
is no reasonable algorithm to guess what they are. So we will not use them as predictions when making our Random
Forest, but we might be able to use them for another purpose. Other features, like Age and Fare, have blanks as
well but these could be filled in with the right algorithm. We start attacking the missing data problem with Age.
6.1 Defining the Problem
It is pretty clear that Age is one of the most important features and one of the easiest gaps to fill. The bar chart
below shows the number of passengers with an age and with an Age missing in the testing set and training set.
10
Figure 7: A histogram that displays the missing Age data distributions in the training and test sets. About 25% of
the passengers in the testing set and about 20% of the passengers in the training set have a missing Age.
Our first algorithm consisted of setting the Age feature for all of the passengers with a missing Age the average age
based off of the training set. This essentially classified each passenger with a missing age as an adult. Since there
were statistically more adults than children aboard the Titanic, this was not a bad assumption, but some of our
decision trees made decisions based on an age more specific than just adult and child. Therefore, we had to design
a more accurate algorithm.
The algorithm that we created is dependent on the Ticket, Sibsp, and Parch features. From research, we noticed
that family members all share the same Ticket number, probably because they all bought their tickets as a group.
We used this idea to our advantage. If the current predicting passenger has a NaN as an Age, we first check to
see if they have a number greater than zero in the Sibsp column. If they do, we look through our whole list of
passengers and see if we can find another passenger with a matching ticket number. If we find another passenger
with a matching ticket number and a number as an Age, then we use this same Age for the current passenger. If
the current passenger does not have a sibling or spouse, we check to see if he or she has any parents or children,
Parch. If he or she does, then we look through our whole list of passengers, just like we did for Sibsp, and try to
find another passenger with a matching ticket number and an Age that is not NaN. If such a passenger is found,
then we take the opposite Age of such passenger and use that Age for the current passenger. Opposite age refers to
either 30 or 3, depending on if the parent or child is an adult or a child. If the parent/childs Age is greater than or
equal to 14, then the current passengers Age is 3 (to represent a child), which is the most frequent child age. If the
parent/childs Age is less than 14, then the current passengers Age is 30 (to represent an adult), which is the most
frequent adult Age. (These frequencies were taken from Figure 1 from section 2). We are basically classifying the
current passenger as the opposite of what their parent or childs Age is. If however, the current passenger does not
have any siblings/spouse or any parents/children on board, we fill in his or her Age as 30 (which is the average Age)
because if a passenger is traveling alone they are most likely an adult.
6.2 Implementing a Solution
In order to fill in missing data, we created a function, missingAge(), in Python that mimics the algorithm described
above. To review, the method we created is to check if the passenger has a sibling on board. If they have a sibling on
board, the passenger list is searched until the matching ticket is found. Then, the missing Age is replaced with the
siblings Age. If they do not have a sibling on board, it checks if they have a parent or child on board. If this is true,
then the csv file is searched until the matching ticket is found and the Age is swapped. If the matching passenger is
11
a child (under 14), the missing Age becomes 30. If the matching passenger is an adult (greater than or equal to 14),
the missing Age is changed to 3. If they don’t have a sibling or parent/child, then the passenger with the missing
Age is given the average Age which is 30. Below is the missingAge() algorithm:
def missingAge(passenger):
spSib = passenger.loc[’SibSp’]
parch = passenger.loc[’Parch’]
ticket = passenger.loc[’Ticket’]
age = 30
if spSib > 0:
for i in range(len(lines)):
if lines.loc[i,’Ticket’] == ticket:
if np.isnan(lines.loc[i,’Age’]) == False:
age = lines.loc[i,’Age’]
break
elif parch > 0:
for i in range(len(lines)):
if lines.loc[i,’Ticket’] == ticket:
pcage = lines.loc[i,’Age’]
if pcage < 14:
age = 30
else:
age = 8
break
else:
age = 30
return age
In order for our random forest to work, we must have a csv file that does not have any missing data. So, after
writing the function for missing Age, we generated a new csv file in Python that does not have missing Ages. We
accomplished this by saving the new Age found in the missingAge() function as the passenger’s Age.
7 Non-numeric Values
7.1 Sex and Embark
After filling in Age, we fixed another very important feature: Sex. This feature was not missing any data in either
the testing or training table, but it had entries that were not numeric. We had to change the entries from “female”
and “male” to 0 and 1. This was a fairly easy fix. We added a total of two lines to our random forest code, which
are highlighted below:
train = pd.read_csv(’train.csv’)
train[’Gender’] = train[’Sex’].map({’female’: 0, ’male’: 1}).astype(int) # changing Sex to numbers
test = pd.read_csv(’test.csv’)
test[’Gender’] = test[’Sex’].map({’female’: 0, ’male’: 1}).astype(int) # changing Sex to numbers
Similarly, we had to change the non-numeric Embarked entries to numeric values. Therefore, we changed the entries
from “Q,” “C,” and “S” to 0, 1, and 2 using similar code shown above.
7.2 Title
Another data type we wanted to include in our random forest is the passengers Title. This is another non-numeric
value. In order to change the Titles to numeric values, we assigned each Title a number. Before we assigned the
Titles a number, we had to parse the Title out of the passengers name. The method used to achieve this can be
12
seen in the code below. The passengers name in the csv file was in the general form last name, title, then first name
as strings. For example, one passengers name is listed as “Kelly, Mr. James.” Our code checks for a period in the
passengers name and then appends the title which is in between the comma and period. The most frequent Title
was given a lower number. For example, the Title ‘Mr’ was given 0, ‘Mrs’ was given 1, and so on. This process was
used for both the test and train csv files. The method to convert from strings to integers is the same method as
explained in the previous subsection. The code had to be modified for each file since the Titles varied between the
files. We added another column to the csv file that contained each passengers assigned Title number. The Python
code can be seen below:
title = list()
if ’.’ in passenger.loc[’Name’]:
title.append(passenger.loc[’Name’].split(’,’)[1].split(’.’)[0].strip())
else:
title.append(’nan’)
8 Our Statistics
To test the accuracy of our forests and decision trees, we submitted multiple csv files with the passenger ID and if they
survived or not (0 or 1) to Kaggle. Kaggle has many different benchmarks (that Kaggle generated) that we can use
to compare how accurate our results are compared to its basic models as well as where we stand in ranking compared
to other teams. “Assume All Perished” is the lowest benchmark, with a score of 62.679%. Another threshold is the
“Gender Based Model” Benchmark where the all females survive and all males die, which has a score of 76.555%.
The “My First Random Forest” benchmark, which scored a 77.512%, is a random forest algorithm that uses the
Python package sklearn with features Age, Sex, Pclass, Fare, Sibsp, Embarked, and Parch. The top benchmark is
the “Gender, Price and Class Based Model” which is based on gender, ticket class, and ticket price, and has a score
of 77.990%. Below is a summary of the benchmarks:
Benchmark: Score
Assume All Perished: 0.62679
Gender Based Model: 0.76555
My First Random Forest: 0.77512
Gender, Price, and Class Based Model: 0.77990
Figure 8: A summary of Kaggle’s benchmarks. The highest score that can be achieved is 1.
We first ran our single decision trees and submitted them to Kaggle to see where we stand compared to other teams.
The score of our Age model is 77.033%. We placed one rank above the Gender Based Survival threshold. The score
of our Entropy model is 78.469% and the score of our Pruned Entropy model is 77.990%. Our highest scoring model,
the Entropy model, is 555 places above the Gender Based Survival threshold and 3 places above the Gender, Price
and Class Based Model.
In addition to our decision trees, we also ran Panda’s random forest tool on the basic features, Pclass, Parch, and
Sibsp, and the basic features plus Sex separately. Our Basic Random Forest obtained a score of .68421, which falls
above the “Assume All Perished Model” benchmark and below the “Gender Based Model” benchmark. These results
were not as good as our “Entropy Tree” submission. Our Basic plus Gender Random Forest was much better than
the previous basic model with a Kaggle Accuracy of .77990, which is around the “Gender, Price, and Class Based
Model” benchmark.
After successfully filling in missing Ages and converting Sex to a numeric form, we ran the Panda’s random forest
tool on all five features: Age, Sex, Pclass, Parch, and Sibsp. Ideally, this random forest should improve our overall
accuracy, but after submitting it to Kaggle we found that it actually performed worse than our Entropy-based trees.
We scored a .71770 on Kaggle, which is above the “Assume All Perished Model” Benchmark but way below the
“Gender Based Model” Benchmark. We also added in the Titles feature (so now your random forest has six features)
after filling in those missing data points and it yielded similar Kaggle results as the Basic Random Forest with Age
and Sex yielded.
13
After playing with the features, we started playing with the different parameters from sklearn’s Random Forest Tool.
For one of the forests we made, we set parameters max features = 1 and oob score = True. This forest received a
score of .78947, performing better than all of Kaggle’s benchmarks. It did about 1% better than the best benchmark
(“Gender, Price, and Class Base Model”). After also setting the parameter max depth = 7, we created our best
scoring random forest. It got a score of 0.80383. This put us at rank 791 out of 3910.
Below is a summary comparing how well our trees and forests performed with each other. The survival, death, and
total accuracies were calculated by dividing the number of correct results by the total number of results. They are
also based on the training set. The Kaggle score is based on the test set. The Survival Accuracy is the number
of correctly predicted survivors divided by the number of total survivors. The Death Accuracy is the number of
correctly predicted passengers who died divided by the number of total passengers who died. Total accuracy is the
number correct (death and survival accuracy) divided by the total number of passengers. Kaggle accuracy runs a
similar algorithm as Total Accuracy, but for the test set instead of the training set, and is computed by Kaggle.
Results
Alg # Type Survival
Accuracy
Death
Accuracy
Total
Accuracy
Kaggle
Accuracy
1 T Age 87% 74% 79% 77%
2 T Entropy 92% 77% 83% 78%
3 T Pruned Entropy 91% 78% 83% 78%
4 RF
Features: Pclass, Parch, Sibsp
Parameters: estimators=100
57% 78% 72% 68%
5 RF
Features: Pclass, Parch, Sibsp, Sex
Parameters: estimators = 100
70% 91% 83% 78%
6 RF
Features: Pclass, Parch, Sibsp, Sex, Age
Parameters: estimators = 100
84% 95% 91% 72%
7 RF
Features: Pclass, Parch, Sibsp, Sex
Parameters: maxFeatures = 1, oobScore =
True
67% 95% 85% 79%
8 RF
Features: PClass, Parch, Sibsp, Sex, Age,
Titles, Fare, Embarked
Parameters: maxFeatures = 1, oobScore =
True
75% 95% 88% 80%
Figure 9: Summary of our statistics from decision trees and random forests. For type, T stands for Tree and RF
stands for Random Forest. All parameters that are not listed are set as the default.
9 Statistical Comparisons
Our team’s highest score is 0.80383 which is number 791 out of 3910 entries with our best random forest. This score
is above the “Gender, Price, and Class Based Model,” the highest benchmark, with a score of 0.7799. Scores that
are close to our score use the port of embarkation, sex, cabin number, and age. From observing the leader board on
Kaggle, it seems that only the top 20 scores are above 90%. There are many scores in the 80% range. High scoring
teams use decision tree techniques including nodes with the passengers title, fare, and analyze if there is a mother
and child pair, match families, and much more. These methods seem to work well and would be good to explore in
the future.
10 Conclusion
We have researched our topic in depth, learned many machine learning methods to carry out the problem, and have
created multiple random forests and decision trees. Most of our submissions have received scores in the high seventies,
which is above the middle of the rankings. We were surprised at first to discover that our first few submissions were
pretty high in ranking, but later we realized that it is hard to push past a score of 80%; an 80% accuracy score
should be another benchmark since it’s extremely difficult to achieve this score. Currently, our best Kaggle score
comes from our best random forest which has a 75% survival accuracy, 95% death accuracy, 88% total accuracy, and
14
80% Kaggle accuracy. This forest has put us at spot 791 out of 3910. Our score is above all of Kaggle’s benchmark
scores, but it does not put us in the top 20% (the impressive scores).
11 Future Work
Our statistics show pretty consistent results with an average of about 76% Kaggle Accuracy, but it also shows us
varying Survival and Death Accuracies for the different algorithms. The Survival Accuracy varies between 57% and
92% and the Death Accuracy varies between 74% and 95%. According to our Total Accuracy, our highest-scoring
algorithm should be the random forest consisting of features Pclass, Parch, Sibsp, Sex, and Age, and a forest pa-
rameter estimator set to 100, with an accuracy of 91%. However, Kaggle scored the results for this algorithm as a
72%. This discrepancy could be because of overfitting. If someone else was to work on this project in the future,
something to consider is using all of our algorithms together but give each one a different weight to try to optimize
the accuracies. This technique is called bagging, or the “bootstrapping and aggregation” method.
There are multiple ways to use the bagging method. A random forest actually uses this method to make decisions.
A random forest splits up the testing set S into k subsets, s of size m. It then performs its algorithm A onto each
subset s and outputs k hypotheses h, (one for each subset). These hypotheses, h, then run through an aggregation
function f, which finally outputs a single final hypothesis, hf .
By using a similar principle as the random forest algorithm, it is possible to bag our testing set into the different
algorithms we have implemented. Lets say we have n algorithms a in our algorithm set A, a ∈ A, and S is our
testing set. Then for each s ∈ S, we run s through all of the algorithms in A and output n hypotheses, h. Notion
wise, we have
∀s ∈ S, A(s) = hi, i = 1, ..., n
where s is a passenger in the testing set S, A is all of the algorithms, and hi are the hypotheses for each algorithm
in A. From here, all of the hypotheses hi from one passenger s go through an aggregation function f which spits out
a single final hypothesis hf .
The aggregation function, f, typically weights each of the algorithms in A and decides which hypothesis it wants
to choose as its final hf . This function is the most important part of bagging. If created correctly, then ideally the
function would choose the appropriate algorithm to use for certain passengers. For example, our statistics shown in
the Results table show that the Entropy Tree has the best Survival Accuracy score at 92% and our last Random
Forest has (one of) the best Death Accuracy scores at 95%. If we knew that a given passenger survived, then the
ideal aggregation function (without knowing beforehand) would pick the hypothesis resulted from the Entropy Tree
as that passenger’s final hypothesis. Likewise, if we knew that the given passenger died, then the function, with only
knowledge from the features, would pick the hypothesis resulted from our last Random Forest as that passenger’s
final hypothesis. The question is, how does one come up with such a great aggregation function?
Based on our statistics, algorithms with more accurate results in either Survival, Death, Total, or Kaggle, will have
a higher weight than other algorithms. Algorithm 2, 8, and 6 should be weighted higher than the others. Algorithm
4 should be weighted the lowest since it has the lowest Survival and Kaggle Accuracy scores. Then, the aggregation
function should prioritize which features should hold most importance. While analyzing our data when we first
started this project, we noticed that age, class, and sex have the most importance. Then when calculating entropy,
we noticed that fare also has some importance. From these features, it is possible to group them in such a way
so that if a passenger has a certain combination, it will pick a certain algorithm. For example, if a passenger is a
10 year old female in first class, her chances of survival are very high. Therefore, the aggregation function should
pick algorithm 2 (the strongest algorithm for survival) for this specific passenger. If a passenger is a 30 year old
male in third class, then the aggregation function should pick algorithm 8 (the strongest algorithm for death) for
this passenger. However, not every combination of these priority features will yield a clear algorithm decision. For
example, a male passenger with an age 18 in second class has the potential to both survive and die. Some combina-
tions will make a more definite decision (like 95% for algorithm 2 for a 10 year old female in first class) while other
combinations will make a more unsure decision (like maybe a 60% for algorithm 8 for an 18 year old male in second
class). Each combination of features (or at least the more frequent combinations) should have a decision distribution
for which algorithm to use based on previous data analysis. For example, a male passenger with an age of 18 in
second class might give a distribution of 60% for algorithm 8, a 30% for algorithm 2, and a 10% for algorithm 6.
Then the aggregation function will pick the algorithm with the highest percent from this distribution and return its
15
corresponding hypothesis.
Our results for this project suggest that we are having an overfitting problem. Bagging is a method that is known to
help reduce this particular obstacle. The above algorithm is a good start to mending our already-existing algorithms
together to yield more accurate results.
References
[1] Berghammer, R. Relations and Kleene Algebra in Computer Science 11th International Conference on Relational
Methods in Computer Science, RelMiCS 2009, and 6th International Conference on Applications of Kleene
Algebra, AKA 2009, Doha, Qatar, November 1-5, 2009 : Proceed. Berlin: Springer, 2009. Print.
[2] “Decision Tree.” Decision Tree. Web. 11 Oct. 2015. http://www.saedsayad.com/decision_tree.htm
[3] “Decision Tree” Decision Tree. Web. 11 Oct. 2015. http://www.saedsayad.com/decision_tree_overfitting.
htm
[4] “Public Leaderboard - Titanic: Machine Learning from Disaster.” Kaggle. 28 Sept. 2012. Web. 28 Oct. 2015.
https://www.kaggle.com/c/titanic/leaderboard
[5] “Titanic Survivor Prediction.” Dans Website. 5 Sept. 2012. Web. 29 Oct. 2015. shttp://logicalgenetics.com/
titanic-survivor-prediction
16

More Related Content

Viewers also liked

EurActory Preview, Innovative Tools July 2014
EurActory Preview, Innovative Tools July 2014EurActory Preview, Innovative Tools July 2014
EurActory Preview, Innovative Tools July 2014euractiv
 
James Silcott Resume 04 March 2015
James Silcott Resume 04 March 2015James Silcott Resume 04 March 2015
James Silcott Resume 04 March 2015James Silcott
 
Building A Radio Station Project
Building A Radio Station ProjectBuilding A Radio Station Project
Building A Radio Station ProjectSalvador Lopez
 
Marathi language Basics chapter 2 for foreigner And Basic learners
Marathi language Basics chapter 2 for foreigner And Basic learnersMarathi language Basics chapter 2 for foreigner And Basic learners
Marathi language Basics chapter 2 for foreigner And Basic learnersUniversity Of Wuerzburg,Germany
 
Best Minecraft Minigames
Best Minecraft MinigamesBest Minecraft Minigames
Best Minecraft MinigamesDoubleUpGaming
 

Viewers also liked (8)

2015_CAST_results
2015_CAST_results2015_CAST_results
2015_CAST_results
 
EurActory Preview, Innovative Tools July 2014
EurActory Preview, Innovative Tools July 2014EurActory Preview, Innovative Tools July 2014
EurActory Preview, Innovative Tools July 2014
 
James Silcott Resume 04 March 2015
James Silcott Resume 04 March 2015James Silcott Resume 04 March 2015
James Silcott Resume 04 March 2015
 
diversityarticle
diversityarticlediversityarticle
diversityarticle
 
ELS NOSTRES AVANTPASSATS
ELS NOSTRES AVANTPASSATSELS NOSTRES AVANTPASSATS
ELS NOSTRES AVANTPASSATS
 
Building A Radio Station Project
Building A Radio Station ProjectBuilding A Radio Station Project
Building A Radio Station Project
 
Marathi language Basics chapter 2 for foreigner And Basic learners
Marathi language Basics chapter 2 for foreigner And Basic learnersMarathi language Basics chapter 2 for foreigner And Basic learners
Marathi language Basics chapter 2 for foreigner And Basic learners
 
Best Minecraft Minigames
Best Minecraft MinigamesBest Minecraft Minigames
Best Minecraft Minigames
 

Similar to milestone-5-stretching

Thesis_Sebastian_Ånerud_2015-06-16
Thesis_Sebastian_Ånerud_2015-06-16Thesis_Sebastian_Ånerud_2015-06-16
Thesis_Sebastian_Ånerud_2015-06-16Sebastian
 
College Essay Rubric Template Telegraph
College Essay Rubric Template TelegraphCollege Essay Rubric Template Telegraph
College Essay Rubric Template TelegraphKimberly Berger
 
Action Speaks Louder Than Words Essay. Action Speak louder than Words Essay f...
Action Speaks Louder Than Words Essay. Action Speak louder than Words Essay f...Action Speaks Louder Than Words Essay. Action Speak louder than Words Essay f...
Action Speaks Louder Than Words Essay. Action Speak louder than Words Essay f...Crystal Adams
 
Webber-thesis-2015
Webber-thesis-2015Webber-thesis-2015
Webber-thesis-2015Darcy Webber
 
Titanic who do you think survived
Titanic   who do you think survived Titanic   who do you think survived
Titanic who do you think survived Vaibhav Agarwal
 
3 Unsupervised learning
3 Unsupervised learning3 Unsupervised learning
3 Unsupervised learningDmytro Fishman
 
Analyzing Titanic Disaster using Machine Learning Algorithms
Analyzing Titanic Disaster using Machine Learning AlgorithmsAnalyzing Titanic Disaster using Machine Learning Algorithms
Analyzing Titanic Disaster using Machine Learning Algorithmsijtsrd
 
An intro to applied multi stat with r by everitt et al
An intro to applied multi stat with r by everitt et alAn intro to applied multi stat with r by everitt et al
An intro to applied multi stat with r by everitt et alRazzaqe
 
Dynamically-Driven Galaxy Evolution in Clusters of Galaxies
Dynamically-Driven Galaxy Evolution in Clusters of GalaxiesDynamically-Driven Galaxy Evolution in Clusters of Galaxies
Dynamically-Driven Galaxy Evolution in Clusters of GalaxiesPete Abriani Jensen
 
Order Paper Writing Help 247 - How To Write A Re
Order Paper Writing Help 247 - How To Write A ReOrder Paper Writing Help 247 - How To Write A Re
Order Paper Writing Help 247 - How To Write A ReJessica Adams
 
R, Data Wrangling & Kaggle Data Science Competitions
R, Data Wrangling & Kaggle Data Science CompetitionsR, Data Wrangling & Kaggle Data Science Competitions
R, Data Wrangling & Kaggle Data Science CompetitionsKrishna Sankar
 
Thesis. A comparison between some generative and discriminative classifiers.
Thesis. A comparison between some generative and discriminative classifiers.Thesis. A comparison between some generative and discriminative classifiers.
Thesis. A comparison between some generative and discriminative classifiers.Pedro Ernesto Alonso
 
MS Tomlinson Thesis 2004-s
MS Tomlinson Thesis 2004-sMS Tomlinson Thesis 2004-s
MS Tomlinson Thesis 2004-sMSTomlinson
 

Similar to milestone-5-stretching (20)

Thesis_Sebastian_Ånerud_2015-06-16
Thesis_Sebastian_Ånerud_2015-06-16Thesis_Sebastian_Ånerud_2015-06-16
Thesis_Sebastian_Ånerud_2015-06-16
 
M3R.FINAL
M3R.FINALM3R.FINAL
M3R.FINAL
 
College Essay Rubric Template Telegraph
College Essay Rubric Template TelegraphCollege Essay Rubric Template Telegraph
College Essay Rubric Template Telegraph
 
Action Speaks Louder Than Words Essay. Action Speak louder than Words Essay f...
Action Speaks Louder Than Words Essay. Action Speak louder than Words Essay f...Action Speaks Louder Than Words Essay. Action Speak louder than Words Essay f...
Action Speaks Louder Than Words Essay. Action Speak louder than Words Essay f...
 
Webber-thesis-2015
Webber-thesis-2015Webber-thesis-2015
Webber-thesis-2015
 
Titanic who do you think survived
Titanic   who do you think survived Titanic   who do you think survived
Titanic who do you think survived
 
3 Unsupervised learning
3 Unsupervised learning3 Unsupervised learning
3 Unsupervised learning
 
Analyzing Titanic Disaster using Machine Learning Algorithms
Analyzing Titanic Disaster using Machine Learning AlgorithmsAnalyzing Titanic Disaster using Machine Learning Algorithms
Analyzing Titanic Disaster using Machine Learning Algorithms
 
An intro to applied multi stat with r by everitt et al
An intro to applied multi stat with r by everitt et alAn intro to applied multi stat with r by everitt et al
An intro to applied multi stat with r by everitt et al
 
Essay Database.pdf
Essay Database.pdfEssay Database.pdf
Essay Database.pdf
 
BENVENISTE-DISSERTATION-2014
BENVENISTE-DISSERTATION-2014BENVENISTE-DISSERTATION-2014
BENVENISTE-DISSERTATION-2014
 
Dynamically-Driven Galaxy Evolution in Clusters of Galaxies
Dynamically-Driven Galaxy Evolution in Clusters of GalaxiesDynamically-Driven Galaxy Evolution in Clusters of Galaxies
Dynamically-Driven Galaxy Evolution in Clusters of Galaxies
 
Order Paper Writing Help 247 - How To Write A Re
Order Paper Writing Help 247 - How To Write A ReOrder Paper Writing Help 247 - How To Write A Re
Order Paper Writing Help 247 - How To Write A Re
 
THESIS.DI.AJP.GM
THESIS.DI.AJP.GMTHESIS.DI.AJP.GM
THESIS.DI.AJP.GM
 
Thesis-DelgerLhamsuren
Thesis-DelgerLhamsurenThesis-DelgerLhamsuren
Thesis-DelgerLhamsuren
 
R, Data Wrangling & Kaggle Data Science Competitions
R, Data Wrangling & Kaggle Data Science CompetitionsR, Data Wrangling & Kaggle Data Science Competitions
R, Data Wrangling & Kaggle Data Science Competitions
 
eclampsia
eclampsiaeclampsia
eclampsia
 
Thesis. A comparison between some generative and discriminative classifiers.
Thesis. A comparison between some generative and discriminative classifiers.Thesis. A comparison between some generative and discriminative classifiers.
Thesis. A comparison between some generative and discriminative classifiers.
 
MS Tomlinson Thesis 2004-s
MS Tomlinson Thesis 2004-sMS Tomlinson Thesis 2004-s
MS Tomlinson Thesis 2004-s
 
refman
refmanrefman
refman
 

milestone-5-stretching

  • 1. University of Delaware Contemporary Applications of Mathematics Milestone 5: Stretching for Survival Authors: Christina Dehn Devon Gonzalez Johanna Jan December 10, 2015
  • 2. Contents 1 Introduction 2 2 Data Analysis 2 3 Methods for Predictions 4 3.1 Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 3.2 Random Forests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 3.3 Choosing Random Forest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 4 Decision Trees 5 4.1 Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 4.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 4.2.1 Age Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 4.2.2 Entropy Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 4.2.3 Pruned Entropy Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 5 sklearn’s Random Forest 8 6 Missing Data 10 6.1 Defining the Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 6.2 Implementing a Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 7 Non-numeric Values 12 7.1 Sex and Embark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 7.2 Title . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 8 Our Statistics 13 9 Statistical Comparisons 14 10 Conclusion 14 11 Future Work 15 1
  • 3. 1 Introduction On April 12, 1912, one of the most infamous shipwrecks in history occurred; the RMS Titanic shockingly sank. This vessel was 882 feet long, making it the the largest ship in service during its time. Because of its great size, the creators and regulators did not believe that this ship could possibly sink. However during its maiden voyage, RMS Titanic sank after colliding with an iceberg in the North Atlantic Ocean while traveling from Southampton, UK to New York, NY. The collision caused the death of 1502 out of 2224 passengers and crew. The early 1900s had poor safety regulations, including the lack of lifeboats on board; better safety regulations and more life boats would have saved more lives. As a result, this tragedy exposed the importance of safety and led to better safety regulations for maritime vessels. Although luck was one factor in who survived the sinking, it wasn’t the only factor. Women, children, and upper class were chosen first for the lifeboats. Because of this, some groups of people, were more likely to survive than others. For this project, we are analyzing the Titanics survival statistics to design a model for predicting which passengers or crew members were more likely to survive. The features included in the survival statistics consist of their class, name, sex, age, number of family members aboard, cabin, and where they embarked from. 2 Data Analysis The data set is provided on the Kaggle Website. We are given two main csv files, a training set and a testing set. These sets include the passengers title, name, ID, age, sex, class, cabin number, ticket fare, ticket number, number of siblings, number of parents or children, and port of debarkation as features. The training set consists of 891 passengers including if they survived (1) or if they died (0). The purpose of this set is to “train” the prediction program to make accurate predictions of outcomes. The testing set consists of 418 passengers without their survival status. This is the set of passengers that Kaggle uses to test submissions on. As mentioned above, Kaggle has supplied some classification features for each passenger. Below is a list of the feature names and what they represent: • PassengerId: An integer that represents a passenger’s ID • Pclass: An integer (1, 2, or 3) that represents 1st (1), 2nd (2), and 3rd (3) class • Name: The name of the passenger • Sex: The sex of the passenger (male or female) • Age: The age of the passenger • SibSp: The number of siblings or spouse that the passenger has on board (does not specify between sibling and spouse) • Parch: The number of parents or children the passenger has on board (does not specify between parent and child) • Ticket: The passenger’s Ticket Number • Fare: The price of the passenger’s ticket • Cabin: The cabin number in which the passenger stayed on board • Embarked: A symbol (Q, S, C) that represents where the passenger boarded the ship. Q represents Queen- stown, S represents Southampton, and C represents Chernbourg. Because we are given a huge amount of information about the passengers, we wanted to easily look for trends within the data. To do this, we read in the data from the training set and analyzed the data by feature. Our findings included: • There are almost twice the amount of males than there are females 2
  • 4. • There are about twelve times more adults (greater than or equal to age fourteen) than children (under age fourteen) • Over 50% of the passengers were third class • Over 60% of passengers embarked from Southampton. To further push our analysis, below is a histogram displaying the passengers’ age distributions, a pie chart categorized by class, and a pie chart categorized by embarkation. Figure 1: A histogram that displays the age distributions in the training set. The age distribution of the histogram shows that many of the passengers were in their twenties and thirties. Class Distribution for train.csv Figure 2: A pie chart that displays the class distribution in the test set. The class distribution pie chart indicates that the amount of passengers in first and second class are pretty even, but more than half of the passengers were third class. 3
  • 5. Embarkation Distribution for train.csv Figure 3: A pie chart that displays the embarkation distribution in the training set. The embarkation distribution pie chart indicates that most passengers boarded the ship at Southampton, which was the last stop. Many crew members were picked up at this port, which could be a strong reason for why Southampton was such a popular embarkation location. 3 Methods for Predictions Emerging in the 1990s, machine learning is a field of computer science that explores algorithms that can learn from and make predictions on given data. Machine learning makes predictions based on known properties learned from the training data set, rather than solely following a programs instructions. Kaggle suggests that participants apply the tools of machine learning to predict which passengers survived. When researching common machine learning techniques to use, we found many researchers’ approaches; the most popular approaches were random forests and neural networks, which are discuss below. The majority of the researchers suggest random forests, but there are some that determine survival through neural networks. 3.1 Neural Networks Neural networks is a common technique that is used to estimate functions that may depend on a large number of inputs. Some of these inputs may be unknown. This method is inspired by our understanding of how the brain learns. There are layers which include the inner, hidden, and outer layers. Input vectors are fed into the neural network and output vectors are given in the end. One example of a neural network for the titanic problem uses inputs of social class, age (adult or child), and sex. Their network had an error rate of 0.2. One important fact about the neural network is that it uses weights for the relative importance of each input. Some inputs are more important, like if the passenger is a child. The overall effect (excitatory or inhibitory) of the neuron is found from looking at the weights from the input to hidden neurons and also from hidden to output neurons, [5]. 3.2 Random Forests Random forests are a group of decision trees that rate the importance of certain features and predict the outcome of an input based on the value of the features. The final prediction of a testing set is based off of the the decision trees that are made based on an inputted training set. When a data point is fed into the tree, the forest will output the class that is the mode of the classes, otherwise known as classification, or the mean prediction, otherwise known as regression, of the individual trees. Classification is the result that most of the trees in the forest predict as the final 4
  • 6. outcome. Regression is the mean prediction of all of the trees’ final outcomes. Based on these results, the forest will decide which algorithm results in the better prediction for each input. 3.3 Choosing Random Forest After careful consideration, we decided to use random forests because it seemed more flexible and easier to implement than neural networks. We felt that we would have more freedom using Random Forests because it has a larger amount of resources than neural networks does. In addition, Kaggle suggested it and other people are using it with good results. Random forests are also good at taking different values from different categories and outputting categorical results.[3] We will be using Python to tackle this project. Python has many useful libraries for data analysis. For this project, we are using Python’s Pandas Library, the sklearn Random Forest Classifier, and its csv libraries. These libraries are useful because the pandas library can read and write csv files and sklearn has a random forest module which is a useful tool for creating random forests. 4 Decision Trees Before discussing Random Forests, we must first define decision trees. Decision trees are a specific type of tree data structure that make predictions by breaking down a large data set into smaller subsets. A decision node holds a specific feature and has two or more branches stemming out of it. The first decision it has to make is at the root node. Based on the input, the node will determine which branch to take in relation to its specified feature. Each branch will lead to a different decision node. This process continues until it hits a leaf node and cannot branch anywhere else. The leaf node represents the final outcome for that input. Multiple decision trees make up a random forest. 4.1 Entropy Entropy can be used to find the homogeneity of a sample. Entropy normally represents disorder, but in information theory, it is the expected value of information contained in a message (in our case, the survival of a person on board). We used conditional entropy to determine the splits in our decision tree. Conditional entropy calculates the expected value of information given another piece of information. Two kinds of entropy need to be calculated to make a decision tree: entropy using one attribute and conditional entropy. In relation to our topic, entropy using one attribute is survival and conditional entropy could be survival given male or female. The equation for the entropy of one attribute is: E(S) = c i=1 −pi log2 pi (1) where S is the target (survival), p is the probability that one attribute occurs, and c is the number of possibilities for that attribute. The equation for conditional entropy of two attributes (or more) is: E(S) = c∈X P(C)E(C) (2) where S (survival) and X (male, female) represent the attributes, P is the probability of one attribute, E is the entropy of that attribute, and c is the number of possibilities for that attribute. To calculate the entropy of a target, in this case target is survival, split the data and calculate entropy for each branch and subtract from the entropy before the split. Then choose the attribute with the largest information gain, or the smallest entropy, as the next node because the smaller the entropy, the more accurate the prediction is. Run this algorithm recursively until all of the data is in the tree. Small entropy values lead to more accurate predictions because that means less information is gained by knowing the end result. If the entropy is 0, then that means that you know the end result from the other information given 100% of the time. Like most algorithms, decision trees are not perfect. Some problems that exist with decision trees include working with continuous attributes, overfitting, attributes with many values, and dealing with missing values, [2]. 5
  • 7. 4.2 Implementation We began predicting the survival of passengers by implementing a few different decision trees. Some were based primarily on features by logic, like age, while others were based primarily on entropy-based features, like fare. We will start with the most basic decision tree: the Age Tree. 4.2.1 Age Tree Based on our data and its trends, we were able to implement a basic decision tree that focused on age, sex, and class since these logically seemed to be the most influential classifications from our research and data analysis. We also added a Parent/Child branch for our third class females because our initial research showed a positive correlation between families and survival, meaning those traveling with families had a higher chance of surviving. This branch was only added for third class females because it was the only node that had uncertainty. Below is the layout of our most basic decision tree with the corresponding algorithm implemented in Python: Figure 4: A flow representation of Age Tree, our most basic decision tree. if math.isnan(age): # passengers with missing age receive the mean age age = missingAge(passenger) if age < 14: if pclass == 1 or pclass == 2: survived = 1 else: if sex == ’’female’’: survived = 1 else: survived = 0 else: if sex == ’’male’’: survived = 0 6
  • 8. else: if pclass == 1 or pclass == 2: survived = 1 else: if parch > 0: survived = 1 else: survived = 0 return survived If the passenger does not have an age, then the missingAge() function is called to predict the passenger’s age. These predictions are based on the ticket number and whether the passenger has siblings, a spouse, or a parent/child. This algorithm will be explained in more detail in section 6.2. 4.2.2 Entropy Tree Another decision tree we implemented was based on the conditional entropy equation 2, which calculates the entropy for every possible place we could split the tree. Then we chose the split that had no more than 2/3 of the tree on either side, meaning we chose the split that divided the tree as evenly as possible while maintaining a low entropy score. A low entropy score best fits the given data into the possible outcomes as evenly as possible. Out of all of the possible splits, the “Fare” classification came out with low entropy scores most often. We believe that fare may be a good indicator of survival rate and may also be influenced by the other variables. Therefore, our second decision tree used fare as well as age, sex, and class as fare’s influencers. Below is an illustration of this decision tree. Figure 5: A flow representation of our Entropy Tree. 4.2.3 Pruned Entropy Tree One common issue that occurs with decision trees is overfitting. Overfitting is when an algorithm is too specific to the training set and doesn’t predict other data well. To eliminate this potential issue for the above Entropy Tree, we used the following estimate error formula: 7
  • 9. e = f + z2 2N + z f N − f2 N + z2 4N2 1 + z2 N (3) where z = 1.96 (which is the z-value for the confidence level of 95%), f is the error on the data in the training set, and N is the number of instances covered by the leaf, to find the error at a node with only leaves and its children. Then we found the weighted sum of the errors of these leaves where the weight is the percentage of values that fall within that leaf versus the values of the parent. If the weighted sum of the leaves is greater than the error of the parent node, then we prune the tree by getting rid of those leaves. Below is our pruned version of the entropy tree. [3] Figure 6: A flow representation of our Pruned Entropy Tree to counteract overfitting. Our pruned entropy tree results can be found in the table, Figure 9, in Section 8. This tree was one of our best algorithms at predicting survival accuracy with a 91% success. The death accuracy did not score as well with 78% and gave a Kaggle accuracy of 78%. 5 sklearn’s Random Forest After researching and implementing our own decision trees, we were finally ready to create a random forest. Python’s sklearn package already comes equipped with a ready-to-use Random Forest tool. The way it works is we feed in the training set, which is a csv file of 891 passengers with all of the given features along with if they survived (1) or if 8
  • 10. they died (0), and a list of features from the training set as inputs. It uses these inputs to create a bunch of decision trees it believes will give us the most accurate results. Then we feed in the testing set, which is another csv file of 418 passengers with all of the given features but without their survival status, and sklearn will use the decisions trees made from the training set to predict the survival statuses of the passengers in the testing set. Below is a list of the random forest parameters and their default settings: • n estimators: (default=10) An integer that sets the number of trees in the forest • criterion: (default=‘gini’) A string that sets the function to measure the quality of a split • max features: (default=’auto’) An integer, float, string, or None that sets the number of features to consider when determining the best split • max depth: (default=None) An integer or None that sets the maximum depth of a tree • min samples split: (default=2) An integer that sets the minimum number of samples in an internal node to split • min samples leaf: (default=1) An integer that sets the minimum number of samples in a leaf node • min weight fraction leaf: (default=0) A float that sets the minimum fractions of samples to be in a leaf node • max leaf nodes: (default=None) An integer or None that sets the maximum number of leaf nodes in a tree • bootstrap: (default=True) A boolean that sets whether or not bootstrap samples are used when building a tree • oob score: (default=False) A boolean that sets whether or not out-of-bag samples are used to estimate error • n jobs: (default=1) An integer that sets the number of parallel jobs to be run • random state: (default=None) An integer, RandomState instance, or None that sets the seed for the random number generator • verbose: (default=0) An integer that controls the verbosity of the tree building • warm start: (default=False) A boolean that sets whether or not the random forest will reuse the previous solution and add more estimators • class weight: (default=None) A dictionary, list of dictionaries, “balanced”, “balanced subsample,” or None that sets the weights of each class We made a few models using the standard random forest structure and training in sklearn. There are optimizations that can be tried out, like using multi-core CPU and specifying min and max number of splits per tree. We can also increase the number of estimators, or the number of trees enforced. We used 100 estimators for some of our random forests because it is fast and somewhat accurate. This model prevents overfitting trees because it makes the decision trees based on part of the training set, not the entire set. However, this tool has a few problems if data is missing or not a number. In our csv files, the Age, Fare, Ticket, Cabin, Embarked, and Sex features either have missing data or have values that are not a number. So we created a random forest based on the features that have numbered-values and no missing data as our most basic random forest; this included Pclass (passenger class), Parch (parent or child), and SibSp (sibling or spouse). To clarify, we did not use the Age, Fare, Sex, or Embark feature for our first couple random forests because, at that point, we were still working on filling in missing data and converting non-numeric values to usable numeric values. We also did not plan to use Ticket or Cabin in any of our random forests because not only did we lack the information that would help fill the data or make it numeric, but we also did not see them as necessary features. However, these features could be necessary for other reasons not pertaining to establishing predictions, like filling in missing data for other features. This will all be explained in more detail in Section 6. Below is the code for creating our first basic random forest: 9
  • 11. train = pd.read_csv(’train.csv’) test = pd.read_csv(’test.csv’) cols = [’Pclass’,’SibSp’,’Parch’] colsRes = [’Survived’] trainArr = train.as_matrix(cols) #training array trainArr = trainArr[:] trainRes = train.as_matrix(colsRes) # training results rf = RandomForestClassifier(n_estimators=100) # initialize rf = rf.fit(trainArr, trainRes.ravel()) # fit the data to the algorithm testArr = test.as_matrix(cols) results = rf.predict(testArr) One random forest we made uses the features Pclass, Sibsp, Parch, Sex, Age, Port, and Fare. We also used most of the standard parameters from sklearn’s random forest package. The only parameters we changed were max features=1 and oob score=True. max features is the number of classification features that are considered when determining the best split; the standard setting is ‘auto’, or sqrt(n features). Setting this to 1 should create relatively random trees since every split only considers one feature. This works well because it reduces overfitting. oob score is whether or not the random forest uses out-of-bag samples to estimate error. The normal setting is False, but setting this to True also helps to reduce overfitting. Our best random forest uses the features: Pclass, Sibsp, Parch, Sex, Age, Port, Fare,and Title, and the parameters: max features=1, oob score=True, and max depth=7. Adding a new feature (Title) allows for more predictive power because it gives the forest more data to work with. The “new” max depth parameter defines the max depth any one tree can be grown; the default for this is “None”, which does not restrict the depth the trees can grow to, so the trees can vary greatly in size and depth. Setting this parameter to a fixed number reduces overfitting farther by not allowing any trees in the forest to be too large. After “opening” up our previous random forest, we noticed that the depths of its decision trees do not exceed ten levels. So we tried various max depths under 10 to see if (or which) depths would make a difference in our results. After some brute forcing, we found that a max depth of 7 yielded optimal results. The results for all of our random forests are shown in Section 8. 6 Missing Data In order to use sklearn Random Forest tool, all of our data has to be fully filled in; we cannot have any holes or NaNs (Python’s symbol for no solution). Some features, like Ticket and Cabin Number, have too many blanks and there is no reasonable algorithm to guess what they are. So we will not use them as predictions when making our Random Forest, but we might be able to use them for another purpose. Other features, like Age and Fare, have blanks as well but these could be filled in with the right algorithm. We start attacking the missing data problem with Age. 6.1 Defining the Problem It is pretty clear that Age is one of the most important features and one of the easiest gaps to fill. The bar chart below shows the number of passengers with an age and with an Age missing in the testing set and training set. 10
  • 12. Figure 7: A histogram that displays the missing Age data distributions in the training and test sets. About 25% of the passengers in the testing set and about 20% of the passengers in the training set have a missing Age. Our first algorithm consisted of setting the Age feature for all of the passengers with a missing Age the average age based off of the training set. This essentially classified each passenger with a missing age as an adult. Since there were statistically more adults than children aboard the Titanic, this was not a bad assumption, but some of our decision trees made decisions based on an age more specific than just adult and child. Therefore, we had to design a more accurate algorithm. The algorithm that we created is dependent on the Ticket, Sibsp, and Parch features. From research, we noticed that family members all share the same Ticket number, probably because they all bought their tickets as a group. We used this idea to our advantage. If the current predicting passenger has a NaN as an Age, we first check to see if they have a number greater than zero in the Sibsp column. If they do, we look through our whole list of passengers and see if we can find another passenger with a matching ticket number. If we find another passenger with a matching ticket number and a number as an Age, then we use this same Age for the current passenger. If the current passenger does not have a sibling or spouse, we check to see if he or she has any parents or children, Parch. If he or she does, then we look through our whole list of passengers, just like we did for Sibsp, and try to find another passenger with a matching ticket number and an Age that is not NaN. If such a passenger is found, then we take the opposite Age of such passenger and use that Age for the current passenger. Opposite age refers to either 30 or 3, depending on if the parent or child is an adult or a child. If the parent/childs Age is greater than or equal to 14, then the current passengers Age is 3 (to represent a child), which is the most frequent child age. If the parent/childs Age is less than 14, then the current passengers Age is 30 (to represent an adult), which is the most frequent adult Age. (These frequencies were taken from Figure 1 from section 2). We are basically classifying the current passenger as the opposite of what their parent or childs Age is. If however, the current passenger does not have any siblings/spouse or any parents/children on board, we fill in his or her Age as 30 (which is the average Age) because if a passenger is traveling alone they are most likely an adult. 6.2 Implementing a Solution In order to fill in missing data, we created a function, missingAge(), in Python that mimics the algorithm described above. To review, the method we created is to check if the passenger has a sibling on board. If they have a sibling on board, the passenger list is searched until the matching ticket is found. Then, the missing Age is replaced with the siblings Age. If they do not have a sibling on board, it checks if they have a parent or child on board. If this is true, then the csv file is searched until the matching ticket is found and the Age is swapped. If the matching passenger is 11
  • 13. a child (under 14), the missing Age becomes 30. If the matching passenger is an adult (greater than or equal to 14), the missing Age is changed to 3. If they don’t have a sibling or parent/child, then the passenger with the missing Age is given the average Age which is 30. Below is the missingAge() algorithm: def missingAge(passenger): spSib = passenger.loc[’SibSp’] parch = passenger.loc[’Parch’] ticket = passenger.loc[’Ticket’] age = 30 if spSib > 0: for i in range(len(lines)): if lines.loc[i,’Ticket’] == ticket: if np.isnan(lines.loc[i,’Age’]) == False: age = lines.loc[i,’Age’] break elif parch > 0: for i in range(len(lines)): if lines.loc[i,’Ticket’] == ticket: pcage = lines.loc[i,’Age’] if pcage < 14: age = 30 else: age = 8 break else: age = 30 return age In order for our random forest to work, we must have a csv file that does not have any missing data. So, after writing the function for missing Age, we generated a new csv file in Python that does not have missing Ages. We accomplished this by saving the new Age found in the missingAge() function as the passenger’s Age. 7 Non-numeric Values 7.1 Sex and Embark After filling in Age, we fixed another very important feature: Sex. This feature was not missing any data in either the testing or training table, but it had entries that were not numeric. We had to change the entries from “female” and “male” to 0 and 1. This was a fairly easy fix. We added a total of two lines to our random forest code, which are highlighted below: train = pd.read_csv(’train.csv’) train[’Gender’] = train[’Sex’].map({’female’: 0, ’male’: 1}).astype(int) # changing Sex to numbers test = pd.read_csv(’test.csv’) test[’Gender’] = test[’Sex’].map({’female’: 0, ’male’: 1}).astype(int) # changing Sex to numbers Similarly, we had to change the non-numeric Embarked entries to numeric values. Therefore, we changed the entries from “Q,” “C,” and “S” to 0, 1, and 2 using similar code shown above. 7.2 Title Another data type we wanted to include in our random forest is the passengers Title. This is another non-numeric value. In order to change the Titles to numeric values, we assigned each Title a number. Before we assigned the Titles a number, we had to parse the Title out of the passengers name. The method used to achieve this can be 12
  • 14. seen in the code below. The passengers name in the csv file was in the general form last name, title, then first name as strings. For example, one passengers name is listed as “Kelly, Mr. James.” Our code checks for a period in the passengers name and then appends the title which is in between the comma and period. The most frequent Title was given a lower number. For example, the Title ‘Mr’ was given 0, ‘Mrs’ was given 1, and so on. This process was used for both the test and train csv files. The method to convert from strings to integers is the same method as explained in the previous subsection. The code had to be modified for each file since the Titles varied between the files. We added another column to the csv file that contained each passengers assigned Title number. The Python code can be seen below: title = list() if ’.’ in passenger.loc[’Name’]: title.append(passenger.loc[’Name’].split(’,’)[1].split(’.’)[0].strip()) else: title.append(’nan’) 8 Our Statistics To test the accuracy of our forests and decision trees, we submitted multiple csv files with the passenger ID and if they survived or not (0 or 1) to Kaggle. Kaggle has many different benchmarks (that Kaggle generated) that we can use to compare how accurate our results are compared to its basic models as well as where we stand in ranking compared to other teams. “Assume All Perished” is the lowest benchmark, with a score of 62.679%. Another threshold is the “Gender Based Model” Benchmark where the all females survive and all males die, which has a score of 76.555%. The “My First Random Forest” benchmark, which scored a 77.512%, is a random forest algorithm that uses the Python package sklearn with features Age, Sex, Pclass, Fare, Sibsp, Embarked, and Parch. The top benchmark is the “Gender, Price and Class Based Model” which is based on gender, ticket class, and ticket price, and has a score of 77.990%. Below is a summary of the benchmarks: Benchmark: Score Assume All Perished: 0.62679 Gender Based Model: 0.76555 My First Random Forest: 0.77512 Gender, Price, and Class Based Model: 0.77990 Figure 8: A summary of Kaggle’s benchmarks. The highest score that can be achieved is 1. We first ran our single decision trees and submitted them to Kaggle to see where we stand compared to other teams. The score of our Age model is 77.033%. We placed one rank above the Gender Based Survival threshold. The score of our Entropy model is 78.469% and the score of our Pruned Entropy model is 77.990%. Our highest scoring model, the Entropy model, is 555 places above the Gender Based Survival threshold and 3 places above the Gender, Price and Class Based Model. In addition to our decision trees, we also ran Panda’s random forest tool on the basic features, Pclass, Parch, and Sibsp, and the basic features plus Sex separately. Our Basic Random Forest obtained a score of .68421, which falls above the “Assume All Perished Model” benchmark and below the “Gender Based Model” benchmark. These results were not as good as our “Entropy Tree” submission. Our Basic plus Gender Random Forest was much better than the previous basic model with a Kaggle Accuracy of .77990, which is around the “Gender, Price, and Class Based Model” benchmark. After successfully filling in missing Ages and converting Sex to a numeric form, we ran the Panda’s random forest tool on all five features: Age, Sex, Pclass, Parch, and Sibsp. Ideally, this random forest should improve our overall accuracy, but after submitting it to Kaggle we found that it actually performed worse than our Entropy-based trees. We scored a .71770 on Kaggle, which is above the “Assume All Perished Model” Benchmark but way below the “Gender Based Model” Benchmark. We also added in the Titles feature (so now your random forest has six features) after filling in those missing data points and it yielded similar Kaggle results as the Basic Random Forest with Age and Sex yielded. 13
  • 15. After playing with the features, we started playing with the different parameters from sklearn’s Random Forest Tool. For one of the forests we made, we set parameters max features = 1 and oob score = True. This forest received a score of .78947, performing better than all of Kaggle’s benchmarks. It did about 1% better than the best benchmark (“Gender, Price, and Class Base Model”). After also setting the parameter max depth = 7, we created our best scoring random forest. It got a score of 0.80383. This put us at rank 791 out of 3910. Below is a summary comparing how well our trees and forests performed with each other. The survival, death, and total accuracies were calculated by dividing the number of correct results by the total number of results. They are also based on the training set. The Kaggle score is based on the test set. The Survival Accuracy is the number of correctly predicted survivors divided by the number of total survivors. The Death Accuracy is the number of correctly predicted passengers who died divided by the number of total passengers who died. Total accuracy is the number correct (death and survival accuracy) divided by the total number of passengers. Kaggle accuracy runs a similar algorithm as Total Accuracy, but for the test set instead of the training set, and is computed by Kaggle. Results Alg # Type Survival Accuracy Death Accuracy Total Accuracy Kaggle Accuracy 1 T Age 87% 74% 79% 77% 2 T Entropy 92% 77% 83% 78% 3 T Pruned Entropy 91% 78% 83% 78% 4 RF Features: Pclass, Parch, Sibsp Parameters: estimators=100 57% 78% 72% 68% 5 RF Features: Pclass, Parch, Sibsp, Sex Parameters: estimators = 100 70% 91% 83% 78% 6 RF Features: Pclass, Parch, Sibsp, Sex, Age Parameters: estimators = 100 84% 95% 91% 72% 7 RF Features: Pclass, Parch, Sibsp, Sex Parameters: maxFeatures = 1, oobScore = True 67% 95% 85% 79% 8 RF Features: PClass, Parch, Sibsp, Sex, Age, Titles, Fare, Embarked Parameters: maxFeatures = 1, oobScore = True 75% 95% 88% 80% Figure 9: Summary of our statistics from decision trees and random forests. For type, T stands for Tree and RF stands for Random Forest. All parameters that are not listed are set as the default. 9 Statistical Comparisons Our team’s highest score is 0.80383 which is number 791 out of 3910 entries with our best random forest. This score is above the “Gender, Price, and Class Based Model,” the highest benchmark, with a score of 0.7799. Scores that are close to our score use the port of embarkation, sex, cabin number, and age. From observing the leader board on Kaggle, it seems that only the top 20 scores are above 90%. There are many scores in the 80% range. High scoring teams use decision tree techniques including nodes with the passengers title, fare, and analyze if there is a mother and child pair, match families, and much more. These methods seem to work well and would be good to explore in the future. 10 Conclusion We have researched our topic in depth, learned many machine learning methods to carry out the problem, and have created multiple random forests and decision trees. Most of our submissions have received scores in the high seventies, which is above the middle of the rankings. We were surprised at first to discover that our first few submissions were pretty high in ranking, but later we realized that it is hard to push past a score of 80%; an 80% accuracy score should be another benchmark since it’s extremely difficult to achieve this score. Currently, our best Kaggle score comes from our best random forest which has a 75% survival accuracy, 95% death accuracy, 88% total accuracy, and 14
  • 16. 80% Kaggle accuracy. This forest has put us at spot 791 out of 3910. Our score is above all of Kaggle’s benchmark scores, but it does not put us in the top 20% (the impressive scores). 11 Future Work Our statistics show pretty consistent results with an average of about 76% Kaggle Accuracy, but it also shows us varying Survival and Death Accuracies for the different algorithms. The Survival Accuracy varies between 57% and 92% and the Death Accuracy varies between 74% and 95%. According to our Total Accuracy, our highest-scoring algorithm should be the random forest consisting of features Pclass, Parch, Sibsp, Sex, and Age, and a forest pa- rameter estimator set to 100, with an accuracy of 91%. However, Kaggle scored the results for this algorithm as a 72%. This discrepancy could be because of overfitting. If someone else was to work on this project in the future, something to consider is using all of our algorithms together but give each one a different weight to try to optimize the accuracies. This technique is called bagging, or the “bootstrapping and aggregation” method. There are multiple ways to use the bagging method. A random forest actually uses this method to make decisions. A random forest splits up the testing set S into k subsets, s of size m. It then performs its algorithm A onto each subset s and outputs k hypotheses h, (one for each subset). These hypotheses, h, then run through an aggregation function f, which finally outputs a single final hypothesis, hf . By using a similar principle as the random forest algorithm, it is possible to bag our testing set into the different algorithms we have implemented. Lets say we have n algorithms a in our algorithm set A, a ∈ A, and S is our testing set. Then for each s ∈ S, we run s through all of the algorithms in A and output n hypotheses, h. Notion wise, we have ∀s ∈ S, A(s) = hi, i = 1, ..., n where s is a passenger in the testing set S, A is all of the algorithms, and hi are the hypotheses for each algorithm in A. From here, all of the hypotheses hi from one passenger s go through an aggregation function f which spits out a single final hypothesis hf . The aggregation function, f, typically weights each of the algorithms in A and decides which hypothesis it wants to choose as its final hf . This function is the most important part of bagging. If created correctly, then ideally the function would choose the appropriate algorithm to use for certain passengers. For example, our statistics shown in the Results table show that the Entropy Tree has the best Survival Accuracy score at 92% and our last Random Forest has (one of) the best Death Accuracy scores at 95%. If we knew that a given passenger survived, then the ideal aggregation function (without knowing beforehand) would pick the hypothesis resulted from the Entropy Tree as that passenger’s final hypothesis. Likewise, if we knew that the given passenger died, then the function, with only knowledge from the features, would pick the hypothesis resulted from our last Random Forest as that passenger’s final hypothesis. The question is, how does one come up with such a great aggregation function? Based on our statistics, algorithms with more accurate results in either Survival, Death, Total, or Kaggle, will have a higher weight than other algorithms. Algorithm 2, 8, and 6 should be weighted higher than the others. Algorithm 4 should be weighted the lowest since it has the lowest Survival and Kaggle Accuracy scores. Then, the aggregation function should prioritize which features should hold most importance. While analyzing our data when we first started this project, we noticed that age, class, and sex have the most importance. Then when calculating entropy, we noticed that fare also has some importance. From these features, it is possible to group them in such a way so that if a passenger has a certain combination, it will pick a certain algorithm. For example, if a passenger is a 10 year old female in first class, her chances of survival are very high. Therefore, the aggregation function should pick algorithm 2 (the strongest algorithm for survival) for this specific passenger. If a passenger is a 30 year old male in third class, then the aggregation function should pick algorithm 8 (the strongest algorithm for death) for this passenger. However, not every combination of these priority features will yield a clear algorithm decision. For example, a male passenger with an age 18 in second class has the potential to both survive and die. Some combina- tions will make a more definite decision (like 95% for algorithm 2 for a 10 year old female in first class) while other combinations will make a more unsure decision (like maybe a 60% for algorithm 8 for an 18 year old male in second class). Each combination of features (or at least the more frequent combinations) should have a decision distribution for which algorithm to use based on previous data analysis. For example, a male passenger with an age of 18 in second class might give a distribution of 60% for algorithm 8, a 30% for algorithm 2, and a 10% for algorithm 6. Then the aggregation function will pick the algorithm with the highest percent from this distribution and return its 15
  • 17. corresponding hypothesis. Our results for this project suggest that we are having an overfitting problem. Bagging is a method that is known to help reduce this particular obstacle. The above algorithm is a good start to mending our already-existing algorithms together to yield more accurate results. References [1] Berghammer, R. Relations and Kleene Algebra in Computer Science 11th International Conference on Relational Methods in Computer Science, RelMiCS 2009, and 6th International Conference on Applications of Kleene Algebra, AKA 2009, Doha, Qatar, November 1-5, 2009 : Proceed. Berlin: Springer, 2009. Print. [2] “Decision Tree.” Decision Tree. Web. 11 Oct. 2015. http://www.saedsayad.com/decision_tree.htm [3] “Decision Tree” Decision Tree. Web. 11 Oct. 2015. http://www.saedsayad.com/decision_tree_overfitting. htm [4] “Public Leaderboard - Titanic: Machine Learning from Disaster.” Kaggle. 28 Sept. 2012. Web. 28 Oct. 2015. https://www.kaggle.com/c/titanic/leaderboard [5] “Titanic Survivor Prediction.” Dans Website. 5 Sept. 2012. Web. 29 Oct. 2015. shttp://logicalgenetics.com/ titanic-survivor-prediction 16