Building Digital Brands with Data Insights - A Lowe Profero Case Study
MS5103BusinessAnalyticsProject
1. MS5103 Business Analytics Project
An Analysis of the First Time Bookings of Airbnb Users
Group Members (ID): Patrick Leddy (08370231)
Brian O Conghaile (11311151)
Níamh Ryan (11307801)
Supervisor: Michael Lang
2. 1
Declaration of Originality
Project Details
Module Code: MS5103 Assignment Title: M.Sc. Business Analytics - Major Project
Group Members: (please use BLOCK CAPITALS)
Student ID Student Name Contact Details (Email, Telephone)
08370231 Patrick Leddy pleddy91@gmail.com
0834819255
11311151 Brian O Conghaile connobrian@gmail.com
0870694753
11307801 Níamh Ryan nellieryan93@gmail.com,
0873119544
I/We hereby declare that this project is my/our own original work. I/We have read the University
Code of Practice for Dealing with Plagiarism* and am/are aware that the possible penalties for
plagiarism include expulsion from the University.
Signature Date
* http://www.nuigalway.ie/plagiarism
3. 2
Abstract
The intention of our research project is to analyse the Airbnb datasets acquired from Kaggle.,
whereby we aimed to look into interesting patterns and trends that could be discovered in the
data. Some of the minor areas we focused on were the likes of social media trends and
seasonal patterns. However, our main goal was to attempt to predict the destination which the
next new user will book to travel to on the Airbnb website. We wanted to gain insight into the
booking patterns of new users, including as much variables as possible to aid us in our
analysis. We used a wide variety of tools and techniques for this project including RStudio,
for decision trees and xgboost, and Minitab for some time series analysis.
4. 3
Table of Contents
1. Company Background
2. Outline of Problem
2.1 Objectives
2.2 Defining the Lifecycle
2.3 Selecting the Model
3. Description of Datasets
4. Data Preparation
4.1 Merging the Datasets
4.2 Missing Values
4.3 Creating the Dummy Variables
5. Initial Understanding of Data
5.1 Social Media Trends
5.2 Seasonal Trends
6. Explanation of Tools and Techniques
6.1 R Studio
6.2 Minitab
6.3 Excel
7. Findings
7.1 Association Between Variables
7.2The Earlier Models
7.3 XGBoost
7.4 The Final Model
7.5 Testing the Variables
7.6 The Results
8. Conclusion
9. Appendices
5. 4
1. Company Background
Founded in 2008, Airbnb is still a relatively new company which has experienced major
growth over the last number of years. This well-established company is a market leader in the
hospitality sector. The idea of Airbnb is simple; they provide a platform and a service
whereby their customers, people seeking places to stay, can connect to their clients, hosts who
are looking to rent out their property. Customers are matched to clients based on their own
preferences, such as room type, price, host language and other factors, all in a timely and
efficient manner. (Airbnb, 2016)
According the Airbnb website, there are over 2 million listings worldwide, more than 60
million guests and more than 191 countries on offer to visit. They also boast more than 34
thousand cities on offer as well as more than 14 hundred castles to stay in. (Airbnb, 2016)
6. 5
2. Outline of Problem
A vast amount of data is available to us in this project. Like all big data projects, the biggest
issue we encountered before we began our project was how to organise, understand and
manage the data we are given in a suitable manner to try and achieved our objectives.
2.1 Objectives
A real and worrying issue that arises in organisations, who are attempting projects involving
large data and prediction models, is that they often tend to worry about what type of model
they are trying to build, or what data they have available. Instead, they should be focusing on
what the business problem is, that they are trying to solve, and whether or not they are asking
the right questions to begin with. We have outlined four major objectives to reach by the end
of our project, which we felt would be beneficial to Airbnb, for the sake of understanding
their customers and those customer’s behavioural patterns a bit better.
2.1.1 Location of Booking (Main Objective)
As the data retrieved from a competition run by Airbnb, with the goal being to predict where
a customer would book as their first holiday destination, the main aim of this project was to
imagine ourselves working for Airbnb and deciding on why this information would be of
value to our employer and how we would go about achieving this goal. Naturally one of the
main ways of utilising data for creating value is the use of personalisation or even a reward
strategy. In order to create value from the large volume and variety of data available with Big
Data, that there are different types of value creation such as performance management, data
exploration, social analysis or decision science. For the kind of data collected by Airbnb, the
value creation method that best suits them would be the social analysis. Airbnb have collected
information on everything their customers have done on their website to date and as such that
could use this greatly to their advantage. (Pairse, Iyer, and Vesset, 2015)
Seeing as Airbnb acts as an intermediary between their buyers and sellers, having the
knowledge of where a customer is likely to book their accommodation allows the company to
act more efficiently and as such there is an opportunity for better forecasting of future
income. Having this sort of knowledge will allow Airbnb to interact with their customers, for
example, by providing them with offers which are individually customised. This could easily
reduce the time between the creation of an account and the time of the first booking by the
customer.
7. 6
One downside of this sort of objective is questioning whether or not we really have all the
necessary data available to us. In something as personal as choosing a holiday destination
there are so many different factors that influence this sort of decision. People don’t simply
choose holiday destinations based on their age and gender alone as well as the time they
created their account. Instead there could be social, economic and trending issues as well as
the cost of flights to and from that particular location. We do not necessarily have all this
information to hand all the time.
2.1.2 Social Media Trends
Another object of our project was to assess how social media affected different aspect of the
Airbnb customers data. There is potential to yield some very interesting trends and results
from the analysis of the data in relation to the method of sign up, be it via Facebook, Google
+, or so on.
2.1.3 Seasonal Trends
Seasonal trends is something that we all would think is very closely linked to the choice of
holiday destinations. One would expect that seasonal trends would have perhaps the greatest
influence on the main objective of the location of the users’ bookings, so perhaps it is worth
looking into this on a more individualised bases for the sake of finding some interesting
patterns. Potentially time series analysis could be run in order to determine how seasonal
trends affect different aspects of our datasets not only the location but sign up method as well
as others. With this vast variety and volume of data, there is room for potential seasonal
trends in all areas, not just on location alone. We need to also take into consideration that all
customers in these datasets are from America. So they may have different ideas of holiday
trends from our own personal Irish perspective.
2.2 Defining the Lifecycle
In order to focus our time and ensure rigor and completeness, it was important that we clearly
defined the analytics lifecycle to approach the problem correctly and follow a certain
framework. Well-defined processes can help guide any analytic project and with an analytics
lifecycle, the focus is on Data Science rather than Business Intelligence. Data Science projects
differ in the fact that they require more due diligence in the discovery phase, tend to lack
shape or structure and contain less predictable data.
8. 7
Figure 2.2.1: The Analytics Lifecycle
● Discovery - Learn the business domain, assess resources available, frame the problem
and begin learning the data.
● Data Preparation - Create analytic sandbox (conduct discovery and situational
analytics), ETL to get data into the sandbox so team can work with it and analyse it,
increase data familiarisation, data conditioning.
● Model Planning - Determine methods, techniques and workflow, data exploration,
select key variables and most suitable model.
● Model Building - Develop datasets for testing and training, build and execute model,
review existing tools and environment.
● Communicate Results - Determine success of results, identify key findings, quantify
business value, develop narrative to summarise and convey findings.
● Operationalise - Deliver final report including code used and technical specifications,
no pilot run as there is no production environment.
Once models have been run and findings produced, it is critical to frame these results in a way
that is tailored to the audience that engages the project and demonstrates clear value. If a team
performs a technically accurate analysis but fails to translate the results into a language that
resonates with the audience, people will not see the value, and much of the time and effort on
the project will have been wasted.
9. 8
2.3 Selecting the Model
As described by Finlay (2014), whether or not the model is of any use to the organisation can
be answered in 3 simple questions:
● Does the model improve efficiency?
● Does the model result in better decision making?
● Does the model enable you to do something new?
When deliberating and deciding on whether or not the model is sufficient or even any use to
the organisation, it would be useful to apply the above questions to the model as well as
determining whether the model is in line with the objectives outlined in the section 2.1.
10. 9
3. Description of Datasets
For our project we have a number of interconnected datasets that are given to us by Airbnb
for the purpose of the challenge. In total we were supplied with four datasets we are using.
The first dataset that we interact with is the training dataset. The main dataset consists of
213451x16 items of data. The 16 different columns:
ID
Date a/c was created
Timestamp of First activity
Date of first booking
Gender
Age
Sign_up method
Sign_up flow
Language
Affiliate channel
Affiliate provider
First affiliate traded
Sign up App
First device type
First browser
Country destination
The next dataset is the age gender dataset which contains the following columns:
Age bracket
Country destination
Gender
Population (in thousands)
Year
The countries dataset contains the following columns:
Country destination
Latitude of destination
Longitude of destination
Distance from US in km
Area of destination
11. 10
Destination language
Language levenshtein distance
The test dataset is then a dataset very similar to the training set except for the column for
country destination, the reasoning behind this is due to the need to predict this location after
training the data.
The last dataset was the Sessions dataset, which contained the following columns:
User_Id
Action
Action_Type
Action Detail
Device Type
Seconds Elapsed
12. 11
4. Data Preparation
4.1 Merging of the Datasets
What we needed to do first was to cleanse the datasets and merge them together.
The data sets utilised were of the following dimensions:
Sessions: 10,567,737 x 6
Training_Users: 213,451 x 16
Test_Users: 62096 x 15
Figure 4.1.1: Reading and Loading the Datasets into RStudio
Particular packages that will be of use when using R, are the dplyr and ggplot2 packages.
They are installed and loaded below.
Dplyr package: This package is useful for data manipulation, using “verb” functions to
perform the manipulation. In particular the subset, group_by, summarize and arrange
functions will be of use for preparing the data for analysis.
Ggplot2 package: This package is primarily used for the creation of complex and
customizable graphs in a stylish and clear manner, as opposed to the default graphing options
available on R that have many limitations.
Figure 4.1.2: Loading the Packages into RStudio
The main goal of the data preparation phase is to combine information available in the
Sessions dataset to the Training and Test datasets.
The Sessions dataset consists of the following information.
Figure 4.1.3: Sessions Variables
The first issue that arises with the Sessions dataset is the fact that it contains multiple rows of
data for each particular user, as opposed to the single row of data per user in the Training and
Test datasets. In order to combine the Sessions dataset to the others, only one row of data per
13. 12
user would need to be created. This row of data would be a summary of the user’s actions.
The new variables created per each user were, total time on the site, average time spent per
action, standard deviation (sd) of time per action, total number of actions taken and number of
key actions taken.
The first 3 new variables of total time on the site, average time spent per action and sd of time
per action were calculated using the secs_elapsed variable.
Figure 4.1.4: Creating Summary Statistics of the Sessions Dataset
As for the number of actions variable, the original variable of action was used to get a count
of how many actions each user took.
Figure 4.1.5: Creating the Count for the Actions Variables
Finally, the number of key actions builds on the original action variable and gives a count
based on how many key actions the user takes.
Figure 4.1.6: Creating the Count for Important Actions Variables
All of these data sets are then combined into one data frame.
Figure 4.1.7: Combining the Newly Created Variables
This creates a new data frame with the following dimensions that is made up of both Training
and Test user data.
Sessions_New: 135484 x 6
The next step in the process is to filter the training and test datasets based on the Sessions
information. The Sessions information provided only dates back as far as 1/1/14, whereas the
Training data dates back to the 1/1/10. With regards to tailoring the dataset to answer the
main objective, the Sessions information was deemed important for the analysis, therefore
only Training Users who recorded session activity from the 1/1/14 onwards were considered.
14. 13
The Sessions_New data frame was split up between Training and Testing.
Training:
Figure 4.1.8: Splitting Sessions into Training Dataset
Testing:
Figure 4.1.9: Splitting Sessions into Testing Datasets
The original Training Set was then filtered, keeping only users with session information.
Figure 4.1.10: Filtering the Training and the Training Sessions Sets
The original Test Set is similarly filtered to users with session information.
Figure 4.1.11: Filtering the Test and the Test Sessions Sets
Users with no session information are added to a different data set, making use of the Hmisc
package in R.
Figure 4.1.12: Set of Users Sessions Information not in the Training or Test Sets
The appropriate data sets were then combined column wise in R.
Figure 4.1.13: Creating the New Training and Test Sets
15. 14
As for the users in the Test Set users with no session activity, a value of 0 was given for all
variables featuring session information and was added to the Test Set via Excel.
It was then decided to drop some variables that would not have any impact on the overall
objective. Such variables were the date of account creation and timestamp first active. Also as
the variable of date of first booking was null in the test set it was also removed.
With appropriate filtering, new variables created and unneeded variables dropped, the Test
and Training sets were of the following dimensions.
Training Set: 73815 x 18
Test Set: 62096 x 17
4.2 Missing values
The next step of the data preparation process was dealing with any missing values. This was
an issue within the age variable in particular. As users signed up there was no obligation to
give an age, which left a large portion of users with an undefined age. For example within the
training set, there was 32362 users with no age, which is nearly half of the users. As this was
a vast portion of the data set we could not simply remove the users with no age. There were
several alternative options available to deal with the missing values.
The first option was to turn age into a dummy variable, with value 0 indicating no age
specified and value 1 indicating an age was specified. The second option was to impute the
missing values by running a regression model to determine the age of a user based on the
other variables that were given. The third and final option was to take the missing values and
impute them with the mean value for age (≈ 35).
In an ideal world the second option of imputing the missing values would have been the
leading candidate, however the results of the regression analysis run on Minitab were
disappointing, with no relationship found between the variables and the age of the user.
Both the first and third options seemed viable, it was decided to try both options in our final
model, with the third option producing more accurate results.
4.3 Creation of Dummy Variables
In order to perform any form of prediction models and analysis on our datasets there was a
need to convert our character variables into respective dummy variables to be used in their
place. By doing this we can perform our models with issue of converting to numeric and such.
16. 15
The following variables required the creation of dummy variables:
● Gender; Four levels, which included male, female, -unknown- and other.
● Signup Method; Four level, which included google, facebook, basic and
weibo.
● Language; This variable had 25 levels which included english, chinese, italian
and spanish, to name but a few.
● Affiliate Channel; Eight levels, which include seo, sem-non-brand, sem-brand,
remarketing, other, direct, content, and api.
● First Affiliate Tracked; Eight level, which includes linked, local ops,
marketing, omg, product, tracked - other, untracked, and empty cells.
● Signup App; Four levels, which include android, iOS, moweb, and web.
● Affiliate Provider; 17 levels, which includes yahoo, padmapper, craigslist,
google and many more.
● First Device Type; Nine levels, which includes android phone, android tablet,
desktop (other), iPad, iPhone, mac desktop, other/unknown, smartphone
(other), and windows desktop.
● First Browser; 39 levels, which includes chrome, IE, opera, safari, firefox and
many more.
The dummy variables were created using nested IF statements in Microsoft Excel. Following
on from the creation of these variables, their respective character version (or original
versions) were removed from the dataset as they no longer served any purpose in the analysis.
Figure 4.3.1: Creation of Dummy Variables
The IF statement method was used for the majority of the dummy variable creations.
However, there are certain columns that had too many variables in order to use the IF
statement method. As such we decided to simply sort the data alphabetically based on one
column. Then we manually input the dummy variable for each different variable. Although
this was time consuming it was a guaranteed way to ensure the test set and the training set
have the same dummy variables for the same variable type.
18. 17
5. Initial Understanding of Data
5.1 Social Media Trends
Social media is a huge part of the modern era. Airbnb gave us a few different methods of sign
up for the website. These include Google+ (0), Facebook (1), the basic website signup
method (2), and Weibo (3). In order to get a sense of the amount of bookings actually made
on each form of social media or even to see if they are evenly enough dispersed to see if there
may be other patterns in booking based off the choice of social media.
Through the use of R Studio, we created a bar chart of each type of sign up method, one for
the training set and one for the test set.
Figure 5.1.1: Code for the Social Media Trends Graph
From the graphs below, it is very clear to see that the basic method is the most popular option
for people to sign up with. The data was so skewed that clearly there is no particular
connection between the sign up method and the destination travelled. Too much of the data
involved will fall into one of two categories. Also the training set is clearly missing one of
the methods. As such this would mean that there isn’t a huge connection between the signup
method and the destination.
Figure 5.1.2: Training and Test Set bar charts for Signup Methods
19. 18
5.2 Seasonal Trends
When we tidied up the data, we had removed all seasonal factors. So in order to perform this
analysis we had to go back to the original datasets to find booking patterns. A big issue with
this is kind of analysis is that there are so many variables in the datasets that its difficult to
focus on just one.
In order to discover trends or patterns in the dataset, we quickly realised we had too much
variables for the system handle the analysis correctly or even to plot a seasonal graph for us.
So in order to create any sort of seasonal plot we decided to use R to subset our variables and
create a new dataset which simply has the total number of bookings per month included.,
From this we could get a better understanding of the times of year in which bookings were
being made.
Figure 5.2.1: Code for Separating the Data by the Month
The above code is a sample of the method used to break down our large dataset into a more
basic format in which we intend to use in Minitab to create a more sensible and
understandable graphic for the time series. Once we analysed the different months, we
discovered that the pattern for the individual months didn’t show much of an alternate pattern
to each other. As such we combined the vectors together into a different dataset. We exported
the dataset and then used it in Minitab to display a time series analysis graphic.
Figure 5.2.2: Minitab Time Series Analysis
20. 19
As seen above, in Minitab we decided to begin by creating a simple time series plot to give us
an initial understanding of the task at hand. The series was centered about the bookings per
month and we created a basic line graph as seen below to highlight the booking patterns of the
customers of Airbnb for the year 2014.
Figure 5.2.3: Time Series Analysis Plot of Total Booking for 2014
21. 20
6. Explanation of Tools and Techniques
6.1 R Studio
R Studio is also a free software available mainly for statistics and visualisations. This can be
two fold for our project as not only do we need the ability to apply data mining and analytics,
but all the opportunity to display information in graphs and charts in such a way that the
average person could understand what we are trying to sell. RStudio allows the user to
analyse in a number of different ways including linear and nonlinear modelling, traditional
statistical tests (regression, correlation), data manipulations and data handling. This software
offers a wide range of opportunities for analysis of our datasets and in particular when it
comes to the predictive aspects of where the next person may book. (Authorimanuel, 2016)
We can also use R studio as a visualisation tool. We can easily create maps displaying the
travels trends and the different locations they travel to. The maps allow us to show areas of
greater popularity amongst travellers and also display the results of our various area and
social media trends.
6.2 Minitab
A very important thing to remember about our datasets is that they contain information on US
citizens who have made their first booking on the Airbnb website. The information we have
may not necessarily follow the normal trends expected for holiday goers.
Minitab is a statistical software used in many areas but in particular it offers a wide selection
of methods that can be used for time-series analysis such as:
● Simple forecasting and smoothing methods
● Trend analysis
● Decomposition
● Moving average
● Single exponential smoothing
● Double exponential smoothing
● Winter’s method
● Correlation analysis and ARIMA modelling
For what we intend on doing in relation to the time series analysis, the method we feel may
best fit our intentions would be the ARIMA modelling method. ARIMA modelling not only
22. 21
makes use of the patterns in the data but is specifically tailored to find patterns in the data that
may not be simply seen in the case of a simple visualisation. (Inc, 2016)
6.3 Excel and XLMiner
XLMiner Platform is an analytical tool used in Excel which is part of the Analytic Solver
Platform. This software is used for both predictive and prescriptive analytics. There are many
features of this platform such as identifying key features, by which the software uses feature
selection that automatically locates the variable what has the best power for explaining your
classification, the methods for prediction, whereby the software offers options such as
multiple linear regression, ensembles of regression trees and neural networks and finally
affinity analysis which uses the market basket analyses and system of recommendations with
specific rules. (Systems, 2016)
Also the Excel Add-In ArcGIS Maps was used to give a geographic representation of the
results generated.
23. 22
7. Findings
7.1 Association Between Variables
Before proceeding with various models and techniques, it was important to firstly understand
the variables that we had at our disposal and the relationships they had with each other and
the target output variable.
Within our data we had two types of variables, categorical and continuous. With categorical
data the numbers simply distinguish between different groups but with continuous data the
value of the number is more of a measure of the size.
With different types of data finding relationships between them can be difficult. Analysing
the continuous variables at our disposal was the more straight-forward task. However there
was some complications in that the data for the variables failed the normality test on Minitab,
therefore a Spearman’s rho test was decided to be used. A Spearman’s rank correlation matrix
was created on Minitab based on the 5 variables created from the Sessions dataset. The
variables were transformed using the rank function on Minitab and produced the following
correlation matrix.
Time Spent Mean Time Sd Time Number of
Actions
Important
Actions
Time Spent 1
Mean Time 0.59 1
Sd Time 0.72 0.94 1
Number of
Actions
0.80 0.09 0.25 1
Important
Actions
0.24 0.27 0.26 0.11 1
Table 7.1.1: Correlation Table
With a p-value less than the significance level of 0.05, the correlations are statistically
significant. In most cases there is a medium to strong relationship between the variables,
indicating the variables should be a good foundation for a model.
24. 23
Figure 7.1.2: Summary Table
In terms of determining the relationship between the categorical variables a chi squared test of
independence is conducted.
The data must first be summarized, for example a count is taken of each gender type (4
levels) on whether they make a booking or not. For the purpose of this analysis a new
categorical variable “Booking” with 2 levels was created, to simplify the analysis.
A chi squared test of independence is run on Minitab to test the association between the two
categorical variables, gender and booking.
For this example a significant interaction was found (ᵡ2
(3) = 4281, p < 0.05), thus rejecting
the null hypothesis of independence between the variables. A similar approach was taken for
analysing the remaining categorical variables, with results similar to the example indicating
dependence between the categorical variables and the output variable of “Booking”.
7.2 The Earlier Models
The techniques we used in XLMiner were Multiple Linear Regression and Neural Networks.
However, there were certain unknown limitations to using XLMiner. The biggest issue with
using this platform was the limitations of the training data which only allows 10000 rows for
training. Unfortunately that was a much too small amount to train our extensive data. As such,
this meant the XLMiner was no longer a viable option for our analysis and as such was
shelved in the end.
The techniques we used in RStudio were plentiful in this project. We felt RStudio was a
major factor in completing this project as it had a range of techniques we could use in order to
understand and solve our problems. Some of the ideas we had fell flat while others seemed to
excel more than expected. When it came to the machine learning algorithm aspect of our
project we tough the use of decision trees may allow for a relatively accurate result. We
decided to try 2 different types of decision trees to determine which would give us a more
accurate answer. Firstly we attempted the use of the package party. Through the use of this
technique we managed to create our own model for the decision tree using the variables we
felt would predict the destination of the customers the best. Unfortunately the accuracy of this
particular decision tree was very weak and only yielded a 30% accurate result. Although we
25. 24
tried several different variable combinations, it still didn’t allow for the accuracy of our
answers to improve by a greater amount.
Figure 7.2.1: Code for the ‘Party’ Package Decision Tree
Next we tried to use the rpart package. This method had more detail in the code than the
previous party package option. This method allowed us to yield a much more accurate results
which was 73% accurate. There was a more detailed training which in outlined below,
whereby there is now a control element added to the decision tree for various parameters the
rpart fit. This allows, for an improved accuracy of the results.
26. 25
Figure 7.2.2: Code for the ‘RPart’ Package Decision Tree
7.3 XGBoost
The XGBoost package available in R, which stands for Extreme Gradient Boosting, is a
machine learning algorithm used for supervised learning problems. The idea of supervised
learning is to use training data xi to predict the target variable yi. The model used for
evaluation of the prediction variable in the case of xgboost is tree ensembles, which is a set of
classification and regression trees (CART), where each output variable is classified into
different leaves depending on the inputs. CART differs from decision trees in that it gives a
score for each leaf within the tree. Similar to random forests, the prediction scores for each
tree are combined to give an overall prediction. The main difference between tree ensembles
and random forests are in the way the model is trained. Training the model involves
determining an objective function and then to optimize it, with tree ensembles using an
additive training approach.
27. 26
7.4 The Final Model
The xgboost and associated packages that are required for the prediction model are installed
and loaded on R.
Figure 7.4.1: Loading the Packages in RStudio
The country destination column of the training set is assigned to its own data frame called
labels.
Figure 7.4.2: Assigning the Destination Variable
Then in order to run the xgboost the destination column must be removed from the training
set.
Figure 7.4.3: Eliminating the Destination Column
The next step involves assigning a numeric value for each country destination.
Figure 7.4.4: Assigning the Destination
Once this is completed, the process of training the model can begin. There are many different
parameters involved in the xgboost function. Many of the values given for each parameter are
either default values or are widely used and acceptable. Some parameters that influence the
output of the model that are of worth mentioning are:
Eta (default = 0.3) is used to prevent overfitting, where by it shrinks the weights at each step
giving a more conservative boosting process.
Max_depth indicates the maximum depth of a tree, with the higher the value the more
complex the model becomes.
Subsample is the ratio of training instance. It is the ratio of data collected to grow the trees. It
is used to prevent over fitting.
28. 27
Figure 7.4.5: XGBoost Parameters
As with many machine learning algorithms, a major consideration in understanding the
accuracy of the output is the fit of the model. A model that is under fit fails to identify
relationships between the variables and the targeted output variable. On the other hand a
model is over fit when the relationships defined are too specific to the training data set and
cannot be generalised to the wider population. Both cases can lead to poor predictive
accuracy. The parameter tuning within the xgboost model was performed to find a balanced
model that can identify relationships and be used within the wider population.
Figure 7.4.6: Under fitting and Over fitting
The next step involves using the model created to classify the test data. This is done using the
predict function in R.
Figure 7.4.7: Predicting the Destinations using the XGBoost Model
The final steps relate to organising the results generated into a clear and manageable format.
For the purpose of the Kaggle competition, the top 5 most likely destinations for each user
were specified in descending order of likelihood.
29. 28
Figure 7.4.8: Generating the Output Excel Files
7.5 Testing the Variables
Once we had found the model that seemed to yield the best accuracy for our predictions, we
proceeded to look at what factors were affecting our results in a negative way. We looked
very closely at the evaluation metric and played around with a few different ideas. The
competition outline on Kaggle suggested the use of the NDCG (Normalised Discounted
Cumulative Gain) metric.
Figure 7.5.1: Mathematical Formula for Normalised Discounted Cumulative Gain
This metric is used to measure the performance of the system based off of recommendations,
which are ranked in order of relevance to the ideas which they are trying to recommend. The
resulting values vary from 0.0 to 1.0 and the higher the value they higher they are in the
ranking system. This metric is actually used quite a bit in evaluating the performance of web
search engines such as Google. (Solera, F., 2015)
30. 29
Seeing as this was the recommended metric it was natural we were going to attempt to use it.
Our only issue was building the formula. We did succeed in building a formula but managed
to have an error in RStudio that we were incapable of solving in the end. There was some
issues with the size of objects and the incompatibility of this size with other objects. As a
result we decided to move forward with other evaluation metrics that didn’t require us to
create the function ourselves and as such would reduce the risk of error.
Figure 7.5.2: R Code to Generate the NDCG Evaluation Metric
Another metric we tried was rmse (root mean square error) which is the default value for
regressions and for the error for classification. Mean square error measures the closeness of a
fitted line to its data points. Root mean square error is very similar to mse, whereby it is the
square root of the mse. As statistics go this is one of the easily understood ones. This did not
yield a decent accuracy however and as such we felt we could do better. (Vernier, 2016 )
Finally we looked into the idea of using multiclass log loss or mlogloss as it is referred to in
R. This metric was widely talked about on different forums and blogs about xgboost on the
internet. Some even said it was the best metric to use. This metric takes each observation,
which are in the same class, and for each individual observation it predicts it probability for
each of the classes. Mlogloss is the negative log likelihood of the specified model which
states that each observation from the test set is chosen independently from a distribution that
gives a relative probability to a corresponding class for every observation in the set. (Kaggle,
2016)
Figure 7.5.3: Mathematical Formula for Multiclass Log Loss
For the final attempt at improving our model we looked at the different variables we were
using in the model. Ideally we wanted to reduce the model to as few a variables as possible
31. 30
because we were certain there were a few variables contained in our model that didn’t do any
good in our predictions or better yet were actually making our prediction accuracy worse. We
began by simply altering our datasets and deleting different columns we felt were not
necessary for the analysis. We tried the different affiliate variables and the devices, generally
the things we thought were just pointless. But once we saw that it made a difference but not a
good one, we decided to try removing the variables we thought were affecting it. In the end
we discovered that all our variables actually matter on in model. Surprisingly every little
detail the model can get on a customer helps shape the prediction of their next destination.
7.6 The Results
In terms of conveying the results generated we make use of the country_pred data frame
created using the predict function on R. This data frame outputted a probability associated
with each destination for each user, as seen in the example figure below.
Figure 7.6.1: Probability of a Random User Booking in Each Destination
From the table below we have outlined how many people we anticipate will visit each
location as their first booking destination or in most cases no destination at all. We used an
expected value method in order to calculate the number of people traveling to each of the 11
destinations or no destination at all.
𝐸𝐸𝐸𝐸𝐸𝐸 = � 𝑃𝑃 (𝑥𝑥)
Where EUB is Expected User Bookings and x is the country destination. This formula was
executed through Excel.
32. 31
Country Expected User Bookings
Australia 552
Canada 802
Germany 662
Spain 883
France 1359
Great Britain 951
Italy 1041
Netherlands 723
Portugal 514
United States of America 13885
Other 3008
No Booking 37716
Table 7.6.2: The Expected Number of Airbnb Users Travelling to Each Destination
In order to get a clearer image of what exact results we obtained from those Airbnb users we
were able to predict a destination, we designed the following graphic. As both other
destination and NDF cannot be represented geographically they were omitted from the
graphic below. The size of the orange dot on the location is a representative proportion of the
overall number of users who had a definitive location. It is clear from the graph that the USA
had the largest percentage of the users travel to it and given that all the users were from the
USA it makes very clear sense. In the case of the likes of Australia and Portugal we see much
smaller dots, as they were the least popular locations for travel.
33. 32
Figure 7.6.3: Map of Results
So how accurate was our analysis overall? When it came to checking the accuracy of the
model we had built, we were lucky enough to have Kaggle to do it for us. Once we created
our datasets of our predictions of the top 5 countries per user, we simply had the option of
uploading it onto Kaggle and receive as result as to how accurate our findings actually were.
In the end, after much editing of the model, we came to a final result of 87.248% accuracy
with our predictions.
Figure 7.6.3: The Evaluated Accuracy of the Final Model via Kaggle
34. 33
8. Conclusion
From the offset of this project our main aim was to predict the destination of the next Airbnb
users booking. Although we aimed to discover a few other items of information along the
way, the main objective of this project has always been that. At the beginning we understood
clearly the limitations we would be facing with a project such as this. Predicting people's
travel patterns was always going to be difficult considering we cannot possibly know
everything that a person considers when booking a holiday. We were limited by the fact that
we cannot predict the likes of world disasters, people's personal aversions to certain locations
or even how fickle any one particular human being can be when it comes to making a
decision like this. However, we feel we overcame those limitations throughout this project
and managed to gain a result that may not be one hundred percent accurate, but then again no
model is ever perfect. We felt that our final model yielded a very accurate result when
compared to the winner of the Kaggle competition in the end whose model was 88.697%
accurate. In the end, our model was strong and our results showed that and even though we
didn’t get a one hundred percent accurate result, we got one that was close enough given the
information that is necessary for these kinds of decisions.
9. Appendices
Referencing
Airbnb(2016) About s. Available at: https://www.airbnb.com/about-us (Accessed:27 May
2016)
Amazon Web Services. Model Fit: Underfitting vs. Overfitting.
http://docs.aws.amazon.com/machine-learning/latest/dg/model-fit-underfitting-vs-
overfitting.html (Accessed 19 June 2016)
Authorimanuel, T. (2013) Top 20 predictive Analytics Freeware software. Available at:
http://www.predictiveanalyticstoday.com/top-predictive-analytics-freeware-software/
(Accessed: 31 March 2016)
Finlay, S. (2014) Predictive Analytics, Data Mining and Big Data Myths, Misconceptions
and Methods
Graphics with ggplot2
http://www.statmethods.net/advgraphs/ggplot2.html
He, T. (2016) An Introduction to XGBoost R package, Available at:
http://dmlc.ml/rstats/2016/03/10/xgboost.html (Accessed 14 June 2016)
35. 34
Inc, M. (2016) Methods for analyzing time series. Available at: http://support.minitab.com/en-
us/minitab/17/topic-library/modeling-statistics/time-series/basics/methods-for-analyzing-
time-series/ (Accessed: 31 March 2016)
Introduction to Boosted Trees, Available at:
http://xgboost.readthedocs.io/en/latest/model.html (Accessed 14 June 2016)
Introduction to dplyr (2015)
https://cran.rstudio.com/web/packages/dplyr/vignettes/introduction.html (Accessed 20 June
2016)
Jain, A. (2016) Complete Guide to Parameter Tuning in XGBoost. Available at:
http://www.analyticsvidhya.com/blog/2016/03/complete-guide-parameter-tuning-xgboost-
with-codes-python/ (Accessed 19 June 2016)
Kaggle. 2016. Multi Class Log Loss | Kaggle. [ONLINE] Available
at:https://www.kaggle.com/wiki/MultiClassLogLoss. [Accessed 16 June 2016].
Parise, S., Iyer, B. and Vesset, D. (2015) Four strategies to capture and create value from big
data. Available at : http://iveybusinessjournal.com/publication/four-strategies-to-capture-and-
create-value-from-big-data/ (Accessed: 14 June 2016)
http://www.r-bloggers.com/an-introduction-to-xgboost-r-package/ (Accessed 14 June 2016)
PennState, Performing a Chi-Square Test of Independence from Summarized Data in
Minitab
https://onlinecourses.science.psu.edu/stat500/node/181 (Accessed 20 June 2016)
Solera, F. 2015. Normalized Discounted Cumulative Gain. [ONLINE] Available at:
https://www.kaggle.com/wiki/NormalizedDiscountedCumulativeGain. [Accessed 16 June
2016].
Vernier (2016) What are mean squared error and root mean squared error? > vernier
software & technology. Available at: http://www.vernier.com/til/1014/ (Accessed: 19 June
2016)
XGBoost R Tutorial, Available at:
http://xgboost.readthedocs.io/en/latest/R-package/xgboostPresentation.html (Accessed 14
June 2016)
36. 35
EXTRA EARLIER WORKINGS
The following section is our initial understanding of the information provided and the
situation we are dealing with. We decided to use RStudio to produce a series of graphs to gain
a better insight into what is happening with some of our more major variables.
We started by looking at our sign up methods, i.e through Airbnb (basic), Facebook, or
Google +. Simply from our initial outlook of the information it is clear that most users either
sign up directly via the website or through Facebook. Google + seems to be non-existent at
the moment.
Following on from that we decided to have a look at the different countries and how many
people visited them in our dataset. Simply from a quick glance at the histogram it’s clear that
37. 36
the US is most popular which would make sense. Seeing as all our customers are from the
US, naturally you would imagine the most popular holiday destination would be the US itself.
Outside of the Us however, the most popular destinations are other, France, Italy and Spain.
The least popular of our destinations are Portugal, Australia and the Netherlands.
Finally we looked at the first device type used by the customers. The most popular amongst
these would be the Mac desktop and the Windows desktop. Smartphones other than android
or iPhone, and Android Tablet were the least popular amongst the customers.
40. 39
These graphics were created in one of our earlier datasets. We had a few versions of the
datasets before we were happy with them and as such we were creating different graphics to
understand the data. We decided to keep these graphics as they were earlier works for earlier
datasets.