Booking cancellations have a substantial impact in demand management decisions in the hospitality industry. Cancellations limit the ability to make accurate forecasts which is a critical tool in terms of revenue management. To overcome the problems caused by booking cancellations, hotels implement rigid cancellation policies & overbooking strategies, which can also have a negative influence on revenue & goodwill. Using data of world’s leading chain of hotels, homes & spaces and addressing booking cancellation prediction as a classification problem in the scope of data science, we in this model try to predict with higher accuracy whether a booking will be cancelled. Using supervised machine learning techniques, a viable model is being created to predict hotel cancellations that further allows organization to look into its cancellation policies and overbooking strategies. It will have a positive impact on the revenue and profitability of the business.
1. Capstone Project on Hotel Cancellation
Group No. 3 | Batch: Apr 2020 | Location: Hyderabad
Bhavik Doshi || K. Sailesh Kumar || Lakshmi Aparna Chirravuri ||
Mampi Bera || Shubham Baheti
2. Problem Statement & Objective
Data Description
Exploratory Data Analysis
Models Evaluation
Best Model & Important Variables
Recommendations
Question & Answers
Agenda
2
3. High Distribution Cost
Irregular Cash Flow
Unpredictable Occupancy Rate
High Opportunity Cost
Low RevPAR /Loss of Income
Hotel Booking Cancellation Impact
Problem Statement
Hi
gh
Di
str
ib
uti
on
Co
st
3
5. Data for 119390
bookings
25 variables (11
categorical
columns and 14
numerical
columns)
Data Description
Target
variable ‘Is
canceled’.
Period 2018-2020
Includes period of Covid
Pandemic which may
not be reflective of
regular times
5
6. Is cancelled Mix
Balanced
Dataset
(37% of
bookings
got
cancelled)
Top 10 Country of Origin
Majority of
bookings are
from PRT
(Portugal)
~57% of
bookings were
cancelled
Exploratory Data Analysis (Slide 1 of 4)
6
7. Deposit Type
• ~99% “Non-
refundable
deposit”
booking got
cancelled
• Concludes
Deposit amount
significantly
lower than
Room rent
Exploratory Data Analysis (Slide 2 of 4)
No of Special Requests
Every second
booking with no
special requests
got cancelled
Approx., one out
of five bookings
with 1 0r 2
special requests
got cancelled
7
8. Room Adherence
Room adherence not
a major factor behind
cancellation , as only
802 booking were
cancelled out of
14,917 cases when
false
Exploratory Data Analysis (Slide 3 of 4)
Arrival Year Vs Lead Time
As lead time
increases,
cancellations go
up
Median lead time
for bookings
cancelled ~120
days against ~40
days where it is
not cancelled
8
9. Market Segment
Maximum bookings : Market
segment “Offline & Online
Travel Agents” (#80,696),
followed by “Groups &
Direct” (#32,417)
Highest cancellation : 61%
Market segment “Group”
Exploratory Data Analysis (Slide 4 of 4)
9
10. Decision Tree
Model Evaluation (Slide 1 of 7)
Criterion
Max
Depth
Min
Sample
leaf
Min
Samples
Split
Dataset Accuracy Recall Precision F1_Score ROC_AUC
Gini 10 800 2400
Train 0.83 0.72 0.80 0.76 0.90
Test 0.82 0.72 0.79 0.75 0.90
Entropy 10 800 2400
Train 0.83 0.72 0.80 0.76 0.91
Test 0.83 0.72 0.79 0.75 0.90
10
11. Model Evaluation (Slide 2 of 7)
Max
depth
Max
features
Min
sample
leaf
Min
sample
split
N_
Estimators
Dataset Accuracy Recall Precision
F1
score
ROC_AUC
11 10 60 180 501
Train 0.85 0.73 0.85 0.79 0.93
Test 0.85 0.73 0.84 0.78 0.92
14 14 40 120 501
Train 0.86 0.76 0.85 0.80 0.94
Test 0.85 0.75 0.83 0.79 0.93
15 15 20 60 201
Train 0.87 0.77 0.86 0.81 0.95
Test 0.86 0.76 0.83 0.80 0.93
Random Forest
11
12. ANN
Itera
tion
Hidden
Layer
Size
max
iter
solver
Acti
vation
learning
rate
tol Dataset Accuracy Recall Precision
F1_
Score
ROC_
AUC
1 10 100 sgd relu constant 0.01
Train 0.85 0.75 0.84 0.79 0.93
Test 0.84 0.74 0.82 0.78 0.92
2 10 100 adam tanh adaptive 0.01
Train 0.86 0.76 0.84 0.80 0.94
Test 0.85 0.75 0.82 0.78 0.93
3 100 500 sgd relu constant 0.001
Train 0.89 0.84 0.87 0.86 0.96
Test 0.85 0.79 0.81 0.80 0.93
4 100 1000 adam tanh adaptive 0.001
Train 0.94 0.93 0.91 0.92 0.99
Test 0.84 0.80 0.78 0.79 0.92
5 100 1000 sgd relu constant 0.0001
Train 0.91 0.88 0.89 0.88 0.97
Test 0.85 0.80 0.80 0.80 0.93
6 100 1000 adam tanh adaptive 0.0001
Train 0.95 0.93 0.93 0.93 0.99
Test 0.84 0.79 0.79 0.79 0.92
Model Evaluation (Slide 3 of 7)
12
13. Logistic Regression
Solver C
N
jobs
Columns
removed
Dataset Accuracy Recall Precision F1_Score
ROC
AUC
lbfgs 1 -1 Base model
Train 0.81 0.64 0.80 0.71 0.88
Test 0.81 0.64 0.79 0.71 0.88
newton-
cg
1 -1
Drop high VIF
country
Train 0.80 0.56 0.84 0.67 0.85
Test 0.80 0.56 0.84 0.67 0.85
newton-
cg
1 -1
Drop high vif col
(except total days
& assigned room)
Train 0.80 0.56 0.85 0.67 0.85
Test 0.80 0.56 0.84 0.67 0.85
Model Evaluation (Slide 4 of 7)
13
14. Linear Discriminant Analysis
Solver C
N
jobs
Columns
removed
Data Accuracy Recall Precision
F1_
Score
ROC_
AUC
svd 1 -1 Base model
Train 0.77 0.83 0.65 0.73 0.88
Test 0.77 0.83 0.65 0.73 0.87
newton-
cg
1 -1
Drop high VIF
country col
names
Train 0.77 0.43 0.90 0.59 0.83
Test 0.77 0.44 0.89 0.58 0.83
newton-
cg
1 -1
Drop high vif col
(except total
days & assigned
room)
Train 0.77 0.43 0.90 0.58 0.83
Test 0.77 0.43 0.89 0.58 0.83
Model Evaluation (Slide 5 of 7)
14
15. Naïve Bayes
priors Var smoothing Dataset Accuracy Recall Precision F1_Score ROC_AUC
None 1.00E-09
Train 0.51 0.95 0.43 0.59 0.76
Test 0.51 0.95 0.43 0.59 0.76
[0.5,0.5] 2.00E-09
Train 0.57 0.93 0.46 0.61 0.76
Test 0.57 0.93 0.46 0.61 0.76
Model Evaluation (Slide 6 of 7)
15
17. Best Model Identification
0.8 0.8 0.78 0.75
0.71 0.69
0.6
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
GBC Random
Forrest
ANN CART Logistic
Regression
LDA Naive Bayes
F1_Score
Models
17
18. Random Forest vs GBC
0 1
Actual
Value
1
0
RF
Train
RF
Test
GBC
Train
GBC
Test
Actual
Value
1
0
Actual
Value
1
0
Actual
Value
1
0
0 1
Predicted Value
0 1
0 1
Predicted Value
18
20. Model identifies a
‘potential defaulter’
Model identifies a
‘trusted customer’
Opt for
‘Overbooking’ strategy
• No ‘pay-later’ option
• Surge the booking price
• Indicate low availability of
rooms and high cancellation
penalty
Minimize cancellation
probability by:
• Have a continuous connect
with customer
• Suggest a personalized
itinerary
• Offer to avail freebies during
their stay
Avoid
cancellations by:
• Set an upper limit to take
confirmed bookings within
available capacity
• For potential defaulters suggested
by model can continue to take
additional bookings for a specified
no of rooms above capacity
De-risk last minute
cancellations with
overbooking
strategy:
Recommendations & Insights (Slide 1 of 3)
20
21. Recommendations & Insights (Slide 2 of 3)
Institute a policy of
taking ‘Deposit/Advance’
• For Individual / Direct bookings –
100% / 50%
• Non-refundable bookings for prime
locations
• Offer credit for Corporate bookings
and charge % penalty for
cancellation
Minimize revenue loss by
taking deposit / advance
Specific focus on
‘Bookings from Portugal’
Offer
Room upgrade
• Tie-up with Airline partners and
offer vouchers for booking flight
tickets
• Consider the bookings under
‘overbooking’ strategy to
minimize chance of unoccupied
rooms as a result of cancellations
Specific measures for
bookings from Portugal
• At the same or partial
higher price
• With add-on complimentary
services
• Offer an additional day stay
for free or as a package
Delight customer by
offering Room
upgrade:
21
22. Seek for
‘Special Requests’
• Provide preference at the time
of booking
• Offer complimentary services
for trusted and repeated
customers
Increase customers’
probability to stay by
asking for ‘special
services’
Focus on OTA wise
cancellation rate
• Not to promote ’book now and
pay later’
• Restrict ‘free cancellations’
• Have lesser commission for
OTAs with higher cancellations
Plan to minimize
OTAs cancellations
Recommendations & Insights (Slide 3 of 3)
22
25. Appendix
Variables
• HOTEL: The type of the hotel booked by the customer
• IS_CANCELLED: Value that indicates if the booking has been cancelled.
‘1’ indicates: Cancelled and ‘0’ indicates Not Cancelled.
This is the target variable.
• BOOKING_DATE: Date on which the booking has been done.
• ARRIVAL_DATE: Date on which customer indicated during the time of booking.
• STAY IN WEEKEND NIGHTS: Number of weekends customer has planned to stay or has stayed.
• STAY IN WEEKNIGHTS: Number of week (Monday to Friday) customer has planned to stay or has stayed.
• ADULT: Number of Customers aged more than 18 years of age.
• CHILDREN: Number of Customers aged less than 18 years of age.
• MEAL: Type of meal booked by the customer
1)No meal 2) Only Breakfast 3) Breakfast-dinner 4) Breakfast-lunch-dinner
• COUNTRY: ISO Codes of the country the customer belongs to.
• MARKET SEGMENT: The divide of the customer’s type of booking
1) Aviation 2) Groups 3) Complementary 4) Corporate 5) Direct 6) Online 7) Online Travel Agent 8) Offline Travel Agent 9) Undefined.
• DISTRIBUTION CHANNEL: Type of distribution channel, through which the booking was done.
1) Direct 2) GDS 3) TA/TO 4) Corporate 5) Undefined.
• IS_REPEATED_CUSTOMER: Value indicating if booking name was from a repeated customer (1) or not (0)
• PREVIOUS_CANCELLED: No. of previous bookings that were cancelled by customer prior to the current booking.
• PREVIOUS_NOT_CANCELLED: No. of previous bookings not cancelled by customer prior to the current booking.
• RESERVED_ROOM_TYPE: Code of room type reserved.
• ASSIGNED_ROOM_TYPE: Code for the type of room assigned to the booking.
• BOOKING_CHANGES: No of changes/amendments made between booking date to check-in date/ till cancellation.
• DEPOSIT_TYPE: Indication on if the customer made a deposit to guarantee the booking.
• AGENT: ID of travel agency that made the booking.
• COMPANY: ID of the company/entity that made the booking or responsible for paying the booking.
• DAYS_IN_WAITING_LIST: No of days the booking was in the waiting list before it was confirmed to the customer.
• CUSTOMER_TYPE: Type of booking - transient: last minute check-in, contract: associated with contract, group: associated to certain club or group, transient-party: transient and associated with other transient booking.
• REQUIRED_CAR_PARKING_SPACES: Number of car parking spaces required by the customer.
• TOTALNO_OF_SPECIAL_REQUESTS: No of special requests made by the customer (e.g., twin bed or high floor)
Of the above 25 variables – #11 are Categorical/Object & balance #14 are Continuous (details of which are available in Exploratory Data Analysis section)
25