Can we predict Airline fares from London to cities in Asia

Can we predict how Airline fares to cities in Asia will change closer to departure?
1
Can we predict how Airline fares to cities in Asia will
change closer to departure?
MSc Business Analytics: Dissertation
Departure area of London Heathrow Terminal 5 (Google images)
Name: Karim Awad

2
ABSTRACT
This paper shall examine airline prices departing from London Heathrow to seven cities situated in
North and South-East Asia. Over a 5 week period prior to departure, from collecting online pricing
data on fourteen successive 1-week return trips, we will construct pricing curves to understand how
air-fares evolve.
We shall then employ machine learning techniques to determine if this pricing behaviour can be
predicted, ranging from logistic regression through to a simple neural network. This shall attempt to
predict opportune moments to purchase economy-class tickets for flights within our test-set.
Through using ensemble techniques, our results can predict when to purchase peak and off-peak
flights to Bangkok and Kuala Lumpur with reasonable accuracy, with savings between £56-70 relative
to average flight prices. This is despite facing a series of challenges connected to the quality of the
underlying data used.

3
INTRODUCTION
When is the optimal time to book a flight? Should a consumer book several weeks in advance, or wait
for a “last-minute deal”? This issue has long troubled consumers, spawning price comparison
websites, highlighting the lowest price at a specific point in time without addressing this wider
question.
Underpinning this uncertainty, airlines and travel vendors employ dynamic pricing models to optimise
aircraft utilisation, discovering daily market-clearing prices that aim to maximise flight revenue
(Lantseva et al, 2015).
This often generates pricing volatility, which can be pronounced as the departure date draws closer,
and between pairs of flight that leave shortly before / after. Yet this can also be affected by
competition amongst airlines, the choice of seating class, seasonal factors, and whether flights are
direct or indirect to a destination (Etzioni et al, 2003).
Can we however identify patterns to airline pricing by examining how prices change prior to
departure? Are there non-parametric relationships that machine learning techniques can identify,
which are not apparent through structural analysis?
We examine long-haul flights from London to several cities in Asia: Beijing (BEI), Bangkok (BKK), Hong
Kong (HK), Kuala Lumpur (KL), Seoul (SEU), Singapore (SGP), and Tokyo (TKY). This involves looking at
one-week return flights, over a two-week period between 23rd
July – 6th
August 2018, offering 14 flight
pairs to evaluate for each city. We accumulate pricing data on each flight at least 5 weeks prior to each
departure, with data collected twice a day. This methodology is discussed further within our data
collection section
With data on approximately 50 variables at each point in time (e.g. prices for different airlines, number
of stops, class of transport, different airports in the London region), this provides a potential dataset
of over 49,000 entries.
We partition this dataset into training, validation, and test sets, before applying machine learning
techniques. This involves methods that reduce dimensionality (e.g. Regularisation, Principal
Component Analysis (PCA), Random Forest, Support Vector Machines), and considering results from
algorithms that reduce residual error (e.g. AdaBoost) or better capture non-linearity (neural network).
This paper will undertake a brief literature review, before discussing data collection, processing, and
methodology considerations. We shall then illustrate the results of descriptive analysis, before
undertaking the machine learning approaches discussed above.

4
LITERATURE REVIEW
Etzioni et al (2003) serves as a useful starting point, that has motivated several other articles. Intended
as a pilot study, this examined non-stop, one-week returns, for two flight pairs between Los Angeles
and Boston, and Seattle to Washington DC respectively. This was done for departures during January
2003, where prices were tracked at 3-hour intervals, starting 21 days prior to departure; resulting in
12,000 price observations over a 41 day period.
They proceed to use a variety of approaches, from a moving-average statistical approach, through to
a RIPPER classification algorithm, Q-learning, and a combination of these three results, using a stacking
generaliser, referred to as HAMLET. These algorithms seek to identify optimal points to purchase a
flight, introducing penalties subject to how near the departure date falls, and the cost of
misclassification (i.e. failing to buy a cheap ticket, and paying more later). Savings are measured
relative to the initial flight cost 21 days prior to departure.
Etzioni et al (2003) found that HAMLET performed best, generating total net savings of $198,074, with
its decisions being optimal 61.8%, relative to having perfect knowledge. They believe this accuracy
level could be higher, as they employed a “…uniform distribution of passengers, (where) 33% of the
passengers arrived at most 7 days before the flight’s departure, when savings are hard to come by…”
(p.7, Etzioni et al (2003)).
Groves and Gini (2011) expand on this work by using over 60 days of data prior to departure, using
both lagged OLS and Partial Least Square techniques to identify a subset of explanatory variables, from
their searches which have examined multiple departure dates for two domestic US flight pairs. This is
contrasted with a naïve approach (i.e. immediate purchase) relative to an optimal approach (perfect
information).
Their results are shown for business trips (Monday-Friday round trips) and low-cost trips (Thursday –
Tuesday round trips), with the aim of out-performing a naïve approach. Their methods confirm this,
albeit by virtue of the small number of buy signals their models generate. They do observe that pricing
volatility is more apparent on routes with less route competition, premised on more pricing power
being present.
Tziridis, Kalampokas, and Papakostas (2017) seek to apply a wide variety of machine learning
approaches, but only on a single-ticket flight-pair between Thessaloniki (Greece) and Stuttgart
(Germany) between December and July 2017. These approaches include Multilayer Perceptron,
Neural Networks, Regression Trees with bagging, and Regression Support Vector Machines, amongst
others, with 10-fold cross validation used to train these models.
Their analysis concludes that ordinary, bagging, and random forest regression trees, along with
multilayer perceptron techniques yield consistent results when omitting different features, with
accuracy levels > 80%. This seems surprisingly high when relying on pricing data alone, but may
highlight potential consistency on flights with low levels of flight traffic, and the value in collecting
historical data.
Intriguingly, their analysis also points to accuracy being marginally higher when excluding features,
which included “day of week” of departure, but lower when time of departure and arrival is excluded.
Although their analysis is not directly comparable to the above, it highlights how anecdotal factors
may not yield as much explanatory power; possibly favouring a Random Forest approach.

5
There are several other articles that seek to apply a wide range of machine learning techniques, both
supervised and unsupervised, to predicting flight prices, predominantly looking at different flight
pairs. This extends to Papadakis (2014) that focused on five significant US airport hubs, Lantseva et al
(2015) that looked at domestic and international pricing from Russian airports, and Lu (2018) who
looks at 8 flight pairs within Europe with trips to / from the same departing airport, relative to the
recipient city (2018). Their results all point to different machine learning techniques assisting with
prediction.
Lu’s (2018) article does seek to find a prediction technique that works across routes, and thus focuses
on the variance achieved by each method as a discriminating factor, along with examining Mean-
Squared Error (MSE). Furthermore, in contrast to other authors, he directly addresses imbalances in
the underlying dataset, given the small number of “buy” decisions likely to arise. This involves
performing K-Means and Expectation Maximisation cluster analysis to identify and remove outliers
between buy and wait decisions, along with over-sampling buy decisions in training his algorithms.
More generally, how do we measure the correct time to “buy” or “wait”? The articles above have
focused on either contrasting a starting price and predicting any reduction, buying if it is determined
the next period will generate a price increase, or attempting to align purchasing accuracy if perfect
foresight were available.
Boin et al (2017) wrote that airlines will need to continue unbundling elements of the travel
experience (e.g. bags on-board, meals, etc), such that additional revenues can be generated in future.
When examining Ryanair’s revenue model, Malighetti et al (2015) note that specific functional
parameters used to derive prices have likely already been determined, with their analysis revealing
prices follow a hyperbola as departure draws close.
This suggests there is a minimum average price, allowing for discounts, whilst sales near to departure
are aimed at boosting overall yields. Understanding historical average prices over a period prior to
departure, may thus help consumers avoid contributing to the airline’s supernormal profits.

6
DATA CONSIDERATIONS
Data considerations
Obtaining pricing data proved immensely difficult. There are no formal, public repositories of data
documenting prices prior to departure. Those that do exist either provide information only on US
outbound flights (www.faredetective.com), or charge for access to their airfare database
(www.atpco.com); the latter quoting $5,000 for academic purposes, when approached.
Price comparison websites were not initially helpful. Skyscanner, the UK market leader, refused to
provide API access. Other price comparison websites (e.g. www.farecompare.com,
www.expedia.co.uk, etc) only provided price alerts, which would be inadequate in assembling a
database. Informal sources (e.g. those who have collected pricing data) were available, but these were
cross-sectional in nature (i.e. prices for a vast range of flights at one moment in time only), and lacking
in structure suitable for academic study.
Using Selenium, prices were scraped from www.kayak.com, with results tabulated for each location
and departure date. This scraped prices for all known airlines operating routes to the location, by
different UK airports, by number of stops, and class of flight. Selenium ensured each search was
anonymised, preventing beacon cookies from detecting repeated searches, and unduly inflating
prices1
; especially with ticket-price customisation likely (Boin et al, 2017).
Intermediaries (e.g. online travel vendors) typically had advertised prices below those shown by the
corresponding airline. It was unclear if this was attributable to preferential rates, or prices being
artificially depressed. To avoid any misrepresentation, these vendors were excluded.
A more troubling issue was an inability to segregate price changes based on seat purchases relative to
changes in competitor prices. Plane capacity is not disclosed by airlines, resulting in being unable to
track capacity changes and thus deduce own-price and cross-price elasticities.
However, a slight benefit was the 14 flights studied fell during peak-season (Lantseva et al (2015)). It
was also felt price falls shortly before departure (e.g. 1 week beforehand) would reflect under-utilised
flights, with prices normally expected to rise (Groves and Gini (2011)). This may generate more
volatility which we can attempt to measure through an F-test between these two periods.
As data was collected, a further problem transpired with Kayak providing incomplete entries, leading
to pricing gaps appearing. This was beyond the author’s control, resulting in some airlines seldom
generating quoted prices, along with a lack of data collected for business / first class tickets, and non-
stop flights.
Given the dynamic nature of pricing, using regression analysis to estimate and back-fill historical prices
was deemed spurious, given the idiosyncratic behaviour of airline prices. For some airlines, this led to
their exclusion, which was not ideal; however, it was felt this could be offset by including lagged
industry prices to mitigate any omitted variable bias. In most instances, prices were back-filled with
their proceeding price. This artificially reduced volatility. Overall, these issues were likely to reduce
the effectiveness of our analysis, albeit permitting some limited analysis to still be undertaken.
A final problem was deriving suitable training, validation and test sets. In contrast to cross-sectional
or time-series datasets, each flight is a discrete, finite event. It was thus decided to have one week of
1
“Travel website cookies milk you for dough”, Sunday Times (24th
June 2018)

7
training data, three days as a validation set (e.g. Monday, Thursday, and Sunday) to capture off-peak,
peak, and Sunday demand, with the remaining 4 days to test our algorithms (two days for peak and
off-peak). This followed our descriptive analysis, with a similar logic apparent to Groves and Gini
(2011), who found prices are lower for departures between Tuesday-Thursday, and higher from
Thursday through to Saturday (p.3).
Data collection
Concentrating on long-haul flights, these routes were expected to be more competitive than EU short-
haul flights, where pricing could be distorted by regional airport incentives in the EU, and airports
being located far apart within a city, which may not provide for a like-for-like comparison2
. With long
haul flights combining direct and transfer passengers, it was hoped this would minimise under-utilised
flights, and thus price changes from demand shortfalls that could distort our analysis. This led to a
focus on airports that serve as significant regional transfer hubs, with Asia being an area of interest.
Prices were recorded both during the morning and evening, to understand if timing contributed to
any difference. With dynamic prices likely to change far more frequently intra-day, this was not ideal,
although was felt to still suffice in detecting general pricing movements.
We accumulate five weeks of historical data for each flight pair prior to departure. This period was
based purely on practical considerations in starting the dissertation in late June. In practice, additional
data would have been helpful in examining pricing volatility going further back in time.
For Airlines with incomplete data, we employ a threshold of 40% across our training and test set flight
pairs to determine their inclusion, consistent with Groves and Gini (2011). This is on average, so there
are still results in some flight pairs with minimal pricing disclosure (e.g. due to having no flights that
leave on a specific day), but provides a subset where analysis can be undertaken.
Overall, despite very significant impediments in data completeness and quality, it was felt some
meaningful analysis could still follow.
2
“EU eases competition rules for state aid to regional airports”, Financial Times (17th
May 2017)

8
DESCRIPTIVE ANALYSIS
Figure 1 below highlights some of the relationships we observe across economy class flights to our
targeted cities. Prices appear positively skewed towards the bottom quartile, based on median prices.
This is consistent with prices typically being low, but rising closer towards departure, which is
confirmed when examining line graphs of the same data (not shown).
The level of variance in pricing does vary by city. The interquartile ranges for Bangkok and Kuala
Lumpur are small, with differences typically less than £100, although with a significant number of
outliers where prices exceed the maximum tail of our boxplots. This may point to prices rising close to
departure, where there is merit in advanced booking.
By contrast, Beijing has the widest interquartile range but with few outliers, highlighting high levels of
volatility. This suggests value may be derived from later-stage bookings, potentially resulting from
under-utilised plane capacity.
The remaining cities demonstrate behaviour between the two instances above, although HK,
Singapore, Tokyo and Seoul all demonstrate significant outliers for weekend departures.
Departures between Monday to Wednesday appear cheapest, when examining median prices, across
all seven cities. Prices broadly increase between Thursday – Saturday, although do fall on Sunday by
varying levels. When predicting future prices, we shall distinguish between these groups of days, as
this appears consistent across all cities.
The cheapest destination to visit, based on median prices, appears to be Bangkok, with prices between
£600-800 across the week. However, both Beijing and Singapore, when considering tail minimums, do
offer opportunities for ticket prices below £600.
Conversely, median prices indicate that both Tokyo and Seoul are the most expensive cities to visit,
with economy prices approximately between £1,000 - £1,200. This may reflect reduced competition
on these routes, along with geographically being on the periphery of Asia, and thus not benefiting
from as much traffic compared to other hubs.
Data on business class flights is patchy and inconsistent, with data-points carried forward where
intermittent data exists. Line graphs of these results can be found in Appendix 1. Although not
conclusive, business class prices appear to fall closer to departure for Beijing, Bangkok, and HK, with
KL and Singapore prices rising instead, and insufficient data available for Seoul and Tokyo.
A similar lack of data prevails when examining non-stop economy flights (Appendix 2). Both Bangkok
and Tokyo are lacking in data for conclusions to be drawn. KL, HK and Singapore all demonstrate price
increases nearer to departure, with Beijing and Seoul displaying volatility but being range-bound. The
price levels quoted may be above actual levels, but suggests potential interchangeability with business
class flights being cheaper in specific instances.

9
Figure 1: Boxplot of Economy ticket prices for 7 cities in Asia departing from London

10
There is a lot of variability in airline data collected. Applying a 40% threshold discussed earlier, reveals
only 10 airlines where pricing data is consistently available across most of our 7 cities: Air France (AF),
British Airways (BA), Emirates (EK), Etihad (EY), KLM (KL), SwissAir (LX), Malaysian Airline (MH),
Philippine Airlines (PR), Thai Airlines (TG), and Vietnam Airlines (VN). Even looking at routes in
isolation, data for direct flights (e.g. Cathay Airlines to Hong Kong, Air China to Beijing, Korean Air to
Seoul; ANA and Nippon Airlines to Tokyo) was not available. This data may still be implicit within non-
stop economy fares, although it may limit our analysis of such competitors if this data series is highly
correlated with British Airways (the only airline in our dataset that flies direct to these locations).
Summary statistics were also examined for differences in airline prices by departure from different
London Airport (not shown). This included Heathrow, City Airport, Gatwick, Luton, Stansted, and
Southend. As this data was incomplete, it was felt this would not offer much insofar as evaluating any
competitive dynamic between departing London airports.
We examine separate correlation heatmaps examining travel class and number of stops respectively,
before tabulating these results for all our variables, for each day within our training set. Figure 2 shows
the results obtained for Thursday departures, although results are not too dissimilar across the week.
We exclude variables with insufficient data, as noted above.
Figure 2: Correlation heatmaps by class and stops for Thursday
We observe strong pricing
correlations for flight pairs
involving Hong Kong, KL, Seoul and
Singapore. Positive correlations
are observed to be strongest based
on proximity (e.g. Seoul and Tokyo)
or commercial links (e.g. Singapore
and Hong Kong). Negative
correlations are noted for Beijing
and Bangkok business class, with
no immediate explanation
apparent. Moreover, the
interaction between economy and
business class flights to the same
destination varies from quite weak
(e.g. Bangkok), through to quite
strong (e.g. Singapore),
highlighting the variable nature of
these relationships.
Looking at stops, we would expect
to find strong correlations
amongst stops to the same
destination, with weak
correlations elsewhere. As we can
see, this does not hold (e.g. HK and
KL), with no immediate logic
apparent.

11
We can also see how flight prices are correlated both amongst each airline’s departure destinations,
and between airlines (not shown). Figure 3 examines same flight correlations exceeding 0.7 for KL, as
an example, where significant correlations do exist for other destinations
Figure 3: Same airline correlations for Thursday departures from LHR
Note: Airline for City Comparison should be read across for each of the destinations on the x-axis
Combined with the above, it is evident that multicollinearity exists not only between airlines, but
within airline pricing to different destinations, and by class. This does not consider any interaction
between different departure dates.
A further point is whether prices behave differently several weeks prior to departure, relative to 1-2
weeks beforehand? With a small time-series available, we perform an f-test on prices prior to and
starting two weeks before departures. After standardisation, the results (not shown) illustrate that
only Beijing and Tokyo cannot reject our null hypothesis of equal variance. We shall thus introduce
rolling windows, testing permutations involving 2, 3, and 4 weeks to account for any parameter
instability.
Arriving
Airport
Starting
Airline
Airline for City
Comparison
BEI BKK HKG KL SEU SGP TKY
BEI AF-price AF-price 1 0.3 0.7 0.9 NaN 0.4 0.5
KL AF-price AF-price 0.9 0.1 0.5 1 NaN 0.5 0.5
KL BA-price BA-price 0.5 0.5 0.7 1 0 0.8 0.7
KL EK-price EK-price 0.6 0.6 0.6 1 0.5 0.8 0.6
KL EY-price EY-price 0.4 0.6 0.5 1 0.5 0.3 0.4
KL KL-price KL-price 0.3 0.3 -0.2 1 0.4 0.4 0.6
KL MH-price MH-price 0.3 0.1 0 1 0.2 0.6 0.3
KL PR-price PR-price NaN 0.2 0.6 1 0.5 0.6 0.2
KL TG-price TG-price 0.3 0.6 0.2 1 0.4 0.6 0.3
KL VN-price VN-price NaN 0.5 0.7 1 0.7 0.4 0
SGP BA-price BA-price 0.5 0.6 0.6 0.8 0.1 1 0.6
SGP EK-price EK-price 0.6 0.6 0.7 0.8 0.6 1 0.6

12
MACHINE LEARNING ANALYSIS
Methodology
We focus on the lowest quartile of economy class prices achieved. This is aimed at striking a balance
between having enough samples to train our algorithms, relative to identifying flights that are cheaper
than average. We generate a binary signal to denote purchase. Although not as intuitive as directly
predicting price curves, this further facilitates using support vector machine and neural network
techniques.
The lumpy nature of prices for some flight pairs (i.e. quotes being static over several days / back-
filling), results in plateau-points where several prices are at the 25th
percentile point. This leads to our
sample size of buy signals being below or above 25% of the sample, depending on if this level is
included. This is most extreme for Wednesday’s departure to Seoul, where we have 17 prices at this
level, of which we reduce 4 sample points by £10 to avoid extreme imbalances. Overall, we examine
both including and excluding the 25th
percentage level, with this a crude adjustment mechanism
should our samples be too imbalanced.
Despite eliminating some explanatory variables, we have 73 variables remaining for each departure
day, on a total time-series of 68 points. All explanatory variables are normalised, which though
convenient for our analysis, may not be consistent with the underlying distributions observed, possibly
undermining overall accuracy.
The quality of our analysis rests on dimensionality reduction. We initially use a logistic regression with
one-norm regularisation to induce sparsity, along with using a Lasso regression. This is expected to
provide a starting-point for prediction, with non-linear and endogenous relationships likely to exist,
based on our descriptive analysis. These aspects may undermine the quality of these predictions.
We then use principal component analysis to transform our data. Although eliminating any
endogeneity amongst our variables, this may reduce any correlation between time-points. This is
underlined by our scree plots (not shown), where an elbow point of k=4 or 5 is apparent across our
training set, but which accounts for only 55-60% of variance. Dimensionality is raised to k = 8 to
account for 70% of variance, which is consistent with a squared-root number of explanatory variables,
despite the downside of adding higher dimensionality to a small dataset.
We then apply ridge regression to our PCA dataset, optimising our hyper-parameters. Given the
possible lack of time-dependency, we separately experiment with over-sampling (SMOTE), to mitigate
any sample imbalance. Finally, given our small subset of variables, we also train the AdaBoost
algorithm to minimise error amongst our boosted samples.
Beyond PCA, we employ both random forest and Support Vector Machine analysis. We experiment
with different hyperparameters to optimise our classification trees, which are anticipated to better
capture any non-linearities present. The SVM model uses different kernels to transform our dataset,
to better delineate amongst our binary outcomes. This is contrasted with a simple neural network,
which is trained at different drop-out rates to identify suitable network depth.
Depending on how these models perform, either one or several will be applied to our test set. An
ensemble approach based on majority voting will determine whether tickets should be purchased for
a given location. This is similar to the stacking approach used by Etzioni et al (2003) to improve
accuracy.

13
Model parameters and approach
Cross validation, or out-of-bag sampling for Random Forest, was used wherever possible in
conjunction with rolling windows of 2, 3, and 4 weeks. This was intended to reduce the variance of
our algorithms, and improve parameter stability respectively. Cross validation was performed on a
rolling basis from a starting-point, testing our hyperparameters before determining optimal window
size.
There were instances where rolling windows were not used. This was premised either on using the
largest sample to identify data-points for error reduction (Adaboost), where time dependency
between observations was felt weak (e.g. after using PCA), or were not necessarily suited to the
algorithm (e.g. SVM and neural networks).
Where possible, algorithms were trained to avoid using accuracy, with sample imbalances likely to
favour no purchase signals. A ROC-AUC score was preferred, to reflect both true positives and
negatives, but was often not possible due to our samples being too imbalanced (i.e. no buy-signals
present) or shortcomings within SKlearn generating errors. Mean-squared error was used in most
cases (when undertaking regression), often in conjunction with a high cost-weight in favour of
purchasing tickets; which formed part of our hyperparameter testing (see below).
The one exception was with our Random Forest, where neither cost-weights or a non-accuracy scoring
measure could be applied. This was attributed to our imbalanced samples, with coding working on
some instances, but with too many omissions likely. Although this significantly compromised potential
performance, this algorithm was still used nonetheless as a base-line.
The hyperparameters tested varied, but besides cost-weights, included our regularisation parameter
(for Lasso and Ridge), drop-out rate (for neural network), kernel-type, gamma, and our regularisation
constraint (for SVM). For Random Forest, we examined minimum samples per leaf, maximum tree
depth, and set maximum features to the squared root of total explanatory variables. These
hyperparameters were all tested on our training set, before being applied to our validation set.
As predictions were not confined to binary results, a threshold level was used to distinguish between
such events. This was arbitrarily set slightly below 0.5 (0.45), to allow some latitude in forming
judgement on our imbalanced samples. Although introducing bias, this appears consistent with the
approach adopted by Groves and Gini (2011).
Results – validation
For validation, we examine both ROC-AUC scores obtained from our optimised hyperparameters,
along with judging overall specificity and accuracy (latter not shown), given the likely sample
imbalance expected.
When considering our samples below the threshold of our first quartile, we obtain the following
results shown in figures 4a and 4b. It is evident the imbalanced nature of our sample has affected the
performance of our algorithms. This is surprising when considering the high class-weights applied to
favour purchasing a ticket, and with the algorithms trained not to focus on accuracy.
The performance of our Logistic and Lasso Regressions is not unexpected, along with the Random
Forest, given the likely non-linear relationships they were not expected to identify, or the nature by
which the algorithm was trained (discussed above). Both SVM and Neural Network techniques appear
to have underperformed, given their ability to capture more complex relationships.

14
Applying PCA before using AdaBoost or employing SMOTE to over-sample our population before Ridge
regularisation, seems to have performed relatively well when examining ROC and specificity scores.
This is premised on some predictions being recorded for each departure day within our training set,
with AdaBoost generating higher accuracy levels. In absolute terms, ROC and specificity scores remain
quite low; although Beijing and KL do achieve marginally higher levels than other cities.
Within these scores, we do observe instances of consistency across our algorithms. Most algorithms
generate ROC scores between 0.6-0.85 when examining Sunday flight prices for Bangkok and Seoul
(not shown). Applying SMOTE on our PCA sample generates ROC scores exceeding 0.6 for Bangkok,
KL, Seoul, and Tokyo on the same day. This may highlight that our training / validation procedure of
categorising peak and off-peak days, and validating on days before / within a week of departure, may
not be as effective relative to examining earlier departures on the same day.
Figure 4a – Specificity averaged by Monday-Sunday for each city: below 25th
percentile level
Figure 4b – ROC-AUC score averaged by Monday-Sunday for each city: below 25th
percentile level
Note: For calculation purposes, results where ROC scores <= 0.5 were marked as 0.5
To address under-sampling, we adjust our threshold of prices to include those below and at the 25th
percentile level. This typically results in more observations than 25% of our sample. Our subsequent
results can be seen in Figures 5a and 5b.
By adjusting our sample, this has improved average specificity and marginally increased ROC scores,
with our PCA techniques benefiting most, and specificity scores for our neural network. These
increases have not been universal, with some departures from specific cities showing a decline in
performance, based on ROC score and specificity. Accuracy levels have fallen across all techniques,
highlighting an increase in false negatives.
Machine learning Algorithm
Country Logistic w/L1 Lasso
Random
Forest Ridge w/PCA
Ridge w/PCA
and Smote Adaboost SVM NeuralNets
BEI 0.10 - - 0.16 0.24 0.30 0.21 0.15
BKK 0.03 0.10 0.10 0.31 0.24 0.20 0.12 0.29
HKG 0.04 - - 0.08 0.22 0.12 0.10 0.23
KL - - - - 0.24 0.43 0.19 0.15
SEU 0.30 0.08 0.02 0.16 0.26 0.27 0.53 0.21
SGP 0.02 0.07 - 0.26 0.21 0.21 0.18 0.08
TKY - 0.01 - 0.23 0.24 0.28 0.05 0.08
Average 0.07 0.04 0.02 0.17 0.24 0.26 0.20 0.17
Random
Forest Ridge w/PCA
Ridge w/PCA
BEI 0.53 0.50 0.50 0.52 0.60 0.54 0.54 0.54
BKK 0.50 0.52 0.55 0.56 0.55 0.52 0.51 0.59
HKG 0.50 0.50 0.50 0.51 0.53 0.52 0.50 0.55
KL 0.50 0.50 0.50 0.50 0.58 0.55 0.52 0.55
SEU 0.53 0.53 0.50 0.54 0.55 0.54 0.57 0.54
SGP 0.50 0.50 0.50 0.54 0.55 0.53 0.51 0.50
TKY 0.50 0.50 0.50 0.53 0.55 0.52 0.51 0.51
Average 0.51 0.51 0.51 0.53 0.56 0.53 0.52 0.54

15
Figure 5a – Specificity averaged by Monday-Sunday for each city: below and at 25th
percentile level
Figure 5b – ROC-AUC score averaged by Monday-Sunday for each city: below and at 25th
percentile
level
Note: Results where ROC scores <= 0.5 were marked as 0.5
Examining these results, our SMOTE on PCA generates results for each departure day, with AdaBoost
making predictions for each day barring three occasions.
For Bangkok, Hong Kong, and KL, the SMOTE on PCA performs well on our off-peak validation day (i.e.
departing on Monday), with Tuesday departures typically generating a higher than average ROC score.
AdaBoost generates ROC scores > 0.65 when considering Wednesday departures for Beijing and KL
instead.
For peak departure days, we observe AdaBoost achieves ROC scores exceeding 0.6 for Saturday
departures from Beijing and Singapore. This might be capturing some interaction with our validation
day the following Thursday, with a possible substitution effect existing. The observations for Sunday
above remain valid. Full results are shown in appendix 3 and 4.
Given these results, SMOTE on PCA generates the highest ROC-score, where there is merit in
examining this further. However, with AdaBoost and Ridge PCA without SMOTE achieving higher
specificity scores, we shall also contrast with an ensemble approach, with majority voting determining
if a flight should be purchased.
Random
Forest Ridge w/PCA
Ridge w/PCA
BEI 0.19 0.10 - 0.21 0.31 0.51 0.22 0.41
BKK 0.25 0.14 0.20 0.56 0.41 0.40 0.29 0.38
HKG 0.13 0.07 0.02 0.22 0.24 0.34 0.31 0.28
KL 0.21 0.08 0.05 0.26 0.38 0.49 0.30 0.52
SEU 0.13 0.20 0.08 0.22 0.27 0.27 0.42 0.15
SGP 0.11 0.02 0.04 0.33 0.26 0.35 0.17 0.22
TKY 0.03 0.01 0.01 0.41 0.31 0.23 0.09 0.20
Average 0.15 0.09 0.06 0.32 0.31 0.37 0.26 0.31
Random
Forest
Ridge
w/PCA
Ridge w/PCA
BEI 0.55 0.54 0.50 0.55 0.59 0.58 0.54 0.55
BKK 0.57 0.52 0.58 0.60 0.60 0.56 0.52 0.54
HKG 0.51 0.50 0.50 0.53 0.54 0.51 0.52 0.52
KL 0.55 0.53 0.52 0.53 0.59 0.59 0.52 0.56
SEU 0.54 0.54 0.50 0.54 0.56 0.53 0.55 0.51
SGP 0.51 0.50 0.50 0.53 0.52 0.53 0.50 0.51
TKY 0.50 0.50 0.50 0.54 0.53 0.50 0.51 0.50
Average 0.53 0.52 0.52 0.55 0.56 0.54 0.52 0.53

16
Results – testing
With 4 test days available, we train our algorithms on each peak / off-peak day respectively within our
training set, before juxtaposing with our test days. This results in 3 training days each being tested
against a test-day. This helps contrast if predictions are affected by departures not leaving on the same
day during the preceding week. Our underlying results are shown in appendix 5.
Across these 3 training days, we adopt an ensemble approach through majority voting, to determine
if a flight should be purchased. Appendix 6 illustrates results when solely examining Ridge PCA with
SMOTE.
Figure 6 – Ensemble results on test data: Combining three models
Both ROC and specificity scores would suggest some predictive capability when examining Bangkok
and Kuala Lumpur. However, even scores of 0.68-0.73 are relatively low insofar as reducing
City
Departure
Day
of test set Test Period ROC
correct
predictions
incorrect
predictions specificity recall
Beijing Tuesday Off-peak 1 0.37 - 1 - -
Beijing Wednesday Off-peak 2 0.83 1 - 1.00 0.04
Beijing Friday Peak 1 0.59 1 1 0.50 0.05
Beijing Saturday Peak 2 0.31 - 2 - -
Bangkok Tuesday Off-peak 1 0.62 1 1 0.50 0.06
Bangkok Wednesday Off-peak 2 0.86 2 - 1.00 0.10
Bangkok Friday Peak 1 0.54 2 3 0.40 0.09
Bangkok Saturday Peak 2 0.69 3 2 0.60 0.18
Hong Kong Tuesday Off-peak 1 0.51 2 5 0.29 0.11
Hong Kong Wednesday Off-peak 2 0.42 1 6 0.14 0.05
Hong Kong Friday Peak 1 0.47 1 4 0.20 0.06
Hong Kong Saturday Peak 2 0.47 1 4 0.20 0.06
KL Tuesday Off-peak 1 0.37 - 2 - -
KL Wednesday Off-peak 2 0.89 2 - 1.00 0.12
KL Friday Peak 1 0.88 2 - 1.00 0.11
KL Saturday Peak 2 0.87 2 - 1.00 0.11
Seoul Tuesday Off-peak 1 0.34 - 6 - -
Seoul Wednesday Off-peak 2 0.54 2 4 0.33 0.11
Seoul Friday Peak 1 0.45 1 5 0.17 0.06
Seoul Saturday Peak 2 0.31 - 6 - -
Singapore Tuesday Off-peak 1 0.61 4 5 0.44 0.24
Singapore Wednesday Off-peak 2 0.48 2 7 0.22 0.11
Singapore Friday Peak 1 0.57 2 3 0.40 0.11
Singapore Saturday Peak 2 0.58 2 3 0.40 0.12
Tokyo Tuesday Off-peak 1 0.49 4 11 0.27 0.21
Tokyo Wednesday Off-peak 2 0.46 3 12 0.20 0.17
Tokyo Friday Peak 1 0.63 2 2 0.50 0.12
Tokyo Saturday Peak 2 0.48 1 3 0.25 0.05

17
uncertainty with these predictions. Notwithstanding some successful predictions elsewhere, there is
too much uncertainty for these to prove reliable.
There are several results where ROC scores are below 0.5, implying improvements can be achieved by
undertaking the opposite action to what is suggested. This could indicate scope for model
improvements, of which some were highlighted above, but also the nature of our dataset (discussed
further below).
An Ensemble approach across the three algorithms mentioned should refine performance further,
helping eliminate noise / spurious predictions. Results are shown in Figure 6.
Average ROC and specificity scores within each city are broadly higher. Although generating far fewer
predictions, which is evidenced by low recall rates (i.e. relative to the total number of actual buy-
signals), we achieve high specificity levels for peak flights to KL, and the potential to achieve the same
for off-peak. Results for Bangkok are also encouraging, where reasonable accuracy is attained.
Average accuracy levels exceed 61.8% attained by Etzioni et al (2003).
Figure 7 highlights that our predictions
save between 7.0 – 8.1% vs average
flight costs to Bangkok and KL. On a
next period ahead basis, these
predictions can vary on average, from
an additional cost of 1.1% from
Bangkok through to an 8.3% saving
from KL.
Results for the remaining cities are
lacklustre. The models perform poorly
when examining Seoul, and generate
too many incorrect predictions
elsewhere for our models to be relied
upon. This is disappointing, although
may be remedied by more
comprehensive datasets.
Figure 7 – Additional cost savings / (expense) from
Bangkok and KL predictions
£
Predicted
Price
Period Ahead
price
Average over
period
Saving vs
next
Saving vs
average
Bangkok
Off-peak 1 734 734 782 - 48
Off-peak 1 741 741 782 - 41
Off-peak 2 741 741 796 - 55
Off-peak 2 741 741 796 - 55
Peak 1 813 813 847 - 34
Peak 1 813 625 847 (188) 34
Peak 1 865 806 847 (59) (18)
Peak 1 805 805 847 - 42
Peak 1 785 865 847 80 62
Peak 2 649 648 751 (1) 102
Peak 2 649 649 751 - 102
Peak 2 711 711 751 - 40
Peak 2 635 635 751 - 116
Peak 2 685 734 751 49 66
Average saving / (cost) (£) 796 (9) 56
Average saving / (cost) (%) -1.1% 7.0%
Kuala Lumpur
Off-peak 1 799 1104 866 305 67
Off-peak 1 798 997 866 199 68
Off-peak 2 797 798 858 1 61
Off-peak 2 782 811 858 29 76
Peak 1 911 847 943 (64) 32
Peak 1 715 969 943 254 228
Peak 2 881 849 901 (32) 20
Peak 2 897 774 901 (123) 4
Average saving / (cost) (£) 858 71 70
Average saving / (cost) (%) 8.3% 8.1%

18
CONCLUSIONS
Limitations
A considerable constraint has been the quantity and quality of data. Having additional departure data
dating back several weeks, would have helped with providing adequate training and validation
datasets. This itself was not feasible without having earmarked further time beyond the scope of this
assignment, especially with data collection commencing in late June.
The underlying quality of the data could have been further improved, given the gaps within our
dataset. It should be recalled that a low threshold of only 40% of complete data was used to inform
our variables, highlighting limited disclosure. It is inevitable that with back-filling, this likely reduced
the explanatory power of our variables. Notwithstanding, this also highlighted the difficulty with
collecting raw, “real-world” data; within this context, it is pleasing we were still able to identify those
pricing relationships above.
Summary
Despite data limitations, our study has shown a capacity to predict peak and off-peak airfares
departing to Bangkok and Kuala Lumpur. This also includes Sunday departures, albeit based on our
validation datasets, as we had insufficient data to test prior Sunday departures.
Going forward, it remains interesting to evaluate how air-fares are affected by flights within 2 days
prior / post departure, and how sensitive they are to competitor behaviour. It would be ideal to
combine this with demand-based data, to better approximate own and cross-price elasticities.
A further extension would be a more comprehensive analysis of business class flights, especially as
these tickets entail far more flexible conditions compared to economy. The descriptive analysis
highlighted instances where prices fell quite significantly leading to departure, potentially heralding
value if flights can be cancelled and re-booked at lower airfares.

19
APPENDICES
Appendix 1: Line graph of business class tickets for 7 cities in Asia departing from London
Note: Days includes AM and PM observations, with
40 days of data shown

20
Appendix 2: Line graph of non-stop tickets for 7 cities in Asia departing from London
Note: Days includes AM and PM observations, with
40 days of data shown

21
Appendix 3: Specificity on buy signals less than and equal to 25th percentile of prices
Country
Day of
Departure Logistic w/L1 Lasso
Random
Forest Ridge w/PCA
Ridge w/PCA
BEI Mon 0.33 - - - 0.13 0.08 - 0.44
BEI Tues 0.10 0.71 - 0.61 0.41 0.80 - 0.91
BEI Weds 0.59 - - - 0.46 0.75 0.44 1.00
BEI Thurs 0.31 - - - 0.19 0.18 0.19 0.20
BEI Fri - - - - 0.38 1.00 0.50 0.33
BEI Sat - - - 0.71 0.43 0.50 0.27 -
BEI Sun - - - 0.17 0.18 0.25 0.14 -
BKK Mon 0.55 - - 0.83 0.58 0.70 0.53 0.50
BKK Tues - - 0.10 0.59 0.63 0.56 0.43 0.56
BKK Weds 0.62 0.44 0.72 0.60 0.53 0.44 0.30 0.44
BKK Thurs - - - 0.15 0.09 - 0.05 0.50
BKK Fri - - - 1.00 0.26 0.45 0.28 -
BKK Sat 0.05 0.04 0.06 0.21 0.29 0.12 - -
BKK Sun 0.50 0.50 0.54 0.50 0.47 0.56 0.43 0.64
HKG Mon 0.28 - 0.05 0.43 0.34 0.36 0.28 0.30
HKG Tues - - - 0.39 0.37 0.30 0.30 0.83
HKG Weds 0.32 - - - 0.34 0.25 0.60 0.33
HKG Thurs - 0.22 0.06 0.14 0.11 0.11 0.28 0.17
HKG Fri 0.28 0.26 - 0.22 0.19 0.25 0.16 0.10
HKG Sat - - - - 0.09 1.00 0.26 -
HKG Sun - - - 0.38 0.23 0.11 0.28 0.25
KL Mon - 0.11 - 0.86 0.45 0.58 1.00 0.83
KL Tues 0.47 - - - 0.65 0.83 0.32 0.33
KL Weds 0.58 - - - 0.36 0.79 0.40 0.50
KL Thurs - - - - 0.28 0.33 - 1.00
KL Fri - - - 0.14 0.28 0.43 - -
KL Sat - - - 0.50 0.31 - 0.11 0.29
KL Sun 0.41 0.46 0.35 0.35 0.36 0.44 0.28 0.67
SEU Mon - - - 0.20 0.35 0.20 1.00 0.11
SEU Tues 0.35 0.33 - 0.29 0.17 0.21 0.50 -
SEU Weds - - 0.06 0.12 0.07 0.36 0.33 -
SEU Thurs 0.16 0.22 0.50 0.21 0.24 0.22 0.26 0.28
SEU Fri - 0.24 - 0.17 0.19 - 0.17 0.29
SEU Sat - - - - 0.16 0.33 0.23 -
SEU Sun 0.38 0.61 - 0.55 0.68 0.60 0.44 0.40
SGP Mon 0.25 0.09 0.28 0.23 0.20 0.19 0.25 0.17
SGP Tues 0.30 - - 0.27 0.29 0.25 - 0.28
SGP Weds - - - - 0.27 0.33 0.24 0.50
SGP Thurs - - - 0.60 0.30 0.50 0.14 0.33
SGP Fri - - - 0.13 0.13 0.17 0.04 0.23
SGP Sat - - - 0.83 0.36 0.75 0.30 -
SGP Sun 0.19 0.08 - 0.25 0.28 0.25 0.24 -
TKY Mon 0.20 - - 0.36 0.36 0.31 - -
TKY Tues - - 0.05 0.20 0.31 0.38 - 0.36
TKY Weds - - - 1.00 0.23 0.13 - 0.25
TKY Thurs - - - 0.36 0.29 0.23 - 0.15
TKY Fri - - - 0.38 0.33 0.21 - 0.30
TKY Sat - 0.10 - 0.18 0.34 0.22 0.33 -
TKY Sun - - - 0.43 0.33 0.14 0.29 0.33
Average 0.15 0.09 0.06 0.32 0.31 0.37 0.26 0.31

22
Appendix 4: ROC scores on buy-signals less than and equal to 25th percentile of prices
Note: ROC scores < 0.51 are not shown above
Country
Day of
Departure Logistic w/L1 Lasso
Random
Forest
Ridge w/
PCA
Ridge w/PCA
BEI Mon - - - - - - - 0.55
BEI Tues - 0.76 - 0.72 0.67 0.60 - 0.75
BEI Weds 0.82 - - - 0.76 0.77 0.72 0.53
BEI Thurs 0.53 - - - - - - -
BEI Fri - - - - 0.59 0.56 0.54 0.52
BEI Sat - - - 0.62 0.63 0.62 0.51 -
BEI Sun - - - - - - - -
BKK Mon 0.61 - - 0.65 0.67 0.66 0.66 0.51
BKK Tues - - 0.55 0.66 0.76 0.53 - 0.53
BKK Weds 0.72 0.52 0.82 0.71 0.60 - - -
BKK Thurs - - - - - - - 0.52
BKK Fri - - - 0.56 - 0.64 - -
BKK Sat - - 0.52 - 0.52 - - -
BKK Sun 0.63 0.63 0.69 0.65 0.64 0.61 0.52 0.70
HKG Mon - - 0.52 0.61 0.59 0.57 - 0.52
HKG Tues - - - 0.57 0.62 - 0.52 0.62
HKG Weds 0.54 - - - 0.58 - 0.56 0.54
HKG Thurs - - - - - - 0.52 -
HKG Fri 0.52 - - - - - - -
HKG Sat - - - - - 0.53 - -
HKG Sun - - - 0.53 - - 0.53 -
KL Mon - - - 0.60 0.59 0.69 0.56 0.67
KL Tues 0.54 - - - 0.75 0.58 - -
KL Weds 0.66 - - - - 0.68 0.52 0.51
KL Thurs - - - - 0.53 0.54 - 0.53
KL Fri - - - - 0.53 0.55 - -
KL Sat - - - 0.52 0.58 - - -
KL Sun 0.67 0.70 0.66 0.60 0.66 0.62 0.56 0.70
SEU Mon - - - - 0.58 - 0.56 -
SEU Tues 0.59 0.52 - 0.53 - - 0.52 -
SEU Weds - - - - - 0.56 0.52 -
SEU Thurs - - - - - - - 0.53
SEU Fri - - - - - - - -
SEU Sat - - - - - 0.52 - -
SEU Sun 0.66 0.75 - 0.75 0.82 0.64 0.74 0.56
SGP Mon - - - - - - - -
SGP Tues 0.57 - - - 0.53 - - 0.51
SGP Weds - - - - - 0.52 - 0.52
SGP Thurs - - - 0.56 0.52 0.56 - 0.51
SGP Fri - - - - - - - -
SGP Sat - - - 0.62 0.58 0.64 0.53 -
SGP Sun - - - - 0.52 - - -
TKY Mon - - - 0.53 0.55 - - -
TKY Tues - - - - - 0.51 - 0.51
TKY Weds - - - 0.57 - - - -
TKY Thurs - - - 0.53 - - - -
TKY Fri - - - 0.52 0.53 - - -
TKY Sat - - - - 0.55 - - -
TKY Sun - - - 0.60 0.59 - 0.59 -

23
Appendix 5a: Test scores when applying Ridge on PCA with SMOTE
City
Departure Day
of training set
Departure Day
of test set Test Period accuracy roc_auc sensitivity specificity
BEI Monday Tuesday Off-peak 1 0.51 0.48 0.74 0.23
BEI Tuesday Tuesday Off-peak 1 0.51 0.52 0.76 0.26
BEI Wednesday Tuesday Off-peak 1 0.56 0.55 0.78 0.29
BEI Monday Wednesday Off-peak 2 0.32 0.32 0.47 0.19
BEI Tuesday Wednesday Off-peak 2 0.71 0.71 0.82 0.57
BEI Wednesday Wednesday Off-peak 2 0.53 0.52 0.67 0.38
BEI Thursday Friday Peak 1 0.41 0.39 0.58 0.22
BEI Friday Friday Peak 1 0.47 0.53 0.71 0.34
BEI Saturday Friday Peak 1 0.54 0.56 0.73 0.37
BEI Thursday Saturday Peak 2 0.63 0.63 0.74 0.50
BEI Friday Saturday Peak 2 0.62 0.62 0.74 0.48
BEI Saturday Saturday Peak 2 0.63 0.65 0.78 0.50
BKK Monday Tuesday Off-peak 1 0.72 0.79 0.97 0.49
BKK Tuesday Tuesday Off-peak 1 0.51 0.55 0.77 0.30
BKK Wednesday Tuesday Off-peak 1 0.62 0.63 0.83 0.38
BKK Monday Wednesday Off-peak 2 0.76 0.79 0.92 0.57
BKK Tuesday Wednesday Off-peak 2 0.50 0.47 0.68 0.27
BKK Wednesday Wednesday Off-peak 2 0.53 0.54 0.74 0.32
BKK Thursday Friday Peak 1 0.49 0.50 0.68 0.32
BKK Friday Friday Peak 1 0.57 0.61 0.79 0.41
BKK Saturday Friday Peak 1 0.51 0.50 0.68 0.32
BKK Thursday Saturday Peak 2 0.63 0.70 0.91 0.39
BKK Friday Saturday Peak 2 0.54 0.60 0.83 0.32
BKK Saturday Saturday Peak 2 0.53 0.53 0.77 0.27
HKG Monday Tuesday Off-peak 1 0.54 0.60 0.83 0.33
HKG Tuesday Tuesday Off-peak 1 0.59 0.63 0.84 0.36
HKG Wednesday Tuesday Off-peak 1 0.53 0.59 0.82 0.33
HKG Monday Wednesday Off-peak 2 0.49 0.55 0.76 0.33
HKG Tuesday Wednesday Off-peak 2 0.49 0.50 0.71 0.30
HKG Wednesday Wednesday Off-peak 2 0.54 0.59 0.79 0.36
HKG Thursday Friday Peak 1 0.63 0.71 0.93 0.41
HKG Friday Friday Peak 1 0.57 0.64 0.86 0.36
HKG Saturday Friday Peak 1 0.57 0.64 0.86 0.36
HKG Thursday Saturday Peak 2 0.47 0.48 0.72 0.25
HKG Friday Saturday Peak 2 0.41 0.48 0.71 0.25
HKG Saturday Saturday Peak 2 0.40 0.43 0.67 0.22
KL Monday Tuesday Off-peak 1 0.53 0.61 0.85 0.32
KL Tuesday Tuesday Off-peak 1 0.57 0.60 0.82 0.32
KL Wednesday Tuesday Off-peak 1 0.69 0.72 0.89 0.43
KL Monday Wednesday Off-peak 2 0.60 0.58 0.80 0.32
KL Tuesday Wednesday Off-peak 2 0.57 0.54 0.78 0.29
KL Wednesday Wednesday Off-peak 2 0.66 0.68 0.87 0.40
KL Thursday Friday Peak 1 0.47 0.48 0.72 0.25
KL Friday Friday Peak 1 0.51 0.46 0.71 0.22
KL Saturday Friday Peak 1 0.65 0.69 0.88 0.41
KL Thursday Saturday Peak 2 0.57 0.62 0.83 0.37
KL Friday Saturday Peak 2 0.63 0.58 0.77 0.38
KL Saturday Saturday Peak 2 0.50 0.51 0.73 0.29
SEU Monday Tuesday Off-peak 1 0.68 0.68 0.84 0.47
SEU Tuesday Tuesday Off-peak 1 0.40 0.35 0.59 0.16
SEU Wednesday Tuesday Off-peak 1 0.32 0.32 0.53 0.16
SEU Monday Wednesday Off-peak 2 0.54 0.51 0.74 0.28
SEU Tuesday Wednesday Off-peak 2 0.53 0.54 0.76 0.29
SEU Wednesday Wednesday Off-peak 2 0.53 0.50 0.74 0.27
SEU Thursday Friday Peak 1 0.50 0.55 0.79 0.28
SEU Friday Friday Peak 1 0.50 0.51 0.76 0.26
SEU Saturday Friday Peak 1 0.47 0.55 0.80 0.28
SEU Thursday Saturday Peak 2 0.38 0.35 0.54 0.19
SEU Friday Saturday Peak 2 0.74 0.78 0.94 0.57
SEU Saturday Saturday Peak 2 0.65 0.70 0.89 0.49
SGP Monday Tuesday Off-peak 1 0.47 0.49 0.74 0.24
SGP Tuesday Tuesday Off-peak 1 0.37 0.40 0.65 0.19
SGP Wednesday Tuesday Off-peak 1 0.60 0.62 0.83 0.34
SGP Monday Wednesday Off-peak 2 0.31 0.35 0.57 0.18
SGP Tuesday Wednesday Off-peak 2 0.35 0.31 0.59 0.12
SGP Wednesday Wednesday Off-peak 2 0.65 0.64 0.83 0.39
SGP Thursday Friday Peak 1 0.59 0.61 0.82 0.35
SGP Friday Friday Peak 1 0.54 0.55 0.77 0.30
SGP Saturday Friday Peak 1 0.47 0.52 0.75 0.28
SGP Thursday Saturday Peak 2 0.53 0.53 0.77 0.27
SGP Friday Saturday Peak 2 0.54 0.54 0.78 0.28
SGP Saturday Saturday Peak 2 0.44 0.45 0.71 0.22
TKY Monday Tuesday Off-peak 1 0.47 0.46 0.69 0.24
TKY Tuesday Tuesday Off-peak 1 0.51 0.49 0.71 0.27
TKY Wednesday Tuesday Off-peak 1 0.44 0.48 0.70 0.27
TKY Monday Wednesday Off-peak 2 0.59 0.65 0.87 0.37
TKY Tuesday Wednesday Off-peak 2 0.65 0.71 0.91 0.42
TKY Wednesday Wednesday Off-peak 2 0.49 0.51 0.74 0.27
TKY Thursday Friday Peak 1 0.50 0.49 0.74 0.24
TKY Friday Friday Peak 1 0.50 0.49 0.74 0.24
TKY Saturday Friday Peak 1 0.59 0.55 0.78 0.30
TKY Thursday Saturday Peak 2 0.37 0.35 0.59 0.17
TKY Friday Saturday Peak 2 0.51 0.50 0.72 0.28
TKY Saturday Saturday Peak 2 0.62 0.61 0.79 0.38

24
Appendix 5b: Test scores when applying Ridge on PCA only
City
Departure
Day
of training
Departure
Day
BEI Wednesday Tuesday Off-peak 1 0.75 0.50 0.75 NaN
BEI Wednesday Wednesday Off-peak 2 0.65 0.50 0.65 NaN
BEI Thursday Friday Peak 1 0.68 0.50 0.68 NaN
BEI Thursday Saturday Peak 2 0.63 0.50 0.63 NaN
BKK Friday Saturday Peak 2 0.75 0.50 0.75 NaN
KL Tuesday Tuesday Off-peak 1 0.75 0.50 0.75 NaN
KL Wednesday Tuesday Off-peak 1 0.75 0.50 0.75 NaN
KL Tuesday Wednesday Off-peak 2 0.75 0.50 0.75 NaN
KL Wednesday Wednesday Off-peak 2 0.75 0.50 0.75 NaN
SEU Saturday Friday Peak 1 0.75 0.50 0.75 NaN
SEU Saturday Saturday Peak 2 0.66 0.50 0.66 NaN
SGP Wednesday Tuesday Off-peak 1 0.75 0.50 0.75 NaN
TKY Wednesday Tuesday Off-peak 1 0.72 0.50 0.72 NaN

25
Appendix 5c: Test scores when applying AdaBoost
City
Departure
Day
of training
Departure
Day
BEI Wednesday Tuesday Off-peak 1 0.69 0.54 0.77 0.33
BEI Wednesday Wednesday Off-peak 2 0.63 0.51 0.65 0.40
BEI Thursday Friday Peak 1 0.54 0.46 0.65 0.26
BEI Thursday Saturday Peak 2 0.62 0.58 0.69 0.48
BKK Friday Saturday Peak 2 0.60 0.46 0.73 0.19
KL Tuesday Tuesday Off-peak 1 0.72 0.60 0.80 0.43
KL Wednesday Tuesday Off-peak 1 0.69 0.66 0.84 0.42
KL Tuesday Wednesday Off-peak 2 0.72 0.50 0.75 0.25
KL Wednesday Wednesday Off-peak 2 0.79 0.65 0.81 0.67
SEU Saturday Friday Peak 1 0.75 0.56 0.77 0.50
SEU Saturday Saturday Peak 2 0.62 0.47 0.65 0.00
SGP Wednesday Tuesday Off-peak 1 0.78 0.62 0.80 0.63
TKY Wednesday Tuesday Off-peak 1 0.60 0.45 0.70 0.17

26
Appendix 6 – Ensemble results on test data: Ridge PCA with SMOTE
City
Departure
Day
of test set Test Period ROC
correct
predictions
incorrect
predictions specificity recall
Beijing Tuesday Off-peak 1 0.37 - 3 - -
Beijing Wednesday Off-peak 2 0.66 2 1 0.67 0.08
Beijing Friday Peak 1 0.51 3 6 0.33 0.14
Beijing Saturday Peak 2 0.35 1 8 0.11 0.04
Bangkok Tuesday Off-peak 1 0.46 2 8 0.20 0.11
Bangkok Wednesday Off-peak 2 0.68 6 4 0.60 0.30
Bangkok Friday Peak 1 0.68 10 7 0.59 0.45
Bangkok Saturday Peak 2 0.73 10 7 0.59 0.59
Hong Kong Tuesday Off-peak 1 0.57 5 8 0.38 0.28
Hong Kong Wednesday Off-peak 2 0.41 2 11 0.15 0.10
Hong Kong Friday Peak 1 0.43 1 6 0.14 0.06
Hong Kong Saturday Peak 2 0.43 1 6 0.14 0.06
KL Tuesday Off-peak 1 0.62 5 6 0.45 0.29
KL Wednesday Off-peak 2 0.68 6 5 0.55 0.35
KL Friday Peak 1 0.61 5 6 0.45 0.28
KL Saturday Peak 2 0.55 4 7 0.36 0.21
Seoul Tuesday Off-peak 1 0.32 2 25 0.07 0.10
Seoul Wednesday Off-peak 2 0.59 10 17 0.37 0.56
Seoul Friday Peak 1 0.54 7 16 0.30 0.41
Seoul Saturday Peak 2 0.41 5 18 0.22 0.22
Singapore Tuesday Off-peak 1 0.53 7 17 0.29 0.41
Singapore Wednesday Off-peak 2 0.39 3 21 0.13 0.17
Singapore Friday Peak 1 0.50 6 17 0.26 0.33
Singapore Saturday Peak 2 0.44 4 19 0.17 0.24
Tokyo Tuesday Off-peak 1 0.54 6 12 0.33 0.32
Tokyo Wednesday Off-peak 2 0.47 4 14 0.22 0.22
Tokyo Friday Peak 1 0.54 4 9 0.31 0.24
Tokyo Saturday Peak 2 0.47 3 10 0.23 0.16

27
BIBLIOGRAPHY
R.Boin, W.Coleman, D.Delfassy, G.Palombo. “How airlines can gain a competitive edge through
pricing”. McKinsey and Company. Article (December 2017)
G. Cawley, N. Talbot. “On over-fitting in Model Selection and Subsequent Selection Bias in
Performance Evaluation”. Journal of Machine Learning Research 11 (2010) p.2079-2107
R. Culkin, S.R Das. “Machine Learning in Finance: The Case of Deep Learning for Option Pricing”. Santa
Clara University (August 2017)
O.Etzioni, C. Knoblock, R.Tuchinda, A.Yates. “To Buy or Not to Buy: Mining Airfare Data to Minimize
Ticket Purchase Price”. Proceedings of the ninth ACM SIGKDD international conference on Knowledge
discovery and data mining. ACM, 2003
B. Fritz, Y. Chen, T. Murray-Torres, et al. “Using Machine learning techniques to develop forecast
algorithms for postoperative complications: protocol for a retrospective study. British Medical Journal
Open: e020124. doi:10.1136/ bmjopen-2017-020124. Volume 8 (2018)
W.Groves and M Gini. “A regression model for predicting optimal purchase timing for airline tickets”.
Technical Report 11-025, Department of Computer Science and Engineering, University of Minnesota,
October 2011
A.Hussain. “Travel website cookies milk you for dough”, Sunday Times (24th June 2018)
A.Lantseva, K.Mukhini, A.Nikishova, S.Ivanov, K.Knyazkov. “Data-driven modelling of Airline Pricing”.
YSC 2015. 4th International Young Scientists Conference on Computational Science. Procedia
Computer Science, Volume 66, 2015
J. Lu. “Machine learning modelling for time series problem: Predicting flight ticket prices”. School of
Computer and Communication Sciences, Ecole Polytechnique Federale De Lausanne (2018)
P.Malighetti, S. Paleari, R. Redondi. “Pricing Strategies of low-cost airlines: The Ryanair case study”.
Journal of Air Transport Management 15 (2009)
M. Papadakis. “Predicting Airfare Prices,” Stanford University. 2014
R.Toplensky. “EU eases competition rules for state aid to regional airports”. Financial Times (17th
May
2017)
K.Tziridis, T.Kalampokas, G. Papakostas. “Airfare Prices Prediction Using Machine Learning
Techniques”. 25th European Signal Processing Conference (EUSIPCO), 2017

Can we predict Airline fares from London to cities in Asia

Recommended

Recommended

More Related Content

Similar to Can we predict Airline fares from London to cities in Asia

Similar to Can we predict Airline fares from London to cities in Asia (20)

Recently uploaded

Recently uploaded (20)

Can we predict Airline fares from London to cities in Asia