My MSc dissertation where the optimal time to purchase flights to 7 cities in Asia from London is predicted based solely on 5 weeks of data prior to departure. This relies solely on pricing data scraped from the web.
The analysis shows that despite incomplete data, experimenting with different machine learning techniques culminates in being able to predict optimal purchase times for flights to two cities, saving between 7-8% on average expected air fares over this period.
These articles published in CAMA, having the following titles 1- Accuracy of Forecasting Model (Coefficient of Determinations vs. Signal Tracking ) 2- Head To Head Analysis, A320 Family VS B737NG (Value Analysis) 3- Forecasting by Objectives ( Airport Forecasting ).
The usage of Neural network s has determined a variegated area of packages in the present world. This has caused the
improvement of various fashions for economic markets and funding. This paper represents the idea the way to predict share
market fee the use of artificial Neural community with a given enter parameters of share marketplace. The proportion
marketplace is dynamic in nature approach to expect percentage fee could be very complex method by using trendy prediction
or computation method. Its predominant motive is that there is no linear relationship between market parameters and target last
price. Since there is no linear relationship between input patterns and corresponding output patterns, so use of neural network is
a desire of hobby for share market prediction.
Statistics And Probability Tutorial | Statistics And Probability for Data Sci...Edureka!
YouTube Link: https://youtu.be/XcLO4f1i4Yo
** Data Science Certification using R: https://www.edureka.co/data-science **
This session on Statistics And Probability will cover all the fundamentals of stats and probability along with a practical demonstration in the R language.
These articles published in CAMA, having the following titles 1- Accuracy of Forecasting Model (Coefficient of Determinations vs. Signal Tracking ) 2- Head To Head Analysis, A320 Family VS B737NG (Value Analysis) 3- Forecasting by Objectives ( Airport Forecasting ).
The usage of Neural network s has determined a variegated area of packages in the present world. This has caused the
improvement of various fashions for economic markets and funding. This paper represents the idea the way to predict share
market fee the use of artificial Neural community with a given enter parameters of share marketplace. The proportion
marketplace is dynamic in nature approach to expect percentage fee could be very complex method by using trendy prediction
or computation method. Its predominant motive is that there is no linear relationship between market parameters and target last
price. Since there is no linear relationship between input patterns and corresponding output patterns, so use of neural network is
a desire of hobby for share market prediction.
Statistics And Probability Tutorial | Statistics And Probability for Data Sci...Edureka!
YouTube Link: https://youtu.be/XcLO4f1i4Yo
** Data Science Certification using R: https://www.edureka.co/data-science **
This session on Statistics And Probability will cover all the fundamentals of stats and probability along with a practical demonstration in the R language.
I completed this project as part of my internship capstone at Learnbay. The task was to predict the flight fares for multiple flights in India. Visualizing the data helped me find trends and correlations between the independent and target variables. In the following step, I developed a model to predict the price.
MACHINE LEARNING TECHNIQUES FOR ANALYSIS OF EGYPTIAN FLIGHT DELAYIJDKP
Flight delay has been the fiendish problem to the world's aviation industry, so there is very important
significance to research for computer system predicting flight delay propagation. Extraction of hidden
information from large datasets of raw data could be one of the ways for building predictive model. This
paper describes the application of classification techniques for analysing the Flight delay pattern in Egypt
Airline’s Flight dataset. In this work, four decision tree classifiers were evaluated and results show that the
REPTree have the best accuracy 80.3% with respect to Forest, Stump and J48. However, four rules based
classifiers were compared and results show that PART provides best accuracy among studied rule-based
classifiers with accuracy of 83.1%. By analysing running time for all classifiers, the current work
concluded that REPtree is the most efficient classifier with respect to accuracy and running time. Also, the
current work is extended to apply of Apriori association technique to extract some important information
about flight delay. Association rules are presented and association technique is evaluated.
Business Decision Making Part TwoQNT275.docxRAHUL126667
Business Decision Making Part Two
QNT275
Descriptive statistics are used in the presentation of data sets in form of meaningful summaries. This helps for the important patterns of the data to be observable. Descriptive statistics may not be useful in drawing final conclusions about the data. These statistics are majorly used in describing quantitative data as they involve numerical calculations. The main descriptive statistics are the measures of central tendency and the measures of variability. Measures of central tendency express the central position of a data. Measures of variability, on the other hand, represent spread of the values in a data set.
For the case of American Airline Group, the research involves both quantitative and qualitative data. The operational costs, which represent the dependent variable, can be understood by studying the operational changes that result from the merger. This includes quantitative data on the number of passengers that have access to the airline’s services. The descriptive statistics which could be used in summarized this data include the mean, mode, and median. The mean represents the average number of passengers using the airline’s services for a given period of time, for example in one day. The mode represents the most recurrent number in the data set. For example, if the data were to be collected for a period of one month, the particular number of passengers that would be recorded in many days would represent the mode of the data. Median, on the other hand, represents the centrally placed value after the data has been arranged in either an ascending order or descending order. Descriptive statistics could also be used for summarizing data on the financial capability of the merger. This data is obtained from a survey audit of the airline’s financial data. The measures of variability that could be used in this research include the range, variance, standard deviation, quartiles and absolute. These measures describe the consistency of data by presenting the variability (Holcomb, 2017).
Inferential statistics involve making generalizations about the population using facts from the sample. These statistics are useful where the population under study is large. In this case, it is most feasible to select a small group to be a representative of the population. The inferential statistics that can be used in the analysis of the data from the American Airline Group merger include the estimation of parameters and tests of hypothesis. Estimation of parameters involves approximating population parameters using the calculated sample statistics. For example, the population mean may be estimated by the sample mean. It would be economical to study only a small group. Estimation of parameters is thus highly useful for making inferences about the whole population (Bernstein, & Bernstein, 2011).
Test of hypothesis involves testing the accuracy of a claim about the population based on the sample under study. Othe ...
Artificial Intelligence and Stock Marketingijsrd.com
Business sagacity is turning into a significant pattern in money related world. One such range is securities exchange knowledge that makes utilization of information mining strategies, for example, affiliation, grouping, fake neural systems, choice tree, hereditary calculation, master frameworks and fuzzy rationale. These strategies could be utilized to anticipate stock value or exchanging indicator naturally with adequate exactness. In spite of the fact that there has been a loads of exploration done here , still there are numerous issues that have not been investigated yet furthermore it is not clear to new analysts where and how to begin . Information mining could be connected on over a significant time span monetary information to create examples and choice making framework. This paper gives concise review of a few endeavors made via scientists for stock expectation by concentrating on securities exchange dissection furthermore characterizes another exploration space to comprehend the sagacity of stock exchange. This alludes as stock exchange brainpower, which is to create information mining strategies to help all parts of algorithmic exchanging furthermore recommend various exploration issues in stock knowledge identified with guaging& its exactness.
Benchmarking data mining approaches for traveler segmentation IJECEIAES
The purpose of this study is proposing a hybrid data mining solution for traveler segmentation in tourism domain which can be used for planning user-oriented trips, arranging travel campaigns or similar services. Data set used in this work have been provided by a travel agency which contains flight and hotel bookings of travelers. Initially, the data set was prepared for running data mining algorithms. Then, various machine learning algorithms were benchmarked for performing accurate traveler segmentation and prediction tasks. Fuzzy C-means and X-means algorithms were applied for clustering user data. J48 and multilayer perceptron (MLP) algorithms were applied for classifying instances based on segmented user data. According to the findings of this study, J48 has the most effective classification results when applied on the data set which is clustered with X-means algorithm. The proposed hybrid data mining solution can be used by travel agencies to plan trip campaigns for similar travelers.
Travel agencies should be able to judge the market demand for tourism to develop sales plans accordingly. However, many travel agencies lack the ability to judge the market demand for tourism, and thus make risky business decisions. Based on the above, this study applied the Artificial Neural Network combined with the Genetic Algorithm (GA) to establish a prediction model of air ticket sales revenue. GA was used to determine the optimum number of input and hidden nodes of a feedforward neural network. The empirical results suggested that the mean absolute relative error(MARE) of the proposed hybrid model’s predicted value of air ticket sales revenue and the actual value was 10.51%and the correlation coefficient was 0.913. The proposed model had good predictive capability and could provide travel agency operators with reliable and highly efficient analysis data.
TOURISM DEMAND FORECASTING MODEL USING NEURAL NETWORKijcsit
Travel agencies should be able to judge the market demand for tourism to develop sales plans accordingly.However, many travel agencies lack the ability to judge the market demand for tourism, and thus make risky business decisions. Based on the above, this study applied the Artificial Neural Network combined with the Genetic Algorithm (GA) to establish a prediction model of air ticket sales revenue. GA was used to
determine the optimum number of input and hidden nodes of a feedforward neural network. The empirical results suggested that the mean absolute relative error(MARE) of the proposed hybrid model’s predicted value of air ticket sales revenue and the actual value was 10.51%and the correlation coefficient was 0.913. The proposed model had good predictive capability and could provide travel agency operators with reliable and highly efficient analysis data.
COMPARISON OF BANKRUPTCY PREDICTION MODELS WITH PUBLIC RECORDS AND FIRMOGRAPHICScscpconf
Many business operations and strategies rely on bankruptcy prediction. In this paper, we aim to
study the impacts of public records and firmographics and predict the bankruptcy in a 12-
month-ahead period with using different classification models and adding values to traditionally
used financial ratios. Univariate analysis shows the statistical association and significance of
public records and firmographics indicators with the bankruptcy. Further, seven statistical
models and machine learning methods were developed, including Logistic Regression, Decision
Tree, Random Forest, Gradient Boosting, Support Vector Machine, Bayesian Network, and
Neural Network. The performance of models were evaluated and compared based on
classification accuracy, Type I error, Type II error, and ROC curves on the hold-out dataset.
Moreover, an experiment was set up to show the importance of oversampling for rare event
prediction. The result also shows that Bayesian Network is comparatively more robust than
other models without oversampling.
Better efficiency of the air transport system of a country at the national level, especially in terms of its
capacity to generate value for passenger flow and cargo transport, effectively depends on the identification of
the demand generation potential of each hub for this type of service. This requires the mapping of the passenger
flow and volume of cargo transport of each region served by the system and the number of connections. The
main goal of this study was to identify important factors that account for the great variability (demand) of
regional hubsof the airport modal system in operation in the State of São Paulo, the most populated and
industrialized in the Southeast region in Brazil. For this purpose, datasets for each airport related to passengers
or cargo flow were obtained from time series data in the period ranging from January 01, 2008 to December
31, 2014. Different data analysis approaches could imply in better mapping of the flow of the air modal system
from the evaluation of some factors related to operations/volume. Therefore, different statistical models - such
as multiple linear regression with normal errors and new stochastic volatility (SV) models - are introduced in
this study, to provide a better view of the operation system in the four main regional hubs, within a large group
of 32 airports reported in the dataset.
I completed this project as part of my internship capstone at Learnbay. The task was to predict the flight fares for multiple flights in India. Visualizing the data helped me find trends and correlations between the independent and target variables. In the following step, I developed a model to predict the price.
MACHINE LEARNING TECHNIQUES FOR ANALYSIS OF EGYPTIAN FLIGHT DELAYIJDKP
Flight delay has been the fiendish problem to the world's aviation industry, so there is very important
significance to research for computer system predicting flight delay propagation. Extraction of hidden
information from large datasets of raw data could be one of the ways for building predictive model. This
paper describes the application of classification techniques for analysing the Flight delay pattern in Egypt
Airline’s Flight dataset. In this work, four decision tree classifiers were evaluated and results show that the
REPTree have the best accuracy 80.3% with respect to Forest, Stump and J48. However, four rules based
classifiers were compared and results show that PART provides best accuracy among studied rule-based
classifiers with accuracy of 83.1%. By analysing running time for all classifiers, the current work
concluded that REPtree is the most efficient classifier with respect to accuracy and running time. Also, the
current work is extended to apply of Apriori association technique to extract some important information
about flight delay. Association rules are presented and association technique is evaluated.
Business Decision Making Part TwoQNT275.docxRAHUL126667
Business Decision Making Part Two
QNT275
Descriptive statistics are used in the presentation of data sets in form of meaningful summaries. This helps for the important patterns of the data to be observable. Descriptive statistics may not be useful in drawing final conclusions about the data. These statistics are majorly used in describing quantitative data as they involve numerical calculations. The main descriptive statistics are the measures of central tendency and the measures of variability. Measures of central tendency express the central position of a data. Measures of variability, on the other hand, represent spread of the values in a data set.
For the case of American Airline Group, the research involves both quantitative and qualitative data. The operational costs, which represent the dependent variable, can be understood by studying the operational changes that result from the merger. This includes quantitative data on the number of passengers that have access to the airline’s services. The descriptive statistics which could be used in summarized this data include the mean, mode, and median. The mean represents the average number of passengers using the airline’s services for a given period of time, for example in one day. The mode represents the most recurrent number in the data set. For example, if the data were to be collected for a period of one month, the particular number of passengers that would be recorded in many days would represent the mode of the data. Median, on the other hand, represents the centrally placed value after the data has been arranged in either an ascending order or descending order. Descriptive statistics could also be used for summarizing data on the financial capability of the merger. This data is obtained from a survey audit of the airline’s financial data. The measures of variability that could be used in this research include the range, variance, standard deviation, quartiles and absolute. These measures describe the consistency of data by presenting the variability (Holcomb, 2017).
Inferential statistics involve making generalizations about the population using facts from the sample. These statistics are useful where the population under study is large. In this case, it is most feasible to select a small group to be a representative of the population. The inferential statistics that can be used in the analysis of the data from the American Airline Group merger include the estimation of parameters and tests of hypothesis. Estimation of parameters involves approximating population parameters using the calculated sample statistics. For example, the population mean may be estimated by the sample mean. It would be economical to study only a small group. Estimation of parameters is thus highly useful for making inferences about the whole population (Bernstein, & Bernstein, 2011).
Test of hypothesis involves testing the accuracy of a claim about the population based on the sample under study. Othe ...
Artificial Intelligence and Stock Marketingijsrd.com
Business sagacity is turning into a significant pattern in money related world. One such range is securities exchange knowledge that makes utilization of information mining strategies, for example, affiliation, grouping, fake neural systems, choice tree, hereditary calculation, master frameworks and fuzzy rationale. These strategies could be utilized to anticipate stock value or exchanging indicator naturally with adequate exactness. In spite of the fact that there has been a loads of exploration done here , still there are numerous issues that have not been investigated yet furthermore it is not clear to new analysts where and how to begin . Information mining could be connected on over a significant time span monetary information to create examples and choice making framework. This paper gives concise review of a few endeavors made via scientists for stock expectation by concentrating on securities exchange dissection furthermore characterizes another exploration space to comprehend the sagacity of stock exchange. This alludes as stock exchange brainpower, which is to create information mining strategies to help all parts of algorithmic exchanging furthermore recommend various exploration issues in stock knowledge identified with guaging& its exactness.
Benchmarking data mining approaches for traveler segmentation IJECEIAES
The purpose of this study is proposing a hybrid data mining solution for traveler segmentation in tourism domain which can be used for planning user-oriented trips, arranging travel campaigns or similar services. Data set used in this work have been provided by a travel agency which contains flight and hotel bookings of travelers. Initially, the data set was prepared for running data mining algorithms. Then, various machine learning algorithms were benchmarked for performing accurate traveler segmentation and prediction tasks. Fuzzy C-means and X-means algorithms were applied for clustering user data. J48 and multilayer perceptron (MLP) algorithms were applied for classifying instances based on segmented user data. According to the findings of this study, J48 has the most effective classification results when applied on the data set which is clustered with X-means algorithm. The proposed hybrid data mining solution can be used by travel agencies to plan trip campaigns for similar travelers.
Travel agencies should be able to judge the market demand for tourism to develop sales plans accordingly. However, many travel agencies lack the ability to judge the market demand for tourism, and thus make risky business decisions. Based on the above, this study applied the Artificial Neural Network combined with the Genetic Algorithm (GA) to establish a prediction model of air ticket sales revenue. GA was used to determine the optimum number of input and hidden nodes of a feedforward neural network. The empirical results suggested that the mean absolute relative error(MARE) of the proposed hybrid model’s predicted value of air ticket sales revenue and the actual value was 10.51%and the correlation coefficient was 0.913. The proposed model had good predictive capability and could provide travel agency operators with reliable and highly efficient analysis data.
TOURISM DEMAND FORECASTING MODEL USING NEURAL NETWORKijcsit
Travel agencies should be able to judge the market demand for tourism to develop sales plans accordingly.However, many travel agencies lack the ability to judge the market demand for tourism, and thus make risky business decisions. Based on the above, this study applied the Artificial Neural Network combined with the Genetic Algorithm (GA) to establish a prediction model of air ticket sales revenue. GA was used to
determine the optimum number of input and hidden nodes of a feedforward neural network. The empirical results suggested that the mean absolute relative error(MARE) of the proposed hybrid model’s predicted value of air ticket sales revenue and the actual value was 10.51%and the correlation coefficient was 0.913. The proposed model had good predictive capability and could provide travel agency operators with reliable and highly efficient analysis data.
COMPARISON OF BANKRUPTCY PREDICTION MODELS WITH PUBLIC RECORDS AND FIRMOGRAPHICScscpconf
Many business operations and strategies rely on bankruptcy prediction. In this paper, we aim to
study the impacts of public records and firmographics and predict the bankruptcy in a 12-
month-ahead period with using different classification models and adding values to traditionally
used financial ratios. Univariate analysis shows the statistical association and significance of
public records and firmographics indicators with the bankruptcy. Further, seven statistical
models and machine learning methods were developed, including Logistic Regression, Decision
Tree, Random Forest, Gradient Boosting, Support Vector Machine, Bayesian Network, and
Neural Network. The performance of models were evaluated and compared based on
classification accuracy, Type I error, Type II error, and ROC curves on the hold-out dataset.
Moreover, an experiment was set up to show the importance of oversampling for rare event
prediction. The result also shows that Bayesian Network is comparatively more robust than
other models without oversampling.
Better efficiency of the air transport system of a country at the national level, especially in terms of its
capacity to generate value for passenger flow and cargo transport, effectively depends on the identification of
the demand generation potential of each hub for this type of service. This requires the mapping of the passenger
flow and volume of cargo transport of each region served by the system and the number of connections. The
main goal of this study was to identify important factors that account for the great variability (demand) of
regional hubsof the airport modal system in operation in the State of São Paulo, the most populated and
industrialized in the Southeast region in Brazil. For this purpose, datasets for each airport related to passengers
or cargo flow were obtained from time series data in the period ranging from January 01, 2008 to December
31, 2014. Different data analysis approaches could imply in better mapping of the flow of the air modal system
from the evaluation of some factors related to operations/volume. Therefore, different statistical models - such
as multiple linear regression with normal errors and new stochastic volatility (SV) models - are introduced in
this study, to provide a better view of the operation system in the four main regional hubs, within a large group
of 32 airports reported in the dataset.
Similar to Can we predict Airline fares from London to cities in Asia (20)
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...John Andrews
SlideShare Description for "Chatty Kathy - UNC Bootcamp Final Project Presentation"
Title: Chatty Kathy: Enhancing Physical Activity Among Older Adults
Description:
Discover how Chatty Kathy, an innovative project developed at the UNC Bootcamp, aims to tackle the challenge of low physical activity among older adults. Our AI-driven solution uses peer interaction to boost and sustain exercise levels, significantly improving health outcomes. This presentation covers our problem statement, the rationale behind Chatty Kathy, synthetic data and persona creation, model performance metrics, a visual demonstration of the project, and potential future developments. Join us for an insightful Q&A session to explore the potential of this groundbreaking project.
Project Team: Jay Requarth, Jana Avery, John Andrews, Dr. Dick Davis II, Nee Buntoum, Nam Yeongjin & Mat Nicholas
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdfEnterprise Wired
In this guide, we'll explore the key considerations and features to look for when choosing a Trusted analytics platform that meets your organization's needs and delivers actionable intelligence you can trust.
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Round table discussion of vector databases, unstructured data, ai, big data, real-time, robots and Milvus.
A lively discussion with NJ Gen AI Meetup Lead, Prasad and Procure.FYI's Co-Found
Global Situational Awareness of A.I. and where its headedvikram sood
You can see the future first in San Francisco.
Over the past year, the talk of the town has shifted from $10 billion compute clusters to $100 billion clusters to trillion-dollar clusters. Every six months another zero is added to the boardroom plans. Behind the scenes, there’s a fierce scramble to secure every power contract still available for the rest of the decade, every voltage transformer that can possibly be procured. American big business is gearing up to pour trillions of dollars into a long-unseen mobilization of American industrial might. By the end of the decade, American electricity production will have grown tens of percent; from the shale fields of Pennsylvania to the solar farms of Nevada, hundreds of millions of GPUs will hum.
The AGI race has begun. We are building machines that can think and reason. By 2025/26, these machines will outpace college graduates. By the end of the decade, they will be smarter than you or I; we will have superintelligence, in the true sense of the word. Along the way, national security forces not seen in half a century will be un-leashed, and before long, The Project will be on. If we’re lucky, we’ll be in an all-out race with the CCP; if we’re unlucky, an all-out war.
Everyone is now talking about AI, but few have the faintest glimmer of what is about to hit them. Nvidia analysts still think 2024 might be close to the peak. Mainstream pundits are stuck on the wilful blindness of “it’s just predicting the next word”. They see only hype and business-as-usual; at most they entertain another internet-scale technological change.
Before long, the world will wake up. But right now, there are perhaps a few hundred people, most of them in San Francisco and the AI labs, that have situational awareness. Through whatever peculiar forces of fate, I have found myself amongst them. A few years ago, these people were derided as crazy—but they trusted the trendlines, which allowed them to correctly predict the AI advances of the past few years. Whether these people are also right about the next few years remains to be seen. But these are very smart people—the smartest people I have ever met—and they are the ones building this technology. Perhaps they will be an odd footnote in history, or perhaps they will go down in history like Szilard and Oppenheimer and Teller. If they are seeing the future even close to correctly, we are in for a wild ride.
Let me tell you what we see.
Learn SQL from basic queries to Advance queriesmanishkhaire30
Dive into the world of data analysis with our comprehensive guide on mastering SQL! This presentation offers a practical approach to learning SQL, focusing on real-world applications and hands-on practice. Whether you're a beginner or looking to sharpen your skills, this guide provides the tools you need to extract, analyze, and interpret data effectively.
Key Highlights:
Foundations of SQL: Understand the basics of SQL, including data retrieval, filtering, and aggregation.
Advanced Queries: Learn to craft complex queries to uncover deep insights from your data.
Data Trends and Patterns: Discover how to identify and interpret trends and patterns in your datasets.
Practical Examples: Follow step-by-step examples to apply SQL techniques in real-world scenarios.
Actionable Insights: Gain the skills to derive actionable insights that drive informed decision-making.
Join us on this journey to enhance your data analysis capabilities and unlock the full potential of SQL. Perfect for data enthusiasts, analysts, and anyone eager to harness the power of data!
#DataAnalysis #SQL #LearningSQL #DataInsights #DataScience #Analytics
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeWalaa Eldin Moustafa
Dynamic policy enforcement is becoming an increasingly important topic in today’s world where data privacy and compliance is a top priority for companies, individuals, and regulators alike. In these slides, we discuss how LinkedIn implements a powerful dynamic policy enforcement engine, called ViewShift, and integrates it within its data lake. We show the query engine architecture and how catalog implementations can automatically route table resolutions to compliance-enforcing SQL views. Such views have a set of very interesting properties: (1) They are auto-generated from declarative data annotations. (2) They respect user-level consent and preferences (3) They are context-aware, encoding a different set of transformations for different use cases (4) They are portable; while the SQL logic is only implemented in one SQL dialect, it is accessible in all engines.
#SQL #Views #Privacy #Compliance #DataLake
Machine learning and optimization techniques for electrical drives.pptx
Can we predict Airline fares from London to cities in Asia
1. Can we predict how Airline fares to cities in Asia will change closer to departure?
1
Can we predict how Airline fares to cities in Asia will
change closer to departure?
MSc Business Analytics: Dissertation
Departure area of London Heathrow Terminal 5 (Google images)
Name: Karim Awad
2. Can we predict how Airline fares to cities in Asia will change closer to departure?
2
ABSTRACT
This paper shall examine airline prices departing from London Heathrow to seven cities situated in
North and South-East Asia. Over a 5 week period prior to departure, from collecting online pricing
data on fourteen successive 1-week return trips, we will construct pricing curves to understand how
air-fares evolve.
We shall then employ machine learning techniques to determine if this pricing behaviour can be
predicted, ranging from logistic regression through to a simple neural network. This shall attempt to
predict opportune moments to purchase economy-class tickets for flights within our test-set.
Through using ensemble techniques, our results can predict when to purchase peak and off-peak
flights to Bangkok and Kuala Lumpur with reasonable accuracy, with savings between £56-70 relative
to average flight prices. This is despite facing a series of challenges connected to the quality of the
underlying data used.
3. Can we predict how Airline fares to cities in Asia will change closer to departure?
3
INTRODUCTION
When is the optimal time to book a flight? Should a consumer book several weeks in advance, or wait
for a “last-minute deal”? This issue has long troubled consumers, spawning price comparison
websites, highlighting the lowest price at a specific point in time without addressing this wider
question.
Underpinning this uncertainty, airlines and travel vendors employ dynamic pricing models to optimise
aircraft utilisation, discovering daily market-clearing prices that aim to maximise flight revenue
(Lantseva et al, 2015).
This often generates pricing volatility, which can be pronounced as the departure date draws closer,
and between pairs of flight that leave shortly before / after. Yet this can also be affected by
competition amongst airlines, the choice of seating class, seasonal factors, and whether flights are
direct or indirect to a destination (Etzioni et al, 2003).
Can we however identify patterns to airline pricing by examining how prices change prior to
departure? Are there non-parametric relationships that machine learning techniques can identify,
which are not apparent through structural analysis?
We examine long-haul flights from London to several cities in Asia: Beijing (BEI), Bangkok (BKK), Hong
Kong (HK), Kuala Lumpur (KL), Seoul (SEU), Singapore (SGP), and Tokyo (TKY). This involves looking at
one-week return flights, over a two-week period between 23rd
July – 6th
August 2018, offering 14 flight
pairs to evaluate for each city. We accumulate pricing data on each flight at least 5 weeks prior to each
departure, with data collected twice a day. This methodology is discussed further within our data
collection section
With data on approximately 50 variables at each point in time (e.g. prices for different airlines, number
of stops, class of transport, different airports in the London region), this provides a potential dataset
of over 49,000 entries.
We partition this dataset into training, validation, and test sets, before applying machine learning
techniques. This involves methods that reduce dimensionality (e.g. Regularisation, Principal
Component Analysis (PCA), Random Forest, Support Vector Machines), and considering results from
algorithms that reduce residual error (e.g. AdaBoost) or better capture non-linearity (neural network).
This paper will undertake a brief literature review, before discussing data collection, processing, and
methodology considerations. We shall then illustrate the results of descriptive analysis, before
undertaking the machine learning approaches discussed above.
4. Can we predict how Airline fares to cities in Asia will change closer to departure?
4
LITERATURE REVIEW
Etzioni et al (2003) serves as a useful starting point, that has motivated several other articles. Intended
as a pilot study, this examined non-stop, one-week returns, for two flight pairs between Los Angeles
and Boston, and Seattle to Washington DC respectively. This was done for departures during January
2003, where prices were tracked at 3-hour intervals, starting 21 days prior to departure; resulting in
12,000 price observations over a 41 day period.
They proceed to use a variety of approaches, from a moving-average statistical approach, through to
a RIPPER classification algorithm, Q-learning, and a combination of these three results, using a stacking
generaliser, referred to as HAMLET. These algorithms seek to identify optimal points to purchase a
flight, introducing penalties subject to how near the departure date falls, and the cost of
misclassification (i.e. failing to buy a cheap ticket, and paying more later). Savings are measured
relative to the initial flight cost 21 days prior to departure.
Etzioni et al (2003) found that HAMLET performed best, generating total net savings of $198,074, with
its decisions being optimal 61.8%, relative to having perfect knowledge. They believe this accuracy
level could be higher, as they employed a “…uniform distribution of passengers, (where) 33% of the
passengers arrived at most 7 days before the flight’s departure, when savings are hard to come by…”
(p.7, Etzioni et al (2003)).
Groves and Gini (2011) expand on this work by using over 60 days of data prior to departure, using
both lagged OLS and Partial Least Square techniques to identify a subset of explanatory variables, from
their searches which have examined multiple departure dates for two domestic US flight pairs. This is
contrasted with a naïve approach (i.e. immediate purchase) relative to an optimal approach (perfect
information).
Their results are shown for business trips (Monday-Friday round trips) and low-cost trips (Thursday –
Tuesday round trips), with the aim of out-performing a naïve approach. Their methods confirm this,
albeit by virtue of the small number of buy signals their models generate. They do observe that pricing
volatility is more apparent on routes with less route competition, premised on more pricing power
being present.
Tziridis, Kalampokas, and Papakostas (2017) seek to apply a wide variety of machine learning
approaches, but only on a single-ticket flight-pair between Thessaloniki (Greece) and Stuttgart
(Germany) between December and July 2017. These approaches include Multilayer Perceptron,
Neural Networks, Regression Trees with bagging, and Regression Support Vector Machines, amongst
others, with 10-fold cross validation used to train these models.
Their analysis concludes that ordinary, bagging, and random forest regression trees, along with
multilayer perceptron techniques yield consistent results when omitting different features, with
accuracy levels > 80%. This seems surprisingly high when relying on pricing data alone, but may
highlight potential consistency on flights with low levels of flight traffic, and the value in collecting
historical data.
Intriguingly, their analysis also points to accuracy being marginally higher when excluding features,
which included “day of week” of departure, but lower when time of departure and arrival is excluded.
Although their analysis is not directly comparable to the above, it highlights how anecdotal factors
may not yield as much explanatory power; possibly favouring a Random Forest approach.
5. Can we predict how Airline fares to cities in Asia will change closer to departure?
5
There are several other articles that seek to apply a wide range of machine learning techniques, both
supervised and unsupervised, to predicting flight prices, predominantly looking at different flight
pairs. This extends to Papadakis (2014) that focused on five significant US airport hubs, Lantseva et al
(2015) that looked at domestic and international pricing from Russian airports, and Lu (2018) who
looks at 8 flight pairs within Europe with trips to / from the same departing airport, relative to the
recipient city (2018). Their results all point to different machine learning techniques assisting with
prediction.
Lu’s (2018) article does seek to find a prediction technique that works across routes, and thus focuses
on the variance achieved by each method as a discriminating factor, along with examining Mean-
Squared Error (MSE). Furthermore, in contrast to other authors, he directly addresses imbalances in
the underlying dataset, given the small number of “buy” decisions likely to arise. This involves
performing K-Means and Expectation Maximisation cluster analysis to identify and remove outliers
between buy and wait decisions, along with over-sampling buy decisions in training his algorithms.
More generally, how do we measure the correct time to “buy” or “wait”? The articles above have
focused on either contrasting a starting price and predicting any reduction, buying if it is determined
the next period will generate a price increase, or attempting to align purchasing accuracy if perfect
foresight were available.
Boin et al (2017) wrote that airlines will need to continue unbundling elements of the travel
experience (e.g. bags on-board, meals, etc), such that additional revenues can be generated in future.
When examining Ryanair’s revenue model, Malighetti et al (2015) note that specific functional
parameters used to derive prices have likely already been determined, with their analysis revealing
prices follow a hyperbola as departure draws close.
This suggests there is a minimum average price, allowing for discounts, whilst sales near to departure
are aimed at boosting overall yields. Understanding historical average prices over a period prior to
departure, may thus help consumers avoid contributing to the airline’s supernormal profits.
6. Can we predict how Airline fares to cities in Asia will change closer to departure?
6
DATA CONSIDERATIONS
Data considerations
Obtaining pricing data proved immensely difficult. There are no formal, public repositories of data
documenting prices prior to departure. Those that do exist either provide information only on US
outbound flights (www.faredetective.com), or charge for access to their airfare database
(www.atpco.com); the latter quoting $5,000 for academic purposes, when approached.
Price comparison websites were not initially helpful. Skyscanner, the UK market leader, refused to
provide API access. Other price comparison websites (e.g. www.farecompare.com,
www.expedia.co.uk, etc) only provided price alerts, which would be inadequate in assembling a
database. Informal sources (e.g. those who have collected pricing data) were available, but these were
cross-sectional in nature (i.e. prices for a vast range of flights at one moment in time only), and lacking
in structure suitable for academic study.
Using Selenium, prices were scraped from www.kayak.com, with results tabulated for each location
and departure date. This scraped prices for all known airlines operating routes to the location, by
different UK airports, by number of stops, and class of flight. Selenium ensured each search was
anonymised, preventing beacon cookies from detecting repeated searches, and unduly inflating
prices1
; especially with ticket-price customisation likely (Boin et al, 2017).
Intermediaries (e.g. online travel vendors) typically had advertised prices below those shown by the
corresponding airline. It was unclear if this was attributable to preferential rates, or prices being
artificially depressed. To avoid any misrepresentation, these vendors were excluded.
A more troubling issue was an inability to segregate price changes based on seat purchases relative to
changes in competitor prices. Plane capacity is not disclosed by airlines, resulting in being unable to
track capacity changes and thus deduce own-price and cross-price elasticities.
However, a slight benefit was the 14 flights studied fell during peak-season (Lantseva et al (2015)). It
was also felt price falls shortly before departure (e.g. 1 week beforehand) would reflect under-utilised
flights, with prices normally expected to rise (Groves and Gini (2011)). This may generate more
volatility which we can attempt to measure through an F-test between these two periods.
As data was collected, a further problem transpired with Kayak providing incomplete entries, leading
to pricing gaps appearing. This was beyond the author’s control, resulting in some airlines seldom
generating quoted prices, along with a lack of data collected for business / first class tickets, and non-
stop flights.
Given the dynamic nature of pricing, using regression analysis to estimate and back-fill historical prices
was deemed spurious, given the idiosyncratic behaviour of airline prices. For some airlines, this led to
their exclusion, which was not ideal; however, it was felt this could be offset by including lagged
industry prices to mitigate any omitted variable bias. In most instances, prices were back-filled with
their proceeding price. This artificially reduced volatility. Overall, these issues were likely to reduce
the effectiveness of our analysis, albeit permitting some limited analysis to still be undertaken.
A final problem was deriving suitable training, validation and test sets. In contrast to cross-sectional
or time-series datasets, each flight is a discrete, finite event. It was thus decided to have one week of
1
“Travel website cookies milk you for dough”, Sunday Times (24th
June 2018)
7. Can we predict how Airline fares to cities in Asia will change closer to departure?
7
training data, three days as a validation set (e.g. Monday, Thursday, and Sunday) to capture off-peak,
peak, and Sunday demand, with the remaining 4 days to test our algorithms (two days for peak and
off-peak). This followed our descriptive analysis, with a similar logic apparent to Groves and Gini
(2011), who found prices are lower for departures between Tuesday-Thursday, and higher from
Thursday through to Saturday (p.3).
Data collection
Concentrating on long-haul flights, these routes were expected to be more competitive than EU short-
haul flights, where pricing could be distorted by regional airport incentives in the EU, and airports
being located far apart within a city, which may not provide for a like-for-like comparison2
. With long
haul flights combining direct and transfer passengers, it was hoped this would minimise under-utilised
flights, and thus price changes from demand shortfalls that could distort our analysis. This led to a
focus on airports that serve as significant regional transfer hubs, with Asia being an area of interest.
Prices were recorded both during the morning and evening, to understand if timing contributed to
any difference. With dynamic prices likely to change far more frequently intra-day, this was not ideal,
although was felt to still suffice in detecting general pricing movements.
We accumulate five weeks of historical data for each flight pair prior to departure. This period was
based purely on practical considerations in starting the dissertation in late June. In practice, additional
data would have been helpful in examining pricing volatility going further back in time.
For Airlines with incomplete data, we employ a threshold of 40% across our training and test set flight
pairs to determine their inclusion, consistent with Groves and Gini (2011). This is on average, so there
are still results in some flight pairs with minimal pricing disclosure (e.g. due to having no flights that
leave on a specific day), but provides a subset where analysis can be undertaken.
Overall, despite very significant impediments in data completeness and quality, it was felt some
meaningful analysis could still follow.
2
“EU eases competition rules for state aid to regional airports”, Financial Times (17th
May 2017)
8. Can we predict how Airline fares to cities in Asia will change closer to departure?
8
DESCRIPTIVE ANALYSIS
Figure 1 below highlights some of the relationships we observe across economy class flights to our
targeted cities. Prices appear positively skewed towards the bottom quartile, based on median prices.
This is consistent with prices typically being low, but rising closer towards departure, which is
confirmed when examining line graphs of the same data (not shown).
The level of variance in pricing does vary by city. The interquartile ranges for Bangkok and Kuala
Lumpur are small, with differences typically less than £100, although with a significant number of
outliers where prices exceed the maximum tail of our boxplots. This may point to prices rising close to
departure, where there is merit in advanced booking.
By contrast, Beijing has the widest interquartile range but with few outliers, highlighting high levels of
volatility. This suggests value may be derived from later-stage bookings, potentially resulting from
under-utilised plane capacity.
The remaining cities demonstrate behaviour between the two instances above, although HK,
Singapore, Tokyo and Seoul all demonstrate significant outliers for weekend departures.
Departures between Monday to Wednesday appear cheapest, when examining median prices, across
all seven cities. Prices broadly increase between Thursday – Saturday, although do fall on Sunday by
varying levels. When predicting future prices, we shall distinguish between these groups of days, as
this appears consistent across all cities.
The cheapest destination to visit, based on median prices, appears to be Bangkok, with prices between
£600-800 across the week. However, both Beijing and Singapore, when considering tail minimums, do
offer opportunities for ticket prices below £600.
Conversely, median prices indicate that both Tokyo and Seoul are the most expensive cities to visit,
with economy prices approximately between £1,000 - £1,200. This may reflect reduced competition
on these routes, along with geographically being on the periphery of Asia, and thus not benefiting
from as much traffic compared to other hubs.
Data on business class flights is patchy and inconsistent, with data-points carried forward where
intermittent data exists. Line graphs of these results can be found in Appendix 1. Although not
conclusive, business class prices appear to fall closer to departure for Beijing, Bangkok, and HK, with
KL and Singapore prices rising instead, and insufficient data available for Seoul and Tokyo.
A similar lack of data prevails when examining non-stop economy flights (Appendix 2). Both Bangkok
and Tokyo are lacking in data for conclusions to be drawn. KL, HK and Singapore all demonstrate price
increases nearer to departure, with Beijing and Seoul displaying volatility but being range-bound. The
price levels quoted may be above actual levels, but suggests potential interchangeability with business
class flights being cheaper in specific instances.
9. Can we predict how Airline fares to cities in Asia will change closer to departure?
9
Figure 1: Boxplot of Economy ticket prices for 7 cities in Asia departing from London
10. Can we predict how Airline fares to cities in Asia will change closer to departure?
10
There is a lot of variability in airline data collected. Applying a 40% threshold discussed earlier, reveals
only 10 airlines where pricing data is consistently available across most of our 7 cities: Air France (AF),
British Airways (BA), Emirates (EK), Etihad (EY), KLM (KL), SwissAir (LX), Malaysian Airline (MH),
Philippine Airlines (PR), Thai Airlines (TG), and Vietnam Airlines (VN). Even looking at routes in
isolation, data for direct flights (e.g. Cathay Airlines to Hong Kong, Air China to Beijing, Korean Air to
Seoul; ANA and Nippon Airlines to Tokyo) was not available. This data may still be implicit within non-
stop economy fares, although it may limit our analysis of such competitors if this data series is highly
correlated with British Airways (the only airline in our dataset that flies direct to these locations).
Summary statistics were also examined for differences in airline prices by departure from different
London Airport (not shown). This included Heathrow, City Airport, Gatwick, Luton, Stansted, and
Southend. As this data was incomplete, it was felt this would not offer much insofar as evaluating any
competitive dynamic between departing London airports.
We examine separate correlation heatmaps examining travel class and number of stops respectively,
before tabulating these results for all our variables, for each day within our training set. Figure 2 shows
the results obtained for Thursday departures, although results are not too dissimilar across the week.
We exclude variables with insufficient data, as noted above.
Figure 2: Correlation heatmaps by class and stops for Thursday
We observe strong pricing
correlations for flight pairs
involving Hong Kong, KL, Seoul and
Singapore. Positive correlations
are observed to be strongest based
on proximity (e.g. Seoul and Tokyo)
or commercial links (e.g. Singapore
and Hong Kong). Negative
correlations are noted for Beijing
and Bangkok business class, with
no immediate explanation
apparent. Moreover, the
interaction between economy and
business class flights to the same
destination varies from quite weak
(e.g. Bangkok), through to quite
strong (e.g. Singapore),
highlighting the variable nature of
these relationships.
Looking at stops, we would expect
to find strong correlations
amongst stops to the same
destination, with weak
correlations elsewhere. As we can
see, this does not hold (e.g. HK and
KL), with no immediate logic
apparent.
11. Can we predict how Airline fares to cities in Asia will change closer to departure?
11
We can also see how flight prices are correlated both amongst each airline’s departure destinations,
and between airlines (not shown). Figure 3 examines same flight correlations exceeding 0.7 for KL, as
an example, where significant correlations do exist for other destinations
Figure 3: Same airline correlations for Thursday departures from LHR
Note: Airline for City Comparison should be read across for each of the destinations on the x-axis
Combined with the above, it is evident that multicollinearity exists not only between airlines, but
within airline pricing to different destinations, and by class. This does not consider any interaction
between different departure dates.
A further point is whether prices behave differently several weeks prior to departure, relative to 1-2
weeks beforehand? With a small time-series available, we perform an f-test on prices prior to and
starting two weeks before departures. After standardisation, the results (not shown) illustrate that
only Beijing and Tokyo cannot reject our null hypothesis of equal variance. We shall thus introduce
rolling windows, testing permutations involving 2, 3, and 4 weeks to account for any parameter
instability.
Arriving
Airport
Starting
Airline
Airline for City
Comparison
BEI BKK HKG KL SEU SGP TKY
BEI AF-price AF-price 1 0.3 0.7 0.9 NaN 0.4 0.5
KL AF-price AF-price 0.9 0.1 0.5 1 NaN 0.5 0.5
KL BA-price BA-price 0.5 0.5 0.7 1 0 0.8 0.7
KL EK-price EK-price 0.6 0.6 0.6 1 0.5 0.8 0.6
KL EY-price EY-price 0.4 0.6 0.5 1 0.5 0.3 0.4
KL KL-price KL-price 0.3 0.3 -0.2 1 0.4 0.4 0.6
KL MH-price MH-price 0.3 0.1 0 1 0.2 0.6 0.3
KL PR-price PR-price NaN 0.2 0.6 1 0.5 0.6 0.2
KL TG-price TG-price 0.3 0.6 0.2 1 0.4 0.6 0.3
KL VN-price VN-price NaN 0.5 0.7 1 0.7 0.4 0
SGP BA-price BA-price 0.5 0.6 0.6 0.8 0.1 1 0.6
SGP EK-price EK-price 0.6 0.6 0.7 0.8 0.6 1 0.6
12. Can we predict how Airline fares to cities in Asia will change closer to departure?
12
MACHINE LEARNING ANALYSIS
Methodology
We focus on the lowest quartile of economy class prices achieved. This is aimed at striking a balance
between having enough samples to train our algorithms, relative to identifying flights that are cheaper
than average. We generate a binary signal to denote purchase. Although not as intuitive as directly
predicting price curves, this further facilitates using support vector machine and neural network
techniques.
The lumpy nature of prices for some flight pairs (i.e. quotes being static over several days / back-
filling), results in plateau-points where several prices are at the 25th
percentile point. This leads to our
sample size of buy signals being below or above 25% of the sample, depending on if this level is
included. This is most extreme for Wednesday’s departure to Seoul, where we have 17 prices at this
level, of which we reduce 4 sample points by £10 to avoid extreme imbalances. Overall, we examine
both including and excluding the 25th
percentage level, with this a crude adjustment mechanism
should our samples be too imbalanced.
Despite eliminating some explanatory variables, we have 73 variables remaining for each departure
day, on a total time-series of 68 points. All explanatory variables are normalised, which though
convenient for our analysis, may not be consistent with the underlying distributions observed, possibly
undermining overall accuracy.
The quality of our analysis rests on dimensionality reduction. We initially use a logistic regression with
one-norm regularisation to induce sparsity, along with using a Lasso regression. This is expected to
provide a starting-point for prediction, with non-linear and endogenous relationships likely to exist,
based on our descriptive analysis. These aspects may undermine the quality of these predictions.
We then use principal component analysis to transform our data. Although eliminating any
endogeneity amongst our variables, this may reduce any correlation between time-points. This is
underlined by our scree plots (not shown), where an elbow point of k=4 or 5 is apparent across our
training set, but which accounts for only 55-60% of variance. Dimensionality is raised to k = 8 to
account for 70% of variance, which is consistent with a squared-root number of explanatory variables,
despite the downside of adding higher dimensionality to a small dataset.
We then apply ridge regression to our PCA dataset, optimising our hyper-parameters. Given the
possible lack of time-dependency, we separately experiment with over-sampling (SMOTE), to mitigate
any sample imbalance. Finally, given our small subset of variables, we also train the AdaBoost
algorithm to minimise error amongst our boosted samples.
Beyond PCA, we employ both random forest and Support Vector Machine analysis. We experiment
with different hyperparameters to optimise our classification trees, which are anticipated to better
capture any non-linearities present. The SVM model uses different kernels to transform our dataset,
to better delineate amongst our binary outcomes. This is contrasted with a simple neural network,
which is trained at different drop-out rates to identify suitable network depth.
Depending on how these models perform, either one or several will be applied to our test set. An
ensemble approach based on majority voting will determine whether tickets should be purchased for
a given location. This is similar to the stacking approach used by Etzioni et al (2003) to improve
accuracy.
13. Can we predict how Airline fares to cities in Asia will change closer to departure?
13
Model parameters and approach
Cross validation, or out-of-bag sampling for Random Forest, was used wherever possible in
conjunction with rolling windows of 2, 3, and 4 weeks. This was intended to reduce the variance of
our algorithms, and improve parameter stability respectively. Cross validation was performed on a
rolling basis from a starting-point, testing our hyperparameters before determining optimal window
size.
There were instances where rolling windows were not used. This was premised either on using the
largest sample to identify data-points for error reduction (Adaboost), where time dependency
between observations was felt weak (e.g. after using PCA), or were not necessarily suited to the
algorithm (e.g. SVM and neural networks).
Where possible, algorithms were trained to avoid using accuracy, with sample imbalances likely to
favour no purchase signals. A ROC-AUC score was preferred, to reflect both true positives and
negatives, but was often not possible due to our samples being too imbalanced (i.e. no buy-signals
present) or shortcomings within SKlearn generating errors. Mean-squared error was used in most
cases (when undertaking regression), often in conjunction with a high cost-weight in favour of
purchasing tickets; which formed part of our hyperparameter testing (see below).
The one exception was with our Random Forest, where neither cost-weights or a non-accuracy scoring
measure could be applied. This was attributed to our imbalanced samples, with coding working on
some instances, but with too many omissions likely. Although this significantly compromised potential
performance, this algorithm was still used nonetheless as a base-line.
The hyperparameters tested varied, but besides cost-weights, included our regularisation parameter
(for Lasso and Ridge), drop-out rate (for neural network), kernel-type, gamma, and our regularisation
constraint (for SVM). For Random Forest, we examined minimum samples per leaf, maximum tree
depth, and set maximum features to the squared root of total explanatory variables. These
hyperparameters were all tested on our training set, before being applied to our validation set.
As predictions were not confined to binary results, a threshold level was used to distinguish between
such events. This was arbitrarily set slightly below 0.5 (0.45), to allow some latitude in forming
judgement on our imbalanced samples. Although introducing bias, this appears consistent with the
approach adopted by Groves and Gini (2011).
Results – validation
For validation, we examine both ROC-AUC scores obtained from our optimised hyperparameters,
along with judging overall specificity and accuracy (latter not shown), given the likely sample
imbalance expected.
When considering our samples below the threshold of our first quartile, we obtain the following
results shown in figures 4a and 4b. It is evident the imbalanced nature of our sample has affected the
performance of our algorithms. This is surprising when considering the high class-weights applied to
favour purchasing a ticket, and with the algorithms trained not to focus on accuracy.
The performance of our Logistic and Lasso Regressions is not unexpected, along with the Random
Forest, given the likely non-linear relationships they were not expected to identify, or the nature by
which the algorithm was trained (discussed above). Both SVM and Neural Network techniques appear
to have underperformed, given their ability to capture more complex relationships.
14. Can we predict how Airline fares to cities in Asia will change closer to departure?
14
Applying PCA before using AdaBoost or employing SMOTE to over-sample our population before Ridge
regularisation, seems to have performed relatively well when examining ROC and specificity scores.
This is premised on some predictions being recorded for each departure day within our training set,
with AdaBoost generating higher accuracy levels. In absolute terms, ROC and specificity scores remain
quite low; although Beijing and KL do achieve marginally higher levels than other cities.
Within these scores, we do observe instances of consistency across our algorithms. Most algorithms
generate ROC scores between 0.6-0.85 when examining Sunday flight prices for Bangkok and Seoul
(not shown). Applying SMOTE on our PCA sample generates ROC scores exceeding 0.6 for Bangkok,
KL, Seoul, and Tokyo on the same day. This may highlight that our training / validation procedure of
categorising peak and off-peak days, and validating on days before / within a week of departure, may
not be as effective relative to examining earlier departures on the same day.
Figure 4a – Specificity averaged by Monday-Sunday for each city: below 25th
percentile level
Figure 4b – ROC-AUC score averaged by Monday-Sunday for each city: below 25th
percentile level
Note: For calculation purposes, results where ROC scores <= 0.5 were marked as 0.5
To address under-sampling, we adjust our threshold of prices to include those below and at the 25th
percentile level. This typically results in more observations than 25% of our sample. Our subsequent
results can be seen in Figures 5a and 5b.
By adjusting our sample, this has improved average specificity and marginally increased ROC scores,
with our PCA techniques benefiting most, and specificity scores for our neural network. These
increases have not been universal, with some departures from specific cities showing a decline in
performance, based on ROC score and specificity. Accuracy levels have fallen across all techniques,
highlighting an increase in false negatives.
Machine learning Algorithm
Country Logistic w/L1 Lasso
Random
Forest Ridge w/PCA
Ridge w/PCA
and Smote Adaboost SVM NeuralNets
BEI 0.10 - - 0.16 0.24 0.30 0.21 0.15
BKK 0.03 0.10 0.10 0.31 0.24 0.20 0.12 0.29
HKG 0.04 - - 0.08 0.22 0.12 0.10 0.23
KL - - - - 0.24 0.43 0.19 0.15
SEU 0.30 0.08 0.02 0.16 0.26 0.27 0.53 0.21
SGP 0.02 0.07 - 0.26 0.21 0.21 0.18 0.08
TKY - 0.01 - 0.23 0.24 0.28 0.05 0.08
Average 0.07 0.04 0.02 0.17 0.24 0.26 0.20 0.17
Machine learning Algorithm
Country Logistic w/L1 Lasso
Random
Forest Ridge w/PCA
Ridge w/PCA
and Smote Adaboost SVM NeuralNets
BEI 0.53 0.50 0.50 0.52 0.60 0.54 0.54 0.54
BKK 0.50 0.52 0.55 0.56 0.55 0.52 0.51 0.59
HKG 0.50 0.50 0.50 0.51 0.53 0.52 0.50 0.55
KL 0.50 0.50 0.50 0.50 0.58 0.55 0.52 0.55
SEU 0.53 0.53 0.50 0.54 0.55 0.54 0.57 0.54
SGP 0.50 0.50 0.50 0.54 0.55 0.53 0.51 0.50
TKY 0.50 0.50 0.50 0.53 0.55 0.52 0.51 0.51
Average 0.51 0.51 0.51 0.53 0.56 0.53 0.52 0.54
15. Can we predict how Airline fares to cities in Asia will change closer to departure?
15
Figure 5a – Specificity averaged by Monday-Sunday for each city: below and at 25th
percentile level
Figure 5b – ROC-AUC score averaged by Monday-Sunday for each city: below and at 25th
percentile
level
Note: Results where ROC scores <= 0.5 were marked as 0.5
Examining these results, our SMOTE on PCA generates results for each departure day, with AdaBoost
making predictions for each day barring three occasions.
For Bangkok, Hong Kong, and KL, the SMOTE on PCA performs well on our off-peak validation day (i.e.
departing on Monday), with Tuesday departures typically generating a higher than average ROC score.
AdaBoost generates ROC scores > 0.65 when considering Wednesday departures for Beijing and KL
instead.
For peak departure days, we observe AdaBoost achieves ROC scores exceeding 0.6 for Saturday
departures from Beijing and Singapore. This might be capturing some interaction with our validation
day the following Thursday, with a possible substitution effect existing. The observations for Sunday
above remain valid. Full results are shown in appendix 3 and 4.
Given these results, SMOTE on PCA generates the highest ROC-score, where there is merit in
examining this further. However, with AdaBoost and Ridge PCA without SMOTE achieving higher
specificity scores, we shall also contrast with an ensemble approach, with majority voting determining
if a flight should be purchased.
Machine learning Algorithm
Country Logistic w/L1 Lasso
Random
Forest Ridge w/PCA
Ridge w/PCA
and Smote Adaboost SVM NeuralNets
BEI 0.19 0.10 - 0.21 0.31 0.51 0.22 0.41
BKK 0.25 0.14 0.20 0.56 0.41 0.40 0.29 0.38
HKG 0.13 0.07 0.02 0.22 0.24 0.34 0.31 0.28
KL 0.21 0.08 0.05 0.26 0.38 0.49 0.30 0.52
SEU 0.13 0.20 0.08 0.22 0.27 0.27 0.42 0.15
SGP 0.11 0.02 0.04 0.33 0.26 0.35 0.17 0.22
TKY 0.03 0.01 0.01 0.41 0.31 0.23 0.09 0.20
Average 0.15 0.09 0.06 0.32 0.31 0.37 0.26 0.31
Machine learning Algorithm
Country Logistic w/L1 Lasso
Random
Forest
Ridge
w/PCA
Ridge w/PCA
and Smote Adaboost SVM NeuralNets
BEI 0.55 0.54 0.50 0.55 0.59 0.58 0.54 0.55
BKK 0.57 0.52 0.58 0.60 0.60 0.56 0.52 0.54
HKG 0.51 0.50 0.50 0.53 0.54 0.51 0.52 0.52
KL 0.55 0.53 0.52 0.53 0.59 0.59 0.52 0.56
SEU 0.54 0.54 0.50 0.54 0.56 0.53 0.55 0.51
SGP 0.51 0.50 0.50 0.53 0.52 0.53 0.50 0.51
TKY 0.50 0.50 0.50 0.54 0.53 0.50 0.51 0.50
Average 0.53 0.52 0.52 0.55 0.56 0.54 0.52 0.53
16. Can we predict how Airline fares to cities in Asia will change closer to departure?
16
Results – testing
With 4 test days available, we train our algorithms on each peak / off-peak day respectively within our
training set, before juxtaposing with our test days. This results in 3 training days each being tested
against a test-day. This helps contrast if predictions are affected by departures not leaving on the same
day during the preceding week. Our underlying results are shown in appendix 5.
Across these 3 training days, we adopt an ensemble approach through majority voting, to determine
if a flight should be purchased. Appendix 6 illustrates results when solely examining Ridge PCA with
SMOTE.
Figure 6 – Ensemble results on test data: Combining three models
Both ROC and specificity scores would suggest some predictive capability when examining Bangkok
and Kuala Lumpur. However, even scores of 0.68-0.73 are relatively low insofar as reducing
City
Departure
Day
of test set Test Period ROC
correct
predictions
incorrect
predictions specificity recall
Beijing Tuesday Off-peak 1 0.37 - 1 - -
Beijing Wednesday Off-peak 2 0.83 1 - 1.00 0.04
Beijing Friday Peak 1 0.59 1 1 0.50 0.05
Beijing Saturday Peak 2 0.31 - 2 - -
Bangkok Tuesday Off-peak 1 0.62 1 1 0.50 0.06
Bangkok Wednesday Off-peak 2 0.86 2 - 1.00 0.10
Bangkok Friday Peak 1 0.54 2 3 0.40 0.09
Bangkok Saturday Peak 2 0.69 3 2 0.60 0.18
Hong Kong Tuesday Off-peak 1 0.51 2 5 0.29 0.11
Hong Kong Wednesday Off-peak 2 0.42 1 6 0.14 0.05
Hong Kong Friday Peak 1 0.47 1 4 0.20 0.06
Hong Kong Saturday Peak 2 0.47 1 4 0.20 0.06
KL Tuesday Off-peak 1 0.37 - 2 - -
KL Wednesday Off-peak 2 0.89 2 - 1.00 0.12
KL Friday Peak 1 0.88 2 - 1.00 0.11
KL Saturday Peak 2 0.87 2 - 1.00 0.11
Seoul Tuesday Off-peak 1 0.34 - 6 - -
Seoul Wednesday Off-peak 2 0.54 2 4 0.33 0.11
Seoul Friday Peak 1 0.45 1 5 0.17 0.06
Seoul Saturday Peak 2 0.31 - 6 - -
Singapore Tuesday Off-peak 1 0.61 4 5 0.44 0.24
Singapore Wednesday Off-peak 2 0.48 2 7 0.22 0.11
Singapore Friday Peak 1 0.57 2 3 0.40 0.11
Singapore Saturday Peak 2 0.58 2 3 0.40 0.12
Tokyo Tuesday Off-peak 1 0.49 4 11 0.27 0.21
Tokyo Wednesday Off-peak 2 0.46 3 12 0.20 0.17
Tokyo Friday Peak 1 0.63 2 2 0.50 0.12
Tokyo Saturday Peak 2 0.48 1 3 0.25 0.05
17. Can we predict how Airline fares to cities in Asia will change closer to departure?
17
uncertainty with these predictions. Notwithstanding some successful predictions elsewhere, there is
too much uncertainty for these to prove reliable.
There are several results where ROC scores are below 0.5, implying improvements can be achieved by
undertaking the opposite action to what is suggested. This could indicate scope for model
improvements, of which some were highlighted above, but also the nature of our dataset (discussed
further below).
An Ensemble approach across the three algorithms mentioned should refine performance further,
helping eliminate noise / spurious predictions. Results are shown in Figure 6.
Average ROC and specificity scores within each city are broadly higher. Although generating far fewer
predictions, which is evidenced by low recall rates (i.e. relative to the total number of actual buy-
signals), we achieve high specificity levels for peak flights to KL, and the potential to achieve the same
for off-peak. Results for Bangkok are also encouraging, where reasonable accuracy is attained.
Average accuracy levels exceed 61.8% attained by Etzioni et al (2003).
Figure 7 highlights that our predictions
save between 7.0 – 8.1% vs average
flight costs to Bangkok and KL. On a
next period ahead basis, these
predictions can vary on average, from
an additional cost of 1.1% from
Bangkok through to an 8.3% saving
from KL.
Results for the remaining cities are
lacklustre. The models perform poorly
when examining Seoul, and generate
too many incorrect predictions
elsewhere for our models to be relied
upon. This is disappointing, although
may be remedied by more
comprehensive datasets.
Figure 7 – Additional cost savings / (expense) from
Bangkok and KL predictions
£
Predicted
Price
Period Ahead
price
Average over
period
Saving vs
next
Saving vs
average
Bangkok
Off-peak 1 734 734 782 - 48
Off-peak 1 741 741 782 - 41
Off-peak 2 741 741 796 - 55
Off-peak 2 741 741 796 - 55
Peak 1 813 813 847 - 34
Peak 1 813 625 847 (188) 34
Peak 1 865 806 847 (59) (18)
Peak 1 805 805 847 - 42
Peak 1 785 865 847 80 62
Peak 2 649 648 751 (1) 102
Peak 2 649 649 751 - 102
Peak 2 711 711 751 - 40
Peak 2 635 635 751 - 116
Peak 2 685 734 751 49 66
Average saving / (cost) (£) 796 (9) 56
Average saving / (cost) (%) -1.1% 7.0%
Kuala Lumpur
Off-peak 1 799 1104 866 305 67
Off-peak 1 798 997 866 199 68
Off-peak 2 797 798 858 1 61
Off-peak 2 782 811 858 29 76
Peak 1 911 847 943 (64) 32
Peak 1 715 969 943 254 228
Peak 2 881 849 901 (32) 20
Peak 2 897 774 901 (123) 4
Average saving / (cost) (£) 858 71 70
Average saving / (cost) (%) 8.3% 8.1%
18. Can we predict how Airline fares to cities in Asia will change closer to departure?
18
CONCLUSIONS
Limitations
A considerable constraint has been the quantity and quality of data. Having additional departure data
dating back several weeks, would have helped with providing adequate training and validation
datasets. This itself was not feasible without having earmarked further time beyond the scope of this
assignment, especially with data collection commencing in late June.
The underlying quality of the data could have been further improved, given the gaps within our
dataset. It should be recalled that a low threshold of only 40% of complete data was used to inform
our variables, highlighting limited disclosure. It is inevitable that with back-filling, this likely reduced
the explanatory power of our variables. Notwithstanding, this also highlighted the difficulty with
collecting raw, “real-world” data; within this context, it is pleasing we were still able to identify those
pricing relationships above.
Summary
Despite data limitations, our study has shown a capacity to predict peak and off-peak airfares
departing to Bangkok and Kuala Lumpur. This also includes Sunday departures, albeit based on our
validation datasets, as we had insufficient data to test prior Sunday departures.
Going forward, it remains interesting to evaluate how air-fares are affected by flights within 2 days
prior / post departure, and how sensitive they are to competitor behaviour. It would be ideal to
combine this with demand-based data, to better approximate own and cross-price elasticities.
A further extension would be a more comprehensive analysis of business class flights, especially as
these tickets entail far more flexible conditions compared to economy. The descriptive analysis
highlighted instances where prices fell quite significantly leading to departure, potentially heralding
value if flights can be cancelled and re-booked at lower airfares.
19. Can we predict how Airline fares to cities in Asia will change closer to departure?
19
APPENDICES
Appendix 1: Line graph of business class tickets for 7 cities in Asia departing from London
Note: Days includes AM and PM observations, with
40 days of data shown
20. Can we predict how Airline fares to cities in Asia will change closer to departure?
20
Appendix 2: Line graph of non-stop tickets for 7 cities in Asia departing from London
Note: Days includes AM and PM observations, with
40 days of data shown
26. Can we predict how Airline fares to cities in Asia will change closer to departure?
26
Appendix 6 – Ensemble results on test data: Ridge PCA with SMOTE
City
Departure
Day
of test set Test Period ROC
correct
predictions
incorrect
predictions specificity recall
Beijing Tuesday Off-peak 1 0.37 - 3 - -
Beijing Wednesday Off-peak 2 0.66 2 1 0.67 0.08
Beijing Friday Peak 1 0.51 3 6 0.33 0.14
Beijing Saturday Peak 2 0.35 1 8 0.11 0.04
Bangkok Tuesday Off-peak 1 0.46 2 8 0.20 0.11
Bangkok Wednesday Off-peak 2 0.68 6 4 0.60 0.30
Bangkok Friday Peak 1 0.68 10 7 0.59 0.45
Bangkok Saturday Peak 2 0.73 10 7 0.59 0.59
Hong Kong Tuesday Off-peak 1 0.57 5 8 0.38 0.28
Hong Kong Wednesday Off-peak 2 0.41 2 11 0.15 0.10
Hong Kong Friday Peak 1 0.43 1 6 0.14 0.06
Hong Kong Saturday Peak 2 0.43 1 6 0.14 0.06
KL Tuesday Off-peak 1 0.62 5 6 0.45 0.29
KL Wednesday Off-peak 2 0.68 6 5 0.55 0.35
KL Friday Peak 1 0.61 5 6 0.45 0.28
KL Saturday Peak 2 0.55 4 7 0.36 0.21
Seoul Tuesday Off-peak 1 0.32 2 25 0.07 0.10
Seoul Wednesday Off-peak 2 0.59 10 17 0.37 0.56
Seoul Friday Peak 1 0.54 7 16 0.30 0.41
Seoul Saturday Peak 2 0.41 5 18 0.22 0.22
Singapore Tuesday Off-peak 1 0.53 7 17 0.29 0.41
Singapore Wednesday Off-peak 2 0.39 3 21 0.13 0.17
Singapore Friday Peak 1 0.50 6 17 0.26 0.33
Singapore Saturday Peak 2 0.44 4 19 0.17 0.24
Tokyo Tuesday Off-peak 1 0.54 6 12 0.33 0.32
Tokyo Wednesday Off-peak 2 0.47 4 14 0.22 0.22
Tokyo Friday Peak 1 0.54 4 9 0.31 0.24
Tokyo Saturday Peak 2 0.47 3 10 0.23 0.16
27. Can we predict how Airline fares to cities in Asia will change closer to departure?
27
BIBLIOGRAPHY
R.Boin, W.Coleman, D.Delfassy, G.Palombo. “How airlines can gain a competitive edge through
pricing”. McKinsey and Company. Article (December 2017)
G. Cawley, N. Talbot. “On over-fitting in Model Selection and Subsequent Selection Bias in
Performance Evaluation”. Journal of Machine Learning Research 11 (2010) p.2079-2107
R. Culkin, S.R Das. “Machine Learning in Finance: The Case of Deep Learning for Option Pricing”. Santa
Clara University (August 2017)
O.Etzioni, C. Knoblock, R.Tuchinda, A.Yates. “To Buy or Not to Buy: Mining Airfare Data to Minimize
Ticket Purchase Price”. Proceedings of the ninth ACM SIGKDD international conference on Knowledge
discovery and data mining. ACM, 2003
B. Fritz, Y. Chen, T. Murray-Torres, et al. “Using Machine learning techniques to develop forecast
algorithms for postoperative complications: protocol for a retrospective study. British Medical Journal
Open: e020124. doi:10.1136/ bmjopen-2017-020124. Volume 8 (2018)
W.Groves and M Gini. “A regression model for predicting optimal purchase timing for airline tickets”.
Technical Report 11-025, Department of Computer Science and Engineering, University of Minnesota,
October 2011
A.Hussain. “Travel website cookies milk you for dough”, Sunday Times (24th June 2018)
A.Lantseva, K.Mukhini, A.Nikishova, S.Ivanov, K.Knyazkov. “Data-driven modelling of Airline Pricing”.
YSC 2015. 4th International Young Scientists Conference on Computational Science. Procedia
Computer Science, Volume 66, 2015
J. Lu. “Machine learning modelling for time series problem: Predicting flight ticket prices”. School of
Computer and Communication Sciences, Ecole Polytechnique Federale De Lausanne (2018)
P.Malighetti, S. Paleari, R. Redondi. “Pricing Strategies of low-cost airlines: The Ryanair case study”.
Journal of Air Transport Management 15 (2009)
M. Papadakis. “Predicting Airfare Prices,” Stanford University. 2014
R.Toplensky. “EU eases competition rules for state aid to regional airports”. Financial Times (17th
May
2017)
K.Tziridis, T.Kalampokas, G. Papakostas. “Airfare Prices Prediction Using Machine Learning
Techniques”. 25th European Signal Processing Conference (EUSIPCO), 2017