1. Data Warehousing and Business Intelligence
Project
on
Airfare Analysis of Domestic Airlines in U.S.
Abhishek Surendra Dahale
x17170311
MSc Data Analytics ā 2018/9
Submitted to: Dr. Horacio Gonzalez-Velez
2. National College of Ireland
Project Submission Sheet ā 2017/2018
School of Computing
Student Name: Abhishek Surendra Dahale
Student ID: x17170311
Programme: MSc Data Analytics
Year: 2018/9
Module: Data Warehousing and Business Intelligence
Lecturer: Dr.Horacio Gonzalez-Velez
Submission Due
Date:
26/11/2018
Project Title: Airfare Analysis of Domestic Airlines in U.S.
I hereby certify that the information contained in this (my submission) is information
pertaining to my own individual work that I conducted for this project. All information
other than my own contribution is fully and appropriately referenced and listed in the
relevant bibliography section. I assert that I have not referred to any work(s) other than
those listed. I also include my TurnItIn report with this submission.
ALL materials used must be referenced in the bibliography section. Students are
encouraged to use the Harvard Referencing Standard supplied by the Library. To use
other authorās written or electronic work is an act of plagiarism and may result in disci-
plinary action. Students may be required to undergo a viva (oral examination) if there
is suspicion about the validity of their submitted work.
Signature:
Date: November 25, 2018
PLEASE READ THE FOLLOWING INSTRUCTIONS:
1. Please attach a completed copy of this sheet to each project (including multiple copies).
2. You must ensure that you retain a HARD COPY of ALL projects, both for
your own reference and in case a project is lost or mislaid. It is not suļ¬cient to keep
a copy on computer. Please do not bind projects or place in covers unless speciļ¬cally
requested.
3. Assignments that are submitted to the Programme Coordinator oļ¬ce must be placed
into the assignment box located outside the oļ¬ce.
Oļ¬ce Use Only
Signature:
Date:
Penalty Applied (if
applicable):
3. Table 1: Mark sheet ā do not edit
Criteria Mark Awarded Comment(s)
Objectives of 5
Related Work of 10
Data of 25
ETL of 20
Application of 30
Video of 10
Presentation of 10
Total of 100
4. Project Check List
This section capture the core requirements that the project entails represented as a check
list for convenience.
Used LATEX template
Three Business Requirements listed in introduction
At least one structured data source
At least one unstructured data source
At least three sources of data
Described all sources of data
All sources of data are less than one year old, i.e. released after 17/09/2017
Inserted and discussed star schema
Completed logical data map
Discussed the high level ETL strategy
Provided 3 BI queries
Detailed the sources of data used in each query
Discussed the implications of results in each query
Reviewed at least 5-10 appropriate papers on topic of your DWBI project
5. Airfare Analysis of Domestic Airlines in U.S.
Abhishek Surendra Dahale
x17170311
November 25, 2018
Abstract
Airline fare price analysis is most trending research topic nowadays as airline
is the most used mode of transportation worldwide. Airline is the fastest and
convenient means of transport for connectivity across any corner of globe. U.S.
government after deregulating the US Airline industry, operating carriers started
charging fare according to the services oļ¬ered and also diļ¬erent other factors which
were used in this project for the analysis. The focus of this project is to analyze
the deciding factors behind fare prices which will further help passengers to make
a right choice according their travel purpose and traveling cost. This Analysis
is developed on the basis of Kimballs Approach which also known as bottom up
approach. It consists three stages, Extraction of data, Transformation of data
and Finally loading of data into database. A cube was formed using data marts.
Certain level of automation was achieved which includes automated deployment of
cube and integration with R. Further the results were used for forming business
query to evaluate business goals of this research.
1 Introduction
In June 1997 the Department of Transportation of USA released the ļ¬rst quarterly fare
report for the quarter of 1996 in the response of an increased number of customer in-
quiries about airline fare price.Department started releasing Air Travel Consumer Report
each month which included information about ļ¬ight delayed, over sales and mishandled
baggage and various other complaints of consumers. A wide range of variety in average
fare is oļ¬ered by airlines. Due to which airline fare price analysis is essential, so that
customer can evaluate the prices and book best tickets according to their requirements.
Average fare dependency varies in many ways, it can diļ¬er carrier to carrier, Size of the
airport occupancy and city to city. Customers are beneļ¬ted highly with the deregulation
of airline department. Because of regulation, airlines are providing competitive fares and
more services from one destination to another. If airlines have high range airfare that
simply means that large variety of fare is being oļ¬ered in Market. These kinds of airlines
generally have limited number of low fare seats and oļ¬ered with lots of restrictions. In
Low fare market, fare is clustered nearly with average fare because these kinds of airlines
mostly have passengers preferring low fares rather than paying for higher fare.
The motivation behind building this Data warehouse and business intelligence system
is to analyze the reasons behind variety range of airfare by in-depth analysis of deciding
factors for average fare. The concept on dynamic pricing, a tool used by the airline
1
6. industry for the revenue management was used by Fang et al. (2017) to analyze revenue
generated for diļ¬erent types of customers. In relation to this, we can identify how average
fare aļ¬ects pricing strategy.Ferguson et al. (2009) performed the analysis of airfare. He
studied how the fuel prices aļ¬ect the rising airfare. In Addition to this, we can study
more factors involved in pricing strategy which include the number of passengers, distance
traveled, the origin and destination of the ļ¬ight, reviews, and feedback of customers, etc
The data warehouse built will be able to address following requirements which include:
1. Which is the most popular fare class in U.S. Domestic Airlines?
2. What is the relation between Average fare paid by passengers and the distance
travelled for corresponding Airline?
3. Which is the most trustworthy domestic Airline in United States?
2 Data Sources
For building a datawarehouse , Data gathered from the ļ¬rst two sources contains a
structured data repository which yields a large amount of data related to Airlines.
The third data source is the unstructured data source.Description for data source
is as follows:
7. Source Type Brief Summary
Transtats Structured As per my business requirement data con-
tains every detailed information regarding
Domestic Airlines in U.S. Which supported
my business requirement.
Statista Structured The data was relevant to support my require-
ment related to Average Airfare.
Twitter Unstructured Reviews extracted from Twitter supported
in performing sentiment analysis based on
corpus
Table 2: Summary of sources of data used in the project
2.1 Source 1: Transtats
Department of Transportation (DOT), the Bureau of Transportation Statistics
(BTS) helps researchers with accurate and well-grounded data related to trans-
portation, which can help to invest in and achieve economic growth. This data
was made available in June 2018. This data repository yielded a large amount of
data for which cleaning was done using R and unwanted Rows and columns were
removed.R code for the same is attached in the appendix.
This dataset consists of 36 columns of which relevant attributes for my business
requirement are :
ā¢ Year
ā¢ Quarter
ā¢ Origin State
ā¢ Destination State
ā¢ Fare Class
ā¢ Passengers
ā¢ Distance
URL : https://transtats.bts.gov/PREZIP/Origin_and_Destination_Survey_
DB1BCoupon_2018_1.zip
This dataset plays an important role in my datawarehouse.All the mentioned busi-
ness applications in Section 1 are supported by this data.
2.2 Source 2: Statista
This dataset provides details about the top 10 domestic airports in the United
States. It also includes data which fundamentally provides average fare paid for
these domestic ļ¬ights. This data was released in 2018. Data provided by statistic
has a very limited number of rows which do not require cleaning but it consists of 2
data sheets for which I have R code to remove this unused sheet and text. R code
is attached in the appendix. This dataset includes the following attributes :
8. ā¢ Airport Code
ā¢ Averagefare
Here ,we have used Average Fare paid for particular airlines,which answers my
business queries related to fare as mentioned in Section 1. Assumption: Here ,I
have assumed Airlines column which was taken from 1st data set.
URL : https://www.statista.com/statistics/642191/us-domestic-airports-lowest-aver
2.3 Source 3: Twitter (Unstructured)
With reference to the Operating carrier in the data set provided by transtats, I
have used Twitter for gathering reviews posted by consumers for domestic Airlines
in the U.S., as it forms a strong base for the future customers to opt for the best
airlines with aļ¬ordable traveling cost. Performed sentimental analysis for the data
gathered ,so that the sentiments can be analysed for the reviews gathered depending
on the corpus. AWS(Amazon Web Services) instance was used to pull the data
from Twitter and stored in MYSQL database. This unstructured data gathered
for Airlines was further pulled to R. A package called RMySQL which works as
MySQL Driver for R helped me to gather this gigantic data which consisted of
Tweets about Airlines. Further, the data was cleaned and sentiment analysis was
performed for each airline using Syuzhet package. Detailed code for the same is
attached in Appendix. Attributes of unstructured data comprise of sentiment score
in form of :
ā¢ anger
ā¢ anticipation
ā¢ disgust
ā¢ fear
ā¢ joy
ā¢ sadness
ā¢ surprise
ā¢ trust
ā¢ negative
ā¢ positive
Considering the above corpus,I have used the trust score for my analysis of most
trustworthy airline,which people would believe in before choosing their carrier. Also
i have kept other sentiment scores which may be further used in my BI application.
URL : https://www.twitter.com/
9. 3 Related Work
Air transport plays an important role in achieving economic growth and devel-
opment. It provides a vital connectivity across the globe. The Department of
Transportation (DOT), Bureau of Transportation Statistics help researchers to an-
alyze various service quality elements related to Airlines. This involves factors such
as airfare, ļ¬ight-related services based on the category of class, ļ¬ight operations,
etc. DOT releases an Air Travel Consumer Report which includes all the above-
mentioned factors. These datasets were used for the provision of more in-depth
analysis of the eļ¬ect of low-fare service on fares.
In Addition to above, Ferguson et al. (2009), referred this type of data set for anal-
ysis of ļ¬uctuations of airfares. They used data sets for studying how the fuel prices
have eļ¬ect on rising airfare. It also involved diļ¬erent factors such as seasonality,
the distance travelled by the passengers and other economical demands. This anal-
ysis shows how the increasing number of carriers decreases the passenger demands.
Thus using this type of data set, I can analyze how average fare is aļ¬ected by
considering the factors such as Number of passengers ,Operating carrier, Distance
Travelled, etc.
Concept of Dynamic pricing, tool used by airline industry for the revenue manage-
ment, was used by Fang et al. (2017) .Analysis was performed for the revenue that
airlines generate by diļ¬erent types of customers. They studied a decision making
process about the best and minimal prices and also the discount to be oļ¬ered to
maximize the proļ¬ts. The research also involved the pricing strategy. A Stack-
elberg game was used to design this strategy. Similarly,Hao & Yu (2008) ,used a
foundation game-theoretic model for Dynamic pricing. Referring the above work ,as
average airfare plays an important role in pricing strategy, I have Average fare from
dataset ,in business requirements, to analyze how all other factors in the airlines
industry are decided based on this fare.
Social media is the richest source of electronic word of mouth. Reviews, feedback,
comments posted on social media by users reļ¬ect sentiments of the users towards
a certain topic .Therefore social media analytics helps to automatically classify
text messages into sentiment categories(Positive ,Negative, Neutral).Wang et al.
(2016), proposed a method for a ļ¬ne grained sentimental analysis for more and more
detailed sentiments. This method described ļ¬nes grained sensing which included
sentiments and emotions as well. This helped me to gather a better idea to perform
sentiment analysis on reviews and feedback for domestic airlines in U.S. As per
my business requirement stated in section 1, I have used the trust score of airlines
which can easily help me to analyse the Trustworthy Airlines.
4 Data Model
For building a data warehouse ,we have two approaches William Inmon and Ralph
Kimball. For implementation of this Datawarehouse , I have chosen to follow Ralph
Kimballs bottom up approach because ,in terms of performance, query execution
time is less . Also the database is lightweight and does not contain any complexity.
Kimballs Datawarehouse support dimensional data modelling. This approach was
10. built which always involved end users perspective. I found this approach ļ¬t for my
project because the result of the business queries will help people to know the best
and the Trustworthy airline to be chosen. Also it will help to compare Airlines with
lowest travelling cost.[Kimball & Ross (2013)]
The Architecture for my Datawarehouse model is as follows:
Figure 1: Architecture(Using Kimballās Approach)
The preferred schema I have used for implementation of my data warehouse is
Star Schema, which can easily categorize the information of my Datawarehouse
and provide a perspicuous view of relations among them. As we have a simple
Datawarehouse to be implemented and a relatively small number of tables on which
a simple join is performed, Star schema can be a better approach to be followed
on. In this case, I have a fact table that contains primary keys of all the dimension
tables and measures in form of sentiment scores, average fare, distance and number
of passengers. The reason behind choosing the star schema here is the query and
load performance, also the structure of the data warehouse can be easily understood
[Chaudhuri & Dayal (1997)]. After performing a join operation, 81602 records
were generated with a very less execution time, maximizing the performance of
Datawarehouse.
Figure 2 precisely delineates the view of Star Schema:
User query forms a basis of a star schema. The user would basically require a
speciļ¬c information about Airline details, the sentiments related to the airlines,
information on average fare paid or the number of the passengers traveling. So
we can group this information in tables under a particular category. Considering
the above scenario if we need to know about any factors related to time, the time-
related information can be found in Dimension table DimTime. Here, in our
DimTime, it includes entities related to our data as Year and Quarter of the year.
Another example supporting this can be information related to location can be
clubbed into dimension table DimLocation containing attributes like City, Origin
State, Destination State. Also, the Airline related information can be found in
the dimension table DimAirline.Information related to Fare can be obtained from
Dimension table DimFare and info regarding sentiments of Airlines can be found
in DimSentiments. Considering the above tables, here it is not required to create
sub tables related to a speciļ¬c dimension which would lead to the formation of a
11. Figure 2: Star Schema
snowļ¬ake schema. Therefore we can achieve star schema easily by creating a simple
and easily understood structure for the above-mentioned dimension tables.
The key elements of the fact table are measures, which can be further used to per-
form diļ¬erent types of aggregation. The above-mentioned dimension tables played
a vital role in populating measures in fact table. Fact table makes relation with
dimensions table in the form of one-to-many relationships. Thus fact table pop-
ulated for Datawarehouse comprises of primary keys of the dimension tables viz.
Airline id,Time id,Fare id,Location id ,Sentiment id. Measures in fact table also
include the average fare, distance, sentiments, number of passengers. [L. Moody &
Kortink (2000)]
As per the business requirement stated in Section 1, business relation of measures
is as follow:
ā¢ Average Fare : As per the requirement stated in Section 1 Average Fare is used
to analyze relation with distance.
ā¢ Trust Score: Based on Trust sentiment score ,used to identify most trustworthy
Airlines in U.S.
12. ā¢ Distance : As per 2nd requirement,Distance was used to analyze the dependency
on fare.
ā¢ Passengers: As per the 1st requirement,used to demonstrate the number of
passengers travelling in diļ¬erent Airline class for eg. Business or economy
class.
13. 5 Logical Data Map
Table 3: Logical Data Map describing all transforma-
tions, sources and destinations for all components of the
data model illustrated in Figure 2
Source Column Destination Column Type Transformation
1 Opcarrier DimAirline Opcarrier Dimension Missing Values were removed using is.na() and
na.omit()
1 Distance Fact Distance Fact Contained numeric values.Rows with Missing data were
removed.
1 Year DimTime Year Dimension No transformation Required as it consisted data for
single Year:2018. Missing values are removed using
na.omit()
1 Quarter DimTime Quarter Dimension Consisted single value i.e. 1 for 1st quarter.No trans-
formation was required
1 OriginState DimLocation OriginState Dimension Missing values were removed using na.omit()
1 DestState DimLocation DestState Dimension No transformation required.Missing values were re-
moved using na.omit()
1 Origin DimLocation Origin Dimension No transformation required.Missing values were re-
moved using na.omit()
1 Dest DimLocation Dest Dimension No transformation required.Missing values were re-
moved using na.omit()
2 Average fare
in U.S. dol-
lars
DimFare Average Fare Fact No Cleaning required.
3 carrier name DimSentiment carrier name Dimension No transformation was Required.
1 Fare Class DimFare FareClass Dimension No transformation was needed,only Missing values were
removed.
Continued on next page
14. Table 3 ā Continued from previous page
Source Column Destination Column Type Transformation
1 Airline DimFare Airline Dimension Missing Values were removed using is.na() and
na.omit()
3 anticipation FactTable anticipation Fact Reviews were cleaned and libraries sentimentr and
syuzhet were used for calculating sentiment score.
3 fear FactTable fear Fact Reviews were cleaned and libraries sentimentr and
syuzhet were used for calculating sentiment score .
3 joy FactTable joy Fact Reviews were cleaned and libraries sentimentr and
syuzhet were used for calculating sentiment score .
3 sadness FactTable sadness Fact Reviews were cleaned and libraries sentimentr and
syuzhet were used for calculating sentiment score .
3 surprise FactTable surprise Fact Reviews were cleaned and libraries sentimentr and
syuzhet were used for calculating sentiment score .
3 trust FactTable trust Fact Reviews were cleaned and libraries sentimentr and
syuzhet were used for calculating sentiment score .
3 positive FactTable positive Fact Reviews were cleaned and libraries sentimentr and
syuzhet were used for calculating sentiment score .
3 negative FactTable negative Fact Reviews were cleaned and libraries sentimentr and
syuzhet were used for calculating sentiment score .
15. 6 ETL Process
Extract, transformation, and loading (ETL) is the foundation of Datawarehouse
and Business Intelligence Process. Collection of data from diļ¬erent sources plays an
important role. As DataWarehouse can be used in decision making and knowledge
management process, our data gathered must be cleaned and does not contain any
redundant and inconsistent data [Chaudhuri & Dayal (1997)]. The ETL strategy
undertaken to build a data warehouse is as described as follows:
6.1 Extraction
The data related to Airlines was extracted from 3 diļ¬erent data sources as men-
tioned in Section 2. 2 data sets are structured data sources and the other one is
unstructured which was extracted using a tweeter.
6.2 Cleaning
While building a Datawarehouse, it is important to maintain the quality and con-
sistency of data. As the data contained a large number of records, it had anomalous
and duplicate data, also the data with null ļ¬elds. Cleaning was performed on this
data to make sure that data available is cleaned and consistent and do not contain
any redundancy.Detailed explanation of cleaning for the datasets used is as follows:
6.2.1 Source 1: Transtat
All the cleaning task was performed using R programming. Cleaning task was
a fully automated process where the data downloaded from web contained a zip
format, for which unzipping and reading CSV ļ¬le in R was performed and the
further task of cleaning data and writing cleaned data in csv was performed make
sure that this data was available for the next step in ETL. Extraction and Cleaning
was achieved with the help one-touch automation level which involved extracting
data from web to writing cleaned data in CSV ļ¬le.R code for the same is attached
in the appendix.
6.2.2 Source 2: Statista
Data gathered from Statista in CSV format contained 2 data sheets. According
to our business requirement with reference to section 1, only one data sheet was
required which had cleaned data. Hence, cleaning was performed to remove the
unused data sheet and text ļ¬elds. R code was used to remove the data sheet and
text. Code for the same is attached in the appendix.
6.2.3 Source 3: Twitter
Tweets gathered from Twitter consisted of lot of noise data which included Tweets
in diļ¬erent languages, Re-tweets and lot of unwanted columns. This unused data
16. was furthered cleaned to perform sentiment analysis. Average sentiment score were
calculated for each Airline and written in CSV ļ¬le which was further used as input
in next ETL steps. All the above task was performed in R. Code for the same is
attached in the appendix.
6.3 Transform
After the cleaning was performed on data and making sure that it does not contain
any noise data, cleaned data from these sources were transformed CSV format.
Data mapping plays an important role in transformation. Data Integration and
aggregation were performed on this stage.
6.4 Load
Loading is the process where actual database comes in picture. Here we have used
Microsofts SSMS(SQL Server Management Studio).A database named Airline was
used to store the Rawdata. Load in ETL involves loading the ļ¬at ļ¬le in target
database repository. Microsofts SSIS tool was used for loading data to the desti-
nation. 3 diļ¬erent sources were used at the staging area to form 5 separate tables
which are further used as input for the dimension tables and this dimension will
be used to populate the fact table. But ,this is not an end of database creation.
We can modify this data without aļ¬ecting other processes. At this stage we can
perform diļ¬erent modiļ¬cations in data such as increasing the granularity of dimen-
sions and facts , adding or removing attributes, adding or removing facts .Loading
the data is one of the most important stage in ETL. It is a time consuming task
and mainly included the challenges such as connectivity to MOLAP server, Data
type mismatch in staging and dimension area which were overcome to perform ETL
successfully.
The overview of the ETL for Datawarehouse model built is as shown below:
Figure 3: Overview of ETL
17. 6.5 Overview
Illustration of the above model is as follows:
6.5.1 Integration with R
Execute process task is used to integrate R with SSIS. Aim for using execute process
at this stage was to achieve automation in SSIS such that data downloaded from
the web is read, cleaned and written in CSV format and is ready to be used as input
for the next stage in this pipelined architecture. Here Data downloaded from the
Transtats source was directly downloaded from the URL and was further cleaned
to be made readily available from next stage of input. Same procedure was followed
for data from Statista and made readily available for next stage of input. R script
was written to perform Extraction, Cleaning and writing data to CSV.
6.5.2 Truncation
As data quality plays an important role in the ETL process, ETL must be a re-
runnable and an automated process. All the dimension table and fact tables are
truncated here to load fresh data every time an ETL process is executed.
6.5.3 Staging
After truncating data, fresh data is loaded in SSIS which includes a ļ¬at ļ¬le source.
Considering 3 diļ¬erent data sources,3 tables were created to load all the data in
staging tables with help of SSIS.
6.5.4 Dimension
Staging tables created in the previous step are further used to populate dimension
tables. While creating a dimension table, the primary key was generated using
identity for each dimension ID. Details of dimensions are illustrated in Section 4.
One of the key challenges occurred while creating the dimension tables was the
Data type mismatch. It was further overcome by manipulating matching the data
types from Staging Tables.
6.5.5 Fact Table
One of the important tasks in ETL in measures. The output of the dimension tables
forms an input to the fact table in this pipelines architecture. Fact tables mainly
consist of the primary key of dimension tables and measures in terms of facts. I
have used component of SSIS, Lookup transformation, which helps in the lookup
operation. Here I have used a join query to connect the Staging tables. This data
was used to compare with lookup columns i.e. the attributes of the dimension table.
Thus fact table was populated automatically with the help of the lookups.
18. 6.5.6 Analysis of the Cube in SSIS
After successfully generating fact table, manually processing cube every time is a
cumbersome task. I have used the Analysis Services Processing Task for processing
the cube in SSIS. A sequence container contained 2 Analysis Services Processing
Task because cube processing can be achieved eļ¬ciently by processing dimensions
ļ¬rst and then processing the cube as it is a sequential task.
6.5.7 Creating and deploying Cube in SSAS
SSAS(SQL Server Analysis Service) is mainly used for Online Analytical Process-
ing(OLAP). The connection was made to MOLAP for accessing Fact and Dimension
table used to create a cube. A data source view was made using the same data
repository AIRLINES. Further, dimensions and measures were selected to process
in a cube and the cube was deployed successfully. I have used the browser the
section for exploring data and browse the dimensions. In this section, we can also
browse the data authenticity.
6.5.8 Degree of Automation
The degree of Automation achieved in ETL process was One Touch Automation.
As we click the execute button following processes were automated:
ā¢ Data Extraction
ā¢ Cleaning
ā¢ Transformation in the required target format
ā¢ Loading the data in SSIS
ā¢ Populating the fact table
ā¢ Cube Deployment
7 Application
7.1 BI Query 1: Which is the most popular fare class in
U.S. Domestic Airlines?
The contributing sources for this query are data source (2.1) and data source(2.2)
Figure 4 shows us details about the most popular fare class in U.S. Domestic Air-
lines.We can clearly see the average fare paid for corresponding Airline Class,which
include C,D,X,Y.Here,the number of passengers travelling using business class are
more as compared to that of economy class even though the fare for business class
is more.
19. Figure 4: Results for BI Query 1
7.2 BI Query 2:What is the relation between Average fare
paid by passengers and the distance travelled for correspond-
ing Airline?
The contributing sources for this query are data source(2.1) and data source(2.2)
The Tree Map shown in Figure 5 demonstrates the relation between Average fare,the
distance travelled,origin state and the Airlines.Here,Fare is independent of the dis-
tance but depends on Airlines.For example,considering all the cases for Delta Air-
lines(DL) originating from diļ¬erent states,fare varies according to origin state and
the travelling distance.On the other hands,for Endeavor Air(9E) we can see the
dependency of fare on distance,as the distance increases ,fare decreases and vice
versa.
20. Figure 5: Results for BI Query 2
7.3 BI Query 3: Which is the most trustworthy domestic
Airline in United States?
The contributing sources for this query are data sources(2.2) and data source(2.3)
Figure 6 illustrates the general ļ¬ndings of most trustworthy Domestic airlines in
U.S.Here ,we can see that OH(PSA Airlines) and MQ(Envoy Air) has received
a trust score 4 which delineates to be the most trustworthy airlines followed by
AA(Alaska Airlines) with a trust score equivalent to 3.
21. Figure 6: Results for BI Query 3
7.4 Discussion
All the above discussed BI queries have satisļ¬ed our business requirements men-
tioned in Section 1.
Considering our 1st BI query as discussed above, fare acts as an important fac-
tor.Fang et al. (2017), performed analysis for revenue generated by airlines for dif-
ferent types of customers.They used the concept of dynamic pricing to decide the
discounts to be oļ¬ered.We can relate this to our query where maximum number of
passengers travel using business class.This might be due to the discounts oļ¬ered
over the business class as compared to that of economy class.
In our 2nd BI query,we have discussed relation between Average Fare paid, the
distance travelled and the Operating Carrier.Ferguson et al. (2009),discussed about
eļ¬ect of the fuel prices on rising airfare.His analysis has shown how the increas-
ing carriers decreases passenger demand.Here we can relate their work and add
some more points which involves dependancy of airfare on distance travelled and
Operating Carrier and Origin state of Airline departure.
Third BI Query as discussed in Section 7.3 picturizes the most trustworthy air-
lines.Wang et al. (2016),proposed a method for ļ¬ne grained sentimental analy-
sis which also involved emotions like trust.Using Sentiments score ,we can ana-
lyze ,which Airlines to trust on and which airlines not to trust.Here,MQ and OH
can be considered as most trustworthy airlines as compared to 9K(Cape Air) and
OO(SkyWest Airlines) with a very low trust score.Here,we can conclude that pas-
22. sengers,in general,would opt more for PSA Airlines,Envoy Air and Alaska Airlines
on the grounds of higher trust score.
8 Conclusion and Future Work
Air transport plays a vital role in connectivity throughout the world. Data ware-
house built was able to answer all the stated business requirements which included
the numbers of passengers traveling in diļ¬erent class, the most trustworthy Domes-
tic airline in US and dependency of fare and distance. Although a Datawarehouse
built also can answer a number of queries related to sentiments such as positive
and the negative feedback on the airlines, also analysis on the number passengers
traveling from diļ¬erent geographic location, Queries related to Airfare and factors
aļ¬ecting it, there are certain limitations to be considered. As the data gathered
here consisted of data generated by U.S. government, social media, etc., for achiev-
ing in-depth analysis on this topic we cannot simply rely on this data. In fact, we
need to gather more detailed and granular data, to end with eļ¬cient outcomes. We
cannot simply rely on the data gathered from social media for sentiment analysis
but also can increase the range of our research to gather data from multiple data
sources to draw a better conclusion. In the coming years, we need to anticipate
the challenges and the opportunities of the airline industry. Data warehouse built
by overcoming the above limitations would be eļ¬ective and would provide more
in-depth analysis of fare related queries to the users. An attempt to build such
Datawarehouse on the wider scope would be useful in demonstrating all queries
related to the Airline industry on a single platform.
References
Chaudhuri, S. & Dayal, U. (1997), āAn overview of data warehousing and olap
technologyā, SIGMOD Rec. 26(1), 65ā74.
Fang, Y., Chen, Y. & Li, X. (2017), Joint decision making about price and dura-
tion of discount airfares, in ā2017 IEEE International Conference on Industrial
Engineering and Engineering Management (IEEM)ā, pp. 31ā34.
Ferguson, J., Hoļ¬man, K., Sherry, L. & Kara, A. Q. (2009), Eļ¬ects of fuel prices
on air transportation market average fares and passenger demand, in ā9th AIAA
Aviation Technology, Integration and Operations (ATIO) Conference, Aircraft
Noise and Emissions Reduction Symposium (ANERS)ā.
Hao, L. & Yu, X. (2008), Dynamic pricing of airline tickets in competitive markets,
in ā2008 4th International Conference on Wireless Communications, Networking
and Mobile Computingā, pp. 1ā5.
Kimball, R. & Ross, M. (2013), āThe data warehouse toolkit-third editionā.
L. Moody, D. & Kortink, M. (2000), From enterprise models to dimensional models:
a methodology for data warehouse and data mart design., p. 5.
23. Wang, Z., Chong, C. S., Lan, L., Yang, Y., Ho, S. B. & Tong, J. C. (2016), Fine-
grained sentiment analysis of social media with emotion sensing, in ā2016 Future
Technologies Conference (FTC)ā, pp. 1361ā1364.
Appendix
R code
R code for Data source 2.1
setwd("E:/NCI/sem1/datawarehous/project")
#--Package for downloading file from web
Reference URL: https://stat.ethz.ch/R-manual/R-devel/library/utils/html/
download.file.html
library(utils)
#--download zip file directly from Data source for for fetching
#realtime data in R and extract it
temp <- tempfile ()
download.file("https :// transtats.bts.gov/PREZIP/
Origin_and_Destination_Survey_DB1BCoupon_2018_1.zip",temp)
#unzip file
Reference URL: https://stackoverflow.com/questions/3053833/using-r-to-download-zip
csvFile= unz(temp ,"Origin_and_Destination_Survey_DB1BCoupon_2018_1.csv")
#--Referred from R labs
AirlineData <- read.csv(csvFile)
AirlineData <- read.csv(csvFile , header=T,
na.strings=c(""), stringsAsFactors = T)
#--View the dataframe for further cleaning process
col(AirlineData)
View(AirlineData)
is.na(AirlineData)
na.omit(AirlineData)
#--Check for NA values in Data frame
#--Referred from R labs
sapply(AirlineData ,function(x) sum(is.na(x)))
names(AirlineData)
#--Removing Unwanted Columns
AirlineData[ ,c(āGateway ā, āItinGeoType ā,āSeqNum ā,āCoupons ā,
āOriginStateFips ā,āXā)] <- list(NULL)
AirlineData[ ,c(āCouponType ā, āTkCarrier ā,āDestStateFips ā,
āCouponGeoType ā,āDistanceGroup ā)] <- list(NULL)
AirlineData[ ,c(āRPCarrier ā,āDestStateFips ā,āDestWac ā,āDistanceGroup ā,
āOriginWac ā)] <- list(NULL)
24. AirlineData[ ,c(āOriginCountry ā,āDestCountry ā,āBreak ā,āItinID ā,āMktID ā)]
#--Deleting ROws with NA
AirlineData <- AirlineData [!is.na(AirlineData$FareClass), ]
#--Subsetting Of Data Using sample
Reference URL: https://www.statmethods.net/management/subset.html
AirlineData <- AirlineData[sample(1:nrow(AirlineData), 81602,
replace=FALSE),]
#--View the data frame and check number of rows
View(AirlineData)
NROW(AirlineData)
#--Writing Data to in csv
write.csv(AirlineData ," AirlinesCleanedData .csv")
R code for Data source 2.2
setwd("E:/NCI/sem1/DWBI/project/Statista")
#Package For Reading XLS file
install.packages("readxl")
library(readxl)
StatistaData <- read_excel("statistic_id642191_domestic -airports -in -the -
#Removing Unwanted Rows
StatistaData <- StatistaData[-c(1,2),]
#Changing column name
colnames(StatistaData )[2] <- "AverageFare"
#Writing CSV
write.csv(StatistaData , file = "E:/NCI/sem1/DWBI/project/Statista/Statis
#--Referred from Twitter labs
install.packages("tidytext")
install.packages("dplyr")
install.packages("reshape")
install.packages("ggplot2")
install.packages("tidyr")
install.packages("RMySQL")
library(reshape)
library(ggplot2)
library(tidyr)
library(RMySQL)
install.packages("textcat")
install.packages("cld2")
install.packages("cld3")
install.packages("tidyverse")
#--Kill if any open connections exist
25. killDbConnections <- function () {
all_cons <- dbListConnections (MySQL ())
print(all_cons)
for(con in all_cons)
+ dbDisconnect(con)
print(paste(length(all_cons), " connections killed."))
}
Reference URL: https://stackoverflow.com/questions/32139596/cannot-allocate-a-new-
#Creating a connection to database
con <- dbConnect(MySQL(),
user="twitter", password="password",
dbname="TWITTER", host="18.203.249.12")
on.exit(dbDisconnect(con))
#fetching tweets related to Airlines from
resultset <- dbSendQuery(con , "select tweet_text ,
created_at from tweets;")
#creating DataFrame from resultset
AirlineTweetsData <- fetch(resultset , n=Inf)
summary( AirlineTweetsData )
#Recognizing Text category
library(textcat)
#Google language Detectors
library(cld2)
library(cld3)
#Collection of different R packages
library(tidyverse)
#Filtering Data from tweets gathered for English language
AirlineTweetsData <- AirlineTweetsData %>% mutate(textcat = textcat(x =
select(tweet_text , textcat , cld2, cld3, created_at) %>%
filter(cld2 == "en" & cld3 == "en")
summary( AirlineTweetsData )
#---Analyzing Retweets
AirlineTweetsData $RT <- startsWith( AirlineTweetsData $tweet_text , "RT")
#-----Removing Retweets
AirlineTweetsData <- AirlineTweetsData [! AirlineTweetsData $RT , ]
View( AirlineTweetsData )
#creating a new data frame
AirlinesDataFiltering <- AirlineTweetsData
26. #Filter Tweets based on Textcat
FilteredTweets <- AirlinesDataFiltering %>%
filter( AirlinesDataFiltering $textcat =="english")
summary(FilteredTweets )
#Removing unwanted columns
FilteredTweets [ ,c(ātextcat ā,ācld2ā,ācld3ā,ācreated_atā,āRTā)]
<- list(NULL)
#Filtering tweets for from data frame for collecting reviews
#of similar airlines using filter and str_detect
Reference URL: https://www.datanovia.com/en/lessons/subset-data-frame-rows-in-r/
AlaskaAirlinesTweets <- FilteredTweets %>%
filter(str_detect( FilteredTweets $tweet_text ,"Alaska"))
#Adding COlumn name with particlular Airline
AlaskaAirlinesTweets [āAirline ā]=āAAā
#for Cape Air
CapeAirlinesTweets <- FilteredTweets %>%
filter(str_detect( FilteredTweets $tweet_text ,"Cape"))
CapeAirlinesTweets [āAirline ā]=ā9Kā
#--For Delta Airlines
DeltaAirlinesTweets <- FilteredTweets %>%
filter(str_detect( FilteredTweets $tweet_text ,"Delta"))
DeltaAirlinesTweets [āAirline ā]=āDLā
#--For United Airlines
UnitedAirlinesTweets <- FilteredTweets %>%
filter(str_detect( FilteredTweets $tweet_text ,"United"))
UnitedAirlinesTweets [āAirline ā]=āUAā
#Similarly same task was performed for remaining Airlines
#--packages for performing sentiment analysis
library(syuzhet)
library(sentimentr)
#--sentiment Score for each review
mysentiment_review <- get_nrc_sentiment (( AlaskaAirlinesTweets $
tweet_text ))
#--Mean of sentiments for Alaska Airline
mean_AlaskaAirlines <-data.frame(mean(mysentiment_review$anger),
mean(mysentiment_review$disgus),mean(mysentiment_review$anticipation),
mean(mysentiment_review$fear),mean(mysentiment_review$joy),
mean(mysentiment_review$sadness),mean(mysentiment_review$surprise),
mean(mysentiment_review$trust),mean(mysentiment_review$negative),
mean(mysentiment_review$positive ))
27. #-- Changing of column names
colnames(mean_ AlaskaAirlines)<-c(āanger ā,ādisgust ā,āanticipation ā,āfear
#Adding a column name with specifc Airlines
mean_ AlaskaAirlines [āAirline ā]=āAAā
#Dataframe with reqired output for AlaskaAirlines
View(mean_AlaskaAirlines )
#For Delta Airlines
#sentiment Score for each review
mysentiment_ reviewDeltaAirlines <- get_nrc_sentiment (( DeltaAirlinesTwe
#Mean of sentiments for Alaska Airline
mean_DeltaAirlines <-data.frame(mean(mysentiment_ reviewDeltaAirlines $a
mean(mysentiment_ reviewDeltaAirlines $disgust),
mean(mysentiment_ reviewDeltaAirlines $anticipation),
mean(mysentiment_ reviewDeltaAirlines $fear),
mean(mysentiment_ reviewDeltaAirlines $joy),
mean(mysentiment_ reviewDeltaAirlines $sadness),
mean(mysentiment_ reviewDeltaAirlines $surprise),
mean(mysentiment_ reviewDeltaAirlines $trust),
mean(mysentiment_ reviewDeltaAirlines $negative),
mean(mysentiment_ reviewDeltaAirlines $positive ))
# Changing of column names
colnames(mean_DeltaAirlines)<-c(āanger ā,ādisgust ā,āanticipation ā,
āfear ā,ājoyā,āsadness ā,āsurprise ā,ātrust ā,ānegative ā,āpositive ā)
#Adding a column name with specific Airlines
mean_ AlaskaAirlines [āAirline ā]=āDLā
#Dataframe with reqired output for AlaskaAirlines
View(mean_DeltaAirlines)
#Similarly for remaining Airlines the Average
#Sentiments were calculated
#Merging Multiple DataFrames to single data frame
Reference URL:https://www.r-bloggers.com/concatenating-a-list-of-data-frames/
Mean_Airline_Sentiments <- do.call("rbind",list(mean_AlaskaAirlines ,
mean_DeltaAirlines ,mean_HawaiianAirlines ,mean_AmericanAirlines ,
mean_UnitedAirlines ,mean_CapeAir ,mean_EndeavorAir ,mean_PSAAirlines ,
mean_EnvoyAir , mean_AirWisconsin ))
#Writing Data to in csv
write.csv(Mean_Airline_Sentiments ,file="E:/NCI/sem1/datawarehouse/
project/ AverageAirlineSetiments .csv")
28. Screen shots of Data sources used are as follows:
Figure 7: Data Source 1: Transtats
Figure 8: Data Source 2: Statista