SlideShare a Scribd company logo
1 of 29
Download to read offline
Data Warehousing and Business Intelligence
Project
on
Airfare Analysis of Domestic Airlines in U.S.
Abhishek Surendra Dahale
x17170311
MSc Data Analytics ā€“ 2018/9
Submitted to: Dr. Horacio Gonzalez-Velez
National College of Ireland
Project Submission Sheet ā€“ 2017/2018
School of Computing
Student Name: Abhishek Surendra Dahale
Student ID: x17170311
Programme: MSc Data Analytics
Year: 2018/9
Module: Data Warehousing and Business Intelligence
Lecturer: Dr.Horacio Gonzalez-Velez
Submission Due
Date:
26/11/2018
Project Title: Airfare Analysis of Domestic Airlines in U.S.
I hereby certify that the information contained in this (my submission) is information
pertaining to my own individual work that I conducted for this project. All information
other than my own contribution is fully and appropriately referenced and listed in the
relevant bibliography section. I assert that I have not referred to any work(s) other than
those listed. I also include my TurnItIn report with this submission.
ALL materials used must be referenced in the bibliography section. Students are
encouraged to use the Harvard Referencing Standard supplied by the Library. To use
other authorā€™s written or electronic work is an act of plagiarism and may result in disci-
plinary action. Students may be required to undergo a viva (oral examination) if there
is suspicion about the validity of their submitted work.
Signature:
Date: November 25, 2018
PLEASE READ THE FOLLOWING INSTRUCTIONS:
1. Please attach a completed copy of this sheet to each project (including multiple copies).
2. You must ensure that you retain a HARD COPY of ALL projects, both for
your own reference and in case a project is lost or mislaid. It is not suļ¬ƒcient to keep
a copy on computer. Please do not bind projects or place in covers unless speciļ¬cally
requested.
3. Assignments that are submitted to the Programme Coordinator oļ¬ƒce must be placed
into the assignment box located outside the oļ¬ƒce.
Oļ¬ƒce Use Only
Signature:
Date:
Penalty Applied (if
applicable):
Table 1: Mark sheet ā€“ do not edit
Criteria Mark Awarded Comment(s)
Objectives of 5
Related Work of 10
Data of 25
ETL of 20
Application of 30
Video of 10
Presentation of 10
Total of 100
Project Check List
This section capture the core requirements that the project entails represented as a check
list for convenience.
Used LATEX template
Three Business Requirements listed in introduction
At least one structured data source
At least one unstructured data source
At least three sources of data
Described all sources of data
All sources of data are less than one year old, i.e. released after 17/09/2017
Inserted and discussed star schema
Completed logical data map
Discussed the high level ETL strategy
Provided 3 BI queries
Detailed the sources of data used in each query
Discussed the implications of results in each query
Reviewed at least 5-10 appropriate papers on topic of your DWBI project
Airfare Analysis of Domestic Airlines in U.S.
Abhishek Surendra Dahale
x17170311
November 25, 2018
Abstract
Airline fare price analysis is most trending research topic nowadays as airline
is the most used mode of transportation worldwide. Airline is the fastest and
convenient means of transport for connectivity across any corner of globe. U.S.
government after deregulating the US Airline industry, operating carriers started
charging fare according to the services oļ¬€ered and also diļ¬€erent other factors which
were used in this project for the analysis. The focus of this project is to analyze
the deciding factors behind fare prices which will further help passengers to make
a right choice according their travel purpose and traveling cost. This Analysis
is developed on the basis of Kimballs Approach which also known as bottom up
approach. It consists three stages, Extraction of data, Transformation of data
and Finally loading of data into database. A cube was formed using data marts.
Certain level of automation was achieved which includes automated deployment of
cube and integration with R. Further the results were used for forming business
query to evaluate business goals of this research.
1 Introduction
In June 1997 the Department of Transportation of USA released the ļ¬rst quarterly fare
report for the quarter of 1996 in the response of an increased number of customer in-
quiries about airline fare price.Department started releasing Air Travel Consumer Report
each month which included information about ļ¬‚ight delayed, over sales and mishandled
baggage and various other complaints of consumers. A wide range of variety in average
fare is oļ¬€ered by airlines. Due to which airline fare price analysis is essential, so that
customer can evaluate the prices and book best tickets according to their requirements.
Average fare dependency varies in many ways, it can diļ¬€er carrier to carrier, Size of the
airport occupancy and city to city. Customers are beneļ¬ted highly with the deregulation
of airline department. Because of regulation, airlines are providing competitive fares and
more services from one destination to another. If airlines have high range airfare that
simply means that large variety of fare is being oļ¬€ered in Market. These kinds of airlines
generally have limited number of low fare seats and oļ¬€ered with lots of restrictions. In
Low fare market, fare is clustered nearly with average fare because these kinds of airlines
mostly have passengers preferring low fares rather than paying for higher fare.
The motivation behind building this Data warehouse and business intelligence system
is to analyze the reasons behind variety range of airfare by in-depth analysis of deciding
factors for average fare. The concept on dynamic pricing, a tool used by the airline
1
industry for the revenue management was used by Fang et al. (2017) to analyze revenue
generated for diļ¬€erent types of customers. In relation to this, we can identify how average
fare aļ¬€ects pricing strategy.Ferguson et al. (2009) performed the analysis of airfare. He
studied how the fuel prices aļ¬€ect the rising airfare. In Addition to this, we can study
more factors involved in pricing strategy which include the number of passengers, distance
traveled, the origin and destination of the ļ¬‚ight, reviews, and feedback of customers, etc
The data warehouse built will be able to address following requirements which include:
1. Which is the most popular fare class in U.S. Domestic Airlines?
2. What is the relation between Average fare paid by passengers and the distance
travelled for corresponding Airline?
3. Which is the most trustworthy domestic Airline in United States?
2 Data Sources
For building a datawarehouse , Data gathered from the ļ¬rst two sources contains a
structured data repository which yields a large amount of data related to Airlines.
The third data source is the unstructured data source.Description for data source
is as follows:
Source Type Brief Summary
Transtats Structured As per my business requirement data con-
tains every detailed information regarding
Domestic Airlines in U.S. Which supported
my business requirement.
Statista Structured The data was relevant to support my require-
ment related to Average Airfare.
Twitter Unstructured Reviews extracted from Twitter supported
in performing sentiment analysis based on
corpus
Table 2: Summary of sources of data used in the project
2.1 Source 1: Transtats
Department of Transportation (DOT), the Bureau of Transportation Statistics
(BTS) helps researchers with accurate and well-grounded data related to trans-
portation, which can help to invest in and achieve economic growth. This data
was made available in June 2018. This data repository yielded a large amount of
data for which cleaning was done using R and unwanted Rows and columns were
removed.R code for the same is attached in the appendix.
This dataset consists of 36 columns of which relevant attributes for my business
requirement are :
ā€¢ Year
ā€¢ Quarter
ā€¢ Origin State
ā€¢ Destination State
ā€¢ Fare Class
ā€¢ Passengers
ā€¢ Distance
URL : https://transtats.bts.gov/PREZIP/Origin_and_Destination_Survey_
DB1BCoupon_2018_1.zip
This dataset plays an important role in my datawarehouse.All the mentioned busi-
ness applications in Section 1 are supported by this data.
2.2 Source 2: Statista
This dataset provides details about the top 10 domestic airports in the United
States. It also includes data which fundamentally provides average fare paid for
these domestic ļ¬‚ights. This data was released in 2018. Data provided by statistic
has a very limited number of rows which do not require cleaning but it consists of 2
data sheets for which I have R code to remove this unused sheet and text. R code
is attached in the appendix. This dataset includes the following attributes :
ā€¢ Airport Code
ā€¢ Averagefare
Here ,we have used Average Fare paid for particular airlines,which answers my
business queries related to fare as mentioned in Section 1. Assumption: Here ,I
have assumed Airlines column which was taken from 1st data set.
URL : https://www.statista.com/statistics/642191/us-domestic-airports-lowest-aver
2.3 Source 3: Twitter (Unstructured)
With reference to the Operating carrier in the data set provided by transtats, I
have used Twitter for gathering reviews posted by consumers for domestic Airlines
in the U.S., as it forms a strong base for the future customers to opt for the best
airlines with aļ¬€ordable traveling cost. Performed sentimental analysis for the data
gathered ,so that the sentiments can be analysed for the reviews gathered depending
on the corpus. AWS(Amazon Web Services) instance was used to pull the data
from Twitter and stored in MYSQL database. This unstructured data gathered
for Airlines was further pulled to R. A package called RMySQL which works as
MySQL Driver for R helped me to gather this gigantic data which consisted of
Tweets about Airlines. Further, the data was cleaned and sentiment analysis was
performed for each airline using Syuzhet package. Detailed code for the same is
attached in Appendix. Attributes of unstructured data comprise of sentiment score
in form of :
ā€¢ anger
ā€¢ anticipation
ā€¢ disgust
ā€¢ fear
ā€¢ joy
ā€¢ sadness
ā€¢ surprise
ā€¢ trust
ā€¢ negative
ā€¢ positive
Considering the above corpus,I have used the trust score for my analysis of most
trustworthy airline,which people would believe in before choosing their carrier. Also
i have kept other sentiment scores which may be further used in my BI application.
URL : https://www.twitter.com/
3 Related Work
Air transport plays an important role in achieving economic growth and devel-
opment. It provides a vital connectivity across the globe. The Department of
Transportation (DOT), Bureau of Transportation Statistics help researchers to an-
alyze various service quality elements related to Airlines. This involves factors such
as airfare, ļ¬‚ight-related services based on the category of class, ļ¬‚ight operations,
etc. DOT releases an Air Travel Consumer Report which includes all the above-
mentioned factors. These datasets were used for the provision of more in-depth
analysis of the eļ¬€ect of low-fare service on fares.
In Addition to above, Ferguson et al. (2009), referred this type of data set for anal-
ysis of ļ¬‚uctuations of airfares. They used data sets for studying how the fuel prices
have eļ¬€ect on rising airfare. It also involved diļ¬€erent factors such as seasonality,
the distance travelled by the passengers and other economical demands. This anal-
ysis shows how the increasing number of carriers decreases the passenger demands.
Thus using this type of data set, I can analyze how average fare is aļ¬€ected by
considering the factors such as Number of passengers ,Operating carrier, Distance
Travelled, etc.
Concept of Dynamic pricing, tool used by airline industry for the revenue manage-
ment, was used by Fang et al. (2017) .Analysis was performed for the revenue that
airlines generate by diļ¬€erent types of customers. They studied a decision making
process about the best and minimal prices and also the discount to be oļ¬€ered to
maximize the proļ¬ts. The research also involved the pricing strategy. A Stack-
elberg game was used to design this strategy. Similarly,Hao & Yu (2008) ,used a
foundation game-theoretic model for Dynamic pricing. Referring the above work ,as
average airfare plays an important role in pricing strategy, I have Average fare from
dataset ,in business requirements, to analyze how all other factors in the airlines
industry are decided based on this fare.
Social media is the richest source of electronic word of mouth. Reviews, feedback,
comments posted on social media by users reļ¬‚ect sentiments of the users towards
a certain topic .Therefore social media analytics helps to automatically classify
text messages into sentiment categories(Positive ,Negative, Neutral).Wang et al.
(2016), proposed a method for a ļ¬ne grained sentimental analysis for more and more
detailed sentiments. This method described ļ¬nes grained sensing which included
sentiments and emotions as well. This helped me to gather a better idea to perform
sentiment analysis on reviews and feedback for domestic airlines in U.S. As per
my business requirement stated in section 1, I have used the trust score of airlines
which can easily help me to analyse the Trustworthy Airlines.
4 Data Model
For building a data warehouse ,we have two approaches William Inmon and Ralph
Kimball. For implementation of this Datawarehouse , I have chosen to follow Ralph
Kimballs bottom up approach because ,in terms of performance, query execution
time is less . Also the database is lightweight and does not contain any complexity.
Kimballs Datawarehouse support dimensional data modelling. This approach was
built which always involved end users perspective. I found this approach ļ¬t for my
project because the result of the business queries will help people to know the best
and the Trustworthy airline to be chosen. Also it will help to compare Airlines with
lowest travelling cost.[Kimball & Ross (2013)]
The Architecture for my Datawarehouse model is as follows:
Figure 1: Architecture(Using Kimballā€™s Approach)
The preferred schema I have used for implementation of my data warehouse is
Star Schema, which can easily categorize the information of my Datawarehouse
and provide a perspicuous view of relations among them. As we have a simple
Datawarehouse to be implemented and a relatively small number of tables on which
a simple join is performed, Star schema can be a better approach to be followed
on. In this case, I have a fact table that contains primary keys of all the dimension
tables and measures in form of sentiment scores, average fare, distance and number
of passengers. The reason behind choosing the star schema here is the query and
load performance, also the structure of the data warehouse can be easily understood
[Chaudhuri & Dayal (1997)]. After performing a join operation, 81602 records
were generated with a very less execution time, maximizing the performance of
Datawarehouse.
Figure 2 precisely delineates the view of Star Schema:
User query forms a basis of a star schema. The user would basically require a
speciļ¬c information about Airline details, the sentiments related to the airlines,
information on average fare paid or the number of the passengers traveling. So
we can group this information in tables under a particular category. Considering
the above scenario if we need to know about any factors related to time, the time-
related information can be found in Dimension table DimTime. Here, in our
DimTime, it includes entities related to our data as Year and Quarter of the year.
Another example supporting this can be information related to location can be
clubbed into dimension table DimLocation containing attributes like City, Origin
State, Destination State. Also, the Airline related information can be found in
the dimension table DimAirline.Information related to Fare can be obtained from
Dimension table DimFare and info regarding sentiments of Airlines can be found
in DimSentiments. Considering the above tables, here it is not required to create
sub tables related to a speciļ¬c dimension which would lead to the formation of a
Figure 2: Star Schema
snowļ¬‚ake schema. Therefore we can achieve star schema easily by creating a simple
and easily understood structure for the above-mentioned dimension tables.
The key elements of the fact table are measures, which can be further used to per-
form diļ¬€erent types of aggregation. The above-mentioned dimension tables played
a vital role in populating measures in fact table. Fact table makes relation with
dimensions table in the form of one-to-many relationships. Thus fact table pop-
ulated for Datawarehouse comprises of primary keys of the dimension tables viz.
Airline id,Time id,Fare id,Location id ,Sentiment id. Measures in fact table also
include the average fare, distance, sentiments, number of passengers. [L. Moody &
Kortink (2000)]
As per the business requirement stated in Section 1, business relation of measures
is as follow:
ā€¢ Average Fare : As per the requirement stated in Section 1 Average Fare is used
to analyze relation with distance.
ā€¢ Trust Score: Based on Trust sentiment score ,used to identify most trustworthy
Airlines in U.S.
ā€¢ Distance : As per 2nd requirement,Distance was used to analyze the dependency
on fare.
ā€¢ Passengers: As per the 1st requirement,used to demonstrate the number of
passengers travelling in diļ¬€erent Airline class for eg. Business or economy
class.
5 Logical Data Map
Table 3: Logical Data Map describing all transforma-
tions, sources and destinations for all components of the
data model illustrated in Figure 2
Source Column Destination Column Type Transformation
1 Opcarrier DimAirline Opcarrier Dimension Missing Values were removed using is.na() and
na.omit()
1 Distance Fact Distance Fact Contained numeric values.Rows with Missing data were
removed.
1 Year DimTime Year Dimension No transformation Required as it consisted data for
single Year:2018. Missing values are removed using
na.omit()
1 Quarter DimTime Quarter Dimension Consisted single value i.e. 1 for 1st quarter.No trans-
formation was required
1 OriginState DimLocation OriginState Dimension Missing values were removed using na.omit()
1 DestState DimLocation DestState Dimension No transformation required.Missing values were re-
moved using na.omit()
1 Origin DimLocation Origin Dimension No transformation required.Missing values were re-
moved using na.omit()
1 Dest DimLocation Dest Dimension No transformation required.Missing values were re-
moved using na.omit()
2 Average fare
in U.S. dol-
lars
DimFare Average Fare Fact No Cleaning required.
3 carrier name DimSentiment carrier name Dimension No transformation was Required.
1 Fare Class DimFare FareClass Dimension No transformation was needed,only Missing values were
removed.
Continued on next page
Table 3 ā€“ Continued from previous page
Source Column Destination Column Type Transformation
1 Airline DimFare Airline Dimension Missing Values were removed using is.na() and
na.omit()
3 anticipation FactTable anticipation Fact Reviews were cleaned and libraries sentimentr and
syuzhet were used for calculating sentiment score.
3 fear FactTable fear Fact Reviews were cleaned and libraries sentimentr and
syuzhet were used for calculating sentiment score .
3 joy FactTable joy Fact Reviews were cleaned and libraries sentimentr and
syuzhet were used for calculating sentiment score .
3 sadness FactTable sadness Fact Reviews were cleaned and libraries sentimentr and
syuzhet were used for calculating sentiment score .
3 surprise FactTable surprise Fact Reviews were cleaned and libraries sentimentr and
syuzhet were used for calculating sentiment score .
3 trust FactTable trust Fact Reviews were cleaned and libraries sentimentr and
syuzhet were used for calculating sentiment score .
3 positive FactTable positive Fact Reviews were cleaned and libraries sentimentr and
syuzhet were used for calculating sentiment score .
3 negative FactTable negative Fact Reviews were cleaned and libraries sentimentr and
syuzhet were used for calculating sentiment score .
6 ETL Process
Extract, transformation, and loading (ETL) is the foundation of Datawarehouse
and Business Intelligence Process. Collection of data from diļ¬€erent sources plays an
important role. As DataWarehouse can be used in decision making and knowledge
management process, our data gathered must be cleaned and does not contain any
redundant and inconsistent data [Chaudhuri & Dayal (1997)]. The ETL strategy
undertaken to build a data warehouse is as described as follows:
6.1 Extraction
The data related to Airlines was extracted from 3 diļ¬€erent data sources as men-
tioned in Section 2. 2 data sets are structured data sources and the other one is
unstructured which was extracted using a tweeter.
6.2 Cleaning
While building a Datawarehouse, it is important to maintain the quality and con-
sistency of data. As the data contained a large number of records, it had anomalous
and duplicate data, also the data with null ļ¬elds. Cleaning was performed on this
data to make sure that data available is cleaned and consistent and do not contain
any redundancy.Detailed explanation of cleaning for the datasets used is as follows:
6.2.1 Source 1: Transtat
All the cleaning task was performed using R programming. Cleaning task was
a fully automated process where the data downloaded from web contained a zip
format, for which unzipping and reading CSV ļ¬le in R was performed and the
further task of cleaning data and writing cleaned data in csv was performed make
sure that this data was available for the next step in ETL. Extraction and Cleaning
was achieved with the help one-touch automation level which involved extracting
data from web to writing cleaned data in CSV ļ¬le.R code for the same is attached
in the appendix.
6.2.2 Source 2: Statista
Data gathered from Statista in CSV format contained 2 data sheets. According
to our business requirement with reference to section 1, only one data sheet was
required which had cleaned data. Hence, cleaning was performed to remove the
unused data sheet and text ļ¬elds. R code was used to remove the data sheet and
text. Code for the same is attached in the appendix.
6.2.3 Source 3: Twitter
Tweets gathered from Twitter consisted of lot of noise data which included Tweets
in diļ¬€erent languages, Re-tweets and lot of unwanted columns. This unused data
was furthered cleaned to perform sentiment analysis. Average sentiment score were
calculated for each Airline and written in CSV ļ¬le which was further used as input
in next ETL steps. All the above task was performed in R. Code for the same is
attached in the appendix.
6.3 Transform
After the cleaning was performed on data and making sure that it does not contain
any noise data, cleaned data from these sources were transformed CSV format.
Data mapping plays an important role in transformation. Data Integration and
aggregation were performed on this stage.
6.4 Load
Loading is the process where actual database comes in picture. Here we have used
Microsofts SSMS(SQL Server Management Studio).A database named Airline was
used to store the Rawdata. Load in ETL involves loading the ļ¬‚at ļ¬le in target
database repository. Microsofts SSIS tool was used for loading data to the desti-
nation. 3 diļ¬€erent sources were used at the staging area to form 5 separate tables
which are further used as input for the dimension tables and this dimension will
be used to populate the fact table. But ,this is not an end of database creation.
We can modify this data without aļ¬€ecting other processes. At this stage we can
perform diļ¬€erent modiļ¬cations in data such as increasing the granularity of dimen-
sions and facts , adding or removing attributes, adding or removing facts .Loading
the data is one of the most important stage in ETL. It is a time consuming task
and mainly included the challenges such as connectivity to MOLAP server, Data
type mismatch in staging and dimension area which were overcome to perform ETL
successfully.
The overview of the ETL for Datawarehouse model built is as shown below:
Figure 3: Overview of ETL
6.5 Overview
Illustration of the above model is as follows:
6.5.1 Integration with R
Execute process task is used to integrate R with SSIS. Aim for using execute process
at this stage was to achieve automation in SSIS such that data downloaded from
the web is read, cleaned and written in CSV format and is ready to be used as input
for the next stage in this pipelined architecture. Here Data downloaded from the
Transtats source was directly downloaded from the URL and was further cleaned
to be made readily available from next stage of input. Same procedure was followed
for data from Statista and made readily available for next stage of input. R script
was written to perform Extraction, Cleaning and writing data to CSV.
6.5.2 Truncation
As data quality plays an important role in the ETL process, ETL must be a re-
runnable and an automated process. All the dimension table and fact tables are
truncated here to load fresh data every time an ETL process is executed.
6.5.3 Staging
After truncating data, fresh data is loaded in SSIS which includes a ļ¬‚at ļ¬le source.
Considering 3 diļ¬€erent data sources,3 tables were created to load all the data in
staging tables with help of SSIS.
6.5.4 Dimension
Staging tables created in the previous step are further used to populate dimension
tables. While creating a dimension table, the primary key was generated using
identity for each dimension ID. Details of dimensions are illustrated in Section 4.
One of the key challenges occurred while creating the dimension tables was the
Data type mismatch. It was further overcome by manipulating matching the data
types from Staging Tables.
6.5.5 Fact Table
One of the important tasks in ETL in measures. The output of the dimension tables
forms an input to the fact table in this pipelines architecture. Fact tables mainly
consist of the primary key of dimension tables and measures in terms of facts. I
have used component of SSIS, Lookup transformation, which helps in the lookup
operation. Here I have used a join query to connect the Staging tables. This data
was used to compare with lookup columns i.e. the attributes of the dimension table.
Thus fact table was populated automatically with the help of the lookups.
6.5.6 Analysis of the Cube in SSIS
After successfully generating fact table, manually processing cube every time is a
cumbersome task. I have used the Analysis Services Processing Task for processing
the cube in SSIS. A sequence container contained 2 Analysis Services Processing
Task because cube processing can be achieved eļ¬ƒciently by processing dimensions
ļ¬rst and then processing the cube as it is a sequential task.
6.5.7 Creating and deploying Cube in SSAS
SSAS(SQL Server Analysis Service) is mainly used for Online Analytical Process-
ing(OLAP). The connection was made to MOLAP for accessing Fact and Dimension
table used to create a cube. A data source view was made using the same data
repository AIRLINES. Further, dimensions and measures were selected to process
in a cube and the cube was deployed successfully. I have used the browser the
section for exploring data and browse the dimensions. In this section, we can also
browse the data authenticity.
6.5.8 Degree of Automation
The degree of Automation achieved in ETL process was One Touch Automation.
As we click the execute button following processes were automated:
ā€¢ Data Extraction
ā€¢ Cleaning
ā€¢ Transformation in the required target format
ā€¢ Loading the data in SSIS
ā€¢ Populating the fact table
ā€¢ Cube Deployment
7 Application
7.1 BI Query 1: Which is the most popular fare class in
U.S. Domestic Airlines?
The contributing sources for this query are data source (2.1) and data source(2.2)
Figure 4 shows us details about the most popular fare class in U.S. Domestic Air-
lines.We can clearly see the average fare paid for corresponding Airline Class,which
include C,D,X,Y.Here,the number of passengers travelling using business class are
more as compared to that of economy class even though the fare for business class
is more.
Figure 4: Results for BI Query 1
7.2 BI Query 2:What is the relation between Average fare
paid by passengers and the distance travelled for correspond-
ing Airline?
The contributing sources for this query are data source(2.1) and data source(2.2)
The Tree Map shown in Figure 5 demonstrates the relation between Average fare,the
distance travelled,origin state and the Airlines.Here,Fare is independent of the dis-
tance but depends on Airlines.For example,considering all the cases for Delta Air-
lines(DL) originating from diļ¬€erent states,fare varies according to origin state and
the travelling distance.On the other hands,for Endeavor Air(9E) we can see the
dependency of fare on distance,as the distance increases ,fare decreases and vice
versa.
Figure 5: Results for BI Query 2
7.3 BI Query 3: Which is the most trustworthy domestic
Airline in United States?
The contributing sources for this query are data sources(2.2) and data source(2.3)
Figure 6 illustrates the general ļ¬ndings of most trustworthy Domestic airlines in
U.S.Here ,we can see that OH(PSA Airlines) and MQ(Envoy Air) has received
a trust score 4 which delineates to be the most trustworthy airlines followed by
AA(Alaska Airlines) with a trust score equivalent to 3.
Figure 6: Results for BI Query 3
7.4 Discussion
All the above discussed BI queries have satisļ¬ed our business requirements men-
tioned in Section 1.
Considering our 1st BI query as discussed above, fare acts as an important fac-
tor.Fang et al. (2017), performed analysis for revenue generated by airlines for dif-
ferent types of customers.They used the concept of dynamic pricing to decide the
discounts to be oļ¬€ered.We can relate this to our query where maximum number of
passengers travel using business class.This might be due to the discounts oļ¬€ered
over the business class as compared to that of economy class.
In our 2nd BI query,we have discussed relation between Average Fare paid, the
distance travelled and the Operating Carrier.Ferguson et al. (2009),discussed about
eļ¬€ect of the fuel prices on rising airfare.His analysis has shown how the increas-
ing carriers decreases passenger demand.Here we can relate their work and add
some more points which involves dependancy of airfare on distance travelled and
Operating Carrier and Origin state of Airline departure.
Third BI Query as discussed in Section 7.3 picturizes the most trustworthy air-
lines.Wang et al. (2016),proposed a method for ļ¬ne grained sentimental analy-
sis which also involved emotions like trust.Using Sentiments score ,we can ana-
lyze ,which Airlines to trust on and which airlines not to trust.Here,MQ and OH
can be considered as most trustworthy airlines as compared to 9K(Cape Air) and
OO(SkyWest Airlines) with a very low trust score.Here,we can conclude that pas-
sengers,in general,would opt more for PSA Airlines,Envoy Air and Alaska Airlines
on the grounds of higher trust score.
8 Conclusion and Future Work
Air transport plays a vital role in connectivity throughout the world. Data ware-
house built was able to answer all the stated business requirements which included
the numbers of passengers traveling in diļ¬€erent class, the most trustworthy Domes-
tic airline in US and dependency of fare and distance. Although a Datawarehouse
built also can answer a number of queries related to sentiments such as positive
and the negative feedback on the airlines, also analysis on the number passengers
traveling from diļ¬€erent geographic location, Queries related to Airfare and factors
aļ¬€ecting it, there are certain limitations to be considered. As the data gathered
here consisted of data generated by U.S. government, social media, etc., for achiev-
ing in-depth analysis on this topic we cannot simply rely on this data. In fact, we
need to gather more detailed and granular data, to end with eļ¬ƒcient outcomes. We
cannot simply rely on the data gathered from social media for sentiment analysis
but also can increase the range of our research to gather data from multiple data
sources to draw a better conclusion. In the coming years, we need to anticipate
the challenges and the opportunities of the airline industry. Data warehouse built
by overcoming the above limitations would be eļ¬€ective and would provide more
in-depth analysis of fare related queries to the users. An attempt to build such
Datawarehouse on the wider scope would be useful in demonstrating all queries
related to the Airline industry on a single platform.
References
Chaudhuri, S. & Dayal, U. (1997), ā€˜An overview of data warehousing and olap
technologyā€™, SIGMOD Rec. 26(1), 65ā€“74.
Fang, Y., Chen, Y. & Li, X. (2017), Joint decision making about price and dura-
tion of discount airfares, in ā€˜2017 IEEE International Conference on Industrial
Engineering and Engineering Management (IEEM)ā€™, pp. 31ā€“34.
Ferguson, J., Hoļ¬€man, K., Sherry, L. & Kara, A. Q. (2009), Eļ¬€ects of fuel prices
on air transportation market average fares and passenger demand, in ā€˜9th AIAA
Aviation Technology, Integration and Operations (ATIO) Conference, Aircraft
Noise and Emissions Reduction Symposium (ANERS)ā€™.
Hao, L. & Yu, X. (2008), Dynamic pricing of airline tickets in competitive markets,
in ā€˜2008 4th International Conference on Wireless Communications, Networking
and Mobile Computingā€™, pp. 1ā€“5.
Kimball, R. & Ross, M. (2013), ā€˜The data warehouse toolkit-third editionā€™.
L. Moody, D. & Kortink, M. (2000), From enterprise models to dimensional models:
a methodology for data warehouse and data mart design., p. 5.
Wang, Z., Chong, C. S., Lan, L., Yang, Y., Ho, S. B. & Tong, J. C. (2016), Fine-
grained sentiment analysis of social media with emotion sensing, in ā€˜2016 Future
Technologies Conference (FTC)ā€™, pp. 1361ā€“1364.
Appendix
R code
R code for Data source 2.1
setwd("E:/NCI/sem1/datawarehous/project")
#--Package for downloading file from web
Reference URL: https://stat.ethz.ch/R-manual/R-devel/library/utils/html/
download.file.html
library(utils)
#--download zip file directly from Data source for for fetching
#realtime data in R and extract it
temp <- tempfile ()
download.file("https :// transtats.bts.gov/PREZIP/
Origin_and_Destination_Survey_DB1BCoupon_2018_1.zip",temp)
#unzip file
Reference URL: https://stackoverflow.com/questions/3053833/using-r-to-download-zip
csvFile= unz(temp ,"Origin_and_Destination_Survey_DB1BCoupon_2018_1.csv")
#--Referred from R labs
AirlineData <- read.csv(csvFile)
AirlineData <- read.csv(csvFile , header=T,
na.strings=c(""), stringsAsFactors = T)
#--View the dataframe for further cleaning process
col(AirlineData)
View(AirlineData)
is.na(AirlineData)
na.omit(AirlineData)
#--Check for NA values in Data frame
#--Referred from R labs
sapply(AirlineData ,function(x) sum(is.na(x)))
names(AirlineData)
#--Removing Unwanted Columns
AirlineData[ ,c(ā€™Gateway ā€™, ā€™ItinGeoType ā€™,ā€™SeqNum ā€™,ā€™Coupons ā€™,
ā€™OriginStateFips ā€™,ā€™Xā€™)] <- list(NULL)
AirlineData[ ,c(ā€™CouponType ā€™, ā€™TkCarrier ā€™,ā€™DestStateFips ā€™,
ā€™CouponGeoType ā€™,ā€™DistanceGroup ā€™)] <- list(NULL)
AirlineData[ ,c(ā€™RPCarrier ā€™,ā€™DestStateFips ā€™,ā€™DestWac ā€™,ā€™DistanceGroup ā€™,
ā€™OriginWac ā€™)] <- list(NULL)
AirlineData[ ,c(ā€™OriginCountry ā€™,ā€™DestCountry ā€™,ā€™Break ā€™,ā€™ItinID ā€™,ā€™MktID ā€™)]
#--Deleting ROws with NA
AirlineData <- AirlineData [!is.na(AirlineData$FareClass), ]
#--Subsetting Of Data Using sample
Reference URL: https://www.statmethods.net/management/subset.html
AirlineData <- AirlineData[sample(1:nrow(AirlineData), 81602,
replace=FALSE),]
#--View the data frame and check number of rows
View(AirlineData)
NROW(AirlineData)
#--Writing Data to in csv
write.csv(AirlineData ," AirlinesCleanedData .csv")
R code for Data source 2.2
setwd("E:/NCI/sem1/DWBI/project/Statista")
#Package For Reading XLS file
install.packages("readxl")
library(readxl)
StatistaData <- read_excel("statistic_id642191_domestic -airports -in -the -
#Removing Unwanted Rows
StatistaData <- StatistaData[-c(1,2),]
#Changing column name
colnames(StatistaData )[2] <- "AverageFare"
#Writing CSV
write.csv(StatistaData , file = "E:/NCI/sem1/DWBI/project/Statista/Statis
#--Referred from Twitter labs
install.packages("tidytext")
install.packages("dplyr")
install.packages("reshape")
install.packages("ggplot2")
install.packages("tidyr")
install.packages("RMySQL")
library(reshape)
library(ggplot2)
library(tidyr)
library(RMySQL)
install.packages("textcat")
install.packages("cld2")
install.packages("cld3")
install.packages("tidyverse")
#--Kill if any open connections exist
killDbConnections <- function () {
all_cons <- dbListConnections (MySQL ())
print(all_cons)
for(con in all_cons)
+ dbDisconnect(con)
print(paste(length(all_cons), " connections killed."))
}
Reference URL: https://stackoverflow.com/questions/32139596/cannot-allocate-a-new-
#Creating a connection to database
con <- dbConnect(MySQL(),
user="twitter", password="password",
dbname="TWITTER", host="18.203.249.12")
on.exit(dbDisconnect(con))
#fetching tweets related to Airlines from
resultset <- dbSendQuery(con , "select tweet_text ,
created_at from tweets;")
#creating DataFrame from resultset
AirlineTweetsData <- fetch(resultset , n=Inf)
summary( AirlineTweetsData )
#Recognizing Text category
library(textcat)
#Google language Detectors
library(cld2)
library(cld3)
#Collection of different R packages
library(tidyverse)
#Filtering Data from tweets gathered for English language
AirlineTweetsData <- AirlineTweetsData %>% mutate(textcat = textcat(x =
select(tweet_text , textcat , cld2, cld3, created_at) %>%
filter(cld2 == "en" & cld3 == "en")
summary( AirlineTweetsData )
#---Analyzing Retweets
AirlineTweetsData $RT <- startsWith( AirlineTweetsData $tweet_text , "RT")
#-----Removing Retweets
AirlineTweetsData <- AirlineTweetsData [! AirlineTweetsData $RT , ]
View( AirlineTweetsData )
#creating a new data frame
AirlinesDataFiltering <- AirlineTweetsData
#Filter Tweets based on Textcat
FilteredTweets <- AirlinesDataFiltering %>%
filter( AirlinesDataFiltering $textcat =="english")
summary(FilteredTweets )
#Removing unwanted columns
FilteredTweets [ ,c(ā€™textcat ā€™,ā€™cld2ā€™,ā€™cld3ā€™,ā€™created_atā€™,ā€™RTā€™)]
<- list(NULL)
#Filtering tweets for from data frame for collecting reviews
#of similar airlines using filter and str_detect
Reference URL: https://www.datanovia.com/en/lessons/subset-data-frame-rows-in-r/
AlaskaAirlinesTweets <- FilteredTweets %>%
filter(str_detect( FilteredTweets $tweet_text ,"Alaska"))
#Adding COlumn name with particlular Airline
AlaskaAirlinesTweets [ā€™Airline ā€™]=ā€™AAā€™
#for Cape Air
CapeAirlinesTweets <- FilteredTweets %>%
filter(str_detect( FilteredTweets $tweet_text ,"Cape"))
CapeAirlinesTweets [ā€™Airline ā€™]=ā€™9Kā€™
#--For Delta Airlines
DeltaAirlinesTweets <- FilteredTweets %>%
filter(str_detect( FilteredTweets $tweet_text ,"Delta"))
DeltaAirlinesTweets [ā€™Airline ā€™]=ā€™DLā€™
#--For United Airlines
UnitedAirlinesTweets <- FilteredTweets %>%
filter(str_detect( FilteredTweets $tweet_text ,"United"))
UnitedAirlinesTweets [ā€™Airline ā€™]=ā€™UAā€™
#Similarly same task was performed for remaining Airlines
#--packages for performing sentiment analysis
library(syuzhet)
library(sentimentr)
#--sentiment Score for each review
mysentiment_review <- get_nrc_sentiment (( AlaskaAirlinesTweets $
tweet_text ))
#--Mean of sentiments for Alaska Airline
mean_AlaskaAirlines <-data.frame(mean(mysentiment_review$anger),
mean(mysentiment_review$disgus),mean(mysentiment_review$anticipation),
mean(mysentiment_review$fear),mean(mysentiment_review$joy),
mean(mysentiment_review$sadness),mean(mysentiment_review$surprise),
mean(mysentiment_review$trust),mean(mysentiment_review$negative),
mean(mysentiment_review$positive ))
#-- Changing of column names
colnames(mean_ AlaskaAirlines)<-c(ā€™anger ā€™,ā€™disgust ā€™,ā€™anticipation ā€™,ā€™fear
#Adding a column name with specifc Airlines
mean_ AlaskaAirlines [ā€™Airline ā€™]=ā€™AAā€™
#Dataframe with reqired output for AlaskaAirlines
View(mean_AlaskaAirlines )
#For Delta Airlines
#sentiment Score for each review
mysentiment_ reviewDeltaAirlines <- get_nrc_sentiment (( DeltaAirlinesTwe
#Mean of sentiments for Alaska Airline
mean_DeltaAirlines <-data.frame(mean(mysentiment_ reviewDeltaAirlines $a
mean(mysentiment_ reviewDeltaAirlines $disgust),
mean(mysentiment_ reviewDeltaAirlines $anticipation),
mean(mysentiment_ reviewDeltaAirlines $fear),
mean(mysentiment_ reviewDeltaAirlines $joy),
mean(mysentiment_ reviewDeltaAirlines $sadness),
mean(mysentiment_ reviewDeltaAirlines $surprise),
mean(mysentiment_ reviewDeltaAirlines $trust),
mean(mysentiment_ reviewDeltaAirlines $negative),
mean(mysentiment_ reviewDeltaAirlines $positive ))
# Changing of column names
colnames(mean_DeltaAirlines)<-c(ā€™anger ā€™,ā€™disgust ā€™,ā€™anticipation ā€™,
ā€™fear ā€™,ā€™joyā€™,ā€™sadness ā€™,ā€™surprise ā€™,ā€™trust ā€™,ā€™negative ā€™,ā€™positive ā€™)
#Adding a column name with specific Airlines
mean_ AlaskaAirlines [ā€™Airline ā€™]=ā€™DLā€™
#Dataframe with reqired output for AlaskaAirlines
View(mean_DeltaAirlines)
#Similarly for remaining Airlines the Average
#Sentiments were calculated
#Merging Multiple DataFrames to single data frame
Reference URL:https://www.r-bloggers.com/concatenating-a-list-of-data-frames/
Mean_Airline_Sentiments <- do.call("rbind",list(mean_AlaskaAirlines ,
mean_DeltaAirlines ,mean_HawaiianAirlines ,mean_AmericanAirlines ,
mean_UnitedAirlines ,mean_CapeAir ,mean_EndeavorAir ,mean_PSAAirlines ,
mean_EnvoyAir , mean_AirWisconsin ))
#Writing Data to in csv
write.csv(Mean_Airline_Sentiments ,file="E:/NCI/sem1/datawarehouse/
project/ AverageAirlineSetiments .csv")
Screen shots of Data sources used are as follows:
Figure 7: Data Source 1: Transtats
Figure 8: Data Source 2: Statista
Figure 9: Data Source 3: Tweeter (Tweet Count)

More Related Content

Similar to Airfare Analysis of Domestic Airlines in U.S.

Cost Analysis of ComFrame: A Communication Framework for Data Management in ...
Cost Analysis of ComFrame: A Communication Framework for  Data Management in ...Cost Analysis of ComFrame: A Communication Framework for  Data Management in ...
Cost Analysis of ComFrame: A Communication Framework for Data Management in ...IOSR Journals
Ā 
UWProjectHandoffReport
UWProjectHandoffReportUWProjectHandoffReport
UWProjectHandoffReportJill Schulze
Ā 
Route Performance VNO - FRA
Route Performance VNO - FRA Route Performance VNO - FRA
Route Performance VNO - FRA Mohammed Awad
Ā 
STOCK PRICE PREDICTION AND RECOMMENDATION USINGMACHINE LEARNING TECHNIQUES AN...
STOCK PRICE PREDICTION AND RECOMMENDATION USINGMACHINE LEARNING TECHNIQUES AN...STOCK PRICE PREDICTION AND RECOMMENDATION USINGMACHINE LEARNING TECHNIQUES AN...
STOCK PRICE PREDICTION AND RECOMMENDATION USINGMACHINE LEARNING TECHNIQUES AN...IRJET Journal
Ā 
Predicting Air Transport Industry - 2018
Predicting Air Transport Industry  - 2018 Predicting Air Transport Industry  - 2018
Predicting Air Transport Industry - 2018 Mohammed Awad
Ā 
Flight delay detection data mining project
Flight delay detection data mining projectFlight delay detection data mining project
Flight delay detection data mining projectAkshay Kumar Bhushan
Ā 
Predicting 2016 Airlines Performance
Predicting 2016   Airlines Performance Predicting 2016   Airlines Performance
Predicting 2016 Airlines Performance Mohammed Awad
Ā 
Air Ticket Price Prediction.pdf
Air Ticket Price Prediction.pdfAir Ticket Price Prediction.pdf
Air Ticket Price Prediction.pdfAdityaAryan45
Ā 
Passenger air transportation global market report 2018
Passenger air transportation global market report 2018Passenger air transportation global market report 2018
Passenger air transportation global market report 2018lakshmipraneethganti
Ā 
Arthur Yang - A1 Poster
Arthur Yang - A1 PosterArthur Yang - A1 Poster
Arthur Yang - A1 PosterArthur Yang
Ā 
Data warehousing and Business intelligence project on Tourism sector's impact...
Data warehousing and Business intelligence project on Tourism sector's impact...Data warehousing and Business intelligence project on Tourism sector's impact...
Data warehousing and Business intelligence project on Tourism sector's impact...SindhujanDhayalan
Ā 
Executive Summary (5 page paper)Ā· Research any of following Webs.docx
Executive Summary (5 page paper)Ā· Research any of following Webs.docxExecutive Summary (5 page paper)Ā· Research any of following Webs.docx
Executive Summary (5 page paper)Ā· Research any of following Webs.docxrhetttrevannion
Ā 
Describe how users of the financial statements may benefit from.docx
Describe how users of the financial statements may benefit from.docxDescribe how users of the financial statements may benefit from.docx
Describe how users of the financial statements may benefit from.docxwrite22
Ā 
Describe how users of the financial statements may benefit from.docx
Describe how users of the financial statements may benefit from.docxDescribe how users of the financial statements may benefit from.docx
Describe how users of the financial statements may benefit from.docx4934bk
Ā 
IRJET- Consumer Complaint Data Analysis
IRJET-  	  Consumer Complaint Data AnalysisIRJET-  	  Consumer Complaint Data Analysis
IRJET- Consumer Complaint Data AnalysisIRJET Journal
Ā 
INTRODUCTIONOne of the most critical factors in customer relat.docx
INTRODUCTIONOne of the most critical factors in customer relat.docxINTRODUCTIONOne of the most critical factors in customer relat.docx
INTRODUCTIONOne of the most critical factors in customer relat.docxbagotjesusa
Ā 
A. Nurra, From ICT survey data to experimental statistics; using IaD source f...
A. Nurra, From ICT survey data to experimental statistics; using IaD source f...A. Nurra, From ICT survey data to experimental statistics; using IaD source f...
A. Nurra, From ICT survey data to experimental statistics; using IaD source f...Istituto nazionale di statistica
Ā 
UNDERSTANDING CUSTOMERS' EVALUATIONS THROUGH MINING AIRLINE REVIEWS
UNDERSTANDING CUSTOMERS' EVALUATIONS THROUGH MINING AIRLINE REVIEWSUNDERSTANDING CUSTOMERS' EVALUATIONS THROUGH MINING AIRLINE REVIEWS
UNDERSTANDING CUSTOMERS' EVALUATIONS THROUGH MINING AIRLINE REVIEWSIJDKP
Ā 

Similar to Airfare Analysis of Domestic Airlines in U.S. (20)

Air passenger report
Air passenger reportAir passenger report
Air passenger report
Ā 
ANSI CPH Worksheet
ANSI CPH WorksheetANSI CPH Worksheet
ANSI CPH Worksheet
Ā 
Cost Analysis of ComFrame: A Communication Framework for Data Management in ...
Cost Analysis of ComFrame: A Communication Framework for  Data Management in ...Cost Analysis of ComFrame: A Communication Framework for  Data Management in ...
Cost Analysis of ComFrame: A Communication Framework for Data Management in ...
Ā 
UWProjectHandoffReport
UWProjectHandoffReportUWProjectHandoffReport
UWProjectHandoffReport
Ā 
Route Performance VNO - FRA
Route Performance VNO - FRA Route Performance VNO - FRA
Route Performance VNO - FRA
Ā 
STOCK PRICE PREDICTION AND RECOMMENDATION USINGMACHINE LEARNING TECHNIQUES AN...
STOCK PRICE PREDICTION AND RECOMMENDATION USINGMACHINE LEARNING TECHNIQUES AN...STOCK PRICE PREDICTION AND RECOMMENDATION USINGMACHINE LEARNING TECHNIQUES AN...
STOCK PRICE PREDICTION AND RECOMMENDATION USINGMACHINE LEARNING TECHNIQUES AN...
Ā 
Predicting Air Transport Industry - 2018
Predicting Air Transport Industry  - 2018 Predicting Air Transport Industry  - 2018
Predicting Air Transport Industry - 2018
Ā 
Flight delay detection data mining project
Flight delay detection data mining projectFlight delay detection data mining project
Flight delay detection data mining project
Ā 
Predicting 2016 Airlines Performance
Predicting 2016   Airlines Performance Predicting 2016   Airlines Performance
Predicting 2016 Airlines Performance
Ā 
Air Ticket Price Prediction.pdf
Air Ticket Price Prediction.pdfAir Ticket Price Prediction.pdf
Air Ticket Price Prediction.pdf
Ā 
Passenger air transportation global market report 2018
Passenger air transportation global market report 2018Passenger air transportation global market report 2018
Passenger air transportation global market report 2018
Ā 
Arthur Yang - A1 Poster
Arthur Yang - A1 PosterArthur Yang - A1 Poster
Arthur Yang - A1 Poster
Ā 
Data warehousing and Business intelligence project on Tourism sector's impact...
Data warehousing and Business intelligence project on Tourism sector's impact...Data warehousing and Business intelligence project on Tourism sector's impact...
Data warehousing and Business intelligence project on Tourism sector's impact...
Ā 
Executive Summary (5 page paper)Ā· Research any of following Webs.docx
Executive Summary (5 page paper)Ā· Research any of following Webs.docxExecutive Summary (5 page paper)Ā· Research any of following Webs.docx
Executive Summary (5 page paper)Ā· Research any of following Webs.docx
Ā 
Describe how users of the financial statements may benefit from.docx
Describe how users of the financial statements may benefit from.docxDescribe how users of the financial statements may benefit from.docx
Describe how users of the financial statements may benefit from.docx
Ā 
Describe how users of the financial statements may benefit from.docx
Describe how users of the financial statements may benefit from.docxDescribe how users of the financial statements may benefit from.docx
Describe how users of the financial statements may benefit from.docx
Ā 
IRJET- Consumer Complaint Data Analysis
IRJET-  	  Consumer Complaint Data AnalysisIRJET-  	  Consumer Complaint Data Analysis
IRJET- Consumer Complaint Data Analysis
Ā 
INTRODUCTIONOne of the most critical factors in customer relat.docx
INTRODUCTIONOne of the most critical factors in customer relat.docxINTRODUCTIONOne of the most critical factors in customer relat.docx
INTRODUCTIONOne of the most critical factors in customer relat.docx
Ā 
A. Nurra, From ICT survey data to experimental statistics; using IaD source f...
A. Nurra, From ICT survey data to experimental statistics; using IaD source f...A. Nurra, From ICT survey data to experimental statistics; using IaD source f...
A. Nurra, From ICT survey data to experimental statistics; using IaD source f...
Ā 
UNDERSTANDING CUSTOMERS' EVALUATIONS THROUGH MINING AIRLINE REVIEWS
UNDERSTANDING CUSTOMERS' EVALUATIONS THROUGH MINING AIRLINE REVIEWSUNDERSTANDING CUSTOMERS' EVALUATIONS THROUGH MINING AIRLINE REVIEWS
UNDERSTANDING CUSTOMERS' EVALUATIONS THROUGH MINING AIRLINE REVIEWS
Ā 

Recently uploaded

办ē†å­¦ä½čƁēŗ½ēŗ¦å¤§å­¦ęƕäøščƁ(NYUęƕäøščƁ书ļ¼‰åŽŸē‰ˆäø€ęƔäø€
办ē†å­¦ä½čƁēŗ½ēŗ¦å¤§å­¦ęƕäøščƁ(NYUęƕäøščƁ书ļ¼‰åŽŸē‰ˆäø€ęƔäø€åŠžē†å­¦ä½čƁēŗ½ēŗ¦å¤§å­¦ęƕäøščƁ(NYUęƕäøščƁ书ļ¼‰åŽŸē‰ˆäø€ęƔäø€
办ē†å­¦ä½čƁēŗ½ēŗ¦å¤§å­¦ęƕäøščƁ(NYUęƕäøščƁ书ļ¼‰åŽŸē‰ˆäø€ęƔäø€fhwihughh
Ā 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.natarajan8993
Ā 
ē§‘ē½—ę‹‰å¤šå¤§å­¦ę³¢å°”å¾—åˆ†ę ”ęƕäøščƁ学位čÆęˆē»©å•-åÆ办ē†
ē§‘ē½—ę‹‰å¤šå¤§å­¦ę³¢å°”å¾—åˆ†ę ”ęƕäøščƁ学位čÆęˆē»©å•-åÆ办ē†ē§‘ē½—ę‹‰å¤šå¤§å­¦ę³¢å°”å¾—åˆ†ę ”ęƕäøščƁ学位čÆęˆē»©å•-åÆ办ē†
ē§‘ē½—ę‹‰å¤šå¤§å­¦ę³¢å°”å¾—åˆ†ę ”ęƕäøščƁ学位čÆęˆē»©å•-åÆ办ē†e4aez8ss
Ā 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Sapana Sha
Ā 
9711147426āœØCall In girls Gurgaon Sector 31. SCO 25 escort service
9711147426āœØCall In girls Gurgaon Sector 31. SCO 25 escort service9711147426āœØCall In girls Gurgaon Sector 31. SCO 25 escort service
9711147426āœØCall In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
Ā 
1:1定制(UQęƕäøščƁļ¼‰ę˜†å£«å…°å¤§å­¦ęƕäøščÆęˆē»©å•äæ®ę”¹ē•™äæ”å­¦åŽ†č®¤čƁ原ē‰ˆäø€ęØ”äø€ę ·
1:1定制(UQęƕäøščƁļ¼‰ę˜†å£«å…°å¤§å­¦ęƕäøščÆęˆē»©å•äæ®ę”¹ē•™äæ”å­¦åŽ†č®¤čƁ原ē‰ˆäø€ęØ”äø€ę ·1:1定制(UQęƕäøščƁļ¼‰ę˜†å£«å…°å¤§å­¦ęƕäøščÆęˆē»©å•äæ®ę”¹ē•™äæ”å­¦åŽ†č®¤čƁ原ē‰ˆäø€ęØ”äø€ę ·
1:1定制(UQęƕäøščƁļ¼‰ę˜†å£«å…°å¤§å­¦ęƕäøščÆęˆē»©å•äæ®ę”¹ē•™äæ”å­¦åŽ†č®¤čƁ原ē‰ˆäø€ęØ”äø€ę ·vhwb25kk
Ā 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
Ā 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett
Ā 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]šŸ“Š Markus Baersch
Ā 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
Ā 
Call Girls in Defence Colony Delhi šŸ’ÆCall Us šŸ”8264348440šŸ”
Call Girls in Defence Colony Delhi šŸ’ÆCall Us šŸ”8264348440šŸ”Call Girls in Defence Colony Delhi šŸ’ÆCall Us šŸ”8264348440šŸ”
Call Girls in Defence Colony Delhi šŸ’ÆCall Us šŸ”8264348440šŸ”soniya singh
Ā 
From idea to production in a day ā€“ Leveraging Azure ML and Streamlit to build...
From idea to production in a day ā€“ Leveraging Azure ML and Streamlit to build...From idea to production in a day ā€“ Leveraging Azure ML and Streamlit to build...
From idea to production in a day ā€“ Leveraging Azure ML and Streamlit to build...Florian Roscheck
Ā 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman
Ā 
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxNLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxBoston Institute of Analytics
Ā 
办ē†(UWICęƕäøščƁ书)č‹±å›½å”čæŖ夫城åø‚大学ęƕäøščÆęˆē»©å•åŽŸē‰ˆäø€ęƔäø€
办ē†(UWICęƕäøščƁ书)č‹±å›½å”čæŖ夫城åø‚大学ęƕäøščÆęˆē»©å•åŽŸē‰ˆäø€ęƔäø€åŠžē†(UWICęƕäøščƁ书)č‹±å›½å”čæŖ夫城åø‚大学ęƕäøščÆęˆē»©å•åŽŸē‰ˆäø€ęƔäø€
办ē†(UWICęƕäøščƁ书)č‹±å›½å”čæŖ夫城åø‚大学ęƕäøščÆęˆē»©å•åŽŸē‰ˆäø€ęƔäø€F La
Ā 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
Ā 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
Ā 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceSapana Sha
Ā 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFAAndrei Kaleshka
Ā 

Recently uploaded (20)

办ē†å­¦ä½čƁēŗ½ēŗ¦å¤§å­¦ęƕäøščƁ(NYUęƕäøščƁ书ļ¼‰åŽŸē‰ˆäø€ęƔäø€
办ē†å­¦ä½čƁēŗ½ēŗ¦å¤§å­¦ęƕäøščƁ(NYUęƕäøščƁ书ļ¼‰åŽŸē‰ˆäø€ęƔäø€åŠžē†å­¦ä½čƁēŗ½ēŗ¦å¤§å­¦ęƕäøščƁ(NYUęƕäøščƁ书ļ¼‰åŽŸē‰ˆäø€ęƔäø€
办ē†å­¦ä½čƁēŗ½ēŗ¦å¤§å­¦ęƕäøščƁ(NYUęƕäøščƁ书ļ¼‰åŽŸē‰ˆäø€ęƔäø€
Ā 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.
Ā 
ē§‘ē½—ę‹‰å¤šå¤§å­¦ę³¢å°”å¾—åˆ†ę ”ęƕäøščƁ学位čÆęˆē»©å•-åÆ办ē†
ē§‘ē½—ę‹‰å¤šå¤§å­¦ę³¢å°”å¾—åˆ†ę ”ęƕäøščƁ学位čÆęˆē»©å•-åÆ办ē†ē§‘ē½—ę‹‰å¤šå¤§å­¦ę³¢å°”å¾—åˆ†ę ”ęƕäøščƁ学位čÆęˆē»©å•-åÆ办ē†
ē§‘ē½—ę‹‰å¤šå¤§å­¦ę³¢å°”å¾—åˆ†ę ”ęƕäøščƁ学位čÆęˆē»©å•-åÆ办ē†
Ā 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Ā 
9711147426āœØCall In girls Gurgaon Sector 31. SCO 25 escort service
9711147426āœØCall In girls Gurgaon Sector 31. SCO 25 escort service9711147426āœØCall In girls Gurgaon Sector 31. SCO 25 escort service
9711147426āœØCall In girls Gurgaon Sector 31. SCO 25 escort service
Ā 
1:1定制(UQęƕäøščƁļ¼‰ę˜†å£«å…°å¤§å­¦ęƕäøščÆęˆē»©å•äæ®ę”¹ē•™äæ”å­¦åŽ†č®¤čƁ原ē‰ˆäø€ęØ”äø€ę ·
1:1定制(UQęƕäøščƁļ¼‰ę˜†å£«å…°å¤§å­¦ęƕäøščÆęˆē»©å•äæ®ę”¹ē•™äæ”å­¦åŽ†č®¤čƁ原ē‰ˆäø€ęØ”äø€ę ·1:1定制(UQęƕäøščƁļ¼‰ę˜†å£«å…°å¤§å­¦ęƕäøščÆęˆē»©å•äæ®ę”¹ē•™äæ”å­¦åŽ†č®¤čƁ原ē‰ˆäø€ęØ”äø€ę ·
1:1定制(UQęƕäøščƁļ¼‰ę˜†å£«å…°å¤§å­¦ęƕäøščÆęˆē»©å•äæ®ę”¹ē•™äæ”å­¦åŽ†č®¤čƁ原ē‰ˆäø€ęØ”äø€ę ·
Ā 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
Ā 
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Ā 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdf
Ā 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]
Ā 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
Ā 
Call Girls in Defence Colony Delhi šŸ’ÆCall Us šŸ”8264348440šŸ”
Call Girls in Defence Colony Delhi šŸ’ÆCall Us šŸ”8264348440šŸ”Call Girls in Defence Colony Delhi šŸ’ÆCall Us šŸ”8264348440šŸ”
Call Girls in Defence Colony Delhi šŸ’ÆCall Us šŸ”8264348440šŸ”
Ā 
From idea to production in a day ā€“ Leveraging Azure ML and Streamlit to build...
From idea to production in a day ā€“ Leveraging Azure ML and Streamlit to build...From idea to production in a day ā€“ Leveraging Azure ML and Streamlit to build...
From idea to production in a day ā€“ Leveraging Azure ML and Streamlit to build...
Ā 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
Ā 
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxNLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
Ā 
办ē†(UWICęƕäøščƁ书)č‹±å›½å”čæŖ夫城åø‚大学ęƕäøščÆęˆē»©å•åŽŸē‰ˆäø€ęƔäø€
办ē†(UWICęƕäøščƁ书)č‹±å›½å”čæŖ夫城åø‚大学ęƕäøščÆęˆē»©å•åŽŸē‰ˆäø€ęƔäø€åŠžē†(UWICęƕäøščƁ书)č‹±å›½å”čæŖ夫城åø‚大学ęƕäøščÆęˆē»©å•åŽŸē‰ˆäø€ęƔäø€
办ē†(UWICęƕäøščƁ书)č‹±å›½å”čæŖ夫城åø‚大学ęƕäøščÆęˆē»©å•åŽŸē‰ˆäø€ęƔäø€
Ā 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
Ā 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
Ā 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts Service
Ā 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFA
Ā 

Airfare Analysis of Domestic Airlines in U.S.

  • 1. Data Warehousing and Business Intelligence Project on Airfare Analysis of Domestic Airlines in U.S. Abhishek Surendra Dahale x17170311 MSc Data Analytics ā€“ 2018/9 Submitted to: Dr. Horacio Gonzalez-Velez
  • 2. National College of Ireland Project Submission Sheet ā€“ 2017/2018 School of Computing Student Name: Abhishek Surendra Dahale Student ID: x17170311 Programme: MSc Data Analytics Year: 2018/9 Module: Data Warehousing and Business Intelligence Lecturer: Dr.Horacio Gonzalez-Velez Submission Due Date: 26/11/2018 Project Title: Airfare Analysis of Domestic Airlines in U.S. I hereby certify that the information contained in this (my submission) is information pertaining to my own individual work that I conducted for this project. All information other than my own contribution is fully and appropriately referenced and listed in the relevant bibliography section. I assert that I have not referred to any work(s) other than those listed. I also include my TurnItIn report with this submission. ALL materials used must be referenced in the bibliography section. Students are encouraged to use the Harvard Referencing Standard supplied by the Library. To use other authorā€™s written or electronic work is an act of plagiarism and may result in disci- plinary action. Students may be required to undergo a viva (oral examination) if there is suspicion about the validity of their submitted work. Signature: Date: November 25, 2018 PLEASE READ THE FOLLOWING INSTRUCTIONS: 1. Please attach a completed copy of this sheet to each project (including multiple copies). 2. You must ensure that you retain a HARD COPY of ALL projects, both for your own reference and in case a project is lost or mislaid. It is not suļ¬ƒcient to keep a copy on computer. Please do not bind projects or place in covers unless speciļ¬cally requested. 3. Assignments that are submitted to the Programme Coordinator oļ¬ƒce must be placed into the assignment box located outside the oļ¬ƒce. Oļ¬ƒce Use Only Signature: Date: Penalty Applied (if applicable):
  • 3. Table 1: Mark sheet ā€“ do not edit Criteria Mark Awarded Comment(s) Objectives of 5 Related Work of 10 Data of 25 ETL of 20 Application of 30 Video of 10 Presentation of 10 Total of 100
  • 4. Project Check List This section capture the core requirements that the project entails represented as a check list for convenience. Used LATEX template Three Business Requirements listed in introduction At least one structured data source At least one unstructured data source At least three sources of data Described all sources of data All sources of data are less than one year old, i.e. released after 17/09/2017 Inserted and discussed star schema Completed logical data map Discussed the high level ETL strategy Provided 3 BI queries Detailed the sources of data used in each query Discussed the implications of results in each query Reviewed at least 5-10 appropriate papers on topic of your DWBI project
  • 5. Airfare Analysis of Domestic Airlines in U.S. Abhishek Surendra Dahale x17170311 November 25, 2018 Abstract Airline fare price analysis is most trending research topic nowadays as airline is the most used mode of transportation worldwide. Airline is the fastest and convenient means of transport for connectivity across any corner of globe. U.S. government after deregulating the US Airline industry, operating carriers started charging fare according to the services oļ¬€ered and also diļ¬€erent other factors which were used in this project for the analysis. The focus of this project is to analyze the deciding factors behind fare prices which will further help passengers to make a right choice according their travel purpose and traveling cost. This Analysis is developed on the basis of Kimballs Approach which also known as bottom up approach. It consists three stages, Extraction of data, Transformation of data and Finally loading of data into database. A cube was formed using data marts. Certain level of automation was achieved which includes automated deployment of cube and integration with R. Further the results were used for forming business query to evaluate business goals of this research. 1 Introduction In June 1997 the Department of Transportation of USA released the ļ¬rst quarterly fare report for the quarter of 1996 in the response of an increased number of customer in- quiries about airline fare price.Department started releasing Air Travel Consumer Report each month which included information about ļ¬‚ight delayed, over sales and mishandled baggage and various other complaints of consumers. A wide range of variety in average fare is oļ¬€ered by airlines. Due to which airline fare price analysis is essential, so that customer can evaluate the prices and book best tickets according to their requirements. Average fare dependency varies in many ways, it can diļ¬€er carrier to carrier, Size of the airport occupancy and city to city. Customers are beneļ¬ted highly with the deregulation of airline department. Because of regulation, airlines are providing competitive fares and more services from one destination to another. If airlines have high range airfare that simply means that large variety of fare is being oļ¬€ered in Market. These kinds of airlines generally have limited number of low fare seats and oļ¬€ered with lots of restrictions. In Low fare market, fare is clustered nearly with average fare because these kinds of airlines mostly have passengers preferring low fares rather than paying for higher fare. The motivation behind building this Data warehouse and business intelligence system is to analyze the reasons behind variety range of airfare by in-depth analysis of deciding factors for average fare. The concept on dynamic pricing, a tool used by the airline 1
  • 6. industry for the revenue management was used by Fang et al. (2017) to analyze revenue generated for diļ¬€erent types of customers. In relation to this, we can identify how average fare aļ¬€ects pricing strategy.Ferguson et al. (2009) performed the analysis of airfare. He studied how the fuel prices aļ¬€ect the rising airfare. In Addition to this, we can study more factors involved in pricing strategy which include the number of passengers, distance traveled, the origin and destination of the ļ¬‚ight, reviews, and feedback of customers, etc The data warehouse built will be able to address following requirements which include: 1. Which is the most popular fare class in U.S. Domestic Airlines? 2. What is the relation between Average fare paid by passengers and the distance travelled for corresponding Airline? 3. Which is the most trustworthy domestic Airline in United States? 2 Data Sources For building a datawarehouse , Data gathered from the ļ¬rst two sources contains a structured data repository which yields a large amount of data related to Airlines. The third data source is the unstructured data source.Description for data source is as follows:
  • 7. Source Type Brief Summary Transtats Structured As per my business requirement data con- tains every detailed information regarding Domestic Airlines in U.S. Which supported my business requirement. Statista Structured The data was relevant to support my require- ment related to Average Airfare. Twitter Unstructured Reviews extracted from Twitter supported in performing sentiment analysis based on corpus Table 2: Summary of sources of data used in the project 2.1 Source 1: Transtats Department of Transportation (DOT), the Bureau of Transportation Statistics (BTS) helps researchers with accurate and well-grounded data related to trans- portation, which can help to invest in and achieve economic growth. This data was made available in June 2018. This data repository yielded a large amount of data for which cleaning was done using R and unwanted Rows and columns were removed.R code for the same is attached in the appendix. This dataset consists of 36 columns of which relevant attributes for my business requirement are : ā€¢ Year ā€¢ Quarter ā€¢ Origin State ā€¢ Destination State ā€¢ Fare Class ā€¢ Passengers ā€¢ Distance URL : https://transtats.bts.gov/PREZIP/Origin_and_Destination_Survey_ DB1BCoupon_2018_1.zip This dataset plays an important role in my datawarehouse.All the mentioned busi- ness applications in Section 1 are supported by this data. 2.2 Source 2: Statista This dataset provides details about the top 10 domestic airports in the United States. It also includes data which fundamentally provides average fare paid for these domestic ļ¬‚ights. This data was released in 2018. Data provided by statistic has a very limited number of rows which do not require cleaning but it consists of 2 data sheets for which I have R code to remove this unused sheet and text. R code is attached in the appendix. This dataset includes the following attributes :
  • 8. ā€¢ Airport Code ā€¢ Averagefare Here ,we have used Average Fare paid for particular airlines,which answers my business queries related to fare as mentioned in Section 1. Assumption: Here ,I have assumed Airlines column which was taken from 1st data set. URL : https://www.statista.com/statistics/642191/us-domestic-airports-lowest-aver 2.3 Source 3: Twitter (Unstructured) With reference to the Operating carrier in the data set provided by transtats, I have used Twitter for gathering reviews posted by consumers for domestic Airlines in the U.S., as it forms a strong base for the future customers to opt for the best airlines with aļ¬€ordable traveling cost. Performed sentimental analysis for the data gathered ,so that the sentiments can be analysed for the reviews gathered depending on the corpus. AWS(Amazon Web Services) instance was used to pull the data from Twitter and stored in MYSQL database. This unstructured data gathered for Airlines was further pulled to R. A package called RMySQL which works as MySQL Driver for R helped me to gather this gigantic data which consisted of Tweets about Airlines. Further, the data was cleaned and sentiment analysis was performed for each airline using Syuzhet package. Detailed code for the same is attached in Appendix. Attributes of unstructured data comprise of sentiment score in form of : ā€¢ anger ā€¢ anticipation ā€¢ disgust ā€¢ fear ā€¢ joy ā€¢ sadness ā€¢ surprise ā€¢ trust ā€¢ negative ā€¢ positive Considering the above corpus,I have used the trust score for my analysis of most trustworthy airline,which people would believe in before choosing their carrier. Also i have kept other sentiment scores which may be further used in my BI application. URL : https://www.twitter.com/
  • 9. 3 Related Work Air transport plays an important role in achieving economic growth and devel- opment. It provides a vital connectivity across the globe. The Department of Transportation (DOT), Bureau of Transportation Statistics help researchers to an- alyze various service quality elements related to Airlines. This involves factors such as airfare, ļ¬‚ight-related services based on the category of class, ļ¬‚ight operations, etc. DOT releases an Air Travel Consumer Report which includes all the above- mentioned factors. These datasets were used for the provision of more in-depth analysis of the eļ¬€ect of low-fare service on fares. In Addition to above, Ferguson et al. (2009), referred this type of data set for anal- ysis of ļ¬‚uctuations of airfares. They used data sets for studying how the fuel prices have eļ¬€ect on rising airfare. It also involved diļ¬€erent factors such as seasonality, the distance travelled by the passengers and other economical demands. This anal- ysis shows how the increasing number of carriers decreases the passenger demands. Thus using this type of data set, I can analyze how average fare is aļ¬€ected by considering the factors such as Number of passengers ,Operating carrier, Distance Travelled, etc. Concept of Dynamic pricing, tool used by airline industry for the revenue manage- ment, was used by Fang et al. (2017) .Analysis was performed for the revenue that airlines generate by diļ¬€erent types of customers. They studied a decision making process about the best and minimal prices and also the discount to be oļ¬€ered to maximize the proļ¬ts. The research also involved the pricing strategy. A Stack- elberg game was used to design this strategy. Similarly,Hao & Yu (2008) ,used a foundation game-theoretic model for Dynamic pricing. Referring the above work ,as average airfare plays an important role in pricing strategy, I have Average fare from dataset ,in business requirements, to analyze how all other factors in the airlines industry are decided based on this fare. Social media is the richest source of electronic word of mouth. Reviews, feedback, comments posted on social media by users reļ¬‚ect sentiments of the users towards a certain topic .Therefore social media analytics helps to automatically classify text messages into sentiment categories(Positive ,Negative, Neutral).Wang et al. (2016), proposed a method for a ļ¬ne grained sentimental analysis for more and more detailed sentiments. This method described ļ¬nes grained sensing which included sentiments and emotions as well. This helped me to gather a better idea to perform sentiment analysis on reviews and feedback for domestic airlines in U.S. As per my business requirement stated in section 1, I have used the trust score of airlines which can easily help me to analyse the Trustworthy Airlines. 4 Data Model For building a data warehouse ,we have two approaches William Inmon and Ralph Kimball. For implementation of this Datawarehouse , I have chosen to follow Ralph Kimballs bottom up approach because ,in terms of performance, query execution time is less . Also the database is lightweight and does not contain any complexity. Kimballs Datawarehouse support dimensional data modelling. This approach was
  • 10. built which always involved end users perspective. I found this approach ļ¬t for my project because the result of the business queries will help people to know the best and the Trustworthy airline to be chosen. Also it will help to compare Airlines with lowest travelling cost.[Kimball & Ross (2013)] The Architecture for my Datawarehouse model is as follows: Figure 1: Architecture(Using Kimballā€™s Approach) The preferred schema I have used for implementation of my data warehouse is Star Schema, which can easily categorize the information of my Datawarehouse and provide a perspicuous view of relations among them. As we have a simple Datawarehouse to be implemented and a relatively small number of tables on which a simple join is performed, Star schema can be a better approach to be followed on. In this case, I have a fact table that contains primary keys of all the dimension tables and measures in form of sentiment scores, average fare, distance and number of passengers. The reason behind choosing the star schema here is the query and load performance, also the structure of the data warehouse can be easily understood [Chaudhuri & Dayal (1997)]. After performing a join operation, 81602 records were generated with a very less execution time, maximizing the performance of Datawarehouse. Figure 2 precisely delineates the view of Star Schema: User query forms a basis of a star schema. The user would basically require a speciļ¬c information about Airline details, the sentiments related to the airlines, information on average fare paid or the number of the passengers traveling. So we can group this information in tables under a particular category. Considering the above scenario if we need to know about any factors related to time, the time- related information can be found in Dimension table DimTime. Here, in our DimTime, it includes entities related to our data as Year and Quarter of the year. Another example supporting this can be information related to location can be clubbed into dimension table DimLocation containing attributes like City, Origin State, Destination State. Also, the Airline related information can be found in the dimension table DimAirline.Information related to Fare can be obtained from Dimension table DimFare and info regarding sentiments of Airlines can be found in DimSentiments. Considering the above tables, here it is not required to create sub tables related to a speciļ¬c dimension which would lead to the formation of a
  • 11. Figure 2: Star Schema snowļ¬‚ake schema. Therefore we can achieve star schema easily by creating a simple and easily understood structure for the above-mentioned dimension tables. The key elements of the fact table are measures, which can be further used to per- form diļ¬€erent types of aggregation. The above-mentioned dimension tables played a vital role in populating measures in fact table. Fact table makes relation with dimensions table in the form of one-to-many relationships. Thus fact table pop- ulated for Datawarehouse comprises of primary keys of the dimension tables viz. Airline id,Time id,Fare id,Location id ,Sentiment id. Measures in fact table also include the average fare, distance, sentiments, number of passengers. [L. Moody & Kortink (2000)] As per the business requirement stated in Section 1, business relation of measures is as follow: ā€¢ Average Fare : As per the requirement stated in Section 1 Average Fare is used to analyze relation with distance. ā€¢ Trust Score: Based on Trust sentiment score ,used to identify most trustworthy Airlines in U.S.
  • 12. ā€¢ Distance : As per 2nd requirement,Distance was used to analyze the dependency on fare. ā€¢ Passengers: As per the 1st requirement,used to demonstrate the number of passengers travelling in diļ¬€erent Airline class for eg. Business or economy class.
  • 13. 5 Logical Data Map Table 3: Logical Data Map describing all transforma- tions, sources and destinations for all components of the data model illustrated in Figure 2 Source Column Destination Column Type Transformation 1 Opcarrier DimAirline Opcarrier Dimension Missing Values were removed using is.na() and na.omit() 1 Distance Fact Distance Fact Contained numeric values.Rows with Missing data were removed. 1 Year DimTime Year Dimension No transformation Required as it consisted data for single Year:2018. Missing values are removed using na.omit() 1 Quarter DimTime Quarter Dimension Consisted single value i.e. 1 for 1st quarter.No trans- formation was required 1 OriginState DimLocation OriginState Dimension Missing values were removed using na.omit() 1 DestState DimLocation DestState Dimension No transformation required.Missing values were re- moved using na.omit() 1 Origin DimLocation Origin Dimension No transformation required.Missing values were re- moved using na.omit() 1 Dest DimLocation Dest Dimension No transformation required.Missing values were re- moved using na.omit() 2 Average fare in U.S. dol- lars DimFare Average Fare Fact No Cleaning required. 3 carrier name DimSentiment carrier name Dimension No transformation was Required. 1 Fare Class DimFare FareClass Dimension No transformation was needed,only Missing values were removed. Continued on next page
  • 14. Table 3 ā€“ Continued from previous page Source Column Destination Column Type Transformation 1 Airline DimFare Airline Dimension Missing Values were removed using is.na() and na.omit() 3 anticipation FactTable anticipation Fact Reviews were cleaned and libraries sentimentr and syuzhet were used for calculating sentiment score. 3 fear FactTable fear Fact Reviews were cleaned and libraries sentimentr and syuzhet were used for calculating sentiment score . 3 joy FactTable joy Fact Reviews were cleaned and libraries sentimentr and syuzhet were used for calculating sentiment score . 3 sadness FactTable sadness Fact Reviews were cleaned and libraries sentimentr and syuzhet were used for calculating sentiment score . 3 surprise FactTable surprise Fact Reviews were cleaned and libraries sentimentr and syuzhet were used for calculating sentiment score . 3 trust FactTable trust Fact Reviews were cleaned and libraries sentimentr and syuzhet were used for calculating sentiment score . 3 positive FactTable positive Fact Reviews were cleaned and libraries sentimentr and syuzhet were used for calculating sentiment score . 3 negative FactTable negative Fact Reviews were cleaned and libraries sentimentr and syuzhet were used for calculating sentiment score .
  • 15. 6 ETL Process Extract, transformation, and loading (ETL) is the foundation of Datawarehouse and Business Intelligence Process. Collection of data from diļ¬€erent sources plays an important role. As DataWarehouse can be used in decision making and knowledge management process, our data gathered must be cleaned and does not contain any redundant and inconsistent data [Chaudhuri & Dayal (1997)]. The ETL strategy undertaken to build a data warehouse is as described as follows: 6.1 Extraction The data related to Airlines was extracted from 3 diļ¬€erent data sources as men- tioned in Section 2. 2 data sets are structured data sources and the other one is unstructured which was extracted using a tweeter. 6.2 Cleaning While building a Datawarehouse, it is important to maintain the quality and con- sistency of data. As the data contained a large number of records, it had anomalous and duplicate data, also the data with null ļ¬elds. Cleaning was performed on this data to make sure that data available is cleaned and consistent and do not contain any redundancy.Detailed explanation of cleaning for the datasets used is as follows: 6.2.1 Source 1: Transtat All the cleaning task was performed using R programming. Cleaning task was a fully automated process where the data downloaded from web contained a zip format, for which unzipping and reading CSV ļ¬le in R was performed and the further task of cleaning data and writing cleaned data in csv was performed make sure that this data was available for the next step in ETL. Extraction and Cleaning was achieved with the help one-touch automation level which involved extracting data from web to writing cleaned data in CSV ļ¬le.R code for the same is attached in the appendix. 6.2.2 Source 2: Statista Data gathered from Statista in CSV format contained 2 data sheets. According to our business requirement with reference to section 1, only one data sheet was required which had cleaned data. Hence, cleaning was performed to remove the unused data sheet and text ļ¬elds. R code was used to remove the data sheet and text. Code for the same is attached in the appendix. 6.2.3 Source 3: Twitter Tweets gathered from Twitter consisted of lot of noise data which included Tweets in diļ¬€erent languages, Re-tweets and lot of unwanted columns. This unused data
  • 16. was furthered cleaned to perform sentiment analysis. Average sentiment score were calculated for each Airline and written in CSV ļ¬le which was further used as input in next ETL steps. All the above task was performed in R. Code for the same is attached in the appendix. 6.3 Transform After the cleaning was performed on data and making sure that it does not contain any noise data, cleaned data from these sources were transformed CSV format. Data mapping plays an important role in transformation. Data Integration and aggregation were performed on this stage. 6.4 Load Loading is the process where actual database comes in picture. Here we have used Microsofts SSMS(SQL Server Management Studio).A database named Airline was used to store the Rawdata. Load in ETL involves loading the ļ¬‚at ļ¬le in target database repository. Microsofts SSIS tool was used for loading data to the desti- nation. 3 diļ¬€erent sources were used at the staging area to form 5 separate tables which are further used as input for the dimension tables and this dimension will be used to populate the fact table. But ,this is not an end of database creation. We can modify this data without aļ¬€ecting other processes. At this stage we can perform diļ¬€erent modiļ¬cations in data such as increasing the granularity of dimen- sions and facts , adding or removing attributes, adding or removing facts .Loading the data is one of the most important stage in ETL. It is a time consuming task and mainly included the challenges such as connectivity to MOLAP server, Data type mismatch in staging and dimension area which were overcome to perform ETL successfully. The overview of the ETL for Datawarehouse model built is as shown below: Figure 3: Overview of ETL
  • 17. 6.5 Overview Illustration of the above model is as follows: 6.5.1 Integration with R Execute process task is used to integrate R with SSIS. Aim for using execute process at this stage was to achieve automation in SSIS such that data downloaded from the web is read, cleaned and written in CSV format and is ready to be used as input for the next stage in this pipelined architecture. Here Data downloaded from the Transtats source was directly downloaded from the URL and was further cleaned to be made readily available from next stage of input. Same procedure was followed for data from Statista and made readily available for next stage of input. R script was written to perform Extraction, Cleaning and writing data to CSV. 6.5.2 Truncation As data quality plays an important role in the ETL process, ETL must be a re- runnable and an automated process. All the dimension table and fact tables are truncated here to load fresh data every time an ETL process is executed. 6.5.3 Staging After truncating data, fresh data is loaded in SSIS which includes a ļ¬‚at ļ¬le source. Considering 3 diļ¬€erent data sources,3 tables were created to load all the data in staging tables with help of SSIS. 6.5.4 Dimension Staging tables created in the previous step are further used to populate dimension tables. While creating a dimension table, the primary key was generated using identity for each dimension ID. Details of dimensions are illustrated in Section 4. One of the key challenges occurred while creating the dimension tables was the Data type mismatch. It was further overcome by manipulating matching the data types from Staging Tables. 6.5.5 Fact Table One of the important tasks in ETL in measures. The output of the dimension tables forms an input to the fact table in this pipelines architecture. Fact tables mainly consist of the primary key of dimension tables and measures in terms of facts. I have used component of SSIS, Lookup transformation, which helps in the lookup operation. Here I have used a join query to connect the Staging tables. This data was used to compare with lookup columns i.e. the attributes of the dimension table. Thus fact table was populated automatically with the help of the lookups.
  • 18. 6.5.6 Analysis of the Cube in SSIS After successfully generating fact table, manually processing cube every time is a cumbersome task. I have used the Analysis Services Processing Task for processing the cube in SSIS. A sequence container contained 2 Analysis Services Processing Task because cube processing can be achieved eļ¬ƒciently by processing dimensions ļ¬rst and then processing the cube as it is a sequential task. 6.5.7 Creating and deploying Cube in SSAS SSAS(SQL Server Analysis Service) is mainly used for Online Analytical Process- ing(OLAP). The connection was made to MOLAP for accessing Fact and Dimension table used to create a cube. A data source view was made using the same data repository AIRLINES. Further, dimensions and measures were selected to process in a cube and the cube was deployed successfully. I have used the browser the section for exploring data and browse the dimensions. In this section, we can also browse the data authenticity. 6.5.8 Degree of Automation The degree of Automation achieved in ETL process was One Touch Automation. As we click the execute button following processes were automated: ā€¢ Data Extraction ā€¢ Cleaning ā€¢ Transformation in the required target format ā€¢ Loading the data in SSIS ā€¢ Populating the fact table ā€¢ Cube Deployment 7 Application 7.1 BI Query 1: Which is the most popular fare class in U.S. Domestic Airlines? The contributing sources for this query are data source (2.1) and data source(2.2) Figure 4 shows us details about the most popular fare class in U.S. Domestic Air- lines.We can clearly see the average fare paid for corresponding Airline Class,which include C,D,X,Y.Here,the number of passengers travelling using business class are more as compared to that of economy class even though the fare for business class is more.
  • 19. Figure 4: Results for BI Query 1 7.2 BI Query 2:What is the relation between Average fare paid by passengers and the distance travelled for correspond- ing Airline? The contributing sources for this query are data source(2.1) and data source(2.2) The Tree Map shown in Figure 5 demonstrates the relation between Average fare,the distance travelled,origin state and the Airlines.Here,Fare is independent of the dis- tance but depends on Airlines.For example,considering all the cases for Delta Air- lines(DL) originating from diļ¬€erent states,fare varies according to origin state and the travelling distance.On the other hands,for Endeavor Air(9E) we can see the dependency of fare on distance,as the distance increases ,fare decreases and vice versa.
  • 20. Figure 5: Results for BI Query 2 7.3 BI Query 3: Which is the most trustworthy domestic Airline in United States? The contributing sources for this query are data sources(2.2) and data source(2.3) Figure 6 illustrates the general ļ¬ndings of most trustworthy Domestic airlines in U.S.Here ,we can see that OH(PSA Airlines) and MQ(Envoy Air) has received a trust score 4 which delineates to be the most trustworthy airlines followed by AA(Alaska Airlines) with a trust score equivalent to 3.
  • 21. Figure 6: Results for BI Query 3 7.4 Discussion All the above discussed BI queries have satisļ¬ed our business requirements men- tioned in Section 1. Considering our 1st BI query as discussed above, fare acts as an important fac- tor.Fang et al. (2017), performed analysis for revenue generated by airlines for dif- ferent types of customers.They used the concept of dynamic pricing to decide the discounts to be oļ¬€ered.We can relate this to our query where maximum number of passengers travel using business class.This might be due to the discounts oļ¬€ered over the business class as compared to that of economy class. In our 2nd BI query,we have discussed relation between Average Fare paid, the distance travelled and the Operating Carrier.Ferguson et al. (2009),discussed about eļ¬€ect of the fuel prices on rising airfare.His analysis has shown how the increas- ing carriers decreases passenger demand.Here we can relate their work and add some more points which involves dependancy of airfare on distance travelled and Operating Carrier and Origin state of Airline departure. Third BI Query as discussed in Section 7.3 picturizes the most trustworthy air- lines.Wang et al. (2016),proposed a method for ļ¬ne grained sentimental analy- sis which also involved emotions like trust.Using Sentiments score ,we can ana- lyze ,which Airlines to trust on and which airlines not to trust.Here,MQ and OH can be considered as most trustworthy airlines as compared to 9K(Cape Air) and OO(SkyWest Airlines) with a very low trust score.Here,we can conclude that pas-
  • 22. sengers,in general,would opt more for PSA Airlines,Envoy Air and Alaska Airlines on the grounds of higher trust score. 8 Conclusion and Future Work Air transport plays a vital role in connectivity throughout the world. Data ware- house built was able to answer all the stated business requirements which included the numbers of passengers traveling in diļ¬€erent class, the most trustworthy Domes- tic airline in US and dependency of fare and distance. Although a Datawarehouse built also can answer a number of queries related to sentiments such as positive and the negative feedback on the airlines, also analysis on the number passengers traveling from diļ¬€erent geographic location, Queries related to Airfare and factors aļ¬€ecting it, there are certain limitations to be considered. As the data gathered here consisted of data generated by U.S. government, social media, etc., for achiev- ing in-depth analysis on this topic we cannot simply rely on this data. In fact, we need to gather more detailed and granular data, to end with eļ¬ƒcient outcomes. We cannot simply rely on the data gathered from social media for sentiment analysis but also can increase the range of our research to gather data from multiple data sources to draw a better conclusion. In the coming years, we need to anticipate the challenges and the opportunities of the airline industry. Data warehouse built by overcoming the above limitations would be eļ¬€ective and would provide more in-depth analysis of fare related queries to the users. An attempt to build such Datawarehouse on the wider scope would be useful in demonstrating all queries related to the Airline industry on a single platform. References Chaudhuri, S. & Dayal, U. (1997), ā€˜An overview of data warehousing and olap technologyā€™, SIGMOD Rec. 26(1), 65ā€“74. Fang, Y., Chen, Y. & Li, X. (2017), Joint decision making about price and dura- tion of discount airfares, in ā€˜2017 IEEE International Conference on Industrial Engineering and Engineering Management (IEEM)ā€™, pp. 31ā€“34. Ferguson, J., Hoļ¬€man, K., Sherry, L. & Kara, A. Q. (2009), Eļ¬€ects of fuel prices on air transportation market average fares and passenger demand, in ā€˜9th AIAA Aviation Technology, Integration and Operations (ATIO) Conference, Aircraft Noise and Emissions Reduction Symposium (ANERS)ā€™. Hao, L. & Yu, X. (2008), Dynamic pricing of airline tickets in competitive markets, in ā€˜2008 4th International Conference on Wireless Communications, Networking and Mobile Computingā€™, pp. 1ā€“5. Kimball, R. & Ross, M. (2013), ā€˜The data warehouse toolkit-third editionā€™. L. Moody, D. & Kortink, M. (2000), From enterprise models to dimensional models: a methodology for data warehouse and data mart design., p. 5.
  • 23. Wang, Z., Chong, C. S., Lan, L., Yang, Y., Ho, S. B. & Tong, J. C. (2016), Fine- grained sentiment analysis of social media with emotion sensing, in ā€˜2016 Future Technologies Conference (FTC)ā€™, pp. 1361ā€“1364. Appendix R code R code for Data source 2.1 setwd("E:/NCI/sem1/datawarehous/project") #--Package for downloading file from web Reference URL: https://stat.ethz.ch/R-manual/R-devel/library/utils/html/ download.file.html library(utils) #--download zip file directly from Data source for for fetching #realtime data in R and extract it temp <- tempfile () download.file("https :// transtats.bts.gov/PREZIP/ Origin_and_Destination_Survey_DB1BCoupon_2018_1.zip",temp) #unzip file Reference URL: https://stackoverflow.com/questions/3053833/using-r-to-download-zip csvFile= unz(temp ,"Origin_and_Destination_Survey_DB1BCoupon_2018_1.csv") #--Referred from R labs AirlineData <- read.csv(csvFile) AirlineData <- read.csv(csvFile , header=T, na.strings=c(""), stringsAsFactors = T) #--View the dataframe for further cleaning process col(AirlineData) View(AirlineData) is.na(AirlineData) na.omit(AirlineData) #--Check for NA values in Data frame #--Referred from R labs sapply(AirlineData ,function(x) sum(is.na(x))) names(AirlineData) #--Removing Unwanted Columns AirlineData[ ,c(ā€™Gateway ā€™, ā€™ItinGeoType ā€™,ā€™SeqNum ā€™,ā€™Coupons ā€™, ā€™OriginStateFips ā€™,ā€™Xā€™)] <- list(NULL) AirlineData[ ,c(ā€™CouponType ā€™, ā€™TkCarrier ā€™,ā€™DestStateFips ā€™, ā€™CouponGeoType ā€™,ā€™DistanceGroup ā€™)] <- list(NULL) AirlineData[ ,c(ā€™RPCarrier ā€™,ā€™DestStateFips ā€™,ā€™DestWac ā€™,ā€™DistanceGroup ā€™, ā€™OriginWac ā€™)] <- list(NULL)
  • 24. AirlineData[ ,c(ā€™OriginCountry ā€™,ā€™DestCountry ā€™,ā€™Break ā€™,ā€™ItinID ā€™,ā€™MktID ā€™)] #--Deleting ROws with NA AirlineData <- AirlineData [!is.na(AirlineData$FareClass), ] #--Subsetting Of Data Using sample Reference URL: https://www.statmethods.net/management/subset.html AirlineData <- AirlineData[sample(1:nrow(AirlineData), 81602, replace=FALSE),] #--View the data frame and check number of rows View(AirlineData) NROW(AirlineData) #--Writing Data to in csv write.csv(AirlineData ," AirlinesCleanedData .csv") R code for Data source 2.2 setwd("E:/NCI/sem1/DWBI/project/Statista") #Package For Reading XLS file install.packages("readxl") library(readxl) StatistaData <- read_excel("statistic_id642191_domestic -airports -in -the - #Removing Unwanted Rows StatistaData <- StatistaData[-c(1,2),] #Changing column name colnames(StatistaData )[2] <- "AverageFare" #Writing CSV write.csv(StatistaData , file = "E:/NCI/sem1/DWBI/project/Statista/Statis #--Referred from Twitter labs install.packages("tidytext") install.packages("dplyr") install.packages("reshape") install.packages("ggplot2") install.packages("tidyr") install.packages("RMySQL") library(reshape) library(ggplot2) library(tidyr) library(RMySQL) install.packages("textcat") install.packages("cld2") install.packages("cld3") install.packages("tidyverse") #--Kill if any open connections exist
  • 25. killDbConnections <- function () { all_cons <- dbListConnections (MySQL ()) print(all_cons) for(con in all_cons) + dbDisconnect(con) print(paste(length(all_cons), " connections killed.")) } Reference URL: https://stackoverflow.com/questions/32139596/cannot-allocate-a-new- #Creating a connection to database con <- dbConnect(MySQL(), user="twitter", password="password", dbname="TWITTER", host="18.203.249.12") on.exit(dbDisconnect(con)) #fetching tweets related to Airlines from resultset <- dbSendQuery(con , "select tweet_text , created_at from tweets;") #creating DataFrame from resultset AirlineTweetsData <- fetch(resultset , n=Inf) summary( AirlineTweetsData ) #Recognizing Text category library(textcat) #Google language Detectors library(cld2) library(cld3) #Collection of different R packages library(tidyverse) #Filtering Data from tweets gathered for English language AirlineTweetsData <- AirlineTweetsData %>% mutate(textcat = textcat(x = select(tweet_text , textcat , cld2, cld3, created_at) %>% filter(cld2 == "en" & cld3 == "en") summary( AirlineTweetsData ) #---Analyzing Retweets AirlineTweetsData $RT <- startsWith( AirlineTweetsData $tweet_text , "RT") #-----Removing Retweets AirlineTweetsData <- AirlineTweetsData [! AirlineTweetsData $RT , ] View( AirlineTweetsData ) #creating a new data frame AirlinesDataFiltering <- AirlineTweetsData
  • 26. #Filter Tweets based on Textcat FilteredTweets <- AirlinesDataFiltering %>% filter( AirlinesDataFiltering $textcat =="english") summary(FilteredTweets ) #Removing unwanted columns FilteredTweets [ ,c(ā€™textcat ā€™,ā€™cld2ā€™,ā€™cld3ā€™,ā€™created_atā€™,ā€™RTā€™)] <- list(NULL) #Filtering tweets for from data frame for collecting reviews #of similar airlines using filter and str_detect Reference URL: https://www.datanovia.com/en/lessons/subset-data-frame-rows-in-r/ AlaskaAirlinesTweets <- FilteredTweets %>% filter(str_detect( FilteredTweets $tweet_text ,"Alaska")) #Adding COlumn name with particlular Airline AlaskaAirlinesTweets [ā€™Airline ā€™]=ā€™AAā€™ #for Cape Air CapeAirlinesTweets <- FilteredTweets %>% filter(str_detect( FilteredTweets $tweet_text ,"Cape")) CapeAirlinesTweets [ā€™Airline ā€™]=ā€™9Kā€™ #--For Delta Airlines DeltaAirlinesTweets <- FilteredTweets %>% filter(str_detect( FilteredTweets $tweet_text ,"Delta")) DeltaAirlinesTweets [ā€™Airline ā€™]=ā€™DLā€™ #--For United Airlines UnitedAirlinesTweets <- FilteredTweets %>% filter(str_detect( FilteredTweets $tweet_text ,"United")) UnitedAirlinesTweets [ā€™Airline ā€™]=ā€™UAā€™ #Similarly same task was performed for remaining Airlines #--packages for performing sentiment analysis library(syuzhet) library(sentimentr) #--sentiment Score for each review mysentiment_review <- get_nrc_sentiment (( AlaskaAirlinesTweets $ tweet_text )) #--Mean of sentiments for Alaska Airline mean_AlaskaAirlines <-data.frame(mean(mysentiment_review$anger), mean(mysentiment_review$disgus),mean(mysentiment_review$anticipation), mean(mysentiment_review$fear),mean(mysentiment_review$joy), mean(mysentiment_review$sadness),mean(mysentiment_review$surprise), mean(mysentiment_review$trust),mean(mysentiment_review$negative), mean(mysentiment_review$positive ))
  • 27. #-- Changing of column names colnames(mean_ AlaskaAirlines)<-c(ā€™anger ā€™,ā€™disgust ā€™,ā€™anticipation ā€™,ā€™fear #Adding a column name with specifc Airlines mean_ AlaskaAirlines [ā€™Airline ā€™]=ā€™AAā€™ #Dataframe with reqired output for AlaskaAirlines View(mean_AlaskaAirlines ) #For Delta Airlines #sentiment Score for each review mysentiment_ reviewDeltaAirlines <- get_nrc_sentiment (( DeltaAirlinesTwe #Mean of sentiments for Alaska Airline mean_DeltaAirlines <-data.frame(mean(mysentiment_ reviewDeltaAirlines $a mean(mysentiment_ reviewDeltaAirlines $disgust), mean(mysentiment_ reviewDeltaAirlines $anticipation), mean(mysentiment_ reviewDeltaAirlines $fear), mean(mysentiment_ reviewDeltaAirlines $joy), mean(mysentiment_ reviewDeltaAirlines $sadness), mean(mysentiment_ reviewDeltaAirlines $surprise), mean(mysentiment_ reviewDeltaAirlines $trust), mean(mysentiment_ reviewDeltaAirlines $negative), mean(mysentiment_ reviewDeltaAirlines $positive )) # Changing of column names colnames(mean_DeltaAirlines)<-c(ā€™anger ā€™,ā€™disgust ā€™,ā€™anticipation ā€™, ā€™fear ā€™,ā€™joyā€™,ā€™sadness ā€™,ā€™surprise ā€™,ā€™trust ā€™,ā€™negative ā€™,ā€™positive ā€™) #Adding a column name with specific Airlines mean_ AlaskaAirlines [ā€™Airline ā€™]=ā€™DLā€™ #Dataframe with reqired output for AlaskaAirlines View(mean_DeltaAirlines) #Similarly for remaining Airlines the Average #Sentiments were calculated #Merging Multiple DataFrames to single data frame Reference URL:https://www.r-bloggers.com/concatenating-a-list-of-data-frames/ Mean_Airline_Sentiments <- do.call("rbind",list(mean_AlaskaAirlines , mean_DeltaAirlines ,mean_HawaiianAirlines ,mean_AmericanAirlines , mean_UnitedAirlines ,mean_CapeAir ,mean_EndeavorAir ,mean_PSAAirlines , mean_EnvoyAir , mean_AirWisconsin )) #Writing Data to in csv write.csv(Mean_Airline_Sentiments ,file="E:/NCI/sem1/datawarehouse/ project/ AverageAirlineSetiments .csv")
  • 28. Screen shots of Data sources used are as follows: Figure 7: Data Source 1: Transtats Figure 8: Data Source 2: Statista
  • 29. Figure 9: Data Source 3: Tweeter (Tweet Count)