Built a data warehouse from multiple data sources and ETL methodologies and executed three non-trivial Business Intelligence queries.
Technologies/Tools: R, SQL, Visual Studio, SQL Server Management, Tableau
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Impact of Travel & Tourism on Economy
1. Data Warehousing and Business Intelligence
Project
on
Impact of Travel & Tourism on Economy
Shantanu Deshpande
x18125514
Video Link:https://youtu.be/1upKlsPfWJ4
MSc/PGDip Data Analytics – 2019/20
Submitted to: Prof. Sean Heeney
2. National College of Ireland
Project Submission Sheet – 2019/2020
School of Computing
Student Name: Shantanu Deshpande
Student ID: x18125514
Programme: MSc Data Analytics
Year: 2019/20
Module: Data Warehousing and Business Intelligence
Lecturer: Prof. Sean Heeney
Submission Due
Date:
12/04/2019
Project Title: Impact of Travel & Tourism on the Economy
I hereby certify that the information contained in this (my submission) is information
pertaining to my own individual work that I conducted for this project. All information
other than my own contribution is fully and appropriately referenced and listed in the
relevant bibliography section. I assert that I have not referred to any work(s) other than
those listed. I also include my TurnItIn report with this submission.
ALL materials used must be referenced in the bibliography section. Students are
encouraged to use the Harvard Referencing Standard supplied by the Library. To use
other author’s written or electronic work is an act of plagiarism and may result in disci-
plinary action. Students may be required to undergo a viva (oral examination) if there
is suspicion about the validity of their submitted work.
Signature:
Date: April 12, 2019
PLEASE READ THE FOLLOWING INSTRUCTIONS:
1. Please attach a completed copy of this sheet to each project (including multiple copies).
2. You must ensure that you retain a HARD COPY of ALL projects, both for
your own reference and in case a project is lost or mislaid. It is not sufficient to keep
a copy on computer. Please do not bind projects or place in covers unless specifically
requested.
3. Assignments that are submitted to the Programme Coordinator office must be placed
into the assignment box located outside the office.
Office Use Only
Signature:
Date:
Penalty Applied (if
applicable):
3. Table 1: Mark sheet – do not edit
Criteria Mark Awarded Comment(s)
Objectives of 5
Related Work of 10
Data of 25
ETL of 20
Application of 30
Video of 10
Presentation of 10
Total of 100
4. Project Check List
This section capture the core requirements that the project entails represented as a check
list for convenience.
Used LATEX template
Three Business Requirements listed in introduction
At least one structured data source
At least one unstructured data source
At least three sources of data
Described all sources of data
All sources of data are less than one year old, i.e. released after 17/09/2017
Inserted and discussed star schema
Completed logical data map
Discussed the high level ETL strategy
Provided 3 BI queries
Detailed the sources of data used in each query
Discussed the implications of results in each query
Reviewed at least 5-10 appropriate papers on topic of your DWBI project
5. Impact of Travel & Tourism on the Economy
Shantanu Deshpande
x18125514
April 12, 2019
Abstract
The tourism to a destination is often under-appreciated in terms of its economic
importance. With the advancement in the transportation industry, we can now see
bigger and fast-moving airplanes, which has made travelling to different locations
more affordable and convenient. This has led to an increase in the tourist numbers
worldwide. Very often, this growth has a tremendously positive impact on a coun-
trys key development indicators. To understand its importance, I have built a Data
Warehouse and a Business Intelligence model by using the data of tourist arrival
rate as my main source and compared it with other development indicators like the
unemployment rate and the GDP to observe impact on the countrys development
indicators.
1 Introduction
Travelling these days has become a customary habit for people across the globe, be it
for business purpose or leisure purpose. Due to the improvement in air connectivity
between countries, it has resulted in decrease in travel time and travel costs. This is one
of the major factors that have boosted the tourism rate globally. Tourists contribute to
sales, profits, jobs, tax revenues, and income in an area. The most direct effects occur
within the primary tourism sectors –lodging, restaurants, transportation, amusements,
and retail trade . Through secondary effects, tourism affects most sectors of the economy.
An economic impact analysis of tourism activity normally focuses on changes in sales,
income, and employment in a region resulting from tourism activity Stynes (1997). Many
countries have started taking steps in order to ease the visa norms for tourists Kumar
(2018) in view of the economic advantages. In this project, I am analyzing the year wise
tourism rate across several countries and comparing it with few key country development
indicators like the GDP and unemployment rate to figure out the relationship between
them. Using this data warehouse, I will be developing a MOLAP (Multidimensional
Online Analytical Processing) cube that will address the following business queries
(1) Is there a relationship between the number of tourists visiting a country and the
unemployment rate of a country?
(2) Does the growth in International travel and tourism affect the GDP of the country?
(3) How the top 5 European countries have performed in terms of the global GDP
contribution of the Travel and Tourism industry?
1
6. 2 Data Sources
For implementing this project, I have made use of 6 data sets fetched from 5 different
sources of which 5 are structured and 1 is unstructured. They have been discussed in
brief below
Source Type Brief Summary
World Bank Structured It is used because it has country wise data of
international arrival rate which is useful for
my query
OECD Structured The data contains country wise Unemploy-
ment rate which is joined with World Bank
data for my first query
Wikipedia Unstructured It is an online encyclopedia and contains the
relevant data table related to tourism which
helped me for my second query
WTTC Structured This website contains the data pertaining to
the economic impact of travel and tourism
industry which is useful for me to compare
it with worldwide GDP contribution.
Statista Structured It is a statistical data containing the world-
wide total GDP contribution and has been
used as benchmark for one of the queries
Table 2: Summary of sources of data used in the project
7. 2.1 Source 1: OECD
The data set represents the unemployment rate for around 42 countries, over the years.
The data spans from 2014 to 2018. The data is periodically updated on the website and
has been downloaded in a CSV format. The data was subsequently uploaded in R for
cleaning purpose. Two columns that were not useful were removed from the dataset and
also the Country code column was converted to Country name in order to perform join
operation. The following columns were used for the query-
1) Country
2) Year
3) Unemployment Rate
The dataset was downloaded from the following URL-
URL: https://data.oecd.org/unemp/unemployment-rate.htm
Figure 1: Unemployment Rate
8. 2.2 Source 2: World Bank
This data set contains the tourist arrival data for 9 years of around 242 countries. In the
data, each row corresponds to the tourist arrival number in a country for a particular
year. This is a periodically updated dataset which is available for download on data-
bank.worldbank.org. The file format was CSV and the file was then loaded in R for the
cleaning process that included transpose of the rows columns and removing the NULL
values. The relevant R codes that were used for the cleaning process have been mentioned
in the appendix. Following this, there were 4 columns of data-
1) Country
2) Country Code
3) Year
4) Arrival Nos.
URL: https://databank.worldbank.org/data/reports.aspx?source=2series=ST.INT.ARVL
This dataset would be used for my first query.
Figure 2: Arrival Rate
9. 2.3 Source 3: Wikipedia (Unstructured)
Wikipedia is an online encyclopaedia which is a most popular repository for general
reference work. My third dataset has been taken from this website. Here, I fetched the
data of countries showing strong international travel and tourism growth between 2010-
2016. The data was grabbed using Selector Gadget and then loaded in R for cleaning.
The countries that are mentioned here are some of the most underdeveloped ones however
in the recent years they have seen a massive inflow in the tourist numbers. This may be
due to the steps taken by the Government to boost tourism in the country. The following
columns were used for the query-
1) Country
2) Percent growth (tourist nos.)
The dataset were downloaded from the below URL-
https://en.wikipedia.org/wiki/Tourism
Figure 3: Tourism Growth Rate
10. 2.4 Source 4: World Travel Tourism Council (Structured)
Two separate datasets have been taken from World travel tourism council. They host
the data related to the economics of travel and tourism industry through a separate Data
Gateway portal for all the countries around the globe. Access for getting the data was
provided after a written email stating the purpose of the study and how the data would
be used. The data for relevant countries was fetched individually and then merged using
R to form two separate datasets that were used for two queries. The data on this website
is periodically updated hence no publication date is available. This data incorporates the
below columns-
1) Year
2) Country
3) Local Currency in Bn Nominal prices
4) Local Currency in Bn Real prices
5) Percentage growth
6) Percentage of GDP
7) USD in bn Nominal prices
8) USD in bn Real prices
The datasets were downloaded from the below URL-
URL: https://www.wttc.org/datagateway
Figure 4: Tourism GDP Contribution
11. Figure 5: Email Request
Figure 6: Access provided to data portal
2.5 Source 5: Statista (Structured)
Statista is a German online portal for statistics, which makes data collected by market
and opinion research institutes and data derived from the economic sector. The company
provides statistics and survey results, which are presented in charts and tables. (Statista)
The dataset for the third query has been downloaded from Statista. The data represents
the total contribution of travel and tourism industry to the global economy from 2006
to 2017, the values being in trillion US dollars. The data was downloaded in an excel
format with two sheets, the first one being the Overview that contained the metadata
whereas the second sheet contained the relevant data for the query. With the use of R,
the first sheet was removed and the file was converted in CSV format. The release date
of the report is March 2018. The data has the following three columns-
1) Year
2) Direct Contribution(in trillion USD)
3) Total Contribution(in trillion USD)
The link to the dataset is pasted below-
URL: https://www.statista.com/statistics/233223/travel-and-tourism–total-economic-contribution-
worldwide/
12. Figure 7: Worldwide T&T GDP Contribution
3 Related Work
The growth in global travel industry has been momentous and will presumably proceed
for a long time to come. The travel industry’s significance to the economy of numerous
industrial and developing nations has also increased drastically. The following points
will assist in exploring the topic in detail in order to get a better clarity with the use
of past works in accordance to the project requirement. After a study which included
research of around 114 articles, it has been identified that the critical success factors for
the growth of tourism industry are openness to trade and tourist security Kristo (2014).
Studies suggest that an increase in tourism demand may alter the country’s patterns of
production and specialization, in particular by crowding out internationally traded sectors
(i.e. export and import-competing sectors) Sahli & Nowak (2005). This will subsequently
affect the GDP of the country and also the employment rate however the impact may
be positive or negative depending on the residents attitude to the growing number of
tourists. While the underlying phases of the travel industry are typically met with a
lot of eagerness on part of local residents in view of the apparent financial advantages,
it is just common that, as undesirable changes take place in the physical condition and
in the sort of vacationer being pulled in, this inclination step by step turns out to be
increasingly negative. In order to have a sustainable tourism growth, these factors need
to be monitored intermittently.
The following points will assist in exploring the topic in detail in order to get a better
clarity with the use of past works in accordance to the project requirement. Under
developed countries, beset by incapacitating rural poverty, have extensive potential in
pulling in travellers looking for new, bona fide encounters in zones of unexploited natural
and cultural riches. The direct effects occur within the primary tourism sectors; hotelling,
restaurants, transportation, and retail trade Stynes (1997). Under-developed countries
13. generally lack proper employment sectors hence the economy does not flourish at the
required rate. For such countries, tourism is one such sector that will assist in upbringing
the economy. The challenging part for the government is to allocate funds for marketing
the tourism on global level.
However, the growing rate of tourism does not necessarily mean a significant increase
in the revenue generated by this industry. The spend behaviour of the tourists must be
taken into account so as to benefit most out of each individual tourist as this will lead to
overall growth in the sector as well as the economy.
4 Data Model
Previously mentioned literature’s gave me clearness on what my methodology ought to
be and also how I need to continue while keeping all my business necessities in prospect.
I have tried to incorporate the country development indicators in order to build a system
that will produce a significant relation and thereby enable the respective government
bodies to take steps in accordance for improving the countrys economy. The four key
decisions to be made during the design of a dimensional model according to the Kimball
approach are-
1. Selecting the business process.
2. Declaring the grains.
3. Identifying the dimensions.
4. Identifying the facts.
Here, I have used the Kimballs bottom up approach, wherein at first the data marts are
created and then we build the data warehouse. In a general sense, marts are made of all
the dimension tables and the fact table. We can say that data marts are assembled to
form a data warehouse. The purpose for picking the Kimballs approach in light of the
fact that all through his work Ralph Kimball constantly bolstered the consideration of
the end-clients in the process Chhabra and Pahwa (2016). Regardless, for our project we
have assembled data of last 4-5 years and giving a filtered analysis to the end users to
settle on sound decisions. Along these lines, to achieve what I ask for from my tasks I
have joined my datasets on the basis of the unique values each of them include. So, my
dataset in section data source 2.1 contains the unemployment rate, it is joined with my
dataset from 2.2 on the basis of country and year. Similarly, the other datasets have also
been joined on the basis of either country or year. Now, from the data that has been
derived, I have created 2 dimensions- DimCountry and DimYear. These dimensions have
been discussed in brief below-
DimCountry: This dimension table consists of all the countries that are present in
our data tables. As there were different countries in the data sources, I had to write a
SQL query with Union operation in order to have all the countries that would be required
for the queries. The attributes that are contained in this dimension are Country ID and
Country Name. Country id has been created by me in SSIS and is used in the Fact table
with a foreign key reference.
DimYear: In this dimension, we have used only the Year attribute along with the
Year id which is a primary key that I have generated in SSIS. The Year id will act as a
foreign key in my fact table. With the help of DimYear we can notice the change in the
rate of our measures.
Now, let us discuss the facts that will act as the measures and help us in the business
14. process-
Fact Table: The fact table that is created with the help of dim tables constitutes
both the primary keys of our dimensions. The fact table plays a crucial role in setting up
the ground for our business query requirements. It contains all the measures that would
be required for the thorough analysis of our BI queries. Following are the facts that my
fact table comprises of-
Country ID : Primary key of DimCountry
Year ID: Primary key of DimYear
Arrival Rate: It consists of the arrival rate of tourists in a particular country over the
year.
Unemployment Rate: This is a measure of the rate of unemployment of a country over
the year.
Tourism growth percent: It consists the tourism growth rate between 2010-2016 for 10
less developed countries.
TnT GDP Contribution10: This value represents the contribution of travel and tourism
to the countrys GDP.
TnT TotalGDPContribution: It consists of the overall contribution of travel and tourism
industry to the worldwide GDP.
TnT GDP Contribution5: This value represents the contribution of travel and tourism
to the countrys GDP.
All the above dimensions and facts will be our base and form our data mart. The Kimballs
approach is followed and the following schema is formed:
Figure 8: Star Schema
15. 5 Logical Data Map
Following is the Logical Data Map for my way to deal with acquiring the ideal star schema. It clarifies all of the dimensions and facts
that have been utilized and how they were changed before stacking.
Table 3: Logical Data Map describing all transforma-
tions, sources and destinations for all components of the
data model illustrated in Figure 1
Source Column Destination Column Type Transformation
1,2,3,4 Country DimCountry Country name Dimension In few of data sources, country name was spelled incor-
rectly, so matched using match() function
1,2,4,5 Year DimYear Year Dimension contained junk prefix (’x’) which was removed using
separate() function
2 Tourist arrival FactTable Arrival Rate Fact values were converted to float type
1 Value FactTable Unemployment
Rate
Fact values were converted to float. Null values removed
using na.omit() function
3 Percentage FactTable Tourism growth
percent
Fact percent symbol was separated using separate() function
4 USD in bn
real prices
FactTable TnT GDP con
tribution10
Fact value were converted to float type
5 Total contri
bution
FactTable TnT Total GDP
Contribution
Fact values were converted to float type
4 USD in bn
real prices
FactTable TnT GDP con
tribution5
Fact value were converted to float type
16. 6 ETL Process
Data warehouses are basically used for decision-making, hence the foremost requirement
of a data warehouse is the correctness of data which will avoid misleading calculations.
The ETL process primarily consists of Extraction, Transformation Loading which in-
volves the following prominent tasks:
a) identification of relevant information at the source level.
b) extraction of the appropriate information.
c) integration of the information obtained from multiple sources into a common format.
d) transforming the integrated data model through cleaning process based on business
rules or requirement.
e) loading the processed information onto the data warehouse / data mart Rahm & Do
(2000) Mentioned below is the ETL strategy that I have applied in my project.
Figure 9: Cube Formation
6.1 Extraction:
The first data set for my project has been extracted from OECD. There were total
8 columns in the dataset. The dataset includes the year wise unemployment rate for
around 42 countries and it was extracted in CSV format. The second source of data
has been extracted from World Bank. The dataset incorporates 16 columns and tourist
arrival rate for 269 countries. The data is horizontally distributed yearwise hence had to
perform a column transpose in R. The third data source is an unstructured one which
has been extracted from Wikipedia. From here, I extracted a data table that had the
countries showing strong growth in international travel and tourism between 2010-2016.
It had two columns country name and the tourist arrivals percentage growth. The fourth
and fifth data set for my project has been extracted from World Travel Tourism Council
(WTTC). Depending on the query requirements, the countries were individually selected
along with required attributes and downloaded as multiple files. These files were then
17. grouped and split into two using R for two separate BI queries. The source for my sixth
dataset is Statista. From this website, the data that was extracted consisted of two excel
sheets. Out of this, the first sheet was of no use, hence it was removed and the second
sheet was converted into CSV format. The second sheet consisted of contribution of
travel and tourism to the global GDP.
6.2 Transformation:
Data cleaning, also called data cleansing or scrubbing is an important function of trans-
formation. It basically deals with detecting and removing errors and inconsistencies from
data so as to ameliorate the quality of data. Data quality problems are generally observed
in single data collections, such as files and databases, the causes for this being misspellings
during data entry, missing information or other invalid data Rahm & Do (2000) All the
above datasets that were extracted from multiple sources need to be transformed in a
clean format which will thus make it suitable for loading in our data warehouse. Trans-
formation part is one of the most time-consuming activity as all the data sources must be
thoroughly cleaned in order to produce a reliable business solution. Now, let us see what
all cleaning mechanisms I have used to cleanse the data for our project implementation.
Firstly, I scanned the data extracted from my first source i.e. OECD and distinguished
the attributes that are required for my query. The columns that were not required were
removed from the dataset using R programming. The dataset only contained the country
codes and not the country names. In order to achieve a successful join, I had to replace
the country codes with the country names. For this, I used the match() function in R and
replaced with the country names from my second dataset that was sourced from World
Bank. Thereafter, I found that were few missing values that were then removed using the
na.omit() function in R. In the World Bank data set, as mentioned above, the data was
horizontally distributed i.e. for each year there was a separate column with the value in
respective cell. For performing the join operation, I needed all the years in one single col-
umn. For this, I have used the t() function in R while keeping the first two columns intact.
Then I used the rep() function to repeat the values in first two columns to match with
the corresponding transposed value. There were some bad entries in the dataset which
were then replaced with NA values and then subsequently removed by using na.omit()
function. The third dataset is an unstructured dataset sourced from Wikipedia. The data
table was pulled using Selector Gadget in Chrome and loaded in R for cleaning purpose.
For this, I used two R libraries- tidyr and htmltab. The data contained an initial rank
column that was not required and removed. Another column included the percentage
(%) symbol infront of the number. So, I used the separate() function to remove the
symbol and convert into numeric values. Also I had to name couple of countries properly
which were wrongly spelt during extraction. The fourth data source for this project is
World Travel Tourism Council (WTTC). As described above, depending on the query,
countrywise datasets for the relevant query were extracted which resulted in multiple files
from source. For the cleaning purpose, following libraries were used readxl, data.table
and tidyr. The data, in this case as well, was horizontally distributed which was then
transposed and the country name was repeated to match with the number of correspond-
ing transpose value. The year column contained a junk prefix,x, that was removed using
s
¯
eparate() function. The fifth and final data source for this project is Statista. The data
extracted from here had two sheets, first one had the metadata which was not required
and thus removed using R. The second sheet contained three columns that were already
18. in cleaned state just had to remove the initial two rows and change the column names.
All the above transformations have been carried out in R studio and the relevant R codes
have been been mentioned in the Appendix.
6.3 Loading:
After the completion of the above stages, we then have to move our raw data to staging
arena. The data is called from the SSMS to SSIS where the data would be staged through
the flat file source to the OLE DB Destination. In the SSIS, we have several components
out of which OLE DB destination is one which loads data from database in form of
tables using SQL commands. In this stage, I have created 6 flat files in my staging arena
which will hold the raw data. Additionally, I created an execute SQL task to create the
dimension table from the flat files. Within this task, I have written an SQL query to
create the dimension tables followed by the insert query to insert the data from all the
staging tables. By completing this process, I got two dimension tables which will now
be used for populating our fact table. For the fact table, another SQL task is taken and
the output of the previous SQL task for dimension has been given as input. In the SQL
task for fact table, we created the fact table and written an SQL query for obtaining the
values from all the staging tables that will be required for our business queries. After
this, the final step is to populate the fact table with the relevant measures.
During the execution of query for populating fact table, it is important to have proper
SQL joins in place so that correct values get populated else this will lead us to misleading
values. This was one of the most challenging part as I had 6 different datasets with
differing values hence needed to experiment with join queries multiple times. During
the execution of the complete process, the data gets loaded repetitively thus creating
duplicate values each time. In order to avoid this scenario, we have added a truncate
query at the start of the SSIS process. This will truncate all the staging tables and drop
the dimension and fact tables and rerun the process so that the data gets populated
properly every time. Once fact table is populated, we then have to create our schema.
Star schemas characteristically consist of fact tables linked to associated dimension tables
via primary/foreign key relationships. OLAP cubes can be equivalent in content to, or
more often derived from, a relational star schema Kimball/Ross (2016). In SSAS first I
made a new data source source which determined my association from where it got the
data from SSIS. This imported every one of my tables including dimensions and facts in
SSAS. Our essential point was to acquire a star schema for which we went with Kimball’s
methodology of data modelling. To accomplish this we at that point made a data source
view in which we chose what all tables we required. Here I chose my two dimensions and
the fact table and in the wake of handling the data source view it gave me my ideal star
schema design. Now the final step was to create the MOLAP cube. For achieving this, a
new cube had to be built within which our existing dimension and fact tables had to be
selected. Next, I named the cube and hit the process button to deploy the cube. Once we
receive the deployment successful as output, we can move over to the browser section and
drag the fact count field into the query field and confirm whether the fact table has been
properly populated or not. With this we can say that my cube is successfully deployed.
19. 7 Application
In the sections above, we observed the successful deployment of our cube that we are now
going to use for answering our business queries. Following are the business queries that
I think would be useful to review the constraints discussed in section 3. The obtained
results of these queries along with the previous related work on the subject have been
discussed in detail in section 7.4. Now, we will evaluate our 3 business queries and check
our results:
7.1 BI Query 1: Is there a relationship between the number of
tourists visiting a country and the unemployment rate of a
country?
The sources contributing for this query are data source (2.1) and data source (2.2) Figure
1 shows us the visualized comparison of the arrival rate and the unemployment rate for
a country. From the graph, we can clearly observe that there is a positive and inverse
relationship between the two. As the tourist arrival rate increases, there is a decrease
in the unemployment rate hence we can say that tourism is one of the factors that can
reduce the unemployment rate of a country. Surprisingly, we can notice in our graph that
although the tourist arrival rate in South Africa is increasing, it is not having an impact
on the unemployment rate, infact it is as well increasing.
Figure 10: Results for BI Query 1
20. 7.2 Query 2: Does the growth in International travel and tourism
affect the GDP of the country?
The sources contributing for this query are data source (2.3) and data source (2.5) The
first graph in figure 2 shows us the percentage growth in international travel and tourism
between 2010-2016. Between these years, from our analysis, we can see that the GDP has
also increased significantly which shows that there is a distinct relationship between the
tourism industry and the GDP of a country. Only for 1 country, Sao Tome and Principe,
the GDP growth is not significant although the tourism rate has increased by almost 30
Figure 11: Results for BI Query 2
21. 7.3 BI Query 3: How the top 5 European countries have per-
formed in terms of the global GDP contribution of the
Travel and Tourism industry?
The sources contributing for this query are data source (2.4) and data source (2.5)
From figure 3, the yearwise growth in the worldwide GDP contribution of the travel
and tourism industry is visible. In the second graph we can see the top 5 European
countries with maximum tourism numbers and the GDP contribution of their tourism
industry. As visible from the graph, the GDP contribution of France and Spain is below
the average global GDP contribution of Travel and tourism industry. Whereas United
Kingdom and Italy have outperformed the global GDP growth rate.
Figure 12: Results for BI Query 3
22. 7.4 Discussion
As now we are done with our business queries and have the graphs with us let us discuss
the implication of each one of them in detail. Let us first discuss BI query 1 that gave
us the relation between the tourist arrival rate and the unemployment rate of a country.
We can clearly observe that as the rate of tourism is growing in a country, it has a direct
and inverse relation with the unemployment rate of the country i.e. theres a decrease in
unemployment. The tourism spending, as also mentioned in section 3, lay primarily in the
purchase of goods and services from a variety of industries, with usually rather less than
two-thirds of their expenditure being in the hotels and restaurants normally identified
with the tourism industry De Kadt (1979). This spending pattern by tourists thereby
creates job for the locals. The second query gives us the relation between the growth of
tourism and the GDP of the country. The countries that were studied were some of the
less developed countries. As described in section 3, the less developed countries usually
have unexploited natural and cultural riches. If proper promotional initiatives are taken
by the local bodies, tourists would be willing to explore new places and thereby boost
the countrys economy. From our graph, we can notice a strong positive relation between
the two attributes thereby highlighting the significant impact tourism can have on the
economy. Result of the third and final query is to visualize on a global level, how the top
5 tourist famous European countries have performed over the recent years in terms of the
tourism contribution to GDP. From the graph, it can noted that although the tourism
numbers are high in France and Spain, the contribution to global economy is below the
average growth rate. The government can work around strategies that will influence the
tourists to spend more while they travel within these countries whereas Germany, Italy
and United Kingdom have contributed almost equally and at similar rate to the global
GDP contribution.
8 Conclusion and Future Work
With all the data that was fetched for analysis, I have tried to build a data warehouse
that will help us in correlating the tourism parameter with the country development
indicators. From the graphs we can say that we were able to achieve the desired results.
What I observed is that, in the recent years, there has been a very positive growth in the
tourism industry worldwide. This growth is primarily due to reduced travel times and
costs and also better communication mediums. The notable aspect is the positive impact
on the key country development indicators. Lot of countries have started acknowledging
this by allocating part of budget towards the betterment of tourism related services.
However, it is also equally important to consistently monitor the impact on the GDP
contribution as although the tourism rate is increasing it does not necessarily mean that
it will improve the GDP at a similar rate. The future prospect in this study could be to
analyze tourist sentiment so as to understand what factors are considered before deciding
to visit the place. This could be compared across different countries and modelled to
improve tourism rate. Also understanding the spending pattern of tourists is important
in order to monitor the impact of tourism industry on the GDP contribution.
23. References
De Kadt, E. (1979), ‘Tourism: Passport todevelopment’, Perspectives on thesocial and-
cultural effects of tourism in developing countries .
Kimball/Ross (2016), ‘Star schema olap cube — kimball dimensional modeling tech-
niques’.
Kristo, J. (2014), ‘Evaluating the tourism-led economic growth hypothesis in a developing
country: The case of albania’, Mediterranean Journal of Social Sciences 5(8), 39.
Kumar, V. R. (2018), ‘Ease visa norms for free movement of tourists, says
united nations body’, https://www.thehindubusinessline.com/news/variety/
ease-visa-norms-for-free-movement-of-tourists-says-united-nations-body/
article20602398.ece1.
Rahm, E. & Do, H. H. (2000), ‘Data cleaning: Problems and current approaches’, IEEE
Data Eng. Bull. 23(4), 3–13.
Sahli, M. & Nowak, J.-J. (2005), ‘Migration, unemployment and net benefits of inbound
tourism in a developing country’.
Stynes, D. J. (1997), ‘Economic impacts of tourism’, Illinois Bureau of Tourism, Depart-
ment of Commerce and Community Affairs .
Appendix
R code example
#Extraction and cleaning Unemployment Rate - Data Source 1
setwd("C:/ Users/shant/Desktop/Data -Files")
unemployment <- data.frame(read.csv("Unemployment -total.csv"))
arrival <- data.frame(read.csv("International -arrival.csv"))
#Changing the column names
colnames(unemployment) <- c("Country", "Indicator", "Subject", "Measure", "Fr
#removing columns that are not required
unemployment[,4:5] <- NULL
unemployment[6] <- NULL
#replacing country codes with country names
unemployment$Country <- arrival$Country.Name[match(unemployment$Country , arri
24. #removing NA values
unemployment <- na.omit(unemployment)
write.csv(unemployment , "Unemployment -rate -cleaned.csv", row.names = F)
#Extraction and cleaning of Arrival data - Data Source 2
#install.packages (" reshape2")
library(reshape2)
setwd("C:/ Users/shant/Desktop/Data -Files")
arrival_rate <- data.frame(read.csv("International -arrival.csv", stringsAsFac
head(arrival_rate)
arrival_rate[,1:2] <- NULL
arrival_rate <-subset(arrival_rate , select= -c(3,4))
head(arrival_rate)
#arrival_rate <- melt(arrival_rate ,id = c(" Country.Name", "Country.Code "))
data <- t(arrival_rate)
arrival_rate <-cbind(arrival_rate[rep(1:nrow(arrival_rate),each=10),1:2],#this
Year=c(2009:2018),#this gives the year column
Tourist_Arrival=as.vector(data[3:12 ,])) # the Average Educat
arrival_rate$Country <- NULL
arrival_rate$Tourist_Arrival <- as.character(arrival_rate$Tourist_Arrival)
arrival_rate$Tourist_Arrival[arrival_rate$Tourist_Arrival == ".."] <- "NA"
#arrival_rate <- arrival_rate [!( arrival_rate$Tourist_Arrival == "NA")]
arrival_rate <- arrival_rate[-c(2640:2690), ]
#arrival_rate <- na.omit(arrival_rate)
write.csv(arrival_rate , "arrival -rate -cleanned.csv", row.names = F)
#Extracting and cleaning Wikipedia data - Data source 3
#install.packages (" htmltab ")
library(tidyr)
#working
setwd("C:/ Users/shant/Desktop/Data -Files")
library(htmltab)
url <- "https ://en.wikipedia.org/wiki/Tourism"