PLAI - Acceleration Program for Generative A.I. Startups
Search engine traffic as input for predicting tourist arrivals
1. ENTER 2018 Research Track Slide Number 1
Wolfram Höpkena, Tobias Eberlea, Matthias Fuchsb,
and Maria Lexhagenb
a Business Informatics Group
University of Applied Sciences Ravensburg-Weingarten, Germany
{name.surname}@hs-weingarten.de
b European Tourism Research Institute (ETOUR)
Mid-Sweden University, Sweden
{name.surname}@miun.se
Search engine traffic as input for
predicting tourist arrivals
2. ENTER 2018 Research Track Slide Number 2
Content
• Introduction
• Related work
• Data collection and preparation
• Construction of web search indices with high predictive
power
• Model building and evaluation
• Results
• Conclusion and outlook
3. ENTER 2018 Research Track Slide Number 3
Motivation
• Demand prediction in tourism
– Due to perishable nature of tourism products, accurate forecasts of
tourism demand are of utmost relevance (Frechtling, 2002; Fitzsimmons &
Fitzsimmons, 2002)
– Knowledge on long-term trends, imminent changes and short-term
intra-period fluctuations of demand are essential for tourism
management
Accuracy and reliability of demand forecasts can hardly be
overestimated for tourism businesses and policy makers (Frechtling, 2002)
• Limitations of autoregressive approaches
– Lack of historical data, influence of unexpected events, variety of input
factors and complexity of travel decision-making process (Song et al. 2010)
– Availability of travellers’ web search behaviour as additional input to
demand prediction
4. ENTER 2018 Research Track Slide Number 4
Objective
• Extend autoregressive forecasting approach by including
travellers´ web search behaviour
– Does the inclusion of time series data on web search behaviour
increase performance when forecasting tourist arrivals compared to
the purely autoregressive approach?
• Examine behavioural aspects of travellers related to
concrete search terms used in online search for trip planning
– Analyse temporal relationships between search terms and tourist
arrivals
– Identify patterns that reflect online planning behaviour of travellers
before visiting specific destinations
5. ENTER 2018 Research Track Slide Number 5
Content
• Introduction
• Related work
• Data collection and preparation
• Construction of web search indices with high predictive
power
• Model building and evaluation
• Results
• Conclusion and outlook
6. ENTER 2018 Research Track Slide Number 6
Search engine traffic for demand prediction
• Web search data to predict tourist arrivals
– Google web search data to improve tourism demand prediction
accuracy, compared to purely autoregressive models or exponential
smoothing time-series models (Önder & Gunter, 2016)
– Google web search data to increase forecasting performance using
autoregressive mixed-data sampling (AR-MIDAS) models
(Bangwayo-Skeete & Skeete, 2015)
– Google web search data and econometric indicators to improve
autoregressive prediction of tourist arrivals, comparing statistical and
data mining approaches (Höpken et al., 2017)
7. ENTER 2018 Research Track Slide Number 7
Content
• Introduction
• Related work
• Data collection and preparation
• Construction of web search indices with high predictive
power
• Model building and evaluation
• Results
• Conclusion and outlook
8. ENTER 2018 Research Track Slide Number 8
Specification of data set
• Tourist arrivals
– Monthly aggregated tourist arrivals (December 2005 - April 2012) for
the leading Swedish mountain destination Åre
– Specified separately for its major sending countries (Denmark, Finland,
Norway and the United Kingdom)
• Web search traffic
– Google Trends as approriate data source for above sending countries
– Represents relative search volume of popular search terms over time
and, thus, reflects peoples’ interest in specific search terms across
different geographic regions and topical domains
9. ENTER 2018 Research Track Slide Number 9
Collection of web search data
• Google Trends crawling algorithm
– Using search engine-based keyword recommendations by Google‘s
Keyword Planner
– Iterative algorithm to identify relevant keywords
• Starting with seed keyword „are“ and iterating over related keywords
suggested by Google keyword planner
• Normalization of search terms
– Examining search terms for close similarity based on linguistic
variations, synonyms or misspellings
• Transforming search terms by text processing techniques (tokenization,
character substitution, stemming, stop-word removal, generation of word
vector)
• Eliminating semantically identical search terms (cosine similarity = 1)
10. ENTER 2018 Research Track Slide Number 10
Content
• Introduction
• Related work
• Data collection and preparation
• Construction of web search indices with high predictive
power
• Model building and evaluation
• Results
• Conclusion and outlook
11. ENTER 2018 Research Track Slide Number 11
Construction of aggregated search indices
• Identifying optimal time-lag between each search query and
tourist arrivals
– Calculating de-trended cross-correlation analysis (DCCA) coefficients
for time lags 0 to 6, capturing travellers’ short- and mid-term online
travel planning behaviour
– Selecting time lag with maximal DCCA coefficient and weighting search
query by DCCA coefficient
• Constructing compound search indices
– Filtering queries by Hurst exponent in order to assure the search
indices to be constructed following the same auto-correlative patterns
as its corresponding tourist arrival series (Pan et al., 2017)
– Aggregate all weighted and time-lagged query series to compound
search index
12. ENTER 2018 Research Track Slide Number 12
Evaluation of search indices
Index evaluation metrics for different sending countries
High structural similarity between search indices and tourist arrivals
(de-trended cross-correlation analysis appropriate for potentially
non-stationary time series)
13. ENTER 2018 Research Track Slide Number 13
Evaluation of search indices
Index evaluation metrics for different sending countries
Similar Hurst exponents of arrival and search index time series,
indicating the same auto-correlative patterns
14. ENTER 2018 Research Track Slide Number 14
Evaluation of search indices
Index evaluation metrics for different sending countries
Prediction accuracy can be improved when autoregressive forecasting
models are extended by Google Trends data as additional predictor
15. ENTER 2018 Research Track Slide Number 15
Content
• Introduction
• Related work
• Data collection and preparation
• Construction of web search indices with high predictive
power
• Model building and evaluation
• Results
• Conclusion and outlook
16. ENTER 2018 Research Track Slide Number 16
Stationarity tests
Tests for stationarity and co-integration for arrival data and search indices
Augmented Dickey-Fuller (ADF) and Kwiatkowski-Phillips-Schmidt-Shin
(KPSS) test confirm stationarity for DK, FI and UK but not for NO
17. ENTER 2018 Research Track Slide Number 17
Stationarity tests
Tests for stationarity and co-integration for arrival data and search indices
Johansen test shows co-integration relationships between search indices
and corresponding arrival series (-> no series transformations necessary)
18. ENTER 2018 Research Track Slide Number 18
Model building
• Buidling a prediction model
– Linear regression as statistical approach
– Autoregressive approach, using past 25 month as input data
– Search index, constructed from Google Trends data, as additional input
data
– Backward selection to eliminate irrelevant input (kitchen sink problem)
• Evaluation
– Prediction performance evaluated by sliding window validation
(moving a training and consecutive test window along data set)
– Shapiro-Wilk test to check for normal distribution of residuals
19. ENTER 2018 Research Track Slide Number 19
Content
• Introduction
• Related work
• Data collection and preparation
• Construction of web search indices with high predictive
power
• Model building and evaluation
• Results
• Conclusion and outlook
20. ENTER 2018 Research Track Slide Number 20
Comparison of forecasting performance
Comparison of prediction accuracy at different forecasting horizons
Adding Google Trends data reduces RMSE for all horizons and countries
21. ENTER 2018 Research Track Slide Number 21
Comparison of forecasting performance
Comparison of prediction accuracy at different forecasting horizons
Normally distributed residuals -> Google Trends model fits data well
22. ENTER 2018 Research Track Slide Number 22
Analysis of customers’ online search behaviour
Significant query lags for sending country Denmark
3 to 2 month before arrival -> search for activities in Sweden
One month before arrival -> more precise queries, searching specifically for Are
23. ENTER 2018 Research Track Slide Number 23
Analysis of customers’ online search behaviour
Significant query lags for sending country Denmark
Potential to
• Analyse customers online search behaviour and decision making process
• Identify most relevant keywords, used by tourists actually visiting the destination
(input to search engine optimization - SEO)
24. ENTER 2018 Research Track Slide Number 24
Content
• Introduction
• Related work
• Data collection and preparation
• Construction of web search indices with high predictive
power
• Model building and evaluation
• Results
• Conclusion and outlook
25. ENTER 2018 Research Track Slide Number 25
Conclusion and outlook
• Web search data as additional input to demand prediction
– Forecast model with Google Trends data as additional predictor
outperforms purely autoregressive approaches
• Analysis of customers’ online search behaviour
– Most significant search terms and time lags constitute valuable input
to analysing customers’ online search behaviour and decision making
process
• Open issues and future research activities
– Add further input data, e.g. customers‘ online interactions on social
media platforms like youtube, facebook, etc. or web navigation data
– Compare statistical approaches with data mining methods (e.g. deep
learning with neural networks)
26. ENTER 2018 Research Track Slide Number 26
Wolfram Höpkena, Tobias Eberlea, Matthias Fuchsb,
and Maria Lexhagenb
a Business Informatics Group
University of Applied Sciences Ravensburg-Weingarten, Germany
{name.surname}@hs-weingarten.de
b European Tourism Research Institute (ETOUR)
Mid-Sweden University, Sweden
{name.surname}@miun.se
Search engine traffic as input for
predicting tourist arrivals