A Method for Analysing Large-Scale UGC Data for Tourism: Application to the Case of Catalonia

ENTER 2015 Research Track Slide Number 1
A Method for Analysing Large-Scale UGC Data for Tourism:
Application to the Case of Catalonia
Estela Mariné-Roig and Salvador Anton Clavé
Research Group on Territorial Analysis and Tourism Studies (GRATET)
University Rovira i Virgili, Catalonia, Spain
estela.marine@aegern.udl.cat
salvador.anton@urv.cat
http://www.urv.cat/en_index.html

Introduction and aim
 UGC data  good source of information for DMOs, stakeholders and tourists.
 Travel blogs and Online Travel Reviews (OTRs)  first-hand experiences of
travellers.
 They have mostly been analysed with content analysis and narrative analysis
(Banyai & Glover, 2012) in the areas of service quality, destination image and
reputation, UGC, experiences and behaviour, and mobility patterns (Lu &
Stepchenkova, 2014)
 Such UGC data have exponentially grown in recent years and it is now considered
that its manipulation requires the use of Big Data technologies.
 However, in most studies concerning UGC data the collection is done “by hand”
(Lu and Stepchenkova, 2014) and is usually non-random  very time-consuming
and non-representative.
This article aims to propose a method for semi-automatic
downloading, arranging, cleaning, debugging, and
analysing large-scale travel blog and OTR data.

Web mining background
 Web mining, using data mining techniques, intends to
find useful information or to extract knowledge of the
hyperlink structure and content of webpages Liu
(2011)
 To automatize the process of extraction, first a Web
crawler programme is needed, capable of roaming
the hyperlink structure and downloading the linked
webpages.
 There is abundant literature on data mining related to
tourism and some on massive downloads.

Methodology
 Abburu and Babu (2013) propose a framework for web data extraction
and analysis based on three basic steps: finding URLs of webpages,
extracting information from webpages, and data analysis.
 The above system architecture is divided into three modules:
 web crawling
 information extraction
 Mining
 In this research we add the cleaning and debugging phases to
eliminate the noise present in the webpage to be able to get to the
content analysis phase with quality information in the original HTML
format  Resulting webpages only contain what the user wrote.
The methodology is applied to the case of Catalonia to analyse
about 85,000 travel diaries created between the years 2004 -2013

Destination selected for the case study (Catalonia)
Attributes:
•Millenary history
•Mediterranean destination
•Bathed by 580 km of shoreline
•Own culture and language (Catalan)
•Wealthy historical and natural heritage
•Third European region (overnight stays)
•Foreign tourists in 2013: 15,631,500
•Nine regional tourism brands:
Tourist brand Abbr.
Barcelona
Costa Barcelona
Costa Brava
Costa Daurada
Paisatges Barcelona
Pirineus
Terres de l’Ebre
Terres de Lleida
Val d’Aran
(unclassified)
Barna
cBarc
cBrav
cDaur
pBarc
Pyren
tEbre
tLlei
vAran
unCla

Selection of the most suitable websites hosting UGC data
Weighted-formula applied (Marine-Roig, 2014a): TBRH = 1*B(V) + 1*B(P) + 2*B(S)
oBorda count (B): Method that ranks options in order of preference
Webometrics:
oVisibility (V):
• Indexed pages in search engines (Google.com, Bing.com)
• Link-based ranks (Google page rank PR, Yandex topical citation index CY)
oPopularity (P): Visit-based ranks (Alexa.com, Compete.com, Quantcast.com)
oSize (S): Number of UGC entries related to the case study
Websites hosting UGC data selected:
o1st
TripAdvisor.com (TA): Hosts online travel reviews (OTRs)
o2nd
VirtualTourist.com (VT): Hosts travel blogs, travelogues and OTRs
o3rd
TravelBlog.org (TB): Hosts travel blogs
o4th
TravelPod.com (TP): Hosts travel blogs and a few OTRs

Webometrics of the top four websites hosting travel diaries
TA TB TP VT
Indexed
pages
Google.com
Bing.com
18,600,000
23,800,000
478,000
320,000
759,000
448,000
1,120,000
415,000
Link-based
rank
Google PR
Yandex CY
8
1,600
6
110
6
350
7
375
Visit-based
rank
Compete.com
Quantcast.co
m
Alexa.com
51
127
182
38,742
36,067
21,123
11,824
9,279
21,324
2,500
2,065
4,156
Size Entries 72,874 2,988 2,116 7,791
TBRH Rank 1 3 4 2

Gathering process on websites
Filters: Simplified flow diagram of the downloading process:
oLevel (0, 1, ... no level limit)
Inclusive / exclusive
oURL
• Protocol (HTTP, FTP, ...)
• Server
• Domain
• Directories (folders)
• Filename
• File type (html, jpg, ...)
o Content. Search
• for all keywords
• for exact word sequence
• inside HTML tags

UGC data arrangement
Structure of folders and files:
rootwebsitebranddestinationdate_lang_[isFrom]_pageName_[theme].htm

UGC data cleaning
Aims: Before: 52 KB After: 2 KB (both without pictures)
The cleaning and debugging phases
are essential to be able to obtain
quality information, limited to the
web content as written and posted
by the diary author, and overcoming
the most significant errors.
Sample of removed HTML elements:
•<meta ... />
•<form ... </form>
•<iframe ... </iframe>
•<div id="header">... </div>
•
•<div id="comment">... </div>
•<div id="footer">... </div>
•<script type ... </script>

UGC data debugging (encoding and common mistakes)
ISO 8859-15 (ASCII Latin-1 extended characters) UTF-8 encoding
Encoding: HTML entities
Gaudí: UTF-8 (GaudÃ--), HTML number (Gaudí), HTML name (Gaudí)
Mistakes:
Correct noun Misspellings
Barcelona Bathelona, Barcellona, Barthelonaaaa, Bar-tha-lona, Bar-the-lona ...
Gaudí Gaudi; Gaudì, Gaüdi, Gaudie, Gaudii, Goudi, Goudí, Guadi, Gualdi ...
Parc Güell Parc Guël, Güel, Guéll, Guelle; Park Gueil, Guel, Güelle, Guelli; Güelle ...
_0 _1 _2 _3 _4 _5 _6 _7 _8 _9 _A _B _C _D _E _F
C_ À Á Â Ã Ä Å Æ Ç È É Ê Ë Ì Í Î Ï
192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207
D_ Ð Ñ Ò Ó Ô Õ Ö × Ø Ù Ú Û Ü Ý Þ ß
208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223
E_ à á â ã ä å æ ç è é ê ë ì í î ï
224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239
F_ ð ñ ò ó ô õ ö ÷ ø ù ú û ü ý þ ÿ
240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255
Number Name
À À À
Á Á Á
Â Â Â
Ã Ã Ã
Ä Ä Ä
Å Å Å
HEX Symb
À c3 8o Ã €
Á c3 81 Ã ?
Â c3 82 Ã ‚
Ã c3 83 Ã ƒ
Ä c3 84 Ã „
Å c3 85 Ã …

Results: Trends
2004 2005 2006 2007 2008 2009 2010 2011 2012 2013
TA 40 38 81 117 204 608 1,421 5,933 28,387 36,045
TB 22 139 254 427 662 415 328 362 231 148
TP 29 100 236 276 258 226 238 218 189 346
VT 1,474 1,498 1,023 1,031 762 413 398 635 306 251
Barna 1,177 1,374 1,191 1,309 1,367 1,295 1,742 5,828 24,211 30,875
cBarc 34 42 53 70 79 34 63 115 325 560
cBrav 201 204 163 238 191 134 177 332 1,448 1,707
cDaur 61 46 82 117 134 121 288 698 2,599 2,498
pBarc 57 45 38 37 45 20 35 89 412 927
Pyren 10 20 12 25 8 10 22 14 62 149
tLlei 6 1 1 3 5 5 11 19 16 16
tEbre 4 3 0 1 1 5 1 2 3 10
vAran 1 0 7 0 0 3 2 3 9 11
unCla 14 40 47 51 56 35 44 48 28 37
Trends in web hosting
and Catalan brands
Monthly distribution of travel
blogs and OTRs (TA, TB, TP, & VT)

Results: Top keywords
Rank Keyword Count
Site-wide
Density
Average
Weight
Remark
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
barcelona
great
tour
sagrada familia
gaudi
city
place
good
visit
amazing
park
basilica
park guell
beautiful
way
197,723
51,525
49,221
38,341
33,187
28,155
26,597
26,098
25,973
25,242
24,962
23,618
23,367
23,322
22,996
3.77 %
0.98 %
0.94 %
0.73 %
0.63 %
0.54 %
0.51 %
0.50 %
0.49 %
0.48 %
0.47 %
0.45 %
0.44 %
0.44 %
0.44 %
56.26
23.73
18.08
60.75
19.66
11.70
15.73
15.02
14.86
24.18
28.38
81.68
62.06
23.01
15.02
Capital of Catalonia
Good feeling
Gaudi’s masterpiece
Architect A. Gaudi
Good feeling
Good feeling
Religious building
Gaudi’s work
Good feeling
 Site Content Analyzer (SCA) was applied to the dataset

Top keywords: Barcelona, Gaudi and two Gaudi’s works
Barcelona: Guell Park / Mosaic Dragon
Basilica of La Sagrada Familia (Passion façade) / Antoni Gaudi

Conclusions
 The proposed methodology facilitates the massive gathering of UGC
data from the most suitable sources for a specific case study.
 The hierarchical territorial structure of folders and the articulation
of the individual diaries’ file name, enable multiple classifications
using utilities to order and manipulate the files.
 This structure also allows to focus the analysis on a specific place,
language or subject.
 The cleaning and debugging phases are essential to obtain quality
information, limited to what has been written by the diary author.
 The HTML dataset is prepared for any offline content analysis in
future work and most phases of this method are useful for the
content analysis of other web data sources.

Thank you for your attention!
estela.marine@aegern.udl.cat
salvador.anton@urv.cat

A Method for Analysing Large-Scale UGC Data for Tourism: Application to the Case of Catalonia

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (18)

Similar to A Method for Analysing Large-Scale UGC Data for Tourism: Application to the Case of Catalonia

Similar to A Method for Analysing Large-Scale UGC Data for Tourism: Application to the Case of Catalonia (20)

Recently uploaded

Recently uploaded (20)

A Method for Analysing Large-Scale UGC Data for Tourism: Application to the Case of Catalonia

Editor's Notes