A Method for Analysing Large-Scale UGC Data for Tourism: Application to the Case of Catalonia
1. ENTER 2015 Research Track Slide Number 1
A Method for Analysing Large-Scale UGC Data for Tourism:
Application to the Case of Catalonia
Estela Mariné-Roig and Salvador Anton Clavé
Research Group on Territorial Analysis and Tourism Studies (GRATET)
University Rovira i Virgili, Catalonia, Spain
estela.marine@aegern.udl.cat
salvador.anton@urv.cat
http://www.urv.cat/en_index.html
2. ENTER 2015 Research Track Slide Number 2
Introduction and aim
UGC data good source of information for DMOs, stakeholders and tourists.
Travel blogs and Online Travel Reviews (OTRs) first-hand experiences of
travellers.
They have mostly been analysed with content analysis and narrative analysis
(Banyai & Glover, 2012) in the areas of service quality, destination image and
reputation, UGC, experiences and behaviour, and mobility patterns (Lu &
Stepchenkova, 2014)
Such UGC data have exponentially grown in recent years and it is now considered
that its manipulation requires the use of Big Data technologies.
However, in most studies concerning UGC data the collection is done “by hand”
(Lu and Stepchenkova, 2014) and is usually non-random very time-consuming
and non-representative.
This article aims to propose a method for semi-automatic
downloading, arranging, cleaning, debugging, and
analysing large-scale travel blog and OTR data.
3. ENTER 2015 Research Track Slide Number 3
Web mining background
Web mining, using data mining techniques, intends to
find useful information or to extract knowledge of the
hyperlink structure and content of webpages Liu
(2011)
To automatize the process of extraction, first a Web
crawler programme is needed, capable of roaming
the hyperlink structure and downloading the linked
webpages.
There is abundant literature on data mining related to
tourism and some on massive downloads.
4. ENTER 2015 Research Track Slide Number 4
Methodology
Abburu and Babu (2013) propose a framework for web data extraction
and analysis based on three basic steps: finding URLs of webpages,
extracting information from webpages, and data analysis.
The above system architecture is divided into three modules:
web crawling
information extraction
Mining
In this research we add the cleaning and debugging phases to
eliminate the noise present in the webpage to be able to get to the
content analysis phase with quality information in the original HTML
format Resulting webpages only contain what the user wrote.
The methodology is applied to the case of Catalonia to analyse
about 85,000 travel diaries created between the years 2004 -2013
5. ENTER 2015 Research Track Slide Number 5
Destination selected for the case study (Catalonia)
Attributes:
•Millenary history
•Mediterranean destination
•Bathed by 580 km of shoreline
•Own culture and language (Catalan)
•Wealthy historical and natural heritage
•Third European region (overnight stays)
•Foreign tourists in 2013: 15,631,500
•Nine regional tourism brands:
Tourist brand Abbr.
Barcelona
Costa Barcelona
Costa Brava
Costa Daurada
Paisatges Barcelona
Pirineus
Terres de l’Ebre
Terres de Lleida
Val d’Aran
(unclassified)
Barna
cBarc
cBrav
cDaur
pBarc
Pyren
tEbre
tLlei
vAran
unCla
6. ENTER 2015 Research Track Slide Number 6
Selection of the most suitable websites hosting UGC data
Weighted-formula applied (Marine-Roig, 2014a): TBRH = 1*B(V) + 1*B(P) + 2*B(S)
oBorda count (B): Method that ranks options in order of preference
Webometrics:
oVisibility (V):
• Indexed pages in search engines (Google.com, Bing.com)
• Link-based ranks (Google page rank PR, Yandex topical citation index CY)
oPopularity (P): Visit-based ranks (Alexa.com, Compete.com, Quantcast.com)
oSize (S): Number of UGC entries related to the case study
Websites hosting UGC data selected:
o1st
TripAdvisor.com (TA): Hosts online travel reviews (OTRs)
o2nd
VirtualTourist.com (VT): Hosts travel blogs, travelogues and OTRs
o3rd
TravelBlog.org (TB): Hosts travel blogs
o4th
TravelPod.com (TP): Hosts travel blogs and a few OTRs
7. ENTER 2015 Research Track Slide Number 7
Webometrics of the top four websites hosting travel diaries
TA TB TP VT
Indexed
pages
Google.com
Bing.com
18,600,000
23,800,000
478,000
320,000
759,000
448,000
1,120,000
415,000
Link-based
rank
Google PR
Yandex CY
8
1,600
6
110
6
350
7
375
Visit-based
rank
Compete.com
Quantcast.co
m
Alexa.com
51
127
182
38,742
36,067
21,123
11,824
9,279
21,324
2,500
2,065
4,156
Size Entries 72,874 2,988 2,116 7,791
TBRH Rank 1 3 4 2
8. ENTER 2015 Research Track Slide Number 8
Gathering process on websites
Filters: Simplified flow diagram of the downloading process:
oLevel (0, 1, ... no level limit)
Inclusive / exclusive
oURL
• Protocol (HTTP, FTP, ...)
• Server
• Domain
• Directories (folders)
• Filename
• File type (html, jpg, ...)
o Content. Search
• for all keywords
• for exact word sequence
• inside HTML tags
9. ENTER 2015 Research Track Slide Number 9
UGC data arrangement
Structure of folders and files:
rootwebsitebranddestinationdate_lang_[isFrom]_pageName_[theme].htm
10. ENTER 2015 Research Track Slide Number 10
UGC data cleaning
Aims: Before: 52 KB After: 2 KB (both without pictures)
The cleaning and debugging phases
are essential to be able to obtain
quality information, limited to the
web content as written and posted
by the diary author, and overcoming
the most significant errors.
Sample of removed HTML elements:
•<meta ... />
•<form ... </form>
•<iframe ... </iframe>
•<div id="header">... </div>
•<!-- [comment] -->
•<div id="comment">... </div>
•<div id="footer">... </div>
•<script type ... </script>
11. ENTER 2015 Research Track Slide Number 11
UGC data debugging (encoding and common mistakes)
ISO 8859-15 (ASCII Latin-1 extended characters) UTF-8 encoding
Encoding: HTML entities
Gaudí: UTF-8 (GaudÃ--), HTML number (Gaudí), HTML name (Gaudí)
Mistakes:
Correct noun Misspellings
Barcelona Bathelona, Barcellona, Barthelonaaaa, Bar-tha-lona, Bar-the-lona ...
Gaudí Gaudi; Gaudì, Gaüdi, Gaudie, Gaudii, Goudi, Goudí, Guadi, Gualdi ...
Parc Güell Parc Guël, Güel, Guéll, Guelle; Park Gueil, Guel, Güelle, Guelli; Güelle ...
_0 _1 _2 _3 _4 _5 _6 _7 _8 _9 _A _B _C _D _E _F
C_ À Á Â Ã Ä Å Æ Ç È É Ê Ë Ì Í Î Ï
192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207
D_ Ð Ñ Ò Ó Ô Õ Ö × Ø Ù Ú Û Ü Ý Þ ß
208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223
E_ à á â ã ä å æ ç è é ê ë ì í î ï
224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239
F_ ð ñ ò ó ô õ ö ÷ ø ù ú û ü ý þ ÿ
240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255
Number Name
À À À
Á Á Á
  Â
à à Ã
Ä Ä Ä
Å Å Å
HEX Symb
À c3 8o à €
Á c3 81 Ã ?
 c3 82 à ‚
à c3 83 à ƒ
Ä c3 84 Ã „
Å c3 85 Ã …
13. ENTER 2015 Research Track Slide Number 13
Results: Top keywords
Rank Keyword Count
Site-wide
Density
Average
Weight
Remark
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
barcelona
great
tour
sagrada familia
gaudi
city
place
good
visit
amazing
park
basilica
park guell
beautiful
way
197,723
51,525
49,221
38,341
33,187
28,155
26,597
26,098
25,973
25,242
24,962
23,618
23,367
23,322
22,996
3.77 %
0.98 %
0.94 %
0.73 %
0.63 %
0.54 %
0.51 %
0.50 %
0.49 %
0.48 %
0.47 %
0.45 %
0.44 %
0.44 %
0.44 %
56.26
23.73
18.08
60.75
19.66
11.70
15.73
15.02
14.86
24.18
28.38
81.68
62.06
23.01
15.02
Capital of Catalonia
Good feeling
Gaudi’s masterpiece
Architect A. Gaudi
Good feeling
Good feeling
Religious building
Gaudi’s work
Good feeling
Site Content Analyzer (SCA) was applied to the dataset
14. ENTER 2015 Research Track Slide Number 14
Top keywords: Barcelona, Gaudi and two Gaudi’s works
Barcelona: Guell Park / Mosaic Dragon
Basilica of La Sagrada Familia (Passion façade) / Antoni Gaudi
15. ENTER 2015 Research Track Slide Number 15
Conclusions
The proposed methodology facilitates the massive gathering of UGC
data from the most suitable sources for a specific case study.
The hierarchical territorial structure of folders and the articulation
of the individual diaries’ file name, enable multiple classifications
using utilities to order and manipulate the files.
This structure also allows to focus the analysis on a specific place,
language or subject.
The cleaning and debugging phases are essential to obtain quality
information, limited to what has been written by the diary author.
The HTML dataset is prepared for any offline content analysis in
future work and most phases of this method are useful for the
content analysis of other web data sources.
16. ENTER 2015 Research Track Slide Number 16
Thank you for your attention!
estela.marine@aegern.udl.cat
salvador.anton@urv.cat
Editor's Notes
Catalonia is not an Anglophone region, and therefore the problems related to character codification beyond ASCII 127 should be considered and, specifically, those related to existing accent marks in destination and tourist attraction factor names.
The first thing after selecting the methodology is to
The problem is that blogs come from diverse sources and websites do not have homogeneous structures, which makes it impossible to automatize the process of downloading, classification and refinement, as intended in this study.
WEBSITES WITH A CRITERION: at least 100 entries about Catalonia
The first thing after selecting the methodology is to
The problem is that blogs come from diverse sources and websites do not have homogeneous structures, which makes it impossible to automatize the process of downloading, classification and refinement, as intended in this study.
WEBSITES WITH A CRITERION: at least 100 entries about Catalonia
HTTrack Website Copier (www.httrack.com).
The first step to download data is to navigate the selected websites manually to identify the initial pages, that is to say, those containing hyperlinks which lead to the individual blogs and OTR pages, and save their complete URLs.
A Level 0 filter only downloads the page indicated by the initial URL, a Level 1 filter, downloads that page and all the resources directly linked to it, etc; 2. The file type filter allows to download, for example, only HTML files, and the remaining files (multimedia, PDF, etc.) will only be visualized if an Internet connection is available; this system is ideal to analyse the textual content of diaries saving space in the local disk; 3. The URL filters allow to act at any part of it (protocol, server, domain subdirectories or folders, filename and file type); and 4. The content filter is the least efficient because it is necessary to download the page to assess whether or not it contains the chain of key characters, while with URL filters only the pages of interest are downloaded (Figure 2).
For example, in the case of TB, it is sufficient to place an inclusive folder filter: /Catalonia/, with no level limit, because the server has a hierarchical territorial structure of folders to store the files. Conversely, in the case of TA, all the files of interest contain the word Catalonia, those which have hyperlinks which lead to OTRs start with Attraction, and those of the same OTRs start with ShowUserReview; therefore, a couple of inclusive filename filters are enough: Attraction*Catalonia and ShowUserReview*Catalonia. To understand the importance of the filters in this case, we ought to bear in mind that TA reached the figure of more than 170,000,000 reviews and opinions, and all its webpages are linked at different levels by hyperlinks.
Geography: TAcode;VTcode;Destination;Brand
g187496;c1;Catalonia;unCla
g187497;430de;Barcelona;Barna
g494960;402fe;Lloret-de-Mar;cBrav
LANGUAGE DETECTION: plain text HTML As Text
Once the CSV files are ready, a batch programme (Marine-Roig, 2013: Annex A3) is run for each website, which goes through all files, extracts internal data such as the date of the diary and the name of the destination, eliminates entries without narrative content (more than 70,000 OTRs in the case of TA), changes the format of such dates to yyyymmdd, creates new territorial directories, and transfers the diary to the destination folder already with its articulated name to facilitate future classifications. Finally, the two-character ISO 639-1 codes are introduced in the name of the files, after the date (Figure 3).
Eliminating noise. The original HTML format should be preserved in order to be
able to weight keywords and key phrases according to their potential impact
SITE CONTENT ANALYZER, frequency, sitewide-density, weight