SlideShare a Scribd company logo
1 of 16
ENTER 2015 Research Track Slide Number 1
A Method for Analysing Large-Scale UGC Data for Tourism:
Application to the Case of Catalonia
Estela Mariné-Roig and Salvador Anton Clavé
Research Group on Territorial Analysis and Tourism Studies (GRATET)
University Rovira i Virgili, Catalonia, Spain
estela.marine@aegern.udl.cat
salvador.anton@urv.cat
http://www.urv.cat/en_index.html
ENTER 2015 Research Track Slide Number 2
Introduction and aim
 UGC data  good source of information for DMOs, stakeholders and tourists.
 Travel blogs and Online Travel Reviews (OTRs)  first-hand experiences of
travellers.
 They have mostly been analysed with content analysis and narrative analysis
(Banyai & Glover, 2012) in the areas of service quality, destination image and
reputation, UGC, experiences and behaviour, and mobility patterns (Lu &
Stepchenkova, 2014)
 Such UGC data have exponentially grown in recent years and it is now considered
that its manipulation requires the use of Big Data technologies.
 However, in most studies concerning UGC data the collection is done “by hand”
(Lu and Stepchenkova, 2014) and is usually non-random  very time-consuming
and non-representative.
This article aims to propose a method for semi-automatic
downloading, arranging, cleaning, debugging, and
analysing large-scale travel blog and OTR data.
ENTER 2015 Research Track Slide Number 3
Web mining background
 Web mining, using data mining techniques, intends to
find useful information or to extract knowledge of the
hyperlink structure and content of webpages Liu
(2011)
 To automatize the process of extraction, first a Web
crawler programme is needed, capable of roaming
the hyperlink structure and downloading the linked
webpages.
 There is abundant literature on data mining related to
tourism and some on massive downloads.
ENTER 2015 Research Track Slide Number 4
Methodology
 Abburu and Babu (2013) propose a framework for web data extraction
and analysis based on three basic steps: finding URLs of webpages,
extracting information from webpages, and data analysis.
 The above system architecture is divided into three modules:
 web crawling
 information extraction
 Mining
 In this research we add the cleaning and debugging phases to
eliminate the noise present in the webpage to be able to get to the
content analysis phase with quality information in the original HTML
format  Resulting webpages only contain what the user wrote.
The methodology is applied to the case of Catalonia to analyse
about 85,000 travel diaries created between the years 2004 -2013
ENTER 2015 Research Track Slide Number 5
Destination selected for the case study (Catalonia)
Attributes:
•Millenary history
•Mediterranean destination
•Bathed by 580 km of shoreline
•Own culture and language (Catalan)
•Wealthy historical and natural heritage
•Third European region (overnight stays)
•Foreign tourists in 2013: 15,631,500
•Nine regional tourism brands:
Tourist brand Abbr.
Barcelona
Costa Barcelona
Costa Brava
Costa Daurada
Paisatges Barcelona
Pirineus
Terres de l’Ebre
Terres de Lleida
Val d’Aran
(unclassified)
Barna
cBarc
cBrav
cDaur
pBarc
Pyren
tEbre
tLlei
vAran
unCla
ENTER 2015 Research Track Slide Number 6
Selection of the most suitable websites hosting UGC data
Weighted-formula applied (Marine-Roig, 2014a): TBRH = 1*B(V) + 1*B(P) + 2*B(S)
oBorda count (B): Method that ranks options in order of preference
Webometrics:
oVisibility (V):
• Indexed pages in search engines (Google.com, Bing.com)
• Link-based ranks (Google page rank PR, Yandex topical citation index CY)
oPopularity (P): Visit-based ranks (Alexa.com, Compete.com, Quantcast.com)
oSize (S): Number of UGC entries related to the case study
Websites hosting UGC data selected:
o1st
TripAdvisor.com (TA): Hosts online travel reviews (OTRs)
o2nd
VirtualTourist.com (VT): Hosts travel blogs, travelogues and OTRs
o3rd
TravelBlog.org (TB): Hosts travel blogs
o4th
TravelPod.com (TP): Hosts travel blogs and a few OTRs
ENTER 2015 Research Track Slide Number 7
Webometrics of the top four websites hosting travel diaries
TA TB TP VT
Indexed
pages
Google.com
Bing.com
18,600,000
23,800,000
478,000
320,000
759,000
448,000
1,120,000
415,000
Link-based
rank
Google PR
Yandex CY
8
1,600
6
110
6
350
7
375
Visit-based
rank
Compete.com
Quantcast.co
m
Alexa.com
51
127
182
38,742
36,067
21,123
11,824
9,279
21,324
2,500
2,065
4,156
Size Entries 72,874 2,988 2,116 7,791
TBRH Rank 1 3 4 2
ENTER 2015 Research Track Slide Number 8
Gathering process on websites
Filters: Simplified flow diagram of the downloading process:
oLevel (0, 1, ... no level limit)
Inclusive / exclusive
oURL
• Protocol (HTTP, FTP, ...)
• Server
• Domain
• Directories (folders)
• Filename
• File type (html, jpg, ...)
o Content. Search
• for all keywords
• for exact word sequence
• inside HTML tags
ENTER 2015 Research Track Slide Number 9
UGC data arrangement
Structure of folders and files:
rootwebsitebranddestinationdate_lang_[isFrom]_pageName_[theme].htm
ENTER 2015 Research Track Slide Number 10
UGC data cleaning
Aims: Before: 52 KB After: 2 KB (both without pictures)
The cleaning and debugging phases
are essential to be able to obtain
quality information, limited to the
web content as written and posted
by the diary author, and overcoming
the most significant errors.
Sample of removed HTML elements:
•<meta ... />
•<form ... </form>
•<iframe ... </iframe>
•<div id="header">... </div>
•<!-- [comment] -->
•<div id="comment">... </div>
•<div id="footer">... </div>
•<script type ... </script>
ENTER 2015 Research Track Slide Number 11
UGC data debugging (encoding and common mistakes)
ISO 8859-15 (ASCII Latin-1 extended characters) UTF-8 encoding
Encoding: HTML entities
Gaudí: UTF-8 (GaudÃ--), HTML number (Gaud&#237;), HTML name (Gaud&iacute;)
Mistakes:
Correct noun Misspellings
Barcelona Bathelona, Barcellona, Barthelonaaaa, Bar-tha-lona, Bar-the-lona ...
Gaudí Gaudi; Gaudì, Gaüdi, Gaudie, Gaudii, Goudi, Goudí, Guadi, Gualdi ...
Parc Güell Parc Guël, Güel, Guéll, Guelle; Park Gueil, Guel, Güelle, Guelli; Güelle ...
_0 _1 _2 _3 _4 _5 _6 _7 _8 _9 _A _B _C _D _E _F
C_ À Á Â Ã Ä Å Æ Ç È É Ê Ë Ì Í Î Ï
192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207
D_ Ð Ñ Ò Ó Ô Õ Ö × Ø Ù Ú Û Ü Ý Þ ß
208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223
E_ à á â ã ä å æ ç è é ê ë ì í î ï
224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239
F_ ð ñ ò ó ô õ ö ÷ ø ù ú û ü ý þ ÿ
240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255
Number Name
À &#192; &Agrave;
Á &#193; &Aacute;
 &#194; &Acirc;
à &#195; &Atilde;
Ä &#196; &Auml;
Å &#197; &Aring;
HEX Symb
À c3 8o à €
Á c3 81 Ã ?
 c3 82 à ‚
à c3 83 à ƒ
Ä c3 84 Ã „
Å c3 85 Ã …
ENTER 2015 Research Track Slide Number 12
Results: Trends
2004 2005 2006 2007 2008 2009 2010 2011 2012 2013
TA 40 38 81 117 204 608 1,421 5,933 28,387 36,045
TB 22 139 254 427 662 415 328 362 231 148
TP 29 100 236 276 258 226 238 218 189 346
VT 1,474 1,498 1,023 1,031 762 413 398 635 306 251
Barna 1,177 1,374 1,191 1,309 1,367 1,295 1,742 5,828 24,211 30,875
cBarc 34 42 53 70 79 34 63 115 325 560
cBrav 201 204 163 238 191 134 177 332 1,448 1,707
cDaur 61 46 82 117 134 121 288 698 2,599 2,498
pBarc 57 45 38 37 45 20 35 89 412 927
Pyren 10 20 12 25 8 10 22 14 62 149
tLlei 6 1 1 3 5 5 11 19 16 16
tEbre 4 3 0 1 1 5 1 2 3 10
vAran 1 0 7 0 0 3 2 3 9 11
unCla 14 40 47 51 56 35 44 48 28 37
Trends in web hosting
and Catalan brands
Monthly distribution of travel
blogs and OTRs (TA, TB, TP, & VT)
ENTER 2015 Research Track Slide Number 13
Results: Top keywords
Rank Keyword Count
Site-wide
Density
Average
Weight
Remark
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
barcelona
great
tour
sagrada familia
gaudi
city
place
good
visit
amazing
park
basilica
park guell
beautiful
way
197,723
51,525
49,221
38,341
33,187
28,155
26,597
26,098
25,973
25,242
24,962
23,618
23,367
23,322
22,996
3.77 %
0.98 %
0.94 %
0.73 %
0.63 %
0.54 %
0.51 %
0.50 %
0.49 %
0.48 %
0.47 %
0.45 %
0.44 %
0.44 %
0.44 %
56.26
23.73
18.08
60.75
19.66
11.70
15.73
15.02
14.86
24.18
28.38
81.68
62.06
23.01
15.02
Capital of Catalonia
Good feeling
Gaudi’s masterpiece
Architect A. Gaudi
Good feeling
Good feeling
Religious building
Gaudi’s work
Good feeling
 Site Content Analyzer (SCA) was applied to the dataset
ENTER 2015 Research Track Slide Number 14
Top keywords: Barcelona, Gaudi and two Gaudi’s works
Barcelona: Guell Park / Mosaic Dragon
Basilica of La Sagrada Familia (Passion façade) / Antoni Gaudi
ENTER 2015 Research Track Slide Number 15
Conclusions
 The proposed methodology facilitates the massive gathering of UGC
data from the most suitable sources for a specific case study.
 The hierarchical territorial structure of folders and the articulation
of the individual diaries’ file name, enable multiple classifications
using utilities to order and manipulate the files.
 This structure also allows to focus the analysis on a specific place,
language or subject.
 The cleaning and debugging phases are essential to obtain quality
information, limited to what has been written by the diary author.
 The HTML dataset is prepared for any offline content analysis in
future work and most phases of this method are useful for the
content analysis of other web data sources.
ENTER 2015 Research Track Slide Number 16
Thank you for your attention!
estela.marine@aegern.udl.cat
salvador.anton@urv.cat

More Related Content

Viewers also liked

Viewers also liked (18)

Transportation Mode Annotation of Tourist GPS Trajectories under Environmenta...
Transportation Mode Annotation of Tourist GPS Trajectories under Environmenta...Transportation Mode Annotation of Tourist GPS Trajectories under Environmenta...
Transportation Mode Annotation of Tourist GPS Trajectories under Environmenta...
 
What Types of Hotels Make Their Guests (Un)Happy? Text Analytics of Customer ...
What Types of Hotels Make Their Guests (Un)Happy? Text Analytics of Customer ...What Types of Hotels Make Their Guests (Un)Happy? Text Analytics of Customer ...
What Types of Hotels Make Their Guests (Un)Happy? Text Analytics of Customer ...
 
Athens Destination Specialist Program: The New Era of Destination Specialist ...
Athens Destination Specialist Program: The New Era of Destination Specialist ...Athens Destination Specialist Program: The New Era of Destination Specialist ...
Athens Destination Specialist Program: The New Era of Destination Specialist ...
 
@Spain is different. Co-branding strategies between Spanish national and regi...
@Spain is different. Co-branding strategies between Spanish national and regi...@Spain is different. Co-branding strategies between Spanish national and regi...
@Spain is different. Co-branding strategies between Spanish national and regi...
 
Linked Data for Cross-Domain Decision-making in Tourism
Linked Data for Cross-Domain Decision-making in TourismLinked Data for Cross-Domain Decision-making in Tourism
Linked Data for Cross-Domain Decision-making in Tourism
 
Online marketing challenges Zürich Tourism
Online marketing challenges Zürich TourismOnline marketing challenges Zürich Tourism
Online marketing challenges Zürich Tourism
 
Exhibition Attendees' Smart Technology Actual Usage: A Case of Near Field Com...
Exhibition Attendees' Smart Technology Actual Usage: A Case of Near Field Com...Exhibition Attendees' Smart Technology Actual Usage: A Case of Near Field Com...
Exhibition Attendees' Smart Technology Actual Usage: A Case of Near Field Com...
 
Usages and Role of Instant Messaging Applications during the Beatification of...
Usages and Role of Instant Messaging Applications during the Beatification of...Usages and Role of Instant Messaging Applications during the Beatification of...
Usages and Role of Instant Messaging Applications during the Beatification of...
 
Smart and Connected Tourism Technologies
Smart and Connected Tourism TechnologiesSmart and Connected Tourism Technologies
Smart and Connected Tourism Technologies
 
Liricon Valley
Liricon ValleyLiricon Valley
Liricon Valley
 
Trevii: Cheaper tickets for tourist attraction in an user-friendly way.
Trevii: Cheaper tickets for tourist attraction in an user-friendly way.Trevii: Cheaper tickets for tourist attraction in an user-friendly way.
Trevii: Cheaper tickets for tourist attraction in an user-friendly way.
 
Tourism destination perspective. Best practices of Zermatt - Matterhorn.
Tourism destination perspective. Best practices of Zermatt - Matterhorn.Tourism destination perspective. Best practices of Zermatt - Matterhorn.
Tourism destination perspective. Best practices of Zermatt - Matterhorn.
 
The Rise of eTourism for Development
The Rise of eTourism for DevelopmentThe Rise of eTourism for Development
The Rise of eTourism for Development
 
Gamification in Tourism: Analysis of Brazil Quest Game
Gamification in Tourism: Analysis of Brazil Quest GameGamification in Tourism: Analysis of Brazil Quest Game
Gamification in Tourism: Analysis of Brazil Quest Game
 
The Role of Personal Value in Information Search Strategies for Community-Bas...
The Role of Personal Value in Information Search Strategies for Community-Bas...The Role of Personal Value in Information Search Strategies for Community-Bas...
The Role of Personal Value in Information Search Strategies for Community-Bas...
 
The Evolution of eTourism Research A Case of ENTER Conference
The Evolution of eTourism Research A Case of ENTER ConferenceThe Evolution of eTourism Research A Case of ENTER Conference
The Evolution of eTourism Research A Case of ENTER Conference
 
How to get research papers published
How to get research papers publishedHow to get research papers published
How to get research papers published
 
Recommender Systems
Recommender SystemsRecommender Systems
Recommender Systems
 

Similar to A Method for Analysing Large-Scale UGC Data for Tourism: Application to the Case of Catalonia

Introduction to web scraping
Introduction to web scrapingIntroduction to web scraping
Introduction to web scrapingDario Cottafava
 
Latest Trends in Technology: BigData Analytics, Virtualization, Cloud Computi...
Latest Trends in Technology:BigData Analytics, Virtualization, Cloud Computi...Latest Trends in Technology:BigData Analytics, Virtualization, Cloud Computi...
Latest Trends in Technology: BigData Analytics, Virtualization, Cloud Computi...Abzetdin Adamov
 
LEAP into Data Science!
LEAP into Data Science!LEAP into Data Science!
LEAP into Data Science!Dev Gonzalez
 
Comparative analysis of national open data portals or whether your portal is ...
Comparative analysis of national open data portals or whether your portal is ...Comparative analysis of national open data portals or whether your portal is ...
Comparative analysis of national open data portals or whether your portal is ...Anastasija Nikiforova
 
Big data@accordantmedia - oanyc summit
Big data@accordantmedia - oanyc summitBig data@accordantmedia - oanyc summit
Big data@accordantmedia - oanyc summitOpen Analytics
 
Improving the reported use and impact of institutional repositories
Improving the reported use and impact of institutional repositoriesImproving the reported use and impact of institutional repositories
Improving the reported use and impact of institutional repositoriesKenning Arlitsch
 
BMO Capital Capital Markets at DES: The State of Advertising Technology
BMO Capital Capital Markets at DES: The State of Advertising TechnologyBMO Capital Capital Markets at DES: The State of Advertising Technology
BMO Capital Capital Markets at DES: The State of Advertising TechnologyDigiday
 
2013 07 05 (uc3m) lasi emadrid jgzubia deusto learning analytics primeras exp...
2013 07 05 (uc3m) lasi emadrid jgzubia deusto learning analytics primeras exp...2013 07 05 (uc3m) lasi emadrid jgzubia deusto learning analytics primeras exp...
2013 07 05 (uc3m) lasi emadrid jgzubia deusto learning analytics primeras exp...eMadrid network
 
a2c Boston Big Data Meet-up: Agile Data Warehouse Design
a2c Boston Big Data Meet-up:  Agile Data Warehouse Designa2c Boston Big Data Meet-up:  Agile Data Warehouse Design
a2c Boston Big Data Meet-up: Agile Data Warehouse Designa2c
 

Similar to A Method for Analysing Large-Scale UGC Data for Tourism: Application to the Case of Catalonia (20)

From the projected to the transmitted image: The 2.0 construction of tourist ...
From the projected to the transmitted image: The 2.0 construction of tourist ...From the projected to the transmitted image: The 2.0 construction of tourist ...
From the projected to the transmitted image: The 2.0 construction of tourist ...
 
Introduction to web scraping
Introduction to web scrapingIntroduction to web scraping
Introduction to web scraping
 
Latest Trends in Technology: BigData Analytics, Virtualization, Cloud Computi...
Latest Trends in Technology:BigData Analytics, Virtualization, Cloud Computi...Latest Trends in Technology:BigData Analytics, Virtualization, Cloud Computi...
Latest Trends in Technology: BigData Analytics, Virtualization, Cloud Computi...
 
Mapping of Tourism Destinations to Travel Behavioural Patterns
Mapping of Tourism Destinations to Travel Behavioural PatternsMapping of Tourism Destinations to Travel Behavioural Patterns
Mapping of Tourism Destinations to Travel Behavioural Patterns
 
Leap into data science!
Leap into data science!Leap into data science!
Leap into data science!
 
LEAP into Data Science!
LEAP into Data Science!LEAP into Data Science!
LEAP into Data Science!
 
Comparative analysis of national open data portals or whether your portal is ...
Comparative analysis of national open data portals or whether your portal is ...Comparative analysis of national open data portals or whether your portal is ...
Comparative analysis of national open data portals or whether your portal is ...
 
The Design and Implementation of an Electronic Ticket Package System for Tour...
The Design and Implementation of an Electronic Ticket Package System for Tour...The Design and Implementation of an Electronic Ticket Package System for Tour...
The Design and Implementation of an Electronic Ticket Package System for Tour...
 
Big data@accordantmedia - oanyc summit
Big data@accordantmedia - oanyc summitBig data@accordantmedia - oanyc summit
Big data@accordantmedia - oanyc summit
 
Inversini Enter2009(P35)
Inversini Enter2009(P35)Inversini Enter2009(P35)
Inversini Enter2009(P35)
 
Improving the reported use and impact of institutional repositories
Improving the reported use and impact of institutional repositoriesImproving the reported use and impact of institutional repositories
Improving the reported use and impact of institutional repositories
 
IoD Sales and Marketing Forum 8oct13
IoD Sales and Marketing Forum 8oct13IoD Sales and Marketing Forum 8oct13
IoD Sales and Marketing Forum 8oct13
 
Content Analysis of Travel Reviews: Exploring the Needs of Tourists from Diff...
Content Analysis of Travel Reviews: Exploring the Needs of Tourists from Diff...Content Analysis of Travel Reviews: Exploring the Needs of Tourists from Diff...
Content Analysis of Travel Reviews: Exploring the Needs of Tourists from Diff...
 
Information gathering by ubiquitous services for CRM in tourism destinations:...
Information gathering by ubiquitous services for CRM in tourism destinations:...Information gathering by ubiquitous services for CRM in tourism destinations:...
Information gathering by ubiquitous services for CRM in tourism destinations:...
 
BMO Capital Capital Markets at DES: The State of Advertising Technology
BMO Capital Capital Markets at DES: The State of Advertising TechnologyBMO Capital Capital Markets at DES: The State of Advertising Technology
BMO Capital Capital Markets at DES: The State of Advertising Technology
 
diadem-vldb-2015
diadem-vldb-2015diadem-vldb-2015
diadem-vldb-2015
 
Piano rubyslava final
Piano rubyslava finalPiano rubyslava final
Piano rubyslava final
 
2013 07 05 (uc3m) lasi emadrid jgzubia deusto learning analytics primeras exp...
2013 07 05 (uc3m) lasi emadrid jgzubia deusto learning analytics primeras exp...2013 07 05 (uc3m) lasi emadrid jgzubia deusto learning analytics primeras exp...
2013 07 05 (uc3m) lasi emadrid jgzubia deusto learning analytics primeras exp...
 
Forecasting the final penetration rate of online travel agencies in different...
Forecasting the final penetration rate of online travel agencies in different...Forecasting the final penetration rate of online travel agencies in different...
Forecasting the final penetration rate of online travel agencies in different...
 
a2c Boston Big Data Meet-up: Agile Data Warehouse Design
a2c Boston Big Data Meet-up:  Agile Data Warehouse Designa2c Boston Big Data Meet-up:  Agile Data Warehouse Design
a2c Boston Big Data Meet-up: Agile Data Warehouse Design
 

Recently uploaded

Meghan Sutherland In Media Res Media Component
Meghan Sutherland In Media Res Media ComponentMeghan Sutherland In Media Res Media Component
Meghan Sutherland In Media Res Media ComponentInMediaRes1
 
Biting mechanism of poisonous snakes.pdf
Biting mechanism of poisonous snakes.pdfBiting mechanism of poisonous snakes.pdf
Biting mechanism of poisonous snakes.pdfadityarao40181
 
Capitol Tech U Doctoral Presentation - April 2024.pptx
Capitol Tech U Doctoral Presentation - April 2024.pptxCapitol Tech U Doctoral Presentation - April 2024.pptx
Capitol Tech U Doctoral Presentation - April 2024.pptxCapitolTechU
 
History Class XII Ch. 3 Kinship, Caste and Class (1).pptx
History Class XII Ch. 3 Kinship, Caste and Class (1).pptxHistory Class XII Ch. 3 Kinship, Caste and Class (1).pptx
History Class XII Ch. 3 Kinship, Caste and Class (1).pptxsocialsciencegdgrohi
 
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxGaneshChakor2
 
Alper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentAlper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentInMediaRes1
 
MARGINALIZATION (Different learners in Marginalized Group
MARGINALIZATION (Different learners in Marginalized GroupMARGINALIZATION (Different learners in Marginalized Group
MARGINALIZATION (Different learners in Marginalized GroupJonathanParaisoCruz
 
Types of Journalistic Writing Grade 8.pptx
Types of Journalistic Writing Grade 8.pptxTypes of Journalistic Writing Grade 8.pptx
Types of Journalistic Writing Grade 8.pptxEyham Joco
 
Proudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxProudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxthorishapillay1
 
भारत-रोम व्यापार.pptx, Indo-Roman Trade,
भारत-रोम व्यापार.pptx, Indo-Roman Trade,भारत-रोम व्यापार.pptx, Indo-Roman Trade,
भारत-रोम व्यापार.pptx, Indo-Roman Trade,Virag Sontakke
 
Final demo Grade 9 for demo Plan dessert.pptx
Final demo Grade 9 for demo Plan dessert.pptxFinal demo Grade 9 for demo Plan dessert.pptx
Final demo Grade 9 for demo Plan dessert.pptxAvyJaneVismanos
 
Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Celine George
 
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdfEnzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdfSumit Tiwari
 
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdfssuser54595a
 
Crayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon ACrayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon AUnboundStockton
 
DATA STRUCTURE AND ALGORITHM for beginners
DATA STRUCTURE AND ALGORITHM for beginnersDATA STRUCTURE AND ALGORITHM for beginners
DATA STRUCTURE AND ALGORITHM for beginnersSabitha Banu
 
Full Stack Web Development Course for Beginners
Full Stack Web Development Course  for BeginnersFull Stack Web Development Course  for Beginners
Full Stack Web Development Course for BeginnersSabitha Banu
 
Pharmacognosy Flower 3. Compositae 2023.pdf
Pharmacognosy Flower 3. Compositae 2023.pdfPharmacognosy Flower 3. Compositae 2023.pdf
Pharmacognosy Flower 3. Compositae 2023.pdfMahmoud M. Sallam
 

Recently uploaded (20)

Meghan Sutherland In Media Res Media Component
Meghan Sutherland In Media Res Media ComponentMeghan Sutherland In Media Res Media Component
Meghan Sutherland In Media Res Media Component
 
Biting mechanism of poisonous snakes.pdf
Biting mechanism of poisonous snakes.pdfBiting mechanism of poisonous snakes.pdf
Biting mechanism of poisonous snakes.pdf
 
Capitol Tech U Doctoral Presentation - April 2024.pptx
Capitol Tech U Doctoral Presentation - April 2024.pptxCapitol Tech U Doctoral Presentation - April 2024.pptx
Capitol Tech U Doctoral Presentation - April 2024.pptx
 
History Class XII Ch. 3 Kinship, Caste and Class (1).pptx
History Class XII Ch. 3 Kinship, Caste and Class (1).pptxHistory Class XII Ch. 3 Kinship, Caste and Class (1).pptx
History Class XII Ch. 3 Kinship, Caste and Class (1).pptx
 
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptx
 
Alper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentAlper Gobel In Media Res Media Component
Alper Gobel In Media Res Media Component
 
MARGINALIZATION (Different learners in Marginalized Group
MARGINALIZATION (Different learners in Marginalized GroupMARGINALIZATION (Different learners in Marginalized Group
MARGINALIZATION (Different learners in Marginalized Group
 
Types of Journalistic Writing Grade 8.pptx
Types of Journalistic Writing Grade 8.pptxTypes of Journalistic Writing Grade 8.pptx
Types of Journalistic Writing Grade 8.pptx
 
Proudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxProudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptx
 
ESSENTIAL of (CS/IT/IS) class 06 (database)
ESSENTIAL of (CS/IT/IS) class 06 (database)ESSENTIAL of (CS/IT/IS) class 06 (database)
ESSENTIAL of (CS/IT/IS) class 06 (database)
 
भारत-रोम व्यापार.pptx, Indo-Roman Trade,
भारत-रोम व्यापार.pptx, Indo-Roman Trade,भारत-रोम व्यापार.pptx, Indo-Roman Trade,
भारत-रोम व्यापार.pptx, Indo-Roman Trade,
 
Final demo Grade 9 for demo Plan dessert.pptx
Final demo Grade 9 for demo Plan dessert.pptxFinal demo Grade 9 for demo Plan dessert.pptx
Final demo Grade 9 for demo Plan dessert.pptx
 
Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17
 
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdfEnzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
 
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
 
Crayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon ACrayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon A
 
DATA STRUCTURE AND ALGORITHM for beginners
DATA STRUCTURE AND ALGORITHM for beginnersDATA STRUCTURE AND ALGORITHM for beginners
DATA STRUCTURE AND ALGORITHM for beginners
 
Full Stack Web Development Course for Beginners
Full Stack Web Development Course  for BeginnersFull Stack Web Development Course  for Beginners
Full Stack Web Development Course for Beginners
 
Pharmacognosy Flower 3. Compositae 2023.pdf
Pharmacognosy Flower 3. Compositae 2023.pdfPharmacognosy Flower 3. Compositae 2023.pdf
Pharmacognosy Flower 3. Compositae 2023.pdf
 

A Method for Analysing Large-Scale UGC Data for Tourism: Application to the Case of Catalonia

  • 1. ENTER 2015 Research Track Slide Number 1 A Method for Analysing Large-Scale UGC Data for Tourism: Application to the Case of Catalonia Estela Mariné-Roig and Salvador Anton Clavé Research Group on Territorial Analysis and Tourism Studies (GRATET) University Rovira i Virgili, Catalonia, Spain estela.marine@aegern.udl.cat salvador.anton@urv.cat http://www.urv.cat/en_index.html
  • 2. ENTER 2015 Research Track Slide Number 2 Introduction and aim  UGC data  good source of information for DMOs, stakeholders and tourists.  Travel blogs and Online Travel Reviews (OTRs)  first-hand experiences of travellers.  They have mostly been analysed with content analysis and narrative analysis (Banyai & Glover, 2012) in the areas of service quality, destination image and reputation, UGC, experiences and behaviour, and mobility patterns (Lu & Stepchenkova, 2014)  Such UGC data have exponentially grown in recent years and it is now considered that its manipulation requires the use of Big Data technologies.  However, in most studies concerning UGC data the collection is done “by hand” (Lu and Stepchenkova, 2014) and is usually non-random  very time-consuming and non-representative. This article aims to propose a method for semi-automatic downloading, arranging, cleaning, debugging, and analysing large-scale travel blog and OTR data.
  • 3. ENTER 2015 Research Track Slide Number 3 Web mining background  Web mining, using data mining techniques, intends to find useful information or to extract knowledge of the hyperlink structure and content of webpages Liu (2011)  To automatize the process of extraction, first a Web crawler programme is needed, capable of roaming the hyperlink structure and downloading the linked webpages.  There is abundant literature on data mining related to tourism and some on massive downloads.
  • 4. ENTER 2015 Research Track Slide Number 4 Methodology  Abburu and Babu (2013) propose a framework for web data extraction and analysis based on three basic steps: finding URLs of webpages, extracting information from webpages, and data analysis.  The above system architecture is divided into three modules:  web crawling  information extraction  Mining  In this research we add the cleaning and debugging phases to eliminate the noise present in the webpage to be able to get to the content analysis phase with quality information in the original HTML format  Resulting webpages only contain what the user wrote. The methodology is applied to the case of Catalonia to analyse about 85,000 travel diaries created between the years 2004 -2013
  • 5. ENTER 2015 Research Track Slide Number 5 Destination selected for the case study (Catalonia) Attributes: •Millenary history •Mediterranean destination •Bathed by 580 km of shoreline •Own culture and language (Catalan) •Wealthy historical and natural heritage •Third European region (overnight stays) •Foreign tourists in 2013: 15,631,500 •Nine regional tourism brands: Tourist brand Abbr. Barcelona Costa Barcelona Costa Brava Costa Daurada Paisatges Barcelona Pirineus Terres de l’Ebre Terres de Lleida Val d’Aran (unclassified) Barna cBarc cBrav cDaur pBarc Pyren tEbre tLlei vAran unCla
  • 6. ENTER 2015 Research Track Slide Number 6 Selection of the most suitable websites hosting UGC data Weighted-formula applied (Marine-Roig, 2014a): TBRH = 1*B(V) + 1*B(P) + 2*B(S) oBorda count (B): Method that ranks options in order of preference Webometrics: oVisibility (V): • Indexed pages in search engines (Google.com, Bing.com) • Link-based ranks (Google page rank PR, Yandex topical citation index CY) oPopularity (P): Visit-based ranks (Alexa.com, Compete.com, Quantcast.com) oSize (S): Number of UGC entries related to the case study Websites hosting UGC data selected: o1st TripAdvisor.com (TA): Hosts online travel reviews (OTRs) o2nd VirtualTourist.com (VT): Hosts travel blogs, travelogues and OTRs o3rd TravelBlog.org (TB): Hosts travel blogs o4th TravelPod.com (TP): Hosts travel blogs and a few OTRs
  • 7. ENTER 2015 Research Track Slide Number 7 Webometrics of the top four websites hosting travel diaries TA TB TP VT Indexed pages Google.com Bing.com 18,600,000 23,800,000 478,000 320,000 759,000 448,000 1,120,000 415,000 Link-based rank Google PR Yandex CY 8 1,600 6 110 6 350 7 375 Visit-based rank Compete.com Quantcast.co m Alexa.com 51 127 182 38,742 36,067 21,123 11,824 9,279 21,324 2,500 2,065 4,156 Size Entries 72,874 2,988 2,116 7,791 TBRH Rank 1 3 4 2
  • 8. ENTER 2015 Research Track Slide Number 8 Gathering process on websites Filters: Simplified flow diagram of the downloading process: oLevel (0, 1, ... no level limit) Inclusive / exclusive oURL • Protocol (HTTP, FTP, ...) • Server • Domain • Directories (folders) • Filename • File type (html, jpg, ...) o Content. Search • for all keywords • for exact word sequence • inside HTML tags
  • 9. ENTER 2015 Research Track Slide Number 9 UGC data arrangement Structure of folders and files: rootwebsitebranddestinationdate_lang_[isFrom]_pageName_[theme].htm
  • 10. ENTER 2015 Research Track Slide Number 10 UGC data cleaning Aims: Before: 52 KB After: 2 KB (both without pictures) The cleaning and debugging phases are essential to be able to obtain quality information, limited to the web content as written and posted by the diary author, and overcoming the most significant errors. Sample of removed HTML elements: •<meta ... /> •<form ... </form> •<iframe ... </iframe> •<div id="header">... </div> •<!-- [comment] --> •<div id="comment">... </div> •<div id="footer">... </div> •<script type ... </script>
  • 11. ENTER 2015 Research Track Slide Number 11 UGC data debugging (encoding and common mistakes) ISO 8859-15 (ASCII Latin-1 extended characters) UTF-8 encoding Encoding: HTML entities Gaudí: UTF-8 (GaudÃ--), HTML number (Gaud&#237;), HTML name (Gaud&iacute;) Mistakes: Correct noun Misspellings Barcelona Bathelona, Barcellona, Barthelonaaaa, Bar-tha-lona, Bar-the-lona ... Gaudí Gaudi; Gaudì, Gaüdi, Gaudie, Gaudii, Goudi, Goudí, Guadi, Gualdi ... Parc Güell Parc Guël, Güel, Guéll, Guelle; Park Gueil, Guel, Güelle, Guelli; Güelle ... _0 _1 _2 _3 _4 _5 _6 _7 _8 _9 _A _B _C _D _E _F C_ À Á Â Ã Ä Å Æ Ç È É Ê Ë Ì Í Î Ï 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 D_ Ð Ñ Ò Ó Ô Õ Ö × Ø Ù Ú Û Ü Ý Þ ß 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 E_ à á â ã ä å æ ç è é ê ë ì í î ï 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 F_ ð ñ ò ó ô õ ö ÷ ø ù ú û ü ý þ ÿ 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 Number Name À &#192; &Agrave; Á &#193; &Aacute;  &#194; &Acirc; à &#195; &Atilde; Ä &#196; &Auml; Å &#197; &Aring; HEX Symb À c3 8o à € Á c3 81 à ?  c3 82 à ‚ à c3 83 à ƒ Ä c3 84 à „ Å c3 85 à …
  • 12. ENTER 2015 Research Track Slide Number 12 Results: Trends 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 TA 40 38 81 117 204 608 1,421 5,933 28,387 36,045 TB 22 139 254 427 662 415 328 362 231 148 TP 29 100 236 276 258 226 238 218 189 346 VT 1,474 1,498 1,023 1,031 762 413 398 635 306 251 Barna 1,177 1,374 1,191 1,309 1,367 1,295 1,742 5,828 24,211 30,875 cBarc 34 42 53 70 79 34 63 115 325 560 cBrav 201 204 163 238 191 134 177 332 1,448 1,707 cDaur 61 46 82 117 134 121 288 698 2,599 2,498 pBarc 57 45 38 37 45 20 35 89 412 927 Pyren 10 20 12 25 8 10 22 14 62 149 tLlei 6 1 1 3 5 5 11 19 16 16 tEbre 4 3 0 1 1 5 1 2 3 10 vAran 1 0 7 0 0 3 2 3 9 11 unCla 14 40 47 51 56 35 44 48 28 37 Trends in web hosting and Catalan brands Monthly distribution of travel blogs and OTRs (TA, TB, TP, & VT)
  • 13. ENTER 2015 Research Track Slide Number 13 Results: Top keywords Rank Keyword Count Site-wide Density Average Weight Remark 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 barcelona great tour sagrada familia gaudi city place good visit amazing park basilica park guell beautiful way 197,723 51,525 49,221 38,341 33,187 28,155 26,597 26,098 25,973 25,242 24,962 23,618 23,367 23,322 22,996 3.77 % 0.98 % 0.94 % 0.73 % 0.63 % 0.54 % 0.51 % 0.50 % 0.49 % 0.48 % 0.47 % 0.45 % 0.44 % 0.44 % 0.44 % 56.26 23.73 18.08 60.75 19.66 11.70 15.73 15.02 14.86 24.18 28.38 81.68 62.06 23.01 15.02 Capital of Catalonia Good feeling Gaudi’s masterpiece Architect A. Gaudi Good feeling Good feeling Religious building Gaudi’s work Good feeling  Site Content Analyzer (SCA) was applied to the dataset
  • 14. ENTER 2015 Research Track Slide Number 14 Top keywords: Barcelona, Gaudi and two Gaudi’s works Barcelona: Guell Park / Mosaic Dragon Basilica of La Sagrada Familia (Passion façade) / Antoni Gaudi
  • 15. ENTER 2015 Research Track Slide Number 15 Conclusions  The proposed methodology facilitates the massive gathering of UGC data from the most suitable sources for a specific case study.  The hierarchical territorial structure of folders and the articulation of the individual diaries’ file name, enable multiple classifications using utilities to order and manipulate the files.  This structure also allows to focus the analysis on a specific place, language or subject.  The cleaning and debugging phases are essential to obtain quality information, limited to what has been written by the diary author.  The HTML dataset is prepared for any offline content analysis in future work and most phases of this method are useful for the content analysis of other web data sources.
  • 16. ENTER 2015 Research Track Slide Number 16 Thank you for your attention! estela.marine@aegern.udl.cat salvador.anton@urv.cat

Editor's Notes

  1. Catalonia is not an Anglophone region, and therefore the problems related to character codification beyond ASCII 127 should be considered and, specifically, those related to existing accent marks in destination and tourist attraction factor names.
  2. The first thing after selecting the methodology is to The problem is that blogs come from diverse sources and websites do not have homogeneous structures, which makes it impossible to automatize the process of downloading, classification and refinement, as intended in this study. WEBSITES WITH A CRITERION: at least 100 entries about Catalonia
  3. The first thing after selecting the methodology is to The problem is that blogs come from diverse sources and websites do not have homogeneous structures, which makes it impossible to automatize the process of downloading, classification and refinement, as intended in this study. WEBSITES WITH A CRITERION: at least 100 entries about Catalonia
  4. HTTrack Website Copier (www.httrack.com). The first step to download data is to navigate the selected websites manually to identify the initial pages, that is to say, those containing hyperlinks which lead to the individual blogs and OTR pages, and save their complete URLs. A Level 0 filter only downloads the page indicated by the initial URL, a Level 1 filter, downloads that page and all the resources directly linked to it, etc; 2. The file type filter allows to download, for example, only HTML files, and the remaining files (multimedia, PDF, etc.) will only be visualized if an Internet connection is available; this system is ideal to analyse the textual content of diaries saving space in the local disk; 3. The URL filters allow to act at any part of it (protocol, server, domain subdirectories or folders, filename and file type); and 4. The content filter is the least efficient because it is necessary to download the page to assess whether or not it contains the chain of key characters, while with URL filters only the pages of interest are downloaded (Figure 2). For example, in the case of TB, it is sufficient to place an inclusive folder filter: /Catalonia/, with no level limit, because the server has a hierarchical territorial structure of folders to store the files. Conversely, in the case of TA, all the files of interest contain the word Catalonia, those which have hyperlinks which lead to OTRs start with Attraction, and those of the same OTRs start with ShowUserReview; therefore, a couple of inclusive filename filters are enough: Attraction*Catalonia and ShowUserReview*Catalonia. To understand the importance of the filters in this case, we ought to bear in mind that TA reached the figure of more than 170,000,000 reviews and opinions, and all its webpages are linked at different levels by hyperlinks.
  5. Geography: TAcode;VTcode;Destination;Brand g187496;c1;Catalonia;unCla g187497;430de;Barcelona;Barna g494960;402fe;Lloret-de-Mar;cBrav LANGUAGE DETECTION: plain text  HTML As Text Once the CSV files are ready, a batch programme (Marine-Roig, 2013: Annex A3) is run for each website, which goes through all files, extracts internal data such as the date of the diary and the name of the destination, eliminates entries without narrative content (more than 70,000 OTRs in the case of TA), changes the format of such dates to yyyymmdd, creates new territorial directories, and transfers the diary to the destination folder already with its articulated name to facilitate future classifications. Finally, the two-character ISO 639-1 codes are introduced in the name of the files, after the date (Figure 3).
  6. Eliminating noise. The original HTML format should be preserved in order to be able to weight keywords and key phrases according to their potential impact SITE CONTENT ANALYZER, frequency, sitewide-density, weight