Topic Detection - Identifying relevant topics in tourism reviews

ENTER 2016 Research Track Slide Number 15th February 2016
Thomas Mennera
Wolfram Höpkena
Matthias Fuchsb
Maria Lexhagenb
a Business Informatics Group
University of Applied Sciences Ravensburg-Weingarten, Germany
{name.surname}@hs-weingarten.de
b European Tourism Research Institute (ETOUR)
Mid-Sweden University, Sweden
{name.surname}@miun.se
Topic Detection - Identifying relevant
topics in tourism reviews

Content
• Introduction
• Related work
• Methodology
– Document retrieval, extraction & processing
– Mining (supervised & unsupervised learning)
• Evaluation
• Application
• Summary and outlook

Motivation
• Product reviews as important source of information
– Shared experiences, impressions and opinions of travellers represent
huge pool of potentially useful information for tourism and travel
providers
– Product reviews can be extracted and processed automatically using
different information extraction and data mining techniques
• Topic detection
– Next to extracting the sentiment or polarity of reviews, explicit topics
mentioned in the review can be extract
(Fuchs et al., 2014; Höpken et al., 2015)
– Extracted topics can subsequently be used to improve the own
services and offerings depending on whether the topics are mentioned
positively or negatively

Objective
• Unsupervised topic detection
– Identifying unknown (and not predefined) topics mentioned within a
review
– New topics, not recognized as relevant quality dimensions of tourism
services so far, can be identified
– Unsupervised topic detection as promising approach to gain new
insights into relevant quality dimensions as well as strengths and
weaknesses of concrete tourism services along those quality dimensions
• Several approaches for unsupervised topic detection are
presented and compared related to their detection accuracy
• Provide holistic process for extracting, pre-processing and
mining user reviews from rating portals

Content
• Introduction
• Related work
• Methodology
• Evaluation
• Application

Related work
• Hu and Liu (2004)
– Approach for extracting features (topics) of products as part of
sentiment analysis
– Since product features are often represented by nouns, their approach
extracts all frequent nouns from the UGC and matches nouns to the
particular opinion words they are rated by
• Ziebarth et al. (2008)
– Approach for topic detection by keyword clustering (clustering of
term-document-matrix based on TF-IDF values)
• Panagiotis et al. (1999)
– Categorize documents by topic using a Latent Semantic Indexing (LSI)
approach through Single Value Decomposition (SVD)

Content
• Introduction
• Related work
• Methodology
• Evaluation
• Application

Document retrieval, extraction & processing
Document
retrieval
•Web crawler fetches all html pages containing reviews based on regular expressions
•124 user reviews, consisting of over 1,200 single review statements were retrieved (from
TripAdvisor for two hotels of the Swedish mountain destination Åre)
Document
extraction
•Explicit textual reviews are extracted from html pages using Xpath expressions to select
relevant document parts and regular expression to clear up review texts from HTML tags
Document
processing
•Pre-processing of reviews depending on applied mining techniques
•Splitting reviews into sentences or single words; tokenisation; filtering of stop-words;
stemming; transformation to lower cases
•Representation of review by term-document-matrix, based on term occurrences, term
frequency or TF-IDF values

Topic detection approach 1
• Identification of frequent nouns and verbs
– Assumption: nouns and verbs occurring frequently within user reviews
represent the topics discussed in user reviews
• Customers use similar vocabulary (nouns and verbs) to talk about the
same products or services (i.e. topics)
• Thus, nouns and verbs representing topics will occur more frequently then
other words
– Approach simply detects nouns and/or verbs based on Part-of-Speech
(POS) tagging (making use of the PENN-database), and extracts all
frequent words above a specific threshold
– Extracted words represent the most important topics
– Nouns, verbs and nouns & verbs have been tested and compared

• Keyword Clustering
– Starting point: TF-IDF value based term-document-matrix
– Documents are clustered by the k-means clustering algorithm (based
on the cosine similarity as distance measure)
– Words with high TF-IDF values within a cluster then represent words
often co-occurring in reviews and, thus, represent latent topics
– Different settings and parameters were tested
• Clustering with k=40 and k=80
• Clustering on sentences and on whole feedbacks
• Clustering all words
• Clustering only nouns, only verbs or nouns and verbs

• Latent Semantic Indexing (LSI)
– Based on Singular Value Decomposition (SVD - a dimension reduction
technique) words often co-occurring are summarized to concepts
• Thus, SVD reduces the variety of words within a text, utilising the fact that
usually there exists more than one word for describing the same object
(e.g. a hotel room)
– A concept then represents the (latent) semantic of all those words
which, finally, represents a latent topic
– Different settings and parameters were tested
• LSI with k=40 and k=80 (k = number of concepts extracted)
• LSI on sentences and on whole feedbacks
• LSI on all words
• LSI only on nouns, only on verbs or on nouns and verbs

• Named Entity Recognition (NER)
– General approach: extract relevant information from a text in form of
entities like persons, organisations or locations
– In this study, the original NER approach is modified in a way that only
two pseudo entities are used to declare a specific word as a topic
(entity = Topic) or a non-topic (entity = O)
– Training dataset is prepared with a single data record for each word
• Each single word is classified as topic or non-topic
• Each data record is enriched with linguistic and/or grammatical content,
like the surrounding words of a specific word within a sentence, or the
part of speech of a specific word

Example for enriched and pre-classified dataset: for skiing this is a lovely place.
– Based on such pre-classified data records each classification algorithm can be
used to distinguish between topics and non-topics
– The following settings, parameters and algorithms were finally tested
• 2, 3, 4 and 5 words before and after a specific word
• Naïve Bayes
• K-Nearest-Neighbour with k = 5, 10, 15, 20, 25 and 50
• Support Vector Machines (SVM)
• Sequential Mining based on Conditional Random Fields

Content
• Introduction
• Related work
• Methodology
• Evaluation
• Application

Evaluation
Approach Accuracy
Recall
Topic
Recall
No Topic
Precision
Topic
Precision
No Topic
Identification of frequent words
(nouns only)
82.86% 94.20% 80.14% 53.19% 98.29%
Keyword Clustering
(nouns only, sentences-based, k=80)
88.45% 62.84% 94.59% 73.56% 91.40%
LSI - Latent Semantic Indexing
85.46% 48.95% 94.21% 66.96% 88.51%
NER – Named Entity Recognition
(Naïve Bayes, 2 words +/- as context)
75.17% 77.94% 72.39% 73.84% 76.65%
F1 Topic
0,6798
0,6777
0,5655
0,7583

Evaluation
Approach Accuracy
Recall
Topic
Recall
No Topic
Precision
Topic
Precision
No Topic
(nouns only)
82.86% 94.20% 80.14% 53.19% 98.29%
Keyword Clustering
88.45% 62.84% 94.59% 73.56% 91.40%
85.46% 48.95% 94.21% 66.96% 88.51%
75.17% 77.94% 72.39% 73.84% 76.65%
F1 Topic
0,6798
0,6777
0,5655
0,7583
1
2
3
4

Evaluation
Even though 94.20% of the manually pre-classified topics could be
redetected, the precision of detecting topics was only 53.19% and
the corresponding F1 measure 0,6798
(the approach declares a word as a topic way too often because
not every frequent noun is a topic, of course)
Approach Accuracy
Recall
Topic
Recall
No Topic
Precision
Topic
Precision
No Topic
(nouns only)
82.86% 94.20% 80.14% 53.19% 98.29%
Keyword Clustering
88.45% 62.84% 94.59% 73.56% 91.40%
85.46% 48.95% 94.21% 66.96% 88.51%
75.17% 77.94% 72.39% 73.84% 76.65%
F1 Topic
0,6798
0,6777
0,5655
0,7583

Evaluation
Despite of the highest accuracy (88.45%) and a high topic
precision (73.56%), the topic recall (62.84%) constitutes a clear
limitation of the approach (F1 measure 0,6777)
Approach Accuracy
Recall
Topic
Recall
No Topic
Precision
Topic
Precision
No Topic
(nouns only)
82.86% 94.20% 80.14% 53.19% 98.29%
Keyword Clustering
88.45% 62.84% 94.59% 73.56% 91.40%
85.46% 48.95% 94.21% 66.96% 88.51%
75.17% 77.94% 72.39% 73.84% 76.65%
F1 Topic
0,6798
0,6777
0,5655
0,7583

Evaluation
With a topic recall of 48.95% the LSI could not even redetect half
of the manually pre-classified topic words, constituting a severe
limitation (leading to the worst F1 measure of 0,5655)
Approach Accuracy
Recall
Topic
Recall
No Topic
Precision
Topic
Precision
No Topic
(nouns only)
82.86% 94.20% 80.14% 53.19% 98.29%
Keyword Clustering
88.45% 62.84% 94.59% 73.56% 91.40%
85.46% 48.95% 94.21% 66.96% 88.51%
75.17% 77.94% 72.39% 73.84% 76.65%
F1 Topic
0,6798
0,6777
0,5655
0,7583

Evaluation
With the second best topic recall (77.94%) and the best topic
precision of (73.84%) NER reaches the most balanced result of all
tested approaches (leading to the best F1 measure of 0,7583)
and can be recommended for topic detection
Approach Accuracy
Recall
Topic
Recall
No Topic
Precision
Topic
Precision
No Topic
Identification of frequent word
(nouns only)
82.86% 94.20% 80.14% 53.19% 98.29%
Keyword Clustering
88.45% 62.84% 94.59% 73.56% 91.40%
85.46% 48.95% 94.21% 66.96% 88.51%
75.17% 77.94% 72.39% 73.84% 76.65%
F1 Topic
0,6798
0,6777
0,5655
0,7583
3
4
2
1

Content
• Introduction
• Related work
• Methodology
• Evaluation
• Application

Application of topic detection
• Mining UGC by topic detection
– Effective way to get an idea of what customers are talking about
– hotel, room, ski and spa/pool/sauna are the most important and most often mentioned
topics within the analysed online reviews
– hotel and staff seem to be much more important for guests of Copperhill Mountain
Lodge than for those of Holiday Club Åre -> different guest perception based on different
hotel profile (Copperhill Mountain Lodge being a five-star hotel, well known for its
exceptional hotel architecture and design)
Identified topics Total occurrences
hotel 143
room 84
ski 63
spa 41
staff 38
Copperhill Mountain Lodge
Identified topics Total occurrences
room 80
hotel 62
pool 44
ski 39
sauna 28
Holiday Club Åre

Application of topic detection
• Topic detection as one step of overall sentiment analysis
– Consecutive sentiment detection identifies positive or negative
sentiment of review statements (Schmunk/Höpken/Fuchs/Lexhagen, 2014)
• Powerful tool for hoteliers and tourism stakeholder
– Constant transparency about the topics the customers are talking
about within their reviews
– Customer’s opinions and experiences in connection with the own hotel
– Insights into the competitors’ businesses because of the public
availability of UGC
• Identify strengths and weaknesses of the own business compared to
competitors
• Identify direct competitors addressing the same market segment

Content
• Introduction
• Related work
• Methodology
• Evaluation
• Application

Summary and outlook
• Summary
– NER (Named Entity Recognition) most powerful approach for topic
detection (accuracy 77.94%, precision 73.84%, recall 77.94%, F1
0,7583)
– All data mining processes have been implemented by the data mining
tool RapidMiner Studio®
• Outlook
– Additional approaches like Topic Modelling (Blei & Lafferty, 2009) and
Dependency Parsing (Wu et al., 2009)
– Feature-based sentiment analysis, combining topic detection with a
consecutive sentiment analysis for each single feature/topic

Thomas Mennera
Wolfram Höpkena
Matthias Fuchsb
Maria Lexhagenb
a Business Informatics Group
University of Applied Sciences Ravensburg-Weingarten, Germany
{name.surname}@hs-weingarten.de
b European Tourism Research Institute (ETOUR)
Mid-Sweden University, Sweden
{name.surname}@miun.se
Topic Detection - Identifying relevant
topics in tourism reviews

Topic Detection - Identifying relevant topics in tourism reviews

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (20)

Similar to Topic Detection - Identifying relevant topics in tourism reviews

Similar to Topic Detection - Identifying relevant topics in tourism reviews (20)

Recently uploaded

Recently uploaded (20)

Topic Detection - Identifying relevant topics in tourism reviews

Editor's Notes