Topic Detection - Identifying relevant topics in tourism reviews
1. ENTER 2016 Research Track Slide Number 15th February 2016
Thomas Mennera
Wolfram Höpkena
Matthias Fuchsb
Maria Lexhagenb
a Business Informatics Group
University of Applied Sciences Ravensburg-Weingarten, Germany
{name.surname}@hs-weingarten.de
b European Tourism Research Institute (ETOUR)
Mid-Sweden University, Sweden
{name.surname}@miun.se
Topic Detection - Identifying relevant
topics in tourism reviews
2. ENTER 2016 Research Track Slide Number 25th February 2016
Content
• Introduction
• Related work
• Methodology
– Document retrieval, extraction & processing
– Mining (supervised & unsupervised learning)
• Evaluation
• Application
• Summary and outlook
3. ENTER 2016 Research Track Slide Number 35th February 2016
Motivation
• Product reviews as important source of information
– Shared experiences, impressions and opinions of travellers represent
huge pool of potentially useful information for tourism and travel
providers
– Product reviews can be extracted and processed automatically using
different information extraction and data mining techniques
• Topic detection
– Next to extracting the sentiment or polarity of reviews, explicit topics
mentioned in the review can be extract
(Fuchs et al., 2014; Höpken et al., 2015)
– Extracted topics can subsequently be used to improve the own
services and offerings depending on whether the topics are mentioned
positively or negatively
4. ENTER 2016 Research Track Slide Number 45th February 2016
Objective
• Unsupervised topic detection
– Identifying unknown (and not predefined) topics mentioned within a
review
– New topics, not recognized as relevant quality dimensions of tourism
services so far, can be identified
– Unsupervised topic detection as promising approach to gain new
insights into relevant quality dimensions as well as strengths and
weaknesses of concrete tourism services along those quality dimensions
• Several approaches for unsupervised topic detection are
presented and compared related to their detection accuracy
• Provide holistic process for extracting, pre-processing and
mining user reviews from rating portals
5. ENTER 2016 Research Track Slide Number 55th February 2016
Content
• Introduction
• Related work
• Methodology
– Document retrieval, extraction & processing
– Mining (supervised & unsupervised learning)
• Evaluation
• Application
• Summary and outlook
6. ENTER 2016 Research Track Slide Number 65th February 2016
Related work
• Hu and Liu (2004)
– Approach for extracting features (topics) of products as part of
sentiment analysis
– Since product features are often represented by nouns, their approach
extracts all frequent nouns from the UGC and matches nouns to the
particular opinion words they are rated by
• Ziebarth et al. (2008)
– Approach for topic detection by keyword clustering (clustering of
term-document-matrix based on TF-IDF values)
• Panagiotis et al. (1999)
– Categorize documents by topic using a Latent Semantic Indexing (LSI)
approach through Single Value Decomposition (SVD)
7. ENTER 2016 Research Track Slide Number 75th February 2016
Content
• Introduction
• Related work
• Methodology
– Document retrieval, extraction & processing
– Mining (supervised & unsupervised learning)
• Evaluation
• Application
• Summary and outlook
8. ENTER 2016 Research Track Slide Number 85th February 2016
Document retrieval, extraction & processing
Document
retrieval
•Web crawler fetches all html pages containing reviews based on regular expressions
•124 user reviews, consisting of over 1,200 single review statements were retrieved (from
TripAdvisor for two hotels of the Swedish mountain destination Åre)
Document
extraction
•Explicit textual reviews are extracted from html pages using Xpath expressions to select
relevant document parts and regular expression to clear up review texts from HTML tags
Document
processing
•Pre-processing of reviews depending on applied mining techniques
•Splitting reviews into sentences or single words; tokenisation; filtering of stop-words;
stemming; transformation to lower cases
•Representation of review by term-document-matrix, based on term occurrences, term
frequency or TF-IDF values
9. ENTER 2016 Research Track Slide Number 95th February 2016
Topic detection approach 1
• Identification of frequent nouns and verbs
– Assumption: nouns and verbs occurring frequently within user reviews
represent the topics discussed in user reviews
• Customers use similar vocabulary (nouns and verbs) to talk about the
same products or services (i.e. topics)
• Thus, nouns and verbs representing topics will occur more frequently then
other words
– Approach simply detects nouns and/or verbs based on Part-of-Speech
(POS) tagging (making use of the PENN-database), and extracts all
frequent words above a specific threshold
– Extracted words represent the most important topics
– Nouns, verbs and nouns & verbs have been tested and compared
10. ENTER 2016 Research Track Slide Number 105th February 2016
Topic detection approach 2
• Keyword Clustering
– Starting point: TF-IDF value based term-document-matrix
– Documents are clustered by the k-means clustering algorithm (based
on the cosine similarity as distance measure)
– Words with high TF-IDF values within a cluster then represent words
often co-occurring in reviews and, thus, represent latent topics
– Different settings and parameters were tested
• Clustering with k=40 and k=80
• Clustering on sentences and on whole feedbacks
• Clustering all words
• Clustering only nouns, only verbs or nouns and verbs
11. ENTER 2016 Research Track Slide Number 115th February 2016
Topic detection approach 3
• Latent Semantic Indexing (LSI)
– Based on Singular Value Decomposition (SVD - a dimension reduction
technique) words often co-occurring are summarized to concepts
• Thus, SVD reduces the variety of words within a text, utilising the fact that
usually there exists more than one word for describing the same object
(e.g. a hotel room)
– A concept then represents the (latent) semantic of all those words
which, finally, represents a latent topic
– Different settings and parameters were tested
• LSI with k=40 and k=80 (k = number of concepts extracted)
• LSI on sentences and on whole feedbacks
• LSI on all words
• LSI only on nouns, only on verbs or on nouns and verbs
12. ENTER 2016 Research Track Slide Number 125th February 2016
Topic detection approach 4
• Named Entity Recognition (NER)
– General approach: extract relevant information from a text in form of
entities like persons, organisations or locations
– In this study, the original NER approach is modified in a way that only
two pseudo entities are used to declare a specific word as a topic
(entity = Topic) or a non-topic (entity = O)
– Training dataset is prepared with a single data record for each word
• Each single word is classified as topic or non-topic
• Each data record is enriched with linguistic and/or grammatical content,
like the surrounding words of a specific word within a sentence, or the
part of speech of a specific word
13. ENTER 2016 Research Track Slide Number 135th February 2016
Topic detection approach 4
Example for enriched and pre-classified dataset: for skiing this is a lovely place.
– Based on such pre-classified data records each classification algorithm can be
used to distinguish between topics and non-topics
– The following settings, parameters and algorithms were finally tested
• 2, 3, 4 and 5 words before and after a specific word
• Naïve Bayes
• K-Nearest-Neighbour with k = 5, 10, 15, 20, 25 and 50
• Support Vector Machines (SVM)
• Sequential Mining based on Conditional Random Fields
14. ENTER 2016 Research Track Slide Number 145th February 2016
Content
• Introduction
• Related work
• Methodology
– Document retrieval, extraction & processing
– Mining (supervised & unsupervised learning)
• Evaluation
• Application
• Summary and outlook
15. ENTER 2016 Research Track Slide Number 155th February 2016
Evaluation
Approach Accuracy
Recall
Topic
Recall
No Topic
Precision
Topic
Precision
No Topic
Identification of frequent words
(nouns only)
82.86% 94.20% 80.14% 53.19% 98.29%
Keyword Clustering
(nouns only, sentences-based, k=80)
88.45% 62.84% 94.59% 73.56% 91.40%
LSI - Latent Semantic Indexing
(nouns only, sentences-based, k=80)
85.46% 48.95% 94.21% 66.96% 88.51%
NER – Named Entity Recognition
(Naïve Bayes, 2 words +/- as context)
75.17% 77.94% 72.39% 73.84% 76.65%
F1 Topic
0,6798
0,6777
0,5655
0,7583
16. ENTER 2016 Research Track Slide Number 165th February 2016
Evaluation
Approach Accuracy
Recall
Topic
Recall
No Topic
Precision
Topic
Precision
No Topic
Identification of frequent words
(nouns only)
82.86% 94.20% 80.14% 53.19% 98.29%
Keyword Clustering
(nouns only, sentences-based, k=80)
88.45% 62.84% 94.59% 73.56% 91.40%
LSI - Latent Semantic Indexing
(nouns only, sentences-based, k=80)
85.46% 48.95% 94.21% 66.96% 88.51%
NER – Named Entity Recognition
(Naïve Bayes, 2 words +/- as context)
75.17% 77.94% 72.39% 73.84% 76.65%
F1 Topic
0,6798
0,6777
0,5655
0,7583
1
2
3
4
17. ENTER 2016 Research Track Slide Number 175th February 2016
Evaluation
Even though 94.20% of the manually pre-classified topics could be
redetected, the precision of detecting topics was only 53.19% and
the corresponding F1 measure 0,6798
(the approach declares a word as a topic way too often because
not every frequent noun is a topic, of course)
Approach Accuracy
Recall
Topic
Recall
No Topic
Precision
Topic
Precision
No Topic
Identification of frequent words
(nouns only)
82.86% 94.20% 80.14% 53.19% 98.29%
Keyword Clustering
(nouns only, sentences-based, k=80)
88.45% 62.84% 94.59% 73.56% 91.40%
LSI - Latent Semantic Indexing
(nouns only, sentences-based, k=80)
85.46% 48.95% 94.21% 66.96% 88.51%
NER – Named Entity Recognition
(Naïve Bayes, 2 words +/- as context)
75.17% 77.94% 72.39% 73.84% 76.65%
F1 Topic
0,6798
0,6777
0,5655
0,7583
18. ENTER 2016 Research Track Slide Number 185th February 2016
Evaluation
Despite of the highest accuracy (88.45%) and a high topic
precision (73.56%), the topic recall (62.84%) constitutes a clear
limitation of the approach (F1 measure 0,6777)
Approach Accuracy
Recall
Topic
Recall
No Topic
Precision
Topic
Precision
No Topic
Identification of frequent words
(nouns only)
82.86% 94.20% 80.14% 53.19% 98.29%
Keyword Clustering
(nouns only, sentences-based, k=80)
88.45% 62.84% 94.59% 73.56% 91.40%
LSI - Latent Semantic Indexing
(nouns only, sentences-based, k=80)
85.46% 48.95% 94.21% 66.96% 88.51%
NER – Named Entity Recognition
(Naïve Bayes, 2 words +/- as context)
75.17% 77.94% 72.39% 73.84% 76.65%
F1 Topic
0,6798
0,6777
0,5655
0,7583
19. ENTER 2016 Research Track Slide Number 195th February 2016
Evaluation
With a topic recall of 48.95% the LSI could not even redetect half
of the manually pre-classified topic words, constituting a severe
limitation (leading to the worst F1 measure of 0,5655)
Approach Accuracy
Recall
Topic
Recall
No Topic
Precision
Topic
Precision
No Topic
Identification of frequent words
(nouns only)
82.86% 94.20% 80.14% 53.19% 98.29%
Keyword Clustering
(nouns only, sentences-based, k=80)
88.45% 62.84% 94.59% 73.56% 91.40%
LSI - Latent Semantic Indexing
(nouns only, sentences-based, k=80)
85.46% 48.95% 94.21% 66.96% 88.51%
NER – Named Entity Recognition
(Naïve Bayes, 2 words +/- as context)
75.17% 77.94% 72.39% 73.84% 76.65%
F1 Topic
0,6798
0,6777
0,5655
0,7583
20. ENTER 2016 Research Track Slide Number 205th February 2016
Evaluation
With the second best topic recall (77.94%) and the best topic
precision of (73.84%) NER reaches the most balanced result of all
tested approaches (leading to the best F1 measure of 0,7583)
and can be recommended for topic detection
Approach Accuracy
Recall
Topic
Recall
No Topic
Precision
Topic
Precision
No Topic
Identification of frequent word
(nouns only)
82.86% 94.20% 80.14% 53.19% 98.29%
Keyword Clustering
(nouns only, sentences-based, k=80)
88.45% 62.84% 94.59% 73.56% 91.40%
LSI - Latent Semantic Indexing
(nouns only, sentences-based, k=80)
85.46% 48.95% 94.21% 66.96% 88.51%
NER – Named Entity Recognition
(Naïve Bayes, 2 words +/- as context)
75.17% 77.94% 72.39% 73.84% 76.65%
F1 Topic
0,6798
0,6777
0,5655
0,7583
3
4
2
1
21. ENTER 2016 Research Track Slide Number 215th February 2016
Content
• Introduction
• Related work
• Methodology
– Document retrieval, extraction & processing
– Mining (supervised & unsupervised learning)
• Evaluation
• Application
• Summary and outlook
22. ENTER 2016 Research Track Slide Number 225th February 2016
Application of topic detection
• Mining UGC by topic detection
– Effective way to get an idea of what customers are talking about
– hotel, room, ski and spa/pool/sauna are the most important and most often mentioned
topics within the analysed online reviews
– hotel and staff seem to be much more important for guests of Copperhill Mountain
Lodge than for those of Holiday Club Åre -> different guest perception based on different
hotel profile (Copperhill Mountain Lodge being a five-star hotel, well known for its
exceptional hotel architecture and design)
Identified topics Total occurrences
hotel 143
room 84
ski 63
spa 41
staff 38
Copperhill Mountain Lodge
Identified topics Total occurrences
room 80
hotel 62
pool 44
ski 39
sauna 28
Holiday Club Åre
23. ENTER 2016 Research Track Slide Number 235th February 2016
Application of topic detection
• Topic detection as one step of overall sentiment analysis
– Consecutive sentiment detection identifies positive or negative
sentiment of review statements (Schmunk/Höpken/Fuchs/Lexhagen, 2014)
• Powerful tool for hoteliers and tourism stakeholder
– Constant transparency about the topics the customers are talking
about within their reviews
– Customer’s opinions and experiences in connection with the own hotel
– Insights into the competitors’ businesses because of the public
availability of UGC
• Identify strengths and weaknesses of the own business compared to
competitors
• Identify direct competitors addressing the same market segment
24. ENTER 2016 Research Track Slide Number 245th February 2016
Content
• Introduction
• Related work
• Methodology
– Document retrieval, extraction & processing
– Mining (supervised & unsupervised learning)
• Evaluation
• Application
• Summary and outlook
25. ENTER 2016 Research Track Slide Number 255th February 2016
Summary and outlook
• Summary
– NER (Named Entity Recognition) most powerful approach for topic
detection (accuracy 77.94%, precision 73.84%, recall 77.94%, F1
0,7583)
– All data mining processes have been implemented by the data mining
tool RapidMiner Studio®
• Outlook
– Additional approaches like Topic Modelling (Blei & Lafferty, 2009) and
Dependency Parsing (Wu et al., 2009)
– Feature-based sentiment analysis, combining topic detection with a
consecutive sentiment analysis for each single feature/topic
26. ENTER 2016 Research Track Slide Number 265th February 2016
Thomas Mennera
Wolfram Höpkena
Matthias Fuchsb
Maria Lexhagenb
a Business Informatics Group
University of Applied Sciences Ravensburg-Weingarten, Germany
{name.surname}@hs-weingarten.de
b European Tourism Research Institute (ETOUR)
Mid-Sweden University, Sweden
{name.surname}@miun.se
Topic Detection - Identifying relevant
topics in tourism reviews
Editor's Notes
Duration: 20 min (without questions)
best settings and parameters for each of the four test cases:
Identification of frequent words: Identification of frequent nouns
Keyword Clustering: Clustering substantives based on sentences with k=80
LSI: LSI on substantives based on sentences with k=80
NER: Naïve Bayes with 2 words before and after the word to be classified
best settings and parameters for each of the four test cases:
Identification of frequent words: Identification of frequent nouns
Keyword Clustering: Clustering substantives based on sentences with k=80
LSI: LSI on substantives based on sentences with k=80
NER: Naïve Bayes with 2 words before and after the word to be classified
If we compare the approaches based on accuracy alone, keyword clustering is the best approach, followed by LSI, Identification of frequent words and NER
If we compare precision and recall values and the correspond F1 measure