SlideShare a Scribd company logo
1 of 26
ENTER 2016 Research Track Slide Number 15th February 2016
Thomas Mennera
Wolfram Höpkena
Matthias Fuchsb
Maria Lexhagenb
a Business Informatics Group
University of Applied Sciences Ravensburg-Weingarten, Germany
{name.surname}@hs-weingarten.de
b European Tourism Research Institute (ETOUR)
Mid-Sweden University, Sweden
{name.surname}@miun.se
Topic Detection - Identifying relevant
topics in tourism reviews
ENTER 2016 Research Track Slide Number 25th February 2016
Content
• Introduction
• Related work
• Methodology
– Document retrieval, extraction & processing
– Mining (supervised & unsupervised learning)
• Evaluation
• Application
• Summary and outlook
ENTER 2016 Research Track Slide Number 35th February 2016
Motivation
• Product reviews as important source of information
– Shared experiences, impressions and opinions of travellers represent
huge pool of potentially useful information for tourism and travel
providers
– Product reviews can be extracted and processed automatically using
different information extraction and data mining techniques
• Topic detection
– Next to extracting the sentiment or polarity of reviews, explicit topics
mentioned in the review can be extract
(Fuchs et al., 2014; Höpken et al., 2015)
– Extracted topics can subsequently be used to improve the own
services and offerings depending on whether the topics are mentioned
positively or negatively
ENTER 2016 Research Track Slide Number 45th February 2016
Objective
• Unsupervised topic detection
– Identifying unknown (and not predefined) topics mentioned within a
review
– New topics, not recognized as relevant quality dimensions of tourism
services so far, can be identified
– Unsupervised topic detection as promising approach to gain new
insights into relevant quality dimensions as well as strengths and
weaknesses of concrete tourism services along those quality dimensions
• Several approaches for unsupervised topic detection are
presented and compared related to their detection accuracy
• Provide holistic process for extracting, pre-processing and
mining user reviews from rating portals
ENTER 2016 Research Track Slide Number 55th February 2016
Content
• Introduction
• Related work
• Methodology
– Document retrieval, extraction & processing
– Mining (supervised & unsupervised learning)
• Evaluation
• Application
• Summary and outlook
ENTER 2016 Research Track Slide Number 65th February 2016
Related work
• Hu and Liu (2004)
– Approach for extracting features (topics) of products as part of
sentiment analysis
– Since product features are often represented by nouns, their approach
extracts all frequent nouns from the UGC and matches nouns to the
particular opinion words they are rated by
• Ziebarth et al. (2008)
– Approach for topic detection by keyword clustering (clustering of
term-document-matrix based on TF-IDF values)
• Panagiotis et al. (1999)
– Categorize documents by topic using a Latent Semantic Indexing (LSI)
approach through Single Value Decomposition (SVD)
ENTER 2016 Research Track Slide Number 75th February 2016
Content
• Introduction
• Related work
• Methodology
– Document retrieval, extraction & processing
– Mining (supervised & unsupervised learning)
• Evaluation
• Application
• Summary and outlook
ENTER 2016 Research Track Slide Number 85th February 2016
Document retrieval, extraction & processing
Document
retrieval
•Web crawler fetches all html pages containing reviews based on regular expressions
•124 user reviews, consisting of over 1,200 single review statements were retrieved (from
TripAdvisor for two hotels of the Swedish mountain destination Åre)
Document
extraction
•Explicit textual reviews are extracted from html pages using Xpath expressions to select
relevant document parts and regular expression to clear up review texts from HTML tags
Document
processing
•Pre-processing of reviews depending on applied mining techniques
•Splitting reviews into sentences or single words; tokenisation; filtering of stop-words;
stemming; transformation to lower cases
•Representation of review by term-document-matrix, based on term occurrences, term
frequency or TF-IDF values
ENTER 2016 Research Track Slide Number 95th February 2016
Topic detection approach 1
• Identification of frequent nouns and verbs
– Assumption: nouns and verbs occurring frequently within user reviews
represent the topics discussed in user reviews
• Customers use similar vocabulary (nouns and verbs) to talk about the
same products or services (i.e. topics)
• Thus, nouns and verbs representing topics will occur more frequently then
other words
– Approach simply detects nouns and/or verbs based on Part-of-Speech
(POS) tagging (making use of the PENN-database), and extracts all
frequent words above a specific threshold
– Extracted words represent the most important topics
– Nouns, verbs and nouns & verbs have been tested and compared
ENTER 2016 Research Track Slide Number 105th February 2016
Topic detection approach 2
• Keyword Clustering
– Starting point: TF-IDF value based term-document-matrix
– Documents are clustered by the k-means clustering algorithm (based
on the cosine similarity as distance measure)
– Words with high TF-IDF values within a cluster then represent words
often co-occurring in reviews and, thus, represent latent topics
– Different settings and parameters were tested
• Clustering with k=40 and k=80
• Clustering on sentences and on whole feedbacks
• Clustering all words
• Clustering only nouns, only verbs or nouns and verbs
ENTER 2016 Research Track Slide Number 115th February 2016
Topic detection approach 3
• Latent Semantic Indexing (LSI)
– Based on Singular Value Decomposition (SVD - a dimension reduction
technique) words often co-occurring are summarized to concepts
• Thus, SVD reduces the variety of words within a text, utilising the fact that
usually there exists more than one word for describing the same object
(e.g. a hotel room)
– A concept then represents the (latent) semantic of all those words
which, finally, represents a latent topic
– Different settings and parameters were tested
• LSI with k=40 and k=80 (k = number of concepts extracted)
• LSI on sentences and on whole feedbacks
• LSI on all words
• LSI only on nouns, only on verbs or on nouns and verbs
ENTER 2016 Research Track Slide Number 125th February 2016
Topic detection approach 4
• Named Entity Recognition (NER)
– General approach: extract relevant information from a text in form of
entities like persons, organisations or locations
– In this study, the original NER approach is modified in a way that only
two pseudo entities are used to declare a specific word as a topic
(entity = Topic) or a non-topic (entity = O)
– Training dataset is prepared with a single data record for each word
• Each single word is classified as topic or non-topic
• Each data record is enriched with linguistic and/or grammatical content,
like the surrounding words of a specific word within a sentence, or the
part of speech of a specific word
ENTER 2016 Research Track Slide Number 135th February 2016
Topic detection approach 4
Example for enriched and pre-classified dataset: for skiing this is a lovely place.
– Based on such pre-classified data records each classification algorithm can be
used to distinguish between topics and non-topics
– The following settings, parameters and algorithms were finally tested
• 2, 3, 4 and 5 words before and after a specific word
• Naïve Bayes
• K-Nearest-Neighbour with k = 5, 10, 15, 20, 25 and 50
• Support Vector Machines (SVM)
• Sequential Mining based on Conditional Random Fields
ENTER 2016 Research Track Slide Number 145th February 2016
Content
• Introduction
• Related work
• Methodology
– Document retrieval, extraction & processing
– Mining (supervised & unsupervised learning)
• Evaluation
• Application
• Summary and outlook
ENTER 2016 Research Track Slide Number 155th February 2016
Evaluation
Approach Accuracy
Recall
Topic
Recall
No Topic
Precision
Topic
Precision
No Topic
Identification of frequent words
(nouns only)
82.86% 94.20% 80.14% 53.19% 98.29%
Keyword Clustering
(nouns only, sentences-based, k=80)
88.45% 62.84% 94.59% 73.56% 91.40%
LSI - Latent Semantic Indexing
(nouns only, sentences-based, k=80)
85.46% 48.95% 94.21% 66.96% 88.51%
NER – Named Entity Recognition
(Naïve Bayes, 2 words +/- as context)
75.17% 77.94% 72.39% 73.84% 76.65%
F1 Topic
0,6798
0,6777
0,5655
0,7583
ENTER 2016 Research Track Slide Number 165th February 2016
Evaluation
Approach Accuracy
Recall
Topic
Recall
No Topic
Precision
Topic
Precision
No Topic
Identification of frequent words
(nouns only)
82.86% 94.20% 80.14% 53.19% 98.29%
Keyword Clustering
(nouns only, sentences-based, k=80)
88.45% 62.84% 94.59% 73.56% 91.40%
LSI - Latent Semantic Indexing
(nouns only, sentences-based, k=80)
85.46% 48.95% 94.21% 66.96% 88.51%
NER – Named Entity Recognition
(Naïve Bayes, 2 words +/- as context)
75.17% 77.94% 72.39% 73.84% 76.65%
F1 Topic
0,6798
0,6777
0,5655
0,7583
1
2
3
4
ENTER 2016 Research Track Slide Number 175th February 2016
Evaluation
Even though 94.20% of the manually pre-classified topics could be
redetected, the precision of detecting topics was only 53.19% and
the corresponding F1 measure 0,6798
(the approach declares a word as a topic way too often because
not every frequent noun is a topic, of course)
Approach Accuracy
Recall
Topic
Recall
No Topic
Precision
Topic
Precision
No Topic
Identification of frequent words
(nouns only)
82.86% 94.20% 80.14% 53.19% 98.29%
Keyword Clustering
(nouns only, sentences-based, k=80)
88.45% 62.84% 94.59% 73.56% 91.40%
LSI - Latent Semantic Indexing
(nouns only, sentences-based, k=80)
85.46% 48.95% 94.21% 66.96% 88.51%
NER – Named Entity Recognition
(Naïve Bayes, 2 words +/- as context)
75.17% 77.94% 72.39% 73.84% 76.65%
F1 Topic
0,6798
0,6777
0,5655
0,7583
ENTER 2016 Research Track Slide Number 185th February 2016
Evaluation
Despite of the highest accuracy (88.45%) and a high topic
precision (73.56%), the topic recall (62.84%) constitutes a clear
limitation of the approach (F1 measure 0,6777)
Approach Accuracy
Recall
Topic
Recall
No Topic
Precision
Topic
Precision
No Topic
Identification of frequent words
(nouns only)
82.86% 94.20% 80.14% 53.19% 98.29%
Keyword Clustering
(nouns only, sentences-based, k=80)
88.45% 62.84% 94.59% 73.56% 91.40%
LSI - Latent Semantic Indexing
(nouns only, sentences-based, k=80)
85.46% 48.95% 94.21% 66.96% 88.51%
NER – Named Entity Recognition
(Naïve Bayes, 2 words +/- as context)
75.17% 77.94% 72.39% 73.84% 76.65%
F1 Topic
0,6798
0,6777
0,5655
0,7583
ENTER 2016 Research Track Slide Number 195th February 2016
Evaluation
With a topic recall of 48.95% the LSI could not even redetect half
of the manually pre-classified topic words, constituting a severe
limitation (leading to the worst F1 measure of 0,5655)
Approach Accuracy
Recall
Topic
Recall
No Topic
Precision
Topic
Precision
No Topic
Identification of frequent words
(nouns only)
82.86% 94.20% 80.14% 53.19% 98.29%
Keyword Clustering
(nouns only, sentences-based, k=80)
88.45% 62.84% 94.59% 73.56% 91.40%
LSI - Latent Semantic Indexing
(nouns only, sentences-based, k=80)
85.46% 48.95% 94.21% 66.96% 88.51%
NER – Named Entity Recognition
(Naïve Bayes, 2 words +/- as context)
75.17% 77.94% 72.39% 73.84% 76.65%
F1 Topic
0,6798
0,6777
0,5655
0,7583
ENTER 2016 Research Track Slide Number 205th February 2016
Evaluation
With the second best topic recall (77.94%) and the best topic
precision of (73.84%) NER reaches the most balanced result of all
tested approaches (leading to the best F1 measure of 0,7583)
and can be recommended for topic detection
Approach Accuracy
Recall
Topic
Recall
No Topic
Precision
Topic
Precision
No Topic
Identification of frequent word
(nouns only)
82.86% 94.20% 80.14% 53.19% 98.29%
Keyword Clustering
(nouns only, sentences-based, k=80)
88.45% 62.84% 94.59% 73.56% 91.40%
LSI - Latent Semantic Indexing
(nouns only, sentences-based, k=80)
85.46% 48.95% 94.21% 66.96% 88.51%
NER – Named Entity Recognition
(Naïve Bayes, 2 words +/- as context)
75.17% 77.94% 72.39% 73.84% 76.65%
F1 Topic
0,6798
0,6777
0,5655
0,7583
3
4
2
1
ENTER 2016 Research Track Slide Number 215th February 2016
Content
• Introduction
• Related work
• Methodology
– Document retrieval, extraction & processing
– Mining (supervised & unsupervised learning)
• Evaluation
• Application
• Summary and outlook
ENTER 2016 Research Track Slide Number 225th February 2016
Application of topic detection
• Mining UGC by topic detection
– Effective way to get an idea of what customers are talking about
– hotel, room, ski and spa/pool/sauna are the most important and most often mentioned
topics within the analysed online reviews
– hotel and staff seem to be much more important for guests of Copperhill Mountain
Lodge than for those of Holiday Club Åre -> different guest perception based on different
hotel profile (Copperhill Mountain Lodge being a five-star hotel, well known for its
exceptional hotel architecture and design)
Identified topics Total occurrences
hotel 143
room 84
ski 63
spa 41
staff 38
Copperhill Mountain Lodge
Identified topics Total occurrences
room 80
hotel 62
pool 44
ski 39
sauna 28
Holiday Club Åre
ENTER 2016 Research Track Slide Number 235th February 2016
Application of topic detection
• Topic detection as one step of overall sentiment analysis
– Consecutive sentiment detection identifies positive or negative
sentiment of review statements (Schmunk/Höpken/Fuchs/Lexhagen, 2014)
• Powerful tool for hoteliers and tourism stakeholder
– Constant transparency about the topics the customers are talking
about within their reviews
– Customer’s opinions and experiences in connection with the own hotel
– Insights into the competitors’ businesses because of the public
availability of UGC
• Identify strengths and weaknesses of the own business compared to
competitors
• Identify direct competitors addressing the same market segment
ENTER 2016 Research Track Slide Number 245th February 2016
Content
• Introduction
• Related work
• Methodology
– Document retrieval, extraction & processing
– Mining (supervised & unsupervised learning)
• Evaluation
• Application
• Summary and outlook
ENTER 2016 Research Track Slide Number 255th February 2016
Summary and outlook
• Summary
– NER (Named Entity Recognition) most powerful approach for topic
detection (accuracy 77.94%, precision 73.84%, recall 77.94%, F1
0,7583)
– All data mining processes have been implemented by the data mining
tool RapidMiner Studio®
• Outlook
– Additional approaches like Topic Modelling (Blei & Lafferty, 2009) and
Dependency Parsing (Wu et al., 2009)
– Feature-based sentiment analysis, combining topic detection with a
consecutive sentiment analysis for each single feature/topic
ENTER 2016 Research Track Slide Number 265th February 2016
Thomas Mennera
Wolfram Höpkena
Matthias Fuchsb
Maria Lexhagenb
a Business Informatics Group
University of Applied Sciences Ravensburg-Weingarten, Germany
{name.surname}@hs-weingarten.de
b European Tourism Research Institute (ETOUR)
Mid-Sweden University, Sweden
{name.surname}@miun.se
Topic Detection - Identifying relevant
topics in tourism reviews

More Related Content

Viewers also liked

Viewers also liked (20)

Why are there more hotels in Tyrol than in Austria. Analyzing schema.org usag...
Why are there more hotels in Tyrol than in Austria. Analyzing schema.org usag...Why are there more hotels in Tyrol than in Austria. Analyzing schema.org usag...
Why are there more hotels in Tyrol than in Austria. Analyzing schema.org usag...
 
Identifying the new frontier of big data as an enabler for T&T industries: Re...
Identifying the new frontier of big data as an enabler for T&T industries: Re...Identifying the new frontier of big data as an enabler for T&T industries: Re...
Identifying the new frontier of big data as an enabler for T&T industries: Re...
 
User Generated Video Reviews by Hotel Guests
User Generated Video Reviews by Hotel GuestsUser Generated Video Reviews by Hotel Guests
User Generated Video Reviews by Hotel Guests
 
Correlating languages and sentiment analysis on the basis of text-based reviews
Correlating languages and sentiment analysis on the basis of text-based reviewsCorrelating languages and sentiment analysis on the basis of text-based reviews
Correlating languages and sentiment analysis on the basis of text-based reviews
 
Trends in e tourism research
Trends in e tourism researchTrends in e tourism research
Trends in e tourism research
 
Exploring daily deals as a distribution channel and a potential driver of hot...
Exploring daily deals as a distribution channel and a potential driver of hot...Exploring daily deals as a distribution channel and a potential driver of hot...
Exploring daily deals as a distribution channel and a potential driver of hot...
 
Digital Tourist Gaze and Mega Events
Digital Tourist Gaze and Mega EventsDigital Tourist Gaze and Mega Events
Digital Tourist Gaze and Mega Events
 
DataTourism: designing an architecture to process tourism data
DataTourism: designing an architecture to process tourism dataDataTourism: designing an architecture to process tourism data
DataTourism: designing an architecture to process tourism data
 
Making Spain a Smart Destination
Making Spain a Smart DestinationMaking Spain a Smart Destination
Making Spain a Smart Destination
 
Marketing the smart destination
Marketing the smart destinationMarketing the smart destination
Marketing the smart destination
 
Expanding typologies of tourists spatio-temporal activities using the sequenc...
Expanding typologies of tourists spatio-temporal activities using the sequenc...Expanding typologies of tourists spatio-temporal activities using the sequenc...
Expanding typologies of tourists spatio-temporal activities using the sequenc...
 
The impact of sharing economy on the diversification of tourism products imp...
The impact of sharing economy on the diversification of tourism products  imp...The impact of sharing economy on the diversification of tourism products  imp...
The impact of sharing economy on the diversification of tourism products imp...
 
From Information Technology to Mobile Information Technology: Applications in...
From Information Technology to Mobile Information Technology: Applications in...From Information Technology to Mobile Information Technology: Applications in...
From Information Technology to Mobile Information Technology: Applications in...
 
Forecasting the final penetration rate of online travel agencies in different...
Forecasting the final penetration rate of online travel agencies in different...Forecasting the final penetration rate of online travel agencies in different...
Forecasting the final penetration rate of online travel agencies in different...
 
Gender and Instagram hashtags: A study of #Malaysianfood
Gender and Instagram hashtags: A study of #MalaysianfoodGender and Instagram hashtags: A study of #Malaysianfood
Gender and Instagram hashtags: A study of #Malaysianfood
 
The Value of augmented Reality from a Business Model perspective
The Value of augmented Reality from a Business Model perspectiveThe Value of augmented Reality from a Business Model perspective
The Value of augmented Reality from a Business Model perspective
 
The role of information quality, visual appeal and information facilitation i...
The role of information quality, visual appeal and information facilitation i...The role of information quality, visual appeal and information facilitation i...
The role of information quality, visual appeal and information facilitation i...
 
Value co-creation and co-destruction in connected tourist experiences
Value co-creation and co-destruction in connected tourist experiencesValue co-creation and co-destruction in connected tourist experiences
Value co-creation and co-destruction in connected tourist experiences
 
Alpine tourists' willingness to engage in virtual co-creation of experiences
Alpine tourists' willingness to engage in virtual co-creation of experiencesAlpine tourists' willingness to engage in virtual co-creation of experiences
Alpine tourists' willingness to engage in virtual co-creation of experiences
 
Chinese Adoption of Travel Information on Social Media: Moderating Effects of...
Chinese Adoption of Travel Information on Social Media: Moderating Effects of...Chinese Adoption of Travel Information on Social Media: Moderating Effects of...
Chinese Adoption of Travel Information on Social Media: Moderating Effects of...
 

Similar to Topic Detection - Identifying relevant topics in tourism reviews

Similar to Topic Detection - Identifying relevant topics in tourism reviews (20)

How to read research
How to read researchHow to read research
How to read research
 
Fundamentals of Business Research & Report
Fundamentals of Business Research & ReportFundamentals of Business Research & Report
Fundamentals of Business Research & Report
 
Details of the research idea
Details of the research ideaDetails of the research idea
Details of the research idea
 
OpenEssayist: Extractive Summarisation and Formative Assessment (DCLA13)
OpenEssayist: Extractive Summarisation and Formative Assessment (DCLA13)OpenEssayist: Extractive Summarisation and Formative Assessment (DCLA13)
OpenEssayist: Extractive Summarisation and Formative Assessment (DCLA13)
 
Final 23.3.12 cs3 mod 3 review of analysis and learning 3760
  Final 23.3.12 cs3  mod 3 review of analysis and learning 3760  Final 23.3.12 cs3  mod 3 review of analysis and learning 3760
Final 23.3.12 cs3 mod 3 review of analysis and learning 3760
 
Methodology in scientific writing
Methodology in scientific writingMethodology in scientific writing
Methodology in scientific writing
 
Making Sense of It All: Analyzing Qualitative Data
Making Sense of It All: Analyzing Qualitative DataMaking Sense of It All: Analyzing Qualitative Data
Making Sense of It All: Analyzing Qualitative Data
 
Analyzing observational data during qualitative research
Analyzing observational data during qualitative researchAnalyzing observational data during qualitative research
Analyzing observational data during qualitative research
 
The difference between Method and Methodology
The difference between Method and MethodologyThe difference between Method and Methodology
The difference between Method and Methodology
 
Systematic review international conference slides
Systematic review   international conference slidesSystematic review   international conference slides
Systematic review international conference slides
 
Conducting a Literature Search & Writing Review Paper, Part 1: Systematic Rev...
Conducting a Literature Search & Writing Review Paper, Part 1: Systematic Rev...Conducting a Literature Search & Writing Review Paper, Part 1: Systematic Rev...
Conducting a Literature Search & Writing Review Paper, Part 1: Systematic Rev...
 
1. introduction to research methods
1. introduction to research methods1. introduction to research methods
1. introduction to research methods
 
Action research in Applied Linguistics: An Introduction
Action research in Applied Linguistics: An IntroductionAction research in Applied Linguistics: An Introduction
Action research in Applied Linguistics: An Introduction
 
N vivo tutorial 2020
N vivo tutorial 2020N vivo tutorial 2020
N vivo tutorial 2020
 
Sample Lecture
Sample LectureSample Lecture
Sample Lecture
 
Qualitative Studies in Software Engineering - Interviews, Observation, Ground...
Qualitative Studies in Software Engineering - Interviews, Observation, Ground...Qualitative Studies in Software Engineering - Interviews, Observation, Ground...
Qualitative Studies in Software Engineering - Interviews, Observation, Ground...
 
Presentation for workshop diu - december 28, 2014
Presentation for workshop   diu - december 28, 2014Presentation for workshop   diu - december 28, 2014
Presentation for workshop diu - december 28, 2014
 
Research process and proposal writing
Research process and proposal writingResearch process and proposal writing
Research process and proposal writing
 
Research seminar lecture_10_analysing_qualitative_data
Research seminar lecture_10_analysing_qualitative_dataResearch seminar lecture_10_analysing_qualitative_data
Research seminar lecture_10_analysing_qualitative_data
 
Week 9 Qualitative Data Analysis 2022.pptx
Week 9 Qualitative Data Analysis 2022.pptxWeek 9 Qualitative Data Analysis 2022.pptx
Week 9 Qualitative Data Analysis 2022.pptx
 

Recently uploaded

🔥HOT🔥📲9602870969🔥Prostitute Service in Udaipur Call Girls in City Palace Lake...
🔥HOT🔥📲9602870969🔥Prostitute Service in Udaipur Call Girls in City Palace Lake...🔥HOT🔥📲9602870969🔥Prostitute Service in Udaipur Call Girls in City Palace Lake...
🔥HOT🔥📲9602870969🔥Prostitute Service in Udaipur Call Girls in City Palace Lake...
Apsara Of India
 
Sample sample sample sample sample sample
Sample sample sample sample sample sampleSample sample sample sample sample sample
Sample sample sample sample sample sample
Casey Keith
 
sample sample sample sample sample sample
sample sample sample sample sample samplesample sample sample sample sample sample
sample sample sample sample sample sample
Casey Keith
 
Visa Consultant in Lahore || 📞03094429236
Visa Consultant in Lahore || 📞03094429236Visa Consultant in Lahore || 📞03094429236
Visa Consultant in Lahore || 📞03094429236
Sherazi Tours
 
Sample sample sample sample sample sample
Sample sample sample sample sample sampleSample sample sample sample sample sample
Sample sample sample sample sample sample
Casey Keith
 

Recently uploaded (20)

WhatsApp Chat: 📞 8617697112 Hire Call Girls Cooch Behar For a Sensual Sex Exp...
WhatsApp Chat: 📞 8617697112 Hire Call Girls Cooch Behar For a Sensual Sex Exp...WhatsApp Chat: 📞 8617697112 Hire Call Girls Cooch Behar For a Sensual Sex Exp...
WhatsApp Chat: 📞 8617697112 Hire Call Girls Cooch Behar For a Sensual Sex Exp...
 
Papi kondalu Call Girls 8250077686 Service Offer VIP Hot Model
Papi kondalu Call Girls 8250077686 Service Offer VIP Hot ModelPapi kondalu Call Girls 8250077686 Service Offer VIP Hot Model
Papi kondalu Call Girls 8250077686 Service Offer VIP Hot Model
 
🔥HOT🔥📲9602870969🔥Prostitute Service in Udaipur Call Girls in City Palace Lake...
🔥HOT🔥📲9602870969🔥Prostitute Service in Udaipur Call Girls in City Palace Lake...🔥HOT🔥📲9602870969🔥Prostitute Service in Udaipur Call Girls in City Palace Lake...
🔥HOT🔥📲9602870969🔥Prostitute Service in Udaipur Call Girls in City Palace Lake...
 
Genuine 8250077686 Hot and Beautiful 💕 Visakhapatnam Escorts call Girls
Genuine 8250077686 Hot and Beautiful 💕 Visakhapatnam Escorts call GirlsGenuine 8250077686 Hot and Beautiful 💕 Visakhapatnam Escorts call Girls
Genuine 8250077686 Hot and Beautiful 💕 Visakhapatnam Escorts call Girls
 
Genuine 8250077686 Hot and Beautiful 💕 Hosur Escorts call Girls
Genuine 8250077686 Hot and Beautiful 💕 Hosur Escorts call GirlsGenuine 8250077686 Hot and Beautiful 💕 Hosur Escorts call Girls
Genuine 8250077686 Hot and Beautiful 💕 Hosur Escorts call Girls
 
Jhargram call girls 📞 8617697112 At Low Cost Cash Payment Booking
Jhargram call girls 📞 8617697112 At Low Cost Cash Payment BookingJhargram call girls 📞 8617697112 At Low Cost Cash Payment Booking
Jhargram call girls 📞 8617697112 At Low Cost Cash Payment Booking
 
Genuine 9332606886 Hot and Beautiful 💕 Pune Escorts call Girls
Genuine 9332606886 Hot and Beautiful 💕 Pune Escorts call GirlsGenuine 9332606886 Hot and Beautiful 💕 Pune Escorts call Girls
Genuine 9332606886 Hot and Beautiful 💕 Pune Escorts call Girls
 
Alipore Call Girls - 📞 8617697112 🔝 Top Class Call Girls Service Available
Alipore Call Girls - 📞 8617697112 🔝 Top Class Call Girls Service AvailableAlipore Call Girls - 📞 8617697112 🔝 Top Class Call Girls Service Available
Alipore Call Girls - 📞 8617697112 🔝 Top Class Call Girls Service Available
 
Sample sample sample sample sample sample
Sample sample sample sample sample sampleSample sample sample sample sample sample
Sample sample sample sample sample sample
 
WhatsApp Chat: 📞 8617697112 Suri Call Girls available for hotel room package
WhatsApp Chat: 📞 8617697112 Suri Call Girls available for hotel room packageWhatsApp Chat: 📞 8617697112 Suri Call Girls available for hotel room package
WhatsApp Chat: 📞 8617697112 Suri Call Girls available for hotel room package
 
High Profile 🔝 8250077686 📞 Call Girls Service in Siri Fort🍑
High Profile 🔝 8250077686 📞 Call Girls Service in Siri Fort🍑High Profile 🔝 8250077686 📞 Call Girls Service in Siri Fort🍑
High Profile 🔝 8250077686 📞 Call Girls Service in Siri Fort🍑
 
Genuine 8250077686 Hot and Beautiful 💕 Diu Escorts call Girls
Genuine 8250077686 Hot and Beautiful 💕 Diu Escorts call GirlsGenuine 8250077686 Hot and Beautiful 💕 Diu Escorts call Girls
Genuine 8250077686 Hot and Beautiful 💕 Diu Escorts call Girls
 
Ooty call girls 📞 8617697112 At Low Cost Cash Payment Booking
Ooty call girls 📞 8617697112 At Low Cost Cash Payment BookingOoty call girls 📞 8617697112 At Low Cost Cash Payment Booking
Ooty call girls 📞 8617697112 At Low Cost Cash Payment Booking
 
sample sample sample sample sample sample
sample sample sample sample sample samplesample sample sample sample sample sample
sample sample sample sample sample sample
 
2k Shots ≽ 9205541914 ≼ Call Girls In Tagore Garden (Delhi)
2k Shots ≽ 9205541914 ≼ Call Girls In Tagore Garden (Delhi)2k Shots ≽ 9205541914 ≼ Call Girls In Tagore Garden (Delhi)
2k Shots ≽ 9205541914 ≼ Call Girls In Tagore Garden (Delhi)
 
❤Personal Contact Number Mcleodganj Call Girls 8617697112💦✅.
❤Personal Contact Number Mcleodganj Call Girls 8617697112💦✅.❤Personal Contact Number Mcleodganj Call Girls 8617697112💦✅.
❤Personal Contact Number Mcleodganj Call Girls 8617697112💦✅.
 
❤Personal Contact Number Varanasi Call Girls 8617697112💦✅.
❤Personal Contact Number Varanasi Call Girls 8617697112💦✅.❤Personal Contact Number Varanasi Call Girls 8617697112💦✅.
❤Personal Contact Number Varanasi Call Girls 8617697112💦✅.
 
Visa Consultant in Lahore || 📞03094429236
Visa Consultant in Lahore || 📞03094429236Visa Consultant in Lahore || 📞03094429236
Visa Consultant in Lahore || 📞03094429236
 
Top travel agency in panchkula - Best travel agents in panchkula
Top  travel agency in panchkula - Best travel agents in panchkulaTop  travel agency in panchkula - Best travel agents in panchkula
Top travel agency in panchkula - Best travel agents in panchkula
 
Sample sample sample sample sample sample
Sample sample sample sample sample sampleSample sample sample sample sample sample
Sample sample sample sample sample sample
 

Topic Detection - Identifying relevant topics in tourism reviews

  • 1. ENTER 2016 Research Track Slide Number 15th February 2016 Thomas Mennera Wolfram Höpkena Matthias Fuchsb Maria Lexhagenb a Business Informatics Group University of Applied Sciences Ravensburg-Weingarten, Germany {name.surname}@hs-weingarten.de b European Tourism Research Institute (ETOUR) Mid-Sweden University, Sweden {name.surname}@miun.se Topic Detection - Identifying relevant topics in tourism reviews
  • 2. ENTER 2016 Research Track Slide Number 25th February 2016 Content • Introduction • Related work • Methodology – Document retrieval, extraction & processing – Mining (supervised & unsupervised learning) • Evaluation • Application • Summary and outlook
  • 3. ENTER 2016 Research Track Slide Number 35th February 2016 Motivation • Product reviews as important source of information – Shared experiences, impressions and opinions of travellers represent huge pool of potentially useful information for tourism and travel providers – Product reviews can be extracted and processed automatically using different information extraction and data mining techniques • Topic detection – Next to extracting the sentiment or polarity of reviews, explicit topics mentioned in the review can be extract (Fuchs et al., 2014; Höpken et al., 2015) – Extracted topics can subsequently be used to improve the own services and offerings depending on whether the topics are mentioned positively or negatively
  • 4. ENTER 2016 Research Track Slide Number 45th February 2016 Objective • Unsupervised topic detection – Identifying unknown (and not predefined) topics mentioned within a review – New topics, not recognized as relevant quality dimensions of tourism services so far, can be identified – Unsupervised topic detection as promising approach to gain new insights into relevant quality dimensions as well as strengths and weaknesses of concrete tourism services along those quality dimensions • Several approaches for unsupervised topic detection are presented and compared related to their detection accuracy • Provide holistic process for extracting, pre-processing and mining user reviews from rating portals
  • 5. ENTER 2016 Research Track Slide Number 55th February 2016 Content • Introduction • Related work • Methodology – Document retrieval, extraction & processing – Mining (supervised & unsupervised learning) • Evaluation • Application • Summary and outlook
  • 6. ENTER 2016 Research Track Slide Number 65th February 2016 Related work • Hu and Liu (2004) – Approach for extracting features (topics) of products as part of sentiment analysis – Since product features are often represented by nouns, their approach extracts all frequent nouns from the UGC and matches nouns to the particular opinion words they are rated by • Ziebarth et al. (2008) – Approach for topic detection by keyword clustering (clustering of term-document-matrix based on TF-IDF values) • Panagiotis et al. (1999) – Categorize documents by topic using a Latent Semantic Indexing (LSI) approach through Single Value Decomposition (SVD)
  • 7. ENTER 2016 Research Track Slide Number 75th February 2016 Content • Introduction • Related work • Methodology – Document retrieval, extraction & processing – Mining (supervised & unsupervised learning) • Evaluation • Application • Summary and outlook
  • 8. ENTER 2016 Research Track Slide Number 85th February 2016 Document retrieval, extraction & processing Document retrieval •Web crawler fetches all html pages containing reviews based on regular expressions •124 user reviews, consisting of over 1,200 single review statements were retrieved (from TripAdvisor for two hotels of the Swedish mountain destination Åre) Document extraction •Explicit textual reviews are extracted from html pages using Xpath expressions to select relevant document parts and regular expression to clear up review texts from HTML tags Document processing •Pre-processing of reviews depending on applied mining techniques •Splitting reviews into sentences or single words; tokenisation; filtering of stop-words; stemming; transformation to lower cases •Representation of review by term-document-matrix, based on term occurrences, term frequency or TF-IDF values
  • 9. ENTER 2016 Research Track Slide Number 95th February 2016 Topic detection approach 1 • Identification of frequent nouns and verbs – Assumption: nouns and verbs occurring frequently within user reviews represent the topics discussed in user reviews • Customers use similar vocabulary (nouns and verbs) to talk about the same products or services (i.e. topics) • Thus, nouns and verbs representing topics will occur more frequently then other words – Approach simply detects nouns and/or verbs based on Part-of-Speech (POS) tagging (making use of the PENN-database), and extracts all frequent words above a specific threshold – Extracted words represent the most important topics – Nouns, verbs and nouns & verbs have been tested and compared
  • 10. ENTER 2016 Research Track Slide Number 105th February 2016 Topic detection approach 2 • Keyword Clustering – Starting point: TF-IDF value based term-document-matrix – Documents are clustered by the k-means clustering algorithm (based on the cosine similarity as distance measure) – Words with high TF-IDF values within a cluster then represent words often co-occurring in reviews and, thus, represent latent topics – Different settings and parameters were tested • Clustering with k=40 and k=80 • Clustering on sentences and on whole feedbacks • Clustering all words • Clustering only nouns, only verbs or nouns and verbs
  • 11. ENTER 2016 Research Track Slide Number 115th February 2016 Topic detection approach 3 • Latent Semantic Indexing (LSI) – Based on Singular Value Decomposition (SVD - a dimension reduction technique) words often co-occurring are summarized to concepts • Thus, SVD reduces the variety of words within a text, utilising the fact that usually there exists more than one word for describing the same object (e.g. a hotel room) – A concept then represents the (latent) semantic of all those words which, finally, represents a latent topic – Different settings and parameters were tested • LSI with k=40 and k=80 (k = number of concepts extracted) • LSI on sentences and on whole feedbacks • LSI on all words • LSI only on nouns, only on verbs or on nouns and verbs
  • 12. ENTER 2016 Research Track Slide Number 125th February 2016 Topic detection approach 4 • Named Entity Recognition (NER) – General approach: extract relevant information from a text in form of entities like persons, organisations or locations – In this study, the original NER approach is modified in a way that only two pseudo entities are used to declare a specific word as a topic (entity = Topic) or a non-topic (entity = O) – Training dataset is prepared with a single data record for each word • Each single word is classified as topic or non-topic • Each data record is enriched with linguistic and/or grammatical content, like the surrounding words of a specific word within a sentence, or the part of speech of a specific word
  • 13. ENTER 2016 Research Track Slide Number 135th February 2016 Topic detection approach 4 Example for enriched and pre-classified dataset: for skiing this is a lovely place. – Based on such pre-classified data records each classification algorithm can be used to distinguish between topics and non-topics – The following settings, parameters and algorithms were finally tested • 2, 3, 4 and 5 words before and after a specific word • Naïve Bayes • K-Nearest-Neighbour with k = 5, 10, 15, 20, 25 and 50 • Support Vector Machines (SVM) • Sequential Mining based on Conditional Random Fields
  • 14. ENTER 2016 Research Track Slide Number 145th February 2016 Content • Introduction • Related work • Methodology – Document retrieval, extraction & processing – Mining (supervised & unsupervised learning) • Evaluation • Application • Summary and outlook
  • 15. ENTER 2016 Research Track Slide Number 155th February 2016 Evaluation Approach Accuracy Recall Topic Recall No Topic Precision Topic Precision No Topic Identification of frequent words (nouns only) 82.86% 94.20% 80.14% 53.19% 98.29% Keyword Clustering (nouns only, sentences-based, k=80) 88.45% 62.84% 94.59% 73.56% 91.40% LSI - Latent Semantic Indexing (nouns only, sentences-based, k=80) 85.46% 48.95% 94.21% 66.96% 88.51% NER – Named Entity Recognition (Naïve Bayes, 2 words +/- as context) 75.17% 77.94% 72.39% 73.84% 76.65% F1 Topic 0,6798 0,6777 0,5655 0,7583
  • 16. ENTER 2016 Research Track Slide Number 165th February 2016 Evaluation Approach Accuracy Recall Topic Recall No Topic Precision Topic Precision No Topic Identification of frequent words (nouns only) 82.86% 94.20% 80.14% 53.19% 98.29% Keyword Clustering (nouns only, sentences-based, k=80) 88.45% 62.84% 94.59% 73.56% 91.40% LSI - Latent Semantic Indexing (nouns only, sentences-based, k=80) 85.46% 48.95% 94.21% 66.96% 88.51% NER – Named Entity Recognition (Naïve Bayes, 2 words +/- as context) 75.17% 77.94% 72.39% 73.84% 76.65% F1 Topic 0,6798 0,6777 0,5655 0,7583 1 2 3 4
  • 17. ENTER 2016 Research Track Slide Number 175th February 2016 Evaluation Even though 94.20% of the manually pre-classified topics could be redetected, the precision of detecting topics was only 53.19% and the corresponding F1 measure 0,6798 (the approach declares a word as a topic way too often because not every frequent noun is a topic, of course) Approach Accuracy Recall Topic Recall No Topic Precision Topic Precision No Topic Identification of frequent words (nouns only) 82.86% 94.20% 80.14% 53.19% 98.29% Keyword Clustering (nouns only, sentences-based, k=80) 88.45% 62.84% 94.59% 73.56% 91.40% LSI - Latent Semantic Indexing (nouns only, sentences-based, k=80) 85.46% 48.95% 94.21% 66.96% 88.51% NER – Named Entity Recognition (Naïve Bayes, 2 words +/- as context) 75.17% 77.94% 72.39% 73.84% 76.65% F1 Topic 0,6798 0,6777 0,5655 0,7583
  • 18. ENTER 2016 Research Track Slide Number 185th February 2016 Evaluation Despite of the highest accuracy (88.45%) and a high topic precision (73.56%), the topic recall (62.84%) constitutes a clear limitation of the approach (F1 measure 0,6777) Approach Accuracy Recall Topic Recall No Topic Precision Topic Precision No Topic Identification of frequent words (nouns only) 82.86% 94.20% 80.14% 53.19% 98.29% Keyword Clustering (nouns only, sentences-based, k=80) 88.45% 62.84% 94.59% 73.56% 91.40% LSI - Latent Semantic Indexing (nouns only, sentences-based, k=80) 85.46% 48.95% 94.21% 66.96% 88.51% NER – Named Entity Recognition (Naïve Bayes, 2 words +/- as context) 75.17% 77.94% 72.39% 73.84% 76.65% F1 Topic 0,6798 0,6777 0,5655 0,7583
  • 19. ENTER 2016 Research Track Slide Number 195th February 2016 Evaluation With a topic recall of 48.95% the LSI could not even redetect half of the manually pre-classified topic words, constituting a severe limitation (leading to the worst F1 measure of 0,5655) Approach Accuracy Recall Topic Recall No Topic Precision Topic Precision No Topic Identification of frequent words (nouns only) 82.86% 94.20% 80.14% 53.19% 98.29% Keyword Clustering (nouns only, sentences-based, k=80) 88.45% 62.84% 94.59% 73.56% 91.40% LSI - Latent Semantic Indexing (nouns only, sentences-based, k=80) 85.46% 48.95% 94.21% 66.96% 88.51% NER – Named Entity Recognition (Naïve Bayes, 2 words +/- as context) 75.17% 77.94% 72.39% 73.84% 76.65% F1 Topic 0,6798 0,6777 0,5655 0,7583
  • 20. ENTER 2016 Research Track Slide Number 205th February 2016 Evaluation With the second best topic recall (77.94%) and the best topic precision of (73.84%) NER reaches the most balanced result of all tested approaches (leading to the best F1 measure of 0,7583) and can be recommended for topic detection Approach Accuracy Recall Topic Recall No Topic Precision Topic Precision No Topic Identification of frequent word (nouns only) 82.86% 94.20% 80.14% 53.19% 98.29% Keyword Clustering (nouns only, sentences-based, k=80) 88.45% 62.84% 94.59% 73.56% 91.40% LSI - Latent Semantic Indexing (nouns only, sentences-based, k=80) 85.46% 48.95% 94.21% 66.96% 88.51% NER – Named Entity Recognition (Naïve Bayes, 2 words +/- as context) 75.17% 77.94% 72.39% 73.84% 76.65% F1 Topic 0,6798 0,6777 0,5655 0,7583 3 4 2 1
  • 21. ENTER 2016 Research Track Slide Number 215th February 2016 Content • Introduction • Related work • Methodology – Document retrieval, extraction & processing – Mining (supervised & unsupervised learning) • Evaluation • Application • Summary and outlook
  • 22. ENTER 2016 Research Track Slide Number 225th February 2016 Application of topic detection • Mining UGC by topic detection – Effective way to get an idea of what customers are talking about – hotel, room, ski and spa/pool/sauna are the most important and most often mentioned topics within the analysed online reviews – hotel and staff seem to be much more important for guests of Copperhill Mountain Lodge than for those of Holiday Club Åre -> different guest perception based on different hotel profile (Copperhill Mountain Lodge being a five-star hotel, well known for its exceptional hotel architecture and design) Identified topics Total occurrences hotel 143 room 84 ski 63 spa 41 staff 38 Copperhill Mountain Lodge Identified topics Total occurrences room 80 hotel 62 pool 44 ski 39 sauna 28 Holiday Club Åre
  • 23. ENTER 2016 Research Track Slide Number 235th February 2016 Application of topic detection • Topic detection as one step of overall sentiment analysis – Consecutive sentiment detection identifies positive or negative sentiment of review statements (Schmunk/Höpken/Fuchs/Lexhagen, 2014) • Powerful tool for hoteliers and tourism stakeholder – Constant transparency about the topics the customers are talking about within their reviews – Customer’s opinions and experiences in connection with the own hotel – Insights into the competitors’ businesses because of the public availability of UGC • Identify strengths and weaknesses of the own business compared to competitors • Identify direct competitors addressing the same market segment
  • 24. ENTER 2016 Research Track Slide Number 245th February 2016 Content • Introduction • Related work • Methodology – Document retrieval, extraction & processing – Mining (supervised & unsupervised learning) • Evaluation • Application • Summary and outlook
  • 25. ENTER 2016 Research Track Slide Number 255th February 2016 Summary and outlook • Summary – NER (Named Entity Recognition) most powerful approach for topic detection (accuracy 77.94%, precision 73.84%, recall 77.94%, F1 0,7583) – All data mining processes have been implemented by the data mining tool RapidMiner Studio® • Outlook – Additional approaches like Topic Modelling (Blei & Lafferty, 2009) and Dependency Parsing (Wu et al., 2009) – Feature-based sentiment analysis, combining topic detection with a consecutive sentiment analysis for each single feature/topic
  • 26. ENTER 2016 Research Track Slide Number 265th February 2016 Thomas Mennera Wolfram Höpkena Matthias Fuchsb Maria Lexhagenb a Business Informatics Group University of Applied Sciences Ravensburg-Weingarten, Germany {name.surname}@hs-weingarten.de b European Tourism Research Institute (ETOUR) Mid-Sweden University, Sweden {name.surname}@miun.se Topic Detection - Identifying relevant topics in tourism reviews

Editor's Notes

  1. Duration: 20 min (without questions)
  2. best settings and parameters for each of the four test cases: Identification of frequent words: Identification of frequent nouns Keyword Clustering: Clustering substantives based on sentences with k=80 LSI: LSI on substantives based on sentences with k=80 NER: Naïve Bayes with 2 words before and after the word to be classified
  3. best settings and parameters for each of the four test cases: Identification of frequent words: Identification of frequent nouns Keyword Clustering: Clustering substantives based on sentences with k=80 LSI: LSI on substantives based on sentences with k=80 NER: Naïve Bayes with 2 words before and after the word to be classified If we compare the approaches based on accuracy alone, keyword clustering is the best approach, followed by LSI, Identification of frequent words and NER If we compare precision and recall values and the correspond F1 measure
  4. Duration: 20 min (without questions)