Correlating languages and sentiment analysis on the basis of text-based reviews
1. ENTER 2016 Research Track Slide Number 1
Correlating Languages and
Sentiment Analysis on the basis of
Text-based Reviews
Aitor García Pablosa
, Angelica Lo Ducab
, Montse
Cuadrosa
, María Teresa Linazaa
, Andrea Marchettib
a
Department of eTourism and Cultural Heritage and Department of Human Speech and
Language Technologies - Vicomtech-IK4, Spain
{agarciap,mcuadros,mtlinaza}@vicomtech.org
http://www.vicomtech.org
b
Institute of Informatics and Telematics, National Research Council, Pisa, Italy
{angelica.loduca,andrea.marchetti}@iit.cnr.it
http://www.iit.cnr.it/
2. ENTER 2016 Research Track Slide Number 2
Summary
• Introduction
• About this work
• Data gathering process
• Data analysis results and discussion
• Conclusions and further work
3. ENTER 2016 Research Track Slide Number 3
Introduction
• The use of online social media is growing
• Many social networks accumulate an
important share of customer comments
and interactions
• The data generated in these social channels
deserves a lot of attention due to the
valuable information it can provide
4. ENTER 2016 Research Track Slide Number 4
Introduction (2)
• Some social networks:
– Foursquare: […] a local search and discovery service mobile
app which provides search results for its users. By taking
into account the places a user goes, the things they have
told the app that they like, and the other users whose
advice they trust, Foursquare provides recommendations of
the places to go around a user's current location.
– Google Places: is a free service of geo-localization for
businesses to allow business managers and companies to
place information in Google Maps
– Facebook: No need to explain this one, right?
5. ENTER 2016 Research Track Slide Number 5
About this work
• Big amount of user comments gathered from the
aforementioned social media sources during
three years period (2012 – 2014)
• These comments have been arranged per
language and analysed with an automatic text
analysis tool to extract the sentiment orientation
• The analysed data has been examined to obtain
some interesting insight and conclusions
6. ENTER 2016 Research Track Slide Number 6
Data gathering process
• The data used in this study has been retrieved from three popular
social networks: Facebook, Foursquare and Google Places
• For each social network, a tailored crawler was used to extract
reviews about accommodations
• The crawler searched for places within the geographical areas of the
analysed destinations:
– A geographical area has been identified by a circle of a certain radius around a
particular geo-coordinate
• Reviews were retrieved and stored using algorithms tailored for each
social network
• The data gathering campaign was able to retrieve customer reviews
ranging from December 2009 to May 2014
7. ENTER 2016 Research Track Slide Number 7
Resulting dataset
Location Facebook Foursquare Google Places
#places #reviews #places #reviews #places #reviews
Amsterdam 583 7.725 7.735 27.537 13.635 7.623
Barcelona 455 3.732 6339 46.445 18.499 14.092
Berlin 2.084 3.078 21.875 43.816 39.765 22.630
Dubai 1.052 3.124 14.469 38.347 7.301 3.844
London 4.893 3.263 47.148 137.749 121.723 75.973
Paris 832 4.227 6.545 46.431 51.665 36.572
Rome 4.465 384 16.913 35.503 31.455 15.094
Tuscany n.a. n.a. 43.844 40.389 70.072 16.986
Total 14.364 25.533 164.868 416.217 354.115 192.814
8. ENTER 2016 Research Track Slide Number 8
Example of user comments
Source Examples of customer reviews (misspellings included)
Facebook
One of my favorite hotels! very kind staff and great location!!! In the evening, a warm fireplace in the lobby
and a wonderful mood!
Bar open until you decide to go to bed and largest towels ever !!!
10 euros per day for wifi!,... Not acceptable in Europe !
Foursquare
Perfect location.
Nice hosting service. Decent breakfast and honest staff, excepts some night shift recepcionists...
Worst hostel ever! I could stay in better rooms with this money. Rooms are cold, there are bugs everywhere,
sheets are not clean.
Google Places
VERY POOR - Back-packers hostel, not a hotel. over 200 euros for the "family room"- very noisy, dirty very
dusty, stains on the carpet.
Very noisy rooms. Cleaning staff continuosly enter in the room despite of "do not disturb" cartel. Reception
staff not so helpful.
Awesome hotel. Nice views (get an upper room) and it has all the top shelf frills you'd expect from a hotel
like this.
9. ENTER 2016 Research Track Slide Number 9
Data analysis process
• The customer review analysis process includes
two main tasks:
– identification of the language of the review
– simple sentiment analysis (polarity of the words)
• OpeNER has been used as the analysis tool:
– OpeNER project (www.opener-project.eu)
• Sentiment calculation for each review:
– Arithmetic mean of the detected polarities
count
10. ENTER 2016 Research Track Slide Number 10
Results
Distribution of the language of the reviews per destination (logarithmic scale)
11. ENTER 2016 Research Track Slide Number 11
Results (2)
• Some conclusions looking at the chart:
– For every city/country, the most used language
is the official language of that country
• Except for Amsterdam and Dubai
– For cities/regions of the same country (Rome
and Tuscany) we see a very similar language
pattern, but
• Rome has a significant number of comments in
“other languages”
12. ENTER 2016 Research Track Slide Number 12
Results (3)
Distribution of the language (percentage) of the reviews per detected polarity
(polarity 0 means “very negative”, polarity 10 means “very positive”)
13. ENTER 2016 Research Track Slide Number 13
Results (4)
• Some conclusions looking at the chart:
– Negative sentiment is harder to detect* (especially
with a dictionary based approach)
– Certain patterns seems to depend on language families
(e.g. more neutral comments coming from Romance
languages: French, Italian, Spanish)
– Certain languages are more biased to positivity than
others (e.g. the Dutch seem very easy-going ;-) )
*This a classical issue in sentiment analysis, due to the subtle ways to express negativity, like
irony, sarcasm, etc.
14. ENTER 2016 Research Track Slide Number 14
Results (5)
Distribution of the language of the reviews per social network
15. ENTER 2016 Research Track Slide Number 15
Results (6)
• Some conclusions looking at the chart:
– FourSquare is very popular among English, Italian and
Spanish speaking people
– German and French people are quite evenly
distributed among FourSquare and GooglePlaces
– Facebook contains less content in general
• Facebook is more oriented to multimedia and special-offer
sharing rather than to building a customer review ecosystem
16. ENTER 2016 Research Track Slide Number 16
Conclusions and further work
• Social media is a very valuable resource to obtain
interesting and relevant information
• Different languages show different patterns about the use
of social media (e.g. review polarity, preferred social
networks, etc.)
• It would be interesting to focus attention on more
languages (Russian, Chinese, Japanese, etc.)
• More in-depth analysis would be interesting (e.g. topic
detection to discover which topics/features of the
hospitalities are evaluated by customers from each
country, etc.)
17. ENTER 2016 Research Track Slide Number 17
That’s all
Thank you very much for your attention!
Questions?