Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

User review sites as a resource for large scale Sociolinguistic studies


Published on

It was a paper presented by Anders Johannsen, Dirk Hov, Anders Søgaard from University of Copenhagen. It talks about the socio linguistic issues like age, gender, region, review ratings etc. and tries to relate it with different language reviews on trustpilot.

Published in: Data & Analytics
  • Be the first to comment

User review sites as a resource for large scale Sociolinguistic studies

  1. 1. USER REVIEW SITES AS A RESOURCE FOR LARGE-SCALE SOCIOLINGUISTIC STUDIES By, Ashutosh Bhargave. Anders Johannsen, Dirk Hov, Anders Søgaard University of Copenhagen
  2. 2. OUTLINE:  Introduction  Data Format  Data Augmentation  Representativeness  Pilot Studies  Conclusion
  3. 3.  Sociolinguistic studies  Problems: • Traditional approach. • Social media data  Remedy: • Paper aims to remedy both problems by exploring a large new data source, international review websites with user profiles. language extra-linguistic variables Relation
  4. 4. DATA FORMAT: • The Trustpilot Corpus consists of user reviews from the Trustpilot website. • Users need to register with a username in order to leave review • no mandatory fields other than the name • assign unique identifiers to both users and companies and use those to link up reviews. • mostly interested in age, gender, and location in combination with the written reviews.
  5. 5. DATA AUGMENTATION  Augmented the retrieved data set in two ways, 1. gender information based on 1st names, and 2. geo tagging information (latitude & longitude)  Problems - 1. no gender information 2. “canonical" town
  6. 6. REPRESENTATIVENESS  restricted to the age range from 16 to 80.  median age in our data is typically close to the country's median value.  more male than female users  average number of reviews per user is around 4
  7. 7. PILOT STUDIES Discovering gender-specific words :
  8. 8.  Emoticons, age, and gender  Eyes ( : ; ) Nose ( - or none) Mouth ( ( , ) , [ ,* etc)  women use emoticons almost twice as often as men do  for all ages, the use of a nose is highly anti correlated with age
  9. 9. Ratings, categories, gender, and age  men tend to vote slightly more negative than women  people in the younger group are more likely to use negative ratings than people in the older group
  10. 10. DENMARK  missing distinction between the reflexive possessive pronouns and non-reflexives  record the frequency of sin/sit (his/her own) and the joint frequency of all possessive pronouns(his). Then compute the ratio of the former in all pronouns.
  11. 11. Swear words across location, gender, and age: • as people grow older, they tend to use more conservative language • women use this stronger version words less than the men
  12. 12. GERMAN  Replacement : β with ss  dass/daβ, “that", and the modal mussen/muβen, “to must”  older speakers retain the traditional spelling they acquired in their youth to a much greater extent .
  13. 13. CONCLUSION  Traditional sociolinguistic studies often lack statistical power to draw valid conclusions and big- data approaches to language studies mostly lack extra-linguistic information that would enable sociolinguistic studies.  Solution to this dilemma is user review sites.
  14. 14. QUESTIONS ?
  15. 15. THANK YOU