Comparing user generated contentpublished in different social media sourcesÓscar Muñoz-García, Carlos Navarro@NLP can u ta...
Introduction The growth of social media has populated the Web with valuable      UGC that can be exploited for many inter...
Introduction     We show the differences of the language used in UGC w.r.t. social media sources        By analysing the...
Comparing user generated content published in different social media sourcesDistribution of PoS categories
Distribution of PoS categories Content analysed   Corpora with 10,000 posts extracted from heterogeneous SM sources     ...
Distribution of PoS categories      Microblogs: determiners and prepositions are used to a lesser extent        Limitati...
Distribution of PoS categories      News and blogs present similar distributions        Because of similar writing style...
Distribution of PoS categories Nouns   Common and proper nouns present similar distributions for all sources   PoS tagg...
Distribution of PoS categories Adverbs   There is a correlation with the distribution of adverbs of negation and the siz...
Distribution of PoS categories Punctuation marks   Full stop less used in news                l   Sentences are longer t...
Comparing user generated content published in different social media sourcesPerformance of languageidentification
Performance of Language Identification Content analysed            3,368 tweets            2,768 posts extracted from o...
Performance of Language Identification Language identification methodComparing user generated content published in differ...
Performance of Language Identification Evaluation Results   Overall accuracy                l    Twitter: 93.02%        ...
Comparing user generated content published in different social media sourcesPerformance of sentimentanalysis
Performance of Sentiment Analysis     Content analysed       1,859 tweets and 1,847 posts extracted from other social me...
Performance of Sentiment Analysis Evaluation Results   Overall accuracy                l    Twitter: 66.92%             ...
Comparing user generated content published in different social media sourcesPerformance of topicidentification
Performance of topic identification     Description of the method [Muñoz-García et al., 2011]  Input   PoS           • “t...
Performance of topic identification PoS filtering example                                                  • But a hardwa...
Performance of topic identification     Topic Recognition (Sem4Tags [García-Silva et al, 2010])                    • Blac...
Performance of topic identification Context Selection        For each keyword, a set of up to 4 related keywords that wi...
Performance of topic identification  Disambiguation Criteria              OPTION 1: Most frequent sense for the ambiguou...
Performance of topic identification Evaluation settings    Evaluated a random sample of 1,816 posts (18,16%)    47 huma...
Performance of topic identification Evaluation Results             Precision depends on the channel                 l   ...
Comparing user generated content published in different social media sourcesConclusions
Conclusions We have found differences among social media sources for every      experiment executed             Distribu...
Thank you! oscar.munoz@havasmedia.com
Upcoming SlideShare
Loading in …5
×

Comparing user generated content published in different social media sources

1,978 views

Published on

The growth of social media has populated the Web with valuable user generated content that can be exploited for many different and interesting purposes, such as, explaining or predicting real world outcomes through opinion mining. In this context, natural language
processing techniques are a key technology for analysing user generated content. Such content is characterised by its casual language, with short texts, misspellings, and set-phrases, among other characteristics that challenge content analysis. This paper shows the differences of the language used in heterogeneous social media sources, by analysing the distribution of the part-of-speech categories extracted from the analysis of the morphology of a sample of texts published in such sources. In addition, we evaluate the performance of three natural language processing techniques (i.e., language identification, sentiment analysis, and topic identification) showing the differences
on accuracy when applying such techniques to different types of user generated content.

Published in: Technology, Business
0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,978
On SlideShare
0
From Embeds
0
Number of Embeds
868
Actions
Shares
0
Downloads
1
Comments
0
Likes
3
Embeds 0
No embeds

No notes for slide

Comparing user generated content published in different social media sources

  1. 1. Comparing user generated contentpublished in different social media sourcesÓscar Muñoz-García, Carlos Navarro@NLP can u tag #user_generated_content ?! via lrec-conf.org26 May 2012
  2. 2. Introduction The growth of social media has populated the Web with valuable UGC that can be exploited for many interesting purposes  E.g. explaining or predicting real world outcomes through opinion mining Advertising companies use social media content for market research  By mining users’ interests for focusing advertisement actions  By obtaining the opinion of customers about brands NLP lets us automatizing social media content analysis  However, UGC presents differences on text quality w.r.t. content source (e.g., Blogs vs. Twitter)  Such differences challenge existing NLP techniquesComparing user generated content published in different social media sources ⎢2
  3. 3. Introduction We show the differences of the language used in UGC w.r.t. social media sources  By analysing the distribution of PoS categories on different sources We evaluate the performance of three NLP techniques  Language Identification  Sentiment Analysis  Topic Identification Social media sources analysed  Blogs (e.g., Wordpress and Blogger posts)  Forums  Microblogs (e.g., Twitter)  Social networks (e.g., Facebook, Google+, MySpace, LinkedIn and Xing)  Review Sites (e.g., Ciao and Dooyoo)  Audio-visual content publishing sites (e.g., Youtube and Vimeo)  News publishing sites (i.e., mainstream media)  Other sitesComparing user generated content published in different social media sources ⎢3
  4. 4. Comparing user generated content published in different social media sourcesDistribution of PoS categories
  5. 5. Distribution of PoS categories Content analysed  Corpora with 10,000 posts extracted from heterogeneous SM sources l written in Spanish l related to telecommunications domain The distribution has been obtained by using an automatic tagger  Tools used: l PoS tagging:  TreeTagger [Schmid, 1994] with a Spanish parameterisation l Annotation pipeline:  GATE [Cunningham et al., 2011] Categories identified  Main: noun, adjective, adverb, determiner, conjunction, pronoun, verb, …  Secondary: common noun, proper noun, negation adverb, personal pronoun, … Helmut Schmid. 1994. Probabilistic part-of-speech tagging using decision trees. In Proceedings of International Conference on New Methods in Language Processing, Manchester, UK. Hamish Cunningham, Diana Maynard , Kalina Bontcheva et al. 2011. Text Processing with GATE (Version 6). University of Sheffield. Department of Computer Science, April.Comparing user generated content published in different social media sources ⎢5
  6. 6. Distribution of PoS categories  Microblogs: determiners and prepositions are used to a lesser extent  Limitation of length (140 characters)  Posts need to be written more concisely → Meaningless grammatical categories tend to be used less Social News Blogs Video Reviews Microblogs Forums Other networks Nouns 31% 30% 29% 23% 34% 22% 27% 33% Adjectives 9% 8% 6% 8% 9% 7% 8% 6% Adverbs 2% 3% 3% 5% 4% 4% 4% 3% Determiners 11% 10% 8% 8% 6% 8% 9% 7% Conjunctions 6% 8% 7% 10% 6% 10% 9% 7% Pronouns 2% 3% 5% 6% 5% 6% 4% 4% Prepositions 15% 15% 12% 13% 8% 12% 13% 11%Punctuaction marks 11% 8% 13% 9% 8% 9% 10% 11% Verbs 12% 14% 17% 18% 19% 21% 16% 16% Other particles 1% 1% 1% 1% 1% 1% 1% 1% Comparing user generated content published in different social media sources ⎢6
  7. 7. Distribution of PoS categories  News and blogs present similar distributions  Because of similar writing styles  No limitations on the size of posts Social News Blogs Video Reviews Microblogs Forums Other networks Nouns 31% 30% 29% 23% 34% 22% 27% 33% Adjectives 9% 8% 6% 8% 9% 7% 8% 6% Adverbs 2% 3% 3% 5% 4% 4% 4% 3% Determiners 11% 10% 8% 8% 6% 8% 9% 7% Conjunctions 6% 8% 7% 10% 6% 10% 9% 7% Pronouns 2% 3% 5% 6% 5% 6% 4% 4% Prepositions 15% 15% 12% 13% 8% 12% 13% 11%Punctuaction marks 11% 8% 13% 9% 8% 9% 10% 11% Verbs 12% 14% 17% 18% 19% 21% 16% 16% Other particles 1% 1% 1% 1% 1% 1% 1% 1% Comparing user generated content published in different social media sources ⎢7
  8. 8. Distribution of PoS categories Nouns  Common and proper nouns present similar distributions for all sources  PoS tagger fails when proper nouns are written in lower case l In special in Forums and Reviews where discussion about specific products are raised l Solution: use gazetteers  Improves entity detection  Domain dependent  Foreign words are less used in news that in other sources because of style rules of Spanish mainstream media l Avoid foreign words, as far as possible, whenever a Spanish word exists Adjectives  Adjectives of quantity are the most used (47%) in all the channels l Cardinals (30%) more used than ordinals (2%)  Multiplicative, partitive and indefinite quantity adjectives are used more frequently in forums and review sites: l Due to quantitative evaluations and comparison of productsComparing user generated content published in different social media sources ⎢8
  9. 9. Distribution of PoS categories Adverbs  There is a correlation with the distribution of adverbs of negation and the size of the posts l More used in channels with shorter texts l Detection of negations is essential when performing sentiment analysis Conjunctions  The distribution of coordinating conjunctions is higher in News and Blogs l More used in channels with longer texts l Coordinating conjunctions are used to identify opinion chunks as they were punctuation marks. Pronouns  The distribution of personal pronouns is higher in Microblogs, Reviews, Forums and audio-visual content publishing sites l Due to conversations between users vs. narrative style of News and Blogs l Pronouns make it difficult to identify entities within opinions  Entities not explicitly mentionedComparing user generated content published in different social media sources ⎢9
  10. 10. Distribution of PoS categories Punctuation marks  Full stop less used in news l Sentences are longer than in other sources  Comma less used on Microblogs and Audio-visual content sites  Ellipses are more used in Microblogs l To denote unfinished sentences l Automatically truncated messages  Secondary punctuation marks less used in Microblogs l Difficulty for introducing these characters on mobile terminals l Content length limitation Verbs  More used in Microblogs and Forums l Intentions and actions are expressed more often  Past tenses less used in Microblogs l Immediate experiences  Infinitive more used in MicroblogsComparing user generated content published in different social media sources ⎢10
  11. 11. Comparing user generated content published in different social media sourcesPerformance of languageidentification
  12. 12. Performance of Language Identification Content analysed  3,368 tweets  2,768 posts extracted from other social media sources (not Twitter)  Written in Spanish, Portuguese and English Technique used  Implementation of an existing text categorization algorithm l Analysis of the frequency of n-grams of characters within documents [Cavnar and Trenkle, 1994] Cavnar, W. B., & Trenkle, J. M. (1994). N-Gram-Based Text Categorization. Proceedings of SDAIR-94, 3rd Annual Symposium on Document Analysis and Information Retrieval (pp. 161-175).Comparing user generated content published in different social media sources ⎢12
  13. 13. Performance of Language Identification Language identification methodComparing user generated content published in different social media sources ⎢13
  14. 14. Performance of Language Identification Evaluation Results  Overall accuracy l Twitter: 93.02% l Other sources: 96.76%  Kappa l Twitter: 0.844 l Other sources: 0.916 Normalizing tweets does not improve performance  Syntactic normalization of Twitter messages [Kauffmann and Jugal, 2010] 1. Delete references to users at the beginning of the tweet 2. Delete “RT @user:” sequences 3. Delete hash tags found at the end of the tweet 4. Delete “#” at the beginning of hash tags 5. Delete URLs 6. Delete “…” followed by a URL Max Kaufmann and Kalita Jugal. 2010. Syntactic normalization of twitter messages. In Proceedings of the International Conference on Natural Language Processing (ICON-2010).Comparing user generated content published in different social media sources ⎢14
  15. 15. Comparing user generated content published in different social media sourcesPerformance of sentimentanalysis
  16. 16. Performance of Sentiment Analysis Content analysed  1,859 tweets and 1,847 posts extracted from other social media sources (not Twitter) written in Spanish Technique used  Matching of linguistic expressions based on a Lexicon l Each expression is a sequence of pairs (lemma, PoS)  E.g. “Your brand is cool!” matches with {(Σ,Noun),(‘be’,Verb), (‘cool’,Adjective)}  Kind of expressions l For detecting subjectivity (20 expressions)  Use to include specific verbs l For detecting sentiment of opinions (1,480 expressions)  Negative expressions add a value in {-2,-1} to overall sentiment  Positive expressions add a value in {1,2} to overall sentiment l For reversing sentiment (22)  Include negations  Multiply detected sentiment by (-1) l For augmenting or reducing sentiment (32)  Use to include adverbs  Multiply detected sentiment by 1.5 or 0.75Comparing user generated content published in different social media sources ⎢16
  17. 17. Performance of Sentiment Analysis Evaluation Results  Overall accuracy l Twitter: 66.92% l Other sources: 80.17%  Kappa l Twitter: 0.198 l Other sources: 0.31 Normalizing tweets does not improve performance  Syntactic normalization of Twitter messages [Kauffmann and Jugal, 2010] 1. Delete references to users at the beginning of the tweet 2. Delete “RT @user:” sequences 3. Delete hash tags found at the end of the tweet 4. Delete “#” at the beginning of hash tags 5. Delete URLs 6. Delete “…” followed by a URL Max Kaufmann and Kalita Jugal. 2010. Syntactic normalization of twitter messages. In Proceedings of the International Conference on Natural Language Processing (ICON-2010).Comparing user generated content published in different social media sources ⎢17
  18. 18. Comparing user generated content published in different social media sourcesPerformance of topicidentification
  19. 19. Performance of topic identification  Description of the method [Muñoz-García et al., 2011] Input PoS • “torino”, “art”, “media”, “user”, “cloud” Filtering • http://dbpedia.org/resource/Turin • http://dbpedia.org/resource/Art TopicRecognition • http://dbpedia.org/resource/User_(computing)Language • “Torino”, “arte”, “utente”, “mezzo di comunicazione di massa”, ... Filtering Óscar Muñoz-Garcíaa, Andrés García-Silva, Óscar Corcho, Manuel de la Higuera Hern´andez, and Carlos Navarro. 2011. Identifying Topics in Social Media Posts using DBpedia. In Jean-Dominique Meunier, Halid Hrasnica, and Florent Genoux, editors, Proceedings of the NEM Summit 2011, pages 81–86, Torino, Italy. Eurescom the European Institute for Research and Strategic Studies in Telecommunications GmbH. Comparing user generated content published in different social media sources ⎢19
  20. 20. Performance of topic identification PoS filtering example • But a hardware problem is more likely, especially if you use the phone a lot while eating. The Blackberrys tiny trackball could be suffering the same accumulation of gunk and grime that can plague a computer mouse that still uses a rubber Input ball on the underside to roll around the desk. • Blackberry, phone, trackball, computer, problem, grime, hardware, mouse, desk, PoS filtering rubber ball, gunk exampleComparing user generated content published in different social media sources ⎢20
  21. 21. Performance of topic identification  Topic Recognition (Sem4Tags [García-Silva et al, 2010]) • Blackberry, phone, trackball, computer, problem, grime, hardware, PoS mouse, desk, rubber ball, gunk filtering • Blackberry, {phone, hardware, trackball, mouse} • Computer, {hardware, mouse, problem, desk} ContextSelection • … • http://dbpedia.org/resource/BlackBerry • http://dbpedia.org/resource/ComputerDisambiguation Andrés García-Silva, Oscar Corcho, and Jorge Gracia. 2010. Associating semantics to multilingual tags in folksonomies. In 17th Int. Conference on Knowledge Engineering and Knowledge Management EKAW 2010, Lisbon (Portugal), October Comparing user generated content published in different social media sources ⎢21
  22. 22. Performance of topic identification Context Selection  For each keyword, a set of up to 4 related keywords that will help to disambiguate the its meaning  4 is the number of words above which the context does not add more resolving power to disambiguation [Kaplan, 1955]  We compute semantic relatedness (active context) taking into account the co-ocurrence of words in web pages [Gracia et al, 2009] Keyword Relatedness Keyword Relatedness phone 0.347 hardware 0.347 trackball 0.311 mouse 0.311 computer 0.288 desk 0.287 problem 0.246 rubber ball 0.246 grime 0.190 gunk 0.168 Active context selection for blackberry keyword A. Kaplan.1955. An experimental study of ambiguity and context. Mechanical Translation, 2:39-46 Jorge Gracia and Eduardo Mena. 2009. Multiontology semantic disambiguation in unstructured web contexts. In Proc. of Workshop on Collective Knowledge Capturing and Representation (CKCaR’09) at K-CAP’09,Identifying Topics in Social Media Posts using DBpedia ⎢22
  23. 23. Performance of topic identification  Disambiguation Criteria  OPTION 1: Most frequent sense for the ambiguous word l Determined by Wikipedia editors (the first link in a disambiguation page)  OPTION 2: Vector space model 1. A vector containing the keyword and its context 2. A vector containing top N terms is created from each candidate sense is created using TF-IDF (Term Frequency and Inverse Document Frequency) 3. The cosine similarity is used to determine which vectorised sense is more similar to the vector associated to the keyword DBpedia resource Definition Similarity Is a line of mobile e-mail andBlackBerry 0.224 smartphoneBlackberry is an edible fruit 0.15BlackBerry_(song) is a song by the Black Crowes 0.0BlackBerry_Township,_Itasca_County, Is a towship in … Itasca County 0.0_Minnesota Comparing user generated content published in different social media sources ⎢23
  24. 24. Performance of topic identification Evaluation settings  Evaluated a random sample of 1,816 posts (18,16%)  47 human evaluators  Each post and topics identified shown to 3 different evaluators  Evaluation options: 1. The topic is not related with the post 2. The topic is somehow related with the post 3. The topic is closely related with the post 4. The evaluator has not enough information for taking a decision  Fleiss’ kappa test l Strength of agreement for 2 evaluators = 0.826 (very good) l Strength of agreement for 3 evaluators = 0.493 (moderate)Comparing user generated content published in different social media sources ⎢24
  25. 25. Performance of topic identification Evaluation Results  Precision depends on the channel l From 59.19% for social networks  More misspellings  More common nouns l To 88.89% for review sites  Concrete products and brands  Proper nouns tend to have a Wikipedia entry  Context selection criteria also depends on the channel l Active context selection better for microblogs and review sites l Considering all the post keywords as context better for blogs l Without context selection is better for the rest of the cases (almost all the channels)  Naïve default sense selection is effectiveComparing user generated content published in different social media sources ⎢25
  26. 26. Comparing user generated content published in different social media sourcesConclusions
  27. 27. Conclusions We have found differences among social media sources for every experiment executed  Distribution of PoS tagging vary across different sources l Since PoS tagging is a previous step for many NLP techniques, the performance of such techniques may be affected  E.g. Using nouns as context for performing term disambiguation.  More nouns → More context  E.g. Adjectives and adverbs for performing sentiment analysis  Language identification is less accurate for content extracted from Twitter  Sentiment analysis is less accurate for content extracted from Twitter  Precision of topic identification also depends on the source l With respect to context selection there is not a technique that performs better for all the sourcesComparing user generated content published in different social media sources ⎢27
  28. 28. Thank you! oscar.munoz@havasmedia.com

×