Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Data Cleaning for social media knowledge extraction


Published on

Social media platforms let users share their opinions through textual or multimedia content. In many settings, this becomes a valuable source of knowledge that can be exploited for specific business objectives. Brands and companies often ask to monitor social media as sources for understanding the stance, opinion, and sentiment of their customers, audience and potential audience. This is crucial for them because it let them understand the trends and future commercial and marketing opportunities.

However, all this relies on a solid and reliable data collection phase, that grants that all the analyses, extractions and predictions are applied on clean, solid and focused data. Indeed, the typical topic-based collection of social media content performed through keyword-based search typically entails very noisy results.

We recently implemented a simple study aiming at cleaning the data collected from social content, within specific domains or related to given topics of interest.  We propose a basic method for data cleaning and removal of off-topic content based on supervised machine learning techniques, i.e. classification, over data collected from social media platforms based on keywords regarding a specific topic. We define a general method for this and then we validate it through an experiment of data extraction from Twitter, with respect to a set of famous cultural institutions in Italy, including theaters, museums, and other venues.

For this case, we collaborated with domain experts to label the dataset, and then we evaluated and compared the performance of classifiers that are trained with different feature extraction strategies.

Published in: Social Media
  • Be the first to comment

  • Be the first to like this

Data Cleaning for social media knowledge extraction

  1. 1. Emre Calisir, Marco Brambilla KDWEB2018, Cáceres, Spain The Problem of Data Cleaning for Knowledge Extraction from Social Media
  2. 2. Knowledge Extraction from Social Media is a Need
  3. 3. Keyword or hash-tag based filtering is insufficient
  4. 4. Is it possible to extract a sub-selection of content items if and only if they are actually relevant to the topic or context of interest ?
  5. 5. Examples to Related Studies 1. Earthquake Alarm System, Sakaki et al. Proc. of the 19th int.conference on WWW, 2010 2. Detection of influenza-like illnesses Culotta. Proc. of the 1st workshop on social media analytics, Washington, D.C, 2010 3. Discovering health topics, Paul & Dredze, PLoS ONE 9, e103408, 2014 4. Detection of prescription medication abuse, Sarker et al. Drug Safety, 2016 5. Tracking baseball and fashion topics, Lin et al. KDD, 2011 6. Event detection system, Kunneman & Bosch, BNAIC, 2014 7. Credibility of trend-topic hashtag usage Castillo et al, Proc. of the 20th int. Conf. on WWW. ACM, Hyderabad, USA, 2011 8. Non-relevant tweet filtering, Hajjem & Latiri, Procedia computer science, 112, 2017
  6. 6. Supervised learning trained on annotated data could help us
  7. 7. Overview Topic Relevancy Detection Machine Topic Relevant Dataset
  8. 8. Proposed Data Cleaning Method for Knowledge Extraction
  9. 9. Use Case CulturalInstitutions ofItaly
  10. 10. Best #Hotel Deals in #Pompei #HotelDegliAmiciPompei starting at EUR99.60 Pompei Hero Pliny the Elder May Have Been Found 2000 Years Later #2017Rewind #archeology #history #Pompei #rome #RomanEmpire Non-Relevant Tweet Relevant Tweet
  11. 11. 4 feature extraction strategies evaluated N-grams (unigrams, bigrams, trigrams) Word2Vec Word2Vec + additional tweet features Dimensionality Reduction with PCA
  12. 12. Annotated Data 726 tweets. Contains tweet having specific hashtags and keywords related to Pompei, Colosseo and Teatro Alla Scala The data contains 50% relevant and 50% non relevant.
  13. 13. Model 1: Text transformation to ngrams # of [unigrams, bigrams, trigrams] : [494,287,228] vocabulary size: 1009 words Example:
  14. 14. Model 2: Text transformation to word2vec • Word2vec dimension is selected as 25. • Word2vec vocabulary is built with 12K unlabeled tweets • Preprocessing operations before building word2vec model • Convert to lowercase, • Discard • Web links • Words with character size < 3 • Stopwords are eliminated before model building.
  15. 15. Model 2: Text transformation to word2vec
  16. 16. Model 2: Text transformation to word2vec
  17. 17. Model 3: word2vec + Additional Features Tweet Author Full text: #nuovi #corsi #inglese #settembre #pompei #chiamaci per #info Number of Friends: 4 Number of Followers: 9 Number of Lists: 15 Number of Favourited Tweets: 0 Language: en Number of Tweets: 4220 Source: PostPickr Verified Account: False Number of Favorited: 0 Geo Enabled: False Number of Retweets: 0 Default Profile: False Example:
  18. 18. Model 4: PCA applied on Model 3
  19. 19. Model 1 2 3 4 Accuracy 0.84 0.81 0.82 0.83 Precision 0.84 0.78 0.83 0.84 Recall 0.83 0.86 0.8 0.81 F1 0.83 0.82 0.81 0.82 Model 1: ngrams Model 2: word2vec Model 3: word2vec + additional features Model 4: PCA applied on Model 3 10fold Cross-Validated Results
  20. 20. Conclusions Supervised Machine Learning techniques could help to obtain topic relevant social media data
  21. 21. Collecting more data to build larger Word2Vec Vocabulary New Use Cases Challenges ahead
  22. 22. THANKS! QUESTIONS? Emre Calisir, Marco Brambilla The Problem of Data Cleaning for Knowledge Extraction from Social Media Marco Brambilla @marcobrambi