Emre Calisir, Marco Brambilla
KDWEB2018, Cáceres, Spain
The Problem of Data Cleaning
for Knowledge Extraction
from Social Media
Knowledge Extraction
from Social Media
is a Need
Keyword or hash-tag
based filtering is
insufficient
Is it possible to extract a sub-selection of content
items if and only if they are actually relevant to the
topic or context of interest ?
Examples to Related Studies
1. Earthquake Alarm System, Sakaki et al. Proc. of the 19th int.conference on WWW, 2010
2. Detection of influenza-like illnesses
Culotta. Proc. of the 1st workshop on social media analytics, Washington, D.C, 2010
3. Discovering health topics, Paul & Dredze, PLoS ONE 9, e103408, 2014
4. Detection of prescription medication abuse, Sarker et al. Drug Safety, 2016
5. Tracking baseball and fashion topics, Lin et al. KDD, 2011
6. Event detection system, Kunneman & Bosch, BNAIC, 2014
7. Credibility of trend-topic hashtag usage
Castillo et al, Proc. of the 20th int. Conf. on WWW. ACM, Hyderabad, USA, 2011
8. Non-relevant tweet filtering, Hajjem & Latiri, Procedia computer science, 112, 2017
Supervised learning
trained on annotated
data could help us
Overview
Topic
Relevancy
Detection
Machine
Topic Relevant
Dataset
Proposed Data Cleaning Method for
Knowledge Extraction
Use Case
CulturalInstitutions
ofItaly
Best #Hotel Deals in #Pompei #HotelDegliAmiciPompei
starting at EUR99.60 https://t.co/5DxkKn4o69
Pompei Hero Pliny the Elder May Have Been Found 2000
Years Later https://t.co/PyR2rP1Xpe #2017Rewind #archeology
#history #Pompei #rome #RomanEmpire
Non-Relevant Tweet
Relevant Tweet
4 feature extraction strategies
evaluated
N-grams (unigrams, bigrams, trigrams)
Word2Vec
Word2Vec + additional tweet features
Dimensionality Reduction with PCA
Annotated Data
726 tweets.
Contains tweet having specific hashtags and keywords related to
Pompei, Colosseo and Teatro Alla Scala
The data contains 50% relevant and 50% non relevant.
Model 1: Text transformation to ngrams
# of [unigrams, bigrams, trigrams] : [494,287,228]
vocabulary size: 1009 words
Example:
Model 2: Text transformation to word2vec
• Word2vec dimension is selected as 25.
• Word2vec vocabulary is built with 12K unlabeled tweets
• Preprocessing operations before building word2vec model
• Convert to lowercase,
• Discard
• Web links
• Words with character size < 3
• Stopwords are eliminated before model building.
Model 2: Text transformation to word2vec
Model 2: Text transformation to word2vec
Model 3: word2vec + Additional Features
Tweet Author
Full text: #nuovi #corsi #inglese #settembre
#pompei #chiamaci per #info
https://t.co/QRrXlMC0g1
Number of Friends: 4
Number of Followers: 9
Number of Lists: 15
Number of Favourited Tweets: 0
Language: en Number of Tweets: 4220
Source: PostPickr Verified Account: False
Number of Favorited: 0 Geo Enabled: False
Number of Retweets: 0 Default Profile: False
Example:
Model 4: PCA applied on Model 3
Model 1 2 3 4
Accuracy 0.84 0.81 0.82 0.83
Precision 0.84 0.78 0.83 0.84
Recall 0.83 0.86 0.8 0.81
F1 0.83 0.82 0.81 0.82
Model 1: ngrams
Model 2: word2vec
Model 3: word2vec + additional features
Model 4: PCA applied on Model 3
10fold Cross-Validated Results
Conclusions
Supervised Machine Learning
techniques could help to obtain topic
relevant social media data
Collecting more data to build
larger Word2Vec Vocabulary
New Use Cases
Challenges ahead
THANKS!
QUESTIONS?
Emre Calisir, Marco Brambilla
The Problem of Data Cleaning for Knowledge Extraction from Social Media
Marco Brambilla @marcobrambi marco.brambilla@polimi.it
http://datascience.deib.polimi.it http://home.deib.polimi.it/marcobrambi

Data Cleaning for social media knowledge extraction

  • 1.
    Emre Calisir, MarcoBrambilla KDWEB2018, Cáceres, Spain The Problem of Data Cleaning for Knowledge Extraction from Social Media
  • 2.
  • 3.
    Keyword or hash-tag basedfiltering is insufficient
  • 4.
    Is it possibleto extract a sub-selection of content items if and only if they are actually relevant to the topic or context of interest ?
  • 5.
    Examples to RelatedStudies 1. Earthquake Alarm System, Sakaki et al. Proc. of the 19th int.conference on WWW, 2010 2. Detection of influenza-like illnesses Culotta. Proc. of the 1st workshop on social media analytics, Washington, D.C, 2010 3. Discovering health topics, Paul & Dredze, PLoS ONE 9, e103408, 2014 4. Detection of prescription medication abuse, Sarker et al. Drug Safety, 2016 5. Tracking baseball and fashion topics, Lin et al. KDD, 2011 6. Event detection system, Kunneman & Bosch, BNAIC, 2014 7. Credibility of trend-topic hashtag usage Castillo et al, Proc. of the 20th int. Conf. on WWW. ACM, Hyderabad, USA, 2011 8. Non-relevant tweet filtering, Hajjem & Latiri, Procedia computer science, 112, 2017
  • 6.
    Supervised learning trained onannotated data could help us
  • 7.
  • 8.
    Proposed Data CleaningMethod for Knowledge Extraction
  • 9.
  • 10.
    Best #Hotel Dealsin #Pompei #HotelDegliAmiciPompei starting at EUR99.60 https://t.co/5DxkKn4o69 Pompei Hero Pliny the Elder May Have Been Found 2000 Years Later https://t.co/PyR2rP1Xpe #2017Rewind #archeology #history #Pompei #rome #RomanEmpire Non-Relevant Tweet Relevant Tweet
  • 11.
    4 feature extractionstrategies evaluated N-grams (unigrams, bigrams, trigrams) Word2Vec Word2Vec + additional tweet features Dimensionality Reduction with PCA
  • 12.
    Annotated Data 726 tweets. Containstweet having specific hashtags and keywords related to Pompei, Colosseo and Teatro Alla Scala The data contains 50% relevant and 50% non relevant.
  • 13.
    Model 1: Texttransformation to ngrams # of [unigrams, bigrams, trigrams] : [494,287,228] vocabulary size: 1009 words Example:
  • 14.
    Model 2: Texttransformation to word2vec • Word2vec dimension is selected as 25. • Word2vec vocabulary is built with 12K unlabeled tweets • Preprocessing operations before building word2vec model • Convert to lowercase, • Discard • Web links • Words with character size < 3 • Stopwords are eliminated before model building.
  • 15.
    Model 2: Texttransformation to word2vec
  • 16.
    Model 2: Texttransformation to word2vec
  • 17.
    Model 3: word2vec+ Additional Features Tweet Author Full text: #nuovi #corsi #inglese #settembre #pompei #chiamaci per #info https://t.co/QRrXlMC0g1 Number of Friends: 4 Number of Followers: 9 Number of Lists: 15 Number of Favourited Tweets: 0 Language: en Number of Tweets: 4220 Source: PostPickr Verified Account: False Number of Favorited: 0 Geo Enabled: False Number of Retweets: 0 Default Profile: False Example:
  • 18.
    Model 4: PCAapplied on Model 3
  • 19.
    Model 1 23 4 Accuracy 0.84 0.81 0.82 0.83 Precision 0.84 0.78 0.83 0.84 Recall 0.83 0.86 0.8 0.81 F1 0.83 0.82 0.81 0.82 Model 1: ngrams Model 2: word2vec Model 3: word2vec + additional features Model 4: PCA applied on Model 3 10fold Cross-Validated Results
  • 20.
    Conclusions Supervised Machine Learning techniquescould help to obtain topic relevant social media data
  • 21.
    Collecting more datato build larger Word2Vec Vocabulary New Use Cases Challenges ahead
  • 22.
    THANKS! QUESTIONS? Emre Calisir, MarcoBrambilla The Problem of Data Cleaning for Knowledge Extraction from Social Media Marco Brambilla @marcobrambi marco.brambilla@polimi.it http://datascience.deib.polimi.it http://home.deib.polimi.it/marcobrambi

Editor's Notes

  • #3 500 million tweets each day. 100 million active daily people. People tend to share every part of their lives, opinions... Every sector wants to extract knowledge from social media: banking, health, tourism...
  • #4 Data is created by humans. Very open to errors. Words can have multiple meanings. Rule-based filtering systems fail. (we mean by Rule-based systems  systems that collect tweets just using queries) They bring related and unrelated content together.
  • #5 Our research question.
  • #6 All of them are based on Twitter data Various use cases, various machine learning techniques Supervised (1 to 7) SVM, Multinomial Bayes, Logistic Regression, Decision Trees, Bayes Networks, Unsupervised (Last one) Latent Dirichlet Allocation We benefit from these studies. Early Alert System. Technique: SVM classifier. Features: keywords and the context of target-event words. Detection of Illnesses. Technique: Log Regr. Bag-of-Words Filter-out Non-Relevant Content. Technique: Log. Regr. Unigrams-Bigrams-Trigrams. Very similar to first model in our study. Discovery of patterns of abuse of specific medications. Topic tracking on tweet streams. Again unigrams Discard noisy data for an event detection system based on Twitter stream data. Assesment of credibility of trend-topics, to check if they are really related with the topic of used hash-tag, or spam. They tried SVM, decision trees and Bayes. Unsupervised technique. Pooling method using Information Retrieval and Latent Dirichlet Allocation.
  • #7 Annotated dataset, labeled as Relevant or Non-Relevant with the topic. We train SVM based model with labeled data. We predict the new unlabeled data. We selected Support Vector Machines with Linear Kernel. The recommended approach when there is sparse vector features.
  • #8 Our basic approach to obtain a relevant dataset. We applied with Twitter but is also applicable to any kind of social media data.
  • #9 And a more detailed illustration to general flow. Initially we build our model. And then we make the prediction. Finally we have a clean dataset.
  • #10 Pompei, Colosseo and Teatro Alla Scala
  • #11 Imagine that we have a social media monitoring tool. We want to track tweets related historical value of Pompei. How to do it? Label relevant and non-relevant data. Who will label it? Subject experts
  • #13 Subject experts are labeled the tweets as relevant and non relevant. A random guess could predict with 0.5 accuracy.
  • #14 Ngrams is a widely used technique in text classification. It is an example to how we transform a tweet to feature vector. Also we applied Tf-idf to increase accuracy. It is a best-practive in ngrams usage.
  • #15 Default word2vec dimension is 100. However, our dataset is limited. We performed analysis and achieved better results by representing with 25 features. Larger the word2vec vocabulary, better creating the semantic relations.
  • #16 Top 3 Similarities showed for specific words. Inside parantheses, there is vectorial distance btw words. For a given tweet, we calculate the average value of its word vectors.
  • #17 Another illustration of trained word2vec model. This graph is generated with few words to show you just the vectorial representations of words.
  • #18 Feature Extraction Strategy Vector features: Word2Vec Numerical features: MinMaxScaler Categorical features: OneHotEncoder Dimension of features after transformation: dim(Word2Vec) = 25 -- dim(numerical features) = 7 -- dim(one-hot-encoding) = 68 -- Total number of features = 125
  • #19 The target is to analyze impact of dimensionality reduction. It could improve model accuracy. Graphic: Desired dimension < 40  the model is underfitting Desired dimension > 40  the model is overfitting We selected dimension size : 40
  • #20 We preferred cross-validation because size of our labeled dataset is limited. (726 tweets) All the classification models are succesful. Dotted line shows random guess. (50% relevant and 50% non relevant tweets exist inside data) Model 1 has best performance. Model 4 has second best performance. If we have a larger word2vec vocabulary, word2vec could have better accuracy then ngrams. ROC curve and AUC scores prove the performance of our model.
  • #21 We addressed our research question. We proved that text classsification is a very convenient way to obtain topic relevant data.
  • #22 We still a way to go for building a more accurate system. We now create a larger dataset. And we will find out new use cases, also open datasets to compare our results with the literature. Also, we will try other algorithms, not only SVM. Also, we can bring external data by importing content from given weblinks.