Sentiment Analysis and Political Disaffection in Italy

  • 388 views
Uploaded on

Slideshow for my master thesis. I built a classification system for politic-related concepts on Italian microblogging, and applied it to political disaffection, measuring correlation between Twitter …

Slideshow for my master thesis. I built a classification system for politic-related concepts on Italian microblogging, and applied it to political disaffection, measuring correlation between Twitter and h polls.

More in: Technology , Education
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
388
On Slideshare
0
From Embeds
0
Number of Embeds
1

Actions

Shares
Downloads
6
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. IntroductionDevelopment of the classification systemApplication and experimental resultsConclusionsSentiment Analysis for italian languagemicroblogging: development and valutation of anautomatic systemCorrado Monti22-04-2013Corrado Monti Sentiment Analysis for italian microblogging 1/23
  • 2. IntroductionDevelopment of the classification systemApplication and experimental resultsConclusionsTwitterMain microblogging platforminstant pubblication of short textual contentsTweet → 140 charactersWidespread in Italy tooItalian internetusersHearsayknowledge ofTwitterWeeklyTwitter usersDaylyTwitterusers28.6 M people100%25.3 M people88.6% of internet users1.24 M people4.4% of internet users4.7 M people16.5% of internet usersdecember 2012Corrado Monti Sentiment Analysis for italian microblogging 2/23
  • 3. IntroductionDevelopment of the classification systemApplication and experimental resultsConclusionsTextual classificationWe would like to classify millions of textsRule-based approach: definitions of rules (keywords, regularexpressions) together with field expertsInaccurate, hard to defineSupervised machine learningThey need a training set to learn how to classify new textsEvery text is transformed in a sequence of numerical features;the algorithm learns what these numbers meanRecent problem: Sentiment Analysis → recognize thesentiment expressed by the author of the textMany applications, both industrial and academicCorrado Monti Sentiment Analysis for italian microblogging 3/23
  • 4. IntroductionDevelopment of the classification systemApplication and experimental resultsConclusionsMeasuring collective feelingsOne of the first works is O’Connor et al. (2010), whomeasured economic trust through Sentiment Analysis onTwitterSentimentRatio1.52.02.53.03.54.0GallupEconomicConfidence−60−50−40−30−202008−012008−022008−032008−042008−052008−062008−072008−082008−092008−102008−112008−122009−012009−022009−032009−042009−052009−062009−072009−082009−092009−102009−11Dates DatesFor which phenomena is this possible?How much are these measures valid? Do they work in Italytoo?Corrado Monti Sentiment Analysis for italian microblogging 4/23
  • 5. IntroductionDevelopment of the classification systemApplication and experimental resultsConclusionsGoals of this workHypothesis: can political disaffection be measured throughmassive tweet classification?It is a relevant phenomenon, especially in ItalyLot of interest, academic (sociology) and not academicWe’d also like to build reusable classifiersCorrado Monti Sentiment Analysis for italian microblogging 5/23
  • 6. IntroductionDevelopment of the classification systemApplication and experimental resultsConclusionsPolitical disaffectionHow to define a “disaffected” tweet?According to domain experts, it must1. have a politic-related topic → topic detection2. have negative sentiment → sentiment analysis3. be directed to all politiciansTweet Is political?DiscardedNoYesIs negative?DiscardedNoYesIs general?DiscardedNoYesPoliticallyDisaffectedTweetCorrado Monti Sentiment Analysis for italian microblogging 6/23
  • 7. IntroductionDevelopment of the classification systemApplication and experimental resultsConclusionsTraining Set28 340 tweet labelled by 40 political science studentsBetween april and june 20123 labellers for every textWe keep in the dataset only tweet with unanimous “politic”labelFor other labels we measure agreement with Krippendorff α:“negative” → 0.78 → reliable labels“generic” → 0.41 → much noise on labels ×⇓negative non negativepolitic 7 965 4 544not politic 15 831Corrado Monti Sentiment Analysis for italian microblogging 7/23
  • 8. IntroductionDevelopment of the classification systemApplication and experimental resultsConclusions1 – Topic Detection: politicsTime-robust classifierTraining Set extension: 17 388 news titles (January-October2012)Best feature extraction:5-grams of characters → uniformity from different spacingstf-idfDiscard terms with less than 4 occourencesCorrado Monti Sentiment Analysis for italian microblogging 8/23
  • 9. IntroductionDevelopment of the classification systemApplication and experimental resultsConclusions45 728 points with 78 642 featuresSMO SVM-solver, k-Nearest Neighbor, Random Forest, kernel are tooresource-hungryWe preferred online algorithmsALMA, Passive-Aggressive, PEGASOS, OIPCAC gave good resultsClassifier Accuracy F-Measure Global timeALMA 0.88 ± 0.01 0.87 ± 0.01 13.5 ± 1PA 0.89 ± 0.01 0.89 ± 0.01 10.6 ± 0.1PEGASOS 0.88 ± 0.01 0.88 ± 0.01 1103 ± 10OIPCAC 0.89 ± 0.001 0.89 ± 0.01 5911 ± 52Other algorithms were tested, but with worst results (i.e. Na¨ıveBayes)We selected Passive-Aggressive: good results, low costsCorrado Monti Sentiment Analysis for italian microblogging 9/23
  • 10. IntroductionDevelopment of the classification systemApplication and experimental resultsConclusions2 – Sentiment Analysis: feature extractionDifferent ways were tested:n-grams of characters, words, n-grams of wordsboolean presence, term frequency, tf-idf, fuzzy match betweenn-gramsBest: term frequency of single wordsAdjustments:Fraction of uppercase words as a featureFeatures of synonyms were joined togetherOther adjustments did not gave better results:Consider as different words in beginning or in the end of thetweetCorrado Monti Sentiment Analysis for italian microblogging 10/23
  • 11. IntroductionDevelopment of the classification systemApplication and experimental resultsConclusions2 – Removal of the target of sentimentWe do not want that the sentiment could be guessed basingon specific entities (PD, Berlusconi. . . ). It should be based onthe surrounding words.Training Set prejudices ⇒ OverfittingProblem sometimes ignored, especially in Italian language SentimentAnalysisHow can we decide which words must be removed?Proposed solution: ontology-based filterDBpediaquerySPARQLfi rstobjectslisttext fromonlinenewspaperslist ofobjectsto removeappeared?DiscardedNoYesCorrado Monti Sentiment Analysis for italian microblogging 11/23
  • 12. IntroductionDevelopment of the classification systemApplication and experimental resultsConclusions2 – Sentiment Analysis: classification12 476 points with 2 383 featuresClassifier Accuracy F-Measure Global timeALMA 0.70 ± 0.03 0.75 ± 0.03 0.82 ± 0.3PA 0.67 ± 0.06 0.71 ± 0.12 0.9 ± 10−3PEGASOS 0.69 ± 0.03 0.73 ± 0.05 76 ± 0.1OIPCAC 0.71 ± 0.03 0.75 ± 0.02 121 ± 25RF 0.72 ± 0.03 0.78 ± 0.03 2173 ± 48We select ALMA: good results, low costsCorrado Monti Sentiment Analysis for italian microblogging 12/23
  • 13. IntroductionDevelopment of the classification systemApplication and experimental resultsConclusions3 – Genericity: rule-based approachProblem: select tweets directed to all politiciansUnhelpful training set labels → we simplify the problem anddecide to use a rule-based approachRegole usate:Presence of keywordsdefined with field experts(based on the most “secure”part of the training set)∨Entities from at least threedifferent political areas(identified throughDBpedia ontologies)Corrado Monti Sentiment Analysis for italian microblogging 13/23
  • 14. IntroductionDevelopment of the classification systemApplication and experimental resultsConclusionsApplicazione del sistema di classificazioneHypothesis:Applying this classification system to tweet of April-October2012 we can obtain a good measure of the diffusion ofpolitical disaffection in society?Corrado Monti Sentiment Analysis for italian microblogging 14/23
  • 15. IntroductionDevelopment of the classification systemApplication and experimental resultsConclusionsSurveysSurveys are a curtesy of IPSOSWe compute two indexes:1. Inefficacy: fraction of italians that for everyparty say that their propensity to vote themis 1 in a 1 to 10 scaleSociologists say that this capture thesentiment of inefficacy perceived by citizens,symptom of political disaffection2. Non-vote: fraction of italians that will notvoteIt measures a behaviourInfluenced by other factors: election proximity,“moral” perception of abstention......We have a survey every ∼ 10 days in April-OctoberCorrado Monti Sentiment Analysis for italian microblogging 15/23
  • 16. IntroductionDevelopment of the classification systemApplication and experimental resultsConclusionsTweet sampleWe gather a set of user active in October and wesample also their followersWe obtain 261 313 users active in October∼ 5% of italian users and ∼ 10% of active italianusers in October 2012We select only those active in April → 167 557utentiScraping of their tweets in this time period:35 882 423 tweetLorem ipsum dolor sitamet, consecteturadipiscing elit. Inrhoncus diam a urnapulvinar convallis.Lorem ipsum dolor sitamet, consecteturadipiscing elit. Inrhoncus diam a urnapulvinar convallis.Lorem ipsum dolor sitamet, consecteturadipiscing elit. Inrhoncus diam a urnapulvinar convallis.Lorem ipsum dolor sitamet, consecteturadipiscing elit. Inrhoncus diam a urnapulvinar convallis.Lorem ipsum dolor sitamet, consecteturadipiscing elit. Inrhoncus diam a urnapulvinar convallis.Lorem ipsum dolor sitamet, consecteturadipiscing elit. Inrhoncus diam a urnapulvinar convallis.Corrado Monti Sentiment Analysis for italian microblogging 16/23
  • 17. IntroductionDevelopment of the classification systemApplication and experimental resultsConclusionsComparisonFor every survey date, we compute the il ratio betweendisaffected tweet volume and political tweet volume in acertain time window. We consider:∆141 ∆147 ∆71Giugno 2012L M M G V S D1 23 4 5 6 7 8 910 11 12 13 14 15 1617 18 19 20 21 22 2324 25 26 27 28 29 30Giugno 2012L M M G V S D1 23 4 5 6 7 8 910 11 12 13 14 15 1617 18 19 20 21 22 2324 25 26 27 28 29 30Giugno 2012L M M G V S D1 23 4 5 6 7 8 910 11 12 13 14 15 1617 18 19 20 21 22 2324 25 26 27 28 29 30Last two weeks Between 14 and 7 days before Last weekWe compare these ratios with polls through Pearsoncorrelation indexCorrado Monti Sentiment Analysis for italian microblogging 17/23
  • 18. IntroductionDevelopment of the classification systemApplication and experimental resultsConclusionsInefficacy:∆maxmin ρ 95%Confidenceinterval P-Value for ρ > 0∆141 0.7860 0.476-0.922 0.031%∆147 0.7749 0.454-0.917 0.042%∆71 0.6880 0.310-0.878 0.226%Non-voto:∆maxmin ρ 95%Confidenceinterval P-Value for ρ > 0∆141 0.5579 0.190-0.788 0.567%∆147 0.5920 0.248-0.803 0.231%∆71 0.4433 0.049-0.718 3.00%Corrado Monti Sentiment Analysis for italian microblogging 18/23
  • 19. IntroductionDevelopment of the classification systemApplication and experimental resultsConclusionsApr 15 May 01 May 15 Jun 01 Jun 15 Jul 01 Jul 15 Aug 01 Aug 15 Sep 01 Sep 15 Oct 010.000.050.100.150.200.250.0040.0050.0060.0070.0080.0090.010.011TimeInefficacyindicatorTwitterdisaffectionratioTwitter disaffection ratioInefficacy indicatorCorrado Monti Sentiment Analysis for italian microblogging 19/23
  • 20. IntroductionDevelopment of the classification systemApplication and experimental resultsConclusionsInterpretationData seem to indicate a quite strong correlation betweendisaffected tweet and diffusion of the phenomena in societyThis does not mean that Twitter is a representative sampleWe can guess that the quantity of discussion about thispheonomenon is connected with how much it will spreadCorrado Monti Sentiment Analysis for italian microblogging 20/23
  • 21. IntroductionDevelopment of the classification systemApplication and experimental resultsConclusionsFurther investigationWith this big tweet sample, we can study causes ofoscillations with text mining techinquesWe compute the ratio day by day and we found out when wehave disaffection peaksFor every peak1. we take news of that day2. we compare them with the mass of tweet of the peak3. which is the most similar news?Corrado Monti Sentiment Analysis for italian microblogging 21/23
  • 22. IntroductionDevelopment of the classification systemApplication and experimental resultsConclusionsApr 01 Apr 15 May 01 May 15 Jun 01 Jun 15 Jul 01 Jul 15 Aug 01 Aug 15 Sep 01 Sep 15 Oct 010.0000.0050.0100.0150.0200.0.0250.050.0750.10.1250.150.1750.2TimeTwitterdisaffectionratioInefficacyindicatorTwitter disaffection ratioInefficacy indicatorRapporto di tweetSondaggi (Inefficacia)TempoSondaggiSondaggiTweet(Teorie del complotto sustragismo di Stato)Scandalo LegaAmministrativeAttentato di BrindisiScandaloFioritoCorrado Monti Sentiment Analysis for italian microblogging 22/23
  • 23. IntroductionDevelopment of the classification systemApplication and experimental resultsConclusionsConclusionsWe build classifiers for political messagesTopic and sentiment classifers are reusableWe showed a correlation between quantity of discussion onTwitter and diffusion of a phenomenonFuture worksClassifying users as nodes of a social networkConfirm or denial of models that could explain our dataCorrado Monti Sentiment Analysis for italian microblogging 23/23
  • 24. IntroductionDevelopment of the classification systemApplication and experimental resultsConclusionsThanksCorrado Monti Sentiment Analysis for italian microblogging 24/23