• Like
Violence det ijcnlp13-slideshare
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

Violence det ijcnlp13-slideshare

  • 303 views
Published

Presentation for the paper entitled: "A Weakly Supervised Bayesian Model for Violence Detection in Social Media" presented at the IJCNLP 2013

Presentation for the paper entitled: "A Weakly Supervised Bayesian Model for Violence Detection in Social Media" presented at the IJCNLP 2013

Published in Technology , Business
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
303
On SlideShare
0
From Embeds
0
Number of Embeds
1

Actions

Shares
Downloads
3
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide
  • During the last 2 years we have witnessed the use of social media platforms as medium to express different emotions within society; Inlcuding for example:Middle East revolutions.2011 Japan Earthquake these services have become a proxy of information which communicates the social perception of situations regarding for exampleTerrorismSocial Crisis RacismAs well as Extremist groups propagandaThis project aims to leverage this continuous streaming of information for detecting and tracking of violent radicalization and extremism in social media, becoming therefore a sensor of the social perception of violent activities. The project aim to help in the prompt detection of situations which can lead to the diffusion of messages which can potentially become influential triggers of violence.
  • In this work we focus on Twitter data; in particular we aim at creating models which can identify suspicious tweets which can give an insight of violent or criminal events happening at the moment. We seek to detect and extract topics related to violent and criminal activities from large-scale social media data in real-time, and constantly track any events that are identified as suspicious. . Owing to the fast- evolving nature of social media, such a system will be very important for the forces of law to respond to and deal with the potential security risks timely.This work aims to develop efficient computational tools for detecting violent radicalization and extremism from social media, which will ultimately help improving the national security capability with the online monitoring function offered by the system. Specifically, the tools seek to detect and extract topics relating to violent and criminal activities from large-scale social media data in real-time, and constantly track any events that are identified suspicious
  • But also positive sentiments such as excitement can appear in criminal activities like for example rioting
  • Characterising violence-related content in tweets present different challenges, including the:The constantredefinition of the vocabulary used to represent current events, and the generation of new jargon in this channels of communications, introduce new difficulties for the use of traditional supervised models, which make use of labelled data. Traditional classification methods which rely on labelled data for training their models do not necessarily work with social media, since of what we see is event driven having short life spans. This means that in order to maintain tuned models it is necessary the continuous learning from social media for re-chacaracterising the feature representation of an event.
  • There has been a large body of work in topic classification of short texts Weakly supervised approaches include the JST model and the partially-labelled LDA model. These two models will be part of our baselines and we’ll talk about them in more detail later on. To the best of our knowledge very few have been devoted to violent content analysis of Twitter, and none has car- ried out deep violence-related topic analysis.
  • Previous approaches rely only on..To the best of our knowledge very few have been devoted to violent content analysis of Twitter, none of which has carried out deep violence-related topic analysis.
  • One of the main challenges in detecting violence-related content is that this type of content is event-related, tending to occur during short to medium life-spans, therefore methods which rely only on labeled data can rapidly become outdated.
  • There has been a large body of work in topic classification of short textsTo the best of our knowledge very few have been devoted to vio- lent content analysis of Twitter, and none has car- ried out deep violence-related topic analysis.
  • Rather than using traditional machine learning models, in this project we propose the use of a Bayesian model which allows the detection of violence-related topics from social media without the use of labelled data. In particular, prior knowledge capturing words typically expressing violence is derived from external knowledge sources and incorporated into model learning.
  • Consider the following tweet which is contains information about Travis Kvapil who is an NASCAR racing driver, and who seem to have been involved in a domestic dispute
  • The existing framework of LDA has three hierarchical layers, where topics are associated with documents, and words are associated with topics. In order to model document violence-polarity, we construct a violence detection model (VDM) by adding an additional violence label layer between the document and the topic layer. Hence, VDM is effectively a four-layer hierarchical Bayesian model, where violence labels are associated with documents, under which topics are associated with violence labels and words are associated with both violence labels and topics.
  • Although the model does not require labelled documents for learning, it does require as an input a collection of words which are dominant on the topic of interest. Such a list of words is often called as a lexicon. In our study, we explore two different types of sources for deriving violence related lexicons which are DBpedia and Twitter.
  • Our first experiment users two corpora, the TRECMicrobloging, and DBPediaDBpedia is the semantified version of Wikipedia. The latest version of DBpedia consists of over 1.8 million resources, which have been classified into 740 thousand Wikipedia categories, and over 18 million Yago categories. Social Knowledge sources constitute one of the largest repositories built in a collaborative manner. They provide an up-to-date channel of information and knowledge over a large number of topics.These ontologies enable a broad coverage of entities in the world ,and allow entities to bear multiple overlapping types. One of the main advantages of using this knowledge sources for topic classification, is that each particular topic is associated with a large number of resources.We created out violence related corpus by querying DBepdia for all articles belonging to categories and subcategories under the Violence Category.
  • We created out violence related corpus by querying DBepdia for all articles belonging to categories and subcategories under the Violence Category.After removing those categories with less than 1000 articles, we obtained a set of 14 categories
  • In the case of the Twitter dataset we selected those documents which were annotated by OpenCalais as been relevant to the topic of War and conflict, and the collection of other tweets as the ones for deriving the non-violent lexicon.
  • In the case of the Twitter dataset we selected those documents which were annotated by OpenCalais as been relevant to the topic of War and conflict, and the collection of other tweets as the ones for deriving the non-violent lexicon.
  • Here is an example of the type of violent and non-violent lexicon derived from these two sources using RWE. In the firs column we present a lexicon derived using the dbpedia corpus combined with the twitter corpus. In this case we chunked all those article’s abstract related to violent categories, in order to obtain documents which were of the same average size of a tweet and use the non-violent documents from the Twitter corpus as the non-violent documents. The second column present lexicons derived from the DBpedia corpus and the third those from the Twitter corpus.We compare the performance of the propose RWE metric against word priors derived IG.After filtering features using Information Gain, we obtained the probability of a word given a category.P(W|C) = P(C|W)*P(W) / P(C)This measure weights the word as been relevant or not to the violent and non-violent categories.
  • Whenanalysing the TREC corpus we observed that as expected there are very few violence related documents as opposed to the massive amount of tweets discussing other matters. The violent related tweets are event oriented, therefore some of the existing dates may not contain violent tweets at all. This is interesting to notice when thinking on the implementation of evolving models which depend on previous violence-related occurrences. Our proposed model is not epoch dependent at the moment and was tested on the collection of tweets taken a particular epoch.
  • We proposed two different strategies for deriving priors the first one is based on information gain while the second one is based on word entropy. We compare the performance of the proposed approach with three other models. The first two models are unsupervised approaches, namely the maximum entropy trained with generalised expectation, and the maximum entropy trained with posterior regularisation. The third one is a weakly supervised approach which also makes use of prior lexicons. We can see that the proposed approach VDM, outperforms existing approaches, and that the entropy based strategy for the lexicon derivation from Twitter is the one providing the best performance in precision. It is important to notice that the use of DBPedia as a source for deriving lexicon priors turned out to be quite effective, since it reduces the need of having Twitter annotations with can be costly.
  • We proposed two different strategies for deriving priors:the first one is based on information gain while the second one is based on word entropy. We compare the performance of the proposed approach with three other models.:The first two models are unsupervised approaches, namely the maximum entropy trained with generalised expectation, and the maximum entropy trained with posterior regularisation. The third one is a weakly supervised approach which also makes use of prior lexicons. We can see that the proposed approach VDM, outperforms existing approaches, and that the entropy based strategy for the lexicon derivation from Twitter is the one providing the best performance in precision. It is important to notice that the use of DBPedia as a source for deriving lexicon priors turned out to be quite effective, since it reduces the need of having Twitter annotations with can be costly.
  • We proposed two different strategies for deriving priors the first one is based on information gain while the second one is based on word entropy. We compare the performance of the proposed approach with three other models. The first two models are unsupervised approaches, namely the maximum entropy trained with generalised expectation, and the maximum entropy trained with posterior regularisation. The third one is a weakly supervised approach which also makes use of prior lexicons. We can see that the proposed approach VDM, outperforms existing approaches, and that the entropy based strategy for the lexicon derivation from Twitter is the one providing the best performance in precision. It is important to notice that the use of DBPedia as a source for deriving lexicon priors turned out to be quite effective, since it reduces the need of having Twitter annotations with can be costly.
  • We proposed two different strategies for deriving priors the first one is based on information gain while the second one is based on word entropy. We compare the performance of the proposed approach with three other models. The first two models are unsupervised approaches, namely the maximum entropy trained with generalised expectation, and the maximum entropy trained with posterior regularisation. The third one is a weakly supervised approach which also makes use of prior lexicons. We can see that the proposed approach VDM, outperforms existing approaches, and that the entropy based strategy for the lexicon derivation from Twitter is the one providing the best performance in precision. It is important to notice that the use of DBPedia as a source for deriving lexicon priors turned out to be quite effective, since it reduces the need of having Twitter annotations with can be costly.
  • Whenanalysing the set of topics derived from the VDM model, we notice that this collection of clustered words are very good indicators of current events been discussed in the Twitter-sphere. Our next goal is to enable the automatic labelling of these topics in order to enrich the context of current events been discussed.

Transcript

  • 1. A Weakly Supervised Bayesian Model for Click to edit Master subtitle style Violence Detection in Social Media Elizabeth Cano*, Yulan He*, Kang Liu+, Jun Zhao+ *School of Engineering and Applied Science Aston University, UK +Institute of Automation Chinese Academy of Sciences, China
  • 2. Outline Click to edit Master subtitle style o Introduction o Research Challenges o Violence Detection Model o Deriving word priors o Experiments 2
  • 3. Click to edit Master subtitle style Introduction 3
  • 4. Introduction Click to edit Master subtitle style 4
  • 5. Introduction Objectives Click to edit Master subtitle style Objectives  Identification of suspicious tweets  Violence-related Topic detection  Extraction of violent and criminal events appearing in social media 5
  • 6. Introduction Click to edit Master subtitle style Violence-related content analysis  Violence-related content  Characterised by the use of terms expressing aggression and attitudes towards violence  Violence-related content Analysis  Identifying violence polarity in piece of text (violence-related or non-violence related)  Involves the detection of particular types of sentiments not necessarily negative (e.g. anger, shame, excitement) 6
  • 7. Introduction Click to edit Master subtitle violence-related tweets Characterisingstyle Challenges  Restricted number of characters  Irregular and ill-formed words  Wide variety of language  Evolving jargon (e.g. slang and teenage lingo)  Event-dependent vocabulary characterising violence-related content • Volatile jargon relevant to particular events. While sentiment and affect lexicon rarely changes in time, words relevant to violence tend to be event dependent  E.g., “fire” and “flame” are negative during the UK riots 2011, but appear to be positive in the London Olympics 2012.  E.g. “#Jan25” violence-related during the Egyptian revolution 7
  • 8. Related Work Click to edit Master subtitle style Violence-related classification in Social Media  Topic Classification of short texts  Standard supervised machine learning methods [Milne-etal 2008][Gabrilovich-et-al 2006][Munoz-et-al 2011][Meij-et-al 2012]  Alleviate micropost sparsity by making use of external knowledge sources (e.g. DBpedia)[Michelson-et-al 2010][Cano-et-al 2013]  Weakly Supervised approaches  JST model [Lin&He 2009][Lin&He2012]  Partially-Labeled LDA (PLDA) [Ramage et al., 2011] 8
  • 9. Related Work Click to edit Master subtitle style Violence-related classification in Social Media  Rely on supervised classification techniques or do not cater for the violence detection challenges.  Do not perform discover topics with an associated document category. 9
  • 10. Related Work Click to edit Master subtitle style Violence-related classification in Social Media  Topic Classification of short texts  Standard supervised machine learning methods [Milne-et-al 2008][Gabrilovich-et-al 2006][Munoz-et-al 2011][Meij-et-al 2012]  Alleviate micropost sparsity by making use of external Since violence-related (e.g. DBpedia)[Michelson-et-al 2010][Canoknowledge sources events tend to occur during short to medium 2013] et-al life-spans, methods relying only on labeled data can rapidly become outdated.  Rely on supervised classification techniques or do not cater for the violence detection challenges.  Do not perform discover topics with an associated document category. 10
  • 11. Violence-related classification in Social Media Click to edit Master subtitle Challenges style  How to characterise violence-polarity?  How to build a model to discriminate across documents to identify violence-related content?  How to provide overall information to understand the type of violence-related events? 11
  • 12. Click to edit Master subtitle style Violence Detection Model (VDM) Problem Formulation and Proposed Method 12
  • 13. Accessing Topics via Word Distributions Click to edit Master subtitle style o Novel Bayesian Modelling Approach for:  Identifying violent content in social media  No need of labelled data  Inspired by the previous work on sentiment analysis, in particular on the JST model[Lin&He 2009][Lin&He2012] o Use of knowledge sources (e.g. DBpedia)  Priors derivation strategies 13
  • 14. Accessing Topics via Word Distributions Click to edit Master subtitle style 14
  • 15. Accessing Topics via Word Distributions Click to edit Master subtitle style Each Tweet can involve multiple topics Topics 15
  • 16. Accessing Topics via Word Distributions Click to edit Master subtitle style Each tweet involves as well words with different violencepolarity Violence Polarity Casting these intuitions into a generative probabilistic process [Blei-et-al 2003] - Each document is a random mixture of corpus-wide topics - Each word is drawn from one of those topics 16
  • 17. Accessing Topics via Word Distributions Click to edit Master subtitle style Document Violence polarity non-violence-related violence-related Text non-violence-related Document Violence polarity non-violence-related violence-related Text violence-related 17
  • 18. Violence Detection Model (VDM) Click to edit Master subtitle style violenceLabel/ topic probability word topic Violabel/topic language model word Violence probability vioLabel Nd D 18
  • 19. Violence Detection Model (VDM) Click to edit Master subtitle style • Choose ω ∼ Beta(ε), φ0 ∼ Dir(β0), φ ∼ Dir(β). • For each category (violent or nonviolent) c  For each topic z under the document category c o Choose θcz ~ Dir(α) • For each doc m  Choose πm ~ Dir(γ)  For each word wi in doc m o choose xm,n ∼ Mult (ω); o If xm,n =0,  choose a word wm,n ∼ Mult(φ0); o if xm,n =1,  choose a tweet category label cm,n ∼ Mult (πm ),  choose a topic zm,n ∼ Mult(θcm,n ),  choose a word wm,n ∼ Mult(φcm,n ,zm,n ). 19
  • 20. Violence Detection Model (VDM) Click to edit Master subtitle style • Single document category-topic distribution shared across all the documents. • Assumes words are generated either from a category-specific topic distribution or from a general background model. 20
  • 21. Click to edit Master subtitle style Deriving Word Priors 21
  • 22. Violence Lexicon Click to edit Master subtitle style • Violence Lexicon Preparation • • DBpedia articles from violent related topics Twitter Data for Jan-Dec 2010 (10% Twitter Firehose) Violence-related Non-Violence-related fight war protest riots conflict bomb trouble fear twilight sandwich award moon record common excited great 22
  • 23. Deriving Priors Click to edit Master subtitle style Using DBpedia Categories • Structured Semantic Web Representation of data derived from Wikipedia  Maintained by thousand of editors  Evolves and adapts as knowledge changes [Syed et al, 2008] • Cover a broad range of topics • Characterise topics with a large number of resources DBpedia* Yago2 Freebase Resources 2.35 million 447million 3.6 million Classes 359 562,312 1,450 Properties 1,820 253,213,84 2 7,000 23
  • 24. Deriving Priors Click to edit Master subtitle style Using DBpedia Categories Revolutionary Terror Terrorism Violence War …. Military Operations Guerrilla Warfare … …. 24
  • 25. Obtaining Priors from Tweets Click to edit Master subtitle style 1 million Tweets annotated with OpenCalais derived topics including: • Business & Finance • Disaster & Accident • Education • Entertainment & Culture • Environment • Health & Medical • Hospital & Recreation • Labor • Law &Crime •Politics • Religion & Belief • Social Issues • Sports • Technology &Internet • War & Conflict 8,338 tweets 25
  • 26. Datasets for Priors Click to edit Master subtitle style • Use OpenCalais to annotate tweets • • • Extracted tweets labelled as “War & Conflict” and considered them as violence-related annotations OpenCalais has low F-measure of 38% when evaluated on our manually annotated test set DBpedia abstracts have longer sentences than tweets • Generated tweet size documents by chunking the abstracts into 9 or less words Tweets (TW) DBpedia (DB) DBpedia chunked (DCH) Violent-related 10,432 4,082 32,174 Non violent-related 11,411 11,411 11,411 26
  • 27. Relative Word Entropy Click to edit Master subtitle style • Corpus Word Entropy captures the dispersion of the usage of word w in the corpus SD • Class Word Entropy characterises the usage of a word in a particular document class • Relative Word Entropy provides information on the relative importance of that word to a given document class 27
  • 28. Word Priors Obtained using RWE Click to edit Master subtitle style DBpedia-Chunked Priors DBpedia-derived Priors Tweets-derived Priors Violent NotViolent Violent NotViolent Violent NotViolent group customer group gop rebel ey alleg win power lov destro nnw armour diff suffer back sectar vot resid good soc good anti soc cult sen palest twees mortat aid separat eat knif interest amnest job influ surve rebel right drug good democr afford campaign answer fighter congrat 28
  • 29. Click to edit Master subtitle style Experiments 29
  • 30. Datasets for Experiments Click to edit Master subtitle style • TREC Microblog 2011 corpus • • Comprises over 16 million tweets sampled over a two week period (January 23rd to February 8th, 2011) includes 49 different events • • violence-related ones such as Egyptian revolution, and Moscow airport bombing non-violence related such as the Super Bowl seating fiasco Training set Violence-related Non violence-related 10,581 Testing set 759 1,000 30
  • 31. Baselines Click to edit Master subtitle style • Learned from labelled features • Word priors are used as labelled feature constraints • Train MaxEnt classifier with Generalized Expectation (GE) [Druck et al., 2008] or Posterior Regularization (PR) [Ganchev et al., 2010] • Joint Sentiment-Topic (JST) model [Lin&He 2009][Lin&He2012] • Set the number of sentiment classes to 2 (violent or non-violent) • Partially-Labeled LDA (PLDA) [Ramage et al., 2011] • Assume that some document labels are observed and model perlabel latent topics • Supervised information is incorporated at the document level rather than at the word level • The training set is labelled as violent or non-violent using OpenCalais 31
  • 32. Violence Classification Results Click to edit Master subtitle style • ME-GE and ME-PR perform poorly • Best result obtained using VDM with word priors derived from TW using RWE • Source data for deriving word priors • DB does not improve over TW • DCH boosts F-measure in JST and is close to TW for VDM • RWE consistently outperforms IG for both JST and VDM 32
  • 33. Varying Number of Topics Click to edit Master subtitle style 33
  • 34. Topic Coherence Evaluation Click to edit Master subtitle style Violence-related topics Non violence-related topics 34
  • 35. Example Violence-Related Topics Click to edit Master subtitle style Protest in Tahrir Square Middle East uprise Moscow Airport bombing Government shut down Facebook Topic 1 Topic 2 Topic 3 Topic 4 egypt middle internet crash tahrir east egypt kill cair give phone moscow strees power block bomb police idea word airport protester government service tweets square spread government injure arm uprise shut arrest report fall facebook dead 35
  • 36. Questions? Click to edit Master subtitle style Elizabeth Cano Yulan He Kang Liu Jun Zhao a.cano_basave@aston.ac.uk y.he@cantab.net kliu@nlpr.ia.ac.cn jzhao@nlpr.ia.ac.cn Slides available at http://www.slideshare.net/ampaeli 36