Graph-based Analysis and Opinion Mining in Social Network


Published on

This is the final report for Networks & Data Mining Techniques project focusing on mining social network to estimate public opinion about entities and associated keywords. This project mines Twitter for recent feeds and analyzes them to estimate sentiment score, discussed entity and describing keywords in each tweet. This data is then exploited to elicit overall sentiment associated with each entity. Entities and keywords extracted is also used to form an entity-keyword bigraph. This graph is further used to detect entity communities and keywords found within those communities. Presented implementation works in linear time.

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Graph-based Analysis and Opinion Mining in Social Network

  1. 1. 1 Project Report: Graph-based Analysis and Opinion Mining in Social Network Khan Mostafa Stony Brook University Student ID# 109365509 ABSTRACT This is the final report for Networks & Data Mining Techniques project focusing on mining social network to estimate public opinion about entities and associated keywords. This project mines Twitter for recent feeds and analyzes them to estimate sentiment score, discussed entity and describing keywords in each tweet. This data is then exploited to elicit overall sentiment associated with each entity. Entities and keywords extracted is also used to form an entity-keyword bigraph. This graph is further used to detect entity communities and keywords found within those communities. Presented implementation works in linear time. Categories and Subject Descriptors H.2.8 [Database Management]: Database Applications – Data Mining. General Terms Algorithms, Documentation, Experimentation. Keywords Opinion mining, sentiment, graph clustering, graph community detection. 1. INTRODUCTION This project focuses on mining opinion from social network. It takes Twitter as a model platform for that it has a publicly available stream of posts from people of diverge demographic. The goal is to report public opinion in two forms: (a) overall opinion about some entity and (b) opinion based cluster of entities and keywords. Public opinion can be mined from posts about entity of interest. At first, ample posts are fetched from public stream. Then, each post is individually scored to find embedded subjectivity. All posts are not subjective, some assert information while some other express feelings. Hence, posts can be generally classified as objective, positive and negative. However, subjective bias is not discrete; rather each post embody mixed polarity. Again, attempts to annotate post manually has shown that, different people associate sentiment to same posts differently. Therefore, this project focuses on calculating sentiment scores for posts. After each posts are individually scored, overall opinion is represented using few aggregative parameters including overall score, diversity, and percentage of each type of polar posts. A set of keywords (kw) are also identified to report how the entity (E) is positively and negatively described. In this project sentiment analysis is done using an approach similar to [1], using a combination of two naïve Bayes classifiers to calculate polarity score – PoS tag based classifier and n-gram based classifier. Keywords and entities are primarily detected using parts of speech. Then, in combined analysis, keywords that occur less frequently for an entity is discarded, as that word is not sufficiently associated with the entity. Again, those keywords that occur in descriptions of too many entities, are less likely to be keyword, rather are stop-words or generic words. After tweets are individually analyzed further overall analysis can be done. To do so, first an entity – keyword bigraph (E×kw) is computed from tweets analyzed. Tweets are collected from recent public feed stream using Twitter API. Analysis reports a polarity score, a set of keywords and a set of entities for each tweet. In E×kw bigraph an edge exist between E and kw if both occur in same tweet. These edges also have associated polarity score. This E×kw bigraph can be used to generate an E×E graph. In E×E, there exists an edge between two entities if they share a keyword with similar sentiment bias. This E×E graph is then clustered using a local clustering algorithm in linear time. This project is implemented mainly using .Net framework (C#) and partially using PHP on Apache server to access Twitter API [2]. PoS tagging is done using a third party TreeTagger developed recently for tweets [3]. The main contributions of this project are,  Implemented a sentiment analysis tool that can elicit scores for individual tweets  Implemented a way to report aggregate sentiment score and associated keywords for queried entity  Devised and implemented a simple approach to identify entities and keywords in tweets  Implemented a fast local graph clustering algorithm using split vectors instead of full-blown matrices.  Used the fast local graph clustering to detect and report entity groups along with keywords and grouped polarity scores In this report following sections include, overview of prior works, methodology description and result and analysis of mined data. 2. BACKGROUND Mining social network for eliciting public opinion requires sentiment analysis, keyword & entity tagging and graph clustering. Sentiment analysis is vastly studied in several fields and still is an open problem. There had also been ample investigation on detecting communities, partitioning, and finding clusters in graphs. In this section a few prior works are briefly discussed. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. CSE590 Network and Data Mining Techniques, Fall, 2013, Stony Brook University, NY, USA. Copyright 2013
  2. 2. 2 2.1 Sentiment Analysis Sentiment analysis is being studied thoroughly for a decade or more. One of the earliest work done by Pang, et al. [4], amongst others, investigated in the field of sentiment classification. This investigation opened a wide arena of research and have led to many outcome by multitude of researchers from different fields. Statistics, computational linguistics and machine learning has been studied to solve the challenge of sentiment analysis. There are several lexicon based techniques for opinion mining viz. [5], versions of SentiWordNet [6], [7]. A detail survey of many lexicon based approaches is done by [8]. Although earlier studies [9] suggested use of only adjectives as subjectivity measure, later investigations revealed sentiment appraisal is much diverse. Whitelaw, et al. [10] suggested using appraisal taxonomies for sentiment classification. Similar observation was found by [11] and [12] stating that, “Adjectives, Verbs and Adverbs are better than Adjectives Alone”. Machine learning approaches widely used Support Vector Machines e.g. [1], [13] and Naïve Bayes e.g. [4], [14] classifiers. Latent Dirichlet Allocation (LDA) is also utilized e.g. [15], [16]. A lexicon based holistic approach [17] is also described to address context dependency. Opinion mining and sentiment analysis on Twitter is investigated using various approaches viz. [14] [18] [1] [16] [19]. Most approaches for opinion mining assign strict subjectivity class (positive, negative, neutral) to individual texts in different granularity (i.e. sentence, post, paragraph and document). However, a score assignment will serve better to understand intensity of opinion. There is a paucity of studies that tried to aggregate sentiment to identify public opinion. Perception of opinion vary for each individual and a better insight of public opinion can be found by eliciting few attributes from social media. Overall sentiment score, percentage of positive and negative opinions, key descriptions are useful attributes that can be elicited. This project will focus on mining tweets about some entity for these attributes associated with that entity. 2.2 Keyword and Entity detection There are different and diverse approaches for keyword detection. For example, there are machine learning based approaches, using SVM [20], associating linguistic knowledge like n-grams and PoS for supervised keyword extraction [21]. Thesaurus based approaches [22] use semantic knowledge for machine based keyword extraction. Most keyword identification approaches use some kind of machine learning technique along with some other knowledge. However, for this project’s purpose, a simple method is required to identify keywords. This project will employ hints from PoS tagging and then let data itself build a keyword lexicon while simultaneously detecting them. 1 Modularity is the fraction of the edges that fall within the given groups minus the expected such fraction if edges were distributed at random. [Wikipedia, Accessed Dec 03, 2013] 2.3 Graph clustering Graphs have been studied extensively historically from mathematical and theoretical viewpoint and in recent few decades they have been more extensively studied from data analytic perspectives. A lot of real world and physical phenomena can be ideally modeled as graphs. These graphs can be then efficiently investigated to find latent characteristics of modeled data. One major operation on graph in data mining is to divide them into smaller parts. Partitioning can be of different types. One approach might be to partition whole graph into disjoint sub graphs of similar size [23]. For analyzing graphs, a more natural division is often desired. Vertices in graphs tend to have edge with vertices that have vertices with other connected neighboring vertices of its own and thus create communities. However, communities differ in sizes and these communities are not disconnected. Rather, there are few links between nodes of different community in contrast to nodes of same community. Newman and others has conducted several research [24] [25] [26] on detecting communities in graphs. They exploited modularity1 of graph to do so. Most of their early works were restrictive on scalability but later spectral optimization of modularity yielded [27] an algorithm that works in near linear time. Modularity based approaches cluster graph into disjoint communities. In contrast, often communities are overlapped. Andersen et al [28] suggested a “Local Graph Partitioning using PageRank Vectors” and other derived algorithms. The core idea behind these approaches is to use conductance2 of graph to locally cluster them. These approaches works near linearly and can detect communities that overlap. This project uses an approach as devised by Andersen et al, as it serves several purposes of the project goal. It can detect communities that overlap, works near linearly, and an implementation without necessarily creating the blown-up full matrix is possible. 3. PROJECT DESCRIPTION 3.1 Problem Statement People express their opinion about entities (viz. location, person, products etc.) in social networks. In brief, the goal is to,  extract overall public opinion of some entity  elicit opinion based entity groups in recent stream The scope of the project is to mine a popular microblogging platform: Twitter. 3.1.1 Extract overall public opinion of some entity The goal is to extract opinion about a given entity, E. This will be done in terms of ample recent tweets about E. The solution shall be able to yield the following about a given entity, E,  Overall sentiment: Overall sentiment (viz. positive, negative, mixed) about E. A sentiment score in a range of [-1, 1] will be given. This will also show the percentage of positive, negative 2 Conductance is the measure of a sub graph denoting how much it is connected to rest of the graph. It is the ratio of out-links from the sub graph to the volume (total edge count from nodes in it).
  3. 3. 3 and neutral (some threshold can be applied to distinguish between these three classes) tweets as well as the count of analyzed tweets. A measure (e.g. variance) of how diverse the opinion is can also be included.  Key description: The system will yield a set of keywords (kw) that are used to describe E An overall sentiment about an entity is useful to multitude of clients for various applications. Sets of key descriptive words along with sentiment will provide a better insight of public feelings. 3.1.2 Opinion based entity groups in recent stream The goal is to detect how entities are grouped together in terms of sentiment and descriptive keywords. This will be done based on a stream of recent tweets. Each tweets shall be individually analyzed, as in 3.1.1. Analysis on each tweet will yield,  Text in the tweet, T  Entities discussed in it, E  Keywords in it, kw  Polarity score, P This tuples (T,E,kw,P) will then be used to build E×kw bigraph such that,  There exists an edge between Ei and kwj if there is one or more tweet that contains Ei and kwj  The edge has a weight indicating co-occurrence of Ei and kwj. i.e. weightij = Count ({Tk | Ei ∈ Tk.E ∧ kwj∈})  The edge has pScore that is average of pScore (=P) for all such occurrences. i.e. pScore = Sum({Tk .pScore| Ei ∈ Tk.E ∧ kwj∈})/weight After this, a filter will be run on this graph to eliminate those links that exist between entity and keyword where the keyword is not enough descriptive of the entity. This is done, by calculating freq such that, freqij = weightij/ Occurrence (Ei) If freqij is smaller than certain threshold, εfreq then that keyword is filtered out for this entity Ei. This E×kw bigraph will then be used to build E×E graph, such that, there exists an edge between Ei and Ej if  Occurrence(Ei)> εeo ∧ Occurrence(Ej)> εeo  {kw(Ei) | Occurrence(kwx)< εkwo} ⋂ {kw(Ej) | Occurrence(kwx)< εkwo} is not empty  Polarity bias for both are similar To describe, there is an edge between two entities if they share one or more keywords with similar polarity bias link. These entities are such that, they occur over a threshold, εeo. These keywords are such that, they do not occur for more than some threshold, εkwo, times. This threshold over keywords is motivated from following intuition,  If a potential word occur in description of most entities then that is not an keyword but is a generic term Then, a community detection algorithm is to be run on this E×E graph to find groups of entities that are bind together with lot of polarity aligned keyword links. After one such groups of entities is generated, there will be a group of keywords such that, they occur in edges that are within that community of nodes. Also, a representative averaged pScore can be calculated for such a group. To summarize, given a stream of tweets, the system shall be able to generate,  (T,E,kw,P) tuples  E×kw bigraph  E×E graph  Return group of entities has similar opinion 3.2 Data collection 3.2.1 Corpus and entity from Twitter This project requires collecting two types of data. First, a corpus of subjective and objective tweets are collected – these data is used to train classifier (scorer). After training the classifier, training (not the training data set) can be stored in a file so that scorer can act later by loading them from file. Secondly, on query time posts are fetched from Twitter. Following API from Twitter is used:  search/tweets This API is called with ‘q’ = emoticons for gathering training data (positive and negative posts). In query time, same API is used with ‘q’ = query term to fetch related recent posts.  statuses/user_timeline This API is used to fetch objective training data by querying 'screen_name' = popular_stream. I used, Lifehacker, Gizmodo, New York Times, and The Atlantic as source. Twitter API do not allow fetching more than 100 posts at once. Hence, I had to exploit max_id for iteratively requesting same call for different portions of result. I have collected ten thousands of each type of data for training. In query time 200~2000 posts are fetched. 3.2.2 Mining recent twitter stream To generate an E×E graph large enough to detect grouping of entities a large stream of Twitter public stream is to be collected. To do this, again Twitter API is used and strapped continuously for a large amount of windows. Note that, in v1.1, Twitter API allow only 180 search query per window per user and 450 query per window per app. At each query, a maximum of 100 tweets are returned. Currently, windows are 15 minutes each. Hence, max_id is utilized to continuously fetch tweets using a q=”.” query. Another alternative to search/tweets API could be a streaming API. After tweets are fetched, very tiny tweets are discarded. I have, filtered out tweets with less than 50 characters. This is because, smaller tweets are difficult to understand. Also, retweets (RT) are discarded to avoid occurrence of same tweets many times. Furhtermore, another stage of filtration is imposed to remove yet duplicate tweets.
  4. 4. 4 3.2.3 PoS Tagging After collecting tweets they are passed to a TreeTagger for PoS tagging. I used recently developed GATE Twitter part-of-speech tagger [3], which is based on Stanford TreeTagger, which in terms are based on famous TreeTagger [29] by Schimd. PoS tags yielded are based on Penn-Treebank-Tagset [30]. 3.3 Implementation 3.3.1 Twitter corpus to train sentiment classifier Each posts are individually scored based on two scorers. Following (Pak and Paroubek 2010) [1], two classifiers are built. To train them, tweets are queried as such, (1) positive tweets are fetched with a search of q=””, (2) negative tweets are fetched with a search of q=”” and (3) objective tweets are fetched from new media accounts. One classifier exploits parts-of-speech (PoS) distribution amongst objective and polar statements. PoS distribution differs amongst positive and negative statements. See Figure 1 and Figure 2. Another classifier is made exploiting the distribution of n-grams (n=2). N-grams indicate strong correlation with bias or with objectivity. Human usually uses common phrases to express a type of feeling. On the other hand, some phrases are of assertive nature. This feature of natural language is captured using n-grams. See Table 2 for top 20 polar n-grams of 94k n-grams. The reference work used classification result from two classifiers to verdict final classification. This project enhances the approach by implementing classifiers as scorers to evaluate PoS score and N- Gram score for each statement. Then, both score contribute to a final score of the statement (tweet). 3.3.2 From strapped tweets to graphs As outlined in 3.1.2, (T,E,kw,P) tuples, E×kw bigraph and E×E graph are generated from a given stream of tweets. Analyzing tweets To do so, first each tweet is scored using sentiment classifier described in 3.3.1. PoS tags are exploited to primarily identify entities and keywords. Entity: Our goal is to analyze entities (location, place, person, product etc.) In English, they are generally represented by proper nouns. Also, in Twitter, users can be regarded as entities. Hence, from, PoS tags, proper nouns (NNP, NNPS, USR) are regarded as entities. Keyword: In English adjective, adverbs and verbs are used to describe an entity. This property is exploited by identifying words with tags for these PoS (JJ, RB, VB etc.) as keywords. The algorithm also allows an alternate using a parameter that include common nouns (not NNP) as keywords. Entity-keyword bigraph From analyzed tweets, (T,E,kw,P) tuples are iterated on to build an E×kw bigraph as described in 3.3.1. A general intuition, also confirmed by several studies, is that, graphs are generally sparse. Thus, instead of building full blown matrix, two dictionary/maps are stored to represent E×kw bigraph:-  A dictionary of entities, with pointers to keywords, as well as weight and pScore associated with that node  For ease of iteration, another dictionary of keywords is stored, which stores pointers back to entities from keywords. This representation, assure small storage for the entire bigraph, yet describes entire bigraph with edges and nodes. This reduces the storage from (E*kw) to edgeCount. Note that, 2*(E+kw) < edgeCount << (E*kw) Running time for building a bigraph is proportional to number of edges, i.e. 𝑂(𝑒𝑑𝑔𝑒𝑠). Entity-Entity graph From the E×kw bigraph generated above, an E×E is generated by iterating over each entity. For each entity, Ei, a set of keywords kw(Ei) are processed. Each keyword points to another set of entities, E(kw(Ei). These set of entities are added to neighbor of Ei. In this step also, a dictionary is used to represent the graph. It requires one dictionary of entities, where each entry also point to immediate neighbors. This requires a storage of 2*edge. Runtime to build this graph is proportional to number of edges. However, a filtration of entities is done a priori to remove nodes with very few neighbor from simulation (thus building a set of significant entities). Filtering generic terms from keyword list (thus only using legitimate keywords) reduces search space. 3.3.3 Keywords form data Keywords are filtered in several steps to let data define legitimate keywords. In first step, PoS tagging define preliminary set. After all tweets are analyzed, a filtration is used to remove low frequency terms from keyword lists of each entity. After E×kw bigraph is built, another filtration is used to rule out generic terms. Generic terms are those potential keywords that are found in too many entities. A threshold parameter is supplied to the algorithm for this. Finally after generating communities consolidation step filters out irregular keywords to yield final set of keywords. 3.3.4 Community detection: group of entities After E×E graph is generated, consisting legitimate keywords and significant entities a community detection algorithm can be used to detect community in them. This project implements a fast derivation of Andersen et al [28]. Table 1. Community Detection Algorithm 1. Significant_entities := entities in (E×E) 2. Seed_node := supplied_seed 3. if(seed null or not exist) then seed:=first(Significant_entities) 4. aCommunity := new Community() 5. entity :=seed 6. eval := evaluate(entity,aCommunity) 7. if(eval.member) then aCommunity.Add(entity) remove(entity, Significant_entities) remove(a.Community. Nbor, entity) 8. if(aCommunity.Nbor = empty) goto 11 9. entity := first(aCommunity. Nbor) 10. goto 5 11. add(aCommunity,Communities) 12. if(Significant_entities not emmpty) goto 4 13. return
  5. 5. 5 Algorithm described above uses objects of class Community. It’s Add() member function adds the entity and updates the community with, Volume (=edges inside) and outward links. evaluate() function check membership by calculated conductance if this node added to community and compare with original conductance. Conductance is defined as, Cond = (links outward from community)/(edges inside). This will generate a set of communities. After generating each communities, a consolidation step in is undergone to further filter keywords. This is done as, size:= size of community := number of entities in it Threshold := ln(size) If (Occcurance(kw)< Threshold) then Remove(kw) After this step, a set of descriptive keywords is associated with the group of entities. 3.3.5 Storing result The final outcome of communities is returned as an XML document from the implementation. Also, (T,E,kw,P) tuples are returned as XML. Other intermediate graphs, E×kw bigraph and E×E graph are exported as CSV (comma separated value) files. 4. RESULTS AND FINDINGS 4.1 Findings Findings reported here are based on 160,711 tweets collected in late November of 2013. 4.1.1 PoS Distributions and n-grams Later in this section are figures of PoS distributions over subjective- objective statements and positive-negative statements. A positive bias value in Figure 1 indicate presence of such PoS is more indicative of the statement of being positive. Same is for negative values. Subjectivity score in Figure 2 indicates similar score. Table 2 shows top few n-grams. Note that, PoS distributions and top n- grams slightly differ from referred work [1]. Again, if training data is collected in different time, some slight change will occur. Table 2. Top n-gram with occurrence in each class of data n-gram Positive Negative Objective 'enjoying break' 1 328 1 'happy birthday' 22 207 1 'so happy' 106 53 1 'follow back' 10 132 1 'miss my' 93 10 1 'no one notices' 97 4 1 'notices my' 97 1 1 'good day' 5 82 1 'follow please' 47 38 1 'my phone' 64 18 1 'presenting emotional' 60 20 1 'please follow' 11 66 1 'follow love' 17 60 1 'am sorry' 71 4 1 'so sad' 71 3 1 'miss u' 65 7 1 'new followers' 53 17 1 Figure 1. Distribution of PoS in positive and negative statements Figure 2. Distribution of PoS between subjective and objective tweets 4.1.2 Power law in Entity and Keywords Figure 3 and Figure 4 show how entity and keywords follow power law. Figure 3. ln(Occurance) of Entities show power law Figure 4. ln(Occurance) of keyword show power law 4.1.3 Distribution of Polarity Score in Entities Figure 5 show how polarity score amongst entities are distributed. It is seen that, polarity score has skewed distribution. Figure 6 shows the distribution of polarity score over natural logarithm (ln) of occurrence of the entity. POS,0.600 WP$,0.500 PDT,0.333 RBS,0.280 URL,0.229 WP,0.217 JJS,0.187 SYM,0.176 USR,0.155 FW,0.127 NNP,0.110 CD,0.068 DT,0.032 VB,0.000 UH,-0.004 NN,-0.007 JJR,-0.010 IN,-0.012 NNS,-0.015 JJ,-0.019 RBR,-0.024 WDT,-0.031 VBG,-0.034 NNPS,-0.050 VBZ,-0.055 EX,-0.064 MD,-0.099 CC,-0.102 PRP$,-0.114 PRP,-0.135 VBP,-0.144 TO,-0.149 RP,-0.175 RB,-0.182 VBD,-0.227 VBN,-0.245 WRB,-0.282 BIAS WRB,0.164 VBN,0.140 VBD,0.128 RB,0.100 RP,0.096 TO,0.081 VBP,0.078 PRP,0.072 PRP$,0.061 CC,0.054 MD,0.052 EX,0.033 VBZ,0.028 NNPS,0.025 VBG,0.017 WDT,0.016 RBR,0.012 JJ,0.010 NNS,0.008 IN,0.006 JJR,0.005 NN,0.003 UH,0.002 VB,0.000 LS,0.000 DT,-0.016 CD,-0.033 NNP,-0.052 FW,-0.060 USR,-0.072 SYM,-0.081 JJS,-0.085 WP,-0.098 URL,-0.103 RBS,-0.123 PDT,-0.143 WP$,-0.200 POS,-0.231 SUBJECTIVITY 0 1 2 3 4 5 6 7 8 9 0 2000 4000 6000 8000 10000 12000 14000 0 1 2 3 4 5 6 7 8 9 10 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000
  6. 6. 6 Figure 5. Distribution of Polarity Score over entire entity space Figure 6. Polarity Score over ln(Occurance) of entities 4.1.4 Graph BFS & communities in adjacency matrix From any arbitrary node, the E×E graph is traversed BFS (breadth first search) to generate an arbitrary random walk. This BFS assigns index to each entity and then an adjacency matrix is visualized as in Figure 7. Notice that, this is a near diagonal matrix. Although the diagram is white, as there is no self-edge. Notice the blocks; these blocks are representative of communities. There are tiny and large communities. There are 157 communities having a maximum size of 136. Figure 7. Adjacency matrix of significant entities 4.1.5 Observation of Groups Different size of feed tweet set are examined. It is seen that, number of significant entities and number of legitimate keywords increase with size of tweets. They all yield communities with different size. When manually examined these communities, and keywords, they matched intuition. An interesting community where the keyword cries is associated with two stars is noted in Figure 8. <Community id="146" size="2" conductance="0.5" pScore="0.63566754320156"> <trapped-keywords count="1"> Cries:4, </trapped-keywords> <e>Kristen Stewart</e> <e>Robert Pattinson</e> </Community> Figure 8. XML representation of a community 4.2 Results Figure 9 shows some sample runs where the system is queried for overall sentiment analysis of an entity. <opinion entity='mermaid'> <score>0.21</score> <analysis post-count='1086' percent-positive='52.03' percent-negative='24.59'/> </opinion> <opinion entity='bankrupt'> <score>-0.18</score> <analysis post-count='2073' percent-positive='30.29' percent-negative='47.03'/> </opinion> <opinion entity='drunk man'> <score>-0.50</score> <analysis post-count='1084' percent-positive='11.99' percent-negative='65.59'/> </opinion> <opinion entity='November'> <score>0.20</score> <analysis post-count='2062' percent-positive='53.25' percent-negative='25.12'/> </opinion> Figure 9. Result runs for query over entity Few parameters are fluctuated on the sample to see how they works. Kw threshold (εkwo), Minimum nodes (εeo), Common Noun as keyword are varied and results are shown in Table 3. Using common nouns as keyword yield a few groups with very large size. Thus, it is recommended to discard common noun from keywords. Table 3. Effect of parameters change Kw threshold 350 350 450 Minimum nodes 2 2 2 Common Noun as keyword false true false Potential kw 15108 31593 15108 Legitimate kw 14967 31368 14997 Entities 97147 97147 97147 E occurring > 2 7580 7580 7580 Significant E. 1190 2012 1378 Groups 170 92 157 Largest size 70 1256 136 Polarity scores of each entities and keywords are stored and can be accessed directly in E×kw bigraph. Building a polarity invariant E×kw bigraph is also tested. For, similar setting as of last column in Table 3, polarity invariant version generated 174 groups with largest group of size 598 for 1854 significant entities. Generated groups are also significantly different. -1.5 -1 -0.5 0 0.5 1 1.5 -1.5 -1 -0.5 0 0.5 1 1.5 0 1 2 3 4 5 6 7 8
  7. 7. 7 Files generated containing result sets are kept online at 4.3 Performance 4.3.1 Sentiment Scoring There is no available way to evaluate correctness for overall sentiment analysis. Therefore, performance for individual scoring is tested against a publicly available Mechanical Turk annotated Twitter data [18]. This data set includes 3771 annotated tweets. It is to be noted that, each of them were annotated by three human. They annotated 21600 tweets and all three of them agreed on only 3771 tweets. As the test set has strict classification, test tweets are scored and then classified for testing purpose with a threshold of .5 (i.e. tweets with score above +0.5 are regarded positive, scored below -0.5 regarded negative and rest are neutral). This yields only 61% matching of sentiment with test data. However most disagreement are seen in non-biased annotated entries. For biased posts, mismatch is around 26%. 4.3.2 Opinion based entity groups in recent stream Data strapping from Twitter requires a long while due to the query restriction per window. The third party GATE TreeTagger I used performs slowly and this hinders overall performance. However, the part implemented for the project performs fast in linear fashion. Table 4 lists time variance over size of sample. Table 4. Performance of graph analysis for different data size Sample 1 Large Sample Very large Sample Tweets 160711 485447 847276 Time to analyze each 48.91s 148.53s 262.01s Build Bigraph 9.29s 34.24s 66.45 Generate EE graph 1.54s 3.49s 4.99s Time to Find Groups 0.126s 0.310s 0.358s Groups count 157 334 457 Largest Group size 136 183 162 Significant Entities 1378 2627 3560 Legitimate Keywords 14997 25818 35005 5. Conclusion This project has devised and studied an approach to mine social network for eliciting public opinion about entities. Public opinion is represented as, analysis of individual entities and graph analysis of entities based on polarity aligned keyword relationship. Sentiment analysis itself is still an open problem and needs further investigation. This project uses an approach to analyze sentiment of tweets, which is built from Twitter as learning corpus. This approach yield polarity score rather than discrete polarity marker. To elicit overall opinion about an entity, aggregative polarity score and representative keywords are detected. For grouping entities, an entity graph is built from entity-keyword bigraph involving polarity scores. A local community detection mechanism is used to finally cluster them. The problem of detecting keywords is solved as an embedded approach. During steps for building entity groups from strapped tweets, keyword are filtered from raw candidate set of keywords to final set of keywords. This approach can be useful in building keyword lexicons dynamically. A report of sample runs of implementation is also added in this document. Several key observation are noted in section 4. 6. References [1] A. Pak and P. Paroubek, "Twitter as a Corpus for Sentiment Analysis and Opinion Mining," in Language Resources and Evaluation, 2010. [2] Twitter, "REST API v1.1 Resources," [Online]. Available: [3] "GATE Twitter part-of-speech tagger," [Online]. Available: [4] B. Pang, L. Lee and S. Vaithyanathan, "Thumbs up? Sentiment Classification using Machine Learning Techniques," in Proceedings of the ACL-02 conference on Empirical methods in natural language processing, Philadelphia, PA, USA, 2002. [5] T. Wilson, J. Wiebe and P. Hoffmann, "Recognizing contextual polarity in phrase-level sentiment analysis," in HLT '05 Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing, Stroudsburg, PA, USA, 2005 . [6] A. Esuli and F. Sebastiani, "Sentiwordnet: A publicly available lexical resource for opinion mining," in Proceedings of LREC, 2006. [7] S. Baccianella, A. Esuli and F. Sebastiani, "SentiWordNet 3.0: An Enhanced Lexical Resource for Sentiment Analysis and Opinion Mining," in LREC, 2010. [8] M. Taboada, J. Brooke, M. Tofiloski, K. Voll and M. Stede, "Lexicon-based methods for sentiment analysis," Computational linguistics, vol. 37, pp. 267-307, 2011. [9] V. Hatzivassiloglou and J. M. Wiebe, "Effects of adjective orientation and gradability on sentence subjectivity," in Proceedings of the 18th conference on Computational linguistics-Volume 1, 2000. [10] C. Whitelaw, N. Garg and S. Argamon, "Using appraisal groups for sentiment analysis," in Proceedings of the 14th ACM international conference on Information and knowledge management, 2005. [11] F. Benamara, C. Cesarano, A. Picariello, D. Reforgiato and V. Subrahmanian, "Sentiment Analysis: Adjectives and Adverbs are better than Adjectives Alone," in International Conference on Weblogs and Social Media, Boulder, CO USA, 2007. [12] V. S. Subrahmanian and D. Reforgiato, "AVA: Adjective- verb-adverb combinations for sentiment analysis," Intelligent Systems, vol. 23, no. 4, pp. 43-50, 2008.
  8. 8. 8 [13] T. Mullen and N. Collier, "Sentiment Analysis using Support Vector Machines with Diverse Information Sources," in EMNLP, 2004. [14] A. Bifet and E. Frank., "Sentiment knowledge discovery in twitter streaming data," in Discovery Science, Berlin Heidelberg, Springer , 2010, pp. 1-15. [15] C. Lin and Y. He, "Joint sentiment/topic model for sentiment analysis," in Proceedings of the 18th ACM conference on Information and knowledge management, 2009. [16] S. a. L. Y. a. S. H. Tan, Z. Guan, X. Yan, J. Bu, C. Chen and X. He, "Interpreting the Public Sentiment Variations on Twitter," IEEE Transactions on Knowledge and Data Engineering, vol. 6, no. 1, pp. 1-14, 2012. [17] X. Ding, B. Liu and P. S. Yu, "A holistic lexicon-based approach to opinion mining," in WSDM '08 Proceedings of the 2008 International Conference on Web Search and Data Mining, New York, NY, USA, 2008. [18] S. Narr, "Annotated Twitter Sentiment Dataset," [Online]. Available: [Accessed 7 10 2013]. [19] "Sentiment140," [Online]. Available: [20] K. Zhang, H. Xu, J. Tang and J. Li, "Keyword Extraction Using Support Vector Machine," in Advances in Web-Age Information Management, Springer, 2006, pp. 85--96. [21] A. Hulth, "Improved automatic keyword extraction given more linguistic knowledge," in EMNLP '03 Proceedings of the 2003 conference on Empirical methods in natural language processing, Stroudsburg, PA, USA, 2003. [22] O. Medelyan and I. H. Witten, "Thesaurus based automatic keyphrase indexing," in Proceedings of the 6th ACM/IEEE- CS joint conference on Digital libraries, 2006. [23] G. Karypis and V. Kumar, "Multilevel k-way Partitioning Scheme for Irregular Graphs," J. Parallel Distrib. Comput, vol. 48, no. 1, pp. 96-129, 1998. [24] M. Girvan and M. E. J. Newman, "Community structure in social and biological networks," in Proc. Natl. Acad. Sci. USA, 1999. [25] M. E. J. Newman, "Fast algorithm for detecting community structure in networks," in Phys. Rev. E 69, 066133., 2004. [26] A. Clauset, M. E. J. Newman and C. Moore, "Finding community structure in very large networks," in Phys. Rev. E 70, 066111, 2004. [27] M. E. J. Newman, "Modularity and community structure in networks," in Proc. Natl. Acad. Sci. USA 103, 8577–8582, 2006. [28] R. Andersen, F. Chung and K. Lang, "Local graph partitioning using pagerank vectors," in Foundations of Computer Science, FOCS'06. 47th Annual IEEE Symposium on, 2006. [29] H. Schmid, "TreeTagger," TC project at the Institute for Computational Linguistics of the University of Stuttgart, 1994. [30] B. Santorini, Part-of-speech tagging guidelines for the Penn Treebank Project, 3rd revision ed., 1990. [31] A. Go, R. Bhayani and L. Huang, "Twitter sentiment classification using distant supervision," Stanford, 2009. [32] L. Derczynski, A. Ritter, S. Clark and K. Bontcheva, "Twitter Part-of-Speech Tagging for All: Overcoming Sparse and Noisy Data," in Proceedings of the International Conference on Recent Advances in Natural Language Processing, 2013.