• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Citizen Sensor Data Mining, Social Media Analytics and Development Centric Web Applications

Citizen Sensor Data Mining, Social Media Analytics and Development Centric Web Applications



Meenakshi Nagarajan,Amit Sheth,Selvam Velmurugan, "Citizen Sensor Data Mining, Social Media Analytics and Development Centric Web Applications," Tutorial at WWW2011, Hyderabad, India, March 28, 2011. ...

Meenakshi Nagarajan,Amit Sheth,Selvam Velmurugan, "Citizen Sensor Data Mining, Social Media Analytics and Development Centric Web Applications," Tutorial at WWW2011, Hyderabad, India, March 28, 2011.

More info at:




Total Views
Views on SlideShare
Embed Views



1 Embed 1

http://a0.twimg.com 1


Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.


12 of 2 previous next

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment
  • Got carried away with coverage and content – too much material for 3 hours – so the remaining content can be used as background
  • Got carried away with coverage and content – too much material for 3 hours – so the remaining content can be used as background
  • - We want to understand meaningful citizen sensor observation  social signals
  • Source for Stats
  • Many media companies use Facebook and Twitter as news-delivery platform. Many individuals rely on them as news source. News is increasingly social.
  • tweetmeme_url = 'http://www.readwriteweb.com/archives/this_is_what_a_tweet_looks_like.php'; tweetmeme_source = 'rww'; A tweet is filled with metadata - information about when it was sent, by who, using what Twitter application and so on
  • “isn't much of life's meaning found in the play between limits and the infinite?”
  • Link to media files, context about annotation, a special option to write reviews of movies, books, or links you're sharing. The ISBN of the book, a link to a preview of the movie and the number of stars in your rating could be included in the Tweet Annotations, Any way you can classify, describe, append or otherwise enrich a Tweet with words or numbers can be included in Annotations.
  • Interest level:(Based on Description info, lists and fav. tweets)
  • Semantic metadata, relationships: Inferred?
  • Structure Level MetadataCommunity Size - Showing scale: global vs. localCommunity growth rate - Popularity estimation for a topicLargest Strongly Connected Component size - Measuring Reachability in the directed graphNo. of Weakly Connected Components & Max. size - distribution of pre-existing network connections (follower-followee) - Showing Nature: loose vs. compactAverage Degree of Separation - How many hops between two authorsClustering Coefficient - Showing the likelihood of associationRelationship Level MetadataType of Relationship- topic/content (based on Retweet, Entity etc.)- follower/followee (based on structure)Relationship strength- Strong vs. Weak ties based on activity/ communication between users - % tie strengthUser Homophily [Homophily (i.e., "love of the same") is the tendency of individuals to associate and bond with similar others] based on certain characteristic (e.g., Location, interest etc.)% of users showing similar behaviorReciprocity: mutual relationship- % of users following back their followersActive Community/ Ties- How active is the communication between users or how active are the relationship ties - Average of tie strength based on activity
  • Pat Hayes
  • Add some examples of how people store such semantic metadata…When you put social data as LOD, talk about technologies -
  • Building on foundations of Statistical Natural Language ProcessingInformation ExtractionSemantic Web/ Knowledge RepresentationWe will talk about key issues in extracting metadata from Informal Text and how it varies from what has been done in more well-structured text like news articles etc.
  • Social Media text is informal for various reasons.. Read red points
  • Recently two researchers came up with a score to formalize the contextual nature of text and therefore the formality of text. More the available context, more formal the textWe used the same score on SM text and found that …---Score is too limited and probably outdated– does not consider full sentences/structure, does not consider links– similarly network related score would be good to have
  • What the two tasks look like in terms of outputs they produce
  • For two types of NE movie and music over two types of SM textUsing those cues
  • Focus only on one row at a time Cultural entity defn in next slide
  • What makes cultural entity extraction more difficult
  • There are two flavors to the Cultural entity recognition problemWhere same entity appears in multiple senses in the same domainWhere same entity appears in multiple senses in different domains
  • Focusing on the first flavor
  • Same song occurs as multiple instances in Music Brainz (knowledge base)
  • Sample real world constraints hard-coded – this work was an experiment into scoping using real-world constraints
  • As you chop away the domain model you accuracy increases…
  • This is an application of the NER work
  • Conclusion in RED
  • We have come a long way but still room for improvement
  • Fact can be proven, opinion cannot. An opinion is normally a subjective statement that bases on people's thoughts, feelings and understandings.
  • Social media serves as a platform for people to speak their mind more freely, which lead to a growing volume of opinionated data that can be used by:  (1) individuals for suggestion and recommendation(2) companies and organizations for marketing strategies and other decision making process(3) government for monitoring social phenomenons, being aware of potential dangerous situations, etc.
  • For the task of classification, supervised learning or unsupervised learning techniques can be used. For the review-like data, e.g., movie review or product review, it's easy to get training data from website like imdb or Amazon. The sentiment classification is different from traditional topic classification since they have different features involved.lexicon-based approach: first creating a sentiment lexicon, and then determining the polarity of a text via some function based on the positive and negative clues within the text, as determined by the lexicon. The idea of bootstrapping is to use the output of an available initial classifier to create labeled data, to which a supervised learning algorithm may be applied. The task of extracting the opinion/holder/target is similar to the traditional information extraction task. The difference is that for this task, the relations between opinion and opinion target are considered important.E.g. proximity, the opinion expression is assumed to be closed to the opinion target in the text. Based on this assumption, if the opinion target is given, then the nearby adjectives can be extracted as opinion candidates.Other possible ways to model the relations between opinion and opinion target include: syntactic dependency, co-occurrence, or manually prepared patterns/rules 
  •  In this paper, The authors connect measures of public opinion measured from polls with sentiment measured from tweets. They find that a relatively simple sentiment detector based on Tweets replicates consumer confidence and presidential job approval polls.The results highlight the potential of text streams as a substitute and supplement for traditional polling. Positive and negative words are defined by the subjectivity lexicon from OpinionFinder,a word list containing about 1,600 and 1,200 words marked as positive and negative, respectively (Wilson, Wiebe, and Hoffmann, 2005)A message is defined as positive if it contains any positive word, and negative if it contains any negative word. (This allows for messages to be both positive and negative.)
  • In this paper, the authors demonstrate how social mediacontent can be used to predict real-world outcomes. In particular, they use tweets to forecast box-officerevenues for movies. The results show that the prediction model using the rate at which tweets are created about a movie outperforms the market-based methods. And the sentiments present in tweets about a movie can be used to improve the prediction. The intuition is that a movie that has far more positive than negative tweets is likely to be successful. For the task of sentiment classification of tweets, they use a supervised classifier "DynamicLMClassifier" from LingPipe.Each tweet in the training set is labeled as positive, negative or neutral by workers from Amazon Mechanical Turk.The classifier is trained using the n-gram model. In their work, they use n=8.  they find that the sentiments do provide improvements, although they are not as important as the rate of tweets themselves 
  • One of the most attractive advantages of unsupervised approaches is that they do not require for training data.Many sentiment analysis applications for social media content use simple lexicon-based method. However, for the problem of target-specific sentiment analysis, it doesn't work. Based on simple lexicon-based method which use a general sentiment lexicon containing positive/negative/neutral words in the general sense, (1) for the task of "find tweets containing positive opinions about a specific topic", such as a movie, the results will like the table shows. However, 2,3,5,6,7 don't contain opinions about the movie. (2) for the task of extract the opinion clues/expressions, the right answers should be like we show in the other picture. However, the simple  lexicon-based method might give all the words with orange color in the table.
  • We create a general subjective lexicon which contains subjective words in the general sense. This lexicon is created by extending the commonly used subjective lexicon to involve slangs learned from Urban Dictionary.This general lexicon is used for select sentiment units candidates. A bootstrapping method is used to learn domain-dependent sentiment clues from domain-specific corpus. Most of the current lexicons only contain words. We employ statistical models to find words, phrases and patterns which can be used as sentiment clues. Such as "must see", "want my money/time back", "don't miss it" in the movie domain.For the task of identifying opinions towards the given target, we use a syntactic rule-based method as well as proximity model. Since the informal language structures of tweets bring difficulties to the parser, our method just requires a very shallow syntactic parse of tweets.
  • Refs:http://en.wikipedia.org/wiki/Writing_stylehttp://en.wikipedia.org/wiki/Psychometrics
  • Metadata from Network Analysis:- Not sufficient to answer the above questions unless we consider context, and hence merge approach (Content + Network) is better
  • [Example scenario:- Buyer wants to buy a movie dvdMultiple influencers!!!- Key Influencers: Media experts- Peer Influencers: Hiscollegues (the people, buyer interacts face-to-face daily)- Social Influencers: His social circle ]Now how do we find out them?Link Analysis based on structure is not just sufficient ---- SOCIAL MEDIA IS HIGLY COMPLEXThat’s why we need additional context analysis in play- Popularity NOT =Influence! - We need to understand audience, their activity level and interest is of greater importanceHomophily (tendency to follow similar behavior) limits people's social worlds in a way that has powerful implications for the information they receive, the attitudes they form, and the interactions they experience.KLOUT: (http://klout.com/kscore)Reach :: Are your tweets interesting and informative enough to build an audience? How far has your content been spread across Twitter?Amplification:: Probability is the likelihood that your content will be acted upon
  • Multiple types of users - HOW DO WE FIND OUT THESE TYPES?Does external web (background knowledge) presence of a user tells us more than the limited context available in the network?
  • User engagement levels: applications in coordination activities Connecting the dots here with NGO initiatives (*presented by Selvam)
  • User engagement levels: applications in coordination activities Connecting the dots here with NGO initiatives (*presented by Selvam)- Just not limit to Active vs. Passive in general but be specific to topic and then say ‘active’/passive w.r.t. topic (e.g., active for ‘Biology info’ vs. passive for ‘comp. sci. info.’)
  • Connections/Relationships- Implicit content features
  • We want to achieve by Network Analysis for social media: - Graph Traversal --- for understanding reachability between people - Community Formation, sustainability for people
  • We want to analyze these Social networks for understanding various social science studies:- DiffusionHomophily (tendency to follow similar behavior) – based on certain characteristic (demographic, interest etc.) What makes dynamics to be diff. here (factors)
  • Authoritativenature of the poster or the volume of follower connections did not predict the re-tweet behavior associated with the tweets!
  • Spammers diverting their attention to social media sites.
  • This tweet was by Kenneth Cole at the time of Egypt Revolution. Though it uses a hashtag that was used to indicate a tweet on Egypt crisis (#Cairo), the link it has is not connected to Egypt crisis.
  • This Article was published on Guardian website in Feb 2010. In this article the Director of BBC Peter Horrocks states that the journalists should use social media as the primary source of Information. He took over the position of Director a week back.Now let us consider a scenario where a Journalist wants to follow social signals wants to analyze what news is stirring up today at a particulat location.There is a problem using this since there is Information Overload
  • This use case requires merging streaming data with background knowledge information (e.g. from DBpedia). Examples of ?category include category:Wi-Fi devices and category:Touchscreen portable media players amongst others. As a result, without having to elicit all products of interest as keywords to lter a stream, a user is able to leverage relationships in background knowledge to more effectively narrow down the stream of tweets to a subset of interest.
  • How news articles We collected the output of our system for healthcare topic for a time period. We also collected articles from the Nytimes for the same period and put this as the input for our Extraction pipeline. And plotted the occurrence of entities in tweets and in Nytimes articles. We found that both these co-ordinate very well. We then got the events occurred during the peaks of our time plot from timeline.com and nytimes.com and found this result.

Citizen Sensor Data Mining, Social Media Analytics and Development Centric Web Applications Citizen Sensor Data Mining, Social Media Analytics and Development Centric Web Applications Presentation Transcript