Twitter LDA
Upcoming SlideShare
Loading in...5
×
 

Twitter LDA

on

  • 4,880 views

http://www.akshaybhat.com/LDA

http://www.akshaybhat.com/LDA

Statistics

Views

Total Views
4,880
Views on SlideShare
4,740
Embed Views
140

Actions

Likes
5
Downloads
86
Comments
0

1 Embed 140

http://www.akshaybhat.com 140

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Twitter LDA Twitter LDA Presentation Transcript

  • Entity and Link annotation in Online Social Networks
    Karan Kurani & Akshay Bhat
    CS 6740 Fall 2010 Project at Cornell University
  • Overview
    • Introduction
    • Prior work
    • Methodology
    • Datasets and Implementation
    • Results
    • Discussion
    • Future Work
  • Overview
    • Introduction
    • Prior work
    • Methodology
    • Datasets and Implementation
    • Results
    • Discussion
    • Future Work
  • Introduction
    Motivation:
    We are interested in studying how social networks and textual information associated with entities in the networkcan be modeled for insight?
    Goal :
    To create a model for annotating entities and links using
    The social network
    Textual content
    Dataset :
    Social network of 36 Million twitter users and 450 Million tweets
    Applications
    Targeted Advertising,
    Friend suggestions
    etc.,
  • Overview
    • Introduction
    • Prior work
    • Methodology
    • Datasets and Implementation
    • Results
    • Discussion
    • Future Work
  • Prior Work
    Large scale analysis of user behavior on Twitter "What is Twitter, a Social Network or a News Media” by Kwak et. al.
    Studies the propagation of information through the network using retweets, to determine user influence.
    “Automatic generation of personalized annotation tags for Twitter users” by Wu et. al.
    Uses TFIDF weights to assign tags to each user, using textual information alone.
  • Prior Work: Motivation
    Connections between the lines: Augmenting social networks with text published by Chang et. al.
    Using Wikipedia and Bible, annotated with entities, a network between entities and a topic model is constructed.
  • Prior Work: Other models
    Block-LDA :  Jointly modeling entity-annotated text and entity-entity links by Cohen et. al.  (Protein-Protein Interaction dataset)
    Predefined undirected network & text associated with each node
  • Prior Work: Other models
    Topic-link LDA : joint models of topic and author communities by Liu et. al.
    Corpus of academic publications modeled using Bayesian hierarchical topic model
    To find topics within those papers as well as community of authors
  • Overview
    • Introduction
    • Prior work
    • Methodology
    • Datasets and Implementation
    • Results
    • Discussion
    • Future Work
  • Methodology: Overview
    Community Detection
    • Detect communities of users in the social network
    • Use Label Propagation algorithm
    • Only the network information, who follows whom is used
    • Communities are detected at various levels
    LDA
    • Each community is considered as a corpus
    • All tweets by a single entity are considered as a single document
    • Only the textual information is used
    • Stemming, stop word removal and rare word removal
  • Methodology: Generating annotations
    A topic is considered to be relevant to a user if the probability exceeds 0.05
    Users are annotated using topics generated by the LDA model
    For a link between users we take intersection of the topics generated for each user forming the link.
    We also detect general topics, by comparing topics generated for randomly selected users from the network (not the community)
  • Methodology: Evaluation
    • Select a single community
    • Generate the LDA model from all users within that community
    • Generate topic probabilities using the model for a set of randomly selected users
    • A classifier (linear SVM) is used to discriminate between a users in the community and randomly selected.
    • Measure Accuracy, Precision and Recall
    • Repeat above procedure for different communities
  • Overview
    • Introduction
    • Prior work
    • Methodology
    • Datasets and Implementation
    • Results
    • Discussion
    • Future Work
  • Dataset
    • Two types of data sets:
    • Network:
    • Twitter follower network of 36 million users collected in June 2009.
    • Users with more than 900 followers are removed to uncover the underlying social network,
    • Textual data:
    • 450 million tweets from 20 million users.
    • Collected from June 2009 to December 2009.
    • Covers ~20-30% of all public tweets from above time period.
  • Implementation
    • Community Detection
    • Communities are detected using label propagation algorithm
    • Implemented on Cornell Web Lab Hadoop cluster
    • 15 iterations are performed
    • Communities from 7th and 15th iterations are considered
    • LDA
    • LingPipe – Java package which provides implementation of LDA.
    • Uses Collapsed Gibbs Sampling to infer the topic distributions.
    • Stop word removal, stemming and rare word removal is performed.
    • Number of topics – 50.
    • Topic prior value - 0.02, word prior – 0.001, number of samples – 2000, burnin epochs – 100.
  • Overview
    • Introduction
    • Prior work
    • Methodology
    • Datasets and Implementation
    • Results
    • Discussion
    • Future Work
  • Results - Topics
  • Results: User distribution over number of topics
  • Results: Topic Distribution
  • Results: Topic Distribution
  • Results: Portland Specific
  • Results: Portland General
  • Results: India Specific
  • Results: India General
  • Results: Common/Shared Topics
  • Results: Classifier performance
  • Overview
    • Introduction
    • Prior work
    • Methodology
    • Datasets and Implementation
    • Results
    • Discussion
    • Future Work
  • Discussion
  • Overview
    • Introduction
    • Prior work
    • Methodology
    • Datasets and Implementation
    • Results
    • Discussion
    • Future Work
  • Future Work
    Use Hierarchical Dirichlet Processes to determine the number of topics automatically.
    Also use online version of LDA currently being developed by David Blei at Princeton. Which will allow the possibility of generating topic distribution over whole twitter dataset.