Twitter mining


Published on

Published in: Technology
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Hashtags are indicated by a # symbol and are combined with keywords to indicate a topic of interest. Hashtags become popular when many people use it. Popular topics, known as “trending” topics, appear on the main twitter page and can significantly increase the number of tweets containing that topic.
  • we are friendsTwitter follow youMedia the means of communication, as radio and television, newspapers, and magazines, that reach or influence people widelyOnly 22.1% user pairs follow each other (flickr 68%, 84 yahoo% )Majority of topics are headlineTwitter user ranking by followers, pagerank, and RT Followers, pagerank(actor, musician, show host, sports star, model)RT (news)A retweet brings a few hundred additional readers (55% of RT < 1hr)Summary:Low reciprocity distinguishes twitter from OSNsTwitter hasw characteristics of news media: 1. tweets mentioning timely topics 2. plenty of hubs reaching a large public directly 3. fast and wide spread of word-of-mouth
  • Indegree news source; politicians; athletes; celebritiesRetweet content aggregation service, news sitesMention celebrities
  • The follower/followee ratio “matters” more than raw number of followersFollowing people is a simple way to get followers
  • TunkRank is an influence ranking tool that helps you identify leading influencers on Twitter. There are two basic ideas:The amount of attention you can give is spread out among all those you follow. The more you follow, the less attention you can give each one.Your influence depends on the amount of attention your followers can give you.As a twitterer, your influence does not depend on how many people you follow. However, your usefulness as a follower does. Having higher influence depends on having many followers who follow relatively few people but are followed by many. Followers like that are more likely to read your tweets and act on them (retweeting, clicking links, responding, blogging, etc). Their influence trickles up to you.Your TunkRank score is a reflection of how much attention your followers can both directly give you and how much attention they bring you from their network of followers.
  • External URLLetter+number patterns in usernamesSuggestive keywords (“naked”, “girls”, “webcam”)Propagation tree
  • Context extraction algorithm(such as PCA, SVD) over the recent history of the trend and reports the keywords that most correlated with it.For example, thekeyword ‘NBA’ may usually appear in 5 tweets per minute,yet suddenly exhibit a rate of 100 tweets/min.
  • Lots of celebrity names–lady gaga@ and # reduce ambiguity like advanced query operators•Hashtagqueries particularly popular–Most popular queries: Hashtag51% of the time–Least popular queries: Hashtag7% of the time•Celebrity queries particularly popular–Most popular queries: Celebrity 25% of the time–Least popular queries: Celebrity 4% of the time•Twitter queries less diverse than Web queries–Only 1 in 4 unique (v. 2 in 4 unique)
  • is no collective filtering
  • There is no collaborative filtering
  • Length normalizationStopwordThredhold remove similar onescluster
  • 研究者采用了两种情绪追踪工具。一种是开源工具OpinionFinder,能将推文二分为正面和负面情绪;另一种则是研究者在临床上使用的情绪状态量表(POMS)基础上,新开发出的情绪测试工具GPOMS。它能将公众的情绪分为冷静、警惕、确信、活力、友善和幸福这六个类别。为了验证两个工具的准确性,研究者将公众情绪和社会事件对比,结果十分吻合。例如,在总统大选日(2008年11月4日)期间,Twitter 在大选日前一天开始紧张,在大选日当天变得冷静、活力、友善、幸福,总体情绪在大选日后又回归平常。在感恩节(11月28日)当天,整个 Twitter 洋溢着浓浓的幸福味道,过后又恢复正常。而最令人激动的是,将“冷静”情绪指数后移3天,竟然和道琼斯工业平均指数惊人一致。其他情绪则没有这样的效果。另外,研究者还测试了一个称为SOFNN的股市预测模型。当仅输入股市数据时,模型已经有73.3%的准确率;加入“冷静”的情感信息后,准确率更升至86.7%。但是,Twitter 情绪指标,仍然不可能预测出会冲击金融市场的突发事件。例如,在2008年10月13号,美国联邦储备委员会突然启动一项银行纾困计划,令道琼斯指数反弹,而3天前的Twitter冷静指数自然毫无征兆。而且,研究者自己也意识到,Twitter 用户与股市投资者并不完全重合,这样的样本代表性有待商榷。慕尼黑工业大学的两位学者对 Twitter 进行了更为细致的分析[5]。他们筛选出提到标准普尔100指数中的公司的推文(比如 $AAPL 代表苹果公司),分为 “买入”、“持有”或“卖出”三类,并算出每支股票的看涨程度。结果同样鼓舞人心。例如,推文的总数和交易量,看涨程度和标准普尔100指数之间,都有密切相关。更具操作意义的是,如果投资者采取“买入”看涨程度最高的3支股票,“卖出”最低3支的策略,半年便有高达15%的收益。美国佩斯大学的博士生亚瑟•奥康纳(Arthur O’Connor)[7],则采用了另外一种思路。他追踪了星巴克、可口可乐和耐克三家公司在社交媒体上的受欢迎程度,同时比较它们的股价。他发现,Facebook上的粉丝数、Twitter 上的听众数和 Youtude上的观看人数,都和股价密切相关。品牌的受欢迎程度,还能预测股价在10天、30天之后的上涨情况。
  • Twitter mining

    1. 1. Microblog(Twitter) mining yutao
    2. 2. What is twitter?• 140 character tweet• Hashtag # before relevant keywords in tweet• RT means to “re-tweet” or forward a tweet• @ reference refers to a user’s screen name
    3. 3. Why it is different?• Very short in length• Written in informal style• Social
    4. 4. What is twitter, a social network or a news media?(www2010)• Following is mostly not reciprocated(not so “social”)• Users talk about timely topics• A few users reach large audience directly• Most users can reach large audience by word- of-mouth quickly
    5. 5. early Analysis
    6. 6. Analysis 1: Take the people out• Krishnamurthy et al (2008)• users were classified by follower/following counts, Numbers and ratios• means and mechanisms of their engagement Web (61.7%), mobile/text (7.5%), software (22.4%)
    7. 7. Analysis 2: Content Category Four meta-categories• daily chatter• conversations• information / URL sharing• news reporting
    8. 8. Analysis 3: measuring user influence• Indegree, retweets and mentions• Strong correlation between retweet and mention• Most connected != most influential
    9. 9. User influence
    10. 10. How to detect spam?• classification• Content attributes hashtags, trending topics replies, mentions, http links• User behavior attributes age of user account• Graph based attribute
    11. 11. Sentiment analysis• Supervised classification• Training data come from twitter, instead of human labeled• Happy emotions: “:-)”, “:)”, “=)”, “:D” etc• Sad emotions: “:-(”, “:(”, “=(”, “;(” etc• Objective: newspapers and magzines such as “NY times”
    12. 12. Trend detection• Bursty keywords detection• Bursty keywords grouping• Context extraction(such as PCA, SVD)
    13. 13. twitter search(wsdm2011)
    14. 14. The largest difference• Twitter search order by time• Search engine order by relevance• Social• Time
    15. 15. recommendation
    16. 16. Recommending content from information streams• The filtering problem: – “I get 1000+ items in my stream daily but only have time to read 10 of them. Which ones should I read?”• The Discovery Problem: – “There are millions of URLs posted daily on twitter. Am I missing something important there outside my own Twitter stream?”
    17. 17. Recommending content from information streams• Recency of content: only interesting within a short time after published. – always a “cold start” situation• Explicit interaction among users – Explicitly interact by subscribing or sharing• User-generated content – People are content producers as well as consumers
    18. 18. Recommending content from information streams
    19. 19. URL Sources• Considering all URLs was impossible• FoF : URLs from followee-of-followees• Popular : URLs that are popular across whole twitter
    20. 20. Topic relevance scores• Topic profile of URLs – Use term vectors as profiles – Built from tweets that have mentioned the URL• Topic profile of users – Self-topic: content profile based on what I post – Followee-Topic: content profile based on what my followees post
    21. 21. Social network scores• “Popular Vote” in among my followees-of- followees – People “vote” a URL by tweeting it – Votes are weighted using social network structure – URLs with more votes in total are assigned higher score
    22. 22. Recommending twitter users to follow• Social graph• Profile user – User himself – Followers – followees
    23. 23. Microblog summarization
    24. 24. The phrase reinforcement algorithm• Looking for the most commonly occurring phrases – Users tend to use similar words when describing a particular topic – RT
    25. 25. Hybrid TF-IDF summarization• TF: the document is the entire collection of posts• IDF: the document is a single post
    26. 26. Topic model
    27. 27. Content modeling on Twitter tf.idf cosine similarity, Surface word etc. features Deeper Parsing, parts of dats yur mom notspeech, coreferen natural me lol ce, etc language processing THE_REAL_SHAQ 32
    28. 28. Content modeling on Twitter tf.idf cosine Topic Latent Dirichlet similarity, Surface word models, Dimen Allocation (LDA), etc. features sionality LSA, etc. reduction Supervised classification #hashtags, emotico ns, questions, etc. Labeled LDABest model in Naïve Bayes, ranking SVM, etc.experiments 33
    29. 29. Content modeling with Labeled LDADiscover unlabeled topics Model common labels Parameter K=200 latent 500 - 1000 dimensions for topic dimensions hashtags, emoticons, etc. obama president Smile : ) american america says :) good day country russia morning thanks #jobs pope island have happy #jobs featured hope birthday I’m going go out manager sales gonna see im :) can‘t wait see engineer yahoo tonight sleep one yay!!! cant location senior tomorrow about tomorrow got !! am night next christmas 34
    30. 30. Content modeling with Labeled LDA 4 1 1 1 new muppetblog political commentary link 2 2 2 3 3 @kermit heyy wanna catch a movie 5 5 #yummy #yummy just ate a cookie #yummy Histogram as signature for set of posts 35
    31. 31. Twitter content by categorycan make help if someone obama president americantell_me them anyone use america says country russiamakes any sense trying explain pope island failed hondurasup whats hit pick whats hey iphone new phone app mobileset twitter sign give catch Social apple ipod blackberry touchwhen show first wats make 23% Substance pro store apps free android an 27% Status 12%haha lol :) funny :p omg Style am still doing sleep so goinghahaha yeah too yes thats ha 38% tired bed awake supposed hellwow cool lmao though kinda asleep early sleeping sleepyim get dont gonna shit gotta night sleep bed going offwanna cuz damn ur make cant tomorrow bye tonightsay cause bout ill mad tired goodnight all im time now nite 36
    32. 32. Characterizing Microblogs with Topic ModelsOutline• Modeling Twitter content with topic models• Characterizing, recommending and filtering 37
    33. 33. Characterizing users
    34. 34. Characterizing users
    35. 35. TwitterRank: Finding Topic-sensitive Influential Twitterers• Apply LDA to distill topics automatically• Find topics in the twitterer’s content to represent her interests – Twitterer’s content = aggregated tweets• Twitterers with “following” relationships are more similar than those without according to the topics they are interested in
    36. 36. Topic-specific TwitterRank
    37. 37. Interesting application• Personalized and automatic social summarization of events in video• Twitter Can Predict the Stock Market• Predicting elections with twitter• Earthquake(time, location)
    38. 38. thanksmany pictures and slides come from the internet