Unleashing Twitter Data     for fun and insightMatthew A. Russellhttp://linkedin.com/in/ptwobrussell@ptwobrussell         ...
Happy Groundhog Day!
Mining the Social Web          Chapters 1-5Introduction: Trends, Tweets, and TwitterersMicroformats: Semantic Markup and C...
Mining the Social Web            Chapters 6-10LinkedIn: Clustering Your Professional Network For Fun (andProfit?)Google Buz...
O                verview• Trends, Tweets, and Retweet Visualizations• Friends, Followers, and Setwise Operations• The Twee...
Insight Matters• What is @users potential influence?• What are @users passions right now?• Who are @users most trusted frie...
Part 1:Tweets, Trends, and Retweet       Visualizations                     Agile Data Solutions Social Web               ...
A point to ponder:Twitter : Data :: JavaScript : Programming Languages (???)
Getting Ready To Code                  Agile Data Solutions Social Web                         Mining the
Python Installation• Mac users already have it• Linux users probably have it• Windows users should grab ActivePython
easy_install• Installs packages from PyPI• Get it:  • http://pypi.python.org/pypi/setuptools  • Ships with ActivePython• I...
Git It?• http://github.com/ptwobrussell/Mining-the-Social-Web• git clone git://github.com/ptwobrussell/Mining-the-Social-W...
Getting Data               Agile Data Solutions Social Web                      Mining the
Twitter Data Sources• Twitter API Resources• GNIP• Infochimps• Library of Congress
Trending Topics>>>   import twitter # Remember to "easy_install twitter">>>   twitter_search = twitter.Twitter(domain="sea...
Search Results>>> search_results = []>>> for page in range(1,6):...   search_results.append(twitter_search.search(q="SNL",...
Search Results (continued) >>> import json >>> print json.dumps(search_results, sort_keys=True, indent=1) [   {     "compl...
Search Results (continued)  "results": [   {     "created_at": "Sun, 11 Apr 2010 01:34:52 +0000",     "from_user": "bieber...
Search Results (continued)       "profile_image_url": "http://a1.twimg.com/profile_images/80...",       "source": "<a h...
Lexical Diversity• Ratio of unique terms to total terms  • A measure of "stickiness"?  • A measure of "group think"?  • A ...
Distilling Tweet Text >>> # search_results is already defined >>> tweets = [ r[text]  ...     for result in search_results...
Analyzing Data                 Agile Data Solutions Social Web                        Mining the
Lexical Diversity >>> len(words) 7238 >>> # unique words >>> len(set(words)) 1636 >>> # lexical diversity >>> 1.0*len(set(...
Size Frequency Matters• Counting: always the first step• Simple but effective• NLTK saves us a little trouble
Frequency Analysis >>> import nltk >>> freq_dist = nltk.FreqDist(words) >>> freq_dist.keys()[:50] #50 most frequent tokens...
Frequency Visualization
Tweet and RT were sitting on a fence.    Tweet fell off. Who was left?
RTs: past, present, & future• Retweet: Tweeting a tweet thats already been tweeted• RT or via followed by @mention• Exampl...
Some people, when confronted with a problem, think "I know,   Ill use regular expressions." Now they have two             ...
Parsing Retweets >>> example_tweets = ["Visualize Twitter search results w/ this simple script http://bit.ly/cBu0l4 - Gist...
Visualizing Data                   Agile Data Solutions Social Web                          Mining the
Graph Construction >>> import networkx as nx >>> g = nx.DiGraph() >>> g.add_edge("@SocialWebMining", "@ptwobrussell",  ......
Writing out DOTOUT_FILE = "out_file.dot"try:    nx.drawing.write_dot(g, OUT_FILE)except ImportError, e:    dot = ["%s" -> ...
Example DOT Language strict digraph {   "@ericastolte" -> "bonitasworld" [tweet_id=11965974697];   "@mpcoelho" ->"Lil_Amar...
DOT to Image• Download Graphviz: http://www.graphviz.org/•$ dot -Tpng out_file.dot > graph.png• Windows users might prefer...
Graphviz: Extreme Closeup
But you want more sexy?
Protovis: Extreme Closeup                   38       Mining the Social Web
It Doesnt Have To Be a Graph                Graph Connectedness
Part 2:Friends, Followers, and Setwise          Operations                        Agile Data Solutions Social Web         ...
Insight Matters• What is my potential influence?• Who are the most popular people in my network?• Who are my mutual friends...
Getting Data               Agile Data Solutions Social Web                      Mining the
OAuth (1.0a)import twitterfrom twitter.oauth_dance import oauth_dance# Get these from http://dev.twitter.com/apps/newconsu...
Getting Friendship Data friend_ids = t.friends.ids(screen_name=timoreilly, cursor=-1) follower_ids = t.followers.ids(scree...
Perspective: Fetching all of Lady Gagas~7M followers would take ~4 hours
But theres always a catch...
Rate Limits• 350 requests/hr for authenticated requests• 150 requests/hr for anonymous requests• Coping mechanisms:  • Cac...
The Beloved Fail Whale • Twitter is sometimes "overcapacity" • HTTP 503 Error • Handle it just as any other HTTP error • R...
Abstraction Helps friend_ids = [] wait_period = 2 # secs cursor = -1 while cursor != 0:     response = makeTwitterRequest(...
Abstracting Abstractions screen_name = timoreilly # This is what you ultimately want... friend_ids = getFriends(screen_nam...
Storing Data               Agile Data Solutions Social Web                      Mining the
Flat Files?  ./  screen_name1/      friend_ids.json      follower_ids.json      user_info.json  screen_name2/      ...  ...
Pickles?import cPickleo = {    friend_ids   : friend_ids,    follower_ids : follower_ids,    user_info    : user_info}f = ...
A relational database? import sqlite3 as sqlite conn = sqlite.connect(data.db) c = conn.cursor() c.execute(create table   ...
Redis (A Data Structures Server)  import redis  r = redis.Redis()  [ r.sadd("timoreilly$friend_ids", i) for i in friend_id...
Redis Set Operations• Key/value store...on typed values!• Common set operations  • smembers, scard  • sinter, sdiff, sunio...
Analyzing Data                 Agile Data Solutions Social Web                        Mining the
Setwise Operations• Union• Intersection• Difference• Complement
Venn Diagrams              Followers - Friends  Friends                                              Friends - Followers  ...
Count Your Blessings# A utility functiondef getRedisIdByScreenName(screen_name, key_name):    return screen_name$ + screen...
Asymmetric Relationships# Friends who arent following backfriends_diff_followers = r.sdiffstore(temp, [                 ge...
Asymmetric Relationships# Followers who arent friendedfollowers_diff_friends = r.sdiffstore(temp, [                  getRe...
Symmetric Relationships mutual_friends = r.sinterstore(temp, [         getRedisIdByScreenName(screen_name, follower_ids), ...
Sample Output timoreilly is following 663 timoreilly is being followed by 1,423,704 131 of 663 are not following timoreill...
Who Isnt Following Back? user_ids = [ ... ] # Resolve these to user info objects while len(user_ids) > 0:   user_ids_str, ...
Friends in Common# Assume weve harvested friends/followers and its in Redis...screen_names = [timoreilly, mikeloukides]r.s...
Potential Influence• My followers?• My followers followers?• My followers followers followers?•for n in range(1, 7): # 6 de...
Saving a Thousand Words...         {                                           1                         2            Bran...
Same Data, Different Layout                 9         10                 4           5             8        2          11 ...
Space Complexity                     Depth                1    2    3     4    5            2   3    7   15     31   63Bra...
Breadth-First TraversalCreate an empty graphCreate an empty queue to keep track of unprocessed nodesAdd the starting point...
Breadth-First Harvest next_queue = [ timoreilly ] # seed node d = 1 while d < depth:     d += 1     queue, next_queue = ne...
The Most Popular Followers freqs = {} for follower in followers:     cnt = follower[followers_count]     if not freqs.has_...
Average # of Followers all_freqs = [k for k in keys for user in freqs[k]] avg = sum(all_freqs) / len(all_freqs)
@timoreillys Popular Followers          The top 10 followers from the sample:          aplusk              4,993,072      ...
Futzing the Numbers• The average number of timoreillys followers followers: 445• Discarding the top 10 lowers the average ...
The Right Tool For the Job:NetworkX for Networks
Friendship Graphsfor i in ids: #ids is timoreillys id along with friend ids  info = json.loads(r.get(getRedisIdByUserId(i,...
Clique Analysis                                              • Cliques                                              • Maxi...
Calculating Cliquescliques = [c for c in nx.find_cliques(g)]num_cliques = len(cliques)clique_sizes = [len(c) for c in cliq...
Cliques for @timoreilly         Num   cliques:                762573         Avg   clique size:                14         ...
Visualizing Data                   Agile Data Solutions Social Web                          Mining the
Graphs, etc    • Your first instinct is naturally      G = (V, E) ?
Dorling Cartogram  • A location-aware bubble chart (ish)  • At least 3-dimensional    • Position, color, size  • Look at f...
Sunburst of Friends • A very compact visualization • Slice and dice friends/followers by  gender, country, locale, etc.
Part 3:The Tweet, the Whole Tweet, and     Nothing but the Tweet                        Agile Data Solutions Social Web   ...
Insight Matters• Which entities frequently appear in @users tweets?• How often does @user talk about specific friends?• Who...
Pen : Sword :: Tweet : Machine Gun (?!?)
Getting Data               Mining the Social Web
Let me count the APIs...• Timelines• Tweets• Favorites• Direct Messages• Streams
Anatomy of a Tweet (1/2){    "created_at" : "Thu Jun 24 14:21:11 +0000 2010",    "id" : 16932571217,    "text" : "Great id...
Anatomy of a Tweet (2/2)    ...    "entities" : {      "hashtags" : [    {"indices" : [ 97, 103 ], "text" : "gov20"},     ...
Entities & Annotations• Entities  • Opt-in now but will "soon" be standard • $ easy_install twitter_text• Annotations  • U...
Manual Entity Extraction import twitter_text extractor = twitter_text.Extractor(tweet[text]) mentions = extractor.extract_...
Storing Data               Mining the Social Web
Storing Tweets• Flat files? (Really, who does that?)• A relational database?• Redis?• CouchDB (Relax...?)
CouchDB: Relax• Document-oriented key/value• Map/Reduce• RESTful API• Erlang
As easy as sitting on the couch• Get it - http://www.couchone.com/get• Install it• Relax - http://localhost:5984/_utils/• ...
Storing Timeline Dataimport couchdbimport twitterTIMELINE_NAME = "user" # or "home" or "public"t = twitter.Twitter(domain=...
Analyzing & Visualizing Data                        Mining the Social Web
Approach:Map/Reduce on Tweets
Map/Reduce Paraadigm• Mapper: yields key/value pairs• Reducer: operates on keyed mapper output• Example: Computing the sum...
Which entities frequently appear in       @mentions tweets?
@timoreillys Tweet Entities
How often does @timoreilly mention specific friends?
Filtering Tweet Entities• Lets find out how often someone talks about specific friends• We have friend info on hand• Weve ex...
@timoreillys friend mentions                                          Number of user entities in tweets who are Number of ...
Who does @timoreilly retweet     most frequently?
Counting Retweets• Map @mentions out of tweets using a regex• Reduce to sum them up• Sort the results• Display results
Retweets by @timoreilly
How frequently is @timoreilly        retweeted?
Retweet Counts• An API resource /statuses/retweet_count exists (and is now functional)• Example: http://twitter.com/status...
Survey Says...@timoreilly is retweeted about 2/3            of the time
How often does @timoreillyinclude #hashtags in tweets?
Counting Hashtags• Use a mapper to emit a #hashtag entities for tweets• Use a reducer to sum them all up• Been there, done...
Survey Says...About 1 out of every 3 tweets by @timoreilly contain #hashtags
But if you order within the next 5            mintues...                           Mining the Social Web
Bonus Material:What do #JustinBieber and #TeaParty         have in common?                             Mining the Social Web
Tweet Entities
#JustinBieber co-occurrences #bieberblast           http://tinyurl.com/                                              #musi...
#TeaParty co-occurrences                   @STOPOBAMA2012               #jcot@blogging_tories   @TheFlaCracker            ...
Hashtag Distributions
Hashtag Analysis• TeaParty: ~ 5 hashtags per tweet.• Example: “Rarely is the questioned asked: Is our children learning?” ...
Common #hashtags #lol              #dancing #jesus            #music #worldcup         #glennbeck #teaparty         @addth...
Retweet Patterns
Retweet Behaviors
Friendship Networks
Juxtaposing Friendships• Harvest search results for #JustinBieber and #TeaParty• Get friend ids for each @mention with /fr...
Nodes Degrees
Two Kinds of Hairballs...       #JustinBieber        #TeaParty
The world twitterverse is your oyster
• Twitter: @SocialWebMining• GitHub: http://bit.ly/socialwebmining• Facbook: http://facebook.com/MiningTheSocialWeb       ...
Upcoming SlideShare
Loading in...5
×

Unleashing Twitter Data for Fun and Insight

6,457

Published on

Matthew Russell's "Unleashing Twitter Data for Fun and Insight" presentation from Strata 2011. Matthew Russell's "Unleashing Twitter Data for Fun and Insight" presentation from Strata 2011. See http://strataconf.com/strata2011/public/schedule/detail/17714 for an overview of the talk.

Published in: Technology
1 Comment
12 Likes
Statistics
Notes
No Downloads
Views
Total Views
6,457
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
154
Comments
1
Likes
12
Embeds 0
No embeds

No notes for slide

Unleashing Twitter Data for Fun and Insight

  1. 1. Unleashing Twitter Data for fun and insightMatthew A. Russellhttp://linkedin.com/in/ptwobrussell@ptwobrussell Agile Data Solutions Social Web Mining the
  2. 2. Happy Groundhog Day!
  3. 3. Mining the Social Web Chapters 1-5Introduction: Trends, Tweets, and TwitterersMicroformats: Semantic Markup and Common Sense CollideMailboxes: Oldies but GoodiesFriends, Followers, and Setwise OperationsTwitter: The Tweet, the Whole Tweet, andNothing but the Tweet
  4. 4. Mining the Social Web Chapters 6-10LinkedIn: Clustering Your Professional Network For Fun (andProfit?)Google Buzz: TF-IDF, Cosine Similarity, and CollocationsBlogs et al: Natural Language Processing (and Beyond)Facebook: The All-In-One WonderThe Semantic Web: A Cocktail Discussion
  5. 5. O verview• Trends, Tweets, and Retweet Visualizations• Friends, Followers, and Setwise Operations• The Tweet, the Whole Tweet, and Nothing but the Tweet
  6. 6. Insight Matters• What is @users potential influence?• What are @users passions right now?• Who are @users most trusted friends?
  7. 7. Part 1:Tweets, Trends, and Retweet Visualizations Agile Data Solutions Social Web Mining the
  8. 8. A point to ponder:Twitter : Data :: JavaScript : Programming Languages (???)
  9. 9. Getting Ready To Code Agile Data Solutions Social Web Mining the
  10. 10. Python Installation• Mac users already have it• Linux users probably have it• Windows users should grab ActivePython
  11. 11. easy_install• Installs packages from PyPI• Get it: • http://pypi.python.org/pypi/setuptools • Ships with ActivePython• It really is easy: easy_install twitter easy_install nltk easy_install networkx
  12. 12. Git It?• http://github.com/ptwobrussell/Mining-the-Social-Web• git clone git://github.com/ptwobrussell/Mining-the-Social-Web.git • introduction__*.py • friends_followers__*.py • the_tweet__*.py
  13. 13. Getting Data Agile Data Solutions Social Web Mining the
  14. 14. Twitter Data Sources• Twitter API Resources• GNIP• Infochimps• Library of Congress
  15. 15. Trending Topics>>> import twitter # Remember to "easy_install twitter">>> twitter_search = twitter.Twitter(domain="search.twitter.com")>>> trends = twitter_search.trends()>>> [ trend[name] for trend in trends[trends] ][u#ZodiacFacts, u#nowplaying, u#ItsOverWhen, u#Christoferdrew, uJustin Bieber, u#WhatwouldItBeLike, u#Sagittarius, uSNL, u#SurveySays, u#iDoit2]
  16. 16. Search Results>>> search_results = []>>> for page in range(1,6):... search_results.append(twitter_search.search(q="SNL",rpp=100, page=page))
  17. 17. Search Results (continued) >>> import json >>> print json.dumps(search_results, sort_keys=True, indent=1) [ { "completed_in": 0.088122000000000006, "max_id": 11966285265, "next_page": "?page=2&max_id=11966285265&rpp=100&q=SNL", "page": 1, "query": "SNL", "refresh_url": "?since_id=11966285265&q=SNL", ...more...
  18. 18. Search Results (continued) "results": [ { "created_at": "Sun, 11 Apr 2010 01:34:52 +0000", "from_user": "bieber_luv2", "from_user_id": 106998169, "geo": null, "id": 11966285265, "iso_language_code": "en", "metadata": { "result_type": "recent" }, ...more...
  19. 19. Search Results (continued) "profile_image_url": "http://a1.twimg.com/profile_images/80...", "source": "&lt;a href=&quot;http://twitter.com/&quo...", "text": "im nt gonna go to sleep happy unless i see ...", "to_user_id": null } ... output truncated - 99 more tweets ... ], "results_per_page": 100, "since_id": 0 }, ... output truncated - 4 more pages ...]
  20. 20. Lexical Diversity• Ratio of unique terms to total terms • A measure of "stickiness"? • A measure of "group think"? • A crude indicator of retweets to originally authored tweets?
  21. 21. Distilling Tweet Text >>> # search_results is already defined >>> tweets = [ r[text] ... for result in search_results ... for r in result[results] ] >>> words = [] >>> for t in tweets: ... words += [ w for w in t.split() ] ...
  22. 22. Analyzing Data Agile Data Solutions Social Web Mining the
  23. 23. Lexical Diversity >>> len(words) 7238 >>> # unique words >>> len(set(words)) 1636 >>> # lexical diversity >>> 1.0*len(set(words))/len(words) 0.22602928985907708 >>> # average number of words per tweet >>> 1.0*sum([ len(t.split()) for t in tweets ])/len(tweets) 14.476000000000001
  24. 24. Size Frequency Matters• Counting: always the first step• Simple but effective• NLTK saves us a little trouble
  25. 25. Frequency Analysis >>> import nltk >>> freq_dist = nltk.FreqDist(words) >>> freq_dist.keys()[:50] #50 most frequent tokens [usnl, uon, urt, uis, uto, ui, uwatch, ujustin, u@justinbieber, ube, uthe, utonight, ugonna, uat, uin, ubieber, uand, uyou, uwatching, utina, ufor, ua, uwait, ufey, uof, u@justinbieber:, uif, uwith, uso, u"cant", uwho, ugreat, uit, ugoing, uim, u:), usnl..., u2nite..., uare, ucant, udress, urehearsal, usee, uthat, uwhat, ubut, utonight!, u:d, u2, uwill]
  26. 26. Frequency Visualization
  27. 27. Tweet and RT were sitting on a fence. Tweet fell off. Who was left?
  28. 28. RTs: past, present, & future• Retweet: Tweeting a tweet thats already been tweeted• RT or via followed by @mention• Example: RT @SocialWebMining Justin Bieber is on SNL 2nite. w00t?!?• Relatively new APIs were rolled out last year for retweeting sans conventions
  29. 29. Some people, when confronted with a problem, think "I know, Ill use regular expressions." Now they have two problems. -- Jamie Zawinski
  30. 30. Parsing Retweets >>> example_tweets = ["Visualize Twitter search results w/ this simple script http://bit.ly/cBu0l4 - Gist instructions http://bit.ly/9SZ2kb (via @SocialWebMining @ptwobrussell)"] >>> import re >>> rt_patterns = re.compile(r"(RT|via)((?:bW*@w+)+)", ... re.IGNORECASE) >>> rt_origins = [] >>> for t in example_tweets: ... try: ... rt_origins += [mention.strip() ... for mention in rt_patterns.findall(t)[0][1].split()] ... except IndexError, e: ... pass >>> [rto.strip("@") for rto in rt_origins]
  31. 31. Visualizing Data Agile Data Solutions Social Web Mining the
  32. 32. Graph Construction >>> import networkx as nx >>> g = nx.DiGraph() >>> g.add_edge("@SocialWebMining", "@ptwobrussell", ... {"tweet_id" : 4815162342},)
  33. 33. Writing out DOTOUT_FILE = "out_file.dot"try: nx.drawing.write_dot(g, OUT_FILE)except ImportError, e: dot = ["%s" -> "%s" [tweet_id=%s] % (n1, n2, g[n1][n2][tweet_id]) for n1, n2 in g.edges()] f = open(OUT_FILE, w) f.write(strict digraph {n%sn} % (;n.join(dot),)) f.close()
  34. 34. Example DOT Language strict digraph { "@ericastolte" -> "bonitasworld" [tweet_id=11965974697]; "@mpcoelho" ->"Lil_Amaral" [tweet_id=11965954427]; "@BieberBelle123" -> "BELIEBE4EVER" [tweet_id=11966261062]; "@BieberBelle123" -> "sabrina9451" [tweet_id=11966197327]; }
  35. 35. DOT to Image• Download Graphviz: http://www.graphviz.org/•$ dot -Tpng out_file.dot > graph.png• Windows users might prefer GVEdit
  36. 36. Graphviz: Extreme Closeup
  37. 37. But you want more sexy?
  38. 38. Protovis: Extreme Closeup 38 Mining the Social Web
  39. 39. It Doesnt Have To Be a Graph Graph Connectedness
  40. 40. Part 2:Friends, Followers, and Setwise Operations Agile Data Solutions Social Web Mining the
  41. 41. Insight Matters• What is my potential influence?• Who are the most popular people in my network?• Who are my mutual friends?• What common friends/followers do I have with @user?• Who is not following me back?• What can I learn from analyzing my friendship cliques?
  42. 42. Getting Data Agile Data Solutions Social Web Mining the
  43. 43. OAuth (1.0a)import twitterfrom twitter.oauth_dance import oauth_dance# Get these from http://dev.twitter.com/apps/newconsumer_key, consumer_secret = key, secret(oauth_token, oauth_token_secret) = oauth_dance(MiningTheSocialWeb, consumer_key, consumer_secret)auth=twitter.oauth.OAuth(oauth_token, oauth_token_secret, consumer_key, consumer_secret)t = twitter.Twitter(domain=api.twitter.com, auth=auth)
  44. 44. Getting Friendship Data friend_ids = t.friends.ids(screen_name=timoreilly, cursor=-1) follower_ids = t.followers.ids(screen_name=timoreilly, cursor=-1) # store the data somewhere...
  45. 45. Perspective: Fetching all of Lady Gagas~7M followers would take ~4 hours
  46. 46. But theres always a catch...
  47. 47. Rate Limits• 350 requests/hr for authenticated requests• 150 requests/hr for anonymous requests• Coping mechanisms: • Caching & Archiving Data • Streaming API • HTTP 400 codes• See http://dev.twitter.com/pages/rate-limiting
  48. 48. The Beloved Fail Whale • Twitter is sometimes "overcapacity" • HTTP 503 Error • Handle it just as any other HTTP error • RESTfulness has its advantages
  49. 49. Abstraction Helps friend_ids = [] wait_period = 2 # secs cursor = -1 while cursor != 0: response = makeTwitterRequest(t, # twitter.Twitter instance t.friends.ids, screen_name=screen_name, cursor=cursor) friend_ids += response[ids] cursor = response[next_cursor] # break out of loop early if you dont need all ids
  50. 50. Abstracting Abstractions screen_name = timoreilly # This is what you ultimately want... friend_ids = getFriends(screen_name) follower_ids = getFollowers(screen_name)
  51. 51. Storing Data Agile Data Solutions Social Web Mining the
  52. 52. Flat Files? ./ screen_name1/ friend_ids.json follower_ids.json user_info.json screen_name2/ ... ...
  53. 53. Pickles?import cPickleo = { friend_ids : friend_ids, follower_ids : follower_ids, user_info : user_info}f = open(screen_name1.pickle, wb)cPickle.dump(o, f)f.close()
  54. 54. A relational database? import sqlite3 as sqlite conn = sqlite.connect(data.db) c = conn.cursor() c.execute(create table friends...) c.execute(insert into friends... ) # Lots of fun...sigh...
  55. 55. Redis (A Data Structures Server) import redis r = redis.Redis() [ r.sadd("timoreilly$friend_ids", i) for i in friend_ids ] r.smembers("timoreilly$friend_ids") # returns a set Project page: http://redis.io Windows binary: http://code.google.com/p/servicestack/wiki/RedisWindowsDownload
  56. 56. Redis Set Operations• Key/value store...on typed values!• Common set operations • smembers, scard • sinter, sdiff, sunion • sadd, srem, etc.• See http://code.google.com/p/redis/wiki/CommandReference• Dont forget to $ easy_install redis
  57. 57. Analyzing Data Agile Data Solutions Social Web Mining the
  58. 58. Setwise Operations• Union• Intersection• Difference• Complement
  59. 59. Venn Diagrams Followers - Friends Friends Friends - Followers Friends Followers U Followers
  60. 60. Count Your Blessings# A utility functiondef getRedisIdByScreenName(screen_name, key_name): return screen_name$ + screen_name + $ + key_name# Number of friendsn_friends = r.scard(getRedisIdByScreenName(screen_name, friend_ids))# Number of followersn_followers = r.scard(getRedisIdByScreenName(screen_name, follower_ids))
  61. 61. Asymmetric Relationships# Friends who arent following backfriends_diff_followers = r.sdiffstore(temp, [ getRedisIdByScreenName(screen_name, friend_ids), getRedisIdByScreenName(screen_name, follower_ids) ])# ... compute interesting things ...r.delete(temp)
  62. 62. Asymmetric Relationships# Followers who arent friendedfollowers_diff_friends = r.sdiffstore(temp, [ getRedisIdByScreenName(screen_name, follower_ids), getRedisIdByScreenName(screen_name, friend_ids) ])# ... compute interesting things ...r.delete(temp)
  63. 63. Symmetric Relationships mutual_friends = r.sinterstore(temp, [ getRedisIdByScreenName(screen_name, follower_ids), getRedisIdByScreenName(screen_name, friend_ids) ]) # ... compute interesting things ... r.delete(temp)
  64. 64. Sample Output timoreilly is following 663 timoreilly is being followed by 1,423,704 131 of 663 are not following timoreilly back 1,423,172 of 1,423,704 are not being followed back by timoreilly timoreilly has 532 mutual friends
  65. 65. Who Isnt Following Back? user_ids = [ ... ] # Resolve these to user info objects while len(user_ids) > 0: user_ids_str, = ,.join([ str(i) for i in user_ids[:100] ]) user_ids = user_ids[100:] response = t.users.lookup(user_id=user_ids) if type(response) is dict: response = [response] r.mset(dict([(getRedisIdByUserId(resp[id], info.json), json.dumps(resp)) for resp in response])) r.mset(dict([(getRedisIdByScreenName(resp[screen_name],info.json), json.dumps(resp)) for resp in response]))
  66. 66. Friends in Common# Assume weve harvested friends/followers and its in Redis...screen_names = [timoreilly, mikeloukides]r.sinterstore(temp$friends_in_common, [getRedisIdByScreenName(screen_name, friend_ids) for screen_name in screen_names])r.sinterstore(temp$followers_in_common, [getRedisIdByScreenName(screen_name,follower_ids) for screen_name in screen_names])# Manipulate the sets
  67. 67. Potential Influence• My followers?• My followers followers?• My followers followers followers?•for n in range(1, 7): # 6 degrees? print "My " + "followers "*n + "followers?"
  68. 68. Saving a Thousand Words... { 1 2 Branching 3 Factor = 2 Depth = 3 4 5 6 7 8 9 10 11 12 13 14 15
  69. 69. Same Data, Different Layout 9 10 4 5 8 2 11 1 12 3 15 6 7 13 14 4 12 5
  70. 70. Space Complexity Depth 1 2 3 4 5 2 3 7 15 31 63Branching 3 4 13 40 121 364 Factor 4 5 21 85 341 1365 5 6 31 156 781 3906 6 7 43 259 1555 9331
  71. 71. Breadth-First TraversalCreate an empty graphCreate an empty queue to keep track of unprocessed nodesAdd the starting point to the graph as the "root node"Add the root node to a queue for processingRepeat until some maximum depth is reached or the queue is empty: Remove a node from queue For each of the nodes neighbors: If the neighbor hasnt already been processed: Add it to the graph Add it to the queue Add an edge to the graph connecting the node & its neighbor
  72. 72. Breadth-First Harvest next_queue = [ timoreilly ] # seed node d = 1 while d < depth: d += 1 queue, next_queue = next_queue, [] for screen_name in queue: follower_ids = getFollowers(screen_name=screen_name) next_queue += follower_ids getUserInfo(user_ids=next_queue)
  73. 73. The Most Popular Followers freqs = {} for follower in followers: cnt = follower[followers_count] if not freqs.has_key(cnt): freqs[cnt] = [] freqs[cnt].append({screen_name: follower[screen_name], user_id: f[id]}) popular_followers = sorted(freqs, reverse=True)[:100]
  74. 74. Average # of Followers all_freqs = [k for k in keys for user in freqs[k]] avg = sum(all_freqs) / len(all_freqs)
  75. 75. @timoreillys Popular Followers The top 10 followers from the sample: aplusk 4,993,072 BarackObama 4,114,901 mashable 2,014,615 MarthaStewart 1,932,321 Schwarzenegger 1,705,177 zappos 1,689,289 Veronica 1,612,827 jack 1,592,004 stephenfry 1,531,813 davos 1,522,621
  76. 76. Futzing the Numbers• The average number of timoreillys followers followers: 445• Discarding the top 10 lowers the average to around 300• Discarding any follower with less than 10 followers of their own increases the average to over 1,000!• Doing both brings the average to around 800
  77. 77. The Right Tool For the Job:NetworkX for Networks
  78. 78. Friendship Graphsfor i in ids: #ids is timoreillys id along with friend ids info = json.loads(r.get(getRedisIdByUserId(i, info.json))) screen_name = info[screen_name] friend_ids = list(r.smembers(getRedisIdByScreenName(screen_name, friend_ids))) for friend_id in [fid for fid in friend_ids if fid in ids]: friend_info = json.loads(r.get(getRedisIdByUserId(friend_id, info.json))) g.add_edge(screen_name, friend_info[screen_name])nx.write_gpickle(g, timoreilly.gpickle) # see also nx.read_gpickle
  79. 79. Clique Analysis • Cliques • Maximum Cliques • Maximal Cliqueshttp://en.wikipedia.org/wiki/Clique_problem
  80. 80. Calculating Cliquescliques = [c for c in nx.find_cliques(g)]num_cliques = len(cliques)clique_sizes = [len(c) for c in cliques]max_clique_size = max(clique_sizes)avg_clique_size = sum(clique_sizes) / num_cliquesmax_cliques = [c for c in cliques if len(c) == max_clique_size]num_max_cliques = len(max_cliques)people_in_every_max_clique = list(reduce( lambda x, y: x.intersection(y),[set(c) for c in max_cliques]))
  81. 81. Cliques for @timoreilly Num cliques: 762573 Avg clique size: 14 Max clique size: 26 Num max cliques: 6 Num people in every max clique: 20
  82. 82. Visualizing Data Agile Data Solutions Social Web Mining the
  83. 83. Graphs, etc • Your first instinct is naturally G = (V, E) ?
  84. 84. Dorling Cartogram • A location-aware bubble chart (ish) • At least 3-dimensional • Position, color, size • Look at friends/followers by state
  85. 85. Sunburst of Friends • A very compact visualization • Slice and dice friends/followers by gender, country, locale, etc.
  86. 86. Part 3:The Tweet, the Whole Tweet, and Nothing but the Tweet Agile Data Solutions Social Web Mining the
  87. 87. Insight Matters• Which entities frequently appear in @users tweets?• How often does @user talk about specific friends?• Who does @user retweet most frequently?• How frequently is @user retweeted (by anyone)?• How many #hashtags are usually in @users tweets?
  88. 88. Pen : Sword :: Tweet : Machine Gun (?!?)
  89. 89. Getting Data Mining the Social Web
  90. 90. Let me count the APIs...• Timelines• Tweets• Favorites• Direct Messages• Streams
  91. 91. Anatomy of a Tweet (1/2){ "created_at" : "Thu Jun 24 14:21:11 +0000 2010", "id" : 16932571217, "text" : "Great idea from @crowdflower: Crowdsourcing ... #opengov", "user" : { "description" : "Founder and CEO, OReilly Media. Watching the alpha geeks...", "id" : 2384071, "location" : "Sebastopol, CA", "name" : "Tim OReilly", "screen_name" : "timoreilly", "url" : "http://radar.oreilly.com" }, ...
  92. 92. Anatomy of a Tweet (2/2) ... "entities" : { "hashtags" : [ {"indices" : [ 97, 103 ], "text" : "gov20"}, {"indices" : [ 104, 112 ], "text" : "opengov"} ], "urls" : [{"expanded_url" : null, "indices" : [ 76, 96 ], "url" : "http://bit.ly/9o4uoG"} ], "user_mentions" : [{"id" : 28165790, "indices" : [ 16, 28 ], "name" : "crowdFlower","screen_name" : "crowdFlower"}] }}
  93. 93. Entities & Annotations• Entities • Opt-in now but will "soon" be standard • $ easy_install twitter_text• Annotations • User-defined metadata • See http://dev.twitter.com/pages/annotations_overview
  94. 94. Manual Entity Extraction import twitter_text extractor = twitter_text.Extractor(tweet[text]) mentions = extractor.extract_mentioned_screen_names_with_indices() hashtags = extractor.extract_hashtags_with_indices() urls = extractor.extract_urls_with_indices() # Splice info into a tweet object
  95. 95. Storing Data Mining the Social Web
  96. 96. Storing Tweets• Flat files? (Really, who does that?)• A relational database?• Redis?• CouchDB (Relax...?)
  97. 97. CouchDB: Relax• Document-oriented key/value• Map/Reduce• RESTful API• Erlang
  98. 98. As easy as sitting on the couch• Get it - http://www.couchone.com/get• Install it• Relax - http://localhost:5984/_utils/• Also - $ easy_install couchdb
  99. 99. Storing Timeline Dataimport couchdbimport twitterTIMELINE_NAME = "user" # or "home" or "public"t = twitter.Twitter(domain=api.twitter.com, api_version=1)server = couchdb.Server(http://localhost:5984)db = server.create(DB)page_num = 1while page_num <= MAX_PAGES: api_call = getattr(t.statuses, TIMELINE_NAME + _timeline) tweets = makeTwitterRequest(t, api_call, page=page_num) db.update(tweets, all_or_nothing=True) print Fetched %i tweets % len(tweets) page_num += 1
  100. 100. Analyzing & Visualizing Data Mining the Social Web
  101. 101. Approach:Map/Reduce on Tweets
  102. 102. Map/Reduce Paraadigm• Mapper: yields key/value pairs• Reducer: operates on keyed mapper output• Example: Computing the sum of squares • Mapper Input: (k, [2,4,6]) • Mapper Output: (k, [4,16,36]) • Reducer Input: [(k, 4,16), (k, 36)] • Reducer Output: 56
  103. 103. Which entities frequently appear in @mentions tweets?
  104. 104. @timoreillys Tweet Entities
  105. 105. How often does @timoreilly mention specific friends?
  106. 106. Filtering Tweet Entities• Lets find out how often someone talks about specific friends• We have friend info on hand• Weve extracted @mentions from the tweets• Lets cound friend vs non-friend mentions
  107. 107. @timoreillys friend mentions Number of user entities in tweets who are Number of @user entities in tweets: 20 not friends: 2 Number of @user entities in tweets who are friends: 18 n2vip timoreilly ahier andrewsavikas pkedrosky gnat CodeforAmerica slashdot nytimes OReillyMedia brady dalepd carlmalamud mikeloukides pahlkadot monkchips make fredwilson jamesoreilly digiphile andrewsavikas
  108. 108. Who does @timoreilly retweet most frequently?
  109. 109. Counting Retweets• Map @mentions out of tweets using a regex• Reduce to sum them up• Sort the results• Display results
  110. 110. Retweets by @timoreilly
  111. 111. How frequently is @timoreilly retweeted?
  112. 112. Retweet Counts• An API resource /statuses/retweet_count exists (and is now functional)• Example: http://twitter.com/statuses/show/29016139807.json • retweet_count • retweeted
  113. 113. Survey Says...@timoreilly is retweeted about 2/3 of the time
  114. 114. How often does @timoreillyinclude #hashtags in tweets?
  115. 115. Counting Hashtags• Use a mapper to emit a #hashtag entities for tweets• Use a reducer to sum them all up• Been there, done that...
  116. 116. Survey Says...About 1 out of every 3 tweets by @timoreilly contain #hashtags
  117. 117. But if you order within the next 5 mintues... Mining the Social Web
  118. 118. Bonus Material:What do #JustinBieber and #TeaParty have in common? Mining the Social Web
  119. 119. Tweet Entities
  120. 120. #JustinBieber co-occurrences #bieberblast http://tinyurl.com/ #music #Eclipse 343kax4 @justinbieber #somebodytolove @JustBieberFact #nowplaying http://bit.ly/aARD4t @TinselTownDirt #Justinbieber http://bit.ly/b2Kc1L #beliebers #JUSTINBIEBER #Escutando #BieberFact #Proform #justinBieber #Celebrity http://migre.me/TJwj #Restart #Dschungel @ProSieben #TT @_Yassi_ @lojadoaltivo #Telezwerge #musicmonday #JustinBieber @rheinzeitung #video #justinbieber #WTF #tickets
  121. 121. #TeaParty co-occurrences @STOPOBAMA2012 #jcot@blogging_tories @TheFlaCracker #tweetcongress#cdnpoli #palin2012 #Obama#fail #AZ #topprog#nra #TopProg #palin#roft #conservative #dems@BrnEyeSuss http://tinyurl.com/386k5hh #acon@crispix49 @ResistTyranny #cspj@koopersmith #tsot #immigration@Kriskxx @ALIPAC #politics#Kagan #majority #hhrs@Liliaep #NoAmnesty #TeaParty#nvsen #patriottweets #vote2010@First_Patriots @Drudge_Report #libertarian#patriot #military #obama#pjtv #palin12 #ucot@andilinks #rnc #iamthemob@RonPaulNews #TCOT #GOP#ampats http://tinyurl.com/24h36zq #tpp#cnn #spwbt #dnc#jews @welshman007 #twisters#GOPDeficit #FF #sgp#wethepeople #liberty #ocra#asamom #glennbeck #gop@thenewdeal #news #tlot#AFIRE #oilspill #p2#Dems #rs #tcot@JIDF #Teaparty #teaparty
  122. 122. Hashtag Distributions
  123. 123. Hashtag Analysis• TeaParty: ~ 5 hashtags per tweet.• Example: “Rarely is the questioned asked: Is our children learning?” - G.W. Bush #p2 #topprog #tcot #tlot #teaparty #GOP #FF• JustinBieber: ~ 2 hashtags per tweet• Example: #justinbieber is so coool
  124. 124. Common #hashtags #lol #dancing #jesus #music #worldcup #glennbeck #teaparty @addthis #AZ #nowplaying #milk #news #ff #WTF #guns #fail #WorldCup #toomanypeople #bp #oilspill #News #catholic
  125. 125. Retweet Patterns
  126. 126. Retweet Behaviors
  127. 127. Friendship Networks
  128. 128. Juxtaposing Friendships• Harvest search results for #JustinBieber and #TeaParty• Get friend ids for each @mention with /friends/ids• Resolve screen names with /users/lookup• Populate a NetworkX graph• Analyze it• Visualize with Graphviz
  129. 129. Nodes Degrees
  130. 130. Two Kinds of Hairballs... #JustinBieber #TeaParty
  131. 131. The world twitterverse is your oyster
  132. 132. • Twitter: @SocialWebMining• GitHub: http://bit.ly/socialwebmining• Facbook: http://facebook.com/MiningTheSocialWeb Mining the Social Web
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×