SlideShare a Scribd company logo
1 of 132
Download to read offline
Unleashing Twitter Data
     for fun and insight

Matthew A. Russell
http://linkedin.com/in/ptwobrussell
@ptwobrussell

                                      Agile Data Solutions Social Web
                                             Mining the
Happy Groundhog Day!
Mining the Social Web
          Chapters 1-5
Introduction: Trends, Tweets, and Twitterers
Microformats: Semantic Markup and Common Sense Collide
Mailboxes: Oldies but Goodies
Friends, Followers, and Setwise Operations
Twitter: The Tweet, the Whole Tweet, and
Nothing but the Tweet
Mining the Social Web
            Chapters 6-10

LinkedIn: Clustering Your Professional Network For Fun (and
Profit?)
Google Buzz: TF-IDF, Cosine Similarity, and Collocations
Blogs et al: Natural Language Processing (and Beyond)
Facebook: The All-In-One Wonder
The Semantic Web: A Cocktail Discussion
O                verview


• Trends, Tweets, and Retweet Visualizations
• Friends, Followers, and Setwise Operations
• The Tweet, the Whole Tweet, and Nothing but the Tweet
Insight Matters


• What is @user's potential influence?
• What are @user's passions right now?
• Who are @user's most trusted friends?
Part 1:
Tweets, Trends, and Retweet
       Visualizations



                     Agile Data Solutions Social Web
                            Mining the
A point to ponder:
Twitter : Data :: JavaScript : Programming Languages (???)
Getting Ready To Code




                  Agile Data Solutions Social Web
                         Mining the
Python Installation


• Mac users already have it
• Linux users probably have it
• Windows users should grab ActivePython
easy_install
• Installs packages from PyPI
• Get it:
  • http://pypi.python.org/pypi/setuptools
  • Ships with ActivePython
• It really is easy:
 easy_install twitter
 easy_install nltk
 easy_install networkx
Git It?
• http://github.com/ptwobrussell/Mining-the-Social-Web
• git clone git://github.com/ptwobrussell/Mining-the-Social-Web.git
 • introduction__*.py
 • friends_followers__*.py
 • the_tweet__*.py
Getting Data




               Agile Data Solutions Social Web
                      Mining the
Twitter Data Sources


• Twitter API Resources
• GNIP
• Infochimps
• Library of Congress
Trending Topics

>>>   import twitter # Remember to "easy_install twitter"
>>>   twitter_search = twitter.Twitter(domain="search.twitter.com")
>>>   trends = twitter_search.trends()
>>>   [ trend['name'] for trend in trends['trends'] ]

[u'#ZodiacFacts', u'#nowplaying', u'#ItsOverWhen',
 u'#Christoferdrew', u'Justin Bieber', u'#WhatwouldItBeLike',
 u'#Sagittarius', u'SNL', u'#SurveySays', u'#iDoit2']
Search Results


>>> search_results = []
>>> for page in range(1,6):
...   search_results.append(twitter_search.search(q="SNL",rpp=100, page=page))
Search Results (continued)
 >>> import json
 >>> print json.dumps(search_results, sort_keys=True, indent=1)
 [
   {
     "completed_in": 0.088122000000000006,
     "max_id": 11966285265,
     "next_page": "?page=2&max_id=11966285265&rpp=100&q=SNL",
     "page": 1,
     "query": "SNL",
     "refresh_url": "?since_id=11966285265&q=SNL",

   ...more...
Search Results (continued)
  "results": [
   {
     "created_at": "Sun, 11 Apr 2010 01:34:52 +0000",
     "from_user": "bieber_luv2",
     "from_user_id": 106998169,
     "geo": null,
     "id": 11966285265,
     "iso_language_code": "en",
     "metadata": {
      "result_type": "recent"
     },
     ...more...
Search Results (continued)
       "profile_image_url": "http://a1.twimg.com/profile_images/80...",
       "source": "<a href="http://twitter.com/&quo...",
       "text": "im nt gonna go to sleep happy unless i see ...",
       "to_user_id": null
       }
       ... output truncated - 99 more tweets ...
     ],
     "results_per_page": 100,
     "since_id": 0
    },
    ... output truncated - 4 more pages ...
]
Lexical Diversity

• Ratio of unique terms to total terms
  • A measure of "stickiness"?
  • A measure of "group think"?
  • A crude indicator of retweets to originally authored tweets?
Distilling Tweet Text
 >>> # search_results is already defined

 >>> tweets = [ r['text'] 
 ...     for result in search_results 
 ...         for r in result['results'] ]

 >>> words = []

 >>> for t in tweets:
 ...     words += [ w for w in t.split() ]
 ...
Analyzing Data




                 Agile Data Solutions Social Web
                        Mining the
Lexical Diversity
 >>> len(words)
 7238

 >>> # unique words
 >>> len(set(words))
 1636

 >>> # lexical diversity
 >>> 1.0*len(set(words))/len(words)
 0.22602928985907708

 >>> # average number of words per tweet
 >>> 1.0*sum([ len(t.split()) for t in tweets ])/len(tweets)
 14.476000000000001
Size Frequency Matters


• Counting: always the first step
• Simple but effective
• NLTK saves us a little trouble
Frequency Analysis
 >>> import nltk
 >>> freq_dist = nltk.FreqDist(words)
 >>> freq_dist.keys()[:50] #50 most frequent tokens

 [u'snl', u'on', u'rt', u'is', u'to', u'i', u'watch', u'justin',
  u'@justinbieber', u'be', u'the', u'tonight', u'gonna', u'at',
  u'in', u'bieber', u'and', u'you', u'watching', u'tina', u'for',
  u'a', u'wait', u'fey', u'of', u'@justinbieber:', u'if', u'with',
  u'so', u"can't", u'who', u'great', u'it', u'going', u'im', u':)',
  u'snl...', u'2nite...', u'are', u'cant', u'dress', u'rehearsal',
  u'see', u'that', u'what', u'but', u'tonight!', u':d', u'2',
  u'will']
Frequency Visualization
Tweet and RT were sitting on a fence.
    Tweet fell off. Who was left?
RTs: past, present, & future


• Retweet: Tweeting a tweet that's already been tweeted
• RT or via followed by @mention
• Example: RT @SocialWebMining Justin Bieber is on SNL 2nite. w00t?!?
• Relatively new APIs were rolled out last year for retweeting sans
 conventions
Some people, when confronted with a problem, think "I know,
   I'll use regular expressions." Now they have two
                 problems. -- Jamie Zawinski
Parsing Retweets
 >>> example_tweets = ["Visualize Twitter search results w/ this simple script
 http://bit.ly/cBu0l4 - Gist instructions http://bit.ly/9SZ2kb (via
 @SocialWebMining @ptwobrussell)"]

 >>> import re
 >>> rt_patterns = re.compile(r"(RT|via)((?:bW*@w+)+)", 
 ...                          re.IGNORECASE)

 >>> rt_origins = []
 >>> for t in example_tweets:
 ...    try:
 ...         rt_origins += [mention.strip() 
 ...         for mention in rt_patterns.findall(t)[0][1].split()]
 ...    except IndexError, e:
 ...         pass

 >>> [rto.strip("@") for rto in rt_origins]
Visualizing Data




                   Agile Data Solutions Social Web
                          Mining the
Graph Construction

 >>> import networkx as nx
 >>> g = nx.DiGraph()
 >>> g.add_edge("@SocialWebMining", "@ptwobrussell", 
 ...            {"tweet_id" : 4815162342},)
Writing out DOT
OUT_FILE = "out_file.dot"

try:
    nx.drawing.write_dot(g, OUT_FILE)
except ImportError, e:
    dot = ['"%s" -> "%s" [tweet_id=%s]' % 
    (n1, n2, g[n1][n2]['tweet_id']) for n1, n2 in g.edges()]

       f = open(OUT_FILE, 'w')
       f.write('strict digraph {n%sn}' % (';n'.join(dot),))
       f.close()
Example DOT Language

 strict digraph {
   "@ericastolte" -> "bonitasworld" [tweet_id=11965974697];
   "@mpcoelho" ->"Lil_Amaral" [tweet_id=11965954427];
   "@BieberBelle123" -> "BELIEBE4EVER" [tweet_id=11966261062];
   "@BieberBelle123" -> "sabrina9451" [tweet_id=11966197327];
 }
DOT to Image


• Download Graphviz: http://www.graphviz.org/
•$ dot -Tpng out_file.dot > graph.png
• Windows users might prefer GVEdit
Graphviz: Extreme Closeup
But you want more sexy?
Protovis: Extreme Closeup




                   38       Mining the Social Web
It Doesn't Have To Be a Graph

                Graph Connectedness
Part 2:
Friends, Followers, and Setwise
          Operations



                        Agile Data Solutions Social Web
                               Mining the
Insight Matters

• What is my potential influence?
• Who are the most popular people in my network?
• Who are my mutual friends?
• What common friends/followers do I have with @user?
• Who is not following me back?
• What can I learn from analyzing my friendship cliques?
Getting Data



               Agile Data Solutions Social Web
                      Mining the
OAuth (1.0a)
import twitter
from twitter.oauth_dance import oauth_dance

# Get these from http://dev.twitter.com/apps/new
consumer_key, consumer_secret = 'key', 'secret'

(oauth_token, oauth_token_secret) = oauth_dance('MiningTheSocialWeb',
                                       consumer_key, consumer_secret)

auth=twitter.oauth.OAuth(oauth_token, oauth_token_secret,
                         consumer_key, consumer_secret)

t = twitter.Twitter(domain='api.twitter.com', auth=auth)
Getting Friendship Data


 friend_ids = t.friends.ids(screen_name='timoreilly', cursor=-1)
 follower_ids = t.followers.ids(screen_name='timoreilly', cursor=-1)

 # store the data somewhere...
Perspective: Fetching all of Lady Gaga's
~7M followers would take ~4 hours
But there's always a catch...
Rate Limits
• 350 requests/hr for authenticated requests
• 150 requests/hr for anonymous requests
• Coping mechanisms:
  • Caching & Archiving Data
  • Streaming API
  • HTTP 400 codes
• See http://dev.twitter.com/pages/rate-limiting
The Beloved Fail Whale


 • Twitter is sometimes "overcapacity"
 • HTTP 503 Error
 • Handle it just as any other HTTP error
 • RESTfulness has its advantages
Abstraction Helps
 friend_ids = []
 wait_period = 2 # secs
 cursor = -1

 while cursor != 0:
     response = makeTwitterRequest(t, # twitter.Twitter instance
                                   t.friends.ids,
                                   screen_name=screen_name,
                                   cursor=cursor)

     friend_ids += response['ids']
     cursor = response['next_cursor']
     # break out of loop early if you don't need all ids
Abstracting Abstractions
 screen_name = 'timoreilly'

 # This is what you ultimately want...

 friend_ids = getFriends(screen_name)
 follower_ids = getFollowers(screen_name)
Storing Data



               Agile Data Solutions Social Web
                      Mining the
Flat Files?
  ./
  screen_name1/
      friend_ids.json
      follower_ids.json
      user_info.json

  screen_name2/
      ...

  ...
Pickles?
import cPickle

o = {
    'friend_ids'   : friend_ids,
    'follower_ids' : follower_ids,
    'user_info'    : user_info
}

f = open('screen_name1.pickle, 'wb')
cPickle.dump(o, f)
f.close()
A relational database?
 import sqlite3 as sqlite

 conn = sqlite.connect('data.db')
 c = conn.cursor()

 c.execute('''create table
              friends...''')


 c.execute('''insert into friends...
 ''')


 # Lots of fun...sigh...
Redis (A Data Structures Server)


  import redis

  r = redis.Redis()

  [ r.sadd("timoreilly$friend_ids", i) for i in friend_ids ]

  r.smembers("timoreilly$friend_ids") # returns a set


         Project page: http://redis.io
         Windows binary: http://code.google.com/p/servicestack/wiki/RedisWindowsDownload
Redis Set Operations
• Key/value store...on typed values!
• Common set operations
  • smembers, scard
  • sinter, sdiff, sunion
  • sadd, srem, etc.
• See http://code.google.com/p/redis/wiki/CommandReference
• Don't forget to $ easy_install redis
Analyzing Data



                 Agile Data Solutions Social Web
                        Mining the
Setwise Operations

• Union
• Intersection
• Difference
• Complement
Venn Diagrams

              Followers - Friends
  Friends
                                              Friends - Followers




                    Friends       Followers
                              U
  Followers
Count Your Blessings
# A utility function
def getRedisIdByScreenName(screen_name, key_name):
    return 'screen_name$' + screen_name + '$' + key_name


# Number of friends
n_friends = r.scard(getRedisIdByScreenName(screen_name,
                                           'friend_ids'))

# Number of followers
n_followers = r.scard(getRedisIdByScreenName(screen_name,
                                             'follower_ids'))
Asymmetric Relationships


# Friends who aren't following back
friends_diff_followers = r.sdiffstore('temp', [
                 getRedisIdByScreenName(screen_name, 'friend_ids'),
                 getRedisIdByScreenName(screen_name, 'follower_ids')
                 ])
# ... compute interesting things ...
r.delete('temp')
Asymmetric Relationships


# Followers who aren't friended
followers_diff_friends = r.sdiffstore('temp', [
                  getRedisIdByScreenName(screen_name, 'follower_ids'),
                  getRedisIdByScreenName(screen_name, 'friend_ids')
                  ])
# ... compute interesting things ...
r.delete('temp')
Symmetric Relationships

 mutual_friends = r.sinterstore('temp', [
         getRedisIdByScreenName(screen_name, 'follower_ids'),
         getRedisIdByScreenName(screen_name, 'friend_ids')
         ])
 # ... compute interesting things ...
 r.delete('temp')
Sample Output

 timoreilly is following 663

 timoreilly is being followed by 1,423,704

 131 of 663 are not following timoreilly back

 1,423,172 of 1,423,704 are not being followed back by
 timoreilly

 timoreilly has 532 mutual friends
Who Isn't Following Back?
 user_ids = [ ... ] # Resolve these to user info objects

 while len(user_ids) > 0:
   user_ids_str, = ','.join([ str(i) for i in user_ids[:100] ])
   user_ids = user_ids[100:]

   response = t.users.lookup(user_id=user_ids)

   if type(response) is dict: response = [response]
   r.mset(dict([(getRedisIdByUserId(resp['id'], 'info.json'), json.dumps(resp))
                for resp in response]))

   r.mset(dict([(getRedisIdByScreenName(resp['screen_name'],'info.json'),
                json.dumps(resp)) for resp in response]))
Friends in Common
# Assume we've harvested friends/followers and it's in Redis...
screen_names = ['timoreilly', 'mikeloukides']

r.sinterstore('temp$friends_in_common',
              [getRedisIdByScreenName(screen_name, 'friend_ids')
              for screen_name in screen_names])

r.sinterstore('temp$followers_in_common',
              [getRedisIdByScreenName(screen_name,'follower_ids')
              for screen_name in screen_names])

# Manipulate the sets
Potential Influence

• My followers?
• My followers' followers?
• My followers' followers' followers?
•for n in range(1, 7): # 6 degrees?
   print "My " + "followers' "*n + "followers?"
Saving a Thousand Words...




         {
                                           1


                         2            Branching              3
                                      Factor = 2

 Depth = 3       4                5                 6                 7


             8       9       10       11       12       13       14       15
Same Data, Different Layout
                 9         10

                 4           5
             8        2          11

                      1

            12        3          15
                 6           7

                 13        14

                 4        12 5
Space Complexity
                     Depth
                1    2    3     4    5
            2   3    7   15     31   63
Branching   3   4   13   40    121 364
 Factor     4   5   21   85    341 1365
            5   6   31   156   781 3906
            6   7   43   259   1555 9331
Breadth-First Traversal
Create an empty graph
Create an empty queue to keep track of unprocessed nodes

Add the starting point to the graph as the "root node"
Add the root node to a queue for processing

Repeat until some maximum depth is reached or the queue is empty:
  Remove a node from queue
  For each of the node's neighbors:
    If the neighbor hasn't already been processed:
      Add it to the graph
      Add it to the queue
      Add an edge to the graph connecting the node & its neighbor
Breadth-First Harvest

 next_queue = [ 'timoreilly' ] # seed node
 d = 1

 while d < depth:
     d += 1
     queue, next_queue = next_queue, []
     for screen_name in queue:
         follower_ids = getFollowers(screen_name=screen_name)
         next_queue += follower_ids
         getUserInfo(user_ids=next_queue)
The Most Popular Followers

 freqs = {}
 for follower in followers:
     cnt = follower['followers_count']
     if not freqs.has_key(cnt):
         freqs[cnt] = []

     freqs[cnt].append({'screen_name': follower['screen_name'],
                        'user_id': f['id']})

 popular_followers = sorted(freqs, reverse=True)[:100]
Average # of Followers

 all_freqs = [k for k in keys for user in freqs[k]]
 avg = sum(all_freqs) / len(all_freqs)
@timoreilly's Popular Followers

          The top 10 followers from the sample:

          aplusk              4,993,072
          BarackObama         4,114,901
          mashable            2,014,615
          MarthaStewart       1,932,321
          Schwarzenegger      1,705,177
          zappos              1,689,289
          Veronica            1,612,827
          jack                1,592,004
          stephenfry          1,531,813
          davos               1,522,621
Futzing the Numbers

• The average number of timoreilly's followers' followers: 445
• Discarding the top 10 lowers the average to around 300
• Discarding any follower with less than 10 followers of their
 own increases the average to over 1,000!
• Doing both brings the average to around 800
The Right Tool For the Job:
NetworkX for Networks
Friendship Graphs
for i in ids: #ids is timoreilly's id along with friend ids
  info = json.loads(r.get(getRedisIdByUserId(i, 'info.json')))
  screen_name = info['screen_name']
  friend_ids = list(r.smembers(getRedisIdByScreenName(screen_name,
                                                      'friend_ids')))

  for friend_id in [fid for fid in friend_ids if fid in ids]:
      friend_info = json.loads(r.get(getRedisIdByUserId(friend_id, 'info.json')))
      g.add_edge(screen_name, friend_info['screen_name'])

nx.write_gpickle(g, 'timoreilly.gpickle') # see also nx.read_gpickle
Clique Analysis


                                              • Cliques
                                              • Maximum Cliques
                                              • Maximal Cliques

http://en.wikipedia.org/wiki/Clique_problem
Calculating Cliques
cliques = [c for c in nx.find_cliques(g)]

num_cliques = len(cliques)
clique_sizes = [len(c) for c in cliques]

max_clique_size = max(clique_sizes)
avg_clique_size = sum(clique_sizes) / num_cliques
max_cliques = [c for c in cliques if len(c) == max_clique_size]
num_max_cliques = len(max_cliques)

people_in_every_max_clique = list(reduce(
    lambda x, y: x.intersection(y),[set(c) for c in max_cliques]
))
Cliques for @timoreilly


         Num   cliques:                762573
         Avg   clique size:                14
         Max   clique size:                26
         Num   max cliques:                 6
         Num   people in every max clique: 20
Visualizing Data



                   Agile Data Solutions Social Web
                          Mining the
Graphs, etc


    • Your first instinct is naturally
      G = (V, E) ?
Dorling Cartogram

  • A location-aware bubble chart (ish)
  • At least 3-dimensional
    • Position, color, size
  • Look at friends/followers by state
Sunburst of Friends


 • A very compact visualization
 • Slice and dice friends/followers by
  gender, country, locale, etc.
Part 3:
The Tweet, the Whole Tweet, and
     Nothing but the Tweet



                        Agile Data Solutions Social Web
                               Mining the
Insight Matters

• Which entities frequently appear in @user's tweets?
• How often does @user talk about specific friends?
• Who does @user retweet most frequently?
• How frequently is @user retweeted (by anyone)?
• How many #hashtags are usually in @user's tweets?
Pen : Sword :: Tweet : Machine Gun (?!?)
Getting Data



               Mining the Social Web
Let me count the APIs...

• Timelines
• Tweets
• Favorites
• Direct Messages
• Streams
Anatomy of a Tweet (1/2)
{
    "created_at" : "Thu Jun 24 14:21:11 +0000 2010",
    "id" : 16932571217,
    "text" : "Great idea from @crowdflower: Crowdsourcing ... #opengov",
    "user" : {
       "description" : "Founder and CEO, O'Reilly Media. Watching the alpha geeks...",
       "id" : 2384071,
       "location" : "Sebastopol, CA",
       "name" : "Tim O'Reilly",
       "screen_name" : "timoreilly",
       "url" : "http://radar.oreilly.com"
    },

    ...
Anatomy of a Tweet (2/2)

    ...

    "entities" : {
      "hashtags" : [    {"indices" : [ 97, 103 ], "text" : "gov20"},
                        {"indices" : [ 104, 112 ], "text" : "opengov"} ],

        "urls" : [{"expanded_url" : null, "indices" : [ 76, 96 ],
                   "url" : "http://bit.ly/9o4uoG"} ],

        "user_mentions" : [{"id" : 28165790, "indices" : [ 16, 28 ],
                            "name" : "crowdFlower","screen_name" : "crowdFlower"}]
    }
}
Entities & Annotations

• Entities
  • Opt-in now but will "soon" be standard
 • $ easy_install twitter_text
• Annotations
  • User-defined metadata
  • See http://dev.twitter.com/pages/annotations_overview
Manual Entity Extraction
 import twitter_text

 extractor = twitter_text.Extractor(tweet['text'])

 mentions = extractor.extract_mentioned_screen_names_with_indices()
 hashtags = extractor.extract_hashtags_with_indices()
 urls = extractor.extract_urls_with_indices()

 # Splice info into a tweet object
Storing Data



               Mining the Social Web
Storing Tweets

• Flat files? (Really, who does that?)
• A relational database?
• Redis?
• CouchDB (Relax...?)
CouchDB: Relax

• Document-oriented key/value
• Map/Reduce
• RESTful API
• Erlang
As easy as sitting on the couch


• Get it - http://www.couchone.com/get
• Install it
• Relax - http://localhost:5984/_utils/
• Also - $ easy_install couchdb
Storing Timeline Data
import couchdb
import twitter

TIMELINE_NAME = "user" # or "home" or "public"

t = twitter.Twitter(domain='api.twitter.com', api_version='1)

server = couchdb.Server('http://localhost:5984')
db = server.create(DB)

page_num = 1
while page_num <= MAX_PAGES:
    api_call = getattr(t.statuses, TIMELINE_NAME + '_timeline')
    tweets = makeTwitterRequest(t, api_call, page=page_num)
    db.update(tweets, all_or_nothing=True)
    print 'Fetched %i tweets' % len(tweets)
    page_num += 1
Analyzing & Visualizing Data



                        Mining the Social Web
Approach:
Map/Reduce on Tweets
Map/Reduce Paraadigm

• Mapper: yields key/value pairs
• Reducer: operates on keyed mapper output
• Example: Computing the sum of squares
  • Mapper Input: (k, [2,4,6])
  • Mapper Output: (k, [4,16,36])
  • Reducer Input: [(k, 4,16), (k, 36)]
  • Reducer Output: 56
Which entities frequently appear in
       @mention's tweets?
@timoreilly's Tweet Entities
How often does @timoreilly
 mention specific friends?
Filtering Tweet Entities

• Let's find out how often someone talks about
 specific friends
• We have friend info on hand
• We've extracted @mentions from the tweets
• Let's cound friend vs non-friend mentions
@timoreilly's friend mentions
                                          Number of user entities in tweets who are
 Number of @user entities in tweets: 20
                                          not friends: 2
 Number of @user entities in tweets who
 are friends: 18
                                            n2vip
                                            timoreilly
   ahier               andrewsavikas
   pkedrosky           gnat
   CodeforAmerica      slashdot
   nytimes             OReillyMedia
   brady               dalepd
   carlmalamud         mikeloukides
   pahlkadot           monkchips
   make                fredwilson
   jamesoreilly        digiphile
   andrewsavikas
Who does @timoreilly retweet
     most frequently?
Counting Retweets

• Map @mentions out of tweets using a regex
• Reduce to sum them up
• Sort the results
• Display results
Retweets by @timoreilly
How frequently is @timoreilly
        retweeted?
Retweet Counts


• An API resource /statuses/retweet_count exists (and is now functional)
• Example: http://twitter.com/statuses/show/29016139807.json
  • retweet_count
  • retweeted
Survey Says...
@timoreilly is retweeted about 2/3
            of the time
How often does @timoreilly
include #hashtags in tweets?
Counting Hashtags


• Use a mapper to emit a #hashtag entities for tweets
• Use a reducer to sum them all up
• Been there, done that...
Survey Says...
About 1 out of every 3 tweets by
 @timoreilly contain #hashtags
But if you order within the next 5
            mintues...



                           Mining the Social Web
Bonus Material:
What do #JustinBieber and #TeaParty
         have in common?


                             Mining the Social Web
Tweet Entities
#JustinBieber co-occurrences

 #bieberblast           http://tinyurl.com/
                                              #music
 #Eclipse               343kax4
                                              @justinbieber
 #somebodytolove        @JustBieberFact
                                              #nowplaying
 http://bit.ly/aARD4t   @TinselTownDirt
                                              #Justinbieber
 http://bit.ly/b2Kc1L   #beliebers
                                              #JUSTINBIEBER
 #Escutando             #BieberFact
                                              #Proform
 #justinBieber          #Celebrity
                                              http://migre.me/TJwj
 #Restart               #Dschungel
                                              @ProSieben
 #TT                    @_Yassi_
                                              @lojadoaltivo
 #Telezwerge            #musicmonday
                                              #JustinBieber
 @rheinzeitung          #video
                                              #justinbieber
 #WTF                   #tickets
#TeaParty co-occurrences
                   @STOPOBAMA2012               #jcot
@blogging_tories   @TheFlaCracker               #tweetcongress
#cdnpoli           #palin2012                   #Obama
#fail              #AZ                          #topprog
#nra               #TopProg                     #palin
#roft              #conservative                #dems
@BrnEyeSuss        http://tinyurl.com/386k5hh   #acon
@crispix49         @ResistTyranny               #cspj
@koopersmith       #tsot                        #immigration
@Kriskxx           @ALIPAC                      #politics
#Kagan             #majority                    #hhrs
@Liliaep           #NoAmnesty                   #TeaParty
#nvsen             #patriottweets               #vote2010
@First_Patriots    @Drudge_Report               #libertarian
#patriot           #military                    #obama
#pjtv              #palin12                     #ucot
@andilinks         #rnc                         #iamthemob
@RonPaulNews       #TCOT                        #GOP
#ampats            http://tinyurl.com/24h36zq   #tpp
#cnn               #spwbt                       #dnc
#jews              @welshman007                 #twisters
#GOPDeficit        #FF                          #sgp
#wethepeople       #liberty                     #ocra
#asamom            #glennbeck                   #gop
@thenewdeal        #news                        #tlot
#AFIRE             #oilspill                    #p2
#Dems              #rs                          #tcot
@JIDF              #Teaparty                    #teaparty
Hashtag Distributions
Hashtag Analysis

• TeaParty: ~ 5 hashtags per tweet.
• Example: “Rarely is the questioned asked: Is our children
 learning?” - G.W. Bush #p2 #topprog #tcot #tlot #teaparty
 #GOP #FF
• JustinBieber: ~ 2 hashtags per tweet
• Example: #justinbieber is so coool
Common #hashtags
 #lol              #dancing
 #jesus            #music
 #worldcup         #glennbeck
 #teaparty         @addthis
 #AZ               #nowplaying
 #milk             #news
 #ff               #WTF
 #guns             #fail
 #WorldCup         #toomanypeople
 #bp               #oilspill
 #News             #catholic
Retweet Patterns
Retweet Behaviors
Friendship Networks
Juxtaposing Friendships

• Harvest search results for #JustinBieber and #TeaParty
• Get friend ids for each @mention with /friends/ids
• Resolve screen names with /users/lookup
• Populate a NetworkX graph
• Analyze it
• Visualize with Graphviz
Nodes Degrees
Two Kinds of Hairballs...




       #JustinBieber        #TeaParty
The world twitterverse is your oyster
• Twitter: @SocialWebMining
• GitHub: http://bit.ly/socialwebmining
• Facbook: http://facebook.com/MiningTheSocialWeb




                              Mining the Social Web

More Related Content

Viewers also liked

Influence mapping Toolbox Presentation London 2015
Influence mapping Toolbox Presentation London 2015Influence mapping Toolbox Presentation London 2015
Influence mapping Toolbox Presentation London 2015Jun Julien Matsushita
 
Digital Winners 2013: Aleksander stensby
Digital Winners 2013: Aleksander stensbyDigital Winners 2013: Aleksander stensby
Digital Winners 2013: Aleksander stensbyTelenor Group
 
Data Driven PR: 8 Steps to Building Media Attention with Research
Data Driven PR: 8 Steps to Building Media Attention with ResearchData Driven PR: 8 Steps to Building Media Attention with Research
Data Driven PR: 8 Steps to Building Media Attention with ResearchWalkerSands
 
Analyzing social conversation: a guide to data mining and data visualization
Analyzing social conversation: a guide to data mining and data visualization Analyzing social conversation: a guide to data mining and data visualization
Analyzing social conversation: a guide to data mining and data visualization Tempero UK
 
Analysis and Visualization of Real-Time Twitter Data
Analysis and Visualization of Real-Time Twitter DataAnalysis and Visualization of Real-Time Twitter Data
Analysis and Visualization of Real-Time Twitter DataEducational Technology
 
Searching lexis nexis in power search mode
Searching lexis nexis in power search modeSearching lexis nexis in power search mode
Searching lexis nexis in power search modeJoyce Johnston
 
What is 1st, 2nd, 3rd party data?
What is 1st, 2nd, 3rd party data?What is 1st, 2nd, 3rd party data?
What is 1st, 2nd, 3rd party data?Sparc Media Poland
 
Can Digital Data help predict the results of the US elections?
Can Digital Data help predict the results of the US elections? Can Digital Data help predict the results of the US elections?
Can Digital Data help predict the results of the US elections? Laurence Borel
 
Text Analytics: Yesterday, Today and Tomorrow
Text Analytics: Yesterday, Today and TomorrowText Analytics: Yesterday, Today and Tomorrow
Text Analytics: Yesterday, Today and TomorrowTony Russell-Rose
 
Topic and text analysis for sentiment, emotion, and computational social science
Topic and text analysis for sentiment, emotion, and computational social scienceTopic and text analysis for sentiment, emotion, and computational social science
Topic and text analysis for sentiment, emotion, and computational social scienceAlice Oh
 
Text Analytics Past, Present & Future: An Industry View
Text Analytics Past, Present & Future: An Industry ViewText Analytics Past, Present & Future: An Industry View
Text Analytics Past, Present & Future: An Industry ViewSeth Grimes
 
Big Data: Mapping Twitter Communities
Big Data: Mapping Twitter CommunitiesBig Data: Mapping Twitter Communities
Big Data: Mapping Twitter CommunitiesSocialphysicist
 
Learn How a New Kind of Marketing Mix Modeling is Better for Media Planning
Learn How a New Kind of Marketing Mix Modeling is Better for Media PlanningLearn How a New Kind of Marketing Mix Modeling is Better for Media Planning
Learn How a New Kind of Marketing Mix Modeling is Better for Media PlanningThinkVine
 
How to Build a Basic Model with Analytica
How to Build a Basic Model with AnalyticaHow to Build a Basic Model with Analytica
How to Build a Basic Model with AnalyticaTorsten Röhner
 
Deep Social Insight
Deep Social InsightDeep Social Insight
Deep Social InsightSysomos
 
Staying on the Right Side of the Fence when Analyzing Human Data
Staying on the Right Side of the Fence when Analyzing Human DataStaying on the Right Side of the Fence when Analyzing Human Data
Staying on the Right Side of the Fence when Analyzing Human DataDataSift
 
Evolving in a new Data economy
Evolving in a new Data economyEvolving in a new Data economy
Evolving in a new Data economyAcxiom Corporation
 
Text Analytics Overview, 2011
Text Analytics Overview, 2011Text Analytics Overview, 2011
Text Analytics Overview, 2011Seth Grimes
 

Viewers also liked (20)

Influence mapping Toolbox Presentation London 2015
Influence mapping Toolbox Presentation London 2015Influence mapping Toolbox Presentation London 2015
Influence mapping Toolbox Presentation London 2015
 
Digital Winners 2013: Aleksander stensby
Digital Winners 2013: Aleksander stensbyDigital Winners 2013: Aleksander stensby
Digital Winners 2013: Aleksander stensby
 
Data Driven PR: 8 Steps to Building Media Attention with Research
Data Driven PR: 8 Steps to Building Media Attention with ResearchData Driven PR: 8 Steps to Building Media Attention with Research
Data Driven PR: 8 Steps to Building Media Attention with Research
 
Analyzing social conversation: a guide to data mining and data visualization
Analyzing social conversation: a guide to data mining and data visualization Analyzing social conversation: a guide to data mining and data visualization
Analyzing social conversation: a guide to data mining and data visualization
 
Analysis and Visualization of Real-Time Twitter Data
Analysis and Visualization of Real-Time Twitter DataAnalysis and Visualization of Real-Time Twitter Data
Analysis and Visualization of Real-Time Twitter Data
 
Searching lexis nexis in power search mode
Searching lexis nexis in power search modeSearching lexis nexis in power search mode
Searching lexis nexis in power search mode
 
What is 1st, 2nd, 3rd party data?
What is 1st, 2nd, 3rd party data?What is 1st, 2nd, 3rd party data?
What is 1st, 2nd, 3rd party data?
 
Can Digital Data help predict the results of the US elections?
Can Digital Data help predict the results of the US elections? Can Digital Data help predict the results of the US elections?
Can Digital Data help predict the results of the US elections?
 
Text Analytics: Yesterday, Today and Tomorrow
Text Analytics: Yesterday, Today and TomorrowText Analytics: Yesterday, Today and Tomorrow
Text Analytics: Yesterday, Today and Tomorrow
 
Market Mix Models: Shining a Light in the Black Box
Market Mix Models: Shining a Light in the Black BoxMarket Mix Models: Shining a Light in the Black Box
Market Mix Models: Shining a Light in the Black Box
 
Topic and text analysis for sentiment, emotion, and computational social science
Topic and text analysis for sentiment, emotion, and computational social scienceTopic and text analysis for sentiment, emotion, and computational social science
Topic and text analysis for sentiment, emotion, and computational social science
 
Text Analytics Past, Present & Future: An Industry View
Text Analytics Past, Present & Future: An Industry ViewText Analytics Past, Present & Future: An Industry View
Text Analytics Past, Present & Future: An Industry View
 
Big Data: Mapping Twitter Communities
Big Data: Mapping Twitter CommunitiesBig Data: Mapping Twitter Communities
Big Data: Mapping Twitter Communities
 
Text mining and Visualizations
Text mining  and VisualizationsText mining  and Visualizations
Text mining and Visualizations
 
Learn How a New Kind of Marketing Mix Modeling is Better for Media Planning
Learn How a New Kind of Marketing Mix Modeling is Better for Media PlanningLearn How a New Kind of Marketing Mix Modeling is Better for Media Planning
Learn How a New Kind of Marketing Mix Modeling is Better for Media Planning
 
How to Build a Basic Model with Analytica
How to Build a Basic Model with AnalyticaHow to Build a Basic Model with Analytica
How to Build a Basic Model with Analytica
 
Deep Social Insight
Deep Social InsightDeep Social Insight
Deep Social Insight
 
Staying on the Right Side of the Fence when Analyzing Human Data
Staying on the Right Side of the Fence when Analyzing Human DataStaying on the Right Side of the Fence when Analyzing Human Data
Staying on the Right Side of the Fence when Analyzing Human Data
 
Evolving in a new Data economy
Evolving in a new Data economyEvolving in a new Data economy
Evolving in a new Data economy
 
Text Analytics Overview, 2011
Text Analytics Overview, 2011Text Analytics Overview, 2011
Text Analytics Overview, 2011
 

Similar to Unleashing Twitter Data for Fun and Insight

Mining social data
Mining social dataMining social data
Mining social dataMalk Zameth
 
Life at Twitter + Career Advice for Students
Life at Twitter + Career Advice for StudentsLife at Twitter + Career Advice for Students
Life at Twitter + Career Advice for StudentsChris Aniszczyk
 
"R & Text Analytics" (15 January 2013)
"R & Text Analytics" (15 January 2013)"R & Text Analytics" (15 January 2013)
"R & Text Analytics" (15 January 2013)Portland R User Group
 
Mining the Geo Needles in the Social Haystack
Mining the Geo Needles in the Social HaystackMining the Geo Needles in the Social Haystack
Mining the Geo Needles in the Social HaystackMatthew Russell
 
Mining Social Web APIs with IPython Notebook (PyCon 2014)
Mining Social Web APIs with IPython Notebook (PyCon 2014)Mining Social Web APIs with IPython Notebook (PyCon 2014)
Mining Social Web APIs with IPython Notebook (PyCon 2014)Matthew Russell
 
Mining Social Web APIs with IPython Notebook (Data Day Texas 2015)
Mining Social Web APIs with IPython Notebook (Data Day Texas 2015)Mining Social Web APIs with IPython Notebook (Data Day Texas 2015)
Mining Social Web APIs with IPython Notebook (Data Day Texas 2015)Matthew Russell
 
The Web Application Hackers Toolchain
The Web Application Hackers ToolchainThe Web Application Hackers Toolchain
The Web Application Hackers Toolchainjasonhaddix
 
Twitter Presentation: #APIConSF
Twitter Presentation: #APIConSFTwitter Presentation: #APIConSF
Twitter Presentation: #APIConSFRyan Choi
 
PyData Texas 2015 Keynote
PyData Texas 2015 KeynotePyData Texas 2015 Keynote
PyData Texas 2015 KeynotePeter Wang
 
Datasploit - An Open Source Intelligence Tool
Datasploit - An Open Source Intelligence ToolDatasploit - An Open Source Intelligence Tool
Datasploit - An Open Source Intelligence ToolShubham Mittal
 
Mining the social web ch1
Mining the social web ch1Mining the social web ch1
Mining the social web ch1HyeonSeok Choi
 
Idea2app
Idea2appIdea2app
Idea2appFlumes
 
Python and Oracle : allies for best of data management
Python and Oracle : allies for best of data managementPython and Oracle : allies for best of data management
Python and Oracle : allies for best of data managementLaurent Leturgez
 
Creating More Engaging Content For Social
Creating More Engaging Content For SocialCreating More Engaging Content For Social
Creating More Engaging Content For SocialEric T. Tung
 
Continuum Analytics and Python
Continuum Analytics and PythonContinuum Analytics and Python
Continuum Analytics and PythonTravis Oliphant
 
Build a Twitter Bot with Basic Python
Build a Twitter Bot with Basic PythonBuild a Twitter Bot with Basic Python
Build a Twitter Bot with Basic PythonThinkful
 
ASFWS 2012 - Contourner les conditions d’utilisation et l’API du service Twit...
ASFWS 2012 - Contourner les conditions d’utilisation et l’API du service Twit...ASFWS 2012 - Contourner les conditions d’utilisation et l’API du service Twit...
ASFWS 2012 - Contourner les conditions d’utilisation et l’API du service Twit...Cyber Security Alliance
 
Protect Your Payloads: Modern Keying Techniques
Protect Your Payloads: Modern Keying TechniquesProtect Your Payloads: Modern Keying Techniques
Protect Your Payloads: Modern Keying TechniquesLeo Loobeek
 

Similar to Unleashing Twitter Data for Fun and Insight (20)

Mining social data
Mining social dataMining social data
Mining social data
 
Life at Twitter + Career Advice for Students
Life at Twitter + Career Advice for StudentsLife at Twitter + Career Advice for Students
Life at Twitter + Career Advice for Students
 
Developing apps using Perl
Developing apps using PerlDeveloping apps using Perl
Developing apps using Perl
 
"R & Text Analytics" (15 January 2013)
"R & Text Analytics" (15 January 2013)"R & Text Analytics" (15 January 2013)
"R & Text Analytics" (15 January 2013)
 
Big data. Opportunità e rischi
Big data. Opportunità e rischiBig data. Opportunità e rischi
Big data. Opportunità e rischi
 
Mining the Geo Needles in the Social Haystack
Mining the Geo Needles in the Social HaystackMining the Geo Needles in the Social Haystack
Mining the Geo Needles in the Social Haystack
 
Mining Social Web APIs with IPython Notebook (PyCon 2014)
Mining Social Web APIs with IPython Notebook (PyCon 2014)Mining Social Web APIs with IPython Notebook (PyCon 2014)
Mining Social Web APIs with IPython Notebook (PyCon 2014)
 
Mining Social Web APIs with IPython Notebook (Data Day Texas 2015)
Mining Social Web APIs with IPython Notebook (Data Day Texas 2015)Mining Social Web APIs with IPython Notebook (Data Day Texas 2015)
Mining Social Web APIs with IPython Notebook (Data Day Texas 2015)
 
The Web Application Hackers Toolchain
The Web Application Hackers ToolchainThe Web Application Hackers Toolchain
The Web Application Hackers Toolchain
 
Twitter Presentation: #APIConSF
Twitter Presentation: #APIConSFTwitter Presentation: #APIConSF
Twitter Presentation: #APIConSF
 
PyData Texas 2015 Keynote
PyData Texas 2015 KeynotePyData Texas 2015 Keynote
PyData Texas 2015 Keynote
 
Datasploit - An Open Source Intelligence Tool
Datasploit - An Open Source Intelligence ToolDatasploit - An Open Source Intelligence Tool
Datasploit - An Open Source Intelligence Tool
 
Mining the social web ch1
Mining the social web ch1Mining the social web ch1
Mining the social web ch1
 
Idea2app
Idea2appIdea2app
Idea2app
 
Python and Oracle : allies for best of data management
Python and Oracle : allies for best of data managementPython and Oracle : allies for best of data management
Python and Oracle : allies for best of data management
 
Creating More Engaging Content For Social
Creating More Engaging Content For SocialCreating More Engaging Content For Social
Creating More Engaging Content For Social
 
Continuum Analytics and Python
Continuum Analytics and PythonContinuum Analytics and Python
Continuum Analytics and Python
 
Build a Twitter Bot with Basic Python
Build a Twitter Bot with Basic PythonBuild a Twitter Bot with Basic Python
Build a Twitter Bot with Basic Python
 
ASFWS 2012 - Contourner les conditions d’utilisation et l’API du service Twit...
ASFWS 2012 - Contourner les conditions d’utilisation et l’API du service Twit...ASFWS 2012 - Contourner les conditions d’utilisation et l’API du service Twit...
ASFWS 2012 - Contourner les conditions d’utilisation et l’API du service Twit...
 
Protect Your Payloads: Modern Keying Techniques
Protect Your Payloads: Modern Keying TechniquesProtect Your Payloads: Modern Keying Techniques
Protect Your Payloads: Modern Keying Techniques
 

More from Matthew Russell

Mining the Social Web for Fun and Profit: A Getting Started Guide
Mining the Social Web for Fun and Profit: A Getting Started GuideMining the Social Web for Fun and Profit: A Getting Started Guide
Mining the Social Web for Fun and Profit: A Getting Started GuideMatthew Russell
 
Privacy, Ethics, and Future Uses of the Social Web
Privacy, Ethics, and Future Uses of the Social WebPrivacy, Ethics, and Future Uses of the Social Web
Privacy, Ethics, and Future Uses of the Social WebMatthew Russell
 
Why Twitter Is All The Rage: A Data Miner's Perspective (PyTN 2014)
Why Twitter Is All The Rage: A Data Miner's Perspective (PyTN 2014)Why Twitter Is All The Rage: A Data Miner's Perspective (PyTN 2014)
Why Twitter Is All The Rage: A Data Miner's Perspective (PyTN 2014)Matthew Russell
 
Mining the Social Web for Fun and Profit: A Getting Started Guide
Mining the Social Web for Fun and Profit: A Getting Started GuideMining the Social Web for Fun and Profit: A Getting Started Guide
Mining the Social Web for Fun and Profit: A Getting Started GuideMatthew Russell
 
Why Twitter Is All the Rage: A Data Miner's Perspective
Why Twitter Is All the Rage: A Data Miner's PerspectiveWhy Twitter Is All the Rage: A Data Miner's Perspective
Why Twitter Is All the Rage: A Data Miner's PerspectiveMatthew Russell
 
Mining Social Web APIs with IPython Notebook - Data Day Texas 2014
Mining Social Web APIs with IPython Notebook - Data Day Texas 2014Mining Social Web APIs with IPython Notebook - Data Day Texas 2014
Mining Social Web APIs with IPython Notebook - Data Day Texas 2014Matthew Russell
 
Mining Social Web APIs with IPython Notebook (Strata 2013)
Mining Social Web APIs with IPython Notebook (Strata 2013)Mining Social Web APIs with IPython Notebook (Strata 2013)
Mining Social Web APIs with IPython Notebook (Strata 2013)Matthew Russell
 
Mining Social Web Data Like a Pro: Four Steps to Success
Mining Social Web Data Like a Pro: Four Steps to SuccessMining Social Web Data Like a Pro: Four Steps to Success
Mining Social Web Data Like a Pro: Four Steps to SuccessMatthew Russell
 

More from Matthew Russell (8)

Mining the Social Web for Fun and Profit: A Getting Started Guide
Mining the Social Web for Fun and Profit: A Getting Started GuideMining the Social Web for Fun and Profit: A Getting Started Guide
Mining the Social Web for Fun and Profit: A Getting Started Guide
 
Privacy, Ethics, and Future Uses of the Social Web
Privacy, Ethics, and Future Uses of the Social WebPrivacy, Ethics, and Future Uses of the Social Web
Privacy, Ethics, and Future Uses of the Social Web
 
Why Twitter Is All The Rage: A Data Miner's Perspective (PyTN 2014)
Why Twitter Is All The Rage: A Data Miner's Perspective (PyTN 2014)Why Twitter Is All The Rage: A Data Miner's Perspective (PyTN 2014)
Why Twitter Is All The Rage: A Data Miner's Perspective (PyTN 2014)
 
Mining the Social Web for Fun and Profit: A Getting Started Guide
Mining the Social Web for Fun and Profit: A Getting Started GuideMining the Social Web for Fun and Profit: A Getting Started Guide
Mining the Social Web for Fun and Profit: A Getting Started Guide
 
Why Twitter Is All the Rage: A Data Miner's Perspective
Why Twitter Is All the Rage: A Data Miner's PerspectiveWhy Twitter Is All the Rage: A Data Miner's Perspective
Why Twitter Is All the Rage: A Data Miner's Perspective
 
Mining Social Web APIs with IPython Notebook - Data Day Texas 2014
Mining Social Web APIs with IPython Notebook - Data Day Texas 2014Mining Social Web APIs with IPython Notebook - Data Day Texas 2014
Mining Social Web APIs with IPython Notebook - Data Day Texas 2014
 
Mining Social Web APIs with IPython Notebook (Strata 2013)
Mining Social Web APIs with IPython Notebook (Strata 2013)Mining Social Web APIs with IPython Notebook (Strata 2013)
Mining Social Web APIs with IPython Notebook (Strata 2013)
 
Mining Social Web Data Like a Pro: Four Steps to Success
Mining Social Web Data Like a Pro: Four Steps to SuccessMining Social Web Data Like a Pro: Four Steps to Success
Mining Social Web Data Like a Pro: Four Steps to Success
 

Recently uploaded

08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdfChristopherTHyatt
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 

Recently uploaded (20)

08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdf
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 

Unleashing Twitter Data for Fun and Insight

  • 1. Unleashing Twitter Data for fun and insight Matthew A. Russell http://linkedin.com/in/ptwobrussell @ptwobrussell Agile Data Solutions Social Web Mining the
  • 3. Mining the Social Web Chapters 1-5 Introduction: Trends, Tweets, and Twitterers Microformats: Semantic Markup and Common Sense Collide Mailboxes: Oldies but Goodies Friends, Followers, and Setwise Operations Twitter: The Tweet, the Whole Tweet, and Nothing but the Tweet
  • 4. Mining the Social Web Chapters 6-10 LinkedIn: Clustering Your Professional Network For Fun (and Profit?) Google Buzz: TF-IDF, Cosine Similarity, and Collocations Blogs et al: Natural Language Processing (and Beyond) Facebook: The All-In-One Wonder The Semantic Web: A Cocktail Discussion
  • 5. O verview • Trends, Tweets, and Retweet Visualizations • Friends, Followers, and Setwise Operations • The Tweet, the Whole Tweet, and Nothing but the Tweet
  • 6. Insight Matters • What is @user's potential influence? • What are @user's passions right now? • Who are @user's most trusted friends?
  • 7. Part 1: Tweets, Trends, and Retweet Visualizations Agile Data Solutions Social Web Mining the
  • 8. A point to ponder: Twitter : Data :: JavaScript : Programming Languages (???)
  • 9. Getting Ready To Code Agile Data Solutions Social Web Mining the
  • 10. Python Installation • Mac users already have it • Linux users probably have it • Windows users should grab ActivePython
  • 11. easy_install • Installs packages from PyPI • Get it: • http://pypi.python.org/pypi/setuptools • Ships with ActivePython • It really is easy: easy_install twitter easy_install nltk easy_install networkx
  • 12. Git It? • http://github.com/ptwobrussell/Mining-the-Social-Web • git clone git://github.com/ptwobrussell/Mining-the-Social-Web.git • introduction__*.py • friends_followers__*.py • the_tweet__*.py
  • 13. Getting Data Agile Data Solutions Social Web Mining the
  • 14. Twitter Data Sources • Twitter API Resources • GNIP • Infochimps • Library of Congress
  • 15. Trending Topics >>> import twitter # Remember to "easy_install twitter" >>> twitter_search = twitter.Twitter(domain="search.twitter.com") >>> trends = twitter_search.trends() >>> [ trend['name'] for trend in trends['trends'] ] [u'#ZodiacFacts', u'#nowplaying', u'#ItsOverWhen', u'#Christoferdrew', u'Justin Bieber', u'#WhatwouldItBeLike', u'#Sagittarius', u'SNL', u'#SurveySays', u'#iDoit2']
  • 16. Search Results >>> search_results = [] >>> for page in range(1,6): ... search_results.append(twitter_search.search(q="SNL",rpp=100, page=page))
  • 17. Search Results (continued) >>> import json >>> print json.dumps(search_results, sort_keys=True, indent=1) [ { "completed_in": 0.088122000000000006, "max_id": 11966285265, "next_page": "?page=2&max_id=11966285265&rpp=100&q=SNL", "page": 1, "query": "SNL", "refresh_url": "?since_id=11966285265&q=SNL", ...more...
  • 18. Search Results (continued) "results": [ { "created_at": "Sun, 11 Apr 2010 01:34:52 +0000", "from_user": "bieber_luv2", "from_user_id": 106998169, "geo": null, "id": 11966285265, "iso_language_code": "en", "metadata": { "result_type": "recent" }, ...more...
  • 19. Search Results (continued) "profile_image_url": "http://a1.twimg.com/profile_images/80...", "source": "&lt;a href=&quot;http://twitter.com/&quo...", "text": "im nt gonna go to sleep happy unless i see ...", "to_user_id": null } ... output truncated - 99 more tweets ... ], "results_per_page": 100, "since_id": 0 }, ... output truncated - 4 more pages ... ]
  • 20. Lexical Diversity • Ratio of unique terms to total terms • A measure of "stickiness"? • A measure of "group think"? • A crude indicator of retweets to originally authored tweets?
  • 21. Distilling Tweet Text >>> # search_results is already defined >>> tweets = [ r['text'] ... for result in search_results ... for r in result['results'] ] >>> words = [] >>> for t in tweets: ... words += [ w for w in t.split() ] ...
  • 22. Analyzing Data Agile Data Solutions Social Web Mining the
  • 23. Lexical Diversity >>> len(words) 7238 >>> # unique words >>> len(set(words)) 1636 >>> # lexical diversity >>> 1.0*len(set(words))/len(words) 0.22602928985907708 >>> # average number of words per tweet >>> 1.0*sum([ len(t.split()) for t in tweets ])/len(tweets) 14.476000000000001
  • 24. Size Frequency Matters • Counting: always the first step • Simple but effective • NLTK saves us a little trouble
  • 25. Frequency Analysis >>> import nltk >>> freq_dist = nltk.FreqDist(words) >>> freq_dist.keys()[:50] #50 most frequent tokens [u'snl', u'on', u'rt', u'is', u'to', u'i', u'watch', u'justin', u'@justinbieber', u'be', u'the', u'tonight', u'gonna', u'at', u'in', u'bieber', u'and', u'you', u'watching', u'tina', u'for', u'a', u'wait', u'fey', u'of', u'@justinbieber:', u'if', u'with', u'so', u"can't", u'who', u'great', u'it', u'going', u'im', u':)', u'snl...', u'2nite...', u'are', u'cant', u'dress', u'rehearsal', u'see', u'that', u'what', u'but', u'tonight!', u':d', u'2', u'will']
  • 27. Tweet and RT were sitting on a fence. Tweet fell off. Who was left?
  • 28. RTs: past, present, & future • Retweet: Tweeting a tweet that's already been tweeted • RT or via followed by @mention • Example: RT @SocialWebMining Justin Bieber is on SNL 2nite. w00t?!? • Relatively new APIs were rolled out last year for retweeting sans conventions
  • 29. Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems. -- Jamie Zawinski
  • 30. Parsing Retweets >>> example_tweets = ["Visualize Twitter search results w/ this simple script http://bit.ly/cBu0l4 - Gist instructions http://bit.ly/9SZ2kb (via @SocialWebMining @ptwobrussell)"] >>> import re >>> rt_patterns = re.compile(r"(RT|via)((?:bW*@w+)+)", ... re.IGNORECASE) >>> rt_origins = [] >>> for t in example_tweets: ... try: ... rt_origins += [mention.strip() ... for mention in rt_patterns.findall(t)[0][1].split()] ... except IndexError, e: ... pass >>> [rto.strip("@") for rto in rt_origins]
  • 31. Visualizing Data Agile Data Solutions Social Web Mining the
  • 32. Graph Construction >>> import networkx as nx >>> g = nx.DiGraph() >>> g.add_edge("@SocialWebMining", "@ptwobrussell", ... {"tweet_id" : 4815162342},)
  • 33. Writing out DOT OUT_FILE = "out_file.dot" try: nx.drawing.write_dot(g, OUT_FILE) except ImportError, e: dot = ['"%s" -> "%s" [tweet_id=%s]' % (n1, n2, g[n1][n2]['tweet_id']) for n1, n2 in g.edges()] f = open(OUT_FILE, 'w') f.write('strict digraph {n%sn}' % (';n'.join(dot),)) f.close()
  • 34. Example DOT Language strict digraph { "@ericastolte" -> "bonitasworld" [tweet_id=11965974697]; "@mpcoelho" ->"Lil_Amaral" [tweet_id=11965954427]; "@BieberBelle123" -> "BELIEBE4EVER" [tweet_id=11966261062]; "@BieberBelle123" -> "sabrina9451" [tweet_id=11966197327]; }
  • 35. DOT to Image • Download Graphviz: http://www.graphviz.org/ •$ dot -Tpng out_file.dot > graph.png • Windows users might prefer GVEdit
  • 37. But you want more sexy?
  • 38. Protovis: Extreme Closeup 38 Mining the Social Web
  • 39. It Doesn't Have To Be a Graph Graph Connectedness
  • 40. Part 2: Friends, Followers, and Setwise Operations Agile Data Solutions Social Web Mining the
  • 41. Insight Matters • What is my potential influence? • Who are the most popular people in my network? • Who are my mutual friends? • What common friends/followers do I have with @user? • Who is not following me back? • What can I learn from analyzing my friendship cliques?
  • 42. Getting Data Agile Data Solutions Social Web Mining the
  • 43. OAuth (1.0a) import twitter from twitter.oauth_dance import oauth_dance # Get these from http://dev.twitter.com/apps/new consumer_key, consumer_secret = 'key', 'secret' (oauth_token, oauth_token_secret) = oauth_dance('MiningTheSocialWeb', consumer_key, consumer_secret) auth=twitter.oauth.OAuth(oauth_token, oauth_token_secret, consumer_key, consumer_secret) t = twitter.Twitter(domain='api.twitter.com', auth=auth)
  • 44. Getting Friendship Data friend_ids = t.friends.ids(screen_name='timoreilly', cursor=-1) follower_ids = t.followers.ids(screen_name='timoreilly', cursor=-1) # store the data somewhere...
  • 45. Perspective: Fetching all of Lady Gaga's ~7M followers would take ~4 hours
  • 46. But there's always a catch...
  • 47. Rate Limits • 350 requests/hr for authenticated requests • 150 requests/hr for anonymous requests • Coping mechanisms: • Caching & Archiving Data • Streaming API • HTTP 400 codes • See http://dev.twitter.com/pages/rate-limiting
  • 48. The Beloved Fail Whale • Twitter is sometimes "overcapacity" • HTTP 503 Error • Handle it just as any other HTTP error • RESTfulness has its advantages
  • 49. Abstraction Helps friend_ids = [] wait_period = 2 # secs cursor = -1 while cursor != 0: response = makeTwitterRequest(t, # twitter.Twitter instance t.friends.ids, screen_name=screen_name, cursor=cursor) friend_ids += response['ids'] cursor = response['next_cursor'] # break out of loop early if you don't need all ids
  • 50. Abstracting Abstractions screen_name = 'timoreilly' # This is what you ultimately want... friend_ids = getFriends(screen_name) follower_ids = getFollowers(screen_name)
  • 51. Storing Data Agile Data Solutions Social Web Mining the
  • 52. Flat Files? ./ screen_name1/ friend_ids.json follower_ids.json user_info.json screen_name2/ ... ...
  • 53. Pickles? import cPickle o = { 'friend_ids' : friend_ids, 'follower_ids' : follower_ids, 'user_info' : user_info } f = open('screen_name1.pickle, 'wb') cPickle.dump(o, f) f.close()
  • 54. A relational database? import sqlite3 as sqlite conn = sqlite.connect('data.db') c = conn.cursor() c.execute('''create table friends...''') c.execute('''insert into friends... ''') # Lots of fun...sigh...
  • 55. Redis (A Data Structures Server) import redis r = redis.Redis() [ r.sadd("timoreilly$friend_ids", i) for i in friend_ids ] r.smembers("timoreilly$friend_ids") # returns a set Project page: http://redis.io Windows binary: http://code.google.com/p/servicestack/wiki/RedisWindowsDownload
  • 56. Redis Set Operations • Key/value store...on typed values! • Common set operations • smembers, scard • sinter, sdiff, sunion • sadd, srem, etc. • See http://code.google.com/p/redis/wiki/CommandReference • Don't forget to $ easy_install redis
  • 57. Analyzing Data Agile Data Solutions Social Web Mining the
  • 58. Setwise Operations • Union • Intersection • Difference • Complement
  • 59. Venn Diagrams Followers - Friends Friends Friends - Followers Friends Followers U Followers
  • 60. Count Your Blessings # A utility function def getRedisIdByScreenName(screen_name, key_name): return 'screen_name$' + screen_name + '$' + key_name # Number of friends n_friends = r.scard(getRedisIdByScreenName(screen_name, 'friend_ids')) # Number of followers n_followers = r.scard(getRedisIdByScreenName(screen_name, 'follower_ids'))
  • 61. Asymmetric Relationships # Friends who aren't following back friends_diff_followers = r.sdiffstore('temp', [ getRedisIdByScreenName(screen_name, 'friend_ids'), getRedisIdByScreenName(screen_name, 'follower_ids') ]) # ... compute interesting things ... r.delete('temp')
  • 62. Asymmetric Relationships # Followers who aren't friended followers_diff_friends = r.sdiffstore('temp', [ getRedisIdByScreenName(screen_name, 'follower_ids'), getRedisIdByScreenName(screen_name, 'friend_ids') ]) # ... compute interesting things ... r.delete('temp')
  • 63. Symmetric Relationships mutual_friends = r.sinterstore('temp', [ getRedisIdByScreenName(screen_name, 'follower_ids'), getRedisIdByScreenName(screen_name, 'friend_ids') ]) # ... compute interesting things ... r.delete('temp')
  • 64. Sample Output timoreilly is following 663 timoreilly is being followed by 1,423,704 131 of 663 are not following timoreilly back 1,423,172 of 1,423,704 are not being followed back by timoreilly timoreilly has 532 mutual friends
  • 65. Who Isn't Following Back? user_ids = [ ... ] # Resolve these to user info objects while len(user_ids) > 0: user_ids_str, = ','.join([ str(i) for i in user_ids[:100] ]) user_ids = user_ids[100:] response = t.users.lookup(user_id=user_ids) if type(response) is dict: response = [response] r.mset(dict([(getRedisIdByUserId(resp['id'], 'info.json'), json.dumps(resp)) for resp in response])) r.mset(dict([(getRedisIdByScreenName(resp['screen_name'],'info.json'), json.dumps(resp)) for resp in response]))
  • 66. Friends in Common # Assume we've harvested friends/followers and it's in Redis... screen_names = ['timoreilly', 'mikeloukides'] r.sinterstore('temp$friends_in_common', [getRedisIdByScreenName(screen_name, 'friend_ids') for screen_name in screen_names]) r.sinterstore('temp$followers_in_common', [getRedisIdByScreenName(screen_name,'follower_ids') for screen_name in screen_names]) # Manipulate the sets
  • 67. Potential Influence • My followers? • My followers' followers? • My followers' followers' followers? •for n in range(1, 7): # 6 degrees? print "My " + "followers' "*n + "followers?"
  • 68. Saving a Thousand Words... { 1 2 Branching 3 Factor = 2 Depth = 3 4 5 6 7 8 9 10 11 12 13 14 15
  • 69. Same Data, Different Layout 9 10 4 5 8 2 11 1 12 3 15 6 7 13 14 4 12 5
  • 70. Space Complexity Depth 1 2 3 4 5 2 3 7 15 31 63 Branching 3 4 13 40 121 364 Factor 4 5 21 85 341 1365 5 6 31 156 781 3906 6 7 43 259 1555 9331
  • 71. Breadth-First Traversal Create an empty graph Create an empty queue to keep track of unprocessed nodes Add the starting point to the graph as the "root node" Add the root node to a queue for processing Repeat until some maximum depth is reached or the queue is empty: Remove a node from queue For each of the node's neighbors: If the neighbor hasn't already been processed: Add it to the graph Add it to the queue Add an edge to the graph connecting the node & its neighbor
  • 72. Breadth-First Harvest next_queue = [ 'timoreilly' ] # seed node d = 1 while d < depth: d += 1 queue, next_queue = next_queue, [] for screen_name in queue: follower_ids = getFollowers(screen_name=screen_name) next_queue += follower_ids getUserInfo(user_ids=next_queue)
  • 73. The Most Popular Followers freqs = {} for follower in followers: cnt = follower['followers_count'] if not freqs.has_key(cnt): freqs[cnt] = [] freqs[cnt].append({'screen_name': follower['screen_name'], 'user_id': f['id']}) popular_followers = sorted(freqs, reverse=True)[:100]
  • 74. Average # of Followers all_freqs = [k for k in keys for user in freqs[k]] avg = sum(all_freqs) / len(all_freqs)
  • 75. @timoreilly's Popular Followers The top 10 followers from the sample: aplusk 4,993,072 BarackObama 4,114,901 mashable 2,014,615 MarthaStewart 1,932,321 Schwarzenegger 1,705,177 zappos 1,689,289 Veronica 1,612,827 jack 1,592,004 stephenfry 1,531,813 davos 1,522,621
  • 76. Futzing the Numbers • The average number of timoreilly's followers' followers: 445 • Discarding the top 10 lowers the average to around 300 • Discarding any follower with less than 10 followers of their own increases the average to over 1,000! • Doing both brings the average to around 800
  • 77. The Right Tool For the Job: NetworkX for Networks
  • 78. Friendship Graphs for i in ids: #ids is timoreilly's id along with friend ids info = json.loads(r.get(getRedisIdByUserId(i, 'info.json'))) screen_name = info['screen_name'] friend_ids = list(r.smembers(getRedisIdByScreenName(screen_name, 'friend_ids'))) for friend_id in [fid for fid in friend_ids if fid in ids]: friend_info = json.loads(r.get(getRedisIdByUserId(friend_id, 'info.json'))) g.add_edge(screen_name, friend_info['screen_name']) nx.write_gpickle(g, 'timoreilly.gpickle') # see also nx.read_gpickle
  • 79. Clique Analysis • Cliques • Maximum Cliques • Maximal Cliques http://en.wikipedia.org/wiki/Clique_problem
  • 80. Calculating Cliques cliques = [c for c in nx.find_cliques(g)] num_cliques = len(cliques) clique_sizes = [len(c) for c in cliques] max_clique_size = max(clique_sizes) avg_clique_size = sum(clique_sizes) / num_cliques max_cliques = [c for c in cliques if len(c) == max_clique_size] num_max_cliques = len(max_cliques) people_in_every_max_clique = list(reduce( lambda x, y: x.intersection(y),[set(c) for c in max_cliques] ))
  • 81. Cliques for @timoreilly Num cliques: 762573 Avg clique size: 14 Max clique size: 26 Num max cliques: 6 Num people in every max clique: 20
  • 82. Visualizing Data Agile Data Solutions Social Web Mining the
  • 83. Graphs, etc • Your first instinct is naturally G = (V, E) ?
  • 84. Dorling Cartogram • A location-aware bubble chart (ish) • At least 3-dimensional • Position, color, size • Look at friends/followers by state
  • 85. Sunburst of Friends • A very compact visualization • Slice and dice friends/followers by gender, country, locale, etc.
  • 86. Part 3: The Tweet, the Whole Tweet, and Nothing but the Tweet Agile Data Solutions Social Web Mining the
  • 87. Insight Matters • Which entities frequently appear in @user's tweets? • How often does @user talk about specific friends? • Who does @user retweet most frequently? • How frequently is @user retweeted (by anyone)? • How many #hashtags are usually in @user's tweets?
  • 88. Pen : Sword :: Tweet : Machine Gun (?!?)
  • 89. Getting Data Mining the Social Web
  • 90. Let me count the APIs... • Timelines • Tweets • Favorites • Direct Messages • Streams
  • 91. Anatomy of a Tweet (1/2) { "created_at" : "Thu Jun 24 14:21:11 +0000 2010", "id" : 16932571217, "text" : "Great idea from @crowdflower: Crowdsourcing ... #opengov", "user" : { "description" : "Founder and CEO, O'Reilly Media. Watching the alpha geeks...", "id" : 2384071, "location" : "Sebastopol, CA", "name" : "Tim O'Reilly", "screen_name" : "timoreilly", "url" : "http://radar.oreilly.com" }, ...
  • 92. Anatomy of a Tweet (2/2) ... "entities" : { "hashtags" : [ {"indices" : [ 97, 103 ], "text" : "gov20"}, {"indices" : [ 104, 112 ], "text" : "opengov"} ], "urls" : [{"expanded_url" : null, "indices" : [ 76, 96 ], "url" : "http://bit.ly/9o4uoG"} ], "user_mentions" : [{"id" : 28165790, "indices" : [ 16, 28 ], "name" : "crowdFlower","screen_name" : "crowdFlower"}] } }
  • 93. Entities & Annotations • Entities • Opt-in now but will "soon" be standard • $ easy_install twitter_text • Annotations • User-defined metadata • See http://dev.twitter.com/pages/annotations_overview
  • 94. Manual Entity Extraction import twitter_text extractor = twitter_text.Extractor(tweet['text']) mentions = extractor.extract_mentioned_screen_names_with_indices() hashtags = extractor.extract_hashtags_with_indices() urls = extractor.extract_urls_with_indices() # Splice info into a tweet object
  • 95. Storing Data Mining the Social Web
  • 96. Storing Tweets • Flat files? (Really, who does that?) • A relational database? • Redis? • CouchDB (Relax...?)
  • 97. CouchDB: Relax • Document-oriented key/value • Map/Reduce • RESTful API • Erlang
  • 98. As easy as sitting on the couch • Get it - http://www.couchone.com/get • Install it • Relax - http://localhost:5984/_utils/ • Also - $ easy_install couchdb
  • 99. Storing Timeline Data import couchdb import twitter TIMELINE_NAME = "user" # or "home" or "public" t = twitter.Twitter(domain='api.twitter.com', api_version='1) server = couchdb.Server('http://localhost:5984') db = server.create(DB) page_num = 1 while page_num <= MAX_PAGES: api_call = getattr(t.statuses, TIMELINE_NAME + '_timeline') tweets = makeTwitterRequest(t, api_call, page=page_num) db.update(tweets, all_or_nothing=True) print 'Fetched %i tweets' % len(tweets) page_num += 1
  • 100. Analyzing & Visualizing Data Mining the Social Web
  • 102. Map/Reduce Paraadigm • Mapper: yields key/value pairs • Reducer: operates on keyed mapper output • Example: Computing the sum of squares • Mapper Input: (k, [2,4,6]) • Mapper Output: (k, [4,16,36]) • Reducer Input: [(k, 4,16), (k, 36)] • Reducer Output: 56
  • 103. Which entities frequently appear in @mention's tweets?
  • 105. How often does @timoreilly mention specific friends?
  • 106. Filtering Tweet Entities • Let's find out how often someone talks about specific friends • We have friend info on hand • We've extracted @mentions from the tweets • Let's cound friend vs non-friend mentions
  • 107. @timoreilly's friend mentions Number of user entities in tweets who are Number of @user entities in tweets: 20 not friends: 2 Number of @user entities in tweets who are friends: 18 n2vip timoreilly ahier andrewsavikas pkedrosky gnat CodeforAmerica slashdot nytimes OReillyMedia brady dalepd carlmalamud mikeloukides pahlkadot monkchips make fredwilson jamesoreilly digiphile andrewsavikas
  • 108. Who does @timoreilly retweet most frequently?
  • 109. Counting Retweets • Map @mentions out of tweets using a regex • Reduce to sum them up • Sort the results • Display results
  • 111. How frequently is @timoreilly retweeted?
  • 112. Retweet Counts • An API resource /statuses/retweet_count exists (and is now functional) • Example: http://twitter.com/statuses/show/29016139807.json • retweet_count • retweeted
  • 113. Survey Says... @timoreilly is retweeted about 2/3 of the time
  • 114. How often does @timoreilly include #hashtags in tweets?
  • 115. Counting Hashtags • Use a mapper to emit a #hashtag entities for tweets • Use a reducer to sum them all up • Been there, done that...
  • 116. Survey Says... About 1 out of every 3 tweets by @timoreilly contain #hashtags
  • 117. But if you order within the next 5 mintues... Mining the Social Web
  • 118. Bonus Material: What do #JustinBieber and #TeaParty have in common? Mining the Social Web
  • 120. #JustinBieber co-occurrences #bieberblast http://tinyurl.com/ #music #Eclipse 343kax4 @justinbieber #somebodytolove @JustBieberFact #nowplaying http://bit.ly/aARD4t @TinselTownDirt #Justinbieber http://bit.ly/b2Kc1L #beliebers #JUSTINBIEBER #Escutando #BieberFact #Proform #justinBieber #Celebrity http://migre.me/TJwj #Restart #Dschungel @ProSieben #TT @_Yassi_ @lojadoaltivo #Telezwerge #musicmonday #JustinBieber @rheinzeitung #video #justinbieber #WTF #tickets
  • 121. #TeaParty co-occurrences @STOPOBAMA2012 #jcot @blogging_tories @TheFlaCracker #tweetcongress #cdnpoli #palin2012 #Obama #fail #AZ #topprog #nra #TopProg #palin #roft #conservative #dems @BrnEyeSuss http://tinyurl.com/386k5hh #acon @crispix49 @ResistTyranny #cspj @koopersmith #tsot #immigration @Kriskxx @ALIPAC #politics #Kagan #majority #hhrs @Liliaep #NoAmnesty #TeaParty #nvsen #patriottweets #vote2010 @First_Patriots @Drudge_Report #libertarian #patriot #military #obama #pjtv #palin12 #ucot @andilinks #rnc #iamthemob @RonPaulNews #TCOT #GOP #ampats http://tinyurl.com/24h36zq #tpp #cnn #spwbt #dnc #jews @welshman007 #twisters #GOPDeficit #FF #sgp #wethepeople #liberty #ocra #asamom #glennbeck #gop @thenewdeal #news #tlot #AFIRE #oilspill #p2 #Dems #rs #tcot @JIDF #Teaparty #teaparty
  • 123. Hashtag Analysis • TeaParty: ~ 5 hashtags per tweet. • Example: “Rarely is the questioned asked: Is our children learning?” - G.W. Bush #p2 #topprog #tcot #tlot #teaparty #GOP #FF • JustinBieber: ~ 2 hashtags per tweet • Example: #justinbieber is so coool
  • 124. Common #hashtags #lol #dancing #jesus #music #worldcup #glennbeck #teaparty @addthis #AZ #nowplaying #milk #news #ff #WTF #guns #fail #WorldCup #toomanypeople #bp #oilspill #News #catholic
  • 128. Juxtaposing Friendships • Harvest search results for #JustinBieber and #TeaParty • Get friend ids for each @mention with /friends/ids • Resolve screen names with /users/lookup • Populate a NetworkX graph • Analyze it • Visualize with Graphviz
  • 130. Two Kinds of Hairballs... #JustinBieber #TeaParty
  • 131. The world twitterverse is your oyster
  • 132. • Twitter: @SocialWebMining • GitHub: http://bit.ly/socialwebmining • Facbook: http://facebook.com/MiningTheSocialWeb Mining the Social Web