• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Mining the social web ch1
 

Mining the social web ch1

on

  • 1,048 views

mining the social web chapter 1

mining the social web chapter 1

Statistics

Views

Total Views
1,048
Views on SlideShare
1,048
Embed Views
0

Actions

Likes
2
Downloads
23
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Mining the social web ch1 Mining the social web ch1 Presentation Transcript

    • Ch1. Introduction: Hacking on Twitter Data chois79 2011.10.1511년 10월 20일 목요일
    • Installing Python Development Tools ✤ python ✤ http://www.python.org/download ✤ python package manager tools ✤ allow to effortlessly install Python packages ✤ easy_install ✤ http://pypi.python.org/pypi/setuptools ✤ pip ✤ http://www.pip-installer.org/en/latest/installing.html ✤ networkx ✤ creating and manipulating graphs and networks ✤ ex) easy_install networkx or pip install networkx11년 10월 20일 목요일
    • Collecting and Manipulating Twitter Data11년 10월 20일 목요일
    • Tinkering with Twitter’s API(1/2) ✤ Setup ✤ easy_install twitter ✤ but, Twitter’s apis was updated ✤ http://github.com/sixohsix/twitter/issues/56 ✤ The Minimalist Twitter API for Python is a Python API for Twitter ✤ Equivalent REST query ✤ http://search.twitter.com/trends.json11년 10월 20일 목요일
    • Tinkering with Twitter’s API(2/2) ✤ Retrieving Twitter search trends # ex.3 import twitter twitter_api = twitter.Twitter() WORLD_WOE_ID = 1 # The Yahoo! Where On Earth ID for the entire world world_trends = twitter_api.trends._(WORLD_WOE_ID) # get back a callable #[ trend["name"] for trend in world_trends()[0][trends] ] # call the callabl for trend in world_trends()[0][trends]: # call the callabl print trend["name"] ✤ Paging through Twitter search results # ex.4 search_results = [] for page in range(1,6): search_results.append(twitter_api.search(q="Dennis Ritchie", rpp=20, page=page))11년 10월 20일 목요일
    • Frequency Analysis and Lexical Diversity(1/5) ✤ Lexical diversity ✤ One of the most intuitive measurements that can be applied to unstructured text ✤ Expression of the number of unique tokens in the text divided by the total number of tokens >>> words = [] >>> for t in tweets: ... words += [ w for w in t.split() ] >>> len(words) # total words 7238 >>> len(set(words)) # unique words 1636 >>> 1.0*len(set(words))/len(words) # lexical diversity 0.22602928985907708 >>> 1.0*sum([ len(t.split()) for t in tweets ])/len(tweets) # avg words per tweet 14.476000000000001 ✤ Each tweet carries about 20 percent unique infomation11년 10월 20일 목요일
    • Frequency Analysis and Lexical Diversity(2/5) ✤ Frequency Analysis: Use NLTK or collections.Count ✤ Very simple, powerful tool >>> import nltk >>> import cPickle >>> words = cPickle.load(open("myData.pickle")) >>> freq_dist = nltk.FreqDist(words) >>> freq_dist.keys()[:50] # 50 most frequent tokens [usnl, uon, urt, uis, uto, ui, uwatch, ujustin, u@justinbieber, ube, uthe, utonight, ugonna, uat, uin, ubieber, uand, uyou, uwatching, utina, ufor, ua, uwait, ufey, uof, u@justinbieber:, uif, uwith, uso, u"cant", uwho, ugreat, uit, ugoing, uim, u:), usnl..., u2nite..., uare, ucant, udress, urehearsal, usee, uthat, uwhat, ubut, utonight!, u:d, u2, uwill] >>> freq_dist.keys()[-50:] # 50 least frequent tokens [uwhat?!, uwhens, uwhere, uwhile, uwhite, uwhoever, uwhoooo!!!!, uwhose, uwiating, uwii, uwiig, uwin..., uwink., uwknd., uwohh, uwon, uwonder, uwondering, uwootwoot!, uworked, uworth, uxo., uxx, uya, uya<3miranda, uyay, uyay!, uyau2665, uyea, uyea., uyeaa, uyeah!, uyeah., uyeahhh., uyes,, uyes;), uyess, uyess,, uyou!!!!!, u"youll", uyou+snl=, uyou, uyoull, uyoutube??, uyouu<3, uyouuuuu, uyum, uyumyum, u~, uxacxac ✤ Frequent tokens refer to entities such as people, times, activities ✤ Infrequent terms amount to mostly noise11년 10월 20일 목요일
    • Frequency Analysis and Lexical Diversity(3/5) ✤ Extracting relationships from the tweets ✤ The social web is foremost the linkages between people ✤ One high convenient format for storing social web data is graph ✤ Using regular expressions to find retweets ✤ RT followed by a username ✤ via followed by a username >>> import re >>> rt_patterns = re.compile(r"(RT|via)((?:bW*@w+)+)", re.IGNORECASE) >>> example_tweets = ["RT @SocialWebMining Justin Bieber is on SNL 2nite. w00t?!?", ... "Justin Bieber is on SNL 2nite. w00t?!? (via @SocialWebMining)"] >>> for t in example_tweets: ... rt_patterns.findall(t) [(RT, @SocialWebMining)] [(via, @SocialWebMining)11년 10월 20일 목요일
    • Frequency Analysis and Lexical Diversity(4/5) ✤ >>> import networkx as nx ✤ ... g.add_edge(rt_source, tweet["from_user"], {"tweet_id" : tweet["id"]}) ✤ >>> import re ✤ >>> g.number_of_nodes() ✤ >>> g = nx.DiGraph() ✤ 160 ✤ >>> ✤ >>> g.number_of_edges() ✤ >>> all_tweets = [ tweet ✤ 125 ✤ ... for page in search_results ✤ >>> g.edges(data=True)[0] ✤ ... for tweet in page["results"] ] ✤ (u@ericastolte, ubonitasworld, {tweet_id: 11965974697L}) ✤ >>> def get_rt_sources(tweet): ✤ >>> len(nx.connected_components(g.to_undirected())) ✤ ... rt_patterns = re.compile(r"(RT|via)((?:bW*@w+)+)", re.IGNORECASE) ✤ 37 ✤ ... return [ source.strip() ✤ >>> sorted(nx.degree(g)) ✤ ... for tuple in rt_patterns.findall(tweet) ✤ [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ✤ ... for source in tuple ✤ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ✤ ... if source not in ("RT", "via") ] ✤ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ✤ >>> for tweet in all_tweets: ✤ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ✤ ... rt_sources = get_rt_sources(tweet["text"]) ✤ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, ✤ ... if not rt_sources: continue ✤ 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 4, 4, 4, 5, 6, 6, 9, 37] ✤ ... for rt_source in rt_sources:11년 10월 20일 목요일
    • Frequency Analysis and Lexical Diversity(5/5) ✤ Analysis ✤ 500 tweets ✤ 160 users: number of nodes ✤ 160 users involved in retweet relationships with one another ✤ 125 edges connected ✤ 1.28(160/125): some nodes are connected to more than one node ✤ 37: The graph consists of 32 subgraphs and is not fully connected ✤ The output of degree ✤ node are connected to anywhere11년 10월 20일 목요일
    • Visualizing Tweet Graphs(1/3) ✤ Dot language ✤ Text graph description language ✤ Support simple way of describing graphs that both humans and computer programs can use ✤ Graphviz ✤ install from source: http://www.graphviz.org/ ✤ pygraphviz ✤ easy_install pygraphviz ✤ setup.py: library_path, include_path11년 10월 20일 목요일
    • Visualizing Tweet Graphs(2/3) ✤ Generating DOT language output OUT = "snl_search_results.dot" try: nx.drawing.write_dot(g, OUT) except ImportError, e: # Help for Windows users: # Not a general-purpose method, but representative of # the same output write_dot would provide for this graph # if installed and easy to implement dot = ["%s" -> "%s" [tweet_id=%s] % (n1, n2, g[n1][n2][tweet_id]) for n1, n2 in g.edges()] f = open(OUT, w) f.write(strict digraph {n%sn} % (;n.join(dot),)) f.close() ✤ Output strict digraph { "@ericastolte" -> "bonitasworld" [tweet_id=11965974697]; "@mpcoelho" -> "Lil_Amaral" [tweet_id=11965954427]; "@BieberBelle123" -> "BELIEBE4EVER" [tweet_id=11966261062]; "@BieberBelle123" -> "sabrina9451" [tweet_id=11966197327]; ✤ }11년 10월 20일 목요일
    • Visualizing Tweet Graphs(3/3) ✤ Convert ✤ $circo -Tpng -Osnl_search_results snl_search_results.dot ✤11년 10월 20일 목요일
    • Closing Remarks ✤ Illustrated how easy it is to use Python’s interactive interpreter to explore and visualize Twitter data ✤ Feel comfortable with your Python development environment ✤ Spend some time with the Twitter APIs and Graphviz ✤ Canviz project ✤ Draw Graphviz graphs on a web browser <canvas> element.11년 10월 20일 목요일