Mining the social web ch1

  • 747 views
Uploaded on

mining the social web chapter 1

mining the social web chapter 1

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
747
On Slideshare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
24
Comments
0
Likes
2

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Ch1. Introduction: Hacking on Twitter Data chois79 2011.10.1511년 10월 20일 목요일
  • 2. Installing Python Development Tools ✤ python ✤ http://www.python.org/download ✤ python package manager tools ✤ allow to effortlessly install Python packages ✤ easy_install ✤ http://pypi.python.org/pypi/setuptools ✤ pip ✤ http://www.pip-installer.org/en/latest/installing.html ✤ networkx ✤ creating and manipulating graphs and networks ✤ ex) easy_install networkx or pip install networkx11년 10월 20일 목요일
  • 3. Collecting and Manipulating Twitter Data11년 10월 20일 목요일
  • 4. Tinkering with Twitter’s API(1/2) ✤ Setup ✤ easy_install twitter ✤ but, Twitter’s apis was updated ✤ http://github.com/sixohsix/twitter/issues/56 ✤ The Minimalist Twitter API for Python is a Python API for Twitter ✤ Equivalent REST query ✤ http://search.twitter.com/trends.json11년 10월 20일 목요일
  • 5. Tinkering with Twitter’s API(2/2) ✤ Retrieving Twitter search trends # ex.3 import twitter twitter_api = twitter.Twitter() WORLD_WOE_ID = 1 # The Yahoo! Where On Earth ID for the entire world world_trends = twitter_api.trends._(WORLD_WOE_ID) # get back a callable #[ trend["name"] for trend in world_trends()[0][trends] ] # call the callabl for trend in world_trends()[0][trends]: # call the callabl print trend["name"] ✤ Paging through Twitter search results # ex.4 search_results = [] for page in range(1,6): search_results.append(twitter_api.search(q="Dennis Ritchie", rpp=20, page=page))11년 10월 20일 목요일
  • 6. Frequency Analysis and Lexical Diversity(1/5) ✤ Lexical diversity ✤ One of the most intuitive measurements that can be applied to unstructured text ✤ Expression of the number of unique tokens in the text divided by the total number of tokens >>> words = [] >>> for t in tweets: ... words += [ w for w in t.split() ] >>> len(words) # total words 7238 >>> len(set(words)) # unique words 1636 >>> 1.0*len(set(words))/len(words) # lexical diversity 0.22602928985907708 >>> 1.0*sum([ len(t.split()) for t in tweets ])/len(tweets) # avg words per tweet 14.476000000000001 ✤ Each tweet carries about 20 percent unique infomation11년 10월 20일 목요일
  • 7. Frequency Analysis and Lexical Diversity(2/5) ✤ Frequency Analysis: Use NLTK or collections.Count ✤ Very simple, powerful tool >>> import nltk >>> import cPickle >>> words = cPickle.load(open("myData.pickle")) >>> freq_dist = nltk.FreqDist(words) >>> freq_dist.keys()[:50] # 50 most frequent tokens [usnl, uon, urt, uis, uto, ui, uwatch, ujustin, u@justinbieber, ube, uthe, utonight, ugonna, uat, uin, ubieber, uand, uyou, uwatching, utina, ufor, ua, uwait, ufey, uof, u@justinbieber:, uif, uwith, uso, u"cant", uwho, ugreat, uit, ugoing, uim, u:), usnl..., u2nite..., uare, ucant, udress, urehearsal, usee, uthat, uwhat, ubut, utonight!, u:d, u2, uwill] >>> freq_dist.keys()[-50:] # 50 least frequent tokens [uwhat?!, uwhens, uwhere, uwhile, uwhite, uwhoever, uwhoooo!!!!, uwhose, uwiating, uwii, uwiig, uwin..., uwink., uwknd., uwohh, uwon, uwonder, uwondering, uwootwoot!, uworked, uworth, uxo., uxx, uya, uya<3miranda, uyay, uyay!, uyau2665, uyea, uyea., uyeaa, uyeah!, uyeah., uyeahhh., uyes,, uyes;), uyess, uyess,, uyou!!!!!, u"youll", uyou+snl=, uyou, uyoull, uyoutube??, uyouu<3, uyouuuuu, uyum, uyumyum, u~, uxacxac ✤ Frequent tokens refer to entities such as people, times, activities ✤ Infrequent terms amount to mostly noise11년 10월 20일 목요일
  • 8. Frequency Analysis and Lexical Diversity(3/5) ✤ Extracting relationships from the tweets ✤ The social web is foremost the linkages between people ✤ One high convenient format for storing social web data is graph ✤ Using regular expressions to find retweets ✤ RT followed by a username ✤ via followed by a username >>> import re >>> rt_patterns = re.compile(r"(RT|via)((?:bW*@w+)+)", re.IGNORECASE) >>> example_tweets = ["RT @SocialWebMining Justin Bieber is on SNL 2nite. w00t?!?", ... "Justin Bieber is on SNL 2nite. w00t?!? (via @SocialWebMining)"] >>> for t in example_tweets: ... rt_patterns.findall(t) [(RT, @SocialWebMining)] [(via, @SocialWebMining)11년 10월 20일 목요일
  • 9. Frequency Analysis and Lexical Diversity(4/5) ✤ >>> import networkx as nx ✤ ... g.add_edge(rt_source, tweet["from_user"], {"tweet_id" : tweet["id"]}) ✤ >>> import re ✤ >>> g.number_of_nodes() ✤ >>> g = nx.DiGraph() ✤ 160 ✤ >>> ✤ >>> g.number_of_edges() ✤ >>> all_tweets = [ tweet ✤ 125 ✤ ... for page in search_results ✤ >>> g.edges(data=True)[0] ✤ ... for tweet in page["results"] ] ✤ (u@ericastolte, ubonitasworld, {tweet_id: 11965974697L}) ✤ >>> def get_rt_sources(tweet): ✤ >>> len(nx.connected_components(g.to_undirected())) ✤ ... rt_patterns = re.compile(r"(RT|via)((?:bW*@w+)+)", re.IGNORECASE) ✤ 37 ✤ ... return [ source.strip() ✤ >>> sorted(nx.degree(g)) ✤ ... for tuple in rt_patterns.findall(tweet) ✤ [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ✤ ... for source in tuple ✤ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ✤ ... if source not in ("RT", "via") ] ✤ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ✤ >>> for tweet in all_tweets: ✤ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ✤ ... rt_sources = get_rt_sources(tweet["text"]) ✤ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, ✤ ... if not rt_sources: continue ✤ 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 4, 4, 4, 5, 6, 6, 9, 37] ✤ ... for rt_source in rt_sources:11년 10월 20일 목요일
  • 10. Frequency Analysis and Lexical Diversity(5/5) ✤ Analysis ✤ 500 tweets ✤ 160 users: number of nodes ✤ 160 users involved in retweet relationships with one another ✤ 125 edges connected ✤ 1.28(160/125): some nodes are connected to more than one node ✤ 37: The graph consists of 32 subgraphs and is not fully connected ✤ The output of degree ✤ node are connected to anywhere11년 10월 20일 목요일
  • 11. Visualizing Tweet Graphs(1/3) ✤ Dot language ✤ Text graph description language ✤ Support simple way of describing graphs that both humans and computer programs can use ✤ Graphviz ✤ install from source: http://www.graphviz.org/ ✤ pygraphviz ✤ easy_install pygraphviz ✤ setup.py: library_path, include_path11년 10월 20일 목요일
  • 12. Visualizing Tweet Graphs(2/3) ✤ Generating DOT language output OUT = "snl_search_results.dot" try: nx.drawing.write_dot(g, OUT) except ImportError, e: # Help for Windows users: # Not a general-purpose method, but representative of # the same output write_dot would provide for this graph # if installed and easy to implement dot = ["%s" -> "%s" [tweet_id=%s] % (n1, n2, g[n1][n2][tweet_id]) for n1, n2 in g.edges()] f = open(OUT, w) f.write(strict digraph {n%sn} % (;n.join(dot),)) f.close() ✤ Output strict digraph { "@ericastolte" -> "bonitasworld" [tweet_id=11965974697]; "@mpcoelho" -> "Lil_Amaral" [tweet_id=11965954427]; "@BieberBelle123" -> "BELIEBE4EVER" [tweet_id=11966261062]; "@BieberBelle123" -> "sabrina9451" [tweet_id=11966197327]; ✤ }11년 10월 20일 목요일
  • 13. Visualizing Tweet Graphs(3/3) ✤ Convert ✤ $circo -Tpng -Osnl_search_results snl_search_results.dot ✤11년 10월 20일 목요일
  • 14. Closing Remarks ✤ Illustrated how easy it is to use Python’s interactive interpreter to explore and visualize Twitter data ✤ Feel comfortable with your Python development environment ✤ Spend some time with the Twitter APIs and Graphviz ✤ Canviz project ✤ Draw Graphviz graphs on a web browser <canvas> element.11년 10월 20일 목요일