Computational Social Science, Lecture 09: Data Wrangling

3,108 views

Published on

Guest lecture by John Myles White.

Published in: Education, Technology
0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
3,108
On SlideShare
0
From Embeds
0
Number of Embeds
1,519
Actions
Shares
0
Downloads
0
Comments
0
Likes
3
Embeds 0
No embeds

No notes for slide

Computational Social Science, Lecture 09: Data Wrangling

  1. 1. A Guide to Getting Data John Myles White April 10, 2013
  2. 2. A Hierarchy of Data Access SchemesBulk DownloadsAPI AccessWeb-Scraping
  3. 3. Bulk Downloads
  4. 4. Collections of Bulk Downloadshttps://delicious.com/jhofman/datahttp://bitly.com/bundles/hmason/1
  5. 5. Some Available Data SetsWikipediaIMDBMillion Song DatabaseSNAP (Social Networks)Sunlight (Congressional Votes)
  6. 6. Data FormatsDelimited Values CSV TSV WSVJSONXMLAd Hoc Formats
  7. 7. JSONJSON sees the world as hash tables and arrays: Hash tables: {"a": 1, "b": 2} Arrays: [1, 2, 3]
  8. 8. JSONExample from json.org:{"menu": { "id": "file", "value": "File", "popup": { "menuitem": [ {"value": "New", "onclick": "CreateNewDoc()"}, {"value": "Open", "onclick": "OpenDoc()"}, {"value": "Close", "onclick": "CloseDoc()"} ] }}}
  9. 9. XMLXML views the world as a recursive container:<container> <item>A</item> <item>B</item> <item> <container> <item attr="SomePropertyOfC">C’</item> </container> </item></container>
  10. 10. XMLFrom Wikipedia XML dump:<mediawiki xml:lang="en"> <page> <title>Page title</title> <restrictions>edit=sysop:move=sysop</restrictions> <revision> <timestamp>2001-01-15T13:15:00Z</timestamp> <contributor><username>Foobar</username></contributor <comment>I have just one thing to say!</comment> <text>A bunch of [[text]] here.</text> <minor /> </revision> </page></mediawiki>
  11. 11. Ad Hoc Data FormatsFixed Width FilesGraph EdgelistsVoting Record FormatMany others. . .
  12. 12. Fixed Width Format7-5-5 Format:Sam 5 6Josh 6 1211Nicole 9983 200
  13. 13. Graph EdgelistDirected Graph Format:1 21 31 42 34 4
  14. 14. Voting RecordsKH Format:1109991099 0USA 200 BUSH9999999999999999696996999999996. . .
  15. 15. Unstructured or Misstructured DataWhich Wikipedia articles link to each other?Have Wikipedia dump of raw textNeed to parse XML, find links, extract them
  16. 16. API Access
  17. 17. Sites w/ API’sNY TimesTwitterGoogleFacebookFoursquare
  18. 18. Live Demo of NY Times APIhttp://developer.nytimes.com/docs
  19. 19. Live Demo of Twitter APIhttps://dev.twitter.com/docs/api/1.1
  20. 20. Use wget or curl:wget http://google.com
  21. 21. API Wrappers Google API Client Tweepy - Twitter API twitteR ...
  22. 22. Tweepy usage:# Create API object# ...auth = tweepy.OAuthHandler(consumer_key, consumer_secret)auth.set_access_token(access_token, access_token_secret)api = tweepy.API(auth)screen_name = "BarackObama"user_info = api.get_user(screen_name)for page in Cursor(api.friends_ids, screen_name = screen_name).pages(): for user_id in page: user_friends.append(user_id)
  23. 23. Parsing DataRegular ExpressionsFormal Parsers XML Parsers HTML Parsers
  24. 24. Basics of Regular ExpressionsA Pattern Language for Text w/ Three Parts Character literals: a, b, 5 Repetition operator: * Logical OR: |
  25. 25. (cat)|(dog)(cats*)|(dogs*)(ha)*
  26. 26. grep "cat" /usr/share/dict/wordsgrep -E "(cats*)|(dogs*)" /usr/share/dict/words
  27. 27. Advanced Tools: Complex Repetition: *, +, ?, {m, n} Character Classes: [0-9], [a-z] Special Character Classes: d, w
  28. 28. Complex Repetition: a*: 0 or more occurrences of a a+: 1 or more occurrences of a a?: 0 or 1 occurrences of a a{m, n}: At least m and no more than n occurrences of a
  29. 29. Character Classes: [0-9] [a-z] [0-9a-zA-Z] [^0-9] Negate a character class
  30. 30. Special Character Classes: d: Any digit D: Any non-digit w: Any word character s: Any whitespace character
  31. 31. Matching Phone Numbers555-5757800-555-5757800.555.57571-800-555-5757+1 800 555 57575–52—25
  32. 32. First Draft Regular Expression(d|-)+
  33. 33. Second Draftddd-dddd
  34. 34. Third Draftddd[.-]dddd
  35. 35. Formal Parsers
  36. 36. Python JSONimport jsonprint json.dumps({’4’: 5, ’6’: [7, 8]})json.loads(’["foo", {"bar":["baz", null, 1.0, 2]}]’)
  37. 37. Python XML Parser<data> <items> <item name="item1"></item> <item name="item2"></item> <item name="item3"></item> <item name="item4"></item> </items></data>
  38. 38. Python XML Parserfrom xml.dom import minidomxmldoc = minidom.parse(’items.xml’)itemlist = xmldoc.getElementsByTagName(’item’)print len(itemlist)print itemlist[0].attributes[’name’].valuefor s in itemlist : print s.attributes[’name’].value
  39. 39. Web-Scraping
  40. 40. Crawling webSpidering dataScraping HTML for information
  41. 41. wget google.com
  42. 42. Developer Console Demo
  43. 43. Many HTML parsing libraries: Beautiful Soup Nokogiri
  44. 44. Generic UNIX Toolsgrepsortmorewccutawk...

×