A Guide to Getting Data


    John Myles White




      April 10, 2013
A Hierarchy of Data Access Schemes


Bulk Downloads
API Access
Web-Scraping
Bulk Downloads
Collections of Bulk Downloads


https://delicious.com/jhofman/data
http://bitly.com/bundles/hmason/1
Some Available Data Sets


Wikipedia
IMDB
Million Song Database
SNAP (Social Networks)
Sunlight (Congressional Votes)
Data Formats


Delimited Values
    CSV
    TSV
    WSV
JSON
XML
Ad Hoc Formats
JSON

JSON sees the world as hash tables and arrays:

    Hash tables: {"a":    1, "b":    2}
    Arrays: [1, 2, 3]
JSON

Example from json.org:

{"menu": {
   "id": "file",
   "value": "File",
   "popup": {
     "menuitem": [
       {"value": "New", "onclick": "CreateNewDoc()"},
       {"value": "Open", "onclick": "OpenDoc()"},
       {"value": "Close", "onclick": "CloseDoc()"}
     ]
   }
}}
XML

XML views the world as a recursive container:

<container>
    <item>A</item>
    <item>B</item>
    <item>
        <container>
            <item attr="SomePropertyOfC">C’</item>
        </container>
    </item>
</container>
XML

From Wikipedia XML dump:

<mediawiki xml:lang="en">
  <page>
    <title>Page title</title>
    <restrictions>edit=sysop:move=sysop</restrictions>
    <revision>
      <timestamp>2001-01-15T13:15:00Z</timestamp>
      <contributor><username>Foobar</username></contributor
      <comment>I have just one thing to say!</comment>
      <text>A bunch of [[text]] here.</text>
      <minor />
    </revision>
  </page>
</mediawiki>
Ad Hoc Data Formats


Fixed Width Files
Graph Edgelists
Voting Record Format
Many others. . .
Fixed Width Format

7-5-5 Format:

Sam    5    6
Josh   6    1211
Nicole 9983 200
Graph Edgelist

Directed Graph Format:

1   2
1   3
1   4
2   3
4   4
Voting Records

KH Format:
1109991099 0USA 200 BUSH
9999999999999999696996999999996. . .
Unstructured or Misstructured Data


Which Wikipedia articles link to each other?
Have Wikipedia dump of raw text
Need to parse XML, find links, extract them
API Access
Sites w/ API’s


NY Times
Twitter
Google
Facebook
Foursquare
Live Demo of NY Times API

http://developer.nytimes.com/docs
Live Demo of Twitter API

https://dev.twitter.com/docs/api/1.1
Use wget or curl:

wget http://google.com
API Wrappers

    Google API Client
    Tweepy - Twitter API
    twitteR
    ...
Tweepy usage:

# Create API object
# ...
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
api = tweepy.API(auth)

screen_name = "BarackObama"
user_info = api.get_user(screen_name)
for page in Cursor(api.friends_ids,
                   screen_name = screen_name).pages():
  for user_id in page:
    user_friends.append(user_id)
Parsing Data


Regular Expressions
Formal Parsers
    XML Parsers
    HTML Parsers
Basics of Regular Expressions

A Pattern Language for Text w/ Three Parts

    Character literals: a, b, 5
    Repetition operator: *
    Logical OR: |
(cat)|(dog)
(cats*)|(dogs*)
(ha)*
grep "cat" /usr/share/dict/words
grep -E "(cats*)|(dogs*)" /usr/share/dict/words
Advanced Tools:

    Complex Repetition: *, +, ?, {m, n}
    Character Classes: [0-9], [a-z]
    Special Character Classes: d, w
Complex Repetition:

    a*: 0 or more occurrences of a
    a+: 1 or more occurrences of a
    a?: 0 or 1 occurrences of a
    a{m, n}: At least m and no more than n occurrences of a
Character Classes:

    [0-9]
    [a-z]
    [0-9a-zA-Z]
    [^0-9] Negate a character class
Special Character Classes:

    d: Any digit
    D: Any non-digit
    w: Any word character
    s: Any whitespace character
Matching Phone Numbers


555-5757
800-555-5757
800.555.5757
1-800-555-5757
+1 800 555 5757
5–52—25
First Draft Regular Expression

(d|-)+
Second Draft

ddd-dddd
Third Draft

ddd[.-]dddd
Formal Parsers
Python JSON

import json
print json.dumps({’4’: 5, ’6’: [7, 8]})
json.loads(’["foo", {"bar":["baz", null, 1.0, 2]}]’)
Python XML Parser

<data>
    <items>
        <item   name="item1"></item>
        <item   name="item2"></item>
        <item   name="item3"></item>
        <item   name="item4"></item>
    </items>
</data>
Python XML Parser

from xml.dom import minidom
xmldoc = minidom.parse(’items.xml’)
itemlist = xmldoc.getElementsByTagName(’item’)
print len(itemlist)
print itemlist[0].attributes[’name’].value
for s in itemlist :
    print s.attributes[’name’].value
Web-Scraping
Crawling web
Spidering data
Scraping HTML for information
wget google.com
Developer Console Demo
Many HTML parsing libraries:

    Beautiful Soup
    Nokogiri
Generic UNIX Tools


grep
sort
more
wc
cut
awk
...

Computational Social Science, Lecture 09: Data Wrangling