Computational Social Science, Lecture 09: Data Wrangling

A Guide to Getting Data

John Myles White

April 10, 2013

A Hierarchy of Data Access Schemes

Bulk Downloads
API Access
Web-Scraping

Collections of Bulk Downloads

https://delicious.com/jhofman/data
http://bitly.com/bundles/hmason/1

Some Available Data Sets

Wikipedia
IMDB
Million Song Database
SNAP (Social Networks)
Sunlight (Congressional Votes)

Data Formats

Delimited Values
CSV
TSV
WSV
JSON
XML
Ad Hoc Formats

JSON

JSON sees the world as hash tables and arrays:

Hash tables: {"a": 1, "b": 2}
Arrays: [1, 2, 3]

JSON

Example from json.org:

{"menu": {
"id": "file",
"value": "File",
"popup": {
"menuitem": [
{"value": "New", "onclick": "CreateNewDoc()"},
{"value": "Open", "onclick": "OpenDoc()"},
{"value": "Close", "onclick": "CloseDoc()"}
]
}
}}

XML

XML views the world as a recursive container:

<container>
<item>A</item>
<item>B</item>
<item>
<container>
<item attr="SomePropertyOfC">C’</item>
</container>
</item>
</container>

XML

From Wikipedia XML dump:

<mediawiki xml:lang="en">
<page>
<title>Page title</title>
<restrictions>edit=sysop:move=sysop</restrictions>
<revision>
<timestamp>2001-01-15T13:15:00Z</timestamp>
<contributor><username>Foobar</username></contributor
<comment>I have just one thing to say!</comment>
<text>A bunch of [[text]] here.</text>
<minor />
</revision>
</page>
</mediawiki>

Ad Hoc Data Formats

Fixed Width Files
Graph Edgelists
Voting Record Format
Many others. . .

Fixed Width Format

7-5-5 Format:

Sam 5 6
Josh 6 1211
Nicole 9983 200

Graph Edgelist

Directed Graph Format:

1 2
1 3
1 4
2 3
4 4

Voting Records

KH Format:
1109991099 0USA 200 BUSH
9999999999999999696996999999996. . .

Unstructured or Misstructured Data

Which Wikipedia articles link to each other?
Have Wikipedia dump of raw text
Need to parse XML, ﬁnd links, extract them

Sites w/ API’s

NY Times
Twitter
Google
Facebook
Foursquare

Live Demo of NY Times API

http://developer.nytimes.com/docs

Live Demo of Twitter API

https://dev.twitter.com/docs/api/1.1

Use wget or curl:

wget http://google.com

API Wrappers

Google API Client
Tweepy - Twitter API
twitteR
...

Tweepy usage:

# Create API object
# ...
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
api = tweepy.API(auth)

screen_name = "BarackObama"
user_info = api.get_user(screen_name)
for page in Cursor(api.friends_ids,
screen_name = screen_name).pages():
for user_id in page:
user_friends.append(user_id)

Parsing Data

Regular Expressions
Formal Parsers
XML Parsers
HTML Parsers

Basics of Regular Expressions

A Pattern Language for Text w/ Three Parts

Character literals: a, b, 5
Repetition operator: *
Logical OR: |

(cat)|(dog)
(cats*)|(dogs*)
(ha)*

grep "cat" /usr/share/dict/words
grep -E "(cats*)|(dogs*)" /usr/share/dict/words

Advanced Tools:

Complex Repetition: *, +, ?, {m, n}
Character Classes: [0-9], [a-z]
Special Character Classes: d, w

Complex Repetition:

a*: 0 or more occurrences of a
a+: 1 or more occurrences of a
a?: 0 or 1 occurrences of a
a{m, n}: At least m and no more than n occurrences of a

Character Classes:

[0-9]
[a-z]
[0-9a-zA-Z]
[^0-9] Negate a character class

Special Character Classes:

d: Any digit
D: Any non-digit
w: Any word character
s: Any whitespace character

Matching Phone Numbers

555-5757
800-555-5757
800.555.5757
1-800-555-5757
+1 800 555 5757
5–52—25

First Draft Regular Expression

(d|-)+

Python JSON

import json
print json.dumps({’4’: 5, ’6’: [7, 8]})
json.loads(’["foo", {"bar":["baz", null, 1.0, 2]}]’)

Python XML Parser

<data>
<items>
<item name="item1"></item>
</items>
</data>

Python XML Parser

from xml.dom import minidom
xmldoc = minidom.parse(’items.xml’)
itemlist = xmldoc.getElementsByTagName(’item’)
print len(itemlist)
print itemlist[0].attributes[’name’].value
for s in itemlist :
print s.attributes[’name’].value

Crawling web
Spidering data
Scraping HTML for information

Many HTML parsing libraries:

Beautiful Soup
Nokogiri

Generic UNIX Tools

grep
sort
more
wc
cut
awk
...

Computational Social Science, Lecture 09: Data Wrangling

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Computational Social Science, Lecture 09: Data Wrangling

Similar to Computational Social Science, Lecture 09: Data Wrangling (20)

More from jakehofman

More from jakehofman (17)

Recently uploaded

Recently uploaded (20)

Computational Social Science, Lecture 09: Data Wrangling