This document provides an overview of different methods for accessing and parsing data, including bulk downloads, APIs, web scraping, and unstructured data. It discusses formats like CSV, JSON, XML and examples of each. It also covers using regular expressions and parsers to extract structured data from unstructured sources.
9. XML
XML views the world as a recursive container:
<container>
<item>A</item>
<item>B</item>
<item>
<container>
<item attr="SomePropertyOfC">C’</item>
</container>
</item>
</container>
10. XML
From Wikipedia XML dump:
<mediawiki xml:lang="en">
<page>
<title>Page title</title>
<restrictions>edit=sysop:move=sysop</restrictions>
<revision>
<timestamp>2001-01-15T13:15:00Z</timestamp>
<contributor><username>Foobar</username></contributor
<comment>I have just one thing to say!</comment>
<text>A bunch of [[text]] here.</text>
<minor />
</revision>
</page>
</mediawiki>
11. Ad Hoc Data Formats
Fixed Width Files
Graph Edgelists
Voting Record Format
Many others. . .
15. Unstructured or Misstructured Data
Which Wikipedia articles link to each other?
Have Wikipedia dump of raw text
Need to parse XML, find links, extract them
27. Advanced Tools:
Complex Repetition: *, +, ?, {m, n}
Character Classes: [0-9], [a-z]
Special Character Classes: d, w
28. Complex Repetition:
a*: 0 or more occurrences of a
a+: 1 or more occurrences of a
a?: 0 or 1 occurrences of a
a{m, n}: At least m and no more than n occurrences of a
29. Character Classes:
[0-9]
[a-z]
[0-9a-zA-Z]
[^0-9] Negate a character class
30. Special Character Classes:
d: Any digit
D: Any non-digit
w: Any word character
s: Any whitespace character
38. Python XML Parser
from xml.dom import minidom
xmldoc = minidom.parse(’items.xml’)
itemlist = xmldoc.getElementsByTagName(’item’)
print len(itemlist)
print itemlist[0].attributes[’name’].value
for s in itemlist :
print s.attributes[’name’].value