SCRAPING FROM THE WEB An Overview That Does Not Contain Too Much Cussing Feihong Hsu ChiPy February 14, 2013
OrganizationDefinition of scraperCommon types of scrapersComponents of a scraping systemPro tips
What I mean when I say scraperAny program that retrieves structured data from the web, andthen transforms it to conform with a different structure.Wait, isn’t that just ETL? (extract, transform, load)Well, sort of, but I don’t want to call it that...
NotesSome people would say that “scraping” only applies to webpages. I would argue that getting data from a CSV or JSON file isqualitatively not all that different. So I lump them all together.Why not ETL? Because ETL implies that there are rules andexpectations, and these two things don’t exist in the world ofopen government data. They can change the structure of theirdataset without telling you, or even take the dataset down on awhim. A program that pulls down government data is often goingto be a bit hacky by necessity, so “scraper” seems like a goodterm for that.
Main types of scrapersCSV PDFRSS/Atom Database dumpJSON GISXML MixedHTML crawlerWeb browser
CSVimport csvYou should usually use csv.DictReader.If the column names are all caps, consider making themlowercase.Watch out for CSV datasets that don’t have the same number ofelements on each row.
def get_rows(csv_ﬁle): reader = csv.reader(open(csv_ﬁle)) # Get the column names, lowercased. column_names = tuple(k.lower() for k in reader.next()) for row in reader: yield dict(zip(column_names, row))
import lxml.etreetree = lxml.etree.fromstring(SOME_XML_STRING)for el in tree.ﬁndall(items/item): children = el.getchildren() # Keys are element names. keys = (c.tag for c in children) # Values are element text contents. values = (c.text for c in children) yield dict(zip(keys, values))
NotesPlease don’t use urllib2.If you do use html5lib for parsing, remember that you can do sofrom within lxml itself. http://lxml.de/html5parser.html
Web browserIf you need a real browser to scrape the data, it’s often not worthit.But there are tools out there.I wrote PunkyBrowster, but I cant really recommend it overghost.py. It seems to have a better API, supports PySide and Qt,and has a more permissive license (MIT).
PDFNot as hard as it looks.There are no Python libraries that handle all kinds of PDFdocuments in the wild.Use the pdftohtml command to convert the PDF to XML.When debugging, use pdftohtml to generate HTML that you caninspect in the browser.If the text in the PDF is in tabular format, you can group text cellsby proximity.
NotesThe “group by proximity” strategy works like this:1. Find a text cell that has a very distinct pattern (probably a datecell). This is your “anchor”.2. Find all cells that have the same row position as the anchor(possibly off by a few pixels).3. Figure out which grouped cells belong to which fields basedon column position.
RSS/Atomimport feedparserSometimes feedparser can’t handle custom fields, and you’ll haveto fall back to lxml.etree.Unfortunately, plenty of RSS feeds are not compliant XML.Either do some custom munging or try html5lib.
Database dumpIf it’s a Microsoft Access file, use mbtools to dump the data.Sometimes it’s a ZIP file containing CSV files, each of whichcorresponds to a separate table dump.Just load it all into a SQLite database and run queries on it.
NotesWe wrote code that simulated joins using lists of dictionaries.This was painful to write and not so much fun to read. Don’t dothis.
GISI haven’t worked much with KML or SHP files.If an organization provides GIS files for download, they usuallyoffer other options as well. Look for those instead.
MixedThis is very common.For example: an organization offers a CSV download, but youhave to scrape their web page to find the link for it.
Components of a scraping systemDownloaderCacherRaw item retrieverExisting item detectorItem transformerStatus reporter
NotesCaching is essential when scraping a dataset that involves a largenumber of HTML pages. Test runs can take hours if you’remaking requests over the network. A good caching system prettyprints the files it downloads so you can more easily inspect them.Reporting is essential if you’re managing a group of scrapers.Since you KNOW that at least one of your scrapers will bebroken at any time, you might as well know which ones arebroken. A good reporting mechanism shows when your scrapersbreak, as well as when the dataset itself has issues (determinedheuristically).
Steps to writing a scraperFind the data sourceFind the metadataAnalysis (verify the primary key)DevelopTestFix (repeat ∞ times)
NotesThe Analysis step should also include noting which fields shouldbe lookup fields (see design pattern slide).The Testing step is always done on real data and has threephases: dry run (nothing added or updated), dry run withlookups (only lookups are added), and production run. I run allthree phases on my local instance before deploying toproduction.
A very useful tool for HTML scrapingFirefinder (http://bit.ly/kr0UOY)Extension for FirebugAllows you to test CSS and XPath expressions on any page, andvisually inspect the results.
Storing scraped dataDon’t create tables before you understand how you want to usethe data.Consider using ZODB (or another nonrelational DB)Adrian Holovaty’s talk on how EveryBlock avoided creating newtables for each dataset: http://bit.ly/Yl6VAZ (relevant partstarts at 7:10)
Design patternsIf a field contains a finite number of possible values, use a lookuptable instead of storing each value.Make a scraper superclass that incorporates common scraperlogic.
NotesThe scraper superclass will probably have convenience methodsfor converting dates/times, cleaning HTML, looking for existingitems, etc. It should also incorporate the caching and reportinglogic.
Working with government dataSome data sources are only available at certain times of day.Be careful about rate limiting and IP blocking.Data scraped from a web page shouldn’t be used for analyzingtrends.When you’re stuck, give them a phone call.
NotesIf you do manage to find an actual person to talk to you, keep arecord of their contact information and do NOT lose it! They areyour first line of defense when a dataset you rely on goes down.
Pro tipsWhen you don’t know what encoding the content is in, usecharade, not chardet.Remember to clean any HTML you intend to display.If the dataset doesn’t allow filtering by date, it’s a lost cause(unless you just care about historical data).When your scraper fails, do NOT fix it. If a user complains,consider fixing it.