• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content

Loading…

Flash Player 9 (or above) is needed to view presentations.
We have detected that you do not have it on your computer. To install it, go here.

Like this presentation? Why not share!

Scraping the Web

on

  • 3,861 views

Presentation to the Open Government Hackathon at RubyConf 2010 on November 12, 2010 in New Orleans. Updated on 2010/11/15.

Presentation to the Open Government Hackathon at RubyConf 2010 on November 12, 2010 in New Orleans. Updated on 2010/11/15.

Statistics

Views

Total Views
3,861
Views on SlideShare
3,825
Embed Views
36

Actions

Likes
7
Downloads
92
Comments
3

4 Embeds 36

http://sunlightlabs.com 33
http://summify.com 1
http://paper.li 1
http://sunlightfoundation.com 1

Accessibility

Categories

Upload Details

Uploaded via as Apple Keynote

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel

13 of 3 previous next Post a comment

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
  • Also, ScraperWiki is a different approach. Worth considering.
    Are you sure you want to
    Your message goes here
    Processing…
  • Note to self for slide 16: I have enjoyed the Ruby curb library (an interface to curl).
    Are you sure you want to
    Your message goes here
    Processing…
  • Note to self: I want to mention character sets and encodings. Ruby 1.9 makes this relatively easy.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • Licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. <br />
  • <br />
  • <br />
  • Wait. I&#x2019;ve got this all wrong. I need to rebrand scraping! <br />
  • DRY = Do not Repeat Yourself <br />
  • DRY = Do not Repeat Yourself <br />
  • Wait. I&#x2019;ve got this all wrong. I need to rebrand scraping! <br />
  • <br />
  • Licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • See &#x201C;Politeness policy&#x201D; section on http://en.wikipedia.org/wiki/Web_crawler <br /> <br /> http://en.wikipedia.org/wiki/User_agent#User_agent_identification <br />
  • <br />
  • Splitting the interface into three parts aids in development, because you can run any part in isolation. It will typically result in a cleaner, decoupled software design. <br />
  • <br />
  • <br />
  • <br />
  • For example: if the number of imported documents decreases by 10%, it probably make sense to alert someone. <br />
  • It is helpful to avoid false positives when diffing files. In YAML, for example, hashes are unordered and may be serialized in various orders. This means that the same data structure may be serialized in different ways (i.e. a false positive). <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • Licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. <br />

Scraping the Web Scraping the Web Presentation Transcript

  • Scraping the Web David James Sunlight Labs http://twitter.com/djsunlight http://github.com/djsun
  • Definition • Scraping: converting unstructured documents into structured information
  • Why Scrape? • The information is useful • The information might be useful... • if people know about it • if people looked at it • if people built apps on top of it
  • SCRAPING HAS AN IMAGE PROBLEM
  • Warning • Scraping is often maligned because it is claimed to offend the sensibilities of programmers who appreciate: • elegance • efficiency • DRYness • eye health...
  • AT SCRAPER
  • CAN WE RETHINK REBRAND REDEFINE SCRAPING ?
  • Scraping Reality • Yes, low level scraping is rarely fun nor glorious. • But if you think of scraping at a higher level, scraping can be a fun challenge! • Real-world systems (like scraping) are often messy and require creative solutions.
  • FREEDOM SCRAPING David James Sunlight Labs http://twitter.com/djsunlight http://github.com/djsun
  • Definition • Freedom Scraping: converting unstructured documents into structured information about bald eagles, flags, and apple pie.
  • Definition • Freedom Scraping: converting unstructured documents into structured information about your city, local issues, schools, elected officials, taxes, and your favorite issue.
  • Why Freedom Scrape? • The information is useful • The information might be useful... • The unstructured data want to be liberated! Free the DOM!
  • TOOLS for FREEDOM SCRAPING
  • Scraping Sequence 1. Fetch 2. Process 3. Store
  • Scraping Sequence 1. Fetch (HTTP, FTP, etc.) 2. Process (text, HTML, XML, PDF) 3. Store (file, database, API)
  • 1. Fetch net/http http://www.ruby-doc.org/stdlib/libdoc/net/http/rdoc/index.html httpclient http://github.com/nahi/httpclient open-uri http://www.ruby-doc.org/stdlib/libdoc/open-uri/rdoc/ em-http-request http://github.com/igrigorik/em-http-request httparty http://httparty.rubyforge.org/ mechanize http://mechanize.rubyforge.org ... more ... http://ruby-toolbox.com/categories/http_clients.html
  • 2. Process Regexp http://www.ruby-doc.org/core/classes/Regexp.html REXML http://ruby-doc.org/stdlib/libdoc/rexml/rdoc/ Treetop http://treetop.rubyforge.org/ Citrus https://github.com/mjijackson/citrus libxml-ruby http://libxml.rubyforge.org/ Hpricot http://hpricot.com/ Nokogiri http://nokogiri.org scrAPI https://github.com/assaf/scrapi scRUBYt http://github.com/scrubber/scrubyt Ariel http://ariel.rubyforge.org/
  • 3. Store • File (XML,YAML, ...) • Database (relational, non-relational, ...) • API (various)
  • Challenges (sometimes called constraints)
  • Challenge #1 • External sites can change without warning • Figuring out scraping frequency is messy • Changes can easily break scrapers without warning
  • Challenge #2 • Bad HTTP status codes • (e.g. using 200 OK to signal an error) • You cannot always trust your HTTP libraries default behavior
  • Challenge #3 • Messy HTML markup • In the scraping world, template-generated HTML is nirvana • Human generated HTML forever burns in one of Dante’s Circles of Hell
  • Challenge #4 (a special case of challenge #3) • Lack of unique identifiers • Figuring out if an entity was created or updated is messy • Figuring out associations between entities is messy
  • 12 Suggestions I have just enough to make a calendar.
  • Suggestion #1 Scraping Suggestion for January • Don’t scrape unless you have to! • Has someone already written a scraper? • Is there machine readable data available?
  • Suggestion #2 Scraping Suggestion for February • Test! Test! Test! • Download a few sample pages • Mock out HTTP calls • Try FakeWeb or WebMock • Test scraping against the sample pages
  • Suggestion #3 Scraping Suggestion for March • Cache HTTP files • Speed up development • Keeping a history helps production debugging • Storing cache in revision control can be very helpful for diffs
  • Suggestion #4 Scraping Suggestion for April • Scrape politely • Scrape infrequently • Identify your scraper • Set the user-agent HTTP header
  • Suggestion #5 Scraping Suggestion for May • Contact the organization or person behind the scraped site • Ask how to minimize your impact • You might find or gain access to the underlying raw information
  • Suggestion #6 Scraping Suggestion for June • Use a standard interface for your scrapers • Essential for automation (see next slide) • I recommend this interface: rake fetch rake process rake store
  • Suggestion #7 Scraping Suggestion for July • Use a system to manage your scrapers • Key features: scheduling, logging, and notifications • We ❤ continuous integration tools • e.g. Hudson, Integrity, CI Joe, CruiseControl
  • Suggestion #8 Scraping Suggestion for August • Scrape Scrupulously scrupulous: diligent, thorough, and extremely attentive to details • Scrapers are a primary consumer in your data ecosystem • All your data processing depends on them • You want your scrapers to notice when they see things they don’t understand
  • Suggestion #9 Scraping Suggestion for September • Scrape sanely: pre-processing checks • Do not automatically trust HTTP status codes • Does the shape of the document make sense?
  • Suggestion #10 Scraping Suggestion for October • Scrape sanely: pre-storing checks • Remove duplicate items • Check expected data types, ranges, string lengths, and so on • Watch aggregate statistics for useful patterns and react appropriately
  • Suggestion #11 Scraping Suggestion for November • Take logging seriously • Keep a record of scraping to allow for analytics and debugging • Harness the power of diffs • But avoid false positives*
  • Suggestion #12 Scraping Suggestion for December • Iterate Often • Pay attention to your scrapers • Adjust based on your needs • Extract useful code and share it
  • Open Ended Questions Things you might not have thought about...
  • Automation Open Ended Question #1 • Is a fully automated scraper a “good” answer for what you are doing? • Why not half-automated?
  • Natural Intelligence Open Ended Question #2 • When is human intelligence useful and when should you leverage it? • in-house / editing and curation • free / crowdsourcing • outsourced / Amazon Mechanical Turk
  • Sanity Checks Open Ended Question #3 • How far can you go with sanity checking? • What can statistics about document changes tell you?
  • Learn Open Ended Question #4 • Read more about these nerdy topics: • information extraction • information retrieval • machine learning
  • The Ecosystem Open Ended Question #5 • Build tools to make scraping easier! • Programming is often more fun at higher levels • Cut out the boilerplate • Build tools that help non-programmers scrape pages
  • The End David James Sunlight Labs http://twitter.com/djsunlight http://github.com/djsun