Scraping the Web

4,103 views

Published on

Presentation to the Open Government Hackathon at RubyConf 2010 on November 12, 2010 in New Orleans. Updated on 2010/11/15.

Published in: Technology
3 Comments
6 Likes
Statistics
Notes
  • Also, ScraperWiki is a different approach. Worth considering.
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Note to self for slide 16: I have enjoyed the Ruby curb library (an interface to curl).
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Note to self: I want to mention character sets and encodings. Ruby 1.9 makes this relatively easy.
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Views
Total views
4,103
On SlideShare
0
From Embeds
0
Number of Embeds
40
Actions
Shares
0
Downloads
102
Comments
3
Likes
6
Embeds 0
No embeds

No notes for slide
  • Licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.


  • Wait. I’ve got this all wrong. I need to rebrand scraping!
  • DRY = Do not Repeat Yourself
  • DRY = Do not Repeat Yourself
  • Wait. I’ve got this all wrong. I need to rebrand scraping!

  • Licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.


















  • See “Politeness policy” section on http://en.wikipedia.org/wiki/Web_crawler

    http://en.wikipedia.org/wiki/User_agent#User_agent_identification

  • Splitting the interface into three parts aids in development, because you can run any part in isolation. It will typically result in a cleaner, decoupled software design.



  • For example: if the number of imported documents decreases by 10%, it probably make sense to alert someone.
  • It is helpful to avoid false positives when diffing files. In YAML, for example, hashes are unordered and may be serialized in various orders. This means that the same data structure may be serialized in different ways (i.e. a false positive).







  • Licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.
  • ×