Don't scrape, Glean!

2,159 views

Published on

Lacks the demo part, alas, but it's the slides I used

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
2,159
On SlideShare
0
From Embeds
0
Number of Embeds
28
Actions
Shares
0
Downloads
19
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Don't scrape, Glean!

  1. 1. Don’t Scrape, Glean . . <ul><li>Tom Morris </li></ul>
  2. 2. Scraping sucks.
  3. 3. def lastlogin (@hmodel/ &quot;//td[@class='text'][@width='193']&quot; ).first.innerHTML.split(&quot;<br />&quot;[ 9 ].strip[ -10 .. -1 ] return date[ -4 .. -1 ] + &quot;-&quot; + date[ -7 .. -6 ] + &quot;-&quot; + date[ -10 .. -9 ] end end end end
  4. 4. Hpricot for ‘Last login’ date on MySpace.
  5. 5. try : lastlogin = self.soup.findAll( True , { &quot;width&quot; : &quot;193&quot; })[ 0 ].br.nextSibling.nextSibling.nextSibling.nextSibling.nextSibling.nextSibling.nextSibling.nextSibling.nextSibling.nextSibling.nextSibling.nextSibling.nextSibling.nextSibling.nextSibling.nextSibling.nextSibling.nextSibling.nextSibling.string loginregex = re.compile( r &quot; [0-9] / [0-9] +/ [0-9]* &quot;) loginregex_inst = loginregex.search(lastlogin) if loginregex_inst is not None : self.lastlogin = loginregex_inst.group() except : pass pass pass pass pass pass pass pass
  6. 6. Taken from a Python/BeautifulSoup library.
  7. 7. (The Ruby is prettier, but who’s counting?)
  8. 8. getElementsByClassName(“foo”)[0].children
  9. 9. It’s an edge case. MySpace’s HTML is worse than average.
  10. 10. But it is an ugly recipe for mental turmoil.
  11. 11. The alternative?
  12. 12. flickr.getPhotos()
  13. 13. And you get back nice XML or JSON (or even SOAP!) (or even SOAP!)
  14. 14. But ‘D.R.Y.’! APIs break that principle. APIs break that principle.
  15. 15. This is the data equivalent of the ‘accessible version’.
  16. 16. Enter GRDDL.
  17. 17. GRDDL defines a transformation process for XHTML » RDF.
  18. 18. XHTML ? That’s what the spec says. That’s what the spec says.
  19. 19. HTML 4 works too. Tidy ! !
  20. 20. RDF? Yes. Trust me. It’s not evil. It’s not evil. It’s not evil.
  21. 21. GRDDL can work like a data stylesheet on top of your HTML. on top of your HTML. on top of your HTML.
  22. 22. You simply use HTML (or XML) in the normal way...
  23. 23. ...and define how the data transformation.
  24. 24. You can even use it as a bridge for exisiting APIs and services.
  25. 25. Could even be used for other formats than RDF. Atom? than RDF. Atom? than RDF. Atom?
  26. 26. Simple example: ‘Not Safe For Work’ ‘Not Safe For Work’
  27. 27. <a href=&quot; http://tubgirl.com &quot; class=&quot;nsfw&quot;>
  28. 28. I can write that. I can’t write xFolk by hand. I can’t write xFolk by hand.
  29. 29. Is ‘nsfw’ a good class name? No.
  30. 30. Do I care? No.
  31. 31. The data layer becomes separated like CSS is from HTML.
  32. 32. That’s the theory. Now for the demo. Now for the demo.
  33. 33. irc.freenode.net #swig #swhack #swhack #swhack
  34. 34. getsemantic.com [email_address] [email_address]
  35. 35. [email_address] http://tommorris.org http://tommorris.org

×