Don't scrape, Glean!
Upcoming SlideShare
Loading in...5
×

Like this? Share it with your network

Share

Don't scrape, Glean!

  • 2,413 views
Uploaded on

Lacks the demo part, alas, but it's the slides I used

Lacks the demo part, alas, but it's the slides I used

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
2,413
On Slideshare
2,413
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
18
Comments
0
Likes
1

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Don’t Scrape, Glean . .
    • Tom Morris
  • 2. Scraping sucks.
  • 3. def lastlogin (@hmodel/ &quot;//td[@class='text'][@width='193']&quot; ).first.innerHTML.split(&quot;<br />&quot;[ 9 ].strip[ -10 .. -1 ] return date[ -4 .. -1 ] + &quot;-&quot; + date[ -7 .. -6 ] + &quot;-&quot; + date[ -10 .. -9 ] end end end end
  • 4. Hpricot for ‘Last login’ date on MySpace.
  • 5. try : lastlogin = self.soup.findAll( True , { &quot;width&quot; : &quot;193&quot; })[ 0 ].br.nextSibling.nextSibling.nextSibling.nextSibling.nextSibling.nextSibling.nextSibling.nextSibling.nextSibling.nextSibling.nextSibling.nextSibling.nextSibling.nextSibling.nextSibling.nextSibling.nextSibling.nextSibling.nextSibling.string loginregex = re.compile( r &quot; [0-9] / [0-9] +/ [0-9]* &quot;) loginregex_inst = loginregex.search(lastlogin) if loginregex_inst is not None : self.lastlogin = loginregex_inst.group() except : pass pass pass pass pass pass pass pass
  • 6. Taken from a Python/BeautifulSoup library.
  • 7. (The Ruby is prettier, but who’s counting?)
  • 8. getElementsByClassName(“foo”)[0].children
  • 9. It’s an edge case. MySpace’s HTML is worse than average.
  • 10. But it is an ugly recipe for mental turmoil.
  • 11. The alternative?
  • 12. flickr.getPhotos()
  • 13. And you get back nice XML or JSON (or even SOAP!) (or even SOAP!)
  • 14. But ‘D.R.Y.’! APIs break that principle. APIs break that principle.
  • 15. This is the data equivalent of the ‘accessible version’.
  • 16. Enter GRDDL.
  • 17. GRDDL defines a transformation process for XHTML » RDF.
  • 18. XHTML ? That’s what the spec says. That’s what the spec says.
  • 19. HTML 4 works too. Tidy ! !
  • 20. RDF? Yes. Trust me. It’s not evil. It’s not evil. It’s not evil.
  • 21. GRDDL can work like a data stylesheet on top of your HTML. on top of your HTML. on top of your HTML.
  • 22. You simply use HTML (or XML) in the normal way...
  • 23. ...and define how the data transformation.
  • 24. You can even use it as a bridge for exisiting APIs and services.
  • 25. Could even be used for other formats than RDF. Atom? than RDF. Atom? than RDF. Atom?
  • 26. Simple example: ‘Not Safe For Work’ ‘Not Safe For Work’
  • 27. <a href=&quot; http://tubgirl.com &quot; class=&quot;nsfw&quot;>
  • 28. I can write that. I can’t write xFolk by hand. I can’t write xFolk by hand.
  • 29. Is ‘nsfw’ a good class name? No.
  • 30. Do I care? No.
  • 31. The data layer becomes separated like CSS is from HTML.
  • 32. That’s the theory. Now for the demo. Now for the demo.
  • 33. irc.freenode.net #swig #swhack #swhack #swhack
  • 34. getsemantic.com [email_address] [email_address]
  • 35. [email_address] http://tommorris.org http://tommorris.org