Slideshow transcript
Slide 1: Don’t Scrape, G le a n . . Tom Morris
Slide 2: Scraping sucks.
Slide 3: def lastlogin (@hmodel/"//td[@class='text'][@width='193']").first.innerHTML.split("<br />"[9].strip[-10..-1] return date[-4..-1] + "-" + date[-7..-6] + "-" + date[-10..-9] end end end end
Slide 4: Hpricot for ‘Last login’ date on MySpace.
Slide 5: try: lastlogin = self.soup.findAll(True, {"width": "193"})[0].br.nextSibling.nextSibling.nextSibling.nextSibling.nextSibling.nextSibl ing.nextSibling.nextSibling.nextSibling.nextSibling.nextSibling.nextSibling.nextSi bling.nextSibling.nextSibling.nextSibling.nextSibling.nextSibling.nextSibling.stri ng loginregex = re.compile( r"[0-9]/[0-9]+/[0-9]*") loginregex_inst = loginregex.search(lastlogin) if loginregex_inst is not None: self.lastlogin = loginregex_inst.group() except: pass pass pass pass pass pass pass pass
Slide 6: Taken from a Python/BeautifulSoup library.
Slide 7: (The Ruby is prettier, but who’s counting?)
Slide 8: getElementsByC lassNa me(“foo”)[0].children
Slide 9: It’s an edge case. MySpace’s HTML is worse than average.
Slide 10: But it is an ugly recipe for mental turmoil.
Slide 11: The alternative?
Slide 12: flickr.getPhotos()
Slide 13: And you get back nice XML or JSON (or even SOAP!) (or even SOAP!)
Slide 14: But ‘D.R.Y.’! APIs break that principle. APIs break that principle.
Slide 15: This is the data equivalent of the ‘accessible version’.
Slide 16: Enter GRDDL.
Slide 17: GRDDL defines a transformation process for XHTML » RDF.
Slide 18: X H TM L? That’s what the spec says. That’s what the spec says.
Slide 19: HTML 4 works too. Tid y! !
Slide 20: RDF? Yes. Trust me. It’s not evil. It’s not evil. It’s not evil.
Slide 21: GRDDL can work like a data stylesheet on top of your HTML. on top of your HTML. on top of your HTML.
Slide 22: You simply use HTML (or XML) in the normal way...
Slide 23: ...and define how the data transformation.
Slide 24: You can even use it as a bridge for exisiting APIs and services.
Slide 25: Could even be used for other formats than RDF. Atom? than RDF. Atom? than RDF. Atom?
Slide 26: Simple example: ‘Not Safe For Work’ ‘Not Safe For Work’
Slide 27: <a href=" http://tubgirl.com" class="nsfw">
Slide 28: I can write that. I can’t write xFolk by hand. I can’t write xFolk by hand.
Slide 29: Is ‘nsfw’ a good class name? No.
Slide 30: Do I care? No.
Slide 31: The data layer becomes separated like C is from HTML. SS
Slide 32: That’s the theory. Now for the demo. Now for the demo.
Slide 33: irc.freenode.net #swig #swhack #swhack #swhack
Slide 34: getsemantic.com semantic-web@w3.org semantic-web@w3.org
Slide 35: tom@tommorris.org http://tommorris.org http://tommorris.org




Add a comment on Slide 1
If you have a SlideShare account, login to comment; else you can comment as a guest- Favorites & Groups
Showing 1-50 of 1 (more)