Slideshare.net (beta)

 
Post: 
Myspace Hi5 Friendster Xanga LiveJournal Facebook Blogger Tagged Typepad Freewebs BlackPlanet gigya icons

All comments

Add a comment on Slide 1

If you have a SlideShare account, login to comment; else you can comment as a guest


Showing 1-50 of 1 (more)

Don't scrape, Glean!

From tommorris, 5 months ago

Lacks the demo part, alas, but it's the slides I used

248 views  |  0 comments  |  1 favorite  |  5 downloads
 

Groups/Events

Not added to any group/event

 
 

Privacy InfoNew!

This slideshow is Public

 
Embed in your blog
Embed (wordpress.com)

Slideshow Statistics
Total Views: 248
on Slideshare: 248
from embeds: 0* * Views from embeds since 21 Aug, 07

Slideshow transcript

Slide 1: Don’t Scrape, G le a n . . Tom Morris

Slide 2: Scraping sucks.

Slide 3: def lastlogin (@hmodel/"//td[@class='text'][@width='193']").first.innerHTML.split("<br />"[9].strip[-10..-1] return date[-4..-1] + "-" + date[-7..-6] + "-" + date[-10..-9] end end end end

Slide 4: Hpricot for ‘Last login’ date on MySpace.

Slide 5: try: lastlogin = self.soup.findAll(True, {"width": "193"})[0].br.nextSibling.nextSibling.nextSibling.nextSibling.nextSibling.nextSibl ing.nextSibling.nextSibling.nextSibling.nextSibling.nextSibling.nextSibling.nextSi bling.nextSibling.nextSibling.nextSibling.nextSibling.nextSibling.nextSibling.stri ng loginregex = re.compile( r"[0-9]/[0-9]+/[0-9]*") loginregex_inst = loginregex.search(lastlogin) if loginregex_inst is not None: self.lastlogin = loginregex_inst.group() except: pass pass pass pass pass pass pass pass

Slide 6: Taken from a Python/BeautifulSoup library.

Slide 7: (The Ruby is prettier, but who’s counting?)

Slide 8: getElementsByC lassNa me(“foo”)[0].children

Slide 9: It’s an edge case. MySpace’s HTML is worse than average.

Slide 10: But it is an ugly recipe for mental turmoil.

Slide 11: The alternative?

Slide 12: flickr.getPhotos()

Slide 13: And you get back nice XML or JSON (or even SOAP!) (or even SOAP!)

Slide 14: But ‘D.R.Y.’! APIs break that principle. APIs break that principle.

Slide 15: This is the data equivalent of the ‘accessible version’.

Slide 16: Enter GRDDL.

Slide 17: GRDDL defines a transformation process for XHTML » RDF.

Slide 18: X H TM L? That’s what the spec says. That’s what the spec says.

Slide 19: HTML 4 works too. Tid y! !

Slide 20: RDF? Yes. Trust me. It’s not evil. It’s not evil. It’s not evil.

Slide 21: GRDDL can work like a data stylesheet on top of your HTML. on top of your HTML. on top of your HTML.

Slide 22: You simply use HTML (or XML) in the normal way...

Slide 23: ...and define how the data transformation.

Slide 24: You can even use it as a bridge for exisiting APIs and services.

Slide 25: Could even be used for other formats than RDF. Atom? than RDF. Atom? than RDF. Atom?

Slide 26: Simple example: ‘Not Safe For Work’ ‘Not Safe For Work’

Slide 27: <a href=" http://tubgirl.com" class="nsfw">

Slide 28: I can write that. I can’t write xFolk by hand. I can’t write xFolk by hand.

Slide 29: Is ‘nsfw’ a good class name? No.

Slide 30: Do I care? No.

Slide 31: The data layer becomes separated like C is from HTML. SS

Slide 32: That’s the theory. Now for the demo. Now for the demo.

Slide 33: irc.freenode.net #swig #swhack #swhack #swhack

Slide 34: getsemantic.com semantic-web@w3.org semantic-web@w3.org

Slide 35: tom@tommorris.org http://tommorris.org http://tommorris.org