SlideShare a Scribd company logo
Don’t Scrape, Glean . . ,[object Object]
Scraping sucks.
def  lastlogin   (@hmodel/ &quot;//td[@class='text'][@width='193']&quot; ).first.innerHTML.split(&quot;<br />&quot;[ 9 ].strip[ -10 .. -1 ]   return  date[ -4 .. -1 ] + &quot;-&quot; + date[ -7 .. -6 ] + &quot;-&quot; + date[ -10 .. -9 ] end end end end
Hpricot for ‘Last login’ date on MySpace.
try :   lastlogin = self.soup.findAll( True , { &quot;width&quot; :  &quot;193&quot; })[ 0 ].br.nextSibling.nextSibling.nextSibling.nextSibling.nextSibling.nextSibling.nextSibling.nextSibling.nextSibling.nextSibling.nextSibling.nextSibling.nextSibling.nextSibling.nextSibling.nextSibling.nextSibling.nextSibling.nextSibling.string   loginregex  =  re.compile(  r &quot; [0-9] / [0-9] +/ [0-9]* &quot;)   loginregex_inst  =  loginregex.search(lastlogin)   if  loginregex_inst  is   not None :   self.lastlogin  =  loginregex_inst.group()   except :   pass pass pass pass pass pass pass pass
Taken from a Python/BeautifulSoup library.
(The Ruby is prettier, but who’s counting?)
getElementsByClassName(“foo”)[0].children
It’s an edge case. MySpace’s HTML  is  worse than average.
But it is an ugly recipe for mental turmoil.
The alternative?
flickr.getPhotos()
And you get back nice XML or JSON (or even SOAP!) (or even SOAP!)
But ‘D.R.Y.’! APIs break that principle. APIs break that principle.
This is the data equivalent of the ‘accessible version’.
Enter GRDDL.
GRDDL defines a transformation process for XHTML » RDF.
XHTML ? That’s what the spec says. That’s what the spec says.
HTML 4 works too. Tidy ! !
RDF? Yes. Trust me. It’s not evil. It’s not evil. It’s not evil.
GRDDL can work like a data stylesheet on top of your HTML. on top of your HTML. on top of your HTML.
You simply use HTML (or XML) in the normal way...
...and define how the data transformation.
You can even use it as a bridge for exisiting APIs and services.
Could even be used for other formats than RDF. Atom? than RDF. Atom? than RDF. Atom?
Simple example: ‘Not Safe For Work’ ‘Not Safe For Work’
<a href=&quot; http://tubgirl.com &quot; class=&quot;nsfw&quot;>
I can write that. I can’t write xFolk by hand. I can’t write xFolk by hand.
Is ‘nsfw’ a good class name? No.
Do I care? No.
The data layer becomes separated like CSS is from HTML.
That’s the theory. Now for the demo. Now for the demo.
irc.freenode.net #swig #swhack #swhack #swhack
getsemantic.com [email_address] [email_address]
[email_address] http://tommorris.org http://tommorris.org

More Related Content

Similar to Don't scrape, Glean!

XML and Web Services with PHP5 and PEAR
XML and Web Services with PHP5 and PEARXML and Web Services with PHP5 and PEAR
XML and Web Services with PHP5 and PEAR
Stephan Schmidt
 
Ods Markup And Tagsets: A Tutorial
Ods Markup And Tagsets: A TutorialOds Markup And Tagsets: A Tutorial
Ods Markup And Tagsets: A Tutorial
simienc
 
lf-2003_01-0269
lf-2003_01-0269lf-2003_01-0269
lf-2003_01-0269
tutorialsruby
 
lf-2003_01-0269
lf-2003_01-0269lf-2003_01-0269
lf-2003_01-0269
tutorialsruby
 
Csphtp1 18
Csphtp1 18Csphtp1 18
Csphtp1 18
HUST
 
Douglas Crockford Presentation Jsonsaga
Douglas Crockford Presentation JsonsagaDouglas Crockford Presentation Jsonsaga
Douglas Crockford Presentation Jsonsaga
Ajax Experience 2009
 
Jsonsaga
JsonsagaJsonsaga
Jsonsaga
nohmad
 
The JSON Saga
The JSON SagaThe JSON Saga
The JSON Saga
kaven yan
 
XML processing with perl
XML processing with perlXML processing with perl
XML processing with perl
Joe Jiang
 
Grails Introduction - IJTC 2007
Grails Introduction - IJTC 2007Grails Introduction - IJTC 2007
Grails Introduction - IJTC 2007
Guillaume Laforge
 
Lecture 3 - Comm Lab: Web @ ITP
Lecture 3 - Comm Lab: Web @ ITP Lecture 3 - Comm Lab: Web @ ITP
Lecture 3 - Comm Lab: Web @ ITP
yucefmerhi
 
Grails and Dojo
Grails and DojoGrails and Dojo
Grails and Dojo
Sven Haiges
 
A Toda Maquina Con Ruby on Rails
A Toda Maquina Con Ruby on RailsA Toda Maquina Con Ruby on Rails
A Toda Maquina Con Ruby on Rails
Rafael García
 
How Xslate Works
How Xslate WorksHow Xslate Works
How Xslate Works
Goro Fuji
 
Debugging and Error handling
Debugging and Error handlingDebugging and Error handling
Debugging and Error handling
Suite Solutions
 
Система рендеринга в Magento
Система рендеринга в MagentoСистема рендеринга в Magento
Система рендеринга в Magento
Magecom Ukraine
 
WordPress Development Confoo 2010
WordPress Development Confoo 2010WordPress Development Confoo 2010
WordPress Development Confoo 2010
Brendan Sera-Shriar
 
Lecture 5 - Comm Lab: Web @ ITP
Lecture 5 - Comm Lab: Web @ ITPLecture 5 - Comm Lab: Web @ ITP
Lecture 5 - Comm Lab: Web @ ITP
yucefmerhi
 
JavaScript
JavaScriptJavaScript
JavaScript
Doncho Minkov
 
Orm hero
Orm heroOrm hero
Orm hero
Simone Di Maulo
 

Similar to Don't scrape, Glean! (20)

XML and Web Services with PHP5 and PEAR
XML and Web Services with PHP5 and PEARXML and Web Services with PHP5 and PEAR
XML and Web Services with PHP5 and PEAR
 
Ods Markup And Tagsets: A Tutorial
Ods Markup And Tagsets: A TutorialOds Markup And Tagsets: A Tutorial
Ods Markup And Tagsets: A Tutorial
 
lf-2003_01-0269
lf-2003_01-0269lf-2003_01-0269
lf-2003_01-0269
 
lf-2003_01-0269
lf-2003_01-0269lf-2003_01-0269
lf-2003_01-0269
 
Csphtp1 18
Csphtp1 18Csphtp1 18
Csphtp1 18
 
Douglas Crockford Presentation Jsonsaga
Douglas Crockford Presentation JsonsagaDouglas Crockford Presentation Jsonsaga
Douglas Crockford Presentation Jsonsaga
 
Jsonsaga
JsonsagaJsonsaga
Jsonsaga
 
The JSON Saga
The JSON SagaThe JSON Saga
The JSON Saga
 
XML processing with perl
XML processing with perlXML processing with perl
XML processing with perl
 
Grails Introduction - IJTC 2007
Grails Introduction - IJTC 2007Grails Introduction - IJTC 2007
Grails Introduction - IJTC 2007
 
Lecture 3 - Comm Lab: Web @ ITP
Lecture 3 - Comm Lab: Web @ ITP Lecture 3 - Comm Lab: Web @ ITP
Lecture 3 - Comm Lab: Web @ ITP
 
Grails and Dojo
Grails and DojoGrails and Dojo
Grails and Dojo
 
A Toda Maquina Con Ruby on Rails
A Toda Maquina Con Ruby on RailsA Toda Maquina Con Ruby on Rails
A Toda Maquina Con Ruby on Rails
 
How Xslate Works
How Xslate WorksHow Xslate Works
How Xslate Works
 
Debugging and Error handling
Debugging and Error handlingDebugging and Error handling
Debugging and Error handling
 
Система рендеринга в Magento
Система рендеринга в MagentoСистема рендеринга в Magento
Система рендеринга в Magento
 
WordPress Development Confoo 2010
WordPress Development Confoo 2010WordPress Development Confoo 2010
WordPress Development Confoo 2010
 
Lecture 5 - Comm Lab: Web @ ITP
Lecture 5 - Comm Lab: Web @ ITPLecture 5 - Comm Lab: Web @ ITP
Lecture 5 - Comm Lab: Web @ ITP
 
JavaScript
JavaScriptJavaScript
JavaScript
 
Orm hero
Orm heroOrm hero
Orm hero
 

Recently uploaded

"Choosing proper type of scaling", Olena Syrota
"Choosing proper type of scaling", Olena Syrota"Choosing proper type of scaling", Olena Syrota
"Choosing proper type of scaling", Olena Syrota
Fwdays
 
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
Edge AI and Vision Alliance
 
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdfMonitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Tosin Akinosho
 
Y-Combinator seed pitch deck template PP
Y-Combinator seed pitch deck template PPY-Combinator seed pitch deck template PP
Y-Combinator seed pitch deck template PP
c5vrf27qcz
 
The Microsoft 365 Migration Tutorial For Beginner.pptx
The Microsoft 365 Migration Tutorial For Beginner.pptxThe Microsoft 365 Migration Tutorial For Beginner.pptx
The Microsoft 365 Migration Tutorial For Beginner.pptx
operationspcvita
 
Skybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoptionSkybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoption
Tatiana Kojar
 
Dandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity serverDandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity server
Antonios Katsarakis
 
"Scaling RAG Applications to serve millions of users", Kevin Goedecke
"Scaling RAG Applications to serve millions of users",  Kevin Goedecke"Scaling RAG Applications to serve millions of users",  Kevin Goedecke
"Scaling RAG Applications to serve millions of users", Kevin Goedecke
Fwdays
 
"$10 thousand per minute of downtime: architecture, queues, streaming and fin...
"$10 thousand per minute of downtime: architecture, queues, streaming and fin..."$10 thousand per minute of downtime: architecture, queues, streaming and fin...
"$10 thousand per minute of downtime: architecture, queues, streaming and fin...
Fwdays
 
Christine's Product Research Presentation.pptx
Christine's Product Research Presentation.pptxChristine's Product Research Presentation.pptx
Christine's Product Research Presentation.pptx
christinelarrosa
 
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...
Jason Yip
 
Choosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptxChoosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptx
Brandon Minnick, MBA
 
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-EfficiencyFreshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
ScyllaDB
 
Essentials of Automations: Exploring Attributes & Automation Parameters
Essentials of Automations: Exploring Attributes & Automation ParametersEssentials of Automations: Exploring Attributes & Automation Parameters
Essentials of Automations: Exploring Attributes & Automation Parameters
Safe Software
 
Must Know Postgres Extension for DBA and Developer during Migration
Must Know Postgres Extension for DBA and Developer during MigrationMust Know Postgres Extension for DBA and Developer during Migration
Must Know Postgres Extension for DBA and Developer during Migration
Mydbops
 
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
Alex Pruden
 
AppSec PNW: Android and iOS Application Security with MobSF
AppSec PNW: Android and iOS Application Security with MobSFAppSec PNW: Android and iOS Application Security with MobSF
AppSec PNW: Android and iOS Application Security with MobSF
Ajin Abraham
 
What is an RPA CoE? Session 1 – CoE Vision
What is an RPA CoE?  Session 1 – CoE VisionWhat is an RPA CoE?  Session 1 – CoE Vision
What is an RPA CoE? Session 1 – CoE Vision
DianaGray10
 
Crafting Excellence: A Comprehensive Guide to iOS Mobile App Development Serv...
Crafting Excellence: A Comprehensive Guide to iOS Mobile App Development Serv...Crafting Excellence: A Comprehensive Guide to iOS Mobile App Development Serv...
Crafting Excellence: A Comprehensive Guide to iOS Mobile App Development Serv...
Pitangent Analytics & Technology Solutions Pvt. Ltd
 
Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)
Jakub Marek
 

Recently uploaded (20)

"Choosing proper type of scaling", Olena Syrota
"Choosing proper type of scaling", Olena Syrota"Choosing proper type of scaling", Olena Syrota
"Choosing proper type of scaling", Olena Syrota
 
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
 
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdfMonitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdf
 
Y-Combinator seed pitch deck template PP
Y-Combinator seed pitch deck template PPY-Combinator seed pitch deck template PP
Y-Combinator seed pitch deck template PP
 
The Microsoft 365 Migration Tutorial For Beginner.pptx
The Microsoft 365 Migration Tutorial For Beginner.pptxThe Microsoft 365 Migration Tutorial For Beginner.pptx
The Microsoft 365 Migration Tutorial For Beginner.pptx
 
Skybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoptionSkybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoption
 
Dandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity serverDandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity server
 
"Scaling RAG Applications to serve millions of users", Kevin Goedecke
"Scaling RAG Applications to serve millions of users",  Kevin Goedecke"Scaling RAG Applications to serve millions of users",  Kevin Goedecke
"Scaling RAG Applications to serve millions of users", Kevin Goedecke
 
"$10 thousand per minute of downtime: architecture, queues, streaming and fin...
"$10 thousand per minute of downtime: architecture, queues, streaming and fin..."$10 thousand per minute of downtime: architecture, queues, streaming and fin...
"$10 thousand per minute of downtime: architecture, queues, streaming and fin...
 
Christine's Product Research Presentation.pptx
Christine's Product Research Presentation.pptxChristine's Product Research Presentation.pptx
Christine's Product Research Presentation.pptx
 
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...
 
Choosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptxChoosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptx
 
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-EfficiencyFreshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
 
Essentials of Automations: Exploring Attributes & Automation Parameters
Essentials of Automations: Exploring Attributes & Automation ParametersEssentials of Automations: Exploring Attributes & Automation Parameters
Essentials of Automations: Exploring Attributes & Automation Parameters
 
Must Know Postgres Extension for DBA and Developer during Migration
Must Know Postgres Extension for DBA and Developer during MigrationMust Know Postgres Extension for DBA and Developer during Migration
Must Know Postgres Extension for DBA and Developer during Migration
 
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
 
AppSec PNW: Android and iOS Application Security with MobSF
AppSec PNW: Android and iOS Application Security with MobSFAppSec PNW: Android and iOS Application Security with MobSF
AppSec PNW: Android and iOS Application Security with MobSF
 
What is an RPA CoE? Session 1 – CoE Vision
What is an RPA CoE?  Session 1 – CoE VisionWhat is an RPA CoE?  Session 1 – CoE Vision
What is an RPA CoE? Session 1 – CoE Vision
 
Crafting Excellence: A Comprehensive Guide to iOS Mobile App Development Serv...
Crafting Excellence: A Comprehensive Guide to iOS Mobile App Development Serv...Crafting Excellence: A Comprehensive Guide to iOS Mobile App Development Serv...
Crafting Excellence: A Comprehensive Guide to iOS Mobile App Development Serv...
 
Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)
 

Don't scrape, Glean!