distilling the Web of Data drop by drop (with Java)
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

distilling the Web of Data drop by drop (with Java)

on

  • 1,973 views

An introduction to the machine-readable Web and the various way to implement it with Microformats, RDFa or Microdata and how to consume it....

An introduction to the machine-readable Web and the various way to implement it with Microformats, RDFa or Microdata and how to consume it.

Any23 is presented, a Java open source distiller for the Web of Data.

Statistics

Views

Total Views
1,973
Views on SlideShare
1,962
Embed Views
11

Actions

Likes
4
Downloads
20
Comments
1

4 Embeds 11

https://twitter.com 5
http://www.linkedin.com 3
https://www.linkedin.com 2
http://www.slideshare.net 1

Accessibility

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

distilling the Web of Data drop by drop (with Java) Presentation Transcript

  • 1. distilling the Web of Data drop by drop (with Java)Sourcesense UK “Last Wednesday” - Davide Palmisano @dpalmisanoWednesday, June 29, 2011
  • 2. the shortest introduction ever to the Web o f Data Web pages markup technologies are intended for human consumption they let machines to present raw data to humans extracting valuable data may require fancy scraping techniques scraping: one size doesn’t fit allWednesday, June 29, 2011
  • 3. the shortest introduction ever to the Web o f Data <div> <div> Canon Rebel T2i (EOS 550D) $899 </div> <div> The Rebel T2i EOS 550D is Cannons top-of-the-line consumer digital SLR camera. It can shoot up <div> AN_UCC-13: 013803123784 </div> <div> price: 899 USD </div> </div> </div>Wednesday, June 29, 2011
  • 4. the shortest introduction ever to the Web o f Data <div> <div> Canon Rebel T2i (EOS 550D) $899 </div> <div> The Rebel T2i EOS 550D is Cannons top-of-the-line consumer digital SLR camera. It can shoot up <div> AN_UCC-13: 013803123784 </div> <div> price: 899 USD </div> what does this </div> tag mean? </div>Wednesday, June 29, 2011
  • 5. the shortest introduction ever to the Web o f Data <div> <div> Canon Rebel T2i (EOS 550D) $899 </div> <div> The Rebel T2i EOS 550D is Cannons top-of-the-line consumer digital SLR camera. It can shoot up <div> AN_UCC-13: 013803123784 </div> <div> price: 899 USD </div> what does this </div> is this a tag mean? </div> currency or what?Wednesday, June 29, 2011
  • 6. “meaning”, Joseph Kosuth, The Panza Collection, MART - Rovereto, ItalyWednesday, June 29, 2011
  • 7. Microformats “Microformats are a way of adding simple markup to human-readable data items such as events, contact details or locations, on web pages” Andy Mabbett - community driven initiative - largely adopted - quick & dirty - scarcely extensibilityWednesday, June 29, 2011
  • 8. Microformats <div class=”hlisting item”> <div> Canon Rebel T2i (EOS 550D) $899< /div> <div class=”description”> The Rebel T2i EOS 550D is Cannons top-of-the-line consumer digital SLR camera. It can shoot up <div> AN_UCC-13: 013803123784 </div> <div class=”price”> price: 899 USD </div> </div> </div>Wednesday, June 29, 2011
  • 9. RDFa: RDF in attribute model your data as they were Web pages connected with named links and properties and embed them in your (X)HTML using @attributes - RDF, graph-based model - W3C Recommandation - highly extensible i.e GoodRelations[1], a fully flavored vocabulary for the e-commerceWednesday, June 29, 2011
  • 10. RDFa: RDF in attribute model your data http://mystore.com/product/5642 ex:price ex:value 899 ex:producer ex:currency ex:description USD http://canon.co.uk The Rebel T2i EOS 550D blah blahWednesday, June 29, 2011
  • 11. RDFa: RDF in attribute and then embed them in your (X)HTML pages <div about=”http://mystore.com/product/5642”> <div>Canon Rebel T2i (EOS 550D) $899</div> <div property=”gr:description”>The Rebel T2i EOS 550D is Cannons blah blah</div> <div rel=”gr:hasPriceSpecification”> <span> price: <span property=”gr:hasCurrencyValue”>899</span> <span property=”gr:hasCurrency”>USD</span> </span> </div> </div>Wednesday, June 29, 2011
  • 12. HTML5: Microdata Microdata allows nested groups of name-value pairs to be added to HTML documents, in parallel with the existing content - W3C Working draft - native of HTML5 specification - serializable in RDF - Google, Yahoo! and Bing endorsed Schema.org - large adoption expectedWednesday, June 29, 2011
  • 13. HTML5: Microdata <div itemscop itemtype=”http://schema.org/Offer”> <div itemprop=”name”> Canon Rebel T2i (EOS 550D) $899 </div> <div itemprop=”description”> The Rebel T2i EOS 550D is Cannons blah blah</div> <div> <span> price: <span itemprop=”price”> 899 </span> <span itemprop=”priceCurrency”> USD </span> </span> </div> </div>Wednesday, June 29, 2011
  • 14. % of marked up Web pages 3.5 3 2.5 2 1.5 1 RDFa 0.5 hCard adr 09/2008 xfn 0 03/2009 hReview 10/2010 data from Yahoo! [2]Wednesday, June 29, 2011
  • 15. tie ‘em all together uniform, reconciled and unified RDF representationWednesday, June 29, 2011
  • 16. a drop-by-drop distiller Anything To Triples (any23) is an open source, Apache-licensed: - Java library, - Web service and - a command-line tool able to distill RDF triples from a variety of semantically marked up Web documents http://developers.any23.orgWednesday, June 29, 2011
  • 17. live demo http://any23.org Web site with ~5000 products description with GoodRelations using RDFaWednesday, June 29, 2011
  • 18. use Any23 in your Java programs Any23 runner = new Any23(); runner.setHTTPUserAgent("test-user-agent"); HTTPClient httpClient = runner.getHTTPClient(); DocumentSource source = new HTTPDocumentSource(      httpClient,      "http://test.com/index.html"   ); ByteArrayOutputStream out = new ByteArrayOutputStream(); TripleHandler handler = new NTriplesWriter(out); runner.extract(source, handler); String n3 = out.toString("UTF-8");Wednesday, June 29, 2011
  • 19. Any23: Command-Line tool any23-core/bin$ ./any23 usage: any23 [-e <arg>] [-f <arg>] [-l <arg>] [-n] [-o <arg>]        [-p] [-s] [-t] [-v] {<url>|<file>}  -e <arg>            comma-separated list of extractors, e.g.                      rdf-xml,rdf-turtle  -f,--format <arg>   Output format [turtle (default), ntriples, rdfxml, quad, uris]  -l,--log <arg>      logging, please specify a file  -n,--nesting        disable production of nesting triples  -o,--output <arg>   ouput file (defaults to stdout)  -p,--pedantic       validates and fixes HTML content detecting commons issues  -s,--stats          print out statistics of Any23Wednesday, June 29, 2011
  • 20. Any23: Web Service blacky:~ davide$ curl http://any23.org/any23/?format=nquads&url=http:// www.bbc.co.uk/programmes/b00kygwh&report=on <response> <extractors> <extractor>rdf-xml</extractor> </extractors> <report> ... <validationReport> <ruleActivations></ruleActivations> ... </validationReport> </report> <data> <rdf:Description rdf:about="http://www.bbc.co.uk/programmes/ b00kygwh#programme"> <rdf:type rdf:resource="http://purl.org/ontology/po/Episode"/> <po:pid>b00kygwh</po:pid> <dc:title>The Terminator</dc:title> </rdf:Description> </data> </response>Wednesday, June 29, 2011
  • 21. Apache Tika mimetype detection Cyber Neko HTML DOM extraction Rule Fix Validator Microdata RDFa hListing hReview hCalendar hCard Extractor Extractor Microformat Extractors Sesame RDF/XML NQuads JSON Writer Writer Writer ExtractionResultWednesday, June 29, 2011
  • 22. extractor public interface Extractor<Input> { /** * Executes the extractor. Will be invoked only once, extractors are * not reusable. * * @param in The extractors input * @param documentURI The documents URI * @param out Sink for extracted data * @throws IOException On error while reading from the input stream * @throws ExtractionException On other error, such as parse errors */ void run(Input in, URI documentURI, ExtractionResult out) throws IOException, ExtractionException; /** * Returns a {@link org.deri.any23.extractor.ExtractorDescription} of * this extractor. */ ExtractorDescription getDescription(); }Wednesday, June 29, 2011
  • 23. validate and fix public interface Rule { String getHRName(); boolean applyOn( DOMDocument document, RuleContext context, ValidationReportBuilder validationReportBuilder ); } public interface Fix { String getHRName(); void execute(Rule rule, RuleContext context, DOMDocument document); } void addRule(Class<? extends Rule> rule, Class<? extends Fix> fix);Wednesday, June 29, 2011
  • 24. plugins @PluginImplementation   @Author(name="Michele Mostarda (mostarda@fbk.eu)")   public class HTMLScraperPlugin implements ExtractorPlugin {     private static final Logger logger = LoggerFactory.getLogger(HTMLScraperPlugin.class);     @Init     public void init() {         logger.info("Plugin initialization.");     }     @Shutdown     public void shutdown() {         logger.info("Plugin shutdown.");     }     public ExtractorFactory getExtractorFactory() {         return HTMLScraperExtractor.factory;     }   }Wednesday, June 29, 2011
  • 25. roadmap incoming 0.6.0 release - support for Microdata - support for CSV - support for RDFa 1.1 prefix mechanism - improved app configuration - bug fixing Apache (pre) Incubation process - http://wiki.apache.org/incubator/Any23Proposal - supporters and mentors (thanks guys!) Simone Tripodi (@stripodi) Tommaso Teofili (@tteofili) - we’re looking for mentorsWednesday, June 29, 2011
  • 26. closing credits active committers Giovanni Tummarello ( @jccq ) Michele Mostarda ( @micmos ) Davide Palmisano ( @dpalmisano ) Richard Cyganiak ( @cygri ) thanks to the whole Semantic Web community, especially those who tirelessly challenge us with bugs and features requestsWednesday, June 29, 2011
  • 27. References [1] http://purl.org/goodrelations/v1 [2] http://tripletalk.wordpress.com/2011/01/25/ rdfa-deployment-across-the-web/Wednesday, June 29, 2011