distilling the Web of Data           drop by drop (with Java)Sourcesense UK “Last Wednesday” - Davide Palmisano @dpalmisan...
the shortest introduction                                ever to the Web o f Data      Web pages markup technologies are  ...
the shortest introduction                                                 ever to the Web o f Data     <div>         <div>...
the shortest introduction                                      ever to the Web o f Data     <div>         <div> Canon Rebe...
the shortest introduction                                  ever to the Web o f Data     <div>         <div> Canon Rebel T2...
“meaning”, Joseph Kosuth, The Panza Collection, MART - Rovereto, ItalyWednesday, June 29, 2011
Microformats    “Microformats are a way of adding simple markup    to human-readable data items such as events,    contact...
Microformats     <div class=”hlisting item”>         <div> Canon Rebel T2i (EOS 550D) $899< /div>         <div class=”desc...
RDFa: RDF in attribute       model your data as they were Web pages       connected with named links and properties       ...
RDFa: RDF in attribute       model your data             http://mystore.com/product/5642                                  ...
RDFa: RDF in attribute       and then embed them in your       (X)HTML pages    <div about=”http://mystore.com/product/564...
HTML5: Microdata       Microdata allows nested groups of name-value       pairs to be added to HTML documents, in       pa...
HTML5: Microdata          <div itemscop itemtype=”http://schema.org/Offer”>              <div itemprop=”name”> Canon Rebel...
% of marked up Web pages                                                                       3.5                        ...
tie ‘em all together uniform, reconciled and unified RDF representationWednesday, June 29, 2011
a drop-by-drop distiller        Anything To Triples (any23) is an open source,        Apache-licensed:            - Java l...
live demo http://any23.org                Web site with ~5000 products description with                GoodRelations using...
use Any23 in your Java                                                 programs      Any23 runner = new Any23();      runn...
Any23: Command-Line tool      any23-core/bin$ ./any23      usage: any23 [-e <arg>] [-f <arg>] [-l <arg>] [-n] [-o <arg>]  ...
Any23: Web Service  blacky:~ davide$ curl http://any23.org/any23/?format=nquads&url=http://  www.bbc.co.uk/programmes/b00k...
Apache Tika       mimetype detection                                                                  Cyber Neko HTML     ...
extractor  public interface Extractor<Input> {        /**         * Executes the extractor. Will be invoked only once, ext...
validate and fix  public interface Rule {        String getHRName();        boolean applyOn(           DOMDocument documen...
plugins  @PluginImplementation    @Author(name="Michele Mostarda (mostarda@fbk.eu)")    public class HTMLScraperPlugin imp...
roadmap      incoming 0.6.0 release       - support for Microdata       - support for CSV       - support for RDFa 1.1 pre...
closing credits                                  active committers                             Giovanni Tummarello ( @jccq...
References      [1] http://purl.org/goodrelations/v1      [2] http://tripletalk.wordpress.com/2011/01/25/      rdfa-deploy...
Upcoming SlideShare
Loading in …5
×

distilling the Web of Data drop by drop (with Java)

2,030 views
1,933 views

Published on

An introduction to the machine-readable Web and the various way to implement it with Microformats, RDFa or Microdata and how to consume it.

Any23 is presented, a Java open source distiller for the Web of Data.

Published in: Art & Photos, Technology

distilling the Web of Data drop by drop (with Java)

  1. 1. distilling the Web of Data drop by drop (with Java)Sourcesense UK “Last Wednesday” - Davide Palmisano @dpalmisanoWednesday, June 29, 2011
  2. 2. the shortest introduction ever to the Web o f Data Web pages markup technologies are intended for human consumption they let machines to present raw data to humans extracting valuable data may require fancy scraping techniques scraping: one size doesn’t fit allWednesday, June 29, 2011
  3. 3. the shortest introduction ever to the Web o f Data <div> <div> Canon Rebel T2i (EOS 550D) $899 </div> <div> The Rebel T2i EOS 550D is Cannons top-of-the-line consumer digital SLR camera. It can shoot up <div> AN_UCC-13: 013803123784 </div> <div> price: 899 USD </div> </div> </div>Wednesday, June 29, 2011
  4. 4. the shortest introduction ever to the Web o f Data <div> <div> Canon Rebel T2i (EOS 550D) $899 </div> <div> The Rebel T2i EOS 550D is Cannons top-of-the-line consumer digital SLR camera. It can shoot up <div> AN_UCC-13: 013803123784 </div> <div> price: 899 USD </div> what does this </div> tag mean? </div>Wednesday, June 29, 2011
  5. 5. the shortest introduction ever to the Web o f Data <div> <div> Canon Rebel T2i (EOS 550D) $899 </div> <div> The Rebel T2i EOS 550D is Cannons top-of-the-line consumer digital SLR camera. It can shoot up <div> AN_UCC-13: 013803123784 </div> <div> price: 899 USD </div> what does this </div> is this a tag mean? </div> currency or what?Wednesday, June 29, 2011
  6. 6. “meaning”, Joseph Kosuth, The Panza Collection, MART - Rovereto, ItalyWednesday, June 29, 2011
  7. 7. Microformats “Microformats are a way of adding simple markup to human-readable data items such as events, contact details or locations, on web pages” Andy Mabbett - community driven initiative - largely adopted - quick & dirty - scarcely extensibilityWednesday, June 29, 2011
  8. 8. Microformats <div class=”hlisting item”> <div> Canon Rebel T2i (EOS 550D) $899< /div> <div class=”description”> The Rebel T2i EOS 550D is Cannons top-of-the-line consumer digital SLR camera. It can shoot up <div> AN_UCC-13: 013803123784 </div> <div class=”price”> price: 899 USD </div> </div> </div>Wednesday, June 29, 2011
  9. 9. RDFa: RDF in attribute model your data as they were Web pages connected with named links and properties and embed them in your (X)HTML using @attributes - RDF, graph-based model - W3C Recommandation - highly extensible i.e GoodRelations[1], a fully flavored vocabulary for the e-commerceWednesday, June 29, 2011
  10. 10. RDFa: RDF in attribute model your data http://mystore.com/product/5642 ex:price ex:value 899 ex:producer ex:currency ex:description USD http://canon.co.uk The Rebel T2i EOS 550D blah blahWednesday, June 29, 2011
  11. 11. RDFa: RDF in attribute and then embed them in your (X)HTML pages <div about=”http://mystore.com/product/5642”> <div>Canon Rebel T2i (EOS 550D) $899</div> <div property=”gr:description”>The Rebel T2i EOS 550D is Cannons blah blah</div> <div rel=”gr:hasPriceSpecification”> <span> price: <span property=”gr:hasCurrencyValue”>899</span> <span property=”gr:hasCurrency”>USD</span> </span> </div> </div>Wednesday, June 29, 2011
  12. 12. HTML5: Microdata Microdata allows nested groups of name-value pairs to be added to HTML documents, in parallel with the existing content - W3C Working draft - native of HTML5 specification - serializable in RDF - Google, Yahoo! and Bing endorsed Schema.org - large adoption expectedWednesday, June 29, 2011
  13. 13. HTML5: Microdata <div itemscop itemtype=”http://schema.org/Offer”> <div itemprop=”name”> Canon Rebel T2i (EOS 550D) $899 </div> <div itemprop=”description”> The Rebel T2i EOS 550D is Cannons blah blah</div> <div> <span> price: <span itemprop=”price”> 899 </span> <span itemprop=”priceCurrency”> USD </span> </span> </div> </div>Wednesday, June 29, 2011
  14. 14. % of marked up Web pages 3.5 3 2.5 2 1.5 1 RDFa 0.5 hCard adr 09/2008 xfn 0 03/2009 hReview 10/2010 data from Yahoo! [2]Wednesday, June 29, 2011
  15. 15. tie ‘em all together uniform, reconciled and unified RDF representationWednesday, June 29, 2011
  16. 16. a drop-by-drop distiller Anything To Triples (any23) is an open source, Apache-licensed: - Java library, - Web service and - a command-line tool able to distill RDF triples from a variety of semantically marked up Web documents http://developers.any23.orgWednesday, June 29, 2011
  17. 17. live demo http://any23.org Web site with ~5000 products description with GoodRelations using RDFaWednesday, June 29, 2011
  18. 18. use Any23 in your Java programs Any23 runner = new Any23(); runner.setHTTPUserAgent("test-user-agent"); HTTPClient httpClient = runner.getHTTPClient(); DocumentSource source = new HTTPDocumentSource(      httpClient,      "http://test.com/index.html"   ); ByteArrayOutputStream out = new ByteArrayOutputStream(); TripleHandler handler = new NTriplesWriter(out); runner.extract(source, handler); String n3 = out.toString("UTF-8");Wednesday, June 29, 2011
  19. 19. Any23: Command-Line tool any23-core/bin$ ./any23 usage: any23 [-e <arg>] [-f <arg>] [-l <arg>] [-n] [-o <arg>]        [-p] [-s] [-t] [-v] {<url>|<file>}  -e <arg>            comma-separated list of extractors, e.g.                      rdf-xml,rdf-turtle  -f,--format <arg>   Output format [turtle (default), ntriples, rdfxml, quad, uris]  -l,--log <arg>      logging, please specify a file  -n,--nesting        disable production of nesting triples  -o,--output <arg>   ouput file (defaults to stdout)  -p,--pedantic       validates and fixes HTML content detecting commons issues  -s,--stats          print out statistics of Any23Wednesday, June 29, 2011
  20. 20. Any23: Web Service blacky:~ davide$ curl http://any23.org/any23/?format=nquads&url=http:// www.bbc.co.uk/programmes/b00kygwh&report=on <response> <extractors> <extractor>rdf-xml</extractor> </extractors> <report> ... <validationReport> <ruleActivations></ruleActivations> ... </validationReport> </report> <data> <rdf:Description rdf:about="http://www.bbc.co.uk/programmes/ b00kygwh#programme"> <rdf:type rdf:resource="http://purl.org/ontology/po/Episode"/> <po:pid>b00kygwh</po:pid> <dc:title>The Terminator</dc:title> </rdf:Description> </data> </response>Wednesday, June 29, 2011
  21. 21. Apache Tika mimetype detection Cyber Neko HTML DOM extraction Rule Fix Validator Microdata RDFa hListing hReview hCalendar hCard Extractor Extractor Microformat Extractors Sesame RDF/XML NQuads JSON Writer Writer Writer ExtractionResultWednesday, June 29, 2011
  22. 22. extractor public interface Extractor<Input> { /** * Executes the extractor. Will be invoked only once, extractors are * not reusable. * * @param in The extractors input * @param documentURI The documents URI * @param out Sink for extracted data * @throws IOException On error while reading from the input stream * @throws ExtractionException On other error, such as parse errors */ void run(Input in, URI documentURI, ExtractionResult out) throws IOException, ExtractionException; /** * Returns a {@link org.deri.any23.extractor.ExtractorDescription} of * this extractor. */ ExtractorDescription getDescription(); }Wednesday, June 29, 2011
  23. 23. validate and fix public interface Rule { String getHRName(); boolean applyOn( DOMDocument document, RuleContext context, ValidationReportBuilder validationReportBuilder ); } public interface Fix { String getHRName(); void execute(Rule rule, RuleContext context, DOMDocument document); } void addRule(Class<? extends Rule> rule, Class<? extends Fix> fix);Wednesday, June 29, 2011
  24. 24. plugins @PluginImplementation   @Author(name="Michele Mostarda (mostarda@fbk.eu)")   public class HTMLScraperPlugin implements ExtractorPlugin {     private static final Logger logger = LoggerFactory.getLogger(HTMLScraperPlugin.class);     @Init     public void init() {         logger.info("Plugin initialization.");     }     @Shutdown     public void shutdown() {         logger.info("Plugin shutdown.");     }     public ExtractorFactory getExtractorFactory() {         return HTMLScraperExtractor.factory;     }   }Wednesday, June 29, 2011
  25. 25. roadmap incoming 0.6.0 release - support for Microdata - support for CSV - support for RDFa 1.1 prefix mechanism - improved app configuration - bug fixing Apache (pre) Incubation process - http://wiki.apache.org/incubator/Any23Proposal - supporters and mentors (thanks guys!) Simone Tripodi (@stripodi) Tommaso Teofili (@tteofili) - we’re looking for mentorsWednesday, June 29, 2011
  26. 26. closing credits active committers Giovanni Tummarello ( @jccq ) Michele Mostarda ( @micmos ) Davide Palmisano ( @dpalmisano ) Richard Cyganiak ( @cygri ) thanks to the whole Semantic Web community, especially those who tirelessly challenge us with bugs and features requestsWednesday, June 29, 2011
  27. 27. References [1] http://purl.org/goodrelations/v1 [2] http://tripletalk.wordpress.com/2011/01/25/ rdfa-deployment-across-the-web/Wednesday, June 29, 2011

×