Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
distilling the Web of Data           drop by drop (with Java)Sourcesense UK “Last Wednesday” - Davide Palmisano @dpalmisan...
the shortest introduction                                ever to the Web o f Data      Web pages markup technologies are  ...
the shortest introduction                                                 ever to the Web o f Data     <div>         <div>...
the shortest introduction                                      ever to the Web o f Data     <div>         <div> Canon Rebe...
the shortest introduction                                  ever to the Web o f Data     <div>         <div> Canon Rebel T2...
“meaning”, Joseph Kosuth, The Panza Collection, MART - Rovereto, ItalyWednesday, June 29, 2011
Microformats    “Microformats are a way of adding simple markup    to human-readable data items such as events,    contact...
Microformats     <div class=”hlisting item”>         <div> Canon Rebel T2i (EOS 550D) $899< /div>         <div class=”desc...
RDFa: RDF in attribute       model your data as they were Web pages       connected with named links and properties       ...
RDFa: RDF in attribute       model your data             http://mystore.com/product/5642                                  ...
RDFa: RDF in attribute       and then embed them in your       (X)HTML pages    <div about=”http://mystore.com/product/564...
HTML5: Microdata       Microdata allows nested groups of name-value       pairs to be added to HTML documents, in       pa...
HTML5: Microdata          <div itemscop itemtype=”http://schema.org/Offer”>              <div itemprop=”name”> Canon Rebel...
% of marked up Web pages                                                                       3.5                        ...
tie ‘em all together uniform, reconciled and unified RDF representationWednesday, June 29, 2011
a drop-by-drop distiller        Anything To Triples (any23) is an open source,        Apache-licensed:            - Java l...
live demo http://any23.org                Web site with ~5000 products description with                GoodRelations using...
use Any23 in your Java                                                 programs      Any23 runner = new Any23();      runn...
Any23: Command-Line tool      any23-core/bin$ ./any23      usage: any23 [-e <arg>] [-f <arg>] [-l <arg>] [-n] [-o <arg>]  ...
Any23: Web Service  blacky:~ davide$ curl http://any23.org/any23/?format=nquads&url=http://  www.bbc.co.uk/programmes/b00k...
Apache Tika       mimetype detection                                                                  Cyber Neko HTML     ...
extractor  public interface Extractor<Input> {        /**         * Executes the extractor. Will be invoked only once, ext...
validate and fix  public interface Rule {        String getHRName();        boolean applyOn(           DOMDocument documen...
plugins  @PluginImplementation    @Author(name="Michele Mostarda (mostarda@fbk.eu)")    public class HTMLScraperPlugin imp...
roadmap      incoming 0.6.0 release       - support for Microdata       - support for CSV       - support for RDFa 1.1 pre...
closing credits                                  active committers                             Giovanni Tummarello ( @jccq...
References      [1] http://purl.org/goodrelations/v1      [2] http://tripletalk.wordpress.com/2011/01/25/      rdfa-deploy...
Upcoming SlideShare
Loading in …5
×

distilling the Web of Data drop by drop (with Java)

2,171 views

Published on

An introduction to the machine-readable Web and the various way to implement it with Microformats, RDFa or Microdata and how to consume it.

Any23 is presented, a Java open source distiller for the Web of Data.

Published in: Art & Photos, Technology

distilling the Web of Data drop by drop (with Java)

  1. 1. distilling the Web of Data drop by drop (with Java)Sourcesense UK “Last Wednesday” - Davide Palmisano @dpalmisanoWednesday, June 29, 2011
  2. 2. the shortest introduction ever to the Web o f Data Web pages markup technologies are intended for human consumption they let machines to present raw data to humans extracting valuable data may require fancy scraping techniques scraping: one size doesn’t fit allWednesday, June 29, 2011
  3. 3. the shortest introduction ever to the Web o f Data <div> <div> Canon Rebel T2i (EOS 550D) $899 </div> <div> The Rebel T2i EOS 550D is Cannons top-of-the-line consumer digital SLR camera. It can shoot up <div> AN_UCC-13: 013803123784 </div> <div> price: 899 USD </div> </div> </div>Wednesday, June 29, 2011
  4. 4. the shortest introduction ever to the Web o f Data <div> <div> Canon Rebel T2i (EOS 550D) $899 </div> <div> The Rebel T2i EOS 550D is Cannons top-of-the-line consumer digital SLR camera. It can shoot up <div> AN_UCC-13: 013803123784 </div> <div> price: 899 USD </div> what does this </div> tag mean? </div>Wednesday, June 29, 2011
  5. 5. the shortest introduction ever to the Web o f Data <div> <div> Canon Rebel T2i (EOS 550D) $899 </div> <div> The Rebel T2i EOS 550D is Cannons top-of-the-line consumer digital SLR camera. It can shoot up <div> AN_UCC-13: 013803123784 </div> <div> price: 899 USD </div> what does this </div> is this a tag mean? </div> currency or what?Wednesday, June 29, 2011
  6. 6. “meaning”, Joseph Kosuth, The Panza Collection, MART - Rovereto, ItalyWednesday, June 29, 2011
  7. 7. Microformats “Microformats are a way of adding simple markup to human-readable data items such as events, contact details or locations, on web pages” Andy Mabbett - community driven initiative - largely adopted - quick & dirty - scarcely extensibilityWednesday, June 29, 2011
  8. 8. Microformats <div class=”hlisting item”> <div> Canon Rebel T2i (EOS 550D) $899< /div> <div class=”description”> The Rebel T2i EOS 550D is Cannons top-of-the-line consumer digital SLR camera. It can shoot up <div> AN_UCC-13: 013803123784 </div> <div class=”price”> price: 899 USD </div> </div> </div>Wednesday, June 29, 2011
  9. 9. RDFa: RDF in attribute model your data as they were Web pages connected with named links and properties and embed them in your (X)HTML using @attributes - RDF, graph-based model - W3C Recommandation - highly extensible i.e GoodRelations[1], a fully flavored vocabulary for the e-commerceWednesday, June 29, 2011
  10. 10. RDFa: RDF in attribute model your data http://mystore.com/product/5642 ex:price ex:value 899 ex:producer ex:currency ex:description USD http://canon.co.uk The Rebel T2i EOS 550D blah blahWednesday, June 29, 2011
  11. 11. RDFa: RDF in attribute and then embed them in your (X)HTML pages <div about=”http://mystore.com/product/5642”> <div>Canon Rebel T2i (EOS 550D) $899</div> <div property=”gr:description”>The Rebel T2i EOS 550D is Cannons blah blah</div> <div rel=”gr:hasPriceSpecification”> <span> price: <span property=”gr:hasCurrencyValue”>899</span> <span property=”gr:hasCurrency”>USD</span> </span> </div> </div>Wednesday, June 29, 2011
  12. 12. HTML5: Microdata Microdata allows nested groups of name-value pairs to be added to HTML documents, in parallel with the existing content - W3C Working draft - native of HTML5 specification - serializable in RDF - Google, Yahoo! and Bing endorsed Schema.org - large adoption expectedWednesday, June 29, 2011
  13. 13. HTML5: Microdata <div itemscop itemtype=”http://schema.org/Offer”> <div itemprop=”name”> Canon Rebel T2i (EOS 550D) $899 </div> <div itemprop=”description”> The Rebel T2i EOS 550D is Cannons blah blah</div> <div> <span> price: <span itemprop=”price”> 899 </span> <span itemprop=”priceCurrency”> USD </span> </span> </div> </div>Wednesday, June 29, 2011
  14. 14. % of marked up Web pages 3.5 3 2.5 2 1.5 1 RDFa 0.5 hCard adr 09/2008 xfn 0 03/2009 hReview 10/2010 data from Yahoo! [2]Wednesday, June 29, 2011
  15. 15. tie ‘em all together uniform, reconciled and unified RDF representationWednesday, June 29, 2011
  16. 16. a drop-by-drop distiller Anything To Triples (any23) is an open source, Apache-licensed: - Java library, - Web service and - a command-line tool able to distill RDF triples from a variety of semantically marked up Web documents http://developers.any23.orgWednesday, June 29, 2011
  17. 17. live demo http://any23.org Web site with ~5000 products description with GoodRelations using RDFaWednesday, June 29, 2011
  18. 18. use Any23 in your Java programs Any23 runner = new Any23(); runner.setHTTPUserAgent("test-user-agent"); HTTPClient httpClient = runner.getHTTPClient(); DocumentSource source = new HTTPDocumentSource(      httpClient,      "http://test.com/index.html"   ); ByteArrayOutputStream out = new ByteArrayOutputStream(); TripleHandler handler = new NTriplesWriter(out); runner.extract(source, handler); String n3 = out.toString("UTF-8");Wednesday, June 29, 2011
  19. 19. Any23: Command-Line tool any23-core/bin$ ./any23 usage: any23 [-e <arg>] [-f <arg>] [-l <arg>] [-n] [-o <arg>]        [-p] [-s] [-t] [-v] {<url>|<file>}  -e <arg>            comma-separated list of extractors, e.g.                      rdf-xml,rdf-turtle  -f,--format <arg>   Output format [turtle (default), ntriples, rdfxml, quad, uris]  -l,--log <arg>      logging, please specify a file  -n,--nesting        disable production of nesting triples  -o,--output <arg>   ouput file (defaults to stdout)  -p,--pedantic       validates and fixes HTML content detecting commons issues  -s,--stats          print out statistics of Any23Wednesday, June 29, 2011
  20. 20. Any23: Web Service blacky:~ davide$ curl http://any23.org/any23/?format=nquads&url=http:// www.bbc.co.uk/programmes/b00kygwh&report=on <response> <extractors> <extractor>rdf-xml</extractor> </extractors> <report> ... <validationReport> <ruleActivations></ruleActivations> ... </validationReport> </report> <data> <rdf:Description rdf:about="http://www.bbc.co.uk/programmes/ b00kygwh#programme"> <rdf:type rdf:resource="http://purl.org/ontology/po/Episode"/> <po:pid>b00kygwh</po:pid> <dc:title>The Terminator</dc:title> </rdf:Description> </data> </response>Wednesday, June 29, 2011
  21. 21. Apache Tika mimetype detection Cyber Neko HTML DOM extraction Rule Fix Validator Microdata RDFa hListing hReview hCalendar hCard Extractor Extractor Microformat Extractors Sesame RDF/XML NQuads JSON Writer Writer Writer ExtractionResultWednesday, June 29, 2011
  22. 22. extractor public interface Extractor<Input> { /** * Executes the extractor. Will be invoked only once, extractors are * not reusable. * * @param in The extractors input * @param documentURI The documents URI * @param out Sink for extracted data * @throws IOException On error while reading from the input stream * @throws ExtractionException On other error, such as parse errors */ void run(Input in, URI documentURI, ExtractionResult out) throws IOException, ExtractionException; /** * Returns a {@link org.deri.any23.extractor.ExtractorDescription} of * this extractor. */ ExtractorDescription getDescription(); }Wednesday, June 29, 2011
  23. 23. validate and fix public interface Rule { String getHRName(); boolean applyOn( DOMDocument document, RuleContext context, ValidationReportBuilder validationReportBuilder ); } public interface Fix { String getHRName(); void execute(Rule rule, RuleContext context, DOMDocument document); } void addRule(Class<? extends Rule> rule, Class<? extends Fix> fix);Wednesday, June 29, 2011
  24. 24. plugins @PluginImplementation   @Author(name="Michele Mostarda (mostarda@fbk.eu)")   public class HTMLScraperPlugin implements ExtractorPlugin {     private static final Logger logger = LoggerFactory.getLogger(HTMLScraperPlugin.class);     @Init     public void init() {         logger.info("Plugin initialization.");     }     @Shutdown     public void shutdown() {         logger.info("Plugin shutdown.");     }     public ExtractorFactory getExtractorFactory() {         return HTMLScraperExtractor.factory;     }   }Wednesday, June 29, 2011
  25. 25. roadmap incoming 0.6.0 release - support for Microdata - support for CSV - support for RDFa 1.1 prefix mechanism - improved app configuration - bug fixing Apache (pre) Incubation process - http://wiki.apache.org/incubator/Any23Proposal - supporters and mentors (thanks guys!) Simone Tripodi (@stripodi) Tommaso Teofili (@tteofili) - we’re looking for mentorsWednesday, June 29, 2011
  26. 26. closing credits active committers Giovanni Tummarello ( @jccq ) Michele Mostarda ( @micmos ) Davide Palmisano ( @dpalmisano ) Richard Cyganiak ( @cygri ) thanks to the whole Semantic Web community, especially those who tirelessly challenge us with bugs and features requestsWednesday, June 29, 2011
  27. 27. References [1] http://purl.org/goodrelations/v1 [2] http://tripletalk.wordpress.com/2011/01/25/ rdfa-deployment-across-the-web/Wednesday, June 29, 2011

×