SlideShare a Scribd company logo
1 of 27
Download to read offline
distilling the Web of Data
           drop by drop (with Java)


Sourcesense UK “Last Wednesday” - Davide Palmisano @dpalmisano

Wednesday, June 29, 2011
the shortest introduction
                                ever to the Web o f Data

      Web pages markup technologies are
      intended for human consumption

      they let machines to present raw
      data to humans

      extracting valuable data may
      require fancy scraping techniques

      scraping: one size doesn’t fit all



Wednesday, June 29, 2011
the shortest introduction
                                                 ever to the Web o f Data

     <div>
         <div> Canon Rebel T2i (EOS 550D) $899 </div>
         <div> The Rebel T2i EOS 550D is Cannon's
             top-of-the-line consumer digital SLR
     camera. It can shoot up

                           <div> AN_UCC-13: 013803123784 </div>
                           <div> price: 899 USD </div>

        </div>
     </div>


Wednesday, June 29, 2011
the shortest introduction
                                      ever to the Web o f Data

     <div>
         <div> Canon Rebel T2i (EOS 550D) $899 </div>
         <div> The Rebel T2i EOS 550D is Cannon's
             top-of-the-line consumer digital SLR
     camera. It can shoot up

                  <div> AN_UCC-13: 013803123784 </div>
                  <div> price: 899 USD </div>
                                              what does this
             </div>                           tag mean?
     </div>


Wednesday, June 29, 2011
the shortest introduction
                                  ever to the Web o f Data

     <div>
         <div> Canon Rebel T2i (EOS 550D) $899 </div>
         <div> The Rebel T2i EOS 550D is Cannon's
             top-of-the-line consumer digital SLR
     camera. It can shoot up

             <div> AN_UCC-13: 013803123784 </div>
             <div> price: 899 USD </div>
                                         what does this
        </div>       is this a           tag mean?
     </div>          currency or what?


Wednesday, June 29, 2011
“meaning”, Joseph Kosuth, The Panza Collection, MART - Rovereto, Italy

Wednesday, June 29, 2011
Microformats




    “Microformats are a way of adding simple markup
    to human-readable data items such as events,
    contact details or locations, on web pages”
                                        Andy Mabbett


    -    community driven initiative
    -    largely adopted
    -    quick & dirty
    -    scarcely extensibility


Wednesday, June 29, 2011
Microformats



     <div class=”hlisting item”>
         <div> Canon Rebel T2i (EOS 550D) $899< /div>
         <div class=”description”> The Rebel T2i EOS
     550D is Cannon's
             top-of-the-line consumer digital SLR
     camera. It can shoot up
                           <div> AN_UCC-13: 013803123784 </div>
                           <div class=”price”> price: 899 USD </div>
        </div>
     </div>

Wednesday, June 29, 2011
RDFa: RDF in attribute


       model your data as they were Web pages
       connected with named links and properties
       and embed them in your (X)HTML using
       @attributes

       - RDF, graph-based model
       - W3C Recommandation
       - highly extensible

       i.e GoodRelations[1], a fully flavored
       vocabulary for the e-commerce



Wednesday, June 29, 2011
RDFa: RDF in attribute

       model your data

             http://mystore.com/product/5642

                                    ex:price       ex:value      899

      ex:producer
                                                        ex:currency


                                      ex:description
                                                               USD
            http://canon.co.uk



                                    The Rebel T2i EOS
                                    550D blah blah
Wednesday, June 29, 2011
RDFa: RDF in attribute

       and then embed them in your
       (X)HTML pages
    <div about=”http://mystore.com/product/5642”>
        <div>Canon Rebel T2i (EOS 550D) $899</div>
        <div property=”gr:description”>The Rebel T2i EOS 550D
    is Cannon's blah blah</div>

        <div rel=”gr:hasPriceSpecification”>
            <span> price:
               <span property=”gr:hasCurrencyValue”>899</span>
               <span property=”gr:hasCurrency”>USD</span>
           </span>
        </div>
    </div>
Wednesday, June 29, 2011
HTML5: Microdata


       Microdata allows nested groups of name-value
       pairs to be added to HTML documents, in
       parallel with the existing content

       - W3C Working draft
       - native of HTML5 specification
       - serializable in RDF


       - Google, Yahoo! and Bing endorsed Schema.org
       - large adoption expected


Wednesday, June 29, 2011
HTML5: Microdata

          <div itemscop itemtype=”http://schema.org/Offer”>
              <div itemprop=”name”> Canon Rebel T2i (EOS 550D) $899
          </div>
              <div itemprop=”description”> The Rebel T2i EOS 550D
          is Cannon's blah blah</div>

              <div>
                  <span> price:
                     <span itemprop=”price”> 899 </span>
                     <span itemprop=”priceCurrency”> USD </span>
                 </span>
              </div>
          </div>


Wednesday, June 29, 2011
% of marked up Web pages

                                                                       3.5


                                                                    3

                                                                    2.5

                                                                   2

                                                                   1.5

                                                               1

                 RDFa                                          0.5
                             hCard
                                     adr
                  09/2008                    xfn               0
                  03/2009                            hReview
                  10/2010
      data from Yahoo! [2]

Wednesday, June 29, 2011
tie ‘em all together




 uniform, reconciled and
 unified RDF representation

Wednesday, June 29, 2011
a drop-by-drop distiller

        Anything To Triples (any23) is an open source,
        Apache-licensed:

            - Java library,
            - Web service and
            - a command-line tool

        able to distill RDF triples from a
        variety of semantically marked up Web
        documents

        http://developers.any23.org

Wednesday, June 29, 2011
live demo http://any23.org




                Web site with ~5000 products description with
                GoodRelations using RDFa

Wednesday, June 29, 2011
use Any23 in your Java
                                                 programs
      Any23 runner = new Any23();
      runner.setHTTPUserAgent("test-user-agent");
      HTTPClient httpClient = runner.getHTTPClient();
      DocumentSource source = new HTTPDocumentSource(
            httpClient,
            "http://test.com/index.html"
         );
      ByteArrayOutputStream out = new
            ByteArrayOutputStream();
      TripleHandler handler = new NTriplesWriter(out);
      runner.extract(source, handler);
      String n3 = out.toString("UTF-8");




Wednesday, June 29, 2011
Any23: Command-Line tool
      any23-core/bin$ ./any23

      usage: any23 [-e <arg>] [-f <arg>] [-l <arg>] [-n] [-o <arg>]
             [-p] [-s] [-t] [-v] {<url>|<file>}
       -e <arg>            comma-separated list of extractors, e.g.
                           rdf-xml,rdf-turtle
       -f,--format <arg>   Output format [turtle (default),
      ntriples, rdfxml, quad, uris]
       -l,--log <arg>      logging, please specify a file
       -n,--nesting        disable production of nesting triples
       -o,--output <arg>   ouput file (defaults to stdout)
       -p,--pedantic       validates and fixes HTML content
      detecting commons issues
       -s,--stats          print out statistics of Any23


Wednesday, June 29, 2011
Any23: Web Service
  blacky:~ davide$ curl http://any23.org/any23/?format=nquads&url=http://
  www.bbc.co.uk/programmes/b00kygwh&report=on

      <response>
          <extractors>
              <extractor>rdf-xml</extractor>
          </extractors>
          <report>
              ...
              <validationReport>
                  <ruleActivations></ruleActivations>
                  ...
              </validationReport>
           </report>
          <data>
           <rdf:Description rdf:about="http://www.bbc.co.uk/programmes/
      b00kygwh#programme">
               <rdf:type rdf:resource="http://purl.org/ontology/po/Episode"/>
               <po:pid>b00kygwh</po:pid>
               <dc:title>The Terminator</dc:title>
             </rdf:Description>
          </data>
      </response>
Wednesday, June 29, 2011
Apache Tika
       mimetype detection

                                                                  Cyber Neko HTML
       DOM extraction


                                                           Rule                  Fix
       Validator


           Microdata         RDFa       hListing   hReview      hCalendar       hCard
           Extractor       Extractor

                                       Microformat Extractors

        Sesame                                      RDF/XML NQuads              JSON
                                                     Writer Writer              Writer
       ExtractionResult



Wednesday, June 29, 2011
extractor
  public interface Extractor<Input> {

        /**
         * Executes the extractor. Will be invoked only once, extractors are
         * not reusable.
         *
         * @param in         The extractor's input
         * @param documentURI The document's URI
         * @param out        Sink for extracted data
         * @throws IOException         On error while reading from the input stream
         * @throws ExtractionException On other error, such as parse errors
         */
        void run(Input in, URI documentURI, ExtractionResult out)
               throws IOException, ExtractionException;

        /**
         * Returns a {@link org.deri.any23.extractor.ExtractorDescription} of
         * this extractor.
         */
        ExtractorDescription getDescription();

  }

Wednesday, June 29, 2011
validate and fix
  public interface Rule {

        String getHRName();

        boolean applyOn(
           DOMDocument document,
           RuleContext context,
           ValidationReportBuilder validationReportBuilder
        );
  }

  public interface Fix {

        String getHRName();

        void execute(Rule rule, RuleContext context, DOMDocument document);

  }



      void addRule(Class<? extends Rule> rule, Class<? extends Fix> fix);


Wednesday, June 29, 2011
plugins
  @PluginImplementation
    @Author(name="Michele Mostarda (mostarda@fbk.eu)")
    public class HTMLScraperPlugin implements ExtractorPlugin {

      private static final Logger logger =
          LoggerFactory.getLogger(HTMLScraperPlugin.class);

          @Init
          public void init() {
              logger.info("Plugin initialization.");
          }

          @Shutdown
          public void shutdown() {
              logger.info("Plugin shutdown.");
          }

      public ExtractorFactory getExtractorFactory() {
          return HTMLScraperExtractor.factory;
      }

    }

Wednesday, June 29, 2011
roadmap
      incoming 0.6.0 release
       - support for Microdata
       - support for CSV
       - support for RDFa 1.1 prefix mechanism
       - improved app configuration
       - bug fixing

      Apache (pre) Incubation process
          - http://wiki.apache.org/incubator/Any23Proposal
          - supporters and mentors (thanks guys!)
            Simone Tripodi (@stripodi)
            Tommaso Teofili (@tteofili)
          - we’re looking for mentors

Wednesday, June 29, 2011
closing credits




                                  active committers

                             Giovanni Tummarello ( @jccq )
                              Michele Mostarda ( @micmos )
                           Davide Palmisano ( @dpalmisano )
                              Richard Cyganiak ( @cygri )

                   thanks to the whole Semantic Web community,
                  especially those who tirelessly challenge us
                         with bugs and features requests

Wednesday, June 29, 2011
References



      [1] http://purl.org/goodrelations/v1

      [2] http://tripletalk.wordpress.com/2011/01/25/
      rdfa-deployment-across-the-web/




Wednesday, June 29, 2011

More Related Content

Similar to distilling the Web of Data drop by drop (with Java)

Building XML Based Applications
Building XML Based ApplicationsBuilding XML Based Applications
Building XML Based ApplicationsPrabu U
 
Proactive Web Performance Optimization.(Marcel Duran)
Proactive Web Performance Optimization.(Marcel Duran)Proactive Web Performance Optimization.(Marcel Duran)
Proactive Web Performance Optimization.(Marcel Duran)Ontico
 
Domeo, Text Mining, UIMA and Clerezza
Domeo, Text Mining, UIMA and ClerezzaDomeo, Text Mining, UIMA and Clerezza
Domeo, Text Mining, UIMA and ClerezzaPaolo Ciccarese
 
Domeo, Text Mining, UIMA and Clerezza
Domeo, Text Mining, UIMA and ClerezzaDomeo, Text Mining, UIMA and Clerezza
Domeo, Text Mining, UIMA and ClerezzaTommaso Teofili
 
Systems Bioinformatics Workshop Keynote
Systems Bioinformatics Workshop KeynoteSystems Bioinformatics Workshop Keynote
Systems Bioinformatics Workshop KeynoteDeepak Singh
 
HTML5ではないサイトを HTML5へ - Change HTML5 from Not HTML5.
HTML5ではないサイトを HTML5へ - Change HTML5 from Not HTML5.HTML5ではないサイトを HTML5へ - Change HTML5 from Not HTML5.
HTML5ではないサイトを HTML5へ - Change HTML5 from Not HTML5.Sadaaki HIRAI
 
Moving to the cloud azure, office365, and intune - concurrency
Moving to the cloud   azure, office365, and intune - concurrencyMoving to the cloud   azure, office365, and intune - concurrency
Moving to the cloud azure, office365, and intune - concurrencyConcurrency, Inc.
 
buildingxmlbasedapplications-180322042009.pptx
buildingxmlbasedapplications-180322042009.pptxbuildingxmlbasedapplications-180322042009.pptx
buildingxmlbasedapplications-180322042009.pptxNKannanCSE
 
CloudCon Data Mining Presentation
CloudCon Data Mining PresentationCloudCon Data Mining Presentation
CloudCon Data Mining PresentationBrian Johnson
 
Iz Pack
Iz PackIz Pack
Iz PackInria
 
cdac@parag.gajbhiye@test123
cdac@parag.gajbhiye@test123cdac@parag.gajbhiye@test123
cdac@parag.gajbhiye@test123Parag Gajbhiye
 
Tools for mxf-embedded bucore metadata, Dieter Van Rijsselbergen, Jean-Pierre...
Tools for mxf-embedded bucore metadata, Dieter Van Rijsselbergen, Jean-Pierre...Tools for mxf-embedded bucore metadata, Dieter Van Rijsselbergen, Jean-Pierre...
Tools for mxf-embedded bucore metadata, Dieter Van Rijsselbergen, Jean-Pierre...FIAT/IFTA
 
Freeing the cloud, one service at a time
Freeing the cloud, one service at a timeFreeing the cloud, one service at a time
Freeing the cloud, one service at a timeFrancois Marier
 
Kliment oggioni ppt_gi2011_env_europe_remote_final
Kliment oggioni ppt_gi2011_env_europe_remote_finalKliment oggioni ppt_gi2011_env_europe_remote_final
Kliment oggioni ppt_gi2011_env_europe_remote_finalIGN Vorstand
 
The Java Content Repository
The Java Content RepositoryThe Java Content Repository
The Java Content Repositorynobby
 
Powering the Next Generation Services with Java Platform - Spark IT 2010
Powering the Next Generation Services with Java Platform - Spark IT 2010Powering the Next Generation Services with Java Platform - Spark IT 2010
Powering the Next Generation Services with Java Platform - Spark IT 2010Arun Gupta
 
Macruby - RubyConf Presentation 2010
Macruby - RubyConf Presentation 2010Macruby - RubyConf Presentation 2010
Macruby - RubyConf Presentation 2010Matt Aimonetti
 
前瞻性Web性能优化pwpo
前瞻性Web性能优化pwpo前瞻性Web性能优化pwpo
前瞻性Web性能优化pwpoMichael Zhang
 

Similar to distilling the Web of Data drop by drop (with Java) (20)

Building XML Based Applications
Building XML Based ApplicationsBuilding XML Based Applications
Building XML Based Applications
 
Proactive Web Performance Optimization.(Marcel Duran)
Proactive Web Performance Optimization.(Marcel Duran)Proactive Web Performance Optimization.(Marcel Duran)
Proactive Web Performance Optimization.(Marcel Duran)
 
Callimachus
CallimachusCallimachus
Callimachus
 
RESTful OGC Services
RESTful OGC ServicesRESTful OGC Services
RESTful OGC Services
 
Domeo, Text Mining, UIMA and Clerezza
Domeo, Text Mining, UIMA and ClerezzaDomeo, Text Mining, UIMA and Clerezza
Domeo, Text Mining, UIMA and Clerezza
 
Domeo, Text Mining, UIMA and Clerezza
Domeo, Text Mining, UIMA and ClerezzaDomeo, Text Mining, UIMA and Clerezza
Domeo, Text Mining, UIMA and Clerezza
 
Systems Bioinformatics Workshop Keynote
Systems Bioinformatics Workshop KeynoteSystems Bioinformatics Workshop Keynote
Systems Bioinformatics Workshop Keynote
 
HTML5ではないサイトを HTML5へ - Change HTML5 from Not HTML5.
HTML5ではないサイトを HTML5へ - Change HTML5 from Not HTML5.HTML5ではないサイトを HTML5へ - Change HTML5 from Not HTML5.
HTML5ではないサイトを HTML5へ - Change HTML5 from Not HTML5.
 
Moving to the cloud azure, office365, and intune - concurrency
Moving to the cloud   azure, office365, and intune - concurrencyMoving to the cloud   azure, office365, and intune - concurrency
Moving to the cloud azure, office365, and intune - concurrency
 
buildingxmlbasedapplications-180322042009.pptx
buildingxmlbasedapplications-180322042009.pptxbuildingxmlbasedapplications-180322042009.pptx
buildingxmlbasedapplications-180322042009.pptx
 
CloudCon Data Mining Presentation
CloudCon Data Mining PresentationCloudCon Data Mining Presentation
CloudCon Data Mining Presentation
 
Iz Pack
Iz PackIz Pack
Iz Pack
 
cdac@parag.gajbhiye@test123
cdac@parag.gajbhiye@test123cdac@parag.gajbhiye@test123
cdac@parag.gajbhiye@test123
 
Tools for mxf-embedded bucore metadata, Dieter Van Rijsselbergen, Jean-Pierre...
Tools for mxf-embedded bucore metadata, Dieter Van Rijsselbergen, Jean-Pierre...Tools for mxf-embedded bucore metadata, Dieter Van Rijsselbergen, Jean-Pierre...
Tools for mxf-embedded bucore metadata, Dieter Van Rijsselbergen, Jean-Pierre...
 
Freeing the cloud, one service at a time
Freeing the cloud, one service at a timeFreeing the cloud, one service at a time
Freeing the cloud, one service at a time
 
Kliment oggioni ppt_gi2011_env_europe_remote_final
Kliment oggioni ppt_gi2011_env_europe_remote_finalKliment oggioni ppt_gi2011_env_europe_remote_final
Kliment oggioni ppt_gi2011_env_europe_remote_final
 
The Java Content Repository
The Java Content RepositoryThe Java Content Repository
The Java Content Repository
 
Powering the Next Generation Services with Java Platform - Spark IT 2010
Powering the Next Generation Services with Java Platform - Spark IT 2010Powering the Next Generation Services with Java Platform - Spark IT 2010
Powering the Next Generation Services with Java Platform - Spark IT 2010
 
Macruby - RubyConf Presentation 2010
Macruby - RubyConf Presentation 2010Macruby - RubyConf Presentation 2010
Macruby - RubyConf Presentation 2010
 
前瞻性Web性能优化pwpo
前瞻性Web性能优化pwpo前瞻性Web性能优化pwpo
前瞻性Web性能优化pwpo
 

More from Davide Palmisano

beancounter.io - Social Web user profiling as a service #semtechbiz
beancounter.io - Social Web user profiling as a service #semtechbiz beancounter.io - Social Web user profiling as a service #semtechbiz
beancounter.io - Social Web user profiling as a service #semtechbiz Davide Palmisano
 
NoTube: past, present and future
NoTube: past, present and futureNoTube: past, present and future
NoTube: past, present and futureDavide Palmisano
 
Dear Sourcesense, don't you think it's time to make sense of #opendata as well?
Dear Sourcesense, don't you think it's time to make sense of #opendata as well?Dear Sourcesense, don't you think it's time to make sense of #opendata as well?
Dear Sourcesense, don't you think it's time to make sense of #opendata as well?Davide Palmisano
 
From the Semantic Web to the Web of Data: ten years of linking up
From the Semantic Web to the Web of Data: ten years of linking upFrom the Semantic Web to the Web of Data: ten years of linking up
From the Semantic Web to the Web of Data: ten years of linking upDavide Palmisano
 
NoTube Project Collecting Data Social Web
NoTube Project Collecting Data Social WebNoTube Project Collecting Data Social Web
NoTube Project Collecting Data Social WebDavide Palmisano
 

More from Davide Palmisano (6)

beancounter.io - Social Web user profiling as a service #semtechbiz
beancounter.io - Social Web user profiling as a service #semtechbiz beancounter.io - Social Web user profiling as a service #semtechbiz
beancounter.io - Social Web user profiling as a service #semtechbiz
 
NoTube: past, present and future
NoTube: past, present and futureNoTube: past, present and future
NoTube: past, present and future
 
Dear Sourcesense, don't you think it's time to make sense of #opendata as well?
Dear Sourcesense, don't you think it's time to make sense of #opendata as well?Dear Sourcesense, don't you think it's time to make sense of #opendata as well?
Dear Sourcesense, don't you think it's time to make sense of #opendata as well?
 
From the Semantic Web to the Web of Data: ten years of linking up
From the Semantic Web to the Web of Data: ten years of linking upFrom the Semantic Web to the Web of Data: ten years of linking up
From the Semantic Web to the Web of Data: ten years of linking up
 
Unwinding The Twine
Unwinding The TwineUnwinding The Twine
Unwinding The Twine
 
NoTube Project Collecting Data Social Web
NoTube Project Collecting Data Social WebNoTube Project Collecting Data Social Web
NoTube Project Collecting Data Social Web
 

Recently uploaded

Olivia Cox. intertextual references.pptx
Olivia Cox. intertextual references.pptxOlivia Cox. intertextual references.pptx
Olivia Cox. intertextual references.pptxLauraFagan6
 
Triangle Vinyl Record Store, Clermont Florida
Triangle Vinyl Record Store, Clermont FloridaTriangle Vinyl Record Store, Clermont Florida
Triangle Vinyl Record Store, Clermont FloridaGabrielaMiletti
 
FULL ENJOY - 9953040155 Call Girls in Laxmi Nagar | Delhi
FULL ENJOY - 9953040155 Call Girls in Laxmi Nagar | DelhiFULL ENJOY - 9953040155 Call Girls in Laxmi Nagar | Delhi
FULL ENJOY - 9953040155 Call Girls in Laxmi Nagar | DelhiMalviyaNagarCallGirl
 
Bur Dubai Call Girls O58993O4O2 Call Girls in Bur Dubai
Bur Dubai Call Girls O58993O4O2 Call Girls in Bur DubaiBur Dubai Call Girls O58993O4O2 Call Girls in Bur Dubai
Bur Dubai Call Girls O58993O4O2 Call Girls in Bur Dubaidajasot375
 
Downtown Call Girls O5O91O128O Pakistani Call Girls in Downtown
Downtown Call Girls O5O91O128O Pakistani Call Girls in DowntownDowntown Call Girls O5O91O128O Pakistani Call Girls in Downtown
Downtown Call Girls O5O91O128O Pakistani Call Girls in Downtowndajasot375
 
Karol Bagh Call Girls : ☎ 8527673949, Low rate Call Girls
Karol Bagh Call Girls : ☎ 8527673949, Low rate Call GirlsKarol Bagh Call Girls : ☎ 8527673949, Low rate Call Girls
Karol Bagh Call Girls : ☎ 8527673949, Low rate Call Girlsashishs7044
 
Low Rate Call Girls in Laxmi Nagar Delhi Call 9990771857
Low Rate Call Girls in Laxmi Nagar Delhi Call 9990771857Low Rate Call Girls in Laxmi Nagar Delhi Call 9990771857
Low Rate Call Girls in Laxmi Nagar Delhi Call 9990771857delhimodel235
 
Jagat Puri Call Girls : ☎ 8527673949, Low rate Call Girls
Jagat Puri Call Girls : ☎ 8527673949, Low rate Call GirlsJagat Puri Call Girls : ☎ 8527673949, Low rate Call Girls
Jagat Puri Call Girls : ☎ 8527673949, Low rate Call Girlsashishs7044
 
Benjamin Portfolio Process Work Slideshow
Benjamin Portfolio Process Work SlideshowBenjamin Portfolio Process Work Slideshow
Benjamin Portfolio Process Work Slideshowssuser971f6c
 
FULL ENJOY - 9953040155 Call Girls in Paschim Vihar | Delhi
FULL ENJOY - 9953040155 Call Girls in Paschim Vihar | DelhiFULL ENJOY - 9953040155 Call Girls in Paschim Vihar | Delhi
FULL ENJOY - 9953040155 Call Girls in Paschim Vihar | DelhiMalviyaNagarCallGirl
 
Govindpuri Call Girls : ☎ 8527673949, Low rate Call Girls
Govindpuri Call Girls : ☎ 8527673949, Low rate Call GirlsGovindpuri Call Girls : ☎ 8527673949, Low rate Call Girls
Govindpuri Call Girls : ☎ 8527673949, Low rate Call Girlsashishs7044
 
Laxmi Nagar Call Girls : ☎ 8527673949, Low rate Call Girls
Laxmi Nagar Call Girls : ☎ 8527673949, Low rate Call GirlsLaxmi Nagar Call Girls : ☎ 8527673949, Low rate Call Girls
Laxmi Nagar Call Girls : ☎ 8527673949, Low rate Call Girlsashishs7044
 
Lindy's Diner, Restaurant-Café, Albuquerque, NM
Lindy's Diner, Restaurant-Café, Albuquerque, NMLindy's Diner, Restaurant-Café, Albuquerque, NM
Lindy's Diner, Restaurant-Café, Albuquerque, NMroute66connected
 
Greater Noida Call Girls : ☎ 8527673949, Low rate Call Girls
Greater Noida Call Girls : ☎ 8527673949, Low rate Call GirlsGreater Noida Call Girls : ☎ 8527673949, Low rate Call Girls
Greater Noida Call Girls : ☎ 8527673949, Low rate Call Girlsashishs7044
 
Burari Call Girls : ☎ 8527673949, Low rate Call Girls
Burari Call Girls : ☎ 8527673949, Low rate Call GirlsBurari Call Girls : ☎ 8527673949, Low rate Call Girls
Burari Call Girls : ☎ 8527673949, Low rate Call Girlsashishs7044
 
FULL ENJOY - 9953040155 Call Girls in Karol Bagh | Delhi
FULL ENJOY - 9953040155 Call Girls in Karol Bagh | DelhiFULL ENJOY - 9953040155 Call Girls in Karol Bagh | Delhi
FULL ENJOY - 9953040155 Call Girls in Karol Bagh | DelhiMalviyaNagarCallGirl
 
Zagor VČ OP 055 - Oluja nad Haitijem.pdf
Zagor VČ OP 055 - Oluja nad Haitijem.pdfZagor VČ OP 055 - Oluja nad Haitijem.pdf
Zagor VČ OP 055 - Oluja nad Haitijem.pdfStripovizijacom
 
FULL ENJOY - 9953040155 Call Girls in Dwarka Mor | Delhi
FULL ENJOY - 9953040155 Call Girls in Dwarka Mor | DelhiFULL ENJOY - 9953040155 Call Girls in Dwarka Mor | Delhi
FULL ENJOY - 9953040155 Call Girls in Dwarka Mor | DelhiMalviyaNagarCallGirl
 
FULL ENJOY - 9953040155 Call Girls in Gandhi Vihar | Delhi
FULL ENJOY - 9953040155 Call Girls in Gandhi Vihar | DelhiFULL ENJOY - 9953040155 Call Girls in Gandhi Vihar | Delhi
FULL ENJOY - 9953040155 Call Girls in Gandhi Vihar | DelhiMalviyaNagarCallGirl
 

Recently uploaded (20)

Olivia Cox. intertextual references.pptx
Olivia Cox. intertextual references.pptxOlivia Cox. intertextual references.pptx
Olivia Cox. intertextual references.pptx
 
Triangle Vinyl Record Store, Clermont Florida
Triangle Vinyl Record Store, Clermont FloridaTriangle Vinyl Record Store, Clermont Florida
Triangle Vinyl Record Store, Clermont Florida
 
FULL ENJOY - 9953040155 Call Girls in Laxmi Nagar | Delhi
FULL ENJOY - 9953040155 Call Girls in Laxmi Nagar | DelhiFULL ENJOY - 9953040155 Call Girls in Laxmi Nagar | Delhi
FULL ENJOY - 9953040155 Call Girls in Laxmi Nagar | Delhi
 
call girls in Noida New Ashok Nagar 🔝 >༒8448380779 🔝 genuine Escort Service 🔝...
call girls in Noida New Ashok Nagar 🔝 >༒8448380779 🔝 genuine Escort Service 🔝...call girls in Noida New Ashok Nagar 🔝 >༒8448380779 🔝 genuine Escort Service 🔝...
call girls in Noida New Ashok Nagar 🔝 >༒8448380779 🔝 genuine Escort Service 🔝...
 
Bur Dubai Call Girls O58993O4O2 Call Girls in Bur Dubai
Bur Dubai Call Girls O58993O4O2 Call Girls in Bur DubaiBur Dubai Call Girls O58993O4O2 Call Girls in Bur Dubai
Bur Dubai Call Girls O58993O4O2 Call Girls in Bur Dubai
 
Downtown Call Girls O5O91O128O Pakistani Call Girls in Downtown
Downtown Call Girls O5O91O128O Pakistani Call Girls in DowntownDowntown Call Girls O5O91O128O Pakistani Call Girls in Downtown
Downtown Call Girls O5O91O128O Pakistani Call Girls in Downtown
 
Karol Bagh Call Girls : ☎ 8527673949, Low rate Call Girls
Karol Bagh Call Girls : ☎ 8527673949, Low rate Call GirlsKarol Bagh Call Girls : ☎ 8527673949, Low rate Call Girls
Karol Bagh Call Girls : ☎ 8527673949, Low rate Call Girls
 
Low Rate Call Girls in Laxmi Nagar Delhi Call 9990771857
Low Rate Call Girls in Laxmi Nagar Delhi Call 9990771857Low Rate Call Girls in Laxmi Nagar Delhi Call 9990771857
Low Rate Call Girls in Laxmi Nagar Delhi Call 9990771857
 
Jagat Puri Call Girls : ☎ 8527673949, Low rate Call Girls
Jagat Puri Call Girls : ☎ 8527673949, Low rate Call GirlsJagat Puri Call Girls : ☎ 8527673949, Low rate Call Girls
Jagat Puri Call Girls : ☎ 8527673949, Low rate Call Girls
 
Benjamin Portfolio Process Work Slideshow
Benjamin Portfolio Process Work SlideshowBenjamin Portfolio Process Work Slideshow
Benjamin Portfolio Process Work Slideshow
 
FULL ENJOY - 9953040155 Call Girls in Paschim Vihar | Delhi
FULL ENJOY - 9953040155 Call Girls in Paschim Vihar | DelhiFULL ENJOY - 9953040155 Call Girls in Paschim Vihar | Delhi
FULL ENJOY - 9953040155 Call Girls in Paschim Vihar | Delhi
 
Govindpuri Call Girls : ☎ 8527673949, Low rate Call Girls
Govindpuri Call Girls : ☎ 8527673949, Low rate Call GirlsGovindpuri Call Girls : ☎ 8527673949, Low rate Call Girls
Govindpuri Call Girls : ☎ 8527673949, Low rate Call Girls
 
Laxmi Nagar Call Girls : ☎ 8527673949, Low rate Call Girls
Laxmi Nagar Call Girls : ☎ 8527673949, Low rate Call GirlsLaxmi Nagar Call Girls : ☎ 8527673949, Low rate Call Girls
Laxmi Nagar Call Girls : ☎ 8527673949, Low rate Call Girls
 
Lindy's Diner, Restaurant-Café, Albuquerque, NM
Lindy's Diner, Restaurant-Café, Albuquerque, NMLindy's Diner, Restaurant-Café, Albuquerque, NM
Lindy's Diner, Restaurant-Café, Albuquerque, NM
 
Greater Noida Call Girls : ☎ 8527673949, Low rate Call Girls
Greater Noida Call Girls : ☎ 8527673949, Low rate Call GirlsGreater Noida Call Girls : ☎ 8527673949, Low rate Call Girls
Greater Noida Call Girls : ☎ 8527673949, Low rate Call Girls
 
Burari Call Girls : ☎ 8527673949, Low rate Call Girls
Burari Call Girls : ☎ 8527673949, Low rate Call GirlsBurari Call Girls : ☎ 8527673949, Low rate Call Girls
Burari Call Girls : ☎ 8527673949, Low rate Call Girls
 
FULL ENJOY - 9953040155 Call Girls in Karol Bagh | Delhi
FULL ENJOY - 9953040155 Call Girls in Karol Bagh | DelhiFULL ENJOY - 9953040155 Call Girls in Karol Bagh | Delhi
FULL ENJOY - 9953040155 Call Girls in Karol Bagh | Delhi
 
Zagor VČ OP 055 - Oluja nad Haitijem.pdf
Zagor VČ OP 055 - Oluja nad Haitijem.pdfZagor VČ OP 055 - Oluja nad Haitijem.pdf
Zagor VČ OP 055 - Oluja nad Haitijem.pdf
 
FULL ENJOY - 9953040155 Call Girls in Dwarka Mor | Delhi
FULL ENJOY - 9953040155 Call Girls in Dwarka Mor | DelhiFULL ENJOY - 9953040155 Call Girls in Dwarka Mor | Delhi
FULL ENJOY - 9953040155 Call Girls in Dwarka Mor | Delhi
 
FULL ENJOY - 9953040155 Call Girls in Gandhi Vihar | Delhi
FULL ENJOY - 9953040155 Call Girls in Gandhi Vihar | DelhiFULL ENJOY - 9953040155 Call Girls in Gandhi Vihar | Delhi
FULL ENJOY - 9953040155 Call Girls in Gandhi Vihar | Delhi
 

distilling the Web of Data drop by drop (with Java)

  • 1. distilling the Web of Data drop by drop (with Java) Sourcesense UK “Last Wednesday” - Davide Palmisano @dpalmisano Wednesday, June 29, 2011
  • 2. the shortest introduction ever to the Web o f Data Web pages markup technologies are intended for human consumption they let machines to present raw data to humans extracting valuable data may require fancy scraping techniques scraping: one size doesn’t fit all Wednesday, June 29, 2011
  • 3. the shortest introduction ever to the Web o f Data <div> <div> Canon Rebel T2i (EOS 550D) $899 </div> <div> The Rebel T2i EOS 550D is Cannon's top-of-the-line consumer digital SLR camera. It can shoot up <div> AN_UCC-13: 013803123784 </div> <div> price: 899 USD </div> </div> </div> Wednesday, June 29, 2011
  • 4. the shortest introduction ever to the Web o f Data <div> <div> Canon Rebel T2i (EOS 550D) $899 </div> <div> The Rebel T2i EOS 550D is Cannon's top-of-the-line consumer digital SLR camera. It can shoot up <div> AN_UCC-13: 013803123784 </div> <div> price: 899 USD </div> what does this </div> tag mean? </div> Wednesday, June 29, 2011
  • 5. the shortest introduction ever to the Web o f Data <div> <div> Canon Rebel T2i (EOS 550D) $899 </div> <div> The Rebel T2i EOS 550D is Cannon's top-of-the-line consumer digital SLR camera. It can shoot up <div> AN_UCC-13: 013803123784 </div> <div> price: 899 USD </div> what does this </div> is this a tag mean? </div> currency or what? Wednesday, June 29, 2011
  • 6. “meaning”, Joseph Kosuth, The Panza Collection, MART - Rovereto, Italy Wednesday, June 29, 2011
  • 7. Microformats “Microformats are a way of adding simple markup to human-readable data items such as events, contact details or locations, on web pages” Andy Mabbett - community driven initiative - largely adopted - quick & dirty - scarcely extensibility Wednesday, June 29, 2011
  • 8. Microformats <div class=”hlisting item”> <div> Canon Rebel T2i (EOS 550D) $899< /div> <div class=”description”> The Rebel T2i EOS 550D is Cannon's top-of-the-line consumer digital SLR camera. It can shoot up <div> AN_UCC-13: 013803123784 </div> <div class=”price”> price: 899 USD </div> </div> </div> Wednesday, June 29, 2011
  • 9. RDFa: RDF in attribute model your data as they were Web pages connected with named links and properties and embed them in your (X)HTML using @attributes - RDF, graph-based model - W3C Recommandation - highly extensible i.e GoodRelations[1], a fully flavored vocabulary for the e-commerce Wednesday, June 29, 2011
  • 10. RDFa: RDF in attribute model your data http://mystore.com/product/5642 ex:price ex:value 899 ex:producer ex:currency ex:description USD http://canon.co.uk The Rebel T2i EOS 550D blah blah Wednesday, June 29, 2011
  • 11. RDFa: RDF in attribute and then embed them in your (X)HTML pages <div about=”http://mystore.com/product/5642”> <div>Canon Rebel T2i (EOS 550D) $899</div> <div property=”gr:description”>The Rebel T2i EOS 550D is Cannon's blah blah</div> <div rel=”gr:hasPriceSpecification”> <span> price: <span property=”gr:hasCurrencyValue”>899</span> <span property=”gr:hasCurrency”>USD</span> </span> </div> </div> Wednesday, June 29, 2011
  • 12. HTML5: Microdata Microdata allows nested groups of name-value pairs to be added to HTML documents, in parallel with the existing content - W3C Working draft - native of HTML5 specification - serializable in RDF - Google, Yahoo! and Bing endorsed Schema.org - large adoption expected Wednesday, June 29, 2011
  • 13. HTML5: Microdata <div itemscop itemtype=”http://schema.org/Offer”> <div itemprop=”name”> Canon Rebel T2i (EOS 550D) $899 </div> <div itemprop=”description”> The Rebel T2i EOS 550D is Cannon's blah blah</div> <div> <span> price: <span itemprop=”price”> 899 </span> <span itemprop=”priceCurrency”> USD </span> </span> </div> </div> Wednesday, June 29, 2011
  • 14. % of marked up Web pages 3.5 3 2.5 2 1.5 1 RDFa 0.5 hCard adr 09/2008 xfn 0 03/2009 hReview 10/2010 data from Yahoo! [2] Wednesday, June 29, 2011
  • 15. tie ‘em all together uniform, reconciled and unified RDF representation Wednesday, June 29, 2011
  • 16. a drop-by-drop distiller Anything To Triples (any23) is an open source, Apache-licensed: - Java library, - Web service and - a command-line tool able to distill RDF triples from a variety of semantically marked up Web documents http://developers.any23.org Wednesday, June 29, 2011
  • 17. live demo http://any23.org Web site with ~5000 products description with GoodRelations using RDFa Wednesday, June 29, 2011
  • 18. use Any23 in your Java programs Any23 runner = new Any23(); runner.setHTTPUserAgent("test-user-agent"); HTTPClient httpClient = runner.getHTTPClient(); DocumentSource source = new HTTPDocumentSource(      httpClient,      "http://test.com/index.html"   ); ByteArrayOutputStream out = new ByteArrayOutputStream(); TripleHandler handler = new NTriplesWriter(out); runner.extract(source, handler); String n3 = out.toString("UTF-8"); Wednesday, June 29, 2011
  • 19. Any23: Command-Line tool any23-core/bin$ ./any23 usage: any23 [-e <arg>] [-f <arg>] [-l <arg>] [-n] [-o <arg>]        [-p] [-s] [-t] [-v] {<url>|<file>}  -e <arg>            comma-separated list of extractors, e.g.                      rdf-xml,rdf-turtle  -f,--format <arg>   Output format [turtle (default), ntriples, rdfxml, quad, uris]  -l,--log <arg>      logging, please specify a file  -n,--nesting        disable production of nesting triples  -o,--output <arg>   ouput file (defaults to stdout)  -p,--pedantic       validates and fixes HTML content detecting commons issues  -s,--stats          print out statistics of Any23 Wednesday, June 29, 2011
  • 20. Any23: Web Service blacky:~ davide$ curl http://any23.org/any23/?format=nquads&url=http:// www.bbc.co.uk/programmes/b00kygwh&report=on <response> <extractors> <extractor>rdf-xml</extractor> </extractors> <report> ... <validationReport> <ruleActivations></ruleActivations> ... </validationReport> </report> <data> <rdf:Description rdf:about="http://www.bbc.co.uk/programmes/ b00kygwh#programme"> <rdf:type rdf:resource="http://purl.org/ontology/po/Episode"/> <po:pid>b00kygwh</po:pid> <dc:title>The Terminator</dc:title> </rdf:Description> </data> </response> Wednesday, June 29, 2011
  • 21. Apache Tika mimetype detection Cyber Neko HTML DOM extraction Rule Fix Validator Microdata RDFa hListing hReview hCalendar hCard Extractor Extractor Microformat Extractors Sesame RDF/XML NQuads JSON Writer Writer Writer ExtractionResult Wednesday, June 29, 2011
  • 22. extractor public interface Extractor<Input> { /** * Executes the extractor. Will be invoked only once, extractors are * not reusable. * * @param in The extractor's input * @param documentURI The document's URI * @param out Sink for extracted data * @throws IOException On error while reading from the input stream * @throws ExtractionException On other error, such as parse errors */ void run(Input in, URI documentURI, ExtractionResult out) throws IOException, ExtractionException; /** * Returns a {@link org.deri.any23.extractor.ExtractorDescription} of * this extractor. */ ExtractorDescription getDescription(); } Wednesday, June 29, 2011
  • 23. validate and fix public interface Rule { String getHRName(); boolean applyOn( DOMDocument document, RuleContext context, ValidationReportBuilder validationReportBuilder ); } public interface Fix { String getHRName(); void execute(Rule rule, RuleContext context, DOMDocument document); } void addRule(Class<? extends Rule> rule, Class<? extends Fix> fix); Wednesday, June 29, 2011
  • 24. plugins @PluginImplementation   @Author(name="Michele Mostarda (mostarda@fbk.eu)")   public class HTMLScraperPlugin implements ExtractorPlugin {     private static final Logger logger = LoggerFactory.getLogger(HTMLScraperPlugin.class);     @Init     public void init() {         logger.info("Plugin initialization.");     }     @Shutdown     public void shutdown() {         logger.info("Plugin shutdown.");     }     public ExtractorFactory getExtractorFactory() {         return HTMLScraperExtractor.factory;     }   } Wednesday, June 29, 2011
  • 25. roadmap incoming 0.6.0 release - support for Microdata - support for CSV - support for RDFa 1.1 prefix mechanism - improved app configuration - bug fixing Apache (pre) Incubation process - http://wiki.apache.org/incubator/Any23Proposal - supporters and mentors (thanks guys!) Simone Tripodi (@stripodi) Tommaso Teofili (@tteofili) - we’re looking for mentors Wednesday, June 29, 2011
  • 26. closing credits active committers Giovanni Tummarello ( @jccq ) Michele Mostarda ( @micmos ) Davide Palmisano ( @dpalmisano ) Richard Cyganiak ( @cygri ) thanks to the whole Semantic Web community, especially those who tirelessly challenge us with bugs and features requests Wednesday, June 29, 2011
  • 27. References [1] http://purl.org/goodrelations/v1 [2] http://tripletalk.wordpress.com/2011/01/25/ rdfa-deployment-across-the-web/ Wednesday, June 29, 2011