sitemap4rdf
generate Sitemap files from a SPARQL
              endpoint
          http://www.deri.ie/
          http://www deri ie/




     Boris Villazón-Terrazas and Richard Cyganiak (DERI)
    Facultad de Informática, Universidad Politécnica de Madrid
  Campus de Montegancedo sn 28660 Boadilla del Monte Madrid
                             sn,                   Monte,
                     http://www.oeg-upm.net
           Phone: 34.91.3366605, Fax: 34.91.3524819
ToC



•   Publishing Linked Data from a triple store
•   Search engines
•   The Sitemap protocol
•   sitemap4rdf
•   Summary
    S
•   Future work




                              2
Linked Data frontends for triple stores




Source: Pubby website, http://www4.wiwiss.fu-berlin.de/pubby/


                          3
ToC



•   Publishing Linked Data from a triple store
•   Search engines
•   The Sitemap protocol
•   sitemap4rdf
•   Summary
    S
•   Future work




                              4
Sindice: the best RDF search engine




     5
Sindice: the best RDF search engine




•   120M+ documents
•   Continuously updating since 2006
    C ti      l    d ti    i
•   Search API
•   RDF/XML, Turtle, RDFa, microformats




                       6
ToC



•   Publishing Linked Data from a triple store
•   Search engines
•   The Sitemap protocol
•   sitemap4rdf
•   Summary
    S
•   Future work




                              7
Sitemap Protocol

• Used by web crawlers
• Efficiently find all your content & discover
  what has been updated
             http://sitemaps.org/




A sitemap fil contains i f
   i      file      i information regarding one or more URL on
                               i         di                URLs
   your Web site. The information that is stored there helps search
   engines better spider your website.


                                 8
Sitemap Protocol: Simple example

<?xml version="1.0" encoding="UTF-8"?>
<urlset
   xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
   <url>
      <loc>http://yoursite/</loc>
   </url>
   <url>
         oc ttp://you s te/p oducts/535 6 / oc
      <loc>http://yoursite/products/53546</loc>
   </url>
   <url>
      <loc>http://yoursite/products/98421</loc>
   </url>
   <url>
      <loc>http://yoursite/products/41003</loc>
   </url>
</urlset>


                             9
Sitemap Protocol: Optional parts




<?xml version="1.0" encoding="UTF-8"?>
<urlset
   xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
   <url>
      <loc>http://yoursite/</loc>
      <lastmod>2010-06-24</lastmod>
      <changefreq>daily</changefreq>
      < h    f   >d il </ h    f   >
   </url>
</urlset>




                           10
Sitemap Protocol: Huge sitemaps


• Gzip-compress your sitemap
• Limit: 50k URLs or 10MB
  • split into multiple sitemap files
  • add a sitemap index file




                         11
Sitemap Protocol: Discovery

• Publish the sitemap file

• Add a line to http://yoursite/robots.txt
   •   Web site owners use the /robots.txt file to give instructions about their site
       to web robots; this is called The Robots Exclusion Protocol.




 Sitemap: http://yoursite/sitemap.xml




                                          12
ToC



•   Publishing Linked Data from a triple store
•   Search engines
•   The Sitemap protocol
•   sitemap4rdf
•   Summary
    S
•   Future work




                             13
sitemap4rdf


• Simple command line tool
• Sends a SPARQL query to list all URIs
• Generates sitemap

sitemap4rdf htt //
 it    4 df http://yoursite/sparql htt //
                        it /     l http://yoursite/resource/
                                               it /        /

Example:

sitemap4rdf http://geo.linkeddata.es/sparql http://geo.linkeddata.es/


• run sitemap4rdf specifying th SPARQL endpoint
       it    4 df      if i the               d i t
  and the prefix of the URLs to include in the Sitemap


                                         14
Submit the sitemap location - Sindice

• http://sindice.com/main/submit




                           15
Submit the sitemap location - Google

• https://www.google.com/webmasters/tools/




                         16
ToC



•   Publishing Linked Data from a triple store
•   Search engines
•   The Sitemap protocol
•   sitemap4rdf
•   Summary
    S
•   Future work




                             17
Summary

• Sitemap protocol informs search engines about
  available pages
   • Supported by Sindice!


• sitemap4rdf generates Sitemap files by listing URIs
  in a SPARQL endpoint
   • Open source, Java
   • http://lab.linkeddata.deri.ie/2010/sitemap4rdf/
   • http://mccarthy dia fi upm es/sitemap4rdf/
     http://mccarthy.dia.fi.upm.es/sitemap4rdf/
   • http://www.oeg-upm.net/index.php/en/downloads/122-sitemap4rdf




                                 18
ToC



•   Publishing Linked Data from a triple store
•   Search engines
•   The Sitemap protocol
•   sitemap4rdf
•   Summary
    S
•   Future work




                             19
Future Work

• Integrate sitemap4rdf with Pubby

• Generate voiD file automatically from a SPARQL
  endpoint

• Generate an entry in CKAN (registry of open
  knowledge packages) automatically through CKAN-
  API
   • http://ckan net/package/geolinkeddata
     http://ckan.net/package/geolinkeddata


• Interact with prefix cc ( service for remembering and
                prefix.cc
  looking up URI prefixes) through its API
   • geoes: < http://geo.linkeddata.es/ontology>
              http://geo.linkeddata.es/ontology

                                20
Future Work

• Support the semantic sitemap extension (when it will
  be compatible with google)
   • http://sw.deri.org/2007/07/sitemapextension/




                                21
sitemap4rdf
generate Sitemap files from a SPARQL
              endpoint
          http://www.deri.ie/
          http://www deri ie/




     Boris Villazón-Terrazas and Richard Cyganiak (DERI)
    Facultad de Informática, Universidad Politécnica de Madrid
  Campus de Montegancedo sn 28660 Boadilla del Monte Madrid
                             sn,                   Monte,
                     http://www.oeg-upm.net
           Phone: 34.91.3366605, Fax: 34.91.3524819

Sitemap4rdf(v2 boris)

  • 1.
    sitemap4rdf generate Sitemap filesfrom a SPARQL endpoint http://www.deri.ie/ http://www deri ie/ Boris Villazón-Terrazas and Richard Cyganiak (DERI) Facultad de Informática, Universidad Politécnica de Madrid Campus de Montegancedo sn 28660 Boadilla del Monte Madrid sn, Monte, http://www.oeg-upm.net Phone: 34.91.3366605, Fax: 34.91.3524819
  • 2.
    ToC • Publishing Linked Data from a triple store • Search engines • The Sitemap protocol • sitemap4rdf • Summary S • Future work 2
  • 3.
    Linked Data frontendsfor triple stores Source: Pubby website, http://www4.wiwiss.fu-berlin.de/pubby/ 3
  • 4.
    ToC • Publishing Linked Data from a triple store • Search engines • The Sitemap protocol • sitemap4rdf • Summary S • Future work 4
  • 5.
    Sindice: the bestRDF search engine 5
  • 6.
    Sindice: the bestRDF search engine • 120M+ documents • Continuously updating since 2006 C ti l d ti i • Search API • RDF/XML, Turtle, RDFa, microformats 6
  • 7.
    ToC • Publishing Linked Data from a triple store • Search engines • The Sitemap protocol • sitemap4rdf • Summary S • Future work 7
  • 8.
    Sitemap Protocol • Usedby web crawlers • Efficiently find all your content & discover what has been updated http://sitemaps.org/ A sitemap fil contains i f i file i information regarding one or more URL on i di URLs your Web site. The information that is stored there helps search engines better spider your website. 8
  • 9.
    Sitemap Protocol: Simpleexample <?xml version="1.0" encoding="UTF-8"?> <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"> <url> <loc>http://yoursite/</loc> </url> <url> oc ttp://you s te/p oducts/535 6 / oc <loc>http://yoursite/products/53546</loc> </url> <url> <loc>http://yoursite/products/98421</loc> </url> <url> <loc>http://yoursite/products/41003</loc> </url> </urlset> 9
  • 10.
    Sitemap Protocol: Optionalparts <?xml version="1.0" encoding="UTF-8"?> <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"> <url> <loc>http://yoursite/</loc> <lastmod>2010-06-24</lastmod> <changefreq>daily</changefreq> < h f >d il </ h f > </url> </urlset> 10
  • 11.
    Sitemap Protocol: Hugesitemaps • Gzip-compress your sitemap • Limit: 50k URLs or 10MB • split into multiple sitemap files • add a sitemap index file 11
  • 12.
    Sitemap Protocol: Discovery •Publish the sitemap file • Add a line to http://yoursite/robots.txt • Web site owners use the /robots.txt file to give instructions about their site to web robots; this is called The Robots Exclusion Protocol. Sitemap: http://yoursite/sitemap.xml 12
  • 13.
    ToC • Publishing Linked Data from a triple store • Search engines • The Sitemap protocol • sitemap4rdf • Summary S • Future work 13
  • 14.
    sitemap4rdf • Simple commandline tool • Sends a SPARQL query to list all URIs • Generates sitemap sitemap4rdf htt // it 4 df http://yoursite/sparql htt // it / l http://yoursite/resource/ it / / Example: sitemap4rdf http://geo.linkeddata.es/sparql http://geo.linkeddata.es/ • run sitemap4rdf specifying th SPARQL endpoint it 4 df if i the d i t and the prefix of the URLs to include in the Sitemap 14
  • 15.
    Submit the sitemaplocation - Sindice • http://sindice.com/main/submit 15
  • 16.
    Submit the sitemaplocation - Google • https://www.google.com/webmasters/tools/ 16
  • 17.
    ToC • Publishing Linked Data from a triple store • Search engines • The Sitemap protocol • sitemap4rdf • Summary S • Future work 17
  • 18.
    Summary • Sitemap protocolinforms search engines about available pages • Supported by Sindice! • sitemap4rdf generates Sitemap files by listing URIs in a SPARQL endpoint • Open source, Java • http://lab.linkeddata.deri.ie/2010/sitemap4rdf/ • http://mccarthy dia fi upm es/sitemap4rdf/ http://mccarthy.dia.fi.upm.es/sitemap4rdf/ • http://www.oeg-upm.net/index.php/en/downloads/122-sitemap4rdf 18
  • 19.
    ToC • Publishing Linked Data from a triple store • Search engines • The Sitemap protocol • sitemap4rdf • Summary S • Future work 19
  • 20.
    Future Work • Integratesitemap4rdf with Pubby • Generate voiD file automatically from a SPARQL endpoint • Generate an entry in CKAN (registry of open knowledge packages) automatically through CKAN- API • http://ckan net/package/geolinkeddata http://ckan.net/package/geolinkeddata • Interact with prefix cc ( service for remembering and prefix.cc looking up URI prefixes) through its API • geoes: < http://geo.linkeddata.es/ontology> http://geo.linkeddata.es/ontology 20
  • 21.
    Future Work • Supportthe semantic sitemap extension (when it will be compatible with google) • http://sw.deri.org/2007/07/sitemapextension/ 21
  • 22.
    sitemap4rdf generate Sitemap filesfrom a SPARQL endpoint http://www.deri.ie/ http://www deri ie/ Boris Villazón-Terrazas and Richard Cyganiak (DERI) Facultad de Informática, Universidad Politécnica de Madrid Campus de Montegancedo sn 28660 Boadilla del Monte Madrid sn, Monte, http://www.oeg-upm.net Phone: 34.91.3366605, Fax: 34.91.3524819