Query Expansion Methods and
                    Performance Evaluation
                              for
                Reusing Linking Open Data of the
              European Public Procurement Notices

                               José María Álvarez Rodríguez
                                   WESO-Universidad de Oviedo
                                   http://purl.org/weso/moldeas/
                      Tecnologías de Linked Data y sus aplicaciones en España (TLDE)
                                       CAEPIA 2011-Tenerife (Spain)
                                          8th of November, 2011

Code: TSI-020100-2010-919
Overview
 Use case & Context

SPARQL & Performance

     Next Steps
Objective




Creation of a pan-european
 e-procurement platform
E-procurement
   Long Tail
 TED
        BOE
       (official bulletin
        of the Spanish
        Governement)        BOPA
                            (official bulletin
                             of the Asturian
                             Governement)
To Be Able to answer to…



     Which public procurement notices are
relevant to Dutch companies (only SMEs) that
  want to tender for contracts announced by
 local authorities with a total value lower than
 170K € to procure “Road bridge construction
  work” and a two year duration in the Dutch-
    speaking region of Flanders (Belgium)?
Structuring public procurement notices
    d
                                                                 Providing new semantic-
                                                                      based services

             yD             Z                           ^
                                                                             ^
                                                                                 ^
                                                           D         D
K                                                                      W




KW          LOD
          enrichment



K                                                                                   

                                                                     W               ^


    Ws                      Z
                                                                    Easing the access to the
                                                                    published data using the
Ehd^                                                                     LOD approach


                      Transforming government classifications
Preliminary Results

/           d            d


      W
s                        Z



K


Ehd^


W
Semantic-based
         Services


      Problem of
«Query Expansion»
depending on the kind of
  information variable
Methods of«Query
                       Expansion»
                                 



                     /                           '



d                                                Z
                                    E



                            '              



    ^           ^            h       ,



        ^
                                         



    Z
Remembering…



     Which public procurement notices are
relevant to Dutch companies (only SMEs) that
  want to tender for contracts announced by
 local authorities with a total value lower than
 170K € to procure “Road bridge construction
  work” and a two year duration in the Dutch-
    speaking region of Flanders (Belgium)?
cpv:45221111-3
  NL




                                      Query…
                                  Ehd^    Z'
                                    t KEE


                                         ppn:nutsCode
                    ppn:hasDuration



                                             cpv:CodeIn2008



                 ppn:hasAmount                    org:classification



                                                  ^D
cpv:45221111-3
  NL


                      Applying Query Expansion…
                                       Ehd^ 
                                      Ehd^ E
                                      Ehd^
                                       Ehd^ 


                                             ppn:nutsCode
                    ppn:hasDuration




                                                 cpv:CodeIn2008


                 ppn:hasAmount                         org:classification



                                                       ^D
Example of SPARQL
                         query
SELECT DISTINCT * WHERE {
   ?ppn       rdf:type          http://purl.org/weso/ppn/def#ppn.
   ?ppn       ppn:nutsCode       ?nutsCode.
   ?ppn       cpv:codeIn2008 ?cpvCode.
   ?ppn       ppn:hasDuration ?duration
   ?ppn       dc:identifier      ?id.
   ?ppn       dc:date             ?date .
   ? ppn      ppn:hasAmount ?amount.
    FILTER(? cpvCode = cpv:45221111-3 ... ) .
    FILTER (
        (xsd:double(?amount) = xsd:long(170,000)) 
        (xsd:double(?amount) = xsd:long(200,000)) ).
.   FILTER(?nutsCode = nuts:B3 ... ) .
    FILTER (
        (xsd:long(?duration) = xsd:long(2)) 
        (xsd:long(?duration) =     xsd:long(3)) ).
}
Context

Performance of SPARQL
       Queries

     ~30 sec.
Hardware 
        Software
DELL PC 2GB RAM and 30GB HardDisk
      Virtual Box (version 4.0.6)

Linux 2.6.35-22-server #33-Ubuntu 2 SMP
           x86_64 GNU/Linux
              Ubuntu 10.10

   OpenLink Virtuoso Opensource-6-
              20110218
Question?
How to decrease the time of
 query execution without
modify the hardware and not
 use any vendor feature?
TripleStore
    25 graphs
20 M of RDF Triples
       But…
     8 graphs
11 M of RDF Triples
Focus on..
The generation of SPARQL
         queries
Let’s start…


9 SPARQL Queries

  3 executions
d   ^      /D/d   /dZ   'Z W,^   ^   W   d

d
d
d
d
d
d
d
d
d
d
d
d
d
d
d
Simple SPARQL query

SELECT DISTINCT * WHERE {
   ?ppn    rdf:type        http://purl.org/weso/ppn/def#ppn.
   ?ppn    ppn:nutsCode     ?nutsCode.
   ?ppn    cpv:codeIn2008 ?cpvCode.
   ?ppn    ppn:hasDuration ?duration
   ?ppn    dc:identifier    ?id.
   ?ppn    dc:date           ?date .
   ? ppn   ppn:hasAmount ?amount.
   FILTER(? cpvCode = cpv:15331137 ) .
.  FILTER(?nutsCode = nuts:UK) .
}
Simple Query

    1 CPV Code
   1 NUTS Code


Time: ~3,29 sec.
T1

Rewrite SPARQL queries:
Match triples from specific to
           general
  Filter as soon as possible
T2

Use the LIMIT clause

 Value set to 10,000
Rewrite SPARQL query

SELECT DISTINCT * WHERE {
   ?ppn     rdf:type        http://purl.org/weso/ppn/def#ppn.
   ?ppn     cpv:codeIn2008 ?cpvCode.
    FILTER(? cpvCode = cpv:15331137 ) .
    ?ppn    ppn:nutsCode     ?nutsCode.
    FILTER(?nutsCode = nuts:UK) .
   ?ppn     ppn:hasDuration ?duration
   ?ppn     dc:identifier    ?id.
   ?ppn     dc:date           ?date .
    ? ppn   ppn:hasAmount ?amount.
.  }
LIMIT 10000
Results T2

    1 CPV Code
   1 NUTS Code



Time: ~3,26 sec.
Evaluation

  There is no significant
changes in execution time
       and gain…
           and
   We are interested in
   “enhanced queries”
T3

Execution of enhanced
       queries
Enhanced SPARQL
            query
SELECT DISTINCT * WHERE {
   ?ppn    rdf:type        http://purl.org/weso/ppn/def#ppn.
   ?ppn    ppn:nutsCode     ?nutsCode.
   ?ppn    cpv:codeIn2008 ?cpvCode.
   ?ppn    ppn:hasDuration ?duration
   ?ppn    dc:identifier    ?id.
   ?ppn    dc:date           ?date .
   ? ppn   ppn:hasAmount ?amount.
   FILTER(? cpvCode = {cpv:15331137 , cpv:48611000,
           cpv: 48611000, cpv:50531510, cpv: 15871210}) .
.  FILTER(?nutsCode = {nuts:B3, nuts:PL, nuts:RO ) .
}
Results T3

    5 CPV Codes
   3 NUTS Codes
       1 query


Time: ~20,65 sec.
T4

Rewrite SPARQL queries
           +
 Use the LIMIT clause
Results T4 wrt T3

    5 CPV Codes
   3 NUTS Codes
       1 query


Time: ~20,55 sec.
Info

     8 graphs

11 M of RDF Triples
T5

Rewrite SPARQL queries
            +
  Use the LIMIT clause
            +
 Named Graphs (FROM)
Results T5 wrt T3

    5 CPV Codes
   3 NUTS Codes
       1 query


Time: ~20,65 sec.
T6
Rewrite SPARQL queries
             +
  Use the LIMIT clause
             +
 Named Graphs (FROM)
             +
Split into simple queries
Results T6 wrt T3
    5 CPV Codes
   3 NUTS Codes
      4 Graphs
  4 simple queries

Time: ~20,60 sec.
T6-1
        Rewrite SPARQL queries
                    +
          Use the LIMIT clause
                    +
         Named Graphs (FROM)
                    +
Split enhance query into simple queries
                    +
   Parallelization of query execution
          (ad-hoc map/reduce)
Results T6-1 wrt T3
    5 CPV Codes
   3 NUTS Codes
      4 Graphs
  4 simple queries

Time: ~11,93 sec.
T7
        Rewrite SPARQL queries
                   +
         Use the LIMIT clause
                   +
Split enhance query into simple queries
Results T7 wrt T3
   1 CPV Code (5)
    3 NUTS Code
  5 simple queries


Time: ~15,81 sec.
T7-1
        Rewrite SPARQL queries
                    +
          Use the LIMIT clause
                    +
Split enhance query into simple queries
                    +
   Parallelization of query execution
          (ad-hoc map/reduce)
Results T7-1 wrt T3
   1 CPV Code (5)
   3 NUTS Codes
  5 simple queries


Time: ~10,55 sec.
T8
Rewrite SPARQL queries
             +
  Use the LIMIT clause
             +
 Named Graphs (FROM)
             +
Split into simple queries
Results T8 wrt T3
   1 CPV Code (5)
   3 NUTS Codes
      4 Graphs
  20 simple queries

Time: ~32,34 sec.
T8-1
        Rewrite SPARQL queries
                    +
          Use the LIMIT clause
                    +
         Named Graphs (FROM)
                    +
Split enhance query into simple queries
                    +
   Parallelization of query execution
          (ad-hoc map/reduce)
Results T8-1 wrt T3
   1 CPV Code (5)
   3 NUTS Codes
      4 Graphs
  20 simple queries

Time: ~18,45 sec.
T9
        Rewrite SPARQL queries
                    +
          Use the LIMIT clause
                    +
Split enhance query into simple queries
       (1 CPV code+1 NUTS code)
Results T9 wrt T3
    1 CPV Code (5)
   1 NUTS Code (3)
  15 simple queries


Time: ~22,462 sec.
T9-1
        Rewrite SPARQL queries
                    +
          Use the LIMIT clause
                    +
Split enhance query into simple queries
       (1 CPV code+1 NUTS code)
                    +
   Parallelization of query execution
           (ad-hoc map/reduce)
Results T9-1 wrt T3
    1 CPV Code (5)
   1 NUTS Code (3)
  15 simple queries


Time: ~12,77 sec.
T10
 Rewrite SPARQL queries
               +
   Use the LIMIT clause
               +
  Named Graphs (FROM)
               +
  Split into simple queries
(1 CPV code+1 NUTS code)
Results T10 wrt T3
    1 CPV Code (5)
   1 NUTS Code (3)
       4 Graphs
  60 simple queries

Time: ~71,17 sec.
T10-1
        Rewrite SPARQL queries
                    +
          Use the LIMIT clause
                    +
         Named Graphs (FROM)
                    +
Split enhance query into simple queries
       (1 CPV code+1 NUTS code)
                    +
   Parallelization of query execution
           (ad-hoc map/reduce)
Results T10-1 wrt T3
    1 CPV Code (5)
   1 NUTS Code (3)
       4 Graphs
  60 simple queries

Time: ~35,13 sec.
d       Table of Results
           d        '
    d               E
    d
    d               E
    d
    d
    d
    d
    d
    d
    d
    d
    d
    d
    d
    d
Discussion
•       The number of queries is a key-factor
•       The number of CPV codes implies more
        execution time
•       The parallelization improves execution
        time
•       T7-1 is the best execution in terms of
        time
    •     Rewrite SPARQL queries
    •     Use the LIMIT clause
    •     Split enhance query into simple queries
    •     Parallelization of query execution
Further Steps

• Distribute graphs in different nodes
  (HW improvement)
• Use of other triple stores
• (SW comparison)
• Add SPARQL 1.1 new features
  (Expressiveness improvement)
• Cache of queries (SW improvement)
Some
              References…
•   http://www4.wiwiss.fu-
    berlin.de/bizer/BerlinSPARQLBenchmark/results/index.html#comparison
•   http://www.slideshare.net/olafhartig/an-overview-on-linked-data-
    management-and-sparql-querying-isslod2011
•   http://squin.sourceforge.net/
•   http://www2.informatik.hu-
    berlin.de/~hartig/files/Slides_Hartig_ISSLOD2011.pdf
•   http://www2008.org/papers/pdf/p595-stocker1.pdf
•   http://www.informatik.uni-
    freiburg.de/~mschmidt/docs/diss_final01122010.pdf
•   http://mayor2.dia.fi.upm.es/oeg-upm/files/sparql-dqp/eswc11-bac-ext.pdf
•   http://www.slideshare.net/olafhartig/the-sparql-query-graph-model-for-
    query-optimization-1259536
•   http://www.w3.org/TR/sparql-features/
Query Expansion Methods and
                    Performance Evaluation
                              for
                Reusing Linking Open Data of the
              European Public Procurement Notices

                               José María Álvarez Rodríguez
                                   WESO-Universidad de Oviedo
                                   http://purl.org/weso/moldeas/
                      Tecnologías de Linked Data y sus aplicaciones en España (TLDE)
                                       CAEPIA 2011-Tenerife (Spain)
                                          8th of November, 2011

Code: TSI-020100-2010-919
WESO CAEPIA-20111108

WESO CAEPIA-20111108

  • 1.
    Query Expansion Methodsand Performance Evaluation for Reusing Linking Open Data of the European Public Procurement Notices José María Álvarez Rodríguez WESO-Universidad de Oviedo http://purl.org/weso/moldeas/ Tecnologías de Linked Data y sus aplicaciones en España (TLDE) CAEPIA 2011-Tenerife (Spain) 8th of November, 2011 Code: TSI-020100-2010-919
  • 2.
    Overview Use case& Context SPARQL & Performance Next Steps
  • 3.
    Objective Creation of apan-european e-procurement platform
  • 4.
    E-procurement Long Tail TED BOE (official bulletin of the Spanish Governement) BOPA (official bulletin of the Asturian Governement)
  • 5.
    To Be Ableto answer to… Which public procurement notices are relevant to Dutch companies (only SMEs) that want to tender for contracts announced by local authorities with a total value lower than 170K € to procure “Road bridge construction work” and a two year duration in the Dutch- speaking region of Flanders (Belgium)?
  • 6.
    Structuring public procurementnotices d Providing new semantic- based services yD Z ^ ^ ^ D D K W KW LOD enrichment K W ^ Ws Z Easing the access to the published data using the Ehd^ LOD approach Transforming government classifications
  • 7.
    Preliminary Results / d d W s Z K Ehd^ W
  • 8.
    Semantic-based Services Problem of «Query Expansion» depending on the kind of information variable
  • 9.
    Methods of«Query Expansion» / ' d Z E ' ^ ^ h , ^ Z
  • 10.
    Remembering… Which public procurement notices are relevant to Dutch companies (only SMEs) that want to tender for contracts announced by local authorities with a total value lower than 170K € to procure “Road bridge construction work” and a two year duration in the Dutch- speaking region of Flanders (Belgium)?
  • 11.
    cpv:45221111-3 NL Query… Ehd^ Z' t KEE ppn:nutsCode ppn:hasDuration cpv:CodeIn2008 ppn:hasAmount org:classification ^D
  • 12.
    cpv:45221111-3 NL Applying Query Expansion… Ehd^ Ehd^ E Ehd^ Ehd^ ppn:nutsCode ppn:hasDuration cpv:CodeIn2008 ppn:hasAmount org:classification ^D
  • 13.
    Example of SPARQL query SELECT DISTINCT * WHERE { ?ppn rdf:type http://purl.org/weso/ppn/def#ppn. ?ppn ppn:nutsCode ?nutsCode. ?ppn cpv:codeIn2008 ?cpvCode. ?ppn ppn:hasDuration ?duration ?ppn dc:identifier ?id. ?ppn dc:date ?date . ? ppn ppn:hasAmount ?amount. FILTER(? cpvCode = cpv:45221111-3 ... ) . FILTER ( (xsd:double(?amount) = xsd:long(170,000)) (xsd:double(?amount) = xsd:long(200,000)) ). . FILTER(?nutsCode = nuts:B3 ... ) . FILTER ( (xsd:long(?duration) = xsd:long(2)) (xsd:long(?duration) = xsd:long(3)) ). }
  • 14.
  • 15.
    Hardware Software DELL PC 2GB RAM and 30GB HardDisk Virtual Box (version 4.0.6) Linux 2.6.35-22-server #33-Ubuntu 2 SMP x86_64 GNU/Linux Ubuntu 10.10 OpenLink Virtuoso Opensource-6- 20110218
  • 16.
    Question? How to decreasethe time of query execution without modify the hardware and not use any vendor feature?
  • 17.
    TripleStore 25 graphs 20 M of RDF Triples But… 8 graphs 11 M of RDF Triples
  • 18.
    Focus on.. The generationof SPARQL queries
  • 19.
    Let’s start… 9 SPARQLQueries 3 executions
  • 20.
    d ^ /D/d /dZ 'Z W,^ ^ W d d d d d d d d d d d d d d d d
  • 21.
    Simple SPARQL query SELECTDISTINCT * WHERE { ?ppn rdf:type http://purl.org/weso/ppn/def#ppn. ?ppn ppn:nutsCode ?nutsCode. ?ppn cpv:codeIn2008 ?cpvCode. ?ppn ppn:hasDuration ?duration ?ppn dc:identifier ?id. ?ppn dc:date ?date . ? ppn ppn:hasAmount ?amount. FILTER(? cpvCode = cpv:15331137 ) . . FILTER(?nutsCode = nuts:UK) . }
  • 22.
    Simple Query 1 CPV Code 1 NUTS Code Time: ~3,29 sec.
  • 23.
    T1 Rewrite SPARQL queries: Matchtriples from specific to general Filter as soon as possible
  • 24.
    T2 Use the LIMITclause Value set to 10,000
  • 25.
    Rewrite SPARQL query SELECTDISTINCT * WHERE { ?ppn rdf:type http://purl.org/weso/ppn/def#ppn. ?ppn cpv:codeIn2008 ?cpvCode. FILTER(? cpvCode = cpv:15331137 ) . ?ppn ppn:nutsCode ?nutsCode. FILTER(?nutsCode = nuts:UK) . ?ppn ppn:hasDuration ?duration ?ppn dc:identifier ?id. ?ppn dc:date ?date . ? ppn ppn:hasAmount ?amount. . } LIMIT 10000
  • 26.
    Results T2 1 CPV Code 1 NUTS Code Time: ~3,26 sec.
  • 27.
    Evaluation Thereis no significant changes in execution time and gain… and We are interested in “enhanced queries”
  • 28.
  • 29.
    Enhanced SPARQL query SELECT DISTINCT * WHERE { ?ppn rdf:type http://purl.org/weso/ppn/def#ppn. ?ppn ppn:nutsCode ?nutsCode. ?ppn cpv:codeIn2008 ?cpvCode. ?ppn ppn:hasDuration ?duration ?ppn dc:identifier ?id. ?ppn dc:date ?date . ? ppn ppn:hasAmount ?amount. FILTER(? cpvCode = {cpv:15331137 , cpv:48611000, cpv: 48611000, cpv:50531510, cpv: 15871210}) . . FILTER(?nutsCode = {nuts:B3, nuts:PL, nuts:RO ) . }
  • 30.
    Results T3 5 CPV Codes 3 NUTS Codes 1 query Time: ~20,65 sec.
  • 31.
    T4 Rewrite SPARQL queries + Use the LIMIT clause
  • 32.
    Results T4 wrtT3 5 CPV Codes 3 NUTS Codes 1 query Time: ~20,55 sec.
  • 33.
    Info 8 graphs 11 M of RDF Triples
  • 34.
    T5 Rewrite SPARQL queries + Use the LIMIT clause + Named Graphs (FROM)
  • 35.
    Results T5 wrtT3 5 CPV Codes 3 NUTS Codes 1 query Time: ~20,65 sec.
  • 36.
    T6 Rewrite SPARQL queries + Use the LIMIT clause + Named Graphs (FROM) + Split into simple queries
  • 37.
    Results T6 wrtT3 5 CPV Codes 3 NUTS Codes 4 Graphs 4 simple queries Time: ~20,60 sec.
  • 38.
    T6-1 Rewrite SPARQL queries + Use the LIMIT clause + Named Graphs (FROM) + Split enhance query into simple queries + Parallelization of query execution (ad-hoc map/reduce)
  • 39.
    Results T6-1 wrtT3 5 CPV Codes 3 NUTS Codes 4 Graphs 4 simple queries Time: ~11,93 sec.
  • 40.
    T7 Rewrite SPARQL queries + Use the LIMIT clause + Split enhance query into simple queries
  • 41.
    Results T7 wrtT3 1 CPV Code (5) 3 NUTS Code 5 simple queries Time: ~15,81 sec.
  • 42.
    T7-1 Rewrite SPARQL queries + Use the LIMIT clause + Split enhance query into simple queries + Parallelization of query execution (ad-hoc map/reduce)
  • 43.
    Results T7-1 wrtT3 1 CPV Code (5) 3 NUTS Codes 5 simple queries Time: ~10,55 sec.
  • 44.
    T8 Rewrite SPARQL queries + Use the LIMIT clause + Named Graphs (FROM) + Split into simple queries
  • 45.
    Results T8 wrtT3 1 CPV Code (5) 3 NUTS Codes 4 Graphs 20 simple queries Time: ~32,34 sec.
  • 46.
    T8-1 Rewrite SPARQL queries + Use the LIMIT clause + Named Graphs (FROM) + Split enhance query into simple queries + Parallelization of query execution (ad-hoc map/reduce)
  • 47.
    Results T8-1 wrtT3 1 CPV Code (5) 3 NUTS Codes 4 Graphs 20 simple queries Time: ~18,45 sec.
  • 48.
    T9 Rewrite SPARQL queries + Use the LIMIT clause + Split enhance query into simple queries (1 CPV code+1 NUTS code)
  • 49.
    Results T9 wrtT3 1 CPV Code (5) 1 NUTS Code (3) 15 simple queries Time: ~22,462 sec.
  • 50.
    T9-1 Rewrite SPARQL queries + Use the LIMIT clause + Split enhance query into simple queries (1 CPV code+1 NUTS code) + Parallelization of query execution (ad-hoc map/reduce)
  • 51.
    Results T9-1 wrtT3 1 CPV Code (5) 1 NUTS Code (3) 15 simple queries Time: ~12,77 sec.
  • 52.
    T10 Rewrite SPARQLqueries + Use the LIMIT clause + Named Graphs (FROM) + Split into simple queries (1 CPV code+1 NUTS code)
  • 53.
    Results T10 wrtT3 1 CPV Code (5) 1 NUTS Code (3) 4 Graphs 60 simple queries Time: ~71,17 sec.
  • 54.
    T10-1 Rewrite SPARQL queries + Use the LIMIT clause + Named Graphs (FROM) + Split enhance query into simple queries (1 CPV code+1 NUTS code) + Parallelization of query execution (ad-hoc map/reduce)
  • 55.
    Results T10-1 wrtT3 1 CPV Code (5) 1 NUTS Code (3) 4 Graphs 60 simple queries Time: ~35,13 sec.
  • 56.
    d Table of Results d ' d E d d E d d d d d d d d d d d d
  • 57.
    Discussion • The number of queries is a key-factor • The number of CPV codes implies more execution time • The parallelization improves execution time • T7-1 is the best execution in terms of time • Rewrite SPARQL queries • Use the LIMIT clause • Split enhance query into simple queries • Parallelization of query execution
  • 58.
    Further Steps • Distributegraphs in different nodes (HW improvement) • Use of other triple stores • (SW comparison) • Add SPARQL 1.1 new features (Expressiveness improvement) • Cache of queries (SW improvement)
  • 59.
    Some References… • http://www4.wiwiss.fu- berlin.de/bizer/BerlinSPARQLBenchmark/results/index.html#comparison • http://www.slideshare.net/olafhartig/an-overview-on-linked-data- management-and-sparql-querying-isslod2011 • http://squin.sourceforge.net/ • http://www2.informatik.hu- berlin.de/~hartig/files/Slides_Hartig_ISSLOD2011.pdf • http://www2008.org/papers/pdf/p595-stocker1.pdf • http://www.informatik.uni- freiburg.de/~mschmidt/docs/diss_final01122010.pdf • http://mayor2.dia.fi.upm.es/oeg-upm/files/sparql-dqp/eswc11-bac-ext.pdf • http://www.slideshare.net/olafhartig/the-sparql-query-graph-model-for- query-optimization-1259536 • http://www.w3.org/TR/sparql-features/
  • 60.
    Query Expansion Methodsand Performance Evaluation for Reusing Linking Open Data of the European Public Procurement Notices José María Álvarez Rodríguez WESO-Universidad de Oviedo http://purl.org/weso/moldeas/ Tecnologías de Linked Data y sus aplicaciones en España (TLDE) CAEPIA 2011-Tenerife (Spain) 8th of November, 2011 Code: TSI-020100-2010-919