USEWOD2012History and Background of the   USEWOD Data Challenge        Knud Möller, Talis Systems Ltd.              @knudm...
Möller, K., Hausenblas, M., Cyganiak,R., Grimnes, G., and Handschuh, S.(2010). Learning from linked opendata usage: Patter...
Linked DataConventional “Eye-ball” Web   Web of Linked Datainterlinked documents         interlinked items of data        ...
Linked DataConventional “Eye-ball” Web   Web of Linked Datainterlinked documents         interlinked items of data        ...
How is Linked Data being                    used?• plenty of research on conventional Web  usage• what about usage of link...
Approachparticular sites:– a URI for each data item ➙ a request for each data item  (resource)– content negotiation best p...
Approachparticular sites:– a URI for each data item ➙ a request for each data item  (resource)– content negotiation best p...
Approach (ctd.)    • server log files        – common log format (CLF), combined log format     Request IP                 ...
219.211.147 - - [23/May/2009:09:52:03 +0100] "GET /sparql?query=PREFIX [..] LIMIT+200 HTTP/1.0"  200 64674 "-" "ARC Reader...
Agents                                        http://data.semanticweb.org, 21/07/2008 - 20/06/2009       500000           ...
semantic hits/total hits (>100 semantic hits)                                    0                                        ...
Demand for LOD?                                                     DBpedia Hits over Time (smoothing factor 0.05)300000  ...
Impact of Real-world                            Events                                              Irish Lisbon Treaty Re...
Impact of Real-world                          Events                                              Michael Jackson Memorial...
• this research: one motivation for  USEWOD• expand the dataset, encourage more  and different analyses                   ...
USEWOD Data Challenge 20122nd International Workshop on Usage Analysis             and the Web of Data       Sponsored by ...
USEWOD Data Challenge
USEWOD Data ChallengeMoving forward by releasing a dataset: • to relieve difficulty of obtaining real-life usage   data • ...
USEWOD Data ChallengeMoving forward by releasing a dataset: • to relieve difficulty of obtaining real-life usage   data • ...
The USEWOD Dataset 2011Server logs from two major web of data  servers:• DBPedia • Several weeks during 2 months of reques...
USEWOD 2011 Challenge     Participants
USEWOD 2011 Challenge               Participants• At the time of the workshop 11 groups had  requested the 2011 data
USEWOD 2011 Challenge               Participants• At the time of the workshop 11 groups had  requested the 2011 data• By n...
USEWOD 2011 Challenge               Participants• At the time of the workshop 11 groups had  requested the 2011 data• By n...
USEWOD 2011 Challenge                 Participants• At the time of the workshop 11 groups had  requested the 2011 data• By...
USEWOD 2011 Challenge                 Participants• At the time of the workshop 11 groups had  requested the 2011 data• By...
The USEWOD Dataset 2012Server logs from two major web of data  servers:• DBPedia • Several weeks during 2 months of reques...
USEWOD 2012 Challenge             Participants• 20 groups requested the data, so far.• 2 data challenge paper submissions…...
DBpedia
DBpedia
DBpedia
DBpedia
Semantic Web Dog Food[Screenshots and image take from http://data.semanticweb.org/]
Semantic Web Dog Food[Screenshots and image take from http://data.semanticweb.org/]
Semantic Web Dog Food[Screenshots and image take from http://data.semanticweb.org/]
Requests for Human / Machine        readable Web dataBoth servers get requests for RDF • http://dbpedia.org/data/Berlin.rd...
Requests to SPARQL endpoints• Both servers have a SPARQL endpoint  to request RDF data:  SELECT DISTINCT ?s ?t ?y ?to ?h  ...
Anonymizing the USEWOD       Dataset
Anonymizing the USEWOD               Dataset• IP addresses: • replace all IPs with 0.0.0.0 • add the country code for the ...
USEWOD2011, Hydebarabad,               India• M. Kirchberg, R. K. L. Ko, and B. S.  Lee. From linked data to relevant data...
USEWOD2012, Lyon, France• A. Raghuveer. Characterizing Machine  Agent Behavior through SPARQL Query  Mining. - http://ir.i...
What could be improved?• does not work well with embedded metadata (e.g.,  RDFa-based sites)• does not take into account u...
Upcoming SlideShare
Loading in...5
×

History and Background of the USEWOD Data Challenge

710

Published on

Introduction to the USEWOD2012 workshop at WWW2012, Lyon, France. Some background, motivation, statistics, potential problems.

Published in: Technology, Education
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
710
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
7
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide
  • - not so much about USEWOD in general, but more about the challenge data in particular\n
  • \n
  • \n
  • \n
  • - you have a URI for the thing itself, the subject of a document\n- you have different URIs for documents about that thing\n- servers would be set up so that they would give a document back, based on the kind of data that the requesting agent wants\n- that shows up in the server logs\n
  • \n
  • \n
  • \n
  • - ratio of semantic/total hits\n
  • \n
  • \n
  • \n
  • - challenge dataset grew from the dataset used in this paper\n- has been constantly growing since then\n
  • \n
  • \n
  • \n
  • Close to (in the same university) of some of the people behind the Dbpedia project.\nOne of the main drivers of this project.\n
  • Observations about the logs: \nwhich are the most used language elements \nWhat are the characteristics of the triple patterns used?\nVery insightful. \ndesigning query evaluation engines or fine-tuning RDF stores.\n
  • Observations about the logs: \nwhich are the most used language elements \nWhat are the characteristics of the triple patterns used?\nVery insightful. \ndesigning query evaluation engines or fine-tuning RDF stores.\n
  • Observations about the logs: \nwhich are the most used language elements \nWhat are the characteristics of the triple patterns used?\nVery insightful. \ndesigning query evaluation engines or fine-tuning RDF stores.\n
  • Observations about the logs: \nwhich are the most used language elements \nWhat are the characteristics of the triple patterns used?\nVery insightful. \ndesigning query evaluation engines or fine-tuning RDF stores.\n
  • Observations about the logs: \nwhich are the most used language elements \nWhat are the characteristics of the triple patterns used?\nVery insightful. \ndesigning query evaluation engines or fine-tuning RDF stores.\n
  • TODO: Ask knud: the same dbpedia data?\nTODO: ask knud: what time span is the data from?\nOpen Geo Data: OpenStreetMap as RDF\nBio2RDF Linked Data for life sciences\nCredits go to Knud and Markus.\nMarkus is Close to (in the same university) of some of the people behind the Dbpedia project.\nKnud was One of the main drivers of this project.\n
  • Todo: check getallen\n
  • The linked data twin of Wikipedia\n
  • The linked data twin of Wikipedia\n
  • The linked data twin of Wikipedia\n
  • Screenshot of dbpedia\n
  • Screenshot of dbpedia\n
  • \n
  • Todo ander voorbeeld.\n
  • Raw data, but anonymization\n
  • - last year two interesting papers providing analysis of the dataset\n- note: not all USEWOD papers are about the challenge dataset, just like this year\n
  • \n
  • \n
  • History and Background of the USEWOD Data Challenge

    1. 1. USEWOD2012History and Background of the USEWOD Data Challenge Knud Möller, Talis Systems Ltd. @knudmoeller 1
    2. 2. Möller, K., Hausenblas, M., Cyganiak,R., Grimnes, G., and Handschuh, S.(2010). Learning from linked opendata usage: Patterns & metrics. InWebScience 2010, Raleigh, NC, USA.http://journal.webscience.org/302/ 2
    3. 3. Linked DataConventional “Eye-ball” Web Web of Linked Datainterlinked documents interlinked items of data (URIs, RDF)mainly people / Web mainly machine agents (butbrowsers also people?) 3
    4. 4. Linked DataConventional “Eye-ball” Web Web of Linked Datainterlinked documents interlinked items of data (URIs, RDF)mainly people / Web mainly machine agents (butbrowsers also people?) 3
    5. 5. How is Linked Data being used?• plenty of research on conventional Web usage• what about usage of linked data?Why?• how healthy is the Web of linked data?• who is using the data and how? Is it useful? Are there trends?• providers: improve hosting• ... just curiosity! 4
    6. 6. Approachparticular sites:– a URI for each data item ➙ a request for each data item (resource)– content negotiation best practices– redirection (HTTP 303) 5
    7. 7. Approachparticular sites:– a URI for each data item ➙ a request for each data item (resource)– content negotiation best practices– redirection (HTTP 303) http://data.semanticweb.org/ conference/www/2009 plain resource URI RDF HTML document URI document URI http://data.semanticweb.org/ http://data.semanticweb.org/ conference/www/2009/rdf conference/www/2009/html 5
    8. 8. Approach (ctd.) • server log files – common log format (CLF), combined log format Request IP Request Date Request String 80.219.211.147 - - [23/May/2009:09:52:03 +0100] "GET /sparql?query=PREFIX [..] LIMIT+200 HTTP/1.0" 200 64674 "-" "ARC Reader (http://arc.semsol.org/)"Response Code Responce Size Referrer User Agent • RDF requests vs. “semantic” requests•90.21.243.141 − − [06/Oct/2008:16:07:58 +0100] ”GET /organization/vrije −universiteit−amsterdam−the−netherlands HTTP/1.1” 303 7592 ”−” ”rdflib −2.4.0 (http://rdflib.net/; eikeon@eikeon.com)”•90.21.243.141 − − [06/Oct/2008:16:08:02 +0100] ”GET /organization/vrije −universiteit−amsterdam−the−netherlands/rdf HTTP/1.1” 200 453586 ”−” ”rdflib −2.4.0 (http://rdflib.net/; eikeon@eikeon.com)”
    9. 9. 219.211.147 - - [23/May/2009:09:52:03 +0100] "GET /sparql?query=PREFIX [..] LIMIT+200 HTTP/1.0" 200 64674 "-" "ARC Reader (http://arc.semsol.org/)"nse Code Responce Size Referrer Source Data User Agent Figure 1: The combined log format # triples # days total # hits # plain hits # RDF hits # HTML hits SPARQL Dog Food 79,175 597 8,427,967 1,923,945 259,031 1,647,205 879,932 (14,117) (3,223) (434) (2,759) (1,471) DBpedia 109,750,000 118 87,203,310 22,821,475 7,008,310 22,999,237 20,972,630 (739,011) (193,402) (59,392) (194,909) (177,734) DBTune 74,209,000 61 7,467,125 1,952,185 1,135,509 677,904 3,055,493 (122,412) (32,003) (18,615) (11,113) (50,090) RKBExplorer 91,501,684 29 529,938 — — — 9,327 (18,274) (—) (—) (—) (322) RDF 5.8% Semantic 2.8% Table RDF 14.9% Semantic 4.2% datasets 1: Overview of four LOD RDF 7.8% Semantic 2.5%s are served. For our evaluation, we had access to log Plain 47.7% taining a SPARQL query, we assume that it is Plain 45% Plain 41.0% two periods: from 24/05/2009–21/06/2009 and from ble of handling the query result, i.e., either a/2009–29/10/2009, i.e., roughly two months. bindings (in the case of a SELECT query), pote containing URIs of RDF resources, or an RDF RKBExplorer (in the case of a CONSTRUCT or DESCRIBE qBExplorer6 [11] is another meta-dataset currently com- HTML 46.5% HTML 39.9% 44 sub-datasets covering various topics and sources • RDF requests: if an agent directly requests HTML 51.1% the domain of academic research, as well as a Web from a server, we assume that it knows how tation that allowsDBpedia users to access and browse its content cess data in this format. 7 Directly here mean DBTune the agent specified an RDF syntax such as r Dog Foodntegrated fashion. Both RDF and HTML documentsthe resources in all datasets are available. Apart from as an acceptable response in the header of its re
    10. 10. Agents http://data.semanticweb.org, 21/07/2008 - 20/06/2009 500000 hits 3) 83 66 8 97 ordinary traffic: the usual suspects 37 23 ) (4 13 59 400000 ot (1 eB rp ) 28 gl lu & ) 11 89 oo !S 92 11 G oo (1 t( h 300000 bo er 5) Ya 32 ch sn 12 et mhits eF r( le ic w nd ra Si 200000 ) tic 42 ul 3 8) (7 m 80 .0 (6 /1 r ot de fb 100000 ea rd R C R A 0 8 0 5 10 15 20 25 30 agents
    11. 11. semantic hits/total hits (>100 semantic hits) 0 0.2 0.4 0.6 0.8 1 attributor/1.13.2 triplr sindicebot rdflib-2.4.2 RippleOL_Virtuoso_RDF_crawlerMorph_Converter_Service Falconsbot Speedy Slug_SW_Crawler yacybot hclsreport-crawler MJ12bot PycURL heritrix/1.14.3 SindiceFetcher heritrix/pom.version heritrix/2.0.2 multicrawler SindiceBot ia_archiver Zitgist-APlusPlus-Agent rdflib-2.4.1 they? Mp3Bot curl Zend_Http_Client Speedy_Spider nxcrawler marbles - Java rdflib-2.4.0 (unknown) ARC_Reader MLBot Mozilla Jakarta_HttpClient9 Wget libwww-perl MSIE Firefox Python-urllib sindice_ontology_fetcher Agents: How “semantic” are Googlebot
    12. 12. Demand for LOD? DBpedia Hits over Time (smoothing factor 0.05)300000 plain html rdf250000 semantic200000150000100000 no increase for semantic requests 50000 0 2009-06-20 2009-07-04 2009-07-18 2009-08-01 2009-08-15 2009-08-29 2009-09-12 2009-09-26 2009-10-10 2009-10-24 2009-11-07 10
    13. 13. Impact of Real-world Events Irish Lisbon Treaty Referendum (smoothing factor 0.05)9 http://dbpedia.org/resource/Republic_of_Ireland http://dbpedia.org/resource/European_Union8 http://dbpedia.org/resource/Treaty_of_Lisbon7 possible impact6543210 2009-06-20 2009-07-04 2009-07-18 2009-08-01 2009-08-15 2009-08-29 2009-09-12 2009-09-26 2009-10-10 2009-10-24 2009-11-07 11
    14. 14. Impact of Real-world Events Michael Jackson Memorial Service (smoothing factor 0.05)4.5 http://dbpedia.org/resource/Staples_Center http://dbpedia.org/resource/Michael_Jackson_memorial_service 4 http://dbpedia.org/resource/Michael_Jackson3.5 32.5 21.5 possible impact 10.5 0 2009-06-20 2009-07-04 2009-07-18 2009-08-01 2009-08-15 2009-08-29 2009-09-12 2009-09-26 2009-10-10 2009-10-24 2009-11-07 12
    15. 15. • this research: one motivation for USEWOD• expand the dataset, encourage more and different analyses 13
    16. 16. USEWOD Data Challenge 20122nd International Workshop on Usage Analysis and the Web of Data Sponsored by the LATC project
    17. 17. USEWOD Data Challenge
    18. 18. USEWOD Data ChallengeMoving forward by releasing a dataset: • to relieve difficulty of obtaining real-life usage data • to allow comparison and combination of approaches done on the same dataset • by releasing a relatively new type of logs: usage on the Web of Data.
    19. 19. USEWOD Data ChallengeMoving forward by releasing a dataset: • to relieve difficulty of obtaining real-life usage data • to allow comparison and combination of approaches done on the same dataset • by releasing a relatively new type of logs: usage on the Web of Data.Also for longer-term use.
    20. 20. The USEWOD Dataset 2011Server logs from two major web of data servers:• DBPedia • Several weeks during 2 months of requests• Semantic Web Dog Food • 2 years of requests from Dec 2008 – Dec 2010
    21. 21. USEWOD 2011 Challenge Participants
    22. 22. USEWOD 2011 Challenge Participants• At the time of the workshop 11 groups had requested the 2011 data
    23. 23. USEWOD 2011 Challenge Participants• At the time of the workshop 11 groups had requested the 2011 data• By now 28.
    24. 24. USEWOD 2011 Challenge Participants• At the time of the workshop 11 groups had requested the 2011 data• By now 28.• 7 data challenge paper submissions
    25. 25. USEWOD 2011 Challenge Participants• At the time of the workshop 11 groups had requested the 2011 data• By now 28.• 7 data challenge paper submissions• Winner of the 2011 USEWOD data challenge prize: • Mario Arias Gallego, Javier D. Fernández, Miguel A. Martínez-Prieto and Pablo De La Fuente. An Empirical Study of Real-World SPARQL Queries.
    26. 26. USEWOD 2011 Challenge Participants• At the time of the workshop 11 groups had requested the 2011 data• By now 28.• 7 data challenge paper submissions• Winner of the 2011 USEWOD data challenge prize: • Mario Arias Gallego, Javier D. Fernández, Miguel A. Martínez-Prieto and Pablo De La Fuente. An Empirical Study of Real-World SPARQL Queries.
    27. 27. The USEWOD Dataset 2012Server logs from two major web of data servers:• DBPedia • Several weeks during 2 months of requests• Semantic Web Dog Food • 2 years of requests from Dec 2008 – Dec 2010• Linked Open Geo Data• Bio2RDF
    28. 28. USEWOD 2012 Challenge Participants• 20 groups requested the data, so far.• 2 data challenge paper submissions…• 1 winner of the USEWOD data challenge prize. • kindly brought to you by LATC
    29. 29. DBpedia
    30. 30. DBpedia
    31. 31. DBpedia
    32. 32. DBpedia
    33. 33. Semantic Web Dog Food[Screenshots and image take from http://data.semanticweb.org/]
    34. 34. Semantic Web Dog Food[Screenshots and image take from http://data.semanticweb.org/]
    35. 35. Semantic Web Dog Food[Screenshots and image take from http://data.semanticweb.org/]
    36. 36. Requests for Human / Machine readable Web dataBoth servers get requests for RDF • http://dbpedia.org/data/Berlin.rdfas well as HTML • http://dbpedia.org/page/BerlinAnd requests for the URI itself: • http://dbpedia.org/resource/Berlin (will be served HTML or RDF)
    37. 37. Requests to SPARQL endpoints• Both servers have a SPARQL endpoint to request RDF data: SELECT DISTINCT ?s ?t ?y ?to ?h WHERE { ?s dc:title ?t . ?s swrc:year ?y . OPTIONAL {?s foaf:homepage ?h }. OPTIONAL {?s foaf:topic ?t } } order by desc(?y”) LIMIT 200
    38. 38. Anonymizing the USEWOD Dataset
    39. 39. Anonymizing the USEWOD Dataset• IP addresses: • replace all IPs with 0.0.0.0 • add the country code for the original IP address -> track location of requests • add an identifier of the original IP -> track individual requestors
    40. 40. USEWOD2011, Hydebarabad, India• M. Kirchberg, R. K. L. Ko, and B. S. Lee. From linked data to relevant data - time is the essence. - http:// arxiv.org/abs/1103.5046• M. A. Gallego, J. D. Fernández, M. A. Martínez-Prieto, and P. D. L. Fuente. An empirical study of real-world SPARQL queries. - http://arxiv.org/ abs/1103.5043 25
    41. 41. USEWOD2012, Lyon, France• A. Raghuveer. Characterizing Machine Agent Behavior through SPARQL Query Mining. - http://ir.ii.uam.es/ usewod2012/ usewod2012_raghuveer.pdf• J. Hoxha, M. Junghans, S. Agarwal. Enabling Semantic Analysis of User Browsing Patterns in the Web of Data. - http://arxiv.org/abs/1204.2713 26
    42. 42. What could be improved?• does not work well with embedded metadata (e.g., RDFa-based sites)• does not take into account usage through meta sites (indexes, search engines, mirrors, ...)• does (probably) not take into account usage through apps• what about caches?• what about bulk/dump downloads of data?• not enough usage to be interesting yet? 27
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×