Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Statistical Analysis of Web of Data Usage


Published on

Presentation as held at the "Workshop on Knowledge Evolution and Ontology Dynamics" co-located with ISWC 2011. Related to the paper

Published in: Technology, Education
  • Be the first to comment

Statistical Analysis of Web of Data Usage

  1. 1. Statistical Analysis of Web of Data Usage Towards (Visual) Maintenance Support for Dataset Publishers Markus Luczak-Rösch, Markus BischoffFreie Universität Berlin, Networked Information Systems (
  2. 2. Who is addressed?• rather small/simple ontologies – min. effort for OE – “under-engineered”• unknown user requirements
  3. 3. We propose: A Usage-dependent Life Cycle Requests and • RDB2RDF Queries • Re-engineering • Crawling & • SELECT * WHERE ?t • Re-population transformation a:madeOf a:Plastic •… •… • SELECT * WHERE ?t b:madeOf b:Wood Negotiate Initial Release understanding USAGE
  4. 4. (Very) Quick Example • Out of which instruments consists The Beatles band? • Are the Beatles a “Big Band”? • What are “british” bands?
  5. 5. • Is it what the user expected to see?• Did you know that this happens and do you know what to do now?
  6. 6. Survey covering approx. 25% of all cloud datasets• size• complexity• engineering methodology• …  Publishers of most of the dataset do not have any (structured) idea how to maintain their data. Survey ran in October 2010, not yet published officially
  7. 7. Role of the dataset publisher (more general) Effort Distribution between Publisher and Consumer• use common vocabularies• provide RDF Consumer generates/ links to other data mines links resources Effort• provide Distribution schema Publisher provides Links as links mappings hints Christian Bizer: Pay-as-you-go Data Integration (21/9/2010) Source: Talk of Chris Bizer
  8. 8. Role of the dataset publisher (more specific)*• Reliability  Is the data valid and complete?• Peak-load  Temporal profiles of important data?• Performance  Are caches and indexes optimal?• Usefulness  What do people find and use frequently?• Attacks  Is the data threatened by spam? * w.r.t. Möller et al.: Learning from Linked Open Data Usage: Patterns & Metrics.
  9. 9. Our Usage-based Approachdigging in log files
  10. 10. How do people access resources on the Web of Data? - - [21/Sep/2009:00:00:00 -0600] "GET /page/Jeroen_Simaeys HTTP/1.1" 200 26777 "" "msnbot/2.0b (+" - - [21/Sep/2009:00:00:00 -0600] "GET /resource/Guano_Apes HTTP/1.1" 303 0 "" "Mozilla/5.0 (compatible; Googlebot/2.1; +" - - [21/Sep/2009:00:00:01 -0600] "GET /sparql?query=PREFIX+rdfs%3A+%...“ 200 1844 "" "" What do they get? • RDF-Graphs • SPARQL Query Results XML Format • …, HTML, JSON, … serialization of results • …, HTML, JSON, … serialization of no results 204 would be great but for now the usage mining process should respect this 
  11. 11. Adapted from Myra Spilipoulou: “Web usage mining for Web site evaluation”, 2000, Commun. ACM Log File Result Patterns Instructions Visualization Tool Preparation Tool Mining Query Mining Results Access Methods and Patterns Navigation PatternsQueries Patterns Triples Filters Sessions and Statistics Sequences Usage Mining Methods Prepared Log Data Preparation Phase Mining Phase
  12. 12. Preparation - - [21/Sep/2009:00:00:01 -0600] "GET /sparql?query=PREFIX+rdfs%3A+%...“ 200 1844 "" "" SPARQL Query Basic Graph Log Entry Triple Pattern Selection and Pattern Extraction Selection Validation Selection Query Partitions Database Query Partition Query Partition Query Filter Success Re-Execution Evaluation Determination
  13. 13. Usage Analysis• queries • patterns • triples • primitives ns1:A rdf:type Reference for details: M. Luczak-Rösch and H. Mühleisen, ns2:B "Log File Analysis for Web of Data Endpoints ," in Proc. of the 8th Extended Semantic Web Conference (ESWC) Poster-Session, 2011.
  14. 14. Metrics• Ontology heat map • Resource usage – the amount a class or – triple combinations in a predicate is used in which a resource is queries used• Primitive usage – position in triples – triple combinations
  15. 15. Metrics• Time statistics • Error statistics – hourly accesses – triple patterns that contradict the schema but succeeded• Hosts statistics – triples patterns that – hourly accesses per fail due to the host modelling – primitives and triple patterns requested by host
  16. 16. Visualizations network• weighted nodes overview and edges (depending on the applied metric) represent the amount of usage zoom in and see details
  17. 17. Evaluation Dataset• Dbpedia 3.3 log files – 1.700.000 requests from two randomly chosen days (07/2009) – analysis against a mirror of the 3.3 dataset (inconsistent dataset) – performance issues of dynamic network visualization and reprocessing of queries  limited number of analyzed logs
  18. 18. Starting Point for Visual Analysis
  19. 19. Resource Analysis
  20. 20. Predicate Analysis
  21. 21. Access Time and Hosts Analysis All hosts Specific host
  22. 22. Hosts and Primitives Analysis Specific host
  23. 23. Inconsitencies & Weaknesses • ns:Band ns:instrument ?x inconsistent • ns:Band ns:genre ?y data • ns:Band ns:associatedBand ?z • ns:Band ns:knownFor ?x missing facts • ns:Band ns:nationality ?y •…Complete analysis can be found at
  24. 24. What to learn from usage analysis?• ontology maintenance – schema evolution – instance population – ontology modularization – error detection Image source
  25. 25. What else to learn?• performance scaling – index generation – store architecture based on frequent SPARQL patterns – hardware scaling at peak times – modularization of data for different hosts
  26. 26. This is ok for the beginning but…… SONIVIS can do more evaluate (with users!) various network visualizations and find the best one for specific context
  27. 27. More for the Future• Generic patterns for the metrics + resolution/evolution patterns• Common sense of statistics + Quality-of-dataset index Central conclusion:• Temporal analysis Calculate statistics,• Network metrics (degree,…) weaknesses and inconsistencies first and• Visualize the effects of change do visual editing afterwards! Image source:
  28. 28. • usage-dependent life cycle support for LOD vocabularies and the populated instances T A • (visual) usage analysis can help to plan and perform maintenance activities • this is a benefit for the dataset publisher a w and the Web of data as a whole k a e yMarkus Luczak-Rösch ( Universität Berlin, Networked Information Systems ( Image source: