Statistical Analysis of Web of Data Usage

409
-1

Published on

Presentation as held at the "Workshop on Knowledge Evolution and Ontology Dynamics" co-located with ISWC 2011. Related to the paper http://ceur-ws.org/Vol-784/evodyn1.pdf

Published in: Technology, Education
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
409
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
0
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide
  • This is not an approach for all kind of domains but within LOD we find characteristic ontologies and vocabulariesdataset hosts do not know the requirements of the dataset users necessarily
  • round about 25 per cent of alldatsets were covered by the survey.that relates to the absolute number of datsets and not the amount of triples servedsome of the bigger ones replied such as dbpedia and bio2rdf
  • Statistical Analysis of Web of Data Usage

    1. 1. Statistical Analysis of Web of Data Usage Towards (Visual) Maintenance Support for Dataset Publishers Markus Luczak-Rösch, Markus BischoffFreie Universität Berlin, Networked Information Systems (www.ag-nbi.de)
    2. 2. Who is addressed?• rather small/simple ontologies – min. effort for OE – “under-engineered”• unknown user requirements
    3. 3. We propose: A Usage-dependent Life Cycle Requests and • RDB2RDF Queries • Re-engineering • Crawling & • SELECT * WHERE ?t • Re-population transformation a:madeOf a:Plastic •… •… • SELECT * WHERE ?t b:madeOf b:Wood Negotiate Initial Release understanding USAGE
    4. 4. (Very) Quick Example • Out of which instruments consists The Beatles band? • Are the Beatles a “Big Band”? • What are “british” bands?
    5. 5. • Is it what the user expected to see?• Did you know that this happens and do you know what to do now?
    6. 6. Survey covering approx. 25% of all cloud datasets• size• complexity• engineering methodology• …  Publishers of most of the dataset do not have any (structured) idea how to maintain their data. Survey ran in October 2010, not yet published officially
    7. 7. Role of the dataset publisher (more general) Effort Distribution between Publisher and Consumer• use common vocabularies• provide RDF Consumer generates/ links to other data mines links resources Effort• provide Distribution schema Publisher provides Links as links mappings hints Christian Bizer: Pay-as-you-go Data Integration (21/9/2010) Source: Talk of Chris Bizer
    8. 8. Role of the dataset publisher (more specific)*• Reliability  Is the data valid and complete?• Peak-load  Temporal profiles of important data?• Performance  Are caches and indexes optimal?• Usefulness  What do people find and use frequently?• Attacks  Is the data threatened by spam? * w.r.t. Möller et al.: Learning from Linked Open Data Usage: Patterns & Metrics.
    9. 9. Our Usage-based Approachdigging in log files
    10. 10. How do people access resources on the Web of Data?xxx.xxx.xxx.xxx - - [21/Sep/2009:00:00:00 -0600] "GET /page/Jeroen_Simaeys HTTP/1.1" 200 26777 "" "msnbot/2.0b (+http://search.msn.com/msnbot.htm)"xxx.xxx.xxx.xxx - - [21/Sep/2009:00:00:00 -0600] "GET /resource/Guano_Apes HTTP/1.1" 303 0 "" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"xxx.xxx.xxx.xxx - - [21/Sep/2009:00:00:01 -0600] "GET /sparql?query=PREFIX+rdfs%3A+%...“ 200 1844 "" "" What do they get? • RDF-Graphs • SPARQL Query Results XML Format • …, HTML, JSON, … serialization of results • …, HTML, JSON, … serialization of no results 204 would be great but for now the usage mining process should respect this 
    11. 11. Adapted from Myra Spilipoulou: “Web usage mining for Web site evaluation”, 2000, Commun. ACM Log File Result Patterns Instructions Visualization Tool Preparation Tool Mining Query Mining Results Access Methods and Patterns Navigation PatternsQueries Patterns Triples Filters Sessions and Statistics Sequences Usage Mining Methods Prepared Log Data Preparation Phase Mining Phase
    12. 12. Preparation Processxxx.xxx.xxx.xxx - - [21/Sep/2009:00:00:01 -0600] "GET /sparql?query=PREFIX+rdfs%3A+%...“ 200 1844 "" "" SPARQL Query Basic Graph Log Entry Triple Pattern Selection and Pattern Extraction Selection Validation Selection Query Partitions Database Query Partition Query Partition Query Filter Success Re-Execution Evaluation Determination
    13. 13. Usage Analysis• queries • patterns • triples • primitives ns1:A rdf:type Reference for details: M. Luczak-Rösch and H. Mühleisen, ns2:B "Log File Analysis for Web of Data Endpoints ," in Proc. of the 8th Extended Semantic Web Conference (ESWC) Poster-Session, 2011.
    14. 14. Metrics• Ontology heat map • Resource usage – the amount a class or – triple combinations in a predicate is used in which a resource is queries used• Primitive usage – position in triples – triple combinations
    15. 15. Metrics• Time statistics • Error statistics – hourly accesses – triple patterns that contradict the schema but succeeded• Hosts statistics – triples patterns that – hourly accesses per fail due to the host modelling – primitives and triple patterns requested by host
    16. 16. Visualizations network• weighted nodes overview and edges (depending on the applied metric) represent the amount of usage zoom in and see details
    17. 17. Evaluation Dataset• Dbpedia 3.3 log files – 1.700.000 requests from two randomly chosen days (07/2009) – analysis against a mirror of the 3.3 dataset (inconsistent dataset) – performance issues of dynamic network visualization and reprocessing of queries  limited number of analyzed logs
    18. 18. Starting Point for Visual Analysis
    19. 19. Resource Analysis
    20. 20. Predicate Analysis
    21. 21. Access Time and Hosts Analysis All hosts Specific host
    22. 22. Hosts and Primitives Analysis Specific host
    23. 23. Inconsitencies & Weaknesses • ns:Band ns:instrument ?x inconsistent • ns:Band ns:genre ?y data • ns:Band ns:associatedBand ?z • ns:Band ns:knownFor ?x missing facts • ns:Band ns:nationality ?y •…Complete analysis can be found at http://page.mi.fu-berlin.de/mluczak/pub/visual-analysis-of-web-of-data-usage-dbpedia33/
    24. 24. What to learn from usage analysis?• ontology maintenance – schema evolution – instance population – ontology modularization – error detection Image source http://mrg.bz/GgaxPB
    25. 25. What else to learn?• performance scaling – index generation – store architecture based on frequent SPARQL patterns – hardware scaling at peak times – modularization of data for different hosts
    26. 26. This is ok for the beginning but…… SONIVIS can do more evaluate (with users!) various network visualizations and find the best one for specific context
    27. 27. More for the Future• Generic patterns for the metrics + resolution/evolution patterns• Common sense of statistics + Quality-of-dataset index Central conclusion:• Temporal analysis Calculate statistics,• Network metrics (degree,…) weaknesses and inconsistencies first and• Visualize the effects of change do visual editing afterwards! Image source: http://mrg.bz/8Co9lA
    28. 28. • usage-dependent life cycle support for LOD vocabularies and the populated instances T A • (visual) usage analysis can help to plan and perform maintenance activities • this is a benefit for the dataset publisher a w and the Web of data as a whole k a e yMarkus Luczak-Rösch (luczak@inf.fu-berlin.de)Freie Universität Berlin, Networked Information Systems (www.ag-nbi.de) Image source: http://mrg.bz/jlObbL

    ×