This is not an approach for all kind of domains but within LOD we find characteristic ontologies and vocabulariesdataset hosts do not know the requirements of the dataset users necessarily
round about 25 per cent of alldatsets were covered by the survey.that relates to the absolute number of datsets and not the amount of triples servedsome of the bigger ones replied such as dbpedia and bio2rdf
Statistical Analysis of Web of Data Usage
Statistical Analysis of Web of Data Usage Towards (Visual) Maintenance Support for Dataset Publishers Markus Luczak-Rösch, Markus BischoffFreie Universität Berlin, Networked Information Systems (www.ag-nbi.de)
Who is addressed?• rather small/simple ontologies – min. effort for OE – “under-engineered”• unknown user requirements
We propose: A Usage-dependent Life Cycle Requests and • RDB2RDF Queries • Re-engineering • Crawling & • SELECT * WHERE ?t • Re-population transformation a:madeOf a:Plastic •… •… • SELECT * WHERE ?t b:madeOf b:Wood Negotiate Initial Release understanding USAGE
(Very) Quick Example • Out of which instruments consists The Beatles band? • Are the Beatles a “Big Band”? • What are “british” bands?
• Is it what the user expected to see?• Did you know that this happens and do you know what to do now?
Survey covering approx. 25% of all cloud datasets• size• complexity• engineering methodology• … Publishers of most of the dataset do not have any (structured) idea how to maintain their data. Survey ran in October 2010, not yet published officially
Role of the dataset publisher (more general) Effort Distribution between Publisher and Consumer• use common vocabularies• provide RDF Consumer generates/ links to other data mines links resources Effort• provide Distribution schema Publisher provides Links as links mappings hints Christian Bizer: Pay-as-you-go Data Integration (21/9/2010) Source: Talk of Chris Bizer
Role of the dataset publisher (more specific)*• Reliability Is the data valid and complete?• Peak-load Temporal profiles of important data?• Performance Are caches and indexes optimal?• Usefulness What do people find and use frequently?• Attacks Is the data threatened by spam? * w.r.t. Möller et al.: Learning from Linked Open Data Usage: Patterns & Metrics.
How do people access resources on the Web of Data?xxx.xxx.xxx.xxx - - [21/Sep/2009:00:00:00 -0600] "GET /page/Jeroen_Simaeys HTTP/1.1" 200 26777 "" "msnbot/2.0b (+http://search.msn.com/msnbot.htm)"xxx.xxx.xxx.xxx - - [21/Sep/2009:00:00:00 -0600] "GET /resource/Guano_Apes HTTP/1.1" 303 0 "" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"xxx.xxx.xxx.xxx - - [21/Sep/2009:00:00:01 -0600] "GET /sparql?query=PREFIX+rdfs%3A+%...“ 200 1844 "" "" What do they get? • RDF-Graphs • SPARQL Query Results XML Format • …, HTML, JSON, … serialization of results • …, HTML, JSON, … serialization of no results 204 would be great but for now the usage mining process should respect this
Adapted from Myra Spilipoulou: “Web usage mining for Web site evaluation”, 2000, Commun. ACM Log File Result Patterns Instructions Visualization Tool Preparation Tool Mining Query Mining Results Access Methods and Patterns Navigation PatternsQueries Patterns Triples Filters Sessions and Statistics Sequences Usage Mining Methods Prepared Log Data Preparation Phase Mining Phase
Usage Analysis• queries • patterns • triples • primitives ns1:A rdf:type Reference for details: M. Luczak-Rösch and H. Mühleisen, ns2:B "Log File Analysis for Web of Data Endpoints ," in Proc. of the 8th Extended Semantic Web Conference (ESWC) Poster-Session, 2011.
Metrics• Ontology heat map • Resource usage – the amount a class or – triple combinations in a predicate is used in which a resource is queries used• Primitive usage – position in triples – triple combinations
Metrics• Time statistics • Error statistics – hourly accesses – triple patterns that contradict the schema but succeeded• Hosts statistics – triples patterns that – hourly accesses per fail due to the host modelling – primitives and triple patterns requested by host
Visualizations network• weighted nodes overview and edges (depending on the applied metric) represent the amount of usage zoom in and see details
Evaluation Dataset• Dbpedia 3.3 log files – 1.700.000 requests from two randomly chosen days (07/2009) – analysis against a mirror of the 3.3 dataset (inconsistent dataset) – performance issues of dynamic network visualization and reprocessing of queries limited number of analyzed logs
Inconsitencies & Weaknesses • ns:Band ns:instrument ?x inconsistent • ns:Band ns:genre ?y data • ns:Band ns:associatedBand ?z • ns:Band ns:knownFor ?x missing facts • ns:Band ns:nationality ?y •…Complete analysis can be found at http://page.mi.fu-berlin.de/mluczak/pub/visual-analysis-of-web-of-data-usage-dbpedia33/
What to learn from usage analysis?• ontology maintenance – schema evolution – instance population – ontology modularization – error detection Image source http://mrg.bz/GgaxPB
What else to learn?• performance scaling – index generation – store architecture based on frequent SPARQL patterns – hardware scaling at peak times – modularization of data for different hosts
This is ok for the beginning but…… SONIVIS can do more evaluate (with users!) various network visualizations and find the best one for specific context
More for the Future• Generic patterns for the metrics + resolution/evolution patterns• Common sense of statistics + Quality-of-dataset index Central conclusion:• Temporal analysis Calculate statistics,• Network metrics (degree,…) weaknesses and inconsistencies first and• Visualize the effects of change do visual editing afterwards! Image source: http://mrg.bz/8Co9lA
• usage-dependent life cycle support for LOD vocabularies and the populated instances T A • (visual) usage analysis can help to plan and perform maintenance activities • this is a benefit for the dataset publisher a w and the Web of data as a whole k a e yMarkus Luczak-Rösch (firstname.lastname@example.org)Freie Universität Berlin, Networked Information Systems (www.ag-nbi.de) Image source: http://mrg.bz/jlObbL