SlideShare a Scribd company logo
Statistical Analysis of Web of
                          Data Usage
                          Towards (Visual) Maintenance
                          Support for Dataset Publishers
                      Markus Luczak-Rösch, Markus Bischoff


Freie Universität Berlin, Networked Information Systems (www.ag-nbi.de)
Who is addressed?
• rather small/simple ontologies
  – min. effort for OE
  – “under-engineered”
• unknown user requirements
We propose: A Usage-dependent Life
              Cycle


                                Requests and
   • RDB2RDF                      Queries         • Re-engineering
   • Crawling &           • SELECT * WHERE ?t     • Re-population
     transformation          a:madeOf a:Plastic   •…
   •…                     • SELECT * WHERE ?t
                            b:madeOf b:Wood
                                                         Negotiate
        Initial Release
                                                       understanding


                           USAGE
(Very) Quick Example
           • Out of which
             instruments consists
             The Beatles band?
           • Are the Beatles a “Big Band”?
           • What are “british” bands?
• Is it what the user expected
  to see?
• Did you know that
  this happens and
  do you know what
  to do now?
Survey covering approx.
                               25% of all cloud datasets


•   size
•   complexity
•   engineering methodology
•   …
     Publishers of most of the dataset do not
    have any (structured) idea how to maintain
                    their data.      Survey ran in October 2010, not yet
                                     published officially
Role of the dataset publisher
               (more general)
                   Effort Distribution between Publisher and Consumer

• use common
  vocabularies
• provide RDF
                   Consumer generates/
  links to other     data mines links



  resources                      Effort

• provide                     Distribution



  schema            Publisher provides       Links as
                           links
  mappings                                   hints



                                                        Christian Bizer: Pay-as-you-go Data Integration (21/9/2010)




                                                                          Source: Talk of Chris Bizer
Role of the dataset publisher
               (more specific)*
•   Reliability  Is the data valid and complete?
•   Peak-load  Temporal profiles of important data?
•   Performance  Are caches and indexes optimal?
•   Usefulness  What do people find and use frequently?
•   Attacks  Is the data threatened by spam?




                                * w.r.t. Möller et al.: Learning from Linked
                                Open Data Usage: Patterns & Metrics.
Our Usage-based Approach




digging in log files
How do people access resources on the Web of Data?

xxx.xxx.xxx.xxx - - [21/Sep/2009:00:00:00 -0600]
    "GET /page/Jeroen_Simaeys HTTP/1.1"
    200 26777 "" "msnbot/2.0b (+http://search.msn.com/msnbot.htm)"
xxx.xxx.xxx.xxx - - [21/Sep/2009:00:00:00 -0600]
    "GET /resource/Guano_Apes HTTP/1.1"
    303 0 "" "Mozilla/5.0 (compatible; Googlebot/2.1;
    +http://www.google.com/bot.html)"
xxx.xxx.xxx.xxx - - [21/Sep/2009:00:00:01 -0600]
    "GET /sparql?query=PREFIX+rdfs%3A+%...“
    200 1844 "" ""


                         What do they get?
                         • RDF-Graphs
                         • SPARQL Query Results XML Format
                         • …, HTML, JSON, … serialization of results
                         • …, HTML, JSON, … serialization of no results


                              204 would be great but for now the usage
                              mining process should respect this 
Adapted from Myra Spilipoulou: “Web usage mining for
                                                                         Web site evaluation”, 2000, Commun. ACM


                      Log
                      File                                       Result Patterns
                                       Instructions

                                                                                 Visualization Tool
          Preparation Tool
                                                      Mining Query
                                                                                    Mining Results
      Access Methods and Patterns

                                                                             Navigation
                                                                              Patterns

Queries    Patterns          Triples    Filters                       Sessions
                                                                        and                               Statistics
                                                                     Sequences




                                                                                   Usage Mining
                                                                                     Methods
          Prepared Log Data



   Preparation Phase                                                             Mining Phase
Preparation Process
xxx.xxx.xxx.xxx - - [21/Sep/2009:00:00:01 -0600]
     "GET /sparql?query=PREFIX+rdfs%3A+%...“
     200 1844 "" ""



                            SPARQL Query           Basic Graph
        Log Entry                                                Triple Pattern
                            Selection and            Pattern
        Extraction                                                 Selection
                              Validation            Selection



                                 Query Partitions Database



                                       Query Partition
     Query Partition                                             Query Filter
                                          Success
      Re-Execution                                               Evaluation
                                       Determination
Usage Analysis

• queries
   • patterns
      • triples
         • primitives
   ns1:A


    rdf:type

                        Reference for details: M. Luczak-Rösch and H. Mühleisen,
           ns2:B        "Log File Analysis for Web of Data Endpoints ," in Proc. of
                        the 8th Extended Semantic Web Conference (ESWC)
                        Poster-Session, 2011.
Metrics
• Ontology heat map          • Resource usage
  – the amount a class or      – triple combinations in
    a predicate is used in       which a resource is
    queries                      used


• Primitive usage
  – position in triples
  – triple combinations
Metrics
• Time statistics           • Error statistics
  – hourly accesses            – triple patterns that
                                 contradict the schema
                                 but succeeded
• Hosts statistics             – triples patterns that
  – hourly accesses per          fail due to the
    host                         modelling
  – primitives and triple
    patterns requested by
    host
Visualizations

                                network
• weighted nodes                overview

  and edges
  (depending on
  the applied
  metric) represent
  the amount of
  usage               zoom in and see
                          details
Evaluation Dataset
• Dbpedia 3.3 log files
  – 1.700.000 requests from two randomly chosen
    days (07/2009)
  – analysis against a mirror of the 3.3 dataset
    (inconsistent dataset)
  – performance issues of dynamic network
    visualization and reprocessing of queries 
    limited number of analyzed logs
Starting Point for Visual Analysis
Resource Analysis
Predicate Analysis
Access Time and Hosts Analysis
    All hosts        Specific host
Hosts and Primitives Analysis
           Specific host
Inconsitencies & Weaknesses
                                                                            • ns:Band ns:instrument ?x
                                                        inconsistent        • ns:Band ns:genre ?y
                                                            data
                                                                            • ns:Band ns:associatedBand ?z




    • ns:Band ns:knownFor ?x                    missing facts
    • ns:Band ns:nationality ?y
    •…
Complete analysis can be found at http://page.mi.fu-berlin.de/mluczak/pub/visual-analysis-of-web-of-data-usage-dbpedia33/
What to learn from usage analysis?
• ontology maintenance
  – schema evolution
  – instance population
  – ontology modularization
  – error detection




                              Image source http://mrg.bz/GgaxPB
What else to learn?
• performance scaling
  – index generation
  – store architecture based on frequent SPARQL
    patterns
  – hardware scaling at peak times
  – modularization of data for different hosts
This is ok for the beginning but…




… SONIVIS can do more
 evaluate (with users!) various network visualizations
 and find the best one for specific context
More for the Future

• Generic patterns for the metrics
   + resolution/evolution patterns
• Common sense of statistics
   + Quality-of-dataset index
                                     Central conclusion:
• Temporal analysis                  Calculate statistics,
• Network metrics (degree,…)         weaknesses and
                                     inconsistencies first and
• Visualize the effects of change    do visual editing
                                     afterwards!

                                           Image source: http://mrg.bz/8Co9lA
• usage-dependent life cycle support for
                                 LOD vocabularies and the populated
                                 instances
      T           A            • (visual) usage analysis can help to plan
                                 and perform maintenance activities
                               • this is a benefit for the dataset publisher
      a           w              and the Web of data as a whole

      k           a
      e           y

Markus Luczak-Rösch (luczak@inf.fu-berlin.de)
Freie Universität Berlin, Networked Information Systems (www.ag-nbi.de)   Image source: http://mrg.bz/jlObbL

More Related Content

Similar to Statistical Analysis of Web of Data Usage

Geo-referenced human-activity-data; access, processing and knowledge extraction
Geo-referenced human-activity-data; access, processing and knowledge extractionGeo-referenced human-activity-data; access, processing and knowledge extraction
Geo-referenced human-activity-data; access, processing and knowledge extraction
Conor Mc Elhinney
 
agINFRA Agricultural Ontology Workshop Presentation
agINFRA Agricultural Ontology Workshop PresentationagINFRA Agricultural Ontology Workshop Presentation
agINFRA Agricultural Ontology Workshop Presentation
Benjamin Cave
 
Internet data mining 2006
Internet data mining   2006Internet data mining   2006
Internet data mining 2006
raj_vij
 
IOUG93 - Technical Architecture for the Data Warehouse - Presentation
IOUG93 - Technical Architecture for the Data Warehouse - PresentationIOUG93 - Technical Architecture for the Data Warehouse - Presentation
IOUG93 - Technical Architecture for the Data Warehouse - Presentation
David Walker
 

Similar to Statistical Analysis of Web of Data Usage (20)

Dm4
Dm4Dm4
Dm4
 
Crushing, Blending, and Stretching Data
Crushing, Blending, and Stretching DataCrushing, Blending, and Stretching Data
Crushing, Blending, and Stretching Data
 
Crushing, Blending, and Stretching Transactional Data
Crushing, Blending, and Stretching Transactional DataCrushing, Blending, and Stretching Transactional Data
Crushing, Blending, and Stretching Transactional Data
 
Geo-referenced human-activity-data; access, processing and knowledge extraction
Geo-referenced human-activity-data; access, processing and knowledge extractionGeo-referenced human-activity-data; access, processing and knowledge extraction
Geo-referenced human-activity-data; access, processing and knowledge extraction
 
STI Summit 2011 - Mlr-sm
STI Summit 2011 - Mlr-smSTI Summit 2011 - Mlr-sm
STI Summit 2011 - Mlr-sm
 
By
ByBy
By
 
agINFRA Agricultural Ontology Workshop Presentation
agINFRA Agricultural Ontology Workshop PresentationagINFRA Agricultural Ontology Workshop Presentation
agINFRA Agricultural Ontology Workshop Presentation
 
2012 02 aos-johanneskeizer
2012 02 aos-johanneskeizer2012 02 aos-johanneskeizer
2012 02 aos-johanneskeizer
 
Exploratory Search upon Semantically Described Web Data Sources: Service regi...
Exploratory Search upon Semantically Described Web Data Sources: Service regi...Exploratory Search upon Semantically Described Web Data Sources: Service regi...
Exploratory Search upon Semantically Described Web Data Sources: Service regi...
 
The Analytics Frontier of the Hadoop Eco-System
The Analytics Frontier of the Hadoop Eco-SystemThe Analytics Frontier of the Hadoop Eco-System
The Analytics Frontier of the Hadoop Eco-System
 
Data mining - GDi Techno Solutions
Data mining - GDi Techno SolutionsData mining - GDi Techno Solutions
Data mining - GDi Techno Solutions
 
Actions speak louder than words: Analyzing large-scale query logs to improve ...
Actions speak louder than words: Analyzing large-scale query logs to improve ...Actions speak louder than words: Analyzing large-scale query logs to improve ...
Actions speak louder than words: Analyzing large-scale query logs to improve ...
 
Siddhi: A Second Look at Complex Event Processing Implementations
Siddhi: A Second Look at Complex Event Processing ImplementationsSiddhi: A Second Look at Complex Event Processing Implementations
Siddhi: A Second Look at Complex Event Processing Implementations
 
Bertenthal
BertenthalBertenthal
Bertenthal
 
Internet data mining 2006
Internet data mining   2006Internet data mining   2006
Internet data mining 2006
 
IOUG93 - Technical Architecture for the Data Warehouse - Presentation
IOUG93 - Technical Architecture for the Data Warehouse - PresentationIOUG93 - Technical Architecture for the Data Warehouse - Presentation
IOUG93 - Technical Architecture for the Data Warehouse - Presentation
 
Life Science Database Cross Search and Metadata
Life Science Database Cross Search and MetadataLife Science Database Cross Search and Metadata
Life Science Database Cross Search and Metadata
 
"Data Provenance: Principles and Why it matters for BioMedical Applications"
"Data Provenance: Principles and Why it matters for BioMedical Applications""Data Provenance: Principles and Why it matters for BioMedical Applications"
"Data Provenance: Principles and Why it matters for BioMedical Applications"
 
Software Analytics: Data Analytics for Software Engineering
Software Analytics: Data Analytics for Software EngineeringSoftware Analytics: Data Analytics for Software Engineering
Software Analytics: Data Analytics for Software Engineering
 
Functional and Architectural Requirements for Metadata: Supporting Discovery...
Functional and Architectural Requirements for Metadata: Supporting Discovery...Functional and Architectural Requirements for Metadata: Supporting Discovery...
Functional and Architectural Requirements for Metadata: Supporting Discovery...
 

More from Markus Luczak-Rösch

Web of Data Usage Mining
Web of Data Usage MiningWeb of Data Usage Mining
Web of Data Usage Mining
Markus Luczak-Rösch
 

More from Markus Luczak-Rösch (12)

Not re-decentralizing the Web is not only a missed opportunity, it is irrespo...
Not re-decentralizing the Web is not only a missed opportunity, it is irrespo...Not re-decentralizing the Web is not only a missed opportunity, it is irrespo...
Not re-decentralizing the Web is not only a missed opportunity, it is irrespo...
 
Analysing literature through the lens of information theory and network science
Analysing literature through the lens of information theory and network scienceAnalysing literature through the lens of information theory and network science
Analysing literature through the lens of information theory and network science
 
Our World is Socio-technical
Our World is Socio-technicalOur World is Socio-technical
Our World is Socio-technical
 
Web of Data Usage Mining
Web of Data Usage MiningWeb of Data Usage Mining
Web of Data Usage Mining
 
Transcending our views to sequential data
Transcending our views to sequential data Transcending our views to sequential data
Transcending our views to sequential data
 
The Web Science MacroScope: Mixed-methods Approach for Understanding Web Acti...
The Web Science MacroScope: Mixed-methods Approach for Understanding Web Acti...The Web Science MacroScope: Mixed-methods Approach for Understanding Web Acti...
The Web Science MacroScope: Mixed-methods Approach for Understanding Web Acti...
 
Context-free data analysis with Transcendental Information Cascades.
Context-free data analysis with Transcendental Information Cascades.Context-free data analysis with Transcendental Information Cascades.
Context-free data analysis with Transcendental Information Cascades.
 
From coincidence to purposeful flow? Properties of transcendental information...
From coincidence to purposeful flow? Properties of transcendental information...From coincidence to purposeful flow? Properties of transcendental information...
From coincidence to purposeful flow? Properties of transcendental information...
 
When resources collide: Towards a theory of coincidence in information spaces...
When resources collide: Towards a theory of coincidence in information spaces...When resources collide: Towards a theory of coincidence in information spaces...
When resources collide: Towards a theory of coincidence in information spaces...
 
Observation and Analysis of Social Machines
Observation and Analysis of Social MachinesObservation and Analysis of Social Machines
Observation and Analysis of Social Machines
 
Zooniverse - Through the Observatory
Zooniverse - Through the ObservatoryZooniverse - Through the Observatory
Zooniverse - Through the Observatory
 
loomp - semantic content authoring
loomp - semantic content authoringloomp - semantic content authoring
loomp - semantic content authoring
 

Recently uploaded

Structuring Teams and Portfolios for Success
Structuring Teams and Portfolios for SuccessStructuring Teams and Portfolios for Success
Structuring Teams and Portfolios for Success
UXDXConf
 

Recently uploaded (20)

To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
 
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxIOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
 
What's New in Teams Calling, Meetings and Devices April 2024
What's New in Teams Calling, Meetings and Devices April 2024What's New in Teams Calling, Meetings and Devices April 2024
What's New in Teams Calling, Meetings and Devices April 2024
 
ODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User GroupODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User Group
 
Speed Wins: From Kafka to APIs in Minutes
Speed Wins: From Kafka to APIs in MinutesSpeed Wins: From Kafka to APIs in Minutes
Speed Wins: From Kafka to APIs in Minutes
 
Optimizing NoSQL Performance Through Observability
Optimizing NoSQL Performance Through ObservabilityOptimizing NoSQL Performance Through Observability
Optimizing NoSQL Performance Through Observability
 
Salesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
Salesforce Adoption – Metrics, Methods, and Motivation, Antone KomSalesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
Salesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
 
10 Differences between Sales Cloud and CPQ, Blanka Doktorová
10 Differences between Sales Cloud and CPQ, Blanka Doktorová10 Differences between Sales Cloud and CPQ, Blanka Doktorová
10 Differences between Sales Cloud and CPQ, Blanka Doktorová
 
Structuring Teams and Portfolios for Success
Structuring Teams and Portfolios for SuccessStructuring Teams and Portfolios for Success
Structuring Teams and Portfolios for Success
 
Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya Halder
Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya HalderCustom Approval Process: A New Perspective, Pavel Hrbacek & Anindya Halder
Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya Halder
 
"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi
 
Exploring UiPath Orchestrator API: updates and limits in 2024 🚀
Exploring UiPath Orchestrator API: updates and limits in 2024 🚀Exploring UiPath Orchestrator API: updates and limits in 2024 🚀
Exploring UiPath Orchestrator API: updates and limits in 2024 🚀
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
 
Intro in Product Management - Коротко про професію продакт менеджера
Intro in Product Management - Коротко про професію продакт менеджераIntro in Product Management - Коротко про професію продакт менеджера
Intro in Product Management - Коротко про професію продакт менеджера
 
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...
 
Transforming The New York Times: Empowering Evolution through UX
Transforming The New York Times: Empowering Evolution through UXTransforming The New York Times: Empowering Evolution through UX
Transforming The New York Times: Empowering Evolution through UX
 
AI presentation and introduction - Retrieval Augmented Generation RAG 101
AI presentation and introduction - Retrieval Augmented Generation RAG 101AI presentation and introduction - Retrieval Augmented Generation RAG 101
AI presentation and introduction - Retrieval Augmented Generation RAG 101
 
Demystifying gRPC in .Net by John Staveley
Demystifying gRPC in .Net by John StaveleyDemystifying gRPC in .Net by John Staveley
Demystifying gRPC in .Net by John Staveley
 
IoT Analytics Company Presentation May 2024
IoT Analytics Company Presentation May 2024IoT Analytics Company Presentation May 2024
IoT Analytics Company Presentation May 2024
 
Enterprise Security Monitoring, And Log Management.
Enterprise Security Monitoring, And Log Management.Enterprise Security Monitoring, And Log Management.
Enterprise Security Monitoring, And Log Management.
 

Statistical Analysis of Web of Data Usage

  • 1. Statistical Analysis of Web of Data Usage Towards (Visual) Maintenance Support for Dataset Publishers Markus Luczak-Rösch, Markus Bischoff Freie Universität Berlin, Networked Information Systems (www.ag-nbi.de)
  • 2. Who is addressed? • rather small/simple ontologies – min. effort for OE – “under-engineered” • unknown user requirements
  • 3. We propose: A Usage-dependent Life Cycle Requests and • RDB2RDF Queries • Re-engineering • Crawling & • SELECT * WHERE ?t • Re-population transformation a:madeOf a:Plastic •… •… • SELECT * WHERE ?t b:madeOf b:Wood Negotiate Initial Release understanding USAGE
  • 4. (Very) Quick Example • Out of which instruments consists The Beatles band? • Are the Beatles a “Big Band”? • What are “british” bands?
  • 5.
  • 6. • Is it what the user expected to see? • Did you know that this happens and do you know what to do now?
  • 7. Survey covering approx. 25% of all cloud datasets • size • complexity • engineering methodology • …  Publishers of most of the dataset do not have any (structured) idea how to maintain their data. Survey ran in October 2010, not yet published officially
  • 8. Role of the dataset publisher (more general) Effort Distribution between Publisher and Consumer • use common vocabularies • provide RDF Consumer generates/ links to other data mines links resources Effort • provide Distribution schema Publisher provides Links as links mappings hints Christian Bizer: Pay-as-you-go Data Integration (21/9/2010) Source: Talk of Chris Bizer
  • 9. Role of the dataset publisher (more specific)* • Reliability  Is the data valid and complete? • Peak-load  Temporal profiles of important data? • Performance  Are caches and indexes optimal? • Usefulness  What do people find and use frequently? • Attacks  Is the data threatened by spam? * w.r.t. Möller et al.: Learning from Linked Open Data Usage: Patterns & Metrics.
  • 11. How do people access resources on the Web of Data? xxx.xxx.xxx.xxx - - [21/Sep/2009:00:00:00 -0600] "GET /page/Jeroen_Simaeys HTTP/1.1" 200 26777 "" "msnbot/2.0b (+http://search.msn.com/msnbot.htm)" xxx.xxx.xxx.xxx - - [21/Sep/2009:00:00:00 -0600] "GET /resource/Guano_Apes HTTP/1.1" 303 0 "" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" xxx.xxx.xxx.xxx - - [21/Sep/2009:00:00:01 -0600] "GET /sparql?query=PREFIX+rdfs%3A+%...“ 200 1844 "" "" What do they get? • RDF-Graphs • SPARQL Query Results XML Format • …, HTML, JSON, … serialization of results • …, HTML, JSON, … serialization of no results 204 would be great but for now the usage mining process should respect this 
  • 12. Adapted from Myra Spilipoulou: “Web usage mining for Web site evaluation”, 2000, Commun. ACM Log File Result Patterns Instructions Visualization Tool Preparation Tool Mining Query Mining Results Access Methods and Patterns Navigation Patterns Queries Patterns Triples Filters Sessions and Statistics Sequences Usage Mining Methods Prepared Log Data Preparation Phase Mining Phase
  • 13. Preparation Process xxx.xxx.xxx.xxx - - [21/Sep/2009:00:00:01 -0600] "GET /sparql?query=PREFIX+rdfs%3A+%...“ 200 1844 "" "" SPARQL Query Basic Graph Log Entry Triple Pattern Selection and Pattern Extraction Selection Validation Selection Query Partitions Database Query Partition Query Partition Query Filter Success Re-Execution Evaluation Determination
  • 14. Usage Analysis • queries • patterns • triples • primitives ns1:A rdf:type Reference for details: M. Luczak-Rösch and H. Mühleisen, ns2:B "Log File Analysis for Web of Data Endpoints ," in Proc. of the 8th Extended Semantic Web Conference (ESWC) Poster-Session, 2011.
  • 15. Metrics • Ontology heat map • Resource usage – the amount a class or – triple combinations in a predicate is used in which a resource is queries used • Primitive usage – position in triples – triple combinations
  • 16. Metrics • Time statistics • Error statistics – hourly accesses – triple patterns that contradict the schema but succeeded • Hosts statistics – triples patterns that – hourly accesses per fail due to the host modelling – primitives and triple patterns requested by host
  • 17. Visualizations network • weighted nodes overview and edges (depending on the applied metric) represent the amount of usage zoom in and see details
  • 18. Evaluation Dataset • Dbpedia 3.3 log files – 1.700.000 requests from two randomly chosen days (07/2009) – analysis against a mirror of the 3.3 dataset (inconsistent dataset) – performance issues of dynamic network visualization and reprocessing of queries  limited number of analyzed logs
  • 19. Starting Point for Visual Analysis
  • 22. Access Time and Hosts Analysis All hosts Specific host
  • 23. Hosts and Primitives Analysis Specific host
  • 24. Inconsitencies & Weaknesses • ns:Band ns:instrument ?x inconsistent • ns:Band ns:genre ?y data • ns:Band ns:associatedBand ?z • ns:Band ns:knownFor ?x missing facts • ns:Band ns:nationality ?y •… Complete analysis can be found at http://page.mi.fu-berlin.de/mluczak/pub/visual-analysis-of-web-of-data-usage-dbpedia33/
  • 25. What to learn from usage analysis? • ontology maintenance – schema evolution – instance population – ontology modularization – error detection Image source http://mrg.bz/GgaxPB
  • 26. What else to learn? • performance scaling – index generation – store architecture based on frequent SPARQL patterns – hardware scaling at peak times – modularization of data for different hosts
  • 27. This is ok for the beginning but… … SONIVIS can do more  evaluate (with users!) various network visualizations and find the best one for specific context
  • 28. More for the Future • Generic patterns for the metrics + resolution/evolution patterns • Common sense of statistics + Quality-of-dataset index Central conclusion: • Temporal analysis Calculate statistics, • Network metrics (degree,…) weaknesses and inconsistencies first and • Visualize the effects of change do visual editing afterwards! Image source: http://mrg.bz/8Co9lA
  • 29. • usage-dependent life cycle support for LOD vocabularies and the populated instances T A • (visual) usage analysis can help to plan and perform maintenance activities • this is a benefit for the dataset publisher a w and the Web of data as a whole k a e y Markus Luczak-Rösch (luczak@inf.fu-berlin.de) Freie Universität Berlin, Networked Information Systems (www.ag-nbi.de) Image source: http://mrg.bz/jlObbL

Editor's Notes

  1. This is not an approach for all kind of domains but within LOD we find characteristic ontologies and vocabulariesdataset hosts do not know the requirements of the dataset users necessarily
  2. round about 25 per cent of alldatsets were covered by the survey.that relates to the absolute number of datsets and not the amount of triples servedsome of the bigger ones replied such as dbpedia and bio2rdf