Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

ESWC 2019 - A Software Framework and Datasets for the Analysis of Graphs Measures

180 views

Published on

This is my talk in the "Best of Resources" session about a software framework and datasets for the analysis of graph measures on RDF Graphs.

Published in: Engineering
  • Be the first to comment

  • Be the first to like this

ESWC 2019 - A Software Framework and Datasets for the Analysis of Graphs Measures

  1. 1. A Software Framework and Datasets for the Analysis of Graph Measures on RDF Graphs Matthäus Zloch1, Maribel Acosta2, Daniel Hienert1, Stefan Dietze1,3, Stefan Conrad3 1 GESIS - Leibniz-Institute for the Social Sciences, Germany 2 Karlsruhe Institute of Technology, Germany 3 Institute for Computer Science, Heinrich-Heine University, Germany
  2. 2. Motivation Studying graph topologies is relevant because  availability and linkage of RDF data sets grow  various research areas rely on meaningful statistics and measures We want to study the topology of RDF graphs  not at instance- or schema-level  but about the implicit data structure on RDF data graphs 2 Why studying graph topologies is relevant
  3. 3. Graph-based model of RDF 3 oo o o o - # vertices and # edges - # parallel edges - density or reciprocity - degree-based measures (s, p, o) s o p p p p p p p os p Why studying graph topologies is relevant
  4. 4. Research areas that may benefit  Benchmarking – Designers may use the measures to generate more representative synthetic datasets 4 Why studying graph topologies is relevant
  5. 5. Research areas that may benefit  Benchmarking – Designers may use the measures to generate more representative synthetic datasets  Sampling – more representative samples in terms of the structure 5 Why studying graph topologies is relevant
  6. 6. Research areas that may benefit  Benchmarking – Designers may use the measures to generate more representative synthetic datasets  Sampling – more representative samples in terms of the structure  Profiling and Evolution – monitor the change in structure over time (influence vs. prominence) 6 Why studying graph topologies is relevant
  7. 7. Resource Paper Contribution Our paper introduces two resources 1. An open source framework to acquire, prepare, and perform analyses of graph-based measures on RDF graphs [1] 2. A dataset of 280 RDF datasets from the LOD Cloud late 2017, pre-processed and ready to be re-used. Browsable version available [2] 7 [1] https://github.com/mazlo/lodcc [2] https://data.gesis.org/lodcc/2017-08
  8. 8. Framework’s Processing Pipeline 8 How to acquire, prepare, and perform a graph-based analysis on RDF [3] Debattista, J., Lange, C., Auer, S. & Cortis, D. (2018). Evaluating the quality of the LOD cloud: An empirical investigation.. Semantic Web, 9, 859-901. DOI 10.3233/SW-180306
  9. 9. Dataset’s Metadata Preparation 9  Optional. Preparation of an offline list of all datasets, e.g. for parallel processing.  List should contain all dataset names, the (official) media type format with URLs, domain class, and modification date. How to acquire, prepare, and perform a graph-based analysis on RDF
  10. 10. Graph-Object Preparation 10  Downloads the dump, extracts*, transforms*, and groups* RDF files  N-triples format is used to transform into an edgelist structure * if necessary How to acquire, prepare, and perform a graph-based analysis on RDF
  11. 11. Graph-Object Preparation 11 s o (s, p, o) p <http://../dataset/whisky-circle-info> <http://..title> "Whisky Circle"@en . How to acquire, prepare, and perform a graph-based analysis on RDF  As N-Triples
  12. 12. Graph-Object Preparation  As N-Triples  use non-cryptographic hashing function to „encode“ the data [3] 12 <http://../dataset/whisky-circle-info> <http://..title> "Whisky Circle"@en . 43f2f4f2e41ae099 c9643559faeed68e 02325f53aeba2f02 (s, p, o) s o p [3] xxhash, https://github.com/Cyan4973/xxHash How to acquire, prepare, and perform a graph-based analysis on RDF
  13. 13. Graph-Object Preparation  As N-Triples  As edgelist 13 (s, p, o) source vertex target vertex edge-property 43f2f4f2e41ae099 02325f53aeba2f02 c9643559faeed68e <http://../dataset/whisky-circle-info> <http://..title> "Whisky Circle"@en . 43f2f4f2e41ae099 c9643559faeed68e 02325f53aeba2f02 s o p How to acquire, prepare, and perform a graph-based analysis on RDF
  14. 14. Graph-Object Instantiation 14  Reads edgelist and builds graph structure  Reports results on measures from 5 dimensions How to acquire, prepare, and perform a graph-based analysis on RDF
  15. 15. Library re-use 15 How to acquire, prepare, and perform a graph-based analysis on RDF [4] https://old.datahub.io/dataset/<dataset-name>/datapackage.json [5] Wget, https://www.gnu.org/software/wget/ [6] dtrx, https://github.com/moonpyk/dtrx [7] rapper, http://librdf.org/raptor/rapper.html [8] xxhash, https://github.com/Cyan4973/xxHash [9] graph-tool, https://graph-tool.skewed.de/ [4] [6,7,8][9] [5]
  16. 16. Groups of Measures Framework reports on 28 measures from 5 groups 16 How to acquire, prepare, and perform a graph-based analysis on RDF • no. of vertices, edges • parallel edges • unique edges Basic graph measures • max-[in|out]-degree • average degree • h-index (direct./undirect.) Degree-based measures • graph centralization • max degree centrality Centrality measures
  17. 17. Groups of Measures Framework reports on 28 measures from 5 groups 17 How to acquire, prepare, and perform a graph-based analysis on RDF • no. of vertices, edges • parallel edges • unique edges Basic graph measures • max-[in|out]-degree • average degree • h-index (direct./undirect.) Degree-based measures • graph centralization • max degree centrality Centrality measures • density • reciprocity • diameter Edge-based measures • variance, standard dev., coefficient of var. • degree-distribution, powerlaw-exponent alpha Descriptive stat. measures
  18. 18. Performance Example: datasets and sizes 18 How to acquire, prepare, and perform a graph-based analysis on RDF
  19. 19. Performance Example: datasets and sizes 19 How to acquire, prepare, and perform a graph-based analysis on RDF
  20. 20. Performance Example: datasets and sizes Example: runtimes 20 How to acquire, prepare, and perform a graph-based analysis on RDF
  21. 21. Performance Example: datasets and sizes Example: runtimes 21 How to acquire, prepare, and perform a graph-based analysis on RDF
  22. 22. 22 Datasets from 9 domains in LOD Cloud  12 in May 2007  570 in August 2014  1163 in August 2017  1224 in August 2018  1239 in March 2019 A Dataset of Pre-Processed RDF Graphs
  23. 23. A Dataset of Pre-Processed RDF Graphs  Total of 280 RDF datasets processed and analyzed  Values for 28 measures per dataset  Graph-objects ready to be re-used, results as CSV, and original link to metadata 23 Case Study with Datasets from LOD Cloud Available at our website https://data.gesis.org/lodcc/2017-08
  24. 24. Graph-based Analysis at large scale To analyze RDF graphs at large scale you have to  Download the list of available datasets  Acquire the datasets  Represent as a graph-object  Compute graph measures on that Sounds easy, right? 24 Case Study with Datasets from LOD Cloud
  25. 25. Graph-based Analysis at large scale In reality not that easy  not all data providers offer data dumps  non-standard media type declarations  various formats, compressed archives, hierarchies of files and folders  erroneous/error-prone data 25 Case Study with Datasets from LOD Cloud
  26. 26. Acquisition and Preparation 26 1163 • metadata packages 890 • 150 different media type statements • URLs for the official media type statements that are supported 486 • after filtering 404 and content-type HTML 280 • left out SPARQL-Endpoints • after graph preparation with corrupt downloads, wrong media type statements, syntax errors Case Study with Datasets from LOD Cloud
  27. 27. Processed Datasets by Domain 27 Case Study with Datasets from LOD Cloud
  28. 28. Processed Datasets by Domain 28 Case Study with Datasets from LOD Cloud
  29. 29.  Average degree z seems not affected by number of edges, in all but Geography and Government  Average edges per vertex  Life Sciences: 63.50  Cross Domain: 5.46  Average overall domains: 7.9 edges per vertex 29 Preliminary Analysis of Results Preliminary Analysis of Results
  30. 30. Preliminary Analysis of Results  hd grows exponentially with number of edges  Life Sciences and Government are more “dense”  Linguistics forms two clusters, almost no dependency to the number of edges, low on avg. 30 Preliminary Analysis of Results
  31. 31. Availability, Maintenance, Sustainability 31 • Framework is published under MIT license on GitHub. https://github.com/mazlo/lodcc • Actively used in other research activities. • Future releases (minor, bugfixes, features) The framework • Recalculate for newer versions of the LOD Cloud • Made available to the community • Combine with other datasets http://stats.lod2.euThe datasets
  32. 32. Future Work and Research  Investigate domain- and dataset-specific irregularities  Derive implications for modelling tasks, on dataset level and applications like benchmarking  Offer SPARQL-endpoint to query results 32
  33. 33. Thank you for your attention [1] https://github.com/mazlo/lodcc [2] https://data.gesis.org/lodcc/2017-08 @matzlo

×