ESWC 2019 - A Software Framework and Datasets for the Analysis of Graphs Measures

A Software Framework and Datasets for the
Analysis of Graph Measures on RDF Graphs
Matthäus Zloch1, Maribel Acosta2, Daniel Hienert1,
Stefan Dietze1,3, Stefan Conrad3
1 GESIS - Leibniz-Institute for the Social Sciences, Germany
2 Karlsruhe Institute of Technology, Germany
3 Institute for Computer Science, Heinrich-Heine University, Germany

Motivation
Studying graph topologies is relevant because
 availability and linkage of RDF data sets grow
 various research areas rely on meaningful statistics and
measures
We want to study the topology of RDF graphs
 not at instance- or schema-level
 but about the implicit data structure on RDF data graphs
2
Why studying graph topologies is relevant

Graph-based model of RDF
3
oo o
o o
- # vertices and # edges
- # parallel edges
- density or reciprocity
- degree-based measures
(s, p, o)
s o
p
p
p
p
p
p
p
os
p

Research areas that may benefit
 Benchmarking – Designers may use the measures to
generate more representative synthetic datasets
4

 Sampling – more representative samples in terms of
the structure
5

 Sampling – more representative samples in terms of
the structure
 Profiling and Evolution – monitor the change in
structure over time (influence vs. prominence)
6

Resource Paper Contribution
Our paper introduces two resources
1. An open source framework to acquire, prepare, and perform
analyses of graph-based measures on RDF graphs [1]
2. A dataset of 280 RDF datasets from the LOD Cloud late 2017,
pre-processed and ready to be re-used. Browsable version
available [2]
7
[1] https://github.com/mazlo/lodcc
[2] https://data.gesis.org/lodcc/2017-08

Framework’s Processing Pipeline
8
How to acquire, prepare, and perform a graph-based analysis on RDF
[3] Debattista, J., Lange, C., Auer, S. & Cortis, D. (2018). Evaluating the quality of the LOD cloud:
An empirical investigation.. Semantic Web, 9, 859-901. DOI 10.3233/SW-180306

Dataset’s Metadata Preparation
9
 Optional. Preparation of an offline list of all datasets,
e.g. for parallel processing.
 List should contain all dataset names, the (official)
media type format with URLs, domain class, and
modification date.

Graph-Object Preparation
10
 Downloads the dump, extracts*, transforms*, and
groups* RDF files
 N-triples format is used to transform into an edgelist
structure
* if necessary

11
s o
(s, p, o)
p
<http://../dataset/whisky-circle-info> <http://..title> "Whisky Circle"@en .
 As N-Triples

 As N-Triples
 use non-cryptographic hashing function to „encode“
the data [3]
12
43f2f4f2e41ae099 c9643559faeed68e 02325f53aeba2f02
(s, p, o)
s o
p
[3] xxhash, https://github.com/Cyan4973/xxHash

 As N-Triples
 As edgelist
13
(s, p, o)
source vertex target vertex edge-property
43f2f4f2e41ae099 02325f53aeba2f02 c9643559faeed68e
43f2f4f2e41ae099 c9643559faeed68e 02325f53aeba2f02
s o
p

Graph-Object Instantiation
14
 Reads edgelist and builds graph structure
 Reports results on measures from 5 dimensions

Library re-use
15
[4] https://old.datahub.io/dataset/<dataset-name>/datapackage.json
[5] Wget, https://www.gnu.org/software/wget/
[6] dtrx, https://github.com/moonpyk/dtrx
[7] rapper, http://librdf.org/raptor/rapper.html
[8] xxhash, https://github.com/Cyan4973/xxHash
[9] graph-tool, https://graph-tool.skewed.de/
[4]
[6,7,8][9]
[5]

Groups of Measures
Framework reports on 28 measures from 5 groups
16
• no. of vertices, edges
• parallel edges
• unique edges
Basic graph
measures
• max-[in|out]-degree
• average degree
• h-index (direct./undirect.)
Degree-based
measures
• graph centralization
• max degree centrality
Centrality
measures

Groups of Measures
Framework reports on 28 measures from 5 groups
17
• no. of vertices, edges
• parallel edges
• unique edges
Basic graph
measures
• max-[in|out]-degree
• average degree
• h-index (direct./undirect.)
Degree-based
measures
• graph centralization
• max degree centrality
Centrality
measures
• density
• reciprocity
• diameter
Edge-based
measures
• variance, standard dev., coefficient of var.
• degree-distribution, powerlaw-exponent
alpha
Descriptive stat.
measures

Performance
Example: datasets and sizes
18

Performance
19

Performance
Example: runtimes
20

Performance
Example: runtimes
21

22
Datasets from 9 domains in LOD Cloud
 12 in May 2007
 570 in August 2014
 1163 in August 2017
 1224 in August 2018
 1239 in March 2019
A Dataset of Pre-Processed RDF Graphs

A Dataset of Pre-Processed RDF Graphs
 Total of 280 RDF datasets processed and analyzed
 Values for 28 measures per dataset
 Graph-objects ready to be re-used, results as CSV, and
original link to metadata
23
Case Study with Datasets from LOD Cloud
Available at our website https://data.gesis.org/lodcc/2017-08

Graph-based Analysis at large scale
To analyze RDF graphs at large scale you have to
 Download the list of available datasets
 Acquire the datasets
 Represent as a graph-object
 Compute graph measures on that
Sounds easy, right?
24

Graph-based Analysis at large scale
In reality not that easy
 not all data providers offer data dumps
 non-standard media type declarations
 various formats, compressed archives, hierarchies of
files and folders
 erroneous/error-prone data
25

Acquisition and Preparation
26
1163
• metadata packages
890
• 150 different media type statements
• URLs for the official media type statements that are
supported
486
• after filtering 404 and content-type HTML
280
• left out SPARQL-Endpoints
• after graph preparation with corrupt downloads, wrong
media type statements, syntax errors

Processed Datasets by Domain
27

Processed Datasets by Domain
28

 Average degree z seems not affected by number of
edges, in all but Geography and Government
 Average edges per vertex
 Life Sciences: 63.50
 Cross Domain: 5.46
 Average overall domains: 7.9 edges per vertex
29
Preliminary Analysis of Results

 hd grows exponentially with number of edges
 Life Sciences and Government are more “dense”
 Linguistics forms two clusters, almost no dependency
to the number of edges, low on avg.
30

Availability, Maintenance, Sustainability
31
• Framework is published under MIT license on
GitHub. https://github.com/mazlo/lodcc
• Actively used in other research activities.
• Future releases (minor, bugfixes, features)
The framework
• Recalculate for newer versions of the LOD Cloud
• Made available to the community
• Combine with other datasets http://stats.lod2.euThe datasets

Future Work and Research
 Investigate domain- and dataset-specific irregularities
 Derive implications for modelling tasks, on dataset
level and applications like benchmarking
 Offer SPARQL-endpoint to query results
32

Thank you for your attention
[1] https://github.com/mazlo/lodcc
[2] https://data.gesis.org/lodcc/2017-08
@matzlo

ESWC 2019 - A Software Framework and Datasets for the Analysis of Graphs Measures

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Similar to ESWC 2019 - A Software Framework and Datasets for the Analysis of Graphs Measures

Similar to ESWC 2019 - A Software Framework and Datasets for the Analysis of Graphs Measures (20)

Recently uploaded

Recently uploaded (20)

ESWC 2019 - A Software Framework and Datasets for the Analysis of Graphs Measures

Editor's Notes