SlideShare a Scribd company logo
A Software Framework and Datasets for the
Analysis of Graph Measures on RDF Graphs
Matthäus Zloch1, Maribel Acosta2, Daniel Hienert1,
Stefan Dietze1,3, Stefan Conrad3
1 GESIS - Leibniz-Institute for the Social Sciences, Germany
2 Karlsruhe Institute of Technology, Germany
3 Institute for Computer Science, Heinrich-Heine University, Germany
Motivation
Studying graph topologies is relevant because
 availability and linkage of RDF data sets grow
 various research areas rely on meaningful statistics and
measures
We want to study the topology of RDF graphs
 not at instance- or schema-level
 but about the implicit data structure on RDF data graphs
2
Why studying graph topologies is relevant
Graph-based model of RDF
3
oo o
o o
- # vertices and # edges
- # parallel edges
- density or reciprocity
- degree-based measures
(s, p, o)
s o
p
p
p
p
p
p
p
os
p
Why studying graph topologies is relevant
Research areas that may benefit
 Benchmarking – Designers may use the measures to
generate more representative synthetic datasets
4
Why studying graph topologies is relevant
Research areas that may benefit
 Benchmarking – Designers may use the measures to
generate more representative synthetic datasets
 Sampling – more representative samples in terms of
the structure
5
Why studying graph topologies is relevant
Research areas that may benefit
 Benchmarking – Designers may use the measures to
generate more representative synthetic datasets
 Sampling – more representative samples in terms of
the structure
 Profiling and Evolution – monitor the change in
structure over time (influence vs. prominence)
6
Why studying graph topologies is relevant
Resource Paper Contribution
Our paper introduces two resources
1. An open source framework to acquire, prepare, and perform
analyses of graph-based measures on RDF graphs [1]
2. A dataset of 280 RDF datasets from the LOD Cloud late 2017,
pre-processed and ready to be re-used. Browsable version
available [2]
7
[1] https://github.com/mazlo/lodcc
[2] https://data.gesis.org/lodcc/2017-08
Framework’s Processing Pipeline
8
How to acquire, prepare, and perform a graph-based analysis on RDF
[3] Debattista, J., Lange, C., Auer, S. & Cortis, D. (2018). Evaluating the quality of the LOD cloud:
An empirical investigation.. Semantic Web, 9, 859-901. DOI 10.3233/SW-180306
Dataset’s Metadata Preparation
9
 Optional. Preparation of an offline list of all datasets,
e.g. for parallel processing.
 List should contain all dataset names, the (official)
media type format with URLs, domain class, and
modification date.
How to acquire, prepare, and perform a graph-based analysis on RDF
Graph-Object Preparation
10
 Downloads the dump, extracts*, transforms*, and
groups* RDF files
 N-triples format is used to transform into an edgelist
structure
* if necessary
How to acquire, prepare, and perform a graph-based analysis on RDF
Graph-Object Preparation
11
s o
(s, p, o)
p
<http://../dataset/whisky-circle-info> <http://..title> "Whisky Circle"@en .
How to acquire, prepare, and perform a graph-based analysis on RDF
 As N-Triples
Graph-Object Preparation
 As N-Triples
 use non-cryptographic hashing function to „encode“
the data [3]
12
<http://../dataset/whisky-circle-info> <http://..title> "Whisky Circle"@en .
43f2f4f2e41ae099 c9643559faeed68e 02325f53aeba2f02
(s, p, o)
s o
p
[3] xxhash, https://github.com/Cyan4973/xxHash
How to acquire, prepare, and perform a graph-based analysis on RDF
Graph-Object Preparation
 As N-Triples
 As edgelist
13
(s, p, o)
source vertex target vertex edge-property
43f2f4f2e41ae099 02325f53aeba2f02 c9643559faeed68e
<http://../dataset/whisky-circle-info> <http://..title> "Whisky Circle"@en .
43f2f4f2e41ae099 c9643559faeed68e 02325f53aeba2f02
s o
p
How to acquire, prepare, and perform a graph-based analysis on RDF
Graph-Object Instantiation
14
 Reads edgelist and builds graph structure
 Reports results on measures from 5 dimensions
How to acquire, prepare, and perform a graph-based analysis on RDF
Library re-use
15
How to acquire, prepare, and perform a graph-based analysis on RDF
[4] https://old.datahub.io/dataset/<dataset-name>/datapackage.json
[5] Wget, https://www.gnu.org/software/wget/
[6] dtrx, https://github.com/moonpyk/dtrx
[7] rapper, http://librdf.org/raptor/rapper.html
[8] xxhash, https://github.com/Cyan4973/xxHash
[9] graph-tool, https://graph-tool.skewed.de/
[4]
[6,7,8][9]
[5]
Groups of Measures
Framework reports on 28 measures from 5 groups
16
How to acquire, prepare, and perform a graph-based analysis on RDF
• no. of vertices, edges
• parallel edges
• unique edges
Basic graph
measures
• max-[in|out]-degree
• average degree
• h-index (direct./undirect.)
Degree-based
measures
• graph centralization
• max degree centrality
Centrality
measures
Groups of Measures
Framework reports on 28 measures from 5 groups
17
How to acquire, prepare, and perform a graph-based analysis on RDF
• no. of vertices, edges
• parallel edges
• unique edges
Basic graph
measures
• max-[in|out]-degree
• average degree
• h-index (direct./undirect.)
Degree-based
measures
• graph centralization
• max degree centrality
Centrality
measures
• density
• reciprocity
• diameter
Edge-based
measures
• variance, standard dev., coefficient of var.
• degree-distribution, powerlaw-exponent
alpha
Descriptive stat.
measures
Performance
Example: datasets and sizes
18
How to acquire, prepare, and perform a graph-based analysis on RDF
Performance
Example: datasets and sizes
19
How to acquire, prepare, and perform a graph-based analysis on RDF
Performance
Example: datasets and sizes
Example: runtimes
20
How to acquire, prepare, and perform a graph-based analysis on RDF
Performance
Example: datasets and sizes
Example: runtimes
21
How to acquire, prepare, and perform a graph-based analysis on RDF
22
Datasets from 9 domains in LOD Cloud
 12 in May 2007
 570 in August 2014
 1163 in August 2017
 1224 in August 2018
 1239 in March 2019
A Dataset of Pre-Processed RDF Graphs
A Dataset of Pre-Processed RDF Graphs
 Total of 280 RDF datasets processed and analyzed
 Values for 28 measures per dataset
 Graph-objects ready to be re-used, results as CSV, and
original link to metadata
23
Case Study with Datasets from LOD Cloud
Available at our website https://data.gesis.org/lodcc/2017-08
Graph-based Analysis at large scale
To analyze RDF graphs at large scale you have to
 Download the list of available datasets
 Acquire the datasets
 Represent as a graph-object
 Compute graph measures on that
Sounds easy, right?
24
Case Study with Datasets from LOD Cloud
Graph-based Analysis at large scale
In reality not that easy
 not all data providers offer data dumps
 non-standard media type declarations
 various formats, compressed archives, hierarchies of
files and folders
 erroneous/error-prone data
25
Case Study with Datasets from LOD Cloud
Acquisition and Preparation
26
1163
• metadata packages
890
• 150 different media type statements
• URLs for the official media type statements that are
supported
486
• after filtering 404 and content-type HTML
280
• left out SPARQL-Endpoints
• after graph preparation with corrupt downloads, wrong
media type statements, syntax errors
Case Study with Datasets from LOD Cloud
Processed Datasets by Domain
27
Case Study with Datasets from LOD Cloud
Processed Datasets by Domain
28
Case Study with Datasets from LOD Cloud
 Average degree z seems not affected by number of
edges, in all but Geography and Government
 Average edges per vertex
 Life Sciences: 63.50
 Cross Domain: 5.46
 Average overall domains: 7.9 edges per vertex
29
Preliminary Analysis of Results
Preliminary Analysis of Results
Preliminary Analysis of Results
 hd grows exponentially with number of edges
 Life Sciences and Government are more “dense”
 Linguistics forms two clusters, almost no dependency
to the number of edges, low on avg.
30
Preliminary Analysis of Results
Availability, Maintenance, Sustainability
31
• Framework is published under MIT license on
GitHub. https://github.com/mazlo/lodcc
• Actively used in other research activities.
• Future releases (minor, bugfixes, features)
The framework
• Recalculate for newer versions of the LOD Cloud
• Made available to the community
• Combine with other datasets http://stats.lod2.euThe datasets
Future Work and Research
 Investigate domain- and dataset-specific irregularities
 Derive implications for modelling tasks, on dataset
level and applications like benchmarking
 Offer SPARQL-endpoint to query results
32
Thank you for your attention
[1] https://github.com/mazlo/lodcc
[2] https://data.gesis.org/lodcc/2017-08
@matzlo

More Related Content

What's hot

Achieving time effective federated information from scalable rdf data using s...
Achieving time effective federated information from scalable rdf data using s...Achieving time effective federated information from scalable rdf data using s...
Achieving time effective federated information from scalable rdf data using s...
తేజ దండిభట్ల
 
Scalable Data Analysis in R -- Lee Edlefsen
Scalable Data Analysis in R -- Lee EdlefsenScalable Data Analysis in R -- Lee Edlefsen
Scalable Data Analysis in R -- Lee Edlefsen
Revolution Analytics
 
Managing large datasets in R – ff examples and concepts
Managing large datasets in R – ff examples and conceptsManaging large datasets in R – ff examples and concepts
Managing large datasets in R – ff examples and concepts
Ajay Ohri
 

What's hot (19)

Semantics 2017 - Trying Not to Die Benchmarking using LITMUS
Semantics 2017 - Trying Not to Die Benchmarking using LITMUSSemantics 2017 - Trying Not to Die Benchmarking using LITMUS
Semantics 2017 - Trying Not to Die Benchmarking using LITMUS
 
Achieving time effective federated information from scalable rdf data using s...
Achieving time effective federated information from scalable rdf data using s...Achieving time effective federated information from scalable rdf data using s...
Achieving time effective federated information from scalable rdf data using s...
 
Why R? A Brief Introduction to the Open Source Statistics Platform
Why R? A Brief Introduction to the Open Source Statistics PlatformWhy R? A Brief Introduction to the Open Source Statistics Platform
Why R? A Brief Introduction to the Open Source Statistics Platform
 
Effective and Efficient Entity Search in RDF data
Effective and Efficient Entity Search in RDF dataEffective and Efficient Entity Search in RDF data
Effective and Efficient Entity Search in RDF data
 
R programming Language , Rahul Singh
R programming Language , Rahul SinghR programming Language , Rahul Singh
R programming Language , Rahul Singh
 
Scalable Data Analysis in R -- Lee Edlefsen
Scalable Data Analysis in R -- Lee EdlefsenScalable Data Analysis in R -- Lee Edlefsen
Scalable Data Analysis in R -- Lee Edlefsen
 
Information Content based Ranking Metric for Linked Open Vocabularies
Information Content based Ranking Metric for Linked Open VocabulariesInformation Content based Ranking Metric for Linked Open Vocabularies
Information Content based Ranking Metric for Linked Open Vocabularies
 
Employing Graph Databases as a Standardization Model towards Addressing Heter...
Employing Graph Databases as a Standardization Model towards Addressing Heter...Employing Graph Databases as a Standardization Model towards Addressing Heter...
Employing Graph Databases as a Standardization Model towards Addressing Heter...
 
SPARQL and RDF query optimization
SPARQL and RDF query optimizationSPARQL and RDF query optimization
SPARQL and RDF query optimization
 
Managing large datasets in R – ff examples and concepts
Managing large datasets in R – ff examples and conceptsManaging large datasets in R – ff examples and concepts
Managing large datasets in R – ff examples and concepts
 
The History and Use of R
The History and Use of RThe History and Use of R
The History and Use of R
 
Optimized index structures for querying rdf from the web
Optimized index structures for querying rdf from the webOptimized index structures for querying rdf from the web
Optimized index structures for querying rdf from the web
 
Survey of Graph Indexing
Survey of Graph IndexingSurvey of Graph Indexing
Survey of Graph Indexing
 
9. Searching & Sorting - Data Structures using C++ by Varsha Patil
9. Searching & Sorting - Data Structures using C++ by Varsha Patil9. Searching & Sorting - Data Structures using C++ by Varsha Patil
9. Searching & Sorting - Data Structures using C++ by Varsha Patil
 
Mansi chowkkar programming_in_data_analytics
Mansi chowkkar programming_in_data_analyticsMansi chowkkar programming_in_data_analytics
Mansi chowkkar programming_in_data_analytics
 
Range Query on Big Data Based on Map Reduce
Range Query on Big Data Based on Map ReduceRange Query on Big Data Based on Map Reduce
Range Query on Big Data Based on Map Reduce
 
Using R to Visualize Spatial Data: R as GIS - Guy Lansley
Using R to Visualize Spatial Data: R as GIS - Guy LansleyUsing R to Visualize Spatial Data: R as GIS - Guy Lansley
Using R to Visualize Spatial Data: R as GIS - Guy Lansley
 
Introduction to the R Statistical Computing Environment
Introduction to the R Statistical Computing EnvironmentIntroduction to the R Statistical Computing Environment
Introduction to the R Statistical Computing Environment
 
SF Python Meetup: TextRank in Python
SF Python Meetup: TextRank in PythonSF Python Meetup: TextRank in Python
SF Python Meetup: TextRank in Python
 

Similar to ESWC 2019 - A Software Framework and Datasets for the Analysis of Graphs Measures

DistLODStats: Distributed Computation of RDF Dataset Statistics - ISWC 2018 talk
DistLODStats: Distributed Computation of RDF Dataset Statistics - ISWC 2018 talkDistLODStats: Distributed Computation of RDF Dataset Statistics - ISWC 2018 talk
DistLODStats: Distributed Computation of RDF Dataset Statistics - ISWC 2018 talk
Gezim Sejdiu
 
Efficient Distributed In-Memory Processing of RDF Datasets - PhD Viva
Efficient Distributed In-Memory Processing of RDF Datasets - PhD VivaEfficient Distributed In-Memory Processing of RDF Datasets - PhD Viva
Efficient Distributed In-Memory Processing of RDF Datasets - PhD Viva
Gezim Sejdiu
 
Hadoop for High-Performance Climate Analytics - Use Cases and Lessons Learned
Hadoop for High-Performance Climate Analytics - Use Cases and Lessons LearnedHadoop for High-Performance Climate Analytics - Use Cases and Lessons Learned
Hadoop for High-Performance Climate Analytics - Use Cases and Lessons Learned
DataWorks Summit
 
Data Science & Big Data - Theory.pdf
Data Science & Big Data - Theory.pdfData Science & Big Data - Theory.pdf
Data Science & Big Data - Theory.pdf
RAKESHG79
 

Similar to ESWC 2019 - A Software Framework and Datasets for the Analysis of Graphs Measures (20)

Analytics Beyond RAM Capacity using R
Analytics Beyond RAM Capacity using RAnalytics Beyond RAM Capacity using R
Analytics Beyond RAM Capacity using R
 
DistLODStats: Distributed Computation of RDF Dataset Statistics - ISWC 2018 talk
DistLODStats: Distributed Computation of RDF Dataset Statistics - ISWC 2018 talkDistLODStats: Distributed Computation of RDF Dataset Statistics - ISWC 2018 talk
DistLODStats: Distributed Computation of RDF Dataset Statistics - ISWC 2018 talk
 
Scalable and privacy-preserving data integration - part 1
Scalable and privacy-preserving data integration - part 1Scalable and privacy-preserving data integration - part 1
Scalable and privacy-preserving data integration - part 1
 
FIWARE Global Summit - IDS Implementation with FIWARE Software Components
FIWARE Global Summit - IDS Implementation with FIWARE Software ComponentsFIWARE Global Summit - IDS Implementation with FIWARE Software Components
FIWARE Global Summit - IDS Implementation with FIWARE Software Components
 
Efficient Distributed In-Memory Processing of RDF Datasets - PhD Viva
Efficient Distributed In-Memory Processing of RDF Datasets - PhD VivaEfficient Distributed In-Memory Processing of RDF Datasets - PhD Viva
Efficient Distributed In-Memory Processing of RDF Datasets - PhD Viva
 
R tutorial
R tutorialR tutorial
R tutorial
 
Translation of Relational and Non-Relational Databases into RDF with xR2RML
Translation of Relational and Non-Relational Databases into RDF with xR2RMLTranslation of Relational and Non-Relational Databases into RDF with xR2RML
Translation of Relational and Non-Relational Databases into RDF with xR2RML
 
Scaling up Linked Data
Scaling up Linked DataScaling up Linked Data
Scaling up Linked Data
 
AnzoGraph DB: Driving AI and Machine Insights with Knowledge Graphs in a Conn...
AnzoGraph DB: Driving AI and Machine Insights with Knowledge Graphs in a Conn...AnzoGraph DB: Driving AI and Machine Insights with Knowledge Graphs in a Conn...
AnzoGraph DB: Driving AI and Machine Insights with Knowledge Graphs in a Conn...
 
Hadoop for High-Performance Climate Analytics - Use Cases and Lessons Learned
Hadoop for High-Performance Climate Analytics - Use Cases and Lessons LearnedHadoop for High-Performance Climate Analytics - Use Cases and Lessons Learned
Hadoop for High-Performance Climate Analytics - Use Cases and Lessons Learned
 
Release webinar: Sansa and Ontario
Release webinar: Sansa and OntarioRelease webinar: Sansa and Ontario
Release webinar: Sansa and Ontario
 
Visualization Proess
Visualization ProessVisualization Proess
Visualization Proess
 
Matching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software ArchitecturesMatching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software Architectures
 
Matching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software ArchitecturesMatching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software Architectures
 
ADAPTER
ADAPTERADAPTER
ADAPTER
 
AstraZeneca - Re-imagining the Data Landscape in Compound Synthesis & Management
AstraZeneca - Re-imagining the Data Landscape in Compound Synthesis & ManagementAstraZeneca - Re-imagining the Data Landscape in Compound Synthesis & Management
AstraZeneca - Re-imagining the Data Landscape in Compound Synthesis & Management
 
Azure Databricks for Data Scientists
Azure Databricks for Data ScientistsAzure Databricks for Data Scientists
Azure Databricks for Data Scientists
 
Scaling up Linked Data
Scaling up Linked DataScaling up Linked Data
Scaling up Linked Data
 
Data Science & Big Data - Theory.pdf
Data Science & Big Data - Theory.pdfData Science & Big Data - Theory.pdf
Data Science & Big Data - Theory.pdf
 
Matlab, Big Data, and HDF Server
Matlab, Big Data, and HDF ServerMatlab, Big Data, and HDF Server
Matlab, Big Data, and HDF Server
 

Recently uploaded

CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptxCFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
R&R Consult
 
Hall booking system project report .pdf
Hall booking system project report  .pdfHall booking system project report  .pdf
Hall booking system project report .pdf
Kamal Acharya
 

Recently uploaded (20)

2024 DevOps Pro Europe - Growing at the edge
2024 DevOps Pro Europe - Growing at the edge2024 DevOps Pro Europe - Growing at the edge
2024 DevOps Pro Europe - Growing at the edge
 
Introduction to Machine Learning Unit-4 Notes for II-II Mechanical Engineering
Introduction to Machine Learning Unit-4 Notes for II-II Mechanical EngineeringIntroduction to Machine Learning Unit-4 Notes for II-II Mechanical Engineering
Introduction to Machine Learning Unit-4 Notes for II-II Mechanical Engineering
 
Toll tax management system project report..pdf
Toll tax management system project report..pdfToll tax management system project report..pdf
Toll tax management system project report..pdf
 
ASME IX(9) 2007 Full Version .pdf
ASME IX(9)  2007 Full Version       .pdfASME IX(9)  2007 Full Version       .pdf
ASME IX(9) 2007 Full Version .pdf
 
shape functions of 1D and 2 D rectangular elements.pptx
shape functions of 1D and 2 D rectangular elements.pptxshape functions of 1D and 2 D rectangular elements.pptx
shape functions of 1D and 2 D rectangular elements.pptx
 
Introduction to Machine Learning Unit-5 Notes for II-II Mechanical Engineering
Introduction to Machine Learning Unit-5 Notes for II-II Mechanical EngineeringIntroduction to Machine Learning Unit-5 Notes for II-II Mechanical Engineering
Introduction to Machine Learning Unit-5 Notes for II-II Mechanical Engineering
 
A CASE STUDY ON ONLINE TICKET BOOKING SYSTEM PROJECT.pdf
A CASE STUDY ON ONLINE TICKET BOOKING SYSTEM PROJECT.pdfA CASE STUDY ON ONLINE TICKET BOOKING SYSTEM PROJECT.pdf
A CASE STUDY ON ONLINE TICKET BOOKING SYSTEM PROJECT.pdf
 
NO1 Pandit Amil Baba In Bahawalpur, Sargodha, Sialkot, Sheikhupura, Rahim Yar...
NO1 Pandit Amil Baba In Bahawalpur, Sargodha, Sialkot, Sheikhupura, Rahim Yar...NO1 Pandit Amil Baba In Bahawalpur, Sargodha, Sialkot, Sheikhupura, Rahim Yar...
NO1 Pandit Amil Baba In Bahawalpur, Sargodha, Sialkot, Sheikhupura, Rahim Yar...
 
Explosives Industry manufacturing process.pdf
Explosives Industry manufacturing process.pdfExplosives Industry manufacturing process.pdf
Explosives Industry manufacturing process.pdf
 
The Benefits and Techniques of Trenchless Pipe Repair.pdf
The Benefits and Techniques of Trenchless Pipe Repair.pdfThe Benefits and Techniques of Trenchless Pipe Repair.pdf
The Benefits and Techniques of Trenchless Pipe Repair.pdf
 
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptxCFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
 
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
 
CME397 Surface Engineering- Professional Elective
CME397 Surface Engineering- Professional ElectiveCME397 Surface Engineering- Professional Elective
CME397 Surface Engineering- Professional Elective
 
Online resume builder management system project report.pdf
Online resume builder management system project report.pdfOnline resume builder management system project report.pdf
Online resume builder management system project report.pdf
 
Democratizing Fuzzing at Scale by Abhishek Arya
Democratizing Fuzzing at Scale by Abhishek AryaDemocratizing Fuzzing at Scale by Abhishek Arya
Democratizing Fuzzing at Scale by Abhishek Arya
 
Arduino based vehicle speed tracker project
Arduino based vehicle speed tracker projectArduino based vehicle speed tracker project
Arduino based vehicle speed tracker project
 
Hall booking system project report .pdf
Hall booking system project report  .pdfHall booking system project report  .pdf
Hall booking system project report .pdf
 
İTÜ CAD and Reverse Engineering Workshop
İTÜ CAD and Reverse Engineering WorkshopİTÜ CAD and Reverse Engineering Workshop
İTÜ CAD and Reverse Engineering Workshop
 
Halogenation process of chemical process industries
Halogenation process of chemical process industriesHalogenation process of chemical process industries
Halogenation process of chemical process industries
 
Architectural Portfolio Sean Lockwood
Architectural Portfolio Sean LockwoodArchitectural Portfolio Sean Lockwood
Architectural Portfolio Sean Lockwood
 

ESWC 2019 - A Software Framework and Datasets for the Analysis of Graphs Measures

  • 1. A Software Framework and Datasets for the Analysis of Graph Measures on RDF Graphs Matthäus Zloch1, Maribel Acosta2, Daniel Hienert1, Stefan Dietze1,3, Stefan Conrad3 1 GESIS - Leibniz-Institute for the Social Sciences, Germany 2 Karlsruhe Institute of Technology, Germany 3 Institute for Computer Science, Heinrich-Heine University, Germany
  • 2. Motivation Studying graph topologies is relevant because  availability and linkage of RDF data sets grow  various research areas rely on meaningful statistics and measures We want to study the topology of RDF graphs  not at instance- or schema-level  but about the implicit data structure on RDF data graphs 2 Why studying graph topologies is relevant
  • 3. Graph-based model of RDF 3 oo o o o - # vertices and # edges - # parallel edges - density or reciprocity - degree-based measures (s, p, o) s o p p p p p p p os p Why studying graph topologies is relevant
  • 4. Research areas that may benefit  Benchmarking – Designers may use the measures to generate more representative synthetic datasets 4 Why studying graph topologies is relevant
  • 5. Research areas that may benefit  Benchmarking – Designers may use the measures to generate more representative synthetic datasets  Sampling – more representative samples in terms of the structure 5 Why studying graph topologies is relevant
  • 6. Research areas that may benefit  Benchmarking – Designers may use the measures to generate more representative synthetic datasets  Sampling – more representative samples in terms of the structure  Profiling and Evolution – monitor the change in structure over time (influence vs. prominence) 6 Why studying graph topologies is relevant
  • 7. Resource Paper Contribution Our paper introduces two resources 1. An open source framework to acquire, prepare, and perform analyses of graph-based measures on RDF graphs [1] 2. A dataset of 280 RDF datasets from the LOD Cloud late 2017, pre-processed and ready to be re-used. Browsable version available [2] 7 [1] https://github.com/mazlo/lodcc [2] https://data.gesis.org/lodcc/2017-08
  • 8. Framework’s Processing Pipeline 8 How to acquire, prepare, and perform a graph-based analysis on RDF [3] Debattista, J., Lange, C., Auer, S. & Cortis, D. (2018). Evaluating the quality of the LOD cloud: An empirical investigation.. Semantic Web, 9, 859-901. DOI 10.3233/SW-180306
  • 9. Dataset’s Metadata Preparation 9  Optional. Preparation of an offline list of all datasets, e.g. for parallel processing.  List should contain all dataset names, the (official) media type format with URLs, domain class, and modification date. How to acquire, prepare, and perform a graph-based analysis on RDF
  • 10. Graph-Object Preparation 10  Downloads the dump, extracts*, transforms*, and groups* RDF files  N-triples format is used to transform into an edgelist structure * if necessary How to acquire, prepare, and perform a graph-based analysis on RDF
  • 11. Graph-Object Preparation 11 s o (s, p, o) p <http://../dataset/whisky-circle-info> <http://..title> "Whisky Circle"@en . How to acquire, prepare, and perform a graph-based analysis on RDF  As N-Triples
  • 12. Graph-Object Preparation  As N-Triples  use non-cryptographic hashing function to „encode“ the data [3] 12 <http://../dataset/whisky-circle-info> <http://..title> "Whisky Circle"@en . 43f2f4f2e41ae099 c9643559faeed68e 02325f53aeba2f02 (s, p, o) s o p [3] xxhash, https://github.com/Cyan4973/xxHash How to acquire, prepare, and perform a graph-based analysis on RDF
  • 13. Graph-Object Preparation  As N-Triples  As edgelist 13 (s, p, o) source vertex target vertex edge-property 43f2f4f2e41ae099 02325f53aeba2f02 c9643559faeed68e <http://../dataset/whisky-circle-info> <http://..title> "Whisky Circle"@en . 43f2f4f2e41ae099 c9643559faeed68e 02325f53aeba2f02 s o p How to acquire, prepare, and perform a graph-based analysis on RDF
  • 14. Graph-Object Instantiation 14  Reads edgelist and builds graph structure  Reports results on measures from 5 dimensions How to acquire, prepare, and perform a graph-based analysis on RDF
  • 15. Library re-use 15 How to acquire, prepare, and perform a graph-based analysis on RDF [4] https://old.datahub.io/dataset/<dataset-name>/datapackage.json [5] Wget, https://www.gnu.org/software/wget/ [6] dtrx, https://github.com/moonpyk/dtrx [7] rapper, http://librdf.org/raptor/rapper.html [8] xxhash, https://github.com/Cyan4973/xxHash [9] graph-tool, https://graph-tool.skewed.de/ [4] [6,7,8][9] [5]
  • 16. Groups of Measures Framework reports on 28 measures from 5 groups 16 How to acquire, prepare, and perform a graph-based analysis on RDF • no. of vertices, edges • parallel edges • unique edges Basic graph measures • max-[in|out]-degree • average degree • h-index (direct./undirect.) Degree-based measures • graph centralization • max degree centrality Centrality measures
  • 17. Groups of Measures Framework reports on 28 measures from 5 groups 17 How to acquire, prepare, and perform a graph-based analysis on RDF • no. of vertices, edges • parallel edges • unique edges Basic graph measures • max-[in|out]-degree • average degree • h-index (direct./undirect.) Degree-based measures • graph centralization • max degree centrality Centrality measures • density • reciprocity • diameter Edge-based measures • variance, standard dev., coefficient of var. • degree-distribution, powerlaw-exponent alpha Descriptive stat. measures
  • 18. Performance Example: datasets and sizes 18 How to acquire, prepare, and perform a graph-based analysis on RDF
  • 19. Performance Example: datasets and sizes 19 How to acquire, prepare, and perform a graph-based analysis on RDF
  • 20. Performance Example: datasets and sizes Example: runtimes 20 How to acquire, prepare, and perform a graph-based analysis on RDF
  • 21. Performance Example: datasets and sizes Example: runtimes 21 How to acquire, prepare, and perform a graph-based analysis on RDF
  • 22. 22 Datasets from 9 domains in LOD Cloud  12 in May 2007  570 in August 2014  1163 in August 2017  1224 in August 2018  1239 in March 2019 A Dataset of Pre-Processed RDF Graphs
  • 23. A Dataset of Pre-Processed RDF Graphs  Total of 280 RDF datasets processed and analyzed  Values for 28 measures per dataset  Graph-objects ready to be re-used, results as CSV, and original link to metadata 23 Case Study with Datasets from LOD Cloud Available at our website https://data.gesis.org/lodcc/2017-08
  • 24. Graph-based Analysis at large scale To analyze RDF graphs at large scale you have to  Download the list of available datasets  Acquire the datasets  Represent as a graph-object  Compute graph measures on that Sounds easy, right? 24 Case Study with Datasets from LOD Cloud
  • 25. Graph-based Analysis at large scale In reality not that easy  not all data providers offer data dumps  non-standard media type declarations  various formats, compressed archives, hierarchies of files and folders  erroneous/error-prone data 25 Case Study with Datasets from LOD Cloud
  • 26. Acquisition and Preparation 26 1163 • metadata packages 890 • 150 different media type statements • URLs for the official media type statements that are supported 486 • after filtering 404 and content-type HTML 280 • left out SPARQL-Endpoints • after graph preparation with corrupt downloads, wrong media type statements, syntax errors Case Study with Datasets from LOD Cloud
  • 27. Processed Datasets by Domain 27 Case Study with Datasets from LOD Cloud
  • 28. Processed Datasets by Domain 28 Case Study with Datasets from LOD Cloud
  • 29.  Average degree z seems not affected by number of edges, in all but Geography and Government  Average edges per vertex  Life Sciences: 63.50  Cross Domain: 5.46  Average overall domains: 7.9 edges per vertex 29 Preliminary Analysis of Results Preliminary Analysis of Results
  • 30. Preliminary Analysis of Results  hd grows exponentially with number of edges  Life Sciences and Government are more “dense”  Linguistics forms two clusters, almost no dependency to the number of edges, low on avg. 30 Preliminary Analysis of Results
  • 31. Availability, Maintenance, Sustainability 31 • Framework is published under MIT license on GitHub. https://github.com/mazlo/lodcc • Actively used in other research activities. • Future releases (minor, bugfixes, features) The framework • Recalculate for newer versions of the LOD Cloud • Made available to the community • Combine with other datasets http://stats.lod2.euThe datasets
  • 32. Future Work and Research  Investigate domain- and dataset-specific irregularities  Derive implications for modelling tasks, on dataset level and applications like benchmarking  Offer SPARQL-endpoint to query results 32
  • 33. Thank you for your attention [1] https://github.com/mazlo/lodcc [2] https://data.gesis.org/lodcc/2017-08 @matzlo

Editor's Notes

  1. Our motivation is the study of graph topologies, which is interesting because the availability and linkage of RDF datasets grow. As this number rises we need to collect meaningful statistics and measures to describe the data. Many approaches collect statistics at instance- and schema-level mostly, but not necessarily from the data structure that an RDF dataset comes with, the RDF data graph. Various research areas rely on statistics and measures, e.g. data-driven tasks like query processing, studies on the quality of data sets, monitoring services of the evolution of the space, are some examples.
  2. The implicit data structure that we get from a set of RDF triples compose a directed and labelled graph, where subjects and objects can be defined as vertices while predicates correspond to edges. So, when we build up a graph-object from this we will be able to compute various measures like ..
  3. We can think of various research areas that may benefit from such analyses. For instance, BENCHMARKING Benchmark suites e.g. aim at designing a simulation of a real-world scenario, in that they have a synthetic dataset generator and common queries. If we look closer, we can see that benchmark datasets interprets growth in terms of number of edges and max in degree, not with max out degree. density of the graph shrinks. some have no reciprocity.
  4. SAMPLING Almost the same applies to sampling methods where here research aims at delivering a representative sample from an original dataset. Example questions that arise in this field: What does representative mean? How to obtain a (minimal) representative sample? Which method to use? Apart from qualitative aspects, like classes, properties, instances, and used vocabularies, also topological characteristics should be considered, since they allow for a more accurate description of the dataset. This applies to all graph-based datasets, and is not a LD/RDF-specific issue.
  5. PROFILING With the growing number of datasets in the LOD Cloud, the linkage and connectivity is of particular interest. Graph measures may help to monitor changes and the impact of changes in datasets or even domains over time.
  6. First… And second, a dataset of 280 RDF datasets from LOD Cloud late 2017, that we processed with the framework. As part of the resource is a website that presents these results for all of the datasets. The datasets are pre-processed and ready to be re-used by you. In the next slides I am going to present you how the framework works, how we did the case study on the LOD Cloud late 2017, and a report on a preliminary analysis of these results over all domains.
  7. This is how the framework works. To be able to instantiate a graph-object from an RDF dataset, we have come up with this pipeline. This can also be found in related work and in other studies. …
  8. The first two steps are optional. First, corresponding metadata will be loaded from datahub, parsed for mediatypes, and saved into a local database. This is advisable for parallel processing, as it is highly recommended when you have many datasets. The framework can work with both, a database connection or command line arguments if you have no database. The list should contain all names, media type statements with URLs, domain affiliation, and modification date. The framework is currently limited to work official media types for the most common formats for RDF data, which are N-Triples, RDF/XML, Turtle, NQuads, and Notation3.
  9. Dumps will then get downloaded, extracted, transformed, and grouped in case of archives with multiple files. In order to build a graph-object one can use an edgelist, which is a list of source and target vertices per line. That is why the N-Triples format is very handy and why we need the transformation procedure.
  10. There is an example of a statement transformed into N-Triples format. An issue with N-Triples is however, that it adds a lot of boilerplate text, because a lot of information gets repeated, mainly the URLs, and so the graph objects will get large, on hard disk and in-memory.
  11. Therefore we used a non-cryptographic hashing function to “encode” each part of the triple. Encoding data in such a form has many advantages, e.g. saves memory, as per average only 20% of the characters have to be stored. and, it makes graphs be comparable in terms of contained vertices and edges, because a hash for a URL in dataset 1 will be the same for dataset 2.
  12. To build up the edgelist we just changed the position of the O and P, making the P an additional property of the edge in the graph object that is stored in addition.
  13. In the last steps the graph object is build up from the edgelist and the measures get calculated. The framework can be configured to be used in parallel. It depends on your network connection, CPU cores, and hard disk IO how long it takes to complete.
  14. The framework computes 28 graph-based measures which can be grouped into 5 groups. Here are some examples. BASIC : number of vertices and edges, parallel edges, unique edges. DEGREE-BASED : max-(in,out)-degree, avg degree, h-index (directed and undirected) CENTRALITY : graph centralization, max CD EDGE-BASED : density (ratio all edges to all possible edges), reciprocity, diameter DESCRIPTIVE : variance, std dev, coefficient of variation, degree-distribution and powerlaw-exponent alpha
  15. For example BASIC : number of vertices and edges, parallel edges, unique edges. DEGREE-BASED : max-(in,out)-degree, avg degree, h-index (directed and undirected) CENTRALITY : graph centralization, max CD EDGE-BASED : density (ratio all edges to all possible edges), reciprocity, diameter DESCRIPTIVE : variance, std dev, coefficient of variation, degree-distribution and powerlaw-exponent alpha
  16. This does not necessarily shows how well our framework works, but rather how well the underlying libraries work that we are using, e.g. dtrx, rapper, and graph-tool.
  17. This does not necessarily shows how well our framework works, but rather how well the underlying libraries work that we are using, e.g. dtrx, rapper, and graph-tool.
  18. This does not necessarily shows how well our framework works, but rather how well the underlying libraries work that we are using, e.g. dtrx, rapper, and graph-tool.
  19. This does not necessarily shows how well our framework works, but rather how well the underlying libraries work that we are using, e.g. dtrx, rapper, and graph-tool.
  20. Now I will come to the second part, which is the description of datasets that we have processed with the framework. We thought there is no better stress-test for our framework than datasets from the LOD cloud. From a theoretically available number of around 1200 datasets, we managed to analyse 280 datasets from the LOD cloud late 2017, I will tell you why in a moment. This is the second resource the we publish with the paper.
  21. So this resource contains all 280 datasets that we have processed and analyzed with the framework. We got values for all 28 measures per dataset and created a website so be able to browse the results. You can download the initial metadata that was used to acquire the dump, all results as CSV file export, and a serialized graph-object that you can re-use for further analysis, for each of the datasets. All available at this website. The main benefit from this collection is that each RDF dataset is already prepared. This enables to reproduce the results and to perform further analysis of graph measures on the graphs from scratch without further preparation For all datasets we also provide plots, e.g. for the distribution of the degree.
  22. This is how we did the analysis. To analyze RDF graphs at large scale, in terms of dataset size and dataset quantity, you would have to …
  23. But in reality, not all data providers offer data dumps. And for those that offer dumps, you frequently have to deal with non-standard (wrong) media type declarations. Providers use different formats, some compress their files with different algorithms and some will give you a hierarchy of files and folders including non-RDF data. In addition, you will have erroneous and error-prone data, like syntax errors etc.
  24. At first we had all metadata package at hand. After parsing those we got 150 different media type statements. Since the framework accepts only official media type statements of the most common media types we manually mapped them. After this mapping we got 890 datasets with URLs. Further, we filtered out HTTP 404 codes and content-types HTML. This was the manual steps in the process pipeline of the framework presented earlier. Further, we concentrated on data dumps, not SPARQL-endpoint to not stress them.
  25. This is a snapshot of the website. On the left side you can see the distribution of the datasets per domain for which we were able to do the analysis. Unfortunately, some of them are not well represented.
  26. However, the largest dataset is in the Cross Domain which is en-dbpedia with 2.6B edges. Most datasets in Linguistics domain and Publications. We did a preliminary analysis of the measures across all domains and could observe dataset- and domain-specific phenomena. I would like to show you two measures, average degree and h-index on the directed graph, which we have plotted across all domains.
  27. Avg. degree is a frequently consulted measure and gives you the average number of edges that vertices have in the graph object. In this plot you can see the datasets and avg. degree values for 5 domains. The datasets are ordered descending by number of edges When you look at the plot you can see that avg. degree seems not to be affected by number of edges, in all but GE and GO. GE and GO report an increasing linear relationship. Outliers can be observed among all domains, like bio2rdf-irefindex in Life Sciences with 63. In the domain of LI there seems to be two clusters, with one group having higher values than the other. This may be considered as a dataset-specific phenomom, most probably cased by the fact that either - one data provider used a specific vocabulary and used to model more accurately (more predicates on average), OR - there were two different providers publishing a lot of small datasets of different kind. Unfortunately, this may not necessarily be representative, because not all datasets were included.
  28. The second measure that I've plotted here across all domains is the h-index, is known from citation networks. It is an indicator for the importance of a vertex in a network. Here it is a statistical measure on the graph. Each dot in the figure is a dataset. The datasets are ordered descending by number of edges y-axis is log-scaled Grows exponentially with the size of the graph. GO, LS report higher values and could be considered more "dense", PU lower values. Again Linguistics shows two clusters with almost constant values, that seem to be independent from the number of edges in most cases, in particular for the lower group of datasets.
  29. Regarding future work, we would like to investigate the domain- and dataset-specific irregularities. Where to they come from, what is the reason etc. and derive implications for modelling tasks, on dataset and application specific level like benchmarking.