The document summarizes statistics about quantifying RDF data sets. It provides counts of triples, literals, URIs, subjects, predicates, objects, and other metrics for several RDF datasets, including LinkedCT, BioGrid, RxNorm, SUNY Reach, and DrugBank. For each dataset, it lists the top 5 subjects, objects, and predicates as well as overall statistics like literalness, uniqueness, coverage, and type frequencies.
Presentation on how to chat with PDF using ChatGPT code interpreter
Quantifying RDF data sets: statistics and metrics
1. Quantifying RDF data sets
(a start)
Janos G. Hajagos
Stony Brook University
School of Medicine
1
2. Resource Description Framework
Graph based data model:
– Vertices or nodes are identified by URIs
<http://dbpedia.org/resource/Aspirin>
– Vertices can be typed: rdf:type
– Directed edges or links are specified with URIs
– Parallel edges are allowed (multi-graph)
– Literals are properties of vertices
2
4. • Pure Python library • No SPARQL support
• In-memory only • Ignores types
• PyPy JIT for speed • No named graphs
• API for pattern matching • No http access
4
5. Counting: 1, 2, 3, . . .
• Number of triples (Nt)
• Number of literals (Nl)
• Number of object URIs (No)
• Number of distinct literals (type removed) (Ndl)
• Number of distinct objects (Ndo)
• Number of distinct subjects (Nds)
• Number of distinct URIs (Nu)
• Number of typed instances (Ni)
• Number of instances of type t (Nit)
• Number of distinct classes (Nc)
• Number of distinct predicates (Ndp)
5
6. Simple fractions
“Literalness” = Nl / Nt
“Literal uniqueness” = Ndl / Nl
“Object uniqueness” = Ndo / No
“Structure” = 1 - (Ni + Nl) / Nt
“Subject coverage” = Nds / Nu
“Object coverage” = Ndo / Nu
“Type frequency of class t” = {Nit / Ni , . . .}
6
15. VIVO ontology modeling of authorship
The twist is to include only members of the Reach site
15
16. Graph processing and extraction
• Follow
– Multiple linked steps are allowed
• Collapse parallel edges
– Add weight to edges based on
on counts
• Export
– Standard graph format like GraphML, an XML format for
graph exchange
16
20. For Your Information
- Linked CT: http://queens.db.toronto.edu/~oktie/linkedct/
- BioGrid in PAX: http://www.pathwaycommons.org/pc-
snapshot/current-release/biopax/by_source/
- Drugbank: http://www4.wiwiss.fu-
berlin.de/drugbank/drugbank_dump.nt
- DailyMed: http://www4.wiwiss.fu-
berlin.de/dailymed/dailymed_dump.nt
- RxNorm is available at:
http://link.informatics.stonybrook.edu/rxnorm/
- Reach VIVO site is at: http://reach.sunysb.edu
SPARQL endpoint:
http://link.informatics.stonybrook.edu/sparql/
named graph http://reach.sunysb.edu
20