Upcoming SlideShare
×

# Quantifying RDF data sets

900 views

Published on

The semantic Web is built on the Resource Description Framework (RDF). RDF is a graph model. It would be expected that a wide range of network analytical tools could be directly applied to a RDF data set. However, most network algorithms assume that a graph does not have parallel edges which the RDF graph model allows. Two approaches will be examined: direct measures of RDF graph structure using ratios and extraction of graphs from an RDF data set. Py-Triple-Simple (http://code.google.com/p/py-triple-simple/), an experimental pure Python library, can extract “well behaved” graphs from an N-triples file and can quantify RDF graph structure using ratios.

Published in: Technology, Education
0 Likes
Statistics
Notes
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

• Be the first to like this

Views
Total views
900
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
10
0
Likes
0
Embeds 0
No embeds

No notes for slide

### Quantifying RDF data sets

1. 1. Quantifying RDF data sets (a start)Janos G. HajagosStony Brook UniversitySchool of Medicine 1
2. 2. Resource Description FrameworkGraph based data model: – Vertices or nodes are identified by URIs <http://dbpedia.org/resource/Aspirin> – Vertices can be typed: rdf:type – Directed edges or links are specified with URIs – Parallel edges are allowed (multi-graph) – Literals are properties of vertices 2
3. 3. http://challenge.semanticweb.org/submissions/swc2010_submission_15.pdf 3
4. 4. • Pure Python library • No SPARQL support• In-memory only • Ignores types• PyPy JIT for speed • No named graphs• API for pattern matching • No http access 4
5. 5. Counting: 1, 2, 3, . . .• Number of triples (Nt)• Number of literals (Nl)• Number of object URIs (No)• Number of distinct literals (type removed) (Ndl)• Number of distinct objects (Ndo)• Number of distinct subjects (Nds)• Number of distinct URIs (Nu)• Number of typed instances (Ni)• Number of instances of type t (Nit)• Number of distinct classes (Nc)• Number of distinct predicates (Ndp) 5
6. 6. Simple fractions“Literalness” = Nl / Nt“Literal uniqueness” = Ndl / Nl“Object uniqueness” = Ndo / No“Structure” = 1 - (Ni + Nl) / Nt“Subject coverage” = Nds / Nu“Object coverage” = Ndo / Nu“Type frequency of class t” = {Nit / Ni , . . .} 6
7. 7. LODD + ComparisonsSource: http://dx.doi.org/10.1186/1758-2946-3-19 7
8. 8. Linked CT Top 5 subjects:Statistics: <http://data.linkedct.org/resource/country/united-states>, 60,980Number of triples (Nt): 27,965,909 <http://data.linkedct.org/resource/state/california>, 15,775Number of literals (Nl): 11,153,086 <http://data.linkedct.org/resource/state/texas>, 13,264Number of objects (No): 16,812,823 <http://data.linkedct.org/resource/state/new-york>, 13,172Number of typed instances (Ni): 3,033,501 <http://data.linkedct.org/resource/oversight_info/7eb3d38adc47e7e583ab6031Number of URIs excluding predicates (Nu): 3,269,681 fe2948ba>, 11,963Number of distinct classes (Nc): 30Number of distinct subjects (Nds): 3,033,495Number of distinct predicates (Ndp): 123 Top 5 objects including literals:Number of distinct objects (Ndo): 3,148,210 "No", 525,210Number of distinct literals (Ndl): 5,496,593 <http://data.linkedct.org/vocab/resource/location>, 477,926Number of distinct lexical symbols (Ndls): 8,621,986 <http://data.linkedct.org/vocab/resource/facility>, 387,542 <http://data.linkedct.org/vocab/resource/outcome>, 376,231Literalness (Nl/Nt): 0.399 <http://data.linkedct.org/vocab/resource/external_linkage>, 271,431Literal uniqueness (Ndl/Nl): 0.493 <http://data.linkedct.org/resource/linkage_method/standardized-string-Object uniqueness (Ndo/No): 0.187 matching>, 185,902Structure (1 - (Nl+Ni)/Nt): 0.492Subject coverage (Nds/Nu): 0.927Object coverage (Ndo/Nu): 0.962 Top 5 predicates:Class coverage: [0.15, 0.13, 0.12, 0.08, 0.05, 0.04, 0.04, <http://data.linkedct.org/vocab/resource/has_provenance>, 7,482,3520.04, 0.04, 0.04, 0.04, 0.03, 0.03, 0.03, 0.03, 0.02, 0.01,0.01, 0.009, 0.008, 0.007, 0.007, 0.006, 0.002, 0.002, <http://www.w3.org/2000/01/rdf-schema#label>, 3,142,2070.001, 6.0e-05, 4.0e-05, 9.2e-06, 6.6e-07] <http://www.w3.org/1999/02/22-rdf-syntax-ns#type>, 3,033,501 <http://data.linkedct.org/vocab/resource/trial_location>, 982,202 <http://data.linkedct.org/vocab/resource/location_facility>, 477,923 8
9. 9. BioGrid in BioPax Top 5 subjects:Statistics: <http://cbio.mskcc.org/cpath#CPATH-716194>, 470Number of triples (Nt): 14,326,621Number of literals (Nl): 5,680,921 <http://cbio.mskcc.org/cpath#CPATH-156001>, 362Number of objects (No): 8,645,700 <http://cbio.mskcc.org/cpath#CPATH-738240>, 292Number of typed instances (Ni): 4,229,345 <http://cbio.mskcc.org/cpath#CPATH-818091>, 266,Number of URIs excluding predicates (Nu): 4,229,358 <http://cbio.mskcc.org/cpath#CPATH-726044>, 229Number of distinct classes (Nc): 12Number of distinct subjects (Nds): 4,229,345 Top 5 objects including literals:Number of distinct predicates (Ndp): 23 <http://www.biopax.org/release/biopax-level2.owl#unificationXref>,Number of distinct objects (Ndo): 4,009,607 1,249,232Number of distinct literals (Ndl): 1,145,973 <http://www.biopax.org/release/biopax-Number of distinct lexical symbols (Ndls): 5,375,354 level2.owl#openControlledVocabulary>, 659,251 "PSI-MI", 659,250Literalness (Nl/Nt): 0.400 "PUBMED", 439,528Literal uniqueness (Ndl/Nl): 0.202 <http://www.biopax.org/release/biopax-Object uniqueness (Ndo/No): 0.464 level2.owl#publicationXref>, 439,528Structure (1 - (Nl+Ni)/Nt): 0.309Subject coverage (Nds/Nu): 0.999 Top 5 predicates:Object coverage (Ndo/Nu): 0.948 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type>, 4,229,345Class coverage: [0.295, 0.156, 0.104, 0.104, 0.104, 0.066, <http://www.biopax.org/release/biopax-level2.owl#DB>, 1,966,3560.052, 0.052, 0.052, 0.007, 0.007, 2.3e-07] <http://www.biopax.org/release/biopax-level2.owl#ID>, 1,966,356 <http://www.biopax.org/release/biopax-level2.owl#XREF>, 1,933,616 <http://www.biopax.org/release/biopax-level2.owl#TERM>, 659,251 9
10. 10. RxNorm Top 5 subjects:Statistics: <http://link.informatics.stonybrook.edu/rxnorm/RXCUI/317541>, 11,804Number of triples (Nt): 9,169,907 <http://link.informatics.stonybrook.edu/rxnorm/RXAUI/3149147>, 9,943Number of literals (Nl): 4,557,110 <http://link.informatics.stonybrook.edu/rxnorm/RXCUI/316949>, 8,668Number of objects (No): 4,612,797 <http://link.informatics.stonybrook.edu/rxnorm/RXCUI/316968>, 6,464Number of typed instances (Ni): 628,852 <http://link.informatics.stonybrook.edu/rxnorm/RXCUI/316965>, 4,605Number of URIs excluding predicates (Nu): 808,979Number of distinct classes (Nc): 6Number of distinct subjects (Nds): 807,722Number of distinct predicates (Ndp): 193 Top 5 objects including literals:Number of distinct objects (Ndo): 471,847 <http://link.informatics.stonybrook.edu/rxnorm/RXAUI>, 470,170Number of distinct literals (Ndl): 2,577,006 <http://link.informatics.stonybrook.edu/rxnorm/RXCUI>, 158,457Number of distinct lexical symbols (Ndls): 3,385,997 <http://link.informatics.stonybrook.edu/rxnorm/SAB/RXNORM> 143,622 <http://link.informatics.stonybrook.edu/rxnorm/SAB/NDFRT>, 134,049Literalness (Nl/Nt): 0.497 <http://link.informatics.stonybrook.edu/rxnorm/TTY/CD>, 101,246Literal uniqueness (Ndl/Nl): 0.565Object uniqueness (Ndo/No): 0.102Structure (1 - (Nl+Ni)/Nt): 0.434Subject coverage (Nds/Nu): 0.998 Top 5 predicates:Object coverage (Ndo/Nu): 0.583 <http://www.w3.org/2000/01/rdf-schema#label>, 807,705Class coverage: [0.748, 0.252, 0.0003, 5. 6e-05, 9.5e-06, <http://link.informatics.stonybrook.edu/rxnorm/ATN#NDC>, 634,1246.360e-06] <http://www.w3.org/1999/02/22-rdf-syntax-ns#type>, 628,852 <http://link.informatics.stonybrook.edu/rxnorm/REL#has_related_form>, 571,320 <http://link.informatics.stonybrook.edu/umls/hasCUI>, 507,950 10
11. 11. SUNY Reach in VIVO Top 5 subjects:Statistics: <http://reach.suny.edu/individual/team_1>, 599Number of triples (Nt): 1,278,216 <http://reach.suny.edu/individual/Faraone_Stephen>, 404Number of literals (Nl): 562,262 <http://reach.suny.edu/individual/Hopkins_L>, 298Number of objects (No): 715,954 <http://reach.suny.edu/individual/Genco_Robert>, 272Number of typed instances (Ni): 243,263 <http://reach.suny.edu/individual/Jusko_William>, 257Number of URIs excluding predicates (Nu): 174,488Number of distinct classes (Nc): 71Number of distinct subjects (Nds): 161,459 Top 5 objects including literals:Number of distinct predicates (Ndp): 109 <http://vivoweb.org/ontology/core#Authorship>, 95,303Number of distinct objects (Ndo): 172,991 <http://xmlns.com/foaf/0.1/Person>, 32,040Number of distinct literals (Ndl): 224,290 <http://reach.suny.edu/ontology/core#Other_Investigator>, 31,170Number of distinct lexical symbols (Ndls): 398,887 <http://vivoweb.org/ontology/core#Relationship>, 20,176 <http://vivoweb.org/ontology/core#InformationResource>, 18,301Literalness (Nl/Nt): 0.440Literal uniqueness (Ndl/Nl): 0.399Object uniqueness (Ndo/No): 0.241 Top 5 predicates:Structure (1 - (Nl+Ni)/Nt): 0.369 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type>, 243,263Subject coverage (Nds/Nu): 0.925 <http://vivoweb.org/ontology/core#freetextKeyword>, 199,327Object coverage (Ndo/Nu): 0.991 <http://www.w3.org/2000/01/rdf-schema#label>, 144,653Class coverage: <http://vivoweb.org/ontology/core#informationResourceInAuthorship>,[0.391, 0.132, 0.128, 0.083, 0.075, 0.040, 0.037, 0.017,. . .] 95,105 <http://vivoweb.org/ontology/core#authorInAuthorship>, 95,101 11
12. 12. DrugBank Top 5 subjects:Statistics: <http://www4.wiwiss.fu-berlin.de/drugbank/resource/targets/587>, 3767Number of triples (Nt): 766,920 <http://www4.wiwiss.fu-berlin.de/drugbank/resource/targets/3722>, 3032Number of literals (Nl): 494,028 <http://www4.wiwiss.fu-berlin.de/drugbank/resource/targets/357>, 2780Number of objects (No): 272,892 <http://www4.wiwiss.fu-berlin.de/drugbank/resource/targets/146>, 2570Number of typed instances (Ni): 24,522 <http://www4.wiwiss.fu-berlin.de/drugbank/resource/targets/136>, 2504Number of URIs excluding predicates (Nu): 103,847Number of distinct classes (Nc): 8 Top 5 objects including literals:Number of distinct subjects (Nds): 19,693 <http://www4.wiwiss.fu-Number of distinct predicates (Ndp): 119 berlin.de/drugbank/resource/drugbank/drug_interactions>,10,153Number of distinct objects (Ndo): 89,685 "physiological process", 8,001Number of distinct literals (Ndl): 186,457 <http://www4.wiwiss.fu-Number of distinct lexical symbols (Ndls): 290,307 berlin.de/drugbank/resource/references/17016423>, 7,191 <http://www4.wiwiss.fu-Literalness (Nl/Nt): 0.644 berlin.de/drugbank/resource/references/17139284>, 7,191),Literal uniqueness (Ndl/Nl): 0.377 "catalytic activity", 6,841Object uniqueness (Ndo/No): 0.329Structure (1 - (Nl+Ni)/Nt): 0.324 Top 5 predicates:Subject coverage (Nds/Nu): 0.190 <http://www4.wiwiss.fu-Object coverage (Ndo/Nu): 0.863 berlin.de/drugbank/resource/drugbank/generalReference>, 72,359Class coverage: [0.41, 0.20, 0.20, 0.19, 0.004, 0.004, <http://www4.wiwiss.fu-0.002, 0.0002] berlin.de/drugbank/resource/drugbank/goClassificationFunction>, 72,232 <http://www4.wiwiss.fu- berlin.de/drugbank/resource/drugbank/goClassificationProcess>, 63,520 <http://www4.wiwiss.fu-berlin.de/drugbank/resource/drugbank/synonym>, 44,949 <http://www4.wiwiss.fu- berlin.de/drugbank/resource/drugbank/cellularLocation>, 26,258 12
13. 13. DailyMed Top 5 subjects:Statistics: <http://www4.wiwiss.fu-berlin.de/dailymed/resource/drugs/2245>, 240Number of triples (Nt): 164,276 <http://www4.wiwiss.fu-berlin.de/dailymed/resource/organization/Hospira,_Inc.>,Number of literals (Nl): 59,885 216Number of objects (No): 104,391 <http://www4.wiwiss.fu-berlin.de/dailymed/resource/drugs/2019>, 200 <http://www4.wiwiss.fu-Number of typed instances (Ni): 14,934 berlin.de/dailymed/resource/organization/Teva_Pharmaceuticals_USA, 193Number of URIs excluding predicates (Nu): 22,365 <http://www4.wiwiss.fu-berlin.de/dailymed/resource/drugs/3505>, 170Number of distinct classes (Nc): 6 Top 5 objects including literals:Number of distinct subjects (Nds): 10,015 <http://www4.wiwiss.fu-berlin.de/dailymed/resource/dailymed/ingredients>,Number of distinct predicates (Ndp): 28 5,577Number of distinct objects (Ndo): 21,968 http://www4.wiwiss.fu-berlin.de/dailymed/resource/dailymed/drugs>, 4,308Number of distinct literals (Ndl): 45,814 <http://www4.wiwiss.fu-berlin.de/drugbank/vocab/resource/class/Offer>,Number of distinct lexical symbols (Ndls): 68,181 4308 <http://www4.wiwiss.fu-berlin.de/dailymed/resource/routeOfAdministration/Oral>, 2,465Literalness (Nl/Nt): 0.364 <http://www4.wiwiss.fu-Literal uniqueness (Ndl/Nl): 0.765 berlin.de/dailymed/resource/ingredient/magnesium_stearate>, 1,405Object uniqueness (Ndo/No): 0.210Structure (1 - (Nl+Ni)/Nt): 0.544 Top 5 predicates: <http://www.w3.org/2002/07/owl#sameAs>, 31,929Subject coverage (Nds/Nu): 0.448 <http://www4.wiwiss.fu-berlin.de/dailymed/resource/dailymed/inactiveIngredient>,Object coverage (Ndo/Nu): 0.982 28,403Class coverage: [0.37, 0.29, 0.29, 0.05, 0.002, 0.0003] <http://www.w3.org/1999/02/22-rdf-syntax-ns#type>, 14,934 <http://www.w3.org/2000/01/rdf-schema#label>, 10,596 <http://www4.wiwiss.fu- berlin.de/dailymed/resource/dailymed/possibleDiseaseTarget>, 6,124 <http://www4.wiwiss.fu- berlin.de/dailymed/resource/dailymed/routeOfAdministration>, 4,308 13
14. 14. Building a co-author network from VIVO with a twist 14
15. 15. VIVO ontology modeling of authorshipThe twist is to include only members of the Reach site 15
16. 16. Graph processing and extraction• Follow – Multiple linked steps are allowed• Collapse parallel edges – Add weight to edges based on on counts• Export – Standard graph format like GraphML, an XML format for graph exchange 16
17. 17. Network analysis with NetworkX 17
18. 18. Network analysis with Mathematica 18
19. 19. Network visualization with Gephi 19
20. 20. For Your Information- Linked CT: http://queens.db.toronto.edu/~oktie/linkedct/- BioGrid in PAX: http://www.pathwaycommons.org/pc-snapshot/current-release/biopax/by_source/- Drugbank: http://www4.wiwiss.fu-berlin.de/drugbank/drugbank_dump.nt- DailyMed: http://www4.wiwiss.fu-berlin.de/dailymed/dailymed_dump.nt- RxNorm is available at:http://link.informatics.stonybrook.edu/rxnorm/- Reach VIVO site is at: http://reach.sunysb.eduSPARQL endpoint:http://link.informatics.stonybrook.edu/sparql/named graph http://reach.sunysb.edu 20
21. 21. The Endhttp://ctsaconnect.org/ 21