SlideShare a Scribd company logo
1 of 21
Quantifying RDF data sets
               (a start)
Janos G. Hajagos
Stony Brook University
School of Medicine




                                  1
Resource Description Framework
Graph based data model:
  – Vertices or nodes are identified by URIs
    <http://dbpedia.org/resource/Aspirin>
  – Vertices can be typed: rdf:type
  – Directed edges or links are specified with URIs
  – Parallel edges are allowed (multi-graph)
  – Literals are properties of vertices



                                                      2
http://challenge.semanticweb.org/submissions/swc2010_submission_15.pdf


                                                                         3
•   Pure Python library        •   No SPARQL support
•   In-memory only             •   Ignores types
•   PyPy JIT for speed         •   No named graphs
•   API for pattern matching   •   No http access
                                                       4
Counting: 1, 2, 3, . . .
•   Number of triples (Nt)
•   Number of literals (Nl)
•   Number of object URIs (No)
•   Number of distinct literals (type removed) (Ndl)
•   Number of distinct objects (Ndo)
•   Number of distinct subjects (Nds)
•   Number of distinct URIs (Nu)
•   Number of typed instances (Ni)
•   Number of instances of type t (Nit)
•   Number of distinct classes (Nc)
•   Number of distinct predicates (Ndp)

                                                       5
Simple fractions
“Literalness” = Nl / Nt
“Literal uniqueness” = Ndl / Nl
“Object uniqueness” = Ndo / No
“Structure” = 1 - (Ni + Nl) / Nt
“Subject coverage” = Nds / Nu
“Object coverage” = Ndo / Nu
“Type frequency of class t” = {Nit / Ni , . . .}

                                                   6
LODD + Comparisons




Source: http://dx.doi.org/10.1186/1758-2946-3-19


                                                   7
Linked CT
                                                              Top 5 subjects:
Statistics:
                                                              <http://data.linkedct.org/resource/country/united-states>, 60,980
Number of triples (Nt): 27,965,909
                                                               <http://data.linkedct.org/resource/state/california>, 15,775
Number of literals (Nl): 11,153,086
                                                              <http://data.linkedct.org/resource/state/texas>, 13,264
Number of objects (No): 16,812,823
                                                              <http://data.linkedct.org/resource/state/new-york>, 13,172
Number of typed instances (Ni): 3,033,501
                                                              <http://data.linkedct.org/resource/oversight_info/7eb3d38adc47e7e583ab6031
Number of URIs excluding predicates (Nu): 3,269,681
                                                              fe2948ba>, 11,963
Number of distinct classes (Nc): 30
Number of distinct subjects (Nds): 3,033,495
Number of distinct predicates (Ndp): 123                      Top 5 objects including literals:
Number of distinct objects (Ndo): 3,148,210                   "No", 525,210
Number of distinct literals (Ndl): 5,496,593                  <http://data.linkedct.org/vocab/resource/location>, 477,926
Number of distinct lexical symbols (Ndls): 8,621,986          <http://data.linkedct.org/vocab/resource/facility>, 387,542
                                                              <http://data.linkedct.org/vocab/resource/outcome>, 376,231
Literalness (Nl/Nt): 0.399                                    <http://data.linkedct.org/vocab/resource/external_linkage>, 271,431
Literal uniqueness (Ndl/Nl): 0.493                            <http://data.linkedct.org/resource/linkage_method/standardized-string-
Object uniqueness (Ndo/No): 0.187                             matching>, 185,902
Structure (1 - (Nl+Ni)/Nt): 0.492
Subject coverage (Nds/Nu): 0.927
Object coverage (Ndo/Nu): 0.962                               Top 5 predicates:
Class coverage: [0.15, 0.13, 0.12, 0.08, 0.05, 0.04, 0.04,    <http://data.linkedct.org/vocab/resource/has_provenance>, 7,482,352
0.04, 0.04, 0.04, 0.04, 0.03, 0.03, 0.03, 0.03, 0.02, 0.01,
0.01, 0.009, 0.008, 0.007, 0.007, 0.006, 0.002, 0.002,        <http://www.w3.org/2000/01/rdf-schema#label>, 3,142,207
0.001, 6.0e-05, 4.0e-05, 9.2e-06, 6.6e-07]                    <http://www.w3.org/1999/02/22-rdf-syntax-ns#type>, 3,033,501
                                                              <http://data.linkedct.org/vocab/resource/trial_location>, 982,202
                                                              <http://data.linkedct.org/vocab/resource/location_facility>, 477,923



                                                                                                                                   8
BioGrid in BioPax
                                                             Top 5 subjects:
Statistics:
                                                             <http://cbio.mskcc.org/cpath#CPATH-716194>, 470
Number of triples (Nt): 14,326,621
Number of literals (Nl): 5,680,921
                                                             <http://cbio.mskcc.org/cpath#CPATH-156001>, 362
Number of objects (No): 8,645,700                            <http://cbio.mskcc.org/cpath#CPATH-738240>, 292
Number of typed instances (Ni): 4,229,345                    <http://cbio.mskcc.org/cpath#CPATH-818091>, 266,
Number of URIs excluding predicates (Nu): 4,229,358          <http://cbio.mskcc.org/cpath#CPATH-726044>, 229
Number of distinct classes (Nc): 12
Number of distinct subjects (Nds): 4,229,345                 Top 5 objects including literals:
Number of distinct predicates (Ndp): 23                      <http://www.biopax.org/release/biopax-level2.owl#unificationXref>,
Number of distinct objects (Ndo): 4,009,607                    1,249,232
Number of distinct literals (Ndl): 1,145,973                  <http://www.biopax.org/release/biopax-
Number of distinct lexical symbols (Ndls): 5,375,354         level2.owl#openControlledVocabulary>, 659,251
                                                              "PSI-MI", 659,250
Literalness (Nl/Nt): 0.400                                   "PUBMED", 439,528
Literal uniqueness (Ndl/Nl): 0.202                           <http://www.biopax.org/release/biopax-
Object uniqueness (Ndo/No): 0.464                            level2.owl#publicationXref>, 439,528
Structure (1 - (Nl+Ni)/Nt): 0.309
Subject coverage (Nds/Nu): 0.999                             Top 5 predicates:
Object coverage (Ndo/Nu): 0.948                              <http://www.w3.org/1999/02/22-rdf-syntax-ns#type>, 4,229,345
Class coverage: [0.295, 0.156, 0.104, 0.104, 0.104, 0.066,   <http://www.biopax.org/release/biopax-level2.owl#DB>, 1,966,356
0.052, 0.052, 0.052, 0.007, 0.007, 2.3e-07]
                                                             <http://www.biopax.org/release/biopax-level2.owl#ID>, 1,966,356
                                                             <http://www.biopax.org/release/biopax-level2.owl#XREF>, 1,933,616
                                                             <http://www.biopax.org/release/biopax-level2.owl#TERM>, 659,251



                                                                                                                            9
RxNorm
                                                            Top 5 subjects:
Statistics:
                                                            <http://link.informatics.stonybrook.edu/rxnorm/RXCUI/317541>, 11,804
Number of triples (Nt): 9,169,907
                                                            <http://link.informatics.stonybrook.edu/rxnorm/RXAUI/3149147>, 9,943
Number of literals (Nl): 4,557,110
                                                            <http://link.informatics.stonybrook.edu/rxnorm/RXCUI/316949>, 8,668
Number of objects (No): 4,612,797
                                                            <http://link.informatics.stonybrook.edu/rxnorm/RXCUI/316968>, 6,464
Number of typed instances (Ni): 628,852
                                                            <http://link.informatics.stonybrook.edu/rxnorm/RXCUI/316965>, 4,605
Number of URIs excluding predicates (Nu): 808,979
Number of distinct classes (Nc): 6
Number of distinct subjects (Nds): 807,722
Number of distinct predicates (Ndp): 193                    Top 5 objects including literals:
Number of distinct objects (Ndo): 471,847                   <http://link.informatics.stonybrook.edu/rxnorm/RXAUI>, 470,170
Number of distinct literals (Ndl): 2,577,006                <http://link.informatics.stonybrook.edu/rxnorm/RXCUI>, 158,457
Number of distinct lexical symbols (Ndls): 3,385,997        <http://link.informatics.stonybrook.edu/rxnorm/SAB/RXNORM> 143,622
                                                            <http://link.informatics.stonybrook.edu/rxnorm/SAB/NDFRT>, 134,049
Literalness (Nl/Nt): 0.497                                  <http://link.informatics.stonybrook.edu/rxnorm/TTY/CD>, 101,246
Literal uniqueness (Ndl/Nl): 0.565
Object uniqueness (Ndo/No): 0.102
Structure (1 - (Nl+Ni)/Nt): 0.434
Subject coverage (Nds/Nu): 0.998                            Top 5 predicates:
Object coverage (Ndo/Nu): 0.583                             <http://www.w3.org/2000/01/rdf-schema#label>, 807,705
Class coverage: [0.748, 0.252, 0.0003, 5. 6e-05, 9.5e-06,   <http://link.informatics.stonybrook.edu/rxnorm/ATN#NDC>, 634,124
6.360e-06]                                                  <http://www.w3.org/1999/02/22-rdf-syntax-ns#type>, 628,852
                                                            <http://link.informatics.stonybrook.edu/rxnorm/REL#has_related_form>,
                                                             571,320
                                                            <http://link.informatics.stonybrook.edu/umls/hasCUI>, 507,950



                                                                                                                             10
SUNY Reach in VIVO
                                                                 Top 5 subjects:
Statistics:
                                                                 <http://reach.suny.edu/individual/team_1>, 599
Number of triples (Nt): 1,278,216
                                                                 <http://reach.suny.edu/individual/Faraone_Stephen>, 404
Number of literals (Nl): 562,262
                                                                 <http://reach.suny.edu/individual/Hopkins_L>, 298
Number of objects (No): 715,954
                                                                 <http://reach.suny.edu/individual/Genco_Robert>, 272
Number of typed instances (Ni): 243,263
                                                                 <http://reach.suny.edu/individual/Jusko_William>, 257
Number of URIs excluding predicates (Nu): 174,488
Number of distinct classes (Nc): 71
Number of distinct subjects (Nds): 161,459                       Top 5 objects including literals:
Number of distinct predicates (Ndp): 109                         <http://vivoweb.org/ontology/core#Authorship>, 95,303
Number of distinct objects (Ndo): 172,991                        <http://xmlns.com/foaf/0.1/Person>, 32,040
Number of distinct literals (Ndl): 224,290                       <http://reach.suny.edu/ontology/core#Other_Investigator>, 31,170
Number of distinct lexical symbols (Ndls): 398,887               <http://vivoweb.org/ontology/core#Relationship>, 20,176
                                                                 <http://vivoweb.org/ontology/core#InformationResource>, 18,301
Literalness (Nl/Nt): 0.440
Literal uniqueness (Ndl/Nl): 0.399
Object uniqueness (Ndo/No): 0.241                                Top 5 predicates:
Structure (1 - (Nl+Ni)/Nt): 0.369                                <http://www.w3.org/1999/02/22-rdf-syntax-ns#type>, 243,263
Subject coverage (Nds/Nu): 0.925                                 <http://vivoweb.org/ontology/core#freetextKeyword>, 199,327
Object coverage (Ndo/Nu): 0.991                                  <http://www.w3.org/2000/01/rdf-schema#label>, 144,653
Class coverage:                                                  <http://vivoweb.org/ontology/core#informationResourceInAuthorship>,
[0.391, 0.132, 0.128, 0.083, 0.075, 0.040, 0.037, 0.017,. . .]   95,105
                                                                 <http://vivoweb.org/ontology/core#authorInAuthorship>, 95,101




                                                                                                                                    11
DrugBank
                                                         Top 5 subjects:
Statistics:
                                                         <http://www4.wiwiss.fu-berlin.de/drugbank/resource/targets/587>, 3767
Number of triples (Nt): 766,920
                                                         <http://www4.wiwiss.fu-berlin.de/drugbank/resource/targets/3722>, 3032
Number of literals (Nl): 494,028
                                                         <http://www4.wiwiss.fu-berlin.de/drugbank/resource/targets/357>, 2780
Number of objects (No): 272,892
                                                         <http://www4.wiwiss.fu-berlin.de/drugbank/resource/targets/146>, 2570
Number of typed instances (Ni): 24,522
                                                         <http://www4.wiwiss.fu-berlin.de/drugbank/resource/targets/136>, 2504
Number of URIs excluding predicates (Nu): 103,847
Number of distinct classes (Nc): 8                       Top 5 objects including literals:
Number of distinct subjects (Nds): 19,693                <http://www4.wiwiss.fu-
Number of distinct predicates (Ndp): 119                 berlin.de/drugbank/resource/drugbank/drug_interactions>,10,153
Number of distinct objects (Ndo): 89,685                 "physiological process", 8,001
Number of distinct literals (Ndl): 186,457               <http://www4.wiwiss.fu-
Number of distinct lexical symbols (Ndls): 290,307       berlin.de/drugbank/resource/references/17016423>, 7,191
                                                         <http://www4.wiwiss.fu-
Literalness (Nl/Nt): 0.644                               berlin.de/drugbank/resource/references/17139284>, 7,191),
Literal uniqueness (Ndl/Nl): 0.377                       "catalytic activity", 6,841
Object uniqueness (Ndo/No): 0.329
Structure (1 - (Nl+Ni)/Nt): 0.324                        Top 5 predicates:
Subject coverage (Nds/Nu): 0.190                         <http://www4.wiwiss.fu-
Object coverage (Ndo/Nu): 0.863                          berlin.de/drugbank/resource/drugbank/generalReference>, 72,359
Class coverage: [0.41, 0.20, 0.20, 0.19, 0.004, 0.004,   <http://www4.wiwiss.fu-
0.002, 0.0002]                                           berlin.de/drugbank/resource/drugbank/goClassificationFunction>, 72,232
                                                         <http://www4.wiwiss.fu-
                                                         berlin.de/drugbank/resource/drugbank/goClassificationProcess>, 63,520
                                                         <http://www4.wiwiss.fu-berlin.de/drugbank/resource/drugbank/synonym>,
                                                          44,949
                                                         <http://www4.wiwiss.fu-
                                                         berlin.de/drugbank/resource/drugbank/cellularLocation>, 26,258
                                                                                                                          12
DailyMed
                                                          Top 5 subjects:
Statistics:                                               <http://www4.wiwiss.fu-berlin.de/dailymed/resource/drugs/2245>, 240
Number of triples (Nt): 164,276                           <http://www4.wiwiss.fu-berlin.de/dailymed/resource/organization/Hospira,_Inc.>,
Number of literals (Nl): 59,885                            216
Number of objects (No): 104,391                           <http://www4.wiwiss.fu-berlin.de/dailymed/resource/drugs/2019>, 200
                                                          <http://www4.wiwiss.fu-
Number of typed instances (Ni): 14,934
                                                          berlin.de/dailymed/resource/organization/Teva_Pharmaceuticals_USA, 193
Number of URIs excluding predicates (Nu): 22,365          <http://www4.wiwiss.fu-berlin.de/dailymed/resource/drugs/3505>, 170
Number of distinct classes (Nc): 6
                                                          Top 5 objects including literals:
Number of distinct subjects (Nds): 10,015
                                                          <http://www4.wiwiss.fu-berlin.de/dailymed/resource/dailymed/ingredients>,
Number of distinct predicates (Ndp): 28                    5,577
Number of distinct objects (Ndo): 21,968                  http://www4.wiwiss.fu-berlin.de/dailymed/resource/dailymed/drugs>, 4,308
Number of distinct literals (Ndl): 45,814                 <http://www4.wiwiss.fu-berlin.de/drugbank/vocab/resource/class/Offer>,
Number of distinct lexical symbols (Ndls): 68,181          4308
                                                          <http://www4.wiwiss.fu-berlin.de/dailymed/resource/routeOfAdministration/Oral>,
                                                           2,465
Literalness (Nl/Nt): 0.364                                <http://www4.wiwiss.fu-
Literal uniqueness (Ndl/Nl): 0.765                        berlin.de/dailymed/resource/ingredient/magnesium_stearate>, 1,405
Object uniqueness (Ndo/No): 0.210
Structure (1 - (Nl+Ni)/Nt): 0.544                         Top 5 predicates:
                                                          <http://www.w3.org/2002/07/owl#sameAs>, 31,929
Subject coverage (Nds/Nu): 0.448
                                                          <http://www4.wiwiss.fu-berlin.de/dailymed/resource/dailymed/inactiveIngredient>,
Object coverage (Ndo/Nu): 0.982                            28,403
Class coverage: [0.37, 0.29, 0.29, 0.05, 0.002, 0.0003]   <http://www.w3.org/1999/02/22-rdf-syntax-ns#type>, 14,934
                                                          <http://www.w3.org/2000/01/rdf-schema#label>, 10,596
                                                          <http://www4.wiwiss.fu-
                                                          berlin.de/dailymed/resource/dailymed/possibleDiseaseTarget>, 6,124
                                                          <http://www4.wiwiss.fu-
                                                          berlin.de/dailymed/resource/dailymed/routeOfAdministration>, 4,308
                                                                                                                                  13
Building a co-author network from VIVO




             with a twist
                                         14
VIVO ontology modeling of authorship




The twist is to include only members of the Reach site
                                                     15
Graph processing and extraction
• Follow
  – Multiple linked steps are allowed
• Collapse parallel edges
  – Add weight to edges based on
    on counts
• Export
  – Standard graph format like GraphML, an XML format for
    graph exchange

                                                        16
Network analysis with NetworkX




                                 17
Network analysis with Mathematica




                                    18
Network visualization with Gephi




                                   19
For Your Information
- Linked CT: http://queens.db.toronto.edu/~oktie/linkedct/
- BioGrid in PAX: http://www.pathwaycommons.org/pc-
snapshot/current-release/biopax/by_source/
- Drugbank: http://www4.wiwiss.fu-
berlin.de/drugbank/drugbank_dump.nt
- DailyMed: http://www4.wiwiss.fu-
berlin.de/dailymed/dailymed_dump.nt
- RxNorm is available at:
http://link.informatics.stonybrook.edu/rxnorm/
- Reach VIVO site is at: http://reach.sunysb.edu
SPARQL endpoint:
http://link.informatics.stonybrook.edu/sparql/
named graph http://reach.sunysb.edu




                                                             20
The End




http://ctsaconnect.org/



                          21

More Related Content

Viewers also liked

Redesigning a Computer Concepts Course Using SAM and SAM Training - Course Te...
Redesigning a Computer Concepts Course Using SAM and SAM Training - Course Te...Redesigning a Computer Concepts Course Using SAM and SAM Training - Course Te...
Redesigning a Computer Concepts Course Using SAM and SAM Training - Course Te...Cengage Learning
 
Sparq lreference 1.8-us
Sparq lreference 1.8-usSparq lreference 1.8-us
Sparq lreference 1.8-usAjay Ohri
 
Optimizing RDF Data Cubes for Efficient Processing of Analytical Queries
Optimizing RDF Data Cubes for Efficient Processing of Analytical QueriesOptimizing RDF Data Cubes for Efficient Processing of Analytical Queries
Optimizing RDF Data Cubes for Efficient Processing of Analytical QueriesKim Ahlstrøm
 
Tutorial "An Introduction to SPARQL and Queries over Linked Data" Chapter 2 (...
Tutorial "An Introduction to SPARQL and Queries over Linked Data" Chapter 2 (...Tutorial "An Introduction to SPARQL and Queries over Linked Data" Chapter 2 (...
Tutorial "An Introduction to SPARQL and Queries over Linked Data" Chapter 2 (...Olaf Hartig
 
Tom Grey - Google Cloud Platform
Tom Grey - Google Cloud PlatformTom Grey - Google Cloud Platform
Tom Grey - Google Cloud PlatformFondazione CUOA
 
Oracle Public Cloud: Oracle Java Cloud Service, by Nino Guarnacci
Oracle Public Cloud: Oracle Java Cloud Service, by Nino GuarnacciOracle Public Cloud: Oracle Java Cloud Service, by Nino Guarnacci
Oracle Public Cloud: Oracle Java Cloud Service, by Nino GuarnacciCodemotion
 

Viewers also liked (6)

Redesigning a Computer Concepts Course Using SAM and SAM Training - Course Te...
Redesigning a Computer Concepts Course Using SAM and SAM Training - Course Te...Redesigning a Computer Concepts Course Using SAM and SAM Training - Course Te...
Redesigning a Computer Concepts Course Using SAM and SAM Training - Course Te...
 
Sparq lreference 1.8-us
Sparq lreference 1.8-usSparq lreference 1.8-us
Sparq lreference 1.8-us
 
Optimizing RDF Data Cubes for Efficient Processing of Analytical Queries
Optimizing RDF Data Cubes for Efficient Processing of Analytical QueriesOptimizing RDF Data Cubes for Efficient Processing of Analytical Queries
Optimizing RDF Data Cubes for Efficient Processing of Analytical Queries
 
Tutorial "An Introduction to SPARQL and Queries over Linked Data" Chapter 2 (...
Tutorial "An Introduction to SPARQL and Queries over Linked Data" Chapter 2 (...Tutorial "An Introduction to SPARQL and Queries over Linked Data" Chapter 2 (...
Tutorial "An Introduction to SPARQL and Queries over Linked Data" Chapter 2 (...
 
Tom Grey - Google Cloud Platform
Tom Grey - Google Cloud PlatformTom Grey - Google Cloud Platform
Tom Grey - Google Cloud Platform
 
Oracle Public Cloud: Oracle Java Cloud Service, by Nino Guarnacci
Oracle Public Cloud: Oracle Java Cloud Service, by Nino GuarnacciOracle Public Cloud: Oracle Java Cloud Service, by Nino Guarnacci
Oracle Public Cloud: Oracle Java Cloud Service, by Nino Guarnacci
 

Similar to Quantifying RDF data sets: statistics and metrics

Facilitating Data Curation: a Solution Developed in the Toxicology Domain
Facilitating Data Curation: a Solution Developed in the Toxicology DomainFacilitating Data Curation: a Solution Developed in the Toxicology Domain
Facilitating Data Curation: a Solution Developed in the Toxicology DomainChristophe Debruyne
 
Topic detecton by clustering and text mining
Topic detecton by clustering and text miningTopic detecton by clustering and text mining
Topic detecton by clustering and text miningIRJET Journal
 
Fhir dev days_advanced_fhir_terminology_services
Fhir dev days_advanced_fhir_terminology_servicesFhir dev days_advanced_fhir_terminology_services
Fhir dev days_advanced_fhir_terminology_servicesDevDays
 
Integrating a Domain Ontology Development Environment and an Ontology Search ...
Integrating a Domain Ontology Development Environment and an Ontology Search ...Integrating a Domain Ontology Development Environment and an Ontology Search ...
Integrating a Domain Ontology Development Environment and an Ontology Search ...Takeshi Morita
 
Finding Emerging Topics Using Chaos and Community Detection in Social Media G...
Finding Emerging Topics Using Chaos and Community Detection in Social Media G...Finding Emerging Topics Using Chaos and Community Detection in Social Media G...
Finding Emerging Topics Using Chaos and Community Detection in Social Media G...Paragon_Science_Inc
 
The Logical Model Designer - Binding Information Models to Terminology
The Logical Model Designer - Binding Information Models to TerminologyThe Logical Model Designer - Binding Information Models to Terminology
The Logical Model Designer - Binding Information Models to TerminologySnow Owl
 
Open-source from/in the enterprise: the RDKit
Open-source from/in the enterprise: the RDKitOpen-source from/in the enterprise: the RDKit
Open-source from/in the enterprise: the RDKitGreg Landrum
 
Yde de Jong & Dave Roberts - ZooBank and EDIT: Towards a business model for Z...
Yde de Jong & Dave Roberts - ZooBank and EDIT: Towards a business model for Z...Yde de Jong & Dave Roberts - ZooBank and EDIT: Towards a business model for Z...
Yde de Jong & Dave Roberts - ZooBank and EDIT: Towards a business model for Z...ICZN
 
Interlinking educational data to Web of Data (Thesis presentation)
Interlinking educational data to Web of Data (Thesis presentation)Interlinking educational data to Web of Data (Thesis presentation)
Interlinking educational data to Web of Data (Thesis presentation)Enayat Rajabi
 
Case Study in Linked Data and Semantic Web: Human Genome
Case Study in Linked Data and Semantic Web: Human GenomeCase Study in Linked Data and Semantic Web: Human Genome
Case Study in Linked Data and Semantic Web: Human GenomeDavid Portnoy
 
Metadata as Linked Data for Research Data Repositories
Metadata as Linked Data for Research Data RepositoriesMetadata as Linked Data for Research Data Repositories
Metadata as Linked Data for Research Data Repositoriesandrea huang
 
Linked Data for improved organization of research data
Linked Data  for improved organization  of research dataLinked Data  for improved organization  of research data
Linked Data for improved organization of research dataSamuel Lampa
 
SQL is Dead; Long Live SQL: Lightweight Query Services for Long Tail Science
SQL is Dead; Long Live SQL: Lightweight Query Services for Long Tail ScienceSQL is Dead; Long Live SQL: Lightweight Query Services for Long Tail Science
SQL is Dead; Long Live SQL: Lightweight Query Services for Long Tail ScienceUniversity of Washington
 
Data-knowledge transition zones within the biomedical research ecosystem
Data-knowledge transition zones within the biomedical research ecosystemData-knowledge transition zones within the biomedical research ecosystem
Data-knowledge transition zones within the biomedical research ecosystemMaryann Martone
 
LD4 Wikidata Affinity Group - Shorthouse
LD4 Wikidata Affinity Group - ShorthouseLD4 Wikidata Affinity Group - Shorthouse
LD4 Wikidata Affinity Group - ShorthouseDavid Shorthouse
 

Similar to Quantifying RDF data sets: statistics and metrics (20)

Facilitating Data Curation: a Solution Developed in the Toxicology Domain
Facilitating Data Curation: a Solution Developed in the Toxicology DomainFacilitating Data Curation: a Solution Developed in the Toxicology Domain
Facilitating Data Curation: a Solution Developed in the Toxicology Domain
 
Topic detecton by clustering and text mining
Topic detecton by clustering and text miningTopic detecton by clustering and text mining
Topic detecton by clustering and text mining
 
BioSD Tutorial 2014 Editition
BioSD Tutorial 2014 EdititionBioSD Tutorial 2014 Editition
BioSD Tutorial 2014 Editition
 
Fhir dev days_advanced_fhir_terminology_services
Fhir dev days_advanced_fhir_terminology_servicesFhir dev days_advanced_fhir_terminology_services
Fhir dev days_advanced_fhir_terminology_services
 
Integrating a Domain Ontology Development Environment and an Ontology Search ...
Integrating a Domain Ontology Development Environment and an Ontology Search ...Integrating a Domain Ontology Development Environment and an Ontology Search ...
Integrating a Domain Ontology Development Environment and an Ontology Search ...
 
Knetminer Backend Training, Nov 2018
Knetminer Backend Training, Nov 2018Knetminer Backend Training, Nov 2018
Knetminer Backend Training, Nov 2018
 
Finding Emerging Topics Using Chaos and Community Detection in Social Media G...
Finding Emerging Topics Using Chaos and Community Detection in Social Media G...Finding Emerging Topics Using Chaos and Community Detection in Social Media G...
Finding Emerging Topics Using Chaos and Community Detection in Social Media G...
 
The Logical Model Designer - Binding Information Models to Terminology
The Logical Model Designer - Binding Information Models to TerminologyThe Logical Model Designer - Binding Information Models to Terminology
The Logical Model Designer - Binding Information Models to Terminology
 
Open-source from/in the enterprise: the RDKit
Open-source from/in the enterprise: the RDKitOpen-source from/in the enterprise: the RDKit
Open-source from/in the enterprise: the RDKit
 
Yde de Jong & Dave Roberts - ZooBank and EDIT: Towards a business model for Z...
Yde de Jong & Dave Roberts - ZooBank and EDIT: Towards a business model for Z...Yde de Jong & Dave Roberts - ZooBank and EDIT: Towards a business model for Z...
Yde de Jong & Dave Roberts - ZooBank and EDIT: Towards a business model for Z...
 
Bioinformatica 06-10-2011-t2-databases
Bioinformatica 06-10-2011-t2-databasesBioinformatica 06-10-2011-t2-databases
Bioinformatica 06-10-2011-t2-databases
 
Interlinking educational data to Web of Data (Thesis presentation)
Interlinking educational data to Web of Data (Thesis presentation)Interlinking educational data to Web of Data (Thesis presentation)
Interlinking educational data to Web of Data (Thesis presentation)
 
Case Study in Linked Data and Semantic Web: Human Genome
Case Study in Linked Data and Semantic Web: Human GenomeCase Study in Linked Data and Semantic Web: Human Genome
Case Study in Linked Data and Semantic Web: Human Genome
 
Metadata as Linked Data for Research Data Repositories
Metadata as Linked Data for Research Data RepositoriesMetadata as Linked Data for Research Data Repositories
Metadata as Linked Data for Research Data Repositories
 
Dataset Metadata, Tools and Approaches for Access and Preservation
Dataset Metadata, Tools and Approaches for Access and PreservationDataset Metadata, Tools and Approaches for Access and Preservation
Dataset Metadata, Tools and Approaches for Access and Preservation
 
Linked Data for improved organization of research data
Linked Data  for improved organization  of research dataLinked Data  for improved organization  of research data
Linked Data for improved organization of research data
 
SQL is Dead; Long Live SQL: Lightweight Query Services for Long Tail Science
SQL is Dead; Long Live SQL: Lightweight Query Services for Long Tail ScienceSQL is Dead; Long Live SQL: Lightweight Query Services for Long Tail Science
SQL is Dead; Long Live SQL: Lightweight Query Services for Long Tail Science
 
Data-knowledge transition zones within the biomedical research ecosystem
Data-knowledge transition zones within the biomedical research ecosystemData-knowledge transition zones within the biomedical research ecosystem
Data-knowledge transition zones within the biomedical research ecosystem
 
An intelligent retrieval system for Chinese agricultural scientific literature
An intelligent retrieval system for Chinese agricultural scientific literature An intelligent retrieval system for Chinese agricultural scientific literature
An intelligent retrieval system for Chinese agricultural scientific literature
 
LD4 Wikidata Affinity Group - Shorthouse
LD4 Wikidata Affinity Group - ShorthouseLD4 Wikidata Affinity Group - Shorthouse
LD4 Wikidata Affinity Group - Shorthouse
 

Recently uploaded

Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilV3cube
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 

Recently uploaded (20)

Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 

Quantifying RDF data sets: statistics and metrics

  • 1. Quantifying RDF data sets (a start) Janos G. Hajagos Stony Brook University School of Medicine 1
  • 2. Resource Description Framework Graph based data model: – Vertices or nodes are identified by URIs <http://dbpedia.org/resource/Aspirin> – Vertices can be typed: rdf:type – Directed edges or links are specified with URIs – Parallel edges are allowed (multi-graph) – Literals are properties of vertices 2
  • 4. Pure Python library • No SPARQL support • In-memory only • Ignores types • PyPy JIT for speed • No named graphs • API for pattern matching • No http access 4
  • 5. Counting: 1, 2, 3, . . . • Number of triples (Nt) • Number of literals (Nl) • Number of object URIs (No) • Number of distinct literals (type removed) (Ndl) • Number of distinct objects (Ndo) • Number of distinct subjects (Nds) • Number of distinct URIs (Nu) • Number of typed instances (Ni) • Number of instances of type t (Nit) • Number of distinct classes (Nc) • Number of distinct predicates (Ndp) 5
  • 6. Simple fractions “Literalness” = Nl / Nt “Literal uniqueness” = Ndl / Nl “Object uniqueness” = Ndo / No “Structure” = 1 - (Ni + Nl) / Nt “Subject coverage” = Nds / Nu “Object coverage” = Ndo / Nu “Type frequency of class t” = {Nit / Ni , . . .} 6
  • 7. LODD + Comparisons Source: http://dx.doi.org/10.1186/1758-2946-3-19 7
  • 8. Linked CT Top 5 subjects: Statistics: <http://data.linkedct.org/resource/country/united-states>, 60,980 Number of triples (Nt): 27,965,909 <http://data.linkedct.org/resource/state/california>, 15,775 Number of literals (Nl): 11,153,086 <http://data.linkedct.org/resource/state/texas>, 13,264 Number of objects (No): 16,812,823 <http://data.linkedct.org/resource/state/new-york>, 13,172 Number of typed instances (Ni): 3,033,501 <http://data.linkedct.org/resource/oversight_info/7eb3d38adc47e7e583ab6031 Number of URIs excluding predicates (Nu): 3,269,681 fe2948ba>, 11,963 Number of distinct classes (Nc): 30 Number of distinct subjects (Nds): 3,033,495 Number of distinct predicates (Ndp): 123 Top 5 objects including literals: Number of distinct objects (Ndo): 3,148,210 "No", 525,210 Number of distinct literals (Ndl): 5,496,593 <http://data.linkedct.org/vocab/resource/location>, 477,926 Number of distinct lexical symbols (Ndls): 8,621,986 <http://data.linkedct.org/vocab/resource/facility>, 387,542 <http://data.linkedct.org/vocab/resource/outcome>, 376,231 Literalness (Nl/Nt): 0.399 <http://data.linkedct.org/vocab/resource/external_linkage>, 271,431 Literal uniqueness (Ndl/Nl): 0.493 <http://data.linkedct.org/resource/linkage_method/standardized-string- Object uniqueness (Ndo/No): 0.187 matching>, 185,902 Structure (1 - (Nl+Ni)/Nt): 0.492 Subject coverage (Nds/Nu): 0.927 Object coverage (Ndo/Nu): 0.962 Top 5 predicates: Class coverage: [0.15, 0.13, 0.12, 0.08, 0.05, 0.04, 0.04, <http://data.linkedct.org/vocab/resource/has_provenance>, 7,482,352 0.04, 0.04, 0.04, 0.04, 0.03, 0.03, 0.03, 0.03, 0.02, 0.01, 0.01, 0.009, 0.008, 0.007, 0.007, 0.006, 0.002, 0.002, <http://www.w3.org/2000/01/rdf-schema#label>, 3,142,207 0.001, 6.0e-05, 4.0e-05, 9.2e-06, 6.6e-07] <http://www.w3.org/1999/02/22-rdf-syntax-ns#type>, 3,033,501 <http://data.linkedct.org/vocab/resource/trial_location>, 982,202 <http://data.linkedct.org/vocab/resource/location_facility>, 477,923 8
  • 9. BioGrid in BioPax Top 5 subjects: Statistics: <http://cbio.mskcc.org/cpath#CPATH-716194>, 470 Number of triples (Nt): 14,326,621 Number of literals (Nl): 5,680,921 <http://cbio.mskcc.org/cpath#CPATH-156001>, 362 Number of objects (No): 8,645,700 <http://cbio.mskcc.org/cpath#CPATH-738240>, 292 Number of typed instances (Ni): 4,229,345 <http://cbio.mskcc.org/cpath#CPATH-818091>, 266, Number of URIs excluding predicates (Nu): 4,229,358 <http://cbio.mskcc.org/cpath#CPATH-726044>, 229 Number of distinct classes (Nc): 12 Number of distinct subjects (Nds): 4,229,345 Top 5 objects including literals: Number of distinct predicates (Ndp): 23 <http://www.biopax.org/release/biopax-level2.owl#unificationXref>, Number of distinct objects (Ndo): 4,009,607 1,249,232 Number of distinct literals (Ndl): 1,145,973 <http://www.biopax.org/release/biopax- Number of distinct lexical symbols (Ndls): 5,375,354 level2.owl#openControlledVocabulary>, 659,251 "PSI-MI", 659,250 Literalness (Nl/Nt): 0.400 "PUBMED", 439,528 Literal uniqueness (Ndl/Nl): 0.202 <http://www.biopax.org/release/biopax- Object uniqueness (Ndo/No): 0.464 level2.owl#publicationXref>, 439,528 Structure (1 - (Nl+Ni)/Nt): 0.309 Subject coverage (Nds/Nu): 0.999 Top 5 predicates: Object coverage (Ndo/Nu): 0.948 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type>, 4,229,345 Class coverage: [0.295, 0.156, 0.104, 0.104, 0.104, 0.066, <http://www.biopax.org/release/biopax-level2.owl#DB>, 1,966,356 0.052, 0.052, 0.052, 0.007, 0.007, 2.3e-07] <http://www.biopax.org/release/biopax-level2.owl#ID>, 1,966,356 <http://www.biopax.org/release/biopax-level2.owl#XREF>, 1,933,616 <http://www.biopax.org/release/biopax-level2.owl#TERM>, 659,251 9
  • 10. RxNorm Top 5 subjects: Statistics: <http://link.informatics.stonybrook.edu/rxnorm/RXCUI/317541>, 11,804 Number of triples (Nt): 9,169,907 <http://link.informatics.stonybrook.edu/rxnorm/RXAUI/3149147>, 9,943 Number of literals (Nl): 4,557,110 <http://link.informatics.stonybrook.edu/rxnorm/RXCUI/316949>, 8,668 Number of objects (No): 4,612,797 <http://link.informatics.stonybrook.edu/rxnorm/RXCUI/316968>, 6,464 Number of typed instances (Ni): 628,852 <http://link.informatics.stonybrook.edu/rxnorm/RXCUI/316965>, 4,605 Number of URIs excluding predicates (Nu): 808,979 Number of distinct classes (Nc): 6 Number of distinct subjects (Nds): 807,722 Number of distinct predicates (Ndp): 193 Top 5 objects including literals: Number of distinct objects (Ndo): 471,847 <http://link.informatics.stonybrook.edu/rxnorm/RXAUI>, 470,170 Number of distinct literals (Ndl): 2,577,006 <http://link.informatics.stonybrook.edu/rxnorm/RXCUI>, 158,457 Number of distinct lexical symbols (Ndls): 3,385,997 <http://link.informatics.stonybrook.edu/rxnorm/SAB/RXNORM> 143,622 <http://link.informatics.stonybrook.edu/rxnorm/SAB/NDFRT>, 134,049 Literalness (Nl/Nt): 0.497 <http://link.informatics.stonybrook.edu/rxnorm/TTY/CD>, 101,246 Literal uniqueness (Ndl/Nl): 0.565 Object uniqueness (Ndo/No): 0.102 Structure (1 - (Nl+Ni)/Nt): 0.434 Subject coverage (Nds/Nu): 0.998 Top 5 predicates: Object coverage (Ndo/Nu): 0.583 <http://www.w3.org/2000/01/rdf-schema#label>, 807,705 Class coverage: [0.748, 0.252, 0.0003, 5. 6e-05, 9.5e-06, <http://link.informatics.stonybrook.edu/rxnorm/ATN#NDC>, 634,124 6.360e-06] <http://www.w3.org/1999/02/22-rdf-syntax-ns#type>, 628,852 <http://link.informatics.stonybrook.edu/rxnorm/REL#has_related_form>, 571,320 <http://link.informatics.stonybrook.edu/umls/hasCUI>, 507,950 10
  • 11. SUNY Reach in VIVO Top 5 subjects: Statistics: <http://reach.suny.edu/individual/team_1>, 599 Number of triples (Nt): 1,278,216 <http://reach.suny.edu/individual/Faraone_Stephen>, 404 Number of literals (Nl): 562,262 <http://reach.suny.edu/individual/Hopkins_L>, 298 Number of objects (No): 715,954 <http://reach.suny.edu/individual/Genco_Robert>, 272 Number of typed instances (Ni): 243,263 <http://reach.suny.edu/individual/Jusko_William>, 257 Number of URIs excluding predicates (Nu): 174,488 Number of distinct classes (Nc): 71 Number of distinct subjects (Nds): 161,459 Top 5 objects including literals: Number of distinct predicates (Ndp): 109 <http://vivoweb.org/ontology/core#Authorship>, 95,303 Number of distinct objects (Ndo): 172,991 <http://xmlns.com/foaf/0.1/Person>, 32,040 Number of distinct literals (Ndl): 224,290 <http://reach.suny.edu/ontology/core#Other_Investigator>, 31,170 Number of distinct lexical symbols (Ndls): 398,887 <http://vivoweb.org/ontology/core#Relationship>, 20,176 <http://vivoweb.org/ontology/core#InformationResource>, 18,301 Literalness (Nl/Nt): 0.440 Literal uniqueness (Ndl/Nl): 0.399 Object uniqueness (Ndo/No): 0.241 Top 5 predicates: Structure (1 - (Nl+Ni)/Nt): 0.369 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type>, 243,263 Subject coverage (Nds/Nu): 0.925 <http://vivoweb.org/ontology/core#freetextKeyword>, 199,327 Object coverage (Ndo/Nu): 0.991 <http://www.w3.org/2000/01/rdf-schema#label>, 144,653 Class coverage: <http://vivoweb.org/ontology/core#informationResourceInAuthorship>, [0.391, 0.132, 0.128, 0.083, 0.075, 0.040, 0.037, 0.017,. . .] 95,105 <http://vivoweb.org/ontology/core#authorInAuthorship>, 95,101 11
  • 12. DrugBank Top 5 subjects: Statistics: <http://www4.wiwiss.fu-berlin.de/drugbank/resource/targets/587>, 3767 Number of triples (Nt): 766,920 <http://www4.wiwiss.fu-berlin.de/drugbank/resource/targets/3722>, 3032 Number of literals (Nl): 494,028 <http://www4.wiwiss.fu-berlin.de/drugbank/resource/targets/357>, 2780 Number of objects (No): 272,892 <http://www4.wiwiss.fu-berlin.de/drugbank/resource/targets/146>, 2570 Number of typed instances (Ni): 24,522 <http://www4.wiwiss.fu-berlin.de/drugbank/resource/targets/136>, 2504 Number of URIs excluding predicates (Nu): 103,847 Number of distinct classes (Nc): 8 Top 5 objects including literals: Number of distinct subjects (Nds): 19,693 <http://www4.wiwiss.fu- Number of distinct predicates (Ndp): 119 berlin.de/drugbank/resource/drugbank/drug_interactions>,10,153 Number of distinct objects (Ndo): 89,685 "physiological process", 8,001 Number of distinct literals (Ndl): 186,457 <http://www4.wiwiss.fu- Number of distinct lexical symbols (Ndls): 290,307 berlin.de/drugbank/resource/references/17016423>, 7,191 <http://www4.wiwiss.fu- Literalness (Nl/Nt): 0.644 berlin.de/drugbank/resource/references/17139284>, 7,191), Literal uniqueness (Ndl/Nl): 0.377 "catalytic activity", 6,841 Object uniqueness (Ndo/No): 0.329 Structure (1 - (Nl+Ni)/Nt): 0.324 Top 5 predicates: Subject coverage (Nds/Nu): 0.190 <http://www4.wiwiss.fu- Object coverage (Ndo/Nu): 0.863 berlin.de/drugbank/resource/drugbank/generalReference>, 72,359 Class coverage: [0.41, 0.20, 0.20, 0.19, 0.004, 0.004, <http://www4.wiwiss.fu- 0.002, 0.0002] berlin.de/drugbank/resource/drugbank/goClassificationFunction>, 72,232 <http://www4.wiwiss.fu- berlin.de/drugbank/resource/drugbank/goClassificationProcess>, 63,520 <http://www4.wiwiss.fu-berlin.de/drugbank/resource/drugbank/synonym>, 44,949 <http://www4.wiwiss.fu- berlin.de/drugbank/resource/drugbank/cellularLocation>, 26,258 12
  • 13. DailyMed Top 5 subjects: Statistics: <http://www4.wiwiss.fu-berlin.de/dailymed/resource/drugs/2245>, 240 Number of triples (Nt): 164,276 <http://www4.wiwiss.fu-berlin.de/dailymed/resource/organization/Hospira,_Inc.>, Number of literals (Nl): 59,885 216 Number of objects (No): 104,391 <http://www4.wiwiss.fu-berlin.de/dailymed/resource/drugs/2019>, 200 <http://www4.wiwiss.fu- Number of typed instances (Ni): 14,934 berlin.de/dailymed/resource/organization/Teva_Pharmaceuticals_USA, 193 Number of URIs excluding predicates (Nu): 22,365 <http://www4.wiwiss.fu-berlin.de/dailymed/resource/drugs/3505>, 170 Number of distinct classes (Nc): 6 Top 5 objects including literals: Number of distinct subjects (Nds): 10,015 <http://www4.wiwiss.fu-berlin.de/dailymed/resource/dailymed/ingredients>, Number of distinct predicates (Ndp): 28 5,577 Number of distinct objects (Ndo): 21,968 http://www4.wiwiss.fu-berlin.de/dailymed/resource/dailymed/drugs>, 4,308 Number of distinct literals (Ndl): 45,814 <http://www4.wiwiss.fu-berlin.de/drugbank/vocab/resource/class/Offer>, Number of distinct lexical symbols (Ndls): 68,181 4308 <http://www4.wiwiss.fu-berlin.de/dailymed/resource/routeOfAdministration/Oral>, 2,465 Literalness (Nl/Nt): 0.364 <http://www4.wiwiss.fu- Literal uniqueness (Ndl/Nl): 0.765 berlin.de/dailymed/resource/ingredient/magnesium_stearate>, 1,405 Object uniqueness (Ndo/No): 0.210 Structure (1 - (Nl+Ni)/Nt): 0.544 Top 5 predicates: <http://www.w3.org/2002/07/owl#sameAs>, 31,929 Subject coverage (Nds/Nu): 0.448 <http://www4.wiwiss.fu-berlin.de/dailymed/resource/dailymed/inactiveIngredient>, Object coverage (Ndo/Nu): 0.982 28,403 Class coverage: [0.37, 0.29, 0.29, 0.05, 0.002, 0.0003] <http://www.w3.org/1999/02/22-rdf-syntax-ns#type>, 14,934 <http://www.w3.org/2000/01/rdf-schema#label>, 10,596 <http://www4.wiwiss.fu- berlin.de/dailymed/resource/dailymed/possibleDiseaseTarget>, 6,124 <http://www4.wiwiss.fu- berlin.de/dailymed/resource/dailymed/routeOfAdministration>, 4,308 13
  • 14. Building a co-author network from VIVO with a twist 14
  • 15. VIVO ontology modeling of authorship The twist is to include only members of the Reach site 15
  • 16. Graph processing and extraction • Follow – Multiple linked steps are allowed • Collapse parallel edges – Add weight to edges based on on counts • Export – Standard graph format like GraphML, an XML format for graph exchange 16
  • 17. Network analysis with NetworkX 17
  • 18. Network analysis with Mathematica 18
  • 20. For Your Information - Linked CT: http://queens.db.toronto.edu/~oktie/linkedct/ - BioGrid in PAX: http://www.pathwaycommons.org/pc- snapshot/current-release/biopax/by_source/ - Drugbank: http://www4.wiwiss.fu- berlin.de/drugbank/drugbank_dump.nt - DailyMed: http://www4.wiwiss.fu- berlin.de/dailymed/dailymed_dump.nt - RxNorm is available at: http://link.informatics.stonybrook.edu/rxnorm/ - Reach VIVO site is at: http://reach.sunysb.edu SPARQL endpoint: http://link.informatics.stonybrook.edu/sparql/ named graph http://reach.sunysb.edu 20