A  P ractical  O ntology for the  L arge- S cale  M odeling of  S cholarly  A rtifacts and their  U sage Marko A. Rodriguez  (1) Johan Bollen Herbert Van de Sompel Digital Library Research & Prototyping Team Los Alamos National Laboratory - Research Library (1)  [email_address] Acknowledgements: Lyudmila L. Balakireva (LANL),  Wenzhong Zhao (LANL) , Aric Hagberg (LANL) MESUR is supported by the Andrew W. Mellon Foundation.
Overview The MESUR project A quick RDF/RDFS/OWL tutorial Modeling the scholarly community Practical applications of the model Conclusion
Overview The MESUR project A quick RDF/RDFS/OWL tutorial Modeling the scholarly community Practical applications of the model Conclusion
What is the MESUR project? ME trics from  S cholarly  U sage of  R esources http://www.mesur.org The  MESUR  project is currently gathering publication, citation, and usage data from providers world-wide in order to engineer a large-scale scholarly model. Publication and citation data from bibliographic databases. Usage data logged by institutions, publishers, and aggregators.
Journal and Article data Journal-level Bibliographic Data Thomson Scientific JCR: > 8,000 indexed journals Thomson Scientific JCR: > 50,000,000 journal citations Thomson Scientific JCR: > 100,000 journal classifications Ex Libris SFX Journal Master List: > 300,000 journals identifiers (title, abbreviated title, ISSN, eISSN) clustered in groups Ex Libris SFX: > 85,000 journal classifications Article-level Bibliographic Data Thomson Scientific Citation Databases : > 37,500,000 articles Thomson Scientific Citation Databases: > 550,000,000 article citations
Usage data Institutions (link resolvers, proxies) Los Alamos: > 350,000 1-year CalState: >  3,500,000 2-years UTexas: > 2,500,000 5-years PennState: … … Aggregators anonymous : > 2,500,000 1-year anonymous : > 50,000,000 1-week … Publishers BioMed Central: > 24,000,000 2-years Elsevier …
The Primary Data Representation So how are we going to represent all this data in one model such that this model can analyzed computationally? ANSWER : A semantic network. Why are we using a semantic network? Able to represent a heterogeneous set of entities related to one another by a heterogeneous set of relationships. All “actors” are represented in the same substrate. Existing technologies and standards to support the representation: triples-stores, RDF, semantic network analysis algorithms, etc. A B C
Example Scholarly Relationships <marko, wrote, practical_ontology> <johan, wrote, practical_ontology> <herbertv, wrote, practical_ontology> <practical_ontology, publishedIn, jcdl> <practical_ontology, cites, rdf_specification> <rdf_specifcation, downloadedBy, 127.0.0.1> <127.0.0.1, from, LANL> <LANL, contains, herbertv> <herbertv, coauthorsWith, marko>
What is the Purpose of an Ontology?
The MESUR Data Flow
Overview The MESUR project A quick RDF/RDFS/OWL tutorial Modeling the scholarly community Practical applications of the model Conclusion
RDF, RDFS, OWL The  R esource  D escription  F ramework A data model for representing a semantic network. URIs connected to one another by a URI. < lanl:marko, lanl:worksWith, lanl:johan > The  R esource  D escription  F ramework  S chema A simple ontology language for defining classes and their relationships to one another. (provides basic class hierarchy construction) The  W eb  O ntology  L anguage A more advanced ontology language. ( this is the ontology language used in MESUR )
RDF, RDFS ex:marko ex:cookie ex:Human ex:Food ex:isEating rdf:type rdf:type ex:isEating rdfs:domain rdfs:range ontology instance
RDF, RDFS, OWL ex:fluffy ex:marko ex:Pet ex:Human ex:hasOwner rdf:type rdf:type ex:hasOwner rdfs:domain rdfs:range ontology instance _:0123 rdfs:subClassOf owl:onProperty “ 1” owl:maxCardinality ex:bob ex:hasOwner owl:Restriction rdf:type
The Triple Store SELECT ?a ?c WHERE  ( ?a type human ) ( ?a wrote ?b )  ( ?b type article ) ( ?c wrote ?b ) ( ?c type human ) ( ?a != ?c ) The triple store is to semantic networks what the relational database is to data tables. Storing and querying triples in a triple store SPARQL query language like SQL, but for triple stores
Overview The MESUR project A quick RDF/RDFS/OWL tutorial Modeling the scholarly community Practical applications of the model Conclusion
The Problem of Scale High-end triple stores reasonably support 1+ billion triples. The  MESUR  solution is to not include  all   artifact metadata in the triple store. MESUR  leverages relational database and triple store technology. The triple store is for relationships. The relational database is for metadata.
Relational Database & Triple Store
The MESUR Class Hierarchy
The Context Classes Inspired by OntologyX: http://www.ontologyx.com
The Publishes Context
The Uses Context
Overview The MESUR project A quick RDF/RDFS/OWL tutorial Modeling the scholarly community Practical applications of the model Conclusion
Analysis Algorithms ISI Impact Factor Usage Impact Factor Bollen J., Van de Sompel, H., “ Usage Impact Factor: The Effects of Sample Characteristics on Usage-based Impact Metrics ”, [in review], 2007. H-Index Hirsh, J.E., “ An index to quantify an individual's scientific research output ”, Proceedings of the National Academy of Science, 102:46, 2005. Y-Factor Bollen J., Rodriguez, M.A., Van de Sompel, H., “ Journal Status ”, Scientometrics, 69:3, 2006. Other social network metrics Eccentrity, Betweenness, Closeness, PageRank, …
Journal Citation and Usage
Calculating the 2007 Impact Factor SELECT  ?x WHERE  ( ?x rdf:type mesur:Citation ) ( ?x mesur:hasSource ?a) ( ?x mesur:hasSink urn:issn:0028-0836 ) ( ?x mesur:hasSourceTime ?u) AND  (?u == 2007) ( ?x mesur:hasSinkTime ?t) AND (?t > 2004 AND ?t < 2007) SELECT  ?y WHERE  ( ?y rdf:type mesur:Publishes ) ( ?y mesur:hasGroup urn:issn:0028-0836 ) ( ?y mesur:hasTime ?t ) AND  (?t > 2004 AND ?t < 2007) INSERT < _123 rdf:type mesur:ImpactFactor > INSERT < _123 mesur:hasObject urn:issn:0028-0836 > INSERT < _123 mesur:hasStartTime 2007 > INSERT < _123 mesur:hasEndTime 2007 > INSERT < _123 mesur:hasNumbericValue  (COUNT(?x) / COUNT(?y)) > The 2007 impact factor of journal  A  is the total number of citations to articles published in  A  in 2005 and 2006 from articles published in 2007 in journal  B divided by the total number of articles published by journal  A  in 2005 and 2006.
Calculating the 2007 Usage Impact Factor SELECT  ?x WHERE  ( ?x rdf:type mesur:Uses )  ( ?x mesur:hasUnit ?a ) ( ?x mesur:hasGroup ?b ) ( ?b mesur:partOf urn:issn:1082-9873 ) ( ?x mesur:hasTime ?t ) AND  (?t == 2007) ( ?y rdf:type mesur:Publishes ) ( ?y mesur:hasUnit ?a ) ( ?y mesur:hasTime ?u ) AND (?u > 2004 AND ?u < 2007) SELECT  ?y WHERE  ( ?y rdf:type mesur:Publishes ) ( ?y mesur:hasGroup ?a ) ( ?a mesur:partOf urn:issn:1082-9873 ) ( ?y mesur:hasTime ?t ) AND  (?t > 2004 AND ?t < 2007) INSERT < _123 rdf:type mesur:UsageImpactFactor > INSERT < _123 mesur:hasObject urn:issn:1082-9873 > INSERT < _123 mesur:hasStartTime 2007 > INSERT < _123 mesur:hasEndTime 2007 > INSERT < _123 mesur:hasNumbericValue  (COUNT(?x) / COUNT(?y)) > The 2007 usage impact factor of journal  A  is the total number of 2007 usage events of articles published in  A  in 2005 and 2006 divided by the total number of articles published by journal  A  in 2005 and 2006.
Overview The MESUR project A quick RDF/RDFS/OWL tutorial Modeling the scholarly community Practical applications of the model Conclusion
Contributions Uniting the RDF/Semantic Web community technology with scholarly modeling. Semantic network model of the scholarly community. Architectural set-up supports a massive data set Triple-store/relational database coupling. Open ontology for the scholarly community Open source ontology to represent many aspects of the scholarly communication process including publication, citation, and usage.
Some Related Publications Marko A. Rodriguez, Johan Bollen and Herbert Van de Sompel.  A Practical Ontology for the Large-Scale Modeling of Scholarly Artifacts and their Usage,  In Proceedings of the Joint Conference on Digital Libraries, Vancouver, June 2007 Marko A. Rodriguez.  Grammar-based random walkers in semantic networks . (LAUR-06-7791) Marko A. Rodriguez and Jennifer H. Watkins.  Grammar-based geodesics in semantic networks . (LAUR-07-4042) Johan Bollen and Herbert Van de Sompel.  Usage Impact Factor: the effects of sample characteristics on  usage-based impact metrics.  (arxiv.org: cs.DL/0610154) Johan Bollen and Herbert Van   d e Sompel.  An architecture for the aggregation and analysis of scholarly usage data.  In Joint Conference on Digital Libraries (JCDL2006),  pages 298 - 3 07, June 2006. Johan Bollen and Herbert Van de Sompel.  Mapping the structure of science through usage.  Scientometrics, 69(2), 2006. Johan  Bollen, Marko  A . Rodriguez, and Herbert  V an de   S ompel.  Journal status.  Scientometrics, 69(3), December 2006 (arxiv.org:cs.DL/0601030)  Johan  Bollen, Herbert  V an de  S ompel, Joan Smith, and Rick Luce.  Toward alternative metrics of  journal impact: a comparison of download and citation data.  Information Processing and Management, 41(6):1419 - 1 440, 2005.
Questions MESUR is at  http://www.mesur.org MESUR ontology is at  http://www.mesur.org/schemas/2007-01/mesur/ Many thanks to the Andrew W. Mellon Foundation for their support

A Practical Ontology for the Large-Scale Modeling of Scholarly Artifacts and their Usage

  • 1.
    A Practical O ntology for the L arge- S cale M odeling of S cholarly A rtifacts and their U sage Marko A. Rodriguez (1) Johan Bollen Herbert Van de Sompel Digital Library Research & Prototyping Team Los Alamos National Laboratory - Research Library (1) [email_address] Acknowledgements: Lyudmila L. Balakireva (LANL), Wenzhong Zhao (LANL) , Aric Hagberg (LANL) MESUR is supported by the Andrew W. Mellon Foundation.
  • 2.
    Overview The MESURproject A quick RDF/RDFS/OWL tutorial Modeling the scholarly community Practical applications of the model Conclusion
  • 3.
    Overview The MESURproject A quick RDF/RDFS/OWL tutorial Modeling the scholarly community Practical applications of the model Conclusion
  • 4.
    What is theMESUR project? ME trics from S cholarly U sage of R esources http://www.mesur.org The MESUR project is currently gathering publication, citation, and usage data from providers world-wide in order to engineer a large-scale scholarly model. Publication and citation data from bibliographic databases. Usage data logged by institutions, publishers, and aggregators.
  • 5.
    Journal and Articledata Journal-level Bibliographic Data Thomson Scientific JCR: > 8,000 indexed journals Thomson Scientific JCR: > 50,000,000 journal citations Thomson Scientific JCR: > 100,000 journal classifications Ex Libris SFX Journal Master List: > 300,000 journals identifiers (title, abbreviated title, ISSN, eISSN) clustered in groups Ex Libris SFX: > 85,000 journal classifications Article-level Bibliographic Data Thomson Scientific Citation Databases : > 37,500,000 articles Thomson Scientific Citation Databases: > 550,000,000 article citations
  • 6.
    Usage data Institutions(link resolvers, proxies) Los Alamos: > 350,000 1-year CalState: > 3,500,000 2-years UTexas: > 2,500,000 5-years PennState: … … Aggregators anonymous : > 2,500,000 1-year anonymous : > 50,000,000 1-week … Publishers BioMed Central: > 24,000,000 2-years Elsevier …
  • 7.
    The Primary DataRepresentation So how are we going to represent all this data in one model such that this model can analyzed computationally? ANSWER : A semantic network. Why are we using a semantic network? Able to represent a heterogeneous set of entities related to one another by a heterogeneous set of relationships. All “actors” are represented in the same substrate. Existing technologies and standards to support the representation: triples-stores, RDF, semantic network analysis algorithms, etc. A B C
  • 8.
    Example Scholarly Relationships<marko, wrote, practical_ontology> <johan, wrote, practical_ontology> <herbertv, wrote, practical_ontology> <practical_ontology, publishedIn, jcdl> <practical_ontology, cites, rdf_specification> <rdf_specifcation, downloadedBy, 127.0.0.1> <127.0.0.1, from, LANL> <LANL, contains, herbertv> <herbertv, coauthorsWith, marko>
  • 9.
    What is thePurpose of an Ontology?
  • 10.
  • 11.
    Overview The MESURproject A quick RDF/RDFS/OWL tutorial Modeling the scholarly community Practical applications of the model Conclusion
  • 12.
    RDF, RDFS, OWLThe R esource D escription F ramework A data model for representing a semantic network. URIs connected to one another by a URI. < lanl:marko, lanl:worksWith, lanl:johan > The R esource D escription F ramework S chema A simple ontology language for defining classes and their relationships to one another. (provides basic class hierarchy construction) The W eb O ntology L anguage A more advanced ontology language. ( this is the ontology language used in MESUR )
  • 13.
    RDF, RDFS ex:markoex:cookie ex:Human ex:Food ex:isEating rdf:type rdf:type ex:isEating rdfs:domain rdfs:range ontology instance
  • 14.
    RDF, RDFS, OWLex:fluffy ex:marko ex:Pet ex:Human ex:hasOwner rdf:type rdf:type ex:hasOwner rdfs:domain rdfs:range ontology instance _:0123 rdfs:subClassOf owl:onProperty “ 1” owl:maxCardinality ex:bob ex:hasOwner owl:Restriction rdf:type
  • 15.
    The Triple StoreSELECT ?a ?c WHERE ( ?a type human ) ( ?a wrote ?b ) ( ?b type article ) ( ?c wrote ?b ) ( ?c type human ) ( ?a != ?c ) The triple store is to semantic networks what the relational database is to data tables. Storing and querying triples in a triple store SPARQL query language like SQL, but for triple stores
  • 16.
    Overview The MESURproject A quick RDF/RDFS/OWL tutorial Modeling the scholarly community Practical applications of the model Conclusion
  • 17.
    The Problem ofScale High-end triple stores reasonably support 1+ billion triples. The MESUR solution is to not include all artifact metadata in the triple store. MESUR leverages relational database and triple store technology. The triple store is for relationships. The relational database is for metadata.
  • 18.
  • 19.
    The MESUR ClassHierarchy
  • 20.
    The Context ClassesInspired by OntologyX: http://www.ontologyx.com
  • 21.
  • 22.
  • 23.
    Overview The MESURproject A quick RDF/RDFS/OWL tutorial Modeling the scholarly community Practical applications of the model Conclusion
  • 24.
    Analysis Algorithms ISIImpact Factor Usage Impact Factor Bollen J., Van de Sompel, H., “ Usage Impact Factor: The Effects of Sample Characteristics on Usage-based Impact Metrics ”, [in review], 2007. H-Index Hirsh, J.E., “ An index to quantify an individual's scientific research output ”, Proceedings of the National Academy of Science, 102:46, 2005. Y-Factor Bollen J., Rodriguez, M.A., Van de Sompel, H., “ Journal Status ”, Scientometrics, 69:3, 2006. Other social network metrics Eccentrity, Betweenness, Closeness, PageRank, …
  • 25.
  • 26.
    Calculating the 2007Impact Factor SELECT ?x WHERE ( ?x rdf:type mesur:Citation ) ( ?x mesur:hasSource ?a) ( ?x mesur:hasSink urn:issn:0028-0836 ) ( ?x mesur:hasSourceTime ?u) AND (?u == 2007) ( ?x mesur:hasSinkTime ?t) AND (?t > 2004 AND ?t < 2007) SELECT ?y WHERE ( ?y rdf:type mesur:Publishes ) ( ?y mesur:hasGroup urn:issn:0028-0836 ) ( ?y mesur:hasTime ?t ) AND (?t > 2004 AND ?t < 2007) INSERT < _123 rdf:type mesur:ImpactFactor > INSERT < _123 mesur:hasObject urn:issn:0028-0836 > INSERT < _123 mesur:hasStartTime 2007 > INSERT < _123 mesur:hasEndTime 2007 > INSERT < _123 mesur:hasNumbericValue (COUNT(?x) / COUNT(?y)) > The 2007 impact factor of journal A is the total number of citations to articles published in A in 2005 and 2006 from articles published in 2007 in journal B divided by the total number of articles published by journal A in 2005 and 2006.
  • 27.
    Calculating the 2007Usage Impact Factor SELECT ?x WHERE ( ?x rdf:type mesur:Uses ) ( ?x mesur:hasUnit ?a ) ( ?x mesur:hasGroup ?b ) ( ?b mesur:partOf urn:issn:1082-9873 ) ( ?x mesur:hasTime ?t ) AND (?t == 2007) ( ?y rdf:type mesur:Publishes ) ( ?y mesur:hasUnit ?a ) ( ?y mesur:hasTime ?u ) AND (?u > 2004 AND ?u < 2007) SELECT ?y WHERE ( ?y rdf:type mesur:Publishes ) ( ?y mesur:hasGroup ?a ) ( ?a mesur:partOf urn:issn:1082-9873 ) ( ?y mesur:hasTime ?t ) AND (?t > 2004 AND ?t < 2007) INSERT < _123 rdf:type mesur:UsageImpactFactor > INSERT < _123 mesur:hasObject urn:issn:1082-9873 > INSERT < _123 mesur:hasStartTime 2007 > INSERT < _123 mesur:hasEndTime 2007 > INSERT < _123 mesur:hasNumbericValue (COUNT(?x) / COUNT(?y)) > The 2007 usage impact factor of journal A is the total number of 2007 usage events of articles published in A in 2005 and 2006 divided by the total number of articles published by journal A in 2005 and 2006.
  • 28.
    Overview The MESURproject A quick RDF/RDFS/OWL tutorial Modeling the scholarly community Practical applications of the model Conclusion
  • 29.
    Contributions Uniting theRDF/Semantic Web community technology with scholarly modeling. Semantic network model of the scholarly community. Architectural set-up supports a massive data set Triple-store/relational database coupling. Open ontology for the scholarly community Open source ontology to represent many aspects of the scholarly communication process including publication, citation, and usage.
  • 30.
    Some Related PublicationsMarko A. Rodriguez, Johan Bollen and Herbert Van de Sompel. A Practical Ontology for the Large-Scale Modeling of Scholarly Artifacts and their Usage, In Proceedings of the Joint Conference on Digital Libraries, Vancouver, June 2007 Marko A. Rodriguez. Grammar-based random walkers in semantic networks . (LAUR-06-7791) Marko A. Rodriguez and Jennifer H. Watkins. Grammar-based geodesics in semantic networks . (LAUR-07-4042) Johan Bollen and Herbert Van de Sompel. Usage Impact Factor: the effects of sample characteristics on usage-based impact metrics. (arxiv.org: cs.DL/0610154) Johan Bollen and Herbert Van d e Sompel. An architecture for the aggregation and analysis of scholarly usage data. In Joint Conference on Digital Libraries (JCDL2006), pages 298 - 3 07, June 2006. Johan Bollen and Herbert Van de Sompel. Mapping the structure of science through usage. Scientometrics, 69(2), 2006. Johan Bollen, Marko A . Rodriguez, and Herbert V an de S ompel. Journal status. Scientometrics, 69(3), December 2006 (arxiv.org:cs.DL/0601030) Johan Bollen, Herbert V an de S ompel, Joan Smith, and Rick Luce. Toward alternative metrics of journal impact: a comparison of download and citation data. Information Processing and Management, 41(6):1419 - 1 440, 2005.
  • 31.
    Questions MESUR isat http://www.mesur.org MESUR ontology is at http://www.mesur.org/schemas/2007-01/mesur/ Many thanks to the Andrew W. Mellon Foundation for their support