A Practical Ontology for the Large-Scale Modeling of Scholarly Artifacts and their Usage
Upcoming SlideShare
Loading in...5
×
 

A Practical Ontology for the Large-Scale Modeling of Scholarly Artifacts and their Usage

on

  • 2,013 views

The large-scale analysis of scholarly artifact usage is constrained primarily by current practices in usage data archiving, privacy issues concerned with the dissemination of usage data, and the lack ...

The large-scale analysis of scholarly artifact usage is constrained primarily by current practices in usage data archiving, privacy issues concerned with the dissemination of usage data, and the lack of a practical ontology for modeling the usage domain. As a remedy to the third constraint, this article presents a scholarly ontology that was engineered to represent those classes for which large-scale bibliographic and usage data exists, supports usage research, and whose instantiation is scalable to the order of 50 million articles along with their associated artifacts (e.g. authors and journals) and an accompanying 1 billion usage events. The real world instantiation of the presented abstract ontology is a semantic network model of the scholarly community which lends the scholarly process to statistical analysis and computational support. We present the ontology, discuss its instantiation, and provide some example inference rules for calculating various scholarly artifact metrics.

Statistics

Views

Total Views
2,013
Views on SlideShare
2,008
Embed Views
5

Actions

Likes
1
Downloads
26
Comments
0

1 Embed 5

http://www.slideshare.net 5

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

A Practical Ontology for the Large-Scale Modeling of Scholarly Artifacts and their Usage A Practical Ontology for the Large-Scale Modeling of Scholarly Artifacts and their Usage Presentation Transcript

  • A P ractical O ntology for the L arge- S cale M odeling of S cholarly A rtifacts and their U sage Marko A. Rodriguez (1) Johan Bollen Herbert Van de Sompel Digital Library Research & Prototyping Team Los Alamos National Laboratory - Research Library (1) [email_address] Acknowledgements: Lyudmila L. Balakireva (LANL), Wenzhong Zhao (LANL) , Aric Hagberg (LANL) MESUR is supported by the Andrew W. Mellon Foundation.
  • Overview The MESUR project A quick RDF/RDFS/OWL tutorial Modeling the scholarly community Practical applications of the model Conclusion
  • Overview The MESUR project A quick RDF/RDFS/OWL tutorial Modeling the scholarly community Practical applications of the model Conclusion
  • What is the MESUR project?
    • ME trics from S cholarly U sage of R esources
      • http://www.mesur.org
    • The MESUR project is currently gathering publication, citation, and usage data from providers world-wide in order to engineer a large-scale scholarly model.
      • Publication and citation data from bibliographic databases.
      • Usage data logged by institutions, publishers, and aggregators.
  • Journal and Article data
    • Journal-level Bibliographic Data
      • Thomson Scientific JCR: > 8,000 indexed journals
      • Thomson Scientific JCR: > 50,000,000 journal citations
      • Thomson Scientific JCR: > 100,000 journal classifications
      • Ex Libris SFX Journal Master List: > 300,000 journals identifiers (title, abbreviated title, ISSN, eISSN) clustered in groups
      • Ex Libris SFX: > 85,000 journal classifications
    • Article-level Bibliographic Data
      • Thomson Scientific Citation Databases : > 37,500,000 articles
      • Thomson Scientific Citation Databases: > 550,000,000 article citations
  • Usage data
    • Institutions (link resolvers, proxies)
      • Los Alamos: > 350,000 1-year
      • CalState: > 3,500,000 2-years
      • UTexas: > 2,500,000 5-years
      • PennState: …
    • Aggregators
      • anonymous : > 2,500,000 1-year
      • anonymous : > 50,000,000 1-week
    • Publishers
      • BioMed Central: > 24,000,000 2-years
      • Elsevier
  • The Primary Data Representation
    • So how are we going to represent all this data in one model such that this model can analyzed computationally?
    • ANSWER : A semantic network.
    • Why are we using a semantic network?
      • Able to represent a heterogeneous set of entities related to one another by a heterogeneous set of relationships.
      • All “actors” are represented in the same substrate.
      • Existing technologies and standards to support the representation: triples-stores, RDF, semantic network analysis algorithms, etc.
    A B C
  • Example Scholarly Relationships
    • <marko, wrote, practical_ontology>
    • <johan, wrote, practical_ontology>
    • <herbertv, wrote, practical_ontology>
    • <practical_ontology, publishedIn, jcdl>
    • <practical_ontology, cites, rdf_specification>
    • <rdf_specifcation, downloadedBy, 127.0.0.1>
    • <127.0.0.1, from, LANL>
    • <LANL, contains, herbertv>
    • <herbertv, coauthorsWith, marko>
  • What is the Purpose of an Ontology?
  • The MESUR Data Flow
  • Overview The MESUR project A quick RDF/RDFS/OWL tutorial Modeling the scholarly community Practical applications of the model Conclusion
  • RDF, RDFS, OWL
    • The R esource D escription F ramework
      • A data model for representing a semantic network. URIs connected to one another by a URI. < lanl:marko, lanl:worksWith, lanl:johan >
    • The R esource D escription F ramework S chema
      • A simple ontology language for defining classes and their relationships to one another. (provides basic class hierarchy construction)
    • The W eb O ntology L anguage
      • A more advanced ontology language. ( this is the ontology language used in MESUR )
  • RDF, RDFS ex:marko ex:cookie ex:Human ex:Food ex:isEating rdf:type rdf:type ex:isEating rdfs:domain rdfs:range ontology instance
  • RDF, RDFS, OWL ex:fluffy ex:marko ex:Pet ex:Human ex:hasOwner rdf:type rdf:type ex:hasOwner rdfs:domain rdfs:range ontology instance _:0123 rdfs:subClassOf owl:onProperty “ 1” owl:maxCardinality ex:bob ex:hasOwner owl:Restriction rdf:type
  • The Triple Store SELECT ?a ?c WHERE ( ?a type human ) ( ?a wrote ?b ) ( ?b type article ) ( ?c wrote ?b ) ( ?c type human ) ( ?a != ?c )
    • The triple store is to semantic networks what the relational database is to data tables.
    • Storing and querying triples in a triple store
    • SPARQL query language
      • like SQL, but for triple stores
  • Overview The MESUR project A quick RDF/RDFS/OWL tutorial Modeling the scholarly community Practical applications of the model Conclusion
  • The Problem of Scale
    • High-end triple stores reasonably support 1+ billion triples.
    • The MESUR solution is to not include all artifact metadata in the triple store.
    • MESUR leverages relational database and triple store technology.
      • The triple store is for relationships.
      • The relational database is for metadata.
  • Relational Database & Triple Store
  • The MESUR Class Hierarchy
  • The Context Classes Inspired by OntologyX: http://www.ontologyx.com
  • The Publishes Context
  • The Uses Context
  • Overview The MESUR project A quick RDF/RDFS/OWL tutorial Modeling the scholarly community Practical applications of the model Conclusion
  • Analysis Algorithms
    • ISI Impact Factor
    • Usage Impact Factor
      • Bollen J., Van de Sompel, H., “ Usage Impact Factor: The Effects of Sample Characteristics on Usage-based Impact Metrics ”, [in review], 2007.
    • H-Index
      • Hirsh, J.E., “ An index to quantify an individual's scientific research output ”, Proceedings of the National Academy of Science, 102:46, 2005.
    • Y-Factor
      • Bollen J., Rodriguez, M.A., Van de Sompel, H., “ Journal Status ”, Scientometrics, 69:3, 2006.
    • Other social network metrics
      • Eccentrity, Betweenness, Closeness, PageRank, …
  • Journal Citation and Usage
  • Calculating the 2007 Impact Factor SELECT ?x WHERE ( ?x rdf:type mesur:Citation ) ( ?x mesur:hasSource ?a) ( ?x mesur:hasSink urn:issn:0028-0836 ) ( ?x mesur:hasSourceTime ?u) AND (?u == 2007) ( ?x mesur:hasSinkTime ?t) AND (?t > 2004 AND ?t < 2007) SELECT ?y WHERE ( ?y rdf:type mesur:Publishes ) ( ?y mesur:hasGroup urn:issn:0028-0836 ) ( ?y mesur:hasTime ?t ) AND (?t > 2004 AND ?t < 2007) INSERT < _123 rdf:type mesur:ImpactFactor > INSERT < _123 mesur:hasObject urn:issn:0028-0836 > INSERT < _123 mesur:hasStartTime 2007 > INSERT < _123 mesur:hasEndTime 2007 > INSERT < _123 mesur:hasNumbericValue (COUNT(?x) / COUNT(?y)) > The 2007 impact factor of journal A is the total number of citations to articles published in A in 2005 and 2006 from articles published in 2007 in journal B divided by the total number of articles published by journal A in 2005 and 2006.
  • Calculating the 2007 Usage Impact Factor SELECT ?x WHERE ( ?x rdf:type mesur:Uses ) ( ?x mesur:hasUnit ?a ) ( ?x mesur:hasGroup ?b ) ( ?b mesur:partOf urn:issn:1082-9873 ) ( ?x mesur:hasTime ?t ) AND (?t == 2007) ( ?y rdf:type mesur:Publishes ) ( ?y mesur:hasUnit ?a ) ( ?y mesur:hasTime ?u ) AND (?u > 2004 AND ?u < 2007) SELECT ?y WHERE ( ?y rdf:type mesur:Publishes ) ( ?y mesur:hasGroup ?a ) ( ?a mesur:partOf urn:issn:1082-9873 ) ( ?y mesur:hasTime ?t ) AND (?t > 2004 AND ?t < 2007) INSERT < _123 rdf:type mesur:UsageImpactFactor > INSERT < _123 mesur:hasObject urn:issn:1082-9873 > INSERT < _123 mesur:hasStartTime 2007 > INSERT < _123 mesur:hasEndTime 2007 > INSERT < _123 mesur:hasNumbericValue (COUNT(?x) / COUNT(?y)) > The 2007 usage impact factor of journal A is the total number of 2007 usage events of articles published in A in 2005 and 2006 divided by the total number of articles published by journal A in 2005 and 2006.
  • Overview The MESUR project A quick RDF/RDFS/OWL tutorial Modeling the scholarly community Practical applications of the model Conclusion
  • Contributions
    • Uniting the RDF/Semantic Web community technology with scholarly modeling.
      • Semantic network model of the scholarly community.
    • Architectural set-up supports a massive data set
      • Triple-store/relational database coupling.
    • Open ontology for the scholarly community
      • Open source ontology to represent many aspects of the scholarly communication process including publication, citation, and usage.
  • Some Related Publications
    • Marko A. Rodriguez, Johan Bollen and Herbert Van de Sompel. A Practical Ontology for the Large-Scale Modeling of Scholarly Artifacts and their Usage, In Proceedings of the Joint Conference on Digital Libraries, Vancouver, June 2007
    • Marko A. Rodriguez. Grammar-based random walkers in semantic networks . (LAUR-06-7791)
    • Marko A. Rodriguez and Jennifer H. Watkins. Grammar-based geodesics in semantic networks . (LAUR-07-4042)
    • Johan Bollen and Herbert Van de Sompel. Usage Impact Factor: the effects of sample characteristics on usage-based impact metrics. (arxiv.org: cs.DL/0610154)
    • Johan Bollen and Herbert Van d e Sompel. An architecture for the aggregation and analysis of scholarly usage data. In Joint Conference on Digital Libraries (JCDL2006), pages 298 - 3 07, June 2006.
    • Johan Bollen and Herbert Van de Sompel. Mapping the structure of science through usage. Scientometrics, 69(2), 2006.
    • Johan Bollen, Marko A . Rodriguez, and Herbert V an de S ompel. Journal status. Scientometrics, 69(3), December 2006 (arxiv.org:cs.DL/0601030)
    • Johan Bollen, Herbert V an de S ompel, Joan Smith, and Rick Luce. Toward alternative metrics of journal impact: a comparison of download and citation data. Information Processing and Management, 41(6):1419 - 1 440, 2005.
  • Questions MESUR is at http://www.mesur.org MESUR ontology is at http://www.mesur.org/schemas/2007-01/mesur/ Many thanks to the Andrew W. Mellon Foundation for their support