A Model of the Scholarly Community
Upcoming SlideShare
Loading in...5
×

Like this? Share it with your network

Share
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
1,896
On Slideshare
1,896
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
27
Comments
0
Likes
1

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. A Model of the Scholarly Community Marko A. Rodriguez http://www.soe.ucsc.edu/~okram March 30, 2007
  • 2. MESUR Project
    • 2-year project
    • First half of the project is focused on ontology development , parsing, and the development of analysis algorithms ( metrics ).
    • Second half of the project is the analysis of our data structure and reporting our findings in the literature.
  • 3. Outline
    • The Data : which and how much data?
    • The Model : how do we represent the data?
    • The Metrics : how do we quantify the entities in our model?
  • 4. Terminology
    • Groups : journals, proceedings, magazines, newspapers, edited books.
    • Units : articles, book chapters, dissertations.
    • Documents : groups and units.
    • Usage-Event : the act interacting with an article. (e.g. getFullText, getAbstract, getReferences) -- expression of user interest.
  • 5. The Data
    • Two types of data:
      • Bibliographic data : metadata pertaining groups and units.
      • Usage data : metadata pertaining to group and unit usage.
  • 6. The Data
    • Group-level Bibliographic Data
      • SFX Master List: > 300,000 groups
      • SFX: > 85,000 group classifications
      • ISI JCR: > 8,000 indexed groups
      • ISI JCR: > 50,000,000 group citations
      • ISI JCR: > 100,000 group classifications
    • Unit-level Bibliographic Data
      • ISI Tapes: > 30,000,000 unit records
      • ISI Tapes: > 500,000,000 unit citations
  • 7. The Data
    • Usage Data
      • Los Alamos: > 400,000 1-year
      • BioMed Central: > 24,000,000 2-years
      • anonymous : > 1,000,000 5-years
      • anonymous : > 2,500,000 1-year
      • anonymous : > 50,000,000 1-week
  • 8. The Data
    • The semantic network model is estimated be >10 billion triples (edges).
      • as of March 2007: 1.2 billion.
  • 9. The Model
    • In order to integrate the various data sets in their various formats, we model all information according to an ontology.
  • 10. The Model
    • RDF, RDFS, OWL [W3C Standards]
      • Resource Description Framework
      • Resource Description Framework Schema
      • Web Ontology Language
    • Provides us a standardized language for which to represent our entities and their relationships to one another.
  • 11. The Model
    • In OWL, everything is an owl:Thing--both nodes and edges (analogous to java.lang.Object in Java)
    • All owl:Things are represented by a URI.
    • An instance of the ontology provides us with a URI triple list data structure:
  • 12. The Model
    • The instance of an OWL ontology resides in a triple store.
  • 13. The Model
    • SPARQL (like SQL, but for triple stores).
    SELECT ?c as grandparent WHERE ( ?a childOf ?b) ( ?b childOf ?c )
  • 14. The Model Rodriguez, M.A., Bollen, J., Van de Sompel, H., “ A Practical Ontology for the Large-Scale Modeling of Scholarly Artifacts and their Usage ”, IEEE/ACM Joint Conference on Digital Libraries, Vancouver, 2007.
  • 15. The Model
  • 16. The Model
  • 17. The Model SELECT ?x WHERE ( ?x rdf:type mesur:Publishes ) ( ?x mesur:hasAuthor lanl:marko ) ( ?x mesur:hasAuthor lanl:herbertv ) INSERT < _123 rdf:type mesur:Coauthor > INSERT < _123 mesur:hasSource lanl:marko > INSERT < _123 mesur:hasSink lanl:herbertv > INSERT < _123 mesur:hasWeight COUNT(?x) > INSERT < _456 rdf:type mesur:Coauthor > INSERT < _456 mesur:hasSource lanl:herbertv > INSERT < _456 mesur:hasSink lanl:marko > INSERT < _456 mesur:hasWeight COUNT(?x) > From the Publishes contexts, generate a weighted coauthorship network.
  • 18. The Model Phase 1 is looking just at group level usage and bibliographic data
  • 19. The Metrics
    • ISI Impact Factor
    • Usage Impact Factor
      • Bollen J., Van de Sompel, H., “ Usage Impact Factor: The Effects of Sample Characteristics on Usage-based Impact Metrics ”, [in review], 2007.
    • H-Index
      • Hirsh, J.E., “ An index to quantify an individual's scientific research output ”, Proceedings of the National Academy of Science, 102:46, 2005.
    • Y-Factor
      • Bollen J., Rodriguez, M.A., Van de Sompel, H., “ Journal Status ”, Scientometrics, 69:3, 2006.
  • 20. The Metrics SELECT ?x WHERE ( ?x rdf:type mesur:Publishes ) ( ?x mesur:hasUnit ?a ) ( ?x mesur:hasGroup ?b ) ( ?b mesur:partOf urn:issn:1082-9873 ) ( ?x mesur:hasTime ?t ) AND (?t > 2004 AND ?t < 2007) ( ?y rdf:type mesur:Citation ) ( ?y mesur:hasSource ?c ) ( ?y mesur:hasSink ?a ) ( ?z rdf:type mesur:Publishes ) ( ?z mesur:hasUnit ?c ) ( ?z mesur:hasTime ?u) AND ?u = 2007 SELECT ?y WHERE ( ?y rdf:type mesur:Publishes ) ( ?y mesur:hasGroup ?a ) ( ?a mesur:partOf urn:issn:1082-9873 ) ( ?y mesur:hasTime ?t ) AND (?t > 2004 AND ?t < 2007) INSERT < _123 rdf:type mesur:ImpactFactor > INSERT < _123 mesur:hasObject urn:issn:1082-9873 > INSERT < _123 mesur:hasStartTime 2007 > INSERT < _123 mesur:hasEndTime 2007 > INSERT < _123 mesur:hasNumbericValue (COUNT(?x) / COUNT(?y)) > From the Publishes and Citation contexts, generate Impact Factor Rankings.
  • 21. The Metrics
    • Eigenvector-based global-rank metrics such as PageRank, Eigenvector centrality, Y-Factor, and relative-rank ‘spreading activation’ algorithms can be calculated in a similar fashion.
    Rodriguez, M.A., “ Grammar-Based Random Walkers in Semantic Networks ”, [in review], 2007.
  • 22. Conclusion
    • Thanks for your time…Good life.
    http://www.mesur.org