ALA 2010 -- Johan Bollen
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

ALA 2010 -- Johan Bollen

on

  • 2,190 views

 

Statistics

Views

Total Views
2,190
Views on SlideShare
2,190
Embed Views
0

Actions

Likes
0
Downloads
12
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

ALA 2010 -- Johan Bollen Presentation Transcript

  • 1. The MESUR project: an overview and update Johan Bollen Indiana University School of Informatics and Computing Center for Complex Networks and System Research p y jbollen@indiana.edu Acknowledgements: Herbert Van de Sompel (LANL), Marko A. Rodriguez (LANL), Ryan Chute (LANL), Lyudmila L. Balakireva (LANL), Aric Hagberg (LANL), Luis Bettencourt (LANL) Research supported by the NSF and Andrew W. Mellon Foundation. School of Informatics and Computing Indiana University November 5th, 2009
  • 2. When the obvious is staring you in the face School of Informatics and Computing Indiana University November 5th, 2009
  • 3. The scientific process: the importance of early indicators (Egghe & Rousseau, 2000; Wouters, 1997) (Brody, Harnad, & Carr 2006), Usage data Citation: final products • Scale, cf. Elsevier downloads • Publication delays (+1B) vs. Wos citations (650M) vs • Focus on publications • Immediate, early stages • Focus on authors • Variety of resources and actors School of Informatics and Computing Indiana University November 5th, 2009
  • 4. What is MESUR? MESUR IS: Scientific project to study Science itself from REAL-TIME indicators. Foundations: • Very large-scale, representative usage data (10^9) • Network science and social network analysis • Complex systems, complex networks systems Outcomes • Surveys of novel impact metrics and their properties • Network models of Science Unrelated picture of my daughter (who wants to be a scientist) MESUR IS NOT: Commercial endeavor, end-user service development, promoter of particular impact metrics, enemy nor friend of particular metrics, advocacy group, … School of Informatics and Computing Indiana University November 5th, 2009
  • 5. Timeline and development • 2006-2008: o Andrew W. Mellon Foundation o Digital Library Research and Prototyping team Los Alamos National team, Laboratory o Collection of large-scale usage data from some of world’s most significant publishers, aggregators and institutional consortia o Feasibility: Usage data, usage-based network models of science, usage-based impact metrics • 2009 – infinity and beyond: o NSF funding (SciSIP, 2009 2012) f di (S iSIP 2009-2012) o Indiana University, School of Informatics and Computing • 2010: Andrew W. Mellon foundation o Continuation of MESUR data collection and scientific work o Investigate evolving to sustainable, open, community-supported infrastructure School of Informatics and Computing Indiana University November 5th, 2009
  • 6. Presentation structure 1. MESUR’s Usage reference data set g 2. Mapping scientific activity 3. Metrics survey 4. Future research 5. Discussion School of Informatics and Computing Indiana University November 5th, 2009
  • 7. Creating the MESUR usage reference data set 1B 2006-2008: Collaborating publishers, aggregators and institutional consortia: • BMC, Blackwell, UC, CSU (23), EBSCO, ELSEVIER, EMERALD, INGENTA, JSTOR, LANL, MIMAS/ZETOC, THOMSON, UPENN (9), UTEXAS • Scale: o > 1,000,000,000 usage events, and growing… o +50M articles, +-100,000 serials 50M ti l 100 000 i l • Period: 2002-2007, but mostly 2006 School of Informatics and Computing Indiana University November 5th, 2009
  • 8. Data normalization and ingestion Minimal requirements for all usage data • Unique usage events (article level) • Fields: unique session ID, date/time, unique document ID and/or metadata, request type • Note difference with usage statistics 2007 9 1 0 0 1 CFA cffoe A172080.N1.Vanderbilt.Edu unknown AST A 1996SPIE.2828..64S http://foe.edu/abs/1996SPIE.2828..64S http://www.google.com 2007 9 1 0 0 1 CFA cffoe 210.94.41.89 unknown PHY A 2007ApPhL.90a2120C http://foe.edu/abs/2007ApPhL.90a2120C http://www.google.co.kr 2007 9 1 0 0 1 CFA cffoe 24-196-228-125.dhcp.gwnt.ga.charter.com unknown AST A 2000ASPC.213.333S http://foe.edu/abs/2000bioa.conf.333S http://scholar.go 2007 9 1 0 0 4 CFA cffoe 163.152.35.114 4700387eae PHY A 1993WRR..29.133S http://foe.edu/abs/1993WRR..29.133S http://scholar.google.com 2007 9 1 0 0 6 CFA cffoe pd9e980fc dip0 t ipconnect de 45f0c69881 pd9e980fc.dip0.t-ipconnect.de AST X 2007AN 328 841H http://arXiv org/abs/0708 1863 http://foe edu 2007AN..328.841H http://arXiv.org/abs/0708.1863 http://foe.edu 2007 9 1 0 0 1 CFA cffoe A172080.N1.Vanderbilt.Edu unknown AST A 1996SPIE.2828..64S http://foeabs.edu/abs/1996SPIE.2828..64S http://www.google.com 2007 9 1 0 0 1 CFA cffoe 210.94.41.89 unknown PHY A 2007ApPhL.90a2120C http://foeabs.edu/abs/2007ApPhL.90a2120C http://www.google.co.kr 2007 9 1 0 0 1 CFA cffoe 24-196-228-125.dhcp.gwnt.ga.charter.com unknown AST A 2000ASPC.213.333S http://foeabs.edu/abs/2000bioa.conf.333S http://schola 2007 9 1 0 0 4 CFA cffoe 163.152.35.114 4700387eae PHY A 1993WRR..29.133S http://foeabs.edu/abs/1993WRR..29.133S http://scholar.google.com 2007 9 1 0 0 6 CFA cffoe pd9e980fc.dip0.t-ipconnect.de 45f0c69881 AST X 2007AN..328.841H http://arXiv.org/abs/0708.1863 http://foeabs.edu 2007 9 1 0 0 6 CFA cffoe foel25144.4u.com.gh 47002f8eda PHY A 2002AGUFM.S21A0965M http://foeabs.edu/abs/2002AGUFM.S21A0965M http://www.goo 2007 9 1 0 0 6 CFA cffoe 66-215-171-214.dhcp.ccmn.ca.charter.com 4681d22a6f AST A 2001P&SS..49.657R http://foeabs.edu/cgi-bin/bib_query?bibcode=2001P% 2007 9 1 0 0 7 CFA cffoe nat ptouser3 uspto gov unknown PHY A nat-ptouser3.uspto.gov 2005ApPhL 86g2106M http://foeabs edu/abs/2005ApPhL 86g2106M 2005ApPhL.86g2106M http://foeabs.edu/abs/2005ApPhL.86g2106M http://www google com http://www.google.com 2007 9 1 0 0 7 CFA cffoe cpe-71-65-25-115.ma.res.rr.com unknown PHY A 1980SPIE.205.153S http://foeabs.edu/abs/1980SPIE.205.153S http://www.google.com 2007 9 1 0 0 7 CFA cffoe customer3491.pool1.unallocated-106-0.orangehomedsl.co.uk unknown PHY A 1983ElL..19.883V http://foeabs.edu/abs/1983ElL..19.883V 2007 9 1 0 0 8 CFA cffoe Uranus.seas.ucla.edu 46672d96b2 PHY A 1966Phy..32.385K http://foeabs.edu/abs/1966Phy..32.385K http://www.google.com 2007 9 1 0 0 9 CFA cffoe 75-121-173-37.dyn.centurytel.net 46cf1fd8a6 AST D 1984ApJS..56.257J http://vizier.cfa.edu/viz-bin/VizieR?-source=III/92/ http://foe 2007 9 1 0 0 13 CFA cffoe foel17-18.kln.forthnet.gr unknown AST A 1987cosm.book...C http://foeabs.edu/abs/1987cosm.book...C http://www.google.gr 2007 9 1 0 0 15 CFA cffoe hades.astro.uiuc.edu 46f707564d PRE A 2007arXiv0707.3146N http://foeabs.edu/abs/2007arXiv0707.3146N http://foeabs.edu 2007 9 1 0 0 17 CFA cffoe ool-43554752.dyn.optonline.net unknown PHY A 2000PhTea.38.132K http://foeabs.edu/abs/2000PhTea.38.132K http://www.google.com 2007 9 1 0 0 17 CFA cffoe c 68 33 176 222 hsd1 md comcast net unknown GEN A c-68-33-176-222.hsd1.md.comcast.net 1994RSPSB.256.177M http://foeabs.edu/abs/1994RSPSB.256.177M 1994RSPSB 256 177M http://foeabs edu/abs/1994RSPSB 256 177M http://w 2007 9 1 0 0 19 CFA cffoe 74-36-139-46.dr02.brvl.mn.frontiernet.net unknown AST T 2002SPIE.4767.114W http://foeabs.edu/cgi-bin/nph-abs_connect?bibcode=20 2007 9 1 0 0 19 CFA cffoe c-76-16-53-120.hsd1.il.comcast.net 46f667b71b AST F 1916PA...24.613L http://articles.foeabs.edu/cgi-bin/nph-iarticle_query?1916PA 2007 9 1 0 0 20 CFA cffoe 74-39-37-62.nas03.roch.ny.frontiernet.net unknown PHY E 2007JSTEd.tmp..29B http://dx.doi.org/10.1007/s10972-007-9067-2 http://fo 2007 9 1 0 0 22 ANU bio-mirror uatu-virtual1.anu.edu.au 46f9e8f87f AST A 2006ApJ..647.128E http://foe.grangenet.net/abs/2006ApJ..647.128E http://foe 2007 9 1 0 0 22 CFA cffoe fw.hia.nrc.ca 46f1531d59 AST A 2002P&SS..50.745H http://foeabs.edu/abs/2002P%26SS..50.745H http://foeabs.edu 2007 9 1 0 0 22 CFA cffoe 24-117-0-220.cpe.cableone.net unknown AST A 1984BITA..15.268S http://foeabs.edu/abs/1984BITA..15.268S http://www.google.com 2 School of Informatics and Computing Indiana University November 5th, 2009
  • 9. Presentation structure 1. MESUR’s Usage reference data set g 2. Mapping scientific activity 3. Metrics survey 4. Future research 5. Discussion School of Informatics and Computing Indiana University November 5th, 2009
  • 10. Data set: subset of MESUR • Common time period: p o March 1st 2006 - February 1st 2007 o Thomson Scientific (Web of Science), Elsevier (Scopus), JSTOR, Ingenta, University of Texas ( campuses, 6 y (9 p , health institutions), and California State University (23 campuses) • 346,312,045 346 312 045 usage events • 97,532 serials (many of which not journals) School of Informatics and Computing Indiana University November 5th, 2009
  • 11. How to generate a usage network. Same session ~ documents relatedness • Same session, same user: common interest • FFrequency of co-occurrence = d f degree of f relationship • Normalized: conditional probability Usage data is on article level: • Works for journals and articles • Anything for which usage was recorded Note: not something we invented: association rule learning in data mining. Beer and diapers! School of Informatics and Computing Indiana University November 5th, 2009
  • 12. Johan Bollen, Herbert Van de Sompel, Aric Hagberg,Luis Bettencourt, Ryan Chute, Marko A. Rodriguez, Lyudmila Balakireva. Clickstream data yields high-resolution maps of science. PLoS One, February 2009. School of Informatics and Computing Indiana University November 5th, 2009
  • 13. Network science for impact metrics. PageRank PR(vi): PageRank of node vi O(vj): out-degree of journal vj N: number of nodes in network L: dampening factor Betweenness centrality : Number of geodesics between vi and vj School of Informatics and Computing Indiana University November 5th, 2009
  • 14. Presentation structure 1. MESUR’s Usage reference data set g 2. Mapping scientific activity 3. Metrics survey 4. Future research 5. Discussion School of Informatics and Computing Indiana University November 5th, 2009
  • 15. A variety of impact metrics Note: • Metrics can be calculated both on citation and usage data • “Frequentist” o Citation and usage rates • “Structural” o Citation graph, e.g. 2005 JCR o Usage graph, e.g. created by MESUR • H-index, G-index, SJR, What d th Wh t do they MEAN? etc t What facets of impact do they represent? Which are best suited? School of Informatics and Computing Indiana University November 5th, 2009
  • 16. Set of metrics calculated on MESUR data set School of Informatics and Computing Indiana University November 5th, 2009
  • 17. The MESUR Metrics Map BETWEENNESS PAGERANK(S) USAGE METRICS TOTAL CITES Johan Bollen, Herbert Van de Sompel, Aric Hagberg and Ryan Chute. A Principal g g y p Component Analysis of 39 Scientific Impact Measures. PLoS ONE, June 2009. URL: http://dx.plos.org/10.1371/journal.pone.0006 022. RATE METRICS School of Informatics and Computing Indiana University November 5th, 2009
  • 18. Presentation structure 1. MESUR’s Usage reference data set g 2. Mapping scientific activity 3. Metrics survey 4. Future research 5. Discussion School of Informatics and Computing Indiana University November 5th, 2009
  • 19. Samples of future work (can be skipped) • Longitudinal studies: o Network changes over time: collaboration with Carl Bergstrom (UW) o Prediction f innovation using random walk models P di ti of i ti i d lk d l • Logistics: o Expand existing data set: focus on standardization, repeatability o Establish continued funding, good home for project o “Center” model: rather than data->scientists, scientists->data School of Informatics and Computing Indiana University November 5th, 2009
  • 20. Animated maps: tracing bursts of scientific activity p g y School of Informatics and Computing Indiana University November 5th, 2009
  • 21. Coordinated bursts 3 2 1 School of Informatics and Computing Indiana University November 5th, 2009
  • 22. MESUR Mapping and ranking services School of Informatics and Computing Indiana University November 5th, 2009
  • 23. MESUR Mapping and ranking services School of Informatics and Computing Indiana University November 5th, 2009
  • 24. MESUR Mapping and ranking services School of Informatics and Computing Indiana University November 5th, 2009
  • 25. MESUR: the good ... After 3 years of MESUR: • Scientific exploration of metrics for scholarly evaluation • Creation of large-scale reference data set • Mapping science from the viewpoint of users: there is structure! pp g p • Variety of metrics that cover various aspects of scholarly impact and prestige • MESUR dataset contains many more pearls for future research • Foundation for future continued research program: p g • Longitudinal studies • Models of collective behavior of scientists School of Informatics and Computing Indiana University November 5th, 2009
  • 26. MESUR: the bad and the ugly … Scalability of the approach: • Lengthy negotiations to obtain log data • No infrastructure standards (yet): Recording, aggregating, normalization, ingestion, de-duplication,… • No generally accepted policies: privacy, property, … • No census data: when is a sample large and representative enough? Quality control: • Bots, Crawlers (detectable but never perfect) • Cheating manipulation (easier with usage statistics than network Cheating, metrics) Acceptance: p • Network-based usage metrics require session information. This is overlooked! As a result, will we end up with usage-based statistics only? • “As simple as possible, but not more simple!” School of Informatics and Computing Indiana University November 5th, 2009
  • 27. Moving towards community involvement “ Registration is now open for "Scholarly Evaluation Metrics: Opportunities and Challenges", a one-day NSF-funded workshop that will take place in the Renaissance Washington W hi t DC H t l on W d Hotel Wednesday, December 16th 2009. P ti i ti i thi workshop d D b 2009 Participation in this k h is limited to 50 people. Registration is free at http://informatics.indiana.edu/scholmet09/registration.html. The topic of the workshop is the future of scholarly assessment approaches, including p p y pp , g organizational, infrastructural, and community issues. The overall goal is to identify requirements for novel assessment approaches, several of which have been proposed in recent years, to become acceptable to community stakeholders including scholars, academic and research institutions, and funding agencies. The impressive group of speakers and panelists for the workshop includes representatives from each of these constituencies. Further details are available at http://informatics.indiana.edu/scholmet09/announcement.html Workshop organizers: Johan Bollen (jbollen@indiana.edu), Herbert Van de Sompel (hvdsomp@gmail.com) and Ying Ding (dingying@indiana.edu) “ School of Informatics and Computing Indiana University November 5th, 2009
  • 28. Moving towards community involvement Planning process underway to establish sustainable, open, community supported infrastructure. New support from Andrew W. Mellon foundation to figure it all out. Logistics: Science: Data aggregation Metrics Normalization Analysis Data-related services Prediction Data management Services =More than sum of parts: Ranking • Each component supports the other Assessment • Various business and funding models Mapping • Generate added value on all levels Can fundamentally change scholarly communication School of Informatics and Computing Indiana University November 5th, 2009
  • 29. Some relevant publications. Johan Bollen, Herbert Van de Sompel, Aric Hagberg, Luis Bettencourt, Ryan Chute, Marko A. Rodriguez, Lyudmila Balakireva. Clickstream data yields high- resolution maps of science. PLoS One, March 2009 (In Press) Johan Bollen, Herbert Van de Sompel, Aric HagBerg, Ryan Chute. A principal component analysis of 39 scientific impact measures. arXiv.org/abs/0902.2183 Johan Bollen, Marko A. Rodriguez, and Herbert Van de Sompel. Journal status. Scientometrics, 69(3), December 2006 (arxiv.org:cs.DL/0601030) ( g ) Johan Bollen, Herbert Van de Sompel, and Marko A. Rodriguez. Towards usage-based impact metrics: first results from the MESUR project. In Proceedings of the Joint Conference on Digital Libraries, Pittsburgh, June 2008 Marko A. Rodriguez, Johan Bollen and Herbert Van de Sompel. A Practical Ontology for the Large-Scale Modeling of Scholarly Artifacts and their Usage, In Proceedings of the Joint Conference on Digital Libraries, Vancouver, Libraries Vancouver June 2007 Johan Bollen and Herbert Van de Sompel. Usage Impact Factor: the effects of sample characteristics on usage-based impact metrics. (cs.DL/0610154) Johan Bollen and Herbert Van de Sompel. An architecture for the aggregation and analysis of scholarly usage data. In Joint Conference on Digital Libraries (JCDL2006), pages 298-307, June 2006. Johan Bollen and Herbert Van de Sompel. Mapping the structure of science through usage. Scientometrics, 69(2), 2006. Johan Bollen, Herbert Van de Sompel, Joan Smith, and Rick Luce. Toward alternative metrics of journal impact: a comparison of download and citation data. Information Processing and Management, 41(6):1419-1440, 2005. School of Informatics and Computing Indiana University November 5th, 2009
  • 30. Presentation structure 1. MESUR’s Usage reference data set g 2. Mapping scientific activity 3. Metrics survey 4. Future research 5. Discussion School of Informatics and Computing Indiana University November 5th, 2009