Case Study: Public Library of Science Thesaurus: Year One

1,425 views

Published on

Presented at the 10th annual Data Harmony Users Group meeting on Tuesday, February 11, 2014 by Rachel Drysdale of PLOS. Discusses the process of building and integrating their new thesaurus into the PLOS journals workflow and publication platform. From constructing the thesaurus to creating channels for feedback and updates, through building new current awareness and discovery tools, to gathering data for article level metrics and web site analytics, follow their progress through to today’s PLOS websites and services.

0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,425
On SlideShare
0
From Embeds
0
Number of Embeds
5
Actions
Shares
0
Downloads
13
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Case Study: Public Library of Science Thesaurus: Year One

  1. 1. The PLOS Thesaurus: the first year Rachel Drysdale – Taxonomy Manager, PLOS DHUG 2014 11th February, 2014
  2. 2. Public Library of Science - evolution 2000 PLOS founded 2003 PLOS Biology 2004 PLOS Medicine 2005 PLOS Computational Biology (June) PLOS Genetics (July) PLOS Pathogens (September) 2006 PLOS ONE 2007 PLOS Neglected Tropical Diseases 2
  3. 3. Journal Article Count PLOS Biology 3,450 PLOS Medicine 2,626 PLOS Computational Biology 3,112 PLOS Genetics 4,048 PLOS Pathogens 3,639 PLOS ONE 87,296 PLOS Neglect Trop Diseases 2,444
  4. 4. Journal Article Count PLOS Biology 3,450 PLOS Medicine 2,626 PLOS Computational Biology 3,112 PLOS Genetics 4,048 PLOS Pathogens 3,639 PLOS ONE 87,296 PLOS Neglect Trop Diseases 2,444 beautiful monster….
  5. 5. Overview – today’s talk The Solution: Good Thesaurus + Machine Aided Indexing  Building the new Thesaurus with AI  The initial implementation at plos.org  MAIstro integration into Publishing workflow  Thesaurus maintenance The Service:  Content Discovery Article Analysis  Relative Metrics 5
  6. 6. Starting point 2011 – the old Taxonomy Inadequate in content – just over 3100 specific terms Inflexible in structure – terms in pre-defined paths Housed in Editorial Manager ossified and difficult to update Author-chosen terms - association with article 6
  7. 7. PLOS delivered to Access Innovations…. A copy of the old PLOS Taxonomy Over 2,000 suggested changes “Research analysis and methods” branch request Use cases: Subject Area-based searches Hierarchy-based exploration of our corpus Email Alerts based on Subject Area searches RSS Feeds based on Subject Areas 7
  8. 8. Access Innovations added:  STEM vocabulary  Broader/Narrower term relationships  Rules for the Machine Aided Indexing  Synonyms  Analysis with respect to the PLOS corpus .....to and fro with PLOS …. Result: Vastly improved NISO Z-39.19-compliant thesaurus 8
  9. 9. Statistics 9 Old Taxonomy A. I. Thesaurus Terms 3,132 10,156 Synonyms 0 3,291 Tiers 5 7 Rules 0 14,798
  10. 10. Top-level Terms 1. Biology and life sciences 2. Computer and information sciences 3. Earth sciences 4. Engineering and technology 5. Environmental sciences and ecology 6. Medicine and health sciences 7. Physical sciences 8. Research and analysis methods 9. Science policy 10. Social sciences 10
  11. 11. Infrastructure PLOS Taxonomy server: Thesaurus – plos2012thes Data Harmony Thesaurus Master and MAI Rule Builder Corpus fed to the Taxonomy Server for MAI Article by article Initial implementation: Title – Abstract - Results – Methods Top 8 hits selected 11
  12. 12. Elapsed time from project kick-off until terms appeared on published articles: 9 months
  13. 13. 13 Learning curve – teething troubles Not all articles had Subject Area terms – why not? Initial implementation – text to index: Title + Abstract* + Results + Methods Upon consideration – text to index: Full Text (though not references) Implementation of “all paths” Polyhierarchy implications
  14. 14. Consider “White blood cells” Biology and life sciences Medicine and health sciences Immunology Immunology Immune cells Immune cells White blood cells White blood cells Biology and life sciences Biology and life sciences Cell biology Cell biology Cellular types Cellular types Animal cells Animal cells Blood cells Immune cells White blood cells White blood cells 14 The polyhierarchy and Search
  15. 15. 15 Establishing update cycle - articles: Initial implementation: Entire back-corpus indexed at once New Papers: PLOS submits text to MAIstro at publication MAI returns terms and term frequencies PLOS stores terms in search engine
  16. 16. 16 Establishing update cycle - thesauri: Separate instances (nerves): Production server – plosthes.2013-6 Working version – plosthes.2013-7 When ready to release a new version: Load onto test server – MAI corpus - Index Test: new/changed/deleted terms rule changes structural changes any implementation changes
  17. 17. 17 Thesaurus updates – why? More terms : Memory T cells, Monocotyledons Errrm… : Report gene detection What? : Webs Hierarchy changes deemed desirable: Geographical locations Organisms (Un)Rule(y) : snails, fabrication, pumas
  18. 18. Thesaurus updates – how? 18
  19. 19. Thesaurus updates – how? 19
  20. 20. Thesaurus updates – how? 20
  21. 21. Thesaurus updates – how? 21
  22. 22. 22 Rule-Building in MAIstro – Pumas before...
  23. 23. 23 Rule-Building in MAIstro – Pumas before... p53 upregulated modifier of apoptosis or
  24. 24. Rule-Building in MAIstro – Pumas after… 24
  25. 25. 25
  26. 26. 26 Thesaurus updates – prioritisation? Miss-hits and missed term reports: Ourselves: article pages Our readers: in email complaints in twitter in correspondence with our editorial staff via Journal and Saved Search alerts via article pages – Flagged Term reports
  27. 27. 27
  28. 28. 28 Things we learned – Thesaurus editorial Tension: strict and rigorous taxonomy/ontology construction vs user utility Abbreviations and Synonyms Issues that continue to exercise us: T cells/Memory T cells Obesity/Childhood obesity When should we make both explicit? Rule work – working to top 8
  29. 29. 29 Building a new project - exports
  30. 30. 30 Building a new project - import
  31. 31. Content Discovery How has having the thesaurus changed the way that users interact with PLOS web sites?
  32. 32. 32 • Journal alerts • Saved Searches • RSS feeds • Hierarchy exploration Problem: How to keep up? Solution: Current Awareness Tools
  33. 33. 33
  34. 34. 34 Journal alerts
  35. 35. 35 Journal alerts
  36. 36. 36 Journal alerts
  37. 37. 37 Journal alerts
  38. 38. 38 Journal alerts
  39. 39. 39 Saved search
  40. 40. 40 Saved search
  41. 41. 41 RSS feeds
  42. 42. 42 RSS feeds
  43. 43. 43 Hierarchy exploration
  44. 44. 44 Hierarchy exploration
  45. 45. 45 Hierarchy exploration
  46. 46. 46 Hierarchy exploration
  47. 47. 47 Hierarchy exploration
  48. 48. 48 Hierarchy exploration
  49. 49. Relative Metrics
  50. 50. Relative Metrics: Defining a Paper’s Peer Group 1. Group papers by Subject Area Accommodate multiple topics per paper 2. Group papers by age Important for comparison of cumulative measures like total downloads or citations 3. Determine norms for peer group The average usage of each paper is compared with the median usage of its peer group More on Relative Metrics at: http://www.plosone.org/static/almInfo#relativeMetrics 50
  51. 51. 51 Relative Metrics
  52. 52. 52 Relative Metrics
  53. 53. 53
  54. 54. 54
  55. 55. Area of development - Editorial Workflow
  56. 56. The PLOS Thesaurus and Peer Review Maintaining a copy of the PLOS thesaurus in Editorial Manager helps with editor and reviewer matching 56 Classifications for People Classifications for Papers
  57. 57. The PLOS Thesaurus and Peer Review • Authors select Subject Area terms related to their article submissions • Editors and Reviewers select terms that represent their areas of expertise • Staff and Editors use these terms to help ensure editors and reviewers are well matched to the submissions they are handling 57
  58. 58. Planned Enhancements • Automate the application of terms associated with Editors, Reviewers and submitted articles with MAIstro • Provide Editors and Staff with detailed terms to assist with reviewer selection and vetting – Academic disciplines help Editors gauge Subject Area relevance of potential Reviewers – Methods, protocols and model organisms help Editors gauge technical suitability of potential Reviewers 58
  59. 59. 59 Jonas Dupuich Product Manager Patrick Polischuk Product Manager Sebastian Toomey Interaction Designer Jennifer Lin Senior Product Manager Martin Fenner ALM Technical Lead Kallie Huss Senior Publications Assistant John Chodacki Director - Product Management Dramatis personae:
  60. 60. 60

×