Successfully reported this slideshow.

CrossRef Text and Data Mining

3,574 views

Published on

After a successful pilot under the name "Prospect," CrossRef will provide a means for publishers to simplify text and data mining access for researchers. Both researchers and publishers will benefit from support of standard APIs and data representations to enable text and data mining across open access and subscription-based publishers, and this is what CrossRef is aiming to provide. This webinar was held on October 28, 2014.

Published in: Business, Technology, Education
  • Be the first to comment

CrossRef Text and Data Mining

  1. 1. Rachael Lammey Product Manager, CrossRef 28 October 2014
  2. 2. Not-for-profit association of scholarly publishers All subjects, all business models 4,000+ organizations from all over the world 83 non-publisher affiliates, 2000 library affiliates 68 million content items
  3. 3. 10.1098/ rstl. 1665.0001
  4. 4. User clicks on CrossRef DOI reference link in Journal A Tani, N., N. Tomaru, M. Araki, AND K. Ohba. 1996. Genetic diversity and differentiation in populations of Japanese stone pine (Pinus pumila) in Japan. Canadian Journal of Forest Research 26: 1454–1462.[CrossRef] DOI directory returns URL User accesses cited article in Journal B
  5. 5. 90,000,000
  6. 6. Services • Cross-publisher reference linking • Cross-publisher Cited-by linking • Cross-publisher metadata feeds • Cross-publisher plagiarism screening • Cross-publisher update identification • Cross-publisher funder identification • Cross-publisher text and data mining Powered by iThenticate
  7. 7. A Text and Data Mining Hub for Researchers
  8. 8. What is text and data mining? Text Mining is an interdisciplinary field combining techniques from linguistics, computer science and statistics to build tools that can efficiently retrieve and extract information from digital text. http://blogs.plos.org/everyone/2013/04/17/announcing-the-plos-text-mining-collection/ It uses powerful computers to find links between drugs and side effects, or genes and diseases, that are hidden within the vast scientific literature. These are discoveries that a person scouring through papers one by one may never notice. http://www.theguardian.com/science/2012/may/23/text-mining-research-tool-forbidden
  9. 9. http://www.jisc.ac.uk/media/documents/publications/textminingbp_rtf.rtf Marc Weeber and colleagues used automated text mining tools to infer that the drug thalidomide could treat several diseases it had not been associated with before. Thalidomide was taken off the market 40 years ago, but is still the subject of research because it seems to benefit leprosy patients via their immune systems. Weeber and Grietje Molema, an immunologist, used text mining tools to search the literature for papers on thalidomide and then pick out those containing concepts related to immunology. One concept, concerning thalidomide’s ability to inhibit Interleukin-12 (IL-12), a chemical involved in the launch of an immune response, struck Molema as particularly interesting. A second automated search for diseases that improve when the action of IL-12 is blocked, revealed several not previously linked with thalidomide, including chronic hepatitis, myasthenia gravis and a type of gastritis. “Type in thalidomide and you get 2-3000 hits. Type in disease and you get 40,000 hits. With automated text mining tools we only had to read 100-200 abstracts and 20 or 30 full papers. We’ve created hypotheses for others to follow up” says Weeber. Weeber et al. J Am Med Inform Assoc. 2003 10 252-259
  10. 10. http://www.forbes.com/sites/stevensalzberg/2014/03/23/why-google-flu- is-a-failure/
  11. 11. Why? • Researchers find it impractical to negotiate multiple bilateral agreements with hundreds of subscription- based publishers in order to authorize TDM of subscribed content. • Subscription-based publishers find it impractical to negotiate multiple bilateral agreements with thousands of researchers and institutions in order to authorize TDM of subscribed content. • All parties would benefit from support of standard APIs and data representations in order to enable TDM across both open access and subscription-based publishers.
  12. 12. * Chinese Geoscience Union * Chinese Institute Of Automation Engineers (Ciae) * Chinese Journal Of Mechanical Engineering * Chinese Mathematical Society * Chinese Physical Society * Chinese Physiological Society * Chinese Society Of Theoretical And Applied Mechanics * Chonnam National University Medical School (Kamje) * Christ University Bangalore * Cic Edizioni Internazionali * Cig Media Group * Cilip Information Literacy Group * Civil-Comp, Ltd. * Claremont Colleges Library * Classical Association Of The Middle West And South, Inc. (Camws) * Clawar Association Limited * Clay Minerals Society * Cleo Revues.Org * Cleveland Clinic Journal Of Medicine * Clinical Autonomic Research Society * Clinical Laboratory Publications * Clinics Cardive Publishing * Clockss Archive * Cnps * Cnrs France * Cnu Journal Of Agricultural Science
  13. 13. Using the DOI as the basis for a common text and data mining API provides several benefits. For example, the DOI provides: •An easy way to de-duplicate documents that may be found on several sites. •Persistent provenance information. •An easy way to document, share and compare coropra without having to exchange the actual documents •A mechanism to ensure the reproducibility of TDM results using the source documents. •A mechanism to track the impact of updates, corrections retractions and withdrawls on corpora. Why use the DOI?
  14. 14. The TDM Workflow
  15. 15. Researchers:Comm on API
  16. 16. DOI Content Negotiation
  17. 17. http://dx.doi.org/10.5555-12345678 (Accept: text/html)
  18. 18. http://dx.doi.org/10.5555-12345678 (Accept: application/bibjson+json)
  19. 19. Rate Limiting(Optiona l)
  20. 20. CrossRef TDM HTTP Headers CR-TDM-Rate-Limit: 1500 (the rate limit ceiling per window on requests) CR-TDM-Rate-Limit-Remaining: 1387 (number of requests left for the current window) CR-TDM-Rate-Limit-Reset: 1378072800 (the remaining time in UTC epoch seconds before the rate limit resets and a new window is started) *this is a technique used by many APIs, including Twitter’s
  21. 21. Common API Summary • Content Negotiation (Required) • New Metadata (Required) • Full text URIs • License URIs • Rate Limiting Headers (optional)
  22. 22. New Metadata
  23. 23. 1. Full Text Link https://apps.crossref.org/docs/tdm/full-text- uris-technical-details/
  24. 24. https://apps.crossref.org/docs/tdm/license-uris-technical-https://apps.crossref.org/docs/tdm/license-uris-technical- details/details/ 2. License Information https://apps.crossref.org/docs/tdm/license- uris-technical-details/
  25. 25. Example from Hindawi <ai:program name="AccessIndicators"> <ai:license_ref>http://creativecommons.org/licenses/by/3.0/</ai:license_ref> </ai:program> <doi_data> <doi>10.1155/2014/969265</doi> <timestamp>20140401090031</timestamp> <resource>http://www.hindawi.com/journals/aaa/2014/969265/</resource> <collection property="text-mining"> <item> <resource mime_type="application/pdf"> http://downloads.hindawi.com/journals/aaa/2014/969265.pdf </resource> </item> <item> <resource mime_type="application/xml"> http://downloads.hindawi.com/journals/aaa/2014/969265.xml </resource> </item>
  26. 26. Stop here if • You are an open access publisher • You include TDM as a part of your subscription license/T&Cs.
  27. 27. Click-Through Service (Optional)
  28. 28. Extended TDM Workflow
  29. 29. Researcher View
  30. 30. Publisher View
  31. 31. Researcher queries DOI using CN + API token Publisher verifies API token If token verified AND access control allows, publisher returns full text (frequency at publisher discretion)
  32. 32. Benefits • Streamlines researcher access to distributed full text for TDM • Enables machine-to-machine, automated access for recognized TDM (i.e. researchers won’t be locked out of publisher sites) • Enables article-level licensing info and easy mechanism for supplemental T&Cs for text and data mining (publishers discussing model license via STM)
  33. 33. What do researchers publishers tools developers need to do?
  34. 34. Publishers There are two additional metadata elements that publishers will need to deposit to support TDM via CrossRef. These are: •Full Text URIs: One or more URIs that point to full text representations of the content identified by your CrossRef DOIs. •License URIs: One or more URIs pointing at licenses that govern how the full text content can be used. •OPTIONAL: Add publisher TDM terms and conditions to the click-through service
  35. 35. Researchers • Modify TDM tools to make use of the API token • Modify TDM tools to look for <lic_ref> elements • Register with the click-through service and accept/decline licenses (if applicable)
  36. 36. http://tdmsupport.crossref.org/
  37. 37. Progress to date • DOI content negotiation • CrossRef support for recording links to full text • CrossRef metadata support for: • ORCIDS • FundRef • License information • CrossRef Metadata Search for Discovery: http://search.labs.crossref.org/ • Click-through license service • Publisher API for verifying and managing tokens • Launched as live service 29th May 2014
  38. 38. Publishers Articles with full-text links and license information deposited: 998,416 Cost? Free to researchers and the public No cost for publishers through 2014, 2015 tbc Register interest at: http://www.crossref.org/tdm/contact_form.html
  39. 39. Usable as is: https://blogs.nd.edu/emorgan/
  40. 40. www.crossref.org http://www.crossref.org/tdm/index.html tdm@crossref.org

×