Successfully reported this slideshow.
Your SlideShare is downloading. ×

Elsevier Industry Talk - WSDM 2020

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Upcoming SlideShare
Resume 112614
Resume 112614
Loading in …3
×

Check these out next

1 of 34 Ad

Elsevier Industry Talk - WSDM 2020

Download to read offline

At Elsevier, a lot of effort is focussed on content discovery for users, allowing them to find the most relevant articles for their research. This, at its core, blurs the boundaries of search and recommendation as we are both pushing content to the user and allowing them to search the world’s largest catalogue of scientific research. Apart from using the content as is, we can make new content more discoverable with the help of authors at submission time, for example by getting them to write an executive summary of their paper. However, doing this at submission time means that this additional information is not available for older content. This raises the question of how we can utilise the author’s input on new content to create the same feature retrospectively to the whole Elsevier corpus. Focusing on one use case, we discuss how an extractive summarization model (which is trained on the user-submitted summaries), is used to retrospectively generate executive summaries for articles in the catalogue. Further, we show how extractive summarization is used to highlight the salient points (methods, results and finding) within research articles across the complete corpus. This helps users to identify whether an article is of particular interest for them. As a logical next step, we investigate how these extractions can be used to make the research papers more discoverable through connecting it to other papers which share similar findings, methods or conclusion. In this talk we start from the beginning, understanding what users want from summarization systems. We discuss how the proposed use cases were developed and how this ties into the discovery of new content. We then look in more technical detail at what data is available and which methods can be utilised to implement such a system. Finally, while we are working toward taking this extractive summarization system into production, we need to understand the quality of what is being produced before going live. We discuss how internal annotators were used to confirming the quality of the summaries. Though the monitoring of quality does not stop there, we continually monitor user interaction with the extractive summaries as a proxy for quality and satisfaction.

At Elsevier, a lot of effort is focussed on content discovery for users, allowing them to find the most relevant articles for their research. This, at its core, blurs the boundaries of search and recommendation as we are both pushing content to the user and allowing them to search the world’s largest catalogue of scientific research. Apart from using the content as is, we can make new content more discoverable with the help of authors at submission time, for example by getting them to write an executive summary of their paper. However, doing this at submission time means that this additional information is not available for older content. This raises the question of how we can utilise the author’s input on new content to create the same feature retrospectively to the whole Elsevier corpus. Focusing on one use case, we discuss how an extractive summarization model (which is trained on the user-submitted summaries), is used to retrospectively generate executive summaries for articles in the catalogue. Further, we show how extractive summarization is used to highlight the salient points (methods, results and finding) within research articles across the complete corpus. This helps users to identify whether an article is of particular interest for them. As a logical next step, we investigate how these extractions can be used to make the research papers more discoverable through connecting it to other papers which share similar findings, methods or conclusion. In this talk we start from the beginning, understanding what users want from summarization systems. We discuss how the proposed use cases were developed and how this ties into the discovery of new content. We then look in more technical detail at what data is available and which methods can be utilised to implement such a system. Finally, while we are working toward taking this extractive summarization system into production, we need to understand the quality of what is being produced before going live. We discuss how internal annotators were used to confirming the quality of the summaries. Though the monitoring of quality does not stop there, we continually monitor user interaction with the extractive summaries as a proxy for quality and satisfaction.

Advertisement
Advertisement

More Related Content

Slideshows for you (17)

Similar to Elsevier Industry Talk - WSDM 2020 (20)

Advertisement

Recently uploaded (20)

Advertisement

Elsevier Industry Talk - WSDM 2020

  1. 1. Making Content Discoverable February 2020 – WSDM, Houston Dr. Daniel Kershaw Automating Highlight Generation
  2. 2. Outline • About Elsevier • The Product Needs • Models • Validations • Results • Improvements • Production Validation
  3. 3. To help customers meet those challenges: Elsevier combines content with technology to provide actionable knowledge Operational Excellence Content Technology Chemistry database 500m published experimental facts User queries 13m monthly users on ScienceDirect Books 35,000 published books Drug Database 100% of drug information from pharmaceutical companies updated daily Research 16% of the world’s research data and articles published by Elsevier 1,000 technologists employed by Elsevier Machine learning Over 1,000 predictive models trained on 1.5 billion electronic health care events Machine reading 475m facts extracted from ScienceDirect Collaborative filtering: 1bn scientific articles added by 2.5m researchers analyzed daily to generate over 250m article recommendations Semantic Enhancement Knowledge on 50m chemicals captured as 11B facts
  4. 4. Read Search Do this this this Cell Fundamentals Gray‘s Anatomy ScienceDirect Scopus ClinicalKey Reaxys Sherpath Mendeley Knovel Elsevier combines content with technology to provide actionable knowledge
  5. 5. 5
  6. 6. Product use cases: Identified in concept testing 6 UC1: Evaluate (Relevance) • Is this paper relevant to me? • What was their approach? • What were the key novel findings? UC3: Discover More • What specifically was matched? • What other papers share these findings? • Are the results validated with other methods? UC2: Understand (navigate) • What is the context of this finding? • Give me all the places where a specific finding is being discussed.
  7. 7. Data Science Path 7 Extract Extract key points from a document e.g. main findings, methods and results Connect Connect these to core locations within the document Relate Find relations between extracted sentences across documents - OpenIE
  8. 8. 04.02.20 Can we use extractive summarization to find the key finding/points within a document? Our authors are the best writers
  9. 9. Available Data Full Text
  10. 10. Available Data 10 Title
  11. 11. Available Data 11
  12. 12. 04.02.20 Available Data – Author Submitted Highlights
  13. 13. Focusing of text 13 Paper Abstract Author Highlights
  14. 14. 04.02.20 Rouge Sampling Rouge-l-f In order to enhance the efficiency of the discovery of natural active constituents from plants, a bioactivity-guided cut CCC separation strategy was developed and used here to isolate LSD1 inhibitors from S. baicalensis Georgi. A bioactivity-guided cut CCC strategy was designed Highlight Selected Sentence
  15. 15. 04.02.20 Collins, E., Augenstein, I., & Riedel, S. (2017). A Supervised Approach to Extractive Summarisation of Scientific Papers. CoRR, 195–205. http://doi.org/10.18653/v1/K17-1021 Initial Model – Sentence Classification • Section Classification • Content overlap • Number of numbers • Sentence length
  16. 16. 04.02.20 Kedzie, C., McKeown, K. R., & Daumé, H., III. (2018). Content Selection in Deep Learning Models of Summarization. Emnlp. Sequential RNN
  17. 17. 04.02.20 Accuracy Measures Model Name Test Accuracy LSTM 0.853 Abstractnet Classifier 0.718 Combined Linear Classifier 0.696 Combined MLP Classifier 0.730 Percceptron Features Abstract Vector 0.697 Single Layer NN 0.696 Accuracy does not make a good summary
  18. 18. 04.02.20 Rouge Values of Extractions
  19. 19. 04.02.20 “Human (Editor) in the loop” Work with editors from Lancet & Elsevier (SME) 1.Ask Editors to rate the output of the machine learning model 2.Have multiple rates rate the same output Framework used with the Lancet editors to evaluate computer generated summaries/assertions Selection Ranking Simplicity Complete Set
  20. 20. 20
  21. 21. 04.02.20 How should we order sentences Introduction Methods Results Conclusion Methods Results Conclusion Methods Results or or Basted off section classification
  22. 22. 04.02.20 Ranking Preference Results Ranker Ranker 1 Ranker 2 Ranker 3 Ranker 4 Grand Total Ranker 1 9 7 4 20 Ranker 2 5 6 8 19 Ranker 3 9 8 7 24 Ranker 4 4 5 7 16 Grand Total 18 22 20 19 79
  23. 23. 04.02.20
  24. 24. 04.02.20 Anshelevich, E., Bhardwaj, O., Elkind, E., Postl, J. and Skowron, P., 2018. Approximating optimal social choice under metric preferences. Artificial Intelligence, 264, pp.27-51. Example Extractions • We show how closely these rules, which only have access to ordinal preferences, approximate an optimal alternative with respect to the metric costs.1 we consider two general objective functions to quantify the quality of alternatives, and, for most of the rules we consider, we give distortion bounds with respect to both of these functions. • Our other objective function defines the quality of an alternative as the median of agent costs for that alternative: focusing on the median agent is quite common as well, as this reduces the impact of outliers, i.e., agents with very high or very low costs. • Our results show that, while some commonly used rules have high distortion, there are important voting rules for which distortion is bounded by a small constant or grows slowly with the number of alternatives. • We will now show upper bounds on the distortion of Plurality and Borda, which match the lower bounds for these rules given by Theorem 9, and an upper bound on the distortion of the Harmonic rule, which almost matches the respective lower bound. • Although these rules know nothing about the metric costs other than the ordinal preferences induced by them, and cannot possibly find the true optimal alternative, they nevertheless always select an alternative whose quality is only a factor of 5 away from optimal! • We prove that the distortion of every voting rule that always outputs a subset of the uncovered set [32] does not exceed 5; this upper bound holds both for the sum objective and for the median objective, though different techniques are required to prove it for the two cases.
  25. 25. Simplification • Selected sentences are a tad to long. • Contain irrelevant openings e.g. “Furthermore” • Solution split sentences on first “,” filter out common openings. 25 thus however in summary finally in this study moreover in this work furthermore in addition in conclusion in this section then to the best of our knowledge hence in particular additionally also second first as a result specifically
  26. 26. Simplification In the following work, we design lightweight authentication protocol for three tiers wireless body area network with wearable devices. We design lightweight authentication protocol for three tiers wireless body area network with wearable devices. Simplified
  27. 27. 04.02.20 Production Getting user feedback
  28. 28. 28
  29. 29. 29 Click
  30. 30. 30
  31. 31. 31
  32. 32. 04.02.20 Where the Data also goes SEARCH EDITORIAL SYSTEMS
  33. 33. 04.02.20 Final Overview • Data Science lead by User needs • Multiple stages of evaluation • Each evaluation can influence the whole model • Value the original authors writing • Relatively simple model have powerful results
  34. 34. WE ARE HIRING (AMS & LDN) Come Speak to Me (Senior) Data Scientist (Research/ML)

×