Successfully reported this slideshow.
Your SlideShare is downloading. ×

Data for Science: How Elsevier is using data science to empower researchers

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Loading in …3
×

Check these out next

1 of 22 Ad

Data for Science: How Elsevier is using data science to empower researchers

Download to read offline

Each month 12 million people use Elsevier’s ScienceDirect platform. The Mendeley social network has 4.6 million registered users. 3500 institutions make use of ClinicalKey to bring the latest in medical research to doctors and nurses. How can we help these users be more effective? In this talk, I give an overview of how Elsevier is employing data science to improve its services from recommendation systems, to natural language processing and analytics. While data science is changing how Elsevier serves researchers, it’s also changing research practice itself. In that context, I discuss the impact that large amounts of open research data are having and the challenges researchers face in making use of it, in particular, in terms of data integration and reuse. We are at just beginning to see of how technology and data is changing science correspondingly this impacts how best to empower those who practice it.

Each month 12 million people use Elsevier’s ScienceDirect platform. The Mendeley social network has 4.6 million registered users. 3500 institutions make use of ClinicalKey to bring the latest in medical research to doctors and nurses. How can we help these users be more effective? In this talk, I give an overview of how Elsevier is employing data science to improve its services from recommendation systems, to natural language processing and analytics. While data science is changing how Elsevier serves researchers, it’s also changing research practice itself. In that context, I discuss the impact that large amounts of open research data are having and the challenges researchers face in making use of it, in particular, in terms of data integration and reuse. We are at just beginning to see of how technology and data is changing science correspondingly this impacts how best to empower those who practice it.

Advertisement
Advertisement

More Related Content

Slideshows for you (20)

Similar to Data for Science: How Elsevier is using data science to empower researchers (20)

Advertisement

More from Paul Groth (20)

Advertisement

Data for Science: How Elsevier is using data science to empower researchers

  1. 1. DATA FOR SCIENCE HOW ELSEVIER IS USING DATA SCIENCE TO EMPOWER RESEARCHERS Paul Groth | @pgroth | pgroth.com Disruptive Technology Director Elsevier Labs | @elsevierlabs European Data Forum 2016
  2. 2. 12 million people per month
  3. 3. 40 million reactions 75 million compounds 500 million facts
  4. 4. 3 EXAMPLES • Personalized: what should I read? • Actionable: who should I collaborate with? • Consumable: how do I make my data available?
  5. 5. RECOMMENDATIONS AT MENDELEY • Maya Hristakeva • Data Scientist at Mendeley • @mayahhf • Spark Summit 2015 • http://www.slideshare.net/SparkSummit/sparkin g-science-up-with-research-recommendations- by-maya-hristakeva
  6. 6. Read & Organize Search & Discover Collaborate & Network Experiment & Synthesize MENDELEY BUILDS TOOLS TO HELP RESEARCHERS …
  7. 7. BEING THE BEST RESEARCHER YOU CAN BE! • Good researchers are on top of their game • Large amount of research produced • Takes time to get what you need • Help researchers by recommending relevant research
  8. 8. PERSONALIZED ARTICLE RECOMMENDATION Input: User libraries Output: Suggested articles to read Algorithms: • Collaborative Filtering – Item-based – User-Based – Matrix Factorization • Content-based
  9. 9. Costly & GoodCostly & Bad Cheap & GoodCheap & Bad Tuned IB Mahout Tuned UB Mahout Tuned UB Spark Tuned IB Spark UB DimSum Spark MLlib ALS Matrix Fact. Spark MLlib Performance +100% +150% ~$50
  10. 10. CALCULATING 75 TRILLION METRICS • Benchmark 4600 institutions & 220 countries updated weekly • 40 terabytes of data • HPCC massively parallel compute system – 40 node system
  11. 11. ALL DATA ISN’T CURATED
  12. 12. 60 % OF TIME IS SPENT ON DATA PREPARATION
  13. 13. 10 ASPECTS OF HIGHLY EFFECTIVE RESEARCH DATA https://www.elsevier.com/con nect/10-aspects-of-highly- effective-research-data
  14. 14. http://data.mendeley.com/ Each dataset receives a versioned DOI, so it can be cited The citation for the associated article is displayed
  15. 15. ACADEMIC COLLABORATIONS
  16. 16. CONCLUSION • Researchers are faced with an ever growing amount of data and content • Data Science is key to making systems that help them • I’ve shown three Elsevier examples. Many more! • Antonio Gulli’s codingplayground.blogspot.nl • labs.elsevier.com • Of course, we’re hiring  Contact: Paul Groth @pgroth

Editor's Notes

  • 1.8 million unique authors worldwide submitted 1.3 million manuscripts to Elsevier journals
  • 40 million reactions
    75 million compounds
    500 million experimental facts ,
  • 40 million reactions
    75 million compounds
    500 million experimental facts ,
  • At Mendeley we build tools to help researchers organise and read research articles, collaborate and connect with other researchers, search and discover new research articles, etc. 
  • 815 million articles
  • “Mendeley Suggest” is our personalised article recommender. It is based on what users have in their libraries, and recommends other related articles. 
  • Calculate for over 4 million users

    We are building a personalised article recommender based on what users read. Input is the users’ libraries and the output is a list of articles they may want to add to their library and read. There are a number of different algorithms we can use to generate the recommendations (content-based, collaborative filtering), and this talk we’ll focus on three types of collaborative filtering algorithms (user and item-based as well as matrix factorisation).
  • To sum, we now have a Spark implementation of our production UB CF algorithm which performs well, and is a lot simpler to maintain and extend. There are still a few areas where we can tune and optimise further, so that could only make it faster and get bigger gains of using Spark. Depending on your data different algorithms might work better, so do experiment. 
  • 40 million reactions
    75 million compounds
    500 million experimental facts ,
  • http://www.tamr.com/piketty-revisited-improving-economics-data-science/
  • NASA, A.40 Computational Modeling Algorithms and Cyberinfrastructure, tech. report, NASA, 19 Dec. 2011
  • Data enginnering pipleines

×