Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

A Recommender Story: Improving Backend Data Quality While Reducing Costs

A recommender story: improving backend data quality while reducing costsnInformation overload is one of the biggest challenges academics face on a daily basis while finding the right knowledge to advance science. With around 7k research articles being published every day, how do you find the right ones?

Elsevier is a global information analytics business that helps institutions and professionals advance healthcare, open science and improve performance. With many data sources and signals being available, data science and big data engineering provide the perfect opportunity to deliver more value to researchers.

Here we will focus on Mendeley, an open (free of charge) academic content platform to help researchers discover new information via functionalities such as a crowd sourced collection of academic related documents (Catalogue) and various personalized recommender systems. MendeleySuggest, the recommender system, helps millions of researchers worldwide to find documents and people relevant to their research field, they did not yet know exist. The personalised recommenders are powered by Mendeley Catalogue, clustering 2 billion records correctly into canonical records, state of the art algorithms and big data solutions (e.g. Spark).

In the past few years, we noticed that with our content growth, quality of the canonical records started drifting due to scalability issues. As a result, we faced clustering accuracy problems and, in turn, impacting also the recommenders. In this talk we will highlight how we rearchitected the fabrication of Mendeley Catalogue to improve its scalability and accuracy. In addition, we will show how the migration from Hadoop Map Reduce to Spark has helped us reduce costs as well as improving maintainability.

  • Login to see the comments

A Recommender Story: Improving Backend Data Quality While Reducing Costs

  1. 1. Jacques Doux, Elsevier A recommender story: Improving backend data quality while reducing costs #UnifiedDataAnalytics #SparkAISummit
  2. 2. Elsevier – in a nutshel We are a Data and Analytics Company delivering solutions for Science and Health. • Modern Elsevier Born in 1880 in Amsterdam https://www.elsevier.com/about/history • The biggest Academic publisher 38k books – 3000 journals (~25% of ever cited content) • First publisher to have provided electronic version of its content From ADONIS (1979) to ScienceDirect (1997) and now hosting ~17M full text documents • Empowering decision support : https://www.elsevier.com/solutions − Abstract and indexing − Research & Data management tools − Research evaluation and showcasing tools − Adaptative learning for health professionals − Clinical decision support − Discovery sandbox to combine Elsevier high quality data with your own proprietary data 2
  3. 3. endeley helps academics stay on top • By providing solutions for − References management − Academic & Research networking − Managing Datasets − Finding academic job opportunity − Finding Funding opportunity • Powered by − Data − Search and Discovery tools 3
  4. 4. Mendeley Suggest (MS) 4 https://impactrs19.github.io/papers/paper5.pdf Desktop app – Mobile – Web – emails 6.8M emails sent weekly with 3 recommendations
  5. 5. How does Mendeley Suggest work 5 https://doi.org/10.1142/9789813275355_0018 • Custom implementation of user based collaborative filtering • Significance Weighting • Time Decay • Impression Discounting • Dithering • Data… lots of it • Content records (Catalogue) • User profiles • Ambient data
  6. 6. Issues with catalogue impacts dependent systems Under-merged 6 Relevant, but already added it a while back Very “focused” recommendations ! Many obvious duplicates creeping up search result… … splitting metrics apart impacting relevance ranking
  7. 7. How it’s made: Catalogue 7 public birth of Mendeley old algorithm inception acquisition by Elsevier publication of old deduplication algorithm something needs to be done >= 24h to run too many visible issues Start project 0 M 500 M 1,000 M 1,500 M 2,000 M 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 #documentrecords year of addition Evolution of document records in Mendeley catalogue https://doi.org/10.1108/PROG-02-2015-0021
  8. 8. Redesign without breaking: Goals 1. Improve overall user experience Better document disambiguation : better classifier 2. Improve scalability and processing speed Need to work for > 2B records and faster than it currently does 3. Improve code maintainability and ease of evolution Migrate from Hadoop MR in Java to Spark + Scala 8
  9. 9. Improving the classifier: are those duplicates? 9 title authors doi published_in year Coseismic extension recorded within the damage zone of the Vado di Ferruccio Thrust Fault, Central Apennines, Italy Harold Leah, Michele Fondriest, Alessio Lucca, Fabrizio Storti, Fabrizio Balsamo, Giulio Di Toro 10.31223/osf.io/5y3pn 2018 Coseismic extension recorded within the damage zone of the Vado di Ferruccio Thrust Fault, Central Apennines, Italy Harold Leah, Michele Fondriest, Alessio Lucca, Fabrizio Storti, Fabrizio Balsamo, Giulio Di Toro 10.1016/j.jsg.2018.06.015 Journal of Structural Geology 2018 Determination of cefpirome concentrations in lung extracellular fluid of pigs by microdialysis and ….. Croneberger, A.S., Kietzmann, M., Ehinger, A.M., Allan, M., Nuernberger, M.C. 10.1111/j.1365-2885.2009.01090.x Journal of Veterinary Pharmacology and Therapeutics 2009 Oral communications 10.1111/j.1365-2885.2009.01090.x Journal of Veterinary Pharmacology and Therapeutics 2009 Médecine & Droit; The judge, the physician and the prisoner. A critical view of the “suspension de peine” for medical reason Chassagne A, Godard-Marceau A 2019 Le juge, le médecin et le détenu. Regard critique sur la suspension de peine pour raison médicale Chassagne A, Godard-Marceau A 10.1016/j.meddro.2019.01.001 Medecine et Droit 2019 Hearing loss: Diagnosis and Management John MLasak, Patrick Allen, Douglas Lewis Primary Care Clinics in Office Practice 2014 Hearing loss: diagnosis and management. John MLasak, Patrick Allen, Tim McVay, Douglas Lewis Primary care 2014 On the number and structure of sum-free sets in a segment of positive integers K. G. Omelyanov,A. A. Sapozhenko Discrete Mathematics and Applications 2002 On the number and structure of sum-free sets in a segment of positive integers K. G. Omelyanov,A. A. Sapozhenko Discrete Mathematics and Applications 2003 Vertex-disjoint cycles containing specified edges in a bipartite graph Guantao Chen, Hikoe Enomoto, Ken Ichi Kawarabayashi, Katsuhiro Ota… Dicscrete Mathematics(Elsevier) 2001 Vertex-disjoint cycles containing specified vertices in a bipartite graph Guantao Chen, Hikoe Enomoto, Ken Ichi Kawarabayashi, Katsuhiro Ota… 2004 Duplicates Duplicates BUT! Not Duplicates Duplicates Not Duplicates Not Duplicates BUT!
  10. 10. Improving the classifier: What have we done? 10 Find document pairs at the decision boundary Bootstrap with previous training set Engineer relevant features Tune & Train model Benchmark Deploy Heurristic + Manual Classification
  11. 11. Scalability issues: Reduce problem cardinality 11 Record exclusion Exact duplicates detection and Identifier based duplicates detection Tuneable blocking step MinHash-LSH and sub blocking if needed Divide classifier tasks by 2 by removing equal pairs (A,B) = (B,A) Data normalization
  12. 12. Performance 12 Old data set (1M pairs) New manually annotated data set (9.2k pairs) Old Dedup 0.98 0.60 new Dedup 0.98 0.96 Verdict Same Much better F1 Score Comute time >24h ~13h Much better Old New If 3% error rate, over 2B => 60M miss classified
  13. 13. Lessons learned • Monitor your systems • Data matters ! − Know your data! − “More data” vs “Good data” • Engineering − Keep it as simple as possible If it works with simpler model don’t use more complex ones e.g. Random Forest vs SVM with RBF kernel vs Neural Networks − Extensive testing Especially if production code will use different language / libraries / library versions − With big data, hash collision is real • As a data scientist, work in tight collaboration engineers implementing production grade code Make sure things are feasible within their tech stack 13
  14. 14. Come join us in solving problems 15 https://www.elsevier.com/about/careers/technology-careers 7,500 Empoyees >1,000 technologistsThank you
  15. 15. DON’T FORGET TO RATE AND REVIEW THE SESSIONS SEARCH SPARK + AI SUMMIT

×