Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.



Published on

Presented by Ralph LeVan, Senior Research Scientist, OCLC Research, as a lightning talk at Code4LibMidWest at the University of Chicago Regenstein Library on 14 July 2016.

Published in: Education
  • Be the first to comment

  • Be the first to like this


  1. 1. This is for ELM Ralph LeVan Sr. Research Scientist 7/14/2016 Code4Lib Midwest AutoSuggest
  2. 2. Goals • Return records at keystroke speeds • Run on an underpowered Unix box 2
  3. 3. Result • Precalculate a response record for every possible legitimate keystroke combination • Load those records into a Pears database and expose via SRW • Client javascript takes keystrokes and turns them into queries to an AutoSuggest servlet • The thin gateway servlet takes queries, turns them into SRW requests and passes through the record returned 3
  4. 4. How are the records precalculated? • For each source record, a relevance score is calculated – For VIAF, that’s a value in the record • Names are extracted from the record. – The names are ranked – The best name gets the score of the record and subsequent names get a reduced score – For each name, a tuple is generated containing the name, the recordID of the source record, the score for the name and any other data extracted from the record 4
  5. 5. How are the records precalculated? • The tuples are sorted • A process reads in all the names that start with the same letter. • The first two terms are compared and a top-10 list is started for each set of letters in common – E.g. Andrew and Anthony each go into the top-10 list for A and AN. – AutoSuggest records are generated for the singletons Andrew and Anthony. The full name is the key for these records. 5
  6. 6. How are the records precalculated? • The next term is compared to the one that preceeded it – E.g. Anthony and Astrid are compared – Astrid is added to the top-10 list for A – An AutoSuggest record is written for the AN list • The key for the record is AN • Each of the names (and associated data) are included in the record – An AutoSuggest record is generated for the singleton Astrid 6
  7. 7. Top-10 is complicated • The naïve assumption is that the 10 names with the highest score would be in the list • But, all the variations on Shakespeare that start with S would be in the S record. • So, a candidate name for the top-10 list is checked to see if there is a higher ranking name with the same recordID before it is added 7
  8. 8. It’s not really that easy • All the names that start with A won’t fit into memory. • We do all of this work in Hadoop • We partition the tuple input on the first 5 letters in common • Process as described before, but write the shorter fragments (less than 5 letters) to a separate directory • Combine those lists to produce unified lists (and records) 8
  9. 9. Loaded into Pears • All these generated records are loaded into Pears • Lots and lots of records – The latest AutoSuggest database for VIAF has 341 million records in it. – VIAF itself only has 31M records 9
  10. 10. Thank You! ©2014 OCLC. This work is licensed under a Creative Commons Attribution 3.0 Unported License. Suggested attribution: “This work uses content from [presentation title] © OCLC, used under a Creative Commons Attribution license:” Ralph LeVan 10