Presented by Ralph LeVan, Senior Research Scientist, OCLC Research, as a lightning talk at Code4LibMidWest at the University of Chicago Regenstein Library on 14 July 2016.
3. Result
• Precalculate a response record for every
possible legitimate keystroke combination
• Load those records into a Pears database and
expose via SRW
• Client javascript takes keystrokes and turns
them into queries to an AutoSuggest servlet
• The thin gateway servlet takes queries, turns
them into SRW requests and passes through the
record returned
3
4. How are the records precalculated?
• For each source record, a relevance score is
calculated
– For VIAF, that’s a value in the record
• Names are extracted from the record.
– The names are ranked
– The best name gets the score of the record and
subsequent names get a reduced score
– For each name, a tuple is generated containing the
name, the recordID of the source record, the score for
the name and any other data extracted from the
record
4
5. How are the records precalculated?
• The tuples are sorted
• A process reads in all the names that start with
the same letter.
• The first two terms are compared and a top-10
list is started for each set of letters in common
– E.g. Andrew and Anthony each go into the top-10 list
for A and AN.
– AutoSuggest records are generated for the singletons
Andrew and Anthony. The full name is the key for
these records.
5
6. How are the records precalculated?
• The next term is compared to the one that
preceeded it
– E.g. Anthony and Astrid are compared
– Astrid is added to the top-10 list for A
– An AutoSuggest record is written for the AN list
• The key for the record is AN
• Each of the names (and associated data) are included in the
record
– An AutoSuggest record is generated for the singleton
Astrid
6
7. Top-10 is complicated
• The naïve assumption is that the 10 names with
the highest score would be in the list
• But, all the variations on Shakespeare that start
with S would be in the S record.
• So, a candidate name for the top-10 list is
checked to see if there is a higher ranking name
with the same recordID before it is added
7
8. It’s not really that easy
• All the names that start with A won’t fit into
memory.
• We do all of this work in Hadoop
• We partition the tuple input on the first 5 letters
in common
• Process as described before, but write the
shorter fragments (less than 5 letters) to a
separate directory
• Combine those lists to produce unified lists (and
records)
8
9. Loaded into Pears
• All these generated records are loaded into
Pears
• Lots and lots of records
– The latest AutoSuggest database for VIAF has 341
million records in it.
– VIAF itself only has 31M records
9