AutoSuggest

This is for ELM
Ralph LeVan
Sr. Research Scientist
7/14/2016
Code4Lib Midwest
AutoSuggest

Goals
• Return records at keystroke speeds
• Run on an underpowered Unix box
2

Result
• Precalculate a response record for every
possible legitimate keystroke combination
• Load those records into a Pears database and
expose via SRW
• Client javascript takes keystrokes and turns
them into queries to an AutoSuggest servlet
• The thin gateway servlet takes queries, turns
them into SRW requests and passes through the
record returned
3

How are the records precalculated?
• For each source record, a relevance score is
calculated
– For VIAF, that’s a value in the record
• Names are extracted from the record.
– The names are ranked
– The best name gets the score of the record and
subsequent names get a reduced score
– For each name, a tuple is generated containing the
name, the recordID of the source record, the score for
the name and any other data extracted from the
record
4

• The tuples are sorted
• A process reads in all the names that start with
the same letter.
• The first two terms are compared and a top-10
list is started for each set of letters in common
– E.g. Andrew and Anthony each go into the top-10 list
for A and AN.
– AutoSuggest records are generated for the singletons
Andrew and Anthony. The full name is the key for
these records.
5

• The next term is compared to the one that
preceeded it
– E.g. Anthony and Astrid are compared
– Astrid is added to the top-10 list for A
– An AutoSuggest record is written for the AN list
• The key for the record is AN
• Each of the names (and associated data) are included in the
record
– An AutoSuggest record is generated for the singleton
Astrid
6

Top-10 is complicated
• The naïve assumption is that the 10 names with
the highest score would be in the list
• But, all the variations on Shakespeare that start
with S would be in the S record.
• So, a candidate name for the top-10 list is
checked to see if there is a higher ranking name
with the same recordID before it is added
7

It’s not really that easy
• All the names that start with A won’t fit into
memory.
• We do all of this work in Hadoop
• We partition the tuple input on the first 5 letters
in common
• Process as described before, but write the
shorter fragments (less than 5 letters) to a
separate directory
• Combine those lists to produce unified lists (and
records)
8

Loaded into Pears
• All these generated records are loaded into
Pears
• Lots and lots of records
– The latest AutoSuggest database for VIAF has 341
million records in it.
– VIAF itself only has 31M records
9

Thank You!
©2014 OCLC. This work is licensed under a Creative Commons Attribution 3.0 Unported License. Suggested attribution: “This
work uses content from [presentation title] © OCLC, used under a Creative Commons Attribution license:
http://creativecommons.org/licenses/by/3.0/”
Ralph LeVan
levan@oclc.org
10

AutoSuggest

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (15)

Similar to AutoSuggest

Similar to AutoSuggest (13)

More from OCLC

More from OCLC (20)

Recently uploaded

Recently uploaded (20)

AutoSuggest