W H AT ’ S I N A N A M E ?
P H O N E T I C A L G O R I T H M S
F O R S E A R C H A N D S I M I L A R I T Y
Mercedes Coyle
@benzobot
Data Infrastructure Engineer
W H AT I ’ M G O I N G T O C O V E R T O D AY
• Search - how does it work?
• Phonetic Algorithms
• Use cases for Phonetic Algorithms
W H E N W E T H I N K O F S E A R C H …
H O W D O E S G O O G L E
S E A R C H W O R K ?
• Web crawling on a very large
scale!
• Document rank (importance)
and similarity
• Text analysis
image credit: flickr.com/photos/rserrano/
• Obligatory Hand-wavey
“Big Data” comment
here
H O W D O E S G O O G L E
S E A R C H W O R K ?
image credit: twitter.com/wtrsld/status/424364245648564226
D ATA B A S E S E A R C H
image credit: Mercedes Coyle
* S Q L
• Comparison search: LIKE operator
• SELECT * FROM table WHERE word LIKE %and%
* S Q L
• Comparison search: LIKE operator
• basically a wildcard character search
• only returns data that contains the search string;
does not account for misspelling
• can be expensive on large datasets
* S Q L
E L A S T I C S E A R C H - T O K E N I Z AT I O N
• Used in full-text search against a corpus of text
• “The quick brown fox jumped over the lazy dog”
• the, quick, brown, fox, jump, over, lazy, dog
• Wildcard searches return too many results
• Typos or misspelled names don’t return correct results
• exp: “Shawn” vs “Sean”
P R O B L E M : T E X T- B A S E D S E A R C H E S D O N ’ T
A LWAY S W O R K W E L L W I T H N A M E S
W H AT I S A P H O N E M E ?
• In language, the smallest unit that conveys distinct
meaning
• Includes single letters, letter combinations, vowels and
consonants
E N G L I S H P H O N E M E S
H O W D O W E T R A N S L AT E P H O N E M E S
C O D E ?
image credit: demoons.com/2010/09/first-animation-test.html
P H O N E T I C A L G O R I T H M S
• A method of hashing words and names based on
sounds (phonemes).
P H O N E T I C A L G O R I T H M T Y P E S
• Soundex
• NYSIIS
• Metaphone and Double Metaphone
• Match Rating, Daitch-Mokotoff Soundex, Kölner
Phonetik, Caverphone…
S O U N D E X
• Designed in the 1900’s to encode names for the US
Census
• Built in to PostgreSQL and MySQL
S O U N D E X A L G O R I T H M
Mercedes = MERCEDES
MERCEDES = M0620302
{ 0 : [’A’, E', 'I', 'O', 'U', 'H', 'W', ‘Y’], 1 : [ 'B', 'F', 'P', ‘V’], 2
: ['C', 'G', 'J', 'K', 'Q', 'S', 'X', ‘Z’], 3 : [‘D’,’T’], 4 : [‘L’], 5 :
[‘M’,’N’], 6 : [‘R’] }
M0620302 = M6232
M6232 = M623
S O U N D E X L I M I TAT I O N S
• Most implementations work
for English Language only
• First letter retention causes
no match on some similar
names
S O U N D E X L I M I TAT I O N S
• Postgres Soundex implementation has limited
character encoding support
http://www.postgresql.org/docs/9.4/static/fuzzystrmatch.html
N Y S I I S
• Developed in 1970, part of New York State
Identification and Intelligence System
• Slightly improved functionality over Soundex
N Y S I I S A L G O R I T H M
N Y S I I S A L G O R I T H M
• MERCEDES
• MARCADAS
• MARCADA
• MARCAD
N Y S I I S
M E TA P H O N E
• Developed in 1990 by Lawrence Philips
• Improved accuracy over Soundex and NYSIIS
• Double Metaphone implements two hashes for each
name or word
M E TA P H O N E
M E TA P H O N E
• Metaphone and Double Metaphone were improved
upon in Metaphone 3, which is unfortunately closed
source.
P H O N E T I C A L G O R I T H M S I N P R A C T I C E
• Use cases for Phonetic Algorithms
• Example uses in Databases
P H O N E T I C A L G O R I T H M S I N P R A C I T C E
• Phonetic algorithms are useful for searching by name
or word, and tolerate some misspelling.
P H O N E T I C A L G O R I T H M S I N P R A C I T C E
• Store the phonetic hash of a name in fields/columns in
your db for indexing and querying
{ "_id" : ObjectId("53e13a73cbcc7a0a6e3078e5"),
"first_name" : "Arya", "last_name" : “Stark",
"n_first_name" : “AR", "n_last_name" : “STARC”,
“report” : “lost_item”, “item” : “ID Card”,
"timestamp" : 1407269491, "report_id" : 50642 }
P H O N E T I C S E A R C H W I T H
E L A S T I C S E A R C H
• Elasticsearch has support for Phonetic Matches, in
many different languages!
• Store words/names as documents, and hashing is
done at query time
GET /my_index/_analyze?analyzer=dbl_metaphone
returns: Smith Smythe
P H O N E T I C S E A R C H U S I N G
E L A S T I C S E A R C H
• As a Developer, I really like using Elasticsearch!
• But as a System Administrator, I have battle scars.
P H O N E T I C A L G O R I T H M S F O R N O N
E N G L I S H L A N G U A G E S
Grab a linguist and write one?
image credit: flickr.com/photos/opacity
R E S O U R C E S
• Libraries
• clj-fuzzy: yomguithereal.github.io/clj-fuzzy/
• python soundex: pypi.python.org/pypi/soundex/1.1.3
• python fuzzy: pypi.python.org/pypi/Fuzzy
• elasticsearch phonetic matching https://www.elastic.co/guide/en/elasticsearch/guide/
current/phonetic-matching.html
• http://aspell.net/metaphone/dmetaph.cpp
• Reading:
• http://doughellmann.com/2012/03/03/using-fuzzy-matching-to-search-by-sound-with-
python.html
• Fluency, Jen Feohner Wells - http://www.jenniferfoehnerwells.com/fluency.html
T H A N K S F O R L I S T E N I N G !
Q U E S T I O N S ?
Mercedes Coyle
@benzobot
image credit: Mercedes Coyle

Phonetic algorithms os_bridge_2015

  • 1.
    W H AT’ S I N A N A M E ? P H O N E T I C A L G O R I T H M S F O R S E A R C H A N D S I M I L A R I T Y Mercedes Coyle @benzobot Data Infrastructure Engineer
  • 2.
    W H ATI ’ M G O I N G T O C O V E R T O D AY • Search - how does it work? • Phonetic Algorithms • Use cases for Phonetic Algorithms
  • 3.
    W H EN W E T H I N K O F S E A R C H …
  • 4.
    H O WD O E S G O O G L E S E A R C H W O R K ? • Web crawling on a very large scale! • Document rank (importance) and similarity • Text analysis image credit: flickr.com/photos/rserrano/
  • 5.
    • Obligatory Hand-wavey “BigData” comment here H O W D O E S G O O G L E S E A R C H W O R K ? image credit: twitter.com/wtrsld/status/424364245648564226
  • 6.
    D ATA BA S E S E A R C H image credit: Mercedes Coyle
  • 7.
    * S QL • Comparison search: LIKE operator • SELECT * FROM table WHERE word LIKE %and%
  • 8.
    * S QL • Comparison search: LIKE operator • basically a wildcard character search • only returns data that contains the search string; does not account for misspelling • can be expensive on large datasets
  • 9.
  • 10.
    E L AS T I C S E A R C H - T O K E N I Z AT I O N • Used in full-text search against a corpus of text • “The quick brown fox jumped over the lazy dog” • the, quick, brown, fox, jump, over, lazy, dog
  • 11.
    • Wildcard searchesreturn too many results • Typos or misspelled names don’t return correct results • exp: “Shawn” vs “Sean” P R O B L E M : T E X T- B A S E D S E A R C H E S D O N ’ T A LWAY S W O R K W E L L W I T H N A M E S
  • 12.
    W H ATI S A P H O N E M E ? • In language, the smallest unit that conveys distinct meaning • Includes single letters, letter combinations, vowels and consonants
  • 13.
    E N GL I S H P H O N E M E S
  • 14.
    H O WD O W E T R A N S L AT E P H O N E M E S C O D E ? image credit: demoons.com/2010/09/first-animation-test.html
  • 15.
    P H ON E T I C A L G O R I T H M S • A method of hashing words and names based on sounds (phonemes).
  • 16.
    P H ON E T I C A L G O R I T H M T Y P E S • Soundex • NYSIIS • Metaphone and Double Metaphone • Match Rating, Daitch-Mokotoff Soundex, Kölner Phonetik, Caverphone…
  • 17.
    S O UN D E X • Designed in the 1900’s to encode names for the US Census • Built in to PostgreSQL and MySQL
  • 18.
    S O UN D E X A L G O R I T H M Mercedes = MERCEDES MERCEDES = M0620302 { 0 : [’A’, E', 'I', 'O', 'U', 'H', 'W', ‘Y’], 1 : [ 'B', 'F', 'P', ‘V’], 2 : ['C', 'G', 'J', 'K', 'Q', 'S', 'X', ‘Z’], 3 : [‘D’,’T’], 4 : [‘L’], 5 : [‘M’,’N’], 6 : [‘R’] } M0620302 = M6232 M6232 = M623
  • 19.
    S O UN D E X L I M I TAT I O N S • Most implementations work for English Language only • First letter retention causes no match on some similar names
  • 20.
    S O UN D E X L I M I TAT I O N S • Postgres Soundex implementation has limited character encoding support http://www.postgresql.org/docs/9.4/static/fuzzystrmatch.html
  • 21.
    N Y SI I S • Developed in 1970, part of New York State Identification and Intelligence System • Slightly improved functionality over Soundex
  • 22.
    N Y SI I S A L G O R I T H M
  • 23.
    N Y SI I S A L G O R I T H M • MERCEDES • MARCADAS • MARCADA • MARCAD
  • 24.
    N Y SI I S
  • 25.
    M E TAP H O N E • Developed in 1990 by Lawrence Philips • Improved accuracy over Soundex and NYSIIS • Double Metaphone implements two hashes for each name or word
  • 26.
    M E TAP H O N E
  • 27.
    M E TAP H O N E • Metaphone and Double Metaphone were improved upon in Metaphone 3, which is unfortunately closed source.
  • 28.
    P H ON E T I C A L G O R I T H M S I N P R A C T I C E • Use cases for Phonetic Algorithms • Example uses in Databases
  • 29.
    P H ON E T I C A L G O R I T H M S I N P R A C I T C E • Phonetic algorithms are useful for searching by name or word, and tolerate some misspelling.
  • 30.
    P H ON E T I C A L G O R I T H M S I N P R A C I T C E • Store the phonetic hash of a name in fields/columns in your db for indexing and querying { "_id" : ObjectId("53e13a73cbcc7a0a6e3078e5"), "first_name" : "Arya", "last_name" : “Stark", "n_first_name" : “AR", "n_last_name" : “STARC”, “report” : “lost_item”, “item” : “ID Card”, "timestamp" : 1407269491, "report_id" : 50642 }
  • 31.
    P H ON E T I C S E A R C H W I T H E L A S T I C S E A R C H • Elasticsearch has support for Phonetic Matches, in many different languages! • Store words/names as documents, and hashing is done at query time GET /my_index/_analyze?analyzer=dbl_metaphone returns: Smith Smythe
  • 32.
    P H ON E T I C S E A R C H U S I N G E L A S T I C S E A R C H • As a Developer, I really like using Elasticsearch! • But as a System Administrator, I have battle scars.
  • 33.
    P H ON E T I C A L G O R I T H M S F O R N O N E N G L I S H L A N G U A G E S Grab a linguist and write one? image credit: flickr.com/photos/opacity
  • 34.
    R E SO U R C E S • Libraries • clj-fuzzy: yomguithereal.github.io/clj-fuzzy/ • python soundex: pypi.python.org/pypi/soundex/1.1.3 • python fuzzy: pypi.python.org/pypi/Fuzzy • elasticsearch phonetic matching https://www.elastic.co/guide/en/elasticsearch/guide/ current/phonetic-matching.html • http://aspell.net/metaphone/dmetaph.cpp • Reading: • http://doughellmann.com/2012/03/03/using-fuzzy-matching-to-search-by-sound-with- python.html • Fluency, Jen Feohner Wells - http://www.jenniferfoehnerwells.com/fluency.html
  • 35.
    T H AN K S F O R L I S T E N I N G ! Q U E S T I O N S ? Mercedes Coyle @benzobot image credit: Mercedes Coyle