SlideShare a Scribd company logo
1 of 26
Download to read offline
A ‫ד‬ ֶ
‫ֶר‬‫ו‬ By Any Other Name
Matching Names in Hebrew and Other Languages
Gil Irizarry Fiona Hasanaj
VP Engineering Senior Software Engineer
BASIS TECHNOLOGY
Speakers
2
Gil Irizarry
VP Engineering
Fiona Hasanaj
Senior Software Engineer
BASIS TECHNOLOGY
What is a name?
What's Montague? it is nor hand, nor foot,
Nor arm, nor face, nor any other part
Belonging to a man. O, be some other name!
What's in a name? that which we call a rose
By any other name would smell as sweet;
⎯ Romeo and Juliet, Act 2 Scene 2
3
Public domain image https://commons.wikimedia.org/wiki/File:Romeo_and_Juliet_(detail)_by_Frank_Dicksee.png
BASIS TECHNOLOGY
A rose by any other name
rose
‫ורד‬
‫وردة‬
장미
一朵玫瑰
バラ
роза
trëndafil
4
https://commons.wikimedia.org/wiki/File:Rose_on_a_table.jpg Creative Commons License
BASIS TECHNOLOGY
John by any other name
John, Jan, Johan, Johann, Johannes, Hannes, Hans, Gjon, Gjin, ዮሐንስ (Yoḥännǝs), ‫ﯾﺣﯾﻰ‬ (Yaḥyā, Qurʾānic), ‫ﯾوﺣﻧﺎ‬ (Yūḥannā, Biblical) or ‫ّﺎ‬‫ﻧ‬‫ﺣ‬ (Henna or
Hanna), ‫ܝܘܚܢܢ‬ (Yuḥanon), ‫ܚܢܐ‬ (Henna or Hanna), ‫ܐܝܘܢ‬ (Ewan), Chuan, Հովհաննես (Hovhannes), Օհաննես (Ohannes), Յովհաննէս
(Hovhannēs), Xuan, Manez, Ganix, Joanes, Iban, Ян (Yan), Янка (Yanka), Янэк (Yanek), Ясь (Yas'), Іван (Ivan), ইয়ািহয়া (Iyahiya), য়াহয়া (Yahya), Ivan,
Jahija, Yann, Yannig, Иван (Ivan), Йоан (Yoan), Янко (Yanko), Яне (Yane), Joan, 约翰, 約翰, Yuēhàn, ⲓⲱϩⲁⲛⲛⲏⲥ (Iohannes), ⲓⲱⲁ (Ioa), Jowan,
Ghjuvanni, Ivo, Ive, Ivica, Ivano, Ivanko, Janko, Ivek, Honza, Hanuš, Jens, Yohanes, Han, Hannes, Jannes, Wannes, Sjeng, Guiàn, Zvan, Ian, Johnny,
Jack, Shawn, Sean, Shaun, Shane, Shani, Jaan, Juhan, Juho, Janno, Jukk, Jaanus, Hannes, Johano, Huan, Jann, Janus, Jenis, Jóannes, Jónar,
Jógvan, Hannis, Hanus, Jone, Ioane, Juan, Hannes, Hannu, Jani, Janne, Joni, Juha, Juho, Juhani, Jonne, Juntti (archaic), Jean, Jehan (outdated),
Xoán, Xan, იოანე (Ioane), ივანე (Ivane), იოვანე (Iovane), ვანო (Vano), ივა (Iva), Hannes, Ιωάννης (Ioannis), Γιάννης (Yiannis, sometimes Giannis),
Huã, Keoni, ʻIoane, ‫יוחנן‬ (Yôḥānān) Johanan, János, Jancsi (moniker), Hannes, Yohana, Yuhanna, Ayan, ా ను Yohanu, Iwan, Yahya, Yan, Yaya,
Yuan, Luan, Eóin, Gianni, Giannino, Gionino, Giovanni, Ivano, Ivo, Vanni, Nino, Vannino, ヨハネ (Yohane), ジョハン (Johan), Жақия (Zhaqiya, Yahya),
Шоқан (Shoqan), Жакыя (Jakyya, Yahya), Жакан (Jakan), 요한 (Yohan)[12], Juang, Yohanis, Iohannes, Ioannes, Jānis, Janis, Jancis, Janka, Jans,
Jāns, Jānuss, Jonass, Žans, Žanis, Džons, Džonijs, Džanni, Džovanni, Ians, Džeks, Šeins, Johans, Hanss, Ansis, Johaness, Johanness, Johanāns,
Haness, Hanness, Ivans, Aivans, Aivens, Aiens, Jonas, Giuàn, Јован (Jovan), Јованче (Jovanče), Иван (Ivan), Јане (Jane), േയാഹ ാൻ
(Yōhannān) ഉലഹ ാൻ (Ulahannan) േലാന ൻ (Lonappan) നയിനാ൯ (Nainan, Ninan), Ġwanni, Hōne, Jon, (Yohannan), ‫ﯾﺣﯾﯽ‬ (Yahya), Gioann,
Janek, João, Ivo, Ivã, Ioan, Ionuț, Ionel, Ionică, Nelu, Iancu, Иван (Ivan), Иоанн (Ioann, Hebrew form), Ян (Yan), Ioane, Juons, Giuanni, Jock, Iain,
Eòin, Seathan, Euan/Ewan, Јован (Jovan), Иван (Ivan), Јанко (Janko), Јовица (Jovica), Ивица (Ivica), Ивко (Ivko), Giuvanni, Giuanni, Juwam,
Yohan, Janez, Ivo, Janko, Anže, Anžej, Jon, Nuño, Hannes, য়াহয়া (Yahya), ‫ܝܘܚܢܢ‬ (Yuḥanon), ‫ܚܢܐ‬ (Ḥanna), ‫ܐܝܘܢ‬ (Ewan), ேயாவா (Yovaan),
Sione, Yahya, Yuhanna, Іван (Ivan), Іванко (Ivanko), Ян (Jan), Dương, Giăng, Gioan, Evan, Ianto, Ieuan, Ifan, Ioan, Siôn
5
https://en.wikipedia.org/wiki/John_(given_name)
BASIS TECHNOLOGY
An easy name-matching challenge
6
BASIS TECHNOLOGY
A harder name-matching challenge
7
BASIS TECHNOLOGY
Overcoming the name-matching challenge
8
BASIS TECHNOLOGY
Hidden Markov Models
9
https://en.wikipedia.org/wiki/Hidden_Markov_model#/media/File:HMMGraph.svg Public Domain image
BASIS TECHNOLOGY
Other name-matching challenges
10
BASIS TECHNOLOGY
Using Vector Similarity for Name Matching
11
BASIS TECHNOLOGY
Matching Hebrew Names
12
BASIS TECHNOLOGY
Hebrew String Normalization
13
● Keep these characters:
○ All letters, digits, split characters (hyphens, periods, commas etc.), whitespace, symbols
○ u05B0 through u05BB (Hebrew vowels)
○ u05BC, u05BF, u05C1, and u05C2 (Hebrew consonant modifiers)
○ u05F3 (geresh, a punctuation mark also used as a consonant modifier)
● Map some Hebrew punctuation to common ASCII fallbacks:
○ u05BE (HEBREW PUNCTUATION MAQAF) to -
○ u05F3 (HEBREW PUNCTUATION GERESH) to '
○ u2019 (RIGHT SINGLE QUOTATION MARK ) to '
● We do not normalize Hebrew final letters
○ ‫ג'ף‬ - Jef word-final final pe
○ ‫קאמפ‬ - Kamp word-final non-final pe
BASIS TECHNOLOGY
Hebrew Vocalization
14
● Process of vocalization:
○ Dictionary Lookup
○ Statistical Model Vocalization
○ Rule-based Vocalization Checker
Vocalization
Dictionary
Statistical Model
Vocalizer
Vocalization Checker
Input
Vocalized Output
Output
BASIS TECHNOLOGY
Hebrew Transliteration
15
● FOLK transliteration scheme example:
○ ‫שמיר‬ ‫עדי‬ ⟹ Adi Shamir
● ISO 259-2-1994 transliteration scheme example:
○ ‫שמיר‬ ‫עדי‬ ⟹ ʿadiy Šamiyr
● ICU (UNGEGN - United Nations Group of Experts on Geographical Names) transliteration
scheme example:
○ ‫שמיר‬ ‫עדי‬ ⟹ ʻàdiy Şá̌miyr
● Statistical model example for names of foreign origin:
○ ‫פרנקלין‬ ‫רוזלינד‬ ⟹ Rosalind Franklin
BASIS TECHNOLOGY
Hebrew Transliteration - FOLK
16
● Many-to-one mapping from the source table.
● Gathered valid onsets, valid codas, allowed “‫”נג‬ (ng) as a coda, which occurs in English
loanwords.
● Bet, kaf, and pe each have two pronunciations, a stop and a fricative, and are romanized
accordingly.
● Complexity with shva. (continued on next two slides)
BASIS TECHNOLOGY
Hebrew Transliteration - FOLK
17
BASIS TECHNOLOGY
Hebrew Transliteration - FOLK
18
BASIS TECHNOLOGY
Hebrew Name Matching
19
● Statistical model trained on Hebrew-English PERSON names (used for matching all entity
types)
● Levenshtein edit distance
● Initial and initialism matching
● Embedding matching and entity resolution for ORGs
● Overrides matching
● Gender Model
● Frequency Model
BASIS TECHNOLOGY
Hebrew Name Matching - Statistical Model
20
BASIS TECHNOLOGY
Hebrew Name Matching - Embedding Model
21
BASIS TECHNOLOGY
Hebrew Name Matching - Entity Resolution
22
BASIS TECHNOLOGY
Hebrew Name Matching - Frequency Model
23
BASIS TECHNOLOGY
Hebrew Name Matching - Gender Model
24
BASIS TECHNOLOGY
Hebrew Name Matching - Overrides
25
BASIS TECHNOLOGY
Hebrew Name Matching - Stopwords
26

More Related Content

More from Gil Irizarry

Make Cross-platform Mobile Apps Quickly - SIGGRAPH 2014
Make Cross-platform Mobile Apps Quickly - SIGGRAPH 2014Make Cross-platform Mobile Apps Quickly - SIGGRAPH 2014
Make Cross-platform Mobile Apps Quickly - SIGGRAPH 2014Gil Irizarry
 
Graphics on the Go
Graphics on the GoGraphics on the Go
Graphics on the GoGil Irizarry
 
Make Mobile Apps Quickly
Make Mobile Apps QuicklyMake Mobile Apps Quickly
Make Mobile Apps QuicklyGil Irizarry
 
Building The Agile Enterprise - LSSC '12
Building The Agile Enterprise - LSSC '12Building The Agile Enterprise - LSSC '12
Building The Agile Enterprise - LSSC '12Gil Irizarry
 
Agile The Kanban Way - Central MA PMI 2011
Agile The Kanban Way - Central MA PMI 2011Agile The Kanban Way - Central MA PMI 2011
Agile The Kanban Way - Central MA PMI 2011Gil Irizarry
 
Transitioning to Kanban: Theory and Practice - Project Summit Boston 2011
Transitioning to Kanban: Theory and Practice - Project Summit Boston 2011Transitioning to Kanban: Theory and Practice - Project Summit Boston 2011
Transitioning to Kanban: Theory and Practice - Project Summit Boston 2011Gil Irizarry
 
Transitioning to Kanban - Aug 11
Transitioning to Kanban - Aug 11Transitioning to Kanban - Aug 11
Transitioning to Kanban - Aug 11Gil Irizarry
 
Transitioning to Kanban
Transitioning to KanbanTransitioning to Kanban
Transitioning to KanbanGil Irizarry
 
Beyond Scrum of Scrums
Beyond Scrum of ScrumsBeyond Scrum of Scrums
Beyond Scrum of ScrumsGil Irizarry
 

More from Gil Irizarry (9)

Make Cross-platform Mobile Apps Quickly - SIGGRAPH 2014
Make Cross-platform Mobile Apps Quickly - SIGGRAPH 2014Make Cross-platform Mobile Apps Quickly - SIGGRAPH 2014
Make Cross-platform Mobile Apps Quickly - SIGGRAPH 2014
 
Graphics on the Go
Graphics on the GoGraphics on the Go
Graphics on the Go
 
Make Mobile Apps Quickly
Make Mobile Apps QuicklyMake Mobile Apps Quickly
Make Mobile Apps Quickly
 
Building The Agile Enterprise - LSSC '12
Building The Agile Enterprise - LSSC '12Building The Agile Enterprise - LSSC '12
Building The Agile Enterprise - LSSC '12
 
Agile The Kanban Way - Central MA PMI 2011
Agile The Kanban Way - Central MA PMI 2011Agile The Kanban Way - Central MA PMI 2011
Agile The Kanban Way - Central MA PMI 2011
 
Transitioning to Kanban: Theory and Practice - Project Summit Boston 2011
Transitioning to Kanban: Theory and Practice - Project Summit Boston 2011Transitioning to Kanban: Theory and Practice - Project Summit Boston 2011
Transitioning to Kanban: Theory and Practice - Project Summit Boston 2011
 
Transitioning to Kanban - Aug 11
Transitioning to Kanban - Aug 11Transitioning to Kanban - Aug 11
Transitioning to Kanban - Aug 11
 
Transitioning to Kanban
Transitioning to KanbanTransitioning to Kanban
Transitioning to Kanban
 
Beyond Scrum of Scrums
Beyond Scrum of ScrumsBeyond Scrum of Scrums
Beyond Scrum of Scrums
 

Recently uploaded

Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...OnePlan Solutions
 
Mastering Project Planning with Microsoft Project 2016.pptx
Mastering Project Planning with Microsoft Project 2016.pptxMastering Project Planning with Microsoft Project 2016.pptx
Mastering Project Planning with Microsoft Project 2016.pptxAS Design & AST.
 
Understanding Plagiarism: Causes, Consequences and Prevention.pptx
Understanding Plagiarism: Causes, Consequences and Prevention.pptxUnderstanding Plagiarism: Causes, Consequences and Prevention.pptx
Understanding Plagiarism: Causes, Consequences and Prevention.pptxSasikiranMarri
 
Keeping your build tool updated in a multi repository world
Keeping your build tool updated in a multi repository worldKeeping your build tool updated in a multi repository world
Keeping your build tool updated in a multi repository worldRoberto Pérez Alcolea
 
2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdf
2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdf2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdf
2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdfAndrey Devyatkin
 
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...OnePlan Solutions
 
Osi security architecture in network.pptx
Osi security architecture in network.pptxOsi security architecture in network.pptx
Osi security architecture in network.pptxVinzoCenzo
 
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full RecordingOpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full RecordingShane Coughlan
 
GraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4j
GraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4jGraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4j
GraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4jNeo4j
 
Best Angular 17 Classroom & Online training - Naresh IT
Best Angular 17 Classroom & Online training - Naresh ITBest Angular 17 Classroom & Online training - Naresh IT
Best Angular 17 Classroom & Online training - Naresh ITmanoharjgpsolutions
 
Ronisha Informatics Private Limited Catalogue
Ronisha Informatics Private Limited CatalogueRonisha Informatics Private Limited Catalogue
Ronisha Informatics Private Limited Catalogueitservices996
 
Strategies for using alternative queries to mitigate zero results
Strategies for using alternative queries to mitigate zero resultsStrategies for using alternative queries to mitigate zero results
Strategies for using alternative queries to mitigate zero resultsJean Silva
 
Advantages of Cargo Cloud Solutions.pptx
Advantages of Cargo Cloud Solutions.pptxAdvantages of Cargo Cloud Solutions.pptx
Advantages of Cargo Cloud Solutions.pptxRTS corp
 
SAM Training Session - How to use EXCEL ?
SAM Training Session - How to use EXCEL ?SAM Training Session - How to use EXCEL ?
SAM Training Session - How to use EXCEL ?Alexandre Beguel
 
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...Bert Jan Schrijver
 
Zer0con 2024 final share short version.pdf
Zer0con 2024 final share short version.pdfZer0con 2024 final share short version.pdf
Zer0con 2024 final share short version.pdfmaor17
 
What’s New in VictoriaMetrics: Q1 2024 Updates
What’s New in VictoriaMetrics: Q1 2024 UpdatesWhat’s New in VictoriaMetrics: Q1 2024 Updates
What’s New in VictoriaMetrics: Q1 2024 UpdatesVictoriaMetrics
 
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdf
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdfEnhancing Supply Chain Visibility with Cargo Cloud Solutions.pdf
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdfRTS corp
 
2024 DevNexus Patterns for Resiliency: Shuffle shards
2024 DevNexus Patterns for Resiliency: Shuffle shards2024 DevNexus Patterns for Resiliency: Shuffle shards
2024 DevNexus Patterns for Resiliency: Shuffle shardsChristopher Curtin
 
Pros and Cons of Selenium In Automation Testing_ A Comprehensive Assessment.pdf
Pros and Cons of Selenium In Automation Testing_ A Comprehensive Assessment.pdfPros and Cons of Selenium In Automation Testing_ A Comprehensive Assessment.pdf
Pros and Cons of Selenium In Automation Testing_ A Comprehensive Assessment.pdfkalichargn70th171
 

Recently uploaded (20)

Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...
 
Mastering Project Planning with Microsoft Project 2016.pptx
Mastering Project Planning with Microsoft Project 2016.pptxMastering Project Planning with Microsoft Project 2016.pptx
Mastering Project Planning with Microsoft Project 2016.pptx
 
Understanding Plagiarism: Causes, Consequences and Prevention.pptx
Understanding Plagiarism: Causes, Consequences and Prevention.pptxUnderstanding Plagiarism: Causes, Consequences and Prevention.pptx
Understanding Plagiarism: Causes, Consequences and Prevention.pptx
 
Keeping your build tool updated in a multi repository world
Keeping your build tool updated in a multi repository worldKeeping your build tool updated in a multi repository world
Keeping your build tool updated in a multi repository world
 
2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdf
2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdf2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdf
2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdf
 
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...
 
Osi security architecture in network.pptx
Osi security architecture in network.pptxOsi security architecture in network.pptx
Osi security architecture in network.pptx
 
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full RecordingOpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
 
GraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4j
GraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4jGraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4j
GraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4j
 
Best Angular 17 Classroom & Online training - Naresh IT
Best Angular 17 Classroom & Online training - Naresh ITBest Angular 17 Classroom & Online training - Naresh IT
Best Angular 17 Classroom & Online training - Naresh IT
 
Ronisha Informatics Private Limited Catalogue
Ronisha Informatics Private Limited CatalogueRonisha Informatics Private Limited Catalogue
Ronisha Informatics Private Limited Catalogue
 
Strategies for using alternative queries to mitigate zero results
Strategies for using alternative queries to mitigate zero resultsStrategies for using alternative queries to mitigate zero results
Strategies for using alternative queries to mitigate zero results
 
Advantages of Cargo Cloud Solutions.pptx
Advantages of Cargo Cloud Solutions.pptxAdvantages of Cargo Cloud Solutions.pptx
Advantages of Cargo Cloud Solutions.pptx
 
SAM Training Session - How to use EXCEL ?
SAM Training Session - How to use EXCEL ?SAM Training Session - How to use EXCEL ?
SAM Training Session - How to use EXCEL ?
 
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...
 
Zer0con 2024 final share short version.pdf
Zer0con 2024 final share short version.pdfZer0con 2024 final share short version.pdf
Zer0con 2024 final share short version.pdf
 
What’s New in VictoriaMetrics: Q1 2024 Updates
What’s New in VictoriaMetrics: Q1 2024 UpdatesWhat’s New in VictoriaMetrics: Q1 2024 Updates
What’s New in VictoriaMetrics: Q1 2024 Updates
 
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdf
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdfEnhancing Supply Chain Visibility with Cargo Cloud Solutions.pdf
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdf
 
2024 DevNexus Patterns for Resiliency: Shuffle shards
2024 DevNexus Patterns for Resiliency: Shuffle shards2024 DevNexus Patterns for Resiliency: Shuffle shards
2024 DevNexus Patterns for Resiliency: Shuffle shards
 
Pros and Cons of Selenium In Automation Testing_ A Comprehensive Assessment.pdf
Pros and Cons of Selenium In Automation Testing_ A Comprehensive Assessment.pdfPros and Cons of Selenium In Automation Testing_ A Comprehensive Assessment.pdf
Pros and Cons of Selenium In Automation Testing_ A Comprehensive Assessment.pdf
 

A Rose By Any Other Name.pdf

  • 1. A ‫ד‬ ֶ ‫ֶר‬‫ו‬ By Any Other Name Matching Names in Hebrew and Other Languages Gil Irizarry Fiona Hasanaj VP Engineering Senior Software Engineer
  • 2. BASIS TECHNOLOGY Speakers 2 Gil Irizarry VP Engineering Fiona Hasanaj Senior Software Engineer
  • 3. BASIS TECHNOLOGY What is a name? What's Montague? it is nor hand, nor foot, Nor arm, nor face, nor any other part Belonging to a man. O, be some other name! What's in a name? that which we call a rose By any other name would smell as sweet; ⎯ Romeo and Juliet, Act 2 Scene 2 3 Public domain image https://commons.wikimedia.org/wiki/File:Romeo_and_Juliet_(detail)_by_Frank_Dicksee.png
  • 4. BASIS TECHNOLOGY A rose by any other name rose ‫ורד‬ ‫وردة‬ 장미 一朵玫瑰 バラ роза trëndafil 4 https://commons.wikimedia.org/wiki/File:Rose_on_a_table.jpg Creative Commons License
  • 5. BASIS TECHNOLOGY John by any other name John, Jan, Johan, Johann, Johannes, Hannes, Hans, Gjon, Gjin, ዮሐንስ (Yoḥännǝs), ‫ﯾﺣﯾﻰ‬ (Yaḥyā, Qurʾānic), ‫ﯾوﺣﻧﺎ‬ (Yūḥannā, Biblical) or ‫ّﺎ‬‫ﻧ‬‫ﺣ‬ (Henna or Hanna), ‫ܝܘܚܢܢ‬ (Yuḥanon), ‫ܚܢܐ‬ (Henna or Hanna), ‫ܐܝܘܢ‬ (Ewan), Chuan, Հովհաննես (Hovhannes), Օհաննես (Ohannes), Յովհաննէս (Hovhannēs), Xuan, Manez, Ganix, Joanes, Iban, Ян (Yan), Янка (Yanka), Янэк (Yanek), Ясь (Yas'), Іван (Ivan), ইয়ািহয়া (Iyahiya), য়াহয়া (Yahya), Ivan, Jahija, Yann, Yannig, Иван (Ivan), Йоан (Yoan), Янко (Yanko), Яне (Yane), Joan, 约翰, 約翰, Yuēhàn, ⲓⲱϩⲁⲛⲛⲏⲥ (Iohannes), ⲓⲱⲁ (Ioa), Jowan, Ghjuvanni, Ivo, Ive, Ivica, Ivano, Ivanko, Janko, Ivek, Honza, Hanuš, Jens, Yohanes, Han, Hannes, Jannes, Wannes, Sjeng, Guiàn, Zvan, Ian, Johnny, Jack, Shawn, Sean, Shaun, Shane, Shani, Jaan, Juhan, Juho, Janno, Jukk, Jaanus, Hannes, Johano, Huan, Jann, Janus, Jenis, Jóannes, Jónar, Jógvan, Hannis, Hanus, Jone, Ioane, Juan, Hannes, Hannu, Jani, Janne, Joni, Juha, Juho, Juhani, Jonne, Juntti (archaic), Jean, Jehan (outdated), Xoán, Xan, იოანე (Ioane), ივანე (Ivane), იოვანე (Iovane), ვანო (Vano), ივა (Iva), Hannes, Ιωάννης (Ioannis), Γιάννης (Yiannis, sometimes Giannis), Huã, Keoni, ʻIoane, ‫יוחנן‬ (Yôḥānān) Johanan, János, Jancsi (moniker), Hannes, Yohana, Yuhanna, Ayan, ా ను Yohanu, Iwan, Yahya, Yan, Yaya, Yuan, Luan, Eóin, Gianni, Giannino, Gionino, Giovanni, Ivano, Ivo, Vanni, Nino, Vannino, ヨハネ (Yohane), ジョハン (Johan), Жақия (Zhaqiya, Yahya), Шоқан (Shoqan), Жакыя (Jakyya, Yahya), Жакан (Jakan), 요한 (Yohan)[12], Juang, Yohanis, Iohannes, Ioannes, Jānis, Janis, Jancis, Janka, Jans, Jāns, Jānuss, Jonass, Žans, Žanis, Džons, Džonijs, Džanni, Džovanni, Ians, Džeks, Šeins, Johans, Hanss, Ansis, Johaness, Johanness, Johanāns, Haness, Hanness, Ivans, Aivans, Aivens, Aiens, Jonas, Giuàn, Јован (Jovan), Јованче (Jovanče), Иван (Ivan), Јане (Jane), േയാഹ ാൻ (Yōhannān) ഉലഹ ാൻ (Ulahannan) േലാന ൻ (Lonappan) നയിനാ൯ (Nainan, Ninan), Ġwanni, Hōne, Jon, (Yohannan), ‫ﯾﺣﯾﯽ‬ (Yahya), Gioann, Janek, João, Ivo, Ivã, Ioan, Ionuț, Ionel, Ionică, Nelu, Iancu, Иван (Ivan), Иоанн (Ioann, Hebrew form), Ян (Yan), Ioane, Juons, Giuanni, Jock, Iain, Eòin, Seathan, Euan/Ewan, Јован (Jovan), Иван (Ivan), Јанко (Janko), Јовица (Jovica), Ивица (Ivica), Ивко (Ivko), Giuvanni, Giuanni, Juwam, Yohan, Janez, Ivo, Janko, Anže, Anžej, Jon, Nuño, Hannes, য়াহয়া (Yahya), ‫ܝܘܚܢܢ‬ (Yuḥanon), ‫ܚܢܐ‬ (Ḥanna), ‫ܐܝܘܢ‬ (Ewan), ேயாவா (Yovaan), Sione, Yahya, Yuhanna, Іван (Ivan), Іванко (Ivanko), Ян (Jan), Dương, Giăng, Gioan, Evan, Ianto, Ieuan, Ifan, Ioan, Siôn 5 https://en.wikipedia.org/wiki/John_(given_name)
  • 6. BASIS TECHNOLOGY An easy name-matching challenge 6
  • 7. BASIS TECHNOLOGY A harder name-matching challenge 7
  • 8. BASIS TECHNOLOGY Overcoming the name-matching challenge 8
  • 9. BASIS TECHNOLOGY Hidden Markov Models 9 https://en.wikipedia.org/wiki/Hidden_Markov_model#/media/File:HMMGraph.svg Public Domain image
  • 11. BASIS TECHNOLOGY Using Vector Similarity for Name Matching 11
  • 13. BASIS TECHNOLOGY Hebrew String Normalization 13 ● Keep these characters: ○ All letters, digits, split characters (hyphens, periods, commas etc.), whitespace, symbols ○ u05B0 through u05BB (Hebrew vowels) ○ u05BC, u05BF, u05C1, and u05C2 (Hebrew consonant modifiers) ○ u05F3 (geresh, a punctuation mark also used as a consonant modifier) ● Map some Hebrew punctuation to common ASCII fallbacks: ○ u05BE (HEBREW PUNCTUATION MAQAF) to - ○ u05F3 (HEBREW PUNCTUATION GERESH) to ' ○ u2019 (RIGHT SINGLE QUOTATION MARK ) to ' ● We do not normalize Hebrew final letters ○ ‫ג'ף‬ - Jef word-final final pe ○ ‫קאמפ‬ - Kamp word-final non-final pe
  • 14. BASIS TECHNOLOGY Hebrew Vocalization 14 ● Process of vocalization: ○ Dictionary Lookup ○ Statistical Model Vocalization ○ Rule-based Vocalization Checker Vocalization Dictionary Statistical Model Vocalizer Vocalization Checker Input Vocalized Output Output
  • 15. BASIS TECHNOLOGY Hebrew Transliteration 15 ● FOLK transliteration scheme example: ○ ‫שמיר‬ ‫עדי‬ ⟹ Adi Shamir ● ISO 259-2-1994 transliteration scheme example: ○ ‫שמיר‬ ‫עדי‬ ⟹ ʿadiy Šamiyr ● ICU (UNGEGN - United Nations Group of Experts on Geographical Names) transliteration scheme example: ○ ‫שמיר‬ ‫עדי‬ ⟹ ʻàdiy Şá̌miyr ● Statistical model example for names of foreign origin: ○ ‫פרנקלין‬ ‫רוזלינד‬ ⟹ Rosalind Franklin
  • 16. BASIS TECHNOLOGY Hebrew Transliteration - FOLK 16 ● Many-to-one mapping from the source table. ● Gathered valid onsets, valid codas, allowed “‫”נג‬ (ng) as a coda, which occurs in English loanwords. ● Bet, kaf, and pe each have two pronunciations, a stop and a fricative, and are romanized accordingly. ● Complexity with shva. (continued on next two slides)
  • 19. BASIS TECHNOLOGY Hebrew Name Matching 19 ● Statistical model trained on Hebrew-English PERSON names (used for matching all entity types) ● Levenshtein edit distance ● Initial and initialism matching ● Embedding matching and entity resolution for ORGs ● Overrides matching ● Gender Model ● Frequency Model
  • 20. BASIS TECHNOLOGY Hebrew Name Matching - Statistical Model 20
  • 21. BASIS TECHNOLOGY Hebrew Name Matching - Embedding Model 21
  • 22. BASIS TECHNOLOGY Hebrew Name Matching - Entity Resolution 22
  • 23. BASIS TECHNOLOGY Hebrew Name Matching - Frequency Model 23
  • 24. BASIS TECHNOLOGY Hebrew Name Matching - Gender Model 24
  • 25. BASIS TECHNOLOGY Hebrew Name Matching - Overrides 25
  • 26. BASIS TECHNOLOGY Hebrew Name Matching - Stopwords 26