Fuzzy Matching with Apache Spark

FUZZY MATCHING
WITH APACHE SPARK
ABH IJ IT N AYAK

APPLICATIONS
G O L D E N C U S T O M E R
Identity resolution of a customer
across all source systems
D E D U P L I C A T I O N
De duplicating reference data
R E C O R D
L I N K A G E
Linking records without keys and by
using a pattern
INDUSTRYSPECIFICUSECASES
Used in computational biology for
DNA sequencing
Fraud detection

FUZZY MATCHING METHODS
• Most of them are developed for the
English Language
• Most algorithms take only one string
to give an output
• Commonly used in spell checks
• Can render too many false positives
P H O N E T I C
A L G O R I T H M
Indexes words by their pronunciation
• Compares two strings to give a
proximity score
• Most algorithms require two strings
as inputs
• Commonly used in data integrity
solutions & deduplication
• Computationally fast
S I M I L A R I T Y M E T R I C
A score that determines the similarity
of two strings

• Robert = R 163
• Rupert = R 163
• Rubin = R 150
S A M P L E
Soundex
B P F V 1
C S K G J Q X Z 2
D T 3
L 4
M N 5
R 6
E N C O D E R
PHONETIC ALGORITHMS
S O U N D E X
H O W I T W O R K S ?
• The first alphabet is retained
• Vowels and the letters w,y,h
are ignored
• If two consecutive letters are the
same only one is considered
• Any other alphabet is substituted
with a number
• The result is truncated to 4
character code
• Robert = R 901096
• Rupert = R 901096
S A M P L E
R E F I N E D S O U N D E X
H O W I T W O R K S ?
• The first alphabet
is retained
• Divides letters to
more groups
• Here y,h and w are ignored
• Length of the result is
not truncated
E N C O D E R
Refined Soundex
B P 1
F V 2
C S K 3
G J 4
Q X Z 5
D T 6
L 7
M N 8
R 9

H OW IT W OR K S?
• The Jaccard distance is calculated from set theory.
It calculates how dissimilar two sets are
• It is a pairwise comparison of two strings
(it can be different in length).
• First , it converts both strings to characters
• Calculates the common characters and uses the formula
• Example - Compare (“abhijit”,”abhilash”)
• Common alphabets = 4 All alphabets = 15
• Formulae = [1- [15-4]/15 ]= [ 1 - 0.733 ] = 0.2667
[0] [1] [2] [3] [4] [5] [6] [7]
s1 a b h i j i t
s2 a b h i l a s h
JACCARD DISTANCE

HAMMING DISTANCE
• The Hamming distance between two strings of equal length is the number of positions
at which the corresponding symbols are different.
• Pairwise comparison of two strings
• Allows only substitution
E X A M P L E
• "karolin" and "kathrin" is 3
• "karolin" and "kerstin" is 3

H OW IT W OR K S?
• The Levenshtein distance between two words is the
minimum number of single-character edits (insertions,
deletions or substitutions) required to change one
word into the other
E x a m p l e
• To transform Akshay -> Akshata there are two
substitutions
• Akshay → Akshat (substitution of ”y" for ”t")
• Akshat → Akshata(addition of “a” at the end)
a k s h a t a
[0] [1] [2] [3] [4] [5] [6] [7]
a [1] 0 1 2 3 4 5 6
k [2] 1 0 1 2 3 4 5
s [3] 2 1 0 1 2 3 4
h [4] 3 2 1 0 1 2 3
a [5] 4 3 2 1 0 2 2
y [6] 5 4 3 2 1 1 2
LEVENSHTEIN EDIT DISTANCE

FUZZY MATCHING
U S I N G S PA R K

MIRROR MATRIX
NAME1 NAME2 NAME3 NAME4 NAME5 NAME6 NAME7
NAME1
NAME2
NAME3
NAME4
NAME5
NAME6
X X X X X X
X X X X X
X X X X
X X X
X X
X

MATCH PAIRS to SETS
Lets say we have Match Pairs
• Name#1 -> Name #2
• Name #2 -> Name #3
• Name #2 -> Name #4
• Name #1,#2,#3,#4 becomes a set
Use Graph-X to convert the pairs to sets
• Vertices = Name
• Edge = Match score between 2 Names
1 2
3 4
Abhi Abhijit Nayak
Abhi Nayak Abhijit Kayak
Score = 0.3
Score = 0.9Score = 0.4 Score = 0.75
Match Set
1
2
3 4
Abhi Abhijit
Nayak
Abhi
Nayak
Abhijit
Kayak

Fuzzy Matching with Apache Spark

More Related Content

What's hot

Similar to Fuzzy Matching with Apache Spark

More from DataWorks Summit

Recently uploaded

Fuzzy Matching with Apache Spark

Editor's Notes