FUZZY MATCHING
WITH APACHE SPARK
ABH IJ IT N AYAK
FUZZY
MATCHING
APPLICATIONS
G O L D E N C U S T O M E R
Identity resolution of a customer
across all source systems
D E D U P L I C A T I O N
De duplicating reference data
R E C O R D
L I N K A G E
Linking records without keys and by
using a pattern
INDUSTRYSPECIFICUSECASES
Used in computational biology for
DNA sequencing
Fraud detection
FUZZY MATCHING METHODS
• Most of them are developed for the
English Language
• Most algorithms take only one string
to give an output
• Commonly used in spell checks
• Can render too many false positives
P H O N E T I C
A L G O R I T H M
Indexes words by their pronunciation
• Compares two strings to give a
proximity score
• Most algorithms require two strings
as inputs
• Commonly used in data integrity
solutions & deduplication
• Computationally fast
S I M I L A R I T Y M E T R I C
A score that determines the similarity
of two strings
• Robert = R 163
• Rupert = R 163
• Rubin = R 150
S A M P L E
Soundex
B P F V 1
C S K G J Q X Z 2
D T 3
L 4
M N 5
R 6
E N C O D E R
PHONETIC ALGORITHMS
S O U N D E X
H O W I T W O R K S ?
• The first alphabet is retained
• Vowels and the letters w,y,h
are ignored
• If two consecutive letters are the
same only one is considered
• Any other alphabet is substituted
with a number
• The result is truncated to 4
character code
• Robert = R 901096
• Rupert = R 901096
S A M P L E
R E F I N E D S O U N D E X
H O W I T W O R K S ?
• The first alphabet
is retained
• Divides letters to
more groups
• Here y,h and w are ignored
• Length of the result is
not truncated
E N C O D E R
Refined Soundex
B P 1
F V 2
C S K 3
G J 4
Q X Z 5
D T 6
L 7
M N 8
R 9
H OW IT W OR K S?
• The Jaccard distance is calculated from set theory.
It calculates how dissimilar two sets are
• It is a pairwise comparison of two strings
(it can be different in length).
• First , it converts both strings to characters
• Calculates the common characters and uses the formula
• Example - Compare (“abhijit”,”abhilash”)
• Common alphabets = 4 All alphabets = 15
• Formulae = [1- [15-4]/15 ]= [ 1 - 0.733 ] = 0.2667
[0] [1] [2] [3] [4] [5] [6] [7]
s1 a b h i j i t
s2 a b h i l a s h
JACCARD DISTANCE
HAMMING DISTANCE
• The Hamming distance between two strings of equal length is the number of positions
at which the corresponding symbols are different.
• Pairwise comparison of two strings
• Allows only substitution
E X A M P L E
• "karolin" and "kathrin" is 3
• "karolin" and "kerstin" is 3
H OW IT W OR K S?
• The Levenshtein distance between two words is the
minimum number of single-character edits (insertions,
deletions or substitutions) required to change one
word into the other
E x a m p l e
• To transform Akshay -> Akshata there are two
substitutions
• Akshay → Akshat (substitution of ”y" for ”t")
• Akshat → Akshata(addition of “a” at the end)
a k s h a t a
[0] [1] [2] [3] [4] [5] [6] [7]
a [1] 0 1 2 3 4 5 6
k [2] 1 0 1 2 3 4 5
s [3] 2 1 0 1 2 3 4
h [4] 3 2 1 0 1 2 3
a [5] 4 3 2 1 0 2 2
y [6] 5 4 3 2 1 1 2
LEVENSHTEIN EDIT DISTANCE
FUZZY MATCHING
U S I N G S PA R K
DEMODEMO
MIRROR MATRIX
NAME1 NAME2 NAME3 NAME4 NAME5 NAME6 NAME7
NAME1
NAME2
NAME3
NAME4
NAME5
NAME6
X X X X X X
X X X X X
X X X X
X X X
X X
X
MATCH PAIRS to SETS
Lets say we have Match Pairs
• Name#1 -> Name #2
• Name #2 -> Name #3
• Name #2 -> Name #4
• Name #1,#2,#3,#4 becomes a set
Use Graph-X to convert the pairs to sets
• Vertices = Name
• Edge = Match score between 2 Names
1 2
3 4
Abhi Abhijit Nayak
Abhi Nayak Abhijit Kayak
Score = 0.3
Score = 0.9Score = 0.4 Score = 0.75
Match Set
1
2
3 4
Abhi Abhijit
Nayak
Abhi
Nayak
Abhijit
Kayak
Questions?

Fuzzy Matching with Apache Spark

  • 1.
    FUZZY MATCHING WITH APACHESPARK ABH IJ IT N AYAK
  • 2.
  • 3.
    APPLICATIONS G O LD E N C U S T O M E R Identity resolution of a customer across all source systems D E D U P L I C A T I O N De duplicating reference data R E C O R D L I N K A G E Linking records without keys and by using a pattern INDUSTRYSPECIFICUSECASES Used in computational biology for DNA sequencing Fraud detection
  • 4.
    FUZZY MATCHING METHODS •Most of them are developed for the English Language • Most algorithms take only one string to give an output • Commonly used in spell checks • Can render too many false positives P H O N E T I C A L G O R I T H M Indexes words by their pronunciation • Compares two strings to give a proximity score • Most algorithms require two strings as inputs • Commonly used in data integrity solutions & deduplication • Computationally fast S I M I L A R I T Y M E T R I C A score that determines the similarity of two strings
  • 5.
    • Robert =R 163 • Rupert = R 163 • Rubin = R 150 S A M P L E Soundex B P F V 1 C S K G J Q X Z 2 D T 3 L 4 M N 5 R 6 E N C O D E R PHONETIC ALGORITHMS S O U N D E X H O W I T W O R K S ? • The first alphabet is retained • Vowels and the letters w,y,h are ignored • If two consecutive letters are the same only one is considered • Any other alphabet is substituted with a number • The result is truncated to 4 character code • Robert = R 901096 • Rupert = R 901096 S A M P L E R E F I N E D S O U N D E X H O W I T W O R K S ? • The first alphabet is retained • Divides letters to more groups • Here y,h and w are ignored • Length of the result is not truncated E N C O D E R Refined Soundex B P 1 F V 2 C S K 3 G J 4 Q X Z 5 D T 6 L 7 M N 8 R 9
  • 6.
    H OW ITW OR K S? • The Jaccard distance is calculated from set theory. It calculates how dissimilar two sets are • It is a pairwise comparison of two strings (it can be different in length). • First , it converts both strings to characters • Calculates the common characters and uses the formula • Example - Compare (“abhijit”,”abhilash”) • Common alphabets = 4 All alphabets = 15 • Formulae = [1- [15-4]/15 ]= [ 1 - 0.733 ] = 0.2667 [0] [1] [2] [3] [4] [5] [6] [7] s1 a b h i j i t s2 a b h i l a s h JACCARD DISTANCE
  • 7.
    HAMMING DISTANCE • TheHamming distance between two strings of equal length is the number of positions at which the corresponding symbols are different. • Pairwise comparison of two strings • Allows only substitution E X A M P L E • "karolin" and "kathrin" is 3 • "karolin" and "kerstin" is 3
  • 8.
    H OW ITW OR K S? • The Levenshtein distance between two words is the minimum number of single-character edits (insertions, deletions or substitutions) required to change one word into the other E x a m p l e • To transform Akshay -> Akshata there are two substitutions • Akshay → Akshat (substitution of ”y" for ”t") • Akshat → Akshata(addition of “a” at the end) a k s h a t a [0] [1] [2] [3] [4] [5] [6] [7] a [1] 0 1 2 3 4 5 6 k [2] 1 0 1 2 3 4 5 s [3] 2 1 0 1 2 3 4 h [4] 3 2 1 0 1 2 3 a [5] 4 3 2 1 0 2 2 y [6] 5 4 3 2 1 1 2 LEVENSHTEIN EDIT DISTANCE
  • 9.
    FUZZY MATCHING U SI N G S PA R K
  • 10.
  • 11.
    MIRROR MATRIX NAME1 NAME2NAME3 NAME4 NAME5 NAME6 NAME7 NAME1 NAME2 NAME3 NAME4 NAME5 NAME6 X X X X X X X X X X X X X X X X X X X X X
  • 12.
    MATCH PAIRS toSETS Lets say we have Match Pairs • Name#1 -> Name #2 • Name #2 -> Name #3 • Name #2 -> Name #4 • Name #1,#2,#3,#4 becomes a set Use Graph-X to convert the pairs to sets • Vertices = Name • Edge = Match score between 2 Names 1 2 3 4 Abhi Abhijit Nayak Abhi Nayak Abhijit Kayak Score = 0.3 Score = 0.9Score = 0.4 Score = 0.75 Match Set 1 2 3 4 Abhi Abhijit Nayak Abhi Nayak Abhijit Kayak
  • 13.

Editor's Notes

  • #2 Hi every one, Thanks you for attending this breakout session. I am abhijit nayak , I currently a manager at deloitte australia working within in the consulting team. I work for the analytics and information management department. I am sure just like me all of you learnt a lot from through these breakout sessions and other networking forums. For the next 50 mins I will talk about fuzzy matching algorithms and demonstrate how to perform this in apache spark.
  • #3 Hi every one, Thanks you for attending this breakout session. I am abhijit nayak , I currently a manager at deloitte australia working within in the consulting team. I work for the analytics and information management cluster. I am sure just like me all of you learnt a lot from through these breakout sessions and other networking forums. For the next 50 mins I will talk about fuzzy matching algorithms and demonstrate how to perform this in apache spark.
  • #4 Some of the applications of fuzzy matching are #1 identity resolution #2 deduplication #3 record linkage #4 DNA sequencing etc
  • #5 tbc
  • #6 Let first discuss about phonetic algorithms There are quiet a few phonetic algorithms available , but today I will talk about two algorithms Soundex Refined Soundex Soundex Soundex encoding is a fast encoding technique. In this technique , we create encode the string Read it Give an example Soundex Robert R being the first character is retained Vowels and w,y,h are skipped B = 1 R = 6 T = 3 Rupert ( sounds like Rooo pert) R being the first character is retained Vowels and w,y,h are skipped P = 1 R = 6 T = 3 Rubin R being the first character is retained Vowels and w,y,h are skipped B = 1 N = 5 Append zero to make it four characters Refined soundex Refined soundex is a better version of soundex , if you can see the encoder on the top right corner , it alphabets in the category “2” is split into 4 categories. Another fundamental difference of refined soundex is the length of the reuslt is not truncated This gives higher accuracy and reduces the false positives Refined Soundex Robert R first alphabet retained R = 9 O = zero B = 1 E = 0 R = 9 T = 6 Rupert R first alphabet retained R = 9 U = zero P = 1 E = 0 R = 9 T = 6
  • #7 The first similarity metric algorithm we will cover is the jaccard distance Jaccard distance use set theory to calculate the similarity score between two strings It first converts the strings to characters and then it is bascially the formula subsitution The formula is really simple. It substracts the total number of characters subtracts with the common alphabets. This is then divided by the total number of characters. Let us take an example between “Abhijit” and “Abhilash” The common characters are “abhi” . So the count of this becomes 4 All the alphabets put together become 15. The inference of the algorithm is smaller the number closer the match
  • #8 The second similarity metric algorithm that I am going to talk about is the hamming distance Hamming distance is again a pairwise comparison algorithm Both the strings are of equal length It compares one string to another and informs the number of substitutions required to convert string 2 to string 1 Inference : higher the score , higher the number substitutions required Example Read from the slide
  • #9 The third similarity algorithm is the Levenshtein edit distance The is an enhancement of the hamming distance Substitutions Additions Deletions are allowed Example Explain the formula and the matrix
  • #12 After selecting a list of names , name pairs are created using a mirror matrix concept Create a Cartesian product and filter out one side of the matrix The name pair which contains the same name combination is also skipped We achieve this by … show code
  • #13 Once scores are determined and the