Quality StageStandardization & Matching Training Edit007.ppt

Matching Concepts and an Introduction to
Ascential QualityStage
Training Material
BI Practice, Chennai
Matching Concepts
and an Introduction to

Objective of the Training
Matching
Variations & Errors
Parsing, Cleaning & Standardization
Tool architecture – Designer-client, QS-server
Standardization – ‘.cls’, ‘.pat’, ‘.tbl’
Matching
Concepts
QualityStage

Matching
Name & Address Matching
De-duplication, Unduplication,
Merge-Purge, Customer ID
generation (UID)
Householding
Data warehouse projects
Application integration
Business mergers
Data acquisition
When?
Terminology
Absence of a persistent
identifying key between the
data sources.
Absence of a global standard
for representation between the
data sources.
Drivers

Matching Example
Society Of St. Vincent De Paul
The Scty Of Saint Vncnt De Pau
St Vincent De Paul Society
Sosiety Of Saint Vincent Dpl
Clymer Atty At Law Brian
Brian I Clymer Attorney At Law
Arizona Dept Of Agricutlture
Dept Of Agri Arizona
Arizona State Dept Of Agri
Az Agri Dept
A fact can be represented in multiple standard forms
Standards change over time
Errors and variations occur during data capture & processing
In practice multiple standards forms/formats are used for data
capture, processing and storage
Duplicate detection
• Database consolidation
• Application consolidation
Query
• Removing felons from voters
list.
• List processing

Variation & Errors
Errors may include non-standard variations, additional
words, missing words, or unknown data.
Synonyms & nicknames
Prefix & suffix variations
Abbreviation & Acronyms
Anglicization & foreign versions of
names
Spelling, typing & phonetic error
Initials, inconsistently abbreviated
names
Transposition (Word sequence
variations)
Truncation & missing words
Extra words
format, character & convention
variations
Dr John Doe Med. Doctor
Dr John Doe MD
Saint Louis University
St. Louis Univ.
Tata Consultancy Services Inc
TCS Incorporated
University of south Florida
South-Florida University(USF)
ABC CO Attn: Mr. Clark
The ABC CO, City of New York
Bill Clinton
William Clinton

Parsing,
Cleansing &
Standardization
Candidate
selection
Matching &
Scoring
Apply Threshold/
Cutoff
Reference
Records
No Match
Match
Matchin
g rules
Cutoff
Cleansin
g rules
Matching Algorithm
Input
File
Reduce variations & errors
through Cleansing &
Standardization
Retain the differentiators &
remove the noise
Filter dissimilar records and
match the similar records
Use fuzzy matching to handle
unresolved variations & errors

Identify Character set (Code page)
Translate Code page
Identify delimiters, operators, punctuations, allowable
characters and special-characters. Ignore the rest.
Parse text into tokens
Assign token types and build a pattern (sentence structure)
Break the pattern into individual attributes (based on context)
Store the standard form for each parsed attribute.
Parsing: Understanding the parts to build a structure and
breaking the structure into meaningful parts”.
Parsing, Cleansing & Standardization
Rules Development
 Identify Words and assign Word-Types
and standard values.
 Define patterns and parsing rules.
• 80-20 rule
• Frequencies
• Context & Data
Placement

Derive
Candidate
Key
Parsed &
Stan. Output
Candidate
Selection
Dr. John Doe Jr. PhD
123 Main St Ste 101
Cottonwood, CA 92626
926-Ma-Do-Jo
John Doe 123 Main St Cottonwood CA 92626
Jones Donald 123 Maple Av 101 Cottonwood CA 92626
Joseph Don 456 Main Ln Cottonwood CA 92626
Dr. John Doe Jr. PhD
123 Main St Ste 101
Cottonwood, CA 92626
Candidate Selection
Candidate Selection is the processes of identifying likely
matching records.

Derivative of the entity/record to be matched
Forms Clusters of similar records
Small keys form Large Clusters (General)
Large keys form small clusters (restrictive)
Other names
 Candidate Code
 Blocking Key
 Window Key
Design considerations
 Multiple blocking keys (handle Missing
values, Variations)
 Balance between Performance (Candidate
set size), miss-rate (Quality) and hit-rate
(Matching)
Use of Candidate-Key decreases the cost of
matching by reducing records being
matched, this results in higher throughput
and performance.
Data skew will cause large clusters .
Candidate Keys
Matching – O(n)
De-duplication – O(n2)

Candidates
Matching &
Scoring
Dr. John Doe Jr. PhD 123 Main St Ste 101 Cottonwood, CA 92626
If Score >= 95 then Match
Otherwise No-Match
YYYY--YYY = 100
xxYx--YYY = 70
xxxY--YYY = 70
John Doe 123 Main St Cottonwood CA 92626
Jones Donald 123 Maple Av 101 Cottonwood CA 92626
Joseph Don 456 Main Ln Cottonwood CA 92626
Matching & Scoring
Parsed &
Stan. Output
match
no-match
no-match

Exact matching
Phonetic matching
Soundex
NYSIIS (New York State Identification and
Intelligence System)
Edit Distance
Prefix, Suffix & Initial matching
Acronym matching
String matching (exact, approx)
Interval matching (numeric data)
Date matching (exact, diff)
… other tool specific algorithms
Matching Functions
Word
Matching
Field
Matching
String
Matching
Name
Address
URL, E-mail ID
SSN
Phone
Date
Number

Deterministic & Probabilistic Matching
WILLIAM J HOLDEN 128 MAIN ST 02111 12/8/62
WILLAIM JOHN HOLDEN 128 MAINE AVE 02110 12/8/62
Deterministic Decisions Tables: Fields are evaluated for degree-of-
match and a letter grade assigned; the grades form a “match pattern”
which is looked-up in a table to determine if the pair Matches, Fails, or is
Suspect
Are these two records a match?
B B A A B D B A = BBAABDBA
+9 +2 +14 +5 +4 -1 +5 +11 = +49
Probabilistic Linkage: Fields are evaluated for degree-of-match and a
weight assigned which represents the “informational content”
contributed by those values; the weights are summed to derived a total
score that measures the statistical probability of a match

Training Material
BI Practice, Chennai
Introduction to

Name
Address
Phone
URL, E-mail ID
Others
FFC - File format converter
(Delimited to fixed and
vise-versa).
GTF - Code page, data type,
Column derivation etc.
SLC - Column and row filtering.
SORT - Reorder data files.
UNI - Inner join, Left, Right &
Full Outer join on flat
files.
QS
Procedures
CLP - Field domain frequency
distribution.
PRS - Parse (space delimited)
free form text into words
for analysis.
NMA - Name abbreviation key
generation for matching.
PGM - Run command line
programs from with in the
procedure.
Char, Word & Pattern
investigation/analysis.
Parsing
Cleansing &
Standardization
De-duping,
Reference Matching
Cross population of
fields within a
duplicate group.
Utility
Procedures
Standardization
Matching
Survivorship
Analysis,
Investigation
procedures
QualityStage Procedures (Stages)

Standardiza
tion (STAN)
Ref. Match
(GEOREF)
De-Dup
(UNDUP)
New ID
Assignment
Filter
(SLC)
Std
Out
Bad
Stan
Good
Ref
File
No
Match
Match
Dup
Groups
Collect
(UNIX)
New
Recs
Output
Raw
Data
No
Match
(New)
Matched
(Old)
Non
Standard
Data
Standardized
Data
Raw
Data
Job-1
Job-2
Matching
Rules
Stan.
Rules
QS-Project-A
Job-3 Add to Ref.
File
Matching rules
Survivorship rules
Parsing,
Cleansing &
Standardization
Rules
Jobs
Stages
Files
QualityStage Project

Create
Read
Update
Delete
Submit
Job
Tell server to Run Job
Job status reported to client
CRUD
QS Designer
Client
QS Server
QS
Developer
x.mdb
Deploy Job
Run Job
Project
.imf
Export
Import
Export
Import CRUD
QS
Job
Server
Windows
PC
Projects
Jobs
Stages
Files
Parsing,
Cleansing &
Standardization
Rules
Work area used
for Deployment
of Jobs & Rules
Matching rules
Survivorship rules
QualityStage Tool Architecture
Unix or
Windows

OUTREC = (d + x) bytes
QualityStage Standardization Procedure
Standardiza
tion
Raw
Input
USNAME.PRC
USNAMEIP.TBL
USNAMEIT.TBL
USNAMEMF.TBL
USNAMEUP.TBL
USNAMEUT.TBL
USFIRSTN.TBL
USGENDER.TBL
USNAME.DCT
USNAME.CLS
USNAME.PAT
USNAME.UCL
Standardized
Output
INREC – x bytes INREC – x bytes
DCT – d bytes
DCT – d bytes
Layout of the
parsed fields
Word
Classification
Table
Pattern Rules
User defined
Pattern Rules
Lookup tables
for special
processing

Word Classification
Word Class
1 byte
User defined
(A-Z)
Implicit types
• Numeric Zero ‘0’ will
nullify the token.
• The token will not
participate in the
pattern parsing.
NULL Type

Word Classification file (USNAME.CLS)

FORMAT SORT=N
;-------------------------------------------------------------------------------
; USNAME Dictionary File
;-------------------------------------------------------------------------------
; Business Intelligence Fields
;-------------------------------------------------------------------------------
NT C 1 S NameType ;0001-0001
GC C 1 S GenderCode ;0002-0002
NP C 20 S NamePrefix ;0003-0022
FN C 25 S FirstName ;0023-0047
MN C 25 S MiddleName ;0048-0072
LN C 50 S PrimaryName ;0073-0122
NG C 10 S NameGeneration ;0123-0132
NS C 20 S NameSuffix ;0133-0152
AN C 50 S AdditionalNameInformation ;0153-0202
;-------------------------------------------------------------------------------
; Matching Fields
;-------------------------------------------------------------------------------
MF C 25 S MatchFirstName ;0203-0227
NF C 8 X NYSIISofMatchFirstName ;0228-0235
SF C 4 Z RSoundexofMatchFirstName ;0236-0239
ML C 50 S MatchPrimaryName ;0240-0289
HK C 10 S HashKeyofMatchPrimaryName ;0290-0299
PK C 20 S PackedKeyofMatchPrimaryName ;0300-0319
NW C 1 S NumberofMatchPrimaryWords ;0320-0320
W1 C 15 S MatchPrimaryWord1 ;0321-0335
N1 C 8 X NYSIISofMatchPrimaryWord1 ;0396-0403
S1 C 4 Z RSoundexofMatchPrimaryWord1 ;0404-0407
N2 C 8 X NYSIISofMatchPrimaryWord2 ;0408-0415
S2 C 4 Z RSoundexofMatchPrimaryWord2 ;0416-0419
;-------------------------------------------------------------------------------
; Reporting Fields
;-------------------------------------------------------------------------------
UP C 30 S UnhandledPattern ;0420-0449
UD C 100 S UnhandledData ;0450-0549
IP C 30 S InputPattern ;0550-0579
ED C 25 S ExceptionData ;0580-0604
UO C 2 S UserOverrideFlag ;0605-0606
Dictionary file (USNAME.DCT)
Field Name
{NT}
Data type
Char
Length
50 bytes
Nulls
Space as null
Zero as null
Both as null
Display Name
Comment

Pattern file (USNAME.PAT)
PRAGMA_START
SEPLIST " ~`!@#$%^&*()_-+={}[]|:;"'<>,.?/"
STRIPLIST " ~`!@#$^*()_+={}[]|:;"<>?"
PRAGMA_END
POST_START
NYSIIS {MF} {NF}
RSOUNDEX {MF} {SF}
NYSIIS {W1} {N1}
RSOUNDEX {W1} {S1}
NYSIIS {W2} {N2}
RSOUNDEX {W2} {S2}
POST_END
P | W | S | $ | [ {LN} = "" ]
COPY_A [1] {NP}
COPY [2] {LN}
COPY_A [3] {NS}
EXIT
&
CALL Handle_Common_Patterns
EXIT
SUB Handle_Common_Patterns
P | F | I | + | $ | [ {FN} = "" & {MN} = "" & {LN} = "" ]
COPY_A [1] {NP}
COPY [2] {FN}
COPY [3] {MN}
COPY [4] {LN}
CALL Post_Process
RETURN
F | I | I | + | $ | [ {FN} = "" & {MN} = "" & {LN} = "" ]
COPY [1] {FN}
COPY [2] tempn
CONCAT " " tempn
CONCAT [3] tempn
COPY tempn {MN}
COPY [4] {LN}
CALL Post_Process
RETURN
END_SUB
Separator
Characters
Characters to
be removed
Phonetic
Codes
Pattern
Action
Subroutine

QualityStage Matching Procedure
UNDUP
Standardized
Output
Matching rules
(*.MAT)
Dup Groups
Residuals
Record groups
(2 or more recs)
Singleton records
De-duplication of
one file

QualityStage Matching Procedure …… Contd.
GEOMATCH
Standardized
Output
(A)
Matching rules
(*.MAT)
Matches
(A->B)
Residuals
(A)
Singleton records
Reference File
(B)
Dup Groups
(B)
Record groups
(2 or more recs)
Record groups
(2 or more recs)
Residuals
(B)
One to many
matching.
Matching one record
from File-A can match
to many records in
File-B

Matching Functions

Pass-1
Blocking-Key
3 bytes of Zip
2 Bytes of Street Name
2 Bytes of Last Name
2 Bytes of First Name
Matching Fields
First name
Middle name
Last name
House number
Street name
Zip
Pass-2
Blocking-Key
2 bytes of State code
3 Bytes of City name
3 Bytes of Street name
2 Bytes of Last name
2 Bytes of First name
Matching Fields
First name
Middle name
Last name
House number
Street name
City
Multi-Pass Matching
Maximum of 7 Passes
per match application

Pass
VarType
M-prob
U-prob
Agreement weight
Disagreement weight
Work in progress…
To be completed…

Quality StageStandardization & Matching Training Edit007.ppt

Recommended

Recommended

More Related Content

Similar to Quality StageStandardization & Matching Training Edit007.ppt

Similar to Quality StageStandardization & Matching Training Edit007.ppt (20)

Recently uploaded

Recently uploaded (20)

Quality StageStandardization & Matching Training Edit007.ppt