SlideShare a Scribd company logo
1 of 30
Matching Concepts and an Introduction to
Ascential QualityStage
Training Material
BI Practice, Chennai
Matching Concepts
and an Introduction to
Ascential QualityStage
Matching Concepts and an Introduction to
Ascential QualityStage
Objective of the Training
Matching
Variations & Errors
Parsing, Cleaning & Standardization
Tool architecture – Designer-client, QS-server
Standardization – ‘.cls’, ‘.pat’, ‘.tbl’
Matching
Concepts
QualityStage
Matching Concepts and an Introduction to
Ascential QualityStage
Matching
Name & Address Matching
De-duplication, Unduplication,
Merge-Purge, Customer ID
generation (UID)
Householding
Data warehouse projects
Application integration
Business mergers
Data acquisition
When?
Terminology
Absence of a persistent
identifying key between the
data sources.
Absence of a global standard
for representation between the
data sources.
Drivers
Matching Concepts and an Introduction to
Ascential QualityStage
Matching Example
Society Of St. Vincent De Paul
The Scty Of Saint Vncnt De Pau
St Vincent De Paul Society
Sosiety Of Saint Vincent Dpl
Clymer Atty At Law Brian
Brian I Clymer Attorney At Law
Arizona Dept Of Agricutlture
Dept Of Agri Arizona
Arizona State Dept Of Agri
Az Agri Dept
A fact can be represented in multiple standard forms
Standards change over time
Errors and variations occur during data capture & processing
In practice multiple standards forms/formats are used for data
capture, processing and storage
Duplicate detection
• Database consolidation
• Application consolidation
Query
• Removing felons from voters
list.
• List processing
Matching Concepts and an Introduction to
Ascential QualityStage
Matching Concepts and an Introduction to
Ascential QualityStage
Matching Concepts and an Introduction to
Ascential QualityStage
Matching Concepts and an Introduction to
Ascential QualityStage
Variation & Errors
Errors may include non-standard variations, additional
words, missing words, or unknown data.
Synonyms & nicknames
Prefix & suffix variations
Abbreviation & Acronyms
Anglicization & foreign versions of
names
Spelling, typing & phonetic error
Initials, inconsistently abbreviated
names
Transposition (Word sequence
variations)
Truncation & missing words
Extra words
format, character & convention
variations
Dr John Doe Med. Doctor
Dr John Doe MD
Saint Louis University
St. Louis Univ.
Tata Consultancy Services Inc
TCS Incorporated
University of south Florida
South-Florida University(USF)
ABC CO Attn: Mr. Clark
The ABC CO, City of New York
Bill Clinton
William Clinton
Matching Concepts and an Introduction to
Ascential QualityStage
Parsing,
Cleansing &
Standardization
Candidate
selection
Matching &
Scoring
Apply Threshold/
Cutoff
Reference
Records
No Match
Match
Matchin
g rules
Cutoff
Cleansin
g rules
Matching Algorithm
Input
File
Reduce variations & errors
through Cleansing &
Standardization
Retain the differentiators &
remove the noise
Filter dissimilar records and
match the similar records
Use fuzzy matching to handle
unresolved variations & errors
Matching Concepts and an Introduction to
Ascential QualityStage
Raw Input
Lexical
Analysis
Contextual
Parsing
“Dr John Doe Jr PhD”
Tokenization
“123 Main Street Suite 101”
Dr |John|Doe |Jr |PhD 123 |Main|Street|Suite|101
Prefix|First|alpha|Gen|Suffix NNN |alpha|Type|Unit|NNN
Prefix|First|Last|Gen|Suffix
Prefix = Dr.
First Name = John
Last Name = Doe
Generation = Jr.
Suffix = PhD
Output
Hsn |Street|Type|Unit|Unit#
House Number = 123
Street Name = Main
Street Type = St
Unit Type = Ste
Unit Number = 101
Parsing, Cleansing & Standardization
Matching Concepts and an Introduction to
Ascential QualityStage
Identify Character set (Code page)
Translate Code page
Identify delimiters, operators, punctuations, allowable
characters and special-characters. Ignore the rest.
Parse text into tokens
Assign token types and build a pattern (sentence structure)
Break the pattern into individual attributes (based on context)
Store the standard form for each parsed attribute.
Parsing: Understanding the parts to build a structure and
breaking the structure into meaningful parts”.
Parsing, Cleansing & Standardization
Rules Development
 Identify Words and assign Word-Types
and standard values.
 Define patterns and parsing rules.
• 80-20 rule
• Frequencies
• Context & Data
Placement
Matching Concepts and an Introduction to
Ascential QualityStage
Derive
Candidate
Key
Parsed &
Stan. Output
Candidate
Selection
Dr. John Doe Jr. PhD
123 Main St Ste 101
Cottonwood, CA 92626
926-Ma-Do-Jo
John Doe 123 Main St Cottonwood CA 92626
Jones Donald 123 Maple Av 101 Cottonwood CA 92626
Joseph Don 456 Main Ln Cottonwood CA 92626
Dr. John Doe Jr. PhD
123 Main St Ste 101
Cottonwood, CA 92626
Candidate Selection
Candidate Selection is the processes of identifying likely
matching records.
Matching Concepts and an Introduction to
Ascential QualityStage
Derivative of the entity/record to be matched
Forms Clusters of similar records
Small keys form Large Clusters (General)
Large keys form small clusters (restrictive)
Other names
 Candidate Code
 Blocking Key
 Window Key
Design considerations
 Multiple blocking keys (handle Missing
values, Variations)
 Balance between Performance (Candidate
set size), miss-rate (Quality) and hit-rate
(Matching)
Use of Candidate-Key decreases the cost of
matching by reducing records being
matched, this results in higher throughput
and performance.
Data skew will cause large clusters .
Candidate Keys
Matching – O(n)
De-duplication – O(n2)
Matching Concepts and an Introduction to
Ascential QualityStage
Candidates
Matching &
Scoring
Dr. John Doe Jr. PhD 123 Main St Ste 101 Cottonwood, CA 92626
If Score >= 95 then Match
Otherwise No-Match
YYYY--YYY = 100
xxYx--YYY = 70
xxxY--YYY = 70
John Doe 123 Main St Cottonwood CA 92626
Jones Donald 123 Maple Av 101 Cottonwood CA 92626
Joseph Don 456 Main Ln Cottonwood CA 92626
Matching & Scoring
Parsed &
Stan. Output
match
no-match
no-match
Matching Concepts and an Introduction to
Ascential QualityStage
Exact matching
Phonetic matching
Soundex
NYSIIS (New York State Identification and
Intelligence System)
Edit Distance
Prefix, Suffix & Initial matching
Acronym matching
String matching (exact, approx)
Interval matching (numeric data)
Date matching (exact, diff)
… other tool specific algorithms
Matching Functions
Word
Matching
Field
Matching
String
Matching
Name
Address
URL, E-mail ID
SSN
Phone
Date
Number
Matching Concepts and an Introduction to
Ascential QualityStage
Deterministic & Probabilistic Matching
WILLIAM J HOLDEN 128 MAIN ST 02111 12/8/62
WILLAIM JOHN HOLDEN 128 MAINE AVE 02110 12/8/62
Deterministic Decisions Tables: Fields are evaluated for degree-of-
match and a letter grade assigned; the grades form a “match pattern”
which is looked-up in a table to determine if the pair Matches, Fails, or is
Suspect
Are these two records a match?
B B A A B D B A = BBAABDBA
+9 +2 +14 +5 +4 -1 +5 +11 = +49
Probabilistic Linkage: Fields are evaluated for degree-of-match and a
weight assigned which represents the “informational content”
contributed by those values; the weights are summed to derived a total
score that measures the statistical probability of a match
Matching Concepts and an Introduction to
Ascential QualityStage
Training Material
BI Practice, Chennai
Introduction to
Ascential QualityStage
Matching Concepts and an Introduction to
Ascential QualityStage
Name
Address
Phone
URL, E-mail ID
Others
FFC - File format converter
(Delimited to fixed and
vise-versa).
GTF - Code page, data type,
Column derivation etc.
SLC - Column and row filtering.
SORT - Reorder data files.
UNI - Inner join, Left, Right &
Full Outer join on flat
files.
QS
Procedures
CLP - Field domain frequency
distribution.
PRS - Parse (space delimited)
free form text into words
for analysis.
NMA - Name abbreviation key
generation for matching.
PGM - Run command line
programs from with in the
procedure.
Char, Word & Pattern
investigation/analysis.
Parsing
Cleansing &
Standardization
De-duping,
Reference Matching
Cross population of
fields within a
duplicate group.
Utility
Procedures
Standardization
Matching
Survivorship
Analysis,
Investigation
procedures
QualityStage Procedures (Stages)
Matching Concepts and an Introduction to
Ascential QualityStage
Standardiza
tion (STAN)
Ref. Match
(GEOREF)
De-Dup
(UNDUP)
New ID
Assignment
Filter
(SLC)
Std
Out
Bad
Stan
Good
Ref
File
No
Match
Match
Dup
Groups
Collect
(UNIX)
New
Recs
Output
Raw
Data
No
Match
(New)
Matched
(Old)
Non
Standard
Data
Standardized
Data
Raw
Data
Job-1
Job-2
Matching
Rules
Stan.
Rules
QS-Project-A
Job-3 Add to Ref.
File
Matching rules
Survivorship rules
Parsing,
Cleansing &
Standardization
Rules
Jobs
Stages
Files
QualityStage Project
Matching Concepts and an Introduction to
Ascential QualityStage
Create
Read
Update
Delete
Submit
Job
Tell server to Run Job
Job status reported to client
CRUD
QS Designer
Client
QS Server
QS
Developer
x.mdb
Deploy Job
Run Job
Project
.imf
Export
Import
Export
Import CRUD
QS
Job
Server
Windows
PC
Projects
Jobs
Stages
Files
Parsing,
Cleansing &
Standardization
Rules
Work area used
for Deployment
of Jobs & Rules
Matching rules
Survivorship rules
QualityStage Tool Architecture
Unix or
Windows
Matching Concepts and an Introduction to
Ascential QualityStage
OUTREC = (d + x) bytes
QualityStage Standardization Procedure
Standardiza
tion
Raw
Input
USNAME.PRC
USNAMEIP.TBL
USNAMEIT.TBL
USNAMEMF.TBL
USNAMEUP.TBL
USNAMEUT.TBL
USFIRSTN.TBL
USGENDER.TBL
USNAME.DCT
USNAME.CLS
USNAME.PAT
USNAME.UCL
Standardized
Output
INREC – x bytes INREC – x bytes
DCT – d bytes
DCT – d bytes
Layout of the
parsed fields
Word
Classification
Table
Pattern Rules
User defined
Pattern Rules
Lookup tables
for special
processing
Matching Concepts and an Introduction to
Ascential QualityStage
Word Classification
Word Class
1 byte
User defined
(A-Z)
Implicit types
• Numeric Zero ‘0’ will
nullify the token.
• The token will not
participate in the
pattern parsing.
NULL Type
Matching Concepts and an Introduction to
Ascential QualityStage
Word Classification file (USNAME.CLS)
Matching Concepts and an Introduction to
Ascential QualityStage
FORMAT SORT=N
;-------------------------------------------------------------------------------
; USNAME Dictionary File
;-------------------------------------------------------------------------------
; Business Intelligence Fields
;-------------------------------------------------------------------------------
NT C 1 S NameType ;0001-0001
GC C 1 S GenderCode ;0002-0002
NP C 20 S NamePrefix ;0003-0022
FN C 25 S FirstName ;0023-0047
MN C 25 S MiddleName ;0048-0072
LN C 50 S PrimaryName ;0073-0122
NG C 10 S NameGeneration ;0123-0132
NS C 20 S NameSuffix ;0133-0152
AN C 50 S AdditionalNameInformation ;0153-0202
;-------------------------------------------------------------------------------
; Matching Fields
;-------------------------------------------------------------------------------
MF C 25 S MatchFirstName ;0203-0227
NF C 8 X NYSIISofMatchFirstName ;0228-0235
SF C 4 Z RSoundexofMatchFirstName ;0236-0239
ML C 50 S MatchPrimaryName ;0240-0289
HK C 10 S HashKeyofMatchPrimaryName ;0290-0299
PK C 20 S PackedKeyofMatchPrimaryName ;0300-0319
NW C 1 S NumberofMatchPrimaryWords ;0320-0320
W1 C 15 S MatchPrimaryWord1 ;0321-0335
W2 C 15 S MatchPrimaryWord2 ;0336-0350
W3 C 15 S MatchPrimaryWord3 ;0351-0365
W4 C 15 S MatchPrimaryWord4 ;0366-0380
W5 C 15 S MatchPrimaryWord5 ;0381-0395
N1 C 8 X NYSIISofMatchPrimaryWord1 ;0396-0403
S1 C 4 Z RSoundexofMatchPrimaryWord1 ;0404-0407
N2 C 8 X NYSIISofMatchPrimaryWord2 ;0408-0415
S2 C 4 Z RSoundexofMatchPrimaryWord2 ;0416-0419
;-------------------------------------------------------------------------------
; Reporting Fields
;-------------------------------------------------------------------------------
UP C 30 S UnhandledPattern ;0420-0449
UD C 100 S UnhandledData ;0450-0549
IP C 30 S InputPattern ;0550-0579
ED C 25 S ExceptionData ;0580-0604
UO C 2 S UserOverrideFlag ;0605-0606
Dictionary file (USNAME.DCT)
Field Name
{NT}
Data type
Char
Length
50 bytes
Nulls
Space as null
Zero as null
Both as null
Display Name
Comment
Matching Concepts and an Introduction to
Ascential QualityStage
Pattern file (USNAME.PAT)
PRAGMA_START
SEPLIST " ~`!@#$%^&*()_-+={}[]|:;"'<>,.?/"
STRIPLIST " ~`!@#$^*()_+={}[]|:;"<>?"
PRAGMA_END
POST_START
NYSIIS {MF} {NF}
RSOUNDEX {MF} {SF}
NYSIIS {W1} {N1}
RSOUNDEX {W1} {S1}
NYSIIS {W2} {N2}
RSOUNDEX {W2} {S2}
POST_END
P | W | S | $ | [ {LN} = "" ]
COPY_A [1] {NP}
COPY [2] {LN}
COPY_A [3] {NS}
EXIT
&
CALL Handle_Common_Patterns
EXIT
SUB Handle_Common_Patterns
P | F | I | + | $ | [ {FN} = "" & {MN} = "" & {LN} = "" ]
COPY_A [1] {NP}
COPY [2] {FN}
COPY [3] {MN}
COPY [4] {LN}
CALL Post_Process
RETURN
F | I | I | + | $ | [ {FN} = "" & {MN} = "" & {LN} = "" ]
COPY [1] {FN}
COPY [2] tempn
CONCAT " " tempn
CONCAT [3] tempn
COPY tempn {MN}
COPY [4] {LN}
CALL Post_Process
RETURN
END_SUB
Separator
Characters
Characters to
be removed
Phonetic
Codes
Pattern
Action
Subroutine
Matching Concepts and an Introduction to
Ascential QualityStage
QualityStage Matching Procedure
UNDUP
Standardized
Output
Matching rules
(*.MAT)
Dup Groups
Residuals
Record groups
(2 or more recs)
Singleton records
De-duplication of
one file
Matching Concepts and an Introduction to
Ascential QualityStage
QualityStage Matching Procedure …… Contd.
GEOMATCH
Standardized
Output
(A)
Matching rules
(*.MAT)
Matches
(A->B)
Residuals
(A)
Singleton records
Reference File
(B)
Dup Groups
(B)
Record groups
(2 or more recs)
Record groups
(2 or more recs)
Residuals
(B)
One to many
matching.
Matching one record
from File-A can match
to many records in
File-B
Matching Concepts and an Introduction to
Ascential QualityStage
Matching Functions
Matching Concepts and an Introduction to
Ascential QualityStage
Pass-1
Blocking-Key
3 bytes of Zip
2 Bytes of Street Name
2 Bytes of Last Name
2 Bytes of First Name
Matching Fields
First name
Middle name
Last name
House number
Street name
Zip
Pass-2
Blocking-Key
2 bytes of State code
3 Bytes of City name
3 Bytes of Street name
2 Bytes of Last name
2 Bytes of First name
Matching Fields
First name
Middle name
Last name
House number
Street name
City
Multi-Pass Matching
Maximum of 7 Passes
per match application
Matching Concepts and an Introduction to
Ascential QualityStage
Pass
VarType
M-prob
U-prob
Agreement weight
Disagreement weight
Work in progress…
To be completed…

More Related Content

Similar to Quality StageStandardization & Matching Training Edit007.ppt

Reflected Intelligence: Lucene/Solr as a self-learning data system
Reflected Intelligence: Lucene/Solr as a self-learning data systemReflected Intelligence: Lucene/Solr as a self-learning data system
Reflected Intelligence: Lucene/Solr as a self-learning data system
Trey Grainger
 
Invited talk @Aberdeen, '07: Modelling and computing the quality of informati...
Invited talk @Aberdeen, '07: Modelling and computing the quality of informati...Invited talk @Aberdeen, '07: Modelling and computing the quality of informati...
Invited talk @Aberdeen, '07: Modelling and computing the quality of informati...
Paolo Missier
 
How We Use Functional Programming to Find the Bad Guys
How We Use Functional Programming to Find the Bad GuysHow We Use Functional Programming to Find the Bad Guys
How We Use Functional Programming to Find the Bad Guys
New York City College of Technology Computer Systems Technology Colloquium
 
Sound Data Quality for CRM
Sound Data Quality for CRMSound Data Quality for CRM
Sound Data Quality for CRM
Divya Malik
 
Data Mining DataLecture Notes for Chapter 2Introduc
Data Mining DataLecture Notes for Chapter 2IntroducData Mining DataLecture Notes for Chapter 2Introduc
Data Mining DataLecture Notes for Chapter 2Introduc
OllieShoresna
 
Search Engines
Search EnginesSearch Engines
Search Engines
butest
 
CIKM Tutorial 2008
CIKM Tutorial 2008CIKM Tutorial 2008
CIKM Tutorial 2008
Peiling Wang
 

Similar to Quality StageStandardization & Matching Training Edit007.ppt (20)

Reflected Intelligence - Lucene/Solr as a self-learning data system: Presente...
Reflected Intelligence - Lucene/Solr as a self-learning data system: Presente...Reflected Intelligence - Lucene/Solr as a self-learning data system: Presente...
Reflected Intelligence - Lucene/Solr as a self-learning data system: Presente...
 
Reflected Intelligence: Lucene/Solr as a self-learning data system
Reflected Intelligence: Lucene/Solr as a self-learning data systemReflected Intelligence: Lucene/Solr as a self-learning data system
Reflected Intelligence: Lucene/Solr as a self-learning data system
 
Solomon
SolomonSolomon
Solomon
 
Invited talk @Aberdeen, '07: Modelling and computing the quality of informati...
Invited talk @Aberdeen, '07: Modelling and computing the quality of informati...Invited talk @Aberdeen, '07: Modelling and computing the quality of informati...
Invited talk @Aberdeen, '07: Modelling and computing the quality of informati...
 
Address Standard
Address StandardAddress Standard
Address Standard
 
How We Use Functional Programming to Find the Bad Guys
How We Use Functional Programming to Find the Bad GuysHow We Use Functional Programming to Find the Bad Guys
How We Use Functional Programming to Find the Bad Guys
 
Provinance in scientific workflows in e science
Provinance in scientific workflows in e scienceProvinance in scientific workflows in e science
Provinance in scientific workflows in e science
 
Invited talk @Roma La Sapienza, April '07
Invited talk @Roma La Sapienza, April '07Invited talk @Roma La Sapienza, April '07
Invited talk @Roma La Sapienza, April '07
 
Data Recognition Corporation
Data Recognition CorporationData Recognition Corporation
Data Recognition Corporation
 
Data imputation for unstructured dataset
Data imputation for unstructured datasetData imputation for unstructured dataset
Data imputation for unstructured dataset
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
2008 Data Mining Analysis
2008 Data Mining Analysis2008 Data Mining Analysis
2008 Data Mining Analysis
 
Sound Data Quality for CRM
Sound Data Quality for CRMSound Data Quality for CRM
Sound Data Quality for CRM
 
Training MS Access 2007
Training MS Access 2007Training MS Access 2007
Training MS Access 2007
 
Data Mining DataLecture Notes for Chapter 2Introduc
Data Mining DataLecture Notes for Chapter 2IntroducData Mining DataLecture Notes for Chapter 2Introduc
Data Mining DataLecture Notes for Chapter 2Introduc
 
Search Engines
Search EnginesSearch Engines
Search Engines
 
Data science training in hyderabad
Data science training in hyderabadData science training in hyderabad
Data science training in hyderabad
 
CIKM Tutorial 2008
CIKM Tutorial 2008CIKM Tutorial 2008
CIKM Tutorial 2008
 
Data Preprocessing
Data PreprocessingData Preprocessing
Data Preprocessing
 
Rdbms
RdbmsRdbms
Rdbms
 

Recently uploaded

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
TrustArc
 

Recently uploaded (20)

[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
The Zero-ETL Approach: Enhancing Data Agility and Insight
The Zero-ETL Approach: Enhancing Data Agility and InsightThe Zero-ETL Approach: Enhancing Data Agility and Insight
The Zero-ETL Approach: Enhancing Data Agility and Insight
 
WSO2 Micro Integrator for Enterprise Integration in a Decentralized, Microser...
WSO2 Micro Integrator for Enterprise Integration in a Decentralized, Microser...WSO2 Micro Integrator for Enterprise Integration in a Decentralized, Microser...
WSO2 Micro Integrator for Enterprise Integration in a Decentralized, Microser...
 
TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....
TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....
TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
Stronger Together: Developing an Organizational Strategy for Accessible Desig...
Stronger Together: Developing an Organizational Strategy for Accessible Desig...Stronger Together: Developing an Organizational Strategy for Accessible Desig...
Stronger Together: Developing an Organizational Strategy for Accessible Desig...
 
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
 
API Governance and Monetization - The evolution of API governance
API Governance and Monetization -  The evolution of API governanceAPI Governance and Monetization -  The evolution of API governance
API Governance and Monetization - The evolution of API governance
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
ChatGPT and Beyond - Elevating DevOps Productivity
ChatGPT and Beyond - Elevating DevOps ProductivityChatGPT and Beyond - Elevating DevOps Productivity
ChatGPT and Beyond - Elevating DevOps Productivity
 
Quantum Leap in Next-Generation Computing
Quantum Leap in Next-Generation ComputingQuantum Leap in Next-Generation Computing
Quantum Leap in Next-Generation Computing
 
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
 

Quality StageStandardization & Matching Training Edit007.ppt

  • 1. Matching Concepts and an Introduction to Ascential QualityStage Training Material BI Practice, Chennai Matching Concepts and an Introduction to Ascential QualityStage
  • 2. Matching Concepts and an Introduction to Ascential QualityStage Objective of the Training Matching Variations & Errors Parsing, Cleaning & Standardization Tool architecture – Designer-client, QS-server Standardization – ‘.cls’, ‘.pat’, ‘.tbl’ Matching Concepts QualityStage
  • 3. Matching Concepts and an Introduction to Ascential QualityStage Matching Name & Address Matching De-duplication, Unduplication, Merge-Purge, Customer ID generation (UID) Householding Data warehouse projects Application integration Business mergers Data acquisition When? Terminology Absence of a persistent identifying key between the data sources. Absence of a global standard for representation between the data sources. Drivers
  • 4. Matching Concepts and an Introduction to Ascential QualityStage Matching Example Society Of St. Vincent De Paul The Scty Of Saint Vncnt De Pau St Vincent De Paul Society Sosiety Of Saint Vincent Dpl Clymer Atty At Law Brian Brian I Clymer Attorney At Law Arizona Dept Of Agricutlture Dept Of Agri Arizona Arizona State Dept Of Agri Az Agri Dept A fact can be represented in multiple standard forms Standards change over time Errors and variations occur during data capture & processing In practice multiple standards forms/formats are used for data capture, processing and storage Duplicate detection • Database consolidation • Application consolidation Query • Removing felons from voters list. • List processing
  • 5. Matching Concepts and an Introduction to Ascential QualityStage
  • 6. Matching Concepts and an Introduction to Ascential QualityStage
  • 7. Matching Concepts and an Introduction to Ascential QualityStage
  • 8. Matching Concepts and an Introduction to Ascential QualityStage Variation & Errors Errors may include non-standard variations, additional words, missing words, or unknown data. Synonyms & nicknames Prefix & suffix variations Abbreviation & Acronyms Anglicization & foreign versions of names Spelling, typing & phonetic error Initials, inconsistently abbreviated names Transposition (Word sequence variations) Truncation & missing words Extra words format, character & convention variations Dr John Doe Med. Doctor Dr John Doe MD Saint Louis University St. Louis Univ. Tata Consultancy Services Inc TCS Incorporated University of south Florida South-Florida University(USF) ABC CO Attn: Mr. Clark The ABC CO, City of New York Bill Clinton William Clinton
  • 9. Matching Concepts and an Introduction to Ascential QualityStage Parsing, Cleansing & Standardization Candidate selection Matching & Scoring Apply Threshold/ Cutoff Reference Records No Match Match Matchin g rules Cutoff Cleansin g rules Matching Algorithm Input File Reduce variations & errors through Cleansing & Standardization Retain the differentiators & remove the noise Filter dissimilar records and match the similar records Use fuzzy matching to handle unresolved variations & errors
  • 10. Matching Concepts and an Introduction to Ascential QualityStage Raw Input Lexical Analysis Contextual Parsing “Dr John Doe Jr PhD” Tokenization “123 Main Street Suite 101” Dr |John|Doe |Jr |PhD 123 |Main|Street|Suite|101 Prefix|First|alpha|Gen|Suffix NNN |alpha|Type|Unit|NNN Prefix|First|Last|Gen|Suffix Prefix = Dr. First Name = John Last Name = Doe Generation = Jr. Suffix = PhD Output Hsn |Street|Type|Unit|Unit# House Number = 123 Street Name = Main Street Type = St Unit Type = Ste Unit Number = 101 Parsing, Cleansing & Standardization
  • 11. Matching Concepts and an Introduction to Ascential QualityStage Identify Character set (Code page) Translate Code page Identify delimiters, operators, punctuations, allowable characters and special-characters. Ignore the rest. Parse text into tokens Assign token types and build a pattern (sentence structure) Break the pattern into individual attributes (based on context) Store the standard form for each parsed attribute. Parsing: Understanding the parts to build a structure and breaking the structure into meaningful parts”. Parsing, Cleansing & Standardization Rules Development  Identify Words and assign Word-Types and standard values.  Define patterns and parsing rules. • 80-20 rule • Frequencies • Context & Data Placement
  • 12. Matching Concepts and an Introduction to Ascential QualityStage Derive Candidate Key Parsed & Stan. Output Candidate Selection Dr. John Doe Jr. PhD 123 Main St Ste 101 Cottonwood, CA 92626 926-Ma-Do-Jo John Doe 123 Main St Cottonwood CA 92626 Jones Donald 123 Maple Av 101 Cottonwood CA 92626 Joseph Don 456 Main Ln Cottonwood CA 92626 Dr. John Doe Jr. PhD 123 Main St Ste 101 Cottonwood, CA 92626 Candidate Selection Candidate Selection is the processes of identifying likely matching records.
  • 13. Matching Concepts and an Introduction to Ascential QualityStage Derivative of the entity/record to be matched Forms Clusters of similar records Small keys form Large Clusters (General) Large keys form small clusters (restrictive) Other names  Candidate Code  Blocking Key  Window Key Design considerations  Multiple blocking keys (handle Missing values, Variations)  Balance between Performance (Candidate set size), miss-rate (Quality) and hit-rate (Matching) Use of Candidate-Key decreases the cost of matching by reducing records being matched, this results in higher throughput and performance. Data skew will cause large clusters . Candidate Keys Matching – O(n) De-duplication – O(n2)
  • 14. Matching Concepts and an Introduction to Ascential QualityStage Candidates Matching & Scoring Dr. John Doe Jr. PhD 123 Main St Ste 101 Cottonwood, CA 92626 If Score >= 95 then Match Otherwise No-Match YYYY--YYY = 100 xxYx--YYY = 70 xxxY--YYY = 70 John Doe 123 Main St Cottonwood CA 92626 Jones Donald 123 Maple Av 101 Cottonwood CA 92626 Joseph Don 456 Main Ln Cottonwood CA 92626 Matching & Scoring Parsed & Stan. Output match no-match no-match
  • 15. Matching Concepts and an Introduction to Ascential QualityStage Exact matching Phonetic matching Soundex NYSIIS (New York State Identification and Intelligence System) Edit Distance Prefix, Suffix & Initial matching Acronym matching String matching (exact, approx) Interval matching (numeric data) Date matching (exact, diff) … other tool specific algorithms Matching Functions Word Matching Field Matching String Matching Name Address URL, E-mail ID SSN Phone Date Number
  • 16. Matching Concepts and an Introduction to Ascential QualityStage Deterministic & Probabilistic Matching WILLIAM J HOLDEN 128 MAIN ST 02111 12/8/62 WILLAIM JOHN HOLDEN 128 MAINE AVE 02110 12/8/62 Deterministic Decisions Tables: Fields are evaluated for degree-of- match and a letter grade assigned; the grades form a “match pattern” which is looked-up in a table to determine if the pair Matches, Fails, or is Suspect Are these two records a match? B B A A B D B A = BBAABDBA +9 +2 +14 +5 +4 -1 +5 +11 = +49 Probabilistic Linkage: Fields are evaluated for degree-of-match and a weight assigned which represents the “informational content” contributed by those values; the weights are summed to derived a total score that measures the statistical probability of a match
  • 17. Matching Concepts and an Introduction to Ascential QualityStage Training Material BI Practice, Chennai Introduction to Ascential QualityStage
  • 18. Matching Concepts and an Introduction to Ascential QualityStage Name Address Phone URL, E-mail ID Others FFC - File format converter (Delimited to fixed and vise-versa). GTF - Code page, data type, Column derivation etc. SLC - Column and row filtering. SORT - Reorder data files. UNI - Inner join, Left, Right & Full Outer join on flat files. QS Procedures CLP - Field domain frequency distribution. PRS - Parse (space delimited) free form text into words for analysis. NMA - Name abbreviation key generation for matching. PGM - Run command line programs from with in the procedure. Char, Word & Pattern investigation/analysis. Parsing Cleansing & Standardization De-duping, Reference Matching Cross population of fields within a duplicate group. Utility Procedures Standardization Matching Survivorship Analysis, Investigation procedures QualityStage Procedures (Stages)
  • 19. Matching Concepts and an Introduction to Ascential QualityStage Standardiza tion (STAN) Ref. Match (GEOREF) De-Dup (UNDUP) New ID Assignment Filter (SLC) Std Out Bad Stan Good Ref File No Match Match Dup Groups Collect (UNIX) New Recs Output Raw Data No Match (New) Matched (Old) Non Standard Data Standardized Data Raw Data Job-1 Job-2 Matching Rules Stan. Rules QS-Project-A Job-3 Add to Ref. File Matching rules Survivorship rules Parsing, Cleansing & Standardization Rules Jobs Stages Files QualityStage Project
  • 20. Matching Concepts and an Introduction to Ascential QualityStage Create Read Update Delete Submit Job Tell server to Run Job Job status reported to client CRUD QS Designer Client QS Server QS Developer x.mdb Deploy Job Run Job Project .imf Export Import Export Import CRUD QS Job Server Windows PC Projects Jobs Stages Files Parsing, Cleansing & Standardization Rules Work area used for Deployment of Jobs & Rules Matching rules Survivorship rules QualityStage Tool Architecture Unix or Windows
  • 21. Matching Concepts and an Introduction to Ascential QualityStage OUTREC = (d + x) bytes QualityStage Standardization Procedure Standardiza tion Raw Input USNAME.PRC USNAMEIP.TBL USNAMEIT.TBL USNAMEMF.TBL USNAMEUP.TBL USNAMEUT.TBL USFIRSTN.TBL USGENDER.TBL USNAME.DCT USNAME.CLS USNAME.PAT USNAME.UCL Standardized Output INREC – x bytes INREC – x bytes DCT – d bytes DCT – d bytes Layout of the parsed fields Word Classification Table Pattern Rules User defined Pattern Rules Lookup tables for special processing
  • 22. Matching Concepts and an Introduction to Ascential QualityStage Word Classification Word Class 1 byte User defined (A-Z) Implicit types • Numeric Zero ‘0’ will nullify the token. • The token will not participate in the pattern parsing. NULL Type
  • 23. Matching Concepts and an Introduction to Ascential QualityStage Word Classification file (USNAME.CLS)
  • 24. Matching Concepts and an Introduction to Ascential QualityStage FORMAT SORT=N ;------------------------------------------------------------------------------- ; USNAME Dictionary File ;------------------------------------------------------------------------------- ; Business Intelligence Fields ;------------------------------------------------------------------------------- NT C 1 S NameType ;0001-0001 GC C 1 S GenderCode ;0002-0002 NP C 20 S NamePrefix ;0003-0022 FN C 25 S FirstName ;0023-0047 MN C 25 S MiddleName ;0048-0072 LN C 50 S PrimaryName ;0073-0122 NG C 10 S NameGeneration ;0123-0132 NS C 20 S NameSuffix ;0133-0152 AN C 50 S AdditionalNameInformation ;0153-0202 ;------------------------------------------------------------------------------- ; Matching Fields ;------------------------------------------------------------------------------- MF C 25 S MatchFirstName ;0203-0227 NF C 8 X NYSIISofMatchFirstName ;0228-0235 SF C 4 Z RSoundexofMatchFirstName ;0236-0239 ML C 50 S MatchPrimaryName ;0240-0289 HK C 10 S HashKeyofMatchPrimaryName ;0290-0299 PK C 20 S PackedKeyofMatchPrimaryName ;0300-0319 NW C 1 S NumberofMatchPrimaryWords ;0320-0320 W1 C 15 S MatchPrimaryWord1 ;0321-0335 W2 C 15 S MatchPrimaryWord2 ;0336-0350 W3 C 15 S MatchPrimaryWord3 ;0351-0365 W4 C 15 S MatchPrimaryWord4 ;0366-0380 W5 C 15 S MatchPrimaryWord5 ;0381-0395 N1 C 8 X NYSIISofMatchPrimaryWord1 ;0396-0403 S1 C 4 Z RSoundexofMatchPrimaryWord1 ;0404-0407 N2 C 8 X NYSIISofMatchPrimaryWord2 ;0408-0415 S2 C 4 Z RSoundexofMatchPrimaryWord2 ;0416-0419 ;------------------------------------------------------------------------------- ; Reporting Fields ;------------------------------------------------------------------------------- UP C 30 S UnhandledPattern ;0420-0449 UD C 100 S UnhandledData ;0450-0549 IP C 30 S InputPattern ;0550-0579 ED C 25 S ExceptionData ;0580-0604 UO C 2 S UserOverrideFlag ;0605-0606 Dictionary file (USNAME.DCT) Field Name {NT} Data type Char Length 50 bytes Nulls Space as null Zero as null Both as null Display Name Comment
  • 25. Matching Concepts and an Introduction to Ascential QualityStage Pattern file (USNAME.PAT) PRAGMA_START SEPLIST " ~`!@#$%^&*()_-+={}[]|:;"'<>,.?/" STRIPLIST " ~`!@#$^*()_+={}[]|:;"<>?" PRAGMA_END POST_START NYSIIS {MF} {NF} RSOUNDEX {MF} {SF} NYSIIS {W1} {N1} RSOUNDEX {W1} {S1} NYSIIS {W2} {N2} RSOUNDEX {W2} {S2} POST_END P | W | S | $ | [ {LN} = "" ] COPY_A [1] {NP} COPY [2] {LN} COPY_A [3] {NS} EXIT & CALL Handle_Common_Patterns EXIT SUB Handle_Common_Patterns P | F | I | + | $ | [ {FN} = "" & {MN} = "" & {LN} = "" ] COPY_A [1] {NP} COPY [2] {FN} COPY [3] {MN} COPY [4] {LN} CALL Post_Process RETURN F | I | I | + | $ | [ {FN} = "" & {MN} = "" & {LN} = "" ] COPY [1] {FN} COPY [2] tempn CONCAT " " tempn CONCAT [3] tempn COPY tempn {MN} COPY [4] {LN} CALL Post_Process RETURN END_SUB Separator Characters Characters to be removed Phonetic Codes Pattern Action Subroutine
  • 26. Matching Concepts and an Introduction to Ascential QualityStage QualityStage Matching Procedure UNDUP Standardized Output Matching rules (*.MAT) Dup Groups Residuals Record groups (2 or more recs) Singleton records De-duplication of one file
  • 27. Matching Concepts and an Introduction to Ascential QualityStage QualityStage Matching Procedure …… Contd. GEOMATCH Standardized Output (A) Matching rules (*.MAT) Matches (A->B) Residuals (A) Singleton records Reference File (B) Dup Groups (B) Record groups (2 or more recs) Record groups (2 or more recs) Residuals (B) One to many matching. Matching one record from File-A can match to many records in File-B
  • 28. Matching Concepts and an Introduction to Ascential QualityStage Matching Functions
  • 29. Matching Concepts and an Introduction to Ascential QualityStage Pass-1 Blocking-Key 3 bytes of Zip 2 Bytes of Street Name 2 Bytes of Last Name 2 Bytes of First Name Matching Fields First name Middle name Last name House number Street name Zip Pass-2 Blocking-Key 2 bytes of State code 3 Bytes of City name 3 Bytes of Street name 2 Bytes of Last name 2 Bytes of First name Matching Fields First name Middle name Last name House number Street name City Multi-Pass Matching Maximum of 7 Passes per match application
  • 30. Matching Concepts and an Introduction to Ascential QualityStage Pass VarType M-prob U-prob Agreement weight Disagreement weight Work in progress… To be completed…