SlideShare a Scribd company logo
1 of 7
Matching Criteria Overview
By
Hugh Knight
Samaritan’s Purse
OCC Data Associate
January 2016
Table of Contents
Table of Contents.................................................................................................................2
INTRODUCTION...........................................................................................................2
Percentage Variation Spectrum....................................................................................2
Indexing / Sorting and a Blocking Key........................................................................3
Dataset Cursory Consideration....................................................................................3
Other conceptual and practical concerns
(extracts from Australian Attorney General Website).................................................4
My Match Key Schematic
......................................................................................................................................6
INTRODUCTION
As you consider moving from data entry work to the detailed oriented matching work,
you need to consider a number of rules or factors in order that you may have a consistent
framework and standardization on your matching.
Include in this Word document are a number of concepts from the
Australian Attorney General’s website and the book, Data Matching by
Peter Christen that I found very useful in lining out some data matching
concepts that will ensure data consistency and standard practices in your
ongoing matching work.
Percentage Variation Spectrum
What is my range of error leeway as I begin this process? I take the 75-100% spectrum
range to allow for some variation in names (first and last) due to human error. Please see
the My Match Key Schematic at the end of this documentation for more details.
Reasons for variations in names:
Abbreviations
2
Child’s limited thinking
Country Language Nomenclature1
Flipped (Reverse) Names in fields
Name inconsistency (English vs. Native language)
Nickname versus Real Name
Indexing / Sorting and a Blocking Key
The use of Indexing (Sorting) by using a Blocking key (i.e. zip code / Last Name for
quick elimination of non-matched datasets) is a quick mechanism to assist in matching a
dataset.
Menu: Home, Sort and Filter, Custom Filter
Dataset Cursory Consideration
As you look at your dataset you may see and perceive some similarities, these are
noteworthy as you begin your matching function:
1. Phonetic2
similarity – sounds the same
2. Character Shape – looks the same
3. Numerical similarity3
- are exact matches
Birthday and Date variations are another issue for discussion and consideration.
1
https://en.wikipedia.org/wiki/Nomenclature
2
Soundex . . . Downloaded file on G Drive
3
Data Matching: Concepts and Techniques for Record Linkage, Entity Resolution, Christen, Springer, page
70
3
Other conceptual and practical concerns
(extracts from Australian Attorney General Website)
Standardizing may involve the removal of non alphabetic characters like hyphens, spaces
and apostrophes to produce a “standard format”. As an example, in instances where
“OConnor” would normally not match with “O’Connor”, standardizing would result in a
record in each file with the value “OConnor” which would then produce matches.
Include a control group4
: The use of a control group of records can assist in the
development of data matching applications and in interpreting the results of data
matching activity. By including a control group with known characteristics in the data
passing through the data matching application and observing the results, the effectiveness
of the application can be reviewed and refined.
Use name, date of birth, address in the algorithm5
design In designing identity data
matching algorithms and applications, designers should consider the use of
Name (s), date of birth and address, as using multiple aspects of record detail in
compared data enables greater flexibility in determining what constitutes a match.
Consideration may also need to be given to the use of the sex field, although many
agencies consider the susceptibility to miscoding of this value may negate its overall
usefulness.
Ensure the use of a flexible matching algorithm
Name matching should optimally employ orthographic6
, linguistic or phonetic (or any
combination thereof) fuzzy logic pattern matching. . . . Whether a matching solution has
been developed in-house or is a commercial product, developers will need to determine
what constitutes a match.
Agencies (Organizations) will also need to decide on the degree of field value correlation
they are willing to accept in the matching process as constituting a match. If two records
have largely consistent, but not exact, field values in those areas being compared (e.g.
4
Control group: follows the exact methodology of all other surveys, but there is no
intervention event. (courtesy of Michael Cardy)
5
Algorithm: A set of logic rules determined during the design phase of a data matching
application. The “blueprint‟ used to turn logic rules into computer instructions that detail
what steps to perform in what order
6
Orthographic: A principle used in data matching where correct or accepted spelling
and characters are used to determine the results
4
name, date of birth, address), the developer, in conjunction with business analysts, will
have to establish the boundary between acceptable difference and unacceptable difference
, a decision that will also need to take into account the risks posed by the various options.
Combine human involvement in the analysis of data matching results when flexible
matching has been employed. One of the efficiencies deliverable with the use of data
matching is the ability to automate particular actions or activities depending on the results
obtained. Such automated “cause and effect , or “lights -out , systems are based on the‟ ‟
perceived accuracy (or believability) of the results obtained and the low risk involved in
automating subsequent business activity. . . . Human evaluation of results not only
confirms the validity of any matching that has taken place but the analysis and
evaluation involved provides recursive advice for improved data matching.
Fields may also contain invalid or nonsensical values. For example, dates of birth may
contain zero -filled values, which can have a direct affect on the ratio of non-matches
obtained. Efforts should be made to identify and quantify the prevalence of such
characteristics. Knowing the preponderance of various data anomalies and characteristics
would assist in better understanding the data matching results obtained and more
correctly interpreting their significance.
This is illustrated in the following two scenarios: failure to match is due to the fact that
there exists no record for that identity in the other databases a record exists for the same
identity in the other databases but there is a failure to match because the date of birth for
one record is zero-filled. If, for example, an aim of a data matching exercise was to
determine which identities in a particular database exhibit higher identity risk by not‟
appearing in other databases, the inclusion of records from both of the above scenarios in
the same category of output skews any real understanding of the problem. A preliminary
analysis of data quality can help place subsequent results into context. Invalid, missing,
duplicate or otherwise, “incorrect values can be identified prior to matching.‟ 7
7
https://www.ag.gov.au/RightsAndProtections/IdentitySecurity/Documents/Data%20matching%20better
%20practice%20guidelines%20%5BPDF%20775KB%5D.pdf
5
My Match Key Schematic
This showcases the weighted values on the demographic fields in the One Stop and
Teacher Match workbooks. (Ctrl + Click) Image below:
Explanation:
1. Listed all Demographics fields, comon to One Stop and Teacher Match
workbooks
2. Set a priority to each field (1-7)
3. Set a numercial weight to each filter (0.5-3)
4. Set a Criterian Strength Point and % schematic (Best to Worse outcomes)
Walked through differing scenarios if one (more) field(s) was missing with
cooresponding %
5. Created Matching Legend for clarity in Matching Fields
6. Color-coded % for ease of use
6
My Match Key Schematic
This showcases the weighted values on the demographic fields in the One Stop and
Teacher Match workbooks. (Ctrl + Click) Image below:
Explanation:
1. Listed all Demographics fields, comon to One Stop and Teacher Match
workbooks
2. Set a priority to each field (1-7)
3. Set a numercial weight to each filter (0.5-3)
4. Set a Criterian Strength Point and % schematic (Best to Worse outcomes)
Walked through differing scenarios if one (more) field(s) was missing with
cooresponding %
5. Created Matching Legend for clarity in Matching Fields
6. Color-coded % for ease of use
6

More Related Content

What's hot

Incentive compatible privacy preserving data
Incentive compatible privacy preserving dataIncentive compatible privacy preserving data
Incentive compatible privacy preserving dataIEEEFINALYEARPROJECTS
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)IJERD Editor
 
Moderation and Meditation conducting in SPSS
Moderation and Meditation conducting in SPSSModeration and Meditation conducting in SPSS
Moderation and Meditation conducting in SPSSOsama Yousaf
 
Detailed Investigation of Text Classification and Clustering of Twitter Data ...
Detailed Investigation of Text Classification and Clustering of Twitter Data ...Detailed Investigation of Text Classification and Clustering of Twitter Data ...
Detailed Investigation of Text Classification and Clustering of Twitter Data ...ijtsrd
 
IRJET- Probability based Missing Value Imputation Method and its Analysis
IRJET- Probability based Missing Value Imputation Method and its AnalysisIRJET- Probability based Missing Value Imputation Method and its Analysis
IRJET- Probability based Missing Value Imputation Method and its AnalysisIRJET Journal
 
PREDICTIVE MODELLING OF CRIME DATASET USING DATA MINING
PREDICTIVE MODELLING OF CRIME DATASET USING DATA MININGPREDICTIVE MODELLING OF CRIME DATASET USING DATA MINING
PREDICTIVE MODELLING OF CRIME DATASET USING DATA MININGIJDKP
 
PREDICTIVE MODELLING OF CRIME DATASET USING DATA MINING
PREDICTIVE MODELLING OF CRIME DATASET USING DATA MININGPREDICTIVE MODELLING OF CRIME DATASET USING DATA MINING
PREDICTIVE MODELLING OF CRIME DATASET USING DATA MININGIJDKP
 
ThesisDraftGarbageIn-GarbageOutSimulatingInternalConsistency.docx
ThesisDraftGarbageIn-GarbageOutSimulatingInternalConsistency.docxThesisDraftGarbageIn-GarbageOutSimulatingInternalConsistency.docx
ThesisDraftGarbageIn-GarbageOutSimulatingInternalConsistency.docxThomas Goodheart
 
Business analytics and data mining
Business analytics and data miningBusiness analytics and data mining
Business analytics and data miningHoang Nguyen
 
Data leakage detection
Data leakage detectionData leakage detection
Data leakage detectionAjitkaur saini
 
A model to find the agent who responsible for data leakage
A model to find the agent who responsible for data leakageA model to find the agent who responsible for data leakage
A model to find the agent who responsible for data leakageeSAT Publishing House
 
A model to find the agent who responsible for data leakage
A model to find the agent who responsible for data leakageA model to find the agent who responsible for data leakage
A model to find the agent who responsible for data leakageeSAT Journals
 
Research Method EMBA chapter 11
Research Method EMBA chapter 11Research Method EMBA chapter 11
Research Method EMBA chapter 11Mazhar Poohlah
 
Reducing False Positives - BSA AML Transaction Monitoring Re-Tuning Approach
Reducing False Positives - BSA AML Transaction Monitoring Re-Tuning ApproachReducing False Positives - BSA AML Transaction Monitoring Re-Tuning Approach
Reducing False Positives - BSA AML Transaction Monitoring Re-Tuning ApproachErik De Monte
 

What's hot (15)

Incentive compatible privacy preserving data
Incentive compatible privacy preserving dataIncentive compatible privacy preserving data
Incentive compatible privacy preserving data
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)
 
Moderation and Meditation conducting in SPSS
Moderation and Meditation conducting in SPSSModeration and Meditation conducting in SPSS
Moderation and Meditation conducting in SPSS
 
T 8-gurjinder
T 8-gurjinderT 8-gurjinder
T 8-gurjinder
 
Detailed Investigation of Text Classification and Clustering of Twitter Data ...
Detailed Investigation of Text Classification and Clustering of Twitter Data ...Detailed Investigation of Text Classification and Clustering of Twitter Data ...
Detailed Investigation of Text Classification and Clustering of Twitter Data ...
 
IRJET- Probability based Missing Value Imputation Method and its Analysis
IRJET- Probability based Missing Value Imputation Method and its AnalysisIRJET- Probability based Missing Value Imputation Method and its Analysis
IRJET- Probability based Missing Value Imputation Method and its Analysis
 
PREDICTIVE MODELLING OF CRIME DATASET USING DATA MINING
PREDICTIVE MODELLING OF CRIME DATASET USING DATA MININGPREDICTIVE MODELLING OF CRIME DATASET USING DATA MINING
PREDICTIVE MODELLING OF CRIME DATASET USING DATA MINING
 
PREDICTIVE MODELLING OF CRIME DATASET USING DATA MINING
PREDICTIVE MODELLING OF CRIME DATASET USING DATA MININGPREDICTIVE MODELLING OF CRIME DATASET USING DATA MINING
PREDICTIVE MODELLING OF CRIME DATASET USING DATA MINING
 
ThesisDraftGarbageIn-GarbageOutSimulatingInternalConsistency.docx
ThesisDraftGarbageIn-GarbageOutSimulatingInternalConsistency.docxThesisDraftGarbageIn-GarbageOutSimulatingInternalConsistency.docx
ThesisDraftGarbageIn-GarbageOutSimulatingInternalConsistency.docx
 
Business analytics and data mining
Business analytics and data miningBusiness analytics and data mining
Business analytics and data mining
 
Data leakage detection
Data leakage detectionData leakage detection
Data leakage detection
 
A model to find the agent who responsible for data leakage
A model to find the agent who responsible for data leakageA model to find the agent who responsible for data leakage
A model to find the agent who responsible for data leakage
 
A model to find the agent who responsible for data leakage
A model to find the agent who responsible for data leakageA model to find the agent who responsible for data leakage
A model to find the agent who responsible for data leakage
 
Research Method EMBA chapter 11
Research Method EMBA chapter 11Research Method EMBA chapter 11
Research Method EMBA chapter 11
 
Reducing False Positives - BSA AML Transaction Monitoring Re-Tuning Approach
Reducing False Positives - BSA AML Transaction Monitoring Re-Tuning ApproachReducing False Positives - BSA AML Transaction Monitoring Re-Tuning Approach
Reducing False Positives - BSA AML Transaction Monitoring Re-Tuning Approach
 

Viewers also liked

Hugh knight beyond_the_linked_in_profile
Hugh knight beyond_the_linked_in_profileHugh knight beyond_the_linked_in_profile
Hugh knight beyond_the_linked_in_profileHugh Knight
 
Zurvita business overview_with_zeal_3_7_11
Zurvita business overview_with_zeal_3_7_11Zurvita business overview_with_zeal_3_7_11
Zurvita business overview_with_zeal_3_7_11TAMiller74
 
Hugh knight what have i been doing
Hugh knight   what have i been doingHugh knight   what have i been doing
Hugh knight what have i been doingHugh Knight
 
Why should i_hire_hugh
Why should i_hire_hughWhy should i_hire_hugh
Why should i_hire_hughHugh Knight
 
Kaitlyn Harriet Tubman Project
Kaitlyn Harriet Tubman ProjectKaitlyn Harriet Tubman Project
Kaitlyn Harriet Tubman Projectbmforr
 
New technologies functions learned
New technologies   functions learnedNew technologies   functions learned
New technologies functions learnedHugh Knight
 
Trabajo informatica
Trabajo informaticaTrabajo informatica
Trabajo informaticaCUN
 
Why should i hire hugh
Why should i hire hughWhy should i hire hugh
Why should i hire hughHugh Knight
 
Mansion In March Check Stuffer 2012
Mansion In March Check Stuffer 2012Mansion In March Check Stuffer 2012
Mansion In March Check Stuffer 2012Barry Belcher
 
Beyond Bullet Points
Beyond Bullet PointsBeyond Bullet Points
Beyond Bullet PointsHugh Knight
 

Viewers also liked (14)

Projeto de lei nº005.2012
Projeto de lei nº005.2012Projeto de lei nº005.2012
Projeto de lei nº005.2012
 
Euariste galois
Euariste galoisEuariste galois
Euariste galois
 
Hugh knight beyond_the_linked_in_profile
Hugh knight beyond_the_linked_in_profileHugh knight beyond_the_linked_in_profile
Hugh knight beyond_the_linked_in_profile
 
Zurvita business overview_with_zeal_3_7_11
Zurvita business overview_with_zeal_3_7_11Zurvita business overview_with_zeal_3_7_11
Zurvita business overview_with_zeal_3_7_11
 
Hugh knight what have i been doing
Hugh knight   what have i been doingHugh knight   what have i been doing
Hugh knight what have i been doing
 
Why should i_hire_hugh
Why should i_hire_hughWhy should i_hire_hugh
Why should i_hire_hugh
 
Kaitlyn Harriet Tubman Project
Kaitlyn Harriet Tubman ProjectKaitlyn Harriet Tubman Project
Kaitlyn Harriet Tubman Project
 
New technologies functions learned
New technologies   functions learnedNew technologies   functions learned
New technologies functions learned
 
Trabajo informatica
Trabajo informaticaTrabajo informatica
Trabajo informatica
 
Why should i hire hugh
Why should i hire hughWhy should i hire hugh
Why should i hire hugh
 
Swine flu
Swine fluSwine flu
Swine flu
 
Mansion In March Check Stuffer 2012
Mansion In March Check Stuffer 2012Mansion In March Check Stuffer 2012
Mansion In March Check Stuffer 2012
 
Knowledge fostering program
Knowledge fostering programKnowledge fostering program
Knowledge fostering program
 
Beyond Bullet Points
Beyond Bullet PointsBeyond Bullet Points
Beyond Bullet Points
 

Similar to Matching Criteria

Keys to extract value from the data analytics life cycle
Keys to extract value from the data analytics life cycleKeys to extract value from the data analytics life cycle
Keys to extract value from the data analytics life cycleGrant Thornton LLP
 
Data quality testing – a quick checklist to measure and improve data quality
Data quality testing – a quick checklist to measure and improve data qualityData quality testing – a quick checklist to measure and improve data quality
Data quality testing – a quick checklist to measure and improve data qualityJaveriaGauhar
 
Implementation of Matching Tree Technique for Online Record Linkage
Implementation of Matching Tree Technique for Online Record LinkageImplementation of Matching Tree Technique for Online Record Linkage
Implementation of Matching Tree Technique for Online Record LinkageIOSR Journals
 
Final Report
Final ReportFinal Report
Final Reportimu409
 
Semantic Web Based Sentiment Engine
Semantic Web Based Sentiment EngineSemantic Web Based Sentiment Engine
Semantic Web Based Sentiment EngineJames Dellinger
 
Fairness in Search & RecSys 네이버 검색 콜로키움 김진영
Fairness in Search & RecSys 네이버 검색 콜로키움 김진영Fairness in Search & RecSys 네이버 검색 콜로키움 김진영
Fairness in Search & RecSys 네이버 검색 콜로키움 김진영Jin Young Kim
 
Top 30 Data Analyst Interview Questions.pdf
Top 30 Data Analyst Interview Questions.pdfTop 30 Data Analyst Interview Questions.pdf
Top 30 Data Analyst Interview Questions.pdfShaikSikindar1
 
A guide to preparing your data for tableau
A guide to preparing your data for tableauA guide to preparing your data for tableau
A guide to preparing your data for tableauPhillip Reinhart
 
Achieving Privacy in Publishing Search logs
Achieving Privacy in Publishing Search logsAchieving Privacy in Publishing Search logs
Achieving Privacy in Publishing Search logsIOSR Journals
 
machinelearning-191005133446.pdf
machinelearning-191005133446.pdfmachinelearning-191005133446.pdf
machinelearning-191005133446.pdfLellaLinton
 
PREDICTING BANKRUPTCY USING MACHINE LEARNING ALGORITHMS
PREDICTING BANKRUPTCY USING MACHINE LEARNING ALGORITHMSPREDICTING BANKRUPTCY USING MACHINE LEARNING ALGORITHMS
PREDICTING BANKRUPTCY USING MACHINE LEARNING ALGORITHMSIJCI JOURNAL
 
3 30022 assessing_yourbusinessanalytics
3 30022 assessing_yourbusinessanalytics3 30022 assessing_yourbusinessanalytics
3 30022 assessing_yourbusinessanalyticscragsmoor123
 
Machine Learning: A Fast Review
Machine Learning: A Fast ReviewMachine Learning: A Fast Review
Machine Learning: A Fast ReviewAhmad Ali Abin
 
Journal for Clinical Studies: Close Cooperation Between Data Management and B...
Journal for Clinical Studies: Close Cooperation Between Data Management and B...Journal for Clinical Studies: Close Cooperation Between Data Management and B...
Journal for Clinical Studies: Close Cooperation Between Data Management and B...KCR
 
Approximating Source Accuracy Using Dublicate Records in Da-ta Integration
Approximating Source Accuracy Using Dublicate Records in Da-ta IntegrationApproximating Source Accuracy Using Dublicate Records in Da-ta Integration
Approximating Source Accuracy Using Dublicate Records in Da-ta IntegrationIOSR Journals
 
From Volume to Value - A Guide to Data Engineering
From Volume to Value - A Guide to Data EngineeringFrom Volume to Value - A Guide to Data Engineering
From Volume to Value - A Guide to Data EngineeringRy Walker
 
InfoSphere_Information_Analyzer
InfoSphere_Information_AnalyzerInfoSphere_Information_Analyzer
InfoSphere_Information_AnalyzerSourav Maity
 
A Holistic Approach to Property Valuations
A Holistic Approach to Property ValuationsA Holistic Approach to Property Valuations
A Holistic Approach to Property ValuationsCognizant
 
Data and Processes: Can we Marry Them . . . and Make the Marriage Last?
Data and Processes: Can we Marry Them . . . and Make the Marriage Last?Data and Processes: Can we Marry Them . . . and Make the Marriage Last?
Data and Processes: Can we Marry Them . . . and Make the Marriage Last?INRIA-CEDAR
 

Similar to Matching Criteria (20)

Keys to extract value from the data analytics life cycle
Keys to extract value from the data analytics life cycleKeys to extract value from the data analytics life cycle
Keys to extract value from the data analytics life cycle
 
Risk mgmt-analysis-wp-326822
Risk mgmt-analysis-wp-326822Risk mgmt-analysis-wp-326822
Risk mgmt-analysis-wp-326822
 
Data quality testing – a quick checklist to measure and improve data quality
Data quality testing – a quick checklist to measure and improve data qualityData quality testing – a quick checklist to measure and improve data quality
Data quality testing – a quick checklist to measure and improve data quality
 
Implementation of Matching Tree Technique for Online Record Linkage
Implementation of Matching Tree Technique for Online Record LinkageImplementation of Matching Tree Technique for Online Record Linkage
Implementation of Matching Tree Technique for Online Record Linkage
 
Final Report
Final ReportFinal Report
Final Report
 
Semantic Web Based Sentiment Engine
Semantic Web Based Sentiment EngineSemantic Web Based Sentiment Engine
Semantic Web Based Sentiment Engine
 
Fairness in Search & RecSys 네이버 검색 콜로키움 김진영
Fairness in Search & RecSys 네이버 검색 콜로키움 김진영Fairness in Search & RecSys 네이버 검색 콜로키움 김진영
Fairness in Search & RecSys 네이버 검색 콜로키움 김진영
 
Top 30 Data Analyst Interview Questions.pdf
Top 30 Data Analyst Interview Questions.pdfTop 30 Data Analyst Interview Questions.pdf
Top 30 Data Analyst Interview Questions.pdf
 
A guide to preparing your data for tableau
A guide to preparing your data for tableauA guide to preparing your data for tableau
A guide to preparing your data for tableau
 
Achieving Privacy in Publishing Search logs
Achieving Privacy in Publishing Search logsAchieving Privacy in Publishing Search logs
Achieving Privacy in Publishing Search logs
 
machinelearning-191005133446.pdf
machinelearning-191005133446.pdfmachinelearning-191005133446.pdf
machinelearning-191005133446.pdf
 
PREDICTING BANKRUPTCY USING MACHINE LEARNING ALGORITHMS
PREDICTING BANKRUPTCY USING MACHINE LEARNING ALGORITHMSPREDICTING BANKRUPTCY USING MACHINE LEARNING ALGORITHMS
PREDICTING BANKRUPTCY USING MACHINE LEARNING ALGORITHMS
 
3 30022 assessing_yourbusinessanalytics
3 30022 assessing_yourbusinessanalytics3 30022 assessing_yourbusinessanalytics
3 30022 assessing_yourbusinessanalytics
 
Machine Learning: A Fast Review
Machine Learning: A Fast ReviewMachine Learning: A Fast Review
Machine Learning: A Fast Review
 
Journal for Clinical Studies: Close Cooperation Between Data Management and B...
Journal for Clinical Studies: Close Cooperation Between Data Management and B...Journal for Clinical Studies: Close Cooperation Between Data Management and B...
Journal for Clinical Studies: Close Cooperation Between Data Management and B...
 
Approximating Source Accuracy Using Dublicate Records in Da-ta Integration
Approximating Source Accuracy Using Dublicate Records in Da-ta IntegrationApproximating Source Accuracy Using Dublicate Records in Da-ta Integration
Approximating Source Accuracy Using Dublicate Records in Da-ta Integration
 
From Volume to Value - A Guide to Data Engineering
From Volume to Value - A Guide to Data EngineeringFrom Volume to Value - A Guide to Data Engineering
From Volume to Value - A Guide to Data Engineering
 
InfoSphere_Information_Analyzer
InfoSphere_Information_AnalyzerInfoSphere_Information_Analyzer
InfoSphere_Information_Analyzer
 
A Holistic Approach to Property Valuations
A Holistic Approach to Property ValuationsA Holistic Approach to Property Valuations
A Holistic Approach to Property Valuations
 
Data and Processes: Can we Marry Them . . . and Make the Marriage Last?
Data and Processes: Can we Marry Them . . . and Make the Marriage Last?Data and Processes: Can we Marry Them . . . and Make the Marriage Last?
Data and Processes: Can we Marry Them . . . and Make the Marriage Last?
 

Matching Criteria

  • 1. Matching Criteria Overview By Hugh Knight Samaritan’s Purse OCC Data Associate January 2016
  • 2. Table of Contents Table of Contents.................................................................................................................2 INTRODUCTION...........................................................................................................2 Percentage Variation Spectrum....................................................................................2 Indexing / Sorting and a Blocking Key........................................................................3 Dataset Cursory Consideration....................................................................................3 Other conceptual and practical concerns (extracts from Australian Attorney General Website).................................................4 My Match Key Schematic ......................................................................................................................................6 INTRODUCTION As you consider moving from data entry work to the detailed oriented matching work, you need to consider a number of rules or factors in order that you may have a consistent framework and standardization on your matching. Include in this Word document are a number of concepts from the Australian Attorney General’s website and the book, Data Matching by Peter Christen that I found very useful in lining out some data matching concepts that will ensure data consistency and standard practices in your ongoing matching work. Percentage Variation Spectrum What is my range of error leeway as I begin this process? I take the 75-100% spectrum range to allow for some variation in names (first and last) due to human error. Please see the My Match Key Schematic at the end of this documentation for more details. Reasons for variations in names: Abbreviations 2
  • 3. Child’s limited thinking Country Language Nomenclature1 Flipped (Reverse) Names in fields Name inconsistency (English vs. Native language) Nickname versus Real Name Indexing / Sorting and a Blocking Key The use of Indexing (Sorting) by using a Blocking key (i.e. zip code / Last Name for quick elimination of non-matched datasets) is a quick mechanism to assist in matching a dataset. Menu: Home, Sort and Filter, Custom Filter Dataset Cursory Consideration As you look at your dataset you may see and perceive some similarities, these are noteworthy as you begin your matching function: 1. Phonetic2 similarity – sounds the same 2. Character Shape – looks the same 3. Numerical similarity3 - are exact matches Birthday and Date variations are another issue for discussion and consideration. 1 https://en.wikipedia.org/wiki/Nomenclature 2 Soundex . . . Downloaded file on G Drive 3 Data Matching: Concepts and Techniques for Record Linkage, Entity Resolution, Christen, Springer, page 70 3
  • 4. Other conceptual and practical concerns (extracts from Australian Attorney General Website) Standardizing may involve the removal of non alphabetic characters like hyphens, spaces and apostrophes to produce a “standard format”. As an example, in instances where “OConnor” would normally not match with “O’Connor”, standardizing would result in a record in each file with the value “OConnor” which would then produce matches. Include a control group4 : The use of a control group of records can assist in the development of data matching applications and in interpreting the results of data matching activity. By including a control group with known characteristics in the data passing through the data matching application and observing the results, the effectiveness of the application can be reviewed and refined. Use name, date of birth, address in the algorithm5 design In designing identity data matching algorithms and applications, designers should consider the use of Name (s), date of birth and address, as using multiple aspects of record detail in compared data enables greater flexibility in determining what constitutes a match. Consideration may also need to be given to the use of the sex field, although many agencies consider the susceptibility to miscoding of this value may negate its overall usefulness. Ensure the use of a flexible matching algorithm Name matching should optimally employ orthographic6 , linguistic or phonetic (or any combination thereof) fuzzy logic pattern matching. . . . Whether a matching solution has been developed in-house or is a commercial product, developers will need to determine what constitutes a match. Agencies (Organizations) will also need to decide on the degree of field value correlation they are willing to accept in the matching process as constituting a match. If two records have largely consistent, but not exact, field values in those areas being compared (e.g. 4 Control group: follows the exact methodology of all other surveys, but there is no intervention event. (courtesy of Michael Cardy) 5 Algorithm: A set of logic rules determined during the design phase of a data matching application. The “blueprint‟ used to turn logic rules into computer instructions that detail what steps to perform in what order 6 Orthographic: A principle used in data matching where correct or accepted spelling and characters are used to determine the results 4
  • 5. name, date of birth, address), the developer, in conjunction with business analysts, will have to establish the boundary between acceptable difference and unacceptable difference , a decision that will also need to take into account the risks posed by the various options. Combine human involvement in the analysis of data matching results when flexible matching has been employed. One of the efficiencies deliverable with the use of data matching is the ability to automate particular actions or activities depending on the results obtained. Such automated “cause and effect , or “lights -out , systems are based on the‟ ‟ perceived accuracy (or believability) of the results obtained and the low risk involved in automating subsequent business activity. . . . Human evaluation of results not only confirms the validity of any matching that has taken place but the analysis and evaluation involved provides recursive advice for improved data matching. Fields may also contain invalid or nonsensical values. For example, dates of birth may contain zero -filled values, which can have a direct affect on the ratio of non-matches obtained. Efforts should be made to identify and quantify the prevalence of such characteristics. Knowing the preponderance of various data anomalies and characteristics would assist in better understanding the data matching results obtained and more correctly interpreting their significance. This is illustrated in the following two scenarios: failure to match is due to the fact that there exists no record for that identity in the other databases a record exists for the same identity in the other databases but there is a failure to match because the date of birth for one record is zero-filled. If, for example, an aim of a data matching exercise was to determine which identities in a particular database exhibit higher identity risk by not‟ appearing in other databases, the inclusion of records from both of the above scenarios in the same category of output skews any real understanding of the problem. A preliminary analysis of data quality can help place subsequent results into context. Invalid, missing, duplicate or otherwise, “incorrect values can be identified prior to matching.‟ 7 7 https://www.ag.gov.au/RightsAndProtections/IdentitySecurity/Documents/Data%20matching%20better %20practice%20guidelines%20%5BPDF%20775KB%5D.pdf 5
  • 6. My Match Key Schematic This showcases the weighted values on the demographic fields in the One Stop and Teacher Match workbooks. (Ctrl + Click) Image below: Explanation: 1. Listed all Demographics fields, comon to One Stop and Teacher Match workbooks 2. Set a priority to each field (1-7) 3. Set a numercial weight to each filter (0.5-3) 4. Set a Criterian Strength Point and % schematic (Best to Worse outcomes) Walked through differing scenarios if one (more) field(s) was missing with cooresponding % 5. Created Matching Legend for clarity in Matching Fields 6. Color-coded % for ease of use 6
  • 7. My Match Key Schematic This showcases the weighted values on the demographic fields in the One Stop and Teacher Match workbooks. (Ctrl + Click) Image below: Explanation: 1. Listed all Demographics fields, comon to One Stop and Teacher Match workbooks 2. Set a priority to each field (1-7) 3. Set a numercial weight to each filter (0.5-3) 4. Set a Criterian Strength Point and % schematic (Best to Worse outcomes) Walked through differing scenarios if one (more) field(s) was missing with cooresponding % 5. Created Matching Legend for clarity in Matching Fields 6. Color-coded % for ease of use 6