This document provides an overview of concepts for matching criteria to ensure consistent data matching. It discusses using indexing/sorting and a blocking key like zip code to eliminate non-matches. Name variations that could impact matching are described, like abbreviations or flipped names. The importance of considering data quality issues and establishing a percentage variation spectrum is also covered. A schematic is presented showing weighted values assigned to demographic fields to determine match strength.
2. Table of Contents
Table of Contents.................................................................................................................2
INTRODUCTION...........................................................................................................2
Percentage Variation Spectrum....................................................................................2
Indexing / Sorting and a Blocking Key........................................................................3
Dataset Cursory Consideration....................................................................................3
Other conceptual and practical concerns
(extracts from Australian Attorney General Website).................................................4
My Match Key Schematic
......................................................................................................................................6
INTRODUCTION
As you consider moving from data entry work to the detailed oriented matching work,
you need to consider a number of rules or factors in order that you may have a consistent
framework and standardization on your matching.
Include in this Word document are a number of concepts from the
Australian Attorney General’s website and the book, Data Matching by
Peter Christen that I found very useful in lining out some data matching
concepts that will ensure data consistency and standard practices in your
ongoing matching work.
Percentage Variation Spectrum
What is my range of error leeway as I begin this process? I take the 75-100% spectrum
range to allow for some variation in names (first and last) due to human error. Please see
the My Match Key Schematic at the end of this documentation for more details.
Reasons for variations in names:
Abbreviations
2
3. Child’s limited thinking
Country Language Nomenclature1
Flipped (Reverse) Names in fields
Name inconsistency (English vs. Native language)
Nickname versus Real Name
Indexing / Sorting and a Blocking Key
The use of Indexing (Sorting) by using a Blocking key (i.e. zip code / Last Name for
quick elimination of non-matched datasets) is a quick mechanism to assist in matching a
dataset.
Menu: Home, Sort and Filter, Custom Filter
Dataset Cursory Consideration
As you look at your dataset you may see and perceive some similarities, these are
noteworthy as you begin your matching function:
1. Phonetic2
similarity – sounds the same
2. Character Shape – looks the same
3. Numerical similarity3
- are exact matches
Birthday and Date variations are another issue for discussion and consideration.
1
https://en.wikipedia.org/wiki/Nomenclature
2
Soundex . . . Downloaded file on G Drive
3
Data Matching: Concepts and Techniques for Record Linkage, Entity Resolution, Christen, Springer, page
70
3
4. Other conceptual and practical concerns
(extracts from Australian Attorney General Website)
Standardizing may involve the removal of non alphabetic characters like hyphens, spaces
and apostrophes to produce a “standard format”. As an example, in instances where
“OConnor” would normally not match with “O’Connor”, standardizing would result in a
record in each file with the value “OConnor” which would then produce matches.
Include a control group4
: The use of a control group of records can assist in the
development of data matching applications and in interpreting the results of data
matching activity. By including a control group with known characteristics in the data
passing through the data matching application and observing the results, the effectiveness
of the application can be reviewed and refined.
Use name, date of birth, address in the algorithm5
design In designing identity data
matching algorithms and applications, designers should consider the use of
Name (s), date of birth and address, as using multiple aspects of record detail in
compared data enables greater flexibility in determining what constitutes a match.
Consideration may also need to be given to the use of the sex field, although many
agencies consider the susceptibility to miscoding of this value may negate its overall
usefulness.
Ensure the use of a flexible matching algorithm
Name matching should optimally employ orthographic6
, linguistic or phonetic (or any
combination thereof) fuzzy logic pattern matching. . . . Whether a matching solution has
been developed in-house or is a commercial product, developers will need to determine
what constitutes a match.
Agencies (Organizations) will also need to decide on the degree of field value correlation
they are willing to accept in the matching process as constituting a match. If two records
have largely consistent, but not exact, field values in those areas being compared (e.g.
4
Control group: follows the exact methodology of all other surveys, but there is no
intervention event. (courtesy of Michael Cardy)
5
Algorithm: A set of logic rules determined during the design phase of a data matching
application. The “blueprint‟ used to turn logic rules into computer instructions that detail
what steps to perform in what order
6
Orthographic: A principle used in data matching where correct or accepted spelling
and characters are used to determine the results
4
5. name, date of birth, address), the developer, in conjunction with business analysts, will
have to establish the boundary between acceptable difference and unacceptable difference
, a decision that will also need to take into account the risks posed by the various options.
Combine human involvement in the analysis of data matching results when flexible
matching has been employed. One of the efficiencies deliverable with the use of data
matching is the ability to automate particular actions or activities depending on the results
obtained. Such automated “cause and effect , or “lights -out , systems are based on the‟ ‟
perceived accuracy (or believability) of the results obtained and the low risk involved in
automating subsequent business activity. . . . Human evaluation of results not only
confirms the validity of any matching that has taken place but the analysis and
evaluation involved provides recursive advice for improved data matching.
Fields may also contain invalid or nonsensical values. For example, dates of birth may
contain zero -filled values, which can have a direct affect on the ratio of non-matches
obtained. Efforts should be made to identify and quantify the prevalence of such
characteristics. Knowing the preponderance of various data anomalies and characteristics
would assist in better understanding the data matching results obtained and more
correctly interpreting their significance.
This is illustrated in the following two scenarios: failure to match is due to the fact that
there exists no record for that identity in the other databases a record exists for the same
identity in the other databases but there is a failure to match because the date of birth for
one record is zero-filled. If, for example, an aim of a data matching exercise was to
determine which identities in a particular database exhibit higher identity risk by not‟
appearing in other databases, the inclusion of records from both of the above scenarios in
the same category of output skews any real understanding of the problem. A preliminary
analysis of data quality can help place subsequent results into context. Invalid, missing,
duplicate or otherwise, “incorrect values can be identified prior to matching.‟ 7
7
https://www.ag.gov.au/RightsAndProtections/IdentitySecurity/Documents/Data%20matching%20better
%20practice%20guidelines%20%5BPDF%20775KB%5D.pdf
5
6. My Match Key Schematic
This showcases the weighted values on the demographic fields in the One Stop and
Teacher Match workbooks. (Ctrl + Click) Image below:
Explanation:
1. Listed all Demographics fields, comon to One Stop and Teacher Match
workbooks
2. Set a priority to each field (1-7)
3. Set a numercial weight to each filter (0.5-3)
4. Set a Criterian Strength Point and % schematic (Best to Worse outcomes)
Walked through differing scenarios if one (more) field(s) was missing with
cooresponding %
5. Created Matching Legend for clarity in Matching Fields
6. Color-coded % for ease of use
6
7. My Match Key Schematic
This showcases the weighted values on the demographic fields in the One Stop and
Teacher Match workbooks. (Ctrl + Click) Image below:
Explanation:
1. Listed all Demographics fields, comon to One Stop and Teacher Match
workbooks
2. Set a priority to each field (1-7)
3. Set a numercial weight to each filter (0.5-3)
4. Set a Criterian Strength Point and % schematic (Best to Worse outcomes)
Walked through differing scenarios if one (more) field(s) was missing with
cooresponding %
5. Created Matching Legend for clarity in Matching Fields
6. Color-coded % for ease of use
6