This document provides an overview of matching concepts and introduces Ascential QualityStage. It describes matching concepts like variations and errors in data, parsing, cleansing and standardization. It explains the QualityStage tool architecture with designer-client and QS-server components. QualityStage procedures for standardization, matching, deduplication and more are outlined. Matching techniques like phonetic coding, n-gram matching and scoring are also summarized.
RDF and graph databases are steadily increasing their adoption and are no longer choices of niche-only communities. For almost 20 years, a constraint language for RDF was a big missing piece in the technology stack and a prohibiting factor for further adoption.
Even though most RDF-based systems were performing data validation and quality assessment, there was no standardized way to define constraints. People were using ad-hoc solutions or schemas and languages that were not meant for validation.
Thankfully, since 2017 there are 2 additions to the RDF technology stack: SHACL & ShEx. Both provide a high level RDF constraint language that people can use to define data constraints (a.k.a. Shapes), each with different strengths.
This talk provides an outline of different types of RDF data quality issues and existing approaches to quality assessment. The goal is to give an overview of the existing RDF validation landscape and hopefully, inspire people on how to improve their RDF publishing workflows.
Database Essentials for Healthcare Finance ProfessionalsBrad Adams
I presented this information at the 2013 Tennessee HFMA Fall Institute.
Databases sit behind the scenes and store nearly every piece of information related to healthcare finance in
our hospitals and clinics. We depend on databases for everything from accepting a point of service payment to
claim submission. Communication breakdowns between finance and information management often lead to
frustration, inaccurate reports and poor decision making. Learning a few basic database concepts and key
terms is essential to receiving useful and accurate data in a timely manner and creating meaningful financial
and revenue cycle reports.
Learning Objectives:
• Understand essentials of good database design.
• Learn the basics of database queries.
Take Away: Participants will have a better understanding of how data is structured and queried which will help
with in writing and/or requesting reports that contain healthcare finance information.
RDF and graph databases are steadily increasing their adoption and are no longer choices of niche-only communities. For almost 20 years, a constraint language for RDF was a big missing piece in the technology stack and a prohibiting factor for further adoption.
Even though most RDF-based systems were performing data validation and quality assessment, there was no standardized way to define constraints. People were using ad-hoc solutions or schemas and languages that were not meant for validation.
Thankfully, since 2017 there are 2 additions to the RDF technology stack: SHACL & ShEx. Both provide a high level RDF constraint language that people can use to define data constraints (a.k.a. Shapes), each with different strengths.
This talk provides an outline of different types of RDF data quality issues and existing approaches to quality assessment. The goal is to give an overview of the existing RDF validation landscape and hopefully, inspire people on how to improve their RDF publishing workflows.
Database Essentials for Healthcare Finance ProfessionalsBrad Adams
I presented this information at the 2013 Tennessee HFMA Fall Institute.
Databases sit behind the scenes and store nearly every piece of information related to healthcare finance in
our hospitals and clinics. We depend on databases for everything from accepting a point of service payment to
claim submission. Communication breakdowns between finance and information management often lead to
frustration, inaccurate reports and poor decision making. Learning a few basic database concepts and key
terms is essential to receiving useful and accurate data in a timely manner and creating meaningful financial
and revenue cycle reports.
Learning Objectives:
• Understand essentials of good database design.
• Learn the basics of database queries.
Take Away: Participants will have a better understanding of how data is structured and queried which will help
with in writing and/or requesting reports that contain healthcare finance information.
Reflected Intelligence: Lucene/Solr as a self-learning data systemTrey Grainger
What if your search engine could automatically tune its own domain-specific relevancy model? What if it could learn the important phrases and topics within your domain, automatically identify alternate spellings (synonyms, acronyms, and related phrases) and disambiguate multiple meanings of those phrases, learn the conceptual relationships embedded within your documents, and even use machine-learned ranking to discover the relative importance of different features and then automatically optimize its own ranking algorithms for your domain?
In this presentation, you’ll learn you how to do just that - to evolving Lucene/Solr implementations into self-learning data systems which are able to accept user queries, deliver relevance-ranked results, and automatically learn from your users’ subsequent interactions to continually deliver a more relevant experience for each keyword, category, and group of users.
Such a self-learning system leverages reflected intelligence to consistently improve its understanding of the content (documents and queries), the context of specific users, and the relevance signals present in the collective feedback from every prior user interaction with the system. Come learn how to move beyond manual relevancy tuning and toward a closed-loop system leveraging both the embedded meaning within your content and the wisdom of the crowds to automatically generate search relevancy algorithms optimized for your domain.
Traditional approaches in anti-money laundering involve simple matching algorithms and a lot of human review. However, in recent years this approach has proven to not scale well with the ever increasingly strict regulatory environment. We at Bayard Rock have had much success at applying fancier approaches, including some machine learning, to this problem. In this talk I walk you through the general problem domain and talk about some of the algorithms we use. I’ll also dip into why and how we leverage typed functional programming for rapid iteration with a small team in order to out-innovate our competitors.
Bayard Rock, LLC, is a private research and software development company with headquarters in the Empire State Building. It is a leader in the filed in the research and development of tools for improving the state of the art in anti-money laundering and fraud detection. As you might imagine, these tools rely heavily on mathematics and graph algorithms. In this talk, Richard Minerich will discuss the research activities of Bayard Rock and its approaches to build tools to find the “bad guys”. Richard Minerich is Bayard Rock’s Director of Research and Development. Rick has expertise in F#, C#, C, C++, C++/CLI,. NET (1.1, 2.0, 3.0, 3.5, 4.0, and 4.5), Object Oriented Design, Functional Design, Entity Resolution, Machine Learning, Concurrency, and Image Processing. He is interested in working on algorithmically, mathematically complex projects and remains open to explore new ideas.
Rick holds 2 patents. The first one, co-invented with a colleague, is titled “Method of Image Analysis Using Sparse Hough Transform.” The other independently held is known as “Method for Document to Template Alignment.”
Missing data arise in almost all serious statistical analyses. In this post I discuss a variety of methods to handle missing data, including some relatively simple approaches that can often yield reasonable results.
This presentation gives the idea about Data Preprocessing in the field of Data Mining. Images, examples and other things are adopted from "Data Mining Concepts and Techniques by Jiawei Han, Micheline Kamber and Jian Pei "
Data Mining DataLecture Notes for Chapter 2IntroducOllieShoresna
Data Mining: Data
Lecture Notes for Chapter 2
Introduction to Data Mining
by
Tan, Steinbach, Kumar
What is Data?Collection of data objects and their attributes
An attribute is a property or characteristic of an objectExamples: eye color of a person, temperature, etc.Attribute is also known as variable, field, characteristic, or featureA collection of attributes describe an objectObject is also known as record, point, case, sample, entity, or instance
Attributes
Objects
Attribute ValuesAttribute values are numbers or symbols assigned to an attribute
Distinction between attributes and attribute valuesSame attribute can be mapped to different attribute values Example: height can be measured in feet or meters
Different attributes can be mapped to the same set of values Example: Attribute values for ID and age are integers But properties of attribute values can be different
ID has no limit but age has a maximum and minimum value
Types of Attributes There are different types of attributesNominalExamples: ID numbers, eye color, zip codesOrdinalExamples: rankings (e.g., taste of potato chips on a scale from 1-10), grades, height in {tall, medium, short}IntervalExamples: calendar dates, temperatures in Celsius or Fahrenheit.RatioExamples: temperature in Kelvin, length, time, counts
Properties of Attribute Values The type of an attribute depends on which of the following properties it possesses:Distinctness: = Order: < > Addition: + - Multiplication: * /
Nominal attribute: distinctnessOrdinal attribute: distinctness & orderInterval attribute: distinctness, order & additionRatio attribute: all 4 properties
Attribute Type
Description
Examples
Operations
Nominal
The values of a nominal attribute are just different names, i.e., nominal attributes provide only enough information to distinguish one object from another. (=, )
zip codes, employee ID numbers, eye color, sex: {male, female}
mode, entropy, contingency correlation, 2 test
Ordinal
The values of an ordinal attribute provide enough information to order objects. (<, >)
hardness of minerals, {good, better, best},
grades, street numbers
median, percentiles, rank correlation, run tests, sign tests
Interval
For interval attributes, the differences between values are meaningful, i.e., a unit of measurement exists.
(+, - )
calendar dates, temperature in Celsius or Fahrenheit
mean, standard deviation, Pearson's correlation, t and F tests
Ratio
For ratio variables, both differences and ratios are meaningful. (*, /)
temperature in Kelvin, monetary quantities, counts, age, mass, length, electrical current
geometric mean, harmonic mean, percent variation
Attribute Level
Transformation
Comments
Nominal
Any permutation of values
If all employee ID numbers were reassigned, would it make any difference?
Ordinal
An order preserving change of values, i.e.,
new_value = f(old_value)
where f is a monotonic function.
An attribut ...
Kelly technologies is the best data science training institute in hyderabad.We provide our trainings by industrial real time experts so that our students know about real time market technology.
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Jeffrey Haguewood
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on the notifications, alerts, and approval requests using Slack for Bonterra Impact Management. The solutions covered in this webinar can also be deployed for Microsoft Teams.
Interested in deploying notification automations for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
More Related Content
Similar to Quality StageStandardization & Matching Training Edit007.ppt
Reflected Intelligence: Lucene/Solr as a self-learning data systemTrey Grainger
What if your search engine could automatically tune its own domain-specific relevancy model? What if it could learn the important phrases and topics within your domain, automatically identify alternate spellings (synonyms, acronyms, and related phrases) and disambiguate multiple meanings of those phrases, learn the conceptual relationships embedded within your documents, and even use machine-learned ranking to discover the relative importance of different features and then automatically optimize its own ranking algorithms for your domain?
In this presentation, you’ll learn you how to do just that - to evolving Lucene/Solr implementations into self-learning data systems which are able to accept user queries, deliver relevance-ranked results, and automatically learn from your users’ subsequent interactions to continually deliver a more relevant experience for each keyword, category, and group of users.
Such a self-learning system leverages reflected intelligence to consistently improve its understanding of the content (documents and queries), the context of specific users, and the relevance signals present in the collective feedback from every prior user interaction with the system. Come learn how to move beyond manual relevancy tuning and toward a closed-loop system leveraging both the embedded meaning within your content and the wisdom of the crowds to automatically generate search relevancy algorithms optimized for your domain.
Traditional approaches in anti-money laundering involve simple matching algorithms and a lot of human review. However, in recent years this approach has proven to not scale well with the ever increasingly strict regulatory environment. We at Bayard Rock have had much success at applying fancier approaches, including some machine learning, to this problem. In this talk I walk you through the general problem domain and talk about some of the algorithms we use. I’ll also dip into why and how we leverage typed functional programming for rapid iteration with a small team in order to out-innovate our competitors.
Bayard Rock, LLC, is a private research and software development company with headquarters in the Empire State Building. It is a leader in the filed in the research and development of tools for improving the state of the art in anti-money laundering and fraud detection. As you might imagine, these tools rely heavily on mathematics and graph algorithms. In this talk, Richard Minerich will discuss the research activities of Bayard Rock and its approaches to build tools to find the “bad guys”. Richard Minerich is Bayard Rock’s Director of Research and Development. Rick has expertise in F#, C#, C, C++, C++/CLI,. NET (1.1, 2.0, 3.0, 3.5, 4.0, and 4.5), Object Oriented Design, Functional Design, Entity Resolution, Machine Learning, Concurrency, and Image Processing. He is interested in working on algorithmically, mathematically complex projects and remains open to explore new ideas.
Rick holds 2 patents. The first one, co-invented with a colleague, is titled “Method of Image Analysis Using Sparse Hough Transform.” The other independently held is known as “Method for Document to Template Alignment.”
Missing data arise in almost all serious statistical analyses. In this post I discuss a variety of methods to handle missing data, including some relatively simple approaches that can often yield reasonable results.
This presentation gives the idea about Data Preprocessing in the field of Data Mining. Images, examples and other things are adopted from "Data Mining Concepts and Techniques by Jiawei Han, Micheline Kamber and Jian Pei "
Data Mining DataLecture Notes for Chapter 2IntroducOllieShoresna
Data Mining: Data
Lecture Notes for Chapter 2
Introduction to Data Mining
by
Tan, Steinbach, Kumar
What is Data?Collection of data objects and their attributes
An attribute is a property or characteristic of an objectExamples: eye color of a person, temperature, etc.Attribute is also known as variable, field, characteristic, or featureA collection of attributes describe an objectObject is also known as record, point, case, sample, entity, or instance
Attributes
Objects
Attribute ValuesAttribute values are numbers or symbols assigned to an attribute
Distinction between attributes and attribute valuesSame attribute can be mapped to different attribute values Example: height can be measured in feet or meters
Different attributes can be mapped to the same set of values Example: Attribute values for ID and age are integers But properties of attribute values can be different
ID has no limit but age has a maximum and minimum value
Types of Attributes There are different types of attributesNominalExamples: ID numbers, eye color, zip codesOrdinalExamples: rankings (e.g., taste of potato chips on a scale from 1-10), grades, height in {tall, medium, short}IntervalExamples: calendar dates, temperatures in Celsius or Fahrenheit.RatioExamples: temperature in Kelvin, length, time, counts
Properties of Attribute Values The type of an attribute depends on which of the following properties it possesses:Distinctness: = Order: < > Addition: + - Multiplication: * /
Nominal attribute: distinctnessOrdinal attribute: distinctness & orderInterval attribute: distinctness, order & additionRatio attribute: all 4 properties
Attribute Type
Description
Examples
Operations
Nominal
The values of a nominal attribute are just different names, i.e., nominal attributes provide only enough information to distinguish one object from another. (=, )
zip codes, employee ID numbers, eye color, sex: {male, female}
mode, entropy, contingency correlation, 2 test
Ordinal
The values of an ordinal attribute provide enough information to order objects. (<, >)
hardness of minerals, {good, better, best},
grades, street numbers
median, percentiles, rank correlation, run tests, sign tests
Interval
For interval attributes, the differences between values are meaningful, i.e., a unit of measurement exists.
(+, - )
calendar dates, temperature in Celsius or Fahrenheit
mean, standard deviation, Pearson's correlation, t and F tests
Ratio
For ratio variables, both differences and ratios are meaningful. (*, /)
temperature in Kelvin, monetary quantities, counts, age, mass, length, electrical current
geometric mean, harmonic mean, percent variation
Attribute Level
Transformation
Comments
Nominal
Any permutation of values
If all employee ID numbers were reassigned, would it make any difference?
Ordinal
An order preserving change of values, i.e.,
new_value = f(old_value)
where f is a monotonic function.
An attribut ...
Kelly technologies is the best data science training institute in hyderabad.We provide our trainings by industrial real time experts so that our students know about real time market technology.
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Jeffrey Haguewood
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on the notifications, alerts, and approval requests using Slack for Bonterra Impact Management. The solutions covered in this webinar can also be deployed for Microsoft Teams.
Interested in deploying notification automations for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
Search and Society: Reimagining Information Access for Radical FuturesBhaskar Mitra
The field of Information retrieval (IR) is currently undergoing a transformative shift, at least partly due to the emerging applications of generative AI to information access. In this talk, we will deliberate on the sociotechnical implications of generative AI for information access. We will argue that there is both a critical necessity and an exciting opportunity for the IR community to re-center our research agendas on societal needs while dismantling the artificial separation between the work on fairness, accountability, transparency, and ethics in IR and the rest of IR research. Instead of adopting a reactionary strategy of trying to mitigate potential social harms from emerging technologies, the community should aim to proactively set the research agenda for the kinds of systems we should build inspired by diverse explicitly stated sociotechnical imaginaries. The sociotechnical imaginaries that underpin the design and development of information access technologies needs to be explicitly articulated, and we need to develop theories of change in context of these diverse perspectives. Our guiding future imaginaries must be informed by other academic fields, such as democratic theory and critical theory, and should be co-developed with social science scholars, legal scholars, civil rights and social justice activists, and artists, among others.
Essentials of Automations: Optimizing FME Workflows with ParametersSafe Software
Are you looking to streamline your workflows and boost your projects’ efficiency? Do you find yourself searching for ways to add flexibility and control over your FME workflows? If so, you’re in the right place.
Join us for an insightful dive into the world of FME parameters, a critical element in optimizing workflow efficiency. This webinar marks the beginning of our three-part “Essentials of Automation” series. This first webinar is designed to equip you with the knowledge and skills to utilize parameters effectively: enhancing the flexibility, maintainability, and user control of your FME projects.
Here’s what you’ll gain:
- Essentials of FME Parameters: Understand the pivotal role of parameters, including Reader/Writer, Transformer, User, and FME Flow categories. Discover how they are the key to unlocking automation and optimization within your workflows.
- Practical Applications in FME Form: Delve into key user parameter types including choice, connections, and file URLs. Allow users to control how a workflow runs, making your workflows more reusable. Learn to import values and deliver the best user experience for your workflows while enhancing accuracy.
- Optimization Strategies in FME Flow: Explore the creation and strategic deployment of parameters in FME Flow, including the use of deployment and geometry parameters, to maximize workflow efficiency.
- Pro Tips for Success: Gain insights on parameterizing connections and leveraging new features like Conditional Visibility for clarity and simplicity.
We’ll wrap up with a glimpse into future webinars, followed by a Q&A session to address your specific questions surrounding this topic.
Don’t miss this opportunity to elevate your FME expertise and drive your projects to new heights of efficiency.
Let's dive deeper into the world of ODC! Ricardo Alves (OutSystems) will join us to tell all about the new Data Fabric. After that, Sezen de Bruijn (OutSystems) will get into the details on how to best design a sturdy architecture within ODC.
"Impact of front-end architecture on development cost", Viktor TurskyiFwdays
I have heard many times that architecture is not important for the front-end. Also, many times I have seen how developers implement features on the front-end just following the standard rules for a framework and think that this is enough to successfully launch the project, and then the project fails. How to prevent this and what approach to choose? I have launched dozens of complex projects and during the talk we will analyze which approaches have worked for me and which have not.
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Ramesh Iyer
In today's fast-changing business world, Companies that adapt and embrace new ideas often need help to keep up with the competition. However, fostering a culture of innovation takes much work. It takes vision, leadership and willingness to take risks in the right proportion. Sachin Dev Duggal, co-founder of Builder.ai, has perfected the art of this balance, creating a company culture where creativity and growth are nurtured at each stage.
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Quality StageStandardization & Matching Training Edit007.ppt
1. Matching Concepts and an Introduction to
Ascential QualityStage
Training Material
BI Practice, Chennai
Matching Concepts
and an Introduction to
Ascential QualityStage
2. Matching Concepts and an Introduction to
Ascential QualityStage
Objective of the Training
Matching
Variations & Errors
Parsing, Cleaning & Standardization
Tool architecture – Designer-client, QS-server
Standardization – ‘.cls’, ‘.pat’, ‘.tbl’
Matching
Concepts
QualityStage
3. Matching Concepts and an Introduction to
Ascential QualityStage
Matching
Name & Address Matching
De-duplication, Unduplication,
Merge-Purge, Customer ID
generation (UID)
Householding
Data warehouse projects
Application integration
Business mergers
Data acquisition
When?
Terminology
Absence of a persistent
identifying key between the
data sources.
Absence of a global standard
for representation between the
data sources.
Drivers
4. Matching Concepts and an Introduction to
Ascential QualityStage
Matching Example
Society Of St. Vincent De Paul
The Scty Of Saint Vncnt De Pau
St Vincent De Paul Society
Sosiety Of Saint Vincent Dpl
Clymer Atty At Law Brian
Brian I Clymer Attorney At Law
Arizona Dept Of Agricutlture
Dept Of Agri Arizona
Arizona State Dept Of Agri
Az Agri Dept
A fact can be represented in multiple standard forms
Standards change over time
Errors and variations occur during data capture & processing
In practice multiple standards forms/formats are used for data
capture, processing and storage
Duplicate detection
• Database consolidation
• Application consolidation
Query
• Removing felons from voters
list.
• List processing
8. Matching Concepts and an Introduction to
Ascential QualityStage
Variation & Errors
Errors may include non-standard variations, additional
words, missing words, or unknown data.
Synonyms & nicknames
Prefix & suffix variations
Abbreviation & Acronyms
Anglicization & foreign versions of
names
Spelling, typing & phonetic error
Initials, inconsistently abbreviated
names
Transposition (Word sequence
variations)
Truncation & missing words
Extra words
format, character & convention
variations
Dr John Doe Med. Doctor
Dr John Doe MD
Saint Louis University
St. Louis Univ.
Tata Consultancy Services Inc
TCS Incorporated
University of south Florida
South-Florida University(USF)
ABC CO Attn: Mr. Clark
The ABC CO, City of New York
Bill Clinton
William Clinton
9. Matching Concepts and an Introduction to
Ascential QualityStage
Parsing,
Cleansing &
Standardization
Candidate
selection
Matching &
Scoring
Apply Threshold/
Cutoff
Reference
Records
No Match
Match
Matchin
g rules
Cutoff
Cleansin
g rules
Matching Algorithm
Input
File
Reduce variations & errors
through Cleansing &
Standardization
Retain the differentiators &
remove the noise
Filter dissimilar records and
match the similar records
Use fuzzy matching to handle
unresolved variations & errors
10. Matching Concepts and an Introduction to
Ascential QualityStage
Raw Input
Lexical
Analysis
Contextual
Parsing
“Dr John Doe Jr PhD”
Tokenization
“123 Main Street Suite 101”
Dr |John|Doe |Jr |PhD 123 |Main|Street|Suite|101
Prefix|First|alpha|Gen|Suffix NNN |alpha|Type|Unit|NNN
Prefix|First|Last|Gen|Suffix
Prefix = Dr.
First Name = John
Last Name = Doe
Generation = Jr.
Suffix = PhD
Output
Hsn |Street|Type|Unit|Unit#
House Number = 123
Street Name = Main
Street Type = St
Unit Type = Ste
Unit Number = 101
Parsing, Cleansing & Standardization
11. Matching Concepts and an Introduction to
Ascential QualityStage
Identify Character set (Code page)
Translate Code page
Identify delimiters, operators, punctuations, allowable
characters and special-characters. Ignore the rest.
Parse text into tokens
Assign token types and build a pattern (sentence structure)
Break the pattern into individual attributes (based on context)
Store the standard form for each parsed attribute.
Parsing: Understanding the parts to build a structure and
breaking the structure into meaningful parts”.
Parsing, Cleansing & Standardization
Rules Development
Identify Words and assign Word-Types
and standard values.
Define patterns and parsing rules.
• 80-20 rule
• Frequencies
• Context & Data
Placement
12. Matching Concepts and an Introduction to
Ascential QualityStage
Derive
Candidate
Key
Parsed &
Stan. Output
Candidate
Selection
Dr. John Doe Jr. PhD
123 Main St Ste 101
Cottonwood, CA 92626
926-Ma-Do-Jo
John Doe 123 Main St Cottonwood CA 92626
Jones Donald 123 Maple Av 101 Cottonwood CA 92626
Joseph Don 456 Main Ln Cottonwood CA 92626
Dr. John Doe Jr. PhD
123 Main St Ste 101
Cottonwood, CA 92626
Candidate Selection
Candidate Selection is the processes of identifying likely
matching records.
13. Matching Concepts and an Introduction to
Ascential QualityStage
Derivative of the entity/record to be matched
Forms Clusters of similar records
Small keys form Large Clusters (General)
Large keys form small clusters (restrictive)
Other names
Candidate Code
Blocking Key
Window Key
Design considerations
Multiple blocking keys (handle Missing
values, Variations)
Balance between Performance (Candidate
set size), miss-rate (Quality) and hit-rate
(Matching)
Use of Candidate-Key decreases the cost of
matching by reducing records being
matched, this results in higher throughput
and performance.
Data skew will cause large clusters .
Candidate Keys
Matching – O(n)
De-duplication – O(n2)
14. Matching Concepts and an Introduction to
Ascential QualityStage
Candidates
Matching &
Scoring
Dr. John Doe Jr. PhD 123 Main St Ste 101 Cottonwood, CA 92626
If Score >= 95 then Match
Otherwise No-Match
YYYY--YYY = 100
xxYx--YYY = 70
xxxY--YYY = 70
John Doe 123 Main St Cottonwood CA 92626
Jones Donald 123 Maple Av 101 Cottonwood CA 92626
Joseph Don 456 Main Ln Cottonwood CA 92626
Matching & Scoring
Parsed &
Stan. Output
match
no-match
no-match
15. Matching Concepts and an Introduction to
Ascential QualityStage
Exact matching
Phonetic matching
Soundex
NYSIIS (New York State Identification and
Intelligence System)
Edit Distance
Prefix, Suffix & Initial matching
Acronym matching
String matching (exact, approx)
Interval matching (numeric data)
Date matching (exact, diff)
… other tool specific algorithms
Matching Functions
Word
Matching
Field
Matching
String
Matching
Name
Address
URL, E-mail ID
SSN
Phone
Date
Number
16. Matching Concepts and an Introduction to
Ascential QualityStage
Deterministic & Probabilistic Matching
WILLIAM J HOLDEN 128 MAIN ST 02111 12/8/62
WILLAIM JOHN HOLDEN 128 MAINE AVE 02110 12/8/62
Deterministic Decisions Tables: Fields are evaluated for degree-of-
match and a letter grade assigned; the grades form a “match pattern”
which is looked-up in a table to determine if the pair Matches, Fails, or is
Suspect
Are these two records a match?
B B A A B D B A = BBAABDBA
+9 +2 +14 +5 +4 -1 +5 +11 = +49
Probabilistic Linkage: Fields are evaluated for degree-of-match and a
weight assigned which represents the “informational content”
contributed by those values; the weights are summed to derived a total
score that measures the statistical probability of a match
17. Matching Concepts and an Introduction to
Ascential QualityStage
Training Material
BI Practice, Chennai
Introduction to
Ascential QualityStage
18. Matching Concepts and an Introduction to
Ascential QualityStage
Name
Address
Phone
URL, E-mail ID
Others
FFC - File format converter
(Delimited to fixed and
vise-versa).
GTF - Code page, data type,
Column derivation etc.
SLC - Column and row filtering.
SORT - Reorder data files.
UNI - Inner join, Left, Right &
Full Outer join on flat
files.
QS
Procedures
CLP - Field domain frequency
distribution.
PRS - Parse (space delimited)
free form text into words
for analysis.
NMA - Name abbreviation key
generation for matching.
PGM - Run command line
programs from with in the
procedure.
Char, Word & Pattern
investigation/analysis.
Parsing
Cleansing &
Standardization
De-duping,
Reference Matching
Cross population of
fields within a
duplicate group.
Utility
Procedures
Standardization
Matching
Survivorship
Analysis,
Investigation
procedures
QualityStage Procedures (Stages)
19. Matching Concepts and an Introduction to
Ascential QualityStage
Standardiza
tion (STAN)
Ref. Match
(GEOREF)
De-Dup
(UNDUP)
New ID
Assignment
Filter
(SLC)
Std
Out
Bad
Stan
Good
Ref
File
No
Match
Match
Dup
Groups
Collect
(UNIX)
New
Recs
Output
Raw
Data
No
Match
(New)
Matched
(Old)
Non
Standard
Data
Standardized
Data
Raw
Data
Job-1
Job-2
Matching
Rules
Stan.
Rules
QS-Project-A
Job-3 Add to Ref.
File
Matching rules
Survivorship rules
Parsing,
Cleansing &
Standardization
Rules
Jobs
Stages
Files
QualityStage Project
20. Matching Concepts and an Introduction to
Ascential QualityStage
Create
Read
Update
Delete
Submit
Job
Tell server to Run Job
Job status reported to client
CRUD
QS Designer
Client
QS Server
QS
Developer
x.mdb
Deploy Job
Run Job
Project
.imf
Export
Import
Export
Import CRUD
QS
Job
Server
Windows
PC
Projects
Jobs
Stages
Files
Parsing,
Cleansing &
Standardization
Rules
Work area used
for Deployment
of Jobs & Rules
Matching rules
Survivorship rules
QualityStage Tool Architecture
Unix or
Windows
21. Matching Concepts and an Introduction to
Ascential QualityStage
OUTREC = (d + x) bytes
QualityStage Standardization Procedure
Standardiza
tion
Raw
Input
USNAME.PRC
USNAMEIP.TBL
USNAMEIT.TBL
USNAMEMF.TBL
USNAMEUP.TBL
USNAMEUT.TBL
USFIRSTN.TBL
USGENDER.TBL
USNAME.DCT
USNAME.CLS
USNAME.PAT
USNAME.UCL
Standardized
Output
INREC – x bytes INREC – x bytes
DCT – d bytes
DCT – d bytes
Layout of the
parsed fields
Word
Classification
Table
Pattern Rules
User defined
Pattern Rules
Lookup tables
for special
processing
22. Matching Concepts and an Introduction to
Ascential QualityStage
Word Classification
Word Class
1 byte
User defined
(A-Z)
Implicit types
• Numeric Zero ‘0’ will
nullify the token.
• The token will not
participate in the
pattern parsing.
NULL Type
23. Matching Concepts and an Introduction to
Ascential QualityStage
Word Classification file (USNAME.CLS)
24. Matching Concepts and an Introduction to
Ascential QualityStage
FORMAT SORT=N
;-------------------------------------------------------------------------------
; USNAME Dictionary File
;-------------------------------------------------------------------------------
; Business Intelligence Fields
;-------------------------------------------------------------------------------
NT C 1 S NameType ;0001-0001
GC C 1 S GenderCode ;0002-0002
NP C 20 S NamePrefix ;0003-0022
FN C 25 S FirstName ;0023-0047
MN C 25 S MiddleName ;0048-0072
LN C 50 S PrimaryName ;0073-0122
NG C 10 S NameGeneration ;0123-0132
NS C 20 S NameSuffix ;0133-0152
AN C 50 S AdditionalNameInformation ;0153-0202
;-------------------------------------------------------------------------------
; Matching Fields
;-------------------------------------------------------------------------------
MF C 25 S MatchFirstName ;0203-0227
NF C 8 X NYSIISofMatchFirstName ;0228-0235
SF C 4 Z RSoundexofMatchFirstName ;0236-0239
ML C 50 S MatchPrimaryName ;0240-0289
HK C 10 S HashKeyofMatchPrimaryName ;0290-0299
PK C 20 S PackedKeyofMatchPrimaryName ;0300-0319
NW C 1 S NumberofMatchPrimaryWords ;0320-0320
W1 C 15 S MatchPrimaryWord1 ;0321-0335
W2 C 15 S MatchPrimaryWord2 ;0336-0350
W3 C 15 S MatchPrimaryWord3 ;0351-0365
W4 C 15 S MatchPrimaryWord4 ;0366-0380
W5 C 15 S MatchPrimaryWord5 ;0381-0395
N1 C 8 X NYSIISofMatchPrimaryWord1 ;0396-0403
S1 C 4 Z RSoundexofMatchPrimaryWord1 ;0404-0407
N2 C 8 X NYSIISofMatchPrimaryWord2 ;0408-0415
S2 C 4 Z RSoundexofMatchPrimaryWord2 ;0416-0419
;-------------------------------------------------------------------------------
; Reporting Fields
;-------------------------------------------------------------------------------
UP C 30 S UnhandledPattern ;0420-0449
UD C 100 S UnhandledData ;0450-0549
IP C 30 S InputPattern ;0550-0579
ED C 25 S ExceptionData ;0580-0604
UO C 2 S UserOverrideFlag ;0605-0606
Dictionary file (USNAME.DCT)
Field Name
{NT}
Data type
Char
Length
50 bytes
Nulls
Space as null
Zero as null
Both as null
Display Name
Comment
26. Matching Concepts and an Introduction to
Ascential QualityStage
QualityStage Matching Procedure
UNDUP
Standardized
Output
Matching rules
(*.MAT)
Dup Groups
Residuals
Record groups
(2 or more recs)
Singleton records
De-duplication of
one file
27. Matching Concepts and an Introduction to
Ascential QualityStage
QualityStage Matching Procedure …… Contd.
GEOMATCH
Standardized
Output
(A)
Matching rules
(*.MAT)
Matches
(A->B)
Residuals
(A)
Singleton records
Reference File
(B)
Dup Groups
(B)
Record groups
(2 or more recs)
Record groups
(2 or more recs)
Residuals
(B)
One to many
matching.
Matching one record
from File-A can match
to many records in
File-B
29. Matching Concepts and an Introduction to
Ascential QualityStage
Pass-1
Blocking-Key
3 bytes of Zip
2 Bytes of Street Name
2 Bytes of Last Name
2 Bytes of First Name
Matching Fields
First name
Middle name
Last name
House number
Street name
Zip
Pass-2
Blocking-Key
2 bytes of State code
3 Bytes of City name
3 Bytes of Street name
2 Bytes of Last name
2 Bytes of First name
Matching Fields
First name
Middle name
Last name
House number
Street name
City
Multi-Pass Matching
Maximum of 7 Passes
per match application
30. Matching Concepts and an Introduction to
Ascential QualityStage
Pass
VarType
M-prob
U-prob
Agreement weight
Disagreement weight
Work in progress…
To be completed…