On Client and Transaction
Identification and Matching
                  Problems

                    Veljko Pejović
     ...
Presentation Outline
 Introduction
 Data Input and Identification Problems
 Known Solutions
 Damerau Edit Distance Algorit...
Introduction

 Problems
   Data Input Problems
     Unnecessary Repetition of a Character (Jaack)
     Character Permutati...
Identification Problems

 Identification Criteria
   Similarity of corresponding fields brings us
   closer to entity iden...
Known Solutions

 LCS Approach
   Finds the longest common subsequence of two
   strings
   Example: 'GCTAT' i 'CGATTA' th...
Known Solutions

 Edit Distance Approach
   Edit Distance – Difference between two
   strings observed through operations
...
Damerau Edit Distance Algorithm
           Modifications
Changes will be made in order to adjust the
algorithm to the give...
Algorithm Application, Weight Factors
           Determination
Table Clients:                    Table Transactions:
   Na...
Algorithm Application, Weight Factors
           Determination
Table Result:
  Client ID
  Transaction ID
  Probability fo...
Algorithm Application, Weight Factors
           Determination


Comparison of corresponding attributes in
two tables (Cli...
Algorithm Application, Weight Factors
           Determination
Weight factors should
be well determined

The leaves repres...
Algorithm Application, Weight Factors
               Determination
   Certain attributes
   correlate

   Data redundancy
...
Algorithm Application, Weight Factors
           Determination
Thresholds:

   Identification
   threshold ~ 94 %

   Simi...
Algorithm Regionalization

Common names/surnames
  The more common name pair – the less influence it has
  on total simila...
Evaluation

 Competitive solution
   Based on simple LCS algorithm
 Test vectors, Example
   “Z. Mihajlović, Sremska 33, B...
Conclusion And Future Work Guidelines

Main strong points of the proposed
solution:
  Based on well developed and examined...
Conclusion And Future Work Guidelines

Possible Improvements
  Automatic database update after the identification
  proces...
Thank You!




      - Comments And Questions, Please!
On Client and Transaction
Identification and Matching
                  Problems

                    Veljko Pejović
     ...
Upcoming SlideShare
Loading in...5
×

On Client and Transaction Identification and Matching Problems

331

Published on

Authors Veljko Pejovic, Emil Varga and Marko Stankovic.

Published in: Technology, Business
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
331
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
0
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

On Client and Transaction Identification and Matching Problems

  1. 1. On Client and Transaction Identification and Matching Problems Veljko Pejović veljkoveljko@gmail.com Coauthors: Emil Varga, Marko Stanković
  2. 2. Presentation Outline Introduction Data Input and Identification Problems Known Solutions Damerau Edit Distance Algorithm Modifications Algorithm Application, Weight Factors Determination Algorithm Regionalization Evaluation Conclusion and Future Work Guidelines
  3. 3. Introduction Problems Data Input Problems Unnecessary Repetition of a Character (Jaack) Character Permutation (Jakc) Character Omission (Jck) Initials, Abbreviations etc. (J.W.) Identification Problems Attribute Comparison Weight Factor Determination for each Attribute Pair Mind the Correlations
  4. 4. Identification Problems Identification Criteria Similarity of corresponding fields brings us closer to entity identification Identification Threshold Similarity probability above which we have identified the client – higher threshold Similarity Threshold Similarity probability above which we can claim similarity of two entities – lower threshold
  5. 5. Known Solutions LCS Approach Finds the longest common subsequence of two strings Example: 'GCTAT' i 'CGATTA' the longest common subsequence is 'GTT' Ratcliff Obershelp Algorithm Returns similarity percentage of two strings
  6. 6. Known Solutions Edit Distance Approach Edit Distance – Difference between two strings observed through operations necessary for bringing them into the same state Every operation has its cost Algorithms Levenshtein – 3 basic operations Damerau Edit Distance algorithm Additional operation – character transposition
  7. 7. Damerau Edit Distance Algorithm Modifications Changes will be made in order to adjust the algorithm to the given problem Solving the Following Key Problems Unnecessary Repetition of a Character Lower cost of insertion operation Initials usage Comparison of starting letters only Separator omission Separators will be ignored Abbreviation usage Abbreviation Dictionary (data mining)
  8. 8. Algorithm Application, Weight Factors Determination Table Clients: Table Transactions: Name Name Surname Surname Personal ID Number Personal ID Number City City Street Street Apt. No. Apt. No. Zip Code Zip Code Date of Birth Date of Birth Client ID (as a primary key) Transaction ID (as a primary key) Internal Transaction Number Type of Transaction Amount Account No.
  9. 9. Algorithm Application, Weight Factors Determination Table Result: Client ID Transaction ID Probability for Name Probability for Surname Probability for Personal ID Number Probability for City Probability for Street Probability for Apt. No. Probability for Zip Code Probability for Date of Birth Total Probability Result
  10. 10. Algorithm Application, Weight Factors Determination Comparison of corresponding attributes in two tables (Clients and Transactions) Each calculated similarity probability is stored in table Result Iteratively for every pair of attributes
  11. 11. Algorithm Application, Weight Factors Determination Weight factors should be well determined The leaves represent probability for similarity of two attributes [-100%, 100%] The branches represent weight factors [0, 1]
  12. 12. Algorithm Application, Weight Factors Determination Certain attributes correlate Data redundancy Dictionary Table Total probability calculation: ⎧ pid > I, pid ⎪nad > I, nad ⎪ r =⎨ ⎪ pid > 0 ∧ nad > 0, pid * q + nad * (1 − q) ⎪0 ⎩ ⎧ pid pid > nad, ⎪ ⎪ nad q=⎨ ⎪nad > pid , nad ⎪ ⎩ pid
  13. 13. Algorithm Application, Weight Factors Determination Thresholds: Identification threshold ~ 94 % Similarity threshold ~ 54 % Results above the Similarity threshold will be stored in table Result
  14. 14. Algorithm Regionalization Common names/surnames The more common name pair – the less influence it has on total similarity. Adjustable weight factors Characteristic suffixes, infixes i prefixes ( -ić, - Van-, Mc- ) These will be ignored during the matching phase Different alphabets Alphabet “Leveling” – ћирилица, ćirilica, cirilica…
  15. 15. Evaluation Competitive solution Based on simple LCS algorithm Test vectors, Example “Z. Mihajlović, Sremska 33, Bgf, 11000” “Zoran Mihailović, Sremska 33, Beograd 11000” Result evaluation
  16. 16. Conclusion And Future Work Guidelines Main strong points of the proposed solution: Based on well developed and examined algorithm Adjusted to one particular problem Dynamic reliability improvement Flexibility Regionalization
  17. 17. Conclusion And Future Work Guidelines Possible Improvements Automatic database update after the identification process Coding an address to “Address code” Mapping the standard key settings on different keyboard layouts Dynamic value change of identification and similarity threshold – adjust to the users’ expectations System should be verified in “real world” surrounding
  18. 18. Thank You! - Comments And Questions, Please!
  19. 19. On Client and Transaction Identification and Matching Problems Veljko Pejović veljkoveljko@gmail.com Coauthors: Emil Varga, Marko Stanković - Comments And Questions, Please!

×