G-Link_Probablistic Record Linkage System_PVER Conf_May2011


Published on

May 2011 Personal Validation and Entity Resolution Conference. Presenter: Antoine Chevrette, System Engineering Division, Statistics Canada

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • This menu allows access to all system options, such as: Project creation Rule, frequency weight and special SQL script import/export Data importation and visualization Frequency weight creation Initial pairs creation Rules management Group creation and mapping
  • G-Link_Probablistic Record Linkage System_PVER Conf_May2011

    1. 1. — G-Link — A Probabilistic Record Linkage System Antoine Chevrette System Engineering Division Statistics Canada
    2. 2. Agenda <ul><ul><li>Background: early days of record linkage </li></ul></ul><ul><ul><li>Motivation for building G-Link </li></ul></ul><ul><ul><li>G-Link design objectives </li></ul></ul><ul><ul><li>System overview </li></ul></ul><ul><ul><li>Software installation </li></ul></ul><ul><ul><li>What’s in the future? </li></ul></ul>07/09/11 Statistics Canada • Statistique Canada
    3. 3. Theory of Record Linkage <ul><li>Ivan Fellegi & Alan Sunter </li></ul><ul><ul><li>“ A Theory for Record Linkage” (1969) </li></ul></ul><ul><li>Still widely regarded as both pivotal and definitive </li></ul><ul><li>Implemented in Statistics Canada’s linkage software </li></ul>07/09/11 Statistics Canada • Statistique Canada
    4. 4. Linkage Systems at Statistics Canada <ul><li>Ted Hill (SDD) and Martha Fair (Health) produced: the “Generalized Iterative Record Linkage System” (GIRLS) </li></ul><ul><ul><li>First released as a mainframe-only product (GIRLS) ca. 1980 </li></ul></ul><ul><ul><li>Re-engineered for Unix servers ca. 1990 (rename GRLS) </li></ul></ul><ul><li>Larger linkages became practical over time </li></ul><ul><li>Functionality and ease of use encouraged wider application </li></ul>07/09/11 Statistics Canada • Statistique Canada
    5. 5. Why Replace GRLS? <ul><li>GRLS fully functional, and very popular, but: </li></ul><ul><ul><li>Requires the use of a Unix-based server </li></ul></ul><ul><ul><li>Requires connection with the Oracle DBMS </li></ul></ul><ul><li>Potential applications saw architecture as a barrier </li></ul><ul><li>GRLS software was aging & required significant updates </li></ul>07/09/11 Statistics Canada • Statistique Canada
    6. 6. G-Link Design Objectives <ul><li>Operable on all Windows desktops </li></ul><ul><li>Available for both Windows & Unix servers </li></ul><ul><li>No third-party software dependencies </li></ul><ul><li>No additional licensing fees </li></ul><ul><li>Full GRLS work-alike functionality </li></ul><ul><li>Processing speed comparable to GRLS </li></ul><ul><li>Extensible </li></ul><ul><li>Easy to use </li></ul>07/09/11 Statistics Canada • Statistique Canada
    7. 7. <ul><li>G-LINK introduction through: </li></ul><ul><ul><li>Menu options. </li></ul></ul><ul><ul><li>The following screens: </li></ul></ul><ul><ul><ul><li>Project creation </li></ul></ul></ul><ul><ul><ul><li>Data importation </li></ul></ul></ul><ul><ul><ul><li>Data analysis </li></ul></ul></ul><ul><ul><ul><li>Pairs creation </li></ul></ul></ul><ul><ul><ul><li>Index creation </li></ul></ul></ul><ul><ul><ul><li>Rules creation </li></ul></ul></ul><ul><ul><ul><li>Graph and pairs distribution weitghts </li></ul></ul></ul><ul><ul><ul><li>Pairs review </li></ul></ul></ul><ul><ul><ul><li>Group creation and mapping </li></ul></ul></ul><ul><ul><ul><li>Data exportation </li></ul></ul></ul><ul><ul><ul><li>Batch functionality </li></ul></ul></ul><ul><li>Installation instructions </li></ul>G-LINK Overview
    8. 8. G-LINK overview
    9. 9. <ul><li>Project Creation </li></ul>External or Internal Linkage Internal: e.g. Find duplicate records from an address file. External: e.g. Link a cancer database with a death database. Information taken from a configuration file (for server mode only) Project protected by a username and password
    10. 10. <ul><li>Data Importation </li></ul>You can see the first 100 observations form the SAS file Once the importation is complete you can create derived columns based on nysiis and soundex Definitions for the columns to import
    11. 11. <ul><li>Data analysis </li></ul>Obtain the frequency of each field value
    12. 12. <ul><li>Pairs Creation </li></ul>Create pairs interactively Experienced users can directly create SQL statements
    13. 13. <ul><li>Rule Creation </li></ul>3 level character rule
    14. 14. <ul><li>Rule creation </li></ul>3 level character matrix rule
    15. 15. <ul><li>Rule Creation </li></ul>2 level date rule
    16. 16. <ul><li>Rule Creation </li></ul>Numerical condition rule
    17. 17. <ul><li>User Rules </li></ul>Type must be custom Outcome set by users. (use in the user rule psql) Include field from your input tables
    18. 18. <ul><li>Pairs weight distribution graph </li></ul>You can choose the range selection Minimum and maximum weight + the threshold values
    19. 19. <ul><li>Pairs revision </li></ul>Special criteria in order to revise groups of pairs Rules outcome level Manual update
    20. 20. <ul><li>Group creation and mapping </li></ul>Mapping screen Group creation screen
    21. 21. <ul><li>Data Exportation </li></ul>Export in flat or SAS files
    22. 22. <ul><li>Set a G-Link project as batch. </li></ul><ul><li>Run from the command line, embeded script with time execution. </li></ul><ul><li>Batch </li></ul>
    23. 23. How to install G-LINK <ul><li>G-LINK is installed using an .exe file on a Windows machine. </li></ul><ul><li>G-LINK can be installed locally or in server mode </li></ul><ul><ul><li>You should use the server client mode when: </li></ul></ul><ul><ul><ul><li>Performance is important (option of using multiple cpus) </li></ul></ul></ul><ul><ul><ul><li>Data confidentiality is required. </li></ul></ul></ul>Interface Logical Processing (DBMS) Local Processing (DBMS) Server
    24. 24. G-Link: The Future? <ul><li>Product will continue to evolve: </li></ul><ul><ul><li>Faster processing </li></ul></ul><ul><ul><li>Enhanced pre-processing and post-processing </li></ul></ul><ul><ul><li>Enhanced fuzzy matching </li></ul></ul><ul><li>Possibility of “record-at-a-time” linkages: </li></ul><ul><ul><li>For interactive applications (capture, un-duplication) </li></ul></ul><ul><ul><li>Potential for embedded processing </li></ul></ul>07/09/11 Statistics Canada • Statistique Canada
    25. 25. Contact: