Ws2001 sessione8 cibella_tuoto


Published on

RELAIS, a powerfull instrument to support pubblic statistics.
Cibella N., Tuoto T.

Published in: Technology, Education
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • The record linkage techniques are a multidisciplinary set of methods and practices
  • The record linkage techniques are a multidisciplinary set of methods and practices
  • RELAIS has been implemented in Java and R and has a database architecture (MySQL).
  • Ws2001 sessione8 cibella_tuoto

    1. 1. Nicoletta Cibella , Tiziana Tuoto Istituto Nazionale di Statistica – ISTAT – Direzione centrale per le tecnologie e il supporto metodologico (DCMT) RELAIS, a powerful instrument to support public statistic RELAIS, un valido strumento di supporto alla statistica pubblica
    2. 2. Outline <ul><li>The record linkage problem </li></ul><ul><li>RELAIS: a powerful instrument to support public statistic </li></ul><ul><li>Adds on in the 2.0 version </li></ul><ul><li>Future Research Projects and Conclusions </li></ul>Nicoletta Cibella, VSP, APRILE 2011
    3. 3. The problem <ul><li>Record linkage techniques are a multidisciplinary set of methods with the aim of accurately recognize the same real world entity at individual micro level, even when differently stored in sources of various type. </li></ul><ul><li>Examples of applications (in official statistics): </li></ul><ul><ul><li>data integration </li></ul></ul><ul><ul><li>update and de-duplication of a source </li></ul></ul><ul><ul><li>quality improvement of a data source </li></ul></ul><ul><ul><li>measure of population size by capture-recapture </li></ul></ul><ul><ul><li>estimate the risk of re-identification in public-use microdata </li></ul></ul><ul><li>Also known as: Object Identification, Record Matching, … </li></ul>Nicoletta Cibella, VSP, APRILE 2011
    4. 4. Record linkage in Istat <ul><li>Wide use of record linkage in different production processes: first experiences date back to ‘80s </li></ul><ul><li>Common practice is to develop ad hoc linkage procedures for each project, basically via deterministic techniques </li></ul><ul><li>Little awareness of linkage errors in further analyses of linked data </li></ul><ul><li>Only a few official experiences with probabilistic approach </li></ul><ul><li>Decennial studies on the Fellegi-Sunter methodology with the EM algorithm </li></ul>Nicoletta Cibella, VSP, APRILE 2011
    5. 5. Possible Solutions for Record Linkage A very jeopardized picture, not only in Istat. Different approaches to deal with record linkage: Exact RL - Deterministic RL - Probabilistic RL (Fellegi and Sunter theory) - Bayesian RL - Machine Learning - Knowledge Representation … No particular technique has emerged as the best solution for all cases (maybe because such a solution does not exist…) Several software and tools proposed, based on different approaches, free or commercial. Nicoletta Cibella, VSP, APRILE 2011
    6. 6. RELAIS, brief history <ul><ul><li>RELAIS 1.0 in Feb 2008 on the Istat website with probabilistic model based on F-S theory, EM estimation and file architectural structure </li></ul></ul><ul><ul><li>Enriched experiences on Data Integration as coordinator of Essnet </li></ul></ul><ul><ul><ul><li>Common nature of problems and needs of NSIs in data integration projects </li></ul></ul></ul><ul><ul><ul><li>Profitable experiences in cooperation with NSIs also in sharing the same software tools (NTTS 2009) </li></ul></ul></ul><ul><ul><li>Istat working group with several cooperation and training courses on probabilistic record linkage </li></ul></ul>Nicoletta Cibella, VSP, APRILE 2011
    7. 7. RELAIS: a solution <ul><li>RELAIS (REcord Linkage At IStat) main idea: </li></ul><ul><li>decompose the complex RL project in its constituting phases </li></ul><ul><li>-choose dinamically the most appropriate technique for each phase, depending on application and data requirements, not only on practitioner’s skill </li></ul><ul><li>RELAIS is configured as an open source project, a winning choice for sharing techniques and software. </li></ul>Nicoletta Cibella, VSP, APRILE 2011
    8. 8. <ul><li>1. Pre-processing of the input files </li></ul><ul><li>Creation-Reduction of the search space of link candidate pairs </li></ul><ul><li>Choice of the matching variables </li></ul><ul><li>4. Choice of the comparison function </li></ul><ul><li>5. Choice of the decision model </li></ul><ul><li>6. Identification of unique links </li></ul><ul><li>RL evaluation </li></ul>1. Decompose RL in phases Nicoletta Cibella, VSP, APRILE 2011
    9. 9. 2. Choose the most appropriate techniques Nicoletta Cibella, VSP, APRILE 2011
    10. 10. 3. Build ad-hoc RL workflows Nicoletta Cibella, VSP, APRILE 2011 Preprocessing Search Space Reduction Comparison Function Decision Model Normalization UpperLowerCase Blocking SNM Edit Distance Jaro Equality Probabilistic Deterministic RecLink WF Appl2 SNM Probabilistic RecLink WF Appl1 Normalization UpperLowerCase Blocking Jaro Deterministic Equality
    11. 11. Main features of RELAIS <ul><li>Modular structure: each phase is planned as a “module” of the toolkit, with an explicit interface with the other modules </li></ul><ul><li>Top-down design: this allows to omit and/or iterate “modules” (phases) of the record linkage process </li></ul><ul><li>Open Source Project: </li></ul><ul><ul><li>Java, R , MySQL </li></ul></ul>Nicoletta Cibella, VSP, APRILE 2011
    12. 12. RELAIS and the open-source EUPL: European Union Public Licence Winning choice of the open-source philosophy and of the overcoming of ad-hoc approaches Sharing experiences and solutions with NSIs of Spain, UK, Tunisia, Brazil, … Training on the job in Uk on January 2011 and in Latvia on July Thanks to the modular approach and the OS, adding new techniques to the pool already available is really easy Nicoletta Cibella, VSP, APRILE 2011
    13. 13. RELAIS 2.0 in June 2009 <ul><li>Reading of input files in text format; </li></ul><ul><li>Creation of the search space of pairs candidate to link by means of the “cross product”, “blocking” method or “sorted neighborhood” method; </li></ul><ul><li>Data profiling to guide the choice of matching and blocking variables; </li></ul><ul><li>Choice of matching variables; </li></ul><ul><li>Set of comparison functions (several string distances); </li></ul><ul><li>Probabilistic record linkage : estimation of the F - S model parameters via the EM algorithm; </li></ul><ul><li>Deterministic record linkage: both exact and rule based; </li></ul><ul><li>Reduction from N:M to 1:1 matching solution with optimal or greedy methods. </li></ul>Nicoletta Cibella, VSP, APRILE 2011
    14. 14. Relational database architecture - to optimize the performances with respect to the management of huge amount of data through the whole record linkage project (input, intermediate phase and output). Two modalities to process blocks: a) step by step executions when blocks are few or in exploratory phase and b) one-shot execution to deal with a large amount of blocks (on Spanish NSI suggestion). Explicit management of the output and residual files to iterate several processes and back-up management. Adds on RELAIS 2.0 Nicoletta Cibella, VSP, APRILE 2011
    15. 15. RELAIS 2.1 is already available on OSOR and Istat websites. Relational database support: input of data from database Oracle or MySQL. New default input values for the parameter estimation of the probabilistic model and new definition of the candidate pairs for the optimal 1:1 reduction. More than one variable for search space reduction by sorted neighborhood method. Minor bugs have been solved. RELAIS 2.1 in May 2010 Nicoletta Cibella, VSP, APRILE 2011
    16. 16. A glance on RELAIS 2.1
    17. 17. RELAIS 2.2 in May 2011 <ul><li>Explicit application for de-duplication </li></ul><ul><li>Nested blocking methods </li></ul><ul><li>Set probabilities by the users </li></ul><ul><li>Improvement of GUI functionalities for output management and user interactions (manual review). </li></ul><ul><li>Summary output on linkage results </li></ul><ul><li>Batch execution </li></ul><ul><li>Interfaces for clerical review </li></ul>Nicoletta Cibella, VSP, APRILE 2011
    18. 18. Next challenges <ul><li>Censuses and post-censual surveys (Population and Agriculture): integration of population registers and auxiliary ones to focus on population register under-coverage,de-duplication also due to multi-channel answers, post enumeration Survey </li></ul><ul><li>Longitudinal study of regular foreign people </li></ul><ul><li>Integration of ICT enterprises </li></ul>Nicoletta Cibella, VSP, APRILE 2011
    19. 19. Future research projects <ul><li>Preprocessing (character conversions, schema reconciliation, standardization, etc.); </li></ul><ul><li>Modification of the probabilistic approach: </li></ul><ul><ul><li>Not binary comparison vector </li></ul></ul><ul><ul><li>Allowing interactions between matching variables </li></ul></ul><ul><ul><li>Bayesian approach </li></ul></ul><ul><li>Graphical analysis on the model fitting </li></ul>Nicoletta Cibella, VSP, APRILE 2011
    20. 20. Thanks and Invitation to Cooperations RELAIS Contacts: Computer Scientists: Monica Scannapieco E-mail: [email_address] Laura Tosco E-mail: [email_address] Luca Valentino E-mail: [email_address] Statisticians: Nicoletta Cibella E-mail: [email_address] Tiziana Tuoto E-mail: [email_address]