sisvsp2012_ sessione14_ cardinaleschi_spinelli


Published on

Published in: Technology, Economy & Finance
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • sisvsp2012_ sessione14_ cardinaleschi_spinelli

    1. 1. Archivi amministrativi per le statistiche Eurostat Stefania Cardinaleschi, Vincenzo Spinelli Istat – Istituto Nazionale di Statistica SIS VSP 19-20 Aprile 2012Sessione: Utilizzo statistico di archivi amministrativi
    2. 2. OutlineThe context • Definition of Structural Earning Survey (SES) • Process flow forFocus on the Education public sector • Data integration by Record Linkage • Estimation of local units 2
    3. 3. The ContextLAbour MArket Statistics (LAMAS)the Council Regulation (EC) No 530/1999“…needs information on the level and composition of labour costs andon the structure and distribution of earnings in order to assess theeconomic development in the Member States…”Labour Cost Survey (LCS) Structure of Earnings Survey(SES)“The statistics on the level “The statistics on the structure andand composition of labour distribution of earningscostsThe objective of this legislation is to provide accurate andharmonised data on earnings in EU Member States and othercountries for policy-making and research purposes (gender paygap, file MFR, .. 3
    4. 4. SES - outlineObjective• to provide accurate and harmonised microdata on earnings for scientific purpose• to monitor the structure and distribution of earnings, taking into account job-related factors• to provide information on several individual characteristics of employees such as gender, age, occupation, education, length of stay in service and othersCoverage• reporting units consisted of enterprises with 10+ employees; results related to local units• C-K of Nace rev.1 plus M-O Nace rev.1 from 2006• Private sector plus public sector from 2006 4
    5. 5. Definition of SES by Eurostat Structure of the Survey SES 2010Private sector Public sectorand S13 list (excluded P.A. estimates through administrativeand Education) data on all institutionsdirect survey through a specificallyquestionnaire • EducationFirms chosen from the official they cover the 11% of the totallist of firms (ASIA 2009) employment 5
    6. 6. Private sector and S13 ListTwo stage sampling design: a sample of employees in a sample ofenterprises enterprises10-249 employees >249 employees first stage: the enterprisesstratified sample censusby economic activity dimension geographical position second stage: employees (october) belonging to the chosen units two chances by the enterprises: 1. simple random sample 2. they could be given a list of the VAT code of employees to interview 6
    7. 7. SES sample design – private sectorSample design Second stage of samplingNumber of employees interviewed by dimension of enterprise to whichthey belong Enterprises Dimension Number of employees 10-19 all 20-49 20 50-99 25 100-249 35 250-499 40 500-999 50 1000-1999 60 2000-3999 65 4000-7499 75 7500-9999 100 >10000 200 7
    8. 8. SES Education - public sectorEducation - Public sector: Estimates based on data derived from integration among administrative, fiscal and statistic sourcesAdministrative and fiscal data:• 770 Form Tax Register by MEF (2010)• Payroll dataset by MEF/Service Personale Tesoro List employement teaching and not teaching by Ministero dell’Istruzione, dell’Università e della Ricerca (2010-2011)Statistical surveys (2010):• Eu-Silc Panel Survey Statistics on Income and Living Conditions• Labour force survey 8
    9. 9. The Context Process flow Education Data acquisition School Employment 770 Payroll + List (MIUR) Estimation of Census Integration with survey data Eu-silc LFS Next steps (sampling, checking, ....) 9
    10. 10. The Context Process flow Integration with survey data sampling Census eusilc fl Eu-silc LFS isco_2 isco_3 manag_2 manag_3 isced_2 isced_3   anz_3 Part-time full time _2 Part-time full time_3 tipo_2 tipo_3 cittad_2 cittad_3 ore_2 ore_3   orestra_3 bonus_2 bonus_3 SES - Education RLM_2 RLM_3 checking 10
    11. 11. Data integration in SESThe context: Archives coming from heterogeneous sources.Objective: Assignment to the statistical units (employees) in Census R85 of some features available in LFS data.Problem: The two sources (SES and LFS) do not use the same key fields to identify their statistical units. 11
    12. 12. Data integration in SESWarning: Eu-silc can be considered as a “special case” of LFS for the integration problem, and, for this reason, it is not further mentioned in this presentation.Choice: The key field in LFS is “personal code”, valid only in this context. While census R85 is based on “fiscal code” a well-defined key for physical persons in administrative and fiscal archives. This is why we want to define a mapping from “personal code” to fiscal code and not vice versa. 12
    13. 13. Data integration in SEShow can we integrate these archives? (LFS) Personal code Birth date Sex Name & surnameCensus (R85) Personal code (LFS) and Fiscal code (R85) and cannot be compared directly! 13
    14. 14. Data integration in SESHypothesis:•Census is error free and must be considered as the benchmark forLFS archives.•In LFS archives the personal data are affected by random errors;•the sistematic ones (or bias) must be corrected outside thiscontext;•The errors are not uniformly distributed in all the fields of thepersonal data of LFS. Errors in (Name, surname) fields are morelikely than in birth date/place or gender. Consequence:•We define and assign a “fiscal code” to each personal code(statistical unit) in LFS. 14
    15. 15. Data integration in SES “Naïve” solution to matching problem begin •normalization step for personal data in LFS archive •definition of fiscal code on normalized personal data •<the archives from LFS and R85 can be joined by fiscal code> end 15
    16. 16. Data integration in SES ResultsLFS personal data (reference year 2010) : 92,129 records.Before normalization step : Error rate: 27.8% 66,481 records can be matched in fiscal archives (i.e. Modello 770/2010).After normalization step : Error rate: 18.5% (-9.3%) 16
    17. 17. Estimation of local units school 1 Local unit 1 school 2 MIUR school 5 Local unit 2 school 4 school 3But: what if there are multi-level local units? 17
    18. 18. Estimation of local units Local units structures local unit school addressesA constraint must hold in this list: every school must be“linked” to one and only one local unitIn other words, every school must belong to a clusterhaving a unique “center of mass” (a school itself) 18
    19. 19. Estimation of local units From local units to graphs Local unit = connected components BIEE002018 BIEE002029 AGEE00101V AGEE001042 BIEE002007 BIEE00203A AGEE00100T BIEE00206D BIEE00205CThis list can be seen as a graph G: the vertices are thecodes of the schools and the (oriented) edges are thecouples in each row of the list. 19
    20. 20. Estimation of local unitsResult The local units in R85 are the connected components in G such that they are (oriented) tree with one root.Search of connectedcomponents in G:there are many algorithms tocompute the connectedcomponents of a graph inlinear time using eitherbreadth-first search ordepth-first search. local units 20
    21. 21. Estimation of local units InputsThere are 36,923 schools (reference year 2010), i.e., vertices in G.There are 24,031 (oriented) edges in G. Before the clustering Algorithm….We get 11,892 connected components, such that 3,126 are singletons.The average size of these components is 3, while the largestcomponents has 21 vertices.There are 512 components having less than 10 employees. ResultsWe considered 11,380 local units (i.e. 35,152 schools) in SES 2010. 21