The De-identification of Clinical Data
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

The De-identification of Clinical Data

on

  • 2,256 views

A comprehensive presentation on why the de-identification of clinical information is necessary for secondary uses and how to do it effectively.

A comprehensive presentation on why the de-identification of clinical information is necessary for secondary uses and how to do it effectively.

Statistics

Views

Total Views
2,256
Views on SlideShare
2,196
Embed Views
60

Actions

Likes
0
Downloads
36
Comments
0

9 Embeds 60

http://www.ehealthinformation.ca 49
http://www.slideshare.net 3
https://www.ehealthinformation.ca 2
http://127.0.0.1 1
http://www.ehealthinformation.org 1
http://ehealthinformation.ca 1
file:// 1
http://ww.ehealthinformation.ca 1
http://www.slideee.com 1
More...

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

The De-identification of Clinical Data Presentation Transcript

  • 1. De-identifying Clinical Data Khaled El Emam, CHEO RI & uOttawa
  • 2. www.ehealthinformation.ca Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 3. Secondary Use/Disclosure disclosure collection recipient individuals custodian agent t custodian use disclosure Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 4. Data Flows • Mandatory disclosures • Uses by an agent for secondary purposes • Permitted discretionary disclosures for secondary purposes • Other disclosures for secondary purposes Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 5. Obtaining Consent - I • Sometimes it is not possible or practical to obtain consent: – Making contact to obtain consent may reveal the individual’s condition to others against their wishes h h – The size of the population may be too large to obtain consent from everyone – Many patients may have relocated or died Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 6. Obtaining Consent - II – There may be a lack of existing or continuing relationship with the patients – There is a risk of inflicting psychological, social or other harm by contacting individuals or their families in delicate circumstances – It would be difficult to contact individuals through advertisements and other public notices Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 7. Impact of Obtaining Consent • In the case where explicit consent is used, consenters and non-consenters non consenters differ on: – age, sex, race, marital status, educational level, socioeconomic status, health status, mortality, lifestyle factors, functioning • The consent rate for express consent varied from 16% to 93% Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 8. Limiting Principles • Do not collect, use, or disclose PHI if other information will serve the purpose • For example, even if it is easier to p, disclose a whole record, that should not be done if lesser information will reasonably satisfy the purpose • De-identification would be one element in limiting the amount of PHI that is i li iti th tf th t i collected/used/disclosed Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 9. Breaches • In many large research hospitals and hospital networks it is simply not possible to control and manage all of the databases and data sets that are created, used, and disclosed for research • Breach frequency and severity is growing • D id tifi ti De-identification provides one way to id t manage the risks, however Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 10. Trust • Patients change their behavior if they perceive a threat to privacy • This can have a negative impact on the q quality of the data that is used for y research Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 11. Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 12. Deloitte Survey (2007) • N=827 respondents in North America • 43% reported more than 10 privacy breaches within the last 12 months in their organizations • Over 85% reported at least one privacy breach • Over 63% reported multiple privacy breaches requiring notification • Breaches involving 1000+ records were reported by 34% of respondents Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 13. Verizon Study • Based on forensic engagements conducted by Verizon • Breaches resulting from external sources: 73% • Caused by insiders: 18% • Implicated business partners: 39% • The median number of records involved in an e ed a u be o eco ds o ed a insider breach were 10 times more than an external breach • Bi Biggest causes are errors and hackers t dh k Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 14. Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 15. HIMSS Leadership Survey • Survey of healthcare IT executives, n=307 • Conducted in the 2007-2008 timeframe • 24% of respondents reported that they have had a security breach in their organization in the last 12 months • 16% of respondents reported that they have had a security breach in their organization in the last 6 months • Half indicated that an internal security breach is a concern to their organizations Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 16. HIMSS Analytics Report • IT executives and security officers at healthcare institutions; n=263 • Half of respondents are concerned with internal inadvertent access to patient data • 13% indicated that their organization has had a security breach in the last 12 months • 80% of these were internal breaches Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 17. Medical Record Breaches 2008 • For all of 2008 (datalossdb.org) • 83 breaches involving medical records (14% of total) • Approx. 7.2 million records involved in these breaches (21.5% of all records) (21 5% Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 18. Does this Happen Here ? • Do you know of any cases where computer equipment was stolen from a hospital ? Did this equipment contain personal health information ? • Do you know if any cases where memory sticks with data on them were lost ? • Does anyone email data to their hotmail or gmail accounts so that they can access them from home or while travelling ? • Do people still share passwords ? Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 19. Known Data Leaks • PHI on second hand computers • Leaks through peer-to-peer file sharing networks • PowerPoint files on th I t P P i t fil the Internet t • Password protected files sent by email Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 20. Identity Theft • William Ernst Black (Edmonton 1999) • The creation of identity packages using information about dead children who were living in one jurisdiction but died in another ($37k for each identity package) • Example: drug smuggler was caught with these identity packages • Example: American getting free medical care in Canada iC d Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 21. Patient Concerns • There is evidence (from surveys) that the general public has changed their behavior to adjust for perceived privacy risks wrt th i PHI idi ik t their PHI: – 15% to 17% of US adults – 11% to 13% of Canadian adults • There is also evidence that vulnerable populations exhibit similar behaviors (e.g., adolescents, people with HIV or at high risk for HIV, those undergoing HIV genetic testing, mental health patients and battered women) Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 22. Behavior Change - I • Going to another doctor • Paying out of pocket when insured to avoid disclosure • Not seeking care to avoid disclosure to an employer or to not be seen entering a clinic by other members of the community • Giving inaccurate or incomplete information on medical historyy • Asking a doctor not to record a health problem or record a less serious or embarrassing one Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 23. Behavior Change - II • 87% of US physicians reported that a patient had asked them not to include certain information in their record • 78% of US physicians reported that they have withheld information due to privacy concerns Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 24. S Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 25. Asymmetry Principle - I • Trust is hard to gain but easy to lose: – Negative events/news carry more weight than g y g positive ones (negativity bias); it is more diagnostic – Avoiding loss – people weight negative information more greatly in an effort to avoid loss – Sources of negative information appear more g pp credible (positive information seems self-serving) Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 26. Asymmetry Principle - II – People interpret information according to their prior beliefs: if they have negative prior beliefs then th negative events will re-enforce that and ti t ill f th t d positive events will have little impact – Undecided individuals tend to be affected more by negative information – People with positive prior beliefs may feel betrayed b negative i f bt d by ti information/events ti / t Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 27. Canadian Public - 2007 100 90 80 70 60 46 44 50 40 39 37 37 35 34 40 30 20 10 0 Total BC Alberta Prairies Ont Que Atlantic Territories In your opinion, how safe and secure is the health y p , information which EXISTS about you? (5-7 on a 7 pt scale) Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 28. Canadian Public - 2003 Agree (5-7) (5 7) Neither (4) Disagree (1-3) DK/NR 0 10 20 30 40 50 60 70 80 90 100 I really worry that my personal health information might be used for other purposes in the future i ht b df th i th f t which have little to do with my health Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 29. How not to De identify De-identify • Just removing the name and address information is not enough • It is quite easy to re-identify individuals from the other data that is left • There are a number of public real life p examples of re-identification actually happening Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 30. Example Data With PHI Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 31. Types of Variables • Identifying variables: variables that can directly identify a patient • Quasi-identifiers: variables that can indirectly identify a patient y yp • Sensitive variables: sensitive clinical information that the patient would not p want to be known beyond the circle of care Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 32. De identified De-identified Data ? Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 33. Examples of Re-identification Re identification Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 34. Examples of Re-identification Re identification Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 35. Examples of Re-identification Re identification Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 36. Examples of Re-identification Re identification Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 37. User #4417749 • “tea for good health” • “numb fingers”, “hand tremors” numb fingers , hand tremors • “dry mouth” • “60 single men 60 men” • “dog that urinates on everything” • “landscapers in Lilburn Ga” landscapers Lilburn, Ga • “homes sold in shadow lake subdivision gwinnett county georgia” georgia Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 38. Thelma Arnold • 62 year old widow living in Lilburn Ga re-identified by the New York Times • She has three dogs Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 39. What Happened Next ? • Maureen Govern, CTO of AOL “resigns” • Abdur Chowdhury, AOL researcher who released the data was fired • Abdur’s boss in the research department was fired • Big embarrassment for AOL g Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 40. Examples of Re-identification Re identification Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 41. Examples of Re-identification Re identification Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 42. Examples of Re-identification Re identification Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 43. Uniqueness in the US Population • Studies show that between 63% to 87% of the US population is unique on their date of birth + ZIP code + gender • Uniqueness makes it q q quite easy to re- y identify individuals using a variety of techniques Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 44. Uniqueness in Canadian Population 100% 80% ques 60% Percent Uniq 40% 20% 0% PC PC + Gender PC + DoB 1 2 3 4 5 6 PC + DoB + Gender Number of Characters in Postal Code Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 45. Example • This example shows the risk of re- identification using just demographics Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 46. Types of Disclosure • Identity Disclosure: being able to determine the identity associated with a record • Attribute Disclosure: discovering g something new about an individual known to be in the database Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 47. Disclosure and Invasion-of-Privacy Invasion of Privacy • An important first criterion is deciding on the sensitivity of the data and the potential for harm to the patients from a secondary use/disclosure • If the invasion-of-privacy is deemed low then there may not be a need to de-identify the data Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 48. Invasion of Privacy Invasion-of-Privacy - I • The personal information in the Data is highly detailed • The information in the Data is of a highly sensitive and personal nature gy p • The information in the Data comes from a highly sensitive context gy • Many people would be affected if there was a Data breach or the Data was processed inappropriately by the recipient/agent Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 49. Invasion of Privacy Invasion-of-Privacy - II • If there was a Data breach or the Data was processed inappropriately by the recipient/agent that may cause direct and quantifiable damages and measurable injury to the patients • If the recipient/agent is located in a different jurisdiction, there is a possibility, for practical purposes, that the data sharing agreement will be difficult to enforce Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 50. Invasion of Privacy Invasion-of-Privacy – Consent - I • There is a provision in the relevant legislation permitting the disclosure/use of the Data without the consent of the patients • The Data was unsolicited or given freely or voluntarily by the patients with little expectation of it being maintained in total confidence Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 51. Invasion of Privacy Invasion-of-Privacy – Consent - II • The patients have provided express consent that their Data can be disclosed for this secondary Purpose when it was originally collected or at some point since then • The custodian has consulted well- defined groups or communities regarding the disclosure of the Data and had a positive response Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 52. Invasion of Privacy Invasion-of-Privacy – Consent - III • A strategy for informing/notifying the public about potential disclosures for the recipient’s secondary Purpose was in place when the data was collected or since then • Obtaining consent from the individuals at this point is inappropriate or impractical Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 53. Identity Disclosure • Three common types: – Prosecutor risk – Journalist risk – Rareness • All three are concerned with the risk of re-identifying a single individual Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 54. Prosecutor vs. Journalist • If all of the following is true then p prosecutor risk is relevant: – The data represents the whole population such that everyone is known to be in it or the sampling fraction is very high – If not the whole population, it is possible for an intruder to know that a particular p person has a record in the data • Patient may self-reveal • Data collection method is revealing • Otherwise journalist risk is relevant Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 55. Prosecutor Risk - I • The intruder has background information about a specific individual p known to be in the database • The amount of background information will depend on the intruder • The intruder is attempting to find the record belonging to that individual in the database Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 56. Prosecutor Risk - II • Examples of intruders: – Neighbor g – Ex-spouse – Employer – Relative Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 57. Example Date of Birth Gender Postal Code Diagnosis 12/03/1957 M K0J 1P0 … 01/7/1978 M K0J 1P0 … 09/12/1968 F K0J 1P0 … 17/08/1987 F K0J 1P0 … 25/02/1974 F K0J 1T0 … 23/05/1985 M K0J 1T0 … 14/03/1965 F K0J 2A0 … Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 58. Selecting Variables – Prosecutor - I • In the best case assumption, a neighbor would know: g – Address and telephone information about the VIP – Household and dwelling information (number of children, value of property, type of property) –KKey dates (births, deaths, weddings) d t (bi th d th ddi ) – Visible characteristics: gender, race, ethnicity, language spoken at home, weight, height, physical disabilities – Profession Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 59. Selecting Variables – Prosecutor - II • What would an ex-spouse know: – The same things that a neighbor would g g know – Basic medical history (allergies, chronic diseases) – Income, years of schooling • All of these variables would be considered quasi-identifiers if they appear in the database Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 60. Journalist Risk • The journalist is not looking for a specific p p person – re-identifying any yg y person will do • The journalist has access to a database that s/he can use for matching • This is called an identification database Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 61. Journalist Matching Example Medical Database Identification DB DoB DB Name Clinical Initials and lab Address data Gender Telephone No. Postal Code Quasi-Identifiers Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 62. Assessing Journalist Risk • In general, we want to know how rare the quasi-identifier values would be in q the population (e.g., homeowners/professionals/civil servants i th geographic area of t in the hi f interest) • If the combination is not rare then th bi ti i t th there is small journalist risk Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 63. Selecting Variables – Journalist - I • Depends on what information can be obtained in an identification database • For an external intruder, likely variables are those available in public registries: egist ies – Key dates (birth, death, marriage) – Profession – Home address and telephone number – Type of dwelling – Gender, ethnicity, race – Income if a highly paid public servant Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 64. Selecting Variables – Journalist - II • Assume that an internal intruder would be able to get all relevant g administrative data: – Key dates (birth, death, admission, discharge, discharge visit) – Gender, address, telephone number Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 65. Inference of Variables - I • Even though a particular quasi- identifier may not be known to the y intruder (prosecutor risk), available in an identification database (journalist), or available in the disclosed database (all three risks), it may be possible to infer it from other variables • Variables that can be inferred should be treated as quasi-identifiers Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 66. Inference of Variables - II • Inferred variables should be added to the disclosed database if they are not y there because they may be used in a re-identification attack, and you want to take them into account during risk assessment Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 67. Inference Examples • Gender, ethnicity, religious origin from name • Age from graduation date • Profession from payer of insurance claim (e.g., civil servants have a single health insurer) • Age and gender from a diagnostic or lab code (e.g., mamogram or PSA test) Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 68. Rareness • If individuals are rare on the quasi- identifiers, then they are at higher , y g prosecutor and journalist re- identification risk • If an individual has a rare and visible characteristic/feature, then that also makes th k them easier to re-identify ( it id tif (eg, put an ad in the radio) Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 69. Attribute Disclosure • If there is very little variation on sensitive variables • The data set can represent a whole population or some subset • Learn something new about a person without actually finding which record belongs to them Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 70. A Pragmatic Approach • It is important to ensure that the q quasi-identifiers are plausible for the p data and the recipients of the data • If you select many quasi-identifiers then that will b definition inc ease the ill by increase re-identification risk • Ideally each selected quasi-identifier Ideally, quasi identifier should be associated with a realistic re- identification scenario Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 71. Constructing an Identification DB • This may be a single physical database or a join of multiple sources together to construct a virtual database • It will have the quasi-identifiers as well q as identity information, but will not have the sensitive information (e.g., clinical or financial details) • The sources may be public and free, public and for a fee, or fully bli df f f ll commercial Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 72. Examples of Identification DBs - I • These are databases or sources (Canada): – Obituaries: available from newspapers and funeral homes; there are obituary aggregator sites that make this simple h kh l – PPSR: Private Property Security Registration; contains information on loans secured by property (e.g., cars) – Land Registry: information on house ownership Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 73. Examples of Identification DBs - II – Membership Lists: provide comprehensive listings of professionals (e.g., doctors, lawyers, civil servants) – Salary Disclosure Reports: provided by governments for those earning higher than a certain threshold – White Pages: public telephone directory – Job Sites: CVs posted in public and closed job web sites –DDonations: Di l ti Disclosures of donations to fd ti t political parties (include address) Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 74. Voter Lists - I • Cannot legally be used for purposes outside of an election (in Canada) ( ) • But, a charity allegedly supporting a terrorist group (Tamil Tigers) was found by fo nd b the RCMP to ha e Canadian have voter lists • Volunteers do not necessarily destroy or dispose of the lists after an election (and in many cases do not sign anything b f thi before th they get them) t th ) Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 75. Voter Lists - II • It is not expensive (or difficult) to become a candidate in an election and get the voter list: – Alberta: $500 – BC: $100 – NB: $100 (+nominated by 25 electors) – Ontario: $100 $ – Quebec: 0$ (+nominated by 100 electors) • Canadian voter lists do not contain the DoB ( t) D B (yet) Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 76. Economics of Identification DBs • Some data sources have a fee for each individual record/search • This makes the cost of creating an identification database quite high • This may impose a large economic burden on an intruder and act as a deterrent from creating identification databases Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 77. Internal Identification Databases • An internal intruder may have access to administrative databases that can act as Identification DB • For example, in a hospital an internal intruder may ha e int de ma have access to all admissions; this is not sensitive data so is less protected but has enough p g demographics that it can be good as an identification database • Thi puts i t This t internal i t d l intruders at a huge th advantage Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 78. Internal Access • An internal intruder can get access to such an administrative database: – had access in a previous position but that access was not revoked – people in the organization share access credentials, so the intruder can use someone else’s credentials to get the administrative database – has access as part of his/her job and there are no audit trails – internal systems are not well protected because internal people are trusted and intruder knows how to break-in the system to get the data break in Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 79. Public Registries • In the following slides I will explain how to create identification databases from public registries in Canada Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 80. Professional Groups - I We can construct identification databases for specific professional groups Membership PPSR Lists White Pages Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 81. Professional Groups - II • College of Physicians and Surgeons of Ontario • Law Society of Upper Canada • Professional Engineers O t i Pf i lE i Ontario • College of Occupational Therapists • College of Physical Therapists • Public servants (eg, GEDS) • ……. Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 82. What is the success rate ? CPSO LSUC • Ability to get home postal codes (source: PPSR and 60% 45% telephone directory) • Ability to get practice/firm postal codes (source: 100% 100% CPSO/LSUC) • Ability to get date of birth (source: PPSR) 40% 45% • Ability to get gender (source: CPSO/genderizing 100% 100% LSUC) • Ability to get initials (source: CPSO/LSUC) 100% 100% Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 83. What is the success rate by gender? CPSO LSUC MALE • Ability to get home postal codes (source: PPSR and 63% 48% telephone directory) • Ability to get date of birth (source: PPSR) 45% 48% FEMALE • Ability to get home postal codes (source: PPSR and 49% 40% telephone directory) • Ability to get date of birth (source: PPSR) 29% 40% Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 84. Homeowners We can construct identification databases for specific postal codes Canada Land PPSR Post Registry White Pages Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 85. What is the success rate ? Ott To • Ability to get initials 93% 100% • Ability to get DoB 33% 40% • Ability to get telephone number 80% 50% • Ability to get gender 87% 95% Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 86. Re id Re-id Risk for Homeowners • The number of households per postal code is quite small ( q (Ott: 15; To: 20) ; ) • The individuals (homeowners) were unique on common combinations of quasi-identifiers (eg, gender and DoB) • For these individuals re-identification risk is very high Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 87. Civil Servants - I • GEDS is on the Internet: Government Electronic Directory Services • There are 386,630 individuals in the federal government (159,652 in Ontario and 28 046 in Alberta) 28,046 • GEDS has approx. 170,000 entries • Incomplete because: organizations can opt-out, some individuals need to opt- in, and some employees and orgs are exempted ( d (eg, CSIS DND) CSIS, Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 88. Civil Servants - II • We selected a sample of 40 individuals in health care related federal departments in Ontario • Able to get home address for 50%, home telephone number for 40%, gender for 100%, DoB for 22.5% • Provincial governments have similar sources Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 89. Re identification Re-identification Threshold • There is a spectrum of re-identification risk • When does the probability of re- identification become so high that the information is deemed identifiable ? • Canadian privacy law tends not to be precise about this • Gordon case: serious possibility test Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 90. Canadian Definitions - I Privacy Law Definition Ontario PHIPA “Identifying information” means information that identifies an individual or for which it is reasonably foreseeable in the circumstances that it could be utilized, either alone or with other information, to identify an individual. Nfld PPHI “Identifying information means information that identifies an Identifying information” individual or for which it is reasonably foreseeable in the circumstances that it could be utilized either alone or together with other information to identify an individual. Sask THIPA “De-identified personal health information” means personal health information from which any information that may reasonably be expected to identify an individual has been removed. removed Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 91. Canadian Definitions - II Privacy Law Definition Alberta HIA be a “Individually identifying” means that the identity o the individual d dua y de y g ea s a e de y of e d dua who is the subject of the information can be readily ascertained from the information; “nonidentifying” means that the identity of the individual who is the subject of the information cannot be readily ascertained from the information information. NB PPIA “Identifiable individual” means an individual can be identified by the contents of the information because the information includes the individual s name, makes the individual s identity obvious, or individual’s name individual’s obvious is likely in the circumstances to be combined with other information that includes the individual’s name or makes the individual’s identity obvious. Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 92. Re identification Re-identification Risk Spectrum Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 93. Re identification Re-identification Threshold • Privacy legislation treats the threshold in two ways: y – Discretionary/permitted disclosures and uses = threshold can be anywhere along the spectrum – Only de-identified information without consent = information id identifiable or not; there is no spectrum • Any systematic approach to dealing with thresholds must cover both Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 94. Threshold Precedents - I • We will use healthcare precedents as an indication of the risk that society y has agreed to take: – The largest probability of re-identification that i th t is used in any policy or guideline di li id li document in Canada or the US is 0.33 – If the probability is > 0.33 then the information would certainly be considered identifiable Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 95. Threshold Precedents - II – The most common probability of re- identification used in disclosure control of health d t i 0 2 ( ll i h lth data is 0.2 (cell size of 5) f – It makes sense that a value of 0.2 would be used as a “default” risk default • Below 0.33 there are many degrees of de-identification Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 96. Example • The choice of threshold has a significant impact on risk assessment g p results Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 97. De identification De-identification Techniques D1 quasi identifying yg identifying yg variables variables D3 D2 Analytics Heuristics Randomization Coding Suppression Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 98. Examples of Analytics • Table aggregation – disclose only summary tables y • Generalization • Record or variable suppression pp • Geographic aggregation • Sub-sampling Sub sampling • Adding noise Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 99. Common De-identification Heuristic De identification • If geographic area has a small pp population, then: , – Suppress all data from that area – Aggregate the geographic area • Applied for a variety of data sets, including public health data sets • For many applications this heuristic results in significant loss of data or imperils analysis Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 100. Examples • HIPAA: 20k rule • Census Bureau: 100k rule • Statistics Canada: 70k rule • British Census: 120k rule Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 101. The Problem • Such generic rules ignore the specific variables that are included in a data set • A smaller cutoff should be used if few variables are in a data set • A larger cutoff should be used if many variables are in a data set Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 102. Automation - I Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 103. Automation - II Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 104. 20,000 70,000 100,000 Our GAPS Models Province Cutoff Cutoff Cutoff FSA Pop FSA Pop FSA Pop FSA Pop Alberta Alb t 55% 84% 38% 71% 1.4% 1 4% 5% 0 0 British Columbia 68% 87% 46% 70% 1.1% 4% 0 0 Manitoba 59% 88% 39% 68% 0 0 0 0 New Brunswick 20% 51% 4.5% 19% 0 0 0 0 Newfoundland 55% 83% 30% 62% 0 0 0 0 Nova Scotia 47% 82% 16% 43% 0 0 0 0 Ontario 69% 91% 49% 76% 1.4% 5% 0.2% 1% PEI 57% 90% 43% 79% 0 0 0 0 Quebec 59% 84% 36% 63% 1% 5% 0.25% 0 Saskatchewan 60% 93% 49% 84% 2% 7% 0 2% Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 105. Risk Methodology • De-identification by itself is not sufficient: – Using low thresholds results in rapid data quality deterioration – Using high thresholds is perceived as too risky – We want to create incentives for the data recipients to improve their security and privacy practices • M th d l Methodology allows you to select and ll t lt d justify a threshold Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 106. Managing Re identification Risk Re-identification V A Amount of De-identification - Risk Exposure p - + + Mitigating Invasion-of- Motives & Controls Privacy V A Capacity Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 107. The Tradeoffs Ability to Re-identify the Data Low High g gating Controls s balanced dangerous Low C higher cost burden on data recipient High Mitig conservative balanced lower data quality Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 108. Steps in Risk Methodology • The methodology has two steps to evaluate the overall risks • First we determine the probability of a re-identification attempt • Then we determine the re-identification risk to use Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 109. Determining Pr Re-identification Attempts Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 110. Determining Risk Threshold to Use Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 111. Implementation of Methodology • An important component of this methodology is the ability to audit the gy y data recipient/agent receiving the data • Update audits are performed regularly • Data sharing agreements are put in place for external recipients and external agents (internal ones usually covered by employment agreements) • The elements in the security maturity y y profile are part of the data sharing agreement Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 112. Compliance Audits • The audits use a publicly available checklist • Audit results would be generally accepted so that recipients do not need to get audited repeatedly for different a dited epeatedl fo diffe ent disclosures • Intended to be rapid (one or two day on-site) and cheap ($1k to $2k) Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 113. Example - Pharmacy Data • Request to CHEO for prescription data from a commercial data broker • Concern that this data could potentially identify patients • We performed a study to evaluate re- identification risk and come up with an anonymous version of the data Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 114. Prescription Records Example • Patient age in days • Gender • Patient gender • Length of stay in days • Forward Sortation Area • Admission date • Quarter and year of admission • Discharge date • Patient’s region (first character of the • Diagnosis postal code) • Dispensed drug • Patient’s age in weeks • Diagnosis • Dispensed drug • Regular third party privacy/security audits • Breach notification protocols must be in place B h ifi i l bi l • Restrictions on further distribution of raw data • Data destruction provisions Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 115. An Example Deployment Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 116. An Example Deployment Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 117. An Example Deployment Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca