Pseudonymised Linkage of Cancer Registry Data in England

Pseudonymised Matching:
Robustly Linking Molecular and
Prescription Data to Cancer
Registry Data in England
Brian Shand, Fiona McRonald, Katherine Henson, Cong Chen
(Public Health England)

Overview
• Motivation: matching patients between data feeds is
challenging
• The OpenPseudonymiser approach to pseudonymisation
with one-way hash functions
• Extending OpenPseudonymiser with encrypted
demographics
• Results: linkage of national prescription data, BRCA
mutation screening data
• Conclusion
2 Pseudonymised matching: robustly linking molecular and prescription data to cancer registry data in England

Motivation – information needs
• Cancer registry data is extremely sensitive, and challenging to link:
• The English cancer registration service (NCRAS) cannot reveal who has
cancer to external providers
• External providers cannot give identifiable data for patients without
cancer – NCRAS can however hold data on patients with (suspicion of)
cancer
• This makes sensitive feeds without a cancer marker difficult to access, e.g.
national prescription data, BRCA molecular screening data
• screening for mutations in BRCA1 or BRCA2 genes identifies people
with increased risk of developing breast and/or ovarian cancer.
• 50%-65% of women with a BRCA1 mutation develop breast cancer by
age 70, and 35%-46% ovarian cancer.
• if patients develop cancer later, the mutation data would add value

Key idea
• We want to pseudonymise cancer registry
data and another data source in the same
way:
• If the same patient is in both data sources,
they will get the same pseudo-id.
• Demographics / sensitive fields can be
encrypted, so that only a trusted party –
who also knows the linkage demographics
– can decrypt them.
• Non-demographic fields are generally not
disclosive, and do not need to be
encrypted (at least within our secure
cancer database).

Useful concepts
• Hashing
• Irreversible scrambling algorithm
• Secret salt
• Information making hashing context-specific
• Reversible encryption

Illustrative slide

Hashes and OpenPseudonymiser
• We start with the OpenPseudonymiser approach, which uses SHA-256 to
generate pseudonyms for each patient:
• SHA-256 is a one-way hash function (and cryptographically secure)
• given x, it’s straightforward to compute y = sha256(x), but
• given y, it’s impossible to reconstruct x, without trying all possibilities by
brute force.
• The pseudonyms are secure, if the salt is secret, and “long enough” (e.g.
256 bits of random data).
• Replace each patient identifier with a pseudonym, derived from the NHS
number (national healthcare identifier)
• researchers can link their datasets, without sharing patient demographics
• pseudonym = sha256(NHS number + salt)
E.g. sha256('1234567881’ + 'ab00ec62fa2ad275b08471cbfc76cb85
80f92283f3663baff0ea7d83aee57e19') = ' 778aebfe72aefcf391d00
96333bf325837981ba60ba8a5921be37789307321d3'

OpenPseudonymiser
• Research teams use the same salt as a shared secret.
• Patients with the same NHS number will be given the same pseudonym
778aebf…d3 <=> 778aebf…d3
• Without knowing the salt, the pseudonyms are non-identifiable.
• Ordinary researchers cannot access to the salt: only a trusted linkage
function can use it, and the secrecy of this is contractually agreed.
(If the salt is known, a brute force attack could be possible.)
• Patients must match exactly by NHS number (or other demographics used
for matching purposes, e.g. postcode + date of birth)
• OpenPseudonymiser only protects the key demographics (NHS number);
the clinical data is treated as non-identifiable
• Patients must match exactly by NHS number (or whatever demographics
tuple is used for matching purposes). OpenPseudonymiser does not
support complex patient matching (e.g. NHS number + surname + month
and year of birth)

This is the top half of the slide

Extending OpenPseudonymiser
• We have extended OpenPseudonymiser-like pseudonymisation to support
fuzzy patient matching, and clinical data encryption.
• As in OpenPseudonymiser, pseudonyms identify possible matches, i.e.
records in which the registry has a legitimate interest.
• We use the plaintext linkage demographics to generate a secondary
encryption key, e.g.
• per-record encryption keys are used for additional demographics, and
clinical data
• keys combine patient pseudonym, random key, and additional salt

Extending OpenPseudonymiser 2
• The cancer registry keeps an isolated database of pseudonymised data and
keys, to match registry patients against.
• Where the core demographics match, the remaining demographics will
be unpacked, and used for fuzzy patient matching.
• If the demographics match score is high enough, the clinical data will be
unpacked and released to the encore cancer registration database.
• No access to identifiable data for patients not suspected to have cancer
• The pseudonymised dataset itself can also be used for baseline
comparisons, e.g. to compare how often a particular prescription drug was
dispensed to lung cancer patients, vs the overall population.
• By including patient age as a derived, non-disclosive field in the
pseudonymised data, baseline comparisons can be age standardised.

This is the full slide

Applications in PHE
• Public Health England has access to pseudonymised national prescription
data feeds from NHS Business Services Authority, and BRCA and other
genetic mutation screening data. These have been linked to the cancer
registry. Decrypted birthdates help validate NHS number matches.
• Four months of prescription data (332 million prescriptions, 29 million
people) matched 1.6 million cancer patients: 88% of living cancer
patients had a prescription record.
• We now have 47 months of prescription data linked to the cancer registry
• Non-disclosive fields need not be pseudonymised, so the pseudonymised
dataset allows baseline comparisons against the cancer-linked cohort. For
BRCA screening data, this identified nearly 1,300 unique variants from 7,000
screening patients, and an overall variant detection rate of about 25%. In
prescription data, cancer patients were compared with age-matched
controls.

Conclusion
• Linking data from external sources to the cancer registry creates a powerful
resource to better understand patient experience over their lifetimes.
• Pseudonymised matching can help to unlock data sources which include
people without cancer.
• We have done this for prescribing and screening data.
• cong.chen@phe.gov.uk
• ncrasenquiries@phe.gov.uk

Pseudonymised Linkage of Cancer Registry Data in England

Recommended

Recommended

More Related Content

Similar to Pseudonymised Linkage of Cancer Registry Data in England

Similar to Pseudonymised Linkage of Cancer Registry Data in England (20)

Recently uploaded

Recently uploaded (20)

Pseudonymised Linkage of Cancer Registry Data in England