The document discusses the challenges of anonymizing clinical study reports to comply with increasing privacy regulations while maintaining transparency. It proposes a human-in-the-loop analytics approach using natural language processing and iterative risk scoring to transform structured and unstructured data in clinical studies. By leveraging domain expertise and user feedback, the solution aims to balance privacy, transparency and regulatory compliance for anonymizing sensitive clinical data.
1. A human-in-loop analytics driven
approach for anonymization &
redaction of clinical data submissions
Ganes Kesari
August 23, ‘22
2. Ganes Kesari
Co-founder & Chief Decision Scientist
“Simplify Data Science for all”
100+ Clients
Solve business problems using insights
and stories on a low-code platform
@kesaritweets
/gkesari
3. Ø New regulatory requirements such as EMA
0070 and Health Canada PRCI
Ø Specific data privacy guidelines for all reports
published in the public domain
Ø These regulations also call for a level of data
transparency
Ø Prescribe sufficient data granularity to help
the scientific community
Anonymizing Clinical Study Reports (CSR) with the right balance of Privacy
and Transparency is challenging
Privacy
Transparency
4. Anonymization of CSRs is a three-fold problem
Human Errors
Ø Time consuming & cumbersome process with many complex steps
Ø Error prone and requires multiple manual reviews
Unstructured Content
Ø No plug & play, off-the shelf
Named Entity Recog. models
Ø Requires pharma domain
specific entities
Regulatory Constraints
Ø Complex and rapidly evolving
regulations
Ø High quality thresholds with
stringent re-identification
thresholds
5. 1. Regulatory Constraints: Increasing regulations spike compliance costs
and the likely penalties for breaches
Typical annual Spends of $2-3 Mn on
external vendor costs
Typical cost of a healthcare breach
was $9.2 Mn per incident
84% increase in healthcare data
breaches, impacting 45 million people
Tech advances & public data spike re-
identification risks
IBM Report - Cost of a Data breach
6. 2. Human Errors: Clinical teams go through a long and cumbersome
process for CSR anonymization
Time consuming
processes with cycle
times up to 45 days
for each summary
document
25+ complex steps in
achieving
anonymization using
different clinical trial
management
systems
Higher potential for
error with data
flowing across
multiple internal
systems, databases
and emails for reviews
and approvals
7. Anonymization
Data anonymization is the process of transforming
information by removing or encrypting sensitive
data (PII or PHI), in order to protect data subjects’
privacy and confidentiality
NLP
Process of transforming and
understanding human language to
identify meaningful patterns and
new insights
3. Unstructured Content: Advanced analytics and Natural Language
Processing is needed to extract PII entities with high accuracy
Anonymization Techniques:
Ø Character Masking
Ø Pseudonymisation
Ø Generalization
Ø Swapping
Ø Data perturbation etc
NLP Techniques:
ØInformation Retrieval
ØNatural Language Processing
ØInformation extraction (NER-
Named entity recognition)
8. The Solution: A measured approach which balances human validation
and judgement with analytics and automation
Ø User-centered solution design
Ø Collaborative workflows with user feedback
Ø Leveraged open-source tech
Ø Custom algorithm training for
better domain understanding
Ø Domain experts helped tailor
algorithms for unstructured data
Ø Strong & scalable solution
capabilities basis past experience
Ø Regulatory & research
community help understand
required quality thresholds
Ø Iterative optimization till the
desired EMA and Health
Canada controls were met
Human-in-the Loop
Advanced Analytics Regulatory Compliance
9. Unstructured Data Transformation
The Anonymization Solution handles structured and unstructured data
with iterative risk scoring to ensure compliance
CSR
documents
Reference population
(data on similar trials)
Parsing
CSR docs
Entity
recognition
Sampling
for users to
validate
Recall
calculation
Structured data
transformation
Iterative risk scoring
and optimization
algorithm
Final risk
adjusted CSR
document
User Input
10. What did we learn from implementing such solutions for clients?
Be prepared to tackle a variety of input data sources
in terms of document structure, style, and entities
Typical CSRs contain 100+ tables and figures which
need to be treated as independent problems
Paucity of research on the risk of reidentification and
patient privacy in pharma clinical space
11. Where are we headed? Solutions must be geared for more attacks,
tightening regulations, and regional variations
Data breaches have become
more easy
World regulations are evolving
& norms are being tightened
Region specific variations of
regulations are emerging