A Hit-Miss Model for Duplicate Detection in the WHO Drug Safety Database Andrew Bate Senior Director, Analytics Team Lead,...
Acknowledgements <ul><li>This research was wholly funded by the WHO Collaborating Centre for International Drug Monitoring...
Overview <ul><li>Background </li></ul><ul><ul><li>Post marketing safety surveillance </li></ul></ul><ul><ul><li>WHO Progra...
The WHO International Drug Monitoring Programme, 2004
The WHO International  Drug Monitoring Programme <ul><li>Aim to discover suspected adverse drug reactions (ADRs) not ident...
Spontaneous reporting limitations <ul><li>Often limited clinical information on reports and satisfactory secondary case ev...
The WHO database of suspected side drug effects <ul><li>Strengths </li></ul><ul><ul><li>Database size (>6 million case rep...
Quantitative signal detection <ul><li>Detect potential signals for further investigation that are not readily recognisable...
Duplicate case reports <ul><li>Unlinked case reports related to the same ADR incident: ‘duplicates’ </li></ul><ul><li>Dupl...
Problem extent <ul><li>Case report duplication is one of the most important data quality problems in post-marketing drug s...
Data extraction <ul><li>Hundreds of possible record fields for each case report (administrative information, incident info...
Duplicate case report characteristics <ul><li>Typically more similar than other record pairs </li></ul><ul><li>Sometimes V...
Example: impact of missing data <ul><li>Consider the following two case reports: </li></ul><ul><li>Likely duplicates? </li...
The hit-miss model  (Copas & Hilton, 1990) <ul><li>Compare the probability of a certain matching event under the assumptio...
Hit-miss model weights <ul><li>Matches receive positive weights (greater rewards for matches on rare events) </li></ul><ul...
Properties <ul><li>Accounts for both the level of agreement and the amount of information </li></ul><ul><li>Imposes no str...
Fitting standard hit-miss models to the WHO database <ul><li>Model fitting is based on simple parameter estimation </li></...
Extending the hit-miss mixture model <ul><li>A generalisation of the standard hit-miss model to numerical record fields </...
Evaluation on Norwegian data <ul><li>The last Norwegian batch of reports from 2004 included 19 confirmed duplicates </li><...
Results <ul><li>17 record pairs were highlighted as suspected duplicates </li></ul><ul><li>12 of these were confirmed dupl...
Top scoring record pair <ul><li>The highest match score in the study is for an alleged false positive </li></ul><ul><ul><l...
Follow-up <ul><li>The Norwegian centre informed us that: </li></ul><ul><li>The top scoring record pair does relate to a se...
Duplicates? <ul><li>Cluster of 3 reports highlighted in a a specific country </li></ul><ul><li>Onset date: 16th Dec 2003 <...
Duplicates? <ul><li>No – confirmed by reporting country </li></ul><ul><li>But: were all reported by the same dentist </li>...
References Copas J, Hilton F. Record linkage: statistical models for matching Computer records. Journal of the Royal Stati...
Conclusions <ul><li>The extended hit-miss model has several beneficial theoretical properties for this application </li></...
Upcoming SlideShare
Loading in …5
×

Hit-Miss Model for Duplicate Detection-WHO Drug Safety Database_PVER Conf_May2011

1,149 views

Published on

PowerPoint Presentation from May 2011 Personal Validation and Entity Resolution Conference. Presenter: Andrew Bate

Published in: Technology, Business
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,149
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
20
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • Proactive, REMS: Studies that evaluate effectiveness of risk mitigation activities (e.g., label changes). Quantitative aspects
  • False &apos;false positive&apos;? – Confirmed non-duplicates, diff doctors same hospital!
  • Dentist!
  • Dentist!
  • Hit-Miss Model for Duplicate Detection-WHO Drug Safety Database_PVER Conf_May2011

    1. 1. A Hit-Miss Model for Duplicate Detection in the WHO Drug Safety Database Andrew Bate Senior Director, Analytics Team Lead, Epidemiology, Worldwide Safety Strategy Person Validation and Entity Resolution Conference Washington DC May 23, 2011
    2. 2. Acknowledgements <ul><li>This research was wholly funded by the WHO Collaborating Centre for International Drug Monitoring </li></ul><ul><li>I was at the time an employee of the WHO Centre. </li></ul><ul><li>This presentation is my current opinion of the completed research </li></ul><ul><li>Co-authors Niklas Norén and Roland Orre played instrumental role in this research; Niklas Norén developed many of these slides </li></ul><ul><li>For more information please contact: </li></ul><ul><li>[email_address] </li></ul>
    3. 3. Overview <ul><li>Background </li></ul><ul><ul><li>Post marketing safety surveillance </li></ul></ul><ul><ul><li>WHO Programme for International Drug Monitoring </li></ul></ul><ul><ul><li>The problem of duplicate reports </li></ul></ul><ul><li>Method for detecting duplicate reports </li></ul><ul><li>Results </li></ul><ul><li>Concluding remarks </li></ul>
    4. 4. The WHO International Drug Monitoring Programme, 2004
    5. 5. The WHO International Drug Monitoring Programme <ul><li>Aim to discover suspected adverse drug reactions (ADRs) not identified in clinical trials, when drugs are on the market </li></ul><ul><li>Collect reports from healthcare professionals and consumers internationally on suspected ADR incidents in clinical practice </li></ul><ul><li>Run by WHO Collaborating Centre, Sweden </li></ul><ul><li>Analysis based on a combination of quantitative methods for exploratory data analysis and expert clinical review </li></ul>
    6. 6. Spontaneous reporting limitations <ul><li>Often limited clinical information on reports and satisfactory secondary case evaluation is not always possible </li></ul><ul><li>Not all ADRs that occur will be recognized as drug induced by a healthcare professional </li></ul><ul><li>Even those that are suspected will not necessarily be reported </li></ul><ul><li>Suspicion can mistakenly rest on the drug, coincidental spontaneous ADR case reports resulting </li></ul><ul><li>Control information is not collected as part of spontaneously reported systems, the drug use is not known, and there is no direct information on disease incidence </li></ul>Ref Bate et al 2008 FCP
    7. 7. The WHO database of suspected side drug effects <ul><li>Strengths </li></ul><ul><ul><li>Database size (>6 million case reports, 200+ fields), now more than 1 million per year </li></ul></ul><ul><ul><li>International coverage since 1967 </li></ul></ul><ul><ul><li>Reporting of all marketed drugs from 100+ countries </li></ul></ul><ul><li>Spontaneous reporting remains the data primarily used for post-marketing identification of suspected ADRs </li></ul>
    8. 8. Quantitative signal detection <ul><li>Detect potential signals for further investigation that are not readily recognisable on a single case report nor otherwise readily apparent at case entry </li></ul><ul><li>Enhance rather than replace other methods of signal detection </li></ul><ul><ul><li>Clinical review remains critical </li></ul></ul><ul><li>Methods assume independence between reports </li></ul>
    9. 9. Duplicate case reports <ul><li>Unlinked case reports related to the same ADR incident: ‘duplicates’ </li></ul><ul><li>Duplication may be due to: </li></ul><ul><ul><li>Different reporting sources (health professionals, national authorities, different companies) having provided separate case reports related to the same incident </li></ul></ul><ul><ul><li>Mistakes in linking follow-up case reports to the earlier database records </li></ul></ul>
    10. 10. Problem extent <ul><li>Case report duplication is one of the most important data quality problems in post-marketing drug safety data, and therefore limits ADR identification capability </li></ul><ul><li>There was no published research on methods for automated duplicate detection in this type of data </li></ul><ul><li>No studies on how common duplicates really are (studies on vaccine ADR data suggest 5%, but for specific case series, rates around 20% have been reported) </li></ul>
    11. 11. Data extraction <ul><li>Hundreds of possible record fields for each case report (administrative information, incident information, patient information) but most case reports carry little information </li></ul><ul><li>Anonymised data (but patient age and gender may be available) </li></ul><ul><li>The following record fields are used: age, gender, country, date, drug substances, ADRs, outcome. </li></ul><ul><li>Note no free text fields involved as rarely entered in the anonymized reports entered into WHO database </li></ul>
    12. 12. Duplicate case report characteristics <ul><li>Typically more similar than other record pairs </li></ul><ul><li>Sometimes VERY different </li></ul><ul><li>Great variety of discrepancies – no “safe” record fields </li></ul><ul><li>Missing data can complicate things </li></ul>
    13. 13. Example: impact of missing data <ul><li>Consider the following two case reports: </li></ul><ul><li>Likely duplicates? </li></ul><ul><li>Identical case reports but too little information for the evidence to be considered strong! </li></ul>
    14. 14. The hit-miss model (Copas & Hilton, 1990) <ul><li>Compare the probability of a certain matching event under the assumption that the two records are related, to the same probability under the assumption that they are independent </li></ul><ul><li>Under additional assumption of independence between record fields, the weights for the different record fields can be added to provide an overall match score </li></ul><ul><li>Hit-miss model provides model for P(x,y) – the probability for different matching events between related records </li></ul>
    15. 15. Hit-miss model weights <ul><li>Matches receive positive weights (greater rewards for matches on rare events) </li></ul><ul><li>Mismatches receive negative weights (greater penalties for mismatches in record field with few errors in training data) </li></ul><ul><li>Record fields for which at least one of the records have missing data receive weight 0 </li></ul>
    16. 16. Properties <ul><li>Accounts for both the level of agreement and the amount of information </li></ul><ul><li>Imposes no strict criteria that a record pair must fulfil in order to be highlighted </li></ul><ul><li>Allows the threshold for manual review to be adjusted based on the available resources </li></ul><ul><li>Robust with respect to small amounts of training data </li></ul>
    17. 17. Fitting standard hit-miss models to the WHO database <ul><li>Model fitting is based on simple parameter estimation </li></ul><ul><li>The probability for different values and for missing data in each record field can be estimated based on the data set as a whole </li></ul><ul><li>The probability for a miss in a given record field needs to be estimated based on labelled duplicates (38 pairs available for the WHO database) </li></ul>
    18. 18. Extending the hit-miss mixture model <ul><li>A generalisation of the standard hit-miss model to numerical record fields </li></ul><ul><li>In addition to hits, misses and blanks, the hit-miss mixture model includes deviations </li></ul><ul><li>Motivation: many types of errors in numerical record fields are likely to lead to small differences compared to the true value, rather than to random values </li></ul>
    19. 19. Evaluation on Norwegian data <ul><li>The last Norwegian batch of reports from 2004 included 19 confirmed duplicates </li></ul><ul><li>We used the hit-miss model to highlight suspected duplicates in this batch of 1559 case reports </li></ul><ul><li>The match score threshold for likely duplicates was set to 37.5 (based on an assumed 5% duplicates in the data set and in order to achieve an estimated rate of false alarms of below 0.05) </li></ul>
    20. 20. Results <ul><li>17 record pairs were highlighted as suspected duplicates </li></ul><ul><li>12 of these were confirmed duplicates, 5 were not </li></ul><ul><ul><li>5 false positives </li></ul></ul><ul><ul><li>7 false negatives </li></ul></ul><ul><ul><li>63% recall </li></ul></ul><ul><ul><li>71% precision </li></ul></ul>
    21. 21. Top scoring record pair <ul><li>The highest match score in the study is for an alleged false positive </li></ul><ul><ul><li>Only near-matches on age and date, no matching ADR terms </li></ul></ul><ul><ul><li>BUT 6 matching drug substances (not commonly co-prescribed) </li></ul></ul><ul><ul><li>... and ADR terms are semantically close </li></ul></ul>Andrew Bate, UMC
    22. 22. Follow-up <ul><li>The Norwegian centre informed us that: </li></ul><ul><li>The top scoring record pair does relate to a set of confirmed duplicates (submitted by different doctors in the same hospital) </li></ul><ul><li>One of the other 'false positives' corresponds to a pair of suspected but yet unconfirmed duplicates </li></ul>
    23. 23. Duplicates? <ul><li>Cluster of 3 reports highlighted in a a specific country </li></ul><ul><li>Onset date: 16th Dec 2003 </li></ul><ul><li>Age: 8, 18 and 29 </li></ul><ul><li>All female </li></ul><ul><li>All had one drug listed and one AE listed – both drug and AE quite rarely reported </li></ul>
    24. 24. Duplicates? <ul><li>No – confirmed by reporting country </li></ul><ul><li>But: were all reported by the same dentist </li></ul><ul><li>Clearly these reports are not completely independent </li></ul><ul><li>Analysis methods treat all reports as equally important and weigh them equally </li></ul><ul><ul><li>Can use duplicate detection algorithm to down weigh very similar reports that are less likely to be ‘independent’ </li></ul></ul>
    25. 25. References Copas J, Hilton F. Record linkage: statistical models for matching Computer records. Journal of the Royal Statistical Society: Series A 153 (1990) 287-320. Norén GN, Orre R, Bate A . A hit-miss model for duplicate detection in the WHO drug safety database 2005 . Proceedings of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (Awarded Best Application Paper at the SIGKDD annual meeting, Chicago 2005) . Norén GN, Hopstadius J, Bate A , Star K, Edwards IR. Temporal Pattern Discovery in Electronic Patient Records. Data Mining and Knowledge Discovery, 2010. 20 (3):361-387.
    26. 26. Conclusions <ul><li>The extended hit-miss model has several beneficial theoretical properties for this application </li></ul><ul><li>Overall performance on duplicate detection in real world post-marketing drug safety data is very useful </li></ul><ul><li>The hit-miss mixture model's capability to account for near-matches on age and date important in real world applications </li></ul>

    ×