PPTX slides - PowerPoint Presentation

249 views

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
249
On SlideShare
0
From Embeds
0
Number of Embeds
7
Actions
Shares
0
Downloads
2
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

PPTX slides - PowerPoint Presentation

  1. 1. 2010 HDWA Annual Conference Data Warehousing – Adding Value to Healthcare Pathology Reports Information Extraction: An OHNLP and UMLS Powered Approach Naveen Ashish Research Associate Professor September 14th 2010 HDWA 2010 Durham, NC
  2. 2. All Rights Reserved, Duke Medicine 2007 HDWA 2010 Durham, NC Agenda • Introduce institution, research group, and project • Outline automated information extraction problem • Solution – Using open frameworks – Open ontology resources • Current Status • Domain experts engagement • Conclusions
  3. 3. All Rights Reserved, Duke Medicine 2007 HDWA 2010 Durham, NC University of California, Irvine Medical Center University of California, Irvine Medical Center is a 422- bed tertiary teaching hospital with a commitment to education, research and quality patient care. UCI Medical Center is a Magnet Designated facility with a Level 1 Trauma Center, Burn Center and Level II Neonatal Care Center. • Not-for-Profit • # Employees • # ER Visits • # Admissions •
  4. 4. All Rights Reserved, Duke Medicine 2007 HDWA 2010 Durham, NC Data Warehouse Profile • UCI Clinical Informatics Team – Director – Informatics Solutions Architect – Principal Statistician/Advisor – Informatics Outreach Architect (future) – Clinical Practice Engineer – Clinical Research Informatics Lead – Business Intelligence Developer (2) – Clinical Informatics Specialist – NLP Specialist
  5. 5. All Rights Reserved, Duke Medicine 2007 HDWA 2010 Durham, NC Project Team • Supported by UCI Medical Center Clinical Informatics Department • Collaboration between UCI Medical Center (Clinical Informatics) and Calit2/Computer Science • Members – Naveen Ashish (NLP and CS Researcher) – Lisa Dahm (Director, Biomedical Informatics) – Charles Boicey (Informatics Architect)
  6. 6. All Rights Reserved, Duke Medicine 2007 HDWA 2010 Durham, NC Vision UCI QUP Quest Text Reports Analysis
  7. 7. All Rights Reserved, Duke Medicine 2007 HDWA 2010 Durham, NC (UCI) Pathology Report
  8. 8. All Rights Reserved, Duke Medicine 2007 HDWA 2010 Durham, NC Reports • Pathology reports – Free text but “semi-structured” as well – Nuggets of information in the text
  9. 9. All Rights Reserved, Duke Medicine 2007 HDWA 2010 Durham, NC What do we want to ask ?  Sample (retrieval) “questions”  Surgical Pathology  Patients with a surgical pathology report containing undifferentiated lymphoepithelioma-like gastric carcinoma.  Patients with a surgical pathology report containing spindle cell carcinoma of the breast, grade 3, margin(s) positive, node(s) positive.   Discharge Note  Patients with a discharge note containing a diagnosis of cerebrovascular accident and diabetes mellitus type II discharged in stable condition to home.  Female patients with a discharge diagnosis of Ewing sarcoma, hypertension and obesity discharged in stable condition to home.
  10. 10. All Rights Reserved, Duke Medicine 2007 HDWA 2010 Durham, NC In Text  Thus we need  Sections and sub-sections  Associations  Terms  Dimensions  … FINAL DIAGNOSIS AFTER MICROSCOPY: LUNG, LEFT LOWER LOBE, WEDGE RESECTION: POORLY DIFFERENTIATED ADENOCARCINOMA OF PULMONARY ORIGIN SIZE: 1.5 CM STAPLED RESECTION MARGIN: NEGATIVE 5 NECROSIS EXTENSIVE FIBROSIS IS NOT PRESENT PLEASE SEE COMMENT FINAL DIAGNOSIS AFTER MICROSCOPY: A. DEEP TRICEPS MARGIN, EXCISION: POSITIVE FOR SARCOMA B. LATERAL SUPERIOR MARGIN, EXCISION: POSITIVE FOR SARCOMA
  11. 11. All Rights Reserved, Duke Medicine 2007 HDWA 2010 Durham, NC System OHNLP UCI-QUP Application (Rules, Code) Database (warehouse) GUI, Tableau, i2b2 Unstructured Structured Analysis
  12. 12. All Rights Reserved, Duke Medicine 2007 HDWA 2010 Durham, NC Related Work Computerized Extraction of Information on the Quality of Diabetes Care from Free Text in Electronic Patient Records of General Practitioners. Jaco Voorham,Petra Denig. JAMIA 2007;14:349-354 doi:10.1197/jamia.M2128 Application of information technology: MedEx: a medication information extraction system for clinical narratives. Hua Xu, Shane P Stenner,Son Doan,Kevin B Johnson,Lemuel R Waitman,Joshua C Denny. JAMIA 2010;17:19-24 doi:10.1197/jamia.M3378 Identifying Smokers with a Medical Extraction System. Cheryl Clark,Kathleen Good,Lesley Jezierny,Melissa Macpherson,Brian Wilson,Urszula Chajewska. JAMIA 2008;15:36-39 doi:10.1197/jamia.M2442 Automated evaluation of electronic discharge notes to assess quality of care for cardiovascular diseases using Medical Language Extraction and Encoding System (MedLEE). Jung-Hsien Chiang, Jou-Wei Lin, Chen-Wei Yang. JAMIA 2010;17:245-252 doi:10.1136/jamia.2009.000182 Using Regular Expressions to Abstract Blood Pressure and Treatment Intensification Information from the Text of Physician Notes. Alexander Turchin, Nikheel S Kolatkar,Richard W Grant,Eric C Makhni,Merri L Pendergrass, Jonathan S Einbinder. JAMIA 2006;13:691-695 doi:10.1197/jamia.M2078 Natural Language Processing Framework to Assess Clinical Conditions. Henry Ware, Charles J Mullett,V Jagannathan JAMIA 2009;16:585-589 doi:10.1197/jamia.M3091 A General Natural-language Text Processor for Clinical Radiology. Carol Friedman,Philip O Alderson, John H M Austin, James J Cimino,Stephen B Johnson.JAMIA 1994;1:161-174 doi:10.1136/jamia.1994.95236146 Improved Identification of Noun Phrases in Clinical Radiology Reports Using a High-Performance Statistical Natural Language Parser Augmented with the UMLS Specialist Lexicon. Yang Huang,Henry J Lowe, Dan Klein,Russell J Cucina .JAMIA 2005;12:275-285 doi:10.1197/jamia.M1695 Automated Encoding of Clinical Documents Based on Natural Language Processing. Carol Friedman, Lyudmila Shagina,Yves Lussier, George Hripcsak. JAMIA 2004;11:392-402 doi:10.1197/jamia.M1552 Description of a Rule-based System for the i2b2 Challenge in Natural Language Processing for Clinical Data. Lois C Childs, Robert Enelow,Lone Simonsen, Norris H Heintzelman,Kimberly M Kowalski,Robert J Taylor. JAMIA 2009;16:571-575 doi:10.1197/jamia.M3083 Automated Detection of Adverse Events Using Natural Language Processing of Discharge Summaries. Genevieve B Melton,George Hripcsak. JAMIA 2005;12:448- 457 doi:10.1197/jamia.M1794
  13. 13. All Rights Reserved, Duke Medicine 2007 HDWA 2010 Durham, NC Related Work Documents Discharge summaries, Patient notes, EMR sections, Path or Radiology reports … Identify noun phrases. Section (headings) Numerical values, Negations, … Extract blood pressure, Medications, … Quality of care, Smoker status, Adverse events Other diagnoses … Processing Analysis
  14. 14. All Rights Reserved, Duke Medicine 2007 HDWA 2010 Durham, NC Systems  Columbia  Carol Friedman (et al.,)  MedLee  “Black art”  Systems from Defense, Intelligence etc., companies  Open Software and Tools  Medical Informatics  OHNLP  Open Health Natural Language Processing  IBM, MayoClinic, (NCI)  General  UIMA, GATE  Variety of lexical tools, named-entity recognizers, parsers etc.,  XAR http://zellig.cpmc.columbia.edu/medlee/ http://incubator.apache.org/uima/ http://gate.ac.uk/ http://nlp.stanford.edu/software/lex-parser.shtml
  15. 15. All Rights Reserved, Duke Medicine 2007 HDWA 2010 Durham, NC Extraction Techniques  What do we employ to achieve automated extraction ?  Broad paradigms  Rule driven (expert)  Machine-learning based (trained)  Combined (most recent systems)  Multiple levels  Semi-structured data extraction  Named entity extraction  POS tagging, NE identification  Ontology driven (domain terms)  “Deep” relation level extraction  Associations  Natural Language Parsing
  16. 16. All Rights Reserved, Duke Medicine 2007 HDWA 2010 Durham, NC NL Parse Illustration (ROOT (S (NP (NP (NNP Tissue)) (PP (IN between) (NP (DT the) (CD two) (JJ surgical) (NNS clips)))) (VP (VBZ contains) (NP (NP (NNS foci)) (PP (IN of) (NP (NP (JJ ductal) (NN carcinoma)) (ADJP (FW in) (FW situ))))) (PP (PP (IN within) (NP (DT a) (NN papilloma))) (, ,) (CONJP (RB as) (RB well) (IN as)) (PP (IN within) (NP (NNS ducts))))) (. .))) nsubj(contains-7, Tissue-1) det(clips-6, the-3) num(clips-6, two-4) amod(clips-6, surgical-5) prep_between(Tissue-1, clips-6) dobj(contains-7, foci-8) amod(carcinoma-11, ductal-10) prep_of(foci-8, carcinoma-11) amod(carcinoma-11, in-12) dep(in-12, situ-13) det(papilloma-16, a-15) prep_within(contains-7, papilloma-16) prep_within(contains-7, ducts-22) conj_and(papilloma-16, ducts-22) “Tissue between the two surgical clips contains foci of ductal carcinoma in situ within a papilloma, as well as within ducts.”
  17. 17. All Rights Reserved, Duke Medicine 2007 HDWA 2010 Durham, NC MedLee Illustration
  18. 18. All Rights Reserved, Duke Medicine 2007 HDWA 2010 Durham, NC OHNLP • OHNLP – Open Health Natural Language Processing Consortium • IBM and MayoClinic are founding partners • caBIG/NCI supported – Open-source consortium promoting the use of UIMA • Features – Built upon Apache UIMA • Annotators, Pipelines – Medical domain • MedKAT/P (IBM) – Pathology reports extraction • cTAKES (Mayo Clinic) – Clinical data
  19. 19. All Rights Reserved, Duke Medicine 2007 HDWA 2010 Durham, NC Rationale for OHNLP • Based on UIMA – Open source – Community of developers • OHNLP itself – NCI • IBM, Mayo – MedKAT/P and cTakes – Two way benefits • Adopt • Contribute back
  20. 20. All Rights Reserved, Duke Medicine 2007 HDWA 2010 Durham, NC MedKAT Annotations
  21. 21. All Rights Reserved, Duke Medicine 2007 HDWA 2010 Durham, NC “Programming” UIMA • OHNLP based on UIMA • UIMA composed of “Analysis Engines” – Primitive – Aggregate Primitive Engine (section headings) Primitive Engine (numerical) Primitive Engine (dict terms)
  22. 22. All Rights Reserved, Duke Medicine 2007 HDWA 2010 Durham, NC Descriptors, Resources
  23. 23. All Rights Reserved, Duke Medicine 2007 HDWA 2010 Durham, NC Analysis Engines • Developing “UCI-QUP” – UCI Quest Uima Pipeline • Analysis Engines – Recognize sections and sub-sections • Regular expressions – Significant terms • Medical terms – Existing dictionary in MedKAT/P • Useful, not complete – Integrate additional terminology
  24. 24. All Rights Reserved, Duke Medicine 2007 HDWA 2010 Durham, NC AE Terms • Good resources – NCI Thesaurus • Cancer related • > 500,000 terms/concepts – NCI Metathesaurus • Several million concepts • Developed – Converter • NCI Thesaurus  UIMA Dictionary Resource – Application – Database • MySQL
  25. 25. All Rights Reserved, Duke Medicine 2007 HDWA 2010 Durham, NC Architecture Pathology Reports (Unstructured) OHNLP Extracted Data (Structured) UCI Quest Uima Pipeline Knowledge Sources
  26. 26. All Rights Reserved, Duke Medicine 2007 HDWA 2010 Durham, NC UMLS and Metathesaurus
  27. 27. All Rights Reserved, Duke Medicine 2007 HDWA 2010 Durham, NC UMLS • UMLS – Obtained system from NLM – Installed successfully on informatics-nlp – Features • Browse concepts and relationships • Flat files • DB import – Being integrated into UCI-QUP
  28. 28. All Rights Reserved, Duke Medicine 2007 HDWA 2010 Durham, NC Our Contribution to OHNLP • We indeed adopted – Framework – Relevant “resources” • Contribute to overall OHNLP effort – Specific analysis engines • Sections and sub-sections in pathology reports • Significant items • Dictionary terms (UMLS integration) • … • Contribute as a project back to OHNLP
  29. 29. All Rights Reserved, Duke Medicine 2007 HDWA 2010 Durham, NC Database Schema
  30. 30. All Rights Reserved, Duke Medicine 2007 HDWA 2010 Durham, NC Demo
  31. 31. All Rights Reserved, Duke Medicine 2007 HDWA 2010 Durham, NC SQL Queries • Example (possible) queries SELECT reportid FROM collection WHERE (sectioncontent like ‘%carcinomia%’) AND (heading like ‘%tumor%)
  32. 32. All Rights Reserved, Duke Medicine 2007 HDWA 2010 Durham, NC Interfaces • i2b2 • Tableau
  33. 33. All Rights Reserved, Duke Medicine 2007 HDWA 2010 Durham, NC Guide For Fields • College of American Pathologists (CAP) – Detailed protocols Specimen (Note A) ___ Partial breast ___ Total breast (including nipple and skin) ___ Other (specify): ____________________________ ___ Not specified Procedure (Note A) ___ Excision without wire-guided localization ___ Excision with wire-guided localization ___ Total mastectomy (including nipple and skin) ___ Other (specify): ____________________________ ___ Not specified Lymph Node Sampling (select all that apply) (Note B) ___ No lymph nodes present ___ Sentinel lymph node(s) ___ Axillary dissection (partial or complete dissection) ___ Lymph nodes present within the breast specimen (ie, intramammary lymph nodes) ___ Other lymph nodes (eg, supraclavicular or location not identified) Specify location, if provided: _________________________ Specimen Integrity ___ Single intact specimen (margins can be evaluated) ___ Multiple designated specimens (eg, main excisions and identified margins) ___ Fragmented (margins cannot be evaluated with certainty) ___ Other (specify): __________________________________ Specimen Size (for excisions less than total mastectomy)
  34. 34. All Rights Reserved, Duke Medicine 2007 HDWA 2010 Durham, NC Implications • Multiple specific extraction and distillation techniques • Section, sub-section segmentation • Term spotting • Associations • Negation (Absence) and Assertion (Presence) • Dimensions • Expressions • Full NL Parse where required
  35. 35. All Rights Reserved, Duke Medicine 2007 HDWA 2010 Durham, NC Current Status • System first version – Creation of database for data warehouse • QUEST “compliant” – Meta-thesaurus integration – Retrieval • SQL and UI • Tableau • i2b2 – Star schema
  36. 36. All Rights Reserved, Duke Medicine 2007 HDWA 2010 Durham, NC Presentation Content Continued • Direction – Demonstrate value to researchers – CTSA Investigators • Lessons learned – Open source frameworks very useful ! • Reuse external solutions, resources • Our solutions can be adopted – Approach appears scalable – Domain expert engagement essential
  37. 37. All Rights Reserved, Duke Medicine 2007 HDWA 2010 Durham, NC Content • What went well – UIMA and MedKAT choice – UMLS integration • What would you would do differently – Project is in early stage – Technical and framework choices seem right – Will learn more as we engage domain experts • What will provide value to investigators ?
  38. 38. All Rights Reserved, Duke Medicine 2007 HDWA 2010 Durham, NC Summary • Comprehensive approach to detailed information extraction from Pathology reports • Exploiting open source and programmable frameworks (UIMA) • Integration of UMLS • Contribution of pipeline • Engagement of domain experts
  39. 39. All Rights Reserved, Duke Medicine 2007 HDWA 2010 Durham, NC Presenter(s) Contact Information • Contact information – Naveen Ashish – ashish@ics.uci.edu – http://www.ics.uci.edu/~ashish

×