MLconf NYC Chang Wang

2,651 views

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
2,651
On SlideShare
0
From Embeds
0
Number of Embeds
1,708
Actions
Shares
0
Downloads
8
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

MLconf NYC Chang Wang

  1. 1. © 2014 IBM Corporation Medical Relation Extraction with Manifold Models Chang Wang, IBM T. J. Watson Research Center
  2. 2. © 2014 IBM Corporation Adapt IBM Watson to Different Domains Contact Center Healthcare Financial Services Government Diagnostic/treatment assistance, evidenced-based insights, collaborative medicine Investment and retirement planning, institutional trading and decision support Call center and tech support services, enterprise knowledge management, consumer insight Public safety, improved information sharing, security, fraud and abuse prevention
  3. 3. © 2014 IBM Corporation Main Topic of This Talk This talk is about how we built a semantic relation extraction system for medical domain. A semantic relation example: Healthcare Diagnostic/treatment assistance, evidenced-based insights, collaborative medicine What is the most common manifestation of MEN-1 (Multiple Endocrine Neoplasia type 1). Symptom_of relation
  4. 4. © 2014 IBM Corporation Motivation: How Relation Extraction is Used in Question Answering – 1, Candidate Answer Generation: • a, Detect relations in the question; • b, Use the relation for knowledge base lookup (with UMLS KB, DBpedia, FreeBase, etc); – 2, Passage Scoring: Hyperparathyroidism is the most common sign of MEN-1. What is the most common manifestation of MEN-1 (Multiple Endocrine Neoplasia type 1). Question Focus Symptom_of relation Candidate Answer Symptom_of relation
  5. 5. © 2014 IBM Corporation Motivation: How Relation Extraction is Used in Question Answering – 3, Knowledge Base (KB) Construction: • a, Most existing KBs are manually built or extracted from semi structured sources, and thus have a low coverage; • b, Medical knowledge is growing and changing extremely quickly; Our medical corpus contains 80M sentences (11G pure text) coming from Wikipedia, books, PubMed, etc.
  6. 6. © 2014 IBM Corporation Identify the Key Medical Relations From an analysis of 5,000 doctor dilemma questions from the American College of Physicians and reading the literature (Demner-Fushman and Lin, 2007), we decided to focus on 7 key relations. These relations cover >50% of those 5,000 clinical questions.
  7. 7. © 2014 IBM Corporation Collect Training Data- Distant Supervision + Human Labeling This resulted in ~800 positive and ~13,000 negative labeled examples for each relation, plus a huge amount of unlabeled examples.
  8. 8. © 2014 IBM Corporation Technical Challenges & Design Goals: 1, Real data challenge: – In typical relation extraction task, entities are manually labeled. • For example, in i2b2 relation extraction task, entities are given and each has 1 of 3 concepts: “treatment”, “problem”, and “test”. – In real applications, entities have to be automatically detected. • In our application, entities are associated with multiple concepts from a list of 2.7M concepts (can be further grouped into ~130 types). 2, Relation detectors need to be fast: – Need to consider all the term pairs for each sentence in our corpus (80M sentences). – Use linear classifiers. 3, Relation detectors need to be accurate: – # training examples is not sufficient. – labels from “crowd sourcing” and “distant supervision” are not 100% reliable. – Utilize unlabeled data. – Take label confidence into consideration.
  9. 9. © 2014 IBM Corporation Method (1): Parsing Most popular tool to parse medical text: MetaMap (Aronson, 2001) We used Medical ESG – An adaptation of ESG [English Slot Grammar] (McCord, Murdock, and Boguraev, 2012) to medical domain; – Similar results as Metamap; – 10 times faster;
  10. 10. © 2014 IBM Corporation Method (2): Feature Extraction
  11. 11. © 2014 IBM Corporation Method (3): Cost Function to Minimize Construct a linear mapping f that minimizes C(f): αi: Label weight, xi: the ith example, µ: weight scalar, f: mapping function, yi: label of xi, Wi,j: similarity of xi and xj, )( ii xfx → Positive Positive Negative Negative unlabeled unlabeled Negative Illustration:
  12. 12. © 2014 IBM Corporation Method (4): Algorithm: Algorithm: Notation:
  13. 13. © 2014 IBM Corporation Method (5): Advantages A closed-form solution; As fast as a linear regression classifier at the apply time; Associate labels with weights; – Useful for “crowd sourcing” and “distant supervision” Make use of unlabeled data;
  14. 14. © 2014 IBM Corporation Experiment (1): 5-Fold Cross Validation On average, each relation has 800 positive examples and 13,000 negative examples; For manifold models, 2,500-5,000 extra unlabeled examples are used. Average F1 scores of all 5 folds; Compare against SVM+ tree kernel (Collins and Duffy, 2001), SVM+ linear kernel (Schölkopf and Smola, 2002), Linear regression, SemRep (Rindflesch and Fiszman, 2003);
  15. 15. © 2014 IBM Corporation Experiment (2): Knowledge Base Construction Applied the relation detectors to our medical corpus with 80M sentences (11G text); Resulted in 3.4M entries in the format of (relation_name, arg_1, arg_2, confidence); The whole process cost 16 * 4 Core Machines 8 hours; Evaluation – A Candidate Answer Generation Experiment comparing the new KB and UMLS relation KB (the most popular medical KB). – 742 doctor dilemma questions from American College of Physicians; – Detect relations in the question; – Generate candidate answers using the relation based KB lookup; – For each question, generate up to k answers: k=20, 50, 3000;
  16. 16. © 2014 IBM Corporation Conclusions: From the perspective of relation extraction applications, – Identified 7 key relations that can facilitate clinical decision making – Built a system that can directly extract relations from medical text From the perspective of relation extraction methodologies, – A manifold model based relation extraction system • Closed-form solution • Fast • Utilizes unlabeled data • Takes label weight into consideration • Also works for the other domains More detail: see “Relation Extraction with Manifold Models”, ACL-2014.
  17. 17. © 2014 IBM Corporation References: [1] A. Aronson. 2001. Effective mapping of biomedical text to the UMLS metathesaurus: the MetaMap program. [2] Michael Collins and Nigel Duffy. 2001. Convolution kernels for natural language. [3] D. Demner-Fushman and J. Lin. 2007. Answering clinical questions with knowledge-based and statistical techniques. [4] D. Lindberg, B. Humphreys, and A. McCray. 1993. The Unified Medical Language System. [5] M. McCord, J. W. Murdock, and B. K. Boguraev. 2012. Deep parsing in Watson. [6] özlem Uzuner, B. R. South, S. Shen, and S. L. DuVall. 2011. 2010 i2b2/VA challenge on concepts, assertions, and relations in clinical text. [7] B. Schölkopf and A. J. Smola. 2002. Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond.

×