More Related Content
Similar to MLconf NYC Chang Wang
Similar to MLconf NYC Chang Wang (20)
MLconf NYC Chang Wang
- 1. © 2014 IBM Corporation
Medical Relation Extraction with
Manifold Models
Chang Wang, IBM T. J. Watson Research Center
- 2. © 2014 IBM Corporation
Adapt IBM Watson to Different Domains
Contact Center
Healthcare Financial Services
Government
Diagnostic/treatment assistance,
evidenced-based insights,
collaborative medicine
Investment and retirement
planning, institutional trading
and decision support
Call center and tech support
services, enterprise knowledge
management, consumer insight
Public safety, improved
information sharing, security,
fraud and abuse prevention
- 3. © 2014 IBM Corporation
Main Topic of This Talk
This talk is about how we built a semantic relation extraction system for
medical domain.
A semantic relation example:
Healthcare
Diagnostic/treatment assistance,
evidenced-based insights,
collaborative medicine
What is the most common manifestation of MEN-1 (Multiple Endocrine Neoplasia type 1).
Symptom_of relation
- 4. © 2014 IBM Corporation
Motivation: How Relation Extraction is Used in Question Answering
– 1, Candidate Answer Generation:
• a, Detect relations in the question;
• b, Use the relation for knowledge base lookup (with UMLS KB, DBpedia, FreeBase, etc);
– 2, Passage Scoring:
Hyperparathyroidism is the most common sign of MEN-1.
What is the most common manifestation of MEN-1 (Multiple Endocrine Neoplasia type 1).
Question Focus
Symptom_of relation
Candidate Answer
Symptom_of relation
- 5. © 2014 IBM Corporation
Motivation: How Relation Extraction is Used in Question Answering
– 3, Knowledge Base (KB) Construction:
• a, Most existing KBs are manually built or extracted from semi structured sources, and thus have a
low coverage;
• b, Medical knowledge is growing and changing extremely quickly;
Our medical corpus contains 80M sentences
(11G pure text) coming from Wikipedia,
books, PubMed, etc.
- 6. © 2014 IBM Corporation
Identify the Key Medical Relations
From an analysis of 5,000 doctor dilemma questions from the American College of
Physicians and reading the literature (Demner-Fushman and Lin, 2007), we decided
to focus on 7 key relations.
These relations cover >50% of those 5,000 clinical questions.
- 7. © 2014 IBM Corporation
Collect Training Data- Distant Supervision + Human Labeling
This resulted in ~800 positive and ~13,000 negative labeled examples for each
relation, plus a huge amount of unlabeled examples.
- 8. © 2014 IBM Corporation
Technical Challenges & Design Goals:
1, Real data challenge:
– In typical relation extraction task, entities are manually labeled.
• For example, in i2b2 relation extraction task, entities are given and each has 1 of 3
concepts: “treatment”, “problem”, and “test”.
– In real applications, entities have to be automatically detected.
• In our application, entities are associated with multiple concepts from a list of 2.7M
concepts (can be further grouped into ~130 types).
2, Relation detectors need to be fast:
– Need to consider all the term pairs for each sentence in our corpus (80M sentences).
– Use linear classifiers.
3, Relation detectors need to be accurate:
– # training examples is not sufficient.
– labels from “crowd sourcing” and “distant supervision” are not 100% reliable.
– Utilize unlabeled data.
– Take label confidence into consideration.
- 9. © 2014 IBM Corporation
Method (1): Parsing
Most popular tool to parse medical text: MetaMap (Aronson, 2001)
We used Medical ESG
– An adaptation of ESG [English Slot Grammar] (McCord, Murdock, and Boguraev, 2012) to
medical domain;
– Similar results as Metamap;
– 10 times faster;
- 10. © 2014 IBM Corporation
Method (2): Feature Extraction
- 11. © 2014 IBM Corporation
Method (3): Cost Function to Minimize
Construct a linear mapping f that minimizes C(f):
αi: Label weight, xi: the ith example, µ: weight scalar,
f: mapping function, yi: label of xi, Wi,j: similarity of xi and xj,
)( ii xfx →
Positive
Positive
Negative
Negative
unlabeled
unlabeled
Negative
Illustration:
- 12. © 2014 IBM Corporation
Method (4): Algorithm:
Algorithm: Notation:
- 13. © 2014 IBM Corporation
Method (5): Advantages
A closed-form solution;
As fast as a linear regression classifier at the apply time;
Associate labels with weights;
– Useful for “crowd sourcing” and “distant supervision”
Make use of unlabeled data;
- 14. © 2014 IBM Corporation
Experiment (1): 5-Fold Cross Validation
On average, each relation has 800 positive examples and 13,000 negative examples;
For manifold models, 2,500-5,000 extra unlabeled examples are used.
Average F1 scores of all 5 folds;
Compare against SVM+ tree kernel (Collins and Duffy, 2001), SVM+ linear kernel (Schölkopf and Smola,
2002), Linear regression, SemRep (Rindflesch and Fiszman, 2003);
- 15. © 2014 IBM Corporation
Experiment (2): Knowledge Base Construction
Applied the relation detectors to our medical corpus with 80M sentences (11G text);
Resulted in 3.4M entries in the format of (relation_name, arg_1, arg_2, confidence);
The whole process cost 16 * 4 Core Machines 8 hours;
Evaluation
– A Candidate Answer Generation Experiment comparing the new KB and UMLS relation
KB (the most popular medical KB).
– 742 doctor dilemma questions from American College of Physicians;
– Detect relations in the question;
– Generate candidate answers using the relation based KB lookup;
– For each question, generate up to k answers: k=20, 50, 3000;
- 16. © 2014 IBM Corporation
Conclusions:
From the perspective of relation extraction applications,
– Identified 7 key relations that can facilitate clinical decision making
– Built a system that can directly extract relations from medical text
From the perspective of relation extraction methodologies,
– A manifold model based relation extraction system
• Closed-form solution
• Fast
• Utilizes unlabeled data
• Takes label weight into consideration
• Also works for the other domains
More detail: see “Relation Extraction with Manifold Models”, ACL-2014.
- 17. © 2014 IBM Corporation
References:
[1] A. Aronson. 2001. Effective mapping of biomedical text to the UMLS metathesaurus: the MetaMap
program.
[2] Michael Collins and Nigel Duffy. 2001. Convolution kernels for natural language.
[3] D. Demner-Fushman and J. Lin. 2007. Answering clinical questions with knowledge-based and
statistical techniques.
[4] D. Lindberg, B. Humphreys, and A. McCray. 1993. The Unified Medical Language System.
[5] M. McCord, J. W. Murdock, and B. K. Boguraev. 2012. Deep parsing in Watson.
[6] özlem Uzuner, B. R. South, S. Shen, and S. L. DuVall. 2011. 2010 i2b2/VA challenge on concepts,
assertions, and relations in clinical text.
[7] B. Schölkopf and A. J. Smola. 2002. Learning with Kernels: Support Vector Machines, Regularization,
Optimization, and Beyond.