Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
1
Knowledge-driven Implicit Information
Extraction
Sujan Perera
Dissertation Committee : Drs. Amit P. Sheth (advisor), Kri...
2
Information Extraction
• More than 70% of data in organizations exist in unstructured form1
• Extraction of structured i...
3
Information Extraction
• Almost exclusively focused on explicit information
“Bob Smith is a 61-year-old man referred by ...
4
Information Extraction
• Almost exclusively focused on explicit information
Named Entity Recognition Relationship Extrac...
5
Information Extraction
• Misses the implicit information
“Bob Smith is a 61-year-old man referred by Dr. Davis for outpa...
6
Thesis Statement
Implicit factual information in unstructured text can be efficiently
extracted by bridging syntactic an...
7
• Express sarcasm/sentiment
• “I'm striving to be positive in what I say on Twitter. So I'll refrain
from making a comme...
8
Significance
• Volume
• 20% movie references and 40% book references in tweets
• 35% edema and 40% shortness of breath r...
9
Significance
• Volume
• 20% movie references and 40% book references in tweets
• 35% edema and 40% shortness of breath r...
10
Role of Knowledge
New Sandra Bullock astronaut lost in space
movie looks absolutely terrifying
The patient showed accum...
11
Knowledge
Acquisition
Knowledge
Modeling
Detecting
Implicit
Information
Information
Extraction
Implicit Information Ext...
12
Dissertation Focus
Implicit Information Extraction
Entities Relationships
Organized Text Unorganized Text
Clinical Narr...
13
Dissertation Focus
Implicit Information Extraction
Entities Relationships
Organized Text Unorganized Text
Clinical Narr...
14
Sentence Entity
“small fluid adjacent to the gallbladder with gallstones which may represent
inflammation.”
Cholecystit...
15
Knowledge Acquisition
• Unified Medical Language System – integrate many health and
biomedical vocabularies
• Linguisti...
16
Knowledge Modeling
• Each entity has multiple definitions
• Each definition is processed to create entity indicator
• R...
17
Detecting Sentences with Implicit Entities
• The sentences with entity representative term but without the
entity name ...
• The similarity between entity model and the pruned sentence is
measured to annotate them with positive or negative label...
19
Information Extraction – Entity Linking
ct1
ct2
ct3
ct4
et5
et6
et7
Candidate Sentence Entity Indicator
WordNet
If anto...
20
Evaluation
• Re-annotated the SemEval-
2014 task 7 dataset for implicit
entities
• Entities are selected
considering th...
21
Algorithm Positive
Precision
Positive
Recall
Positive
F1
Negative
Precision
Negative
Recall
Negative
F1
Our 0.66 0.87 0...
22
Similarity as a Feature to Supervised Algorithm
• Added similarity value of unsupervised algorithms as a feature to the...
23
Annotation Performance – A Study with the Confidence
• Each annotation has
confidence ranges from 1 to 5
• Low confiden...
24
Dissertation Focus
Implicit Information Extraction
Entities Relationships
Organized Text Unorganized Text
Clinical Narr...
25
• Use diverse characteristics of the entity
– “New Sandra Bullock astronaut lost in space movie looks absolutely
terrif...
26
• Use diverse characteristics of the entity
– “… Richard Linklater movie …”
– “… Ellar Coltrane on his 12-year movie …”...
27
Knowledge Acquisition
• Acquiring factual knowledge
• Source – DBpedia
• Not all factual knowledge is important – movie...
28
Knowledge Acquisition
Wikipedia page titles
and anchor texts
Contemporary
tweets
Generate
semantic cues
Factual
knowled...
29
Knowledge Modeling – Entity Model Network
Sandra Bullock
Alfonso Curan
Mars orbiter mission
Woman in space
astronaut
• ...
30
Detecting Tweets with Implicit Entities
• Tweets are filtered with keywords – movie, film, book, novel
• Applied simple...
31
Information Extraction – Entity Linking
• Two Step Process
• Step 1: Candidate selection and filtering
• Objective - pr...
32
Entity Linking - Candidate selection and filtering
m1
m2 m4
m5
m3
m7
m6
c1
c5
c8
c4
c6
c3
c2
c9
c7
“ISRO sends probe to...
33
m1
m2 m4
m5
m3
m7
m6
c1
c5
c8
c6
c3
c2
c9
c7
“ISRO sends probe to Mars for less money
than it takes Hollywood movie to ...
34
m1
m2 m4
m5
m3
m7
m6
c1
c5
c8
c6
c3
c2
c9
c7
c5
c2
m1
m2
m4
m5
m3
c7
c8
m6
m7
m8
Factual Knowledge Contextual Knowledge...
35
m1
m2 m4
m5
m3
m7
m6
c1
c5
c8
c6
c3
c2
c9
c7
c5
c2
m1
m2
m4
m5
m3
𝑠𝑐𝑜𝑟𝑒 𝑚 𝑖
=
𝑐 𝑗 𝜖 ℂ
𝑠𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑖𝑡𝑦 𝑜𝑓 𝑐𝑗 ∗ 𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦 (𝑐𝑗...
36
• Formulated as a ranking problem
• SVMrank to rank candidates
• Similarity between the candidate entity and the tweet
...
Evaluation Dataset
Entity Type Annotation Tweets Entity
Movie Explicit 391 107
Implicit 207 54
NULL 117 0
Book Explicit 20...
Entity Model Network Creation
• 15,000 tweets for movies and books in July 2014
• 617 movies and 102 books
• Recent 1000 t...
• How many tweets had correct entity within selected candidate set (top-25)?
• How many entities were correctly linked by ...
Qualitative Error Analysis
Error Tweet Entity
Lack of contextual
knowledge
‘That Movie Where Shailene Woodley Has Her Firs...
41
Dissertation Focus
Implicit Information Extraction
Entities Relationships
Organized Text Unorganized Text
Clinical Narr...
42
Implicit Relationships in Clinical Narratives
atrial fibrillation hypertension
diabetes
chest pain
weight gain
headache...
43
• Implicit relationships:
• Exist between symptoms, disorders, medications, and procedures
• Can be established by leve...
44
A Scenario
Atrial fibrillation
Hypertension
Diabetes
Fatigue
Syncope
Weight loss
Chest pain
Discomfort in chest
Dizzy
S...
45
A Scenario
Atrial fibrillation
Hypertension
Diabetes
Fatigue
Syncope
Weight loss
Chest pain
Discomfort in chest
Dizzy
S...
46
Knowledge Acquisition
• Hierarchical knowledge and non-hierarchical knowledge
Hierarchical Knowledge
Retrieved from UML...
47
Hypertension
Diastolic
Hypertension
Pulmonary
Hypertension
Renal
Hypertension
Episodic Pulmonary
Hypertension
Solitary ...
48
Detecting Unexplained Symptoms
• Clinical documents were semantically annotated for entities using
cTAKES
• Known relat...
49
Information Extraction – Unknown Relationships
• Naïve method would assume relationship between unexplained
symptom and...
1. All co-occurring
disorders are candidates
50
Information Extraction – Unknown Relationships
D1
S
D2
D3
D4
D5
2. Find known disorders
of the symptom
51
D1
S
D6
D7
D2
D3
D4
D5
Information Extraction – Unknown Relationships
3. Collect more
knowledge about
known relationships
52
D1
S
D6
D7
D2
D3
D4
D5
D7
D8 D2
D10 D11
D12
D4
D14
Information Extr...
4. Compare co-
occurring disorders with
collected knowledge
53
D1
S
D6
D7
D2
D3
D4
D5
D7
D8 D2
D10 D11
D12
D4
D14
Informat...
5. Eliminate non-
matching candidate
disorders
54
S
D2
D4
We left with most plausible disorders for unexplained symptom. I...
55
Evaluation
• A corpus of 1,500 electronic medical records were used
• Annotated with cTAKES and selected the most frequ...
• There were 29 distinct unexplained symptoms
• Precision of the questions generated
• 1st iteration - 105 correct from 14...
57
Evaluation – Increment in Explainability
Knowledge base Number of unexplained
relationships
Increment in explainability...
58
Summary
• Implicit information is frequent occurrence in text and ignoring them
would adversely affect downstream appli...
59
Contributions
• Identify and demonstrate the value of implicit information.
• Study the characteristics of the implicit...
60
Graduate Life@Kno.e.sis
Journal Publications:
• Sujan Perera, Cory Henson, Krishnaprasad Thirunarayan, Amit Sheth, Suha...
61
Thank You
Mentors Collaborators
62
Coffee Mates and Colleagues
Thank You
Funding
• ezDI
• George Thomas Fellowship
• NSF: CNS 1513721 Context-
Aware Haras...
Upcoming SlideShare
Loading in …5
×

Knowledge-driven Implicit Information Extraction

1,616 views

Published on

Sujan Perera's Dissertation Defense: Friday, August 12, 2016
Ph.D. Committee: Drs. Amit Sheth, Advisor; T.K. Prasad, Michael Raymer, and Pablo Mendes (IBM Research)
Video: https://youtu.be/pbjJ1zb8ayY

ABSTRACT:

Natural language is a powerful tool developed by humans over hundreds of thousands of years. The extensive usage, flexibility of the language, creativity of the human beings, and social, cultural, and economic changes that have taken place in daily life have added new constructs, styles, and features to the language. One such feature of the language is its ability to express ideas, opinions, and facts in an implicit manner. This is a feature that is used extensively in day to day communications in situations such as: 1) expressing sarcasm, 2) when trying to recall forgotten things, 3) when required to convey descriptive information, 4) when emphasizing the features of an entity, and 5) when communicating a common understanding.

Consider the tweet 'New Sandra Bullock astronaut lost in space movie looks absolutely terrifying' and the text snippet extracted from a clinical narrative 'He is suffering from nausea and severe headaches. Dolasteron was prescribed.' The tweet has an implicit mention of the entity Gravity and the clinical text snippet has implicit mention of the relationship between medication Dolasteron and clinical condition nausea. Such implicit references of the entities and the relationships are common occurrences in daily communication and they add unique value to conversations. However, extracting implicit constructs has not received enough attention. This dissertation focuses on extracting implicit entities and relationships from clinical narratives and extracting implicit entities from Tweets.

This dissertation demonstrates manifestations of implicit constructs in text, studies their characteristics, and develops a solution that is capable of extracting implicit factual information from text. The developed solution starts by acquiring relevant knowledge to solve the implicit information extraction problem. The relevant knowledge includes domain knowledge, contextual knowledge, and linguistic knowledge. The acquired knowledge can take different syntactic forms such as a text snippet, structured knowledge represented in standard knowledge representation languages like Resource Description Framework (RDF) or custom formats. Hence, the acquired knowledge is processed to create models that can be understood by machines. Such models provide the infrastructure to perform implicit information extraction of interest.

This dissertation focuses on three different use cases of implicit information and demonstrates the applicability of the developed solution in these use cases. They are:
- implicit entity linking in clinical narratives,
- implicit entity linking in Twitter,
- implicit relationship extraction from clinical narratives.

Published in: Data & Analytics
  • Be the first to comment

Knowledge-driven Implicit Information Extraction

  1. 1. 1 Knowledge-driven Implicit Information Extraction Sujan Perera Dissertation Committee : Drs. Amit P. Sheth (advisor), Krishnaprasad Thirunarayan, Michael Raymer, Pablo N. Mendes (IBM Research) Ph.D. Dissertation Defense
  2. 2. 2 Information Extraction • More than 70% of data in organizations exist in unstructured form1 • Extraction of structured information from unstructured data is a fundamental task “All home medications although his insulin dose (nph 20 qPM) was halved (--> NPH 10 qPM) on the floor, and his sugars were running in the 150s-250s range.” Insulin Cisapride Diabetes Mellitus Hyperglycemia Proinsulin Porcine Insulin Insulin Glulisine is a is a is a 1https://en.wikipedia.org/wiki/Unstructured_data
  3. 3. 3 Information Extraction • Almost exclusively focused on explicit information “Bob Smith is a 61-year-old man referred by Dr. Davis for outpatient cardiac catheterization because of a positive exercise tolerance test. Recently, he started to have left shoulder twinges and tingling in his hands. A stress test done on 2013-06-02 revealed that the patient exercised for 6 1/2 minutes, stopped due to fatigue. However, Mr. Smith is comfortably breathing in room air. He also showed accumulation of fluid in his extremities. He does not have any chest pain.”
  4. 4. 4 Information Extraction • Almost exclusively focused on explicit information Named Entity Recognition Relationship Extraction Entity Linking “Bob Smith is a 61-year-old man referred by Dr. Davis for outpatient cardiac catheterization because of a positive exercise tolerance test. Recently, he started to have left shoulder twinges and tingling in his hands. A stress test done on 2013-06-02 revealed that the patient exercised for 6 1/2 minutes, stopped due to fatigue. However, Mr. Smith is comfortably breathing in room air. He also showed accumulation of fluid in his extremities. He does not have any chest pain.” Person Person C0018795 C0015672 C0008031
  5. 5. 5 Information Extraction • Misses the implicit information “Bob Smith is a 61-year-old man referred by Dr. Davis for outpatient cardiac catheterization because of a positive exercise tolerance test. Recently, he started to have left shoulder twinges and tingling in his hands. A stress test done on 2013-06-02 revealed that the patient exercised for 6 1/2 minutes, stopped due to fatigue. However, Mr. Smith is comfortably breathing in room air. He also showed accumulation of fluid in his extremities. He does not have any chest pain.” Person Person C0018795 C0015672 C0008031 No shortness of breath edema Named Entity Recognition Relationship Extraction Entity Linking Implicit information extraction
  6. 6. 6 Thesis Statement Implicit factual information in unstructured text can be efficiently extracted by bridging syntactic and semantic gaps in natural language usage and augmenting information extraction techniques with relevant domain knowledge.
  7. 7. 7 • Express sarcasm/sentiment • “I'm striving to be positive in what I say on Twitter. So I'll refrain from making a comment about the latest Michael Bay movie.” • Provide descriptive information • “small fluid adjacent to the gallbladder with gallstones which may represent inflammation” • Emphasize features of the entity • “Mason Evans 12 year long shoot won big in golden globe” • Communicate the common understanding • “He is suffering from nausea and severe headaches. Dolasteron was prescribed.” • Stylistic Preferences • “Democratic candidate Bernie Sanders … The Vermont senator …” Credit:http://bit.ly/2b9Bnjk
  8. 8. 8 Significance • Volume • 20% movie references and 40% book references in tweets • 35% edema and 40% shortness of breath references in clinical narratives • Value Explicit Information Computer Assisted Coding 30-day Readmission Prediction Sentiment Analysis Structured Information
  9. 9. 9 Significance • Volume • 20% movie references and 40% book references in tweets • 35% edema and 40% shortness of breath references in clinical narratives • Value Ignoring implicit information in text would adversely affect downstream applications Explicit Information Implicit Information Computer Assisted Coding 30-day Readmission Prediction Sentiment Analysis Structured Information
  10. 10. 10 Role of Knowledge New Sandra Bullock astronaut lost in space movie looks absolutely terrifying The patient showed accumulation of fluid in his extremities, but respirations were unlabored and there were no use of accessory muscles. Edema Accumulation of an excessive amount of watery fluid in cells or intercellular tissues Shortness of breath Labored or difficult breathing associated with a variety of disorders Sandra Bullock Gravity Knowledge Bases Image credits: http://bit.ly/2b5HPDQ and Icon made by Freepik from www.flaticon.com Credit: http://bit.ly/2bi34FGCredit: http://bit.ly/1x3sack Credit: http://bit.ly/2b9CejW Credit: http://bit.ly/2aXM97v
  11. 11. 11 Knowledge Acquisition Knowledge Modeling Detecting Implicit Information Information Extraction Implicit Information Extraction
  12. 12. 12 Dissertation Focus Implicit Information Extraction Entities Relationships Organized Text Unorganized Text Clinical Narratives Tweets Disorders Symptoms Movies Books Clinical Narratives Disorders and Symptoms
  13. 13. 13 Dissertation Focus Implicit Information Extraction Entities Relationships Organized Text Unorganized Text Clinical Narratives Tweets Disorders Symptoms Movies Books Clinical Narratives Disorders and Symptoms
  14. 14. 14 Sentence Entity “small fluid adjacent to the gallbladder with gallstones which may represent inflammation.” Cholecystitis “His tip of the appendix is inflamed.” Appendicitis “The respirations were unlabored and there were no use of accessory muscles.” Shortness of breath (NEG) Implicit Entities in Clinical Documents • One should know the physiological observations that characterize particular entity • Negations are embedded in the phrases indicating entities • “Patient denies shortness of breath” • “The respirations were unlabored”
  15. 15. 15 Knowledge Acquisition • Unified Medical Language System – integrate many health and biomedical vocabularies • Linguistic Knowledge – WordNet • Synonyms/antonyms • Syntactic variations of the same term CUI AUI STR CUI TUI CUI STR DEF SAB Definitions for shortness of breath A disorder characterized by an uncomfortable sensation of difficulty breathing Difficult or labored breathing Labored or difficulty breathing associated with a variety of disorders, indicating inadequate ventilation or low blood oxygen or a subjective experience of breathing discomfort
  16. 16. 16 Knowledge Modeling • Each entity has multiple definitions • Each definition is processed to create entity indicator • Representative power of the term (r1) calculated with measure inspired by TF-IDF • A collection of entity indicators constitute entity model definition1 definition2 definition3 Entity Indicator1 Entity Indicator2 Entity Indicator3 Entity Model Definition Entity Indicator A disorder characterized by an uncomfortable sensation of difficulty breathing (uncomfortable, r1), (sensation, r2), (difficulty, r3), (breathing, r4) Difficult or labored breathing (difficult, r5), (labored, r6), (breathing, r4)
  17. 17. 17 Detecting Sentences with Implicit Entities • The sentences with entity representative term but without the entity name may have implicit mention of the entity. “However, Mr. Smith is comfortably breathing in room air.” Candidate sentence for shortness of breath
  18. 18. • The similarity between entity model and the pruned sentence is measured to annotate them with positive or negative labels • We developed a semantic similarity measure that takes care of the synonyms and antonyms 18 Information Extraction – Entity Linking Candidate Sentence Indicator1 Indicator2 Indicator3 Entity Model sim1 sim2 sim3
  19. 19. 19 Information Extraction – Entity Linking ct1 ct2 ct3 ct4 et5 et6 et7 Candidate Sentence Entity Indicator WordNet If antonym then -1 else max similarity 𝑠𝑖𝑚 ∗ 𝑟𝑝 𝑒𝑡 𝑟𝑝 𝑒𝑡 >t1 <t2 Positive Annotation Negative Annotation
  20. 20. 20 Evaluation • Re-annotated the SemEval- 2014 task 7 dataset for implicit entities • Entities are selected considering the frequency of appearance and with expert feedback • 857 sentences selected for 8 entities • Annotated by three domain experts • Annotation agreement 0.58 Entity Positive Annotations Negative Annotations Shortness of Breath 93 94 Edema 115 35 Syncope 96 92 Cholecystitis 78 36 Gastrointestinal Gas 18 14 Colitis 12 11 Cellulitis 8 2 Fasciitis 7 3
  21. 21. 21 Algorithm Positive Precision Positive Recall Positive F1 Negative Precision Negative Recall Negative F1 Our 0.66 0.87 0.75 0.73 0.73 0.73 MCS 0.50 0.93 0.65 0.31 0.76 0.44 SVM 0.73 0.82 0.77 0.66 0.67 0.67 Adding similarity value as a feature for the supervised algorithm SVM+MCS 0.73 0.82 0.77 0.66 0.66 0.66 SVM+Our 0.77 0.85 0.81 0.72 0.75 0.73 • Baselines • MCS algorithm (Mihalcea 2006) • SVM (trained on n-grams) • Our algorithm outperforms selected baselines in negative category. • SVM is able to leverage the supervision to beat our algorithm in positive category. Annotation Performance
  22. 22. 22 Similarity as a Feature to Supervised Algorithm • Added similarity value of unsupervised algorithms as a feature to the SVM. Positive Annotations Negative Annotations
  23. 23. 23 Annotation Performance – A Study with the Confidence • Each annotation has confidence ranges from 1 to 5 • Low confidence reflects incomplete or ambiguous information • Annotation performance increases as the confidence increases • The negative class shows significant increment
  24. 24. 24 Dissertation Focus Implicit Information Extraction Entities Relationships Organized Text Unorganized Text Clinical Narratives Tweets Disorders Symptoms Movies Books Clinical Narratives Disorders and Symptoms
  25. 25. 25 • Use diverse characteristics of the entity – “New Sandra Bullock astronaut lost in space movie looks absolutely terrifying” – “ISRO sends probe to Mars for less money than it takes Hollywood to send a woman to space.” – “oh yeah there is that new space movie coming out that looks terrifying i am going to go see it” • Use time-sensitive phrases Furious 7Gravity The Martian Fall 2013 April 2014 Fall 2015 space movie fastest movie to earn $1 billion Paul walkers’ last movie Tweets with Implicit Entities Credit: http://bit.ly/2bkePJ6
  26. 26. 26 • Use diverse characteristics of the entity – “… Richard Linklater movie …” – “… Ellar Coltrane on his 12-year movie …” – “… 12-year long movie shoot …” – “… Mason Evan's childhood movie …” • Use time-sensitive phrases Furious 7Gravity The Martian Fall 2013 April 2014 Fall 2015 space movie fastest movie to earn $1 billion Paul walkers’ last movie Tweets with Implicit Entities Credit: http://bit.ly/2bk8xdp
  27. 27. 27 Knowledge Acquisition • Acquiring factual knowledge • Source – DBpedia • Not all factual knowledge is important – movie has ‘starring’ and ‘director’ as well as ‘billed‘ and ‘license’ • Rank the relationships based on joint probability with the entity type • Values of top-k relationships and the value of rdfs:comment are obtained • Acquiring contextual knowledge • Source – contemporary tweets • We collect 1000 tweets with explicit mentions of the entity • Number of views for the entity’s Wikipedia page within last t days
  28. 28. 28 Knowledge Acquisition Wikipedia page titles and anchor texts Contemporary tweets Generate semantic cues Factual knowledge Clean tweets Generate n-grams • Need to extract meaningful phrases from acquired knowledge • Meaningful phrases = Wikipedia titles + anchor texts • Matching n-grams are added to semantic cues • Non-matching n-grams are added to semantic cues after removing stop words
  29. 29. 29 Knowledge Modeling – Entity Model Network Sandra Bullock Alfonso Curan Mars orbiter mission Woman in space astronaut • A property graph - reflecting the topical relationships between entities 𝑠𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑖𝑡𝑦 = |𝑁| |𝑁𝑐 𝑗 | 𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦 = 𝑡𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑡𝑖𝑚𝑒𝑠 𝑝ℎ𝑟𝑎𝑠𝑒 𝑖𝑛 𝑡𝑤𝑒𝑒𝑡𝑠 𝑡𝑖𝑚𝑒 𝑠𝑎𝑙𝑖𝑒𝑛𝑐𝑒 = number of Wikipedia views 𝑁 − 𝑡𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑒𝑛𝑡𝑖𝑡𝑖𝑒𝑠, 𝑁𝑐 𝑗 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑎𝑑𝑗𝑎𝑐𝑒𝑛𝑡 𝑒𝑛𝑡𝑖𝑡𝑖𝑒𝑠 Factual Knowledge Contextual Knowledge Entity Gravity Christopher Nolan Matt Damon Interstellar The Martian
  30. 30. 30 Detecting Tweets with Implicit Entities • Tweets are filtered with keywords – movie, film, book, novel • Applied simple annotation technique – dictionary matching • The tweets that are not annotated with entity of types we are looking for are considered to have implicit entity mentions Keywords Entity Dictionary Annotating Tweets
  31. 31. 31 Information Extraction – Entity Linking • Two Step Process • Step 1: Candidate selection and filtering • Objective - prune the search space to reduce number of entities to be considered in disambiguation step from EMN • Step 2: Disambiguation • Objective - sort the selected candidate entities to place the implicitly mentioned entity in top position
  32. 32. 32 Entity Linking - Candidate selection and filtering m1 m2 m4 m5 m3 m7 m6 c1 c5 c8 c4 c6 c3 c2 c9 c7 “ISRO sends probe to Mars for less money than it takes Hollywood movie to send a woman to space” m8 EntityFactual Knowledge Contextual Knowledge
  33. 33. 33 m1 m2 m4 m5 m3 m7 m6 c1 c5 c8 c6 c3 c2 c9 c7 “ISRO sends probe to Mars for less money than it takes Hollywood movie to send a woman to space” c5 c2 c7 c8 m8 Factual Knowledge Contextual Knowledge Entity Entity Linking - Candidate selection and filtering c4
  34. 34. 34 m1 m2 m4 m5 m3 m7 m6 c1 c5 c8 c6 c3 c2 c9 c7 c5 c2 m1 m2 m4 m5 m3 c7 c8 m6 m7 m8 Factual Knowledge Contextual Knowledge Entity “ISRO sends probe to Mars for less money than it takes Hollywood movie to send a woman to space” Entity Linking - Candidate selection and filtering c4
  35. 35. 35 m1 m2 m4 m5 m3 m7 m6 c1 c5 c8 c6 c3 c2 c9 c7 c5 c2 m1 m2 m4 m5 m3 𝑠𝑐𝑜𝑟𝑒 𝑚 𝑖 = 𝑐 𝑗 𝜖 ℂ 𝑠𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑖𝑡𝑦 𝑜𝑓 𝑐𝑗 ∗ 𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦 (𝑐𝑗, 𝑚𝑖) c7 c8 m6 m7 m2 m4 m6 m7 m3 ℂ is the set of matching cues m8 Factual Knowledge Contextual Knowledge Entity “ISRO sends probe to Mars for less money than it takes Hollywood movie to send a woman to space” Entity Linking - Candidate selection and filtering c4
  36. 36. 36 • Formulated as a ranking problem • SVMrank to rank candidates • Similarity between the candidate entity and the tweet • Temporal salience of the candidate entity x1 x2 x3 … xn xj= 𝑠𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑖𝑡𝑦 𝑜𝑓 𝑐𝑗 ∗ 𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦 (𝑐𝑗, 𝑒𝑖) 𝑡𝑒𝑚𝑝𝑜𝑟𝑎𝑙 𝑠𝑎𝑙𝑖𝑒𝑛𝑐𝑒 𝑒𝑖 𝑒∈𝐸 𝑐 𝑡𝑒𝑚𝑝𝑜𝑟𝑎𝑙 𝑠𝑎𝑙𝑖𝑒𝑛𝑐𝑒 𝑒 𝐸𝑐 is the selected candidate set m2 m6 m4m3 m7 Entity Linking - Disambiguation
  37. 37. Evaluation Dataset Entity Type Annotation Tweets Entity Movie Explicit 391 107 Implicit 207 54 NULL 117 0 Book Explicit 200 24 Implicit 190 53 NULL 70 0 • Tweets are collected in August 2014 using keywords • Manually annotated the tweets with DBpedia URL of entities • The tweets annotated with NULL do not have either explicit or implicit mention of an entity 37
  38. 38. Entity Model Network Creation • 15,000 tweets for movies and books in July 2014 • 617 movies and 102 books • Recent 1000 tweets per entity to build its contextual knowledge • May 2014 version of DBpedia used to extract factual knowledge • Temporal salience is obtained for July 2014 m1 m2 m4 m5 m3 m7 m6 c1 c5 c8 c4 c6 c3 c2 c9 c7 38Factual Knowledge Contextual Knowledge Entity
  39. 39. • How many tweets had correct entity within selected candidate set (top-25)? • How many entities were correctly linked by our disambiguation approach? • Importance of contextual knowledge Evaluation - Implicit Entity Linking Entity Type Candidate Selection Recall Disambiguation accuracy Movie 90.33% 60.97% Book 94.73% 61.05% 39 Step Entity Type Without Contextual Knowledge With Contextual Knowledge Candidate Selection Recall Movie 77.29% 90.33% Book 76.84% 94.73% Disambiguation Accuracy Movie 51.7% 60.97% Book 50.0% 61.05%
  40. 40. Qualitative Error Analysis Error Tweet Entity Lack of contextual knowledge ‘That Movie Where Shailene Woodley Has Her First Nude Scene? The Trailer Is RIGHT HERE!: No one can say Shailene Woodley isn't brave!’ White Bird in a Blizzard Novel entities ‘”hey, what's wrawng widdis goose?" RT @TIME: Mark Wahlberg could be starring in a movie about the BP oil spill http://ti.me/1oZh55V' Deepwater Horizon Cold start of entities ‘Video: George R.R. Martin's Children's Book Gets Re-release http://bit.ly/1qNNH5r’ The Ice Dragon Multiple implicit entity mentions ‘That moment when you realize that hazel grace and Augustus are brother and sister in one movie and in love battling cancer in another’ Divergent, The Fault in Our Stars 40
  41. 41. 41 Dissertation Focus Implicit Information Extraction Entities Relationships Organized Text Unorganized Text Clinical Narratives Tweets Disorders Symptoms Movies Books Clinical Narratives Disorders and Symptoms
  42. 42. 42 Implicit Relationships in Clinical Narratives atrial fibrillation hypertension diabetes chest pain weight gain headache lisinopril warfarin insulin atenolol medication disease symptom is_treated_with has_symptom
  43. 43. 43 • Implicit relationships: • Exist between symptoms, disorders, medications, and procedures • Can be established by leveraging domain knowledge • The existing knowledge bases fall short in eliciting relationships • Data + Knowledge can help to elicit such implicit relationships efficiently Implicit Relationships in Clinical Narratives
  44. 44. 44 A Scenario Atrial fibrillation Hypertension Diabetes Fatigue Syncope Weight loss Chest pain Discomfort in chest Dizzy Shortness of Breath Nausea Vomiting Headache Cough Weight gain
  45. 45. 45 A Scenario Atrial fibrillation Hypertension Diabetes Fatigue Syncope Weight loss Chest pain Discomfort in chest Dizzy Shortness of Breath Nausea Vomiting Headache Cough Weight gain Atrial fibrillation Hypertension Diabetes Chest pain Weight gain Discomfort in chest Cough Headache Edema Shortness of Breath Knowledge base does not know about edema. Now edema can be a symptom of any disorder in the document. Observed Disorders Observed Symptoms
  46. 46. 46 Knowledge Acquisition • Hierarchical knowledge and non-hierarchical knowledge Hierarchical Knowledge Retrieved from UMLS Non-hierarchical Knowledge Extracted from Web Resources + Feedback from domain expert www.nlm.nih.gov www.en.wikipedia.org www.webmd.com www.mayoclinic.com www.clevelandclinic.org ww.healthline.org CUI AUI PAUI PTR C0013404 A0052186 A0111363 A0434168.A2367943. … C0013604 A0052723 A0135504 A0434168.A2367943 CUI AUI SAB STR C0013404 A0052186 MSH Shortness of breath C0013604 A0052723 MSH Edema MRHIER MRCONSO
  47. 47. 47 Hypertension Diastolic Hypertension Pulmonary Hypertension Renal Hypertension Episodic Pulmonary Hypertension Solitary Pulmonary Hypertension Breathing Problems Shortness of Breath Asthma is_symptom_of Instances of symptomsInstances of disorders Shortness of Breath Hypertension Classes of disorders Classes of symptoms rdfs:subclassOf rdf:type Knowledge Modeling
  48. 48. 48 Detecting Unexplained Symptoms • Clinical documents were semantically annotated for entities using cTAKES • Known relationships are populated • Unexplained symptoms were detected Modeled Knowledge Credit:http://bit.ly/2aMWVAd
  49. 49. 49 Information Extraction – Unknown Relationships • Naïve method would assume relationship between unexplained symptom and all disorders in clinical narrative • Can we leverage the knowledge we have about symptom to find most plausible disorders? • Intuition: a symptom is most likely to be shared by similar disorders
  50. 50. 1. All co-occurring disorders are candidates 50 Information Extraction – Unknown Relationships D1 S D2 D3 D4 D5
  51. 51. 2. Find known disorders of the symptom 51 D1 S D6 D7 D2 D3 D4 D5 Information Extraction – Unknown Relationships
  52. 52. 3. Collect more knowledge about known relationships 52 D1 S D6 D7 D2 D3 D4 D5 D7 D8 D2 D10 D11 D12 D4 D14 Information Extraction – Unknown Relationships
  53. 53. 4. Compare co- occurring disorders with collected knowledge 53 D1 S D6 D7 D2 D3 D4 D5 D7 D8 D2 D10 D11 D12 D4 D14 Information Extraction – Unknown Relationships
  54. 54. 5. Eliminate non- matching candidate disorders 54 S D2 D4 We left with most plausible disorders for unexplained symptom. If this scenario occurs frequently, it increases the confidence on this relationship. Information Extraction – Unknown Relationships
  55. 55. 55 Evaluation • A corpus of 1,500 electronic medical records were used • Annotated with cTAKES and selected the most frequent entities were selected • UMLS semantic types were used to categorize disorders and symptoms • Initial knowledge base - 86 disorders, 42 symptoms, 255 disorder- symptom relationships
  56. 56. • There were 29 distinct unexplained symptoms • Precision of the questions generated • 1st iteration - 105 correct from 142 (73.94%) • 2nd iteration - 20 correct from 29 (68.96%) • 3rd iteration - 4 correct from 9 (44.44%) 56 Evaluation – Relationship Prediction Symptom Number of unexplained instances Edema 910 Syncope 336 Systolic Murmur 168 Tachycardia 143 Angina 136 Disorder Number of co- occurrences Hypertension 647 Hyperlipidemia 641 Claudication 454 Coronary atherosclerosis 395 Coronary artery disease 242 Top 5 unexplained symptom Top 5 co-occurring disorders with edema
  57. 57. 57 Evaluation – Increment in Explainability Knowledge base Number of unexplained relationships Increment in explainability Initial knowledge base 2251 0% After 1st iteration 878 60.99% After 2nd iteration 806 64.19%
  58. 58. 58 Summary • Implicit information is frequent occurrence in text and ignoring them would adversely affect downstream applications. • Linguistic and world Knowledge plays an important role in decoding implicit information. • This dissertation demonstrated characteristics of implicit information and developed solution to capture factual implicit constructs. Knowledge Acquisition Knowledge Modeling Detecting Implicit Information Information Extraction
  59. 59. 59 Contributions • Identify and demonstrate the value of implicit information. • Study the characteristics of the implicit information manifestation. • Demonstrate the value of knowledge in extracting factual implicit information. - Linguistic - Domain - Contextual • Developed a framework for factual implicit information extraction. • Demonstrated the usage of the framework to solve three implicit information extraction problems.
  60. 60. 60 Graduate Life@Kno.e.sis Journal Publications: • Sujan Perera, Cory Henson, Krishnaprasad Thirunarayan, Amit Sheth, Suhas Nair, Semantics Driven Approach for Knowledge Acquisition from EMRs, IEEE Journal of Biomedical and Health Informatics. • Raminta Daniulaityte, Robert Carlson, Russel Falck, Delroy Cameron, Sujan Perera, Lu Chen and Amit Sheth. I just wanted to tell you that loperamide WILL WORK': A Web- Based Study of Extra-Medical Use of Loperamide. Conference Publications: • Sujan Perera, Pablo Mendes, Adarsh Alex, Amit Sheth, Krishnaprasad Thirunarayan, Implicit Entity Linking in Tweets, ESWC 2016 • Sujan Perera, Pablo Mendes, Amit Sheth, Krishnaprasad Thirunarayan, Adarsh Alex, Christopher Heid, Greg Mott, Implicit Entity Recognition in Clinical Documents, *SEM 2015 • Sujan Perera, Cory Henson, Krishnaprasad Thirunarayan, Amit Sheth, Suhas Nair, Data Driven Knowledge Acquisition Method for Domain Knowledge Enrichment in the Healthcare, BIBM 2012 • Menasha Thilakaratne, Ruvan Weerasinghe, Sujan Perera, Knowledge-driven Approach to Predict Personality Traits by Leveraging Social Media Data, WI 2016 Workshop and Posters: • Sujan Perera, Amit Sheth, Krishnaprasad Thirunarayan, Challenges in Understanding Clinical Notes: Why NLP Engines Fall Short and Where Background Knowledge Can Help, DARE 2013 • Raminta Daniulaityte, Robert Carlson, Russel Falck, Delroy Cameron, Sujan Perera, Lu Chen, Amit Sheth. A Web-Based Study of Self-Treatment of Opioid Withdrawal Symptoms with Loperamide, CPDD 2012 Internships: • ezDI Summer 2012 • IBM Watson Summer 2014 and 2015 Awards and grants: • George Thomas Graduate Fellowship • NSF travel grants: BIBM and ICHI PC Committee: • DARE (2013), EKAW (2014, 2016), ISWC 2015, IJCAI 2016 External Reviewer: • ISWC, ESWC, IJSWIS, IEEE Intelligent Systems, Applied Ontology, ODBASE Proposal Contributions: • eDrugTrends (NIH R01) • Healthcare Outcome Prediction (NSF-SCH) Mentoring: • Adarsh Alex (MSc) • Menasha Tilakaratne (BSc)
  61. 61. 61 Thank You Mentors Collaborators
  62. 62. 62 Coffee Mates and Colleagues Thank You Funding • ezDI • George Thomas Fellowship • NSF: CNS 1513721 Context- Aware Harassment Detection on Social Media

×