Knowledge Acquisition in a          System                 Christopher ThomasOhio Center of Excellence in Knowledge-enable...
Circle of knowledge in a System           Knowledge Enabled Information and Services Science   2
Dissertation Overview                                Conceptual Knowledge: Ontologies, LoD                                ...
Talk Contents                                                       What is knowledge?How do we turnpropositions/beliefs i...
Talk outline • Motivation • Knowledge Acquisition (KA) Overview • KA in a loosely connected system – Doozer++   – Automati...
Larger Context of automated KA    • Increasing significance of knowledge      economy          – “Knowledge Workers” spend...
Motivating Scenario • Learn about a new subject   – E.g. gain a quick overview over a current or     historical event • Us...
Motivating Scenario •     Google: India • Brief description –   demographic-,   geographic   information, etc.            ...
Motivating Scenario •     Google: India • Regular Web results             Knowledge Enabled Information and Services Scien...
Motivating Scenario • Clicking on a link to the Wikipedia entry   shows that there have been conflicts with   Pakistan ove...
Motivating Scenario •     Google: India       Pakistan Kashmir • Only Web results and   news So far, search engines only ...
Motivating Scenario • Beneficial to get an overview “at a glance”   over a domain. • Automated approach to creating knowle...
Motivating Scenario      Doozer++: india pakistan kashmir      • Important concepts and relationships        describing th...
Motivating Scenario• Filtered IR using  concepts in the  model• Concepts and  relationships that  contributed to  clicked ...
Circle of Knowledge (Example)          Knowledge Enabled Information and Services Science   15
Motivating Scenario • On-demand creation of domain knowledge   improves individual comprehension of an   event • Formal mo...
Importance of Model creation • Models support individual user or know-   ledge worker, but also groups or system    – More...
Domain Knowledge Models• Scientific applications   – In-depth description of concepts   – Narrow field   – People  system...
Model Creation Resources • Large models are available as reference    – DBPedia, YAGO, UMLS, MeSH, GO …    – Too big to be...
Epistemological Considerations • Knowledge    – Ensure epistemological soundness of      automated knowledge acquisition •...
Knowledge • Functional Definition   – Knowledge = “Know-How”   – Practical, but weak,     Includes “Actionable Information...
Belief and Justification • Belief    – Statements held by the system • Justification    – Trusted sources    – Extraction ...
Truth assessment of a statement • Is truth   correspondence?    – “A” is true No Access                  iff A (a true sta...
Domain Model – Reference • Model of a domain conceptually split   – Domain Definition      Concepts identified by URIs (c...
Domain Definition • Top-down concept identification • Achieved through    – Manual creation based on consensus in a      g...
Domain Description • Possible to do top-down extraction of the   domain description, e.g. from DBPedia • Problem: Formal c...
Knowledge Acquisition Approaches • KA in a tightly connected system   – GlycO: domain-specific BioChemistry ontology      ...
Knowledge Acquisition Approaches                Knowledge Traditional            GlycO                    Doozer++        ...
KA on the Web - Vision • Web searches, browsing sessions or   classification task can be seen as creating   an implicit do...
KA in a Loosely Connected SystemDomain Model creationto gradually increase                                 •Linked Dataove...
Domain Definition Requirements• Identify concepts, concept  labels (denotations) and  concept hierarchy• Challenge: define...
Domain definition - conceptual • Expand and Reduce approach   – Start with „high recall‟ methods      •   Exploration – Fu...
Domain Description - Classifier • Concept-aware   – Use concepts and concept labels from the     domain definition step • ...
Domain Description • Combined Language model and Semantic   classification model • Language model: Surface-pattern – based...
Domain Description - Implementation • Probabilistic Vector-space model   – Each relationship is defined by vectors of     ...
TerminologySymbol   Meaning                ExampleS, O     Subject and            Kelly_Miller_(scientist)         Object ...
Probabilistic Classifier                                                                             Semantic types.   Lab...
Probabilistic ClassifierHow is Barack Obama related to Columbia University?             p(R, Barack_Obama, Columbia_Univer...
Probabilistic Classifier  Obama graduated in 1983 from Columbia University    p(almaMater ,Barack_Obama, Columbia_Universi...
Pattern Generalization • Problem: Low recall in pattern-based IE • Substitute terms with wild cards    – No POS tagging, h...
Learning p(R|P) • Distantly Supervised Training • Collect pattern frequencies for training   examples   – Fact triples <S,...
Learning p(R|P) – naïve • For each vector Ri containing pattern   frequencies for relationship Ri, compute • #Patternj tha...
Learning p(R|P) – naïve • Uniform distribution of relationships assumed   – As the number of relationship types grows), th...
Problem: Relationship Similarities • Extensional similarity   – Semantically different relationships can share     Subject...
Relationship similarities Pertinence Measure   similarity between pattern vectors as approximation   of intensional simil...
Pertinence for Relationships Do not punish the occurrence of the same pattern  with relationship types that are intension...
Pertinence Example Pattern: <Subject> in the right <Object> Relationship                                                  ...
Similarities between relationships              Knowledge Enabled Information and Services Science   48
Pertinence evaluation            0.8            0.7            0.6            0.5Precision            0.4                 ...
Fact extraction evaluation - DBPedia60% training set, 40% testing, DBPedia Infobox fact corpus, Wikipedia text corpus    P...
Sample results (DBPedia)                        suggested        Extracted Rank 1  Subject :: Object     Relationship     ...
Fact extraction evaluation - UMLS60% training set, 40% testing, UMLS fact corpus, MedLine text corpus    Precision / Recal...
Sample results (UMLS)  Subject :: Object                        suggested Relationship     Extracted Rank 1  Teeth::poison...
Comparison – DBPedia corpus                                                                                               ...
Evaluate Ad-Hoc Model Creation • On demand creation of models                                                             ...
Ad-Hoc Model Creation - Evaluation           Knowledge Enabled Information and Services Science   56
Ad-Hoc Model Creation - Evaluation                                                                Recall wrt. possible    ...
Related Work                                        Mintz          Sur-          face          pat-         terns         ...
Main Differences •   Surface-patterns only •   Only positive training examples •   Pertinence measure for semantic similar...
Related work • Pattern-based fact extraction    – E. Agichtein and L. Gravano. Snowball: Extracting      relations from la...
Related work • Relationship-pattern computations    – P. D. Turney and P. Pantel. From Frequency to      Meaning: Vector S...
Summary Fact extraction • Pattern-based fact extraction with   generalization and Pertinence achieves   competitive precis...
Application and Knowledge ValidationExample: Domain modelas a basis for research in                                 • 18 M...
Domain Definition – Extracted HierarchyA hierarchy extracted for a cognitive science domain model.The keyword description ...
Domain Description: Connect Concepts            Knowledge Enabled Information and Services Science   65
Expert Evaluation of Facts in the Model           0.9           0.8           0.7           0.6Fraction           0.5     ...
Extractor Confidence vs. Correctness• Analysis shows that highest quality extractions have the  highest confidence, but al...
Extractor Confidence vs. Correctness• Many facts deemed interesting were extracted based on  highly specialized patterns i...
Sources of Errors • Extracted relationship too specific or formally   incorrect but metaphorically correct.    – <Interped...
Validation • Extracted statements need to be validated   to be considered knowledge   – Explicit validation, e.g. thumbs u...
Explicit Validation • Certainty of reference    – I.e. we know exactly which statement was      validated • Validator cred...
Implicit Validation • Find indications of correctness or   incorrectness based on the way the users   interact with the pr...
Implicit Validation • Examples for implicit community-validation    – Games with a purpose (L. von Ahn)    – Google search...
Implicit Validation • A fact is browsed very often by different users.    – The fact is interesting to many users.    – Th...
Validation “through use”             Choose entityEnter search              of interest    terms                    Browse...
Validation “through use”                           Find another                          interesting fact            Fact ...
Validation “through use”                                                Path suggests                                     ...
Browsed Facts Examples            Knowledge Enabled Information and Services Science   78
Related work • Evaluation and Use    – E. Agichtein, E. Brill, and S. Dumais. Improving web      search ranking by incorpo...
Summary Knowledge Acquisition    • The model actually reflects what the user is      interested in at the point of creatio...
Future Directions • Active Learning to improve classification   – Easy in tightly connected system (e.g. NELL)   – Feedbac...
Contributions                                Conceptual Knowledge: Ontologies, LoD                                        ...
Journal/Conference Publications [WebSem] C. Thomas, P. Mehra, A. Sheth, W. Wang, G. Weikum. Automatic     domain model cre...
Journal/Conference Publications [WI2] C. Thomas and A. Sheth. Semantic Convergence of Wikipedia Articles. In    Proceeding...
Publications [CHB] C. Thomas and A. Sheth. Web Wisdom - An Essay on How Web 2.0 and     Semantic Web can foster a Global K...
Other Publications Workshop Publications [SWLS] A. Sheth, W. York, C. Thomas, M. Nagarajan, J. Miller, K. Kochut, S.    Sa...
• Research                            • Collaborations                                            – Complex Carbohydrate R...
Thank you!             Shaojun              Amit                 Pascal     Pankaj Gerhard              Wang              ...
Thank youKnowledge Enabled Information and Services Science   89
Upcoming SlideShare
Loading in...5
×

PhD thesis defense of Christopher Thomas

2,144

Published on

Christopher Tomas defended his thesis on "Knowledge Acquisition in a System".
Video can be found at: http://www.youtube.com/watch?v=NeQomGsJvDk






Published in: Education
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
2,144
On Slideshare
0
From Embeds
0
Number of Embeds
13
Actions
Shares
0
Downloads
37
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide
  • Shared vocabulary
  • To get the probability of seeing a relationship when given a concept pair, we average over all occurrences of phrases that contain labels for the concept pair, take into account the probabilities that the term pair actually denotes the concept pair and, if available, if the types of subject and object are likely to occur with that relationship.
  • Show how pattern probabilities and background knowledge interact
  • Shortcoming. Patterns are seen as independent, even though they would have been in the same path trough a parse tree
  • Pertinence has most influence in high-recall regions. Intuitively, as the threshold is increased, patterns that are highly indicative of specific relationships contribute more to the classification and thus the advantage of the pertinence method is slightly diminished.
  • Doozer(R) – recall oriented, generalizedDoozer(P) – precision-oriented, not generalized
  • None of the facts were previously found in UMLS
  • It is important to know how correct information was extracted. The probabilistic classifier easily allows for analysis of the patterns that were underlying the extraction. The slide shows how extraction quality measures up against extraction confidence.
  • It is important to know how correct information was extracted. The probabilistic classifier easily allows for analysis of the patterns that were underlying the extraction. The slide shows how extraction quality measures up against extraction confidence.
  • PhD thesis defense of Christopher Thomas

    1. 1. Knowledge Acquisition in a System Christopher ThomasOhio Center of Excellence in Knowledge-enabled Computing - Kno.e.sis, Wright State University Dayton, OH topher@knoesis.org
    2. 2. Circle of knowledge in a System Knowledge Enabled Information and Services Science 2
    3. 3. Dissertation Overview Conceptual Knowledge: Ontologies, LoD Knowledge Representation [IJSWIS, CR, FLSW] Ontology design [WWW, FOIS]Knowledge merging/Ontology alignment[AAAI, WebSem2, Textual Information:SWSWPC] Wikipedia, Web Information Quality[WI2] Social processes for content creation [CHB] Social processes Doozer++: for knowledge Taxonomy extraction validation Relationship/Fact [IHI,WebSci, CHB] extraction [IHI, WebSem1, IEEE- IC, WebSci, WI1] Knowledge Enabled Information and Services Science 3
    4. 4. Talk Contents What is knowledge?How do we turnpropositions/beliefs intoknowledge? How do we acquire information? Knowledge Enabled Information and Services Science 4
    5. 5. Talk outline • Motivation • Knowledge Acquisition (KA) Overview • KA in a loosely connected system – Doozer++ – Automatic formal domain model creation – Information Extraction • Top-Down • Bottom-Up – Information Validation “in use” • Conclusion Knowledge Enabled Information and Services Science 5
    6. 6. Larger Context of automated KA • Increasing significance of knowledge economy – “Knowledge Workers” spend 38% of their time searching for information (McDermott, 2005) – Vital to get a quick and still comprehensive understanding of a field through pertinent concepts/entities and relations/interactions • Increased demand for formally available knowledge in semantic models – Filtering, browsing, annotation, reasoningMcdermott, M. "Knowledge Workers: How can you gauge their effectiveness." Leadership Excellence. Vol. 22.10. October 2005 Knowledge Enabled Information and Services Science
    7. 7. Motivating Scenario • Learn about a new subject – E.g. gain a quick overview over a current or historical event • Use a formal representation of the gained overview to filter information – Facilitate in-depth exploration • Use the formalized information and the user interaction to create knowledge from information Knowledge Enabled Information and Services Science 7
    8. 8. Motivating Scenario • Google: India • Brief description – demographic-, geographic information, etc. Knowledge Enabled Information and Services Science 8
    9. 9. Motivating Scenario • Google: India • Regular Web results Knowledge Enabled Information and Services Science 9
    10. 10. Motivating Scenario • Clicking on a link to the Wikipedia entry shows that there have been conflicts with Pakistan over the region of Kashmir  Investigate more Knowledge Enabled Information and Services Science 10
    11. 11. Motivating Scenario • Google: India Pakistan Kashmir • Only Web results and news So far, search engines only display facts about entities, not relationships or larger contexts Knowledge Enabled Information and Services Science 11
    12. 12. Motivating Scenario • Beneficial to get an overview “at a glance” over a domain. • Automated approach to creating knowledge models for focused areas of interest • Create models around an incomplete or rudimentary keyword description and “anticipate” user‟s intentions wrt. the full context Knowledge Enabled Information and Services Science 12
    13. 13. Motivating Scenario Doozer++: india pakistan kashmir • Important concepts and relationships describing the context Knowledge Enabled Information and Services Science 13
    14. 14. Motivating Scenario• Filtered IR using concepts in the model• Concepts and relationships that contributed to clicked results gain support• User can explicitly approve content Knowledge Enabled Information and Services Science 14
    15. 15. Circle of Knowledge (Example) Knowledge Enabled Information and Services Science 15
    16. 16. Motivating Scenario • On-demand creation of domain knowledge improves individual comprehension of an event • Formal models are easy to use in information filtering • Validated information  Knowledge – Can be given back to the community to improve the overall amount of formal knowledge available on the Web – E.g. “Unknown” to DBPedia that the region of Kashmir belongs to both India and Pakistan Knowledge Enabled Information and Services Science 16
    17. 17. Importance of Model creation • Models support individual user or know- ledge worker, but also groups or system – More efficient communication through small, shared, agreeable conceptualizations • People  people • People  system • System  system – Classify or filter pertinent and topical information using models – Model-assisted searching and faceted or exploratory browsing using relationships – Reuse of validated knowledge Knowledge Enabled Information and Services Science
    18. 18. Domain Knowledge Models• Scientific applications – In-depth description of concepts – Narrow field – People  system, system  system • Annotation, reasoning ⇒Absolute correctness necessary (as far as possible)• General applications – Broad coverage of the field – Context – how does the new information fit in? – People  people, people  system • Individual domain comprehension, filtering, annotation ⇒Relative correctness sufficient Knowledge Enabled Information and Services Science 18
    19. 19. Model Creation Resources • Large models are available as reference – DBPedia, YAGO, UMLS, MeSH, GO … – Too big to be efficiently and effectively usable • Prior knowledge required to find pertinent resources • Other information is available in great abundance, but unformalized – Tacit expert knowledge – Scientific databases – Free text • peer reviewed journals and proceedings • General Web content Knowledge Enabled Information and Services Science 19
    20. 20. Epistemological Considerations • Knowledge – Ensure epistemological soundness of automated knowledge acquisition • Reference – Ensure that nodes in the models refer to real- world concepts/entities Knowledge Enabled Information and Services Science 20
    21. 21. Knowledge • Functional Definition – Knowledge = “Know-How” – Practical, but weak, Includes “Actionable Information” • Categorical Definition – Knowledge = Justified true belief – S knows that p iff i. p is true; ii. S believes that p; iii. S is justified in believing that p. Knowledge Enabled Information and Services Science
    22. 22. Belief and Justification • Belief – Statements held by the system • Justification – Trusted sources – Extraction algorithms • Bayesian, deductive or inductive reasoning • Macro-Reading algorithms  Wisdom of the crowds – Validation Knowledge Enabled Information and Services Science 22
    23. 23. Truth assessment of a statement • Is truth correspondence? – “A” is true No Access iff A (a true statement corresponds to an actual state of affairs) • Is truth coherence? – Does the statement fit into the system of other statements? • Is truth consensus? – agreement of correctness amongst a group ⇒In the cyclical model, achieve high degree of certainty by allowing constant validation Knowledge Enabled Information and Services Science
    24. 24. Domain Model – Reference • Model of a domain conceptually split – Domain Definition Concepts identified by URIs (classes, entities, relationship types)  ensures reference Remains static – necessity Rigid designators (Kripke) – Domain Description Relationships describe concepts Subject to change – possibility Definite descriptions (Russell) Knowledge Enabled Information and Services Science
    25. 25. Domain Definition • Top-down concept identification • Achieved through – Manual creation based on consensus in a group – Extraction from community-created or peer- reviewed conceptualization • Wikipedia • MeSH or UMLS Semantic Network Knowledge Enabled Information and Services Science
    26. 26. Domain Description • Possible to do top-down extraction of the domain description, e.g. from DBPedia • Problem: Formal concept descriptions are sparse – On average, DBPedia has less than 2 object properties per entity • Extract descriptions (facts) bottom-up – Available in text, DBs, etc. – Domain-specific molecular structure extractors (GlycO) – Domain independent IE techniques (Doozer++) Knowledge Enabled Information and Services Science
    27. 27. Knowledge Acquisition Approaches • KA in a tightly connected system – GlycO: domain-specific BioChemistry ontology • Manual domain definition and description • Partial automatic domain description • Domain-specific automatic validation • Manual validation for false negatives • KA in a loosely connected system – Doozer++: general domain-model creation framework • Automatic domain definition, top-down concept extraction • Automatic domain description, bottom-up fact extraction – Extraction from trusted sources – A trusted extraction and validation procedure • Domain-independent community-based validation Knowledge Enabled Information and Services Science
    28. 28. Knowledge Acquisition Approaches Knowledge Traditional GlycO Doozer++ Engineering Extraction Approach Approach Definition Top-Down Bottom-up Top-Down Top-Down Knowledge Conceptually, by Engineering extraction from Top-Down corpus Description Top-Down Bottom-up Bottom-up, Bottom-up, restricted by Top- restricted by down definition Top-down definition Verification Manual Manual Correctness: Community- automatic: based validation Exceptions: added manually Knowledge Enabled Information and Services Science
    29. 29. KA on the Web - Vision • Web searches, browsing sessions or classification task can be seen as creating an implicit domain model – World view, Concept coverage, Facts • Make models explicit and reusable using formal descriptions (RDF, OWL) • Validate the contained information and share with the community  Increase system‟s knowledge by “doing what you do”: Search, browse, click, communicate Knowledge Enabled Information and Services Science 29
    30. 30. KA in a Loosely Connected SystemDomain Model creationto gradually increase •Linked Dataoverall knowledge ofthe system • Free text• User-interest driven • Wikipedia• Incentive to • Web evaluate Domain Definition Validation Doozer++ScoonerEvaluation in Use: – Domain Definition:Semantic browsing Top-down conceptand retrieval, extractionDomain-independent, – Domain Description:Community-based Domain Description Pattern-based fact extraction Knowledge Enabled Information and Services Science 30
    31. 31. Domain Definition Requirements• Identify concepts, concept labels (denotations) and concept hierarchy• Challenge: define narrow boundaries for a domain while at the same time ensuring broad conceptual coverage within the domain Knowledge Enabled Information and Services Science 31
    32. 32. Domain definition - conceptual • Expand and Reduce approach – Start with „high recall‟ methods • Exploration – Full text search • Exploitation – Graph-Similarity Method • Category growth • “What could be in the domain?” – End with “high precision” methods • Apply restrictions on the concepts found • Remove terms and categories that fall outside the dense areas of the model graph • “What should be in the domain?” Knowledge Enabled Information and Services Science 32
    33. 33. Domain Description - Classifier • Concept-aware – Use concepts and concept labels from the domain definition step • Fact extraction as classification of concept pairs into relationship types – fclass: C C R – RS,O = {R | p(R,S,O) > ε} Knowledge Enabled Information and Services Science
    34. 34. Domain Description • Combined Language model and Semantic classification model • Language model: Surface-pattern – based – Pattern manifestations of relationships as features – Open to any corpus, language independent – Less computational overhead than NLP • Semantic Classification Model – Learned or assigned concept labels – Semantic types to aid classification Knowledge Enabled Information and Services Science
    35. 35. Domain Description - Implementation • Probabilistic Vector-space model – Each relationship is defined by vectors of • Pattern probabilities • Domain/range probabilities – Each concept is grounded by its semantic types and manifested by it‟s labels and their probabilities of identifying the concept – Sparse pattern representation (density ~2%) – White-box, easily verifiable – Inherently parallel Knowledge Enabled Information and Services Science
    36. 36. TerminologySymbol Meaning ExampleS, O Subject and Kelly_Miller_(scientist) Object concepts Howard_University (semantic)LS,LO Subject and “Kelly Miller” Object labels “Howard University”PLS,LO Phrase Kelly Miller graduated from Howard University instantiating the patternP Pattern <Subject> graduated from <Object>TS,TO Semantic type of Person Subject or Object Educational_InstitutionR relationship almaMater birthPlace Knowledge Enabled Information and Services Science 36
    37. 37. Probabilistic Classifier Semantic types. Labels taken Asserted in from Lexicon Ontology or or linked learned from corpus linked data Patterns learned from free text Knowledge Enabled Information and Services Science 37
    38. 38. Probabilistic ClassifierHow is Barack Obama related to Columbia University? p(R, Barack_Obama, Columbia_University) Sentence in corpus: Obama graduated in 1983 from Columbia University with a degree in political science and international relations. (Regular classification requires multiple examples) Knowledge Enabled Information and Services Science 38
    39. 39. Probabilistic Classifier Obama graduated in 1983 from Columbia University p(almaMater ,Barack_Obama, Columbia_University) = p(almaMater | “<Subject> graduated in 1983 from <Object>”) * p(Barack_Obama | ”Obama”) * p(Columbia_University | ”Columbia University”) * p(almaMater | domain(person)) * p(almaMater | range(academic_institution)) p(almaMater , Barack_Obama, Columbia_University) = 0.9 * 0.95 * 0.95 * 0.9 * 0.97 p(almaMater, Barack_Obama, Columbia_University) = 0.70909425 Knowledge Enabled Information and Services Science 39
    40. 40. Pattern Generalization • Problem: Low recall in pattern-based IE • Substitute terms with wild cards – No POS tagging, hence only “*” wild cards • Mirrors shortest paths through parse trees <Subject> graduated in 1983 from <Object> <Subject> * in 1983 from <Object> <Subject> graduated * 1983 from <Object> <Subject> * * 1983 from <Object> <Subject> graduated in * from <Object> <Subject> * in * from <Object> <Subject> graduated * * from <Object> <Subject> * * * from <Object> Knowledge Enabled Information and Services Science 40
    41. 41. Learning p(R|P) • Distantly Supervised Training • Collect pattern frequencies for training examples – Fact triples <S, R, O> e.g. from Linked Data (DBPedia, UMLS) – Manifestations of facts in text in the form of patterns (corpus e.g. Web, Wikipedia, MedLine) • For relationship Ri, aggregate pattern vectors representing <*, Ri, *> Knowledge Enabled Information and Services Science 41
    42. 42. Learning p(R|P) – naïve • For each vector Ri containing pattern frequencies for relationship Ri, compute • #Patternj that occur with terms denoting each <S, O> Ri in normalized by all pattern occurrences for Ri Knowledge Enabled Information and Services Science 42
    43. 43. Learning p(R|P) – naïve • Uniform distribution of relationships assumed – As the number of relationship types grows), the prior of each type goes towards 0. – normalize the probabilities over the column vector to get p(Ri|Pj) • Vector space representation – Relationship-pattern matrix – R2Pij = p(Ri|Pj) Knowledge Enabled Information and Services Science 43
    44. 44. Problem: Relationship Similarities • Extensional similarity – Semantically different relationships can share Subject-Object pairs in training data • Intensional similarity – Overlap and entailment of relationship types • Types should not be seen as discrete – E,g, physical_part_of part_of • Apriori unknown which types overlap unless formal description available – Semantically similar types compete for the same patterns Knowledge Enabled Information and Services Science 44
    45. 45. Relationship similarities Pertinence Measure similarity between pattern vectors as approximation of intensional similarity Knowledge Enabled Information and Services Science 45
    46. 46. Pertinence for Relationships Do not punish the occurrence of the same pattern with relationship types that are intensionally similar, but extensionally dissimilar Reduce impact of extensionally similar relations Knowledge Enabled Information and Services Science 46
    47. 47. Pertinence Example Pattern: <Subject> in the right <Object> Relationship p(R|P) biological_process_has_associated_location 0.968371381 disease_has_associated_anatomic_site 0.880452774 part_of 0.622532958 has_finding_site 0.561041318 has_location 0.537424451 has_direct_procedure_site 0.363832078 Sum: 3.933654958 Note: This never causes p(R,S,O) > 1 Knowledge Enabled Information and Services Science 47
    48. 48. Similarities between relationships Knowledge Enabled Information and Services Science 48
    49. 49. Pertinence evaluation 0.8 0.7 0.6 0.5Precision 0.4 Pertinence 0.3 No Pertinence 0.2 0.1 0 0 0.1 0.2 0.3 0.4 0.5 Recall Knowledge Enabled Information and Services Science 49
    50. 50. Fact extraction evaluation - DBPedia60% training set, 40% testing, DBPedia Infobox fact corpus, Wikipedia text corpus Precision / Recall Strict evaluation: Only 1st ranked extracted relation is compared to gold- standard. Averaged over 107 Confidence Threshold relation types. Knowledge Enabled Information and Services Science 50
    51. 51. Sample results (DBPedia) suggested Extracted Rank 1 Subject :: Object Relationship (Rel;Confidence) Rank 2 Rank 3 Howard Pawley :: successor; after; office; after Gary Filmon 0.799 0.768 0.686 nextSingle; followedBy; after; Mulan :: Tarzan after 0.603 0.533 0.416 Species Deceases:: producer; artist; genre; artist Midnight Oil 0.761 0.719 0.467 The Crystal City :: artist; author; writer; author Orson Scott Card 0.625 0.617 0.583 Horatio Allen :: before predecessor;0.629 before;0.475 William Maxwell Basdeo Panday :: birthplace; nationality; birthplace deathPlace;0.658 Trinidad &Tobago 0.658 0.330 Bob Nystrom :: birthplace cityOfBirth;0.677 birthplace;0.513 Stockholm Beccles railway borough; friend; borough district;0.772 station :: Suffolk 0.770 0.749 Knowledge Enabled Information and Services Science 51
    52. 52. Fact extraction evaluation - UMLS60% training set, 40% testing, UMLS fact corpus, MedLine text corpus Precision / Recall Strict evaluation: Only 1st ranked extracted relation is compared to gold- standard. Averaged over Confidence Threshold ~100 relation types. Knowledge Enabled Information and Services Science 52
    53. 53. Sample results (UMLS) Subject :: Object suggested Relationship Extracted Rank 1 Teeth::poisoning, fluoride finding_site_of finding_site_of 768 polyps::polyp of cervix nos associated_with associated_with (disorder) neck of uterus::polyp of cervix nos location_of finding_site_of (disorder) benign neoplasms::polyp of colon related_to associated_with brain ischemia::brain has_finding_site location_of is_primary_anatomic_ gastrointestinal tract::polyp of colon location_of site_of_disease gamete structure (cell is_normal_cell_origin_ is_normal_cell_ structure)::polyvesicular vitelline of_disease origin_of_disease tumor Knowledge Enabled Information and Services Science 53
    54. 54. Comparison – DBPedia corpus Mintz: extraction 1 of 102 relation- 0.9 ship types from 0.8 Freebase Doozer: 107 0.7 from DBPediaPrecision 0.6 0.5 Mintz-POS Mintz-NLP 0.4 Doozer++ (R) 0.3 Doozer++ (P) 0.2 0.1 0 (R) Recall- oriented, using 0 0.2 0.4 0.6 0.8 1 pattern Recall generalization M. Mintz, S. Bills, R. Snow, and D. Jurafsky, “Distant supervision for relation (P) Precision- extraction without labeled data,” in ACL2009. oriented, no generalization Knowledge Enabled Information and Services Science 54
    55. 55. Evaluate Ad-Hoc Model Creation • On demand creation of models Precision Number of (Domain Domain Query Concepts Definition) “Semantic Web” OWL Semantic Web ontologies RDF 143 0.98 “Harry Potter” dumbledore Harry Potter gryffindor slytherin 134 0.98 Beatles "John Lennon" "Paul Beatles McCartney" song 250 0.99 India-Pakistan Relations India Pakistan Kashmir 129 0.99 US Financial tarp "financial crisis" "toxic crisis - TARP assets" 146 0.93 German German chancellors "Angela Chancellors Merkel" "Helmut Kohl" 124 0.91 Knowledge Enabled Information and Services Science 55
    56. 56. Ad-Hoc Model Creation - Evaluation Knowledge Enabled Information and Services Science 56
    57. 57. Ad-Hoc Model Creation - Evaluation Recall wrt. possible extraction. I.e. the Relative Recall maximum number of extracted facts marks 100% recall Knowledge Enabled Information and Services Science 57
    58. 58. Related Work Mintz Sur- face pat- terns SOFIE Turney only Knowledge Enabled Information and Services Science 58
    59. 59. Main Differences • Surface-patterns only • Only positive training examples • Pertinence measure for semantic similarity • Concept-aware: start with defined concepts • Include background knowledge in probabilistic classification instead of rule- based reasoning Knowledge Enabled Information and Services Science 59
    60. 60. Related work • Pattern-based fact extraction – E. Agichtein and L. Gravano. Snowball: Extracting relations from large plain-text collections. In JCDL, 2000. – Suchanek, Fabian M., Mauro Sozio, and Gerhard Weikum. SOFIE : A Self-Organizing Framework for Information Extraction.• WWW 2009. – T. M. Mitchell, J. Betteridge, A. Carlson, E. Hruschka, and R. Wang. Populating the Semantic Web by Macro- Reading Internet Text. ISWC 2009. – M. Pasca, D. Lin, J. Bigham, A. Lifchits, and A. Jain. Organizing and searching the world wide web of facts- step one: the one-million fact extraction challenge. In AAAI 2006. Knowledge Enabled Information and Services Science 60
    61. 61. Related work • Relationship-pattern computations – P. D. Turney and P. Pantel. From Frequency to Meaning: Vector Space Models of Semantics. Journal of Artificial Intelligence Research, 37, 2010. – P. D. Turney. Expressing implicit semantic relations without supervision. In ACL 2006 Knowledge Enabled Information and Services Science 61
    62. 62. Summary Fact extraction • Pattern-based fact extraction with generalization and Pertinence achieves competitive precision and recall while being computationally feasible for large-scale extraction – Pertinence computation can also be a preprocessing step for other ML techniques • Different types of background knowledge incorporated into one statistical framework – Combined Language model and Semantic model Knowledge Enabled Information and Services Science 62
    63. 63. Application and Knowledge ValidationExample: Domain modelas a basis for research in • 18 Million MedLinethe area of human publications/abstractscognitive performance. • UMLS Metathesaurus • WikipediaScooner:Semantic browsing Doozer++and retrieval – – Hierarchy extractionEvaluation in Use – Pattern-based fact extraction Knowledge Enabled Information and Services Science 63
    64. 64. Domain Definition – Extracted HierarchyA hierarchy extracted for a cognitive science domain model.The keyword description given to the system was a collection of terms relevantto human performance and cognition. Knowledge Enabled Information and Services Science 64
    65. 65. Domain Description: Connect Concepts Knowledge Enabled Information and Services Science 65
    66. 66. Expert Evaluation of Facts in the Model 0.9 0.8 0.7 0.6Fraction 0.5 Fraction in bin 0.4 Cumulative incorrect Cumulative correct 0.3 Cumulative interesting 0.2 0.1 0. Score 1 2 3 4 5 6 7 8 9 1-2: Information that is 3-4: Information that is 5-6: Correct general 7-9: Correct Information not overall incorrect somewhat correct Information commonly known Knowledge Enabled Information and Services Science 66
    67. 67. Extractor Confidence vs. Correctness• Analysis shows that highest quality extractions have the highest confidence, but also incorrectly extracted facts have high confidence High-quality patterns as well as some noise-patterns have high indicative power. Knowledge Enabled Information and Services Science 67
    68. 68. Extractor Confidence vs. Correctness• Many facts deemed interesting were extracted based on highly specialized patterns in the long tail of the frequency distribution.• Noisy patterns also tend to occupy this space Knowledge Enabled Information and Services Science 68
    69. 69. Sources of Errors • Extracted relationship too specific or formally incorrect but metaphorically correct. – <Interpeduncular_Cistern  disease_has_associated_ anatomic_site  Cerebral_peduncle> is incorrect, • Interpeduncular Cistern is not a disease. However, it does have the associated anatomic site Cerebral peduncle. • Incorrect directionality – <Pituitary_Gland  sends_output_to  Supraoptic_ nucleus> should be <Supraoptic_nucleus  sends_ output_to  Pituitary_Gland> • Direction in text often expressed in the context rather than the immediate pattern Knowledge Enabled Information and Services Science 69
    70. 70. Validation • Extracted statements need to be validated to be considered knowledge – Explicit validation, e.g. thumbs up/down – Implicit validation, e.g. by analyzing click streams Knowledge Enabled Information and Services Science 70
    71. 71. Explicit Validation • Certainty of reference – I.e. we know exactly which statement was validated • Validator credentials can be obtained – E.g. a small community of experts may evaluate • Extra work – Explicit validation is a task that is consciously performed Knowledge Enabled Information and Services Science 71
    72. 72. Implicit Validation • Find indications of correctness or incorrectness based on the way the users interact with the presented information – Every action taken on a piece of information is recorded and analyzed – The cumulative behavior of the users gives an indication of which propositions are correct or interesting Knowledge Enabled Information and Services Science 72
    73. 73. Implicit Validation • Examples for implicit community-validation – Games with a purpose (L. von Ahn) – Google search rankings • Scooner semantic browser – Browse literature along facts in a model – Browsing trails suggest correct extraction Knowledge Enabled Information and Services Science 73
    74. 74. Implicit Validation • A fact is browsed very often by different users. – The fact is interesting to many users. – The fact is surprising and interesting, but may be incorrect. • A user follows a trail of multiple fact-triples trough a variety of documents. – The facts that were browsed have a high probability of being correct and support is added to the triples. – If the trail was longer than suggested by a small-world phenomenon, initial triples may have been incorrect, but led to interesting ones. For this reason, only the last k triples of the trail should garner support or the support should increase for the last k triples in the trail. – The last triple in the trail may have been incorrect and led to browsing results that caused the user to stop browsing. For this reason, the last triple of the trail should be treated with caution. Knowledge Enabled Information and Services Science 74
    75. 75. Validation “through use” Choose entityEnter search of interest terms Browse Choose relevant extracted facts literature that supports the fact Knowledge Enabled Information and Services Science 75
    76. 76. Validation “through use” Find another interesting fact Fact trails are recorded Knowledge Enabled Information and Services Science 76
    77. 77. Validation “through use” Path suggests that at least the first 2 triples are factually correct Knowledge Enabled Information and Services Science 77
    78. 78. Browsed Facts Examples Knowledge Enabled Information and Services Science 78
    79. 79. Related work • Evaluation and Use – E. Agichtein, E. Brill, and S. Dumais. Improving web search ranking by incorporating user behavior information. Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval - SIGIR ‟06, page 19, 2006. – A. Das, M. Datar, A. Garg, and S. Rajaram. Google News Personalization: Scalable Online Collaborative Filtering. In Proceedings of the 16th international conference on World Wide Web, page 280. ACM, 2007. Knowledge Enabled Information and Services Science 79
    80. 80. Summary Knowledge Acquisition • The model actually reflects what the user is interested in at the point of creation  Willingness to help validate facts – Applications allow for implicit and explicit evaluation • Validated Statements can be merged with existing knowledge  Automated acquisition completed  Individual-driven KA improved overall system• R. Kavuluru, C. Thomas et al. An Up-to-date Knowledge-Based Literature Search and Exploration Framework for Focused Bioscience Domains. IHI 2012• Amit Sheth, Christopher Thomas, Pankaj Mehra, Continuous Semantics to Analyze Real-Time Data, IEEE IC, Nov./Dec. 2010• C. Thomas et al. Improving Linked Open Data through On-Demand Model Creation. Web Science Conference, 2010.• C. Thomas, et al.. Growing Fields of Interest - Using an Expand and Reduce Strategy for Domain Model Extraction. WIC 2008. Knowledge Enabled Information and Services Science 80
    81. 81. Future Directions • Active Learning to improve classification – Easy in tightly connected system (e.g. NELL) – Feedback mechanism for loosely connected systems • Improve depth of classification – Augment Domain Description with learned concept hierarchies from text (e.g. Navigli) • Knowledge management for background knowledge – Belief updates – Model evolution Knowledge Enabled Information and Services Science 81
    82. 82. Contributions Conceptual Knowledge: Ontologies, LoD Knowledge Representation [IJSWIS, CR, FLSW] Ontology design [WWW, FOIS]Knowledge merging/Ontology alignment[AAAI, WebSem2, Textual Information:SWSWPC] Wikipedia, Web Information Quality[WI2] Social processes for content creation [CHB] Social processes Taxonomy extraction for knowledge [WI1, WebSci, WebSem1] validation Event modeling [IEEE-IC] [IHI,WebSci, CHB] Relationship/Fact/Event extraction [IHI, WebSem1, IEEE-IC, WebSci] Knowledge Enabled Information and Services Science 82
    83. 83. Journal/Conference Publications [WebSem] C. Thomas, P. Mehra, A. Sheth, W. Wang, G. Weikum. Automatic domain model creation using pattern-based fact extraction. Submitted to Journal of Web Semantics. [IHI]R. Kavuluru, C. Thomas, A. Sheth, V. Chan, W. Wang, A. Smith, A. Sato and A. Walters. An Up-to-date Knowledge-Based Literature Search and Exploration Framework for Focused Bioscience Domains. IHI 2012 - 2nd ACM SIGHIT International Health Informatics Symposium, January 28-30, 2012. [IEEE-IC] Amit Sheth, Christopher Thomas, Pankaj Mehra, Continuous Semantics to Analyze Real-Time Data, IEEE Internet Computing, vol. 14, no. 6, pp. 84-89, Nov./Dec. 2010, doi:10.1109/MIC.2010.137 [WebSci] C. Thomas, W. Wang, P. Mehra and A. Sheth. What Goes Around Comes Around Improving Linked Opend Data through On-Demand Model Creation. Web Science Conference, 2010. [WI1] C. Thomas, P. Mehra, R. Brooks, and A. Sheth. Growing Fields of Interest - Using an Expand and Reduce Strategy for Domain Model Extraction. Web Intelligence and Intelligent Agent Technology, IEEE/WIC/ACM International Conference on, 1:496–502, 2008. Knowledge Enabled Information and Services Science 83
    84. 84. Journal/Conference Publications [WI2] C. Thomas and A. Sheth. Semantic Convergence of Wikipedia Articles. In Proceedings of the 2007 IEEE/WIC International Conference on Web Intelligence, pages 600–606, Washington, DC, USA, November 2007. IEEE Computer Society. [WWW] S. S. Sahoo, C. Thomas, A. Sheth, W. S. York, and S. Tartir. Knowledge Modeling and its Application in Life Sciences: A Tale of two Ontologies. In WWW ‟06: Proceedings of the 15th international conference on World Wide Web, pages 317–326, New York, NY, USA, 2006. ACM Press. [FOIS] C. Thomas, A. Sheth, and W. York. Modular Ontology Design Using Canonical Building Blocks in the Biochemistry Domain. In Proceeding of the 2006 conference on Formal Ontology in Information Systems: Proceedings of the Fourth International Conference (FOIS 2006), pages 115–127, Amsterdam (NL), 2006. IOS Press. [AAAI] P. Doshi and C. Thomas. Inexact matching of ontology graphs using expectation-maximization. In AAAI‟06: proceedings of the 21st national conference on Artificial intelligence, pages 1277–1282. AAAI Press, 2006. Knowledge Enabled Information and Services Science 84
    85. 85. Publications [CHB] C. Thomas and A. Sheth. Web Wisdom - An Essay on How Web 2.0 and Semantic Web can foster a Global Knowledge Society. Computers in Human Behavior, Elsevier. [WebSem2] P. Doshi, R. Kolli, and C. Thomas. Inexact matching of ontology graphs using expectation-maximization. Web Semantics: Science, Services and Agents on the World Wide Web, 7(2):90–106, 2009. [IJWGS] V. Kashyap, C. Ramakrishnan, C. Thomas, and A. Sheth. Taxaminer: an experimentation framework for automated taxonomy bootstrapping. International Journal of Web and Grid Services, 1(2):240–266, 2005. [IJSWIS] A. P. Sheth, C. Ramakrishnan, and C. Thomas. Semantics for the semantic web: The implicit, the formal and the powerful. Int. J. Semantic Web Inf. Syst., 1(1):1–18, 2005. [CR] S. Sahoo, C. Thomas, A. Sheth, C. Henson, and W. York. GLYDEan expressive XML standard for the representation of glycan structure. Carbohydrate research, 340(18):2802–2807, 2005. Knowledge Enabled Information and Services Science 85
    86. 86. Other Publications Workshop Publications [SWLS] A. Sheth, W. York, C. Thomas, M. Nagarajan, J. Miller, K. Kochut, S. Sahoo, and X. Yi. Semantic Web technology in support of Bioinformatics for Glycan Expression. In W3C Workshop on Semantic Web for Life Sciences, pages 27–28, 2004. [SWSWPC] N. Oldham, C. Thomas, A. Sheth, and K. Verma. METEOR-S Web Service Annotation Framework with Machine Learning Classification. Semantic Web Services and Web Process Composition, pages 137–146, 2005, Springer. Book Chapters [FLSW] C. Thomas and A. Sheth. On the expressiveness of the languages for the semantic web - making a case for a little more. Fuzzy Logic and the Semantic Web, pages 3–20, 2006. Patent [PAT] P. Mehra, R. Brooks and C. Thomas. ONTOLOGY CREATION BY REFERENCE TO A KNOWLEDGE CORPUS. Pub.No. US 2010/0280989 A1 Knowledge Enabled Information and Services Science 86
    87. 87. • Research • Collaborations – Complex Carbohydrate Research – KR Center – Domain model at UGA extraction / IE – HP Labs Palo Alto – Human Performance Directorate, AFRL• Proposals – HP Incubation & Innovation grant for Doozer++ • Tools and Ontologies – AFRL grant largely – GlycO based on Doozer++ – GlycoViz – NSF proposal – Doozer++ submitted with “very good” reviews – Scooner 87 Knowledge Enabled Information and Services Science
    88. 88. Thank you! Shaojun Amit Pascal Pankaj Gerhard Wang Sheth Hitzler Mehra Weikum Thanks to all Kno.e.sis Center Members – Past and Present Knowledge Enabled Information and Services Science 88
    89. 89. Thank youKnowledge Enabled Information and Services Science 89
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×