PhD thesis defense of Christopher Thomas
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

PhD thesis defense of Christopher Thomas

on

  • 2,020 views

Christopher Tomas defended his thesis on "Knowledge Acquisition in a System".

Christopher Tomas defended his thesis on "Knowledge Acquisition in a System".
Video can be found at: http://www.youtube.com/watch?v=NeQomGsJvDk






Statistics

Views

Total Views
2,020
Views on SlideShare
1,204
Embed Views
816

Actions

Likes
1
Downloads
25
Comments
0

14 Embeds 816

http://knoesis.org 441
http://knoesis.wright.edu 277
http://www.knoesis.org 81
http://kmhigtz.knoesis.org 3
http://www.mobicloud-classic.knoesis.org 2
http://jjwww.knoesis.org 2
http://jjwiki.knoesis.org 2
http://www.slashdocs.com 2
http://www.wiki.knoesis.org 1
http://131.253.14.66 1
http://www.ijswis.knoesis.org 1
http://linkedin.www.knoesis.org 1
http://knoesis 1
http://130.108.5.60 1
More...

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • Shared vocabulary
  • To get the probability of seeing a relationship when given a concept pair, we average over all occurrences of phrases that contain labels for the concept pair, take into account the probabilities that the term pair actually denotes the concept pair and, if available, if the types of subject and object are likely to occur with that relationship.
  • Show how pattern probabilities and background knowledge interact
  • Shortcoming. Patterns are seen as independent, even though they would have been in the same path trough a parse tree
  • Pertinence has most influence in high-recall regions. Intuitively, as the threshold is increased, patterns that are highly indicative of specific relationships contribute more to the classification and thus the advantage of the pertinence method is slightly diminished.
  • Doozer(R) – recall oriented, generalizedDoozer(P) – precision-oriented, not generalized
  • None of the facts were previously found in UMLS
  • It is important to know how correct information was extracted. The probabilistic classifier easily allows for analysis of the patterns that were underlying the extraction. The slide shows how extraction quality measures up against extraction confidence.
  • It is important to know how correct information was extracted. The probabilistic classifier easily allows for analysis of the patterns that were underlying the extraction. The slide shows how extraction quality measures up against extraction confidence.

PhD thesis defense of Christopher Thomas Presentation Transcript

  • 1. Knowledge Acquisition in a System Christopher ThomasOhio Center of Excellence in Knowledge-enabled Computing - Kno.e.sis, Wright State University Dayton, OH topher@knoesis.org
  • 2. Circle of knowledge in a System Knowledge Enabled Information and Services Science 2
  • 3. Dissertation Overview Conceptual Knowledge: Ontologies, LoD Knowledge Representation [IJSWIS, CR, FLSW] Ontology design [WWW, FOIS]Knowledge merging/Ontology alignment[AAAI, WebSem2, Textual Information:SWSWPC] Wikipedia, Web Information Quality[WI2] Social processes for content creation [CHB] Social processes Doozer++: for knowledge Taxonomy extraction validation Relationship/Fact [IHI,WebSci, CHB] extraction [IHI, WebSem1, IEEE- IC, WebSci, WI1] Knowledge Enabled Information and Services Science 3
  • 4. Talk Contents What is knowledge?How do we turnpropositions/beliefs intoknowledge? How do we acquire information? Knowledge Enabled Information and Services Science 4
  • 5. Talk outline • Motivation • Knowledge Acquisition (KA) Overview • KA in a loosely connected system – Doozer++ – Automatic formal domain model creation – Information Extraction • Top-Down • Bottom-Up – Information Validation “in use” • Conclusion Knowledge Enabled Information and Services Science 5
  • 6. Larger Context of automated KA • Increasing significance of knowledge economy – “Knowledge Workers” spend 38% of their time searching for information (McDermott, 2005) – Vital to get a quick and still comprehensive understanding of a field through pertinent concepts/entities and relations/interactions • Increased demand for formally available knowledge in semantic models – Filtering, browsing, annotation, reasoningMcdermott, M. "Knowledge Workers: How can you gauge their effectiveness." Leadership Excellence. Vol. 22.10. October 2005 Knowledge Enabled Information and Services Science
  • 7. Motivating Scenario • Learn about a new subject – E.g. gain a quick overview over a current or historical event • Use a formal representation of the gained overview to filter information – Facilitate in-depth exploration • Use the formalized information and the user interaction to create knowledge from information Knowledge Enabled Information and Services Science 7
  • 8. Motivating Scenario • Google: India • Brief description – demographic-, geographic information, etc. Knowledge Enabled Information and Services Science 8
  • 9. Motivating Scenario • Google: India • Regular Web results Knowledge Enabled Information and Services Science 9
  • 10. Motivating Scenario • Clicking on a link to the Wikipedia entry shows that there have been conflicts with Pakistan over the region of Kashmir  Investigate more Knowledge Enabled Information and Services Science 10
  • 11. Motivating Scenario • Google: India Pakistan Kashmir • Only Web results and news So far, search engines only display facts about entities, not relationships or larger contexts Knowledge Enabled Information and Services Science 11
  • 12. Motivating Scenario • Beneficial to get an overview “at a glance” over a domain. • Automated approach to creating knowledge models for focused areas of interest • Create models around an incomplete or rudimentary keyword description and “anticipate” user‟s intentions wrt. the full context Knowledge Enabled Information and Services Science 12
  • 13. Motivating Scenario Doozer++: india pakistan kashmir • Important concepts and relationships describing the context Knowledge Enabled Information and Services Science 13
  • 14. Motivating Scenario• Filtered IR using concepts in the model• Concepts and relationships that contributed to clicked results gain support• User can explicitly approve content Knowledge Enabled Information and Services Science 14
  • 15. Circle of Knowledge (Example) Knowledge Enabled Information and Services Science 15
  • 16. Motivating Scenario • On-demand creation of domain knowledge improves individual comprehension of an event • Formal models are easy to use in information filtering • Validated information  Knowledge – Can be given back to the community to improve the overall amount of formal knowledge available on the Web – E.g. “Unknown” to DBPedia that the region of Kashmir belongs to both India and Pakistan Knowledge Enabled Information and Services Science 16
  • 17. Importance of Model creation • Models support individual user or know- ledge worker, but also groups or system – More efficient communication through small, shared, agreeable conceptualizations • People  people • People  system • System  system – Classify or filter pertinent and topical information using models – Model-assisted searching and faceted or exploratory browsing using relationships – Reuse of validated knowledge Knowledge Enabled Information and Services Science
  • 18. Domain Knowledge Models• Scientific applications – In-depth description of concepts – Narrow field – People  system, system  system • Annotation, reasoning ⇒Absolute correctness necessary (as far as possible)• General applications – Broad coverage of the field – Context – how does the new information fit in? – People  people, people  system • Individual domain comprehension, filtering, annotation ⇒Relative correctness sufficient Knowledge Enabled Information and Services Science 18
  • 19. Model Creation Resources • Large models are available as reference – DBPedia, YAGO, UMLS, MeSH, GO … – Too big to be efficiently and effectively usable • Prior knowledge required to find pertinent resources • Other information is available in great abundance, but unformalized – Tacit expert knowledge – Scientific databases – Free text • peer reviewed journals and proceedings • General Web content Knowledge Enabled Information and Services Science 19
  • 20. Epistemological Considerations • Knowledge – Ensure epistemological soundness of automated knowledge acquisition • Reference – Ensure that nodes in the models refer to real- world concepts/entities Knowledge Enabled Information and Services Science 20
  • 21. Knowledge • Functional Definition – Knowledge = “Know-How” – Practical, but weak, Includes “Actionable Information” • Categorical Definition – Knowledge = Justified true belief – S knows that p iff i. p is true; ii. S believes that p; iii. S is justified in believing that p. Knowledge Enabled Information and Services Science
  • 22. Belief and Justification • Belief – Statements held by the system • Justification – Trusted sources – Extraction algorithms • Bayesian, deductive or inductive reasoning • Macro-Reading algorithms  Wisdom of the crowds – Validation Knowledge Enabled Information and Services Science 22
  • 23. Truth assessment of a statement • Is truth correspondence? – “A” is true No Access iff A (a true statement corresponds to an actual state of affairs) • Is truth coherence? – Does the statement fit into the system of other statements? • Is truth consensus? – agreement of correctness amongst a group ⇒In the cyclical model, achieve high degree of certainty by allowing constant validation Knowledge Enabled Information and Services Science
  • 24. Domain Model – Reference • Model of a domain conceptually split – Domain Definition Concepts identified by URIs (classes, entities, relationship types)  ensures reference Remains static – necessity Rigid designators (Kripke) – Domain Description Relationships describe concepts Subject to change – possibility Definite descriptions (Russell) Knowledge Enabled Information and Services Science
  • 25. Domain Definition • Top-down concept identification • Achieved through – Manual creation based on consensus in a group – Extraction from community-created or peer- reviewed conceptualization • Wikipedia • MeSH or UMLS Semantic Network Knowledge Enabled Information and Services Science
  • 26. Domain Description • Possible to do top-down extraction of the domain description, e.g. from DBPedia • Problem: Formal concept descriptions are sparse – On average, DBPedia has less than 2 object properties per entity • Extract descriptions (facts) bottom-up – Available in text, DBs, etc. – Domain-specific molecular structure extractors (GlycO) – Domain independent IE techniques (Doozer++) Knowledge Enabled Information and Services Science
  • 27. Knowledge Acquisition Approaches • KA in a tightly connected system – GlycO: domain-specific BioChemistry ontology • Manual domain definition and description • Partial automatic domain description • Domain-specific automatic validation • Manual validation for false negatives • KA in a loosely connected system – Doozer++: general domain-model creation framework • Automatic domain definition, top-down concept extraction • Automatic domain description, bottom-up fact extraction – Extraction from trusted sources – A trusted extraction and validation procedure • Domain-independent community-based validation Knowledge Enabled Information and Services Science
  • 28. Knowledge Acquisition Approaches Knowledge Traditional GlycO Doozer++ Engineering Extraction Approach Approach Definition Top-Down Bottom-up Top-Down Top-Down Knowledge Conceptually, by Engineering extraction from Top-Down corpus Description Top-Down Bottom-up Bottom-up, Bottom-up, restricted by Top- restricted by down definition Top-down definition Verification Manual Manual Correctness: Community- automatic: based validation Exceptions: added manually Knowledge Enabled Information and Services Science
  • 29. KA on the Web - Vision • Web searches, browsing sessions or classification task can be seen as creating an implicit domain model – World view, Concept coverage, Facts • Make models explicit and reusable using formal descriptions (RDF, OWL) • Validate the contained information and share with the community  Increase system‟s knowledge by “doing what you do”: Search, browse, click, communicate Knowledge Enabled Information and Services Science 29
  • 30. KA in a Loosely Connected SystemDomain Model creationto gradually increase •Linked Dataoverall knowledge ofthe system • Free text• User-interest driven • Wikipedia• Incentive to • Web evaluate Domain Definition Validation Doozer++ScoonerEvaluation in Use: – Domain Definition:Semantic browsing Top-down conceptand retrieval, extractionDomain-independent, – Domain Description:Community-based Domain Description Pattern-based fact extraction Knowledge Enabled Information and Services Science 30
  • 31. Domain Definition Requirements• Identify concepts, concept labels (denotations) and concept hierarchy• Challenge: define narrow boundaries for a domain while at the same time ensuring broad conceptual coverage within the domain Knowledge Enabled Information and Services Science 31
  • 32. Domain definition - conceptual • Expand and Reduce approach – Start with „high recall‟ methods • Exploration – Full text search • Exploitation – Graph-Similarity Method • Category growth • “What could be in the domain?” – End with “high precision” methods • Apply restrictions on the concepts found • Remove terms and categories that fall outside the dense areas of the model graph • “What should be in the domain?” Knowledge Enabled Information and Services Science 32
  • 33. Domain Description - Classifier • Concept-aware – Use concepts and concept labels from the domain definition step • Fact extraction as classification of concept pairs into relationship types – fclass: C C R – RS,O = {R | p(R,S,O) > ε} Knowledge Enabled Information and Services Science
  • 34. Domain Description • Combined Language model and Semantic classification model • Language model: Surface-pattern – based – Pattern manifestations of relationships as features – Open to any corpus, language independent – Less computational overhead than NLP • Semantic Classification Model – Learned or assigned concept labels – Semantic types to aid classification Knowledge Enabled Information and Services Science
  • 35. Domain Description - Implementation • Probabilistic Vector-space model – Each relationship is defined by vectors of • Pattern probabilities • Domain/range probabilities – Each concept is grounded by its semantic types and manifested by it‟s labels and their probabilities of identifying the concept – Sparse pattern representation (density ~2%) – White-box, easily verifiable – Inherently parallel Knowledge Enabled Information and Services Science
  • 36. TerminologySymbol Meaning ExampleS, O Subject and Kelly_Miller_(scientist) Object concepts Howard_University (semantic)LS,LO Subject and “Kelly Miller” Object labels “Howard University”PLS,LO Phrase Kelly Miller graduated from Howard University instantiating the patternP Pattern <Subject> graduated from <Object>TS,TO Semantic type of Person Subject or Object Educational_InstitutionR relationship almaMater birthPlace Knowledge Enabled Information and Services Science 36
  • 37. Probabilistic Classifier Semantic types. Labels taken Asserted in from Lexicon Ontology or or linked learned from corpus linked data Patterns learned from free text Knowledge Enabled Information and Services Science 37
  • 38. Probabilistic ClassifierHow is Barack Obama related to Columbia University? p(R, Barack_Obama, Columbia_University) Sentence in corpus: Obama graduated in 1983 from Columbia University with a degree in political science and international relations. (Regular classification requires multiple examples) Knowledge Enabled Information and Services Science 38
  • 39. Probabilistic Classifier Obama graduated in 1983 from Columbia University p(almaMater ,Barack_Obama, Columbia_University) = p(almaMater | “<Subject> graduated in 1983 from <Object>”) * p(Barack_Obama | ”Obama”) * p(Columbia_University | ”Columbia University”) * p(almaMater | domain(person)) * p(almaMater | range(academic_institution)) p(almaMater , Barack_Obama, Columbia_University) = 0.9 * 0.95 * 0.95 * 0.9 * 0.97 p(almaMater, Barack_Obama, Columbia_University) = 0.70909425 Knowledge Enabled Information and Services Science 39
  • 40. Pattern Generalization • Problem: Low recall in pattern-based IE • Substitute terms with wild cards – No POS tagging, hence only “*” wild cards • Mirrors shortest paths through parse trees <Subject> graduated in 1983 from <Object> <Subject> * in 1983 from <Object> <Subject> graduated * 1983 from <Object> <Subject> * * 1983 from <Object> <Subject> graduated in * from <Object> <Subject> * in * from <Object> <Subject> graduated * * from <Object> <Subject> * * * from <Object> Knowledge Enabled Information and Services Science 40
  • 41. Learning p(R|P) • Distantly Supervised Training • Collect pattern frequencies for training examples – Fact triples <S, R, O> e.g. from Linked Data (DBPedia, UMLS) – Manifestations of facts in text in the form of patterns (corpus e.g. Web, Wikipedia, MedLine) • For relationship Ri, aggregate pattern vectors representing <*, Ri, *> Knowledge Enabled Information and Services Science 41
  • 42. Learning p(R|P) – naïve • For each vector Ri containing pattern frequencies for relationship Ri, compute • #Patternj that occur with terms denoting each <S, O> Ri in normalized by all pattern occurrences for Ri Knowledge Enabled Information and Services Science 42
  • 43. Learning p(R|P) – naïve • Uniform distribution of relationships assumed – As the number of relationship types grows), the prior of each type goes towards 0. – normalize the probabilities over the column vector to get p(Ri|Pj) • Vector space representation – Relationship-pattern matrix – R2Pij = p(Ri|Pj) Knowledge Enabled Information and Services Science 43
  • 44. Problem: Relationship Similarities • Extensional similarity – Semantically different relationships can share Subject-Object pairs in training data • Intensional similarity – Overlap and entailment of relationship types • Types should not be seen as discrete – E,g, physical_part_of part_of • Apriori unknown which types overlap unless formal description available – Semantically similar types compete for the same patterns Knowledge Enabled Information and Services Science 44
  • 45. Relationship similarities Pertinence Measure similarity between pattern vectors as approximation of intensional similarity Knowledge Enabled Information and Services Science 45
  • 46. Pertinence for Relationships Do not punish the occurrence of the same pattern with relationship types that are intensionally similar, but extensionally dissimilar Reduce impact of extensionally similar relations Knowledge Enabled Information and Services Science 46
  • 47. Pertinence Example Pattern: <Subject> in the right <Object> Relationship p(R|P) biological_process_has_associated_location 0.968371381 disease_has_associated_anatomic_site 0.880452774 part_of 0.622532958 has_finding_site 0.561041318 has_location 0.537424451 has_direct_procedure_site 0.363832078 Sum: 3.933654958 Note: This never causes p(R,S,O) > 1 Knowledge Enabled Information and Services Science 47
  • 48. Similarities between relationships Knowledge Enabled Information and Services Science 48
  • 49. Pertinence evaluation 0.8 0.7 0.6 0.5Precision 0.4 Pertinence 0.3 No Pertinence 0.2 0.1 0 0 0.1 0.2 0.3 0.4 0.5 Recall Knowledge Enabled Information and Services Science 49
  • 50. Fact extraction evaluation - DBPedia60% training set, 40% testing, DBPedia Infobox fact corpus, Wikipedia text corpus Precision / Recall Strict evaluation: Only 1st ranked extracted relation is compared to gold- standard. Averaged over 107 Confidence Threshold relation types. Knowledge Enabled Information and Services Science 50
  • 51. Sample results (DBPedia) suggested Extracted Rank 1 Subject :: Object Relationship (Rel;Confidence) Rank 2 Rank 3 Howard Pawley :: successor; after; office; after Gary Filmon 0.799 0.768 0.686 nextSingle; followedBy; after; Mulan :: Tarzan after 0.603 0.533 0.416 Species Deceases:: producer; artist; genre; artist Midnight Oil 0.761 0.719 0.467 The Crystal City :: artist; author; writer; author Orson Scott Card 0.625 0.617 0.583 Horatio Allen :: before predecessor;0.629 before;0.475 William Maxwell Basdeo Panday :: birthplace; nationality; birthplace deathPlace;0.658 Trinidad &Tobago 0.658 0.330 Bob Nystrom :: birthplace cityOfBirth;0.677 birthplace;0.513 Stockholm Beccles railway borough; friend; borough district;0.772 station :: Suffolk 0.770 0.749 Knowledge Enabled Information and Services Science 51
  • 52. Fact extraction evaluation - UMLS60% training set, 40% testing, UMLS fact corpus, MedLine text corpus Precision / Recall Strict evaluation: Only 1st ranked extracted relation is compared to gold- standard. Averaged over Confidence Threshold ~100 relation types. Knowledge Enabled Information and Services Science 52
  • 53. Sample results (UMLS) Subject :: Object suggested Relationship Extracted Rank 1 Teeth::poisoning, fluoride finding_site_of finding_site_of 768 polyps::polyp of cervix nos associated_with associated_with (disorder) neck of uterus::polyp of cervix nos location_of finding_site_of (disorder) benign neoplasms::polyp of colon related_to associated_with brain ischemia::brain has_finding_site location_of is_primary_anatomic_ gastrointestinal tract::polyp of colon location_of site_of_disease gamete structure (cell is_normal_cell_origin_ is_normal_cell_ structure)::polyvesicular vitelline of_disease origin_of_disease tumor Knowledge Enabled Information and Services Science 53
  • 54. Comparison – DBPedia corpus Mintz: extraction 1 of 102 relation- 0.9 ship types from 0.8 Freebase Doozer: 107 0.7 from DBPediaPrecision 0.6 0.5 Mintz-POS Mintz-NLP 0.4 Doozer++ (R) 0.3 Doozer++ (P) 0.2 0.1 0 (R) Recall- oriented, using 0 0.2 0.4 0.6 0.8 1 pattern Recall generalization M. Mintz, S. Bills, R. Snow, and D. Jurafsky, “Distant supervision for relation (P) Precision- extraction without labeled data,” in ACL2009. oriented, no generalization Knowledge Enabled Information and Services Science 54
  • 55. Evaluate Ad-Hoc Model Creation • On demand creation of models Precision Number of (Domain Domain Query Concepts Definition) “Semantic Web” OWL Semantic Web ontologies RDF 143 0.98 “Harry Potter” dumbledore Harry Potter gryffindor slytherin 134 0.98 Beatles "John Lennon" "Paul Beatles McCartney" song 250 0.99 India-Pakistan Relations India Pakistan Kashmir 129 0.99 US Financial tarp "financial crisis" "toxic crisis - TARP assets" 146 0.93 German German chancellors "Angela Chancellors Merkel" "Helmut Kohl" 124 0.91 Knowledge Enabled Information and Services Science 55
  • 56. Ad-Hoc Model Creation - Evaluation Knowledge Enabled Information and Services Science 56
  • 57. Ad-Hoc Model Creation - Evaluation Recall wrt. possible extraction. I.e. the Relative Recall maximum number of extracted facts marks 100% recall Knowledge Enabled Information and Services Science 57
  • 58. Related Work Mintz Sur- face pat- terns SOFIE Turney only Knowledge Enabled Information and Services Science 58
  • 59. Main Differences • Surface-patterns only • Only positive training examples • Pertinence measure for semantic similarity • Concept-aware: start with defined concepts • Include background knowledge in probabilistic classification instead of rule- based reasoning Knowledge Enabled Information and Services Science 59
  • 60. Related work • Pattern-based fact extraction – E. Agichtein and L. Gravano. Snowball: Extracting relations from large plain-text collections. In JCDL, 2000. – Suchanek, Fabian M., Mauro Sozio, and Gerhard Weikum. SOFIE : A Self-Organizing Framework for Information Extraction.• WWW 2009. – T. M. Mitchell, J. Betteridge, A. Carlson, E. Hruschka, and R. Wang. Populating the Semantic Web by Macro- Reading Internet Text. ISWC 2009. – M. Pasca, D. Lin, J. Bigham, A. Lifchits, and A. Jain. Organizing and searching the world wide web of facts- step one: the one-million fact extraction challenge. In AAAI 2006. Knowledge Enabled Information and Services Science 60
  • 61. Related work • Relationship-pattern computations – P. D. Turney and P. Pantel. From Frequency to Meaning: Vector Space Models of Semantics. Journal of Artificial Intelligence Research, 37, 2010. – P. D. Turney. Expressing implicit semantic relations without supervision. In ACL 2006 Knowledge Enabled Information and Services Science 61
  • 62. Summary Fact extraction • Pattern-based fact extraction with generalization and Pertinence achieves competitive precision and recall while being computationally feasible for large-scale extraction – Pertinence computation can also be a preprocessing step for other ML techniques • Different types of background knowledge incorporated into one statistical framework – Combined Language model and Semantic model Knowledge Enabled Information and Services Science 62
  • 63. Application and Knowledge ValidationExample: Domain modelas a basis for research in • 18 Million MedLinethe area of human publications/abstractscognitive performance. • UMLS Metathesaurus • WikipediaScooner:Semantic browsing Doozer++and retrieval – – Hierarchy extractionEvaluation in Use – Pattern-based fact extraction Knowledge Enabled Information and Services Science 63
  • 64. Domain Definition – Extracted HierarchyA hierarchy extracted for a cognitive science domain model.The keyword description given to the system was a collection of terms relevantto human performance and cognition. Knowledge Enabled Information and Services Science 64
  • 65. Domain Description: Connect Concepts Knowledge Enabled Information and Services Science 65
  • 66. Expert Evaluation of Facts in the Model 0.9 0.8 0.7 0.6Fraction 0.5 Fraction in bin 0.4 Cumulative incorrect Cumulative correct 0.3 Cumulative interesting 0.2 0.1 0. Score 1 2 3 4 5 6 7 8 9 1-2: Information that is 3-4: Information that is 5-6: Correct general 7-9: Correct Information not overall incorrect somewhat correct Information commonly known Knowledge Enabled Information and Services Science 66
  • 67. Extractor Confidence vs. Correctness• Analysis shows that highest quality extractions have the highest confidence, but also incorrectly extracted facts have high confidence High-quality patterns as well as some noise-patterns have high indicative power. Knowledge Enabled Information and Services Science 67
  • 68. Extractor Confidence vs. Correctness• Many facts deemed interesting were extracted based on highly specialized patterns in the long tail of the frequency distribution.• Noisy patterns also tend to occupy this space Knowledge Enabled Information and Services Science 68
  • 69. Sources of Errors • Extracted relationship too specific or formally incorrect but metaphorically correct. – <Interpeduncular_Cistern  disease_has_associated_ anatomic_site  Cerebral_peduncle> is incorrect, • Interpeduncular Cistern is not a disease. However, it does have the associated anatomic site Cerebral peduncle. • Incorrect directionality – <Pituitary_Gland  sends_output_to  Supraoptic_ nucleus> should be <Supraoptic_nucleus  sends_ output_to  Pituitary_Gland> • Direction in text often expressed in the context rather than the immediate pattern Knowledge Enabled Information and Services Science 69
  • 70. Validation • Extracted statements need to be validated to be considered knowledge – Explicit validation, e.g. thumbs up/down – Implicit validation, e.g. by analyzing click streams Knowledge Enabled Information and Services Science 70
  • 71. Explicit Validation • Certainty of reference – I.e. we know exactly which statement was validated • Validator credentials can be obtained – E.g. a small community of experts may evaluate • Extra work – Explicit validation is a task that is consciously performed Knowledge Enabled Information and Services Science 71
  • 72. Implicit Validation • Find indications of correctness or incorrectness based on the way the users interact with the presented information – Every action taken on a piece of information is recorded and analyzed – The cumulative behavior of the users gives an indication of which propositions are correct or interesting Knowledge Enabled Information and Services Science 72
  • 73. Implicit Validation • Examples for implicit community-validation – Games with a purpose (L. von Ahn) – Google search rankings • Scooner semantic browser – Browse literature along facts in a model – Browsing trails suggest correct extraction Knowledge Enabled Information and Services Science 73
  • 74. Implicit Validation • A fact is browsed very often by different users. – The fact is interesting to many users. – The fact is surprising and interesting, but may be incorrect. • A user follows a trail of multiple fact-triples trough a variety of documents. – The facts that were browsed have a high probability of being correct and support is added to the triples. – If the trail was longer than suggested by a small-world phenomenon, initial triples may have been incorrect, but led to interesting ones. For this reason, only the last k triples of the trail should garner support or the support should increase for the last k triples in the trail. – The last triple in the trail may have been incorrect and led to browsing results that caused the user to stop browsing. For this reason, the last triple of the trail should be treated with caution. Knowledge Enabled Information and Services Science 74
  • 75. Validation “through use” Choose entityEnter search of interest terms Browse Choose relevant extracted facts literature that supports the fact Knowledge Enabled Information and Services Science 75
  • 76. Validation “through use” Find another interesting fact Fact trails are recorded Knowledge Enabled Information and Services Science 76
  • 77. Validation “through use” Path suggests that at least the first 2 triples are factually correct Knowledge Enabled Information and Services Science 77
  • 78. Browsed Facts Examples Knowledge Enabled Information and Services Science 78
  • 79. Related work • Evaluation and Use – E. Agichtein, E. Brill, and S. Dumais. Improving web search ranking by incorporating user behavior information. Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval - SIGIR ‟06, page 19, 2006. – A. Das, M. Datar, A. Garg, and S. Rajaram. Google News Personalization: Scalable Online Collaborative Filtering. In Proceedings of the 16th international conference on World Wide Web, page 280. ACM, 2007. Knowledge Enabled Information and Services Science 79
  • 80. Summary Knowledge Acquisition • The model actually reflects what the user is interested in at the point of creation  Willingness to help validate facts – Applications allow for implicit and explicit evaluation • Validated Statements can be merged with existing knowledge  Automated acquisition completed  Individual-driven KA improved overall system• R. Kavuluru, C. Thomas et al. An Up-to-date Knowledge-Based Literature Search and Exploration Framework for Focused Bioscience Domains. IHI 2012• Amit Sheth, Christopher Thomas, Pankaj Mehra, Continuous Semantics to Analyze Real-Time Data, IEEE IC, Nov./Dec. 2010• C. Thomas et al. Improving Linked Open Data through On-Demand Model Creation. Web Science Conference, 2010.• C. Thomas, et al.. Growing Fields of Interest - Using an Expand and Reduce Strategy for Domain Model Extraction. WIC 2008. Knowledge Enabled Information and Services Science 80
  • 81. Future Directions • Active Learning to improve classification – Easy in tightly connected system (e.g. NELL) – Feedback mechanism for loosely connected systems • Improve depth of classification – Augment Domain Description with learned concept hierarchies from text (e.g. Navigli) • Knowledge management for background knowledge – Belief updates – Model evolution Knowledge Enabled Information and Services Science 81
  • 82. Contributions Conceptual Knowledge: Ontologies, LoD Knowledge Representation [IJSWIS, CR, FLSW] Ontology design [WWW, FOIS]Knowledge merging/Ontology alignment[AAAI, WebSem2, Textual Information:SWSWPC] Wikipedia, Web Information Quality[WI2] Social processes for content creation [CHB] Social processes Taxonomy extraction for knowledge [WI1, WebSci, WebSem1] validation Event modeling [IEEE-IC] [IHI,WebSci, CHB] Relationship/Fact/Event extraction [IHI, WebSem1, IEEE-IC, WebSci] Knowledge Enabled Information and Services Science 82
  • 83. Journal/Conference Publications [WebSem] C. Thomas, P. Mehra, A. Sheth, W. Wang, G. Weikum. Automatic domain model creation using pattern-based fact extraction. Submitted to Journal of Web Semantics. [IHI]R. Kavuluru, C. Thomas, A. Sheth, V. Chan, W. Wang, A. Smith, A. Sato and A. Walters. An Up-to-date Knowledge-Based Literature Search and Exploration Framework for Focused Bioscience Domains. IHI 2012 - 2nd ACM SIGHIT International Health Informatics Symposium, January 28-30, 2012. [IEEE-IC] Amit Sheth, Christopher Thomas, Pankaj Mehra, Continuous Semantics to Analyze Real-Time Data, IEEE Internet Computing, vol. 14, no. 6, pp. 84-89, Nov./Dec. 2010, doi:10.1109/MIC.2010.137 [WebSci] C. Thomas, W. Wang, P. Mehra and A. Sheth. What Goes Around Comes Around Improving Linked Opend Data through On-Demand Model Creation. Web Science Conference, 2010. [WI1] C. Thomas, P. Mehra, R. Brooks, and A. Sheth. Growing Fields of Interest - Using an Expand and Reduce Strategy for Domain Model Extraction. Web Intelligence and Intelligent Agent Technology, IEEE/WIC/ACM International Conference on, 1:496–502, 2008. Knowledge Enabled Information and Services Science 83
  • 84. Journal/Conference Publications [WI2] C. Thomas and A. Sheth. Semantic Convergence of Wikipedia Articles. In Proceedings of the 2007 IEEE/WIC International Conference on Web Intelligence, pages 600–606, Washington, DC, USA, November 2007. IEEE Computer Society. [WWW] S. S. Sahoo, C. Thomas, A. Sheth, W. S. York, and S. Tartir. Knowledge Modeling and its Application in Life Sciences: A Tale of two Ontologies. In WWW ‟06: Proceedings of the 15th international conference on World Wide Web, pages 317–326, New York, NY, USA, 2006. ACM Press. [FOIS] C. Thomas, A. Sheth, and W. York. Modular Ontology Design Using Canonical Building Blocks in the Biochemistry Domain. In Proceeding of the 2006 conference on Formal Ontology in Information Systems: Proceedings of the Fourth International Conference (FOIS 2006), pages 115–127, Amsterdam (NL), 2006. IOS Press. [AAAI] P. Doshi and C. Thomas. Inexact matching of ontology graphs using expectation-maximization. In AAAI‟06: proceedings of the 21st national conference on Artificial intelligence, pages 1277–1282. AAAI Press, 2006. Knowledge Enabled Information and Services Science 84
  • 85. Publications [CHB] C. Thomas and A. Sheth. Web Wisdom - An Essay on How Web 2.0 and Semantic Web can foster a Global Knowledge Society. Computers in Human Behavior, Elsevier. [WebSem2] P. Doshi, R. Kolli, and C. Thomas. Inexact matching of ontology graphs using expectation-maximization. Web Semantics: Science, Services and Agents on the World Wide Web, 7(2):90–106, 2009. [IJWGS] V. Kashyap, C. Ramakrishnan, C. Thomas, and A. Sheth. Taxaminer: an experimentation framework for automated taxonomy bootstrapping. International Journal of Web and Grid Services, 1(2):240–266, 2005. [IJSWIS] A. P. Sheth, C. Ramakrishnan, and C. Thomas. Semantics for the semantic web: The implicit, the formal and the powerful. Int. J. Semantic Web Inf. Syst., 1(1):1–18, 2005. [CR] S. Sahoo, C. Thomas, A. Sheth, C. Henson, and W. York. GLYDEan expressive XML standard for the representation of glycan structure. Carbohydrate research, 340(18):2802–2807, 2005. Knowledge Enabled Information and Services Science 85
  • 86. Other Publications Workshop Publications [SWLS] A. Sheth, W. York, C. Thomas, M. Nagarajan, J. Miller, K. Kochut, S. Sahoo, and X. Yi. Semantic Web technology in support of Bioinformatics for Glycan Expression. In W3C Workshop on Semantic Web for Life Sciences, pages 27–28, 2004. [SWSWPC] N. Oldham, C. Thomas, A. Sheth, and K. Verma. METEOR-S Web Service Annotation Framework with Machine Learning Classification. Semantic Web Services and Web Process Composition, pages 137–146, 2005, Springer. Book Chapters [FLSW] C. Thomas and A. Sheth. On the expressiveness of the languages for the semantic web - making a case for a little more. Fuzzy Logic and the Semantic Web, pages 3–20, 2006. Patent [PAT] P. Mehra, R. Brooks and C. Thomas. ONTOLOGY CREATION BY REFERENCE TO A KNOWLEDGE CORPUS. Pub.No. US 2010/0280989 A1 Knowledge Enabled Information and Services Science 86
  • 87. • Research • Collaborations – Complex Carbohydrate Research – KR Center – Domain model at UGA extraction / IE – HP Labs Palo Alto – Human Performance Directorate, AFRL• Proposals – HP Incubation & Innovation grant for Doozer++ • Tools and Ontologies – AFRL grant largely – GlycO based on Doozer++ – GlycoViz – NSF proposal – Doozer++ submitted with “very good” reviews – Scooner 87 Knowledge Enabled Information and Services Science
  • 88. Thank you! Shaojun Amit Pascal Pankaj Gerhard Wang Sheth Hitzler Mehra Weikum Thanks to all Kno.e.sis Center Members – Past and Present Knowledge Enabled Information and Services Science 88
  • 89. Thank youKnowledge Enabled Information and Services Science 89