Knowledge Acquisition in a System: Automatic Creation of Conceptual Domain Models
1. Knowledge Acquisition in a
System
Christopher Thomas
Ohio Center of Excellence in Knowledge-enabled Computing -
Kno.e.sis,
Wright State University
Dayton, OH
topher@knoesis.org
2. Circle of knowledge in a System
Knowledge Enabled Information and Services Science 2
3. Dissertation Overview
Conceptual Knowledge: Ontologies, LoD
Knowledge Representation
[IJSWIS, CR, FLSW]
Ontology design [WWW, FOIS]
Knowledge merging/
Ontology alignment
[AAAI, WebSem2, Textual Information:
SWSWPC] Wikipedia, Web
Information
Quality[WI2]
Social processes
for content creation
[CHB]
Social processes
Doozer++:
for knowledge
Taxonomy extraction
validation
Relationship/Fact
[IHI,WebSci, CHB]
extraction
[IHI, WebSem1, IEEE-
IC, WebSci, WI1]
Knowledge Enabled Information and Services Science 3
4. Talk Contents
What is knowledge?
How do we turn
propositions/belie
fs into
knowledge?
How do we
acquire
information?
Knowledge Enabled Information and Services Science 4
5. Talk outline
• Motivation
• Knowledge Acquisition (KA) Overview
• KA in a loosely connected system – Doozer++
– Automatic formal domain model creation
– Information Extraction
• Top-Down
• Bottom-Up
– Information Validation “in use”
• Conclusion
Knowledge Enabled Information and Services Science 5
6. Larger Context of automated KA
• Increasing significance of knowledge
economy
– “Knowledge Workers” spend 38% of their time
searching for information (McDermott, 2005)
– Vital to get a quick and still comprehensive
understanding of a field through pertinent
concepts/entities and relations/interactions
• Increased demand for formally available
knowledge in semantic models
– Filtering, browsing, annotation, reasoning
Mcdermott, M. "Knowledge Workers: How can you gauge their effectiveness." Leadership Excellence. Vol. 22.10. October 2005
Knowledge Enabled Information and Services Science
7. Motivating Scenario
• Learn about a new subject
– E.g. gain a quick overview over a current or
historical event
• Use a formal representation of the gained
overview to filter information
– Facilitate in-depth exploration
• Use the formalized information and the
user interaction to create knowledge from
information
Knowledge Enabled Information and Services Science 7
8. Motivating Scenario
• Google: India
• Brief description –
demographic-,
geographic
information, etc.
Knowledge Enabled Information and Services Science 8
9. Motivating Scenario
• Google: India
• Regular Web results
Knowledge Enabled Information and Services Science 9
10. Motivating Scenario
• Clicking on a link to the Wikipedia entry
shows that there have been conflicts with
Pakistan over the region of Kashmir
Investigate more
Knowledge Enabled Information and Services Science 10
11. Motivating Scenario
• Google: India
Pakistan Kashmir
• Only Web results and
news
So far, search
engines only display
facts about entities, not
relationships or larger
contexts
Knowledge Enabled Information and Services Science 11
12. Motivating Scenario
• Beneficial to get an overview “at a glance”
over a domain.
• Automated approach to creating knowledge
models for focused areas of interest
• Create models around an incomplete or
rudimentary keyword description and
“anticipate” user‟s intentions wrt. the full
context
Knowledge Enabled Information and Services Science 12
13. Motivating Scenario
Doozer++: india pakistan kashmir
• Important concepts and relationships
describing the context
Knowledge Enabled Information and Services Science 13
14. Motivating Scenario
• Filtered IR using
concepts in the
model
• Concepts and
relationships that
contributed to
clicked results gain
support
• User can explicitly
approve content
Knowledge Enabled Information and Services Science 14
15. Circle of Knowledge (Example)
Knowledge Enabled Information and Services Science 15
16. Motivating Scenario
• On-demand creation of domain knowledge
improves individual comprehension of an
event
• Formal models are easy to use in
information filtering
• Validated information Knowledge
– Can be given back to the community to
improve the overall amount of formal
knowledge available on the Web
– E.g. “Unknown” to DBPedia that the region of
Kashmir belongs to both India and Pakistan
Knowledge Enabled Information and Services Science 16
17. Importance of Model creation
• Models support individual user or know-
ledge worker, but also groups or system
– More efficient communication through small,
shared, agreeable conceptualizations
• People people
• People system
• System system
– Classify or filter pertinent and topical
information using models
– Model-assisted searching and faceted or
exploratory browsing using relationships
– Reuse of validated knowledge
Knowledge Enabled Information and Services Science
18. Domain Knowledge Models
• Scientific applications
– In-depth description of concepts
– Narrow field
– People system, system system
• Annotation, reasoning
⇒Absolute correctness necessary (as far as possible)
• General applications
– Broad coverage of the field
– Context – how does the new information fit in?
– People people, people system
• Individual domain comprehension, filtering, annotation
⇒Relative correctness sufficient
Knowledge Enabled Information and Services Science 18
19. Model Creation Resources
• Large models are available as reference
– DBPedia, YAGO, UMLS, MeSH, GO …
– Too big to be efficiently and effectively usable
• Prior knowledge required to find pertinent resources
• Other information is available in great
abundance, but unformalized
– Tacit expert knowledge
– Scientific databases
– Free text
• peer reviewed journals and proceedings
• General Web content
Knowledge Enabled Information and Services Science 19
20. Epistemological Considerations
• Knowledge
– Ensure epistemological soundness of
automated knowledge acquisition
• Reference
– Ensure that nodes in the models refer to real-
world concepts/entities
Knowledge Enabled Information and Services Science 20
21. Knowledge
• Functional Definition
– Knowledge = “Know-How”
– Practical, but weak,
Includes “Actionable Information”
• Categorical Definition
– Knowledge = Justified true belief
– S knows that p iff
i. p is true;
ii. S believes that p;
iii. S is justified in believing that p.
Knowledge Enabled Information and Services Science
22. Belief and Justification
• Belief
– Statements held by the system
• Justification
– Trusted sources
– Extraction algorithms
• Bayesian, deductive or inductive reasoning
• Macro-Reading algorithms Wisdom of the crowds
– Validation
Knowledge Enabled Information and Services Science 22
23. Truth assessment of a statement
• Is truth correspondence?
– “A” is true No Access
iff A (a true statement corresponds
to an actual state of affairs)
• Is truth coherence?
– Does the statement fit into the system of other
statements?
• Is truth consensus?
– agreement of correctness amongst a group
⇒In the cyclical model, achieve high degree
of certainty by allowing constant validation
Knowledge Enabled Information and Services Science
24. Domain Model – Reference
• Model of a domain conceptually split
– Domain Definition
Concepts identified by URIs (classes, entities,
relationship types) ensures reference
Remains static – necessity
Rigid designators (Kripke)
– Domain Description
Relationships describe concepts
Subject to change – possibility
Definite descriptions (Russell)
Knowledge Enabled Information and Services Science
25. Domain Definition
• Top-down concept identification
• Achieved through
– Manual creation based on consensus in a
group
– Extraction from community-created or peer-
reviewed conceptualization
• Wikipedia
• MeSH or UMLS Semantic Network
Knowledge Enabled Information and Services Science
26. Domain Description
• Possible to do top-down extraction of the
domain description, e.g. from DBPedia
• Problem: Formal concept descriptions are
sparse
– On average, DBPedia has less than 2 object
properties per entity
• Extract descriptions (facts) bottom-up
– Available in text, DBs, etc.
– Domain-specific molecular structure extractors
(GlycO)
– Domain independent IE techniques (Doozer++)
Knowledge Enabled Information and Services Science
27. Knowledge Acquisition Approaches
• KA in a tightly connected system
– GlycO: domain-specific BioChemistry ontology
• Manual domain definition and description
• Partial automatic domain description
• Domain-specific automatic validation
• Manual validation for false negatives
• KA in a loosely connected system
– Doozer++: general domain-model creation framework
• Automatic domain definition, top-down concept extraction
• Automatic domain description, bottom-up fact extraction
– Extraction from trusted sources
– A trusted extraction and validation procedure
• Domain-independent community-based validation
Knowledge Enabled Information and Services Science
28. Knowledge Acquisition Approaches
Knowledge Traditional GlycO Doozer++
Engineering Extraction
Approach Approach
Definition Top-Down Bottom-up Top-Down Top-Down
Knowledge Conceptually, by
Engineering extraction from
Top-Down
corpus
Description Top-Down Bottom-up Bottom-up, Bottom-up,
restricted by Top- restricted by
down definition Top-down
definition
Verification Manual Manual Correctness: Community-
automatic: based validation
Exceptions:
added manually
Knowledge Enabled Information and Services Science
29. KA on the Web - Vision
• Web searches, browsing sessions or
classification task can be seen as creating
an implicit domain model
– World view, Concept coverage, Facts
• Make models explicit and reusable using
formal descriptions (RDF, OWL)
• Validate the contained information and
share with the community
Increase system‟s knowledge by
“doing what you do”: Search, browse,
click, communicate
Knowledge Enabled Information and Services Science 29
30. KA in a Loosely Connected System
Domain Model creation
to gradually increase
•Linked Data
overall knowledge of
the system • Free text
• User-interest driven • Wikipedia
• Incentive to • Web
evaluate
Domain Definition
Validation Doozer++
Scooner
Evaluation in Use: – Domain Definition:
Semantic browsing Top-down concept
and retrieval, extraction
Domain-independent, – Domain Description:
Community-based Domain Description Pattern-based fact
extraction
Knowledge Enabled Information and Services Science 30
31. Domain Definition Requirements
• Identify concepts, concept
labels (denotations) and
concept hierarchy
• Challenge: define narrow
boundaries for a domain while
at the same time ensuring
broad conceptual coverage
within the domain
Knowledge Enabled Information and Services Science 31
32. Domain definition - conceptual
• Expand and Reduce approach
– Start with „high recall‟ methods
• Exploration – Full text search
• Exploitation – Graph-Similarity Method
• Category growth
• “What could be in the domain?”
– End with “high precision” methods
• Apply restrictions on the concepts found
• Remove terms and categories that fall outside the
dense areas of the model graph
• “What should be in the domain?”
Knowledge Enabled Information and Services Science 32
33. Domain Description - Classifier
• Concept-aware
– Use concepts and concept labels from the
domain definition step
• Fact extraction as classification of
concept pairs into relationship types
– fclass: C C R
– RS,O = {R | p(R,S,O) > ε}
Knowledge Enabled Information and Services Science
34. Domain Description
• Combined Language model and Semantic
classification model
• Language model: Surface-pattern – based
– Pattern manifestations of relationships as
features
– Open to any corpus, language independent
– Less computational overhead than NLP
• Semantic Classification Model
– Learned or assigned concept labels
– Semantic types to aid classification
Knowledge Enabled Information and Services Science
35. Domain Description - Implementation
• Probabilistic Vector-space model
– Each relationship is defined by vectors of
• Pattern probabilities
• Domain/range probabilities
– Each concept is grounded by its semantic
types and manifested by it‟s labels and their
probabilities of identifying the concept
– Sparse pattern representation (density ~2%)
– White-box, easily verifiable
– Inherently parallel
Knowledge Enabled Information and Services Science
36. Terminology
Symbol Meaning Example
S, O Subject and Kelly_Miller_(scientist)
Object concepts Howard_University
(semantic)
LS,LO Subject and “Kelly Miller”
Object labels “Howard University”
PLS,LO Phrase Kelly Miller graduated from Howard University
instantiating the
pattern
P Pattern <Subject> graduated from <Object>
TS,TO Semantic type of Person
Subject or Object Educational_Institution
R relationship almaMater
birthPlace
Knowledge Enabled Information and Services Science 36
37. Probabilistic Classifier
Semantic types.
Labels taken Asserted in
from Lexicon Ontology or
or linked learned from
corpus linked data
Patterns
learned from
free text
Knowledge Enabled Information and Services Science 37
38. Probabilistic Classifier
How is Barack Obama related to Columbia University?
p(R, Barack_Obama, Columbia_University)
Sentence in corpus:
Obama graduated in 1983 from Columbia University
with a degree in political science and international
relations.
(Regular classification requires multiple examples)
Knowledge Enabled Information and Services Science 38
39. Probabilistic Classifier
Obama graduated in 1983 from Columbia University
p(almaMater ,Barack_Obama, Columbia_University) =
p(almaMater | “<Subject> graduated in 1983 from <Object>”) *
p(Barack_Obama | ”Obama”) *
p(Columbia_University | ”Columbia University”) *
p(almaMater | domain(person)) *
p(almaMater | range(academic_institution))
p(almaMater , Barack_Obama, Columbia_University)
= 0.9 * 0.95 * 0.95 * 0.9 * 0.97
p(almaMater, Barack_Obama, Columbia_University) = 0.70909425
Knowledge Enabled Information and Services Science 39
40. Pattern Generalization
• Problem: Low recall in pattern-based IE
• Substitute terms with wild cards
– No POS tagging, hence only “*” wild cards
• Mirrors shortest paths through parse trees
<Subject> graduated in 1983 from <Object>
<Subject> * in 1983 from <Object>
<Subject> graduated * 1983 from <Object>
<Subject> * * 1983 from <Object>
<Subject> graduated in * from <Object>
<Subject> * in * from <Object>
<Subject> graduated * * from <Object>
<Subject> * * * from <Object>
Knowledge Enabled Information and Services Science 40
41. Learning p(R|P)
• Distantly Supervised Training
• Collect pattern frequencies for training
examples
– Fact triples <S, R, O> e.g. from Linked Data
(DBPedia, UMLS)
– Manifestations of facts in text in the form of
patterns (corpus e.g. Web, Wikipedia, MedLine)
• For relationship Ri, aggregate pattern
vectors representing <*, Ri, *>
Knowledge Enabled Information and Services Science 41
42. Learning p(R|P) – naïve
• For each vector Ri containing pattern
frequencies for relationship Ri, compute
• #Patternj that occur with terms denoting each
<S, O> Ri in normalized by all pattern
occurrences for Ri
Knowledge Enabled Information and Services Science 42
43. Learning p(R|P) – naïve
• Uniform distribution of relationships assumed
– As the number of relationship types grows), the
prior of each type goes towards 0.
– normalize the probabilities over the column
vector to get p(Ri|Pj)
• Vector space representation
– Relationship-pattern matrix
– R2Pij = p(Ri|Pj)
Knowledge Enabled Information and Services Science 43
44. Problem: Relationship Similarities
• Extensional similarity
– Semantically different relationships can share
Subject-Object pairs in training data
• Intensional similarity
– Overlap and entailment of relationship types
• Types should not be seen as discrete
– E,g, physical_part_of part_of
• Apriori unknown which types overlap unless formal
description available
– Semantically similar types compete for the
same patterns
Knowledge Enabled Information and Services Science 44
45. Relationship similarities
Pertinence Measure
similarity between pattern vectors as approximation
of intensional similarity
Knowledge Enabled Information and Services Science 45
46. Pertinence for Relationships
Do not punish the occurrence of the same pattern
with relationship types that are intensionally
similar, but extensionally dissimilar
Reduce impact of extensionally similar relations
Knowledge Enabled Information and Services Science 46
47. Pertinence Example
Pattern: <Subject> in the right <Object>
Relationship p(R|P)
biological_process_has_associated_location 0.968371381
disease_has_associated_anatomic_site 0.880452774
part_of 0.622532958
has_finding_site 0.561041318
has_location 0.537424451
has_direct_procedure_site 0.363832078
Sum: 3.933654958
Note: This never causes p(R,S,O) > 1
Knowledge Enabled Information and Services Science 47
49. Pertinence evaluation
0.8
0.7
0.6
0.5
Precision
0.4
Pertinence
0.3 No Pertinence
0.2
0.1
0
0 0.1 0.2 0.3 0.4 0.5
Recall
Knowledge Enabled Information and Services Science 49
50. Fact extraction evaluation - DBPedia
60% training set, 40% testing, DBPedia Infobox fact corpus, Wikipedia text corpus
Precision / Recall
Strict evaluation:
Only 1st ranked
extracted relation is
compared to gold-
standard.
Averaged over 107
Confidence Threshold relation types.
Knowledge Enabled Information and Services Science 50
51. Sample results (DBPedia)
suggested Extracted Rank 1
Subject :: Object Relationship (Rel;Confidence) Rank 2 Rank 3
Howard Pawley :: successor; after; office;
after
Gary Filmon 0.799 0.768 0.686
nextSingle; followedBy; after;
Mulan :: Tarzan after
0.603 0.533 0.416
Species Deceases:: producer; artist; genre;
artist
Midnight Oil 0.761 0.719 0.467
The Crystal City :: artist; author; writer;
author
Orson Scott Card 0.625 0.617 0.583
Horatio Allen ::
before predecessor;0.629 before;0.475
William Maxwell
Basdeo Panday :: birthplace; nationality;
birthplace deathPlace;0.658
Trinidad &Tobago 0.658 0.330
Bob Nystrom ::
birthplace cityOfBirth;0.677 birthplace;0.513
Stockholm
Beccles railway borough; friend;
borough district;0.772
station :: Suffolk 0.770 0.749
Knowledge Enabled Information and Services Science 51
52. Fact extraction evaluation - UMLS
60% training set, 40% testing, UMLS fact corpus, MedLine text corpus
Precision / Recall
Strict evaluation:
Only 1st ranked
extracted relation is
compared to gold-
standard.
Averaged over
Confidence Threshold ~100 relation types.
Knowledge Enabled Information and Services Science 52
53. Sample results (UMLS)
Subject :: Object suggested Relationship Extracted Rank 1
Teeth::poisoning, fluoride finding_site_of finding_site_of
768 polyps::polyp of cervix nos
associated_with associated_with
(disorder)
neck of uterus::polyp of cervix nos
location_of finding_site_of
(disorder)
benign neoplasms::polyp of colon related_to associated_with
brain ischemia::brain has_finding_site location_of
is_primary_anatomic_
gastrointestinal tract::polyp of colon location_of
site_of_disease
gamete structure (cell
is_normal_cell_origin_ is_normal_cell_
structure)::polyvesicular vitelline
of_disease origin_of_disease
tumor
Knowledge Enabled Information and Services Science 53
54. Comparison – DBPedia corpus
Mintz: extraction
1
of 102 relation-
0.9 ship types from
0.8 Freebase
Doozer: 107
0.7 from DBPedia
Precision
0.6
0.5 Mintz-POS
Mintz-NLP
0.4
Doozer++ (R)
0.3
Doozer++ (P)
0.2
0.1
0 (R) Recall-
oriented, using
0 0.2 0.4 0.6 0.8 1 pattern
Recall generalization
M. Mintz, S. Bills, R. Snow, and D. Jurafsky, “Distant supervision for relation (P) Precision-
extraction without labeled data,” in ACL2009. oriented, no
generalization
Knowledge Enabled Information and Services Science 54
55. Evaluate Ad-Hoc Model Creation
• On demand creation of models
Precision
Number of (Domain
Domain Query Concepts Definition)
“Semantic Web” OWL
Semantic Web ontologies RDF 143 0.98
“Harry Potter” dumbledore
Harry Potter gryffindor slytherin 134 0.98
Beatles "John Lennon" "Paul
Beatles McCartney" song 250 0.99
India-Pakistan
Relations India Pakistan Kashmir 129 0.99
US Financial tarp "financial crisis" "toxic
crisis - TARP assets" 146 0.93
German German chancellors "Angela
Chancellors Merkel" "Helmut Kohl" 124 0.91
Knowledge Enabled Information and Services Science 55
56. Ad-Hoc Model Creation - Evaluation
Knowledge Enabled Information and Services Science 56
57. Ad-Hoc Model Creation - Evaluation
Recall wrt. possible
extraction. I.e. the
Relative Recall maximum number of
extracted facts
marks 100% recall
Knowledge Enabled Information and Services Science 57
58. Related Work
Mintz
Sur-
face
pat-
terns SOFIE
Turney
only
Knowledge Enabled Information and Services Science 58
59. Main Differences
• Surface-patterns only
• Only positive training examples
• Pertinence measure for semantic similarity
• Concept-aware: start with defined concepts
• Include background knowledge in
probabilistic classification instead of rule-
based reasoning
Knowledge Enabled Information and Services Science 59
60. Related work
• Pattern-based fact extraction
– E. Agichtein and L. Gravano. Snowball: Extracting
relations from large plain-text collections. In JCDL,
2000.
– Suchanek, Fabian M., Mauro Sozio, and Gerhard
Weikum. SOFIE : A Self-Organizing Framework for
Information Extraction.• WWW 2009.
– T. M. Mitchell, J. Betteridge, A. Carlson, E. Hruschka,
and R. Wang. Populating the Semantic Web by Macro-
Reading Internet Text. ISWC 2009.
– M. Pasca, D. Lin, J. Bigham, A. Lifchits, and A. Jain.
Organizing and searching the world wide web of facts-
step one: the one-million fact extraction challenge. In
AAAI 2006.
Knowledge Enabled Information and Services Science 60
61. Related work
• Relationship-pattern computations
– P. D. Turney and P. Pantel. From Frequency to
Meaning: Vector Space Models of Semantics. Journal
of Artificial Intelligence Research, 37, 2010.
– P. D. Turney. Expressing implicit semantic relations
without supervision. In ACL 2006
Knowledge Enabled Information and Services Science 61
62. Summary Fact extraction
• Pattern-based fact extraction with
generalization and Pertinence achieves
competitive precision and recall while being
computationally feasible for large-scale
extraction
– Pertinence computation can also be a
preprocessing step for other ML techniques
• Different types of background knowledge
incorporated into one statistical framework
– Combined Language model and Semantic
model
Knowledge Enabled Information and Services Science 62
63. Application and Knowledge Validation
Example: Domain model
as a basis for research in • 18 Million MedLine
the area of human publications/abstracts
cognitive performance. • UMLS Metathesaurus
• Wikipedia
Scooner:
Semantic browsing Doozer++
and retrieval – – Hierarchy extraction
Evaluation in Use – Pattern-based fact
extraction
Knowledge Enabled Information and Services Science 63
64. Domain Definition – Extracted Hierarchy
A hierarchy extracted for a cognitive science domain model.
The keyword description given to the system was a collection of terms relevant
to human performance and cognition.
Knowledge Enabled Information and Services Science 64
66. Expert Evaluation of Facts in the Model
0.9
0.8
0.7
0.6
Fraction
0.5
Fraction in bin
0.4 Cumulative incorrect
Cumulative correct
0.3 Cumulative interesting
0.2
0.1
0.
Score 1 2 3 4 5 6 7 8 9
1-2: Information that is 3-4: Information that is 5-6: Correct general 7-9: Correct Information not
overall incorrect somewhat correct Information commonly known
Knowledge Enabled Information and Services Science 66
67. Extractor Confidence vs. Correctness
• Analysis shows that highest quality extractions have the
highest confidence, but also incorrectly extracted facts have
high confidence
High-quality patterns as well as some noise-patterns have
high indicative power.
Knowledge Enabled Information and Services Science 67
68. Extractor Confidence vs. Correctness
• Many facts deemed interesting were extracted based on
highly specialized patterns in the long tail of the frequency
distribution.
• Noisy patterns also tend to occupy this space
Knowledge Enabled Information and Services Science 68
69. Sources of Errors
• Extracted relationship too specific or formally
incorrect but metaphorically correct.
– <Interpeduncular_Cistern disease_has_associated_
anatomic_site Cerebral_peduncle> is incorrect,
• Interpeduncular Cistern is not a disease. However, it does have
the associated anatomic site Cerebral peduncle.
• Incorrect directionality
– <Pituitary_Gland sends_output_to Supraoptic_
nucleus> should be <Supraoptic_nucleus sends_
output_to Pituitary_Gland>
• Direction in text often expressed in the context rather than the
immediate pattern
Knowledge Enabled Information and Services Science 69
70. Validation
• Extracted statements need to be validated
to be considered knowledge
– Explicit validation, e.g. thumbs up/down
– Implicit validation, e.g. by analyzing click streams
Knowledge Enabled Information and Services Science 70
71. Explicit Validation
• Certainty of reference
– I.e. we know exactly which statement was
validated
• Validator credentials can be obtained
– E.g. a small community of experts may evaluate
• Extra work
– Explicit validation is a task that is consciously
performed
Knowledge Enabled Information and Services Science 71
72. Implicit Validation
• Find indications of correctness or
incorrectness based on the way the users
interact with the presented information
– Every action taken on a piece of information is
recorded and analyzed
– The cumulative behavior of the users gives an
indication of which propositions are correct or
interesting
Knowledge Enabled Information and Services Science 72
73. Implicit Validation
• Examples for implicit community-validation
– Games with a purpose (L. von Ahn)
– Google search rankings
• Scooner semantic browser
– Browse literature along facts in a model
– Browsing trails suggest correct extraction
Knowledge Enabled Information and Services Science 73
74. Implicit Validation
• A fact is browsed very often by different users.
– The fact is interesting to many users.
– The fact is surprising and interesting, but may be incorrect.
• A user follows a trail of multiple fact-triples trough
a variety of documents.
– The facts that were browsed have a high probability of being correct and support is
added to the triples.
– If the trail was longer than suggested by a small-world phenomenon, initial triples
may have been incorrect, but led to interesting ones. For this reason, only the last
k triples of the trail should garner support or the support should increase for the
last k triples in the trail.
– The last triple in the trail may have been incorrect and led to browsing results that
caused the user to stop browsing. For this reason, the last triple of the trail should
be treated with caution.
Knowledge Enabled Information and Services Science 74
75. Validation “through use”
Choose entityEnter search
of interest terms
Browse
Choose relevant
extracted facts
literature that
supports the fact
Knowledge Enabled Information and Services Science 75
76. Validation “through use”
Find another
interesting fact
Fact trails are
recorded
Knowledge Enabled Information and Services Science 76
77. Validation “through use”
Path suggests
that at least the
first 2 triples are
factually correct
Knowledge Enabled Information and Services Science 77
79. Related work
• Evaluation and Use
– E. Agichtein, E. Brill, and S. Dumais. Improving web
search ranking by incorporating user behavior
information. Proceedings of the 29th annual
international ACM SIGIR conference on Research and
development in information retrieval - SIGIR ‟06, page
19, 2006.
– A. Das, M. Datar, A. Garg, and S. Rajaram. Google
News Personalization: Scalable Online Collaborative
Filtering. In Proceedings of the 16th international
conference on World Wide Web, page 280. ACM,
2007.
Knowledge Enabled Information and Services Science 79
80. Summary Knowledge Acquisition
• The model actually reflects what the user is
interested in at the point of creation
Willingness to help validate facts
– Applications allow for implicit and explicit
evaluation
• Validated Statements can be merged with
existing knowledge
Automated acquisition completed
Individual-driven KA improved overall system
• R. Kavuluru, C. Thomas et al. An Up-to-date Knowledge-Based Literature Search and Exploration Framework for Focused
Bioscience Domains. IHI 2012
• Amit Sheth, Christopher Thomas, Pankaj Mehra, 'Continuous Semantics to Analyze Real-Time Data', IEEE IC, Nov./Dec. 2010
• C. Thomas et al. Improving Linked Open Data through On-Demand Model Creation. Web Science Conference, 2010.
• C. Thomas, et al.. Growing Fields of Interest - Using an Expand and Reduce Strategy for Domain Model Extraction. WIC 2008.
Knowledge Enabled Information and Services Science 80
81. Future Directions
• Active Learning to improve classification
– Easy in tightly connected system (e.g. NELL)
– Feedback mechanism for loosely connected
systems
• Improve depth of classification
– Augment Domain Description with learned
concept hierarchies from text (e.g. Navigli)
• Knowledge management for background
knowledge
– Belief updates
– Model evolution
Knowledge Enabled Information and Services Science 81
82. Contributions
Conceptual Knowledge: Ontologies, LoD
Knowledge Representation
[IJSWIS, CR, FLSW]
Ontology design [WWW, FOIS]
Knowledge merging/
Ontology alignment
[AAAI, WebSem2, Textual Information:
SWSWPC] Wikipedia, Web
Information
Quality[WI2]
Social processes
for content creation
[CHB]
Social processes
Taxonomy extraction
for knowledge
[WI1, WebSci, WebSem1]
validation
Event modeling [IEEE-IC]
[IHI,WebSci, CHB]
Relationship/Fact/Event
extraction [IHI, WebSem1,
IEEE-IC, WebSci]
Knowledge Enabled Information and Services Science 82
83. Journal/Conference Publications
[WebSem] C. Thomas, P. Mehra, A. Sheth, W. Wang, G. Weikum. Automatic
domain model creation using pattern-based fact extraction. Submitted to
Journal of Web Semantics.
[IHI]R. Kavuluru, C. Thomas, A. Sheth, V. Chan, W. Wang, A. Smith, A. Sato and
A. Walters. An Up-to-date Knowledge-Based Literature Search and
Exploration Framework for Focused Bioscience Domains. IHI 2012 - 2nd
ACM SIGHIT International Health Informatics Symposium, January 28-30,
2012.
[IEEE-IC] Amit Sheth, Christopher Thomas, Pankaj Mehra, 'Continuous
Semantics to Analyze Real-Time Data', IEEE Internet Computing, vol. 14, no.
6, pp. 84-89, Nov./Dec. 2010, doi:10.1109/MIC.2010.137
[WebSci] C. Thomas, W. Wang, P. Mehra and A. Sheth. What Goes Around
Comes Around Improving Linked Opend Data through On-Demand Model
Creation. Web Science Conference, 2010.
[WI1] C. Thomas, P. Mehra, R. Brooks, and A. Sheth. Growing Fields of Interest
- Using an Expand and Reduce Strategy for Domain Model Extraction. Web
Intelligence and Intelligent Agent Technology, IEEE/WIC/ACM International
Conference on, 1:496–502, 2008.
Knowledge Enabled Information and Services Science 83
84. Journal/Conference Publications
[WI2] C. Thomas and A. Sheth. Semantic Convergence of Wikipedia Articles. In
Proceedings of the 2007 IEEE/WIC International Conference on Web
Intelligence, pages 600–606, Washington, DC, USA, November 2007. IEEE
Computer Society.
[WWW] S. S. Sahoo, C. Thomas, A. Sheth, W. S. York, and S. Tartir. Knowledge
Modeling and its Application in Life Sciences: A Tale of two Ontologies. In
WWW ‟06: Proceedings of the 15th international conference on World Wide
Web, pages 317–326, New York, NY, USA, 2006. ACM Press.
[FOIS] C. Thomas, A. Sheth, and W. York. Modular Ontology Design Using
Canonical Building Blocks in the Biochemistry Domain. In Proceeding of the
2006 conference on Formal Ontology in Information Systems: Proceedings of
the Fourth International Conference (FOIS 2006), pages 115–127,
Amsterdam (NL), 2006. IOS Press.
[AAAI] P. Doshi and C. Thomas. Inexact matching of ontology graphs using
expectation-maximization. In AAAI‟06: proceedings of the 21st national
conference on Artificial intelligence, pages 1277–1282. AAAI Press, 2006.
Knowledge Enabled Information and Services Science 84
85. Publications
[CHB] C. Thomas and A. Sheth. Web Wisdom - An Essay on How Web 2.0 and
Semantic Web can foster a Global Knowledge Society. Computers in Human
Behavior, Elsevier.
[WebSem2] P. Doshi, R. Kolli, and C. Thomas. Inexact matching of ontology
graphs using expectation-maximization. Web Semantics: Science, Services
and Agents on the World Wide Web, 7(2):90–106, 2009.
[IJWGS] V. Kashyap, C. Ramakrishnan, C. Thomas, and A. Sheth. Taxaminer:
an experimentation framework for automated taxonomy bootstrapping.
International Journal of Web and Grid Services, 1(2):240–266, 2005.
[IJSWIS] A. P. Sheth, C. Ramakrishnan, and C. Thomas. Semantics for the
semantic web: The implicit, the formal and the powerful. Int. J. Semantic Web
Inf. Syst., 1(1):1–18, 2005.
[CR] S. Sahoo, C. Thomas, A. Sheth, C. Henson, and W. York. GLYDEan
expressive XML standard for the representation of glycan structure.
Carbohydrate research, 340(18):2802–2807, 2005.
Knowledge Enabled Information and Services Science 85
86. Other Publications
Workshop Publications
[SWLS] A. Sheth, W. York, C. Thomas, M. Nagarajan, J. Miller, K. Kochut, S.
Sahoo, and X. Yi. Semantic Web technology in support of Bioinformatics for
Glycan Expression. In W3C Workshop on Semantic Web for Life Sciences,
pages 27–28, 2004.
[SWSWPC] N. Oldham, C. Thomas, A. Sheth, and K. Verma. METEOR-S Web
Service Annotation Framework with Machine Learning Classification.
Semantic Web Services and Web Process Composition, pages 137–146,
2005, Springer.
Book Chapters
[FLSW] C. Thomas and A. Sheth. On the expressiveness of the languages for
the semantic web - making a case for a little more. Fuzzy Logic and the
Semantic Web, pages 3–20, 2006.
Patent
[PAT] P. Mehra, R. Brooks and C. Thomas. ONTOLOGY CREATION BY
REFERENCE TO A KNOWLEDGE CORPUS. Pub.No. US 2010/0280989 A1
Knowledge Enabled Information and Services Science 86
87. • Research • Collaborations
– Complex Carbohydrate Research
– KR Center
– Domain model at UGA
extraction / IE – HP Labs Palo Alto
– Human Performance
Directorate, AFRL
• Proposals
– HP Incubation &
Innovation grant for
Doozer++
• Tools and Ontologies
– AFRL grant largely – GlycO
based on Doozer++ – GlycoViz
– NSF proposal – Doozer++
submitted with “very
good” reviews – Scooner
87
Knowledge Enabled Information and Services Science
88. Thank you!
Shaojun Amit Pascal Pankaj
Gerhard
Wang Sheth Hitzler Mehra
Weikum
Thanks to all Kno.e.sis Center
Members
–
Past and Present
Knowledge Enabled Information and Services Science 88
To get the probability of seeing a relationship when given a concept pair, we average over all occurrences of phrases that contain labels for the concept pair, take into account the probabilities that the term pair actually denotes the concept pair and, if available, if the types of subject and object are likely to occur with that relationship.
Show how pattern probabilities and background knowledge interact
Shortcoming. Patterns are seen as independent, even though they would have been in the same path trough a parse tree
Pertinence has most influence in high-recall regions. Intuitively, as the threshold is increased, patterns that are highly indicative of specific relationships contribute more to the classification and thus the advantage of the pertinence method is slightly diminished.
Doozer(R) – recall oriented, generalizedDoozer(P) – precision-oriented, not generalized
None of the facts were previously found in UMLS
It is important to know how correct information was extracted. The probabilistic classifier easily allows for analysis of the patterns that were underlying the extraction. The slide shows how extraction quality measures up against extraction confidence.
It is important to know how correct information was extracted. The probabilistic classifier easily allows for analysis of the patterns that were underlying the extraction. The slide shows how extraction quality measures up against extraction confidence.