Concepts in web ontologies help machines to un-
derstand data through the meanings they hold. Furthermore,
learning contexts and topics of web documents also have helped
in better semantic-oriented structuring and retrieval of data on
the web. In this short paper we present a novel approach for
domain-independent open learning of domain concepts, context
and topic of any given web document. Our approach is based on a
computational version of the Construction-Integration (CI) model
of text comprehension. Our proposed system mimics the way
humans learn the meanings of textual units and identify domain
concepts, contexts and topics in the form of semantic networks.
We apply our system on a number of web documents with a
range of topics and domains. The resulting semantic networks
provide a quantitative and qualitative insights into the nature of
the given web documents.
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Using Text Comprehension Model for Learning Concepts, Context, and Topic of Web Content
1. Using Text Comprehension Model for
Learning Concepts, Context, and Topic
of Web Content
11th International Conference on Semantic Computing
IEEE ICSC 2017 - San Diego, California, USA
Jan 30-Feb 1, 2017
Ismael Ali, Naser Al Madi, Austin Melton
Department of Computer Science
Kent State University
2. Outline
• Text Comprehension
• System Architecture and Workflow
• Semantic Learning
– Semantic Network Construction
– Mathematical Foundation
– Domain Concept Learning
– Topic Learning
– Context Learning
• Experimental Design
• Evaluation Strategy
• Results
• Conclusion and Future Works
3. Abstract
• Role of learning Semantics including concepts, contexts, and
topics from web documents
– semantic-based structuring and retrieving
• We present a novel approach for domain-independent
semantic learning.
• Our approach uses a computational version of the
Construction-Integration (CI) model of text comprehension.
4. Text Comprehension
• Comprehension is a cognitive-based learning process
• Comprehension produces the mental representations:
– perceptual
– verbal
– semantic representations
• CI model simulates the incremental and dynamic task of
comprehending the text and it leads to the construction of a
semantic network (SN)
5. CI as a Cognitive Model of Text Comprehension
This figure from: (Cathleen Wharton and Walter Kintsch, 1991 in ACM SIGART Bulletin)
Surface
Model
Text-Base
Model
Situation
Model
Situation
Model• Time of acquisition
• Recognizing main
concepts
• Integrating them with
background knowledge
6. System Architecture and Workflow
Using Stanford CoreNLP
1. Text tokenization
2. Lemmatization
3. Sentence splitting
- To get the Surface Model.
4. Part of Speech Tagging
5. Anaphora Resolution
Running the
computational CI model
to produce weighted
semantic network
Analysis and
filtering of the
weighted semantic
networks
7. Semantic Network Construction
• Sentences are presented as single units of time (a reading
episode)
• “Knowledge is a familiarity. Awareness or understanding of
something. Such as facts.”
Recognized Concepts
Neglected Concepts
Recognized Associations
Neglected Associations
Fig. 2. Sample Concept Network.
(After running the CI model)
8. • “Knowledge is a familiarity. Awareness or understanding of something. Such
as facts.”
• Episodes of {e1
, e2
, ... , ei
} are background knowledge for episode {ei+1
}
• Weights on edges represents the semantic association strength
Fig. 2. Sample Concept Network.
(After running the CI model)
1. concept recognition threshold (S) is 7
for Fig. 2
– s(“something”) = 6
– e1 + e2 < S
– s(“Awareness”) = 12
– e3 + e4 > S
2. association recognition threshold (I)
is 5 for Fig. 2
– i(“Knowledge”,”facts”) < I
– i(“Knowledge”,”Awareness”) > I
Semantic Network Construction
9. 1. Associative Matrix is generated from Text-base model
2. Each sentence forms an Individual Concept Network, ICN
3. All ICN graphs are combined to create the Base Semantic Network, BSN
Semantic Network Construction:
Semantic Association Graph
C1-Sent-ID C2-Sent-ID;in which
C2 1st occured
C3-Sent-ID C4-Sent-ID ... Cn-Sent-ID
1 2 3 4 ... n
C1 C2 C3 C4 ... Cn
1 C1
2 C2
3 C3 Sentence-ID of 1st
episode, which
C3 and C2
are co-occurrence
4 C4
... ...
n Cn
10. - Finding weights and thresholds:
4. BSN shows recognized the which were neglected concepts and associations
6. BSN Semantic network is represented as a set of inequalities:
- Inequalities set upper- and lower-bound for concept (S) and association (I) recognition thresholds
- Linear programming finds the suitable values for all variables to satisfy the inequalities
7. Finding values for the variable vector X that satisfies the inequalities; by minimizing the problem
specified in:
Semantic Network Construction:
Mathematical Foundation
Where:
- f is the linear objective function
- A is the left hand side of the inequalities
- B is the right hand side of the inequalities
- LB is the lower bound of the solution
- UB is the upper bound of the solution
- The resulting variable vector contains
weights for nodes and associations, along
with individual thresholds (S) and (I) values
for recognizing concepts and associations.
11. Domain Concepts Learning
• variable vector used to construct the semantic network Gi
= (Ci
, Ei
)
• Then the concept filtering performed to learn domain concepts
• Domain concepts for web document di
are the concepts in a subgraph G*
i
of
its semantic network Gi
:
- G*
i
= (C*
i
, E*
i
) where;C*
i
⊂ Ci
, and Ei
*
⊂ Ei
• Filtering mechanisms:
(1) statistical-based filtering: mean threshold and median threshold
(2) positive-based filtering: suggested for the proposed cognitive-based
semantic learning approach
12. Topic Learning
• Foreach domain concept ci
∈ C*
i
in dj
calculate the Topic Identification
Weight (Tiw):
– CIw
(ci
) : the weight calculated the computational CI model
– Eigenvector(ci
) : the value of eigenvector centrality measure as the
function of the centralities of its neighbors
– e(ci
) is the episode in which the given concept ci
first appeared
• Topic Identification:
– Topic concept of di
is the concept with the highest Tiw weight
– The most influential node in the semantic network G*
i
of domain
concept set
13. Context Learning
• The context of the di
is the all the nearest neighbor (nodes
with distance k=1) to the topic concept
• Thus the context includes :
– the most semantically associated to the topic concept
– a normal distribution of a concept selection from
different sections of the text
14. Experimental Design
• A diverse set of ten randomly selected web documents
from Wikipedia
– astronomy, brain, cognition, ecology, knowledge, law,
literacy, robotic, virus and tennis
• Testing the the openness (domain-independency) property
of our approach in learning semantics of the web contents
15. Evaluation Strategies
• Results of filtering mechanisms are evaluated by human judgment strategy [4]:
1. A set of seven human judges (domain experts) selected, KSU
2. Human judges were asked to evaluate the list(s) of all potential concepts learned
from the CI model for each web document
3. Then asked to identify whether the concepts belonged to a given domain or not
4. Next, domain concepts identified by the domain experts were compared against the
domain concepts identified by each concept filtering strategy.
5. Then the quality of each concept filtering strategy was evaluated.
• The evaluation performed using the binary evaluation measures from IR: Precision, Recall
and F1
17. Context and Topic Analysis
Context for web document of EcologyTopic-Concept for web document of Ecology
18. • We investigated a novel approach for open learning of the concepts,
contexts, and topics of web contents.
• Our approach is based on the Construction-Integration (CI) model of text
comprehension, which mimics the way humans learn the semantic
components of a web document.
• We also highlighted the use of cognitive science results in learning
semantics from web content.
• Our work is a step toward our future research on cognition and open
based:
– Ontology Learning
– Ontology Selection
Conclusion and Future Work