The document proposes an approach to automatically extract conceptual taxonomies from text using multiple cooperating techniques. Key aspects of the approach include identifying relevant concepts, generalizing similar concepts, and performing reasoning by concept association. Preliminary experiments show promise, but extensions are needed to improve concept descriptions, representation of relations, and similarity measures. Future work is outlined to address limitations and refine the approach.
The document describes a method for learning incoherent dictionaries using iterative projections and rotations (IPR). It begins with background on dictionary learning models and algorithms, as well as previous work on learning incoherent dictionaries. The IPR algorithm constructs Grassmannian frames, which have minimal mutual coherence, using iterative projections of the dictionary's Gram matrix onto constraint sets, followed by a rotation step. Numerical experiments show that dictionaries learned with IPR have lower incoherence and perform well for sparse approximation compared to existing methods.
The document proposes a mixed approach using existing natural language processing techniques and novel techniques to automatically construct conceptual taxonomies from text. Key steps include identifying relevant concepts and attributes from text, clustering similar concepts, computing relevance weights for concepts, and generalizing concepts using WordNet. Preliminary results suggest the approach shows promise for extending and improving automatic taxonomy construction.
Annotating Rhetorical and Argumentative Structures in Mathematical KnowledgeChristoph Lange
This document summarizes the work Christoph Lange did at DERI from April to October 2008. It discusses Lange's background in mathematical knowledge management and his project using semantic web technologies like ontologies and annotation. At DERI, Lange learned about engineering ontologies for scientific documents and user interfaces for annotating and browsing knowledge. He expanded his ontologies to model the rhetorical and document structures of mathematical texts.
This document discusses inducing concepts in web ontologies through terminological decision trees (TDTs). It introduces TDTs, which extend first-order logical decision trees to allow description logic (DL) concept descriptions as node tests. The document outlines inducting, classifying with, and converting TDTs to learn concepts expressed in standard semantic web representations based on DL from examples. It evaluates the approach on benchmark datasets and concludes TDTs provide an effective means for automated concept learning in ontologies.
The spread and abundance of electronic documents requires automatic techniques for extracting useful information from the text they contain. The availability of conceptual taxonomies can be of great help, but manually building them is a complex and costly task. Building on previous work, we propose a technique to automatically extract conceptual graphs from text and reason with them. Since automated learning of taxonomies needs to be robust with respect to missing or partial knowledge and flexible with respect to noise, this work proposes a way to deal with these problems. The case of poor data/sparse concepts is tackled by finding generalizations among disjoint pieces of knowledge. Noise is
handled by introducing soft relationships among concepts rather than hard ones, and applying a probabilistic inferential setting. In particular, we propose to reason on the extracted graph using different kinds of relationships among concepts, where each arc/relationship is associated to a number that represents its likelihood among all possible worlds, and to face the problem of sparse knowledge by using generalizations among distant concepts as bridges between disjoint portions of knowledge.
This document summarizes a talk on learning read-constant polynomials of constant degree modulo composites. It discusses representing Boolean functions using polynomials over finite commutative rings, with the function value determined by whether the polynomial evaluates to an accepting set. The goal of the research is to show that the class of Boolean functions computed by polynomials over a fixed ring of constant degree can be exactly learned in deterministic polynomial time using membership queries.
An introduction to compositional models in distributional semanticsAndre Freitas
The document provides an overview of compositional distributional semantic models, which aim to develop principled and effective semantic models for real-world language use. It discusses using large corpora to extract distributional representations of word meanings and developing compositional models that combine these representations according to syntactic structure. Both additive and multiplicative mixture models as well as function-based models are described. Challenges including lack of training data and computational complexity are also outlined.
This document summarizes topic models, which are probabilistic models used to uncover the underlying semantic structure of document collections. It introduces latent Dirichlet allocation (LDA), the simplest topic model, which models documents as mixtures of topics, where each topic is a distribution over words. LDA assumes documents exhibit multiple topics in different proportions. It describes the graphical model representation of LDA and the probabilistic generative process that is assumed to have produced the observed document-word data. Figures from applying LDA to a collection of Science articles are shown to illustrate the automatically discovered topics.
The document describes a method for learning incoherent dictionaries using iterative projections and rotations (IPR). It begins with background on dictionary learning models and algorithms, as well as previous work on learning incoherent dictionaries. The IPR algorithm constructs Grassmannian frames, which have minimal mutual coherence, using iterative projections of the dictionary's Gram matrix onto constraint sets, followed by a rotation step. Numerical experiments show that dictionaries learned with IPR have lower incoherence and perform well for sparse approximation compared to existing methods.
The document proposes a mixed approach using existing natural language processing techniques and novel techniques to automatically construct conceptual taxonomies from text. Key steps include identifying relevant concepts and attributes from text, clustering similar concepts, computing relevance weights for concepts, and generalizing concepts using WordNet. Preliminary results suggest the approach shows promise for extending and improving automatic taxonomy construction.
Annotating Rhetorical and Argumentative Structures in Mathematical KnowledgeChristoph Lange
This document summarizes the work Christoph Lange did at DERI from April to October 2008. It discusses Lange's background in mathematical knowledge management and his project using semantic web technologies like ontologies and annotation. At DERI, Lange learned about engineering ontologies for scientific documents and user interfaces for annotating and browsing knowledge. He expanded his ontologies to model the rhetorical and document structures of mathematical texts.
This document discusses inducing concepts in web ontologies through terminological decision trees (TDTs). It introduces TDTs, which extend first-order logical decision trees to allow description logic (DL) concept descriptions as node tests. The document outlines inducting, classifying with, and converting TDTs to learn concepts expressed in standard semantic web representations based on DL from examples. It evaluates the approach on benchmark datasets and concludes TDTs provide an effective means for automated concept learning in ontologies.
The spread and abundance of electronic documents requires automatic techniques for extracting useful information from the text they contain. The availability of conceptual taxonomies can be of great help, but manually building them is a complex and costly task. Building on previous work, we propose a technique to automatically extract conceptual graphs from text and reason with them. Since automated learning of taxonomies needs to be robust with respect to missing or partial knowledge and flexible with respect to noise, this work proposes a way to deal with these problems. The case of poor data/sparse concepts is tackled by finding generalizations among disjoint pieces of knowledge. Noise is
handled by introducing soft relationships among concepts rather than hard ones, and applying a probabilistic inferential setting. In particular, we propose to reason on the extracted graph using different kinds of relationships among concepts, where each arc/relationship is associated to a number that represents its likelihood among all possible worlds, and to face the problem of sparse knowledge by using generalizations among distant concepts as bridges between disjoint portions of knowledge.
This document summarizes a talk on learning read-constant polynomials of constant degree modulo composites. It discusses representing Boolean functions using polynomials over finite commutative rings, with the function value determined by whether the polynomial evaluates to an accepting set. The goal of the research is to show that the class of Boolean functions computed by polynomials over a fixed ring of constant degree can be exactly learned in deterministic polynomial time using membership queries.
An introduction to compositional models in distributional semanticsAndre Freitas
The document provides an overview of compositional distributional semantic models, which aim to develop principled and effective semantic models for real-world language use. It discusses using large corpora to extract distributional representations of word meanings and developing compositional models that combine these representations according to syntactic structure. Both additive and multiplicative mixture models as well as function-based models are described. Challenges including lack of training data and computational complexity are also outlined.
This document summarizes topic models, which are probabilistic models used to uncover the underlying semantic structure of document collections. It introduces latent Dirichlet allocation (LDA), the simplest topic model, which models documents as mixtures of topics, where each topic is a distribution over words. LDA assumes documents exhibit multiple topics in different proportions. It describes the graphical model representation of LDA and the probabilistic generative process that is assumed to have produced the observed document-word data. Figures from applying LDA to a collection of Science articles are shown to illustrate the automatically discovered topics.
The document discusses formalizing the concept of integration in finite terms. It begins by explaining that to "do" an integral means to find a formula F(x) such that the derivative of F is the integrand. It then discusses formalizing the concept of a "formula" using differential fields - starting with the field of rational functions and adding algebraic elements, logarithms, and exponentials one by one. It explains that to prove an integral cannot be done in finite terms, it establishes an algebraic condition for a function to have a primitive expressible in terms of elementary functions, and shows the integrand does not satisfy this condition.
Sources
Technicalities
Exponentials and logarithms
a is an exponential
This document presents a method for measuring the semantic similarity of short texts using both corpus-based and knowledge-based measures of word semantic similarity. It combines word-to-word similarity scores with word specificity measures to determine the overall semantic similarity between two text segments. The method is evaluated on a paraphrase recognition task and is shown to outperform methods based only on simple lexical matching, resulting in up to a 13% reduction in error rate.
This document provides lecture notes for an advanced artificial intelligence course covering description logic and business rules. It includes sections on description logic concepts like concept expressions, axioms, disjunctions, and negations. It also discusses description logic knowledge bases containing TBoxes and ABoxes. Reasoning services like concept satisfiability, subsumption, and instance checking are explained. Common description logics are also introduced along with extensions like cardinality restrictions.
This paper presents a new exemplar-based approach for word sense disambiguation (WSD) that integrates multiple knowledge sources. The authors' WSD system, called LEXAS, was tested on two datasets. On a common dataset involving the noun "interest", LEXAS achieved 87.4% accuracy, higher than previous work. LEXAS was also tested on a large dataset of 192,800 sense-tagged words, performing better than the most frequent sense heuristic on highly ambiguous words. This represents the largest test of a WSD system to date.
Extending the knowledge level of cognitive architectures with Conceptual Spac...Antonio Lieto
Extending the knowledge level of cognitive architectures with Conceptual Spaces (+ a case study with Dual-PECCS: a hybrid knowledge representation system for common sense reasoning). Talk given at Stockholm, September 2016.
Introduction to Distributional SemanticsAndre Freitas
This document provides an introduction to distributional semantics. It discusses how distributional semantic models (DSMs) represent word meanings as vectors based on their linguistic contexts in large corpora. This distributional hypothesis states that words that appear in similar contexts tend to have similar meanings. The document outlines how DSMs are built, important parameters like context type and weighting, and examples like latent semantic analysis. It also discusses how DSMs can support applications like semantic search. Finally, it introduces how compositional semantics explores representing the meanings of phrases and sentences compositionally based on the meanings of their parts.
The document presents an overview of multistrategy learning, which aims to develop learning systems that integrate multiple inferential and computational strategies, such as empirical induction, explanation-based learning, deduction, and genetic algorithms. It describes representative multistrategy learning systems and their applications in domains like knowledge acquisition, planning, scheduling, and decision making. The systems are able to learn from a combination of examples, background knowledge, and inferences to develop more comprehensive models than single strategy learning approaches.
ConNeKTion: A Tool for Exploiting Conceptual Graphs Automatically Learned fro...University of Bari (Italy)
Studying, understanding and exploiting the content of a digital library, and extracting useful information thereof, require automatic techniques that can effectively support the users. To this aim, a relevant role can be played by concept taxonomies. Unfortunately, the availability of such a kind of resources is limited, and their manual building and maintenance are costly and error-prone. This work presents ConNeKTion, a tool for conceptual graph learning and exploitation. It allows to learn conceptual graphs from plain text and to enrich them by finding concept generalizations. The resulting graph can be used for several purposes: finding relationships between concepts (if any), filtering the concepts from a particular perspective, keyword extraction and information retrieval. A suitable control panel is provided for the user to comfortably carry out these activities.
In recent years, great advances have been made in the speed, accuracy, and coverage of automatic word
sense disambiguator systems that, given a word appearing in a certain context, can identify the sense of
that word. In this paper we consider the problem of deciding whether same words contained in different
documents are related to the same meaning or are homonyms. Our goal is to improve the estimate of the
similarity of documents in which some words may be used with different meanings. We present three new
strategies for solving this problem, which are used to filter out homonyms from the similarity computation.
Two of them are intrinsically non-semantic, whereas the other one has a semantic flavor and can also be
applied to word sense disambiguation. The three strategies have been embedded in an article document
recommendation system that one of the most important Italian ad-serving companies offers to its customers
RuleML2015 The Herbrand Manifesto - Thinking Inside the Box RuleML
The traditional semantics for First Order Logic (sometimes called Tarskian semantics) is based on the notion of interpretations of constants. Herbrand semantics is an alternative semantics based directly on truth assignments for ground sentences rather than interpretations of constants. Herbrand semantics is simpler and more intuitive than Tarskian semantics; and, consequently, it is easier to teach and learn. Moreover, it is more expressive. For example, while it is not possible to finitely axiomatize integer arithmetic with Tarskian semantics, this can be done easily with Herbrand Semantics. The downside is a loss of some common logical properties, such as compactness and completeness. However, there is no loss of inferential power. Anything that can be proved according to Tarskian semantics can also be proved according to Herbrand semantics. In this presentation, we define Herbrand semantics; we look at the implications for research on logic and rules systems and automated reasoning; and and we assess the potential for popularizing logic.
Paper presentation for the final course Advanced Concept in Machine Learning.
The paper is @Topic Modeling using Topics from Many Domains, Lifelong Learning and Big Data"
http://jmlr.org/proceedings/papers/v32/chenf14.pdf
ConNeKTion: A Tool for Exploiting Conceptual Graphs Automatically Learned fro...University of Bari (Italy)
Studying, understanding and exploiting the content of a digital library, and extracting useful information thereof, require automatic techniques that can effectively support the users. To this aim, a relevant role can be played by concept taxonomies. Unfortunately, the availability of such a kind of resources is limited, and their manual building and maintenance are costly and error-prone. This work presents ConNeKTion, a tool for conceptual graph learning and exploitation. It allows to learn conceptual graphs from plain text and to enrich them by finding concept generalizations. The resulting graph can be used for several purposes: finding relationships between concepts (if any), filtering the concepts from a particular perspective, keyword extraction and information retrieval. A suitable control panel is provided for the user to comfortably carry out these activities.
The document discusses constructive description logics and provides three options for constructing description logics constructively:
1) Translating description logic syntax into intuitionistic first-order logic (IFOL) to obtain the logic IALC.
2) Translating description logic syntax into intuitionistic modal logic (IK) to obtain the logic iALC.
3) Translating description logic syntax into constructive modal logic (CK) to obtain the logic cALC.
The talk outlines the translation approaches and discusses some pros and cons of the different constructive description logics, but notes that the work is preliminary and more criteria are needed to identify the best constructive system(s).
How to Ground A Language for Legal Discourse In a Prototypical Perceptual Sem...L. Thorne McCarty
Slides for my talk at the 15th International Conference on Artificial Intelligence and Law (ICAIL 2015), June 11, 2015.
The full ICAIL 2015 paper is available on ResearchGate at bit.ly/1qCnLJq.
A survey on parallel corpora alignment andrefsantos
This document provides a survey of methods for aligning parallel text corpora. It discusses the historical background of using parallel texts in language processing from the 1950s onward. Key early methods are described, including ones based on sentence length, lexical mapping between words, and identifying cognates. The document also evaluates major efforts to create benchmark datasets and evaluate system performance against gold standard alignments. It surveys the evolution of various alignment techniques and lists some relevant tools and projects in the field.
This document discusses computing the grounded extension of infinite argumentation frameworks (AFRA). It defines AFRA as frameworks with a finite set of arguments but an infinite set of attacks, represented by a regular language. It presents the dfa+ representation, which encodes an AFRA as a deterministic finite automaton such that argument states correspond to arguments and attack states correspond to subsets of attacks. It introduces concepts like splitting attack states when they have multiple incoming symbols. The goal is to use this representation to compute the grounded extension of infinite AFRA.
This document discusses using hybrid logics to model contexts in textual inference logic (TIL). It considers representing contexts as modal operators or nominal operators (@) from hybrid logic. Specifically, it explores using intuitionistic hybrid logic (IHL) to represent temporal contexts with @ operators and other contexts as modal boxes. However, it notes that semanticists may not view contexts as modalities. It also considers experiments building a constructive hybrid logic and using a hybrid logic with only nominals and satisfaction for distributed reasoning, but leaves the experiments for future work.
- The document discusses whether the truth predicate Tr in Friedman-Sheard's truth theory FS can be considered a logical connective.
- It raises the problem that Tr violates the "HARMONY" between its introduction and elimination rules, as FS is ω-inconsistent based on McGee's theorem.
- The source of the problem is that deflationism allows asserting infinite conjunctions of sentences using Tr, while also insisting that Tr not involve ontological changes, as logical connectives do not.
This document summarizes a technical report on an empirical study of design pattern evolution in 39 open source Java projects. The study analyzed a total of 428 software releases from the projects to identify 10 common design patterns. A total of 27,855 instances of the patterns were found. The data collected includes the number of each pattern identified in each project release. This dataset will be further analyzed to understand how design patterns evolve over the lifetime of a software system.
A Run Length Smoothing-Based Algorithm for Non-Manhattan Document SegmentationUniversity of Bari (Italy)
This document proposes a run length smoothing-based algorithm called RLSO for segmenting non-Manhattan document layouts. RLSO is a variant of the Run Length Smoothing Algorithm (RLSA) that uses the OR logical operator instead of AND to group connected components. Like RLSA, RLSO requires setting thresholds but these are based on different criteria. The document also presents a technique for automatically assessing the run length thresholds needed for RLSO based on the distribution of spacing in each individual document.
A Domain Based Approach to Information Retrieval in Digital Libraries - Rotel...University of Bari (Italy)
The current abundance of electronic documents requires automatic techniques that support the users in understanding their content and extracting useful information. To this aim, improving the retrieval performance must necessarily go beyond simple lexical interpretation of the user queries, and pass through an understanding of their semantic content and aims. It goes without saying that any digital library would take enormous advantage from the availability of effective Information Retrieval techniques to provide to their users. This paper proposes an approach to Information Retrieval based on a correspondence of the domain of discourse between the query and the documents in the repository. Such an association is based on standard general-purpose linguistic resources (WordNet and WordNet Domains) and on a novel similarity assessment technique. Although the work is at a preliminary stage, interesting initial results suggest to go on extending and improving the approach.
The document discusses formalizing the concept of integration in finite terms. It begins by explaining that to "do" an integral means to find a formula F(x) such that the derivative of F is the integrand. It then discusses formalizing the concept of a "formula" using differential fields - starting with the field of rational functions and adding algebraic elements, logarithms, and exponentials one by one. It explains that to prove an integral cannot be done in finite terms, it establishes an algebraic condition for a function to have a primitive expressible in terms of elementary functions, and shows the integrand does not satisfy this condition.
Sources
Technicalities
Exponentials and logarithms
a is an exponential
This document presents a method for measuring the semantic similarity of short texts using both corpus-based and knowledge-based measures of word semantic similarity. It combines word-to-word similarity scores with word specificity measures to determine the overall semantic similarity between two text segments. The method is evaluated on a paraphrase recognition task and is shown to outperform methods based only on simple lexical matching, resulting in up to a 13% reduction in error rate.
This document provides lecture notes for an advanced artificial intelligence course covering description logic and business rules. It includes sections on description logic concepts like concept expressions, axioms, disjunctions, and negations. It also discusses description logic knowledge bases containing TBoxes and ABoxes. Reasoning services like concept satisfiability, subsumption, and instance checking are explained. Common description logics are also introduced along with extensions like cardinality restrictions.
This paper presents a new exemplar-based approach for word sense disambiguation (WSD) that integrates multiple knowledge sources. The authors' WSD system, called LEXAS, was tested on two datasets. On a common dataset involving the noun "interest", LEXAS achieved 87.4% accuracy, higher than previous work. LEXAS was also tested on a large dataset of 192,800 sense-tagged words, performing better than the most frequent sense heuristic on highly ambiguous words. This represents the largest test of a WSD system to date.
Extending the knowledge level of cognitive architectures with Conceptual Spac...Antonio Lieto
Extending the knowledge level of cognitive architectures with Conceptual Spaces (+ a case study with Dual-PECCS: a hybrid knowledge representation system for common sense reasoning). Talk given at Stockholm, September 2016.
Introduction to Distributional SemanticsAndre Freitas
This document provides an introduction to distributional semantics. It discusses how distributional semantic models (DSMs) represent word meanings as vectors based on their linguistic contexts in large corpora. This distributional hypothesis states that words that appear in similar contexts tend to have similar meanings. The document outlines how DSMs are built, important parameters like context type and weighting, and examples like latent semantic analysis. It also discusses how DSMs can support applications like semantic search. Finally, it introduces how compositional semantics explores representing the meanings of phrases and sentences compositionally based on the meanings of their parts.
The document presents an overview of multistrategy learning, which aims to develop learning systems that integrate multiple inferential and computational strategies, such as empirical induction, explanation-based learning, deduction, and genetic algorithms. It describes representative multistrategy learning systems and their applications in domains like knowledge acquisition, planning, scheduling, and decision making. The systems are able to learn from a combination of examples, background knowledge, and inferences to develop more comprehensive models than single strategy learning approaches.
ConNeKTion: A Tool for Exploiting Conceptual Graphs Automatically Learned fro...University of Bari (Italy)
Studying, understanding and exploiting the content of a digital library, and extracting useful information thereof, require automatic techniques that can effectively support the users. To this aim, a relevant role can be played by concept taxonomies. Unfortunately, the availability of such a kind of resources is limited, and their manual building and maintenance are costly and error-prone. This work presents ConNeKTion, a tool for conceptual graph learning and exploitation. It allows to learn conceptual graphs from plain text and to enrich them by finding concept generalizations. The resulting graph can be used for several purposes: finding relationships between concepts (if any), filtering the concepts from a particular perspective, keyword extraction and information retrieval. A suitable control panel is provided for the user to comfortably carry out these activities.
In recent years, great advances have been made in the speed, accuracy, and coverage of automatic word
sense disambiguator systems that, given a word appearing in a certain context, can identify the sense of
that word. In this paper we consider the problem of deciding whether same words contained in different
documents are related to the same meaning or are homonyms. Our goal is to improve the estimate of the
similarity of documents in which some words may be used with different meanings. We present three new
strategies for solving this problem, which are used to filter out homonyms from the similarity computation.
Two of them are intrinsically non-semantic, whereas the other one has a semantic flavor and can also be
applied to word sense disambiguation. The three strategies have been embedded in an article document
recommendation system that one of the most important Italian ad-serving companies offers to its customers
RuleML2015 The Herbrand Manifesto - Thinking Inside the Box RuleML
The traditional semantics for First Order Logic (sometimes called Tarskian semantics) is based on the notion of interpretations of constants. Herbrand semantics is an alternative semantics based directly on truth assignments for ground sentences rather than interpretations of constants. Herbrand semantics is simpler and more intuitive than Tarskian semantics; and, consequently, it is easier to teach and learn. Moreover, it is more expressive. For example, while it is not possible to finitely axiomatize integer arithmetic with Tarskian semantics, this can be done easily with Herbrand Semantics. The downside is a loss of some common logical properties, such as compactness and completeness. However, there is no loss of inferential power. Anything that can be proved according to Tarskian semantics can also be proved according to Herbrand semantics. In this presentation, we define Herbrand semantics; we look at the implications for research on logic and rules systems and automated reasoning; and and we assess the potential for popularizing logic.
Paper presentation for the final course Advanced Concept in Machine Learning.
The paper is @Topic Modeling using Topics from Many Domains, Lifelong Learning and Big Data"
http://jmlr.org/proceedings/papers/v32/chenf14.pdf
ConNeKTion: A Tool for Exploiting Conceptual Graphs Automatically Learned fro...University of Bari (Italy)
Studying, understanding and exploiting the content of a digital library, and extracting useful information thereof, require automatic techniques that can effectively support the users. To this aim, a relevant role can be played by concept taxonomies. Unfortunately, the availability of such a kind of resources is limited, and their manual building and maintenance are costly and error-prone. This work presents ConNeKTion, a tool for conceptual graph learning and exploitation. It allows to learn conceptual graphs from plain text and to enrich them by finding concept generalizations. The resulting graph can be used for several purposes: finding relationships between concepts (if any), filtering the concepts from a particular perspective, keyword extraction and information retrieval. A suitable control panel is provided for the user to comfortably carry out these activities.
The document discusses constructive description logics and provides three options for constructing description logics constructively:
1) Translating description logic syntax into intuitionistic first-order logic (IFOL) to obtain the logic IALC.
2) Translating description logic syntax into intuitionistic modal logic (IK) to obtain the logic iALC.
3) Translating description logic syntax into constructive modal logic (CK) to obtain the logic cALC.
The talk outlines the translation approaches and discusses some pros and cons of the different constructive description logics, but notes that the work is preliminary and more criteria are needed to identify the best constructive system(s).
How to Ground A Language for Legal Discourse In a Prototypical Perceptual Sem...L. Thorne McCarty
Slides for my talk at the 15th International Conference on Artificial Intelligence and Law (ICAIL 2015), June 11, 2015.
The full ICAIL 2015 paper is available on ResearchGate at bit.ly/1qCnLJq.
A survey on parallel corpora alignment andrefsantos
This document provides a survey of methods for aligning parallel text corpora. It discusses the historical background of using parallel texts in language processing from the 1950s onward. Key early methods are described, including ones based on sentence length, lexical mapping between words, and identifying cognates. The document also evaluates major efforts to create benchmark datasets and evaluate system performance against gold standard alignments. It surveys the evolution of various alignment techniques and lists some relevant tools and projects in the field.
This document discusses computing the grounded extension of infinite argumentation frameworks (AFRA). It defines AFRA as frameworks with a finite set of arguments but an infinite set of attacks, represented by a regular language. It presents the dfa+ representation, which encodes an AFRA as a deterministic finite automaton such that argument states correspond to arguments and attack states correspond to subsets of attacks. It introduces concepts like splitting attack states when they have multiple incoming symbols. The goal is to use this representation to compute the grounded extension of infinite AFRA.
This document discusses using hybrid logics to model contexts in textual inference logic (TIL). It considers representing contexts as modal operators or nominal operators (@) from hybrid logic. Specifically, it explores using intuitionistic hybrid logic (IHL) to represent temporal contexts with @ operators and other contexts as modal boxes. However, it notes that semanticists may not view contexts as modalities. It also considers experiments building a constructive hybrid logic and using a hybrid logic with only nominals and satisfaction for distributed reasoning, but leaves the experiments for future work.
- The document discusses whether the truth predicate Tr in Friedman-Sheard's truth theory FS can be considered a logical connective.
- It raises the problem that Tr violates the "HARMONY" between its introduction and elimination rules, as FS is ω-inconsistent based on McGee's theorem.
- The source of the problem is that deflationism allows asserting infinite conjunctions of sentences using Tr, while also insisting that Tr not involve ontological changes, as logical connectives do not.
This document summarizes a technical report on an empirical study of design pattern evolution in 39 open source Java projects. The study analyzed a total of 428 software releases from the projects to identify 10 common design patterns. A total of 27,855 instances of the patterns were found. The data collected includes the number of each pattern identified in each project release. This dataset will be further analyzed to understand how design patterns evolve over the lifetime of a software system.
A Run Length Smoothing-Based Algorithm for Non-Manhattan Document SegmentationUniversity of Bari (Italy)
This document proposes a run length smoothing-based algorithm called RLSO for segmenting non-Manhattan document layouts. RLSO is a variant of the Run Length Smoothing Algorithm (RLSA) that uses the OR logical operator instead of AND to group connected components. Like RLSA, RLSO requires setting thresholds but these are based on different criteria. The document also presents a technique for automatically assessing the run length thresholds needed for RLSO based on the distribution of spacing in each individual document.
A Domain Based Approach to Information Retrieval in Digital Libraries - Rotel...University of Bari (Italy)
The current abundance of electronic documents requires automatic techniques that support the users in understanding their content and extracting useful information. To this aim, improving the retrieval performance must necessarily go beyond simple lexical interpretation of the user queries, and pass through an understanding of their semantic content and aims. It goes without saying that any digital library would take enormous advantage from the availability of effective Information Retrieval techniques to provide to their users. This paper proposes an approach to Information Retrieval based on a correspondence of the domain of discourse between the query and the documents in the repository. Such an association is based on standard general-purpose linguistic resources (WordNet and WordNet Domains) and on a novel similarity assessment technique. Although the work is at a preliminary stage, interesting initial results suggest to go on extending and improving the approach.
Recognising the Social Attitude in Natural Interaction with Pedagogical AgentsUniversity of Bari (Italy)
Pedagogical Conversational Agents (PCAs) have the advantage of offering to students not only task-oriented support but also the possibility to interact with the computer media at a social level. This form of intelligence is particularly important when the character is employed in an educational setting. This paper reports our initial results on the recognition of users' social response to a pedagogical agent from the linguistic, acoustic and gestural analysis of the student communicative act.
Recognising the Social Attitude in Natural Interaction with Pedagogical AgentsUniversity of Bari (Italy)
This document describes research on recognizing social attitudes in interactions between students and pedagogical conversational agents. The researchers developed a framework that analyzes linguistic, acoustic, and gesture cues from students to recognize social responses like openness/warmth and closedness/distance. They collected a multimodal corpus of student-agent interactions and annotated the data. A dynamic Bayesian network integrates the social attitude cues to model how the student's attitude evolves during the dialog.
A Run Length Smoothing-Based Algorithm for Non-Manhattan Document SegmentationUniversity of Bari (Italy)
Layout analysis is a fundamental step in automatic document processing, because its outcome affects all subsequent processing steps. Many different techniques have been proposed to perform this task. In this work, we propose a general bottom-up strategy to tackle the layout analysis of (possibly) non-Manhattan documents, and two specializations of it to handle both bitmap and PS/PDF sources. A famous approach proposed in the literature for layout analysis was the RLSA. Here we consider a variant of RLSA, called RLSO (short for “Run Length Smoothing with OR”), that exploits the OR logical operator instead of the AND and is particularly indicated for the identification of frames in non-Manhattan layouts. Like RLSA, RLSO is based on thresholds, but based on different criteria than those that work in RLSA. Since setting such thresholds is a hard and unnatural task for (even expert) users, and no single threshold can fit all documents, we developed a technique to automatically define such thresholds for each specific document, based on the distribution of spacing therein. Application on selected sample documents, that cover a significant landscape of real cases, revealed that the approach is satisfactory for documents characterized by the use of a uniform text font size.
Cooperating Techniques for Extracting Conceptual Taxonomies from TextFulvio Rotella
The document proposes a mixed approach using existing natural language processing techniques and novel techniques to automatically construct conceptual taxonomies from text. It identifies relevant concepts from text using keyword extraction, clustering, and computing relevance weights. It then generalizes similar concepts using WordNet to group concepts and disambiguate word senses. Preliminary evaluations show promising initial results.
An Approach To Assess The Existence Of A Proposed Intervention In Essay-Argum...Heather Strinden
This paper presents an approach to automatically assess the existence of argument components like thesis, arguments, and intervention proposals in essay texts. The methodology involves corpus annotation, feature selection, training and testing models. Argumentation mining features are extracted from essays, including the number of thesis, arguments, and other components. Machine learning classifiers like gradient boosted trees and support vector machines are trained on the features to predict scores. The results show argumentation mining features can improve automatic essay scoring compared to usual lexical and statistical features. This helps evaluate student essays at a larger scale.
word2vec, node2vec, graph2vec, X2vec: Towards a Theory of Vector Embeddings o...Subhajit Sahu
Below are the important points I note from the 2020 paper by Martin Grohe:
- 1-WL distinguishes almost all graphs, in a probabilistic sense
- Classical WL is two dimensional Weisfeiler-Leman
- DeepWL is an unlimited version of WL graph that runs in polynomial time.
- Knowledge graphs are essentially graphs with vertex/edge attributes
ABSTRACT:
Vector representations of graphs and relational structures, whether handcrafted feature vectors or learned representations, enable us to apply standard data analysis and machine learning techniques to the structures. A wide range of methods for generating such embeddings have been studied in the machine learning and knowledge representation literature. However, vector embeddings have received relatively little attention from a theoretical point of view.
Starting with a survey of embedding techniques that have been used in practice, in this paper we propose two theoretical approaches that we see as central for understanding the foundations of vector embeddings. We draw connections between the various approaches and suggest directions for future research.
The document discusses Lin Ma's PhD research on analyzing presuppositions in natural language requirements. Presuppositions are implicit commitments in language that simplify communication but can cause misunderstanding if not made explicit. The research aims to automatically detect presuppositions triggered by definite descriptions in requirements and identify which are not explicitly stated. It will use natural language processing techniques and knowledge sources to classify definite descriptions and analyze how presuppositions project in requirements texts.
An Entity-Driven Recursive Neural Network Model for Chinese Discourse Coheren...ijaia
Chinese discourse coherence modeling remains a challenge taskin Natural Language Processing
field.Existing approaches mostlyfocus on the need for feature engineering, whichadoptthe sophisticated
features to capture the logic or syntactic or semantic relationships acrosssentences within a text.In this
paper, we present an entity-drivenrecursive deep modelfor the Chinese discourse coherence evaluation
based on current English discourse coherenceneural network model. Specifically, to overcome the
shortage of identifying the entity(nouns) overlap across sentences in the currentmodel, Our combined
modelsuccessfully investigatesthe entities information into the recursive neural network
freamework.Evaluation results on both sentence ordering and machine translation coherence rating
task show the effectiveness of the proposed model, which significantly outperforms the existing strong
baseline.
Analogy is one of the most studied representatives of a family of non-classical forms of reasoning working across different domains, usually taken to play a crucial role in creative thought and problem-solving. In the first part of the talk, I will shortly introduce general principles of computational analogy models (relying on a generalization-based approach to analogy-making). We will then have a closer look at Heuristic-Driven Theory Projection (HDTP) as an example for a theoretical framework and implemented system: HDTP computes analogical relations and inferences for domains which are represented using many-sorted first-order logic languages, applying a restricted form of higher-order anti-unification for finding shared structural elements common to both domains. The presentation of the framework will be followed by a few reflections on the "cognitive plausibility" of the approach motivated by theoretical complexity and tractability considerations.
In the second part of the talk I will discuss an application of HDTP to modeling essential parts of concept blending processes as current "hot topic" in Cognitive Science. Here, I will sketch an analogy-inspired formal account of concept blending —developed in the European FP7-funded Concept Invention Theory (COINVENT) project— combining HDTP with mechanisms from Case-Based Reasoning.
FCA-MERGE: Bottom-Up Merging of Ontologiesalemarrena
The document describes a new bottom-up method called FCA-MERGE for merging ontologies. It extracts instances from documents for each ontology to generate formal contexts. It then merges the contexts and computes a concept lattice using techniques from Formal Concept Analysis. This lattice provides a structural description of the merging process. The final merged ontology is then generated from the lattice with human guidance. FCA-MERGE circumvents the problem of finding instances classified in both ontologies by extracting instances from relevant documents.
This document discusses topic extraction for domain ontology. It describes domain ontology as a collection of vocabularies and conceptualization of a given domain. The purpose of topic extraction is to identify relevant concepts in documents, obtain domain-specific terms, classify documents, and identify key concepts and relationships for an ontology. The project stages include obtaining domain knowledge, preprocessing documents, and applying either K-Means clustering or Latent Dirichlet Allocation to extract topics. K-Means partitions data into clusters while LDA represents documents as mixtures over topics characterized by word distributions.
This document proposes online inference algorithms for topic models as an alternative to traditional batch algorithms. It introduces two related online algorithms: incremental Gibbs samplers and particle filters. These algorithms update estimates of topics incrementally as each new document is observed, making them suitable for applications where the document collection grows over time. The algorithms are evaluated in comparison to existing batch algorithms to analyze their runtime and performance.
Lean Logic for Lean Times: Varieties of Natural LogicValeria de Paiva
This document discusses using logic to analyze natural language text. It proposes a Knowledge Inference Management Language (KIML) that represents text as concepts, roles, and contexts. KIML aims to model quantification, propositional attitudes, and inference in a way that corresponds to natural language semantics. The document also discusses using contextual constructive description logics and connexive logic to model textual entailment relationships.
ONTOLOGICAL MODEL FOR CHARACTER RECOGNITION BASED ON SPATIAL RELATIONSsipij
In this paper, we present a set of spatial relations between concepts describing an ontological model for a
new process of character recognition. Our main idea is based on the construction of the domain ontology
modelling the Latin script. This ontology is composed by a set of concepts and a set of relations. The
concepts represent the graphemes extracted by segmenting the manipulated document and the relations are
of two types, is-a relations and spatial relations. In this paper we are interested by description of second
type of relations and their implementation by java code.
The document describes latent Dirichlet allocation (LDA), a generative probabilistic model for collections of discrete data such as text corpora. LDA represents documents as random mixtures over latent topics, characterized by distributions over words. It is a three-level hierarchical Bayesian model where documents are generated by first sampling a per-document topic distribution from a Dirichlet prior, then repeatedly sampling topics and words from these distributions. LDA addresses limitations of previous models by capturing statistical structure within and between documents through the hierarchical Bayesian formulation.
1) The document discusses a system called MaLTe (Machine Learning from Text) that aims to extract knowledge from technical expository texts using both natural language processing and machine learning techniques.
2) MaLTe will process texts containing narratives and examples, and output a representation of the knowledge in the form of Horn clauses. Some user interaction will be required during the translation process.
3) The document outlines several challenges in applying machine learning and natural language processing to knowledge extraction from real-world texts, including their logical structure and examples. It provides an example from a tax guide to illustrate these challenges.
The document discusses the history and development of ontologies. It begins with definitions of key terms like ontology, vocabulary, and taxonomy. It then provides a brief history of ontologies dating back to ancient Greek philosophers. The document also discusses how ontologies are used in computer science to formally represent domain knowledge. It provides examples of ontologies in fields like medicine, commerce, and the semantic web. Finally, it discusses best practices for building ontologies, such as reusing existing terms and collaborating with domain experts and end users.
Cerutti--Knowledge Representation and Reasoning (postgrad seminar @ Universit...Federico Cerutti
This document provides an overview of knowledge representation and reasoning. It discusses several key concepts, including knowledge, representation, and reasoning. It also describes different approaches to knowledge representation and reasoning, such as classical logic, description logics, and non-monotonic logics. The document uses examples to illustrate concepts like first-order logic, description logics syntax and semantics, and the semantic web.
Method for ontology generation from concept maps in shallow domainsLuigi Ceccaroni
This document presents a method for generating OWL ontologies from concept maps in shallow domains. It involves a 5-phase process: 1) disambiguating concept senses using WordNet, 2) initially coding classes, 3) identifying subclass relations, 4) identifying instance relations, and 5) identifying property relations. The method was implemented in a Java application and validated using Protege and an OWL validator. It demonstrates the close relationship between concept maps and ontologies by interpreting concept maps as structured text to semantically infer OWL coding.
Discovering Novel Information with sentence Level clustering From Multi-docu...irjes
The document presents a novel fuzzy clustering algorithm called FRECCA that clusters sentences from multi-documents to discover new information. FRECCA uses fuzzy relational eigenvector centrality to calculate page rank scores for sentences within clusters, treating the scores as likelihoods. It uses expectation maximization to optimize cluster membership values and mixing coefficients without a parameterized likelihood function. An evaluation shows FRECCA achieves superior performance to other clustering algorithms on a quotations dataset, identifying overlapping clusters of semantically related sentences.
IDENTIFYING THE SEMANTIC RELATIONS ON UNSTRUCTURED DATAijistjournal
Ontologisms have been applied to many applications in recent years, especially on Sematic Web, Information Retrieval, Information Extraction, and Question and Answer. The purpose of domain-specific ontology is to get rid of conceptual and terminological confusion. It accomplishes this by specifying a set of generic concepts that characterizes the domain as well as their definitions and interrelationships. This paper will describe some algorithms for identifying semantic relations and constructing an Information Technology Ontology, while extracting the concepts and objects from different sources. The Ontology is constructed based on three main resources: ACM, Wikipedia and unstructured files from ACM Digital Library. Our algorithms are combined of Natural Language Processing and Machine Learning. We use Natural Language Processing tools, such as OpenNLP, Stanford Lexical Dependency Parser in order to explore sentences. We then extract these sentences based on English pattern in order to build training set. We use a random sample among 245 categories of ACM to evaluate our results. Results generated show that our system yields superior performance.
Ontologisms have been applied to many applications in recent years, especially on Sematic Web, Information
Retrieval, Information Extraction, and Question and Answer. The purpose of domain-specific ontology
is to get rid of conceptual and terminological confusion. It accomplishes this by specifying a set of generic
concepts that characterizes the domain as well as their definitions and interrelationships. This paper will
describe some algorithms for identifying semantic relations and constructing an Information Technology
Ontology, while extracting the concepts and objects from different sources. The Ontology is constructed
based on three main resources: ACM, Wikipedia and unstructured files from ACM Digital Library. Our
algorithms are combined of Natural Language Processing and Machine Learning. We use Natural Language
Processing tools, such as OpenNLP, Stanford Lexical Dependency Parser in order to explore sentences.
We then extract these sentences based on English pattern in order to build training set. We use a
random sample among 245 categories of ACM to evaluate our results. Results generated show that our
system yields superior performance.
Similar to Cooperating Techniques for Extracting Conceptual Taxonomies from Text (20)
How to Manage Your Lost Opportunities in Odoo 17 CRMCeline George
Odoo 17 CRM allows us to track why we lose sales opportunities with "Lost Reasons." This helps analyze our sales process and identify areas for improvement. Here's how to configure lost reasons in Odoo 17 CRM
Strategies for Effective Upskilling is a presentation by Chinwendu Peace in a Your Skill Boost Masterclass organisation by the Excellence Foundation for South Sudan on 08th and 09th June 2024 from 1 PM to 3 PM on each day.
A review of the growth of the Israel Genealogy Research Association Database Collection for the last 12 months. Our collection is now passed the 3 million mark and still growing. See which archives have contributed the most. See the different types of records we have, and which years have had records added. You can also see what we have for the future.
LAND USE LAND COVER AND NDVI OF MIRZAPUR DISTRICT, UPRAHUL
This Dissertation explores the particular circumstances of Mirzapur, a region located in the
core of India. Mirzapur, with its varied terrains and abundant biodiversity, offers an optimal
environment for investigating the changes in vegetation cover dynamics. Our study utilizes
advanced technologies such as GIS (Geographic Information Systems) and Remote sensing to
analyze the transformations that have taken place over the course of a decade.
The complex relationship between human activities and the environment has been the focus
of extensive research and worry. As the global community grapples with swift urbanization,
population expansion, and economic progress, the effects on natural ecosystems are becoming
more evident. A crucial element of this impact is the alteration of vegetation cover, which plays a
significant role in maintaining the ecological equilibrium of our planet.Land serves as the foundation for all human activities and provides the necessary materials for
these activities. As the most crucial natural resource, its utilization by humans results in different
'Land uses,' which are determined by both human activities and the physical characteristics of the
land.
The utilization of land is impacted by human needs and environmental factors. In countries
like India, rapid population growth and the emphasis on extensive resource exploitation can lead
to significant land degradation, adversely affecting the region's land cover.
Therefore, human intervention has significantly influenced land use patterns over many
centuries, evolving its structure over time and space. In the present era, these changes have
accelerated due to factors such as agriculture and urbanization. Information regarding land use and
cover is essential for various planning and management tasks related to the Earth's surface,
providing crucial environmental data for scientific, resource management, policy purposes, and
diverse human activities.
Accurate understanding of land use and cover is imperative for the development planning
of any area. Consequently, a wide range of professionals, including earth system scientists, land
and water managers, and urban planners, are interested in obtaining data on land use and cover
changes, conversion trends, and other related patterns. The spatial dimensions of land use and
cover support policymakers and scientists in making well-informed decisions, as alterations in
these patterns indicate shifts in economic and social conditions. Monitoring such changes with the
help of Advanced technologies like Remote Sensing and Geographic Information Systems is
crucial for coordinated efforts across different administrative levels. Advanced technologies like
Remote Sensing and Geographic Information Systems
9
Changes in vegetation cover refer to variations in the distribution, composition, and overall
structure of plant communities across different temporal and spatial scales. These changes can
occur natural.
How to Fix the Import Error in the Odoo 17Celine George
An import error occurs when a program fails to import a module or library, disrupting its execution. In languages like Python, this issue arises when the specified module cannot be found or accessed, hindering the program's functionality. Resolving import errors is crucial for maintaining smooth software operation and uninterrupted development processes.
ISO/IEC 27001, ISO/IEC 42001, and GDPR: Best Practices for Implementation and...PECB
Denis is a dynamic and results-driven Chief Information Officer (CIO) with a distinguished career spanning information systems analysis and technical project management. With a proven track record of spearheading the design and delivery of cutting-edge Information Management solutions, he has consistently elevated business operations, streamlined reporting functions, and maximized process efficiency.
Certified as an ISO/IEC 27001: Information Security Management Systems (ISMS) Lead Implementer, Data Protection Officer, and Cyber Risks Analyst, Denis brings a heightened focus on data security, privacy, and cyber resilience to every endeavor.
His expertise extends across a diverse spectrum of reporting, database, and web development applications, underpinned by an exceptional grasp of data storage and virtualization technologies. His proficiency in application testing, database administration, and data cleansing ensures seamless execution of complex projects.
What sets Denis apart is his comprehensive understanding of Business and Systems Analysis technologies, honed through involvement in all phases of the Software Development Lifecycle (SDLC). From meticulous requirements gathering to precise analysis, innovative design, rigorous development, thorough testing, and successful implementation, he has consistently delivered exceptional results.
Throughout his career, he has taken on multifaceted roles, from leading technical project management teams to owning solutions that drive operational excellence. His conscientious and proactive approach is unwavering, whether he is working independently or collaboratively within a team. His ability to connect with colleagues on a personal level underscores his commitment to fostering a harmonious and productive workplace environment.
Date: May 29, 2024
Tags: Information Security, ISO/IEC 27001, ISO/IEC 42001, Artificial Intelligence, GDPR
-------------------------------------------------------------------------------
Find out more about ISO training and certification services
Training: ISO/IEC 27001 Information Security Management System - EN | PECB
ISO/IEC 42001 Artificial Intelligence Management System - EN | PECB
General Data Protection Regulation (GDPR) - Training Courses - EN | PECB
Webinars: https://pecb.com/webinars
Article: https://pecb.com/article
-------------------------------------------------------------------------------
For more information about PECB:
Website: https://pecb.com/
LinkedIn: https://www.linkedin.com/company/pecb/
Facebook: https://www.facebook.com/PECBInternational/
Slideshare: http://www.slideshare.net/PECBCERTIFICATION
This slide is special for master students (MIBS & MIFB) in UUM. Also useful for readers who are interested in the topic of contemporary Islamic banking.
How to Setup Warehouse & Location in Odoo 17 InventoryCeline George
In this slide, we'll explore how to set up warehouses and locations in Odoo 17 Inventory. This will help us manage our stock effectively, track inventory levels, and streamline warehouse operations.
This presentation was provided by Steph Pollock of The American Psychological Association’s Journals Program, and Damita Snow, of The American Society of Civil Engineers (ASCE), for the initial session of NISO's 2024 Training Series "DEIA in the Scholarly Landscape." Session One: 'Setting Expectations: a DEIA Primer,' was held June 6, 2024.
Main Java[All of the Base Concepts}.docxadhitya5119
This is part 1 of my Java Learning Journey. This Contains Custom methods, classes, constructors, packages, multithreading , try- catch block, finally block and more.
Executive Directors Chat Leveraging AI for Diversity, Equity, and InclusionTechSoup
Let’s explore the intersection of technology and equity in the final session of our DEI series. Discover how AI tools, like ChatGPT, can be used to support and enhance your nonprofit's DEI initiatives. Participants will gain insights into practical AI applications and get tips for leveraging technology to advance their DEI goals.
This presentation includes basic of PCOS their pathology and treatment and also Ayurveda correlation of PCOS and Ayurvedic line of treatment mentioned in classics.
বাংলাদেশের অর্থনৈতিক সমীক্ষা ২০২৪ [Bangladesh Economic Review 2024 Bangla.pdf] কম্পিউটার , ট্যাব ও স্মার্ট ফোন ভার্সন সহ সম্পূর্ণ বাংলা ই-বুক বা pdf বই " সুচিপত্র ...বুকমার্ক মেনু 🔖 ও হাইপার লিংক মেনু 📝👆 যুক্ত ..
আমাদের সবার জন্য খুব খুব গুরুত্বপূর্ণ একটি বই ..বিসিএস, ব্যাংক, ইউনিভার্সিটি ভর্তি ও যে কোন প্রতিযোগিতা মূলক পরীক্ষার জন্য এর খুব ইম্পরট্যান্ট একটি বিষয় ...তাছাড়া বাংলাদেশের সাম্প্রতিক যে কোন ডাটা বা তথ্য এই বইতে পাবেন ...
তাই একজন নাগরিক হিসাবে এই তথ্য গুলো আপনার জানা প্রয়োজন ...।
বিসিএস ও ব্যাংক এর লিখিত পরীক্ষা ...+এছাড়া মাধ্যমিক ও উচ্চমাধ্যমিকের স্টুডেন্টদের জন্য অনেক কাজে আসবে ...
Digital Artefact 1 - Tiny Home Environmental Design
Cooperating Techniques for Extracting Conceptual Taxonomies from Text
1. Università degli studi di Bari “Aldo Moro”
Dipartimento di Informatica
Cooperating Techniques for
Extracting Conceptual Taxonomies from Text
S. Ferilli, F. Leuzzi, F. Rotella
L.A.C.A.M.
http://lacam.di.uniba.it:8000
AI*IA 2011 XIIth Conference of the Italian Association for Artificial Intelligence
Workshop on Mining Complex Patterns (MCP 2011)
Palermo, Italy, September 17, 2011
2. Overview
1. Introduction & Objectives
2. Extraction of knowledge from text
3. Knowledge representation formalism
4. Identification of relevant concepts
5. Generalization of similar concepts
6. Reasoning ‘by association’
7. Conclusions & Future works
Cooperating Techniques for Extracting Conceptual Taxonomies from Text - S. Ferilli, F. Leuzzi, F. Rotella 2
3. Introduction
The spread of electronic documents and document
repositories has generated the need for automatic techniques
to understand and handle the documents content in order to
help users in satisfying their information needs.
Full Text Understading is not trivial, due to:
1. intrinsic ambiguity of natural language;
2. huge amount of common sense and conceptual background
knowledge.
For facing these problems lexical and/or conceptual
taxonomies are useful, even if manually building is very costly
and error prone.
Cooperating Techniques for Extracting Conceptual Taxonomies from Text - S. Ferilli, F. Leuzzi, F. Rotella 3
4. Introduction
This lack is a strong motivation towards
automatic construction of conceptual
networks by mining large amounts of
documents in natural language.
However, even assuming a correct
knowledge representation, we are
far to simulate human abilities yet.
Cooperating Techniques for Extracting Conceptual Taxonomies from Text - S. Ferilli, F. Leuzzi, F. Rotella 4
5. Objectives
1. Definition of a representation formalism for knowledge
extracted from natural language texts
2. Extraction of concepts and relevance assessment
3. Generalization of concepts having similar descriptions
4. Definition of a kind of reasoning by concept association that
looks for possible indirect connections between two
identified concepts
Cooperating Techniques for Extracting Conceptual Taxonomies from Text - S. Ferilli, F. Leuzzi, F. Rotella 5
6. Extraction of knowledge
from text
Knowledge extracted by processing each sentence separately.
Stanford Stanford
Parser [1] Dependencies [2]
The final output of the Stanford Dependencies is a typed
syntactic structure of each sentence.
Cooperating Techniques for Extracting Conceptual Taxonomies from Text - S. Ferilli, F. Leuzzi, F. Rotella 6
7. Knowledge representation
formalism
Among all grammatical roles played by words in a sentence,
only subject, verb and complement have been considered.
In the final conceptual graph subjects and complements will
represent concepts, while verbs will express relations between
them.
subject,
subject,
verb,
complement
complement
Cooperating Techniques for Extracting Conceptual Taxonomies from Text - S. Ferilli, F. Leuzzi, F. Rotella 7
8. Identification of
relevant concept
A mix of several techniques are brought to cooperation for
identifying relevant concepts:
● Hub Words [3]: words having high frequency whose relevance is
computed as:
W (t )=α w 0 +β n+γ ∑ i=1 w (t i )
where: w0 , initial weight; n, # of relationships;
w(ti), tf*idf weight of i-th word related to t.
● Keyword extraction techniques from single documents.
● EM Clustering provided by Weka [4] based on Euclidean
distance.
Cooperating Techniques for Extracting Conceptual Taxonomies from Text - S. Ferilli, F. Leuzzi, F. Rotella 8
9. Identification of
relevant concept
Inspired to the Hub Words approach we have defined a
Relevance Weight:
A B C D E
w (̄)
c e(̄)c ∑( c , ̄ ) w (c ) d M −d ( c )
c ̄ k (̄)
c
W ( ̄ )=α
c +β +γ +δ +ε
max c w( c ) max c e ( c ) e( ̄ ) c dM max c k ( c )
where: α + β+γ +δ +ε =1
Nodes in the network are ranked by decreasing Relevance
Weight.
A suitable cut-point in the ranking is determined by choosing
the first item such that:
W ( c k )-W (c k+1 )≥ p⋅ max ( W ( c i )-W (c i+1 ) )
i =0,.. . , n−1
where: p∈ [ 0,1 ]
Cooperating Techniques for Extracting Conceptual Taxonomies from Text - S. Ferilli, F. Leuzzi, F. Rotella 9
10. Identification of relevant concept
Relevance Weight in details
Definition of the Initial Weight
The whole set of triples <subject,verb,complement> is
represented in a Concepts x Attributes matrix V recalling the
classical Terms x Documents Vector Space Model.
f i, j ∣A∣
Resembling tf*idf: ⋅log
∑ k
f k, j ∣{ j : c i ∈a j }∣
w (c )
̄
Therefore component A is: α
max c w ( c)
where w(c) is the initial weight assigned to node c computed
according to the above tf*idf schema.
Cooperating Techniques for Extracting Conceptual Taxonomies from Text - S. Ferilli, F. Leuzzi, F. Rotella 10
11. Identification of relevant concept
Relevance Weight in details
Connections Number
Component B considers the number of connections (edges) in
which c is involved
e(̄)c
β
max c e ( c )
Neighborhood Weight Summary
Component C takes into account the average
initial weight of all neighbors of c
∑ (c,c )
̄
w ( c)
γ
e( c )
̄
Cooperating Techniques for Extracting Conceptual Taxonomies from Text - S. Ferilli, F. Leuzzi, F. Rotella 11
12. Identification of relevant concept
Relevance Weight in details
Inverse Distance form Center
Component D represents the closeness to center of the cluster
d M −d( c )
̄
δ
dM
KE Influence
Component E takes into account the outcome of three KE
techniques suitably weighted:
k (̄ )
c
ε
max c k (c )
where:
k ( ̄ )=ςk co−occurrences ( ̄ )+ηk synset ( ̄ )+θk mvn ( ̄ )
c c c c
Cooperating Techniques for Extracting Conceptual Taxonomies from Text - S. Ferilli, F. Leuzzi, F. Rotella 12
13. Identification of relevant concept
Relevance Weight in details
2
KE based on χ
k co− occurrences=ς
●
2
co-occurrences max cluster χ
kw synset
● KE based on k synset =η
WordNet Synsets max ( kw synset )
KE by means
kw mvn
●
Multivariate Normal k mvn=θ
max ( kw mvn )
Distribution (MVN)
Cooperating Techniques for Extracting Conceptual Taxonomies from Text - S. Ferilli, F. Leuzzi, F. Rotella 13
14. Identification of relevant concept
Evaluations
Test # α β γ δ ε p
1 0.10 0.10 0.30 0.25 0.25 1.0
2 0.20 0.15 0.15 0.25 0.25 0.7
3 0.15 0.25 0.30 0.15 0.15 1.0
Test # Concept A B C D E W
1 network 0.100 0.100 0.021 0.178 0.250 0.649
access 0.001 0.001 0.154 0.239 0.250 0.646
subset 6.32E-4 0.001 0.150 0.239 0.250 0.641
2 network 0.200 0.150 0.0105 0.178 0.250 0.789
3 network 0.150 0.250 0.021 0.146 0.150 0.717
user 0.127 0.195 0.022 0.146 0.150 0.641
number 0.113 0.187 0.022 0.146 0.150 0.619
individual 0.103 0.174 0.020 0.146 0.150 0.594
Cooperating Techniques for Extracting Conceptual Taxonomies from Text - S. Ferilli, F. Leuzzi, F. Rotella 14
15. Generalization of similar concepts
Pairwise clustering
Take in account the description of each concept, consisting in
a binary vector that represents presence or absence (1 or 0
respectively) of a <subject,complement> relation between
the involved concepts. The Hamming distance provides a
similarity evaluation between them.
Cooperating Techniques for Extracting Conceptual Taxonomies from Text - S. Ferilli, F. Leuzzi, F. Rotella 15
16. Generalization of similar concepts
WordNet
WordNet1 is an external resource that has some useful
properties:
1. lexical taxonomy
2. each concept is described as a set of synonyms (synset)
3. synsets are interlinked by means of conceptual-
semantic and lexical relations
We are focused on hyperonymy, a relation that links the
current synset to more general ones.
1. http://wordnet.princeton.edu/
Cooperating Techniques for Extracting Conceptual Taxonomies from Text - S. Ferilli, F. Leuzzi, F. Rotella 16
17. Generalization of similar concepts
Taxonomical similarity function
More general: provides a More specific: provides a
similarity value on the bases of similarity value on the bases of
common relations, without common relations, relying on
focusing on the specific path. the specific path.
Cooperating Techniques for Extracting Conceptual Taxonomies from Text - S. Ferilli, F. Leuzzi, F. Rotella 17
18. Generalization of similar concepts
WSD Domain Driven
One Domain per Discourse assumption: many uses of a word
in a coherent portion of text tend to share the same domain.
Prevalent domain
Prevalent domain
individuation
individuation
Extraction of all
Extraction of all
synsets for each term
synsets for each term
Extraction of all
Extraction of all
domains for each synset
domains for each synset
Choice of prevalent
Choice of prevalent
domain synset
domain synset
Cooperating Techniques for Extracting Conceptual Taxonomies from Text - S. Ferilli, F. Leuzzi, F. Rotella 18
19. Generalization of similar concepts
Evaluations
Two toy experiments have been performed with Hamming
distance threshold respectively equal to 0.001 and 0.0001,
while taxonomical similarity function threshold has been kept
equal to 0.4.
Cooperating Techniques for Extracting Conceptual Taxonomies from Text - S. Ferilli, F. Leuzzi, F. Rotella 19
20. Reasoning ‘by association’
Breadth-First Search
Given two nodes (concepts), a Breadth-First Search starts
from both nodes, the former searches the latter's frontier and
vice versa, until the two frontiers meet by common nodes.
Then the path is restored going backward to the roots in both
directions.
Cooperating Techniques for Extracting Conceptual Taxonomies from Text - S. Ferilli, F. Leuzzi, F. Rotella 20
21. Reasoning ‘by association’
Evaluations
The table below shows a sample of possible outcomes.
E.g., an interpretation of case 5 can be:
“the adults write about freedom and use platform, that is
recognized as a technology, as well as the internet”.
Cooperating Techniques for Extracting Conceptual Taxonomies from Text - S. Ferilli, F. Leuzzi, F. Rotella 21
22. Conclusions
This work proposes an approach to extract automatic conceptual
taxonomy from natural language texts.
It works mixing different techniques in order to:
● identify relevant terms/concepts in text;
● generalize similar concepts;
● perform some kind of reasoning “by association”.
Preliminary experiments show that this approach can be viable
although extensions and refinements are needed.
A reliable outcome might help users in understanding the text
content and machines to automatically perform some kind of
reasoning on the taxonomy.
Cooperating Techniques for Extracting Conceptual Taxonomies from Text - S. Ferilli, F. Leuzzi, F. Rotella 22
23. Future works
1. Extending the knowledge representation formalism to
express negation.
2. Defining a strategy to make a better choice of weights in
Relevance Weight computation.
3. Enriching the adjacency matrix to improve concept
descriptions.
4. ODD alternatives exploration, to overcome its limits.
5. Taxonomical similarity measures take into account only the
hypernym relation, while a more accurate similarity can be
obtained adding other relations.
6. Define a strategy to prefer one verb rather than keeping all
of them, in reasoning ‘by association’ phase.
Cooperating Techniques for Extracting Conceptual Taxonomies from Text - S. Ferilli, F. Leuzzi, F. Rotella 23
24. References
[1] Dan Klein and Christopher D. Manning. Fast exact
inference with a factored model for natural language parsing.
In Advances in Neural Information Processing Systems,
volume 15. MIT Press, 2003.
[2] Marie-Catherine de Marneffe, Bill MacCartney, and
Christopher D. Manning. Generating typed dependency parses
from phrase structure trees. In LREC, 2006.
[3] Sang Ok Koo, Soo Yeon Lim, and Sang-Jo Lee. Constructing
an ontology based on hub words. In ISMIS’03, pages 93–97,
2003.
[4] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann,
and I.H. Witten. The weka data mining software: an update.
SIGKDD Explorations, 11(1):10–18,2009.
Cooperating Techniques for Extracting Conceptual Taxonomies from Text - S. Ferilli, F. Leuzzi, F. Rotella 24