Type Vector Representations from Text: An Empirical Analysis. For the Workshop on Deep Learning for Knowledge Graphs and Semantic Technologies (DL4KGS).
Held in conjunction with ESWC 18 in June 2018 in Crete, Greece.
Type Embeddings, Ontology Matching, Type Similarity
Entity linking involves recognizing entity mentions in text and linking them to entries in a knowledge base. It includes three main steps: mention detection to identify linkable phrases, candidate selection to generate potential entities for each mention, and disambiguation to select the best entity for each mention based on context. Disambiguation approaches consider local compatibility of mentions and entities as well as global coherence across all entity linking decisions. Collective approaches aim to jointly optimize all entity annotations for a document.
The document provides an introduction to Probabilistic Latent Semantic Analysis (PLSA). It discusses how PLSA improves on previous Latent Semantic Analysis methods by incorporating a probabilistic framework. PLSA models documents as mixtures of topics and allows words to have multiple meanings. The parameters of the PLSA model, including the topic distributions and word-topic distributions, are estimated using an expectation-maximization algorithm to find the parameters that best explain the observed word-document co-occurrence data.
This document discusses machine learning techniques for modeling document collections. It introduces topic models, which represent documents as mixtures of topics and topics as mixtures of words. Topic models provide dimensionality reduction and allow semantic-based browsing of document collections. Variational inference methods are described for approximating the posterior distribution in topic models like LDA and correlated topic models.
Gleaning Types for Literals in RDF with Application to Entity SummarizationKalpa Gunaratna
ESWC 2016 talk about how to compute types (ontology classes) for literals and add semantics to them, making them richer. Then utilize them in an entity summarization usecase.
Topic modeling is a technique for discovering hidden semantic patterns in large document collections. It represents documents as probability distributions over latent topics, where each topic is characterized by a distribution over words. Two common probabilistic topic models are latent Dirichlet allocation (LDA) and probabilistic latent semantic analysis (pLSA). LDA assumes each document exhibits multiple topics in different proportions, with topics modeled as distributions over words. Topic modeling provides dimensionality reduction and can be applied to problems like text classification, collaborative filtering, and computer vision tasks like image classification.
Introduction to automated text analyses in the Political SciencesChristianRauh2
This document provides an introduction and overview of automated text analysis methods for political science research. It discusses the promises and pitfalls of automated analysis and outlines some common text analysis approaches, including corpus construction, dictionary-based analysis, text scaling, and briefly touches on topic modeling and machine learning. The document uses debates around climate change at the United Nations as a running example to demonstrate how these various methods can be applied to a research question and corpus of documents. It emphasizes that automated analyses require validation and should augment rather than replace human interpretation of texts.
Entity linking involves recognizing entity mentions in text and linking them to entries in a knowledge base. It includes three main steps: mention detection to identify linkable phrases, candidate selection to generate potential entities for each mention, and disambiguation to select the best entity for each mention based on context. Disambiguation approaches consider local compatibility of mentions and entities as well as global coherence across all entity linking decisions. Collective approaches aim to jointly optimize all entity annotations for a document.
The document provides an introduction to Probabilistic Latent Semantic Analysis (PLSA). It discusses how PLSA improves on previous Latent Semantic Analysis methods by incorporating a probabilistic framework. PLSA models documents as mixtures of topics and allows words to have multiple meanings. The parameters of the PLSA model, including the topic distributions and word-topic distributions, are estimated using an expectation-maximization algorithm to find the parameters that best explain the observed word-document co-occurrence data.
This document discusses machine learning techniques for modeling document collections. It introduces topic models, which represent documents as mixtures of topics and topics as mixtures of words. Topic models provide dimensionality reduction and allow semantic-based browsing of document collections. Variational inference methods are described for approximating the posterior distribution in topic models like LDA and correlated topic models.
Gleaning Types for Literals in RDF with Application to Entity SummarizationKalpa Gunaratna
ESWC 2016 talk about how to compute types (ontology classes) for literals and add semantics to them, making them richer. Then utilize them in an entity summarization usecase.
Topic modeling is a technique for discovering hidden semantic patterns in large document collections. It represents documents as probability distributions over latent topics, where each topic is characterized by a distribution over words. Two common probabilistic topic models are latent Dirichlet allocation (LDA) and probabilistic latent semantic analysis (pLSA). LDA assumes each document exhibits multiple topics in different proportions, with topics modeled as distributions over words. Topic modeling provides dimensionality reduction and can be applied to problems like text classification, collaborative filtering, and computer vision tasks like image classification.
Introduction to automated text analyses in the Political SciencesChristianRauh2
This document provides an introduction and overview of automated text analysis methods for political science research. It discusses the promises and pitfalls of automated analysis and outlines some common text analysis approaches, including corpus construction, dictionary-based analysis, text scaling, and briefly touches on topic modeling and machine learning. The document uses debates around climate change at the United Nations as a running example to demonstrate how these various methods can be applied to a research question and corpus of documents. It emphasizes that automated analyses require validation and should augment rather than replace human interpretation of texts.
This is a short presentation that explains the famous TextRank papers that used graphs to produce summaries and document indices (keywords).
Link to paper : https://web.eecs.umich.edu/~mihalcea/papers/mihalcea.emnlp04.pdf
The document discusses the Julia programming language. It highlights that Julia bridges the gap between computer science and computational science by allowing for both data abstraction and high performance. Julia uses multiple dispatch as its core programming paradigm, which allows functions to have different implementations depending on the types of their arguments. This enables Julia to perform efficiently on a wide range of technical computing tasks.
This document presents the Duet model for document ranking. The Duet model uses a combination of local and distributed representations of text to perform both exact and inexact matching of queries to documents. The local model operates on a term interaction matrix to model exact matches, while the distributed model projects text into an embedding space for inexact matching. Results show the Duet model, which combines these approaches, outperforms models using only local or distributed representations. The Duet model benefits from training on large datasets and can effectively handle queries containing rare terms or needing semantic matching.
This document contains a questionnaire aimed at understanding how RDF data and ontologies should be visualized in the Pundit tool. It asks about the respondents' backgrounds and experiences with ontologies, RDF, and visualization tools. It also asks how the respondents would ideally like to see ontologies and RDF data visualized, and how visualization could be integrated into Pundit to make it more useful. The questionnaire covers topics such as how ontologies are created and used, the visualization of schemas versus instances, the features of current tools, and how visualization could complement annotation and ontology work.
The document discusses processing Boolean queries in an information retrieval system using an inverted index. It describes the steps to process a simple conjunctive query by locating terms in the dictionary, retrieving their postings lists, and intersecting the lists. More complex queries involving OR and NOT operators are also processed in a similar way. The document also discusses optimizing query processing by considering the order of accessing postings lists.
V. Malykh presents an approach for creating robust word vectors for the Russian language that does not rely on a predefined vocabulary or word co-occurrence matrices. The approach uses a LSTM neural network and BME representations of words at the character level to learn word embeddings. Experiments on Russian corpora for paraphrase identification and plagiarism detection show the approach outperforms standard word2vec models, especially in noisy conditions with character substitutions and additions/deletions.
Entity Linking in Queries: Tasks and EvaluationFaegheh Hasibi
The document discusses entity linking in queries for tasks like ad-hoc document retrieval and query understanding. It describes three main tasks for entity linking in queries: entity linking, semantic mapping, and interpretation finding. Semantic mapping allows mentions to overlap and links mentions to multiple entities, while interpretation finding returns sets of semantically related entity sets where mentions do not overlap within a set. The document also discusses test collections, evaluation metrics, and methods like mention detection, candidate entity ranking, and greedy interpretation finding for linking entity mentions in queries to knowledge bases.
Formal and Computational Representations
The Semantics of First-Order Logic
Event Representations
Description Logics & the Web Ontology Language
Compositionality
Lamba calculus
Corpus-based approaches:
Latent Semantic Analysis
Topic models
Distributional Semantics
The document describes a Russian paraphrase corpus created by the authors. It contains over 8000 sentence pairs annotated as precise, loose, or non-paraphrases using crowdsourcing. The corpus was collected from news headlines and aims to capture the most important events. The authors evaluate different models for classifying sentence pairs and find that combining linguistic features improves performance over individual feature types. Graphs built from the corpus can reveal connected events more completely than human annotations alone.
This document discusses natural language processing and language models. It begins by explaining that natural language processing aims to give computers the ability to process human language in order to perform tasks like dialogue systems, machine translation, and question answering. It then discusses how language models assign probabilities to strings of text to determine if they are valid sentences. Specifically, it covers n-gram models which use the previous n words to predict the next, and how smoothing techniques are used to handle uncommon words. The document provides an overview of key concepts in natural language processing and language modeling.
This document discusses building an inverted index to efficiently support information retrieval on large document collections. It describes tokenizing documents, building a dictionary of normalized terms, and creating postings lists that map each term to the documents it appears in. Inverted indexes allow skipping linear scanning and support flexible queries by indexing term locations. The document also covers calculating precision and recall to measure system effectiveness.
This document summarizes a paper on using simple lexical overlap features with support vector machines (SVMs) for Russian paraphrase identification. It introduces paraphrase identification and various paraphrase corpora. It then describes a knowledge-lean approach using only tokenization, lowercasing, and overlap features like union and intersection size as inputs to linear and RBF kernel SVMs. The method achieves competitive results on English, Turkish, and Russian paraphrase identification tasks.
Detecting and Describing Historical Periods in a Large CorporaTraian Rebedea
Many historic periods (or events) are remembered
by slogans, expressions or words that are strongly linked to them. Educated people are also able to determine whether a particular word or expression is related to a specific period in human history. The present paper aims to establish correlations between significant historic periods (or events) and the texts written in that period. In order to achieve this, we have developed a system that automatically links words (and topics discovered using Latent Dirichlet Allocation) to periods of time in the recent history. For this analysis to be relevant and conclusive, it must be undertaken on a representative set of texts written throughout history. To this end, instead of relying on manually selected texts, the Google Books Ngram corpus has been chosen as a basis for the analysis. Although it provides only word n-gram statistics for the texts written in a given year, the resulting time series can be used to provide insights about the most important periods and events in recent history, by automatically linking them with specific keywords or even LDA topics.
Exploiting Entity Linking in Queries For Entity RetrievalFaegheh Hasibi
Slides for the ICTIR 2016 paper: "Exploiting Entity Linking in Queries For Entity Retrieval"
The premise of entity retrieval is to better answer search queries by returning specific entities instead of documents. Many queries mention particular entities; recognizing and linking them to the corresponding entry in a knowledge base is known as the task of entity
linking in queries. In this paper we make a first attempt at bringing together these two, i.e., leveraging entity annotations of queries in the entity retrieval model. We introduce a new probabilistic component and show how it can be applied on top of any term based entity retrieval model that can be emulated in the Markov Random Field framework, including language models, sequential dependence models, as well as their fielded variations. Using a standard entity retrieval test collection, we show that our extension brings consistent improvements over all baseline methods, including the current state-of-the-art. We further show that our extension is robust against parameter settings.
Topic modeling using big data analytics can analyze large datasets. It involves installing Hadoop on multiple nodes for distributed processing, preprocessing data into a desired format, and using modeling tools to parallelize computation and select algorithms. Topic modeling identifies patterns in corpora to develop new ways to search, browse, and summarize large text archives. Tools like Mallet use algorithms like LDA and PLSI to achieve topic modeling on Hadoop, applying it to analyze news articles, search engine rankings, genetic and image data, and more.
The document describes a character-level convolutional neural network approach for sentence paraphrase detection. It evaluates standard and non-standard models using word and character embeddings as inputs to the CNN. The standard model using character embeddings achieved the best results, obtaining an accuracy of 72.74% and F1 score of 78.8%, outperforming the standard word-based model and non-standard model. The document discusses related work applying CNNs to other NLP tasks and analyzes the results.
Two Approaches to Factor Time into Word and Entity Representations Learned fr...Federico Bianchi
Time is a crucial factor when dealing with distributional models of language and knowledge. For example, tracking word meaning shift and entity evolution can have several applications and time may sneak into similarity as computed with these models in a way that may be difficult to control. In this presentation, we discuss two novel approaches to factor time into word and knowledge representations learned from text: explicit, with representations of temporal references (e.g., years, days, etc.), and implicit, with time-dependent representations of words and entities (e.g., amazon_1975 vs. amazon_2012). Finally, being this an emerging field of research, we will discuss several open topics in this research domain.
FBK, Trento, 10/5/2019
The generation of digital content has undergone a great increase in recent years due to the
development of new technologies that allow the creation of content quickly and easily. A further step in this
evolution is the generation of contents by automatic systems without human intervention. Thus, for decadesit has
been developing models for the Natural Language Generation (NLG) that allow the transformation of content to
the form of narratives. At present, there are several systems that enable the generation in text format. In this
paper we present the Narrative system, which allows the generation of text narratives from different sources,
and which are indistinguishable for user from those made by a human being.
This is a short presentation that explains the famous TextRank papers that used graphs to produce summaries and document indices (keywords).
Link to paper : https://web.eecs.umich.edu/~mihalcea/papers/mihalcea.emnlp04.pdf
The document discusses the Julia programming language. It highlights that Julia bridges the gap between computer science and computational science by allowing for both data abstraction and high performance. Julia uses multiple dispatch as its core programming paradigm, which allows functions to have different implementations depending on the types of their arguments. This enables Julia to perform efficiently on a wide range of technical computing tasks.
This document presents the Duet model for document ranking. The Duet model uses a combination of local and distributed representations of text to perform both exact and inexact matching of queries to documents. The local model operates on a term interaction matrix to model exact matches, while the distributed model projects text into an embedding space for inexact matching. Results show the Duet model, which combines these approaches, outperforms models using only local or distributed representations. The Duet model benefits from training on large datasets and can effectively handle queries containing rare terms or needing semantic matching.
This document contains a questionnaire aimed at understanding how RDF data and ontologies should be visualized in the Pundit tool. It asks about the respondents' backgrounds and experiences with ontologies, RDF, and visualization tools. It also asks how the respondents would ideally like to see ontologies and RDF data visualized, and how visualization could be integrated into Pundit to make it more useful. The questionnaire covers topics such as how ontologies are created and used, the visualization of schemas versus instances, the features of current tools, and how visualization could complement annotation and ontology work.
The document discusses processing Boolean queries in an information retrieval system using an inverted index. It describes the steps to process a simple conjunctive query by locating terms in the dictionary, retrieving their postings lists, and intersecting the lists. More complex queries involving OR and NOT operators are also processed in a similar way. The document also discusses optimizing query processing by considering the order of accessing postings lists.
V. Malykh presents an approach for creating robust word vectors for the Russian language that does not rely on a predefined vocabulary or word co-occurrence matrices. The approach uses a LSTM neural network and BME representations of words at the character level to learn word embeddings. Experiments on Russian corpora for paraphrase identification and plagiarism detection show the approach outperforms standard word2vec models, especially in noisy conditions with character substitutions and additions/deletions.
Entity Linking in Queries: Tasks and EvaluationFaegheh Hasibi
The document discusses entity linking in queries for tasks like ad-hoc document retrieval and query understanding. It describes three main tasks for entity linking in queries: entity linking, semantic mapping, and interpretation finding. Semantic mapping allows mentions to overlap and links mentions to multiple entities, while interpretation finding returns sets of semantically related entity sets where mentions do not overlap within a set. The document also discusses test collections, evaluation metrics, and methods like mention detection, candidate entity ranking, and greedy interpretation finding for linking entity mentions in queries to knowledge bases.
Formal and Computational Representations
The Semantics of First-Order Logic
Event Representations
Description Logics & the Web Ontology Language
Compositionality
Lamba calculus
Corpus-based approaches:
Latent Semantic Analysis
Topic models
Distributional Semantics
The document describes a Russian paraphrase corpus created by the authors. It contains over 8000 sentence pairs annotated as precise, loose, or non-paraphrases using crowdsourcing. The corpus was collected from news headlines and aims to capture the most important events. The authors evaluate different models for classifying sentence pairs and find that combining linguistic features improves performance over individual feature types. Graphs built from the corpus can reveal connected events more completely than human annotations alone.
This document discusses natural language processing and language models. It begins by explaining that natural language processing aims to give computers the ability to process human language in order to perform tasks like dialogue systems, machine translation, and question answering. It then discusses how language models assign probabilities to strings of text to determine if they are valid sentences. Specifically, it covers n-gram models which use the previous n words to predict the next, and how smoothing techniques are used to handle uncommon words. The document provides an overview of key concepts in natural language processing and language modeling.
This document discusses building an inverted index to efficiently support information retrieval on large document collections. It describes tokenizing documents, building a dictionary of normalized terms, and creating postings lists that map each term to the documents it appears in. Inverted indexes allow skipping linear scanning and support flexible queries by indexing term locations. The document also covers calculating precision and recall to measure system effectiveness.
This document summarizes a paper on using simple lexical overlap features with support vector machines (SVMs) for Russian paraphrase identification. It introduces paraphrase identification and various paraphrase corpora. It then describes a knowledge-lean approach using only tokenization, lowercasing, and overlap features like union and intersection size as inputs to linear and RBF kernel SVMs. The method achieves competitive results on English, Turkish, and Russian paraphrase identification tasks.
Detecting and Describing Historical Periods in a Large CorporaTraian Rebedea
Many historic periods (or events) are remembered
by slogans, expressions or words that are strongly linked to them. Educated people are also able to determine whether a particular word or expression is related to a specific period in human history. The present paper aims to establish correlations between significant historic periods (or events) and the texts written in that period. In order to achieve this, we have developed a system that automatically links words (and topics discovered using Latent Dirichlet Allocation) to periods of time in the recent history. For this analysis to be relevant and conclusive, it must be undertaken on a representative set of texts written throughout history. To this end, instead of relying on manually selected texts, the Google Books Ngram corpus has been chosen as a basis for the analysis. Although it provides only word n-gram statistics for the texts written in a given year, the resulting time series can be used to provide insights about the most important periods and events in recent history, by automatically linking them with specific keywords or even LDA topics.
Exploiting Entity Linking in Queries For Entity RetrievalFaegheh Hasibi
Slides for the ICTIR 2016 paper: "Exploiting Entity Linking in Queries For Entity Retrieval"
The premise of entity retrieval is to better answer search queries by returning specific entities instead of documents. Many queries mention particular entities; recognizing and linking them to the corresponding entry in a knowledge base is known as the task of entity
linking in queries. In this paper we make a first attempt at bringing together these two, i.e., leveraging entity annotations of queries in the entity retrieval model. We introduce a new probabilistic component and show how it can be applied on top of any term based entity retrieval model that can be emulated in the Markov Random Field framework, including language models, sequential dependence models, as well as their fielded variations. Using a standard entity retrieval test collection, we show that our extension brings consistent improvements over all baseline methods, including the current state-of-the-art. We further show that our extension is robust against parameter settings.
Topic modeling using big data analytics can analyze large datasets. It involves installing Hadoop on multiple nodes for distributed processing, preprocessing data into a desired format, and using modeling tools to parallelize computation and select algorithms. Topic modeling identifies patterns in corpora to develop new ways to search, browse, and summarize large text archives. Tools like Mallet use algorithms like LDA and PLSI to achieve topic modeling on Hadoop, applying it to analyze news articles, search engine rankings, genetic and image data, and more.
The document describes a character-level convolutional neural network approach for sentence paraphrase detection. It evaluates standard and non-standard models using word and character embeddings as inputs to the CNN. The standard model using character embeddings achieved the best results, obtaining an accuracy of 72.74% and F1 score of 78.8%, outperforming the standard word-based model and non-standard model. The document discusses related work applying CNNs to other NLP tasks and analyzes the results.
Two Approaches to Factor Time into Word and Entity Representations Learned fr...Federico Bianchi
Time is a crucial factor when dealing with distributional models of language and knowledge. For example, tracking word meaning shift and entity evolution can have several applications and time may sneak into similarity as computed with these models in a way that may be difficult to control. In this presentation, we discuss two novel approaches to factor time into word and knowledge representations learned from text: explicit, with representations of temporal references (e.g., years, days, etc.), and implicit, with time-dependent representations of words and entities (e.g., amazon_1975 vs. amazon_2012). Finally, being this an emerging field of research, we will discuss several open topics in this research domain.
FBK, Trento, 10/5/2019
The generation of digital content has undergone a great increase in recent years due to the
development of new technologies that allow the creation of content quickly and easily. A further step in this
evolution is the generation of contents by automatic systems without human intervention. Thus, for decadesit has
been developing models for the Natural Language Generation (NLG) that allow the transformation of content to
the form of narratives. At present, there are several systems that enable the generation in text format. In this
paper we present the Narrative system, which allows the generation of text narratives from different sources,
and which are indistinguishable for user from those made by a human being.
Automatic Annotation Approach Of Events In News ArticlesJoaquin Hamad
The document describes an approach for automatically annotating events in news articles. It involves four main steps: 1) preprocessing the text through segmentation into sentences and identifying named entities, 2) using a classifier to filter out non-event sentences, 3) grouping similar event sentences into clusters, 4) generating a summary of the annotated events. The approach uses natural language processing and machine learning techniques like decision trees for classification.
Topic Modeling for Information Retrieval and Word Sense Disambiguation tasksLeonardo Di Donato
Experimental work done regarding the use of Topic Modeling for the implementation and the improvement of some common tasks of Information Retrieval and Word Sense Disambiguation.
First of all it describes the scenario, the pre-processing pipeline realized and the framework used. After we we face a discussion related to the investigation of some different hyperparameters configurations for the LDA algorithm.
This work continues dealing with the retrieval of relevant documents mainly through two different approaches: inferring the topics distribution of the held out document (or query) and comparing it to retrieve similar collection’s documents or through an approach driven by probabilistic querying. The last part of this work is devoted to the investigation of the word sense disambiguation task.
USING MACHINE LEARNING TO BUILD A SEMI-INTELLIGENT BOT ecij
Nowadays, real-time systems and intelligent systems offer more and more control interface based on voice recognition or human language recognition. Robots and drones will soon be mainly controlled by voice. Other robots will integrate bots to interact with their users, this can be useful both in industry and entertainment. At first, researchers were digging on the side of "ontology reasoning". Given all the technical constraints brought by the treatment of ontologies, an interesting solution has emerged in last years: the construction of a model based on machine learning to connect a human language to a knowledge
base (based for example on RDF). We present in this paper our contribution to build a bot that could be used on real-time systems and drones/robots, using recent machine learning technologies.
USING MACHINE LEARNING TO BUILD A SEMI-INTELLIGENT BOT ecij
Nowadays, real-time systems and intelligent systems offer more and more control interface based on voice
recognition or human language recognition. Robots and drones will soon be mainly controlled by voice.
Other robots will integrate bots to interact with their users, this can be useful both in industry and
entertainment. At first, researchers were digging on the side of "ontology reasoning". Given all the
technical constraints brought by the treatment of ontologies, an interesting solution has emerged in last
years: the construction of a model based on machine learning to connect a human language to a knowledge
base (based for example on RDF). We present in this paper our contribution to build a bot that could be
used on real-time systems and drones/robots, using recent machine learning technologies.
Analysing Business Models For Cross Border E-Services Provided By The Chamber...Brandi Gonzales
This document analyzes business models for cross-border e-services provided by Chambers of Commerce. It discusses definitions of business models from literature and how ontologies have been used as conceptual tools. The document then describes applying a business model ontology to analyze the business model design process for a new organization involving cooperation between Chambers of Commerce from Italy, Romania, Poland and Bulgaria, who have different needs, legal constraints and cultures.
We present a framework that combines machine learnt classifiers and taxonomies of topics to enable a more conceptual analysis of a corpus than can be accomplished using Vector Space Models and Latent Dirichlet Allocation based topic models which represent documents purely
in terms of words. Given a corpus and a taxonomy of topics, we learn a classifier per topic and annotate each document with the topics covered by it. The distribution of topics in the corpus can then be visualized as a function of the attributes of the documents. We apply this framework to the US State of the Union and presidential election speeches to observe how topics such as jobs and employment have evolved from being relatively unimportant to being the most discussed topic. We show that our framework is better than Vector Space Models and an Latent Dirichlet Allocation based topic model for performing certain kinds of analysis.
1) The document discusses using blogs and other structured web data to develop linguistic corpora for research. It argues that structured web data provides large amounts of naturally occurring language data in various genres and languages.
2) Examples are given of how blog data in particular is well-structured with metadata like authorship, dates, and semantics. This structured data can be extracted and analyzed to study linguistic patterns and variation across different authors, registers, and languages.
3) One research example analyzed the distribution of future tense expressions ("will" vs. "be going to") in three English language blogs and found patterns relating to subject type that confirm theoretical assumptions.
SWSN UNIT-3.pptx we can information about swsn professionalgowthamnaidu0986
Ontology engineering involves constructing ontologies through various methods. It begins with defining the scope and evaluating existing ontologies for reuse. Terms are enumerated and organized in a taxonomy with defined properties, facets, and instances. The ontology is checked for anomalies and refined iteratively. Popular tools for ontology development include Protege and WebOnto. Methods like Meth ontology and On-To-Knowledge methodology provide processes for building ontologies from scratch or reusing existing ones. Ontology sharing requires mapping between ontologies to allow interoperability, and libraries exist for storing and accessing ontologies.
SSONDE is a framework for calculating semantic similarity between ontology instances represented as linked data. It provides an asymmetric similarity score that emphasizes containment relationships between instances. SSONDE operates at the application layer and assumes integration steps like ontology alignment have already occurred. It has been applied to compare researchers based on publications and interests, and habitats based on hosted species. The framework supports configurable similarity contexts and caching to optimize performance on large linked datasets.
Navigation through citation network based on content similarity using cosine ...Salam Shah
The rate of scientific literature has been increased in the past few decades; new topics and information is added in the form of articles, papers, text documents, web logs, and patents. The growth of information at rapid rate caused a tremendous amount of additions in the current and past knowledge, during this process, new topics emerged, some topics split into many other sub-topics, on the other hand, many topics merge to formed single topic. The selection and search of a topic manually in such a huge amount of information have been found as an expensive and workforce-intensive task. For the emerging need of an automatic process to locate, organize, connect, and make associations among these sources the researchers have proposed different techniques that automatically extract components of the information presented in various formats and organize or structure them. The targeted data which is going to be processed for component extraction might be in the form of text, video or audio. The addition of different algorithms has structured information and grouped similar information into clusters and on the basis of their importance, weighted them. The organized, structured and weighted data is then compared with other structures to find similarity with the use of various algorithms. The semantic patterns can be found by employing visualization techniques that show similarity or relation between topics over time or related to a specific event. In this paper, we have proposed a model based on Cosine Similarity Algorithm for citation network which will answer the questions like, how to connect documents with the help of citation and content similarity and how to visualize and navigate through the document.
This document is a thesis that proposes using word embeddings to improve information retrieval by addressing term mismatch issues. It discusses word2vec, a technique for learning word embeddings from large text corpora that capture semantic relationships between words. The thesis proposes two approaches: 1) incorporating word embedding similarities into a probabilistic language model for retrieval and 2) a vector space model. Due to time constraints, only the first approach is implemented, which integrates word embeddings into ALMasri and Chevallet's probabilistic language model. Experiments are conducted to evaluate the impact of using semantic features from word embeddings on retrieval effectiveness.
This document describes a named entity recognition system for Romanian developed using linguistic rule-based techniques and resources. The system has two main modules: named entity identification, which marks named entity candidates in text, and named entity classification, which classifies candidates into categories like Person, Organization, Place, etc. Evaluation shows the system achieves promising results, performing comparably or better than existing Romanian named entity recognition systems, especially for identifying Person entities.
Invited Talk: Early Detection of Research Topics Angelo Salatino
Slides of my talk at Chan Zuckerberg Initiative (Meta)
Abstract:
The ability to promptly recognise new research trends is strategic for many stakeholders, including universities, institutional funding bodies, academic publishers and companies. While the literature describes several approaches which aim to identify the emergence of new research topics early in their lifecycle, these rely on the assumption that the topic in question is already associated with a number of publications and consistently referred to by a community of researchers. Hence, detecting the emergence of a new research area at an embryonic stage, i.e., before the topic has been consistently labelled by a community of researchers and associated with a number of publications, is still an open challenge. In this paper, we begin to address this challenge by performing a study of the dynamics preceding the creation of new topics. This study indicates that the emergence of a new topic is anticipated by a significant increase in the pace of collaboration between relevant research areas, which can be seen as the ‘parents’ of the new topic. These initial findings (i) confirm our hypothesis that it is possible in principle to detect the emergence of a new topic at the embryonic stage, (ii) provide new empirical evidence supporting relevant theories in Philosophy of Science, and also (iii) suggest that new topics tend to emerge in an environment in which weakly interconnected research areas begin to cross-fertilise.
This project aimed to develop models for named entity recognition from online news articles. Two models were created: a Maximum Entropy model using traditional NLP techniques and a Deep Neural Network model using pre-trained word embeddings. Both models achieved similar accuracy levels of around 93.5-93.8%, but have different requirements and limitations. The Maximum Entropy model performance improved with additional complex features but training time increased substantially. The Deep Learning model accuracy plateaued after 6 epochs of training, with the word embedding dimension having little effect beyond a certain point.
Text Segmentation for Online Subjective Examination using Machine LearningIRJET Journal
This document discusses using k-Nearest Neighbor (K-NN) machine learning for text segmentation of online exams. K-NN is an instance-based learning method that computes similarity between feature vectors to determine similarity between texts. The goal is to implement natural language processing using text segmentation, which provides benefits. It reviews related work applying various machine learning methods like K-NN, support vector machines, decision trees to tasks like text categorization and clustering.
Real-time Generation of Topic Maps from Speech Streamstmra
Topic Maps are means for representing sophisticated indexes of any information collections for the purpose of semantic information. The creation of Topic Maps bases on a theoretic fundament which is introduced in this paper. Moreover, the Observation Principle is the result of a deep investigation of the Subject Equality Decision Chain will be discussed as well as the Semantic Talk System which generates sophisticated, conceptual indexes of speech streams in realtime. This paper describes how these indexes are created, how they are represented as Topic Maps and how they can be used for integration purposes.
The document summarizes techniques for identifying themes in qualitative research. It discusses that themes are abstract constructs that link expressions and can come in various shapes. Themes can come from the data, investigator's prior understanding, characteristics of the phenomenon, definitions, common constructs, and personal experiences. It outlines several techniques for identifying themes, including repetitions, indigenous typologies/categories, metaphors and analogies, transitions, similarities and differences, linguistic connectors, missing data, theory-related material, and processing techniques like cutting and sorting, multi-dimensional scaling, words and keywords in context, and word co-occurrence. The document evaluates different techniques based on the type of data, required expertise, labor required, and number/types of themes to
FREE A4 Cyber Security Awareness Posters-Social Engineering part 3Data Hops
Free A4 downloadable and printable Cyber Security, Social Engineering Safety and security Training Posters . Promote security awareness in the home or workplace. Lock them Out From training providers datahops.com
Fueling AI with Great Data with Airbyte WebinarZilliz
This talk will focus on how to collect data from a variety of sources, leveraging this data for RAG and other GenAI use cases, and finally charting your course to productionalization.
Taking AI to the Next Level in Manufacturing.pdfssuserfac0301
Read Taking AI to the Next Level in Manufacturing to gain insights on AI adoption in the manufacturing industry, such as:
1. How quickly AI is being implemented in manufacturing.
2. Which barriers stand in the way of AI adoption.
3. How data quality and governance form the backbone of AI.
4. Organizational processes and structures that may inhibit effective AI adoption.
6. Ideas and approaches to help build your organization's AI strategy.
In the realm of cybersecurity, offensive security practices act as a critical shield. By simulating real-world attacks in a controlled environment, these techniques expose vulnerabilities before malicious actors can exploit them. This proactive approach allows manufacturers to identify and fix weaknesses, significantly enhancing system security.
This presentation delves into the development of a system designed to mimic Galileo's Open Service signal using software-defined radio (SDR) technology. We'll begin with a foundational overview of both Global Navigation Satellite Systems (GNSS) and the intricacies of digital signal processing.
The presentation culminates in a live demonstration. We'll showcase the manipulation of Galileo's Open Service pilot signal, simulating an attack on various software and hardware systems. This practical demonstration serves to highlight the potential consequences of unaddressed vulnerabilities, emphasizing the importance of offensive security practices in safeguarding critical infrastructure.
This presentation provides valuable insights into effective cost-saving techniques on AWS. Learn how to optimize your AWS resources by rightsizing, increasing elasticity, picking the right storage class, and choosing the best pricing model. Additionally, discover essential governance mechanisms to ensure continuous cost efficiency. Whether you are new to AWS or an experienced user, this presentation provides clear and practical tips to help you reduce your cloud costs and get the most out of your budget.
Your One-Stop Shop for Python Success: Top 10 US Python Development Providersakankshawande
Simplify your search for a reliable Python development partner! This list presents the top 10 trusted US providers offering comprehensive Python development services, ensuring your project's success from conception to completion.
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...Jeffrey Haguewood
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on integration of Salesforce with Bonterra Impact Management.
Interested in deploying an integration with Salesforce for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
HCL Notes and Domino License Cost Reduction in the World of DLAUpanagenda
Webinar Recording: https://www.panagenda.com/webinars/hcl-notes-and-domino-license-cost-reduction-in-the-world-of-dlau/
The introduction of DLAU and the CCB & CCX licensing model caused quite a stir in the HCL community. As a Notes and Domino customer, you may have faced challenges with unexpected user counts and license costs. You probably have questions on how this new licensing approach works and how to benefit from it. Most importantly, you likely have budget constraints and want to save money where possible. Don’t worry, we can help with all of this!
We’ll show you how to fix common misconfigurations that cause higher-than-expected user counts, and how to identify accounts which you can deactivate to save money. There are also frequent patterns that can cause unnecessary cost, like using a person document instead of a mail-in for shared mailboxes. We’ll provide examples and solutions for those as well. And naturally we’ll explain the new licensing model.
Join HCL Ambassador Marc Thomas in this webinar with a special guest appearance from Franz Walder. It will give you the tools and know-how to stay on top of what is going on with Domino licensing. You will be able lower your cost through an optimized configuration and keep it low going forward.
These topics will be covered
- Reducing license cost by finding and fixing misconfigurations and superfluous accounts
- How do CCB and CCX licenses really work?
- Understanding the DLAU tool and how to best utilize it
- Tips for common problem areas, like team mailboxes, functional/test users, etc
- Practical examples and best practices to implement right away
Introduction of Cybersecurity with OSS at Code Europe 2024Hiroshi SHIBATA
I develop the Ruby programming language, RubyGems, and Bundler, which are package managers for Ruby. Today, I will introduce how to enhance the security of your application using open-source software (OSS) examples from Ruby and RubyGems.
The first topic is CVE (Common Vulnerabilities and Exposures). I have published CVEs many times. But what exactly is a CVE? I'll provide a basic understanding of CVEs and explain how to detect and handle vulnerabilities in OSS.
Next, let's discuss package managers. Package managers play a critical role in the OSS ecosystem. I'll explain how to manage library dependencies in your application.
I'll share insights into how the Ruby and RubyGems core team works to keep our ecosystem safe. By the end of this talk, you'll have a better understanding of how to safeguard your code.
For the full video of this presentation, please visit: https://www.edge-ai-vision.com/2024/06/temporal-event-neural-networks-a-more-efficient-alternative-to-the-transformer-a-presentation-from-brainchip/
Chris Jones, Director of Product Management at BrainChip , presents the “Temporal Event Neural Networks: A More Efficient Alternative to the Transformer” tutorial at the May 2024 Embedded Vision Summit.
The expansion of AI services necessitates enhanced computational capabilities on edge devices. Temporal Event Neural Networks (TENNs), developed by BrainChip, represent a novel and highly efficient state-space network. TENNs demonstrate exceptional proficiency in handling multi-dimensional streaming data, facilitating advancements in object detection, action recognition, speech enhancement and language model/sequence generation. Through the utilization of polynomial-based continuous convolutions, TENNs streamline models, expedite training processes and significantly diminish memory requirements, achieving notable reductions of up to 50x in parameters and 5,000x in energy consumption compared to prevailing methodologies like transformers.
Integration with BrainChip’s Akida neuromorphic hardware IP further enhances TENNs’ capabilities, enabling the realization of highly capable, portable and passively cooled edge devices. This presentation delves into the technical innovations underlying TENNs, presents real-world benchmarks, and elucidates how this cutting-edge approach is positioned to revolutionize edge AI across diverse applications.
5th LF Energy Power Grid Model Meet-up SlidesDanBrown980551
5th Power Grid Model Meet-up
It is with great pleasure that we extend to you an invitation to the 5th Power Grid Model Meet-up, scheduled for 6th June 2024. This event will adopt a hybrid format, allowing participants to join us either through an online Mircosoft Teams session or in person at TU/e located at Den Dolech 2, Eindhoven, Netherlands. The meet-up will be hosted by Eindhoven University of Technology (TU/e), a research university specializing in engineering science & technology.
Power Grid Model
The global energy transition is placing new and unprecedented demands on Distribution System Operators (DSOs). Alongside upgrades to grid capacity, processes such as digitization, capacity optimization, and congestion management are becoming vital for delivering reliable services.
Power Grid Model is an open source project from Linux Foundation Energy and provides a calculation engine that is increasingly essential for DSOs. It offers a standards-based foundation enabling real-time power systems analysis, simulations of electrical power grids, and sophisticated what-if analysis. In addition, it enables in-depth studies and analysis of the electrical power grid’s behavior and performance. This comprehensive model incorporates essential factors such as power generation capacity, electrical losses, voltage levels, power flows, and system stability.
Power Grid Model is currently being applied in a wide variety of use cases, including grid planning, expansion, reliability, and congestion studies. It can also help in analyzing the impact of renewable energy integration, assessing the effects of disturbances or faults, and developing strategies for grid control and optimization.
What to expect
For the upcoming meetup we are organizing, we have an exciting lineup of activities planned:
-Insightful presentations covering two practical applications of the Power Grid Model.
-An update on the latest advancements in Power Grid -Model technology during the first and second quarters of 2024.
-An interactive brainstorming session to discuss and propose new feature requests.
-An opportunity to connect with fellow Power Grid Model enthusiasts and users.
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc
How does your privacy program stack up against your peers? What challenges are privacy teams tackling and prioritizing in 2024?
In the fifth annual Global Privacy Benchmarks Survey, we asked over 1,800 global privacy professionals and business executives to share their perspectives on the current state of privacy inside and outside of their organizations. This year’s report focused on emerging areas of importance for privacy and compliance professionals, including considerations and implications of Artificial Intelligence (AI) technologies, building brand trust, and different approaches for achieving higher privacy competence scores.
See how organizational priorities and strategic approaches to data security and privacy are evolving around the globe.
This webinar will review:
- The top 10 privacy insights from the fifth annual Global Privacy Benchmarks Survey
- The top challenges for privacy leaders, practitioners, and organizations in 2024
- Key themes to consider in developing and maintaining your privacy program
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Type Vector Representations from Text. DL4KGS@ESWC 2018
1. Type Vector Representations
from Text: An Empirical Analysis
Federico Bianchi, Mauricio Soto, Matteo Palmonari and Vincenzo Cutrona
Department of Informatics, Systems and Communications
University of Milano-Bicocca
federico.bianchi@disco.unimib.it
Workshop on Deep Learning for Knowledge Graphs
and Semantic Technologies (DL4KGS)
Co-located with ESWC 18, June 2018, Crete, Greece
2. ESWC, Crete, 4th June 2018
Outline
● Knowledge Graphs
● Scope of this Paper and State-of-the-art
● T2V: Type to Vector
● Experiments
3. ESWC, Crete, 4th June 2018
Outline
● Knowledge Graphs
● Scope of this Paper and State-of-the-art
● T2V: Type to Vector
● Experiments
4. ESWC, Crete, 4th June 2018
● Structured representations
of knowledge
● Entities are classified using
types (i.e., concepts)
● Types are organized in
sub-types graphs
Knowledge Graphs
A.S.
Roma
Kostas
Manolas
team
Soccer
Player
Soccer
Club
Athlete
Thing
Person
Sports
Club
Garry
Kasparov
Chess
Player
Real
Madrid
Organisa.
5. ESWC, Crete, 4th June 2018
Outline
● Knowledge Graphs
● Scope of this Paper and State-of-the-art
● T2V: Type to Vector
● Experiments
6. ESWC, Crete, 4th June 2018
Scope of this Paper
● Propose an approach to learn representations of types by
considering text as a different source of information
○ Distributional semantics
○ Embeddings of types in a vector space
○ Mapping to a word2vec learning problem
● Main intuition: building a type similarity measure that encodes
relatedness between types (beyond ontological similarity)
● Empirical evaluation of the properties of text-based type
representations
○ Focus on similarity (relatedness vs ontological similarity)
7. ESWC, Crete, 4th June 2018
Vector Representations of Types
Types represented in a vector space:
● Easy and fast evaluation of similarity
2
5
6
2
6
4
2
12
5
2
Soccer
Club
Person
8. ESWC, Crete, 4th June 2018
Embeddings for Representing Ontologies
● [Jayawardana+, 2017]
○ Instance-based approach for building word embeddings vectors of the instances in a custom
ontology (legal domain)
○ Embedding used to predict the best representative vector for each ontology type
(cluster-based approach)
○ Conclusions: type vectors are aggregation of entity embeddings
● [Smaili+, 2018]
○ Distributional hypothesis based embeddings for ontological representation
○ Textual document generated by considering axioms in an ontology as sentences of a text
○ Conclusions: uses the structure of the ontology
9. ESWC, Crete, 4th June 2018
State-of-the-Art on Ontological Similarity
● [Rada+, 1989] (path)
○ Shortest path length between concepts
○ Equal path problem: two concepts with the same path length share the same semantic similarity
● [Wu&Palmer, 1994] (wup)
○ Considers the instances depth (based on the Least Common Subsumer - i.e., first common ancestor)
○ Equal depth problem: concepts at the same hierarchical level share the same similarity
● [Zhu&Iglesias, 2017] (wpath)
○ Weighted path length to evaluate the similarity between concepts
○ Exploitation of the statistical Information Content (IC) along with the topology
○ IC computed on text corpora and used to assign higher level to more specific entities
● Topological distant concepts may be highly related (e.g., SoccerPlayer and
SoccerClub)
● Not all siblings pairs are similar in the same way (e.g., is a SoccerPlayer equally
similar to a Wrestler and a BasketballPlayer)
10. ESWC, Crete, 4th June 2018
Similarity vs. Relatedness
Semantic Similarity
Resemblance general conceptual term
Ex. Settlement and Town
Equal Path problem, Depth problem
Measures based on the ontology topology:
● path
● wup (Least Common Subsumer)
● wpath (Information Content)
Relatedness
Existence of connections
Ex. SoccerPlayer and SoccerClub
Ontology structure obliviousness
Measures based in corpora co-occurrence
Word Embedding (Distributional Hypothesis)
● word2vec
11. ESWC, Crete, 4th June 2018
Outline
● Knowledge Graphs
● Scope of this Paper and State-of-the-art
● T2V: Type to Vector
● Experiments
12. ESWC, Crete, 4th June 2018
Word2Vec [Mikolov+, 2013]
Well-known algorithm for learning word
representations from an input corpus
Distributional hypothesis: similar words appear in
similar contexts (word-word co-occurrence)
Type to Vector (T2V): generate distributed
representations of types based on type-type
co-occurrence.
cat
black
eats
dog
similar words corresponds
to similar vectors
The big black cat eats its food.
My little black cat sleeps all day.
Sometimes my cat eats too much!
Two hyperparameters:
● Desired embedding size
● Length of the context window
13. ESWC, Crete, 4th June 2018
T2V: Word2Vec on Annotated Text
“Rome is the capital of
Italy and a special
comune (named
Comune di Roma
Capitale). Rome also
serves as the capital
of the Lazio region.”
“Rome is the capital of
Italy and a special
comune (named
Comune di Roma
Capitale). Rome also
serves as the capital of
the Lazio region.”
Part of our approach to learn representations of typed entities:
- Bianchi & Palmonari. Joint Learning of Entity and Type Embeddings for Analogical Reasoning with Entities. NL4AI 2017
- Bianchi & al. Towards Encoding Time in Text-Based Entity Embeddings. ISWC 2018 (to appear).
14. ESWC, Crete, 4th June 2018
T2V: Word2Vec on Annotated Text
Find entities in text
“Rome is the capital of
Italy and a special
comune (named
Comune di Roma
Capitale). Rome also
serves as the capital
of the Lazio region.”
“Rome is the capital of
Italy and a special
comune (named
Comune di Roma
Capitale). Rome also
serves as the capital of
the Lazio region.”
● Entities are found with a
Named Entity Linking
Service
15. ESWC, Crete, 4th June 2018
T2V: Word2Vec on Annotated Text
“Rome is the capital of
Italy and a special
comune (named
Comune di Roma
Capitale). Rome also
serves as the capital
of the Lazio region.”
Rome Italy Rome Lazio
Find entities in text
● Entities are found with a
Named Entity Linking
Service
● Words are removed
16. ESWC, Crete, 4th June 2018
T2V: Word2Vec on Annotated Text
“Rome is the capital of
Italy and a special
comune (named
Comune di Roma
Capitale). Rome also
serves as the capital
of the Lazio region.”
Replace Entities
With Minimal Types
“Rome is the capital of
Italy and a special
comune (named
Comune di Roma
Capitale). Rome also
serves as the capital
of the Lazio region.”
City, Country, City
Administrative_Region
Rome Italy Rome Lazio
● Entities are found with a
Named Entity Linking
Service
● Words are removed
● Entities are replaced with
their minimal (most specific)
type
Find entities in text
17. ESWC, Crete, 4th June 2018
T2V: Word2Vec on Annotated Text
“Rome is the capital of
Italy and a special
comune (named
Comune di Roma
Capitale). Rome also
serves as the capital
of the Lazio region.”
Replace Entities
With Minimal Types
“Rome is the capital of
Italy and a special
comune (named
Comune di Roma
Capitale). Rome also
serves as the capital
of the Lazio region.”
City, Country, City
Administrative_Region
Rome Italy Rome Lazio
● Entities are found with a
Named Entity Linking
Service
● Words are removed
● Entities are replaced with
their minimal (most specific)
type
● The document containing
sequences of types is fed to
word2vec
Find entities in text
18. ESWC, Crete, 4th June 2018
T2V: Word2Vec on Annotated Text
Generate Type Vectors
word2vec
“Rome is the capital of
Italy and a special
comune (named
Comune di Roma
Capitale). Rome also
serves as the capital
of the Lazio region.”
Replace Entities
With Minimal Types
“Rome is the capital of
Italy and a special
comune (named
Comune di Roma
Capitale). Rome also
serves as the capital
of the Lazio region.”
City, Country, City
Administrative_Region
Rome Italy Rome Lazio
2
5
6
2
6
4
2
12
5
2
6
7
6
9
7
City Country Adminis.
Region
Similarity can be computed with cosine similarity
Find entities in text
19. ESWC, Crete, 4th June 2018
Outline
● Knowledge Graphs
● Scope of this Paper and State of the Art
● T2V: Type to Vector
● Experiments
20. ESWC, Crete, 4th June 2018
Empirical Evaluation of T2V Representations
Objective: analyzing the properties of the T2V representations, focus on similarity
Corpus for T2V training: DBpedia 2016-04 abstracts annotated with DBpedia Spotlight
Experiments:
1) Analogical reasoning (standard method of evaluation for word embeddings)
2) Correlation with topological measures
3) Similarity and depth (depth problem)
4) Similarity and siblings (siblings similarity problem)
5) Type matching (similarity between different categorization systems)
21. ESWC, Crete, 4th June 2018
1) Analogical Reasoning
Hypothesis
T2V can support analogical reasoning as word2vec does
Dataset
Dataset of 868 reasonably objective analogies on sports.
(e.g., sportPlayer - sportTeam)
Methodology
● Tested two different T2V analogical reasoning
models with 100 and 200 dimensions for the
embeddings and a window of 5
● Word2vec answers with the list of closest points
to the analogical operation
● We check if the correct answer is found in the list
of top-k (1, 5, 10)
● In the top-k setting answer is correct if it is in the
top-k the ranked list
Example
“Who is the equivalent of a RugbyPlayer that plays in a
RugbyTeam in a BasketballTeam?”
RugbyPlayer : RugbyTeam :: ? : BasketballTeam
Analogical operation: v(dbo:RugbyPlayer) -
v(dbo:RugbyTeam) + v(dbo:BasketballTeam) ≈
v(dbo:BasketballPlayer)
22. ESWC, Crete, 4th June 2018
1) Analogical Reasoning
P@1 P@2 P@5
T2V
(200,5)
0.50 0.85 0.98
T2V
(100,5)
0.47 0.76 0.93
Outcome
● Correct answer is often found in the first 5 positions
● Linguistic properties are preserved also in T2V
Model used for the
next experiments
23. ESWC, Crete, 4th June 2018
2) T2V vs Topological Measures: Correlation
path wup wpath T2V
path 1.00 0.87 0.94 0.30
wup 1.00 0.93 0.33
wpath 1.00 0.36
T2V 1.00
Hypothesis
T2V similarity is orthogonal to topological similarity
Dataset
~15000 pairs of types in DBpedia
Methodology
Pearson Correlation coefficient between T2V similarity and
well-known topological measures
Outcome
T2V similarity and topological similarity are not strongly
correlated
24. ESWC, Crete, 4th June 2018
2) T2V vs Topological Measures: Insights
State of the Art
Based on the topology of the ontology
Ex. dbo:Settlement and dbo:Town (high similarity)
Ex. dbo:SoccerPlayer and dbo:SoccerClub (low similarity)
Ex. dbo:Wrestler and dbo:SoccerPlayer (high similarity, siblings)
T2V
Captures the co-occurrences of types in text
Ex. dbo:Settlement and dbo:Town (high similarity)
Ex. dbo:SoccerPlayer and dbo:SoccerClub (high similarity)
Ex. dbo:Wrestler and dbo:SoccerPlayer (low similarity, siblings)
26. ESWC, Crete, 4th June 2018
3) Similarity vs. Depth
Hypothesis
Sibling types are pairwise more similar when types are more specific
(as noticed in topological similarity )
sim(dbo:BasketballPlayer, dbo:SoccerPlayer)
>
sim(dbo:Person,dbo:Organization)
Dataset
DBpedia ontology
Methodology
● Children Information Distribution CID
○ Average pairwise similarity between the children of
a type p
● CID vs relative depth (relative = to the type path)
27. ESWC, Crete, 4th June 2018
3) T2V CID vs. Relative Depth
Outcome
● On average, CID increases
with depth
CID drops here: CID(dbo:Thing)>CID(dbo:Agent)
28. ESWC, Crete, 4th June 2018
4) Siblings’ Similarity
Hypothesis
The pairwise similarity for a set of siblings changes from pair to pair
Dataset
31 siblings type from the DBpedia ontology
For each type we selected its most similar sibling and its least similar sibling considering
T2V similarity
(e.g., SoccerPlayer => most similar RugbyPlayer, least similar ChessPlayer)
Methodology
We asked 5 users (knowledgeable about semantic web) to answer questions like the
following:
“Do you think a SoccerPlayer is more similar to a RugbyPlayer or a ChessPlayer?”
Potential Biases
● Low number of participants
● Questions were selected using T2V
29. ESWC, Crete, 4th June 2018
4) Siblings’ Similarity
Outcome
● Agreement between the user using Gwet AC1 [Gwet, 2008] is 0.9 (high agreement)
● Given an input type, users choose as answer the type that is also returned as most similar by
T2V
Examples
Is a Writer more similar to a dbo:Philosopher or a dbo:BusinessPerson?
Is a President more similar to a dbo:PrimeMinister or a dbo:Mayor?
Most challenging question for users
“is a dbo:Skyscraper more similar to a dbo:Hospital or a dbo:Museum?”
30. ESWC, Crete, 4th June 2018
5) Type Matching
Hypothesis
T2V can be used for ontology matching provided that two different ontologies are used to classify a common set of instances
Methodology
● Learn representations of types from different ontologies in a shared vector space (100 dimensions, 5 window)
● Replace entities with a type of one of the two ontologies (randomly)
Dataset
● DBpedia 2016-04 and Wikidata 2016-06 (instance of)
Same space in
which types of
different
ontologies
co-exist
City (Ontology 1)
Country (Ontology 2)
City (Ontology 1)
Region (Ontology 2)
“Rome is the capital of
Italy and a special
comune (named
Comune di Roma
Capitale). Rome also
serves as the capital of
the Lazio region.”
2
9
1
2
5
4
2
6
4
5
7
5
2
2
9
31. ESWC, Crete, 4th June 2018
5) Type Matching
Wikidata (label) DBpedia Sim
Q4498974 (ice hokey team) HockeyTeam 0.99
Q5107 (continent) Continent 0.99
Q17374546* (Australian rules football club) AustralianFootballTeam 0.99
Q3001412* (horse race) HorseRace 0.98
Q4022 (river) River 0.98
Q46970 (airline) Airline 0.98
Q18127 (record label) RecordLabel 0.98
Q13027888* (baseball team) BaseballTeam 0.98
Q11424 (film) Film 0.98
Q1075* (color) Colour 0.98
Q17156793* (American football team) American Football Team 0.95
Q3146899* (diocese of the Catholic Church) Diocese 0.93
Q7944* (earthquake) Earthquake 0.91
* not declared equivalent in DBpedia
Outcome
● Types with highest similarity are equivalent classes in the
two ontologies (due to the use in text)
● Found equivalent types not declared as equivalent in
DBpedia
32. Conclusions and
Future Work
Future Work:
● Combine T2V similarity and topological
similarities in one measure
● Study relation between sub-type relation
and the vector representation
● Support ontology matching tasks
● Compare with other methods for
vector-based type representations
Conclusions:
● Similarity with T2V injects relatedness in
type similarity measures (from
handwritten text corpora)
● T2V exhibits some desired properties
(depth, sibling discrimination)
● T2V supports analogical reasoning
● T2V can support ontology matching
33. Thank You
Workshop on Deep Learning for Knowledge Graphs
and Semantic Technologies (DL4KGS)
Co-located with ESWC 18, June 2018, Crete, Greece
Code and models are publicly available (see the paper for details)
Mail to: federico.bianchi@disco.unimib.it
34. ESWC, Crete, 4th June 2018
References
Bianchi, F., Palmonari, M., & Nozza, D., Towards Encoding Time in Text-Based Entity Embeddings. in International Semantic Web
Conference, 2018 (to appear).
Bianchi, F., & Palmonari, M. (2017). Joint Learning of Entity and Type Embeddings for Analogical Reasoning with Entities. In In
Proceedings of the NL4AI Workshop, co-located with the International Conference of the Italian Association for Artificial Intelligence
(AI* IA).
Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed representations of words and phrases and their
compositionality. In Advances in neural information processing systems (pp. 3111-3119).
Kilem Li Gwet. Computing inter-rater reliability and its variance in the presence of high agreement. British Journal of Mathematical
and Statistical Psychology,61(1):29–48, 2008.
Ganggao Zhu and Carlos A Iglesias. Computing semantic similarity of concepts in knowledge graphs. IEEE Transactions on Knowledge
and Data Engineering, 29(1):72–85, 2017.
V. Jayawardana, D. Lakmal, N. de Silva, A. S. Perera, K. Sugathadasa, and B. Ayesha. Deriving a representative vector for
ontology classes with instance word vector embeddings. In INTECH, pages 79–84, Aug 2017.
Fatima Zohra Smaili, Xin Gao, and Robert Hoehndorf. Onto2vec: joint vector-based representation of biological entities and
their ontology-based annotations. arXiv preprint arXiv:1802.00864, 2018.