Learning Multilingual Semantic Parsers for Question Answering over Linked Data - A comparison of neural and probabilistic graphical model architectures
The document summarizes a PhD dissertation defense talk on learning multilingual semantic parsers for question answering over linked data. It discusses comparing neural and probabilistic graphical model architectures for semantic parsing to map natural language to formal meaning representations. The talk outlines introducing dependency parse tree-based approaches, evaluating different model architectures, and addressing challenges in building multilingual question answering systems over structured knowledge bases.
Applications of Word Vectors in Text Retrieval and Classificationshakimov
Applications of word vectors (word2vec, BERT, etc.) on problems such as text retrieval, classification of textual documents for tasks such as sentiment analysis, spam detection.
Framester: A Wide Coverage Linguistic Linked Data HubMehwish Alam
Framester is a linguistic linked data hub that aims to improve coverage of FrameNet by extending mappings between FrameNet and other resources like WordNet and BabelNet. Framester represents over 40 million triples linking linguistic and factual resources and aligning frames, roles, and types to foundational ontologies. It provides a word frame disambiguation service and was evaluated on annotated corpora, showing improved performance over previous approaches.
Interactive Knowledge Discovery over Web of Data.Mehwish Alam
This document describes research on classifying and exploring data from the Web of Data. It discusses building a classification structure over RDF data by classifying triples based on RDF Schema and creating views through SPARQL queries. This structure can then be used for data completion and interactive knowledge discovery through data analysis and visualization. Formal concept analysis and pattern structures are introduced as techniques for dealing with complex data types from the Web of Data like graphs and linked data. Range minimum queries are also proposed as a way to compute the lowest common ancestor for structured attribute sets in the pattern structures.
This document discusses cross-language information retrieval (CLIR). It presents the goals of allowing users to query for domain-specific information in their native language and presenting relevant search results in the target language. It describes the key components of CLIR including bilingual corpus extraction from multiple sources, corpus indexing, querying and string matching. Preliminary evaluation results of sample queries are provided, along with conclusions that machine translation based CLIR is often more useful than the proposed method and that future work could focus on automated evaluation and fuzzy matching.
This document discusses cross-language information retrieval (CLIR). It defines CLIR as retrieving information written in a language different from the user's query language. It describes approaches to CLIR such as dictionary-based query translation and pseudo-relevance feedback. Dictionary-based query translation uses bilingual dictionaries but requires disambiguation due to ambiguity. Pseudo-relevance feedback assumes top documents are relevant and selects terms from them to expand the query. The document also discusses using parallel corpora to estimate cross-lingual relevance models and evaluate CLIR using conferences like TREC and CLEF.
The document describes cross-language information retrieval (CLIR) and summarizes an English-Chinese information retrieval system called ECIRS. ECIRS allows users to input queries in English and retrieves relevant Chinese documents through translation. It includes dictionaries, document indexes, and a Chinese search engine. Screenshots show the user interface where a user can enter an English keyword, view its Chinese translation, and see search results in Chinese.
The document discusses test-driven quality assessment of RDF data. It proposes a methodology called the Test-driven Quality Assessment Methodology (TDQAM) where test cases are generated automatically from the RDF schema to validate data constraints. Test cases are written as SPARQL queries and can check for issues like a person having a birthdate after a deathdate. Pattern-based test generators analyze the schema to instantiate test cases. The methodology provides a unified way to validate RDF data against different schema languages to improve data quality.
This document provides a quick tour of data mining. It begins with an overview of the evolution of data management techniques from manual record keeping to modern big data and data science. It then discusses what data mining is, focusing on algorithms for discovering patterns in existing data. Various examples of data mining applications are also presented, as well as the origins of data mining in fields such as machine learning and databases. Finally, an overview of the key steps in the knowledge discovery process is given, including data preprocessing, data mining, and pattern evaluation.
Applications of Word Vectors in Text Retrieval and Classificationshakimov
Applications of word vectors (word2vec, BERT, etc.) on problems such as text retrieval, classification of textual documents for tasks such as sentiment analysis, spam detection.
Framester: A Wide Coverage Linguistic Linked Data HubMehwish Alam
Framester is a linguistic linked data hub that aims to improve coverage of FrameNet by extending mappings between FrameNet and other resources like WordNet and BabelNet. Framester represents over 40 million triples linking linguistic and factual resources and aligning frames, roles, and types to foundational ontologies. It provides a word frame disambiguation service and was evaluated on annotated corpora, showing improved performance over previous approaches.
Interactive Knowledge Discovery over Web of Data.Mehwish Alam
This document describes research on classifying and exploring data from the Web of Data. It discusses building a classification structure over RDF data by classifying triples based on RDF Schema and creating views through SPARQL queries. This structure can then be used for data completion and interactive knowledge discovery through data analysis and visualization. Formal concept analysis and pattern structures are introduced as techniques for dealing with complex data types from the Web of Data like graphs and linked data. Range minimum queries are also proposed as a way to compute the lowest common ancestor for structured attribute sets in the pattern structures.
This document discusses cross-language information retrieval (CLIR). It presents the goals of allowing users to query for domain-specific information in their native language and presenting relevant search results in the target language. It describes the key components of CLIR including bilingual corpus extraction from multiple sources, corpus indexing, querying and string matching. Preliminary evaluation results of sample queries are provided, along with conclusions that machine translation based CLIR is often more useful than the proposed method and that future work could focus on automated evaluation and fuzzy matching.
This document discusses cross-language information retrieval (CLIR). It defines CLIR as retrieving information written in a language different from the user's query language. It describes approaches to CLIR such as dictionary-based query translation and pseudo-relevance feedback. Dictionary-based query translation uses bilingual dictionaries but requires disambiguation due to ambiguity. Pseudo-relevance feedback assumes top documents are relevant and selects terms from them to expand the query. The document also discusses using parallel corpora to estimate cross-lingual relevance models and evaluate CLIR using conferences like TREC and CLEF.
The document describes cross-language information retrieval (CLIR) and summarizes an English-Chinese information retrieval system called ECIRS. ECIRS allows users to input queries in English and retrieves relevant Chinese documents through translation. It includes dictionaries, document indexes, and a Chinese search engine. Screenshots show the user interface where a user can enter an English keyword, view its Chinese translation, and see search results in Chinese.
The document discusses test-driven quality assessment of RDF data. It proposes a methodology called the Test-driven Quality Assessment Methodology (TDQAM) where test cases are generated automatically from the RDF schema to validate data constraints. Test cases are written as SPARQL queries and can check for issues like a person having a birthdate after a deathdate. Pattern-based test generators analyze the schema to instantiate test cases. The methodology provides a unified way to validate RDF data against different schema languages to improve data quality.
This document provides a quick tour of data mining. It begins with an overview of the evolution of data management techniques from manual record keeping to modern big data and data science. It then discusses what data mining is, focusing on algorithms for discovering patterns in existing data. Various examples of data mining applications are also presented, as well as the origins of data mining in fields such as machine learning and databases. Finally, an overview of the key steps in the knowledge discovery process is given, including data preprocessing, data mining, and pattern evaluation.
Natural Language Processing in R (rNLP)fridolin.wild
The introductory slides of a workshop given to the doctoral school at the Institute of Business Informatics of the Goethe University Frankfurt. The tutorials are available on http://crunch.kmi.open.ac.uk/w/index.php/Tutorials
Named entity recognition (ner) with nltkJanu Jahnavi
https://www.learntek.org/blog/named-entity-recognition-ner-with-nltk/
Learntek is global online training provider on Big Data Analytics, Hadoop, Machine Learning, Deep Learning, IOT, AI, Cloud Technology, DEVOPS, Digital Marketing and other IT and Management courses.
The document discusses cross-language information retrieval (CLIR). It notes that while there are over 6,000 languages, 80% of websites are in English, creating a need for CLIR. CLIR aims to retrieve relevant documents in languages different from the query language. It is an important area as it allows for global information exchange and knowledge sharing, with applications in national security, access to foreign patents and medical information. CLIR draws on multiple disciplines including information retrieval, natural language processing and machine translation.
R is a free software environment for statistical analysis and graphics. This document discusses using R for text mining, including preprocessing text data through transformations like stemming, stopword removal, and part-of-speech tagging. It also demonstrates building term document matrices and classifying text with k-nearest neighbors (KNN) algorithms. Specifically, it shows classifying speeches from Obama and Romney with over 90% accuracy using KNN classification in R.
Text mining and natural language processing techniques can be used to extract useful information from text data. Common text mining tasks include text categorization to classify documents into predefined categories, document clustering to group similar documents without predefined categories, and keyword-based association analysis to find frequent patterns and relationships between keywords in a collection of documents. Text classification algorithms such as support vector machines, k-nearest neighbors, naive Bayes, and neural networks can be applied to categorize documents based on their contents.
Text mining and natural language processing techniques can be used to extract useful information from text data. Common text mining tasks include text categorization to classify documents into predefined categories, document clustering to group similar documents without predefined categories, and keyword-based association analysis to find frequently co-occurring terms. Text classification algorithms such as support vector machines, naive Bayes classifiers, and neural networks are often applied to categorized documents into topics. The vector space model is commonly used to represent documents as vectors of term weights to enable similarity comparisons between documents.
Text mining and natural language processing techniques can be used to extract useful information from text data. Common text mining tasks include text categorization to classify documents into predefined categories, document clustering to group similar documents without predefined categories, and keyword-based association analysis to find frequently co-occurring terms. Text classification algorithms such as support vector machines, naive Bayes classifiers, and neural networks are often applied to categorized documents. The vector space model is commonly used to represent documents as vectors of term weights.
TermPicker: Enabling the Reuse of Vocabulary Terms by Exploiting Data from th...JohannWanja
Deciding which RDF vocabulary terms to use when modeling data as Linked Open Data (LOD) is far from trivial. We propose "TermPicker" as a novel approach enabling vocabulary reuse by recommending vocabulary terms based on various features of a term. These features include the term’s popularity, whether it is from an already used vocabulary, and the so-called schema-level pattern (SLP) feature that exploits which terms other data providers on the LOD cloud use to describe their data. We apply Learning To Rank to establish a ranking model for vocabulary terms based on the utilized features. The results show that using the SLP-feature improves the recommendation quality by 29% to 36% considering the Mean Average Precision and the Mean Reciprocal Rank at the first five positions compared to recommendations based on solely the term’s popularity and whether it is from an already used vocabulary.
RDF2Vec: RDF Graph Embeddings for Data MiningPetar Ristoski
Linked Open Data has been recognized as a valuable source for background information in data mining. However, most data mining tools require features in propositional form, i.e., a vector of nominal or numerical features associated with an instance, while Linked Open Data sources are graphs by nature. In this paper, we present RDF2Vec, an approach that uses language modeling approaches for unsupervised feature extraction from sequences of words, and adapts them to RDF graphs. We generate sequences by leveraging local information from graph sub-structures, harvested by Weisfeiler-Lehman Subtree RDF Graph Kernels and graph walks, and learn latent numerical representations of entities in RDF graphs. Our evaluation shows that such vector representations outperform existing techniques for the propositionalization of RDF graphs on a variety of different predictive machine learning tasks, and that feature vector representations of general knowledge graphs such as DBpedia and Wikidata can be easily reused for different tasks.
DS2014: Feature selection in hierarchical feature spacesPetar Ristoski
The document describes a proposed approach called SHSEL for hierarchical feature selection in machine learning. SHSEL exploits the hierarchical structure of feature spaces, where more specific features imply more general ones. It initially selects ranges of similar features in each branch based on relevance similarity. It then prunes the set further by selecting only the most relevant remaining features. The authors evaluate SHSEL on real and synthetic datasets compared to other feature selection methods, finding it achieves comparable or improved accuracy while significantly reducing the feature space.
This document discusses rules and the Semantic Web Rule Language (SWRL). It defines rules as a means of representing knowledge similar to if-then statements. SWRL combines OWL and rule-based languages by allowing users to write rules that can refer to OWL classes, properties, individuals and datatypes. SWRL has an abstract and XML syntax and supports built-in predicates for manipulating data types. Rules provide more expressivity than RDFS and OWL in some cases, such as defining application behaviors, but rule-based reasoning is less performant so they should not be overused when RDFS/OWL suffice.
This document provides an overview of SPARQL 1.0, the W3C recommendation for querying RDF data. It describes the main components of SPARQL queries including graph patterns used to match subgraphs, basic graph patterns using triple patterns, and optional, union, and constraint graph patterns. It provides examples of SPARQL queries and describes how variables, blank nodes, and filter expressions are used in constraints on query solutions.
Information Extraction from the Web - Algorithms and ToolsBenjamin Habegger
This document provides an overview of algorithms and tools for information extraction from the web. It discusses document representations, approaches like wrappers that can extract semi-structured data from websites, and algorithms such as Wien, Stalker, DIPRE and IERel that learn wrappers. It also presents tools like WetDL for describing workflows and WebSource for executing them to extract and transform web data. Finally, it discusses applications of information extraction like semantic search engines and linking extracted data to schemas for data integration.
An overview of existing solutions for link discovery and looked into some of the state-of-art algorithms for the rapid execution of link discovery tasks focusing on algorithms which guarantee result completeness.
(HOBBIT project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 688227.)
This document provides an overview of hands-on tasks for a link discovery tutorial using the Limes framework. It describes a test dataset, and three tasks: 1) executing a provided Limes configuration to detect duplicate authors, 2) creating a configuration to find similar publications based on keywords, and 3) using the Limes GUI.
An overview of existing solutions for link discovery and looked into some of the state-of-art algorithms for the rapid execution of link discovery tasks focusing on algorithms which guarantee result completeness.
(HOBBIT project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 688227.)
Performance of graph query languages:
Analysis on theperformance of graph querylanguages: comparative study of Cypher, Gremlin and native access in Neo4j
Enriching the semantic web tutorial session 1Tobias Wunner
The document discusses challenges and opportunities in natural language processing for the multilingual semantic web. It provides examples of how content on the web and semantic web exhibits linguistic variations within and across languages. It also summarizes several NLP applications like information extraction and natural language generation that utilize ontologies, and notes that these applications require domain and multilingual adaptation of lexicons and extraction rules. The document argues that efficient adaptation and sharing of linguistic resources between ontology-based NLP applications is needed.
This document provides an overview of deep learning for information retrieval. It begins with background on the speaker and discusses how the data landscape is changing with increasing amounts of diverse data types. It then introduces neural networks and how deep learning can learn hierarchical representations from data. Key aspects of deep learning that help with natural language processing tasks like word embeddings and modeling compositionality are discussed. Several influential papers that advanced word embeddings and recursive neural networks are also summarized.
The web of interlinked data and knowledge strippedSören Auer
Linked Data approaches can help solve enterprise information integration (EII) challenges by complementing text on web pages with structured, linked open data from different sources. This allows for intelligently combining, integrating, and joining structured information across heterogeneous systems. A distributed, iterative, bottom-up integration approach using Linked Data may help solve the EII problem in large companies by taking a pay-as-you-go approach.
Natural Language Processing in R (rNLP)fridolin.wild
The introductory slides of a workshop given to the doctoral school at the Institute of Business Informatics of the Goethe University Frankfurt. The tutorials are available on http://crunch.kmi.open.ac.uk/w/index.php/Tutorials
Named entity recognition (ner) with nltkJanu Jahnavi
https://www.learntek.org/blog/named-entity-recognition-ner-with-nltk/
Learntek is global online training provider on Big Data Analytics, Hadoop, Machine Learning, Deep Learning, IOT, AI, Cloud Technology, DEVOPS, Digital Marketing and other IT and Management courses.
The document discusses cross-language information retrieval (CLIR). It notes that while there are over 6,000 languages, 80% of websites are in English, creating a need for CLIR. CLIR aims to retrieve relevant documents in languages different from the query language. It is an important area as it allows for global information exchange and knowledge sharing, with applications in national security, access to foreign patents and medical information. CLIR draws on multiple disciplines including information retrieval, natural language processing and machine translation.
R is a free software environment for statistical analysis and graphics. This document discusses using R for text mining, including preprocessing text data through transformations like stemming, stopword removal, and part-of-speech tagging. It also demonstrates building term document matrices and classifying text with k-nearest neighbors (KNN) algorithms. Specifically, it shows classifying speeches from Obama and Romney with over 90% accuracy using KNN classification in R.
Text mining and natural language processing techniques can be used to extract useful information from text data. Common text mining tasks include text categorization to classify documents into predefined categories, document clustering to group similar documents without predefined categories, and keyword-based association analysis to find frequent patterns and relationships between keywords in a collection of documents. Text classification algorithms such as support vector machines, k-nearest neighbors, naive Bayes, and neural networks can be applied to categorize documents based on their contents.
Text mining and natural language processing techniques can be used to extract useful information from text data. Common text mining tasks include text categorization to classify documents into predefined categories, document clustering to group similar documents without predefined categories, and keyword-based association analysis to find frequently co-occurring terms. Text classification algorithms such as support vector machines, naive Bayes classifiers, and neural networks are often applied to categorized documents into topics. The vector space model is commonly used to represent documents as vectors of term weights to enable similarity comparisons between documents.
Text mining and natural language processing techniques can be used to extract useful information from text data. Common text mining tasks include text categorization to classify documents into predefined categories, document clustering to group similar documents without predefined categories, and keyword-based association analysis to find frequently co-occurring terms. Text classification algorithms such as support vector machines, naive Bayes classifiers, and neural networks are often applied to categorized documents. The vector space model is commonly used to represent documents as vectors of term weights.
TermPicker: Enabling the Reuse of Vocabulary Terms by Exploiting Data from th...JohannWanja
Deciding which RDF vocabulary terms to use when modeling data as Linked Open Data (LOD) is far from trivial. We propose "TermPicker" as a novel approach enabling vocabulary reuse by recommending vocabulary terms based on various features of a term. These features include the term’s popularity, whether it is from an already used vocabulary, and the so-called schema-level pattern (SLP) feature that exploits which terms other data providers on the LOD cloud use to describe their data. We apply Learning To Rank to establish a ranking model for vocabulary terms based on the utilized features. The results show that using the SLP-feature improves the recommendation quality by 29% to 36% considering the Mean Average Precision and the Mean Reciprocal Rank at the first five positions compared to recommendations based on solely the term’s popularity and whether it is from an already used vocabulary.
RDF2Vec: RDF Graph Embeddings for Data MiningPetar Ristoski
Linked Open Data has been recognized as a valuable source for background information in data mining. However, most data mining tools require features in propositional form, i.e., a vector of nominal or numerical features associated with an instance, while Linked Open Data sources are graphs by nature. In this paper, we present RDF2Vec, an approach that uses language modeling approaches for unsupervised feature extraction from sequences of words, and adapts them to RDF graphs. We generate sequences by leveraging local information from graph sub-structures, harvested by Weisfeiler-Lehman Subtree RDF Graph Kernels and graph walks, and learn latent numerical representations of entities in RDF graphs. Our evaluation shows that such vector representations outperform existing techniques for the propositionalization of RDF graphs on a variety of different predictive machine learning tasks, and that feature vector representations of general knowledge graphs such as DBpedia and Wikidata can be easily reused for different tasks.
DS2014: Feature selection in hierarchical feature spacesPetar Ristoski
The document describes a proposed approach called SHSEL for hierarchical feature selection in machine learning. SHSEL exploits the hierarchical structure of feature spaces, where more specific features imply more general ones. It initially selects ranges of similar features in each branch based on relevance similarity. It then prunes the set further by selecting only the most relevant remaining features. The authors evaluate SHSEL on real and synthetic datasets compared to other feature selection methods, finding it achieves comparable or improved accuracy while significantly reducing the feature space.
This document discusses rules and the Semantic Web Rule Language (SWRL). It defines rules as a means of representing knowledge similar to if-then statements. SWRL combines OWL and rule-based languages by allowing users to write rules that can refer to OWL classes, properties, individuals and datatypes. SWRL has an abstract and XML syntax and supports built-in predicates for manipulating data types. Rules provide more expressivity than RDFS and OWL in some cases, such as defining application behaviors, but rule-based reasoning is less performant so they should not be overused when RDFS/OWL suffice.
This document provides an overview of SPARQL 1.0, the W3C recommendation for querying RDF data. It describes the main components of SPARQL queries including graph patterns used to match subgraphs, basic graph patterns using triple patterns, and optional, union, and constraint graph patterns. It provides examples of SPARQL queries and describes how variables, blank nodes, and filter expressions are used in constraints on query solutions.
Information Extraction from the Web - Algorithms and ToolsBenjamin Habegger
This document provides an overview of algorithms and tools for information extraction from the web. It discusses document representations, approaches like wrappers that can extract semi-structured data from websites, and algorithms such as Wien, Stalker, DIPRE and IERel that learn wrappers. It also presents tools like WetDL for describing workflows and WebSource for executing them to extract and transform web data. Finally, it discusses applications of information extraction like semantic search engines and linking extracted data to schemas for data integration.
An overview of existing solutions for link discovery and looked into some of the state-of-art algorithms for the rapid execution of link discovery tasks focusing on algorithms which guarantee result completeness.
(HOBBIT project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 688227.)
This document provides an overview of hands-on tasks for a link discovery tutorial using the Limes framework. It describes a test dataset, and three tasks: 1) executing a provided Limes configuration to detect duplicate authors, 2) creating a configuration to find similar publications based on keywords, and 3) using the Limes GUI.
An overview of existing solutions for link discovery and looked into some of the state-of-art algorithms for the rapid execution of link discovery tasks focusing on algorithms which guarantee result completeness.
(HOBBIT project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 688227.)
Performance of graph query languages:
Analysis on theperformance of graph querylanguages: comparative study of Cypher, Gremlin and native access in Neo4j
Enriching the semantic web tutorial session 1Tobias Wunner
The document discusses challenges and opportunities in natural language processing for the multilingual semantic web. It provides examples of how content on the web and semantic web exhibits linguistic variations within and across languages. It also summarizes several NLP applications like information extraction and natural language generation that utilize ontologies, and notes that these applications require domain and multilingual adaptation of lexicons and extraction rules. The document argues that efficient adaptation and sharing of linguistic resources between ontology-based NLP applications is needed.
Similar to Learning Multilingual Semantic Parsers for Question Answering over Linked Data - A comparison of neural and probabilistic graphical model architectures
This document provides an overview of deep learning for information retrieval. It begins with background on the speaker and discusses how the data landscape is changing with increasing amounts of diverse data types. It then introduces neural networks and how deep learning can learn hierarchical representations from data. Key aspects of deep learning that help with natural language processing tasks like word embeddings and modeling compositionality are discussed. Several influential papers that advanced word embeddings and recursive neural networks are also summarized.
The web of interlinked data and knowledge strippedSören Auer
Linked Data approaches can help solve enterprise information integration (EII) challenges by complementing text on web pages with structured, linked open data from different sources. This allows for intelligently combining, integrating, and joining structured information across heterogeneous systems. A distributed, iterative, bottom-up integration approach using Linked Data may help solve the EII problem in large companies by taking a pay-as-you-go approach.
a system called natural language interface which transforms user's natural language question into SPARQL query
find related papers here https://sites.google.com/site/fadhlinams81/publication
The document provides an overview of knowledge graphs and the metaphactory knowledge graph platform. It defines knowledge graphs as semantic descriptions of entities and relationships using formal knowledge representation languages like RDF, RDFS and OWL. It discusses how knowledge graphs can power intelligent applications and gives examples like Google Knowledge Graph, Wikidata, and knowledge graphs in cultural heritage and life sciences. It also provides an introduction to key standards like SKOS, SPARQL, and Linked Data principles. Finally, it describes the main features and architecture of the metaphactory platform for creating and utilizing enterprise knowledge graphs.
Schema-agnositc queries over large-schema databases: a distributional semanti...Andre Freitas
This document provides an overview and summary of André Freitas' PhD thesis defense presentation on schema-agnostic queries for large schema databases using distributional semantics. The presentation motivates the need for schema-agnostic queries due to the rise of very large and dynamic database schemas. It proposes using distributional semantics to provide an accurate, comprehensive and low maintenance approach to cope with semantic heterogeneity in schema-agnostic queries. The key aspects of the approach include semantic pivoting to reduce semantic complexity, distributional semantic models to enable semantic matching, and a hybrid distributional-relational semantic model called τ-Space to support the development of a schema-agnostic query mechanism.
DODDLE-OWL: A Domain Ontology Construction Tool with OWLTakeshi Morita
In this paper, we propose a domain ontology construction tool with OWL. The advantage of our tool is focusing the quality refinement phase of ontology construction. Through interactive support for refining the initial ontology, OWL-Lite level ontology, which consists of taxonomic relationships (defined as classes) and non-taxonomic relationships (defined as properties), is constructed effectively. The tool also provides semi-automatic generation of the initial ontology using domain specific documents and general ontologies.
Identifying Topics in Social Media Posts using DBpediaÓscar Muñoz García
This document discusses a method for identifying topics in social media posts using DBpedia. It begins with an introduction that outlines the task of topic identification, applications for social media, and challenges with short, misspelled texts. It then reviews related work exploiting Wikipedia and DBpedia for tasks like text categorization. The method section describes the process of part-of-speech tagging, context selection, disambiguation against DBpedia, and language filtering. An evaluation on 10,000 Spanish posts finds high coverage rates and precision varying by channel from 59-89%. The conclusions discuss achieving good coverage while noting precision depends on the channel and no single context approach works best across all channels.
Understanding Natural Language Queries over Relational DatabasesAshis Kumar Chanda
This document summarizes a proposed method for understanding natural language queries over relational databases. The method aims to accept natural language questions from users and help them retrieve results from a database using SQL. It works by transforming the natural language into a query tree, verifying the transformation interactively, and translating the query tree into an SQL statement. The method is evaluated based on effectiveness and usability based on a dataset from Microsoft Academic Search, showing improved results over the website. However, criticisms note that users need domain knowledge, natural language variety is not discussed in depth, and experimental results do not fully support the claims.
Neural Text Embeddings for Information Retrieval (WSDM 2017)Bhaskar Mitra
The document describes a tutorial on using neural networks for information retrieval. It discusses an agenda for the tutorial that includes fundamentals of IR, word embeddings, using word embeddings for IR, deep neural networks, and applications of neural networks to IR problems. It provides context on the increasing use of neural methods in IR applications and research.
Scalable Cross-lingual Document Similarity through Language-specific Concept ...Carlos Badenes-Olmedo
This document proposes an unsupervised algorithm to relate similar documents in multilingual corpora without requiring translations. It represents documents as distributions over topics derived from language-specific concept hierarchies (like WordNet synsets). Documents in different languages are aligned in a single representation space based on shared synsets from their main topics. The algorithm is evaluated on document classification and retrieval tasks using legislative text corpora in English, Spanish and French, showing it performs comparably to supervised methods while not requiring parallel or comparable training data.
Presented by Ted Xiao at RobotXSpace on 4/18/2017. This workshop covers the fundamentals of Natural Language Processing, crucial NLP approaches, and an overview of NLP in industry.
PhD thesis defense.
This manuscript describes a methodology designed and implemented to realise the recommendation of vocabularies based on the content of a given website. The goal of the proposed approach is to generate vocabularies by reusing existing schemas. The automatic recommendation helps to leverage websites to self-described web entities in the Web of Data; understandable by both humans and machines. In this direction, the implemented approach is wrapped within a broader methodology of turning a website in a machine understandable node by using technologies that have been developed in the scope of the Semantic Web vision. Transforming a website to a machine understandable entity is the first step required by the websites side in order to narrow the gap with web agents and enable the structured content consumption without the need of implementing an Application Programming Interface (API) that would provide read-write functionality. The motivation of the thesis stems from the fact that the data provided via an API is already presented on the corresponding website in most of the cases.
LSI latent (par HATOUM Saria et DONGO ESCALANTE Irvin Franco)rchbeir
This document introduces latent semantic indexing (LSI), a technique for information retrieval that overcomes some limitations of the vector space model. LSI represents documents and queries in a semantic space of concepts derived from word co-occurrence patterns in the original text. It uses singular value decomposition to project documents and queries into a concept space of lower dimensionality than the original word space. This addresses problems with synonymy and polysemy. An example shows how LSI can retrieve a document based on conceptual similarity rather than direct word matching. Advantages of LSI include capturing synonymy and polysemy, while disadvantages include increased storage and computational requirements.
The document discusses using Linked Open Data from DBpedia to help with Unicode localization interoperability (ULI). DBpedia extracts structured data from Wikipedia and makes it available as Linked Data. It describes how ULI aims to standardize localization data exchange between tools. DBpedia data on abbreviations in over 100 languages was extracted and evaluated, finding it could help improve text segmentation precision and recall. The extracted data is being considered for inclusion in the Common Locale Data Repository (CLDR) to further standardization efforts.
The document discusses using semantics to understand social media data. It covers using both implicit and explicit semantics. For implicit semantics, it discusses topic models like Latent Dirichlet Allocation (LDA) that can be used to extract topics from unlabeled text data by modeling each document as a mixture of topics. For explicit semantics, it discusses representing social media data, conversations, and user behavior using ontologies. The tutorial provides an overview of using semantics to better understand social media information through techniques like topic modeling, semantic representation, and knowledge extraction.
KeyNote @SEMANTICS 2017 (Amsterdam, sept 2017) about convergences between NLP and KE at the era of the semantic web, with a focus on semantic relation extraction from text.
Similar to Learning Multilingual Semantic Parsers for Question Answering over Linked Data - A comparison of neural and probabilistic graphical model architectures (20)
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfMalak Abu Hammad
Discover how MongoDB Atlas and vector search technology can revolutionize your application's search capabilities. This comprehensive presentation covers:
* What is Vector Search?
* Importance and benefits of vector search
* Practical use cases across various industries
* Step-by-step implementation guide
* Live demos with code snippets
* Enhancing LLM capabilities with vector search
* Best practices and optimization strategies
Perfect for developers, AI enthusiasts, and tech leaders. Learn how to leverage MongoDB Atlas to deliver highly relevant, context-aware search results, transforming your data retrieval process. Stay ahead in tech innovation and maximize the potential of your applications.
#MongoDB #VectorSearch #AI #SemanticSearch #TechInnovation #DataScience #LLM #MachineLearning #SearchTechnology
Dr. Sean Tan, Head of Data Science, Changi Airport Group
Discover how Changi Airport Group (CAG) leverages graph technologies and generative AI to revolutionize their search capabilities. This session delves into the unique search needs of CAG’s diverse passengers and customers, showcasing how graph data structures enhance the accuracy and relevance of AI-generated search results, mitigating the risk of “hallucinations” and improving the overall customer journey.
Pushing the limits of ePRTC: 100ns holdover for 100 daysAdtran
At WSTS 2024, Alon Stern explored the topic of parametric holdover and explained how recent research findings can be implemented in real-world PNT networks to achieve 100 nanoseconds of accuracy for up to 100 days.
20 Comprehensive Checklist of Designing and Developing a WebsitePixlogix Infotech
Dive into the world of Website Designing and Developing with Pixlogix! Looking to create a stunning online presence? Look no further! Our comprehensive checklist covers everything you need to know to craft a website that stands out. From user-friendly design to seamless functionality, we've got you covered. Don't miss out on this invaluable resource! Check out our checklist now at Pixlogix and start your journey towards a captivating online presence today.
A tale of scale & speed: How the US Navy is enabling software delivery from l...sonjaschweigert1
Rapid and secure feature delivery is a goal across every application team and every branch of the DoD. The Navy’s DevSecOps platform, Party Barge, has achieved:
- Reduction in onboarding time from 5 weeks to 1 day
- Improved developer experience and productivity through actionable findings and reduction of false positives
- Maintenance of superior security standards and inherent policy enforcement with Authorization to Operate (ATO)
Development teams can ship efficiently and ensure applications are cyber ready for Navy Authorizing Officials (AOs). In this webinar, Sigma Defense and Anchore will give attendees a look behind the scenes and demo secure pipeline automation and security artifacts that speed up application ATO and time to production.
We will cover:
- How to remove silos in DevSecOps
- How to build efficient development pipeline roles and component templates
- How to deliver security artifacts that matter for ATO’s (SBOMs, vulnerability reports, and policy evidence)
- How to streamline operations with automated policy checks on container images
How to Get CNIC Information System with Paksim Ga.pptxdanishmna97
Pakdata Cf is a groundbreaking system designed to streamline and facilitate access to CNIC information. This innovative platform leverages advanced technology to provide users with efficient and secure access to their CNIC details.
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AIVladimir Iglovikov, Ph.D.
Presented by Vladimir Iglovikov:
- https://www.linkedin.com/in/iglovikov/
- https://x.com/viglovikov
- https://www.instagram.com/ternaus/
This presentation delves into the journey of Albumentations.ai, a highly successful open-source library for data augmentation.
Created out of a necessity for superior performance in Kaggle competitions, Albumentations has grown to become a widely used tool among data scientists and machine learning practitioners.
This case study covers various aspects, including:
People: The contributors and community that have supported Albumentations.
Metrics: The success indicators such as downloads, daily active users, GitHub stars, and financial contributions.
Challenges: The hurdles in monetizing open-source projects and measuring user engagement.
Development Practices: Best practices for creating, maintaining, and scaling open-source libraries, including code hygiene, CI/CD, and fast iteration.
Community Building: Strategies for making adoption easy, iterating quickly, and fostering a vibrant, engaged community.
Marketing: Both online and offline marketing tactics, focusing on real, impactful interactions and collaborations.
Mental Health: Maintaining balance and not feeling pressured by user demands.
Key insights include the importance of automation, making the adoption process seamless, and leveraging offline interactions for marketing. The presentation also emphasizes the need for continuous small improvements and building a friendly, inclusive community that contributes to the project's growth.
Vladimir Iglovikov brings his extensive experience as a Kaggle Grandmaster, ex-Staff ML Engineer at Lyft, sharing valuable lessons and practical advice for anyone looking to enhance the adoption of their open-source projects.
Explore more about Albumentations and join the community at:
GitHub: https://github.com/albumentations-team/albumentations
Website: https://albumentations.ai/
LinkedIn: https://www.linkedin.com/company/100504475
Twitter: https://x.com/albumentations
Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...Zilliz
Join us to introduce Milvus Lite, a vector database that can run on notebooks and laptops, share the same API with Milvus, and integrate with every popular GenAI framework. This webinar is perfect for developers seeking easy-to-use, well-integrated vector databases for their GenAI apps.
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slackshyamraj55
Discover the seamless integration of RPA (Robotic Process Automation), COMPOSER, and APM with AWS IDP enhanced with Slack notifications. Explore how these technologies converge to streamline workflows, optimize performance, and ensure secure access, all while leveraging the power of AWS IDP and real-time communication via Slack notifications.
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...Neo4j
Leonard Jayamohan, Partner & Generative AI Lead, Deloitte
This keynote will reveal how Deloitte leverages Neo4j’s graph power for groundbreaking digital twin solutions, achieving a staggering 100x performance boost. Discover the essential role knowledge graphs play in successful generative AI implementations. Plus, get an exclusive look at an innovative Neo4j + Generative AI solution Deloitte is developing in-house.
Removing Uninteresting Bytes in Software FuzzingAftab Hussain
Imagine a world where software fuzzing, the process of mutating bytes in test seeds to uncover hidden and erroneous program behaviors, becomes faster and more effective. A lot depends on the initial seeds, which can significantly dictate the trajectory of a fuzzing campaign, particularly in terms of how long it takes to uncover interesting behaviour in your code. We introduce DIAR, a technique designed to speedup fuzzing campaigns by pinpointing and eliminating those uninteresting bytes in the seeds. Picture this: instead of wasting valuable resources on meaningless mutations in large, bloated seeds, DIAR removes the unnecessary bytes, streamlining the entire process.
In this work, we equipped AFL, a popular fuzzer, with DIAR and examined two critical Linux libraries -- Libxml's xmllint, a tool for parsing xml documents, and Binutil's readelf, an essential debugging and security analysis command-line tool used to display detailed information about ELF (Executable and Linkable Format). Our preliminary results show that AFL+DIAR does not only discover new paths more quickly but also achieves higher coverage overall. This work thus showcases how starting with lean and optimized seeds can lead to faster, more comprehensive fuzzing campaigns -- and DIAR helps you find such seeds.
- These are slides of the talk given at IEEE International Conference on Software Testing Verification and Validation Workshop, ICSTW 2022.
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Speck&Tech
ABSTRACT: A prima vista, un mattoncino Lego e la backdoor XZ potrebbero avere in comune il fatto di essere entrambi blocchi di costruzione, o dipendenze di progetti creativi e software. La realtà è che un mattoncino Lego e il caso della backdoor XZ hanno molto di più di tutto ciò in comune.
Partecipate alla presentazione per immergervi in una storia di interoperabilità, standard e formati aperti, per poi discutere del ruolo importante che i contributori hanno in una comunità open source sostenibile.
BIO: Sostenitrice del software libero e dei formati standard e aperti. È stata un membro attivo dei progetti Fedora e openSUSE e ha co-fondato l'Associazione LibreItalia dove è stata coinvolta in diversi eventi, migrazioni e formazione relativi a LibreOffice. In precedenza ha lavorato a migrazioni e corsi di formazione su LibreOffice per diverse amministrazioni pubbliche e privati. Da gennaio 2020 lavora in SUSE come Software Release Engineer per Uyuni e SUSE Manager e quando non segue la sua passione per i computer e per Geeko coltiva la sua curiosità per l'astronomia (da cui deriva il suo nickname deneb_alpha).
Sudheer Mechineni, Head of Application Frameworks, Standard Chartered Bank
Discover how Standard Chartered Bank harnessed the power of Neo4j to transform complex data access challenges into a dynamic, scalable graph database solution. This keynote will cover their journey from initial adoption to deploying a fully automated, enterprise-grade causal cluster, highlighting key strategies for modelling organisational changes and ensuring robust disaster recovery. Learn how these innovations have not only enhanced Standard Chartered Bank’s data infrastructure but also positioned them as pioneers in the banking sector’s adoption of graph technology.
Maruthi Prithivirajan, Head of ASEAN & IN Solution Architecture, Neo4j
Get an inside look at the latest Neo4j innovations that enable relationship-driven intelligence at scale. Learn more about the newest cloud integrations and product enhancements that make Neo4j an essential choice for developers building apps with interconnected data and generative AI.
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc
How does your privacy program stack up against your peers? What challenges are privacy teams tackling and prioritizing in 2024?
In the fifth annual Global Privacy Benchmarks Survey, we asked over 1,800 global privacy professionals and business executives to share their perspectives on the current state of privacy inside and outside of their organizations. This year’s report focused on emerging areas of importance for privacy and compliance professionals, including considerations and implications of Artificial Intelligence (AI) technologies, building brand trust, and different approaches for achieving higher privacy competence scores.
See how organizational priorities and strategic approaches to data security and privacy are evolving around the globe.
This webinar will review:
- The top 10 privacy insights from the fifth annual Global Privacy Benchmarks Survey
- The top challenges for privacy leaders, practitioners, and organizations in 2024
- Key themes to consider in developing and maintaining your privacy program
Learning Multilingual Semantic Parsers for Question Answering over Linked Data - A comparison of neural and probabilistic graphical model architectures
1. Learning Multilingual Semantic Parsers for
Question Answering over Linked Data
A comparison of neural and probabilistic graphical model architectures
PhD Dissertation Defense Talk
March 2019
Sherzod Hakimov
Semantic Computing Group, CITEC
Bielefeld University
7. 7
Give me the route to Jahnplatz
What is Semantic Parsing?
• mapping natural language sentence to a detailed meaning representation
8. 8
Give me the route to Jahnplatz
route($LOC, “Jahnplatz”)
What is Semantic Parsing?
• mapping natural language sentence to a detailed representation of meaning representation
route(StartLocation, EndLocation)
9. 9
Give me the route to Jahnplatz
route($LOC, “Jahnplatz”)
What is Semantic Parsing?
• mapping natural language sentence to a detailed representation of meaning representation
• meaning representation can be modelled using a formal language that
10. 10
Give me the route to Jahnplatz
route($LOC, “Jahnplatz”)
What is Semantic Parsing?
• mapping natural language sentence to a detailed representation of meaning representation
• meaning representation can be modelled using a formal language, e.g. lambda calculus
• an ontology with properties, classes, entities, etc. (route, create_calendar_event, set_alarm)
• supports automated execution or reasoning
11. 11
Give me the route to Jahnplatz Query
Knowledge Base
Answer
Why do we need Semantic Parsers?
route($LOC, “Jahnplatz”)
12. 12
Give me the route to Jahnplatz Query
Knowledge Base
Answer
Why do we need Semantic Parsers?
route($LOC, “Jahnplatz”)
14. Motivation
• Building semantic parsers with application on Question Answering
• Building multilingual solutions that can be applied for multiple languages
14
Which German politicians were born in Bielefeld?
Which metal has a liquid form?
Welche deutschen Politiker wurden in Bielefeld geboren?
Welches Metall hat eine flüssige Form?
¿Qué políticos alemanes nacieron en Bielefeld?
¿Qué metal tiene una forma líquida?
15. Motivation
• Building semantic parsers with application on Question Answering
• Building multilingual solutions that can be extended for other languages
• Comparison and evaluation of different model architectures
15
16. Motivation
• Building semantic parsers with application on Question Answering
• Building multilingual solutions that can be extended for other languages
• Comparison and evaluation of different model architectures
• Highlight the challenges of building Question Answering systems
16
17. • based on structured content from Wikipedia
• more than 130 languages supported
• 760 classes, 1105 object & 1622 data type properties
• ca. 9 million resources
17
21. dbr:Dan_Brown
dbo:author
Question Answering on RDF Data
dbr:Inferno_(novel)
Dan Brown is the author of Inferno
Triple:
Natural Language:
dbr:Inferno_(novel) dbo:author dbr:Dan_Brown
21
22. dbr:Dan_Brown
dbo:author
Question Answering on RDF Data
dbr:Inferno_(novel)
Dan Brown is the author of Inferno
Who is the author of Inferno?Natural Language:
Question format
dbr:Inferno_(novel) dbo:author dbr:Dan_BrownTriple:
Natural Language:
22
23. dbr:Dan_Brown
dbo:author
Question Answering on RDF Data
dbr:Inferno_(novel)
Dan Brown is the author of Inferno
dbr:Inferno_(novel) dbo:author dbr:Dan_BrownTriple:
Natural Language:
Who is the author of Inferno?Natural Language:
SPARQL Query:
Question format
SELECT ?x WHERE {dbr:Inferno_(novel) dbo:author ?x}
23
25. Research Questions
How to map natural language phrases into knowledge base entries for multiple languages?
Which linguistic resources can be used?
25
dbr:Dan_Brown
dbo:author
Who is the author of Inferno? dbr:Inferno_(novel)
Who wrote Inferno?
Who is the writer of Inferno?
SELECT ?x WHERE {dbr:Inferno_(novel) dbo:author ?x}
26. Research Questions
How to map natural language phrases into knowledge base entries for multiple languages?
Which linguistic resources can be used?
26
dbr:Dan_Brown
dbo:author
Who is the author of Inferno? dbr:Inferno_(novel)
Who wrote Inferno?
Who is the writer of Inferno?
SELECT ?x WHERE {dbr:Inferno_(novel) dbo:author ?x}
Lexical Gap: write -> dbo:author
27. Research Questions
How to disambiguate URIs when multiple candidates are retrieved from mapping natural
language tokens into knowledge base entries?
27
When was Inferno released?
SELECT ?x WHERE {dbr:Inferno_(novel) dbo:releaseDate ?x}
dbr:Inferno_(2016_film) dbr:Inferno_(novel)
28. Research Questions
How to use syntactic information of a natural language question together with semantic
representations of entries in a knowledge base?
28
Who wrote Inferno?
dbr:Dan_Brown
dbo:author
dbr:Inferno_(novel)
SELECT ?x WHERE { dbr:Inferno_(novel) dbo:author ?x }
wrote
(VERB)
Who
(PRON)
Inferno
(PROPN)
nsubj dobj
29. Research Questions
What are the advantages and the disadvantages of a multilingual QA system vs. a
monolingual system built for each language?
29
Who is the author of Inferno? dbr:Dan_Brown
dbo:author
dbr:Inferno_(novel)
Wer ist der Autor von Inferno?
¿Quién es el autor de Inferno?
SELECT ?x WHERE { dbr:Inferno_(novel) dbo:author ?x }
30. Research Questions
What effort is required to adapt our QA pipelines to another language?
30
Who is the author of Inferno? dbr:Dan_Brown
dbo:author
dbr:Inferno_(novel)
Qui est l'auteur de Inferno?
Infernoning muallifi kim?
SELECT ?x WHERE { dbr:Inferno_(novel) dbo:author ?x }
32. Preliminaries
• Logical Form - DUDES, formalism for specifying meaning representations for dependency tree
structures
32
33. Preliminaries
• Logical Form - DUDES, formalism for specifying meaning representations for dependency tree
structures
• Semantic Composition - acquiring the meaning representations using the syntax of questions
33
34. Logical Form
34
• DUDES - Dependency-based Underspecified Discourse Representation Structures (Cimiano et al [1])
[1] Cimiano, P., 2009, Flexible semantic composition with DUDES. In Proceedings of the Eighth International Conference on
Computational Semantics (pp. 272-276). Association for Computational Linguistics.
35. Logical Form
35
• DUDES - Dependency-based Underspecified Discourse Representation Structures (Cimiano et al [1])
• Formalism for specifying meaning representation
• Flexible semantic composition w.r.t order of application
• Build on semantic dependencies e.g. suitable for working with dependency-based syntactic analysis
[1] Cimiano, P., 2009, Flexible semantic composition with DUDES. In Proceedings of the Eighth International Conference on
Computational Semantics (pp. 272-276). Association for Computational Linguistics.
36. DUDES
v : is the main variable
vs : the projection variables
l : is the label of the main DRS
drs : is a DRS (Discourse Representation Structure)
slots : is a set of semantic dependencies
36
37. Semantic Composition with DUDES
Who created Wikipedia?
Input: a natural language question and its dependency parse tree
37
38. Semantic Composition with DUDES
Who created Wikipedia?
Input: a natural language question and its dependency parse tree
dbr:Wikipedia dbo:author ?x
Output: a meaning representation based on certain domain
38
39. Semantic Composition with DUDES
Each node gets a pair of assignments: DUDES Type + Knowledge base ID
Oracle
39
49. Dependency parse tree-based Semantic Parsing
Approach
• multilingual semantic parsing approach: English, German & Spanish [1]
49
[1] Hakimov S, Jebbara S, Cimiano P. AMUSE: Multilingual Semantic Parsing for Question Answering over Linked Data.
In Proceedings of the 16th International Semantic Web Conference (ISWC), 2017
50. Dependency parse tree-based Semantic Parsing
Approach
• multilingual semantic parsing approach: English, German & Spanish [1]
• uses language-independent dependency parse trees from Universal
Dependencies
50
[1] Hakimov S, Jebbara S, Cimiano P. AMUSE: Multilingual Semantic Parsing for Question Answering over Linked Data.
In Proceedings of the 16th International Semantic Web Conference (ISWC), 2017
51. Dependency parse tree-based Semantic Parsing
Approach
• multilingual semantic parsing approach: English, German & Spanish [1]
• uses language-independent dependency parse trees from Universal
Dependencies
• combines different types of lexical information: DBpedia Ontology labels,
the M-ATOLL[2] lexicon & word embeddings
51
[1] Hakimov S, Jebbara S, Cimiano P. “AMUSE: Multilingual Semantic Parsing for Question Answering over Linked
Data”. ISWC 2017
[2] Walter S, Unger C, and Cimiano P. “M-ATOLL: A Framework for the Lexicalization of Ontologies in Multiple
Languages”. ISWC 2014
[3] Hakimov S, Walter S, Unger C, and Cimiano P. “Applying semantic parsing to question answering over linked data:
Addressing the lexical gap”. NLDB 2015
55. Inference
• Metropolis-Hastings: exploring huge search space (ca. 10 million resources, 2000 properties)
• Linking to Knowledge Base (L2KB)
•objective : compare set of URIs to the expected set of URIs
• Query Construction (QC)
•objective : compare the constructed query to the expected query
55
Input: initial state
56. L2KB Sampling
Explore the edges and assign Knowledge Base IDs based on lemmas of nodes
Inverted index: Ontology labels, lexicon from M-ATOLL & word embeddings
56
57. L2KB Sampling
Explore the edges and assign Knowledge Base IDs based on lemmas of nodes
Check the triple pattern- ?x dbo:author dbr:Wikipedia : Slot 2, dbr:Wikipedia dbo:author ?x : Slot1
Inverted index: Ontology labels, lexicon from M-ATOLL & word embeddings
57
66. Evaluation
Dataset: Question Answering over Linked Data (QALD), 6th challenge
English, German, Spanish, Italian, French, Dutch, Romanian, Farsi
350 for train, 100 for test
Unger, Christina, Axel-Cyrille Ngonga Ngomo, and Elena Cabrio (2016). “6th open challenge on question
answering over linked data (qald-6)”. In: Semantic Web Evaluation Challenge.
66
67. Evaluation
DBP: lexicon from DBpedia Ontology & WordNet
M-ATOLL: lexicon induced by the M-ATOLL (Walter et al. 2014)
Embed: lexicon added using pre-trained word embeddings (Mikolov et al. 2013)
Walter, Sebastian, Christina Unger, and Philipp Cimiano. “M-ATOLL: A Framework for the Lexicalization of Ontologies in Multiple Languages”. ISWC 2014
Mikolov, Tomas et al. “Distributed representations of words and phrases and their compositionality”. NIPS 2013
67
68. Evaluation
DBP: lexicon from DBpedia Ontology & WordNet
M-ATOLL: lexicon induced by the M-ATOLL (Walter et al. 2014)
Embed: lexicon added using pre-trained word embeddings (Mikolov et al. 2013)
Dict: manually defined lexicon
Walter, Sebastian, Christina Unger, and Philipp Cimiano. “M-ATOLL: A Framework for the Lexicalization of Ontologies in Multiple Languages”. ISWC 2014
Mikolov, Tomas et al. “Distributed representations of words and phrases and their compositionality”. NIPS 2013
68
69. Evaluation
DBP: lexicon from DBpedia Ontology & WordNet
M-ATOLL: lexicon induced by the M-ATOLL (Walter et al. 2014)
Embed: lexicon added using pre-trained word embeddings (Mikolov et al. 2013)
Dict: manually defined lexicon
Walter, Sebastian, Christina Unger, and Philipp Cimiano. “M-ATOLL: A Framework for the Lexicalization of Ontologies in Multiple Languages”. ISWC 2014
Mikolov, Tomas et al. “Distributed representations of words and phrases and their compositionality”. NIPS 2013
69
72. Outline
• SimpleQuestions dataset, 74k samples, Freebase data
• Question: “Who wrote Mildred Pierced?”
• Fact: mildred_pierced, book.written_work.author, stuart_kaminsky
• Answer: mildred_pierced, book.written_work.author, ?x
• Systematic comparison of different model architectures
72
Hakimov S, Jebbara S, Cimiano P. “Evaluating Architectural Choices for Deep Learning Approaches for Question
Answering over Knowledge Bases”. ICSC 2019
73. Named Entity Recognition
• Used by all models to predict the entity span
• Character & word embeddings
• Trained using weak supervision: inference is correct if the
expected entity has been found
73
75. Model1: BiLSTM-Softmax
75
Model2: BiLSTM-KB Model3: BiLSTM-Binary Model4: Fasttext [1]
Architectures
[1] A. Joulin, E. Grave, P. Bojanowski, T. Mikolov, "Bag of Tricks for Efficient Text Classification", 2016, arxiv.org
76. Model1: BiLSTM-Softmax
76
Model2: BiLSTM-KB Model3: BiLSTM-Binary Model4: Fasttext [1]
Architectures
[1] A. Joulin, E. Grave, P. Bojanowski, T. Mikolov, "Bag of Tricks for Efficient Text Classification", 2016, arxiv.org
82. Discussion
• Manual Effort
• Syntax and Semantics
• Multilinguality
• Cross-domain Transferability
• Training Data Size and Search Space
82
83. Discussion
83
Systems Manual Effort Syntax & Semantics Multilinguality
Cross-domain
transferability
Training Data &
Number of KB IDs
CCG-based
(Chapter 6)
CCG combination rules
manual lexicon
• learned in tandem
• CCG for syntax
• lambda calculus for semantics
manual effort is required manual effort is required
600 training instances
750 entities
Dependency-
based
(Chapter 7)
feature templates
syntax is given
DUDES as a formalism
an adaptable solution
• a dependency parser
required.
• e.g. biomedical domain
300 training instances
<= 10 mil. entities
>= 2000 predicates
BiLSTM-Softmax
(Chapter 8)
-
• word & char embed. for lexical
& contextual info
• semantics is limited to a single
predicate and a subject entity
an adaptable solution
only word & char
embed.
an adaptable solution
only word & char embed.
>= 75K instances
<= 2 mil. entities
85. Research Questions
85
• Ontology lexicalisations, e.g. M-ATOLL (Walter et al. 2014)
• Ontology labels, e.g. DBpedia labels
• Dictionaries
• WordNet synsets
• Induced from contextual embeddings of words
RQ1: How to map natural language phrases into knowledge base entries for
multiple languages? Which linguistic resources can be used?
86. Research Questions
86
• Supervised models with objective for disambiguation
• CCG-based model
• uses lexical and syntactic information as features
• Dependency tree-based model
• syntactic dependency between words, lexical similarity, ontology restrictions
• Neural network-based model
• ranking objective of predicates
RQ2: How to disambiguate URIs when multiple candidates are retrieved from
mapping natural language tokens into knowledge base entries?
87. Research Questions
87
• Semantic Parsing
• bottom-up composition
• CCG-based model
• learns the syntax and semantics together
• Dependency tree-based model
• learns composing semantics based on dependency trees
RQ3: How to use syntactic information of a natural language question together
with semantic representations of entries in a knowledge base?
88. Research Questions
88
RQ4: What are the advantages and the disadvantages of a multilingual QA system
vs. a monolingual system built for each language?
• Advantages
• Multilingual: broader coverage
• Monolingual: higher performance, e.g. Xser (Xu et al. 2014) 0.7 F1 on QALD-4
• Disadvantages
• Multilingual: lower performance, e.g. AMUSE 0.3 F1 on QALD-6
• Monolingual: need expertise, e.g. CCG rules, lexicon
89. Research Questions
89
• CCG-based model
• grammar rules, manually defined lexicon
• language-specific
• Dependency parse tree-based model
• dependency parse tree generator
• lexicon
• Neural network-based model
• depends on the training data
RQ5: What effort is required to adapt our QA pipelines to another language?
90. Conclusion
• Address the lexical gap for QA systems
• Incorporate ontology lexicalizations to reduce the lexical gap
• Use Universal Dependencies to build language-independent QA pipeline
• Multilingual semantic parsing for Question Answering
• Evaluate different QA models under the certain conditions
• Highlight importance of building blocks of a pipeline for a fair comparison
90
92. GENLEX
Barack Obama is married to Michelle Obama
[1] Zettlemoyer, Luke S and Michael Collins (2005). “Learning to Map Sentences to Logical Form : Structured
Classification with Probabilistic Categorial Grammars”. In: 21st Conference on Uncertainty in Artificial Intelligence
[2] Hakimov, Sherzod et al. (2015). “Applying semantic parsing to question answering over linked data: Addressing the lexical
gap”. In: International Conference on Applications of Natural Language to Information Systems 92
96. Lexicon
During sampling, compute cosine similarity of words into Ontology labels of properties
Vectors for multiple words are summed, e.g. V(population) + V(total)
96
103. Semantic Composition
•recursively computing the meaning of each node from the meanings of its child nodes
•build the meaning representation bottom-up
ComposeSemantics(dependency-parse-tree)
If parse-tree is a terminal node (word) then
return an atomic lexical meaning for the word.
Else
For each child, subtreei, of parse-tree
Create its MR by calling ComposeSemantics(subtreei)
Return an MR by properly combining the resulting MRs
for its children into an MR for the overall parse-tree.
103
108. Model Representation
108
Observed variables: dependency parse tree
Hidden variables: KB IDs, slot, DUDE types
• States can be ranked by
• objective score : compare to ground truth
• model score: computed using feature weights
• Training procedures
• switch between model & objective score after every iteration
111. Model1: BiLSTM-Softmax
• Softmax layer that predicts predicates seen
during training
• Encoding layer: word & character
• BiLSTM: two LSTM layers (backward, forward)
111
112. Model2: BiLSTM-KB
• Learn embedding of predicates in KB
• Encoding layer: word & character
• BiLSTM: two LSTM layers (backward, forward)
• Output layer computes cosine similarity to all
predicates and chooses the closest
112
113. Model3: BiLSTM-Binary
• Encoding layer: encodes input question with word &
character embeddings
• Encoding layer: encodes input predicate with word &
character embeddings
• Output layer: binary decision
113
114. Model4: Fasttext
• Document classification tool developed by Facebook*
• Uses word & character embeddings
• Softmax layer that predicts the expected predicate
114
* http://fasttext.cc
116. 116
Generative Models -> computing joint probability distribution on p(y|x)
HMM -> y_t depends on y_t-1 and x_t
how output label y_t generates input vector x
Discriminative Models -> computing conditional probability distribution over inputs x and outputs y
CRF -> doesn’t have any limitation like that
how feature vector x gets assignment y_t
118. Manual Effort
• CCG-based model
• define CCG grammar rules, hand-crafted lexicon for domain independent phrases
• Dependency parse tree-based model
• feature functions
• Neural network-based model (BiLSTM-Softmax)
• not required
118
119. Syntax and Semantics
• CCG-based model
• syntax and semantics is learned in tandem
• CCG for syntax and the lambda calculus for semantics
• syntax guides the semantics of the sentences
• Dependency parse tree-based model
• syntax is given and the semantics is learned
• DUDES as a formalism for semantics, syntax is based on dependency trees from Universal Dependencies
• Neural network-based model (BiLSTM-Softmax)
• syntactic information is learned, e.g. word and character embeddings provide contextual information
• semantics is based on a single subject and the predicate, simpler task
119
120. Multilinguality
• CCG-based model
• CCG grammar rules needs to be extended
• Dependency parse tree-based model
• a multilingual solution
• Neural network-based model (BiLSTM-Softmax)
• can be adapted to other languages, e.g. word & characters as features
120
121. Cross-domain Transferability
• CCG-based model
• manual effort is required: CCG rules, lexicon
• Dependency parse tree-based model
• dependency parse trees e.g. biomedical domain
• Neural network-based model (BiLSTM-Softmax)
• can be easily adapted
121
122. Training Data Size and Search Space
• CCG-based model
• Dependency parse tree-based model
• Neural network-based model (BiLSTM-Softmax)
• heavily depends on the data
122