Slides deck for SIGMOD 2017 Tutorial.
ABSTRACT:
The volume of natural language text data has been rapidly increasing over the past two decades, due to factors such as the growth of the Web, the low cost associated to publishing and the progress on the digitization of printed texts. This growth combined with the proliferation of natural language systems for search and retrieving information provides tremendous opportunities for studying some of the areas where database systems and natural language processing systems overlap. This tutorial explores two more relevant areas of overlap to the database community: (1) managing natural language text data in a relational database, and (2) developing natural language interfaces to databases. The tutorial presents state-of-the-art methods, related systems, research opportunities and challenges covering both area.
a system called natural language interface which transforms user's natural language question into SPARQL query
find related papers here https://sites.google.com/site/fadhlinams81/publication
The document summarizes the history and impact of the Semantic Web. It discusses how the Semantic Web was originally envisioned as a way to make information on the web more machine-readable through semantic annotations. While early work showed promise, widespread adoption lagged behind expectations. Key impacts included positive but limited effects on web search through knowledge graphs, the rise of centralized social networks rather than distributed semantic social media, and limited use in e-commerce. Ongoing work continues on standards and applications while addressing challenges around centralization.
RDF and other linked data standards — how to make use of big localization dataDave Lewis
The standards and interoperability challenge to using the Resource Description Framework for data resource in linked data. Based on work from CNGL (www.cngl.ie), the FALCON project (www.falcon-project.eu) and the LIDER project (www.lider-project.eu)
This document discusses various techniques for question answering and relation extraction in natural language processing. It provides an overview of question answering systems and approaches, including examples like START, Ask Jeeves and Siri. It also discusses using search engines for question answering, relation extraction from questions, and common evaluation metrics for question answering systems like accuracy and mean reciprocal rank.
Development of Semantic Web based Disaster Management SystemNIT Durgapur
Semantic Web model In the field of disaster management to structurise the data such that any information needed during emergency will be easily available.
The Semantic Web meets the Code of Federal Regulationstbruce
Semantic Web and natural-language-processing techniques meet the Code of Federal Regulations. Presentation from CALICON12 by the Legal Information Institute. Work on definition extraction, linked data publishing, search enhancement, vocabulary discovery.
Joint presentation with Nuria Casellas.
a system called natural language interface which transforms user's natural language question into SPARQL query
find related papers here https://sites.google.com/site/fadhlinams81/publication
The document summarizes the history and impact of the Semantic Web. It discusses how the Semantic Web was originally envisioned as a way to make information on the web more machine-readable through semantic annotations. While early work showed promise, widespread adoption lagged behind expectations. Key impacts included positive but limited effects on web search through knowledge graphs, the rise of centralized social networks rather than distributed semantic social media, and limited use in e-commerce. Ongoing work continues on standards and applications while addressing challenges around centralization.
RDF and other linked data standards — how to make use of big localization dataDave Lewis
The standards and interoperability challenge to using the Resource Description Framework for data resource in linked data. Based on work from CNGL (www.cngl.ie), the FALCON project (www.falcon-project.eu) and the LIDER project (www.lider-project.eu)
This document discusses various techniques for question answering and relation extraction in natural language processing. It provides an overview of question answering systems and approaches, including examples like START, Ask Jeeves and Siri. It also discusses using search engines for question answering, relation extraction from questions, and common evaluation metrics for question answering systems like accuracy and mean reciprocal rank.
Development of Semantic Web based Disaster Management SystemNIT Durgapur
Semantic Web model In the field of disaster management to structurise the data such that any information needed during emergency will be easily available.
The Semantic Web meets the Code of Federal Regulationstbruce
Semantic Web and natural-language-processing techniques meet the Code of Federal Regulations. Presentation from CALICON12 by the Legal Information Institute. Work on definition extraction, linked data publishing, search enhancement, vocabulary discovery.
Joint presentation with Nuria Casellas.
Semantic search: from document retrieval to virtual assistantsPeter Mika
This document summarizes a presentation on semantic search given by Peter Mika, a senior research scientist at Yahoo Labs. It discusses the history and goals of semantic search, including improving query understanding and bridging the semantic gap. It also describes Yahoo's research into semantic search applications for web search, including enhancing search results, entity retrieval and recommendations, and question answering. Semantic representations of queries and documents are key to these applications.
Libraries around the world have a long tradition of maintaining authority files to assure the consistent presentation and indexing of names. As library authority files have become available online, the authority data has become accessible -- and many have been published as Linked Open Data (LOD) -- but names in one library authority file typically had no link to corresponding records for persons and organizations in other library authority files. After a successful experiment in matching the Library of Congress/NACO authority file with the German National Library's authority file, an online system called the Virtual International Authority File was developed to facilitate sharing by ingesting, matching, and displaying the relations between records in multiple authority files.
The Virtual International Authority File (VIAF) has grown from three source files in 2007 to more than two dozen files today. The system harvests authority records, enhances them with bibliographic information and brings them together into clusters when it is confident the records describe the same identity. Although the most visible part of VIAF is a HTML interface, the API beneath it supports a linked data view of VIAF with URIs representing the identities themselves, not just URIs for the clusters. It supports names for person, corporations, geographic entities, works, and expressions. With English, French, German, Spanish interfaces (and a Japanese in process), the system is used around the world, with over a million queries per day.
Speaker
Thomas Hickey is Chief Scientist at OCLC where he helped found OCLC Research. Current interests include metadata creation and editing systems, authority control, parallel systems for bibliographic processing, and information retrieval and display. In addition to implementing VIAF, his group looks into exploring Web access to metadata, identification of FRBR works and expressions in WorldCat, the algorithmic creation of authorities, and the characterization of collections. He has an undergraduate degree in Physics and a Ph.D. in Library and Information Science.
Question Answering - Application and ChallengesJens Lehmann
This document provides an overview of question answering applications and challenges. It defines question answering as receiving natural language questions and providing concise answers. Recent developments in question answering systems are discussed, including IBM Watson. Challenges for question answering over semantic data are explored, such as lexical gaps, ambiguity, granularity, and alternative resources. Large-scale linguistic resources and machine learning approaches for question answering are also covered. Applications of question answering technologies are examined.
This document provides guidance on how to effectively search the internet like an expert. It outlines the main types of internet searches, including web directories that are carefully reviewed and organized by humans versus search engines that are compiled by machine crawlers with minimal human oversight. It details useful search terms and parts of a URL. The document then explains how to perform effective searches on Google, including using keywords, capitalization, phrase searches, and excluding terms. It provides examples of advanced search functions like searching by title, URL, or text. Finally, it emphasizes the importance of evaluating information and provides the C.A.R.P. test to assess reliability.
Metadata Training for Staff and Librarians for the New Data EnvironmentDiane Hillmann
The document summarizes a training program on metadata and structured data. It discusses the goals of offering participatory training for libraries, challenges with current webinar-style training, and an overview of the planned training which includes topics like the definition of metadata, different data types, identifiers, and graph-based data modeling. It also includes sample exercises for participants and asks for feedback on the overall training plan.
This document discusses the evolution of the web from a network of documents to a network of linked data. It begins by describing the original web of documents, which organized information in silos and had implicit semantics. The document then introduces the concept of the semantic web and linked data, which structures information as interconnected data using explicit semantics. It provides examples of how linked data can be represented using RDF triples and describes the principles of linked data for publishing and connecting data on the web. Finally, it discusses characteristics and examples of linked data applications.
This document discusses the evolution of the web from a web of documents to a web of linked data. It outlines the principles of linked data, which involve using URIs to identify things and linking those URIs to other URIs so that machines can discover more data. RDF is introduced as a standard data model for publishing linked data on the web using triples. Examples of linked data applications and datasets are provided to illustrate how linked data allows the web to function as a global database.
Slides for the iDB summer school (Sapporo, Japan) http://db-event.jpn.org/idb2013/
Typically, Web mining approaches have focused on enhancing or learning about user seeking behavior, from query log analysis and click through usage, employing the web graph structure for ranking to detecting spam or web page duplicates. Lately, there's a trend on mining web content semantics and dynamics in order to enhance search capabilities by either providing direct answers to users or allowing for advanced interfaces or capabilities. In this tutorial we will look into different ways of mining textual information from Web archives, with a particular focus on how to extract and disambiguate entities, and how to put them in use in various search scenarios. Further, we will discuss how web dynamics affects information access and how to exploit them in a search context.
The document discusses Semantic Web technologies including XML, DOM, RDF, and ontologies. It provides an overview of how these layers work together, from the basic levels of Unicode and URIs, to XML which enables data sharing and transport, to RDF triplets that represent relationships between resources, to ontologies that define classes and connect related items, and finally to higher levels of logic, digital signatures, and trust. The goal of the Semantic Web is to make data on the web more intelligible to computers and enable more sophisticated question answering about relationships between different entities.
From the Semantic Web to the Web of Data: ten years of linking upDavide Palmisano
This document discusses the concepts and technologies behind the Semantic Web. It describes how RDF, RDF Schema, and OWL allow structured data and relationships to be represented and shared across the web. It also discusses tools for working with semantic data in Java, such as Jena, Sesame, and Any23 for extracting and working with RDF. The document provides examples of representing data and relationships in RDF and querying semantic data with SPARQL.
The document discusses a webinar presented by NISO and DCMI on Schema.org and Linked Data. The webinar provides an overview of Schema.org and Linked Data, examines the advantages and challenges of using RDF and Linked Data, looks at Schema.org in more detail, and discusses how Schema.org and Linked Data can be combined. The goals of the webinar are to illustrate the different design choices for identifying entities and describing structured data, integrating vocabularies, and incentives for publishing accurate data, as well as to help guide adoption of Schema.org and Linked Data approaches.
From queries to answers in the Web document discusses:
- How web search has evolved from primarily returning links to now attempting to directly answer queries.
- Future trends in search include more personalized, social, contextual and anticipatory search capabilities.
- Semantic search aims to understand user intent and resources using semantic models to improve matching and ranking.
This document discusses techniques for improving record linkage quality when dealing with data that exhibits high diversity in attribute values. It begins by motivating the challenges of linking temporal records where attribute values may evolve over time, as well as records belonging to the same group where different sources may provide different local values. The document then presents an approach using disagreement and agreement decay to incorporate how attribute values change over time into the record linkage process. It also describes a temporal clustering algorithm that links records in time order while considering continuity of attribute histories. Experimental results on real datasets demonstrate improved recall over baselines. Finally, the document outlines ongoing work on linking records of the same group where different local values may be provided.
This document discusses text mining and provides an outline of the topic. It defines text mining as the analysis of natural language text data and explains why it is useful given the large amount of unstructured data. The document then describes the basic text mining process, which includes steps like filtering, segmentation, stemming, eliminating excessive words, and clustering. Several applications of text mining are mentioned like call centers, anti-spam, and market intelligence. Challenges of text mining like dealing with unstructured data and large collections of documents are also outlined.
A QA system takes in a natural language question, analyzes it to understand the type of question and information sought, searches structured and unstructured data sources for relevant information, and generates a natural language answer. It consists of modules for question analysis, information retrieval from knowledge bases and documents, answer generation, and response formatting. The goal is to delegate more interpretation work to machines so users can get direct answers to complex questions over heterogeneous data.
Natural Language Processing, Techniques, Current Trends and Applications in I...RajkiranVeluri
The document discusses natural language processing (NLP) techniques, current trends, and applications in industry. It covers common NLP techniques like morphology, syntax, semantics, and pragmatics. It also discusses word embeddings like Word2Vec and contextual embeddings like BERT. Finally, it discusses applications of NLP in healthcare like analyzing clinical notes and brand monitoring through sentiment analysis of user reviews.
This document provides an overview of data management best practices for graduate students presented in a workshop. It discusses what constitutes research data, the importance of managing data, how to create a data management plan, file naming conventions, metadata, data storage and backup strategies, and archiving options. The workshop covers topics like using a structured folder system, creating codebooks and documentation to describe data, and ensuring long-term access and preservation of research data. University librarians are available to help students with all aspects of responsible data management.
This document discusses the challenges and opportunities presented by the increasing volume and complexity of biological data. It outlines four main areas: 1) Developing methods to efficiently store, access, and analyze large datasets; 2) Broadening our understanding of gene function beyond a small number of well-studied genes; 3) Accelerating research through improved sharing of data, results, and methods; and 4) Leveraging exploratory analysis of integrated datasets to generate new insights. The author advocates for lossy data compression, streaming analysis, preprint sharing, improved metadata collection, and incentivizing open data practices.
Enterprise Search Share Point2009 Best Practices FinalMarianne Sweeny
This presentation examines features and benefits in Microsoft Office SharePoint Server (MOSS) 2007 enteprise search. It contains configuration guidance, code snippets, tips and tricks.
These slides were presented at the "graph databases in life sciences workshop". There is an accompanying Neo4j guide that will walk you through importing data into Neo4j using web services form a number of databases at EMBL-EBI.
https://github.com/simonjupp/importing-lifesci-data-into-neo4j
The document discusses scaling web data at low cost. It begins by presenting Javier D. Fernández and providing context about his work in semantic web, open data, big data management, and databases. It then discusses techniques for compressing and querying large RDF datasets at low cost using binary RDF formats like HDT. Examples of applications using these techniques include compressing and sharing datasets, fast SPARQL querying, and embedding systems. It also discusses efforts to enable web-scale querying through projects like LOD-a-lot that integrate billions of triples for federated querying.
Semantic search: from document retrieval to virtual assistantsPeter Mika
This document summarizes a presentation on semantic search given by Peter Mika, a senior research scientist at Yahoo Labs. It discusses the history and goals of semantic search, including improving query understanding and bridging the semantic gap. It also describes Yahoo's research into semantic search applications for web search, including enhancing search results, entity retrieval and recommendations, and question answering. Semantic representations of queries and documents are key to these applications.
Libraries around the world have a long tradition of maintaining authority files to assure the consistent presentation and indexing of names. As library authority files have become available online, the authority data has become accessible -- and many have been published as Linked Open Data (LOD) -- but names in one library authority file typically had no link to corresponding records for persons and organizations in other library authority files. After a successful experiment in matching the Library of Congress/NACO authority file with the German National Library's authority file, an online system called the Virtual International Authority File was developed to facilitate sharing by ingesting, matching, and displaying the relations between records in multiple authority files.
The Virtual International Authority File (VIAF) has grown from three source files in 2007 to more than two dozen files today. The system harvests authority records, enhances them with bibliographic information and brings them together into clusters when it is confident the records describe the same identity. Although the most visible part of VIAF is a HTML interface, the API beneath it supports a linked data view of VIAF with URIs representing the identities themselves, not just URIs for the clusters. It supports names for person, corporations, geographic entities, works, and expressions. With English, French, German, Spanish interfaces (and a Japanese in process), the system is used around the world, with over a million queries per day.
Speaker
Thomas Hickey is Chief Scientist at OCLC where he helped found OCLC Research. Current interests include metadata creation and editing systems, authority control, parallel systems for bibliographic processing, and information retrieval and display. In addition to implementing VIAF, his group looks into exploring Web access to metadata, identification of FRBR works and expressions in WorldCat, the algorithmic creation of authorities, and the characterization of collections. He has an undergraduate degree in Physics and a Ph.D. in Library and Information Science.
Question Answering - Application and ChallengesJens Lehmann
This document provides an overview of question answering applications and challenges. It defines question answering as receiving natural language questions and providing concise answers. Recent developments in question answering systems are discussed, including IBM Watson. Challenges for question answering over semantic data are explored, such as lexical gaps, ambiguity, granularity, and alternative resources. Large-scale linguistic resources and machine learning approaches for question answering are also covered. Applications of question answering technologies are examined.
This document provides guidance on how to effectively search the internet like an expert. It outlines the main types of internet searches, including web directories that are carefully reviewed and organized by humans versus search engines that are compiled by machine crawlers with minimal human oversight. It details useful search terms and parts of a URL. The document then explains how to perform effective searches on Google, including using keywords, capitalization, phrase searches, and excluding terms. It provides examples of advanced search functions like searching by title, URL, or text. Finally, it emphasizes the importance of evaluating information and provides the C.A.R.P. test to assess reliability.
Metadata Training for Staff and Librarians for the New Data EnvironmentDiane Hillmann
The document summarizes a training program on metadata and structured data. It discusses the goals of offering participatory training for libraries, challenges with current webinar-style training, and an overview of the planned training which includes topics like the definition of metadata, different data types, identifiers, and graph-based data modeling. It also includes sample exercises for participants and asks for feedback on the overall training plan.
This document discusses the evolution of the web from a network of documents to a network of linked data. It begins by describing the original web of documents, which organized information in silos and had implicit semantics. The document then introduces the concept of the semantic web and linked data, which structures information as interconnected data using explicit semantics. It provides examples of how linked data can be represented using RDF triples and describes the principles of linked data for publishing and connecting data on the web. Finally, it discusses characteristics and examples of linked data applications.
This document discusses the evolution of the web from a web of documents to a web of linked data. It outlines the principles of linked data, which involve using URIs to identify things and linking those URIs to other URIs so that machines can discover more data. RDF is introduced as a standard data model for publishing linked data on the web using triples. Examples of linked data applications and datasets are provided to illustrate how linked data allows the web to function as a global database.
Slides for the iDB summer school (Sapporo, Japan) http://db-event.jpn.org/idb2013/
Typically, Web mining approaches have focused on enhancing or learning about user seeking behavior, from query log analysis and click through usage, employing the web graph structure for ranking to detecting spam or web page duplicates. Lately, there's a trend on mining web content semantics and dynamics in order to enhance search capabilities by either providing direct answers to users or allowing for advanced interfaces or capabilities. In this tutorial we will look into different ways of mining textual information from Web archives, with a particular focus on how to extract and disambiguate entities, and how to put them in use in various search scenarios. Further, we will discuss how web dynamics affects information access and how to exploit them in a search context.
The document discusses Semantic Web technologies including XML, DOM, RDF, and ontologies. It provides an overview of how these layers work together, from the basic levels of Unicode and URIs, to XML which enables data sharing and transport, to RDF triplets that represent relationships between resources, to ontologies that define classes and connect related items, and finally to higher levels of logic, digital signatures, and trust. The goal of the Semantic Web is to make data on the web more intelligible to computers and enable more sophisticated question answering about relationships between different entities.
From the Semantic Web to the Web of Data: ten years of linking upDavide Palmisano
This document discusses the concepts and technologies behind the Semantic Web. It describes how RDF, RDF Schema, and OWL allow structured data and relationships to be represented and shared across the web. It also discusses tools for working with semantic data in Java, such as Jena, Sesame, and Any23 for extracting and working with RDF. The document provides examples of representing data and relationships in RDF and querying semantic data with SPARQL.
The document discusses a webinar presented by NISO and DCMI on Schema.org and Linked Data. The webinar provides an overview of Schema.org and Linked Data, examines the advantages and challenges of using RDF and Linked Data, looks at Schema.org in more detail, and discusses how Schema.org and Linked Data can be combined. The goals of the webinar are to illustrate the different design choices for identifying entities and describing structured data, integrating vocabularies, and incentives for publishing accurate data, as well as to help guide adoption of Schema.org and Linked Data approaches.
From queries to answers in the Web document discusses:
- How web search has evolved from primarily returning links to now attempting to directly answer queries.
- Future trends in search include more personalized, social, contextual and anticipatory search capabilities.
- Semantic search aims to understand user intent and resources using semantic models to improve matching and ranking.
This document discusses techniques for improving record linkage quality when dealing with data that exhibits high diversity in attribute values. It begins by motivating the challenges of linking temporal records where attribute values may evolve over time, as well as records belonging to the same group where different sources may provide different local values. The document then presents an approach using disagreement and agreement decay to incorporate how attribute values change over time into the record linkage process. It also describes a temporal clustering algorithm that links records in time order while considering continuity of attribute histories. Experimental results on real datasets demonstrate improved recall over baselines. Finally, the document outlines ongoing work on linking records of the same group where different local values may be provided.
This document discusses text mining and provides an outline of the topic. It defines text mining as the analysis of natural language text data and explains why it is useful given the large amount of unstructured data. The document then describes the basic text mining process, which includes steps like filtering, segmentation, stemming, eliminating excessive words, and clustering. Several applications of text mining are mentioned like call centers, anti-spam, and market intelligence. Challenges of text mining like dealing with unstructured data and large collections of documents are also outlined.
A QA system takes in a natural language question, analyzes it to understand the type of question and information sought, searches structured and unstructured data sources for relevant information, and generates a natural language answer. It consists of modules for question analysis, information retrieval from knowledge bases and documents, answer generation, and response formatting. The goal is to delegate more interpretation work to machines so users can get direct answers to complex questions over heterogeneous data.
Natural Language Processing, Techniques, Current Trends and Applications in I...RajkiranVeluri
The document discusses natural language processing (NLP) techniques, current trends, and applications in industry. It covers common NLP techniques like morphology, syntax, semantics, and pragmatics. It also discusses word embeddings like Word2Vec and contextual embeddings like BERT. Finally, it discusses applications of NLP in healthcare like analyzing clinical notes and brand monitoring through sentiment analysis of user reviews.
This document provides an overview of data management best practices for graduate students presented in a workshop. It discusses what constitutes research data, the importance of managing data, how to create a data management plan, file naming conventions, metadata, data storage and backup strategies, and archiving options. The workshop covers topics like using a structured folder system, creating codebooks and documentation to describe data, and ensuring long-term access and preservation of research data. University librarians are available to help students with all aspects of responsible data management.
This document discusses the challenges and opportunities presented by the increasing volume and complexity of biological data. It outlines four main areas: 1) Developing methods to efficiently store, access, and analyze large datasets; 2) Broadening our understanding of gene function beyond a small number of well-studied genes; 3) Accelerating research through improved sharing of data, results, and methods; and 4) Leveraging exploratory analysis of integrated datasets to generate new insights. The author advocates for lossy data compression, streaming analysis, preprint sharing, improved metadata collection, and incentivizing open data practices.
Enterprise Search Share Point2009 Best Practices FinalMarianne Sweeny
This presentation examines features and benefits in Microsoft Office SharePoint Server (MOSS) 2007 enteprise search. It contains configuration guidance, code snippets, tips and tricks.
These slides were presented at the "graph databases in life sciences workshop". There is an accompanying Neo4j guide that will walk you through importing data into Neo4j using web services form a number of databases at EMBL-EBI.
https://github.com/simonjupp/importing-lifesci-data-into-neo4j
The document discusses scaling web data at low cost. It begins by presenting Javier D. Fernández and providing context about his work in semantic web, open data, big data management, and databases. It then discusses techniques for compressing and querying large RDF datasets at low cost using binary RDF formats like HDT. Examples of applications using these techniques include compressing and sharing datasets, fast SPARQL querying, and embedding systems. It also discusses efforts to enable web-scale querying through projects like LOD-a-lot that integrate billions of triples for federated querying.
This presentation was provided by Ted Koppel ofAuto-Graphics, Inc, Ed Riding of SirsiDynix, Andrew K. Pace of OCLC, and John Mark Ockerbloom of The University of Pennsylvania, during the NISO webinar "Library Systems & Interoperability: Breaking Down Silos," held on June 10, 2009.
Introduction to apache spark and machine learningAwoyemi Ezekiel
This document provides an introduction to Apache Spark and machine learning. It discusses what Apache Spark is, how it compares to other big data frameworks, and the Spark program lifecycle. It also defines what big data is and where it comes from. Additionally, it discusses data science goals of deriving knowledge from big data efficiently and intelligently, and provides examples of machine learning applications. Finally, it includes two coding examples - one involving text analysis on Shakespeare's works, and another involving movie recommendations from movie rating data.
Case study of Rujhaan.com (A social news app )Rahul Jain
Rujhaan.com is a news aggregation app that collects trending news and social media discussions from topics of interest to users. It uses various technologies including a crawler to collect data from social media, Apache Solr for search, MongoDB for storage, Redis for caching, and machine learning techniques like classification and clustering. The presenter discussed the technical architecture and challenges of building Rujhaan.com to provide fast, personalized news content to over 16,000 monthly users while scaling to growing traffic levels.
1. The document introduces databases and their history, from early data storage and retrieval to modern database management systems.
2. It discusses Edgar Codd's invention of the relational database model in 1970 which changed the field by separating data from application code for easier modification and generalization.
3. The document outlines what a database management system does, including managing large amounts of data, supporting efficient and concurrent access, and providing security.
Open data is a crucial prerequisite for inventing and disseminating the innovative practices needed for agricultural development. To be usable, data must not just be open in principle—i.e., covered by licenses that allow re-use. Data must also be published in a technical form that allows it to be integrated into a wide range of applications. The webinar will be of interest to any institution seeking ways to publish and curate data in the Linked Data cloud.
This webinar describes the technical solutions adopted by a widely diverse global network of agricultural research institutes for publishing research results. The talk focuses on AGRIS, a central and widely-used resource linking agricultural datasets for easy consumption, and AgriDrupal, an adaptation of the popular, open-source content management system Drupal optimized for producing and consuming linked datasets.
Agricultural research institutes in developing countries share many of the constraints faced by libraries and other documentation centers, and not just in developing countries: institutions are expected to expose their information on the Web in a re-usable form with shoestring budgets and with technical staff working in local languages and continually lured by higher-paying work in the private sector. Technical solutions must be easy to adopt and freely available.
Research Data Curation _ Grad Humanities ClassAaron Collie
This document discusses best practices for research data curation and management. It covers topics such as data storage, file organization, documentation, sharing, and archiving. Effective data management practices include making backups in multiple locations, using logical file naming conventions and organization schemes, documenting projects, processes, and data, publishing and sharing data when appropriate, and archiving data for long-term preservation and access. Proper data management ensures that valuable research data is organized, preserved, and accessible to enable future research and verification of results.
This document provides an overview of natural language processing (NLP). It discusses how NLP systems have achieved shallow matching to understand language but still have fundamental limitations in deep understanding that requires context and linguistic structure. It also describes technologies like speech recognition, text-to-speech, question answering and machine translation. It notes that while text data may seem superficial, language is complex with many levels of structure and meaning. Corpus-based statistical methods are presented as one approach in NLP.
This tutorial discusses phrases and their use in natural language processing tasks. It defines phrases as word combinations that can express ideas not obvious from individual words alone. Unsupervised methods like mutual information and supervised techniques are used to learn good phrases from text. Phrases are useful for tasks such as named entity recognition, sentiment analysis, and solving analogies. Current research focuses on evaluating learned phrases and developing unsupervised phrase learning models.
The document provides an overview of unit 2.4 which introduces students to basic concepts in bioinformatics and databases. The objectives are to understand relational databases, major online biological databases, and how to extract data from databases. It also discusses challenges with large genomic data sets and how bioinformatics can help make sense of such data through databases, algorithms, and computational approaches.
This document provides an overview of big data and Hadoop. It defines big data as large volumes of structured, semi-structured and unstructured data that is growing exponentially and is too large for traditional databases to handle. It discusses the 4 V's of big data - volume, velocity, variety and veracity. The document then describes Hadoop as an open-source framework for distributed storage and processing of big data across clusters of commodity hardware. It outlines the key components of Hadoop including HDFS, MapReduce, YARN and related modules. The document also discusses challenges of big data, use cases for Hadoop and provides a demo of configuring an HDInsight Hadoop cluster on Azure.
Similar to Natural Language Data Management and Interfaces: Recent Development and Open Challenges (20)
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...Ana Luísa Pinho
Functional Magnetic Resonance Imaging (fMRI) provides means to characterize brain activations in response to behavior. However, cognitive neuroscience has been limited to group-level effects referring to the performance of specific tasks. To obtain the functional profile of elementary cognitive mechanisms, the combination of brain responses to many tasks is required. Yet, to date, both structural atlases and parcellation-based activations do not fully account for cognitive function and still present several limitations. Further, they do not adapt overall to individual characteristics. In this talk, I will give an account of deep-behavioral phenotyping strategies, namely data-driven methods in large task-fMRI datasets, to optimize functional brain-data collection and improve inference of effects-of-interest related to mental processes. Key to this approach is the employment of fast multi-functional paradigms rich on features that can be well parametrized and, consequently, facilitate the creation of psycho-physiological constructs to be modelled with imaging data. Particular emphasis will be given to music stimuli when studying high-order cognitive mechanisms, due to their ecological nature and quality to enable complex behavior compounded by discrete entities. I will also discuss how deep-behavioral phenotyping and individualized models applied to neuroimaging data can better account for the subject-specific organization of domain-general cognitive systems in the human brain. Finally, the accumulation of functional brain signatures brings the possibility to clarify relationships among tasks and create a univocal link between brain systems and mental functions through: (1) the development of ontologies proposing an organization of cognitive processes; and (2) brain-network taxonomies describing functional specialization. To this end, tools to improve commensurability in cognitive science are necessary, such as public repositories, ontology-based platforms and automated meta-analysis tools. I will thus discuss some brain-atlasing resources currently under development, and their applicability in cognitive as well as clinical neuroscience.
hematic appreciation test is a psychological assessment tool used to measure an individual's appreciation and understanding of specific themes or topics. This test helps to evaluate an individual's ability to connect different ideas and concepts within a given theme, as well as their overall comprehension and interpretation skills. The results of the test can provide valuable insights into an individual's cognitive abilities, creativity, and critical thinking skills
Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...Travis Hills MN
Travis Hills of Minnesota developed a method to convert waste into high-value dry fertilizer, significantly enriching soil quality. By providing farmers with a valuable resource derived from waste, Travis Hills helps enhance farm profitability while promoting environmental stewardship. Travis Hills' sustainable practices lead to cost savings and increased revenue for farmers by improving resource efficiency and reducing waste.
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...University of Maribor
Slides from talk:
Aleš Zamuda: Remote Sensing and Computational, Evolutionary, Supercomputing, and Intelligent Systems.
11th International Conference on Electrical, Electronics and Computer Engineering (IcETRAN), Niš, 3-6 June 2024
Inter-Society Networking Panel GRSS/MTT-S/CIS Panel Session: Promoting Connection and Cooperation
https://www.etran.rs/2024/en/home-english/
The binding of cosmological structures by massless topological defectsSérgio Sacani
Assuming spherical symmetry and weak field, it is shown that if one solves the Poisson equation or the Einstein field
equations sourced by a topological defect, i.e. a singularity of a very specific form, the result is a localized gravitational
field capable of driving flat rotation (i.e. Keplerian circular orbits at a constant speed for all radii) of test masses on a thin
spherical shell without any underlying mass. Moreover, a large-scale structure which exploits this solution by assembling
concentrically a number of such topological defects can establish a flat stellar or galactic rotation curve, and can also deflect
light in the same manner as an equipotential (isothermal) sphere. Thus, the need for dark matter or modified gravity theory is
mitigated, at least in part.
The use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptxMAGOTI ERNEST
Although Artemia has been known to man for centuries, its use as a food for the culture of larval organisms apparently began only in the 1930s, when several investigators found that it made an excellent food for newly hatched fish larvae (Litvinenko et al., 2023). As aquaculture developed in the 1960s and ‘70s, the use of Artemia also became more widespread, due both to its convenience and to its nutritional value for larval organisms (Arenas-Pardo et al., 2024). The fact that Artemia dormant cysts can be stored for long periods in cans, and then used as an off-the-shelf food requiring only 24 h of incubation makes them the most convenient, least labor-intensive, live food available for aquaculture (Sorgeloos & Roubach, 2021). The nutritional value of Artemia, especially for marine organisms, is not constant, but varies both geographically and temporally. During the last decade, however, both the causes of Artemia nutritional variability and methods to improve poorquality Artemia have been identified (Loufi et al., 2024).
Brine shrimp (Artemia spp.) are used in marine aquaculture worldwide. Annually, more than 2,000 metric tons of dry cysts are used for cultivation of fish, crustacean, and shellfish larva. Brine shrimp are important to aquaculture because newly hatched brine shrimp nauplii (larvae) provide a food source for many fish fry (Mozanzadeh et al., 2021). Culture and harvesting of brine shrimp eggs represents another aspect of the aquaculture industry. Nauplii and metanauplii of Artemia, commonly known as brine shrimp, play a crucial role in aquaculture due to their nutritional value and suitability as live feed for many aquatic species, particularly in larval stages (Sorgeloos & Roubach, 2021).
Phenomics assisted breeding in crop improvementIshaGoswami9
As the population is increasing and will reach about 9 billion upto 2050. Also due to climate change, it is difficult to meet the food requirement of such a large population. Facing the challenges presented by resource shortages, climate
change, and increasing global population, crop yield and quality need to be improved in a sustainable way over the coming decades. Genetic improvement by breeding is the best way to increase crop productivity. With the rapid progression of functional
genomics, an increasing number of crop genomes have been sequenced and dozens of genes influencing key agronomic traits have been identified. However, current genome sequence information has not been adequately exploited for understanding
the complex characteristics of multiple gene, owing to a lack of crop phenotypic data. Efficient, automatic, and accurate technologies and platforms that can capture phenotypic data that can
be linked to genomics information for crop improvement at all growth stages have become as important as genotyping. Thus,
high-throughput phenotyping has become the major bottleneck restricting crop breeding. Plant phenomics has been defined as the high-throughput, accurate acquisition and analysis of multi-dimensional phenotypes
during crop growing stages at the organism level, including the cell, tissue, organ, individual plant, plot, and field levels. With the rapid development of novel sensors, imaging technology,
and analysis methods, numerous infrastructure platforms have been developed for phenotyping.
Nucleophilic Addition of carbonyl compounds.pptxSSR02
Nucleophilic addition is the most important reaction of carbonyls. Not just aldehydes and ketones, but also carboxylic acid derivatives in general.
Carbonyls undergo addition reactions with a large range of nucleophiles.
Comparing the relative basicity of the nucleophile and the product is extremely helpful in determining how reversible the addition reaction is. Reactions with Grignards and hydrides are irreversible. Reactions with weak bases like halides and carboxylates generally don’t happen.
Electronic effects (inductive effects, electron donation) have a large impact on reactivity.
Large groups adjacent to the carbonyl will slow the rate of reaction.
Neutral nucleophiles can also add to carbonyls, although their additions are generally slower and more reversible. Acid catalysis is sometimes employed to increase the rate of addition.
This presentation explores a brief idea about the structural and functional attributes of nucleotides, the structure and function of genetic materials along with the impact of UV rays and pH upon them.
ANAMOLOUS SECONDARY GROWTH IN DICOT ROOTS.pptxRASHMI M G
Abnormal or anomalous secondary growth in plants. It defines secondary growth as an increase in plant girth due to vascular cambium or cork cambium. Anomalous secondary growth does not follow the normal pattern of a single vascular cambium producing xylem internally and phloem externally.
The ability to recreate computational results with minimal effort and actionable metrics provides a solid foundation for scientific research and software development. When people can replicate an analysis at the touch of a button using open-source software, open data, and methods to assess and compare proposals, it significantly eases verification of results, engagement with a diverse range of contributors, and progress. However, we have yet to fully achieve this; there are still many sociotechnical frictions.
Inspired by David Donoho's vision, this talk aims to revisit the three crucial pillars of frictionless reproducibility (data sharing, code sharing, and competitive challenges) with the perspective of deep software variability.
Our observation is that multiple layers — hardware, operating systems, third-party libraries, software versions, input data, compile-time options, and parameters — are subject to variability that exacerbates frictions but is also essential for achieving robust, generalizable results and fostering innovation. I will first review the literature, providing evidence of how the complex variability interactions across these layers affect qualitative and quantitative software properties, thereby complicating the reproduction and replication of scientific studies in various fields.
I will then present some software engineering and AI techniques that can support the strategic exploration of variability spaces. These include the use of abstractions and models (e.g., feature models), sampling strategies (e.g., uniform, random), cost-effective measurements (e.g., incremental build of software configurations), and dimensionality reduction methods (e.g., transfer learning, feature selection, software debloating).
I will finally argue that deep variability is both the problem and solution of frictionless reproducibility, calling the software science community to develop new methods and tools to manage variability and foster reproducibility in software systems.
Exposé invité Journées Nationales du GDR GPL 2024
EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...Sérgio Sacani
Context. With a mass exceeding several 104 M⊙ and a rich and dense population of massive stars, supermassive young star clusters
represent the most massive star-forming environment that is dominated by the feedback from massive stars and gravitational interactions
among stars.
Aims. In this paper we present the Extended Westerlund 1 and 2 Open Clusters Survey (EWOCS) project, which aims to investigate
the influence of the starburst environment on the formation of stars and planets, and on the evolution of both low and high mass stars.
The primary targets of this project are Westerlund 1 and 2, the closest supermassive star clusters to the Sun.
Methods. The project is based primarily on recent observations conducted with the Chandra and JWST observatories. Specifically,
the Chandra survey of Westerlund 1 consists of 36 new ACIS-I observations, nearly co-pointed, for a total exposure time of 1 Msec.
Additionally, we included 8 archival Chandra/ACIS-S observations. This paper presents the resulting catalog of X-ray sources within
and around Westerlund 1. Sources were detected by combining various existing methods, and photon extraction and source validation
were carried out using the ACIS-Extract software.
Results. The EWOCS X-ray catalog comprises 5963 validated sources out of the 9420 initially provided to ACIS-Extract, reaching a
photon flux threshold of approximately 2 × 10−8 photons cm−2
s
−1
. The X-ray sources exhibit a highly concentrated spatial distribution,
with 1075 sources located within the central 1 arcmin. We have successfully detected X-ray emissions from 126 out of the 166 known
massive stars of the cluster, and we have collected over 71 000 photons from the magnetar CXO J164710.20-455217.
BREEDING METHODS FOR DISEASE RESISTANCE.pptxRASHMI M G
Plant breeding for disease resistance is a strategy to reduce crop losses caused by disease. Plants have an innate immune system that allows them to recognize pathogens and provide resistance. However, breeding for long-lasting resistance often involves combining multiple resistance genes
Natural Language Data Management and Interfaces: Recent Development and Open Challenges
1. Natural Language Data
Management and Interfaces
Recent Development and Open Challenges
Davood Rafiei
University of Alberta
Yunyao Li
IBM Research - Almaden
Chicago
2017
2. “If we are to satisfy the needs of
casual users of data bases, we
must break through the barriers
that presently prevent these users
from freely employing their native
languages"
Ted Codd, 1974
3. Employing Native Languages
•As data for describing things and relationships
• Otherwise a huge volume of data will end up outside
databases
•As an interface to databases
• Otherwise we limit database use to professionals
6. Outline of Part I
•The ubiquity of natural language data
•A few areas of application
•Challenges
•Areas of progress
•Querying natural language text
•Transforming natural language text
•Integration
9. Corporate Data
Merril Lynch rule
“unstructured data
comprises the vast
majority of data found
in an organization.
Some estimates run as
high as 80%.”
Unstructured data
10. Scientific Literature
Impact of less invasive treatments including sclerotherapy with a new agent and
hemorrhoidopexy for prolapsing internal hemorrhoids.
Tokunaga Y, Sasaki H. (Int Surg. 2013)
Abstract
Abstract Conventional hemorrhoidectomy is applied for the treatment of prolapsing internal
hemorrhoids. Recently, less-invasive treatments such as sclerotherapy using aluminum
potassium sulphate/tannic acid (ALTA) and a procedure for prolapse and hemorrhoids (PPH)
have been introduced. We compared the results of sclerotherapy with ALTA and an improved
type of PPH03 with those of hemorrhoidectomy. Between January 2006 and March 2009, we
performed hemorrhoidectomy in 464 patients, ALTA in 940 patients, and PPH in 148 patients
with second- and third-degree internal hemorrhoids according to the Goligher's classification.
The volume of ALTA injected into a hemorrhoid was 7.3 ± 2.2 (mean ± SD) mL. The duration
of the operation was significantly shorter in ALTA (13 ± 2 minutes) than in hemorrhoidectomy
(43 ± 5 minutes) or PPH (32 ± 12 minutes). Postoperative pain, requiring intravenous pain
medications, occurred in 65 cases (14%) in hemorrhoidectomy, in 16 cases (1.7%) in ALTA,
and in 1 case (0.7%) in PPH. The disappearance rates of prolapse were 100% in
hemorrhoidectomy, 96% in ALTA, and 98.6% in PPH. ALTA can be performed on an
outpatient basis without any severe pain or complication, and PPH is a useful alternative
treatment with less pain. Less-invasive treatments are beneficial when performed with care to
avoid complications.
Treatment
No of patients tries on
Duration
11. News Articles
April 25, 2017 12:48 pm
Loonie hits 14-month low as softwood lumber duties expected to impact jobs
By Ross Marowits The Canadian Press
MONTREAL – The loonie hit a 14-month low on Tuesday at 73.60 cents, the lowest
level since February 2016.
The U.S. Commerce Department levied countervailing duties ranging between 3.02
and 24.12 per cent on five large Canadian producers and 19.88 per cent for all other
firms effective May 1. The duties will be retroactive 90 days for J.D. Irving and
producers other than Canfor, West Fraser, Resolute Forest Products and Tolko.
Anti-dumping duties to be announced June 23 could raise the total to as much as 30
to 35 per cent.
25,000 jobs will eventually be hit, including 10,000 direct jobs and 15,000 indirect
ones tied to the sector
Dias anticipates that.
Event
Triggering event
Following events expected
12. Wikipedia
•42 million pages
•Only 2.4 million infobox triplets
•Lots of data not in infobox
Obama was hired in Chicago as director of the Developing Communities
Project, a church-based community organization originally comprising
eight Catholic parishes in Roseland, West Pullman, and Riverdale on
Chicago's South Side.
…
In 1991, Obama accepted a two-year position as Visiting Law and
Government Fellow at the University of Chicago Law School to work on
his first book.
…
From April to October 1992, Obama directed Illinois's Project Vote, a
voter registration campaign…
13. Community QA
•Services such as Yahoo answers, Stack
Overflow, AnswerBag, …
•Data: question and answer pairs
•Want answers to new queries
Q: How to fix auto terminate mac terminal
Two StackOverflow pages returned by Google
- osx - How do I get a Mac “.command” file to automatically quit
after running a shell script?
- OSX - How to auto Close Terminal window after the “exit”
command executed.
16. Challenge – Lack of Schema
treatment patientCnt duration noOfPatients disappearanceRate
sclerotherapy with
ALTA
940 13+-2 16 96
PPH03 148 32+-12 1 98.6
hemorrhoidectomy 484 43+-5 65 100
•The scientific article shown earlier contains
structured data (as shown) but hard to query
due to the lack of schema
17. Challenge - Opacity of References
•Anaphora
• “Joe did not interrupt Sue because he was polite”
• “the lion bit the gazelle, because it had sharp teeth”
•Ambiguity of ids
• Does “john” in article A refer to the same “john” in
article B?
•Variations due to spatiotemporal differences
• “police chief” is ambiguous without a
spatiotemporal anchor
18. Challenge - Richness of Semantics
•Semantic relations
•crow ⊆ bird; bird ∩ nonbird= {};
bird ∪ nonbird=U
•Pragmatics
•The meanning depends on the context
•E.g. “Sherlock saw the man with binoculars”
•Textual entailment
•“every dog danced” ⟼ “every poodle moved”
19. Challenge - Correctness of Data
•Incorrect or sarcastic
•“Vladimir Putin is the president of the US’’
•Correct at some point in time (but not now)
•“Barack Obama is the president of the US”
•Correct now
•“Donald Trump is the president of the US”
•Always correct
•“Barack Obama is born in Hawaii”
•“Earth rotates around the sun”
21. System Architecture
Transform RDF
store
Text
store
Text System
Integrate
Enrichment
Entity
Resolution
Information
Extraction
SQL
SPARQL
Support
Natural Language
Text Queries
Rich
Queries
Knowledge
Base
Domain
schema
Structured
Data
DBMS
26. Boolean Queries
•TREC legal track 2006-2012
•Retrieve documents as evidence in civil
litigation
•Default search in Quicklaw and Westlaw
•E.g.
( (memory w/2 loss) OR amnesia OR Alzheimer! OR
dementia) AND (lawsuit! OR litig! OR case OR
(tort w/2 claim!) OR complaint OR allegation!)
from TREC 09
Legal track
memory /2 loss
memory /s loss
Highlight that due to the variants in NL,
BQ can be extremely complex
27. Boolean Queries (Cont.)
•Not much use of the grammar
•Except ordering and term distance
•Research issues
•Optimization
• Selectivity estimation for boolean queries
[Chen et al., PODS 2000]
• String selectivity estimation [Jagadish et al.,
PODS 1999], [Chaudhuri et al., ICDE 2004]
•Query evaluation [Broder et al., CIKM 2003]
28. PAT Expressions
[Saliminen & Tompa, Acta Lingusitica Hungarica 94]
•A set-at-a-time algebra for text
•Text normalization
•Delimiters mapped to blank, lowercasing, etc.
•Searches make less use of grammar
•Lexical: e.g. “joe”, “bo”..“jo”
•Position: e.g. [20], shift.2 “2010”..“2017”
• The last two characters of the matches
•Frequency: e.g. signif.2 “computer”
• Significant two terms that start with “computer” such as
“computer systems”
29. Mind your Grammar [Gonnet and Tompa, VLDB 1987]
•Schema expressed
as a grammar
•Studied in the context
of Oxford English
Dictionary
Word Pos_tag Pr_brit Pr_us Plurals …
Man-trap n
30. Grammar-based Data
•The grammar (when known) allows data
to be represented and retrieved
•Compared to relational data
•Grammar ~ table schema
•Parsed strings (p-strings) ~ table instance
31. Grammar-based Data
(another context)
•Data wrapped in text and html formatting
•Many ecommerce sites with back-end rel.
data
•Grammar often simple
•Schema finding ~ grammar induction
•Input: (a) html pages with wrapped data, (b)
sample/tagged tuples
•Output: a grammar (or a wrapper)
32. Grammar Induction
•Challenge: Regular grammars cannot be
learned from positive samples only [Gold,
Inf. Cont. 1967]
• Many web pages use grammars that are
identifiable in the limit (e.g. [Crescenzi & Mecca,
J. ACM 2004])
•With natural language text
• Context free production rules exist for good
subsets
• Not deterministic (multiple derivations per input)
• The rules are usually complex, less uniform, and
maybe ambiguous
33. Text Pattern Queries
•Text modeled as “a sequence of tokens”
•Data wrapped in text patterns
•<name> was born in <year>
•Also referred to as surface text patterns
[Ravichandran and Hovy, ACL 2002]
•Queries ~ text patterns
35. DeWild [Li & Rafiei, SIGIR 2006, CIKM 2009]
•Query match short text (instead of a page)
•Result ranking
•To improve “precision at k”
•Query rewritings
DeWild Query: % is a car manufacturer
36. Rewriting Rules
•Hyponym patterns [Hearst, 1992]
• X such as Y
• X including Y
• Y and other X
•Morphological patterns
• X invents Y
• Y is invented by X
•Specific patterns
• X discovers Y
• X finds Y
• X stumbles upon Y
37. Rewriting Rules in DeWild
# nopos
(.+),? such as (.+)
such (.+) as (.+)
(.+),? especially (.+)
(.+),? including (.+)
->
$1 such as $2 && noun(,$1)
such $1 as $2 && noun(,$1)
$1, especially $2 && noun(,$1)
$1, including $2 && noun(,$1)
$2, and other $1 && noun(,$1)
$2, or other $1 && noun(,$1)
$2, a $1 && noun($1,)
$2 is a $1 && noun($1,)
#pos
N<([^<>]+)>N,? V<(w+)>V by N<([^<>]+)>N
N<([^<>]+)>N V<is (w+)>V by N<([^<>]+)>N
N<([^<>]+)>N V<are (w+)>V by N<([^<>]+)>N
N<([^<>]+)>N V<was (w+)>V by N<([^<>]+)>N
N<([^<>]+)>N V<were (w+)>V by N<([^<>]+)>N
->
$3 $2 $1 && verb($2,,,)
$3 $2 $1 && verb(,$2,,)
$3 $2 $1 && verb(,,$2,)
$3 will $2 $1 && verb($2,,,)
$3 is going to $2 $1 && verb($2,,,)
$1 is $2 by $3 && verb(,,,$2)
$1 was $2 by $3 && verb(,,,$2)
$1 are $2 by $3 && verb(,,,$2)
noun(country, countries) verb(go, goes, went, gone)
38. Queries in DeWild
•Text patterns with some wild cards
•E.g
•% is the prime minister of Canada
•% invented the light bulb
•% invented %
•% is a summer *blockbuster*
39. Indexing for Text Pattern Queries
•Method 1: Inverted index
34,480,00 -> …, <2,1,[10]>, …
is -> <1,5,[4,16,35,58,89]>, …. <2,1,[9]>, …
population -> … <2,1,[8]> <3,1,[10]>, …
Canada -> … <2,1,[7]>, …
Query: Canada population is %
docId tf offset list
40. Indexing for Text Pattern Queries (Cont.)
•Method 2: Neighbor index
[Cafarella & Etzioni, WWW 2005]
34,480,00 -> …, <2,1,[(10,is,-)]>, …
is -> …. <2,1,[(9,population,34,480,000)]>, …
population -> … <2,1,[(8,Canada,is)]>, …
Canada -> … <2,1,[(7,though,population)]>, …
Problems: (1) long posting lists e.g. for “is”, “and”, …
(2) join costs |#(query terms) - 1| * |post_list(termi)|
41. Indexing for Text Pattern Queries (Cont.)
•Method 3: Word Permuterm Index (WPI)
[Chubak & Rafiei, CIKM 2010]
•Based on Permuterm index [Garfield, JAIS 1976]
•Burrows-wheeler transformation of text [Burrows
& Wheeler, 1994]
•Structures to maintain the alphabet and to
access ranks
42. •E.g. three sentences (lexicographically sorted)
T = $ Rome is a city $ Rome is the capital of Italy $ countries such
as Italy $ ~
•BW-transform
• Find all word-level rotations of T
• Sort rotations
• The vector of the last elements is BW-transform
42
Word-level Burrows-wheeler
transformation
43. $ Rome is a city $ Rome is the capital of Italy $ countries such as Italy $ ~
$ Rome is the capital of Italy $ countries such as Italy $ ~ $ Rome is a city
$ countries such as Italy $ ~ $ Rome is a city $ Rome is the capital of Italy
$ ~ $ Rome is a city $ Rome is the capital of Italy $ countries such as Italy
Italy $ countries such as Italy $ ~ $ Rome is a city $ Rome is the capital of
Italy $ ~ $ Rome is a city $ Rome is the capital of Italy $ countries such as
Rome is a city $ Rome is the capital of Italy $ countries such as Italy $ ~ $
Rome is the capital of Italy $ countries such as Italy $ ~ $ Rome is a city $
a city $ Rome is the capital of Italy $ countries such as Italy $ ~ $ Rome is
as Italy $ ~ $ Rome is a city $ Rome is the capital of Italy $ countries such
capital of Italy $ countries such as Italy $ ~ $ Rome is a city $ Rome is the
city $ Rome is the capital of Italy $ countries such as Italy $ ~ $ Rome is a
countries such as Italy $ ~ $ Rome is a city $ Rome is the capital of Italy $
is a city $ Rome is the capital of Italy $ countries such as Italy $ ~ $ Rome
is the capital of Italy $ countries such as Italy $ ~ $ Rome is a city $ Rome
of Italy $ countries such as Italy $ ~ $ Rome is a city $ Rome is the capital
such as Italy $ ~ $ Rome is a city $ Rome is the capital of Italy $ countries
the capital of Italy $ countries such as Italy $ ~ $ Rome is a city $ Rome is
~ $ Rome is a city $ Rome is the capital of Italy $ countries such as Italy $
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
BW-transformation
43
44. 44
Traversing L backwards
Prev(i) = Count[L[i]] + RankL[i](L,i)
i
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
L
~
city
Italy
Italy
of
as
$
$
is
such
the
a
$
Rome
Rome
capital
countries
is
$
Prev(8) = Count($) + Rank$(L,8)
= 0 + 2 = 2
The second $ is preceded by city in T
Prev(10) = Count(such) + Ranksuch(L,10)
= 16 + 1 = 17
such is preceded by countries in T
T = $ Rome is a city $ Rome is the capital of Italy $ countries such as Italy $ ~
Prev(10)Prev(8)
Number elements smaller
than L[i], in L
Occurrences of L[i] in the
range (L[1..i])
45. Tree Pattern Queries
•Text often modeled as a set of “ordered
node labeled tree”
•Order usually correspond to the order of the
words in a sentence
•Queries
•Navigational axes: XPath style queries
• E.g. find sentences that include `dog’ as a subject
•Boolean queries
• E.g. Find sentences that contain any of the words w1, w2 or
w3.
•Quantifiers and implications
•Subtree searches
47. Approaches
•Literature on general tree matching
•E.g. ATreeGrep [Shasha et al., PODS 2002]
•Often do not exploit properties of
Syntactically-Annotated Tree (SAT)
• E.g. distinct labels on nodes
•Querying SATs
•Work from the NLP community
• E.g. TGrep2, CorpusSearch, Lpath
•Scan-based, inefficient
•Indexing unique subtrees
48. Indexing Unique Subtrees
[Chubak & Rafiei, PVLDB 2012]
•Keys: unique subtrees of up to a certain size
•Posting lists: structural info. of keys
•Evaluation strategy: break queries into
subtrees, fetch lists and join
•Syntactically annotated trees
•Abundant frequent patterns small number of
keys
•Small average branching factor small number
of postings
49. Example Subtrees
A
B A
BC
D
C
A
B
C
D
size = 1
A
B C
D
size = 2
B
A
A
A
size = 3
A
B C
A
B A
A
C A
A
A
B
A
A
C
A
B
D
A
B C A
B C
D
A
B
C
D
A
B
A
C
A C
A D
A
B C B
A A
A C A
A A A
A
B
A C
DC
50. Subtree Coding
•Filter-based
• Store only tid for each unique subtree in the
posting list
• No other structural information
•Subtree interval coding
• Store pre, post and order values in a pre-order
traversal (for containment rel.) and level (for
parent-child rel.)
•Root split coding
• Optimize the storage for subtree interval coding
51. Query Decomposition
B
C D
FE
Query
A
B
A Query Cover = { , }
A
D
FE
,
C
B
C D
•Want an optimal cover to reduce the join cost
•Guarantee an optimal cover for filter-based and
subtree interval coding
• For subtrees of size 6 or less
•Bound the number of joins in a root split cover
52. System Architecture
Transform RDF
store
Text
store
Text System
Integrate
Enrichment
Entity
Resolution
Information
Extraction
SQL
SPARQL
Support
Natural Language
Text Queries
Rich
Queries
Knowledge
Base
Domain
schema
Structured
Data
DBMS
54. Transforming Natural Language Data
•Transformation to a meaning
representation (aka semantic parsing)
such as
•RDF triples
•Other form of logical
predicates
55. Integrating Natural Language Data
•Tight integration
•Text is maintained by a relational system
•Lose integration
•Text is maintained by a text system
57. Challenges
(with logical inference in general)
•Detecting that
•Craw is a bird,
•Bird is an animal
•Craws can fly but pigs cannot
•Attending an organization relates to
education
•A person has a mother and a father but can
have many children
•Many more
58. Progress
•Brachman & Levesque, Knowledge
representation & reasoning, 2000.
•RTE entailment challenge
•Since 2005
•Knowledge bases and resources
such as Freebase, Wordnet, Yago,
dbpedia, …
•Shallow semantic parsers
59. Mapping to DCS Trees [Tian et al., ACL 2014]
•Dependency-based compositional
semantics (DCS) trees [Liang et al., ACL 2011]
•Similar to (and generated from) dependency
parse trees
love
Mary dog
subj obj
F1 = love ∩ (Mary[subj] X W[obj])
F2 = animal ∩ πobj (F1)
F3 = have ∩ (John[subj] X F2[obj])
Does John have an animal that Mary love?
DCS tree node ~ table
Subtree ~ rel. algebra exp.
60. Logical Inference on DCS
•Some of the axioms
•(R ⊂ S & S ⊂ T) ⇒ R ⊂ T
•R ⊂ S ⇒ πA(R) ⊂ πA(S)
•W != ∅
•Inference ~ deriving new relations using
the tables and the axioms
•Performance on inference problems
•Comparable to systems in FraCaS and
Pascal RTE
61. Addressing Knowledge Shortage
•Treat DCS tree fragments as paraphrase
candidates
•Establish paraphrases based on
distributional similarity (as in [Lewis &
Steedman, TACL 2013] and others)
blame cause
Debby Debbydeath
storm
tropical
storm
tropical
loss
life
obj iobj objsubj
mod mod
mod
62. Semantic Parsing using Freebase
[Berant et al., EMNLP 2013]
•Transform questions to freebase derivations
•Learn the mapping from a large collection of
question-answer pairs
63. Approach
•15 million triplets (text phrases) from
ClubWeb09 mapped to Freebase predicates
• Dates are normalized and text phrases are
lemmatized
• Unary predicates are extracted
• E.g. city(Chicago) from (Chicago, “is a city in”, Illinois)
• 6,299 such unary predicates
• Entity types are checked when there is ambiguity
• E.g. (BarackObama, 1961) is added to “born in” [person,date]
and not to “born in” [person,location]
• 55,081 typed binary predicates
64. Two Steps Mapping
•Alignment
•Map each phrase to a set of logical forms
•Bridging
•Establish a relation between multiple
predicates in a sentence
•E.g. Marriage.Spouse.TomCruise and 2006
will form Marriage.(Spouse.TomCruise ∩
startDate.2006)
The transformation helps to answer questions using Freebase
65. Storage and Querying of Triples
•RDF stores
•Native: Apache Jena TDB, Virtuoso,
Algebraix, 4store, GraphDB, …
•Relational-backed: Jena SDB, C-store, …
•Semantic reasoners
•Open source: Apache Jena, and many more
•A list at Manchester U.
• http://owl.cs.manchester.ac.uk/tools/list-of-reasoners/
67. Challenges
•Structure in text
•Often not known in advance
•Sometimes subjective
•Optimization and plan generation
•Difficult with less stats, cost estimates and
join dependencies
•Interaction with other systems (e.g. IE,
NER)
•Adds another layer of abstraction
68. Integration Schemes
• Tight integration
• A Rel. Approach to Querying
Text
[Chu et al., VLDB 2007]
• Lose integration
• Join queries with external
text sources
[Chaudhuri et al., DIGMOD Record
1995]
• Optimizing SQL queries over
text databases
[Jain et al., ICDE 2008]
69. A Rel. Approach to Querying Text
[Chu et al., VLDB 2007]
•Each document is stored in a wide table
•Attributes are added as discovered
•Two tables
•Attribute catalog
•Records (one row per document)
•Attributes
•Two documents can have different attributes
•Multiple attributes in a doc can have the same
name
•Only non-null values are stored
71. Operators
•Extract
•Extract desired entities and relationships
•Integrate
•Suggest mappings between attributes
•Cluster
•Group documents into one or more clusters
Operator interaction
Integrate(address, sent-to) – extract(city,street,zipcode)
72. Lose Integration of Text
[Chaudhuri et al., SIGMOD Record 1995]
•Documents stored in a text system
•Relational view of documents
Relational
Database
System
Text
System
(mercury)
docid title author abstract …
Search, retrieve, join
73. Integration Techniques
•Tuple substitution
•Nested loop with the db tuple as the outer
relation
SELECT p.member, p.name, m.docid
FROM projects p, mercury m
WHERE p.sponsor=‘NSF’ AND p.name in m.title
AND p.member in m.author
74. Integration Techniques -- Cont.
•Semi-join
•Suppose the text system can take k terms
•For n members, send n/k queries of the form
(m1 OR m2 OR … OR mk) to the text system
•Probing
•Select a set of terms (how?) from project title
and check their mentions in the text system
•Keep a list of terms (or assignments) that
return empty
•Probing with tuple substitution
•Maintain a cache
75. SQL Queries over Text Databases
[Jain et al., ICDE 2008]
•Information Extraction (IE) modules over
text
•headquarter(company, location)
•ceoOf(company, ceo)
•Relational view of text
•A set of full outer joins over IE modules
•e.g. companies =headquarter ⋈ ceoOf ⋈ …
•SQL queries over relational views
•Want to improve upon “extract-then-query”
76. Problem
•Given a SQL query
•Find execution strategies that meet some
efficiency and quality constraints
•In terms of runtime, precision, recall, …
•On-the-fly IE from text
SELECT company, ceo, location
FROM companies
WHERE location=‘Chicago’
77. Retrieval Strategies
•scan
•Process all documents
•const
•Process documents that contain query
keywords
•promD
•Only process the promising documents for
each IE system (using IE specific keywords)
•promC
•AND the predicates of const and promD
chicago
headquarter OR (based AND shares)
chicago AND (Headquarter OR (based AND shares)
78. Selecting an Execution Plan
•Stats estimated for each strategy
•# of matching docs docs(E, promC, D)
•Retrieval time rTime(E, scan, D)
•Cost estimation
•Stratified sampling (with one stratum for PD
and another stratum for D-PD)
•For const use both
strata
•For promC & promD
use PD only
scan
const
promD
promC
D
PD
80. Anatomy of a NLIDB
Query
Understanding
Query
Translation Data store
Feedback
Generation
Domain
knowledge
Optional component
NLQ Interpretation
interactions
queries
queries
81. Query Understanding
– Scope of Natural Language Support
Ad-hoc
NLQs
Controlled
NLQs
Grammar complexity
Vocabulary complexity
Ambiguity
Parser error
Query naturalness
82. Query Understanding – Stateless and Stateful
Stateful NLQsStateless NLQs
NLQ Engine
Databases
NLQ
NLQ Engine
Databases
NLQ
Query
history
Each query must be
• Fully specified
• Processed independently
Each query
• Can be partially specified
• Processed with regards to previous queries
83. Query Understanding - Parser Error
Handling
Parsers make mistakes.
• News: Accuracy of a dependency parser = ~90% [Andor et al., 2016]
• Questions: ~80% [Judge et al., 2006]
Different approaches:
Ignore Auto-correction Interactive correction
• Detect and correct certain
parser mistakes
• Query reformulation
• Parse tree correction
• Do nothing
84. Query Translation - Bridging the Semantic
Gaps
• Vocabulary gap
“Bill Clinton” vs. “William Jefferson Clinton”
“IBM” vs. “International Business Machine Incorporated”
• Leaky abstraction
• Mismatch between abstraction (e.g. data schema/domain ontology) and
user assumptions
“top executives” vs “person with title CEO, CFO, CIO, etc.”
• Ambiguity in user queries
• Underspecified queries
“Watson movie” “Watson” as actor/actress
E.g. Emma Watson
“Watson” as a movie character
E.g. Dr. Watson in movie “Holmes and
Watson”
…
85. Query Translation – Query Construction
• Approaches
• Machine learning
• Construct formal queries from NLQ interpretations with
deterministic algorithms
• Query
• Formal query languages (e.g. XQuery / SQL)
• Intermediate language independent of underlying data stores
• The same intermediate query for different data stores
87. PRECISE [Popescu et al., 2003,2004]
• Controlled NLQ based on Semantic Tractability
Dependency
Parsing
Query
Generator RDBMS
Lexicon
NLQ Interpretation queriesMatcher
Semantic
Override
Feedback
Generation
interactions
Equivalence
Checker
88. PRECISE [Popescu et al., 2003,2004]
• Semantic Tractability
Database element: relations, attributes, or values
Token: a set of word stems that matches a database element
Syntactic marker: a term from a fixed set of database-independent terms that
make no semantic contribution to the interpretation of the NLQ
Semantically tractable sentence: Given a set of database element E, a
sentence S is considered semantic tractable, when its complete tokenization
satisfies the following conditions:
• Every token matches a unique data element in E
• Every attribute token attaches to a unique value token
• Every relation token attaches to either an attribute token or a value token
89. PRECISE [Popescu et al., 2003,2004]
• Explicitly correct parsing errors:
• Preposition attachment
• Preposition ellipsis
What are flights from Boston to Chicago on Monday?
pronoun verb noun prep noun noun nounprepprep
NP NP NP NP NP
PP PP
NP
PP
NP
VP
S
90. PRECISE [Popescu et al., 2003,2004]
• Explicitly correct parsing errors:
• Preposition attachment
• Preposition ellipsis
What are flights from Boston to Chicago on Monday?
pronoun verb noun prep noun noun nounprepprep
NP NP NP NP NP
PP PP
NP
PP
NP
VP
S
What are flights from Boston to Chicago Monday?
pronoun verb noun prep noun noun nounprep
NP NP NP NP NP
PP
NP
PP
NP
VP
S
91. PRECISE [Popescu et al., 2003,2004]
• Mapping parse tree nodes based on lexicon built from database
92. PRECISE [Popescu et al., 2003,2004]
• Addressing ambiguities through lexicon + semantic tractability
• Maximum-flow solution
93. PRECISE [Popescu et al., 2003,2004]
• Addressing ambiguities through lexicon + semantic tractability + user input
What are the systems analyst jobs in Austin?
Interpretation 1 Job title: systems analyst
Interpretation 2 Area: systems
Job title: analyst
NLQ
94. PRECISE [Popescu et al., 2003,2004]
• 1-to-many translation from interpretations to SQL based on all
possible join-paths
Job.Description What
Job.Company ‘HP’
Job.Platform ‘Unix’
City.size ‘small’
Job
JobID
Description
Company
Platform
City
CityID
Name
State
Size
What are the HP jobs on Unix in a small town?
NLQ
Interpretations DB Schema
SELECT DISTINCT Job.Description
FROM Job, City
WHERE Job.Platform = ‘HP’
AND Job.Company = ‘Unix’
AND Job.JobID = City.CityID
95. PRECISE [Popescu et al., 2003,2004]
• 1-to-many translation from interpretations to SQL based on all
possible join-paths
Job.Description What
Job.Company ‘HP’
Job.Platform ‘Unix’
City.size ‘small’
Job
JobID
Description
Company
Platform
City
CityID
Name
State
Size
What are the HP jobs on Unix in a small town?
WorkLocation
JobID
CityID
NLQ
Interpretations DB Schema
PostLocation
JobID
CityID
SELECT DISTINCT Job.Description
FROM Job, WorkLocation, City
WHERE Job.Platform = ‘HP’
AND Job.Company = ‘Unix’
AND Job.JobID = WorkLocation.JobID
AND WorkLocation.CityID = City.CityID
SELECT DISTINCT Job.Description
FROM Job, PostLocation, City
WHERE Job.Platform = ‘HP’
AND Job.Company = ‘Unix’
AND Job.JobID = WorkLocation.JobID
AND PostLocation.CityID = City.CityID
96. NLPQC [Stratica et al., 2005]
Question
Parsing
Query
Translation RDBMSNLQ Interpretation queries
Preprocessor schema
Rule
template
Link
Parser
Semantic
Analysis
• Controlled NLQ based on predefined rule templates
• No query history
97. NLPQC [Stratica et al., 2005]
• Build mapping rules for table names and attributes
• Automatically generated using WordNet
• Curated by system administrator
Table name: resource
…
Synonyms: 3 sense of resource
Sense 1: resource
Sense 2: resource
Sense 3: resource, resourcefulness, imagination
Hypernyms: 3 sense of resource
…
Hyponyms: 3 sense of resource
…
…
accept/reject/add
Databases
98. NLPQC [Stratica et al., 2005]
• Mapping parse tree node to data schema and value based on mapping
rules
Who is the author of book Algorithms
Table name: resource Table name: resource resource.default_attribute
NLQ
99. NLPQC [Stratica et al., 2005]
• Mapping parse tree node to data schema and value based on pre-defined
mapping rules
• Mapping parse trees to SQL statements based on pre-defined rule templates
Who is the author of book Algorithms
Table name: resource Table name: resource resource.default_attribute
Rule
template
SELECT author.name FROM author, resource, resource_author
WHERE resource.title = “Algorithm”
AND resource_author.resource_id=resource.resource_id
AND resource_author.author_id=author.author_id
NLQ
100. NLPQC [Stratica et al., 2005]
• No explicit ambiguity handling leave it to mapping rules and rule
templates
• No parsing error handling Assume no parsing error
101. NaLIX [Li et al., 2007a, 2007b, 2007c]
• Controlled NLQ based on pre-defined controlled grammar
Dependency
Parser
Query
Translation
XML
DBs
Message Generator
Translation
Patterns
NLQ
Validated
Parse Tree
interactions
Schema-free
XQuery
warning
Classifier Validator
Controlled
Grammar
Classification
Tables
Query
History
Domain
Adapter
Domain
Knowledge
Knowledge
Extractor
errors
102. NaLIX [Li et al., 2007a, 2007b, 2007c]
• Classify parse tree nodes into different types based on classification
tables
• Token: words/phrases that can be mapped into a XQery component
• Constructs in FLOWR expressions
• Marker: word/phrase that cannot be mapped into a XQuery component
• Connecting tokens, modify tokens, pronoun, stopwords
What are the state that share a watershed with
California
NLQ
What are [CMT]
state[NT]
the [MM]
the [MM]
Classified parse tree
share [CM]
watershed [NT]
a [MM]
with [CM]
California [VT]
103. NaLIX [Li et al., 2007a, 2007b, 2007c]
• Expand scope of NLQ support via domain adaptation
What are the state that share a watershed with
California
NLQ
What are [CMT]
state[NT]
the [MM]
that [MM]
Classified parse tree
share [CM]
watershed [NT]
a [MM]
with [CM]
California [VT]
What are [CMT]
state[NT]
the [MM]
each [MM]
is the same as[CM]
state[NT]
a [MM] of [CM]
California [VT]
Updated classified parse tree with domain knowledge
where [MM]
river [NT] river [NT]
a [MM] of [CM]
104. NaLIX [Li et al., 2007a, 2007b, 2007c]
• Validate classified parse tree + term expansion + insert implicit nodes
What are the state that share a watershed with
California
NLQ
What are [CMT]
state[NT]
the [MM]
each [MM]
is the same as[CM]
state[NT]
a [MM] of [CM]
California [VT]
Updated classified parse tree with domain knowledge
where [MM]
river [NT] river [NT]
a [MM] of [CM]
What are [CMT]
state[NT]
the [MM]
each [MM]
is the same as[CM]
state[NT]
a [MM] of [CM]
CA [VT]
Updated classified parse tree post validation
where [MM]
river [NT] river [NT]
a [MM] of [CM]
state[NT]
Implicit
node
Term expansion to
bridge
terminology gap
105. NaLIX [Li et al., 2007a, 2007b, 2007c]
• Translation: (1) Variable binding
What are the state that share a watershed with
California
NLQ
$v1
*
$v2
$v1
*
What are [CMT]
state[NT]
the [MM]
each [MM]
is the same as[CM]
state[NT]
a [MM] of [CM]
CA [VT]
Updated classified parse tree post validation
where [MM]
river [NT] river [NT]
a [MM] of [CM]
state[NT]
$v3
$v4
106. NaLIX [Li et al., 2007a, 2007b, 2007c]
• Translation: (2) Pattern Mapping
What are the state that share a watershed with
California
NLQ
$v1
*
$v2
$v1
*
What are [CMT]
state[NT]
the [MM]
each [MM]
is the same as[CM]
state[NT]
a [MM] of [CM]
CA [VT]
Updated classified parse tree post validation
where [MM]
river [NT] river [NT]
a [MM] of [CM]
state[NT]
$v3
$v4
for $v1 in doc//state
for $v2 in doc//river
for $v3 in doc//river
for $v4 in doc//state
where $v2 = $v3
where $v4 = “CA”
XQuery fragments
107. NaLIX [Li et al., 2007a, 2007b, 2007c]
• Translation: (3) Nesting and grouping
What are the state that share a watershed with
California
NLQ
$v1
*
$v2
$v1
*
What are [CMT]
state[NT]
the [MM]
each [MM]
is the same as[CM]
state[NT]
a [MM] of [CM]
CA [VT]
Updated classified parse tree post validation
where [MM]
river [NT] river [NT]
a [MM] of [CM]
state[NT]
$v3
$v4
for $v1 in doc//state
for $v2 in doc//river
for $v3 in doc//river
for $v4 in doc//state
where $v2 = $v3
where $v4 = “CA”
XQuery fragments
No aggregation function/qualifier
No nesting/grouping
108. NaLIX [Li et al., 2007a, 2007b, 2007c]
• Translation: (3) Nesting and grouping
Find all the states whose number of rivers is the same as the number of
rivers in California?
NLQ
$v1
*
$v2
$v1
*
What are [CMT]
state[NT]
the [MM]
each [MM]
is the same as[CM]
state[NT]
a [MM] of [CM]
CA [VT]
where [MM]
river [NT] river [NT]
a [MM] of [CM]
state[NT]
$v3
$v4
the number of [FT] the number of [FT]$cv1
$cv2
for $v1 in doc//state
for $v2 in doc//river
for $v3 in doc//river
for $v4 in doc//state
for $cv1 = count($v2)
for $cv2 = count($v3)
where $cv1 = $cv2
where $v4 = “CA”
XQuery fragments
Aggregation function
Nesting and grouping based on $v2
and $v3
109. NaLIX [Li et al., 2007a, 2007b, 2007c]
• Translation: (4) Construction full query
Find all the states whose number of rivers is the same as the number of
rivers in California?
NLQ
for $v1 in doc//state
for $v2 in doc//river
for $v3 in doc//river
for $v4 in doc//state
for $cv1 = count($v2)
for $cv2 = count($v3)
where $cv1 = $cv2
where $v4 = “CA”
XQuery fragments
for $v1 in doc(“geo.xml”)//state,
$v4 in doc(“geo.xml”)//state
let $vars1 := {
for $v2 in doc(“geo.xml”)//river,
$v5 in doc(“geo.xml”)//state
where mqf($v2,$v5)
and $v5 = $v1
return $v2}
let $vars2 := {
for $v3 in doc(“geo.xml”)//river,
$v6 in doc(“geo.xml”)//state
where mqf($v3,$v6)
and $v6 = $v4
return $v3}
where count($vars1) = count($vars2)
and $v4 = “CA”
return $v1
110. NaLIX [Li et al., 2007a, 2007b, 2007c]
• Support partially specified follow-up queries
• Detect topic switch to refresh query context
How about with
Texas?
NLQ
How about [SM]
Validated parse tree
with [CM]
TX [VT]
for $v1 in doc//state
for $v2 in doc//river
for $v3 in doc//river
for $v4 in doc//state
for $cv1 = count($v2)
for $cv2 = count($v3)
where $cv1 = $cv2
where $v4 = “CA”
Query context
for $v1 in doc//state
for $v2 in doc//river
for $v3 in doc//river
for $v4 in doc//state
for $cv1 = count($v2)
for $cv2 = count($v3)
where $cv1 = $cv2
where $v4 = “TX”
Updated query context
Substitution
marker
Updated value
111. NaLIX [Li et al., 2007a, 2007b, 2007c]
• Handle ambiguity
• Ambiguity in terms User feedback
e.g. “California” can be the name of a state, as well as a city
• Ambiguity in join-path leverage Schema-free XQuery to find out the optimal join-
path
e.g. There could be multiple ways for a river to be related to a state
• Error handling
• Do not handle parser error explicitly
• Interactive UI to encourage NLQ input understandable by the system
112. FREyA [Damljanovic et al., 2013,2014]
• Support ad-hoc NLQs, including ill-formed queries
• Direct ontology look up + parse tree mapping Certain level of
robustness
Syntactic
parsing
Query
Generation
Ontology
Feedback
Generation
POCs
dialogs
SPARQL
queries
Ambiguous
OCs/POCs
Ontology-based
Lookup
Syntactic
Mapping
Mapping
rules
NLQ
OCs
Consolidation
Triple
Generation
113. FREyA [Damljanovic et al., 2013,2014]
• Parse tree mapping based on pre-defined heuristic rules
Finds POCs (Potential Ontology Concept)
• Direct ontology look up
Finds OCs (Ontology Concept)
What is the highest point of the state bordering
Mississippi?
NLQ
the state Mississippi
POCs
the highest point
OCs
geo:isHighestPointOf geo:State geo:border geo:mississippi
PROPERTY PROPERTYCLASS INSTANCE
114. FREyA [Damljanovic et al., 2013,2014]
• Consolidate POCs and OCs
• If span(POC) ⊆ span(OC) Merge POC and OC
What is the highest point of the state bordering
Mississippi?
NLQ
state Mississippi
POCs
the highest point
OCs
geo:isHighestPointOf geo:State geo:border geo:mississippi
PROPERTY PROPERTYCLASS INSTANCE
OCs
geo:isHighestPointOf geo:State geo:border geo:mississippi
PROPERTY PROPERTYCLASS INSTANCE
115. FREyA
[Damljanovic et al., 2013,2014]
• Consolidate POCs and OCs
• If span(POC) ⊆ span(OC) Merge POC and OC
• Otherwise, provide suggestions and ask for user feedback
Return the population of California
NLQ
California
POCs
population
OCs
INSTANCE
geo:california
Suggestions ranked based on string similarity (Monge Elkan + Soundex)
1.T1. state population 2. state population density 3. has low point, …
119. FREyA
[Damljanovic et al., 2013,2014]
• Determine return type
• Result of a SPARQL query is a graph
• Identify answer type to decide the result display
Show lakes in Minnesota.
NLQ
120. FREyA
[Damljanovic et al., 2013,2014]
• Handle ambiguities via user interactions
• Provide suggestions
• Leverage re-enforcement learning to improve ranking of suggestions
• No parser error handling
121. NaLIR [Li and Jagadish, 2014]
• Controlled NLQ based on predefined grammar
• No query history
Query
Tree
Translator
RDBMS
Interactive
Communicator
Data index &
schema
graph
NLQ query tree
interactions
queries
Dependency
parser
Parse Tree
Node
Mapper
Parse Tree
Structure
Adjuster
candidate
mapping
choice
candidate
query trees
choice
122. NaLIR [Li and Jagadish, 2014]
• Mapping parse tree node to data schema and value based on WUP similarity [Wu
and Palmer, 1994]
• Explicitly request user input on ambiguous mappings and interpretations
Query
Tree
Translator
RDBMS
Interactive
Communicator
Data index &
schema
graph
NLQ query tree
interactions
queries
Dependency
parser
Parse Tree
Node
Mapper
Parse Tree
Structure
Adjuster
candidate
mapping
choice
candidate
query trees
choice
123. NaLIR [Li and Jagadish, 2014]
• Automatically adjust parse tree structure into a valid parse tree
Query
Tree
Translator
RDBMS
Interactive
Communicator
Data index &
schema
graph
NLQ query tree
interactions
queries
Dependency
parser
Parse Tree
Node
Mapper
Parse Tree
Structure
Adjuster
candidate
mapping
choice
candidate
query trees
choice
ROOT
return
author
paper
more
Bob
VLDB
after
2000
ROOT
return
author Bob
VLDB after
2000
more
paper
124. NaLIR [Li and Jagadish, 2014]
• Automatically adjust parse tree structure into a valid parse tree
• Further rewrite parse tree into one semantically reasonable
Query
Tree
Translator
RDBMS
Interactive
Communicator
Data index &
schema
graph
NLQ query tree
interactions
queries
Dependency
parser
Parse Tree
Node
Mapper
Parse Tree
Structure
Adjuster
candidate
mapping
choice
candidate
query trees
choice
ROOT
return
author
paper
more
Bob
VLDB
after
2000
ROOT
return
author Bob
VLDB after
2000
more
paper
ROOT
return
author
VLDB after
2000
more
paper
number of number of
author
paper
VLDB after
2000
Bob
125. NaLIR [Li and Jagadish, 2014]
• 1-1 translation from query tree to SQL
Query
Tree
Translator
RDBMS
Interactive
Communicator
Data index &
schema
graph
NLQ query tree
interactions
queries
Dependency
parser
Parse Tree
Node
Mapper
Parse Tree
Structure
Adjuster
candidate
mapping
choice
candidate
query trees
choice
126. Learning NLQ SQL [Palakurthi et al., 2015]
Stanford
Parser
Query
Translation RDBMS
Entity
Relationship
Schema
NLQ queries
Attribute
Classifier
Training
Data
Condition
Random
Fields
trained model
Training phase
Runtime
Classified
attributes
• Ad-hoc NLQ queries with explicit attribute mentions
• Implicit restriction imposed by the capability of the system itself
127. Learning NLQ SQL [Palakurthi et al., 2015]
• Explicit attributes: attributes mentioned explicitly in the NLQ
List all the grades of all the students in Mathematics
NLQ
Explicit attributes:
grade and student
Implicit attribute:
course_name
by the classifier
128. Learning NLQ SQL
[Palakurthi et al., 2015]
• Learn to map explicit attributes in the NLQ to SQL clauses
Type of Feature Example Feature
Token-based isSymbol
Grammatical POS tags and grammatical relations
Contextual Tokens preceding or following the current token
Other • isAttribute
• Presence of other attributes
• Trigger words (e.g. “each”)
FeaturesTraining data
129. Learning NLQ SQL
[Palakurthi et al., 2015]
• Learn to map explicit attributes in the NLQ to SQL clauses
Who are the professors teaching more than 2 courses?
NLQ
GROUP BY HAVINGFROM
130. Learning NLQ SQL
[Palakurthi et al., 2015]
• Construct full SQL queries
• Attribute Clause Mapping
• Identify joins based on ER diagram
• Add missing implicit attributes via Concept Identification [Srirampur et al., 2014]
Who are the professors teaching more than 2 courses?
NLQ
GROUP BY HAVINGFROM
SELECT professor_name
FROM COURSES,TEACH,PROFESSOR
WHERE course_id=course_teach_id
AND prof_teach_id =prof_id
GROUP BY professor_name
HAVING COUNT(course_name) > 2
SQL
Identified based
ER schema
131. Learning NLQ SQL
[Palakurthi et al., 2015]
• No parsing error handling
• No explicit ambiguity handling
What length is the Mississippi?
NLQ
Implicit attribute: State
Wrongly
identified
132. NL2CM [Amsterdamer et al., 2015]
Query
Verification
Query
Generator OASIS
Feedback
Generation
Vocabularies
NLQ
interactions
Formal
query
IX
Detector
IX: Individual Expression
Ontology
Stanford
Parser
IX Patterns
General
Query
Generator
OASIS-QL
triples
SPARQL
triples
Crowd mining engine
• Controlled NLQ based on predefined types (e.g. no “why”
questions)
• Query verification with feedback
• No query history
133. NL2CM [Amsterdamer et al., 2015]
• Map parse tree with Individual Expression (IX) patterns and
vocabularies
• Lexical individuality: Individual terms convey certain meaning
• Participant individuality: Participants or agents in the text that that
are relative to the person addressed by the request
• Synctatic individuality: Certain syntactic constructs in a sentence.
What are the most interesting places near Forest Hotel, Buffalo that we should vi
134. NL2CM [Amsterdamer et al., 2015]
• Map parse tree with Individual Expression (IX) patterns and
vocabularies
• Lexical individuality: Individual terms convey certain meaning
• Participant individuality: Participants or agents in the text that that
are relative to the person addressed by the request
• Synctatic individuality: Certain syntactic constructs in a sentence.
What are the most interesting places near Forest Hotel, Buffalo that we should vi
$x interesting [] visit $x
Opinion Lexicon
135. NL2CM [Amsterdamer et al., 2015]
• Map parse tree with Individual Expression (IX) patterns and
vocabularies
• Processing the general parts of the query with FREyA system
• Interact with user to resolve ambiguities
What are the most interesting places near Forest Hotel, Buffalo that we should vi
$x interesting [] visit $x
Opinion Lexicon
$x near Forest Hotel,_Buffalo,_NY
User interaction
$x instanceOf Place
136. NL2CM [Amsterdamer et al., 2015]
• No parsing error handling
• Return error for partially interpretable queries
• SPARQL + OASIS-QL triples a complete OASIS-QL query
$x interesting
[] visit $x
$x near Forest Hotel,_Buffalo,_NY
$x instanceOf Place
SELECT VARIABLES
WHERE
{$x instanceOf Place.
$x near Forest_Hotel,_Buffalo,_NY}
SATISFYING
{$x hasLabel “interesting”}
ORDER BY DESC(SUPPORT)
LIMIT 5
AND
{ [ ] visit $x}
WITH SUPPORT THRESHOLD = 0.1
137. NL2CM [Amsterdamer et al., 2015]
Query
Verification
Query
Generator OASIS
Feedback
Generation
Vocabularies
NLQ
interactions
Formal
query
IX
Detector
IX: Individual Expression
Ontology
Stanford
Parser
IX Patterns
General
Query
Generator
OASIS-QL
triples
SPARQL
triples
Crowd mining engine
• Handling ambiguity via user input
138. ATHANA [Saha et al., 2016]
• Permit ad-hoc queries
• No explicit constraints on NLQ
• Implicit limit on expressivity of NLQs by query expressivity limitation (e.g.
nested query with more than 1 level)
• No query history
NLQ Engine
Query
Translation Databases
Domain
Ontology
NLQ
OQL with NL
explanations
Top
ranked
SQL
query
SQL queries with
NL explanations
user-selected SQL query
Translation
Index
139. ATHANA [Saha et al., 2016]
• Annotate NLQ into evidences No explicit parsing
• Handle ambiguity based on translation index and domain ontology
Key Entries
“Alibaba”
“Alibaba Inc”
“Alibaba Incorporated”
“Alibaba Holding”
…
Company.name: Alibaba Inc
Company.name: Alibaba Holding Inc.
Company.name: Alibaba Capital Partners
…
Translation Index
“Investiments”
“investiment”
PersonalInvestiment
InstitutionalInvestiment
… …
Databases
Data
Value
Domain Ontology
Metadata
Data
141. ATHANA [Saha et al., 2016]
Show me restricted stock investments in Alibaba since 2012 by year
Holding.type
Transaction.type
InstitutionalInvestment.type
…
PersonalInvestment
InstitutionalInvestment
VCInvestment
…
Company.name:Alibaba Inc.
Company.name:Alibaba Holding Inc.
…
Transaction.reported_year
Transaction.purchase_year
InstitutionalInvestment.
reported_year
…
indexed value indexed valuemetadata metadatatime range
“since 2012”, “year”
Institutional
investment
Investment
Investee
type
Reported_
year
name
“investments” “restricted stock”
“since 2012”, “year”
“Alibaba”
Investee Company
investedIn “in”
unionOf
Is-a
Institutional
investment
Investment
Investee
type
Reported_
year
name
“investments” “restricted stock”
“Alibaba”
Investee Company
investedIn “in”
issuedBy
Is-a
unionOf
Security
Evidence
Interpretation trees
142. ATHANA [Saha et al., 2016]
• Ontology Query Language
• Intermediate language over domain ontologies
• Separate query semantics from underlying data stores
• Support common OLAP-style queries
143. ATHANA [Saha et al., 2016]
• 1-1 translation from interpretation tree to OQL
• 1-1 translation from OQL to SQL per relational schema
SELECT Sum(oInstituionalINvestment.amount),
oInstitutionalInvestment.reported_year
FROM InstitutionalInvestment OInstitutionalInvestment,
InvesteeCompany oInvesteeCompany
WHERE oInstitutionalInvestment.type = “restricted_stock”,
oInstitutionalInvestment.reported_year >= ‘2012’
oInstitutionalInvestment.reported_year >= Inf,
oInvesteeCompany.name = (‘Alibaba Holdings Ltd.’, ‘Alibaba Inc.’, ‘Alibaba Capital Partners’},
oInstitionalInvestmentisaInvestedInunionOf_SecurityissuedBy=oInvesteeCompany
GROUP BY oInstituionalInvestment.reported_year
Database 1 Database 2 Database 3 …
SQL
Statement1
SQL
Statement2
SQL
Statement3
SQL
Statement …
144. NLIDBs Summary
Systems Scope of NLQ Support Capability State Parsing Error Handling
Controlled Ad-hoc* Fixed Self-improving Stateless Stateful Auto-correction Interactive-correction
PRECISE
NLPQC
NaLIX
FREyA
NaLIR
NL2CM
ML2SQL
ATHANA N/A N/A
* Implicit limitation by system capability
146. Relationship to Semantic Parsing
Query
Understanding
NLQ
Domain
knowledge
Query
Translation
interpretations
queries
Data store
Semantic Parser
NL Sentence
Semantic
parsing results
Training
data
ML
Model
Semantic parsing can be
used to build NLIDB
query results
NLIDB Semantic Parsing
147. Relationship to Question Answering
Query
Understanding
NLQ
Domain
knowledge
Query
Translation
interpretations
database queries
Data store
Similar
techniques
query results
Query
Understanding
NLQ
Query
Translation
interpretations
Document search queries
top results
Document
Collection
Domain
knowledge
NLIDB Question Answering
149. Querying Natural Language Data -
Review
•Covered
•Boolean queries
•Grammar-based schema and searches
•Text pattern queries
•Tree pattern queries
• Developments beyond
•Keyword searches as input
•Documents as output
150. Querying Natural Language Data –
Challenges & Opportunities
• Grammar-based schemas
• Promising direction
• Challenges
• Queries w/o knowing the schema
• Many table schemes!
• Overlap and equivalence relationships
• Promising developments
• Paraphrasing relationships between text phrases, tree patterns,
DCS trees, etc.
• Development of resources (e.g. KBs) and shallow semantic
parsers to understand semantics
• Self-improving systems
151. Integrating & Transforming Natural
Language Data - Review
•Covered
•Transformations on text
•Lose and tight integration
•More work on
•Lose integration
•Optimizing query plans
152. Integrating & Transforming Natural Language
Data – Challenges & Opportunities
• Challenges
• Lack of schema, opacity of references, richness of
semantics and correctness of data
• Much to inspire from
• Work on transforming text
• Size and scope of resources for understanding text
• Progress in shallow semantic parsing
• Other areas such as translation and speech recognition
• Opportunities
• Lots of demand for relevant tools
• More structure in natural language text than text (as a
seq. of tokens)
• Strong ties to deductive databases
153. NLIDB: Ideal and Reality
Systems Scope of NLQ Support Capability State Parsing Error Handling
Controlled Ad-hoc Fixed Self-improving Stateless Stateful Auto-correction Interactive-correction
PRECISE *
NLPQC
NaLIX *
FREyA *
NaLIR *
NL2CM
ML2SQL *
ATHANA * N/A N/A
Ideal
NLIDB
* Supported at limited extent
154. NLIDB: Ideal and Reality – Cont.
Systems Ambiguity Handling Query Construction Target Language
Automatic Interactive Rule-based Machine-learning
PRECISE * SQL
NLPQC SQL
NaLIX * (Schema-free) XQuery
FREyA SPARQL
NaLIR SQL
NL2CM OASIS-QL
ML2SQL * SQL
ATHANA * OQL
Ideal
NLIDB
Polystore language
* Supported at limited extent
155. NLIDB: Open Challenges
Query
Understanding
Query
Translation Data store
Feedback
Generation
Domain
knowledge
NLQ Interpretation
interactions
queries
queries
• Effectively communicate
limitations to users
• Engage user at the right moment
• Multi-modal interaction
• Support ad-hoc NLQs with complex
semantics
• Better handle parser errors
• Automatically bridge terminology gaps
• Automatically identify and resolve
ambiguity
• Multilingual/crosslingual support
• Polystore
• Structured data s+ (un-
/semi-)structured data
• Construct domain knowledge with
minimal development effort
• Construct complex queries
• Self-improving
• Personalization
• Conversational
Document
Collection
Transform
& integrate
156. Natural Language DM & Interfaces:
Opportunities
Database
Human
Computer
Interaction
Natural
Language
Processing
Machine
Learning
157. References
• [Agichtein and Gravano, 2003] Agichtein, E. and Gravano, L. (2003). Querying text databases for efficient
information extraction. In Proc. of the ICDE Conference, pages 113–124, Bangalore, India.
• [Agrawal et al., 2008] Agrawal, S., Chakrabarti, K., Chaudhuri, S., and Ganti, V. (2008). Scalable ad-hoc entity
extraction from text collections. PVLDB, 1(1):945–957.
• [Amsterdamer et al., 2015] Amsterdamer, Y., Kukliansky, A., and Milo, T. (2015). A natural language interface for
querying general and individual knowledge. PVLDB, 8(12):1430–1441.
• [Andor et al., 2016] Andor, D., Alberti, C., Weiss, D., Severyn, A., Presta, A., Ganchev, K., Petrov, S., and
Collins, M. (2016). Globally normalized transition-based neural networks. CoRR, abs/1603.06042.
• [Berant et al., 2013] Berant, J., Chou, A., Frostig, R., and Liang, P. (2013). Semantic parsing on freebase from
question-answer pairs. In Proc. of the EMNLP Conference, volume 2, page 6.
• [Bertino et al., 2012] Bertino, E., Ooi, B. C., Sacks-Davis, R., Tan, K.-L., Zobel, J., Shidlovsky, B., and
Andronico, D. (2012). Indexing techniques for advanced database systems, volume 8. Springer Science &
Business Media.
• [Broder et al., 2003] Broder, A. Z., Carmel, D., Herscovici, M., Soffer, A., and Zien, J. (2003). Efficient query
evaluation using a two-level retrieval process. In Proc. of the CIKM Conf., pages 426–434. ACM.
• [Cafarella and Etzioni, 2005] Cafarella, M. J. and Etzioni, O. (2005). A search engine for natural language
applications. In Proc. of the WWW conference, pages 442–452. ACM.
• [Cafarella et al., 2007] Cafarella, M. J., Re, C., Suciu, D., and Etzioni, O. (2007). Structured querying of web text
data: A technical challenge. In Proc. of the CIDR Conference, pages 225–234, Asilomar, CA.
• [Cai et al., 2005]Cai, G., Wang, H., MacEachren, A. M., Tokensregex: Defining cascaded regular expressions
over tokens. Technical Report CSTR-2014-02, Department of Computer Science, Stanford University.
158. References – Cont.
• [Chaudhuri et al., 1995] Chaudhuri, S., Dayal, U., and Yan, T. W. (1995). Join queries with external text sources:
Execution and optimization techniques. In ACM SIGMOD Record, pages 410–422, San Jose, California.
• [Chaudhuri et al., 2004] Chaudhuri, S., Ganti, V., and Gravano, L. (2004). Selectivity estimation for string
predicates: Overcoming the underestimation problem. In Proc. of the ICDE Conf., pages 227–238. IEEE.
• [Chen et al., 2000] Chen, Z., Koudas, N., Korn, F., and Muthukrishnan, S. (2000). Selectively estimation for
boolean queries. In Proc. of the PODS Conf., pages 216–225. ACM.
• [Chu et al., 2007] Chu, E., Baid, A., Chen, T., Doan, A., and Naughton, J. (2007a). A relational approach to
incrementally extracting and querying structure in unstructured data. In Proc. of the VLDB Conference.
• [Chubak and Rafiei, 2010] Chubak, P. and Rafiei, D. (2010). Index Structures for Efficiently Searching Natural
Language Text. In Proc. of the CIKM Conference.
• [Chubak and Rafiei, 2012] Chubak, P. and Rafiei, D. (2012). Efficient indexing and querying over syntactically
annotated trees. PVLDB, 5(11):1316–1327.
• [Codd, 1974] Codd, E. (1974). Seven steps to rendezvous with the casual user. In IFIP Working Conference
Data Base Management, pages 179–200.
• [Ferrucci, 2012] Ferrucci, D. A. (2012). Introduction to ”this is watson”. IBM Journal of Research and
Development, 56(3):1.
• [Gonnet and Tompa, 1987] Gonnet, G. H. and Tompa, F. W. (1987). Mind your grammar: a new approach to
modelling text. In Proc. of the VLDB Conference, pages 339–346, Brighton, England.
• [Gyssens et al., 1989] Gyssens, M., Paredaens, J., and Gucht, D. V. (1989). A grammar-based approach
towards unifying hierarchical data models (extended abstract). In Proc. of the SIGMOD Conference, pages
263–272, Portland, Oregon.
159. References – Cont.
• [Jagadish et al., 1999] Jagadish, H., Ng, R. T., and Srivastava, D. (1999). Substring selectivity estimation. In
Proc. of the PODS Conf., pages 249–260. ACM.
• [Jain et al., 2008] Jain, A., Doan, A., and Gravano, L. (2008). Optimizing SQL queries over text databases. In
Proc. of the ICDE Conference, pages 636–645, Cancun, Mexico.
• [Kaoudi and Manolescu, 2015] Kaoudi, Z. and Manolescu, I. (2015). Rdf in the clouds: a survey. The VLDB
Journal, 24(1):67–91.
• [Lewis and Steedman, 2013] Lewis, M. and Steedman, M. (2013). Combining distributional and logical
semantics. Transactions of the Association for Computational Linguistics, 1:179–192.
• [Li and Jagadish, 2014] Li, F. and Jagadish, H. V. (2014). Constructing an interactive natural language interface
for relational databases. PVLDB, 8(1):73–84.
• [Li et al., 2007] Li, Y., Yang, H., and Jagadish, H. V. (2007). Nalix: A generic natural language search
environment for XML data. ACM Trans. Database Systems, 32(4).
• [Liang et al., 2011] Liang, P., Jordan, M. I., and Klein, D. (2011). Learning dependency-based compositional
semantics. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human
Language Technologies-Volume 1, pages 590–599. Association for Computational Linguistics.
• [Lin and Pantel, 2001] Lin, D. and Pantel, P. (2001). Dirt - discovery of inference rules from text. In Proc. of the
KDD Conference, pages 323–328.
• [Rafiei and Li, 2009] Rafiei, D. and Li, H. (2009). Data extraction from the web using wild card queries. In Proc.
of the CIKM Conference, pages 1939–1942.
• [Ravichandran and Hovy, 2002] Ravichandran, D. and Hovy, E. (2002). Learning surface text patterns for a
question answering system. In Proc. of the ACL Conference.
160. References – Cont.
• [Popescu et al., 2004] Popescu et al., A. (2004). Modern natural language interfaces to databases: Composing
statistical parsing with semantic tractability. In Proc. of the COLING Conference.
• [Saha et al., 2016] Saha, D., Floratou, A., Sankaranarayanan, K., Minhas, U. F., Mittal, A. R., and O¨ zcan, F.
(2016). Athena: An ontology-driven system for natural language querying over relational data stores. PVLDB,
9(12):1209–1220.
• [Salminen and Tompa, 1994] Salminen, A. and Tompa, F. (1994). PAT expressions: an algebra for text search.
• Acta Linguistica Hungarica, 41(1):277–306.
• [Stratica et al., 2005] Stratica, N., Kosseim, L., and Desai,
• B. C. (2005). Using semantic templates for a natural language interface to the cindi virtual library. Data and
Knowledge Engineering, 55(1):4–19.
• [Suchanek and Preda, 2014] Suchanek, F. M. and Preda,
• N. (2014). Semantic culturomics. Proc. of the VLDB Endowment, 7(12):1215–1218.
• [Tague et al., 1991] Tague, J., Salminen, A., and McClellan, C. (1991). A complete model for information
retrieval systems. In Proc. of the SIGIR Conference, pages 14–20, Chicago, Illinois.
• [Tian et al., 2014] Tian, R., Miyao, Y., and Matsuzaki, T. (2014). Logical inference on dependency-based
compositional semantics. In Proc. of the ACL Conference, pages 79–89.
• [Wu and Palmer, 1994] Wu, Z. and Palmer, M. (1994). Verbs semantics and lexical selection. In ACL
• [Valenzuela-Escarcega et al., 2016] Valenzuela-Escarcega, M. A., Hahn-Powell, G., and Surdeanu, M. (2016).
Odin’s runes: A rule language for information extraction. In Proc. of the Language Resources and Evaluation
Conference (LREC).
• [Xu, 2014] Xu, W. (2014). Data-driven approaches for paraphrasing across language variations. PhD thesis,
New York University.
161. Relevant Tutorials
• Semantic parsing
• Percy Liang: “natural language understanding: foundations and state-
of-the-art”, ICML 2015.
• Information extraction
• Laura Chiticariu, Yunyao Li, Sriram Raghavan, Frederick Reiss:
“Enterprise information extraction: recent developments and open
challenges.” SIGMOD 2010
• Entity resolution
• Lise Getoor and shwin Machanavajjhala: “Entity Resolution for Big
Data” KDD 2013
Editor's Notes
RDF store: jena, MarkLogic, …
Chen: selectivity of boolean queries
Jagadish, chaudhuri: selectivity of strings
Signif.2 “computer”
The paper appeared in VLDB 1987
roadrunner assumes prefix markup encoding of text
Hyponym [specific] – hypernym [general]
Garfield, permuterm index, 1976
Compressed permuterm index, P. Ferragina, R. Venturini, 2007
A hash table supports access to the alphabet; a wavelet tree supports rank
TGrep2, CorpusSearch: load the corpus to main memory and scan
Mss<6 is based on a bin-packing approximation that is optimal for bin sizes less than 6.
RDF store: jena, MarkLogic, …
Tropical storm Debby is blamed for death
Tropical storm Debby has caused loss of lifed
On WebQuestions, a run on training examples sends 600K Sparql queries to freebase.
Result of operators can be stored back in the wide table
Const: applicable for queries with selection conditions
GeoDialogue
Potential Ontology Con-
cepts
(POCs) are derived from the syntactic parse tree, and refer to question
terms which could be linked to an ontology concept. Syntactic parse tree is gen-
erated by Stanford Parser [12]. We use several heuristic rules in order to identify
POCs. For example, each NP (noun phrase) or NN (noun) is identi ed as a POC.
Also, if a noun phrase contains adjectives, these are considered POCs as well.
Next, the algorithm iterates through the list of POCs, attempting to map them
to OC
Potential Ontology Con-
cepts
(POCs) are derived from the syntactic parse tree, and refer to question
terms which could be linked to an ontology concept. Syntactic parse tree is gen-
erated by Stanford Parser [12]. We use several heuristic rules in order to identify
POCs. For example, each NP (noun phrase) or NN (noun) is identi ed as a POC.
Also, if a noun phrase contains adjectives, these are considered POCs as well.
Next, the algorithm iterates through the list of POCs, attempting to map them
to OC
Potential Ontology Con-
cepts
(POCs) are derived from the syntactic parse tree, and refer to question
terms which could be linked to an ontology concept. Syntactic parse tree is gen-
erated by Stanford Parser [12]. We use several heuristic rules in order to identify
POCs. For example, each NP (noun phrase) or NN (noun) is identi ed as a POC.
Also, if a noun phrase contains adjectives, these are considered POCs as well.
Next, the algorithm iterates through the list of POCs, attempting to map them
to OC
Potential Ontology Con-
cepts
(POCs) are derived from the syntactic parse tree, and refer to question
terms which could be linked to an ontology concept. Syntactic parse tree is gen-
erated by Stanford Parser [12]. We use several heuristic rules in order to identify
POCs. For example, each NP (noun phrase) or NN (noun) is identi ed as a POC.
Also, if a noun phrase contains adjectives, these are considered POCs as well.
Next, the algorithm iterates through the list of POCs, attempting to map them
to OC
Potential Ontology Con-
cepts
(POCs) are derived from the syntactic parse tree, and refer to question
terms which could be linked to an ontology concept. Syntactic parse tree is gen-
erated by Stanford Parser [12]. We use several heuristic rules in order to identify
POCs. For example, each NP (noun phrase) or NN (noun) is identi ed as a POC.
Also, if a noun phrase contains adjectives, these are considered POCs as well.
Next, the algorithm iterates through the list of POCs, attempting to map them
to OC
Potential Ontology Con-
cepts
(POCs) are derived from the syntactic parse tree, and refer to question
terms which could be linked to an ontology concept. Syntactic parse tree is gen-
erated by Stanford Parser [12]. We use several heuristic rules in order to identify
POCs. For example, each NP (noun phrase) or NN (noun) is identi ed as a POC.
Also, if a noun phrase contains adjectives, these are considered POCs as well.
Next, the algorithm iterates through the list of POCs, attempting to map them
to OC
Potential Ontology Con-
cepts
(POCs) are derived from the syntactic parse tree, and refer to question
terms which could be linked to an ontology concept. Syntactic parse tree is gen-
erated by Stanford Parser [12]. We use several heuristic rules in order to identify
POCs. For example, each NP (noun phrase) or NN (noun) is identi ed as a POC.
Also, if a noun phrase contains adjectives, these are considered POCs as well.
Next, the algorithm iterates through the list of POCs, attempting to map them
to OC
Potential Ontology Con-
cepts
(POCs) are derived from the syntactic parse tree, and refer to question
terms which could be linked to an ontology concept. Syntactic parse tree is gen-
erated by Stanford Parser [12]. We use several heuristic rules in order to identify
POCs. For example, each NP (noun phrase) or NN (noun) is identi ed as a POC.
Also, if a noun phrase contains adjectives, these are considered POCs as well.
Next, the algorithm iterates through the list of POCs, attempting to map them
to OC
Potential Ontology Con-
cepts
(POCs) are derived from the syntactic parse tree, and refer to question
terms which could be linked to an ontology concept. Syntactic parse tree is gen-
erated by Stanford Parser [12]. We use several heuristic rules in order to identify
POCs. For example, each NP (noun phrase) or NN (noun) is identi ed as a POC.
Also, if a noun phrase contains adjectives, these are considered POCs as well.
Next, the algorithm iterates through the list of POCs, attempting to map them
to OC
Resources to interpret/understand semantics
Shallow semantic parsers such as Framenet, universal dependencies and abstract meaning rep