Information sources such as relational databases, spreadsheets, XML, JSON, and Web APIs contain a tremendous amount of structured data that can be leveraged to build and augment knowledge graphs. However, they rarely provide a semantic model to describe their contents. Semantic models of data sources represent the implicit meaning of the data by specifying the concepts and the relationships within the data. Such models are the key ingredients to automatically publish the data into knowledge graphs. Manually modeling the semantics of data sources requires significant effort and expertise, and although desirable, building these models automatically is a challenging problem. Most of the related work focuses on semantic annotation of the data fields (source attributes). However, constructing a semantic model that explicitly describes the relationships between the attributes in addition to their semantic types is critical.
We present a novel approach that exploits the knowledge from a domain ontology and the semantic models of previously modeled sources to automatically learn a rich semantic model for a new source. This model represents the semantics of the new source in terms of the concepts and relationships defined by the domain ontology. Given some sample data from the new source, we leverage the knowledge in the domain ontology and the known semantic models to construct a weighted graph that represents the space of plausible semantic models for the new source. Then, we compute the top k candidate semantic models and suggest to the user a ranked list of the semantic models for the new source. The approach takes into account user corrections to learn more accurate semantic models on future data sources. Our evaluation shows that our method generates expressive semantic models for data sources and services with minimal user input. These precise models make it possible to automatically integrate the data across sources and provide rich support for source discovery and service composition. They also make it possible to automatically publish semantic data into knowledge graphs.
A Scalable Approach to Learn Semantic Models of Structured SourcesMohsen Taheriyan
Semantic models of data sources describe the meaning of the data in terms of the concepts and relationships defined by a domain ontology. Building such models is an important step toward integrating data from different sources, where we need to provide the user with a unified view of underlying sources. In this paper, we present a scalable approach to automatically learn semantic models of a structured data source by exploiting the knowledge of previously modeled sources. Our evaluation shows that the approach generates expressive semantic models with minimal user input, and it is scalable to large ontologies and data sources with many attributes.
A Graph-Based Approach to Learn Semantic Descriptions of Data SourcesMohsen Taheriyan
The document presents a new graph-based approach to learn semantic descriptions of data sources. It constructs a graph from existing source models in a domain, and leverages the relationships in this graph to generate candidate semantic models for a new source, by mapping its semantic types to the graph and computing minimal subgraphs. This allows the approach to exploit common patterns across source models for automating the learning of semantic descriptions.
Selectivity Estimation for SPARQL Triple Patterns with Shape ExpressionsAbdullah Abbas
We optimize the evaluation of conjunctive SPARQL queries, on big RDF graphs, by taking advantage of ShEx schema constraints. Our optimization is based on computing ranks for query triple patterns, which indicates their order of execution. We first define a set of well-formed ShEx schemas, that possess interesting characteristics for SPARQL query optimization. We then define our optimization method by exploiting information extracted from a ShEx schema. The experimentations performed shows the advantages of applying our optimization on the top of an existing state-of-the-art query evaluation system.
The document discusses spatial query languages and SQL. It begins by introducing learning objectives about understanding query languages, using SQL, extending SQL for spatial data, and trends in query languages. It then provides examples of using SQL to query spatial data by making use of the Open Geodata Interchange Standard (OGIS) spatial data types and operations within SQL queries.
Neo4j is a powerful and expressive tool for storing, querying and manipulating data. However modeling data as graphs is quite different from modeling data under a relational database. In this talk, Michael Hunger will cover modeling business domains using graphs and show how they can be persisted and queried in Neo4j. We'll contrast this approach with the relational model, and discuss the impact on complexity, flexibility and performance.
Metric Learning for Music Discovery with Source and Target PlaylistsYing-Shu Kuo
Playlist generation for music exploration by defining sets of source songs and target songs and deriving a playlist through metric learning and boundary constraints.
https://github.com/hank5925/mlmdstp
The document discusses applying structural operational semantics (SOS) to model-driven engineering (MDE). It introduces SOS and outlines how it can be used to formally define the semantics of domain-specific languages (DSLs) in MDE. The document presents an example of using SOS to define the semantics of a simple expression language to demonstrate how SOS rules specify the transition relation between language states. The goal is to make SOS useful for MDE by providing a semantic language and tool to simulate models defined using DSLs.
Lect 1. introduction to programming languagesVarun Garg
A programming language is a set of rules that allows humans to communicate instructions to computers. There are many programming languages because they have evolved over time as better ways to design them have been developed. Programming languages can be categorized based on their generation or programming paradigm such as imperative, object-oriented, logic-based, and functional. Characteristics like writability, readability, reliability and maintainability are important qualities for programming languages.
A Scalable Approach to Learn Semantic Models of Structured SourcesMohsen Taheriyan
Semantic models of data sources describe the meaning of the data in terms of the concepts and relationships defined by a domain ontology. Building such models is an important step toward integrating data from different sources, where we need to provide the user with a unified view of underlying sources. In this paper, we present a scalable approach to automatically learn semantic models of a structured data source by exploiting the knowledge of previously modeled sources. Our evaluation shows that the approach generates expressive semantic models with minimal user input, and it is scalable to large ontologies and data sources with many attributes.
A Graph-Based Approach to Learn Semantic Descriptions of Data SourcesMohsen Taheriyan
The document presents a new graph-based approach to learn semantic descriptions of data sources. It constructs a graph from existing source models in a domain, and leverages the relationships in this graph to generate candidate semantic models for a new source, by mapping its semantic types to the graph and computing minimal subgraphs. This allows the approach to exploit common patterns across source models for automating the learning of semantic descriptions.
Selectivity Estimation for SPARQL Triple Patterns with Shape ExpressionsAbdullah Abbas
We optimize the evaluation of conjunctive SPARQL queries, on big RDF graphs, by taking advantage of ShEx schema constraints. Our optimization is based on computing ranks for query triple patterns, which indicates their order of execution. We first define a set of well-formed ShEx schemas, that possess interesting characteristics for SPARQL query optimization. We then define our optimization method by exploiting information extracted from a ShEx schema. The experimentations performed shows the advantages of applying our optimization on the top of an existing state-of-the-art query evaluation system.
The document discusses spatial query languages and SQL. It begins by introducing learning objectives about understanding query languages, using SQL, extending SQL for spatial data, and trends in query languages. It then provides examples of using SQL to query spatial data by making use of the Open Geodata Interchange Standard (OGIS) spatial data types and operations within SQL queries.
Neo4j is a powerful and expressive tool for storing, querying and manipulating data. However modeling data as graphs is quite different from modeling data under a relational database. In this talk, Michael Hunger will cover modeling business domains using graphs and show how they can be persisted and queried in Neo4j. We'll contrast this approach with the relational model, and discuss the impact on complexity, flexibility and performance.
Metric Learning for Music Discovery with Source and Target PlaylistsYing-Shu Kuo
Playlist generation for music exploration by defining sets of source songs and target songs and deriving a playlist through metric learning and boundary constraints.
https://github.com/hank5925/mlmdstp
The document discusses applying structural operational semantics (SOS) to model-driven engineering (MDE). It introduces SOS and outlines how it can be used to formally define the semantics of domain-specific languages (DSLs) in MDE. The document presents an example of using SOS to define the semantics of a simple expression language to demonstrate how SOS rules specify the transition relation between language states. The goal is to make SOS useful for MDE by providing a semantic language and tool to simulate models defined using DSLs.
Lect 1. introduction to programming languagesVarun Garg
A programming language is a set of rules that allows humans to communicate instructions to computers. There are many programming languages because they have evolved over time as better ways to design them have been developed. Programming languages can be categorized based on their generation or programming paradigm such as imperative, object-oriented, logic-based, and functional. Characteristics like writability, readability, reliability and maintainability are important qualities for programming languages.
A Graph-based Approach to Learn Semantic Descriptions of Data SourcesPedro Szekely
The document presents a graph-based approach to learn semantic descriptions of data sources. It describes how known source models are used to construct a graph and generate candidate semantic models for a new source, by mapping learned semantic types and computing minimal trees. Multiple candidate models may be generated from different mappings to the graph. The candidates are then ranked to identify the best semantic description for the new source.
Knowledge Graphs - The Power of Graph-Based SearchNeo4j
1) Knowledge graphs are graphs that are enriched with data over time, resulting in graphs that capture more detail and context about real world entities and their relationships. This allows the information in the graph to be meaningfully searched.
2) In Neo4j, knowledge graphs are built by connecting diverse data across an enterprise using nodes, relationships, and properties. Tools like natural language processing and graph algorithms further enrich the data.
3) Cypher is Neo4j's graph query language that allows users to search for graph patterns and return relevant data and paths. This reveals why certain information was returned based on the context and structure of the knowledge graph.
From Artwork to Cyber Attacks: Lessons Learned in Building Knowledge Graphs u...Craig Knoblock
Over the last few years we have been building domain-specific knowledge graphs for a variety of real-world problems, including creating virtual museums, combating human trafficking, identifying illegal arms sales, and predicting cyber attacks. We have developed a variety of techniques to construct such knowledge graphs, including techniques for extracting data from online sources, aligning the data to a domain ontology, and linking the data across sources. In his talk I will present these techniques and describe our experience in applying Semantic Web technologies to build knowledge graphs for real-world problems.
This document discusses web-scale semantic search and knowledge graphs. It introduces the concept of semantic search, which deals with understanding the meaning of queries, terms, documents and results. This is achieved by linking text to unambiguous concepts or entities. The document then discusses knowledge graphs, which define entities, attributes, types, relations and more, and form the backbone of semantic search. It also covers tasks involved in semantic search like information extraction, entity linking, query understanding and result ranking.
FOSDEM2014 - Social Network Benchmark (SNB) Graph Generator - Peter BonczIoan Toma
This document describes the LDBC Social Network Benchmark (SNB). The SNB was created to benchmark graph databases and RDF stores using a social network scenario. It includes a data generator that produces synthetic social network data with correlations between attributes. It also defines workloads of interactive, business intelligence, and graph analytics queries. Systems are evaluated based on metrics like query response times. The SNB aims to uncover performance choke points for different database systems and drive improvements in graph processing capabilities.
How the Web can change social science research (including yours)Frank van Harmelen
A presentation for a group of PhD students from the Leibniz Institutes (section B, social sciences) to discuss how they could use the Web, and even better the Web of Data, as an instrument in their research.
What Business Innovators Need to Know about Content AnalyticsSeth Grimes
The document discusses content analytics and smart content. It notes that content is exploding and becoming more social, mobile, and streaming. Content analytics involves using various technologies like natural language processing, text mining, and machine learning to extract information and insights from large amounts of unstructured content. The goal is to move from simply finding documents to extracting knowledge and relationships to help drive virtuous cycles and findability. However, the technologies are still imperfect so expectations need to be realistic.
Recent Trends in Semantic Search TechnologiesThanh Tran
The document discusses semantic search and provides examples of innovative semantic search applications. It describes Peter Mika as a senior research scientist at Yahoo who leads semantic search research. Thanh Tran is introduced as the CEO of Semsolute, a semantic search technologies company. The agenda outlines why semantic search is important given the rise of semantic data on the web. It then defines semantic search and discusses different types of semantic models that can be used. Examples of semantic search applications presented include entity search, factual search, relational search, semantic auto-completion, results aggregation, and conversational search. Technological components like query interpretation, ranking, aggregation, and presentation are also outlined.
This document summarizes a presentation on semantic search given by Peter Mika from Yahoo! Research, Spain and Thanh Tran from Semsolute, Germany. It discusses why semantic search is needed to address complex queries, describes what semantic search is and how it uses semantic models, and provides examples of innovative semantic search applications such as entity search, relational search, and conversational search. It also outlines some of the main technological building blocks used in semantic search systems, including entity recognition, ranking, aggregation, and knowledge graph construction and exploration techniques.
Reflected Intelligence: Lucene/Solr as a self-learning data systemTrey Grainger
What if your search engine could automatically tune its own domain-specific relevancy model? What if it could learn the important phrases and topics within your domain, automatically identify alternate spellings (synonyms, acronyms, and related phrases) and disambiguate multiple meanings of those phrases, learn the conceptual relationships embedded within your documents, and even use machine-learned ranking to discover the relative importance of different features and then automatically optimize its own ranking algorithms for your domain?
In this presentation, you’ll learn you how to do just that - to evolving Lucene/Solr implementations into self-learning data systems which are able to accept user queries, deliver relevance-ranked results, and automatically learn from your users’ subsequent interactions to continually deliver a more relevant experience for each keyword, category, and group of users.
Such a self-learning system leverages reflected intelligence to consistently improve its understanding of the content (documents and queries), the context of specific users, and the relevance signals present in the collective feedback from every prior user interaction with the system. Come learn how to move beyond manual relevancy tuning and toward a closed-loop system leveraging both the embedded meaning within your content and the wisdom of the crowds to automatically generate search relevancy algorithms optimized for your domain.
Semantics for Big Data Integration and AnalysisCraig Knoblock
Much of the focus on big data has been on the problem of processing very large sources. There is an equally hard problem of how to normalize, integrate, and transform the data from many sources into the format required to run large-scale anal- ysis and visualization tools. We have previously developed an approach to semi-automatically mapping diverse sources into a shared domain ontology so that they can be quickly com- bined. In this paper we describe our approach to building and executing integration and restructuring plans to support analysis and visualization tools on very large and diverse datasets.
Reflected Intelligence - Lucene/Solr as a self-learning data system: Presente...Lucidworks
Trey Grainger gave a presentation about using Lucene/Solr as a self-learning data system through the concept of "reflected intelligence". The presentation covered topics like basic keyword search, taxonomies/entity extraction, query intent, and relevancy tuning. It proposed that by leveraging previous user data and interactions, new data and interactions could be better interpreted to continuously improve the system.
Visualising data: Seeing is Believing - CS Forum 2012Richard Ingram
When patterns and connections are revealed between numbers, content and people that might otherwise be too abstract or scattered to be grasped, we’re able to make better sense of where we are, what it might mean and what needs to be done.
The document discusses methods of automated knowledge extraction from web documents. It describes how knowledge extraction systems work by defining output structures, preprocessing input text, applying extraction methods, and generating output. The key steps involve identifying named entities, their relationships, and addressing issues like ambiguity and accuracy challenges. While automated extraction helps analyze vast online information, limitations regarding accuracy, efficiency and dependency on external resources exist. Combining automated techniques with human oversight may help improve knowledge extraction.
Sherlock a deep learning approach to semantic data type detemayank272369
The document describes Sherlock, a deep learning model for detecting the semantic type of data columns. It was trained on over 686,000 columns labeled with one of 78 semantic types, which were characterized using 1,588 features describing textual and statistical properties. Sherlock achieved an F1 score of 0.89 for semantic type detection, outperforming machine learning baselines, rule-based matching approaches, and crowdsourced annotations. The authors open source the data, code, and trained model to support future benchmarking of semantic type detection methods.
EgoSystem: Presentation to LITA, American Library Association, Nov 8 2014James Powell
The Internet represents the connections among computers and devices, the world wide web is a network of interconnected documents, and the semantic web is the closest thing we have today to a network of interconnected facts. Noticeably absent from these global networks is any sort of open, formal representation for an online global social network. Each users' online presence, and its immediate social network, are isolated and typically only available within the confines of the social networking site that hosts it. Discovery across explicit online social networks and implicit social networks such as those that can be inferred from co-authorship relationships and affiliations is, for all practical purposes, impossible. And yet there are practical and non-nefarious reasons why an organization might be interested in exploring portions of such a network. Outreach is one such interest. Los Alamos National Laboratory (LANL) prototyped EgoSystem to harvest and explore the professional social networks of post doctoral students. The project's goal is to enlist past students and other Lab alumni as ambassadors and advocates for LANL's ongoing mission. During this talk we will discuss the various technologies that support the EgoSystem and demonstrate some of its capabilities.
This document provides an introduction and overview of data mining and the data mining process. It discusses different types of data like transactional data, temporal data, spatial data, and unstructured data. It also covers common data mining tasks like classification, clustering, association rule mining and frequent pattern mining. Additionally, it discusses related fields like statistics, machine learning, databases and visualization and how they differ from data mining. Finally, it provides examples of different data mining models and tasks.
1. The document provides an introduction to data mining, describing what data mining is and the data mining process.
2. It discusses different types of data like transactional data, temporal data, spatial data, and unstructured data. Common data mining tasks are also introduced such as classification, clustering, and frequent pattern mining.
3. The document serves as a high-level overview of key concepts in data mining, the data mining process, different types of data commonly analyzed, and some popular data mining algorithms and tasks.
Wikipedia is the largest user-generated knowledge base. We propose a structured query mechanism, entity-relationship query, for searching entities in Wikipedia corpus by their properties and inter-relationships. An entity-relationship query consists of arbitrary number of predicates on desired entities. The semantics of each predicate is specified with keywords. Entity-relationship query searches entities directly over text rather than pre-extracted structured data stores. This characteristic brings two benefits: (1) Query semantics can be intuitively expressed by keywords; (2) It avoids information loss that happens during extraction. We present a ranking framework for general entity-relationship queries and a position-based Bounded Cumulative Model for accurate ranking of query answers. Experiments on INEX benchmark queries and our own crafted queries show the effectiveness and accuracy of our ranking method.
We are pleased to share with you the latest VCOSA statistical report on the cotton and yarn industry for the month of March 2024.
Starting from January 2024, the full weekly and monthly reports will only be available for free to VCOSA members. To access the complete weekly report with figures, charts, and detailed analysis of the cotton fiber market in the past week, interested parties are kindly requested to contact VCOSA to subscribe to the newsletter.
A Graph-based Approach to Learn Semantic Descriptions of Data SourcesPedro Szekely
The document presents a graph-based approach to learn semantic descriptions of data sources. It describes how known source models are used to construct a graph and generate candidate semantic models for a new source, by mapping learned semantic types and computing minimal trees. Multiple candidate models may be generated from different mappings to the graph. The candidates are then ranked to identify the best semantic description for the new source.
Knowledge Graphs - The Power of Graph-Based SearchNeo4j
1) Knowledge graphs are graphs that are enriched with data over time, resulting in graphs that capture more detail and context about real world entities and their relationships. This allows the information in the graph to be meaningfully searched.
2) In Neo4j, knowledge graphs are built by connecting diverse data across an enterprise using nodes, relationships, and properties. Tools like natural language processing and graph algorithms further enrich the data.
3) Cypher is Neo4j's graph query language that allows users to search for graph patterns and return relevant data and paths. This reveals why certain information was returned based on the context and structure of the knowledge graph.
From Artwork to Cyber Attacks: Lessons Learned in Building Knowledge Graphs u...Craig Knoblock
Over the last few years we have been building domain-specific knowledge graphs for a variety of real-world problems, including creating virtual museums, combating human trafficking, identifying illegal arms sales, and predicting cyber attacks. We have developed a variety of techniques to construct such knowledge graphs, including techniques for extracting data from online sources, aligning the data to a domain ontology, and linking the data across sources. In his talk I will present these techniques and describe our experience in applying Semantic Web technologies to build knowledge graphs for real-world problems.
This document discusses web-scale semantic search and knowledge graphs. It introduces the concept of semantic search, which deals with understanding the meaning of queries, terms, documents and results. This is achieved by linking text to unambiguous concepts or entities. The document then discusses knowledge graphs, which define entities, attributes, types, relations and more, and form the backbone of semantic search. It also covers tasks involved in semantic search like information extraction, entity linking, query understanding and result ranking.
FOSDEM2014 - Social Network Benchmark (SNB) Graph Generator - Peter BonczIoan Toma
This document describes the LDBC Social Network Benchmark (SNB). The SNB was created to benchmark graph databases and RDF stores using a social network scenario. It includes a data generator that produces synthetic social network data with correlations between attributes. It also defines workloads of interactive, business intelligence, and graph analytics queries. Systems are evaluated based on metrics like query response times. The SNB aims to uncover performance choke points for different database systems and drive improvements in graph processing capabilities.
How the Web can change social science research (including yours)Frank van Harmelen
A presentation for a group of PhD students from the Leibniz Institutes (section B, social sciences) to discuss how they could use the Web, and even better the Web of Data, as an instrument in their research.
What Business Innovators Need to Know about Content AnalyticsSeth Grimes
The document discusses content analytics and smart content. It notes that content is exploding and becoming more social, mobile, and streaming. Content analytics involves using various technologies like natural language processing, text mining, and machine learning to extract information and insights from large amounts of unstructured content. The goal is to move from simply finding documents to extracting knowledge and relationships to help drive virtuous cycles and findability. However, the technologies are still imperfect so expectations need to be realistic.
Recent Trends in Semantic Search TechnologiesThanh Tran
The document discusses semantic search and provides examples of innovative semantic search applications. It describes Peter Mika as a senior research scientist at Yahoo who leads semantic search research. Thanh Tran is introduced as the CEO of Semsolute, a semantic search technologies company. The agenda outlines why semantic search is important given the rise of semantic data on the web. It then defines semantic search and discusses different types of semantic models that can be used. Examples of semantic search applications presented include entity search, factual search, relational search, semantic auto-completion, results aggregation, and conversational search. Technological components like query interpretation, ranking, aggregation, and presentation are also outlined.
This document summarizes a presentation on semantic search given by Peter Mika from Yahoo! Research, Spain and Thanh Tran from Semsolute, Germany. It discusses why semantic search is needed to address complex queries, describes what semantic search is and how it uses semantic models, and provides examples of innovative semantic search applications such as entity search, relational search, and conversational search. It also outlines some of the main technological building blocks used in semantic search systems, including entity recognition, ranking, aggregation, and knowledge graph construction and exploration techniques.
Reflected Intelligence: Lucene/Solr as a self-learning data systemTrey Grainger
What if your search engine could automatically tune its own domain-specific relevancy model? What if it could learn the important phrases and topics within your domain, automatically identify alternate spellings (synonyms, acronyms, and related phrases) and disambiguate multiple meanings of those phrases, learn the conceptual relationships embedded within your documents, and even use machine-learned ranking to discover the relative importance of different features and then automatically optimize its own ranking algorithms for your domain?
In this presentation, you’ll learn you how to do just that - to evolving Lucene/Solr implementations into self-learning data systems which are able to accept user queries, deliver relevance-ranked results, and automatically learn from your users’ subsequent interactions to continually deliver a more relevant experience for each keyword, category, and group of users.
Such a self-learning system leverages reflected intelligence to consistently improve its understanding of the content (documents and queries), the context of specific users, and the relevance signals present in the collective feedback from every prior user interaction with the system. Come learn how to move beyond manual relevancy tuning and toward a closed-loop system leveraging both the embedded meaning within your content and the wisdom of the crowds to automatically generate search relevancy algorithms optimized for your domain.
Semantics for Big Data Integration and AnalysisCraig Knoblock
Much of the focus on big data has been on the problem of processing very large sources. There is an equally hard problem of how to normalize, integrate, and transform the data from many sources into the format required to run large-scale anal- ysis and visualization tools. We have previously developed an approach to semi-automatically mapping diverse sources into a shared domain ontology so that they can be quickly com- bined. In this paper we describe our approach to building and executing integration and restructuring plans to support analysis and visualization tools on very large and diverse datasets.
Reflected Intelligence - Lucene/Solr as a self-learning data system: Presente...Lucidworks
Trey Grainger gave a presentation about using Lucene/Solr as a self-learning data system through the concept of "reflected intelligence". The presentation covered topics like basic keyword search, taxonomies/entity extraction, query intent, and relevancy tuning. It proposed that by leveraging previous user data and interactions, new data and interactions could be better interpreted to continuously improve the system.
Visualising data: Seeing is Believing - CS Forum 2012Richard Ingram
When patterns and connections are revealed between numbers, content and people that might otherwise be too abstract or scattered to be grasped, we’re able to make better sense of where we are, what it might mean and what needs to be done.
The document discusses methods of automated knowledge extraction from web documents. It describes how knowledge extraction systems work by defining output structures, preprocessing input text, applying extraction methods, and generating output. The key steps involve identifying named entities, their relationships, and addressing issues like ambiguity and accuracy challenges. While automated extraction helps analyze vast online information, limitations regarding accuracy, efficiency and dependency on external resources exist. Combining automated techniques with human oversight may help improve knowledge extraction.
Sherlock a deep learning approach to semantic data type detemayank272369
The document describes Sherlock, a deep learning model for detecting the semantic type of data columns. It was trained on over 686,000 columns labeled with one of 78 semantic types, which were characterized using 1,588 features describing textual and statistical properties. Sherlock achieved an F1 score of 0.89 for semantic type detection, outperforming machine learning baselines, rule-based matching approaches, and crowdsourced annotations. The authors open source the data, code, and trained model to support future benchmarking of semantic type detection methods.
EgoSystem: Presentation to LITA, American Library Association, Nov 8 2014James Powell
The Internet represents the connections among computers and devices, the world wide web is a network of interconnected documents, and the semantic web is the closest thing we have today to a network of interconnected facts. Noticeably absent from these global networks is any sort of open, formal representation for an online global social network. Each users' online presence, and its immediate social network, are isolated and typically only available within the confines of the social networking site that hosts it. Discovery across explicit online social networks and implicit social networks such as those that can be inferred from co-authorship relationships and affiliations is, for all practical purposes, impossible. And yet there are practical and non-nefarious reasons why an organization might be interested in exploring portions of such a network. Outreach is one such interest. Los Alamos National Laboratory (LANL) prototyped EgoSystem to harvest and explore the professional social networks of post doctoral students. The project's goal is to enlist past students and other Lab alumni as ambassadors and advocates for LANL's ongoing mission. During this talk we will discuss the various technologies that support the EgoSystem and demonstrate some of its capabilities.
This document provides an introduction and overview of data mining and the data mining process. It discusses different types of data like transactional data, temporal data, spatial data, and unstructured data. It also covers common data mining tasks like classification, clustering, association rule mining and frequent pattern mining. Additionally, it discusses related fields like statistics, machine learning, databases and visualization and how they differ from data mining. Finally, it provides examples of different data mining models and tasks.
1. The document provides an introduction to data mining, describing what data mining is and the data mining process.
2. It discusses different types of data like transactional data, temporal data, spatial data, and unstructured data. Common data mining tasks are also introduced such as classification, clustering, and frequent pattern mining.
3. The document serves as a high-level overview of key concepts in data mining, the data mining process, different types of data commonly analyzed, and some popular data mining algorithms and tasks.
Wikipedia is the largest user-generated knowledge base. We propose a structured query mechanism, entity-relationship query, for searching entities in Wikipedia corpus by their properties and inter-relationships. An entity-relationship query consists of arbitrary number of predicates on desired entities. The semantics of each predicate is specified with keywords. Entity-relationship query searches entities directly over text rather than pre-extracted structured data stores. This characteristic brings two benefits: (1) Query semantics can be intuitively expressed by keywords; (2) It avoids information loss that happens during extraction. We present a ranking framework for general entity-relationship queries and a position-based Bounded Cumulative Model for accurate ranking of query answers. Experiments on INEX benchmark queries and our own crafted queries show the effectiveness and accuracy of our ranking method.
Similar to Learning the Semantics of Structured Data Sources (20)
We are pleased to share with you the latest VCOSA statistical report on the cotton and yarn industry for the month of March 2024.
Starting from January 2024, the full weekly and monthly reports will only be available for free to VCOSA members. To access the complete weekly report with figures, charts, and detailed analysis of the cotton fiber market in the past week, interested parties are kindly requested to contact VCOSA to subscribe to the newsletter.
Generative Classifiers: Classifying with Bayesian decision theory, Bayes’ rule, Naïve Bayes classifier.
Discriminative Classifiers: Logistic Regression, Decision Trees: Training and Visualizing a Decision Tree, Making Predictions, Estimating Class Probabilities, The CART Training Algorithm, Attribute selection measures- Gini impurity; Entropy, Regularization Hyperparameters, Regression Trees, Linear Support vector machines.
06-20-2024-AI Camp Meetup-Unstructured Data and Vector DatabasesTimothy Spann
Tech Talk: Unstructured Data and Vector Databases
Speaker: Tim Spann (Zilliz)
Abstract: In this session, I will discuss the unstructured data and the world of vector databases, we will see how they different from traditional databases. In which cases you need one and in which you probably don’t. I will also go over Similarity Search, where do you get vectors from and an example of a Vector Database Architecture. Wrapping up with an overview of Milvus.
Introduction
Unstructured data, vector databases, traditional databases, similarity search
Vectors
Where, What, How, Why Vectors? We’ll cover a Vector Database Architecture
Introducing Milvus
What drives Milvus' Emergence as the most widely adopted vector database
Hi Unstructured Data Friends!
I hope this video had all the unstructured data processing, AI and Vector Database demo you needed for now. If not, there’s a ton more linked below.
My source code is available here
https://github.com/tspannhw/
Let me know in the comments if you liked what you saw, how I can improve and what should I show next? Thanks, hope to see you soon at a Meetup in Princeton, Philadelphia, New York City or here in the Youtube Matrix.
Get Milvused!
https://milvus.io/
Read my Newsletter every week!
https://github.com/tspannhw/FLiPStackWeekly/blob/main/141-10June2024.md
For more cool Unstructured Data, AI and Vector Database videos check out the Milvus vector database videos here
https://www.youtube.com/@MilvusVectorDatabase/videos
Unstructured Data Meetups -
https://www.meetup.com/unstructured-data-meetup-new-york/
https://lu.ma/calendar/manage/cal-VNT79trvj0jS8S7
https://www.meetup.com/pro/unstructureddata/
https://zilliz.com/community/unstructured-data-meetup
https://zilliz.com/event
Twitter/X: https://x.com/milvusio https://x.com/paasdev
LinkedIn: https://www.linkedin.com/company/zilliz/ https://www.linkedin.com/in/timothyspann/
GitHub: https://github.com/milvus-io/milvus https://github.com/tspannhw
Invitation to join Discord: https://discord.com/invite/FjCMmaJng6
Blogs: https://milvusio.medium.com/ https://www.opensourcevectordb.cloud/ https://medium.com/@tspann
https://www.meetup.com/unstructured-data-meetup-new-york/events/301383476/?slug=unstructured-data-meetup-new-york&eventId=301383476
https://www.aicamp.ai/event/eventdetails/W2024062014
PyData London 2024: Mistakes were made (Dr. Rebecca Bilbro)Rebecca Bilbro
To honor ten years of PyData London, join Dr. Rebecca Bilbro as she takes us back in time to reflect on a little over ten years working as a data scientist. One of the many renegade PhDs who joined the fledgling field of data science of the 2010's, Rebecca will share lessons learned the hard way, often from watching data science projects go sideways and learning to fix broken things. Through the lens of these canon events, she'll identify some of the anti-patterns and red flags she's learned to steer around.
2. Motivation
2
How to express the intended meaning of data?
Explicit semantics is missing in many of the
structured sources
name date city state workplace
1 Fred Collins Oct 1959 Seattle WA Microsoft
2 Tina Peterson May 1980 New York NY Google
Employee? CEO?
Birth date?
Death date?
Employment date?
Person?
Organization?
Birth city?
Work city?
3. Semantic Model
Map the source to the classes & properties in an ontology
object property
data property
subClassOf
Person
Organization
Place
Stat
e
name
birthdate
bornIn
worksFor state
name
phone
name
livesIn
City
Event
ceo
location
organizer
nearby
startDate
title
isPartOf
postalCode
3
name date city state workplace
1 Fred Collins Oct 1959 Seattle WA Microsoft
2 Tina Peterson May 1980 New York NY Google
SourceDomainOntology
4. Semantic Types
Person OrganizationCity State
name birthdate name namename
4
Person
name date city state workplace
1 Fred Collins Oct 1959 Seattle WA Microsoft
2 Tina Peterson May 1980 New York NY Google
5. Relationships
OrganizationCity State
name birthdate name namename
5
Person
name date city state workplace
1 Fred Collins Oct 1959 Seattle WA Microsoft
2 Tina Peterson May 1980 New York NY Google
bornIn
worksFor
state
7. Outline
• Semi-automatically building semantic models
• Learning semantics models from known models
• Inferring semantic relations from LOD
• Related Work
• Discussion & Future Work
8
9. Approach
[Knoblock et al, ESWC 2012]
10
Domain Ontology
Learn
Semantic Types
Extract
Relationships
Steiner
Tree
Sample Data
Construct a Graph
Implemented in Karma
http://www.isi.edu/integration/karma
@KarmaSemWe
b
10. Example
Source
object property
data property
subClassOf
Domain Ontology
12
name date city state workplace
1 Fred Collins Oct 1959 Seattle WA Microsoft
2 Tina Peterson May 1980 New York NY Google
Goal: Find a semantic model for the source
(map the source to the ontology)
15. Semantic Labeling Approach
• Each semantic type: label
• Each data column: document
• Textual Data
– Compute TF/IDF vectors for documents
– Compare documents using Cosine Similarity between TF/IDF
vectors
• Numeric Data
– Use Statistical Hypothesis Testing
– Intuition: distribution of values in different semantic types is
different, e.g., temperature vs. population
• Return Top-k suggestions based on the confidence scores
17
16. Construct a Graph
Construct a graph from semantic types and ontology
18
Person OrganizationCity State
name birthdate name namename
Person
name date city state workplace
1 Fred Collins Oct 1959 Seattle WA Microsoft
2 Tina Peterson May 1980 New York NY Google
20. Refining the Model
24
Impose constraints on Steiner Tree Algorithm
– Change weight of selected links to ε
– Add source and target of selected link to Steiner nodes
date
22. Evaluation
26
Evaluation Dataset EDM
# sources 29
# classes in the ontologies 119
# properties in the ontologies 351
# nodes in the gold standard models 473
# links in the gold standard models 444
• Measured the user effort in Karma to model the
sources
• Started with no training data
• User actions
– Assign/Change semantic types
– Change relationships
23. User Effort
27
source columns Choose Type Change Link Time
(min)
s1 7 7 1 3
s2 12 5 2 6
s3 4 0 0 2
s4 17 5 7 8
s5 14 4 6 7
s6 18 4 4 7
s7 14 1 4 6
s8 6 0 4 3
… … … … …
s29 9 2 1 3
Total 329 56 96 128
Avg. min per source: 4.4 minutes
Avg. # user actions per source: 152/29=5.2
24. Limitation
• This approach does not learn the changes
done by the user in relationships
• User has to go through the refinement
process each time
28
26. Main Idea
30
Sources in the same domain often have similar data
Exploit knowledge of known semantic models to
hypothesize a semantic model for a new sources
27. Example
31
EDM SKOS FOAF AAC ElementsGr2ORE DCTermsDomain ontologies:
Source: Dallas Museum of Art dma(title,creationDate,name,type)
Domain: Museum Data
28. Example
32
EDM SKOS FOAF AAC ElementsGr2ORE DCTermsDomain ontologies:
Domain: Museum Data
Source: National Portrait Gallery npg(name,artist,year,image)
29. Example
33
EDM SKOS FOAF AAC ElementsGr2ORE DCTermsDomain ontologies:
Domain: Museum Data
Goal: Automatically suggest a semantic model for dia
Source: Detroit Institute of Art dia(title,credit,classification,name,imageURL)
30. Approach
[Taheriyan et al, ISWC 2013, ICSC 2014, JWS 2015]
Domain Ontology
Learn
Semantic Types
34
Sample Data
Construct a Graph
Generate
Candidate Models
Rank Results
Known Semantic
Models
Implemented in Karma
31. Approach
36
Input
Learn semantic types for attributes(s)
• Sample data from new source (S)
• Domain Ontologies (O)
• Known semantic models
Construct Graph G=(V,E)
Generate mappings between attributes(S) and V
Generate and rank semantic models
1
Output
• A ranked set of semantic models for S
32. Learn Semantic Types
• Learn Semantic Types for each attribute from its data
• Pick top K semantic types according to their confidence
values
37
dia(title,credit, classification,name,imageURL)
title
<aac:CulturalHeritageObject, dcterms:title> 0.49
<aac:CulturalHeritageObject, rdfs:label> 0.28
credit
<aac:CulturalHeritageObject, dcterms:provenance> 0.83
<aac:Person, ElementsGr2:note> 0.06
classification
<skos:Concept, skos:prefLabel> 0.58
<skos:Concept, rdfs:label> 0.41
name
<aac:Person, foaf:name> 0.65
<fofa:Person, fofaf:name> 0.32
imageURL
<foaf:Document, uri> 0.47
<edm:WebResource, uri> 0.40
33. Approach
38
Input
Learn semantic types for attributes(s)
• Sample data from new source (S)
• Domain Ontologies (O)
• Known semantic models
Construct Graph G=(V,E)
Generate mappings between attributes(S) and V
Generate and rank semantic models
2
Output
• A ranked set of semantic models for S
34. 39
• Annotate (tag) nodes and links with list of supporting models
• Adjust weight based on the number of supporting models
Build Graph G: Add Known Models
aac:Person
CulturalHeritageObject
name
creator
name
hasTypecreated title
prefLabel
type
creationDate
title
Concept
1
1
1
dm
a
dm
a
1
dm
a
dm
a 1
dm
a
1
dm
a
dma
35. 40
• Annotate (tag) nodes and links with list of supporting models
• Adjust weight based on the number of tags
Build Graph G: Add Known Models
dma npg
aac:Person
CulturalHeritageObject
name
creator
name
hasTypecreated title
prefLabel
type
creationDate
title
Concept aac:Person
name
sitter
name
EuropeanaAggregation
WebResource
imag
e
uri
aggregatedCHO
hasView
0.5
1
1
1
dm
a
dm
a
1
dm
a
npg 1
npg 1
npg 1
npg 1
dm
a
npg 0.5
dm
a
npg
0.5
dm
a
npg
npg
36. 41
Build Graph G: Add Semantic Types
title <CulturalHeritageObject,title> <CulturalHeritageObject,label>
credit <CulturalHeritageObject,provenance> <Person,note>
classification <Concept,prefLabel> <Concept,label>
name <aac:Person,name> <foaf:Person,name>
imageURL <Document,uri> <WebResource,uri>
aac:Person
CulturalHeritageObject
name
creator
name
hasTypecreated title
prefLabel
type
creationDate
title
Concept aac:Person
name
sitter
name
EuropeanaAggregation
WebResource
imag
e
uri
aggregatedCHO
hasView
0.5
1
1
1
dm
a
dm
a
1
dm
a
npg 1
npg 1
npg 1
npg 1
dm
a
npg 0.5
dm
a
npg
0.5
dm
a
npg
npg
37. 42
Build Graph G: Add Semantic Types
title <CulturalHeritageObject,title> <CulturalHeritageObject,label>
credit <CulturalHeritageObject,provenance> <Person,note>
classification <Concept,prefLabel> <Concept,label>
name <aac:Person,name> <foaf:Person,name>
imageURL <Document,uri> <WebResource,uri>
aac:Person
CulturalHeritageObject
name
creator
name
hasTypecreated title
prefLabel
type
creationDate
title
Concept aac:Person
name
sitter
name
EuropeanaAggregation
WebResource
imag
e
uri
aggregatedCHO
hasView
titl
e
label
0.5
1
1
1
dm
a
dm
a
1
dm
a
npg 1
npg 1
npg 1
npg 1
dm
a
npg 0.5
dm
a
npg
0.5
dm
a
npg
npg
38. 43
Build Graph G: Add Semantic Types
title <CulturalHeritageObject,title> <CulturalHeritageObject,label>
credit <CulturalHeritageObject,provenance> <Person,note>
classification <Concept,prefLabel> <Concept,label>
name <aac:Person,name> <foaf:Person,name>
imageURL <Document,uri> <WebResource,uri>
aac:Person
CulturalHeritageObject
name
creator
name
hasTypecreated title
prefLabel
type
creationDate
title
Concept aac:Person
name
sitter
name
note
credit
foaf:Person
EuropeanaAggregation
WebResource
imag
e
uri
aggregatedCHO
hasView
Document
imageURL
name
nameclassification
note
credit
label
credit
titl
e
provenance
label
uri
0.5
1
1
1
dm
a
dm
a
1
dm
a
npg 1
npg 1
npg 1
npg 1
dm
a
npg 0.5
dm
a
npg
0.5
dm
a
npg
npg
39. 44
Build Graph G: Expand with Paths
from Ontology
• Assign a high weight to the links coming from the
ontology
aac:Person
CulturalHeritageObject
name
creator
name
hasTypecreated title
prefLabel
type
creationDate
title
Concept aac:Person
name
sitter
name
note
credit
foaf:Person
EuropeanaAggregation
WebResource
imag
e
uri
aggregatedCHO
hasView
imageURL
name
nameclassification
note
credit
label
creator
Aggregation
aggregates
subClassOf
page
credit
titl
e
provenance
label
uri
aggregates
isShownAt
Document
1000
1000
1000
1000
1000
1000
0.5
1
1
1
dm
a
dm
a
1
dm
a
npg 1
npg 1
npg 1
npg 1
dm
a
npg 0.5
dm
a
npg
0.5
dm
a
npg
npg
40. Approach
45
Input
Learn semantic types for attributes(s)
• Sample data from new source (S)
• Domain Ontologies (O)
• Known semantic models
Construct Graph G=(V,E)
Generate mappings between attributes(S) and V
Generate and rank semantic models
3
Output
• A ranked set of semantic models for S
41. 46
Map Source Attributes to the Graph
title <CulturalHeritageObject,title> <CulturalHeritageObject,label>
credit <CulturalHeritageObject,provenance> <Person,note>
classification <Concept,prefLabel> <Concept,label>
name <aac:Person,name> <foaf:Person,name>
imageURL <Document,uri> <WebResource,uri>
aac:Person
CulturalHeritageObject
name
creator
name
hasTypecreated title
prefLabel
type
creationDate
title
Concept aac:Person
name
sitter
name
note
credit
foaf:Person
EuropeanaAggregation
WebResource
imag
e
uri
aggregatedCHO
hasView
Document
imageURL
name
nameclassification
note
credit
label
creator
Aggregation
aggregates
subClassOf
page
credit
titl
e
provenance
label
uri
aggregates
isShownAt1000
1000
1000
1000
1000
1000
0.5
1
1
1
dm
a
dm
a
1
dm
a
npg 1
npg 1
npg 1
npg 1
dm
a
npg 0.5
dm
a
npg
0.5
dm
a
npg
npg
42. 47
Map Source Attributes to the Graph
title <CulturalHeritageObject,title> <CulturalHeritageObject,label>
credit <CulturalHeritageObject,provenance> <Person,note>
classification <Concept,prefLabel> <Concept,label>
name <aac:Person,name> <foaf:Person,name>
imageURL <Document,uri> <WebResource,uri>
aac:Person
CulturalHeritageObject
name
creator
name
hasTypecreated title
prefLabel
type
creationDate
title
Concept aac:Person
name
sitter
name
note
credit
foaf:Person
EuropeanaAggregation
WebResource
imag
e
uri
aggregatedCHO
hasView
Document
imageURL
name
nameclassification
note
credit
label
creator
Aggregation
aggregates
subClassOf
page
credit
titl
e
provenance
label
uri
aggregates
isShownAt1000
1000
1000
1000
1000
1000
0.5
1
1
1
dm
a
dm
a
1
dm
a
npg 1
npg 1
npg 1
npg 1
dm
a
npg 0.5
dm
a
npg
0.5
dm
a
npg
npg
43. 48
Map Source Attributes to the Graph
title <CulturalHeritageObject,title> <CulturalHeritageObject,label>
credit <CulturalHeritageObject,provenance> <Person,note>
classification <Concept,prefLabel> <Concept,label>
name <aac:Person,name> <foaf:Person,name>
imageURL <Document,uri> <WebResource,uri>
aac:Person
CulturalHeritageObject
name
creator
name
hasTypecreated
prefLabel
type
creationDate
title
Concept aac:Person
name
sitter
name
note
credit
foaf:Person
EuropeanaAggregation
WebResource
imag
e
uri
aggregatedCHO
hasView
Document
imageURL
name
nameclassification
note
credit
label
creator
Aggregation
aggregates
subClassOf
page
credit
titl
e
provenance
label
uri
aggregates
isShownAt
title
1000
1000
1000
1000
1000
1000
0.5
1
1
1
dm
a
dm
a
1
dm
a
npg 1
npg 1
npg 1
npg 1
dm
a
npg 0.5
dm
a
npg
0.5
dm
a
npg
npg
44. Approach
51
Input
Learn semantic types for attributes(s)
• Sample data from new source (S)
• Domain Ontologies (O)
• Known semantic models
Construct Graph G=(V,E)
Generate mappings between attributes(S) and V
Generate and rank semantic models4
Output
• A ranked set of semantic models for S
45. Generate Semantic Models
• Compute Steiner tree for each mapping
– A minimal tree connecting nodes of mapping
– A customization of BANKS algorithm [Bhalotia
et al., 2002]
• Our algorithm considers both coherence
and popularity
• Each tree is a candidate model
• Rank the models based on coherence and
cost
52
49. What Is Coherence?
56
PlacePerson
organizer
Event
location
s1 s2 s3
Person
bornIn
Place Person
bornIn
Place
PlacePerson
organizer
Event
location
1 1
bornIn0.
5
s2, s3
s1s1
Top 3 Steiner trees
Known Models
PlacePerson
organizer
Event
1
bornIn0.
5
s2, s3
s1
PlacePerson Event
location
1
bornIn0.
5
s2, s3
s1
PlacePerson
organizer
Event
location
1 1
s1s1
Person
Event
Place
Semantic types
of a new source
Not minimal model
but more coherent
Graph
50. 57
Example Mapping
title <CulturalHeritageObject,title> <CulturalHeritageObject,label>
credit <CulturalHeritageObject,provenance> <Person,note>
classification <Concept,prefLabel> <Concept,label>
name <aac:Person,name> <foaf:Person,name>
imageURL <Document,uri> <WebResource,uri>
aac:Person
CulturalHeritageObject
name
creator
name
hasTypecreated title
prefLabel
type
creationDate
title
Concept aac:Person
name
sitter
name
note
credit
foaf:Person
EuropeanaAggregation
WebResource
imag
e
uri
hasView
Document
imageURL
name
nameclassification
note
credit
label
creator
Aggregation
aggregates
subClassOf
page
credit
titl
e
provenance label
uri
isShownAt1000
1000
1000
1000
1000
1000
0.5
1
1
1
dm
a
dm
a
1
dm
a
npg 1
npg 1
npg 1
npg 1
dm
a
npg 0.5
dm
a
npg
0.5
dm
a
npg
npg
aggregatedCHO
aggregates
51. 58
Steiner Tree
title <CulturalHeritageObject,title> <CulturalHeritageObject,label>
credit <CulturalHeritageObject,provenance> <Person,note>
classification <Concept,prefLabel> <Concept,label>
name <aac:Person,name> <foaf:Person,name>
imageURL <Document,uri> <WebResource,uri>
aac:Person
CulturalHeritageObject
name
creator
name
hasTypecreated title
prefLabel
type
creationDate
title
Concept aac:Person
name
sitter
name
note
credit
foaf:Person
EuropeanaAggregation
WebResource
imag
e
uri
hasView
Document
imageURL
name
nameclassification
note
credit
label
creator
Aggregation
aggregates
subClassOf
page
credit
titl
e
provenance
label
uri
isShownAt1000
1000
1000
1000
1000
1000
0.5
1
1
1
dm
a
dm
a
1
dm
a
npg 1
npg 1
npg 1
npg 1
dm
a
npg 0.5
dm
a
npg
0.5
dm
a
npg
npg
aggregatedCHO
aggregates
52. Final Model in Karma
59
EDM SKOS FOAF AAC ElementsGr2ORE DCTermsDomain ontologies:
Domain: Museum Data
Source: Detroit Institute of Art dia(title,credit,classification,name,imageURL)
53. Evaluation
60
Evaluation Dataset EDM CRM
# sources 29 29
# classes in the ontologies 119 147
# properties in the ontologies 351 409
# nodes in the gold standard models 473 812
# links in the gold standard models 444 785
57. Accuracy
Compute precision and recall between learned models and correct models
64
precision =
rel(sm)Çrel(sm')
rel(sm')
recall =
rel(sm)Çrel(sm')
rel(sm)
How many of the learned
relationships are correct?
How many of the correct
relationships are learned?
Person
Artwork
location
Museum
creator
correctmodel
Person
Museum
location
Artwork
founder
learnedmodel
<Artwork,location,Museum>
<Artwork,creator,Person>
<Museum,founder,Person>
<Artwork,location,Museum>
Precision: 0.5, Recall: 0.5
58. Experiment 1
correct semantic types are given
65
0.0
0.2
0.4
0.6
0.8
1.0
0 4 8 12 16 20 24 28
Number of known semantic models
precision
recall
0.0
0.2
0.4
0.6
0.8
1.0
0 4 8 12 16 20 24 28
Number of known semantic models
precision
recall
59. Experiment 2
learn semantic types, pick top K candidates
68
0.0
0.2
0.4
0.6
0.8
0 4 8 12 16 20 24 28
Number of known semantic models
precision (k=1)
recall (k=1)
precision (k=4)
recall (k=4)
0.0
0.2
0.4
0.6
0.8
0 4 8 12 16 20 24 28
Number of known semantic models
precision (k=1)
recall (k=1)
precision (k=4)
recall (k=4)
62. Idea
• There is a huge amount of linked data
available in many domains (RDF format)
72
• Use LOD when
there is no known
semantic model
• Exploit the
relationships
between instances
63. Approach
[Taheriyan et al, COLD 2015]
Domain Ontology
Learn
Semantic Types
73
Sample Data
Construct a Graph
Generate
Candidate Models
Rank Results
Linked Open Data
Extract Patterns
66. Pattern Templates
• Many possible templates for patterns
– Example: patterns for classes C1, C2, C3
• Consider only tree patterns
• Limit the length of the patterns
76
67. Evaluation
• Linked data: 3,398,350
triples published by
Smithsonian American
Art Museum
• Correct semantic types
given
• Extracted patterns of
length 1 and 2
78
Evaluation Dataset CRM
# sources 29
# classes in the ontologies 147
# properties in the ontologies 409
# nodes in the gold standard models 812
# links in the gold standard models 785
background knowledge precision recall time (s)
domain ontology 0.07 0.05 0.17
domain ontology + patterns of length 1 0.65 0.55 0.75
domain ontology + patterns of length 1 and 2 0.78 0.70 0.46
69. Related Work
• Mapping databases and spreadsheets to ontologies
– Mapping languages and tools (D2R, R2RML)
– String similarity between column names and ontology terms
• Understand semantics of Web tables
– Use column headers and cell values to find the labels and
relations from a database of labels and relations populated
from the Web
• Exploit Linked Open Data (LOD)
– Link the values to the entities in LOD to find the types of the
values and their relationships
• Learn semantic definitions of online information sources
– Learns logical rules from known sources
– Only learns descriptions that are conjunctive combinations of
known descriptions
80
71. Discussion
• Contributions
– Semi-automatically model the relationships
– Learn semantic models from previous models
– Infer semantic relationships from LOD
• Provide explicit semantics for large portion of
LOD
• Help to publish consistent RDF data
• Applications
– VIVO
– Smithsonian American Art Museum
– DIG for DARPA’s Memex project
82
72. Future Work
• Improve the quality of semantic labeling
– Use LOD to learn semantic types
• Extract longer patterns from LOD
• Publish linked data
– Transform the data to a common vocabulary
– Linking entities across different datasets
83
Editor's Notes
Key ingredient to automate: Source discovery, Data integration, Publish knowledge graphs
In the second step, instead of selecting MST ((v1, v4) – (v1, v2)- (v2-v3)), if we select another MST like ((v1, v4) – (v1, v2)- (v1-v3)), we would get another Steiner tree as result.
MST = Minimum Spanning Tree
The exact approximation ratio is: 2 * (1 – 1/l) where l is the number of leaves in optimal tree.
Paper link: http://www.springerlink.com/content/v7w86m72174k7267/
Volume 15, Number 2, 141-145, DOI: 10.1007/BF00288961
A fast algorithm for Steiner trees
L. Kou, G. Markowsky and L. Berman
Out of the 56 manual assignments, 19 were for specifying semantic types that the system had never seen before.
0.46 action per column
http://www.metmuseum.org/collection/the-collection-online/search?deptids=1&pg=1&ft=french&od=on&ao=on&noqs=true
http://americanart.si.edu/collections/search/artwork/results/index.cfm?rows=10&fq=online_media_type%3A%22Images%22&q=Search+by+Artist%2C+Work%2C+or+Keyword&page=1&start=0&x=62&y=5
http://museum.dma.org:9090/emuseum/view/objects/asitem/2031/0/title-desc?t:state:flow=ff495581-e0d2-4334-bec9-19988f9eeb20
http://www.mfah.org/art/100-highlights/commemorative-head-king/
Leverage relationships in known semantic models to hypothesize relationships for new sources
Semantic Type: <class_uri, property_uri>
Cosine similarity between TF/IDF vectors (for textual data) and statistical hypothesis testing (for numeric data)
Keyword Searching and Browsing in Databases using BANKS
ds
edm
dscrm
#data source
29
29
#classes in the domain ontologies
119
147
#properties in the domain ontologies
351
409
#nodes in the gold-standard models
473
812
#data nodes in the gold-standard models
331
418
#class nodes in the gold-standard models
142
394
#links in the gold-standard models
444
785
Indiana Museum of Art
Dallas Museum of Art
Out of the 56 manual assignments, 19 were for specifying semantic types that the system had never seen before.
0.46 action per column
Mapping databases and spreadsheets to ontologies
Mapping languages: D2R [Bizer, 2003], D2RQ [Bizer and Seaborne, 2004], R2RML [Das et al., 2012]
Tools: RDOTE [Vavliakis et al., 2010], RDF123 [Han et al., 2008], XLWrap [Langegger and Woß, 2009]
String similarity between column names and ontology terms [Polfliet and Ichise, 2010]
Understand semantics of Web tables
Use column headers and cell values to find the labels and relations from a database of labels and relations populated from the Web [Wang et al., 2012] [Limaye et al., 2010] [Venetis et al., 2011]
Exploit Linked Open Data (LOD)
Link the values to the entities in LOD to find the types of the values and their relationships [Muoz et al., 2013] [Mulwad et al., 2013]
Semantic annotation of Web services
Languages: SAWSDL [Farrell and Lausen, 2007]
Tools: SWEET [Maleshkova et al., 2009]
Annotate input and output parameters [Heß et al., 2003] [Lerman et al., 2006] [Saquicela e al., 2011]
Learn semantic definitions of online information sources [Carman, Knoblock, 2007]
Learns LAV rules from known sources
Only learns descriptions that are conjunctive combinations of known descriptions
Learn semantic models of structured data sources [Taheriyan et al, 2013, 2014, 2015]
Learns complete semantic models from previously modeled sources
A possible direction for future work is to perform user evaluations to measure the quality of the models produced using learning algorithms.
It is much harder for users to model a source from scratch
A large portion of the data in the LOD cloud is published directly from existing databases using tools such as D2R. This conversion process uses the structure of the data as it is organized in the database. There is often no explicit semantic description of the contents of a source and it requires a significant effort if one wants to do more than simply convert a database into RDF
- Often, there are multiple correct ways to model the same type of data. For example, users can use Dublin Core and FOAF to model the creator relationship between a person and an object (dcterms:creator and foaf:maker). A community is better served when all the data with the same semantics is modeled using the same classes and properties. Our work encourages consistency because our learning algorithms bias the selection of classes and properties towards those used more frequently in existing models.
linked to DBpedia, the Getty Union List of Artist Names (ULAN)® and the NY Times Linked Data
linked to DBpedia, the Getty Union List of Artist Names (ULAN)® and the NY Times Linked Data
An et al. [An et al., 2007] generate declarative mapping expressions between two tables with different schemas starting from element correspondences. They create a graph from the conceptual model (CM) of each schema and then suggest plausible mappings by exploring low-cost Steiner trees that connect those nodes in the CM graph that have attributes participating in element correspondences.
Known: Table 1, CM graph 1, marked nodes (nodes in CM1 corresponding to columns), s-tree1: a semantic tree that expresses the correct semantic of Table1 by connecting its marked nodes in CM1
Known: Table 2, CM graph 2, marked nodes (nodes in CM2 corresponding to columns), s-tree2: a semantic tree that expresses the correct semantic of Table2 by connecting its marked nodes in CM2
Goal: Find a mapping from Table1 to Table2 (a subgraph of CM1, called CSG1, to a subgraph of CM2, called CSG2
Method: If CSG2 is known (e.g., it is the s-tree2), find the Steiner tree in CM1 connecting marked nodes, preferring the edges from s-tree1
Strong assumption: we know semantic of each table (s-trees)
Use-case example: Table 1 has 10 columns with a large s-tree, and table 2 has only 3 columns. This approach finds a minimal tree in CM1 (maximum overlap with s-tree1) that connects the marked nodes of table1 corresponding to the marked nodes of table2.