1) Knowledge graphs are a representation of knowledge that is useful for modern AI systems. They describe entities and their relationships in a way that facilitates logical reasoning and access to large-scale factual knowledge.
2) Knowledge graphs can be created through manual effort but recent approaches aim to construct them automatically at large scale, such as extracting knowledge graphs from Wikipedia and other wikis.
3) Knowledge graphs are useful ingredients for AI, supporting natural language processing, automated reasoning, and machine learning approaches through knowledge graph embeddings that represent entities as vectors. However, the meaning carried in dimensions is lost which presents challenges.
This presentation shows approaches for knowledge graph construction from Wikipedia and other Wikis that go beyond the "one entity per page" paradigm. We see CaLiGraph, which extracts entities from categories and listings, as well as DBkWik, which extracts and integrates information from thousands of Wikis.
Using knowledge graphs in data mining typically requires a propositional, i.e., vector-shaped representation of entities. RDF2vec is an example for generating such vectors from knowledge graphs, relying on random walks for extracting pseudo-sentences from a graph, and utilizing word2vec for creating embedding vectors from those pseudo-sentences. In this talk, I will give insights into the idea of RDF2vec, possible application areas, and recently developed variants incorporating different walk strategies and training variations.
Using Knowledge Graphs in Data Science - From Symbolic to Latent Representati...Heiko Paulheim
Knowledge Graphs are often used as a symbolic representation mechanism for representing knowledge in data intensive applications, both for integrating corporate knowledge as well as for providing general, cross-domain knowledge in public knowledge graphs such as Wikidata. As such, they have been identified as a useful way of injecting background knowledge in data analysis processes. To fully harness the potential of knowledge graphs, latent representations of entities in the graphs, so called knowledge graph embeddings, show superior performance, but sacrifice one central advantage of knowledge graphs, i.e., the explicit symbolic knowledge representations. In this talk, I will shed some light on the usage of knowledge graphs and embeddings in data analysis, and give an outlook on research directions which aim at combining the best of both worlds.
Beyond DBpedia and YAGO – The New Kids on the Knowledge Graph BlockHeiko Paulheim
Starting with Cyc in the 1980s, the collection of general
knowledge in machine interpretable form has been considered a valuable ingredient in intelligent and knowledge intensive applications. Notable contributions in the field include the Wikipedia-based datasets DBpedia and YAGO, as well as the collaborative knowledge base Wikidata. Since Google has coined the term in 2012, they are most often referred to as knowledge graphs. Besides such open knowledge graphs, many companies have started using corporate knowledge graphs as a means of information representation.
In this talk, I will look at two ongoing projects related to the extraction of knowledge graphs from Wikipedia and other Wikis. The first new dataset, CaLiGraph, aims at the generation of explicit formal definitions from categories, and the extraction of new instances from list pages. In its current release, CaLiGraph contains 200k axioms defining classes,
and more than 7M typed instances. In the second part, I will look at the transfer of the DBpedia approach to a multitude of arbitrary Wikis. The first such prototype, DBkWik, extracts data from Fandom, a Wiki farm hosting more than 400k different Wikis on various topics. Unlike DBpedia, which relies on a larger user base for crowdsourcing an explicit schema and extraction rules, and the "one-page-per-entity" assumption, DBkWik has to address various challenges in the fields of schema learning and data integration. In its current release, DBkWik contains more than 11M entities, and has been found to be highly complementary to DBpedia.
Machine Learning with and for Semantic Web Knowledge GraphsHeiko Paulheim
Large-scale cross-domain knowledge graphs, such as DBpedia or Wikidata, are some of the most popular and widely used datasets of the Semantic Web. In this paper, we introduce some of the most popular knowledge graphs on the Semantic Web. We discuss how machine learning is used to improve those knowledge graphs, and how they can be exploited as background knowledge in popular machine learning tasks, such as recommender systems.
Knowledge Graphs, such as DBpedia, YAGO, or Wikidata, are valuable resources for building intelligent applications like data analytics tools or recommender systems. Understanding what is in those knowledge graphs is a crucial prerequisite for selecing a Knowledge Graph for a task at hand. Hence, Knowledge Graph profiling - i.e., quantifying the structure and contents of knowledge graphs, as well as their differences - is essential for fully utilizing the power of Knowledge Graphs. In this paper, I will discuss methods for Knowledge Graph profiling, depict crucial differences of the big, well-known Knowledge Graphs, like DBpedia, YAGO, and Wikidata, and throw a glance at current developments of new, complementary Knowledge Graphs such as DBkWik and WebIsALOD.
How are Knowledge Graphs created?
What is inside public Knowledge Graphs?
Addressing typical problems in Knowledge Graphs (errors, incompleteness)
New Knowledge Graphs: WebIsALOD, DBkWik
RDF2vec is a method for creating embeddings vectors for entities in knowledge graphs. In this talk, I introduce the basic idea of RDF2vec, as well as the latest extensions developments, like the use of different walk strategies, the flavour of order-aware RDF2vec, RDF2vec for dynamic knowledge graphs, and more.
This presentation shows approaches for knowledge graph construction from Wikipedia and other Wikis that go beyond the "one entity per page" paradigm. We see CaLiGraph, which extracts entities from categories and listings, as well as DBkWik, which extracts and integrates information from thousands of Wikis.
Using knowledge graphs in data mining typically requires a propositional, i.e., vector-shaped representation of entities. RDF2vec is an example for generating such vectors from knowledge graphs, relying on random walks for extracting pseudo-sentences from a graph, and utilizing word2vec for creating embedding vectors from those pseudo-sentences. In this talk, I will give insights into the idea of RDF2vec, possible application areas, and recently developed variants incorporating different walk strategies and training variations.
Using Knowledge Graphs in Data Science - From Symbolic to Latent Representati...Heiko Paulheim
Knowledge Graphs are often used as a symbolic representation mechanism for representing knowledge in data intensive applications, both for integrating corporate knowledge as well as for providing general, cross-domain knowledge in public knowledge graphs such as Wikidata. As such, they have been identified as a useful way of injecting background knowledge in data analysis processes. To fully harness the potential of knowledge graphs, latent representations of entities in the graphs, so called knowledge graph embeddings, show superior performance, but sacrifice one central advantage of knowledge graphs, i.e., the explicit symbolic knowledge representations. In this talk, I will shed some light on the usage of knowledge graphs and embeddings in data analysis, and give an outlook on research directions which aim at combining the best of both worlds.
Beyond DBpedia and YAGO – The New Kids on the Knowledge Graph BlockHeiko Paulheim
Starting with Cyc in the 1980s, the collection of general
knowledge in machine interpretable form has been considered a valuable ingredient in intelligent and knowledge intensive applications. Notable contributions in the field include the Wikipedia-based datasets DBpedia and YAGO, as well as the collaborative knowledge base Wikidata. Since Google has coined the term in 2012, they are most often referred to as knowledge graphs. Besides such open knowledge graphs, many companies have started using corporate knowledge graphs as a means of information representation.
In this talk, I will look at two ongoing projects related to the extraction of knowledge graphs from Wikipedia and other Wikis. The first new dataset, CaLiGraph, aims at the generation of explicit formal definitions from categories, and the extraction of new instances from list pages. In its current release, CaLiGraph contains 200k axioms defining classes,
and more than 7M typed instances. In the second part, I will look at the transfer of the DBpedia approach to a multitude of arbitrary Wikis. The first such prototype, DBkWik, extracts data from Fandom, a Wiki farm hosting more than 400k different Wikis on various topics. Unlike DBpedia, which relies on a larger user base for crowdsourcing an explicit schema and extraction rules, and the "one-page-per-entity" assumption, DBkWik has to address various challenges in the fields of schema learning and data integration. In its current release, DBkWik contains more than 11M entities, and has been found to be highly complementary to DBpedia.
Machine Learning with and for Semantic Web Knowledge GraphsHeiko Paulheim
Large-scale cross-domain knowledge graphs, such as DBpedia or Wikidata, are some of the most popular and widely used datasets of the Semantic Web. In this paper, we introduce some of the most popular knowledge graphs on the Semantic Web. We discuss how machine learning is used to improve those knowledge graphs, and how they can be exploited as background knowledge in popular machine learning tasks, such as recommender systems.
Knowledge Graphs, such as DBpedia, YAGO, or Wikidata, are valuable resources for building intelligent applications like data analytics tools or recommender systems. Understanding what is in those knowledge graphs is a crucial prerequisite for selecing a Knowledge Graph for a task at hand. Hence, Knowledge Graph profiling - i.e., quantifying the structure and contents of knowledge graphs, as well as their differences - is essential for fully utilizing the power of Knowledge Graphs. In this paper, I will discuss methods for Knowledge Graph profiling, depict crucial differences of the big, well-known Knowledge Graphs, like DBpedia, YAGO, and Wikidata, and throw a glance at current developments of new, complementary Knowledge Graphs such as DBkWik and WebIsALOD.
How are Knowledge Graphs created?
What is inside public Knowledge Graphs?
Addressing typical problems in Knowledge Graphs (errors, incompleteness)
New Knowledge Graphs: WebIsALOD, DBkWik
RDF2vec is a method for creating embeddings vectors for entities in knowledge graphs. In this talk, I introduce the basic idea of RDF2vec, as well as the latest extensions developments, like the use of different walk strategies, the flavour of order-aware RDF2vec, RDF2vec for dynamic knowledge graphs, and more.
From Wikipedia to Thousands of Wikis – The DBkWik Knowledge GraphHeiko Paulheim
From a bird eye's view, the DBpedia Extraction Framework takes a MediaWiki dump as input, and turns it into a knowledge graph. In this talk, I discuss the creation of the DBkWik knowledge graph by applying the DBpedia Extraction Framework to thousands of Wikis.
The original Semantic Web vision foresees to describe entities in a way that the meaning can be interpreted both by machines and humans. Following that idea, large-scale knowledge graphs capturing a significant portion of knowledge have been developed. In the recent past, vector space embeddings of semantic web knowledge graphs - i.e., projections of a knowledge graph into a lower-dimensional, numerical feature
space (a.k.a. latent feature space) - have been shown to yield superior performance in many tasks, including relation prediction, recommender systems, or the enrichment of predictive data mining tasks. At the same time, those projections describe an entity as a numerical vector, without
any semantics attached to the dimensions. Thus, embeddings are as far from the original Semantic Web vision as can be. As a consequence, the results achieved with embeddings - as impressive as they are in terms of quantitative performance - are most often not interpretable, and it is hard to obtain a justification for a prediction, e.g., an explanation why an item has been suggested by a recommender system. In this paper, we make a claim for semantic embeddings and discuss possible ideas towards their construction.
Data-driven Joint Debugging of the DBpedia Mappings and OntologyHeiko Paulheim
DBpedia is a large-scale, cross-domain knowledge graph extracted from Wikipedia. For the extraction, crowd-sourced mappings from Wikipedia infoboxes to the DBpedia ontology are utilized. In this process, different problems may arise: users may create wrong and/or inconsistent mappings, use the ontology in an unforeseen way, or change the ontology without considering all possible consequences. In this paper, we present a data-driven approach to discover problems in mappings as well as in the ontology and its usage in a joint, data-driven process. We show both quantitative and qualitative results about the problems identified, and derive proposals for altering mappings and refactoring the DBpedia ontology.
Knowledge graphs are used in various applications and have
been widely analyzed. A question that is not very well researched is: what is the price of their production? In this paper, we propose ways to estimate the cost of those knowledge graphs. We show that the cost of manually curating a triple is between $2 and $6, and that the cost for automatically created knowledge graphs is by a factor of 15 to 150 cheaper (i.e., 1c to 15c per statement). Furthermore, we advocate for taking cost into account as an evaluation metric, showing the correspondence between cost per triple and semantic validity as an example.
Fast Approximate A-box Consistency Checking using Machine LearningHeiko Paulheim
Ontology reasoning is typically a computationally intensive operation. While soundness and completeness of results is required in some use cases, for many others, a sensible trade-off between computation efforts and correctness of results makes more sense. In this paper, we show that it is possible to approximate a central task in reasoning, i.e., A-box consistency checking, by training a machine learning model which approximates the behavior of that reasoner for a specific ontology. On four different datasets, we show that such learned models constantly achieve an accuracy above 95% at less than 2% of the runtime of a reasoner, using a decision tree with no more than 20 inner nodes. For example, this allows for validating 293M Microdata documents against the schema.org ontology in less than 90 minutes, compared to 18 days required by a state of the art ontology reasoner.
Introduction to the Data Web, DBpedia and the Life-cycle of Linked DataSören Auer
Over the past 4 years, the Semantic Web activity has gained momentum with the widespread publishing of structured data as RDF. The Linked Data paradigm has therefore evolved from a practical research idea into
a very promising candidate for addressing one of the biggest challenges
of computer science: the exploitation of the Web as a platform for data
and information integration. To translate this initial success into a
world-scale reality, a number of research challenges need to be
addressed: the performance gap between relational and RDF data
management has to be closed, coherence and quality of data published on
the Web have to be improved, provenance and trust on the Linked Data Web
must be established and generally the entrance barrier for data
publishers and users has to be lowered. This tutorial will discuss
approaches for tackling these challenges. As an example of a successful
Linked Data project we will present DBpedia, which leverages Wikipedia
by extracting structured information and by making this information
freely accessible on the Web. The tutorial will also outline some recent advances in DBpedia, such as the mappings Wiki, DBpedia Live as well as
the recently launched DBpedia benchmark.
Serving DBpedia with DOLCE - More Than Just Adding a Cherry on TopHeiko Paulheim
Large knowledge bases, such as DBpedia, are most often created heuristically due to scalability issues. In the building process, both random as well as systematic errors may occur. In this paper, we focus on finding systematic errors, or anti-patterns, in DBpedia. We show that by aligning the DBpedia ontology to the foundational ontology DOLCE-Zero, and by combining reasoning and clustering of the reasoning results, errors affecting millions of statements can be identified at a minimal workload for the knowledge base designer.
From Wikipedia to Thousands of Wikis – The DBkWik Knowledge GraphHeiko Paulheim
From a bird eye's view, the DBpedia Extraction Framework takes a MediaWiki dump as input, and turns it into a knowledge graph. In this talk, I discuss the creation of the DBkWik knowledge graph by applying the DBpedia Extraction Framework to thousands of Wikis.
The original Semantic Web vision foresees to describe entities in a way that the meaning can be interpreted both by machines and humans. Following that idea, large-scale knowledge graphs capturing a significant portion of knowledge have been developed. In the recent past, vector space embeddings of semantic web knowledge graphs - i.e., projections of a knowledge graph into a lower-dimensional, numerical feature
space (a.k.a. latent feature space) - have been shown to yield superior performance in many tasks, including relation prediction, recommender systems, or the enrichment of predictive data mining tasks. At the same time, those projections describe an entity as a numerical vector, without
any semantics attached to the dimensions. Thus, embeddings are as far from the original Semantic Web vision as can be. As a consequence, the results achieved with embeddings - as impressive as they are in terms of quantitative performance - are most often not interpretable, and it is hard to obtain a justification for a prediction, e.g., an explanation why an item has been suggested by a recommender system. In this paper, we make a claim for semantic embeddings and discuss possible ideas towards their construction.
Data-driven Joint Debugging of the DBpedia Mappings and OntologyHeiko Paulheim
DBpedia is a large-scale, cross-domain knowledge graph extracted from Wikipedia. For the extraction, crowd-sourced mappings from Wikipedia infoboxes to the DBpedia ontology are utilized. In this process, different problems may arise: users may create wrong and/or inconsistent mappings, use the ontology in an unforeseen way, or change the ontology without considering all possible consequences. In this paper, we present a data-driven approach to discover problems in mappings as well as in the ontology and its usage in a joint, data-driven process. We show both quantitative and qualitative results about the problems identified, and derive proposals for altering mappings and refactoring the DBpedia ontology.
Knowledge graphs are used in various applications and have
been widely analyzed. A question that is not very well researched is: what is the price of their production? In this paper, we propose ways to estimate the cost of those knowledge graphs. We show that the cost of manually curating a triple is between $2 and $6, and that the cost for automatically created knowledge graphs is by a factor of 15 to 150 cheaper (i.e., 1c to 15c per statement). Furthermore, we advocate for taking cost into account as an evaluation metric, showing the correspondence between cost per triple and semantic validity as an example.
Fast Approximate A-box Consistency Checking using Machine LearningHeiko Paulheim
Ontology reasoning is typically a computationally intensive operation. While soundness and completeness of results is required in some use cases, for many others, a sensible trade-off between computation efforts and correctness of results makes more sense. In this paper, we show that it is possible to approximate a central task in reasoning, i.e., A-box consistency checking, by training a machine learning model which approximates the behavior of that reasoner for a specific ontology. On four different datasets, we show that such learned models constantly achieve an accuracy above 95% at less than 2% of the runtime of a reasoner, using a decision tree with no more than 20 inner nodes. For example, this allows for validating 293M Microdata documents against the schema.org ontology in less than 90 minutes, compared to 18 days required by a state of the art ontology reasoner.
Introduction to the Data Web, DBpedia and the Life-cycle of Linked DataSören Auer
Over the past 4 years, the Semantic Web activity has gained momentum with the widespread publishing of structured data as RDF. The Linked Data paradigm has therefore evolved from a practical research idea into
a very promising candidate for addressing one of the biggest challenges
of computer science: the exploitation of the Web as a platform for data
and information integration. To translate this initial success into a
world-scale reality, a number of research challenges need to be
addressed: the performance gap between relational and RDF data
management has to be closed, coherence and quality of data published on
the Web have to be improved, provenance and trust on the Linked Data Web
must be established and generally the entrance barrier for data
publishers and users has to be lowered. This tutorial will discuss
approaches for tackling these challenges. As an example of a successful
Linked Data project we will present DBpedia, which leverages Wikipedia
by extracting structured information and by making this information
freely accessible on the Web. The tutorial will also outline some recent advances in DBpedia, such as the mappings Wiki, DBpedia Live as well as
the recently launched DBpedia benchmark.
Serving DBpedia with DOLCE - More Than Just Adding a Cherry on TopHeiko Paulheim
Large knowledge bases, such as DBpedia, are most often created heuristically due to scalability issues. In the building process, both random as well as systematic errors may occur. In this paper, we focus on finding systematic errors, or anti-patterns, in DBpedia. We show that by aligning the DBpedia ontology to the foundational ontology DOLCE-Zero, and by combining reasoning and clustering of the reasoning results, errors affecting millions of statements can be identified at a minimal workload for the knowledge base designer.
Knowledge graph embeddings are a mechanism that projects each entity in a knowledge graph to a point in a continuous vector space. It is commonly assumed that those approaches project two entities closely to each other if they are similar and/or related. In this talk, I give a closer look at the roles of similarity and relatedness with respect to knowledge graph embeddings, and discuss how the well-known embedding mechanism RDF2vec can be tailored towards focusing on similarity, relatedness, or both.
Linguistic Linked Open Data, Challenges, Approaches, Future WorkSebastian Hellmann
Hellmann keynote TKE (2016), Challenges, Approaches and Future Work for Linguistic Linked Open Data (LLOD)
While the Linguistic Linked Open Data (LLOD) Cloud (http://linguistic-lod.org/) has evolved beyond expectations - thanks to the effort of a vibrant community - overall progress has to be seen under a more scrutinizing light.
Initial challenges which have been formulated by Christian Chiarcos, Sebastian Nordhoff and me as early as 2011[1][2] have been discussed extensively in the LDL, MLODE and NLP & DBpedia workshop series and in several W3C community groups. In particular, the LIDER FP7 project (http://www.lider-project.eu/) - originally conceived to tackle these challenges and build a Linguistic Linked Open Data Cloud - rather gave them more shape and uncovered that there is yet quite a long road ahead to solve problems such as proper metadata, contextualisation of knowledge, data quality, hosting, open licensing and provenance, timely updated network links, knowledge integration and interoperability on the largest possible scale - the Web.
The invited talk attempts to give a full account of these abovementioned challenges and presents and critically evaluates pertinent efforts and approaches including evolving standards such as the NLP Interchange Format (NIF)[3][4], DataID[5], SHACL[6], lemon[7] and the LIDER guidelines[8] as well as practical services such as LingHub[9], LODVader[10], RDFUnit[11] (just to mention a few).
As a glimmer of hope, the talk will conclude with the recent efforts of the DBpedia community to coordinate the creation of a public data infrastructure for a large, multilingual, semantic knowledge graph, which is, of course, not a panacean golden hammer, but a potential step in the right direction to bridge the gap between language and knowledge.
________________
[1] Towards a Linguistic Linked Open Data cloud : The Open Linguistics Working Group (http://www.atala.org/IMG/pdf/Chiarcos-TAL52-3.pdf ) Christian Chiarcos, Sebastian Hellmann, and Sebastian Nordhoff. TAL 52(3):245 - 275 (2011)
[2] Linked Data in Linguistics. Representing Language Data and Metadata (http://www.springer.com/computer/ai/book/978-3-642-28248-5 ) Christian Chiarcos, Sebastian Nordhoff, and Sebastian Hellmann (Eds.). Springer, Heidelberg, (2012)
[3] http://persistence.uni-leipzig.org/nlp2rdf/ontologies/nif-core
[4] https://www.w3.org/community/ld4lt/
[5] http://wiki.dbpedia.org/projects/dbpedia-dataid
[6] http://w3c.github.io/data-shapes/shacl/
[7] https://www.w3.org/2016/05/ontolex/
[8] http://www.lider-project.eu/guidelines
[9] http://linghub.lider-project.eu/
[10] http://lodvader.aksw.org/
[11] http://aksw.org/Projects/RDFUnit
Linked Open Data Publications through Wikidata & Persistent Identification...PACKED vzw
In order for museums to truly reap the benefits of publishing their collections online in a sustainable way, PACKED vzw presents the results of its Linked open data project as a best practice guide for the Flemish heritage sector.
In order for museums to truly reap the benefits of publishing their collections online in a sustainable way, PACKED vzw presents the results of its Linked open data project as a best practice guide for the Flemish heritage sector.
Observations on Annotations – From Computational Linguistics and the World Wi...Georg Rehm
Georg Rehm. Observations on Annotations – From Computational Linguistics and the World Wide Web to Artificial Intelligence and back again. Annotation in Scholarly Editions and Research: Function – Differentiation – Systematization, University of Wuppertal, Germany. February 20-22, 2019. Invited keynote talk.
A summary of DBpedia's History and a detailed analysis of challenges and solutions.
We show how the Linked Data Cloud evolved around DBpedia and also what problems we and other data projects encountered. We included a section on the new solutions that will lead DBpedia into a bright future.
GIS Day 2015: Geoinformatics, Open Source and Videos - a library perspectivePeter Löwe
Digital audiovisual content has become an important communication channel in Science. The TIB|AV-Portal for audiovisual scientific-technical information meets the requirements to preserve such content and to provide innovative services for search and retrieval. Quality checked audiovisual content from Open Source Geoinformatics communities is constantly being acquired for the portal as a part of TIB's mission to preserve relevant content in applied computer sciences for science, industry, and the general public.
Knowledge Graph Generation from Wikipedia in the Age of ChatGPT: Knowledge ...Heiko Paulheim
In the past years, sophisticated methods for extracting knowledge graphs from Wikipedia, like DBpedia,YAGO, and CaLiGraph, have been developed. In this talk, I revisit some of these methods and examine if and how they can be replaced by prompting a large language model like ChatGPT.
Weakly Supervised Learning for Fake News Detection on TwitterHeiko Paulheim
The problem of automatic detection of fake news insocial media, e.g., on Twitter, has recently drawn some attention. Although, from a technical perspective, it can be regarded
as a straight-forward, binary classification problem, the major
challenge is the collection of large enough training corpora, since manual annotation of tweets as fake or non-fake news is an expensive and tedious endeavor. In this paper, we discuss a weakly supervised approach, which automatically collects a large-scale, but very noisy training dataset comprising hundreds of thousands of tweets. During collection, we automatically label tweets by their source, i.e., trustworthy or untrustworthy source, and train a classifier on this dataset. We then use that classifier for a different classification target, i.e., the classification of fake and non-fake tweets. Although the labels are not accurate according to the new classification target (not all tweets by an untrustworthy source need to be fake news, and vice versa), we show that despite this unclean inaccurate dataset, it is possible to detect fake news with an F1 score of up to 0.9.
Combining Ontology Matchers via Anomaly DetectionHeiko Paulheim
In ontology alignment, there is no single best performing matching algorithm for every matching problem. Thus, most modern matching systems combine several base matchers and aggregate their results into a final alignment. This combination is often based on simple voting or averaging, or uses existing matching problems for learning a combination policy in a supervised setting. In this paper, we present the COMMAND matching system, an unsupervised method for combining base matchers, which uses anomaly detection to produce an alignment from the results delivered by several base matchers. The basic idea of our approach is that in a large set of potential mapping candidates, the scarce actual mappings should be visible as anomalies against the majority of non-mappings. The approach is evaluated on different OAEI datasets and shows a competitive performance with state-of-the-art systems.
Gathering Alternative Surface Forms for DBpedia EntitiesHeiko Paulheim
Wikipedia is often used a source of surface forms, or alternative reference strings for an entity, required for entity linking, disambiguation or coreference resolution tasks. Surface forms have been extracted in a number of works from Wikipedia labels, redirects, disambiguations and anchor texts of internal Wikipedia links, which we complement with anchor texts of external Wikipedia links from the Common Crawl web corpus. We tackle the problem of quality of Wikipedia-based surface forms, which has not been raised before. We create the gold standard for the dataset quality evaluation, which reveales the surprisingly low precision of the Wikipedia-based surface forms. We propose filtering approaches that allowed boosting the precision from 75% to 85% for a random entity subset, and from 45% to more than 65% for the subset of popular entities. The filtered surface form dataset as well the gold standard are made publicly available.
Mining the Web of Linked Data with RapidMinerHeiko Paulheim
Lots of data from different domains is published as Linked Open Data. While there are quite a few browsers for that data, as well as intelligent tools for particular purposes, a versatile tool for deriving additional knowledge by mining the Web of Linked Data is still missing. In this challenge entry, we introduce the RapidMiner Linked Open Data extension. The extension hooks into the powerful data mining platform RapidMiner, and offers operators for accessing Linked Open Data in RapidMiner, allowing for using it in sophisticated data analysis workflows without the need to know SPARQL or RDF. As an example, we show how statistical data on scientific publications, published as an RDF data cube, can be linked to further datasets and analyzed using additional background knowledge from various LOD datasets.
Data Mining with Background Knowledge from the Web - Introducing the RapidMin...Heiko Paulheim
Many data mining problems can be solved better if more background knowledge is added: predictive models can become more accurate, and descriptive models can reveal more interesting findings. However, collecting and integrating background knowledge is tedious manual work. In this paper, we introduce the RapidMiner Linked Open Data Extension, which can extend a dataset at hand with additional attributes drawn from the Linked Open Data (LOD) cloud, a large collection of publicly available datasets on various topics. The extension contains operators for linking local data to open data in the LOD cloud, and for augmenting it with additional attributes. In a case study, we show that the prediction error of car fuel consumption can be reduced by 50% by adding additional attributes, e.g., describing the automobile layout and the car body configuration, from Linked Open Data.
Detecting Incorrect Numerical Data in DBpediaHeiko Paulheim
DBpedia is a central hub of Linked Open Data (LOD). Being
based on crowd-sourced contents and heuristic extraction methods, it is not free of errors. In this paper, we study the application of unsupervised numerical outlier detection methods to DBpedia, using Interquantile Range (IQR), Kernel Density Estimation (KDE), and various dispersion estimators, combined with different semantic grouping methods. Our approach reaches 87% precision, and has lead to the identification of 11 systematic errors in the DBpedia extraction framework.
Identifying Wrong Links between Datasets by Multi-dimensional Outlier DetectionHeiko Paulheim
Links between datasets are an essential ingredient of Linked Open Data. Since the manual creation of links is expensive at large-scale, link sets are often created using heuristics, which may lead to errors. In this paper, we propose an unsupervised approach for finding erroneous links. We represent each link as a feature vector in a higher dimensional vector space, and find wrong links by means of different multi-dimensional outlier detection methods. We show how the approach can be implemented in the RapidMiner platform using only off-the-shelf components, and present a first evaluation with real-world datasets from the Linked Open Data cloud showing promising results, with an F-measure of up to 0.54, and an area under the ROC curve of up to 0.86.
Extending DBpedia with Wikipedia List PagesHeiko Paulheim
Thanks to its wide coverage and general-purpose ontology, DBpedia is a prominent dataset in the Linked Open Data cloud. DBpedia's content is harvested from Wikipedia's infoboxes, based on manually created mappings. In this paper, we explore the use of a promising source of knowledge for extending DBpedia, i.e., Wikipedia's list pages. We discuss how a combination of frequent pattern mining and natural language processing (NLP) methods can be leveraged in order to extend both the DBpedia ontology, as well as the instance information in DBpedia. We provide an illustrative example to show the potential impact of our approach and discuss its main challenges.
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...John Andrews
SlideShare Description for "Chatty Kathy - UNC Bootcamp Final Project Presentation"
Title: Chatty Kathy: Enhancing Physical Activity Among Older Adults
Description:
Discover how Chatty Kathy, an innovative project developed at the UNC Bootcamp, aims to tackle the challenge of low physical activity among older adults. Our AI-driven solution uses peer interaction to boost and sustain exercise levels, significantly improving health outcomes. This presentation covers our problem statement, the rationale behind Chatty Kathy, synthetic data and persona creation, model performance metrics, a visual demonstration of the project, and potential future developments. Join us for an insightful Q&A session to explore the potential of this groundbreaking project.
Project Team: Jay Requarth, Jana Avery, John Andrews, Dr. Dick Davis II, Nee Buntoum, Nam Yeongjin & Mat Nicholas
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
Show drafts
volume_up
Empowering the Data Analytics Ecosystem: A Laser Focus on Value
The data analytics ecosystem thrives when every component functions at its peak, unlocking the true potential of data. Here's a laser focus on key areas for an empowered ecosystem:
1. Democratize Access, Not Data:
Granular Access Controls: Provide users with self-service tools tailored to their specific needs, preventing data overload and misuse.
Data Catalogs: Implement robust data catalogs for easy discovery and understanding of available data sources.
2. Foster Collaboration with Clear Roles:
Data Mesh Architecture: Break down data silos by creating a distributed data ownership model with clear ownership and responsibilities.
Collaborative Workspaces: Utilize interactive platforms where data scientists, analysts, and domain experts can work seamlessly together.
3. Leverage Advanced Analytics Strategically:
AI-powered Automation: Automate repetitive tasks like data cleaning and feature engineering, freeing up data talent for higher-level analysis.
Right-Tool Selection: Strategically choose the most effective advanced analytics techniques (e.g., AI, ML) based on specific business problems.
4. Prioritize Data Quality with Automation:
Automated Data Validation: Implement automated data quality checks to identify and rectify errors at the source, minimizing downstream issues.
Data Lineage Tracking: Track the flow of data throughout the ecosystem, ensuring transparency and facilitating root cause analysis for errors.
5. Cultivate a Data-Driven Mindset:
Metrics-Driven Performance Management: Align KPIs and performance metrics with data-driven insights to ensure actionable decision making.
Data Storytelling Workshops: Equip stakeholders with the skills to translate complex data findings into compelling narratives that drive action.
Benefits of a Precise Ecosystem:
Sharpened Focus: Precise access and clear roles ensure everyone works with the most relevant data, maximizing efficiency.
Actionable Insights: Strategic analytics and automated quality checks lead to more reliable and actionable data insights.
Continuous Improvement: Data-driven performance management fosters a culture of learning and continuous improvement.
Sustainable Growth: Empowered by data, organizations can make informed decisions to drive sustainable growth and innovation.
By focusing on these precise actions, organizations can create an empowered data analytics ecosystem that delivers real value by driving data-driven decisions and maximizing the return on their data investment.
Explore our comprehensive data analysis project presentation on predicting product ad campaign performance. Learn how data-driven insights can optimize your marketing strategies and enhance campaign effectiveness. Perfect for professionals and students looking to understand the power of data analysis in advertising. for more details visit: https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/
2. AI Ingredients
OK, Google, when will the final
season of Money Heist be on Netflix?
The fifth season of Money Heist
will be released on September 3rd
and December 3rd
.
3. AI Ingredients
Are there any other series
by the same creator?
Álex Pina has also created
White Lines, The Pier, and Locked Up.
4. AI Ingredients
● What does an AI system like Google Assistant need?
– Speech recognition, interpretation, and synthesis
– A knowledge base
– Logical reasoning
– …
● ...there are many more other ingredients to AI
– e.g., machine learning, computer vision, ...
16.11.21
Heiko Paulheim 4
5. AI Ingredients
●
Four components of AI
required to pass a Turing test [1]:
– Natural language processing
– Knowledge representation
– Automated reasoning
– Machine learning
16.11.21
Heiko Paulheim 5
[1] Russel, Norvig: Artificial Intelligence, A Modern Approach
6. It’s an Unequal Field
16.11.21
Heiko Paulheim 6
[1] Google Trends, 2021
7. Human Intelligence Ingredients
● System 1 (think: 2+2)
– Fast
– Intuitive
– Unconscious
– Prone to biases
● System 2 (think: 342+735)
– Slow
– Explicit
– Conscious
– Tedious (hence: lazy)
[1] Kahnemann: Thinking, Fast and Slow
16.11.21
Heiko Paulheim 7
8. Fast and Slow AI
●
Kahneman for AI [1]
– System 1: ML, Statistics,
Heuristics
– System 2: Explicit reasoning,
knowledge representation,
explanations
●
Neuro-symbolic or Hybrid AI
uses both components
16.11.21
Heiko Paulheim 8
[1] Booch et al. (AAAI 2021): Thinking Fast and Slow in AI
9. Knowledge Graphs for AI
16.11.21
Heiko Paulheim 9
2021-09-03
2020-04-03
release date
release date
has part
h
a
s
p
a
r
t
OK, Google, when will the final season
Money Heist be on Netflix?
.
.
.
10. Knowledge Graphs for AI
16.11.21
Heiko Paulheim 10
2021-09-03
2020-04-03
release date
release date
creator
has part
h
a
s
p
a
r
t
cast
c
a
s
t
creator
c
a
s
t
Are there any other series
by the same creator?
creator
cast
cast .
.
.
.
.
.
11. AIs on the Shoulders of Giants
●
Current knowledge graphs [1]
– Open data
– Millions of entities
– Billions of facts
●
Facilitates AIs access to
– Large-scale factual knowledge
(note: not common sense knowledge)
– e.g., for explanations
16.11.21
Heiko Paulheim 11
[1] Heist et al. (2021): Knowledge Graphs on the Web – An Overview
12. Knowledge What?
• Knowledge Graphs on the Web
• Everybody talks about them, but what is a Knowledge
Graph?
16.11.21
Heiko Paulheim 12
Journal Paper Review, (Natasha Noy, Google, June 2015):
“Please define what a knowledge graph is – and what it is not.”
13. Knowledge Graphs for AI
●
Approaches since the 80s
– CyC (and OpenCyc)
– DBpedia & YAGO
– Wikidata
– Linked Open Data Cloud
16.11.21
Heiko Paulheim 13
14. Knowledge What?
• Working definition [1]: a Knowledge Graph
– mainly describes instances and their relations in the world
• Unlike an ontology
• Unlike, e.g., WordNet
– Defines possible classes and relations in a schema or ontology
• i.e., we know the types of things that are in our graphs
– Has a flexible schema
• Unlike a relational database
– Covers various domains
• Unlike, e.g., Geonames
16.11.21
Heiko Paulheim 14
[1] Paulheim (2017): Knowledge Graph Refinement – A Survey of Approaches and Evaluation
Methods
16. Knowledge What?
● Google uses the knowledge graph...
– for augmenting and improving search results
– for integrating data from various sources
● Some numbers [1]
– >5 billion entities
– >500 billion facts (i.e., edges)
16.11.21
Heiko Paulheim 16
[1] https://blog.google/products/search/about-knowledge-graph-and-knowledge-panels/
17. A Bit of History
• CyC (started by Douglas Lenat in 1984)
– Encyclopedic collection of knowledge
– Estimation: 350 person years and 250,000 rules
should do the job
of collecting the essence of the world’s knowledge
• The present (as of June 2017)
– ~1,000 person years, $120M total development cost
– 21M axioms and rules
16.11.21
Heiko Paulheim 17
18. A Bit of Business
● Does that Scale?
– A few back of an envelope calculations [1]
● Cyc contains...
– 21M statements and rules (roughly: „edges“)
– $120M development costs
→ $5,71 per statement
● Google’s Knowledge Graph
– 500 billion statements
– $2.571 trillion
● (that’s ~15 times Google’s net revenue in 2020)
[1] Paulheim (2018): How much is a Triple? Estimating the Cost of Knowledge Graph Creation.
16.11.21
Heiko Paulheim 18
19. Crowdsourcing Knowledge Graphs
● Freebase (launched 2007)
– Collaborative editing (like Wikipedia)
– Acquired by Google in 2010
– Shut down in 2016
● Wikidata (launched 2012)
– Free, collaborative
– Collects data from different sources
– Today: one of the largest publicly available,
free knowledge graphs
16.11.21
Heiko Paulheim 19
20. The Business Side of Crowdsourcing Knowledge Graphs
● Freebase: created by laymen
– Assumption: adding a statement to Freebase
equals adding a sentence to Wikipedia
• English Wikipedia up to April 2011: 41M working hours [1]
• size in April 2011: 3.6M pages, avg. 36.4 sentences each
• Using US minimum wage: $2.25 per sentence
→ $2.25 per statement
● Total cost of creating Freebase: $6.75B
– Acquired by Google for $60-$300M
[1] Geiger, Halfaker (2013): Using edit sessions to measure participation in wikipedia
16.11.21
Heiko Paulheim 20
21. Towards Automatic Knowledge Graph Construction
● Modern AI needs Massive Amounts of Knowledge
● Manual/crowdsourced creation
– Costly
– Does not work at scale
16.11.21
Heiko Paulheim 21
OK, Google, when will the final
season of Money Heist be on Netflix?
22. Creating Knowledge Graphs from Wikipedia
● Why start from scratch?
– If we already have (semi-)structured knowledge
at our fingertips
● Structured knowledge in Wikipedia
– Infoboxes (cf. Google’s Knowledge Panels)
– Categories
16.11.21
Heiko Paulheim 22
23. Turning Wikipedia into a Knowledge Graph
● First Observation:
– Many Wikipedia pages are about an entity
– For example: people, places, organizations, works…
16.11.21
Heiko Paulheim 23
24. Turning Wikipedia into a Knowledge Graph
● Further Observations:
– Articles are interlinked
– Some links have explicit meaning
– There are also numbers and dates
16.11.21
Heiko Paulheim 24
25. Turning Wikipedia into a Knowledge Graph
● Putting the Pieces Together
16.11.21
Heiko Paulheim 25
Nine_Inch_Nails
The_Downward
_Spiral
artist
1994-03-08
released
…
Trent_Reznor
member producer
...
26. Knowledge Graphs based on Wikipedia
● DBpedia: launched 2007
– Mapping infoboxes to node classes (e.g., „Person“, „Album“)
– Mapping infobox keys to edge labels (e.g., „artist“, „member“)
– Crowd-sourced mappings
● YAGO: launched 2008
– Using article categories in Wikipedia as classes
– Mapping infobox keys to edge labels
– Expert-created mappings
– Also contains temporal facts
16.11.21
Heiko Paulheim 26
27. Again: A Bit of Business
● DBpedia: 4.9M LOC, 2.2M LOC for mappings
– software project development: ~37 LOC per hour
(Devanbu et al., 1996)
– we use German PhD salaries as a cost estimate
→ 1.85c per statement
● We save by a factor of >100!
16.11.21
Heiko Paulheim 27
28. How Big is Big Enough?
● DBpedia and YAGO
– Constrained by the size (i.e., number of entries)
of Wikipedia
– Currently ~6M
● Commonly used recommender system
benchmarks have a coverage of… [1]
– ...85% for movies
– ...63% for music artists
– ...31% for books
16.11.21
Heiko Paulheim 28
https://grouplens.org/datasets/
[1] Di Noia, et al.: SPRank: Semantic Path-based Ranking for Top-n
Recommendations using Linked Open Data. In: ACM TIST, 2016
30. Exploiting More Structure in Wikipedia
● Listings and categories also are
structures
● They commonly share…
– a type (e.g., musician, book, …) and/or
– a common relation
● member of the same band
● book by the same author
● actor playing in the same film
… e.g., to
● the entity that represents the page
● ...or an entity mentioned somewhere
16.11.21
Heiko Paulheim 30
31. Exploiting More Structure in Wikipedia
● CaLiGraph [1]
– Extracts entities from listings
– Derives definitions from categories and list titles
● e.g., „Death Metal Bands“ → genre = Death_Metal
● 15M entities
– incl. 8M from listings
16.11.21
Heiko Paulheim 31
[1] Heist, Paulheim: Information Extraction from Co-Occurring Similar Entities.
In: The Web Conference, 2021
33. Beyond Wikipedia
● Regarding DBpedia and YAGO as a black box
– Input: a copy of Wikipedia
– Output: a knowledge graph
● If we have that black box
– Can’t we input any Wiki?
16.11.21
Heiko Paulheim 33
Magic ;-)
34. Beyond Wikipedia
● There’s thousands of Wikis
– Plus farms that host thousands themselves
● One of the largest farms: Fandom
16.11.21
Heiko Paulheim 34
35. Beyond Wikipedia
● Integration of Information from Multiple Wikis
● Challenges:
– Duplicate detection
– Few conventions
– Contradictions
16.11.21
Heiko Paulheim 35
[1] Hertling, Paulheim (2020): DBkWik: Extracting and Integrating Knowledge from
Thousands of Wikis. Knowledge and Information Systems 62(6): 2169-2190
36. The Story so Far
● We’ve come from AI building blocks:
– Natural language processing
– Knowledge representation
– Automated reasoning
– Machine learning
● How do we put the blocks together?
16.11.21
Heiko Paulheim 36
37. Using Knowledge Graphs as an Ingredient in AI
●
Automated Reasoning
– The combination of reasoning and knowledge graphs
has a long tradition
– Think of rules on the knowledge graph
– Example: artists on metal albums are metal artists
<Y artist X>, <Y genre Z> → <X genre Z>
16.11.21
Heiko Paulheim 37
Nine_Inch_Nails
The_Downward
_Spiral
artist
Metal
genre
genre
38. Using Knowledge Graphs as an Ingredient in AI
●
Knowledge Graphs are graphs
– hence the name ;-)
●
Most learning tools are tabular
16.11.21
Heiko Paulheim 38
39. Using Knowledge Graphs as an Ingredient in AI
● How to create tabular representations of entities in
knowledge graphs?
– Easy: data values (e.g., release date)
– Easy: edges with single occurences (e.g., birth place)
– Complex: edges with multiple occurences (e.g., starring)
16.11.21
Heiko Paulheim 39
?
40. Hybrid AI with Knowledge Graphs
●
Graphs to vectors!
– Representation learning aka embeddings
●
Approaches (not limited to)
– Language modeling adaptations
(RDF2vec, KGlove, …)
– Tensor factorization
(RESCAL, DistMult, ...)
– Link prediction
(TransE and its descendants)
– Graph Neural Networks
(e.g., GCN)
16.11.21
Heiko Paulheim 40
41. Knowledge Graph Embeddings
● A recent hype trend
– Each node (and edge)
in the graph is represented
as a point
– Similar nodes
are close in that space
16.11.21
Heiko Paulheim 41
42. Knowledge Graph Embeddings
● What do we win?
– Each entity is a
numeric vector
– Learning tools can be used
easily
● What do we lose?
– Dimensions do not
carry meaning anymore
16.11.21
Heiko Paulheim 42
43. Quo Vadis?
●
Knowledge Graphs are also
consumable for humans
– (think: explainable AI)
– but vectors are not!
●
We are missing
an important building block
– in Kahneman’s terms:
we forged system 2
into a new system 1 instead
– Holy grail: interpretable embeddings
16.11.21
Heiko Paulheim 43
44. Summary
● AI Ingredients
– AIs need knowledge
– e.g., conversational agents: need to know about entites in the world
● Knowledge Graphs
– One representation paradigm for such knowledge
– There are plenty of freely available KGs
– Can be used for explainable AI
16.11.21
Heiko Paulheim 44