From a bird eye's view, the DBpedia Extraction Framework takes a MediaWiki dump as input, and turns it into a knowledge graph. In this talk, I discuss the creation of the DBkWik knowledge graph by applying the DBpedia Extraction Framework to thousands of Wikis.
Beyond DBpedia and YAGO – The New Kids on the Knowledge Graph BlockHeiko Paulheim
Starting with Cyc in the 1980s, the collection of general
knowledge in machine interpretable form has been considered a valuable ingredient in intelligent and knowledge intensive applications. Notable contributions in the field include the Wikipedia-based datasets DBpedia and YAGO, as well as the collaborative knowledge base Wikidata. Since Google has coined the term in 2012, they are most often referred to as knowledge graphs. Besides such open knowledge graphs, many companies have started using corporate knowledge graphs as a means of information representation.
In this talk, I will look at two ongoing projects related to the extraction of knowledge graphs from Wikipedia and other Wikis. The first new dataset, CaLiGraph, aims at the generation of explicit formal definitions from categories, and the extraction of new instances from list pages. In its current release, CaLiGraph contains 200k axioms defining classes,
and more than 7M typed instances. In the second part, I will look at the transfer of the DBpedia approach to a multitude of arbitrary Wikis. The first such prototype, DBkWik, extracts data from Fandom, a Wiki farm hosting more than 400k different Wikis on various topics. Unlike DBpedia, which relies on a larger user base for crowdsourcing an explicit schema and extraction rules, and the "one-page-per-entity" assumption, DBkWik has to address various challenges in the fields of schema learning and data integration. In its current release, DBkWik contains more than 11M entities, and has been found to be highly complementary to DBpedia.
Machine Learning with and for Semantic Web Knowledge GraphsHeiko Paulheim
Large-scale cross-domain knowledge graphs, such as DBpedia or Wikidata, are some of the most popular and widely used datasets of the Semantic Web. In this paper, we introduce some of the most popular knowledge graphs on the Semantic Web. We discuss how machine learning is used to improve those knowledge graphs, and how they can be exploited as background knowledge in popular machine learning tasks, such as recommender systems.
Knowledge Graphs, such as DBpedia, YAGO, or Wikidata, are valuable resources for building intelligent applications like data analytics tools or recommender systems. Understanding what is in those knowledge graphs is a crucial prerequisite for selecing a Knowledge Graph for a task at hand. Hence, Knowledge Graph profiling - i.e., quantifying the structure and contents of knowledge graphs, as well as their differences - is essential for fully utilizing the power of Knowledge Graphs. In this paper, I will discuss methods for Knowledge Graph profiling, depict crucial differences of the big, well-known Knowledge Graphs, like DBpedia, YAGO, and Wikidata, and throw a glance at current developments of new, complementary Knowledge Graphs such as DBkWik and WebIsALOD.
Using knowledge graphs in data mining typically requires a propositional, i.e., vector-shaped representation of entities. RDF2vec is an example for generating such vectors from knowledge graphs, relying on random walks for extracting pseudo-sentences from a graph, and utilizing word2vec for creating embedding vectors from those pseudo-sentences. In this talk, I will give insights into the idea of RDF2vec, possible application areas, and recently developed variants incorporating different walk strategies and training variations.
How are Knowledge Graphs created?
What is inside public Knowledge Graphs?
Addressing typical problems in Knowledge Graphs (errors, incompleteness)
New Knowledge Graphs: WebIsALOD, DBkWik
The original Semantic Web vision foresees to describe entities in a way that the meaning can be interpreted both by machines and humans. Following that idea, large-scale knowledge graphs capturing a significant portion of knowledge have been developed. In the recent past, vector space embeddings of semantic web knowledge graphs - i.e., projections of a knowledge graph into a lower-dimensional, numerical feature
space (a.k.a. latent feature space) - have been shown to yield superior performance in many tasks, including relation prediction, recommender systems, or the enrichment of predictive data mining tasks. At the same time, those projections describe an entity as a numerical vector, without
any semantics attached to the dimensions. Thus, embeddings are as far from the original Semantic Web vision as can be. As a consequence, the results achieved with embeddings - as impressive as they are in terms of quantitative performance - are most often not interpretable, and it is hard to obtain a justification for a prediction, e.g., an explanation why an item has been suggested by a recommender system. In this paper, we make a claim for semantic embeddings and discuss possible ideas towards their construction.
This presentation shows approaches for knowledge graph construction from Wikipedia and other Wikis that go beyond the "one entity per page" paradigm. We see CaLiGraph, which extracts entities from categories and listings, as well as DBkWik, which extracts and integrates information from thousands of Wikis.
Beyond DBpedia and YAGO – The New Kids on the Knowledge Graph BlockHeiko Paulheim
Starting with Cyc in the 1980s, the collection of general
knowledge in machine interpretable form has been considered a valuable ingredient in intelligent and knowledge intensive applications. Notable contributions in the field include the Wikipedia-based datasets DBpedia and YAGO, as well as the collaborative knowledge base Wikidata. Since Google has coined the term in 2012, they are most often referred to as knowledge graphs. Besides such open knowledge graphs, many companies have started using corporate knowledge graphs as a means of information representation.
In this talk, I will look at two ongoing projects related to the extraction of knowledge graphs from Wikipedia and other Wikis. The first new dataset, CaLiGraph, aims at the generation of explicit formal definitions from categories, and the extraction of new instances from list pages. In its current release, CaLiGraph contains 200k axioms defining classes,
and more than 7M typed instances. In the second part, I will look at the transfer of the DBpedia approach to a multitude of arbitrary Wikis. The first such prototype, DBkWik, extracts data from Fandom, a Wiki farm hosting more than 400k different Wikis on various topics. Unlike DBpedia, which relies on a larger user base for crowdsourcing an explicit schema and extraction rules, and the "one-page-per-entity" assumption, DBkWik has to address various challenges in the fields of schema learning and data integration. In its current release, DBkWik contains more than 11M entities, and has been found to be highly complementary to DBpedia.
Machine Learning with and for Semantic Web Knowledge GraphsHeiko Paulheim
Large-scale cross-domain knowledge graphs, such as DBpedia or Wikidata, are some of the most popular and widely used datasets of the Semantic Web. In this paper, we introduce some of the most popular knowledge graphs on the Semantic Web. We discuss how machine learning is used to improve those knowledge graphs, and how they can be exploited as background knowledge in popular machine learning tasks, such as recommender systems.
Knowledge Graphs, such as DBpedia, YAGO, or Wikidata, are valuable resources for building intelligent applications like data analytics tools or recommender systems. Understanding what is in those knowledge graphs is a crucial prerequisite for selecing a Knowledge Graph for a task at hand. Hence, Knowledge Graph profiling - i.e., quantifying the structure and contents of knowledge graphs, as well as their differences - is essential for fully utilizing the power of Knowledge Graphs. In this paper, I will discuss methods for Knowledge Graph profiling, depict crucial differences of the big, well-known Knowledge Graphs, like DBpedia, YAGO, and Wikidata, and throw a glance at current developments of new, complementary Knowledge Graphs such as DBkWik and WebIsALOD.
Using knowledge graphs in data mining typically requires a propositional, i.e., vector-shaped representation of entities. RDF2vec is an example for generating such vectors from knowledge graphs, relying on random walks for extracting pseudo-sentences from a graph, and utilizing word2vec for creating embedding vectors from those pseudo-sentences. In this talk, I will give insights into the idea of RDF2vec, possible application areas, and recently developed variants incorporating different walk strategies and training variations.
How are Knowledge Graphs created?
What is inside public Knowledge Graphs?
Addressing typical problems in Knowledge Graphs (errors, incompleteness)
New Knowledge Graphs: WebIsALOD, DBkWik
The original Semantic Web vision foresees to describe entities in a way that the meaning can be interpreted both by machines and humans. Following that idea, large-scale knowledge graphs capturing a significant portion of knowledge have been developed. In the recent past, vector space embeddings of semantic web knowledge graphs - i.e., projections of a knowledge graph into a lower-dimensional, numerical feature
space (a.k.a. latent feature space) - have been shown to yield superior performance in many tasks, including relation prediction, recommender systems, or the enrichment of predictive data mining tasks. At the same time, those projections describe an entity as a numerical vector, without
any semantics attached to the dimensions. Thus, embeddings are as far from the original Semantic Web vision as can be. As a consequence, the results achieved with embeddings - as impressive as they are in terms of quantitative performance - are most often not interpretable, and it is hard to obtain a justification for a prediction, e.g., an explanation why an item has been suggested by a recommender system. In this paper, we make a claim for semantic embeddings and discuss possible ideas towards their construction.
This presentation shows approaches for knowledge graph construction from Wikipedia and other Wikis that go beyond the "one entity per page" paradigm. We see CaLiGraph, which extracts entities from categories and listings, as well as DBkWik, which extracts and integrates information from thousands of Wikis.
Knowledge Matters! The Role of Knowledge Graphs in Modern AI SystemsHeiko Paulheim
AI is not just about machine learning, it also requires knowledge about the world. In this talk, I give an introduction on knowledge graphs, how they are built at scale, and how they are used in modern AI systems.
Using Knowledge Graphs in Data Science - From Symbolic to Latent Representati...Heiko Paulheim
Knowledge Graphs are often used as a symbolic representation mechanism for representing knowledge in data intensive applications, both for integrating corporate knowledge as well as for providing general, cross-domain knowledge in public knowledge graphs such as Wikidata. As such, they have been identified as a useful way of injecting background knowledge in data analysis processes. To fully harness the potential of knowledge graphs, latent representations of entities in the graphs, so called knowledge graph embeddings, show superior performance, but sacrifice one central advantage of knowledge graphs, i.e., the explicit symbolic knowledge representations. In this talk, I will shed some light on the usage of knowledge graphs and embeddings in data analysis, and give an outlook on research directions which aim at combining the best of both worlds.
Knowledge graphs are used in various applications and have
been widely analyzed. A question that is not very well researched is: what is the price of their production? In this paper, we propose ways to estimate the cost of those knowledge graphs. We show that the cost of manually curating a triple is between $2 and $6, and that the cost for automatically created knowledge graphs is by a factor of 15 to 150 cheaper (i.e., 1c to 15c per statement). Furthermore, we advocate for taking cost into account as an evaluation metric, showing the correspondence between cost per triple and semantic validity as an example.
Data-driven Joint Debugging of the DBpedia Mappings and OntologyHeiko Paulheim
DBpedia is a large-scale, cross-domain knowledge graph extracted from Wikipedia. For the extraction, crowd-sourced mappings from Wikipedia infoboxes to the DBpedia ontology are utilized. In this process, different problems may arise: users may create wrong and/or inconsistent mappings, use the ontology in an unforeseen way, or change the ontology without considering all possible consequences. In this paper, we present a data-driven approach to discover problems in mappings as well as in the ontology and its usage in a joint, data-driven process. We show both quantitative and qualitative results about the problems identified, and derive proposals for altering mappings and refactoring the DBpedia ontology.
RDF2vec is a method for creating embeddings vectors for entities in knowledge graphs. In this talk, I introduce the basic idea of RDF2vec, as well as the latest extensions developments, like the use of different walk strategies, the flavour of order-aware RDF2vec, RDF2vec for dynamic knowledge graphs, and more.
Serving DBpedia with DOLCE - More Than Just Adding a Cherry on TopHeiko Paulheim
Large knowledge bases, such as DBpedia, are most often created heuristically due to scalability issues. In the building process, both random as well as systematic errors may occur. In this paper, we focus on finding systematic errors, or anti-patterns, in DBpedia. We show that by aligning the DBpedia ontology to the foundational ontology DOLCE-Zero, and by combining reasoning and clustering of the reasoning results, errors affecting millions of statements can be identified at a minimal workload for the knowledge base designer.
Researcher Pod: Scholarly Communication Using the Decentralized WebHerbert Van de Sompel
The presentation provides an overview of the motivation and direction of the Mellon-funded Researcher Pod project that investigates technical aspects of scholarly communication in a decentralized web setting.
Presentation about reference rot given at the Complexity Science Hub in Vienna, November 2021.
Links to web resources frequently break (link rot), and linked content can change at unpredictable rates (content drift). These dynamics of the Web are detrimental when references to web resources provide evidence or supporting information.
This presentation will report on research that assessed the extent of these problems for links to web resources in scholarly literature, by using three vast corpora of publications and a range of public web archives. It will also describe the Robust Link approach that offers a proactive, uniform, and machine-actionable way to combat link rot and content drift. Finally, it will introduce the Robustify web service and API that was devised to generate links that remain functional over time, paying special attention to challenges related to deploying infrastructure that is required to be long lasting.
Revised presentation given at SCOCA in Piketon, Ohio, in September 2013. Includes KnowItNow24x7 as well as information on efficient uses of Google and Wikipedia (updated slides from previous presentations).
Linked data in the German National Library at the OCLC IFLA round table 2013Lars G. Svensson
A presentation about the current state of the linked data activities in the German National Library held at the OCLC Linked Data Round table during the WLIC 2013 in Singapore
Presentation during the 2016 American Library Association (ALA) Annual Conference in Orlando (Florida), given at the ALCTS Program "Linked Data - Globally Connecting Libraries, Archives, and Museums", Sponsor: ALCTS International Relations Committee, Co-Sponsor: Linked Library Data Interest Group
Knowledge graph embeddings are a mechanism that projects each entity in a knowledge graph to a point in a continuous vector space. It is commonly assumed that those approaches project two entities closely to each other if they are similar and/or related. In this talk, I give a closer look at the roles of similarity and relatedness with respect to knowledge graph embeddings, and discuss how the well-known embedding mechanism RDF2vec can be tailored towards focusing on similarity, relatedness, or both.
Fast Approximate A-box Consistency Checking using Machine LearningHeiko Paulheim
Ontology reasoning is typically a computationally intensive operation. While soundness and completeness of results is required in some use cases, for many others, a sensible trade-off between computation efforts and correctness of results makes more sense. In this paper, we show that it is possible to approximate a central task in reasoning, i.e., A-box consistency checking, by training a machine learning model which approximates the behavior of that reasoner for a specific ontology. On four different datasets, we show that such learned models constantly achieve an accuracy above 95% at less than 2% of the runtime of a reasoner, using a decision tree with no more than 20 inner nodes. For example, this allows for validating 293M Microdata documents against the schema.org ontology in less than 90 minutes, compared to 18 days required by a state of the art ontology reasoner.
Knowledge Matters! The Role of Knowledge Graphs in Modern AI SystemsHeiko Paulheim
AI is not just about machine learning, it also requires knowledge about the world. In this talk, I give an introduction on knowledge graphs, how they are built at scale, and how they are used in modern AI systems.
Using Knowledge Graphs in Data Science - From Symbolic to Latent Representati...Heiko Paulheim
Knowledge Graphs are often used as a symbolic representation mechanism for representing knowledge in data intensive applications, both for integrating corporate knowledge as well as for providing general, cross-domain knowledge in public knowledge graphs such as Wikidata. As such, they have been identified as a useful way of injecting background knowledge in data analysis processes. To fully harness the potential of knowledge graphs, latent representations of entities in the graphs, so called knowledge graph embeddings, show superior performance, but sacrifice one central advantage of knowledge graphs, i.e., the explicit symbolic knowledge representations. In this talk, I will shed some light on the usage of knowledge graphs and embeddings in data analysis, and give an outlook on research directions which aim at combining the best of both worlds.
Knowledge graphs are used in various applications and have
been widely analyzed. A question that is not very well researched is: what is the price of their production? In this paper, we propose ways to estimate the cost of those knowledge graphs. We show that the cost of manually curating a triple is between $2 and $6, and that the cost for automatically created knowledge graphs is by a factor of 15 to 150 cheaper (i.e., 1c to 15c per statement). Furthermore, we advocate for taking cost into account as an evaluation metric, showing the correspondence between cost per triple and semantic validity as an example.
Data-driven Joint Debugging of the DBpedia Mappings and OntologyHeiko Paulheim
DBpedia is a large-scale, cross-domain knowledge graph extracted from Wikipedia. For the extraction, crowd-sourced mappings from Wikipedia infoboxes to the DBpedia ontology are utilized. In this process, different problems may arise: users may create wrong and/or inconsistent mappings, use the ontology in an unforeseen way, or change the ontology without considering all possible consequences. In this paper, we present a data-driven approach to discover problems in mappings as well as in the ontology and its usage in a joint, data-driven process. We show both quantitative and qualitative results about the problems identified, and derive proposals for altering mappings and refactoring the DBpedia ontology.
RDF2vec is a method for creating embeddings vectors for entities in knowledge graphs. In this talk, I introduce the basic idea of RDF2vec, as well as the latest extensions developments, like the use of different walk strategies, the flavour of order-aware RDF2vec, RDF2vec for dynamic knowledge graphs, and more.
Serving DBpedia with DOLCE - More Than Just Adding a Cherry on TopHeiko Paulheim
Large knowledge bases, such as DBpedia, are most often created heuristically due to scalability issues. In the building process, both random as well as systematic errors may occur. In this paper, we focus on finding systematic errors, or anti-patterns, in DBpedia. We show that by aligning the DBpedia ontology to the foundational ontology DOLCE-Zero, and by combining reasoning and clustering of the reasoning results, errors affecting millions of statements can be identified at a minimal workload for the knowledge base designer.
Researcher Pod: Scholarly Communication Using the Decentralized WebHerbert Van de Sompel
The presentation provides an overview of the motivation and direction of the Mellon-funded Researcher Pod project that investigates technical aspects of scholarly communication in a decentralized web setting.
Presentation about reference rot given at the Complexity Science Hub in Vienna, November 2021.
Links to web resources frequently break (link rot), and linked content can change at unpredictable rates (content drift). These dynamics of the Web are detrimental when references to web resources provide evidence or supporting information.
This presentation will report on research that assessed the extent of these problems for links to web resources in scholarly literature, by using three vast corpora of publications and a range of public web archives. It will also describe the Robust Link approach that offers a proactive, uniform, and machine-actionable way to combat link rot and content drift. Finally, it will introduce the Robustify web service and API that was devised to generate links that remain functional over time, paying special attention to challenges related to deploying infrastructure that is required to be long lasting.
Revised presentation given at SCOCA in Piketon, Ohio, in September 2013. Includes KnowItNow24x7 as well as information on efficient uses of Google and Wikipedia (updated slides from previous presentations).
Linked data in the German National Library at the OCLC IFLA round table 2013Lars G. Svensson
A presentation about the current state of the linked data activities in the German National Library held at the OCLC Linked Data Round table during the WLIC 2013 in Singapore
Presentation during the 2016 American Library Association (ALA) Annual Conference in Orlando (Florida), given at the ALCTS Program "Linked Data - Globally Connecting Libraries, Archives, and Museums", Sponsor: ALCTS International Relations Committee, Co-Sponsor: Linked Library Data Interest Group
Knowledge graph embeddings are a mechanism that projects each entity in a knowledge graph to a point in a continuous vector space. It is commonly assumed that those approaches project two entities closely to each other if they are similar and/or related. In this talk, I give a closer look at the roles of similarity and relatedness with respect to knowledge graph embeddings, and discuss how the well-known embedding mechanism RDF2vec can be tailored towards focusing on similarity, relatedness, or both.
Fast Approximate A-box Consistency Checking using Machine LearningHeiko Paulheim
Ontology reasoning is typically a computationally intensive operation. While soundness and completeness of results is required in some use cases, for many others, a sensible trade-off between computation efforts and correctness of results makes more sense. In this paper, we show that it is possible to approximate a central task in reasoning, i.e., A-box consistency checking, by training a machine learning model which approximates the behavior of that reasoner for a specific ontology. On four different datasets, we show that such learned models constantly achieve an accuracy above 95% at less than 2% of the runtime of a reasoner, using a decision tree with no more than 20 inner nodes. For example, this allows for validating 293M Microdata documents against the schema.org ontology in less than 90 minutes, compared to 18 days required by a state of the art ontology reasoner.
These slides accompany the first part of a Digital Arts and Humanities sponsored workshop that Vinayak Das Gupta and myself gave in Trinity College Dublin on 27 May 2015. The workshop, entitled 'Data-mining the Semantic Web and spatially visualising the results', introduced the participants to the concepts and technologies of Linked Open Data, the Semantic Web, RDF, SPARQL, GeoJSON and Leaflet.js. These slides cover the data-mining of online cultural heritage resources.
Knowledge Graph Generation from Wikipedia in the Age of ChatGPT: Knowledge ...Heiko Paulheim
In the past years, sophisticated methods for extracting knowledge graphs from Wikipedia, like DBpedia,YAGO, and CaLiGraph, have been developed. In this talk, I revisit some of these methods and examine if and how they can be replaced by prompting a large language model like ChatGPT.
This event extends the reach of the Open Education Conference -- Beyond Content -- taking place in Vancouver 16-18 October, 2012
The Open Education Remixathon will kick off with a round robin to describe each Open Educational Resource and the envisioned enhancements.
See the full description and participate in the conversation in SCoPE: http://scope.bccampus.ca/mod/forum/view.php?id=9009
A 4 hour hands on linked data workshop held at ELAG 2013 - http://elag2013.org/ws2-very-gentle-linked-data/. Resources at http://data.archiveshub.ac.uk/workshops/elag2013/
web 2.0, library systems and the library systemlisld
The Web 2.0 environment is characterized by concentration and diffusion. Library services are not well matched to this environment: they are fragmented and difficult to mobilize in user workflows. This presentation analyzes this situation and suggests some directions.
Catherine Parker (University of Huddersfield) – “The Game of Open Access: mak...ARLGSW
Presentation from the 6th CILIP ARLG-SW Discover Academic Research and Training Support Conference (DARTS6). Dartington Hall, Totnes, Thursday 24th – Friday 25th May 2018
Contributing to the global commons: Repositories and WikimediaNick Sheppard
There is huge potential for universities and their libraries to leverage Wikimedia in order to expose research outputs and collections. Wikimedia comprises sixteen projects in total, including Wikipedia, Wikimedia Commons and Wikidata. At the University of Leeds, the Research Data Management Service have successfully run a project that focuses on linking research data with the Wikimedia suite of tools via a series of ‘editathons’, in order to increase the visibility of research data and enable reuse on Wikipedia and elsewhere. The project - "Manage it locally to share it globally: RDM and Wikimedia Commons" - was the winning submission to a competition launched in May 2018 and sponsored by SPARC Europe, Jisc and the University of Cambridge, called the "Data Management Engagement Award", which aimed to address cultural challenges involved in promoting effective research data practices.
The project has served as a springboard to further explore Wikimedia strategically, both at the University of Leeds and across the White Rose Consortium. For example we are collaborating on a new project looking at Wikipedia citations of research from York, Sheffield and Leeds, and the proportion of these that are open access. The long term goal might be to establish a "Wikimedian in Residence" across the consortium. In this talk, we will present the project's outputs - including a toolkit that will enable other institutions to apply the same methodology. In addition we will explore the potential of Wikidata to link up repositories and other data silos in a manner that enables reuse and increases impact.
Weakly Supervised Learning for Fake News Detection on TwitterHeiko Paulheim
The problem of automatic detection of fake news insocial media, e.g., on Twitter, has recently drawn some attention. Although, from a technical perspective, it can be regarded
as a straight-forward, binary classification problem, the major
challenge is the collection of large enough training corpora, since manual annotation of tweets as fake or non-fake news is an expensive and tedious endeavor. In this paper, we discuss a weakly supervised approach, which automatically collects a large-scale, but very noisy training dataset comprising hundreds of thousands of tweets. During collection, we automatically label tweets by their source, i.e., trustworthy or untrustworthy source, and train a classifier on this dataset. We then use that classifier for a different classification target, i.e., the classification of fake and non-fake tweets. Although the labels are not accurate according to the new classification target (not all tweets by an untrustworthy source need to be fake news, and vice versa), we show that despite this unclean inaccurate dataset, it is possible to detect fake news with an F1 score of up to 0.9.
Combining Ontology Matchers via Anomaly DetectionHeiko Paulheim
In ontology alignment, there is no single best performing matching algorithm for every matching problem. Thus, most modern matching systems combine several base matchers and aggregate their results into a final alignment. This combination is often based on simple voting or averaging, or uses existing matching problems for learning a combination policy in a supervised setting. In this paper, we present the COMMAND matching system, an unsupervised method for combining base matchers, which uses anomaly detection to produce an alignment from the results delivered by several base matchers. The basic idea of our approach is that in a large set of potential mapping candidates, the scarce actual mappings should be visible as anomalies against the majority of non-mappings. The approach is evaluated on different OAEI datasets and shows a competitive performance with state-of-the-art systems.
Gathering Alternative Surface Forms for DBpedia EntitiesHeiko Paulheim
Wikipedia is often used a source of surface forms, or alternative reference strings for an entity, required for entity linking, disambiguation or coreference resolution tasks. Surface forms have been extracted in a number of works from Wikipedia labels, redirects, disambiguations and anchor texts of internal Wikipedia links, which we complement with anchor texts of external Wikipedia links from the Common Crawl web corpus. We tackle the problem of quality of Wikipedia-based surface forms, which has not been raised before. We create the gold standard for the dataset quality evaluation, which reveales the surprisingly low precision of the Wikipedia-based surface forms. We propose filtering approaches that allowed boosting the precision from 75% to 85% for a random entity subset, and from 45% to more than 65% for the subset of popular entities. The filtered surface form dataset as well the gold standard are made publicly available.
Mining the Web of Linked Data with RapidMinerHeiko Paulheim
Lots of data from different domains is published as Linked Open Data. While there are quite a few browsers for that data, as well as intelligent tools for particular purposes, a versatile tool for deriving additional knowledge by mining the Web of Linked Data is still missing. In this challenge entry, we introduce the RapidMiner Linked Open Data extension. The extension hooks into the powerful data mining platform RapidMiner, and offers operators for accessing Linked Open Data in RapidMiner, allowing for using it in sophisticated data analysis workflows without the need to know SPARQL or RDF. As an example, we show how statistical data on scientific publications, published as an RDF data cube, can be linked to further datasets and analyzed using additional background knowledge from various LOD datasets.
Data Mining with Background Knowledge from the Web - Introducing the RapidMin...Heiko Paulheim
Many data mining problems can be solved better if more background knowledge is added: predictive models can become more accurate, and descriptive models can reveal more interesting findings. However, collecting and integrating background knowledge is tedious manual work. In this paper, we introduce the RapidMiner Linked Open Data Extension, which can extend a dataset at hand with additional attributes drawn from the Linked Open Data (LOD) cloud, a large collection of publicly available datasets on various topics. The extension contains operators for linking local data to open data in the LOD cloud, and for augmenting it with additional attributes. In a case study, we show that the prediction error of car fuel consumption can be reduced by 50% by adding additional attributes, e.g., describing the automobile layout and the car body configuration, from Linked Open Data.
Detecting Incorrect Numerical Data in DBpediaHeiko Paulheim
DBpedia is a central hub of Linked Open Data (LOD). Being
based on crowd-sourced contents and heuristic extraction methods, it is not free of errors. In this paper, we study the application of unsupervised numerical outlier detection methods to DBpedia, using Interquantile Range (IQR), Kernel Density Estimation (KDE), and various dispersion estimators, combined with different semantic grouping methods. Our approach reaches 87% precision, and has lead to the identification of 11 systematic errors in the DBpedia extraction framework.
Identifying Wrong Links between Datasets by Multi-dimensional Outlier DetectionHeiko Paulheim
Links between datasets are an essential ingredient of Linked Open Data. Since the manual creation of links is expensive at large-scale, link sets are often created using heuristics, which may lead to errors. In this paper, we propose an unsupervised approach for finding erroneous links. We represent each link as a feature vector in a higher dimensional vector space, and find wrong links by means of different multi-dimensional outlier detection methods. We show how the approach can be implemented in the RapidMiner platform using only off-the-shelf components, and present a first evaluation with real-world datasets from the Linked Open Data cloud showing promising results, with an F-measure of up to 0.54, and an area under the ROC curve of up to 0.86.
Extending DBpedia with Wikipedia List PagesHeiko Paulheim
Thanks to its wide coverage and general-purpose ontology, DBpedia is a prominent dataset in the Linked Open Data cloud. DBpedia's content is harvested from Wikipedia's infoboxes, based on manually created mappings. In this paper, we explore the use of a promising source of knowledge for extending DBpedia, i.e., Wikipedia's list pages. We discuss how a combination of frequent pattern mining and natural language processing (NLP) methods can be leveraged in order to extend both the DBpedia ontology, as well as the instance information in DBpedia. We provide an illustrative example to show the potential impact of our approach and discuss its main challenges.
Show drafts
volume_up
Empowering the Data Analytics Ecosystem: A Laser Focus on Value
The data analytics ecosystem thrives when every component functions at its peak, unlocking the true potential of data. Here's a laser focus on key areas for an empowered ecosystem:
1. Democratize Access, Not Data:
Granular Access Controls: Provide users with self-service tools tailored to their specific needs, preventing data overload and misuse.
Data Catalogs: Implement robust data catalogs for easy discovery and understanding of available data sources.
2. Foster Collaboration with Clear Roles:
Data Mesh Architecture: Break down data silos by creating a distributed data ownership model with clear ownership and responsibilities.
Collaborative Workspaces: Utilize interactive platforms where data scientists, analysts, and domain experts can work seamlessly together.
3. Leverage Advanced Analytics Strategically:
AI-powered Automation: Automate repetitive tasks like data cleaning and feature engineering, freeing up data talent for higher-level analysis.
Right-Tool Selection: Strategically choose the most effective advanced analytics techniques (e.g., AI, ML) based on specific business problems.
4. Prioritize Data Quality with Automation:
Automated Data Validation: Implement automated data quality checks to identify and rectify errors at the source, minimizing downstream issues.
Data Lineage Tracking: Track the flow of data throughout the ecosystem, ensuring transparency and facilitating root cause analysis for errors.
5. Cultivate a Data-Driven Mindset:
Metrics-Driven Performance Management: Align KPIs and performance metrics with data-driven insights to ensure actionable decision making.
Data Storytelling Workshops: Equip stakeholders with the skills to translate complex data findings into compelling narratives that drive action.
Benefits of a Precise Ecosystem:
Sharpened Focus: Precise access and clear roles ensure everyone works with the most relevant data, maximizing efficiency.
Actionable Insights: Strategic analytics and automated quality checks lead to more reliable and actionable data insights.
Continuous Improvement: Data-driven performance management fosters a culture of learning and continuous improvement.
Sustainable Growth: Empowered by data, organizations can make informed decisions to drive sustainable growth and innovation.
By focusing on these precise actions, organizations can create an empowered data analytics ecosystem that delivers real value by driving data-driven decisions and maximizing the return on their data investment.
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Subhajit Sahu
Abstract — Levelwise PageRank is an alternative method of PageRank computation which decomposes the input graph into a directed acyclic block-graph of strongly connected components, and processes them in topological order, one level at a time. This enables calculation for ranks in a distributed fashion without per-iteration communication, unlike the standard method where all vertices are processed in each iteration. It however comes with a precondition of the absence of dead ends in the input graph. Here, the native non-distributed performance of Levelwise PageRank was compared against Monolithic PageRank on a CPU as well as a GPU. To ensure a fair comparison, Monolithic PageRank was also performed on a graph where vertices were split by components. Results indicate that Levelwise PageRank is about as fast as Monolithic PageRank on the CPU, but quite a bit slower on the GPU. Slowdown on the GPU is likely caused by a large submission of small workloads, and expected to be non-issue when the computation is performed on massive graphs.
From Wikipedia to Thousands of Wikis – The DBkWik Knowledge Graph
1. 5/23/19 Heiko Paulheim 1
From Wikipedia to Thousands of Wikis
–
The DBkWik Knowledge Graph
Heiko Paulheim
2. 5/23/19 Heiko Paulheim 2
A Bird’s Eye View on DBpedia EF
• DBpedia Extraction Framework
• Input:
– A Wikipedia Dump
(+ mappings)
• Output:
– DBpedia
DBpedia
Extraction
Framework
3. 5/23/19 Heiko Paulheim 3
An Even Higher Bird’s Eye View on DBpedia EF
• DBpedia Extraction Framework
• Input:
– A Media Wiki Dump
(+ mappings)
• Output:
– A Knowledge Graph
DBpedia
Extraction
Framework
4. 5/23/19 Heiko Paulheim 4
What if…?
• What if we applied the DBpedia EF to every MediaWiki?
• According to WikiApiary, there’s thousands...
7. 5/23/19 Heiko Paulheim 7
A Brief History of DBkWik
• Started as a student project in 2017
• Task: run DBpedia EF on a large Wiki Farm
– ...and see what happens
8. 5/23/19 Heiko Paulheim 8
DBkWik vs. DBpedia
• Challenges
– Getting dumps: only a fraction of Fandom Wikis has dumps
– Downloadable from Fandom: 12,840 dumps
– Tried: auto-requesting dumps
9. 5/23/19 Heiko Paulheim 9
Obtaining Dumps
• We had to change our strategy: WikiTeam software
– Produces dumps by crawling Wikis
– Fandom has not blocked us so far :-)
– Current collection: 307,466 Wikis
→ will go into DBkWik 1.2 release
10. 5/23/19 Heiko Paulheim 10
DBkWik vs. DBpedia
• Mappings do not exist
– no central ontology
– i.e., only raw extraction possible
• Duplicates exist
– origin: pages about the same entity
in different Wikis
– unlike Wikipedia: often not explicitly linked
• Different configurations of MediaWiki
11. 5/23/19 Heiko Paulheim 11
Absence of Mappings and Ontology
• Every infobox becomes a class:
{infobox actor
→ mywiki:actor a owl:Class
• Every infobox key becomes a property
|role = Harry’s mother
→ mywiki:role a rdf:Property
• The resulting ontology is very shallow
– No class hierarchy
– No distinction of object and data properties
– No domains and ranges
15. 5/23/19 Heiko Paulheim 15
Naive Data Fusion and Linking to DBpedia
• String similarity for schema matching (classes/properties)
• doc2vec similarity on original pages for instance matching
• Results
– Classes and properties work OK
– Instances are trickier
– Internal linking seems easier
F1 score... Internal Linking Linking to DBpedia
Classes .979 .898
Properties .836 .865
Instances .879 .657
maybe...
16. 5/23/19 Heiko Paulheim 16
Gold Standard DBkWik 1.1
• Schema alignment: manual
• Instance alignment: crowd-sourced
– Using 3x3 Wikis from 3 different topics
– Asking crowdworkers to identify similar pages
– Search was allowed and encouraged
17. 5/23/19 Heiko Paulheim 17
Gold Standard DBkWik 1.1
• Crowdsourcing results
– High inter rater agreement (Fleiss’ Kappa: 0.8762)
– Most mappings are trivial, though
• Possible bias in gold standard
– We pre-selected matching Wikis!
18. 5/23/19 Heiko Paulheim 18
Results Data Fusion
• Uneven distribution
– e.g., character appears 5k times
• Currently: no multi-linguality
– e.g., Main Page, Hauptseite
• Probably overloaded fusion (false positives)
– e.g., next, location
19. 5/23/19 Heiko Paulheim 19
Light-weight Schema Induction
• Class hierarchy and domain/range induction
– Using association rule mining
●
e.g., Artist(x) → Person(x)
– 5k class subsumption axioms
– 59k domain restrictions
– 114k range restrictions
• Instance typing
– With a light-weight version of SDType
– Using the learned ranges as approximations
of actual distributions
• Result:
~100k new instance types
Person?
Artist
Person
23. 5/23/19 Heiko Paulheim 23
DBkWik 1.1 vs. other Knowledge Graphs
• Caveat:
– Minus non-recognized duplicates!
24. 5/23/19 Heiko Paulheim 24
DBkWik 1.1 vs. DBpedia
• How complementary are DBkWik and Dbpedia?
• Challenge:
– We only have an incomplete and partly correct mapping M
– But: we know its precision P and recall R
• Trick (see KI paper 2017):
– O is the actual overlap (unknown),
T ⊆ M is the true part of M (unknown)
• By definition:
– P = |T| / |M|
→ |T| = P * |M|
– R = |T| / |O|
→ |T| = R * |O|
→ |O| = |M| * P / R
DBkWik DBpedia
25. 5/23/19 Heiko Paulheim 25
DBkWik 1.1 vs. DBpedia
• How complementary are DBkWik and Dbpedia?
– |O| = |M| * P / R
– Overlap: ~500k instances
• In other words:
– 95% of all entities in DBkWik
are not in DBpedia
– 90% of all entities in DBpedia
are not in DBkWik
DBkWik DBpedia
26. 5/23/19 Heiko Paulheim 26
Towards Improving Interlinking
• Strategy: ask the experts
– new Knowledge Graph track at OAEI 2018
– seven systems provided results
• Results:
– it is hard to beat the string baseline
– many matching systems rely
on explicit, deep ontologies
●
but we have just shallow schemas
• Possible reasons:
– the problem is too difficult?
– the gold standard is too trivial?
– the ontology lacks formality
27. 5/23/19 Heiko Paulheim 27
Towards Improving Interlinking
• Currently, embedding based methods are on the rise
– e.g., Azmy et al.: “Matching Entities Across Different Knowledge
Graphs with Graph Embeddings”, 2019
– require large-scale training data
28. 5/23/19 Heiko Paulheim 28
Towards Improving Interlinking
• Overcoming issues of first gold standard
– include non-trivial matches
– include non-matches
29. 5/23/19 Heiko Paulheim 29
Towards Improving Interlinking
• Includes trivial and non-trivial matches
– i.e., task gets more demanding
• Low inter-rater agreement: Fleiss’ Kappa 0.02
30. 5/23/19 Heiko Paulheim 30
Towards Improving Interlinking
• Exploiting Wiki Interlinks
30
== External links ==
* {{mbeta}}
* {{Wikipedia|Bajoran#Kai|Kai}}
[[de:Kai]]
[[nl:Kai]]
[[pl:Kai]]
wiki 1
wiki 2
Kai
Meressa
Star Trek
31. 5/23/19 Heiko Paulheim 31
NewNif
Extractor
Towards DBkWik 1.2
• Current crawl: 307,466 Wikis
• Extraction: more robust for non-infobox templates
– e.g., LyricWiki: 1.7M songs, 177k albums, 84k artists
• Robust abstract extraction
– using SWEBLE parser
– no local MediaWiki instance
• Better matching
• New gold standard
Source
Simple
WikiParser
LinkExtractor
Page
NifExtractor
AST
Destination
Graph
HTML
32. 5/23/19 Heiko Paulheim 32
Towards DBkWik 1.2
• What to expect?
– data from 307,466 wikis
– 38,985,266 articles
33. 5/23/19 Heiko Paulheim 33
Towards DBkWik 1.2
• What to expect?
– data from 307,466 wikis
– 38,985,266 articles
35. 5/23/19 Heiko Paulheim 35
Further Open Challenges
• More detailed profiling
– e.g., do we reduce or increase bias?
• Task-based evaluation
– Does it improve, e.g., recommender systems?
• Fusion policies
– Identify outdated Wikis
36. 5/23/19 Heiko Paulheim 36
Contributors
• DBkWik contributors (past, present, and future)
Sven Hertling Alexandra
Hofmann
Samresh
Perchani
Jan Portisch
37. 5/23/19 Heiko Paulheim 37
From Wikipedia to Thousands of Wikis
–
The DBkWik Knowledge Graph
Heiko Paulheim