Information in Wikipedia can be edited in over 300 languages independently. Therefore often the same subject in Wikipedia can be described differently depending on language edition. In order to compare information between them one usually needs to understand each of considered languages. We work on solutions that can help to automate this process. They leverage machine learning and artificial intelligence algorithms. The crucial component, however, is assessment of article quality therefore we need to know how to define and extract different quality measures. This presentation briefly introduces some of the recent activities of Department of Information Systems at Poznań University of Economics and Business related to quality assessment of multilingual content in Wikipedia. In particular, we demonstrate some of the approaches for the reliability assessment of sources in Wikipedia articles. Such solutions can help to enrich various language editions of Wikipedia and other knowledge bases with information of better quality.
Slides from the December 2020 Wikimedia Research Showcase: https://www.youtube.com/watch?v=v9Wcc-TeaEY
This research aims to measure the level of English proficiency among rural students, examine factors contributing to proficiency, and suggest ways to improve proficiency. A quantitative survey design is used to measure proficiency levels and identify contributing factors. Both quantitative questionnaires and qualitative interviews are used to understand contributing factors and ways to improve proficiency. Data is analyzed using statistical measures for questionnaires and transcription for interviews.
This document outlines various types of Twitter data that can be analyzed including random tweets, user tweets, relational data between users, tweet rates, languages, and networks. It provides sample sizes of random tweets, users, tweets from Russia, and tweets containing the word "democracy" that could be extracted and filtered for different analyses.
The quality of information on Wikipedia may vary depending on the topic and the language version. If we learn how to automatically determine the the best language version of the particular article (or even part of the article), it can simplify the task of enriching less developed language editions of the Wikipedia. During the presentation, measures that can be extracted from Wikipedia's articles for the automatic evaluation of content quality in different languages are briefly presented. These measures can be relevant to various quality dimensions, such as completeness, reliability, up-to-dateness, readability and others. In addition, measures for assessing the data quality in the infobox are presented. Methods and sources for the extraction of measures also discussed briefly. Next - short information about WikiRank, Infoboxes.net and WikiBest projects. At the end - additional information related quality assessment of multilingual Wikipedia.
Presented at Wikimedia CEE Meeting 2018
https://meta.wikimedia.org/wiki/Wikimedia_CEE_Meeting_2018/Programme/Submission/Enrichment_of_multilingual_Wikipedia_based_on_quality_analysis
This document discusses analyzing citations and references in Wikipedia articles to improve quality in DBpedia. It presents statistics on the number of references across different language editions of Wikipedia and the most commonly cited web domains. The authors develop tools to rank Wikipedia articles based on citation metrics and extract citation information to populate DBpedia. They propose future work linking citations to databases like Altmetric and the Open Academic Graph to provide additional context and metrics for quality assessment.
The document discusses using Web of Science and related databases to strengthen research discovery, assessment, and identification of producers of research. It outlines how the databases can be used to discover more relevant papers, assess the impact and performance of articles, authors, journals and institutions, and improve author identification. The document provides examples and screenshots related to searching topics, analyzing citation metrics, and identifying highly cited research.
This document is a workshop presentation for REL and PHIL students on research strategies for their Honours Projects. The workshop covers recognizing appropriate source types, using disciplinary databases, obtaining full-text articles, and organizing citations with Mendeley. Students will learn how to find and cite scholarly sources, obtain full-text through interlibrary loan when needed, and generate bibliographies automatically. The presentation emphasizes that using library resources like databases and getting help from librarians are essential for successful Honours Project research.
Knowledge graph use cases in natural language generationElena Simperl
Keynote talk at INLG (International Natural Language Generation Conference) & SIGDial (Special Interest Group on Discourse and Dialogue), September 2023
The document introduces two Chinese academic resources - Unicat and ScienceChina. Unicat is the National Serials Union Catalogue of China that contains bibliographic and holdings information for serials and books from over 400 Chinese libraries. ScienceChina is a platform that integrates Chinese journal articles, references, and citation relationships. It includes the Chinese Science Citation Database (CSCD) and additional databases for bibliometric analysis.
This research aims to measure the level of English proficiency among rural students, examine factors contributing to proficiency, and suggest ways to improve proficiency. A quantitative survey design is used to measure proficiency levels and identify contributing factors. Both quantitative questionnaires and qualitative interviews are used to understand contributing factors and ways to improve proficiency. Data is analyzed using statistical measures for questionnaires and transcription for interviews.
This document outlines various types of Twitter data that can be analyzed including random tweets, user tweets, relational data between users, tweet rates, languages, and networks. It provides sample sizes of random tweets, users, tweets from Russia, and tweets containing the word "democracy" that could be extracted and filtered for different analyses.
The quality of information on Wikipedia may vary depending on the topic and the language version. If we learn how to automatically determine the the best language version of the particular article (or even part of the article), it can simplify the task of enriching less developed language editions of the Wikipedia. During the presentation, measures that can be extracted from Wikipedia's articles for the automatic evaluation of content quality in different languages are briefly presented. These measures can be relevant to various quality dimensions, such as completeness, reliability, up-to-dateness, readability and others. In addition, measures for assessing the data quality in the infobox are presented. Methods and sources for the extraction of measures also discussed briefly. Next - short information about WikiRank, Infoboxes.net and WikiBest projects. At the end - additional information related quality assessment of multilingual Wikipedia.
Presented at Wikimedia CEE Meeting 2018
https://meta.wikimedia.org/wiki/Wikimedia_CEE_Meeting_2018/Programme/Submission/Enrichment_of_multilingual_Wikipedia_based_on_quality_analysis
This document discusses analyzing citations and references in Wikipedia articles to improve quality in DBpedia. It presents statistics on the number of references across different language editions of Wikipedia and the most commonly cited web domains. The authors develop tools to rank Wikipedia articles based on citation metrics and extract citation information to populate DBpedia. They propose future work linking citations to databases like Altmetric and the Open Academic Graph to provide additional context and metrics for quality assessment.
The document discusses using Web of Science and related databases to strengthen research discovery, assessment, and identification of producers of research. It outlines how the databases can be used to discover more relevant papers, assess the impact and performance of articles, authors, journals and institutions, and improve author identification. The document provides examples and screenshots related to searching topics, analyzing citation metrics, and identifying highly cited research.
This document is a workshop presentation for REL and PHIL students on research strategies for their Honours Projects. The workshop covers recognizing appropriate source types, using disciplinary databases, obtaining full-text articles, and organizing citations with Mendeley. Students will learn how to find and cite scholarly sources, obtain full-text through interlibrary loan when needed, and generate bibliographies automatically. The presentation emphasizes that using library resources like databases and getting help from librarians are essential for successful Honours Project research.
Knowledge graph use cases in natural language generationElena Simperl
Keynote talk at INLG (International Natural Language Generation Conference) & SIGDial (Special Interest Group on Discourse and Dialogue), September 2023
The document introduces two Chinese academic resources - Unicat and ScienceChina. Unicat is the National Serials Union Catalogue of China that contains bibliographic and holdings information for serials and books from over 400 Chinese libraries. ScienceChina is a platform that integrates Chinese journal articles, references, and citation relationships. It includes the Chinese Science Citation Database (CSCD) and additional databases for bibliometric analysis.
The Essentials of Metadata and Taxonomy - Henry Stewart Event
The Next Wave: Using Wikipedia as a Controlled Vocabulary
* Leveraging an online resource for internal use
* Integrating pre-existing unique identifications numbers (UIDs)
* Inherited relations
* Capturing and cataloging
* Risks and remedies
Chris Sizemore BBC Future Technology & Media and Silver Oliver, BBC Future Technology & Media
BIO1000H evaluating websites, plagiarism, vancouver referencingUCT
This document provides information on evaluating websites, avoiding plagiarism, and using the Vancouver referencing style. It discusses how to critically evaluate websites by considering the source, motivation, authority, reviews, and supporting information. Plagiarism is defined as passing off others' work as one's own. To avoid plagiarism, one must acknowledge all sources consulted and properly cite and reference all information that is not one's own creation. The Vancouver referencing style uses numbers to cite sources in the text. The corresponding reference list provides full details of each citation in numerical order. Examples are given for citing various sources like journal articles, books, book chapters, and web pages.
VOSviewer and CitNetExplorer: Software tools for bibliometric analysis of s...Nees Jan van Eck
In this talk, an introduction is given into two software tools that have been developed for bibliometric analysis of scientific publications: VOSviewer (www.vosviewer.com) and CitNetExplorer (www.citnetexplorer.nl). VOSviewer is a popular tool that can be used for visualizing bibliometric networks of citation relations between publications, authors, and journals. In addition, the tool can be used for creating so-called term map visualizations based on a text mining analysis of the titles and abstracts of publications. The most important terms occurring in titles and abstract are identified and the co-occurrence relations between these terms are visualized. CitNetExplorer is a tool for the visualization and analysis of citation networks of scientific publications. The tool can be used to explore in detail how publications build on each other, as indicated by citation links. It is also possible to drill down into specific areas within a citation network, making it possible to perform micro-level analyses of the development of a particular area of research. In this talk, special attention will be paid to possible applications of VOSviewer and CitNetExplorer in humanities research, focusing in particular on the use of advanced text mining, network analysis, and visualization techniques for analyzing large quantities of textual data.
Nees Jan van Eck is a researcher at the Centre for Science and Technology Studies (CWTS) of Leiden University. His research focuses on the quantitative analysis of scientific research based on large amounts of bibliographic data and using sophisticated techniques from fields such as network analysis, statistics, and machine learning. Together with his colleague Ludo Waltman, Nees Jan has developed the VOSviewer and CitNetExplorer tools.
A demonstration of transparent and scalable OpenURL quality metrics for use i...alc28
This document summarizes Adam Chandler's presentation on using OpenURL quality metrics to promote metadata consistency across content providers. It discusses literature on OpenURL and metadata quality, analyzes elements in OpenURLs, and presents Chandler's 2008 findings on common and variable elements. The goal is to build a tool to evaluate OpenURL quality from content providers based on Hughes' metadata evaluation approach and analysis of core OpenURL elements.
Towards OpenURL Quality Metrics: Initial Findingsalc28
Presentation on creating a method for benchmarking metadata consistency in OpenURL links. See also: <http: />. Delivered at the July 2009 American Library Association conference in Chicago.
Slides of David Shotton's presentation at OASPA 2017 - 20 September 2017, Lisbon, Portugal. These slides describe the Initiative for Open Citations – which is a collaboration between scholarly publishers, researchers, and other interested parties to promote the unrestricted availability of scholarly citation – and OpenCitations – i.e. a small infrastructure organization which hosts and develops the OpenCitations Corpus (OCC), a Linked Open Data repository of scholarly bibliographic citation data.
Text Analysis of Social Networks: Working with FB and VK DataAlexander Panchenko
The document outlines a presentation on analyzing social network data from Facebook and VKontakte. It discusses gathering and storing social network data, performing social network analysis, and applying machine learning techniques to detect users' gender, language, and interests from their profile data and network connections. Specific methods are described for gender classification using names, language detection in Facebook profiles, and identifying 253 interest categories that users may belong to based on the groups they join.
The document provides guidance on how to write a bibliometric paper. It begins with an introduction to bibliometrics and definitions of key terms. It then outlines the major citation databases used for bibliometric analysis and reasons for conducting bibliometric studies. Some applications of bibliometrics in research evaluation are discussed, such as analyzing collaborations and measuring the impact of individual researchers, institutions, countries, and journals. Finally, the document describes the essential steps to take in writing a bibliometric paper, including selecting a topic, collecting and analyzing citation data, and discussing the findings.
The document summarizes a SciVerse Lunch & Learn presentation at Syracuse University on November 11th, 2010. The presentation introduced the SciVerse platform which integrates ScienceDirect and Scopus content and APIs to enable third-party application development. It described SciVerse Hub, which provides a single search across ScienceDirect, Scopus and web content, as well as three embedded applications. Updates to ScienceDirect included image search and integration with REFLECT and other tools. Scopus updates included expanded arts and humanities coverage and enhanced author evaluation, citation tracker and journal metrics tools. The presentation also covered the SciVerse applications beta and APIs.
The state of play currently with the preservation of all things webby and concrete actions to take. Delivered by Peter Burnhill at the ALSP event "Standing on the Digits of Giants: Research data, preservation and innovation" on 8 March 2015 in London.
The document summarizes a SciVerse Lunch & Learn presentation. The presentation introduced the SciVerse platform, which integrates ScienceDirect and Scopus content and APIs. It highlighted new features like increased interoperability between ScienceDirect and Scopus, image search, and embedded applications in SciVerse Hub. Updates to ScienceDirect included NextBio content enrichment and a new iPhone app. Scopus updates included expanded arts/humanities coverage and enhanced author evaluation and citation tracking tools. The presentation concluded with a demonstration of the administrative tool.
OECD bibliometric indicators: Selected highlights, March 2023 editioninnovationoecd
This document summarizes bibliometric indicators from the OECD based on data from Elsevier's Scopus database. It shows trends in scientific publication output, citation rates, collaboration, and mobility for countries and regions from 2010-2021. It also includes perspectives on COVID-19 research and research related to long term challenges like environmental science and energy. The data can be explored further using the OECD's STI.Scoreboard platform and datasets.
Web Today, Good Tomorrow? Transactional archiving of web content [Long Version]Peter Burnhill
Report from Hiberlink Project into threat of and remedy for Reference Rot. Archiving what is cited on the web. Need for action by scholarly publishers, the software they use and the software used by authors in note taking.
As delivered at Innovators Session, Professional/Scholarly Publishing (PSP) Division, Association of American Publishers (AAP), Washington DC, 1-4 February 2017.
Long Version, containing extra slides hidden in presentation.
Web Today, Good Tomorrow? Transactional archiving of web contentPeter Burnhill
The document discusses the problem of reference rot, where citations and references in scholarly articles break over time as web resources change or disappear. It provides three key points:
1. Scholarly articles are increasingly linking to web resources rather than just other articles, putting those references at risk. Studies show around 20-30% of references are already "missing" or lead to changed content.
2. Content drift is also a problem, where the content behind a cited URL changes significantly over time, undermining the integrity of the citation. Studies found only 25% of references from 2012 articles remained unchanged by 2015.
3. Five options are presented to remedy reference rot: taking snapshots of URLs, helping authors archive references,
The document discusses the Globalization Strategy for KSCI (Korea Science Citation Index) Platform and Domestic Information. It outlines the goals of KSCI, which include improving research efficiency, evaluating journals, and improving the quality of domestic journals. It then describes the background and purpose of KSCI, the key KSCI information and database, the main KSCI system and services, and the expected effects of developing KSCI.
Wikidata is a collaborative knowledge graph edited by both humans and bots. Research found that a mix of human and bot edits, along with diversity in editor tenure and interests, led to higher quality items. Most external references in Wikidata were found to be relevant and from authoritative sources like governments and academics. Neural networks can generate multilingual summaries for Wikidata items that match Wikipedia style and are useful for editors in underserved language editions.
A preliminary approach to knowledge integrity risk assessment in Wikipedia p...Pablo Aragón
Wikipedia is one of the main repositories of free knowledge available today, with a central role in the Web ecosystem. For this reason, it can also be a battleground for actors trying to impose specific points of view or even spreading disinformation online. There is a growing need to monitor its "health" but this is not an easy task. Wikipedia exists in over 300 language editions and each project is maintained by a different community, with their own strengths, weaknesses and limitations. In this paper, we introduce a taxonomy of knowledge integrity risks across Wikipedia projects and a first set of indicators to assess internal risks related to community and content issues, as well as external threats such as the geopolitical and media landscape. On top of this taxonomy, we offer a preliminary analysis illustrating how the lack of editors' geographical diversity might represent a knowledge integrity risk. These are the first steps of a research project to build a Wikipedia Knowledge Integrity Risk Observatory.
PoolParty 5.0 - Five more reasons to lean on a world-class semantic platformSemantic Web Company
This document provides an overview and summary of a presentation about the PoolParty semantic platform. The presentation covers the following key points in 3 sentences:
The presentation provides an overview of the PoolParty semantic platform and its components for creating and managing taxonomies and knowledge graphs, linking data sources, and integrating knowledge graphs into applications via API. It demonstrates how PoolParty allows unstructured and structured data to be combined by extracting entities from text based on knowledge graphs. The presentation highlights new features of PoolParty 5 including improved entity extraction, disambiguation capabilities, deep integration with content management systems, a web crawler to extract terms from websites, and expanded ontology management.
The Building Blocks of QuestDB, a Time Series Databasejavier ramirez
Talk Delivered at Valencia Codes Meetup 2024-06.
Traditionally, databases have treated timestamps just as another data type. However, when performing real-time analytics, timestamps should be first class citizens and we need rich time semantics to get the most out of our data. We also need to deal with ever growing datasets while keeping performant, which is as fun as it sounds.
It is no wonder time-series databases are now more popular than ever before. Join me in this session to learn about the internal architecture and building blocks of QuestDB, an open source time-series database designed for speed. We will also review a history of some of the changes we have gone over the past two years to deal with late and unordered data, non-blocking writes, read-replicas, or faster batch ingestion.
More Related Content
Similar to Quality assessment of Wikipedia and its sources
The Essentials of Metadata and Taxonomy - Henry Stewart Event
The Next Wave: Using Wikipedia as a Controlled Vocabulary
* Leveraging an online resource for internal use
* Integrating pre-existing unique identifications numbers (UIDs)
* Inherited relations
* Capturing and cataloging
* Risks and remedies
Chris Sizemore BBC Future Technology & Media and Silver Oliver, BBC Future Technology & Media
BIO1000H evaluating websites, plagiarism, vancouver referencingUCT
This document provides information on evaluating websites, avoiding plagiarism, and using the Vancouver referencing style. It discusses how to critically evaluate websites by considering the source, motivation, authority, reviews, and supporting information. Plagiarism is defined as passing off others' work as one's own. To avoid plagiarism, one must acknowledge all sources consulted and properly cite and reference all information that is not one's own creation. The Vancouver referencing style uses numbers to cite sources in the text. The corresponding reference list provides full details of each citation in numerical order. Examples are given for citing various sources like journal articles, books, book chapters, and web pages.
VOSviewer and CitNetExplorer: Software tools for bibliometric analysis of s...Nees Jan van Eck
In this talk, an introduction is given into two software tools that have been developed for bibliometric analysis of scientific publications: VOSviewer (www.vosviewer.com) and CitNetExplorer (www.citnetexplorer.nl). VOSviewer is a popular tool that can be used for visualizing bibliometric networks of citation relations between publications, authors, and journals. In addition, the tool can be used for creating so-called term map visualizations based on a text mining analysis of the titles and abstracts of publications. The most important terms occurring in titles and abstract are identified and the co-occurrence relations between these terms are visualized. CitNetExplorer is a tool for the visualization and analysis of citation networks of scientific publications. The tool can be used to explore in detail how publications build on each other, as indicated by citation links. It is also possible to drill down into specific areas within a citation network, making it possible to perform micro-level analyses of the development of a particular area of research. In this talk, special attention will be paid to possible applications of VOSviewer and CitNetExplorer in humanities research, focusing in particular on the use of advanced text mining, network analysis, and visualization techniques for analyzing large quantities of textual data.
Nees Jan van Eck is a researcher at the Centre for Science and Technology Studies (CWTS) of Leiden University. His research focuses on the quantitative analysis of scientific research based on large amounts of bibliographic data and using sophisticated techniques from fields such as network analysis, statistics, and machine learning. Together with his colleague Ludo Waltman, Nees Jan has developed the VOSviewer and CitNetExplorer tools.
A demonstration of transparent and scalable OpenURL quality metrics for use i...alc28
This document summarizes Adam Chandler's presentation on using OpenURL quality metrics to promote metadata consistency across content providers. It discusses literature on OpenURL and metadata quality, analyzes elements in OpenURLs, and presents Chandler's 2008 findings on common and variable elements. The goal is to build a tool to evaluate OpenURL quality from content providers based on Hughes' metadata evaluation approach and analysis of core OpenURL elements.
Towards OpenURL Quality Metrics: Initial Findingsalc28
Presentation on creating a method for benchmarking metadata consistency in OpenURL links. See also: <http: />. Delivered at the July 2009 American Library Association conference in Chicago.
Slides of David Shotton's presentation at OASPA 2017 - 20 September 2017, Lisbon, Portugal. These slides describe the Initiative for Open Citations – which is a collaboration between scholarly publishers, researchers, and other interested parties to promote the unrestricted availability of scholarly citation – and OpenCitations – i.e. a small infrastructure organization which hosts and develops the OpenCitations Corpus (OCC), a Linked Open Data repository of scholarly bibliographic citation data.
Text Analysis of Social Networks: Working with FB and VK DataAlexander Panchenko
The document outlines a presentation on analyzing social network data from Facebook and VKontakte. It discusses gathering and storing social network data, performing social network analysis, and applying machine learning techniques to detect users' gender, language, and interests from their profile data and network connections. Specific methods are described for gender classification using names, language detection in Facebook profiles, and identifying 253 interest categories that users may belong to based on the groups they join.
The document provides guidance on how to write a bibliometric paper. It begins with an introduction to bibliometrics and definitions of key terms. It then outlines the major citation databases used for bibliometric analysis and reasons for conducting bibliometric studies. Some applications of bibliometrics in research evaluation are discussed, such as analyzing collaborations and measuring the impact of individual researchers, institutions, countries, and journals. Finally, the document describes the essential steps to take in writing a bibliometric paper, including selecting a topic, collecting and analyzing citation data, and discussing the findings.
The document summarizes a SciVerse Lunch & Learn presentation at Syracuse University on November 11th, 2010. The presentation introduced the SciVerse platform which integrates ScienceDirect and Scopus content and APIs to enable third-party application development. It described SciVerse Hub, which provides a single search across ScienceDirect, Scopus and web content, as well as three embedded applications. Updates to ScienceDirect included image search and integration with REFLECT and other tools. Scopus updates included expanded arts and humanities coverage and enhanced author evaluation, citation tracker and journal metrics tools. The presentation also covered the SciVerse applications beta and APIs.
The state of play currently with the preservation of all things webby and concrete actions to take. Delivered by Peter Burnhill at the ALSP event "Standing on the Digits of Giants: Research data, preservation and innovation" on 8 March 2015 in London.
The document summarizes a SciVerse Lunch & Learn presentation. The presentation introduced the SciVerse platform, which integrates ScienceDirect and Scopus content and APIs. It highlighted new features like increased interoperability between ScienceDirect and Scopus, image search, and embedded applications in SciVerse Hub. Updates to ScienceDirect included NextBio content enrichment and a new iPhone app. Scopus updates included expanded arts/humanities coverage and enhanced author evaluation and citation tracking tools. The presentation concluded with a demonstration of the administrative tool.
OECD bibliometric indicators: Selected highlights, March 2023 editioninnovationoecd
This document summarizes bibliometric indicators from the OECD based on data from Elsevier's Scopus database. It shows trends in scientific publication output, citation rates, collaboration, and mobility for countries and regions from 2010-2021. It also includes perspectives on COVID-19 research and research related to long term challenges like environmental science and energy. The data can be explored further using the OECD's STI.Scoreboard platform and datasets.
Web Today, Good Tomorrow? Transactional archiving of web content [Long Version]Peter Burnhill
Report from Hiberlink Project into threat of and remedy for Reference Rot. Archiving what is cited on the web. Need for action by scholarly publishers, the software they use and the software used by authors in note taking.
As delivered at Innovators Session, Professional/Scholarly Publishing (PSP) Division, Association of American Publishers (AAP), Washington DC, 1-4 February 2017.
Long Version, containing extra slides hidden in presentation.
Web Today, Good Tomorrow? Transactional archiving of web contentPeter Burnhill
The document discusses the problem of reference rot, where citations and references in scholarly articles break over time as web resources change or disappear. It provides three key points:
1. Scholarly articles are increasingly linking to web resources rather than just other articles, putting those references at risk. Studies show around 20-30% of references are already "missing" or lead to changed content.
2. Content drift is also a problem, where the content behind a cited URL changes significantly over time, undermining the integrity of the citation. Studies found only 25% of references from 2012 articles remained unchanged by 2015.
3. Five options are presented to remedy reference rot: taking snapshots of URLs, helping authors archive references,
The document discusses the Globalization Strategy for KSCI (Korea Science Citation Index) Platform and Domestic Information. It outlines the goals of KSCI, which include improving research efficiency, evaluating journals, and improving the quality of domestic journals. It then describes the background and purpose of KSCI, the key KSCI information and database, the main KSCI system and services, and the expected effects of developing KSCI.
Wikidata is a collaborative knowledge graph edited by both humans and bots. Research found that a mix of human and bot edits, along with diversity in editor tenure and interests, led to higher quality items. Most external references in Wikidata were found to be relevant and from authoritative sources like governments and academics. Neural networks can generate multilingual summaries for Wikidata items that match Wikipedia style and are useful for editors in underserved language editions.
A preliminary approach to knowledge integrity risk assessment in Wikipedia p...Pablo Aragón
Wikipedia is one of the main repositories of free knowledge available today, with a central role in the Web ecosystem. For this reason, it can also be a battleground for actors trying to impose specific points of view or even spreading disinformation online. There is a growing need to monitor its "health" but this is not an easy task. Wikipedia exists in over 300 language editions and each project is maintained by a different community, with their own strengths, weaknesses and limitations. In this paper, we introduce a taxonomy of knowledge integrity risks across Wikipedia projects and a first set of indicators to assess internal risks related to community and content issues, as well as external threats such as the geopolitical and media landscape. On top of this taxonomy, we offer a preliminary analysis illustrating how the lack of editors' geographical diversity might represent a knowledge integrity risk. These are the first steps of a research project to build a Wikipedia Knowledge Integrity Risk Observatory.
PoolParty 5.0 - Five more reasons to lean on a world-class semantic platformSemantic Web Company
This document provides an overview and summary of a presentation about the PoolParty semantic platform. The presentation covers the following key points in 3 sentences:
The presentation provides an overview of the PoolParty semantic platform and its components for creating and managing taxonomies and knowledge graphs, linking data sources, and integrating knowledge graphs into applications via API. It demonstrates how PoolParty allows unstructured and structured data to be combined by extracting entities from text based on knowledge graphs. The presentation highlights new features of PoolParty 5 including improved entity extraction, disambiguation capabilities, deep integration with content management systems, a web crawler to extract terms from websites, and expanded ontology management.
Similar to Quality assessment of Wikipedia and its sources (20)
The Building Blocks of QuestDB, a Time Series Databasejavier ramirez
Talk Delivered at Valencia Codes Meetup 2024-06.
Traditionally, databases have treated timestamps just as another data type. However, when performing real-time analytics, timestamps should be first class citizens and we need rich time semantics to get the most out of our data. We also need to deal with ever growing datasets while keeping performant, which is as fun as it sounds.
It is no wonder time-series databases are now more popular than ever before. Join me in this session to learn about the internal architecture and building blocks of QuestDB, an open source time-series database designed for speed. We will also review a history of some of the changes we have gone over the past two years to deal with late and unordered data, non-blocking writes, read-replicas, or faster batch ingestion.
Codeless Generative AI Pipelines
(GenAI with Milvus)
https://ml.dssconf.pl/user.html#!/lecture/DSSML24-041a/rate
Discover the potential of real-time streaming in the context of GenAI as we delve into the intricacies of Apache NiFi and its capabilities. Learn how this tool can significantly simplify the data engineering workflow for GenAI applications, allowing you to focus on the creative aspects rather than the technical complexities. I will guide you through practical examples and use cases, showing the impact of automation on prompt building. From data ingestion to transformation and delivery, witness how Apache NiFi streamlines the entire pipeline, ensuring a smooth and hassle-free experience.
Timothy Spann
https://www.youtube.com/@FLaNK-Stack
https://medium.com/@tspann
https://www.datainmotion.dev/
milvus, unstructured data, vector database, zilliz, cloud, vectors, python, deep learning, generative ai, genai, nifi, kafka, flink, streaming, iot, edge
End-to-end pipeline agility - Berlin Buzzwords 2024Lars Albertsson
We describe how we achieve high change agility in data engineering by eliminating the fear of breaking downstream data pipelines through end-to-end pipeline testing, and by using schema metaprogramming to safely eliminate boilerplate involved in changes that affect whole pipelines.
A quick poll on agility in changing pipelines from end to end indicated a huge span in capabilities. For the question "How long time does it take for all downstream pipelines to be adapted to an upstream change," the median response was 6 months, but some respondents could do it in less than a day. When quantitative data engineering differences between the best and worst are measured, the span is often 100x-1000x, sometimes even more.
A long time ago, we suffered at Spotify from fear of changing pipelines due to not knowing what the impact might be downstream. We made plans for a technical solution to test pipelines end-to-end to mitigate that fear, but the effort failed for cultural reasons. We eventually solved this challenge, but in a different context. In this presentation we will describe how we test full pipelines effectively by manipulating workflow orchestration, which enables us to make changes in pipelines without fear of breaking downstream.
Making schema changes that affect many jobs also involves a lot of toil and boilerplate. Using schema-on-read mitigates some of it, but has drawbacks since it makes it more difficult to detect errors early. We will describe how we have rejected this tradeoff by applying schema metaprogramming, eliminating boilerplate but keeping the protection of static typing, thereby further improving agility to quickly modify data pipelines without fear.
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...Social Samosa
The Modern Marketing Reckoner (MMR) is a comprehensive resource packed with POVs from 60+ industry leaders on how AI is transforming the 4 key pillars of marketing – product, place, price and promotions.
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...sameer shah
"Join us for STATATHON, a dynamic 2-day event dedicated to exploring statistical knowledge and its real-world applications. From theory to practice, participants engage in intensive learning sessions, workshops, and challenges, fostering a deeper understanding of statistical methodologies and their significance in various fields."
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Round table discussion of vector databases, unstructured data, ai, big data, real-time, robots and Milvus.
A lively discussion with NJ Gen AI Meetup Lead, Prasad and Procure.FYI's Co-Found
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Discussion on Vector Databases, Unstructured Data and AI
https://www.meetup.com/unstructured-data-meetup-new-york/
This meetup is for people working in unstructured data. Speakers will come present about related topics such as vector databases, LLMs, and managing data at scale. The intended audience of this group includes roles like machine learning engineers, data scientists, data engineers, software engineers, and PMs.This meetup was formerly Milvus Meetup, and is sponsored by Zilliz maintainers of Milvus.
2. Languages of the world in 2020
2
• 7,117 languages are spoken.
• 2,926 languages are endangered.
• just 23 languages account
for more than ½ the world’s
population
• Wikipedia articles have been
created in 314 languages
Source: ethnologue.com, meta.wikimedia.org 0 200 400 600 800 1000 1200 1400
Indonesian
Portuguese
Russian
Bengali
Standard Arabic
French
Spanish
Hindi
Mandarin Chinese
English
The top 10 most spoken languages (in millions)
Native speakers Number of speakers
3. Motivation – enrichment of multilingual information
3
Source: Lewoniewski, W. (2018). The method of comparing and enriching information in multilingual wikis based on the analysis of their quality. PhD thesis
Quality
matters!
4. Quality in Multilingual Wikipedia
• Wikipedia can be edited in each language independently
• same subject can be described differently
• user usually needs to understand those languages
• Information quality depends on language of Wikipedia
• Each language defines own rules and standards
• Standards may change over time
• Reliable sources are important
• Assessment of the same source depends on language edition of Wikipedia
• Reliablity of the same source may change over the time
4
5. Related works
• Lewoniewski, W., Węcel, K., Abramowicz, W. (2020). Modeling Popularity and Reliability of Sources in
Multilingual Wikipedia. Information, 11(5), 263
• Lewoniewski, W., Węcel, K., Abramowicz, W. (2019). Multilingual ranking of Wikipedia articles with
quality and popularity assessment in different topics. Computers, 8(3), 60.
• Lewoniewski, W. (2019). Measures for quality assessment of articles and infoboxes in multilingual
Wikipedia. In International Conference on Business Information Systems (pp. 619-633). Springer, Cham.
• Lewoniewski, W. (2018). The method of comparing and enriching information in multilingual wikis
based on the analysis of their quality. PhD thesis
• Lewoniewski, W., Węcel, K., Abramowicz, W. (2017). Relative quality and popularity evaluation of
multilingual Wikipedia articles. Informatics 2017, 4(4), 43.
• Lewoniewski, W. (2017). Enrichment of information in multilingual Wikipedia based on quality analysis.
In International Conference on Business Information Systems (pp. 216-227). Springer, Cham.
• Lewoniewski, W., Węcel, K., Abramowicz, W. (2017). Analysis of references across Wikipedia languages.
In International Conference on Information and Software Technologies (pp. 561-573). Springer, Cham.
5
6. Quality classes in Wikipedia languages
Source: Lewoniewski, W. (2017). Enrichment of information in multilingual Wikipedia based on quality analysis.
In International Conference on Business Information Systems (pp. 216-227). Springer, Cham.
6
8. Significance of measures depending on language
Source: Węcel, K., Lewoniewski, W. (2015). Modelling the quality of attributes in Wikipedia infoboxes.
8
9. Significance of measures depending on language (2)
Source: Lewoniewski, W. (2018). The method of comparing and enriching information in multilingual wikis based on the analysis of their quality. PhD thesis
9
10. Distribution of measures in quality classes
Source: Lewoniewski, W., Węcel, K., Abramowicz, W. (2017). Relative quality and popularity evaluation of multilingual Wikipedia articles. Informatics 2017, 4(4), 43.
10
11. Normalized measures average (NMA):
𝑁𝑀𝐴 =
1
𝑐
𝑖=1
𝑐
𝑚𝑖
where 𝑚𝑖 is a normalized measure 𝑚𝑖
and 𝑐 is the numer of measures.
11
Article quality score - synthetic quality measure
𝑄𝑢𝑎𝑙𝑖𝑡𝑦𝑆𝑐𝑜𝑟𝑒 = 𝑁𝑀𝐴 ∙ (1 − 5% ∙ 𝑄𝐹𝑇)
Additionaly we need to take into account
the numer of quality flaw templates (QFT)
to measure the quality score:
Source: Lewoniewski, W., Węcel, K., Abramowicz, W. (2019). Multilingual ranking of Wikipedia articles with quality and popularity assessment in different topics. Computers, 8(3), 60.
Normalization of each measure 𝑚𝑖 was
conducted according to the following rule:
• if value of a given feature in a given language
exceeded the threshold of the median value
of the best articles in the same language
version, it was set to 100 points;
• otherwise, its value was linearly scaled to
reflect the relation of the value to the
median value.
For example, if the median for the number of references in Polish Wikipedia was 97:
• any article with a larger number of references would score 100 for this feature;
• an article with 59 references would score proportionally 60.82 (59/97) points after normalizing.
12. 12
Quality score – an example of implementation
Implementation of the quality score on WikiRank.net
13. Wikipedia references
The calculation is based on Wikimedia dumps as of March 2020 using complex extraction of references. More languages:
Lewoniewski, W., Węcel, K., Abramowicz, W. (2020). Modeling Popularity and Reliability of Sources in Multilingual Wikipedia. Information, 11(5), 263. 13
1
2
4
8
16
32
64
Number of references (in millions)
Unique references
Over 200 milion
references
in 55 languages
14. Wikipedia references – complex extraction
Source: Lewoniewski, W., Węcel, K., Abramowicz, W. (2020). Modeling Popularity and Reliability of Sources in Multilingual Wikipedia. Information, 11(5), 263.
14
15. Templates in references on English Wikipedia
Source: Lewoniewski, W., Węcel, K., Abramowicz, W. (2020). Modeling Popularity and Reliability of Sources in Multilingual Wikipedia. Information, 11(5), 263. 15
The most commonly used names in publisher parameter of citations templates:
16. Wikipedia references with special identifiers
The calculation is based on Wikimedia dumps as of March 2020 using complex extraction of references. More languages and indentifiers:
Lewoniewski, W., Węcel, K., Abramowicz, W. (2020). Modeling Popularity and Reliability of Sources in Multilingual Wikipedia. Information, 11(5), 263. 16
0%
1%
2%
3%
4%
5%
6%
7%
8%
9%
de en es fr it ja nl pl pt ru sv uk zh
References with special identifier (in percentages)
DOI unique DOI ISBN unique ISBN ISSN unique ISSN PMID unique PMID
17. 17
Popularity and reliability models
Source: Lewoniewski, W., Węcel, K., Abramowicz, W. (2020). Modeling Popularity and Reliability of Sources in Multilingual Wikipedia. Information, 11(5), 263.
• F model—based on frequency (F) of source usage.
• P model—based on cumulative pageviews (P) of the article
in which source appears.
• PR model—based on cumulative pageviews (P) of the
article in which source appears divided by number of the
references (R) in this article.
• PL model—based on cumulative pageviews (P) of the
article in which source appears divided by article length (L).
• Models Pm, PmR, PmL are modified versions with daily
pageviews median.
• Models A, AR, AL uses number of authors.
18. Publishers in references on English Wikipedia
Position in rankings of
publishers in English
Wikipedia depending on
popularity and reliability
model in February 2020.
Source: own calculation based
on Wikimedia dumps using
complex extraction and using
only values from publisher
parameter of citation
templates in references.
More publishers:
Lewoniewski, W., Węcel, K., Abramowicz, W.
(2020). Modeling Popularity and Reliability of
Sources in Multilingual Wikipedia.
Information, 11(5), 263.
18
Source
Position in the Ranking Depending on Model
F P PR PL Pm PmR PmL A AR AL
AllMusic 8 28 8 8 26 8 9 14 6 7
BBC 3 4 3 3 5 4 3 2 3 3
BBC News 10 5 7 5 6 7 5 7 8 8
BBC Sport 4 11 15 12 16 17 13 5 7 5
Cambridge University Press 5 3 2 2 3 2 2 3 4 4
CBS Interactive 20 9 10 7 9 10 8 12 15 10
CNN 22 2 9 6 2 9 6 6 16 12
ESPN 13 8 17 14 8 19 16 10 13 14
IGN 32 37 29 24 34 29 23 22 17 15
National Park Service 7 94 38 48 89 47 58 60 12 11
Official Charts Company 16 30 24 18 31 26 18 18 21 18
Oxford University Press 2 1 1 1 1 1 1 1 2 1
Routledge 6 6 4 4 4 3 4 4 5 6
Springer 19 12 6 9 10 5 7 16 10 13
United States Census Bureau 1 27 5 11 24 6 12 9 1 2
University of California Press 24 16 19 19 12 18 17 15 19 22
Yale University Press 9 13 28 28 13 28 29 21 33 28
19. BestRef – analysis of sources on Wikipedia
19Implementation of popularity and reliability models on BestRef.net
20. BestRef – analysis of sources on Wikipedia (2)
20Source: BestRef.net
Lang.
Models
F PR AR
all 7 3 5
ar 6 11 15
de 10 10 11
es 24 10 20
fr 14 18 18
it 12 11 13
ja 25 46 49
nl 23 17 23
pl 31 25 38
pt 10 11 22
ru 11 14 18
sv 46 23 32
uk 41 40 41
zh 13 27 32
nytimes.com
Lang.
Models
F PR AR
all 125 47 51
ar 270 423 415
de 2 1 2
es 413 457 584
fr 401 471 392
it 329 359 387
ja 692 992 899
nl 117 113 135
pl 534 392 421
pt 501 497 543
ru 355 333 324
sv 269 250 216
uk 342 501 340
zh 469 787 689
spiegel.de
Lang.
Models
F PR AR
all 111 67 76
ar 343 556 418
de 341 393 355
es 280 398 506
fr 5 1 3
it 315 275 328
ja 701 1381 1244
nl 203 345 342
pl 730 689 675
pt 438 514 597
ru 622 753 875
sv 663 1023 630
uk 771 1522 1013
zh 526 1339 1128
Lang.
Models
F PR AR
all 100 56 62
ar 265 852 400
de 312 445 346
es 21 3 3
fr 109 248 199
it 248 281 256
ja 630 2096 1236
nl 351 620 435
pl 577 1018 785
pt 102 67 101
ru 635 832 741
sv 577 963 758
uk 720 1537 916
zh 371 1051 729
elpais.comlemonde.fr
22. Futher applications
22
• Quality models can help to
enrich various language
editions of Wikipedia and
other knowledge bases with
information of better quality.
• Some of the approaches are
planned to be implemented
on global.dbpedia.org
GlobalFactSync data flow.
Source: commons.wikimedia.org/wiki/File:GFS.png