This document discusses measuring library catalogs and record validation. It begins with an introduction to MARC format and examples of MARC records. It then covers validating individual records and generating summaries of validation errors. Other validation options and viewing/filtering records are described. Methods for calculating completeness, clustering records, indexing with Solr, and finding problems with facets are also summarized. The document concludes with discussions of using MARC data in digital humanities, reproducibility, available catalogs to measure, and future work.
Requirements of DARIAH community for a Dataverse repository (SSHOC 2020)Péter Király
This document summarizes a presentation about developing a Dataverse repository for the SSHOC project. The SSHOC project aims to create an open science cloud for the social sciences and humanities. As part of this, it is developing a Dataverse repository to allow researchers to share and publish datasets according to FAIR principles. The presentation provides an overview of the SSHOC project, introduces Dataverse as a repository software, demonstrates its functionality, and discusses requirements for developing a domain-specific Dataverse for several research communities.
Validating 126 million MARC records (DATeCH 2019)Péter Király
This document summarizes Péter Király's presentation on validating 126 million MARC records. It discusses ingesting MARC records from various libraries, measuring the records for issues using a validator, and aggregating the results. Common issues found include invalid field codes, values and subfields. The results are analyzed and reported through a web interface to help identify problems and improve record quality.
Measuring Metadata Quality (doctoral defense 2019)Péter Király
This document summarizes Peter Király's presentation on measuring metadata quality. The presentation discusses the importance of metadata quality and proposes using structural metrics like completeness, multilinguality, and issue detection to approximate overall metadata quality. It presents a framework for flexible and scalable metadata quality assessment that cultural heritage institutions could implement to measure metadata, generate reports, and improve records. The framework is being used to evaluate metadata in Europeana and could help address challenges of assessing metadata quality at large scales with limited resources.
Empirical evaluation of library catalogues (SWIB 2019)Péter Király
This document summarizes a presentation on empirically evaluating library catalogues using MARC records. It describes ingesting MARC records, measuring them for quality issues, aggregating the results, and working with experts to improve record quality. The tool can validate records, analyze completeness, classifications, authorities, and more. It produces reports on issues and provides links to explore problematic fields and values.
GRO.data - Dataverse in Göttingen (Dataverse Europe 2020)Péter Király
The document discusses the Dataverse installation in Göttingen, Germany. It is run by the Göttingen eResearch Alliance which includes the University of Göttingen and several research institutes. The Dataverse, called Göttingen Research Online (GRO.data), serves as a general repository for research data from across the Göttingen campus. It has been customized to the local IT infrastructure and collaborates with other data initiatives both within Göttingen and beyond. Future plans include further integration with other services and assessing data quality.
Incubating Göttingen Cultural Analytics Alliance (SUB 2021)Péter Király
This document proposes collaborating to form the Göttingen Cultural Analytics Alliance between several Göttingen institutions to analyze digitized cultural heritage data using computational methods. It discusses potential areas for collaboration including developing new metadata services, conducting joint research projects, improving education offerings, and coordinating open source software development. Establishing this network could help connect experts, pursue funding opportunities, and strengthen partnerships with other cultural heritage organizations internationally.
Continuous quality assessment for MARC21 catalogues (MINI ELAG 2021)Péter Király
This document summarizes a presentation about developing a quality assessment tool for MARC21 catalogues. The tool allows users to:
1) Ingest MARC21 records
2) Measure the records against definitions in the MARC21 standard
3) Aggregate results and generate reports
4) Evaluate results with cataloging experts to improve record quality
The presentation demonstrates the tool and its ability to identify issues in fields, provide definitions, search terms, and eventually link terms to controlled vocabularies through cooperation with other projects.
Requirements of DARIAH community for a Dataverse repository (SSHOC 2020)Péter Király
This document summarizes a presentation about developing a Dataverse repository for the SSHOC project. The SSHOC project aims to create an open science cloud for the social sciences and humanities. As part of this, it is developing a Dataverse repository to allow researchers to share and publish datasets according to FAIR principles. The presentation provides an overview of the SSHOC project, introduces Dataverse as a repository software, demonstrates its functionality, and discusses requirements for developing a domain-specific Dataverse for several research communities.
Validating 126 million MARC records (DATeCH 2019)Péter Király
This document summarizes Péter Király's presentation on validating 126 million MARC records. It discusses ingesting MARC records from various libraries, measuring the records for issues using a validator, and aggregating the results. Common issues found include invalid field codes, values and subfields. The results are analyzed and reported through a web interface to help identify problems and improve record quality.
Measuring Metadata Quality (doctoral defense 2019)Péter Király
This document summarizes Peter Király's presentation on measuring metadata quality. The presentation discusses the importance of metadata quality and proposes using structural metrics like completeness, multilinguality, and issue detection to approximate overall metadata quality. It presents a framework for flexible and scalable metadata quality assessment that cultural heritage institutions could implement to measure metadata, generate reports, and improve records. The framework is being used to evaluate metadata in Europeana and could help address challenges of assessing metadata quality at large scales with limited resources.
Empirical evaluation of library catalogues (SWIB 2019)Péter Király
This document summarizes a presentation on empirically evaluating library catalogues using MARC records. It describes ingesting MARC records, measuring them for quality issues, aggregating the results, and working with experts to improve record quality. The tool can validate records, analyze completeness, classifications, authorities, and more. It produces reports on issues and provides links to explore problematic fields and values.
GRO.data - Dataverse in Göttingen (Dataverse Europe 2020)Péter Király
The document discusses the Dataverse installation in Göttingen, Germany. It is run by the Göttingen eResearch Alliance which includes the University of Göttingen and several research institutes. The Dataverse, called Göttingen Research Online (GRO.data), serves as a general repository for research data from across the Göttingen campus. It has been customized to the local IT infrastructure and collaborates with other data initiatives both within Göttingen and beyond. Future plans include further integration with other services and assessing data quality.
Incubating Göttingen Cultural Analytics Alliance (SUB 2021)Péter Király
This document proposes collaborating to form the Göttingen Cultural Analytics Alliance between several Göttingen institutions to analyze digitized cultural heritage data using computational methods. It discusses potential areas for collaboration including developing new metadata services, conducting joint research projects, improving education offerings, and coordinating open source software development. Establishing this network could help connect experts, pursue funding opportunities, and strengthen partnerships with other cultural heritage organizations internationally.
Continuous quality assessment for MARC21 catalogues (MINI ELAG 2021)Péter Király
This document summarizes a presentation about developing a quality assessment tool for MARC21 catalogues. The tool allows users to:
1) Ingest MARC21 records
2) Measure the records against definitions in the MARC21 standard
3) Aggregate results and generate reports
4) Evaluate results with cataloging experts to improve record quality
The presentation demonstrates the tool and its ability to identify issues in fields, provide definitions, search terms, and eventually link terms to controlled vocabularies through cooperation with other projects.
Introduction to data quality management (BVB KVB FDM-KompetenzPool, 2021)Péter Király
1) The document discusses data quality management and introduces metrics for assessing metadata quality. It provides examples of structural issues found in metadata records and outlines a proposed framework for measuring metadata quality.
2) A key hypothesis is that measuring structural elements can approximate metadata record quality. An organizational proposal suggests forming a metadata quality committee, and a technical proposal is to create a generic tool to measure metadata quality across different schemas.
3) The document demonstrates metadata quality dashboards and encourages cooperation on related open source projects and research on measuring metadata quality.
Validating JSON, XML and CSV data with SHACL-like constraints (DINI-KIM 2022)Péter Király
The document describes a Metadata Quality Assessment Framework (MQAF) API that can validate JSON, XML, CSV, and MARC data against SHACL-like constraints. The MQAF API implements a subset of SHACL tests to validate data elements, including tests for data types, lengths, patterns, logical rules and more. It provides a Java API and configuration files to define validation rules for different data formats and schemas in an abstracted way.
FRBR a book history perspective (Bibliodata WG 2022)Péter Király
This document discusses applying FRBR and related bibliographic models from a book history perspective. It identifies several issues with modeling complex bibliographic relationships, including identifier problems when the same work is referenced in different ways, cardinality issues when modeling ownership relationships, and granularity issues in modeling different levels of bibliographic information. It also discusses technological challenges in applying these models, such as adapting them for different metadata schemas and handling uncertainties. The document suggests greater involvement in standardization, publishing research data in linked open data formats, and sharing data between researchers and heritage organizations to help address these issues.
GRO.data - Dataverse in Göttingen (Magdeburg Coffee Lecture, 2022)Péter Király
This document summarizes a presentation about sustainable research data management using Dataverse. It discusses the Göttingen eResearch Alliance which manages Göttingen Research Online (GRO), including the GRO.data repository for publishing research data. GRO.data uses Dataverse to provide an open repository for research institutions in Göttingen. The presentation provides information on using GRO.data, testing sessions held to try the system, collaborations with other Dataverse groups, contributions to the Dataverse community, and plans for an eResearch Lab and improving metadata quality assessment.
Understanding, extracting and enhancing catalogue data (CE Book history works...Péter Király
This document discusses understanding, extracting, and enhancing catalogue data from library records. It describes analyzing publication date and place information, normalizing place names and dates, and linking place names to geographic coordinates. Tables show results of applying these techniques to records from the Austrian National Library, Hungarian Academy of Sciences, and Polish National Library. The document provides references to related analyses, code repositories, and contact information.
Measuring cultural heritage metadata quality (Semantics 2017)Péter Király
This document discusses measuring the quality of cultural heritage metadata. It proposes a generic "Metadata Quality Assurance Framework" tool to measure metadata quality across different schemas. The tool would measure completeness, availability, and other dimensions using structural analysis and by mapping metadata elements to discovery functions. It would provide customizable and scalable quality reports to help data curators improve metadata. The document outlines technical requirements and modules for an open source tool to systematically measure metadata at the record and aggregate level.
Measuring Metadata Quality in Europeana (ADOCHS 2017)Péter Király
This document discusses measuring metadata quality for records in Europeana. It proposes establishing a Europeana Data Quality Committee and developing a "Metadata Quality Assurance Framework" tool to measure metadata quality across Europeana's large collection. Key metrics would include completeness, field cardinality, uniqueness, multilinguality and conformance to requirements. The tool would provide customizable quality measurements, reports, and recommendations to help improve metadata quality.
Evaluating Data Quality in Europeana: Metrics for Multilinguality (MTSR 2018)Péter Király
This document discusses evaluating data quality in Europeana by developing metrics for multilinguality. It identifies processes that contribute to multilinguality in metadata and proposes dimensions like completeness, consistency, conformity and accessibility to quantify multilinguality. Results of applying these metrics to Europeana data are presented, including the number of languages, language-tagged literals and their distribution. A demo of the analysis is also provided. Future work includes embedding the metrics into Europeana's workflow and further evaluation.
Researching metadata quality (ORKG 2018)Péter Király
This document discusses metadata quality and metrics for evaluating metadata. It defines metadata as structured information that describes something else. Metadata quality is described as fulfillment of specifications and goals. General metrics for metadata quality include completeness, accuracy, consistency, objectiveness, appropriateness, and correctness. For linked data, additional dimensions and metrics are proposed such as accessibility, intrinsic qualities, contextual relevance, and representational properties. Good metrics are said to be clear, realistic, measurable, discriminating, and universal. The document discusses using RDFUnit, SHACL and ShEx for evaluating linked data and using clustering algorithms like K-means to analyze metadata quality.
Metadata quality in cultural heritage institutions (ReIRes-FAIR 2018)Péter Király
This document discusses metadata quality in cultural heritage institutions. It provides examples of common metadata issues such as inconsistent date formats, non-informative titles, multilinguality problems, and copy-and-paste cataloging. It also discusses metrics for measuring metadata quality, such as completeness, accuracy, and consistency. Additionally, it proposes using a "Metadata Quality Assurance Framework" tool to measure metadata quality at large scale, generate reports for data curators, and help improve metadata quality over time.
Measuring Completeness as Metadata Quality Metric in Europeana (CAS 2018)Péter Király
The document discusses measuring metadata quality in Europeana. It proposes using metrics like completeness to assess metadata records on a scale from good to bad. It suggests developing a Metadata Quality Assessment Framework tool to measure structural elements and functional requirements to approximate metadata quality. The tool would generate reports and be adaptable, scalable, and open source. It would involve ingesting metadata via APIs, analyzing it using Hadoop and Spark, and presenting results through a web interface.
This document discusses measuring library catalogs and introduces MARC (MAchine Readable Catalog) format. It provides an example MARC record and explains the different positional fields in the Leader and 008 fields. It also covers MARC data fields, different MARC versions, and addressing MARC elements using MARCspec. The second part discusses validating MARC records, including validating individual records, getting a summary of errors, and specifying the MARC version and output format. It also covers processing a subset of records and fixing ALEPHSEQ placeholders.
This document provides an introduction to the Shapes Constraint Language (SHACL) for validating RDF data against a set of rules defined in a shape graph. It demonstrates how to use SHACL to validate data using the shaclvalidate.sh tool and explains common SHACL constraints for defining minimum and maximum counts, expected data types, allowed values, and more. Core SHACL concepts covered include shapes, properties, constraints, and validating RDF data.
Measuring Metadata Quality (ELAG, 2018)Péter Király
This document discusses measuring metadata quality by analyzing structural elements of metadata records. It proposes using generic metrics like completeness, uniqueness and data type conformance to approximate a record's quality. Measuring requirements driven by specific user tasks and discovery scenarios is also suggested. The goals are to improve metadata, ensure reliable system functions, and propagate best practices. A batch processing API and further human analysis are presented as next steps.
Measuring completeness as metadata quality metric in Europeana (DH 2017)Péter Király
This document summarizes a presentation about measuring metadata completeness as a quality metric for records in Europeana, a digital library of over 53 million items from European cultural heritage institutions. The presentation proposes developing a Metadata Quality Assurance Framework tool to measure completeness at the overall, collection, and record level based on structural elements and support of functional requirements. Metrics would help identify records needing improvement and support improving metadata quality in Europeana.
Nothing is created, nothing is lost, everything changes (ELAG, 2017)Péter Király
This document discusses measuring and visualizing data quality in Europeana. It proposes establishing a Europeana Data Quality Committee to analyze metadata quality, develop metrics and problem definitions. Metrics could measure completeness, availability, licensing and other dimensions. Problems like duplicate titles and descriptions are defined. A flexible tool is proposed to measure metadata quality across schemas through APIs and reporting. The results would help improve metadata and documentation.
Towards an extensible measurement of metadata quality (DATeCH 2017)Péter Király
This document discusses measuring metadata quality by analyzing structural elements of metadata records. It proposes that by measuring properties like field cardinality, uniqueness, multilinguality, and presence of non-informative values, the quality of metadata records can be predicted. The document outlines various metrics that could be measured at the record, collection, and overall dataset level. It also describes how measurements could be aggregated and visualized to identify outliers and opportunities for improvement.
Stiller & Király, Multilinguality of MetadataPéter Király
1. The document discusses measuring the multilingual degree of metadata in Europeana, a platform for cultural heritage materials.
2. It proposes a "multilingual score" to quantify the multilinguality of metadata based on factors like number of languages, language tags, and literals per language.
3. It describes implementing systems to automatically calculate multilingual scores from Europeana metadata and visualize the results.
Multilinguality of Metadata. Measuring the Multilingual Degree of Europeana‘s...Péter Király
1. The document discusses measuring the multilingual degree of metadata in Europeana, a platform providing access to over 54 million digital cultural heritage objects from over 50 languages.
2. It presents a multilingual score for metadata based on factors like presence of language tags, number of languages per field, and links to multilingual vocabularies.
3. The score is implemented by processing Europeana metadata using techniques like Apache Spark and visualized through APIs and tools to analyze the distribution of languages and identify areas for improvement.
Global Situational Awareness of A.I. and where its headedvikram sood
You can see the future first in San Francisco.
Over the past year, the talk of the town has shifted from $10 billion compute clusters to $100 billion clusters to trillion-dollar clusters. Every six months another zero is added to the boardroom plans. Behind the scenes, there’s a fierce scramble to secure every power contract still available for the rest of the decade, every voltage transformer that can possibly be procured. American big business is gearing up to pour trillions of dollars into a long-unseen mobilization of American industrial might. By the end of the decade, American electricity production will have grown tens of percent; from the shale fields of Pennsylvania to the solar farms of Nevada, hundreds of millions of GPUs will hum.
The AGI race has begun. We are building machines that can think and reason. By 2025/26, these machines will outpace college graduates. By the end of the decade, they will be smarter than you or I; we will have superintelligence, in the true sense of the word. Along the way, national security forces not seen in half a century will be un-leashed, and before long, The Project will be on. If we’re lucky, we’ll be in an all-out race with the CCP; if we’re unlucky, an all-out war.
Everyone is now talking about AI, but few have the faintest glimmer of what is about to hit them. Nvidia analysts still think 2024 might be close to the peak. Mainstream pundits are stuck on the wilful blindness of “it’s just predicting the next word”. They see only hype and business-as-usual; at most they entertain another internet-scale technological change.
Before long, the world will wake up. But right now, there are perhaps a few hundred people, most of them in San Francisco and the AI labs, that have situational awareness. Through whatever peculiar forces of fate, I have found myself amongst them. A few years ago, these people were derided as crazy—but they trusted the trendlines, which allowed them to correctly predict the AI advances of the past few years. Whether these people are also right about the next few years remains to be seen. But these are very smart people—the smartest people I have ever met—and they are the ones building this technology. Perhaps they will be an odd footnote in history, or perhaps they will go down in history like Szilard and Oppenheimer and Teller. If they are seeing the future even close to correctly, we are in for a wild ride.
Let me tell you what we see.
Build applications with generative AI on Google CloudMárton Kodok
We will explore Vertex AI - Model Garden powered experiences, we are going to learn more about the integration of these generative AI APIs. We are going to see in action what the Gemini family of generative models are for developers to build and deploy AI-driven applications. Vertex AI includes a suite of foundation models, these are referred to as the PaLM and Gemini family of generative ai models, and they come in different versions. We are going to cover how to use via API to: - execute prompts in text and chat - cover multimodal use cases with image prompts. - finetune and distill to improve knowledge domains - run function calls with foundation models to optimize them for specific tasks. At the end of the session, developers will understand how to innovate with generative AI and develop apps using the generative ai industry trends.
Introduction to data quality management (BVB KVB FDM-KompetenzPool, 2021)Péter Király
1) The document discusses data quality management and introduces metrics for assessing metadata quality. It provides examples of structural issues found in metadata records and outlines a proposed framework for measuring metadata quality.
2) A key hypothesis is that measuring structural elements can approximate metadata record quality. An organizational proposal suggests forming a metadata quality committee, and a technical proposal is to create a generic tool to measure metadata quality across different schemas.
3) The document demonstrates metadata quality dashboards and encourages cooperation on related open source projects and research on measuring metadata quality.
Validating JSON, XML and CSV data with SHACL-like constraints (DINI-KIM 2022)Péter Király
The document describes a Metadata Quality Assessment Framework (MQAF) API that can validate JSON, XML, CSV, and MARC data against SHACL-like constraints. The MQAF API implements a subset of SHACL tests to validate data elements, including tests for data types, lengths, patterns, logical rules and more. It provides a Java API and configuration files to define validation rules for different data formats and schemas in an abstracted way.
FRBR a book history perspective (Bibliodata WG 2022)Péter Király
This document discusses applying FRBR and related bibliographic models from a book history perspective. It identifies several issues with modeling complex bibliographic relationships, including identifier problems when the same work is referenced in different ways, cardinality issues when modeling ownership relationships, and granularity issues in modeling different levels of bibliographic information. It also discusses technological challenges in applying these models, such as adapting them for different metadata schemas and handling uncertainties. The document suggests greater involvement in standardization, publishing research data in linked open data formats, and sharing data between researchers and heritage organizations to help address these issues.
GRO.data - Dataverse in Göttingen (Magdeburg Coffee Lecture, 2022)Péter Király
This document summarizes a presentation about sustainable research data management using Dataverse. It discusses the Göttingen eResearch Alliance which manages Göttingen Research Online (GRO), including the GRO.data repository for publishing research data. GRO.data uses Dataverse to provide an open repository for research institutions in Göttingen. The presentation provides information on using GRO.data, testing sessions held to try the system, collaborations with other Dataverse groups, contributions to the Dataverse community, and plans for an eResearch Lab and improving metadata quality assessment.
Understanding, extracting and enhancing catalogue data (CE Book history works...Péter Király
This document discusses understanding, extracting, and enhancing catalogue data from library records. It describes analyzing publication date and place information, normalizing place names and dates, and linking place names to geographic coordinates. Tables show results of applying these techniques to records from the Austrian National Library, Hungarian Academy of Sciences, and Polish National Library. The document provides references to related analyses, code repositories, and contact information.
Measuring cultural heritage metadata quality (Semantics 2017)Péter Király
This document discusses measuring the quality of cultural heritage metadata. It proposes a generic "Metadata Quality Assurance Framework" tool to measure metadata quality across different schemas. The tool would measure completeness, availability, and other dimensions using structural analysis and by mapping metadata elements to discovery functions. It would provide customizable and scalable quality reports to help data curators improve metadata. The document outlines technical requirements and modules for an open source tool to systematically measure metadata at the record and aggregate level.
Measuring Metadata Quality in Europeana (ADOCHS 2017)Péter Király
This document discusses measuring metadata quality for records in Europeana. It proposes establishing a Europeana Data Quality Committee and developing a "Metadata Quality Assurance Framework" tool to measure metadata quality across Europeana's large collection. Key metrics would include completeness, field cardinality, uniqueness, multilinguality and conformance to requirements. The tool would provide customizable quality measurements, reports, and recommendations to help improve metadata quality.
Evaluating Data Quality in Europeana: Metrics for Multilinguality (MTSR 2018)Péter Király
This document discusses evaluating data quality in Europeana by developing metrics for multilinguality. It identifies processes that contribute to multilinguality in metadata and proposes dimensions like completeness, consistency, conformity and accessibility to quantify multilinguality. Results of applying these metrics to Europeana data are presented, including the number of languages, language-tagged literals and their distribution. A demo of the analysis is also provided. Future work includes embedding the metrics into Europeana's workflow and further evaluation.
Researching metadata quality (ORKG 2018)Péter Király
This document discusses metadata quality and metrics for evaluating metadata. It defines metadata as structured information that describes something else. Metadata quality is described as fulfillment of specifications and goals. General metrics for metadata quality include completeness, accuracy, consistency, objectiveness, appropriateness, and correctness. For linked data, additional dimensions and metrics are proposed such as accessibility, intrinsic qualities, contextual relevance, and representational properties. Good metrics are said to be clear, realistic, measurable, discriminating, and universal. The document discusses using RDFUnit, SHACL and ShEx for evaluating linked data and using clustering algorithms like K-means to analyze metadata quality.
Metadata quality in cultural heritage institutions (ReIRes-FAIR 2018)Péter Király
This document discusses metadata quality in cultural heritage institutions. It provides examples of common metadata issues such as inconsistent date formats, non-informative titles, multilinguality problems, and copy-and-paste cataloging. It also discusses metrics for measuring metadata quality, such as completeness, accuracy, and consistency. Additionally, it proposes using a "Metadata Quality Assurance Framework" tool to measure metadata quality at large scale, generate reports for data curators, and help improve metadata quality over time.
Measuring Completeness as Metadata Quality Metric in Europeana (CAS 2018)Péter Király
The document discusses measuring metadata quality in Europeana. It proposes using metrics like completeness to assess metadata records on a scale from good to bad. It suggests developing a Metadata Quality Assessment Framework tool to measure structural elements and functional requirements to approximate metadata quality. The tool would generate reports and be adaptable, scalable, and open source. It would involve ingesting metadata via APIs, analyzing it using Hadoop and Spark, and presenting results through a web interface.
This document discusses measuring library catalogs and introduces MARC (MAchine Readable Catalog) format. It provides an example MARC record and explains the different positional fields in the Leader and 008 fields. It also covers MARC data fields, different MARC versions, and addressing MARC elements using MARCspec. The second part discusses validating MARC records, including validating individual records, getting a summary of errors, and specifying the MARC version and output format. It also covers processing a subset of records and fixing ALEPHSEQ placeholders.
This document provides an introduction to the Shapes Constraint Language (SHACL) for validating RDF data against a set of rules defined in a shape graph. It demonstrates how to use SHACL to validate data using the shaclvalidate.sh tool and explains common SHACL constraints for defining minimum and maximum counts, expected data types, allowed values, and more. Core SHACL concepts covered include shapes, properties, constraints, and validating RDF data.
Measuring Metadata Quality (ELAG, 2018)Péter Király
This document discusses measuring metadata quality by analyzing structural elements of metadata records. It proposes using generic metrics like completeness, uniqueness and data type conformance to approximate a record's quality. Measuring requirements driven by specific user tasks and discovery scenarios is also suggested. The goals are to improve metadata, ensure reliable system functions, and propagate best practices. A batch processing API and further human analysis are presented as next steps.
Measuring completeness as metadata quality metric in Europeana (DH 2017)Péter Király
This document summarizes a presentation about measuring metadata completeness as a quality metric for records in Europeana, a digital library of over 53 million items from European cultural heritage institutions. The presentation proposes developing a Metadata Quality Assurance Framework tool to measure completeness at the overall, collection, and record level based on structural elements and support of functional requirements. Metrics would help identify records needing improvement and support improving metadata quality in Europeana.
Nothing is created, nothing is lost, everything changes (ELAG, 2017)Péter Király
This document discusses measuring and visualizing data quality in Europeana. It proposes establishing a Europeana Data Quality Committee to analyze metadata quality, develop metrics and problem definitions. Metrics could measure completeness, availability, licensing and other dimensions. Problems like duplicate titles and descriptions are defined. A flexible tool is proposed to measure metadata quality across schemas through APIs and reporting. The results would help improve metadata and documentation.
Towards an extensible measurement of metadata quality (DATeCH 2017)Péter Király
This document discusses measuring metadata quality by analyzing structural elements of metadata records. It proposes that by measuring properties like field cardinality, uniqueness, multilinguality, and presence of non-informative values, the quality of metadata records can be predicted. The document outlines various metrics that could be measured at the record, collection, and overall dataset level. It also describes how measurements could be aggregated and visualized to identify outliers and opportunities for improvement.
Stiller & Király, Multilinguality of MetadataPéter Király
1. The document discusses measuring the multilingual degree of metadata in Europeana, a platform for cultural heritage materials.
2. It proposes a "multilingual score" to quantify the multilinguality of metadata based on factors like number of languages, language tags, and literals per language.
3. It describes implementing systems to automatically calculate multilingual scores from Europeana metadata and visualize the results.
Multilinguality of Metadata. Measuring the Multilingual Degree of Europeana‘s...Péter Király
1. The document discusses measuring the multilingual degree of metadata in Europeana, a platform providing access to over 54 million digital cultural heritage objects from over 50 languages.
2. It presents a multilingual score for metadata based on factors like presence of language tags, number of languages per field, and links to multilingual vocabularies.
3. The score is implemented by processing Europeana metadata using techniques like Apache Spark and visualized through APIs and tools to analyze the distribution of languages and identify areas for improvement.
Global Situational Awareness of A.I. and where its headedvikram sood
You can see the future first in San Francisco.
Over the past year, the talk of the town has shifted from $10 billion compute clusters to $100 billion clusters to trillion-dollar clusters. Every six months another zero is added to the boardroom plans. Behind the scenes, there’s a fierce scramble to secure every power contract still available for the rest of the decade, every voltage transformer that can possibly be procured. American big business is gearing up to pour trillions of dollars into a long-unseen mobilization of American industrial might. By the end of the decade, American electricity production will have grown tens of percent; from the shale fields of Pennsylvania to the solar farms of Nevada, hundreds of millions of GPUs will hum.
The AGI race has begun. We are building machines that can think and reason. By 2025/26, these machines will outpace college graduates. By the end of the decade, they will be smarter than you or I; we will have superintelligence, in the true sense of the word. Along the way, national security forces not seen in half a century will be un-leashed, and before long, The Project will be on. If we’re lucky, we’ll be in an all-out race with the CCP; if we’re unlucky, an all-out war.
Everyone is now talking about AI, but few have the faintest glimmer of what is about to hit them. Nvidia analysts still think 2024 might be close to the peak. Mainstream pundits are stuck on the wilful blindness of “it’s just predicting the next word”. They see only hype and business-as-usual; at most they entertain another internet-scale technological change.
Before long, the world will wake up. But right now, there are perhaps a few hundred people, most of them in San Francisco and the AI labs, that have situational awareness. Through whatever peculiar forces of fate, I have found myself amongst them. A few years ago, these people were derided as crazy—but they trusted the trendlines, which allowed them to correctly predict the AI advances of the past few years. Whether these people are also right about the next few years remains to be seen. But these are very smart people—the smartest people I have ever met—and they are the ones building this technology. Perhaps they will be an odd footnote in history, or perhaps they will go down in history like Szilard and Oppenheimer and Teller. If they are seeing the future even close to correctly, we are in for a wild ride.
Let me tell you what we see.
Build applications with generative AI on Google CloudMárton Kodok
We will explore Vertex AI - Model Garden powered experiences, we are going to learn more about the integration of these generative AI APIs. We are going to see in action what the Gemini family of generative models are for developers to build and deploy AI-driven applications. Vertex AI includes a suite of foundation models, these are referred to as the PaLM and Gemini family of generative ai models, and they come in different versions. We are going to cover how to use via API to: - execute prompts in text and chat - cover multimodal use cases with image prompts. - finetune and distill to improve knowledge domains - run function calls with foundation models to optimize them for specific tasks. At the end of the session, developers will understand how to innovate with generative AI and develop apps using the generative ai industry trends.
Learn SQL from basic queries to Advance queriesmanishkhaire30
Dive into the world of data analysis with our comprehensive guide on mastering SQL! This presentation offers a practical approach to learning SQL, focusing on real-world applications and hands-on practice. Whether you're a beginner or looking to sharpen your skills, this guide provides the tools you need to extract, analyze, and interpret data effectively.
Key Highlights:
Foundations of SQL: Understand the basics of SQL, including data retrieval, filtering, and aggregation.
Advanced Queries: Learn to craft complex queries to uncover deep insights from your data.
Data Trends and Patterns: Discover how to identify and interpret trends and patterns in your datasets.
Practical Examples: Follow step-by-step examples to apply SQL techniques in real-world scenarios.
Actionable Insights: Gain the skills to derive actionable insights that drive informed decision-making.
Join us on this journey to enhance your data analysis capabilities and unlock the full potential of SQL. Perfect for data enthusiasts, analysts, and anyone eager to harness the power of data!
#DataAnalysis #SQL #LearningSQL #DataInsights #DataScience #Analytics
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...Kaxil Naik
Navigating today's data landscape isn't just about managing workflows; it's about strategically propelling your business forward. Apache Airflow has stood out as the benchmark in this arena, driving data orchestration forward since its early days. As we dive into the complexities of our current data-rich environment, where the sheer volume of information and its timely, accurate processing are crucial for AI and ML applications, the role of Airflow has never been more critical.
In my journey as the Senior Engineering Director and a pivotal member of Apache Airflow's Project Management Committee (PMC), I've witnessed Airflow transform data handling, making agility and insight the norm in an ever-evolving digital space. At Astronomer, our collaboration with leading AI & ML teams worldwide has not only tested but also proven Airflow's mettle in delivering data reliably and efficiently—data that now powers not just insights but core business functions.
This session is a deep dive into the essence of Airflow's success. We'll trace its evolution from a budding project to the backbone of data orchestration it is today, constantly adapting to meet the next wave of data challenges, including those brought on by Generative AI. It's this forward-thinking adaptability that keeps Airflow at the forefront of innovation, ready for whatever comes next.
The ever-growing demands of AI and ML applications have ushered in an era where sophisticated data management isn't a luxury—it's a necessity. Airflow's innate flexibility and scalability are what makes it indispensable in managing the intricate workflows of today, especially those involving Large Language Models (LLMs).
This talk isn't just a rundown of Airflow's features; it's about harnessing these capabilities to turn your data workflows into a strategic asset. Together, we'll explore how Airflow remains at the cutting edge of data orchestration, ensuring your organization is not just keeping pace but setting the pace in a data-driven future.
Session in https://budapestdata.hu/2024/04/kaxil-naik-astronomer-io/ | https://dataml24.sessionize.com/session/667627
"Financial Odyssey: Navigating Past Performance Through Diverse Analytical Lens"sameer shah
Embark on a captivating financial journey with 'Financial Odyssey,' our hackathon project. Delve deep into the past performance of two companies as we employ an array of financial statement analysis techniques. From ratio analysis to trend analysis, uncover insights crucial for informed decision-making in the dynamic world of finance."
Open Source Contributions to Postgres: The Basics POSETTE 2024ElizabethGarrettChri
Postgres is the most advanced open-source database in the world and it's supported by a community, not a single company. So how does this work? How does code actually get into Postgres? I recently had a patch submitted and committed and I want to share what I learned in that process. I’ll give you an overview of Postgres versions and how the underlying project codebase functions. I’ll also show you the process for submitting a patch and getting that tested and committed.
Analysis insight about a Flyball dog competition team's performanceroli9797
Insight of my analysis about a Flyball dog competition team's last year performance. Find more: https://github.com/rolandnagy-ds/flyball_race_analysis/tree/main
1. Measuring library catalogs
ADOCHS meeting
Royal Library, Brussels, 2017-11-21.
Péter Király
Gesellschaft für wissenschaftliche Datenverarbeitung mbH Göttingen (GWDG)
Card catalog at Gent University Library, photo: Pieter Morlion, 2010 CC-BY 4.0
https://commons.wikimedia.org/wiki/File:Boekentoren_2010PM_1179_21H9015.JPG
2. Part I. Introduction to MARC
❏ MAchine Readable Catalog
❏ format and semantic specification
❏ comes from the age of punchcards - information compression
❏ invented in early 60’s
❏ even the lapidary “MARC must die” article* celebrated its 15th anniversary
last month, but MARC is still living
❏ „There are only two kinds of people who believe themselves able to read a
MARC record without referring to a stack of manuals: a handful of our top
catalogers and those on serious drugs.”
* by Roy Tennnant http://lj.libraryjournal.com/2002/10/ljarchives/marc-must-die/
2
3. an example
LEADER 01136cnm a2200253ui 4500
001 002032820
005 20150224114135.0
008 031117s2003 gw 000 0 ger d
020 $a3805909810
100 1 $avon Staudinger, Julius,$d1836-1902$0(viaf)14846766
245 10$aJ. von Staudingers Kommentar zum ... /$cJ. von Staudinger.
250 $aNeubearb. 2003$bvon Jörn Eckert
260 $aBerlin :$bSellier-de Gruyter,$c2003.
300 $a534 p. ;.
500 $aCiteertitel: BGB.
500 $aBandtitel: Staudinger BGB.
700 1 $aEckert, Jörn
852 4 $xRE$bRE55$cRBIB$jRBIB.BUR 011 DE 021$p000000800147
3
4. Positional fields - Leader
00928nam a2200265 c 4500
0 1 2
01234 5 6 7 8 9 0 1 2345 6 7 8 9 0 1 2 3
00928|n|a|m| |a|2|2|0026|5| |c| |4|5|0|0
❏ LDR/0-4 Record length: ‘00928’ - is a number padding with 0-s (max. value: 99999)
❏ LDR/5 Record status: ‘n’ - is a dictionary term, means “new”
❏ LDR/6 Type of record: ‘a’ - is a dictionary term, means “Language material”
❏ LDR/7 Bibliographic level: ‘m’ - means “Monograph/Item”
❏ ...
4
5. Record type
Type of record Bibliographic level type
a a or c or d or m Books
a b or i or s Continuing Resources
t Books
c or d or i or j Music
e or f Maps
g or k or o or r Visual Materials
m Computer Files
p Mixed Materials
5
6. Positional fields - 008
‘801003s1958 ja 000 0 jpn ‘
0 1 2 3
012345 6 7890 1234 567 8901 23 4 5 67 8 9 0 1 2 34 567 8 9
‘801003|s|1958| |ja | | |#| |##|0|0|#|0|#|0 |jpn| | ‘
common for all types
part I
type specific part
common for all types
part II
6
7. Positional fields - 008
‘801003s1958 ja 000 0 jpn ‘
0 1 2 3
0123456789012345678901234567890123456789
aaaaaabccccddddeeefffgh All materials
IIIIjkLLLLmnopqr Books
ijklmnOOOpqrs Continuing Resources
iijklmNNNNNNOOp Music
IIIIjjklmnOO Maps
Iiijklmn Visual Materials
ijkl Computer Files
i Mixed Materials
lower case = distinct units
upper case = repeatable units
= undefined position
depends on record
type (calculated from
Leader values)
7
8. Datafields
repeatable/non-repeatable
Indicator1
Indicator2
Subfield1, ... , Subfieldn
always 1 char long dictionary term
❏ code
❏ value
❏ free text
❏ dictionary term
❏ fixed format (e.g. yymmdd)
❏ fixed format + dictionary terms (d7i2)
❏ fixed positions + dictionary terms
❏ repeatable/non-repeatable
8
9. Versions
❏ Changes of the standard
❏ No versioning
❏ New, deleted and changed elements every year
❏ Localized versions
❏ Introducing new fields
❏ Overwriting existing fields
❏ Mixing localized versions
❏ No notion about the localization
❏ 50+ localizations (international, national, consortial)
9
10. Handling versions (020, ISBN)
setSubfieldsWithCardinality(
"a", "International Standard Book Number", "NR",
"c", "Terms of availability", "NR",
"q", "Qualifying information", "R",
...
);
setHistoricalSubfields(
"b", "Binding information (BK, MP, MU) [OBSOLETE]"
);
putVersionSpecificSubfields(MarcVersion.DNB, Arrays.asList(
new SubfieldDefinition("9", "ISBN mit Bindestrichen", "R")
));
10
11. Addressing elements - MARCspec
XML: XPath﹣W3C standard
JSON: JSONPath﹣by Stefan Gössner (http://goessner.net/articles/JsonPath/)
MARC: MARCspec﹣by Carsten Klee (Zeitschriftendatenbank, Berlin)
❏ 260﹣field
❏ 245^2﹣the second indicator of a field
❏ 700[0]﹣the first instance of a field
❏ 245$c﹣a subfield
❏ 245$b{007/0=a|007/0=t}﹣subfield ‘b’ of field ‘245’, if character with
position ‘0’ of field 007 equals ‘a’ OR ‘t’.
❏ 020$c{$q=paperback}﹣subfield ‘c’ if subfield ‘q’ equals to ‘paperback’.
http://marcspec.github.io/MARCspec/marc-spec.html
11
12. Part II.
record validation
and quality assurance
Boekentoren UGent - de belvedère, photo: Michiel Hendryckx, 2013, CC-BY-SA 3.0
https://commons.wikimedia.org/wiki/File:Boekentoren_ugent_belvedere_675.jpg
12
13. validating individual records
./validator [file]
001999999 852 undefined subfield L
https://www.loc.gov/...
002000005 035 undefined subfield 9
https://www.loc.gov/...
002000005 852 undefined subfield L
https://www.loc.gov/...
002000005 852 undefined subfield L
https://www.loc.gov/...
002000008 035 undefined subfield 9
https://www.loc.gov/… 13
16. viewing/filtering/selecting records
Displaying record with given ID
./formatter --id “002032820” [file]
Displaying records matching a query
./formatter --search ‘245$c=Shakespeare’ [file]
Retrieve given elements
./formatter --selector ‘245$c’ [file]
16
17. calculating Thompson-Traill completeness
Thompson and Traill (2017) Leveraging Python to improve ebook metadata selection, ingest, and management (Code4Lib
Journal 38, http://journal.code4lib.org/articles/12828) 17
19. K-means clustering
Spark (Scala)
increasing number of clusters
decreasing the distance from
the centroids
after a point this gain is not so
big (“elbow effect”) -- in theory
Big number or low
quality records
small clusters with ‘in
between’ quality records
the acceptable average
clusters with good quality
records
19
20. Indexing with Solr
"marc-tags" format
"100a_ss": "Jung-Baek, Myong Ja",
"100ind1_ss": "Surname",
"245c_ss": "Vorgelegt von Myong Ja Jung-Baek."
"human-readable" format
"MainPersonalName_personalName_ss": "Jung-Baek, Myong Ja",
"MainPersonalName_type_ss": "Surname",
"Title_responsibilityStatement_ss": "Vorgelegt von Myong Ja Jung-Baek."
"mixed" format
"100a_MainPersonalName_personalName_ss": "Jung-Baek, Myong Ja",
"100ind1_MainPersonalName_type_ss": "Surname",
"245a_Title_responsibilityStatement_ss": "Vorgelegt von Myong Ja Jung-Baek."
20
How
to
name
the
fields?
23. Finding problems with facets
Vandenhoeck und Ruprecht
Vandenhoeck & Ruprecht
Vandenhoeck u. Ruprecht
Vandenhoeck
Vandenhoek & Ruprecht
Vandenhoek und Ruprecht
Bandenhoed und Ruprecht
Vandenhoeck et Ruprecht
Vandenhoeck & Reprecht
Vandenhoed und Ruprecht
V&R unipress
V&R Unipress
V & R Unipress
V & R unipress
23
29. reproducibility of science
❏ accessing users (first one: Gent)
❏ making easy of usage (downloadable binaries, helper scripts, documentation)
❏ distribution via Maven Central
❏ continuous integration (Travis CI)
❏ code coverage report
❏ list of freely reusable library catalogs
❏ licencing (GPL-3.0)
29
30. available catalogs to measure
30
❏ Library of Congress
❏ Harvard University Library
❏ Columbia University Library
❏ Deutsche Nationalbibliothek
❏ Universiteitsbibliotheek Gent
❏ Bibliotheksservice-Zentrum Baden Würtemberg
❏ Bibliotheksverbundes Bayern
❏ University of Michigan Library
❏ Toronto Public Library
❏ Leibniz-Informationszentrum Technik und Naturwiss. Universitätsbibliothek (TIB)
❏ Répertoire International des Sources Musicales
❏ ETH-Bibliothek (Swiss Federal Institute of Technology in Zurich)
❏ British library
❏ Talis
https://github.com/pkiraly/metadata-qa-marc#datasources
31. Future work
❏ implementing more validation rules
❏ visual dashboard
❏ communication with catalogers
❏ writing articles/dissertation
31
32. Authority entries
Responsibility statement:
Herr Seele (tekeningen); Toon Coussement (foto's); Peter Claes, Kris Coremans en
Hera Van Sande, vakgroep architectuur en stedenbouw Universiteit Gent
(vormgeving).
Authority entries:
❏ Herr Seele
❏ Coussement, Toon
❏ Claes, Peter
❏ Van Sande, Hera
32
33. everything else
… at least regarding to this project
https://github.com/pkiraly/metadata-qa-marc
https://twitter.com/kiru
peter.kiraly@gwdg.de
33