The European Holocaust Research Infrastructure (EHRI) is a large-scale EU project that involves 23 institutions and archives working on Holocaust studies, from Europe, Israel and the US. In its first phase (2011-2015) it aggregated archival descriptions and materials on a large scale and built a Virtual Research Environment (portal) for Holocaust researchers based on a graph database.
In its second phase (2015-2019), EHRI2 seeks to enhance the gathered materials using semantic approaches: enrichment, coreferencing, interlinking. Semantic integration involves four of the 14 EHRI2 work packages and helps integrate databases, free text, and metadata to interconnect historical entities (people, organizations, places, historic events) and create networks. We will present some of the EHRI2 technical work, including critical issues we have encountered.
WP10 (EAD) converts archival descriptions from various formats to standard EAD XML; transports EADs using OAI PMH or ResourceSync; ingests EADs to the EHRI database; enables use cases such as synchronization; coreferencing of textual Access Points to proper thesaurus references
WP11 (Authorities and Standards) consolidates and enlarges the EHRI authorities to render the indexing and retrieval of information more effective. It addresses Access Points in ingested EADs (normalization of Unicode, spelling, punctuation; deduplication; clustering; coreferencing to authority control), Subjects (deployment of a Thesaurus Management System in support of the EHRI Thesaurus Editorial Board), Places (coreferencing to Geonames); Camps and Ghettos (integrating data with Wikidata); Persons, Corporate Bodies (using USHMM HSV and VIAF); semantic (conceptual) search including hierarchical query expansion; interconnectivity of archival descriptions; permanent URLs; metadata quality; EAD RelaxNG and Schematron schemas and validation, etc.
WP13 (Data Infrastructures) builds up domain knowledge bases from institutional databases by using deduplication, semantic data integration, semantic text analysis. It provides the foundation for research use cases on Jewish Social Networks and their impact on the chance of survival.
WP14 (Digital Historiography Research) works on semantic text analysis (semantic enrichment), text similarity (e.g. clustering based on Neural Networks, LDA, etc), geo-mapping. It develops Digital Historiography researcher tools, including Prosopographical approaches.
Open-Content Text Corpus for African languagesGuy De Pauw
The document discusses the Open-Content Text Corpus (OCTC), a platform for storing and sharing open-content data for African languages. The OCTC aims to address issues of "data islands" by providing a secure platform for researchers to store and build upon each other's data. It uses XML encoding and supports over 50 languages with initial data sets. The OCTC could help preserve data, improve data through collaboration, train researchers, and be a low-cost solution by fighting data isolationism.
Slides for my keynote presentateion "Linked Data for Digital History" presented at Semantic Web for Scientific History (SW4SH) co-located with ESWC 2015
Milena Dobreva (University of Malta, MT): How to Index Biographical Data from Archival Documents Using the Methods of the Citizen Science
co:op-READ-Convention Marburg
Technology meets Scholarship, or how Handwritten Text Recognition will Revolutionize Access to Archival Collections.
With a special focus on biographical data in archives
Hessian State Archives Marburg Friedrichsplatz 15, D - 35037 Marburg
19-21 January 2016
This presentation was provided by Twyla Gibson and Ann Campion Riley, both of the University of Missouri, during the NISO Virtual Conference, The Computer Campus: Integrating Information Systems and Services, held on August 15, 2018.
Comparing and matching archaeological excavation data for integration in onto...ariadnenetwork
Presentation by Anja Masur and Keith May
OAW ( Austrian Academy of Sciences).
OREA (Institute for Oriental and European Archaeology).
English Heritage;
University of South Wales
Full-day session on archaeological infrastructures and services at the 18th Cultural Heritage and New Technologies (CHNT) conference
Vienna, Austria
11th -13th November 2013
Open-Content Text Corpus for African languagesGuy De Pauw
The document discusses the Open-Content Text Corpus (OCTC), a platform for storing and sharing open-content data for African languages. The OCTC aims to address issues of "data islands" by providing a secure platform for researchers to store and build upon each other's data. It uses XML encoding and supports over 50 languages with initial data sets. The OCTC could help preserve data, improve data through collaboration, train researchers, and be a low-cost solution by fighting data isolationism.
Slides for my keynote presentateion "Linked Data for Digital History" presented at Semantic Web for Scientific History (SW4SH) co-located with ESWC 2015
Milena Dobreva (University of Malta, MT): How to Index Biographical Data from Archival Documents Using the Methods of the Citizen Science
co:op-READ-Convention Marburg
Technology meets Scholarship, or how Handwritten Text Recognition will Revolutionize Access to Archival Collections.
With a special focus on biographical data in archives
Hessian State Archives Marburg Friedrichsplatz 15, D - 35037 Marburg
19-21 January 2016
This presentation was provided by Twyla Gibson and Ann Campion Riley, both of the University of Missouri, during the NISO Virtual Conference, The Computer Campus: Integrating Information Systems and Services, held on August 15, 2018.
Comparing and matching archaeological excavation data for integration in onto...ariadnenetwork
Presentation by Anja Masur and Keith May
OAW ( Austrian Academy of Sciences).
OREA (Institute for Oriental and European Archaeology).
English Heritage;
University of South Wales
Full-day session on archaeological infrastructures and services at the 18th Cultural Heritage and New Technologies (CHNT) conference
Vienna, Austria
11th -13th November 2013
This presentation summarizes the emerging open access landscape in Greece, achievements and challenges of open access work, and the activities of the National Documentation Centre (EKT). Some key points include:
- Greece is seeing growing awareness of open access but adoption has been slow. EKT plays a major role in promoting open access through repositories, journals, and information resources.
- Achievements include increasing numbers of open access journals and institutional repositories, as well as digitization projects. However, challenges remain around researcher participation, infrastructure sustainability, and lack of policies.
- EKT has established several open access journals, repositories, and information portals to support the open access movement in Greece and increase dissemination of Greek research outputs
The document discusses the movement of Digital Humanities and its impact on social sciences. It defines Digital Humanities as the intersection of humanities disciplines and digital technologies. It describes the goals of DH as integrating modern information technology into traditional humanistic research and sharing cultural resources. It also provides examples of common DH projects and tools, including text analysis, mapping, encoding, and visualization projects. Throughout, it emphasizes DH as an international, collaborative, and interdisciplinary field that utilizes digital resources and technologies.
Kimberly Silk presented on data management and discovery at the Martin Prosperity Institute. The MPI collects large social science datasets from various common and authoritative sources to support research. To better organize their growing collection, the MPI implemented an open data discovery platform called Dataverse to catalog and provide access to their datasets. Open data initiatives aim to make certain government data freely available to the public, but also present challenges around data preparation, support, and responsiveness. Big data refers to extremely large datasets beyond the capabilities of typical database tools, and data visualization is an important way to communicate insights from data.
The document summarizes the PoliticalMashup project, which aims to connect promises and actions of politicians with societal reactions by integrating large datasets. It discusses using text analytics and XML techniques on datasets like Dutch parliamentary proceedings and election manifestos to enable automated analysis. Example applications include search, entity linking, and detecting promises by ministers. It also outlines several areas for natural language processing research using the datasets, such as topic detection and modeling populist language.
The document provides guidelines for publishing data as Linked Data. It discusses identifying appropriate data sources, reusing existing vocabularies and non-ontological resources, generating RDF data from relational databases or geometrical data using tools like R2O, ODEMapster and geometry2rdf, and publishing the data on the web by resolving URIs. The Ontology Engineering Group at Universidad Politécnica de Madrid has published Spanish geospatial and statistical data as part of projects like GeoLinkedData following these guidelines.
This document discusses digital corpora of inscriptions and the need for collaboration. It summarizes 3 main digital corpora - PHI, EDH, and Clauss-Slaby - and notes challenges like lack of quality control and coherence. It advocates learning from the success of integrated digital papyrology projects. The author proposes a "Citizens' Web of Inscriptions" using Linked Open Data to connect existing databases while allowing community annotation and enhancement through a centralized interface like SoSOL. This would create a massive yet decentralized database to better serve both academics and the public.
This document provides a summary of linked data principles and examples. It discusses how linked data can help computers understand web data by structuring it using common standards like URIs, HTTP, RDF, and SPARQL. The key principles of linked data are explained, including using URIs to identify things, including useful information at those URIs, and linking to other URIs to discover more things. Examples of linked data applications in domains like academia, libraries, government, and media are also provided. The document concludes by discussing how linked data works technically using structured data, graphs, and W3C web standards.
EKAW2014 Keynote: Ontology Engineering for and by the Masses: are we already ...Oscar Corcho
Presentation for one of the keynotes at EKAW2014, where I talked about the need to lower the barrier for ontology development for those who have no experience with ontologies.
One day workshop Linked Data and Semantic WebVictor de Boer
As taught at UNIMAS July 2019. based on a three day summer school by Knud Hinnerk Moeller and Victor de Boer. Includes hands on excercises using SWI-Prolog ClioPatria
New approaches for data acquisition at europeana iiif, sitemaps and schema.o...Nuno Freire
Presentation on experiments at Europeana regarding new methods of aggregating metadata.
Presented at the Seminar Linked Data in Research and Cultural Heritage, on 1st of May 2017.
Open Data Publication - Requirements, Good practices, and Benefitsariadnenetwork
ARIADNE is an EU-funded project that aims to facilitate sharing of archaeological datasets across Europe. The document discusses open data publication requirements, good practices, and benefits. It notes barriers to open data sharing include lack of incentives and effort required. Survey results presented on current limited sharing behavior. The benefits of open data publication are described, including increased citations and recognition for data providers. Proper preparation and documentation of datasets is important for others to find and reuse the data. Widespread open data sharing could enable new collaborative and data-driven research.
Current metadata landscape in the library world (Getaneh Alemu)Getaneh Alemu
The document summarizes the current metadata landscape in libraries. It discusses what metadata is, existing metadata challenges like growing collections and changing user expectations. It covers common metadata standards like MARC21, Dublin Core and frameworks like FRBR. The document emphasizes that metadata enables functions like search, discovery and organization. It discusses metadata enrichment through user tagging and linking metadata to controlled vocabularies. The future of metadata is seen as enriched, linked, open and filtered to meet changing needs.
Europeana meeting under Finland’s Presidency of the Council of the EU - Day 1...Europeana
This document discusses multilingualism in digital cultural heritage. It begins by outlining some of the challenges of multilingual access, including mismatches between user queries and content languages, heterogeneity in queries, and issues with translating metadata. It then discusses some options for bridging the language gap, such as translating queries, content, and metadata; enriching metadata; and adapting systems to better support multilingual exploration. While progress has been made, areas that still need work include improving machine translation for small languages and specialized domains, evaluating solutions, and developing multilingual entity graphs to aid exploration.
Europeana meeting under Finland’s Presidency of the Council of the EU - Day 2...Europeana
Here are a few approaches to address the context demand challenge for machine translation of cultural heritage content:
- Leverage knowledge graphs and ontologies to disambiguate terms based on conceptual relationships
- Train domain-specific models on large cultural heritage corpora to capture nuances of language use in different contexts
- Perform multi-task learning to optimize models for both translation accuracy and conceptual mapping between languages
- Allow users to provide feedback to iteratively improve disambiguation of ambiguous terms over time
- Develop specialized interfaces that surface contextual clues from objects to help machine translation
The goal is to mimic how humans understand intended meaning based on surrounding context clues. Combining linguistic and conceptual techniques can help machines do the same.
This document provides an overview of open data and data sharing in archaeology. It discusses intellectual property rights related to archaeological data, copyright issues, open access and open data principles. The benefits of data sharing are outlined, such as increased visibility, verification of research and enabling new collaborations. However, barriers to sharing such as lack of incentives and concerns about misuse are also covered. The document provides guidance on using open licenses, depositing data in repositories, and attributing data to maximize the benefits of open sharing practices.
Widening the limits of cognitive reception with online digital library graph ...Marton Nemeth
This document discusses using semantic web technologies like linked data and RDF to improve information retrieval from digital library collections. It provides examples of semantic implementations at libraries like Europeana, the French National Library, and the German National Library. Key points covered include linking diverse data sources to facilitate discovery, creating semantic search interfaces, and addressing challenges of referencing vocabularies and evaluating semantic datasets and user experiences. The research plan proposes comparing new semantic OPACs to traditional interfaces and developing a methodology for evaluating the user experience of semantic library systems.
ArCo: the Knowledge Graph of Italian Cultural HeritageValentina Presutti
ArCo is a very ambitious ontology project. Starting from the official central catalogue of Italian Cultural Heritage (maintained by the Ministry) as its main source, its goal is to release an open knowledge graph encoding knowledge about the entities described in catalogue records. This means going beyond the mere representation of their metadata. Although there's still a long way to go, ArCo reached its first 'stable' version (https://w3id.org/arco). The experience in developing this project has taught us important lessons both in knowledge engineering in general, and on its application to Cultural Heritage. In this talk I will tell ArCo's story and lessons learned focusing on methodological, social and ontological perspectives.
MOSAICA: Semantically Enhanced Multifaceted Collaborative Access to Cultural ...Dov Winer
Description of the Mosaica project that ran from 2006 to 2008 developing a toolbox of generic technologies for intelligent presentation, knowledge-based discovery, and interactive and creative educational experience covering a broad variety of diversified cultural heritages requirements.
Big data and Digital Transformations in the HumanitiesMartin Wynne
The document discusses the opportunities and challenges of digital humanities research using large datasets. It outlines how new infrastructure initiatives have lowered barriers to digital research but that interoperability, sharing, and sustainability of resources remain difficult. The humanities risk becoming less relevant if new forms of data-driven research are not embraced, but care must be taken to avoid an overly empirical view that diminishes qualitative analysis. Achieving provisional standards and categories could promote shared infrastructure while still allowing traditional humanities criticism.
Measuring System Performance in Cultural Heritage SystemsToine Bogers
This talk presents a high-level overview of the different components of cultural heritage information systems—search, browsing, recommendation, and enrichment—and their evaluation, and the common challenges.
(Invited talk at the "Evaluating Cultural Heritage Information Systems" workshop at the iConference 2015 in Newport Beach, CA)
This presentation summarizes the emerging open access landscape in Greece, achievements and challenges of open access work, and the activities of the National Documentation Centre (EKT). Some key points include:
- Greece is seeing growing awareness of open access but adoption has been slow. EKT plays a major role in promoting open access through repositories, journals, and information resources.
- Achievements include increasing numbers of open access journals and institutional repositories, as well as digitization projects. However, challenges remain around researcher participation, infrastructure sustainability, and lack of policies.
- EKT has established several open access journals, repositories, and information portals to support the open access movement in Greece and increase dissemination of Greek research outputs
The document discusses the movement of Digital Humanities and its impact on social sciences. It defines Digital Humanities as the intersection of humanities disciplines and digital technologies. It describes the goals of DH as integrating modern information technology into traditional humanistic research and sharing cultural resources. It also provides examples of common DH projects and tools, including text analysis, mapping, encoding, and visualization projects. Throughout, it emphasizes DH as an international, collaborative, and interdisciplinary field that utilizes digital resources and technologies.
Kimberly Silk presented on data management and discovery at the Martin Prosperity Institute. The MPI collects large social science datasets from various common and authoritative sources to support research. To better organize their growing collection, the MPI implemented an open data discovery platform called Dataverse to catalog and provide access to their datasets. Open data initiatives aim to make certain government data freely available to the public, but also present challenges around data preparation, support, and responsiveness. Big data refers to extremely large datasets beyond the capabilities of typical database tools, and data visualization is an important way to communicate insights from data.
The document summarizes the PoliticalMashup project, which aims to connect promises and actions of politicians with societal reactions by integrating large datasets. It discusses using text analytics and XML techniques on datasets like Dutch parliamentary proceedings and election manifestos to enable automated analysis. Example applications include search, entity linking, and detecting promises by ministers. It also outlines several areas for natural language processing research using the datasets, such as topic detection and modeling populist language.
The document provides guidelines for publishing data as Linked Data. It discusses identifying appropriate data sources, reusing existing vocabularies and non-ontological resources, generating RDF data from relational databases or geometrical data using tools like R2O, ODEMapster and geometry2rdf, and publishing the data on the web by resolving URIs. The Ontology Engineering Group at Universidad Politécnica de Madrid has published Spanish geospatial and statistical data as part of projects like GeoLinkedData following these guidelines.
This document discusses digital corpora of inscriptions and the need for collaboration. It summarizes 3 main digital corpora - PHI, EDH, and Clauss-Slaby - and notes challenges like lack of quality control and coherence. It advocates learning from the success of integrated digital papyrology projects. The author proposes a "Citizens' Web of Inscriptions" using Linked Open Data to connect existing databases while allowing community annotation and enhancement through a centralized interface like SoSOL. This would create a massive yet decentralized database to better serve both academics and the public.
This document provides a summary of linked data principles and examples. It discusses how linked data can help computers understand web data by structuring it using common standards like URIs, HTTP, RDF, and SPARQL. The key principles of linked data are explained, including using URIs to identify things, including useful information at those URIs, and linking to other URIs to discover more things. Examples of linked data applications in domains like academia, libraries, government, and media are also provided. The document concludes by discussing how linked data works technically using structured data, graphs, and W3C web standards.
EKAW2014 Keynote: Ontology Engineering for and by the Masses: are we already ...Oscar Corcho
Presentation for one of the keynotes at EKAW2014, where I talked about the need to lower the barrier for ontology development for those who have no experience with ontologies.
One day workshop Linked Data and Semantic WebVictor de Boer
As taught at UNIMAS July 2019. based on a three day summer school by Knud Hinnerk Moeller and Victor de Boer. Includes hands on excercises using SWI-Prolog ClioPatria
New approaches for data acquisition at europeana iiif, sitemaps and schema.o...Nuno Freire
Presentation on experiments at Europeana regarding new methods of aggregating metadata.
Presented at the Seminar Linked Data in Research and Cultural Heritage, on 1st of May 2017.
Open Data Publication - Requirements, Good practices, and Benefitsariadnenetwork
ARIADNE is an EU-funded project that aims to facilitate sharing of archaeological datasets across Europe. The document discusses open data publication requirements, good practices, and benefits. It notes barriers to open data sharing include lack of incentives and effort required. Survey results presented on current limited sharing behavior. The benefits of open data publication are described, including increased citations and recognition for data providers. Proper preparation and documentation of datasets is important for others to find and reuse the data. Widespread open data sharing could enable new collaborative and data-driven research.
Current metadata landscape in the library world (Getaneh Alemu)Getaneh Alemu
The document summarizes the current metadata landscape in libraries. It discusses what metadata is, existing metadata challenges like growing collections and changing user expectations. It covers common metadata standards like MARC21, Dublin Core and frameworks like FRBR. The document emphasizes that metadata enables functions like search, discovery and organization. It discusses metadata enrichment through user tagging and linking metadata to controlled vocabularies. The future of metadata is seen as enriched, linked, open and filtered to meet changing needs.
Europeana meeting under Finland’s Presidency of the Council of the EU - Day 1...Europeana
This document discusses multilingualism in digital cultural heritage. It begins by outlining some of the challenges of multilingual access, including mismatches between user queries and content languages, heterogeneity in queries, and issues with translating metadata. It then discusses some options for bridging the language gap, such as translating queries, content, and metadata; enriching metadata; and adapting systems to better support multilingual exploration. While progress has been made, areas that still need work include improving machine translation for small languages and specialized domains, evaluating solutions, and developing multilingual entity graphs to aid exploration.
Europeana meeting under Finland’s Presidency of the Council of the EU - Day 2...Europeana
Here are a few approaches to address the context demand challenge for machine translation of cultural heritage content:
- Leverage knowledge graphs and ontologies to disambiguate terms based on conceptual relationships
- Train domain-specific models on large cultural heritage corpora to capture nuances of language use in different contexts
- Perform multi-task learning to optimize models for both translation accuracy and conceptual mapping between languages
- Allow users to provide feedback to iteratively improve disambiguation of ambiguous terms over time
- Develop specialized interfaces that surface contextual clues from objects to help machine translation
The goal is to mimic how humans understand intended meaning based on surrounding context clues. Combining linguistic and conceptual techniques can help machines do the same.
This document provides an overview of open data and data sharing in archaeology. It discusses intellectual property rights related to archaeological data, copyright issues, open access and open data principles. The benefits of data sharing are outlined, such as increased visibility, verification of research and enabling new collaborations. However, barriers to sharing such as lack of incentives and concerns about misuse are also covered. The document provides guidance on using open licenses, depositing data in repositories, and attributing data to maximize the benefits of open sharing practices.
Widening the limits of cognitive reception with online digital library graph ...Marton Nemeth
This document discusses using semantic web technologies like linked data and RDF to improve information retrieval from digital library collections. It provides examples of semantic implementations at libraries like Europeana, the French National Library, and the German National Library. Key points covered include linking diverse data sources to facilitate discovery, creating semantic search interfaces, and addressing challenges of referencing vocabularies and evaluating semantic datasets and user experiences. The research plan proposes comparing new semantic OPACs to traditional interfaces and developing a methodology for evaluating the user experience of semantic library systems.
ArCo: the Knowledge Graph of Italian Cultural HeritageValentina Presutti
ArCo is a very ambitious ontology project. Starting from the official central catalogue of Italian Cultural Heritage (maintained by the Ministry) as its main source, its goal is to release an open knowledge graph encoding knowledge about the entities described in catalogue records. This means going beyond the mere representation of their metadata. Although there's still a long way to go, ArCo reached its first 'stable' version (https://w3id.org/arco). The experience in developing this project has taught us important lessons both in knowledge engineering in general, and on its application to Cultural Heritage. In this talk I will tell ArCo's story and lessons learned focusing on methodological, social and ontological perspectives.
MOSAICA: Semantically Enhanced Multifaceted Collaborative Access to Cultural ...Dov Winer
Description of the Mosaica project that ran from 2006 to 2008 developing a toolbox of generic technologies for intelligent presentation, knowledge-based discovery, and interactive and creative educational experience covering a broad variety of diversified cultural heritages requirements.
Big data and Digital Transformations in the HumanitiesMartin Wynne
The document discusses the opportunities and challenges of digital humanities research using large datasets. It outlines how new infrastructure initiatives have lowered barriers to digital research but that interoperability, sharing, and sustainability of resources remain difficult. The humanities risk becoming less relevant if new forms of data-driven research are not embraced, but care must be taken to avoid an overly empirical view that diminishes qualitative analysis. Achieving provisional standards and categories could promote shared infrastructure while still allowing traditional humanities criticism.
Measuring System Performance in Cultural Heritage SystemsToine Bogers
This talk presents a high-level overview of the different components of cultural heritage information systems—search, browsing, recommendation, and enrichment—and their evaluation, and the common challenges.
(Invited talk at the "Evaluating Cultural Heritage Information Systems" workshop at the iConference 2015 in Newport Beach, CA)
Slide 2 - 66: Shaping innovatin in education with cultural heritage by Fred Truyen, Steven Stegers, Evita Tasiopoulou and Marco Neves
Slides 67 - 152: Multilingual access and machine translation by Andy Neale, Antoine Isaac, Pavel Kats, Alex Raginsky and Sergiu Gordea
Slides 155 - 164: How to implement the FAIR principles in digital culture by Sara Di Giorgio, Saskia Scheltjens and Makx Dekkers, Seamus Ross, Franco Niccolucci and Erzsébet Tóth-Czifra
Slide 166: EuropeanaTech Unconference by Clemens Neudecker
Are New Digital Literacies Skills Neededrscd2018SusanMRob
Remarrying research and collection services around access to corpora and text mining, are new technical literacy skills needed? Was presented by Ingrid Mason (Deployment Strategist, AARNet) at the Research Support Community Day 2018
Bridging Digital Humanities Research and Big Data Repositories of Digital TextBeth Plale
The document discusses the emerging field of digital humanities and how large repositories like HathiTrust can enable new forms of computational research on digital texts at scale. It provides background on digital humanities, describing how it applies computational methods to humanities research. It then discusses HathiTrust, a consortium offering millions of digitized books and journals. The HathiTrust Research Center was created to enable computational analysis of its collection while respecting access restrictions. The document argues this "scale" can help advance digital humanities research through methods like data mining of very large text corpora.
folksonomy, social tagging, tag clouds, automatic folksonomy construction, word clouds, wordle,context-preserving word cloud visualisation, CPEWCV, seam carving, inflate and push, star forest, cycle cover, quantitative metrics, realized adjacencies, distortion, area utilization, compactness, aspect ratio, running time, semantics in language technology
Présentation par Anne Réach-Ngô du projet EVEille (Exploration et Valorisation Electroniques de corpus en SHS) porté par Anne Réach-Ngô, Marine Parra et Régine Battiston.
Beyond document retrieval using semantic annotations Roi Blanco
Traditional information retrieval approaches deal with retrieving full-text document as a response to a user's query. However, applications that go beyond the "ten blue links" and make use of additional information to display and interact with search results are becoming increasingly popular and adopted by all major search engines. In addition, recent advances in text extraction allow for inferring semantic information over particular items present in textual documents. This talks presents how enhancing a document with structures derived from shallow parsing is able to convey a different user experience in search and browsing scenarios, and what challenges we face as a consequence.
A whirlwind introduction to digital humanities for CDP Digital Humanities: Collections & Heritage - current challenges and futures workshop. February 22, 2018 Imperial War Museum
Leading research in technoscience institutttseminar-281010NTNU
The document discusses how anthropology can lead research in technoscience using a phronetic approach. It presents three case studies where the author took on roles like facilitator, ethnographer and IT developer. Through these cases, the author applied concepts from actor-network theory like translation, inscription and punctualization to understand how technologies and organizations shape one another. The cases helped address questions about where developments are going, their desirability and how to inform practice using practical and value-rational approaches like phronesis.
Towards Culturally Aware AI Systems - TSDH SymposiumMarieke van Erp
The document discusses culturally aware AI and polyvocality in knowledge graphs. It notes that most current knowledge graphs reflect a single perspective and contemporary sources, lacking polyvocality. The challenges of identifying polyvocal knowledge, representing polyvocality in models and data, and presenting polyvocal knowledge are discussed. Transparent data stories are proposed as a way to represent multiple perspectives on cultural objects through alternative storylines and making the underlying data and knowledge graph transparent.
Experiences in the Development of Geographical Ontologies and Linked DataOscar Corcho
This document summarizes a talk on experiences developing geographical ontologies and linked data. The talk discusses why geographical ontologies were developed, including issues with integrating diverse geographical data sources. It describes the NeOn methodology used in developing ontologies like Hydrontology for the hydrology domain and PhenomenOntology based on feature catalogs. Guidelines for developing linked data are also covered. The overall structure of the talk is outlined.
This document discusses virtual research environments (VREs) in the digital humanities field. It provides examples of several existing VREs, including TextGrid (Germany), TAPoR (Canada), NINES (US/UK), DARIAH (EU-wide), and a VRE for European Holocaust research. It explains that VREs aim to provide researchers with collaborative tools and interfaces to organize, analyze, and share digital research materials online. However, developing VREs for the humanities poses challenges around establishing common standards, balancing diversity of research with coordination needs, and ensuring new technologies support rather than hinder existing humanistic methods.
Creating Sustainable Communities in Open Data Resources: The eagle-i and VIVO...Robert H. McDonald
This is the slidedeck for my ACRL 2015 TechConnect Presentation with Nicole Vasilevsky (OHSU). For more on the program see - <a>http://bit.ly/1xcQbCr</a>.
The document discusses the Hercules project, which is funded 80% by the European Regional Development Fund (ERDF) to create an ontological infrastructure for integrating research information across Spanish universities. The goals are to improve knowledge management and dissemination, increase transparency and collaboration. It describes the ERDF's 4.37 million euro contribution through the Ministry of Economy to support the project from 2017-2022. It then provides details on the project's semantic architecture and ontologies, which aim to normalize researcher CVs, integrate data from multiple university systems and external sources, and develop tools like a national research portal and CV management system to achieve the goals.
ALIAOnline Practical Linked (Open) Data for Libraries, Archives & MuseumsJon Voss
This document discusses practical applications of Linked Open Data (LOD) for libraries, archives, and museums. It describes how LOD allows these institutions to publish structured data on the web in ways that are interoperable and can be connected to other open datasets. Examples are given of how LOD is being used by various institutions to share metadata, images, and other cultural heritage assets on the web in open, machine-readable formats. The presenter argues that LOD represents a new paradigm that these cultural organizations should embrace to make their collections more accessible and useful on the web.
Similar to Semantic Archive Integration for Holocaust Research: the EHRI Research Infrastructure (20)
"Semantic Integration Is What You Do Before The Deep Learning". dev.bg Machine Learning seminar, 13 May 2019.
It's well known that 80\% of the effort of a data scientist is spent on data preparation. Semantic integration is arguably the best way to spend this effort more efficiently and to reuse it between tasks, projects and organizations. Knowledge Graphs (KG) and Linked Open Data (LOD) have become very popular recently. They are used by Google, Amazon, Bing, Samsung, Springer Nature, Microsoft Academic, AirBnb… and any large enterprise that would like to have a holistic (360 degree) view of its business. The Semantic Web (web 3.0) is a way to build a Giant Global Graph, just like the normal web is a Global Web of Documents. IEEE already talks about Big Data Semantics. We review the topic of KGs and their applicability to Machine Learning.
The Galleries, Libraries, Archives and Museums (GLAM) sector deals with complex and varied data. Integrating that data, especially across institutions, has always been a challenge. Semantic data integration is the best approach to deal with such challenges. Linked Open Data (LOD) enable large-scale Digital Humanities (DH) research, collaboration and aggregation, allowing DH researchers to make connections between (and make sense of) the multitude of digitized Cultural Heritage (CH) available on the web. An upsurge of interest in semtech and LOD has swept the CH and DH communities. An active Linked Open Data for Libraries, Archives and Museums (LODLAM) community exists, CH data is published as LOD, and international collaborations have emerged. The value of LOD is especially high in the GLAM sector, since culture by its very nature is cross-border and interlinked. We present interesting LODLAM projects, datasets, and ontologies, as well as Ontotext's experience in this domain. An extended paper on these topics is also available. It has 77 pages, 67 figures, detailed info about CH content and XML standards, Wikidata and global authority control.
This document provides an overview of linked open data (LOD), ontologies, and cultural heritage projects and datasets. It begins with definitions of semantic technologies, ontologies, and LOD. It then discusses several key ontologies used for cultural heritage data, including CIDOC CRM, Schema.org, and Wikidata. The document outlines several LOD projects focused on cultural heritage, including Europeana, ResearchSpace, American Art Collaborative, and ConservationSpace. It provides examples of semantic searches and annotations using these ontologies and projects. In summary, the document introduces the concepts of LOD and relevant ontologies, and highlights several major cultural heritage projects that apply these semantic technologies.
Invited report at Digital Presentation and Preservation of Cultural and Scientific Heritage (DIPP 2018), Burgas, Bulgaria. Report: http://dipp.math.bas.bg/images/2018/019-050_32_11-iDiPP2018-34.pdf
Ontotext provides concise summaries in 3 sentences or less that provide the high level and essential information from the document.
Ontotext is a software company that works on semantic technologies and knowledge graphs for cultural heritage and digital humanities projects. They have developed applications like ResearchSpace for museums and ConservationSpace for conservation specialists. They also contribute linked open data to projects like Europeana, Wikidata, and DBpedia and provide consulting services to institutions like museums and national aggregators.
How GLAMs can use Wikipedia/Wikidata to make their collections globally accessible across languages.
Europeana Food and Drink content providers workshop, Athens, 18 May 2015
Wikidata is being considered as a target for Europeana's semantic strategy to enrich metadata and link cultural heritage objects. Wikidata provides multilingual information on entities like people, organizations, places and concepts from Wikipedia. It has potential to help solve Europeana's challenges around data diversity, quality and interlinking due to its connections to other vocabularies and ability to perform automatic enrichment. Europeana aims to semantically enrich its data and link objects to Wikidata entities to provide additional context. Content providers are also encouraged to contribute information to Wikidata to further enhance the semantic knowledge graph.
Europeana Food and Drink annual meeting, 20 Mar 2015, Athens, Greece. Full report: http://vladimiralexiev.github.io/pubs/Europeana-Food-and-Drink-Classification-Scheme-(D2.2).pdf
Information School, University of Washington, 2014-05-21: INFX 598 - Introducing Linked Data: concepts, methods and tools. Guest lecture (Module 9) "Doing Business with Semantic Technologies": Introduction to Ontotext and some of its products, clients and projects.
Also see video:https://voicethread.com/myvoice/#thread/5784646/29625471/31274564
Triplestores and inference, applications in Finance, text-mining. Projects and solutions for financial media and publishers.
Keystone Industrial Panel, ISWC 2014, Riva del Garda, 18 Oct 2014.
Thanks to Atanas Kiryakov for this presentation, I just cut it to size.
Full version of http://www.slideshare.net/valexiev1/gvp-lodcidocshort. Same is available on http://vladimiralexiev.github.io/pres/20140905-CIDOC-GVP/index.html
CIDOC Congress, Dresden, Germany
2014-09-05: International Terminology Working Group: full version.
2014-09-09: Getty special session: short version
CIDOC Congress, Dresden, Germany
2014-09-05: International Terminology Working Group: full version (http://vladimiralexiev.github.io/pres/20140905-CIDOC-GVP/index.html)
2014-09-09: Getty special session: short version (http://VladimirAlexiev.github.io/pres/20140905-CIDOC-GVP/GVP-LOD-CIDOC-short.pdf)
This document discusses semantic technologies for cultural heritage. It introduces Ontotext Corp, which develops semantic technology, and some of their projects involving cultural heritage data. These include the ResearchSpace project with the British Museum, projects involving Europeana like Bulgariana and Europeana Creative, and publishing Getty vocabularies as linked open data.
Nikola Ikonomov, Boyan Simeonov, Jana Parvanova and Vladimir Alexiev. In Digital Presentation and Preservation of Cultural and Scientific Heritage (DiPP 2013), Veliko Tarnovo, Bulgaria, Sep 2013.
Nikola Ikonomov, Boyan Simeonov, Jana Parvanova and Vladimir Alexiev. In Digital Presentation and Preservation of Cultural and Scientific Heritage (DiPP 2013), Veliko Tarnovo, Bulgaria, Sep 2013
Large-scale Reasoning with a Complex Cultural Heritage Ontology (CIDOC CRM) ...Vladimir Alexiev, PhD, PMP
Vladimir Alexiev, Dimitar Manov, Jana Parvanova and Svetoslav Petrov. In proceedings of workshop Practical Experiences with CIDOC CRM and its Extensions (CRMEX 2013) at TPDL 2013, 26 Sep 2013, Valetta, Malta
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Aggregage
This webinar will explore cutting-edge, less familiar but powerful experimentation methodologies which address well-known limitations of standard A/B Testing. Designed for data and product leaders, this session aims to inspire the embrace of innovative approaches and provide insights into the frontiers of experimentation!
Open Source Contributions to Postgres: The Basics POSETTE 2024ElizabethGarrettChri
Postgres is the most advanced open-source database in the world and it's supported by a community, not a single company. So how does this work? How does code actually get into Postgres? I recently had a patch submitted and committed and I want to share what I learned in that process. I’ll give you an overview of Postgres versions and how the underlying project codebase functions. I’ll also show you the process for submitting a patch and getting that tested and committed.
End-to-end pipeline agility - Berlin Buzzwords 2024Lars Albertsson
We describe how we achieve high change agility in data engineering by eliminating the fear of breaking downstream data pipelines through end-to-end pipeline testing, and by using schema metaprogramming to safely eliminate boilerplate involved in changes that affect whole pipelines.
A quick poll on agility in changing pipelines from end to end indicated a huge span in capabilities. For the question "How long time does it take for all downstream pipelines to be adapted to an upstream change," the median response was 6 months, but some respondents could do it in less than a day. When quantitative data engineering differences between the best and worst are measured, the span is often 100x-1000x, sometimes even more.
A long time ago, we suffered at Spotify from fear of changing pipelines due to not knowing what the impact might be downstream. We made plans for a technical solution to test pipelines end-to-end to mitigate that fear, but the effort failed for cultural reasons. We eventually solved this challenge, but in a different context. In this presentation we will describe how we test full pipelines effectively by manipulating workflow orchestration, which enables us to make changes in pipelines without fear of breaking downstream.
Making schema changes that affect many jobs also involves a lot of toil and boilerplate. Using schema-on-read mitigates some of it, but has drawbacks since it makes it more difficult to detect errors early. We will describe how we have rejected this tradeoff by applying schema metaprogramming, eliminating boilerplate but keeping the protection of static typing, thereby further improving agility to quickly modify data pipelines without fear.
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeWalaa Eldin Moustafa
Dynamic policy enforcement is becoming an increasingly important topic in today’s world where data privacy and compliance is a top priority for companies, individuals, and regulators alike. In these slides, we discuss how LinkedIn implements a powerful dynamic policy enforcement engine, called ViewShift, and integrates it within its data lake. We show the query engine architecture and how catalog implementations can automatically route table resolutions to compliance-enforcing SQL views. Such views have a set of very interesting properties: (1) They are auto-generated from declarative data annotations. (2) They respect user-level consent and preferences (3) They are context-aware, encoding a different set of transformations for different use cases (4) They are portable; while the SQL logic is only implemented in one SQL dialect, it is accessible in all engines.
#SQL #Views #Privacy #Compliance #DataLake
The Ipsos - AI - Monitor 2024 Report.pdfSocial Samosa
According to Ipsos AI Monitor's 2024 report, 65% Indians said that products and services using AI have profoundly changed their daily life in the past 3-5 years.
Codeless Generative AI Pipelines
(GenAI with Milvus)
https://ml.dssconf.pl/user.html#!/lecture/DSSML24-041a/rate
Discover the potential of real-time streaming in the context of GenAI as we delve into the intricacies of Apache NiFi and its capabilities. Learn how this tool can significantly simplify the data engineering workflow for GenAI applications, allowing you to focus on the creative aspects rather than the technical complexities. I will guide you through practical examples and use cases, showing the impact of automation on prompt building. From data ingestion to transformation and delivery, witness how Apache NiFi streamlines the entire pipeline, ensuring a smooth and hassle-free experience.
Timothy Spann
https://www.youtube.com/@FLaNK-Stack
https://medium.com/@tspann
https://www.datainmotion.dev/
milvus, unstructured data, vector database, zilliz, cloud, vectors, python, deep learning, generative ai, genai, nifi, kafka, flink, streaming, iot, edge
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataKiwi Creative
Harness the power of AI-backed reports, benchmarking and data analysis to predict trends and detect anomalies in your marketing efforts.
Peter Caputa, CEO at Databox, reveals how you can discover the strategies and tools to increase your growth rate (and margins!).
From metrics to track to data habits to pick up, enhance your reporting for powerful insights to improve your B2B tech company's marketing.
- - -
This is the webinar recording from the June 2024 HubSpot User Group (HUG) for B2B Technology USA.
Watch the video recording at https://youtu.be/5vjwGfPN9lw
Sign up for future HUG events at https://events.hubspot.com/b2b-technology-usa/
Analysis insight about a Flyball dog competition team's performanceroli9797
Insight of my analysis about a Flyball dog competition team's last year performance. Find more: https://github.com/rolandnagy-ds/flyball_race_analysis/tree/main
Semantic Archive Integration for Holocaust Research: the EHRI Research Infrastructure
1. Semantic Archive Integration for
Holocaust Research: the EHRI
Research Infrastructure
Vladimir Alexiev & Ivelina Nikolova
Ontotext Corp
DSDH 2017, Venice, 30 June 2017
2. Ontotext
• Started in 2000 as a Semantic Web pioneer
− As Innovation lab within Sirma Group (listed as SKK), the biggest Bulgarian software house
− Got spun-off and took VC investment in 2008
− 65 staff, HQ in Bulgaria, reps in Canada, UK, Germany and USA
− Over 400 person-years invested in R&D
• Multiple innovation & technology awards:
− Washington Post, BBC, FT, BAIT, etc.
• Member of multiple industry bodies:
− W3C, EDMC, ODI, LDBC, STI, DBPedia Foundation
#2
4. Ontotext Innovation & Consulting
• 37 EC research projects
− Bulgaria's most successful participant in the
Framework Programmes
− 5-7 active at any time
• Semantic technology
− Semantic databases
− Semantic enrichment
− Multilingual translation
− Media, news, sentiments, rumors, diversity
− Life Sciences
− Data and enrichment marketplaces
− Commercial data and applications
− eGovernment
− LOD curriculum (education)
− Cultural heritage
#4
5. EHRI Technical Work
• Main tech partners
− KCL: portal development and operation, graph database
− YV: thesauri and authorities
− DANS: data architecture, EAD mapping, EAD transport
− INRIA: standardization: EAD XSD, Schematron, TEI ODD, URLs
− ONTO: EAD conversion, ingest, TMS, authorities, semantic enrichment, semantic search, LOD
• EHRI2 technical work packages
− EAD (WP10) - converts archival descriptions from various formats to standard EAD XML
− Authorities & Standards (WP11) - consolidates and enlarges EHRI authorities
− Data infrastructures (WP13) - deduplication, semantic data integration, semantic text analysis.
− Digital Historiography Research (WP14) - researcher tools based on semantic analysis, including Prosopographical
approaches.
#5
6. EHRI Research Use Cases
• Names and Networks: Chances of Survival during the Holocaust. Most Jews needed the
support of other people to survive. Persons found aiders inside their personal and group
networks. Investigate the networks in which European Jews operated during their persecution in
the Second World War, and improve understanding of the various chances of survival that
persecuted Jews all over Europe had.
• In Search of a Better Life and a Safe Haven: Tracing the Paths of Jewish Refugees
(1933-1945). Map and better understand the different migration trajectories and determine the
factors that played a role in the migration movements of migrants or forcefully deported Jews.
• People on the Move: Revisiting Events and Narratives of the European Refugee Crisis
(1930s-1950s). Investigate migration movements of European refugees. The International Tracing
Service (ITS), and EHRI partner, has been a key actor in managing this twentieth century
migration crisis and has hence built the most important archive documenting the lives of refugees
and displaced persons.
#6
7. EHRI Research Use Cases
• Between Decision Making and Improvisation: Tracing and Explaining Patterns of Prisoners’
Transfers through the Concentration Camp System. Investigate how the SS in the years
1942-1945, via Hollerith Departments and punch-card technology, managed the transport of
inmates through the concentration camp system that covered the whole of occupied Europe
• Archives and Machine Learning. Investigate how digital methods might support archivists in the
creation of interoperable and consistent descriptions of sources (metadata) and in the linking of
sources. Estimate metadata quality using data mining and machine learning approaches.
• Networked Reading. What form historical sources need to take in order to be processed using
digital methods? What kind of infrastructure do we need to extract meaning from them? How to
apply digital methods in such a way that the results are verifiable as well as reproducible?
#7
8. EHRI Document Blog
#8
• Provides inspiration
for:
− Research questions
− Research techniques
• Our task:
− Enable such tools at scale
− Help researchers use them
• Soliciting other
interesting
examples!
10. Controlled Access Points
• Some stats from end-2015
− 111k document units, 79k (70%) have access points
− number of access points: 686k (8.7x per object)
− 17% of a.p. instances have an object (link), 83% are just text
− Red represent errors, eg link=corporateBodyAccess should point to object=historicalAgent (or NONE),
not to cvocConcept (concepts are not corporate bodies)
#10
11. Controlled Access Point Problems
• 23 ways of spelling Łódź across instiutions.
− Some of them are mapped to EHRI1 access point objects, others are plain text
Łódź (Ghetto)
Łódź (Poland)
Łódź (Poland),.
Łódz (Poland)
Łódź ehri-ghettos-513 cvocConcept
Łódź jc-places-place-iti-48 cvocConcept
Łódź terezin-places-place-iti-48 cvocConcept
Łódź (Poland)
Łódź – Łódź – Polen Łódź – Łódź – Poland
Lodz
Lodz terezin-places-place-iti-412 cvocConcept
Lodz (Polen)
Lodz - Poland
Lodz Ghetto
Lodz ghetto
Lodz ghetto, Poland
Lodz, Poland, Eastern Europe,
Lodz,Ghetto,Poland
Lodz,Lodz,Lodz,Poland
Lodz,,גטוPoland (note: “”גטו means Ghetto)
Lodž
Lodž jc-places-place-iti-48 cvocConcept
Lodž terezin-places-place-iti-48 cvocConcept
#11
12. Controlled Access Point Problems
• E.g. leading spaces, weird punctuation, local IDs
− YV uses <> as a marker for “missing place component”
Amtsgericht Welun
*Frankfurt/Main
.SUCHY BARBARA, DR
o(Außenpolitik)
<> Friesland <> The Netherlands placeAccess
Sofia Sofia <> Bulgaria placeAccess
<> <> <> Bulgaria placeAccess
(Kleinmann Ferdinand
(Pervomaisk (Olviopol, Bogopol, Golta
""She'erith Hapletah""
-9 Hitler als Staatsmann und Politiker
1 (Kirchen, Pazifisten, Sozialisten, Sonstige allg.)
1(Fürstenenteignung
• Empty a.p. or only an ID
-
.
1
100.
104. ID
#12
13. Controlled Access Point Problems
• Mis-represented (mis-typed) access points:
1941 subjectAccess # it's just year
1942 placeAccess # it's just year
5(dt. Orte in alphabet. Folge) subjectAccess # System of Arrangement
A-Z subjectAccess # System of Arrangement
8f(United Nations Educational, Scientific & Cultural Organization [UNESCO]) subjectAccess # corporateAccess
8g(Organisation f. Ernährung u. Landwirtschaft [FAO]) subjectAccess # corporateAccess
915 (Balta, Chersson) subjectAccess # placeAccess
915 (Feodosia) subjectAccess # placeAccess
93. Koblenz subjectAccess # placeAccess
8. Florian Geyer subjectAccess # corporateAccess: 8th SS Cavalry Division Florian Geyer
• Compound (pre-coordinated) access points. Split up and match separately
Abortion--Poland--Oswiecim. subjectAccess # but there's also placeAccess
Abortion--Poland--Warsaw. subjectAccess # but there's also placeAccess
Abortion--Poland. subjectAccess # but there's also placeAccess
Abortion. subjectAccess # no relation between this and above 3
• Compound a.p. that should have been entered as separate a.p. entries
Aachen, Arnsberg, Baden, Bayern, Berlin, Bremen, Düsseldorf subjectAccess
#13
14. Building Up Authorities
• "Controlled" Access Points? Not at all, this is a misnomer
• Negative impact on:
− Search (user can't know what to search for
− Indexing (when creating a link on the portal, indexers not sure what to select)
• Need to harmonize at the portal and build up authorities.
− Unicode normalization
− Removing punctuation and other cleanup
− Splitting up compounds (atomization)
− Heuristics to treat a.p. type: can't trust it completely, but shouldn't ignore it
• Status per kind
− Places: strong advancement, used Geonames as reference database (80k out of 10M)
− Concepts: EHRI Thesaurus is made sustainable through a TMS and Editorial Board, but IMHO small (2-5k)
− Ghettos & Camps: matching to Wikidata to bring in more data and harmonize
− Persons: evaluating VIAF and USHMM databases for coverage
− Organizations: should start by harvesting corporateAccess across providers
#14
15. Authorities-Related Tasks
• Consolidation and enlargement of EHRI authorities
− To make indexing and retrieval of information more effective.
• Access Points in ingested EADs
− Normalization of Unicode, spelling differences, punctuation; deduplication; clustering; coreferencing to global
authorities where available
• Enlarging and integrating disparate EHRI thesauri
− Subjects: deployment of a Thesaurus Management System, editorial board, deciding whether "candidate"
concepts harvested from access points should be added to authorities
− Places: coreferencing to Geonames
− Camps and Ghettos: integrating data from Wikidata
− Persons, Corporate Bodies: starting
• Implementation of semantic (conceptual) search
− Including hierarchical query expansion
− Aiding indexing for better interconnectivity of archival descriptions
• Use a mix of automatic and manual approaches
#15
16. Geo-Referencing Service
• Uses Geonames to find place references
− In text (oral history), access points, local databases (USHMM Places)
• Based on Ontotext commercial application pipelines
• Specially tailored matching guidelines. E.g. not a place:
− University of [Chicago], [Tel Aviv] University, Veterinary Institute of [Alma-Ata]
− The interview was given to the [United States] Holocaust Memorial Museum on Oct. 30, 1992
− in the Theater de [Champs Elysee]
− [Washington] Post, [Washington] Monthly: publication
− USS [America]: ship
• Remove place types to increase recall and avoid false matches:
− Kreis * (District in German)
− *falu (village in Hungarian)
− Ghetto (numerous villages in Italy)
− Canton (Swiss/French region, matches old name of Guangzhou, China)
• Prefer bigger cities, use the place hierarchy for context
• Hybrid application based on Machine Learning and refinement rules
#16
17. Geo-Referencing Service
• Sophisticated disambiguation mechanisms are developed, based on the
following place characteristics
− places names in various languages - Oświęcim, Освенцим, Auschwitz,
− synonym labels - Auschwitz-Birkenau, Birkenau
− place hierarchy
▪ Moscow (Russia): as opposed to the 23 places of the same name in the US, and a few more in
other countries
▪ Alexandrovka, Lviv, Ukraine: "Alexandrovka" is a very popular village name. So although this
doesn't lead to a single disambiguated place, it helps to reduce the set of possible instances from
about 70 to about 20.
− co-occurrence statistics based on Gold Standard over news corpora
− for populated places, we give priority to bigger places
#17
18. Geo-Referencing Service
• Geo-referenced place names are useful for various purposes:
− Geo-mapping of textual materials, as shown above.
− Other geographic visualizations,
e.g. of detention/imprisonment vs liberation
− The place hierarchy can be used to extract records related to a particular territory,
e.g. "Archival descriptions mentioning Ukraine" should find all records mentioning a place in
Ukraine
− Coordinates can be used to map places, and compute distance between places
− Places are an important characteristic to consider when deduplicating person records.
Certain probabilistic inferences can be made based on place hierarchy and proximity.
• Evaluation on 500 access points
#18
Type Precision Recall F-measure
placeAccess 0.97 0.86 0.91
subjectAccess 0.88 0.90 0.89
19. Geo-Referencing Service
#19
Map of places extracted from a
USHMM Oral History interview.
Courtesy Tobias Blanke, King's
College London, 2016
20. Geographic Challenges
• Geonames often includes historic place names
− Even historic countries (e.g. Czechoslovakia)
• Some challenges of Geonames for historical geography
− Nazis renamed place names,
e.g. Oświęcim → Auschwitz, Brzezinka → Birkenau.
− Nazis established new administrative districts,
e.g. Reichskommissariat Ostland included Estonia, Latvia, Lithuania, the northeastern part of Poland
and the west part of the Belarusian USSR
− Historic processes changed borders, place names and administrative subordination.
e.g. Wilno (Vilna) was part of Poland until 1939, when USSR gave it to Lithuania, to become its capital
Vilnius.
− Geonames place hierarchy reflects modern geography
e.g. "Wilna--Poland" disambiguated as two places "Vilnius, Lithuania" and "Poland
• Fixes?
− Local Geonames fixes (e.g. North America < Americas the megacontinent, not the Cuban village)
− Considering local Geonames additions (e.g. Czechoslovakia the parent of Czech Republic and Slovakia)
− Considering other sources of historical geography (e.g. Spatial History Project)
#20
21. Subjects/Concepts: VocBench TMS
• VocBench: Best open source TMS
− Runs directly over semantic data (ONTO GraphDB); partnership with developer (UniRoma2)
− Used by EC OPOCE (EuroVoc), UN FAO (AgroVoc)
− E.g. multilingual labels
#21
23. Ghettos & Camps
• EHRI1 published Ghetto & Camp data, but was poor
− E.g. camp 2030: only a label "Maly Trostinec"
• Wikidata knows:
− names and Wikipedia links in the following languages: Беларуская, Беларуская (тарашкевіца), Čeština,
Dansk, Deutsch, Español, Français, Frysk, Italiano, ,עברית Nederlands, Norsk bokmål, Polski, Português, Русский,
Српски / srpski, Suomi, Svenska, Українська, 中文
− additional aliases, eg Vernichtungslager Maly Trostinez, KZ Maly Trostinez, Blagowschtschina
− country: Belarus
− location: 53°51'3"N, 27°42'17"E
− Authority IDs: Geonames, VIAF, Freebase
• Wikipedia knows a lot more, but the info is not structured
− names: Maly Trostinets, Maly Trastsianiets, Trasciane, Малы Трасцянец, Maly Tras’tsyanyets, Малый
Тростенец, Maly Trostinez, Maly Trostenez, Maly Trostinec, Klein Trostenez
− Location, Nazi admin district, date established, victim countries, victim places of origin
− killing grounds: Blagovshchina (Благовщина) forest, Shashkovka (Шашковка) forest
− known victims, perpetrators (and their fate)
#23
24. Ghettos & Camps: Wikidata
• EHRI2 decided to coreference to Wikidata
− So we can get lots more labels, coordinates, place/location…
− And international collaboration! E.g. Wikidata.GLAM Facebook Group
− Camps and Ghettos on Wikidata: before (15 Mar 2017) and now (30 Jun 2017):
#24
25. Semantic Annotation (Text Enrichment)
• Entity extraction from free
text
• GATE toolkit (University of
Sheffield, ONTO is second
largest contributor)
• Large KB gazetteers
• Disambiguation is very
important
• Relies on the existence of KB
with rich features (e.g.
relations between entities)
• Enables semantic search and
powerful research
techniques
• Relation extraction can
extract events & relations
between people
• Will try for Jewish Council
text
#25
26. Jewish Social Networks
• Tasks
− Deduplication/coreferencing of person records (partial data!)
− Reconstruct family and acquaintance relations
− Do social network analysis
− Answer hypotheses, e.g. on chances of survival
• Person Data Sources
− USHMM Persons (HSV) database (3.2M+1M names, 1.2M places, 2.5M dates...):
Coverage of Persons in Interviews is 15%
− Dutch Jewish Digital Monument website (100k's names): may not get it
− CDEC person database (10k's names): upcoming
− Dutch Jewish Council proceedings: textual, not structured records; upcoming
− Virtual International Authority File (VIAF) (15M): only "famous" people, potential poor coverage of Holocaust
− Wikidata (2.5M): potential poor coverage of Holocaust domain
#26
27. Jewish Social Networks
• USHMM Survivors and Victims (persons) includes:
− 3.2M person records (public part), 3M in a private/secured part
− 1M additional names (392k patronymic, 233k mother's, 143k maiden, 105k father's, 68k spouse),
− 102k family relations to head of household,
− Identification numbers (folder/page/record/line, 147k prisoner identification, 142k age, 74k convoy, 49k depot,
53k transport identification, 42k number of people in transport)
− Historic dates (2.2M birth, 254k death, 78k arrival, 74k convoy, 55k departure, 50k transport, 19k liberation, 14k
arrest, 10k marriage, 5.3k birth of spouse),
− Historic places (871k birth, 436k wartime, 359k residence, 215k death, 85k registered, 74k convoy destination).
− Categorical or descriptive values (501k occupation, 371k nationality, 16k religion, 116k marital status), 103k
Holocaust fate, 53k ethnicity).
• Data came from 5-15k lists.
− Some persons have 5-6-7 records
#27
28. Reconstructing Families
• Example person record with Additional names:
− personId 123456: firstName “John”, lastName “Smith”, gender Male, nameSpouseMaiden “Matienzo”,
nameSpouse “Maria Smith”, dateMarriage 1921-01-05, nameChild “Mike Smith”, nameSibling “Jack Jones”.
• We can convert "additional names" into a network of "stub" persons and
events, and try to match to other person records
− The spouse (Maria) has gender Female and her maiden name is Matienzo
− They were married on the same date (in the same event)
− The child (Mike) has the person and his spouse respectively as Father and Mother
− The child’s Birth date was likely after the Marriage date
− The sibling (Jack) is the child’s uncle or aunt
• Next figure uses CIDOC CRM to represent persons and events, and AgRelOn
to represent person relations
#28
30. Person/Family Identification: Judgements
• Gold Standard with human judgement about possible matches
− Using name search and historic reasoning
− Difficult because of incomplete info
• Then we train a Machine Learning model for matching
• 5.1k lists (sources of USHMM persons)
− Some correspond to coherent dated historic events.
− Can provide probable info about each person in the list
− Copy "default values" (eg "child", "Polish") and dates & places to person
− Provide evidence the persons listed in that source may have known each other
#30
31. Example of Historic Reasoning
• Sof'ia Alkhazova (Latin, Cyrillic) was married to MOISEI CHERNIAKOVSKII
• But this can only be deduced from source doc that has extra info missing in
structured list:
− Evacuated from Odessa to Urgench (region Harezemsk, Uzbekistan): same place as husband
− Had a son named Boris Moiseev Chernyakovskiy: same patronymic & family name
#31
32. Historical and Place Reasoning
• After coreferencing to Geonames, we can use place hierarchy and proximity (based on coordinates)
to make likely inferences.
• E.g. OH interview RG-50.661.0001:
− "My Grandfather owned a bank in partnership with a cousin, Leibela Mandell"
− "The family remained in Poland until the second world war"
− Mentions Premishlan (Przemyslany in Polish)
• Search for Lejbela (First name, Soundex) and Mandel (Last name, exact) finds Lejb Ela MANDEL.
• List Initial Registration of Lublin's Jews - October 1939 and January 1940
− "listing of the male heads of households appears to have survived in the Lublin Judenrat files".
− Represents coherent historic event (context): place=Lublin, dates=1939-10 to 1940-01, gender=male.
• To the untrained ear the name Lejbela seems female
− But we learn from the interview that Leib is male: "second oldest son was Leib"
• Przemyslany is now Peremyshlyany, Lviv Oblast, Ukraine (not Poland).
− But Google Maps shows the distance from Lublin to Peremyshlyany is 263 km
• So it is very likely that Lejb Ela MANDEL is the cousin mentioned in the interview.
#32
33. Place Hierarchy for Query Expansion
• Query Expansion means finding items (persons, docs)
− Indexed by any sub-term (e.g. Amsterdam)
− When the user searches for a super-term (e.g. Netherlands)
• We can find all person records related to any place in a certain country.
• This is useful to make sub-selections for particular research purposes
− E.g. NIOD wants to investigate networks of Dutch Jews.
• USHMM has a list of 17k Dutch Jews
• But we believe we can extract a lot more Netherlands-related persons,
perhaps 100k
#33
34. Example of Semantic Faceted Search
• Hierarchical Facets: Food & Drink Topic, Geography
#34
35. Person Deduplication
• Data preprocessing
− Person Name Normalization, Daitch-Mokotoff encoding, Beider-Morse Phonetic (eg Maria Iszak -> MR ASSK)
− Statistical and Rule Based models for assigning of person gender
Eg Maria is "female" 0.999 prob according to Linear Class algorithm and 0.9 according to Rule Based
− Linking USHMM Places to Geonames: proximity indicates likelihood
• Statistical model for automatic labeling of duplicate records
• Clustering of USHMM person records
• Uses semantic database. Eg Maria Izsak in ONTO GraphDB vs USHMM HSV
#35
36. Names Normalization
• Convert the name to lowercase.
• Remove all digits (0 - 9) and dashes (“–”, “–”, “–”) from the name.
• Transliterate all symbols to Latin.
• Convert the name into Unicode NFKD normal form.
• Remove all combining characters, e.g. apostrophes (“'”, “’”, “`”), spacing modifier letters,
combining diacritical marks, combining diacritical marks supplement, combining diacritical
marks for symbols, combining half marks.
• Replace punctuation characters (One of !"#$%&'()*+,-./:;<=>?@[]^_`{|}~) with spaces.
• Replace sequences of spaces with a single space.
• Remove any character different from space or lower case Latin letter.
• Remove all preceding or trailing white spaces from the name.
• Convert the name to title case.
#36
37. ML Model for Person Gender
• Features to train ML model for automatic labeling of person gender
− Name suffixes with various length (1, 2 and 3)
− First name
− First and last name
− Nationality
• Rules
− If there is another person with the same first name and known gender assign the
same gender to the person with unknown gender
− Same for the whole name
− If there are multiple persons with different genders do not assign a gender
#37
38. Statistical Model for Person Matching
• Lexical similarities of names (person first and last name, mother name)
− Jaro-Winkler Similarity
− Levenshtein Similarity
− Beider Morse Phonetic Codes Matching
• Birth dates similarity (approximate match)
• Birth place similarity: several features
− Lexical similarity
− Linked Geonames instances match or not
− Hierarchical structure of Geonames in order to predict the country where data is missing
• Genders Match
− Male, Female
• Person Types Match
− Victim, Survivor, Rescuer, Liberator, Witness, Relative, Veteran
• Occupations Match
• Nationalities Match
#38
39. Person Clustering
• Grouping objects so those in a group are more similar than those in other groups
• Each cluster represents a possibly unique person, comprising 1 or many person records
• Example cluster:
− VÁCLAV ŽIŽKA, born 20 Jan 1903
− VÁCLAV ŽIŽKA, born 20 Jan 1903
− VÁCLAV RŮŽIČKA, born 28 Jan 1903
#39
40. EAD Archival Descriptions
• Make EAD aggregation more
sustainable
− Convert from various formats (XML, tabular,
JSON) to standard EAD
− XML Validation: schema (XSD) and extra rules
(Schematron)
− Preview as HTML, show errors integrated in the
preview
− Enable transport and incremental update
(synchronization)
− Provide "self-service" functionality so archival
institutions can initiate the process and validate
results
• Info gathering, process
− See e.g. USHMM info (google doc) Table of
Contents on the right
#40
42. EAD Convertor
• Universal conversion
configured with a Google
Sheet
• Specific conversions also
supported
#42
43. EAD Convertor Configuration
• Analysis & conceptual mapping
• Physical Mapping (XPaths everywhere)
#43
field oral-history % docs % kind map to agreed mapping notes
accession_number 14013 92.7% 9559 99.4% sng did/unitid with @label="accession_number" 1
accession_number_addl 579 3.8% 187 1.9% arr did/unitid with @label="former_accession_number" 1 previously used ids
acq_credit 9156 60.6% 3345 34.8% arr acqinfo 1
arrangement 3 0.0% 2454 25.5% sng arr arrangement 1
assoc_parent_title 12539 82.9% 1091 11.3% sng arr did/unittitle @label="alternative" 1 only use when different from
collection_name
target-path target-node source-node value
/ ead //doc
/ead/eadheader/ eadid ./str[@name="id"] text()
/ead/archdesc/did/ unitid if (./str[@name="accession_number"]/text() != ./str[@name="id"]/text()) then ./str[@name="id"] else () text()
/ead/archdesc/did/ unitid ./str[@name="accession_number"] attribute label {"accession_number"}, text()
44. Conclusion
• We presented technical work in the EHRI project that is centered around
the use of semantic and NLP technologies to help archivists and researchers
working in the Holocaust domain.
• By semantic interlinking of information coming from a number of archives,
structured domain databases (where available) and Linked Open Data, we
can start building more complete histories of people involved in the
Holocaust, their networks, and the events, places and times they were
involved with.
#44