This document discusses efforts to digitize and analyze Coptic manuscripts through annotation, search capabilities, and visualization tools. It introduces Caroline Schroeder and Amir Zeldes who are leading a collaboration called Coptic SCRIPTORIUM. Their work involves annotating Coptic texts through normalization, tokenization, part-of-speech tagging, and other layers. They use ANNIS to allow searching across these annotation layers and segmentations. The document also discusses challenges around representing the multi-layered data and approaches for generic versus dedicated visualizations.
This document discusses abstract and concrete syntaxes in computing languages and how GRDDL (Gleaning Resource Descriptions from Dialects of Languages) addresses some issues. It provides examples of different concrete syntaxes for arithmetic and RDF and explains that GRDDL allows authors to express RDF using XML without having to learn a specific RDF syntax or vocabulary by defining mappings from XML to RDF's abstract syntax. The document outlines some open questions about how GRDDL will handle transformation pipelines, URI resolution, validation, and its base rule.
Shared Canvas presentation at the LIBER conferenceMatthieu Bonicel
Presentation for the LIBER manuscripts group conference in Paris, may 2012
SharedCanvas is data model for interoperability accross digital manuscripts tools and repository promoted by the Digital Manuscripts Technical council, leaded by Stanford University and funded by the Andrew W Mellon foundation
Europeana Regia presentation at eChallenges 2011 conferenceEuropeana Regia
The document summarizes a digital collaborative library project between five European partners to digitize and provide access to over 800 rare manuscripts from the Middle Ages and Renaissance. The objectives were to build a multilingual metadata repository, produce high-quality digitized content for Europeana, and make the textual and image content available to both scholars and the general public. Challenges included producing and aggregating metadata in six languages from different formats and systems, and addressing the needs of different types of users. The digitized collections and metadata would be made accessible through each partner's local digital library and aggregated in Europeana through a central portal.
Creating a Digital Archive of Indian Christian Manuscripts and BooksLeonard Fernandes
Pilot Project of Digitization of Books and Manuscripts in Goa under the British Library's Endangered Archives Programme. For more details, contact Leonard Fernandes at leonard.fernandes@gmail.com
The document discusses how digitizing manuscripts can help turn them into cultural heritage by enabling scholarly work like modeling, aggregation, and annotation. It provides examples of projects that have developed tools and standards to publish digitized collections as linked open data, including Europeana and DM2E. The goal is to advance beyond simply emulating manuscripts and instead use semantic technologies to facilitate new digital humanities research through contextualization, reasoning over triple sets, and generating digital heuristics.
Culture Untapped: inspirational content & fresh ideas for your gamesMilena Popova
Games are often brain- and resource-intensive projects. Why not save precious time and exploit untapped, powerful sources of inspiration and material? Discover Europeana, a digital platform for culture giving access to over 43 million records of great thematic and media variety, coming from 3300 heritage organizations and available in 31 languages.
This presentation shows how this huge database can help game creation process with fresh ideas and “building blocks” of diverse and high-quality digital content. Game developers will look at inspiring content picks, learn more about technical tools and services to access and use the digital material and see some real-life examples of creative re-use of cultural content in educational and tourism games.
The document proposes a "Canvas paradigm" to represent manuscript pages using annotations across different repositories. It allows bringing together images, text, and commentary without all being in one place. Initial experiments had students use tools like T-PEN and DM to transcribe and annotate pages from BNF hosted on Stanford servers. Next steps include extracting and sharing student work in new displays and projects.
This document discusses abstract and concrete syntaxes in computing languages and how GRDDL (Gleaning Resource Descriptions from Dialects of Languages) addresses some issues. It provides examples of different concrete syntaxes for arithmetic and RDF and explains that GRDDL allows authors to express RDF using XML without having to learn a specific RDF syntax or vocabulary by defining mappings from XML to RDF's abstract syntax. The document outlines some open questions about how GRDDL will handle transformation pipelines, URI resolution, validation, and its base rule.
Shared Canvas presentation at the LIBER conferenceMatthieu Bonicel
Presentation for the LIBER manuscripts group conference in Paris, may 2012
SharedCanvas is data model for interoperability accross digital manuscripts tools and repository promoted by the Digital Manuscripts Technical council, leaded by Stanford University and funded by the Andrew W Mellon foundation
Europeana Regia presentation at eChallenges 2011 conferenceEuropeana Regia
The document summarizes a digital collaborative library project between five European partners to digitize and provide access to over 800 rare manuscripts from the Middle Ages and Renaissance. The objectives were to build a multilingual metadata repository, produce high-quality digitized content for Europeana, and make the textual and image content available to both scholars and the general public. Challenges included producing and aggregating metadata in six languages from different formats and systems, and addressing the needs of different types of users. The digitized collections and metadata would be made accessible through each partner's local digital library and aggregated in Europeana through a central portal.
Creating a Digital Archive of Indian Christian Manuscripts and BooksLeonard Fernandes
Pilot Project of Digitization of Books and Manuscripts in Goa under the British Library's Endangered Archives Programme. For more details, contact Leonard Fernandes at leonard.fernandes@gmail.com
The document discusses how digitizing manuscripts can help turn them into cultural heritage by enabling scholarly work like modeling, aggregation, and annotation. It provides examples of projects that have developed tools and standards to publish digitized collections as linked open data, including Europeana and DM2E. The goal is to advance beyond simply emulating manuscripts and instead use semantic technologies to facilitate new digital humanities research through contextualization, reasoning over triple sets, and generating digital heuristics.
Culture Untapped: inspirational content & fresh ideas for your gamesMilena Popova
Games are often brain- and resource-intensive projects. Why not save precious time and exploit untapped, powerful sources of inspiration and material? Discover Europeana, a digital platform for culture giving access to over 43 million records of great thematic and media variety, coming from 3300 heritage organizations and available in 31 languages.
This presentation shows how this huge database can help game creation process with fresh ideas and “building blocks” of diverse and high-quality digital content. Game developers will look at inspiring content picks, learn more about technical tools and services to access and use the digital material and see some real-life examples of creative re-use of cultural content in educational and tourism games.
The document proposes a "Canvas paradigm" to represent manuscript pages using annotations across different repositories. It allows bringing together images, text, and commentary without all being in one place. Initial experiments had students use tools like T-PEN and DM to transcribe and annotate pages from BNF hosted on Stanford servers. Next steps include extracting and sharing student work in new displays and projects.
Digital Manuscripts Toolkit, using IIIF and JavaScript. Monica Messaggi KayaFuture Insights
FOWA London 2015
Monica is part of the DMT project at the Bodleian Libraries (University of Oxford) that aims to create a toolkit using IIIF standard (http://iiif.io) for images, a server solution (to store images of manuscripts and metadata), and a client solution using JavaScript to build an authoring tool that allows editing the manuscript manifest and its metadata. Working specifically on the authoring tool, and on the challenges that different types of manifests presents for the developer. You will have a glimpse of the whole picture and then she taps into the libraries used, choices made, collaboration experiences and lessons learned so far.
XVIII Jornada de Gestión de la Información de SEDIC. Análisis de impacto en r...SEDIC
XVIII Jornada de Gestión de la Información de SEDIC: Empleo & desarrollo profesional. Celebrada el jueves 10 de noviembre en la BNE. Análisis de impacto en redes sociales
Digitization of Documentary Heritage Collections in Indic LanguageComparativ...Anup Kumar Das
Presented by Dr. Anup Kumar Das in the International Conference on the Memory of the World in the Digital Age: Digitization and Preservation, 26-28 September 2012, Vancouver, British Columbia, Canada
La Unión Europea ha acordado un embargo petrolero contra Rusia en respuesta a la invasión de Ucrania. El embargo prohibirá la mayor parte de las importaciones de petróleo ruso a la UE y se implementará de manera gradual durante los próximos seis meses. Algunos países de la UE aún dependen en gran medida del petróleo ruso y se les ha otorgado una exención temporal, pero se espera que todos los estados miembros de la UE dejen de importar petróleo ruso para fines de 2022.
Presentation of Europeana Regia at "The Message of the Old Book in the New En...Europeana Regia
In March, Europeana Regia was presented in Paris at the international seminar “The Message of the Old Book in the New Environment”, organized by the Finnish Research Library Association and the Institut Finlandais en France during the 2011 Paris Book Fair (18-19 March 2011).
Following a general overview of the project, this presentation focused mainly on the development of the Europeana Regia website, where it will be possible to consult the manuscripts in the context of their historical collections through a multilingual interface.
The document discusses the goals and methods of manuscript digitization at the Indira Gandhi National Centre for the Arts (IGNCA). The goals are dissemination of manuscripts to make them accessible, preservation of the content by preventing deterioration, and enabling new research opportunities. The document provides recommendations for scanning parameters, file formats, and storage locations to ensure high quality preservation while allowing various modes of access, from screen viewing to printing. It also describes IGNCA's efforts to digitize over 25,000 sheets of Russian manuscripts and 150 microfilm rolls of Sanskrit and Persian manuscripts.
This document discusses Biblissima, a project that aims to interconnect data about medieval manuscripts from various French libraries and research institutions on the semantic web. It describes Biblissima's data, which includes information on manuscripts, texts, people, places, and more from over 40 databases. The challenges of integrating this heterogeneous data are discussed. Biblissima addresses these challenges through data alignment, cleaning, and publishing the data as RDF linked data using vocabularies like FRBRoo. This allows the data to be interlinked, enriched, and shared to increase visibility and usability for both humans and machines.
Expanding Horizons - Ideas into Practice. Martyn Wade.Twin Cities Conference: Innovation into Practise- New Service Concepts, Helsinki and Turku, Finland, 13-16 May 2009
Medieval Music Manuscript Exploration, Baylor Librariesbaylor university
The document discusses the Jennings Collection at Baylor University, a treasure of the Baylor Libraries that was donated by Mrs. J.W. Jennings. It contains the Roxy Harriette Grove Papers from The Texas Collection. The document provides information on viewing medieval manuscripts in the collection, including things to observe like clefs, staves, neumes, colors, text, and formats.
The process of book publishing starts with Manuscript Acquisition. This Slide Examines the process of acquiring and assessing manuscripts as well as the decision to publish or reject a manuscript.
This document discusses the challenges of creating an interoperable framework for presenting digital manuscripts. It notes that current repositories exist in silos, preventing access and sharing across systems. The goal is to break down these silos by separating data from applications, sharing data models and programming interfaces, and enabling tools and repositories to interact. A proposed solution involves using a "canvas" approach and linked data technologies to align multiple representations and allow annotations to be shared across repositories. Funding from the Mellon Foundation supported numerous digitization projects but lacked ways to share data between systems.
The Library as a Digital Research infrastructure: Digital Initiatives and Dig...lorna_hughes
Memory institutions have built up expertise and taken the lead in all aspects of digital humanities, especially the development and implementation of digital methods for the capture, analysis and dissemination of archives and special collections, including manuscripts. In recent years, these initiatives have become embedded into Digital Humanities Initiatives, Centres and Programmes within research libraries, adding value to the existing relationships between libraries and scholarly iniatiatives. These activities have fostered the development of new projects that bring into collaboration the skills and expertise of academics, librarians, and digital humanists, making the Library increasingly a “digital research infrastructure”. This presentation will discuss these developments based on the experience of the Research Programme in Digital Collections at the National Library of Wales, specifically discussing some recent experimentation with new methods for manuscript digitization and dissemination, including hyperspectral digitization of the Library’s Chaucer manuscripts. The presentation will also discuss the wider embedding of this work within the European Digital Humanities Context, through collaborations with the ESF Research Network Programe NeDiMAH (Network for Digital Methods in the Arts and Humanities).
Treasures of the National Library of MyanmarMya OO
The National Library of Myanmar maintains valuable traditional manuscripts of over 200 years old. These manuscripts are showcase of Myanmar literature status in the past and development of Myanmar traditional arts and crafts. They includes not only literary text but also value to the literature with decorating arts like beautiful hand writings and pictures, very fine decorative medium and covers, etc.
Fitt Toolbox Best Practice Cluster Collaboration FinalFITT
KREATEK is an online collaboration platform for cluster managers and stakeholders to share information, best practices, and network. The platform includes news, documents, profiles, discussion forums, and a private workspace for each cluster initiative. It was created by MFG to professionalize cluster management and currently has over 120 members after six months. While the platform provides benefits, ongoing maintenance and community engagement are needed for success.
presentation for a workshop on cataloging medieval manuscripts with Debra Cashion, Sheila Bair and Sue Steuer which was held at the Rare Book and Manuscript Section (RBMS) of the Association of College and Research Libraries (ACRL) in Minneapolis, MN on June 27, 2013.
Semantic Interoperability - grafi della conoscenzaGiorgia Lodi
This document summarizes Giorgia Lodi's presentation on meaningful data and semantic interoperability in the Italian public sector. Lodi discusses issues with data quality such as missing values, semantics mismatches, and use of strings instead of codes. She argues that adopting semantic web standards like RDF, OWL and SPARQL can help address these issues by linking data together and representing it semantically. Ontologies and knowledge graphs can be used to represent domain knowledge and infer new facts. Tools like FRED can generate knowledge graphs from unstructured text. Overall, Lodi argues that semantic web technologies have the potential to improve data interoperability and quality in the public sector, though challenges remain.
1) The document discusses using blogs and other structured web data to develop linguistic corpora for research. It argues that structured web data provides large amounts of naturally occurring language data in various genres and languages.
2) Examples are given of how blog data in particular is well-structured with metadata like authorship, dates, and semantics. This structured data can be extracted and analyzed to study linguistic patterns and variation across different authors, registers, and languages.
3) One research example analyzed the distribution of future tense expressions ("will" vs. "be going to") in three English language blogs and found patterns relating to subject type that confirm theoretical assumptions.
Linked Open Europeana aims to leverage Europe's cultural heritage through linked open data. It presents Europeana, an online library of European culture. Europeana started in 2005 and now contains over 15 million objects. Linked open data extends the web by adding semantics through RDF triples and linking disparate data sources. Europeana's data model EDM aims to publish cultural heritage data as linked open data to enable novel uses like knowledge generation by combining data with the rest of the semantic web. For Europeana to fully realize its potential, its data needs to be openly available as linked open data under an open license.
Digital Manuscripts Toolkit, using IIIF and JavaScript. Monica Messaggi KayaFuture Insights
FOWA London 2015
Monica is part of the DMT project at the Bodleian Libraries (University of Oxford) that aims to create a toolkit using IIIF standard (http://iiif.io) for images, a server solution (to store images of manuscripts and metadata), and a client solution using JavaScript to build an authoring tool that allows editing the manuscript manifest and its metadata. Working specifically on the authoring tool, and on the challenges that different types of manifests presents for the developer. You will have a glimpse of the whole picture and then she taps into the libraries used, choices made, collaboration experiences and lessons learned so far.
XVIII Jornada de Gestión de la Información de SEDIC. Análisis de impacto en r...SEDIC
XVIII Jornada de Gestión de la Información de SEDIC: Empleo & desarrollo profesional. Celebrada el jueves 10 de noviembre en la BNE. Análisis de impacto en redes sociales
Digitization of Documentary Heritage Collections in Indic LanguageComparativ...Anup Kumar Das
Presented by Dr. Anup Kumar Das in the International Conference on the Memory of the World in the Digital Age: Digitization and Preservation, 26-28 September 2012, Vancouver, British Columbia, Canada
La Unión Europea ha acordado un embargo petrolero contra Rusia en respuesta a la invasión de Ucrania. El embargo prohibirá la mayor parte de las importaciones de petróleo ruso a la UE y se implementará de manera gradual durante los próximos seis meses. Algunos países de la UE aún dependen en gran medida del petróleo ruso y se les ha otorgado una exención temporal, pero se espera que todos los estados miembros de la UE dejen de importar petróleo ruso para fines de 2022.
Presentation of Europeana Regia at "The Message of the Old Book in the New En...Europeana Regia
In March, Europeana Regia was presented in Paris at the international seminar “The Message of the Old Book in the New Environment”, organized by the Finnish Research Library Association and the Institut Finlandais en France during the 2011 Paris Book Fair (18-19 March 2011).
Following a general overview of the project, this presentation focused mainly on the development of the Europeana Regia website, where it will be possible to consult the manuscripts in the context of their historical collections through a multilingual interface.
The document discusses the goals and methods of manuscript digitization at the Indira Gandhi National Centre for the Arts (IGNCA). The goals are dissemination of manuscripts to make them accessible, preservation of the content by preventing deterioration, and enabling new research opportunities. The document provides recommendations for scanning parameters, file formats, and storage locations to ensure high quality preservation while allowing various modes of access, from screen viewing to printing. It also describes IGNCA's efforts to digitize over 25,000 sheets of Russian manuscripts and 150 microfilm rolls of Sanskrit and Persian manuscripts.
This document discusses Biblissima, a project that aims to interconnect data about medieval manuscripts from various French libraries and research institutions on the semantic web. It describes Biblissima's data, which includes information on manuscripts, texts, people, places, and more from over 40 databases. The challenges of integrating this heterogeneous data are discussed. Biblissima addresses these challenges through data alignment, cleaning, and publishing the data as RDF linked data using vocabularies like FRBRoo. This allows the data to be interlinked, enriched, and shared to increase visibility and usability for both humans and machines.
Expanding Horizons - Ideas into Practice. Martyn Wade.Twin Cities Conference: Innovation into Practise- New Service Concepts, Helsinki and Turku, Finland, 13-16 May 2009
Medieval Music Manuscript Exploration, Baylor Librariesbaylor university
The document discusses the Jennings Collection at Baylor University, a treasure of the Baylor Libraries that was donated by Mrs. J.W. Jennings. It contains the Roxy Harriette Grove Papers from The Texas Collection. The document provides information on viewing medieval manuscripts in the collection, including things to observe like clefs, staves, neumes, colors, text, and formats.
The process of book publishing starts with Manuscript Acquisition. This Slide Examines the process of acquiring and assessing manuscripts as well as the decision to publish or reject a manuscript.
This document discusses the challenges of creating an interoperable framework for presenting digital manuscripts. It notes that current repositories exist in silos, preventing access and sharing across systems. The goal is to break down these silos by separating data from applications, sharing data models and programming interfaces, and enabling tools and repositories to interact. A proposed solution involves using a "canvas" approach and linked data technologies to align multiple representations and allow annotations to be shared across repositories. Funding from the Mellon Foundation supported numerous digitization projects but lacked ways to share data between systems.
The Library as a Digital Research infrastructure: Digital Initiatives and Dig...lorna_hughes
Memory institutions have built up expertise and taken the lead in all aspects of digital humanities, especially the development and implementation of digital methods for the capture, analysis and dissemination of archives and special collections, including manuscripts. In recent years, these initiatives have become embedded into Digital Humanities Initiatives, Centres and Programmes within research libraries, adding value to the existing relationships between libraries and scholarly iniatiatives. These activities have fostered the development of new projects that bring into collaboration the skills and expertise of academics, librarians, and digital humanists, making the Library increasingly a “digital research infrastructure”. This presentation will discuss these developments based on the experience of the Research Programme in Digital Collections at the National Library of Wales, specifically discussing some recent experimentation with new methods for manuscript digitization and dissemination, including hyperspectral digitization of the Library’s Chaucer manuscripts. The presentation will also discuss the wider embedding of this work within the European Digital Humanities Context, through collaborations with the ESF Research Network Programe NeDiMAH (Network for Digital Methods in the Arts and Humanities).
Treasures of the National Library of MyanmarMya OO
The National Library of Myanmar maintains valuable traditional manuscripts of over 200 years old. These manuscripts are showcase of Myanmar literature status in the past and development of Myanmar traditional arts and crafts. They includes not only literary text but also value to the literature with decorating arts like beautiful hand writings and pictures, very fine decorative medium and covers, etc.
Fitt Toolbox Best Practice Cluster Collaboration FinalFITT
KREATEK is an online collaboration platform for cluster managers and stakeholders to share information, best practices, and network. The platform includes news, documents, profiles, discussion forums, and a private workspace for each cluster initiative. It was created by MFG to professionalize cluster management and currently has over 120 members after six months. While the platform provides benefits, ongoing maintenance and community engagement are needed for success.
presentation for a workshop on cataloging medieval manuscripts with Debra Cashion, Sheila Bair and Sue Steuer which was held at the Rare Book and Manuscript Section (RBMS) of the Association of College and Research Libraries (ACRL) in Minneapolis, MN on June 27, 2013.
Semantic Interoperability - grafi della conoscenzaGiorgia Lodi
This document summarizes Giorgia Lodi's presentation on meaningful data and semantic interoperability in the Italian public sector. Lodi discusses issues with data quality such as missing values, semantics mismatches, and use of strings instead of codes. She argues that adopting semantic web standards like RDF, OWL and SPARQL can help address these issues by linking data together and representing it semantically. Ontologies and knowledge graphs can be used to represent domain knowledge and infer new facts. Tools like FRED can generate knowledge graphs from unstructured text. Overall, Lodi argues that semantic web technologies have the potential to improve data interoperability and quality in the public sector, though challenges remain.
1) The document discusses using blogs and other structured web data to develop linguistic corpora for research. It argues that structured web data provides large amounts of naturally occurring language data in various genres and languages.
2) Examples are given of how blog data in particular is well-structured with metadata like authorship, dates, and semantics. This structured data can be extracted and analyzed to study linguistic patterns and variation across different authors, registers, and languages.
3) One research example analyzed the distribution of future tense expressions ("will" vs. "be going to") in three English language blogs and found patterns relating to subject type that confirm theoretical assumptions.
Linked Open Europeana aims to leverage Europe's cultural heritage through linked open data. It presents Europeana, an online library of European culture. Europeana started in 2005 and now contains over 15 million objects. Linked open data extends the web by adding semantics through RDF triples and linking disparate data sources. Europeana's data model EDM aims to publish cultural heritage data as linked open data to enable novel uses like knowledge generation by combining data with the rest of the semantic web. For Europeana to fully realize its potential, its data needs to be openly available as linked open data under an open license.
This document summarizes Sebastian Hellmann's PhD thesis on integrating natural language processing (NLP) data, tools, and applications with RDF and OWL. The thesis proposes creating datasets in RDF to facilitate data integration and linking. It describes converting Wiktionary and the Wortschatz corpus to RDF to create a linguistic linked data web. Standardized formats like POWLA are discussed for representing corpora on the web. The thesis also covers knowledge acquisition from resources like the Tiger Corpus Navigator and ontology learning from text using techniques like LExO.
The presentation describes the eLanguage Project, an effort by the Linguistic Society of America (LSA) to advance open access publishing electronic of academic papers in linguistics. The presentation was held on 5 November 2007 at the Max Planck Institute for Evolutionary Anthropology in Leipzig, Germany. It compares eLanguage and the World Atlas of Language Structures (WALS), an extremely successful resource in language typology that has been developed at the Institute.
KeyNote @SEMANTICS 2017 (Amsterdam, sept 2017) about convergences between NLP and KE at the era of the semantic web, with a focus on semantic relation extraction from text.
This document discusses the evolution of natural language processing (NLP) and knowledge engineering (KE) and their convergence, especially with the rise of deep learning and the semantic web. It outlines how NLP and KE have moved from early ambitions of full language understanding and problem solving to more practical, layered approaches focused on specific tasks. The semantic web provides standards and architectures that benefit both NLP and KE by enabling semantic annotation, linking of data, and use of knowledge sources. Deep learning allows NLP to learn representations from large corpora and benefit from semantic resources. Relation extraction and ontology learning from text are examples of the convergence. Challenges remain around contextual language, knowledge assertion, and industrial applications.
The Datalift Project aims to publish and interconnect government open data. It develops tools and methodologies to transform raw datasets into interconnected semantic data. The project's first phase focuses on opening data by developing an infrastructure to ease publication. The second phase will validate the platform by publishing real datasets. The goal of Datalift is to move data from its raw published state to being fully interconnected on the Semantic Web.
This document provides an overview of the SHEBANQ project, which provides tools for querying annotated Hebrew text data. It describes the data sources and contributors that have built up the underlying text corpus over many years. It also outlines the steps taken to make this data and related tools more accessible, including developing a website, depositing data in archives, running demonstration projects, and integrating the data and tools into broader research environments through additional projects and publications. The goal has been to facilitate wider use of this linguistic resource and foster more digital humanities and data science work based on its contents.
The document summarizes the SESAM4 project, which aims to lower barriers for small and medium companies to exploit semantic systems. The project developed open source software, best practices, and tools based on semantic technologies and linked open data. It had 10 partners and was funded for 3 years to work on topics like ontology development, content management integration, and demonstrator applications in tourism.
Semantic Web and Linked Data for cultural heritage materials - Approaches in ...Antoine Isaac
The document discusses using semantic web technologies like linked data and the Europeana Data Model (EDM) to improve access to cultural heritage materials by enabling semantic search and exploiting relationships between concepts, objects, and vocabularies. EDM aims to preserve original metadata while allowing for interoperability by using standards like Dublin Core, SKOS, and OAI ORE. Linked data approaches can ease getting and publishing data across cultural heritage datasets by direct access to RDF descriptions via URIs.
Big Data and Natural Language ProcessingMichel Bruley
Natural Language Processing (NLP) is the branch of computer science focused on developing systems that allow computers to communicate with people using everyday language.
Linked Open Europeana: Semantics for the CitizenStefan Gradmann
The document discusses Linked Open Data and how it relates to Europeana and the Semantic Web. It describes how the Europeana Data Model (EDM) aims to make Europeana's data part of Linked Open Data by preserving original metadata while allowing for interoperability. EDM uses standards like SKOS, DCMI, and OAI ORE. The document argues that fully implementing EDM and making public data available as Linked Open Data could enable new uses of the data for citizens, including tourists planning cultural activities, teachers finding educational resources, and politicians analyzing cultural funding and contributions.
The Harmony collaboration led to a metadata model and query implementation that supports resource discovery over multimedia metadata descriptions. Key aspects include extending Dublin Core for multimedia, modeling complex cases like versioning and alternate formats, and proposing a common approach. The ABC model was implemented to address metadata challenges and allows querying across namespaces. Lessons learned include issues with project scope and coordination across distributed teams.
The document discusses how linking open data and semantics can benefit digital humanities research using Europeana. It proposes fully implementing the Europeana Data Model to represent cultural heritage objects as linked open data. This would connect objects across domains and with external datasets like DBpedia. Combining this enriched semantic data with tools like SwickyNotes could facilitate new forms of digital scholarship through semantic exploration, context discovery, and knowledge generation.
The document discusses cooperative translation between contributors with different language skills and technical skills. It describes the objectives of the translation exercise, which include translating a course brochure and creating an online terminology glossary. It also outlines the roles of human contributors and technical tools involved, such as online dictionaries, machine translation, and blogs for collaboration.
This document discusses 3 use cases for linked data in higher education, including projects in the UK and Australia. It also describes David Flanders' background working with linked data at organizations like JISC and ANDS, and several linked data projects he has worked on including Open Bibliography, LOCAH, and developing ANDS vocabularies. The document raises the idea of using URIs instead of human terms as metadata for research data to enable machines to better understand and compare the data.
This document discusses the benefits of open source and free/libre/open source software (FLOSS) for libraries. It outlines several ways that libraries can make use of FLOSS, including for content management systems, learning management systems, integrated library systems, digital collections, virtual reference, and citation management. Specific FLOSS options are mentioned for each of these areas. The document also notes some potential pros and cons of using FLOSS.
Prof. M. Thaller (Universität Köln) - Toward a reference curriculum in Digita...infoclio.ch
The document discusses developing a reference curriculum for Digital Humanities degree programs. It analyzes the types of existing DH programs, including their scope and content. There is consensus that modeling and formalization should be core components, though disagreements exist around specific skills and standards. The document proposes three levels of DH training: introductory skills for all humanities disciplines, discipline-specific skills to train digital humanists, and skills for curating digital humanities content as a profession. Developing a common curriculum could facilitate international collaboration and student/faculty exchanges between programs.
ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.Lifeng (Aaron) Han
Invited Presentation in NLP lab of Soochow University, about my NLP journey and ADAPT Centre. NLP part covers Machine Translation Evaluation, Quality Estimation, Multiword Expression Identification, Named Entity Recognition, Word Segmentation, Treebanks, Parsing.
Similar to [DCSB] Amiz Zeldes (HU, Berlin) "Towards Digital Coptic: Searching and Visualizing Coptic Manuscript Data" (20)
Throughout the last decade, network analysis has become an increasingly popular method of archaeological research, but the complexity of the archaeological record poses a fundamental challenge. Data sets can be comprised of hundreds or thousands of entities as well as several types of objects, demanding special caution of the design of such studies. Therefore, an appropriate way of storing and querying data is a crucial first step. For this purpose, graph databases are especially well suited. The storing of data as nodes and edges introduces relationship-based thinking already in the early stages of data preparation and acquisition. For archaeological use-cases, the CIDOC CRM suggests itself as the ontology after which to model the structure of the database. The talk will present a mapping of the CIDOC CRM to the model of a graph database containing Late Bronze Age elite graves and explore possibilities of graph databases to archaeological network analysis in further detail.
In the last years several web services emerged that manage and make accessible place gazetteers for the archaeologies and historical sciences. By using semantic technologies these gazetteers act as linked data hubs connecting multiple datasets of varying thematic focus and of different structural properties. Just as important as the geo-spatial properties of research objects are their temporal classifications. In this talk we describe a time gazetteer web service that assumes a role similar to that of place gazetteers but for temporal concepts and cultural periods.
The detection of textual variants is a crucial step in Classical Philology. It represents both the first stage of collation and the preliminary phase for recognising quotation and text reuse in the indirect tradition. As digital tools can improve the mechanical stage of textual comparison, the interaction between automated process and traditional philological methods is in this case very promising.
iAligner performs pairwise intralanguage syntaxbased automatic alignment on Ancient Greek, Latin and English, and it is now being tested on other languages. Texts are aligned at line or sentence level, at any length chosen by the user. They are then converted to vectors of single tokens, and pairwise alignment is performed through Needleman-Wunsch algorithm. Additional languagedependent criteria can be established by the user for further refinement, according to the purpose of the alignment: nonalphabetical characters and diacritics can be ignored, the alignment can be set as case sensitive and Levensthein distance metric can be applied to adjust the tolerance threshold.
Religion was woven into the fabric of Roman society, its visibility ranging from monumental temples to the practice of festival activity. Religious processions, in particular, were carefully choreographed rituals that linked disparate spaces and people together within the cityscape. Despite their acknowledged regularity within the Roman world, our understanding of religious processional movement remains extremely limited. Studies concerning Triumphal, funerary, and circus processions dominate current scholarship due to their greater documentation by the ancient literary sources. These processions, however, formed only a fraction of Roman processional activity. Recent years have seen an increase of scholarship interested in different aspects of processions and movement within the cityscape. In light of this, a reconsideration of the degree to which we can study processions within the archaeological record is warranted. As the record of the performance of processions was primarily held in the memories of those who took part of heard about them, the ways in which they can be studied is challenging. Adopting a theoretical and computer based approach, a critical analysis of the relationship between a procession’s movement patterns and engagement with the urban environment can be studied.
A few years ago, scholars of Greek and Latin literature called for a “cyberinfrastructure” that would facilitate a new generation of digital collections - an infrastructure that uses linked open data approaches to organize the myriad of web resources related to classical studies. Already such frameworks are being built on the basis of existing claves, digital transcriptions of texts, and other tools that comprise standards in the fields of Greek and Latin.
A key claim made by Hero of Alexandria in his work Περί αὐτοματοποιητικῆς (On the making of the Automata, hereafter Automata) is that he has improved upon previously described automata, making them more feasible and more easily reproduced in practice. A three-year, Leverhulme-funded project is testing Hero’s devices and his claims. Working from a fresh analysis of the Greek text, the two automata described by Hero are being built, initially in the computer-aided design (CAD) package SolidWorks, and then in the physical world. A primary objective is to determine to what extent the Automata is a technical treatise, exaggeration/self-aggrandisement and/or a jeu d’esprit.
The new method of solid 3d modelling presented in this study allows new statistical perspectives for archaeological, geophysical and geochemical records in a 3D GIS environment. The micro-scale analysis investigates archaeological excavation trenches of the West Porticus in Ostia.
Maya writing is a semi-deciphered logographic-syllabic system with approximately 10,000 text carriers discovered in sites throughout Mexico, Guatemala, Belize, and Honduras (300 B.C. to A.D. 1500). It is one of the most significant writing traditions of the ancient world. As a graphic manifestation of language, writing mediatizes human thought, communication, and cultural knowledge in the form of texts. Deciphering a script allows ideas, values, conceptions, and believes to be reconstructed, and thus permits insight into the memory of past communities. In order to achieve this, the writing system and the spoken language that underlies it must be known. For Classic Mayan, this breakthrough in decipherment has already been achieved; however, in spite of great progress made in recent decades, some 40% of the script’s more than 800 signs remain unreadable even today. One reason for this situation is their lack of systematic attestation. Even in cases in which the signs are legible, texts may still elude understanding, because the Classic Mayan language itself has not survived; instead, it can only be reconstructed through comparison of the 30 Mayan languages documented since European conquest and still spoken today. However, much pre-Hispanic Mayan cultural vocabulary has been lost in the aftermath of European colonization. Consequently, comprehensive documentation and decipherment of the approximately 10,000 extant hieroglyphic texts, reconstruction of the language that they record, and documentation of that language in a dictionary are necessary prerequisites for acquiring a deeper understanding of Classic Maya culture, history, religion, and society.
The interpretation of archaeological surface survey data is not straightforward. The aim of this paper is to critically evaluate the interpretative potential of the surface survey record in terms, on the one hand of demography and settlement pattern, on the other hand of consumption and changing social patterns of commodities distribution and access, using the microregional ceramic dataset collected during fieldwork in the region of Thugga (Tunisian High Tell). By analyzing rural surface pottery assemblages among settlements pattern and topography, I will show the application of a spatial and quantitative approach to the survey record and discuss its potential and risks. At the macroregional scale, consumption patterns will be considered in a comparative perspective among urban and rural settlements as well as coastal sites and rural hinterland of the Roman Province Africa Proconsularis. The reconstruction of a geography consumption allows a ceramic view on the economic development of the Roman Province and on its integration in the inter-regional and long distance markets.
In the last few years, we have attempted to reconstruct the Roman transport conditions by modelling travel costs and times with the help of GIS and Network Analysis applications. The main geographical focus of this project was the NE of Hispania. It was necessary devote a significant effort to the gathering, documentation, analysis and digitisation of Roman communications with high precision. With the aim of using these methodology in a much broader geographic frame, the entire Iberian Peninsula, Italy and Britain were analysed with less detailed transport networks . It allows us to discover very interesting patterns. The results of such applications provide us with new information to understand the distribution of commodities, product competition and problems of stagnation in ancient economies such as that of Ancient Rome.
Das Feld der Linguistik ist in den Digital Humanities seit dessen Anfängen von großer Bedeutung. Ob die abstrakten Strukturen der Informationstechnik den Sprachstrukturen besonders leicht nahe kamen, ob die Datenmengen die Verwendung von Computern interessant machten, oder ob hier noch ganz andere Faktoren im Spiel waren, darf eine offene Frage bleiben. Als ein Ertrag der Computerlinguistik stellte sich in jedem Falle heraus, dass das Sammeln von Daten nicht nur mit dem Ziel, ein elektronisches Nachschlagewerk zu erhalten, verfolgt werden könnte, sondern dass durch systematische Weiterverarbeitung und Darstellung der Daten bestimmte, meist quantitative, aber auch strukturorientierte Fragen gestellt werden könnten, deren Ergebnisse dem Forscher Hinweise auf bislang nicht erkannte Phänomene geben könnten.
After historical research has applied to the visual aspects of human perception for a long time now, as of recently the investigation of acoustic matters arouse an increased scientific interest too.1 So far still largely unresearched has been the concrete speech situation in premodern times before electro-acoustic amplification. In a large-scale study the department of ancient history in close collaboration with acousticians of the Frauenhofer-institute as well as other research disciplines attempts to adequately reconstruct and simulate ancient historic speech situations.
In a first stage of the project on Theban witnesses in Demotic documents, we illustrated social network analysis and data visualisation as a technique for identifying and disambiguating historic actors in a large dataset. This next phase will present you with an example of how historical research can evolve after having used the identification method.
Throughout Greek and Roman history, naval warfare played a prominent role. Gaining, exerting and contesting sea power was an important characteristic of many a conflict from the Archaic period right down to Late Antiquity; indeed, from the Persian to the Punic wars, contesting control of the sea was often at the very centre of the conflict. Yet despite its importance naval power in general and naval action in particular is extremely poorly understood, and already the most basic questions regarding an ancient naval action – what could and what did actually happen – remain to this day mostly unanswered.
Philology is the aggregate of those practices by which we exploit the linguistic record to engage culture perspectives that are distant from us in time, space, and/or perspective. Whether we are exploiting post-colonial theory, corpus linguistics, or some aspect of the cognitive and brain sciences, we are practicing philology. In the 21st century, we confront the challenge of managing interactions across boundaries of space, language, and culture that are unprecedented in speed and complexity, which each point on the globe now able to interact with any other point in real time. We must think in terms of a World Literature – as Goethe suggested almost two centuries ago – and to do so we must articulate a new philology, one that exploits every possibility of new digital media. Ultimately, we need to establish a sustainable set of evolving cultures – a dynamic Global Culture that provides a voice for many different cultures within it. The field of Altertumswissenschaft has an opportunity to play a fundamental role in this larger process but realizing that opportunity requires a reexamination of what we do, why we do it and for whom.
The study of intertextuality in Classical poetry often presents itself as a specialized case of text-reuse detection: commentaries and other close readings of a work concern themselves with the identification and exegesis of phrases borrowed from earlier texts. Yet it has long been understood that larger-scale, structural parallelisms can also exist between texts (Genette 1997), and that these can provide the context necessary to establish an allusive or intertextual link between two phrases (Wills 1996). Automatic detection of intertextuality must take into account features at various scales: from individual phonemes to larger syntactic units and type scenes.
*ABSTRACT*
This seminar will revolve around two Reflectance Transformation Imaging (RTI) projects based at the University of Cologne on ancient Greek texts. The first deals with the Herculaneum Papyri. Preserved through carbonisation when Mount Vesuvius erupted in 79 CE, these papyri constitute the largest surviving ancient library in the world. For over two centuries scholars have sought to unroll and read the c.1800 papyrus scrolls found in the Villa dei Papyri. Recent infrared RTI has resulted in a major leap forward for revealing further writings and providing vital information about the physical structure of the rolls. The second project, “Magica Levantina”, aims to create an edition of Greek magical texts from Cyprus and the ancient Near East. Over 300 texts, dating from c.100-600 CE and comprised mainly of curses and some protective spells, are incised on various metals and gypsum. Material properties, writing technique and poor condition present challenges to legibility that are successfully tackled through the use of visible spectrum RTI.
Several themes arise from the case studies presented. The conventional use of the digital image as a resource for interpreting past written meaning will be contrasted with a more active concept of the digital image as constitutive of both past reconstructions and the interpretive process. This latter concept will be developed to argue for greater reflexivity in image data use and increased epistemological awareness of the role of the digital image — whether employed for research on the Classical world or the ancient world more generally.
Further information: http://hdl.handle.net/11858/00-1780-0000-0024-BF5F-2
ABSTRACT
How does a researcher or analyst determine whether two records refer to the same person or are related in some other way, and whether other related information refers to both people equally? Starting with three large datasets from the classical world: the Lexicon of Greek Personal Names, an Oxford-based corpus of persons mentioned in ancient Greek texts; Trismegistos, a Leuven-run database of names and persons from Egyptian papyri; Prosopographia Imperii Romani, a series of printed books listing senators and other elites from the first three centuries of the Roman Empire, SNAP:DRGN aims to create a lightweight model to bring this prosopographic and onomastic data together.
Web and Linked data technologies offer ways to model and share this information; linking from references in primary texts to, and between, authoritative lists of persons and names. The SNAP project looks to the many prosopographies and onomastica that already exist, initially within the restricted domain of Greco-Roman antiquity, for whom the same questions of identity and provenance apply and asks whether combining these approaches will allow us to create a shared resource for classical scholars who wish to disambiguate their data.
SNAP:DRGN is an AHRC-funded project exploring the interlinking data collections of persons (prosopographies), names (onomastica) and person-like entities managed in heterogeneous systems and formats. This paper will explore the background to, and results of, the work.
http://hdl.handle.net/11858/00-1780-0000-0024-5F70-E
ABSTRACT
The study of the Roman economy is populated by a large number ofsometimes conflicting descriptive models. These models are rarelyformally compared, and many remain untested due to the limited use offormal hypothesis testing methods in Roman studies and the significantdata requirements to enable their use. This paper illustrates how broadpatterns in large archaeological datasets allow for aspects of thesemodels to be tested, and suggests agent-based network modelling as aparticularly fruitful approach for the study of the Roman economy.
As an example, this paper presents the Market Economy and Roman CeramicsRedistribution agent-based network model (MERCURY, after the Roman godof commerce). It represents the structure of social networks betweentraders that act as the channels for the flow of commercial informationand goods. MERCURY was created to formally represent and compare twodescriptive models of the functioning of the Roman trade system (PeterBang's Roman bazaar (2008) and Peter Temin's (2013) Roman marketeconomy) and how these give rise to differences in the distributionpatterns of Roman tablewares. The results of experiments using MERCURYare subsequently compared to archaeologically observed tablewaredistribution patterns. The results suggest that, contrary to Bang'shypothesis, limited availability of reliable commercial information fromdifferent markets is unlikely to give rise to the large differences inthe wideness of tableware distributions observed in the archaeologicalrecord. This paper concludes that the study of the Roman economy wouldvery much benefit from embracing computational modelling approachesbecause (i) it forces scholars to consider the comparability ofdescriptive models, (ii) it allows comparison of simulated outputs witharchaeologically observed outputs, and (iii) it allows to map out thegrey zone between extreme hypotheses and refocus our descriptive modelsaway from hypotheses that do not compare favourably with thearchaeological record.
http://hdl.handle.net/11858/00-1780-0000-0024-5022-5
ABSTRACT
The talk will present an ongoing research project (Γ-project from now) aiming at designing a dynamic grammar of Ancient Greek. Just as many other languages, Ancient Greek is characterized by a complex interplay between its rich morphological features, its wide range of semantic roles and its diverse syntactic functions. The nodes where these three types of features intersect are commonly known as grammar rules. This means that grammatical rules, in sharp contrast to their static presentation in grammar books as well as online grammars can be regarded as the result of many-to-many relationships. To secure its dynamic structure, the Γ-project is constructed around these many-to-many relationships. Exploring these relations, students will acquire Greek language skills, while also acquiring a more profound knowledge of language structures. Hence, the Γ-grammar will be a novel instrument for learning and understanding ancient languages. As the technology of Γ-grammar will be available under a Creative Commons license, a similar application for other (ancient or modern) languages would be conceivable.
Full abstract: http://hdl.handle.net/11858/00-1780-0000-0024-412A-0
This document provides an overview of wound healing, its functions, stages, mechanisms, factors affecting it, and complications.
A wound is a break in the integrity of the skin or tissues, which may be associated with disruption of the structure and function.
Healing is the body’s response to injury in an attempt to restore normal structure and functions.
Healing can occur in two ways: Regeneration and Repair
There are 4 phases of wound healing: hemostasis, inflammation, proliferation, and remodeling. This document also describes the mechanism of wound healing. Factors that affect healing include infection, uncontrolled diabetes, poor nutrition, age, anemia, the presence of foreign bodies, etc.
Complications of wound healing like infection, hyperpigmentation of scar, contractures, and keloid formation.
How to Setup Warehouse & Location in Odoo 17 InventoryCeline George
In this slide, we'll explore how to set up warehouses and locations in Odoo 17 Inventory. This will help us manage our stock effectively, track inventory levels, and streamline warehouse operations.
Reimagining Your Library Space: How to Increase the Vibes in Your Library No ...Diana Rendina
Librarians are leading the way in creating future-ready citizens – now we need to update our spaces to match. In this session, attendees will get inspiration for transforming their library spaces. You’ll learn how to survey students and patrons, create a focus group, and use design thinking to brainstorm ideas for your space. We’ll discuss budget friendly ways to change your space as well as how to find funding. No matter where you’re at, you’ll find ideas for reimagining your space in this session.
ISO/IEC 27001, ISO/IEC 42001, and GDPR: Best Practices for Implementation and...PECB
Denis is a dynamic and results-driven Chief Information Officer (CIO) with a distinguished career spanning information systems analysis and technical project management. With a proven track record of spearheading the design and delivery of cutting-edge Information Management solutions, he has consistently elevated business operations, streamlined reporting functions, and maximized process efficiency.
Certified as an ISO/IEC 27001: Information Security Management Systems (ISMS) Lead Implementer, Data Protection Officer, and Cyber Risks Analyst, Denis brings a heightened focus on data security, privacy, and cyber resilience to every endeavor.
His expertise extends across a diverse spectrum of reporting, database, and web development applications, underpinned by an exceptional grasp of data storage and virtualization technologies. His proficiency in application testing, database administration, and data cleansing ensures seamless execution of complex projects.
What sets Denis apart is his comprehensive understanding of Business and Systems Analysis technologies, honed through involvement in all phases of the Software Development Lifecycle (SDLC). From meticulous requirements gathering to precise analysis, innovative design, rigorous development, thorough testing, and successful implementation, he has consistently delivered exceptional results.
Throughout his career, he has taken on multifaceted roles, from leading technical project management teams to owning solutions that drive operational excellence. His conscientious and proactive approach is unwavering, whether he is working independently or collaboratively within a team. His ability to connect with colleagues on a personal level underscores his commitment to fostering a harmonious and productive workplace environment.
Date: May 29, 2024
Tags: Information Security, ISO/IEC 27001, ISO/IEC 42001, Artificial Intelligence, GDPR
-------------------------------------------------------------------------------
Find out more about ISO training and certification services
Training: ISO/IEC 27001 Information Security Management System - EN | PECB
ISO/IEC 42001 Artificial Intelligence Management System - EN | PECB
General Data Protection Regulation (GDPR) - Training Courses - EN | PECB
Webinars: https://pecb.com/webinars
Article: https://pecb.com/article
-------------------------------------------------------------------------------
For more information about PECB:
Website: https://pecb.com/
LinkedIn: https://www.linkedin.com/company/pecb/
Facebook: https://www.facebook.com/PECBInternational/
Slideshare: http://www.slideshare.net/PECBCERTIFICATION
Leveraging Generative AI to Drive Nonprofit InnovationTechSoup
In this webinar, participants learned how to utilize Generative AI to streamline operations and elevate member engagement. Amazon Web Service experts provided a customer specific use cases and dived into low/no-code tools that are quick and easy to deploy through Amazon Web Service (AWS.)
LAND USE LAND COVER AND NDVI OF MIRZAPUR DISTRICT, UPRAHUL
This Dissertation explores the particular circumstances of Mirzapur, a region located in the
core of India. Mirzapur, with its varied terrains and abundant biodiversity, offers an optimal
environment for investigating the changes in vegetation cover dynamics. Our study utilizes
advanced technologies such as GIS (Geographic Information Systems) and Remote sensing to
analyze the transformations that have taken place over the course of a decade.
The complex relationship between human activities and the environment has been the focus
of extensive research and worry. As the global community grapples with swift urbanization,
population expansion, and economic progress, the effects on natural ecosystems are becoming
more evident. A crucial element of this impact is the alteration of vegetation cover, which plays a
significant role in maintaining the ecological equilibrium of our planet.Land serves as the foundation for all human activities and provides the necessary materials for
these activities. As the most crucial natural resource, its utilization by humans results in different
'Land uses,' which are determined by both human activities and the physical characteristics of the
land.
The utilization of land is impacted by human needs and environmental factors. In countries
like India, rapid population growth and the emphasis on extensive resource exploitation can lead
to significant land degradation, adversely affecting the region's land cover.
Therefore, human intervention has significantly influenced land use patterns over many
centuries, evolving its structure over time and space. In the present era, these changes have
accelerated due to factors such as agriculture and urbanization. Information regarding land use and
cover is essential for various planning and management tasks related to the Earth's surface,
providing crucial environmental data for scientific, resource management, policy purposes, and
diverse human activities.
Accurate understanding of land use and cover is imperative for the development planning
of any area. Consequently, a wide range of professionals, including earth system scientists, land
and water managers, and urban planners, are interested in obtaining data on land use and cover
changes, conversion trends, and other related patterns. The spatial dimensions of land use and
cover support policymakers and scientists in making well-informed decisions, as alterations in
these patterns indicate shifts in economic and social conditions. Monitoring such changes with the
help of Advanced technologies like Remote Sensing and Geographic Information Systems is
crucial for coordinated efforts across different administrative levels. Advanced technologies like
Remote Sensing and Geographic Information Systems
9
Changes in vegetation cover refer to variations in the distribution, composition, and overall
structure of plant communities across different temporal and spatial scales. These changes can
occur natural.
How to Manage Your Lost Opportunities in Odoo 17 CRMCeline George
Odoo 17 CRM allows us to track why we lose sales opportunities with "Lost Reasons." This helps analyze our sales process and identify areas for improvement. Here's how to configure lost reasons in Odoo 17 CRM
How to Manage Your Lost Opportunities in Odoo 17 CRM
[DCSB] Amiz Zeldes (HU, Berlin) "Towards Digital Coptic: Searching and Visualizing Coptic Manuscript Data"
1. Towards Digital Coptic
Searching and Visualizing
Coptic Manuscript Data
Caroline T. Schroeder,
University of the Pacific
cschroeder@pacific.edu
Amir Zeldes,
Humboldt-Universität zu Berlin
amir.zeldes@rz.hu-berlin.de
Berlin Digital Classicist Seminar, 14.1.2014
2. Plan
Introduction
Coptic data
Annotations so far: normalizing, tokenizing and tagging
Search architecture
Searching through multiple segmentations: ANNIS
Dealing with corpus formats: TEI, SaltNPepper
Visualization
Dedicated visualizations
A reusable generic approach
Conclusion and outlook
Schroeder & Zeldes / Towards Digital Coptic
Berlin, 14.1.2014
1/37
3. Who are these people?
Prof. Caroline T. Schroeder –
Religious and Classical Studies /
Humanities Center Director
University of the Pacific
Dr. Amir Zeldes –
Korpuslinguistik /
SFB 632 Information Structure
(from March: eHumanities group KOMeT)
Humboldt-Universität zu Berlin
Cooperation Coptic SCRIPTORIUM established at 2012
NEH summer institute on "Text in a Digital Age" (Tufts):
http://coptic.pacific.edu/
Schroeder & Zeldes / Towards Digital Coptic
Berlin, 14.1.2014
2/37
4. Why Coptic?
Last stage of Ancient Egyptian Language (starting 2nd Century)
Mediterranean in 1st millenium
Hellenistic period
Unique language
Longest continuous documentation
Contact language (with Greek)
Religious significance
Early Christianity
Rise of monasticism
Gnosticism
...
Schroeder & Zeldes / Towards Digital Coptic
Coptische Dialects
14.1.2014
BMBF eHumanties - KOMeT / Zeldes
Berlin,
3/37
5. The data
Lots of material (thanks to the Egyptian desert )
Relatively little online, nothing like Greek and Latin
(Perseus)
Lots of things you may want are not available:
New Testament (online, not normalized/lemmatized/annotated)
Old Testament
The Rule of St. Pachomius
Works of Shenoute of Atripe
Apophthegmata patrum
...
But some have been digitized at some point!
Schroeder & Zeldes / Towards Digital Coptic
Berlin, 14.1.2014
4/37
6. A word about the texts in this talk
So far we've concentrated on Shenoute's sermon Abraham our
Father
"As for us, brethren, let us live by the truth so that we are upstanding in
all our works, and so that the prophets, apostles and all the saints might
dwell among us, ..."
Apophthegmata Patrum (sayings of the desert fathers)
"They said about the blessed Sarah the virgin that she spent sixty years
living at the top of the river and she never set foot outside to see the
river."
New Testament, esp. Gospel of Mark
see http://coptic.pacific.edu/ for corpora and tools
Schroeder & Zeldes / Towards Digital Coptic
Berlin, 14.1.2014
5/37
7. Getting from raw text to annotated corpora
Making the data searchable starts
with:
Encoding manuscripts (Epidoc TEI)
Segmentation of "word forms"
Normalization
Segmentation of morphemes
Part-of-speech tagging
More annotations...
Brief recap: Detailed talk in Leipzig
last month (slides on my page)
Schroeder & Zeldes / Towards Digital Coptic
Berlin, 14.1.2014
6/37
8. Normalization
Automatic normalization, manual correction
handling of known diacritics, abbreviations
closed, growing list of known variants
Schroeder & Zeldes / Towards Digital Coptic
Berlin, 14.1.2014
7/37
9. Tokenization
Identifying morphemes non-trivial (agglutinative language,
different conventions; we follow Layton 2004)
ϫⲓⲛⲧⲁⲓⲣ̅ⲙⲟⲛⲁⲭⲟⲥ
'Since I became a monk'
since-that-PAST-1sg-do-monk
ⲉⲛⲧⲁϥⲧⲣⲉⲛⲣⲡϣⲁ
'he who made us keep the ceremony'
REL-PAST-3sgM-CAUS-1pl-do-the-observance
Word level segmentation: manual (no scriptio continua)
Morph segmentation: automatic (accuracy: 84% - 94%)
ⲛ̄ⲟⲩϣⲏⲣⲉ` ⲛ̄ⲁⲃⲣⲁϩⲁⲙ`
of-a-son of-Abraham
Schroeder & Zeldes / Towards Digital Coptic
ⲛ ⲟⲩ ϣⲏⲣⲉ ⲛ ⲁⲃⲣⲁϩⲁⲙ
of a son
of Abraham
Berlin, 14.1.2014
8/37
10. Part-of-speech tagging
POS tagging using TreeTagger (Schmid 1994) and a lexicon from the
CMCL project (courtesy of Prof. Tito Orlandi)
Two tag sets:
fine grained (45 tags) and coarse (22 tags)
(see http://coptic.pacific.edu/ for documentation)
Interannotator agreement: 94.19% agreement, kappa = 93.67
(considers chance agreement, cf. Artstein & Poesio 2008)
Accuracy:
In domain, 10-fold cross-validation: 94.04% (fine)
Out of domain (test with papyri.info): 79.6% (fine) / 87.7% (coarse)
Main difficulties: open classes (N/V),
disambiguating homonyms (ⲉ can have 6 different tags!)
Schroeder & Zeldes / Towards Digital Coptic
Berlin, 14.1.2014
9/37
11. Further annotations
Many other layers are done manually:
Translation
Language of origin
Coreference
Entity tagging (people, places...)
Parallel alignment (with Greek)
Syntax trees (very preliminary tests)
Schroeder & Zeldes / Towards Digital Coptic
Berlin, 14.1.2014
10/37
12. Representing data – how to look at all this stuff?
We now have a lot of data to represent:
Diplomatic transcriptions (including character rendering!)
Normalization
Segmentation into words, morphemes, sometimes letters
Annotations
How do we encode this data for search and
visualization?
Schroeder & Zeldes / Towards Digital Coptic
Berlin, 14.1.2014
11/37
13. The first challenge: minimal units
Minimal units, or tokens, are critical for searching:
Find all words preceding the word "God"
Give me any mentions of Saint Paphnutius, ±10 words
Search for the glosses father and son within 20 words
Two problems:
The concept of words is complex in Coptic
Annotations overlap parts of words:
individual letters, line breaks...
tokens are smaller than words!
Schroeder & Zeldes / Towards Digital Coptic
ⲡⲉϪⲁϥ ϫⲉ ⲉⲓ̇ⲥ ϣ
ⲙⲟⲩⲛ ⲛ̇ⲣⲟⲙⲡⲉ ⲻ
Ⲡⲉϫⲉ ⲡ̇ϩⲗ̇ⲗⲟ ⲛⲁϥ
he sAid "it's been e
ight years" –
The old man told him
Berlin, 14.1.2014
12/37
14. Solution: segmentation layers in ANNIS
We use the open source ANNIS platform as a search
interface (Zeldes et al. 2009)
Any annotation layer can be defined as a segmentation
defining alternative views on:
Adjacency
(in words, morphemes, etc.)
Proximity
(in words, morphemes, etc.)
Context size
(in words, morphemes, etc.)
But which segmentation layer do you want to see?
Remember, diplomatic and normalized layers don't match
Any segmentation layer is usable as "base text"
Schroeder & Zeldes / Towards Digital Coptic
Berlin, 14.1.2014
13/37
17. Searching with AQL
(see http://www.sfb632.uni-potsdam.de/annis/ )
Basic principle of ANNIS Query Language (AQL):
search for some annotations (#1, #2, #3...)
stipulate relationships between them (operators)
Example: verbs of Greek origin
pos="V" &
source_lang="Greek" &
#1 _=_ #2
The head bandit repented
identical coverage operator
I have faith in God
Schroeder & Zeldes / Towards Digital Coptic
Berlin, 14.1.2014
16/37
18. Referencing segmentations
There are many operators
. (adjacent), _i_ (inclusion), _o_ (overlap), _l_ (left aligned)...
> (dominance), -> (pointing relation), >@l (left child)...
...
Possible to use segmentations in queries:
#1 . #2
- one followed by two
#1 .word #2
- two is the next word after one
#1 .norm,1,10 #2
- within 1 to 10 norm units
...
Schroeder & Zeldes / Towards Digital Coptic
Berlin, 14.1.2014
17/37
19. Adding metadata
Metadata is like any other constraint, with meta::
prefix
Can use regular expressions and negation
pos!="V" & source_lang="Greek" &
#1 _=_ #2 & meta::msName=/MONB.*/
For metadata names and values we use TEI/EpiDoc as
a guideline
More information on AQL:
http://www.sfb632.uni-potsdam.de/annis/
Schroeder & Zeldes / Towards Digital Coptic
Berlin, 14.1.2014
18/37
20. Architecture and formats
Different formats are suitable for different parts of the
data
TEI ideal for manuscript structure, metadata
Linguistic formats for computational corpus linguistics:
tagging, parsing, coreference
Convert and merge data using SaltNPepper
(Zipser & Romary 2010)
Schroeder & Zeldes / Towards Digital Coptic
Berlin, 14.1.2014
19/37
21. SaltNPepper (Zipser & Romary 2010)
Metamodel Salt for
multiformat conversion
Work on extending
TEI support: 2014-15
Salt as internal representation
in ANNIS
Schroeder & Zeldes / Towards Digital Coptic
Berlin, 14.1.2014
20/37
22. How can we view the data?
Even if we can query everything at once:
people who are indirect objects of the verb "show" aligned
with Greek neuters...
Can we also look at everything at once?
Excerpt from a Salt graph view of two words:
Schroeder & Zeldes / Towards Digital Coptic
Berlin, 14.1.2014
21/37
23. Breaking it down
Different annotations require different visualizations
Two conflicting requirements:
Ideal representation for each layer (syntax -> trees)
Stay generic and minimize amount of visualizations
How can we avoid programming new visualizations
with each new annotation layer?
Schroeder & Zeldes / Towards Digital Coptic
Berlin, 14.1.2014
22/37
24. Generic versus dedicated
For some purposes, dedicated visualizations cannot be
avoided
Special interactive functionality
Special layouting algorithms
For other purposes, we can reuse visualizations by
making flexible and configurable
Need to take segmentations into account
Schroeder & Zeldes / Towards Digital Coptic
Berlin, 14.1.2014
23/37
25. Some dedicated examples
Syntax trees
Coreference view (interactive)
Schroeder & Zeldes / Towards Digital Coptic
Berlin, 14.1.2014
24/37
26. Taking segmentations into account
Visualizations must be configurable to be aware of different
base texts
Syntax tree is based on normalized "word"-internal morphs
Sometimes one syntactic unit has multiple tokens
band
of ban dits
came upon a band
Schroeder & Zeldes / Towards Digital Coptic
of bandits
band ofban
15 dits and foundthem
drinking . [...]
Berlin, 14.1.2014
25/37
27. Reusing dedicated visualizers?
In some cases, some creative uses can be found for
existing visualizations
Using the coreference visualizer for parallel alignment:
apophthegmata patrum
Schroeder & Zeldes / Towards Digital Coptic
Berlin, 14.1.2014
26/37
28. Generic visualizations
Two main generic visualizers:
Annotation grid:
just mark borders of annotations
good for flat information
HTML visualizer:
generates HTML elements based
on annotations
defined using two simple stylesheets
can look like (almost) anything
Schroeder & Zeldes / Towards Digital Coptic
Berlin, 14.1.2014
27/37
29. Multiple grids
All annotations in one grid can lead to visual overload
Often better to separate groups of annotations:
Schroeder & Zeldes / Towards Digital Coptic
Berlin, 14.1.2014
28/37
30. The HTML visualizer
Any specific visualization is configured by two style sheets:
a config file and a CSS file
norm.config
p
norm.css
p
div.htmlvis {
word
span; style="word"
norm
span; style="norm"
font-family: Antinoou, sans-serif;
width: 500px;
white-space: normal !important;
value
trans t:title; style="trans" value
}
.trans:hover{color: red}
.word:after{content: " ";}
Schroeder & Zeldes / Towards Digital Coptic
Berlin, 14.1.2014
29/37
31. Result
Abraham our Father
<p>
<t class="translation"
title="Abraham our father wished to
have children with Sarah.">
<span class="word">
<span class="norm">
ⲁⲃⲣⲁϩⲁⲙ
</span>
</span>
<span class="word">
<span class="norm">
ⲡⲉⲛ
</span>
<span class="norm">
ⲉⲓⲱⲧ
</span>
</span>
</t>
...
</p>
Schroeder & Zeldes / Towards Digital Coptic
Berlin, 14.1.2014
30/37
32. Reusing the HTML visualizer
dipl.config
tok
span
lb
div; style="line"
pb
table:title; style="pb"
pb
tr
cb
td; style="cb"
hi_rend
hi_rend:rend
Schroeder & Zeldes / Towards Digital Coptic
value
value
value
Berlin, 14.1.2014
31/37
34. Aggregate visualizations
Latest version of ANNIS offers basic frequency analysis
Open question: How much more should we build?
Schroeder & Zeldes / Towards Digital Coptic
Berlin, 14.1.2014
33/37
35. Aggregate visualizations
Other visualizations are currently done e.g. in R:
ϫⲟⲟ
ⲉⲓⲣⲉ
ϣⲁ ⲟⲩⲛ
ϩⲟⲟⲩI/me
ⲡⲉϫⲉ
you.SG.M Egyptian vocabulary
ⲓⲏⲥⲟⲩⲥ
ϯⲥⲃⲱ
ⲕⲁ
ⲛⲉⲩ
ⲉⲓ
ⲅⲁⲗⲓⲗⲁⲓⲁ
ⲛⲥⲱ
Gospel
ⲕⲏⲣⲩⲥⲥⲉ
ⲉⲩⲁⲅⲅⲉⲗⲓⲟⲛ
said
Schroeder & Zeldes / Towards Digital Coptic
ⲛⲙⲙⲁ
ⲧⲃⲃⲟ
Jesus
ⲉⲣⲏⲙⲟⲥ
ⲁⲡⲁ
ⲕⲱ
ⲫⲟⲣⲉⲓ ⲣⲓ
ⲕ
. ⲥⲱ
ϣⲧⲏⲛ
ⲣⲁⲧ
ⲙⲉⲉⲩⲉ
ⲗⲁⲁⲩ
ⲙⲟⲛⲁⲭⲟⲥ
ⲡⲉϫⲁ
ⲣⲟⲙⲡⲉ
ϫⲉⲓ
ⲧⲁ
ⲁϣ
ⲓⲱϩⲁⲛⲛⲏⲥbaptism
ⲃⲁⲡⲧⲓⲥⲙⲁ
ⲁⲕⲁⲑⲁⲣⲧⲟⲛ
impure
John
ⲥⲓⲙⲱⲛ
old man
ⲧⲉⲧⲛ
ⲥⲩⲛⲁⲅⲱⲅⲏ
ⲛⲙ
ⲛⲧⲉⲣⲉ
ϣⲟⲙⲛⲧ
ⲏⲣⲡ
ⲉⲓⲃⲉ
Abba
ⲟⲩⲱⲙ
ⲡⲉⲓ ϩⲗⲗⲟ
ⲙⲟⲟⲩ ϭⲱⲗⲡ
wine
synagogue
ⲇⲁⲓⲙⲱⲛⲓⲟⲛ
ⲥⲟⲩⲧⲛ
eat
Gospel of Mark 1
ⲩⲛⲟⲩ
11 apophthegmata patrum
ⲡⲛⲉⲩⲙⲁ Holy
Ghost
Greek vocabulary
Berlin, 14.1.2014
34/37
36. Conclusion
Annotation projects should not be limited by corpus
architectures:
annotate whatever you want, however often you want
link anything to anything
Why annotate all of these things in the corpus?
(and not just in a separate spreadsheet)
Plots of just the verbs? Proper names? POS tagging
Highlight, search and link place-names? Entity tagging
Collapse inflected variants? Lemmatization
Collapse prominent referents? Coreference annotation
Dispersion of any of the above, alignment ... and much more
Schroeder & Zeldes / Towards Digital Coptic
Berlin, 14.1.2014
35/37
37. Conclusion
Anything can be made queryable with more layers:
typical constructions and objects of verbs?
Greek vs. native verbs -> add language of origin layer
Translation behavior -> add alignment layer
...
Fitting visualization facilities
should be easy to re-use
optimized to the task, display relevant portions of information
for many purposes, they must be sensitive to segmentations
Schroeder & Zeldes / Towards Digital Coptic
Berlin, 14.1.2014
36/37
38. Outlook
This March: BMBF funded young researcher group on
eHumanities at HU Berlin
KOMeT:
KOrpuslinguistische Methoden für ePhilologie mit TEI
Focus on marrying TEI resources with computational linguistics methods
and formats
Developing NLP tools, search and visualization for ancient world textual
resources
Pilot phase (2014, approved): Coptic
Main phase (2015-2019, pending): Other languages as well
Currently looking for a student assistant (60h/month)
Stay tuned for more!
Schroeder & Zeldes / Towards Digital Coptic
Berlin, 14.1.2014
37/37
40. References
Artstein, Ron & Massimo Poesio (2008), Inter-Coder Agreement for
Computational Linguistics. Computational Linguistics 34(4), 556–596.
Layton, Bentley (2004), A Coptic Grammar. Second Edition, Revised and
Expanded. (Porta linguarum orientalium 20.) Wiesbaden: Harrassowitz.
Schmid, Helmut (1994), Probabilistic Part-of-Speech Tagging Using Decision
Trees. In: Proceedings of the Conference on New Methods in Language
Processing. Manchester, UK, 44–49. Available at: http://www.ims.unistuttgart.de/ftp/pub/corpora/tree-tagger1.pdf.
Zeldes, Amir, Julia Ritz, Anke Lüdeling & Christian Chiarcos (2009), ANNIS: A
Search Tool for Multi-Layer Annotated Corpora. In: Proceedings of Corpus
Linguistics 2009. Liverpool, UK.
Zipser, Florian & Laurent Romary (2010), A Model Oriented Approach to the
Mapping of Annotation Formats using Standards. In: Proceedings of the
Workshop on Language Resource and Language Technology Standards,
LREC-2010. Valletta, Malta, 7–18.