Data mining or data science is the process of applying computational and algorithmic methods to large datasets.
Text mining is collection of methods used to extract information not from “formalised database records” but from “unstructured textual data”
2015.05.19 tom de nies - tin can2prov exposing interoperable provenance of ...tdenies
This document describes a mapping between the Experience API (xAPI) format for logging learning experiences and the W3C PROV standard for representing provenance. It outlines how xAPI statements can be converted to JSON-LD and then mapped to equivalent PROV concepts to make the learning logs interoperable. The mapping was tested on real xAPI statement data and allows logging of learning processes in a way that is machine-interpretable and can be queried and analyzed at scale. Going forward, the mapping will be used and tested in educational projects and systems to start leveraging the power of linked data for learning analytics.
Scalable Cross-lingual Document Similarity through Language-specific Concept ...Carlos Badenes-Olmedo
This document proposes an unsupervised algorithm to relate similar documents in multilingual corpora without requiring translations. It represents documents as distributions over topics derived from language-specific concept hierarchies (like WordNet synsets). Documents in different languages are aligned in a single representation space based on shared synsets from their main topics. The algorithm is evaluated on document classification and retrieval tasks using legislative text corpora in English, Spanish and French, showing it performs comparably to supervised methods while not requiring parallel or comparable training data.
Repositories are systems to safely store and publish digital objects and their descriptive metadata. Repositories mainly serve their data by using web interfaces which are primarily oriented towards human consumption. They either hide their data behind non-generic interfaces or do not publish them at all in a way a computer can process easily. At the same time the data stored in repositories are particularly suited to be used in the Semantic Web as metadata are already available. They do not have to be generated or entered manually for publication as Linked Data. In my talk I will present a concept of how metadata and digital objects stored in repositories can be woven into the Linked (Open) Data Cloud and which characteristics of repositories have to be considered while doing so. One problem it targets is the use of existing metadata to present Linked Data. The concept can be applied to almost every repository software. At the end of my talk I will present an implementation for DSpace, one of the software solutions for repositories most widely used. With this implementation every institution using DSpace should become able to export their repository content as Linked Data.
This document discusses librAIry, a text mining framework that can distribute text mining tasks across multiple sources and retrieve similar documents from a large corpus. LibrAIry uses standards like OAI-PMH, Linked Data Principles and AMQP to harvest and process textual data from sources like repositories, newspapers and PDFs. It provides modules for tasks like tokenization, annotation and topic modeling. LibrAIry has been used in real world applications to analyze patents and recommend books. Future work aims to improve resource definition and use resource URIs in routing keys.
This document provides an overview of the SHEBANQ project, which provides tools for querying annotated Hebrew text data. It describes the data sources and contributors that have built up the underlying text corpus over many years. It also outlines the steps taken to make this data and related tools more accessible, including developing a website, depositing data in archives, running demonstration projects, and integrating the data and tools into broader research environments through additional projects and publications. The goal has been to facilitate wider use of this linguistic resource and foster more digital humanities and data science work based on its contents.
An initial analysis of topic-based similarity among scientific documents base...Oscar Corcho
This document analyzes the representativeness of different parts of scientific documents, including abstracts and sections related to the approach, outcome, and background. It finds that summaries created from the approach, outcome, or background better represent the full document and related documents than abstracts, based on measures of internal and external representativeness. Future work will use probabilistic topic models better suited to short texts.
Semantic Web and Linked Data for cultural heritage materials - Approaches in ...Antoine Isaac
The document discusses using semantic web technologies like linked data and the Europeana Data Model (EDM) to improve access to cultural heritage materials by enabling semantic search and exploiting relationships between concepts, objects, and vocabularies. EDM aims to preserve original metadata while allowing for interoperability by using standards like Dublin Core, SKOS, and OAI ORE. Linked data approaches can ease getting and publishing data across cultural heritage datasets by direct access to RDF descriptions via URIs.
2015.05.19 tom de nies - tin can2prov exposing interoperable provenance of ...tdenies
This document describes a mapping between the Experience API (xAPI) format for logging learning experiences and the W3C PROV standard for representing provenance. It outlines how xAPI statements can be converted to JSON-LD and then mapped to equivalent PROV concepts to make the learning logs interoperable. The mapping was tested on real xAPI statement data and allows logging of learning processes in a way that is machine-interpretable and can be queried and analyzed at scale. Going forward, the mapping will be used and tested in educational projects and systems to start leveraging the power of linked data for learning analytics.
Scalable Cross-lingual Document Similarity through Language-specific Concept ...Carlos Badenes-Olmedo
This document proposes an unsupervised algorithm to relate similar documents in multilingual corpora without requiring translations. It represents documents as distributions over topics derived from language-specific concept hierarchies (like WordNet synsets). Documents in different languages are aligned in a single representation space based on shared synsets from their main topics. The algorithm is evaluated on document classification and retrieval tasks using legislative text corpora in English, Spanish and French, showing it performs comparably to supervised methods while not requiring parallel or comparable training data.
Repositories are systems to safely store and publish digital objects and their descriptive metadata. Repositories mainly serve their data by using web interfaces which are primarily oriented towards human consumption. They either hide their data behind non-generic interfaces or do not publish them at all in a way a computer can process easily. At the same time the data stored in repositories are particularly suited to be used in the Semantic Web as metadata are already available. They do not have to be generated or entered manually for publication as Linked Data. In my talk I will present a concept of how metadata and digital objects stored in repositories can be woven into the Linked (Open) Data Cloud and which characteristics of repositories have to be considered while doing so. One problem it targets is the use of existing metadata to present Linked Data. The concept can be applied to almost every repository software. At the end of my talk I will present an implementation for DSpace, one of the software solutions for repositories most widely used. With this implementation every institution using DSpace should become able to export their repository content as Linked Data.
This document discusses librAIry, a text mining framework that can distribute text mining tasks across multiple sources and retrieve similar documents from a large corpus. LibrAIry uses standards like OAI-PMH, Linked Data Principles and AMQP to harvest and process textual data from sources like repositories, newspapers and PDFs. It provides modules for tasks like tokenization, annotation and topic modeling. LibrAIry has been used in real world applications to analyze patents and recommend books. Future work aims to improve resource definition and use resource URIs in routing keys.
This document provides an overview of the SHEBANQ project, which provides tools for querying annotated Hebrew text data. It describes the data sources and contributors that have built up the underlying text corpus over many years. It also outlines the steps taken to make this data and related tools more accessible, including developing a website, depositing data in archives, running demonstration projects, and integrating the data and tools into broader research environments through additional projects and publications. The goal has been to facilitate wider use of this linguistic resource and foster more digital humanities and data science work based on its contents.
An initial analysis of topic-based similarity among scientific documents base...Oscar Corcho
This document analyzes the representativeness of different parts of scientific documents, including abstracts and sections related to the approach, outcome, and background. It finds that summaries created from the approach, outcome, or background better represent the full document and related documents than abstracts, based on measures of internal and external representativeness. Future work will use probabilistic topic models better suited to short texts.
Semantic Web and Linked Data for cultural heritage materials - Approaches in ...Antoine Isaac
The document discusses using semantic web technologies like linked data and the Europeana Data Model (EDM) to improve access to cultural heritage materials by enabling semantic search and exploiting relationships between concepts, objects, and vocabularies. EDM aims to preserve original metadata while allowing for interoperability by using standards like Dublin Core, SKOS, and OAI ORE. Linked data approaches can ease getting and publishing data across cultural heritage datasets by direct access to RDF descriptions via URIs.
This document provides an introduction to text analytics and natural language processing techniques. It discusses bag-of-words models, term frequency-inverse document frequency (TF-IDF), vector space models, distance measures, document clustering, word embeddings using word2vec, and recurrent neural networks. The agenda covers traditional "frequentist" text analysis methods as well as deep learning techniques for semantic analysis. Hands-on examples in Python are provided to illustrate document clustering, creating word embeddings, and generating text with recurrent neural networks.
Digital Tools, Trends and Methodologies in the Humanities and Social SciencesShawn Day
This document provides an overview of digital tools, trends and methodologies for the social sciences and humanities. It discusses defining digital humanities and gives examples of digital projects and resources. A case study is presented on exploring the lives of 19th century Ontario farmers through digitizing and analyzing journal entries. The document encourages thinking about how digital approaches can inform research and lists upcoming seminars on digital topics.
HyperMembrane Structures for Open Source Cognitive ComputingJack Park
Open source "cognitive computing" systems, specifically OpenSherlock; describes a HyperMembrane structure, a kind of information fabric, for machine reading, literature-based discovery, deep question answering. Platform is open source, uses ElasticSearch, topic maps, JSON, link-grammar parsing, and qualitative process models.
Reviewing literaure through digital technologiesHRDC, GJU Hisar
This document discusses exploring literature through digital technologies and skills. It notes that literature review is an essential part of the research process. It then discusses problems with traditional literature collection methods and how digital skills can help address these problems by allowing remote access and organization of literature. Various digital tools, techniques, and resources for conducting literature searches are outlined, including search engines, databases, collaborative platforms, and organizing tools like literature review matrices.
Digital Humanities is a term that elicits both excitement and scorn in scholarly circles, and there is still a great deal of discussion as to whether it is a field of inquiry, a set of research methods, or simply a new perspective on arts and humanities research. This workshop will provide a brief survey of how the evolving theory and practice of using contemporary technology and technology-assisted research methods are impacting scholarship in the arts and humanities.
The class outline covers introduction to unstructured data analysis, word-level analysis using vector space model and TF-IDF, beyond word-level analysis using natural language processing, and a text mining demonstration in R mining Twitter data. The document provides background on text mining, defines what text mining is and its tasks. It discusses features of text data and methods for acquiring texts. It also covers word-level analysis methods like vector space model and TF-IDF, and applications. It discusses limitations of word-level analysis and how natural language processing can help. Finally, it demonstrates Twitter mining in R.
The document summarizes efforts to support digital humanities research through collaboration at various institutions. It describes projects at Wheaton College involving students encoding a text using TEI XML under faculty supervision. It also discusses initiatives at the University of Vermont and Brown University to provide infrastructure and expertise for digital scholarship through partnerships between libraries, academic technology groups, and faculty researchers.
Faculty center dh talk 2 s2016 pedagogical provocationsJennifer Dellner
This document discusses digital humanities (DH) pedagogy and contrasts it with traditional "ed tech" approaches. It argues that DH is local and contextual, involving specific configurations of tools, faculty, and students based on an institution's strengths and mission. DH emphasizes hands-on learning through making and production, using tools like programming, audio/video creation, and mapping in project-based ways. Examples provided include open-access textbook projects, rewriting Wikipedia, and digital mapping and narrative projects. The document advocates for DH approaches that encourage exploration, distraction, and making over purely delivering content.
Laurel Stvan, Associate Professor of Linguistics, UT Arlington, presentation for “Using Digital Humanities Research Tools in the Classroom” at UT Dallas 2/27/13
This webinar will explain what text-mining is and why it is important to text-mine research papers. We will consider real-world use-cases and applications and discuss barriers to wider adoption of text-mining.
We will also provide practical advice on how to start text-mining research papers, such as where to obtain data, how to access relevant APIs and highlight some of the tools that are available.
ICRH Winter Institute Strand 4 Day 1 - Building Narratives with Digital ObjectsShawn Day
This document provides an agenda and overview for a workshop on using metadata and Omeka to develop digital narratives. The workshop will include introductions to metadata and Omeka, as well as hands-on sessions adding objects to Omeka collections and building narratives. It reviews the value of metadata for discovery and organization. It also summarizes the basics of different metadata fields like title, creator, date. The objective is to help participants evaluate how digital tools like Omeka could be useful for their research.
A talk to be given in the "Session on Editorial Innovation in OA Publishing" at http://www.oaspa.org/coasp/sessions.php on Aug 23, 2010 in Prague. Also available from http://www.cse.unsw.edu.au/~rvg/COASP/slides.pdf .
Babar: Knowledge Recognition, Extraction and RepresentationPierre de Lacaze
Babar is a research project in the field of Artificial Intelligence. It aims to bridge together Neural AI and Symbolic AI. As such it is implemented in three different programming languages: Clojure, Python and CLOS.
The Clojure component (Clobar) implements the graphical user interface to Babar. Examples of the Clojure Hiccup library and interfacing Clojure to Javascript will be presented. The Python module (Pybar) implements the web crawling and scraping and the Neural Networks aspect of Babar. The Word Embedding and and LSTM (Long Short-Term Memory) components of Pybar will be described in detail. Finally the Common Lisp module (Lispbar) implements the Symbolic AI aspect of Babar. This latter includes an English Language Parser and Semantic Networks implemented as an in-memory Hypergraph.
We will present each of these components and target individual aspects with code examples. Specifically we will first present the web developments and Neural Networks components. Then the English Language parser will be examined in detail. We will also present the knowledge extraction aspect and bridge this with the Neural Network component.
Ultimately we will argue what can be termed "Neural AI" and "Symbolic AI" are at not at odds with each other but rather complement each other. In summary Artificial Intelligence is not a question of "brain" or "mind", but rather a question of "brain" and "mind".
A talk given at the annual Computer Science for High School Teachers event at Victoria University of Wellington. I presented on some basics of the World Wide Web and why it's worth to preserve it, our work on non-expert tools to populate semantically enriched content, a current project to identify NZ native birds based on their calls that involves citizen science and contemporary deep learning using TensorFlow, a project that investigates the impact of online citizen science on the development of science capabilities of primary school children, and my collaboration with Adam Grener from the School of English, Film, Theater and Media Studies at VUW with whom I am working on computational tools for the literature studies.
This presentation was provided by Anne Washington of the University of Houston during the NISO virtual conference, Open Data Projects, held on Wednesday, June 13, 2018.
This document discusses topic extraction for domain ontology. It describes domain ontology as a collection of vocabularies and conceptualization of a given domain. The purpose of topic extraction is to identify relevant concepts in documents, obtain domain-specific terms, classify documents, and identify key concepts and relationships for an ontology. The project stages include obtaining domain knowledge, preprocessing documents, and applying either K-Means clustering or Latent Dirichlet Allocation to extract topics. K-Means partitions data into clusters while LDA represents documents as mixtures over topics characterized by word distributions.
Presentation by Kristina Hettne at the 'Focus on Open Science' conference in Kaunas 2019 explaining how Leiden University translates best practices to the level of faculties, institutes, individual researchers.
The Abnormal Hieratic Global Portal aims to:
- Bring together published texts, i.e. transcriptions, transliterations and translations
- Teaching the study of Abnormal Hieratic with papyri
- Discuss and annotate texts
- Create a name book and dictionary to help new papyri be deciphered
By Ben Companjen, 27th June 2019
This document provides an introduction to text analytics and natural language processing techniques. It discusses bag-of-words models, term frequency-inverse document frequency (TF-IDF), vector space models, distance measures, document clustering, word embeddings using word2vec, and recurrent neural networks. The agenda covers traditional "frequentist" text analysis methods as well as deep learning techniques for semantic analysis. Hands-on examples in Python are provided to illustrate document clustering, creating word embeddings, and generating text with recurrent neural networks.
Digital Tools, Trends and Methodologies in the Humanities and Social SciencesShawn Day
This document provides an overview of digital tools, trends and methodologies for the social sciences and humanities. It discusses defining digital humanities and gives examples of digital projects and resources. A case study is presented on exploring the lives of 19th century Ontario farmers through digitizing and analyzing journal entries. The document encourages thinking about how digital approaches can inform research and lists upcoming seminars on digital topics.
HyperMembrane Structures for Open Source Cognitive ComputingJack Park
Open source "cognitive computing" systems, specifically OpenSherlock; describes a HyperMembrane structure, a kind of information fabric, for machine reading, literature-based discovery, deep question answering. Platform is open source, uses ElasticSearch, topic maps, JSON, link-grammar parsing, and qualitative process models.
Reviewing literaure through digital technologiesHRDC, GJU Hisar
This document discusses exploring literature through digital technologies and skills. It notes that literature review is an essential part of the research process. It then discusses problems with traditional literature collection methods and how digital skills can help address these problems by allowing remote access and organization of literature. Various digital tools, techniques, and resources for conducting literature searches are outlined, including search engines, databases, collaborative platforms, and organizing tools like literature review matrices.
Digital Humanities is a term that elicits both excitement and scorn in scholarly circles, and there is still a great deal of discussion as to whether it is a field of inquiry, a set of research methods, or simply a new perspective on arts and humanities research. This workshop will provide a brief survey of how the evolving theory and practice of using contemporary technology and technology-assisted research methods are impacting scholarship in the arts and humanities.
The class outline covers introduction to unstructured data analysis, word-level analysis using vector space model and TF-IDF, beyond word-level analysis using natural language processing, and a text mining demonstration in R mining Twitter data. The document provides background on text mining, defines what text mining is and its tasks. It discusses features of text data and methods for acquiring texts. It also covers word-level analysis methods like vector space model and TF-IDF, and applications. It discusses limitations of word-level analysis and how natural language processing can help. Finally, it demonstrates Twitter mining in R.
The document summarizes efforts to support digital humanities research through collaboration at various institutions. It describes projects at Wheaton College involving students encoding a text using TEI XML under faculty supervision. It also discusses initiatives at the University of Vermont and Brown University to provide infrastructure and expertise for digital scholarship through partnerships between libraries, academic technology groups, and faculty researchers.
Faculty center dh talk 2 s2016 pedagogical provocationsJennifer Dellner
This document discusses digital humanities (DH) pedagogy and contrasts it with traditional "ed tech" approaches. It argues that DH is local and contextual, involving specific configurations of tools, faculty, and students based on an institution's strengths and mission. DH emphasizes hands-on learning through making and production, using tools like programming, audio/video creation, and mapping in project-based ways. Examples provided include open-access textbook projects, rewriting Wikipedia, and digital mapping and narrative projects. The document advocates for DH approaches that encourage exploration, distraction, and making over purely delivering content.
Laurel Stvan, Associate Professor of Linguistics, UT Arlington, presentation for “Using Digital Humanities Research Tools in the Classroom” at UT Dallas 2/27/13
This webinar will explain what text-mining is and why it is important to text-mine research papers. We will consider real-world use-cases and applications and discuss barriers to wider adoption of text-mining.
We will also provide practical advice on how to start text-mining research papers, such as where to obtain data, how to access relevant APIs and highlight some of the tools that are available.
ICRH Winter Institute Strand 4 Day 1 - Building Narratives with Digital ObjectsShawn Day
This document provides an agenda and overview for a workshop on using metadata and Omeka to develop digital narratives. The workshop will include introductions to metadata and Omeka, as well as hands-on sessions adding objects to Omeka collections and building narratives. It reviews the value of metadata for discovery and organization. It also summarizes the basics of different metadata fields like title, creator, date. The objective is to help participants evaluate how digital tools like Omeka could be useful for their research.
A talk to be given in the "Session on Editorial Innovation in OA Publishing" at http://www.oaspa.org/coasp/sessions.php on Aug 23, 2010 in Prague. Also available from http://www.cse.unsw.edu.au/~rvg/COASP/slides.pdf .
Babar: Knowledge Recognition, Extraction and RepresentationPierre de Lacaze
Babar is a research project in the field of Artificial Intelligence. It aims to bridge together Neural AI and Symbolic AI. As such it is implemented in three different programming languages: Clojure, Python and CLOS.
The Clojure component (Clobar) implements the graphical user interface to Babar. Examples of the Clojure Hiccup library and interfacing Clojure to Javascript will be presented. The Python module (Pybar) implements the web crawling and scraping and the Neural Networks aspect of Babar. The Word Embedding and and LSTM (Long Short-Term Memory) components of Pybar will be described in detail. Finally the Common Lisp module (Lispbar) implements the Symbolic AI aspect of Babar. This latter includes an English Language Parser and Semantic Networks implemented as an in-memory Hypergraph.
We will present each of these components and target individual aspects with code examples. Specifically we will first present the web developments and Neural Networks components. Then the English Language parser will be examined in detail. We will also present the knowledge extraction aspect and bridge this with the Neural Network component.
Ultimately we will argue what can be termed "Neural AI" and "Symbolic AI" are at not at odds with each other but rather complement each other. In summary Artificial Intelligence is not a question of "brain" or "mind", but rather a question of "brain" and "mind".
A talk given at the annual Computer Science for High School Teachers event at Victoria University of Wellington. I presented on some basics of the World Wide Web and why it's worth to preserve it, our work on non-expert tools to populate semantically enriched content, a current project to identify NZ native birds based on their calls that involves citizen science and contemporary deep learning using TensorFlow, a project that investigates the impact of online citizen science on the development of science capabilities of primary school children, and my collaboration with Adam Grener from the School of English, Film, Theater and Media Studies at VUW with whom I am working on computational tools for the literature studies.
This presentation was provided by Anne Washington of the University of Houston during the NISO virtual conference, Open Data Projects, held on Wednesday, June 13, 2018.
This document discusses topic extraction for domain ontology. It describes domain ontology as a collection of vocabularies and conceptualization of a given domain. The purpose of topic extraction is to identify relevant concepts in documents, obtain domain-specific terms, classify documents, and identify key concepts and relationships for an ontology. The project stages include obtaining domain knowledge, preprocessing documents, and applying either K-Means clustering or Latent Dirichlet Allocation to extract topics. K-Means partitions data into clusters while LDA represents documents as mixtures over topics characterized by word distributions.
Presentation by Kristina Hettne at the 'Focus on Open Science' conference in Kaunas 2019 explaining how Leiden University translates best practices to the level of faculties, institutes, individual researchers.
The Abnormal Hieratic Global Portal aims to:
- Bring together published texts, i.e. transcriptions, transliterations and translations
- Teaching the study of Abnormal Hieratic with papyri
- Discuss and annotate texts
- Create a name book and dictionary to help new papyri be deciphered
By Ben Companjen, 27th June 2019
This document provides information about open science and opportunities for researchers at Leiden University. It discusses how open science aims to increase research quality, collaboration, and transparency. The document outlines practical steps researchers can take to engage in open science, such as publishing pre-prints and open access articles. Benefits of open science include expanding professional networks, increasing the impact and visibility of research, and opening new career opportunities in areas like data science. The document promotes engaging with the university's Centre for Digital Scholarship for training and support on open science practices.
This document discusses making research data Findable, Accessible, Interoperable, and Reusable (FAIR). It recommends planning for FAIR data management by creating a data management plan. The four steps to making data more FAIR are: 1) Put data in a repository, 2) Decide on data access conditions, 3) Describe data using metadata, and 4) Choose an appropriate license. Making data FAIR can increase exposure and reuse of data, help comply with funder requirements, and allow others to verify and build upon research findings.
Much of the Internet’s image-based resources are locked up in silos, with access restricted to bespoke, locally built applications.
By using IIIF we aim:
1. To give scholars an unprecedented level of uniform and rich access to image-based resources hosted around the world.
2. To define a set of common application programming interfaces that support interoperability between image repositories.
3. To develop, cultivate and document shared technologies, such as image servers and web clients, that provide a world-class user experience in viewing, comparing, manipulating and annotating images.
Presentation by Laurents Sesink on the International Image Interoperability Framework (IIIF) and its application for the storage, presentation, and annotation of digitized North Korean Posters
Mart van Duijn and Laurents Sesink gave this presentation at the 2017 LIBER conference. It deals with the challenges on the curation of born digital materials at Leiden University Libraries.
This document provides a high level overview of a reference architecture for research data management at Leiden University. It describes the architecture across multiple layers including an organization layer, process layer, functional layer, technical layer, and solutions layer. Key elements that are discussed include drivers and goals for open science, principles like FAIR data, architecture building blocks, and potential solution building blocks and how they map to requirements. The overall intent is to define a reference architecture that supports open science and improves reuse of research data over both short and long term.
Presentation by Fieke Schoots and Laurent Sesink held for the Research Data Alliance in Barcelona about the services for research data management provided to researchers at Leiden University.
Preservation by Laurents Sesink at a knowledge exchange session with subject librarians at Leiden University Libraries, september 2017. Topic of the session: online academic collaboration by use of virtual research environments.
Presentation at the Open Repositories 2017 Conference by Saskia van Bergen and Laurents Sesink on the new repository infrastructure that will be used to preserve and present the digital collections of Leiden University Libraries.
This document discusses research support at Leiden University. It describes the university's efforts to establish a Centre for Digital Scholarship within the university libraries to support open science practices like open access, data management, and data science. The centre aims to provide services across the entire research lifecycle, from the initial idea phase through publication. It will work with other expertise centers and administrative units to create a "one-stop-shop" for research support and facilitate digital scholarship practices. Implementing a comprehensive research data management program and developing shared research facilities and services are important goals. Stakeholder involvement, international cooperation, and building skills in areas like data stewardship will be key to success.
The Centre for Digital Scholarship aims to support academics in the transition to a more interactive academic environment.
Laurents Sesink presented an overview of the Centre's ambitions and activities at the Academy of Korean Studies, 2017.
The document discusses the International Image Interoperability Framework (IIIF). It describes IIIF as a set of common APIs that allow images and image-based resources hosted in different repositories to be accessed and displayed interoperably. It outlines the benefits of IIIF for users, such as fast delivery of zoomable images and ability to annotate and compare images across repositories. It then provides details on the key IIIF APIs - the Image API for retrieving images, and the Presentation API for describing image-based objects and their structure.
Ben Companjen, Peter Verhaar en Laurents Sesink, all from the Centre for Digital Scholarship, act together in an elaborate overview of the ins and outs of text and data mining and the services provided by Leiden University Libraries.
Laurents Sesink from the Centre for Digital Scholarship explores the possibilities for sustainable storage and access for special collections within the new repository infrastructure at Leiden University Libraries.
Held at KITLV, Royal Netherlands Institute of Southeast Asian and Caribbean Studies, 2016.
Fieke Schoots from the Centre of Digital Scholarship provides, in close collaboration with colleagues from other university libraries (UKB), an overview of the policies that publishers increasingly implement regarding the data underlying publications.
Held at the Seminar: ‘The Making of Research Data Management Policy, Wageningen 2016.
Introduction by Mieneke van der Salm on the Leiden ORCID project held at the Persistent Identifier festival PIDapalooza. How to make sure that all Leiden researchers will acquire their own Open Researcher en Contributor Identifier, ORCID, https://orcid.org/
More from Centre for Digital Scholarship, Leiden University Libraries (20)
The Building Blocks of QuestDB, a Time Series Databasejavier ramirez
Talk Delivered at Valencia Codes Meetup 2024-06.
Traditionally, databases have treated timestamps just as another data type. However, when performing real-time analytics, timestamps should be first class citizens and we need rich time semantics to get the most out of our data. We also need to deal with ever growing datasets while keeping performant, which is as fun as it sounds.
It is no wonder time-series databases are now more popular than ever before. Join me in this session to learn about the internal architecture and building blocks of QuestDB, an open source time-series database designed for speed. We will also review a history of some of the changes we have gone over the past two years to deal with late and unordered data, non-blocking writes, read-replicas, or faster batch ingestion.
State of Artificial intelligence Report 2023kuntobimo2016
Artificial intelligence (AI) is a multidisciplinary field of science and engineering whose goal is to create intelligent machines.
We believe that AI will be a force multiplier on technological progress in our increasingly digital, data-driven world. This is because everything around us today, ranging from culture to consumer products, is a product of intelligence.
The State of AI Report is now in its sixth year. Consider this report as a compilation of the most interesting things we’ve seen with a goal of triggering an informed conversation about the state of AI and its implication for the future.
We consider the following key dimensions in our report:
Research: Technology breakthroughs and their capabilities.
Industry: Areas of commercial application for AI and its business impact.
Politics: Regulation of AI, its economic implications and the evolving geopolitics of AI.
Safety: Identifying and mitigating catastrophic risks that highly-capable future AI systems could pose to us.
Predictions: What we believe will happen in the next 12 months and a 2022 performance review to keep us honest.
End-to-end pipeline agility - Berlin Buzzwords 2024Lars Albertsson
We describe how we achieve high change agility in data engineering by eliminating the fear of breaking downstream data pipelines through end-to-end pipeline testing, and by using schema metaprogramming to safely eliminate boilerplate involved in changes that affect whole pipelines.
A quick poll on agility in changing pipelines from end to end indicated a huge span in capabilities. For the question "How long time does it take for all downstream pipelines to be adapted to an upstream change," the median response was 6 months, but some respondents could do it in less than a day. When quantitative data engineering differences between the best and worst are measured, the span is often 100x-1000x, sometimes even more.
A long time ago, we suffered at Spotify from fear of changing pipelines due to not knowing what the impact might be downstream. We made plans for a technical solution to test pipelines end-to-end to mitigate that fear, but the effort failed for cultural reasons. We eventually solved this challenge, but in a different context. In this presentation we will describe how we test full pipelines effectively by manipulating workflow orchestration, which enables us to make changes in pipelines without fear of breaking downstream.
Making schema changes that affect many jobs also involves a lot of toil and boilerplate. Using schema-on-read mitigates some of it, but has drawbacks since it makes it more difficult to detect errors early. We will describe how we have rejected this tradeoff by applying schema metaprogramming, eliminating boilerplate but keeping the protection of static typing, thereby further improving agility to quickly modify data pipelines without fear.
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeWalaa Eldin Moustafa
Dynamic policy enforcement is becoming an increasingly important topic in today’s world where data privacy and compliance is a top priority for companies, individuals, and regulators alike. In these slides, we discuss how LinkedIn implements a powerful dynamic policy enforcement engine, called ViewShift, and integrates it within its data lake. We show the query engine architecture and how catalog implementations can automatically route table resolutions to compliance-enforcing SQL views. Such views have a set of very interesting properties: (1) They are auto-generated from declarative data annotations. (2) They respect user-level consent and preferences (3) They are context-aware, encoding a different set of transformations for different use cases (4) They are portable; while the SQL logic is only implemented in one SQL dialect, it is accessible in all engines.
#SQL #Views #Privacy #Compliance #DataLake
Global Situational Awareness of A.I. and where its headedvikram sood
You can see the future first in San Francisco.
Over the past year, the talk of the town has shifted from $10 billion compute clusters to $100 billion clusters to trillion-dollar clusters. Every six months another zero is added to the boardroom plans. Behind the scenes, there’s a fierce scramble to secure every power contract still available for the rest of the decade, every voltage transformer that can possibly be procured. American big business is gearing up to pour trillions of dollars into a long-unseen mobilization of American industrial might. By the end of the decade, American electricity production will have grown tens of percent; from the shale fields of Pennsylvania to the solar farms of Nevada, hundreds of millions of GPUs will hum.
The AGI race has begun. We are building machines that can think and reason. By 2025/26, these machines will outpace college graduates. By the end of the decade, they will be smarter than you or I; we will have superintelligence, in the true sense of the word. Along the way, national security forces not seen in half a century will be un-leashed, and before long, The Project will be on. If we’re lucky, we’ll be in an all-out race with the CCP; if we’re unlucky, an all-out war.
Everyone is now talking about AI, but few have the faintest glimmer of what is about to hit them. Nvidia analysts still think 2024 might be close to the peak. Mainstream pundits are stuck on the wilful blindness of “it’s just predicting the next word”. They see only hype and business-as-usual; at most they entertain another internet-scale technological change.
Before long, the world will wake up. But right now, there are perhaps a few hundred people, most of them in San Francisco and the AI labs, that have situational awareness. Through whatever peculiar forces of fate, I have found myself amongst them. A few years ago, these people were derided as crazy—but they trusted the trendlines, which allowed them to correctly predict the AI advances of the past few years. Whether these people are also right about the next few years remains to be seen. But these are very smart people—the smartest people I have ever met—and they are the ones building this technology. Perhaps they will be an odd footnote in history, or perhaps they will go down in history like Szilard and Oppenheimer and Teller. If they are seeing the future even close to correctly, we are in for a wild ride.
Let me tell you what we see.
The Ipsos - AI - Monitor 2024 Report.pdfSocial Samosa
According to Ipsos AI Monitor's 2024 report, 65% Indians said that products and services using AI have profoundly changed their daily life in the past 3-5 years.
1. Discover the world at Leiden University
Data Science Workshop
Dr. Peter Verhaar Maastricht, 2 April 2019
2. Discover the world at Leiden University
◻ Unprecedented growth in
volume of digital data
◻ Combined with growing
sophistication of algorithms
and tools
Background
3. □ Data mining or data science is
the process of applying
computational and algorithmic
methods to large datasets.
□ Text mining is collection of
methods used to extract
information not from “formalised
database records” but from
“unstructured textual data”
Data Science
Feldman, Ronan. The Text Mining Handbook:
Advanced Approaches in Analyzing
Unstructured Data. Cambridge: Cambridge
University Press, 2007, p. 1
Érik Desmazière, illustration for
cover of La biblioteca de Babel,
1941
4. Discover the world at Leiden University
Centre for Digital Scholarship
◻ Located within Leiden University
Libraries
◻ Staffed by subject librarians and software
developers
◻ Builds on existing services and existing
expertise
◻ Focus on Open Access, Research Data
management, Digital Preservation and
Text and Data Mining
5. Support for TDM within library?
◻ Central knowledge base
◻ Fostering interdisciplinary
collaboration
◻ Clarifying terms and
conditions of licences and
negotiations with publishers
◻ Digital preservation
◻ Continuation of traditional
role of libraries: providing
access to texts
The old Public library of
Cincinnati, now demolished
6. Discover the world at Leiden University
Building expertise on TDM
◻ Literature review; Courses
on Data Science and on
Machine Learning; Online
Tutorials (R Package, Mallet,
OpenNLP, Packages in
Python: nltk, textmining,
matPlotLib, gensim)
◻ Involvement in MA course
on Text and Data Mining
◻ Interviews with scholars who
have expertise
◻ Internal research projects
and pilots with researchers
7. Discover the world at Leiden University
Biographic research on Van Gogh
◻ Signs of mental decline in the correspondence of Vincent
van Gogh
◻ Average length of sentences and type-token ratios
8. Discover the world at Leiden University
TDM Workshops
◻ Full day workshop with
explanation of the basics of
Python
◻ Explanation of a range of
algorithms which can be used to
analyse texts
◻ Experiments based on research
questions of participants
9. Discover the world at Leiden University
◻ Educational programme
aimed at librarians
◻ Aim is to ensure that
librarians can talk about
technology on a basic level
◻ Courses are developed in
collaboration with National
Library (KB) and VU
Amsterdam, under the name
“DH Clinics”
10. Discover the world at Leiden University
◻ Python and Jupyter Notebook
◻ Data acquisition: Web Scraping, APIs, Linked
Open Data
◻ Data analysis and enrichment: Pandas, CSV,
TDM, tokenization, POS tagging,
lemmatization
◻ Data visualisation: Matplotlib
Workshop outline
11. Discover the world at Leiden University
◻Python is a widely used
programming languages
◻Developed by Guido van
Rossum
◻Advocates code
readability and simplicity
◻Programmng style ought
to ‘pythonic’
12. Discover the world at Leiden University
http://www.rapidtables.com/
Algoritm Programming
Language
Tool
Word2Vec
TopicModelling
(LDA)
Python
Java
Perl
Voyant
Tapor
13. Variables
□ Variables have a name: any combination of
alphanumerical characters with an underscore
keyword
□ Variables can be assigned a value with a specific data
type
keyword = “Elzevier” ;
number = 10 ;
□ Examples of variable types include string (a sequence
of characters), integer (whole numbers) and floating
point numbers
14. Strings
□ Can be created with single quotes and with double
quotes
author = ‘Douglas Adams’
title = “The Hitchhiker’s guide to
the galaxy”
□ You can then “escape characters” in your string to
add basic formatting:
“n” new line
“t” tab
15. Mathematical operators
□ The following mathematical operators can be used:
+ addition
- subtraction
/ division
* multiplication
□ For example:
sum = 5 + 6
product = 5 * 6
16. Boolean operators
□ Boolean operators compare values:
> greater that
< less than
== equal to
□ Expressions result in a ‘Boolean value’: true of false
a = 5
b = 8
print( a > b )
18. Jupyter Notebook
□ Open source application
which can be used to create
documents containing both
code and documentation
□ Such documents can be
opened in a browser
□ It offers support for a
variety of programming
langauges, including Python,
Julia and R
□ It includes “kernels” or
computational engines
which can run the code
directly
19. Opening Jupyter Notebook
□ Open Anaconda Navigator and select Jupyer
Notebook > Launch
□ OR navigate to the directory that contains
your files in the Command Prompt and type
in:
jupyter notebook
Jupyter can then be opened in a web-
browser (e.g. Google Chrome) via the
address localhost:8888
□ Jupyter initially opens the dashboard: a
directory displaying all your files
20. Opening Jupyter Notebook
□ Jupyter notebooks can also be opened in
Microsoft Azure:
https://notebooks.azure.com/
□ Create a new project and import a GitHub
repository
□ The notebooks for this workshop can be
downloaded from:
https://github.com/peterverhaar/
MaastrichtDataScience
21. Discover the world at Leiden University
Algorithm
Define number to be guessed
Ask user to type in number
WHILE given number IS NOT number to be guessed
Print: Number is correct
Given number HIGHER?
Print:
LOWER
Print:
HIGHER
Y N
24. Discover the world at Leiden University
Data Acquisition
◻ Direct downloads of data objects
(e.g. full text in UTF-8 from
Delpher or Project Gutenberg)
◻ Downloading data
◻ Downloads of data via
Application Programming
Interfaces (APS’s)
◻ Webscraping (via
BeautifulSoup)
◻ Download csv files from data
repositories such as Kaggle,
figShare, DANS EASY
25. □ An Application Programming Interface is a
technology which can be used to make specific
functions of an application or specific data sets
available for external services
API
User ServiceAPI
Request +
key
XML /
JSON
26. □ Some APIs are open; in
other cases an API key is
needed
□ Data may be delivered in
different formats: JSON,
XM
□ Actions such as create,
read, update and delete
are technically possible,
but option are usually
limited to reading data
□ Texts and images
28. Discover the world at Leiden University
□ A process in which texts are divided into smaller units (e.g.
Paragraphs, sentences, words)
□ Token counts reflect the total number of words; Types are
the unique words in a text
“It was the best of
times, it was the worst
of times, it was the
age of wisdom, it was
the age of foolishness,
it was the epoch of
belief, it was the
epoch of incredulity”
Tokenisation
Tokens: 36
Types: 13
29. □ Segmentation or
tokenisation
□ Often based on the fact
that there are spaces in
between words (at least
since scriptura continua
was abandoned in late
9th C.)
□ “soft mark up”
Research based on vocabulary
30. □ ‘Bag of words’ model: original
word order is ignored
Frequency lists
“It was the best of
times, it was the worst
of times, it was the
age of wisdom, it was
the age of foolishness,
it was the epoch of
belief, it was the
epoch of incredulity”
the 6
it 6
of 6
was 6
epoch 2
age 2
times 2
foolishness 1
wisdom 1
belief 1
33. Authorship attribution
John Burrows, Never
Say Always Again:
Reflections on the
Numbers Game
□ Suggesting an author for texts whose authorship is
disputed
34.
35. Type-token ratio
□ Peter Garrard, Textual Pathology
□ Total number of
types divided by the
number of tokens
□ Gives an indication
of the lexical
diversity of a text
37. □ NLTK modules contain text corpora, lexical
resources, and “a suite of text processing libraries
for classification, tokenization, stemming, tagging,
parsing, and semantic reasoning”
Python NTLK modules
import nltk
from nltk.tokenize import
sent_tokenize,
word_tokenize
38. novel = open( "ARoomWithAView.txt" ,
encoding = 'utf-8’ )
fullText = novel.read()
sentences = sent_tokenize(fullText)
for sent in sentences:
words = word_tokenize(sent)
tags = nltk.pos_tag(words)
for t in tags:
print( t[0] + " => " + t[1] + "n")
39. The => DT
Signora => NNP
had => VBD
no => DT
business => NN
to => TO
do => VB
it => PRP
said => VBD
Miss => NNP
Bartlett => NNP
40.
41. □ Stemming: converting an inflected verb
from into its stem.
□ Algorithms based on removal of
suffixes
□ Lemmatisation: relating an inflected
verb form to its lemma (dictionary
form)
□ Tags are commonly based on the
Penn Treebank Tag Set
43. Regular expressions
□ A pattern which represents a specific sequence of
characters
□ To work with regular expressions in Python, you
need to import the ‘re’ module:
import re
□ Regex can be used in search() method:
if re.search( r“Florence” , line ):
print( line )
44. □ Simplest regular expression: Simple sequence of
characters
Example:
Regular expressions
’sun’
Also matches: disunited, sunk, Sunday,
asunder
’ sun ’
Does NOT match:
[…] the gate of the eastern sun,
[…] gloom beneath the noonday sun.
45. . Any character
w Any alphanumerical character:
alphabetical characters, numbers and
underscore
d Any digit
s White space: space, tab, newline
[..] Any of the characters supplied within
square brackets, e.g. [A-Za-z]
Character classes
46. ‘d{4}’
Matches: 1234, 2013, 1066
‘[a-zA-Z]+’
Matches any word that consists of
alphabetical characters only
Does not FULLY match:
e-mail, catch22, can’t
Examples
47. {n,m} Pattern must occur a least n times,
at most m times
{n,} At least n times
{n} Exactly n times
? is the same as {0,1}
+ is the same as {1,}
* Is the same as {0,}
Quantifiers
49. Do not match characters, but locations
within strings.
b Word boundaries
^ Start of a line
$ End of a line
Anchors
50. Discover the world at Leiden University
Aa, Pieter Jansz van der (* Leiden 1697; † 2-8-1751 [begr. PK
31-7/7-8-1751]; w. 1719-36)
Example
51. Discover the world at Leiden University
Aa, Pieter Jansz van der (* Leiden 1697; † 2-8-1751 [begr. PK
31-7/7-8-1751]; w. 1719-36)
parts = re.split( '[;]' , data["biography"] )
for p in parts:
if re.search( '^[*]' , p.strip() ):
p = re.sub( '^*' , '' , p )
if re.search( 'd{4}' , p ):
match = re.search( '(d{2,4})' , p )
data['dob'] = match.group(1)
elif re.search( '^[†]' , p.strip() ):
data['dod'] = p.strip()
52. Discover the world at Leiden University
<person>
<firstName>Pieter Jansz van der</firstName>
<lastName>Aa</lastName>
<dob>1697</dob><dod>1751</dod>
<pob>Leiden</pob>
<professional-start>1719</professional-start>
<professional-end>1736</professional-end>
<profession>boekverkoper</profession>
…
</person>
53. □ Indication of the readability of the text, often
based on average number of words per sentence,
or average nr of syllables per word
□ Examples include Flesch-Kincaid test, Gunning-
Fog index, Coleman-Liau index
□ Flesch-Kincaid is often used in US educational
system and roughly indicates number of years of
formal education
Readability metrics
55. Pandas
□ A Python module
developed for data
science
□ Available for Python
2.7 and higher
□ Many methods for
reading the contents
of data sets in a wide
range of formats such
as csv, tsv or MS
Excel
56. □ The data in CSV files can be made available via
the read_csv() method
□ This method converts the CSV file into a so-
called data frame.
□ A data frame consists of rows and columns
□ The data type of the columns is Series
Data frames
62. Type-token ratio
□ The higher the number, the higher the
vocabulary diversity.
□ If the number is (relatively) low, there is a
high level of repetition
□ The length of the text has an impact on the
type-token ratio
63. Correlation
□ A statistical formula that measures the degree
in which variables are related
□ Expressed as a numerical value ranging form -1
to + 1
□ A negative correlation means that values for
one variable go down when the values for the
other go up
□Source: http://www.stat.yale.edu/