A discussion of Text and Data Mining in science and at Springer Nature in particular. As presented at the Frankfurt Book Fair 2018 by Markus Kaindl, Senior Manager Semantic Data, Springer Nature.
A Corpus of Chinese Comic Books: Database, Metadata, and Visual Object Recogn...Matthias Arnold
The document summarizes a project to create a corpus of digitized Chinese comic books from the 1950s-1970s. It discusses the history and achievements of the project, including digitizing over 1,250 comic books, creating metadata records, and providing online access. It also describes experiments with automatic object detection in the comics using computer vision techniques. Finally, it outlines the new system being developed including a IIIF image server, use of the Mirador annotation tool, and linking to external authorities and standards to improve interoperability.
Keynote Exploring and Exploiting Official Publicationsmaartenmarx
This document discusses requirements and opportunities for opening up official documents like parliamentary proceedings. It argues that the value lies not in individual documents but in the relationships between documents over time. A political n-gram viewer application is proposed that would allow exploration of topics and language used by different political parties over decades. However, linking documents and extracting needed metadata like speaker affiliations is challenging and existing linked open data is not reliable enough. Official documents need to be self-describing and use shared standards and controlled vocabularies to be truly open and interoperable.
Using semantic web technologies for exploratory olap a surveyieeepondy
This paper surveys how semantic web technologies can be applied to exploratory online analytical processing (OLAP). It characterizes traditional DW/OLAP environments and introduces relevant semantic web concepts. It then describes how multidimensional models relate to semantic web formalisms and how semantic web reasoning can be used on multidimensional models. Next, it surveys how semantic web technologies can be used for data modeling, annotation, and extract, transform, and load processes for exploratory OLAP. The paper identifies open challenges for using semantic web technologies to support intelligent multidimensional querying and providing context to data warehouses.
This presenations provides an outlook of what we anticipate with the structured data hub: to create linkable datasets, enhance the use of provenance, add quality flags to data, answer new questions and finally, borrow from and provide to public sources such as dbpedia
(Big) bibliographic data @ ScaDS project meeting - 2015-06-12Felix Lohmeier
The document discusses big bibliographic data from UB Leipzig and SLUB Dresden libraries. It notes that libraries are becoming data hubs and describes the libraries' metadata including resources like books, journals, and accessibility information. The libraries are working together on projects like finc and d:swarm to process and integrate metadata, link authority files, and discover resources through a unified search interface. Challenges include scaling the graph database d:swarm to handle large metadata volumes for data integration and enrichment.
TMA Solutions provides business intelligence, big data, and analytics services including data warehouse design and implementation, data collection and analysis in real-time, and data visualization. Their services also include standard and custom reporting, data analytics and forecasting, analyzing structured and unstructured data, and data migration. They have skills in Microsoft BI tools, machine learning algorithms, and verification methods.
A discussion of Text and Data Mining in science and at Springer Nature in particular. As presented at the Frankfurt Book Fair 2018 by Markus Kaindl, Senior Manager Semantic Data, Springer Nature.
A Corpus of Chinese Comic Books: Database, Metadata, and Visual Object Recogn...Matthias Arnold
The document summarizes a project to create a corpus of digitized Chinese comic books from the 1950s-1970s. It discusses the history and achievements of the project, including digitizing over 1,250 comic books, creating metadata records, and providing online access. It also describes experiments with automatic object detection in the comics using computer vision techniques. Finally, it outlines the new system being developed including a IIIF image server, use of the Mirador annotation tool, and linking to external authorities and standards to improve interoperability.
Keynote Exploring and Exploiting Official Publicationsmaartenmarx
This document discusses requirements and opportunities for opening up official documents like parliamentary proceedings. It argues that the value lies not in individual documents but in the relationships between documents over time. A political n-gram viewer application is proposed that would allow exploration of topics and language used by different political parties over decades. However, linking documents and extracting needed metadata like speaker affiliations is challenging and existing linked open data is not reliable enough. Official documents need to be self-describing and use shared standards and controlled vocabularies to be truly open and interoperable.
Using semantic web technologies for exploratory olap a surveyieeepondy
This paper surveys how semantic web technologies can be applied to exploratory online analytical processing (OLAP). It characterizes traditional DW/OLAP environments and introduces relevant semantic web concepts. It then describes how multidimensional models relate to semantic web formalisms and how semantic web reasoning can be used on multidimensional models. Next, it surveys how semantic web technologies can be used for data modeling, annotation, and extract, transform, and load processes for exploratory OLAP. The paper identifies open challenges for using semantic web technologies to support intelligent multidimensional querying and providing context to data warehouses.
This presenations provides an outlook of what we anticipate with the structured data hub: to create linkable datasets, enhance the use of provenance, add quality flags to data, answer new questions and finally, borrow from and provide to public sources such as dbpedia
(Big) bibliographic data @ ScaDS project meeting - 2015-06-12Felix Lohmeier
The document discusses big bibliographic data from UB Leipzig and SLUB Dresden libraries. It notes that libraries are becoming data hubs and describes the libraries' metadata including resources like books, journals, and accessibility information. The libraries are working together on projects like finc and d:swarm to process and integrate metadata, link authority files, and discover resources through a unified search interface. Challenges include scaling the graph database d:swarm to handle large metadata volumes for data integration and enrichment.
TMA Solutions provides business intelligence, big data, and analytics services including data warehouse design and implementation, data collection and analysis in real-time, and data visualization. Their services also include standard and custom reporting, data analytics and forecasting, analyzing structured and unstructured data, and data migration. They have skills in Microsoft BI tools, machine learning algorithms, and verification methods.
Evolution of motion picture digitization at the National Library of MedicineJohn Rees
1) The National Library of Medicine has been digitizing its historical audiovisual collection to improve access and preservation, starting with a pilot of 11 films in 2010. (2) Their goals are discovery, access, and preservation, though preservation was initially secondary, and they focus on formats that satisfy most use cases. (3) They are moving to prioritize digitization as a preservation function and modernizing practices, contracting with outside experts to develop best practices and increase throughput while maintaining quality control.
The document discusses the Digital Public Library of America (DPLA), which aggregates metadata from cultural heritage institutions to make their digital collections more discoverable. It describes DPLA as a portal for discovery, a platform to build upon, and a strong public option. DPLA gets funding from private foundations and public agencies. It went live in 2013 and allows users to explore collections through time and place or curated exhibits. Cultural institutions contribute content through hubs. DPLA's API allows innovative apps to access millions of items. The goal is to maximize discovery and use of collections from libraries, archives and museums.
When it comes to your information literacy instruction, do students get it? If they get it, do they use it? If they don’t use it, do they lose it, and when? Come to this workshop to hear about longitudinal assessment of information literacy. We will discuss a case study for assessing information literacy across courses. Librarians will discuss how to chain assessment together to better understand their educational impact and rightsize instruction to needs.
Data mining for causal inference: Effect of recommendations on Amazon.comAmit Sharma
As an increasing amount of daily activity---ranging from what we purchase to who we talk---shifts to online platforms, it is only natural to ask how those platforms impact our behavior. Take, for instance, online recommendation systems: how much activity do recommendations actually cause over and above what would have happened in their absence? Without doing randomized experiments, which may be costly or infeasible, estimating the impact of such systems is non-trivial. In this talk, I will argue that careful data mining can help in answering relevant causal questions in a more general way than traditional observational approaches.
Taking recommender systems as an example domain, I will show that data mining can be used to augment a popular techniques such as instrumental variables, by searching for large and sudden shocks in time series data. Applying this method to system logs for Amazon's "People who bought this also bought" recommendations, we are able to analyze over 4,000 unique products that experience such shocks. This leads to a more accurate estimate of the impact of the recommender system: at least 75% of recommendation click-throughs would likely occur in their absence, questioning popular industry estimates based on observed click-through rates.
Finally, this shock-based approach can be generalized to derive a data-driven identification strategy for finding natural experiments in time series data. This method too reveals a similar overestimate for the impact of recommendation systems.
This presentation provides an overview of data mining, including its definition, importance, challenges, and techniques. It begins by defining data mining as identifying patterns in large volumes of data. The presenter notes the importance of data mining is understanding large amounts of data for decision making. Some challenges of data mining include extracting, transforming, and loading data, as well as analyzing and presenting results. The presentation then covers specific techniques like association rule learning, classification, clustering, and sequence analysis and provides examples of each. It concludes by listing additional resources for learning more about data mining.
Immutable Infrastructure: Rise of the Machine ImagesC4Media
Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/1WlpXHF.
Axel Fontaine looks at what Immutable Infrastructure is and how it affects scaling, logging, sessions, configuration, service discovery and more. He also looks at how containers and machine images compare and why some things people took for granted may not be necessary anymore. Filmed at qconlondon.com.
Axel Fontaine is the founder and CEO of Boxfuse. Axel is also the creator and project lead of Flyway, the open source tool that makes database migration easy. He is a Continuous Delivery and Immutable Infrastructure expert, a Java Champion, a JavaOne Rockstar and a regular speaker at various large international conferences.
The document discusses various techniques for data preprocessing including data cleaning, integration, transformation, reduction, discretization, and concept hierarchy generation. Specifically, it covers filling missing values, handling noisy data, data normalization, aggregation, attribute selection, clustering, sampling and entropy-based discretization to reduce data size while retaining important information.
Who cares about yesterday's news? Use cases and requirements for newspaper digitization. Presentation held at IFLA News Media Conference 2016, 20-22 April, Hamburg, Germany.
The document discusses the future of libraries and opportunities and challenges for the publishing industry. It describes trends that will impact libraries such as increased use of technology, online learning, and student expectations. Libraries will provide more digital resources through mobile access and cloud computing. Open educational resources and user-generated content will become more common. Libraries will need to address security issues and protect privacy while maintaining intellectual freedom.
This document describes a project report submitted by Apoorv Mehta, Maitray Thaker, and Shail Shah to Gujarat Technological University in fulfillment of their Bachelor of Engineering degree in Information Technology. The report details their project on improving road traffic safety by mining accident data and developing a decision tree to classify injury severity using the programming languages R and Hadoop. The project analyzed a large accident dataset not capable of being analyzed by typical databases or software by implementing distributed processing with Hadoop and connecting the results to the statistical language R for analysis and visualization. This generated a decision tree that could help traffic engineers optimize road safety and help government agencies allocate medical resources.
The document provides an overview of data mining concepts and techniques. It introduces data mining, describing it as the process of discovering interesting patterns or knowledge from large amounts of data. It discusses why data mining is necessary due to the explosive growth of data and how it relates to other fields like machine learning, statistics, and database technology. Additionally, it covers different types of data that can be mined, functionalities of data mining like classification and prediction, and classifications of data mining systems.
Chapter - 8.4 Data Mining Concepts and Techniques 2nd Ed slides Han & Kambererror007
This document discusses mining sequence patterns in biological data. It begins with an overview of DNA structure and the central dogma of biology by which DNA is transcribed into RNA and translated into protein. It then describes several lab tools that can be used to determine biological data, such as DNA sequencers, mass spectrometry, and microarrays. The document concludes by noting that biological data mining can provide insights into biological processes and gain knowledge from abundant biological data sources.
This document discusses the evolution of database technology and data mining. It provides a brief history of databases from the 1960s to the 2010s and their purposes over time. It then discusses the motivation for data mining, noting the explosion in data collection and need to extract useful knowledge from large databases. The rest of the document defines data mining, outlines the basic process, discusses common techniques like classification and clustering, and provides examples of data mining applications in industries like telecommunications, finance, and retail.
This document provides an introduction to data mining and machine learning. It discusses how data mining can extract hidden patterns from large datasets. The document covers common data mining tasks like classification, regression, and clustering. It also describes different algorithms for classification including decision trees, naive Bayes classifiers, and k-nearest neighbors. Regression is also introduced as predicting real-valued outputs. The document uses examples to illustrate key concepts in data mining.
Chapter - 7 Data Mining Concepts and Techniques 2nd Ed slides Han & Kambererror007
The document describes chapter 7 of the book "Data Mining: Concepts and Techniques" which covers cluster analysis. The chapter discusses what cluster analysis is, different types of data that can be analyzed, major clustering methods like partitioning, hierarchical, and density-based methods. It also covers measuring cluster quality, requirements for clustering in data mining, and how to calculate similarity and dissimilarity between data objects.
Look for events that are anomalous given their context, such as:
- Time of day (e.g. activity at 3am)
- Source/destination (e.g. traffic from unknown IP)
- Associated events (e.g. login without subsequent activity)
- Normal volume patterns (e.g. spike in requests)
Analyze events in context to identify deviations from normal patterns.
Multimodal Perspectives for Digitised Historical Newspaperscneudecker
This document discusses challenges and opportunities in analyzing digitized historical newspapers. It describes several projects aimed at improving OCR accuracy using deep learning models, extracting structural information using computer vision and heuristics, and establishing standards for metadata and evaluation. Key challenges include the need for more granular and representative ground truth newspaper data, methods that combine machine learning and domain knowledge, and community efforts around shared tasks, seminars, and an atlas of digitized newspapers to advance interdisciplinary research. The overall goal is to make cultural heritage collections more accessible online through improved digitization and analysis of newspapers.
Evolution of motion picture digitization at the National Library of MedicineJohn Rees
1) The National Library of Medicine has been digitizing its historical audiovisual collection to improve access and preservation, starting with a pilot of 11 films in 2010. (2) Their goals are discovery, access, and preservation, though preservation was initially secondary, and they focus on formats that satisfy most use cases. (3) They are moving to prioritize digitization as a preservation function and modernizing practices, contracting with outside experts to develop best practices and increase throughput while maintaining quality control.
The document discusses the Digital Public Library of America (DPLA), which aggregates metadata from cultural heritage institutions to make their digital collections more discoverable. It describes DPLA as a portal for discovery, a platform to build upon, and a strong public option. DPLA gets funding from private foundations and public agencies. It went live in 2013 and allows users to explore collections through time and place or curated exhibits. Cultural institutions contribute content through hubs. DPLA's API allows innovative apps to access millions of items. The goal is to maximize discovery and use of collections from libraries, archives and museums.
When it comes to your information literacy instruction, do students get it? If they get it, do they use it? If they don’t use it, do they lose it, and when? Come to this workshop to hear about longitudinal assessment of information literacy. We will discuss a case study for assessing information literacy across courses. Librarians will discuss how to chain assessment together to better understand their educational impact and rightsize instruction to needs.
Data mining for causal inference: Effect of recommendations on Amazon.comAmit Sharma
As an increasing amount of daily activity---ranging from what we purchase to who we talk---shifts to online platforms, it is only natural to ask how those platforms impact our behavior. Take, for instance, online recommendation systems: how much activity do recommendations actually cause over and above what would have happened in their absence? Without doing randomized experiments, which may be costly or infeasible, estimating the impact of such systems is non-trivial. In this talk, I will argue that careful data mining can help in answering relevant causal questions in a more general way than traditional observational approaches.
Taking recommender systems as an example domain, I will show that data mining can be used to augment a popular techniques such as instrumental variables, by searching for large and sudden shocks in time series data. Applying this method to system logs for Amazon's "People who bought this also bought" recommendations, we are able to analyze over 4,000 unique products that experience such shocks. This leads to a more accurate estimate of the impact of the recommender system: at least 75% of recommendation click-throughs would likely occur in their absence, questioning popular industry estimates based on observed click-through rates.
Finally, this shock-based approach can be generalized to derive a data-driven identification strategy for finding natural experiments in time series data. This method too reveals a similar overestimate for the impact of recommendation systems.
This presentation provides an overview of data mining, including its definition, importance, challenges, and techniques. It begins by defining data mining as identifying patterns in large volumes of data. The presenter notes the importance of data mining is understanding large amounts of data for decision making. Some challenges of data mining include extracting, transforming, and loading data, as well as analyzing and presenting results. The presentation then covers specific techniques like association rule learning, classification, clustering, and sequence analysis and provides examples of each. It concludes by listing additional resources for learning more about data mining.
Immutable Infrastructure: Rise of the Machine ImagesC4Media
Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/1WlpXHF.
Axel Fontaine looks at what Immutable Infrastructure is and how it affects scaling, logging, sessions, configuration, service discovery and more. He also looks at how containers and machine images compare and why some things people took for granted may not be necessary anymore. Filmed at qconlondon.com.
Axel Fontaine is the founder and CEO of Boxfuse. Axel is also the creator and project lead of Flyway, the open source tool that makes database migration easy. He is a Continuous Delivery and Immutable Infrastructure expert, a Java Champion, a JavaOne Rockstar and a regular speaker at various large international conferences.
The document discusses various techniques for data preprocessing including data cleaning, integration, transformation, reduction, discretization, and concept hierarchy generation. Specifically, it covers filling missing values, handling noisy data, data normalization, aggregation, attribute selection, clustering, sampling and entropy-based discretization to reduce data size while retaining important information.
Who cares about yesterday's news? Use cases and requirements for newspaper digitization. Presentation held at IFLA News Media Conference 2016, 20-22 April, Hamburg, Germany.
The document discusses the future of libraries and opportunities and challenges for the publishing industry. It describes trends that will impact libraries such as increased use of technology, online learning, and student expectations. Libraries will provide more digital resources through mobile access and cloud computing. Open educational resources and user-generated content will become more common. Libraries will need to address security issues and protect privacy while maintaining intellectual freedom.
This document describes a project report submitted by Apoorv Mehta, Maitray Thaker, and Shail Shah to Gujarat Technological University in fulfillment of their Bachelor of Engineering degree in Information Technology. The report details their project on improving road traffic safety by mining accident data and developing a decision tree to classify injury severity using the programming languages R and Hadoop. The project analyzed a large accident dataset not capable of being analyzed by typical databases or software by implementing distributed processing with Hadoop and connecting the results to the statistical language R for analysis and visualization. This generated a decision tree that could help traffic engineers optimize road safety and help government agencies allocate medical resources.
The document provides an overview of data mining concepts and techniques. It introduces data mining, describing it as the process of discovering interesting patterns or knowledge from large amounts of data. It discusses why data mining is necessary due to the explosive growth of data and how it relates to other fields like machine learning, statistics, and database technology. Additionally, it covers different types of data that can be mined, functionalities of data mining like classification and prediction, and classifications of data mining systems.
Chapter - 8.4 Data Mining Concepts and Techniques 2nd Ed slides Han & Kambererror007
This document discusses mining sequence patterns in biological data. It begins with an overview of DNA structure and the central dogma of biology by which DNA is transcribed into RNA and translated into protein. It then describes several lab tools that can be used to determine biological data, such as DNA sequencers, mass spectrometry, and microarrays. The document concludes by noting that biological data mining can provide insights into biological processes and gain knowledge from abundant biological data sources.
This document discusses the evolution of database technology and data mining. It provides a brief history of databases from the 1960s to the 2010s and their purposes over time. It then discusses the motivation for data mining, noting the explosion in data collection and need to extract useful knowledge from large databases. The rest of the document defines data mining, outlines the basic process, discusses common techniques like classification and clustering, and provides examples of data mining applications in industries like telecommunications, finance, and retail.
This document provides an introduction to data mining and machine learning. It discusses how data mining can extract hidden patterns from large datasets. The document covers common data mining tasks like classification, regression, and clustering. It also describes different algorithms for classification including decision trees, naive Bayes classifiers, and k-nearest neighbors. Regression is also introduced as predicting real-valued outputs. The document uses examples to illustrate key concepts in data mining.
Chapter - 7 Data Mining Concepts and Techniques 2nd Ed slides Han & Kambererror007
The document describes chapter 7 of the book "Data Mining: Concepts and Techniques" which covers cluster analysis. The chapter discusses what cluster analysis is, different types of data that can be analyzed, major clustering methods like partitioning, hierarchical, and density-based methods. It also covers measuring cluster quality, requirements for clustering in data mining, and how to calculate similarity and dissimilarity between data objects.
Look for events that are anomalous given their context, such as:
- Time of day (e.g. activity at 3am)
- Source/destination (e.g. traffic from unknown IP)
- Associated events (e.g. login without subsequent activity)
- Normal volume patterns (e.g. spike in requests)
Analyze events in context to identify deviations from normal patterns.
Multimodal Perspectives for Digitised Historical Newspaperscneudecker
This document discusses challenges and opportunities in analyzing digitized historical newspapers. It describes several projects aimed at improving OCR accuracy using deep learning models, extracting structural information using computer vision and heuristics, and establishing standards for metadata and evaluation. Key challenges include the need for more granular and representative ground truth newspaper data, methods that combine machine learning and domain knowledge, and community efforts around shared tasks, seminars, and an atlas of digitized newspapers to advance interdisciplinary research. The overall goal is to make cultural heritage collections more accessible online through improved digitization and analysis of newspapers.
Widening the limits of cognitive reception with online digital library graph ...Marton Nemeth
This document discusses using semantic web technologies like linked data and RDF to improve information retrieval from digital library collections. It provides examples of semantic implementations at libraries like Europeana, the French National Library, and the German National Library. Key points covered include linking diverse data sources to facilitate discovery, creating semantic search interfaces, and addressing challenges of referencing vocabularies and evaluating semantic datasets and user experiences. The research plan proposes comparing new semantic OPACs to traditional interfaces and developing a methodology for evaluating the user experience of semantic library systems.
The Europeana Newspapers Project aims to aggregate and refine over 18 million digitized newspaper pages for Europeana and The European Library. It will perform optical character recognition (OCR) and named entity recognition to convert images to searchable text. The 17-partner consortium, representing 12 countries, will survey existing newspaper collections, develop best practices for digitization workflows, and build a content browser for searching and accessing newspaper pages. The project seeks to improve access to and reuse of historical newspapers in Europe.
Günter Mühlberger (University of Innsbruck, AT): The READ project. Objectives, tasks and partner organisations
co:op-READ-Convention Marburg
Technology meets Scholarship, or how Handwritten Text Recognition will Revolutionize Access to Archival Collections.
With a special focus on biographical data in archives
Hessian State Archives Marburg Friedrichsplatz 15, D - 35037 Marburg
19-21 January 2016
The Europeana Newspapers Project aims to aggregate and refine over 18 million digitized newspaper pages for Europeana and The European Library. It will perform optical character recognition and article segmentation to convert images to searchable text. The project involves 17 partners from 12 countries who will provide newspaper content and refinements. It seeks to improve access to historical newspapers, establish best practices for digitization, and increase usage of Europeana's newspaper collections.
Leslie Johnston: Library Big Data Repository Services, Open Repositories 2012lljohnston
Big Data challenges in developing repositories include:
- Collections like web archives and historic newspapers contain billions of files and grow quickly, requiring constant processing and large-scale infrastructure.
- Researchers want to analyze entire collections using algorithms and computational methods rather than accessing individual items.
- Repository services need to support self-serve access, full-text search of entire collections, and APIs to enable computational research methods.
- Ingesting and providing access to collections measured in petabytes and containing highly diverse content and metadata requires normalization and standardization.
"Big Picture" Backgrounder for Crowdsourcing Ground-Truth - FactMiners, PRImA...Jim Salmons
This is a compilation of the four short "silent Ignite Talk" slideshows submitted with our recent unfunded Knight News Challenge entry. Our current Knight Prototype Fund entry focuses on the "People" (Part 4) component of our larger applied research agenda.
This document summarizes Europeana Newspapers, an EU project from 2012-2015 to digitize historical newspapers. It discusses:
1) The project digitized over 1,000 newspaper titles from 1618-2016 containing 3.3 million issues from 12 countries and 40 languages totaling 120 TB of data.
2) The data was processed using OCR and OLR to extract text and metadata, and is available through various portals and downloads under open licenses.
3) The document outlines the tools used for preprocessing, OCR, OLR, and named entity recognition developed through the project.
4) Future plans are discussed to migrate the data to new Europeana collections, improve search/brows
The European Newspapers Project aims to aggregate and refine over 18 million digitized newspaper pages from European libraries to provide to Europeana and The European Library. The project consortium includes 17 national libraries and university libraries from 12 countries. The project will perform optical character recognition (OCR) on 10 million pages and OCR with article segmentation on 2 million pages. It will also conduct named entity recognition. The project seeks to improve access to digitized European newspaper collections through enhanced search capabilities of full newspaper texts. Dissemination activities will help increase awareness and usage of Europeana.
This document summarizes a study that analyzed descriptive metadata from digitized historical newspaper collections. Key points:
- Over 850,000 newspaper pages from 6 French titles spanning 1814-1945 were analyzed. Optical layout recognition was used to extract descriptive metadata.
- Data mining and visualization techniques were applied to the metadata to gain insights into the history of newspapers and press. Trends in layout, illustrations, word count and more over time were examined.
- The analyses revealed information like periods of structural changes in newspapers, the impact of events like wars on content volume, and outliers that indicated special issues like illustrated supplements. The automated analyses of descriptive metadata provided new perspectives on historical newspaper collections.
Slides of the paper Curation Technologies for a Cultural Heritage Archive: Analysing and transforming a heterogeneous data set into an interactive curation workbench by Georg Rehm, Martin Lee, Julián Moreno Schneider and Peter Bourgonje at the 3rd Edition of the DATeCH2019 International Conference
The National Library of France provides a public data service that includes metadata from over 13.8 million publications in its collections. Researchers and developers can access this data through various APIs and use it for projects like analyzing gender trends in historical fiction writing or building apps that allow users to search for concert recordings based on the musical works performed. The library aims to enhance these services over time by expanding the available datasets and APIs, as well as providing resources for data and code sharing to support digital humanities and other reuse of the cultural data.
This document summarizes a presentation about representing statistical data in RDF. It discusses existing statistical datasets published in RDF format, including datasets from Eurostat, the US Census, and various government sources. It also covers vocabularies for modeling the structure and domain semantics of statistical data, such as SCOVO, Data Cube, and SDMX/RDF. The presentation addresses challenges in converting legacy statistical formats to RDF and techniques for linking and publishing statistical Linked Data on the web.
LIBER, Europeana and the Europeana Newspapers ProjectLIBER Europe
LIBER is a network of over 425 European research libraries from over 40 countries. The Europeana Newspapers Project aims to aggregate over 18 million digitized newspaper pages from European libraries for Europeana. It will refine the content using OCR, article segmentation, and entity recognition. The project will analyze existing newspaper collections, make recommendations for best practices, and build a content browser to improve access to newspaper full texts in Europeana and The European Library.
Designing a multilingual knowledge graph - DCMI2018Antoine Isaac
Presentation for the paper "Designing a multilingual knowledge graph as service for cultural heritage" at the DCMI2018 conference https://www.dublincore.org/conferences/2018/abstracts/#559
Living with Machines at The Past, Present and Future of Digital Scholarship w...Mia
The document discusses the Living with Machines project, which aims to apply computational methods and digital tools to historical newspaper collections to gain new insights. It summarizes the project's goals of facilitating collaboration between data scientists, historians, and digital humanities researchers. It also provides details on the project partners and funders, the newspaper collections involved including the British Library and British Newspaper Archive, challenges around copyright and digitization, and the project's research questions and division into specialized "Labs".
Vladimir Alexiev presented ResearchSpace, a virtual research environment (VRE) based on the CIDOC CRM ontology. ResearchSpace aims to provide tools and services to support collaborative research projects for cultural heritage scholars. It aggregates data from various sources using semantic technologies and the CIDOC CRM ontology, allows semantic search of the data based on fundamental relations, and includes features for data analysis, collaboration, and web publication. The presentation provided an overview of Ontotext, the company developing ResearchSpace, described some of ResearchSpace's key capabilities, and discussed how the CIDOC CRM is central to ResearchSpace's approach.
The document discusses the movement of Digital Humanities and its impact on social sciences. It defines Digital Humanities as the intersection of humanities disciplines and digital technologies. It describes the goals of DH as integrating modern information technology into traditional humanistic research and sharing cultural resources. It also provides examples of common DH projects and tools, including text analysis, mapping, encoding, and visualization projects. Throughout, it emphasizes DH as an international, collaborative, and interdisciplinary field that utilizes digital resources and technologies.
LIBER is a network of over 425 European research libraries from over 40 countries. It focuses on activities like scholarly communication, digitization, heritage collections, and participating in EU projects. Some examples of EU projects LIBER participates in include Europeana Libraries, which provides open access to 5 million digitized objects. The Europeana Newspapers Project aims to aggregate and refine over 18 million digitized newspaper pages for Europeana and The European Library through activities like optical character recognition and named entity recognition. The project involves 17 partners from 12 countries and seeks to improve access to digitized newspaper collections through Europeana.
IIIF for Interoperability and Dissemination of Research Results: The NewsEye ...Jean-Philippe Moreux
The document discusses how the NewsEye European Project uses IIIF (International Image Interoperability Framework) to disseminate research results from analyzing digitized historical newspapers. It describes how NewsEye uses IIIF to display thumbnails and pages from newspapers. The project also exposes corpora and researcher datasets as IIIF collections to facilitate access and reuse of the data. Storytelling tools like Exhibit are highlighted as ways for researchers to showcase collections and narratives to the public.
This document summarizes the BnF's approach to providing access to its digital data and encouraging reuse through new services and use cases. It discusses exposing data through APIs, datasets, and web services using protocols like OAI-PMH, SRU, SPARQL, and IIIF. It provides examples of projects like NewsEye and GallicaPix that leverage BnF data. It also outlines the general workflow for working with BnF digital content, including selecting required metadata, identifying access methods, extracting resources, and building applications by aggregating, processing, and enriching the data.
The document summarizes image retrieval techniques and applications at the BnF (French National Library). It discusses using deep learning for image segmentation, classification, and indexing. It then describes several BnF projects applying these techniques, including GallicaSimilitudes for visual similarity search of collections, GallicaPix for iconographic retrieval and digital humanities case studies, and collaborations with INRIA on object detection in manuscripts and iterative querying. The goal is improved search and access to the diverse range of images in BnF collections.
This presentation, created by Syed Faiz ul Hassan, explores the profound influence of media on public perception and behavior. It delves into the evolution of media from oral traditions to modern digital and social media platforms. Key topics include the role of media in information propagation, socialization, crisis awareness, globalization, and education. The presentation also examines media influence through agenda setting, propaganda, and manipulative techniques used by advertisers and marketers. Furthermore, it highlights the impact of surveillance enabled by media technologies on personal behavior and preferences. Through this comprehensive overview, the presentation aims to shed light on how media shapes collective consciousness and public opinion.
This presentation by OECD, OECD Secretariat, was made during the discussion “Competition and Regulation in Professions and Occupations” held at the 77th meeting of the OECD Working Party No. 2 on Competition and Regulation on 10 June 2024. More papers and presentations on the topic can be found at oe.cd/crps.
This presentation was uploaded with the author’s consent.
This presentation by Professor Alex Robson, Deputy Chair of Australia’s Productivity Commission, was made during the discussion “Competition and Regulation in Professions and Occupations” held at the 77th meeting of the OECD Working Party No. 2 on Competition and Regulation on 10 June 2024. More papers and presentations on the topic can be found at oe.cd/crps.
This presentation was uploaded with the author’s consent.
Mastering the Concepts Tested in the Databricks Certified Data Engineer Assoc...SkillCertProExams
• For a full set of 760+ questions. Go to
https://skillcertpro.com/product/databricks-certified-data-engineer-associate-exam-questions/
• SkillCertPro offers detailed explanations to each question which helps to understand the concepts better.
• It is recommended to score above 85% in SkillCertPro exams before attempting a real exam.
• SkillCertPro updates exam questions every 2 weeks.
• You will get life time access and life time free updates
• SkillCertPro assures 100% pass guarantee in first attempt.
XP 2024 presentation: A New Look to Leadershipsamililja
Presentation slides from XP2024 conference, Bolzano IT. The slides describe a new view to leadership and combines it with anthro-complexity (aka cynefin).
Carrer goals.pptx and their importance in real lifeartemacademy2
Career goals serve as a roadmap for individuals, guiding them toward achieving long-term professional aspirations and personal fulfillment. Establishing clear career goals enables professionals to focus their efforts on developing specific skills, gaining relevant experience, and making strategic decisions that align with their desired career trajectory. By setting both short-term and long-term objectives, individuals can systematically track their progress, make necessary adjustments, and stay motivated. Short-term goals often include acquiring new qualifications, mastering particular competencies, or securing a specific role, while long-term goals might encompass reaching executive positions, becoming industry experts, or launching entrepreneurial ventures.
Moreover, having well-defined career goals fosters a sense of purpose and direction, enhancing job satisfaction and overall productivity. It encourages continuous learning and adaptation, as professionals remain attuned to industry trends and evolving job market demands. Career goals also facilitate better time management and resource allocation, as individuals prioritize tasks and opportunities that advance their professional growth. In addition, articulating career goals can aid in networking and mentorship, as it allows individuals to communicate their aspirations clearly to potential mentors, colleagues, and employers, thereby opening doors to valuable guidance and support. Ultimately, career goals are integral to personal and professional development, driving individuals toward sustained success and fulfillment in their chosen fields.
Collapsing Narratives: Exploring Non-Linearity • a micro report by Rosie WellsRosie Wells
Insight: In a landscape where traditional narrative structures are giving way to fragmented and non-linear forms of storytelling, there lies immense potential for creativity and exploration.
'Collapsing Narratives: Exploring Non-Linearity' is a micro report from Rosie Wells.
Rosie Wells is an Arts & Cultural Strategist uniquely positioned at the intersection of grassroots and mainstream storytelling.
Their work is focused on developing meaningful and lasting connections that can drive social change.
Please download this presentation to enjoy the hyperlinks!
Suzanne Lagerweij - Influence Without Power - Why Empathy is Your Best Friend...Suzanne Lagerweij
This is a workshop about communication and collaboration. We will experience how we can analyze the reasons for resistance to change (exercise 1) and practice how to improve our conversation style and be more in control and effective in the way we communicate (exercise 2).
This session will use Dave Gray’s Empathy Mapping, Argyris’ Ladder of Inference and The Four Rs from Agile Conversations (Squirrel and Fredrick).
Abstract:
Let’s talk about powerful conversations! We all know how to lead a constructive conversation, right? Then why is it so difficult to have those conversations with people at work, especially those in powerful positions that show resistance to change?
Learning to control and direct conversations takes understanding and practice.
We can combine our innate empathy with our analytical skills to gain a deeper understanding of complex situations at work. Join this session to learn how to prepare for difficult conversations and how to improve our agile conversations in order to be more influential without power. We will use Dave Gray’s Empathy Mapping, Argyris’ Ladder of Inference and The Four Rs from Agile Conversations (Squirrel and Fredrick).
In the session you will experience how preparing and reflecting on your conversation can help you be more influential at work. You will learn how to communicate more effectively with the people needed to achieve positive change. You will leave with a self-revised version of a difficult conversation and a practical model to use when you get back to work.
Come learn more on how to become a real influencer!
Updated diagnosis. Cause and treatment of hypothyroidism
Data Mining Newspapers Metadata
1. DATA MINING HISTORICAL
NEWSPAPERS METADATA
Old News Teaches History
Jean-Philippe Moreux
Bibliothèque national de France,
Digitization dpt
IFLA News Media Section,
Hamburg, April 2016
2. ATrue Story about the Researchers’ Needs
• How can we help a historian working on Stock Market
quotes creation and development in French newspapers?
(1800-1870)
here
3. ATrue Story about the Researchers’ Needs
• Obviously, he had to query the digital library catalog.
catalog search
4. ATrue Story about the Researchers’ Needs
• Moreover, he needed a text retrieval functionality.
text retrieval
catalog search
The basics
5. ATrue Story about the Needs of Researchers
• But is it enough? Could we do better?
text retrieval
catalog search
+ Corpora builder
+ Predefined qualitative
& easy-to-use corpora
+ Advanced query
on document structure
and layout (to spot Stock
Market regions)
6. The True Story (cont’d): unhappy Ending
“Stock Market quotes in French Newspapers (1801-1870)”
PhD in Communication and Information Science (P.-C. Langlais)
• The creation of his corpus was very painful:
1. The historian had to script the DL to extract OCR and metadata
from multiple newspaper titles.
2. Then he had to refine/structure his text corpora.
More than 100 Python scripts were needed!
Historians generally prefer to focuse on research, not on writing scripts…
8. How to Satisfy Scientists’ Needs?
Let’s try to address this question, regarding the heritage daily
corpus enriched during the Europeana Newspapers project:
• Feed the DL with enriched digital documents?
• Give end-users access to quantitative metadata describing
documents structure and layout?
• Give end-users an ad hoc corpora builder functionality?
Plan
1. The Europeana Newspapers test bed
2. Building a quantitative metadata dataset
3. Data mining and data visualization use-cases
9. Enriching Digital Documents
Europeana Newspapers
project (2012-2015): 11,5M
OCR’ed pages, 2M OLR’ed
pages from 14 European
libraries
What is OLR?
• Identification of structural
elements, including
separation of articles
and sections.
• Classification of types of
content (ads, offers,
obituaries…)
• Europeana Newspaper project has enriched and aggregated
millions of heritage newspapers pages with advanced refinement
techniques like Optical Layout Recognition and Named Entities
Recognition.
UIBK
10. Document Analysis Technique like OLR
Produce Quantitative Metadata
The good new is OCR and OLR files are full of interesting
objects tagged into the XML:
• OCR (ALTO) is a source for quantitative metadata: number of words,
illustrations & tables, paper format…
• OLR (METS) is a valuable source too for high level informational objects:
• number of articles, titles, etc.
• identification of sections (groups of articles)
• content types classification (ads, judicial review, stock market…)
Huge amount of valuable data
for historians!
11. • We have to count the number of objects in each page of the
collection. Straightforward with XSLT, Java, Python, Perl, etc.
• We have to package and deliver these datasets to end-users.
How to Build such Datasets?
Europeana Newspapers
project / BnF: 880,000
OLR’ed pages from BnF
newspapers collection,
6 titles, 1814-1944
Pros:
• Give to users light derived datasets, not TB of XML files!
• It’s not rocket science.
• It’s fast (2-3 h/title with an optimized NoXML parsing script)
No Cons!
12. Who are the End-Users of the BnF Dataset?
• The EN-BnF dataset includes 5.5 M of values (150K issues, 880K p.)
• 7 metadata at issue level, 5 at page level
• XML, JSON or CSV formats
Researchers (Digital
Humanities, History of
Press, Information
Science)
Digital Curators &
Mediators: insights
on the collections
Digitization Program
Managers: statistics on
digitized content
t o o l s
17. Engaging new Audiences with Dataviz
Interactive chart of the word density reveals breaks
due to changes in layout & paper format, outlier issues…
•
tools
Journal des débats politiques et littéraires, 1814-1944, 45,334 issues displayed
Go beyond keyword
spotting and page flip!
Some users would
like to play with those
charts!
18. Requesting the Dataset
Those datasets can be requested with dedicated tools
(statistical environments, NoSQL or XML databases...)
• Images search solution used by Gallica Mediation Service:
a XQuery HTTP API identifies “graphical” pages, that is to say both
those poor in words and including illustrations.
tools
http://localhost:8984/rest?run=findIllustratedPages.xq&toDate=1920-01-01&toPage=1
"As a digital mediator,
seeking for illustrations
in our 12M p. collection
is a nightmare…"
19. Requesting the Dataset
• Looking for WW1 censored front pages with BaseX: XQueries
can be written to dig into the data and find specific types of content, e.g.
the front pages censored during the Great war, which have a slightly
smaller words count than the front pages average. tools
Is it effective?
• Recall rate: 45%
• Precision rate: 68%
(Based on a ground truth carried on the
Journal des Débats front pages for 1915)
Limits of a statistical approach when
applied to a word based metric biased by
layout singularities. Good enough for
mediation: Gallica blog post
21. Perspectives
• Apply the same data mining process to the other Europeana
Newspapers OLR’ed datasets to produce more datasets.
Apply on the on-going BnF newspapers digitization program.
• Automatically build the quantitative metadata datasets.
• Experiment on other types of materials with a temporal dimension
(e.g. long life magazines or revues, early printed books).
• @BnF: Assess the opportunity of setting up a data mining framework
targeting DH researchers (“Corpus” BnF research project, 2016-2018):
Corpora builder? API? OCR dumps? Derived datasets? Remote
processing?...
22. Conclusion
• Quantitative metadata are relevant for all DLs’ users: scientists,
general public, institutions’ employees.
• OLR enrichment provides a rich source of information for researchers.
Such data, possibly crossed with the OCRed text, usually provide a
fertile ground for research hypotheses.
• Only basic data mining & dataviz methods and tools are needed to
use such datasets:
• Basic scripting: XSL, Python, Perl, JavaScript…
• Statistical applications: Excel, OpenOffice, R…
• Ready to use charts & timelines API: Highcharts,
Google Charts, timeline.knightlab.com, Sigmajs.org…
• Easy to use NoSQL or XML databases: BaseX, MongoDB…
24. Final Thought:Advanced Search in Newspapers?
• Feeding the search engine with layout and structural metadata will
allow users to perform advanced mixed queries:
text retrieval
catalog search layout MD
structural MD
? illustrated articles
in Trial review section
from 1914 to 1916
where title contains
“caillaux” or “calmette”
? articles with table
in Le Matin
where title contains
“metal prices”
and body contains “gold”
25. Final Thought: Advanced Search in Newspapers?
• Feeding the search engine with layout and structural metadata will
allow users to perform advanced mixed queries:
text retrieval
catalog search layout MD
structural MD
? illustrated articles
in Judicial review section
from 1914 to 1916
where title contains
“caillaux” or “calmette”
? articles with table
in Le Matin
where title contains
“metal prices”
and body contains “gold”
Trove Advanced Search
http://trove.nla.gov.au
26. Is it Working for Books too?
• Books’ OCR also contains meaningfull layout information:
tables, maps, ornements, drop caps…
text retrieval
catalog search layout MD
structural MD
? pages illustrated with a map
in XIXth books
where text contains “Mars”
? illustrated pages
in XIXth books
where text contains “Mars”
maps
photos,
drawings,
diagrams…
27. Final Thought: Advanced Search in Newspapers?
• Adding a pinch of semantic flavor to get closer to natural language query:
text retrieval
catalog search layout MD
structural MD
I’m looking for illustrated articles on front page in Trial topic
from 1914 to 1916 which contain NE.person “Henriette Caillaux”
or “Gaston Calmette”
Named Entities
Recognition
Topic Modelling
Historical Events
Recognition
Themes
Classification
28. • Adding a slice of semantic flavor to get closer to natural language query:
I’m looking for illustrated articles on front page in Trial topic
from 1914 to 1916 which contain NE.person “Henriette Caillaux”
or “Gaston Calmette”
Named Entities
Recognition
Topic Modelling
Historical Events
Recognition
Themes
Classification
Final Thought: Advanced Search in Newspapers?
text retrieval
catalog search layout MD
structural MD
http://www.retronews.fr/
RetroNews Advanced Search
http://www.retronews.fr
Faceted
search: dates,
NE, themes,
events, topics…
29. Thank you for your attention!
• Dataset (CSV, XML, JSON) and charts are publicly available. Just
play with it! (no language barrier: not a single word of French inside)
http://altomator.github.io/EN-data_mining
Thanks to all
the EN partners!