This document summarizes the BnF's approach to providing access to its digital data and encouraging reuse through new services and use cases. It discusses exposing data through APIs, datasets, and web services using protocols like OAI-PMH, SRU, SPARQL, and IIIF. It provides examples of projects like NewsEye and GallicaPix that leverage BnF data. It also outlines the general workflow for working with BnF digital content, including selecting required metadata, identifying access methods, extracting resources, and building applications by aggregating, processing, and enriching the data.
From WG2 Datathon to AWAC2. Exploring IIPC special COVID collection thanks to...WARCnet
This document discusses exploring a collection of COVID-19 related web archives from the International Internet Preservation Consortium (IIPC) using the Archives Unleashed Project. It describes datasets on COVID-19 web domains that were collected by various national libraries and the IIPC between January-July 2020. It also presents initial findings from analyzing overlaps between these collections and outlines plans for further research, including a survey of IIPC members.
The document discusses handling NetCDF files in R. It explains that NetCDF is a common data format for storing scientific data in a gridded, self-documented format. It was created by UCAR to be portable. The RNetCDF package allows reading and writing NetCDF files in R. The document demonstrates how to obtain metadata on dimensions, variables, and attributes of NetCDF files in R and plot the variable data.
The Semantic Web Client Library - Consuming Linked Data in Your ApplicationsOlaf Hartig
I presented the Semantic Web Client library (http://www4.wiwiss.fu-berlin.de/bizer/ng4j/semwebclient/) with these slides during the Linked Data Tutorial (http://events.linkeddata.org/iswc2008tutorial/) at the International Semantic Web Conference 2008.
Maurer Presentation - WARCnet Spring Meeting 2021WARCnet
This document summarizes a meeting about comparing entire web domains that are archived in national web archives. It discusses the need to analyze national web collections with low common denominator data while still providing rich information. It presents a file format called a CDX summary file that provides aggregated statistics about file types, sizes, and protocol usage for domains over time. Examples are given demonstrating average file sizes, domain name frequencies, and overlap between different archive sources. Limitations of the current approach are also outlined.
This document summarizes several Linked Data research projects conducted by researchers at the Universidad Politécnica de Madrid. It describes initiatives to publish geospatial, transportation, cultural heritage, and meteorological data as Linked Data. These include the GeoLinkedData project, the El Viajero project about travel experiences, a collaboration with the Spanish National Library to publish its collection as Linked Data, and a project with the Spanish Meteorological Agency to publish weather station data. The researchers are involved in several W3C groups and the document outlines next steps to publish additional data from new domains.
sitemap4rdf is a command line tool that generates Sitemap files by querying a SPARQL endpoint for URIs and listing them. It allows search engines to discover Linked Data resources. It is open source and generates valid Sitemaps that can be submitted to search engines like Sindice and Google. Future work includes integrating it with Linked Data publishing tools and supporting more Sitemap extensions.
From WG2 Datathon to AWAC2. Exploring IIPC special COVID collection thanks to...WARCnet
This document discusses exploring a collection of COVID-19 related web archives from the International Internet Preservation Consortium (IIPC) using the Archives Unleashed Project. It describes datasets on COVID-19 web domains that were collected by various national libraries and the IIPC between January-July 2020. It also presents initial findings from analyzing overlaps between these collections and outlines plans for further research, including a survey of IIPC members.
The document discusses handling NetCDF files in R. It explains that NetCDF is a common data format for storing scientific data in a gridded, self-documented format. It was created by UCAR to be portable. The RNetCDF package allows reading and writing NetCDF files in R. The document demonstrates how to obtain metadata on dimensions, variables, and attributes of NetCDF files in R and plot the variable data.
The Semantic Web Client Library - Consuming Linked Data in Your ApplicationsOlaf Hartig
I presented the Semantic Web Client library (http://www4.wiwiss.fu-berlin.de/bizer/ng4j/semwebclient/) with these slides during the Linked Data Tutorial (http://events.linkeddata.org/iswc2008tutorial/) at the International Semantic Web Conference 2008.
Maurer Presentation - WARCnet Spring Meeting 2021WARCnet
This document summarizes a meeting about comparing entire web domains that are archived in national web archives. It discusses the need to analyze national web collections with low common denominator data while still providing rich information. It presents a file format called a CDX summary file that provides aggregated statistics about file types, sizes, and protocol usage for domains over time. Examples are given demonstrating average file sizes, domain name frequencies, and overlap between different archive sources. Limitations of the current approach are also outlined.
This document summarizes several Linked Data research projects conducted by researchers at the Universidad Politécnica de Madrid. It describes initiatives to publish geospatial, transportation, cultural heritage, and meteorological data as Linked Data. These include the GeoLinkedData project, the El Viajero project about travel experiences, a collaboration with the Spanish National Library to publish its collection as Linked Data, and a project with the Spanish Meteorological Agency to publish weather station data. The researchers are involved in several W3C groups and the document outlines next steps to publish additional data from new domains.
sitemap4rdf is a command line tool that generates Sitemap files by querying a SPARQL endpoint for URIs and listing them. It allows search engines to discover Linked Data resources. It is open source and generates valid Sitemaps that can be submitted to search engines like Sindice and Google. Future work includes integrating it with Linked Data publishing tools and supporting more Sitemap extensions.
BigDataEurope @BDVA Summit2016 1: The BDE PlatformBigData_Europe
The document discusses the Big Data Integrator (BDI) platform, a one-stop solution for big and smart data management developed by the BigDataEurope project. The BDI is a flexible, generic platform that supports a variety of big data components through its Docker-based architecture. It addresses requirements from multiple stakeholders and goes beyond existing solutions by incorporating semantic capabilities and enabling easy deployment of customized data pipelines. A demo of the BDI platform shows how different big data stacks can be deployed through its user-friendly interfaces.
The document summarizes a datathon conducted using various COVID-19 datasets from different European web archives. The goals were to 1) create a sandbox for exploring the data, 2) conduct initial analysis to see what could be achieved, and 3) document the process. Different institutions provided different types of datasets, including seedlists, tweets, and derived datasets. Challenges included restrictions on sharing raw data and representing large collections. Preliminary analysis identified potential research questions and ways to study web archives, collections, and the pandemic response.
Creating Open Data with Open Source (beta2)Sammy Fung
The document discusses creating open data using open source tools. It provides an overview of open data and Tim Berners-Lee's 5 star deployment scheme for open data. The author then describes using Python and the Scrapy framework to crawl websites and extract structured data to create open datasets. Specific examples discussed are the WeatherHK and TCTrack projects, which extract weather data from government websites. The author also proposes the hk0weather open source project to convert Hong Kong weather data into JSON format. The goal is to make more government data openly available in reusable, machine-readable formats.
This document describes a process for publishing Spanish geospatial data as GeoLinked Data. The process involves: 1) Identifying data sources from the Spanish National Geographic Institute, 2) Developing an ontology to model the data, 3) Generating RDF using mappings from the relational databases to the ontology, and 4) Publishing and visualizing the data on the web. The goals are to enrich the web of data with Spanish geospatial information and demonstrate a methodology for publishing spatial data as linked open data.
Common Crawl is a non-profit that makes web data freely accessible. Each crawl captures billions of web pages totaling over 150 terabytes. The data is released without restrictions on Amazon. Common Crawl was founded in 2007 to democratize access to web data at scale. The data has been used for natural language processing, machine learning, analytics, and more. Researchers have extracted tables, links, phone numbers, and parallel text from the data.
How do we develop open source software to help open data ? (MOSC 2013)Sammy Fung
Sammy Fung discusses using open source tools like Python and Scrapy to extract open weather data from websites like the Hong Kong Observatory in order to make it more accessible. He created an open source project called hk0weather that scrapes current weather reports and exports the data to JSON format. The goal is to develop open source projects that can create open data in standard machine-readable formats to help citizens access public data more easily.
IFB cloud: Integration of snakemake workflows in an appliance designed for Ch...Claire Rioualen
The document describes the IFB cloud, a French national bioinformatics infrastructure that provides computing resources and appliances. It summarizes the current and planned capacity of the IFB cloud. It then describes a ChIP-seq analysis appliance available on the IFB cloud, which contains tools like snakemake, NGS tools, and example ChIP-seq workflows. It provides instructions for requesting an account, launching a virtual disk and instance, connecting via SSH, loading data, and executing workflows on the appliance using the snakemake framework.
Overview of Apache Flink: Next-Gen Big Data Analytics FrameworkSlim Baltagi
These are the slides of my talk on June 30, 2015 at the first event of the Chicago Apache Flink meetup. Although most of the current buzz is about Apache Spark, the talk shows how Apache Flink offers the only hybrid open source (Real-Time Streaming + Batch) distributed data processing engine supporting many use cases: Real-Time stream processing, machine learning at scale, graph analytics and batch processing.
In these slides, you will find answers to the following questions: What is Apache Flink stack and how it fits into the Big Data ecosystem? How Apache Flink integrates with Apache Hadoop and other open source tools for data input and output as well as deployment? What is the architecture of Apache Flink? What are the different execution modes of Apache Flink? Why Apache Flink is an alternative to Apache Hadoop MapReduce, Apache Storm and Apache Spark? Who is using Apache Flink? Where to learn more about Apache Flink?
GBIF API Hackaton, March 2015, Leiden, Sp2000/GBIFDag Endresen
A brief presentation of examples using the GBIF API for the GBIF nodes training hackaton for checklist cross-mapping and precursor national checklist generation from GBIF data. Organized by Species 2000 and GBIF at Naturalis in Leiden, March 2015.
Why apache Flink is the 4G of Big Data Analytics FrameworksSlim Baltagi
This document provides an overview and agenda for a presentation on Apache Flink. It begins with an introduction to Apache Flink and how it fits into the big data ecosystem. It then explains why Flink is considered the "4th generation" of big data analytics frameworks. Finally, it outlines next steps for those interested in Flink, such as learning more or contributing to the project. The presentation covers topics such as Flink's APIs, libraries, architecture, programming model and integration with other tools.
apidays LIVE Helsinki & North 2022_Apps without APIsapidays
apidays LIVE Helsinki & North: API Ecosystems - Connecting Physical and Digital
March 16 & 17, 2022
Apps without APIs - Leveraging the stack that we all use, but never think about
Sampo Savolainen, CTO at Spatineo
Intervention de Stefanie Gehrke au Workshop "TEI and Neighbouring Standards" à la DiXiT Convention Week 2015 (Huygens ING, La Haye, 15 septembre 2015).
Best practices and lessons learnt from Running Apache NiFi at RenaultDataWorks Summit
No real-time insight without real-time data ingestion. No real-time data ingestion without NiFi ! Apache NiFi is an integrated platform for data flow management at entreprise level, enabling companies to securely acquire, process and analyze disparate sources of information (sensors, logs, files, etc) in real-time. NiFi helps data engineers accelerate the development of data flows thanks to its UI and a large number of powerful off-the-shelf processors. However, with great power comes great responsibilities. Behind the simplicity of NiFi, best practices must absolutely be respected in order to scale data flows in production & prevent sneaky situations. In this joint presentation, Hortonworks and Renault, a French car manufacturer, will present lessons learnt from real world projects using Apache NiFi. We will present NiFi design patterns to achieve high level performance and reliability at scale as well as the process to put in place around the technology for data flow governance. We will also show how these best practices can be implemented in practical use cases and scenarios.
Speakers
Kamelia Benchekroun, Data Lake Squad Lead, Renault Group
Abdelkrim Hadjidj, Solution Engineer, Hortonworks
This introductory level talk is about Apache Flink: a multi-purpose Big Data analytics framework leading a movement towards the unification of batch and stream processing in the open source.
With the many technical innovations it brings along with its unique vision and philosophy, it is considered the 4 G (4th Generation) of Big Data Analytics frameworks providing the only hybrid (Real-Time Streaming + Batch) open source distributed data processing engine supporting many use cases: batch, streaming, relational queries, machine learning and graph processing.
In this talk, you will learn about:
1. What is Apache Flink stack and how it fits into the Big Data ecosystem?
2. How Apache Flink integrates with Hadoop and other open source tools for data input and output as well as deployment?
3. Why Apache Flink is an alternative to Apache Hadoop MapReduce, Apache Storm and Apache Spark.
4. Who is using Apache Flink?
5. Where to learn more about Apache Flink?
The document discusses the Arabidopsis Information Portal (AIP), a new open resource for sharing and analyzing Arabidopsis data. The AIP aims to develop a community-driven web portal with analysis tools and user data spaces. It will integrate diverse datasets through federation and maintain the Col-0 genome annotation. The AIP architecture uses InterMine, JBrowse and other tools, and provides APIs and an app store for developing interactive analysis applications. A developer workshop is scheduled for November 2014 to involve the community.
Unified Batch and Real-Time Stream Processing Using Apache FlinkSlim Baltagi
This talk was given at Capital One on September 15, 2015 at the launch of the Washington DC Area Apache Flink Meetup. Apache flink is positioned at the forefront of 2 major trends in Big Data Analytics:
- Unification of Batch and Stream processing
- Multi-purpose Big Data Analytics frameworks
In these slides, we will also find answers to the burning question: Why Apache Flink? You will also learn more about how Apache Flink compares to Hadoop MapReduce, Apache Spark and Apache Storm.
This a talk that I gave at BioIT World West on March 12, 2019. The talk was called: A Gen3 Perspective of Disparate Data:From Pipelines in Data Commons to AI in Data Ecosystems.
This document provides an overview of big data and the Cosmos big data platform from Telefonica. It discusses what big data is and how much data exists. It then describes the Hadoop ecosystem for dealing with big data, including MapReduce and HDFS. The document outlines the architecture and features of the Cosmos platform, including its use of Hadoop and tools like Hive and Oozie. It provides examples of using MapReduce, Hive, and the REST API to analyze and query data stored on Cosmos.
This document provides an overview of big data and the Cosmos big data platform from Telefonica. It discusses what big data is, how much data exists, and common tools for working with big data like Hadoop and MapReduce. It then describes the Cosmos platform, how to create clusters and access data using REST APIs or command line tools. Examples are given for querying data using Hive and writing MapReduce applications.
BigDataEurope @BDVA Summit2016 1: The BDE PlatformBigData_Europe
The document discusses the Big Data Integrator (BDI) platform, a one-stop solution for big and smart data management developed by the BigDataEurope project. The BDI is a flexible, generic platform that supports a variety of big data components through its Docker-based architecture. It addresses requirements from multiple stakeholders and goes beyond existing solutions by incorporating semantic capabilities and enabling easy deployment of customized data pipelines. A demo of the BDI platform shows how different big data stacks can be deployed through its user-friendly interfaces.
The document summarizes a datathon conducted using various COVID-19 datasets from different European web archives. The goals were to 1) create a sandbox for exploring the data, 2) conduct initial analysis to see what could be achieved, and 3) document the process. Different institutions provided different types of datasets, including seedlists, tweets, and derived datasets. Challenges included restrictions on sharing raw data and representing large collections. Preliminary analysis identified potential research questions and ways to study web archives, collections, and the pandemic response.
Creating Open Data with Open Source (beta2)Sammy Fung
The document discusses creating open data using open source tools. It provides an overview of open data and Tim Berners-Lee's 5 star deployment scheme for open data. The author then describes using Python and the Scrapy framework to crawl websites and extract structured data to create open datasets. Specific examples discussed are the WeatherHK and TCTrack projects, which extract weather data from government websites. The author also proposes the hk0weather open source project to convert Hong Kong weather data into JSON format. The goal is to make more government data openly available in reusable, machine-readable formats.
This document describes a process for publishing Spanish geospatial data as GeoLinked Data. The process involves: 1) Identifying data sources from the Spanish National Geographic Institute, 2) Developing an ontology to model the data, 3) Generating RDF using mappings from the relational databases to the ontology, and 4) Publishing and visualizing the data on the web. The goals are to enrich the web of data with Spanish geospatial information and demonstrate a methodology for publishing spatial data as linked open data.
Common Crawl is a non-profit that makes web data freely accessible. Each crawl captures billions of web pages totaling over 150 terabytes. The data is released without restrictions on Amazon. Common Crawl was founded in 2007 to democratize access to web data at scale. The data has been used for natural language processing, machine learning, analytics, and more. Researchers have extracted tables, links, phone numbers, and parallel text from the data.
How do we develop open source software to help open data ? (MOSC 2013)Sammy Fung
Sammy Fung discusses using open source tools like Python and Scrapy to extract open weather data from websites like the Hong Kong Observatory in order to make it more accessible. He created an open source project called hk0weather that scrapes current weather reports and exports the data to JSON format. The goal is to develop open source projects that can create open data in standard machine-readable formats to help citizens access public data more easily.
IFB cloud: Integration of snakemake workflows in an appliance designed for Ch...Claire Rioualen
The document describes the IFB cloud, a French national bioinformatics infrastructure that provides computing resources and appliances. It summarizes the current and planned capacity of the IFB cloud. It then describes a ChIP-seq analysis appliance available on the IFB cloud, which contains tools like snakemake, NGS tools, and example ChIP-seq workflows. It provides instructions for requesting an account, launching a virtual disk and instance, connecting via SSH, loading data, and executing workflows on the appliance using the snakemake framework.
Overview of Apache Flink: Next-Gen Big Data Analytics FrameworkSlim Baltagi
These are the slides of my talk on June 30, 2015 at the first event of the Chicago Apache Flink meetup. Although most of the current buzz is about Apache Spark, the talk shows how Apache Flink offers the only hybrid open source (Real-Time Streaming + Batch) distributed data processing engine supporting many use cases: Real-Time stream processing, machine learning at scale, graph analytics and batch processing.
In these slides, you will find answers to the following questions: What is Apache Flink stack and how it fits into the Big Data ecosystem? How Apache Flink integrates with Apache Hadoop and other open source tools for data input and output as well as deployment? What is the architecture of Apache Flink? What are the different execution modes of Apache Flink? Why Apache Flink is an alternative to Apache Hadoop MapReduce, Apache Storm and Apache Spark? Who is using Apache Flink? Where to learn more about Apache Flink?
GBIF API Hackaton, March 2015, Leiden, Sp2000/GBIFDag Endresen
A brief presentation of examples using the GBIF API for the GBIF nodes training hackaton for checklist cross-mapping and precursor national checklist generation from GBIF data. Organized by Species 2000 and GBIF at Naturalis in Leiden, March 2015.
Why apache Flink is the 4G of Big Data Analytics FrameworksSlim Baltagi
This document provides an overview and agenda for a presentation on Apache Flink. It begins with an introduction to Apache Flink and how it fits into the big data ecosystem. It then explains why Flink is considered the "4th generation" of big data analytics frameworks. Finally, it outlines next steps for those interested in Flink, such as learning more or contributing to the project. The presentation covers topics such as Flink's APIs, libraries, architecture, programming model and integration with other tools.
apidays LIVE Helsinki & North 2022_Apps without APIsapidays
apidays LIVE Helsinki & North: API Ecosystems - Connecting Physical and Digital
March 16 & 17, 2022
Apps without APIs - Leveraging the stack that we all use, but never think about
Sampo Savolainen, CTO at Spatineo
Intervention de Stefanie Gehrke au Workshop "TEI and Neighbouring Standards" à la DiXiT Convention Week 2015 (Huygens ING, La Haye, 15 septembre 2015).
Best practices and lessons learnt from Running Apache NiFi at RenaultDataWorks Summit
No real-time insight without real-time data ingestion. No real-time data ingestion without NiFi ! Apache NiFi is an integrated platform for data flow management at entreprise level, enabling companies to securely acquire, process and analyze disparate sources of information (sensors, logs, files, etc) in real-time. NiFi helps data engineers accelerate the development of data flows thanks to its UI and a large number of powerful off-the-shelf processors. However, with great power comes great responsibilities. Behind the simplicity of NiFi, best practices must absolutely be respected in order to scale data flows in production & prevent sneaky situations. In this joint presentation, Hortonworks and Renault, a French car manufacturer, will present lessons learnt from real world projects using Apache NiFi. We will present NiFi design patterns to achieve high level performance and reliability at scale as well as the process to put in place around the technology for data flow governance. We will also show how these best practices can be implemented in practical use cases and scenarios.
Speakers
Kamelia Benchekroun, Data Lake Squad Lead, Renault Group
Abdelkrim Hadjidj, Solution Engineer, Hortonworks
This introductory level talk is about Apache Flink: a multi-purpose Big Data analytics framework leading a movement towards the unification of batch and stream processing in the open source.
With the many technical innovations it brings along with its unique vision and philosophy, it is considered the 4 G (4th Generation) of Big Data Analytics frameworks providing the only hybrid (Real-Time Streaming + Batch) open source distributed data processing engine supporting many use cases: batch, streaming, relational queries, machine learning and graph processing.
In this talk, you will learn about:
1. What is Apache Flink stack and how it fits into the Big Data ecosystem?
2. How Apache Flink integrates with Hadoop and other open source tools for data input and output as well as deployment?
3. Why Apache Flink is an alternative to Apache Hadoop MapReduce, Apache Storm and Apache Spark.
4. Who is using Apache Flink?
5. Where to learn more about Apache Flink?
The document discusses the Arabidopsis Information Portal (AIP), a new open resource for sharing and analyzing Arabidopsis data. The AIP aims to develop a community-driven web portal with analysis tools and user data spaces. It will integrate diverse datasets through federation and maintain the Col-0 genome annotation. The AIP architecture uses InterMine, JBrowse and other tools, and provides APIs and an app store for developing interactive analysis applications. A developer workshop is scheduled for November 2014 to involve the community.
Unified Batch and Real-Time Stream Processing Using Apache FlinkSlim Baltagi
This talk was given at Capital One on September 15, 2015 at the launch of the Washington DC Area Apache Flink Meetup. Apache flink is positioned at the forefront of 2 major trends in Big Data Analytics:
- Unification of Batch and Stream processing
- Multi-purpose Big Data Analytics frameworks
In these slides, we will also find answers to the burning question: Why Apache Flink? You will also learn more about how Apache Flink compares to Hadoop MapReduce, Apache Spark and Apache Storm.
This a talk that I gave at BioIT World West on March 12, 2019. The talk was called: A Gen3 Perspective of Disparate Data:From Pipelines in Data Commons to AI in Data Ecosystems.
This document provides an overview of big data and the Cosmos big data platform from Telefonica. It discusses what big data is and how much data exists. It then describes the Hadoop ecosystem for dealing with big data, including MapReduce and HDFS. The document outlines the architecture and features of the Cosmos platform, including its use of Hadoop and tools like Hive and Oozie. It provides examples of using MapReduce, Hive, and the REST API to analyze and query data stored on Cosmos.
This document provides an overview of big data and the Cosmos big data platform from Telefonica. It discusses what big data is, how much data exists, and common tools for working with big data like Hadoop and MapReduce. It then describes the Cosmos platform, how to create clusters and access data using REST APIs or command line tools. Examples are given for querying data using Hive and writing MapReduce applications.
[Nuxeo World 2013] OPENING KEYNOTE - ERIC BARROCA, NUXEO CEONuxeo
This document provides a summary of key activities and developments for Nuxeo over the past year, including growth in subscription revenue and new subscriptions in the US. It highlights new features in the 5.8 release of the Nuxeo platform such as improved API endpoints, workflow capabilities, Nuxeo Drive, and monitoring. The document also discusses new products and services from Nuxeo including proactive monitoring by Datadog and Postgres support from 2ndQuadrant for Connect customers.
MTA Cloud provides cloud computing resources for researchers in Hungary through two sites, MTA SZTAKI and MTA Wigner Data Center. It currently has over 1,160 vCPUs and over 400 TB of storage deployed across the two sites. MTA Cloud uses an open, multi-cloud architecture and common interface and middleware. It hosts over 50 research projects and provides Infrastructure as a Service as well as Platform as a Service like Hadoop and Docker clusters through tools like Occopus and Flowbster. Significant extensions are underway to increase its computing and storage capacity to better support the Hungarian research community.
Data access and data extraction services within the Land Imagery PortalGasperi Jerome
Models for scientific exploitation of EO Data - Frascati - October 12th 2012
Presentation of the architecture of the french Land Imagery portal data access
This document provides an introduction to Filecoin, including:
1) Core concepts of Filecoin such as using IPFS for data retrieval and Filecoin for data persistence and verifiability on a decentralized storage network.
2) Examples of how storage helpers can simplify storing and retrieving data on Filecoin by handling dealmaking and verification.
3) An overview of the different layers that make up a Web3-enabled architecture using Filecoin and IPFS for decentralized storage.
IIIF for Interoperability and Dissemination of Research Results: The NewsEye ...Jean-Philippe Moreux
The document discusses how the NewsEye European Project uses IIIF (International Image Interoperability Framework) to disseminate research results from analyzing digitized historical newspapers. It describes how NewsEye uses IIIF to display thumbnails and pages from newspapers. The project also exposes corpora and researcher datasets as IIIF collections to facilitate access and reuse of the data. Storytelling tools like Exhibit are highlighted as ways for researchers to showcase collections and narratives to the public.
The document summarizes image retrieval techniques and applications at the BnF (French National Library). It discusses using deep learning for image segmentation, classification, and indexing. It then describes several BnF projects applying these techniques, including GallicaSimilitudes for visual similarity search of collections, GallicaPix for iconographic retrieval and digital humanities case studies, and collaborations with INRIA on object detection in manuscripts and iterative querying. The goal is improved search and access to the diverse range of images in BnF collections.
XP 2024 presentation: A New Look to Leadershipsamililja
Presentation slides from XP2024 conference, Bolzano IT. The slides describe a new view to leadership and combines it with anthro-complexity (aka cynefin).
Mastering the Concepts Tested in the Databricks Certified Data Engineer Assoc...SkillCertProExams
• For a full set of 760+ questions. Go to
https://skillcertpro.com/product/databricks-certified-data-engineer-associate-exam-questions/
• SkillCertPro offers detailed explanations to each question which helps to understand the concepts better.
• It is recommended to score above 85% in SkillCertPro exams before attempting a real exam.
• SkillCertPro updates exam questions every 2 weeks.
• You will get life time access and life time free updates
• SkillCertPro assures 100% pass guarantee in first attempt.
Carrer goals.pptx and their importance in real lifeartemacademy2
Career goals serve as a roadmap for individuals, guiding them toward achieving long-term professional aspirations and personal fulfillment. Establishing clear career goals enables professionals to focus their efforts on developing specific skills, gaining relevant experience, and making strategic decisions that align with their desired career trajectory. By setting both short-term and long-term objectives, individuals can systematically track their progress, make necessary adjustments, and stay motivated. Short-term goals often include acquiring new qualifications, mastering particular competencies, or securing a specific role, while long-term goals might encompass reaching executive positions, becoming industry experts, or launching entrepreneurial ventures.
Moreover, having well-defined career goals fosters a sense of purpose and direction, enhancing job satisfaction and overall productivity. It encourages continuous learning and adaptation, as professionals remain attuned to industry trends and evolving job market demands. Career goals also facilitate better time management and resource allocation, as individuals prioritize tasks and opportunities that advance their professional growth. In addition, articulating career goals can aid in networking and mentorship, as it allows individuals to communicate their aspirations clearly to potential mentors, colleagues, and employers, thereby opening doors to valuable guidance and support. Ultimately, career goals are integral to personal and professional development, driving individuals toward sustained success and fulfillment in their chosen fields.
This presentation, created by Syed Faiz ul Hassan, explores the profound influence of media on public perception and behavior. It delves into the evolution of media from oral traditions to modern digital and social media platforms. Key topics include the role of media in information propagation, socialization, crisis awareness, globalization, and education. The presentation also examines media influence through agenda setting, propaganda, and manipulative techniques used by advertisers and marketers. Furthermore, it highlights the impact of surveillance enabled by media technologies on personal behavior and preferences. Through this comprehensive overview, the presentation aims to shed light on how media shapes collective consciousness and public opinion.
Collapsing Narratives: Exploring Non-Linearity • a micro report by Rosie WellsRosie Wells
Insight: In a landscape where traditional narrative structures are giving way to fragmented and non-linear forms of storytelling, there lies immense potential for creativity and exploration.
'Collapsing Narratives: Exploring Non-Linearity' is a micro report from Rosie Wells.
Rosie Wells is an Arts & Cultural Strategist uniquely positioned at the intersection of grassroots and mainstream storytelling.
Their work is focused on developing meaningful and lasting connections that can drive social change.
Please download this presentation to enjoy the hyperlinks!
This presentation by OECD, OECD Secretariat, was made during the discussion “Competition and Regulation in Professions and Occupations” held at the 77th meeting of the OECD Working Party No. 2 on Competition and Regulation on 10 June 2024. More papers and presentations on the topic can be found at oe.cd/crps.
This presentation was uploaded with the author’s consent.
This presentation by Professor Alex Robson, Deputy Chair of Australia’s Productivity Commission, was made during the discussion “Competition and Regulation in Professions and Occupations” held at the 77th meeting of the OECD Working Party No. 2 on Competition and Regulation on 10 June 2024. More papers and presentations on the topic can be found at oe.cd/crps.
This presentation was uploaded with the author’s consent.
Suzanne Lagerweij - Influence Without Power - Why Empathy is Your Best Friend...Suzanne Lagerweij
This is a workshop about communication and collaboration. We will experience how we can analyze the reasons for resistance to change (exercise 1) and practice how to improve our conversation style and be more in control and effective in the way we communicate (exercise 2).
This session will use Dave Gray’s Empathy Mapping, Argyris’ Ladder of Inference and The Four Rs from Agile Conversations (Squirrel and Fredrick).
Abstract:
Let’s talk about powerful conversations! We all know how to lead a constructive conversation, right? Then why is it so difficult to have those conversations with people at work, especially those in powerful positions that show resistance to change?
Learning to control and direct conversations takes understanding and practice.
We can combine our innate empathy with our analytical skills to gain a deeper understanding of complex situations at work. Join this session to learn how to prepare for difficult conversations and how to improve our agile conversations in order to be more influential without power. We will use Dave Gray’s Empathy Mapping, Argyris’ Ladder of Inference and The Four Rs from Agile Conversations (Squirrel and Fredrick).
In the session you will experience how preparing and reflecting on your conversation can help you be more influential at work. You will learn how to communicate more effectively with the people needed to achieve positive change. You will leave with a self-revised version of a difficult conversation and a practical model to use when you get back to work.
Come learn more on how to become a real influencer!
5. How?
APIs (Application Programming Interface):
allows developers to write programs that
communicate with each other
Datasets: collection of ready to use
or on-demand data/documents
Web services: allows machines to
communicate on the web, using web
protocoles (HTTP)
Temporality: synchronous/asynchronous
11. NewsEye H2020 project
• Article Separation
• HTR (OCR++)
• Named Entities Recognition…
https://www.newseye.eu/
• French dataset: 60k issues delivered as
metadata+OCR only (no images)
• The partners can ingest images for processing
at page level or document level (manifest.json)
• The project DL can handle IIIF (Fedora)
Pros: no more HDs!
Cons: can be a long download and
painful for DLs servers…
12. • Proof of Concept on image search for digital libraries (topic : WW1)
• Automatic extraction of content from BnF digital collections (IIIF, Gallica, SRU,
OAI-PMH, SPARQL)
• Visual content enrichment thanks to deep learning approaches
Image Search
http://demo14-18.bnf.fr:8984/rest?run=findIllustrations-form.xq
GallicaPix
13. • Automatic face/genre recognition with deep learning (L’Excelsior, 1910-1920)
• Data analysis, data visualisation
Image Search for DH
http://demo14-18.bnf.fr:8984/rest?run=findIllustrations-form.xq
14. How to work
with digital
content Detailled example: GallicaPix
web app on Image Search (WW1)
OAI-PMH
SRU
Linked Data
IIIF
15. I. Select the required metadata/documents
II. Identify the ways to get access to these data
III. Extract the resources (asynchronous mode) or use them
in real time (synchronous)
IV. Build the application/analyse the data/…
How to work with BnF digital content
GallicaPix : block diagram
16. 1. How to find documents related to WW1?
1.1 With OAI-PMH (Open Archives Initiative - Protocol
for Metadata Harvesting)
A : Gallica OAI repository
B : BnF Catalog repository
C : Europeana repository
3 : GallicaPix (back-end)
4 : GallicaPix (front-end)
1 : Machine/machine queries
2 : Results: list of documents metadata
17. • List the Gallica « sets » in the OAI repository:
http://oai.bnf.fr/oai2/OAIHandler?verb=ListSets
• Harvest the WW1 set (« gallica:corpus:1418 »)
http://oai.bnf.fr/oai2/OAIHandler?verb=ListRecords&metadata
Prefix=oai_dc&set=gallica:corpus:1418
…
Drawbacks:
• No search criteria
• The sets must have been created by the OAI owner
Let’s do it!
1.1 With OAI-PMH
18. a) Search in Gallica (keyword search or advanced form)
b) Copy the query segment in the URL
1. How to find documents related to WW1? (cont’)
1.2 With the SRU protocol (Search/Retrieve via URL)
19. c) Paste the Gallica query into the SRU query
https://gallica.bnf.fr/SRU?version=1.2&operation=searchRetrieve&query=
(dc.subject all "Guerre mondiale 1914-1918") and (dc.type all "image")
and (gallicapublication_date>="1914/01/01")
and (gallicapublication_date<="1918/01/01")&maximumRecords =100
d) Extract the metadata from the XML result list (-> coding)
1. How to find documents related to WW1? (cont’)
1.2 With the SRU protocol (Search/Retrieve via URL)
13483
first
20. • All the documents about WW1 theme:
For humans (HTML format):
http://data.bnf.fr/fr/11939093/guerre_mondiale__1914-1918_/#documents
For machines (RDF XML/n3…, JSON-LD formats):
https://data.bnf.fr/fr/11939093/guerre_mondiale__1914-1918_/rdf.xml
https://data.bnf.fr/fr/11939093/guerre_mondiale__1914-1918_/rdf.n3
1. How to find documents related to WW1? (cont’)
1.4 With data.bnf.fr: semantic search on Linked Data
21. • Authors related to WW1:
https://data.bnf.fr/fr/linked-authors/11939093
• Documents on Verdun:
https://data.bnf.fr/fr/15265210/verdun__meuse__france_
https://data.bnf.fr/fr/15265210/verdun__meuse__france_/rdf.xml
1. How to find documents related to WW1? (cont’)
1.4 With data.bnf.fr: semantic search on Linked Data
22. 3 : GallicaPix (back-end)
4 : GallicaPix (front-end)
1 : Human/machine queries
2 : Results: metadata
2. How to work with the documents?
From the results list (2):
a) Get the documents metadata
b) Store these metadata localy (3)
c) Get the documents (if needed) and stored them (3)
d) Build services on top of the local database (4)
23. Store the data?
In a document oriented database (NoSQL):
• XML databases: BaseX, eXist…
• JSON databases: MongoDB
• graph oriented databases
In any other place…
24. a) Get the documents metadata
a.1) With OAI-PMH:
http://oai.bnf.fr/oai2/OAIHandler?verb=GetRecord&metadataPrefix=oai_d
c&identifier=ark:/12148/bpt6k5738219s
a.2) With the Gallica Document API:
https://gallica.bnf.fr/services/OAIRecord?ark=bpt6k5738219s
a.3) With IIIF:
https://gallica.bnf.fr/iiif/ark:/12148/bpt6k5738219s/manifest.json
Let’s do it!
31. d) Build services
Classification of genres:
• build a reference dataset
• train a model (CNN)
1. Leverage the
metadata (SRU)
2. Download the
images (IIIF)
3. Train the model
32. d) Build services
Visual Indexation example: IBM Watson
Pros: no need to handle image file (size, rotation, crop),
no local storage
Cons: server intensive, speed, time out
curl -X POST -u "apikey:****" --form
"url=https://gallica.bnf.fr/iiif/ark:/12148/bpt6k9
604090x/f1/22,781,4334,4751/,700/0/native.jpg"
"https://gateway.watsonplatform.
net/visual-
recognition/api/v3/classify?version=2018-03-19"
34. Selection
Segmentatio
n
Indexation QA Use
Search
API, datasets
Access
IIIF in the Visual Indexing Workflow
IIIF IIIF IIIF
IIIF makes prototyping and training of models easier,
but it can be ineffective for large datasets processing
Première étape du cycle de la donnée : les mettre à disposition
Quelles données concernées : données et métadonnées produites par l’établissement
Open data / ouverture des données publiques :
métadonnées 2014
Conditions de réutilisations Gallica
Sous quelles formes / comment les mettre à disposition ? Mise à disposition technique
Interrogation synchrone :
API : Interface de programmation applicative pour permettre à deux machines de dialoguer entre elles par un ou des protocoles normalisés
Services web : API utilise des protocoles web
Exemple : applications sur smartphone qui utilisent des données publiques et des données privées pour créer du service
Interrogation asynchrone
Données / data
Jeux de données dont la constitution constitue une valeur ajoutée
Tradition ancienne d’échanges de données en bibliothèque.
La situation en 2016 : plusieurs API ouvertes par la BnF
l’historique Z39.50, bon exemple d’API qui n’est pas un service web
le protocole SRU sur le catalogue général version web du Z3950
des API très utilisées comme les entrepôts OAI
un service créé spécifiquement pour la diffusion des données : data.bnf.fr et son sparql endpoint
des API à usage interne, d’abord créées parfois accessibles de l’extérieur : le SRU Gallica
IIIF
Mais dispersion de l’accès, de la documentation (lorsqu’elle existait) reflet de l’histoire, dispersion des usages
Rappeler la diversité des publics : publics professionnels, développeurs,
Prise de conscience : le hackathon 2016 >
documenter officiellement, assumer l’ouverture de services web IIIF, SRU Gallica
réalisation d’un wiki sur la plate-forme Github
A l’occasion du hackathon 2017
Regrouper la documentation existante
Aussi les jeux de données, notamment produits dans le cadre des projets de recherche : lien avec le projet CORPUS
Corpus d’images
Dumps de MD
Listes d’URL du dépôt légal de l’internet
Statistiques
Tradition ancienne d’échanges de données en bibliothèque.
La situation en 2016 : plusieurs API ouvertes par la BnF
l’historique Z39.50, bon exemple d’API qui n’est pas un service web
le protocole SRU sur le catalogue général version web du Z3950
des API très utilisées comme les entrepôts OAI
un service créé spécifiquement pour la diffusion des données : data.bnf.fr et son sparql endpoint
des API à usage interne, d’abord créées parfois accessibles de l’extérieur : le SRU Gallica
IIIF
Mais dispersion de l’accès, de la documentation (lorsqu’elle existait) reflet de l’histoire, dispersion des usages
Rappeler la diversité des publics : publics professionnels, développeurs,
Prise de conscience : le hackathon 2016 >
documenter officiellement, assumer l’ouverture de services web IIIF, SRU Gallica
réalisation d’un wiki sur la plate-forme Github
A l’occasion du hackathon 2017
Regrouper la documentation existante
Aussi les jeux de données, notamment produits dans le cadre des projets de recherche : lien avec le projet CORPUS
Corpus d’images
Dumps de MD
Listes d’URL du dépôt légal de l’internet
Statistiques
Tradition ancienne d’échanges de données en bibliothèque.
La situation en 2016 : plusieurs API ouvertes par la BnF
l’historique Z39.50, bon exemple d’API qui n’est pas un service web
le protocole SRU sur le catalogue général version web du Z3950
des API très utilisées comme les entrepôts OAI
un service créé spécifiquement pour la diffusion des données : data.bnf.fr et son sparql endpoint
des API à usage interne, d’abord créées parfois accessibles de l’extérieur : le SRU Gallica
IIIF
Mais dispersion de l’accès, de la documentation (lorsqu’elle existait) reflet de l’histoire, dispersion des usages
Rappeler la diversité des publics : publics professionnels, développeurs,
Prise de conscience : le hackathon 2016 >
documenter officiellement, assumer l’ouverture de services web IIIF, SRU Gallica
réalisation d’un wiki sur la plate-forme Github
A l’occasion du hackathon 2017
Regrouper la documentation existante
Aussi les jeux de données, notamment produits dans le cadre des projets de recherche : lien avec le projet CORPUS
Corpus d’images
Dumps de MD
Listes d’URL du dépôt légal de l’internet
Statistiques
Tradition ancienne d’échanges de données en bibliothèque.
La situation en 2016 : plusieurs API ouvertes par la BnF
l’historique Z39.50, bon exemple d’API qui n’est pas un service web
le protocole SRU sur le catalogue général version web du Z3950
des API très utilisées comme les entrepôts OAI
un service créé spécifiquement pour la diffusion des données : data.bnf.fr et son sparql endpoint
des API à usage interne, d’abord créées parfois accessibles de l’extérieur : le SRU Gallica
IIIF
Mais dispersion de l’accès, de la documentation (lorsqu’elle existait) reflet de l’histoire, dispersion des usages
Rappeler la diversité des publics : publics professionnels, développeurs,
Prise de conscience : le hackathon 2016 >
documenter officiellement, assumer l’ouverture de services web IIIF, SRU Gallica
réalisation d’un wiki sur la plate-forme Github
A l’occasion du hackathon 2017
Regrouper la documentation existante
Aussi les jeux de données, notamment produits dans le cadre des projets de recherche : lien avec le projet CORPUS
Corpus d’images
Dumps de MD
Listes d’URL du dépôt légal de l’internet
Statistiques
Présentation du portail API et jeux de données :
Le premier objet de ce portail est de centraliser la documentation sur les API et les jeux de données. Cette description s’organise
Fiche technique
Point de contact : réflexion
Exemples de requêtes : méthode classique de présentation des API
Format de requête et de réponse
API et jeu de données en relation : ligne éditoriale, valeur ajoutée de cette présentation sur un portail, ce sont les usages qu’elle ouvre par le croisement entre les jeux de données et les API
Pas seulement une description mais on a également posé les jalons d’un travail éditorial autour de ces jeux de données
Pages transversales sur des notions essentielles comme les identifiants
Actualités pour les nouveaux jeux de données ou services web et actualités des services (passage en https ou le plan de reprise d’activités)
Articulation avec d’autres projets comme Gallica Studio
Articulation avec d’autres lieux de description des données vers d’autres publics comme les publics professionnels pour les données
Webographie littéraire : site de veille littéraire dédiée à la fiction, connectée à 200 blogs, Wikipedia, Youtube, des podcasts, la BNF, VIAF et des libraires de proximité.
Exemple intéressant car Bibliosurf utilise les services web de la BNF de deux manières : Utilisation des données de la BnF directement (nom du traducteur, le titre original, la collection), mais les données de la BnF servent de pivot grâce aux identifiants (ISNI, VIAF, Wikidata) pour récupérer des informations dans d’autres bases de données (Wikipédia)
Enrichissement par le SRU catalogue général des descriptions d’ouvrages : nom du traducteur, titre original, et la collection, à chaque affichage de la notice avec un cache de 24 heures.
+ Tant qu’un auteur n’a pas d’ISNI et dès qu’un internaute affiche la notice de l’auteur sur Bibliosurf, une requête ISBN part sur le SRU de la BNF pour récupérer l’ISNI dans la notice UNIMARC. Si l’ISNI est trouvé, il est ajouté dans la base de Bibliosurf.
Identifiant Wikidata. Une requête SPARQL sur data.bnf sera lancée épisodiquement pour récupérer ces identifiants. Ceux-ci servent ensuite à interroger Wikidata et à récupérer les URL des sites d’auteurs.
VIAF. Une requête SPARQL sur data.bnf sera lancée épisodiquement pour récupérer ces identifiants. Ceux-ci servent ensuite à interroger VIAF et récupéré les liens Wikipedia non encore référencés. Bibliosurf utilise ensuite l’API de Wikipédia pour afficher les photos et les biographies des auteurs.
Utilisation des données de la BnF directement (nom du traducteur, le titre original, la collection), mais les données de la BnF servent de pivot grâce aux identifiants (ISNI, VIAF, Wikidata) pour récupérer des informations dans d’autres bases de données (Wikipédia)
Il y a 3485 auteurs référencés sur Bibliosurf. 2954 ont un ISNI, 1830 un identifiant Wikidata, 2751 un identifiant VIAF, 1955 un lien Wikipedia, 412 sites internet.