2013 DataCite Summer Meeting - Purdue University Research Repository (PURR) (...datacite
Michael Witt presented on the Purdue University Research Repository (PURR) at the DataCite summer meeting. PURR is a collaborative effort between Purdue University Libraries, Office of the Vice President for Research, and Information Technology. It provides researchers a space to store, share, and publish research data, with librarian support for data management plans and curation. PURR aims to encourage citation of datasets by assigning identifiers, displaying licenses, providing citation examples, and exposing structured citations. It is built on open source HUBzero software and has over 1,000 registered researchers sharing data across 200 projects.
Scalable Identifiers for Natural History CollectionsJohn Kunze
This document summarizes Scalable Identifiers for Natural History Collections. It discusses how the California Digital Library (CDL) supports the University of California system through various services including the UC Curation Center (UC3). The document then discusses how EZID, a service provided by CDL, offers precise, persistent identifiers like DOIs for datasets to give credit to data producers and link data to publications. Finally, it provides an overview of identifier generation, registration, resolution and some technical dimensions and emerging features of EZID.
This presentation was provided by
Priscilla Caplan of The Florida Center for Library Automation and Jeremy York of The University of Michigan Library, during the NISO Webinar "What It Takes To Make It Last: E-Resources Preservation" held on February 10, 2011.
Slides from a webinar presentation organised by ALCTS -A division of the American Library Association - February 19th 2020. http://www.ala.org/alcts/confevents/upcoming/webinar/021920
The return on investment for academic libraries is chiefly tied to access, usage, and impact. Without accurate, consistent, and quality metadata on the one hand, and an easy-to-use and effective discovery service on the other, these valuable resources may remain invisible and inaccessible to users. In this webinar, four overarching metadata principles, namely metadata enriching, linking, openness, and filtering, are presented. In addition, presenters will examine how these ideas help shape the metadata creation and discovery services at Solent University—focusing on the implementation of RDA and FRBR as well as the use of subject authority headings and authority controls.
http://kulibrarians.g.hatena.ne.jp/kulibrarians/20170222
Presentation by Cuna Ekmekcioglu (The University of Edinburgh)
- Creating and Managing Digital Research Data in Creative Arts: An overview (2016)
CC BY-NC-SA 4.0
Research data catalogues and data interoperability in life sciencesBlue BRIDGE
Presentation by Rafael C Jimenez, ELIXIR CTO
This presentation gives an overview of data catalogues in the life sciences and describe different approaches of data interoperability and federation. It also explains the relationship and differences among ELIXIR registries, data repositories, data archives and knowledge-bases. The presentation introduces few ideas for discussion about how to facilitate data interoperability in the European Open Science Cloud.
This document provides an overview of the Dataverse Network Project, which is a repository for research data hosted at Harvard University. It allows researchers to deposit, share, and organize their data in a curated network. Key features include long-term preservation of data and metadata, access and sharing capabilities, and archiving best practices to promote data access and reproducibility. Researchers can create individual dataverses to organize their studies and deposit data through a web interface or via software installation. The network supports various file types and formats and provides data citation and version control.
2013 DataCite Summer Meeting - Purdue University Research Repository (PURR) (...datacite
Michael Witt presented on the Purdue University Research Repository (PURR) at the DataCite summer meeting. PURR is a collaborative effort between Purdue University Libraries, Office of the Vice President for Research, and Information Technology. It provides researchers a space to store, share, and publish research data, with librarian support for data management plans and curation. PURR aims to encourage citation of datasets by assigning identifiers, displaying licenses, providing citation examples, and exposing structured citations. It is built on open source HUBzero software and has over 1,000 registered researchers sharing data across 200 projects.
Scalable Identifiers for Natural History CollectionsJohn Kunze
This document summarizes Scalable Identifiers for Natural History Collections. It discusses how the California Digital Library (CDL) supports the University of California system through various services including the UC Curation Center (UC3). The document then discusses how EZID, a service provided by CDL, offers precise, persistent identifiers like DOIs for datasets to give credit to data producers and link data to publications. Finally, it provides an overview of identifier generation, registration, resolution and some technical dimensions and emerging features of EZID.
This presentation was provided by
Priscilla Caplan of The Florida Center for Library Automation and Jeremy York of The University of Michigan Library, during the NISO Webinar "What It Takes To Make It Last: E-Resources Preservation" held on February 10, 2011.
Slides from a webinar presentation organised by ALCTS -A division of the American Library Association - February 19th 2020. http://www.ala.org/alcts/confevents/upcoming/webinar/021920
The return on investment for academic libraries is chiefly tied to access, usage, and impact. Without accurate, consistent, and quality metadata on the one hand, and an easy-to-use and effective discovery service on the other, these valuable resources may remain invisible and inaccessible to users. In this webinar, four overarching metadata principles, namely metadata enriching, linking, openness, and filtering, are presented. In addition, presenters will examine how these ideas help shape the metadata creation and discovery services at Solent University—focusing on the implementation of RDA and FRBR as well as the use of subject authority headings and authority controls.
http://kulibrarians.g.hatena.ne.jp/kulibrarians/20170222
Presentation by Cuna Ekmekcioglu (The University of Edinburgh)
- Creating and Managing Digital Research Data in Creative Arts: An overview (2016)
CC BY-NC-SA 4.0
Research data catalogues and data interoperability in life sciencesBlue BRIDGE
Presentation by Rafael C Jimenez, ELIXIR CTO
This presentation gives an overview of data catalogues in the life sciences and describe different approaches of data interoperability and federation. It also explains the relationship and differences among ELIXIR registries, data repositories, data archives and knowledge-bases. The presentation introduces few ideas for discussion about how to facilitate data interoperability in the European Open Science Cloud.
This document provides an overview of the Dataverse Network Project, which is a repository for research data hosted at Harvard University. It allows researchers to deposit, share, and organize their data in a curated network. Key features include long-term preservation of data and metadata, access and sharing capabilities, and archiving best practices to promote data access and reproducibility. Researchers can create individual dataverses to organize their studies and deposit data through a web interface or via software installation. The network supports various file types and formats and provides data citation and version control.
Carmen O'Dell and Barbara Sen JIBS-RLUK event July 2012sherif user group
RDM Rose by Carmen O'Dell and Barbara Sen, (University of Sheffield). Presentation at Demystifying Research Data: don’t be scared be prepared: A joint JIBS/RLUK event, Tuesday 17th July 17th July 2012, Brunei Gallery at SOAS (School of Oriental and African Studies), London.
The document summarizes the experimental project of registering Digital Object Identifiers (DOIs) for research data at the Japan Link Center (JaLC). The project aims to establish workflows for registering DOIs for research data and test the registration of data DOIs. It involves 9 research projects and 14 organizations registering and integrating DOIs for their data through the JaLC system. The project addresses several issues in registering DOIs for dynamic research data, such as data lifecycles, granularity, persistence, and handling changes over time.
The ARK Alliance: 20 years, 850 institutions, 8.2 billion persistent identifi...John Kunze
The ARK Alliance has provided persistent identifiers (ARKs) for over 850 institutions over 20 years, assigning over 8.2 billion ARKs. ARKs allow for long-term preservation and access to digital resources by providing unbroken links even if web addresses (URLs) change. The ARK resolver system makes ARKs actionable by translating them to current web addresses. Major adopters of ARKs include libraries, archives, museums, and research institutions for purposes like genealogical records, published works, scientific data, and cultural heritage collections.
The document discusses Japan Link Center's (JaLC) experiment to register DOIs for research data. The experiment aims to establish workflows for registering DOIs for research data using JaLC's system. It involves 9 projects with 14 organizations testing DOI registration for research data. The document outlines several issues in registering DOIs for data, including operations flow, persistent access, granularity, dynamics of data, and quantity of data. It also provides examples of how projects can involve multiple institutions and how data lifecycles differ from literature.
What do you want to discover today? / Janet Aucock, University of St AndrewsCIGScotland
Overview of resource discovery in libraries today. Presented at the CIG Scotland seminar 'Resource Discovery : from catalogues to discovery services' at the National Library of Scotland, Edinburgh, 21st March 2018
This document summarizes a webinar on metadata for managing scientific research data. The webinar covered why metadata is important for scientific data management, definitions of data and metadata, selected metadata standards including Dublin Core, Darwin Core and FGDC, challenges in generating metadata and opportunities to address these challenges, and advice for getting started with metadata. The webinar emphasized that metadata standards provide guidelines not strict rules, and encouraged participants to keep metadata simple while aiming to facilitate reuse of data.
This document outlines the IT strategic plan of the MAE Documentation Centre and Museum of Performing Arts from 2008-2013. The plan aimed to modernize the center's technology by moving from various disconnected databases and folders to a single, open source system for managing and disseminating its collections online. By 2013, the center had implemented the Hydra Project to create a new unified data model and system called "Escena Digital" for describing, preserving, and sharing its archive and museum holdings digitally. The new system allows for more efficient management and growth of the center's digital collections.
As bibliotecas do mundo conectadas. A um mundo conectado!OCLC LAC
The document discusses OCLC (Online Computer Library Center) and its role in connecting libraries around the world through shared resources, metadata, management tools, and end-user services. It highlights four strategic areas of OCLC's services and describes how libraries can work together through OCLC to explore trends, share data/resources, and amplify the impact of libraries. Specific topics covered include the internet of things, massive open online courses, and connected collections. The end of the document includes a questionnaire about the event.
The document describes the BNE's project to create linked data from its bibliographic and authority records. It involved selecting relevant MARC data about Miguel de Cervantes and related authors, mapping the data to FRBR, FRAD and other ontologies, transforming the data into RDF, and linking the data to other datasets like VIAF. The goals were to test applying IFLA models at scale, and create a unified authority system for Spanish libraries. Key activities included analyzing the MARC data, developing mapping and transformation tools, and publishing the linked data on the web.
This document outlines the agenda and topics that will be covered in a digitization workshop for community heritage organizations. The session will cover how to plan a digitization project by setting standards, assessing needed resources, and caring for original and digital materials. It will also discuss providing access online, including issues around rights, metadata, and sharing images. Key topics include setting an appropriate level of technical standards based on the project goals and constraints, ensuring backups and long-term preservation of digital files, and creating descriptive metadata to enable discovery and use of digitized collections.
Since the early days of e-resource management, holdings maintenance for electronic resources has been a very time consuming and manual process. While the emergence of electronic resource management systems (ERMS) has improved this process to a significant extent, holdings maintenance tasks remain labor intensive due to the increased volume of electronic content to manage, as well as issues related to metadata quality. To ameliorate many of the problems associated with managing electronic resources, and in recognition of a need for greater accuracy and efficiency, some knowledgebase providers are beginning to offer libraries options to automate holdings maintenance for electronic resources. In 2014, OCLC developed a service to provide automated holdings management for a select group of content providers. Within the WorldCat knowledge base system, library specific holdings for e-book and e-serial collections can be managed within the knowledge base without the need for library staff to manually intervene. At the University of Toronto Libraries, we decided to take OCLC's automated holdings management service for a test-drive. For three vendor packages, we conducted an on-going comparison between the library's holdings list and the title listing supplied by the automated service. This presentation will outline the results of this investigation, highlighting the benefits and drawbacks of automated holdings maintenance. The talk will also provide a vision of what the automated holdings management service could look like in the future.
Speaker: Marlene van Ballegooie, Metadata Librarian, University of Toronto
Building an institutional repository using dspaceBharat Chaudhari
This document provides an introduction to institutional repositories and DSpace. It discusses what an institutional repository is, the types of content it contains from a university community, and important elements like being institutionally defined, scholarly, cumulative, open and interoperable. It covers implementing a repository by developing policies, metadata, permissions, and submission guidelines. The roles and software required are also outlined, with DSpace being the most commonly used software. Metadata standards like Dublin Core are explained.
Next Steps for IMLS's National Digital PlatformTrevor Owens
This document summarizes projects funded by the Institute of Museum and Library Services (IMLS) related to developing a National Digital Platform. It describes 7 projects improving open source digital library software tools and communities, 4 projects focused on scaling up shared services, 2 applied research projects related to collections at scale, and 3 projects aimed at improving access for all and inclusion. It provides brief descriptions and links to more information for each of the 20 projects. The overall goal is to expand the digital capability and capacity of libraries across the United States by prioritizing promising digital tools and services.
The presentation of different Slovenian Labour Force Survey microdata, accompanying metadata and materials, and modes of access. From the Fourth DwB Training Course in Ljubljana.
Transparent Licenses: Making user rights clear (OLA Super Conference 2015)Hong (Jenny) Jing
Recent changes to Canada’s Copyright Act have propelled copyright and licensed use into the spotlight at colleges and universities in Canada. This session will look at Queen’s and University of Toronto libraries’ experience implementing a licensing permissions workflow using OCUL Usage Rights database (OUR). The systems will be covered are: 360 Link, Summon, Voyager OPAC, Endeca. We will explain how to implement the license links with and without using API.
Delivered by Peter Burnhill, Director of EDINA, at the PRELIDA Consolidation and Dissemination workshop on 17/18 October 2014 (http://prelida.eu/consolidation-workshop).
Summary: The web changes over time, and significant reference rot inevitably occurs. Web archiving delivers only a 50% chance of success. So in addition to the original URI, the link should be augmented with temporal context to increase robustness.
Research Data Management in GLAM: Managing Data for Cultural HeritageSarah Anna Stewart
Presentation given at the 'Open Science Infrastructures for Big Cultural Data' - Advanced International Masterclass in Plovdiv, Bulgaria. Dec. 13-15, 2018
Presentation - First International Library Staff Exchange Week, ZagrebIva Vrkic
Librarians at the Faculty of Science in Zagreb provide information literacy courses for graduate students and scholars. Topics covered include using plagiarism detection software, changes in scientific publishing, and copyright issues. Plans exist to expand offerings to include workshops for freshmen. Librarians look to colleagues at the University of Zagreb for inspiration on developing robust education programs.
Libraries at Harvard and Oxford offer diverse information literacy instruction through workshops, seminars, and online/hybrid courses. Common topics are using library resources, research skills like literature reviews, data management, reference management software, and open scholarship issues. Both institutions dedicate over 50% of instruction to online formats, with the remainder split between in-person and hybrid
February 18 2014 NISO Virtual Conference
Scientific Data Management: Caring for Your Institution and its Intellectual Wealth
Capacity Building: Leveraging existing library networks to take on research data
Heidi Imker, Director of the Research Data Service, University of Illinois at Urbana-Champaign
One Discovery Layer, Eight Front Doors: Implementing Blacklight @ IUCourtney McDonald
The document summarizes Indiana University's implementation of the Blacklight discovery layer across its eight campuses to provide a shared interface for its online catalog (IUCAT) while allowing for flexibility across campuses. Key points include: IU has a complex data environment with diverse collections across eight campuses previously only served by a one-size-fits-all interface; in 2011 IU selected Blacklight over VuFind as its discovery layer due to flexibility and development community; implementation began in summer 2011 with a public beta in fall 2012 and full transition in May 2013; campus-specific views and call number browsing were customized; and future work includes enhanced customization, transition to Kuali OLE, and improving browse functions.
Carmen O'Dell and Barbara Sen JIBS-RLUK event July 2012sherif user group
RDM Rose by Carmen O'Dell and Barbara Sen, (University of Sheffield). Presentation at Demystifying Research Data: don’t be scared be prepared: A joint JIBS/RLUK event, Tuesday 17th July 17th July 2012, Brunei Gallery at SOAS (School of Oriental and African Studies), London.
The document summarizes the experimental project of registering Digital Object Identifiers (DOIs) for research data at the Japan Link Center (JaLC). The project aims to establish workflows for registering DOIs for research data and test the registration of data DOIs. It involves 9 research projects and 14 organizations registering and integrating DOIs for their data through the JaLC system. The project addresses several issues in registering DOIs for dynamic research data, such as data lifecycles, granularity, persistence, and handling changes over time.
The ARK Alliance: 20 years, 850 institutions, 8.2 billion persistent identifi...John Kunze
The ARK Alliance has provided persistent identifiers (ARKs) for over 850 institutions over 20 years, assigning over 8.2 billion ARKs. ARKs allow for long-term preservation and access to digital resources by providing unbroken links even if web addresses (URLs) change. The ARK resolver system makes ARKs actionable by translating them to current web addresses. Major adopters of ARKs include libraries, archives, museums, and research institutions for purposes like genealogical records, published works, scientific data, and cultural heritage collections.
The document discusses Japan Link Center's (JaLC) experiment to register DOIs for research data. The experiment aims to establish workflows for registering DOIs for research data using JaLC's system. It involves 9 projects with 14 organizations testing DOI registration for research data. The document outlines several issues in registering DOIs for data, including operations flow, persistent access, granularity, dynamics of data, and quantity of data. It also provides examples of how projects can involve multiple institutions and how data lifecycles differ from literature.
What do you want to discover today? / Janet Aucock, University of St AndrewsCIGScotland
Overview of resource discovery in libraries today. Presented at the CIG Scotland seminar 'Resource Discovery : from catalogues to discovery services' at the National Library of Scotland, Edinburgh, 21st March 2018
This document summarizes a webinar on metadata for managing scientific research data. The webinar covered why metadata is important for scientific data management, definitions of data and metadata, selected metadata standards including Dublin Core, Darwin Core and FGDC, challenges in generating metadata and opportunities to address these challenges, and advice for getting started with metadata. The webinar emphasized that metadata standards provide guidelines not strict rules, and encouraged participants to keep metadata simple while aiming to facilitate reuse of data.
This document outlines the IT strategic plan of the MAE Documentation Centre and Museum of Performing Arts from 2008-2013. The plan aimed to modernize the center's technology by moving from various disconnected databases and folders to a single, open source system for managing and disseminating its collections online. By 2013, the center had implemented the Hydra Project to create a new unified data model and system called "Escena Digital" for describing, preserving, and sharing its archive and museum holdings digitally. The new system allows for more efficient management and growth of the center's digital collections.
As bibliotecas do mundo conectadas. A um mundo conectado!OCLC LAC
The document discusses OCLC (Online Computer Library Center) and its role in connecting libraries around the world through shared resources, metadata, management tools, and end-user services. It highlights four strategic areas of OCLC's services and describes how libraries can work together through OCLC to explore trends, share data/resources, and amplify the impact of libraries. Specific topics covered include the internet of things, massive open online courses, and connected collections. The end of the document includes a questionnaire about the event.
The document describes the BNE's project to create linked data from its bibliographic and authority records. It involved selecting relevant MARC data about Miguel de Cervantes and related authors, mapping the data to FRBR, FRAD and other ontologies, transforming the data into RDF, and linking the data to other datasets like VIAF. The goals were to test applying IFLA models at scale, and create a unified authority system for Spanish libraries. Key activities included analyzing the MARC data, developing mapping and transformation tools, and publishing the linked data on the web.
This document outlines the agenda and topics that will be covered in a digitization workshop for community heritage organizations. The session will cover how to plan a digitization project by setting standards, assessing needed resources, and caring for original and digital materials. It will also discuss providing access online, including issues around rights, metadata, and sharing images. Key topics include setting an appropriate level of technical standards based on the project goals and constraints, ensuring backups and long-term preservation of digital files, and creating descriptive metadata to enable discovery and use of digitized collections.
Since the early days of e-resource management, holdings maintenance for electronic resources has been a very time consuming and manual process. While the emergence of electronic resource management systems (ERMS) has improved this process to a significant extent, holdings maintenance tasks remain labor intensive due to the increased volume of electronic content to manage, as well as issues related to metadata quality. To ameliorate many of the problems associated with managing electronic resources, and in recognition of a need for greater accuracy and efficiency, some knowledgebase providers are beginning to offer libraries options to automate holdings maintenance for electronic resources. In 2014, OCLC developed a service to provide automated holdings management for a select group of content providers. Within the WorldCat knowledge base system, library specific holdings for e-book and e-serial collections can be managed within the knowledge base without the need for library staff to manually intervene. At the University of Toronto Libraries, we decided to take OCLC's automated holdings management service for a test-drive. For three vendor packages, we conducted an on-going comparison between the library's holdings list and the title listing supplied by the automated service. This presentation will outline the results of this investigation, highlighting the benefits and drawbacks of automated holdings maintenance. The talk will also provide a vision of what the automated holdings management service could look like in the future.
Speaker: Marlene van Ballegooie, Metadata Librarian, University of Toronto
Building an institutional repository using dspaceBharat Chaudhari
This document provides an introduction to institutional repositories and DSpace. It discusses what an institutional repository is, the types of content it contains from a university community, and important elements like being institutionally defined, scholarly, cumulative, open and interoperable. It covers implementing a repository by developing policies, metadata, permissions, and submission guidelines. The roles and software required are also outlined, with DSpace being the most commonly used software. Metadata standards like Dublin Core are explained.
Next Steps for IMLS's National Digital PlatformTrevor Owens
This document summarizes projects funded by the Institute of Museum and Library Services (IMLS) related to developing a National Digital Platform. It describes 7 projects improving open source digital library software tools and communities, 4 projects focused on scaling up shared services, 2 applied research projects related to collections at scale, and 3 projects aimed at improving access for all and inclusion. It provides brief descriptions and links to more information for each of the 20 projects. The overall goal is to expand the digital capability and capacity of libraries across the United States by prioritizing promising digital tools and services.
The presentation of different Slovenian Labour Force Survey microdata, accompanying metadata and materials, and modes of access. From the Fourth DwB Training Course in Ljubljana.
Transparent Licenses: Making user rights clear (OLA Super Conference 2015)Hong (Jenny) Jing
Recent changes to Canada’s Copyright Act have propelled copyright and licensed use into the spotlight at colleges and universities in Canada. This session will look at Queen’s and University of Toronto libraries’ experience implementing a licensing permissions workflow using OCUL Usage Rights database (OUR). The systems will be covered are: 360 Link, Summon, Voyager OPAC, Endeca. We will explain how to implement the license links with and without using API.
Delivered by Peter Burnhill, Director of EDINA, at the PRELIDA Consolidation and Dissemination workshop on 17/18 October 2014 (http://prelida.eu/consolidation-workshop).
Summary: The web changes over time, and significant reference rot inevitably occurs. Web archiving delivers only a 50% chance of success. So in addition to the original URI, the link should be augmented with temporal context to increase robustness.
Research Data Management in GLAM: Managing Data for Cultural HeritageSarah Anna Stewart
Presentation given at the 'Open Science Infrastructures for Big Cultural Data' - Advanced International Masterclass in Plovdiv, Bulgaria. Dec. 13-15, 2018
Presentation - First International Library Staff Exchange Week, ZagrebIva Vrkic
Librarians at the Faculty of Science in Zagreb provide information literacy courses for graduate students and scholars. Topics covered include using plagiarism detection software, changes in scientific publishing, and copyright issues. Plans exist to expand offerings to include workshops for freshmen. Librarians look to colleagues at the University of Zagreb for inspiration on developing robust education programs.
Libraries at Harvard and Oxford offer diverse information literacy instruction through workshops, seminars, and online/hybrid courses. Common topics are using library resources, research skills like literature reviews, data management, reference management software, and open scholarship issues. Both institutions dedicate over 50% of instruction to online formats, with the remainder split between in-person and hybrid
February 18 2014 NISO Virtual Conference
Scientific Data Management: Caring for Your Institution and its Intellectual Wealth
Capacity Building: Leveraging existing library networks to take on research data
Heidi Imker, Director of the Research Data Service, University of Illinois at Urbana-Champaign
One Discovery Layer, Eight Front Doors: Implementing Blacklight @ IUCourtney McDonald
The document summarizes Indiana University's implementation of the Blacklight discovery layer across its eight campuses to provide a shared interface for its online catalog (IUCAT) while allowing for flexibility across campuses. Key points include: IU has a complex data environment with diverse collections across eight campuses previously only served by a one-size-fits-all interface; in 2011 IU selected Blacklight over VuFind as its discovery layer due to flexibility and development community; implementation began in summer 2011 with a public beta in fall 2012 and full transition in May 2013; campus-specific views and call number browsing were customized; and future work includes enhanced customization, transition to Kuali OLE, and improving browse functions.
HIBERLINK: Reference Rot and Linked Data: Threat and RemedyPRELIDA Project
This document discusses reference rot in linked data and proposes remedies. It defines reference rot as occurring when links to web resources no longer point to the original content. Empirical evidence from analyses of journal articles and e-theses shows that over one third of references experience rot. Proposed remedies include a Hiberlink plug-in to enable proactive archiving, augmenting links with temporal context using the Missing Link approach, and a HiberActive system for repositories to actively archive references. The goal is to increase the chances of accessing referenced content over time by embedding archiving solutions into existing authoring and publishing workflows.
At Utah State University, a pilot project is under development to evaluate the benefits of tracking data sets and faculty publications using the online catalog and the Library’s institutional repository.
With federal mandates to make publications and data open, universities look for solutions to track compliance. At Utah State University, the Sponsored Programs Office follows up with researchers to determine where data has been or will be deposited, per the terms of their grant.
Interested in making this publicly discoverable, the Library, Sponsored Programs, and Research Office are working together to pilot a project that enables the creation of publicly accessible MARC and Dublin Core records for data deposited by USU faculty. This project aims to make data sets, as well as publications, visible in research portals such as WorldCat, as well through Google searches.
This presentation will describe the project and anticipated benefits, as well as outline the roles of the cataloging staff and data librarian, and the involvement of the Research Office.
Plale HathiTrust El Colegio de Mexico May2014Beth Plale
The document discusses HathiTrust, a digital library consortium, and its research center (HTRC). HTRC enables computational analysis of the HathiTrust collection through tools and a secure computing framework called the Data Capsule. The Data Capsule allows researchers to perform computational analysis on the entire HathiTrust collection, including copyrighted works, while preventing data from being leaked. Examples of research conducted through HTRC include identifying the gender of authors using name analysis and using topic modeling to locate philosophical arguments in texts.
About the Webinar
The library and cultural institution communities have generally accepted the vision of moving to a Linked Data environment that will align and integrate their resources with those of the greater Semantic Web. But moving from vision to implementation is not easy or well-understood. A number of institutions have begun the needed infrastructure and tools development with pilot projects to provide structured data in support of discovery and navigation services for their collections and resources.
Join NISO for this webinar where speakers will highlight actual Linked Data projects within their institutions—from envisioning the model to implementation and lessons learned—and present their thoughts on how linked data benefits research, scholarly communications, and publishing.
Speakers:
Jon Voss - Strategic Partnerships Director, We Are What We Do
LODLAM + Historypin: A Collaborative Global Community
Matt Miller - Front End Developer, NYPL Labs at the New York Public Library
The Linked Jazz Project: Revealing the Relationships of the Jazz Community
Cory Lampert - Head, Digital Collections , UNLV University Libraries
Silvia Southwick - Digital Collections Metadata Librarian, UNLV University Libraries
Linked Data Demystified: The UNLV Linked Data Project
This document summarizes a workshop on open science and open data for librarians. The workshop covered introducing open science and open data, how data can inform the library profession and support research, tools and applications for working with data, and developing a data strategy for libraries. It discussed stakeholders in research data, why librarians are important data partners, the role of librarians in advocating for open data and managing repositories. The workshop also covered data skills needed by librarians and introducing trusted data repositories.
Delivered by Peter Burnhill at CNI Fall 2014 Membership Meeting, December 8-9, 2014
Washington, DC. This is about ensuring that online serial content, whether issued in parts or changes over time via a website, continues to be available for scholarship. The central take home message is that we all have a lot still to do.
RDAP13 John Kunze: The Data Management EcosystemASIS&T
John Kunze, University of California, Curation Center
California Digital Library (CDL)
The Data Management Ecosystem
Panel: Partnerships between institutional repositories, domain repositories, and publishers
Research Data Access & Preservation Summit 2013
Baltimore, MD April 4, 2013 #rdap13
The document discusses the need for an ecosystem to better manage research data through its entire lifecycle, from creation to publication to sharing and reuse. It proposes that libraries can play a key role in this ecosystem by providing services like curation repositories, identifiers, metadata, and tools to help researchers publish, share, and get credit for their data. The goal is to improve data discovery, access, attribution, and incentivize data sharing to make research data as integral to the scholarly record as journal articles.
Open Science, Open Data: towards a new transparent and reproducible ecosystemLIBER Europe
Presented at the Preforma Open Source Workshop 8 April 2016
As a library membership organization, LIBER works on addressing Open Science barriers. Standardisation of file formats can really help in overcoming some of these barriers: it enables us to process and preserve data in a controlled way, it helps ensure that outputs are really open and accessible in the long term and it improves interoperability of new tools and services. Making sure data is stored in a controlled way and can be (re) used today and in the future is an important element in Open Science. We see this as not only a technical challenge but also a social one: awareness, trust and community building is needed in order to ensure uptake of these standards. Libraries therefore have a valuable role to play in the development of good research data management throughout all phases of the Open Data lifecycle.
Staffing Research Data Services at University of EdinburghRobin Rice
Invited remote talk for Georg-August University of Göttingen workshop: RDM costs and efforts on 28 May in Göttingen. Organised by the project Göttingen Research Data Exploratory (GRAcE).
NISO Two Day Virtual Conference:
Using the Web as an E-Content Distribution Platform:
Challenges and Opportunities
Oct 21-22, 2014
John Mark Ockerbloom, Digital Library Architect and Planner, University of Pennsylvania
NISO access related projects (presented at the Charleston conference 2016)Christine Stohn
Presentation by Pascal Calarco (University of Windsor), Christine Stohn (Ex Libris/ProQuest), John G. Dove (Paloma Associates), covering NISO D2D work, ResourceSync, KBART and KBART automation, ODI (Open Discovery Initiative), Link origin tracking, ALI (Access and License Indicators), and a discussion around improvements and challenges for open access discovery
Similar to Provenance in Databases and Scientific Workflows: Part I (20)
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...Bertram Ludäscher
Yilin Xia (yilinx2@illinois.edu),
Shawn Bowers (bowers@gonzaga.edu),
Lan Li (lanl2@illinois.edu), and
Bertram Ludäscher (ludaesch@illinois.edu)
Presented at IDCC-2024 in Edinburg.
ABSTRACT. We propose a new approach for modeling and reconciling conflicting data cleaning actions. Such conflicts arise naturally in collaborative data curation settings where multiple experts work independently and then aim to put their efforts together to improve and accelerate data cleaning. The key idea of our approach is to model conflicting updates as a formal argumentation framework (AF). Such argumentation frameworks can be automatically analyzed and solved by translating them to a logic program PAF whose declarative semantics yield a transparent solution with many desirable properties, e.g., uncontroversial updates are accepted, unjustified ones are rejected, and the remaining ambiguities are exposed and presented to users for further analysis. After motivating the problem, we introduce our approach and illustrate it with a detailed running example introducing both well-founded and stable semantics to help understand the AF solutions. We have begun to develop open source tools and Jupyter notebooks that demonstrate the practicality of our approach. In future work we plan to develop a toolkit for conflict resolution that can be used in conjunction with OpenRefine, a popular interactive data cleaning tool.
Games, Queries, and Argumentation Frameworks: Time for a Family ReunionBertram Ludäscher
Research Seminar Talk (online) at KRR@UP (Uni Potsdam) on Dec 6, 2023, loosely based on a paper with the same title at the 7th Workshop on Advances in Argumentation in Artificial Intelligence (AI3)
Games, Queries, and Argumentation Frameworks: Time for a Family Reunion!Bertram Ludäscher
7th Workshop on Advances in Argumentation in Artificial Intelligence (AI3) at
AIxIA 2023: 22nd International Conference of the Italian Association for Artificial Intelligence.
Presentation of a paper by Bertram Ludäscher, Shawn Bowers, and Yilin Xia, given virtually on November 9, 2023.
[Flashback] Integration of Active and Deductive Database RulesBertram Ludäscher
Slides of my PhD defense at the University of Freiburg, 1998.
Statelog and similar state-oriented extensions of Datalog have seen renewed interest subsequently, e.g., see
[Hel10] Hellerstein, J.M., 2010. The declarative imperative: experiences and conjectures in distributed logic. ACM SIGMOD Record, 39(1), pp.5-19.
[AMC+11]
Alvaro, P., Marczak, W.R., Conway, N., Hellerstein, J.M., Maier, D. and Sears, R., 2011. Dedalus: Datalog in time and space. In Datalog Reloaded: First International Workshop, Datalog 2010, Oxford, UK, March 16-19, 2010. Revised Selected Papers (pp. 262-281). Springer
[Flashback] Statelog: Integration of Active & Deductive Database RulesBertram Ludäscher
This document discusses Statelog, which integrates active and deductive database rules. Statelog allows both active rules, which trigger actions and modify the database, and deductive rules, which derive new facts. It defines the semantics of different types of rules and how they interact. Statelog guarantees termination of rule evaluation at both compile-time and runtime through techniques like state-stratification and delta-monotonicity. It can express complex temporal queries and supports features like nested transactions.
Answering More Questions with Provenance and Query PatternsBertram Ludäscher
This document discusses using provenance information to improve transparency and reproducibility in research. It begins by asking questions about the input data, methods, and parameter settings used in a study in order to assess its reliability. It then provides examples of how workflow systems can capture provenance at both the design level (prospective provenance) and runtime level (retrospective provenance). These include a Kepler workflow that simulates X-ray data collection and provenance traces captured by DataONE. The document argues that provenance is a critical link between workflow modeling and runtime traces that can increase trust in research findings.
Computational Reproducibility vs. Transparency: Is It FAIR Enough?Bertram Ludäscher
Keynote at CLIR Workshop (Webinar): Torward Open, Reproducible, and Reusable Research. February 10, 2021. https://reusableresearch.com/
ABSTRACT. The “reproducibility crisis” has resulted in much interest in methods and tools to improve computational reproducibility. FAIR data principles (data should be findable, accessible, interoperable, and reusable) are also being adapted and evolved to apply to other artifacts, notably computational analyses (scientific workflows, Jupyter notebooks, etc.). The current focus on computational reproducibility of scripts and other computational workflows sometimes overshadows a somewhat neglected and arguably more important issue: transparency of data analysis, including data wrangling and cleaning. In this talk I will ask the question: What information is gained by conducting a reproducibility experiment? This leads to a simple model (PRIMAD) that aims to answer this question by sorting out different scenarios. Finally, I will present some features of Whole-Tale, a computational platform for reproducible and transparent computational experiments.
By Michael Gryk and Bertram Ludäscher. Presented at 2020 JCDL-SIGCM Workshop, August 1, 2020.
ABSTRACT. Conceptual models can serve multiple purposes: communication of information between stakeholders, information abstraction and generalization, and information organization for archival and retrieval. An ongoing research question is how to formally define the fit-for-purpose of a conceptual model as well as to define metrics or tests to determine whether a given model faithfully supports a designated purpose.
This paper summarizes preliminary investigations in this area by presenting toy problems along with different conceptual models for the system under study. It is argued that the different models are adequate in supporting a sophisticated query and yet they adopt different normalization schemes and will differ in expressiveness depending on the implied purpose of the models. As the subtitle suggests, this work is intended to be primarily exploratory as to the constraints a formal system would require in defining the “usefulness”, “expressiveness” and “equivalence” of conceptual models.
From Workflows to Transparent Research Objects and Reproducible Science TalesBertram Ludäscher
The document discusses prospective and retrospective provenance in scientific workflows. Prospective provenance involves modeling the workflow design, while retrospective provenance records the workflow execution. The YesWorkflow and noWorkflow tools demonstrate these two types of provenance. YesWorkflow annotates scripts to recreate a workflow model from the script, while noWorkflow records step-by-step runtime logs. Combining both approaches provides a more complete view of a workflow's provenance. Maintaining provenance is important for reproducibility and understanding the origins of scientific results.
From Research Objects to Reproducible Science TalesBertram Ludäscher
University of Southampton. Electronics & Computer Science. Research Seminar (Invited Talk).
TITLE: From Research Objects to Reproducible Science Tales
ABSTRACT. Rumor has it that there is a reproducibility crisis in science. Or maybe there are multiple crises? What do we mean by reproducibility and replicability anyways? In this talk I will first make an attempt at sorting out some of the terminological confusion in this area, focusing on computational aspects. The PRIMAD model is another attempt to describe different aspects of reproducibility studies by focusing on the "delta" between those studies and the original study. In addition to these more theoretical investigations, I will discuss practical efforts to create more reproducible and more transparent computational platforms such as the one developed by the Whole-Tale project: here 'tales' are executable research objects that may combine data, code, runtime environments, and narratives (i.e., the traditional "science story"). I will conclude with some thoughts about the remaining challenges and opportunities to bridge the large conceptual gaps that continue to exist despite the recognition of problems of reproducibility and transparency in science.
ABOUT the Speaker. Bertram Ludäscher is a professor at the School of Information Sciences at the University of Illinois, Urbana-Champaign and a faculty affiliate with the National Center for Supercomputing Applications (NCSA) and the Department of Computer Science at Illinois. Until 2014 he was a professor at the Department of Computer Science at the University of California, Davis. His research interests range from practical questions in scientific data and workflow management, to database theory and knowledge representation and reasoning. Prior to his faculty appointments, he was a research scientist at the San Diego Supercomputer Center (SDSC) and an adjunct faculty at the CSE Department at UC San Diego. He received his M.S. (Dipl.-Inform.) in computer science from the University of Karlsruhe (now K.I.T.), and his PhD (Dr. rer. nat.) from the University of Freiburg, in Germany.
Possible Worlds Explorer: Datalog & Answer Set Programming for the Rest of UsBertram Ludäscher
PWE: Datalog & ASP for the Rest of Us discusses using Possible Worlds Explorer (PWE) to make Datalog and Answer Set Programming (ASP) more accessible to non-experts. It covers topics like using provenance to explain query results, capturing rule firings to track provenance, representing provenance as a graph, using states to track derivation rounds, and declarative profiling of Datalog programs. The presentation advocates for tools like PWE that wrap Datalog/ASP engines to combine them with Python ecosystems and allow interactive use in Jupyter notebooks. This makes the languages more approachable and helps users build on existing work by experimenting further.
Deduktive Datenbanken & Logische Programme: Eine kleine ZeitreiseBertram Ludäscher
Deductive Databases & Logic Programs: Back to the Future!
Colloquium talk on the occasion of the retirement of Prof. Dr. Georg Lausen, May 10th, 2019, Universität Freiburg, Germany
Dissecting Reproducibility: A case study with ecological niche models in th...Bertram Ludäscher
1) The document describes a workshop on research synthesis and reproducibility.
2) It discusses challenges with reproducibility in science and proposes provenance and conceptual tools like PRIMAD to help address these challenges.
3) The document presents a case study where an intern was able to reproduce results from a 2006 ecological niche modeling paper using the Whole Tale environment and MaxEnt software, demonstrating computational reproducibility.
Incremental Recomputation: Those who cannot remember the past are condemned ...Bertram Ludäscher
Talk given at "Problems and techniques for Incremental Re-computation: provenance and beyond".
A workshop co-organized with Provenance Week 2018
King's College London, 12th and 13th July, 2018
Organizers: Paolo Missier (Newcastle University), Tanu Malik (DePaul University), Jacek Cala (Newcastle University)
Abstract: Incremental recomputation has applications, e.g., in databases and workflow systems. Methods and algorithms for recomputation depend on the underlying model of computation (MoC) and model of provenance (MoP). This relation is explored with some examples from databases and workflow systems.
Validation and Inference of Schema-Level Workflow Data-Dependency AnnotationsBertram Ludäscher
Presentation slides of paper by Shawn Bowers, Timothy McPhillips, and Bertram Ludäscher, given by Shawn at Provenance and Annotation of Data and Processes - 7th International Provenance and Annotation Workshop, IPAW 2018, King's College London, UK, July 9-10, 2018.
The paper won a the IPAW best paper award: https://twitter.com/kbelhajj/status/1017082775856467968
ABSTRACT. An advantage of scientific workflow systems is their ability to collect runtime provenance information as an execution trace. Traces include the computation steps invoked as part of the workflow run along with the corresponding data consumed and produced by each workflow step. The information captured by a trace is used to infer "lineage'' relationships among data items, which can help answer provenance queries to find workflow inputs that were involved in producing specific workflow outputs. Determining lineage relationships, however, requires an understanding of the dependency patterns that exist between each workflow step's inputs and outputs, and this information is often under-specified or generally assumed by workflow systems. For instance, most approaches assume all outputs depend on all inputs, which can lead to lineage "false positives''. In prior work, we defined annotations for specifying detailed dependency relationships between inputs and outputs of computation steps. These annotations are used to define corresponding rules for inferring fine-grained data dependencies from a trace. In this paper, we extend our previous work by considering the impact of dependency annotations on workflow specifications. In particular, we provide a reasoning framework to ensure the set of dependency annotations on a workflow specification is consistent. The framework can also infer a complete set of annotations given a partially annotated workflow. Finally, we describe an implementation of the reasoning framework using answer-set programming.
An ontology-driven framework for data transformation in scientific workflowsBertram Ludäscher
Presentation given by Bertram at the Data Integration in the Life Sciences (DILS) Workshop in Leipzig, Germany, 2004.
Reference:
Bowers, Shawn, and Bertram Ludäscher. "An ontology-driven framework for data transformation in scientific workflows." In International Workshop on Data Integration in the Life Sciences (DILS), pp. 1-16. Springer, 2004.
So this isn't new -- but still relevant :-)
ABSTRACT. Ecologists spend considerable effort integrating heterogeneous data for statistical analyses and simulations, for example, to run and test predictive models. Our research is focused on reducing this effort by providing data integration and transformation tools, allowing researchers to focus on “real science,” that is, discovering new knowledge through analysis and modeling. This paper defines a generic framework for transforming heterogeneous data within scientific workflows. Our approach relies on a formalized ontology, which serves as a simple, unstructured global schema. In the framework, inputs and outputs of services within scientific workflows can have structural types and separate seman- tic types (expressions of the target ontology). In addition, a registration mapping can be defined to relate input and output structural types to their corresponding semantic types. Using registration mappings, ap- propriate data transformations can then be generated for each desired service composition. Here, we describe our proposed framework and an initial implementation for services that consume and produce XML data.
The document describes the Whole Tale platform, which aims to facilitate reproducibility in computational research. Whole Tale allows researchers to package computational narratives, data, code, and provenance information into "tales" that can be shared and re-executed. Key features of Whole Tale include running interactive notebooks, versioning and sharing tales, and integrating provenance tracking tools to provide transparency into computational workflows. The speaker demonstrates several example tales and discusses upcoming Whole Tale features and applications in different domains like archaeology, astronomy, and materials science.
Analysis insight about a Flyball dog competition team's performanceroli9797
Insight of my analysis about a Flyball dog competition team's last year performance. Find more: https://github.com/rolandnagy-ds/flyball_race_analysis/tree/main
The Ipsos - AI - Monitor 2024 Report.pdfSocial Samosa
According to Ipsos AI Monitor's 2024 report, 65% Indians said that products and services using AI have profoundly changed their daily life in the past 3-5 years.
Learn SQL from basic queries to Advance queriesmanishkhaire30
Dive into the world of data analysis with our comprehensive guide on mastering SQL! This presentation offers a practical approach to learning SQL, focusing on real-world applications and hands-on practice. Whether you're a beginner or looking to sharpen your skills, this guide provides the tools you need to extract, analyze, and interpret data effectively.
Key Highlights:
Foundations of SQL: Understand the basics of SQL, including data retrieval, filtering, and aggregation.
Advanced Queries: Learn to craft complex queries to uncover deep insights from your data.
Data Trends and Patterns: Discover how to identify and interpret trends and patterns in your datasets.
Practical Examples: Follow step-by-step examples to apply SQL techniques in real-world scenarios.
Actionable Insights: Gain the skills to derive actionable insights that drive informed decision-making.
Join us on this journey to enhance your data analysis capabilities and unlock the full potential of SQL. Perfect for data enthusiasts, analysts, and anyone eager to harness the power of data!
#DataAnalysis #SQL #LearningSQL #DataInsights #DataScience #Analytics
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataKiwi Creative
Harness the power of AI-backed reports, benchmarking and data analysis to predict trends and detect anomalies in your marketing efforts.
Peter Caputa, CEO at Databox, reveals how you can discover the strategies and tools to increase your growth rate (and margins!).
From metrics to track to data habits to pick up, enhance your reporting for powerful insights to improve your B2B tech company's marketing.
- - -
This is the webinar recording from the June 2024 HubSpot User Group (HUG) for B2B Technology USA.
Watch the video recording at https://youtu.be/5vjwGfPN9lw
Sign up for future HUG events at https://events.hubspot.com/b2b-technology-usa/
Global Situational Awareness of A.I. and where its headedvikram sood
You can see the future first in San Francisco.
Over the past year, the talk of the town has shifted from $10 billion compute clusters to $100 billion clusters to trillion-dollar clusters. Every six months another zero is added to the boardroom plans. Behind the scenes, there’s a fierce scramble to secure every power contract still available for the rest of the decade, every voltage transformer that can possibly be procured. American big business is gearing up to pour trillions of dollars into a long-unseen mobilization of American industrial might. By the end of the decade, American electricity production will have grown tens of percent; from the shale fields of Pennsylvania to the solar farms of Nevada, hundreds of millions of GPUs will hum.
The AGI race has begun. We are building machines that can think and reason. By 2025/26, these machines will outpace college graduates. By the end of the decade, they will be smarter than you or I; we will have superintelligence, in the true sense of the word. Along the way, national security forces not seen in half a century will be un-leashed, and before long, The Project will be on. If we’re lucky, we’ll be in an all-out race with the CCP; if we’re unlucky, an all-out war.
Everyone is now talking about AI, but few have the faintest glimmer of what is about to hit them. Nvidia analysts still think 2024 might be close to the peak. Mainstream pundits are stuck on the wilful blindness of “it’s just predicting the next word”. They see only hype and business-as-usual; at most they entertain another internet-scale technological change.
Before long, the world will wake up. But right now, there are perhaps a few hundred people, most of them in San Francisco and the AI labs, that have situational awareness. Through whatever peculiar forces of fate, I have found myself amongst them. A few years ago, these people were derided as crazy—but they trusted the trendlines, which allowed them to correctly predict the AI advances of the past few years. Whether these people are also right about the next few years remains to be seen. But these are very smart people—the smartest people I have ever met—and they are the ones building this technology. Perhaps they will be an odd footnote in history, or perhaps they will go down in history like Szilard and Oppenheimer and Teller. If they are seeing the future even close to correctly, we are in for a wild ride.
Let me tell you what we see.
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Aggregage
This webinar will explore cutting-edge, less familiar but powerful experimentation methodologies which address well-known limitations of standard A/B Testing. Designed for data and product leaders, this session aims to inspire the embrace of innovative approaches and provide insights into the frontiers of experimentation!
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdfGetInData
Recently we have observed the rise of open-source Large Language Models (LLMs) that are community-driven or developed by the AI market leaders, such as Meta (Llama3), Databricks (DBRX) and Snowflake (Arctic). On the other hand, there is a growth in interest in specialized, carefully fine-tuned yet relatively small models that can efficiently assist programmers in day-to-day tasks. Finally, Retrieval-Augmented Generation (RAG) architectures have gained a lot of traction as the preferred approach for LLMs context and prompt augmentation for building conversational SQL data copilots, code copilots and chatbots.
In this presentation, we will show how we built upon these three concepts a robust Data Copilot that can help to democratize access to company data assets and boost performance of everyone working with data platforms.
Why do we need yet another (open-source ) Copilot?
How can we build one?
Architecture and evaluation
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeWalaa Eldin Moustafa
Dynamic policy enforcement is becoming an increasingly important topic in today’s world where data privacy and compliance is a top priority for companies, individuals, and regulators alike. In these slides, we discuss how LinkedIn implements a powerful dynamic policy enforcement engine, called ViewShift, and integrates it within its data lake. We show the query engine architecture and how catalog implementations can automatically route table resolutions to compliance-enforcing SQL views. Such views have a set of very interesting properties: (1) They are auto-generated from declarative data annotations. (2) They respect user-level consent and preferences (3) They are context-aware, encoding a different set of transformations for different use cases (4) They are portable; while the SQL logic is only implemented in one SQL dialect, it is accessible in all engines.
#SQL #Views #Privacy #Compliance #DataLake
2. Welcome & Bem-Vindo!
• Boa tarde e bem-vindo ao tutorial sobre
proveniência!
• Welcome to the Tutorial on: Provenance in
Databases and Scientific Workflows
– Proveniência em bancos (bases) de dados e fluxos de
trabalho (workflows) científicos
• Desculpas ...
– Back to English (my 2nd language)
• Feel free to interrupt and ask questions!
• (You can also ask questions in German or Spanish ...)
Provenance @ SBBD'16
4. • Part I: Provenance in Scientific Workflows
– Alta Vista: Provenance everywhere!
– Provenance & Scientific Workflows
– Provenance Models and Standards (not so much)
– Provenance Tools
• Example & Demo: YesWorkflow
• Part II: Provenance in Databases
– Foundations of provenance in databases
– Why-, How-, and Why-Not provenance
Outline of the Tutorial:
A “Tour de Provenance”
Provenance @ SBBD'16
6. Provenance - Proveniência
• Oxford English Dictionary
– The place of origin or earliest known history of something:
• an orange rug of Iranian provenance
– The beginning of something’s existence; its origin:
• they try to understand the whole universe, its provenance and fate
– A record of ownership of a work of art or an antique, used as a
guide to authenticity or quality:
• the manuscript has a distinguished provenance
• What is the origin (provenance!) of “provenance” ?
Provenance @ SBBD'16
7. The Many Faces of Provenance
• What are those?
• Cosmology
• Geology, Stratigraphy
• Phylogeny
– the Tree of Life
• Genealogy
– your family: literally
• Academic Pedigree
– “Doktorvater” (Doktor-Mutter?)
• Etymology
• Chain of custody
– of art(ifacts)
• Yes: all about origins and history …
Provenance @ SBBD'16
11. 2nd Stop: Liberal Arts & Sciences
• Can you “see provenance” in this image?
• Grand Canyon’s rock layers are a record of the early geologic history of North America.
The ancestral puebloan granaries at Nankoweap Creek tell archaeologists about more
recent human history. (By Drenaline, licensed under CC BY-SA 3.0)
Provenance @ SBBD'16
13. Computational Provenance
• Origin and processing history of an artifact
– usually: data (products), figures, ...
– sometimes: workflow (and script) evolution …
• Different sub-communities:
– Provenance in (scientific) workflows (Tutorial Part I)
– Provenance in databases (Tutorial Part II)
– Wait, there is more:
• ... programming languages, systems/security, …
Provenance @ SBBD'16
14. Why should you care about provenance?
• It’s an important problem:
– reproducibility crisis, transparency, data sharing, …
• There are (still) many deeply technical and practical challenges:
– Efficient capture, management, use of provenance
– Models, semantics, query languages
– Provenance .. for others? Or provenance for self!
– Interdisciplinary work; cross-fertilization: databases, workflows,
programming languages, security, …, various scientific communities
(bioinformatics, ...)
• You have a head start here!
– Marta Mattoso, Daniel de Oliveira, Vanessa Braganholo, Juliana Freire, ...
(e.g. SBBD proceedings ..)
• … oh, and it’s also a fun topic ...
Provenance @ SBBD'16
17. Use Provenance for
Transparency, Reproducibility
• What input data went into
this study?
• What methods were used?
• … with what parameter
settings, calibrations, …?
• Can we trust the data and
methods?
§ Provenance (lineage): track origin and processing history of data è
trust, data quality ~ audit trail for attribution, credit
§ Discovery of data, methodologies, experiments
Provenance @ SBBD'16
20. Provenance today: Important but hard
èmany research projects,
groups conduct R&D on
provenance methods,
tools, …
Example:
Climate Change Impacts
in the United States
U.S. National Climate Assessment
U.S. Global Change Research Program
“This report is the result of a three-year
analytical effort by a team of over 300
experts, overseen by a broadly
constituted Federal Advisory Committee
of 60 members. It was developed from
information and analyses gathered in
over 70 workshops and listening sessions
held across the country.”
Provenance @ SBBD'16
25. Provenance in Action: Benefits & Impact
A DataONE search (here: “grass”) yields different packages with provenance
Provenance @ SBBD'16
26. DataONE: Support for Provenance
Yaxing’s script with
inputs & output
products
Christopher’s
YesWorkflow
model
Christopher using
Yaxing’s outputs as
inputs for his script
Christopher’s results
can be traced back all
the way to Yaxing’s
input
Provenance @ SBBD'16
27. REWIND: From Provenance to Reproducible Science …
Capturing provenance is crucial for
transparency, interpretation, debugging, …
=> repeatable experiments,
=> reproducible science
=> need workflow-system agnostic model
Provenance @ SBBD'16
29. … 3rd stop: Scientific Workflows: ASAP
• Automation
– wfs to automate computational aspects of science
• Scaling (exploit and optimize machine cycles)
– wfs should make use of parallel compute resources
– wfs should be able handle large data
• Abstraction, Evolution, Reuse (human cycles)
– wfs should be easy to (re-)use, evolve, share
• Provenance
– wfs should capture processing history, data lineage
è traceable data- and wf-evolution
è Reproducible Science
Trident
Workbench
VisTrails
Es war einmal …
Provenance @ SBBD'16
30. 10 Essential Functions
of a Scientific Workflow System1. Automate programs and services scientists already use.
2. Schedule invocations of programs and services correctly and efficiently – in
parallel where possible.
3. Manage dataflow to, from, and between programs and services.
4. Enable scientists (not just developers) to author or modify workflows easily.
5. Predict what a workflow will do when executed: prospective provenance.
6. Record what happened during workflow execution: retrospective provenance.
7. Reveal and query provenance – how workflow products were derived from inputs
via programs and services.
8. Organize intermediate and final data products as desired by users.
9. Enable scientists to version, share and publish their workflows.
10. Empower scientists who wish to automate additional programs and services
themselves.
These functions (not just dataflow & actors) distinguish scientific workflow automation
from general (scientific) software development.
Src: Tim McPhillipsProvenance @ SBBD'16
34. Motif-Catcher workflow, implemented in Kepler
S Köhler et al. Improved Motif Detection in Large Sequence Sets with
Random Sampling in a Kepler workflow, ICCS-WS, 2012
Provenance @ SBBD'16
35. Kepler Workflows & Decision Making
(Kruger Natl. Park, South Africa)
SANParks Matt Jones, NCEAS @ UC Santa Barbara
Provenance @ SBBD'16
39. So what is “provenance” (sensu W3C) ?
• Provenance refers to the sources of information, including entities
and processes, involved in producing or delivering an artifact (*)
• Provenance is a description of how things came to be, and how
they came to be in the state they are in today (*)
• Provenance is a record that describes the people, institutions, entities,
and activities, involved in producing, influencing, or delivering a piece
of data or a thing in the world
Provenance @ SBBD'16
49. ProvONE: PROV for scientific workflows
(Transfer station to any of several other “standard extensions”)
“Trace-Land” (retrospective provenance)
“Data-Land”
Yang Cao1
, Christopher Jones2
, Víctor Cuevas-
Vicenttín3
, Matthew B. Jones2
, Bertram
Ludäscher1
, Timothy McPhillips1
, Paolo
Missier4
, Christopher Schwalm5
, Peter
Slaughter2
, Dave Vieglais6
, Lauren Walker2
,
Yaxing Wei7
1
University of Illinois, Urbana-Champaign, 2
National Center for
Ecological Analysis and Synthesis, UCSB, 3
Universidad Popular
Autónoma del Estado de Puebla, Mexico, 4
School of Computing
Science,
Newcastle University, UK, 5
Woods Hole Research Center, Falmouth,
MA, 6
University of Kansas, Lawrence, 7
Environmental Sciences
Division, Oak Ridge National Lab, TN
Also: A. Marinho, L. Murta, C.Werner, V.Braganholo, S. Serra da Cruz,
E.Ogasawara, M. Mattoso. “ProvManager: A Provenance Management
System for Scientific Workflows.” Concurrency and Computation:
Practice and Experience 24, no. 13 (2012): 1513–1530
…
“Workflow-Land” (prospective prov.)
Provenance @ SBBD'16
50. Provenance Sleuth or Engineer?
• Scientists are Provenance (i.e., Natural History) Sleuths
• {Computational, Computer, Information}-Scientists should
(also) be Provenance Engineers
– Ensure your “Data Tree of Life” (data provenance) is correct!
– What is the origin and processing history of your data?
• With great provenance come great questions!
– “We store everything!”
– Huh? Yes, provenance is the answer… (yawn..)
– But what is the question??
• Engineer’s Stance:
– What questions do you want to answer?
– Let’s find out what observables we need to capture, what query
language we should use, how we do that efficiently (later), …
Provenance @ SBBD'16
51. Drilling down into “Trace-Land”:
From MoC to MoP via Observables
• Model of Computation MoC
– specification/algorithm to compute Outputs = MoC(Wf,Params,Inputs)
– a director or scheduler implements MoC
– gives rise to formal notions of
• computation (aka run) R
– Formalisms to define M?
• Model of Provenance MoP
– associate with a MoC a “default” MoP (= MoC ± Δ)
– the MoP is a “trimmed” MoC
• T = R – I + M
– Trace = Run – Ignored-observables + Modeled-observables
• Observables (of a MoC / MoP)
– functional observables (may influence output o)
• token rate, notions of firing, …
– non-functional observables (not part of M, do not influence o)
• token timestamp, size, … (unless the MoC cares about those)
Provenance @ SBBD'16
52. M. Anand, S. Bowers,
et al., SSDBM’09
From MoCs to Models of Provenance (MoPs)
Provenance @ SBBD'16
53. Fine-grained, Data & MoC-aware MoP
M. Anand, S. Bowers,
et al., SSDBM’09
Provenance @ SBBD'16
54. Types of Data Provenance
• Black-box
– know (next to) nothing at compile-time
– at runtime: keep some data lineage
– most prov sensu WF work use this
• White-box
– statically (compile-time) analyzable
– q(Y1,Y2) :- p(X1,X2), r(X1,Y1), s(X2,Y2)
– Most prov sensu DB work use this
• Grey-box
– can “look inside” (some black boxes)
– … e.g. b/c they have subworkflows
– … or FP signatures: A :: t1, t2à t3,t4
– … or semantic annotations (sem.types)
f
A
q
t1
t2
t3
t4
X1
X2
Y1
Y2
Provenance @ SBBD'16
55. ✔ Provenance capture (Matlab, R, Python, … scientific workflow systems)
✔ Uploading, sharing, linking provenance through various provenance tools
✗ Tools for scientists to exploit (≠ capture, share, link) provenance for their own
day-to-day work.
è Prime the provenance pump and increase provenance generation
è Scientists accelerate their work via new, active uses of provenance.
But … how to prime the provenance pump??
Must support “Provenance for Self” !
Provenance
for Self?!
Provenance
for Others
Provenance @ SBBD'16
58. SKOPE: Synthesized Knowledge Of Past Environments
Bocinsky, Kohler et al. study rain-fed maize of Anasazi
– Four Corners; AD 600–1500. Climate change influenced Mesa Verde Migrations; late
13th century AD. Uses network of tree-ring chronologies to reconstruct a spatio-
temporal climate field at a fairly high resolution (~800 m) from AD 1–2000. Algorithm
estimates joint information in tree-rings and a climate signal to identify “best” tree-ring
chronologies for climate reconstructing.
K. Bocinsky, T. Kohler, A 2000-year reconstruction of the rain-fed
maize agricultural niche in the US Southwest. Nature
Communications. doi:10.1038/ncomms6618
… implemented as an R Script …
Provenance @ SBBD'16
61. YesWorkflow.org
• YesWorkflow (YW)
– Started as a grass-roots effort (Kurator, SKOPE, ..)
– … meeting the scientists/users where they R!
• R, Matlab, (i)Python, Jupyter, …
– Scripts + simple user annotations
• => Reveal the workflow model/abstraction
… that underlies the (script) implementation
• => YW can give us more of ASAP!
– First YW: ASAP (Abstraction)...
– Then YW-recon: ASAP (reconstructing runtime Provenance)
Provenance @ SBBD'16
62. YW (prospective) and
YW-Recon (retrospective) Provenance
• 1. YW: Annotate Script => YW Model
– Annotate @BEGIN..@END, @IN, @OUT
– Visualize, share, be happy J
• 2. Run script
– Files are read and written
– Folder- & Filenames have metadata
• 3. YW-Recon
– Use @URI tags that link YW Model ó Persisted Data
– Run URI-template queries
• cf. “ls -R” & RegEx matching
• 4. YW-Query
– Answer the user’s provenance queries
Provenance @ SBBD'16
68. Figure 4: Process workflow view of an A↵ymetrix analysis script (in R).
4 YesWorkflow Examples
In the following we show YesWorkflow views extracted from real-world scientific use cases.
The scripts were annoted with YW tags by scientists and script authors, using a very
modest training and mark-up e↵ort.1
Due to lack of space, the actual MATLAB and R
scripts with their YW markup are not included here. However, they are all available
from the yw-idcc-15 repository on the YW GitHub site [Yes15].
Gene Expression Microarray Data Analysis
• [Normalize]
– Normalization of data across microarray datasets
• [SelectDEGs]
– Selection of differentially expressed genes between conditions
• [GO Analysis]
– determination of gene ontology statistics for the resulting datasets
• [MakeHeatmap]
– creation of a heatmap of the differentially expressed genes.
Tyler Kolisnik, Mark Bieda
Provenance @ SBBD'16
71. YW (prospective) and
YW-Recon (retrospective) Provenance
• 1. YW: Annotate Script => YW Model
– Annotate @BEGIN..@END, @IN, @OUT
– Visualize, share, be happy J
• 2. Run script
– Files are read and written
– Folder- & Filenames have metadata
• 3. YW-Recon
– Use @URI tags that link YW Model ó Persisted Data
– Run URI-template queries
• cf. “ls -R” & RegEx matching
• 4. YW-Query
– Answer the user’s provenance queries
Provenance @ SBBD'16
85. simulate_data_collection
collect_data_set
sample_id energy frame_number raw_image
calculate_strategy
accepted_sample num_imagesenergies
load_screening_results
sample_namesample_quality
transform_images
corrected_image
sample_spreadsheet
calibration_image
sample_score_cutoff data_redundancy
cassette_id
Subgraph
resulting from
lineage query
on YW workflow
model
What is the lineage of
corrected_image?
Provenance @ SBBD'16
87. We’re off to see the Wizard of Prov ...
We're off to see the Wizard,
The wonderful Wizard of Prov!
--
We hear he is a wiz of a wiz
If ever a wiz there was.
--
If ever, oh ever, a wiz there was,
The Wizard of Prov is one because,
Because, because, because, because, because,
Because of the wonderful things he does.
• Enrich YW conceptual view
with NW Python provenance!
• Get the best of both worlds!
• How hard can it be to bridge
YW and NW …
(cf. TaPP’15 prototype)
Provenance @ SBBD'16
91. Secret Reproducible Sauce
• Combining provenance information from
noWorkflow and YesWorkflow
• Using all the good stuff:
– make, docker, Prolog, SQL, Graphviz
• Open source
– github.com/yesworkflow-org/yw-noworkflow
– github.com/gems-uff/yin-yang-demo
• Have a closer look at the demo!
Provenance @ SBBD'16