Accelerating Data-driven Discovery in Energy ScienceIan Foster
A talk given at the US Department of Energy, covering our work on research data management and analysis. Three themes:
(1) Eliminate data friction (use of SaaS for research data management)
(2) Liberate scientific data (research on data extraction, organization, publication)
(3) Create discovery engines at DOE facilities (services that organize data + computation)
WoSC19: Serverless Workflows for Indexing Large Scientific DataUniversity of Chicago
The use and reuse of scientific data is ultimately dependent on the ability to understand what those data represent, how they were captured, and how they can be used. In many ways, data are only as useful as the metadata available to describe them. Unfortunately, due to growing data volumes, large and distributed collaborations, and a desire to store data for long periods of time, scientific “data lakes” quickly become disorganized and lack the metadata necessary to be useful to researchers. New automated approaches are needed to derive metadata from scientific files and to use these metadata for organization and discovery. Here we describe one such system, Xtract, a service capable of processing vast collections of scientific files and automatically extracting metadata from diverse file types. Xtract relies on function as a service models to enable scalable metadata extraction by orchestrating the execution of many, short-running extractor functions. To reduce data transfer costs, Xtract can be configured to deploy extractors centrally or near to the data (i.e., at the edge). We present a prototype implementation of Xtract and demonstrate that it can derive metadata from a 7 TB scientific data repository.
Scientific Workflow Systems for accessible, reproducible researchPeter van Heusden
Presentation for eResearch Africa 2013 on using scientific workflow management systems to compose and enact analysis workflows in bioinformatics (or science in general).
The Materials Data Facility: A Distributed Model for the Materials Data Commu...Ben Blaiszik
Presentation given at the UIUC Workshop on Materials Computation: data science and multiscale modeling. Materials Data Facility data publication, discovery, Globus, and associated python and REST interfaces are discussed. Video available soon.
In this talk we describe how the Fourth Paradigm for Data-Intensive Research is providing a framework for us to develop tools, technologies and platforms to support actionable science. We discuss applications that take advantage of cloud computing, particularly Microsoft Azure, to realise the potential for turning data into decisions, knowledge and understanding. http://www.fourthpardigm.org and http://www.azure4research.com
Semantic Publishing and NanopublicationsTobias Kuhn
Making available and archiving scientific results is for the most part still considered the task of classical publishing companies, despite the fact that classical forms of publishing centered around printed narrative articles no longer seem well-suited in the digital age. In particular, there exist currently no efficient, reliable, and agreed-upon methods for publishing scientific datasets, which have become increasingly important for science. In this talk, I outline how we can design scientific data publishing as a Web-based bottom-up process, without top-down control of central authorities. I present a protocol and a server network to decentrally store and archive data in the form of nanopublications, a format based on Semantic Web techniques to represent scientific data with formal semantics. Such nanopublications can be made verifiable and immutable by applying cryptographic methods with identifiers called Trusty URIs. I show how this approach allows researchers to produce, publish, retrieve, address, verify, and recombine datasets and their individual nanopublications in a reliable and trustworthy manner, and I discuss how the current small network can grow to handle the large amounts of structured data that modern science is producing and consuming.
Accelerating Data-driven Discovery in Energy ScienceIan Foster
A talk given at the US Department of Energy, covering our work on research data management and analysis. Three themes:
(1) Eliminate data friction (use of SaaS for research data management)
(2) Liberate scientific data (research on data extraction, organization, publication)
(3) Create discovery engines at DOE facilities (services that organize data + computation)
WoSC19: Serverless Workflows for Indexing Large Scientific DataUniversity of Chicago
The use and reuse of scientific data is ultimately dependent on the ability to understand what those data represent, how they were captured, and how they can be used. In many ways, data are only as useful as the metadata available to describe them. Unfortunately, due to growing data volumes, large and distributed collaborations, and a desire to store data for long periods of time, scientific “data lakes” quickly become disorganized and lack the metadata necessary to be useful to researchers. New automated approaches are needed to derive metadata from scientific files and to use these metadata for organization and discovery. Here we describe one such system, Xtract, a service capable of processing vast collections of scientific files and automatically extracting metadata from diverse file types. Xtract relies on function as a service models to enable scalable metadata extraction by orchestrating the execution of many, short-running extractor functions. To reduce data transfer costs, Xtract can be configured to deploy extractors centrally or near to the data (i.e., at the edge). We present a prototype implementation of Xtract and demonstrate that it can derive metadata from a 7 TB scientific data repository.
Scientific Workflow Systems for accessible, reproducible researchPeter van Heusden
Presentation for eResearch Africa 2013 on using scientific workflow management systems to compose and enact analysis workflows in bioinformatics (or science in general).
The Materials Data Facility: A Distributed Model for the Materials Data Commu...Ben Blaiszik
Presentation given at the UIUC Workshop on Materials Computation: data science and multiscale modeling. Materials Data Facility data publication, discovery, Globus, and associated python and REST interfaces are discussed. Video available soon.
In this talk we describe how the Fourth Paradigm for Data-Intensive Research is providing a framework for us to develop tools, technologies and platforms to support actionable science. We discuss applications that take advantage of cloud computing, particularly Microsoft Azure, to realise the potential for turning data into decisions, knowledge and understanding. http://www.fourthpardigm.org and http://www.azure4research.com
Semantic Publishing and NanopublicationsTobias Kuhn
Making available and archiving scientific results is for the most part still considered the task of classical publishing companies, despite the fact that classical forms of publishing centered around printed narrative articles no longer seem well-suited in the digital age. In particular, there exist currently no efficient, reliable, and agreed-upon methods for publishing scientific datasets, which have become increasingly important for science. In this talk, I outline how we can design scientific data publishing as a Web-based bottom-up process, without top-down control of central authorities. I present a protocol and a server network to decentrally store and archive data in the form of nanopublications, a format based on Semantic Web techniques to represent scientific data with formal semantics. Such nanopublications can be made verifiable and immutable by applying cryptographic methods with identifiers called Trusty URIs. I show how this approach allows researchers to produce, publish, retrieve, address, verify, and recombine datasets and their individual nanopublications in a reliable and trustworthy manner, and I discuss how the current small network can grow to handle the large amounts of structured data that modern science is producing and consuming.
Findable Accessable Interoperable Reusable < data |models | SOPs | samples | articles| * >. FAIR is a mantra; a meme; a myth; a mystery; a moan. For the past 15 years I have been working on FAIR in a bunch of projects and initiatives in Life Science projects. Some are top-down like Life Science European Research Infrastructures ELIXIR and ISBE, and some are bottom-up, supporting research projects in Systems and Synthetic Biology (FAIRDOM), Biodiversity (BioVel), and Pharmacology (open PHACTS), for example. Some have become movements, like Bioschemas, the Common Workflow Language and Research Objects. Others focus on cross-cutting approaches in reproducibility, computational workflows, metadata representation and scholarly sharing & publication. In this talk I will relate a series of FAIRy tales. Some of them are Grimm. Some have happy endings. Who are the villains and who are the heroes? What are the morals we can draw from these stories?
These slides were presented at AGU 2018 by Tanu Malik from DePaul University, in a session convened by Dr. Ian Foster, director of the Data Science and Learning division at Argonne National Laboratory.
Research Objects: more than the sum of the partsCarole Goble
Workshop on Managing Digital Research Objects in an Expanding Science Ecosystem, 15 Nov 2017, Bethesda, USA
https://www.rd-alliance.org/managing-digital-research-objects-expanding-science-ecosystem
Research output is more than just the rhetorical narrative. The experimental methods, computational codes, data, algorithms, workflows, Standard Operating Procedures, samples and so on are the objects of research that enable reuse and reproduction of scientific experiments, and they too need to be examined and exchanged as research knowledge.
A first step is to think of Digital Research Objects as a broadening out to embrace these artefacts or assets of research. The next is to recognise that investigations use multiple, interlinked, evolving artefacts. Multiple datasets and multiple models support a study; each model is associated with datasets for construction, validation and prediction; an analytic pipeline has multiple codes and may be made up of nested sub-pipelines, and so on. Research Objects (http://researchobject.org/) is a framework by which the many, nested and contributed components of research can be packaged together in a systematic way, and their context, provenance and relationships richly described.
In 2001, as early high-speed networks were deployed, George Gilder observed that “when the network is as fast as the computer's internal links, the machine disintegrates across the net into a set of special purpose appliances.” Two decades later, our networks are 1,000 times faster, our appliances are increasingly specialized, and our computer systems are indeed disintegrating. As hardware acceleration overcomes speed-of-light delays, time and space merge into a computing continuum. Familiar questions like “where should I compute,” “for what workloads should I design computers,” and "where should I place my computers” seem to allow for a myriad of new answers that are exhilarating but also daunting. Are there concepts that can help guide us as we design applications and computer systems in a world that is untethered from familiar landmarks like center, cloud, edge? I propose some ideas and report on experiments in coding the continuum.
Scott Edmunds talk on Big Data Publishing at the "What Bioinformaticians need to know about digital publishing beyond the PDF" workshop at ISMB 2013, July 22nd 2013
Jean-Claude Bradley presents on "Open Notebook Science and other Science2.0 Approaches to Communicate Research" at the University of Pennsylvania Library.
Thomas Heinis is a post-doctoral researcher in the database group at EPFL. His research focuses on scalable data management algorithms for large-scale scientific applications. Thomas is a part of the "Human Brain Project" and currently works with neuroscientists to develop the data management infrastructure necessary for scaling up brain simulations. Prior to joining EPFL, Thomas completed his Ph.D. in the Systems Group at ETH Zurich, where he pursued research in workflow execution systems as well as data provenance.
Using Neo4j for exploring the research graph connections made by RD-Switchboardamiraryani
In this talk, Jingbo Wang (NCI) and Amir Aryani (ANDS) have presented the Neo4j queries that can help data managers to explore the connections between datasets, researchers, grants, and publications using the graph model and Research Data Switchboard. In addition, they have discussed a paper on "Graph connections made by RD-Switchboard using NCI’s metadata", presented in the Reproducible Open Science workshop in Hannover September 2016.
Data Tribology: Overcoming Data Friction with Cloud AutomationIan Foster
A talk at the CODATA/RDA meeting in Gaborone, Botswana. I made the case that the biggest barriers to effective data sharing and reuse are often those associated with "data friction" and that cloud automation can be used to overcome those barriers.
The image on the first slide shows a few of the more than 20,000 active Globus endpoints.
With the rise of Web 2.0, API-based software has appeared. This article examines the API-based search tool created for the Korean search engine Naver: Webonaver (Webometrics Tool for Naver). The software is able to collect large amounts of data automatically and can easily distinguish between different types of information on the web, which was impossible before. In particular, Internet researchers can improve efficiency of data analysis within a specified timeframe using this tool. This paper illustrates how to use WeboNaver and tries to verify the usability and reliability through several case studies. In this article, Korean National Assembly Members’ web presence was analyzed, as was the web presence of the term H1N1.
Web 2.0의 도래와 함께 Open API를 응용한 소프트웨어 프로그램이 등장하면서 더 이상 사용자들은 웹에서 정보를 수동으로 검색하면서 일일이 살펴보는 번거로움을 겪지 않아도 된다. 공개된 API를 활용해 몇 번의 간단한 조작으로 방대한 데이터를 체계적으로 수집하고 관리할 수 있다. 본 논문은 Open API를 응용해 개발한 검색전문 프로그램 WeboNaver(Webometrics Tool for Naver)를 소개한다. 이는 한국에서 가장 영향력 있는 검색엔진 중의 하나인 네이버를 이용해 방대한 데이터를 카테고리별로 자동수집하여 저장해주는 프로그램이다. 연구자들은 이를 활용해 데이터 관리와 처리, 분석 과정에 정확성과 고도의 효율성을 기할 수 있을 것이다. 논문의 목적은 WeboNaver의 사용을 원하는 학생, 일반인, 연구자의 이해를 돕고자 실제 사례들을 통하여 분석절차를 구체적으로 제시해 그 유용성을 입증하는 것이다. 이 프로그램을 사용하여 18대 국회의원 292명의 웹가시성을 조사하였다. 또한 신종플루와 관련된 단어들의 웹 가시성을 분석하였다.
TMS workshop on machine learning in materials science: Intro to deep learning...BrianDeCost
This presentation is intended as a high-level introduction for to deep learning and its applications in materials science. The intended audience is materials scientists and engineers
Disclaimers: the second half of this presentation is intended as a broad overview of deep learning applications in materials science; due to time limitations it is not intended to be comprehensive. As a review of the field, this necessarily includes work that is not my own. If my own name is not included explicitly in the reference at the bottom of a slide, I was not involved in that work.
Any mention of commercial products in this presentation is for information only; it does not imply recommendation or endorsement by NIST.
Next-Generation Search Engines for Information RetrievalWaqas Tariq
In the recent years, there have been significant advancements in the areas of scientific data management and retrieval techniques, particularly in terms of standards and protocols for archiving data and metadata. Scientific data is generally rich, not easy to understand, and spread across different places. In order to integrate these pieces together, a data archive and associated metadata should be generated. This data should be stored in a format that can be locatable, retrievable and understandable, more importantly it should be in a form that will continue to be accessible as technology changes, such as XML. New search technologies are being implemented around these protocols, which makes searching easy, fast and yet robust. One such system, Mercury, a metadata harvesting, data discovery, and access system, built for researchers to search to, share and obtain spatiotemporal data used across a range of climate and ecological sciences.
Keynote presentation delivered at ELAG 2013 in Gent, Belgium, on May 29 2013. Discusses Research Objects and the relationship to work my team has been involved in during the past couple of years: OAI-ORE, Open Annotation, Memento.
Metadata and Semantics Research Conference, Manchester, UK 2015
Research Objects: why, what and how,
In practice the exchange, reuse and reproduction of scientific experiments is hard, dependent on bundling and exchanging the experimental methods, computational codes, data, algorithms, workflows and so on along with the narrative. These "Research Objects" are not fixed, just as research is not “finished”: codes fork, data is updated, algorithms are revised, workflows break, service updates are released. Neither should they be viewed just as second-class artifacts tethered to publications, but the focus of research outcomes in their own right: articles clustered around datasets, methods with citation profiles. Many funders and publishers have come to acknowledge this, moving to data sharing policies and provisioning e-infrastructure platforms. Many researchers recognise the importance of working with Research Objects. The term has become widespread. However. What is a Research Object? How do you mint one, exchange one, build a platform to support one, curate one? How do we introduce them in a lightweight way that platform developers can migrate to? What is the practical impact of a Research Object Commons on training, stewardship, scholarship, sharing? How do we address the scholarly and technological debt of making and maintaining Research Objects? Are there any examples
I’ll present our practical experiences of the why, what and how of Research Objects.
Findable Accessable Interoperable Reusable < data |models | SOPs | samples | articles| * >. FAIR is a mantra; a meme; a myth; a mystery; a moan. For the past 15 years I have been working on FAIR in a bunch of projects and initiatives in Life Science projects. Some are top-down like Life Science European Research Infrastructures ELIXIR and ISBE, and some are bottom-up, supporting research projects in Systems and Synthetic Biology (FAIRDOM), Biodiversity (BioVel), and Pharmacology (open PHACTS), for example. Some have become movements, like Bioschemas, the Common Workflow Language and Research Objects. Others focus on cross-cutting approaches in reproducibility, computational workflows, metadata representation and scholarly sharing & publication. In this talk I will relate a series of FAIRy tales. Some of them are Grimm. Some have happy endings. Who are the villains and who are the heroes? What are the morals we can draw from these stories?
These slides were presented at AGU 2018 by Tanu Malik from DePaul University, in a session convened by Dr. Ian Foster, director of the Data Science and Learning division at Argonne National Laboratory.
Research Objects: more than the sum of the partsCarole Goble
Workshop on Managing Digital Research Objects in an Expanding Science Ecosystem, 15 Nov 2017, Bethesda, USA
https://www.rd-alliance.org/managing-digital-research-objects-expanding-science-ecosystem
Research output is more than just the rhetorical narrative. The experimental methods, computational codes, data, algorithms, workflows, Standard Operating Procedures, samples and so on are the objects of research that enable reuse and reproduction of scientific experiments, and they too need to be examined and exchanged as research knowledge.
A first step is to think of Digital Research Objects as a broadening out to embrace these artefacts or assets of research. The next is to recognise that investigations use multiple, interlinked, evolving artefacts. Multiple datasets and multiple models support a study; each model is associated with datasets for construction, validation and prediction; an analytic pipeline has multiple codes and may be made up of nested sub-pipelines, and so on. Research Objects (http://researchobject.org/) is a framework by which the many, nested and contributed components of research can be packaged together in a systematic way, and their context, provenance and relationships richly described.
In 2001, as early high-speed networks were deployed, George Gilder observed that “when the network is as fast as the computer's internal links, the machine disintegrates across the net into a set of special purpose appliances.” Two decades later, our networks are 1,000 times faster, our appliances are increasingly specialized, and our computer systems are indeed disintegrating. As hardware acceleration overcomes speed-of-light delays, time and space merge into a computing continuum. Familiar questions like “where should I compute,” “for what workloads should I design computers,” and "where should I place my computers” seem to allow for a myriad of new answers that are exhilarating but also daunting. Are there concepts that can help guide us as we design applications and computer systems in a world that is untethered from familiar landmarks like center, cloud, edge? I propose some ideas and report on experiments in coding the continuum.
Scott Edmunds talk on Big Data Publishing at the "What Bioinformaticians need to know about digital publishing beyond the PDF" workshop at ISMB 2013, July 22nd 2013
Jean-Claude Bradley presents on "Open Notebook Science and other Science2.0 Approaches to Communicate Research" at the University of Pennsylvania Library.
Thomas Heinis is a post-doctoral researcher in the database group at EPFL. His research focuses on scalable data management algorithms for large-scale scientific applications. Thomas is a part of the "Human Brain Project" and currently works with neuroscientists to develop the data management infrastructure necessary for scaling up brain simulations. Prior to joining EPFL, Thomas completed his Ph.D. in the Systems Group at ETH Zurich, where he pursued research in workflow execution systems as well as data provenance.
Using Neo4j for exploring the research graph connections made by RD-Switchboardamiraryani
In this talk, Jingbo Wang (NCI) and Amir Aryani (ANDS) have presented the Neo4j queries that can help data managers to explore the connections between datasets, researchers, grants, and publications using the graph model and Research Data Switchboard. In addition, they have discussed a paper on "Graph connections made by RD-Switchboard using NCI’s metadata", presented in the Reproducible Open Science workshop in Hannover September 2016.
Data Tribology: Overcoming Data Friction with Cloud AutomationIan Foster
A talk at the CODATA/RDA meeting in Gaborone, Botswana. I made the case that the biggest barriers to effective data sharing and reuse are often those associated with "data friction" and that cloud automation can be used to overcome those barriers.
The image on the first slide shows a few of the more than 20,000 active Globus endpoints.
With the rise of Web 2.0, API-based software has appeared. This article examines the API-based search tool created for the Korean search engine Naver: Webonaver (Webometrics Tool for Naver). The software is able to collect large amounts of data automatically and can easily distinguish between different types of information on the web, which was impossible before. In particular, Internet researchers can improve efficiency of data analysis within a specified timeframe using this tool. This paper illustrates how to use WeboNaver and tries to verify the usability and reliability through several case studies. In this article, Korean National Assembly Members’ web presence was analyzed, as was the web presence of the term H1N1.
Web 2.0의 도래와 함께 Open API를 응용한 소프트웨어 프로그램이 등장하면서 더 이상 사용자들은 웹에서 정보를 수동으로 검색하면서 일일이 살펴보는 번거로움을 겪지 않아도 된다. 공개된 API를 활용해 몇 번의 간단한 조작으로 방대한 데이터를 체계적으로 수집하고 관리할 수 있다. 본 논문은 Open API를 응용해 개발한 검색전문 프로그램 WeboNaver(Webometrics Tool for Naver)를 소개한다. 이는 한국에서 가장 영향력 있는 검색엔진 중의 하나인 네이버를 이용해 방대한 데이터를 카테고리별로 자동수집하여 저장해주는 프로그램이다. 연구자들은 이를 활용해 데이터 관리와 처리, 분석 과정에 정확성과 고도의 효율성을 기할 수 있을 것이다. 논문의 목적은 WeboNaver의 사용을 원하는 학생, 일반인, 연구자의 이해를 돕고자 실제 사례들을 통하여 분석절차를 구체적으로 제시해 그 유용성을 입증하는 것이다. 이 프로그램을 사용하여 18대 국회의원 292명의 웹가시성을 조사하였다. 또한 신종플루와 관련된 단어들의 웹 가시성을 분석하였다.
TMS workshop on machine learning in materials science: Intro to deep learning...BrianDeCost
This presentation is intended as a high-level introduction for to deep learning and its applications in materials science. The intended audience is materials scientists and engineers
Disclaimers: the second half of this presentation is intended as a broad overview of deep learning applications in materials science; due to time limitations it is not intended to be comprehensive. As a review of the field, this necessarily includes work that is not my own. If my own name is not included explicitly in the reference at the bottom of a slide, I was not involved in that work.
Any mention of commercial products in this presentation is for information only; it does not imply recommendation or endorsement by NIST.
Next-Generation Search Engines for Information RetrievalWaqas Tariq
In the recent years, there have been significant advancements in the areas of scientific data management and retrieval techniques, particularly in terms of standards and protocols for archiving data and metadata. Scientific data is generally rich, not easy to understand, and spread across different places. In order to integrate these pieces together, a data archive and associated metadata should be generated. This data should be stored in a format that can be locatable, retrievable and understandable, more importantly it should be in a form that will continue to be accessible as technology changes, such as XML. New search technologies are being implemented around these protocols, which makes searching easy, fast and yet robust. One such system, Mercury, a metadata harvesting, data discovery, and access system, built for researchers to search to, share and obtain spatiotemporal data used across a range of climate and ecological sciences.
Keynote presentation delivered at ELAG 2013 in Gent, Belgium, on May 29 2013. Discusses Research Objects and the relationship to work my team has been involved in during the past couple of years: OAI-ORE, Open Annotation, Memento.
Metadata and Semantics Research Conference, Manchester, UK 2015
Research Objects: why, what and how,
In practice the exchange, reuse and reproduction of scientific experiments is hard, dependent on bundling and exchanging the experimental methods, computational codes, data, algorithms, workflows and so on along with the narrative. These "Research Objects" are not fixed, just as research is not “finished”: codes fork, data is updated, algorithms are revised, workflows break, service updates are released. Neither should they be viewed just as second-class artifacts tethered to publications, but the focus of research outcomes in their own right: articles clustered around datasets, methods with citation profiles. Many funders and publishers have come to acknowledge this, moving to data sharing policies and provisioning e-infrastructure platforms. Many researchers recognise the importance of working with Research Objects. The term has become widespread. However. What is a Research Object? How do you mint one, exchange one, build a platform to support one, curate one? How do we introduce them in a lightweight way that platform developers can migrate to? What is the practical impact of a Research Object Commons on training, stewardship, scholarship, sharing? How do we address the scholarly and technological debt of making and maintaining Research Objects? Are there any examples
I’ll present our practical experiences of the why, what and how of Research Objects.
Open Data (and Software, and other Research Artefacts) -A proper managementOscar Corcho
Presentation at the event "Let's do it together: How to implement Open Science Practices in Research Projects" (29/11/2019), organised by Universidad Politécnica de Madrid, where we discuss on the need to take into account not only open access or open research data, but also all the other artefacts that are a result of our research processes.
Why Data Science Matters - 2014 WDS Data Stewardship Award LectureXiaogang (Marshall) Ma
A presentation with a review of technical trends in data management, publication and citation, and methodologies on data interoperability, provenance of research and semantic escience.
Slides describing Force11 Work and background of several of the speakers, used for talks to University of Lethbridge, Carnegie Mellon and to Elsevier internally
Data Communities - reusable data in and outside your organization.Paul Groth
Description
Data is a critical both to facilitate an organization and as a product. How can you make that data more usable for both internal and external stakeholders? There are a myriad of recommendations, advice, and strictures about what data providers should do to facilitate data (re)use. It can be overwhelming. Based on recent empirical work (analyzing data reuse proxies at scale, understanding data sensemaking and looking at how researchers search for data), I talk about what practices are a good place to start for helping others to reuse your data. I put this in the context of the notion data communities that organizations can use to help foster the use of data both within your organization and externally.
Research Objects for improved sharing and reproducibilityOscar Corcho
Presentation about the usage of Research Objects to improve scientific experiment sharing and reproducibility, given at the Dagstuhl Perspective Workshop on the intersection between Computer Sciences and Psychology (July 2015)
Talk at the World Science Festival at Columbia, June 2, 2017: session on Big Data and Physics: http://www.worldsciencefestival.com/programs/big-data-future-physics/
Big Data Europe SC6 WS 3: Ron Dekker, Director CESSDA European Open Science A...BigData_Europe
Slides for keynote talk at the Big Data Europe workshop nr 3 on 11.9.2017 in Amsterdam co-located with SEMANTiCS2017 conference by Ron Dekker, Director CESSDA: European Open Science Agenda: where we are and where we are going?
Jonathan Tedds Distinguished Lecture at DLab, UC Berkeley, 12 Sep 2013: "The ...Jonathan Tedds
http://dlab.berkeley.edu/event/open-research-challenge-peer-review-and-publication-research-data
A talk by Dr. Jonathan Tedds, Senior Research Fellow, D2K Data to Knowledge, Dept of Health Sciences, University of Leicester.
PI: #BRISSKit www.brisskit.le.ac.uk
PI: #PREPARDE www.le.ac.uk/projects/preparde
The Peer REview for Publication & Accreditation of Research data in the Earth sciences (PREPARDE) project seeks to capture the processes and procedures required to publish a scientific dataset, ranging from ingestion into a data repository, through to formal publication in a data journal. It will also address key issues arising in the data publication paradigm, namely, how does one peer-review a dataset, what criteria are needed for a repository to be considered objectively trustworthy, and how can datasets and journal publications be effectively cross-linked for the benefit of the wider research community.
I will discuss this and alternative approaches to research data management and publishing through examples in astronomy, biomedical and interdisciplinary research including the arts and humanities. Who can help in the long tail of research if lacking established data centers, archives or adequate institutional support? How much can we transfer from the so called “big data” sciences to other settings and where does the institution fit in with all this? What about software?
Publishing research data brings a wide and differing range of challenges for all involved, whatever the discipline. In PREPARDE we also considered the pre and post publication peer review paradigm, as implemented in the F1000 Research Publishing Model for the life sciences. Finally, in an era of truly international research how might we coordinate the many institutional, regional, national and international initiatives – has the time come for an international Research Data Alliance?
Zudilova-Seinstra-Elsevier-data and the article of the future-nfdp13DataDryad
Presentation by Elena Zudilova-Seinstra on Elsevier's work on data and the article of the future and open data given at the Now and Future of Data Publishing Symposium, 22 May 2013, Oxford, UK
As the volume and complexity of data from myriad Earth Observing platforms, both remote sensing and in-situ increases so does the demand for access to both data and information products from these data. The audience no longer is restricted to an investigator team with specialist science credentials. Non-specialist users from scientists from other disciplines, science-literate public, to teachers, to the general public and decision makers want access. What prevents them from this access to resources? It is the very complexity and specialist developed data formats, data set organizations and specialist terminology. What can be done in response? We must shift the burden from the user to the data provider. To achieve this our developed data infrastructures are likely to need greater degrees of internal code and data structure complexity to achieve (relatively) simpler end-user complexity. Evidence from numerous technical and consumer markets supports this scenario. We will cover the elements of modern data environments, what the new use cases are and how we can respond to them.
Enabling better science - Results and vision of the OpenAIRE infrastructure a...Paolo Manghi
Enabling better science: presentation on the results and vision of the OpenAIRE infrastructure and RDA Publishing Data Services Working Group in this direction.
dkNET Webinar "Texera: A Scalable Cloud Computing Platform for Sharing Data a...dkNET
Presenter: Chen Li, PhD. Professor, Department of Computer Science, University of California Irvine
Abstract
Many data analytics projects have collaborators with complementary backgrounds, including biologists, bioinformaticians, computer scientists, and AI/ML experts. Many of them have limited experience to code, set up a computing infrastructure, and use MLmodels. Existing tools and services, such as email attachments, GitHub, and Google Drive are inefficient for sharing data and analyses. In this talk, we present an open source system called Texera that provides a cloud computing platform for collaborators to share data and analyses as workflows. After seven years of development, the system has a rich set of powerful features, such as shared editing, shared execution, version control, commenting, debugging, user-defined functions in multiple languages (e.g., Python, R, Java), and support of state-of-the-art AI/ML techniques. Its backend parallel engine enables scalable computation on large data sets using computing clusters. We will show a demo of the system, and present our vision supported by a recent NIH award, dkNET(NIDDK Information Network, https://dknet.org), to serve the diabetes, endocrinology, and metabolic diseases research communities through the FAIR sharing of data and knowledge.
Resource link: https://github.com/Texera/texera
Upcoming webinars schedule: https://dknet.org/about/webinar
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...Bertram Ludäscher
Yilin Xia (yilinx2@illinois.edu),
Shawn Bowers (bowers@gonzaga.edu),
Lan Li (lanl2@illinois.edu), and
Bertram Ludäscher (ludaesch@illinois.edu)
Presented at IDCC-2024 in Edinburg.
ABSTRACT. We propose a new approach for modeling and reconciling conflicting data cleaning actions. Such conflicts arise naturally in collaborative data curation settings where multiple experts work independently and then aim to put their efforts together to improve and accelerate data cleaning. The key idea of our approach is to model conflicting updates as a formal argumentation framework (AF). Such argumentation frameworks can be automatically analyzed and solved by translating them to a logic program PAF whose declarative semantics yield a transparent solution with many desirable properties, e.g., uncontroversial updates are accepted, unjustified ones are rejected, and the remaining ambiguities are exposed and presented to users for further analysis. After motivating the problem, we introduce our approach and illustrate it with a detailed running example introducing both well-founded and stable semantics to help understand the AF solutions. We have begun to develop open source tools and Jupyter notebooks that demonstrate the practicality of our approach. In future work we plan to develop a toolkit for conflict resolution that can be used in conjunction with OpenRefine, a popular interactive data cleaning tool.
Games, Queries, and Argumentation Frameworks: Time for a Family ReunionBertram Ludäscher
Research Seminar Talk (online) at KRR@UP (Uni Potsdam) on Dec 6, 2023, loosely based on a paper with the same title at the 7th Workshop on Advances in Argumentation in Artificial Intelligence (AI3)
Games, Queries, and Argumentation Frameworks: Time for a Family Reunion!Bertram Ludäscher
7th Workshop on Advances in Argumentation in Artificial Intelligence (AI3) at
AIxIA 2023: 22nd International Conference of the Italian Association for Artificial Intelligence.
Presentation of a paper by Bertram Ludäscher, Shawn Bowers, and Yilin Xia, given virtually on November 9, 2023.
[Flashback] Integration of Active and Deductive Database RulesBertram Ludäscher
Slides of my PhD defense at the University of Freiburg, 1998.
Statelog and similar state-oriented extensions of Datalog have seen renewed interest subsequently, e.g., see
[Hel10] Hellerstein, J.M., 2010. The declarative imperative: experiences and conjectures in distributed logic. ACM SIGMOD Record, 39(1), pp.5-19.
[AMC+11]
Alvaro, P., Marczak, W.R., Conway, N., Hellerstein, J.M., Maier, D. and Sears, R., 2011. Dedalus: Datalog in time and space. In Datalog Reloaded: First International Workshop, Datalog 2010, Oxford, UK, March 16-19, 2010. Revised Selected Papers (pp. 262-281). Springer
Computational Reproducibility vs. Transparency: Is It FAIR Enough?Bertram Ludäscher
Keynote at CLIR Workshop (Webinar): Torward Open, Reproducible, and Reusable Research. February 10, 2021. https://reusableresearch.com/
ABSTRACT. The “reproducibility crisis” has resulted in much interest in methods and tools to improve computational reproducibility. FAIR data principles (data should be findable, accessible, interoperable, and reusable) are also being adapted and evolved to apply to other artifacts, notably computational analyses (scientific workflows, Jupyter notebooks, etc.). The current focus on computational reproducibility of scripts and other computational workflows sometimes overshadows a somewhat neglected and arguably more important issue: transparency of data analysis, including data wrangling and cleaning. In this talk I will ask the question: What information is gained by conducting a reproducibility experiment? This leads to a simple model (PRIMAD) that aims to answer this question by sorting out different scenarios. Finally, I will present some features of Whole-Tale, a computational platform for reproducible and transparent computational experiments.
By Michael Gryk and Bertram Ludäscher. Presented at 2020 JCDL-SIGCM Workshop, August 1, 2020.
ABSTRACT. Conceptual models can serve multiple purposes: communication of information between stakeholders, information abstraction and generalization, and information organization for archival and retrieval. An ongoing research question is how to formally define the fit-for-purpose of a conceptual model as well as to define metrics or tests to determine whether a given model faithfully supports a designated purpose.
This paper summarizes preliminary investigations in this area by presenting toy problems along with different conceptual models for the system under study. It is argued that the different models are adequate in supporting a sophisticated query and yet they adopt different normalization schemes and will differ in expressiveness depending on the implied purpose of the models. As the subtitle suggests, this work is intended to be primarily exploratory as to the constraints a formal system would require in defining the “usefulness”, “expressiveness” and “equivalence” of conceptual models.
From Research Objects to Reproducible Science TalesBertram Ludäscher
University of Southampton. Electronics & Computer Science. Research Seminar (Invited Talk).
TITLE: From Research Objects to Reproducible Science Tales
ABSTRACT. Rumor has it that there is a reproducibility crisis in science. Or maybe there are multiple crises? What do we mean by reproducibility and replicability anyways? In this talk I will first make an attempt at sorting out some of the terminological confusion in this area, focusing on computational aspects. The PRIMAD model is another attempt to describe different aspects of reproducibility studies by focusing on the "delta" between those studies and the original study. In addition to these more theoretical investigations, I will discuss practical efforts to create more reproducible and more transparent computational platforms such as the one developed by the Whole-Tale project: here 'tales' are executable research objects that may combine data, code, runtime environments, and narratives (i.e., the traditional "science story"). I will conclude with some thoughts about the remaining challenges and opportunities to bridge the large conceptual gaps that continue to exist despite the recognition of problems of reproducibility and transparency in science.
ABOUT the Speaker. Bertram Ludäscher is a professor at the School of Information Sciences at the University of Illinois, Urbana-Champaign and a faculty affiliate with the National Center for Supercomputing Applications (NCSA) and the Department of Computer Science at Illinois. Until 2014 he was a professor at the Department of Computer Science at the University of California, Davis. His research interests range from practical questions in scientific data and workflow management, to database theory and knowledge representation and reasoning. Prior to his faculty appointments, he was a research scientist at the San Diego Supercomputer Center (SDSC) and an adjunct faculty at the CSE Department at UC San Diego. He received his M.S. (Dipl.-Inform.) in computer science from the University of Karlsruhe (now K.I.T.), and his PhD (Dr. rer. nat.) from the University of Freiburg, in Germany.
Possible Worlds Explorer: Datalog & Answer Set Programming for the Rest of UsBertram Ludäscher
Sahil Gupta, Bertram Ludäscher, Jessica Yi-Yun Cheng.
Datalog 2.0: 3rd Workshop on the Resurgence of Datalog in Academia & Industry. Philadelphia Logic Week. June 3-7, 2019 Philadelphia.
Deduktive Datenbanken & Logische Programme: Eine kleine ZeitreiseBertram Ludäscher
Deductive Databases & Logic Programs: Back to the Future!
Colloquium talk on the occasion of the retirement of Prof. Dr. Georg Lausen, May 10th, 2019, Universität Freiburg, Germany
Dissecting Reproducibility: A case study with ecological niche models in th...Bertram Ludäscher
Bertram Ludäscher and Santiago Núñez-Corrales.
Presentation at "Research Synthesis in the Hierarchy of Hypotheses" Workshop, October 10-12, 2018.
Schloss Herrenhausen, Hannover, Germany.
Incremental Recomputation: Those who cannot remember the past are condemned ...Bertram Ludäscher
Talk given at "Problems and techniques for Incremental Re-computation: provenance and beyond".
A workshop co-organized with Provenance Week 2018
King's College London, 12th and 13th July, 2018
Organizers: Paolo Missier (Newcastle University), Tanu Malik (DePaul University), Jacek Cala (Newcastle University)
Abstract: Incremental recomputation has applications, e.g., in databases and workflow systems. Methods and algorithms for recomputation depend on the underlying model of computation (MoC) and model of provenance (MoP). This relation is explored with some examples from databases and workflow systems.
Validation and Inference of Schema-Level Workflow Data-Dependency AnnotationsBertram Ludäscher
Presentation slides of paper by Shawn Bowers, Timothy McPhillips, and Bertram Ludäscher, given by Shawn at Provenance and Annotation of Data and Processes - 7th International Provenance and Annotation Workshop, IPAW 2018, King's College London, UK, July 9-10, 2018.
The paper won a the IPAW best paper award: https://twitter.com/kbelhajj/status/1017082775856467968
ABSTRACT. An advantage of scientific workflow systems is their ability to collect runtime provenance information as an execution trace. Traces include the computation steps invoked as part of the workflow run along with the corresponding data consumed and produced by each workflow step. The information captured by a trace is used to infer "lineage'' relationships among data items, which can help answer provenance queries to find workflow inputs that were involved in producing specific workflow outputs. Determining lineage relationships, however, requires an understanding of the dependency patterns that exist between each workflow step's inputs and outputs, and this information is often under-specified or generally assumed by workflow systems. For instance, most approaches assume all outputs depend on all inputs, which can lead to lineage "false positives''. In prior work, we defined annotations for specifying detailed dependency relationships between inputs and outputs of computation steps. These annotations are used to define corresponding rules for inferring fine-grained data dependencies from a trace. In this paper, we extend our previous work by considering the impact of dependency annotations on workflow specifications. In particular, we provide a reasoning framework to ensure the set of dependency annotations on a workflow specification is consistent. The framework can also infer a complete set of annotations given a partially annotated workflow. Finally, we describe an implementation of the reasoning framework using answer-set programming.
An ontology-driven framework for data transformation in scientific workflowsBertram Ludäscher
Presentation given by Bertram at the Data Integration in the Life Sciences (DILS) Workshop in Leipzig, Germany, 2004.
Reference:
Bowers, Shawn, and Bertram Ludäscher. "An ontology-driven framework for data transformation in scientific workflows." In International Workshop on Data Integration in the Life Sciences (DILS), pp. 1-16. Springer, 2004.
So this isn't new -- but still relevant :-)
ABSTRACT. Ecologists spend considerable effort integrating heterogeneous data for statistical analyses and simulations, for example, to run and test predictive models. Our research is focused on reducing this effort by providing data integration and transformation tools, allowing researchers to focus on “real science,” that is, discovering new knowledge through analysis and modeling. This paper defines a generic framework for transforming heterogeneous data within scientific workflows. Our approach relies on a formalized ontology, which serves as a simple, unstructured global schema. In the framework, inputs and outputs of services within scientific workflows can have structural types and separate seman- tic types (expressions of the target ontology). In addition, a registration mapping can be defined to relate input and output structural types to their corresponding semantic types. Using registration mappings, ap- propriate data transformations can then be generated for each desired service composition. Here, we describe our proposed framework and an initial implementation for services that consume and produce XML data.
Explore our comprehensive data analysis project presentation on predicting product ad campaign performance. Learn how data-driven insights can optimize your marketing strategies and enhance campaign effectiveness. Perfect for professionals and students looking to understand the power of data analysis in advertising. for more details visit: https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
Opendatabay - Open Data Marketplace.pptxOpendatabay
Opendatabay.com unlocks the power of data for everyone. Open Data Marketplace fosters a collaborative hub for data enthusiasts to explore, share, and contribute to a vast collection of datasets.
First ever open hub for data enthusiasts to collaborate and innovate. A platform to explore, share, and contribute to a vast collection of datasets. Through robust quality control and innovative technologies like blockchain verification, opendatabay ensures the authenticity and reliability of datasets, empowering users to make data-driven decisions with confidence. Leverage cutting-edge AI technologies to enhance the data exploration, analysis, and discovery experience.
From intelligent search and recommendations to automated data productisation and quotation, Opendatabay AI-driven features streamline the data workflow. Finding the data you need shouldn't be a complex. Opendatabay simplifies the data acquisition process with an intuitive interface and robust search tools. Effortlessly explore, discover, and access the data you need, allowing you to focus on extracting valuable insights. Opendatabay breaks new ground with a dedicated, AI-generated, synthetic datasets.
Leverage these privacy-preserving datasets for training and testing AI models without compromising sensitive information. Opendatabay prioritizes transparency by providing detailed metadata, provenance information, and usage guidelines for each dataset, ensuring users have a comprehensive understanding of the data they're working with. By leveraging a powerful combination of distributed ledger technology and rigorous third-party audits Opendatabay ensures the authenticity and reliability of every dataset. Security is at the core of Opendatabay. Marketplace implements stringent security measures, including encryption, access controls, and regular vulnerability assessments, to safeguard your data and protect your privacy.
Show drafts
volume_up
Empowering the Data Analytics Ecosystem: A Laser Focus on Value
The data analytics ecosystem thrives when every component functions at its peak, unlocking the true potential of data. Here's a laser focus on key areas for an empowered ecosystem:
1. Democratize Access, Not Data:
Granular Access Controls: Provide users with self-service tools tailored to their specific needs, preventing data overload and misuse.
Data Catalogs: Implement robust data catalogs for easy discovery and understanding of available data sources.
2. Foster Collaboration with Clear Roles:
Data Mesh Architecture: Break down data silos by creating a distributed data ownership model with clear ownership and responsibilities.
Collaborative Workspaces: Utilize interactive platforms where data scientists, analysts, and domain experts can work seamlessly together.
3. Leverage Advanced Analytics Strategically:
AI-powered Automation: Automate repetitive tasks like data cleaning and feature engineering, freeing up data talent for higher-level analysis.
Right-Tool Selection: Strategically choose the most effective advanced analytics techniques (e.g., AI, ML) based on specific business problems.
4. Prioritize Data Quality with Automation:
Automated Data Validation: Implement automated data quality checks to identify and rectify errors at the source, minimizing downstream issues.
Data Lineage Tracking: Track the flow of data throughout the ecosystem, ensuring transparency and facilitating root cause analysis for errors.
5. Cultivate a Data-Driven Mindset:
Metrics-Driven Performance Management: Align KPIs and performance metrics with data-driven insights to ensure actionable decision making.
Data Storytelling Workshops: Equip stakeholders with the skills to translate complex data findings into compelling narratives that drive action.
Benefits of a Precise Ecosystem:
Sharpened Focus: Precise access and clear roles ensure everyone works with the most relevant data, maximizing efficiency.
Actionable Insights: Strategic analytics and automated quality checks lead to more reliable and actionable data insights.
Continuous Improvement: Data-driven performance management fosters a culture of learning and continuous improvement.
Sustainable Growth: Empowered by data, organizations can make informed decisions to drive sustainable growth and innovation.
By focusing on these precise actions, organizations can create an empowered data analytics ecosystem that delivers real value by driving data-driven decisions and maximizing the return on their data investment.
Machine learning and optimization techniques for electrical drives.pptx
Introducing the Whole Tale Project: Merging Science and Cyberinfrastructure Pathways
1. Introducing the Whole Tale Project:
Merging Science and Cyberinfrastructure Pathways
Bertram Ludäscher Victoria Stodden Matt Turk
Kyle Chard (U Chicago), Niall Gaffney (TACC), Matt Jones (UCSB),
Jarek Nabrzyski (Notre Dame), Kandace Turner (NCSA)
CIRSS Seminar
September 2, 2016
2. Problems Facing Data Researchers
Workflow for data research is fragmented
● Data comes from many sources and is “integrated
the old fashioned way” e.g. via chains of email
● Use a collection of cloud services copying data
from Dropbox and Box to local storage with a
distributed directory structures to organize (and
provide discovery) to data
● Actions taken on data are not recorded (custom
scripts, some version of a community developed
and supported codebase)
● Publication of final data as prescribed by a Data
Management Plan (hopefully with a DOI) with link in
publications gives no reproducibility
3. Whole Tale ~ Whole Story (research à publication)
~ Long Tail of Science (Lil’Data/MPC) + Big Data/HPC
4. Whole Tale (WT) Big Picture
• WT will leverage & contribute
to existing CI and tools to
support the whole science
story (= run-to-pub-cycle), and
providing access to big CI &
HPC for long tail researchers.
➡ Integrated tools to simplify
usage and promote best
practices.
• NSF CC*DNI DIBBS:
– 5 Institutions, 5 Years ($5M total)
– Cooperative Agreement
5. WT Project Org & Working Groups
Whole Tale: Merging Science & CI Pathways
… through Working Groups!
Working Groups Driving Use
Cases and Adoption
Working Groups to Provide Key
Components
7. Joining data from different environments to enable new research:
• Streamline gathering, integrating, and analyzing environmental data
needed to build up a fuller picture of the paleoclimate.
• Enables access and interrogation of data from DataONE, iPlant, and
the Long Term Ecological Research Network (LTER), leveraging
Globus On-line, Brown Dog’s data tilling services, RStudio, and
XSEDE resources.
• Enables access to the Digital Archeological Record (tDAR), both
through a native API and via a tDAR member node in DataONE.
The CI Side of this Science
Pathways (Archaeology)
8. Reproducible Science Example:
Paleoclimate Reconstruction
Science paper (OA) uses:
• open source code:
– R, PaleoCAR, …
• multiple tree-ring databases
• HPC resources
• Example WG Goal:
– Reproducibility study using
• YesWorkflow toolkit:
Workflow & provenance from
code
• Jupyter notebook
9. Science Pathways: Astronomy
• Researcher A uses university credentials to access large cosmological
simulation outputs from Blue Waters published into WT, does analysis
using Whole Tale services in a Jupyter Notebook, and creates a new
result. With the publication, user creates a DOI linking data and source
code used to generate data tied to original input data and references
this in his reviewed and published research paper.
• Another researcher finds the DOI, and is able to access data and
analysis to then compare model output with new observations from the
Hobby Eberly Telescope Dark Energy experiment on TACC systems.
Results are shared with the original author and a new DOI is created
for these results.
10. • Enabling direct analysis and
collaborative research on simulation
outputs stored in Whole Tale
enabled repositories via user-
supplied Python scripts.
• YT (yt-project.org), will provide
advanced, customizable analysis
and visualization, leveraging Jupyter
for provide the scripting support.
• Federation will allow jobs to move to
data or visa versa where appropriate
Science Pathways: Astronomy
11. The Whole Tale’s Approach
WT will integrate well established CI components creating a simple and
unified environment to use, share, and publish data and workflows
1. Unified Authentication via Globus Auth
2. Abstracted Storage Layer with a unified namespace
3. Integrated Python and R APIs integrated with Jupyter Notebook Environments
4. Ingest and publication service linking data, computations, and scholarly articles
5. OwnCloud desktop integration for “Dropbox like interface”
6. Event System to react to changes (e.g. new data published)
7. Data Dashboard to ease data management and service interactions
• Capture full workflow via Notebooks, scripts, and applications to be
published along with Data and Research publications
12. WT Dashboard
• Web-based interface to enable ‘live
articles’ and research repeatability by
enabling the execution of research
methods on data using NDS labs
Docker containers and notebooks
(Jupyter).
• Research methods: provided support
for running python scripts
• Research Data: interfaced with NDS
Labs Python API - connecting the
desktop to the NDS Labs data
storage mechanisms (iRODS,
Dropbox, Google Drive, SciDrive and
local file integration)
• Provide a Docker “diff” tarball for
downloading research run results
Demonstrated @ SC 2014
http://ndspilot.com/nds/ndspilot1080p.mp4
13. Base Share Integrate Reproduce Operation
• Ingest data from
HTTP, Globus, and
DataONE
• Store data in a
private cloud based
home directory
• Move and manage
data in iRODS
• Interact with data
using Jupyter
• Manage data
across ownClound
& iRODS
• Authenticate
using ORCID
• Interact with data
through a suite of
frontends
• Automatically
extract key
metadata
• Search and manage
distributed data
from within
frontends
• Operate on remote
data as if it were
local (including
using OAI-ORE)
• Utilize a single
identity across
services
• Discover and
share frontends
through global
repository
• Integrate data and
workflows with
publications
• Issue, resolve, and
track identifiers
for distributed
data
• Discover data using
federated and
distributed queries
• Track provenance
across services
• Organize data
collections via user-
defined
namespaces
High-level Milestones, Phases
14. Working with the NDS and Others
WT is an NDS Pilot Project
– Will develop and integrate system as
components in NDS Labs
– Work with communities to create
powerful environments tailored for the
communities needs
WT will work with NDS to provide status and
train other users
– NDS Meeting Hackathons
– Cooperation and collaboration with
other projects in NDS
WT will collaborate with other projects (DIBBs,
BDHub, RDA…)
è Get involved! (e.g., Working Groups,
Hackathons, Summer Internship) wholetale.org
16. WT “Philosophy”
• Why a platform?
• Computation is near-ubiquitious in
research, yet we have few best practices
or dissemination standards
• And it’s complex! Small changes in a
computational implementation or in the
data can have a dramatic impact on the
result.
17. Some Bold Assertions
• Software is used as a tool of discovery in nearly all
research today.
• When software is a key part of the discovery
process, it should be subject to the same
philosophy of transparency as any method.
• Software is an integral and inseparable
component of the computational infrastructure in
which most research takes place.
• Computational research is embedded in a social
structure which includes many stakeholders.
19. Dissemination is Incomplete
• Publications are missing details that are
necessary to understand and verify
published findings…
• => credibility crisis in computational
science
• How to reproduce the result? What’s
needed?
20. Computational Reproducibility
Traditionally two branches to the scientific method:
• Branch 1 (deductive): mathematics, formal logic,
• Branch 2 (empirical): statistical analysis of
controlled experiments.
Now, new branches due to technological changes?
• Branch 3,4? (computational): large scale
simulations / data driven computational science.
21. The Ubiquity of Error
The central motivation for the scientific method is to
root out error:
• Deductive branch: the well-defined concept of the
proof,
• Empirical branch: the machinery of hypothesis
testing, appropriate statistical methods, structured
communication of methods and protocols.
Claim: Computation presents only a potential
third/fourth branch of the scientific method, until
the development of comparable standards.
22. Whole Tale?
Proposed Solution:
• Capture computational steps / provide
compute environment
• Provide unique identifiers to
data/code/workflows associated with results
• Provide links to embed in the publication for
discoverability
• Preserve digital scholarly objects
23. So it looks pretty simple..
• What about big data?
• Complex codes?
• Reuse and bug fixes?
• Meta-analysis?
• Working with external groups, such as publishers?
• Incentives? What if they don’t come?
• Allocating resources? Sustainability models?
• What does citation mean and how are
contributions to be rewarded?
24. Incompleteness
• “I ran all the stuff and it’s still the wrong
answer!”
• “I got a different result, using your code
and data!”
• “Your code doesn’t work!”
• “Where’s all the documentation? I can’t
figure this thing out.”