Requirements for reproducibility in computational chemistry publications include making available the data, code or algorithms, and results from the study. Authors should provide all data necessary to understand and assess their conclusions. Source code or detailed algorithm descriptions should also be included to allow independent reproduction of the work. Finally, publications must contain the actual results from applying the method rather than just describing results. Adopting these standards of transparency helps ensure others can evaluate and build upon published research claims.
Adversarial Analytics - 2013 Strata & Hadoop World TalkRobert Grossman
This is a talk I gave at the Strata Conference and Hadoop World in New York City on October 28, 2013. It describes predictive modeling in the context of modeling an adversary's behavior.
Themes and objectives:
To position FAIR as a key enabler to automate and accelerate R&D process workflows
FAIR Implementation within the context of a use case
Grounded in precise outcomes (e.g. faster and bigger science / more reuse of data to enhance value / increased ability to share data for collaboration and partnership)
To make data actionable through FAIR interoperability
Speakers:
Mathew Woodwark,Head of Data Infrastructure and Tools, Data Science & AI, AstraZeneca
Erik Schultes, International Science Coordinator, GO-FAIR
Georges Heiter, Founder & CEO, Databiology
Open interoperability standards, tools and services at EMBL-EBIPistoia Alliance
In this webinar Dr Henriette Harmse from EMBL-EBI presents how they are using their ontology services at EMBL-EBI to scale up the annotation of data and deliver added value through ontologies and semantics to their users.
A presentation I gave at the 2018 Molecular Med Tri-Con in San Francisco, February 2018. This addresses the general challenge of biomedical data management, some of the things to consider when evaluation solutions in this space, and concludes with a brief summary of some of the tools and platforms in this space.
Our regular Introduction to Data Management (DM) workshop (90-minutes). Covers very basic DM topics and concepts. Audience is graduate students from all disciplines. Most of the content is in the NOTES FIELD.
Adversarial Analytics - 2013 Strata & Hadoop World TalkRobert Grossman
This is a talk I gave at the Strata Conference and Hadoop World in New York City on October 28, 2013. It describes predictive modeling in the context of modeling an adversary's behavior.
Themes and objectives:
To position FAIR as a key enabler to automate and accelerate R&D process workflows
FAIR Implementation within the context of a use case
Grounded in precise outcomes (e.g. faster and bigger science / more reuse of data to enhance value / increased ability to share data for collaboration and partnership)
To make data actionable through FAIR interoperability
Speakers:
Mathew Woodwark,Head of Data Infrastructure and Tools, Data Science & AI, AstraZeneca
Erik Schultes, International Science Coordinator, GO-FAIR
Georges Heiter, Founder & CEO, Databiology
Open interoperability standards, tools and services at EMBL-EBIPistoia Alliance
In this webinar Dr Henriette Harmse from EMBL-EBI presents how they are using their ontology services at EMBL-EBI to scale up the annotation of data and deliver added value through ontologies and semantics to their users.
A presentation I gave at the 2018 Molecular Med Tri-Con in San Francisco, February 2018. This addresses the general challenge of biomedical data management, some of the things to consider when evaluation solutions in this space, and concludes with a brief summary of some of the tools and platforms in this space.
Our regular Introduction to Data Management (DM) workshop (90-minutes). Covers very basic DM topics and concepts. Audience is graduate students from all disciplines. Most of the content is in the NOTES FIELD.
Combining Explicit and Latent Web Semantics for Maintaining Knowledge GraphsPaul Groth
A look at how the thinking about Web Data and the sources of semantics can help drive decisions on combining latent and explicit knowledge. Examples from Elsevier and lots of pointers to related work.
PA webinar on benefits & costs of FAIR implementation in life sciences Pistoia Alliance
The slides from the Pistoia Alliance Debates Webinar where a panel of experts from technology support providers and the biopharma industry, who have been invited to share their views on the "Benefits and costs of FAIR Implementation for life science industry".
DataONE Education Module 01: Why Data Management?DataONE
Lesson 1 in a set of 10 created by DataONE on Best Practices fo Data Management. The full module can be downloaded from the DataONE.org website at: http://www.dataone.org/educaiton-modules. Released under a CC0 license, attribution and citation requested.
Sources of Change in Modern Knowledge Organization SystemsPaul Groth
Talk covering how knowledge graphs are making us rethink how change occurs in Knowledge Organization Systems. Based on https://arxiv.org/abs/1611.00217
The Roots: Linked data and the foundations of successful Agriculture DataPaul Groth
Some thoughts on successful data for the agricultural domain. Keynote at Linked Open Data in Agriculture
MACS-G20 Workshop in Berlin, September 27th and 28th, 2017 https://www.ktbl.de/inhalte/themen/ueber-uns/projekte/macs-g20-loda/lod/
In this talk we describe how the Fourth Paradigm for Data-Intensive Research is providing a framework for us to develop tools, technologies and platforms to support actionable science. We discuss applications that take advantage of cloud computing, particularly Microsoft Azure, to realise the potential for turning data into decisions, knowledge and understanding. http://www.fourthpardigm.org and http://www.azure4research.com
Krysa - Speciál, Prosinec - vládne Krysa, Potřebuji nový začátek, Život musí plynout jako řeka, bydlení: Adventní interiér, Energie dní: týden od 5.12. 2016
Combining Explicit and Latent Web Semantics for Maintaining Knowledge GraphsPaul Groth
A look at how the thinking about Web Data and the sources of semantics can help drive decisions on combining latent and explicit knowledge. Examples from Elsevier and lots of pointers to related work.
PA webinar on benefits & costs of FAIR implementation in life sciences Pistoia Alliance
The slides from the Pistoia Alliance Debates Webinar where a panel of experts from technology support providers and the biopharma industry, who have been invited to share their views on the "Benefits and costs of FAIR Implementation for life science industry".
DataONE Education Module 01: Why Data Management?DataONE
Lesson 1 in a set of 10 created by DataONE on Best Practices fo Data Management. The full module can be downloaded from the DataONE.org website at: http://www.dataone.org/educaiton-modules. Released under a CC0 license, attribution and citation requested.
Sources of Change in Modern Knowledge Organization SystemsPaul Groth
Talk covering how knowledge graphs are making us rethink how change occurs in Knowledge Organization Systems. Based on https://arxiv.org/abs/1611.00217
The Roots: Linked data and the foundations of successful Agriculture DataPaul Groth
Some thoughts on successful data for the agricultural domain. Keynote at Linked Open Data in Agriculture
MACS-G20 Workshop in Berlin, September 27th and 28th, 2017 https://www.ktbl.de/inhalte/themen/ueber-uns/projekte/macs-g20-loda/lod/
In this talk we describe how the Fourth Paradigm for Data-Intensive Research is providing a framework for us to develop tools, technologies and platforms to support actionable science. We discuss applications that take advantage of cloud computing, particularly Microsoft Azure, to realise the potential for turning data into decisions, knowledge and understanding. http://www.fourthpardigm.org and http://www.azure4research.com
Krysa - Speciál, Prosinec - vládne Krysa, Potřebuji nový začátek, Život musí plynout jako řeka, bydlení: Adventní interiér, Energie dní: týden od 5.12. 2016
What are the materials that give structure and strength to Batman’s suit, and how durable are they? We've taken a deeper look to examine just how unbreakable this superhero really is in this infographic.
Read more: http://on.mash.to/1rg2jM0
Guide d'installation du système de boyau / flexible rétractable Rétraflex
https://www.homexity.com/retraflex-flexible-boyau-retractable-mural-c102x2542861
Winds of change: The shifting face of leadership in business is an Audi report, written by The Economist Intelligence Unit. It delves into the attributes that business leaders need, the factors that influence them and how they can lead most effectively.
Security within SharePoint has become top priority through the various events that have been seen in the news recently. SharePoint is as secure as you make it and is available as much or as little as you decide. Microsoft documentation clearly defines how to configure and secure your environment, yet there are still many environments that are available for the world to see. In this web session we will look at the core decision for any SharePoint solution, Authentication and Authorizing end users. We will discuss the vast array of options with pros and cons for each option.
In this session, we are going to brand a SharePoint site from start to finish. We will use SharePoint Designer, HTML and custom CSS to design a site how not to look like SharePoint. We'll touch upon themes, page layouts as well as master page design. As well as learn how to upgrade a SharePoint 2007 design to SharePoint 2010.
This session is focused on designers well versed with HTML and CSS but might not have the SharePoint development experience. Within the session, we'll also look at usability, accessibility and best practices on branding SharePoint public facing sites.
Being Reproducible: SSBSS Summer School 2017Carole Goble
Lecture 2:
Being Reproducible: Models, Research Objects and R* Brouhaha
Reproducibility is a R* minefield, depending on whether you are testing for robustness (rerun), defence (repeat), certification (replicate), comparison (reproduce) or transferring between researchers (reuse). Different forms of "R" make different demands on the completeness, depth and portability of research. Sharing is another minefield raising concerns of credit and protection from sharp practices.
In practice the exchange, reuse and reproduction of scientific experiments is dependent on bundling and exchanging the experimental methods, computational codes, data, algorithms, workflows and so on along with the narrative. These "Research Objects" are not fixed, just as research is not “finished”: the codes fork, data is updated, algorithms are revised, workflows break, service updates are released. ResearchObject.org is an effort to systematically support more portable and reproducible research exchange.
In this talk I will explore these issues in more depth using the FAIRDOM Platform and its support for reproducible modelling. The talk will cover initiatives and technical issues, and raise social and cultural challenges.
Being FAIR: Enabling Reproducible Data ScienceCarole Goble
Talk presented at Early Detection of Cancer Conference, OHSU, Portland, Oregon USA, 2-4 Oct 2018, http://earlydetectionresearch.com/ in the Data Science session
ISMB/ECCB 2013 Keynote Goble Results may vary: what is reproducible? why do o...Carole Goble
Keynote given by Carole Goble on 23rd July 2013 at ISMB/ECCB 2013
http://www.iscb.org/ismbeccb2013
How could we evaluate research and researchers? Reproducibility underpins the scientific method: at least in principle if not practice. The willing exchange of results and the transparent conduct of research can only be expected up to a point in a competitive environment. Contributions to science are acknowledged, but not if the credit is for data curation or software. From a bioinformatics view point, how far could our results be reproducible before the pain is just too high? Is open science a dangerous, utopian vision or a legitimate, feasible expectation? How do we move bioinformatics from one where results are post-hoc "made reproducible", to pre-hoc "born reproducible"? And why, in our computational information age, do we communicate results through fragmented, fixed documents rather than cohesive, versioned releases? I will explore these questions drawing on 20 years of experience in both the development of technical infrastructure for Life Science and the social infrastructure in which Life Science operates.
DataFAIRy bioassays pilot -- lessons learned and future outlookIsabella Feierberg
We describe a precompetitive collaboration that makes public life science data FAIR and annotated with detailed, high quality metadata, at a shared cost. A data model based on public ontologies was defined to address the participants' business questions. This slide deck was presented at the Cambridge Cheminformatics meeting on June 2, 2021.
Systematic review article and Meta-analysis: Main steps for Successful writin...Pubrica
A review article is a piece of writing that gives a complete and systematic summary of results available in a certain field while also allowing the reader to perceive the subject from a different viewpoint.
Continue Reading: https://bit.ly/3m7OTqC
For our services: https://pubrica.com/services/research-services/systematic-review/
Why Pubrica:
When you order our services, We promise you the following – Plagiarism free | always on Time | 24*7 customer support | Written to international Standard | Unlimited Revisions support | Medical writing Expert | Publication Support | Biostatistical experts | High-quality Subject Matter Experts.
Contact us:
Web: https://pubrica.com/
Blog: https://pubrica.com/academy/
Email: sales@pubrica.com
WhatsApp : +91 9884350006
United Kingdom: +44-1618186353
UDM (Unified Data Model) - Enabling Exchange of Comprehensive Reaction Inform...Frederik van den Broek
Slides from my talk at the ACS CINF Symposium on Chemical Nomenclature & Representation on 26 August 2019 in San Diego.
Abstract:
The first edition of the Beilstein Handbook of Organic Chemistry was published nearly 140 years ago. Electronic laboratory notebooks have been in use in chemistry for almost 20 years. And the life science industry still doesn't have a well-defined way of capturing and exchanging information about chemical reactions and relies on imprecise or vendor-specific data formats. Without a common language and structure to describe experiments, data integration is unnecessarily expensive and a significant part of published data has not been readily available for processing or analysis.
The Unified Data Model (UDM) project team aims to improve the situation. UDM is a collective effort of vendors and life science organizations to create an open, extendable and freely available reference model and data format for exchange of experimental information about compound synthesis and testing. Run under the umbrella of the Pistoia Alliance, the project team has published two releases of the UDM data format and it is expected that the model will continue to be improved as demand stipulates working with the Pistoia FAIR data implementation by industry community.
OpenTox - an open community and framework supporting predictive toxicology an...Barry Hardy
Presented at ACS Boston 2015 at a Session on the growing impact of Open Science chaired by Andy Lang and Tony Williams dedicated to the work, memory and legacy of JC Bradley and the work we carry forward!
One important goal of OpenTox is to support the development of an Open Standards-based predictive toxicology framework that provides a unified access to toxicological data and models. OpenTox supports the development of tools for the integration of data, for the generation and validation of in silico models for toxic effects, libraries for the development and integration of modelling algorithms, and scientifically sound validation and reporting routines.
The OpenTox Application Programming Interface (API) is an important open standards development for software development purposes. It provides a specification against which development of global interoperable toxicology resources by the broader community can be carried out. The use of OpenTox API-compliant web services to communicate instructions between linked resources with URI addresses supports the use of a wide variety of commands to carry out operations such as data integration, algorithm use, model building and validation. The OpenTox Framework currently includes, with its APIs, services for compounds, datasets, features, algorithms, models, ontologies, tasks, validation, reporting, investigations, studies, assays, and authentication and authorisation, which may be combined into multiple applications satisfying a variety of different user needs. As OpenTox creates a semantic web for toxicology, it should be an ideal framework for incorporating toxicology data, ontology and modelling developments, thus supporting both a mechanistic framework for toxicology and best practices in statistical analysis and computational modelling.
In this presentation I will review the recent OpenTox-based development of applications including the ToxBank data infrastructure supporting integrated analysis across biochemical, functional and omics datasets supporting the safety assessment goals of the SEURAT-1 program which aims to develop alternatives to animal testing.
Finally, I will provide an overview of the working group activities of the newly formed OpenTox Association which aim to progress the development of open source, data, standards and tools in this area.
Lecture for a course at NTNU, 27th January 2021
CC-BY 4.0 Dag Endresen https://orcid.org/0000-0002-2352-5497
See also http://bit.ly/biodiversityinformatics
https://www.gbif.no/events/2021/lecture-ntnu-gbif.html
GSmith Springer Nature Data policies and practices: HKU Open Data and Data Pu...GrahamSmith646206
Supporting research data across Springer Nature: joining up policy and practice. Slides from Graham Smith (Research Data Manager, Springer Nature) at HKU Open Data and Data Publishing Seminar, 25th October 2021.
Metadata and Semantics Research Conference, Manchester, UK 2015
Research Objects: why, what and how,
In practice the exchange, reuse and reproduction of scientific experiments is hard, dependent on bundling and exchanging the experimental methods, computational codes, data, algorithms, workflows and so on along with the narrative. These "Research Objects" are not fixed, just as research is not “finished”: codes fork, data is updated, algorithms are revised, workflows break, service updates are released. Neither should they be viewed just as second-class artifacts tethered to publications, but the focus of research outcomes in their own right: articles clustered around datasets, methods with citation profiles. Many funders and publishers have come to acknowledge this, moving to data sharing policies and provisioning e-infrastructure platforms. Many researchers recognise the importance of working with Research Objects. The term has become widespread. However. What is a Research Object? How do you mint one, exchange one, build a platform to support one, curate one? How do we introduce them in a lightweight way that platform developers can migrate to? What is the practical impact of a Research Object Commons on training, stewardship, scholarship, sharing? How do we address the scholarly and technological debt of making and maintaining Research Objects? Are there any examples
I’ll present our practical experiences of the why, what and how of Research Objects.
AI-SDV 2022: Scientific publishing in the age of data mining and artificial i...Dr. Haxel Consult
Most scientific journals request, that the complete set of research data is published simultaneously with the peer-reviewed paper. The publication of the research data usually is carried out as so-called "Supplementary Material", attached to the original paper, or on a "Research Data Repository". Both forms have in common, that the data is published usually unstructured and not in an uniform machine processable format. This makes its further use in electronic tools for AI or data mining unnecessarily difficult or even impossible. A concept is presented, in which the data is digitally recorded, following the principle of FAIR data, as part of the publication process. This digital capture makes the data available to the scientific community for easy use in data mining and AI tools. The data in the repository contains links to the publication to document its origin. The concept is applicable for preprints, peer-review papers, diploma and doctoral theses and is particularly suitable for open access publications. Moreover, the presentation highlights correspondent activities, which were released in scientific publications recently.
Research Data Sharing: A Basic FrameworkPaul Groth
Some thoughts on thinking about data sharing. Prepared for the 2016 LERU Doctoral Summer School - Data Stewardship for Scientific Discovery and Innovation.
http://www.dtls.nl/fair-data/fair-data-training/leru-summer-school/
Similar to Is that a scientific report or just some cool pictures from the lab? Reproducibility and computational chemistry (20)
Processing malaria HTS results using KNIME: a tutorialGreg Landrum
Walks through a couple of KNIME Workflows for working with HTS Data.
The workflows are derived from the work described in this publication: https://f1000research.com/articles/6-1136/v2
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™UiPathCommunity
In questo evento online gratuito, organizzato dalla Community Italiana di UiPath, potrai esplorare le nuove funzionalità di Autopilot, il tool che integra l'Intelligenza Artificiale nei processi di sviluppo e utilizzo delle Automazioni.
📕 Vedremo insieme alcuni esempi dell'utilizzo di Autopilot in diversi tool della Suite UiPath:
Autopilot per Studio Web
Autopilot per Studio
Autopilot per Apps
Clipboard AI
GenAI applicata alla Document Understanding
👨🏫👨💻 Speakers:
Stefano Negro, UiPath MVPx3, RPA Tech Lead @ BSP Consultant
Flavio Martinelli, UiPath MVP 2023, Technical Account Manager @UiPath
Andrei Tasca, RPA Solutions Team Lead @NTT Data
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing – with little to no training required
Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
👨🏫 Andras Palfi, Senior Product Manager, UiPath
👩🏫 Lenka Dulovicova, Product Program Manager, UiPath
Pushing the limits of ePRTC: 100ns holdover for 100 daysAdtran
At WSTS 2024, Alon Stern explored the topic of parametric holdover and explained how recent research findings can be implemented in real-world PNT networks to achieve 100 nanoseconds of accuracy for up to 100 days.
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
Elevating Tactical DDD Patterns Through Object CalisthenicsDorra BARTAGUIZ
After immersing yourself in the blue book and its red counterpart, attending DDD-focused conferences, and applying tactical patterns, you're left with a crucial question: How do I ensure my design is effective? Tactical patterns within Domain-Driven Design (DDD) serve as guiding principles for creating clear and manageable domain models. However, achieving success with these patterns requires additional guidance. Interestingly, we've observed that a set of constraints initially designed for training purposes remarkably aligns with effective pattern implementation, offering a more ‘mechanical’ approach. Let's explore together how Object Calisthenics can elevate the design of your tactical DDD patterns, offering concrete help for those venturing into DDD for the first time!
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Is that a scientific report or just some cool pictures from the lab? Reproducibility and computational chemistry
1. Is that a scientific report or just some cool
pictures from the lab? Reproducibility and
computational chemistry
Gregory Landrum Ph.D.
NIBR IT
Novartis Institutes for BioMedical Research
Basel
2013 CADD Gordon Conference, Mount Snow VT
23 July, 2013
3. Publishing…
Scientific publications have at least two goals: (i) to announce a result and (ii)
to convince readers that the result is correct. Mathematics papers are
expected to contain a proof complete enough to allow knowledgeable
readers to fill in any details. Papers in experimental science should describe
the results and provide a clear enough protocol to allow successful repetition
and extension.
Mesirov, J. P. Accessible Reproducible Research. Science 327,
415–416 (2010).
4. Outline
§ Reproducibility?
§ Requirements for reproducibility of published research
§ Practical aspects
Landrum, G. A. & Stiefl, N. Is that a scientific publication or an advertisement?
Reproducibility, source code and data in the computational chemistry literature. Future
Medicinal Chemistry 4, 1885–1887 (2012).
6. Reproducibility
An author’s central obligation is to present an accurate and complete account
of the research performed, absolutely avoiding deception, including the data
collected or used, as well as an objective discussion of the significance of the
research. Data are defined as information collected or used in generating
research conclusions. The research report and the data collected should
contain sufficient detail and reference to public sources of information to
permit a trained professional to reproduce the experimental observations.
ACS “Ethical Guidelines to Publication of Chemical Research”
7. Reproducibility
Experimental reproducibility is the coin of the scientific realm. The extent to
which measurements or observations agree when performed by different
individuals defines this important tenet of the scientific method. The formal
essence of experimental reproducibility was born of the philosophy of logical
positivism or logical empiricism, which purports to gain knowledge of the world
through the use of formal logic linked to observation. A key principle of logical
positivism is verificationism, which holds that every truth is verifiable by
experience. In this rational context, truth is defined by reproducible experience,
and unbiased scientific observation and determinism are its underpinnings.
…
The assumption that objectively true scientific observations must be reproducible
is implicit, yet direct tests of reproducibility are rarely found in the published
literature. This lack of published evidence of reproducibility stems from the
limited appeal of studies reproducing earlier work to most funding bodies and to
most editors. Furthermore, many readers of scientific journals— especially of
higher-impact journals—assume that if a study is of sufficient quality to pass the
scrutiny of rigorous reviewers, it must be true; this assumption is based on the
inferred equivalence of reproducibility and truth described above.
Loscalzo, J. Irreproducible Experimental Results: Causes, (Mis)
interpretations, and Consequences. Circulation 125, 1211–1214 (2012).
11. A great start
(1) Wherever possible, source code should be provided for new computational methods. The
source code can be a reference implementation of a method or algorithm and does not need to include a
graphical interface. If it is not possible to release the source code for a new method, authors should
provide a sufficient justification. Reviewers and editors will then consider this explanation. Any paper that
does not comply with the reproducibility guidelines will include this explanation when published. In cases
where it is not possible to release code due to intellectual property or other limitations, an executable
version of the new method should be readily accessible. Commercial products should provide time limited
licenses to facilitate evaluation and comparison of published methods.
(2) Any chemical structures and data mentioned in the paper should be made available in a
commonly used (SDF or SMILES) format. Distribution of data in pdf format is not sufficient.
(3) Any publications that employ commercial or open-source software should include scripts
or parameter files as well as data files that will enable others to easily reproduce the work.
(4) A clear easy to follow description of any new method should be a key criterion during the
review process. Wherever possible, a paper should contain a simple worked example that
demonstrates the application of the method. Parameter values and intermediate results for example
compounds should be included as part of the supporting material.
(5) Reviewers should put particular emphasis on the reproducibility of the method described
in a manuscript. Each reviewer should evaluate the description of the method, as well as the presence
of associated code, data, or executables, to ensure that the results can be independently reproduced.
Walters, W. P. Modeling, Informatics, and the Quest for Reproducibility. J.
Chem. Inf. Model. (2013). doi:10.1021/ci400197w
13. Requirements for Reproducibility:
Data
As a condition of publication, authors must agree to make available all data
necessary to understand and assess the conclusions of the manuscript to
any reader of Science. Data must be included in the body of the paper or in
the supplementary materials, where they can be viewed free of charge by all
visitors to the site. Certain types of data must be deposited in an approved
online database, including DNA and protein sequences, microarray data,
crystal structures, and climate records.
http://www.sciencemag.org/site/feature/contribinfo/faq/
index.xhtml#data_faq
14. Requirements for Reproducibility:
Data
§ This is a no brainer, right?
§ Unless it’s completely unprocessed (or the processing is part of the
detailed method description/code), it’s better to include the actual data
§ “Ligands from PDB structures X, Y, and Z” probably not good enough
§ For sources like ChEMBL, a version number and SQL to grab the data
are probably adequate
15. Requirements for Reproducibility:
Data
Goodman, L., Lawrence, R. & Ashley, K. Data-set visibility: Cite links to
data in reference lists. Nature 492:356–6 (2012).
A huge amount of work goes into creating data sets. It is crucial that these data,
big or small, should be more prominently linked to their associated research
articles as standard practice.
To achieve this, data can be cited directly in a publication's reference section using
a permanent identifier such as a digital object identifier (DOI; see, for example,
go.nature.com/vnyidi and go.nature.com/zdfbcl). So far, however, only very few
journals do this.
Publishers, funders, researchers and institutions all need to recognize that data
sets constitute a valuable scholarly resource. Authors should be credited for these
career-making contributions. Enhanced data-set visibility would also benefit
referees and readers by raising standards of data analysis, promoting more
detailed review, encouraging data curation and boosting reproducibility and data
reuse.
16. Requirements for Reproducibility:
Data
§ What about chemical structures?
• a table with drawings of molecules?
• names instead of structures?
§ Why not include the structures in a machine-readable format?
This expanded use of electronic resources offers an excellent opportunity to make chemical
information more accessible and user-friendly to readers of scientific papers.
To take advantage of these opportunities, we have developed several online features that expand
the usefulness of chemical compound information for Nature Chemical Biology readers … In all
original research papers, compounds that are relevant to the background or results of the paper
are assigned a bolded, Arabic numeral that serves as a unique identifier for the compound. Each
numerical abbreviation in the HTML and PDF versions of the article is linked to a Compound Data
page, which shows the structure and the IUPAC or common name of the chemical compound.
From there, readers can download a ChemDraw file of the compound…To provide readers with
rapid access to all of the chemical compounds discussed in an article, we feature a Compound
Data Index page, which is accessible from the Compound Data page, the table of contents entry
for the paper, and the navigation tools on the right side of the Nature Chemical Biology website.
http://www.nature.com/nchembio/journal/v3/n6/full/nchembio0607-297.htm
18. Requirements for Reproducibility:
Chemical Data
From Nature Chemistry
Huigens, R. W., et al. A ring-distortion strategy to construct stereochemically complex and
structurally diverse compounds from natural products. Nature Chemistry 5:195-202 (2013).
doi:10.1038/nchem.1549
19. It’s not always easy
Data Sets. For this study we arbitrarily chose 18 Merck data sets
shown in Table 1. These include a mix of on-target data sets and
ADME data sets. Some data sets are so large (>100,000) that we
randomly selected a smaller subset of compounds (50,000) to
expedite the study. It is useful to use proprietary data sets for two
reasons:
1. We wanted data sets which are realistically large and have a
realistic level of noise but are not as noisy as high- throughput
data sets.
2. Time-splitting requires dates of testing, and these are almost
impossible to find in public domain data sets.
Chen, B., Sheridan, R. P., Hornak, V. & Voigt, J. H. Comparison of Random
Forest and Pipeline Pilot Naïve Bayes in Prospective QSAR Predictions. J.
Chem. Inf. Model. 52, 792–803 (2012).
21. Requirements for Reproducibility:
Code
Stahl, M. & Bajorath, J. Computational Medicinal Chemistry. J. Med.
Chem. 54, 1-2 (2011).
Computational methods must be described in sufficient
detail for the reader to reproduce the results.
22. Requirements for Reproducibility:
Code
Ince, D. C., Hatton, L. & Graham-Cumming, J. The case for open
computer programs. Nature 482, 485–488 (2012).
We argue that, with some exceptions, anything less
than the release of source programs is intolerable for
results that depend on computation. The vagaries of
hardware, software and natural language will always
ensure that exact reproducibility remains uncertain, but
withholding code increases the chances that efforts to
reproduce results will fail.
23. Requirements for Reproducibility:
Code
Data and materials availability All data necessary to understand, assess,
and extend the conclusions of the manuscript must be available to any
reader of Science. All computer codes involved in the creation or
analysis of data must also be available to any reader of Science.
After publication, all reasonable requests for data and materials must be
fulfilled. Any restrictions on the availability of data, codes, or materials,
including fees and original data obtained from other sources (Materials
Transfer Agreements), must be disclosed to the editors upon submission.
http://www.sciencemag.org/site/feature/contribinfo/prep/
gen_info.xhtml#dataavail
24. Requirements for Reproducibility:
Code
An inherent principle of publication is that others should be able to
replicate and build upon the authors' published claims. Therefore, a
condition of publication in a Nature journal is that authors are required to
make materials, data and associated protocols promptly available to
readers without undue qualifications. Any restrictions on the availability of
materials or information must be disclosed to the editors at the time of
submission. Any restrictions must also be disclosed in the submitted
manuscript, including details of how readers can obtain materials and
information. If materials are to be distributed by a for-profit company, this
must be stated in the paper.
http://www.nature.com/authors/policies/availability.html
In the meantime, researchers must, when they are arranging the
commercialization of their work, bear in mind the implications that these
deals may have on their freedom to publish to the standards that the
community is entitled to expect.
http://www.nature.com/nature/journal/v442/
n7098/full/442001a.html
25. Requirements for Reproducibility:
Code
§ “Black box” code sharing: installing the software on a publicly
accessible server, or providing executables for people to test
§ Does this help with reproducibility?
§ Doesn’t demonstrate that the implementation corresponds to the
algorithm description
§ Not cut and dried.
26. The Recomputation Manifesto
From Ian Gent, University of St. Andrews
1. Computational experiments should be recomputable for all time
2. Recomputation of recomputable experiments should be very easy
3. It should be easier to make experiments recomputable than not to
4. Tools and repositories can help recomputation become standard
5. The only way to ensure recomputability is to provide virtual
machines
6. Runtime performance is a secondary issue
http://www.software.ac.uk/blog/2013-07-09-recomputation-manifesto
http://arxiv.org/pdf/1304.3674v1.pdf
29. Requirements for Reproducibility:
Results
§ Including the actual results is even more of a no brainer, right?
Homology Models of Human All-Trans Retinoic Acid Metabolizing Enzymes
CYP26B1 and CYP26B1 Spliced Variant
Homology models of CYP26B1 (cytochrome P450RAI2) and CYP26B1 spliced variant were
derived using the crystal structure of cyanobacterial CYP120A1 as template for the model building.
The quality of the homology models generated were carefully evaluated, and the natural substrate
all-trans-retinoic acid (atRA), several tetralone-derived retinoic acid metabolizing blocking agents
(RAMBAs), and a well-known potent inhibitor of CYP26B1 (R115866) were docked into the
homology model of full-length cytochrome P450 26B1. The results show that in the model of the
full-length CYP26B1, the protein is capable of distinguishing between the natural substrate (atRA),
R115866, and the tetralone derivatives. The spliced variant of CYP26B1 model displays a reduced
affinity for atRA compared to the full-length enzyme, in accordance with recently described
experimental information.
This paper, presenting two new homology models, does not
include either model.
Unfortunately I didn’t have to search long to find this example
31. How are we doing?
§ Survey of recent publications:
• Everything in JCIM vol 52 #10
• Everything in JCAMD vol 26 #10
• Journal of Cheminformatics from July 2012-Nov 4 2012
§ Big differences between journals
§ Plenty of room for improvement
§ Analysis is presence/absence of full results
Journal
Type
of
paper
Count
Full
Data
Par3al
Data
Missing
Data
Code?
JCIM
Method
13
6
3
4
1
JCIM
Non-‐method
16
10
3
3
0
JCAMD
Method
3
3
0
0
0
JCAMD
Non-‐method
4
0
3
1
0
JChemInf
Method
12
7
3
3
8
JChemInf
Non-‐method
3
0
0
0
0
32. Practical considerations
§ Where to put the data and code?
• Supplementary material
• Code-sharing sites (sourceforge.net, google code, github)
• Data sharing: Zenodo/Labarchives.com
• A hybrid: Figshare
§ Considerations:
• It needs to still be there 5+ years from now
• Having a solid connection to the original paper is good
• Others have to actually be able to do something with it
33. Practical considerations
§ Where to put the data and code?
• Supplementary material
• Code-sharing sites (sourceforge.net, google code, github)
• Data sharing: Zenodo/Labarchives.com
• A hybrid: Figshare
§ Considerations:
• It needs to still be there 5+ years from now
• Having a solid connection to the original paper is good
• Others have to actually be able to do something with it
34. Some stuff to look at
§ vagrant (virtual box configuration and provisioning):
http://www.vagrantup.com/
§ openshift (cloud-based application deployment):
https://www.openshift.com/
§ wakari (ipython in the cloud): https://wakari.io/
35. Tools for reproducible research
Knime
§ Open-source workflow tool
§ Strong data manipulation and mining capabilities
§ Data and results can be stored with the workflow.
36. Tools for reproducible research
IPython notebook
§ Python session running in a browser
• Tab completion
• Access to docstrings
§ Text formatting options available for including discussion or capturing
mathematics (access to LaTeX for formatting math)
§ Captures all data transformations and displays output
§ Tight integration with matplotlib
39. Here’s a cool picture from my lab.
… and here’s how you can make it too.
40. A great start
(1) Wherever possible, source code should be provided for new computational methods. The
source code can be a reference implementation of a method or algorithm and does not need to include a
graphical interface. If it is not possible to release the source code for a new method, authors should
provide a sufficient justification. Reviewers and editors will then consider this explanation. Any paper that
does not comply with the reproducibility guidelines will include this explanation when published. In cases
where it is not possible to release code due to intellectual property or other limitations, an executable
version of the new method should be readily accessible. Commercial products should provide time limited
licenses to facilitate evaluation and comparison of published methods.
(2) Any chemical structures and data mentioned in the paper should be made available in a
commonly used (SDF or SMILES) format. Distribution of data in pdf format is not sufficient.
(3) Any publications that employ commercial or open-source software should include scripts
or parameter files as well as data files that will enable others to easily reproduce the work.
(4) A clear easy to follow description of any new method should be a key criterion during the
review process. Wherever possible, a paper should contain a simple worked example that
demonstrates the application of the method. Parameter values and intermediate results for example
compounds should be included as part of the supporting material.
(5) Reviewers should put particular emphasis on the reproducibility of the method described
in a manuscript. Each reviewer should evaluate the description of the method, as well as the presence
of associated code, data, or executables, to ensure that the results can be independently reproduced.
Walters, W. P. Modeling, Informatics, and the Quest for Reproducibility. J.
Chem. Inf. Model. (2013). doi:10.1021/ci400197w
42. Pat’s not completely off the hook
Walters, W. P. Modeling, Informatics, and the Quest for Reproducibility. J.
Chem. Inf. Model. (2013). doi:10.1021/ci400197w
43. Pat’s not completely off the hook
Walters, W. P. Modeling, Informatics, and the Quest for Reproducibility. J.
Chem. Inf. Model. (2013). doi:10.1021/ci400197w
No data
No code
No algorithm description
Results only as a figure
45. Perhaps the biggest barrier to reproducible research
is the lack of a deeply ingrained culture that simply
requires reproducibility for all scientific claims.
Peng, R. D. Reproducible Research in Computational Science.
Science 334, 1226–1227 (2011).