Christian Frech presented on using Docker and Anduril to create reproducible bioinformatics pipelines. Anduril is a pipeline framework that aims to make pipelines modular, bundled with their execution environment, and able to be run on clusters. It uses a proprietary scripting language but can embed other languages. Frech demonstrated an RNA-seq analysis pipeline built in Anduril, which generated QC plots, differential expression results, network and enrichment analyses. While adoption of Anduril has been limited by its scripting language, Docker can be used to containerize components and make pipelines fully reproducible and portable.
Geared towards bioinformatics students and taking a somewhat humoristic point of view, this presentation explains what bioinformaticians are and what they do.
A Tovchigrechko - MGTAXA: a toolkit and webserver for predicting taxonomy of ...Jan Aerts
Talk by A Tovchigrechko at BOSC2012: "MGTAXA: a toolkit and webserver for predicting taxonomy of the metagenomic sequences with Galaxy frontend and parallel computational backend"
Geared towards bioinformatics students and taking a somewhat humoristic point of view, this presentation explains what bioinformaticians are and what they do.
A Tovchigrechko - MGTAXA: a toolkit and webserver for predicting taxonomy of ...Jan Aerts
Talk by A Tovchigrechko at BOSC2012: "MGTAXA: a toolkit and webserver for predicting taxonomy of the metagenomic sequences with Galaxy frontend and parallel computational backend"
The field of next-generation sequencing (NGS) has been experiencing explosive growth over the past several years and shows little sign of slowing down. The increasing capabilities and dramatically lowered costs have expanded NGS's reach beyond that of the human genome into nearly every corner of biological research. An overview of the platforms on the market today, including an assessment of their relative strengths and weaknesses, will be presented. The presentation will conclude with a peek into where the technology is going and what will be available in the future.
Next-generation sequencing data format and visualization with ngs.plot 2015Li Shen
An introduction to the commonly used formats for the next-generation sequencing data. ngs.plot is a popular tool for the visualization and data mining of the NGS data.
Next-generation sequencing format and visualization with ngs.plotLi Shen
Lecture given at the department of neuroscience, Icahn school of medicine at Mount Sinai. ngs.plot has been published in BMC genomics. Link: http://www.biomedcentral.com/1471-2164/15/284
NGS Targeted Enrichment Technology in Cancer Research: NGS Tech Overview Webi...QIAGEN
This slidedeck discusses the most biologically efficient, cost-effective method for successful NGS. The GeneRead DNA QuantiMIZE Kits enable determination of the optimum conditions for targeted enrichment of DNA isolated from biological samples, while the GeneRead DNAseq Panels V2 allow you to quickly and reliably deep sequence your genes of interest. Applications in translational and clinical research are highlighted.
This presentation gives an introduction to analysing ChIP-seq data and is part of a bioinformatics workshop. The accompanying websites are available at http://sschmeier.github.io/bioinf-workshop/#!galaxy-chipseq/
A workshop is intended for those who are interested in and are in the planning stages of conducting an RNA-Seq experiment. Topics to be discussed will include:
* Experimental Design of RNA-Seq experiment
* Sample preparation, best practices
* High throughput sequencing basics and choices
* Cost estimation
* Differential Gene Expression Analysis
* Data cleanup and quality assurance
* Mapping your data
* Assigning reads to genes and counting
* Analysis of differentially expressed genes
* Downstream analysis/visualizations and tables
Next-Generation Sequencing an Intro to Tech and Applications: NGS Tech Overvi...QIAGEN
This slidedeck provides a technical overview of DNA/RNA preprocessing, template preparation, sequencing and data analysis. It covers the applications for NGS technologies, including guidelines for how to select the technology that will best address your biological question.
The field of next-generation sequencing (NGS) has been experiencing explosive growth over the past several years and shows little sign of slowing down. The increasing capabilities and dramatically lowered costs have expanded NGS's reach beyond that of the human genome into nearly every corner of biological research. An overview of the platforms on the market today, including an assessment of their relative strengths and weaknesses, will be presented. The presentation will conclude with a peek into where the technology is going and what will be available in the future.
Next-generation sequencing data format and visualization with ngs.plot 2015Li Shen
An introduction to the commonly used formats for the next-generation sequencing data. ngs.plot is a popular tool for the visualization and data mining of the NGS data.
Next-generation sequencing format and visualization with ngs.plotLi Shen
Lecture given at the department of neuroscience, Icahn school of medicine at Mount Sinai. ngs.plot has been published in BMC genomics. Link: http://www.biomedcentral.com/1471-2164/15/284
NGS Targeted Enrichment Technology in Cancer Research: NGS Tech Overview Webi...QIAGEN
This slidedeck discusses the most biologically efficient, cost-effective method for successful NGS. The GeneRead DNA QuantiMIZE Kits enable determination of the optimum conditions for targeted enrichment of DNA isolated from biological samples, while the GeneRead DNAseq Panels V2 allow you to quickly and reliably deep sequence your genes of interest. Applications in translational and clinical research are highlighted.
This presentation gives an introduction to analysing ChIP-seq data and is part of a bioinformatics workshop. The accompanying websites are available at http://sschmeier.github.io/bioinf-workshop/#!galaxy-chipseq/
A workshop is intended for those who are interested in and are in the planning stages of conducting an RNA-Seq experiment. Topics to be discussed will include:
* Experimental Design of RNA-Seq experiment
* Sample preparation, best practices
* High throughput sequencing basics and choices
* Cost estimation
* Differential Gene Expression Analysis
* Data cleanup and quality assurance
* Mapping your data
* Assigning reads to genes and counting
* Analysis of differentially expressed genes
* Downstream analysis/visualizations and tables
Next-Generation Sequencing an Intro to Tech and Applications: NGS Tech Overvi...QIAGEN
This slidedeck provides a technical overview of DNA/RNA preprocessing, template preparation, sequencing and data analysis. It covers the applications for NGS technologies, including guidelines for how to select the technology that will best address your biological question.
Using Docker Containers to Improve Reproducibility in Software and Web Engine...Vincenzo Ferme
The ability to replicate and reproduce scientific results has become an increasingly important topic for many academic disciplines. In computer science and, more specifically, software and Web engineering, contributions of scientific work rely on developed algorithms, tools and prototypes, quantitative evaluations, and other computational analyses. Published code and data come with many undocumented assumptions, dependencies, and configurations that are internal knowledge and make reproducibility hard to achieve. This tutorial presents how Docker containers can overcome these issues and aid the reproducibility of research artefacts in software engineering and discusses their applications in the field.
Cite us: http://link.springer.com/chapter/10.1007/978-3-319-38791-8_58
Docker is a very useful tool in every data scientists toolbox. In this talk I present motivations to use Docker and made some live demos of typical tools used in data science, such as RStudio, Jupyter Notebook, or Elasticsearch.
Using python and docker for data scienceCalvin Giles
PyData London meetup group lightning talk slides on getting an ipython notebook with the scipy stack and custom packages running in a notebook server in 5 minutes.
SciPipe - A light-weight workflow library inspired by flow-based programmingSamuel Lampa
A presentation of the SciPipe workflow library, written in Go (Golang), inspired by Flow-based programming, at an internal workshop at Uppsala University, Department of Pharmaceutical Biosciences.
Customer was using a combination of mainframe and server-based CM systems. This was adopted to manage product configuration during the development, production and aftermarket segments. These proprietary CM systems were costly to maintain in terms of hardware, software and IT support. Few issues that they faced :
The CM systems were extremely costly to maintain
A lot of effort was required to keep the various CM databases (bills of material) in sync resulting in decrease in productivity
Increased time-to-market in terms of product development
The Solution Proposed
Teamcenter software was suggested as the PLM digital enterprise backbone for the new system. Reasons for these were :
Tight integration between NX and Teamcenter
Teamcenter could be readily integrated with SAP system
Enables product configuration data to seamlessly update supply and manufacturing applications
Optimize the processes and reduce the production time
The Panso Solution
The key objective for the client was to have a tight integration between CAD and CM across its various global sites in a most cost-effective manner. For this very purpose, Panso developed TC-SAP Gateway solution for the seamless integration of the SAP Business Suite with Teamcenter. This provided an intelligent solution for company-wide consistent data and optimized processes that fulfills even the most sophisticated requirements. The solution integrated additional enterprise applications and extended Teamcenter-SAP ERP landscape even further!
Key Highlights of the solutions:
Immediate access to graphical and textual product data.
Teamcenter manages entire information regarding the configuration changes
Increases productivity by automating the entire process between engineering and supply base
Notifications sent automatically to systems that creates the electronic assembly instructions for the mechanics.
Enables validation by engineering department to a change in an assembly and the information flows immediately to SAP
Biology, medicine, physics, astrophysics, chemistry: all these scientific domains need to process large amount of data with more and more complex software systems. For achieving reproducible science, there are several challenges ahead involving multidisciplinary collaboration and socio-technical innovation with software at the center of the problem. Despite the availability of data and code, several studies report that the same data analyzed with different software can lead to different results. I am seeing this problem as a manifestation of deep software variability: many factors (operating system, third-party libraries, versions, workloads, compile-time options and flags, etc.) themselves subject to variability can alter the results, up to the point it can dramatically change the conclusions of some scientific studies. In this keynote, I argue that deep software variability is a threat and also an opportunity for reproducible science. I first outline some works about (deep) software variability, reporting on preliminary evidence of complex interactions between variability layers. I then link the ongoing works on variability modelling and deep software variability in the quest for reproducible science.
For a Bioinformatics Discussion for Students and Post-Docs (BioDSP) meeting: Expands on Sandve's "Ten Simple Rules for Reproducible Computational Research"
Bridging Concepts and Practice in eScience via Simulation-driven EngineeringRafael Ferreira da Silva
The CyberInfrastructure (CI) has been the object of intensive research and development in the last decade, re- sulting in a rich set of abstractions and interoperable software implementations that are used in production today for supporting ongoing and breakthrough scientific discoveries. A key challenge is the development of tools and application execution frameworks that are robust in current and emerging CI configurations, and that can anticipate the needs of upcoming CI applications. This paper presents WRENCH, a framework that enables simulation-driven engineering for evaluating and developing CI application execution frameworks. WRENCH provides a set of high- level simulation abstractions that serve as building blocks for developing custom simulators. These abstractions rely on the scalable and accurate simulation models that are provided by the SimGrid simulation framework. Consequently, WRENCH makes it possible to build, with minimum software development effort, simulators that that can accurately and scalably simulate a wide spectrum of large and complex CI scenarios. These simulators can then be used to evaluate and/or compare alternate platform, system, and algorithm designs, so as to drive the development of CI solutions for current and emerging applications.
German Conference on Bioinformatics 2021
https://gcb2021.de/
FAIR Computational Workflows
Computational workflows capture precise descriptions of the steps and data dependencies needed to carry out computational data pipelines, analysis and simulations in many areas of Science, including the Life Sciences. The use of computational workflows to manage these multi-step computational processes has accelerated in the past few years driven by the need for scalable data processing, the exchange of processing know-how, and the desire for more reproducible (or at least transparent) and quality assured processing methods. The SARS-CoV-2 pandemic has significantly highlighted the value of workflows.
This increased interest in workflows has been matched by the number of workflow management systems available to scientists (Galaxy, Snakemake, Nextflow and 270+ more) and the number of workflow services like registries and monitors. There is also recognition that workflows are first class, publishable Research Objects just as data are. They deserve their own FAIR (Findable, Accessible, Interoperable, Reusable) principles and services that cater for their dual roles as explicit method description and software method execution [1]. To promote long-term usability and uptake by the scientific community, workflows (as well as the tools that integrate them) should become FAIR+R(eproducible), and citable so that author’s credit is attributed fairly and accurately.
The work on improving the FAIRness of workflows has already started and a whole ecosystem of tools, guidelines and best practices has been under development to reduce the time needed to adapt, reuse and extend existing scientific workflows. An example is the EOSC-Life Cluster of 13 European Biomedical Research Infrastructures which is developing a FAIR Workflow Collaboratory based on the ELIXIR Research Infrastructure for Life Science Data Tools ecosystem. While there are many tools for addressing different aspects of FAIR workflows, many challenges remain for describing, annotating, and exposing scientific workflows so that they can be found, understood and reused by other scientists.
This keynote will explore the FAIR principles for computational workflows in the Life Science using the EOSC-Life Workflow Collaboratory as an example.
[1] Carole Goble, Sarah Cohen-Boulakia, Stian Soiland-Reyes,Daniel Garijo, Yolanda Gil, Michael R. Crusoe, Kristian Peters, and Daniel Schober FAIR Computational Workflows Data Intelligence 2020 2:1-2, 108-121 https://doi.org/10.1162/dint_a_00033.
At the technology meeting of the Association of Independent Research Centers (http://airi.org): An overview of recent Scientific Computing activities at Fred Hutch, Seattle
Open64 is an open source, optimizing compiler tool for Intel Itanium platform. It was released by SGI (Silicon Graphics, Inc) company and now mostly serves as a research platform for compiler and computer architecture research groups
Building cloud-enabled genomics workflows with Luigi and DockerJacob Feala
Talk given at Bio-IT 2016, Cloud Computing track
Abstract:
As bioinformatics scientists, we tend to write custom tools for managing our workflows, even when viable, open-source alternatives are available from the tech community. Our field has, however, begun to adopt Docker containers to stabilize compute environments. In this talk, I will introduce Luigi, a workflow system built by engineers at Spotify to manage long-running big data processing jobs with complex dependencies. Focusing on a case study of next generation sequencing analysis in cancer genomics research, I will show how Luigi can connect simple, containerized applications into complex bioinformatics pipelines that can be easily integrated with compute, storage, and data warehousing on the cloud.
The Popper Experimentation Protocol and CLI toolIvo Jimenez
Popper is a convention for conducting scientific explorations and writing academic articles following a DevOps approach. Popper allows researchers to automate the re-execution and validation of experimentation (computational and analysis) pipelines. This talk gives an overview and examples.
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & AlluxioAlluxio, Inc.
Alluxio Global Online Meetup
Apr 23, 2020
For more Alluxio events: https://www.alluxio.io/events/
Speakers:
Jiao (Jennie) Wang, Intel
Tsai Louie, Intel
Bin Fan, Alluxio
Today, many people run deep learning applications with training data from separate storage such as object storage or remote data centers. This presentation will demo the Intel Analytics Zoo + Alluxio stack, an architecture that enables high performance while keeping cost and resource efficiency balanced without network being I/O bottlenecked.
Intel Analytics Zoo is a unified data analytics and AI platform open-sourced by Intel. It seamlessly unites TensorFlow, Keras, PyTorch, Spark, Flink, and Ray programs into an integrated pipeline, which can transparently scale from a laptop to large clusters to process production big data. Alluxio, as an open-source data orchestration layer, accelerates data loading and processing in Analytics Zoo deep learning applications.
This talk, we will go over:
- What is Analytics Zoo and how it works
- How to run Analytics Zoo with Alluxio in deep learning applications
- Initial performance benchmark results using the Analytics Zoo + Alluxio stack
Similar to Reproducible bioinformatics pipelines with Docker and Anduril (20)
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...University of Maribor
Slides from:
11th International Conference on Electrical, Electronics and Computer Engineering (IcETRAN), Niš, 3-6 June 2024
Track: Artificial Intelligence
https://www.etran.rs/2024/en/home-english/
Nucleophilic Addition of carbonyl compounds.pptxSSR02
Nucleophilic addition is the most important reaction of carbonyls. Not just aldehydes and ketones, but also carboxylic acid derivatives in general.
Carbonyls undergo addition reactions with a large range of nucleophiles.
Comparing the relative basicity of the nucleophile and the product is extremely helpful in determining how reversible the addition reaction is. Reactions with Grignards and hydrides are irreversible. Reactions with weak bases like halides and carboxylates generally don’t happen.
Electronic effects (inductive effects, electron donation) have a large impact on reactivity.
Large groups adjacent to the carbonyl will slow the rate of reaction.
Neutral nucleophiles can also add to carbonyls, although their additions are generally slower and more reversible. Acid catalysis is sometimes employed to increase the rate of addition.
DERIVATION OF MODIFIED BERNOULLI EQUATION WITH VISCOUS EFFECTS AND TERMINAL V...Wasswaderrick3
In this book, we use conservation of energy techniques on a fluid element to derive the Modified Bernoulli equation of flow with viscous or friction effects. We derive the general equation of flow/ velocity and then from this we derive the Pouiselle flow equation, the transition flow equation and the turbulent flow equation. In the situations where there are no viscous effects , the equation reduces to the Bernoulli equation. From experimental results, we are able to include other terms in the Bernoulli equation. We also look at cases where pressure gradients exist. We use the Modified Bernoulli equation to derive equations of flow rate for pipes of different cross sectional areas connected together. We also extend our techniques of energy conservation to a sphere falling in a viscous medium under the effect of gravity. We demonstrate Stokes equation of terminal velocity and turbulent flow equation. We look at a way of calculating the time taken for a body to fall in a viscous medium. We also look at the general equation of terminal velocity.
The ability to recreate computational results with minimal effort and actionable metrics provides a solid foundation for scientific research and software development. When people can replicate an analysis at the touch of a button using open-source software, open data, and methods to assess and compare proposals, it significantly eases verification of results, engagement with a diverse range of contributors, and progress. However, we have yet to fully achieve this; there are still many sociotechnical frictions.
Inspired by David Donoho's vision, this talk aims to revisit the three crucial pillars of frictionless reproducibility (data sharing, code sharing, and competitive challenges) with the perspective of deep software variability.
Our observation is that multiple layers — hardware, operating systems, third-party libraries, software versions, input data, compile-time options, and parameters — are subject to variability that exacerbates frictions but is also essential for achieving robust, generalizable results and fostering innovation. I will first review the literature, providing evidence of how the complex variability interactions across these layers affect qualitative and quantitative software properties, thereby complicating the reproduction and replication of scientific studies in various fields.
I will then present some software engineering and AI techniques that can support the strategic exploration of variability spaces. These include the use of abstractions and models (e.g., feature models), sampling strategies (e.g., uniform, random), cost-effective measurements (e.g., incremental build of software configurations), and dimensionality reduction methods (e.g., transfer learning, feature selection, software debloating).
I will finally argue that deep variability is both the problem and solution of frictionless reproducibility, calling the software science community to develop new methods and tools to manage variability and foster reproducibility in software systems.
Exposé invité Journées Nationales du GDR GPL 2024
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...Ana Luísa Pinho
Functional Magnetic Resonance Imaging (fMRI) provides means to characterize brain activations in response to behavior. However, cognitive neuroscience has been limited to group-level effects referring to the performance of specific tasks. To obtain the functional profile of elementary cognitive mechanisms, the combination of brain responses to many tasks is required. Yet, to date, both structural atlases and parcellation-based activations do not fully account for cognitive function and still present several limitations. Further, they do not adapt overall to individual characteristics. In this talk, I will give an account of deep-behavioral phenotyping strategies, namely data-driven methods in large task-fMRI datasets, to optimize functional brain-data collection and improve inference of effects-of-interest related to mental processes. Key to this approach is the employment of fast multi-functional paradigms rich on features that can be well parametrized and, consequently, facilitate the creation of psycho-physiological constructs to be modelled with imaging data. Particular emphasis will be given to music stimuli when studying high-order cognitive mechanisms, due to their ecological nature and quality to enable complex behavior compounded by discrete entities. I will also discuss how deep-behavioral phenotyping and individualized models applied to neuroimaging data can better account for the subject-specific organization of domain-general cognitive systems in the human brain. Finally, the accumulation of functional brain signatures brings the possibility to clarify relationships among tasks and create a univocal link between brain systems and mental functions through: (1) the development of ontologies proposing an organization of cognitive processes; and (2) brain-network taxonomies describing functional specialization. To this end, tools to improve commensurability in cognitive science are necessary, such as public repositories, ontology-based platforms and automated meta-analysis tools. I will thus discuss some brain-atlasing resources currently under development, and their applicability in cognitive as well as clinical neuroscience.
The debris of the ‘last major merger’ is dynamically youngSérgio Sacani
The Milky Way’s (MW) inner stellar halo contains an [Fe/H]-rich component with highly eccentric orbits, often referred to as the
‘last major merger.’ Hypotheses for the origin of this component include Gaia-Sausage/Enceladus (GSE), where the progenitor
collided with the MW proto-disc 8–11 Gyr ago, and the Virgo Radial Merger (VRM), where the progenitor collided with the
MW disc within the last 3 Gyr. These two scenarios make different predictions about observable structure in local phase space,
because the morphology of debris depends on how long it has had to phase mix. The recently identified phase-space folds in Gaia
DR3 have positive caustic velocities, making them fundamentally different than the phase-mixed chevrons found in simulations
at late times. Roughly 20 per cent of the stars in the prograde local stellar halo are associated with the observed caustics. Based
on a simple phase-mixing model, the observed number of caustics are consistent with a merger that occurred 1–2 Gyr ago.
We also compare the observed phase-space distribution to FIRE-2 Latte simulations of GSE-like mergers, using a quantitative
measurement of phase mixing (2D causticality). The observed local phase-space distribution best matches the simulated data
1–2 Gyr after collision, and certainly not later than 3 Gyr. This is further evidence that the progenitor of the ‘last major merger’
did not collide with the MW proto-disc at early times, as is thought for the GSE, but instead collided with the MW disc within
the last few Gyr, consistent with the body of work surrounding the VRM.
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...University of Maribor
Slides from talk:
Aleš Zamuda: Remote Sensing and Computational, Evolutionary, Supercomputing, and Intelligent Systems.
11th International Conference on Electrical, Electronics and Computer Engineering (IcETRAN), Niš, 3-6 June 2024
Inter-Society Networking Panel GRSS/MTT-S/CIS Panel Session: Promoting Connection and Cooperation
https://www.etran.rs/2024/en/home-english/
What is greenhouse gasses and how many gasses are there to affect the Earth.moosaasad1975
What are greenhouse gasses how they affect the earth and its environment what is the future of the environment and earth how the weather and the climate effects.
ESR spectroscopy in liquid food and beverages.pptxPRIYANKA PATEL
With increasing population, people need to rely on packaged food stuffs. Packaging of food materials requires the preservation of food. There are various methods for the treatment of food to preserve them and irradiation treatment of food is one of them. It is the most common and the most harmless method for the food preservation as it does not alter the necessary micronutrients of food materials. Although irradiated food doesn’t cause any harm to the human health but still the quality assessment of food is required to provide consumers with necessary information about the food. ESR spectroscopy is the most sophisticated way to investigate the quality of the food and the free radicals induced during the processing of the food. ESR spin trapping technique is useful for the detection of highly unstable radicals in the food. The antioxidant capability of liquid food and beverages in mainly performed by spin trapping technique.
This presentation explores a brief idea about the structural and functional attributes of nucleotides, the structure and function of genetic materials along with the impact of UV rays and pH upon them.
Professional air quality monitoring systems provide immediate, on-site data for analysis, compliance, and decision-making.
Monitor common gases, weather parameters, particulates.
Reproducible bioinformatics pipelines with Docker and Anduril
1. 1
Reproducible Bioinformatics Pipelines
with Docker & Anduril
Christian Frech, PhD
Bioinformatician at Children‘s Cancer Research Institute, Vienna
CeMM Special Seminar
September 25th
, 2015
2. Why care about reproducible pipelines
in bioinformatics?
For your (future) self
Quickly re-run analysis with different parameters/tools
Best documentation how results have been produced
For others
Allow others to easily reproduce your findings
(“reproducibility crisis”)*
Code re-use between projects and colleagues
2
*) http://theconversation.com/science-is-in-a-reproducibility-crisis-how-do-we-resolve-it-16998
3. Obstacles to computational reproducibility
Software/script not available (even upon request)
Black box: Code (or even virtual machine) available, but no
documentation how to run it
Dependency hell: Software and documentation available,
but (too) difficult to get it running
Code rot: Code breaks over time due to software updates
404 Not Found: unstable URLs, e.g. links to lab homepages
3
Go figure…
4. Computational pipelines to the rescue
In bioinformatics, data analysis typically consists of a series of
heterogeneous programs stringed together via file-based
inputs and outputs
Example: FASTQ -> alignment (BWA) -> variants calling (GATK) -> variant
annotation (SnpEff) -> custom R script
Simple automation via (bash/R/Python/Perl) scripting has its
limitations
No error checking
No partial execution
No parallelization
4
5. No shortage of pipeline frameworks
Script-based
GNU Make, Snakemake, Bpipe, Ruffus, Drake, Rake,
Nextflow, …
GUI-based
Galaxy, GenePattern, Chipster, Taverna, Pegasus, …
Various commercial solutions for more standardized
workflows (e.g. RNA-seq)
Geared toward biologists without programming skills
(“point-and-click”)
5
See also https://www.biostars.org/p/79, https://www.biostars.org/p/91301/
6. Personal wish list for pipeline framework
Script-based (maximum flexibility, minimum overhead)
Powerful scripting language
Cluster integration (preferably via slurm)
Modular (allow code re-use b/w projects and colleagues)
Component library for frequent tasks (e.g. join two CSV files)
Reporting (HTML, PDF) to share results
Free & open-source
Bundle scripts/data with execution environment
6
7. What’s wrong with good ol’ GNU make?
Available on all Linux platforms
Stood the test of time
(developed in 1970s)
Rapid development
(Bash scripting + target rules)
Multi-threading (-j parameter)
7
No cluster support
Arcane syntax, cryptic pattern
rules
Half-baked multi-output rules
No type checking (everything is a
generic file)
Difficult to modularize
(code re-use)
Rebuild not triggered by recipe
change
No reporting
PRO CON
9. Anduril
Developed since 2008 at Biomedicum Systems Biology Laboratory,
Helsinki, Finland
http://research.med.helsinki.fi/gsb/hautaniemi/
Built for scientific data analysis with focus on bioinformatics
Proprietary workflow scripting language “Anduril script”
Possibility to embed native code (Bash/R/Python/Perl)
Version 2 will switch to Scala
Open source & free
Significo (http://www.significo.fi/) is commercial spin-off offering Anduril
consulting services
No widespread adoption (yet?)
9
10. Anduril features
Script-based (maximum flexibility, less overhead)
Expressive scripting language
Cluster integration (preferably via slurm)
Modular to allow code re-use (b/w projects and colleagues)
Ready-made component library for frequent analysis steps
Reporting (HTML, PDF) to share results
Free & open-source
Bundle scripts/data with execution environment
10
X
12. Embedding native R code in Anduril script
12
ensembl = REvaluate(
table1 = ucsc,
script = StringInput(content=
'''
table.out <- table1
table.out$chrom <- gsub("^chr", "", table.out$chrom)
'''
)
)
Supports also inlining of Bash, Python, Java, and Perl scripts
Convert UCSC to Ensembl chromosome names in a CSV file
containing column ‘chrom’:
13. Anduril features
Script-based (maximum flexibility, less overhead)
Expressive scripting language
Cluster integration (preferably via slurm)
Modular to allow code re-use (b/w projects and colleagues)
Ready-made component library for frequent analysis steps
Reporting (HTML, PDF) to share results
Free & open-source
Bundle scripts/data with execution environment
13
?
14. “Lightweight” virtualization technology for Unix-based systems
Processes run in isolated namespaces (“containers”), but share same kernel
Like VMs: containers portable between systems -> reproducibility!
Unlike VMs: instant startup, no resource pre-allocation -> better hardware utilization
14
VM Container
15. How to bundle workflow with execution environment?
15
Container
Anduril
Workflow
Component 1
Component 2
Component 3
Pro: Single container, easy to maintain
Con: VM-like approach; huge, monolithic
container, difficult to share (against Docker
philosophy)
Pro: Completely modularized, easy to re-
use/share workflow components
Con: “container hell”?
Workflow
Anduril
Solution 1 Solution 2
Container A
Component 1
Container B
Component 2
Container C
Component 3
16. Hybrid solution
16
Pro: Workflow completely containerized (= portable);
only shared components in common containers
Con: Still (but greatly reduced) overhead for container
maintenance
Workflow
Anduril
Container A
Component 1
Component 2
Component 3
Master container
Project- and user-
specific components
installed in master
container
Shared components
installed in common
container (e.g.
container “RNA-seq”)
“Docker inside
docker”
18. So, Anduril is great… but
Proprietary scripting language
Biggest hurdle for widespread adoption IMO
Will likely improve with version 2 (which uses Scala)
Documentation opaque for beginners
WANTED: Simple step-by-step guide to build your first Anduril workflow
High upfront investment to get going (because of the above)
In-lining Bash/R/Perl/Python should be simpler
Currently too much clutter when using “BashEvaluate” and alike
Coding in Anduril sometimes “feels heavy” compared to other
frameworks (e.g. GNU Make)
Will improve with fluency in workflow scripting language
18
35. GSEA heat map summarizing all comparisons
35
Rows = enriched gene sets
Columns = comparisons
Value = normalized enrichment score (NES)
Red = enriched for up-regulated genes
Blue = enriched for down-regulated genes
* = significant (FDR < 0.05)
** = highly significant (FDR < 0.01)
36. Future developments
Push new Anduril components to public repository
(needs some refactoring, documentation, test cases)
Help on Anduril2 manuscript
Port custom Makefiles to Anduril (ongoing)
Cloud deployment of dockerized workflow
Couple slurm to AWS EC2
Automatic spin-up of docker-enabled AMIs serving as
computing nodes
36
37. In the (not so) distant future …
$ docker pull cfrech/frech2015_et_al
$ docker run cfrech/frech2015_et_al --use-cloud --max-nodes 300 --out output
$ evince output/figure1.pdf
37
38. Further reading
Discussion thread on Docker & Anduril
https://groups.google.com/forum/#!msg/anduril-dev/Et8-YG9O-Aw
38
39. Acknowledgement
39
Marko Laakso (Significo)
Sirku Kaarinen (Significo)
Kristian Ovaska (Valuemotive)
Pekka Lehti (Valuemotive)
Ville Rantanen (University of
Helsinki, Hautaniemi lab)
Nuno Andrade (CCRI)
Andreas Heitger (CCRI)