4. Data and processing requirements are diverse:
–Functional Genomics (ChIP-seq, ATAC-seq, RNA-seq)
–Single cell Genomics
–Clinical Genomics and Precision Medicine
– Proteomics
–Machine Learning and Image Analysis
Research groups with different scientific questions and data types
We are not big enough to run all analysis for the university and people want
to learn to run their own analysis
Bioinformatics @ UTSW
4
5. Goals of the BICF
5
Initial data analysis for common assays
Standardized quality metrics and output as a starting point for further
analysis
Provide easily reproducible workflows that can be cited by community
Easy to user interface for a diverse population (Post-docs, PhD-students,
Clinicians)
Deployable pipelines for community use
6. Today we’re going to cover:
Principles of Pipeline Development
Astrocyte: Web Interface for Nextflow Pipelines
Migrating to Cloud: Azure
Outline
6
7. Principles of Pipeline Development
7
1. Steps have defined input and output
2. Defined computational resource allocation
3. Parallelization across samples
4. Serial steps executed only if input criteria are met
5. Workflow reproducibility
6. Common steps shared between different analyses
7. Visualization of workflow output to aid researchers in understanding complex data
19. Workflow reproducibility: CI
19
• Unit Tests
• Process specific logic
• Integration Tests
• End-to-end run with multiple real datasets
• Validate each process output
24. Workflow reproducibility: Versioning
24
2. 1. 3
MAJOR: Increase when you make
changes that break the parameter
structure of previous versions
MINOR: Increase when you add
functionality
PATCH: Increase when you make
changes that are small bug fixes.
28. Common steps shared between different analyses
28
• Each process is self contained script in : BASH, R, PERL
or Python
• Each process has its own Docker or Singularity image
• Plan to use Nextflow Modules to share steps (e.g.
ChIP-seq and ATAC-seq)
33. Today we’re going to cover:
Principles of Pipeline Development
Astrocyte: Web Interface for Nextflow Pipelines
Migrating to Cloud: Azure
Outline
33
34. Astrocyte: Web Interface for Nextflow Pipelines
34
Allows groups to give easy-access to their analysis pipelines via the web
Standardized Workflows
Simple Web Forms
Online documentation &
results visualization*
Django Webfamework
Workflows run on HPC cluster without developer or user needing cluster knowledge
40. Today we’re going to cover:
Principles of Pipeline Development
Astrocyte: Web Interface for Nextflow Pipelines
Migrating to Cloud: Azure
Outline
40
42. Azure: CycleCloud
42
Setup the programs that you want installed by default on nodes using the Azure CLI
–We chose Nextflow and Singularity
–Works remarkably similar to building Docker/Singularity images, but BASH
All steps are run on the VM that you have setup, not on the master node of the HPC cluster
Setup the installation instructions on a file
Can only be tested once machine is started
Best practice – One image per-program, like Docker/Singularity
https://docs.microsoft.com/en-us/azure/cyclecloud/tutorials/modify-cluster-template
https://docs.microsoft.com/en-us/azure/cyclecloud/tutorials/deploy-custom-application
44. 44
From the Azure CLI:
Enter the required information
Setup SSH Keypair
Deploy CycleCloud following the instructions:
–https://docs.microsoft.com/en-us/azure/cyclecloud/quickstart-install-cyclecloud
45. Azure: Setup Cluster
45
We used SLURM
–Note: SLURM has unique setup requirements not present in some of the other options
–Counts CPUs as machines
–100 machines means
Setup any alerts for cost
Add the programs that you want to have available
–These will be installed on node creation each time
46. Azure: Future Work
46
Setup images for compute nodes, rather than application installs
–Speed up the process of node launches
–Consistency in the node programs
Further refinement of HTC/HPC nodes
–Customizations for each pipeline
Streamline data transfer from current system to CycleCloud