Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Cloud Native Analysis Platform for NGS analysis


Published on

Cloud Native Analysis Platform optimized for user-friendly large data set transfer from Dropbox to cloud infrastructure for data processing and analysis.   It is particular tailored for easy Next Generation Sequence (NGS) fastq file transfer for rapid exome, RNASeq, small RNASeq, and amplicon analysis. 

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Cloud Native Analysis Platform for NGS analysis

  1. 1. CCCB RNA-Seq DGE Analysis Made Easy Center for Cancer Computational Biology (SM822) Bioinformatics Team Homepage: Twitter: @CCCBseq
  2. 2. So why are we here... You have RNA-Seq data generated but... ○ uploading to Galaxy public server for analysis take forever ○ my bioinformaticians can not process it today ○ sequence alignment is taking forever ○ want to make additional differential expression contrasts ○ formating DGE result for GSEA analysis somehow doesn’t work ○ I am the bioinformatician and don’t have the time to process all this data (for others and for free) ○ bioinformatic core services can be expensive and takes time ○ The Cancer Genomics Cloud, while powerful, requires good understanding of Amazon or Google Cloud System to manage projects and payment for the computing cost
  3. 3. CCCB Cloud System can help Fast ○ Scalable infrastructure with virtually no computing resource limitation ○ Minimal queue time to get data analyzed Secure ○ Google Cloud Platform (GCP) is covered by Google-DFCI BAA to ensure HIPAA compliance security Convenient ○ Simplified large data upload and download processes by parallelized direct cloud-to-cloud transfer between Dropbox to GCP to reduce data transfer time from hours to minutes ○ Like other Cloud platforms users is set up to pay for overhead and computing time but without steep learning curve to request project or manage payment
  4. 4. RNA-Seq DGE analysis should not be difficult Most RNA-Seq data can be aligned and quantified using the same settings for initial DGE analysis Technical bottleneck is often to gather enough computing power and set up proper analysis environment… after data transfer problem is solved AlignFastq files Quantify DGE Clustering Func. Enrichment
  5. 5. Please use your gmail account to log into And upload the fastq data files
  6. 6. CCCB Cloud System- authentication 1. Use Incognito/Private Browser Session 2. Sign-in to with provided Google account - address - DFCI Gsuite email (
  7. 7. CCCB Cloud System- analysis setup 3. Click on ‘Upload files’ on analysis homepage - All analysis projects associated with your email - Projects created on your behalf by CCCB - Status messages, Click on next steps
  8. 8. CCCB Cloud System- analysis setup 4. choose your reference genome 5. Edit the project name to something meaningful
  9. 9. CCCB Cloud System- file uploads 6. Upload Files a. Dropbox - Preferred method - Log in again into Dropbox - Select files and upload b. From local computer - File chooser - Drag/drop interface - Slow transfer through https File naming instructions - Email notification when transfer is complete.
  10. 10. CCCB Cloud System- file uploads 7. After receiving email (if using Dropbox), refresh. Uploaded files will be visible
  11. 11. CCCB Cloud System- Assign Sample Name 7. Set Sample Names Sample names are inferred from sequencing file names. Can create new samples or remove existing ones. - Drag/drop files to the proper sample
  12. 12. CCCB Cloud System- Align and Quantify
  13. 13. RNA-Seq DGE Analysis Under the Hood - Parallelized: - alignment (STAR aligner) ---> BAM Files - Sort, primary-alignment filtering, duplicate evaluation (Samtools, Picard) - Quantification (featureCounts) - Merging: - Overall “raw” (not normalized) count matrix - Differential expression testing with DESeq2 - Plots/figures Master Sample 1 Sample 2 Sample N
  14. 14. Alignment is a Computationally Intensive Process Running on Local Computing ● Require knowledge in unix and high performance computing ● Require powerful computing infrastructure (i.e. 64 bit machine with 30+ GB RAM) ● Require ability to write scripts and program ● Require understanding of the process to run alignment program Running on Public Web Servers ● Wait time for most public web servers such as Galaxy ( and Genboree ( increases with the number of users ● Most of them utilizes https protocol and allows only 1 fastq file upload at a time. ● The Cancer Genomics Cloud ( requires good understanding of Amazon or Google Cloud System to setup project and payment
  15. 15. Typical RNASeq DGE Experimental Design Difficult to estimate the minimum number of biological replicates required, but typical rule of thumb: ● 3+ for cell lines ● 5+ for inbred lines of model organisms ● As many samples for human as possible A single RNASeq experiment is usually between 6 to 20+ samples and wait time for upload, run-time, and download increases linearly on public web server with risk of broken connection
  16. 16. CCCB Cloud Infrastructure Users CCCB Bioinformatics CCCB Sequencing
  17. 17. Data Upload Application “Download 50 fastq files!” Pulls raw data from Dropbox and push into Google Storage buckets
  18. 18. Scaling Application“Align N samples” Independent nodes/images - Each node needs large amount of data (e.g. index files for aligners) - Pre-built images minimizes data transfer - Communication about status Pulls raw data and pushes processed data to/from Google Storage buckets
  19. 19. Task management for data download “Transfer these 50 fastQ files (>2Gb each) to my Partner’s Dropbox!” Application
  20. 20. Fast download for output files using Dropbox Save output by direct download or Dropbox transfer: - Authenticated: only those logged-in as your Google user can access files - Direct transfer to Dropbox storage for fast data transfer and backup - Email notification after transfer is complete - A master directory called “cccb_transfers/” will be created in Dropbox and organized by projects
  21. 21. Straightforward differential analysis Available processed samples Human-readable contrast name Thresholds used for creating heatmaps and volcano plots Drag/drop samples into contrast groups Can rename groups
  22. 22. Standard RNA-Seq DGE Output Custom report Basic figures Output files Raw counts, normalized counts, Differential expression results Files for GSEA analysis
  23. 23. Gene Set Enrichment Analysis Broad Institute GSEA ( Directly use the normalized count matrix file and groups.cls from CCCB Cloud Platform DGE analysis result support files that can be imported into Broad Gene Set Enrichment Analysis (GSEA) on MSigDB
  24. 24. RNASeq Data Visualization Multi-experiment viewer (WebMEV)-- Directly use the raw count matrix from CCCB Cloud Platform and import to do more advanced analyses including: - Clustering (hierarchical, k-means, PCA, etc) - GO enrichment, pathway enrichment analyses
  25. 25. Backup Slides
  26. 26. For more information on Pipeline Services
  27. 27. Pricing Structure for RNASeq DGE DFCI/BWH: $18 per sample External Academia: $24 per sample Industry: Inquire
  28. 28. CCCB Cloud Platform Road Map GATK v3 (Live)/ v4 (May) - Germline Mutation Calling for DNA-Seq Mutect2 (April): - Somatic Mutation Calling for tumor/normal paired DNA-Seq Small non-coding RNA (April): - Mapping and quantification of small non-coding RNA classes (miRNA, piRNA, tRNA, snoRNA) Transcript Isoform (May): - Novel transcript isoform identification and quantification
  29. 29. Important accounts and where to get them DFCI G Suite Account (or just Google Account) Google accounts linked with organization emails are prefered even though any google account can be used. For DFCI community, please request an DFCI google account ( through Research Computing website: Partners Dropbox All Dropbox account will work with our systems. Partners Health provides virtually unlimited encrypted storage on Dropbox Business for all Partners community members (anyone with email) for free. Information is available here: Agilent CrossLab (a.k.a iLab Solutions) As most of cores and centers around DFCI, we use iLab to track all of our projects. A free account can be requested at
  30. 30. Request Project through iLab For more info:
  31. 31. Request iLab Account and Project For more info: CCCB
  32. 32. Request iLab Account and Project For more info: Analysis Pipeline
  33. 33. Moving Beyond Excel: Data Wrangling with R This introductory course is designed for investigators looking to improve their data analysis skills and move beyond Excel. Participants will be introduced to the R language and its basic capabilities for data processing, motivated by practical examples with high-throughput sequencing data such as differential expression or variant analyses. No prior experience with R (or programming in general) is necessary. Topics include: ● Introduction to R and the command line ● The power and ease of programming for consistent, reproducible research ● Reading and writing formatted datasets ● Filtering ● Data “cleaning” ● Data merging ● (If time permits) Basic plotting