CCCB Web Services on
Cloud Platform
Center for Cancer Computational Biology (SM822)
Bioinformatics Team
Homepage:
https://cccb.dfci.harvard.edu/
Twitter: @CCCBseq
Introduction to CCCB
One of the Institutional Strategic Core Centers set up by DFCI to provide broadly
accessible genomic technology and computational analysis capability to
accelerate genomic research
Aim: Enable large number of bench scientists to ask their own specific questions
in different domain by providing:
● NGS Sequencing Services (NextSeq, MiSeq) and consultation
● Customized bioinformatics analysis research
● Education workshops in genomic technology and data analysis
● Genomic data analysis Infrastructure
Problems with NGS data for experimentalists
Have sequencing data generated but...
○ don’t know where to securely store them long term
○ uploading to GenePattern or Galaxy for analysis is taking forever
○ my bioinformaticians can not process it today
○ want to make additional differential expression contrasts
○ my exome data is taking forever to run
○ don’t know how to work with variant data
○ my thousand exome is crushing my bioinformaticians’ HPC server
○ … etc
Easily accessible Cloud Computing Resources will help
Advantages of Using Cloud Systems
Secure
○ Google Cloud Platform (GCP) is covered by Google-DFCI BAA to ensure
HIPAA compliance security
○ All data can be encrypted with SSL/TLS protocol during transfer
Fast
○ Scalable infrastructure with virtually no computing resource limitation
○ Minimal wait time to get data analyzed
Convenient
○ Ability to connect directly to Dropbox
○ Simplifies large data upload and download processes
○ Partners’ Dropbox Business could be used as a personal storage
solution for secure and long term data archive (20 Petabyte total)
Important resources and where to get them
DFCI GCP and G Suite Account
Google accounts linked with organization emails are prefered even though any
google account can be used. For DFCI community, please request an DFCI google
account (user@mail.dfci.harvard.edu) through Research Computing website:
http://rc.dfci.harvard.edu/contact-research-computing
Partners Dropbox
All Dropbox account will work with our systems. Partners Health provides virtually
unlimited encrypted storage on Dropbox Business for all Partners community
members (anyone with partners.org email) for free. Information is available here:
https://rc.partners.org/kb/article/2750
Agilent CrossLab (a.k.a iLab Solutions)
As most of cores and centers around DFCI, we use iLab to track all of our projects.
A free account can be requested at https://dfci.ilab.agilent.com/account/login
DFCI Virtual Private Cloud and Partners Dropbox
Users
CCCB Bioinformatics
CCCB
Sequencing
CCCB Cloud Infrastructure for NGS Analysis
Analysis
Portal
Delivery
Portal
Local
Drive Dropbox
Unlimited space
via Partners
Users
CCCB via DFCI Private Cloud
CCCB
Sequencing
Upload
Download
Direct data transfer
CCCB Data Analysis and Visualization Infrastructure
Analysis
Portal
Local
Drive Dropbox
Unlimited space
via PartnersUsers
CCCB via DFCI GCP
GATK
Analysis
RNASeq
Analysis
Variant
Viewer
WebMeV
Upload
Download
Web Access
Direct data transfer
Under construction
DFCI G Suite Account
Request for
DFCI G Suite
http://rc.dfci.harvard.edu/contact-research-computing
Dropbox Business at Partners Healthcare
https://rc.partners.org/kb/article/2750
To utilize Partners Dropbox Business with
virtually unlimited storage, simply change
your work Dropbox email to your Partners’
email.
We recommend accessing Dropbox using
completely new session of your browser in
Incognito or Private mode to avoid conflict
authentication issues.
If you have issues connecting to Partners’
Dropbox, please contact Partners IT at:
https://rc.partners.org/kb/article/2750
Request iLab Account and Project
For more info: http://cccb.dfci.harvard.edu/project-request
Request iLab Account and Project
For more info: http://cccb.dfci.harvard.edu/project-request
CCCB
Request iLab Account and Project
For more info: http://cccb.dfci.harvard.edu/project-request
Analysis
Pipeline
Request iLab Account and Project
Analysis
Pipeline Number of
Samples
Google
Account
Additional
Comments on
the project
RNA-Seq Analysis Platform Demo
RNA-Seq: What’s happening?
- Parallelized:
- alignment (STAR aligner) ---> BAM
Files
- Sort, primary-alignment filtering,
duplicate evaluation (Samtools,
Picard)
- Quantification (featureCounts)
- Merging:
- Overall “raw” (not normalized) count
matrix
- Differential expression testing
with DESeq2
- Plots/figures
Master
Sample 1
Sample 2
Sample N
CCCB Cloud System- Authentication
Sign-in with Google
- Google handles
authentication/credentials to establish
your identity with our platform
- Any Gmail address
- Google DFCI mail
(first_last@mail.dfci.harvard.edu)
preferred
CCCB Cloud System- analysis setup
Analysis homepage
- All analysis projects associated with your email
- Projects created on your behalf by CCCB
- Status messages, next steps
CCCB Cloud System- analysis setup
- After selecting a project, choose your reference genome
- Edit the project name to something meaningful
CCCB Cloud System- file uploads
- Upload methods:
- Dropbox*
- From local computer
- File chooser
- Drag/drop interface
* preferred. Fastest and most reliable.
- Currently support upload of FastQ-
format and BAM files. File naming
instructions
- Email notification when transfer is
complete.
CCCB Cloud System- file uploads
After receiving email (if using Dropbox),
refresh.
Uploaded files are visible
CCCB Cloud System- sample annotation
Sample names are inferred from
sequencing file names. Can create new
samples or remove existing ones.
- Drag/drop files to the proper
sample
CCCB Cloud System- summary
Start analysis
CCCB Cloud System- RNA-Seq analysis
If analysis type was RNA-Seq, alignments and quantifications proceed. You receive an email upon
completion.
- Scalable-- work is done in parallel, so approximately the same time for 5 or 5000 samples.
Status message Next steps
Download output files from alignment
(BAM)
Differential expression contrasts
Processed
samples available
Human-readable
contrast name
Thresholds used for
creating heatmaps and
volcano plots
Drag/drop samples
into contrast groups
Can rename groups
Downloading output files
Save output by direct
download or Dropbox transfer
- Authenticated: only
those logged-in as your
Google user can access
files
RNA-Seq analysis-- output
Custom report
Basic figures
Output files
Raw counts, normalized counts,
Differential expression results
More advanced analysis
Broad Institute GSEA (http://software.broadinstitute.org/gsea/)
Directly use the normalized count matrix file and groups.cls from CCCB Cloud
Platform DGE analysis result support files that can be imported into Broad Gene
Set Enrichment Analysis (GSEA) on MSigDB
More advanced analysis
Multi-experiment viewer (WebMEV)-- http://mev.tm4.org
Directly use the raw count matrix from CCCB Cloud Platform and import to do more
advanced analyses including:
- Clustering (hierarchical, k-means, PCA, etc)
- GO enrichment, pathway enrichment analyses
Variant Analysis Pipeline
Upload Process
Same as previous RNASeq
Variant Analysis Pipeline
Align reads
Base
recalibration
HaplotypeCaller
HaplotypeCaller
HaplotypeCaller
HaplotypeCaller
Merge
VCFs
Variant Analysis Pipeline
Align reads
Base
recalibration
HaplotypeCaller
HaplotypeCaller
HaplotypeCaller
HaplotypeCaller
Merge
VCFs
Alignment of reads with ‘bwa mem’.
● FASTQ single or paired reads
● Will assign RGIDs for FASTQ files
○ Lanes will be assumed merged
● BAM files will be realigned to hg19
○ Requires RGIDs in BAM
Variant Analysis Pipeline
Align reads
Base
recalibration
HaplotypeCaller
HaplotypeCaller
HaplotypeCaller
HaplotypeCaller
Merge
VCFs
GATK base recalibration.
● For pipelines from FASTQ
○ Loss of lane information
● For pipelines from BAM
○ Retains lane information
Variant Analysis Pipeline
Align reads
Base
recalibration
Variant
calling
Variant
calling
Variant
calling
Variant
calling
Merge
VCFs
Variant calling with GATK HaplotypeCaller
● SNPs and Indels
● Default parameters
● Parallelized between chromosomes
Variant Analysis Pipeline
Align reads
Base
recalibration
Variant
calling
Variant
calling
Variant
calling
Variant
calling
Merge
VCFs
Merge chromosome specific VCFs into a
single sample VCF with Picard Tools.
Variant Analysis Pipeline Output
- Sorted BAM file
- For FASTQ pipeline: aligned input FASTQ reads to hg19
- For BAM pipeline: realigned input BAM to hg19
- VCF file
Variant Visualization
Variant Visualization
Variant Visualization
- DNARails VCF data visualization web app
- https://variant-viz.tm4.org/
- Graphical interface
- Filtering of variants
- VEP annotated VCF
Analysis of Large Exome Cohorts
● The web app for exome analysis pipelines is suitable for up to 20 samples
○ Same for DNARails visualization
● Larger data sets - from 20 - 1000+ - create new issues
○ Data transfer
○ Analyzing cross-sample
● Insuring samples match the data
● Different software to analyze large data sets
● Larger data sets offer better means of variant filtration
● Custom project with us to provide a suitable analysis pipeline
Additional Slides
Variant Visualization
Steps to request analysis service
1. Log into iLab at https://dfci.ilab.agilent.com/
2. Find Center for Cancer Computational Biology in iLab
3. Submit analysis pipeline request with:
○ Number of samples
○ Google Account of the project owner
4. CCCB personnel will review the iLab project and provide a quote
5. Once grant or PO number is provided, analysis project will be created and an email will be send to
provided Google account
6. Follow the link in the email to start analysis
Unsynchronize Dropbox directory
Sequencing data can be very big to be synchronized onto personal computer, so it
is a good idea to unsynchronize the delivery folder (cccb-transfers) by the
following steps:
1. Go to Dropbox ‘Preference’
2. Select ‘Account’
3. Under ‘Selective Sync’ select ‘Change Settings…’
4. Un-check “cccb-transfers’
RNA-Seq workflow

Cloud Native Analysis Platform for NGS analysis

  • 1.
    CCCB Web Serviceson Cloud Platform Center for Cancer Computational Biology (SM822) Bioinformatics Team Homepage: https://cccb.dfci.harvard.edu/ Twitter: @CCCBseq
  • 2.
    Introduction to CCCB Oneof the Institutional Strategic Core Centers set up by DFCI to provide broadly accessible genomic technology and computational analysis capability to accelerate genomic research Aim: Enable large number of bench scientists to ask their own specific questions in different domain by providing: ● NGS Sequencing Services (NextSeq, MiSeq) and consultation ● Customized bioinformatics analysis research ● Education workshops in genomic technology and data analysis ● Genomic data analysis Infrastructure
  • 3.
    Problems with NGSdata for experimentalists Have sequencing data generated but... ○ don’t know where to securely store them long term ○ uploading to GenePattern or Galaxy for analysis is taking forever ○ my bioinformaticians can not process it today ○ want to make additional differential expression contrasts ○ my exome data is taking forever to run ○ don’t know how to work with variant data ○ my thousand exome is crushing my bioinformaticians’ HPC server ○ … etc Easily accessible Cloud Computing Resources will help
  • 4.
    Advantages of UsingCloud Systems Secure ○ Google Cloud Platform (GCP) is covered by Google-DFCI BAA to ensure HIPAA compliance security ○ All data can be encrypted with SSL/TLS protocol during transfer Fast ○ Scalable infrastructure with virtually no computing resource limitation ○ Minimal wait time to get data analyzed Convenient ○ Ability to connect directly to Dropbox ○ Simplifies large data upload and download processes ○ Partners’ Dropbox Business could be used as a personal storage solution for secure and long term data archive (20 Petabyte total)
  • 5.
    Important resources andwhere to get them DFCI GCP and G Suite Account Google accounts linked with organization emails are prefered even though any google account can be used. For DFCI community, please request an DFCI google account (user@mail.dfci.harvard.edu) through Research Computing website: http://rc.dfci.harvard.edu/contact-research-computing Partners Dropbox All Dropbox account will work with our systems. Partners Health provides virtually unlimited encrypted storage on Dropbox Business for all Partners community members (anyone with partners.org email) for free. Information is available here: https://rc.partners.org/kb/article/2750 Agilent CrossLab (a.k.a iLab Solutions) As most of cores and centers around DFCI, we use iLab to track all of our projects. A free account can be requested at https://dfci.ilab.agilent.com/account/login
  • 6.
    DFCI Virtual PrivateCloud and Partners Dropbox Users CCCB Bioinformatics CCCB Sequencing
  • 7.
    CCCB Cloud Infrastructurefor NGS Analysis Analysis Portal Delivery Portal Local Drive Dropbox Unlimited space via Partners Users CCCB via DFCI Private Cloud CCCB Sequencing Upload Download Direct data transfer
  • 8.
    CCCB Data Analysisand Visualization Infrastructure Analysis Portal Local Drive Dropbox Unlimited space via PartnersUsers CCCB via DFCI GCP GATK Analysis RNASeq Analysis Variant Viewer WebMeV Upload Download Web Access Direct data transfer Under construction
  • 9.
    DFCI G SuiteAccount Request for DFCI G Suite http://rc.dfci.harvard.edu/contact-research-computing
  • 10.
    Dropbox Business atPartners Healthcare https://rc.partners.org/kb/article/2750 To utilize Partners Dropbox Business with virtually unlimited storage, simply change your work Dropbox email to your Partners’ email. We recommend accessing Dropbox using completely new session of your browser in Incognito or Private mode to avoid conflict authentication issues. If you have issues connecting to Partners’ Dropbox, please contact Partners IT at: https://rc.partners.org/kb/article/2750
  • 11.
    Request iLab Accountand Project For more info: http://cccb.dfci.harvard.edu/project-request
  • 12.
    Request iLab Accountand Project For more info: http://cccb.dfci.harvard.edu/project-request CCCB
  • 13.
    Request iLab Accountand Project For more info: http://cccb.dfci.harvard.edu/project-request Analysis Pipeline
  • 14.
    Request iLab Accountand Project Analysis Pipeline Number of Samples Google Account Additional Comments on the project
  • 15.
  • 16.
    RNA-Seq: What’s happening? -Parallelized: - alignment (STAR aligner) ---> BAM Files - Sort, primary-alignment filtering, duplicate evaluation (Samtools, Picard) - Quantification (featureCounts) - Merging: - Overall “raw” (not normalized) count matrix - Differential expression testing with DESeq2 - Plots/figures Master Sample 1 Sample 2 Sample N
  • 17.
    CCCB Cloud System-Authentication Sign-in with Google - Google handles authentication/credentials to establish your identity with our platform - Any Gmail address - Google DFCI mail (first_last@mail.dfci.harvard.edu) preferred
  • 18.
    CCCB Cloud System-analysis setup Analysis homepage - All analysis projects associated with your email - Projects created on your behalf by CCCB - Status messages, next steps
  • 19.
    CCCB Cloud System-analysis setup - After selecting a project, choose your reference genome - Edit the project name to something meaningful
  • 20.
    CCCB Cloud System-file uploads - Upload methods: - Dropbox* - From local computer - File chooser - Drag/drop interface * preferred. Fastest and most reliable. - Currently support upload of FastQ- format and BAM files. File naming instructions - Email notification when transfer is complete.
  • 21.
    CCCB Cloud System-file uploads After receiving email (if using Dropbox), refresh. Uploaded files are visible
  • 22.
    CCCB Cloud System-sample annotation Sample names are inferred from sequencing file names. Can create new samples or remove existing ones. - Drag/drop files to the proper sample
  • 23.
    CCCB Cloud System-summary Start analysis
  • 24.
    CCCB Cloud System-RNA-Seq analysis If analysis type was RNA-Seq, alignments and quantifications proceed. You receive an email upon completion. - Scalable-- work is done in parallel, so approximately the same time for 5 or 5000 samples. Status message Next steps Download output files from alignment (BAM)
  • 25.
    Differential expression contrasts Processed samplesavailable Human-readable contrast name Thresholds used for creating heatmaps and volcano plots Drag/drop samples into contrast groups Can rename groups
  • 26.
    Downloading output files Saveoutput by direct download or Dropbox transfer - Authenticated: only those logged-in as your Google user can access files
  • 27.
    RNA-Seq analysis-- output Customreport Basic figures Output files Raw counts, normalized counts, Differential expression results
  • 28.
    More advanced analysis BroadInstitute GSEA (http://software.broadinstitute.org/gsea/) Directly use the normalized count matrix file and groups.cls from CCCB Cloud Platform DGE analysis result support files that can be imported into Broad Gene Set Enrichment Analysis (GSEA) on MSigDB
  • 29.
    More advanced analysis Multi-experimentviewer (WebMEV)-- http://mev.tm4.org Directly use the raw count matrix from CCCB Cloud Platform and import to do more advanced analyses including: - Clustering (hierarchical, k-means, PCA, etc) - GO enrichment, pathway enrichment analyses
  • 30.
    Variant Analysis Pipeline UploadProcess Same as previous RNASeq
  • 31.
    Variant Analysis Pipeline Alignreads Base recalibration HaplotypeCaller HaplotypeCaller HaplotypeCaller HaplotypeCaller Merge VCFs
  • 32.
    Variant Analysis Pipeline Alignreads Base recalibration HaplotypeCaller HaplotypeCaller HaplotypeCaller HaplotypeCaller Merge VCFs Alignment of reads with ‘bwa mem’. ● FASTQ single or paired reads ● Will assign RGIDs for FASTQ files ○ Lanes will be assumed merged ● BAM files will be realigned to hg19 ○ Requires RGIDs in BAM
  • 33.
    Variant Analysis Pipeline Alignreads Base recalibration HaplotypeCaller HaplotypeCaller HaplotypeCaller HaplotypeCaller Merge VCFs GATK base recalibration. ● For pipelines from FASTQ ○ Loss of lane information ● For pipelines from BAM ○ Retains lane information
  • 34.
    Variant Analysis Pipeline Alignreads Base recalibration Variant calling Variant calling Variant calling Variant calling Merge VCFs Variant calling with GATK HaplotypeCaller ● SNPs and Indels ● Default parameters ● Parallelized between chromosomes
  • 35.
    Variant Analysis Pipeline Alignreads Base recalibration Variant calling Variant calling Variant calling Variant calling Merge VCFs Merge chromosome specific VCFs into a single sample VCF with Picard Tools.
  • 36.
    Variant Analysis PipelineOutput - Sorted BAM file - For FASTQ pipeline: aligned input FASTQ reads to hg19 - For BAM pipeline: realigned input BAM to hg19 - VCF file
  • 37.
  • 38.
  • 39.
    Variant Visualization - DNARailsVCF data visualization web app - https://variant-viz.tm4.org/ - Graphical interface - Filtering of variants - VEP annotated VCF
  • 40.
    Analysis of LargeExome Cohorts ● The web app for exome analysis pipelines is suitable for up to 20 samples ○ Same for DNARails visualization ● Larger data sets - from 20 - 1000+ - create new issues ○ Data transfer ○ Analyzing cross-sample ● Insuring samples match the data ● Different software to analyze large data sets ● Larger data sets offer better means of variant filtration ● Custom project with us to provide a suitable analysis pipeline
  • 41.
  • 42.
  • 43.
    Steps to requestanalysis service 1. Log into iLab at https://dfci.ilab.agilent.com/ 2. Find Center for Cancer Computational Biology in iLab 3. Submit analysis pipeline request with: ○ Number of samples ○ Google Account of the project owner 4. CCCB personnel will review the iLab project and provide a quote 5. Once grant or PO number is provided, analysis project will be created and an email will be send to provided Google account 6. Follow the link in the email to start analysis
  • 44.
    Unsynchronize Dropbox directory Sequencingdata can be very big to be synchronized onto personal computer, so it is a good idea to unsynchronize the delivery folder (cccb-transfers) by the following steps: 1. Go to Dropbox ‘Preference’ 2. Select ‘Account’ 3. Under ‘Selective Sync’ select ‘Change Settings…’ 4. Un-check “cccb-transfers’
  • 45.

Editor's Notes

  • #7 Doug Sainato doug@onixnet.com Covered under the BAA: ● Google Compute Engine ● Google Cloud Storage ● Google BigQuery ● Google Cloud SQL ● Cloud Dataproc ● Genomics ● Container Engine ● Container Registry ● Cloud Dataflow ● Cloud Bigtable ● or any other Google Cloud Platform product or service specifically listed at https://cloud.google.com/security/compliance as covered by the Google Cloud Platform BAA NOT covered under the BAA: Google App Engine Cloud Functions Cloud Datalab Cloud Pub/Sub Machine Learning - including Cloud Machine Learning Platform and the ML APIs Networking services outside of Compute Engine - i.e. Cloud DNS and Cloud CDN
  • #17 General DGE pipeline Works for most of good RNA-Seq data Good first pass analysis DESeq2 is conservative (edgeR, VOOM, cuffdiff)