SlideShare a Scribd company logo
1 of 1
Download to read offline
Ref
Reads
Map/Align FastQ to reference genome
Step 4: Non-Picard Burrows-Wheeler Aligner (BWA)
Step 3: SamToFastq
Converts uBAM to FastQ data. Extracts read sequences and
base quality scores from the input uBAM file and writes them to
the output file in Sanger FastQ format. Reduces adapter
sequence base quality scores to low numbers (to prevent them
from contributing to subsequent alignments).
Aggregated, aligned, deduped,
cleaned per sample
BAM Files
Dye-labeled dideoxynucleotides generate variable
length DNA fragments
Base call data (BCL)
Read data organized by lane, cycle, tile, direction etc.
Pre-Picard: Generation of raw sequence data
Multiplex - Pool libraries and samples
(helps prevent lane-specific artifacts)
Multiplex
Libraries
Prepare multiple libraries per sample.
Libraries have unique barcodes
embedded in adapter sequences
Sequencer
Flow cell lanes
Step 2: IlluminaBaseCallsToSam
Create an unmapped BAM/SAM file (uBAM) from
Illumina base-call data (BCL). Unlike FastQ,
SAM can store run-specific metadata e.g. (RG,
LB, etc. in header).
QC: CollectQualityYieldMetrics - Determines the numbers of
reads that pass quality filters
Step 6: Aggregation - Collects samples
across different lanes into a single BAM file per
lane. MarkDuplicates is carried out a second
time because libraries containing duplicates
can be spread across flow cells
Step 5: MarkDuplicates
Detects and removes duplication artifacts
Determines the numbers of:
-Paired/unpaired reads
-Mapped/unmapped reads
-Duplicates (PCR and Optical)
QC:
ValidateSamFile
CollectMultipleMetrics:
-MeanQualityByCycle
-QualityScoreDistribution
-CollectAlignmentSummaryMetrics
-CollectInsertSizeMetrics
Non-Picard VerifyBamID (Contamination check)
CheckFingerprint
CalculateHsMetrics (Exomes)
CollectGcBiasMetrics (WGS)
Abstract
Picard is a publicly available analysis software suite of 77 tools for the manipulation and analysis of high
throughput sequencing (HTS) data. These manipulations include file conversion, data transformation, data
analysis, and production of an array of quality control (QC) metrics. Data inputs range from Illumina base calls
format (BCL), FastQ, SAM/BAM, VCF/BCF and interval files. QC metrics tools validate and troubleshoot data
at virtually every step of analysis starting with library preparation, through variant calling, and ultimately
genotype assignment.
The Broad Genomics Platform (GP) uses these tools, the Burrows-Wheeler Aligner (BWA) and the Genome
Analysis Toolkit (GATK) in their human whole genome and exome sequence (WGS, WES) analysis pipelines
to genotype and call germline variants. Broad’s GP processes ~4000 exomes and ~1000 genomes per month.
This production pipeline takes raw Illumina BCL data, demultiplexes per sample per lane, aligns them to a
reference sequence, merges this alignment with platform metadata, and aggregates the alignments to obtain
sorted BAM files per sample. These processed BAM files are input directly into GATK for germline and
Firehose for somatic variant calling.
We present tools used in GP’s human WGS analysis pipeline. Our example data is from eight flow cell lanes
that represent four multiplexed samples (8 files). Files are processed per sample per lane from steps 1 to 5,
then aggregated and processed per sample from steps 6 to 8. The eight steps and their associated tools are
outlined, including tools that calculate quality control metrics.
Step 1: ExtractIlluminaBarcodes
Determines the barcode for each read in
an Illumina lane (demultiplex).
QC: CollectIlluminaBasecallingMetrics - Produces
per-lane barcode base-call metrics
Step 7: GATK Pre-processing
Step 7a: Indel Realignment
Step 7b: Base Recalibration
Step 8: Input BAM files for variant discovery
-GATK for germline SNPs/Indels
-Firehose for somatic variants
Grayed out areas indicate missing alignment data
Unmapped BAM Files (uBAM)
Raw Mapped/Aligned SAM Files
Grayed out areas indicate missing metadata
Picard Tools for HTS Data Pre-processing
Communications Team, Data Science & Data Engineering (DSDE), Broad Institute
Mapped/Aligned SAM Files
Step 4: MergeBamAlignment
MergeBamAlignment merges defined information from the aligned
SAM with that of the uBAM to conserve read data, and generates
additional meta information. The resulting BAM is coordinate sorted,
indexed, and clean.

More Related Content

Similar to picard_poster_12_16_15

Datasets and tools_from_ncbi_and_elsewhere_for_microbiome_research_v_62817
Datasets and tools_from_ncbi_and_elsewhere_for_microbiome_research_v_62817Datasets and tools_from_ncbi_and_elsewhere_for_microbiome_research_v_62817
Datasets and tools_from_ncbi_and_elsewhere_for_microbiome_research_v_62817Ben Busby
 
Workshop NGS data analysis - 2
Workshop NGS data analysis - 2Workshop NGS data analysis - 2
Workshop NGS data analysis - 2Maté Ongenaert
 
RSEM and DE packages
RSEM and DE packagesRSEM and DE packages
RSEM and DE packagesRavi Gandham
 
Arthropod es tpipeline_poster
Arthropod es tpipeline_posterArthropod es tpipeline_poster
Arthropod es tpipeline_posterTamizhmuhil
 
Dgaston dec-06-2012
Dgaston dec-06-2012Dgaston dec-06-2012
Dgaston dec-06-2012Dan Gaston
 
Best practices for data analysis when using UMI adapters to improve variant d...
Best practices for data analysis when using UMI adapters to improve variant d...Best practices for data analysis when using UMI adapters to improve variant d...
Best practices for data analysis when using UMI adapters to improve variant d...Integrated DNA Technologies
 
Imgc2011 bioinformatics tutorial
Imgc2011 bioinformatics tutorialImgc2011 bioinformatics tutorial
Imgc2011 bioinformatics tutorialDeanna Church
 
BC-Cancer ChimeraScan Presentation
BC-Cancer ChimeraScan PresentationBC-Cancer ChimeraScan Presentation
BC-Cancer ChimeraScan PresentationElijah Willie
 
ECCMID 2015 Meet-The-Expert: Bioinformatics Tools
ECCMID 2015 Meet-The-Expert: Bioinformatics ToolsECCMID 2015 Meet-The-Expert: Bioinformatics Tools
ECCMID 2015 Meet-The-Expert: Bioinformatics ToolsNick Loman
 
[2017-05-29] DNASmartTagger
[2017-05-29] DNASmartTagger [2017-05-29] DNASmartTagger
[2017-05-29] DNASmartTagger Eli Kaminuma
 
XabTracker & SeqAgent: Integrated LIMS & Sequence Analysis Tools for Antibody...
XabTracker & SeqAgent: Integrated LIMS & Sequence Analysis Tools for Antibody...XabTracker & SeqAgent: Integrated LIMS & Sequence Analysis Tools for Antibody...
XabTracker & SeqAgent: Integrated LIMS & Sequence Analysis Tools for Antibody...Mark Evans
 
BITs: Genome browsers and interpretation of gene lists.
BITs: Genome browsers and interpretation of gene lists.BITs: Genome browsers and interpretation of gene lists.
BITs: Genome browsers and interpretation of gene lists.BITS
 
Overview of methods for variant calling from next-generation sequence data
Overview of methods for variant calling from next-generation sequence dataOverview of methods for variant calling from next-generation sequence data
Overview of methods for variant calling from next-generation sequence dataThomas Keane
 
Sequence assembly
Sequence assemblySequence assembly
Sequence assemblyRamya P
 

Similar to picard_poster_12_16_15 (20)

Datasets and tools_from_ncbi_and_elsewhere_for_microbiome_research_v_62817
Datasets and tools_from_ncbi_and_elsewhere_for_microbiome_research_v_62817Datasets and tools_from_ncbi_and_elsewhere_for_microbiome_research_v_62817
Datasets and tools_from_ncbi_and_elsewhere_for_microbiome_research_v_62817
 
BioWeka
BioWekaBioWeka
BioWeka
 
Hong_Celine_ES_workshop.pptx
Hong_Celine_ES_workshop.pptxHong_Celine_ES_workshop.pptx
Hong_Celine_ES_workshop.pptx
 
Workshop NGS data analysis - 2
Workshop NGS data analysis - 2Workshop NGS data analysis - 2
Workshop NGS data analysis - 2
 
Folker Meyer: Metagenomic Data Annotation
Folker Meyer: Metagenomic Data AnnotationFolker Meyer: Metagenomic Data Annotation
Folker Meyer: Metagenomic Data Annotation
 
RSEM and DE packages
RSEM and DE packagesRSEM and DE packages
RSEM and DE packages
 
Arthropod es tpipeline_poster
Arthropod es tpipeline_posterArthropod es tpipeline_poster
Arthropod es tpipeline_poster
 
Dgaston dec-06-2012
Dgaston dec-06-2012Dgaston dec-06-2012
Dgaston dec-06-2012
 
Best practices for data analysis when using UMI adapters to improve variant d...
Best practices for data analysis when using UMI adapters to improve variant d...Best practices for data analysis when using UMI adapters to improve variant d...
Best practices for data analysis when using UMI adapters to improve variant d...
 
Imgc2011 bioinformatics tutorial
Imgc2011 bioinformatics tutorialImgc2011 bioinformatics tutorial
Imgc2011 bioinformatics tutorial
 
BC-Cancer ChimeraScan Presentation
BC-Cancer ChimeraScan PresentationBC-Cancer ChimeraScan Presentation
BC-Cancer ChimeraScan Presentation
 
ECCMID 2015 Meet-The-Expert: Bioinformatics Tools
ECCMID 2015 Meet-The-Expert: Bioinformatics ToolsECCMID 2015 Meet-The-Expert: Bioinformatics Tools
ECCMID 2015 Meet-The-Expert: Bioinformatics Tools
 
[2017-05-29] DNASmartTagger
[2017-05-29] DNASmartTagger [2017-05-29] DNASmartTagger
[2017-05-29] DNASmartTagger
 
XabTracker & SeqAgent: Integrated LIMS & Sequence Analysis Tools for Antibody...
XabTracker & SeqAgent: Integrated LIMS & Sequence Analysis Tools for Antibody...XabTracker & SeqAgent: Integrated LIMS & Sequence Analysis Tools for Antibody...
XabTracker & SeqAgent: Integrated LIMS & Sequence Analysis Tools for Antibody...
 
Bioinformatics
BioinformaticsBioinformatics
Bioinformatics
 
BITs: Genome browsers and interpretation of gene lists.
BITs: Genome browsers and interpretation of gene lists.BITs: Genome browsers and interpretation of gene lists.
BITs: Genome browsers and interpretation of gene lists.
 
Overview of methods for variant calling from next-generation sequence data
Overview of methods for variant calling from next-generation sequence dataOverview of methods for variant calling from next-generation sequence data
Overview of methods for variant calling from next-generation sequence data
 
Rna seq pipeline
Rna seq pipelineRna seq pipeline
Rna seq pipeline
 
Sequence assembly
Sequence assemblySequence assembly
Sequence assembly
 
NCBI
NCBINCBI
NCBI
 

picard_poster_12_16_15

  • 1. Ref Reads Map/Align FastQ to reference genome Step 4: Non-Picard Burrows-Wheeler Aligner (BWA) Step 3: SamToFastq Converts uBAM to FastQ data. Extracts read sequences and base quality scores from the input uBAM file and writes them to the output file in Sanger FastQ format. Reduces adapter sequence base quality scores to low numbers (to prevent them from contributing to subsequent alignments). Aggregated, aligned, deduped, cleaned per sample BAM Files Dye-labeled dideoxynucleotides generate variable length DNA fragments Base call data (BCL) Read data organized by lane, cycle, tile, direction etc. Pre-Picard: Generation of raw sequence data Multiplex - Pool libraries and samples (helps prevent lane-specific artifacts) Multiplex Libraries Prepare multiple libraries per sample. Libraries have unique barcodes embedded in adapter sequences Sequencer Flow cell lanes Step 2: IlluminaBaseCallsToSam Create an unmapped BAM/SAM file (uBAM) from Illumina base-call data (BCL). Unlike FastQ, SAM can store run-specific metadata e.g. (RG, LB, etc. in header). QC: CollectQualityYieldMetrics - Determines the numbers of reads that pass quality filters Step 6: Aggregation - Collects samples across different lanes into a single BAM file per lane. MarkDuplicates is carried out a second time because libraries containing duplicates can be spread across flow cells Step 5: MarkDuplicates Detects and removes duplication artifacts Determines the numbers of: -Paired/unpaired reads -Mapped/unmapped reads -Duplicates (PCR and Optical) QC: ValidateSamFile CollectMultipleMetrics: -MeanQualityByCycle -QualityScoreDistribution -CollectAlignmentSummaryMetrics -CollectInsertSizeMetrics Non-Picard VerifyBamID (Contamination check) CheckFingerprint CalculateHsMetrics (Exomes) CollectGcBiasMetrics (WGS) Abstract Picard is a publicly available analysis software suite of 77 tools for the manipulation and analysis of high throughput sequencing (HTS) data. These manipulations include file conversion, data transformation, data analysis, and production of an array of quality control (QC) metrics. Data inputs range from Illumina base calls format (BCL), FastQ, SAM/BAM, VCF/BCF and interval files. QC metrics tools validate and troubleshoot data at virtually every step of analysis starting with library preparation, through variant calling, and ultimately genotype assignment. The Broad Genomics Platform (GP) uses these tools, the Burrows-Wheeler Aligner (BWA) and the Genome Analysis Toolkit (GATK) in their human whole genome and exome sequence (WGS, WES) analysis pipelines to genotype and call germline variants. Broad’s GP processes ~4000 exomes and ~1000 genomes per month. This production pipeline takes raw Illumina BCL data, demultiplexes per sample per lane, aligns them to a reference sequence, merges this alignment with platform metadata, and aggregates the alignments to obtain sorted BAM files per sample. These processed BAM files are input directly into GATK for germline and Firehose for somatic variant calling. We present tools used in GP’s human WGS analysis pipeline. Our example data is from eight flow cell lanes that represent four multiplexed samples (8 files). Files are processed per sample per lane from steps 1 to 5, then aggregated and processed per sample from steps 6 to 8. The eight steps and their associated tools are outlined, including tools that calculate quality control metrics. Step 1: ExtractIlluminaBarcodes Determines the barcode for each read in an Illumina lane (demultiplex). QC: CollectIlluminaBasecallingMetrics - Produces per-lane barcode base-call metrics Step 7: GATK Pre-processing Step 7a: Indel Realignment Step 7b: Base Recalibration Step 8: Input BAM files for variant discovery -GATK for germline SNPs/Indels -Firehose for somatic variants Grayed out areas indicate missing alignment data Unmapped BAM Files (uBAM) Raw Mapped/Aligned SAM Files Grayed out areas indicate missing metadata Picard Tools for HTS Data Pre-processing Communications Team, Data Science & Data Engineering (DSDE), Broad Institute Mapped/Aligned SAM Files Step 4: MergeBamAlignment MergeBamAlignment merges defined information from the aligned SAM with that of the uBAM to conserve read data, and generates additional meta information. The resulting BAM is coordinate sorted, indexed, and clean.