SlideShare a Scribd company logo
1 of 1
Pipeline Scripting for the Parallel Alignment
of Genomic Short Sequence Reads
Adam Bradley, Kevin Drees, Paul Keim, Jeffrey Foster
Northern Arizona University Center for Microbial Genetics and Genomics
• Python pipeline script developed to input reference fasta and paired-end
Illumina fastq.gz read files into alignment scripts
• locates the read files and reference sequence file in the working
directory using regular expressions
• open source scripts integrated into pipeline
• bwa-0.7.5a: index reference fasta and produce raw alignment
• picard-tools-1.83: sort reads in alignment by reference coordinate,
collect alignment statistics, remove duplicate reads, index alignment
relative to reference
• samtools-0.1.19: remove reads mapping to more than one locus, baq
analysis
• Commands submitted as jobs to pbs_server
Methods
• Single Nucleotide Polymorphisms (SNPs) in whole-genome sequences
can be used to produce high-resolution phylogenetic trees, which
illustrate the genetic relationships between organisms.
• Illumina NGS sequencing produces paired short sequence reads (100-
250 bp) that must be assembled together to perform such genomic
analyses.
• To find SNPs in whole-genome sequences, sequence reads can be
aligned to a complete reference genome.
• Multiple software scripts are involved in the production of an alignment.
It is cumbersome to run these scripts manually, particularly when
aligning multiple sequences
• Our goal is to pipe (i.e., connect) the alignment programs together into
a “pipeline,” which will reduce user effort and allow for parallel
processing.
Introduction
The recent advent of Next-Generation Sequencing (NGS) technologies have allowed for the rapid and
accurate sequencing of organisms' entire genomes, from small viral genomes to large animal
genomes. Analysis of Single Nucleotide Polymorphisms (SNPs), or point mutations, across genomes
allow geneticists to determine evolutionary relationships between organisms. However, NGS sequencers
produce billions of short sequence "reads" that must be reassembled into their original contiguous
sequence before being used in a SNP analysis. One method of doing so is called an alignment, in which
reads are mapped to a completed reference sequence. Alignments often involve the use of several
software scripts, and post-processing is usually needed to make the output compatible with downstream
software. Furthermore, SNP analyses often include large numbers of samples to process. We developed
a script in the Python programming language to connect the various scripts used in sequence alignments
into a single process. The script is also able to process data from multiple samples in parallel, saving time
and efficiently using available computer resources.
Abstract
• Pipeline currently indexes reference and produces raw alignments of multiple samples in parallel
with bwa
• Downstream processes in progress
• Future versions will incorporate:
• average alignment depth of coverage and warning flags indicating poor alignment (Genome
Analysis Toolkit DiagnoseTargets)
• % reference covered by aligned reads
• editing bam alignment headers to include more information about the sequencing run
Discussion
Funding was provided by the National Institutes of Health Bridges to Baccalaureate program and the National Cancer Institute
Native American Cancer Prevention program.
Acknowledgements
Results
• Input reference file in fasta format
• Input paired-end Illumina reads files in fastq or gzipped fastq.gz format
• Output alignment must be in bam format, and have the following
characteristics:
• reads sorted by reference coordinate
• perfectly duplicated reads removed
• reads mapping to more than one locus on reference removed
• Alignment index file in bai format must be output, for use by alignment
viewers and SNP callers downstream
• Output must include mapping statistics and read insert size statistics
• Must align reads from multiple samples in parallel
• Subprocesses must be submitted as pbs jobs to a batch queuing server
Design Specifications
Figure 2. Alignment of Brucella abortus 10-1086 to Brucella abortus 2308
reference, displayed with Tablet 1.13.05.17
picard SortSam
bwa index
reference.fasta
samtools view
bam alignmentbai index
bwa mem
Illumina paired- end .fastq
read files
picard Collect
Insert Size Metrics
picard Collect
Alignment Summary
Metrics
picard
MarkDuplicates
samtools calmdpicard Build Bam
Index
metrics files
Figure 1. Pipeline flow diagram

More Related Content

Viewers also liked

Art 31 - Rocket Silo (Elementary)
Art 31 - Rocket Silo (Elementary)Art 31 - Rocket Silo (Elementary)
Art 31 - Rocket Silo (Elementary)art31bemidji
 
NERETA Talent Pipeline Strategic Alignment Summit
NERETA Talent Pipeline Strategic Alignment SummitNERETA Talent Pipeline Strategic Alignment Summit
NERETA Talent Pipeline Strategic Alignment SummitColleen LaRose
 
pipeline_structure_overview
pipeline_structure_overviewpipeline_structure_overview
pipeline_structure_overviewsetitesuk
 
Creative Alignment Between Industry & Academia to Build Entry Level Talent Pi...
Creative Alignment Between Industry & Academia to Build Entry Level Talent Pi...Creative Alignment Between Industry & Academia to Build Entry Level Talent Pi...
Creative Alignment Between Industry & Academia to Build Entry Level Talent Pi...UPES Dehradun
 
Generating Pipeline Alignment Sheets Using FME
Generating Pipeline Alignment Sheets Using FMEGenerating Pipeline Alignment Sheets Using FME
Generating Pipeline Alignment Sheets Using FMESafe Software
 

Viewers also liked (7)

Jacques TRB Presentation
Jacques TRB PresentationJacques TRB Presentation
Jacques TRB Presentation
 
Bodega nueva concepcion
Bodega nueva concepcionBodega nueva concepcion
Bodega nueva concepcion
 
Art 31 - Rocket Silo (Elementary)
Art 31 - Rocket Silo (Elementary)Art 31 - Rocket Silo (Elementary)
Art 31 - Rocket Silo (Elementary)
 
NERETA Talent Pipeline Strategic Alignment Summit
NERETA Talent Pipeline Strategic Alignment SummitNERETA Talent Pipeline Strategic Alignment Summit
NERETA Talent Pipeline Strategic Alignment Summit
 
pipeline_structure_overview
pipeline_structure_overviewpipeline_structure_overview
pipeline_structure_overview
 
Creative Alignment Between Industry & Academia to Build Entry Level Talent Pi...
Creative Alignment Between Industry & Academia to Build Entry Level Talent Pi...Creative Alignment Between Industry & Academia to Build Entry Level Talent Pi...
Creative Alignment Between Industry & Academia to Build Entry Level Talent Pi...
 
Generating Pipeline Alignment Sheets Using FME
Generating Pipeline Alignment Sheets Using FMEGenerating Pipeline Alignment Sheets Using FME
Generating Pipeline Alignment Sheets Using FME
 

Similar to Pipeline Scripting for the Parallel Alignment of Genomic Short Sequence Reads

Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016
Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016
Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016Prof. Wim Van Criekinge
 
RNASeq Experiment Design
RNASeq Experiment DesignRNASeq Experiment Design
RNASeq Experiment DesignYaoyu Wang
 
Under the Hood of Alignment Algorithms for NGS Researchers
Under the Hood of Alignment Algorithms for NGS ResearchersUnder the Hood of Alignment Algorithms for NGS Researchers
Under the Hood of Alignment Algorithms for NGS Researchers Golden Helix Inc
 
Tools for Transcriptome Data Analysis
Tools for Transcriptome Data AnalysisTools for Transcriptome Data Analysis
Tools for Transcriptome Data AnalysisSANJANA PANDEY
 
Using VarSeq to Improve Variant Analysis Research Workflows
Using VarSeq to Improve Variant Analysis Research WorkflowsUsing VarSeq to Improve Variant Analysis Research Workflows
Using VarSeq to Improve Variant Analysis Research WorkflowsDelaina Hawkins
 
Using VarSeq to Improve Variant Analysis Research Workflows
Using VarSeq to Improve Variant Analysis Research WorkflowsUsing VarSeq to Improve Variant Analysis Research Workflows
Using VarSeq to Improve Variant Analysis Research WorkflowsGolden Helix Inc
 
Rare Variant Analysis Workflows: Analyzing NGS Data in Large Cohorts
Rare Variant Analysis Workflows: Analyzing NGS Data in Large CohortsRare Variant Analysis Workflows: Analyzing NGS Data in Large Cohorts
Rare Variant Analysis Workflows: Analyzing NGS Data in Large CohortsGolden Helix Inc
 
Sequence assembly
Sequence assemblySequence assembly
Sequence assemblyRamya P
 
Legacy Analysis: How Hadoop Streaming Enables Software Reuse – A Genomics Cas...
Legacy Analysis: How Hadoop Streaming Enables Software Reuse – A Genomics Cas...Legacy Analysis: How Hadoop Streaming Enables Software Reuse – A Genomics Cas...
Legacy Analysis: How Hadoop Streaming Enables Software Reuse – A Genomics Cas...StampedeCon
 
Functional ANNOTATION OF GENOME.pptx
Functional ANNOTATION OF GENOME.pptxFunctional ANNOTATION OF GENOME.pptx
Functional ANNOTATION OF GENOME.pptxUmerjibranRaza
 
Festival of Genomics 2016 London: Analyze Genomes: Modeling and Executing Gen...
Festival of Genomics 2016 London: Analyze Genomes: Modeling and Executing Gen...Festival of Genomics 2016 London: Analyze Genomes: Modeling and Executing Gen...
Festival of Genomics 2016 London: Analyze Genomes: Modeling and Executing Gen...Matthieu Schapranow
 
Chambwe bosc2010
Chambwe bosc2010Chambwe bosc2010
Chambwe bosc2010BOSC 2010
 
Dgaston dec-06-2012
Dgaston dec-06-2012Dgaston dec-06-2012
Dgaston dec-06-2012Dan Gaston
 
2012 sept 18_thug_biotech
2012 sept 18_thug_biotech2012 sept 18_thug_biotech
2012 sept 18_thug_biotechAdam Muise
 
Knowing Your NGS Upstream: Alignment and Variants
Knowing Your NGS Upstream: Alignment and VariantsKnowing Your NGS Upstream: Alignment and Variants
Knowing Your NGS Upstream: Alignment and VariantsGolden Helix Inc
 
TGAC Browser bosc 2014
TGAC Browser bosc 2014TGAC Browser bosc 2014
TGAC Browser bosc 2014Anil Thanki
 

Similar to Pipeline Scripting for the Parallel Alignment of Genomic Short Sequence Reads (20)

Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016
Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016
Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016
 
RNASeq Experiment Design
RNASeq Experiment DesignRNASeq Experiment Design
RNASeq Experiment Design
 
Under the Hood of Alignment Algorithms for NGS Researchers
Under the Hood of Alignment Algorithms for NGS ResearchersUnder the Hood of Alignment Algorithms for NGS Researchers
Under the Hood of Alignment Algorithms for NGS Researchers
 
Tools for Transcriptome Data Analysis
Tools for Transcriptome Data AnalysisTools for Transcriptome Data Analysis
Tools for Transcriptome Data Analysis
 
Using VarSeq to Improve Variant Analysis Research Workflows
Using VarSeq to Improve Variant Analysis Research WorkflowsUsing VarSeq to Improve Variant Analysis Research Workflows
Using VarSeq to Improve Variant Analysis Research Workflows
 
Using VarSeq to Improve Variant Analysis Research Workflows
Using VarSeq to Improve Variant Analysis Research WorkflowsUsing VarSeq to Improve Variant Analysis Research Workflows
Using VarSeq to Improve Variant Analysis Research Workflows
 
Rare Variant Analysis Workflows: Analyzing NGS Data in Large Cohorts
Rare Variant Analysis Workflows: Analyzing NGS Data in Large CohortsRare Variant Analysis Workflows: Analyzing NGS Data in Large Cohorts
Rare Variant Analysis Workflows: Analyzing NGS Data in Large Cohorts
 
Sequence assembly
Sequence assemblySequence assembly
Sequence assembly
 
Legacy Analysis: How Hadoop Streaming Enables Software Reuse – A Genomics Cas...
Legacy Analysis: How Hadoop Streaming Enables Software Reuse – A Genomics Cas...Legacy Analysis: How Hadoop Streaming Enables Software Reuse – A Genomics Cas...
Legacy Analysis: How Hadoop Streaming Enables Software Reuse – A Genomics Cas...
 
Functional ANNOTATION OF GENOME.pptx
Functional ANNOTATION OF GENOME.pptxFunctional ANNOTATION OF GENOME.pptx
Functional ANNOTATION OF GENOME.pptx
 
Gwas.emes.comp
Gwas.emes.compGwas.emes.comp
Gwas.emes.comp
 
Festival of Genomics 2016 London: Analyze Genomes: Modeling and Executing Gen...
Festival of Genomics 2016 London: Analyze Genomes: Modeling and Executing Gen...Festival of Genomics 2016 London: Analyze Genomes: Modeling and Executing Gen...
Festival of Genomics 2016 London: Analyze Genomes: Modeling and Executing Gen...
 
NGBT_poster_v0.4
NGBT_poster_v0.4NGBT_poster_v0.4
NGBT_poster_v0.4
 
Chambwe bosc2010
Chambwe bosc2010Chambwe bosc2010
Chambwe bosc2010
 
ChipSeq Data Analysis
ChipSeq Data AnalysisChipSeq Data Analysis
ChipSeq Data Analysis
 
Dgaston dec-06-2012
Dgaston dec-06-2012Dgaston dec-06-2012
Dgaston dec-06-2012
 
2012 sept 18_thug_biotech
2012 sept 18_thug_biotech2012 sept 18_thug_biotech
2012 sept 18_thug_biotech
 
Knowing Your NGS Upstream: Alignment and Variants
Knowing Your NGS Upstream: Alignment and VariantsKnowing Your NGS Upstream: Alignment and Variants
Knowing Your NGS Upstream: Alignment and Variants
 
TGAC Browser bosc 2014
TGAC Browser bosc 2014TGAC Browser bosc 2014
TGAC Browser bosc 2014
 
3rd presentation
3rd presentation3rd presentation
3rd presentation
 

Pipeline Scripting for the Parallel Alignment of Genomic Short Sequence Reads

  • 1. Pipeline Scripting for the Parallel Alignment of Genomic Short Sequence Reads Adam Bradley, Kevin Drees, Paul Keim, Jeffrey Foster Northern Arizona University Center for Microbial Genetics and Genomics • Python pipeline script developed to input reference fasta and paired-end Illumina fastq.gz read files into alignment scripts • locates the read files and reference sequence file in the working directory using regular expressions • open source scripts integrated into pipeline • bwa-0.7.5a: index reference fasta and produce raw alignment • picard-tools-1.83: sort reads in alignment by reference coordinate, collect alignment statistics, remove duplicate reads, index alignment relative to reference • samtools-0.1.19: remove reads mapping to more than one locus, baq analysis • Commands submitted as jobs to pbs_server Methods • Single Nucleotide Polymorphisms (SNPs) in whole-genome sequences can be used to produce high-resolution phylogenetic trees, which illustrate the genetic relationships between organisms. • Illumina NGS sequencing produces paired short sequence reads (100- 250 bp) that must be assembled together to perform such genomic analyses. • To find SNPs in whole-genome sequences, sequence reads can be aligned to a complete reference genome. • Multiple software scripts are involved in the production of an alignment. It is cumbersome to run these scripts manually, particularly when aligning multiple sequences • Our goal is to pipe (i.e., connect) the alignment programs together into a “pipeline,” which will reduce user effort and allow for parallel processing. Introduction The recent advent of Next-Generation Sequencing (NGS) technologies have allowed for the rapid and accurate sequencing of organisms' entire genomes, from small viral genomes to large animal genomes. Analysis of Single Nucleotide Polymorphisms (SNPs), or point mutations, across genomes allow geneticists to determine evolutionary relationships between organisms. However, NGS sequencers produce billions of short sequence "reads" that must be reassembled into their original contiguous sequence before being used in a SNP analysis. One method of doing so is called an alignment, in which reads are mapped to a completed reference sequence. Alignments often involve the use of several software scripts, and post-processing is usually needed to make the output compatible with downstream software. Furthermore, SNP analyses often include large numbers of samples to process. We developed a script in the Python programming language to connect the various scripts used in sequence alignments into a single process. The script is also able to process data from multiple samples in parallel, saving time and efficiently using available computer resources. Abstract • Pipeline currently indexes reference and produces raw alignments of multiple samples in parallel with bwa • Downstream processes in progress • Future versions will incorporate: • average alignment depth of coverage and warning flags indicating poor alignment (Genome Analysis Toolkit DiagnoseTargets) • % reference covered by aligned reads • editing bam alignment headers to include more information about the sequencing run Discussion Funding was provided by the National Institutes of Health Bridges to Baccalaureate program and the National Cancer Institute Native American Cancer Prevention program. Acknowledgements Results • Input reference file in fasta format • Input paired-end Illumina reads files in fastq or gzipped fastq.gz format • Output alignment must be in bam format, and have the following characteristics: • reads sorted by reference coordinate • perfectly duplicated reads removed • reads mapping to more than one locus on reference removed • Alignment index file in bai format must be output, for use by alignment viewers and SNP callers downstream • Output must include mapping statistics and read insert size statistics • Must align reads from multiple samples in parallel • Subprocesses must be submitted as pbs jobs to a batch queuing server Design Specifications Figure 2. Alignment of Brucella abortus 10-1086 to Brucella abortus 2308 reference, displayed with Tablet 1.13.05.17 picard SortSam bwa index reference.fasta samtools view bam alignmentbai index bwa mem Illumina paired- end .fastq read files picard Collect Insert Size Metrics picard Collect Alignment Summary Metrics picard MarkDuplicates samtools calmdpicard Build Bam Index metrics files Figure 1. Pipeline flow diagram