Your SlideShare is downloading. ×
Centralizing sequence analysis
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Centralizing sequence analysis

514

Published on

The first steps of analysing sequencing data (2GS,NGS) has entered a transitional period where on one hand most analysis steps can be automated and standardized (pipeline), while on the other …

The first steps of analysing sequencing data (2GS,NGS) has entered a transitional period where on one hand most analysis steps can be automated and standardized (pipeline), while on the other constantly evolving protocols and software updates makes maintaining these analysis pipelines labour intensive.
I propose a centralized system within CSIRO that is flexible to cater for different analyses while also being generic to efficiently disseminate labour intensive maintenance and extension amongst the user community.

Published in: Education, Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
514
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
1
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Centralizing Sequence AnalysisTaking the grind out of the analysis pipelinesDenis C. BauerCSIRO MATHEMATICS, INFORMATICS AND STATISTICSThe first steps of analysing sequencing data (2GS,NGS) has entered a transitional period where on one hand mostanalysis steps can be automated and standardized (pipeline), while on the other constantly evolving protocols andsoftware updates makes maintaining these analysis pipelines labour intensive.I propose a centralized system within CSIRO that is flexible to cater for different analyses while also being genericto efficiently disseminate labour intensive maintenance and extension amongst the user community.Crowd-sourcing not Wheel-reinventionAcademic tools will remain the methods of choice for cutting-edge dataanalysis[1], however, most do not comply with even very basic software- QC overview Performance plotsdevelopment practice (i.e. poor documentation, lack of legacy –support),which makes setting-up and maintenance time consuming.A similar issue applies to reference data sets, which need to be downloadedand often filtered and converted into a usable format.Summarizing quality control and data yield in a meaningful way remains anlabour intensive expert task.Rather than individually battling these issues, a more efficient way would beto have a centralized system set up that is collectively maintained by theresearchers who are using the system.Benefits would be: Browsing• Sharing modular methods/scripts for data analysis and summary• Ensuring consistency and reproducibility by keeping scripts separate from data• Benchmarking quality amongst other datasets within CSIRO Figure 2: [Examples of applications that can be shared] Quality control overview, highly informative performance plots,• Enabling collaborative knowledge gain system to browse data in real time with references to other data sets are all examples of labour extensive tasks (set- up/maintenance) that would be of benefit to other users.• Making developers’ expert knowledge available to users by enforcing scripts to have a self-contained quality control stage Big picture: flexible yet low-maintenance framework Project server Project Shared scripts Documentation Project-summary-cards 200 GB processed data/Project >35 external programs Wiki pages, Task-Logs, Webbrowser 57 GB external reference data >41 custom scripts (4197 lines of code) User call Config.txt trigger.sh Application Data Warehousing Backup and Statistic From external Version Control Analysis RSudio service providers Raw Data armed verify Quality Control Project Cards (web) Processed Data Rsync Processed Data logfile & code Hypothesis Visualization pbs1.sh eval1 Generation data1 Software BWA, GATK, samtools, etc. IGV Genome Browser Version logfile & code Custom Scripts Control Custom Scripts pbs2.sh eval2 data2 Data Cluster Genomes, Processing and Analysis Annotation, etc. External Genomic Project Server Galaxy Resources Content Summary.html //cherax + //cluster-vm //fsnsw3_syd/Bioinfo //???Figure 1: [Pipeline framework] Project information is kept separate from scripts/programs, a ‘config’ file defines how the Figure 3: [Framework overview] Elements from the three main pillars (Apps, Data, Backup) are over ached by apipeline is accessed to produce data and analysis steps that are relevant for this particular project. Each module is dual- documentation server, which displays current states and annotates changes.functioning: data generation ‘armed’ and quality control ‘verify’.Utilizing international efforts Bpipe - http://code.google.com/p/bpipe/ Effort for streamlining pipeline calls/re-callsThere are several international attempts to automate and standardize NGS data ISAtools - http://isatab.sourceforge.net/tools.htmlanalysis. Investigating which efforts are beneficial to CSIRO is likely to be more Metadata annotation and documentationsuccessful as a group-effort than by each individual alone: BioStore - http://www.seqan-biostore.de/wp/Nectar - http://nectar.org.au/ C++ framework for developing and sharing sequencing analysis programs based on solid algorithmic foundations and template-Australia-wide effort for Cloud-computing and large data storage with emphasis on NGS (Mike Pheasant) based interfaces.FOR FURTHER INFORMATION REFERENCESDenis Bauer .[1] Bauer, Denis. Variant calling comparison CASAVA1.8 ande denis.bauer@csiro.au GATK. Available from Nature Precedings (2011)w www.csiro.au/CMIS

×