Are You Feeling Lucky Tools For Recycling Publicly Available Next Generation High Throughput Garbage - Rupert Shuttleworth
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

Are You Feeling Lucky Tools For Recycling Publicly Available Next Generation High Throughput Garbage - Rupert Shuttleworth

on

  • 589 views

 

Statistics

Views

Total Views
589
Views on SlideShare
589
Embed Views
0

Actions

Likes
0
Downloads
2
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Are You Feeling Lucky Tools For Recycling Publicly Available Next Generation High Throughput Garbage - Rupert Shuttleworth Presentation Transcript

  • 1. sraget Rupert Shuttleworth, Catherine Suter, Mark Cowley, Richard Buckland Finding peer-reviewed data in the NCBI Sequence Read Archive As of November 2013, the NCBI Sequence Read Archive (SRA) contains over three million gigabytes of DNA and RNA sequencing files. Each file has a publication date, which appears to be the date that the file was made available on the SRA. However, we were interested in whether any data had actually been used to write a paper which had been published in a journal. In particular, we thought this might be a proxy for “good quality” data. So we developed a program to automatically check various webpages and identify the SRA data which had corresponding journal publications. autoadapt Rupert Shuttleworth, Catherine Suter, Mark Cowley, Richard Buckland Automatic quality control for FASTQ sequencing files As of November 2013, the NCBI Sequence Read Archive contains over three million gigabytes of publicly available DNA and RNA sequencing files. However, there is a wide variety of sequencing adaptors and primers which may be contaminating each file, and these sequences normally need to be removed before doing any further analysis. We developed a tool to automatically detect which adaptors and primers are present in a FASTQ file and remove those sequences from the file, as well to detect the quality score encoding type used and remove low quality sequences. Run cutadapt to remove contaminants and low quality sequences Look for evidence of publications SRA query results Run FastQC to detect encoding type and contamination from primers and adaptors Collect URLs for sequencing data Split FASTQ file up into smaller pieces for parallel processing ... Run cutadapt to remove contaminants and low quality sequences Output URLs for published sequencing data ... ... Look for evidence of publications Collect URLs for sequencing data We performed dozens of different queries on the SRA and found that typically less than 50% of the query results had been published. For the RNA-seq query shown below only 39% of the matching data had been published. 12000   Total  Experiments   Merge small FASTQ files back together and output a high quality, contamination free FASTQ file Trim Galore! is a popular tool for automatically removing contaminants from FASTQ files. We downloaded a sample of 116 RNA-seq FASTQ files from the NCBI Sequence Read Archive and found that Trim Galore! was able to remove contaminants in 85 of the 116 files (73%). However, our tool was able to remove contaminants in 105 of the 116 files (91%). In addition, our tool is multi-threaded and depending on the number of CPUs available and the speed of the hard drive it can be much quicker to run our tool than to run Trim Galore! 10000   Published  Experiments   8000   S5ll-­‐contaminated  FASTQ  files   autoadapt   6000   4000   2000     De-­‐contaminated  FASTQ  files   0   A.  thaliana   C.  elegans   D.  melanogaster   H.  sapiens   M.  musculus   ("instrument illumina hiseq 2500"[Properties] OR "instrument illumina hiseq 2000"[Properties] OR "instrument illumina genome analyzer iix"[Properties] OR "instrument illumina genome analyzer ii"[Properties] OR "instrument illumina genome analyzer"[Properties] OR "instrument 454 gs flx"[Properties] OR "instrument 454 gs flx titanium"[Properties]) AND ("strategy rna seq"[Properties] OR "strategy mirna seq"[Properties]) AND ("rna data"[Filter]) AND ("2010"[Publication Date] : "2013"[Publication Date])! Download:  www.github.com/op5muscoprime/sraget   Trim  Galore!     0%   10%   20%   30%   40%   50%   60%   70%   80%   90%   100%   ("instrument illumina hiseq 2500"[Properties] OR "instrument illumina hiseq 2000"[Properties] OR "instrument illumina genome analyzer iix"[Properties] OR "instrument illumina genome analyzer ii"[Properties] OR "instrument illumina genome analyzer"[Properties] OR "instrument 454 gs flx"[Properties] OR "instrument 454 gs flx titanium"[Properties]) AND ("strategy rna seq"[Properties] OR "strategy mirna seq"[Properties]) AND ("rna data"[Filter]) AND ("2010"[Publication Date] : "2013"[Publication Date])! Download:  www.github.com/op5muscoprime/autoadapt   Acknowledgements Acknowledgements This project is a collaboration between the School of Computer Science and Engineering, the Victor Chang Cardiac Research Institute and the Garvan Institute of Medical Research. The project is also supported by a generous research grant from Amazon Web Services. This project is a collaboration between the School of Computer Science and Engineering, the Victor Chang Cardiac Research Institute and the Garvan Institute of Medical Research. The project is also supported by a generous research grant from Amazon Web Services.