NGS Data Preprocessing

6,651 views

Published on

NGS Data Preprocessing -
Javier Santoyo -
Massive sequencing data analysis workshop -
Granada 2011

Published in: Technology, Business
1 Comment
3 Likes
Statistics
Notes
No Downloads
Views
Total views
6,651
On SlideShare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
254
Comments
1
Likes
3
Embeds 0
No embeds

No notes for slide

NGS Data Preprocessing

  1. 1. BabelomicsNGS data Preprocessing Granada, June 2011 Javier Santoyo-Lopez jsantoyo@cipf.es http://bioinfo.cipf.es Genomics Department Centro de Investigacion Principe Felipe (CIPF) (Valencia, Spain)
  2. 2. History of DNA Sequencing Adapted from Eric Green, NIH; Adapted from Messing & Llaca, PNAS (1998) Adapted from Eric Green, NIH; Adapted from Messing & Llaca, PNAS (1998) 1870 Miescher: Discovers DNA 1940 Avery: Proposes DNA as ‘Genetic Material’ Efficiency 1953 Watson & Crick: Double Helix Structure of DNA (bp/person/year) 1 1965 Holley: Sequences Yeast tRNAAla 15 1970 Wu: Sequences λ Cohesive End DNA 150 Sanger: Dideoxy Chain Termination 1977 Gilbert: Chemical Degradation 1,500 1980 Messing: M13 Cloning 15,000 25,000 1986 Hood et al.: Partial Automation 50,000 1990 Cycle Sequencing Improved Sequencing Enzymes 200,000 Improved Fluorescent Detection Schemes 50,000,000 2002 Next Generation Sequencing Improved enzymes and chemistry 100,000,000,000 2008 Improved image processing
  3. 3. Sequence Databases Trend EMBL database growth (March 2011)
  4. 4. From Michael Metzker, http://view.ncbi.nlm.nih.gov/pubmed/19997069From Michael Metzker, http://view.ncbi.nlm.nih.gov/pubmed/19997069
  5. 5. NGS platforms comparison06/14/11 Source http://www.clcngs.com/2008/12/ngs-platforms-overview/
  6. 6. Adapted from John McPherson, OICRNext-gen sequencers Adapted from John McPherson, OICR AB SOLiDv3 200Gb, 50 bp reads 100 Gb Illumina HiSeq bases per machine run 150Gb, 100bp reads 10 Gb 454 GS FLX Titanium 1 Gb 0.4-1 Gb, 100-500 bp reads 100 Mb 10 Mb ABI capillary sequencer (0.04-0.08 Mb, 450-800 bp reads 1 Mb 10 bp 100 bp 1,000 bp read length
  7. 7. Many Gbs of Sequences and...• Data management becomes a challenge. – Moving data across file systems takes time (several hundred Gbs)• What structure has the data? – Different sequencers output different files, but – There are some data formats that are being accepted widely (e.g. FastQ format)• Raw sequence data formats – SFF – Fasta, csfasta – Qual file – Fastq
  8. 8. Fasta & Fastq formats• FastA format (everybody knows about it) – Header line starts with “>” followed by a sequence ID – Sequence (string of nt).• FastQ format – First is the sequence (like Fasta but starting with “@”) – Then “+” and sequence ID (optional) and in the following line are QVs encoded as single byte ASCII codes • Different quality encode variants• Nearly all downstream analysis take FastQ as input sequence
  9. 9. Sequence to Variation Workflow Raw Data FastQ IGV BWA/ BWASW SAM SAM Tools Samtools BCF BCF mPileup Filter Tools Raw Tools GFF Pileup BAM VCF VCF
  10. 10. Sequence to Variation Workflow Raw FastX FastQ Data Tookit FastQ BWA/ IGV BWASW BWA/ BWASW SAM SAM Tools Samtools BCF BCF mPileup Filter Tools Raw Tools GFF Pileup BAM VCF VCF
  11. 11. Why Quality Control and Preprocessing? Sequencer output:  Reads + quality  Is the quality of my sequenced data OK?
  12. 12. Why Quality Control and Preprocessing? Sequencer output:  Reads + quality  Is the quality of my sequenced data OK? If something is wrong can I fix it?
  13. 13. Why Quality Control and Preprocessing? Sequencer output:  Reads + quality  Is the quality of my sequenced data OK? If something is wrong can I fix it? Problem:  HUGE files...
  14. 14. Why Quality Control and Preprocessing? Sequencer output:  Reads + quality  Is the quality of my sequenced data OK? If something is wrong can I fix it? Problem:  HUGE files... How do they look? @HWUSI­EAS460:2:1:368:1089#0/1 TACGTACGTACGTACGTACGTAGATCGGAAGAGCGG +HWUSI­EAS460:2:1:368:1089#0/1 aa[a_a_a^a^a]VZ]R^P[]YNSUTZBBBBBBBBB @HWUSI­EAS460:2:1:368:528#0/1 CTATTATAATATGACCGACCAGCTAGATCTACAGTC +HWUSI­EAS460:2:1:368:528#0/1 abbbbaaaabba^aa`Y``aa`aaa``a`a__`[_ Files are flat files and are big... tens of Gbs (please... dont use MS word to see or edit them)
  15. 15. Sequence QualityPer base Position Good data  Consistent  High quality along the read * The central red line is the median value * The yellow box represents the inter-quartile range (25-75%) * The upper and lower whiskers represent the 10% and 90% points * The blue line represents the mean quality
  16. 16. Sequence QualityPer base Position Bad data  High variance  Quality decrease with length
  17. 17. Per Sequence Quality Distribution Good data  Most are high-quality sequences
  18. 18. Per Sequence Quality Distribution Bad data  Not uniform distributionLow Quality Reads
  19. 19. Nucleotide Content per position Good data  Smooth over length  Organism dependent (GC)
  20. 20. Nucleotide Content per position Bad data  Sequence position bias
  21. 21. GC Distribution Good data  Fits with the expected  Organism dependent
  22. 22. Per sequenceGC Distribution Bad data  It does not fit with expected  Organism dependent Library contamination?
  23. 23. Per baseGC Distribution Good data  No variation across read sequence
  24. 24. Per baseGC Distribution Bad data  Variation across read sequence
  25. 25. Per baseN content Its not good if there are N bias per base position
  26. 26. Duplicated Sequences I dont expect high number of duplicated sequences:  PCR artifact?
  27. 27. Distribution Length This is descriptive. Some sequencers output sequences of different length (e.g. 454)
  28. 28. Overrepresented SequencesQuestion: If you obtain the exact same sequence too many times Do you have a problem?Answer: Sometimes!Examples PCR primers (Illumina)  GATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCGATCT  CGGTTCAGCAGGAATGCCGAGATCGGAAGAGCGGTTCAGC
  29. 29. K-mer Content  Helps to detect problems  Adapters?
  30. 30. Practical: FastQC and Fastx-toolkit Use FastQC to see your starting state. Use Fastx-toolkit to optimize different datasets and then visualize the result with FastQC to prove your success! Hints: Try trimming, clipping and quality filtering. Go to the tutorial and try the exercises...

×