SlideShare a Scribd company logo
1 of 12
Download to read offline
Quality Control of NGS Data
Solutions
Surya Saha ss2489@cornell.edu
BTI PGRP Summer Internship Program 2014
Slides: Aureliano Bombarely ab782@cornell.edu
Goal:
Learn the use of read evaluation programs keeping
attention in relevant parameters such as qscore and
length distributions and reads duplications.
Data:
(Illumina data for two tomato ripening stages)
/home/bioinfo/Data/ch4_demo_dataset.tar.gz
Tools:
tar -zxvf (command line, untar and unzip the files)
head (command line, take a quick look of the files)
mv (command line, change the name of the files)
grep (command line, find/count patterns in files)
FASTX toolkit (command line, process fasta/fastq)
FastQC (gui, to calculate several stats for each file)
Evaluation
7/8/2014 BTI PGRP Summer Internship Program 2014 2
Exercise 1:
1. Untar and Unzip the file:
/home/bioinfo/Data/ch4_demo_dataset.tar.gz
2. Raw data will be found in two dirs: breaker and
immature_fruit. Print the first 10 lines for the files:
SRR404331_ch4.fq, SRR404333_ch4.fq, SRR404334_ch4.fq
and SRR404336_ch4.fq.
Question 1.1: Do these files have fastq format?
3. Change the extension of the .fq files to .fastq
4. Type ‘fastqc’ to start the FastQC program. Load the four
sequence files in the program.
Evaluation
7/8/2014 BTI PGRP Summer Internship Program 2014 3
Solution 1:
1. Untar and Unzip the file: ch4_demo_dataset.tar.gz
tar -zxvf ch4_demo_dataset.tar.gz
or in two steps:
gunzip ch4_demo_dataset.tar.gz
tar -xvf ch4_demo_dataset.tar
2. Raw data will be found in two dirs: breaker and immature_fruit.
Print the first 10 lines for the files: SRR404331_ch4.fq,
SRR404333_ch4.fq, SRR404334_ch4.fq and SRR404336_ch4.fq.
head SRR404331_ch4.fq; head SRR404333_ch4.fq; head SRR404334_ch4.fq; head
SRR404336_ch4.fq;
Question 1.1: Do these files have fastq format? Yes
Evaluation
7/8/2014 BTI PGRP Summer Internship Program 2014 4
Solution 1:
• Change the extension of the .fq files to .fastq
mv SRR404331_ch4.fq SRR404331_ch4.fastq
mv SRR404333_ch4.fq SRR404333_ch4.fastq
mv SRR404334_ch4.fq SRR404334_ch4.fastq
mv SRR404336_ch4.fq SRR404336_ch4.fastq
• Count number of sequences in each fastq file using
commands you learnt earlier
grep –c ‘^+$’ SRR404331_ch4.fastq
Evaluation
7/8/2014 BTI PGRP Summer Internship Program 2014 5
Solution 1:
• Convert the fastq files to fasta
fastq_to_fasta –I SRR404331_ch4.fastq –o SRR404331_ch4.fasta –Q33
-n
The –Q33 flag is to denote Sanger Phred 33 encoding. It expects Illumina Phred+64 by
default. See http://seqanswers.com/forums/showthread.php?t=7596
-n tells it not to remove any sequences from the file. It removes any reads containing N’s by
default
• Now count the number of sequences in fasta file and see
if the number of sequences has changed.
grep –c ‘^>’ SRR404331_ch4.fasta
It may be different if you did not use -n parameter. Not using -n will remove sequences with
any ‘N’ bases
Evaluation
7/8/2014 BTI PGRP Summer Internship Program 2014 6
Solutions 2:
Question 2.2: How many sequences there are per file in
FastQC?
SRR404331_ch4.fastq = 762,365
SRR404333_ch4.fastq = 744,048
SRR404334_ch4.fastq = 592,123
SRR404336_ch4.fastq = 880,982
Question 2.3: Which is the length range for these reads?
53 in most of the cases, except in SRR404336 where is 54
Question 2.4: Which is the qscore range for these reads? Which
one looks best quality-wise?
The range goes from 2 in some cases such as SRR404334 to a maximum of 44. SRR404331 has
consistent quality over entire length with least 3’ end drop-off.
Evaluation
7/8/2014 BTI PGRP Summer Internship Program 2014 7
Solutions 1:
Question 2.5: Do these datasets have read
overrepresentation?
In some cases such as SRR404331. A Blast search of the top overrepresented sequence
reveals that it is the gene ASR1 from tomato, so it is not a contamination and probably it has
some biological relevance.
Question 2.6: Looking into the kmer content, do you think that
the samples have an adaptor?
No, it is possible to see some structure at the 5’ extreme, but it starts at the position 7 and
doesn’t affect to all the sequences.
Evaluation
7/8/2014 BTI PGRP Summer Internship Program 2014 8
Goal:
Trim the low quality ends of the reads and remove
the short reads.
Data:
(Illumina data for two tomato ripening stages)
ch4_demo_dataset.tar.gz
Tools:
fastq-mcf (command line tool to process reads)
FastQC (gui, to calculate several stats for each file)
Preprocessing
7/8/2014 BTI PGRP Summer Internship Program 2014 9
Exercise 3:
• Download the file: adapters1.fa from
ftp://ftp.solgenomics.net/user_requests/aubombarely/courses/RNAseqCo
rpoica/adapters1.fa
• Run the read processing program over each of the
datasets using a min. qscore of 30 and a min. length of 40
bp.
fastq-mcf -q 30 -l 40 -o SRR404331_ch4.q30l40.fastq
/home/bioinfo/Downloads/adapters1.fa SRR404331_ch4.fastq
Remember to specify the right path for the adapter1.fa and the input file
Preprocessing
7/8/2014 BTI PGRP Summer Internship Program 2014 10
Exercise 2:
•Type ‘fastqc’ to start the FastQC program. Load the
four sequence files in the program. Compare the
results with the previous datasets.
Before fastq-mcf processing
880,982 reads
After fastq-mcf processing
840,759 reads
SRR404336
Preprocessing
7/8/2014 BTI PGRP Summer Internship Program 2014 11
Need Help??
7/8/2014 BTI PGRP Summer Internship Program 2014 12

More Related Content

What's hot

Imgc2011 bioinformatics tutorial
Imgc2011 bioinformatics tutorialImgc2011 bioinformatics tutorial
Imgc2011 bioinformatics tutorialDeanna Church
 
Aug2014 working group report characterization bioinformatics
Aug2014 working group report characterization bioinformaticsAug2014 working group report characterization bioinformatics
Aug2014 working group report characterization bioinformaticsGenomeInABottle
 
Bioinformatics, Data Integration, and Data Representation Working Group Summa...
Bioinformatics, Data Integration, and Data Representation Working Group Summa...Bioinformatics, Data Integration, and Data Representation Working Group Summa...
Bioinformatics, Data Integration, and Data Representation Working Group Summa...GenomeInABottle
 
NBIS RNA-seq course
NBIS RNA-seq courseNBIS RNA-seq course
NBIS RNA-seq coursePhil Ewels
 
Nextflow Camp 2019: nf-core tutorial (Updated Feb 2020)
Nextflow Camp 2019: nf-core tutorial (Updated Feb 2020)Nextflow Camp 2019: nf-core tutorial (Updated Feb 2020)
Nextflow Camp 2019: nf-core tutorial (Updated Feb 2020)Phil Ewels
 
P4, EPBF, and Linux TC Offload
P4, EPBF, and Linux TC OffloadP4, EPBF, and Linux TC Offload
P4, EPBF, and Linux TC OffloadOpen-NFP
 
Lineage-driven Fault Injection, SIGMOD'15
Lineage-driven Fault Injection, SIGMOD'15Lineage-driven Fault Injection, SIGMOD'15
Lineage-driven Fault Injection, SIGMOD'15palvaro
 
Adding Transparency and Automation into the Galaxy Tool Installation Process
Adding Transparency and Automation into the Galaxy Tool Installation ProcessAdding Transparency and Automation into the Galaxy Tool Installation Process
Adding Transparency and Automation into the Galaxy Tool Installation ProcessEnis Afgan
 
Demand-Driven Structural Testing with Dynamic Instrumentation (ICSE 2005)
Demand-Driven Structural Testing with Dynamic Instrumentation (ICSE 2005)Demand-Driven Structural Testing with Dynamic Instrumentation (ICSE 2005)
Demand-Driven Structural Testing with Dynamic Instrumentation (ICSE 2005)James Clause
 
Progressive Provenance Capture Through Re-computation
Progressive Provenance Capture Through Re-computationProgressive Provenance Capture Through Re-computation
Progressive Provenance Capture Through Re-computationPaul Groth
 
Declarative Experimentation in Information Retrieval using PyTerrier
Declarative Experimentation in Information Retrieval using PyTerrierDeclarative Experimentation in Information Retrieval using PyTerrier
Declarative Experimentation in Information Retrieval using PyTerrierCrai Macdonald
 
ELIXIR Proteomics Community - Connection with nf-core
ELIXIR Proteomics Community - Connection with nf-coreELIXIR Proteomics Community - Connection with nf-core
ELIXIR Proteomics Community - Connection with nf-corePhil Ewels
 

What's hot (15)

Imgc2011 bioinformatics tutorial
Imgc2011 bioinformatics tutorialImgc2011 bioinformatics tutorial
Imgc2011 bioinformatics tutorial
 
Crash course in R and BioConductor
Crash course in R and BioConductorCrash course in R and BioConductor
Crash course in R and BioConductor
 
Aug2014 working group report characterization bioinformatics
Aug2014 working group report characterization bioinformaticsAug2014 working group report characterization bioinformatics
Aug2014 working group report characterization bioinformatics
 
3rd presentation
3rd presentation3rd presentation
3rd presentation
 
Bioinformatics, Data Integration, and Data Representation Working Group Summa...
Bioinformatics, Data Integration, and Data Representation Working Group Summa...Bioinformatics, Data Integration, and Data Representation Working Group Summa...
Bioinformatics, Data Integration, and Data Representation Working Group Summa...
 
NBIS RNA-seq course
NBIS RNA-seq courseNBIS RNA-seq course
NBIS RNA-seq course
 
Nextflow Camp 2019: nf-core tutorial (Updated Feb 2020)
Nextflow Camp 2019: nf-core tutorial (Updated Feb 2020)Nextflow Camp 2019: nf-core tutorial (Updated Feb 2020)
Nextflow Camp 2019: nf-core tutorial (Updated Feb 2020)
 
P4, EPBF, and Linux TC Offload
P4, EPBF, and Linux TC OffloadP4, EPBF, and Linux TC Offload
P4, EPBF, and Linux TC Offload
 
Lineage-driven Fault Injection, SIGMOD'15
Lineage-driven Fault Injection, SIGMOD'15Lineage-driven Fault Injection, SIGMOD'15
Lineage-driven Fault Injection, SIGMOD'15
 
Adding Transparency and Automation into the Galaxy Tool Installation Process
Adding Transparency and Automation into the Galaxy Tool Installation ProcessAdding Transparency and Automation into the Galaxy Tool Installation Process
Adding Transparency and Automation into the Galaxy Tool Installation Process
 
Demand-Driven Structural Testing with Dynamic Instrumentation (ICSE 2005)
Demand-Driven Structural Testing with Dynamic Instrumentation (ICSE 2005)Demand-Driven Structural Testing with Dynamic Instrumentation (ICSE 2005)
Demand-Driven Structural Testing with Dynamic Instrumentation (ICSE 2005)
 
XSEDE15_PhastaGateway
XSEDE15_PhastaGatewayXSEDE15_PhastaGateway
XSEDE15_PhastaGateway
 
Progressive Provenance Capture Through Re-computation
Progressive Provenance Capture Through Re-computationProgressive Provenance Capture Through Re-computation
Progressive Provenance Capture Through Re-computation
 
Declarative Experimentation in Information Retrieval using PyTerrier
Declarative Experimentation in Information Retrieval using PyTerrierDeclarative Experimentation in Information Retrieval using PyTerrier
Declarative Experimentation in Information Retrieval using PyTerrier
 
ELIXIR Proteomics Community - Connection with nf-core
ELIXIR Proteomics Community - Connection with nf-coreELIXIR Proteomics Community - Connection with nf-core
ELIXIR Proteomics Community - Connection with nf-core
 

Viewers also liked

Sequencing 2016
Sequencing 2016Sequencing 2016
Sequencing 2016Surya Saha
 
Next-generation sequencing course, part 1: technologies
Next-generation sequencing course, part 1: technologiesNext-generation sequencing course, part 1: technologies
Next-generation sequencing course, part 1: technologiesJan Aerts
 
Next-generation sequencing and quality control: An Introduction (2016)
Next-generation sequencing and quality control: An Introduction (2016)Next-generation sequencing and quality control: An Introduction (2016)
Next-generation sequencing and quality control: An Introduction (2016)Sebastian Schmeier
 
Next-generation sequencing data format and visualization with ngs.plot 2015
Next-generation sequencing data format and visualization with ngs.plot 2015Next-generation sequencing data format and visualization with ngs.plot 2015
Next-generation sequencing data format and visualization with ngs.plot 2015Li Shen
 
ECCB 2010 Next-gen sequencing Tutorial
ECCB 2010 Next-gen sequencing TutorialECCB 2010 Next-gen sequencing Tutorial
ECCB 2010 Next-gen sequencing TutorialThomas Keane
 

Viewers also liked (7)

Sequencing 2016
Sequencing 2016Sequencing 2016
Sequencing 2016
 
ChIP-seq - Data processing
ChIP-seq - Data processingChIP-seq - Data processing
ChIP-seq - Data processing
 
Next-generation sequencing course, part 1: technologies
Next-generation sequencing course, part 1: technologiesNext-generation sequencing course, part 1: technologies
Next-generation sequencing course, part 1: technologies
 
Next-generation sequencing and quality control: An Introduction (2016)
Next-generation sequencing and quality control: An Introduction (2016)Next-generation sequencing and quality control: An Introduction (2016)
Next-generation sequencing and quality control: An Introduction (2016)
 
Next-generation sequencing data format and visualization with ngs.plot 2015
Next-generation sequencing data format and visualization with ngs.plot 2015Next-generation sequencing data format and visualization with ngs.plot 2015
Next-generation sequencing data format and visualization with ngs.plot 2015
 
ECCB 2010 Next-gen sequencing Tutorial
ECCB 2010 Next-gen sequencing TutorialECCB 2010 Next-gen sequencing Tutorial
ECCB 2010 Next-gen sequencing Tutorial
 
Introduction to next generation sequencing
Introduction to next generation sequencingIntroduction to next generation sequencing
Introduction to next generation sequencing
 

Similar to Quality Control of NGS Data Solutions

Summary of Journal_ShenLu_Summer2013
Summary of Journal_ShenLu_Summer2013Summary of Journal_ShenLu_Summer2013
Summary of Journal_ShenLu_Summer2013Shen Lu
 
Reproducible, Automated and Portable Computational and Data Science Experimen...
Reproducible, Automated and Portable Computational and Data Science Experimen...Reproducible, Automated and Portable Computational and Data Science Experimen...
Reproducible, Automated and Portable Computational and Data Science Experimen...Ivo Jimenez
 
Quality Control of Sequencing Data
Quality Control of Sequencing Data Quality Control of Sequencing Data
Quality Control of Sequencing Data Surya Saha
 
fastp: the FASTQ pre-processor
fastp: the FASTQ pre-processorfastp: the FASTQ pre-processor
fastp: the FASTQ pre-processorHoffman Lab
 
[ CNCF Q1 2024 ] Intro to Continuous Profiling and Grafana Pyroscope.pdf
[ CNCF Q1 2024 ] Intro to Continuous Profiling and Grafana Pyroscope.pdf[ CNCF Q1 2024 ] Intro to Continuous Profiling and Grafana Pyroscope.pdf
[ CNCF Q1 2024 ] Intro to Continuous Profiling and Grafana Pyroscope.pdfSteve Caron
 
Through the firewall with miniCRAN
Through the firewall with miniCRANThrough the firewall with miniCRAN
Through the firewall with miniCRANRevolution Analytics
 
Introduction to Performance Analysis tools on Shaheen II
Introduction to Performance Analysis tools on Shaheen IIIntroduction to Performance Analysis tools on Shaheen II
Introduction to Performance Analysis tools on Shaheen IIGeorge Markomanolis
 
Performance Analysis and Monitoring with Perf4j
Performance Analysis and Monitoring with Perf4jPerformance Analysis and Monitoring with Perf4j
Performance Analysis and Monitoring with Perf4jCraig Dickson
 
OSBConf 2015 | Backups with rdiff backup and rsnapshot by christoph mitasch &...
OSBConf 2015 | Backups with rdiff backup and rsnapshot by christoph mitasch &...OSBConf 2015 | Backups with rdiff backup and rsnapshot by christoph mitasch &...
OSBConf 2015 | Backups with rdiff backup and rsnapshot by christoph mitasch &...NETWAYS
 
White Paper: Perforce Administration Optimization, Scalability, Availability ...
White Paper: Perforce Administration Optimization, Scalability, Availability ...White Paper: Perforce Administration Optimization, Scalability, Availability ...
White Paper: Perforce Administration Optimization, Scalability, Availability ...Perforce
 
Automatic test packet generation
Automatic test packet generationAutomatic test packet generation
Automatic test packet generationtusharjadhav2611
 
Instructions for using the phase wrapping and unwrapping code
Instructions for using the phase wrapping and unwrapping codeInstructions for using the phase wrapping and unwrapping code
Instructions for using the phase wrapping and unwrapping codeImperial College, London
 
Practical SPARQL Benchmarking Revisited
Practical SPARQL Benchmarking RevisitedPractical SPARQL Benchmarking Revisited
Practical SPARQL Benchmarking RevisitedRob Vesse
 
Managing and Versioning Machine Learning Models in Python
Managing and Versioning Machine Learning Models in PythonManaging and Versioning Machine Learning Models in Python
Managing and Versioning Machine Learning Models in PythonSimon Frid
 
Detection as Code, Automation, and Testing: The Key to Unlocking the Power of...
Detection as Code, Automation, and Testing: The Key to Unlocking the Power of...Detection as Code, Automation, and Testing: The Key to Unlocking the Power of...
Detection as Code, Automation, and Testing: The Key to Unlocking the Power of...MITRE ATT&CK
 
Designing Tracing Tools
Designing Tracing ToolsDesigning Tracing Tools
Designing Tracing ToolsSysdig
 
What We Learned Building an R-Python Hybrid Predictive Analytics Pipeline
What We Learned Building an R-Python Hybrid Predictive Analytics PipelineWhat We Learned Building an R-Python Hybrid Predictive Analytics Pipeline
What We Learned Building an R-Python Hybrid Predictive Analytics PipelineWork-Bench
 
Tracer Evaluation
Tracer EvaluationTracer Evaluation
Tracer EvaluationQiao Han
 

Similar to Quality Control of NGS Data Solutions (20)

Summary of Journal_ShenLu_Summer2013
Summary of Journal_ShenLu_Summer2013Summary of Journal_ShenLu_Summer2013
Summary of Journal_ShenLu_Summer2013
 
Reproducible, Automated and Portable Computational and Data Science Experimen...
Reproducible, Automated and Portable Computational and Data Science Experimen...Reproducible, Automated and Portable Computational and Data Science Experimen...
Reproducible, Automated and Portable Computational and Data Science Experimen...
 
Quality Control of Sequencing Data
Quality Control of Sequencing Data Quality Control of Sequencing Data
Quality Control of Sequencing Data
 
fastp: the FASTQ pre-processor
fastp: the FASTQ pre-processorfastp: the FASTQ pre-processor
fastp: the FASTQ pre-processor
 
[ CNCF Q1 2024 ] Intro to Continuous Profiling and Grafana Pyroscope.pdf
[ CNCF Q1 2024 ] Intro to Continuous Profiling and Grafana Pyroscope.pdf[ CNCF Q1 2024 ] Intro to Continuous Profiling and Grafana Pyroscope.pdf
[ CNCF Q1 2024 ] Intro to Continuous Profiling and Grafana Pyroscope.pdf
 
BioMake BOSC 2004
BioMake BOSC 2004BioMake BOSC 2004
BioMake BOSC 2004
 
Through the firewall with miniCRAN
Through the firewall with miniCRANThrough the firewall with miniCRAN
Through the firewall with miniCRAN
 
Introduction to Performance Analysis tools on Shaheen II
Introduction to Performance Analysis tools on Shaheen IIIntroduction to Performance Analysis tools on Shaheen II
Introduction to Performance Analysis tools on Shaheen II
 
Performance Analysis and Monitoring with Perf4j
Performance Analysis and Monitoring with Perf4jPerformance Analysis and Monitoring with Perf4j
Performance Analysis and Monitoring with Perf4j
 
OSBConf 2015 | Backups with rdiff backup and rsnapshot by christoph mitasch &...
OSBConf 2015 | Backups with rdiff backup and rsnapshot by christoph mitasch &...OSBConf 2015 | Backups with rdiff backup and rsnapshot by christoph mitasch &...
OSBConf 2015 | Backups with rdiff backup and rsnapshot by christoph mitasch &...
 
White Paper: Perforce Administration Optimization, Scalability, Availability ...
White Paper: Perforce Administration Optimization, Scalability, Availability ...White Paper: Perforce Administration Optimization, Scalability, Availability ...
White Paper: Perforce Administration Optimization, Scalability, Availability ...
 
Automatic test packet generation
Automatic test packet generationAutomatic test packet generation
Automatic test packet generation
 
Instructions for using the phase wrapping and unwrapping code
Instructions for using the phase wrapping and unwrapping codeInstructions for using the phase wrapping and unwrapping code
Instructions for using the phase wrapping and unwrapping code
 
Practical SPARQL Benchmarking Revisited
Practical SPARQL Benchmarking RevisitedPractical SPARQL Benchmarking Revisited
Practical SPARQL Benchmarking Revisited
 
Managing and Versioning Machine Learning Models in Python
Managing and Versioning Machine Learning Models in PythonManaging and Versioning Machine Learning Models in Python
Managing and Versioning Machine Learning Models in Python
 
Optimizing and Profiling Golang Rest Api
Optimizing and Profiling Golang Rest ApiOptimizing and Profiling Golang Rest Api
Optimizing and Profiling Golang Rest Api
 
Detection as Code, Automation, and Testing: The Key to Unlocking the Power of...
Detection as Code, Automation, and Testing: The Key to Unlocking the Power of...Detection as Code, Automation, and Testing: The Key to Unlocking the Power of...
Detection as Code, Automation, and Testing: The Key to Unlocking the Power of...
 
Designing Tracing Tools
Designing Tracing ToolsDesigning Tracing Tools
Designing Tracing Tools
 
What We Learned Building an R-Python Hybrid Predictive Analytics Pipeline
What We Learned Building an R-Python Hybrid Predictive Analytics PipelineWhat We Learned Building an R-Python Hybrid Predictive Analytics Pipeline
What We Learned Building an R-Python Hybrid Predictive Analytics Pipeline
 
Tracer Evaluation
Tracer EvaluationTracer Evaluation
Tracer Evaluation
 

More from Surya Saha

An open access resource portal for arthropod vectors and agricultural pathosy...
An open access resource portal for arthropod vectors and agricultural pathosy...An open access resource portal for arthropod vectors and agricultural pathosy...
An open access resource portal for arthropod vectors and agricultural pathosy...Surya Saha
 
Functional annotation of invertebrate genomes
Functional annotation of invertebrate genomesFunctional annotation of invertebrate genomes
Functional annotation of invertebrate genomesSurya Saha
 
Saha UC Davis Plant Pathology seminar Infrastructure for battling the Citrus ...
Saha UC Davis Plant Pathology seminar Infrastructure for battling the Citrus ...Saha UC Davis Plant Pathology seminar Infrastructure for battling the Citrus ...
Saha UC Davis Plant Pathology seminar Infrastructure for battling the Citrus ...Surya Saha
 
Updates on Citrusgreening.org database from USDA NIFA project meeting
Updates on Citrusgreening.org database from USDA NIFA project meetingUpdates on Citrusgreening.org database from USDA NIFA project meeting
Updates on Citrusgreening.org database from USDA NIFA project meetingSurya Saha
 
Updates on the ACP v3 genome and annotation from USDA NIFA project meeting
Updates on the ACP v3 genome and annotation from USDA NIFA project meetingUpdates on the ACP v3 genome and annotation from USDA NIFA project meeting
Updates on the ACP v3 genome and annotation from USDA NIFA project meetingSurya Saha
 
AgriVectors: A Data and Systems Resource for Arthropod Vectors of Plant Diseases
AgriVectors: A Data and Systems Resource for Arthropod Vectors of Plant DiseasesAgriVectors: A Data and Systems Resource for Arthropod Vectors of Plant Diseases
AgriVectors: A Data and Systems Resource for Arthropod Vectors of Plant DiseasesSurya Saha
 
Visualization of insect vector-plant pathogen interactions in the citrus gree...
Visualization of insect vector-plant pathogen interactions in the citrus gree...Visualization of insect vector-plant pathogen interactions in the citrus gree...
Visualization of insect vector-plant pathogen interactions in the citrus gree...Surya Saha
 
Deciphering the genome of Diaphorina citri to develop solutions for the citru...
Deciphering the genome of Diaphorina citri to develop solutions for the citru...Deciphering the genome of Diaphorina citri to develop solutions for the citru...
Deciphering the genome of Diaphorina citri to develop solutions for the citru...Surya Saha
 
Sequencing 2017
Sequencing 2017Sequencing 2017
Sequencing 2017Surya Saha
 
Community resources for all y’all Omics
Community resources for all y’all OmicsCommunity resources for all y’all Omics
Community resources for all y’all OmicsSurya Saha
 
CitrusCyc: Metabolic Pathway Databases for the C. clementina and C. sinensis...
 CitrusCyc: Metabolic Pathway Databases for the C. clementina and C. sinensis... CitrusCyc: Metabolic Pathway Databases for the C. clementina and C. sinensis...
CitrusCyc: Metabolic Pathway Databases for the C. clementina and C. sinensis...Surya Saha
 
Using Long Reads, Optical Maps and Long-Range Scaffolding to improve the Diap...
Using Long Reads, Optical Maps and Long-Range Scaffolding to improve the Diap...Using Long Reads, Optical Maps and Long-Range Scaffolding to improve the Diap...
Using Long Reads, Optical Maps and Long-Range Scaffolding to improve the Diap...Surya Saha
 
Tomato Genome Build SL3.0
Tomato Genome Build SL3.0Tomato Genome Build SL3.0
Tomato Genome Build SL3.0Surya Saha
 
Sequencing and Bioinformatics PGRP Summer 2015
Sequencing and Bioinformatics PGRP Summer 2015Sequencing and Bioinformatics PGRP Summer 2015
Sequencing and Bioinformatics PGRP Summer 2015Surya Saha
 
Sequencing: The Next Generation 2015
Sequencing: The Next Generation 2015Sequencing: The Next Generation 2015
Sequencing: The Next Generation 2015Surya Saha
 
Tomato Genome SL2.50 and Beyond…
Tomato Genome SL2.50 and Beyond…Tomato Genome SL2.50 and Beyond…
Tomato Genome SL2.50 and Beyond…Surya Saha
 
Sequencing, Genome Assembly and the SGN Platform
Sequencing, Genome Assembly and the SGN PlatformSequencing, Genome Assembly and the SGN Platform
Sequencing, Genome Assembly and the SGN PlatformSurya Saha
 
ICAR Soybean Indore 2014
ICAR Soybean Indore 2014ICAR Soybean Indore 2014
ICAR Soybean Indore 2014Surya Saha
 
Sequencing: The Next Generation
Sequencing: The Next GenerationSequencing: The Next Generation
Sequencing: The Next GenerationSurya Saha
 

More from Surya Saha (20)

An open access resource portal for arthropod vectors and agricultural pathosy...
An open access resource portal for arthropod vectors and agricultural pathosy...An open access resource portal for arthropod vectors and agricultural pathosy...
An open access resource portal for arthropod vectors and agricultural pathosy...
 
Functional annotation of invertebrate genomes
Functional annotation of invertebrate genomesFunctional annotation of invertebrate genomes
Functional annotation of invertebrate genomes
 
Saha UC Davis Plant Pathology seminar Infrastructure for battling the Citrus ...
Saha UC Davis Plant Pathology seminar Infrastructure for battling the Citrus ...Saha UC Davis Plant Pathology seminar Infrastructure for battling the Citrus ...
Saha UC Davis Plant Pathology seminar Infrastructure for battling the Citrus ...
 
Updates on Citrusgreening.org database from USDA NIFA project meeting
Updates on Citrusgreening.org database from USDA NIFA project meetingUpdates on Citrusgreening.org database from USDA NIFA project meeting
Updates on Citrusgreening.org database from USDA NIFA project meeting
 
Updates on the ACP v3 genome and annotation from USDA NIFA project meeting
Updates on the ACP v3 genome and annotation from USDA NIFA project meetingUpdates on the ACP v3 genome and annotation from USDA NIFA project meeting
Updates on the ACP v3 genome and annotation from USDA NIFA project meeting
 
AgriVectors: A Data and Systems Resource for Arthropod Vectors of Plant Diseases
AgriVectors: A Data and Systems Resource for Arthropod Vectors of Plant DiseasesAgriVectors: A Data and Systems Resource for Arthropod Vectors of Plant Diseases
AgriVectors: A Data and Systems Resource for Arthropod Vectors of Plant Diseases
 
Visualization of insect vector-plant pathogen interactions in the citrus gree...
Visualization of insect vector-plant pathogen interactions in the citrus gree...Visualization of insect vector-plant pathogen interactions in the citrus gree...
Visualization of insect vector-plant pathogen interactions in the citrus gree...
 
Deciphering the genome of Diaphorina citri to develop solutions for the citru...
Deciphering the genome of Diaphorina citri to develop solutions for the citru...Deciphering the genome of Diaphorina citri to develop solutions for the citru...
Deciphering the genome of Diaphorina citri to develop solutions for the citru...
 
Sequencing 2017
Sequencing 2017Sequencing 2017
Sequencing 2017
 
Community resources for all y’all Omics
Community resources for all y’all OmicsCommunity resources for all y’all Omics
Community resources for all y’all Omics
 
CitrusCyc: Metabolic Pathway Databases for the C. clementina and C. sinensis...
 CitrusCyc: Metabolic Pathway Databases for the C. clementina and C. sinensis... CitrusCyc: Metabolic Pathway Databases for the C. clementina and C. sinensis...
CitrusCyc: Metabolic Pathway Databases for the C. clementina and C. sinensis...
 
Using Long Reads, Optical Maps and Long-Range Scaffolding to improve the Diap...
Using Long Reads, Optical Maps and Long-Range Scaffolding to improve the Diap...Using Long Reads, Optical Maps and Long-Range Scaffolding to improve the Diap...
Using Long Reads, Optical Maps and Long-Range Scaffolding to improve the Diap...
 
Tomato Genome Build SL3.0
Tomato Genome Build SL3.0Tomato Genome Build SL3.0
Tomato Genome Build SL3.0
 
Sequencing and Bioinformatics PGRP Summer 2015
Sequencing and Bioinformatics PGRP Summer 2015Sequencing and Bioinformatics PGRP Summer 2015
Sequencing and Bioinformatics PGRP Summer 2015
 
Sequencing: The Next Generation 2015
Sequencing: The Next Generation 2015Sequencing: The Next Generation 2015
Sequencing: The Next Generation 2015
 
Tomato Genome SL2.50 and Beyond…
Tomato Genome SL2.50 and Beyond…Tomato Genome SL2.50 and Beyond…
Tomato Genome SL2.50 and Beyond…
 
Sequencing
SequencingSequencing
Sequencing
 
Sequencing, Genome Assembly and the SGN Platform
Sequencing, Genome Assembly and the SGN PlatformSequencing, Genome Assembly and the SGN Platform
Sequencing, Genome Assembly and the SGN Platform
 
ICAR Soybean Indore 2014
ICAR Soybean Indore 2014ICAR Soybean Indore 2014
ICAR Soybean Indore 2014
 
Sequencing: The Next Generation
Sequencing: The Next GenerationSequencing: The Next Generation
Sequencing: The Next Generation
 

Recently uploaded

08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfhans926745
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 

Recently uploaded (20)

08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 

Quality Control of NGS Data Solutions

  • 1. Quality Control of NGS Data Solutions Surya Saha ss2489@cornell.edu BTI PGRP Summer Internship Program 2014 Slides: Aureliano Bombarely ab782@cornell.edu
  • 2. Goal: Learn the use of read evaluation programs keeping attention in relevant parameters such as qscore and length distributions and reads duplications. Data: (Illumina data for two tomato ripening stages) /home/bioinfo/Data/ch4_demo_dataset.tar.gz Tools: tar -zxvf (command line, untar and unzip the files) head (command line, take a quick look of the files) mv (command line, change the name of the files) grep (command line, find/count patterns in files) FASTX toolkit (command line, process fasta/fastq) FastQC (gui, to calculate several stats for each file) Evaluation 7/8/2014 BTI PGRP Summer Internship Program 2014 2
  • 3. Exercise 1: 1. Untar and Unzip the file: /home/bioinfo/Data/ch4_demo_dataset.tar.gz 2. Raw data will be found in two dirs: breaker and immature_fruit. Print the first 10 lines for the files: SRR404331_ch4.fq, SRR404333_ch4.fq, SRR404334_ch4.fq and SRR404336_ch4.fq. Question 1.1: Do these files have fastq format? 3. Change the extension of the .fq files to .fastq 4. Type ‘fastqc’ to start the FastQC program. Load the four sequence files in the program. Evaluation 7/8/2014 BTI PGRP Summer Internship Program 2014 3
  • 4. Solution 1: 1. Untar and Unzip the file: ch4_demo_dataset.tar.gz tar -zxvf ch4_demo_dataset.tar.gz or in two steps: gunzip ch4_demo_dataset.tar.gz tar -xvf ch4_demo_dataset.tar 2. Raw data will be found in two dirs: breaker and immature_fruit. Print the first 10 lines for the files: SRR404331_ch4.fq, SRR404333_ch4.fq, SRR404334_ch4.fq and SRR404336_ch4.fq. head SRR404331_ch4.fq; head SRR404333_ch4.fq; head SRR404334_ch4.fq; head SRR404336_ch4.fq; Question 1.1: Do these files have fastq format? Yes Evaluation 7/8/2014 BTI PGRP Summer Internship Program 2014 4
  • 5. Solution 1: • Change the extension of the .fq files to .fastq mv SRR404331_ch4.fq SRR404331_ch4.fastq mv SRR404333_ch4.fq SRR404333_ch4.fastq mv SRR404334_ch4.fq SRR404334_ch4.fastq mv SRR404336_ch4.fq SRR404336_ch4.fastq • Count number of sequences in each fastq file using commands you learnt earlier grep –c ‘^+$’ SRR404331_ch4.fastq Evaluation 7/8/2014 BTI PGRP Summer Internship Program 2014 5
  • 6. Solution 1: • Convert the fastq files to fasta fastq_to_fasta –I SRR404331_ch4.fastq –o SRR404331_ch4.fasta –Q33 -n The –Q33 flag is to denote Sanger Phred 33 encoding. It expects Illumina Phred+64 by default. See http://seqanswers.com/forums/showthread.php?t=7596 -n tells it not to remove any sequences from the file. It removes any reads containing N’s by default • Now count the number of sequences in fasta file and see if the number of sequences has changed. grep –c ‘^>’ SRR404331_ch4.fasta It may be different if you did not use -n parameter. Not using -n will remove sequences with any ‘N’ bases Evaluation 7/8/2014 BTI PGRP Summer Internship Program 2014 6
  • 7. Solutions 2: Question 2.2: How many sequences there are per file in FastQC? SRR404331_ch4.fastq = 762,365 SRR404333_ch4.fastq = 744,048 SRR404334_ch4.fastq = 592,123 SRR404336_ch4.fastq = 880,982 Question 2.3: Which is the length range for these reads? 53 in most of the cases, except in SRR404336 where is 54 Question 2.4: Which is the qscore range for these reads? Which one looks best quality-wise? The range goes from 2 in some cases such as SRR404334 to a maximum of 44. SRR404331 has consistent quality over entire length with least 3’ end drop-off. Evaluation 7/8/2014 BTI PGRP Summer Internship Program 2014 7
  • 8. Solutions 1: Question 2.5: Do these datasets have read overrepresentation? In some cases such as SRR404331. A Blast search of the top overrepresented sequence reveals that it is the gene ASR1 from tomato, so it is not a contamination and probably it has some biological relevance. Question 2.6: Looking into the kmer content, do you think that the samples have an adaptor? No, it is possible to see some structure at the 5’ extreme, but it starts at the position 7 and doesn’t affect to all the sequences. Evaluation 7/8/2014 BTI PGRP Summer Internship Program 2014 8
  • 9. Goal: Trim the low quality ends of the reads and remove the short reads. Data: (Illumina data for two tomato ripening stages) ch4_demo_dataset.tar.gz Tools: fastq-mcf (command line tool to process reads) FastQC (gui, to calculate several stats for each file) Preprocessing 7/8/2014 BTI PGRP Summer Internship Program 2014 9
  • 10. Exercise 3: • Download the file: adapters1.fa from ftp://ftp.solgenomics.net/user_requests/aubombarely/courses/RNAseqCo rpoica/adapters1.fa • Run the read processing program over each of the datasets using a min. qscore of 30 and a min. length of 40 bp. fastq-mcf -q 30 -l 40 -o SRR404331_ch4.q30l40.fastq /home/bioinfo/Downloads/adapters1.fa SRR404331_ch4.fastq Remember to specify the right path for the adapter1.fa and the input file Preprocessing 7/8/2014 BTI PGRP Summer Internship Program 2014 10
  • 11. Exercise 2: •Type ‘fastqc’ to start the FastQC program. Load the four sequence files in the program. Compare the results with the previous datasets. Before fastq-mcf processing 880,982 reads After fastq-mcf processing 840,759 reads SRR404336 Preprocessing 7/8/2014 BTI PGRP Summer Internship Program 2014 11
  • 12. Need Help?? 7/8/2014 BTI PGRP Summer Internship Program 2014 12