SlideShare a Scribd company logo
1 of 33
What is Phred/Phrap/Consed ?
Phred/Phrap/Consed is a worldwide
distributed package for:
a. Trace file (chromatograms) reading;
b. Quality (confidence) assignment to each
individual base;
c. Vector and repeat sequences
identification and masking;
d. Sequence assembly;
e. Assembly visualization and editing;
f. Automatic finishing.
Whole genome
BAC/cosmid clone
final consensus sequence
Finishing
quality
both stands coverage
gap filling
Partial Assembly
contigs
DNA sequencing
random clones
Clone library
pUC18
Small fragments
1.0 - 2.0 kb
DNA fragmentation
sonic disruption
nebulization
Whole genome
BAC/cosmid clone
Trace File
High quality region – no ambiguities (Ns)
- no ambiguities (Ns)
- no noise
- peaks very well spaced
Trace File
Medium quality region – some ambiguities (Ns)
Trace File
Poor quality region – low confidence
- some ambiguities (Ns)
- bad noise (notice baseline)
- overlapping peaks
- can be caused by bad quality template, bad matrix, low signal to noise
rate
Trace File
Poor quality region – low confidence
Poor quality read:
- many ambiguities (Ns)
- noise
- caused by homopolymeric region/polymerase slippage
Sudden drop artifact:
- good quality region is followed by a sudden drop of signal
- caused by secondary structure
Sequence Assembly
The phred software reads DNA sequencing trace files, calls bases, and assigns a
quality value to each base.
The quality value is a log-transformed error probability, specifically Q = -10 log10(
Pe ) where Q and Pe are respectively the quality value and error probability of a
particular base call.
Phred can use the quality values to perform sequence trimming.
Phred works well with trace files from the most manufacturers' sequencing machines
The program was developed by Drs. Phil Green and Brent Ewing, and is copyrighted
by the University of Washington
Phred is generates highly accurate, base-specific quality scores
Quality scores range from 4 to about 60, with higher values
corresponding to higher quality
Phred quality
score
Probability that
the base is called
wrong
Accuracy of the
base call
10 1 in 10 90%
20 1 in 100 99%
30 1 in 1,000 99.9%
40 1 in 10,000 99.99%
50 1 in 100,000 99.999%
Ideal tool to assess the quality of sequences
The most commonly used method is to count the bases with a quality
score of 20 and above (sometimes called "high quality bases"); the
resulting number is often called the "Phred20 score"
Conversion of phd files into FASTA files phd2fasta script
Features:
- Phred creates single-sequences files containing the sequence
itself plus the quality assignments (phd files)
- The input file for cross_match and phrap programs is a multiple
sequence file in FASTA format
- A Perl script named phd2fasta converts the phd files into two
multiple sequence FASTA format files, containing the sequence
information and the basecall quality information respectively
- phredPhrap script automatically executes phd2fasta before
running cross_match and phrap!
Phrap
Phragment Assembly Program or… Phil’s Revised Assembly Program
Phrap is a program for assembling shotgun DNA sequence data
Key Features:
a. Uses the entire read content – no need for trimming.
b. User supplied (i.e. Repbase) + internally computed data –
better accuracy of assembly in the presence of repeats.
c. Contig sequence is constituted by a mosaic of the highest
quality parts of the reads – it’s not a consensus!
Phrap is a program for assembling shotgun DNA sequence data..
Accurate consensus sequences. Phrap uses Phred's quality scores to determine
highly accurate consensus sequences. Phrap examines all individual sequences at a
given position, and generally uses the highest quality sequence to build the
consensus - similar to the way scientists would correct consensus sequences during
"contig editing".
Consensus quality estimates. Phrap uses the quality information of individual
sequences to estimate the quality of the consensus sequence. In addition, Phrap
uses available information about sequencing chemistry (dye terminator or dye
primer) and confirmation by "other strand" reads in estimating the consensus
quality. This often allows scientists to ignore random errors, and to focus finishing
efforts exclusively onto regions where the data quality is insufficient. Consensus
quality estimates can also be very helpful in mutation detection by DNA
sequencingAbility to assemble very large projects. Phrap has been used routinely to assembly
bacterial genomes sequenced by the "shotgun" approach, where each project
contained tens of thousands of reads. Smaller bacterial genomes (2 million bases or
less) could often be assembled in less than three hours.
Improved identification and handling of repeats. Phrap uses quality scores to
estimate whether discrepancies between two overlapping sequences are more
likely to arise from random errors, or from different copies of a repeated
sequence. For repeats with 95 to 98% identity (like human Alu sequences) and
high quality sequence data, this typically yields correct assemblies.
Fast assemblies. Assemblies of cosmid- to BAC sized projects with several hundred
to two thousand reads typically take only minutes to complete on high-powered
workstations or personal computers.
Cross_match: Fast DNA Sequence Comparisons and Vector Screening
Identification of overlaps between contig ends after assembly with Phrap
Identification of potential repeat sequences in assemblies.
Generation of error summaries and lists after completion of sequencing projects.
Estimation of vector contamination in newly created libraries.
Consed/Autofinish is a tool for viewing, editing, and finishing
sequence assemblies created with phrap. Finishing capabilities
include allowing the user to pick primers and templates, suggesting
additional sequencing reactions to perform, and facilitating checking
the accuracy of the assembly using digest and forward/reverse pair
information.
See the consed page for additional information.
References:
Gordon, David. "Viewing and Editing Assembled Sequences Using
Consed", in Current Protocols in Bioinformatics,A. D. Baxevanis and
D. B. Davison, eds, New York: John Wiley & Co., 2004, 11.2.1-
11.2.43.
Aligned reads window
Gordon D et al. Genome Res. 1998;8:195-202
Cold Spring Harbor Laboratory Press
Navigation window.
Gordon D et al. Genome Res. 1998;8:195-202
Cold Spring Harbor Laboratory Press
Trace window.
Gordon D et al. Genome Res. 1998;8:195-202
Cold Spring Harbor Laboratory Press
AG-ICB-USP
AG-ICB-USP
Compare contigs window, indicating an alignment selected to investigate a contig match
indicated in phrapview.
Gordon D et al. Genome Res. 1998;8:195-202
Cold Spring Harbor Laboratory Press
Finishing
Autofinish and manual finishing
Assembly viewing/editing
Consed
Assembly
Phrap
assembled contigs - seq.fasta.screen.contigs
assembly file - seq.fasta.screen.ace#
Vector screening and masking
Cross_Match (local alignment program) x vector.seq
screened/masked file - seq.fasta.screen
quality values - seq.fasta.screen.qual
Conversion - phd to fasta
phd2fasta.pl
nucleotide sequences - seq.fasta
quality values - seq.fasta.qual
Quality (confidence) values assignment
Phred
phd files - *.phd
Input
chromatogram files
Phred/Phrap/Consed Pipeline
Finishing Problems
DNA sequencing problems
a. High GC content – genomes presenting a high GC content are more
prone to generate artifacts as compressions, sudden drops, bad quality
regions. Try to use Dye Primer instead of Dye Terminator, change chemistry,
add DMSO, increase annealing temperature, use deaza-dGTP instead of dGTP,
etc.
b. Palindromic regions – lead to strong secondary structures causing
sudden drops. Try to use deaza-dGTP instead of dGTP, amplify the problematic
region by PCR and sequence the product.
c. Homopolymeric regions – can reduce DNA synthesis efficiency for
some chemistries. Try to use Dye Primer instead of Dye Terminator, change
chemistry (dRhodamine instead of BigDye).
DNA assembly problems
a. High repeat content – highly repeated elements reduce accuracy of DNA
assembly. Identify the repeat unit, screen it with Cross_Match or Repeat_Masker
and mask it. Try to assemble again and add the repetitive region only at the end.
Map the repetitive region using restriction enzymes to estimate its size and
number of repeat units.
b. High AT content – some highly biased genomes (i.e. Plasmodium falciparum;
plastid genomes) can pose a problem for assembly programs. Very difficult to
solve. Try to determine a restriction map and associate mapping with DNA
sequencing data.
Staden Package
The Staden Package was developed by Rodger Staden's group at the
MRC Cambridge
The main components are:
pregap4 - base calling with Phred, end clipping, and vector
trimming.
trev - trace viewing and editing
gap4 - sequence assembly, contig editing, and finishing
gap5 - assembly visualisation, editing and finishing of NGS
data
Spin - DNA and protein sequence analysis
Staden Package
PreGap
Pregap is used to process raw
traces
It is used to mask all of the
sequence such as bits of
vector and poor quality
sequence.
Gap
Gap is the Genome Assembly
Program – the bit which
actually assembles individual
fragments into long contigs.
It allows you to edit the
assembly, referring back to
the starting traces where they
Cufflinks assembles transcripts, estimates their abundances,
and tests for differential expression and regulation in RNA-
Seq samples.It can identify novel transcripts in your sequencing data by
examining their alignments to the genome.
Cufflinks is usually run after mapping reads to the genome
with its sister tool Tophat
CUFFLINKS
Submission of sequences
BankIt, a WWW-based submission tool with wizards to guide the submission
process
Sequin, NCBI's stand-alone submission tool with wizards to guide the submission
process is available by FTP for use on for MAC, PC, and UNIX platforms.
tbl2asn, a command-line program, automates the creation of sequence records for
submission to GenBank using many of the same functions as Sequin. It is used
primarily for submission of complete genomes and large batches of sequences and
is available by FTP for use on MAC, PC and Unix platforms.
Submission Portal, a unified system for multiple submission types. Currently only
16S ribosomal RNA from uncultured bacteria/archaea can be submitted with the
GenBank component of this tool. This will be expanded in the future to include
other types of GenBank submissions. Genome and Transcriptome Assemblies can be
submitted through the WGS and TSA portals, respectively.
Barcode Submission Tool, a WWW-based tool for the submission of sequences and
trace read data for Barcode of Lifeprojects based on the COI gene.
BankIt, Submission Portal and Barcode Submission Tool entries are
automatically submitted to GenBank. Submissions made with Sequin or
tbl2asn must be mailed to gb-sub@ncbi.nlm.nih.gov.
Large files which may be truncated during mailing with conventional mail
tools should be submitted directly using Sequin MacroSend.
Submissions of Raw Sequence Reads
Reads of Sanger-style sequencing can be submitted to the Trace
Archive.
Runs of next-generation sequencing, for example from 454 or
Illumina, can be submitted to the Sequence Read Archive (SRA).
You should use BankIt if:
You have a single sequence, a simple set of sequences (for
example:16S rRNA, matK, ITS/rRNA, amoE, tefB, cytb, or
COI sets), or a small batch of different sequences
You prefer to use a web-based submission tool
The feature annotation for your sequences is not
complicated
You do not require advanced sequence analysis tools
Sequin
You should use tbl2asn if:
Your sequence has a lot of annotation
You are submitting a large batch of sequences
You have Whole Genome Shotgun (WGS) submissions
You have complete genome submissions
You are submitting FLIC sequences
Sequence formats sequence formats.docx
NCBI
Gen Bank
EMBL
Stanford University

More Related Content

What's hot

Gene prediction methods vijay
Gene prediction methods  vijayGene prediction methods  vijay
Gene prediction methods vijayVijay Hemmadi
 
Role of transcriptomics in gene expression studies and
Role of transcriptomics in gene expression studies andRole of transcriptomics in gene expression studies and
Role of transcriptomics in gene expression studies andSarla Rao
 
Massively Parallel Signature Sequencing (MPSS)
Massively Parallel Signature Sequencing (MPSS) Massively Parallel Signature Sequencing (MPSS)
Massively Parallel Signature Sequencing (MPSS) Bharathiar university
 
Next generation sequencing
Next generation sequencingNext generation sequencing
Next generation sequencingTapish Goel
 
Dna sequencing techniques
Dna sequencing techniquesDna sequencing techniques
Dna sequencing techniquesPromila Sheoran
 
Next-generation sequencing and quality control: An Introduction (2016)
Next-generation sequencing and quality control: An Introduction (2016)Next-generation sequencing and quality control: An Introduction (2016)
Next-generation sequencing and quality control: An Introduction (2016)Sebastian Schmeier
 
Next generation sequencing methods
Next generation sequencing methods Next generation sequencing methods
Next generation sequencing methods Mrinal Vashisth
 
RNA-seq Data Analysis Overview
RNA-seq Data Analysis OverviewRNA-seq Data Analysis Overview
RNA-seq Data Analysis OverviewSean Davis
 
Sequence alignment global vs. local
Sequence alignment  global vs. localSequence alignment  global vs. local
Sequence alignment global vs. localbenazeer fathima
 
STRUCTURAL GENOMICS, FUNCTIONAL GENOMICS, COMPARATIVE GENOMICS
STRUCTURAL GENOMICS, FUNCTIONAL GENOMICS, COMPARATIVE GENOMICSSTRUCTURAL GENOMICS, FUNCTIONAL GENOMICS, COMPARATIVE GENOMICS
STRUCTURAL GENOMICS, FUNCTIONAL GENOMICS, COMPARATIVE GENOMICSSHEETHUMOLKS
 
Whole Genome Sequencing Analysis
Whole Genome Sequencing AnalysisWhole Genome Sequencing Analysis
Whole Genome Sequencing AnalysisEfi Athieniti
 
GENE KNOCKOUT
GENE KNOCKOUTGENE KNOCKOUT
GENE KNOCKOUTRANA SAHA
 

What's hot (20)

Biological networks
Biological networksBiological networks
Biological networks
 
Sequence alignment
Sequence alignmentSequence alignment
Sequence alignment
 
Sequence assembly
Sequence assemblySequence assembly
Sequence assembly
 
Gene prediction methods vijay
Gene prediction methods  vijayGene prediction methods  vijay
Gene prediction methods vijay
 
Role of transcriptomics in gene expression studies and
Role of transcriptomics in gene expression studies andRole of transcriptomics in gene expression studies and
Role of transcriptomics in gene expression studies and
 
Massively Parallel Signature Sequencing (MPSS)
Massively Parallel Signature Sequencing (MPSS) Massively Parallel Signature Sequencing (MPSS)
Massively Parallel Signature Sequencing (MPSS)
 
Next generation sequencing
Next generation sequencingNext generation sequencing
Next generation sequencing
 
Pyrosequencing
PyrosequencingPyrosequencing
Pyrosequencing
 
Sequence assembly
Sequence assemblySequence assembly
Sequence assembly
 
Dna sequencing techniques
Dna sequencing techniquesDna sequencing techniques
Dna sequencing techniques
 
Next-generation sequencing and quality control: An Introduction (2016)
Next-generation sequencing and quality control: An Introduction (2016)Next-generation sequencing and quality control: An Introduction (2016)
Next-generation sequencing and quality control: An Introduction (2016)
 
Next generation sequencing methods
Next generation sequencing methods Next generation sequencing methods
Next generation sequencing methods
 
RNA-seq Data Analysis Overview
RNA-seq Data Analysis OverviewRNA-seq Data Analysis Overview
RNA-seq Data Analysis Overview
 
Variant analysis and whole exome sequencing
Variant analysis and whole exome sequencingVariant analysis and whole exome sequencing
Variant analysis and whole exome sequencing
 
Sequence alignment global vs. local
Sequence alignment  global vs. localSequence alignment  global vs. local
Sequence alignment global vs. local
 
STRUCTURAL GENOMICS, FUNCTIONAL GENOMICS, COMPARATIVE GENOMICS
STRUCTURAL GENOMICS, FUNCTIONAL GENOMICS, COMPARATIVE GENOMICSSTRUCTURAL GENOMICS, FUNCTIONAL GENOMICS, COMPARATIVE GENOMICS
STRUCTURAL GENOMICS, FUNCTIONAL GENOMICS, COMPARATIVE GENOMICS
 
Whole Genome Sequencing Analysis
Whole Genome Sequencing AnalysisWhole Genome Sequencing Analysis
Whole Genome Sequencing Analysis
 
Fasta
FastaFasta
Fasta
 
GENE KNOCKOUT
GENE KNOCKOUTGENE KNOCKOUT
GENE KNOCKOUT
 
Transcriptomics
TranscriptomicsTranscriptomics
Transcriptomics
 

Similar to Sequence assembly

Dgaston dec-06-2012
Dgaston dec-06-2012Dgaston dec-06-2012
Dgaston dec-06-2012Dan Gaston
 
20100516 bioinformatics kapushesky_lecture08
20100516 bioinformatics kapushesky_lecture0820100516 bioinformatics kapushesky_lecture08
20100516 bioinformatics kapushesky_lecture08Computer Science Club
 
Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...
Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...
Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...GenomeInABottle
 
Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016
Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016
Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016Prof. Wim Van Criekinge
 
Tools for Transcriptome Data Analysis
Tools for Transcriptome Data AnalysisTools for Transcriptome Data Analysis
Tools for Transcriptome Data AnalysisSANJANA PANDEY
 
RNA-seq quality control and pre-processing
RNA-seq quality control and pre-processingRNA-seq quality control and pre-processing
RNA-seq quality control and pre-processingmikaelhuss
 
20150601 bio sb_assembly_course
20150601 bio sb_assembly_course20150601 bio sb_assembly_course
20150601 bio sb_assembly_coursehansjansen9999
 
Forsharing cshl2011 sequencing
Forsharing cshl2011 sequencingForsharing cshl2011 sequencing
Forsharing cshl2011 sequencingSean Davis
 
Pipeline Scripting for the Parallel Alignment of Genomic Short Sequence Reads
Pipeline Scripting for the Parallel Alignment of Genomic Short Sequence ReadsPipeline Scripting for the Parallel Alignment of Genomic Short Sequence Reads
Pipeline Scripting for the Parallel Alignment of Genomic Short Sequence ReadsAdam Bradley
 
Under the Hood of Alignment Algorithms for NGS Researchers
Under the Hood of Alignment Algorithms for NGS ResearchersUnder the Hood of Alignment Algorithms for NGS Researchers
Under the Hood of Alignment Algorithms for NGS Researchers Golden Helix Inc
 
Bioinformatics.Practical Notebook
Bioinformatics.Practical NotebookBioinformatics.Practical Notebook
Bioinformatics.Practical NotebookNaima Tahsin
 
Assembly and finishing
Assembly and finishingAssembly and finishing
Assembly and finishingNikolay Vyahhi
 
2013 pag-equine-workshop
2013 pag-equine-workshop2013 pag-equine-workshop
2013 pag-equine-workshopc.titus.brown
 
Genome in a bottle for amp GeT-RM 181030
Genome in a bottle for amp GeT-RM 181030Genome in a bottle for amp GeT-RM 181030
Genome in a bottle for amp GeT-RM 181030GenomeInABottle
 
RNA-seq: analysis of raw data and preprocessing - part 2
RNA-seq: analysis of raw data and preprocessing - part 2RNA-seq: analysis of raw data and preprocessing - part 2
RNA-seq: analysis of raw data and preprocessing - part 2BITS
 
Legacy Analysis: How Hadoop Streaming Enables Software Reuse – A Genomics Cas...
Legacy Analysis: How Hadoop Streaming Enables Software Reuse – A Genomics Cas...Legacy Analysis: How Hadoop Streaming Enables Software Reuse – A Genomics Cas...
Legacy Analysis: How Hadoop Streaming Enables Software Reuse – A Genomics Cas...StampedeCon
 

Similar to Sequence assembly (20)

Dgaston dec-06-2012
Dgaston dec-06-2012Dgaston dec-06-2012
Dgaston dec-06-2012
 
20100516 bioinformatics kapushesky_lecture08
20100516 bioinformatics kapushesky_lecture0820100516 bioinformatics kapushesky_lecture08
20100516 bioinformatics kapushesky_lecture08
 
Rnaseq forgenefinding
Rnaseq forgenefindingRnaseq forgenefinding
Rnaseq forgenefinding
 
Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...
Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...
Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...
 
Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016
Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016
Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016
 
Tools for Transcriptome Data Analysis
Tools for Transcriptome Data AnalysisTools for Transcriptome Data Analysis
Tools for Transcriptome Data Analysis
 
RNA-seq quality control and pre-processing
RNA-seq quality control and pre-processingRNA-seq quality control and pre-processing
RNA-seq quality control and pre-processing
 
20150601 bio sb_assembly_course
20150601 bio sb_assembly_course20150601 bio sb_assembly_course
20150601 bio sb_assembly_course
 
Forsharing cshl2011 sequencing
Forsharing cshl2011 sequencingForsharing cshl2011 sequencing
Forsharing cshl2011 sequencing
 
Pipeline Scripting for the Parallel Alignment of Genomic Short Sequence Reads
Pipeline Scripting for the Parallel Alignment of Genomic Short Sequence ReadsPipeline Scripting for the Parallel Alignment of Genomic Short Sequence Reads
Pipeline Scripting for the Parallel Alignment of Genomic Short Sequence Reads
 
Under the Hood of Alignment Algorithms for NGS Researchers
Under the Hood of Alignment Algorithms for NGS ResearchersUnder the Hood of Alignment Algorithms for NGS Researchers
Under the Hood of Alignment Algorithms for NGS Researchers
 
Bioinformatics.Practical Notebook
Bioinformatics.Practical NotebookBioinformatics.Practical Notebook
Bioinformatics.Practical Notebook
 
50320130403003 2
50320130403003 250320130403003 2
50320130403003 2
 
Assembly and finishing
Assembly and finishingAssembly and finishing
Assembly and finishing
 
2013 pag-equine-workshop
2013 pag-equine-workshop2013 pag-equine-workshop
2013 pag-equine-workshop
 
Genome in a bottle for amp GeT-RM 181030
Genome in a bottle for amp GeT-RM 181030Genome in a bottle for amp GeT-RM 181030
Genome in a bottle for amp GeT-RM 181030
 
RNA-seq: analysis of raw data and preprocessing - part 2
RNA-seq: analysis of raw data and preprocessing - part 2RNA-seq: analysis of raw data and preprocessing - part 2
RNA-seq: analysis of raw data and preprocessing - part 2
 
Legacy Analysis: How Hadoop Streaming Enables Software Reuse – A Genomics Cas...
Legacy Analysis: How Hadoop Streaming Enables Software Reuse – A Genomics Cas...Legacy Analysis: How Hadoop Streaming Enables Software Reuse – A Genomics Cas...
Legacy Analysis: How Hadoop Streaming Enables Software Reuse – A Genomics Cas...
 
1 2 10.1.1.468.7609
1 2 10.1.1.468.76091 2 10.1.1.468.7609
1 2 10.1.1.468.7609
 
Introduction to Apollo for i5k
Introduction to Apollo for i5kIntroduction to Apollo for i5k
Introduction to Apollo for i5k
 

Recently uploaded

Separation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesSeparation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesFatimaKhan178732
 
URLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website AppURLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website AppCeline George
 
Alper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentAlper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentInMediaRes1
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)eniolaolutunde
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptxVS Mahajan Coaching Centre
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationnomboosow
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfsanyamsingh5019
 
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdfEnzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdfSumit Tiwari
 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxSayali Powar
 
How to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxHow to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxmanuelaromero2013
 
Class 11 Legal Studies Ch-1 Concept of State .pdf
Class 11 Legal Studies Ch-1 Concept of State .pdfClass 11 Legal Studies Ch-1 Concept of State .pdf
Class 11 Legal Studies Ch-1 Concept of State .pdfakmcokerachita
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxGaneshChakor2
 
How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17Celine George
 
Crayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon ACrayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon AUnboundStockton
 
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTiammrhaywood
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introductionMaksud Ahmed
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Sapana Sha
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactdawncurless
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13Steve Thomason
 

Recently uploaded (20)

Separation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesSeparation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and Actinides
 
URLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website AppURLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website App
 
Alper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentAlper Gobel In Media Res Media Component
Alper Gobel In Media Res Media Component
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communication
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdf
 
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdfEnzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
 
How to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxHow to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptx
 
Class 11 Legal Studies Ch-1 Concept of State .pdf
Class 11 Legal Studies Ch-1 Concept of State .pdfClass 11 Legal Studies Ch-1 Concept of State .pdf
Class 11 Legal Studies Ch-1 Concept of State .pdf
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptx
 
Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1
 
How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17
 
Crayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon ACrayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon A
 
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impact
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13
 

Sequence assembly

  • 1. What is Phred/Phrap/Consed ? Phred/Phrap/Consed is a worldwide distributed package for: a. Trace file (chromatograms) reading; b. Quality (confidence) assignment to each individual base; c. Vector and repeat sequences identification and masking; d. Sequence assembly; e. Assembly visualization and editing; f. Automatic finishing. Whole genome BAC/cosmid clone final consensus sequence Finishing quality both stands coverage gap filling Partial Assembly contigs DNA sequencing random clones Clone library pUC18 Small fragments 1.0 - 2.0 kb DNA fragmentation sonic disruption nebulization Whole genome BAC/cosmid clone
  • 2.
  • 3. Trace File High quality region – no ambiguities (Ns) - no ambiguities (Ns) - no noise - peaks very well spaced
  • 4. Trace File Medium quality region – some ambiguities (Ns)
  • 5. Trace File Poor quality region – low confidence - some ambiguities (Ns) - bad noise (notice baseline) - overlapping peaks - can be caused by bad quality template, bad matrix, low signal to noise rate
  • 6. Trace File Poor quality region – low confidence Poor quality read: - many ambiguities (Ns) - noise - caused by homopolymeric region/polymerase slippage
  • 7. Sudden drop artifact: - good quality region is followed by a sudden drop of signal - caused by secondary structure
  • 8. Sequence Assembly The phred software reads DNA sequencing trace files, calls bases, and assigns a quality value to each base. The quality value is a log-transformed error probability, specifically Q = -10 log10( Pe ) where Q and Pe are respectively the quality value and error probability of a particular base call. Phred can use the quality values to perform sequence trimming. Phred works well with trace files from the most manufacturers' sequencing machines The program was developed by Drs. Phil Green and Brent Ewing, and is copyrighted by the University of Washington
  • 9. Phred is generates highly accurate, base-specific quality scores Quality scores range from 4 to about 60, with higher values corresponding to higher quality Phred quality score Probability that the base is called wrong Accuracy of the base call 10 1 in 10 90% 20 1 in 100 99% 30 1 in 1,000 99.9% 40 1 in 10,000 99.99% 50 1 in 100,000 99.999% Ideal tool to assess the quality of sequences The most commonly used method is to count the bases with a quality score of 20 and above (sometimes called "high quality bases"); the resulting number is often called the "Phred20 score"
  • 10. Conversion of phd files into FASTA files phd2fasta script Features: - Phred creates single-sequences files containing the sequence itself plus the quality assignments (phd files) - The input file for cross_match and phrap programs is a multiple sequence file in FASTA format - A Perl script named phd2fasta converts the phd files into two multiple sequence FASTA format files, containing the sequence information and the basecall quality information respectively - phredPhrap script automatically executes phd2fasta before running cross_match and phrap!
  • 11. Phrap Phragment Assembly Program or… Phil’s Revised Assembly Program Phrap is a program for assembling shotgun DNA sequence data Key Features: a. Uses the entire read content – no need for trimming. b. User supplied (i.e. Repbase) + internally computed data – better accuracy of assembly in the presence of repeats. c. Contig sequence is constituted by a mosaic of the highest quality parts of the reads – it’s not a consensus!
  • 12. Phrap is a program for assembling shotgun DNA sequence data.. Accurate consensus sequences. Phrap uses Phred's quality scores to determine highly accurate consensus sequences. Phrap examines all individual sequences at a given position, and generally uses the highest quality sequence to build the consensus - similar to the way scientists would correct consensus sequences during "contig editing". Consensus quality estimates. Phrap uses the quality information of individual sequences to estimate the quality of the consensus sequence. In addition, Phrap uses available information about sequencing chemistry (dye terminator or dye primer) and confirmation by "other strand" reads in estimating the consensus quality. This often allows scientists to ignore random errors, and to focus finishing efforts exclusively onto regions where the data quality is insufficient. Consensus quality estimates can also be very helpful in mutation detection by DNA sequencingAbility to assemble very large projects. Phrap has been used routinely to assembly bacterial genomes sequenced by the "shotgun" approach, where each project contained tens of thousands of reads. Smaller bacterial genomes (2 million bases or less) could often be assembled in less than three hours.
  • 13. Improved identification and handling of repeats. Phrap uses quality scores to estimate whether discrepancies between two overlapping sequences are more likely to arise from random errors, or from different copies of a repeated sequence. For repeats with 95 to 98% identity (like human Alu sequences) and high quality sequence data, this typically yields correct assemblies. Fast assemblies. Assemblies of cosmid- to BAC sized projects with several hundred to two thousand reads typically take only minutes to complete on high-powered workstations or personal computers. Cross_match: Fast DNA Sequence Comparisons and Vector Screening Identification of overlaps between contig ends after assembly with Phrap Identification of potential repeat sequences in assemblies. Generation of error summaries and lists after completion of sequencing projects. Estimation of vector contamination in newly created libraries.
  • 14. Consed/Autofinish is a tool for viewing, editing, and finishing sequence assemblies created with phrap. Finishing capabilities include allowing the user to pick primers and templates, suggesting additional sequencing reactions to perform, and facilitating checking the accuracy of the assembly using digest and forward/reverse pair information. See the consed page for additional information. References: Gordon, David. "Viewing and Editing Assembled Sequences Using Consed", in Current Protocols in Bioinformatics,A. D. Baxevanis and D. B. Davison, eds, New York: John Wiley & Co., 2004, 11.2.1- 11.2.43.
  • 15. Aligned reads window Gordon D et al. Genome Res. 1998;8:195-202 Cold Spring Harbor Laboratory Press
  • 16. Navigation window. Gordon D et al. Genome Res. 1998;8:195-202 Cold Spring Harbor Laboratory Press
  • 17. Trace window. Gordon D et al. Genome Res. 1998;8:195-202 Cold Spring Harbor Laboratory Press
  • 20. Compare contigs window, indicating an alignment selected to investigate a contig match indicated in phrapview. Gordon D et al. Genome Res. 1998;8:195-202 Cold Spring Harbor Laboratory Press
  • 21. Finishing Autofinish and manual finishing Assembly viewing/editing Consed Assembly Phrap assembled contigs - seq.fasta.screen.contigs assembly file - seq.fasta.screen.ace# Vector screening and masking Cross_Match (local alignment program) x vector.seq screened/masked file - seq.fasta.screen quality values - seq.fasta.screen.qual Conversion - phd to fasta phd2fasta.pl nucleotide sequences - seq.fasta quality values - seq.fasta.qual Quality (confidence) values assignment Phred phd files - *.phd Input chromatogram files Phred/Phrap/Consed Pipeline
  • 22. Finishing Problems DNA sequencing problems a. High GC content – genomes presenting a high GC content are more prone to generate artifacts as compressions, sudden drops, bad quality regions. Try to use Dye Primer instead of Dye Terminator, change chemistry, add DMSO, increase annealing temperature, use deaza-dGTP instead of dGTP, etc. b. Palindromic regions – lead to strong secondary structures causing sudden drops. Try to use deaza-dGTP instead of dGTP, amplify the problematic region by PCR and sequence the product. c. Homopolymeric regions – can reduce DNA synthesis efficiency for some chemistries. Try to use Dye Primer instead of Dye Terminator, change chemistry (dRhodamine instead of BigDye).
  • 23. DNA assembly problems a. High repeat content – highly repeated elements reduce accuracy of DNA assembly. Identify the repeat unit, screen it with Cross_Match or Repeat_Masker and mask it. Try to assemble again and add the repetitive region only at the end. Map the repetitive region using restriction enzymes to estimate its size and number of repeat units. b. High AT content – some highly biased genomes (i.e. Plasmodium falciparum; plastid genomes) can pose a problem for assembly programs. Very difficult to solve. Try to determine a restriction map and associate mapping with DNA sequencing data.
  • 24. Staden Package The Staden Package was developed by Rodger Staden's group at the MRC Cambridge The main components are: pregap4 - base calling with Phred, end clipping, and vector trimming. trev - trace viewing and editing gap4 - sequence assembly, contig editing, and finishing gap5 - assembly visualisation, editing and finishing of NGS data Spin - DNA and protein sequence analysis
  • 25. Staden Package PreGap Pregap is used to process raw traces It is used to mask all of the sequence such as bits of vector and poor quality sequence. Gap Gap is the Genome Assembly Program – the bit which actually assembles individual fragments into long contigs. It allows you to edit the assembly, referring back to the starting traces where they
  • 26.
  • 27. Cufflinks assembles transcripts, estimates their abundances, and tests for differential expression and regulation in RNA- Seq samples.It can identify novel transcripts in your sequencing data by examining their alignments to the genome. Cufflinks is usually run after mapping reads to the genome with its sister tool Tophat CUFFLINKS
  • 28. Submission of sequences BankIt, a WWW-based submission tool with wizards to guide the submission process Sequin, NCBI's stand-alone submission tool with wizards to guide the submission process is available by FTP for use on for MAC, PC, and UNIX platforms. tbl2asn, a command-line program, automates the creation of sequence records for submission to GenBank using many of the same functions as Sequin. It is used primarily for submission of complete genomes and large batches of sequences and is available by FTP for use on MAC, PC and Unix platforms. Submission Portal, a unified system for multiple submission types. Currently only 16S ribosomal RNA from uncultured bacteria/archaea can be submitted with the GenBank component of this tool. This will be expanded in the future to include other types of GenBank submissions. Genome and Transcriptome Assemblies can be submitted through the WGS and TSA portals, respectively. Barcode Submission Tool, a WWW-based tool for the submission of sequences and trace read data for Barcode of Lifeprojects based on the COI gene.
  • 29. BankIt, Submission Portal and Barcode Submission Tool entries are automatically submitted to GenBank. Submissions made with Sequin or tbl2asn must be mailed to gb-sub@ncbi.nlm.nih.gov. Large files which may be truncated during mailing with conventional mail tools should be submitted directly using Sequin MacroSend. Submissions of Raw Sequence Reads Reads of Sanger-style sequencing can be submitted to the Trace Archive. Runs of next-generation sequencing, for example from 454 or Illumina, can be submitted to the Sequence Read Archive (SRA).
  • 30. You should use BankIt if: You have a single sequence, a simple set of sequences (for example:16S rRNA, matK, ITS/rRNA, amoE, tefB, cytb, or COI sets), or a small batch of different sequences You prefer to use a web-based submission tool The feature annotation for your sequences is not complicated You do not require advanced sequence analysis tools
  • 32. You should use tbl2asn if: Your sequence has a lot of annotation You are submitting a large batch of sequences You have Whole Genome Shotgun (WGS) submissions You have complete genome submissions You are submitting FLIC sequences
  • 33. Sequence formats sequence formats.docx NCBI Gen Bank EMBL Stanford University

Editor's Notes

  1. Aligned reads window, in color means quality and tags color mode. The top line gives the contig sequence, and below it are the read sequences for the top strand (right-pointing arrows) and bottom strand (left-pointing arrows). Read names are in yellow. The background gray scale indicates base quality, with the highest quality being white and the lowest quality black. Red indicates a character (such as the x or *) that disagrees with the contig sequence. (x) A base that has been masked by cross-match as being vector. (*) A pad that is inserted by phrap to align reads that have insertions and deletions. Tags are indicated by colored bars covering the bottom half of the background square for each base. The blue tag represents sequencing vector, and the orange tag indicates compressions. An edited base has a green tag attached. Gray letters on a black background indicate that phrap clipped these bases off because of low quality.
  2. Navigation window. Each line contains the contig name, the read name, the range of consensus positions, and an indication of the problem. The Go, Prev, and Next buttons cause the associated aligned reads window to scroll to the location on the currently highlighted line, the line above it, or the line below it, respectively. All items can be visited in order by repeatedly clicking Next. The Save button creates a file containing the list.
  3. Trace window. The lines in each panel above the trace chromatogram indicate the following: (con) consensus position; (rd) read position; (con) consensus bases; (edt) editable read bases; (phd) phredbase calls; and (ABI) ABI base calls. The H and V scale bars change the horizontal and vertical magnification of the traces. (Scroll together Yes/No buttons) Allows the user to scroll the traces individually or locked together. (Remove) Removes this trace panel from the window. (Undo) Undo the last edit operation on that read.
  4. Compare contigs window, indicating an alignment selected to investigate a contig match indicated in phrapview. (Top): Aphrapview window showing matches between contigs as red lines. (Bottom) consed compare contigs window showing the sequence alignment of one match from the phrapview window. The two rows of bases in lowercase are the unaligned contigs, which can be scrolled relative to each other. The two rows of bases in uppercase are the aligned contigs, which are locked together if they are scrolled. (x) A mismatch. Red cursors indicate the bases to be pinned together. (Align button) Click this to show alignment. (Scroll Top Contig/Scroll Bottom Contig buttons) After clicking on a base, this causes the Aligned reads window to scroll to the appropriate location.