Sequence assembly

What is Phred/Phrap/Consed ?
Phred/Phrap/Consed is a worldwide
distributed package for:
a. Trace file (chromatograms) reading;
b. Quality (confidence) assignment to each
individual base;
c. Vector and repeat sequences
identification and masking;
d. Sequence assembly;
e. Assembly visualization and editing;
f. Automatic finishing.
Whole genome
BAC/cosmid clone
final consensus sequence
Finishing
quality
both stands coverage
gap filling
Partial Assembly
contigs
DNA sequencing
random clones
Clone library
pUC18
Small fragments
1.0 - 2.0 kb
DNA fragmentation
sonic disruption
nebulization
Whole genome
BAC/cosmid clone

Trace File
High quality region – no ambiguities (Ns)
- no ambiguities (Ns)
- no noise
- peaks very well spaced

Trace File
Medium quality region – some ambiguities (Ns)

Trace File
Poor quality region – low confidence
- some ambiguities (Ns)
- bad noise (notice baseline)
- overlapping peaks
- can be caused by bad quality template, bad matrix, low signal to noise
rate

Trace File
Poor quality region – low confidence
Poor quality read:
- many ambiguities (Ns)
- noise
- caused by homopolymeric region/polymerase slippage

Sudden drop artifact:
- good quality region is followed by a sudden drop of signal
- caused by secondary structure

Sequence Assembly
The phred software reads DNA sequencing trace files, calls bases, and assigns a
quality value to each base.
The quality value is a log-transformed error probability, specifically Q = -10 log10(
Pe ) where Q and Pe are respectively the quality value and error probability of a
particular base call.
Phred can use the quality values to perform sequence trimming.
Phred works well with trace files from the most manufacturers' sequencing machines
The program was developed by Drs. Phil Green and Brent Ewing, and is copyrighted
by the University of Washington

Phred is generates highly accurate, base-specific quality scores
Quality scores range from 4 to about 60, with higher values
corresponding to higher quality
Phred quality
score
Probability that
the base is called
wrong
Accuracy of the
base call
10 1 in 10 90%
20 1 in 100 99%
30 1 in 1,000 99.9%
40 1 in 10,000 99.99%
50 1 in 100,000 99.999%
Ideal tool to assess the quality of sequences
The most commonly used method is to count the bases with a quality
score of 20 and above (sometimes called "high quality bases"); the
resulting number is often called the "Phred20 score"

Conversion of phd files into FASTA files phd2fasta script
Features:
- Phred creates single-sequences files containing the sequence
itself plus the quality assignments (phd files)
- The input file for cross_match and phrap programs is a multiple
sequence file in FASTA format
- A Perl script named phd2fasta converts the phd files into two
multiple sequence FASTA format files, containing the sequence
information and the basecall quality information respectively
- phredPhrap script automatically executes phd2fasta before
running cross_match and phrap!

Phrap
Phragment Assembly Program or… Phil’s Revised Assembly Program
Phrap is a program for assembling shotgun DNA sequence data
Key Features:
a. Uses the entire read content – no need for trimming.
b. User supplied (i.e. Repbase) + internally computed data –
better accuracy of assembly in the presence of repeats.
c. Contig sequence is constituted by a mosaic of the highest
quality parts of the reads – it’s not a consensus!

Phrap is a program for assembling shotgun DNA sequence data..
Accurate consensus sequences. Phrap uses Phred's quality scores to determine
highly accurate consensus sequences. Phrap examines all individual sequences at a
given position, and generally uses the highest quality sequence to build the
consensus - similar to the way scientists would correct consensus sequences during
"contig editing".
Consensus quality estimates. Phrap uses the quality information of individual
sequences to estimate the quality of the consensus sequence. In addition, Phrap
uses available information about sequencing chemistry (dye terminator or dye
primer) and confirmation by "other strand" reads in estimating the consensus
quality. This often allows scientists to ignore random errors, and to focus finishing
efforts exclusively onto regions where the data quality is insufficient. Consensus
quality estimates can also be very helpful in mutation detection by DNA
sequencingAbility to assemble very large projects. Phrap has been used routinely to assembly
bacterial genomes sequenced by the "shotgun" approach, where each project
contained tens of thousands of reads. Smaller bacterial genomes (2 million bases or
less) could often be assembled in less than three hours.

Improved identification and handling of repeats. Phrap uses quality scores to
estimate whether discrepancies between two overlapping sequences are more
likely to arise from random errors, or from different copies of a repeated
sequence. For repeats with 95 to 98% identity (like human Alu sequences) and
high quality sequence data, this typically yields correct assemblies.
Fast assemblies. Assemblies of cosmid- to BAC sized projects with several hundred
to two thousand reads typically take only minutes to complete on high-powered
workstations or personal computers.
Cross_match: Fast DNA Sequence Comparisons and Vector Screening
Identification of overlaps between contig ends after assembly with Phrap
Identification of potential repeat sequences in assemblies.
Generation of error summaries and lists after completion of sequencing projects.
Estimation of vector contamination in newly created libraries.

Consed/Autofinish is a tool for viewing, editing, and finishing
sequence assemblies created with phrap. Finishing capabilities
include allowing the user to pick primers and templates, suggesting
additional sequencing reactions to perform, and facilitating checking
the accuracy of the assembly using digest and forward/reverse pair
information.
See the consed page for additional information.
References:
Gordon, David. "Viewing and Editing Assembled Sequences Using
Consed", in Current Protocols in Bioinformatics,A. D. Baxevanis and
D. B. Davison, eds, New York: John Wiley & Co., 2004, 11.2.1-
11.2.43.

Aligned reads window
Gordon D et al. Genome Res. 1998;8:195-202
Cold Spring Harbor Laboratory Press

Navigation window.

Trace window.

Compare contigs window, indicating an alignment selected to investigate a contig match
indicated in phrapview.

Finishing
Autofinish and manual finishing
Assembly viewing/editing
Consed
Assembly
Phrap
assembled contigs - seq.fasta.screen.contigs
assembly file - seq.fasta.screen.ace#
Vector screening and masking
Cross_Match (local alignment program) x vector.seq
screened/masked file - seq.fasta.screen
quality values - seq.fasta.screen.qual
Conversion - phd to fasta
phd2fasta.pl
nucleotide sequences - seq.fasta
quality values - seq.fasta.qual
Quality (confidence) values assignment
Phred
phd files - *.phd
Input
chromatogram files
Phred/Phrap/Consed Pipeline

Finishing Problems
DNA sequencing problems
a. High GC content – genomes presenting a high GC content are more
prone to generate artifacts as compressions, sudden drops, bad quality
regions. Try to use Dye Primer instead of Dye Terminator, change chemistry,
add DMSO, increase annealing temperature, use deaza-dGTP instead of dGTP,
etc.
b. Palindromic regions – lead to strong secondary structures causing
sudden drops. Try to use deaza-dGTP instead of dGTP, amplify the problematic
region by PCR and sequence the product.
c. Homopolymeric regions – can reduce DNA synthesis efficiency for
some chemistries. Try to use Dye Primer instead of Dye Terminator, change
chemistry (dRhodamine instead of BigDye).

DNA assembly problems
a. High repeat content – highly repeated elements reduce accuracy of DNA
assembly. Identify the repeat unit, screen it with Cross_Match or Repeat_Masker
and mask it. Try to assemble again and add the repetitive region only at the end.
Map the repetitive region using restriction enzymes to estimate its size and
number of repeat units.
b. High AT content – some highly biased genomes (i.e. Plasmodium falciparum;
plastid genomes) can pose a problem for assembly programs. Very difficult to
solve. Try to determine a restriction map and associate mapping with DNA
sequencing data.

Staden Package
The Staden Package was developed by Rodger Staden's group at the
MRC Cambridge
The main components are:
pregap4 - base calling with Phred, end clipping, and vector
trimming.
trev - trace viewing and editing
gap4 - sequence assembly, contig editing, and finishing
gap5 - assembly visualisation, editing and finishing of NGS
data
Spin - DNA and protein sequence analysis

Staden Package
PreGap
Pregap is used to process raw
traces
It is used to mask all of the
sequence such as bits of
vector and poor quality
sequence.
Gap
Gap is the Genome Assembly
Program – the bit which
actually assembles individual
fragments into long contigs.
It allows you to edit the
assembly, referring back to
the starting traces where they

Cufflinks assembles transcripts, estimates their abundances,
and tests for differential expression and regulation in RNA-
Seq samples.It can identify novel transcripts in your sequencing data by
examining their alignments to the genome.
Cufflinks is usually run after mapping reads to the genome
with its sister tool Tophat
CUFFLINKS

Submission of sequences
BankIt, a WWW-based submission tool with wizards to guide the submission
process
Sequin, NCBI's stand-alone submission tool with wizards to guide the submission
process is available by FTP for use on for MAC, PC, and UNIX platforms.
tbl2asn, a command-line program, automates the creation of sequence records for
submission to GenBank using many of the same functions as Sequin. It is used
primarily for submission of complete genomes and large batches of sequences and
is available by FTP for use on MAC, PC and Unix platforms.
Submission Portal, a unified system for multiple submission types. Currently only
16S ribosomal RNA from uncultured bacteria/archaea can be submitted with the
GenBank component of this tool. This will be expanded in the future to include
other types of GenBank submissions. Genome and Transcriptome Assemblies can be
submitted through the WGS and TSA portals, respectively.
Barcode Submission Tool, a WWW-based tool for the submission of sequences and
trace read data for Barcode of Lifeprojects based on the COI gene.

BankIt, Submission Portal and Barcode Submission Tool entries are
automatically submitted to GenBank. Submissions made with Sequin or
tbl2asn must be mailed to gb-sub@ncbi.nlm.nih.gov.
Large files which may be truncated during mailing with conventional mail
tools should be submitted directly using Sequin MacroSend.
Submissions of Raw Sequence Reads
Reads of Sanger-style sequencing can be submitted to the Trace
Archive.
Runs of next-generation sequencing, for example from 454 or
Illumina, can be submitted to the Sequence Read Archive (SRA).

You should use BankIt if:
You have a single sequence, a simple set of sequences (for
example:16S rRNA, matK, ITS/rRNA, amoE, tefB, cytb, or
COI sets), or a small batch of different sequences
You prefer to use a web-based submission tool
The feature annotation for your sequences is not
complicated
You do not require advanced sequence analysis tools

You should use tbl2asn if:
Your sequence has a lot of annotation
You are submitting a large batch of sequences
You have Whole Genome Shotgun (WGS) submissions
You have complete genome submissions
You are submitting FLIC sequences

Sequence formats sequence formats.docx
NCBI
Gen Bank
EMBL
Stanford University

Sequence assembly

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Sequence assembly

Similar to Sequence assembly (20)

Recently uploaded

Recently uploaded (20)

Sequence assembly

Editor's Notes