Computational Biology, Part 6Sequence Database Searching PUSHPENDRA TRIPATHI
Sequence Analysis Tasks⇒ Given a query sequence, search for similar sequences in a database Global or Local? Both local and global alignment methods may be applied to database scanning, but local alignment methods are more useful since they do not make the assumption that the query protein and database sequence are of similar length.
Efficient database searchingmethods Dynamic programming requires order N2L computations (where N is size of the query sequence and L is the size of the database) Given size of databases, more efficient methods needed
“Hit and extend” sequencesearching Problem: Too many calculations “wasted” by comparing regions that have nothing in common Initial insight: Regions that are similar between two sequences are likely to share short stretches that are identical Basic method: Look for similar regions only near short stretches that match exactly
“Hit and extend” sequencesearching We define a word (or k-tuple) size that is the minimum number of exact “letter” matches that must occur before we do any further comparison or alignment How do we find all of the occurences of matching words between a sequence and a database? Could scan sequence a word at a time, but this is order L (size of database)
Word searching - hashing Solution: Use a precomputed table that lists where in the database each possible word occurs Generation of the table is of order L (size of database) but use of the table is of order N (size of query sequence) The computer science term for this approach is hashing
Hashing Hashing Hashing Table of size 10 Hashing function H(x) = x mod 10 Applet: http://www.engin.umd.umich.edu/CIS/course.des/cis Insertion & Search
Demonstration: Hashing algorithm for sequence searching Author: R.F. Murphy, Feb. 6, 1995 (revised Feb. 15, 1996) This demonstration takes a piece of database sequence, calculates hash values for each ktuple, builds a hash table (listing the positions in the database of the occurence of each hash value), and uses a simplified version of the hash table to find the positions in the database sequence of the first occurence of each ktuple in a query sequence. database sequenceHashing i 1 seq(i) seq(i) as char as int hash value a 0 6 2 c 1 27 This section converts each base to a number 3 g 2 47 from 0 to 3 and combines those numbers three 4 t 3 63 at a time to form an integer from 0 to 63 that (Demonstration A10) 5 6 t t 3 3 63 60 is unique for each three base sequence. Each three base sequence is called a "ktuple." 7 t 3 48 8 a 0 0 9 a 0 0 10 a 0 1 11 a 0 6 12 c 1 24 13 g 2 33 14 a 0 4 15 c 1 17 16 a 0 5 17 c 1 18 c 1 hash first hit value pos1 pos2 pos3 hash table for the database sequence hash table 0 a a a 8 9 8 1 a a c 10 10 2 a a g not found 3 a a t not found 4 a c a 14 14 5 a c c 16 16 6 a c g 1 11 1 7 a c t not found 8 a g a not found
FASTAHeavily used for searching databases untiladvent of BLAST (see below)Inputs k (word or k-tuple) size similarity matrixCompares query sequence pairwise witheach sequence in the database
FASTA method The initial step in the algorithm is to identify all exact matches of length k (k– tuples) or greater between the two sequences.
FASTA method1. Find diagonals (paired pieces from each sequence without gaps) that have the highest density of common words2. Rescore these using a scoring (similarity) matrix and trim ends that do not contribute to the highest score Result: partial alignments without gaps Reported as the “init1” score
FASTA method3. Join regions together, including penalties for gaps Result: unoptimized alignment with gaps Reported as the “initn” score4. Use dynamic programming in a band 32 residues wide around the best “initn” score Result: optimized alignment with gaps Reported as the “opt” score
Comments on FASTA Larger k-tuple increases speed since fewer “hits” are found but it also decreases sensitivity for finding similar but not identical sequences since exact matches of this length are required
Limitations of FASTA FASTA can miss significant similarity since For proteins, similar sequences do not have to share identical residues Asp-Lys-Val is quite similar to Glu-Arg-Ile yet it is missed even with k-tuple size of 1 since no amino acid matches Gly-Asp-Gly-Lys-Gly is quite similar to Gly-Glu-Gly-Arg-Gly but there is only match with k-tuple size of 1
Limitations of FASTA FASTA can miss significant similarity since For nucleic acids, due to codon “wobble”, DNA sequences may look like XXyXXyXXy where X’s are conserved and y’s are not GGuUCuACgAAg and GGcUCcACaAAA both code for the same peptide sequence (Gly-Ser- Thr-Lys) but they don’t match with k-tuple size of 3 or higher
BLAST (Basic Local AlignmentSearch Tool) Goal: find sequences from database similar to query sequence Previous tools use either direct, theoretically sound but computationally slow approach to examine all possible alignments of query with database (dynamic programming) indirect, heuristic but computationally fast approach to find similar sequences by first finding identical stretches (FASTP, FASTA)
BLAST (Basic Local AlignmentSearch Tool) BLAST combines best of both by using theoretically sound method which searches for similar sequences directly but computationally fast Reference S. F. Altschul, W. Gish, W. Miller, E. W. Myers and D. J. Lipman. Basic Local Alignment Search Tool. J. Mol. Biol. 215:403- 410 (1990)
BLAST basics Need similarity measure, as in dynamic programming - use PAM-120 for proteins Define maximal segment pair (MSP) to be the highest scoring pair of identical length segments chosen from 2 sequences (in FASTA terms, highest init1 diagonal)
BLAST basics Define a segment pair to be locally maximal if its score cannot be improved either by extending or by shortening both segments
BLAST basics Approach: find segment pairs by first finding word pairs that score above a threshold, i.e., find word pairs of fixed length w with a score of at least T Key concept: Seems similar to FASTA, but we are searching for words which score above T rather than that match exactly
BLAST method for proteins1. Compile a list of words which give a score above T when paired with the query sequence. Example using PAM-120 for query sequence ACDE (w=4, T=17): A C D E ACDE = +3 +9 +5 +5 = 22 try all possibilities: AAAA = +3 -3 0 0 = 0 no good AAAC = +3 -3 0 -7 = -7 no good ...too slow, try directed change
Generating word list A C D E ACDE = +3 +9 +5 +5 = 22 change 1st pos. to all acceptable substitutions gCDE = 1 9 5 5 = 20 ok (=pCDE,sCDE, tCDE) nCDE = 0 9 5 5 = 19 ok (=dCDE,eCDE, nCDE,vCDE) iCDE = -1 9 5 5 = 18 ok (=qCDE) kCDE = -2 9 5 5 = 17 ok (=mCDE) change 2nd pos.: cant - all alternatives negative and the other three positions only add up to 13 change 3rd pos. in combination with first position gCnE = 1 9 2 5 = 17 ok continue - use recursion
Generating word list For "best" values of w and T there are typically about 50 words in the list for every residue in the query sequence
BLAST method for proteins2. Scan the database for hits with the compiled list of words. Two approaches: Use index of all possible words (for w=4, need array of size 204=160,000. Can compress this index using pointers to save space. Use finite state machine (actually used) Calculate a state transition table that tells what state to go to based on the next character in the sequence3a. Extend hits to form HSPs (high-scoring segment pairs)
BLAST method for proteins3b. BLAST2 or gapped BLAST uses an approach similar to FASTA to combine hits before trying to extend them as in 3a.4. Compare the score for each HSP to a threshold S to decide whether to keep it5. Proceed to estimating statistical significance (see below)
BLAST Method for DNA 1. Make list of all contiguous w-mers in the query sequence (often w=12) 2. Compress database by packing 4 nucleotides into a single byte (use auxiliary table to tell you where sequences start and stop within the compressed database) -- doesnt allow for unspecified bases (wildcards)
BLAST Method for DNA 3. Compress the w-mers from the query sequence the same way. 4. Search the compressed database for matches with the compressed w-mers Since all frames of the query sequence are considered separately, any match of length w>=11 must contain a match of length 8 that lies on a byte boundary of one of the w-mers from the query sequence. Thus can scan a (packed) byte at a time, improving speed 4-fold over comparing one nucleotide at a time.
BLAST Method for DNA Problem: if query sequence has a stretch of unusual base composition (e.g., A-T rich) or a repeated sequence element (e.g., Alu sequence) there will be many hits with "uninteresting" regions.
BLAST Method for DNA Solution: During compression of the database, tabulate frequencies of all 8-tuples. Make a list of those occurring very frequently (much more frequently than expected by chance). Remove these words from the query list of w-mers before searching database. Remove words matching a sublibrary of repeated sequences (but report the matches to that sublibrary when done).
BLAST Statistical significance A key to the utility of BLAST is the ability to calculate expected probabilities of occurrence of Maximum Segment Pairs (MSPs) given w and T This allows BLAST to rank matching sequences in order of “significance” and to cut off listings at a user-specified probability
BLAST Statistical significance From Karlin-Altschul formulation, the expected value (mean) of the HSPs between a query and a set of random sequences is u≅ [ e (Kmn)]/λ log or u≅ [ln(Kmn)]/λ
BLAST Statistical significance BLAST uses a correction to this formulation that takes into account the effective sequence lengths of the query and the database sequences l Kn λ u [( m)/ =n ′ ′]
BLAST Statistical significance The corrected lengths are given by m′ = m−(lnKmn)/H n′ = n −(lnKmn)/H with H = (lnKmn)/l where l is the average length of the alignment that can be achieved between random sequences of length m and n
BLAST Statistical significance Given u, we can calculate the probability p of observing a score S between a query sequence and a given database sequence that is equal to or greater than x λ− −xup ≥ = e (e( x 1 x− S ) −p ( ) )
BLAST Statistical significance Lastly, we have to consider that we are searching many database sequences and can expect even a relatively rare score to occur with high chance given enough comparisons For a database of D sequences, this is − sx p≥D ≈− E1e ( )
Summary of Database SearchMethods Authors (Program) Description Needleman & Wunsch full alignment Wilbur & Lipman match k-tuple - form diag - NW Lipman & Pearson k-tuple - diag - rescore (FASTP) Pearson & Lipman FASTP - join diags- (FASTA) NW Altschul et al (BLAST) word match list - statistics
Reading for next class Paper by Grundy and Bailey