Database Searching
Database Searching
• Keyword Searching Database entries,
that contain a keyword specified.
(IRX, SRS, ….)
• Sequence Searching Similar sequences,
and their alignment with the
query sequence.
(FastA, Blast, ….)
Sequence Searching
• The comparison of a sequence against a database ….
• The comparison of a pattern, a profile, or a HMM against a
database ….
In order to find similar sequences
In order to find weaker similarities.
Search
strategies:
Sequence Searching
• Sensitive searching using rigorous searching algorithms:
• Extremely fast searching using heuristic searching algorithms:
- Smith - Waterman search: accurate comparison
computing time !
special hardware (Bioccelerator)
- Approximations in comparison: insertions and
deletions are widely ignored
reduced computing time!
Search
algorithms:
Heuristic Methods
• FastA (Pearson and Lipman)
• Blast / Blast2 (Altschul)
Sequence Searching
FastA (Pearson and Lipman)
Sequence Searching
1. Search for short identities of a fixed length (word size).
2. The score of each diagonal is determined.
3. The ten highest scoring regions (diagonals) are
rescored using scoring tables (initial regions).
Highest score is called init1.
4. Adjacent initial diagonals are connected.
Highest score is called initn.
5. For sequences with an initn score greater than a given
threshold an opt-score, a z-score, and an E() value are
calculated.
6. Sequences with an E() value smaller than a given
threshold are listed
Search for short identities of a fixed length
FastA
Example:
Step 1
Database
sequence
Query sequence
Word size:
DNA: 6
Protein: 2
FastA (Pearson and Lipman)
Sequence Searching
1. Search for short identities of a fixed length (word size).
2. The score of each diagonal is determined.
3. The ten highest scoring regions (diagonals) are
rescored using scoring tables (initial regions).
Highest score is called init1.
4. Adjacent initial diagonals are connected.
Highest score is called initn.
5. For sequences with an initn score greater than a given
threshold an opt-score, a z-score, and an E() value are
calculated.
6. Sequences with an E() value smaller than a given
threshold are listed
FastA
Example:
Step 2
Score = 60
Scoring of diagonals
DNA:
Match: 5
Mismatch: - 4
Protein:
Scoring Table
Database
sequence
Query sequence
FastA (Pearson and Lipman)
Sequence Searching
1. Search for short identities of a fixed length (word size).
2. The score of each diagonal is determined.
3. The ten highest scoring regions (diagonals) are
rescored using scoring tables (initial regions).
Highest score is called init1.
4. Adjacent initial diagonals are connected.
Highest score is called initn.
5. For sequences with an initn score greater than a given
threshold an opt-score, a z-score, and an E() value are
calculated.
6. Sequences with an E() value smaller than a given
threshold are listed
FastA
Example:
Step 3
Score > 60 (INIT1)
Scoring of diagonals
DNA:
Match: 5
Mismatch: - 4
Protein:
Scoring Table
Database
sequence
Query sequence
FastA (Pearson and Lipman)
Sequence Searching
1. Search for short identities of a fixed length (word size).
2. The score of each diagonal is determined.
3. The ten highest scoring regions (diagonals) are
rescored using scoring tables (initial regions).
Highest score is called init1.
4. Adjacent initial diagonals are connected.
Highest score is called initn.
5. For sequences with an initn score greater than a given
threshold an opt-score, a z-score, and an E() value are
calculated.
6. Sequences with an E() value smaller than a given
threshold are listed
Connection of neighbouring diagonals
FastA
Example:
Step 4
INITN = score + score - “joining penalty”
Database
sequence
Query sequence
yellow
green
FastA (Pearson and Lipman)
Sequence Searching
1. Search for short identities of a fixed length (word size).
2. The score of each diagonal is determined.
3. The ten highest scoring regions (diagonals) are
rescored using scoring tables (initial regions).
Highest score is called init1.
4. Adjacent initial diagonals are connected.
Highest score is called initn.
5. For sequences with an initn score greater than a given
threshold an opt-score, a z-score, and an E() value are
calculated.
6. Sequences with an E() value smaller than a given
threshold are listed
Score calculation
Opt-score: Smith-Waterman score
Z-score: normalization with respect to database
sequence length
E() value expectation value of a score
FastA
Step 5
FastA (Pearson and Lipman)
Sequence Searching
1. Search for short identities of a fixed length (word size).
2. The score of each diagonal is determined.
3. The ten highest scoring regions (diagonals) are
rescored using scoring tables (initial regions).
Highest score is called init1.
4. Adjacent initial diagonals are connected.
Highest score is called initn.
5. For sequences with an initn score greater than a given
threshold an opt-score, a z-score, and an E() value are
calculated.
6. Sequences with an E() value smaller than a given
threshold are listed
There is a mathematical theory that allows us to compute the
probability of occurrence of a certain score when searching a
random database of a given size.One can compute an
expectation value, which tells us how often a certain score
would occur simply by chance.
Blast
statistics
FastA output: part I
FastA
Example:
FastA
Results sorted and z-values calculated from opt score
1770 scores saved that exceeded 107
4614416 optimizations performed
Joining threshold: 47, optimization threshold: 32, opt. width: 16
The best scores are: init1 initn opt z-sc E(5219455)
EMORG:CHPHET01 Begin: 1 End:162
! M37322 P.hybrida chloroplast rpS19 810 810 810 614.0 5e-25
EMORG:CHPHETIR Begin:31 End:183 Strand: -
! M35955 P.hybrida chloroplast rps19' 410 410 699 531.8 1.7e-20
EMORG:SNCPJLB Begin: 2 End:150
! Z71250 S.nigrum chloroplast JLB reg 457 457 659 499.2 6.8e-19
EMORG:NPCPJLB Begin: 2 End:151
! Z71235 N.palmeri chloroplast JLB re 642 642 659 501.5 7e-19
EMORG:NBCPJLB Begin: 2 End:158
! Z71226 N.bigelovii chloroplast JLB 472 472 644 485.5 2.7e-18
EMORG:STCPJLB Begin: 2 End:149
! Z71248 S.tuberosum chloroplast JLB 452 452 641 485.4 3.7e-17
FastA output: part II
FastA
Example:
FastA
SCORES Init1: 472 Initn: 472 Opt: 644 z-score: 485.5 E(): 2.7e-18
91.7% identity in 157 bp overlap
10 20 30 40 50 59
chphet01.em_ TCTTCGTCGCCATAGTAAATAGGAGAGGAAATCGAATTTAATTCTTCG-TTTTACAAAAA
||||||||||||||| |||| ||||| ||||||||| |||||||||||
NBCPJLB GTAGTAAATAGGAGAGAAAATTGAATTAAATTCTTCGTTTTTACAAAAA
10 20 30 40
60 70 80 90 100 110
chphet01.em_ CTT------ACAAAAAAAAAAAAAATAGGAGTAAGCTTGTGACACGTTCACTAAAAAAAA
||| | |||||||||||||||||||||||||||||||||||||||||||||||||
NBCPJLB CTTTACAAAAAAAAAAAAAAAAAAATAGGAGTAAGCTTGTGACACGTTCACTAAAAAAAA
50 60 70 80 90 100
120 130 140 150 160
chphet01.em_ ACCCCTTTGTAGCCAATCATTTATTAAATAAAATTGATAAGCTTAACAC
| |||||||||||||||||||||||||| ||||||||||||||||||||
NBCPJLB ATCCCTTTGTAGCCAATCATTTATTAAAAAAAATTGATAAGCTTAACACAAAAGCAGAAA
110 120 130 140 150 160
FastA programs:
searches for similarity between a query sequence and
any group of sequences (DNA and Protein).
compares a peptide sequence against a set of nucleotid
sequences.
compares a nucleotide sequence against a protein
database taking frameshifts into account.
compares a peptide sequence against a nucleotide
sequence database taking frameshifts into account.
FastA
TFastX
FastX
TFastA
FastA
Blast
( 'Basic Local Alignment Search Tool' ) ( S. Altschul)
1. Search for short regions of a given length that score at or above
a certain threshold ( initial hits ).
2. Initial hits are extended until score drops off by a certain amount
from its maximum.
– The region with the maximum score is called
“ high scoring segment pair “ ( HSP ).
3. All HSPs with a score as high as or higher than a cutoff-score
are displayed. This cutoff score corresponds to a predefined
expectation value.
Sequence Searching
Blast
Sequence Searching
Word size Threshold
DNA 11 -
Protein 3 11 (blosum62)
essential parameters
BlastP
Blast
Example:
Step1
Word size: 3
Threshold: 11
Extension: 22
BLOSUM 62
A W T P S
H W T C S
-2 11 5 -5 4
14
Initial hit
Blast
( 'Basic Local Alignment Search Tool' ) ( S. Altschul)
1. Search for short regions of a given length that score at or above
a certain threshold ( initial hits ).
2. Initial hits are extended until score drops off by a certain amount
from its maximum.
– The region with the maximum score is called
“ high scoring segment pair “ ( HSP ).
3. All HSPs with a score as high as or higher than a cutoff-score
are displayed. This cutoff score corresponds to a predefined
expectation value.
Sequence Searching
Blast
Example:
Step 2
Word size: 3
Threshold: 11
Extension: 22
BLOSUM 62
A W T P S
H W T C S
-2 11 5 -5 4
14
9
13
Initial hit
BlastP
Blast
Example:
Step 2
Initial hit
Region of maximum score
Stop extension, if score of total region less than
maximum score minus extension parameter
BlastP
Blast
Example:
Step1 +2
W H R E A W T P S N D C L H Y
W C A R H W T C S Q N A M H Y
11 -3 -1 0 -2 11 5 -5 4 0 1 0 2 8
7
Initial hit
14
+7 +17
Database
sequence
Query sequence
HSP = 38
Evaluation of final HSP
Blast
( 'Basic Local Alignment Search Tool' ) ( S. Altschul)
1. Search for short regions of a given length that score at or above
a certain threshold ( initial hits ).
2. Initial hits are extended until score drops off by a certain amount
from its maximum.
– The region with the maximum score is called
“ high scoring segment pair “ ( HSP ).
3. All HSPs with a score as high as or higher than a cutoff-score
are displayed. This cutoff score corresponds to a predefined
expectation value.
Sequence Searching
Blast
There is a mathematical theory that allows us to compute the
probability of occurrence of a certain score when searching a random
database of a given size.
One can compute an expectation value, which tells us how
often a certain score would occur simply by chance.
Blast uses an expectation value of 10 as a threshold.
Blast displays all HSPs scoring as high as or higher
than the score corresponding to this expectation value.
Blast
statistics
BlastN output: part I
Blast
Example:
Smallest Sum
High Probability
Sequences producing High-scoring Segment Pairs: Score P(N) N
>>>empri:HSCHIKER M37818 Human keratin (psi-K-alpha) p... 1115 4.8e-252 5
>>>empri:HSKERUV X05803 Human radiated keratinocyte m... 718 6.0e-106 2
>>>empri:HSEPKER J00124 Homo sapiens 50 kDa type I ep... 687 2.4e-105 4
>>>empri:HSKERELP X62571 H.sapiens mRNA for keratin-re... 718 3.1e-105 2
>>>emrod:MMKTEPIA M13806 Mouse keratin (epidermal) typ... 691 5.3e-101 2
>>>emrod:MMKTEPIC M13805 Mouse type I epidermal kerati... 700 1.1e-96 2
>>>empri:HSCYTOK17 Z19574 H.sapiens gene for cytokerati... 651 1.8e-90 3
>>>gb_pr_only:S79867 S79867 type I keratin 16 [human, epi... 709 1.5e-87 2
>>>empri:S72493 S72493 keratin=keratin 16 homolog [h... 693 1.3e-86 2
>>>empri:HSKERP2 M22928 Human keratin pseudogene, exo... 624 6.4e-86 3
...
>>>emvrl:HESH1ULTR L24958 Pseudorabies virus long termi... 125 0.99994 1
>>>emvrl:HESH1SEQ M81222 Herpesvirus-pseudorabies viru... 125 0.99995 1
>>>emvrl:HEHSSLLT M57505 Pseudorabies virus ORF1, ORF2... 125 0.99995 1
BlastN output: part II
Blast
Example:
Score = 1115 (308.1 bits) Expect = 4.8e-252 Sum P(5) = 4.8e-252
Identities = 223/223 (100%) Positives = 223/223 (100%) Strand = Plus/Plus
Query: 276 AGAAAGCATCGCTGGAGGGCAGCCTGGTGGAGACGGAGGTGTGTTACAGGACCCAGCTGG 335
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct: 1325 AGAAAGCATCGCTGGAGGGCAGCCTGGTGGAGACGGAGGTGTGTTACAGGACCCAGCTGG 1384
...
Query: 456 AGATCGCCACCTACAGCCGCTTGCTAGAGGTTGAGGACGCCCG 498
|||||||||||||||||||||||||||||||||||||||||||
Sbjct: 1505 AGATCGCCACCTACAGCCGCTTGCTAGAGGTTGAGGACGCCCG 1547
Score = 810 (223.8 bits) Expect = 4.8e-252 Sum P(5) = 4.8e-252
Identities = 162/162 (100%) Positives = 162/162 (100%) Strand = Plus/Plus
Query: 1 GAAATGAACGCCCTTTGAGGTCAGGTGGACGAGGATGTCAGTGTGAAGATGGACACTGTG 60
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct: 195 GAAATGAACGCCCTTTGAGGTCAGGTGGACGAGGATGTCAGTGTGAAGATGGACACTGTG 254
...
Blast2
Blast
Advantages of the new Blast generation:
• Speed
• Gapped alignments
Break/Exercises?
Blast2 - extension -
Reduces number of extensions.
Threshold for initial hits has to be reduced
to achieve same level of sensitivity.
Blast
Database
sequence
Query sequence
Initial hits are extended, if
1. there are two on the same diagonal.
2. their distance is < A.
Replacement of single-hit by two-hit method:
Introduction of gapped alignments
Blast2 - extension -
Blast
Database
sequence
Query sequence
Blast2 - extension -
Due to gapped alignments, only one HSP
out of a group needs to be found!
Blast
Threshold of initial hits can be increased.
Programs use pre-computed statistical parameters for
predefined substitution scores.
Only limited number of substitution matrices available.
Gap penalties are only slightly changeable.
Blast
Due to gapped alignments, no statistical treatment possible
for arbitrary substitution scores.
Blast2 - statistics -
Family of Blast programs
BlastN(2) - fast comparison of a nucleotide sequence
to a DNA sequence database
TBlastN(2) - fast comparison of a peptide sequence
to a DNA sequence database
BlastP(2) - fast comparison of a peptide sequence
to a protein sequence database
BlastX(2) - comparison of nucleotide sequence
to protein sequence database
TBlastX(2) - compares a six-frame translation of a nucleotide
query sequence against a six-frame translation of a
nucleotide sequence database
Blast
Gaps in Blast2
BlastN2, BlastP2 - fully gapped alignments
TBlastN2, BlastX2 - in-frame gapped alignments
TBlastX2 - no gaps
Blast
Nucleotide sequences: Fasta much more sensitive
BlastN(2) much faster
Blast locates multiple hits between the query and ONE database sequence
BlastN(2) cannot be used for short query sequences
For proteins:
Blast programs are faster while producing the same result as Fasta or even
Ssearch (Smith-Waterman)
Comparison between FASTA and BLAST
Database Searching
Sum
High Probability
Sequences producing High-scoring Segment Pairs: Score P(N) N
>>>swissprot:25A1_RAT Q05961 rattus norvegicus (rat). (... 1915 1.8e-263 1
>>>swissprot:25A1_MOUSE P11928 mus musculus (mouse). (2'-... 1583 8.6e-222 2
>>>swissprot:25A2_HUMAN P04820 homo sapiens (human). (2'-... 782 9.2e-179 3
>>>swissprot:25A1_HUMAN P00973 homo sapiens (human). (2'-... 782 6.0e-173 2
>>>swissprot:25A2_MOUSE P29080 mus musculus (mouse). (2'-... 782 6.0e-173 2
>>>swissprot:25A6_HUMAN P29728 homo sapiens (human). 69/7... 386 1.8e-118 4
>>>swissprot:25A3_MOUSE P29081 mus musculus (mouse). (2'-... 785 5.7e-106 1
>>>swissprot:TR14_HUMAN Q15646 homo sapiens (human). thyr... 161 1.5e-16 2
>>>swissprot:YKH2_YEAST P36085 saccharomyces cerevisiae (... 66 0.71 1
BlastP output: part I
Example:
BlastP BlastP2
Sequence Searching
Query sequence: rat adenylate cyclase
Searching..................................................done
Score E
Sequences producing significant alignments: (bits) Value
>>>swissprot:25A1_RAT Q05961 rattus norvegicus (rat). (2'-5')ol... 750 0.0
>>>swissprot:25A1_MOUSE P11928 mus musculus (mouse). (2'-5')oli... 629 e-180
>>>swissprot:25A2_MOUSE P29080 mus musculus (mouse). (2'-5')oli... 495 e-140
>>>swissprot:25A2_HUMAN P04820 homo sapiens (human). (2'-5')oli... 495 e-140
>>>swissprot:25A1_HUMAN P00973 homo sapiens (human). (2'-5')oli... 495 e-140
>>>swissprot:25A3_MOUSE P29081 mus musculus (mouse). (2'-5')oli... 492 e-139
>>>swissprot:25A6_HUMAN P29728 homo sapiens (human). 69/71 kd (... 350 2e-96
>>>swissprot:TR14_HUMAN Q15646 homo sapiens (human). thyroid re... 77 4e-14
>>>swissprot:RN14_YEAST P25298 saccharomyces cerevisiae (baker'... 32 1.5
BlastP2 output: part I
Example:
BlastP BlastP2
Sequence Searching
Query sequence: rat adenylate cyclase
BlastP output: part II
Example:
BlastP BlastP2
Sequence Searching
25A6_HUMAN
BlastP reports 11 HSPs
Score = 386 (178.8 bits) Expect = 1.8e-118, Sum P(4) = 1.8e-118
Identities = 68/121 (56%) Positives = 94/121 (77%)
Query: 228 LPPQYALELLTVYAWERGNGITEFNTAQGFRTILELVTKYQQLRIYWTKYYDFQHPDVSK 287
LPP+YALELLT+YAWE+G+G+ +F+TA+GFRT+LELVT+YQQL I+W Y+F+ V K
Sbjct: 562 LPPKYALELLTIYAWEQGSGVPDFDTAEGFRTVLELVTQYQQLGIFWKVNYNFEDETVRK 621
...
Score = 260 (120.4 bits) Expect = 1.8e-118Sum P(4) = 1.8e-118
Identities = 54/118 (45%) Positives = 73/118 (61%)
Query: 58 KVVKGGSSGKGTTLKGKSDADLVVFLNNFTSFEDQLNRRGEFIKEIKKQLYEVQREKHFR 117
++V+GGS+ KGT LK SDADLVVF N+ S+ Q N R + +KEI +QL REK
Sbjct: 389 QIVRGGSTAKGTALKTGSDADLVVFHNSLKSYTSQKNERHKIVKEIHEQLKAFWREKEEE 448
Score = 222 ...
Score = 174 ...
Score = 142 ...
Score = 121
Score = 98
Score = 69
Score = 47
Score = 35
Score = 35 ...
BlastP2 output: part II
Example:
BlastP BlastP2
Sequence Searching
BlastP2 reports 2 HSPs
25A6_HUMAN
Score = 350 bits (888) Expect = 2e-96
Identities = 178/348 (51%) Positives = 237/348 (67%) Gaps = 8/348 (2%)
Query: 5 LRSTPSWKLDKFIEVYLLPNTSFRDDVKSAINVLCDFLKERCFRDTVHPVRVSKVVKGGS 64
L +TP LDKFI+ +L PN F + + SA+N++ FLKE CFR + +++ V+GGS
Sbjct: 339 LFTTPGHLLDKFIKEFLQPNKCFLEQIDSAVNIIRTFLKENCFRQSTAKIQI---VRGGS 395
...
Query: 301 LDPADPTGNVAGGNQEGWRRLASEARLWLQYPCFMNRGGSPVSSWEVP 348
LDP +PTG+V GG++ W L EA++ L PCF + G+P+ W+VP
Sbjct: 635 LDPGEPTGDVGGGDRWCWHLLDKEAKVRLSSPCFKDGTGNPIPPWKVP 682
Score = 214 bits (540) Expect = 2e-55
Identities = 136/349 (38%) Positives = 187/349 (52%) Gaps = 21/349 (6%)
Query: 2 EQELRSTPSWKLDKFIEVYLLPNTSFRDDVKSAINVLCDFLKERCFRDTVHPVRVSKVVK 61
E +L S P+ KL FI+ YL P + + +N +CD C P+ V V
Sbjct: 4 ESQLSSVPAQKLGWFIQEYLKPYEECQTLIDEMVNTICDV----CRNPEQFPL-VQGVAI 58
...
Query: 299 VILDPADPTGNVAGGNQEGWRRLASEARLWLQYPCFMNRGGSPVSSWEV 347
VILDP DPT NV+ G++ W+ L EA+ WL P N P SW V
Sbjct: 289 VILDPVDPTNNVS-GDKICWQWLKKEAQTWLTSPNLDNE--LPAPSWNV 334
BlastP: 11 HSPs
Example:
BlastP BlastP2
Sequence Searching
25A1_RAT
Query sequence
25A6_HUMAN
Example:
BlastN BlastN2 FastA SW
BlastN output
Sequence Searching
Query sequence: cDNA HUMAN Keratin against HONEST
Smallest Sum
High Probability
Sequences producing High-scoring Segment Pairs: Score P(N) N
>>>emhum1:HSCHIKER M37818 Human keratin (psi-K-alpha) ps... 1115 8.5e-252 5
>>>emhum1:HSKERUV X05803 Human radiated keratinocyte mR... 718 1.0e-105 2
>>>emhum1:HSEPKER J00124 Homo sapiens 50 kDa type I epi... 687 4.2e-105 4
>>>emhum1:HSKERELP X62571 H.sapiens mRNA for keratin-rel... 718 5.4e-105 2
>>>emrod:MMKTEPIA M13806 Mouse keratin (epidermal) type... 691 9.3e-101 2
>>>emrod:MMKTEPIC M13805 Mouse type I epidermal keratin... 700 2.0e-96 2
>>>emhum1:HSCYTOK17 Z19574 H.sapiens gene for cytokeratin... 651 3.2e-90 3
>>>emhum2:S79867 S79867 type I keratin 16 [human, epid... 709 2.7e-87 2
>>>emhum2:S72493 S72493 keratin=keratin 16 homolog [hu... 693 2.3e-86 2
>>>emhum1:HSKERP2 M22928 Human keratin pseudogene, exon... 624 1.1e-85 3
>>>emhum1:HSKERC15 X07696 Human mRNA for cytokeratin 15.... 560 3.3e-72 2
>>>emhum2:HSTYPIKER X90763 H.sapiens mRNA for type I kera... 527 2.1e-56 2
>>>emhum1:HS1K14 M28646 Homo sapiens epidermal keratin... 745 3.2e-52 1
>>>emhum2:HSPK161 M37817 Human keratin psi-K#16-1 pseud... 480 6.2e-49 2
>>>emhum1:HSKERI6 M21755 Human pot. pseudo-keratin K16 ... 651 6.6e-46 1
>>>emhum1:HSKER16A6 M28437 Human keratin type 16 gene, ex... 642 3.8e-45 1
Example:
BlastN BlastN2 FastA SW
BlastN2 output
Sequence Searching
Query sequence: cDNA HUMAN Keratin against HONEST
Sequences producing significant alignments: Score E
(bits) Value
>>>emhum1:HSCHIKER M37818 Human keratin (psi-K-alpha) pseudogen... 442 e-122
>>>emhum1:HS1K14 M28646 Homo sapiens epidermal keratin, 3' end.. 103 5e-20
>>>emhum1:HSEPKER J00124 Homo sapiens 50 kDa type I epidermal k.. 103 5e-20
>>>emmam:BTKERVII X01461 Bovine mRNA 3'-region for epidermal ke.. 96 1e-17
>>>emrod:MMKTEPIB J02644 Mouse type I epidermal keratin mRNA, 3.. 96 1e-17
>>>emrod:MMKTEPIA M13806 Mouse keratin (epidermal) type I mRNA,.. 88 3e-15
>>>emhum1:HSKERI6 M21755 Human pot. pseudo-keratin K16 type I, .. 84 4e-14
>>>emhum1:HSCYTOK17 Z19574 H.sapiens gene for cytokeratin 17. 6/93. 78 3e-12
>>>emhum1:HSKERELP X62571 H.sapiens mRNA for keratin-related pr... 78 3e-12
>>>emhum1:HSKERUV X05803 Human radiated keratinocyte mRNA 266 (.. 78 3e-12
>>>emrod:MMKTEPIC M13805 Mouse type I epidermal keratin mRNA, c.. 76 1e-11
>>>emhum1:HSKER16A6 M28437 Human keratin type 16 gene, exon 6 .... 76 1e-11
>>>emhum2:S79867 S79867 type I keratin 16 [human, epidermal ker. 76 1e-11
>>>emhum2:HSPK161 M37817 Human keratin psi-K#16-1 pseudogene, 3 .
Example:
BlastN BlastN2 FastA SW
FastA output
Sequence Searching
Query sequence: cDNA HUMAN Keratin against HONEST
The best scores are: init1 initn opt..
emhum1:hschiker M37818 Human keratin (psi-K-alpha) pseud... 892 1924 912
emhum1:hsepker J00124 Homo sapiens 50 kDa type I epiderm.. 559 1101 564
emrod:mmktepia M13806 Mouse keratin (epidermal) type I m.. 559 1082 1109
emhum1:hskerelp X62571 H.sapiens mRNA for keratin-relate... 521 1015 1116
emhum1:hskeruv X05803 Human radiated keratinocyte mRNA 2.. 521 1015 1116
emrod:mmktepic M13805 Mouse type I epidermal keratin mRN.. 503 1005 1054
emhum1:hskerp2 M22928 Human keratin pseudogene, exons 2-.. 447 993 511
emhum1:hscytok17 Z19574 H.sapiens gene for cytokeratin 1... 461 977 543
emhum2:s79867 S79867 type I keratin 16 [human, epidermal. 503 890 976
emhum2:s72493 S72493 keratin=keratin 16 homolog [human, . 496 883 965
emhum1:hscytk X52426 H.sapiens mRNA for cytokeratin 13. . 499 753 776
emhum1:hsker13c X14640 Human mRNA for keratin 13. 1/94 492 746 769
emhum1:hskerc15 X07696 Human mRNA for cytokeratin 15. 9/93 391 735 832
emhum1:hs1k14 M28646 Homo sapiens epidermal keratin, 3' . 601 641 647
Fasta only detects the longest exon!
Example:
BlastN BlastN2 FastA SW
SW output
Sequence Searching
Query sequence: cDNA HUMAN Keratin against HONEST
Sequence Strd Orig ZScore EScore Len ! Documentation
emhum1:HSKERUV + 280.40 1079.43 0.000000e2147483647 956 ! X05803
Human radiated keratinocyte mRNA 266 (keratin-related protein). 7/95
emhum1:HSKERELP + 280.40 1076.40 0.000000e2147483647 1512 ! X62571
H.sapiens mRNA for keratin-related protein. 11/92
em_ro:MMKTEPIA + 278.80 1071.56 0.000000e2147483647 1224 ! M13806
Mouse keratin (epidermal) type I mRNA, clone p52, partial 4/90
em_ro:MMKTEPIC + 265.20 1021.34 0.000000e2147483647 801 ! M13805
Mouse type I epidermal keratin mRNA, clone pkSCC-50, partial cds.4/90
emhum2:S79867 + 248.40 952.05 0.000000e2147483647 1422 ! S79867
type I keratin 16 [epidermal keratinocytes, mRNA Partial, 1422 nt]. 2
emhum2:S72493 + 245.80 944.40 0.000000e2147483647 976 ! S72493
keratin [human, tracheobronchial epithelial cells, mRNA P
emhum1:HSCHIKER + 223.00 840.34 0.000000e2147483647 9697 ! M37818
Human keratin (psi-K-alpha) pseudogene, exons 4,5,6,7,8,
SW detects only the longest exon!
Example:
BlastX BlastX2 Framesearch
Plus Strand HSPs:
Score = 215 (98.9 bits) Expect = 1.6e-44 Sum P(3) = 1.6e-44
Identities = 42/45 (93%) Positives = 45/45 (100%) Frame = +3
Query: 3 EIKKAIKEESEGKMKGILGYSEDDVVSTDFVGDNRSSIFDAKAGL 137
EIKKAIKEESEGK+KGILGY+EDDVVSTDFVGDNRSSIFDAKAG+
Sbjct: 261 EIKKAIKEESEGKLKGILGYTEDDVVSTDFVGDNRSSIFDAKAGI 305
Score = 114 (52.4 bits) Expect = 1.6e-44 Sum P(3) = 1.6e-44
Identities = 20/21 (95%) Positives = 21/21 (100%) Frame = +1
Query: 139 IALSDKFVKLVSWYDNEWGYT 201
IALSDKFVKLVSWYDNEWGY+
Sbjct: 305 IALSDKFVKLVSWYDNEWGYS 325
Score = 67 (30.8 bits) Expect = 1.6e-44 Sum P(3) = 1.6e-44
Identities = 14/15 (93%) Positives = 15/15 (100%) Frame = +3
Query: 198 HSSRVVDLIVHMSKA 242
+SSRVVDLIVHMSKA
Sbjct: 324 YSSRVVDLIVHMSKA 338
BlastX output
Sequence Searching
Genbank: atts0012 EST against Swissprot
G3PC_ARATH
Example:
BlastX BlastX2 Framesearch
>>>> swissprot:G3PC_ARATH P25858 arabidopsis thaliana (mouse-ear cress).
glyceraldehyde 3-phosphate dehydrogenase, cytosolic (ec 1.2.1.12). 5/92
Length = 338
Score = 104 bits (257) Expect = 4e-23
Identities = 59/80 (73%) Positives = 65/80 (80%)
Query: 3 EIKKAIKEESEGKMKGILGYSEDDVVSTDFVGDNRSSIFDAKAGLHCIERQVCEVGVMVR 182
EIKKAIKEESEGK+KGILGY+EDDVVSTDFVGDNRSSIFDAKAG+ ++ V V
Sbjct: 261 EIKKAIKEESEGKLKGILGYTEDDVVSTDFVGDNRSSIFDAKAGIALSDKFVKLVS-WYD 319
Query: 183 QRMGLHSSRVVDLIVHMSKA 242
G +SSRVVDLIVHMSKA
Sbjct: 320 NEWG-YSSRVVDLIVHMSKA 338
BlastX2 output
Sequence Searching
Genbank: atts0012 EST against Swissprot
G3PC_ARATH
Sequence Searching
• The comparison of a sequence against a database ….
• The comparison of a pattern, a profile, or a HMM against a
database ….
In order to find similar sequences
In order to find weaker similarities.
Search
strategies:
Pattern searching
Database Searching
Pattern language
Fixed symbols: A
Alternatives: (A,B,C)
Exclusions: ~(D,E,F)
Repetitions: (A,B){n1,n2}
Pattern searching
Database Searching
Example:
Pattern: G(D,E)~PX{0,4}(D,E){1,2}W
Fitting sequences: GDSTRLDDW
GEAEW
Pattern searching
Database Searching
Nucleic acid pattern search
Protein pattern search
Protein pattern search after translation
Programs available in EMBOSS
fuzznuc
fuzztrans
fuzzpro
Profile Searching
Database Searching
• Suited for finding and aligning distantly related proteins.
• Input is not a single sequence, but a profile generated from a
multiple alignment of similar sequences.
• The profile reflects the likelihoods of all amino acids occurring
at every position in the alignment.
• Position specific gap weights.
What is missing in a Profile
Domain 1 (active binding site) ATGTCGTCGTCG
Domain 2 (never found, inactive) ATGTGGTCGTCG
Domain 3 (never found, inactive) ATGTCATCGTCG
Domain 4 (active) ATGTGATCGTCG
Gene prediction
Profile is A 200001000000
based only on G 002011002002
active domains ! T 020200200200
C 000010020020
Domain 1: Score = 22 ATGTCGTCGTCG
Domain 2: Score = 22 ATGTGGTCGTCG
Domain 3: Score = 22 ATGTCATCGTCG
Domain 4: Score = 22 ATGTGATCGTCG
How is a Markov Model created ?
Gene prediction
Markov Model is based on active domains only !
...
If a G is found at position 3, P(T)4=1.0
If a T is found at position 4, P(C)5=0.5, P(G)5=0.5
If a C is found at position 5, P(A)6=1.0, P(G)6=0.5
If a G is found at position 5, P(A)6=0.5, P(G)6=0.5
If a G is found at position 6, P(T)7=1.0
If an A is found at position 6, P(T)7=1.0
...
Domain 1 (active binding site) ATGTCGTCGTCG
Domain 2 (never found, inactive) ATGTGGTCGTCG
Domain 3 (never found, inactive) ATGTCATCGTCG
Domain 4 (active) ATGTGATCGTCG
Gene prediction
In contrast to patterns and profiles, Markov Models take into
account the information about neighboring residues.
Markov Models are probabilistic, statistical models, they
define a score function for each residue in a chain.
In contrast to patterns and profiles, MMs are also better suited
to describe deletions and insertions.
C G
C
-
T
A
P=0.6
P=0.1
P=0.2
P=0.09
P=0.01
Gene prediction
In Hidden Markov Models, additional information is used.
The complexity of the MM increases and does not allow easy
understanding of why a residue is predicted at a certain position.
Hence the underlying model remains “hidden”.
P=0.6
P=0.1
P=0.2
P=0.09
P=0.01
If C was proceeded by
a promoter
P=0.1
P=0.9
P=0
P=0
P=0
G
C
-
T
A
C
First order Markov Model
Gene prediction
Fifth order Markov Model
Hidden Markov Models
Database Searching
• Hidden Markov models are suited for finding and aligning
distantly related proteins.
• A hidden Markov model is a more general description of a
multiple alignment than a profile.
• For database searching the input is not a single sequence,
but a hidden Markov model generated from a multiple
alignment of similar sequences.
Hidden Markov Models
Database Searching
- generates a Hidden Markov Model (HMM)
- Smith-Waterman search using a HMM against a
sequence database
- scans a sequence against a database of HMMs
HMMerbuild
HMMersearch
HMMerscan
Pronouncing HMMER
It's "hammer": as in, a more precise mining tool than a
BLAST.
http://hmmer.wustl.edu/
PSIBlast
Database Searching
• Iterative Blast2 program starting with a single protein sequence.
• Automatically generates profile from search results.
• Blast2 algorithm extended to work with profiles.
stripped down but ultra-fast version of iterative profile HMM searches
Position Specific Iterated
PHI-BLAST (Pattern-Hit Initiated BLAST)
combines matching of regular expressions
with local alignments surrounding the match
helps answer the question:
What other protein sequences both contain an
occurrence of P
and are homologous to S in the vicinity of the
pattern occurrences?
Detecting weak protein homologies
Database Searching
Example:
A Comparison of different methods
Human hemoglobin alpha (BlastP, FastA, SW, PSIBlast)
Profile of the alignment of 7 hemoglobin and myoglobin sequences
(ProfileSearch)
HMM of the alignment of 7 hemoglobin and myoglobin sequences
(HMMSearch)
Input:
Database:
10.000 sequences including 14 globin sequences
BlastP
Database Searching
Example:
HBAC_ANGAN P80726 HEMOGLOBIN ALPHA, CATHODIC CHA... 255 1.6e-46 2
HBAA_ANGAN P80945 HEMOGLOBIN ALPHA, ANODIC CHAIN... 218 6.4e-41 2
HBBC_ANGAN P80727 HEMOGLOBIN BETA, CATHODIC CHAI... 188 5.6e-37 3
HBBA_ANGAN P80946 HEMOGLOBIN BETA, ANODIC CHAIN.... 116 7.1e-17 2
MYG_HORSE P02188 MYOGLOBIN. 2/97 72 7.8e-09 2
PTB_PIG Q29099 POLYPYRIMIDINE TRACT-BINDING P... 59 0.45 1
DLDH_MYCPN P75393 LIPOAMIDE DEHYDROGENASE COMPON... 56 0.79 1
XERC_SALTY P55888 INTEGRASE/RECOMBINASE XERC. 2/97 39 0.87 3
SYI_CAEEL Q21926 PROBABLE ISOLEUCYL-TRNA SYNTHE... 49 0.993 2
LEUA_MICAE P94907 2-ISOPROPYLMALATE SYNTHASE (EC... 40 0.994 3
TPIS_RHIET P96985 TRIOSEPHOSPHATE ISOMERASE (EC ... 43 0.995 2
HJA3_METJA Q58655 PROBABLE ARCHAEAL HISTONE 3. 2/97 49 0.999 1
CMTG_PSEPU Q51983 4-HYDROXY-2-OXOVALERATE ALDOLA... 45 0.999 2
FastA
Database Searching
Example:
hbac_angan P80726 HEMOGLOBIN ALPHA, CATHODIC CHAIN... 263 370 381
hbaa_angan P80945 HEMOGLOBIN ALPHA, ANODIC CHAIN. ... 226 331 351
hbbc_angan P80727 HEMOGLOBIN BETA, CATHODIC CHAIN.... 187 231 300
hbba_angan P80946 HEMOGLOBIN BETA, ANODIC CHAIN. 2/97 120 149 238
syl_syny3 P73274 LEUCYL-TRNA SYNTHETASE (EC 6.1.1... 42 66 57
ymb3_yeast Q04228 HYPOTHETICAL 66.8 KD PROTEIN IN ... 40 66 42
rdh1_bovin Q27979 11-CIS RETINOL DEHYDROGENASE (EC... 54 65 65
amd_mouse P97467 PEPTIDYL-GLYCINE ALPHA-AMIDATING .. 36 64 41
faf_mouse P70398 PROBABLE UBIQUITIN CARBOXYL-TERMI.. 45 63 45
ycby_haein P44524 HYPOTHETICAL PROTEIN HI0116/15. ... 49 61 56
myg_horse P02188 MYOGLOBIN 2/97 40 61 172
ptga_brela Q45298 PTS SYSTEM, GLUCOSE-SPECIFIC IIA... 49 60 69
e6ap_human Q05086 ONCOGENIC PROTEIN-ASSOCIATED PRO... 44 60 44
mauf_thive Q56463 METHYLAMINE UTILIZATION PROTEIN ... 48 59 48
SW
Database Searching
Example:
1 HBAC_ANGA P80726 HEMOGLOBIN ALPHA, CATHODIC CHAIN. 2/97 .0000 378.0
2 HBAA_ANGA P80945 HEMOGLOBIN ALPHA, ANODIC CHAIN. 2/97 .0000 348.0
3 HBBC_ANGA P80727 HEMOGLOBIN BETA, CATHODIC CHAIN. 2/97 .0000 296.0
4 HBBA_ANGA P80946 HEMOGLOBIN BETA, ANODIC CHAIN. 2/97 .0000 231.0
5 MYG_HORSE P02188 MYOGLOBIN. 2/97 .0000 172.0
6 Y168_HAEI P43958 HYPOTHETICAL PROTEIN HI0168/69. 2/97 .0001 80.0
7 GLBI_CHIT Q23763 GLOBIN CTT-VIIB-8 PRECURSOR. 2/97 .0004 73.0
8 HMPA_ERWC Q47266 FLAVOHEMOPROTEIN (HAEMOGLOBIN-LIKE PROTEIN) .0006 75.0
9 GLBZ_CHIT Q23761 GLOBIN CTT-Z PRECURSOR (HBZ). 2/97 .0007 70.0
10 MAOX_MYCT P71880 PUTATIVE MALATE OXIDOREDUCTASE (NAD) (EC 1 .0011 75.0
11 YCHM_ECOL P40877 HYPOTHETICAL 58.4 KD PROTEIN IN PTH-PRSA IN .0014 73.0
12 GLBK_CHIT Q23762 GLOBIN CTT-VIIB-10 PRECURSOR. 2/97 .0014 67.0
13 FCPC_MACP Q40299 FUCOXANTHIN-CHLOROPHYLL A-C BINDING PROTEIN .0018 67.0
25 GLB1_CHIT P02221 GLOBIN CTT-I/CTT-IA PRECURSOR (ERYTHROCRUOR .0037 62.0
35 GLB6_CHIT P02224 GLOBIN CTT-VI PRECURSOR. 2/97 .0047 61.0
.
.
.
.
PSIBlast
Results from round 1
Score E
Sequences producing significant alignments: (bits) Value
>>swnew:HBAC_ANGAN P80726 HEMOGLOBIN CATHODIC, ALPHA CHAIN. 1... 141 2e-35
>>swnew:HBAA_ANGAN P80945 HEMOGLOBIN ANODIC, ALPHA CHAIN. 10/... 125 2e-30
>>swnew:HBBC_ANGAN P80727 HEMOGLOBIN CATHODIC, BETA CHAIN. 11... 102 1e-23
>>swnew:HBBA_ANGAN P80946 HEMOGLOBIN ANODIC, BETA CHAIN. 11/1997 83 7e-18
>>swnew:MYG_HORSE P02188 MYOGLOBIN. 10/2000 48 4e-07
Results from round 2
>>swnew:PYRC_AERPE Q9yfi5 DIHYDROOROTASE (EC 3.5.2.3) (DHOASE... 24 5.7
>>swnew:GLB_PAREP P80721 GLOBIN-3 (MYOGLOBIN). 7/1998 24 5.7
>>swnew:Y373_HUMAN O15078 HYPOTHETICAL PROTEIN KIAA0373. 10/2000 24 7.5
Database Searching
Example:
ProfileSearch
Database Searching
Example:
1 SWNEW:HBBC_ANGAN + 33.39 51.16 146 ! P80727 HEMOGLOBIN BETA, CATHODIC
2 SWNEW:HBBA_ANGAN + 29.72 46.88 147 ! P80946 HEMOGLOBIN BETA, ANODIC C
3 SWNEW:HBAC_ANGAN + 28.88 45.16 142 ! P80726 HEMOGLOBIN ALPHA, CATHODI
4 SWNEW:HBAA_ANGAN + 27.36 43.34 142 ! P80945 HEMOGLOBIN ALPHA, ANODIC
5 SWNEW:MYG_HORSE + 27.02 44.37 153 ! P02188 MYOGLOBIN. 2/97
6 SWNEW:GLBI_CHITH + 11.95 26.38 161 ! Q23763 GLOBIN CTT-VIIB-8 PRECURS
7 SWNEW:GLB1_CHITH + 11.66 25.81 158 ! P02221 GLOBIN CTT-I/CTT-IA PRECU
8 SWNEW:GLBK_CHITH + 11.65 26.01 161 ! Q23762 GLOBIN CTT-VIIB-10 PRECUR
9 SWNEW:GLBZ_CHITH + 11.35 25.75 163 ! Q23761 GLOBIN CTT-Z PRECURSOR (H
10 SWNEW:GLB_PAREP + 10.99 24.20 147 ! P80721 GLOBIN (MYOGLOBIN). 2/97
11 SWNEW:GLB_ISOHY + 10.98 24.26 148 ! P80722 GLOBIN (MYOGLOBIN). 2/97
12 SWNEW:GLBW_CHITH + 8.27 21.63 159 ! Q23760 GLOBIN CTT-W PRECURSOR (H
13 SWNEW:GLB6_CHITH + 8.20 21.72 162 ! P02224 GLOBIN CTT-VI PRECURSOR.
14 SWNEW:Y395_METJA + 5.36 17.94 158 ! Q57838 HYPOTHETICAL PROTEIN MJ03
15 SWNEW:YM34_YEAST + 4.86 16.93 150 ! Q03818 HYPOTHETICAL 17.2 KD PROT
16 SWNEW:YA74_METJA + 4.71 14.66 112 ! Q58474 HYPOTHETICAL PROTEIN MJ10
17 SWNEW:Y267_MYCPN + 4.48 14.53 114 ! P75397 HYPOTHETICAL PROTEIN MG26
50 SWNEW:HMPA_ERWCH FLAVOHEMOPROTEIN
.
.
.
HMMerSearch
Database Searching
Example:
1 172.07 1 153 10 165 MYG_HORSE
2 135.21 4 145 12 158 HBBC_ANGAN
3 113.30 4 147 12 159 HBBA_ANGAN
4 111.99 1 142 10 159 HBAC_ANGAN
5 96.50 1 142 10 159 HBAA_ANGAN
6 33.70 8 150 1 147 GLBI_CHITH
7 31.65 8 150 1 147 GLBK_CHITH
8 26.41 8 157 1 154 GLBZ_CHITH
9 23.70 7 154 1 154 GLB1_CHITH
10 16.57 7 149 1 147 GLB6_CHITH
11 12.42 26 147 32 157 GLB_ISOHY
12 11.24 31 90 30 90 GLBW_CHITH
13 10.06 22 145 28 156 GLB_PAREP
14 8.16 75 126 114 165 SMF1_YEAST
15 6.51 347 434 76 165 AEFA_ECOLI
16 3.64 2 83 84 165 CAPP_AMAHP
17 3.60 72 96 1 25 PHZA_PSEAR
18 3.36 30 142 49 165 HMPA_ERWCH
19 2.58 36 73 1 39 ATPN_CAEEL
20 2.06 91 166 89 165 YFBM_ECOLI
Benchmarks
Database Searching
Protein query against protein database
(Swissprot 38+: 85661 entries, 18-05-2000)
Seq.Length BlastP BlastP2 FastA SSEARCH SW(Biocc)
250 20.6 8.3 49 379 22.6
500 37.2 12.6 67 787 35.1
1000 80 16.0 88 1672 59.8
2000 194 26.6 116 3499 105
4000 308 62.1 195 8628 321
time in seconds
1 hour = 3600 s.
1 day = 86400 s.
Database Searching
DNA query against nucleic acid database
(NR 610 000 entries, 18-05-2000)
Benchmarks
Seq.Length BlastP BlastP2 FastA SSEARCH SW(Biocc)
200 40 39 1314 32970 1448
1000 48 62 2247 - 5254
5000 130 112 2748 - 24465
25000 473 459 10866 - 120562
time in seconds
1 hour = 3600 s.
1 day = 86400 s.
Database Searching
ProfileSearch:
28 sequences of length 720 against
Swissprot (35)
FrameSearch:
EST query (286 bp) against Swissprot (35)
CVX Biocc.
8380 97
6300 140
Benchmarks
time in seconds
1 hour = 3600 s.
Sequence Masking
Database Searching
DUST
XBlast
RepeatMasker
SEG
XNU
- masking of locally biased regions, BlastN2
- masking of repetitive elements
- masking of repetitive elements
- masking of locally biased regions
- masking of locally biased regions
DNA
Protein
Sequence Masking
Database Searching
Example:
High Smallest Sum
Sequences producing High-scoring Segment Pairs: Score Probability P(N) N
emhum1:HS367641 U36764 Human TGF-beta receptor interac.. 470 1.6e-53
3
emhum2:HSU39067 U39067 Human translation initiation fa.. 461 1.1e-52
3
emhum1:AC004161 Ac004161 Homo sapiens BAC clone RG208K.. 249 1.4e-23
3
empln:AT36765 U36765 Arabidopsis thaliana TGF-beta r.. 189 2.3e-05
1
eminv:DMU90930 U90930 Drosophila melanogaster TRIP-1 .. 189 2.3e-05
1
emfun:SPD187 D89187 Schizosaccharomyces pombe mRNA,.. 150 0.045
1
emfun:SPSUM1 Y09529 S.pombe mRNA for SUM1 protein. .. 150 0.046
1
emnew:SPSUM1 Y09529 S.pombe mRNA for SUM1 protein. .. 150 0.046
1
BlastN result against HONEST with masking
Database Searching
Example:
WARNING: Descriptions of 30,108 database sequences
were not reported due to the limiting value
of parameter -LIST=500.
BlastN result against HONEST without masking
Sequence Masking
Sequence Masking
Database Searching
Example:
XBlast
1 xxxxxxxxxx xxxxxxxxxx xxxxxxxxxx xxxxxxxxxx xxxxxxxxxx
51 xxxxxxxxxx xxxxxxxxxx xxxxxxxxxx xxxxxxxxxx xxxxxxxxxx
101 xxxxxxxxxx xxxxxxxxxx xxxxxxxxxx xxxxxxxxxx xxxxxxxxxx
151 xxxxxxxxxx xxxxxxxxxx xxxxxxxxxx xxxxxxxxxx xxxxxxxxxx
201 xxxxxxxxxx xxxxxxxxxx xxxxxxxxxx xxxxxxxxxx xxxxxxxxxx
251 xxxxxxxxxx TAAAAAAxxx xxxxxxxxxx xxxxxxxxxx xxxxxxxxxx
301 xxxxxxAGAG ATCCAGCTTC TTAAGCCTCA AACTCAAATT CGAAGTACTG
351 TGGGTCGAAG TAATGGATAC GGACGTAACC ATCTTCGCCG CCGCTGCTGT
401 ACTCTTGCCA TCAGGATGGA AGGCAACACT GTTGATAGGT CCAAAGTGAC
451 CTTGAACTTC CAAAACTTCT TCAAAGGCCA AATGGAAGAA CTGGCCTCAA
501 ATTGCCAANC TGGTGGAGGT TGTGGTAAAN CATGGTTCCT GACACCGCCA
551 AGACCAAATG GNATAGTTGG G
Homo sapiens cDNA clone
DNA versus protein sequence comparison
E(DNA) E(protein)
MUSGLUTA 0 0
MAMGLUTRA 10
-11
0
MMGLUT 1.1 10
-7
HUMGSTB 14 10
-6
Distantly related sequences
(Pearson,CABIOS 13, 4 (1997)
Database Searching
• Always compare proteins if the genes encode proteins
• Matches that are >50 % identical in a 20-40 amino acid region
frequently occur by chance
• Most sequences with E<0.01 are homologous but distantly
related sequences do not share significant homology
• Use “transitive” method!

Introduction_to_Bioinformatics_Database_Search.ppt

  • 1.
  • 2.
    Database Searching • KeywordSearching Database entries, that contain a keyword specified. (IRX, SRS, ….) • Sequence Searching Similar sequences, and their alignment with the query sequence. (FastA, Blast, ….)
  • 3.
    Sequence Searching • Thecomparison of a sequence against a database …. • The comparison of a pattern, a profile, or a HMM against a database …. In order to find similar sequences In order to find weaker similarities. Search strategies:
  • 4.
    Sequence Searching • Sensitivesearching using rigorous searching algorithms: • Extremely fast searching using heuristic searching algorithms: - Smith - Waterman search: accurate comparison computing time ! special hardware (Bioccelerator) - Approximations in comparison: insertions and deletions are widely ignored reduced computing time! Search algorithms:
  • 5.
    Heuristic Methods • FastA(Pearson and Lipman) • Blast / Blast2 (Altschul) Sequence Searching
  • 6.
    FastA (Pearson andLipman) Sequence Searching 1. Search for short identities of a fixed length (word size). 2. The score of each diagonal is determined. 3. The ten highest scoring regions (diagonals) are rescored using scoring tables (initial regions). Highest score is called init1. 4. Adjacent initial diagonals are connected. Highest score is called initn. 5. For sequences with an initn score greater than a given threshold an opt-score, a z-score, and an E() value are calculated. 6. Sequences with an E() value smaller than a given threshold are listed
  • 7.
    Search for shortidentities of a fixed length FastA Example: Step 1 Database sequence Query sequence Word size: DNA: 6 Protein: 2
  • 8.
    FastA (Pearson andLipman) Sequence Searching 1. Search for short identities of a fixed length (word size). 2. The score of each diagonal is determined. 3. The ten highest scoring regions (diagonals) are rescored using scoring tables (initial regions). Highest score is called init1. 4. Adjacent initial diagonals are connected. Highest score is called initn. 5. For sequences with an initn score greater than a given threshold an opt-score, a z-score, and an E() value are calculated. 6. Sequences with an E() value smaller than a given threshold are listed
  • 9.
    FastA Example: Step 2 Score =60 Scoring of diagonals DNA: Match: 5 Mismatch: - 4 Protein: Scoring Table Database sequence Query sequence
  • 10.
    FastA (Pearson andLipman) Sequence Searching 1. Search for short identities of a fixed length (word size). 2. The score of each diagonal is determined. 3. The ten highest scoring regions (diagonals) are rescored using scoring tables (initial regions). Highest score is called init1. 4. Adjacent initial diagonals are connected. Highest score is called initn. 5. For sequences with an initn score greater than a given threshold an opt-score, a z-score, and an E() value are calculated. 6. Sequences with an E() value smaller than a given threshold are listed
  • 11.
    FastA Example: Step 3 Score >60 (INIT1) Scoring of diagonals DNA: Match: 5 Mismatch: - 4 Protein: Scoring Table Database sequence Query sequence
  • 12.
    FastA (Pearson andLipman) Sequence Searching 1. Search for short identities of a fixed length (word size). 2. The score of each diagonal is determined. 3. The ten highest scoring regions (diagonals) are rescored using scoring tables (initial regions). Highest score is called init1. 4. Adjacent initial diagonals are connected. Highest score is called initn. 5. For sequences with an initn score greater than a given threshold an opt-score, a z-score, and an E() value are calculated. 6. Sequences with an E() value smaller than a given threshold are listed
  • 13.
    Connection of neighbouringdiagonals FastA Example: Step 4 INITN = score + score - “joining penalty” Database sequence Query sequence yellow green
  • 14.
    FastA (Pearson andLipman) Sequence Searching 1. Search for short identities of a fixed length (word size). 2. The score of each diagonal is determined. 3. The ten highest scoring regions (diagonals) are rescored using scoring tables (initial regions). Highest score is called init1. 4. Adjacent initial diagonals are connected. Highest score is called initn. 5. For sequences with an initn score greater than a given threshold an opt-score, a z-score, and an E() value are calculated. 6. Sequences with an E() value smaller than a given threshold are listed
  • 15.
    Score calculation Opt-score: Smith-Watermanscore Z-score: normalization with respect to database sequence length E() value expectation value of a score FastA Step 5
  • 16.
    FastA (Pearson andLipman) Sequence Searching 1. Search for short identities of a fixed length (word size). 2. The score of each diagonal is determined. 3. The ten highest scoring regions (diagonals) are rescored using scoring tables (initial regions). Highest score is called init1. 4. Adjacent initial diagonals are connected. Highest score is called initn. 5. For sequences with an initn score greater than a given threshold an opt-score, a z-score, and an E() value are calculated. 6. Sequences with an E() value smaller than a given threshold are listed
  • 17.
    There is amathematical theory that allows us to compute the probability of occurrence of a certain score when searching a random database of a given size.One can compute an expectation value, which tells us how often a certain score would occur simply by chance. Blast statistics
  • 18.
    FastA output: partI FastA Example: FastA Results sorted and z-values calculated from opt score 1770 scores saved that exceeded 107 4614416 optimizations performed Joining threshold: 47, optimization threshold: 32, opt. width: 16 The best scores are: init1 initn opt z-sc E(5219455) EMORG:CHPHET01 Begin: 1 End:162 ! M37322 P.hybrida chloroplast rpS19 810 810 810 614.0 5e-25 EMORG:CHPHETIR Begin:31 End:183 Strand: - ! M35955 P.hybrida chloroplast rps19' 410 410 699 531.8 1.7e-20 EMORG:SNCPJLB Begin: 2 End:150 ! Z71250 S.nigrum chloroplast JLB reg 457 457 659 499.2 6.8e-19 EMORG:NPCPJLB Begin: 2 End:151 ! Z71235 N.palmeri chloroplast JLB re 642 642 659 501.5 7e-19 EMORG:NBCPJLB Begin: 2 End:158 ! Z71226 N.bigelovii chloroplast JLB 472 472 644 485.5 2.7e-18 EMORG:STCPJLB Begin: 2 End:149 ! Z71248 S.tuberosum chloroplast JLB 452 452 641 485.4 3.7e-17
  • 19.
    FastA output: partII FastA Example: FastA SCORES Init1: 472 Initn: 472 Opt: 644 z-score: 485.5 E(): 2.7e-18 91.7% identity in 157 bp overlap 10 20 30 40 50 59 chphet01.em_ TCTTCGTCGCCATAGTAAATAGGAGAGGAAATCGAATTTAATTCTTCG-TTTTACAAAAA ||||||||||||||| |||| ||||| ||||||||| ||||||||||| NBCPJLB GTAGTAAATAGGAGAGAAAATTGAATTAAATTCTTCGTTTTTACAAAAA 10 20 30 40 60 70 80 90 100 110 chphet01.em_ CTT------ACAAAAAAAAAAAAAATAGGAGTAAGCTTGTGACACGTTCACTAAAAAAAA ||| | ||||||||||||||||||||||||||||||||||||||||||||||||| NBCPJLB CTTTACAAAAAAAAAAAAAAAAAAATAGGAGTAAGCTTGTGACACGTTCACTAAAAAAAA 50 60 70 80 90 100 120 130 140 150 160 chphet01.em_ ACCCCTTTGTAGCCAATCATTTATTAAATAAAATTGATAAGCTTAACAC | |||||||||||||||||||||||||| |||||||||||||||||||| NBCPJLB ATCCCTTTGTAGCCAATCATTTATTAAAAAAAATTGATAAGCTTAACACAAAAGCAGAAA 110 120 130 140 150 160
  • 20.
    FastA programs: searches forsimilarity between a query sequence and any group of sequences (DNA and Protein). compares a peptide sequence against a set of nucleotid sequences. compares a nucleotide sequence against a protein database taking frameshifts into account. compares a peptide sequence against a nucleotide sequence database taking frameshifts into account. FastA TFastX FastX TFastA FastA
  • 22.
    Blast ( 'Basic LocalAlignment Search Tool' ) ( S. Altschul) 1. Search for short regions of a given length that score at or above a certain threshold ( initial hits ). 2. Initial hits are extended until score drops off by a certain amount from its maximum. – The region with the maximum score is called “ high scoring segment pair “ ( HSP ). 3. All HSPs with a score as high as or higher than a cutoff-score are displayed. This cutoff score corresponds to a predefined expectation value. Sequence Searching
  • 23.
    Blast Sequence Searching Word sizeThreshold DNA 11 - Protein 3 11 (blosum62) essential parameters
  • 24.
    BlastP Blast Example: Step1 Word size: 3 Threshold:11 Extension: 22 BLOSUM 62 A W T P S H W T C S -2 11 5 -5 4 14 Initial hit
  • 25.
    Blast ( 'Basic LocalAlignment Search Tool' ) ( S. Altschul) 1. Search for short regions of a given length that score at or above a certain threshold ( initial hits ). 2. Initial hits are extended until score drops off by a certain amount from its maximum. – The region with the maximum score is called “ high scoring segment pair “ ( HSP ). 3. All HSPs with a score as high as or higher than a cutoff-score are displayed. This cutoff score corresponds to a predefined expectation value. Sequence Searching
  • 26.
    Blast Example: Step 2 Word size:3 Threshold: 11 Extension: 22 BLOSUM 62 A W T P S H W T C S -2 11 5 -5 4 14 9 13 Initial hit BlastP
  • 27.
    Blast Example: Step 2 Initial hit Regionof maximum score Stop extension, if score of total region less than maximum score minus extension parameter BlastP
  • 28.
    Blast Example: Step1 +2 W HR E A W T P S N D C L H Y W C A R H W T C S Q N A M H Y 11 -3 -1 0 -2 11 5 -5 4 0 1 0 2 8 7 Initial hit 14 +7 +17 Database sequence Query sequence HSP = 38 Evaluation of final HSP
  • 29.
    Blast ( 'Basic LocalAlignment Search Tool' ) ( S. Altschul) 1. Search for short regions of a given length that score at or above a certain threshold ( initial hits ). 2. Initial hits are extended until score drops off by a certain amount from its maximum. – The region with the maximum score is called “ high scoring segment pair “ ( HSP ). 3. All HSPs with a score as high as or higher than a cutoff-score are displayed. This cutoff score corresponds to a predefined expectation value. Sequence Searching
  • 30.
    Blast There is amathematical theory that allows us to compute the probability of occurrence of a certain score when searching a random database of a given size. One can compute an expectation value, which tells us how often a certain score would occur simply by chance. Blast uses an expectation value of 10 as a threshold. Blast displays all HSPs scoring as high as or higher than the score corresponding to this expectation value. Blast statistics
  • 31.
    BlastN output: partI Blast Example: Smallest Sum High Probability Sequences producing High-scoring Segment Pairs: Score P(N) N >>>empri:HSCHIKER M37818 Human keratin (psi-K-alpha) p... 1115 4.8e-252 5 >>>empri:HSKERUV X05803 Human radiated keratinocyte m... 718 6.0e-106 2 >>>empri:HSEPKER J00124 Homo sapiens 50 kDa type I ep... 687 2.4e-105 4 >>>empri:HSKERELP X62571 H.sapiens mRNA for keratin-re... 718 3.1e-105 2 >>>emrod:MMKTEPIA M13806 Mouse keratin (epidermal) typ... 691 5.3e-101 2 >>>emrod:MMKTEPIC M13805 Mouse type I epidermal kerati... 700 1.1e-96 2 >>>empri:HSCYTOK17 Z19574 H.sapiens gene for cytokerati... 651 1.8e-90 3 >>>gb_pr_only:S79867 S79867 type I keratin 16 [human, epi... 709 1.5e-87 2 >>>empri:S72493 S72493 keratin=keratin 16 homolog [h... 693 1.3e-86 2 >>>empri:HSKERP2 M22928 Human keratin pseudogene, exo... 624 6.4e-86 3 ... >>>emvrl:HESH1ULTR L24958 Pseudorabies virus long termi... 125 0.99994 1 >>>emvrl:HESH1SEQ M81222 Herpesvirus-pseudorabies viru... 125 0.99995 1 >>>emvrl:HEHSSLLT M57505 Pseudorabies virus ORF1, ORF2... 125 0.99995 1
  • 32.
    BlastN output: partII Blast Example: Score = 1115 (308.1 bits) Expect = 4.8e-252 Sum P(5) = 4.8e-252 Identities = 223/223 (100%) Positives = 223/223 (100%) Strand = Plus/Plus Query: 276 AGAAAGCATCGCTGGAGGGCAGCCTGGTGGAGACGGAGGTGTGTTACAGGACCCAGCTGG 335 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct: 1325 AGAAAGCATCGCTGGAGGGCAGCCTGGTGGAGACGGAGGTGTGTTACAGGACCCAGCTGG 1384 ... Query: 456 AGATCGCCACCTACAGCCGCTTGCTAGAGGTTGAGGACGCCCG 498 ||||||||||||||||||||||||||||||||||||||||||| Sbjct: 1505 AGATCGCCACCTACAGCCGCTTGCTAGAGGTTGAGGACGCCCG 1547 Score = 810 (223.8 bits) Expect = 4.8e-252 Sum P(5) = 4.8e-252 Identities = 162/162 (100%) Positives = 162/162 (100%) Strand = Plus/Plus Query: 1 GAAATGAACGCCCTTTGAGGTCAGGTGGACGAGGATGTCAGTGTGAAGATGGACACTGTG 60 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct: 195 GAAATGAACGCCCTTTGAGGTCAGGTGGACGAGGATGTCAGTGTGAAGATGGACACTGTG 254 ...
  • 33.
    Blast2 Blast Advantages of thenew Blast generation: • Speed • Gapped alignments
  • 34.
  • 35.
    Blast2 - extension- Reduces number of extensions. Threshold for initial hits has to be reduced to achieve same level of sensitivity. Blast Database sequence Query sequence Initial hits are extended, if 1. there are two on the same diagonal. 2. their distance is < A. Replacement of single-hit by two-hit method:
  • 36.
    Introduction of gappedalignments Blast2 - extension - Blast Database sequence Query sequence
  • 37.
    Blast2 - extension- Due to gapped alignments, only one HSP out of a group needs to be found! Blast Threshold of initial hits can be increased.
  • 38.
    Programs use pre-computedstatistical parameters for predefined substitution scores. Only limited number of substitution matrices available. Gap penalties are only slightly changeable. Blast Due to gapped alignments, no statistical treatment possible for arbitrary substitution scores. Blast2 - statistics -
  • 39.
    Family of Blastprograms BlastN(2) - fast comparison of a nucleotide sequence to a DNA sequence database TBlastN(2) - fast comparison of a peptide sequence to a DNA sequence database BlastP(2) - fast comparison of a peptide sequence to a protein sequence database BlastX(2) - comparison of nucleotide sequence to protein sequence database TBlastX(2) - compares a six-frame translation of a nucleotide query sequence against a six-frame translation of a nucleotide sequence database Blast
  • 40.
    Gaps in Blast2 BlastN2,BlastP2 - fully gapped alignments TBlastN2, BlastX2 - in-frame gapped alignments TBlastX2 - no gaps Blast
  • 41.
    Nucleotide sequences: Fastamuch more sensitive BlastN(2) much faster Blast locates multiple hits between the query and ONE database sequence BlastN(2) cannot be used for short query sequences For proteins: Blast programs are faster while producing the same result as Fasta or even Ssearch (Smith-Waterman) Comparison between FASTA and BLAST Database Searching
  • 42.
    Sum High Probability Sequences producingHigh-scoring Segment Pairs: Score P(N) N >>>swissprot:25A1_RAT Q05961 rattus norvegicus (rat). (... 1915 1.8e-263 1 >>>swissprot:25A1_MOUSE P11928 mus musculus (mouse). (2'-... 1583 8.6e-222 2 >>>swissprot:25A2_HUMAN P04820 homo sapiens (human). (2'-... 782 9.2e-179 3 >>>swissprot:25A1_HUMAN P00973 homo sapiens (human). (2'-... 782 6.0e-173 2 >>>swissprot:25A2_MOUSE P29080 mus musculus (mouse). (2'-... 782 6.0e-173 2 >>>swissprot:25A6_HUMAN P29728 homo sapiens (human). 69/7... 386 1.8e-118 4 >>>swissprot:25A3_MOUSE P29081 mus musculus (mouse). (2'-... 785 5.7e-106 1 >>>swissprot:TR14_HUMAN Q15646 homo sapiens (human). thyr... 161 1.5e-16 2 >>>swissprot:YKH2_YEAST P36085 saccharomyces cerevisiae (... 66 0.71 1 BlastP output: part I Example: BlastP BlastP2 Sequence Searching Query sequence: rat adenylate cyclase
  • 43.
    Searching..................................................done Score E Sequences producingsignificant alignments: (bits) Value >>>swissprot:25A1_RAT Q05961 rattus norvegicus (rat). (2'-5')ol... 750 0.0 >>>swissprot:25A1_MOUSE P11928 mus musculus (mouse). (2'-5')oli... 629 e-180 >>>swissprot:25A2_MOUSE P29080 mus musculus (mouse). (2'-5')oli... 495 e-140 >>>swissprot:25A2_HUMAN P04820 homo sapiens (human). (2'-5')oli... 495 e-140 >>>swissprot:25A1_HUMAN P00973 homo sapiens (human). (2'-5')oli... 495 e-140 >>>swissprot:25A3_MOUSE P29081 mus musculus (mouse). (2'-5')oli... 492 e-139 >>>swissprot:25A6_HUMAN P29728 homo sapiens (human). 69/71 kd (... 350 2e-96 >>>swissprot:TR14_HUMAN Q15646 homo sapiens (human). thyroid re... 77 4e-14 >>>swissprot:RN14_YEAST P25298 saccharomyces cerevisiae (baker'... 32 1.5 BlastP2 output: part I Example: BlastP BlastP2 Sequence Searching Query sequence: rat adenylate cyclase
  • 44.
    BlastP output: partII Example: BlastP BlastP2 Sequence Searching 25A6_HUMAN BlastP reports 11 HSPs Score = 386 (178.8 bits) Expect = 1.8e-118, Sum P(4) = 1.8e-118 Identities = 68/121 (56%) Positives = 94/121 (77%) Query: 228 LPPQYALELLTVYAWERGNGITEFNTAQGFRTILELVTKYQQLRIYWTKYYDFQHPDVSK 287 LPP+YALELLT+YAWE+G+G+ +F+TA+GFRT+LELVT+YQQL I+W Y+F+ V K Sbjct: 562 LPPKYALELLTIYAWEQGSGVPDFDTAEGFRTVLELVTQYQQLGIFWKVNYNFEDETVRK 621 ... Score = 260 (120.4 bits) Expect = 1.8e-118Sum P(4) = 1.8e-118 Identities = 54/118 (45%) Positives = 73/118 (61%) Query: 58 KVVKGGSSGKGTTLKGKSDADLVVFLNNFTSFEDQLNRRGEFIKEIKKQLYEVQREKHFR 117 ++V+GGS+ KGT LK SDADLVVF N+ S+ Q N R + +KEI +QL REK Sbjct: 389 QIVRGGSTAKGTALKTGSDADLVVFHNSLKSYTSQKNERHKIVKEIHEQLKAFWREKEEE 448 Score = 222 ... Score = 174 ... Score = 142 ... Score = 121 Score = 98 Score = 69 Score = 47 Score = 35 Score = 35 ...
  • 45.
    BlastP2 output: partII Example: BlastP BlastP2 Sequence Searching BlastP2 reports 2 HSPs 25A6_HUMAN Score = 350 bits (888) Expect = 2e-96 Identities = 178/348 (51%) Positives = 237/348 (67%) Gaps = 8/348 (2%) Query: 5 LRSTPSWKLDKFIEVYLLPNTSFRDDVKSAINVLCDFLKERCFRDTVHPVRVSKVVKGGS 64 L +TP LDKFI+ +L PN F + + SA+N++ FLKE CFR + +++ V+GGS Sbjct: 339 LFTTPGHLLDKFIKEFLQPNKCFLEQIDSAVNIIRTFLKENCFRQSTAKIQI---VRGGS 395 ... Query: 301 LDPADPTGNVAGGNQEGWRRLASEARLWLQYPCFMNRGGSPVSSWEVP 348 LDP +PTG+V GG++ W L EA++ L PCF + G+P+ W+VP Sbjct: 635 LDPGEPTGDVGGGDRWCWHLLDKEAKVRLSSPCFKDGTGNPIPPWKVP 682 Score = 214 bits (540) Expect = 2e-55 Identities = 136/349 (38%) Positives = 187/349 (52%) Gaps = 21/349 (6%) Query: 2 EQELRSTPSWKLDKFIEVYLLPNTSFRDDVKSAINVLCDFLKERCFRDTVHPVRVSKVVK 61 E +L S P+ KL FI+ YL P + + +N +CD C P+ V V Sbjct: 4 ESQLSSVPAQKLGWFIQEYLKPYEECQTLIDEMVNTICDV----CRNPEQFPL-VQGVAI 58 ... Query: 299 VILDPADPTGNVAGGNQEGWRRLASEARLWLQYPCFMNRGGSPVSSWEV 347 VILDP DPT NV+ G++ W+ L EA+ WL P N P SW V Sbjct: 289 VILDPVDPTNNVS-GDKICWQWLKKEAQTWLTSPNLDNE--LPAPSWNV 334
  • 46.
    BlastP: 11 HSPs Example: BlastPBlastP2 Sequence Searching 25A1_RAT Query sequence 25A6_HUMAN
  • 47.
    Example: BlastN BlastN2 FastASW BlastN output Sequence Searching Query sequence: cDNA HUMAN Keratin against HONEST Smallest Sum High Probability Sequences producing High-scoring Segment Pairs: Score P(N) N >>>emhum1:HSCHIKER M37818 Human keratin (psi-K-alpha) ps... 1115 8.5e-252 5 >>>emhum1:HSKERUV X05803 Human radiated keratinocyte mR... 718 1.0e-105 2 >>>emhum1:HSEPKER J00124 Homo sapiens 50 kDa type I epi... 687 4.2e-105 4 >>>emhum1:HSKERELP X62571 H.sapiens mRNA for keratin-rel... 718 5.4e-105 2 >>>emrod:MMKTEPIA M13806 Mouse keratin (epidermal) type... 691 9.3e-101 2 >>>emrod:MMKTEPIC M13805 Mouse type I epidermal keratin... 700 2.0e-96 2 >>>emhum1:HSCYTOK17 Z19574 H.sapiens gene for cytokeratin... 651 3.2e-90 3 >>>emhum2:S79867 S79867 type I keratin 16 [human, epid... 709 2.7e-87 2 >>>emhum2:S72493 S72493 keratin=keratin 16 homolog [hu... 693 2.3e-86 2 >>>emhum1:HSKERP2 M22928 Human keratin pseudogene, exon... 624 1.1e-85 3 >>>emhum1:HSKERC15 X07696 Human mRNA for cytokeratin 15.... 560 3.3e-72 2 >>>emhum2:HSTYPIKER X90763 H.sapiens mRNA for type I kera... 527 2.1e-56 2 >>>emhum1:HS1K14 M28646 Homo sapiens epidermal keratin... 745 3.2e-52 1 >>>emhum2:HSPK161 M37817 Human keratin psi-K#16-1 pseud... 480 6.2e-49 2 >>>emhum1:HSKERI6 M21755 Human pot. pseudo-keratin K16 ... 651 6.6e-46 1 >>>emhum1:HSKER16A6 M28437 Human keratin type 16 gene, ex... 642 3.8e-45 1
  • 48.
    Example: BlastN BlastN2 FastASW BlastN2 output Sequence Searching Query sequence: cDNA HUMAN Keratin against HONEST Sequences producing significant alignments: Score E (bits) Value >>>emhum1:HSCHIKER M37818 Human keratin (psi-K-alpha) pseudogen... 442 e-122 >>>emhum1:HS1K14 M28646 Homo sapiens epidermal keratin, 3' end.. 103 5e-20 >>>emhum1:HSEPKER J00124 Homo sapiens 50 kDa type I epidermal k.. 103 5e-20 >>>emmam:BTKERVII X01461 Bovine mRNA 3'-region for epidermal ke.. 96 1e-17 >>>emrod:MMKTEPIB J02644 Mouse type I epidermal keratin mRNA, 3.. 96 1e-17 >>>emrod:MMKTEPIA M13806 Mouse keratin (epidermal) type I mRNA,.. 88 3e-15 >>>emhum1:HSKERI6 M21755 Human pot. pseudo-keratin K16 type I, .. 84 4e-14 >>>emhum1:HSCYTOK17 Z19574 H.sapiens gene for cytokeratin 17. 6/93. 78 3e-12 >>>emhum1:HSKERELP X62571 H.sapiens mRNA for keratin-related pr... 78 3e-12 >>>emhum1:HSKERUV X05803 Human radiated keratinocyte mRNA 266 (.. 78 3e-12 >>>emrod:MMKTEPIC M13805 Mouse type I epidermal keratin mRNA, c.. 76 1e-11 >>>emhum1:HSKER16A6 M28437 Human keratin type 16 gene, exon 6 .... 76 1e-11 >>>emhum2:S79867 S79867 type I keratin 16 [human, epidermal ker. 76 1e-11 >>>emhum2:HSPK161 M37817 Human keratin psi-K#16-1 pseudogene, 3 .
  • 49.
    Example: BlastN BlastN2 FastASW FastA output Sequence Searching Query sequence: cDNA HUMAN Keratin against HONEST The best scores are: init1 initn opt.. emhum1:hschiker M37818 Human keratin (psi-K-alpha) pseud... 892 1924 912 emhum1:hsepker J00124 Homo sapiens 50 kDa type I epiderm.. 559 1101 564 emrod:mmktepia M13806 Mouse keratin (epidermal) type I m.. 559 1082 1109 emhum1:hskerelp X62571 H.sapiens mRNA for keratin-relate... 521 1015 1116 emhum1:hskeruv X05803 Human radiated keratinocyte mRNA 2.. 521 1015 1116 emrod:mmktepic M13805 Mouse type I epidermal keratin mRN.. 503 1005 1054 emhum1:hskerp2 M22928 Human keratin pseudogene, exons 2-.. 447 993 511 emhum1:hscytok17 Z19574 H.sapiens gene for cytokeratin 1... 461 977 543 emhum2:s79867 S79867 type I keratin 16 [human, epidermal. 503 890 976 emhum2:s72493 S72493 keratin=keratin 16 homolog [human, . 496 883 965 emhum1:hscytk X52426 H.sapiens mRNA for cytokeratin 13. . 499 753 776 emhum1:hsker13c X14640 Human mRNA for keratin 13. 1/94 492 746 769 emhum1:hskerc15 X07696 Human mRNA for cytokeratin 15. 9/93 391 735 832 emhum1:hs1k14 M28646 Homo sapiens epidermal keratin, 3' . 601 641 647 Fasta only detects the longest exon!
  • 50.
    Example: BlastN BlastN2 FastASW SW output Sequence Searching Query sequence: cDNA HUMAN Keratin against HONEST Sequence Strd Orig ZScore EScore Len ! Documentation emhum1:HSKERUV + 280.40 1079.43 0.000000e2147483647 956 ! X05803 Human radiated keratinocyte mRNA 266 (keratin-related protein). 7/95 emhum1:HSKERELP + 280.40 1076.40 0.000000e2147483647 1512 ! X62571 H.sapiens mRNA for keratin-related protein. 11/92 em_ro:MMKTEPIA + 278.80 1071.56 0.000000e2147483647 1224 ! M13806 Mouse keratin (epidermal) type I mRNA, clone p52, partial 4/90 em_ro:MMKTEPIC + 265.20 1021.34 0.000000e2147483647 801 ! M13805 Mouse type I epidermal keratin mRNA, clone pkSCC-50, partial cds.4/90 emhum2:S79867 + 248.40 952.05 0.000000e2147483647 1422 ! S79867 type I keratin 16 [epidermal keratinocytes, mRNA Partial, 1422 nt]. 2 emhum2:S72493 + 245.80 944.40 0.000000e2147483647 976 ! S72493 keratin [human, tracheobronchial epithelial cells, mRNA P emhum1:HSCHIKER + 223.00 840.34 0.000000e2147483647 9697 ! M37818 Human keratin (psi-K-alpha) pseudogene, exons 4,5,6,7,8, SW detects only the longest exon!
  • 51.
    Example: BlastX BlastX2 Framesearch PlusStrand HSPs: Score = 215 (98.9 bits) Expect = 1.6e-44 Sum P(3) = 1.6e-44 Identities = 42/45 (93%) Positives = 45/45 (100%) Frame = +3 Query: 3 EIKKAIKEESEGKMKGILGYSEDDVVSTDFVGDNRSSIFDAKAGL 137 EIKKAIKEESEGK+KGILGY+EDDVVSTDFVGDNRSSIFDAKAG+ Sbjct: 261 EIKKAIKEESEGKLKGILGYTEDDVVSTDFVGDNRSSIFDAKAGI 305 Score = 114 (52.4 bits) Expect = 1.6e-44 Sum P(3) = 1.6e-44 Identities = 20/21 (95%) Positives = 21/21 (100%) Frame = +1 Query: 139 IALSDKFVKLVSWYDNEWGYT 201 IALSDKFVKLVSWYDNEWGY+ Sbjct: 305 IALSDKFVKLVSWYDNEWGYS 325 Score = 67 (30.8 bits) Expect = 1.6e-44 Sum P(3) = 1.6e-44 Identities = 14/15 (93%) Positives = 15/15 (100%) Frame = +3 Query: 198 HSSRVVDLIVHMSKA 242 +SSRVVDLIVHMSKA Sbjct: 324 YSSRVVDLIVHMSKA 338 BlastX output Sequence Searching Genbank: atts0012 EST against Swissprot G3PC_ARATH
  • 52.
    Example: BlastX BlastX2 Framesearch >>>>swissprot:G3PC_ARATH P25858 arabidopsis thaliana (mouse-ear cress). glyceraldehyde 3-phosphate dehydrogenase, cytosolic (ec 1.2.1.12). 5/92 Length = 338 Score = 104 bits (257) Expect = 4e-23 Identities = 59/80 (73%) Positives = 65/80 (80%) Query: 3 EIKKAIKEESEGKMKGILGYSEDDVVSTDFVGDNRSSIFDAKAGLHCIERQVCEVGVMVR 182 EIKKAIKEESEGK+KGILGY+EDDVVSTDFVGDNRSSIFDAKAG+ ++ V V Sbjct: 261 EIKKAIKEESEGKLKGILGYTEDDVVSTDFVGDNRSSIFDAKAGIALSDKFVKLVS-WYD 319 Query: 183 QRMGLHSSRVVDLIVHMSKA 242 G +SSRVVDLIVHMSKA Sbjct: 320 NEWG-YSSRVVDLIVHMSKA 338 BlastX2 output Sequence Searching Genbank: atts0012 EST against Swissprot G3PC_ARATH
  • 53.
    Sequence Searching • Thecomparison of a sequence against a database …. • The comparison of a pattern, a profile, or a HMM against a database …. In order to find similar sequences In order to find weaker similarities. Search strategies:
  • 54.
    Pattern searching Database Searching Patternlanguage Fixed symbols: A Alternatives: (A,B,C) Exclusions: ~(D,E,F) Repetitions: (A,B){n1,n2}
  • 55.
    Pattern searching Database Searching Example: Pattern:G(D,E)~PX{0,4}(D,E){1,2}W Fitting sequences: GDSTRLDDW GEAEW
  • 56.
    Pattern searching Database Searching Nucleicacid pattern search Protein pattern search Protein pattern search after translation Programs available in EMBOSS fuzznuc fuzztrans fuzzpro
  • 57.
    Profile Searching Database Searching •Suited for finding and aligning distantly related proteins. • Input is not a single sequence, but a profile generated from a multiple alignment of similar sequences. • The profile reflects the likelihoods of all amino acids occurring at every position in the alignment. • Position specific gap weights.
  • 58.
    What is missingin a Profile Domain 1 (active binding site) ATGTCGTCGTCG Domain 2 (never found, inactive) ATGTGGTCGTCG Domain 3 (never found, inactive) ATGTCATCGTCG Domain 4 (active) ATGTGATCGTCG Gene prediction Profile is A 200001000000 based only on G 002011002002 active domains ! T 020200200200 C 000010020020 Domain 1: Score = 22 ATGTCGTCGTCG Domain 2: Score = 22 ATGTGGTCGTCG Domain 3: Score = 22 ATGTCATCGTCG Domain 4: Score = 22 ATGTGATCGTCG
  • 59.
    How is aMarkov Model created ? Gene prediction Markov Model is based on active domains only ! ... If a G is found at position 3, P(T)4=1.0 If a T is found at position 4, P(C)5=0.5, P(G)5=0.5 If a C is found at position 5, P(A)6=1.0, P(G)6=0.5 If a G is found at position 5, P(A)6=0.5, P(G)6=0.5 If a G is found at position 6, P(T)7=1.0 If an A is found at position 6, P(T)7=1.0 ... Domain 1 (active binding site) ATGTCGTCGTCG Domain 2 (never found, inactive) ATGTGGTCGTCG Domain 3 (never found, inactive) ATGTCATCGTCG Domain 4 (active) ATGTGATCGTCG
  • 60.
    Gene prediction In contrastto patterns and profiles, Markov Models take into account the information about neighboring residues. Markov Models are probabilistic, statistical models, they define a score function for each residue in a chain. In contrast to patterns and profiles, MMs are also better suited to describe deletions and insertions. C G C - T A P=0.6 P=0.1 P=0.2 P=0.09 P=0.01
  • 61.
    Gene prediction In HiddenMarkov Models, additional information is used. The complexity of the MM increases and does not allow easy understanding of why a residue is predicted at a certain position. Hence the underlying model remains “hidden”. P=0.6 P=0.1 P=0.2 P=0.09 P=0.01 If C was proceeded by a promoter P=0.1 P=0.9 P=0 P=0 P=0 G C - T A C
  • 62.
    First order MarkovModel Gene prediction Fifth order Markov Model
  • 63.
    Hidden Markov Models DatabaseSearching • Hidden Markov models are suited for finding and aligning distantly related proteins. • A hidden Markov model is a more general description of a multiple alignment than a profile. • For database searching the input is not a single sequence, but a hidden Markov model generated from a multiple alignment of similar sequences.
  • 64.
    Hidden Markov Models DatabaseSearching - generates a Hidden Markov Model (HMM) - Smith-Waterman search using a HMM against a sequence database - scans a sequence against a database of HMMs HMMerbuild HMMersearch HMMerscan Pronouncing HMMER It's "hammer": as in, a more precise mining tool than a BLAST. http://hmmer.wustl.edu/
  • 65.
    PSIBlast Database Searching • IterativeBlast2 program starting with a single protein sequence. • Automatically generates profile from search results. • Blast2 algorithm extended to work with profiles. stripped down but ultra-fast version of iterative profile HMM searches Position Specific Iterated
  • 66.
    PHI-BLAST (Pattern-Hit InitiatedBLAST) combines matching of regular expressions with local alignments surrounding the match helps answer the question: What other protein sequences both contain an occurrence of P and are homologous to S in the vicinity of the pattern occurrences?
  • 67.
    Detecting weak proteinhomologies Database Searching Example: A Comparison of different methods Human hemoglobin alpha (BlastP, FastA, SW, PSIBlast) Profile of the alignment of 7 hemoglobin and myoglobin sequences (ProfileSearch) HMM of the alignment of 7 hemoglobin and myoglobin sequences (HMMSearch) Input: Database: 10.000 sequences including 14 globin sequences
  • 68.
    BlastP Database Searching Example: HBAC_ANGAN P80726HEMOGLOBIN ALPHA, CATHODIC CHA... 255 1.6e-46 2 HBAA_ANGAN P80945 HEMOGLOBIN ALPHA, ANODIC CHAIN... 218 6.4e-41 2 HBBC_ANGAN P80727 HEMOGLOBIN BETA, CATHODIC CHAI... 188 5.6e-37 3 HBBA_ANGAN P80946 HEMOGLOBIN BETA, ANODIC CHAIN.... 116 7.1e-17 2 MYG_HORSE P02188 MYOGLOBIN. 2/97 72 7.8e-09 2 PTB_PIG Q29099 POLYPYRIMIDINE TRACT-BINDING P... 59 0.45 1 DLDH_MYCPN P75393 LIPOAMIDE DEHYDROGENASE COMPON... 56 0.79 1 XERC_SALTY P55888 INTEGRASE/RECOMBINASE XERC. 2/97 39 0.87 3 SYI_CAEEL Q21926 PROBABLE ISOLEUCYL-TRNA SYNTHE... 49 0.993 2 LEUA_MICAE P94907 2-ISOPROPYLMALATE SYNTHASE (EC... 40 0.994 3 TPIS_RHIET P96985 TRIOSEPHOSPHATE ISOMERASE (EC ... 43 0.995 2 HJA3_METJA Q58655 PROBABLE ARCHAEAL HISTONE 3. 2/97 49 0.999 1 CMTG_PSEPU Q51983 4-HYDROXY-2-OXOVALERATE ALDOLA... 45 0.999 2
  • 69.
    FastA Database Searching Example: hbac_angan P80726HEMOGLOBIN ALPHA, CATHODIC CHAIN... 263 370 381 hbaa_angan P80945 HEMOGLOBIN ALPHA, ANODIC CHAIN. ... 226 331 351 hbbc_angan P80727 HEMOGLOBIN BETA, CATHODIC CHAIN.... 187 231 300 hbba_angan P80946 HEMOGLOBIN BETA, ANODIC CHAIN. 2/97 120 149 238 syl_syny3 P73274 LEUCYL-TRNA SYNTHETASE (EC 6.1.1... 42 66 57 ymb3_yeast Q04228 HYPOTHETICAL 66.8 KD PROTEIN IN ... 40 66 42 rdh1_bovin Q27979 11-CIS RETINOL DEHYDROGENASE (EC... 54 65 65 amd_mouse P97467 PEPTIDYL-GLYCINE ALPHA-AMIDATING .. 36 64 41 faf_mouse P70398 PROBABLE UBIQUITIN CARBOXYL-TERMI.. 45 63 45 ycby_haein P44524 HYPOTHETICAL PROTEIN HI0116/15. ... 49 61 56 myg_horse P02188 MYOGLOBIN 2/97 40 61 172 ptga_brela Q45298 PTS SYSTEM, GLUCOSE-SPECIFIC IIA... 49 60 69 e6ap_human Q05086 ONCOGENIC PROTEIN-ASSOCIATED PRO... 44 60 44 mauf_thive Q56463 METHYLAMINE UTILIZATION PROTEIN ... 48 59 48
  • 70.
    SW Database Searching Example: 1 HBAC_ANGAP80726 HEMOGLOBIN ALPHA, CATHODIC CHAIN. 2/97 .0000 378.0 2 HBAA_ANGA P80945 HEMOGLOBIN ALPHA, ANODIC CHAIN. 2/97 .0000 348.0 3 HBBC_ANGA P80727 HEMOGLOBIN BETA, CATHODIC CHAIN. 2/97 .0000 296.0 4 HBBA_ANGA P80946 HEMOGLOBIN BETA, ANODIC CHAIN. 2/97 .0000 231.0 5 MYG_HORSE P02188 MYOGLOBIN. 2/97 .0000 172.0 6 Y168_HAEI P43958 HYPOTHETICAL PROTEIN HI0168/69. 2/97 .0001 80.0 7 GLBI_CHIT Q23763 GLOBIN CTT-VIIB-8 PRECURSOR. 2/97 .0004 73.0 8 HMPA_ERWC Q47266 FLAVOHEMOPROTEIN (HAEMOGLOBIN-LIKE PROTEIN) .0006 75.0 9 GLBZ_CHIT Q23761 GLOBIN CTT-Z PRECURSOR (HBZ). 2/97 .0007 70.0 10 MAOX_MYCT P71880 PUTATIVE MALATE OXIDOREDUCTASE (NAD) (EC 1 .0011 75.0 11 YCHM_ECOL P40877 HYPOTHETICAL 58.4 KD PROTEIN IN PTH-PRSA IN .0014 73.0 12 GLBK_CHIT Q23762 GLOBIN CTT-VIIB-10 PRECURSOR. 2/97 .0014 67.0 13 FCPC_MACP Q40299 FUCOXANTHIN-CHLOROPHYLL A-C BINDING PROTEIN .0018 67.0 25 GLB1_CHIT P02221 GLOBIN CTT-I/CTT-IA PRECURSOR (ERYTHROCRUOR .0037 62.0 35 GLB6_CHIT P02224 GLOBIN CTT-VI PRECURSOR. 2/97 .0047 61.0 . . . .
  • 71.
    PSIBlast Results from round1 Score E Sequences producing significant alignments: (bits) Value >>swnew:HBAC_ANGAN P80726 HEMOGLOBIN CATHODIC, ALPHA CHAIN. 1... 141 2e-35 >>swnew:HBAA_ANGAN P80945 HEMOGLOBIN ANODIC, ALPHA CHAIN. 10/... 125 2e-30 >>swnew:HBBC_ANGAN P80727 HEMOGLOBIN CATHODIC, BETA CHAIN. 11... 102 1e-23 >>swnew:HBBA_ANGAN P80946 HEMOGLOBIN ANODIC, BETA CHAIN. 11/1997 83 7e-18 >>swnew:MYG_HORSE P02188 MYOGLOBIN. 10/2000 48 4e-07 Results from round 2 >>swnew:PYRC_AERPE Q9yfi5 DIHYDROOROTASE (EC 3.5.2.3) (DHOASE... 24 5.7 >>swnew:GLB_PAREP P80721 GLOBIN-3 (MYOGLOBIN). 7/1998 24 5.7 >>swnew:Y373_HUMAN O15078 HYPOTHETICAL PROTEIN KIAA0373. 10/2000 24 7.5 Database Searching Example:
  • 72.
    ProfileSearch Database Searching Example: 1 SWNEW:HBBC_ANGAN+ 33.39 51.16 146 ! P80727 HEMOGLOBIN BETA, CATHODIC 2 SWNEW:HBBA_ANGAN + 29.72 46.88 147 ! P80946 HEMOGLOBIN BETA, ANODIC C 3 SWNEW:HBAC_ANGAN + 28.88 45.16 142 ! P80726 HEMOGLOBIN ALPHA, CATHODI 4 SWNEW:HBAA_ANGAN + 27.36 43.34 142 ! P80945 HEMOGLOBIN ALPHA, ANODIC 5 SWNEW:MYG_HORSE + 27.02 44.37 153 ! P02188 MYOGLOBIN. 2/97 6 SWNEW:GLBI_CHITH + 11.95 26.38 161 ! Q23763 GLOBIN CTT-VIIB-8 PRECURS 7 SWNEW:GLB1_CHITH + 11.66 25.81 158 ! P02221 GLOBIN CTT-I/CTT-IA PRECU 8 SWNEW:GLBK_CHITH + 11.65 26.01 161 ! Q23762 GLOBIN CTT-VIIB-10 PRECUR 9 SWNEW:GLBZ_CHITH + 11.35 25.75 163 ! Q23761 GLOBIN CTT-Z PRECURSOR (H 10 SWNEW:GLB_PAREP + 10.99 24.20 147 ! P80721 GLOBIN (MYOGLOBIN). 2/97 11 SWNEW:GLB_ISOHY + 10.98 24.26 148 ! P80722 GLOBIN (MYOGLOBIN). 2/97 12 SWNEW:GLBW_CHITH + 8.27 21.63 159 ! Q23760 GLOBIN CTT-W PRECURSOR (H 13 SWNEW:GLB6_CHITH + 8.20 21.72 162 ! P02224 GLOBIN CTT-VI PRECURSOR. 14 SWNEW:Y395_METJA + 5.36 17.94 158 ! Q57838 HYPOTHETICAL PROTEIN MJ03 15 SWNEW:YM34_YEAST + 4.86 16.93 150 ! Q03818 HYPOTHETICAL 17.2 KD PROT 16 SWNEW:YA74_METJA + 4.71 14.66 112 ! Q58474 HYPOTHETICAL PROTEIN MJ10 17 SWNEW:Y267_MYCPN + 4.48 14.53 114 ! P75397 HYPOTHETICAL PROTEIN MG26 50 SWNEW:HMPA_ERWCH FLAVOHEMOPROTEIN . . .
  • 73.
    HMMerSearch Database Searching Example: 1 172.071 153 10 165 MYG_HORSE 2 135.21 4 145 12 158 HBBC_ANGAN 3 113.30 4 147 12 159 HBBA_ANGAN 4 111.99 1 142 10 159 HBAC_ANGAN 5 96.50 1 142 10 159 HBAA_ANGAN 6 33.70 8 150 1 147 GLBI_CHITH 7 31.65 8 150 1 147 GLBK_CHITH 8 26.41 8 157 1 154 GLBZ_CHITH 9 23.70 7 154 1 154 GLB1_CHITH 10 16.57 7 149 1 147 GLB6_CHITH 11 12.42 26 147 32 157 GLB_ISOHY 12 11.24 31 90 30 90 GLBW_CHITH 13 10.06 22 145 28 156 GLB_PAREP 14 8.16 75 126 114 165 SMF1_YEAST 15 6.51 347 434 76 165 AEFA_ECOLI 16 3.64 2 83 84 165 CAPP_AMAHP 17 3.60 72 96 1 25 PHZA_PSEAR 18 3.36 30 142 49 165 HMPA_ERWCH 19 2.58 36 73 1 39 ATPN_CAEEL 20 2.06 91 166 89 165 YFBM_ECOLI
  • 74.
    Benchmarks Database Searching Protein queryagainst protein database (Swissprot 38+: 85661 entries, 18-05-2000) Seq.Length BlastP BlastP2 FastA SSEARCH SW(Biocc) 250 20.6 8.3 49 379 22.6 500 37.2 12.6 67 787 35.1 1000 80 16.0 88 1672 59.8 2000 194 26.6 116 3499 105 4000 308 62.1 195 8628 321 time in seconds 1 hour = 3600 s. 1 day = 86400 s.
  • 75.
    Database Searching DNA queryagainst nucleic acid database (NR 610 000 entries, 18-05-2000) Benchmarks Seq.Length BlastP BlastP2 FastA SSEARCH SW(Biocc) 200 40 39 1314 32970 1448 1000 48 62 2247 - 5254 5000 130 112 2748 - 24465 25000 473 459 10866 - 120562 time in seconds 1 hour = 3600 s. 1 day = 86400 s.
  • 76.
    Database Searching ProfileSearch: 28 sequencesof length 720 against Swissprot (35) FrameSearch: EST query (286 bp) against Swissprot (35) CVX Biocc. 8380 97 6300 140 Benchmarks time in seconds 1 hour = 3600 s.
  • 77.
    Sequence Masking Database Searching DUST XBlast RepeatMasker SEG XNU -masking of locally biased regions, BlastN2 - masking of repetitive elements - masking of repetitive elements - masking of locally biased regions - masking of locally biased regions DNA Protein
  • 78.
    Sequence Masking Database Searching Example: HighSmallest Sum Sequences producing High-scoring Segment Pairs: Score Probability P(N) N emhum1:HS367641 U36764 Human TGF-beta receptor interac.. 470 1.6e-53 3 emhum2:HSU39067 U39067 Human translation initiation fa.. 461 1.1e-52 3 emhum1:AC004161 Ac004161 Homo sapiens BAC clone RG208K.. 249 1.4e-23 3 empln:AT36765 U36765 Arabidopsis thaliana TGF-beta r.. 189 2.3e-05 1 eminv:DMU90930 U90930 Drosophila melanogaster TRIP-1 .. 189 2.3e-05 1 emfun:SPD187 D89187 Schizosaccharomyces pombe mRNA,.. 150 0.045 1 emfun:SPSUM1 Y09529 S.pombe mRNA for SUM1 protein. .. 150 0.046 1 emnew:SPSUM1 Y09529 S.pombe mRNA for SUM1 protein. .. 150 0.046 1 BlastN result against HONEST with masking
  • 79.
    Database Searching Example: WARNING: Descriptionsof 30,108 database sequences were not reported due to the limiting value of parameter -LIST=500. BlastN result against HONEST without masking Sequence Masking
  • 80.
    Sequence Masking Database Searching Example: XBlast 1xxxxxxxxxx xxxxxxxxxx xxxxxxxxxx xxxxxxxxxx xxxxxxxxxx 51 xxxxxxxxxx xxxxxxxxxx xxxxxxxxxx xxxxxxxxxx xxxxxxxxxx 101 xxxxxxxxxx xxxxxxxxxx xxxxxxxxxx xxxxxxxxxx xxxxxxxxxx 151 xxxxxxxxxx xxxxxxxxxx xxxxxxxxxx xxxxxxxxxx xxxxxxxxxx 201 xxxxxxxxxx xxxxxxxxxx xxxxxxxxxx xxxxxxxxxx xxxxxxxxxx 251 xxxxxxxxxx TAAAAAAxxx xxxxxxxxxx xxxxxxxxxx xxxxxxxxxx 301 xxxxxxAGAG ATCCAGCTTC TTAAGCCTCA AACTCAAATT CGAAGTACTG 351 TGGGTCGAAG TAATGGATAC GGACGTAACC ATCTTCGCCG CCGCTGCTGT 401 ACTCTTGCCA TCAGGATGGA AGGCAACACT GTTGATAGGT CCAAAGTGAC 451 CTTGAACTTC CAAAACTTCT TCAAAGGCCA AATGGAAGAA CTGGCCTCAA 501 ATTGCCAANC TGGTGGAGGT TGTGGTAAAN CATGGTTCCT GACACCGCCA 551 AGACCAAATG GNATAGTTGG G Homo sapiens cDNA clone
  • 81.
    DNA versus proteinsequence comparison E(DNA) E(protein) MUSGLUTA 0 0 MAMGLUTRA 10 -11 0 MMGLUT 1.1 10 -7 HUMGSTB 14 10 -6
  • 82.
    Distantly related sequences (Pearson,CABIOS13, 4 (1997) Database Searching • Always compare proteins if the genes encode proteins • Matches that are >50 % identical in a 20-40 amino acid region frequently occur by chance • Most sequences with E<0.01 are homologous but distantly related sequences do not share significant homology • Use “transitive” method!

Editor's Notes

  • #1 Database searching course part of the HUSAR advanced course. Time: 2 * 45 minutes with a break at an appropriate point (before Blast2-part). Make cross-references to pairwise and multiple alignment parts. But do not repeat this theory.
  • #2 What we want to achieve with database searching is to find sequences that are similar to our query. Databases can be searched by keyword or by sequence. Keyword searches check the database for the appearance of the keyword in the entries. This is a simple ‘string‘ search. Programs doing this are SRS or IRX. Sequence searches look for sequences that are similar to the query sequence and show the alignment between the query and the hit. This asks for a pairwise sequence alignment program.
  • #3 Two strategies can be followed when comparing a sequence to a database. One can compare the whole sequence against a database, in order to find similar sequences. One can compare a pattern, profile or hidden Markov model derived from the sequence against a database, in order to detect weaker similarities. This often reflects the more biological situation in which sequences only share homologous domains or regions.
  • #4 Sequence searching is in principle a repeated pairwise alignment. The SW-algorithm looks for local alignments and is therefore ideal for db-searching. The disadvantage is that it is too slow (12*106 entries in EMBL) Therefore we need heuristic methods.
  • #5 Heuristic algorithms available are: FastA, Pearson, University of Virginia Blast, Altschul, NCBI
  • #6 We will go through the FastA algorithm in a step-by-step manner. The first thing FastA does is to look for short identities of a fixed length (word size). There has to be an EXACT match! -- In the first step of this search, the comparison can be viewed as a set of dot plots, with the query as the vertical sequence and the group of sequences to which the query is being compared as the different horizontal sequences. This first step finds the registers of comparison (diagonals) having the largest number of short perfect matches (words) for each comparison.
  • #7 A dotplot is built up of the query sequence against every single database sequence. Every time the wordsize condition is fulfilled a dot appears (in the middle of the wordsize) For DNA the wordsize is usually 6, for proteins it is 2. A diagonal of several dots means that upon shifting by one base/residue the condition is still fulfilled. Example: diagonal of seven spots: 6+1+1+1+1+1+1=12 matches. The wordsize has influence on the running time. The smaller it is the longer it takes, especially for DNA.
  • #8 In the second step the score of each diagonal is determined -- In the first step of this search, the comparison can be viewed as a set of dot plots, with the query as the vertical sequence and the group of sequences to which the query is being compared as the different horizontal sequences. This first step finds the registers of comparison (diagonals) having the largest number of short perfect matches (words) for each comparison.
  • #9 The score for each diagonal is determined, using a scoring table. For DNA this is a simple one, for proteins more complex tables are used, containing score values for ecery possible amino acid pair. When talking about DNA the score of the indicatied diagonal would be : 6 * 5 = 30 for the first dot + 5 for every next dot = 60.
  • #10 Then the ten highest scoring diagonals are rescored using a scoring matrix that allows conservative replacements, ambiguity symbols, and runs of identities shorter than the size of a word. The highest score is called init1
  • #11 The highest scoring diagonals are rescored and extended. Now also mismatches are scored. And other matsches below the wordsize of 6 that might be on the diagonal are also taken into account. The score of each diagonal is calculated. The best score is called init1.
  • #12 Now the program checks to see if some of these initial highest-scoring diagonals can be joined together. The highest scoring joint construction is initn
  • #13 Connection of diagonals of course means the introduction of a gap. The best of all the combined diagonal scores is called initn. In the example: green + yellow - some penalty for joining.
  • #14 Points 1-4 are performed for every sequence in the database. So you have a whole collection of initn scores. All that have an initn score above a certain threshold are are taken. The program sets the threshold. There is a parameter for it, but it is not easy to use this. With the selected sequences a SW alignment is performed and an opt-score, z-score and E() value is calculated.
  • #15 Opt-score: Optimized score: SW score calculated from BLOSUM/PAM matrices. This score is length dependent. Z-score: normalization according to length. E() value: Erwartungswert, includes statistics. How often do I find sequence x in this database by chance. E=10 => no significance E=10-48 => very likely a real similarity (homology).
  • #16 Sequences with an E() value smaller than a given threshold are listed. The standard is: E() < 2.
  • #17 Go through sheet. Blast displays the sequences with a score higher than the threshold and an expectation value lower than the threshold.
  • #18 Here is an example of a fastA output. Showing the values discussed in the sheet before (init1, initn, opt-score, z-score and E() value). N.B: Explain the notation: 5e-25. A higher E() value does not always mean biologically irrelevant.
  • #19 Here we see the alignments. We see the original three hits, two gaps. The smaller gap was closed, the second one was too big. Draw situation on blackboard! | x | x | x | x | x | x 472 init1 / initn | x | x | x | x | x | x | x |_______________________________________
  • #20 Some programs in HUSAR that use the FastA algorithm.
  • #22 Now we will look at the details of the Blast algorithm Blast stands for: Basic Local Alignment Search Tool In the first step the Blast algorithm looks for short stretches of a given length that score above a certain threshold (window stringency). These are called initial hits.
  • #23 Here are the essential parameters of Blast. Of course the threshold for proteins is dependent on the comparison table used (here: blosum62).
  • #24 This shows one possible alignment of two sequences. All other possible alignments of these two sequences are also checked! The initial hit with wordsize 3 has a score of 14, which is above the threshold.
  • #25 Initial hits are now extended. The region with the maximum score is called an HSP.
  • #26 This shows how the extension works. The alignment gets extended until the score drops off by a certain amount from its maximum. 14 - 5 = 9 ; 9 > 14 - 22 = 8, so extension continues. 9 + 4 = 13 ; 13 > 9 - 22 = -13, so extension continues.
  • #27 Extension is stopped by either the extension rule, or the end of the sequence. What is taken as the alignment? When the extension rule applies, the alignment is already bad! What is taken is the maximal score. This is not necessarily the last score; mostly not.
  • #28 So if we look at our example again we see that the maximum score is 38, obtained by this region. This is called the high scoring segment pair (HSP). It corresponds to a diagonal in the dotplot (without gaps!!).
  • #29 All HSPs with a score as high as or higher than the cutoff score are displayed. The cutoff score corresponds to a predefined expectation value.
  • #30 Go through sheet. Blast displays the sequences with a score higher than the threshold and an expectation value lower than the threshold.
  • #31 An example of a blast output showing the parameters discussed before. N gives the number of HSPs within one sequence that were significant (differnce to FastA!!). The high score is the highest score of one of the N hits in the sequence. P(N) is the probability (Wahrscheinlichkeit). This is just an other way of expressing the expectation value, if the size of the database is known.
  • #32 Here we see the alignment output for two of the hits in the first sequence of the list (N=5). Note that there are no gaps!
  • #33 The main problems of Blast are the fact that it is not fast enough and that it does not deal with gaps.
  • #35 80-85 procent of the calculation time in blast is consumed by initial hit extension. So, in Blast2 extension is only done if two initial hits are on one diagonal with a distans <= A. This reduces the number of extensions, but sensitivity gets lost. So in fact the threshold should be brought down, which puts you back to the beginning again. But the introduction of gaps gets us out of this impasse.
  • #36 How can we deal with gaps? FastA does a SW with the best hits. In Blast this would take too much time in comparison to the rest of the program. Therefore the SW matrix is only calculated around the HSP. Starting from a fixed point in the HSP, a pair of amino acids or bases within the highest scoring region of the HSP, two 'clouds' of the SW matrix are calculated.
  • #37 In this way 'lost HSPs' can be picked up againg by the SW / gap search. Therefore only one of the two diagonals shown has to be found. The other one will be picked up by the SW / gap search.
  • #38 Read slide. Due to the gapped alignments no real statistics can be performed with Blast2. Precomputed statistical parameters are used.
  • #39 Blast programs available in HUSAR. Go through slide.
  • #40 Go through sheet.
  • #41 Go through slide.
  • #42 Now some examples of comparisons between different programs will be presented. First BlastP will be compared to BlastP2. Searching with rat adenylate cyclase we find a hit for Human sequence: 25A6_HUMAN.
  • #43 The same sequence is found with BlastP2
  • #44 The BlastP output reparts 11 HSPs
  • #45 BlastP2 reports only two HSPs.
  • #46 Looking at the 'dotplot' we see that BlastP picked up all the eleven single diagonals. Whereas BlastP2 combined seven and four diagonals into two gapped algnments. The gap between these two diagonals was to big to be bridged.
  • #47 Comparison of BlastN, BlastN2, FastA and SW. The query sequence is Human Keratin cDNA and the database is HONEST (= GeAll minus ESTs, STSs, HTGs, GSSs). BlastN picks up HSCIKER M37818. N=5, so all exons are picked up.
  • #48 BlastN2 also pickes up all 5 exons.
  • #49 FastA only detects the longest exon.
  • #50 SW also only detects the longest exon.
  • #51 With BlastX you will have to compbine yourself. And sometimes the different hits are not nicely grouped together like here, but interspersed with many other hits.
  • #52 In BlastX2 gapped alignments mix things up. The middle part is now completely unclear.
  • #53 The second strategy for sequence searching is comparing a pattern, profile or hidden Markov model of the sequence against a database, in order to detect weaker similarities. N.B: This part can be presented relatively quick.
  • #54 In pattern searching a certain language is used to define patterns in a sequence. Explain table.
  • #55 This is an example of a pattern. Sequences fitting the pattern should contain: a G, a D or an E, not a P, 0-4 residues of any kind, a D or an E and 1-2 Ws. Both sequences below fit to the pattern.
  • #56 Programs in HUSAR for pattern searching. Read slide.
  • #57 Profile searching: read slide. Explain on blackboard how a prfile is generated from a multiple alignment. The search algorithm is a modified Smith Waterman. The score of a certain pair is multplied by the relative occurance of a residue in the multiple alignment.
  • #58 Lets presume we have a sequence which can bind, a transcription factor: domain 1 is thus ‚active‘ If we make a single point mutation, we end up with either domain 2 or 3, which can not bind a transcription factor anymore, and which are thus inactive (therefore grey) However, if we have both these point mutations, we get reactivation, the transcription factor can bind again, and domain 4 is active. Situations like this can be found in many cases. to great a profile, only active domains are used (black). Inactive domains are ignored (grey). We end up with the profile as shown in the middle panel. Explain the numbers. Now we use this profile to calculate a score for the 4 domains. Bad surprise ! Although domain 2 and 3 are completely inactive, they receive the same score as domain 1 and 4 ! Conclusion: a profile is not sufficient to describe this situation. How can we solve this ? To build a profile just count the occurence of nucleotides in active domains.
  • #59 We can create a better ‚profile‘, named a markov model. Again, the markov model is only based on the active, black domains. Now explain the frequecies depicted in the yellow box. This is sometimes difficult for the audience, careful ! Be sure that you understand this completely It´s always: If a nucleotide is found at a certain position, then the probability for the next position being filled in by a certain nucleotide is ...
  • #60 This is the conclusion of the preceeding five slides.
  • #61 If the markov model has identified a promoter, the application will know that an exon starting with an ATG has to be found next. Naturally, the program will not start looking for a poly adenylation site at this point, that wouldn‘t make any sense. This knowledge is used to choose a different set of frequencies, namely those specific for finding a start codon.
  • #62  first order - 1 relation 5th order - 5 relationen up to 18th order, computing power becomes limiting if we increase the order. model can use preceeding, succeeding or surrounding residues. In theory, the relationship between a base and five bases 100 residues upstream could also be used (don‘t know any examples for this though). The developers try several models naturally to figure out what is the best method.
  • #63 HMM: read slide.
  • #64 Programs in HUSAR for HMM searching. Read slide.
  • #65 Psi Blast: read slide. Psi Blast starts with a single protein sequence, creates a profile from the results, and searches a database with this profile. Then it creates a new profile and runs another round. Until no more new sequences are found.
  • #67 A comparison of different methods for the detection of weak protein homologies. Read slide. The database used was Swissnew.
  • #68 BlastP finds four hemoglobins and one myoglobin.
  • #69 FatA also finds four hemoglobins and one myoglobin. The same ones as BlastP.
  • #70 SW finds 11 out from 14 globins. But two of them are far down in the list.
  • #71 Psiblast finds five globins in the first round, and one more in the next round.
  • #72 ProfileSearch finds all 14 globins, but one is far down in the list.
  • #73 HMMerSearch finds all 14 and all in the top of the list.
  • #74 Now some benchmarks for different programs. 'Protein to protein'. Read slide. Ssearch is the SW of the GCG package.
  • #75 Some more benchmarks. 'Nucleic acid to nucleic acid'.
  • #76 More benchmarks. Read slide.
  • #77 Sequence masking is important for database searching in order to get rid of repeats and regions of low complexity that would otherwise cause loads of hits. Blast uses DUST and SEG. Read slide.
  • #78 Output with masking.
  • #79 Output without masking. There is no output because of >30.00 relevant hits!
  • #80 An example of o masked sequence.
  • #81 Is it better to compare sequences on the DNA level or on the protein level? Here are some E() values. Conclusion is that comparing on the protein level is better.
  • #82 Read slide.