Successfully reported this slideshow.
Upcoming SlideShare
×

# 50120140502014

185 views

Published on

Published in: Technology
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

• Be the first to like this

### 50120140502014

1. 1. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 - 6375(Online), Volume 5, Issue 2, February (2014), pp. 130-139 © IAEME 130 GENERIC APPROACH OF PATTERN MATCHING OF AMINO ACID SEQUENCES USING MATCHING POLICY & PATTERN POLICY A. K. Payra1 , S. Saha1 1 Dept. of Computer Science & Engg, Dr. Sudhir Chandra Sur Degree Engineering College, DumDum, Kolkata ABSTRACT Pattern matching is hugely used in various applications like image, audio, video, bio- informatics etc. Definitely, there are several pattern matching algorithms which are already present like BM, Naive, and KMP etc, as well as hybrid approaches of existing methods are known to us. To improve complexity and to bring new idea of pattern matching, here a new concept of Matching policy & Pattern Policy has been introduced. Keywords: ASCII, Pre-align, MP, PP, Heap, Success Ratio. I. INTRODUCTION Pattern matching has been studied throughout multiple courses, and is crucial through its computation and analysis. Pattern matching algorithms have been extensively applied in various computer applications or industries, for example, in retrieval of information, information security, and searching nucleotide or amino acid sequence patterns in biological sequence databases. Pattern matching problem can be defined as finding one or more often all the occurrence of a given pattern (P = p0p1…pm − 1 ) of length m in a text (T = t 0t1…tn − 1 ) of length n, which is built over a finite alphabet set Σ of size σ. Before proceeding further, let us overview the various upcoming sections in this paper. Section II gives the review of several efficient algorithms in practice. Section III describes the proposed algorithms in detail. In section IV, the experiment results with complexity analysis has been discussed whereas the entire algorithm has been concluded in section V. INTERNATIONAL JOURNAL OF COMPUTER ENGINEERING & TECHNOLOGY (IJCET) ISSN 0976 – 6367(Print) ISSN 0976 – 6375(Online) Volume 5, Issue 2, February (2014), pp. 130-139 © IAEME: www.iaeme.com/ijcet.asp Journal Impact Factor (2014): 4.4012 (Calculated by GISI) www.jifactor.com IJCET © I A E M E
2. 2. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 - 6375(Online), Volume 5, Issue 2, February (2014), pp. 130-139 © IAEME 131 II. PREVIOUS WORK The interpretation of string pattern matching is to detect the position of substring in a given string. Though there are many string matching algorithms, here we will discuss some of the major well known algorithms among them. Pattern matching algorithms can be categorized as single and multiple based on their functionalities. The naïve approach [15] simply tests all the possible placement of Pattern P [1 . . . m] relative to text T [1 . . . n]. Specifically, we try shift s = 0, 1. . . n - m, successively and for each shift, s. Compare T[s +1 . . . s + m] to P [1 . . . m]. NAÏVE_STRING_MATCHER (T, P) 1. n ← length [T] 2. m ← length [P] 3. for s ← 0 to n - m do 4. if P[1 . . m] = T[s +1 . . s + m] 5. then return valid shift s The naïve string-matching procedure can be interpreted as a sliding a pattern P [1 . . . m] over the text T [1 . . . n] and noting for which shift all of the characters in the pattern match the corresponding characters in the text. The Naive pattern searching algorithm doesn’t work well in cases where we see many matching characters followed by a mismatching character which is overcome both by BM and KMP algorithm. Boyer-Moore (BM) algorithm [1] utilizes two heuristics, bad character and good suffix, to reduce the number of comparisons. Quick Search (QS) algorithm [2], which scans the characters of the window in any order, and computes its shifts with the occurrence shift of the character. In KMP algorithm [3], the prefix function (Π) for a pattern encapsulates knowledge about how the pattern matches against shifts of itself. This information can be used to avoid useless shifts of the pattern ‘p’. In other words, this enables avoidance of backtracking on the string ‘S’. With string ‘S’, pattern ‘p’ and prefix function ‘Π’ as inputs, the occurrence of ‘p’ in ‘S’ is detected along with the return of the number of shifts of ‘p’ after which occurrence is found. Like these algorithms, in this paper, ASCII values of considered pattern and text is considered while comparing in order to reduce complexity and to maximize success rate of matching design Pattern Policy (PP).It has been embedded in the Matching Policy (MP) of the algorithm. PRESENT WORK Motivation: Many approaches [4, 5] have been discussed in previous section over the sequences of amino acid. After studying and going through various papers it can be analyzed that very few assessment had been pursued on basic of ASCII values of Pattern to obtain maximum heuristic value by skipping number of comparisons. This analyzation prompts us to assess it. Dataset: Data has been collected only through serine–phosphorylated peptides of length 13 (i.e., 13-mers centered at serine) from the Phospho.ELM database [6], which are experimentally determined to be substrates of different kinases.
3. 3. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 - 6375(Online), Volume 5, Issue 2, February (2014), pp. 130-139 © IAEME 132 PROPOSED METHOD The basic of algorithm is quite similar with naïve method but it is superior then naive. Our algorithm consist of two major segments, they are given below- Algorithm Matching Policy // Consider, Amino acid sequence or text is T with length Tlen //and pattern P with length Plen. // Stext is the probable matching sub-text of the text(T) // i is an integer positional variable of probable Sub-text(Stext) //matching. // S and S1 are ASCII sum of Pattern (P) and Sub-text (Stext) //respectively. read T,P. Tlen :=length(T). Plen :=length(P) . S:=sum(ASCII(P)). for i :=0 to (Tlen-Plen+1) step +1 then{ j := i; r :=0; if( T[i]=P[0]) then { while(j≤ Plen +i-1) then { Stext[r++]:=T[j++]; } S1 := sum(ASCII(Stext)); if(S1=S2) then compare(S1,S2). else skip. } } Algorithm Pattern Policy //Consider, Amino acid sequence or text is T with length Tlen. //A0,…,.An : Sequence of protein contains n distinguishable amino acids. //On is an integer array of storing occurrences of amino acid. read T. On := Occurrence(T(A0,A1….An)). Heapsort(On) . Generate pattern using descending order of amino acid occurrences (On). PROPOSED METHOD WITH EXAMPLE The proposed algorithms (Matching Policy and Pattern Policy) can be applied efficiently which has been illustrated by using examples given below followed by several questions & its answers which may arise in the mind of the readers while going through the entire algorithm. Input: Here pre-aligned phosphorylated dataset of length of 13 mers and pattern length 3 mers have been considered. Text (T): P H L P P C S P P K Q G K Fig.1 Sample text
4. 4. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 - 6375(Online), Volume 5, Issue 2, February (2014), pp. 130-139 © IAEME 133 Pattern (P): P K Q Fig.2 Sample pattern Matching Policy (MP) The pattern is studied and features are extracted from it. Like length (Plen), ASCII value(S), starting char (P [0]). Here, Plen=3, S=236, P [0] =’P’; Next step is to find the probable positions (i) of pattern (P) present in the text (T). So, i= {0, 3, 4, 7, 8} is represented below in gray. P H L P P C S P P K Q G K 0 1 2 3 4 5 6 7 8 9 1 0 1 1 1 2 Fig.3 Possible index of P in the text Sub-text (Stext) length (Slen) of Text (T) and pattern (P) length is compared. It may be equal for a particular instance. So, Slen=3 and Stext = {PHL} for i=0. Table-I. Selected Sub-text Instance 0 i=0 P H L Instance 1 i=3 P P C Instance 2 i=4 P C S Instance 3 i=7 P P K Instance 4 i=8 P K Q Next the ASCII value (S1) of Stext is calculated. S1= 228. Table-II. Ascii table for Stext If S≠S1, then it is not required to compare P and Stext, otherwise individual characters are compared. Instance i Stext S1 Character Wise Comparison 0 0 PHL 228 X ( 228≠236) 1 3 PPC 227 X 2 4 PCS 230 X 3 7 PPK 235 X 4 8 PKQ 236 (require)
5. 5. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 - 6375(Online), Volume 5, Issue 2, February (2014), pp. 130-139 © IAEME 134 Individual characters of Stext and P are compared to find whether both strings are equal or not. P H L P P C S P P K Q G K Fig-4: Stext matching with sample Pattern Here, P is matched with Stext. So, pattern is present in the text and position of pattern in the text at (i=8). Situation will be crucial, when ASCII value Stext and patterns are same but both are different. For example: Table-III: ASCII value of Stext and pattern are same Pattern(P) Sub-text (Stext) PKQ PMO PQK PPL PKQ Here, we get advantages of character wise comparison between Stext and P. This is given next:- …………………. P M O ………………. Fig-V: Sequential matching So, comparing is skipped and attempt has been executed to find pattern in the text in next probable position. Thus this approach provides efficiency and faster mechanism. Next question may arise that how long this will continue? Answer of the question is simple and it is derived below- P H L P P C S P P K P K Fig-6: Terminating Condition The above steps have been repeated for L times for Text (T) length (Tlen) and Pattern length (Plen): L= Tlen –Plen+1 P K Q P K Q Tlen PlenTerminating condition: There is no chance of presence of pattern in the text. P K Q
6. 6. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 - 6375(Online), Volume 5, Issue 2, February (2014), pp. 130-139 © IAEME 135 In matching policy, the selected pattern can be entered by user or selected by different policies to find best match. Here, the concept of pattern policy has been introduced, which is discussed below. Pattern Policy (PP) Selection of a particular pattern for a sequence is taken based on amino acid occurrence (Oa) in that particular sequence. To find sorted descending order of Oa values with resolve collisions, heap sort has been used here. An example is given next: How heap sort works? Fig.7: Heap sort algorithm
7. 7. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 - 6375(Online), Volume 5, Issue 2, February (2014), pp. 130-139 © IAEME 136 How heap sort works in my pattern policy? A selected pattern can be variable in size, thus it even depends on requirements. Here, each time a pattern (P) is considered for each sequence (T). Heap sort is applied over obtained Oa values of amino acids and pattern (P) is generated, which is given in tabular format in below – T: SSVPTPSPLGPLA Table IV: Pattern Selection Sequence (T) Occurrence (Oa) After Sorting Selected Pattern(P) SSVPTPSPL GPLA T->1, G->1,V->1, A->1, P->4, S->3, L->2 P 4 S 3 L 2 T 1 PSLT IV. RESULT & EXPLANATION Different length (l) of pattern (P) is applied over the indifferent set of pre-align data (T). Matching policy provides better performance when pattern policy is executed simultaneously with it. The obtained results are given below in tabular format. l=1,Pattern length 1 Table-V. Pattern length 1 Patterns Match using MP Success Using MP Match using MP+PP Success Using MP+PP P|R|Q E|A|T S|V|K 229 1.0601 81 3.375 The number of sequence is n and number of possible pattern is m. If number of match using MP is x1 and using MP+PP method is x2. Then success (S’) using MP method will be: But, in MP+PP every sequence has only one pattern. So, for n number of sequences can possible maximum non-repetitive n number of patterns. l=2,Pattern length 2 Table-VI: Pattern length 2 Patterns Match using MP Success Using MP Match using MP+PP Success Using MP+PP PS|PA| ED|AS| ST|VE| SP| ….. ………. 69 .136 14 .5833
8. 8. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 - 6375(Online), Volume 5, Issue 2, February (2014), pp. 130-139 © IAEME 137 l=3,Pattern length 3 Table-VII. Pattern length 3 Patterns Match using MP Success Using MP Match using MP+PP Success Using MP+PP PSF|PAQ| EED|ASV| STG|VED| SPE| … 6 .0104 2 .084 l=4,Pattern length 4 Table-VIII: Pattern length 4 Patterns Match using MP Success Using MP Match using MP+PP Success Using MP+PP PPSF|PPAQ| AASV|SSEE| RRTF|… 0 .00 0 .00 Results due to different length of pattern conclude that if MP and PP works simultaneously will produce best outcomes. The success rate is represented below in bar chart form. To discuss the algorithm, we need to study complexity. Complexity of the any pattern matching algorithms is important. So, complexity analysis is given below: Consider, the length of sequence and pattern are LS and LP respectively. Total number of sequences to be tested is N. Complexity to find the probable positions in a sequence is O(LS).If average probable position value of any pattern is Pp which may appear in the sequence then : Complexity for: • N number of sequences, Searching ≤ O (N×LS). • The length of the pattern equal to compare sequence length. So, Comparison ≤ O (Pp× (LP)2 ×N). • N sequences: Pattern Policy ≤ O (N×LS× log (LS)) • Where, 0 ≤ LP ≤ LS. This approach is thus indeed very simple with low time complexity and robust in usability. The success ratio obtained in this algorithm is graphically represented in Fig.8.
9. 9. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 - 6375(Online), Volume 5, Issue 2, February (2014), pp. 130-139 © IAEME 138 Fig.8: Success Ratio of the pattern matching V. CONCLUSION These unique approaches of algorithms bring a new concept with high success ratio and low time complexity. As concept is simple it can work individually or can be applied with other existing approaches to improve performances. If matching policy is designed to work bi-directional, then these algorithms will be even more efficiently faster. REFERENCES [1] R. S. Boyer, J. S. Moore, “A fast string searching algorithm”, Communications of ACM, 20(10): 762-772,1977. [2] D. M. Sunday, “A very fast substring search algorithm”. Communications of the ACM, 33(8):132-142, 1990. [3] Tang Va-ling. KMP algorithm in the calculation of next array. Computer Technology and Development [J] .2009, 19 (6):98-101. [4] A Fast Hybrid Pattern Matching Algorithm for Biological Sequences.-cai, nie, Huang. 2009,IEEE. [5] Hybrid pattern-matching algorithm based on BM KMP algorithm-lu,bao,feng, 2010, IEEE. [6] http://bio.classcloud.org/f-motif/ [7] Average running time of the Boyer-Moore-Horspool Algorithm, BAEZA-YATES, R.A., RÉGNIER, M., Theoretical Computer Science 92(1) , 1992, pp. 19-31. [8] A. Yao. The complexity of pattern matching for a random string. SIAM Journal on Computing, 8(3):368{387, 1979.24. [9] B. Watson. A new regular grammar pattern matching algorithm. In Proc 4th Annual European Symposium, LNCS 1136, pages 364-377, 1996. [10] E. Ukkonen. Finding approximate patterns in strings. Journal of Algorithms, 6(1-3):132-137, 1985. [11] G. Navarro and R. Baeza-Yates. A hybrid indexing method for approximate string matching. Journal of Discrete Algorithms, 1(1):205-239, 2000. [12] G.Navarro and M. Ra_not. Fast regular expression search. In Proc. 3rd Workshop on Algorithm Engineering, LNCS 1668, pages 199-213, 1000. [13] G. Navarro and M. Ra_not. Compact DFA representation for fast regular expression search. In Proc. 5th Workshop on Algorithm Engineering (WAE'01), LNCS 2141, pages1-12, 2001.
10. 10. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 - 6375(Online), Volume 5, Issue 2, February (2014), pp. 130-139 © IAEME 139 [14] G. Myers and W. Miller. Approximate matching of regular expressions. Bulletin of Mathematical Biology, 51:7-37, 1989. [15] S.Roy, P.Suryanarayan, “The relation …..convolution/relation” IETEJE ,2010,vol-51 [16] G. Myers. A four russians algorithm for regular expression pattern matching. Journal of the ACM, 39(2):430-448, 1992. [17] G. Myers. A fast bit-vector algorithm for approximate string matching based on dynamic programming. Journal of the ACM, 46(3):395-415, 1999 [18] “GENERIC APPROACH FOR PREDICTING UNANNOTATED PROTEIN PAIR FUNCTION USING PROTEIN” - Anjan Kumar Payra…, INTERNATIONAL JOURNAL OF COMPUTER ENGINEERING & TECHNOLOGY (IJCET), ISSN 0976 – 6367(Print) ISSN 0976 – 6375(Online), Journal Impact Factor (2013): 6.1302 (Calculated by GISI). [19] “FUNCTION PREDICTION USING CLUSTER ANALYSIS OF UNANNOTATED ALIGN SEQUENCES”- Anjan Kumar Payra, INTERNATIONAL JOURNAL OF CURRENT RESEARCH AND REVIEW (IJCRR), ISSN 2231-2196 (Print) ISSN 0975- 5241(Online), Journal IC Value: 4.18 (2013).