Tag snp selection using quine mc cluskey optimization method-2


Published on

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Tag snp selection using quine mc cluskey optimization method-2

  1. 1. International Journal of Advanced Research in Engineering and Technology (IJARET), ISSN 0976 – 6480(Print), ISSN 0976 – 6499(Online) Volume 4, Issue 5, July – August (2013), © IAEME 74 TAG SNP SELECTION USING QUINE-McCLUSKEY OPTIMIZATION METHOD 1 Moitree Basu, 2 Pradipta Deb 1,2 Tata Consultancy Services Limited ABSTRACT Due to high cost, Genotyping a large number of SNPs is always a challenge for major genome-wide disease association studies. However, depiction in variation of a population can be identified through a very small set of SNPs, known as “Tagging” SNP or “Tag” SNP or “Informative” SNP. Tag SNP is nothing but a small set SNPs selected from the large set which contains the information embedded within them for the whole set. Recent trends of research interest is focused on this retrieval of Tag SNP set. In this paper, we present an efficient Quine McCluskey method for finding such tagging SNPs. Most of the established method for finding Tag SNPs are confined only to localization but the method presented in this paper is not limited to localization and it is also capable of discarding redundant SNPs from the whole set using feature selection (Sliding Window) method. Experimental results shows that the number of Tag SNPs selected by our proposed method is feasible and effective. Keywords: QuineMcCluskey, prime implicants, essential prime implicants, minterm, maxterm, DNF, CNF, haplotype, tag SNP 1. INTRODUCTION Demand for revolutionary technologies that deliver fast, inexpensive and accurate genome information from huge large DNA has always been there from the start of DNA analysis. Scientists always needed some small efficient dataset that they can do their research on. For this reason computer scientists started to find some optimization algorithm that can find some representatives out of the whole dataset i.e. the tag SNP data that can represent most of the characteristics of other SNPs. An example of a SNP is the alteration of the DNA segment AAGGTTA to ATGGTTA, where the second "A" in the first snippet is replaced with a "T". On average, SNPs occur in the human population more than one percent of the time. Because only about three to five percent of a person's DNA sequence codes for the production of proteins, most SNPs are found outside of "coding sequences". SNPs found within a coding sequence are of particular interest to researchers because they are more likely to alter the biological function of a protein. Because of the recent advances in INTERNATIONAL JOURNAL OF ADVANCED RESEARCH IN ENGINEERING AND TECHNOLOGY (IJARET) ISSN 0976 - 6480 (Print) ISSN 0976 - 6499 (Online) Volume 4, Issue 5, July – August 2013, pp. 74-81 © IAEME: www.iaeme.com/ijaret.asp Journal Impact Factor (2013): 5.8376 (Calculated by GISI) www.jifactor.com IJARET © I A E M E
  2. 2. International Journal of Advanced Research in Engineering and Technology (IJARET), ISSN 0976 – 6480(Print), ISSN 0976 – 6499(Online) Volume 4, Issue 5, July – August (2013), © IAEME 75 technology, coupled with the unique ability of these genetic variations to facilitate gene identification, there has been a recent flurry of SNP discovery and detection. Tag SNP is a representative single nucleotide polymorphism (SNP) in a region of the genome with high linkage disequilibrium (the non-random association of alleles at two or more loci). It is possible to identify genetic variation without genotyping every SNP in a chromosomal region. Tag SNPs are useful in whole-genome SNP association studies in which hundreds of thousands of SNPs across the entire genome are genotyped. In recent years, Single Nucleotide Polymorphisms (SNPs) have become the preferred marker for association studies of genetic diseases or traits. A set of linked SNPs on one chromosome is called a haplotype. Recent studies have shown that the patterns of Linkage Disequilibrium (LD)[9] observed in human populations have a block-like structure. The chromosome recombination only takes place at some low LD regions called recombination hotspots. The high LD region between these hotspots is often referred to as a "haplotype block". Within a haplotype block, there is little or even no recombination occurred, and the SNPs in the block tend to be inherited together. Due to the low haplotype diversity within a block, the information carried by these SNPs is highly redundant. Thus, a small subset of SNPs ("tag SNPs") is sufficient to work with. Now problem of finding tag SNP is similar like minimization of huge number of SNP’s into some[10], [11], so that we can represent this huge set of SNP with the small one(Tag SNPs). Here we present a variety of Quine-McCluskey method which can be successfully applied to find out the Tag SNP set from a block (Haplotype block) of SNP’s. 2. BACKGROUND & RELATED WORK Currently, a wide variety of criteria for tag selection are available; but there is no consensus on them [1]. Broadly, the informativeness of a tag SNP set is measured by two criteria: how well the untagged SNPs could be predicted, a dependency criterion, and how much proportion of haplotype diversity could be explained [2], [3]. We propose an approach for tag selection based on minterm and prime implicant selection, aiming to satisfy the above two objectives. 2.1. Quine McClusky Method for Binary Minimization of Data Following are the important terms used throughout the paper, defined again here for the convenience of the readers. 2.1.1. Literal: It is a variable or its complement/negation ( x or x′ ). 2.1.2. Minterm: A product of the literals where each variable appears exactly once either in true or complemented form, i.e., a normal product term consisting of n literals for n variable function. 2.1.3. Maxterm: A sum of the literals where each variable appears exactly once in either true or complemented form i.e. a normal sum term consisting of n literals. 2.1.4. DNF Form (sum-of-product): The disjunctive normal form is the sum of minterms of the variables. 2.1.5. CNF Form (product-of-sum): Conjunctive normal form is a product-of-maxterm of the variables. 2.1.6. Prime Implicant: A prime implicant of a function is the product which cannot be combined with another term to eliminate a variable for further simplification.
  3. 3. International Journal of Advanced Research in Engineering and Technology (IJARET), ISSN 0976 – 6480(Print), ISSN 0976 – 6499(Online) Volume 4, Issue 5, July – August (2013), © IAEME 76 2.1.7. Essential Prime Implicant: Prime implicant that is able to cover an output of the function which is not covered by any combination of prime implicant called essential prime implicant. Quine-McCluskey (Q-M) method [5], [6] minimizes a logical expression realizing a given Boolean function which is more efficient for computer algorithm, makes this more useful now even though it was introduced more than 55 years ago. The method utilizes the following three basic simplification laws: a) x + x' = 1(Complement) b) x + x = x (Idempotent) c) x ( y + z) = xy + xz (Distributive) This method is also known as tabulation method because it gives deterministic steps to check the minimum form of function based on selection of essential prime implicants using a table. Steps can be broadly categorized in three of the following: 2.2. Find the Prime Implicant In this step, we replace the literals in form of 0 and 1 and generate a table. Initially, the number of rows in table is equal to the total number of minterms of the original unsimplified function. If two terms are only different in one bit like 101 and 111 i.e. one variable is appearing in both form (variable and its negation), then we can use the complement law. Iteratively, we compare all terms and generate the prime implicant. 2.3. Find the Essential Prime Implicant Using prime implicants from above step, we generate the table to find essential prime implicants. Some prime implicants can be redundant and may be omitted, but if they appear only once, they cannot be omitted and no prime implicant can be provided. 2.4. Find Other Prime Implicant It is not necessary that essential prime implicants cover all the minterms. In that case, we consider other prime implicant to make sure that all minterms has been covered. In general, Q-M method provides better method for the function simplification than K-map, but still is an NP-hard problem, and it becomes impractical for large input sizes due to exponential complexity [4]. 3. PROPOSED FRAMEWORK 3.1. Problem Formulation A haplotype represents the allele information of contiguous SNPs on one chromosome, while a genotype represents the combined allele information of the SNPs on a pair of chromosomes. For a bi-allelic SNP, each haplotype can be represented by a binary string. Let a haplotype h have m SNPs. We can then represent h as {h1, h2, … , hm}, hi Є {0,1}. Figure 1: A sample (N X K) haplotype block
  4. 4. International Journal of Advanced Research in Engineering and Technology (IJARET), ISSN 0976 – 6480(Print), ISSN 0976 – 6499(Online) Volume 4, Issue 5, July – August (2013), © IAEME 77 In the same way, a genotype g has m SNPs and we can represent g as g = {g1, g2, … , gm}, gi Є {0, 1, 2}. The genotype may be represented as {0/0, 0/1, 1/0, 1/1}, where 0 and 1 stand for the major homozygote {0/0} and minor homozygote {1/1} respectively, and 2 stands for the heterozygotes {0/1} and {1/0}. Our goal is to determine a minimum set of tag SNPs T = {t1, t2,…,tk}, which consists of selected SNPs of haplotypes with a minimum error. In order to achieve this goal, we need to find the minimum number of tag SNPs that will have a strong correlation among themselves. The major processes involved are the minterm selection algorithm, prime implicant selection, essential prime implicant selection, Tag SNP prediction algorithm. In the next section we introduce the Quine McClusky method used for the tag SNP selection. 3.2. Quine Algorithm Assume we are given a haplotype block containing N SNPs and K haplotype patterns. This block is denoted by an N × K binary matrix M (see Figure 1). Define M [ i , j ] Є { 0 , 1 } for each i Є [ 1 , N ] and j Є [1, K] where 0 and 1 represent the major and minor alleles, respectively. Input: An N × K matrix M, window size (W) Output: The minimum subset of SNP (Tag SNP Set). 3.2.1. quine-tagSNP ( M , W ) 1. Read the haplotype set from the file into a 2-D array. 2. For i = 0 to K 2.1. Using sliding window method divide the matrix into a small 2-D array having ‘N’ no of rows and ‘W’ no of columns (Suppose 2-D array is ‘T’ ). Now for each of this small set of matrix apply the quine mcclusky method of minimization method by calling corresponding method quine ( T ). End for 3.2.2. quine ( T ) 1. For i=0 to N 1.1. Remove all the duplicate rows from the array T 1.2. Store the decimal equivalent of each row of array T into a 1-D array say ‘D’. 1.3. Now using T, apply the quine mcclusky procedure to find minterms. 1.4. From those minterms, find out the prime implicants for this array T 2. End for 3.2.3. primeSelection () 1. Reorder the rows of the array T in increasing order of number of 1s. 2. Now make a group with the rows which have same number of 1s. 3. Compare the rows from the two groups which have a unit difference in no of 1s. 4. If the two rows in comparison differs in unique position ( 0 & 1 only) , then mark those positions with a different number other than ( 0 & 1) say ‘9’ and construct a new row from those. For e.g.: 11101 ( row 1) ( group X say) 10101 ( row 2) ( group Y say) 19101 ( new row)
  5. 5. International Journal of Advanced Research in Engineering and Technology (IJARET), ISSN 0976 – 6480(Print), ISSN 0976 – 6499(Online) Volume 4, Issue 5, July – August (2013), © IAEME 78 5. With the new rows generated add them to a new array, then again repeat step 1 unless there no new rows can be formed. For e.g.: 1 9 9 1 9 9 9 1 ----------------------------------------------------------------- No new rows can be formed from these two rows. 6. The rows from which no new rows can be formed will be considered as prime implicants. 7. At each stage, we have to identify such rows and store those rows in a different array Say ‘P’. 8. Now for each row in ‘P’, all the possible numbers are to be stored in an array. For e.g. : 1091 a row then (1001=9, 1010=10) So, 9 & 10 will be considered as prime implicants. 3.2.4. Essential_Prime Selection (P) 1. For each window of SNPs, we will have a set of prime implicants stored in ‘P’. 2. Now for SNP, by removing duplicates from this array, we are choosing the essential prime implicants. 3. Storing those values of essential prime implicants in separate array say ‘SNP_Array’. 3.2.5. SNP Selection (SNP_Array) 1. After storing the essential prime implicants for a single window of SNP, Tag SNPs are to be selected. 2. Choose the unique numbers from the array. 3. Now, select those rows as tag SNP decimal representation of which the numbers matches with. 4. Store that result in a global array say ‘tag_SNP_array’. 3.2.6. Tag_SNP_Selection ( tag_SNP_array) One single SNP can be tagged for several windows of SNPs. So in order to find the optimized list of tag SNP we need to remove the duplicates from this array to get the optimum result. 4. EXPERIMENTAL RESULTS 4.1 Source of Data We used for our paper Phase –I/ rel#16a. Data files from http://hapmap.ncbi.nlm.nih.gov/ link. These data were downloaded for ENCODE region ENr321
  6. 6. International Journal of Advanced Research in Engineering and Technology (IJARET), ISSN 0976 – 6480(Print), ISSN 0976 – 6499(Online) Volume 4, Issue 5, July – August (2013), © IAEME 79 4.2 Results Table 1: LD Value chart of different datasets The above chart shows the comparison of correlation values. For our paper, we have selected 3 data sets as ( 103 X 774 ), ( 120 X 618 ) & ( 93 X 550 ), where the first value in each set represents the row (no of SNP) and second value represents the no of column(no of haplotype). The LD Value [7], [8] for each data set is very close to the range (0 ≤ LD ≤ 0.3) i.e., high correlation, which means the tag SNPs are highly correlated with the rest of the SNPs in the data. Also it is notifiable that the number of tag SNPs for the data sets is fairly low which is good because then only that small amount of data can represent the whole dataset more efficiently. Figure 2: R2 Value comparison between 3 datasets Now (table 2) shows the variation of tag SNP set in accordance with the window size. The chart clearly states that higher the window size better the result. But window size can’t be extended much further because greater the window size, lesser the possibility of finding prime implicant using quine method. For our case, we have chosen the window size to be within 3-6. Table 2: Different number of Tag SNP chosen due to different window size (Window -4 and Window -5)
  7. 7. International Journal of Advanced Research in Engineering and Technology (IJARET), ISSN 0976 – 6480(Print), ISSN 0976 – 6499(Online) Volume 4, Issue 5, July – August (2013), © IAEME 80 Figure 3: LD Value comparison between different window size for same dataset Figure 4: Window size - 4, Figure 5: Window size- 5, the the sequence numbers of Tag SNP sequence numbers of Tag SNP In this paper, we presented a method for finding tag SNP using boolean minimization Quine- McCluskey procedure with some sliding window application. Our study indicates that this method is highly applicable for tag SNP selection having decent amount of correlation between SNP set and tag SNP set. Also this method, provides a good amount of minimization if proper window size is chosen for a proper dataset. 5. CONCLUSIONS In this paper, we proposed a new natural measure for evaluating the prediction accuracy of a set of tag SNPs, and used it to develop a new method for tag SNPs selection. The proposed method is based on some novel algorithm that predicts the values of the tag SNPs. Our experimental results and theoretical analysis show that this algorithm is not only efficient but the solutions found are considerably better. One future direction for this method is to assign a procedure through which the window size for finding the optimal result for every possible set of haplotype can be calculated.
  8. 8. International Journal of Advanced Research in Engineering and Technology (IJARET), ISSN 0976 – 6480(Print), ISSN 0976 – 6499(Online) Volume 4, Issue 5, July – August (2013), © IAEME 81 6. REFERENCES [1] Halldórsson BV, Bafna V, Lippert R, Schwartz R, Vega FM, ClarkAG,Istrail S: Optimal haplotype block-free selection of tagging SNPs for genome-wide association studies. Genome Research 2004:1633-1640. [2] Carlson CS, Eberle MA, Rieder MJ, Yi Q, Kruglyak L, Nickerson DA: Selecting a maximally informative set of single-nucleotide polymorphisms for association analyses using linkage disequilibrium. [3] Daly MJ, Rioux JD, Schaffner SF, Hudson TJ, Lander ES: High-resolution haplotype structure in the human genome. Nat Genet2001, 29(2):229-232. [4] Cormen TH, Leiserson CE, Rivest RL, Stein C: Introduction to algorithms The MIT Press; 2001. [5] McCluskey E. Minimization of Boolean Functions. Bell System Technical Journal. 1956;35:1417–1444. [6] Tomaszewski, S. P., Celik, I. U., Antoniou, G. E., "WWW-based Boolean function minimization" INTERNATIONAL JOURNAL OF APPLIED MATHEMATICS AND COMPUTER SCIENCE, VOL 13; PART 4, pages 577-584, 2003. [7] Zhang K, Qin ZS, Liu JS, Chen T, Waterman MS, Sun F: Haplotype block partition and tag SNP selection using genotype data and their applications to association studies. Genome Research 2004, 14:908-916. [8] Zhao JH, Lissarrague S, Essioux L, Sham PC: GENECOUNTING: haplotype analysis with missing genotypes. Bioinformatics 2002, 18:1694-1695. [9] B. Devlin, and N. Risch. A comparison of linkage disequilibrium measures for fine-scale mapping. Genomics.29, 311–322, 1995. [10] B. Halldórsson, V. Bafna, R. Lippert, R. Schwartz, F. de la Vega, A. Clark, and S. Istrail. Optimal haplotype block-free selection of tagging snps for genome-wide association studies. Genome research.14, 1633-1640, 2004. [11] Z. Meng, D. Zaykin, C. Xu, M. Wagner, and M. Ehm. Selection of genetic markers for association analyses, using linkage disequilibrium and haplotypes. Am. J. Hum. Genet. 73: 115–130, 2003.