Tag snp selection using quine mc cluskey optimization method-2

International Journal of Advanced Research in Engineering and Technology (IJARET), ISSN
0976 – 6480(Print), ISSN 0976 – 6499(Online) Volume 4, Issue 5, July – August (2013), © IAEME
74
TAG SNP SELECTION USING
QUINE-McCLUSKEY OPTIMIZATION METHOD
1
Moitree Basu, 2
Pradipta Deb
1,2
Tata Consultancy Services Limited
ABSTRACT
Due to high cost, Genotyping a large number of SNPs is always a challenge for major
genome-wide disease association studies. However, depiction in variation of a population can be
identified through a very small set of SNPs, known as “Tagging” SNP or “Tag” SNP or
“Informative” SNP. Tag SNP is nothing but a small set SNPs selected from the large set which
contains the information embedded within them for the whole set. Recent trends of research interest
is focused on this retrieval of Tag SNP set. In this paper, we present an efficient Quine McCluskey
method for finding such tagging SNPs. Most of the established method for finding Tag SNPs are
confined only to localization but the method presented in this paper is not limited to localization and
it is also capable of discarding redundant SNPs from the whole set using feature selection (Sliding
Window) method. Experimental results shows that the number of Tag SNPs selected by our
proposed method is feasible and effective.
Keywords: QuineMcCluskey, prime implicants, essential prime implicants, minterm, maxterm,
DNF, CNF, haplotype, tag SNP
1. INTRODUCTION
Demand for revolutionary technologies that deliver fast, inexpensive and accurate genome
information from huge large DNA has always been there from the start of DNA analysis. Scientists
always needed some small efficient dataset that they can do their research on. For this reason
computer scientists started to find some optimization algorithm that can find some representatives
out of the whole dataset i.e. the tag SNP data that can represent most of the characteristics of other
SNPs. An example of a SNP is the alteration of the DNA segment AAGGTTA to ATGGTTA, where
the second "A" in the first snippet is replaced with a "T". On average, SNPs occur in the human
population more than one percent of the time. Because only about three to five percent of a person's
DNA sequence codes for the production of proteins, most SNPs are found outside of "coding
sequences". SNPs found within a coding sequence are of particular interest to researchers because
they are more likely to alter the biological function of a protein. Because of the recent advances in
INTERNATIONAL JOURNAL OF ADVANCED RESEARCH IN
ENGINEERING AND TECHNOLOGY (IJARET)
ISSN 0976 - 6480 (Print)
ISSN 0976 - 6499 (Online)
Volume 4, Issue 5, July – August 2013, pp. 74-81
© IAEME: www.iaeme.com/ijaret.asp
Journal Impact Factor (2013): 5.8376 (Calculated by GISI)
www.jifactor.com
IJARET
© I A E M E

75
technology, coupled with the unique ability of these genetic variations to facilitate gene
identification, there has been a recent flurry of SNP discovery and detection.
Tag SNP is a representative single nucleotide polymorphism (SNP) in a region of the genome
with high linkage disequilibrium (the non-random association of alleles at two or more loci). It is
possible to identify genetic variation without genotyping every SNP in a chromosomal region. Tag
SNPs are useful in whole-genome SNP association studies in which hundreds of thousands of SNPs
across the entire genome are genotyped.
In recent years, Single Nucleotide Polymorphisms (SNPs) have become the preferred marker
for association studies of genetic diseases or traits. A set of linked SNPs on one chromosome is
called a haplotype. Recent studies have shown that the patterns of Linkage Disequilibrium (LD)[9]
observed in human populations have a block-like structure. The chromosome recombination only
takes place at some low LD regions called recombination hotspots. The high LD region between
these hotspots is often referred to as a "haplotype block". Within a haplotype block, there is little or
even no recombination occurred, and the SNPs in the block tend to be inherited together. Due to the
low haplotype diversity within a block, the information carried by these SNPs is highly redundant.
Thus, a small subset of SNPs ("tag SNPs") is sufficient to work with.
Now problem of finding tag SNP is similar like minimization of huge number of SNP’s into
some[10], [11], so that we can represent this huge set of SNP with the small one(Tag SNPs). Here we
present a variety of Quine-McCluskey method which can be successfully applied to find out the Tag
SNP set from a block (Haplotype block) of SNP’s.
2. BACKGROUND & RELATED WORK
Currently, a wide variety of criteria for tag selection are available; but there is no consensus
on them [1]. Broadly, the informativeness of a tag SNP set is measured by two criteria: how well the
untagged SNPs could be predicted, a dependency criterion, and how much proportion of haplotype
diversity could be explained [2], [3]. We propose an approach for tag selection based on minterm and
prime implicant selection, aiming to satisfy the above two objectives.
2.1. Quine McClusky Method for Binary Minimization of Data
Following are the important terms used throughout the paper, defined again here for the
convenience of the readers.
2.1.1. Literal: It is a variable or its complement/negation ( x or x′ ).
2.1.2. Minterm: A product of the literals where each variable appears exactly once either in true or
complemented form, i.e., a normal product term consisting of n literals for n variable function.
2.1.3. Maxterm: A sum of the literals where each variable appears exactly once in either true or
complemented form i.e. a normal sum term consisting of n literals.
2.1.4. DNF Form (sum-of-product): The disjunctive normal form is the sum of minterms of the
variables.
2.1.5. CNF Form (product-of-sum): Conjunctive normal form is a product-of-maxterm of the
variables.
2.1.6. Prime Implicant: A prime implicant of a function is the product which cannot be combined
with another term to eliminate a variable for further simplification.

76
2.1.7. Essential Prime Implicant: Prime implicant that is able to cover an output of the function
which is not covered by any combination of prime implicant called essential prime implicant.
Quine-McCluskey (Q-M) method [5], [6] minimizes a logical expression realizing a given
Boolean function which is more efficient for computer algorithm, makes this more useful now even
though it was introduced more than 55 years ago. The method utilizes the following three basic
simplification laws:
a) x + x' = 1(Complement)
b) x + x = x (Idempotent)
c) x ( y + z) = xy + xz (Distributive)
This method is also known as tabulation method because it gives deterministic steps to check
the minimum form of function based on selection of essential prime implicants using a table. Steps
can be broadly categorized in three of the following:
2.2. Find the Prime Implicant
In this step, we replace the literals in form of 0 and 1 and generate a table. Initially, the
number of rows in table is equal to the total number of minterms of the original unsimplified
function. If two terms are only different in one bit like 101 and 111 i.e. one variable is appearing in
both form (variable and its negation), then we can use the complement law. Iteratively, we compare
all terms and generate the prime implicant.
2.3. Find the Essential Prime Implicant
Using prime implicants from above step, we generate the table to find essential prime
implicants. Some prime implicants can be redundant and may be omitted, but if they appear only
once, they cannot be omitted and no prime implicant can be provided.
2.4. Find Other Prime Implicant
It is not necessary that essential prime implicants cover all the minterms. In that case, we
consider other prime implicant to make sure that all minterms has been covered.
In general, Q-M method provides better method for the function simplification than K-map,
but still is an NP-hard problem, and it becomes impractical for large input sizes due to exponential
complexity [4].
3. PROPOSED FRAMEWORK
3.1. Problem Formulation
A haplotype represents the allele information of contiguous SNPs on one chromosome, while
a genotype represents the combined allele information of the SNPs on a pair of chromosomes. For a
bi-allelic SNP, each haplotype can be represented by a binary string. Let a haplotype h have m SNPs.
We can then represent h as {h1, h2, … , hm}, hi Є {0,1}.
Figure 1: A sample (N X K) haplotype block

77
In the same way, a genotype g has m SNPs and we can represent g as g = {g1, g2, … , gm},
gi Є {0, 1, 2}. The genotype may be represented as {0/0, 0/1, 1/0, 1/1}, where 0 and 1 stand for the
major homozygote {0/0} and minor homozygote {1/1} respectively, and 2 stands for the
heterozygotes {0/1} and {1/0}.
Our goal is to determine a minimum set of tag SNPs
T = {t1, t2,…,tk}, which consists of selected SNPs of haplotypes with a minimum error. In order to
achieve this goal, we need to find the minimum number of tag SNPs that will have a strong
correlation among themselves. The major processes involved are the minterm selection algorithm,
prime implicant selection, essential prime implicant selection, Tag SNP prediction algorithm. In the
next section we introduce the Quine McClusky method used for the tag SNP selection.
3.2. Quine Algorithm
Assume we are given a haplotype block containing N SNPs and K haplotype patterns. This
block is denoted by an N × K binary matrix M (see Figure 1).
Define M [ i , j ] Є { 0 , 1 } for each i Є [ 1 , N ] and j Є [1, K] where 0 and 1 represent the major and
minor alleles, respectively.
Input: An N × K matrix M, window size (W)
Output: The minimum subset of SNP (Tag SNP Set).
3.2.1. quine-tagSNP ( M , W )
1. Read the haplotype set from the file into a 2-D array.
2. For i = 0 to K
2.1. Using sliding window method divide the matrix into a small 2-D array having
‘N’ no of rows and ‘W’ no of columns (Suppose 2-D array is ‘T’ ).
Now for each of this small set of matrix apply the quine mcclusky method of minimization
method by calling corresponding method quine ( T ).
End for
3.2.2. quine ( T )
1. For i=0 to N
1.1. Remove all the duplicate rows from the array T
1.2. Store the decimal equivalent of each row of array T into a 1-D array say ‘D’.
1.3. Now using T, apply the quine mcclusky procedure to find minterms.
1.4. From those minterms, find out the prime implicants for this array T
2. End for
3.2.3. primeSelection ()
1. Reorder the rows of the array T in increasing order of number of 1s.
2. Now make a group with the rows which have same number of 1s.
3. Compare the rows from the two groups which have a unit difference in no of 1s.
4. If the two rows in comparison differs in unique position ( 0 & 1 only) , then mark
those positions with a different number other than ( 0 & 1) say ‘9’ and construct a new row
from those.
For e.g.: 11101 ( row 1) ( group X say)
10101 ( row 2) ( group Y say)
19101 ( new row)

78
5. With the new rows generated add them to a new array, then again repeat step 1 unless
there no new rows can be formed.
For e.g.:
1 9 9 1
9 9 9 1
-----------------------------------------------------------------
No new rows can be formed from these two rows.
6. The rows from which no new rows can be formed will be considered as prime
implicants.
7. At each stage, we have to identify such rows and store those rows in a different array
Say ‘P’.
8. Now for each row in ‘P’, all the possible numbers are to be stored in an array.
For e.g. :
1091 a row then (1001=9, 1010=10)
So, 9 & 10 will be considered as prime implicants.
3.2.4. Essential_Prime Selection (P)
1. For each window of SNPs, we will have a set of prime implicants stored in ‘P’.
2. Now for SNP, by removing duplicates from this array, we are choosing the essential
prime implicants.
3. Storing those values of essential prime implicants in separate array say ‘SNP_Array’.
3.2.5. SNP Selection (SNP_Array)
1. After storing the essential prime implicants for a single window of SNP, Tag SNPs are
to be selected.
2. Choose the unique numbers from the array.
3. Now, select those rows as tag SNP decimal representation of which the numbers
matches with.
4. Store that result in a global array say ‘tag_SNP_array’.
3.2.6. Tag_SNP_Selection ( tag_SNP_array)
One single SNP can be tagged for several windows of SNPs. So in order to find the optimized
list of tag SNP we need to remove the duplicates from this array to get the optimum result.
4. EXPERIMENTAL RESULTS
4.1 Source of Data
We used for our paper Phase –I/ rel#16a. Data files from http://hapmap.ncbi.nlm.nih.gov/
link. These data were downloaded for ENCODE region ENr321

79
4.2 Results
Table 1: LD Value chart of different datasets
The above chart shows the comparison of correlation values. For our paper, we have
selected 3 data sets as ( 103 X 774 ), ( 120 X 618 ) & ( 93 X 550 ), where the first value in each set
represents the row (no of SNP) and second value represents the no of column(no of haplotype). The
LD Value [7], [8] for each data set is very close to the range (0 ≤ LD ≤ 0.3) i.e., high correlation,
which means the tag SNPs are highly correlated with the rest of the SNPs in the data. Also it is
notifiable that the number of tag SNPs for the data sets is fairly low which is good because then only
that small amount of data can represent the whole dataset more efficiently.
Figure 2: R2
Value comparison between 3 datasets
Now (table 2) shows the variation of tag SNP set in accordance with the window size. The
chart clearly states that higher the window size better the result. But window size can’t be extended
much further because greater the window size, lesser the possibility of finding prime implicant using
quine method. For our case, we have chosen the window size to be within 3-6.
Table 2: Different number of Tag SNP chosen due to different window size (Window -4 and
Window -5)

80
Figure 3: LD Value comparison between different window size for same dataset
Figure 4: Window size - 4, Figure 5: Window size- 5, the
the sequence numbers of Tag SNP sequence numbers of Tag SNP
In this paper, we presented a method for finding tag SNP using boolean minimization Quine-
McCluskey procedure with some sliding window application. Our study indicates that this method is
highly applicable for tag SNP selection having decent amount of correlation between SNP set and tag
SNP set. Also this method, provides a good amount of minimization if proper window size is chosen
for a proper dataset.
5. CONCLUSIONS
In this paper, we proposed a new natural measure for evaluating the prediction accuracy of a
set of tag SNPs, and used it to develop a new method for tag SNPs selection. The proposed method is
based on some novel algorithm that predicts the values of the tag SNPs. Our experimental results and
theoretical analysis show that this algorithm is not only efficient but the solutions found are
considerably better. One future direction for this method is to assign a procedure through which the
window size for finding the optimal result for every possible set of haplotype can be calculated.

81
6. REFERENCES
[1] Halldórsson BV, Bafna V, Lippert R, Schwartz R, Vega FM, ClarkAG,Istrail S: Optimal
haplotype block-free selection of tagging SNPs for genome-wide association studies.
Genome Research 2004:1633-1640.
[2] Carlson CS, Eberle MA, Rieder MJ, Yi Q, Kruglyak L, Nickerson DA: Selecting a maximally
informative set of single-nucleotide polymorphisms for association analyses using linkage
disequilibrium.
[3] Daly MJ, Rioux JD, Schaffner SF, Hudson TJ, Lander ES: High-resolution haplotype
structure in the human genome. Nat Genet2001, 29(2):229-232.
[4] Cormen TH, Leiserson CE, Rivest RL, Stein C: Introduction to algorithms The MIT Press;
2001.
[5] McCluskey E. Minimization of Boolean Functions. Bell System Technical
Journal. 1956;35:1417–1444.
[6] Tomaszewski, S. P., Celik, I. U., Antoniou, G. E., "WWW-based Boolean function
minimization" INTERNATIONAL JOURNAL OF APPLIED MATHEMATICS AND
COMPUTER SCIENCE, VOL 13; PART 4, pages 577-584, 2003.
[7] Zhang K, Qin ZS, Liu JS, Chen T, Waterman MS, Sun F: Haplotype block partition and tag
SNP selection using genotype data and their applications to association studies. Genome
Research 2004, 14:908-916.
[8] Zhao JH, Lissarrague S, Essioux L, Sham PC: GENECOUNTING: haplotype analysis with
missing genotypes. Bioinformatics 2002, 18:1694-1695.
[9] B. Devlin, and N. Risch. A comparison of linkage disequilibrium measures for fine-scale
mapping. Genomics.29, 311–322, 1995.
[10] B. Halldórsson, V. Bafna, R. Lippert, R. Schwartz, F. de la Vega, A. Clark, and S. Istrail.
Optimal haplotype block-free selection of tagging snps for genome-wide association studies.
Genome research.14, 1633-1640, 2004.
[11] Z. Meng, D. Zaykin, C. Xu, M. Wagner, and M. Ehm. Selection of genetic markers for
association analyses, using linkage disequilibrium and haplotypes. Am. J. Hum. Genet. 73:
115–130, 2003.

Tag snp selection using quine mc cluskey optimization method-2

Recommended

Recommended

More Related Content

Similar to Tag snp selection using quine mc cluskey optimization method-2

Similar to Tag snp selection using quine mc cluskey optimization method-2 (20)

More from IAEME Publication

More from IAEME Publication (20)

Recently uploaded

Recently uploaded (20)

Tag snp selection using quine mc cluskey optimization method-2