Final Project –CS6243Transcription Factor DNA Binding Prediction Team Members: Badri Sampath α Iffat Sharmin Chowdhury α Prosunjit Biswas α Tahmina Ahmed α α Department of Computer Science University of Texas at San Antonio.
1. Defining the Scope of the Project:In this project, we have given a number of labeled (which are p & n) DNA sequence and a number ofunlabeled DNA sequence which we have to label based on a model built from the given labeledsequences. Eventually, the scope of the problem is to build a binary classifier model based on the giventraining DNA sequence and apply the model to label the unlabeled DNA sequence. 1.1 Challenges of the Projects:In conventional classification problem, there are a number of different attributes that we can readily use tobuild the classifier. In this project, we are only given sequences and label. So, part of the work for thisproject, is to find a way for generating meaningful attribute. Fig. 1 : Overall scope of the project. 2. K-mer Based Approach: In the K-mer approach, we have generated all possible combination of DNA characters for aspecified length of K. The K-mer Approach is shown in details in figure 2. The important steps of the k-mer approach are discussed in the following paragraphs. Fig 2: Overall K-mer based process.After we have generated the K-mers, we have followed different kind of approaches to count thetheir frequencies which are i)Strict matching , ii) matching with mismatch and iii) matching basedon Regular Expression.In order to build an optimum model, we have tuned different parameters of the model. Some ofparameters and their impact on the classifier is shown in table I. 3. PWM Based Approach:We have used a motif finding tool named MEME  to generate specified number of motifs ofspecific minimum and maximum length and motif Alignment and search tool MAST  to get theE-value (bounded to 100)for each sequence. We have derived scores from these E-values bysubtracting the E-value from 100 for ordering the sequences according to their E-value. We
have used these scores specific to each motif as attributes of the sequences and feed them todifferent classifiers. Table II gives the synopsis of parameters and their impact on the model.Table I: Synopsis of the parameters and their effect in the K-mer model building process. K-mer Value Classifier Selection String Match MisMatch Regular Expression 5( Best) Logistic (Best) When applied When not applied Not significant (perform best) (perform best) 4(reasonably SMO (Good) When not applied When applied (perform good) (perform relatively worse) relatively worse) 6 (Comparatively J48 (Comparatively bad) weak)Table II: Synopsis of the parameters for PWM approach and their effect in the model No. of Motif No.of Sites a Min / Max Length of Motif Classifier Motif appear 10 18 6-15 J48(Best) 8 20 5-16 Logistic(Moderate) 5 10 6-15 Naïve Bayes(comparatively Bad) 4. Combining K-mer & PWM approach:In order to obtain a better model, we have combined both K-mer and PWM approaches withknown best parameters. We found reasonable improvement for the combined approach whenapplying it in the training data. 5. Some Difficulties and Limitation of our Work:Tuning the parameters for the classifier was the most challenging part of the project. We think,we have done reasonable experiment for choosing the parameters given the limited timeline. 6. Acknowledgement:At the end of the project, we would like to thank Dr. Ruan for assigning us such a challengingproject. It offered us good working knowledge of practical Machine Learning and data miningstuffs. Working in the group was also a nice experience and knowledge sharing scope for us.References:[1-2] “MEME Suite“, available at http://meme.sdsc.edu/meme/meme-download.html “Weka”, available at: http://www.cs.waikato.ac.nz/ml/weka/index_downloading.html