Your SlideShare is downloading. ×
Transcription Factor DNA Binding Prediction
Transcription Factor DNA Binding Prediction
Transcription Factor DNA Binding Prediction
Upcoming SlideShare
Loading in...5

Thanks for flagging this SlideShare!

Oops! An error has occurred.

Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Transcription Factor DNA Binding Prediction


Published on

Transcription Factor DNA Binding Prediction

Transcription Factor DNA Binding Prediction

Published in: Technology, Education
  • Be the first to comment

  • Be the first to like this

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

No notes for slide


  • 1. Final Project –CS6243Transcription Factor DNA Binding Prediction Team Members: Badri Sampath α Iffat Sharmin Chowdhury α Prosunjit Biswas α Tahmina Ahmed α α Department of Computer Science University of Texas at San Antonio.
  • 2. 1. Defining the Scope of the Project:In this project, we have given a number of labeled (which are p & n) DNA sequence and a number ofunlabeled DNA sequence which we have to label based on a model built from the given labeledsequences. Eventually, the scope of the problem is to build a binary classifier model based on the giventraining DNA sequence and apply the model to label the unlabeled DNA sequence. 1.1 Challenges of the Projects:In conventional classification problem, there are a number of different attributes that we can readily use tobuild the classifier. In this project, we are only given sequences and label. So, part of the work for thisproject, is to find a way for generating meaningful attribute. Fig. 1 : Overall scope of the project. 2. K-mer Based Approach: In the K-mer approach, we have generated all possible combination of DNA characters for aspecified length of K. The K-mer Approach is shown in details in figure 2. The important steps of the k-mer approach are discussed in the following paragraphs. Fig 2: Overall K-mer based process.After we have generated the K-mers, we have followed different kind of approaches to count thetheir frequencies which are i)Strict matching , ii) matching with mismatch and iii) matching basedon Regular Expression.In order to build an optimum model, we have tuned different parameters of the model. Some ofparameters and their impact on the classifier is shown in table I. 3. PWM Based Approach:We have used a motif finding tool named MEME [1] to generate specified number of motifs ofspecific minimum and maximum length and motif Alignment and search tool MAST [2] to get theE-value (bounded to 100)for each sequence. We have derived scores from these E-values bysubtracting the E-value from 100 for ordering the sequences according to their E-value. We
  • 3. have used these scores specific to each motif as attributes of the sequences and feed them todifferent classifiers. Table II gives the synopsis of parameters and their impact on the model.Table I: Synopsis of the parameters and their effect in the K-mer model building process. K-mer Value Classifier Selection String Match MisMatch Regular Expression 5( Best) Logistic (Best) When applied When not applied Not significant (perform best) (perform best) 4(reasonably SMO (Good) When not applied When applied (perform good) (perform relatively worse) relatively worse) 6 (Comparatively J48 (Comparatively bad) weak)Table II: Synopsis of the parameters for PWM approach and their effect in the model No. of Motif No.of Sites a Min / Max Length of Motif Classifier Motif appear 10 18 6-15 J48(Best) 8 20 5-16 Logistic(Moderate) 5 10 6-15 Naïve Bayes(comparatively Bad) 4. Combining K-mer & PWM approach:In order to obtain a better model, we have combined both K-mer and PWM approaches withknown best parameters. We found reasonable improvement for the combined approach whenapplying it in the training data. 5. Some Difficulties and Limitation of our Work:Tuning the parameters for the classifier was the most challenging part of the project. We think,we have done reasonable experiment for choosing the parameters given the limited timeline. 6. Acknowledgement:At the end of the project, we would like to thank Dr. Ruan for assigning us such a challengingproject. It offered us good working knowledge of practical Machine Learning and data miningstuffs. Working in the group was also a nice experience and knowledge sharing scope for us.References:[1-2] “MEME Suite“, available at[3] “Weka”, available at: