Final Project –CS6243
Transcription Factor DNA Binding Prediction




                    Team Members:
                    Badri Sampath α

                    Iffat Sharmin Chowdhury α

                    Prosunjit Biswas α

                    Tahmina Ahmed α
            α
                Department of Computer Science

            University of Texas at San Antonio.
1. Defining the Scope of the Project:

In this project, we have given a number of labeled (which are p & n) DNA sequence and a number of
unlabeled DNA sequence which we have to label based on a model built from the given labeled
sequences. Eventually, the scope of the problem is to build a binary classifier model based on the given
training DNA sequence and apply the model to label the unlabeled DNA sequence.

        1.1 Challenges of the Projects:

In conventional classification problem, there are a number of different attributes that we can readily use to
build the classifier. In this project, we are only given sequences and label. So, part of the work for this
project, is to find a way for generating meaningful attribute.




                                 Fig. 1 : Overall scope of the project.

    2. K-mer Based Approach:

        In the K-mer approach, we have generated all possible combination of DNA characters for a
specified length of K. The K-mer Approach is shown in details in figure 2. The important steps of the k-
mer approach are discussed in the following paragraphs.




                                 Fig 2: Overall K-mer based process.

After we have generated the K-mers, we have followed different kind of approaches to count the
their frequencies which are i)Strict matching , ii) matching with mismatch and iii) matching based
on Regular Expression.

In order to build an optimum model, we have tuned different parameters of the model. Some of
parameters and their impact on the classifier is shown in table I.

    3. PWM Based Approach:

We have used a motif finding tool named MEME [1] to generate specified number of motifs of
specific minimum and maximum length and motif Alignment and search tool MAST [2] to get the
E-value (bounded to 100)for each sequence. We have derived scores from these E-values by
subtracting the E-value from 100 for ordering the sequences according to their E-value. We
have used these scores specific to each motif as attributes of the sequences and feed them to
different classifiers. Table II gives the synopsis of parameters and their impact on the model.

Table I: Synopsis of the parameters and their effect in the K-mer model building process.

  K-mer Value        Classifier Selection    String Match            MisMatch               Regular
                                                                                           Expression
     5( Best)           Logistic (Best)      When applied         When not applied      Not significant
                                             (perform best)        (perform best)
  4(reasonably           SMO (Good)         When not applied    When applied (perform
      good)                                     (perform          relatively worse)
                                            relatively worse)
 6 (Comparatively     J48 (Comparatively
      bad)                  weak)



Table II: Synopsis of the parameters for PWM approach and their effect in the model

 No. of Motif    No.of Sites a      Min / Max Length of Motif                 Classifier
                 Motif appear
     10                18                     6-15                            J48(Best)
      8                20                     5-16                        Logistic(Moderate)
      5                10                     6-15               Naïve Bayes(comparatively Bad)



   4. Combining K-mer & PWM approach:

In order to obtain a better model, we have combined both K-mer and PWM approaches with
known best parameters. We found reasonable improvement for the combined approach when
applying it in the training data.

   5. Some Difficulties and Limitation of our Work:

Tuning the parameters for the classifier was the most challenging part of the project. We think,
we have done reasonable experiment for choosing the parameters given the limited timeline.

   6. Acknowledgement:

At the end of the project, we would like to thank Dr. Ruan for assigning us such a challenging
project. It offered us good working knowledge of practical Machine Learning and data mining
stuffs. Working in the group was also a nice experience and knowledge sharing scope for us.

References:

[1-2] “MEME Suite“, available at http://meme.sdsc.edu/meme/meme-download.html
[3] “Weka”, available at: http://www.cs.waikato.ac.nz/ml/weka/index_downloading.html

Transcription Factor DNA Binding Prediction

  • 1.
    Final Project –CS6243 TranscriptionFactor DNA Binding Prediction Team Members: Badri Sampath α Iffat Sharmin Chowdhury α Prosunjit Biswas α Tahmina Ahmed α α Department of Computer Science University of Texas at San Antonio.
  • 2.
    1. Defining theScope of the Project: In this project, we have given a number of labeled (which are p & n) DNA sequence and a number of unlabeled DNA sequence which we have to label based on a model built from the given labeled sequences. Eventually, the scope of the problem is to build a binary classifier model based on the given training DNA sequence and apply the model to label the unlabeled DNA sequence. 1.1 Challenges of the Projects: In conventional classification problem, there are a number of different attributes that we can readily use to build the classifier. In this project, we are only given sequences and label. So, part of the work for this project, is to find a way for generating meaningful attribute. Fig. 1 : Overall scope of the project. 2. K-mer Based Approach: In the K-mer approach, we have generated all possible combination of DNA characters for a specified length of K. The K-mer Approach is shown in details in figure 2. The important steps of the k- mer approach are discussed in the following paragraphs. Fig 2: Overall K-mer based process. After we have generated the K-mers, we have followed different kind of approaches to count the their frequencies which are i)Strict matching , ii) matching with mismatch and iii) matching based on Regular Expression. In order to build an optimum model, we have tuned different parameters of the model. Some of parameters and their impact on the classifier is shown in table I. 3. PWM Based Approach: We have used a motif finding tool named MEME [1] to generate specified number of motifs of specific minimum and maximum length and motif Alignment and search tool MAST [2] to get the E-value (bounded to 100)for each sequence. We have derived scores from these E-values by subtracting the E-value from 100 for ordering the sequences according to their E-value. We
  • 3.
    have used thesescores specific to each motif as attributes of the sequences and feed them to different classifiers. Table II gives the synopsis of parameters and their impact on the model. Table I: Synopsis of the parameters and their effect in the K-mer model building process. K-mer Value Classifier Selection String Match MisMatch Regular Expression 5( Best) Logistic (Best) When applied When not applied Not significant (perform best) (perform best) 4(reasonably SMO (Good) When not applied When applied (perform good) (perform relatively worse) relatively worse) 6 (Comparatively J48 (Comparatively bad) weak) Table II: Synopsis of the parameters for PWM approach and their effect in the model No. of Motif No.of Sites a Min / Max Length of Motif Classifier Motif appear 10 18 6-15 J48(Best) 8 20 5-16 Logistic(Moderate) 5 10 6-15 Naïve Bayes(comparatively Bad) 4. Combining K-mer & PWM approach: In order to obtain a better model, we have combined both K-mer and PWM approaches with known best parameters. We found reasonable improvement for the combined approach when applying it in the training data. 5. Some Difficulties and Limitation of our Work: Tuning the parameters for the classifier was the most challenging part of the project. We think, we have done reasonable experiment for choosing the parameters given the limited timeline. 6. Acknowledgement: At the end of the project, we would like to thank Dr. Ruan for assigning us such a challenging project. It offered us good working knowledge of practical Machine Learning and data mining stuffs. Working in the group was also a nice experience and knowledge sharing scope for us. References: [1-2] “MEME Suite“, available at http://meme.sdsc.edu/meme/meme-download.html [3] “Weka”, available at: http://www.cs.waikato.ac.nz/ml/weka/index_downloading.html