Motivation• Label the unlabeled DNA sequences by the model, built by examining the labeled DNA sequences and be able to perceive some real world Machine Learning problems. 2
Approaches• K-mer based Fixed length K-mer K-mer with Mismatches Using Regular Expression• PWM based MEME and MAST• Combined Model Unite both model 3
K-mer Approach Based on Regular ExpressionMotivation 2-mer appears mostly in the sequences. So, emphasize mostly on 2-mer.Strategy - For any two 2-mers X & Y, generate regular expression X(.*)Y and Y(.*)X. - Use these Regular expression as candidate attribute.
Classifier Selection Fig : Around 9 classifiers applied on TF data setAlgorithms are numbered as follows - (1)Logistic (2)SMO (3)NaiveBayes (4)BayesianLogisticRegression (5)Kstar (6)Bagging 7)LogitBoost (8)RandomForest (9)J48Summary - * 9 classifiers are applied on 10 data set. 3 are shown among them * choosing an absolute classifier is not a trivial task * same classifier behaves differently on different data sets 5
Change in Accuracy due to Different Classifiers Logistic J48 RandomForest NaiveBayes Logistic J48 RandomForest NaiveBayes Fig : The performance of different types of Classifiers on TF_3 data set Fig : The performance of different types of Classifiers on TF_5 data setSummary - * classifiers have great consequences on accuracy * one has to be prudent when choosing classifiers 6
Change in Accuracy due to Different K-mer Length 4-mer 5-mer 6-mer Fig : The performance of different length K-mer on TF_3 data setSummary - * K-mer length also has consequences on accuracy * not trivial, difficult to find the absolute one 7
Attribute Space Selection Fig : The performance of different selecting k-mer on TF_4 data setSummary - * considering number of attributes also has consequences on accuracy * accuracy increases if we consider greater number of attributes, but from such saturation point it decreases. 8
PWM based Analysis on Accuracy (TF_1 data set)Fig : J48, minW 6 - maxW 15, no. of sites 10 Fig : J48, minW 6 – maxW 15, no. of motifs 5Summary - * accuracy increases when we have more motifs but fixed no. of sites * accuracy increases when we have more sites but fixed no. of motifs * what happened when we increases both ????? 9
PWM based Analysis Fig : Accuracy vary on no. of motifs and no. of sites* 1st bar concern with no. of sites* 2nd bar concern with no. of motifs* 3rd bar concern with accuracy* the point is that accuracy decreases when we increases no. of motifs and no. of sites.
Extra Work for TF_20 Sequences identified by both modelK-mer The New Model + for TF-20Pwm Sequences Biased 2- Newly identified mer Model Labeled differently Sequences Fig : Flow diagram of Building New Model for TF-20Summary - * we have done some extra work for TF_20
AUC based on the Feedback (bonus model) Fig : AUC of 10 data sets based on last submission* accuracy improved than first submission* PWM does not have pleasant result 12
Participation Background Working Working Paramete Automation Study with Tools with r Tuning Models Badri DNA,RNA, AlignAce, PWM K-mer Arff Writer, Sampath protein, MEME, Mast output motif MAST writer Iffat Protein, Weka, K-mer PWM Script for Sharmin Motif, AlignAce, FASTA,Chowdhury Transcriptio ScanAce Weka nProsunjit DNA, MEME, K-mer PWM Script for Biswas Transcriptio MAST RE, for new nK-mer model Tahmina MEME, MEME, PWM K-mer Script for Ahmed MAST, MAST, MEME, PWM Weka MAST 13