Upcoming SlideShare
×

# Svm dbeth

291 views

Published on

Published in: Education
0 Likes
Statistics
Notes
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

• Be the first to like this

Views
Total views
291
On SlideShare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
2
0
Likes
0
Embeds 0
No embeds

No notes for slide

### Svm dbeth

1. 1. Why Bacterial exotoxin identification? Major cause of diseases, leading to symptoms and lesions during infection • Becomes important to study there mechanism to fight against There toxins are specific to a species • So species specific information is needed Exotoxins in particular, though completely neutralized in vivo, are only partialy inhibited in vitro • Implying they are regulated by environmental signals as well, study of properties that interact with the environment becomes important Most bacteria become resistant to antibiotics because of mutation or genetic recombination • Requires identification of new sequences Futher inactive exotoxins that form toxoids, still reatining the antigenic properties can be used to cure cartain disesases
2. 2. Support Vector Machine?Introduced by Vapnik, in 1992.Set of related supervised learning methods that analyze andrecognize patternsUsed for classification and regression analysisNon-probablistic binary linear classifierBased on statistical learning and optimization theoriesCan handle multiple, continuous as well as categorical data
3. 3. Principle • Representation of examples as points in space • Mapped such that examples of separate categories are divided by a gap as wide as possible • Constructs a hyperplane or a set of hyperplane in high or infinite dimensional space • Such that the hyperplane is at maximum distance from nearest data point of either of the classes
4. 4. Working:Given a training set of instance-label pairs (xi , yi ), i = 1, . . . , n , where xi ∈ Rnand yi ∈ {1, −1} as below: Maximize the margin (from the nearest data points of either classes), m = yi (wTxi + b) = 1 /||w|| w/||w|| (x1, 1) Original problem in finite dimensional space may not be linearly separable , so mapped to higher dimensional space m (xn, -1) wTx + b = 0 Intoduction of kernel function to make computations in higher dimenional space easier.
5. 5. Optimization problemrequire the solution of the following optimization problem:min w,b,ξ (1/2)wTw+C Σξi,subject to yi (wT φ(xi ) + b) ≥ 1 − ξi ,ξi ≥ 0, where φ – function mapping from input space to feature space C > 0 is the penalty parameter of the error term. ξi - error term introducedThe dual solution of the optimization problem found using Lagrange’stheorem , depends only on the inner product of the support vectorsand the new vector x, to determine its class.Kernel Function, given by K(x,z) = φ(x). φ(z) makes SVM to learn inthe high dimensional feature space without having to explicitlycalculate φ(x).
6. 6. Kernel FunctionA valid kernel function must satisfy Mercer Theorem which defines that thecorresponding kernel matrix be symmetric positive semi-definite (zTKz >= 0).Following are commonly used kernel functions:linear: K(xi , xj ) = xT xjpolynomial: K(xi , xj ) = (γxi T xj + r)d , γ > 0radial basis function (RBF): K(xi , xj ) = exp(−γ|xi − xj|2 ), γ > 0sigmoid: K(xi , xj ) = tanh(γxi T xj + r). Effectivenss of SVM depends on the selection of kernel, kernel parameters and the soft margin paarmeter C.
7. 7. Data Collection To model SVM to classify human pathogenic bacterial toxins from nontoxins, 2 major databases were compiled, that of bacterial toxins and that of nontoxins. 294 bacterial toxin sequences were taken from the Bacterial Toxin Database from the site http://www.hpppi/iicb.res.in/btox It contained representative protein sequences from 24 different genus of human pathogenic bacteria inFASTA format this database created after evaluating and processing over the 4750 toxin sequences from 24 different genus, retrieved from NCBI: www.ncbi.nlm.nih.gov, to remove the redundancies, and obtain the representatives
8. 8. Next 2940 nontoxinsequences were manually assembled from NCBI, Selecting protein sequences siginificant to metabolic processes and othersand then removing the sequences with more than 90% sequence identity using CDhitOf the 294 toxin(positive samples) and 2940 nontoxin(negative samples) sequences,44 toxin and 440 nontoxin set apart for remaining 250 toxin and 2500 nontoxin testing feature vectors.
9. 9. Feature Extraction twelve physicochemicalproperties have been employed to describe each protein • Including include Hydrophobicity, Contact Features,Absolute Entropy, Hydration Potential, Isoelectric point, Net Charge, Normalisedflexibility parameters, Relative Mutability, Side chain Oriental Preference,Occurence frequency, PkARcooh,and Polarity ith feature in the feature vector of jth protein sequence, for i = 1, 2, ...,12 is given by, Fj(i) = Σ(prpk(i) * Nk)/N, where • prpk(i) : ith property of the kth aminoacid,∀ k=1, 2, ..., 20 • Nk : number of kth aminoacid residue in the sequence • N : length of the sequence dipeptides and tripeptides composition; to reduce the dimensionality of feature space, amino acids grouped according to properties into 11 groups: • FWY, R, K, DE, H, M, QN, ST,C, and AGILVP
10. 10. LIBSVM tool svmtrain: svmpredict: for preparing models that predicts the class (classifiers) trained of the test or from training sets experimental samples Steps followed before applying svmtrain module: • checkdata.py from the tools folder in the package to check if the data intances are in acceptable format. • Application of subset.py from the tools folder to subset the data instances into 80% and remaining 20%, training and testing modules • Scale the data, using svmscale • Application of grid.py from the tools folder again for selection of optimal parameter values to the kernel function and parameter, C The values for g and C were incremented stepwise(step 1) through a combination of : powers of 2 from -11 through to +3 for g, and powers of 2 from -9 to +5 for C using the tool grid.py, which used 5fold cross validation accuracy to select the optimal parameter set.
11. 11. LIBSVM also provides a tool fselect.py to remove possible redundantfeatures from original feature set.fselect.py ranks the features by assigning them a Fscore value.Higher the value, more significant is the feature in prediction of classes.Performance Evaluation· Accuracy = (TP + TN)/(TP +TN + FP + FN)· Balanced Accuracy, BAC = (Specificity + Sensitivity)/2 , where◦ Specificity = TP/(TP + FP)◦ Sensitivity = TP/(TP + FN)· AUC : area under the curve of sensitivity against (1specificity)· Matthews correlation coefficient[1],MCC = (TP*TN – FP*FN)/((TN+FN)*(TN+FP)*(TP+FP)*(TP+FN))^(1/2)
12. 12. Result•92.27% average accuracy and 0.998 area under curve (AUC) values wereobtained when all the features (298) were utilized whereas ,•91.16% accuracy and 0.94 AUC were achieved with an optimized set of 114features (supplementary file 2).•Much higher accuracies were achieved (98.13% and 97.92% for 298 and 114features, respectively) when an absolutely separate test set consisting of39toxins and 390 non-toxins (1:10 ratio) were used to test.Conclusion The top features can be studied to identify the important functionalities of the toxic proteins. Effective in identifying the bacterial toxins, not being computationally intensive at the same time.
13. 13. Thank You