Transductive Support Vector Classification for RNA Related Biological Abstracts Blake Adams Graduate Student Department of Computer Science Advisor:  Dr. Muhammad A. Rahman
Overview Statistical Learning Theory Support Vector Machines Linear Separability Project Motivation/Concept Expectations Program Design / Algorithm Implementation Results Demonstration Acknowledgements Questions & Answers
Statistical Learning Theory   Introduced by  Vladimir Vapnik  and  Alexey Chervonenkis 4 major areas Theory of consistency of learning processes  What are the necessary conditions for consistency of a learning process?  Nonasymptotic theory of the rate of convergence of learning processes  How fast is the rate of convergence of the learning process?  Theory of controlling the generalization ability of learning processes  How can one control the rate of convergence (generalization) of the learning process?  Theory of constructing learning machines  How can one construct algorithms that can control the generalization ability?* *This concept introduced the support vector machine.
Support Vector Machines The Support Vector Machine is a classification technique that is receiving heavy attention due to it’s precision classification.  It has been especially in successful in text categorization Also showing good results in image recognition such as face and fingerprint identification.
Precision Through SVM Support Vector Machines express machine learning through supervised statistical learning theory.  The technique works with high-dimensional data and avoids the pitfalls of local minima.  It represents decision boundaries using a subset of training examples known as  support vectors
Structural Risk Minimization Support vector machines are based on the  Structural Risk Minimization principle .  The idea of structural risk minimization is to find a hypothesis h for which we can guarantee the lowest true error.  The true error of h is the probability that h will make an error on an unseen and randomly selected test example.
How does SVM work? Uses training data to create a set of plot points that can be mapped out and used to predict the status of future information. Finds a hyperspace surface H, which separates positive and negative examples with the maximum distance.
Linearly Separable Data Linearly Separable Datasets    and  ● Hyper-plane of separation Decision boundaries
Consequences of hyper-plane selection * Maximizing decision boundary margins for best success.
Research Motivation Keyword searches have become the norm for finding information in large bodies of documents but such searches often prove to be highly Imprecise.  Example:  PubMed is a service of the National Library of Medicine that includes over 15 million citations from MEDLINE and other life science journals for biomedical articles back to the 1950s. PubMed includes links to full text articles and other related resources.  This site adds new citations on a daily basis. http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?CMD=search&DB=pubmed&term=mRNA A search on the expressions “RNA” and “Ribosomal Nucleic Acid” at Pubmed earlier in the semester (geared towards finding articles SPECIFICALLY about RNA research) yielded a success rate of 38% based on a review of the first 50 abstracts.  What can be done to improve the precision of searches in such large bodies of documents?
Expectations It is resonable to presume that Support Vector Machines yield statistically significant improvements over this success rate at a minimal cost to the user. If a user was able to request a body of documents on a keyword from such a database, and then read a small subset of abstracts, identifying positive and negative examples, then a Support Vector Machine could be used to key on they articles the user is interested in reading.
Learning Technique One of the short comings of “traditional” SVM is its implementation of “Inductive Learning.” (the process of reaching a general conclusion from specific examples).  While this method is highly effective when applied properly, it often requires SEVERAL examples (at least hundreds, preferably thousands) in order for it to return good results.
Learning Technique  Transductive Learning Implemented by  Thorsten Joachims  Author of SVM-Light Well known for his work with Support Vector Machines and Text-Categorization and highly regarded as one of the foremost authorities on the subject. “ Transductive Learning” takes into account a specific set of data and attempts to minimize error for that specific set based on a minimal number of training examples.
Implementing Support Vector Machines Fortunately, Joachims’ SVM-Light already successfully implements Transductive Learning, thus in this project it was not my task to reinvent the wheel of Support Vector Machines, but to develop a set of software tools to convert sets of articles into training and testing data that can be read and learned by svm-light. http:// svmlight.joachims.org /
Using SVM Light Requires specific formatted input. Things needed by SVM Light user: Feature Set Scoring method Training data file Testing data file Things SVM-Light Generates: Model file Predictions file.
Feature Set Features can be defined in many ways Single words  ANY word that appears in more than one document relating to a particular subject (Bag of words approach). Terms Topic specific – ribosomal nucleic acid, translation, interference, genetic. Combination Any method including both concepts such as a weighing scheme that allows more weight to ‘words’ that also classify  as ‘terms’.
Scoring Method Every ‘feature’ needs a corresponding value that represents that particular feature’s impact in the given example. A popular method of feature scoring (and the one implemented in this project) is  Term Frequency Inverse Document Frequency: TF X Log(N/ DF) TF = Total number of times the term occurs in the document N = Total number of documents in the corpus DF = Total number of documents in the corpus that contain the term.
Generating Training Sets for Transductive Learning File Format <expected outcome> <feature>:<value> <feature>:<value> <feature>:<value> … Feature values must also be organized from lowest to highest Thus a completed train file might look something like: +1 1:2.8473 2:3.8324 9:5.423 19:1.003 +1 4:1.11 9:5.423 11:0.012 +1 1:2.84734 15:10.9213 21:9.343 44:7.7231 -1 1:8.8473 2:3.8324 9:5.423 19:1.003 -1 4:5.13:5.423 11:0.012 28:19.6548 -1 1:2.84734 15:10.9213 21:9.343 44:7.7231  0 1:0.8473 9:3.84 19:5.423 29:1.00 0 4:1.11 9:5.423 11:0.012 0 2:2.84734 15:10.9213 21:9.343 22:7.7231  0 1:4.5473 2:6.7324 9:5.42 0 4:1.11 9:5.423 11:0.012 0 1:2.84734 15:10.9213 21:9.343 44:7.7231 0 19:5.1864 24:7.215 28:12.123
Generating Test Sets for Transductive Learning Since Transductive Learning works with a pre-established set, the test file has the exact same format.  The only difference being that the “expected outcomes”  originally set to “0” are now set to their actual value so that the system can test how well it predicted the outcomes.
Key Experiments How well can SVM Light transductively: Distinguish between abstracts that ARE and ARE NOT about RNA Distinguish between abstracts that ARE and ARE NOT about  each of the following types of RNA messenger RNA ribosomal RNA transfer RNA small nuclear RNA Glean abstracts about a specific type of RNA from a body abstracts that are all RNA-centric.
Implementation In order to implement this project two key elements were collected: A corpus of abstracts was collected and categorized manually from the Pubmed database.  These articles fell into the following categories: 40 abstracts  specific to   RNA  research AND containing the term RNA. 40 abstracts  not  specific to   RNA  research AND containing the term RNA. 40 abstracts  specific to   messenger RNA  research AND containing the term mRNA. 40 abstracts  not  specific to   messenger RNA  research AND containing the term mRNA. 40 abstracts  specific to   transfer RNA  research AND containing the term tRNA. 40 abstracts  not  specific to   transfer RNA  research AND containing the term tRNA. 40 abstracts  specific to   ribosomal RNA  research AND containing the term rRNA. 40 abstracts  not  specific to   ribosomal RNA  research AND containing the term rRNA. 40 abstracts  specific to   small nuclear RNA  research AND containing the term snRNA. 40 abstracts  not  specific to   small nuclear RNA  research AND containing the term snRNA. This resulted in a GRAND TOTAL corpus of  400  abstracts.
Implementation Term Dictionary A dictionary of terms specific to each topic was developed based on these terms relevance to the particular set of abstracts (ex: messenger RNA abstracts would correspond to a dictionary that contained the term ‘messenger’ but small Nuclear RNA abstracts would not).
Implementation Pre-processing Prior to conversion from abstract to feature/value sets, a limited amount of pre-processing was implemented: All special characters were removed from the abstracts including parentheses, commas, periods, apostrophes, colons, semicolons, dashes, etc. Articles were sorted into an arrangement of all positive abstracts followed by all negative abstracts.  This was done to facilitate identification of positive and negative abstracts by the software tools during generation of training and testing sets.  It should be mentioned that the order of feature sets has no bearing on learing.
Implementation Software Development Package – Java 2 Major Data Structures TreeMap – Containing ‘Term’ objects that represent  every  term feature in  every  document in the corpus of documents. Term object – Actual Term, Unique Term Id, Document Id, term frequency score, document frequency score.  Objects are keyed by an id that combines the term Id with the document Id. TreeMap – Containing ‘TermDF’ objects that represent individual features as they enter the program.  TermDF objects contain the actual term, the document frequency score and the last document id that contained the term as a placeholder. Once each Map is constructed, the TermDF map is cross-referenced with the Term map to set the DF score to the appropriate value in each term.
Algorithm/Implementation The set of tools for this project were developed in Java.  Programmatic implementation included the following steps: Read a set of abstracts and a “term dictionary” from file. Tokenize abstracts and test every tokenized word against the term dictionary.  If the token exists in term dictionary it must be search for in the “current corpus set of terms” to see if it has been added. If the feature is not found in the current set, it is added, id-ed, and the TF and DF are initialized to 1 If the feature is found in the current set, the ‘current document’ count is checked to determine whether the term is an initial occurrence in a new document or a subsequent occurrence in the current document If it is an initial occurance in a new document, DF is incremented, and TF for the term in the current article is set to one. If it is a subsequent occurrence, the DF remains unchanged and the TF for the current article is incremented by one. Once EVERY term for EVERY abstract is accounted for, the TF/IDF for each term can be calculated. Finally, the data structure is reorganized to arrange feature VALUES from lowest to highest for each feature set.
Mapping a Feature Tokenized Word Discard Is it in the Term Map? Is it in the keyword list? Is word in current document? No Yes No Assign id, docID, set termFreq to 1 Yes Is it in the TermDocFreqMap? Do Nothing No Yes Assign featureId,  set docFreq to 1, assign lastDocId Increment termFreq Yes No Increment docFreq Is word in current document? Yes No Increment docFreq
Implementation So THIS: All organisms sense and respond to conditions that stress their homeostatic mechanisms Here we review current studies showing that the nucleolus long regarded as a mere ribosome producing factory plays a key role in monitoring and responding to cellular stress After exposure to extra  or intracellular stress cells rapidly down regulate the synthesis of ribosomal RNA … Becomes THIS: 1 2:5.619650483261998 5:0.17733401528291545 6:1.869439496743316 12:0.7184649885442351 13:3.0441924140622225 25:2.772588722239781 50:3.283414346005772 51:6.566828692011544
Implementation All experiments were conducted in sets of 80 abstracts, with 5 positive and 5 negative training examples. Additionally, 35 positive and 35 negative examples were included but unlabeled during the training phase to allow to program to make a decision on where the feature set should fall in order to minimize error.
Results and Conclusions The outcome for every experiment far outpaced the highest expectations of the researchers. Following examination of misclassified abstracts from the first set of results, some abstracts were found to have been misclassified by the manual classifiers.  Correcting these errors led to even better results. It is the belief of the researchers that incorporating such a system into a database like pubmed would allow users to query a term on a keyword, and then use the support vector machine to narrow the results into the specific information the user is looking for, allowing the user to maximize the use research time.
Demonstration
Future Work Future work in this project will address Precision in feature selection Web Interface to tie application to real results
Acknowledgements Dr Muhammand A Rahman – Assistant Professor, University of West Georgia Dr Goran Nenadic – Assistant Professor, University of Manchester Thorsten Joachims – Assistant Professor, Cornell University The Department of Computer Science – University of West Georgia
Questions ?

Project Presentation

  • 1.
    Transductive Support VectorClassification for RNA Related Biological Abstracts Blake Adams Graduate Student Department of Computer Science Advisor: Dr. Muhammad A. Rahman
  • 2.
    Overview Statistical LearningTheory Support Vector Machines Linear Separability Project Motivation/Concept Expectations Program Design / Algorithm Implementation Results Demonstration Acknowledgements Questions & Answers
  • 3.
    Statistical Learning Theory Introduced by Vladimir Vapnik and Alexey Chervonenkis 4 major areas Theory of consistency of learning processes What are the necessary conditions for consistency of a learning process? Nonasymptotic theory of the rate of convergence of learning processes How fast is the rate of convergence of the learning process? Theory of controlling the generalization ability of learning processes How can one control the rate of convergence (generalization) of the learning process? Theory of constructing learning machines How can one construct algorithms that can control the generalization ability?* *This concept introduced the support vector machine.
  • 4.
    Support Vector MachinesThe Support Vector Machine is a classification technique that is receiving heavy attention due to it’s precision classification. It has been especially in successful in text categorization Also showing good results in image recognition such as face and fingerprint identification.
  • 5.
    Precision Through SVMSupport Vector Machines express machine learning through supervised statistical learning theory. The technique works with high-dimensional data and avoids the pitfalls of local minima. It represents decision boundaries using a subset of training examples known as support vectors
  • 6.
    Structural Risk MinimizationSupport vector machines are based on the Structural Risk Minimization principle . The idea of structural risk minimization is to find a hypothesis h for which we can guarantee the lowest true error. The true error of h is the probability that h will make an error on an unseen and randomly selected test example.
  • 7.
    How does SVMwork? Uses training data to create a set of plot points that can be mapped out and used to predict the status of future information. Finds a hyperspace surface H, which separates positive and negative examples with the maximum distance.
  • 8.
    Linearly Separable DataLinearly Separable Datasets  and ● Hyper-plane of separation Decision boundaries
  • 9.
    Consequences of hyper-planeselection * Maximizing decision boundary margins for best success.
  • 10.
    Research Motivation Keywordsearches have become the norm for finding information in large bodies of documents but such searches often prove to be highly Imprecise. Example: PubMed is a service of the National Library of Medicine that includes over 15 million citations from MEDLINE and other life science journals for biomedical articles back to the 1950s. PubMed includes links to full text articles and other related resources. This site adds new citations on a daily basis. http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?CMD=search&DB=pubmed&term=mRNA A search on the expressions “RNA” and “Ribosomal Nucleic Acid” at Pubmed earlier in the semester (geared towards finding articles SPECIFICALLY about RNA research) yielded a success rate of 38% based on a review of the first 50 abstracts. What can be done to improve the precision of searches in such large bodies of documents?
  • 11.
    Expectations It isresonable to presume that Support Vector Machines yield statistically significant improvements over this success rate at a minimal cost to the user. If a user was able to request a body of documents on a keyword from such a database, and then read a small subset of abstracts, identifying positive and negative examples, then a Support Vector Machine could be used to key on they articles the user is interested in reading.
  • 12.
    Learning Technique Oneof the short comings of “traditional” SVM is its implementation of “Inductive Learning.” (the process of reaching a general conclusion from specific examples). While this method is highly effective when applied properly, it often requires SEVERAL examples (at least hundreds, preferably thousands) in order for it to return good results.
  • 13.
    Learning Technique Transductive Learning Implemented by Thorsten Joachims Author of SVM-Light Well known for his work with Support Vector Machines and Text-Categorization and highly regarded as one of the foremost authorities on the subject. “ Transductive Learning” takes into account a specific set of data and attempts to minimize error for that specific set based on a minimal number of training examples.
  • 14.
    Implementing Support VectorMachines Fortunately, Joachims’ SVM-Light already successfully implements Transductive Learning, thus in this project it was not my task to reinvent the wheel of Support Vector Machines, but to develop a set of software tools to convert sets of articles into training and testing data that can be read and learned by svm-light. http:// svmlight.joachims.org /
  • 15.
    Using SVM LightRequires specific formatted input. Things needed by SVM Light user: Feature Set Scoring method Training data file Testing data file Things SVM-Light Generates: Model file Predictions file.
  • 16.
    Feature Set Featurescan be defined in many ways Single words ANY word that appears in more than one document relating to a particular subject (Bag of words approach). Terms Topic specific – ribosomal nucleic acid, translation, interference, genetic. Combination Any method including both concepts such as a weighing scheme that allows more weight to ‘words’ that also classify as ‘terms’.
  • 17.
    Scoring Method Every‘feature’ needs a corresponding value that represents that particular feature’s impact in the given example. A popular method of feature scoring (and the one implemented in this project) is Term Frequency Inverse Document Frequency: TF X Log(N/ DF) TF = Total number of times the term occurs in the document N = Total number of documents in the corpus DF = Total number of documents in the corpus that contain the term.
  • 18.
    Generating Training Setsfor Transductive Learning File Format <expected outcome> <feature>:<value> <feature>:<value> <feature>:<value> … Feature values must also be organized from lowest to highest Thus a completed train file might look something like: +1 1:2.8473 2:3.8324 9:5.423 19:1.003 +1 4:1.11 9:5.423 11:0.012 +1 1:2.84734 15:10.9213 21:9.343 44:7.7231 -1 1:8.8473 2:3.8324 9:5.423 19:1.003 -1 4:5.13:5.423 11:0.012 28:19.6548 -1 1:2.84734 15:10.9213 21:9.343 44:7.7231 0 1:0.8473 9:3.84 19:5.423 29:1.00 0 4:1.11 9:5.423 11:0.012 0 2:2.84734 15:10.9213 21:9.343 22:7.7231 0 1:4.5473 2:6.7324 9:5.42 0 4:1.11 9:5.423 11:0.012 0 1:2.84734 15:10.9213 21:9.343 44:7.7231 0 19:5.1864 24:7.215 28:12.123
  • 19.
    Generating Test Setsfor Transductive Learning Since Transductive Learning works with a pre-established set, the test file has the exact same format. The only difference being that the “expected outcomes” originally set to “0” are now set to their actual value so that the system can test how well it predicted the outcomes.
  • 20.
    Key Experiments Howwell can SVM Light transductively: Distinguish between abstracts that ARE and ARE NOT about RNA Distinguish between abstracts that ARE and ARE NOT about each of the following types of RNA messenger RNA ribosomal RNA transfer RNA small nuclear RNA Glean abstracts about a specific type of RNA from a body abstracts that are all RNA-centric.
  • 21.
    Implementation In orderto implement this project two key elements were collected: A corpus of abstracts was collected and categorized manually from the Pubmed database. These articles fell into the following categories: 40 abstracts specific to RNA research AND containing the term RNA. 40 abstracts not specific to RNA research AND containing the term RNA. 40 abstracts specific to messenger RNA research AND containing the term mRNA. 40 abstracts not specific to messenger RNA research AND containing the term mRNA. 40 abstracts specific to transfer RNA research AND containing the term tRNA. 40 abstracts not specific to transfer RNA research AND containing the term tRNA. 40 abstracts specific to ribosomal RNA research AND containing the term rRNA. 40 abstracts not specific to ribosomal RNA research AND containing the term rRNA. 40 abstracts specific to small nuclear RNA research AND containing the term snRNA. 40 abstracts not specific to small nuclear RNA research AND containing the term snRNA. This resulted in a GRAND TOTAL corpus of 400 abstracts.
  • 22.
    Implementation Term DictionaryA dictionary of terms specific to each topic was developed based on these terms relevance to the particular set of abstracts (ex: messenger RNA abstracts would correspond to a dictionary that contained the term ‘messenger’ but small Nuclear RNA abstracts would not).
  • 23.
    Implementation Pre-processing Priorto conversion from abstract to feature/value sets, a limited amount of pre-processing was implemented: All special characters were removed from the abstracts including parentheses, commas, periods, apostrophes, colons, semicolons, dashes, etc. Articles were sorted into an arrangement of all positive abstracts followed by all negative abstracts. This was done to facilitate identification of positive and negative abstracts by the software tools during generation of training and testing sets. It should be mentioned that the order of feature sets has no bearing on learing.
  • 24.
    Implementation Software DevelopmentPackage – Java 2 Major Data Structures TreeMap – Containing ‘Term’ objects that represent every term feature in every document in the corpus of documents. Term object – Actual Term, Unique Term Id, Document Id, term frequency score, document frequency score. Objects are keyed by an id that combines the term Id with the document Id. TreeMap – Containing ‘TermDF’ objects that represent individual features as they enter the program. TermDF objects contain the actual term, the document frequency score and the last document id that contained the term as a placeholder. Once each Map is constructed, the TermDF map is cross-referenced with the Term map to set the DF score to the appropriate value in each term.
  • 25.
    Algorithm/Implementation The setof tools for this project were developed in Java. Programmatic implementation included the following steps: Read a set of abstracts and a “term dictionary” from file. Tokenize abstracts and test every tokenized word against the term dictionary. If the token exists in term dictionary it must be search for in the “current corpus set of terms” to see if it has been added. If the feature is not found in the current set, it is added, id-ed, and the TF and DF are initialized to 1 If the feature is found in the current set, the ‘current document’ count is checked to determine whether the term is an initial occurrence in a new document or a subsequent occurrence in the current document If it is an initial occurance in a new document, DF is incremented, and TF for the term in the current article is set to one. If it is a subsequent occurrence, the DF remains unchanged and the TF for the current article is incremented by one. Once EVERY term for EVERY abstract is accounted for, the TF/IDF for each term can be calculated. Finally, the data structure is reorganized to arrange feature VALUES from lowest to highest for each feature set.
  • 26.
    Mapping a FeatureTokenized Word Discard Is it in the Term Map? Is it in the keyword list? Is word in current document? No Yes No Assign id, docID, set termFreq to 1 Yes Is it in the TermDocFreqMap? Do Nothing No Yes Assign featureId, set docFreq to 1, assign lastDocId Increment termFreq Yes No Increment docFreq Is word in current document? Yes No Increment docFreq
  • 27.
    Implementation So THIS:All organisms sense and respond to conditions that stress their homeostatic mechanisms Here we review current studies showing that the nucleolus long regarded as a mere ribosome producing factory plays a key role in monitoring and responding to cellular stress After exposure to extra or intracellular stress cells rapidly down regulate the synthesis of ribosomal RNA … Becomes THIS: 1 2:5.619650483261998 5:0.17733401528291545 6:1.869439496743316 12:0.7184649885442351 13:3.0441924140622225 25:2.772588722239781 50:3.283414346005772 51:6.566828692011544
  • 28.
    Implementation All experimentswere conducted in sets of 80 abstracts, with 5 positive and 5 negative training examples. Additionally, 35 positive and 35 negative examples were included but unlabeled during the training phase to allow to program to make a decision on where the feature set should fall in order to minimize error.
  • 29.
    Results and ConclusionsThe outcome for every experiment far outpaced the highest expectations of the researchers. Following examination of misclassified abstracts from the first set of results, some abstracts were found to have been misclassified by the manual classifiers. Correcting these errors led to even better results. It is the belief of the researchers that incorporating such a system into a database like pubmed would allow users to query a term on a keyword, and then use the support vector machine to narrow the results into the specific information the user is looking for, allowing the user to maximize the use research time.
  • 30.
  • 31.
    Future Work Futurework in this project will address Precision in feature selection Web Interface to tie application to real results
  • 32.
    Acknowledgements Dr MuhammandA Rahman – Assistant Professor, University of West Georgia Dr Goran Nenadic – Assistant Professor, University of Manchester Thorsten Joachims – Assistant Professor, Cornell University The Department of Computer Science – University of West Georgia
  • 33.

Editor's Notes

  • #10 Classifiers that produce decision boundaries with small margins are more susceptible to model overfitting and