BioWeka

5,595 views

Published on

Introduction and overview for the BioWeka project

Published in: Technology

BioWeka

  1. 1. BioWeka Extending the Weka framework for Bioinformatics Martin Szugat ( [email_address] ) http:// www.bioweka.org
  2. 2. What is BioWeka? <ul><li>An extension to the Weka data mining framework for bioinformatics </li></ul><ul><li>A framework for additional data mining tools in bioinformatics </li></ul><ul><li>An open source project for the Weka/Bioinformatics community from the Ludwig-Maximilians-University, Munich ( www.bio.ifi.lmu.de ) </li></ul>
  3. 3. Agenda <ul><li>Introduction: Data mining & Biology </li></ul><ul><li>Motivation: Comparability & Interoperability </li></ul><ul><li>Solution: Extensibility & Standardization </li></ul><ul><li>Foundation: Weka & Co. </li></ul><ul><li>Implementation: Guidelines & Patterns </li></ul><ul><li>Application: BioWeka & Eclat </li></ul><ul><li>Conclusion: Prototypes & Experiments </li></ul>
  4. 4. KD, DM and ML <ul><li>Knowledge Discovery (KD): Process of finding unknown patterns in known data </li></ul><ul><li>Data Mining (DM): Step in the KD process </li></ul><ul><ul><li>Descriptive: Clustering, Associate Rules </li></ul></ul><ul><ul><li>Predictive: Classification, Regression </li></ul></ul><ul><li>Machine Learning : Methods for DM </li></ul><ul><ul><li>Unsupervised  Descriptive Mining </li></ul></ul><ul><ul><li>Supervised  Predictive Mining </li></ul></ul>
  5. 5. Knowledge Discovery Process Data Selection and Preparation Transformation and Reduction Data Mining Evaluation and Visualization
  6. 6. Mining biological data <ul><li>Clustering of gene expression data </li></ul><ul><ul><li>Standard formats (CSV) and algorithms ( k -Means) </li></ul></ul><ul><li>Classification of sequences </li></ul><ul><ul><li>DNA/RNA: coding/non-coding, species </li></ul></ul><ul><ul><li>Proteins: function, structure or localization </li></ul></ul><ul><li>Data mining on strings not well supported </li></ul><ul><ul><li>“extraordinary” data formats, e.g. FASTA </li></ul></ul><ul><ul><li>Feature extraction: string  numbers </li></ul></ul><ul><ul><li>Classification based on alignment scores </li></ul></ul>
  7. 7. Knowledge Discovery Experiment Data Selection and Preparation Transformation and Reduction Data Mining Evaluation and Visualization
  8. 8. Comparability & Interoperability <ul><li>Comparison with regard to performance </li></ul><ul><ul><li>Find best combination of model, parameters and transformation </li></ul></ul><ul><ul><li>New method/data set vs. old method/data set </li></ul></ul><ul><li>Technical problems: </li></ul><ul><ul><li>Different data formats: conversions </li></ul></ul><ul><ul><li>Different software interfaces: mappings, wrappers </li></ul></ul>
  9. 9. Intermediate data format Format A Tool X Tool Y Tool Z Format C Format B Mediator
  10. 10. Unified interfaces Loader Data Classifier Filter 1 … Filter n FastaLoader CSVLoader … SymbolCounter SymbolAnalyzer … Alignment SVM …  Extendable and Customizable Execution Pipeline
  11. 11. Foundation: Weka <ul><li>Fortunately such a solution already exists: Weka ( http:// www.cs.waikato.ac.nz/ml/weka / ) </li></ul><ul><li>Open Source Software (GPL) written in Java </li></ul><ul><li>Intermediate data format: ARFF </li></ul><ul><li>Extendable: Classifier , Loader , Saver , Filter , … </li></ul><ul><li>Extensive: > 70 Classifiers , > 40 Filters , … </li></ul><ul><li>Interfaces: API, CLI, GUI </li></ul>
  12. 12. Weka Explorer
  13. 13. Weka Experimenter
  14. 14. Tools & Libraries <ul><li>BioJava: Sequence handling </li></ul><ul><li>JAligner: Smith-Waterman-Algorithm </li></ul><ul><li>FoldRec: Secondary Structure Element Alignment </li></ul><ul><li>BioJava-Ext: Needleman-Wunsch-Algorithm </li></ul><ul><li>BLAST, PSI-BLAST </li></ul><ul><li>AAindex: Amino acid properties </li></ul><ul><li>InterProScan: Sequence Patterns </li></ul>
  15. 15. Implementation Guidelines <ul><li>Software for Users </li></ul><ul><ul><li>Provide a easy-to-use GUI </li></ul></ul><ul><ul><li>Extend existing software well-known to the user </li></ul></ul><ul><ul><li>Make it extensible for additional software </li></ul></ul><ul><li>Software for Developers </li></ul><ul><ul><li>Provide a well-documented API </li></ul></ul><ul><ul><li>Define interfaces for external extensions </li></ul></ul><ul><ul><li>Offer abstract base classes </li></ul></ul><ul><ul><li>Implement at least a simple class for each interface </li></ul></ul>
  16. 16. Implementation Patterns
  17. 17. BioWeka 0.4 <ul><li>Weka extensions: </li></ul><ul><li>Converters : Load & Save foreign file formats </li></ul><ul><li>Filters : Feature extraction & transformation </li></ul><ul><li>Classifiers : Alignment-based ~, ECLAT </li></ul><ul><li>BioWeka specific: </li></ul><ul><li>Normalizers : Normalize feature vectors </li></ul><ul><li>Evaluators : Turn scores into likelihoods </li></ul><ul><li>… </li></ul>
  18. 18. Converters <ul><li>Sequence file formats: </li></ul><ul><ul><li>FASTA, EMBL, SwissProt, GenBank </li></ul></ul><ul><ul><li>Mappers map sequences & annotations into attributes </li></ul></ul><ul><li>XML-based file formats: </li></ul><ul><ul><li>InterProScan, ProML, MAGE-ML </li></ul></ul><ul><ul><li>Based on XSL stylesheets </li></ul></ul><ul><li>Gene expression data formats: </li></ul><ul><ul><li>TAV, MEV, Stanford, Spot, … </li></ul></ul><ul><ul><li>Customizable CSV loader </li></ul></ul>
  19. 19. Sequence filters <ul><li>Feature extraction: </li></ul><ul><ul><li>Sequence properties: e.g. AAindex </li></ul></ul><ul><ul><li>Sequence composition: Codons, amino acids, etc. </li></ul></ul><ul><ul><li>Attribute normalization: counts  frequencies </li></ul></ul><ul><li>Transformation: </li></ul><ul><ul><li>Translation pipelines: e.g. DNA to RNA to AA, reverse complement, stop codon termination </li></ul></ul><ul><ul><li>Frame shifter: e.g. generate open reading frames </li></ul></ul>
  20. 20. Universal filters <ul><li>MultipleFilter : Build filter pipelines (  FilteredClassifier + Trainable filters) </li></ul><ul><li>Normalize : Normalization over a set of instances </li></ul><ul><li>MergeSets : Merge two or more ARFF files </li></ul><ul><li>Save : Export data set in a foreign file format </li></ul><ul><li>SetClass : Set class attribute (  Experimenter) </li></ul>
  21. 21. Alignment-based Classification <ul><li>Alignment methods (sequence  score): </li></ul><ul><ul><li>1 vs. 1: Local, global, secondary structure element ~ </li></ul></ul><ul><ul><li>1 vs. m : BLAST, PSI-BLAST (WU-BLAST, etc.) </li></ul></ul><ul><li>Score evaluation (score  class probability): </li></ul><ul><ul><li>Linear evaluators: Sum, max, average </li></ul></ul><ul><ul><li>Ranked evaluators: SimpleRankEvaluator </li></ul></ul><ul><ul><li>Meta evaluators: SimpleTransformingScoreEvaluator </li></ul></ul>
  22. 22. Precomputed Alignments <ul><li>Precompute alignment scores  Try out different evaluation schemes </li></ul><ul><li>AlignmentScorer filter: </li></ul><ul><ul><li>n sequences  n x n scoring matrix </li></ul></ul><ul><ul><li>Symmetric alignment: O(n^2/2) </li></ul></ul><ul><li>AlignmentScoreClassifier : based on evaluator </li></ul><ul><li>Other: NN, SVM, … </li></ul>
  23. 23. BioWeka distribution <ul><li>Documentation: Readme, Changelogs, API </li></ul><ul><li>Libraries: BioWeka, BioJava, JAligner, … </li></ul><ul><li>Source code: library & tests </li></ul><ul><li>Data: AAindex database, Substitution matrices </li></ul><ul><li>Stylesheets: ProML, MAGE-ML, InterProScan </li></ul><ul><li>Patches: converter.pl (InterProScan) </li></ul>
  24. 24. BioWeka 0.4.1 <ul><li>Batch scripts for Linux and Windows </li></ul><ul><li>Integration of LibSVM via WLSVM (GPL) </li></ul><ul><li>Integration of Weka-CG (GPL) </li></ul><ul><ul><li>Multifactor Dimensionality Reduction (MDR)-Filter </li></ul></ul><ul><li>More than 50 Weka components </li></ul><ul><li>247 Java classes with about 12800 lines of code </li></ul><ul><ul><li>Majority are interfaces and abstract base classes </li></ul></ul><ul><ul><li> Extensibility </li></ul></ul>
  25. 25. Application: ECLAT <ul><li>Friedel et al.: vector machines for separation of mixed plant-pathogen EST collections based on codon usage. </li></ul><ul><li>Eclat: LibSVM & Codon frequencies </li></ul><ul><li>Reimplementation using standard Weka & BioWeka components </li></ul><ul><li>Evaluation on the barley-blumeria set (1315/1902) </li></ul>
  26. 26. Training Eclat Sequences Codon frequencies Norm- alization SVM Frames Factors 1 Model 1 Codon frequencies Norm- alization SVM Factors 2 Model 2
  27. 27. Evaluating Eclat Sequence Frames Norm- alization SVM Factors 2 Model 2 Codon frequencies Norm- alization SVM Factors 1 Model 1 Correct Frame
  28. 28. EclatClassifier/Filter Filtered Classifier SMO Random Forest Naive Bayes JRip J4.8 Multiple Filter Translate Terminator Symbol Counter Sum Normalizer Remove/ Copy Normalize MinMax Normalizer newMin = -1.0 newMax = 1.0 pseudoCount = 1.0 alphabet = DNA symbolWidth = 3.0
  29. 29. Classifier Comparison 1) LibSVM vs. SMO 10 x Sampling with Stratification (2:1)  10 x 3-fold CV 0.7 1.1 0.7 0.7 0.2 Deviation 81.5 (1.3) 84.8 (1.1) 87.1 (1.0) 88.2 (1.0) 93.1 (0.7) BioWeka 82.2 (1.6) 85.9 (1.8) 87.8 (0.9) 88.9 (1.0) 92.9 (0.6) Eclat J4.8 JRip Naïve Bayes Random Forest SVM 1 Accuracy [%] (SD)
  30. 30. EclatFrameFinder/Classifier EclatFrame Classifier EclatFrame Finder EclatFrame Shifter Frame Shifter Eclat Classifier
  31. 31. Frame Prediction <ul><li>10 x Sampling (2:1) with Stratification </li></ul><ul><li>Half sequences correct frame, half incorrect frame (randomly choosen) </li></ul><ul><li>All frames incorrect or multiple frames correct </li></ul><ul><li> Eclat: Hyperplane margin, BioWeka: Random </li></ul>0.6 95.1 BioWeka 0.4 97.7 Eclat Standard deviation Accuracy Implementation
  32. 32. Eclat Eclat EclatFrame Finder Eclat Classifier
  33. 33. Species discrimination <ul><li>Species discrimination & frame prediction </li></ul><ul><li>10 x 10-fold Cross-Validation </li></ul>1.4 (10-CV) 92.0 BioWeka/LibSVM 1.0 91.3 BioWeka/SMO na 93.1 Eclat Standard deviation Accuracy Implementation
  34. 34. BioWeka’s Eclat <ul><li>Complete solution: 647 lines of code </li></ul><ul><li>Integrated in the Weka workbench </li></ul><ul><li>Reusability: EclatFilter , EclatFrameFinder , … </li></ul><ul><li>Configurability: Configure Filter / Classifier </li></ul><ul><li>Extensibility: Replace Filter / Classifier </li></ul><ul><li>Runtime performance (10-CV) </li></ul><ul><ul><li>BioWeka 23 min. vs. Eclat 5 min. </li></ul></ul>
  35. 35. Prototyping & Experiments <ul><li>Evaluating standard procedures without writing a single line of code using BioWeka </li></ul><ul><li>BioWeka is good for rapid application development  Prototypes </li></ul><ul><li>Experimenting with different data sets, filters, classifiers, etc. is easy within Weka </li></ul><ul><li>Runtime performance is the weak point of (Bio)Weka </li></ul>
  36. 36. Web & Download Statistics <ul><li>Project site: sourceforge.net/projects/bioweka : </li></ul><ul><ul><li>Sourceforge  Open Source Project (GNU GPL) </li></ul></ul><ul><ul><li>Forums, mailing lists, bug tracker, CSV, … </li></ul></ul><ul><ul><li>Downloads of BioWeka 0.4: > 110 (12/07/2005) </li></ul></ul><ul><li>Web site: www.bioweka.org </li></ul><ul><ul><li>MediaWiki  Open Content (GNU FDL) </li></ul></ul><ul><ul><li>Project description, documentation, news, … </li></ul></ul><ul><ul><li>Main page hits: > 900 (12/07/2005) </li></ul></ul>
  37. 37. Acknowledgements <ul><li>LMU members: </li></ul><ul><li>Ralf Zimmer </li></ul><ul><li>Jan Gewehr </li></ul><ul><li>Caroline Friedel </li></ul><ul><li>Other: </li></ul><ul><li>Weka contributors </li></ul><ul><li>Mark Schreiber (BioJava) </li></ul><ul><li>Andreas Dräger ( NeedlemanWunsch ) </li></ul><ul><li>Ahmed Moustafa (JAligner) </li></ul><ul><li>Joe White (MAGE-ML) </li></ul><ul><li>Many more … </li></ul>
  38. 38. Thanks for your attention! <ul><li>Questions? </li></ul><ul><li>http:// www.bioweka.org </li></ul>

×