30th Annual International IEEE EMBS Conference
Vancouver, British Columbia, Canada, August 20-24, 2008

oped in C/C++ including Shogun[5], Elefant[6], MLC++[7],             In (binary) classification, the algorithm learns a mod...
a graph reflects a series of questions. malibu supports a          library; in loose-binding, the workbench writes out a fil...
between key software packages. A scripting language such                       [14] N. Bhardwaj, R. E. Langlois, G. Zhao, ...
Upcoming SlideShare
Loading in …5

Intelligible Machine Learning with Malibu for bioinformatics ...


Published on

  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Intelligible Machine Learning with Malibu for bioinformatics ...

  1. 1. 30th Annual International IEEE EMBS Conference Vancouver, British Columbia, Canada, August 20-24, 2008 Intelligible machine learning with malibu Robert E. Langlois and Hui Lu Abstract— malibu is an open-source machine learning work- project or a more powerful protocol to solve a problem in bench developed in C/C++ for high-performance real-world a specific domain. Finally, open-source tools give the wider applications, namely bioinformatics and medical informatics. scientific community full access to machine learning tools It leverages third-party machine learning implementations for more robust bug free software. This workbench handles that have found application in a wide range of fields. Such several well-studied supervised machine learning problems access permits both use of the tool and the ability to extend including classification, regression, importance-weighted clas- the tool to fit a specific need. sification and multiple-instance learning. The malibu inter- A machine learning workbench provides at a minimum face was designed to create reproducible experiments ideally run in a remote and/or command line environment. The five services beyond the standard machine learning tool: software can be found at: http://proteomics.bioengr. 1) Learning algorithms uic.edu/malibu/index.html 2) Learning evaluation . 3) Dataset preprocessing I. INTRODUCTION 4) Interface unification 5) Extensible bindings Recently open source software has matured into a solution able to handle complex real world applications. One such ap- In essence, a machine learning workbench provides a unified plication entails developing implementations for the substan- interface to a number of learning algorithms and, ideally, tial number machine learning algorithms available for the nu- handles more than one type of machine learning problem. For merous problem domains e.g. bioinformatics type problems example, a supervised machine learning workbench should including protein annotation, microarray analysis as well as support a number of classifiers including decision trees and others. Indeed, many such algorithms remain unused due to support vector machines as well as algorithms for other prob- unavailable or poor implementations. Moreover, many re- lems such as calibration methods for probabilistic regression. searchers recognize the need for peer-reviewed open-source Likewise, a workbench should provide stock tools to handle machine learning software by the scientific community[1]. common tasks such as dataset preprocessing or learning eval- The advantages of open-source machine learning tools in- uation. For a supervised machine learning workbench, these clude: stock tools should include metrics to measure performance, 1) Reproducibility and transparency algorithms to perform cross-validation and tools to perform 2) Uncovering problems in current algorithms discreetization or normalization. Finally, a workbench should 3) Building on existing resources be extensible providing the ability to add or support new 4) Access to machine learning tools algorithms as well as tying into some scripting language. One of the fundamental benefits of open-source machine A number of machine learning workbenches have been learning software is that it facilitates the reproducibility of developed in programming languages including C/C++, Java, experimental results. While such experiments are relatively Python and Matlab. Indeed, the programming language easy to reproduce compared to other fields, they often are not. characterizes fundamental properties in the corresponding At the same time, the pressure to publish remains constant software. That is, tools based in C/C++ are more difficult leading to unintentional (or intentional) “cheating”. Having to develop yet utilize computing resources more efficiently. access to an implementation of a machine learning algorithm Java-, Matlab- and Python-based tools are easier to develop (and ideally the associated dataset e.g. UCI repository[2]) and deploy but require an interpreter and garbage collector. could potentially eliminate such cheating or, at a minimum, Java- and Delphi-based tools support a rich set of libraries its ill effects. Likewise, tendering the source of an algorithm enabling complex graphical user interfaces. Finally, Matlab- allows the community at large to more quickly discover based tools enjoy a rich set of statistical and optimiza- problems in current algorithms on both the conceptual and tion routines providing the ability to quick-prototype many concrete levels. Similarly, it enables others to build increas- learning algorithms. Using these languages a number of ingly more intricate systems on top of available source code. workbenches haven been developed. One of the most popular This could simply mean a better user interface on an existing is WEKA[3], a Java-based workbench that supports a large number of supervised and unsupervised machine learning This work is partially supported by the NIH. algorithms. This workbench has been extended and modified R. E. Langlois is with Department of Bioengineering, University of by a number of projects, most notably by RapidMiner[4], a Illinois at Chicago, Chicago, IL 60607, USA ezra@uic.edu H. Lu is with Faculty of Department of Bioengineering, University of workbench focused on fast-prototyping and data visualiza- Illinois at Chicago, Chicago, IL 60607, USA huilu@uic.edu tion. Likewise, a number of workbenches have been devel- 978-1-4244-1815-2/08/$25.00 ©2008 IEEE. 3795
  2. 2. oped in C/C++ including Shogun[5], Elefant[6], MLC++[7], In (binary) classification, the algorithm learns a model Orange[8] and Torch[9]. Shogun, Orange and Elefant support from labeled training examples where the label belongs to python bindings enabling efficient machine learning work one of two discreet classes. malibu incorporates a number of flows. Matlab has also proven an excellent platform for third-party and built-in algorithms to handle classification. machine learning with its own considerable statistical and The third-party classifiers include LIBSVM[18], Cover Tree machine learning libraries; it has been further extended kNN[20], INDTree[21] and C4.5[22]. The built-in classifiers by Spider[10] better handle a large number of machine include the Willow[19] decision tree and ADTree[23]. malibu learning problems including supervised, unsupervised and also supports a number of (binary) meta-classifiers that semi-supervised learning. There has been considerable effort construct ensembles of classifiers to improve performance, in developing additional open-source machine learning soft- which includes Bagging[24], Subagging[25], AdaBoost[26], ware. To this end, most available workbenches can be found Confidence-ratedAdaBoost[26], Gentle AdaBoost[27] and in a peer-reviewed machine learning software repository1 . for the tree-based classifiers Random Forests[28]. The applications of such machine learning software ranges In importance-weighted classification, the algorithm learns from facial recognition to medical diagnosis. One clini- a model from training examples labeled by their relative cal application of machine learning is the identification of importance such that a prediction will be biased toward more cancerous tumors using data collected by some imaging important training examples. One popular variant is called modality, e.g. microscopic analysis of cells [11]. Specifically, cost-sensitive classification where examples are weighted a machine learning algorithm can segment an image into based on their class label. malibu supports both implicit and regions where one may contain a cancerous tumor. A later explicit weighting for each algorithm where implicit weight- algorithm can learn features within these regions (i.e. shape ing is supported by LIBSVM, kNN and Willow. Furthermore, of possible tumor, texture of its edges, level of contrast) to an explicit method utilizes the Costing wrapper[29] to make distinguish benign and cancerous tissue. In more recent work, any classifier importance-weighted. machine learning has found great success in the arena of In regression, the algorithm learns a real-valued output brain-computer interfaces [12]. Such devices have a number from training examples labeled with a real label. One of applications ranging from clinical monitoring of arousal special case of regression is probabilistic regression where to investigating the working of the human brain. the learning algorithm assigns a probability to an example In this work, we introduce a new machine learning work- as belonging to a particular class. Similar to importance- bench for bioinformatics tasks. This workbench has been weighted classification malibu supports both implicit and applied to a number of problems ranging from function pre- explicit regression. That is, learning algorithms such as LIB- diction, e.g. prediction of DNA-binding residues[13], DNA- SVM, kNN and Willow, which support regression. For binary binding proteins[14], [15], membrane-binding proteins[16], classifiers, malibu also includes explicit wrappers to extend a [15], to structure prediction e.g. protein folds[17]. classifier to handle probabilistic regression. These wrappers include sigmoid correction[30], isotonic regression[31] and II. LEARNING WITH malibu probing[32]. malibu is an open-source machine learning workbench In multiple-instance learning (MIL), examples are grouped written in C/C++ and is geared toward supervised learning. into bags where the bag not an individual example has a The basic design of malibu comprises a hierarchy of C++ label. A bag is positive if at least one instance in the bag is template classes that both wrap and extend a core set of positive otherwise the bag is negative. In malibu any binary classification algorithms. By utilizing proven C++ template classifier can be extended to multiple-instance learning by meta-programming techniques used in the Boost Libraries2 viewing this problem as binary classification with positive and the matrix template library3 , malibu provides an efficient class noise; all parameters are selected by estimating bag- yet extensible library of algorithms. The core classifiers level (not instance-level) performance. malibu also supports comprise both third-party tools, e.g. LIBSVM[18], and native extending a weak classifier to a multiple-instance learner implementations[19]. through the AdaBoost.C2MIL wrapper[19]. B. Learning evaluation A. Learning algorithms Evaluating the performance of a learning algorithm is The malibu workbench currently supports a number of important to both select the best model and estimate the per- supervised learning problems including classification, meta- formance on unseen testing dataset. The performance of an classification, importance-weighted classification, regression algorithm is measured as follows: and multiple-instance learning. A supervised learning prob- for each partition do lem comprises a set of labeled training examples with the Train algorithm on one partition goal of predicting the label on an unseen (and possibly Evaluate on other partition unlabeled) example. end for 1 http://mloss.org Learning algorithm performance is usually measured by 2 http://www.boost.org metrics and/or graphs. A single metric reflects some question 3 http://www.osl.iu.edu/research/mtl/ about the performance of a learning algorithm whereas 3796
  3. 3. a graph reflects a series of questions. malibu supports a library; in loose-binding, the workbench writes out a file in number of threshold metrics from a tabulated contingency a format supported by another tool. Currently malibu sup- table, which estimate the performance for every problem ports soft-binding to web-browsers, LTEX, GNUPLOT4 and A 5 except regression; it also supports a number of regression Graphviz . That is, the metrics describing the performance of and ranking metrics. Likewise, malibu supports a number of a learning algorithm can be written out in both the HTML graphs including the receiver operating characteristics curve, and latex formats. Similarly, the performance can also be the cost curve[33], the precision/recall curve, lift curves and written out as a plot in the GNUPLOT format. Finally, the reliability diagrams. models describing the tree-based learning algorithms can be Note that malibu provides automated model selection written out as graphs in the Graphviz DOT format. for every learning algorithm using the previously described evaluation metrics and the dataset partitioning algorithms III. CONCLUSIONS AND FUTURE WORKS introduced in the next section. A. Conclusions C. Dataset preprocessing The maturity of open source software in conjunction with the present need for robust implementations of machine Preprocessing a dataset is a critical step for many ma- learning algorithms has given rise to significant efforts in chine learning algorithms e.g. normalization of attributes for developing large-scale workbenches. However, no single distance-based methods such as SVM. Moreover, preprocess- workbench is comprehensive in its coverage of machine ing also includes algorithms that partition the dataset for learning algorithms nor does every workbench provide an model evaluation. malibu comprises a number of algorithms optimal set of features. malibu is a high-performance ma- to transform a dataset into an appropriate format such as chine learning workbench developed to extend classifiers normalization for distance-based methods, nominal-to-binary to handle classification as well as other problem domains for distance-based methods, and discreetization to speed up namely regression, importance-weighted classification and sorting-based methods. Likewise, malibu includes partition- multiple-instance learning. It also satisfies the basic criterion ing methods such as cross-validation, bootstrapping, holdout of a workbench by providing a unified user interface, dataset and progressive validation[34]. Each of these methods has preprocessing algorithms, learning algorithms and binding to various advantages and disadvantages. Holdout requires a other tools to facilitate learning. large amount of dataset but its the best understood. For The primary contribution of the malibu workbench is im- smaller datasets, cross-validation, progressive validation and proved usability for a more computer-scientist oriented user bootstrapping are more appropriate where cross-validation is group. That is, malibu is written in ANSI C++ and has been the most widely used method. extensively tested in Windows and Unix-like environments. D. Interface unification By downloading binary files rather than interpreted code, malibu does not require the user to learn how to use a The interface to a machine learning algorithm includes Java (e.g. how to increase available memory) or Matlab setting parameters, reading datasets, outputting results and interpreter (e.g. how to program in Matlab). It supports a writing models. Setting parameters in malibu can accom- number of dataset formats removing the burden of creating plished using either command-line arguments or a configu- scripts to format a dataset from the user. Similarly, it provides ration file where a subgroup of arguments can be written a number of standard model selection and evaluation algo- to and read from a file. The parameter system also sup- rithms often missing from third-party code (e.g. CoverTree). ports implicit configuration files depending on the name of malibu also provides a configuration file, which allows learning algorithm where command-line parameters override users to modify arguments in an environment that provides configuration files which, in turn, override implicit config- additional information about each command. Finally, malibu uration files. The dataset format supported by malibu is a provides bindings for third-party tools to generate graphs and standard tab/comma/space delimited file and every example plots. Another contribution includes implementation of new is delimited by line separators. Indeed, the format allows algorithms (e.g. AdaBoost.C2MIL) as well as extension of changes in class position, existence of a header, index of any algorithm to new problem domains (e.g. classifiers to bag label or number of prefixing labels. multiple-instance learning). When a model is applied to a test set, a malibu learning algorithm writes predictions to the standard output. It also B. Future Works outputs statistics describing a training and/or testing set as At the same time, malibu (like most available software) is well as a copy of the configuration file. Finally, malibu a work in progress. One direction of development is to scale supports writing out the models of learning algorithms in the workbench up to distributed computing. That is, model the ASCII format. selection and validation can be distributed via the message E. Extensible bindings passing interface (MPI) to multiple CPUs and machines. Another direction will focus on developing stronger bindings A workbench may interface (or bind) another software tool through two mechanisms: tight-binding and loose-binding. In 4 http://www.gnuplot.info/ tight-binding, the workbench makes a function call to some 5 http://www.graphviz.org/ 3797
  4. 4. between key software packages. A scripting language such [14] N. Bhardwaj, R. E. Langlois, G. Zhao, and H. Lu, “Kernel-based as python is better suited to selecting objects, extracting machine learning protocol for predicting DNA-binding proteins,” Nucleic Acids Research, vol. 33, no. 20, pp. 6486–6493, 2005. features and tying in other applications. A final direction [15] R. Langlois, M. Carson, N. Bhardwaj, and H. Lu, “Learning to will be to assemble more classifiers including Na¨ve Bayes, ı translate sequence and structure to function: Identifying DNA binding logistic regression as well as more learning strategies such as and membrane binding proteins,” Annals of Biomedical Engineering, vol. 35, no. 6, pp. 1043–1052, 2007. multi-class classification, multi-part learning and structured- [16] N. Bhardwaj, R. V. Stahelin, R. E. Langlois, W. Cho, and H. Lu, prediction. “Structural bioinformatics prediction of membrane-binding proteins,” Journal of Molecular Biology, vol. 359, no. 2, pp. 486–495, 2006. [17] R. E. Langlois, A. Diec, O. Perisic, Y. Dai, and H. Lu, “Improved IV. ACKNOWLEDGMENTS protein fold assignment using support vector machines,” International Journal of Bioinformatics Research and Applications, vol. 1, no. 3, This work is partially supported by NIH P01 AI060915 pp. 319–334, 2006. (H.L.). R.E.L. acknowledges the support from NIH training [18] C.-C. Chang and C.-J. Lin, “LIBSVM: A library for support vector grant T32 HL 07692: Cellular Signaling in Cardiovascular machines,” 2001, http://www.csie.ntu.edu.tw/∼cjlin/libsvm. [19] R. E. Langlois, “Machine learning in bioinformatics: Algorithms, System (P.I. John Solaro). implementations and applications,” Ph.D. Thesis, Univeristy of Illinois at Chicago, Chicago, IL, USA, 2008. R EFERENCES [20] A. Beygelzimer, S. Kakade, and J. Langford, “Cover trees for nearest neighbor,” in International Conference on Machine Learning, vol. 148. [1] S. Sonnenburg, M. L. Braun, C. S. Ong, S. Bengio, L. Bottou, Pittsburgh, Pennsylvania: ACM, 2006, pp. 97–104. G. Holmes, Y. LeCun, K.-R. Muller, F. Pereira, C. E. Rasmussen, [21] W. Buntine, “Learning classification trees,” Statistics and Computing, G. Ratsch, B. Scholkopf, A. Smola, P. Vincent, J. Weston, and vol. 2, no. 2, pp. 63–73, 1992. R. Williamson, “The need for open source software in machine [22] J. R. Quinlan, “Improved use of continuous attributes in C4.5,” Journal learning,” Journal of Machine Learning Research, vol. 8, pp. 2443– of Artificial Intelligence Research, vol. 4, pp. 77–90, 1996. 2466, Oct 2007. [23] Y. Freund and L. Mason, “The alternating decision tree learning [2] A. Asuncion and D. Newman, “UCI machine learning repository,” algorithm,” in International Conference on Machine Learning, vol. 16, 2007, http://www.ics.uci.edu/∼mlearn/MLRepository.html. Bled, Slovenia, 1999. [3] I. H. Witten and E. Frank, Data Mining: Practical Machine Learning [24] L. Breiman, “Bagging predictors,” Machine Learning, vol. 24, no. 2, Tools and Techniques, 2nd ed. San Francisco: Morgan Kaufmann, pp. 123–140, 1996. 2005, http://www.cs.waikato.ac.nz/ml/weka/. [25] P. Buhlmann, “Bagging, subagging and bragging for improving some [4] I. Mierswa, M. Wurst, R. Klinkenberg, M. Scholz, and T. Euler, prediction algorithms,” in Recent Advances and Trends in Nonpara- “YALE: Rapid prototyping for complex data mining tasks,” in ACM metric Statistics, M. G. Akritas and D. N. Politis, Eds. North Holland: SIGKDD International Conference on Knowledge Discovery and Data Elsevier, 2003, pp. 19–34. Mining, vol. 12, Philadelphia, USA, 2006. [26] R. E. Schapire and Y. Singer, “Improved boosting algorithms using [5] S. Sonnenburg, G. R¨ tsch, C. Sch¨ fer, and B. Sch¨ lkopf, “Large scale a a o confidence-rated predictions,” Machine Learning, vol. 37, no. 3, pp. multiple kernel learning,” Journal of Machine Learning Research, 297–336, 1999. vol. 7, pp. 1531–1565, July 2006, http://www.shogun-toolbox.org/. [27] J. Friedman, T. Hastie, and R. Tibshirani, “Additive logistic regression: [6] K. Gawande, C. Webers, A. J. Smola, and S. Vishwanathan, “Elefant: A statistical view of boosting,” Annals of Statistics, vol. 28, no. 2, pp. A python machine learning toolbox,” in SciPy Conference, 2007. 337–407, 2000. [7] R. Kohavi, D. Sommerfield, and J. Dougherty, “Data mining using [28] L. Breiman, “Random forests,” Machine Learning, vol. 45, no. 1, pp. MLC++, a machine learning library in C++,” in International Confer- 5–32, 2001. ence on Tools with Artificial Intelligence, vol. 8. Toulouse, France: [29] B. Zadrozny, J. Langford, and N. Abe, “Cost-sensitive learning by IEEE Computer Society, 1996, p. 234, http://www.sgi.com/tech/mlc/. cost-proportionate example weighting,” in IEEE International Confer- [8] J. Demˇar, B. Zupan, G. Leban, and T. Curk, “Orange: From exper- s ence on Data Mining, vol. 3, Melbourne, Florida, 2003, p. 435. imental machine learning to interactive data mining,” in Knowledge [30] J. C. Platt, “Probabilistic outputs for support vector machines and Discovery in Databases: PKDD 2004, ser. Lecture Notes in Computer comparisons to regularized likelihood methods,” in Advances in Large Science. Berlin/Heidelberg: Springer, 2004, vol. 3202, pp. 537–539. Margin Classifiers, P. J. Bartlett, B. Scholkopf, D. Schuurmans, and [9] R. Collobert, S. Bengio, and J. Mariethoz, “Torch: A modular ma- A. J. Smola, Eds. Boston: MIT Press, 1999, pp. 61–74. chine learning software library,” IDIAP Research Institute, Tech. Rep. [31] B. Zadrozny and C. Elkan, “Transforming classifier scores into ac- IDIAP-RR 02-46, 2002, http://www.torch.ch/. curate multiclass probability estimates,” in Special Interest Group on [10] J. Weston, A. Elisseeff, G. BakIr, and F. Sinz, “SPIDER: Object Knowledge Discovery and Data Mining, vol. 8. Edmonton, Alberta, oriented machine learning library,” 2003, http://www.kyb.tuebingen. Canada: ACM Press, 2002, pp. 694–699. mpg.de/bs/people/spider/main.html. [32] J. Langford and B. Zadrozny, “Estimating class membership probabil- [11] J. Mohr and K. Obermayer, “A topographic support vector machine: ities using classifier learners,” in International Workshop on Artificial Classification using local label configurations,” in Advances in Neural Intelligence and Statistics, vol. 10, Barbados, 2005. Information Processing Systems 17, L. K. Saul, Y. Weiss, and L. Bot- [33] C. Drummond and R. C. Holte, “Cost curves: An improved method for tou, Eds. Cambridge, MA: MIT Press, 2005, pp. 929–936. visualizing classifier performance,” Machine Learning, vol. 65, no. 1, [12] G. D. M. K. G. C. B. B. Klaus-Robert M¨ ller, Michael Tangermann, u pp. 95–130, 2006. “Machine learning for real-time single-trial eeg-analysis: From brain- [34] A. Blum, A. Kalai, and J. Langford, “Beating the hold-out: Bounds computer interfacing to mental state monitoring,” J. Neurosci. Meth- for k-fold and progressive cross-validation,” in COLT: Computational ods, vol. 167, no. 1, pp. 82–90, 2008. Learning Theory, vol. 12. Santa Cruz, California: ACM, 1999, pp. [13] N. Bhardwaj and H. Lu, “Residue-level prediction of DNA-binding 203–208. sites and its application on DNA-binding protein predictions,” FEBS Letters, vol. 581, no. 5, pp. 1058–1066, 2007. 3798