Machine Learning: Theory, Applications, Experiences


Published on

  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Machine Learning: Theory, Applications, Experiences

  1. 1. Machine Learning: Theory, Applications, Experiences A Workshop for Women in Machine Learning October 4, 2006 San Diego, CA
  2. 2. Workshop Organization Organizers: Lisa Wainer, University College London Hanna Wallach, University of Cambridge Jennifer Wortman, University of Pennsylvania Faculty advisor: Amy Greenwald, Brown University Additional reviewers: Maria-Florina Balcan Kristina Klinkner Melissa Carroll Bethany Leffler Kimberley Ferguson Ozgur Simsek Katherine Heller Alicia Peregrin Wolfe Julia Hockenmaier Elena Zheleva Rebecca Hutchinson Thanks to our generous sponsors: 1
  3. 3. Schedule October 3, 2006 19:30 Workshop dinner October 4, 2006 08:45 Registration and poster set-up 09:00 Welcome 09:15 Invited talk: A General Class of No-Regret Learning Algorithms and Game- Theoretic Equilibria Amy Greenwald, Brown University 09:45 On a Theory of Learning with Similarity Functions Maria-Florina Balcan, Carnegie Mellon University 10:00 Matrix Tile Analysis Inmar Givoni, University of Toronto 10:15 Towards Bayesian Black Box Learning Systems Jo-Anne Ting, University of Southern California 10:30 Coffee break 10:45 Invited talk: Clustering High-Dimensional Data Jennifer Dy, Northeastern University 11:15 Efficient Bayesian Algorithms for Clustering Katherine Heller, Gatsby Unit, University College London 11:30 Hidden Process Models Rebecca Hutchinson, Carnegie Mellon University 11:45 Invited talk: Recent advances in near-neighbor learning Maya Gupta, University of Washington 12:15 Spotlight talks: Correcting sample selection bias by unlabeled data Jiayuan Huang, University of Waterloo Decision Tree Methods for Finding Reusable MDP Homomorphisms Alicia Peregrin Wolfe, University of Massachusetts, Amherst Evaluating a Reputation-based Spam Classification System Elena Zheleva, University of Maryland, College Park Improving Robot Navigation Through Self-Supervised Online Learning Ellie Lin, Carnegie Mellon University 12:30 Lunch 2
  4. 4. 13:00 Poster session 1 13:45 Invited talk: SRL: Statistical Relational Learning Lise Getoor, University of Maryland, College Park 14:15 Generalized statistical methods for fraud detection Cecile Levasseur, University of California, San Diego 14:30 Kernels for the Predictive Regression of Physical, Chemical and Biological Properties of Small Molecules Chloe-Agathe Azencott, University of California, Irvine 14:45 Invited talk: Modeling and Learning User Preferences for Sets of Objects Marie desJardins, University of Maryland, Baltimore County 15:15 Coffee break 15:30 Efficient Exploration with Latent Structure Bethany Leffler, Rutgers University 15:45 Efficient Model Learning for Dialog Management Finale Doshi, MIT 16:00 Transfer in the context of Reinforcement Learning Soumi Ray, University of Maryland, Baltimore County 16:15 Spotlight talks: Simultaneous Team Assignment and Behavior Recognition from Spatio- temporal Agent Traces Gita Sukthankar, Carnegie Mellon University An Online Learning System for the Prediction of Electricity Distribution Feeder Failures Hila Becker, Columbia University Classification of fMRI Images: An Approach Using Viola-Jones Features Melissa K Carroll, Princeton University Fast Online Classification with Support Vector Machines Seyda Ertekin, Penn State University 16:30 Poster session 2 17:15 Open discussion 17:45 Closing remarks and poster take-down 18:00 End of workshop 3
  5. 5. Invited Talks A General Class of No-Regret Learning Algorithms and Game-Theoretic Equilibria Amy Greenwald, Brown University No-regret learning algorithms have attracted a great deal of attention in the game theoretic and machine learning communities. Whereas rational agents act so as to maximize their expected utilities, no-regret learners are boundedly rational agents that act so as to minimize their "regret". In this talk, we discuss the behavior of no-regret learning algorithms in repeated games. Specifically, we introduce a general class of algorithms called no-Φ-regret learning, which includes common variants of no-regret learning such as no external-regret and no-internal- regret learning. Analogously, we introduce a class of game-theoretic equilibria called Φ- equilibria. We show that no-Φ-regret learning algorithms converge to Φ-equilibria. In particular, no-external-regret learning converges to minimax equilibrium in zero-sum games; and no-internal-regret learning converges to correlated equilibrium in general-sum games. Although our class of no-regret algorithms is quite extensive, no algorithm in this class learns Nash equilibrium. Speaker biography: Dr. Amy Greenwald is an assistant professor of computer science at Brown University in Providence, Rhode Island. Her primary research area is the study of economic interactions among computational agents. Her primary methodologies are game-theoretic analysis and simulation. Her work is applicable in areas ranging from dynamic pricing to autonomous bidding to transportation planning and scheduling. She was awarded a Sloan Fellowship in 2006; she was nominated for the 2002 Presidential Early Career Award for Scientists and Engineers (PECASE); and she was named one of the Computing Research Association's Digital Government Fellows in 2001. Before joining the faculty at Brown, Dr. Greenwald was employed by IBM's T.J. Watson Research Center, where she researched Information Economies. Her paper entitled "Shopbots and Pricebots" (joint work with Jeff Kephart) was named Best Paper at IBM Research in 2000. Clustering High-Dimensional Data Jennifer Dy, Northeastern University Creating effective algorithms for unsupervised learning is important because vast amounts of data preclude humans from manually labeling the categories of each instance. In addition, human labeling is expensive and subjective. Therefore, a majority of existing data is unsupervised (unlabeled). The goal of unsupervised learning or cluster analysis is to group "similar" objects together. "Similarity" is typically defined by a metric or a probability model. These measures are highly dependent on the features representing the data. Many clustering algorithms assume that relevant features have been determined by the domain experts. But, not all features are important. Moreover, many clustering algorithms fail when dealing with high-dimensions. We present two approaches for dealing with clustering in high-dimensional spaces: 1. Feature selection for clustering, through Gaussian mixtures and the maximum likelihood and scatter separability criteria, and 2. Hierarchical feature transformation and clustering, through automated hierarchical mixtures of probabilistic principal component analyzers. 4
  6. 6. Speaker biography: Dr. Jennifer G. Dy is an assistant professor at the Department of Electrical and Computer Engineering, Northeastern University, Boston, MA, since 2002. She obtained her MS and PhD in 1997 and 2001 respectively from the School of Electrical and Computer Engineering, Purdue University, West Lafayette, IN, and her BS degree in 1993 from the Department of Electrical Engineering, University of the Philippines. She received an NSF Career award in 2004. She is an editorial board member for the journal, Machine Learning since 2004, and publications chair for the International Conference on Machine Learning in 2004. Her research interests include Machine Learning, Data Mining, Statistical Pattern Recognition, and Computer Vision. Recent advances in near-neighbor learning Maya R. Gupta, University of Washington Recent advances in nearest-neighbor learning are shown for adaptive neighborhood definitions, neighborhood weighting, and estimating given nearest-neighbors. In particular, it is shown that weights that solve linear interpolation equations minimize the first-order learning error, and this is coupled with the principle of maximum entropy to create a flexible weighting approach. Different approaches to adaptive neighborhoods are contrasted, the focus being on neighborhoods that form a convex hull around the test point. Standard weighted nearest-neighbor estimation is shown to maximize likelihood, and it is shown that minimizing expected Bregman divergence instead leads to optimal solutions in terms of expected misclassification cost. Applications may include the testing of pipeline integrity, custom color enhancements, and estimation for color management. Speaker biography: Maya Gupta completed her Ph.D. in Electrical Engineering in 2003 at Stanford University as a National Science Foundation Graduate Fellow. Her undergraduate studies led to a BS in Electrical Engineering and a BA in Economics from Rice University in 1997. From 1999- 2003 she worked for Ricoh's California Research Center as a color image processing research engineer. In the fall of 2003, she joined the EE faculty of the University of Washington as an Assistant Professor where she also serves as an Adjunct Assistant Professor of Applied Mathematics. More information about her research is available at her group's webpage: Modeling and Learning User Preferences for Sets of Objects Marie desJardins, University of Maryland, Baltimore County Most work on preference learning has focused on pairwise preferences or rankings over individual items. In many application domains, however, when a set of items is presented together, the individual items can interact in ways that increase (via complementarity) or decrease (via redundancy or incompatibility) the quality of the set as a whole. In this talk, I will describe the DD-PREF language that we have developed for specifying set-based preferences. One problem with such a language is that it may be difficult for users to explicitly specify their preferences quantitatively. Therefore, we have also developed an approach for learning these preferences. Our learning method takes as input a collection of positive examples―that is, one or more sets that have been identified by a user as desirable. Kernel density estimation is used to estimate the value function for individual items, and the desired set diversity is estimated from the average set diversity 5
  7. 7. observed in the collection. Since this is a new learning problem, I will also describe our new evaluation methodology and give experimental results of the learning method on two data collections: synthetic blocks-world data and a new real-world music data collection. Joint work with Eric Eaton and Kiri L. Wagstaff. Speaker biography: Dr. Marie desJardins is an assistant professor in the Department of Computer Science and Electrical Engineering at the University of Maryland, Baltimore County. Prior to joining the faculty in 2001, Dr. desJardins was a senior computer scientist at SRI International in Menlo Park, California. Her research is in artificial intelligence, focusing on the areas of machine learning, multi-agent systems, planning, interactive AI techniques, information management, reasoning with uncertainty, and decision theory. SRL: Statistical Relational Learning Lise Getoor, University of Maryland, College Park A key challenge for machine learning is mining richly structured datasets describing objects, their properties, and links among the objects. We'd like to be able to learn models, which can capture both the underlying uncertainty and the logical relationships in the domain. Links among the objects may demonstrate certain patterns, which can be helpful for many practical inference tasks and are usually hard to capture with traditional statistical models. Recently there has been a surge of interest in this area, fueled largely by interest in web and hypertext mining, but also by interest in mining social networks, security and law enforcement data, bibliographic citations and epidemiological records. Statistical Relational Learning (SRL) is a newly emerging research area, which attempts to represent, reason and learn in domains with complex relational and rich probabilistic structure. In this talk, I'll begin with a short SRL overview. Then, I'll describe some of my group's recent work, including our work on entity resolution in relational domains. Joint work with Indrajit Bhattacharya, Mustafa Bilgic, Louis Licamele and Prithviraj Sen. Speaker biography: Prof. Lise Getoor is an assistant professor in the Computer Science Department at the University of Maryland, College Park. She received her PhD from Stanford University in 2001. Her current work includes research on link mining, statistical relational learning and representing uncertainty in structured and semi-structured data. Her work in these areas has been supported by NSF, NGA, KDD, ARL and DARPA. In June 2006, she co-organized the 4th in a series of successful workshops on statistical relational learning, She has published numerous articles in machine learning, data mining, database and AI forums. She is a member of AAAI Executive council, is on the editorial board of the Machine Learning Journal and JAIR and has served on numerous program committees including AAAI, ICML, IJCAI, KDD, SIGMOD, UAI, VLDB, and WWW. 6
  8. 8. Talks On a Theory of Learning with Similarity Functions Maria-Florina Balcan, Carnegie Mellon University Kernel functions have become an extremely popular tool in machine learning. They have an attractive theory that describes a kernel function as being good for a given learning problem if data is separable by a large margin in a (possibly very high-dimensional) implicit space defined by the kernel. This theory, however, has a bit of a disconnect with the intuition of a good kernel as a good similarity function. In this work we develop an alternative theory of learning with similarity functions more generally (i.e., sufficient conditions for a similarity function to allow one to learn well) that does not require reference to implicit spaces, and does not require the function to be positive semi-definite. Our results also generalize the standard theory in the sense that any good kernel function under the usual definition can be shown to also be a good similarity function under our definition. In this way, we provide the first steps towards a theory of kernels that describes the effectiveness of a given kernel function in terms of natural similarity-based properties. Joint work with Avrim Blum. Matrix Tile Analysis Inmar Givoni, University of Toronto Many tasks require finding groups of elements in a matrix of numbers, symbols or class likelihoods. One approach is to use efficient bi- or tri-linear factorization techniques including PCA, ICA, sparse matrix factorization and plaid analysis. These techniques are not appropriate when addition and multiplication of matrix elements are not sensibly defined. More directly, methods like bi-clustering can be used to classify matrix elements, but these methods make the overly restrictive assumption that the class of each element is a function of a row class and a column class. We introduce a general computational problem, "matrix tile analysis" (MTA), which consists of decomposing a matrix into a set of non-overlapping tiles, each of which is defined by a subset of usually nonadjacent rows and columns. MTA does not require an algebra for combining tiles, but must search over an exponential number of discrete combinations of tile assignments. We describe a loopy BP (sum-product) algorithm and an ICM algorithm for performing MTA. We compare the effectiveness of these methods to PCA and the plaid method on hundreds of randomly generated tasks. Using double-gene-knockout data, we show that MTA finds groups of interacting yeast genes that have biologically-related functions. Joint work with Vincent Cheung and Brendan J. Frey. Towards Bayesian Black Box Learning Systems Jo-Anne Ting, University of Southern California A long-standing dream of machine learning is to create black box learning systems that can operate autonomously in home, research and industrial applications. While it is well understood that a universal black box may not be possible, significant progress can be made in specific domains. In particular, we address learning problems in sensor-rich and data-rich environments, as provided by autonomous vehicles, surveillance systems, biological or robotic systems. In these scenarios, the input data has hundreds or thousands 7
  9. 9. of dimensions and is used to make predictions (often in real-time), resulting in a learning system that learns to "understand" the environment. The goal of machine learning in this domain is to devise algorithms that can efficiently deal with very high dimensional data, usually contaminated by noise, redundancy and irrelevant dimensions. These algorithms must learn nonlinear functions, potentially in an incremental and real-time fashion, for robust classification and regression. In order to achieve black box quality, manual tuning parameters (e.g. as in gradient descent or structure selection) need to be minimized or, ideally, avoided. Bayesian inference, when combined with approximation methods to reduce computational complexity, suggests a promising route to achieve our goals, since it offers a principled way to eliminate open parameters. In past work, we have started to create a toolbox of methods to achieve our goal of black box learning. In (Ting et al., NIPS 2005), we introduced a Bayesian approach to linear regression. The novelty of this algorithm comes from a Bayesian and EM-like formulation of linear regression that robustly performs automatic feature detection in the inputs in a computationally efficient way. We applied this algorithm to the analysis of neuroscientific data (i.e. the problem of prediction of electromyographic (EMG) activity in the arm muscles of a monkey from spiking activity of neurons in the primary motor and premotor cortex). The algorithm achieves results that are faster by orders of magnitude and higher quality than previously applied methods. More recently, we introduced a variational Bayesian regression algorithm that is able to perform optimal prediction, given noise-contaminated input and output data (Ting, D'Souza & Schaal, ICML 2006). Traditional linear regression algorithms produce biased estimates when input noise is present and suffer numerically when the data contains irrelevant and/or redundant inputs. Our algorithm is able to effectively handle datasets with both characteristics. On a system identification task for a robot dynamics model, we achieved from 10 to 70% better results than traditional approaches. Current work focuses on developing a Bayesian version of nonlinear function approximation with locally weighted regression. The challenge is to determine the size of the neighborhood of data that should contribute to the local regression model―a typical bias-variance trade-off problem. Preliminary results indicate that a full Bayesian treatment of this problem can achieve impressive robust function approximation performance without the need for tuning meta parameters. We are also interested in extending this locally linear Bayesian model to an online setting, in the spirit of dynamic Bayesian networks, to offer a parameter-free alternative to incremental learning. Joint work with Aaron D'Souza, Stefan Schaal, Kenji Yamamoto, Toshinori Yoshioka, Donna Hoffman, Shinji Kakei, Lauren Sergio, John Kalaska, Mitsuo Kawato, Peter Strick, Michael Mistry, Jan Peters, and Jun Nakanishi. This work will also be in Poster Session 1. Efficient Bayesian Algorithms for Clustering Katherine Ann Heller, Gatsby Unit, University College London One of the most important goals of unsupervised learning is to discover meaningful clusters in data. There are many different types of clustering methods that are commonly used in machine learning including spectral, hierarchical, and mixture modeling. Our work takes a model-based Bayesian approach to defining a cluster and evaluates cluster membership in this paradigm. We use marginal likelihoods to compare different cluster models, and hence determine which data points belong to which clusters. If we have 8
  10. 10. models with conjugate priors, these marginal likelihoods can be computed extremely efficiently. Using this clustering framework in conjunction with non-parametric Bayesian methods, we have proposed a new way of performing hierarchical clustering. Our Bayesian Hierarchical Clustering (BHC) algorithm takes a more principled approach to the problem than the traditional algorithms (e.g. allowing for model comparisons and the prediction of new data points) without sacrificing efficiency. BHC can also be interpreted as performing approximate inference in Dirichlet Process Mixtures (DPMs), and provides a combinatorial lower bound on the marginal likelihood of a DPM. We have also explored the task of "clustering on demand" for information retrieval. Given a query consisting of a few examples of some concept, we have proposed a method that returns other items belonging to the concept exemplified by the query. We do this by ranking all items using a Bayesian relevance criterion based on marginal likelihoods, and returning the items with the highest scores. In the case of binary data, all scores can be computed with a single matrix-vector product. We can also use this method as the basis for an image retrieval system. In our most recent work this framework has served as inspiration for a new approach to automated analogical reasoning. Joint work with Zoubin Ghahramani and Ricardo Silva. Hidden Process Models Rebecca Hutchinson, Carnegie Mellon University We introduce the Hidden Process Model (HPM), a probabilistic model for multivariate time series data. HPMs assume the data is generated by a system of partially observed, linearly additive processes that overlap in space and time. While we present a general formalism for any domain with similar modeling assumptions, HPMs are motivated by our interest in studying cognitive processes in the brain, given a time series of functional magnetic resonance imaging (fMRI) data. We use HPMs to model fMRI data by assuming there is an unobserved series of hidden, overlapping cognitive processes in the brain that probabilistically generate the observed fMRI time series. Consider for example a study in which subjects in the scanner repeatedly view a picture and read a sentence and indicate whether the sentence correctly describes the picture. It is natural to think of the observed fMRI sequence as arising from a set of hidden cognitive processes in the subject’s brain, which we would like to track. To do this, we use HPMs to learn the probabilistic time series response signature for each type of cognitive process, and to estimate the onset time of each instantiated cognitive process occurring throughout the experiment. There are significant challenges to this learning task in the fMRI domain. The first is that fMRI data is high dimensional and sparse. A typical fMRI dataset measures approximately 10,000 brain locations over 15-20 minutes (features), with only a few dozen trials (training examples). A second challenge is due to the nature of the fMRI signal: it is a highly noisy measurement of an indirect and temporally blurred neural correlate called the hemodynamic response. The hemodynamic response to a short burst of less than a second of neural activity lasts for 10-12 seconds. This temporal blurring in fMRI makes it problematic to model the time series as a first-order Markov process. In short, our problem is to learn the parameters and timing of potentially overlapping, partially observed responses to cognitive processes in the brain using many features and a small number of noisy training examples. The modeling assumptions that HPMs make to deal with the challenges of the fMRI domain 9
  11. 11. are: 1) the latent time series is modeled at the level of processes rather than individual time points; 2) processes are general descriptions that can be instantiated many times over the course of the time series; 3) we can use prior knowledge of the form “process instance X occurs somewhere inside the time interval [a, b].” HPMs could apply to any domain in which these assumptions are valid. HPMs address a key open question in fMRI analysis: how can one learn the response signatures of overlapping cognitive processes with unknown timing? There is no competing method to HPMs available in the fMRI community. In our ICML paper, we give the HPM formalism, inference and learning algorithms, and experimental results on real and synthetic fMRI datasets. Joint work with Tom Mitchell and Indrayana Rustandi. This work will also be in Poster Session 1. Generalized statistical methods for fraud detection Cecile Levasseur, University of California, San Diego Many important risk assessment system applications depend on the ability to accurately detect the occurrence of key events given a large data set of observations. For example, this problem arises in drug discovery (“Do the molecular descriptors associated with known drugs suggest that a new, candidate drug will have low toxicity and high effectiveness?”); and credit card fraud detection (“Given the data for a large set of credit card users does the usage pattern of this particular card indicate that it might have been stolen?”). In many of these domains, no or little a priori knowledge exists regarding the true sources of any causal relationships that may occur between variables of interest. In these situations, meaningful information regarding the circumstances of the key events must be extracted from the data itself, a problem that can be viewed as an important application of data-driven pattern recognition or detection. The problem of unsupervised data-driven detection or prediction is one of relating descriptors of a large unlabeled database of “objects” to measured properties of these objects, and then using these empirically determined relationships to infer or detect the properties of new objects. This work considers measured object properties that are nongaussian (and comprised of continuous and discrete data), very noisy, and highly nonlinearly related. Data comprised of measurements of such disparate properties are said to be hybrid or of mixed type. As a consequence, the resulting detection problem is very difficult. The difficulties are further compounded because the descriptor space is of high dimension. While many domains lack accurate labels in their database, others like credit card fraud exhibit tagged data. Therefore, the problem of supervised data-driven detection, one relating to a labelled database of objects, is also examined. In addition, by utilizing tagged data, a performance benchmark can be set, enabling meaningful comparisons of supervised and unsupervised approaches. Statistical approaches to fraud detection are mostly based on modelling the data relying on their statistical properties and using this information to estimate whether a new object comes from the same distribution or not. The statistical modelling approach proposed here is a generalization and amalgamation of techniques from classical linear statistics (logistic regression, principal component analysis and generalized linear models) into a framework referred to as generalized linear statistics (GLS). It is based on the use of exponential family distributions to model the various types (continuous and discrete) of data measurements. A key aspect is that the natural parameter of the exponential family distributions is constrained to a lower dimensional subspace to model the belief that the 10
  12. 12. intrinsic dimensionality of the data is smaller than the dimensionality of the observation space. The proposed constrained statistical modelling is a nonlinear methodology that exploits the split that occurs for exponential family distributions between the data space and the parameter space as soon as one leaves the domain of purely Gaussian random variables. Although the problem is nonlinear, it can be solved by using classical linear statistical tools applied to data that has been mapped into the parameter space that still has a natural, flat Euclidean structure. This approach provides an effective way to exploit tractably parameterized latent-variable exponential-family probability models for data- driven learning of model parameters and features, which in turn are useful for the development of effective fraud detection algorithms. The fraud detection techniques proposed here are performed in the parameter space rather than in the data space as has been done in more classical approaches. In the case of a low level of contamination of the data by fraudulent points, a single lower dimensional subspace is learned by using the GLS based statistical modelling on a training set. Given a new data point, it is projected to its image on the lower dimensional subspace and fraud detection is performed by comparing its distance from the training set mean-image to a threshold. An example that shows that there are domains for which the classical linear techniques, such as principal component analysis, used in the data space perform far from optimally is presented compared to the new proposed parameter space techniques. For cases of data with roughly as many fraudulent as non-fraudulent points, an unsupervised approach to the linear Fisher discriminant is proposed. The GLS based framework enables unsupervised learning of a lower dimensional subspace in the parameter space that separates fraudulent from non-fraudulent data. Fraud detection is performed as in the previous case. In both cases, an ROC curve is generated to assess the performance of the proposed fraud detection methods. Joint work with Kenneth Kreutz-Delgado and Uwe Mayer. Kernels for the Predictive Regression of Physical, Chemical and Biological Properties of Small Molecules Chloe-Agathe Azencott, University of California, Irvine Small molecules, i.e. molecules composed of a couple hundreds of atoms, play a fundamental role in biology, chemistry and pharmacology. Their usage goes from the design of new drugs to the better understanding of biological systems; however, establishing their physical, chemical and biological properties through a physical experimentation can be very costly. It is therefore essential to develop efficient computational methods to predict these properties. Kernel methods, and among them support vector machines, appear as particularly appropriate for chemical data, for they involve similarity measures which allow to embed the data in a high-dimensional feature space where linear methods can be used. Machine learning spectral kernels can be derived from various descriptions of the molecules; we study representations which dimensionality ranges from 1 to 4, thus obtaining 1D, 2D, 2.5D, 3D and 4D kernels. Using cross-validation and redundancy reduction techniques on various datasets of small and medium size from the literature, we test the kernels for the prediction of boiling points, melting points, aqueous solubility and octanol/water partition coefficient and compare them against state-of-the art results. Spectral kernels derived from the rich and reliable two-dimensional representation of the molecules outperform the other methods on most of the datasets. They seem to be the 11
  13. 13. method of choice, given their simplicity, computational efficiency and prediction accuracy. Efficient Exploration with Latent Structure Bethany Leffler, Rutgers University Developing robot control using a reinforcement-learning (RL) approach involves a number of technical challenges. In our work, we address the problem of learning an action model. Classical RL approaches assume Markov decision process (MDP) environments, which do not support the critical idea of generalization between states. For an agent to learn the results of its actions for each state, it would have to visit each state and perform each action in that state at least once. In a robot setting, however, it is unrealistic to assume there will be sufficient time to learn about every state of the environment independently; so richer models of environmental dynamics are needed. Our technique for developing such a model is to assume that each state is not unique. In most environments, there will be states that have the same transition dynamics. By developing models where similar states have similar dynamics, it becomes possible for a learner to reuse its experience in one state to more quickly learn the dynamics of other parts of the environment. However, it also introduces an additional challenge―determining which states are similar. To evaluate the viability of this approach, we constructed an experiment using a four- wheeled Lego Mindstorm robot as the agent. The state space consisted of discretized vehicle locations with a hidden variable of slope (flat or incline), which correlated directly with the action model. The agent had to learn which throttling action to perform in each state to maintain a target speed. In this scenario, the actions did not affect the transitions between states. To determine similarity between states, the agent executed a selected action several times in each of the vehicle locations. The outcomes of these actions were used to hierarchically cluster the states. Once the states were clustered, the agent then started learning an action model for each state cluster. The advantage of this approach over one that learned a separate action model for each state is that information gathered in several different states can be pooled together. In common environments, there are many more states than state-types; therefore, learning based on clusters drastically reduces learning time. In fact, we were able to prove a worst-case learning time result that formalizes and validates this claim. If the environment does not have many similar states or if the clustering algorithm groups the states incorrectly, than the benefit of this approach will be minimized. Even in this worst case, however, it is important to note that this algorithm is no more costly than exploring each state individually. Some limitations of this algorithm arise when states have semi-similar action models. For instance, if two states behave similarly when one action is performed, but not for all the actions, it is possible that the agent would learn incorrectly when following our proposed algorithm. In most robotic environments, however, using our algorithm will greatly reduce the time taken by the agent to determine its action model in all states, thereby increasing the efficiency of the robot. Joint work with Michael L. Littman, Alexander L. Strehl, and Thomas Walsh. 12
  14. 14. Efficient Model Learning for Dialog Management Finale Doshi, MIT Intelligent planning algorithms such as the Partially Observable Markov Decision Process (POMDP) have succeeded in dialog management applications because of their robustness to the inherent uncertainty of human interaction. Like all dialog planning systems, however, POMDPs require an accurate model of the user (such as the user's different states of the user and what the user might say). POMDPs are generally specified using a large probabilistic model with many parameters; these parameters are difficult to specify from domain knowledge, and gathering enough data to estimate the parameters accurately a priori is expensive. In this paper, we take a Bayesian approach to learning the user model simultaneously the dialog management problem. First we show that the policy that maximizes the expected reward is the solution of the POMDP taken with the expected values of the parameters. We update the parameter distributions after each test, and incrementally update the previous POMDP solution. The update process has a relatively small computational cost, and we test various heuristics to focus computation in circumstances where it is most likely to improve the dialog. We are able to demonstrate a robust dialog manager that learns from interaction data, out-performing a hand-coded model in simulation and in a robotic wheelchair application. Joint work with Nicholas Roy. Transfer in the context of Reinforcement Learning Soumi Ray, University of Maryland, Baltimore County We are investigating the problem of transferring knowledge learned in one domain to another related domain. Transfer of knowledge from simple domains to more complex domains can reduce the total training time in the complex domains. We are doing transfer in the context of reinforcement learning. In the past, knowledge transfer has been accomplished between domains with the same state and action spaces. Work has also been done where the state and action spaces of the two domains are different but a mapping has been provided by humans. We are trying to automate the mapping from the old domain to the new domain when the state and action spaces are different. We have two domains D1 and D2, with corresponding state spaces S1 and S2 and action spaces A1 and A2 where |S1| = |S2| and |A1| = |A2|. Our goal is to transfer a policy learned in D1 to D2 so as to speed learning in D2. We first run Q-learning in D1 to produce Q-table Q1. Then we train for limited time in D2 and generate Q2. The test bed we have used is a 16x16 grid world. We have taken two domains in a 16x16 grid world with four actions: North, South, East and West. In the first domain we have trained for 500 iterations and in the second domain we have trained for 20 iterations. The two approaches that we have used are as follows. Our goal is to find the mapping between the state spaces S1 and S2 and action spaces A1 and A2 In the first approach we compute the difference between matrices Q1 and Q2 and greedily find a mapping that minimizes the difference calculated above. With this mapping we can transfer the Q-values from the completely trained domain D1 to the partially trained domain D2 to speed up learning in domain D2. We find that it takes fewer steps to learn completely in the second domain when the Q-values are transferred than learning from scratch. Our second approach finds the mapping that assigns the highest Q-values of the states in domain one to the highest Q-values of the states in domain two. This approach is an improvement over the first approach. It takes many fewer steps to learn in 13
  15. 15. the second domain using transfer. We are also interested in finding the mapping when S1 and A1 are subsets of S2 and A2 respectively, i.e. |S1|<|S2| and |A1|<|A2|. This can be handled by allowing mapping a single state/action in S1/A1 to multiple states/actions in S2/A2. Joint work with Tim Oates. This work will also be in Poster Session 2. 14
  16. 16. Spotlights (Session 1) Correcting sample selection bias by unlabeled data Jiayuan Huang, University of Waterloo The default assumption in many learning scenarios is that training and test data are independently and identically drawn from the same distribution. When the distributions on training and test set do not match, we are facing the problem that commonly referred to as sample selection bias or covariance shift. This problem occurs in many real world applications including the areas of surveys, sociology, biology and economics. It is not hard to see that given the skewed selection for the training data, it is impossible to derive a good model to make accurate predictions on the general target as the training set might not be representative of the complete population from which the test is usually come. Thus the prediction results in a biased estimation, potentially increasing the errors. Although there exists previous work addressing this problem, sample selection bias is typically ignored in standard estimation algorithms. In this work, we utilize the availability of unlabeled data to direct a sample selection de-biasing procedure for various learning methods. Unlike most previous algorithms that try to first recover sampling distributions and then make appropriate corrections based on the distribution estimate, our method infer the re-sampling weight directly by distribution matching between training and testing sets in the feature space in a non-parametric manner. We do not require the estimation of biased densities or selection probabilities or any assumptions of knowing the probabilities of different classes. Our method works by matching distributions between training and testing sets in feature space that can handle high dimensional data. Our experiments results with many benchmark datasets demonstrate our method works well in practice. The method also shows good performance in tumor diagnosis using microarrays that it promises to be a valuable tool for cross-platform microarray classification. Joint work with Alex Smola, Arthur Gretton, Karsten Borgwardt, Bernhard Scholkopf. Decision Tree Methods for Finding Reusable MDP Homomorphisms Alicia Peregrin Wolfe, University of Massachusetts, Amherst State abstraction is a useful tool for agents interacting with complex environments. Good state abstractions are compact, reusable, and easy to learn from sample data. This paper combines and extends two existing classes of state abstraction methods to achieve these criteria. The first class of methods search for MDP homomorphisms (Ravindran 2004), which produce models of reward and transition probabilities in an abstract state space. The second class of methods, like the UTree algorithm (McCallum 1995), learn compact models of the value function quickly from sample data. Models based on MDP homomorphisms can easily be extended such that they are usable across tasks with similar reward functions. However, value based methods like UTree cannot be extended in this fashion. We present results showing a new, combined algorithm that fulfills all three criteria: the resulting models are compact, can be learned quickly from sample data, and can be used across a class of reward functions. Joint work with Andrew Barto. 15
  17. 17. Evaluating a Reputation-based Spam Classification System Elena Zheleva, University of Maryland, College Park Over the past several years, spam has been a growing problem for the Internet community. It interferes with valid e-mail and burdens both e-mail users and ISPs. While there are various successful automated e-mail filtering approaches that aim at reducing the amount of spam, there are still many challenges to overcome. Reactive spam filtering approaches classify a piece of e-mail as spam if it has been reported as such by a large volume of e-mail users. Unfortunately, by the time the system responds by blocking the message or automatically placing it in future recipients' spam folders, the spam campaign has already affected a lot of users. The challenge that we consider is whether we can reduce the response time, recognizing a spam campaign at an earlier stage, thus reducing the cost that users and systems incur. Specifically, we are evaluating the predictive power of a reputation-based spam filtering system, which uses the feedback only from trustworthy e-mail users. In a reputation-based or trust-based spam filtering system, the system identifies a set of users who report spam reliably and trusts their spam reports more than the spam reports of other users. A message coming into the system is classified as spam if enough reliable users report it. This automatic spam filtering approach is vulnerable to malicious users when any anonymous person can subscribe and unsubscribe to the e-mail service. This is the case with most free e-mail providers such as AOL, Hotmail and Yahoo. We show how to overcome this problem in this work. There are two well-known open-source projects which operate in this framework: Vipul's Razor and Distributed Checksum Clearinghouse. Unfortunately, their reputation systems work only as a part of their commercially available software counterparts and, due to trade secrets, it is not clear how the design characteristics such as reputation definition and metrics affect the system performance. More importantly, the spam reports they receive are mostly from authorized users (such as business partner company employees), which reduce the risk of abuse by anonymous users. The effectiveness of a reputation-based spam filtering system is based on evaluating the following properties: 1) automatic maintenance of a reliable user set over time, 2) timely and accurate recognition of a spam campaign, and 3) having a set of guarantees on the system vulnerability. In our work, we present the results from simulating a reputation- based spam filtering over a period of time. The evaluation dataset includes all the spam reports received during that period of time for a particular free e-mail provider. We show how our algorithms effectively reduce spam campaign response time, while minimizing system vulnerability. Joint work with Lise Getoor and Alek Kolcz. Improving Robot Navigation Through Self-Supervised Online Learning Ellie Lin, Carnegie Mellon University In mobile robotics, there are often features that, while potentially powerful for improving navigation, prove difficult to profit from as they generalize poorly to novel situations. Overhead imagery data, for instance, has the potential to greatly enhance autonomous robot navigation in complex outdoor environments. In practice, reliable and effective automated interpretation of imagery from diverse terrain, environmental conditions, and sensor varieties proves challenging. Similarly, fixed techniques that successfully interpret on-board sensor data across many environments begin to fail past short ranges as the 16
  18. 18. density and accuracy necessary for such computation quickly degrade and features that are able to be computed from distant data are very domain-specific. We introduce an online, probabilistic model to effectively learn to use these scope-limited features by leveraging other features that, while perhaps otherwise more limited, generalize reliably. We apply our approach to provide an efficient, self-supervised learning method that accurately predicts traversal costs over large areas from overhead data. We present results from field-testing on-board a robot operating over large distances in off-road environments. Additionally, we show how our algorithm can be used offline with overhead data to produce a priori traversal cost maps and detect misalignments between overhead data and estimated vehicle positions. This approach can significantly improve the versatility of many unmanned ground vehicles by allowing them to traverse highly varied terrains with increased performance. Joint work with B. Sofman, J. Bagnell, N. Vandapel and A. Stentz. 17
  19. 19. Spotlights (Session 2) Simultaneous Team Assignment and Behavior Recognition from Spatio-temporal Agent Traces Gita Sukthankar, Carnegie Mellon University This research addresses the problem of activity recognition for physically embodied agent teams. We define team activity recognition as the process of identifying team behaviors from traces of agent positions over time; for many physical domains, military or athletic, coordinated team behaviors create distinctive spatio-temporal patterns that can be used to identify low-level action sequences. We focus on the novel problem of recovering agent- to-team assignments for complex team tasks where team composition, the mapping of agents into teams, changes over time. Without a priori knowledge of current team assignments, the behavior recognition problem is challenging since behaviors are characterized by the aggregate motion of the entire team and cannot generally be determined by observing the movements of a single agent in isolation. To handle this problem, we introduce a new algorithm, Simultaneous Team Assignment and Behavior Recognition (STABR) that generates behavior annotations from spatio-temporal agent traces. STABR leverages information from the spatial relationships of the team members to create sets of potential team assignments at selected time-steps. These spatial relationships are efficiently discovered using a randomized search technique, RANSAC, to generate potential team assignment hypotheses. Sequences of team assignment hypotheses are evaluated using dynamic programming to derive a parsimonious explanation for the entire observed spatio-temporal trace. To prune the number of hypotheses, potential team assignments are fitted to a parameterized team behavior model; poorly fitting hypotheses are eliminated before the dynamic programming phase. The proposed approach is able to perform accurate team behavior recognition without exhaustive search over the partition set of potential team assignments, as demonstrated on several scenarios of simulated military maneuvers. STABR does not simply assume that agents within a certain proximity should be assigned to the same team; instead if relies on matching static snapshots of agent position against a database of team formation templates to produce a candidate pool of agent-to-team assignments. This candidate pool of assignments is verified by running a local spatio- temporal behavior detector. The intuition is that the aggregate agent movement for an incorrect team assignment will generally fail to match any behavior model. STABR significantly outperforms agglomerative clustering on the agent-to-team assignment problem for traces with dynamic agent composition (95% accuracy). The scenarios presented here illustrate the operation of STABR in environments that lack the external cues used by other multi-agent plan recognition approaches, such as landmarks, cleanly clustered agent teams, and extensive domain knowledge. We believe that when such cues are available they can be directly incorporated into STABR, both to improve accuracy and to prune hypotheses. STABR provides a principled framework for reasoning about dynamic team assignments in spatial domains. Joint work with Katia Sycara. 18
  20. 20. An Online Learning System for the Prediction of Electricity Distribution Feeder Failures Hila Becker, Columbia University We are using machine learning techniques for constructing a failure-susceptibility ranking of feeder cables that supply electricity to the boroughs of New York City. The electricity system is inherently dynamic, and thus our failure-susceptibility ranking system must be able to adapt to the latest conditions in real time, having to update its ranking accordingly. The feeders have a significant failure rate, and many resources are devoted to monitoring, maintenance and repair of feeders. The ability to predict failures allows the shifting from reactive to proactive maintenance, thus reducing costs. The feature set for each feeder includes a mixture of static data (e.g. age and composition of each feeder section) and dynamic data (e.g. electrical load data for a feeder and its transformers). The values of the dynamic features are captured at the time of training and therefore lead to different models depending on the time and day at which each model is trained. Previously, a framework was designed to train models using a new variant of boosting called Martingale Boosting, as well as Support Vector Machines. However, in this framework, an engineer had to decide whether to use the most recent data to build a new model, or use the latest model instead for future predictions. To avoid the need of human intervention, we have developed an “online” system that determines what model to use by monitoring past performance of previously trained models. In our new framework, we treat each batch-trained model as an expert, and use a measurement of its performance as the basis for reward or penalty of its quality score. We measure performance as a normalized average rank of failures. For example, in a ranking of 50 items with actual failures ranked #4 and #20, the performance is: 1 – (4 + 20) / (2*50) = 0.76. Our approach builds on the notion of learning from expert advice as formulated in the continuous version of the Weighted Majority algorithm. Since each model is analogous to an expert and our system runs live thus gathering new data and generating new models, we have to keep adding new experts to the existing ensemble throughout the algorithm’s execution. To avoid having to monitor an ever-increasing set of experts, we drop poorly performing experts after each prediction. We had to address the following key issues in our solution: (1) how often and with what weight do we add new experts, and (2) what experts do we drop. Our simulations suggest that using the median of all current models’ weights for new models works best. To drop experts we use a combination of age of the model and past performance. Finally, to make predictions we use a weighted average of the top- scoring experts. Our system is currently deployed and being tested by New York City’s electricity distribution company. Results are highly encouraging, with 75% of the failures in the summer of 2005 being ranked in the top 26%, and 75% of failures in 2006 being ranked in the top 36%. Joint work with Marta Arias. Classification of fMRI Images: An Approach Using Viola-Jones Features Melissa K. Carroll, Princeton University There has been growing interest in using Functional Magnetic Resonance Imaging (fMRI) for “mind reading,” particularly in applying machine learning methods to classifying fMRI brain images based on the subject’s instantaneous cognitive state. For instance, Haxby et al. (2001) perform fMRI scans while subjects are viewing images of one of seven classes of 19
  21. 21. objects with the goal of discriminating the brain images based on the class of image being viewed at the time. Most machine learning approaches used to date for fMRI classification have treated individual voxels as features and ignored the spatial correlation between voxels (Norman et al., 2006). We present a novel method for searching this feature space to generate features that capture spatial information, derived from the Viola and Jones (2001) algorithm for 2D object detection, and apply it to 2D representations of the images. In this method, features are computed corresponding to absolute and relative intensities over regions of varying size and shape, and used by AdaBoost (Schapire and Singer, 1999) to generate a classifier. Figure 1 ( shows examples of these features overlaid on an actual 2D representation of the 3D fMRI image. Mean intensities in white regions are subtracted from mean activities in gray regions to compute each feature, which are combined to form the feature vector. One-, two-, three- and four-rectangle features of all 100 size combinations between 1x1 and 10x10 are computed for all positions in the image. As Figure 2 ( shows, including richer features than the standard one-pixel features can result in improved classification of the Haxby et al. dataset. One potential limitation of the method is that the large feature set it produces conflicts with computational limitations; however, figure 2 shows that even selecting a small random subset of the richer features can result in an increase in classification accuracy by 5% or more, although performance varies across subjects. In addition, the performance of this subset of features can be used to target subsequent feature selection. Future work needs to be performed to develop reliable and valid methods for rating feature importance. Finally, Figure 3 ( shows that confusion among predicted classes occurs most often between classes that are most similar and for which previous classifiers have encountered difficulty, e.g. male faces and female faces. This target space similarity structure could be exploited in future work to improve classification. 1. J. V. Haxby, M. I. Gobbini, M. L. Furey, A. Ishai, J. L. Schouten, and P. Pietrini. (2001). Distributed and overlapping representations of faces and objects in ventral temporal cortex. Science, (293) 2425-2429. 2. K. A. Norman, S. M. Polyn, G. J. Detre and J. V. Haxby, (2006). Beyond mind-reading: multi-voxel pattern analysis of fMRI data. Trends in Cognitive Sciences, In Press. 3. R.E. Schapire and Y. Singer. (1999). Improved boosting algorithms using confidence- rated predictions. Machine Learning, 37(3): 297-336. 4. P. Viola and M. Jones. (2001). Rapid object detection using a boosted cascade of simple features. CVPR 2001. Joint work with Kenneth A. Norman, James V. Haxby and Robert E. Schapire. Fast Online Classification with Support Vector Machines Seyda Ertekin, Penn State University In recent years, we have witnessed significant increase in the amount of data in digital format, due to the widespread use of computers and advances in storage systems. As the volume of digital information increases, people need more effective tools to better find, filter and manage these resources. Classification, the assignment of instances (i.e. pictures, text documents, emails, Web sites etc.) to one or more predefined categories 20
  22. 22. based on their content, is an important component in many information organization and management tasks. Support Vector Machines (SVMs) is a popular machine learning algorithm for classification problems due to their theoretical foundation and good generalization performance. However, SVMs have not yet seen widespread adoption in the communities working with very large datasets due to the high computational cost involved in solving quadratic programming (QP) problem in the training phase. This research presents an online SVM learning algorithm, LASVM, which yields classification accuracy rates of the state-of-the-art SVM solvers but requires less computational resources. LASVM tolerates much smaller main memory and has a much faster training phase. We also show that not all the examples are equally informative in the training set. We present methods to select the most informative examples and exploit those to reduce the computational requirements of the learning algorithm. We uncover the properties of active learning algorithms to select the informative examples efficiently from very large-scale training sets. We will also show the benefits of using a non-convex loss function at SVMs for faster speeds and less computational requirements. Joint work with Leon Bottou, Antoine Bordes and Jason Weston. 21
  23. 23. Posters (Session 1) Using Decision Trees for Gabor-based Texture Classification of Tissues in Computed Tomography Alia Bashir, DePaul University This research is aimed at developing an automated imaging system for classification of tissues in CT images. Classification of tissues in CT scans using shape or gray level information is challenging due to the changing shape of organs in a stack of images and the gray level intensity overlap in soft tissues. However, healthy organs are expected to have a consistent texture within tissues across slices. Given a large enough set of normal- tissue images, and a good set of texture features, machine learning techniques can be applied to create an automatic classifier. Previous work from one of the authors explored texture descriptors based on wavelet, ridgelets, and curvelets for the classification of tissues from normal chest and abdomen CT scans. These texture descriptors were able to classify tissues with an accuracy range of 85 - 98%, with curvelet-based texture descriptors performing the best. In this paper we bridge the gap to perfect accuracy by focusing on texture features based on a bank of Gabor filters. The approach consists of three steps: convolution of the regions of interest with a bank of 32 Gabor filters (4 frequencies and 8 orientations), extraction of two Gabor texture features per filter (mean and standard deviation), and creation of a classifier that automatically identifies the various tissues. The data set consists of 2D DICOM images from five normal chest and abdomen CT studies from Northwestern Medical Hospital. The following regions of interest were segmented out and labeled by an expert radiologist: liver, spleen, kidney, aorta, trabecular bone, lung, muscle, IP fat, and SQ fat for a total of 1112 images. For each image, the feature vector consists of the mean and standard deviation of the 32 filtered images, totaling 64 descriptors. The classification step is carried out using a Classification and Regression decision tree classifier. A decision tree predicts the class of an object (tissue) from values of predictor variables (texture descriptors), and generates a set of decision rules. These sets of rules are then used for the classification of each region of interest. Both the cross-validation and the random split of the data set into a training set (~65%) and testing set (~35%) techniques were applied but no significant difference was observed. The optimal tree had a depth of 20, parent node value set at 10 and child node value set at 1. To evaluate the performance of each classifier, specificity, sensitivity, precision, and accuracy rates are calculated from each misclassification matrix. Results show that this set of texture features is able to perfectly classify the 9 regions of interests. The Gabor filters’ ability to isolate features at different scales and directions allows for a multi-resolution analysis of texture essential when dealing with, at times, very subtle differences in the texture of tissues in CT scans. Given the great performance in the classification of healthy tissues, we plan to apply Gabor texture feature to the classification of abnormal tissues. Joint work with Julie Hasemann and Lucia Dettori. VOGUE: A Novel Variable Order-Gap State Machine for Modeling Sequences Bouchra Bouqata, Rensselaer Polytechnic Institute (RPI) In this paper we present VOGUE, a new state machine that combines two separate techniques for modeling long range dependencies in sequential data: data mining and data modeling. VOGUE relies on a novel Variable-Gap Sequence mining method (VGS), to mine frequent patterns with different lengths and gaps between elements. It then uses these mined sequences to build the state machine. We applied VOGUE to the task of 22
  24. 24. protein sequence classification on real data from the PROSITE protein families. We show that VOGUE yields significantly better scores than higher-order Hidden Markov Models. Moreover, we show that VOGUEs classification sensitivity outperforms that of HMMER, a state-of-the-art method for protein classification Joint work with Christopher Carothers, Boleslaw K. Szymanski and Mohammed J. Zaki. GroZi: a Grocery Shopping Assistant for the Blind Carolina Galleguillos, U.C San Diego Grocery shopping is a common activity that people all over the world perform on a regular basis. Unfortunately, grocery stores and supermarkets are still largely inaccessible to people with visual impairments, as they are generally viewed as "high cost" customers. We propose to develop a computer vision based grocery shopping assistant based on a handheld device with haptic feedback that can detect different products inside of a store, thereby increasing the autonomy of blind (or low vision) people to perform grocery shopping. Our solution makes use of new computer vision techniques for the task of visual recognition of specific products inside of a store as specified in advance on a shopping list. These techniques can avail of complementary resources such as RFID, barcode scanning, and sighted guides. We also present a challenging new dataset of images consisting of different categories of grocery products that can be use for object recognition studies. The use of the system consists of the creation of a shopping list followed by in-store navigation. In order to create a shopping list we will develop a website accessible to visually impaired people that stores data and images of different products. The website will be augmented with new image templates from the community of users that shop with the device, in addition to images of the same product that are taken in different stores by different users. This will increase the system's ability to recognize products that change appearance due to seasonal or promotion reasons. The navigational task includes finding the correct aisle for the products (based on text detection and character recognition), avoiding obstacles, finding products and checking out. A typical grocery store carries around 30,000 items, so recognizing a single object is a nontrivial task. Assuming a shopping list length is generally less than 1/1000th of this amount (i.e., less than 30 items), the recognition can be constrained to two different phases: detection of object on a possibly cluttered shelf, and verification of the detected object with respect to the shopping list. For this task, we intend to use state of the art object recognition algorithms and develop new approaches for fast identification. Applications of Kernel Minimum Enclosing Ball Cristina Garcia C., Universidad Central de Venezuela The minimum enclosing ball (MEB) is a well-studied problem in computational geometry. In this work we describe a generalization of a simple approximate MEB construction, introduced by M. Badoiu and K. L. Clarkson, to a feature space MEB using the kernel trick. The simplicity of the methodology in itself is surprising, the MEB algorithm is based only on geometrical information extracted from a sample of data points, and just two parameters need to be tuned: the constant of the kernel and the tolerance in the radio of the approximation. The applicability of the method is demonstrated on anomaly detection and less traditional scenarios as 3D object modeling and path planning. Results are 23
  25. 25. encouraging and show that even an approximate feature space MEB, is able to induce topology preserving mappings on arbitrary dimensional noisy data as efficiently as other machine learning approaches. Joint work with Jose Ali Moreno. Classification With Cumular Trees Claudia Henry, Antilles-Guyane The accurate combination of decision trees and linear separators has been shown to provide some of the best off the shelf classifiers. We describes a new type of such combination, which we call Cumular (Cumulative Linear) Trees. Cumular Trees are midway between Oblique Decision Trees and Alternating Decision Trees: more expressive than the former, and simpler than the latter. We provide an induction algorithm for Cumular Trees, which is, as we show, a boosting algorithm in the original sense. Experimental results against AdaBoost, C4.5 and OC1 display very good results, especially when dealing with noisy data. Joint work with Richard Nock and Franck Nielsen. Transient Memory in Reinforcement Learning: Why Forgetting Can be Good for You Anna Koop, University of Alberta The vast majority of work in machine learning is concerned with algorithms that converge to a single solution. It is not clear that this is always the most appropriate aim. Consider a sailor adapting to the ship's motion. She may learn two conditional models: one for walking when at sea, and another for walking when on land. She may, when memory resources are limited, learn a best-on-average policy that settles on a compromise among all situations she has encountered. A more flexible approach might be to quickly adapt the walking policy to new situations, rather than seeking one final solution or set of solutions. We explore two cases of transient memory. In the first case, the rate at which individual parameters change is controlled by meta-parameters. These meta parameters allow the agent to ignore irrelevant or random features, to converge where features are consistent throughout its experience, and otherwise to adapt quickly to changes in the environment. This approach requires no commitment to the number of parameter sets necessary in a given environment, but makes the best use of available resources. In the second case, a single solution is stored in long-term parameters, but this solution is used only as the starting point for learning about a specific situation. This is currently being applied to the game of Go. At the beginning of a game, the agent's value function parameters are initialized according to the long-term memory. During the course of a game these parameters are updated by simulating, from each state, thousands of self-play games. The short-term parameters learned in this way are used both for action selection and as the starting point for learning on the next turn, after the opponent has moved. Actual game-play moves are used to update both the short- and long-term memory. At the end of the game, the short-term memory is forgotten and the value function parameters are initialized to the long-term values. This allows the agent to store general knowledge in long-term memory while adapting quickly to the specific situations encountered in the current game. 24
  26. 26. Predicting Task-Specific Webpages for Revisiting A. Twinkle E. Lettkeman, Oregon State University Most web browsers track the history of all pages visited, with the intuition that users are likely to want to return to pages that they have previously accessed. However, the history viewers in web browsers are ineffective for most users, because of the overwhelming glut of webpages that appear in the history. Not only does the history represent a potentially confusing interleaving of many of a user's different tasks, but it also includes many webpages that would provide minimal or no utility to the user if revisited. This paper reports on a technique used to dramatically reduce web browsing histories down to pages that are relevant to the user's current task context and have a high likelihood of being desirable to revisit. We briefly describe how the TaskTracer system maintains an awareness of a user's tasks and semi-automatically segments the web browsing history by task. We then present a technique that is used to predict whether webpages previously visited on a task will be of future value to the user and are worth displaying in the history user interface. Our approach uses a combination of heuristics and machine learning to evaluate the content of a page and interactions with the page to learn a predictive model of webpage relevance for each user task. We show the results of an empirical evaluation of this technique based on user data. This approach could be applied to systems that include tracking of webpage resources to predict future value of resources and to lower costs of finding and reusing webpages to the user. Our findings suggest that prediction of web pages is highly user- and task-specific, and that the choice of prediction algorithms is not obvious. In future work we aim to refine the features used to predict revisitability. We will analyze the effect of better text feature extraction in conjunction with user interest indicators such as reading time, scrolling behavior, and text selection. Preliminary analysis indicates that applying these refinements may increase the accuracy of our prediction models. Joint work with Simone Stumpf, Jed Irvine and Jonathan Herlocker. Hyper-parameters auto-setting using regularization path for SVM Gaëlle Loosli, INSA de Rouen In the context of classification tasks, Support Vector Machines are now very popular. However, their utilization by neophyte users is still hampered by the need to supply values for control parameters in order to get the best attainable results. Mainly, given clean data, SVM's users must make three choices: the type of kernel, its bandwidth and the regularization parameter. It would be convenient to provide users with a push-button SVM that would be able to auto-set to the best possible values. This paper presents a new method that approaches this goal. Given the importance of this problem for reaping all the potential benefits of the use of SVM, many research works have been dedicated to ways of helping the setting of the parameters. Most rely on either outer measures, such as cross-validation, to guide the selection, or to measures embedded in the learning method itself. In place of empirical approaches to the setting of the control parameters, regularization paths have been proposed and widely studied these past years since they provide a smart and fast way to access all the optimal solutions of a problem according to all compromises between bias and variance for regression or compromises between bias and regularity in classification. For instance, in the case of classification tasks, as studied in this paper, Soft margins SVM deal with non-separable problem thanks to slack variables that are parametrized by a slack trade-off (usually noted C, it is the regularization parameter). Within the usual formulation of the Soft margins SVM, this trade-off takes its value between 0 (random) and infinity (hard-margins). The nu-SVM technique reformulates the SVM problem so that C is replaced by nu parameters taking values in [0,1]. This normalized parameter has a more intuitive meaning: it represents the minimal proportion 25
  27. 27. of points in the solution and the maximal proportion of misclassified points. However, having the whole regularization path is not enough. Indeed, the end user still needs to retrieve from it the best values for the regularization parameters. Instead of selecting these values by k-fold cross-validation or leave-one-out, or other approximations, we propose to include the leave-one-out estimator inside the regularization path in order to have an idea of the generalization error at each step. We explain why it is less expansive than selecting the best parameter a posteriori and give a method to stop learning before attaining the end of the path to save useless efforts. Contrarily to what is usually done for regularization path, our method does not start with all points as support vectors. Doing so we avoid the computation of the whole Gram matrix at the first step. Then, since the proposed method stops on the path, this extreme non-sparse solution is never attained and thus the whole Gram matrix never required. One of the main advantages of this is that it is possible to use this setting for large databases. The Influence of Ranker Quality on Rank Aggregation Algorithms Brandeis Marshall, Rensselaer Polytechnic Institute The rank aggregation problem has been studied extensively in recent years with a focus on how to combine several different rankers to obtain a consensus aggregate ranker. We study the rank aggregation problem from a different perspective: how the individual input rankers impact the performance of the aggregate ranker. We develop a general statistical framework based on a model of how the individual rankers depend on the ground truth ranker. Within this framework, one can study the performance of different aggregation methods. The individual rankers, which are the inputs to the rank aggregation algorithm, are statistical perturbations of the ground truth ranker. With rigorous experimental evaluation, we study how noise level and the misinformation of the rankers affect the performance of the aggregate ranker. We introduce and study a novel Kendall-tau rank aggregator and a simple aggregator called PrOpt, which we compare to some other well known rank aggregation algorithms such as average, median and Markov chain aggregators. Our results show that the relative performance of aggregators varies considerably depending on how the input rankers relate to the ground truth. Joint work with Sibel Adali and Malik Magdon-Ismail. Learning for Route Planning under Uncertainty Evdokia Nikolova, Massachusetts Institute of Technology We present new complexity results and efficient algorithms for optimal route planning in the presence of uncertainty. We employ a decision theoretic framework for defining the optimal route: for a given source S and destination T in the graph, we seek an ST-path of lowest expected cost where the edge travel times are random variables and the cost is a nonlinear function of total travel time. Although this is a natural model for route planning on real-world road networks, results are sparse due to the analytic difficulty of finding closed form expressions for the expected cost, as well as the computational/combinatorial difficulty of efficiently finding an optimal path, which minimizes the expected cost. We identify a family of appropriate cost models and travel time distributions that are closed under convolution and physically valid. We obtain hardness results for routing problems with a given start time and cost functions with a global minimum, in a variety of deterministic and stochastic settings. In general the global cost is not separable into edge costs, precluding classic shortest-path approaches. However, using partial minimization 26
  28. 28. techniques, we exhibit an efficient solution via dynamic programming with low polynomial complexity. We then consider an important special case of the problem, in which the goal is to maximize the probability that the path length does not exceed a given threshold value (deadline). We give a surprising exact nθ log n algorithm for the case of normally distributed edge lengths, which is based on quasi-convex maximization. We then prove average and smoothed polynomial bounds for this algorithm, which also translate to average and smoothed bounds for the parametric shortest path problem, and extend to a more general non-convex optimization setting. We also consider a number other edge length distributions, giving a range of exact and approximation schemes. Our offline algorithms can be adapted to give online learning algorithms via the Kalai- Vempala approach of converting an offline to an efficient online optimization solution. Joint work with Matthew Brand, David Karger, Jonathan Kelner and Michael Mitzenmacher. A Neurocomputational Model of Impaired Imitation Biljana Petreska, Ecole Polytechnique Federale de Lausanne This abstract addresses the question of human imitation through convergent evidence from neuroscience, using tools from machine learning. In particular, we consider a deficit in imitation of meaningless gestures (i.e., hand postures relative to the head) following callosal brain lesion (i.e., disconnected hemispheres). We base our work on the rational that looking at how imitation in apraxic patients is impaired can unveil its underlying neural principles. We ground the functional architecture and information flow of our model in brain imaging studies. Finally findings from monkey brain neurophysiological studies drive the choice of implementation of our processing modules. Our neurocomputational model of visuo-motor imitation is based on selforganizing maps receiving sensory input (i.e., visual, tactile or proprioceptive) with associated activities [1]. We train the connections between the maps with anti-hebbian learning to account for the transformations required to translate the observation of the visual stimulus to imitate to the corresponding tactile and proprioceptive information that will guide the imitative gesture. Patterns of impairment of the model, realized by adding uncertainty in the transfer of information between the networks, reproduce the deficits found in a clinical examination of visuo-motor imitation of meaningless gestures [2]. The model makes hypotheses on the type of representation used and the neural mechanisms underlying human visuo-motor imitation. The model also helps to gain more understanding in the occurrence and nature of imitation errors in patients with brain lesions. [1] B. Petreska, and A.G. Billard. A Neurocomputational Model of an Imitation Deficit following Brain Lesion. In Proceedings of 16th International Conference on Artificial Neural Networks (ICANN 2006), Athens (Greece). To appear. [2] G. Goldenberg, K. Laimgruber, and J. Hermsdörfer. Imitation of gestures by disconnected hemispheres. Neuropsychologia, 39:1432–1443, 2001. Joint work with A. G. Billard. 27
  29. 29. Bayesian Estimation for Autonomous Object Manipulation Based on Tactile Sensors Anya Petrovskaya, Stanford University We consider the problem of autonomously estimating position and orientation of an object from tactile data. When initial uncertainty is high, estimation of all six parameters precisely is computationally expensive. We propose an efficient Bayesian approach that is able to estimate all six parameters in both unimodal and multimodal scenarios. The approach is termed Scaling Series sampling as it estimates the solution region by samples. It performs the search using a series of successive refinements, gradually scaling the precision from low to high. Our approach can be applied to a wide range of manipulation tasks. We demonstrate its portability on two applications: (1) manipulating a box and (2) grasping a door handle. Joint work with Oussama Khatib, Sebastian Thrun, Andrew Y Ng. . Therapist Robot Behavior Adaptation for Post-stroke Rehabilitation Therapy Adriana Tapus, University of Southern California Research into Human-Robot Interaction (HRI) for socially assistive applications is in its infancy. Socially assistive robotics, which focuses on the social interaction, rather than the physical interaction between the robot and the human user, has the potential to enhance the quality of life for large populations of users. Post-stroke rehabilitation is one of the largest potential application domains, since stroke is a dominant cause of severe disability in the growing ageing population. In the US alone, over 750,000 people suffer a new stroke each year, with the majority sustaining some permanent loss of movement [Institute06]. This loss of function, termed "learned disuse", can improve with rehabilitation therapy during the critical post-stroke period. One of the most important elements of any rehabilitation program is carefully directed, well-focused and repetitive practice of exercises, which can be passive and active. Our work focuses on hands-off therapist robots that assist, encourage, and socially interact with patients during their active exercises. Our previous research demonstrated, through real world experiments with stroke patients [Tapus06b, Eriksson05, Gockley06], that the physical embodiment (including shared physical context and physical movement of the robot), the encouragements, and the monitoring play key roles in patient compliance with rehabilitation exercises. In the current work we investigate the role of the robot’s personality in the hands-off therapy process. We focus on the relationship between the level of extroversion/introversion (as defined in Eysenck Model of personality [Eysenck91]) of the robot and the user, addressing the following research questions: 1. How should we model the behavior and encouragement of the therapist robot as a function of the personality of the user and the number of exercises performed? 2. Is there a relationship between the extroversion-introversion personality spectrum based on the Eysenck model and the challenge based vs. nurturing style of patient encouragement? To date, little research into human-robot personality matching has been performed. Some of our recent results showed the preference for personality matching between users and socially assistive robots [Tapus06a]. Our therapist robot behavior adaptation system monitors the number of exercises/minute performed by the human/patient, indicating the level of engagement and/or fatigue, and changes the robot’s behavior in order to maximize this level. The socially assistive therapist robot (see Figure 1) is equipped with a basis set of behaviors that will explicitly express its desires and intentions in a physical and verbal way that is observable to the user/patient. These behaviors involve the control of 28
  30. 30. physical distance, gestural expression, and verbal expression (tone and content). The number of exercises/minute is therefore used as a reward that maximizes the response of the system. Hands-off robot post-stroke rehabilitation therapy holds great promise of improving patient compliance in the recovery program. Our work aims toward developing and testing a model of compatibility between human and robot personality in the assistive context, based on the PEN theory of personality and toward building a customized therapy protocol. Examining and answering these issues will begin to address the role of assistive robot personality in enhancing patient compliance. [Eriksson05] Eriksson, J., Matarić, M., J., and Winstein, C. "Hands-off assistive robotics for post-stroke arm rehabilitation", In Proceedings of the International Conference on Rehabilitation Robotics (ICORR-05), Chicago, Illinois, June 2005. [Eysenck91] Eysenck, H., J. "Dimensions of personality: 16, 5 or 3? Criteria for a taxonomic paradigm", In Personality and individual differences, vol. 12, pp.773-790, 1991. [Gockley06] Gockley, R., and Matarić, M., J. "Encouraging Physical Therapy Compliance with a Hands-Off Mobile Robot", In Proceedings of the First International Conference on Human Robot Interaction (HRI-06), Salt Lake City, Utah, March 2006. [Institute06] "Post-Stroke Rehabilitation Fact Sheet", National Institute of neurological disorders and stroke, January, 2006. [Tapus06a] Tapus, A. and Matarić, M., J. (2006) "User Personality Matching with Hands-Off Robot for Post-Stroke Rehabilitation Therapy", In Proceedings of the 10th International Symposium on Experimental Robotics (ISER), Rio de Janeiro, Brazil, July 2006. [Tapus06b] Tapus, A. and Matarić, M., J. (2006) "Towards Socially Assistive Robotics", International Journal of the Robotics Society of Japan (JRSJ), 24(5), pp. 576- 578, July, 2006. Joint work with Maja J. Matarić. Learning How To Teach Cynthia Taylor, University of California, San Diego The goal of the RUBI project is to develop a social robot (RUBI) that can interact with children and teach them in an autonomous manner. As part of the project we are currently focusing on the problem of teaching 18-24 month old children skills targeted by the California Department of Education as appropriate for this age group. In particular we are focusing on teaching the children to identify objects, shapes and colors. We have seven RFID-tagged stuffed toys, in the shapes of common objects like a slice of watermelon or a waffle. RUBI says the name of the object and shows a picture of it on her touch screen, and the children hand her a toy, which she identifies as correct or incorrect. She keeps track of the right and wrong answers for each toy. RUBI has a touch screen on her stomach she can use to play short videos and play games with the children. By recording when the children touch her stomach, the screen also provides important information about whether or not the children are engaged. She has two Apple iSight cameras for eyes, and runs machine learning software that lets her detect both faces and smiles. The smile detection lets her gage people’s moods during social interaction, and respond accordingly. She has an RFID reader in her right hand, letting her 29
  31. 31. identify RFID tagged toys. The machine learning aspect of this problem is how to use the information from her perceptual primitives so as to teach the materials in an effective manner. After each question/answer, RUBI has to decide whether to continue playing her current learning game or switch to another activity, and what question to ask next if she continues playing the game. She also has to decide what to do in situations where she asks a question and does not get an answer for a long period of time. Unlike many standard AI problems like chess, RUBI works in continuous time, with no discrete turns. We are approaching the problem from the point of view of control theory. Exact solutions to the optimal teach problem exist for some simple models of learning, such as the Atkinson and Bower learning model. We are planning to find approximate solutions to this control problem using Reinforcement Learning Methods. We will complement formal and computational analysis with ethnographic study of how teachers do teach the children on the same task. Our focus will be on understanding both timing and what sources of information they use to adapt their teaching strategies. Joint work with Paul Ruvolo, Ian Fasel, Javier R. Movellan. Strategies for improving face recognition in video using machine learning methods Deborah Thomas, University of Notre Dame Surveillance cameras are a common feature in many stores and public places. There are many applications for face recognition from video streams in the area of law enforcement. However, while face recognition from high quality still images has been very successful, face recognition from video is a relatively new area and there is huge room for improvement. Furthermore, when using video as our data, we can exploit the fact that there are multiple frames to choose from to improve recognition performance. So, instead of representing subjects using a single high quality image, they can be represented using a set of frames chosen from the frames in a video clip. However, we want to select as many distinct frames for an individual as possible. This allows for the diversity in the training space, thereby improving the generalization capacity of the learned face recognition classifier. In this work, we consider two different approaches. The commonality between the two approaches is Principal Component Analysis. Given the high dimensionality of the data, PCA is often warranted to not only reduce the dimensions but also construct mode independent dimensions. In our first approach, we use a nearest neighbor algorithm with Mahalanobis Cosine (MahCosine) distance measure. A pair of images in which the faces differ from each other in pose and expression will have a bigger MahCosine distance between them. So we can use this as a measure of difference between frames. In the second approach, we project the images into PCA space and then use K-means clustering to group all the frames from one subject and pick one image per cluster to make up the representation set. Here again, images, which are similar to each other, will be in the same cluster, while more different images will be in different clusters. In addition to difference between frames, we also incorporate a quality metric of the face in picking the frames in addition to using PCA and this yields a higher recognition rate. We demonstrate our approach using two different datasets. First, we compare our approach to the approach used by Lee et. al in 2003 (Video-based Face Recognition Using Appearance Manifolds) and 2005 (Visual Tracking and Recognition using Probabilistic Appearance Manifolds). They use appearance manifolds to represent their subjects and use planes in PCA space for the different poses. We show that our approach performs 30