3. Main areas in Machine Learning #1 Supervised learning assumes a teacher exists to label/annotate data #2 Unsupervised learning no need for a teacher , try to learn relationships automatically #3 Reinforcement learning biologically plausible , try to learn from reward/punishment stimuli/feedback
16. Training a neural network Sigmoid function Learnt hypothesis is represented by the weights that interconnect each neuron The aim in training the neural network is find the weight vector w that minimises the error E( w ) on the training set Gradient descent problem Hidden Layer Input layer Output layer Menopausal status Ultrasound score CA125 1 0 E( w ) w 1 w 2
Hello everybody, today I’m going to briefly cover some of the topics that I was taught at the Pattern recognition summer school earlier this summer. The talk is at a very high level – so if something needs more detail then please stop me and I will try to explain it or refer to the handouts I got from the summer school. My aim is that this talk is very informal – so stop me at any time. I have tried to give an overview of the field to highlight how all the different areas of machine learning fit together.
You’ve probably heard it all before – it becomes a bit of a cliché but it is a multidisciplinary field which is why I like it. The computing age has caused an exponential explosion of data from all sources – medicine, finance, industry – there became a strong need for algorithms to interpret this data. Machine learning is the ideal solution. Minsky and paperts book on the limitations of the Perceptron really dented the interest in the field of machine learning. But the resurgence was spurred by the development of the Neural Network back propagation algorithm which got around the limitations of the perceptron.
There are three main areas in machine learning: Supervised learning – where we assume a teacher exists to label our data. This is by far the most well studied area of machine learning, as lots of problems/data can be analysed using this framework. Unsupervised learning – this where we have no labels provided by a teacher – we try to cluster the data into “natural” relationships. Reinforcement learning – this is the most biologically plausible area of machine learning, it is similar to supervised learning except instead of telling the learner the correct answer/label we just punish or reward it accordingly.
For this first section I will discuss some algorithms and problems in supervised learning.
As I said earlier this is a very well studied area of machine learning – and borrows lots of its techniques from statistics and mathematics. The main concept in supervised learning is the idea of a training and test set. We learn from a training set, and validate our learning process by checking against a test set. The main sub-areas of research are pattern recognition and regression where the labels are discrete and continuous accordingly. Most studies analyse these problems under the assumption that the data is i.i.d. Another interesting sub-area is time series analysis – this is very popular in finance and signal processing – I only have a crude knowledge of this area. But the main difference is we break free from the i.i.d assumption and recognise some temporal dependence in the data – eg using Markov or stationarity assumptions.
Going into more detail about how we formalise the data – we commonly break it into object and label pairs. The label is the object that we want to predict in our learning task. And we will use our training data to learn the relationship between object and labels. For example in a pattern recognition cancer screening problem we could use labels normal, benign and malignant.
Here I have given an example of some training and test data. Here is a training set of images taken from the USPS (US postal service) we have handwritten digit scanned images – then we have the respective label telling which digit it represents. We have new test images where the label is either witheld or not known. We hope that our learning algorithm will learn from the training data the relationship from the training data and predict new unseen data. We consider n of these training and test examples drawn from an unknown joint distribution.
The are lots of algorithms/solutions out there: Support Vector Machines (SVM) Nearest Neighbours Decision Trees Neural Networks Multivariate Statistics Bayesian algorithms Logic programming
This has a huge following of devout desciples. The underlying technique is very simple – related to the perceptron linear classifier that separates data of two classes into halfspaces. Vapnik has very detailed theoretical justification (PAC theory, empirical risk minimisation) of the technique. (often the case in real life data the data is not separable!) Great practical applications – bioinformatics, financial analysis, text document analysis. The main concept in SVM is to keep the decision rule simple so as not to overfit the data. If the data is not linearly separable we use a kernel to map into another feature space that is separable. This is where we can plug in our domain knowledge into the SVM – which is why it is so popular the focus is kernel design.
The hot topics in SVM are: Kernel design – this is vital, there is a lot of theory which formalises the properties a Kernel must have but in practice this is crucial and requires a lot of thought. Applying the kernel technique to other learning algorithms.
Born in the 60’s – probably the most simple of all algorithms to understand. Decision rule = classify new test examples by finding the closest neighbouring example in the training set and predict the same label as the closest. Lots of theory justifying its convergence properties. Very lazy technique, not very fast – has to search for each test example.
View examples in Euclidean space , can be very sensitive to feature scaling . Finding computationally efficient ways to search for the Nearest Neighbour example.
Many different varieties C4.5, CART, ID3… Algorithms build classification rules using a tree of if-then statements . Constructs tree using Minimum Description Length (MDL) principles (tries to make the tree as simple as possible )
Instability – minor changes to training data makes huge changes to decision tree User can visualise/interpret the hypothesis directly, can find interesting classification rules Problems with continuous real attributes, must be discretalised . Large AI following, and widely used in industry
This can be considered as a fine art – in practice it can be a bit ad hoc. Very flexible, learning is a gradient descent process (back propagation) Training neural networks involves a lot of design choices : what network structure , how many hidden layers … how to encode the data (must be values [0,1]) use momentum to speed up convergence Use weight decay to keep simple
The aim in training the neural network is find the weight vector w that minimises the error E(w) on the training set Learnt hypothesis is represented by the weights that interconnect each neuron
Try to model interrelationships between variables probabilistically . Can model expert/domain knowledge directly into the classifier as prior belief in certain events. Use basic axioms of probability theory to extract probabilistic estimates
Lots of different algorithms – Relevance Vector Machine (RVM), Naïve Bayes, Simple Bayes, Bayesian Belief Networks (BBN)… Has a large following – especially Microsoft Research
Tractability – to find solutions need numerical approximations or take computational shortcuts Can model causal relationships between variables Need lots of data to estimate probabilties using obsevered training data frequencies
Feature Selection/Extraction – Using Principle Component Analysis, Wavelets, Cananonical Correlation, Factor Analysis, Independent Component Analysis Imputation – what to do with missing features? Visualisation – make the hypothesis human readable/interpretable Meta learning – how to add functionality to existing algorithms, or combine the prediction of many classifiers (Boosting, Bagging, Confidence and Probability Machines)
SVM it is kernel design, Nearest Neighbour it is the distance metric Under many different guises in each learning algorithm, SVM it is slack variables, Neural Networks weight decay, or network structure etc, Nearest Neighbours its number of neighbours analysed etc…. How to incorporate domain knowledge into a learner Trade off between complexity (accuracy on training) vs. generalisation (accuracy on test) Pre-processing of data , normalising, standardising, discretalising. How to test – leave one out, cross validation, stratify, online, offline
No need for a teacher/supervisor Mainly clustering – trying to group objects into sensible clusters Novelty detection – finding strange examples in data Give story about the insurance subgrouping invested lots of money to identify safe subgroups
For clustering : EM algorithm, K-Means, Self Organising Maps (SOM) For novelty detection : 1-Class SVM, support vector regression, Neural Networks
Very useful for extracting information from data. Used in medicine to identify disease sub types . Used to cluster web documents automatically Used to identify customer target groups in buisness Not much publicly available data to test algorithms with
Most biologically plausible – feedback given through stimuli reward/punishment A field with a lot of theory needing for real life applications (other than playing BackGammon) But also encompasses the large field of Evolutionary Computing Applications are more open ended Getting closer to what public consider AI .
Techniques use dynamic programming to search for optimal strategy Algorithms search to maximise their reward . Q – Learning (Chris Watkins next door) is most well known technique. Only successful applications are to games and toy problems . A lack of real life applications . Very few researchers in this field.
Inspired by the process of biological evolution . Essentially an optimisation technique – the problem is encoded as a chromosome. We find new/better solutions to problem by sexual reproduction and mutation . This will encourage mutation
How to encode the problem is very important Setting mutation/crossover rates is very adhoc Very computationally/memory intensive Not much theory can be developed – frowned upon by machine learning theorists