Machine learning

What is it?
● Grew out of work in Artificial Intelligence (AI)
● 1959 Arthur Samuel – Machine Learning:
● „Field of study that gives computers the ability to
learn without being explicitly programmed.”
● 1998 Tom Mitchell – Well posed learning
problem:
● „A computer program is said to 'learn' from
experience 'E' with respect to some task 'T' and
some performance measure 'P', if its performance
on 'T', as measured by 'P', improves with
experience 'E'.”

What is it?
● Example:
● Email program
(experience)
– 'E' – watches you label emails as spam/not spam
(task)
– 'T' – classifies emails as spam/not spam
(performance)
– 'P' – fraction of emails correctly classified as spam/not
spam

What is it?
● Solves complicated, underspecified problems
● Some problems can't be solved directly by software
● Instead of writing a program for each problem:
● Collect samples of correct input->output
● Use algorithm to create a program to do the same
● Program handles new cases (other than those in
the training data), retrain if new data
● Massive amounts of data + computation is
cheaper than developing software
http://technocalifornia.blogspot.com/2012/07/more-data-or-better-models.html

Problems for Machine Learning
● Pattern recognition
● Objects in real scenes
● Computer vision – facial identities / expressions
● Speech recognition
– Sample sounds
– Partition phonemes
– Decoding – extract meaning, NLP
● Natural language

● Recognizing anomalies
● Unusual sequences
– Credit / phone fraud
– SPAM / HAM
● Sensor readings
– Power plant operation and health
– Detect when actions are required

● Prediction
● Stock price movements (time sequence)
● Currency exchange rates
● Risk analytics
● Sentiment analysis
● Click throughs (web traffic)
● Preferences
– Netflix, Amazon, Pandora, web ad targetting, etc.

● Information Retrieval (database mining)
● Genomics
● News/Twitter data feeds
● Archived data
● Web clicks
● Medical records
● Find similar, summarize groups of material

Learning - Supervised
● Predict output given the input, train using inputs
with known outputs
● Regression – target is a real number, goal is to
be 'close'
● Classification – target is a class label: binary
(yes/no) or multi-class (one of many)

Learning – Unsupervised
● Older texts explicitly exclude this from being
learning!
● Discover good internal representation of input
● Difficult to determine what the goal is
● Create a representation that can be used in
subsequent supervised learning?
● Dimensionality reduction (PCA) can be used for
compression or to simplify analysis
● Provide an economical high dimensional
representation (binary features, real features –
single largest parameter)

Learning – Reinforcement
● Select action to maximize payoff
● Maximize expected sum of future rewards
● Not every action results in a payoff
● Apply discounting to minimize effect of far future on
present decisions
● Difficult – payoffs are delayed, critical decision
points unknown, scalar payoff contains little
information

Learning – Reinforcement
● Planning
● Choice of actions by anticipating outcomes
● Actions and planning can be interleaved
(incomplete knowledge)
– Warehouse, dock management, Route
planning/replanning
● Multiple simultaneous agents planning
independently
– Emergency responders
– http://www.aiai.ed.ac.uk/project/i-globe/resources/2007-
03-06-Iglobe/2007-03-06-Iglobe-Demo.avi

Learning – Data
● Training data [ ~60% - 80% ]
● Inputs (with correct response for supervised)
● Validation data [ ~20% ]
● Converge by training on multiple sets of data,
improving each time
● Test data [ ~10% - 20% ]
● Not used until training and validation are complete –
measure performance with this data set

Learning – Data
● Partition randomly
● Time series data use random subsequences
● Training and test data should be from same
population
● If feature selection or model tuning required
(e.g. PCA parameter mapping) then the tuning
must be done for each training set

Learning – Training
● One iteration for each set of input data in the
training data set
● Start with random parameters
● Randomize input data during training
● Calculate model parameters for each input
● Use previous parameter values to calculate
next values using new training input

Learning – Bias and Variance
● Bias – algorithm errors
● High bias – underfit
● More training data does not help
● Variance – sensitivity to fluctuations in data
● High variance – overfit
● More training data likely to help
● Irreducible error - noise

Learning – Bias and Variance

Learning – (Cross) Validation
● Validation
● Holdout data for tuning model with new data
● Evaluate model using holdout as test set
● Cross validation
● generating models with different holdouts to avoid
overfitting
● n-fold - divide data into n chunks and train n times,
treating a different chunk as the holdout each time
(leave-one-out – same with chunk size of 1)
● Random subsampling – approaches leave-p-out

Learning - Improvements
● Things to do when the error is to high
● Get more training data (high variance)
● Try smaller sets of features (high variance)
● Try getting additional features (high bias)
● Add polynomial features (high bias)
● Decrease smoothing parameter λ (high bias)
● Increase smoothing parameter λ (high variance)

Learning – Testing
● Reserve set of data [~10% - 20% ]
● Evaluate model performance with the test set
● Make no further model changes
● Performance evaluation
● Supervised learning – compare predictions with
known results
● Predictions of unsupervised model when results
can be known – even if not used in training

Training - Gradient Descent
● Find minimum of a cost / performance metric

Training – Gradient Descent
● Linear cost function
● Well behaved
● Single global minimum, easily reached

● Complex cost functions
● Not well behaved
● Global minimum, many local minima

● Convergence speed and stability controlled by
slope parameter α
● Low α ● High α

Training – k-means
● Classify data into k different groups
● Start with k random points
● Group data with the closest point
● Move the points to the centroid of the data for that
point
● Terminate when the points no longer move (or
move only a small amount)

Training – k-nn
● k nearest neighbors determine classification of
each element in data
● Skewed data can result in homogenous result
● Use weighting to avoid this
● Training – store the training data
● For each data point to be predicted
● Locate the nearest k other points
– Use any consistent distance metric – l-p norms (euclidan,
manhattan distances, maximum single direction)
● Assign the majority class of those nearest points

Types of Machine Learning
● Regressions
● Neural Networks
● Dimensionality reduction
● Support Vector Machines (SVM)
● Principle Component Analysis (PCA)
● Clustering
● Classification
● Probabilistic – Bayes, Markov
● ...others...

Regression
● Single / Multiple variable
● Linear / Logistic
● Regularization (smoothing) – helps to avoid
overfitting

Regression – Equations
● Linear regression
hypothesis function
● Logistic regression
hypothesis function
● Regularized linear
regression cost
function
● Regularized logistic
regression cost
function

Neural Networks - Representation
● Nodes – compared to neurons, many inputs,
one output
● Transfer characteristic – logistic function
● Input from left, output to right
● Layers
– Input layer, driven by numeric input values
– Output layer, provides numeric output values (or
thresholded for classification output)
– Hidden layers between input and output – no discernable
meaning for their values

Neural Networks - Representation

Neural Networks – Learning
● Learns using gradient descent
● Forward propagation – start at inputs, derive
parameters of next stage
● Backward propagation – start at outputs, adjust
parameters to produce desired output

Neural Networks - Learning
● OCR training set
● what does the number '2' look like when
handwritten?

Neural Networks - Learning
● Neural Network parameters are not simply
interpretable

Support Vector Machines
● Supervised learning classification and
regression algorithm
● Cocktail Party Problem
● Many speakers, many sensors (microphones)
● Classify source from the inputs
[W,s,v] = svd((repmat(sum(x.*x,1),size(x,1),1).*x)*x');

Principle Component Analysis
● Unsupervised learning
● Finds basis vectors for data
● Largest is the 'principle' component
● Center each attribute on mean for visualization,
not for prediction models
● Normalized to same range to provide
comparable contributions from each factor

Classification
● Logistic partitioning - data

Classification
● Logistic partitioning – classification boundary

Classification
● Logistic partitioning – overfit boundary

Classification
● Logistic partitioning – underfit boundary

Classification - Performance
● Receiver Operating Characteristic (ROC)
● Location of classification performance
● Perfect predictions indicated in upper left corner
● Up and to the left means better
● Diagonal from lower left to upper right indicates
performance equivalent to random guessing

● Receiver Operating Characteristic (ROC)

● Area Under the Curve (AUC)
● ROC chart with curves applied
● Classifications based on thresholds for continuous
random variables
● Curve is parametric plot with the threshold as the
varying parameter
● AUC is a scalar summary of predictive value

● Area Under the Curve (AUC)

Natural Language Processing
● Text processing
● Modeling
● Generative models – generate observed data from
hidden parameters
– N-gram, Naive Bayes, HSMM, CFG
● Discriminative models – estimate probability of
hidden parameters from observed data
– Regressions, maximum entropy, conditional random
fields, support vector machines, neural networks

NLP - Language Modeling
● Probability of sequences of words (fragments,
sentences)
● Markov assumption
● Product of each element probability conditional on
small preceding sequence
– N-grams: bigrams: single preceding word, trigrams: two
preceeding words

NLP - Information Extraction
● Find and understand relevant parts of texts
● Gather information from many sources
● Produce structured representation
● Relations, knowledge base
● Resource Description Framework (RDF)
● Retrieval
● Finding unstructured material in a large collection
● Web/email search, knowledge bases, legal data,
health data, etc.

Machine learning

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Machine learning

Similar to Machine learning (20)

Recently uploaded

Recently uploaded (20)

Machine learning