Introduction to Machine Learning

Introduction to
Machine Learning
Sebastian E. Kwiatkowski
(sebastian@aisummary.com)
__

Artificial Intelligence
Machine Learning (ML) Everything else
HIGH-LEVEL OVERVIEW
Rule/expert-
based
Biology-inspired
Other ML models:
• Nearest
Neighbors
• Trees, forests
• Naive Bayes
• Support vector
machines
Neural networks:
• Feed-forward
• Convolutional
• Residual
• Recurrent
• Autoencoders
• Memory
Basic models:
• Linear regression
• Logistic regression

TO LIVE IS TO PREDICT
Biology
- Food: edible or
poisonous?
- Fight/flight/freeze
- Position within the
hierarchy
- Mating choice
- Financial markets
- Betting markets
- Economic forecasts
- Business plans
- Election results
- Sports betting
Forecasting
Example 3- Career/job choice
- Moving
- Medical
interventions
- How to compete
with machines?
Modern life
• Effective decision-making requires accurate predictions.
• Humans and other species have evolved adaptations to cope with uncertainty:
• Memory: data storage linked to the ability to generate predictions
• Mental time traveling: ability to project oneself into the future
• Automated processes associated with certain emotions

16,000 words
spoken per person
per day
100 trillion words
words spoken by
humanity per day
28 million papers
(1980-2012)
130 million books
indexed by Google
1 billion websites
on the World Wide Web
500 million videos
hosted on YouTube
BIG TEXT DATA

COMPETITION AS A DISCOVERY PROCESS
• Machine learning is organized around competitions.
• A dataset is split up into two parts:
• a training set and a test set
• Competitors submit models trained on the first set.
• Open source ensures perfect replicability.
• Models are then evaluated based on the second set.
• Competition winners tend to dominate the discourse …
• … until the next improvement is published.
• “State of the art”: best competition performances

DOMAIN-GENERAL LEARNING: COMPUTER VISION 1/2
• Domain-general learning strategies
• Example: convolutional neural networks
Rapid progress in image classification
• LeCun et al. (1998): a CNN trained on the MNIST dataset (60,000 small images of digits, 9 classes) achieves an error
rate of 1-2% when tested on 10,000 images
• Krizhevsky et al. (2012): a deep CNN trained on 1.2 million high-resolution images and 1,000 classes achieves a 17%
top-5 error rate when tested on 150,000 images
Skin cancer classification
• Esteva et al. (2016): Deep CNN, trained on 130,000 clinical images
• Human-level performance when tested against 21 dermatologists on two binary classification tasks
• karitinocytes carcinomas: most common skin cancer
• melanomas: deadliest skin cancer

DOMAIN GENERALITY: COMPUTER VISION 2/2
• Is there something special about skin cancer classification?
• Litjens et al. (2017) summarizes the use of deep learning for medical image analysis:
• at least 300 papers, most published in 2016
• CNNs have already been applied in the 90s:
• Lo et al. (1995): CNN trained to recognize lung nodules in x-rays
• In most cases, the only input to the learning algorithms is a set of pairs.
• Each pair consists of an image and a label:
• Malignant vs. benign
• Stage 1/2/3/4

DOMAIN GENERALITY: NATURAL LANGUAGE PROCESSING
Text classification
• Kim (2014): a shallow CNN outperforms state-of-the-art (SOTA) results in sentence classification tasks
• Conneau et al. (2017): very deep CNNs improve upon SOTA results in short text classification tasks
Sequence labeling
• Map a sequence of words to a sequence of tags:
• Strubell et al. (2017): A new CNN variant achieves almost SOTA results, but 10-20X faster
Machine translation
• Kalchbrenner et al. (2017): SOTA performance on an English-German translation benchmark
Tim Cook is Chief Executive Officer of Apple .
B-Name I-Name O B-Title I-Title I-Title O B-Org O

Elements of a machine
learning system
__

The basic workflow in a machine learning project:
WORKFLOW
IMPROVE
(when needed)
Error analysis,
more data,
“better” models
PROBLEM
FORMULATION
Can you describe the
problem in terms of
existing solutions?
DATA COLLECTION
How can you obtain
a large high-quality
dataset?
MODEL TRAINING
& SELECTION
What is a good
model? How do
you measure
success?

PROBLEM FORMULATION
TYPE OUTPUT COMMENT APPLICATIONS
Multi-class
classification
can be thought of as a special case of
sequence prediction
probability
distribution
topic classification, very good to very
bad, object classification, staging
Binary classification probability a special case of multi-class classification
yes/no, positive/negative,
present/absent, similar/dissimilar
Sequence prediction
sequence of
probability
distributions
sequence labeling, machine translation,
speech synthesis, image segmentation
at the core of intelligence
(artificial and biological)
Clustering
segmentation: customers, images
detection: communities, anomalies
cluster
membership
another special case:
number of clusters is a hyperparameter
robotics, driverless cars, conversational
agents, game playing
sequence predictions in an active
environment: actions effect observations
sequence of
actions
Reinforcement
learning

DATA COLLECTION
• A data set D is a collection of n data points di.
• Each data point di = (xi, ti) in D is a pair consisting of features x and a target t.
• Easy problems require, at least, hundreds or thousands of data points.
• Harder problems require millions of data points.
• Occasionally, data will be provided or is available in existing databases.
• Usually, a data collection strategy has to be devised and implemented.
Weakly supervised Example 3
• Humans manually label
the inputs with
appropriate targets.
• “This is a cat. That’s a
dog. This is another cat.”
• Minimalhumanintervention
• Downloadallimagesw/
thehashtags#cat,#dog
• Syntheticdata
• Nohumansupervision
• Learnpotentially
relevantpatternsfrom
giganticdatasets
Supervision Unsupervised

WHAT IS A MODEL, ANYWAY?
• Using parameters θ, a model f generates a prediction y from an input x:
• f(x, θ) = y
• Parameters allow the model to “weight the evidence”.
• Example: a simple binary classification problem
• Does a given article from a news archive focus on politics? Yes or no?
• Consider the parameters (weights) for the following words:
• Which of these parameters will be positive and negative?
election soccerthe

Loss function /
logistic loss
__

Logistic loss
LOSS FUNCTION: LOGISTIC LOSS
The target can be either 1 (“did occur”) or 0 (“did not occur”).
If the target equals 1: -log(prediction)
If the target equals 0: -log(1-prediction)
loss (prediction, target) = target  log (prediction)-
[
(1 - target)  log (1 - prediction)+ ]
Loss
- Some predictions are better than others.
- The deviation of the prediction p from the target t is referred to as loss.
- Synonyms: cost, error, empirical risk
- The loss is calculated through a loss function.
- Logistic loss is one of the most important loss functions.

LOGISTIC LOSS: EXAMPLES
Good prediction Mediocre predictionBad predictionBad prediction
loss prediction, target = −[target  log(prediction) + 1 − target  log(1 − prediction)]
Loss: -log(0.9) ≈ 0.046
This is a good prediction.
Consequently, the loss is
small.
Target: 1
Prediction: 90%
Target: 1
Prediction: 10%
Loss: -log(0.1) ≈ 0.699
This prediction is
inaccurate and the loss,
therefore, is high.
Target: 0
Prediction: 40%
Loss: -log (1-0.4) ≈ 0.222
This loss is a function of
the counter-probability of
60%.

WHY LOGISTIC LOSS?
• The likelihood function returns the probability of the data for a given parameter.
• In practice, it is convenient to use the log likelihood:
log L parameters data = log
i=1
n
P data pointi parameter) =
i=1
n
logP(data pointi|parameter)
L parameters data = P data parameter =
i=1
n
P data pointi parameter
• Coin flip example: L(ph=0.5|HT) = P(HT|ph=0.5) = 0.25
Likelihood
Log likelihood
• Using the log likelihood helps prevent underflow problems.

MAXIMUM LIKELIHOOD PRINCIPLE
The maximum likelihood principle tells us to select the parameters θ∗ that maximize the probability of the data:
A maximization problem w.r.t. to f(x) is equivalent to a minimization problem w.r.t. to f(-x):
For a random variable with two outcomes, the logistic loss is the negative log likelihood.
Thus, minimizing the logistic loss is equivalent to the maximum likelihood approach.
θ∗= arg maxθ
i=1
n
log P(data pointi|θ)
θ∗= arg minθ[−
i=1
n
log P(data pointi|θ) ]

Brief digression:
Ockham’s Scotus’s Razor
__

NUMBER COMPLETION TASK
3, 9, 27, 81, ?
What is the next number in this sequence?
Simple solution
f(x)=3x
f(1) = 3, f(2) = 9, f(3)= 27, f(4) = 81
f(5) = 243
f(x)= -15 + 32x – 18x2 + 4x3
f(1) = 3, f(2) = 9, f(3) = 27, f(4) = 81
But: f(5) = 195
More complex solution

SCOTUS’S RAZOR
• Problem: There is an infinite number of solutions to any
sequence prediction problem.
• Most, if not all, machine learning problems are
sequence prediction problems.
• One solution: Ockham’s Razor
• Prefer the simplest theory consistent with the data
• First clear formulation by 13th century theologian
Duns Scotus
• Today: Don’t use a fancy machine learning model
when a simple model works just fine.

WHY OCKHAM’S RAZOR?
• It works.
• Successful applications in in ML, science, business, design and other fields
• Fast & cheap
• Simpler models tend to be faster models and consume fewer resources.
• The Schmidhuber/Hutter argument:
• The Great Programmer implements all possible universes with program lengths from 1 to N.
• Program B is a functional copy of program A if both lead to the same result but with different code.
• Simpler programs have more functional copies than longer programs.
• Simple program: print(“Hello world!”)
• Functional copy: const message = “Hello world!”; print(message)

• A neuron is the basic processing unit:
• Accepts input, processes input, sends
output
• Neurons are connected to other
neurons.
• Connections are weighted.
• A layer is a group of neurons.
• Every neural network has an input layer
and an output layer.
• Hidden layer:
• Any layer between input and output
• Shallow: ~ 1-5 hidden layers
• Deep: dozens or hundreds of layers
BUILDING BLOCKS: NEURONS, WEIGHTS AND LAYERS
Output:
f(w1x1+...+w3x3)
Input layer Output layer
Input 1: x1
Input 2: x2
...
Input n: xn
w1
w2
wn

A FEED-FORWARD NETWORK WITH TWO HIDDEN LAYERS

• Baseline model for binary
classification tasks:
• f(x) = s(w1x1+...+wnxn+b)
• Weigh the evidence
• Add a bias
• The sigmoid function s “squashes” the
input to a number b/w 0 and 1.
• Can be formulated as a neural net:
• The input x1, ..., xn corresponds to
neurons in the first layer.
• The bias corresponds to an additional
neuron with a connection weight of 1.
• The output neuron applies the sigmoid
function.
LOGISTIC REGRESSION
Input layer Output layer
Bias: b
Input 1: x1
Input 2: x2
Input n: xn
Output:
s(w1x1+...+wnxn+b)
...
w1
wn
w2
1

THE SIGMOID FUNCTION
The sigmoid function s(x) is one of the most
frequently used functions in machine learning:
𝑠 𝑥 =
1
1 + 𝑒−𝑥
Desirable properties:
• “squashes” any input into the range between 0 and 1
• The derivative is easy to compute:
𝑑𝑠
𝑑𝑥
= 𝑠 𝑥 (1 − 𝑠 𝑥 )

A GLIMPSE AT BACKPROPAGATION
• Model parameters are initialized randomly.
• The term “training” refers to the (iterative) optimization of parameters.
• Almost all neural nets are trained with the backpropagation algorithm.
Backpropagation algorithm (n repetitions)
Go through each instance:
1. Forward propagation: Compute the prediction and the loss.
2. Backward propagation: For each parameter, compute the derivative w.r.t. the loss.
• Positive derivative: small increase in parameter => increase in loss
• Negative derivative: small increase in parameter => decrease in loss
3. Update: Use the derivative to apply an update rule.
• Simple rule: old value = new value – learning rate  derivative

A SIMPLE EXAMPLE
Source: hackernoon.com

INCREASED LEARNING RATE
Source: hackernoon.com

A MORE REALISTIC EXAMPLE
Source: Analytics Vidhya

A small zoo of neural
networks
__

BIDIRECTIONAL RECURRENT NEURAL NETWORK

Convolutional
neural networks
__

CONVOLUTIONAL LAYER
• CNNs use a repeated sequence of layers:
• A convolutional layer, followed by a pooling layer
• A convolutional layer consists of filters:
• A window moves through the output of the
previous layer
• Similar to how we read: from left to right, and
then downwards
• The purpose of a filter is to detect the
presence of a particular feature:
• Basic geometric shapes
• Lines, circles, edges
• Characteristic colors
• Blue sky, green grass

96 low-level features learned by a convolution layer
Source: CS231n Convolutional Neural Networks for Visual Recognition

MAX-POOLING LAYER
• A max-pooling layer performs a
reduction operation:
• A window moves through the subregions of
the previous output.
• For each subregion, the maximum value is
extracted.
• A max-pooling layer with a stride of 2
reduces a 4x4 matrix to a 2x2 matrix.
• Intuition: It doesn’t really matter where
exactly a feature is located.
• Less entries => faster computation

DEEP CONVOLUTIONAL NEURAL NETWORKS
• Deep neural nets are characterized by repeated blocks of layers.
• Ex.: a series of convolution/max-pooling operations
• Some ML fields (though not all) are dominated by deep nets.
• Theory lags behind applications.
• Intuition: hierarchical models for hierarchical data
• Simple example: traffic sign recognitions
• Lines and circles form digits.
• Digits form numbers.
• A speed limit sign is composed of a red circle,
a white circle and a number.

SUMMARY
• The growth of machine learning is fueled by:
1. the importance of predictions
2. the low cost of data acquisition and processing
3. the domain generality of learning algorithms.
• The essential task in machine learning is to
1. formulate the problem
2. collect a large and relevant the dataset
3. train, test and improve appropriate models.
• Neural networks form a class of powerful models trained by backpropagation:
• Building blocks: neurons, connections, layer
• Convolutional neural networks:
• high predictive accuracy and computational efficiency

Introduction to Machine Learning

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Introduction to Machine Learning

Similar to Introduction to Machine Learning (20)

Recently uploaded

Recently uploaded (20)

Introduction to Machine Learning