- A high-level overview of artificial intelligence
- The importance of predictions across different domains of life
- Big (text) data
- Competition as a discovery process
- Domain-general learning
- Computer vision and natural language processing
- Elements of a machine learning system
- A hierarchy of problem classes
- Data collection
- The purpose of a model
- Logistic loss function
- Likelihood, log likelihood and maximum likelihood
- Ockham's Razor
- Intelligence as sequence prediction
- Building blocks of neural networks: neurons, weights and layers
- Logistic regression as a neural network
- Sigmoid function
- A look at backpropagation
- Gradient descent
- Convolutional neural networks
- Max-pooling
- Deep neural networks
3. TO LIVE IS TO PREDICT
Biology
- Food: edible or
poisonous?
- Fight/flight/freeze
- Position within the
hierarchy
- Mating choice
- Financial markets
- Betting markets
- Economic forecasts
- Business plans
- Election results
- Sports betting
Forecasting
Example 3- Career/job choice
- Moving
- Medical
interventions
- How to compete
with machines?
Modern life
• Effective decision-making requires accurate predictions.
• Humans and other species have evolved adaptations to cope with uncertainty:
• Memory: data storage linked to the ability to generate predictions
• Mental time traveling: ability to project oneself into the future
• Automated processes associated with certain emotions
4. 16,000 words
spoken per person
per day
100 trillion words
words spoken by
humanity per day
28 million papers
(1980-2012)
130 million books
indexed by Google
1 billion websites
on the World Wide Web
500 million videos
hosted on YouTube
BIG TEXT DATA
5. COMPETITION AS A DISCOVERY PROCESS
• Machine learning is organized around competitions.
• A dataset is split up into two parts:
• a training set and a test set
• Competitors submit models trained on the first set.
• Open source ensures perfect replicability.
• Models are then evaluated based on the second set.
• Competition winners tend to dominate the discourse …
• … until the next improvement is published.
• “State of the art”: best competition performances
6. DOMAIN-GENERAL LEARNING: COMPUTER VISION 1/2
• Domain-general learning strategies
• Example: convolutional neural networks
Rapid progress in image classification
• LeCun et al. (1998): a CNN trained on the MNIST dataset (60,000 small images of digits, 9 classes) achieves an error
rate of 1-2% when tested on 10,000 images
• Krizhevsky et al. (2012): a deep CNN trained on 1.2 million high-resolution images and 1,000 classes achieves a 17%
top-5 error rate when tested on 150,000 images
Skin cancer classification
• Esteva et al. (2016): Deep CNN, trained on 130,000 clinical images
• Human-level performance when tested against 21 dermatologists on two binary classification tasks
• karitinocytes carcinomas: most common skin cancer
• melanomas: deadliest skin cancer
7. DOMAIN GENERALITY: COMPUTER VISION 2/2
• Is there something special about skin cancer classification?
• Litjens et al. (2017) summarizes the use of deep learning for medical image analysis:
• at least 300 papers, most published in 2016
• CNNs have already been applied in the 90s:
• Lo et al. (1995): CNN trained to recognize lung nodules in x-rays
• In most cases, the only input to the learning algorithms is a set of pairs.
• Each pair consists of an image and a label:
• Malignant vs. benign
• Stage 1/2/3/4
8. DOMAIN GENERALITY: NATURAL LANGUAGE PROCESSING
Text classification
• Kim (2014): a shallow CNN outperforms state-of-the-art (SOTA) results in sentence classification tasks
• Conneau et al. (2017): very deep CNNs improve upon SOTA results in short text classification tasks
Sequence labeling
• Map a sequence of words to a sequence of tags:
• Strubell et al. (2017): A new CNN variant achieves almost SOTA results, but 10-20X faster
Machine translation
• Kalchbrenner et al. (2017): SOTA performance on an English-German translation benchmark
Tim Cook is Chief Executive Officer of Apple .
B-Name I-Name O B-Title I-Title I-Title O B-Org O
10. The basic workflow in a machine learning project:
WORKFLOW
IMPROVE
(when needed)
Error analysis,
more data,
“better” models
PROBLEM
FORMULATION
Can you describe the
problem in terms of
existing solutions?
DATA COLLECTION
How can you obtain
a large high-quality
dataset?
MODEL TRAINING
& SELECTION
What is a good
model? How do
you measure
success?
11. PROBLEM FORMULATION
TYPE OUTPUT COMMENT APPLICATIONS
Multi-class
classification
can be thought of as a special case of
sequence prediction
probability
distribution
topic classification, very good to very
bad, object classification, staging
Binary classification probability a special case of multi-class classification
yes/no, positive/negative,
present/absent, similar/dissimilar
Sequence prediction
sequence of
probability
distributions
sequence labeling, machine translation,
speech synthesis, image segmentation
at the core of intelligence
(artificial and biological)
Clustering
segmentation: customers, images
detection: communities, anomalies
cluster
membership
another special case:
number of clusters is a hyperparameter
robotics, driverless cars, conversational
agents, game playing
sequence predictions in an active
environment: actions effect observations
sequence of
actions
Reinforcement
learning
12. DATA COLLECTION
• A data set D is a collection of n data points di.
• Each data point di = (xi, ti) in D is a pair consisting of features x and a target t.
• Easy problems require, at least, hundreds or thousands of data points.
• Harder problems require millions of data points.
• Occasionally, data will be provided or is available in existing databases.
• Usually, a data collection strategy has to be devised and implemented.
Weakly supervised Example 3
• Humans manually label
the inputs with
appropriate targets.
• “This is a cat. That’s a
dog. This is another cat.”
• Minimalhumanintervention
• Downloadallimagesw/
thehashtags#cat,#dog
• Syntheticdata
• Nohumansupervision
• Learnpotentially
relevantpatternsfrom
giganticdatasets
Supervision Unsupervised
13. WHAT IS A MODEL, ANYWAY?
• Using parameters θ, a model f generates a prediction y from an input x:
• f(x, θ) = y
• Parameters allow the model to “weight the evidence”.
• Example: a simple binary classification problem
• Does a given article from a news archive focus on politics? Yes or no?
• Consider the parameters (weights) for the following words:
• Which of these parameters will be positive and negative?
election soccerthe
15. Logistic loss
LOSS FUNCTION: LOGISTIC LOSS
The target can be either 1 (“did occur”) or 0 (“did not occur”).
If the target equals 1: -log(prediction)
If the target equals 0: -log(1-prediction)
loss (prediction, target) = target log (prediction)-
[
(1 - target) log (1 - prediction)+ ]
Loss
- Some predictions are better than others.
- The deviation of the prediction p from the target t is referred to as loss.
- Synonyms: cost, error, empirical risk
- The loss is calculated through a loss function.
- Logistic loss is one of the most important loss functions.
16. LOGISTIC LOSS: EXAMPLES
Good prediction Mediocre predictionBad predictionBad prediction
loss prediction, target = −[target log(prediction) + 1 − target log(1 − prediction)]
Loss: -log(0.9) ≈ 0.046
This is a good prediction.
Consequently, the loss is
small.
Target: 1
Prediction: 90%
Target: 1
Prediction: 10%
Loss: -log(0.1) ≈ 0.699
This prediction is
inaccurate and the loss,
therefore, is high.
Target: 0
Prediction: 40%
Loss: -log (1-0.4) ≈ 0.222
This loss is a function of
the counter-probability of
60%.
17. WHY LOGISTIC LOSS?
• The likelihood function returns the probability of the data for a given parameter.
• In practice, it is convenient to use the log likelihood:
log L parameters data = log
i=1
n
P data pointi parameter) =
i=1
n
logP(data pointi|parameter)
L parameters data = P data parameter =
i=1
n
P data pointi parameter
• Coin flip example: L(ph=0.5|HT) = P(HT|ph=0.5) = 0.25
Likelihood
Log likelihood
• Using the log likelihood helps prevent underflow problems.
18. MAXIMUM LIKELIHOOD PRINCIPLE
The maximum likelihood principle tells us to select the parameters θ∗ that maximize the probability of the data:
A maximization problem w.r.t. to f(x) is equivalent to a minimization problem w.r.t. to f(-x):
For a random variable with two outcomes, the logistic loss is the negative log likelihood.
Thus, minimizing the logistic loss is equivalent to the maximum likelihood approach.
θ∗= arg maxθ
i=1
n
log P(data pointi|θ)
θ∗= arg minθ[−
i=1
n
log P(data pointi|θ) ]
20. NUMBER COMPLETION TASK
3, 9, 27, 81, ?
What is the next number in this sequence?
Simple solution
f(x)=3x
f(1) = 3, f(2) = 9, f(3)= 27, f(4) = 81
f(5) = 243
f(x)= -15 + 32x – 18x2 + 4x3
f(1) = 3, f(2) = 9, f(3) = 27, f(4) = 81
But: f(5) = 195
More complex solution
21. SCOTUS’S RAZOR
• Problem: There is an infinite number of solutions to any
sequence prediction problem.
• Most, if not all, machine learning problems are
sequence prediction problems.
• One solution: Ockham’s Razor
• Prefer the simplest theory consistent with the data
• First clear formulation by 13th century theologian
Duns Scotus
• Today: Don’t use a fancy machine learning model
when a simple model works just fine.
22. WHY OCKHAM’S RAZOR?
• It works.
• Successful applications in in ML, science, business, design and other fields
• Fast & cheap
• Simpler models tend to be faster models and consume fewer resources.
• The Schmidhuber/Hutter argument:
• The Great Programmer implements all possible universes with program lengths from 1 to N.
• Program B is a functional copy of program A if both lead to the same result but with different code.
• Simpler programs have more functional copies than longer programs.
• Simple program: print(“Hello world!”)
• Functional copy: const message = “Hello world!”; print(message)
24. • A neuron is the basic processing unit:
• Accepts input, processes input, sends
output
• Neurons are connected to other
neurons.
• Connections are weighted.
• A layer is a group of neurons.
• Every neural network has an input layer
and an output layer.
• Hidden layer:
• Any layer between input and output
• Shallow: ~ 1-5 hidden layers
• Deep: dozens or hundreds of layers
BUILDING BLOCKS: NEURONS, WEIGHTS AND LAYERS
Output:
f(w1x1+...+w3x3)
Input layer Output layer
Input 1: x1
Input 2: x2
...
Input n: xn
w1
w2
wn
26. • Baseline model for binary
classification tasks:
• f(x) = s(w1x1+...+wnxn+b)
• Weigh the evidence
• Add a bias
• The sigmoid function s “squashes” the
input to a number b/w 0 and 1.
• Can be formulated as a neural net:
• The input x1, ..., xn corresponds to
neurons in the first layer.
• The bias corresponds to an additional
neuron with a connection weight of 1.
• The output neuron applies the sigmoid
function.
LOGISTIC REGRESSION
Input layer Output layer
Bias: b
Input 1: x1
Input 2: x2
Input n: xn
Output:
s(w1x1+...+wnxn+b)
...
w1
wn
w2
1
27. THE SIGMOID FUNCTION
The sigmoid function s(x) is one of the most
frequently used functions in machine learning:
𝑠 𝑥 =
1
1 + 𝑒−𝑥
Desirable properties:
• “squashes” any input into the range between 0 and 1
• The derivative is easy to compute:
𝑑𝑠
𝑑𝑥
= 𝑠 𝑥 (1 − 𝑠 𝑥 )
28. A GLIMPSE AT BACKPROPAGATION
• Model parameters are initialized randomly.
• The term “training” refers to the (iterative) optimization of parameters.
• Almost all neural nets are trained with the backpropagation algorithm.
Backpropagation algorithm (n repetitions)
Go through each instance:
1. Forward propagation: Compute the prediction and the loss.
2. Backward propagation: For each parameter, compute the derivative w.r.t. the loss.
• Positive derivative: small increase in parameter => increase in loss
• Negative derivative: small increase in parameter => decrease in loss
3. Update: Use the derivative to apply an update rule.
• Simple rule: old value = new value – learning rate derivative
39. CONVOLUTIONAL LAYER
• CNNs use a repeated sequence of layers:
• A convolutional layer, followed by a pooling layer
• A convolutional layer consists of filters:
• A window moves through the output of the
previous layer
• Similar to how we read: from left to right, and
then downwards
• The purpose of a filter is to detect the
presence of a particular feature:
• Basic geometric shapes
• Lines, circles, edges
• Characteristic colors
• Blue sky, green grass
40. 96 low-level features learned by a convolution layer
Source: CS231n Convolutional Neural Networks for Visual Recognition
41. MAX-POOLING LAYER
• A max-pooling layer performs a
reduction operation:
• A window moves through the subregions of
the previous output.
• For each subregion, the maximum value is
extracted.
• A max-pooling layer with a stride of 2
reduces a 4x4 matrix to a 2x2 matrix.
• Intuition: It doesn’t really matter where
exactly a feature is located.
• Less entries => faster computation
42. DEEP CONVOLUTIONAL NEURAL NETWORKS
• Deep neural nets are characterized by repeated blocks of layers.
• Ex.: a series of convolution/max-pooling operations
• Some ML fields (though not all) are dominated by deep nets.
• Theory lags behind applications.
• Intuition: hierarchical models for hierarchical data
• Simple example: traffic sign recognitions
• Lines and circles form digits.
• Digits form numbers.
• A speed limit sign is composed of a red circle,
a white circle and a number.
44. SUMMARY
• The growth of machine learning is fueled by:
1. the importance of predictions
2. the low cost of data acquisition and processing
3. the domain generality of learning algorithms.
• The essential task in machine learning is to
1. formulate the problem
2. collect a large and relevant the dataset
3. train, test and improve appropriate models.
• Neural networks form a class of powerful models trained by backpropagation:
• Building blocks: neurons, connections, layer
• Convolutional neural networks:
• high predictive accuracy and computational efficiency