Deep Learning Primer: A First-Principles Approach

DEEP LEARNING PRIMER
Maurizio Caló Caligaris
Presented at NYU Center For Genomics
A First Principles Approach

CS 229
Machine
Learning
ABOUT ME
uﬂdl.stanford.edu/?people

THE RISE OF DEEP LEARNING
GOOGLE TRENDS

“We will move
from a mobile-ﬁrst world to
an AI-ﬁrst world”
- Sundar Pichai, Google.
Letter to shareholders
THE RISE OF DEEP LEARNING

THIS TALK
Demystify deep learning. Provide a simple way of
approaching the subject at a high-level, from the
ground-up.
GOAL

THIS TALK
Accessible to people with little or no background
in machine learning
INTENDED AUDIENCE
(experts in the field can hopefully learn something too)

ROUNDEDTALK
OUTLINE
• Preliminaries
• Machine learning
• Neural networks
• Bias / variance tradeoff
• The case for deep learning
• Why now
• State-of-the-art + trends
• FAQS

The ability for computers to learn from
experience and understand the world in
terms of a hierarchy of concepts, with
each concept deﬁned in terms of its
relation to simpler concepts.
DEEP LEARNING: BASIC DEFINITION
Without the need for human operators to formally
specify all the knowledge the computer needs
SUBSET OF MACHINE LEARNING
http://www.deeplearningbook.org/

MACHINE LEARNING is a type of
ARTIFICIAL INTELLIGENCE that provides
computers the ability to learn without
being explicitly programmed
MACHINE LEARNING: BASIC DEFINITION

MACHINE LEARNING
“JUST X —> Y”
Approximate a mapping f from input X to
output Y, based on some sample data
A SIMPLE WAY TO THINK ABOUT
SUPERVISED LEARNING

MACHINE LEARNING
“JUST X —> Y”
A SIMPLE WAY TO THINK ABOUT
SUPERVISED LEARNING
Problem X Y
Housing Price Prediction Size (sq. ft), location $35,000
Spam Detection Email Spam / Not Spam
Product recommendations Product and user features P(purchase)
Loan approval Loan application Will they pay? (0 or 1)
Preventive maintenance
Sensors from planes / hard
disk
Is it about to fail?

http://cs229.stanford.edu/notes/cs229-notes1.pdf
Input and output are both real numbers
features output
A SIMPLE X —> Y: REGRESSION

Predict:
where weights w are chosen so
as
to minimize
(sum of squared errors)
A SIMPLE X —> Y: LINEAR REGRESSION
(plus regularization)

A SIMPLE X —> Y: (BINARY) CLASSIFICATION
Presentation of
fetus
OUTPUT IS EITHER 0 OR 1
Presence of
uterine scar
Placenta previa
Maternal
disease
Presentation of
fetus
C-section (0) or
natural birth (1)

LOGISTIC REGRESSSION
Input:
Multiply each feature
by some weight
Map the result
smoothly to (0 - 1)
A SIMPLE X —> Y: LOGISTIC REGRESSION

What if example data looks like this?
X —> Y: NON-LINEAR RELATIONSHIPS

Can learn non-linear
relationships in data
(universality
theorem)
Trained using
back-propagation
NEURAL NETWORKS
https://medium.com/@ageitgey/machine-learning-is-fun-part-2-a26a10b68df3
http://neuralnetworksanddeeplearning.com/chap4.html

NEURAL NETWORKS
or
where
http://uﬂdl.stanford.edu/tutorial/supervised/MultiLayerNeuralNetworks/

PERFECTLY FITTING SAMPLE DATA: 
GOOD OR BAD?

ASIDE: BIAS-VARIANCE TRADE-OFF
https://www.quora.com/What-is-the-best-way-to-explain-the-bias-variance-trade-off-in-laymens-terms
Want our models to generalize to data we
haven’t seen
(One of the most important ideas in machine learning)

ASIDE: MODEL GENERALIZABILITY:
TRAINING AND VALIDATION SETS
Want our models to generalize to data we
haven’t seen
Leave out some of data to
evaluate performance on
(called “validation” set)
k-fold cross-validation: do
this over multiple rounds,
each time leaving out a
different subset of the
sample data

MACHINE LEARNING: 
JUST CURVE FITTING?

JUST CURVE FITTING?
Except we don’t call it that.
We wouldn’t get any funding if we did
that.
Machine learning sounds cooler and
impresses people.

JUST CURVE FITTING?
Finding good feature representations is often
the biggest challenge!
Performance of ML algorithms depends
heavily on the presentation of the data they
are given.
JOKING ASIDE

FEATURE REPRESENTATION matters. A
LOT.

ROUNDEDA SIMPLE
CHALLENGE
Compute CXMCXI times II

ROUNDEDA SIMPLE
CHALLENGE
Compute 111,111 times 2
(way easier)

ROUNDEDANOTHER
CHALLENGE
Which of these
contains a human
face?
http://neuralnetworksanddeeplearning.com/chap1.html#toward_deep_learning

ROUNDEDANOTHER
CHALLENGE
Does this contain a human
face?

WHAT WE SEE VS. WHAT COMPUTERS SEE
https://medium.com/@ageitgey/machine-learning-is-fun-part-3-deep-learning-and-convolutional-neural-networks-f40359318721

CHOICE
OF
FEATURE
REPRESENTATION
MATTERS.
A LOT.

+
pixel 1
+
++
+
++
+
+
+
+
+
+
+--
-
-
-
-
-
-
- -
-
-
-
pixel 2
+
-
Cesarean delivery
Natural birth
Individual pixels in MRI not
correlated with desired outcome
CHOICE OF FEATURE REPRESENTATION
MATTERS
feat 1
feat 2
-
-
-
-
--
-
+ +
+
+
+ +
+
+
+ +
+
-
Want feature representations
so that:
Need a doctor to tell the system features:
presence of uterine scar, presentation of fetus,
placenta previa, maternal disease
primiparity, twins, etc.
Raw MRI input:

FEATURE ENGINEERING:
COMPUTER VISION
SIFT
HoG
TEXTONS
RUFT
GLOH

FEATURE ENGINEERING:
NATURAL LANGUAGE PROCESSING
PARSER FEATURES
STEMMING
ONTOLOGIES
PART OF SPEECH
ANAPHORA

ROUNDED
FEATURE
ENGINEERING
• Time-consuming
• Domain and task-speciﬁc
• Requires human experts
• Lots of trial and error
TRADITIONAL APPROACH

Automatically learns feature
representations from data
(in terms of a hierarchy of concepts, with
each concept deﬁned in terms of its
relation to simpler concepts)
THE CASE FOR DEEP LEARNING: 
AUTOMATIC FEATURE
REPRESENTATION
Without the need for human operators to formally
specify all the knowledge the computer needs

INTUITIVE EXPLANATION
Note: not a realistic approach (just to develop intuition)

INTUITIVE EXPLANATION
Split problems into subproblems (hierarchy of
concepts)

EXAMPLE OUTPUT OF DEEP NETWORK
http://www.cs.toronto.edu/~rgrosse/icml09-cdbn.pdf

Labeled data is expensive. Whereas vast
amounts of unlabeled data is freely
available on the web.
Train a network where output = input
Learn to see the world the way a human
babies learn (exploration as opposed to
learning from labeled examples)
LEARNING FROM UNLABELED DATA

GOOGLE BRAIN
1 billion connections
10 million 200x200 px
Youtube images (unlabeled)
1,000 machines
(16,000 cores)
https://static.googleusercontent.com/media/research.google.com/en//archive/unsupervised_icml2012.pdf
Trained to predict itself
(input =output)

GOOGLE BRAIN: RESULTS
Computer learned concepts of
“cat” and “person” without
being explicitly told

STATE-OF-THE-ART RESULTS ON STANDARD
BENCHMARKS
ImageNet Large Scale Visual Recognition

STATE-OF-THE-ART RESULTS ON STANDARD
BENCHMARKS
Speech Recognition (TIMIT dataset)
Performance
stagnated in
2000s
Introduction of
deep learning
(2009) resulted in
sudden
improvements.
Some error rates
halved.

THE CASE FOR DEEP LEARNING:
REMARKABLE PERFORMANCE
Andrew Ng, NIPS 2016

REMARKABLE PERFORMANCE
(CAVEAT)
(given enough data)
Deep networks do not necessarily
outperform shallow algorithms here

WHY IS DEEP
LEARNING
TAKING OFF
NOW?

WHY NOW
Scale of data
THE FUEL
THE ENGINE
Computing
infrastructure
https://www.quora.com/What-does-Andrew-Ng-think-about-Deep-Learning

THE ENGINE
GPUs
Computation across
several machines
Software frameworks
Theano
Torch
PyLearn2
Caffe
TensorFlow

POSITIVE FEEDBACK
LOOP
Lots of computational power
Greater incentive to acquire
more data
Greater incentive to build
bigger/faster
networks

POSITIVE FEEDBACK
LOOP
Efficient computing infrastructure
Faster experiments
(e.g. 1 day instead of 1 week)
Speeds up innovation

BIAS-VARIANCE NOT AS MUCH OF
A TRADE-OFF
TRADITIONALLY
DEEP LEARNING
At least one possible action item in each case:
get more data or train a bigger model

RICH, COMPLEX INPUTS & OUTPUTS

Image super resolution
down sampled image -> natural detailed version

Image-to-image translation
https://phillipi.github.io/pix2pix/

GENERALIZING ACROSS TASKS
Joint-Many Tasks (JMT)
State-of-the art results on
multiple tasks by a single
model:
- Chunking
- Dependency parsing
- Semantic relatedness
- Textual entailment
https://metamind.io/research/multiple-diﬀerent-natural-language-processing-tasks-in-a-single-deep-model/

GENERALIZING ACROSS TASKS
https://blog.openai.com/unsupervised-sentiment-neuron/

GENERALIZING ACROSS MODALITIES
Generates sentence descriptions from images
Multimodal Recurrent Neural Network (Karpathy, 2014)

SINGLE-LEARNING ALGORITHM HYPOTHESIS
Evidence from neuroscience. Ferrets can learn to “see”
with the auditory cortex if their brains are rewired to send
visual signals to that area.
i.e. mammalian brain may use a single algorithm for many
different tasks http://www.deeplearningbook.org/contents/intro.html

SINGLE-LEARNING ALGORITHM HYPOTHESIS
Machine learning research is becoming less fragmented:
NLP
VISION
MOTION PLANNING
SPEECH
http://www.deeplearningbook.org/contents/intro.html

TOWARDS END-TO-END LEARNING?

ROUNDED
SUMMARY
• Machine Learning: “Just X to Y”
• Choice of feature representation matters
• Hand-engineering features is hard!
• Deep learning
• Intuitive explanation of how deep
networks can learn hierarchical feature
representations and why it works
• No need for humans to formally specify
knowledge
• Works remarkably well in practice, due to:
• Scale of computation
• Scale of data
• Can be successfully applied to an
increasingly wide variety of complex tasks

CONTACT
Maurizio Caló Caligaris
cs.stanford.edu/~maurizio
maurizio@cs.stanford.edu

Deep Learning Primer: A First-Principles Approach

More Related Content

What's hot

Similar to Deep Learning Primer: A First-Principles Approach

Recently uploaded

Deep Learning Primer: A First-Principles Approach