NEURAL NETWORKS AND DEEP
LEARNING
ASIM JALIS
GALVANIZE
INTRO
ASIM JALIS
Galvanize/Zipfian, Data
Engineering
Cloudera, Microso!,
Salesforce
MS in Computer Science
from University of
Virginia
GALVANIZE PROGRAMS
Program Duration
Data Science
Immersive
12
weeks
Data
Engineering
Immersive
12
weeks
Web
Developer
Immersive
6
months
Galvanize U 1 year
TALK OVERVIEW
WHAT IS THIS TALK ABOUT?
Using Neural Networks
and Deep Learning
To recognize images
By the end of the class
you will be able to
create your own deep
learning systems
HOW MANY PEOPLE HERE HAVE
USED NEURAL NETWORKS?
HOW MANY PEOPLE HERE HAVE
USED MACHINE LEARNING?
HOW MANY PEOPLE HERE HAVE
USED PYTHON?
DEEP LEARNING
WHAT IS MACHINE LEARNING
Self-driving cars
Voice recognition
Facial recognition
HISTORY OF DEEP LEARNING
HISTORY OF MACHINE LEARNING
Input Features Algorithm Output
Machine Human Human Machine
Machine Human Machine Machine
Machine Machine Machine Machine
FEATURE EXTRACTION
Traditionally data scientists to define features
Deep learning systems are able to extract features
themselves
DEEP LEARNING MILESTONES
Years Theme
1980s Backpropagation invented allows multi-layer
Neural Networks
2000s SVMs, Random Forests and other classifiers
overtook NNs
2010s Deep Learning reignited interest in NN
IMAGENET
AlexNet submitted to the ImageNet ILSVRC challenge in
2012 is partly responsible for the renaissance.
Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton used
Deep Learning techniques.
They combined this with GPUs, some other techniques.
The result was a neural network that could classify images
of cats and dogs.
It had an error 16% compared to 26% for the runner up.
Ilya Sutskever, Alex Krizhevsky, Geoffrey Hinton
INDEED.COM/SALARY
MACHINE LEARNING
MACHINE LEARNING AND DEEP
LEARNING
Deep Learning fits inside
Machine Learning
Deep Learning a
Machine Learning
technique
Share techniques for
evaluating and
optimizing models
WHAT IS MACHINE LEARNING?
Inputs: Vectors or points of high dimensions
Outputs: Either binary vectors or continuous vectors
Machine Learning finds the relationship between them
Uses statistical techniques
SUPERVISED VS UNSUPERVISED
Supervised: Data needs to be labeled
Unsupervised: Data does not need to be labeled
TECHNIQUES
Classification
Regression
Clustering
Recommendations
Anomaly detection
CLASSIFICATION EXAMPLE:
EMAIL SPAM DETECTION
CLASSIFICATION EXAMPLE:
EMAIL SPAM DETECTION
Start with large collection of emails, labeled spam/not-
spam
Convert email text into vectors of 0s and 1s: 0 if a word
occurs, 1 if it does not
These are called inputs or features
Split data set into training set (70%) and test set (30%)
Use algorithm like Random Forest to build model
Evaluate model by running it on test set and capturing
success rate
CLASSIFICATION ALGORITHMS
Neural Networks
Random Forest
Support Vector Machines (SVM)
Decision Trees
Logistic Regression
Naive Bayes
CHOOSING ALGORITHM
Evaluate different models on data
Look at the relative success rates
Use rules of thumb: some algorithms work better on some
kinds of data
CLASSIFICATION EXAMPLES
Is this tumor benign or cancerous?
Is this lead profitable or not?
Who will win the presidential elections?
CLASSIFICATION: POP QUIZ
Is classification supervised or unsupervised learning?
Supervised because you have to label the data.
CLUSTERING EXAMPLE: LOCATE
CELL PHONE TOWERS
Start with GPS
coordinates of all cell
phone users
Represent data as
vectors
Locate towers in biggest
clusters
CLUSTERING EXAMPLE: T-SHIRTS
What size should a t-
shirt be?
Everyone’s real t-shirt
size is different
Lay out all sizes and
cluster
Target large clusters
with XS, S, M, L, XL
CLUSTERING: POP QUIZ
Is clustering supervised or unsupervised?
Unsupervised because no labeling is required
RECOMMENDATIONS EXAMPLE:
AMAZON
Model looks at user
ratings of books
Viewing a book triggers
implicit rating
Recommend user new
books
RECOMMENDATION: POP QUIZ
Are recommendation systems supervised or unsupervised?
Unsupervised
REGRESSION
Like classification
Output is continuous instead of one from k choices
REGRESSION EXAMPLES
How many units of product will sell next month
What will student score on SAT
What is the market price of this house
How long before this engine needs repair
REGRESSION EXAMPLE:
AIRCRAFT PART FAILURE
Cessna collects data
from airplane sensors
Predict when part needs
to be replaced
Ship part to customer’s
service airport
REGRESSION: QUIZ
Is regression supervised or unsupervised?
Supervised
ANOMALY DETECTION EXAMPLE:
CREDIT CARD FRAUD
Train model on good
transactions
Anomalous activity
indicates fraud
Can pass transaction
down to human for
investigation
ANOMALY DETECTION EXAMPLE:
NETWORK INTRUSION
Train model on network
login activity
Anomalous activity
indicates threat
Can initiate alerts and
lockdown procedures
ANOMALY DETECTION: QUIZ
Is anomaly detection supervised or unsupervised?
Unsupervised because we only train on normal data
FEATURE EXTRACTION
Converting data to feature vectors
Natural Language Processing
Principal Component Analysis
Auto-Encoders
FEATURE EXTRACTION: QUIZ
Is feature extraction supervised or unsupervised?
Unsupervised
MACHINE LEARNING WORKFLOW
DEEP LEARNING USED FOR
Feature Extraction
Classification
Regression
HISTORY OF MACHINE LEARNING
Input Features Algorithm Output
Machine Human Human Machine
Machine Human Machine Machine
Machine Machine Machine Machine
DEEP LEARNING FRAMEWORKS
DEEP LEARNING FRAMEWORKS
TensorFlow: NN library from Google
Theano: Low-level GPU-enabled tensor library
Torch7: NN library, uses Lua for binding, used by Facebook
and Google
Caffe: NN library by Berkeley AMPLab
Nervana: Fast GPU-based machines optimized for deep
learning
DEEP LEARNING FRAMEWORKS
Keras, Lasagne, Blocks: NN libraries that make Theano
easier to use
CUDA: Programming model for using GPUs in general-
purpose programming
cuDNN: NN library by Nvidia based on CUDA, can be used
with Torch7, Caffe
Chainer: NN library that uses CUDA
DEEP LEARNING PROGRAMMING
LANGUAGES
All the frameworks support Python
Except Torch7 which uses Lua for its binding language
TENSORFLOW
TensorFlow originally
developed by Google
Brain Team
Allows using GPUs for
deep learning
algorithms
Single processor version
released in 2015
Multiple processor
version released in
March 2016
KERAS
Supports Theano and
TensorFlow as back-
ends
Provides deep learning
API on top of TensorFlow
TensorFlow provides
low-level matrix
operations
TENSORFLOW: GEOFFREY
HINTON, JEFF DEAN
KERAS: FRANCOIS CHOLLET
NEURAL NETWORKS
WHAT IS A NEURON?
Receives signal on synapse
When trigger sends signal on axon
MATHEMATICAL NEURON
Mathematical abstraction, inspired by biological neuron
Either on or off based on sum of input
MATHEMATICAL FUNCTION
Neuron is a mathematical function
Adds up (weighted) inputs and applies sigmoid (or other
function)
This determines if it fires or not
WHAT ARE NEURAL NETWORKS?
Biologically inspired machine learning algorithm
Mathematical neurons arranged in layers
Accumulate signals from the previous layer
Fire when signal reaches threshold
NEURAL NETWORKS
NEURON INCOMING
Each neuron receives
signals from neurons in
previous layer
Signal affected by
weight
Some are more
important than others
Bias is the base signal
that the neuron receives
NEURON OUTGOING
Each neuron sends its
signal to the neurons in
the next layer
Signals affected by
weight
LAYERED NETWORK
Each layer looks at features identified by previous layer
US ELECTIONS
ELECTIONS
Consider the elections
This is a gated system
A way to aggregate
different views
HIGHEST LEVEL: STATES
NEXT LEVEL: COUNTIES
ELECTIONS
Is this a Neural Network?
How many layers does it
have?
NEURON LAYERS
The nomination is the
last layer, layer N
States are layer N-1
Counties are layer N-2
Districts are layer N-3
Individuals are layer N-4
Individual brains have
even more layers
GRADIENT DESCENT
TRAINING: HOW DO WE
IMPROVE?
Calculate error from desired goal
Increase weight of neurons who voted right
Decrease weight of neurons who voted wrong
This will reduce error
GRADIENT DESCENT
This algorithm is called gradient descent
Think of error as function of weights
FEED FORWARD
Also called forward
propagation or forward
prop
Initialize inputs
Calculate activation of
each layer
Calculate activation of
output layer
BACK PROPAGATION
Use forward prop to
calculate the error
Error is function of all
network weights
Adjust weights using
gradient descent
Repeat with next record
Keep going over training
set until convergence
HOW DO YOU FIND THE MINIMUM
IN AN N-DIMENSIONAL SPACE?
Take a step in the steepest direction.
Steepest direction is vector sum of all derivatives.
PUTTING ALL THIS TOGETHER
Use forward prop to
activate
Use back prop to train
Then use forward prop
to test
TYPES OF NEURONS
SIGMOID
TANH
RELU
BENEFITS OF RELU
Popular
Accelerates convergence
by 6x (Krizhevsky et al)
Operation is faster since
it is linear not
exponential
Can die by going to zero
Pro: Sparse matrix
Con: Network can die
LEAKY RELU
Pro: Does not die
Con: Matrix is not sparse
SOFTMAX
Final layer of network
used for classification
Turns output into
probability distribution
Normalizes output of
neurons to sum to 1
HYPERPARAMETER TUNING
PROBLEM: OIL EXPLORATION
Drilling holes is
expensive
We want to find the
biggest oilfield without
wasting money on duds
Where should we plant
our next oilfield derrick?
PROBLEM: NEURAL NETWORKS
Testing
hyperparameters is
expensive
We have an N-
dimensional grid of
parameters
How can we quickly zero
in on the best
combination of
hyperparameters?
HYPERPARAMETER EXAMPLE
How many layers should
we have
How many neurons
should we have in
hidden layers
Should we use Sigmoid,
Tanh, or ReLU
Should we initialize
ALGORITHMS
Grid
Random
Bayesian Optimization
GRID
Systematically search
entire grid
Remember best found
so far
RANDOM
Randomly search the grid
Remember the best found so far
Bergstra and Bengio’s result and Alice Zheng’s
explanation (see References)
60 random samples gets you within top 5% of grid search
with 95% probability
BAYESIAN OPTIMIZATION
Balance between
explore and exploit
Exploit: test spots within
explored perimeter
Explore: test new spots
in random locations
Balance the trade-off
SIGOPT
YC-backed SF startup
Founded by Scott Clark
Raised $2M
Sells cloud-based
proprietary variant of
Bayesian Optimization
BAYESIAN OPTIMIZATION PRIMER
Bayesian Optimization Primer by Ian Dewancker, Michael
McCourt, Scott Clark
See References
OPEN SOURCE VARIANTS
Open source alternatives:
Spearmint
Hyperopt
SMAC
MOE
PRODUCTION
DEPLOYING
Phases: training,
deployment
Training phase run on
back-end servers
Optimize hyper-
parameters on back-end
Deploy model to front-
end servers, browsers,
devices
Front-end only uses
forward prop and is fast
SERIALIZING/DESERIALIZING
MODEL
Back-end: Serialize model + weights
Front-end: Deserialize model + weights
HDF 5
Keras serializes model architecture to JSON
Keras serializes weights to HDF5
Serialization model for hierarchical data
APIs for C++, Python, Java, etc
https://www.hdfgroup.org
DEPLOYMENT EXAMPLE: CANCER
DETECTION
Rhobota.com’s cancer
detecting iPhone app
Developed by Bryan
Shaw a!er his son’s
illness
Model built on back-end,
deployed on iPhone
iPhone detects retinal
cancer
DEEP LEARNING
WHAT IS DEEP LEARNING?
Deep Learning is a learning method that can train the
system with more than 2 or 3 non-linear hidden layers.
WHAT IS DEEP LEARNING?
Machine learning techniques which enable unsupervised
feature learning and pattern analysis/classification.
The essence of deep learning is to compute
representations of the data.
Higher-level features are defined from lower-level ones.
HOW IS DEEP LEARNING
DIFFERENT FROM REGULAR
NEURAL NETWORKS?
Training neural networks requires applying gradient
descent on millions of dimensions.
This is intractable for large networks.
Deep learning places constraints on neural networks.
This allows them to be solvable iteratively.
The constraints are generic.
AUTO-ENCODERS
WHAT ARE AUTO-ENCODERS?
An auto-encoder is a learning algorithm
It applies backpropagation and sets the target values to
be equal to its inputs
In other words it trains itself to do the identity
transformation
WHY DOES IT DO THIS?
Auto-encoder places constraints on itself
E.g. it restricts the number of hidden neurons
This allows it to find a good representation of the data
IS THE AUTO-ENCODER
SUPERVISED OR UNSUPERVISED?
It is unsupervised.
The data is unlabeled.
WHAT ARE CONVOLUTION
NEURAL NETWORKS?
Feedforward neural networks
Connection pattern inspired by visual cortex
CONVOLUTIONAL NEURAL
NETWORKS
CNNS
The convolutional layer’s parameters are a set of
learnable filters
Every filter is small along width and height
During the forward pass, each filter slides across the width
and height of the input, producing a 2-dimensional
activation map
As we slide across the input we compute the dot product
between the filter and the input
CNNS
Intuitively, the network learns filters that activate when
they see a specific type of feature anywhere
In this way it creates translation invariance
CONVNET EXAMPLE
Zero-Padding: the boundaries are padded with a 0
Stride: how much the filter moves in the convolution
Parameter sharing: all filters share the same parameters
CONVNET EXAMPLE
From http://cs231n.github.io/convolutional-networks/
WHAT IS A POOLING LAYER?
The pooling layer reduces the resolution of the image
further
It tiles the output area with 2x2 mask and takes the
maximum activation value of the area
REVIEW
keras/examples/mnist_cnn.py
Recognizes hand-written digits
By combining different layers
RECURRENT NEURAL NETWORKS
RNNS
RNNs capture patterns
in time series data
Constrained by shared
weights across neurons
Each neuron observes
different times
LSTMS
Long Short Term Memory networks
RNNs cannot handle long time lags between events
LSTMs can pick up patterns separated by big lags
Used for speech recognition
RNN EFFECTIVENESS
Andrej Karpathy uses
LSTMs to generate text
Generates Shakespeare,
Linux Kernel code,
mathematical proofs.
See
http://karpathy.github.io/
RNN INTERNALS
LSTM INTERNALS
CONCLUSION
REFERENCES
Bayesian Optimization by Dewancker et al
Random Search by Bengio et al
Evaluating machine learning models
Alice Zheng
http://sigopt.com
http://jmlr.org
http://www.oreilly.com
REFERENCES
Dropout by Hinton et al
Understanding LSTM Networks by Chris Olah
Multi-scale Deep Learning for Gesture Detection and
Localization
by Neverova et al
Unreasonable Effectiveness of RNNs by Karpathy
http://cs.utoronto.edu
http://github.io
http://uoguelph.ca
http://karpathy.github.io
QUESTIONS

Neural Networks and Deep Learning