Mlcc #4

MLCC #4 Neural Network
Presented by Ofa

2018.7.18

Agenda
• Introduction to NN

• Backpropogation

• Training Neural Networks

• Multi-Class NN

• Embeddings

• ML Engineering

Introduction to Artiﬁcial Neural
Network

What is ANN?
• First we may need to think about what is INTELLIGENCE?

The Octopus https://goo.gl/eUS7nS

Non-linear Problems
Linear solver + linear solver = linear solver!!!

Non-linear Problems
Real Case: TCM price
Price from store(每錢)
Price from origin(每⽄斤)

ReLU
non-linear function => nonlinear model

Activation Functions
More activation functions:
https://www.tensorﬂow.org/api_guides/python/nn

Playground - with 1 hidden layer node
Linear activation Sigmoid activation
ReLu activation

Playground - with 2 hidden layer nodes
Linear activation Sigmoid activation
ReLu activation
#sometimes shows another result

Playground - challenge 0.177 loss
First trial Remove empty nodes
#L2 regularization is required

Playground - initialization
First trial Second trial
#DIY

Playground - Spiral
You can still only tuning the parameters to reach a good
performance rather than doing feature engineering

Playground - Spiral
It is a choice between more features and more computing
power($$$)

Programming Exercise
OK.. it’s steps and batch size that matter…

Backpropogation
• How data ﬂows through the graph.

• How dynamic programming lets us
avoid computing exponentially many
paths through the graph.

Backpropogation
• Update weights according to
the error

• wij = wij - a*dE/dwij

Backpropogation
• Starting from the output layer!

• d(1/2(youtput - y target)^2) = youtput - y target

Backpropogation
• Go backward for each node

• dE/dwij = dxj/dwij *dE/dxj

= yi *dE/dxj

#cuzxj = yi *wij
dE/dw46 = dx6/dw46 *dE/dx6

= y4 *dE/dx6

–Trust me, it’s too complicated to understand
“哩喜勒勒公三⼩小”

Example
Input Layer Hidden Layer Output Layer
Bias
= f(0.3825) = 0.5944
To compute sigmoid, you can use :
https://goo.gl/Jiuw2p
Try to compute yh2, o1, o2 by yourself
Reference

Example
Bias
So we get:
yh1= 0.5944
yh2 = 0.5968
o1 = f(1.106) = 0.7513
o2 = f(1.225) = 0.7729

Example
Bias
Then we can update weights:
OutputO1 = 0.75
OutputO2 = 0.773
Etotal = EO1 + EO2
= 1/2*(0.01-0.75)^2 +
1/ 2*(.099 - 0.773)^2
= 0.74
w5new = w5old - a*dEtotal/dw5

Example
Then we can get:
w5new = w5old - a*dEtotal/dw5
so, w5new = 0.4 - 0.5 * 0.082
= 0.359

Example
Then we can get:
w5new = 0.359
w6new = 0.4
w7new = 0.51
w8new = 0.56
Next, we need to update the ﬁrst
layer, ie. w1~w4
Bias

Example
Bias
#We’ve already computed in the previous layer
Then we can also get :
#

Example
Bias

Example
Bias
Then we can get:
w1new = 0.1497
w2new = 0.1995
w3new = 0.2497
w4new = 0.2995

Brief Summary
• You can do the update for all weights, just remember you need to update
all the weight together instead of one-by-one.

• That means you should always update the weights using old data, do not
mix the old ones and new ones.

• But, I think just call nn.train() would be the best way to do it!

Training Neural Nets
• Thing to note:

• Gradients should be diﬀerentiable so we can learn from it.

• Gradients can vanish and explode: additional layer, ReLUs / learning rate, batch
normalization

• Lower level gradient may go closer to zero that makes training slow. Use
ReLU can prevent it.

• If weights are too large, they may make lower level gradients explode. Use
batch normalization to avoid it.

• ReLU layers can die: learning rate

Dropout Regularization
• Randomly dropping out units
in a network for a single
gradient step.

• Control from 0.0 to 1.0, 1.0
means drop out all nodes and
then learn nothing!

• This mechanism makes
deeplearning useful in recent
years

Programming Exercises: Normalization
This way is much simpler than the solution…

Programming Exercises: Optimizer
AdagradOptimizer:

Automatically reduce the learning rate

RMSE=122.29 / 124.10

AdamOptimizer:

Adaptive Moment Estimation,
computes adaptive learning rates for
each parameter.

RMSE= 67.67 / 67.48
Reference:
http://ruder.io/optimizing-gradient-descent/

Programming Exercises: Normalization+
You can pass the normalization into
the function options, that makes it
simpler

z_score, RMSE: 71.54 / 70.39

binary_threshold(0.5), RMSE:
115.78 / 116.41

clip(0.1, 0.8), RMSE: 115.77 / 116.33

log_normalize??? (math error)

Multi-Class NN
One-class NN Multi-Class NN

See Food
The ‘See Food’ app from Silicon Valley really happened, and it was also a lie
“Meal Snap”

See Food
• Multi-class, single label: this
is a hotdog, octopus, or
banana

• => softmax(candidate
sampling)

• Multi-class, multi-label: this
picture contains hotdog,
cucumber, tomato, and onion

• => regression for all

AWS Sagemaker
Ref: https://goo.gl/3HMkPR

Collaborative Filtering
Step 1. Preprocessing: build a dict of
all movies
Step 2. Encode the user behavior
into the sparse representation

Embeddings
• Embed the data into an
d-dimension plane,
which maps items to low-
dimension real vectors

• the dimensions could be
determined by the
empirical way
hidden layers

PCA
Reference：
[1] https://goo.gl/XetAUb (right)
[2] https://goo.gl/HctuRj (left, including python examples)

Word2Vec
Ray Hsueh has given an awesome talk last week!!!

Production ML Systems
What we’ve learned so far…

Static vs Dynamic Training
• Static - Trained oﬄine. For data do not change a lot overtime.

• Pros: easy to build and test, batch training, test and iterate until good

• Cons: required monitors inputs, easy to let it grow stale

• Dynamic - Trained online.

• Pros: continue to feed data, regularly sync out updated version. Use
progressive validation rather than batch training & test. Adapt to changes.

• Cons: needs monitoring, model rollback & data quarantine capabilities

Static vs Dynamic Inference
• Static - Inference oﬄine. For data do not change a lot overtime.

• Pros: much less computational cost

• Cons: need all the data at hand, update latency could be very long

• Dynamic - Inference online.

• Pros: can predict the newest data

• Cons: latency is higher, you need budget to solve that

Data Dependencies
• Feature and data change makes huge impact with the model

• Unit test for data?

• Reliability: what about the input data disappears?

• Versioning: feature changes over time?

• Necessity: how useful is the feature according to its computational cost?

• Correlations: tied together or tease apart?

• Feedback loops: could my input be impacted by my output?

Cancer Prediction
• Hospitals specialized with cancer treatment make the model overﬁtting

• => label leakage, just like a cheat

Real World Guidelines
• Keep the very ﬁrst model extremely simple

• Focus on data pipeline correctness

• Use a simple, observable metric for training & evaluation

• Own and monitor your input features

• Treat your model conﬁguration as code: review it, check it in

• Write down the results of all experiments, especially “failures"

Good Bye!
Machine Learning Practica
Check out these real-world case studies of how Google uses machine learning in its products,
with video and hands-on coding exercises:
• Image Classification: See how Google developed the image classification model powering
search in Google Photos, and then build your own image classifier.
• More Machine Learning Practica coming soon!
Other Machine Learning Resources
• Deep Learning: Advanced machine learning course on neural networks, with extensive
coverage of image and text models
• Rules of ML: Best practices for machine learning engineering
• TensorFlow.js: WebGL-accelerated, browser-based JavaScript library for training and
deploying ML models

Mlcc #4

Recommended

Recommended

More Related Content

Similar to Mlcc #4

Similar to Mlcc #4 (20)

More from Chung-Hsiang Ofa Hsueh

More from Chung-Hsiang Ofa Hsueh (20)

Recently uploaded

Recently uploaded (20)

Mlcc #4