Machine Learning and Deep Learning 4 dummies

Machine Learning
4 Dummies - Part 1
Dori Waldman - Big Data Lead
Michael Winer - Data Science Lead

● Machine Learning Fundamentals.
○ Quick intro to machine learning.
○ Linear Regression - Regression (spark- scala )
○ Logistic Regression - Classification (spark - scala )
● Basic code examples of a neural network.
○ Neural network example using numpy
○ Deep Learning using Keras ( TensorFlow Backend )
** Our focus is not to find the best model , but to explain the ML building blocks
and combine both theoretical and practise
** we will use the same problem and solve it with ML (scala) and DL (python)
Session 1
Agenda

● More ML Algorithms :
○ Decision Tree
○ Random Forest
○ ALS Recommendation
○ ...
● Image Classification Using Deep Learning.
○ Theory behind convolutional neural network.
○ Convolutional neural network example.
● Go to production.
Next
Session

What Is Machine Learning ?
● Let’s say we would like to know if
tomorrow will be a good day to play
outside or not.
● We have some data about the past.
● We would like to have a model that will
predict whether we will play the next
day given outlook, temperature,
humidity and widny.

Classic approach - Rule Based ( Deterministic )
The rules:
If (outlook = ‘sunny’ and tempature <> ‘hot’)
{ Play = ‘yes’ }
else if ( outlook = ‘rainy’ and windy = true)
{ Play = ‘no’ }
There may be surprises..

● It's all about statistics. there is no deterministic answer.
● Instead of rules , the model needs to find the correct weights per feature
● The model (in most cases) examines various examples and get feedback on the
outcome, the feedback is being used to adjust the model outcome on each iteration.
● In each iteration the model tries to reduce the loss/error
ML Approach ( Probabilistic )

Supervised Vs Unsupervised Learning
Unsupervised
● There is no correct answer.
● Goal is to find underline relations in the data
(split books to categories)
when to use UnSupervised learning:
● Clustering: discover inherent groups , such
as grouping customers by purchasing
behavior .
● Recommendation: discover hidden rules,
such as people that buy X will tend to buy Y .
Supervised
● For each example we have the correct
answer/label.
● We use the correct label during model training
and evaluation.
When to use Supervised learning:
● Classification: output is a category such as
‘male’ or ‘female’
● Regression: output is a numeric value (house
price)

● Regression (How much ?):
○ predict price house according to last year sales data
● Classification (Which class it belongs to ?):
○ Yes/No questions
● Clustering (split data to related groups)
○ Find hidden correlations between data
○ Split books to groups.
○ Principal component analysis. ?
● Recommendation
○ Item similarity ( Hammer and Nail are similar )
○ User similarity ( People with same taste like me)
● Deep Learning ( Neural network )
○ In addition to classic prediction it also support Image analysis (CNN) , LSTM , GANS
○ Reinforced Learning
ML Types:

Machine Learning Pipeline
● Deploy model
● Monitor model impact /feedback
● Update Model every hour/day
● Convert data
● Clean data
● Feature selection
● Split data (train/test)
model selection
model tuning

● We are looking for linear function (Y=X0+ W1X1 +W2X2…+AnXn) that will be closest to most
of the data points.
● The distance between the line and the dot is the “error” between the correct result and the
predicted value.
Linear Regression - Definition

● Continuous prediction is needed. F.E : Income estimation , Size , Price.
● The relationship between the variables is linear. ( Old apartment example )
● Computational Efficiency is an important parameter.
● There is no dependency between features. ( see also Covariance )
Linear Regression - when to use

Advantages of using Linear Regression
● Easy to explain
● High performance ( Regressions are considered robust )
Drawbacks of using Linear Regression
● sensitive to outliers.
● Not suitable for nonlinear data
Linear Regression - Pros/Cons

Which model works better ?
Linear Regression - Measure the model
Model with lower error
(RMSE in many cases)
Links to Measurement discussion

predict house prices for next year based on last year prices.
Let’s Code - House Data
Taken from KAGGLE ( Hosts many data competitions ):
https://www.kaggle.com/c/house-prices-advanced-regression-techniques

https://www.slideshare.net/HadoopSummit/mleap-release-spark-ml-pipelines

One Hot
Encoder
https://www.kaggle.com/dansbecker/using-categorical-data-with-one-hot-encoding

Let’s code

● maxIter : Max Number of Iterations
● intercept : A numeric addition to regression line.
● regParam : Regularization multiplier for overfitting prevention.
● elasticNetParam : L1 Vs. L2 ( Regularization method )
● Standardization: Scales numbers based on -> ( xi- avg(x) ) / sd(x)
○ normalize the features
Linear Regression - Tune

Common
Challenges
&
Techniques

● Overfitting means we trained the model on the data too well.
In some cases , the model remember the values and is not learning the
pattern behind the data.
● Overfitting is more likely with nonlinear models that have more complexity.
Overfitting

● Techniques to avoid overfitting : Reduce #features , Regularization,
train-test-validation , Limit algorithm’s complexity , Early stopping, Dropout..
Overfitting:
detect &
avoid

Cross
validation-
Can be very
useful when
dataset is
small

Hyper
Parameter
Tuning
● ML algorithms has hyperparameters.
● It is hard to guess the best combination of hyperparameters .
● It is recommended to search the hyper-parameter
options for the best score.
● Recommended method : Parameters Grid .

● Until now : In each iteration the model try to reduce the error
( Like RMSE in linearRegression).
● The Regularization functions as an addition to the loss function
of the model, Punishes high coefficients and weights.
Regularization
Why use it ?
Weights example : W1 = 0.2 , W2=0.4 , w3=0.8
W3 will contribute the most to the model’s complexity.

Regularization rate - What to choose?
● High value will make model simple ( underfit danger )
● low value keeps model complex ( overfit danger )
Regularization Method - Which one to choose?
● L2 : Weights will have 0 center , small , normal distributed .
Good for preventing overfitting of models.
● L1 : Set some of the weights to ‘0’ to simplify model
complexity. Good for Feature selection & Evaluation
Regularization
How to use?
https://towardsdatascience.com/l1-and-l
2-regularization-methods-ce25e7fc831c

● Feature selection / Reduction :
○ Removing noisy features will reduce calculation efforts and may
return more accurate results (PCA)
○ The big question : Which features should stay and which will not.
● Data Clean & Pre-process ( Most Important ! ) :
○ Normalize data. ( feature scale : #rooms , price house)
○ Transform strings in order to handle categorical data.
○ Handle unbalanced data.
○ Generate more features ( Cross features like in location, size)
● Algorithm selection & Evaluation
○ Model/s Selection & analysis
○ Accuracy check and tune selected model
○ Avoid Under/Over fit of the model
Other
Challenges

Data Preparation Trick : Bucketize
Bucketize
By using Buckets technique we might reduce RMSE significantly.
Buckets means arrange data to groups like:
● Group 1 :1-3 #rooms
● Group 2 : 4-6 #rooms
● Group 3 : >7 #rooms
Taken from Google course (house price prediction)

Know your data : Outliers Inspection

Know your data - Perform analysis on your data

● Probability Estimator.
● Binary Logistic regression predicts the probability that an observation falls into
one of two categories (Classification).
● Examples : Male or Female , Yes/No.
Logistic Regression
Logistic Regression - Definition
http://www.saedsayad.com/logistic_regression.htm

● Classification is needed. ( Male / Female , Yes / No )
● The regression is Robust & Easy to explain
● There is low dependency between features ( Covariance ) .
Logistic Regression
Logistic Regression - when to use

Logistic Regression- Advanced measurementLogistic Regression :
Measure the model - Confusion Matrix
Confusion matrix consists 4 values:
TP: model predicts correctly the positive class.
TN: model predicts the correctly negative class.
FP: model predicts the positive class and mistakes.
FN: model predicts the negative class and mistakes.

If the goal of linear regression was to predict continues number like house price
according to history data, the goal of logistic regression is to predict to which group A/B
each input belongs, for example is it a Male or Female.
The question is not what are the groups rather how good the model is to distinguish
between the groups meaning when it predict that the input is A its really A according to
the label (TP) and when it predict that its B (not A) its really B (TN)
If you want to know what is the meaning of A (TP) and B (TN) you need to check the
data for example if we have 100 match prediction for Male than group A (TP) is male
Logistic Regression : understand confusion matrix

● Accuracy: (TP+TN)/(TP+TN+FP+FN)
- Most common measure . Example : If patient’s data as are labeled as 90% “healthy” , we can predict in 90%
that a patient is healthy without ML. if model Accuracy is above 90% it means our model is doing something.
● Precision % positive prediction are correct → TP/(TP+FP)
- higher threshold will increase precision. This is important measure whenever you need to be certain when you
decide on a ‘Yes’ ( Hiring example )
● Recall: % actual positive has been identified →
TP / (TP+FN) -sometimes we will need lower threshold for
positive . F.E : Better to send healthy person to treatment
and not miss a sick person.
https://developers.google.com/machine-learning/crash-course/classification/accuracy
https://developers.google.com/machine-learning/crash-course/classification/prediction-bias
Logistic Regression : How to measure

Logistic Regression- Advanced measurementLogistic Regression :
How to set the threshold - ROC Curve
l Please watch link in
website our website !

Let’s code
We are going to predict whether the house has air conditioner or not.

● Threshold : probability threshold for yes/no decision
● maxIter : Max Number of Iterations
● regParam : Regularization multiplier for overfitting prevention.
● elasticNetParam : L1 Vs. L2 ( Regularization method )
● Standardization. Scales numbers based on -> ( xi- avg(x) ) / sd(x)
Logistic Regression
Logistic Regression - Tune

Machine & Deep Learning Landscape
• Deep Learning is more complicated
• Deep Learning works on “special”
tasks like image recognition , NLP
• Deep Learning does not require to
select features . However, It is still
recommended to understand data.

Deep Learning Agenda
● Theoretical Explanation of Artificial Neural Network ( Deep Learning )
● Deep Learning Steps:
○ Predict - Feed Forward
○ Calculate Errors
○ Back Propagation - Fix weights using the calculated errors.
● Code Review - Numpy
● Code Review - Keras

● Artificial neural networks consists of nodes and weights. These components
are used to extract the information from the features.
● Each layer of nodes generates an output based on the output of the previous
layer while using an activation function (chosen by us).
● Simplified : Deep learning is an artificial neural networks with several layers.
Deep Learning : Artificial Neural Network
https://www.quora.com/What-is-the-difference-between-Neural-Networks-and-Deep-Learning

The magic behind deep learning is based on matrix multiplication :
● First matrix is the inputs
● Second matrix is the weights.
Deep Learning- Math behind
https://www.mathsisfun.com/algebra/matrix-multiplying.html

● First, multiply the input nodes with their weights.
● After that, perform activation function on the 2nd layer of nodes.
● Get output from all nodes in the layer and move to the next one.
Deep Learning- Math behind
***Sigmoid function scales the value between 0-1

Deep Learning Steps
1) Initialize the network with random weights.
(Where the arrows are in the picture )
2) Forward propagation :
● Sum Inputs ( values * weights )
● Perform activation function
3) Back propagation :
● Calculate error in the output layer
● Calculate each step contribution to the error
● Fix each step weights using gradient descent.
4) Repeat steps 2-3 Till convergence

Make A Prediction : Feed Forward Recap
● Feed Forward starts by assigning all weights with random values.
● Then, We multiply the Inputs with weights using matrix multiplication.
● In each node there is an ‘activation function’ that calculates the output,
which will be the input for the next layer.
● In the output layer, We get ‘predictions’ that we can compare to the real
results and calculate the error.

Make A Prediction : Activation Functions Examples
activations
For Yes/No
questions
Recommended
for hidden
layers

Backpropagation :
Calculate The Error ( F.E : Root Mean Square Error )
● For each data point in the output layer ( y = Result , y’ = Prediction):
○ Error( y , y’ ) = ( y- y’ ) **2
● Total Errors calculation for N data points :
○ Total Errors ( y, y’ , N ) = sqrt(1/N *(sum( Error)))

Backpropagation : Gradient Descent Concept
● In general , a gradient , Calculated
based on f(x) determines the
change of Y in respect to a change
in X. this can be used to find the
direction to the minimum point of a
function.
● In the deep learning context- We
are calculating the partial derivative
in respect to the error and search
for the direction with the minimum
error for each weight (‘x’ in the
graph).
https://plus.maths.org/content/making-grade
Gradient Descent - Deeper explanation

Backpropagation : Gradient Descent problems
● Our goal is to aim for the lowest point in the error function. ( Low Y )
● Below we see an example of how classic gradient descent will be used to decide on
the direction a weight (x) should change depending on our position.
● As you can see, It’s not always that easy...

● LR is a simple yet important multiplier for the weight change pace.
Recommended values : try 0.1 - 0.0001 ( Always check other options! ) .
● LR can determine if your weights will converge to minimum or not.
Backpropagation : Adjust The weights- Learning Rate
http://sebastianraschka.com/Articles/2015_singlelayer_neurons.html

● As you can see gradient descent could lead the weights into a local minimum.
● Threfore , Besides adjusting the learning rate We also recommend on trying :
Weight initialization techniques , Momentum, Learning Rate Decay , Dropout ...
Local Minimum Problem: What to do?

Round of artificial neural network using numpy
Let’s Code

Keras code example
Sales Price
Air Conditioner

https://inneractive-ondemand.bitbucket.io
Visit us

Machine Learning and Deep Learning 4 dummies

More Related Content

What's hot

Similar to Machine Learning and Deep Learning 4 dummies

More from Dori Waldman

Recently uploaded

Machine Learning and Deep Learning 4 dummies