Predicting Employee Churn: A Data-Driven Approach Project Presentation
Dnn guidelines
1. Naitik(https://www.linkedin.com/in/naitikshukla/)
Very General guidelines for usage on DNN and how to proceed
for starters
Training data
A few measures one can take toget better training data:
Get your hands on as large a dataset as possible(DNNs are quite data-hungry:
more is better)
Remove any training sample with corrupted data (short texts, highly distorted
images, spurious output labels, features with lots of null values, etc.)
Data Augmentation - create new examples (in case of images - rescale, add noise,
etc.)
Choose appropriate activation functions
Activations introduces the much-desired non-linearity intothe model. For
years, sigmoid activation functions have been the preferable choice. However,
a sigmoid function is inherently cursed by these two drawbacks –
1. Saturation of sigmoid at tails (further causing vanishing gradient problem).
2. Sigmoid are not zero-centered.
A better alternative is a tanh function - mathematically, tanh is just a rescaled and
shifted sigmoid
tanh(x) = 2*sigmoid(x) – 1
tanh can still suffer from the vanishing gradient problem, but the good news is
- tanh is zero-centered.
Hence, using tanh as activation function will result into faster convergence.
other alternatives are ReLU,SoftSign,etc.
Number of Hidden Units and Layers
Keeping a larger number of hidden units than the optimal number, is generally a safe
bet. Since, any regularization method will take care of superfluous units(at least to some
extent).
On the other hand, while keeping smaller numbers of hidden units(than the optimal
number), there are higher chances of underfitting the model.
Selecting the optimal number of layers is relatively straight forward.
As @Yoshua-Bengiomentioned on Quora - “You just keep on adding
layers, until the test error doesn’t improve anymore”. ;)
2. Naitik(https://www.linkedin.com/in/naitikshukla/)
Weight Initialization
Always initialize the weights with small random numbers tobreak the symmetry
between different units.
Toinitialize the weights that are evenly distributed, a uniform distribution is probably
one of the best choice.
Furthermore, as shown in the paper (Glorot and Bengio, 2010), units with more incoming
connections (fan_in) should have relatively smaller weights.
Thanks to all these thorough experiments, now we have a tested formula that we can
directly use for weight initialization; i.e. –
Learning Rates
This is probably one of the most important hyperparameter, governing the learning
process.
Set the learning rate too small and your model might take ages to converge, make it too
large and within initial few training examples, your loss might shoot up to sky.
Optimal learning rate should be in accordance to the specific task. Generally taken 0.01.
One possible alternative:
Gradually decreasing thelearning rate,after each epoch or after a few thousand examples.
Although this might help in faster training, but requires another manual decision about
the new learning rates.
Thesekinds of strategieswerequite common a few yearsback. Generally, learningratecan
be halved after each epoch.
Better Alternative:
weightsdrawnfrom ~ Uniform(-r,r)
where r = sqrt(6/(fan_in + fan_out)) for tanh activations
r = 4*(sqrt(6/fan_in + fan_out)) for sigmoid activations
where fan_in is thesize of theprevious layer and fan_out is the
size of next layer.
3. Naitik(https://www.linkedin.com/in/naitikshukla/)
We have better momentum based methods tochange the learning rate, based on the
curvature of the error function.
It might also help to set different learning rates for individual parameters in the model
since, some parameters might be learning at a relatively slower or faster rate.
Advanced Alternative:
Good amount of research on optimization methods, resulting into adaptive
learning rates.
We have numerous options starting from good old Momentum Method to Adagrad, Adam,
RMSProp etc.
Methods like Adagrad or Adam effectively save us from manually choosing an initial
learning rate.
Hyperparameter Tuning: Shun Grid Search - Embrace Random Search
Grid Search has been prevalent in classical machine learning. However, Grid Search is
not at all efficient in finding optimal hyperparameters for DNNs.
Primarily, because of the time taken by a DNN in trying out different hyperparameter
combinations. As the number of hyperparameters keeps on increasing, computation
required for Grid Search also increases exponentially.
There are twoways to go about it:
1. Based on your prior experience, you can manually tune some common
hyperparameters like learning rate, number of layers, etc.
2. Instead of Grid Search, use Random Search/Random Sampling for
choosing optimal hyperparameters. It is also possible to add some prior
knowledge to further decrease the search space(like learning rate shouldn’t be too
large or too small).
Learning Methods
Good old Stochastic Gradient Descent might not be as efficient for DNNs.
Lot of research to develop more flexible optimization algorithms. For
e.g.: Adagrad, Adam, AdaDelta, RMSProp, etc.
In addition to providing adaptive learning rates, these sophisticated methods also
use different rates for different model parameters and this generally results into
a smoother convergence.
4. Naitik(https://www.linkedin.com/in/naitikshukla/)
Best Practice:
Keep dimensions ofweights in the exponential power of 2
memory management is still done at the byte level; So, it’s always good to keep the
size of your parameters as 64,128,512,1024(all powers of 2). This might help in
sharding the matrices, weights, etc.
Unsupervised Pretraining
Doesn’t matter whether you are working with NLP, Computer Vision, Speech
Recognition, etc. Unsupervised Pretraining always help the training of your
supervised or other unsupervised models.
You can use ImageNet dataset to pretrain your model in an unsupervised manner, for a
2-class supervised classification.
Mini-Batch vs. Stochastic Learning
Major objective of training a model is to learn appropriate parameters, that results into
an optimal mapping from inputs to outputs.
Stochastic:
While employing a stochastic learning approach, gradients of weights are tuned after
each training sample, introducing noise into gradients(hence the word ‘stochastic’).
This has a very desirable effect; i.e. - with the introduction of noise during the training,
the model becomes less prone to overfitting.
Stochastic learning might effectively waste a large portion of computation power of
nowadays machines. If we are capable of computing Matrix-Matrix multiplication,
then why should we limit ourselves, to iterate through the multiplications of individual
pairs of Vectors?
There are scenarios when the model is getting the training data as a stream(online
learning), then resorting to Stochastic Learning is a good option.
Mini-Batch:
For greater throughput/faster learning, it’s recommended to use mini-batches instead of
stochastic learning.
5. Naitik(https://www.linkedin.com/in/naitikshukla/)
selecting an appropriate batch size is equally important; so that we can still retain
some noise(by not using a huge batch)
simultaneously use the computation power of machines more effectively.
Commonly, a batch of 16 to 128 examples is a good choice(exponential of 2).
Dropout for Regularization
Considering, millions of parameters to be learned, regularization becomes an imperative
requisite to prevent overfitting in DNNs.
You can keep on using L1/L2 regularization as well, but Dropout is preferable to check
overfitting in DNNs.
If the model is less complex, then a dropout of 0.2 might also suffice else a default value
of 0.5 is a good choice.
References:
1. Practical Recommendations for Deep Architectures(Yoshua Bengio)
2. How to train your Deep Neural Network(Rishabh Shukla)
3. Dropout: A Simple Way to Prevent Neural Networks from Overfitting