Optimization Algorithms for Machine Learning

I have been learning through Andrew Ng's Deep Learning specialization on Coursera. I have completed the 1st of the 5 courses in the specialization  (Neural Networks and Deep Learning).

I am onto the 2nd one which is Improving Deep Learning. This one is a very interesting course which goes deep into Hyper-parameter tuning, Regularization and Optimization techniques.

1. What are Optimization Algorithms?

They enable you to train your neural network much faster since Applied Machine Learning is a bery empirical process these algorithms help reach optimized results efficiently.

Let's start looking into Optimization Algorithms with a more sophisticated version of Gradient Descent.

1.1 Batch v/s Mini-Batch Gradient Descent

In general, Gradient Descent goes over the entire set of training examples (#m), and takes one step towards the global minima. This is also called Batch Gradient Descent. This is rather a little inefficient because it requires us to go through all the training examples before taking a tiny step towards the minima.

How about we use a smaller chunk/sample of the training set to take a step at a time? This is nothing but Mini-Batch Gradient Descent. This means that we divide the input training set (X) and the target set (Y) into small batches - called mini-batch and go through each batch to take a step towards the minima at once. This significantly improves the speed at which Gradient Descent converges.

To make this even faster, how about we take a Gradient Descent step for each training example? Let's see what the implications would be in the image below.

Source: Deep Learning Specialization - Programming Assignment (Optimization)

  1. On the left, we have Stochastic Gradient Descent (where m=1 per step) we take a Gradient Descent step for each example and on the right is Batch Gradient Descent (1 step per entire training set).
  2. SGD seems to be quite noisy, at the same time it is much faster but may not converge to a minimum.
  3. Typically, to get the best out of both worlds we use Mini-batch gradient descent (MGD) which looks at smaller number of training set examples at once to help (usually power of 2 - 2^6 etc.). 
  4. Mini-batch Gradient Descent is relatively more stable than Stochastic Gradient Descent (SGD) but does have oscillations as gradient steps are being taken in the direction of a sample of the training set and not the entire set as in BGD.

1.2 How do we reduce oscillations when using Mini-batch Gradient Descent (MGD)?

There are a variety of techniques available to reduce oscillations, the involve using previous gradient steps as reference to ensure we are going in the correct direction of the minimum. These techniques typically involve using averaging over a window to ensure smoother steps to reduce oscillations.

 The image above shows the output of Batch Gradient Descent. As you can see the cost output is very noisy. When we use techniques like exponential weight averaging like Adam (Adaptive Momentum Estimation), the cost output becomes smoother as seen below.

1.3 Learning Rate Decay

This is another important technique used to optimize learning algorithms. As we move close to the global minimum we want our algorithm to take small steps and not overshoot or oscillate around the minimum. This is typically done through different ways to reduce the learning rate per epoch.
The simplest way to reduce learning rate would be to set it to half of it's previous value every epoch.

Source: Andrew Ng's Coursera - Machine Learning course

Note: An epoch is one pass through the entire training set.

I will be explaining Adam in another post and keep sharing my learnings and findings as I go along the course, my solutions to the programming assignments will be on my github.


Post a Comment