Train your deep model faster and sharper — two novel techniques

Written by harshsayshi | Published 2017/06/26
Tech Story Tags: deep-learning | machine-learning | artificial-intelligence | neural-networks | data-science

TLDRvia the TL;DR App

Deep neural networks have many, many learnable parameters that are used to make inferences. Often, this poses a problem in two ways: Sometimes, the model does not make very accurate predictions. It also takes a long time to train them. This post talks about increasing accuracy while also reducing training time using two very novel ways.

EDIT:

This article won the second most popular blog post award in August 2017 by KDNuggets.

The original papers can be found here(Snapshot ensembles) and here(FreezeOut).

This article assumes some familiarity with neural networks, including aspects like SGD, minima, optimisation, etc.

How this article is structured

I will be talking about two different papers that aim to do different things. Note that even though there are two different ideas, they are not mutually exclusive and can be used simultaneously.

This is a long post, but it is divided into two sections which are mutually exclusive

1. Snapshot Ensembling — M models for the cost of 1

Regular Ensemble Models

Ensemble models are a group of models that work collectively to get the prediction. The idea is simple: Train several models using different hyperparameters , and average the prediction from all these models. This technique gives a great boost in accuracy because it is not relying on a single model for prediction. Most winning entries in high profile Machine Learning competitions have used ensembles.

So what’s the problem?

Training N different models will require N times the time required to train a single model. Most people who don’t have the luxury of having Multiple GPUs will often have to wait for a long time before they can test out these models. Therefore, it makes experimenting much slower.

SGD Mechanics

Before I tell you about the ‘novel’ approach, you must first understand the nature of Stochastic Gradient Descent(SGD). SGD is greedy, it will look for the steepest descent. However, there is one very crucial parameter that governs SGD — The Learning Rate.

If the learning rate is too high, SGD will ignore very narrow crevices(minima) , and take large steps (think of a tank not being affected by a pothole on the road).

On the other hand, if the learning rate is small, SGD will fall inside one of these local minima and not be able to come out of it. It is, however, possible to bring SGD back from the local minima, by increasing the learning rate.

The trick?

The authors of the paper use this controllable property of SGD falling in and climbing out of local minima. Different local minima may have very similar error rates, but the mistakes that they will make will be different from each other.

They have included a very useful diagram that explains this concept:

Figure 1.0: Left: standard SGD trying to find the best local minima. Right: SGD is made to fall into a local minima, then brought back up, and the process is repeated. This way you get 3(which are labelled 1,2,3) local minima, each with similar error rates, but with different error characteristics

What is being ensembled a.k.a snapshot?

The authors use the property of local minima having different ‘viewpoints’ on their predictions to create multiple models. Every time SGD reaches a local minima , a snapshot of that model is saved, which will be part of the final ensemble of networks.

Cyclic Cosine Annealing

Instead of manually trying to figure out when to dive into a local minima or when to jump out of it, the authors used a function to automate this process.

They used Learning Rate Annealing with the following function:

Figure1.0

Simplified

The formula may look complicated, but its quite simple. They used a monotonically decreasing function. α here is the new learning rate, and α0 is the old learning rate. T is the total number of training iterations you want to use (T should be equal to batchsize*number of epochs). M is the number of snapshots you want in your ensemble.

Figure1.1 M=6 , and Budget=300 epochs. The vertical dotted lines indicate a model snapshot. After 300 epochs a total of 6 models were added to the ensemble.

Notice how the loss falls rapidly just before each snapshot. This is because the learning rate decreases continuously. After snapshot, the learning rate is restarted back (they used the value of 0.1). This causes the gradient path to be brought out of the local minima (and new local minima search begins again).

Show me the numbers

I have included the numbers that the authors used to demonstrate the effectiveness of their method

Figure1.2 Error Rates(%) on Cifar10,Cifar100,SVHN and Tiny ImageNet. Blue indicates the authors’ work, and bold indicates the best error rate for that category

Conclusion

This is a useful strategy to get a marginal boost in accuracy at no additional training cost. The paper talks about varying different parameters such as M and T , and how it affects the performance.

2. FreezeOut — Training Acceleration by Progressively Freezing Layers

The authors of this paper propose a method to increase training speed by freezing layers. They experiment with a few different ways of freezing the layers, and demonstrate the training speed up with little(or none) effect on accuracy.

What does Freezing a Layer mean?

Freezing a layer prevents its weights from being modified. This technique is often used in transfer learning, where the base model(trained on some other dataset)is frozen.

How does freezing affect the speed of the model?

If you dont want to modify the weights of a layer, the backward pass to that layer can be completely avoided, resulting in a significant speed boost. For e.g. if half your model is frozen, and you try to train the model, it will take about half the time compared to a fully trainable model.

On the other hand, you still need to train the model, so if you freeze it too early, it will give inaccurate predictions.

What is the ‘novel’ approach?

The authors demonstrated a way to freeze the layers one by one as soon as possible, resulting in fewer and fewer backward passes, which in turn lowers training time.

At first, the entire model is trainable (exactly like a regular model). After a few iterations the first layer is frozen, and the rest of the model is continued to train. After another few iterations , the next layer is frozen, and so on.

Learning Rate Annealing

The authors used learning rate annealing to govern the learning rate of the model. The notably different technique they used was that they changed the learning rate layer by layer instead of the whole model. They used the following equation:

Equation 2.0: α is the learning rate. t is the iteration number. i denotes the ith layer of the model

Equation 2.0 Explanation

The sub i denotes the ith layer. So α sub i denotes the learning rate for the ith layer. Similarly , t sub i denotes the number of iterations the ith layer has been trained on. t denotes the total number of iterations for the whole model.

Equation 2.1

This denotes the initial learning rate for the ith layer.

The authors experimented with different values for Equation 2.1

Initial learning rate for Equation 2.1

The authors tried scaling the initial learning rate so that each layer was trained for an equal amount of time.

Remember that because the first layer of the model would be stopped first, it would be otherwise trained for the least amount of time. To remedy that, they scaled the the learning rate for each layer.

Figure2.0

The scaling was done to ensure all the layers’ weights moved equally in the weight space, i.e. the layers that were being trained the longest(the later layers), had a lower learning rate.

The authors also played with cubic scaling, where the value of t sub i is replaced by its own cube.

Figure2.1: Performance vs Error on DenseNet

The authors have included more benchmarks , and their method increases a training speedup of about 20% at only 3% accuracy drop, and 15% at no drop in accuracy.

Their method does not work very well for models that do not utilize skip connections(such as VGG-16). Neither accuracy not speedups were noticeably different in such networks.

My Bonus Trick

The authors are progressively stopping each layer from being trained, which they then don’t calculate the backward passes for. They seemed to have missed to exploit precomputing layer activations. By doing so , you can even prevent calculating the forward pass.

What is precomputation

This is a trick used in transfer learning. This is the general workflow.

  1. Freeze the layers you don’t want to modify
  2. Calculate the activations the last layer from the frozen layers(for your entire dataset)
  3. Save those activations to disk
  4. Use those activations as the input of your trainable layers

Since the layers are frozen progressively, the new model can now be seen as a standalone model(a smaller model) , that just takes the input of whatever the last layer outputs. This can be done over and over again as each layer is frozen.

Doing this along with FreezeOut will result in a further substantial reduction in training time while not affecting other metrics(like accuracy) in any way.

Conclusion

I demonstrated 2(and half of my own) very recent and novel techniques to improve accuracy and lower training time by fine tuning learning rates. By also adding pre computation whenever possible, a significant speed boost can be possible using my own proposed method.

P.S. (Also stands for Please Share)

If you notice any errors or have any doubts, please comment about them. I will update my post or try to explain better.

Also, if you liked my article, please recommend it by pressing on the ❤. It lets me know I was of help to you.


Published by HackerNoon on 2017/06/26