ULMFiT Indepth Blog Post
This post is for the APaperADay challenge organized by NurtureAI. The paper on which this post is based is available here.
ULMFiT is short for Universal Language Model Fine-Tuning for Text Classification, which addresses the issue of transfer learning in the language domain. This paper was authored by Jeremy Howard and Sebastian Ruder.
Deep learning models require large datasets for training, and as a result are rarely trained from scratch. Transfer learning is the de-facto approach taken when working in with deep neural networks. At least, that is the case for convolutional neural networks. Sequence models, which are used for Natural Language Processing tasks, have not had much success when it comes to utilizing transfer learning.
The authors discuss their success in an approach for inductive deep learning for text classification (a category of NLP), which will save practitioners time and effort by not having to train models from scratch. This approach is different from past research that has mostly focused on transductive transfer.
With inductive transfer, the focus is on fine-tuning pre-trained word embeddings, which targets a model’s first layer. This approach has been said to be used in most state-of-the-art models. A few recent approaches so some things like concatenate embeddings from other tasks with inputs at different layers, but still train the model from scratch, and that is not a very useful approach.
Anyone who has seen Jeremy demonstrate working with pre-trained CNN models will understand the approach that has filtered into this paper. With pre-trained CNN models, Jeremy starts by unfreezing the final layer and fine-tuning it, and then gradually working backwards.
The authors begin by highlighting related work that has been done with regards to applying inductive transfer to NLP, and the absence of recorded successes. The authors show that the problem has been the lack of knowledge with regards how to train has been responsible for the lack of wide adoption. They show that the fine-tuning approaches used for computer vision problems can’t be applied to the NLP domain.
The authors apply ULMFiT to a 3-layer LSTM and use it on six tasks. ULMFiT was able to produce a robust training using only 10% of the training data, and even less in some instances. They were able to reduce the error on those tasks by 18% at the minimum, and it goes without saying that ULMFiT outperformed the state-of-the-art models it was benchmarked against. They proceed to propose the use of discriminative fine-tuning, slanted triangular learning rates, and gradual unfreezing. Their pre-trained models and code are made available to the public.
Discriminative fine-tuning recommends using a different learning rate for each layer. From their work, the authors have found that it is best to choose the learning rate for the last layer, and then to work backwards towards the earlier layers. They even found a sort of golden ratio, the learning rate of a previous layer should be `1/2.6 * learning rate` of current layer.
If you have seen Jeremy’s fast.ai course, your will be familiar with cosine annealing learning rates for quickly converging. For NLP, this is not the best way to achieve this behavior. Instead, they recommend slanted triangular learning rates. These will increase the learning rate linearly, and then linearly decay it according to an update schedule that is specified in their paper. They found that a short increase and a long decay period provides good performance for NLP.
The authors augment the pre-trained language model with two linear blocks. Each block uses batch normalization, a ReLu activation for the intermediate layer, a dropout layer, and a softmax activation layer that outputs a probability distribution over target classes at the final layer. The parameters in these task-specific classifier layers are the only ones that are learned from scratch.
In order to retain the information contained in the input over time, the max-pooled and mean-pooled representation of the hidden states at each time step is carried forward.
Gradual unfreezing for fine-tuning leads to a gradual and proper fine-tuning process. It avoids aggressive fine-tuning, which leads to catastrophic forgetting, and cautious fine-tuning, which leads to slow convergence. Gradual unfreezing means the last layer is unfrozen, and then fine-tune all unfrozen layers for one epoch. Then the next frozen layer is unfrozen and training repeated. This is done until convergence is attained.
I found the paper very interesting reading, and if you work with NLP models you are going to want to review the paper as well as the model and code that they make available. I recommend reading the paper here.