Universal Language Model Fine-tuning for Text Classification
The field of Deep Learning was made possible by the combination of computing power, cheap storage, and large amounts of data. Even at that, it still takes considerable time to train and tune deep neural networks from scratch. In the field of computer vision, success has been found in transfer learning, where models that have been trained for one task of image classification are re-trained for use with a different data set.
Unfortunately, transfer learning has not been as successful when applied to language models. One difference between language models and computer vision is in the type of architectures that get employed. Vision problems are solved using convolutional neural networks, while language problems are solved using sequence networks (LSTM, RNN, GRU, etc).
The authors of this paper introduce ULMFiT (Universal Language Model Fine-Tuning), which is a transfer learning method that can be applied to tasks in the domain of language modeling.
Inductive transfer, which involves fine-tuning the first layer of a model, has been both successful and impactful in modern state-of-the-art models. However, models still get trained from scratch, and pre-trained embeddings are treated as fixed parameters, thus limiting how useful they are.
Compared to computer vision models, language models are more shallow and require a different approach to fine-tuning their parameters.
ULMFiT uses tuned dropout hyper-parameters (to prevent overfitting) to outperform well-engineered models and transfer learning approaches. This approach was carried out on six different text classification tasks.
ULMFiT has been successfully used to train models with only 10% of the training datasets used by other models, and in some cases, with only 1%! The same models have also reduced the classification error by up to 24% in some cases.
Pre-trained models of ULMFit have been made publicly available by the authors in order to encourage wide adoption.
Universal, in relation to ULMFiT, refers to the fact that the model utilizes a single architecture to build a model that can be applied to different tasks, labels, and documents without requiring custom feature engineering or preprocessing.
One of the hallmarks of ULMFiT is the use of discriminative fine-tuning, in which each layer of the network might be trained with a different learning rate. This approach is applied in a reverse manner, starting with the last layer and working back towards the first.
The paper is available here.