The difficulty of training large-scale neural models effectively and efficiently has long been a thorn in the side of deep learning practitioners. Winton’s recent contribution to TensorFlow helps researchers apply weight normalization to address this problem.
One interest of Winton’s Natural Language Patterns team is building language models based on recurrent neural networks (RNNs). Typically, these models are trained on a large number of examples, and they require multiple days of training time. Training such models is known to be difficult and often requires serious trade-offs as well as dexterous - if not laborious - tuning of the training algorithm. We have found that employing the weight normalization technique developed at OpenAI Research is a promising way to tackle this problem. Since it has recently become a default configuration for Winton, we have contributed our implementation to the open-source machine learning framework TensorFlow.
The Neural Network Training Problem
As datasets increase in size and complexity, the need for more powerful models has become paramount. This demand has only been fuelled by recent technological advances that have provided researchers with almost unlimited computational resources. Much hope lies in large-scale neural models, given their potential to solve a large variety of tasks. The catch is that they are notoriously difficult to train.
Recent innovations in the training of neural network algorithms have lagged innovations in model architectures, which has arguably slowed the development of the field of deep learning as a whole. Despite the recent theoretical advances and the immense computing power at our disposal, developing better optimization algorithms remains an active area of research. Indeed, Ian Goodfellow notes in his deep learning book:
It is quite common to invest days to months of time on hundreds of machines to solve even a single instance of the neural network training problem.Neural models, and data-driven models in general, generate outputs by combining their inputs with a set of parameters stored in the model. The models must be first “trained” by tuning their internal parameters to fit a set of given input-output pairs. Training is performed in an iterative manner: tweaking each of the model’s parameters by a small amount produces an effect on the output that is then observed, and the model parameters are then updated based on that observation.
Model parameters are refined in this manner until the model’s outputs are sufficiently close to the desired outputs. This iterative training procedure is formulated mathematically as minimization of a loss function representing the discrepancy between the outputs of the model and the desired outputs.
One of the issues that makes optimization of neural models difficult is the problem of “ill-conditioning”. An optimization problem is said to be ill-conditioned if the quantity to be optimized (the loss function above) exhibits extremely different sensitivities across its different tunable parameters. As an analogy, consider a (toy) model that takes as inputs a set of country-specific market indices and attempts to predict a global market indicator such as the MSCI World index.
It is conceivable in this particular example that the world market index might be affected more strongly by market index of the United States, but not so much by that of Peru. This would cause the model parameters associated with the input United States to become extremely sensitive towards the output, while those associated with Peru to remain extremely inconsequential. For such ill-conditioned problems, optimization algorithms tend to make a large number of spurious updates to the model parameters without any significant improvement in the quality of the model, thereby increasing the training time.
Recent years have seen a considerable amount of focus on developing models which do not exhibit the aforementioned problem of ill-conditioning. The well known batch normalization and layer normalization methods in this category have been extremely successful, especially on image recognition problems.
Weight normalization developed at OpenAI is the most recent technique in this line of research. As the name suggests, it normalizes the parameter vectors in a model to unit-norm and introduces a separate scalar variable controlling the length of those vectors. The resulting optimization problem with the new parameters is shown by the authors to be better conditioned than the original one. The key insight behind their technique is to treat the direction and magnitude of the parameter vectors as separate variables. This provides the optimization algorithm with the flexibility to rotate the parameter vectors without affecting their respective magnitudes and vice versa.
Weight normalization applied to RNN-based language models can produce impressive empirical results:
Contribution to Open Source
In light of this result, we have come to the conclusion that weight normalization is an important ingredient in the efficient training of large-scale RNN models. Our implementation is now available in the latest release of TensorFlow v1.6. It is a drop-in replacement for the
LSTMCell class, currently widely used to implement recurrent architectures. To implement it in your existing TensorFlow code, replace:
cell = tf.contrib.rnn.LSTMCell(num_units)
cell = tf.contrib.rnn.WeightNormLSTMCell(num_units)
This will switch your model from a vanilla RNN to a weight-normalized RNN. All additional arguments taken by the
LSTMCell constructor are available identically in
In addition to addressing the problem of ill-conditioning, we believe that suitable reparameterization of neural models can be more generally useful. Winton’s Natural Language Patterns team is currently exploring simple reparameterization of existing models for text classification, with encouraging initial results. In pursuing this line of research, we hope to discover more reparameterization benefits, similar to what we have found with weight normalization.