The Conference on Neural Information Processing Systems (NIPS) is a leading machine learning event that Winton has supported every year since 2010. This time around, 5,900 participants – 60% more than last year – descended on Barcelona to share their expertise in fields such as deep learning, learning theory, computer vision, large scale learning and optimisation. All are areas of cutting-edge research that could have exciting applications in the worlds of technology, finance and beyond.

There were 570 accepted papers split across ‘Modelling’ (that is, proposing new models), ‘Theory’ (establishing properties of existing models/techniques), ‘Algorithm’ (proposing new training algorithms or improvements over existing ones) and ‘Application’ (where subcategories included text, time series and images). Here, Winton researchers review their highlight papers.

## Modelling

Generative Adversarial Nets

I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, Y. Bengio

Generative Adversarial Nets (GANs) uses adversarial training in which two models are simultaneously trained: 1) a generative model that captures the data distribution and 2) a discriminatory model that tries to differentiate between samples generated by the first model and the training data. The analogy often given is that the generator is learning to be an impostor and is trying to “fool” the discriminator into believing that the generated samples are the actual training samples.

Image samples generated out of this framework were shown to be “sharper” than those obtained using other techniques that tend to “average out” multiple modes of the underlying distribution. The success of GANs has been – in part – attributed to the fact that instead of optimising likelihood, they maximize the ‘Jensen Shannon Divergence’, which incorporates some mode-seeking behaviour and hence reduces the “averaging out” effect.

The results are promising (see here, for example), but the difficulty in training a GAN was raised by several attendees at the event. The consensus seems to be that since the underlying problem in training these models is not a classical optimisation – but rather a minimax game – the nature of the loss surface is not understood very well. Hence, at the moment, efficiently training these models boils down to experience and several “trips and tricks”.

Energy-based Generative Adversarial Network

J. Zhao, M. Mathieu, Y. LeCunn

In most versions, GANs are formulated using log probabilities of the discriminator and generator outputs. This paper interprets the discriminator as an energy function that is trained to assign low energies to samples generated correctly and high energies to the incorrect ones. The output, in this case, is just a contrastive function and not a normalized probability density function.

The lack of normalization can be seen as a drawback since the numbers are no longer interpretable, but the approach does provide more flexibility in terms of what loss functions and discriminator architectures can be used. For example, in the paper they used the discriminator as an Autoencoder with hinge loss, instead of a classifier which was the case in most previous GAN models.

InfoGAN: Interpretable Representation Learning by Information Maximizing Generative Adversarial Nets

X. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever, P. Abbeel

This paper is an information-theoretic extension of GANs as an unsupervised representation learner. The main idea is straightforward but powerful. The input of the generator is appended by a set of hidden variables (latent codes) that hypothesize some structure over the training data. Learning is then performed by maximizing the information gained between the latent codes and the generator output.

The idea has been used in several applications, especially document clustering, but when combining it with deep architectures, it seems to learn surprising aspects of the data. For example, InfoGANs are able to learn – in a completely unsupervised fashion – concepts of brightness, rotation and width in images, as well as hair styles and emotions in human faces!

## Algorithm

Weight Normalization: A Simple Reparameterization to Accelerate Training of Deep Neural Networks

T. Salimans, D.P. Kingma

Several neural network architectures are known to have an ‘ill-conditioned Hessian’, which has been a key factor in limiting their use for decades. Weight normalization is one of a number of conditioning techniques developed recently (Batch Normalization, for example).

The key idea is to reparametrize the network weights w in terms of a unit vector v and scalar magnitude g, and perform stochastic gradient descent on both those parameters separately. As a result of this separation, when we see a drop in the magnitude of g (vanishing gradient), we still have a healthy gradient due to the unit vector v, and the training progresses forward; hopefully to better conditioned regions of the weight space.

Layer Normalization

J.L. Ba, J.R. Kiros, G.E. Hinton

Just like in the previous paper on weight normalization, this work presents another technique to condition and speed up neural network learning. Inspired by batch normalization, which normalizes the gradient by the statistics of the weights on the incoming edges to a neuron, layer normalization introduces a hidden layer that implicitly learns the required statistics. This paper shows that this technique is invariant to feature shifting and scaling. It also improves upon the vanilla LSTM training times by an order of magnitude on the “Attentive Reader” question-answering task.

## Application

### Time Series:

Unsupervised Feature Extraction by Time-Contrastive Learning and Nonlinear ICA

A. Hyvarinen, H. Morioka

This paper uses neural networks to identify discriminatory features from non-stationary time series data. The key idea is to chunk the time series into bins and train a network to discriminate between the chunks. The features then learned in the hidden layers are likely to be informative features describing the non-stationarity properties of the original time series. There are also a few connections made to solving the ICA identification problem (under certain simplifying conditions) in this paper, which might be of interest.

### Text:

Dual Learning for Machine Translation

Y. Xia, D. He, T. Qin, L. Wang, N. Yu, T.Y. Liu, W.Y. Ma

Machine translation using neural networks requires millions of sentence pairs for training, requiring extensive and costly human labelling. This paper shows that machine translation tasks always occur in a prima-dual fashion and it presents a learning framework that takes advantage of this structure.

Much like in a typical GANs style learning, the networks learn from two simultaneous models – translating from language A to B and language B to A – while both models teach each other in a closed loop. This alleviates the problem of hand-labelling millions of examples for training. Some labelled data is needed to warm start the learning process, but the paper claims that the process works with just 10% of bilingual labelled data. The basic idea of dual learning is interesting as many AI tasks can be represented under this framework; for example, speech recognition versus text-to-speech, search versus keyword-extraction etc.

### Image:

Unsupervised Learning of Spoken Language with Visual Context

D. Harwath, A. Torralba, J.R. Glass

This paper is motivated by the fact that humans learn to speak (aided by visual cues) before they read or write. The work attempts to replicate a similar learning model, where a network is provided with spoken captions and image pairs with the aim of associating sound waves to image pixels. Two networks in parallel encode images and spoken captions and the final cost function is based on the dot product of the output of the two networks – simply trying to learn a high similarity score between corresponding sounds and images.

As a nice by-product, the features encoded in the hidden layers of the audio processing network are also analysed. The features learned were informative enough to cluster certain words in just a two-dimensional subspace (obtained using t-SNE dimensionality reduction) of the learned features.

## Theory

Matrix Completion has no Spurious Local Minimum (Award Talk)

R. Ge, J.D. Lee, T. Ma

(Positive semi-definite) matrix completion is at the heart of several applications – collaborative filtering, system identification, global positioning etc. Surprisingly, even though this was known to be a non-convex problem, many algorithms showed fast convergence from random initializations.

This paper explains these observations by proving that commonly used objective functions for matrix completion have no spurious local minima. Precisely, the work shows that any local minimum for the Frobenius norm objective is also a global minimum with the optimum function value equal to 0. This result explains the success of many applications in the recent past.

Deep Learning without Poor Local Minima

K. Kawaguchi

This paper proves a conjecture published in 1989: for the squared loss function of deep linear neural networks, the objective has no poor local minima. That is, every local minimum is a global minimum, and every critical point (apart from the global minimum) is a saddle point with at least one negative eigenvalue (a gradient based algorithm will be able to escape from it).

For deep non-linear networks, the same statements as above were proved but only if the network conforms to a given set of assumptions. The main consequence of these results is that research for designing training algorithms for deep neural networks need to focus on escaping saddle points, rather than local minima.

## Excitement and promise in the field

NIPS 2016 highlighted the current excitement in neural information processing. We have only touched the tip of the iceberg of what is being pursued in the field, but it is exciting to see the work going into predictive (unsupervised) learning. Given that the majority of data are unlabelled, the ability of a machine to learn from data without supervision is essential for developing artificial intelligence. It is the “cake” of intelligence as Lecun puts it, and supervised and reinforcement learning are the icing and cherry on top. Research into GANs, InfoGANs and dual learning for machine translation, amongst other areas, show promising progress in this regard.