*In the blog post below, our NIPS 2017 conference attendees – John, Marco and Simon, three senior data scientists at Winton – share their diary entries from the first four days of last week’s event and highlight some of the most interesting ideas and papers presented.*

## Beginner’s guide to NIPS

The NIPS conference, originally a forum purely for academics, now brings together the top practitioners and theorists in the machine learning community, both from academia and the private sector. The biggest tech companies were all sponsors of NIPS 2017, with booths in the exhibitor area. The conference also attracted a significant number of companies in the financial sector. Winton, attending for the eighth consecutive year, again hosted its own booth, which proved a popular destination for those seeking out employment and partnership opportunities, as well as for participants intrigued by our data visualisations. These were presented in impressive D3.js visualisations, and ran on a large 4K wall-mounted touch screen. Other passers-by wanted to learn more about our research and technology, and to pick up some highly-sought after Winton-branded swag! Away from Winton’s own booth, one of the more interesting was IBM’s, which sported a life-size model of their quantum computer.

Held this year at the Long Beach Convention Center in California from December 4 to December 9, NIPS 2017 sold out fast: registrations had reached a record of just under 6,000 more than 50 days before the end of the early-bird discount price window, according to an AI researcher at Facebook. The total number of participants was ultimately close to 8,000.

Highlights of this year’s conference included sessions on: the convergence of deep learning and Bayesian techniques; new thinking about hierarchical clustering algorithms; and new approaches for solving imperfect-information games.

Interest in NIPS Has Grown

## Some terminology from NIPS 2017

To help readers less familiar with some of the terms used in artificial intelligence and computer science, here is a brief introduction to three important terms at this year’s NIPS:

**Inductive bias**: the vast majority of machine learning applications are now trained “end-to-end”, which means they receive raw data rather than handcrafted features. However, depending on the application domain, it is important to appropriately calibrate the inductive bias, or a priori set of assumptions, embedded in the design of the algorithm’s architecture (eg, locality and translation invariance in images). This idea has driven, for example, the extension of deep learning to non-Euclidean domains such as graphs and manifolds, by inducing the correct form of bias (eg, invariance to ordering in graphs).**Meta-learning**, or learning to learn, which seems to be among the most highly-trending topics, especially in the context of reinforcement learning. The key idea is to train a model that is able to adapt quickly when faced with new, unseen data. A few impressive results of meta-learning applied to robotics were demonstrated at NIPS.**Optimization**: The link between machine learning and optimization is clearly strong and it has been explored in different dimensions, including in: i) understanding how to properly train GANs; ii) proposing concrete algorithms for distributed stochastic gradient descent; and iii) exploring the theoretical foundations of well-established and widely used heuristic algorithms, such as hierarchical clustering.

## Day 1

### Tutorial: deep learning practices and trends (N. de Freitas, S. Read, O. Vinyals)

Monday 4 December opened with a tutorial on deep learning practices and trends. The first section of the tutorial provided an overview of deep learning methods, covering: i) the different modalities for which it has proved successful – images, audio and text – and how the relative benchmarks have been revised in recent years; ii) the most influential architectures: Convolutional Neural Networks (CNN) and its variants, recurrent language models, and attention; iii) the choice of the loss function and its optimization.

The second part of the tutorial covered trends currently addressed by the machine learning community, including, for example:

**Autoregressive models**, which learn a recurrent representation over time and/or space that proved particularly effective to generate realistic audio speech (eg, WaveNets)**Domain alignment**, which encompasses unsupervised or weakly supervised methods that automatically match inputs from different domains (eg, faces to cartoon avatars, or unsupervised neural machine translation)**Graph-based methods**, which learn models represented by graphs using, eg, message passing neural networks (MPNNs)**Program induction**, ie, methods to automatically generate the source code that would be able to reproduce the observe input-output pairs, and possibly generalized to new inputs

### Tutorial: Geometric deep learning on graphs (M. Bronstein, J. Bruma, X. Bresson, Y. LeCun)

A second tutorial addressed one of the trends above, namely geometric deep learning on graphs and manifolds. Starting from the key ideas underpinning the success of CNNs, the authors illustrated how to address non-Euclidean domains. In particular, the commonly-used operators of linear filtering and pooling are generalized to graphs and manifolds, providing both a spectral-domain and a spatial-domain representation.

Applications of these Graph Neural Networks (GraphNN) can be now found in computer graphics and computer vision, but also in biomedicine, recommendation systems and particle physics.

### Tutorial: Reinforcement learning for the and/or by the people (E. Brunskill)

This tutorial tackled the subject of how reinforcement learning systems and humans interact, and the challenges and opportunities that arise as a result. Reinforcement learning has made huge advances in recent years but, in contrast to AlphaGo Zero, which is able to learn by playing itself millions of times, systems that interact with humans don’t have this luxury of scale and are under additional constraints on the range of actions that can be taken – making a wrong decision in a game of Go doesn’t have the same risk as making a wrong decision when, for example, an AI robot is performing surgery. Dr Brunskill discussed several techniques that help us with these kinds of problems including methods for sample-efficient transfer learning and safe exploration techniques that limit the policy search space to safe policies.

In the second half of the tutorial, we discussed how human-in-the-loop systems use humans to aid the learning process, including by: specifying the reward function; giving demonstrations from which the reinforcement learning algorithm can learn about human interaction as part of the online learning phase; providing rewards for good decisions; and advice to guide the algorithm as it learns. It was clear that there is huge potential for augmenting reinforcement learning with human guidance; the challenge is how to most effectively combine human intelligence with the reinforcement algorithm.

### Plenary talk: Powering the next 100 years (J. Platt, Google)

How much energy would be needed to power the human population by 2100? If every individual were to consume as much as the average American citizen today, this would amount to 0.2 Yottajoules. Fossil fuels are clearly not an option, due to their impact on CO2 emissions. At the same time, the economics of renewable energies is such that they would be a viable alternative only for up to 40% of the energy budget. The answer, according to the speaker, is then to be found in zero-carbon technologies, such as fusion energy. Although nowadays the efficiency of these technologies is still low, the use of machine learning and optimization techniques developed in recent years – such as implemented in TensorFlow – might provide a way of making these techniques economically viable in the years to come.

## Day 2

### Test-of-time award

The “test-of-time” award went to a single, highly influential paper published 10 years ago at NIPS: Random features for large-scale random machines by A. Rahimi and B. Recht.

Ali Rahimi followed his acceptance of the award, with a presentation in which he compared the current status of machine learning to alchemy. He acknowledged the value of alchemy in human history, as a way of (incidentally) making important discoveries (e.g., metallurgy, dyes, drugs), without having a deep understanding of the underlying phenomena.

Similarly, some of the breakthroughs in the field of deep learning, such as for example batch normalisation, have a tremendous impact in practical applications, but still lack theoretical explanations of why they work so well.

As such, he urged the NIPS research community to turn ML from alchemy to electricity, by focusing on simple problems and first principles, for which a theory can be proved.

An animated debate between the author and Yann LeCun, Facebook’s director of AI research, who thoroughly disagreed with Rahimi’s characterisation of ML as Alchemy, followed on Reddit.

### Invited talk: The Trouble with Bias (K. Crawford)

This talk addressed the issue of how machine learning models can (unintentionally) incorporate societal bias when they learn from data that has not been carefully selected. For example, an image search of “CEO” will return pictures of lots of older white males. There have been many other infamous examples of algorithms learning prejudices, whether sexist, racist, homophobic, and so on. The speaker hoped to inspire the community to take more responsibility for their work and think about the societal implications of the models they build. She hoped we might see broader collaborations with researchers in humanities, who have been considering these issues for some time. The speaker also noted that it’s not always easy to know what is right – whether to learn models that are descriptive of society as it currently is or to consider building models that are more reflective of how we hope things might be in future – and who gets to decide. Either way she felt people needed to be more aware of the issues and have a dialogue on these questions.

### Sessions:

**Optimization**: Despite the widespread use of optimizers like Adam and stochastic gradient descent (SGD) with momentum, this an active area of research that is a long way from being “solved”. There continue to be advances in both understanding and algorithms. We could see significant breakthroughs over the coming years.Bayesian Optimization with Gradients, by J. Wu, M. Poloczek, A. Wilson, P. Frazier stood out among the presentations in the optimization session, since there is currently a resurgence of interest in Bayesian techniques after they somewhat fell out of favor after the rise of deep learning:

Safe and Nested Subgame Solving for Imperfect-Information Games, by N. Brown, T. Sandholm – a very cool talk on machines learning to play poker, which received a “best paper” award. Unlike chess and go, poker is a game of imperfect information. This makes it strictly harder to have computers learn how to play effectively, but by applying some new subgame-solving techniques the authors developed a computer poker player that can now beat the world’s top human players. Crossing this milestone so soon was as much of a surprise to the authors as everyone else.

Poker Face: Machines Beat Humans at Texas Hold’Em

Approximation Bounds for Hierarchical Clustering: Average Linkage, Bisecting K-means, and Local Search, by B. Moseley, J. Wang: This talk was interesting because hierarchical clustering is used in a large number of applications, but until very recently no or little work has been done on providing an objective function by which to evaluate the performance of the different hierarchical clustering algorithms. This work uses the dual of Dasgupta’s framework to establish lower and upper bounds on the performance of some common hierarchical clustering algorithms. Hopefully, this is the start of more work to formalise the evaluation of hierarchical clustering

**Deep Learning Applications**included several interesting talks that showed new developments in the application of deep learning. Deep Hyperspherical learning seems like one technique that has proved useful and is relatively easy to implement within existing deep learning frameworks. Bayesian Deep Learning was again the subject of a talk in this session. There was another talk with a cool demo that used priors to sharpen blurry images much better than previous state-of-the-art approaches.

## Day 3

### Invited talk: Deep learning for robotics (P. Abbeel)

Pieter Abbeel surveyed the challenges to scaling up deep reinforcement learning and, among other things, presented some cool results on learning to learn. A video of Pieter’s talk is available here.. Basically, the idea was to make the learning part the thing they want the robot to learn. Pieter’s team did this by rewarding the robots (usually virtual ones for the sake of ease of implementation) that learn quickly in new situations they haven’t seen before.

He also showcased a pretty significant result in transfer learning: his team trained a robot on a diverse range of synthetic worlds (that were diverse but not at all realistic-looking) so that the robot learned to generalize across environments. It was then able to perform tasks in the real world despite never having been trained on real world images.

### Sessions

Attention is all you need, by A. Vaswani et al: builds on previous results that showed how convolutional networks with attention can rival recursive networks for problems with non-local dependencies (eg machine translation), with the advantage that they are much faster as they can be parallelized.

Hindsight experience replay, by M. Andrychowicz et al: fascinating work on post-fact revising the reward function so that you always learning something from your reinforcement learning trials:

“One ability humans have, unlike the current generation of model-free RL (reinforcement learning) algorithms, is to learn almost as much from achieving an undesired outcome as from the desired one. Imagine that you are learning how to play hockey and are trying to shoot a puck into a net. You hit the puck but it misses the net on the right side. The conclusion drawn by a standard RL algorithm in such a situation would be that the performed sequence of actions does not lead to a successful shot, and little (if anything) would be learned. It is however possible to draw another conclusion, namely that this sequence of actions would be successful if the net had been placed further to the right.”

## Day 4

### Invited talk: Learning state representations

The day started with an interesting presentation about the brain’s mysterious orbitalfrontal cortex. Through a series of clever experiments (on humans and rats), it turns out there is strong evidence that it’s used to build state representations. There is a good chance that these new brain insights could inspire new techniques in machine learning.

### Sessions

- Deep Sets, by M. Zaheer et al: Traditional machine learning has focused on ordered sequences and fixed length vectors, but there are plenty of examples of real world applications where the input data is more naturally represented as a set. This paper proposes an architecture that addresses this and demonstrates its successful application to a number of tasks. There was palpable excitement about this paper because it points to a future where one can use an architecture more suitable for a given problem, rather than having to try and reframe the problem in terms that current architectures are appropriate for.

### Workshop: deep reinforcement learning

The deep reinforcement learning workshop started strong, with a presentation by DeepMind on their triumphs in Go and, announced in the last few days, on Chess and shoji (Japanese chess). As the media has reported, DeepMind were able to go from a program that knew nothing more than the rules of chess to superhuman play in just four hours of self-play! When you think how many hundreds of years of collective effort it took humanity to get to an inferior level, it is remarkable. As one of the later speakers pointed out, part of the reason self-play is so effective is because of the micro-processing efficiency gains described by Moore’s Law and the vast amount of computing power available – power that through self-play can be turned into virtually limitless data. Exciting times!