by Rory Waite
25 January, 2018 - 7 minute read

Deep learning methods are increasingly being used to solve supervised problems, and their success has been one driver of the recent surge of interest in artificial intelligence.

In supervised problems, a machine learning algorithm predicts a label for some input data based on previously seen pairs of data and labels. Consider speech recognition, for example, where the algorithm must be trained on many hours of recordings that have been paired with transcripts provided by a human. The provision of labelled datasets is fundamental to AI research, and they are costly to produce.

These datasets are often accompanied by a challenge, whereby researchers complete a task that can be measured objectively. Examples of challenges include, among many others, the ImageNet challenge, the Netflix prize, and the Workshop on Machine Translation. These challenges enable researchers to publicise breakthroughs in machine learning that may have been underreported in the academic literature. In fact, the rapid adoption of deep learning methods is mainly due to their strong performance in a wide variety of challenges and evaluations.

A number of Winton employees recently engaged in such a challenge following their completion of the fast.ai deep learning course - a challenge that was set by Winton’s Natural Language Patterns team to put the employees’ new skills to the test.

An interest of the NLP team is metaphor, which we believe is an understudied problem in the computational linguistics community. One of the problems with metaphor research is the scarcity of datasets with which to train an AI. By putting together a metaphor dataset, we could give our students a novel challenge and also contribute to computational linguistics research.

To find a suitable corpus on which to base our dataset, we turned to linguistics research. A study by Jian–Shiung Shie from the Ursuline College of Languages notes that the New York Times edits its headlines for an international audience. The study found that the edited international headlines often contained fewer metaphors. For example the metaphorical headline in the US edition:

Fuel Lines Of Tumors Are New Target

is rewritten in the international edition as:

Trying To Kill Cancer Cells, Again

Shie compiled his set of headlines using print editions of the New York Times, and then systematically annotated the headlines. The headlines were split into two classes: the first for headlines that contained one or more metaphors, and the second for headlines which did not contain a metaphor.

To create our dataset we reproduced the original study on a much larger scale, which required a large number of human annotators to label the headlines. We are fortunate at Winton to have access to Hivemind, a high–quality data science platform offering a combination of software tools and skilled human contributors, which in this case acted as annotators. In the table below, we compare the original study with the Hivemind reproduction.

Compiled Headline Pairs Labelled Headline Pairs Number of Annotators Metaphor in US Headline Only Metaphor in International Headline Only Metaphor in Both Headlines
Shie 605 525 2 87 47 95
Hivemind 15,120 1,572 42 396 173 786

Our reproduction does seem to confirm the results of the original study, with some caveats. We can see that we were able to scrape many more headlines, and use many more annotators, but we were only able to annotate three times more headlines than the original study. We also see that Hivemind’s annotators were much more inclined to label headlines as containing metaphor, hence the larger numbers in both columns.

The discrepancies in the reproduction are due to the subjectivity inherent in metaphor annotation. The main problem is that over time, novel and apt metaphors become conventionalised. Take the following headline:

Bahrain Violently Ousts Protesters From Square Placing Ally US in a Bind

The phrase “in a Bind” is suggestive of metaphor, as it not possible to physically bind a country. However, this expression is in such common use that to many readers it has lost any connection to its original metaphorical domain. Over time, words or phrases that were once metaphorical become part of the English lexicon.

The Shie study was performed by two experienced linguists working closely together. For our larger reproduction, we supplied a detailed annotation protocol that used a dictionary to determine word sense. We found that even though the Hivemind workers were diligent in applying the protocol, there was still a large variance in the labels. For this reason we used a minimum of 10 workers per headline, which reduced the throughput of our annotation process. This low throughput of annotations is the reason why we were only able to annotate three times the number of headlines as Shie, even though we had many more more workers available.

Let us look at the Hivemind results in more detail. We compute an accuracy metric by comparing the labels from each worker with the majority vote. We also show the variance plotted as error bars. We can see that the accuracy is low, at approximately 88% with a large variance.

The difficulty in labelling metaphors raises the question: can an algorithm classify headlines that contain a metaphor? This is the challenge that we issued to the students.

Four students entered the challenge. Interestingly, the students picked two different architectures: a convolutional neural network (CNN); and a long short–term memory (LSTM) recurrent neural network. Both neural models build a vector space representation of each word – also known as “word embedding” – and then use these representations as the input to a logistic regression. The differences between these architectures boils down to how they treat word sequences.

Consider the literal phrase from our previous example: “Bahrain Violently Ousts Protesters”. If we were to switch the subject and object of the verb ousts, we would have a metaphorical expression: “Protestors oust the Bahraini Government”. Modelling sequence has been a difficult problem to solve in NLP because of the expressive power of natural language. The model has to account for a very large number of possible phrases and expressions. It is often very difficult to improve upon bag–of–words (BOW) models, which ignore sequence and treat each word independently.

The CNN architecture improves upon a BOW model by considering a fixed–length window over word vectors. This window is slid over the sentence, allowing it to model local sequence. The LSTM models the global word sequence by computing a context vector at each word. This context vector is computed from the word’s embedding and the proceeding word’s context vector, which is itself computed from its proceeding word’s context vector. Thus, the context vector for the last word of the sentence encodes the global word sequence by a recursive computation of context vectors.

We plot the results next to the Hivemind results and see that all the models perform well, getting close to human–level performance. The LSTM outperforms all the CNN models, which implies that being able to model global sequence is important.

A new relationship appears, however, when we plot the systems with respect to the number of parameters used. Model architecture seems to be less important than the number of parameters used in the model.

Thanks to our challenge participants, we have shown that it is possible to get close to human level performance when using machine learning techniques for classifying metaphor. Although this challenge was designed as an educational exercise, we can make observations that warrant further research. We have seen how difficult it is to use crowdsourcing to build datasets for metaphor. The resulting labels are noisy, which will cause problems when training a supervised algorithm on the dataset. In other words: garbage in, garbage out.

Addressing these problems lays out interesting avenues for future work. There are two approaches that could solve the noisy data problem. The first is by way of introducing a linguistic bias into the annotation protocols. A clearer description of metaphor should remove ambiguity when labelling for metaphor. The second approach would be to use machine learning techniques to reduce the variance in labels. Examples of possible methods include latent–variable models, Bayesian optimisation, or active learning.

Finally, we should further investigate the role of sequence in the classification models, as the results suggest that models with a weak sequence assumption may be able to match the performance of strong sequential models when given enough parameters.