Here we introduce Hivemind and explain how the platform enables the generation of pristine, structured data both quickly and cost effectively.
The data described here were used in a recent competition for Winton employees to detect metaphors in newspaper headlines using machine learning. But Hivemind is a unique and flexible data science platform that allows companies to solve the toughest data problems.
Hivemind breaks down complex projects into simple microtasks that can be distributed to large groups of human contributors. Since each human contributor takes time to complete their work, there is an onus on using as few as possible while maintaining sufficiently high-quality output - the motivation, in other words, for the experiment described in this article.
Identifying metaphors is hard enough for humans, let alone computers being taught to spot them. Determining what counts as metaphorical is a highly subjective exercise: both words and phrases can be figurative by degrees, and metaphors can over time become accepted descriptions of a specific set of circumstances.
The Hivemind contributors’ task was to annotate whether a given headline contained a metaphor. In each case, Hivemind’s workflow process included individual definitions of words and asked multiple contributors to independently identify metaphor.
Below are two snapshots of this classification task taken from the Hivemind platform. Human contributors’ tasks were split into two parts: first, they were asked to scroll over each word in a given headline to see whether the machine-derived definition was accurate, with any corrections designed to improve the underlying algorithm; second, they had to pass judgment on whether the headline did in fact contain a metaphor.
The main difficulty with this line of work was that the contributors tended to rely on their intuition, and hence produced different answers. Before working on a large amount of headlines, we experimented on a small set of headlines and evaluated the answers from Hivemind.
Twenty contributors were involved in the experiment, and each annotated 200 headlines with a yes or no answer to the question of whether they contained a metaphor. We then analysed those 4,000 answers to construct a strategy for identifying metaphor in tens of thousands of paragraphs.
Our aim was to see whether by using a smaller number of contributors we could still guarantee high-quality results.
To do this, we used a probabilistic model developed by researchers at the Machine Perception Laboratory at the University of California, San Diego. This Generative model of Labels, Abilities, and Difficulties – or GLAD, for short – simultaneously infers the difficulty of the tasks, the accuracy of the contributors, and the results of the tasks.
The algorithm assigns a binary value Z to each task as the result that we want to infer. The difficulty of a task is represented by a positive value 1/β, which is correlated to the ambiguity of the task. Each contributor is assigned a score α, which can be negative (likely to disagree with the result) or positive (likely to agree with the result). The algorithm uses Expectation-Maximization (EM) approach to obtain maximum likelihood estimates of these three parameters: Z, α, and 1/β for all the tasks and the contributor.
The paper also states an interesting improvement on priors for α and 1/β based on the prior knowledge about the tasks and the contributors. For example, when we know the contributors are generally good, we make the prior probability for α to be very low for negative value.
For this experiment, we adapted our input data to work with an open source implementation of GLAD. To analyse the data, we computed the accuracy of a majority vote of a subset of analysts with respect to the majority vote for the full set of analysts. We selected the subset using three methods:
- Use a subset of the answers from the contributors with highest α
- Use a subset of the answers from the contributors with lowest α
- Use a random subset of the answers
Figure 3 shows the ratios of the answers, and confirms that for the task of assessing the 200 headlines we derived close to the same level of majority vote when selecting a far smaller group from our original 20 contributors.
Overall, the graph shows that when using 10 contributors instead of 20, we can produce the same majority for 85% of the tasks. Moreover, this value is as high as 80% when we ask just those four contributors with the highest accuracy scores generated by GLAD. When using nearer to 20 contributors, the method does not provide a clear distinction between those contributors with high or low scores. It might be the case that with only a small number of bad contributors, there would be little discernible effect.
Divide and Rule
The general idea of the strategy was to divide the analysts into two groups on the Hivemind platform with user-defined contributor qualifications: METAPHOR-GOOD and METAPHOR-AVERAGE. Following a method proposed by Welinder and Perona at the California Institute of Technology to prioritise the best annotators for labelling, we then sent tasks to the Hivemind platform in batches of n (e.g. 1,000) tasks. For accuracy, we require m answers for each task.
Contributors with good scores have permission to work on all the tasks; those with lower scores can work on only p% (e.g. 20%) of the tasks. These parameters were controlled by the Hivemind platform’s task qualifications. After each batch of n tasks, we used the GLAD algorithm to recalculate the analyst accuracy scores and the tasks’ results. Based on the updated analyst accuracy scores, we reassign q% (e.g. 15%) of those with lower scores to the METAPHOR-AVERAGE group, and the rest to METAPHOR-GOOD group.
The Hivemind platform’s API makes it trivial to manage the process of sending the tasks and getting back the answers. Figure 4 below shows one way of creating a task with only a few lines of code.
The current algorithm is mainly based on the contributor accuracies generated by the GLAD algorithm and it has much scope for further research and improvement.
Another direction for research is to design an algorithm for other types of tasks including non-metaphor tasks and tasks with more than two outcomes.
Further analysis is also helpful for tasks for which we use Hivemind’s link to Amazon Mechanical Turk, a crowdsourcing platform that provides access to a much larger pool of anonymous contributors.