by Sam Bott
26 September, 2017 - 6 minute read

Accuracy and timeliness are two of the vital characteristics we require of the datasets we use for research and, ultimately, Winton’s investment strategies. Yet our approach to collecting, cleaning and adding context to data has changed over time. When Winton was founded 20 years ago, we used far fewer data sources and processed many of them manually. But as we grew, so did our thirst for data. And as we began to harvest data from a growing number of sources, we began to introduce automated processes to download, parse and add the datasets into our substantial relational database.

This approach worked well for many years. There was a single place at Winton to find data, which arrived in a reasonably timely manner. It was not without problems, however: the process of ingesting the data was customised and rigid; introducing new datasets remained an excessively manual procedure; and automated loaders would need to be updated because of inconsistencies in the downloaded files or a change in file format.

These shortcomings were soon exposed by the emergence of what some refer to as “Big Data” or the “Data Age” - terms that describe big improvements in computational power, data capture and data storage. As a result, the number of data sources we were interested in continued to grow at an exponential rate, and we needed to create a new process in order to scale up our data collection. The requirements were as follows:

The Approach

To meet these needs, Winton designed an in-house system consisting of event-driven microservices that could be grouped roughly into three categories: downloading, processing and offloading.

Our system first downloads large datasets that vendors have made available. Generally, each vendor provides all their data at once, which means that from Winton’s perspective the process resembles scheduled batch processing. As soon as the newly-arrived raw files are available for the next stage of the pipeline, an event is fired that triggers a stream-processing system. This in turn parses, tracks and stores event-based data sources that can be used to construct time series. When a downstream process requires data, the relevant range of a time series can be built rapidly and streamed to it.

Downloading

For each data source, a service polls for and downloads all new or modified files we are interested in. This data is stored in a filesystem that preserves all information about when it was downloaded as well as any previous versions of the files. We have designed it to be very flexible and to run almost entirely from configuration. And we provide a selection of options such as the protocol/method to use for polling and downloading, inclusive and exclusive filtering, and post-processing options, such as AES decryption.

A single filesystem for our downloaded data has proved very useful, and in some cases it is only these raw files that are ever required. Once a new file has arrived or an existing one has been updated, that event is written to a Kafka topic that starts the next layer of our ingestion process.

Typically, the vendors of our data are attempting to create a single, best view of a dataset. If they were to record a price of 112.356 USD when in fact it was 112.358 USD, they would later correct this figure, sometimes seconds later. At other times, however, the change could occur days or weeks subsequently. Capturing a change like this is essential for us to be able to re-run historical simulations that use the exact same information we had at the time. We can transfer this challenge - that is, the mutability of the underlying data - to the next stage of the process by re-ingesting the same file each time it changes.

Processing

For the processing of the raw files, we opted for a microservice architecture using a Kafka message bus and Akka-based services. Each service is responsible for a clearly defined role in the process:

It is during this layer of our approach that we deal with the difficulties in presenting an immutable view of the data. The key to how this layer works is treating each new piece of information we receive as an event. We can then use the event sourcing pattern to store each of these events in a way that lets us construct the series swiftly by replaying them at a later point. Storing the changes to our time series as an event stream means for any given record in any data series we can retrieve all associated events. We can also layer any applicable update or correction on top of the initial record. By discarding any event that arrived after a given point in time, we are able to reproduce our view of the time series at that point. This approach means the API we provide to this data can be idempotent and, as such, the accuracy of our downstream processes are not time-dependent and they can achieve exactly-once semantics.

Our API returns time series as a gRPC stream, where data is constructed from its events on-the-fly as it leaves our service. This allows us to start returning data from an API call almost instantly, rather than having to wait for processing on large datasets to complete before it can be used downstream.

Offloading

Although the API mentioned above is available for direct use, it is usually called by the third layer of our data-ingestion pipeline.

Subscribing to one of our Kafka topics from the previous layer is another microservice, which listens for events and indicates when a time series may have been added to. This function loads the modified time series into our central “Time Series Store”. This is a proprietary columnar database that allows efficient access to versioned time series. Once loaded, they exist in an environment conducive for research simulations, data-quality tests, automated (or in some cases manual) sign-off procedures, and, ultimately, the generation of trading signals.

The Result

This is still comparatively early days for Winton’s new approach to data ingestion, but it is proving to be robust and scalable. We have seen a huge improvement in the ease with which we can add new sources for downloading, the timeliness and thus availability of the data, and the sturdiness of the new system.