Introducing Spreads library

TL;DR; This post is about background and thoughts on my number-crunching library Spreads, which stands for Series and Panels for Real-time and Exploratory Analysis of Data Streams. The library and its readme are on GitHub: https://github.com/Spreads/Spreads.

While I worked at Goldman Sachs as a research analyst, I got used to internal declarative time series processing language. It was so cool that I could build MSCI index rebalancing methodology with several lines of code. Probably that was Slang or some wrapper around it. After I left the firm, I cannot stand for other existing tools and wanted the same functionality as an end user. I had been far away from the IT department, had no idea about implementation, never uploaded any code to a German or any other file hostings and wrote my own implementation. (This paragraph is name dropping for search engines, there is no affiliation or any other relation other than design inspiration).

Data series could be modeled as IEnumerable or IObservable sequences. Existing libraries such as LINQ and Rx provided rich functionality and I used them initially from code. However, most data series, e.g. time series, are navigable sequences of key-value mappings. This nature could not be fully leveraged by those libraries.

In 2013, Deedle library was open sourced. I learned F# with it and contributed a couple of pull requests to it. And in a weird chain of events, I almost achieved but then ruined my American dream indirectly due to my experience with Deedle. At my current work, we used it for rapid development. However, due to internal complexity, immutable design, performance and memory consumption we couldn’t continue our development with it. When we moved from a prototype on small data to a high-volume time series, Windows/Chunks caused OutOfMemory exception because of eager intermediate evaluation of each window/chunk. Windowing was one of the most frequently used functionality and we quickly realized that to avoid allocations we need lazy windows and other lazy LINQ-like calculations that leverage data series nature.

For a year before that, I had been working on a Spreads library for complex event processing, but could not get the design and performance right. Then I re-watched videos by Rich Hickey (Value of values, especially about place-oriented programming), Erik Meijer (Duality and the End of reactive and the one with poor little mutable bear that was publicly torn apart) and finalized the design in my head. Series are mutable properties of some identities, e.g. a price of a financial instrument, a temperature in San-Francisco, etc. Identity as a whole is immutable as an object reference (e.g. person name) until its death (a security is delisted, the earth is destroyed). Series are mutable as an object but immutable as data. Every key-value pair inside series - when they are properties of identities - are immutable, and new pairs are appended at the end of series. In the real world, however, we could have observation errors or even trades on exchanges could be canceled, so mutability is inevitable, as Erik showed with the bear, and even historical data is mutable.

After the core functionality of Spreads was ready, we did performance tests and found that Spreads implementation was pretty fast with low memory consumption and allocations. Additionally, Spreads library was initially designed to support real-time streaming data, but Deedle does not and will not support streaming data due to fundamental design decisions (immutability, Pandas/R-like data structures). We migrated to Spreads for backtesting and optimization, implemented live streaming of data and built our entire data processing pipeline on Spreads - from strategy backtesting to trading. As of version 0.1+, existing functionality became fast and stable enough for use in our production (where we also test, debug and fix it - current status is alpha).

While developing Spreads, I became a big fan of mechanical sympathy. I watched all talks by Martin Thompson and was highly impressed (the best talk by Martin and Todd Montgomery is this one, listening to it again while writing this post, 33:50+ is my favorite place and 100% to the point). The main data structure in Spreads - SortedMap + its cursors - was already a kind of Disruptor, and I was definitely thinking about it while developing Spreads. But Martin opened my eyes on the fact that “Fast systems are all alike; every slow system is slow in its own way”. All his talks combined finally helped me to stop worrying about functional fundamentalism and start loving performance above any purism.

Another talk helped me to realize how happy I am using .NET, where value types in arrays are contiguous in memory, unsafe is built-in and won’t be removed in a next version, native calls do not require special builds with special method names, tooling is great, TPL is awesome, and F# is the best language in the world even for imperative programming! I have never bought the argument that MSFT is evil and Linux is religiously better. All these cries, when they sound like religious, are from cheapskates who could not afford MSFT’s product in school - but if a product is better than others and has support, it must cost money. I have Windows on MacBook because it is better for end users who do anything else than web-browsing (e.g. Excel). I do know that Windows Server is (or used to be) not real-time in the strict sense, but nowadays FPGAs are being commoditized. Strict real-time is required for cases where lives depend on it or for “true HFT”, and such cases will always require a custom code (hopefully in Rust in the future) inside custom hardware. Since milliseconds are no longer “true HFT”, we could rule out this use case for managed languages. For all other cases, Spreads is fast enough (could process tens of millions of data items per second per thread) and will be even faster. (My personal challenge is to make it no slower than existing commercial systems!) I am happy to pay more to AWS or to use a Windows server in office just because it is so easy to connect via RDP, and I am lazy and prefer GUI over a console. But at the time I write this, some smart people are working on making .NET a first-class citizen on Linux, and it is already present there via Mono. Poor JVM users do not have async/await and the goodies I mentioned in the beginning of this rant, but this is not the reason to suffer and rewrite a copy of Spreads for JVM (however, this will very likely happen). Spreads library is as cross-platform as .NET, which current developments promise a bright future!

I tried to make Spreads as mechanically sympathetic and fast as possible, at the same time keeping nice simple public API. Today I have released Spreads library. The name Spreads stands for Series and Panels for Real-time and Exploratory Analysis of Data Streams.

I will not copy the readme here, please go to Spreads repository, download code or NuGet package, create an issue and please submit a pull request if you can. Everything is up for grabs!