Real-time predictive applications can demand continuous and agile development, with new models constantly being trained, tested, and then deployed. Training and testing are done by replaying stored event logs, running new models in the context of historical data in a form of backtesting or ``what if?'' analysis. To replay weeks or months of logs while developers wait, we need systems that can stream event logs through prediction logic many times faster than the real-time rate. A challenge with high-speed replay is preserving sequential semantics while harnessing parallel processing power. The crux of the problem lies with causal dependencies inherent in the sequential semantics of log replay.
We introduce an execution engine that produces serial-equivalent output while accelerating throughput with pipelining and distributed parallelism. This is made possible by optimizing for high throughput rather than the traditional stream processing goal of low latency, and by aggressive sharing of versioned state, a technique we term Multi-Versioned Parallel Streaming (MVPS).
In experiments we see that this engine, which we call ReStream, performs as well as batch processing and more than an order of magnitude better than a single-threaded implementation.