This document summarizes an online presentation about online learning with structured streaming in Spark. The key points are:
- Online learning updates model parameters for each data point as it arrives, unlike batch learning which sees the full dataset before updating.
- Structured streaming in Spark provides a single API for batch, streaming, and machine learning workloads. It offers exactly-once guarantees and understands external event time.
- Streaming machine learning on structured streaming works by having a stateful aggregation query that picks up the last trained model and performs a distributed update and merge on each trigger interval. This allows modeling streaming data in a fault-tolerant way.