This document compares Spark Streaming and Google DataFlow for processing streaming data from Google Pub/Sub and writing to BigQuery. Some key differences are:
- DataFlow provides richer streaming concepts like event-time processing, watermarks, and session windows while Spark Streaming only supports fixed and sliding windows based on processing time.
- Fault tolerance is completely handled by DataFlow and Spark Streaming requires developers to implement checkpointing and recovery.
- Code reuse is limited between batch and streaming with Spark Streaming but the DataFlow SDK can call the same APIs.
- DataFlow integrates better with Google Cloud services and has richer monitoring while Spark Streaming requires more integration work for deployment and operations.
- Local testing is easier