Does your big data streaming pipeline have a hole in its pocket ? Streaming involves gathering data, processing it and delivering the results to the intended destinations in real time. Glitches at any stage can cause data loss unless the products employed in the pipeline provide the necessary guarantees and are configured properly to deliver on those guarantees.
Realtime stream processing brings unique challenges with respect to data handling guarantees and fault tolerance. Each streaming product comes with a unique approach to tackle these problems. When assembling a streaming pipeline, it is important to understand this critical topic for proper selection and configuration of the individual components of the pipeline. The exercise to determine if you are missing some records in your data lake can be expensive, but it can be extremely difficult to track down the cause to prevent it from recurring.
To help you build reliable streaming pipelines, this talk will give you a better understanding of the problems involved in realtime streaming, the kinds of guarantees involved and how they are handled in popular open source products such as Storm, Flink, Kafka, Hive Streaming APIs and Flume.
Roshan Naik, Senior MTS, Hortonworks