The 2nd Big Data All Hands Meeting (BDAHM) is held in conjunction with the 2nd Smart Data Innovation Conference (SDIC).
This talk about "Efficiently Handling Streams from Millions of Sensors" provides an introduction to three recent research works dealing with massive sensor data streams in the age of the Internet of Things.
We present three publications and show in a big picture how they enable dealing with potentially billions of IoT sensors.
1)Real-time sensor data enables diverse applications such as smart metering, traffic monitoring, and sport analysis. In the Internet of Things, billions of sensor nodes form a sensor cloud and offer data streams to analysis systems. However, it is impossible to transfer all available data with maximal frequencies to all applications. Therefore, we need to tailor data streams to the demand of applications. We contribute a technique that optimizes communication costs while maintaining the desired accuracy. Our technique schedules reads across huge amounts of sensors based on the data-demands of a huge amount of concurrent queries. We introduce user-defined sampling functions that define the data-demand of queries and facilitate various adaptive sampling techniques, which decrease the amount of transferred data. Moreover, we share sensor reads and data transfers among queries. Our experiments with real-world data show that our approach saves up to 87% in data transmissions.
2) We present Cutty, an innovative technique for the efficient aggregation of user-defined windows over data streams. While the aggregation of periodic sliding and tumbling windows was extensively studied in the past, little to no work was done on optimizing the aggregation of common, non-periodic windows. Typical examples of non-periodic windows are punctuation windows and sessions which can implement complex business logic. Cutty performs aggregate sharing for data stream windows, which are declared as user-defined functions (UDFs) and can contain arbitrary business logic. Cutty outperforms the state of the art for aggregate sharing on single and multiple queries. Moreover, it enables aggregate sharing for a broad class of non-periodic UDWs.
3) We present I², an interactive development environment for real-time analysis pipelines, which is based on Apache Flink and Apache Zeppelin. The sheer amount of available streaming data frequently makes it impossible to visualize all data points at the same time. I² coordinates running cluster applications and corresponding visualizations such that only the currently depicted data points are processed in Flink and transferred towards the front end. We show how Flink jobs can adapt to changed visualization properties at runtime to allow interactive data exploration on high bandwidth data streams. Moreover, we present a data reduction technique which minimizes data transfer while providing loss free time-series plots.