Real-time Big Data streaming integration for sensor networks

Like this? Share it with your network

Share
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
4,300
On Slideshare
3,796
From Embeds
504
Number of Embeds
4

Actions

Shares
Downloads
131
Comments
0
Likes
8

Embeds 504

http://www.scoop.it 348
http://bigdata.hadoop.sk 150
https://twitter.com 3
http://127.0.0.1 3

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide
  • Basically, storing data before analyzing no longer works. Visualize businesses struggling with an exponentially gushing fire-hose of streaming data – GPS sensor data, eCommerce logs, web access logs, security logs, business transactions, and so on. Storing the data before analyzing no longer works because increasingly businesses need to process and digest this information in ever shorter time windows in order to strike while the iron is hot. Today people try to perform these analyses using Data Warehousing technology and they do that because they want to use the SQL query language. It has become the lingua franca of data analysis. But, Data Warehouses can’t keep up with the exploding data volume. Yes, they can load the data. But, there is a considerable delay before the information is cleaned, aggregated, and ready for delivering answers: “cooked data”. And, there is an even greater problem. Data warehouses operate on snapshots of data. They can never be up to date. And, they provide no visibility into rates of change, which is critical for so many decisions these days. Steering your business from snapshots of data is like driving your car on the freeway using only the rear view mirror – only worse as you are given only stale snapshots to steer by that could be hours out of date.
  • SQLstream provides the solution. Weanalyze live data while still in flight, providing a real-time stream of answers, and eliminating the latency. We do this by performing the cleaning, aggregation, and query processing continuously and in real-time on the live data stream. This offloads the data warehouse from having to do the cleaning and aggregation. SQLstream keeps the systems updated with fully cooked data. Historical analysis is now faster and finally delivers up to the moment results. SQLstream is the perfect complement to the data warehouse:The data warehouse queries the past. SQLstream queries the future.The data warehouse operates on historical snapshots. SQLstream operates on live data.The data warehouse provides static results. SQLstream provides real time visibility into data still in flight.SQLstream adds real-time alerting and reporting based not only on thresholds, but also on complex correlations over multiple streams and multiple time or record windows. We are not only talking about generating alerts. Services can now react and respond in real-time to complex analyses on live service data.
  • First Hadoop, with its phased approached, with a lot of superscalar execution of 64MB chunks of data. When each phase is complete, and all the records are assembled and sorted from the chunks of records feeding upstream, the next phase can commence. Next is the Relational Streaming model. What you have is a Directed Acyclic Graph of relational operators operating on streams of tuples synchronized around (normally) timestamps. This is very similar to an electronic logic circuit. The relational operators are the logic gates. The tuples are the binary data signals. Both propagate answers as soon as the results are available with minimal latency. Both utilize dafaflow execution, time synchronization where appropriate, and both offer both pipelining parallelism and superscalar parallelism.
  • For one customer, SQLstream monitors the vehicular traffic on their road systems by processing vehicle and road sensors, and in real-time determining congestion, instantaneous average speeds, and much more. We monitor roads down to a granularity of 10 meter road segments.
  • Querying Multi-million events per secondscale. Consider a massive-scale MMO game with many millions of scores streaming in per second and fed into a cloud of Relational Stream processors.
  • This is an example of a game scoring analyticoperation. The idea is to provide a real-time running score that decays over time. The players are only as good as their recent successes! What you have to do is to maintain a sum of scores with weighting for the more recent scores, and a decreasing weighting for the scores achieved several weeks ago. The data are processed through many such analytical queries and a sustained throughput of millions of records per second is maintained with an Amazon EC2 cloud-based architecture. SQLstream is demonstrating this at its stand outside this auditorium for anyone interested in seeing it it live. One processing node is capable of process a Quarter of a Million records per second on a high-end PC. It scales linearly as you add additional nodes.
  • Streaming requires, well, streaming data records. These records are continuously streamed out to capture some kind of event or transaction, and often have a timestamp or other similar key. We call this type of Data – S3 Data – for Sensors, System and Service Data. The Sensor Data category includes Vehicular sensor data, GPS data, transportation data, engine data. Machine-to-machine networks, smart energy, manufacturing sensors and so on. The System Data category – some call this Machine Data – covers log files emitted from applications and server, and can be used for real-time security, fraud and compliance detection. Also cloud computing monitoring, service level monitoring to name a few. Finally the Service Data category includes all manner of service data, ranging from SMS Text messages, Call Detail Records for Telecomm billing, fraud, real-time pricing and promotion for eCommerce and for the Active Internet – context-dependent content pushed based on the Internet interactions and activities of say your buddies within your social network.
  • Relational Streaming is a great complement to Hadoop providing a lot of synergy. In your Hadoop cluster, you should install a Relational Stream Processor on each Hadoopnode. The Relational Streaming system can then perform Continuous ETL, Data Cleaning and Aggregation on the incoming data streams and can populate the HDFS in a massively parallel streaming architecture. Hadoopcan be used to perform massively parallel batch or historical processing on the data, such as sorting and ranking all of the data. The data can then be re-streamed through the Relational Streaming nodes as part of a Map-Reduce job where stream processing is required (such as time-series analytics – records processed in the context of other records). Also, individual streaming operations can enhance records mid stream using external queries over historical data for real-time enhancements, such as changing social security numbers into names and addresses. Hadoopqueries the past – running processing periodically over stored data – with batch jobs on a massive scale. Relational Streaming queries the future – running processing continuously over real-time streaming data – also on a massive scale but with finer grained parallelism.
  • Let’s look at the broader Data Management space and see how Relational Streaming fits in with the other quadrants. Streaming is a high-level declarative paradigm. In Data warehousing, SQL is used as a declarative language to periodically query the past. Relational Streaming can use SQL to query the future facing into the incoming record streams with continuous queries. It is the mirror image and perfect complement to DBMS. It is also the declarative mirror image to the procedural paradigm of messaging middleware. In messaging middleware, you subscribe to topics of messages and process the records with procedural logic in Java or C++ before republishing the records other message streams. In Relational Streaming you subscribe to relational queries where the application logic is represented declaratively. Both are Pub-Sub models. The Relational Streaming form is massively parallel and much less expensive to build and maintain – a high-level declarative paradigm. Incidentally, Hadoop provides for procedural operations over batch processing with a massively scalable architecture and belongs in the quadrant representing both the mirror image of databases and of messaging middleware.
  • Finally, Streaming and in particular Relational Streaming is the Next Big Data frontier as it supports some important new capabilities vital for emerging Big Data applications. First, Streaming Views of Data – any view of any data, in real-time and running all the time. Next, Real-time Reaction – allowing you to harness massive volumes of S3 data and reacting and adapting in real-time. Finally, Massively Parallel execution – allowing you to exploit fine-grained parallelism on a massive scale including multiple servers, cores using pipelining and superscalar operations. We all know well how to process historical data – we all are already doing that.Perhaps it is now time for us all to start querying Future Data?

Transcript

  • 1. Real-time Control in a Big Data WorldSensors Expo, 2012Presenter: Damian Black, SQLstream CEOCopyright © 2012 – Proprietary and Confidential Information of SQLstream Inc.
  • 2. Agenda » What is Streaming Big Data? » The “Sensor Internet” – a Real-time connected world » Architectures for processing Real-time/Fast Big Data » Sharing and Reusing data with Relational Streaming » Case studies and Examples » Relational Streaming and Hadoop » Mapping out the data management space » Conclusions2 Copyright © 2012 – Proprietary and Confidential Information of SQLstream Inc.
  • 3. Real-time Big Data First what is a Streaming Big Data Platform ?  Stream any data in, immediately stream out real-time answers.  Continuously analyze and process massive data volumes.  React in real-time to each and every new record. And what then is Relational Streaming ?  A Streaming Big Data paradigm for processing data streams.  Familiar relational expressions with automatic optimization.  Queries executed continuously on a massively parallel scale.3 Copyright © 2012 – Proprietary and Confidential Information of SQLstream Inc.
  • 4. A real-time connected world of sensors » Technology Drivers » GPS enabled devices » Low cost wireless sensors » Ultra low power sensors » Business & Environmental Drivers » Congestion Reduction » Smart Energy & Environment Monitoring » V2V, V2I and Smart Transportation » M2M » RFIDs & the „Internet of Things‟4 Copyright © 2012 – Proprietary and Confidential Information of SQLstream Inc.
  • 5. Today‟s operational platforms – far from real-time Poorly integrated operational platforms based on traditional store and process technology Massive volumes of streaming data: Service System Sensors Exponential Growth5 Copyright © 2012 – Proprietary and Confidential Information of SQLstream Inc.
  • 6. Analyze Streaming Data in Flight Respond to Historical data real-time analysis used for predictive real-time analytics Massive volumes of streaming data: Service SQLstream System s-Server Sensors Exponential Growth Existing operational systems and data warehouses kept up to date in real-time with Real-time alerts and visibility continuous ETL with continuously streaming results.6 Copyright © 2012 – Proprietary and Confidential Information of SQLstream Inc.
  • 7. Streaming Data Processing – Achieving Scalability » Fine grained dataflow: pipelined & superscalar parallel processing » Reuse of analytics and data streams across nodes » Avoid transactional bottlenecks – fine-grained streaming dataflow » SQL as a parallel dataflow language – standard, familiar, proven7 Copyright © 2012 – Proprietary and Confidential Information of SQLstream Inc.
  • 8. Streaming Data Processing & Windows Overview of real-time processing pipelines Data sources such as log files, sensors and API feeds are turned Real-time data streaming data into streaming data feeds feed Example: Continuous query for real-time alerts » CREATE VIEW sla_fulfilled AS SELECT STREAM * FROM orders OVER sla JOIN shipments ON orders.id = shipments.orderid WHERE city = New York WINDOW sla AS (RANGE INTERVAL 1 HOUR PRECEDING)8 Copyright © 2012 – Proprietary and Confidential Information of SQLstream Inc.
  • 9. Case Study: Traffic Analytics from GPS Data9 Copyright © 2012 – Proprietary and Confidential Information of SQLstream Inc.
  • 10. Case Study: Traffic Analytics from GPS Data Objective: Accurate and reliable Journey Time information with dynamic updating of alternative routes, identifying „worse than usual‟ events and predictive incident detection. One GPS event per vehicle per second Road segment GIS database Historical Trend Data 10 meter road segments10 Copyright © 2012 – Proprietary and Confidential Information of SQLstream Inc.
  • 11. SQL as an API – Simplifying Analytics » Example: Compute Average speed across any subset of the road network over rolling time windows from GPS events11 Copyright © 2012 – Proprietary and Confidential Information of SQLstream Inc. Copyright © 2012 Proprietary information of SQLstream Inc. All rights reserved 11
  • 12. Case Study: Real-time Seismic Event Detection12 Copyright © 2012 – Proprietary and Confidential Information of SQLstream Inc.
  • 13. Input Signal Data (blue) and Detected Quakes (red)13 Copyright © 2012 – Proprietary and Confidential Information of SQLstream Inc.
  • 14. The „Sensor Internet‟ for Services » Many sensors streaming data over the Internet in real-time. » Streaming analytics maintained over varying time windows. » Aggregated and continuously sorted: streaming “order by”. stream stream stream stream stream Server stream Server Server stream Server Server Server stream Server stream Server stream Server stream Server Server14 Copyright © 2012 – Proprietary and Confidential Information of SQLstream Inc.
  • 15. Streaming SQL: Decaying Service Monitor Scoring CREATE OR REPLACE PUMP "SONG_SCORE_PUMP" STOPPED AS INSERT INTO ”SERVICE_SCORE" (”serviceId", "SCORE") SELECT STREAM ”SERVICE_ID" AS ”serviceId", SUM("POINTS") OVER "LAST_WEEK" + ((SUM("POINTS") OVER "LAST_2_WEEKS” - SUM("POINTS") OVER "LAST_WEEK") * 0.5) + ((SUM("POINTS") OVER "LAST_3_WEEKS" - SUM("POINTS") OVER "LAST_2_WEEKS") * 0.25) + ((SUM("POINTS") OVER "LAST_4_WEEKS" - SUM("POINTS") OVER "LAST_3_WEEKS") * 0.125) AS "SCORE” FROM ”SERVICE_SCORES” WINDOW "LAST_WEEK" AS (PARTITION BY "SONG_ID" RANGE INTERVAL 7 DAY PRECEDING), "LAST_2_WEEKS" AS (PARTITION BY "SONG_ID" RANGE INTERVAL 14 DAY PRECEDING), "LAST_3_WEEKS" AS (PARTITION BY "SONG_ID" RANGE INTERVAL 21 DAY PRECEDING), "LAST_4_WEEKS" AS (PARTITION BY "SONG_ID" RANGE INTERVAL 28 DAY PRECEDING); » Millions of events per second strea strea strea » Real-time service scoring mstrea strea mstrea strea Server Server m mm mm strea strea Server Server Server Server Server m strea m strea » Amazon EC2 Server m Server Server m Server15 Copyright © 2012 – Proprietary and Confidential Information of SQLstream Inc.
  • 16. Streaming SQL – Change in Rate of Service Errors SELECT STREAM ROWTIME, url, “numErrorsLastMinute”, » FROM ( » SELECT STREAM » ROWTIME, url, “numErrorsLastMinute”, » AVG(“numErrorsLastMinute”) OVER (PARTITION BY url RANGE INTERVAL ‟1′ MINUTE PRECEDING) » AS “avgErrorsPerMinute”, » STDDEV(“numErrorsLastMinute”) OVER (PARTITION BY url RANGE INTERVAL ‟1′ MINUTE PRECEDING) » AS “stdDevErrorsPerMinute” » FROM “ServiceRequestsPerMinute”) AS S WHERE S.”numErrorsLastMinute” > S.”avgErrorsPerMinute” + 2 * S.”stdDevErrorsPerMinute”; » Millions of records per second strea strea mstrea mstrea strea strea strea » Real-time Bollinger Bands Server mm Server strea Server Server mm Server strea Server m Server m strea m strea Server Server » Amazon EC2 m m Server Server16 Copyright © 2012 – Proprietary and Confidential Information of SQLstream Inc.
  • 17. Use Cases for S3 Data (Sensor x System x Service) » Sensor Data: » Location, Power, Temperature, Pressure, Speed, … » GPS and Mobile Devices, RFID » System Data: » Log files, Device records, SNMP MIBs » Service Data: » Usage log files, transactions, Internet, other » Industries & Applications: » Energy, Mining, Transportation, Manufacturing, Logistics, etc » Performance, Security, Compliance, and Fraud Monitoring » Error and Service Level Monitoring » Usage, Metering and SCADA17 Copyright © 2012 – Proprietary and Confidential Information of SQLstream Inc.
  • 18. Comparison of Big Data Processing Platforms Hadoop Hadoop style: data chunking coarse-grained dataflow Petabytes of stored data Batch processing Historical queries High Latency Streaming Millions of events per sec Stream processing Continuous queries Low latency Relational Streaming: DAGs of fine-grained dataflow18 Copyright © 2012 – Proprietary and Confidential Information of SQLstream Inc.
  • 19. Relational Streaming overlaying Hadoop » Relational Stream Processors co-located with Hadoop Servers to stream/re-stream local data » Combination performs Real-time and Historical processing: » Querying the future – Continuous ETL and Analytics (parallel pipelines) » Querying the past – Hadoop batch jobs on stored tuples (parallel batches) » Re-streaming and Re-querying (for example, scenario & sensitivity analyses) Select Project Join Agg Order Group Hadoop & Relational Streaming Server Split Map Combine Sort Reduce19 Copyright © 2012 – Proprietary and Confidential Information of SQLstream Inc.
  • 20. Relational Streaming: A new data management quadrant High-level Declarative Language & Operation Data Relational Warehouses Streaming Historical analysis Continuous analysis Periodic batches Real-time processing Hadoop Messaging Big Data Middleware Low-level Procedural Language & Operation20 Copyright © 2012 – Proprietary and Confidential Information of SQLstream Inc.
  • 21. Query the Future Real-time Analysis Process, analyze, and react – all in real-time. Parallel Processing Parallel processing made easy, auto-optimized, massive scale. RT Data Integration Continuous, real-time data integration: • Give each app the view of data and format it needs • Share all your data in real-time with all your apps • Perform Continuous ETL and Data Integration Relational Streaming – the Next Wave of Big Data.21 Copyright © 2012 – Proprietary and Confidential Information of SQLstream Inc. Confidential and Trade Secret SQLstream Inc. © 2012
  • 22. Real-time Control in a Big Data WorldSensors Expo, 2012Presenter: Damian Black, SQLstream CEOCopyright © 2012 – Proprietary and Confidential Information of SQLstream Inc.