• Like
  • Save

Thanks for flagging this SlideShare!

Oops! An error has occurred.

Real-time streaming Big Data with Relational Streaming

  • 1,822 views
Published

SQLstream CEO, Damian Black, presented on real-time streaming Big Data at GlueCON 2012, both as a real-time alternative to Hadoop for Big Data, but also as a complement to add real-time responses and …

SQLstream CEO, Damian Black, presented on real-time streaming Big Data at GlueCON 2012, both as a real-time alternative to Hadoop for Big Data, but also as a complement to add real-time responses and streaming integration to existing Hadoop installations.

Published in Technology , Business
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
1,822
On SlideShare
0
From Embeds
0
Number of Embeds
1

Actions

Shares
Downloads
0
Comments
0
Likes
5

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide
  • A comparison of techniques that databases, Hadoop HDFS and Relational Streaming use to achieve scalability.They all process records in the form of Tuples. They all store the tuples over multiple servers in a clustered networked environment. They all provide the illusion of a single collection of tuples even though each stores the underlying tuples in different physical sub-collections over the various servers, with redundant copies in the case of Hadoop and Relational Streaming for resiliency.Databases provide the abstraction of a single fat table, physically stored as many tables across many servers.Hadoop provides the abstraction of a single fat file of tuples stored as many physical files across many servers.Relational Streaming does the same with fat streams.The key difference is in the way the systems handle changes to the data through data processing.Databases allow updates in situ – transactions – with all of the headache that that entails, which effectively limits their scalability to perhaps many tens of servers.Hadoop and Relational Streaming always create new tuples from old, and that technique eliminates any shared state and allows scaling to thousands of servers.Relational Streaming differs from Hadoop in that it can start streaming out new answer tuples as soon as it starts reading the first input tuples in a massively parallel declarative execution graph, whereas Hadoop has to wait for the entire collection to be processed.
  • .
  • You can visualize the two styles above.First Hadoop, with its phased approached, with a lot of superscalar execution of 64MB chunks of data. When each phase is complete, and all the records are assembled and sorted from the chunks of records feeding upstream, the next phase can commence.Next is the Relational Streaming model. What you have is a Directed Acyclic Graph of relational operators operating on streams of tuples synchronized around (normally) timestamps.This is very similar to an electronic logic circuit. The relational operators are the logic gates. The tuples are the binary data signals. Both propagate answers as soon as the results are available with minimal latency. Both utilize dafaflow execution, time synchronization where appropriate, and both offer both pipelining parallelism and superscalar parallelism.
  • So visualize businesses struggling with an exponentially gushing fire-hose of streaming data – GPS sensor data, eCommerce logs, web access logs, security logs, business transactions, and so on.
  • Storing the data before analyzing no longer works because increasingly businesses need to process and digest this information in ever shorter time windows in order to strike while the iron is hot.
  • SQLstream analyzes live data while still in flight, providing a real-time stream of answers, and eliminating the latency. We do this by performing the cleaning, aggregation, and query processing continuously and in real-time on the live data stream.This offloads the data warehouse from having to do the cleaning and aggregation. SQLstream keeps the systems updated with fully cooked data. Historical analysis is now faster and finally delivers up to the moment results. SQLstream is the perfect complement to the data warehouse:The data warehouse queries the past. SQLstream queries the future.The data warehouse operates on historical snapshots. SQLstream operates on live data.The data warehouse provides static results. SQLstream provides real time visibility into data still in flight.SQLstream adds real-time alerting and reporting based not only on thresholds, but also on complex correlations over multiple streams and multiple time or record windows. For example:Detecting a gang member in the same location as another gang member.Presenting an ad that corresponds to a user’s real-time browsing behavior.Predicting a dangerous traffic situation – sudden deceleration of vehicles and compression wave velocity.We are not only talking about generating alerts. Services can now react and respond in real-time to complex analyses on live service data. In our examples:Not only detecting a gang members too close to one another, but also alerting the authorities.Not only selecting an ad that matching a user’s interests, but providing a discount based on real-time inventory reduction or sales rates.Not only detecting a pending traffic log jam, but dynamically changing freeway speed limits to avoid accidents.Businesses perform data mining either by hand or using sophisticate data mining tools, operating on data held in the data warehouse. But they always back-test the inferred rules for say predicting fraud, or identifying a potential buyer, or predicting a traffic jam by writing SQL queries to run against the historical data. The results of the queries show whether A,B and C are correlated with the predicted outcome D with say a confidence factor of say >95%. The beauty of the SQLstream approach is that we can reuse those back-testing SQL queries to perform the same analysis, but now predictively instead of forensically, running them against the live streaming data. When they detect A,B and C then they generate an output on D with a 95% confidence factor. It might represent fraud, a good prospect for a product or a potential traffic jam ahead. You get the idea.
  • Getting raw data into a data warehouse in “cooked form” is the job of ETL tools (data Extraction, Loading and Transformation). However, today’s ETL tools operate as a sequence of sequential processing stages, with each stage having to complete before the next stage can start.The way this works is that first raw records are collected and poured into staging tables. The the records are processed, aggregated, cleaned. Then they are reading for querying and reports are finally generate and delivered to awaiting users, often the next day.In contrast, SQLstream provides a continuous, overlapped parallel processing alternative where records continuously stream in, directed into cleaning and aggregation pipelines, and then piped immediately in sets of awaiting analytics queries that stream out results for immediate delivery to awaiting apps and users. So we have minimal end-to-end latency and maximum throughput non-stop with massive scale and parallel processing.
  • We are one of only two closed source solutions within Mozilla. We power their real-time analytics. See our ”Powered by SQLstream” logo on the bottom right of their web display. Search Youtube for “Mozilla Glow”. SQLstream processes all of the log-files from Mozilla download servers in real-time, parsing the files, streaming the data, mapping IP addresses to Longitude and Latitude, finding the nearest town, city or village and performing a range of analytics on the streams to feed a Hadoop cluster (Hbase) for displaying historical information complemented by SQLstream’s real-time analytics.
  • Relational Streaming is a great complement to Hadoop providing a lot of synergy.In your Hadoop cluster, you should install a Relational Stream Processor on each Hadoop node.The Relational Streaming system can then perform Continuous ETL, Data Cleaning and Aggregation on the incoming data streams and can populate the HDFS in a massively parallel streaming architecture.Hadoop can be used to perform massively parallel batch or historical processing on the data, such as sorting and ranking all of the data.The data can then be re-streamed through the Relational Streaming nodes as part of a Map-Reduce job where stream processing is required (such as time-series analytics – records processed in the context of other records).Also, individual streaming operations can enhance records mid stream using external queries over historical data for real-time enhancements, such as changing social security numbers into names and addresses.Hadoop queries the past – running processing periodically over stored data – with batch jobs on a massive scale.Relational Streaming queries the future – running processing continuously over real-time streaming data – also on a massive scale but with finer grained parallelism.
  • Relational Streaming requires streaming data records. These records are continuously streamed out to capture some kind of event or transaction, and often have a timestamp or other similar key.We call this type of Data – S3 Data – for Sensors, System and Service Data.The Sensor Data category includes Vehicular sensor data, GPS data, transportation data, engine data. Machine-to-machine networks, smart energy, manufacturing sensors and so on.The System Data category – some call this Machine Data – covers log files emitted from applications and server, and can be used for real-time security, fraud and compliance detection. Also cloud computing monitoring, service level monitoring to name a few.Finally the Service Data category includes all manner of service data, ranging from SMS Text messages, Call Detail Records for Telecomm billing, fraud, real-time pricing and promotion for eCommerce and for the Active Internet – context-dependent content pushed based on the Internet interactions and activities of say your buddies within your social network.
  • Let’s look at the broader Data Management space and see how Relational Streaming fits in with the other quadrants.Relational Streaming is a high-level declarative paradigm. In Data warehousing, SQL is used as a declarative language to periodically query the past. Relational Streaming can use SQL to query the future facing into the incoming record streams with continuous queries. It is the mirror image and perfect complement to DBMS. It is also the declarative mirror image to the procedural paradigm of messaging middleware. In messaging middleware, you subscribe to topics of messages and process the records with procedural logic in Java or C++ before republishing the records other message streams. In Relational Streaming you subscribe to relational queries where the application logic is represented declaratively. Both are Pub-Sub models. The Relational Streaming form is massively parallel and much less expensive to build and maintain – a high-level declarative paradigm.Incidentally, Hadoop provides for procedural operations over batch processing with a massively scalable architecture and belongs in the quadrant representing both the mirror image of databases and of messaging middleware.
  • Any questions?

Transcript

  • 1. Relational Streaming: Massively Parallel Processing using Real-Time Hadoop Overlay Damian Black CEO SQLstream Real-time Big Data through Relational StreamingCopyright © 2012 – Proprietary and Confidential Information of SQLstream Inc.
  • 2. Real-time Big Data through Relational Streaming First what is a Streaming Big Data Platform ?  Stream any data in, immediately stream out real-time answers.  Continuously analyze and process massive data volumes.  React in real-time to each and every new record. And what then is Relational Streaming ?  A Streaming Big Data paradigm for processing tuples.  Familiar relational expressions with automatic optimization.  Relational queries executed continuously on a massively parallel scale.2 Copyright © 2012 – Proprietary and Confidential Information of SQLstream Inc.
  • 3. Comparison of Techniques for Big Data Scaling Databases and Hadoop and HDFS Relational Warehouses Streaming Storing  Appear as  Appear as  Appear as Tuples a single a single a single Fat Table Fat File Fat Stream Processing  Tuples are  Old tuples are  Old tuples are Tuples updated left unchanged left unchanged Scaling  Limited,  ~Unlimited,  ~Unlimited, Across due to shared due to no due to no Servers state propagation shared state shared state3 Copyright © 2012 – Proprietary and Confidential Information of SQLstream Inc.
  • 4. Parallel Processing done Hadoop style Finite tuple sets are mapped into finite tuple sets. Historical queries Need to break data into independent chunks. Independent chunks Procedural, step-wise process used. Procedural, phased For example, great for sorting many years’ gaming scores under different keys.4 Copyright © 2012 – Proprietary and Confidential Information of SQLstream Inc.
  • 5. Parallel Processing with Relational Streaming Finite tuple sets are mapped into finite tuple sets. Historical queries Continuous queries Infinite tuple streams mapped to infinite tuple streams. Need to break data into independent chunks. Independent chunks Data are processed in the context of streams. Ordered streams Procedural, step-wise process used. Procedural, phased Declarative, parallel Declarative, fine-grained parallel processing. For example, great for giving the real-time leaderboard over a rolling minute.5 Copyright © 2012 – Proprietary and Confidential Information of SQLstream Inc.
  • 6. The key to massive parallelism is Dataflow Execution Hadoop Relational Batching Streaming Data Transformation Old sets into new sets, file Old streams into new streams, by file tuple by tuple Data Distribution Hash partitioning Hash partitioning Language Level Procedural hand-crafted Declarative automatically for performance optimized Change Lifecycle Static “edit-build-run” Dynamic “change-on-the-fly” Parallelism Degree Coarse-grained Fine-grained Superscalar Both Pipelining and Superscalar To express dataflow you need a dataflow language. SQL is the only credible, widely-used dataflow language.6 Copyright © 2012 – Proprietary and Confidential Information of SQLstream Inc.
  • 7. Tuple Processing: Hadoop versus Relational Streaming Hadoop style: data chunking coarse-grained dataflow. Relational Streaming: DAGs of fine-grained dataflow.7 Copyright © 2012 – Proprietary and Confidential Information of SQLstream Inc.
  • 8. Exploding S3 Data (Service x System x Sensor) Massive volumes of streaming data: Service System Sensors Exponential Growth8 Copyright © 2012 – Proprietary and Confidential Information of SQLstream Inc.
  • 9. Hadoop is massively scalable – but far from real-time Operating with stale data and static results. Steering through the “rear view mirror”. Massive volumes of streaming data: Hadoop HBase Service System • Historical reports Sensors • Batch processing • Stale data Exponential Growth Data Lag due to periodic data cleaning and aggregation delays.9 Copyright © 2012 – Proprietary and Confidential Information of SQLstream Inc.
  • 10. Analyze Data in Flight Data mining Service responds used for predictive to real-time analysis real-time analytics Massive volumes of streaming data: SQLstream Hadoop HBase Service System s-Server • Historical reports Sensors • Batch processing • Real-time data Exponential Growth Hadoop Hbase now kept continuously up to date. Real-time alerts and visibility. Data Lag due to periodic data Continuously streaming results. cleaning and aggregation delays.10 Copyright © 2012 – Proprietary and Confidential Information of SQLstream Inc.
  • 11. Parallel Pipelined Execution Collect » Hadoop Map Reduce Process » Relational Streaming Approach: Clean » Continuous Execution » Pipelined Clean, Aggregate, Analyze Query Streams out answers Immediately Aggregate » » Intelligently populate Hadoop Analyze data store… » And it is massively scalable and Deliver real-time!11 Copyright © 2012 – Proprietary and Confidential Information of SQLstream Inc.
  • 12. Application Example – Mozilla Glow » Mozilla Firefox 4 – Real-time Download Monitor » Continuous processing of download requests » Real-time integration with Hadoop and HBase12 Copyright © 2012 – Proprietary and Confidential Information of SQLstream Inc.
  • 13. Application Example – MMO Multiplayer Scoring » Many MMO servers streaming game action in real-time. » Streaming analytics maintained over varying time windows. » Aggregated and continuously sorted: streaming “order by”. stream stream stream stream stream Server stream Server Server stream Server Server Server stream Server stream Server stream Server stream Server Server13 Copyright © 2012 – Proprietary and Confidential Information of SQLstream Inc.
  • 14. Streaming SQL – MMO Multiplayer Scoring CREATE OR REPLACE PUMP "SONG_SCORE_PUMP" STOPPED AS INSERT INTO "S_SONG_SCORE" ("songId", "SCORE") SELECT STREAM "SONG_ID" AS "songId", SUM("POINTS") OVER "LAST_WEEK" + ((SUM("POINTS") OVER "LAST_2_WEEKS” - SUM("POINTS") OVER "LAST_WEEK") * 0.5) + ((SUM("POINTS") OVER "LAST_3_WEEKS" - SUM("POINTS") OVER "LAST_2_WEEKS") * 0.25) + ((SUM("POINTS") OVER "LAST_4_WEEKS" - SUM("POINTS") OVER "LAST_3_WEEKS") * 0.125) AS "SCORE” FROM "S_SONG_SCORE_CHANGE” WINDOW "LAST_WEEK" AS (PARTITION BY "SONG_ID" RANGE INTERVAL 7 DAY PRECEDING), "LAST_2_WEEKS" AS (PARTITION BY "SONG_ID" RANGE INTERVAL 14 DAY PRECEDING), "LAST_3_WEEKS" AS (PARTITION BY "SONG_ID" RANGE INTERVAL 21 DAY PRECEDING), "LAST_4_WEEKS" AS (PARTITION BY "SONG_ID" RANGE INTERVAL 28 DAY PRECEDING); » Millions of score records per second strea strea mstrea mstrea strea strea strea » Real-time Game Scoring Server mm Server strea Server Server mm Server strea Server m Server m strea m strea Server Server » Amazon EC2 m m Server Server14 Copyright © 2012 – Proprietary and Confidential Information of SQLstream Inc.
  • 15. Streaming SQL – Change in Rate of Web Errors SELECT STREAM ROWTIME, url, “numErrorsLastMinute”, » FROM ( » SELECT STREAM » ROWTIME, url, “numErrorsLastMinute”, » AVG(“numErrorsLastMinute”) OVER (PARTITION BY url RANGE INTERVAL ’1′ MINUTE PRECEDING) » AS “avgErrorsPerMinute”, » STDDEV(“numErrorsLastMinute”) OVER (PARTITION BY url RANGE INTERVAL ’1′ MINUTE PRECEDING) » AS “stdDevErrorsPerMinute” » FROM “HttpRequestsPerMinute”) AS S WHERE S.”numErrorsLastMinute” > S.”avgErrorsPerMinute” + 2 * S.”stdDevErrorsPerMinute”; » Millions of records per second strea strea mstrea mstrea strea strea strea » Real-time Bollinger Bands Server mm Server strea Server Server mm Server strea Server m Server m strea m strea Server Server » Amazon EC2 m m Server Server15 Copyright © 2012 – Proprietary and Confidential Information of SQLstream Inc.
  • 16. Relational Streaming overlaying Hadoop » Relational Stream Processors » Co-located with Hadoop Servers to stream/re-stream local data » Combination performs Real-time and Historical processing: » Querying the future – Continuous ETL and Analytics (parallel pipelines) » Querying the past – Hadoop batch jobs on stored tuples (parallel batches) » Re-streaming and Re-querying (for example, scenario & sensitivity analyses) Select Project Join Agg Order Group Hadoop & Relational Streaming Server Split Map Combine Sort Reduce16 Copyright © 2012 – Proprietary and Confidential Information of SQLstream Inc.
  • 17. Use Cases for S3 Data » Sensor Data: » GPS, speed, bearing, power, temp, pressure, light, vibration, medical » System Data: » Log files from systems, devices and terminals » Service Data: » Log files and Service access records from apps, web, networks » Applications: » Active Internet for Advertising, Social networking, eCommerce ( Context dependent pricing, promotion and pushed content ) » Predictive Analytics for Service, Cloud and App monitoring » Real-time data integration & analysis for billing, fraud, compliance17 Copyright © 2012 – Proprietary and Confidential Information of SQLstream Inc.
  • 18. Relational Streaming: a new data management quadrant High-level Declarative Language & Operation Data Relational Warehouses Streaming Historical analysis Continuous analysis Periodic batches Real-time processing Hadoop Messaging Big Data Middleware Low-level Procedural Language & Operation18 Copyright © 2012 – Proprietary and Confidential Information of SQLstream Inc.
  • 19. Relational Streaming Solves Three Big Problems Real-time Analysis Process, analyze, and react – all in real-time. Parallel Processing Parallel processing made easy, auto-optimized, massive scale. Data Integration Continuous, real-time data integration: • Give each app the view of data and format it needs • Share all your data in real-time with all your apps • Perform Continuous ETL and Data Integration Relational Streaming – the Next Wave of Big Data.19 Copyright © 2012 – Proprietary and Confidential Information of SQLstream Inc. Confidential and Trade Secret SQLstream Inc. © 2012
  • 20. SQLstreamQuery the Future ®The Future of Query.Thanks! Any questions?