SQLstream Structure 2012: Back to the future - dataflow comes of age

587 views
540 views

Published on

Dataflow is a technique for parallel computing that emerged from research in the 1970s. It's based on graph-based execution models where data flows along the arcs on a graph and is processed at the nodes. It was decades ahead of its time in an era when hardware was expensive and real-world requirements for massively parallel, low latency computing architectures were not required in the mainstream. However, dataflow as an architecture has found its place and time, with the emergence of Big Data volume, real-time low latency requirements, commodity hardware and low cost storage. Dataflow is driving the architectures for today's real-time big data solutions.

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
587
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
14
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide
  • A comparison of techniques that databases, Hadoop HDFS and Relational Streaming use to achieve scalability.They all process records in the form of Tuples. They all store the tuples over multiple servers in a clustered networked environment. They all provide the illusion of a single collection of tuples even though each stores the underlying tuples in different physical sub-collections over the various servers, with redundant copies in the case of Hadoop and Relational Streaming for resiliency. Databases provide the abstraction of a single fat table, physically stored as many tables across many servers. Hadoopprovides the abstraction of a single fat file of tuples stored as many physical files across many servers.Relational Streaming does the same with fat streams. The key difference is in the way the systems handle changes to the data through data processing. Databases allow updates in situ – transactions – with all of the headache that that entails, which effectively limits their scalability to perhaps many tens of servers. Hadoopand Relational Streaming always create new tuples from old, and that technique eliminates any shared state and allows scaling to thousands of servers. Relational Streaming differs from Hadoop in that it can start streaming out new answer tuples as soon as it starts reading the first input tuples in a massively parallel declarative execution graph, whereas Hadoop has to wait for the entire collection to be processed.
  • You can visualize the two styles above. First Hadoop, with its phased approached, with a lot of superscalar execution of 64MB chunks of data. When each phase is complete, and all the records are assembled and sorted from the chunks of records feeding upstream, the next phase can commence. Next is the Relational Streaming model. What you have is a Directed Acyclic Graph of relational operators operating on streams of tuples synchronized around (normally) timestamps. This is very similar to an electronic logic circuit. The relational operators are the logic gates. The tuples are the binary data signals. Both propagate answers as soon as the results are available with minimal latency. Both utilize dafaflow execution, time synchronization where appropriate, and both offer both pipelining parallelism and superscalar parallelism.
  • Relational Streaming is a great complement to Hadoop providing a lot of synergy. In your Hadoop cluster, you should install a Relational Stream Processor on each Hadoopnode. The Relational Streaming system can then perform Continuous ETL, Data Cleaning and Aggregation on the incoming data streams and can populate the HDFS in a massively parallel streaming architecture. Hadoopcan be used to perform massively parallel batch or historical processing on the data, such as sorting and ranking all of the data. The data can then be re-streamed through the Relational Streaming nodes as part of a Map-Reduce job where stream processing is required (such as time-series analytics – records processed in the context of other records). Also, individual streaming operations can enhance records mid stream using external queries over historical data for real-time enhancements, such as changing social security numbers into names and addresses. Hadoopqueries the past – running processing periodically over stored data – with batch jobs on a massive scale. Relational Streaming queries the future – running processing continuously over real-time streaming data – also on a massive scale but with finer grained parallelism.
  • We are one of only two closed source solutions within Mozilla. We power their real-time analytics. See our ”Powered by SQLstream” logo on the bottom right of their web display. Search Youtube for “Mozilla Glow”. SQLstream processes all of the log-files from Mozilla download servers in real-time, parsing the files, streaming the data, mapping IP addresses to Longitude and Latitude, finding the nearest town, city or village and performing a range of analytics on the streams to feed a Hadoop cluster (Hbase) for displaying historical information complemented by SQLstream’s real-time analytics.
  • Let’s look at the broader Data Management space and see how Relational Streaming fits in with the other quadrants. Relational Streaming is a high-level declarative paradigm. In Data warehousing, SQL is used as a declarative language to periodically query the past. Relational Streaming can use SQL to query the future facing into the incoming record streams with continuous queries. It is the mirror image and perfect complement to DBMS. It is also the declarative mirror image to the procedural paradigm of messaging middleware. In messaging middleware, you subscribe to topics of messages and process the records with procedural logic in Java or C++ before republishing the records other message streams. In Relational Streaming you subscribe to relational queries where the application logic is represented declaratively. Both are Pub-Sub models. The Relational Streaming form is massively parallel and much less expensive to build and maintain – a high-level declarative paradigm. Incidentally, Hadoop provides for procedural operations over batch processing with a massively scalable architecture and belongs in the quadrant representing both the mirror image of databases and of messaging middleware.
  • Relational Streaming is the Next Big Data frontier because it supports some important new capabilities vital for emerging Big Data applications. First, Streaming Views of Data – any view of any data, in real-time and running all the time. Next, Real-time Reaction – allowing you to harness massive volumes of S3 data and reacting and adapting in real-time. Finally, Massively Parallel execution – allowing you to exploit fine-grained parallelism on a massive scale including multiple servers, cores using pipelining and superscalar operations. We all know well how to process historical data – we all are already doing that. Perhaps it is now time for us all to start querying Future Data? Thanks for your time and attention.
  • Any questions?
  • SQLstream Structure 2012: Back to the future - dataflow comes of age

    1. 1. Back to the Future: Dataflow Finally Comes of Age Damian Black CEO SQLstream Real-time Big Data with Relational Streaming Dataflow TechnologyCopyright © 2012 – Proprietary and Confidential Information of SQLstream Inc.
    2. 2. Brief History of Dataflow What is Dataflow?  Parallel processing model invented in the 70s  Graphed-based execution, without destructive updates  Data flow along arcs to nodes, are combined, and flow along output arcs What happened to Dataflow?  A number of experimental parallel computers designed and built  Transputer and Occam were literally decades ahead of their time  Due for a resurgence due to inexpensive multi-core servers & SQL What is Relational Streaming?  A dataflow paradigm for processing Streaming Big Data tuples  Familiar relational expressions with automatic optimization  Relational queries executed continuously on a massively parallel scale Copyright © 2012 – Proprietary and Confidential Information of SQLstream Inc.2
    3. 3. Dataflow Graph: Pipelined and Superscalar Processing Relational Streaming: DAGs of fine-grained dataflow. Copyright © 2012 – Proprietary and Confidential Information of SQLstream Inc.3
    4. 4. Comparison of Techniques for Dataflow Scaling Hadoop and HDFS Relational Streaming Data  Fat File  Fat Stream Distribution Dataflow  Generate new tuples  Generate new tuples Enablement from old from old  leaving old tuples  leaving old tuples unaltered unaltered Copyright © 2012 – Proprietary and Confidential Information of SQLstream Inc.4
    5. 5. Dataflow: Hadoop versus Relational Streaming Hadoop style: data chunking coarse-grained dataflow. Relational Streaming: DAGs of fine-grained dataflow. Copyright © 2012 – Proprietary and Confidential Information of SQLstream Inc.5
    6. 6. Parallel Dataflow Execution Collect » Hadoop Map Reduce Process Relational Streaming Approach: » Continuous Parallel Dataflow Execution Clean » Real-time Answers Immediately » Intelligently populate data store: Aggregate Hadoop or Analyze Data Warehouse Deliver Low Latency Copyright © 2012 – Proprietary and Confidential Information of SQLstream Inc.6
    7. 7. Relational Streaming synergies with Hadoop » Relational Stream Processors co-located with Hadoop Servers » Stream/re-stream into and from locally data stores in parallel » Combination performs Real-time and Historical processing: » Querying the future – Continuous ETL and Analytics (parallel pipelines) » Querying the past – Hadoop batch jobs on stored tuples (parallel batches) Select Select Project Select Project Join Join Project Join Agg Agg Agg Order Order Order Group Group Group Select Select Project Project Join Join Agg Agg Order Order Group Group Hadoop & Relational Streaming Server Select Select Project Project Hadoop & Relational Streaming Server Join Agg Join AggOrder Group Order Group Hadoop & Relational Streaming Server Hadoop & Relational Streaming Server Split Map Hadoop & Relational Streaming Server Hadoop & Relational Streaming ServerServer Combine Sort Reduce Split Split MapMap Hadoop & Relational Streaming Combine Combine Sort Sort Reduce Reduce Split Map Combine Sort Reduce Split Map Combine Sort Reduce Split Split Map Map CombineCombine Sort Sort Reduce Reduce Copyright © 2012 – Proprietary and Confidential Information of SQLstream Inc.7
    8. 8. Application Example – Google: “Youtube Mozilla Glow” » Mozilla Firefox 4 – Real-time Download Monitor » Continuous processing of download requests » Real-time integration with Hadoop and HBase Copyright © 2012 – Proprietary and Confidential Information of SQLstream Inc.8
    9. 9. Cloud Monitoring – Detecting Service Error Spikes SELECT STREAM ROWTIME, url, “numErrorsLastMinute” FROM ( SELECT STREAM ROWTIME, url, “numErrorsLastMinute”, AVG(“numErrorsLastMinute”) OVER (PARTITION BY url RANGE INTERVAL ’1′ MINUTE PRECEDING) AS “avgErrorsPerMinute”, STDDEV(“numErrorsLastMinute”) OVER (PARTITION BY url RANGE INTERVAL ’1′ MINUTE PRECEDING) AS “stdDevErrorsPerMinute” FROM “ServiceRequestsPerMinute”) AS S WHERE S.”numErrorsLastMinute” > S.”avgErrorsPerMinute” + 2 * S.”stdDevErrorsPerMinute”; » Millions of records per second Real-time Bollinger Bands stream stream stream » stream stream Server stream stream Server Server Serverstream Server stream Server Server Server stream Server stream » Amazon EC2 Server Server Copyright © 2012 – Proprietary and Confidential Information of SQLstream Inc.9
    10. 10. A New Streaming Data Management Quadrant High-level Declarative Language & Operation Real-time Big Data Data Relational Warehouses Streaming Historical analysis Continuous analysis Periodic batches Real-time processing Hadoop Messaging Big Data Middleware Batched Big Data Low-level Procedural Language & Operation Copyright © 2012 – Proprietary and Confidential Information of SQLstream Inc.10
    11. 11. Benefits of Real-time “Big Dataflow” with Relational Streaming 1. Real-time Integration Continuous, real-time data integration 2. Real-time Analysis Process, analyze, and react – all in real-time 3. RT Parallel Processing Made easy, auto-optimized, massive scale Dataflow finally comes of age. Relational Streaming. The Next Wave of Big Data. Copyright © 2012 – Proprietary and Confidential Information of SQLstream Inc.11 Confidential and Trade Secret SQLstream Inc. © 2012
    12. 12. Query the Future ®The Future of Query.Thanks! Any questions?

    ×