• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Storm: a distributed ,fault tolerant ,real time computation

Storm: a distributed ,fault tolerant ,real time computation



Storm:a distributed real time,fault tolerant computation

Storm:a distributed real time,fault tolerant computation



Total Views
Views on SlideShare
Embed Views



0 Embeds 0

No embeds



Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment
  • Realtime streaming computation application in machine learning data anayltics integration .
  • Hadoop uses batch processing.1.Tedious in deploying workers,where to send messages and deploying queues. 2.Brittle for no fault tolerance 3.For high throughput you need to partition data and how it moves around hence can fail.reconfigure other workers.
  • 1.Real time in the sense it can be used to process messages and updating databases. Continuous querying of database and streaming the result into the client.2.Fault tolerant: If faults occur during the computation, storm can reassign tasks. It makes sure that a computation can be run forever.3.Extremely Robust:Storm clusters are easier to manage than Hadoop.Storm ensures painless user experience.4.Scalable:Massive number of messages per second.All you need to do is add machines and increase parallelism settings of the topology.
  • 1.Hadoop has mapreduce jobs but storm has topologies.Mapreduce job finishes but storm topology processes messages forever until you kill it.2.Nimbus is a daemon similar to master nodes job tracker for distributing code around the cluster. assigning tasks and monitoring for failures.3.Each worker node runs a daemon called supervisor.It starts and stops a worker node based on the work assigned to it.4.Nimbus and Supervisor are stateless all the state is stored in the zookeeper or on a local disk.you can kill nimbus or supervisor they will start back like nothing happened.This provides the stability.
  • Each node in a topology contains processing logic, and links between nodes indicate how data should be passed around between nodes. Each task corresponds to one thread of execution.But tasks can be less than equal to number of trheads.WorkersTopologies execute across one or more worker processes. Each worker process is a physical JVM and executes a subset of all the tasks for the topology. For example, if the combined parallelism of the topology is 300 and 50 workers are allocated, then each worker will execute 6 tasks (as threads within the worker). Storm tries to spread the tasks evenly across all the workers.
  • In a tuple there can be a list of values Storm provides the primitives for transforming a stream into a new stream in a distributed and reliable way. For example, you may transform a stream of tweets into a stream of trending topics. tuples can contain integers, longs, shorts, bytes, strings, doubles, floats, booleans, and byte arrays. You can also define your own serializers so that custom types can be used natively within tuples.Every stream is given an id when declared.
  • The basic primitives Storm provides for doing stream transformations are "spouts" and "bolts". Spouts and bolts have interfaces that you implement to run your application-specific logic a spout may connect to the Twitter API and emit a stream of tweets. Spouts easily integrated to a new queuing system.Spouts can be reliable or unreliable. Reliable have ack and fail.Bolts:Complex stream transformation requires mutliple bolts.Can give out multiple streams.A topology runs forever, or until you kill it. Storm will automatically reassign any failed tasks. Additionally, Storm guarantees that there will be no data loss, even if machines go down and messages are dropped.
  • Part of defining a topology is specifying for each bolt which streams it should receive as inputSpouts and bolts execute as many tasks in parallel across the cluster.Shuffleuples are randomly distributed across the bolt's tasks in a way such that each bolt is guaranteed to get an equal number of tuples. Fields grouping:The stream is partitioned by the fields specified in the grouping. For example, if the stream is grouped by the "user-id" field, tuples with the same "user-id" will always go to the same taskGlobal
  •  These methods take as input a user-specified id, an object containing the processing logic, and the amount of parallelism you want for the node.The last parameter, how much parallelism you want for the node, is optional. It indicates how many threads should execute that component across the cluster
  •  TestWordSpout in this topology emits a random word from the list ["nathan", "mike", "jackson", "golda", "bertels"] as a 1-tuple every 100ms
  • Prepare method: output collector that is used for emitting tuplesThe execute method receives a tuple from one of the bolt's inputs .Provides acknowedgement to prevent data loss.When bolt is shut down and should clean up resources that were openThe declareOutputFields method declares that the ExclamationBolt emits 1-tuples with one field called "word".The getComponentConfiguration method allows you to configure various aspects of how this component runs
  • Before proceeding to discuss algorithms, let us consider the constraints underwhich we work when dealing with streams. First, streams often deliver elementsvery rapidly. We must process elements in real time, or we lose the opportunityto process them at all, without accessing the archival storage. Thus, it often isimportant that the stream-processing algorithm is executed in main memory,without access to secondary storage or with only rare accesses to secondarystorage. Moreover, even when streams are “slow,” as in the sensor-data exampleof Section 4.1.2, there may be many such streams. Even if each stream by itselfcan be processed using a small amount of main memory, the requirements of allthe streams together can easily exceed the amount of available main memory.Thus, many problems about streaming data would be easy to solve if wehad enough memory, but become rather hard and require the invention of newtechniques in order to execute them at a realistic rate on a machine of realisticsize. Here are two generalizations about stream algorithms worth bearing inmind as you read through this chapter:• Often, it is much more efficient to get an approximate answer to ourproblem than an exact solution.• As in Chapter 3, a variety of techniques related to hashing turn out to beuseful. Generally, these techniques introduce useful randomness into thealgorithm’s behavior, in order to produce an approximate answer that isvery close to the true result1. We can use
  • StreamSQL

Storm: a distributed ,fault tolerant ,real time computation Storm: a distributed ,fault tolerant ,real time computation Presentation Transcript

  • STORM Distributed and Fault-Tolerant Real Time Computation By :Nitin Guleria nitin.guleria@mail.utoronto.ca Storm :Distributed Fault Tolerant Real Time Computation
  • Rationale • Hadoop Scales but no Real Time Data Processing. • Batch processing is stale data. • Before Storm : Messages Queues Workers Tedious Hard to Scale 1.Tedious 2.Brittle 3.Hard to Scale Storm :Distributed Fault Tolerant Real Time Computation
  • Why Storm • Real-Time • Fault tolerant • Extremely robust • Scalable (processed 1,000,000 Messages per second on a 10 node cluster) Storm :Distributed Fault Tolerant Real Time Computation
  • Storm Cluster Coordinateseverything Storm :Distributed Fault Tolerant Real Time Computation
  • Key Concepts • Topology • Tasks • Tuple • Stream • Spout • Bolt Topology is a graph of Computation. Tasks are the processes which execute the Streams or bolts. Storm :Distributed Fault Tolerant Real Time Computation Stream Tuple Bolt A simple Topology Spout
  • Key Concepts • Tuple and Streams • Tuple : Ordered list of elements • Steams: Unbounded sequence of tuples Storm :Distributed Fault Tolerant Real Time Computation 6/12
  • Key Concepts Spouts and Bolts • Spout : the source of a stream • Deals with queues • weblogs • API calls • Event data. • Bolts :process input streams and create new streams. • Apply functions/transforms filter, aggregation ,streaming joins etc. • Can produce multiple streams Storm :Distributed Fault Tolerant Real Time Computation
  • Key Concepts Stream groupings • Stream partitioning among the bolt tasks. Storm :Distributed Fault Tolerant Real Time Computation
  • A simple topology Storm :Distributed Fault Tolerant Real Time Computation words exclaim1 exclaim2 mike!!!!!! mike mike!!! Shuffle Shuffle
  • Implementation of Spout • The object implements IRichSpout Interface. • nextTuple() method as part of the TestWordSpout() Storm :Distributed Fault Tolerant Real Time Computation
  • Implementation of Bolt • Implements IRichBolt interface • Prepare method saves the outputCollector as a variable. • Execute method receives a tuple and appends exclamation. • Cleanup prevents resource leakages on bolt Shutdown • DeclareOutputFields declares that the bolt emits a tuple with field named ‘word’. Storm :Distributed Fault Tolerant Real Time Computation
  • Conclusion • Storm is a promising tool. • It has a clean and elegant design. • Excellent documentation for a young open source tool. • Great replacement of Hadoop for real time Computation. Storm :Distributed Fault Tolerant Real Time Computation
  • Thank You Storm :Distributed Fault Tolerant Real Time Computation
  • Sources • Storm: The Real-Time Layer - GlueCon 2012 Dan Lynn( dan@fullcontact.com) • http://storm.incubator.apache.org/documentation/Tutorial.html Nathan Marz • Streams processing with Storm Mariusz Gil Storm :Distributed Fault Tolerant Real Time Computation
  • Questions • What are the major issues with processing in real time stream and how to solve them ?Specify algorithms or techniques. • Any Query Languages for real time stream processing? Storm :Distributed Fault Tolerant Real Time Computation
  • Answers • One strategy to dealing with streams is to maintain summaries of the streams, sufficient to answer the expected queries about the data and use sampling and filtering of data to extract the subset. • A second approach is to maintain a sliding window of the most recently arrived data. • SQL stream. Storm :Distributed Fault Tolerant Real Time Computation