Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Real-Time Streaming with Apache Spark Streaming and Apache Storm


Published on

Real-Time Streaming with Apache Spark Streaming and Apache Storm. Description and comparison of both systems.

Published in: Software
  • Be the first to comment

Real-Time Streaming with Apache Spark Streaming and Apache Storm

  1. 1. Real-Time Streaming with Apache Spark Streaming and Apache Storm Spark Meetup, 27.04.2015. Zagreb, Davorin Vukelić
  2. 2. Agenda • Real-Time Streaming • Apache Storm • Apache Spark Streaming • Demo • Conclusion
  3. 3. Real-Time Streaming • Continious processing, aggregation and analysing data when they are created • System are make like directed acyclic graph • Data is reduced before being centralized • Processing messages one at time • Collect structured, semi-structured and unstructured data from different sources • directed acyclic graph is graphical presentation of developed chain of tasks • Displey order of task execution • Modeling data flows through transformations
  4. 4. Real-Time Streaming • Actions in real time: • Monitoring trends • Communicate • Recommendation • Searching • Expected respone is from 500 ms to one minute • Data message flow through a chain of processors until the result reaches the final destination. • To guarantee reliable data processing, it is necessary to restart processing in case of failures • Delivery Semantics (Message Guarantees): • At most once: messages may be lost but never redelivered. • At least once: messages will never be lost but may be redelivered. • Exactly once: messages are never lost and never redelivered, perfect message delivery.
  5. 5. Apache Storm • distributed real-time computation system • Concept of processing is developed in Storm API • it is neccessery to deploy to Storm cluster for continiously running • Storm cluster is system of couple different daemons run on separate nodes (servers) • Event-Stream Processing - Stream processing is a one-at-a-time processing model
  6. 6. Storm • Main abstractions in Storm: • Spout • source node of streams in a computation • Implemented code to read data from: o Twitter api, web crawlers, FB api o queueing broker: Kafka, Kestrel, RabbitMQ, • Bolt • Node for implement logic of a computation process • functions, filters, streaming joins, streaming aggregations, store and lookup to databases and filesystems • Topology • network of spouts and bolts • run indefinitely when is deployed • edge in the network representing a bolt subscribing to the output like : relation db, filesystem,NoSql db, other topology or queueing system • Tuple • data message model for communication beetween nodes in topology • define immutable list of objects of any types
  7. 7. Storm Tuple • needs to know how to serialize and deserialize objects when they're passed between tasks • uses Kryo for serialization • need to register a custom serializer • Storm by default can serialize: • primitive types, • strings, • byte arrays, • ArrayList, • HashMap, • HashSet • Clojure collection types
  8. 8. Storm Topology TopologyBuilder builder = new TopologyBuilder(); builder.setSpout("reader", new WordReader(),4); builder.setBolt("normalizer", new WordNormalizer(),2).shuffleGrouping("reader") .setNumTasks(2).; builder.setBolt("counter", new WordCounter(), 2).fieldsGrouping("normalizer", new Fields("word")); Config conf = new Config(); conf.put(Config.TOPOLOGY_MAX_SPOUT_PENDING, 8); LocalCluster cluster = new LocalCluster(); cluster.submitTopology("Toplogie",conf,builder.createTopology());
  9. 9. Storm Ramasamy, Karthik : Audicity Real-Time Analytics with Apache Storm • Sharding
  10. 10. Storm • SCALING Ramasamy, Karthik : Audicity Real-Time Analytics with Apache Storm
  11. 11. Storm Grouping • define how data is exchanged between nodes in topology • use when bolt has several paralele instance (several tasks) • Shuffle • Tuple emitted by the source to a randomly chosen bolt instance • bolt warranting that each bolt instance will receive the same number of tuples • Fields • Controls how tuples are sent to bolt instances with field s in tuple • Tuple with same fiield value will be sent into same bolt instance • Partial Key • Controls how tuples are sent to bolt instances with field s in tuple • load balanced between two downstream bolts instance, which provides better utilization of resources • use when the incoming data is skewed • All • Replicate and sends a each tuple to all instances of the receiving bolt • Custom • create your own custom stream grouping • Global Grouping • all instances of the source send tuples to a single target iinstance • For example the lowest id
  12. 12. Storm Grouping • Local or shuffle grouping • If the target bolt has one or more instances (tasks) in the same worker process, tuples will be shuffled to just those in-process tasks • Will act like a normal shuffle grouping if that isn’t case • Direct grouping • producer of the tuple decides which bolt instanceof the consumer will receive this tuple. • It must be specified the task ID, task ID can be gotten from OutputCollector
  13. 13. Storm Spout public class WordReader implements IRichSpout { private SpoutOutputCollector collector; private FileReader fileReader; private boolean completed = false; private TopologyContext context; public boolean isDistributed() {return false;} public void ack(Object msgId) {System.out.println("OK:" + msgId);} public void close() { } public void fail(Object msgId) {System.out.println("FAIL:" + msgId);} public void nextTuple() { if (completed) { String str; BufferedReader reader = new BufferedReader(fileReader); try { while ((str = reader.readLine()) != null) {this.collector.emit(new Values(str), str);} } catch (Exception e) {throw new RuntimeException("Error reading tuple", e); } finally {completed = true;} } public void open( Map conf, TopologyContext context,SpoutOutputCollector collector) { this.context = context; this.collector = collector; try {this.fileReader = new FileReader(conf.get("wordsFile").toString()); } catch (FileNotFoundException e) {throw new RuntimeException("Error reading file[" conf.get("wordFile") + "]"); } } public void declareOutputFields(OutputFieldsDeclarer declarer) { declarer.declare(new Fields("line")); } public void activate() {} public void deactivate() {} public Map<String, Object> getComponentConfiguration() { return null; } }
  14. 14. Storm Bolt public class WordNormalizer implements IRichBolt { private OutputCollector collector; public void cleanup() { } public void execute(Tuple input) { String sentence = input.getString(0); String[] words = sentence.split(" "); for (String word : words) { word = word.trim(); if (!word.isEmpty()) { word = word.toLowerCase(); collector.emit( new Values(word)); } } collector.ack(input); } public void prepare(Map stormConf, TopologyContext context, OutputCollector collector) { this.collector = collector; } public void declareOutputFields(OutputFieldsDeclarer declarer) { declarer.declare(new Fields("word")); } public Map<String, Object> getComponentConfiguration() { return null; } }
  15. 15. Storm demo
  16. 16. Storm demo
  17. 17. Storm demo
  18. 18. Storm - cluster • Master node – run a daemon Nimbus • distributing code around the cluster • assigning tasks to each worker node • monitoring for failures • Worker nodes - run a daemon Supervisor • executes a portion of a topology • Zookeeper • keeps all nodes states in cluster • The supervisor’s computing resource can be partitioned into multiple worker slots. • Worker slot, Storm can spawn multiple threads, referred to as executors Fan Jiang, Enabling Site-Aware Scheduling for Apache Storm in ExoGENI
  19. 19. Storm - cluster • Tasks • Each spout or bolt executes as many tasks across the cluster. • Each task corresponds to one thread of execution • Tasks are instances of spouts and bolts whose nextTuple() and execute() methods are called by executor threads • stream groupings define how to send tuples from one set of tasks to another set of tasks. • ComponentConfigurationDeclarer: .setNumTasks(#) – how much tasks per executor (how much thread in one executor) • Workers • Topologies execute across one or more worker processes. • Each worker process is a physical JVM and executes a subset of all the tasks for the topology. • Each worker execute runs executors for a specific topology • For example, if the combined parallelism of the topology is 300 and 50 workers are allocated, then each worker will execute 6 tasks • Config: setNumWorkers • Executors • Each executor runs one or more tasks of the same component • These are Java threads running within a worker JVM process. Multiple tasks can be assigned to a single executor. • TopologyBuilder: setSpout(,,#) – how much executors • TopologyBuilder: setBolt(.,#) -how much executors • The number of tasks for a component is always the same throughout the lifetime of a topology, but the number of executors (threads) for a component can change over time
  20. 20. Storm - cluster • builder.setSpout(SENTENCE_SPOUT_ID, spout, 2); • builder.setBolt(SPLIT_BOLT_ID, splitBolt, 2) .setNumTasks(4) .shuffleGrouping(SENTENCE_SPOUT_ID); • builder.setBolt(COUNT_BOLT_ID, countBolt, 4) .fieldsGrouping(SPLIT_BOLT_ID, new Fields("word")); P. Taylor Goetz,Brian O'Neill :Storm Blueprints: Patterns for Distributed Real-time Computation
  21. 21. Storm • SCALABLE: • it can process very high throughputs of messages with very low latency • one million 100 byte messages per second per node (node conf: Processor: 2x Intel E5645@2.4Ghz; Memory: 24 GB ) • Fault-tolerant: • automatically restart workers who dieds. If a node dies, the worker will be restarted on another node. • Failur is expected and embraced • they will restart like nothing happened • State is stored on Zookeeper • Actions befor they died • Guarantees data processing (Reliable) : • track the lineage of a tuple as it makes its way through the topology • Messages are only replayed when there are failures. • Anchoring is specifying a link in the tuple tree. It is done at the same time you emit a new tuple. • At least once by default
  22. 22. Storm • Language: • core of Storm is a Thrift definition for defining and submitting topologies • Open source: • large and growing ecosystem of libraries and tools to use in conjunction with Storm • spouts integrate with queueing systems such as JMS, Kafka, Redis pub/sub • helper bolts for integrating with databases, such as MongoDB, RDBMS's, Cassandra, Hbase and filesystem like HDFS • Transactional • can get exactly once messaging semantics for any computation.
  23. 23. Spark Streaming • open source data streaming and processing engine • built around speed, ease of use and sophisticated analytics • extension of the core Spark API • scalable, high-throughput, fault-tolerant stream processing data streams • Batch processing is concept of processing data in masse. Micro-batching is case of batch processing where the batch size is orders smaller. • Runs on: • Hadoop YARN • Mesas • Spark cluster • EC2 • Read data from: • Kafka • Flume • ZeroMQ • TCP sockets • Twitter • Kinesis • HDFS
  24. 24. Spark Streaming • Store data to: • HDFS • Databases • Dashboards • Spark’s machine learning • Spark’s graph processing algorithms • Input data stream in mini-batches and performs transformations on those mini-batches of data or on grupe of mini-batches Holden Karau, Andy Konwinski, Patrick Wendell, Matei Zaharia: Learning Spark Lightning-Fast Big Data Analysis
  25. 25. Spark Streaming • Abstraction in Spark Streaming: • Dstream • Discretized stream • arriving sequence of data • a continuous series of RDDs • RDD • Resilient Distributed Dataset • Collections of objects spread accross a cluster • Partitioned and Distributed • Partitions Recomputed on Failure • Saved in RAM or Disk • Spark Streaming context works with small RDDs • Each RDD in a DStream contains data from a certain interval • On every mini RDD it can be execute transformation like on regular RDD • PairDstream • DStream of key-value pairs, which provides extra methods like reduceByKey and join • StreamingContext API
  26. 26. Spark Streaming SparkConf conf = new SparkConf().setAppName("twitter-stream").setMaster("local[2]"); JavaStreamingContext jssc = new JavaStreamingContext(conf,Durations.seconds(10)); jssc.checkpoint("/home/cloudera/Desktop/spark_tweets"); jssc.start() jssc.awaitTermination()
  27. 27. Spark Streaming • Operations • is applied on each RDD in DStream • Transformations •create new Dstream • Output operations •write data to other systems •run periodically on each time step, producing output in batches. • Checkpointing • must be resilient to system failures, JVM crashes. • it can recover from failures. • types of data that are checkpointed •Metadata checkpointing – Configuration, DStream operations, Incomplete batches •Data checkpointing - Saving of the generated RDDs to reliable storage.
  28. 28. Spark Streaming - Input • built-in streaming sources: •create multiple receivers which will simultaneously receive multiple data streams • Basic sources: •file systems • reading data from files on file system compatible with the HDFS API, • monitor the directory and process files created in that directory • Files must be moved ,not continuously appended •socket connections • Data stream separated into time intervales •Akka actors •Queue of RDDs as a Stream • RDD pushed into the queue • treated as a batch of data in the DStream • Advanced sources: •external non-Spark libraries •Custom Receiver: • Implement by developers •Kafka •Flume •Kinesis •Twitter •ZeroMQ •MQTT
  29. 29. Spark Streaming - Input Authorization auth = twitter.getAuthorization(); final String[] filters = { "#KCA", "#kca" }; JavaDStream<Status> tweets = TwitterUtils.createStream(jssc, auth, filters); JavaDStream<String> statuses = Function<Status, String>(){ public String call(Status status) { return status.getText(); } });
  30. 30. Spark Streaming - Transformations Transformations on Dstreams Stateless transformations • processing of each batch separately • doesn’t depend on the data of its previous batches. • provide any arbitrary RDD-to-RDD function to act on the Dstream • transform() Transformation Meaning map(func) Return a new DStream by passing each element of the source DStream through a function func. flatMap(func) Similar to map, but each input item can be mapped to 0 or more output items. filter(func) Return a new DStream by selecting only the records of the source DStream on which func returns true. repartition(numPartitions) Changes the level of parallelism in this DStream by creating more or fewer partitions.
  31. 31. Spark Streaming - Transformations Transformation Meaning count() Return a new DStream of single-element RDDs by counting the number of elements in each RDD of the source DStream. reduce(func) Return a new DStream of single-element RDDs by aggregating the elements in each RDD of the source DStream using a function func (which takes two arguments and returns one). The function should be associative so that it can be computed in parallel. reduceByKey() Combine values with the same key in each batch. It is necessary to create a JavaPairD Stream groupByKey() Group values with the same key in each batch.
  32. 32. Spark Streaming - Transformations
  33. 33. Spark Streaming - Transformations JavaDStream<String> words = statuses.flatMap(new FlatMapFunction<String, String>() { public Iterable<String> call(String in) { return Arrays.asList(in.split(" ")); } }); JavaPairDStream<String, Integer> pairs = words.mapToPair(new PairFunction<String, String, Integer>() { public Tuple2<String, Integer> call(String in) throws Exception { return new Tuple2<String, Integer>(in, 1); } }); JavaPairDStream<String, Integer> counts = pairs .reduceByKey(new Function2<Integer, Integer, Integer>() { public Integer call(Integer a, Integer b) { return a + b; } });
  34. 34. Spark Streaming - Transformations • Transformations on Dstreams Stateful transformations • Use previous batches is used to generate the results for a new batch • tracking state across time • Checkpointing must be set, (fault tolerance) • updateStateByKey(): • Return a new "state" DStream where the state for each key is updated by applying the given function on the previous state of the key and the new values for the key. This can be used to maintain arbitrary state data for each key. • Sliding windows: • transformations over a sliding window of data • Parameters: • window length - The duration of the window • sliding interval - The interval at which the window operation is performed
  35. 35. Spark Streaming - Transformations Transformation Meaning window(windowLength, slideInterval) Return a new DStream which is computed based on windowed batches of the source DStream. countByWindow(windowLength,slideInterval) Return a sliding window count of elements in the stream. reduceByWindow(func, windowLength,slideInterval) Return a new single-element stream, created by aggregating elements in the stream over a sliding interval using func. The function should be associative so that it can be computed correctly in parallel. reduceByKeyAndWindow(func,windowLength, slideIn terval, [numTasks]) When called on a DStream of (K, V) pairs, returns a new DStream of (K, V) pairs where the values for each key are aggregated using the given reduce function func over batches in a sliding window. Note: By default, this uses Spark's default number of parallel tasks (2 for local mode, and in cluster mode the number is determined by the config propertyspark.default.parallelism) to do the grouping. You can pass an optional numTasks argument to set a different number of tasks. reduceByKeyAndWindow(func, invFunc,windowLengt h, slideInterval, [numTasks]) A more efficient version of the above reduceByKeyAndWindow() where the reduce value of each window is calculated incrementally using the reduce values of the previous window. This is done by reducing the new data that enter the sliding window, and "inverse reducing" the old data that leave the window. An example would be that of "adding" and "subtracting" counts of keys as the window slides. However, it is applicable to only "invertible reduce functions", that is, those reduce functions which have a corresponding "inverse reduce" function (taken as parameter invFunc. Like in reduceByKeyAndWindow, the number of reduce tasks is configurable through an optional argument. Note that [checkpointing](#checkpointing) must be enabled for using this operation. countByValueAndWindow(windowLength,slideInterva l, [numTasks]) When called on a DStream of (K, V) pairs, returns a new DStream of (K, Long) pairs where the value of each key is its frequency within a sliding window. Like in reduceByKeyAndWindow, the number of reduce tasks is configurable through an optional argument.
  36. 36. Spark Streaming - Transformations
  37. 37. Spark Streaming - Transformations Function2<List<Integer>, Optional<Integer>, Optional<Integer>> updateFunction = new Function2<List<Integer>, Optional<Integer>, Optional<Integer>>() { public Optional<Integer> call(List<Integer> values,Optional<Integer> state) { Integer newSum = state.or(0); for (int i : values) { newSum += i; } return Optional.of(newSum); } }; JavaPairDStream<String, Integer> runningCounts = pairs.updateStateByKey(updateFunction);
  38. 38. Spark Streaming – Transformations join • Works JavaPairDStream • combine data from multiple DStreams with transformation: • join() • leftOuterJoin() • rightOuterJoin, • fullOuterJoin • merge contents of two different Dstreams: • union() • Stream-stream joins • Stream-dataset joins
  39. 39. Spark Streaming - Output • Output transformations store finale transformed data into external database, file system, screen Output Operation Meaning print() Prints first ten elements of every batch of data in a DStream on the driver node running the streaming application. saveAsTextFiles(prefix, [suffix]) Save this DStream's contents as a text files. The file name at each batch interval is generated based on prefix and suffix: "prefix-TIME_IN_MS[.suffix]". saveAsObjectFiles(prefix, [suffix]) Save this DStream's contents as a SequenceFile of serialized Java objects. The file name at each batch interval is generated based on prefix and suffix: "prefix-TIME_IN_MS[.suffix]". saveAsHadoopFiles(prefix, [suffix]) Save this DStream's contents as a Hadoop file. The file name at each batch interval is generated based on prefix and suffix: "prefix-TIME_IN_MS[.suffix]". foreachRDD(func) The most generic output operator that applies a function, func, to each RDD generated from the stream. This function should push the data in each RDD to a external system, like saving the RDD to files, or writing it over the network to a database. Note that the function func is executed in the driver process running the streaming application, and will usually have RDD actions in it that will force the computation of the streaming RDDs.
  40. 40. Spark Streaming - Output sortedCounts.foreach(new Function<JavaPairRDD<Integer, String>, Void>() { public Void call(JavaPairRDD<Integer, String> rdd) { Jedis jedis = new Jedis("#.#.#.#"); for (Tuple2<Integer, String> t : rdd.collect()) { jedis.publish("spark_words", t._2 + "|" + Integer.toString(t._1)); } return null; });
  41. 41. Spark Streaming demo
  42. 42. Spark Streaming - Output JavaPairDStream<Text, IntWritable> writableDStream = runningCounts.mapToPair( new PairFunction<Tuple2<String, Integer>, Text, IntWritable>() { public Tuple2<Text, IntWritable> call(Tuple2<String, Integer> e) { return new Tuple2(new Text(e._1()), new IntWritable(e._2())); } }); class OutFormat extends TextOutputFormat<Text, Integer> {}; writableDStream.saveAsHadoopFiles("hdfs://#.#.#.#/user/hdfs/tweets_spark/", "", Text.class, IntWritable.class, OutFormat.class);
  43. 43. Spark Streaming Parallelism • Increasing the number of receivers • multiple input Dstreams • union to merge them • Explicitly repartitioning received data • repartitioning the input stream • DStream.repartition • Increasing parallelism in aggregation • can specify the parallelism • Operations which reduce dataset
  44. 44. Spark Streaming • Scalable • High-throughput • Fault-tolerant • Guarantees data processing (Reliable): • exactly once
  45. 45. Spark Streaming • MLlib • streaming machine learning algorithms which can simultaneously learn from the streaming data as well as apply the model on the streaming data. • Streaming Linear Regression • Streaming KMeans, • DataFrame • create a SQLContext using the SparkContex • Declare JavaRow • apply the model online on streaming data
  46. 46. Storm vs Spark - use case • realtime analytics • online machine learning • continuous computation • distributed RPC (Remote Procedure Call) • ETL • Look for trends that can indicate a problem. • Alert or provide automated corrections • Provide an interface to visualize • Current data • Historical data
  47. 47. Storm vs Spark approach • Storm: • tends to be driven by creating classes and implementing interfaces • has the advantage of broader language support ( code written in R or any other language not natively supported by Spark) • DAG’s is natural to the processing model, Tuple is natural interface for the data passed between nodes • processing excels at computing transformations as data are ingested with sub-second latencies. • Spark: • has more of a “functional” flavor, where working with the API is driven more by chaining successive method calls to invoke primitive operations • Tuples can feel awkward in Java but with this is giong benefit of compile-time • Use existing Hadoop or Mesos cluster • micro-batching trivially gives stateful computation, making windowing an easy task. • Neither approach is better or worse
  48. 48. Storm vs Spark Storm • Event-Streaming • At most once / At least once • sub-second • Java, Clojure, Scala, Python, Ruby • Use other tool for batch Spark • Micro-Batching / Batch (Spark Core) • Exactly Once • Seconds • Java, Scala, Python • batching and streaming are very similar vs Processing Model Delivery Guarantees Latency Language Options Development
  49. 49. Storm vs Spark - recomendation • Storm: • Latency < 1 second (500 ms) • Real Time: • Analytics • Budgeting • ML • Spark: • ETL • iterative machine learning • interactive analytics • Interactive Queries • batch processing • graph processing
  50. 50. Storm vs Spark • Storm: • Lower Level API • No concept of look back aggregations (slideing windows) • combine batch with streaming • Spark: • < 1 TB size of cluster • latency from 500 milisekundi to 1 sec (micro-batching incurs a cost of latency) • streaming inputs are replicated in memory
  51. 51. Storm vs Spark • Jonathan Leibiusky, Gabriel Eisbruch, Dario Simonassi: Getting Started with Storm - Continuous streaming computation with Twitter's cluster technology • Anderson, Quinton: Storm Real-time Processing Cookbook - Efficiently Process Unbounded Streams of Data in Real Time • Holden Karau, Andy Konwinski, Patrick Wendell, Matei Zaharia: Learning Spark - Lightning-Fast Big Data Analysis • Apache Spark Streaming Programming Guide: • • Apache Storm • • P. Taylor Goetz Brian O'Neill: Storm Blueprints - Patterns for Distributed Real-time Computation