Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Apache Storm Tutorial

1,073 views

Published on

Presentation of Apache Storm Tutorial for the Data Mining Class
A.Y. 2016-2017
Sapienza University of Rome

Published in: Technology
  • Be the first to comment

Apache Storm Tutorial

  1. 1. INTRODUCTION TO APACHE STORM Sapienza University of Rome Data Mining Class A.Y. 2016-2017
  2. 2. Team 2 Riccardo Di Stefano Roberto Gaudenzi Davide Mazza Lorenzo Rutigliano Sara Veterini Federico Croce Apache Storm - Sapienza University of Rome - Data Mining Class - A.Y. 2016-2017 https://it.linkedin.com/in /lorenzo-rutigliano-00a0 07135/it https://it.linkedin.com/in/ sara-veterini-667684116 https://it.linkedin.com/in /roberto-gaudenzi-4b04 22116 https://it.linkedin.com/in /federico-croce-921a19 134/it https://it.linkedin.com/in /riccardo-di-stefano-43 9a11134 https://it.linkedin.com/in/ davide-mazza-33a9b291
  3. 3. Contacts and Links 3 https://github.com/davidemazza/ApacheStorm http://www.slideshare.net/DavideMazza6/apache-storm-tutorial apachestormtutorial@gmail.com 3Apache Storm - Sapienza University of Rome - Data Mining Class - A.Y. 2016-2017
  4. 4. Introduction Apache Storm is a free and open source distributed fault-tolerant realtime computation system that make easy to process unbounded streams of data. > use-cases: financial applications, network monitoring, social network analysis, online machine learning, ecc.. > different from traditional batch systems (store and process) . 4Apache Storm - Sapienza University of Rome - Data Mining Class - A.Y. 2016-2017
  5. 5. Companies 5Apache Storm - Sapienza University of Rome - Data Mining Class - A.Y. 2016-2017
  6. 6. Stream Unbounded Sequence of Tuples Tuple: Core unit of data, is a named list of values 6Apache Storm - Sapienza University of Rome - Data Mining Class - A.Y. 2016-2017
  7. 7. Topologies An application is defined in Storm through a Topology that describes its logic as a DAG of operators and streams. Spouts: are the sources of data streams. Usually read data from external sources (e.g. Twitter API) or from disk and emit them in the topology. Bolts: processes input streams and (eventually) produce output streams. They represent the application logic. 7Apache Storm - Sapienza University of Rome - Data Mining Class - A.Y. 2016-2017
  8. 8. Architecture Two kinds of nodes in a Storm cluster: ➢ The Master node runs a daemon called “Nimbus” to which topologies are submitted. It is responsible for scheduling, job orchestration, and monitoring for failures. ➢ Each Worker (Slave) node runs a daemon called “Supervisor”, that can run one or more worker process in which applications are executed. The coordination between this two entities is done through Zookeeper. It is mainly used to maintain state, because Nimbus and Supervisors are stateless. 8Apache Storm - Sapienza University of Rome - Data Mining Class - A.Y. 2016-2017
  9. 9. Architecture Three entities are involved in running a topology: ➢ Worker Process: one or more per cluster, each one is related to only one topology (for design reasons related to fault-tolerance and isolation). ➢ Executor: thread in the Worker process. It runs one or more tasks for the same component (spout or bolt). ➢ Task: a component replica. Therefore Workers provide inter-topology parallelism, Executors intra-topology and Tasks intra-component. Worker process Executor Task TaskTask Task 9Apache Storm - Sapienza University of Rome - Data Mining Class - A.Y. 2016-2017
  10. 10. Simple Example 10Apache Storm - Sapienza University of Rome - Data Mining Class - A.Y. 2016-2017
  11. 11. Example We will show how to compute the average of the grades using a simple Storm topology. We will use: ➢ one spout; ➢ two bolts that work in parallel; ➢ another bolt in which the previous two converge 11Apache Storm - Sapienza University of Rome - Data Mining Class - A.Y. 2016-2017
  12. 12. Spout This represents the spout. Its job is to read a stream of numbers. Our stream represents the grades, so they are within 18 and 30 12Apache Storm - Sapienza University of Rome - Data Mining Class - A.Y. 2016-2017
  13. 13. Bolt This represents the bolt. We can distinguish three different bolts in our example: 1. SummationBolt: computes the sum of the numbers; 2. CounterBolt: counts the numbers; 3. AverageBolt: computes the average. 13Apache Storm - Sapienza University of Rome - Data Mining Class - A.Y. 2016-2017
  14. 14. Topology 14Apache Storm - Sapienza University of Rome - Data Mining Class - A.Y. 2016-2017 Stream
  15. 15. Topology 15Apache Storm - Sapienza University of Rome - Data Mining Class - A.Y. 2016-2017 Stream
  16. 16. Topology 16Apache Storm - Sapienza University of Rome - Data Mining Class - A.Y. 2016-2017 Stream
  17. 17. Topology 17Apache Storm - Sapienza University of Rome - Data Mining Class - A.Y. 2016-2017 Stream
  18. 18. Topology 18Apache Storm - Sapienza University of Rome - Data Mining Class - A.Y. 2016-2017 Stream
  19. 19. Topology 19Apache Storm - Sapienza University of Rome - Data Mining Class - A.Y. 2016-2017 Stream
  20. 20. Topology 20Apache Storm - Sapienza University of Rome - Data Mining Class - A.Y. 2016-2017 Stream Output
  21. 21. Trident 21Apache Storm - Sapienza University of Rome - Data Mining Class - A.Y. 2016-2017
  22. 22. Trident ➢ A high-level abstraction on top of Storm ➢ Uses Spout and Bolt auto-generated by Trident before execution ➢ Trident has functions, filters, joins, grouping, and aggregation ➢ Process streams as a series of batches 22Apache Storm - Sapienza University of Rome - Data Mining Class - A.Y. 2016-2017
  23. 23. Topology ➢ Receives input stream from spout ➢ Do ordered sequence of operation (filter, aggregation, grouping, etc.,) on the stream 23Apache Storm - Sapienza University of Rome - Data Mining Class - A.Y. 2016-2017
  24. 24. Tuples & Spout ➢ TridentTuple is a named list of values. ➢ TridentTuple interface is the data model of a Trident topology ➢ TridentSpout is similar to Storm spout, with additional options ➢ TridentSpout has many sample implementation of trident spout 24Apache Storm - Sapienza University of Rome - Data Mining Class - A.Y. 2016-2017
  25. 25. Example of Spout 25Apache Storm - Sapienza University of Rome - Data Mining Class - A.Y. 2016-2017
  26. 26. Operations ➢ Filter ➢ Function ➢ Aggregation ➢ Grouping ➢ Merging and Joining 26Apache Storm - Sapienza University of Rome - Data Mining Class - A.Y. 2016-2017
  27. 27. Operations: Filter ➢ Object used to perform the task of input validation. ➢ Gets a subset of trident tuple fields as input ➢ Returns either true or false ➢ True → tuple is kept in the output stream ➢ False → the tuple is removed from the stream 27Apache Storm - Sapienza University of Rome - Data Mining Class - A.Y. 2016-2017
  28. 28. Operations: Function ➢ Object used to perform a simple operation on a single trident tuple. ➢ Takes a subset of trident tuple fields ➢ Emits zero or more new trident tuple fields. 28Apache Storm - Sapienza University of Rome - Data Mining Class - A.Y. 2016-2017
  29. 29. Operations: Aggregation Object used to perform aggregation operations on an input batch or partition or stream. ➢ Aggregate → Aggregates each batch of trident tuple in isolation ➢ PartitionAggregate → Aggregates each partition instead of the entire batch of trident tuple. ➢ PersistentAggregate → Aggregates on all trident tuple across all batch. 29Apache Storm - Sapienza University of Rome - Data Mining Class - A.Y. 2016-2017
  30. 30. Operations: Aggregation 30Apache Storm - Sapienza University of Rome - Data Mining Class - A.Y. 2016-2017
  31. 31. Operations: Grouping ➢ Inbuilt operation and can be called by the groupBy method ➢ Repartitions the stream by doing a partitionBy on the specified fields ➢ Groups tuples together whose group fields are equal 31Apache Storm - Sapienza University of Rome - Data Mining Class - A.Y. 2016-2017
  32. 32. Operations: Merging and Joining ➢ Merging combines one or more streams ➢ Joining uses trident tuple field from both sides to check and join two streams. 32Apache Storm - Sapienza University of Rome - Data Mining Class - A.Y. 2016-2017
  33. 33. State Maintenance ➢ State information can be stored in the topology itself ➢ if any tuple fails during processing, then the failed tuple is retried. ➢ If the tuple has failed before updating the state → retrying the tuple will make the state stable. ➢ if the tuple has failed after updating the state → then retrying the same tuple will make the state unstable 33Apache Storm - Sapienza University of Rome - Data Mining Class - A.Y. 2016-2017
  34. 34. When to use Trident? It will be difficult to achieve exactly once processing in the case of Storm 34Apache Storm - Sapienza University of Rome - Data Mining Class - A.Y. 2016-2017 Trident will be useful for those use-cases where you require exactly once processing.
  35. 35. Trident Example 35Apache Storm - Sapienza University of Rome - Data Mining Class - A.Y. 2016-2017
  36. 36. Trident Demo: Twitter Languages Which are the most used languages in Twitter? The code is built on top of Trident and gets a stream of tweets using twitter4J library For each tweet the language is extracted A hashmap of counters is maintained and periodically published on a tweet by the code itself 36Apache Storm - Sapienza University of Rome - Data Mining Class - A.Y. 2016-2017
  37. 37. Trident example setup To setup your twitter application: ● go to https://apps.twitter.com/ and create a new app ● fill the form, leaving callback url empty ● after creating the app, go to keys and access tokens ● pick consumer secret and consumer key info ● select create my access tokens if no tokens are present, then pick access token and access token secret ● open project TwitterTridentExample in Eclipse, open file twitter4j.properties in the project, and copy your info Now you are ready! 37Apache Storm - Sapienza University of Rome - Data Mining Class - A.Y. 2016-2017
  38. 38. Homework 38Apache Storm - Sapienza University of Rome - Data Mining Class - A.Y. 2016-2017
  39. 39. Homework 39Apache Storm - Sapienza University of Rome - Data Mining Class - A.Y. 2016-2017 https://github.com/davidemazza/ApacheStorm Folder “Homework”
  40. 40. Thanks! 40Apache Storm - Sapienza University of Rome - Data Mining Class - A.Y. 2016-2017

×