• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Introduction to Storm
 

Introduction to Storm

on

  • 3,076 views

 

Statistics

Views

Total Views
3,076
Views on SlideShare
3,073
Embed Views
3

Actions

Likes
17
Downloads
0
Comments
0

2 Embeds 3

http://www.linkedin.com 2
https://www.linkedin.com 1

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • Average enterprises now can process and make sense of big data
  • Variety – the various types of dataVelocity – how fast this data is processedVolume – how much data
  • Running if component die and self healing
  • Running if component die and self healing
  • Stream – read tuples, do some processing and update database and drop tuples. Move data from operational db into BI or process log file, ETL processingYou ask storm for really expensive computation query online – for example, how many events I got since last week.Trending topics or most popular articles
  • Graph of spouts and bolts with streams connection
  • Number of worker processes per clusterFinally, you can change the number of workers and/or number of executors for components using the "storm rebalance" command. The following command changes the number of workers for the "demo" topology to 3, the number of executors for the "myspout" component to 5, and the number of executors for the "mybolt" component to 1: storm rebalance demo -n 3 -e myspout=5 -e mybolt=1 The number of executor threads can be changed after the topology has been started (see storm rebalance command).The number of tasks of a topology is static.So one reason for having 2+ tasks per executor thread is to give you the flexibility to expand/scale up the topology through the storm rebalance command in the future without taking the topology offline. For instance, imagine you start out with a Storm cluster of 15 machines but already know that next week another 10 boxes will be added. Here you could opt for running the topology at the anticipated parallelism level of 25 machines already on the 15 initial boxes (which is of course slower than 25 boxes). Once the additional 10 boxes are integrated you can then storm rebalance the topology to make full use of all 25 boxes without any downtime.Another reason to run 2+ tasks per executor is for (primarily functional) testing. For instance, if your dev machine or CI server is only powerful enough to run, say, 2 executors alongside all the other stuff running on the machine, you can still run 30 tasks (here: 15 per executor) to see whether code such as your custom Storm grouping is working as expected.
  • Question
  • Submitter - Uploads topology JAR to Nimbus inbox with dependencies Nimbus - Makes assignment, Starts topology
  • Storm considers a tuple coming of a spout fully processed when every message in the tree has been processed. A tuple is considered failed when its tree of messages fails to be fully processed within a configurable timeout. The default is 30 seconds.
  • For example, mongoDB _id
  • There's two things you have to do as a user to benefit from Storm's reliability capabilities. First, you need to tell Storm whenever you're creating a new link in the tree of tuples. Second, you need to tell Storm when you have finished processing an individual tuple. By doing both these things, Storm can detect when the tree of tuples is fully processed and can ack or fail the spout tuple appropriately. Storm's API provides a concise way of doing both of these tasks.Specifying a link in the tuple tree is called anchoring.
  • Second, you need to tell Storm when you have finished processing an individual tuple.

Introduction to Storm Introduction to Storm Presentation Transcript

  • 1 Distributed, real-time, fault-tolerant framework Introduction to Storm Eugene Dvorkin Coding Architect, WebMD edvorkin@gmail.com #edvorkin eugenedvorkin.com
  • 2 Big Data “Big Data is the capability to manage a huge volume of disparate data, at the right speed, and within the right time frame to allow real-time analysis and reaction”
  • 3 Big Data Velocity VolumeVariety
  • 4 Enablers of Big Data Map/Reduce frameworks – Hadoop Scalable storage – HDFS, NoSQL databases Cheap computing power – Cloud computing
  • 5 Why Real Time? Better end-user experience - Ex: View an ad, see the counter move. Operational intelligence - Low latency analysis - Real time Dashboards ŸEvent response - Rule Engine, Personalization, Predictions - Scalable analysis Example: Trend analysis to recommend „hot‟ articles.
  • 6 Requirements Fast Scalable by process parallelization and distribution Fault-tolerant Guarantees data processing Easy to learn, code and operate Robust Doing scalable real time processing require framework that:
  • 7 Storm • Storm – open source distributed Real- time computation system. • Developed by Nathan Marz – acquired by Twitter
  • 8 Storm Fast Scalable by process parallelization and distribution Fault-tolerant Guarantees data processing Runs on JVM Easy to learn, code and operate Supports development in multiple languages
  • 9 Hadoop Storm Storm for Real-Time processing Storm is to real-time computation what Hadoop is to batch computation.
  • 10 Storm Use cases
  • 11 Storm Use Cases “Storm powers a wide variety of Twitter systems, ranging in applications from discovery, real-time analytics, personalization, search, revenue optimization, and many more.” “Storm empowers stream/micro-batch processing of user events, content feeds, and application logs” - Yahoo “ETL – move data from MongoDB to BI”
  • 12 Storm Abstractions
  • 13 Storm cluster
  • 14 Storm Abstractions Tuples, Streams, Spouts, Bolts and Topologies
  • 15 Tuples [“Colonoscopy”, 14106] • Storm Data structure • List of elements
  • 16 Stream Unbounded sequence of tuples [“Colonoscopy”, 14091][“Cancer”,42651] [“Oncology”, 14417]
  • 17 Spout Read from stream of data – queues, web logs, API calls, databases Emit streams of tuples
  • 18 Bolts Process tuples and create new streams
  • 19 Bolts Apply functions /transforms Calculate and aggregate data (word count!) Access DB, API , etc. Filter data Process tuples and create new streams
  • 20 Topology
  • 21 Storm is Easy to Code How to write storm components? Storm is easy to use
  • 22 Topology Example
  • 23 How to create a spout
  • 24 How to create a spout
  • 25 Spouts Available on GitHub Integration with Redis, Kafka, MongoDB, Amazon SQS, JMS and some others are readily available
  • 26 How to Create a Bolt
  • 27 HashTagFilterBolt
  • 28 HashTagCountBolt
  • 29 Creating Topology
  • 30 Problem What about parallel processing?
  • 31 Topology Example
  • 32 Topology Example
  • 33 Topology Example
  • 34 Parallelism Storm Scalability - Parallelism
  • 35 Storm cluster
  • 36 Storm Parallelism
  • 37 Storm rebalance > storm rebalance demo -n 3 -e myspout=5 -e mybolt=1
  • 38 Creating Cluster Topology >storm jar HashTagTopology.jar org.javameetup.topology. HashTagCountTopology
  • 39 Stream groupings Shuffle grouping: Tuples are randomly distributed across the bolt's tasks Fields grouping: The stream is partitioned by the fields specified in the grouping Custom grouping
  • 40 Stream groupings
  • 41 Demo
  • 42 Deployment Storm Deployment
  • 43 Storm deployment
  • 44 Storm deployment Out of box configuration are suitable for production One-click deploy with storm-deploy project to EC2 Once deployed, easy to operate – designed to be robust Storm daemons, Nimbus and Supervisors are stateless and fail-fast Useful UI
  • 45 Storm UI
  • 46 Storm UI
  • 47 Storm is Fault - Tolerant
  • 48 Normal operations
  • 49 Nimbus down • Processing will continue. But topology lifecycle operations and reassignment facility are lost. • Run under system supervision.
  • 50 Worker node down • Nimbus will reassign tasks to other machines
  • 51 Supervisor goes down Processing will still continue. But assignment is never synchronized
  • 52 Worker process down • Supervisor will restart the worker process and the processing will continue
  • 53 Guaranteeing message processing
  • 54 Guaranteed Message Processing “Tuple tree”
  • 55 Reliability API When emitting a tuple, the Spout provides a "message id" that will be used to identify the tuple later.
  • 56 Reliability API- Anchoring
  • 57 Reliability API – finishing processing
  • 58 Spout - Reliability API
  • 59 Reliability API
  • 60 Reliability API
  • 61 Advanced Topics - Trident Trident is a high-level abstraction for doing real-time computing on top of Storm.
  • 62 Trident- Higher level constructs Joins Aggregations Grouping Functions Filters Consistent, exactly one semantics
  • 63 Example [Physicians, 79] [Oncology:78] [Cancer:237] …….
  • 64 Example
  • 65 Example
  • 66 Example
  • 67 Example
  • 68 Example
  • 69 Example
  • 70 Example
  • 71 Example
  • 72 Example [Physicians, 79] [Oncology:78] [Cancer:237] …….
  • 73 Demo
  • 74 DRPC Server DRPC Server
  • 75 DRPC Server We want to know the aggregate count of tweets with hashtags #cancer and #Physician at this moment
  • 76 DRPC Server
  • 77 DRPC Server
  • 78
  • 79 Conclusion Storm allows us to solve a wide range of business problems in real time Thriving open-source community
  • 80 Resources Storm Project wiki Storm starter project Storm contributions project Running a Multi-Node Storm cluster tutorial Implementing real-time trending topic A Hadoop Alternative: Building a real-time data pipeline with Storm Storm Use cases
  • 81 Resources (cont’d) Understanding the Parallelism of a Storm Topology Trident – high level Storm abstraction A practical Storm‟s Trident API Storm online forum Project source code New York City Storm Meetup Image credits: US NASA
  • 82 Questions Eugene Dvorkin, Architect WebMD edvorkin@gmail.com Twitter: #edvorkin Introduction to Storm