1
Distributed, real-time, fault-tolerant framework
Introduction to Storm
Eugene Dvorkin
Coding Architect, WebMD
edvorkin@g...
2
Big Data
“Big Data is the capability to manage a
huge volume of disparate data, at the right
speed, and within the right...
3
Big Data
Velocity VolumeVariety
4
Enablers of Big Data
Map/Reduce frameworks – Hadoop
Scalable storage – HDFS, NoSQL
databases
Cheap computing power – Clo...
5
Why Real Time?
Better end-user experience
- Ex: View an ad, see the counter move.
Operational intelligence
- Low latency...
6
Requirements
Fast
Scalable by process parallelization and
distribution
Fault-tolerant
Guarantees data processing
Easy to...
7
Storm
• Storm – open source distributed Real-
time computation system.
• Developed by Nathan Marz – acquired by
Twitter
8
Storm
Fast
Scalable by process parallelization and
distribution
Fault-tolerant
Guarantees data processing
Runs on JVM
Ea...
9
Hadoop Storm
Storm for Real-Time processing
Storm is to real-time computation what Hadoop is to batch computation.
10
Storm Use cases
11
Storm Use Cases
“Storm powers a wide variety of Twitter
systems, ranging in applications from
discovery, real-time anal...
12
Storm Abstractions
13
Storm cluster
14
Storm Abstractions
Tuples, Streams, Spouts, Bolts and Topologies
15
Tuples
[“Colonoscopy”, 14106]
• Storm Data structure
• List of elements
16
Stream
Unbounded sequence of tuples
[“Colonoscopy”, 14091][“Cancer”,42651]
[“Oncology”, 14417]
17
Spout
Read from stream of data – queues, web
logs, API calls, databases
Emit streams of tuples
18
Bolts
Process tuples and create new streams
19
Bolts
Apply functions /transforms
Calculate and aggregate
data (word count!)
Access DB, API , etc.
Filter data
Process ...
20
Topology
21
Storm is Easy to Code
How to write storm components?
Storm is easy to use
22
Topology Example
23
How to create a spout
24
How to create a spout
25
Spouts Available on GitHub
Integration with Redis, Kafka, MongoDB, Amazon SQS, JMS
and some others are readily available
26
How to Create a Bolt
27
HashTagFilterBolt
28
HashTagCountBolt
29
Creating Topology
30
Problem
What about parallel processing?
31
Topology Example
32
Topology Example
33
Topology Example
34
Parallelism
Storm Scalability -
Parallelism
35
Storm cluster
36
Storm Parallelism
37
Storm rebalance
> storm rebalance demo -n 3 -e myspout=5 -e mybolt=1
38
Creating Cluster Topology
>storm jar HashTagTopology.jar
org.javameetup.topology. HashTagCountTopology
39
Stream groupings
Shuffle grouping: Tuples are randomly distributed across the
bolt's tasks
Fields grouping: The stream ...
40
Stream groupings
41
Demo
42
Deployment
Storm Deployment
43
Storm deployment
44
Storm deployment
Out of box configuration are suitable for
production
One-click deploy with storm-deploy
project to EC2...
45
Storm UI
46
Storm UI
47
Storm is Fault - Tolerant
48
Normal operations
49
Nimbus down
• Processing will continue. But topology lifecycle
operations and reassignment facility are lost.
• Run und...
50
Worker node down
• Nimbus will reassign tasks to other machines
51
Supervisor goes down
Processing will still continue. But assignment is
never synchronized
52
Worker process down
• Supervisor will restart the worker process
and the processing will continue
53
Guaranteeing message processing
54
Guaranteed Message Processing
“Tuple tree”
55
Reliability API
When emitting a tuple, the Spout provides a
"message id" that will be used to identify
the tuple later.
56
Reliability API- Anchoring
57
Reliability API – finishing processing
58
Spout - Reliability API
59
Reliability API
60
Reliability API
61
Advanced Topics - Trident
Trident is a high-level abstraction for doing real-time
computing on top of Storm.
62
Trident- Higher level constructs
Joins
Aggregations
Grouping
Functions
Filters
Consistent, exactly one semantics
63
Example
[Physicians, 79]
[Oncology:78]
[Cancer:237]
…….
64
Example
65
Example
66
Example
67
Example
68
Example
69
Example
70
Example
71
Example
72
Example
[Physicians, 79]
[Oncology:78]
[Cancer:237]
…….
73
Demo
74
DRPC Server
DRPC Server
75
DRPC Server
We want to know the aggregate count
of tweets with hashtags #cancer and
#Physician at this moment
76
DRPC Server
77
DRPC Server
78
79
Conclusion
Storm allows us to solve a wide range of
business problems in real time
Thriving open-source community
80
Resources
Storm Project wiki
Storm starter project
Storm contributions project
Running a Multi-Node Storm cluster tutor...
81
Resources (cont’d)
Understanding the Parallelism of a Storm
Topology
Trident – high level Storm abstraction
A practical...
82
Questions
Eugene Dvorkin, Architect
WebMD
edvorkin@gmail.com
Twitter: #edvorkin
Introduction to Storm
Upcoming SlideShare
Loading in...5
×

Introduction to Storm

3,784

Published on

Published in: Technology
0 Comments
27 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
3,784
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
0
Comments
0
Likes
27
Embeds 0
No embeds

No notes for slide
  • Average enterprises now can process and make sense of big data
  • Variety – the various types of dataVelocity – how fast this data is processedVolume – how much data
  • Running if component die and self healing
  • Running if component die and self healing
  • Stream – read tuples, do some processing and update database and drop tuples. Move data from operational db into BI or process log file, ETL processingYou ask storm for really expensive computation query online – for example, how many events I got since last week.Trending topics or most popular articles
  • Graph of spouts and bolts with streams connection
  • Number of worker processes per clusterFinally, you can change the number of workers and/or number of executors for components using the "storm rebalance" command. The following command changes the number of workers for the "demo" topology to 3, the number of executors for the "myspout" component to 5, and the number of executors for the "mybolt" component to 1: storm rebalance demo -n 3 -e myspout=5 -e mybolt=1 The number of executor threads can be changed after the topology has been started (see storm rebalance command).The number of tasks of a topology is static.So one reason for having 2+ tasks per executor thread is to give you the flexibility to expand/scale up the topology through the storm rebalance command in the future without taking the topology offline. For instance, imagine you start out with a Storm cluster of 15 machines but already know that next week another 10 boxes will be added. Here you could opt for running the topology at the anticipated parallelism level of 25 machines already on the 15 initial boxes (which is of course slower than 25 boxes). Once the additional 10 boxes are integrated you can then storm rebalance the topology to make full use of all 25 boxes without any downtime.Another reason to run 2+ tasks per executor is for (primarily functional) testing. For instance, if your dev machine or CI server is only powerful enough to run, say, 2 executors alongside all the other stuff running on the machine, you can still run 30 tasks (here: 15 per executor) to see whether code such as your custom Storm grouping is working as expected.
  • Question
  • Submitter - Uploads topology JAR to Nimbus inbox with dependencies Nimbus - Makes assignment, Starts topology
  • Storm considers a tuple coming of a spout fully processed when every message in the tree has been processed. A tuple is considered failed when its tree of messages fails to be fully processed within a configurable timeout. The default is 30 seconds.
  • For example, mongoDB _id
  • There's two things you have to do as a user to benefit from Storm's reliability capabilities. First, you need to tell Storm whenever you're creating a new link in the tree of tuples. Second, you need to tell Storm when you have finished processing an individual tuple. By doing both these things, Storm can detect when the tree of tuples is fully processed and can ack or fail the spout tuple appropriately. Storm's API provides a concise way of doing both of these tasks.Specifying a link in the tuple tree is called anchoring.
  • Second, you need to tell Storm when you have finished processing an individual tuple.
  • Transcript of "Introduction to Storm"

    1. 1. 1 Distributed, real-time, fault-tolerant framework Introduction to Storm Eugene Dvorkin Coding Architect, WebMD edvorkin@gmail.com #edvorkin eugenedvorkin.com
    2. 2. 2 Big Data “Big Data is the capability to manage a huge volume of disparate data, at the right speed, and within the right time frame to allow real-time analysis and reaction”
    3. 3. 3 Big Data Velocity VolumeVariety
    4. 4. 4 Enablers of Big Data Map/Reduce frameworks – Hadoop Scalable storage – HDFS, NoSQL databases Cheap computing power – Cloud computing
    5. 5. 5 Why Real Time? Better end-user experience - Ex: View an ad, see the counter move. Operational intelligence - Low latency analysis - Real time Dashboards ŸEvent response - Rule Engine, Personalization, Predictions - Scalable analysis Example: Trend analysis to recommend „hot‟ articles.
    6. 6. 6 Requirements Fast Scalable by process parallelization and distribution Fault-tolerant Guarantees data processing Easy to learn, code and operate Robust Doing scalable real time processing require framework that:
    7. 7. 7 Storm • Storm – open source distributed Real- time computation system. • Developed by Nathan Marz – acquired by Twitter
    8. 8. 8 Storm Fast Scalable by process parallelization and distribution Fault-tolerant Guarantees data processing Runs on JVM Easy to learn, code and operate Supports development in multiple languages
    9. 9. 9 Hadoop Storm Storm for Real-Time processing Storm is to real-time computation what Hadoop is to batch computation.
    10. 10. 10 Storm Use cases
    11. 11. 11 Storm Use Cases “Storm powers a wide variety of Twitter systems, ranging in applications from discovery, real-time analytics, personalization, search, revenue optimization, and many more.” “Storm empowers stream/micro-batch processing of user events, content feeds, and application logs” - Yahoo “ETL – move data from MongoDB to BI”
    12. 12. 12 Storm Abstractions
    13. 13. 13 Storm cluster
    14. 14. 14 Storm Abstractions Tuples, Streams, Spouts, Bolts and Topologies
    15. 15. 15 Tuples [“Colonoscopy”, 14106] • Storm Data structure • List of elements
    16. 16. 16 Stream Unbounded sequence of tuples [“Colonoscopy”, 14091][“Cancer”,42651] [“Oncology”, 14417]
    17. 17. 17 Spout Read from stream of data – queues, web logs, API calls, databases Emit streams of tuples
    18. 18. 18 Bolts Process tuples and create new streams
    19. 19. 19 Bolts Apply functions /transforms Calculate and aggregate data (word count!) Access DB, API , etc. Filter data Process tuples and create new streams
    20. 20. 20 Topology
    21. 21. 21 Storm is Easy to Code How to write storm components? Storm is easy to use
    22. 22. 22 Topology Example
    23. 23. 23 How to create a spout
    24. 24. 24 How to create a spout
    25. 25. 25 Spouts Available on GitHub Integration with Redis, Kafka, MongoDB, Amazon SQS, JMS and some others are readily available
    26. 26. 26 How to Create a Bolt
    27. 27. 27 HashTagFilterBolt
    28. 28. 28 HashTagCountBolt
    29. 29. 29 Creating Topology
    30. 30. 30 Problem What about parallel processing?
    31. 31. 31 Topology Example
    32. 32. 32 Topology Example
    33. 33. 33 Topology Example
    34. 34. 34 Parallelism Storm Scalability - Parallelism
    35. 35. 35 Storm cluster
    36. 36. 36 Storm Parallelism
    37. 37. 37 Storm rebalance > storm rebalance demo -n 3 -e myspout=5 -e mybolt=1
    38. 38. 38 Creating Cluster Topology >storm jar HashTagTopology.jar org.javameetup.topology. HashTagCountTopology
    39. 39. 39 Stream groupings Shuffle grouping: Tuples are randomly distributed across the bolt's tasks Fields grouping: The stream is partitioned by the fields specified in the grouping Custom grouping
    40. 40. 40 Stream groupings
    41. 41. 41 Demo
    42. 42. 42 Deployment Storm Deployment
    43. 43. 43 Storm deployment
    44. 44. 44 Storm deployment Out of box configuration are suitable for production One-click deploy with storm-deploy project to EC2 Once deployed, easy to operate – designed to be robust Storm daemons, Nimbus and Supervisors are stateless and fail-fast Useful UI
    45. 45. 45 Storm UI
    46. 46. 46 Storm UI
    47. 47. 47 Storm is Fault - Tolerant
    48. 48. 48 Normal operations
    49. 49. 49 Nimbus down • Processing will continue. But topology lifecycle operations and reassignment facility are lost. • Run under system supervision.
    50. 50. 50 Worker node down • Nimbus will reassign tasks to other machines
    51. 51. 51 Supervisor goes down Processing will still continue. But assignment is never synchronized
    52. 52. 52 Worker process down • Supervisor will restart the worker process and the processing will continue
    53. 53. 53 Guaranteeing message processing
    54. 54. 54 Guaranteed Message Processing “Tuple tree”
    55. 55. 55 Reliability API When emitting a tuple, the Spout provides a "message id" that will be used to identify the tuple later.
    56. 56. 56 Reliability API- Anchoring
    57. 57. 57 Reliability API – finishing processing
    58. 58. 58 Spout - Reliability API
    59. 59. 59 Reliability API
    60. 60. 60 Reliability API
    61. 61. 61 Advanced Topics - Trident Trident is a high-level abstraction for doing real-time computing on top of Storm.
    62. 62. 62 Trident- Higher level constructs Joins Aggregations Grouping Functions Filters Consistent, exactly one semantics
    63. 63. 63 Example [Physicians, 79] [Oncology:78] [Cancer:237] …….
    64. 64. 64 Example
    65. 65. 65 Example
    66. 66. 66 Example
    67. 67. 67 Example
    68. 68. 68 Example
    69. 69. 69 Example
    70. 70. 70 Example
    71. 71. 71 Example
    72. 72. 72 Example [Physicians, 79] [Oncology:78] [Cancer:237] …….
    73. 73. 73 Demo
    74. 74. 74 DRPC Server DRPC Server
    75. 75. 75 DRPC Server We want to know the aggregate count of tweets with hashtags #cancer and #Physician at this moment
    76. 76. 76 DRPC Server
    77. 77. 77 DRPC Server
    78. 78. 78
    79. 79. 79 Conclusion Storm allows us to solve a wide range of business problems in real time Thriving open-source community
    80. 80. 80 Resources Storm Project wiki Storm starter project Storm contributions project Running a Multi-Node Storm cluster tutorial Implementing real-time trending topic A Hadoop Alternative: Building a real-time data pipeline with Storm Storm Use cases
    81. 81. 81 Resources (cont’d) Understanding the Parallelism of a Storm Topology Trident – high level Storm abstraction A practical Storm‟s Trident API Storm online forum Project source code New York City Storm Meetup Image credits: US NASA
    82. 82. 82 Questions Eugene Dvorkin, Architect WebMD edvorkin@gmail.com Twitter: #edvorkin Introduction to Storm

    ×