Introduction to Storm

1
Distributed, real-time, fault-tolerant framework
Introduction to Storm
Eugene Dvorkin
Coding Architect, WebMD
edvorkin@gmail.com
#edvorkin
eugenedvorkin.com

2
Big Data
“Big Data is the capability to manage a
huge volume of disparate data, at the right
speed, and within the right time frame to
allow real-time analysis and reaction”

3
Big Data
Velocity VolumeVariety

4
Enablers of Big Data
Map/Reduce frameworks – Hadoop
Scalable storage – HDFS, NoSQL
databases
Cheap computing power – Cloud computing

5
Why Real Time?
Better end-user experience
- Ex: View an ad, see the counter move.
Operational intelligence
- Low latency analysis
- Real time Dashboards
ŸEvent response
- Rule Engine, Personalization, Predictions
- Scalable analysis
Example: Trend analysis to recommend „hot‟
articles.

6
Requirements
Fast
Scalable by process parallelization and
distribution
Fault-tolerant
Guarantees data processing
Easy to learn, code and operate
Robust
Doing scalable real time processing require
framework that:

7
Storm
• Storm – open source distributed Real-
time computation system.
• Developed by Nathan Marz – acquired by
Twitter

8
Storm
Fast
Scalable by process parallelization and
distribution
Fault-tolerant
Guarantees data processing
Runs on JVM
Easy to learn, code and operate
Supports development in multiple
languages

9
Hadoop Storm
Storm for Real-Time processing
Storm is to real-time computation what Hadoop is to batch computation.

11
Storm Use Cases
“Storm powers a wide variety of Twitter
systems, ranging in applications from
discovery, real-time analytics,
personalization, search, revenue
optimization, and many more.”
“Storm empowers stream/micro-batch
processing of user events, content feeds,
and application logs” - Yahoo
“ETL – move data from MongoDB to BI”

14
Storm Abstractions
Tuples, Streams, Spouts, Bolts and Topologies

15
Tuples
[“Colonoscopy”, 14106]
• Storm Data structure
• List of elements

16
Stream
Unbounded sequence of tuples
[“Colonoscopy”, 14091][“Cancer”,42651]
[“Oncology”, 14417]

17
Spout
Read from stream of data – queues, web
logs, API calls, databases
Emit streams of tuples

18
Bolts
Process tuples and create new streams

19
Bolts
Apply functions /transforms
Calculate and aggregate
data (word count!)
Access DB, API , etc.
Filter data
Process tuples and create new streams

21
Storm is Easy to Code
How to write storm components?
Storm is easy to use

25
Spouts Available on GitHub
Integration with Redis, Kafka, MongoDB, Amazon SQS, JMS
and some others are readily available

30
Problem
What about parallel processing?

34
Parallelism
Storm Scalability -
Parallelism

37
Storm rebalance
> storm rebalance demo -n 3 -e myspout=5 -e mybolt=1

38
Creating Cluster Topology
>storm jar HashTagTopology.jar
org.javameetup.topology. HashTagCountTopology

39
Stream groupings
Shuffle grouping: Tuples are randomly distributed across the
bolt's tasks
Fields grouping: The stream is partitioned by the fields specified
in the grouping
Custom grouping

42
Deployment
Storm Deployment

44
Storm deployment
Out of box configuration are suitable for
production
One-click deploy with storm-deploy
project to EC2
Once deployed, easy to operate –
designed to be robust
Storm daemons, Nimbus and
Supervisors are stateless and fail-fast
Useful UI

49
Nimbus down
• Processing will continue. But topology lifecycle
operations and reassignment facility are lost.
• Run under system supervision.

50
Worker node down
• Nimbus will reassign tasks to other machines

51
Supervisor goes down
Processing will still continue. But assignment is
never synchronized

52
Worker process down
• Supervisor will restart the worker process
and the processing will continue

53
Guaranteeing message processing

54
Guaranteed Message Processing
“Tuple tree”

55
Reliability API
When emitting a tuple, the Spout provides a
"message id" that will be used to identify
the tuple later.

57
Reliability API – finishing processing

61
Advanced Topics - Trident
Trident is a high-level abstraction for doing real-time
computing on top of Storm.

62
Trident- Higher level constructs
Joins
Aggregations
Grouping
Functions
Filters
Consistent, exactly one semantics

63
Example
[Physicians, 79]
[Oncology:78]
[Cancer:237]
…….

72
Example
[Physicians, 79]
[Oncology:78]
[Cancer:237]
…….

75
DRPC Server
We want to know the aggregate count
of tweets with hashtags #cancer and
#Physician at this moment

79
Conclusion
Storm allows us to solve a wide range of
business problems in real time
Thriving open-source community

80
Resources
Storm Project wiki
Storm starter project
Storm contributions project
Running a Multi-Node Storm cluster tutorial
Implementing real-time trending topic
A Hadoop Alternative: Building a real-time
data pipeline with Storm
Storm Use cases

81
Resources (cont’d)
Understanding the Parallelism of a Storm
Topology
Trident – high level Storm abstraction
A practical Storm‟s Trident API
Storm online forum
Project source code
New York City Storm Meetup
Image credits: US NASA

82
Questions
Eugene Dvorkin, Architect
WebMD
edvorkin@gmail.com
Twitter: #edvorkin
Introduction to Storm

Introduction to Storm

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Introduction to Storm

Similar to Introduction to Storm (20)

Recently uploaded

Recently uploaded (20)

Introduction to Storm

Editor's Notes