Storm – Streaming Data Analytics at Scale - StampedeCon 2014

Stream Processing
with Apache Storm
Spring 2014
Version 1.0
Kit Menke, Lead Software Engineer, EHI
Scott Shaw, Solutions Engineer, Hortonworks

© Hortonworks Inc. 2013
Stream Processing in Hadoop
Driven by new types of
data
– Sensor/Machine
– Server logs
– Clickstream
Storm with Hadoop
enables new business
opportunities
– Low-latency dashboards
– Quality, Security, Safety,
Operations Alerts
– Improved operations
– Real-time data integration
HDFS2
(redundant, reliable storage)
YARN
(cluster resource management)
MapReduce
(batch)
Apache
STORM
(streaming)
HADOOP 2.1
Tez
(interactive)
Multi Use Data Platform
Batch, Interactive, Online, Streaming, …
Stream processing has emerged as a key use case
2

Typical stream processing workflow
Real-time
data feeds
Stream
processing
solution
Persist data
Relational
or non
relational
data store
Batch
processing
Batch
FeedsUpdate event
models
(Pattern templates,
KPIs & alerts)
Dashboards & Applications
3

Stream processing very different from batch
Factors Real-time Batch
Data
Freshness Real-time ( usually
< 15 min)
Historical – usually
more than 15 min old
Location Primarily memory (
moved to disk after
processing)
Primarily in disk moved
to memory for
processing
Processing
Speed Sub second to few
seconds
Few seconds to hours
Frequency Always running Sporadic to periodic
Clients
Who? Automated systems
only
Human & automated
systems
Type Primarily
operational systems
Primarily analytical
applications
4

Key requirements of a streaming solution
• Extremely high ingest rates – millions of
events/secondData Ingest
• Ability to easily plug different processing frameworks
• Guaranteed processing – atleast once processing
semantics
Processing
• Ability to persist data to multiple relational and non
relational data storesPersistence
• Security, HA, fault tolerance & management supportOperations
5

Apache Storm Leading for Stream Processing
Open source real-time event stream processing platform that
provides fixed, continuous & low latency processing for very high
frequency streaming data
• Horizontally scalable like Hadoop
• Eg: 10 node cluster can process 1M tuples per
second per node
Highly
scalable
• Automatically reassigns tasks on failed nodes
Fault-
tolerant
• Supports at least once & exactly once processing
semantics
Guarantees
processing
• Processing logic can be defined in any language
Language
agnostic
• Brand, governance & a large active community
Apache
project
6

Pattern driving MOST streaming use cases
7
Monitor real-time
data to..
Prevent Optimize
Finance
- Securities Fraud
- Compliance violations
- Order routing
- Pricing
Telco
- Security breaches
- Network Outages
- Bandwidth allocation
- Customer service
Retail
- Offers
- Pricing
Manufacturing
- Machine failures - Supply chain
Transportation
- Driver & fleet issues - Routes
- Pricing
Web
- Application failures
- Operational issues
- Site content
Sentiment Clickstream Machine/Sensor Logs Geo-location
----

Storm use cases – IT operations view
• Continuously ingest high rate messages, process
them and update data stores
Continuous
processing
• Aggregate multiple data streams that emit data at
extremely high rates into one central data store
High speed
data
aggregation
• Filter out unwanted data on the fly before it is
persisted to a data storeData filtering
• Extremely resource( CPU, mem or I/O) intensive
processing that would take long time to process on a
single machine can be parallelized with Storm to
reduce response times to seconds
Distributed
RPC response
time reduction
8

Key Constructs in Apache Storm
• Tuples, Streams, Sprouts, Bolts
• Topology
• Field Grouping
• Components and Topology Submission
• Parallelism
• Processing Guarantee
9

Tuples and Streams
• What is a Tuple?
–Fundamental data structure in Storm. Is a named list of values
that can be of any data type.
10
• What is a Stream?
–An unbounded sequences of tuples.
–Core abstraction in Storm and are what you “process” in Storm

Spouts
• What is a Spout?
–Generates or a source of Streams
– E.g.: JMS, Twitter, Log, Kafka Spout
–Can spin up multiple instances of a Spout and dynamically adjust
as needed
11

Bolts
• What is a Bolt?
–Processes any number of input streams and produces output
streams
–Common processing in bolts are functions, aggregations, joins,
read/write to data stores, alerting logic
–Can spin up multiple instances of a Bolt and dynamically adjust as
needed
• Example of Bolts:
1. HBaseBolt: persist stream in Hbase
2. HDFSBolt: persist stream into HFDS as Avro Files using Flume
3. MonitoringBolt: Read from Hbase and create alerts via email
and messaging queues if given thresholds are exceeded.
12

Topology
• What is a Topology?
–A network of spouts and bolts wired together into a workflow
Truck-Event-Processor Topology
Kafka Spout
HBase
Bolt
Monitoring
Bolt
HDFS
Bolt
WebSocket
Bolt
Stream Stream
Stream
Stream
13

Storm Components and Topology
Submission
Submit
storm-event-processor
topology
Nimbus
(Yarn App Master Node)
Zookeeper ZookeeperZookeeper
Supervisor
(Slave Node)
Supervisor
(Slave Node)
Supervisor
(Slave Node)
Supervisor
(Slave Node)
Supervisor
(Slave Node)
Kafka
Spout
Kafka
Spout
Kafka
Spout
Kafka
Spout
Kafka
Spout
HDFS
Bolt
HDFS
Bolt
HDFS
Bolt
HBase
Bolt
HBase
Bolt
Monitor
Bolt
Monitor
Bolt
Nimbus (Management server)
• Similar to job tracker
• Distributes code around cluster
• Assigns tasks
• Handles failures
Supervisor (Worker nodes)
• Similar to task tracker
• Run bolts and spouts as ‘tasks’
Zookeeper:
• Cluster co-ordination
• Nimbus HA
• Stores cluster metrics
14

Processing Guarantees in Storm
Processing
guarantee
How is it achieved? Applicable use cases
Atleast once Replay tuples on failure - Processing does not need to be
ordered
- Need extremely low latency
processing
Exactly once Transactional
topologies ( now
implemented using
Trident)
- Need ordered processing
- Global counts
- Context aware processing
- Causality based
- Latency not important
15

Implementing Storm
Kit Menke, Lead Software Engineer at Enterprise Holdings, Inc.
May 2014

Implementing Storm
Spring 2014
Version 1.0
Real World Scenarios

Overview
• Storm Terminology
• Creating a Topology
• Persisting data from Storm
• Topology Performance
• Custom Metrics
• Workers, Executors, and Tasks
• Caching within a Bolt
• Environment Setup
18

Storm Terminology
• Topologies run on your Hadoop cluster
– Uber-jar with spouts and bolts
– Runs forever
• Spouts generate streams of tuples
• Tuples are lists of values
• Bolts process tuples (and emit tuples)
Topology
Spout
Bolt A Bolt B
Bolt 1
Tuples
19

Storm Topology Example
Counting Topology
spout unreliable output
Guaranteed message processing
20

Failing a Tuple
1. Spout emits tuple
2. Bolt fails tuple
3. Spout receives failed
message ID
23

Persisting Data
• Write to HDFS using storm-hdfs for
long term storage
24

Files in Hue written by storm-hdfs
25

Persisting Data
long term storage
• Index data in ElasticSearch or Solr
for real-time dashboards
26

Persisting Data
long term storage
• Index data in ElasticSearch or Solr
for real-time dashboards
• Insert messages into a Database
• Message Queue
• HBase reads/writes to influence
topology in real-time
28

Topology Performance
• Storm UI shows capacity
– Break out your bolts to find bottlenecks!
29

Custom Metrics
• New in Storm 0.9.0
• Out of the box metrics, ex: CountMetric
• Custom metric by implementing IMetric
• Register the metric on spout/bolt startup
• Set topology to consume metrics stream
30

Topology Performance
• Filter bolt is our bottleneck!
31

Workers, Executors, and Tasks
• Workers
– Separate JVM
– Workers run Executors
• Executors
– Separate threads
– Executors run Tasks
• Tasks
– Your spout or bolt code
• Running more than one task per executor does not increase
the level of parallelism!!!
Workers <= Executors <= Tasks

Caching inside a Bolt
• RotatingMap with Tick Tuples
• Use fieldsGrouping to ensure cache hits
33

Environment Setup
• Storm-starter project on GitHub
• Git, Eclipse, Maven
• Unit test!
• Develop locally or on a single node
hadoop machine
• Read the source code
34

Storm – Streaming Data Analytics at Scale - StampedeCon 2014

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (6)

Similar to Storm – Streaming Data Analytics at Scale - StampedeCon 2014

Similar to Storm – Streaming Data Analytics at Scale - StampedeCon 2014 (20)

More from StampedeCon

More from StampedeCon (20)

Recently uploaded

Recently uploaded (20)

Storm – Streaming Data Analytics at Scale - StampedeCon 2014

Editor's Notes