At StampedeCon 2014, Scott Shaw (Hortonworks) and Kit Menke (Enteprise Holdings) presented "Storm – Streaming Data Analytics at Scale"
Storm’s primary purpose is to provide real-time analytics against fast moving data before its stored. The use cases range from fraud detection, machine learning, to ETL.
Storm has been clocked at over 1 million tuples processed per second per node. It’s fast, scalable, and language agnostic. This session provides an architecture overview as well as a real-world discussion of its use and implementation at Enterprise Holdings.
18. Overview
• Storm Terminology
• Creating a Topology
• Persisting data from Storm
• Topology Performance
• Custom Metrics
• Workers, Executors, and Tasks
• Caching within a Bolt
• Environment Setup
18
19. Storm Terminology
• Topologies run on your Hadoop cluster
– Uber-jar with spouts and bolts
– Runs forever
• Spouts generate streams of tuples
• Tuples are lists of values
• Bolts process tuples (and emit tuples)
Topology
Spout
Bolt A Bolt B
Bolt 1
Tuples
19
28. Persisting Data
• Write to HDFS using storm-hdfs for
long term storage
• Index data in ElasticSearch or Solr
for real-time dashboards
• Insert messages into a Database
• Message Queue
• HBase reads/writes to influence
topology in real-time
28
30. Custom Metrics
• New in Storm 0.9.0
• Out of the box metrics, ex: CountMetric
• Custom metric by implementing IMetric
• Register the metric on spout/bolt startup
• Set topology to consume metrics stream
30
32. Workers, Executors, and Tasks
• Workers
– Separate JVM
– Workers run Executors
• Executors
– Separate threads
– Executors run Tasks
• Tasks
– Your spout or bolt code
• Running more than one task per executor does not increase
the level of parallelism!!!
Workers <= Executors <= Tasks
33. Caching inside a Bolt
• RotatingMap with Tick Tuples
• Use fieldsGrouping to ensure cache hits
33
34. Environment Setup
• Storm-starter project on GitHub
• Git, Eclipse, Maven
• Unit test!
• Develop locally or on a single node
hadoop machine
• Read the source code
34
Real-time data integration
Analyze, clean, normalize data with low latency
Low-latency dashboards
Summing/aggregations for operational monitors, gauges and counters
Orders, revenue, call volumes, infrastructure load
Geographic location of fleets
Alerts
Quality: Detection of “never seen before” entities (customers, ads, etc)
Security: Detection of trespass / fraud / illegal activities
Safety: patient monitoring, automotive telematics
Operations: Detection of system / network overload
Improved operations
Advertising optimization
Personalization
Fleet rerouting
Stream processing solution needs to consume explicit or implicit event models from batch processing platform. These event models define the schemas of incoming event data, such as records of calls into the customer contact center, copies of customer order transactions or exogenous market data. Event models also specify:
Relationships (such as causation) among the event types
Calculations (for example, formulas to compute KPIs)
Alert thresholds (for example, "if average caller wait time exceeds 45 seconds, send a yellow warning by email")
Responses (for example, "trigger an exception process if the result of a customer credit check has not been received within two hours")
Storm was benchmarked at processing one million 100 byte messages per second per node on hardware with the following specs:
Processor: 2x Intel E5645@2.4Ghz
Memory: 24 GB
Add types of data and ad prevent and optimize use cases
Getting started with storm
Reading source code most helpful
Create a simple hello world topology and run it locally
Topologies are the application you will write and deploy to your cluster where it will run forever working on streams of data.
Each topology contains spouts and bolts
Spouts bring data into your topology by generating streams of tuples. This is an external source like a queue or something on the internet (like twitter).
Tuples are lists of values (string, int, boolean, or custom objects which require serializers)
Bolts process the tuples emitted by the spouts and also emit tuples themselves
Creating a simple storm topology which demonstrates guaranteed message processing.
Create a counting spout connected to an unreliable bolt connected to an output bolt
Many different options for connecting things together: shuffle grouping means tuples are randomly distributed.
Can also group by a field, broadcast tuple
Demonstrate an error scenario by using an unreliable bolt
Simple example of a spout which counts from 0 to 9
Open is called once for each instance of your spout.
Adding numbers 0-9 to an in-memory queue
Typically you will be reading from a real message queue
nextTuple is called repeatedly to get each tuple.
Here we are emitting one int: number
The second parameter is used for reprocessing in the event of a failure
declareOutputFields for specifying which fields you are emitting in nextTuple.
An example implementation of an Unreliable Bolt (because it should fail 50% of the time)
Bolts also have a prepare and declareOutputFields method.
Execute is the main method where your processing will take place.
The input tuple was generated by our spout.
50% of the time, the tuple will fail.
Calling _collector.fail on a tuple will cause it to go back to the spout’s fail method.
In this simple example, I made number the same value as the tuple but in reality this might be a queued message ID.
We ended up not really needing tuple reprocessing but I believe storm-jms has this built in if you need it.
Talked about bringing data into your topology and processing it. Most likely you will want to persist it somewhere as well for additional processing.
We are using storm-hdfs to write all messages we receive straight into HDFS.
Also indexing our data in ElasticSearch in order to have a real-time dashboard for executives.
Influence the topology in “real-time” by reading from or writing to HBase
!!!! This stuff will be SLOW compared to how fast you need to process messages in Storm. HBase read takes 20ms, that is only 50 tuples/s!!!!
Using storm-hdfs to stream data to HDFS for more analytics and storage
Put hive tables over top, run trends, etc.
Talked about bringing data into your topology and processing it. Most likely you will want to persist it somewhere as well for additional processing.
We are using storm-hdfs to write all messages we receive straight into HDFS.
Also indexing our data in ElasticSearch in order to have a real-time dashboard for executives.
Influence the topology in “real-time” by reading from or writing to HBase
!!!! This stuff will be SLOW compared to how fast you need to process messages in Storm. HBase read takes 20ms, that is only 50 tuples/s!!!!
Time based indexes (one per day)
Kibana dashboard on top of elasticsearch indexes
size: 14.3G (28.7G)
docs: 42,051,720 (42,051,720)
Talked about bringing data into your topology and processing it. Most likely you will want to persist it somewhere as well for additional processing.
We are using storm-hdfs to write all messages we receive straight into HDFS.
Also indexing our data in ElasticSearch in order to have a real-time dashboard for executives.
Influence the topology in “real-time” by reading from or writing to HBase
!!!! This stuff will be SLOW compared to how fast you need to process messages in Storm. HBase read takes 20ms, that is only 50 tuples/s!!!!
It is hard to optimize!
The storm UI will help you a lot with determining where the bottleneck is in your topology, but you will need to break out your bolts.
Capacity = If this is around 1.0, the bolt is running as fast as it can and you probably need to increase your parallelism.
Here I’ve prefixed my bolts with a number so they sort nicely in the Storm UI.
Custom Metrics were added in Storm 0.9.0 and allow you to collect a lot more information than what is displayed in the Storm UI.
Comes with some metrics out of the box like the CountMetric (cache hits? # of tuples processed?)
Can create custom metrics by implementing the IMetric interface.
Register your metric in your spout’s open method or bolt’s prepare method.
When creating your topology, configure a consumer. LoggingMetricsConsumer comes out of the box and just logs to the metrics.log on one of the machines.
Can create your own consumers to stream to third party monitoring apps.
We’ve identified a bottleneck in our topology (filter bolt) using the Storm UI and storm’s metrics.
Increasing the parallelism of the bolt might help with our throughput. If it takes twice as long as our categorize bolt, we probably need to DOUBLE the amount of Executors.
Configure workers, executors, and tasks when creating the topology.
Worker process…
Separate JVM
Runs executors
One send/receive thread per worker
Rule of thumb: Multiple of the number of machines in your cluster
Executors
Thread spawned by worker
Runs tasks serially
Rule of thumb: Multiple of the # of workers
Task
Runs your spouts and bolts
Cannot change the number of tasks after topology has been started
Rule of thumb: Multiple of the # of executors.. Typically just have 1 per executor unless you play on adding more nodes as the topology is running
Running more than one task per executor does not increase the level of parallelism!!!
Number of workers and executors can change, number of tasks cannot
http://stackoverflow.com/questions/17257448/what-is-the-task-in-twitter-storm-parallelism
Example: Storm running on 3 nodes.
Three workers, six executors, six tasks.
Workers <= Executors <= Tasks
If HBase calls take 20 ms, we’re going to have a bottleneck in our topology so we need caching.
fieldsGrouping + caching within bolts
Group by something that will be used as the key (or part of the key) to your cache. Same Tuples will be sent to the same bolt and increase the number of cache hits.
Create a RotatingMap (a LRU cache) in your bolt
Configure your bolt to receive Tick Tuples
Tick tuples sent to your bolt in addition to normal Tuples
Check to see if the tuple you received was a tick tuple and then rotate the cache every 300s
Possible to develop in multiple languages, but java makes the most sense for getting started
Check out the storm-starter project on github for a great working example
Use git to clone the repository, setup in your favorite IDE (Eclipse haha yea right!), and setup maven. Use maven-shade-plugin to build your uber-jar
Separate projects for major functionality. Try to keep as little as possible in your storm project. Use unit testing everywhere.. It will save you time when you find bugs in the topology.
You can develop locally just with Eclipse and storm. However, you will most likely also being using a lot of other Hadoop stuff (HDFS check out storm-hdfs, HBase, etc) so it might be helpful to get a single node machine with everything installed.