• Save
Storm – Streaming Data Analytics at Scale - StampedeCon 2014
Upcoming SlideShare
Loading in...5
×
 

Storm – Streaming Data Analytics at Scale - StampedeCon 2014

on

  • 860 views

At StampedeCon 2014, Scott Shaw (Hortonworks) and Kit Menke (Enteprise Holdings) presented "Storm – Streaming Data Analytics at Scale" ...

At StampedeCon 2014, Scott Shaw (Hortonworks) and Kit Menke (Enteprise Holdings) presented "Storm – Streaming Data Analytics at Scale"

Storm’s primary purpose is to provide real-time analytics against fast moving data before its stored. The use cases range from fraud detection, machine learning, to ETL.
Storm has been clocked at over 1 million tuples processed per second per node. It’s fast, scalable, and language agnostic. This session provides an architecture overview as well as a real-world discussion of its use and implementation at Enterprise Holdings.

Statistics

Views

Total Views
860
Views on SlideShare
857
Embed Views
3

Actions

Likes
8
Downloads
0
Comments
0

1 Embed 3

https://twitter.com 3

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • Real-time data integration <br /> Analyze, clean, normalize data with low latency <br /> <br /> Low-latency dashboards <br /> Summing/aggregations for operational monitors, gauges and counters <br /> Orders, revenue, call volumes, infrastructure load <br /> Geographic location of fleets <br /> <br /> Alerts <br /> Quality: Detection of “never seen before” entities (customers, ads, etc) <br /> Security: Detection of trespass / fraud / illegal activities <br /> Safety: patient monitoring, automotive telematics <br /> Operations: Detection of system / network overload <br /> <br /> Improved operations <br /> Advertising optimization <br /> Personalization <br /> Fleet rerouting <br /> <br />
  • Stream processing solution needs to consume explicit or implicit event models from batch processing platform. These event models define the schemas of incoming event data, such as records of calls into the customer contact center, copies of customer order transactions or exogenous market data. Event models also specify: <br /> Relationships (such as causation) among the event types <br /> Calculations (for example, formulas to compute KPIs) <br /> Alert thresholds (for example, "if average caller wait time exceeds 45 seconds, send a yellow warning by email") <br /> Responses (for example, "trigger an exception process if the result of a customer credit check has not been received within two hours") <br />
  • Storm was benchmarked at processing one million 100 byte messages per second per node on hardware with the following specs: <br /> <br /> Processor: 2x Intel E5645@2.4Ghz <br /> Memory: 24 GB <br />
  • Add types of data and ad prevent and optimize use cases
  • Getting started with storm <br /> Reading source code most helpful <br /> Create a simple hello world topology and run it locally <br />
  • Topologies are the application you will write and deploy to your cluster where it will run forever working on streams of data. <br /> Each topology contains spouts and bolts <br /> Spouts bring data into your topology by generating streams of tuples. This is an external source like a queue or something on the internet (like twitter). <br /> Tuples are lists of values (string, int, boolean, or custom objects which require serializers) <br /> Bolts process the tuples emitted by the spouts and also emit tuples themselves
  • Creating a simple storm topology which demonstrates guaranteed message processing. <br /> Create a counting spout connected to an unreliable bolt connected to an output bolt <br /> Many different options for connecting things together: shuffle grouping means tuples are randomly distributed. <br /> Can also group by a field, broadcast tuple <br /> Demonstrate an error scenario by using an unreliable bolt
  • Simple example of a spout which counts from 0 to 9 <br /> Open is called once for each instance of your spout. <br /> Adding numbers 0-9 to an in-memory queue <br /> Typically you will be reading from a real message queue <br /> nextTuple is called repeatedly to get each tuple. <br /> Here we are emitting one int: number <br /> The second parameter is used for reprocessing in the event of a failure <br /> declareOutputFields for specifying which fields you are emitting in nextTuple.
  • An example implementation of an Unreliable Bolt (because it should fail 50% of the time) <br /> Bolts also have a prepare and declareOutputFields method. <br /> Execute is the main method where your processing will take place. <br /> The input tuple was generated by our spout. <br /> 50% of the time, the tuple will fail.
  • Calling _collector.fail on a tuple will cause it to go back to the spout’s fail method. <br /> In this simple example, I made number the same value as the tuple but in reality this might be a queued message ID. <br /> We ended up not really needing tuple reprocessing but I believe storm-jms has this built in if you need it.
  • Talked about bringing data into your topology and processing it. Most likely you will want to persist it somewhere as well for additional processing. <br /> We are using storm-hdfs to write all messages we receive straight into HDFS. <br /> Also indexing our data in ElasticSearch in order to have a real-time dashboard for executives. <br /> Influence the topology in “real-time” by reading from or writing to HBase <br /> <br /> !!!! This stuff will be SLOW compared to how fast you need to process messages in Storm. HBase read takes 20ms, that is only 50 tuples/s!!!!
  • Using storm-hdfs to stream data to HDFS for more analytics and storage <br /> Put hive tables over top, run trends, etc.
  • Talked about bringing data into your topology and processing it. Most likely you will want to persist it somewhere as well for additional processing. <br /> We are using storm-hdfs to write all messages we receive straight into HDFS. <br /> Also indexing our data in ElasticSearch in order to have a real-time dashboard for executives. <br /> Influence the topology in “real-time” by reading from or writing to HBase <br /> <br /> !!!! This stuff will be SLOW compared to how fast you need to process messages in Storm. HBase read takes 20ms, that is only 50 tuples/s!!!!
  • Time based indexes (one per day) <br /> Kibana dashboard on top of elasticsearch indexes <br /> size: 14.3G (28.7G) <br /> docs: 42,051,720 (42,051,720) <br />
  • Talked about bringing data into your topology and processing it. Most likely you will want to persist it somewhere as well for additional processing. <br /> We are using storm-hdfs to write all messages we receive straight into HDFS. <br /> Also indexing our data in ElasticSearch in order to have a real-time dashboard for executives. <br /> Influence the topology in “real-time” by reading from or writing to HBase <br /> <br /> !!!! This stuff will be SLOW compared to how fast you need to process messages in Storm. HBase read takes 20ms, that is only 50 tuples/s!!!!
  • It is hard to optimize! <br /> The storm UI will help you a lot with determining where the bottleneck is in your topology, but you will need to break out your bolts. <br /> Capacity = If this is around 1.0, the bolt is running as fast as it can and you probably need to increase your parallelism. <br /> <br /> Here I’ve prefixed my bolts with a number so they sort nicely in the Storm UI.
  • Custom Metrics were added in Storm 0.9.0 and allow you to collect a lot more information than what is displayed in the Storm UI. <br /> Comes with some metrics out of the box like the CountMetric (cache hits? # of tuples processed?) <br /> Can create custom metrics by implementing the IMetric interface. <br /> Register your metric in your spout’s open method or bolt’s prepare method. <br /> When creating your topology, configure a consumer. LoggingMetricsConsumer comes out of the box and just logs to the metrics.log on one of the machines. <br /> Can create your own consumers to stream to third party monitoring apps.
  • We’ve identified a bottleneck in our topology (filter bolt) using the Storm UI and storm’s metrics. <br /> Increasing the parallelism of the bolt might help with our throughput. If it takes twice as long as our categorize bolt, we probably need to DOUBLE the amount of Executors.
  • Configure workers, executors, and tasks when creating the topology. <br /> <br /> Worker process… <br /> Separate JVM <br /> Runs executors <br /> One send/receive thread per worker <br /> Rule of thumb: Multiple of the number of machines in your cluster <br /> <br /> Executors <br /> Thread spawned by worker <br /> Runs tasks serially <br /> Rule of thumb: Multiple of the # of workers <br /> <br /> Task <br /> Runs your spouts and bolts <br /> Cannot change the number of tasks after topology has been started <br /> Rule of thumb: Multiple of the # of executors.. Typically just have 1 per executor unless you play on adding more nodes as the topology is running <br /> Running more than one task per executor does not increase the level of parallelism!!! <br /> <br /> Number of workers and executors can change, number of tasks cannot <br /> <br /> <br /> http://stackoverflow.com/questions/17257448/what-is-the-task-in-twitter-storm-parallelism <br /> <br /> Example: Storm running on 3 nodes. <br /> Three workers, six executors, six tasks. <br /> Workers &lt;= Tasks <br />
  • If HBase calls take 20 ms, we’re going to have a bottleneck in our topology so we need caching. <br /> fieldsGrouping + caching within bolts <br /> Group by something that will be used as the key (or part of the key) to your cache. Same Tuples will be sent to the same bolt and increase the number of cache hits. <br /> Create a RotatingMap (a LRU cache) in your bolt <br /> Configure your bolt to receive Tick Tuples <br /> Tick tuples sent to your bolt in addition to normal Tuples <br /> Check to see if the tuple you received was a tick tuple and then rotate the cache every 300s
  • Possible to develop in multiple languages, but java makes the most sense for getting started <br /> Check out the storm-starter project on github for a great working example <br /> Use git to clone the repository, setup in your favorite IDE (Eclipse haha yea right!), and setup maven. Use maven-shade-plugin to build your uber-jar <br /> Separate projects for major functionality. Try to keep as little as possible in your storm project. Use unit testing everywhere.. It will save you time when you find bugs in the topology. <br /> You can develop locally just with Eclipse and storm. However, you will most likely also being using a lot of other Hadoop stuff (HDFS check out storm-hdfs, HBase, etc) so it might be helpful to get a single node machine with everything installed.

Storm – Streaming Data Analytics at Scale - StampedeCon 2014 Storm – Streaming Data Analytics at Scale - StampedeCon 2014 Presentation Transcript

  • Stream Processing with Apache Storm Spring 2014 Version 1.0 Kit Menke, Lead Software Engineer, EHI Scott Shaw, Solutions Engineer, Hortonworks
  • © Hortonworks Inc. 2013 Stream Processing in Hadoop Driven by new types of data – Sensor/Machine – Server logs – Clickstream Storm with Hadoop enables new business opportunities – Low-latency dashboards – Quality, Security, Safety, Operations Alerts – Improved operations – Real-time data integration HDFS2 (redundant, reliable storage) YARN (cluster resource management) MapReduce (batch) Apache STORM (streaming) HADOOP 2.1 Tez (interactive) Multi Use Data Platform Batch, Interactive, Online, Streaming, … Stream processing has emerged as a key use case 2
  • © Hortonworks Inc. 2013 Typical stream processing workflow Real-time data feeds Stream processing solution Persist data Relational or non relational data store Batch processing Batch FeedsUpdate event models (Pattern templates, KPIs & alerts) Dashboards & Applications 3
  • © Hortonworks Inc. 2013 Stream processing very different from batch Factors Real-time Batch Data Freshness Real-time ( usually < 15 min) Historical – usually more than 15 min old Location Primarily memory ( moved to disk after processing) Primarily in disk moved to memory for processing Processing Speed Sub second to few seconds Few seconds to hours Frequency Always running Sporadic to periodic Clients Who? Automated systems only Human & automated systems Type Primarily operational systems Primarily analytical applications 4
  • © Hortonworks Inc. 2013 Key requirements of a streaming solution • Extremely high ingest rates – millions of events/secondData Ingest • Ability to easily plug different processing frameworks • Guaranteed processing – atleast once processing semantics Processing • Ability to persist data to multiple relational and non relational data storesPersistence • Security, HA, fault tolerance & management supportOperations 5
  • © Hortonworks Inc. 2013 Apache Storm Leading for Stream Processing Open source real-time event stream processing platform that provides fixed, continuous & low latency processing for very high frequency streaming data • Horizontally scalable like Hadoop • Eg: 10 node cluster can process 1M tuples per second per node Highly scalable • Automatically reassigns tasks on failed nodes Fault- tolerant • Supports at least once & exactly once processing semantics Guarantees processing • Processing logic can be defined in any language Language agnostic • Brand, governance & a large active community Apache project 6
  • © Hortonworks Inc. 2013 Pattern driving MOST streaming use cases 7 Monitor real-time data to.. Prevent Optimize Finance - Securities Fraud - Compliance violations - Order routing - Pricing Telco - Security breaches - Network Outages - Bandwidth allocation - Customer service Retail - Offers - Pricing Manufacturing - Machine failures - Supply chain Transportation - Driver & fleet issues - Routes - Pricing Web - Application failures - Operational issues - Site content Sentiment Clickstream Machine/Sensor Logs Geo-location ----
  • © Hortonworks Inc. 2013 Storm use cases – IT operations view • Continuously ingest high rate messages, process them and update data stores Continuous processing • Aggregate multiple data streams that emit data at extremely high rates into one central data store High speed data aggregation • Filter out unwanted data on the fly before it is persisted to a data storeData filtering • Extremely resource( CPU, mem or I/O) intensive processing that would take long time to process on a single machine can be parallelized with Storm to reduce response times to seconds Distributed RPC response time reduction 8
  • © Hortonworks Inc. 2013 Key Constructs in Apache Storm • Tuples, Streams, Sprouts, Bolts • Topology • Field Grouping • Components and Topology Submission • Parallelism • Processing Guarantee 9
  • © Hortonworks Inc. 2013 Tuples and Streams • What is a Tuple? –Fundamental data structure in Storm. Is a named list of values that can be of any data type. 10 • What is a Stream? –An unbounded sequences of tuples. –Core abstraction in Storm and are what you “process” in Storm
  • © Hortonworks Inc. 2013 Spouts • What is a Spout? –Generates or a source of Streams – E.g.: JMS, Twitter, Log, Kafka Spout –Can spin up multiple instances of a Spout and dynamically adjust as needed 11
  • © Hortonworks Inc. 2013 Bolts • What is a Bolt? –Processes any number of input streams and produces output streams –Common processing in bolts are functions, aggregations, joins, read/write to data stores, alerting logic –Can spin up multiple instances of a Bolt and dynamically adjust as needed • Example of Bolts: 1. HBaseBolt: persist stream in Hbase 2. HDFSBolt: persist stream into HFDS as Avro Files using Flume 3. MonitoringBolt: Read from Hbase and create alerts via email and messaging queues if given thresholds are exceeded. 12
  • © Hortonworks Inc. 2013 Topology • What is a Topology? –A network of spouts and bolts wired together into a workflow Truck-Event-Processor Topology Kafka Spout HBase Bolt Monitoring Bolt HDFS Bolt WebSocket Bolt Stream Stream Stream Stream 13
  • © Hortonworks Inc. 2013 Storm Components and Topology Submission Submit storm-event-processor topology Nimbus (Yarn App Master Node) Zookeeper ZookeeperZookeeper Supervisor (Slave Node) Supervisor (Slave Node) Supervisor (Slave Node) Supervisor (Slave Node) Supervisor (Slave Node) Kafka Spout Kafka Spout Kafka Spout Kafka Spout Kafka Spout HDFS Bolt HDFS Bolt HDFS Bolt HBase Bolt HBase Bolt Monitor Bolt Monitor Bolt Nimbus (Management server) • Similar to job tracker • Distributes code around cluster • Assigns tasks • Handles failures Supervisor (Worker nodes) • Similar to task tracker • Run bolts and spouts as ‘tasks’ Zookeeper: • Cluster co-ordination • Nimbus HA • Stores cluster metrics 14
  • © Hortonworks Inc. 2013 Processing Guarantees in Storm Processing guarantee How is it achieved? Applicable use cases Atleast once Replay tuples on failure - Processing does not need to be ordered - Need extremely low latency processing Exactly once Transactional topologies ( now implemented using Trident) - Need ordered processing - Global counts - Context aware processing - Causality based - Latency not important 15
  • Implementing Storm Kit Menke, Lead Software Engineer at Enterprise Holdings, Inc. May 2014
  • Implementing Storm Spring 2014 Version 1.0 Real World Scenarios
  • Overview • Storm Terminology • Creating a Topology • Persisting data from Storm • Topology Performance • Custom Metrics • Workers, Executors, and Tasks • Caching within a Bolt • Environment Setup 18
  • Storm Terminology • Topologies run on your Hadoop cluster – Uber-jar with spouts and bolts – Runs forever • Spouts generate streams of tuples • Tuples are lists of values • Bolts process tuples (and emit tuples) Topology Spout Bolt A Bolt B Bolt 1 Tuples 19
  • Storm Topology Example Counting Topology spout unreliable output Guaranteed message processing 20
  • Storm Spout Example 21
  • Storm Bolt Example 22
  • Failing a Tuple 1. Spout emits tuple 2. Bolt fails tuple 3. Spout receives failed message ID 23
  • Persisting Data • Write to HDFS using storm-hdfs for long term storage 24
  • Files in Hue written by storm-hdfs 25
  • Persisting Data • Write to HDFS using storm-hdfs for long term storage • Index data in ElasticSearch or Solr for real-time dashboards 26
  • ElasticSearch + Kibana 27
  • Persisting Data • Write to HDFS using storm-hdfs for long term storage • Index data in ElasticSearch or Solr for real-time dashboards • Insert messages into a Database • Message Queue • HBase reads/writes to influence topology in real-time 28
  • Topology Performance • Storm UI shows capacity – Break out your bolts to find bottlenecks! 29
  • Custom Metrics • New in Storm 0.9.0 • Out of the box metrics, ex: CountMetric • Custom metric by implementing IMetric • Register the metric on spout/bolt startup • Set topology to consume metrics stream 30
  • Topology Performance • Filter bolt is our bottleneck! 31
  • Workers, Executors, and Tasks • Workers – Separate JVM – Workers run Executors • Executors – Separate threads – Executors run Tasks • Tasks – Your spout or bolt code • Running more than one task per executor does not increase the level of parallelism!!! Workers <= Executors <= Tasks
  • Caching inside a Bolt • RotatingMap with Tick Tuples • Use fieldsGrouping to ensure cache hits 33
  • Environment Setup • Storm-starter project on GitHub • Git, Eclipse, Maven • Unit test! • Develop locally or on a single node hadoop machine • Read the source code 34
  • Questions?