This document provides a comparison of Hadoop and Storm, two frameworks for batch and real-time processing of data. It outlines key differences such as Hadoop focusing on batch jobs while Storm handles continuous real-time processing topologies. Components and concepts of Storm like spouts, bolts, streams and groupings are also explained. The document details how Storm provides fault tolerance and guarantees of at-least-once processing through its use of Zookeeper, anchoring tuples, and acker tasks.
Storm is a distributed, reliable, fault-tolerant system for processing streams of data.
In this track we will introduce Storm framework, explain some design concepts and considerations, and show some real world examples to explain how to use it to process large amounts of data in real time, in a distributed environment. We will describe how we can scale this solution very easily as more data need to be processed.
We will explain all you need to know to get started with Storm and some tips on how to get your Spouts, Bolts and Topologies up and running in the cloud.
Apache Storm 0.9 basic training - VerisignMichael Noll
Apache Storm 0.9 basic training (130 slides) covering:
1. Introducing Storm: history, Storm adoption in the industry, why Storm
2. Storm core concepts: topology, data model, spouts and bolts, groupings, parallelism
3. Operating Storm: architecture, hardware specs, deploying, monitoring
4. Developing Storm apps: Hello World, creating a bolt, creating a topology, running a topology, integrating Storm and Kafka, testing, data serialization in Storm, example apps, performance and scalability tuning
5. Playing with Storm using Wirbelsturm
Audience: developers, operations, architects
Created by Michael G. Noll, Data Architect, Verisign, https://www.verisigninc.com/
Verisign is a global leader in domain names and internet security.
Tools mentioned:
- Wirbelsturm (https://github.com/miguno/wirbelsturm)
- kafka-storm-starter (https://github.com/miguno/kafka-storm-starter)
Blog post at:
http://www.michael-noll.com/blog/2014/09/15/apache-storm-training-deck-and-tutorial/
Many thanks to the Twitter Engineering team (the creators of Storm) and the Apache Storm open source community!
Storm is a distributed, reliable, fault-tolerant system for processing streams of data.
In this track we will introduce Storm framework, explain some design concepts and considerations, and show some real world examples to explain how to use it to process large amounts of data in real time, in a distributed environment. We will describe how we can scale this solution very easily as more data need to be processed.
We will explain all you need to know to get started with Storm and some tips on how to get your Spouts, Bolts and Topologies up and running in the cloud.
Apache Storm 0.9 basic training - VerisignMichael Noll
Apache Storm 0.9 basic training (130 slides) covering:
1. Introducing Storm: history, Storm adoption in the industry, why Storm
2. Storm core concepts: topology, data model, spouts and bolts, groupings, parallelism
3. Operating Storm: architecture, hardware specs, deploying, monitoring
4. Developing Storm apps: Hello World, creating a bolt, creating a topology, running a topology, integrating Storm and Kafka, testing, data serialization in Storm, example apps, performance and scalability tuning
5. Playing with Storm using Wirbelsturm
Audience: developers, operations, architects
Created by Michael G. Noll, Data Architect, Verisign, https://www.verisigninc.com/
Verisign is a global leader in domain names and internet security.
Tools mentioned:
- Wirbelsturm (https://github.com/miguno/wirbelsturm)
- kafka-storm-starter (https://github.com/miguno/kafka-storm-starter)
Blog post at:
http://www.michael-noll.com/blog/2014/09/15/apache-storm-training-deck-and-tutorial/
Many thanks to the Twitter Engineering team (the creators of Storm) and the Apache Storm open source community!
Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013Sonal Raj
This talk briefly outlines the Storm framework and Neo4J graph database, and how to compositely use them to perform computations on complex graphs in Python using the Petrel and Py2neo packages. This talk was given at PyCon India 2013.
Talk held at the FrOSCon 2013 on 24.08.2013 in Sankt Augustin, Germany
Agenda:
- Why Twitter Storm?
- What is Twitter Storm?
- What to do with Twitter Storm?
Learning Stream Processing with Apache StormEugene Dvorkin
Over the last couple years, Apache Storm became a de-facto standard for developing real-time analytics and complex event processing applications. Storm enables to tackle real-time data processing challenges the same way Hadoop enables batch processing of Big Data. Storm enables companies to have "Fast Data" alongside with "Big Data". Some use cases where Storm can be used are Fraud Detection, Operation Intelligence, Machine Learning, ETL, Analytics, etc.
In this meetup, Eugene Dvorkin, Architect @WebMD and NYC Storm User Group organizer will teach Apache Storm and Stream Processing fundamentals. While this meeting is geared toward new Storm users, experienced users may find something interesting as well.
Following topics will be covered:
• Why use Apache Storm?
• Common use cases
• Storm Architecture - components, concepts, topology
• Building simple Storm topology with Java and Groovy
• Trident and micro-batch processing
• Fault tolerance and guaranteed message delivery
• Running and monitoring Storm in production
• Kafka
• Storm at WebMD
• Resources
This slides are for a brief seminar that I give in a Ph.D. exam "Perspective in Parallel Computing" (held by prof. Marco Danelutto) at University of Pisa (Italy).
They are a rapid introduction to Apache Storm and how it relates to classical algorithmic skeleton parallel frameworks
Slides from talk given at the NYC Cassandra Meetup. Discussing how Storm works and how it integrates well with Apache Cassandra.
There is also a segway into a example project that uses Storm and Cassandra to implement a scalable reactive web crawler.
http://github.com/tjake/stormscraper
Storm-on-YARN: Convergence of Low-Latency and Big-DataDataWorks Summit
adoop plays a central role for Yahoo! to provide personalized experiences for our users and create value for our advertisers. In this talk, we will discuss the convergence of low-latency processing and Hadoop platform. To enable the convergence, we have developed Storm-on-YARN to enable Storm streaming/microbatch applications and Hadoop batch applications hosted in a single cluster. Storm applications could leverage YARN for resource management, and apply Hadoop style security to Hadoop datasets on HDFS and HBase. In Storm-on-YARN, YARN is used to launch Storm application master (Nimbus), and enable Nimbus to request resources for Storm workers (Supervisors). YARN resource manager and Storm scheduler work together to support multi-tenancy and high availability. HDFS enables Storm to achieve higher availability of Nimbus itself. We are introducing Hadoop style security into Storm through JAAS authentication (Kerberos and Digest). Storm servers (Nimbus and DRPC) will be configured with authorization plugins for access control and audit. The security context enables Storm applications to access authorized datasets only (including those created by Hadoop applications). Yahoo! is making our contribution on Storm and YARN available as open source. We will work with industry partners to foster the convergence of low-latency processing and big-data.
Apache Storm and twitter Streaming API integrationUday Vakalapudi
1) Storm is a distributed, real-time computation system.
2) The input stream of a Storm cluster is handled by a component called a spout. The spout passes the data to a bolt, a bolt either persists the data in some sort of storage, or passes it to some other bolt. You can imagine a Storm cluster as a chain of bolt components that each make some kind of transformation on the data exposed by the spout.
1) Real-time systems must guarantee the data processing.
2) And also it should be horizontally scalable, means, just adding few nodes to improve the scalability of a cluster.
3) It should be fault-tolerance, means, if any error occurs or any node goes down, our system should work without any hesitation.
4) We need to get rid of all the intermediate message brokers, because they are complex, and slow, because, instead of sending messages directly from producer to consumers, it has to go through third party message brokers, moreover, those third party message brokers are persist the input data into the disk. This whole process will consume extra time to process the data.
5) In comparison with Storm, Hadoop is ok, because Hadoop also provides a high latency system, so if you take a few hours of down time, you still have high latency, but in real time systems, if you take few hours of down time. Then you no longer in real time, which means robustness requirements, are much harder. Storm satisfies all those properties without any hesitation.
1) Both Hadoop and Storm are distributed and fault-Tolerance systems, but, Hadoop mainly used for batch processing systems, whereas Storm used for Real-time computation systems.
2) Storm doesn’t have inbuilt Storage system, it mainly builds on “come and get some” strategy. In other side, Hadoop have HDFS as storage file system.
1) Both Storm and Flume used for real-time data processing, but Flume will not give you real-time computation systems. moreover flume depends on channel Message broker component, for, guaranteed data processing, here, channel always persist the data before sending it to Consumer, but for Storm, there is no intermediate message brokers concept, it Just Works like as lite as possible. Whatever business logic that you want to write, will goes under Bolt component of Storm.
Developing Java Streaming Applications with Apache StormLester Martin
Apache Storm, http://storm.apache.org, is a free and open source distributed real-time computation system. Storm makes it easy to reliably process unbounded streams of data, doing for real-time processing what Hadoop did for batch processing. During this presentation, a simple Java-based streaming application will be built from scratch!
Code examples can be found at https://github.com/lestermartin/streaming-exploration.
torm is a free and open source distributed realtime computation system. Storm makes it easy to reliably process unbounded streams of data, doing for realtime processing what Hadoop did for batch processing. Storm is simple, can be used with any programming language, and is a lot of fun to use!
Real-Time Analytics and Visualization of Streaming Big Data with JReport & Sc...Mia Yuan Cao
Learn how to extract real-time insight from Big Data. JReport and ScaleDB’s combined solution delivers business value by ingesting Big Data at stunning velocity (millions of rows/second), then provides powerful visualizations, filtering and data analysis that enable you to draw quick conclusions to make agile business decisions. JReport's seamless connection to ScaleDB enables technical or non-technical users to build and modify their own reports and dashboards to visualize these vast data stores. Join us to see how.
Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013Sonal Raj
This talk briefly outlines the Storm framework and Neo4J graph database, and how to compositely use them to perform computations on complex graphs in Python using the Petrel and Py2neo packages. This talk was given at PyCon India 2013.
Talk held at the FrOSCon 2013 on 24.08.2013 in Sankt Augustin, Germany
Agenda:
- Why Twitter Storm?
- What is Twitter Storm?
- What to do with Twitter Storm?
Learning Stream Processing with Apache StormEugene Dvorkin
Over the last couple years, Apache Storm became a de-facto standard for developing real-time analytics and complex event processing applications. Storm enables to tackle real-time data processing challenges the same way Hadoop enables batch processing of Big Data. Storm enables companies to have "Fast Data" alongside with "Big Data". Some use cases where Storm can be used are Fraud Detection, Operation Intelligence, Machine Learning, ETL, Analytics, etc.
In this meetup, Eugene Dvorkin, Architect @WebMD and NYC Storm User Group organizer will teach Apache Storm and Stream Processing fundamentals. While this meeting is geared toward new Storm users, experienced users may find something interesting as well.
Following topics will be covered:
• Why use Apache Storm?
• Common use cases
• Storm Architecture - components, concepts, topology
• Building simple Storm topology with Java and Groovy
• Trident and micro-batch processing
• Fault tolerance and guaranteed message delivery
• Running and monitoring Storm in production
• Kafka
• Storm at WebMD
• Resources
This slides are for a brief seminar that I give in a Ph.D. exam "Perspective in Parallel Computing" (held by prof. Marco Danelutto) at University of Pisa (Italy).
They are a rapid introduction to Apache Storm and how it relates to classical algorithmic skeleton parallel frameworks
Slides from talk given at the NYC Cassandra Meetup. Discussing how Storm works and how it integrates well with Apache Cassandra.
There is also a segway into a example project that uses Storm and Cassandra to implement a scalable reactive web crawler.
http://github.com/tjake/stormscraper
Storm-on-YARN: Convergence of Low-Latency and Big-DataDataWorks Summit
adoop plays a central role for Yahoo! to provide personalized experiences for our users and create value for our advertisers. In this talk, we will discuss the convergence of low-latency processing and Hadoop platform. To enable the convergence, we have developed Storm-on-YARN to enable Storm streaming/microbatch applications and Hadoop batch applications hosted in a single cluster. Storm applications could leverage YARN for resource management, and apply Hadoop style security to Hadoop datasets on HDFS and HBase. In Storm-on-YARN, YARN is used to launch Storm application master (Nimbus), and enable Nimbus to request resources for Storm workers (Supervisors). YARN resource manager and Storm scheduler work together to support multi-tenancy and high availability. HDFS enables Storm to achieve higher availability of Nimbus itself. We are introducing Hadoop style security into Storm through JAAS authentication (Kerberos and Digest). Storm servers (Nimbus and DRPC) will be configured with authorization plugins for access control and audit. The security context enables Storm applications to access authorized datasets only (including those created by Hadoop applications). Yahoo! is making our contribution on Storm and YARN available as open source. We will work with industry partners to foster the convergence of low-latency processing and big-data.
Apache Storm and twitter Streaming API integrationUday Vakalapudi
1) Storm is a distributed, real-time computation system.
2) The input stream of a Storm cluster is handled by a component called a spout. The spout passes the data to a bolt, a bolt either persists the data in some sort of storage, or passes it to some other bolt. You can imagine a Storm cluster as a chain of bolt components that each make some kind of transformation on the data exposed by the spout.
1) Real-time systems must guarantee the data processing.
2) And also it should be horizontally scalable, means, just adding few nodes to improve the scalability of a cluster.
3) It should be fault-tolerance, means, if any error occurs or any node goes down, our system should work without any hesitation.
4) We need to get rid of all the intermediate message brokers, because they are complex, and slow, because, instead of sending messages directly from producer to consumers, it has to go through third party message brokers, moreover, those third party message brokers are persist the input data into the disk. This whole process will consume extra time to process the data.
5) In comparison with Storm, Hadoop is ok, because Hadoop also provides a high latency system, so if you take a few hours of down time, you still have high latency, but in real time systems, if you take few hours of down time. Then you no longer in real time, which means robustness requirements, are much harder. Storm satisfies all those properties without any hesitation.
1) Both Hadoop and Storm are distributed and fault-Tolerance systems, but, Hadoop mainly used for batch processing systems, whereas Storm used for Real-time computation systems.
2) Storm doesn’t have inbuilt Storage system, it mainly builds on “come and get some” strategy. In other side, Hadoop have HDFS as storage file system.
1) Both Storm and Flume used for real-time data processing, but Flume will not give you real-time computation systems. moreover flume depends on channel Message broker component, for, guaranteed data processing, here, channel always persist the data before sending it to Consumer, but for Storm, there is no intermediate message brokers concept, it Just Works like as lite as possible. Whatever business logic that you want to write, will goes under Bolt component of Storm.
Developing Java Streaming Applications with Apache StormLester Martin
Apache Storm, http://storm.apache.org, is a free and open source distributed real-time computation system. Storm makes it easy to reliably process unbounded streams of data, doing for real-time processing what Hadoop did for batch processing. During this presentation, a simple Java-based streaming application will be built from scratch!
Code examples can be found at https://github.com/lestermartin/streaming-exploration.
torm is a free and open source distributed realtime computation system. Storm makes it easy to reliably process unbounded streams of data, doing for realtime processing what Hadoop did for batch processing. Storm is simple, can be used with any programming language, and is a lot of fun to use!
Real-Time Analytics and Visualization of Streaming Big Data with JReport & Sc...Mia Yuan Cao
Learn how to extract real-time insight from Big Data. JReport and ScaleDB’s combined solution delivers business value by ingesting Big Data at stunning velocity (millions of rows/second), then provides powerful visualizations, filtering and data analysis that enable you to draw quick conclusions to make agile business decisions. JReport's seamless connection to ScaleDB enables technical or non-technical users to build and modify their own reports and dashboards to visualize these vast data stores. Join us to see how.
http://bit.ly/1BTaXZP – Hadoop has been a huge success in the data world. It’s disrupted decades of data management practices and technologies by introducing a massively parallel processing framework. The community and the development of all the Open Source components pushed Hadoop to where it is now.
That's why the Hadoop community is excited about Apache Spark. The Spark software stack includes a core data-processing engine, an interface for interactive querying, Sparkstreaming for streaming data analysis, and growing libraries for machine-learning and graph analysis. Spark is quickly establishing itself as a leading environment for doing fast, iterative in-memory and streaming analysis.
This talk will give an introduction the Spark stack, explain how Spark has lighting fast results, and how it complements Apache Hadoop.
Keys Botzum - Senior Principal Technologist with MapR Technologies
Keys is Senior Principal Technologist with MapR Technologies, where he wears many hats. His primary responsibility is interacting with customers in the field, but he also teaches classes, contributes to documentation, and works with engineering teams. He has over 15 years of experience in large scale distributed system design. Previously, he was a Senior Technical Staff Member with IBM, and a respected author of many articles on the WebSphere Application Server as well as a book.
A tutorial presentation based on storm.apache.org documentation.
I gave this presentation at Amirkabir University of Technology as Teaching Assistant of Cloud Computing course of Dr. Amir H. Payberah in spring semester 2015.
Unraveling mysteries of the Universe at CERN, with OpenStack and HadoopPiotr Turek
I will talk about the challenges faced, lessons learned and fun I had while reinventing the way offline data analysis is done at one of LHC (Large Hadron Collider) experiments. A journey, which took us to another land: of contemporary Big Data stack, and which finally married those two. Did it make any sense in the end? Come and you will know.
Among other things you will learn:
• the why, what and how of data analysis at CERN
• why latency variability in large distributed systems matters (literally ;))
• why using C++ as a scripting language is both the best and the worst idea ever
• how to implement a reliable Hadoop cluster provisioning mechanism on OpenStack
• how to marry a huge data analysis framework written in C++, with Hadoop 2
• what is the moral of this story
A Brief introduction to Apache Storm. Talk given at the October Toronto Java User Group meeting, video available at https://www.youtube.com/watch?v=CWyH4-SOGm8
After an overview of Qt and its tools, a Hello World application quickly demonstrates the basic principles.
Qt is mainly famous for its intelligent concepts of signals and slots, which is explained together with examples for how to use widgets (UI controls).
At the end, the foundations of the meta-object system and its implications on memory management are explained.
This module follows up the introduction in the "Software Development with Qt" module, plus the Quickstart slides.
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
Keynote at DIGIT West Expo, Glasgow on 29 May 2024.
Cheryl Hung, ochery.com
Sr Director, Infrastructure Ecosystem, Arm.
The key trends across hardware, cloud and open-source; exploring how these areas are likely to mature and develop over the short and long-term, and then considering how organisations can position themselves to adapt and thrive.
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualityInflectra
In this insightful webinar, Inflectra explores how artificial intelligence (AI) is transforming software development and testing. Discover how AI-powered tools are revolutionizing every stage of the software development lifecycle (SDLC), from design and prototyping to testing, deployment, and monitoring.
Learn about:
• The Future of Testing: How AI is shifting testing towards verification, analysis, and higher-level skills, while reducing repetitive tasks.
• Test Automation: How AI-powered test case generation, optimization, and self-healing tests are making testing more efficient and effective.
• Visual Testing: Explore the emerging capabilities of AI in visual testing and how it's set to revolutionize UI verification.
• Inflectra's AI Solutions: See demonstrations of Inflectra's cutting-edge AI tools like the ChatGPT plugin and Azure Open AI platform, designed to streamline your testing process.
Whether you're a developer, tester, or QA professional, this webinar will give you valuable insights into how AI is shaping the future of software delivery.
Generating a custom Ruby SDK for your web service or Rails API using Smithyg2nightmarescribd
Have you ever wanted a Ruby client API to communicate with your web service? Smithy is a protocol-agnostic language for defining services and SDKs. Smithy Ruby is an implementation of Smithy that generates a Ruby SDK using a Smithy model. In this talk, we will explore Smithy and Smithy Ruby to learn how to generate custom feature-rich SDKs that can communicate with any web service, such as a Rails JSON API.
Neuro-symbolic is not enough, we need neuro-*semantic*Frank van Harmelen
Neuro-symbolic (NeSy) AI is on the rise. However, simply machine learning on just any symbolic structure is not sufficient to really harvest the gains of NeSy. These will only be gained when the symbolic structures have an actual semantics. I give an operational definition of semantics as “predictable inference”.
All of this illustrated with link prediction over knowledge graphs, but the argument is general.
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
STORM
1. STORM
COMPARISON – INTRODUCTION - CONCEPTS
PRESENTATION BY KASPER MADSEN
MARCH - 2012
2. HADOOP VS STORM
Batch processing Real-time processing
Jobs runs to completion Topologies run forever
JobTracker is SPOF* No single point of failure
Stateful nodes Stateless nodes
Scalable Scalable
Guarantees no data loss Guarantees no data loss
Open source Open source
* Hadoop 0.21 added some checkpointing
SPOF: Single Point Of Failure
3. COMPONENTS
Nimbus daemon is comparable to Hadoop JobTracker. It is the master
Supervisor daemon spawns workers, it is comparable to Hadoop TaskTracker
Worker is spawned by supervisor, one per port defined in storm.yaml configuration
Task is run as a thread in workers
Zookeeper* is a distributed system, used to store metadata. Nimbus and
Supervisor daemons are fail-fast and stateless. All state is kept in Zookeeper.
Notice all communication between Nimbus and
Supervisors are done through Zookeeper
On a cluster with 2k+1 zookeeper nodes, the system
can recover when maximally k nodes fails.
* Zookeeper is an Apache top-level project
4. STREAMS
Stream is an unbounded sequence of tuples.
Topology is a graph where each node is a spout or bolt, and the edges indicate
which bolts are subscribing to which streams.
• A spout is a source of a stream
• A bolt is consuming a stream (possibly emits a new one)
Subscribes: A
• An edge represents a grouping Emits: C
Subscribes: C & D
Subscribes: A
Source of stream A Emits: D
Source of stream B
Subscribes:A & B
5. GROUPINGS
Each spout or bolt are running X instances in parallel (called tasks).
Groupings are used to decide which task in the subscribing bolt, the tuple is sent to
Shuffle grouping is a random grouping
Fields grouping is grouped by value, such that equal value results in equal task
All grouping replicates to all tasks
Global grouping makes all tuples go to one task
None grouping makes bolt run in same thread as bolt/spout it subscribes to
Direct grouping producer (task that emits) controls which consumer will receive
4 tasks 3 tasks
2 tasks
2 tasks
6. TestWordSpout ExclamationBolt ExclamationBolt
EXAMPLE
TopologyBuilder builder = new TopologyBuilder(); Create stream called ”words”
Run 10 tasks
builder.setSpout("words", new TestWordSpout(), 10);
Create stream called ”exclaim1”
builder.setBolt("exclaim1", new ExclamationBolt(), 3) Run 3 tasks
Subscribe to stream ”words”,
.shuffleGrouping("words"); using shufflegrouping
Create stream called ”exclaim2”
builder.setBolt("exclaim2", new ExclamationBolt(), 2)
Run 2 tasks
.shuffleGrouping("exclaim1"); Subscribe to stream ”exclaim1”,
using shufflegrouping
A bolt can subscribe to an unlimited number of
streams, by chaining groupings.
The sourcecode for this example is part of the storm-starter project on github
7. TestWordSpout ExclamationBolt ExclamationBolt
EXAMPLE – 1
TestWordSpout
public void nextTuple() {
Utils.sleep(100);
final String[] words = new String[] {"nathan", "mike", "jackson", "golda", "bertels"};
final Random rand = new Random();
final String word = words[rand.nextInt(words.length)];
_collector.emit(new Values(word));
}
The TestWordSpout emits a random string from the
array words, each 100 milliseconds
8. TestWordSpout ExclamationBolt ExclamationBolt
EXAMPLE – 2
ExclamationBolt Prepare is called when bolt is created
OutputCollector _collector;
public void prepare(Map conf, TopologyContext context, OutputCollector collector) {
_collector = collector;
} Execute is called for each tuple
public void execute(Tuple tuple) {
_collector.emit(tuple, new Values(tuple.getString(0) + "!!!"));
_collector.ack(tuple);
} declareOutputFields is called when bolt is created
public void declareOutputFields(OutputFieldsDeclarer declarer) {
declarer.declare(new Fields("word"));
}
declareOutputFields is used to declare streams and their schemas. It
is possible to declare several streams and specify the stream to use
when outputting tuples in the emit function call.
9. FAULT TOLERANCE
Zookeeper stores metadata in a very robust way
Nimbus and Supervisor are stateless and only need metadata from ZK to work/restart
When a node dies
• The tasks will time out and be reassigned to other workers by Nimbus.
When a worker dies
•The supervisor will restart the worker.
•Nimbus will reassign worker to another supervisor, if no heartbeats are sent.
•If not possible (no free ports), then tasks will be run on other workers in
topology. If more capacity is added to the cluster later, STORM will
automatically initialize a new worker and spread out the tasks.
When nimbus or supervisor dies
• Workers will continue to run
• Workers cannot be reassigned without Nimbus
• Nimbus and Supervisor should be run using a process monitoring tool, to
restarts them automatically if they fail.
10. AT-LEAST-ONCE PROCESSING
STORM guarantees at-least-once processing of tuples.
Message id, gets assigned to a tuple when emitting from spout or bolt. Is 64 bits long
Tree of tuples is the tuples generated (directly and indirectly) from a spout tuple.
Ack is called on spout, when tree of tuples for spout tuple is fully processed.
Fail is called on spout, if one of the tuples in the tree of tuples fails or the tree of
tuples is not fully processed within a specified timeout (default is 30 seconds).
It is possible to specify the message id, when emitting a tuple. This might be useful for
replaying tuples from a queue.
Ack/fail method called when tree of
tuples have been fully processed or
failed / timed-out
11. AT-LEAST-ONCE PROCESSING – 2
Anchoring is used to copy the spout tuple message id(s) to the new tuples
generated. In this way, every tuple knows the message id(s) of all spout tuples.
Multi-anchoring is when multiple tuples are anchored. If the tuple tree fails, then
multiple spout tuples will be replayed. Useful for doing streaming joins and more.
Ack called from a bolt, indicates the tuple has been processed as intented
Fail called from a bolt, replays the spout tuple(s)
Every tuple must be acked/failed or the task will run out of memory at some point.
_collector.emit(tuple, new Values(word)); Uses anchoring
_collector.emit(new Values(word)); Does NOT use anchoring
12. AT-LEAST-ONCE PROCESSING – 3
Acker tasks tracks the tree of tuples for every spout tuple
• The acker task responsible for a given spout tuple is determined by modulo
on message id. Since all tuples have all spout tuple message ids, it is easy
to call the correct acker tasks.
• Acker task stores a map, the format is {spoutMsgId, {spoutTaskId, ”ack val”}}
• ”ack val” is the representation of state of entire tree of tuples. It is the xor of
all tuple message ids created and acked in the tree of tuples.
• When ”ack val” is 0, then tuple tree is fully processed.
• Since message ids are random 64 bits numbers, chances of ”ack val”
becoming 0 by accident is extremely small.
Important to set number of acker tasks in topology when
processing large amounts of tuples (defaults to 1)
13. AT-LEAST-ONCE PROCESSING – 4
Example Bolt
Emit ”h” Task: 3
spoutIds: 10
msgId: 2
Spout Emit ”hey” Bolt
Task: 1 msgId:10 Task: 2
Emit ”ey”
spoutIds: 10
msgId: 3 Bolt
Task: 4
Shows what happens in acker task, for one spout tuple. Format is: {spoutMsgId, {spoutTaskId, ”ack val”}}
1. After emit ”hey”: {10, {1, 0000 XOR 1010 = 1010}
2. After emit ”h”: {10, {1, 1010 XOR 0010 = 1000}
3. After emit ”ey”: {10, {1, 1000 XOR 0011 = 1011} USES 64 BIT IDS
4. After ack ”hey”: {10, {1, 1011 XOR 1010 = 0001} IN REALITY
5. After ack ”h”: {10, {1, 0001 XOR 0010 = 0011}
6. After ack ”ey”: {10, {1, 0011 XOR 0011 = 0000}
7. Since ”ack val” is 0, spout tuple with id 10, must be fully processed. Call ack on spout (task 1)
14. AT-LEAST-ONCE PROCESSING – 5
A tuple isn't acked because the task died:
The spout tuple(s) at the root of the tree of tuples will time out and be replayed.
Acker task dies:
All the spout tuples the acker was tracking will time out and be replayed.
Spout task dies:
In this case the source that the spout talks to is responsible for replaying the
messages. For example, queues like Kestrel and RabbitMQ will place all pending
messages back on the queue when a client disconnects.
15. AT-LEAST-ONCE PROCESSING – 6
At-least-once processing might process a tuple more than once.
Example
All grouping Bolt 1. A spout tuple is emitted to task 2 and 3
Task: 2 2. Worker responsible for task 3 fails
3. Supervisor restarts worker
Spout
Task: 1
4. Spout tuple is replayed and emitted to task 2 and 3
5. Task 2 will now have executed the same bolt twice
Bolt
Task: 3
Consider why the all grouping is not important in this example
16. EXACTLY-ONCE-PROCESSING
Transactional topologies (TT) is an abstraction built on STORM primitives.
TT guarantees exactly-once-processing of tuples.
Acking is optimized in TT, no need to do anchoring or acking manually.
Bolts execute as new instances per attempt of processing a batch
Example
All grouping Bolt 1. A spout tuple is emitted to task 2 and 3
Task: 2 2. Worker responsible for task 3 fails
3. Supervisor restarts worker
Spout
Task: 1
4. Spout tuple is replayed and emitted to task 2 and 3
5. Task 2 and 3 initiate new bolts because of new attempt
Bolt 5. Now there is no problem
Task: 3
17. EXACTLY-ONCE-PROCESSING – 2
For efficiency batch processing of tuples is introduced in TT
Batch has two states: processing or committing
Many batches can be in the processing state concurrently
Only one batch can be in the committing state, and a strong ordering is imposed. That
means batch 1 will always be committed before batch 2 and so on.
Types of bolts for TT: BasicBolt, BatchBolt, BatchBolt marked as committer
BasicBolt is processing one tuple at a time.
BatchBolt is processing batches. Call finishBatch when all tuples of batch is executed
BatchBolt marked as committer is calling finishBatch only when batch is in
committing state.
18. EXACTLY-ONCE-PROCESSING – 3
Transactional spout has capability Committer Committer
to replay exact batches of tuples batchbolt batchbolt batchbolt
batchbolt
BATCH IS IN PROCESSING STATE
Bolt A: execute method is called for all tuples received from spout
finishBatch is called when first batch is received
Bolt B: execute method is called for all tuples received from bolt A
finishBatch is NOT called because batch is in processing state
Bolt C: execute method is called for all tuples received from bolt A (and B)
finishBatch is NOT called, because bolt B has not called finishBatch
Bolt D: execute method is called for all tuples received from bolt C
finishBatch is NOT called because batch is in processing state
BATCH CHANGES TO COMMITTING STATE
Bolt B: finishBatch is called
Bolt C: finishBatch is called, because we know we got all tuples from Bolt B now
Bolt D: finishBatch is called, because we know we got all tuples from Bolt C now
19. EXACTLY-ONCE-PROCESSING – 4
Transactional spout
All groupings on When batch should enter processing state:
batch stream • Coordinator emits a tuple with TransactionAttempt and the metadata for that
transaction to the "batch" stream.
• All emitter tasks receives the tuple and begins to emit their portion of tuples for
the given batch.
When processing phase of batch is done (determined by acker task):
• Ack gets called on coordinator
When ack gets called on coordinator and all prior transactions have committed:
Regular bolt, • Coordinator emits a tuple with TransactionAttempt to the commit stream.
Parallelism of P • All Bolts which are marked as committers subscribe to the commit stream of the
coordinator using an all grouping.
• Bolts marked as committers now know the batch is in the committing phase
Regular spout, parallelism of 1
Defined streams: batch & commit
When batch is fully processed again (determined by acker task):
• Ack gets called on coordinator
• Coordinator knows batch is now committed
20. STORM LIBRARIES
STORM uses a lot of libraries. The most prominent are
Clojure a new lisp programming language. Crash-course follows
Jetty an embedded webserver. Used to host the UI of Nimbus.
Kryo a fast serializer, used when sending tuples
Thrift a framework to build services. Nimbus is a thrift daemon
ZeroMQ a very fast transportation layer
Zookeeper a distributed system for storing metadata