Storm is a fast, scalable, fault-tolerant, and easy to operate distributed realtime computation system. It guarantees that messages will be processed and allows processing big data streams reliably in real time. Storm was originally developed by Nathan Marz at BackType (acquired by Twitter) and is written in Java and Clojure. It uses a simple programming model and can scale to large clusters, making it suitable for processing millions of events per second.
Talk held at the FrOSCon 2013 on 24.08.2013 in Sankt Augustin, Germany
Agenda:
- Why Twitter Storm?
- What is Twitter Storm?
- What to do with Twitter Storm?
This slides are for a brief seminar that I give in a Ph.D. exam "Perspective in Parallel Computing" (held by prof. Marco Danelutto) at University of Pisa (Italy).
They are a rapid introduction to Apache Storm and how it relates to classical algorithmic skeleton parallel frameworks
Talk held at the FrOSCon 2013 on 24.08.2013 in Sankt Augustin, Germany
Agenda:
- Why Twitter Storm?
- What is Twitter Storm?
- What to do with Twitter Storm?
This slides are for a brief seminar that I give in a Ph.D. exam "Perspective in Parallel Computing" (held by prof. Marco Danelutto) at University of Pisa (Italy).
They are a rapid introduction to Apache Storm and how it relates to classical algorithmic skeleton parallel frameworks
A tutorial presentation based on storm.apache.org documentation.
I gave this presentation at Amirkabir University of Technology as Teaching Assistant of Cloud Computing course of Dr. Amir H. Payberah in spring semester 2015.
Slides from talk given at the NYC Cassandra Meetup. Discussing how Storm works and how it integrates well with Apache Cassandra.
There is also a segway into a example project that uses Storm and Cassandra to implement a scalable reactive web crawler.
http://github.com/tjake/stormscraper
Apache Storm and twitter Streaming API integrationUday Vakalapudi
1) Storm is a distributed, real-time computation system.
2) The input stream of a Storm cluster is handled by a component called a spout. The spout passes the data to a bolt, a bolt either persists the data in some sort of storage, or passes it to some other bolt. You can imagine a Storm cluster as a chain of bolt components that each make some kind of transformation on the data exposed by the spout.
1) Real-time systems must guarantee the data processing.
2) And also it should be horizontally scalable, means, just adding few nodes to improve the scalability of a cluster.
3) It should be fault-tolerance, means, if any error occurs or any node goes down, our system should work without any hesitation.
4) We need to get rid of all the intermediate message brokers, because they are complex, and slow, because, instead of sending messages directly from producer to consumers, it has to go through third party message brokers, moreover, those third party message brokers are persist the input data into the disk. This whole process will consume extra time to process the data.
5) In comparison with Storm, Hadoop is ok, because Hadoop also provides a high latency system, so if you take a few hours of down time, you still have high latency, but in real time systems, if you take few hours of down time. Then you no longer in real time, which means robustness requirements, are much harder. Storm satisfies all those properties without any hesitation.
1) Both Hadoop and Storm are distributed and fault-Tolerance systems, but, Hadoop mainly used for batch processing systems, whereas Storm used for Real-time computation systems.
2) Storm doesn’t have inbuilt Storage system, it mainly builds on “come and get some” strategy. In other side, Hadoop have HDFS as storage file system.
1) Both Storm and Flume used for real-time data processing, but Flume will not give you real-time computation systems. moreover flume depends on channel Message broker component, for, guaranteed data processing, here, channel always persist the data before sending it to Consumer, but for Storm, there is no intermediate message brokers concept, it Just Works like as lite as possible. Whatever business logic that you want to write, will goes under Bolt component of Storm.
Apache Storm 0.9 basic training - VerisignMichael Noll
Apache Storm 0.9 basic training (130 slides) covering:
1. Introducing Storm: history, Storm adoption in the industry, why Storm
2. Storm core concepts: topology, data model, spouts and bolts, groupings, parallelism
3. Operating Storm: architecture, hardware specs, deploying, monitoring
4. Developing Storm apps: Hello World, creating a bolt, creating a topology, running a topology, integrating Storm and Kafka, testing, data serialization in Storm, example apps, performance and scalability tuning
5. Playing with Storm using Wirbelsturm
Audience: developers, operations, architects
Created by Michael G. Noll, Data Architect, Verisign, https://www.verisigninc.com/
Verisign is a global leader in domain names and internet security.
Tools mentioned:
- Wirbelsturm (https://github.com/miguno/wirbelsturm)
- kafka-storm-starter (https://github.com/miguno/kafka-storm-starter)
Blog post at:
http://www.michael-noll.com/blog/2014/09/15/apache-storm-training-deck-and-tutorial/
Many thanks to the Twitter Engineering team (the creators of Storm) and the Apache Storm open source community!
Learning Stream Processing with Apache StormEugene Dvorkin
Over the last couple years, Apache Storm became a de-facto standard for developing real-time analytics and complex event processing applications. Storm enables to tackle real-time data processing challenges the same way Hadoop enables batch processing of Big Data. Storm enables companies to have "Fast Data" alongside with "Big Data". Some use cases where Storm can be used are Fraud Detection, Operation Intelligence, Machine Learning, ETL, Analytics, etc.
In this meetup, Eugene Dvorkin, Architect @WebMD and NYC Storm User Group organizer will teach Apache Storm and Stream Processing fundamentals. While this meeting is geared toward new Storm users, experienced users may find something interesting as well.
Following topics will be covered:
• Why use Apache Storm?
• Common use cases
• Storm Architecture - components, concepts, topology
• Building simple Storm topology with Java and Groovy
• Trident and micro-batch processing
• Fault tolerance and guaranteed message delivery
• Running and monitoring Storm in production
• Kafka
• Storm at WebMD
• Resources
A tutorial presentation based on storm.apache.org documentation.
I gave this presentation at Amirkabir University of Technology as Teaching Assistant of Cloud Computing course of Dr. Amir H. Payberah in spring semester 2015.
Slides from talk given at the NYC Cassandra Meetup. Discussing how Storm works and how it integrates well with Apache Cassandra.
There is also a segway into a example project that uses Storm and Cassandra to implement a scalable reactive web crawler.
http://github.com/tjake/stormscraper
Apache Storm and twitter Streaming API integrationUday Vakalapudi
1) Storm is a distributed, real-time computation system.
2) The input stream of a Storm cluster is handled by a component called a spout. The spout passes the data to a bolt, a bolt either persists the data in some sort of storage, or passes it to some other bolt. You can imagine a Storm cluster as a chain of bolt components that each make some kind of transformation on the data exposed by the spout.
1) Real-time systems must guarantee the data processing.
2) And also it should be horizontally scalable, means, just adding few nodes to improve the scalability of a cluster.
3) It should be fault-tolerance, means, if any error occurs or any node goes down, our system should work without any hesitation.
4) We need to get rid of all the intermediate message brokers, because they are complex, and slow, because, instead of sending messages directly from producer to consumers, it has to go through third party message brokers, moreover, those third party message brokers are persist the input data into the disk. This whole process will consume extra time to process the data.
5) In comparison with Storm, Hadoop is ok, because Hadoop also provides a high latency system, so if you take a few hours of down time, you still have high latency, but in real time systems, if you take few hours of down time. Then you no longer in real time, which means robustness requirements, are much harder. Storm satisfies all those properties without any hesitation.
1) Both Hadoop and Storm are distributed and fault-Tolerance systems, but, Hadoop mainly used for batch processing systems, whereas Storm used for Real-time computation systems.
2) Storm doesn’t have inbuilt Storage system, it mainly builds on “come and get some” strategy. In other side, Hadoop have HDFS as storage file system.
1) Both Storm and Flume used for real-time data processing, but Flume will not give you real-time computation systems. moreover flume depends on channel Message broker component, for, guaranteed data processing, here, channel always persist the data before sending it to Consumer, but for Storm, there is no intermediate message brokers concept, it Just Works like as lite as possible. Whatever business logic that you want to write, will goes under Bolt component of Storm.
Apache Storm 0.9 basic training - VerisignMichael Noll
Apache Storm 0.9 basic training (130 slides) covering:
1. Introducing Storm: history, Storm adoption in the industry, why Storm
2. Storm core concepts: topology, data model, spouts and bolts, groupings, parallelism
3. Operating Storm: architecture, hardware specs, deploying, monitoring
4. Developing Storm apps: Hello World, creating a bolt, creating a topology, running a topology, integrating Storm and Kafka, testing, data serialization in Storm, example apps, performance and scalability tuning
5. Playing with Storm using Wirbelsturm
Audience: developers, operations, architects
Created by Michael G. Noll, Data Architect, Verisign, https://www.verisigninc.com/
Verisign is a global leader in domain names and internet security.
Tools mentioned:
- Wirbelsturm (https://github.com/miguno/wirbelsturm)
- kafka-storm-starter (https://github.com/miguno/kafka-storm-starter)
Blog post at:
http://www.michael-noll.com/blog/2014/09/15/apache-storm-training-deck-and-tutorial/
Many thanks to the Twitter Engineering team (the creators of Storm) and the Apache Storm open source community!
Learning Stream Processing with Apache StormEugene Dvorkin
Over the last couple years, Apache Storm became a de-facto standard for developing real-time analytics and complex event processing applications. Storm enables to tackle real-time data processing challenges the same way Hadoop enables batch processing of Big Data. Storm enables companies to have "Fast Data" alongside with "Big Data". Some use cases where Storm can be used are Fraud Detection, Operation Intelligence, Machine Learning, ETL, Analytics, etc.
In this meetup, Eugene Dvorkin, Architect @WebMD and NYC Storm User Group organizer will teach Apache Storm and Stream Processing fundamentals. While this meeting is geared toward new Storm users, experienced users may find something interesting as well.
Following topics will be covered:
• Why use Apache Storm?
• Common use cases
• Storm Architecture - components, concepts, topology
• Building simple Storm topology with Java and Groovy
• Trident and micro-batch processing
• Fault tolerance and guaranteed message delivery
• Running and monitoring Storm in production
• Kafka
• Storm at WebMD
• Resources
Unraveling mysteries of the Universe at CERN, with OpenStack and HadoopPiotr Turek
I will talk about the challenges faced, lessons learned and fun I had while reinventing the way offline data analysis is done at one of LHC (Large Hadron Collider) experiments. A journey, which took us to another land: of contemporary Big Data stack, and which finally married those two. Did it make any sense in the end? Come and you will know.
Among other things you will learn:
• the why, what and how of data analysis at CERN
• why latency variability in large distributed systems matters (literally ;))
• why using C++ as a scripting language is both the best and the worst idea ever
• how to implement a reliable Hadoop cluster provisioning mechanism on OpenStack
• how to marry a huge data analysis framework written in C++, with Hadoop 2
• what is the moral of this story
torm is a free and open source distributed realtime computation system. Storm makes it easy to reliably process unbounded streams of data, doing for realtime processing what Hadoop did for batch processing. Storm is simple, can be used with any programming language, and is a lot of fun to use!
Golang Performance : microbenchmarks, profilers, and a war storyAerospike
Slides for Brian Bulkowski's talk about Golang performance:
microbenchmarks, profilers, and a war story about optimizing the Aerospike Database Go client.
http://www.meetup.com/Go-lang-Developers-NYC/events/216650022/
PHP Backends for Real-Time User Interaction using Apache Storm.DECK36
Engaging users in real-time is the topic of our times. Whether it’s a game, a shop, or a content-network, the aim remains the same: providing a personalized experience. In this workshop we will look under the hood of Apache Storm and lay a firm foundation on how to use it with PHP. By that, you can leverage your existing codebase and PHP expertise for an entirely new world: real-time analytics and business logic working on message streams. During the course of the workshop, we will introduce Apache Storm and take a look at all of its components. We will then skyrocket the applicability of Storm by showing you how to implement their components with PHP. All exercises will be conducted using an example project, the infamous and most exhilarating lolcat kitten game ever conceived: Plan 9 From Outer Kitten. In order to follow the hands-on excercises, you will need a development VM prepared by us with all relevant system components and our project repositories. To make the workshop experience as smooth as possible for all participants, please bring a prepared computer to the workshop, as there will be no time to deal with installation and setup issues. Please download all prerequisites and install them as described: VM, Plan 9 webapp, Plan 9 storm backend, (Tutorial: https://github.com/DECK36/plan9_workshop_tutorial ).
• What is Storm?
• Who use Storm?
• Storm Vs Hadoop
• Storm Components
• Storm Topology
• Storm Primitives
• Why Storm is ideal for Real Time Processing?
Your configuration management is fact-based.
Your orchestration is fact-based.
Is your monitoring fact-based?
What does that even mean? Monitoring is very similar to configuration, at least in its expression. Configuration cares about files, services, and hosts being present and in a certain state (""nginx should be running with the following configuration""). Monitoring cares about services being present, running, and in a certain state. Both describe your infrastructure as it should be (""nginx should be running and respond in less than 200ms"").
Fact-based monitoring is about being able to control monitoring with the same facts that Puppet uses (""monitor nginx latency wherever Puppet says it should run""). This is in contrast with imperative monitoring (""monitor nginx on host a, b and c"") that gets out of sync and leads to mailbox meltdowns from spurious alerts.
Using open source and commercial examples, this talk will help you express your monitoring in a way that will feel very natural to your Puppet configuration.
Microservices and Teraflops: Effortlessly Scaling Data Science with PyWren wi...Databricks
While systems like Apache Spark have moved beyond a simple map-reduce model, many data scientists and scientific users still struggle with complex cluster management and configuration tools when trying to do data processing in the cloud. Recently, cloud providers have offered infrastructure such as AWS Lambda to run event-driven, stateless functions as micro-services. In this model, a function is deployed once and is invoked repeatedly whenever new inputs arrive and elastically scales with input size. In this session, the speakers claim that microservices on serverless infrastructure present a viable platform for eliminating cluster management overhead and fulfilling the promise of elasticity in cloud computing for all users. Their key insight is that they can dynamically inject code into these stateless functions and, combined with remote storage, they can build a data processing system that inherits the elasticity of the serverless model while addressing the simplicity required by end users.
Using PyWren, their implementation on AWS Lambda, they show that this model is general enough to implement a number of distributed computing models, such as BSP, efficiently. Learn about a number of scientific and machine learning applications that they have built with PyWren, and how this model could be used to develop a serverless-Spark in the future.
What is a Coroutine in Kotlin? Which are the differences with threads? What is collaborative concurrency? Have a look at these slides and at the companion Github repository https://github.com/f-lombardo/kotlin-from-scratch
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
Keynote at DIGIT West Expo, Glasgow on 29 May 2024.
Cheryl Hung, ochery.com
Sr Director, Infrastructure Ecosystem, Arm.
The key trends across hardware, cloud and open-source; exploring how these areas are likely to mature and develop over the short and long-term, and then considering how organisations can position themselves to adapt and thrive.
Connector Corner: Automate dynamic content and events by pushing a buttonDianaGray10
Here is something new! In our next Connector Corner webinar, we will demonstrate how you can use a single workflow to:
Create a campaign using Mailchimp with merge tags/fields
Send an interactive Slack channel message (using buttons)
Have the message received by managers and peers along with a test email for review
But there’s more:
In a second workflow supporting the same use case, you’ll see:
Your campaign sent to target colleagues for approval
If the “Approve” button is clicked, a Jira/Zendesk ticket is created for the marketing design team
But—if the “Reject” button is pushed, colleagues will be alerted via Slack message
Join us to learn more about this new, human-in-the-loop capability, brought to you by Integration Service connectors.
And...
Speakers:
Akshay Agnihotri, Product Manager
Charlie Greenberg, Host
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Tobias Schneck
As AI technology is pushing into IT I was wondering myself, as an “infrastructure container kubernetes guy”, how get this fancy AI technology get managed from an infrastructure operational view? Is it possible to apply our lovely cloud native principals as well? What benefit’s both technologies could bring to each other?
Let me take this questions and provide you a short journey through existing deployment models and use cases for AI software. On practical examples, we discuss what cloud/on-premise strategy we may need for applying it to our own infrastructure to get it to work from an enterprise perspective. I want to give an overview about infrastructure requirements and technologies, what could be beneficial or limiting your AI use cases in an enterprise environment. An interactive Demo will give you some insides, what approaches I got already working for real.
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualityInflectra
In this insightful webinar, Inflectra explores how artificial intelligence (AI) is transforming software development and testing. Discover how AI-powered tools are revolutionizing every stage of the software development lifecycle (SDLC), from design and prototyping to testing, deployment, and monitoring.
Learn about:
• The Future of Testing: How AI is shifting testing towards verification, analysis, and higher-level skills, while reducing repetitive tasks.
• Test Automation: How AI-powered test case generation, optimization, and self-healing tests are making testing more efficient and effective.
• Visual Testing: Explore the emerging capabilities of AI in visual testing and how it's set to revolutionize UI verification.
• Inflectra's AI Solutions: See demonstrations of Inflectra's cutting-edge AI tools like the ChatGPT plugin and Azure Open AI platform, designed to streamline your testing process.
Whether you're a developer, tester, or QA professional, this webinar will give you valuable insights into how AI is shaping the future of software delivery.
2. About Me
Eiichiro Uchiumi
• A solutions architect at
working in emerging enterprise
technologies
- Cloud transformation
- Enterprise mobility
- Information optimization (big data)
https://github.com/eiichiro
@eiichirouchiumi
http://www.facebook.com/
eiichiro.uchiumi
3. What is Stream Processing?
Stream processing is a technical paradigm to process
big volume unbound sequence of tuples in realtime
• Algorithmic trading
• Sensor data monitoring
• Continuous analytics
= Stream
Source Stream Processor
4. What is Storm?
Storm is
• Fast & scalable
• Fault-tolerant
• Guarantees messages will be processed
• Easy to setup & operate
• Free & open source
distributed realtime computation system
- Originally developed by Nathan Marz at BackType (acquired by Twitter)
- Written in Java and Clojure
5. Conceptual View
Bolt
Bolt
Bolt
Bolt
BoltSpout
Spout
Bolt:
Consumer of streams does some processing
and possibly emits new tuples
Spout:
Source of streams
Stream:
Unbound sequence of tuples
Tuple
Tuple:
List of name-value pair
Topology: Graph of computation composed of spout/bolt as the node and stream as the edge
Tuple
Tuple
6. Physical View
SupervisorNimbus
Worker
* N
Worker
Executor
* N
Task
* N
Supervisor
Supervisor
ZooKeeper
Supervisor
Supervisor
ZooKeeper
ZooKeeper Worker
Nimbus:
Master daemon process
responsible for
• distributing code
• assigning tasks
• monitoring failures
ZooKeeper:
Storing cluster operational state
Supervisor:
Worker daemon process listening for
work assigned its node
Worker:
Java process
executes a subset
of topology
Worker node
Worker process
Executor:
Java thread spawned
by worker runs on
one or more tasks of
the same component
Task:
Component (spout/
bolt) instance
performs the actual
data processing
7. Spout
import backtype.storm.spout.SpoutOutputCollector;
import backtype.storm.task.TopologyContext;
import backtype.storm.topology.OutputFieldsDeclarer;
import backtype.storm.topology.base.BaseRichSpout;
import backtype.storm.tuple.Fields;
import backtype.storm.tuple.Values;
import backtype.storm.utils.Utils;
public class RandomSentenceSpout extends BaseRichSpout {
! SpoutOutputCollector collector;
! Random random;
!
! @Override
! public void open(Map conf, TopologyContext context,
! ! ! SpoutOutputCollector collector) {
! ! this.collector = collector;
! ! random = new Random();
! }
! @Override
! public void nextTuple() {
! ! String[] sentences = new String[] {
! ! ! ! "the cow jumped over the moon",
! ! ! ! "an apple a day keeps the doctor away",
! ! ! ! "four score and seven years ago",
! ! ! ! "snow white and the seven dwarfs",
! ! ! ! "i am at two with nature"
! ! };
! ! String sentence = sentences[random.nextInt(sentences.length)];
! ! collector.emit(new Values(sentence));
! }
8. Spout
! @Override
! public void open(Map conf, TopologyContext context,
! ! ! SpoutOutputCollector collector) {
! ! this.collector = collector;
! ! random = new Random();
! }
! @Override
! public void nextTuple() {
! ! String[] sentences = new String[] {
! ! ! ! "the cow jumped over the moon",
! ! ! ! "an apple a day keeps the doctor away",
! ! ! ! "four score and seven years ago",
! ! ! ! "snow white and the seven dwarfs",
! ! ! ! "i am at two with nature"
! ! };
! ! String sentence = sentences[random.nextInt(sentences.length)];
! ! collector.emit(new Values(sentence));
! }
! @Override
! public void declareOutputFields(OutputFieldsDeclarer declarer) {
! ! declarer.declare(new Fields("sentence"));
! }
@Override
public void ack(Object msgId) {}
@Override
public void fail(Object msgId) {}
}
11. Starting Topology
Nimbus
Thrift server
ZooKeeperStormSubmitter
> bin/storm jar
Uploads topology JAR to
Nimbus’ inbox with
dependencies
Submits topology
configuration as JSON
and structure as Thrift
Copies topology JAR,
configuration and structure
into local file system
Sets up static information
for topology
Makes assignment
Starts topology
13. What is Storm?
Storm is
• Fast & scalable
• Fault-tolerant
• Guarantees messages will be processed
• Easy to setup & operate
• Free & open source
distributed realtime computation system
- Originally developed by Nathan Marz at BackType (acquired by Twitter)
- Written in Java and Clojure
15. Parallelism
RandomSentence
Spout
SplitSentence
Bolt
WordCount
Bolt
Parallelism
hint = 2
Parallelism
hint = 4
Parallelism
hint = 6
Number of
tasks = Not
specified =
Same as
parallelism
hint = 2
Number of
tasks = 8
Number of
tasks = Not
specified
= 6
Number of topology worker = 4
Number of worker slots / node = 4
Number of worker nodes = 2
Number of executor threads
= 2 + 4 + 6 = 12
Number of component instances
= 2 + 8 + 6 = 16
Worker node
Worker node
Worker process
Worker process
SS
Bolt
WC
Bolt
RS
Spout
SS
Bolt
SS
Bolt
WC
Bolt
RS
Spout
SS
Bolt
SS
Bolt
WC
Bolt
SS
Bolt
WC
Bolt
SS
Bolt
WC
Bolt
SS
Bolt
WC
Bolt
Executor thread
Topology can be spread out manually without downtime
when a worker node is added
16. Message Passing
Worker process
Executor
Executor Transfer
thread
Executor
Receive
thread
From other
workers
To other
workers
Receiver queue
Transfer queue
Internal transfer queue
Interprocess communication is mediated by ZeroMQ
Outside transfer is done with Kryo serialization
Local communication is mediated by LMAX Disruptor
Inside transfer is done with no serialization
17. LMAX Disruptor
• Consumer can easily
keep up with
producer by batching
• CPU cache friendly
- The ring is implemented as
an array, so the entries can
be preloaded
• GC safe
- The entries are preallocated
up front and live forever
Large concurrent
magic ring buffer
can be used like
blocking queue
Producer
Consumer
6 million orders per second can be processed
on a single thread at LMAX
18. What is Storm?
Storm is
• Fast & scalable
• Fault-tolerant
• Guarantees messages will be processed
• Easy to setup & operate
• Free & open source
distributed realtime computation system
- Originally developed by Nathan Marz at BackType (acquired by Twitter)
- Written in Java and Clojure
19. Fault-tolerance
Cluster works normally
ZooKeeper WorkerSupervisorNimbus
Monitoring
cluster state
Synchronizing
assignment
Sending heartbeat
Reading worker
heartbeat from
local file system
Sending executor heartbeat
20. Fault-tolerance
Nimbus goes down
ZooKeeper WorkerSupervisorNimbus
Synchronizing
assignment
Sending heartbeat
Reading worker
heartbeat from
local file system
Sending executor heartbeat
Monitoring
cluster state
Processing will still continue. But topology lifecycle operations
and reassignment facility are lost
21. Fault-tolerance
Worker node goes down
ZooKeeper WorkerSupervisorNimbus
Monitoring
cluster state
Synchronizing
assignment
Sending heartbeat
Reading worker
heartbeat from
local file system
Sending executor heartbeat
WorkerSupervisor
Nimbus will reassign the tasks to other machines
and the processing will continue
22. Fault-tolerance
Supervisor goes down
ZooKeeper WorkerSupervisorNimbus
Monitoring
cluster state
Synchronizing
assignment
Sending heartbeat
Reading worker
heartbeat from
local file system
Sending executor heartbeat
Processing will still continue. But assignment is
never synchronized
23. Fault-tolerance
Worker process goes down
ZooKeeper WorkerSupervisorNimbus
Monitoring
cluster state
Synchronizing
assignment
Sending heartbeat
Reading worker
heartbeat from
local file system
Sending executor heartbeat
Supervisor will restart the worker process
and the processing will continue
24. What is Storm?
Storm is
• Fast & scalable
• Fault-tolerant
• Guarantees messages will be processed
• Easy to setup & operate
• Free & open source
distributed realtime computation system
- Originally developed by Nathan Marz at BackType (acquired by Twitter)
- Written in Java and Clojure
25. Reliability API
public class RandomSentenceSpout extends BaseRichSpout {
! public void nextTuple() {
! ! ...;
! ! UUID msgId = getMsgId();
! ! collector.emit(new Values(sentence), msgId);
! }
public void ack(Object msgId) {
! // Do something with acked message id.
}
public void fail(Object msgId) {
! // Do something with failed message id.
}
}
public class SplitSentenceBolt extends BaseRichBolt {
! public void execute(Tuple input) {
! ! for (String s : input.getString(0).split("s")) {
! ! ! collector.emit(input, new Values(s));
! ! }
! !
! ! collector.ack(input);
! }
}
"the"
"the cow jumped
over the moon"
"cow"
"jumped"
"over"
"the"
"moon"
Emitting tuple
with message id
Anchoring incoming tuple
to outgoing tuples
Sending ack
Tuple tree
26. Acking Framework
SplitSentence
Bolt
RandomSentence
Spout
WordCount
Bolt
Acker
implicit bolt
Acker ack
Acker fail
Acker init
Acker implicit bolt
Tuple A
Tuple C
Tuple B
64 bit number called “Ack val”Spout tuple id Spout task id
Ack val has become 0, Acker implicit bolt knows
the tuple tree has been completed
Acker ack
Acker fail
• Emitted tuple A, XOR tuple A id with ack val
• Emitted tuple B, XOR tuple B id with ack val
• Emitted tuple C, XOR tuple C id with ack val
• Acked tuple A, XOR tuple A id with ack val
• Acked tuple B, XOR tuple B id with ack val
• Acked tuple C, XOR tuple C id with ack val
27. What is Storm?
Storm is
• Fast & scalable
• Fault-tolerant
• Guarantees messages will be processed
• Easy to setup & operate
• Free & open source
distributed realtime computation system
- Originally developed by Nathan Marz at BackType (acquired by Twitter)
- Written in Java and Clojure
28. Cluster Setup
• Setup ZooKeeper cluster
• Install dependencies on Nimbus and worker
machines
- ZeroMQ 2.1.7 and JZMQ
- Java 6 and Python 2.6.6
- unzip
• Download and extract a Storm release to Nimbus
and worker machines
• Fill in mandatory configuration into storm.yaml
• Launch daemons under supervision using “storm”
script
32. What is Storm?
Storm is
• Fast & scalable
• Fault-tolerant
• Guarantees messages will be processed
• Easy to setup & operate
• Free & open source
distributed realtime computation system
- Originally developed by Nathan Marz at BackType (acquired by Twitter)
- Written in Java and Clojure
33. Basic Resources
• Storm is available at
- http://storm-project.net/
- https://github.com/nathanmarz/storm
under Eclipse Public License 1.0
• Get help on
- http://groups.google.com/group/storm-user
- #storm-user freenode room
• Follow
- @stormprocessor and @nathanmarz
for updates on the project
34. Many Contributions
• Community repository for modules to use Storm at
- https://github.com/nathanmarz/storm-contrib
including integration with Redis, Kafka, MongoDB,
HBase, JMS, Amazon SQS and so on
• Good articles for understanding Storm internals
- http://www.michael-noll.com/blog/2012/10/16/understanding-the-parallelism-of-a-storm-
topology/
- http://www.michael-noll.com/blog/2013/06/21/understanding-storm-internal-message-
buffers/
• Good slides for understanding real-life examples
- http://www.slideshare.net/DanLynn1/storm-as-deep-into-realtime-data-processing-as-you-
can-get-in-30-minutes
- http://www.slideshare.net/KrishnaGade2/storm-at-twitter
35. Features on Deck
• Current release: 0.8.2 as of 6/28/2013
• Work in progress (older): 0.8.3-wip3
- Some bug fixes
• Work in progress (newest): 0.9.0-wip19
- SLF4J and Logback
- Pluggable tuple serialization and blowfish encryption
- Pluggable interprocess messaging and Netty implementation
- Some bug fixes
- And more
36. Advanced Topics
• Distributed RPC
• Transactional topologies
• Trident
• Using non-JVM languages with Storm
• Unit testing
• Patterns
...Not described in this presentation. So check
these out by yourself, or my upcoming session if a
chance is given :)