Storm is a framework for reliably processing streaming data. It allows defining topologies composed of spouts (data sources) and bolts (processing components). Spouts emit tuples that are processed by bolts which can emit additional tuples. The document describes a topology for processing tweets in real-time to identify top hashtags and display tweets on a map. It includes spouts to fetch tweets and bolts for filtering, counting hashtags, ranking them and storing results to Redis. Storm provides reliability by tracking processing of tuples through a topology using acknowledgments.
Talk held at the FrOSCon 2013 on 24.08.2013 in Sankt Augustin, Germany
Agenda:
- Why Twitter Storm?
- What is Twitter Storm?
- What to do with Twitter Storm?
This slides are for a brief seminar that I give in a Ph.D. exam "Perspective in Parallel Computing" (held by prof. Marco Danelutto) at University of Pisa (Italy).
They are a rapid introduction to Apache Storm and how it relates to classical algorithmic skeleton parallel frameworks
Talk held at the FrOSCon 2013 on 24.08.2013 in Sankt Augustin, Germany
Agenda:
- Why Twitter Storm?
- What is Twitter Storm?
- What to do with Twitter Storm?
This slides are for a brief seminar that I give in a Ph.D. exam "Perspective in Parallel Computing" (held by prof. Marco Danelutto) at University of Pisa (Italy).
They are a rapid introduction to Apache Storm and how it relates to classical algorithmic skeleton parallel frameworks
Storm is a distributed, reliable, fault-tolerant system for processing streams of data.
In this track we will introduce Storm framework, explain some design concepts and considerations, and show some real world examples to explain how to use it to process large amounts of data in real time, in a distributed environment. We will describe how we can scale this solution very easily as more data need to be processed.
We will explain all you need to know to get started with Storm and some tips on how to get your Spouts, Bolts and Topologies up and running in the cloud.
A tutorial presentation based on storm.apache.org documentation.
I gave this presentation at Amirkabir University of Technology as Teaching Assistant of Cloud Computing course of Dr. Amir H. Payberah in spring semester 2015.
Apache Storm 0.9 basic training - VerisignMichael Noll
Apache Storm 0.9 basic training (130 slides) covering:
1. Introducing Storm: history, Storm adoption in the industry, why Storm
2. Storm core concepts: topology, data model, spouts and bolts, groupings, parallelism
3. Operating Storm: architecture, hardware specs, deploying, monitoring
4. Developing Storm apps: Hello World, creating a bolt, creating a topology, running a topology, integrating Storm and Kafka, testing, data serialization in Storm, example apps, performance and scalability tuning
5. Playing with Storm using Wirbelsturm
Audience: developers, operations, architects
Created by Michael G. Noll, Data Architect, Verisign, https://www.verisigninc.com/
Verisign is a global leader in domain names and internet security.
Tools mentioned:
- Wirbelsturm (https://github.com/miguno/wirbelsturm)
- kafka-storm-starter (https://github.com/miguno/kafka-storm-starter)
Blog post at:
http://www.michael-noll.com/blog/2014/09/15/apache-storm-training-deck-and-tutorial/
Many thanks to the Twitter Engineering team (the creators of Storm) and the Apache Storm open source community!
Slides from talk given at the NYC Cassandra Meetup. Discussing how Storm works and how it integrates well with Apache Cassandra.
There is also a segway into a example project that uses Storm and Cassandra to implement a scalable reactive web crawler.
http://github.com/tjake/stormscraper
Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013Sonal Raj
This talk briefly outlines the Storm framework and Neo4J graph database, and how to compositely use them to perform computations on complex graphs in Python using the Petrel and Py2neo packages. This talk was given at PyCon India 2013.
Learning Stream Processing with Apache StormEugene Dvorkin
Over the last couple years, Apache Storm became a de-facto standard for developing real-time analytics and complex event processing applications. Storm enables to tackle real-time data processing challenges the same way Hadoop enables batch processing of Big Data. Storm enables companies to have "Fast Data" alongside with "Big Data". Some use cases where Storm can be used are Fraud Detection, Operation Intelligence, Machine Learning, ETL, Analytics, etc.
In this meetup, Eugene Dvorkin, Architect @WebMD and NYC Storm User Group organizer will teach Apache Storm and Stream Processing fundamentals. While this meeting is geared toward new Storm users, experienced users may find something interesting as well.
Following topics will be covered:
• Why use Apache Storm?
• Common use cases
• Storm Architecture - components, concepts, topology
• Building simple Storm topology with Java and Groovy
• Trident and micro-batch processing
• Fault tolerance and guaranteed message delivery
• Running and monitoring Storm in production
• Kafka
• Storm at WebMD
• Resources
PHP Backends for Real-Time User Interaction using Apache Storm.DECK36
Engaging users in real-time is the topic of our times. Whether it’s a game, a shop, or a content-network, the aim remains the same: providing a personalized experience. In this workshop we will look under the hood of Apache Storm and lay a firm foundation on how to use it with PHP. By that, you can leverage your existing codebase and PHP expertise for an entirely new world: real-time analytics and business logic working on message streams. During the course of the workshop, we will introduce Apache Storm and take a look at all of its components. We will then skyrocket the applicability of Storm by showing you how to implement their components with PHP. All exercises will be conducted using an example project, the infamous and most exhilarating lolcat kitten game ever conceived: Plan 9 From Outer Kitten. In order to follow the hands-on excercises, you will need a development VM prepared by us with all relevant system components and our project repositories. To make the workshop experience as smooth as possible for all participants, please bring a prepared computer to the workshop, as there will be no time to deal with installation and setup issues. Please download all prerequisites and install them as described: VM, Plan 9 webapp, Plan 9 storm backend, (Tutorial: https://github.com/DECK36/plan9_workshop_tutorial ).
Graphs are everywhere! Distributed graph computing with Spark GraphXAndrea Iacono
These are the slide for the talk given at Codemotion Milan on november 2015. The source code shown is available at https://github.com/andreaiacono/TalkGraphX .
Storm – Streaming Data Analytics at Scale - StampedeCon 2014StampedeCon
At StampedeCon 2014, Scott Shaw (Hortonworks) and Kit Menke (Enteprise Holdings) presented "Storm – Streaming Data Analytics at Scale"
Storm’s primary purpose is to provide real-time analytics against fast moving data before its stored. The use cases range from fraud detection, machine learning, to ETL.
Storm has been clocked at over 1 million tuples processed per second per node. It’s fast, scalable, and language agnostic. This session provides an architecture overview as well as a real-world discussion of its use and implementation at Enterprise Holdings.
Storm is a distributed, reliable, fault-tolerant system for processing streams of data.
In this track we will introduce Storm framework, explain some design concepts and considerations, and show some real world examples to explain how to use it to process large amounts of data in real time, in a distributed environment. We will describe how we can scale this solution very easily as more data need to be processed.
We will explain all you need to know to get started with Storm and some tips on how to get your Spouts, Bolts and Topologies up and running in the cloud.
A tutorial presentation based on storm.apache.org documentation.
I gave this presentation at Amirkabir University of Technology as Teaching Assistant of Cloud Computing course of Dr. Amir H. Payberah in spring semester 2015.
Apache Storm 0.9 basic training - VerisignMichael Noll
Apache Storm 0.9 basic training (130 slides) covering:
1. Introducing Storm: history, Storm adoption in the industry, why Storm
2. Storm core concepts: topology, data model, spouts and bolts, groupings, parallelism
3. Operating Storm: architecture, hardware specs, deploying, monitoring
4. Developing Storm apps: Hello World, creating a bolt, creating a topology, running a topology, integrating Storm and Kafka, testing, data serialization in Storm, example apps, performance and scalability tuning
5. Playing with Storm using Wirbelsturm
Audience: developers, operations, architects
Created by Michael G. Noll, Data Architect, Verisign, https://www.verisigninc.com/
Verisign is a global leader in domain names and internet security.
Tools mentioned:
- Wirbelsturm (https://github.com/miguno/wirbelsturm)
- kafka-storm-starter (https://github.com/miguno/kafka-storm-starter)
Blog post at:
http://www.michael-noll.com/blog/2014/09/15/apache-storm-training-deck-and-tutorial/
Many thanks to the Twitter Engineering team (the creators of Storm) and the Apache Storm open source community!
Slides from talk given at the NYC Cassandra Meetup. Discussing how Storm works and how it integrates well with Apache Cassandra.
There is also a segway into a example project that uses Storm and Cassandra to implement a scalable reactive web crawler.
http://github.com/tjake/stormscraper
Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013Sonal Raj
This talk briefly outlines the Storm framework and Neo4J graph database, and how to compositely use them to perform computations on complex graphs in Python using the Petrel and Py2neo packages. This talk was given at PyCon India 2013.
Learning Stream Processing with Apache StormEugene Dvorkin
Over the last couple years, Apache Storm became a de-facto standard for developing real-time analytics and complex event processing applications. Storm enables to tackle real-time data processing challenges the same way Hadoop enables batch processing of Big Data. Storm enables companies to have "Fast Data" alongside with "Big Data". Some use cases where Storm can be used are Fraud Detection, Operation Intelligence, Machine Learning, ETL, Analytics, etc.
In this meetup, Eugene Dvorkin, Architect @WebMD and NYC Storm User Group organizer will teach Apache Storm and Stream Processing fundamentals. While this meeting is geared toward new Storm users, experienced users may find something interesting as well.
Following topics will be covered:
• Why use Apache Storm?
• Common use cases
• Storm Architecture - components, concepts, topology
• Building simple Storm topology with Java and Groovy
• Trident and micro-batch processing
• Fault tolerance and guaranteed message delivery
• Running and monitoring Storm in production
• Kafka
• Storm at WebMD
• Resources
PHP Backends for Real-Time User Interaction using Apache Storm.DECK36
Engaging users in real-time is the topic of our times. Whether it’s a game, a shop, or a content-network, the aim remains the same: providing a personalized experience. In this workshop we will look under the hood of Apache Storm and lay a firm foundation on how to use it with PHP. By that, you can leverage your existing codebase and PHP expertise for an entirely new world: real-time analytics and business logic working on message streams. During the course of the workshop, we will introduce Apache Storm and take a look at all of its components. We will then skyrocket the applicability of Storm by showing you how to implement their components with PHP. All exercises will be conducted using an example project, the infamous and most exhilarating lolcat kitten game ever conceived: Plan 9 From Outer Kitten. In order to follow the hands-on excercises, you will need a development VM prepared by us with all relevant system components and our project repositories. To make the workshop experience as smooth as possible for all participants, please bring a prepared computer to the workshop, as there will be no time to deal with installation and setup issues. Please download all prerequisites and install them as described: VM, Plan 9 webapp, Plan 9 storm backend, (Tutorial: https://github.com/DECK36/plan9_workshop_tutorial ).
Graphs are everywhere! Distributed graph computing with Spark GraphXAndrea Iacono
These are the slide for the talk given at Codemotion Milan on november 2015. The source code shown is available at https://github.com/andreaiacono/TalkGraphX .
Storm – Streaming Data Analytics at Scale - StampedeCon 2014StampedeCon
At StampedeCon 2014, Scott Shaw (Hortonworks) and Kit Menke (Enteprise Holdings) presented "Storm – Streaming Data Analytics at Scale"
Storm’s primary purpose is to provide real-time analytics against fast moving data before its stored. The use cases range from fraud detection, machine learning, to ETL.
Storm has been clocked at over 1 million tuples processed per second per node. It’s fast, scalable, and language agnostic. This session provides an architecture overview as well as a real-world discussion of its use and implementation at Enterprise Holdings.
GraphX: Graph Analytics in Apache Spark (AMPCamp 5, 2014-11-20)Ankur Dave
GraphX is a graph processing framework built into Apache Spark. This talk introduces GraphX, describes key features of its API, and gives an update on its status.
Process the Twitter stream using Storm & Redstorm with Ruby & JRuby. Full working demo, code on github https://github.com/colinsurprenant/tweitgeist and live demo http://tweitgeist.needium.com/
Neo4j and the Panama Papers - FooCafe June 2016Craig Taverner
In May 2016, the ICIJ stunned the world by making available a download of "The Panama Papers".
This remarkable dataset represents the largest leak of offshore financial networks ever, at least 10 times bigger than the previous largest leak. Luckily for those of us keen to analyse this data, they made the data available as a ready-to-run instance of the Neo4j graph database, with all data pre-loaded. In this presentation we'll show you how to download and run this database, and how to perform various queries using the Cypher query language to gain insights into the structure of the offshore financial networks used by people and corporations around the world.
And don't worry, you won't need to know Neo4j or Cypher, or even what the Panama Papers are. We'll introduce everything before pulling out the Cypher queries!
Apache Storm is a free and open source, distributed real-time computation system for processing fast, large streams of data. Storm adds reliable real-time data processing capabilities to Apache Hadoop 2.x. Its effective stream processing capabilities are trusted by Twitter and Yahoo for quickly extracting insights from their Big Data.
Big Data Streaming processing using Apache Storm - FOSSCOMM 2016Adrianos Dadis
Our presentation on FOSSCOMM conference (17 April 2016):
Agenda:
* Big Data concepts
* Batch & Streaming processing
* NoSQL persistence
* Apache Storm and Apache Kafka
* Streaming application demo
* Considerations for Big Data applications
Event: http://fosscomm.cs.unipi.gr/index.php/event/adrianos-dadis/?lang=en
Processing large-scale graphs with Google(TM) PregelArangoDB Database
Many popular graph databases are optimized to run on a single machine, using efficient traversals to query the stored graphs. This boosts performance of algorithms originating at a single vertex and iterating through the graph e.g. finding shortest paths or neighbors. However, graphs are getting bigger and traversals are poorly performing if they require a large depth. If you need to distribute a large-scale graph thru several machines, traversals won't be the best choice (in case of performance) to process the graph. Therefore Google has released it's Pregel framework offering an environment to query distributed graphs, Pregel is also known as the map-reduce for graphs. In this talk I want to present the architecture and requirements of the Pregel framework and introduce you to the different mind-set required to write a Pregel algorithm. Furthermore I will give a short introduction to three implementations or Pregel — Giraph, TinkerPop3 and ArangoDB.
By Michael Hackstein (@mchacki)
The complexity and amount of data rises. Modern graph databases are designed to handle the complexity but still not for the amount of data. When hitting a certain size of a graph many dedicated graph databases reach their limits in vertical or most common horizontal scalability. In this talk I´ll provide a brief overview about current approaches and their limits towards scalability. Dealing with complex data in a complex system doesn't make things easier... but more fun finding a solution. Join me on my journey to handle billions of edges in a graph database.
Developing Java Streaming Applications with Apache StormLester Martin
Apache Storm, http://storm.apache.org, is a free and open source distributed real-time computation system. Storm makes it easy to reliably process unbounded streams of data, doing for real-time processing what Hadoop did for batch processing. During this presentation, a simple Java-based streaming application will be built from scratch!
Code examples can be found at https://github.com/lestermartin/streaming-exploration.
A Brief introduction to Apache Storm. Talk given at the October Toronto Java User Group meeting, video available at https://www.youtube.com/watch?v=CWyH4-SOGm8
PigSPARQL: A SPARQL Query Processing Baseline for Big DataAlexander Schätzle
In this paper we discuss PigSPARQL, a competitive yet easy to use SPARQL query processing system on MapReduce that allows ad-hoc SPARQL query processing on large RDF graphs out of the box. Instead of a direct mapping, PigSPARQL uses the query language of Pig, a data analysis platform on top of Hadoop MapReduce, as an intermediate layer between SPARQL and MapReduce. This additional level of abstraction makes our approach independent of the actual Hadoop version and thus ensures the compatibility to future changes of the Hadoop framework as they will be covered by the underlying Pig layer. We revisit PigSPARQL and demonstrate the performance improvement when simply switching the underlying version of Pig from 0.5.0 to 0.11.0 without any changes to PigSPARQL itself. Because of this sustainability, PigSPARQL is an attractive long-term baseline for comparing various MapReduce based SPARQL implementations which is also underpinned by its competitiveness with existing systems, e.g. HadoopRDF.
Wprowadzenie do technologii Big Data / Intro to Big Data EcosystemSages
Introduction to Hadoop Map Reduce, Pig, Hive and Ambari technologies.
Workshop deck prepared and presented on September 5th 2015 by Radosław Stankiewicz.
During that the day participants had also the possibility to go through prepared tutorials and test their analysis on real cluster.
While most bugs reveal their cause within their stack trace, Java’s OutOfMemoryError is less talkative and therefore regarded as being difficult to debug by a majority of developers. With the right techniques and tools, memory leaks in Java programs can however be tackled like any other programming error. This talks discusses how a JVM stores data, categorizes different types of memory leaks that can occur in a Java program and presents techniques for fixing such errors. Furthermore, we will have a closer look at lambda expressions and their considerable potential of introducing memory leaks when they are used incautiously.
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Tobias Schneck
As AI technology is pushing into IT I was wondering myself, as an “infrastructure container kubernetes guy”, how get this fancy AI technology get managed from an infrastructure operational view? Is it possible to apply our lovely cloud native principals as well? What benefit’s both technologies could bring to each other?
Let me take this questions and provide you a short journey through existing deployment models and use cases for AI software. On practical examples, we discuss what cloud/on-premise strategy we may need for applying it to our own infrastructure to get it to work from an enterprise perspective. I want to give an overview about infrastructure requirements and technologies, what could be beneficial or limiting your AI use cases in an enterprise environment. An interactive Demo will give you some insides, what approaches I got already working for real.
Neuro-symbolic is not enough, we need neuro-*semantic*Frank van Harmelen
Neuro-symbolic (NeSy) AI is on the rise. However, simply machine learning on just any symbolic structure is not sufficient to really harvest the gains of NeSy. These will only be gained when the symbolic structures have an actual semantics. I give an operational definition of semantics as “predictable inference”.
All of this illustrated with link prediction over knowledge graphs, but the argument is general.
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Jeffrey Haguewood
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on the notifications, alerts, and approval requests using Slack for Bonterra Impact Management. The solutions covered in this webinar can also be deployed for Microsoft Teams.
Interested in deploying notification automations for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualityInflectra
In this insightful webinar, Inflectra explores how artificial intelligence (AI) is transforming software development and testing. Discover how AI-powered tools are revolutionizing every stage of the software development lifecycle (SDLC), from design and prototyping to testing, deployment, and monitoring.
Learn about:
• The Future of Testing: How AI is shifting testing towards verification, analysis, and higher-level skills, while reducing repetitive tasks.
• Test Automation: How AI-powered test case generation, optimization, and self-healing tests are making testing more efficient and effective.
• Visual Testing: Explore the emerging capabilities of AI in visual testing and how it's set to revolutionize UI verification.
• Inflectra's AI Solutions: See demonstrations of Inflectra's cutting-edge AI tools like the ChatGPT plugin and Azure Open AI platform, designed to streamline your testing process.
Whether you're a developer, tester, or QA professional, this webinar will give you valuable insights into how AI is shaping the future of software delivery.
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Essentials of Automations: Optimizing FME Workflows with ParametersSafe Software
Are you looking to streamline your workflows and boost your projects’ efficiency? Do you find yourself searching for ways to add flexibility and control over your FME workflows? If so, you’re in the right place.
Join us for an insightful dive into the world of FME parameters, a critical element in optimizing workflow efficiency. This webinar marks the beginning of our three-part “Essentials of Automation” series. This first webinar is designed to equip you with the knowledge and skills to utilize parameters effectively: enhancing the flexibility, maintainability, and user control of your FME projects.
Here’s what you’ll gain:
- Essentials of FME Parameters: Understand the pivotal role of parameters, including Reader/Writer, Transformer, User, and FME Flow categories. Discover how they are the key to unlocking automation and optimization within your workflows.
- Practical Applications in FME Form: Delve into key user parameter types including choice, connections, and file URLs. Allow users to control how a workflow runs, making your workflows more reusable. Learn to import values and deliver the best user experience for your workflows while enhancing accuracy.
- Optimization Strategies in FME Flow: Explore the creation and strategic deployment of parameters in FME Flow, including the use of deployment and geometry parameters, to maximize workflow efficiency.
- Pro Tips for Success: Gain insights on parameterizing connections and leveraging new features like Conditional Visibility for clarity and simplicity.
We’ll wrap up with a glimpse into future webinars, followed by a Q&A session to address your specific questions surrounding this topic.
Don’t miss this opportunity to elevate your FME expertise and drive your projects to new heights of efficiency.
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
Real time and reliable processing with Apache Storm
1. Real time and reliable
processing with Apache
Storm
The code is available on:
https://github.com/andreaiacono/StormTalk
2. What is Apache Storm?
Real time and reliable processing with Apache Storm
Storm is a real-time distributed computing framework for
reliably processing unbounded data streams.
It was created by Nathan Marz and his team at BackType,
and released as open source in 2011 (after BackType was
acquired by Twitter).
3. Topology
A spout is the source of a data stream that is emitted to one or more bolts.
Emitted data is called tuple and is an ordered list of values.
A bolt performs computation on the data it receives and emits them to one
or more bolts. If a bolt is at the end of the topology, it doesn't emit anything.
Every task (either a spout or a bolt) can have multiple instances.
A topology is a directed acyclic graph of computation formed by spouts and bolts.
Real time and reliable processing with Apache Storm
4. Real time and reliable processing with Apache Storm
A simple topology
We'd like to build a system that generates random numbers and writes them
to a file.
Here is a topology that represent it:
5. public class RandomSpout extends BaseRichSpout {
private SpoutOutputCollector spoutOutputCollector;
private Random random;
@Override
public void declareOutputFields(OutputFieldsDeclarer outputFieldsDeclarer) {
outputFieldsDeclarer.declare(new Fields("val"));
}
@Override
public void open(Map map, TopologyContext topologyContext,
SpoutOutputCollector spoutOutputCollector) {
this.spoutOutputCollector = spoutOutputCollector;
random = new Random();
}
@Override
public void nextTuple() {
spoutOutputCollector.emit(new Values(random.nextInt() % 100));
}
}
Real time and reliable processing with Apache Storm
A simple topology: the spout
6. // no exception checking: it's a sample!
public class FileWriteBolt extends BaseBasicBolt {
private final String filename = "output.txt";
private BufferedWriter writer;
@Override
public void prepare(Map stormConf, TopologyContext context) {
super.prepare(stormConf, context);
writer = new BufferedWriter(new FileWriter(filename, true));
}
@Override
public void execute(Tuple input, BasicOutputCollector collector) {
writer.write(tuple.getInteger(0) + "n");
}
@Override
public void declareOutputFields(OutputFieldsDeclarer outputFieldsDeclarer) {}
@Override
public void cleanup() {
writer.close();
}
Real time and reliable processing with Apache Storm
A simple topology: the bolt
7. Real time and reliable processing with Apache Storm
public class RandomValuesTopology {
private static final String name = RandomValuesTopology.class.getName();
public static void main(String[] args) {
TopologyBuilder builder = new TopologyBuilder();
builder.setSpout("random-spout", new RandomSpout());
builder.setBolt("writer-bolt",new FileWriteBolt())
.shuffleGrouping("random-spout");
Config conf = new Config();
conf.setDebug(false);
conf.setMaxTaskParallelism(3);
LocalCluster cluster = new LocalCluster();
cluster.submitTopology(name, conf, builder.createTopology());
Utils.sleep(300_000);
cluster.killTopology(name);
cluster.shutdown();
// to run it on a live cluster
// StormSubmitter.submitTopology("topology", conf, builder.createTopology());
}
}
A simple topology: the topology
8. Grouping
Tuples path from one bolt to another is driven by grouping. Since we can
have multiple instances of bolts, we have to decide where to send the
tuples emitted.
Real time and reliable processing with Apache Storm
9. We want to create a webpage that shows the top-N hashtags and
every time arrives a new tweet containing one of them, displays it
on a world map.
Twitter top-n hashtags: overview
Real time and reliable processing with Apache Storm
10. public class GeoTweetSpout extends BaseRichSpout {
SpoutOutputCollector spoutOutputCollector;
TwitterStream twitterStream;
LinkedBlockingQueue<String> queue = null;
@Override
public void open(Map map, TopologyContext topologyContext,
SpoutOutputCollector spoutOutputCollector) {
this.spoutOutputCollector = spoutOutputCollector;
queue = new LinkedBlockingQueue<>(1000);
ConfigurationBuilder config = new ConfigurationBuilder()
.setOAuthConsumerKey(custkey)
.setOAuthConsumerSecret(custsecret)
.setOAuthAccessToken(accesstoken)
.setOAuthAccessTokenSecret(accesssecret);
TwitterStreamFactory streamFactory = new TwitterStreamFactory(config.build());
twitterStream = streamFactory.getInstance();
twitterStream.addListener(new GeoTwitterListener(queue));
double[][] boundingBox = {{-179d, -89d}, {179d, 89d}};
FilterQuery filterQuery = new FilterQuery().locations(boundingBox);
twitterStream.filter(filterQuery);
}
@Override
public void nextTuple() {
String msg = queue.poll();
if (msg == null) {
return;
}
String lat = MiscUtils.getLatFromMsg(msg);;
String lon = MiscUtils.getLonFromMsg(msg);;
String tweet = MiscUtils.getTweetFromMsg(msg);;
spoutOutputCollector.emit(new Values(tweet, lat, lon));
}
@Override
public void declareOutputFields(OutputFieldsDeclarer outputFieldsDeclarer) {
outputFieldsDeclarer.declare(new Fields("tweet", "lat", "lon"));
}
}
Real time and reliable processing with Apache Storm
Twitter top-n hashtags: GeoTweetSpout
11. public class NoHashtagDropperBolt extends BaseBasicBolt {
@Override
public void declareOutputFields(OutputFieldsDeclarer declarer) {
declarer.declare(new Fields("tweet", "lat", "lon"));
}
@Override
public void execute(Tuple tuple, BasicOutputCollector collector) {
Set<String> hashtags = MiscUtils.getHashtags(tuple.getString(0));
if (hashtags.size() == 0) {
return;
}
String tweet = tuple.getString(0);
String lat = tuple.getString(1);
String lon = tuple.getString(2);
collector.emit(new Values(tweet, lat, lon));
}
}
Twitter top-n hashtags: NoHashtagDropperBolt
Real time and reliable processing with Apache Storm
12. Twitter top-n hashtags: GeoHashtagFilterBolt
Real time and reliable processing with Apache Storm
public class GeoHashtagsFilterBolt extends BaseBasicBolt {
private Rankings rankings;
@Override
public void declareOutputFields(OutputFieldsDeclarer outputFieldsDeclarer) {
outputFieldsDeclarer.declare(new Fields("tweet", "lat", "lon","hashtag"));
}
@Override
public void execute(Tuple tuple, BasicOutputCollector collector) {
String componentId = tuple.getSourceComponent();
if ("total-rankings".equals(componentId)) {
rankings = (Rankings) tuple.getValue(0);
return;
}
if (rankings == null) return;
String tweet = tuple.getString(0);
for (String hashtag : MiscUtils.getHashtags(tweet)) {
for (Rankable r : rankings.getRankings()) {
String rankedHashtag = r.getObject().toString();
if (hashtag.equals(rankedHashtag)) {
String lat = tuple.getString(1);
String lon = tuple.getString(2);
collector.emit(new Values(lat, lon, hashtag, tweet));
return;
}
}
}
13. public class ToRedisTweetBolt extends BaseBasicBolt {
private RedisConnection<String, String> redis;
@Override
public void prepare(Map stormConf, TopologyContext context) {
super.prepare(stormConf, context);
RedisClient client = new RedisClient("localhost", 6379);
redis = client.connect();
}
@Override
public void execute(Tuple tuple, BasicOutputCollector collector) {
// gets the tweet and its rank
String lat = tuple.getString(0);
String lon = tuple.getString(1);
String hashtag = tuple.getString(2);
String tweet = tuple.getString(3);
String message = "1|" + lat + "|" + lon + "|" + hashtag + "|" + tweet;
redis.publish("tophashtagsmap", message);
}
@Override
public void declareOutputFields(OutputFieldsDeclarer declarer) {
}
}
Twitter top-n hashtags: ToRedisTweetBolt
Real time and reliable processing with Apache Storm
14. public class TopHashtagMapTopology {
private static int n = 20;
public static void main(String[] args) {
GeoTweetSpout geoTweetSpout = new GeoTweetSpout();
TopologyBuilder builder = new TopologyBuilder();
builder.setSpout("geo-tweet-spout", geoTweetSpout, 4);
builder.setBolt("no-ht-dropper",new NoHashtagDropperBolt(), 4)
.shuffleGrouping("geo-tweet-spout");
builder.setBolt("parse-twt",new ParseTweetBolt(), 4)
.shuffleGrouping("no-ht-dropper");
builder.setBolt("count-ht",new CountHashtagsBolt(), 4)
.fieldsGrouping("parse-twt",new Fields("hashtag"));
builder.setBolt("inter-rankings", new IntermediateRankingsBolt(n), 4)
.fieldsGrouping("count-ht", new Fields("hashtag"));
builder.setBolt("total-rankings", new TotalRankingsBolt(n), 1)
.globalGrouping("inter-rankings");
builder.setBolt("to-redis-ht", new ToRedisTopHashtagsBolt(), 1)
.shuffleGrouping("total-rankings");
builder.setBolt("geo-hashtag-filter", new GeoHashtagsFilterBolt(), 4)
.shuffleGrouping("no-ht-dropper")
.allGrouping("total-rankings");
builder.setBolt("to-redis-tweets", new ToRedisTweetBolt(), 4)
.globalGrouping("geo-hashtag-filter");
// code to start the topology...
}
}
Twitter top-n hashtags: topology
Real time and reliable processing with Apache Storm
16. Storm Cluster
Nimbus: a daemon responsible for
distributing code around the cluster,
assigning jobs to nodes, and
monitoring for failures.
Worker node: executes a subset of
a topology (spouts and/or bolts). It
runs a supervisor daemon that
listens for jobs assigned to the
machine and starts and stops
worker processes as necessary.
Zookeeper: manages all the
coordination between Nimbus and
the supervisors.
Real time and reliable processing with Apache Storm
17. Worker Node
Worker process: JVM (processes a specific topology)
Executor: Thread
Task: instance of bolt/spout
Supervisor: syncing with Master Node
The number of executors can be modified at runtime;
the topology structure cannot.
Real time and reliable processing with Apache Storm
18. Tuples transfer
●
on the same JVM
●
on different JVMs
For serialization, Storm tries to lookup a Kryo serializer, which is
more efficient than Java standard serialization.
The network layer for transport is provided by Netty.
Also for performance reasons, the queues are implemented using
the LMAX Disruptor library, which enables efficient queuing.
Real time and reliable processing with Apache Storm
Storm supports two different types of transfer:
19. Tuples transfer: on the same JVM
A generic task is composed by two threads and two queues.
Tasks at the start (spout) or at the end of the topology (ending
bolts) have only one queue.
Real time and reliable processing with Apache Storm
20. Tuples transfer: on different JVMs
Real time and reliable processing with Apache Storm
21. Queues failure
Since the model behind the queue is the producer/consumer, if the
producer supplies data at a higher rate than the consumer, the
queue will overflow.
The transfer queue is more critical because it has to serve all the
tasks of the worker, so it's stressed more than the internal one.
If an overflow happens, Storm tries - but not guarantees - to put
the overflowing tuples into a temporary queue, with the side-
effect of dropping the throughput of the topology.
Real time and reliable processing with Apache Storm
22. Reliability
Levels of delivery guarantee
●
at-most-once: tuples are processed in the order coming from spouts and in
case of failure (network, exceptions) are just dropped
●
at-least-once: in case of failure tuples are re-emitted from the spout; a tuple
can be processed more than once and they can arrive out of
order
●
exactly-once: only available with Trident, a layer sitting on top of Storm that
allows to write topologies with different semantic
Real time and reliable processing with Apache Storm
23. Reliability for bolts
Real time and reliable processing with Apache Storm
The three main concepts to achieve at-least-once guarantee level are:
●
anchoring: every tuple emitted by a bolt has to be linked to the
input tuple using the emit(tuple, values) method
●
acking: when a bolt successfully finishes to execute() a tuple, it
has to call the ack() method to notify Storm
●
failing: when a bolt encounters a problem with the incoming tuple, it
has to call the fail() method
The BaseBasicBolt we saw before takes care of them automatically
(when a tuple has to fail, it must be thrown a FailedException).
When the topology is complex (expanding tuples, collapsing tuples,
joining streams) they must be explicitly managed extending a
BaseRichBolt.
24. Reliability for spouts
Real time and reliable processing with Apache Storm
The ISpout interface defines - beside others - these methods:
void open(Map conf,TopologyContext context,SpoutOutputCollector collector);
void close();
void nextTuple();
void ack(Object msgId);
void fail(Object msgId);
To implement a reliable spout we have to call inside the nextTuple() method:
Collector.emit(values, msgId);
and we have to manage the ack() and fail() methods accordingly.
25. Reliability
A tuple tree is the set of all the additional tuples emitted by the
subsequent bolts starting from the tuple emitted by a spout.
When all the tuples of a tree are marked as processed, Storm will
consider the initial tuple from a spout correctly processed.
If any tuple of the tree is not marked as processed within a timeout
(30 secs default) or is explicitly set as failed, Storm will replay the
tuple starting from the spout (this means that the operations made
by a task have to be idempotent).
Real time and reliable processing with Apache Storm
26. Real time and reliable processing with Apache Storm
Reliability – Step 1
This is the starting state, nothing has yet happened.
For better understading the following slides, it's important to review
this binary XOR property:
1 ^ 1 == 0
1 ^ 2 ^ 1 == 2
1 ^ 2 ^ 2 ^ 1 == 0
1 ^ 2 ^ 3 ^ 2 ^ 1 == 3
1 ^ 2 ^ 3 ^ 3 ^ 2 ^ 1 == 0
1 ^ 2 ^ 3 ^ ... ^ N ^ ... ^ 3 ^ 2 ^ 1 == N
1 ^ 2 ^ 3 ^ ... ^ N ^ N ^ ... ^ 3 ^ 2 ^ 1 == 0
(whatever the order of the operands is)
27. Real time and reliable processing with Apache Storm
Reliability – Step 2
The spout has received something, so it sends to its acker
task a couple of values:
●
the tuple ID of the tuple to emit to the bolt
●
its task ID
The acker puts those data in a map <TupleID,[TaskID, AckVal]>
where the AckVal is initially set to the TupleID value.
The ID values are of type long, so they're 64 bit.
28. Real time and reliable processing with Apache Storm
Reliability – Step 3
After having notified the acker that a new tuple was created,
It sends the tuple to its attached bolt (Bolt1).
29. Real time and reliable processing with Apache Storm
Reliability – Step 4
Bolt1 computes the outgoing tuple according to its business
logic and notifies the acker that it's going to emit a new tuple.
The acker gets the ID of the new tuple and XORs it with the
AckVal (that contained the initial tuple ID).
30. Real time and reliable processing with Apache Storm
Reliability – Step 5
Bolt1 sends the new tuple to its attached bolt (Bolt2).
31. Real time and reliable processing with Apache Storm
Reliability – Step 6
After emitting the new tuple, Bolt1 sets as finished its work with
the incoming tuple (Tuple0), and so it sends an ack to the acker to
notify that.
The acker gets the tuple ID and XORs it with the AckVal.
32. Real time and reliable processing with Apache Storm
Reliability – Step 7
Bolt2 process the incoming tuple according to its business logic
and - probably - will write some data on a DB, on a queue or
somewhere else. Since it's a terminal bolt, it will not emit a new
Tuple. Since its job is done, it can send an ack to the acker for the
incoming tuple.
The acker gets the tuple ID and XORs it with the AckVal. The value
of AckVal will be 0, so the acker knows that the starting tuple has
been successfully processed by the topology.