SlideShare a Scribd company logo
TUGA IT 2017
LISBON, PORTUGAL
THANK YOU TO OUR SPONSORS
PLATINUM
GOLD SILVER
PARTICIPATING COMMUNITIES
CLOUD
PRO
PT
Event processing with
Apache Storm
Nuno Caneco - Tuga IT - 20/May/2017
Nuno Caneco
Senior Software Engineer @ Talkdesk
/nunocaneco
nuno.caneco@gmail.com
/@nuno.caneco
Who am I
Stream Processing - Why?
WHY
● Data is crucial for business
● New data is always being generated
● Companies want to extract value from
data in “real-time”
USE CASES
● Fraud detection
● Sensor data aggregation
● Live monitoring
What is Storm?
Apache Storm is a free and open source distributed realtime computation system.
Storm makes it easy to reliably process unbounded streams of data, doing for realtime
processing what Hadoop did for batch processing. Storm is simple, can be used with
any programming language, and is a lot of fun to use!
Storm has many use cases: real time analytics, online machine learning, continuous
computation, distributed RPC, ETL, and more. Storm is fast: a benchmark clocked it at
over a million tuples processed per second per node. It is scalable, fault-tolerant,
guarantees your data will be processed, and is easy to set up and operate.
http://storm.apache.org/
Under the hood
(A bit of)
Architecture
Nimbus
Master node
Zookeeper
Zookeeper
Cluster
coordination
Supervisor
Supervisor
Supervisor
Supervisor
Cluster
Worker
process
Worker
process
Worker
process
Worker
process
Worker
process
Worker
process
JVM
instances
Concepts
Topology
Topologies combine individual work units to be applied to input data
Spout
Bolt A
Bolt C
Bolt B
data
[Tuple]
[Tuple]
[Tuple]
[Tuple]
[Data out]
Bolt D
[Tuple]
Topology
Spout
● First node of every topology
○ Collects data from the outside world
○ Injects the data on the topology in order to be processed
● Must implement ISpout interface
○ BaseRichSpout is a more convenient abstract class
Spouts
Bolt
● Middle or terminating nodes of a topology
● Implements a Unit of Data Processing
● Each Bolt has an output stream
● Can emit one or more Tuples to other Bolts subscribing the output
stream
● Must implement IBolt
○ BaseRichBolt is a more convenient abstract class
Bolt
Tuple
Hash-alike data structure containing the data that flows between Spouts and Bolts
Data can be accessed by:
● Field index: [{0, “foo”}, {1, “bar”}]
● Key name: [{“foo_key”, “foo”}, {“bar_key”, “bar”}]
Values can be:
● Java primitive types
● String
● byte[]
● Any Serializable object
Example: Alert on monitored words
Monitored
Words Bolt
Collector
Spout
{Message} Split
Sentence
Bolt
{Word, MessageId}
Message
queue
[Message]
Notify User
Bolt
Store event
on DB Bolt
{MonitoredWord,
MessageId}
{MonitoredWord,
MessageId}
Demo
Message Processing Guarantees
Message
is lost
Error
Acknowledging Tuples
● ack(): Tuple was processed successfully
● fail(): Tuple failed to process
Tuples with no ack() nor fail() are automatically replayed
Tuples with fail() will fail up the dependency tree
Acknowledge done right
Ack: Anchoring
public class SplitSentence extends BaseRichBolt {
public void execute(Tuple tuple) {
String sentence = tuple.getString(0);
for(String word: sentence.split(" ")) {
_collector.emit(tuple, new Values(word));
}
_collector.ack(tuple);
}
}
}
Anchors the
output tuple to
the input tupleAcknowledges
the input tuple
Dealing with fail()
Spout
Bolt Bolt
BoltBolt Bolt✅ ✅
✅
✅
✅
✅ Spout
Bolt Bolt
BoltBolt Bolt✅ ✅
✅
❌
❌
❌
Beware!
Storm is designed to scale to process millions of messages per second.
It's design deliberately assumes that some Tuples might be lost.
If your application needs Exactly Once semantics, you should consider using
Trident (will talk about that in a while)
Storm does not ensure
exactly once processing
Demo
Parallelism
Cluster node
Worker Process Worker Process
Cluster Node → 1+ JVM instances
JVM Instance → 1+ Threads
Thread → 1+ Task
Each instance of a Bolt or Spout is a
Task
Thread Thread
Thread Thread
Task Task
Task
Task Task
Task Task Task
Parallelism Example
Config conf = new Config();
conf.setNumWorkers(2); // use two worker processes
topologyBuilder.setSpout("blue-spout", new BlueSpout(), 2);
topologyBuilder.setBolt("green-bolt", new GreenBolt(), 2)
.setNumTasks(4)
.shuffleGrouping("blue-spout");
topologyBuilder.setBolt("yellow-bolt", new YellowBolt(), 6)
.shuffleGrouping("green-bolt");
Stream grouping
● Shuffle grouping: randomly distributed across all downstream Bolts
● Fields grouping: GROUP BY values - Same values of the grouped fields will be
delivered to the same Bolt
● All grouping: The stream is replicated across all the bolt's tasks. Use this grouping
with care
● Direct grouping: The producer of the Tuple must indicate which consumer will
receive the Tuple
● Custom Grouping: When you go NIH
Storm UI - Cluster
Storm UI - Cluster
Storm UI - Topology
Storm UI - Topology
Other features: Trident
Trident is an abstraction layer to manage state across the topology
The state can be kept:
● Internally in the topology - in memory or backed by HDFS
● Externally on a Database - such as Memcached or Cassandra
Other features: Storm SQL
The Storm SQL integration allows users to run SQL queries over streaming data
in Storm.
Cool feature, but still experimental
Q&A
Questions ?
Thank you
/nunocaneco
nuno.caneco@gmail.com
/@nuno.caneco
PLEASE FILL IN EVALUATION FORMS
FRIDAY, MAY 19th SATURDAY, MAY 20th
https://survs.com/survey/cprwce7pi8 https://survs.com/survey/l9kksmlzd8
YOUR OPINION IS IMPORTANT!
THANK YOU TO OUR SPONSORS
PLATINUM
GOLD SILVER
Trident: How it works
1. Tuples are processed as small batches
2. Each batch of tuples is given a unique id called the "transaction id" (txid).
a. If the batch is replayed, it is given the exact same txid.
3. State updates are ordered among batches. That is, the state updates for
batch 3 won't be applied until the state updates for batch 2 have succeeded.
Trident: Transactional Spout
Trident
Store
java => [count=5, txid=1]
kotlin => [count=8, txid=2]
csharp => [count=10, txid=3]
["kotlin"]
["kotlin"]
["csharp"]
Batch txid=3
Trident
Store
java => [count=5, txid=1]
kotlin => [count=10, txid=3]
csharp => [count=10, txid=3]
"kotlin" += 2
"csharp" += 0

More Related Content

What's hot

Bsdtw17: george neville neil: realities of dtrace on free-bsd
Bsdtw17: george neville neil: realities of dtrace on free-bsdBsdtw17: george neville neil: realities of dtrace on free-bsd
Bsdtw17: george neville neil: realities of dtrace on free-bsd
Scott Tsai
 
Short introduction to Storm
Short introduction to StormShort introduction to Storm
Short introduction to Storm
JimmyZoger
 
Building Conclave: a decentralized, real-time collaborative text editor
Building Conclave: a decentralized, real-time collaborative text editorBuilding Conclave: a decentralized, real-time collaborative text editor
Building Conclave: a decentralized, real-time collaborative text editor
Sun-Li Beatteay
 
Data structure
Data structureData structure
Data structure
krishna partiwala
 
EKAW - Linked Data Publishing
EKAW - Linked Data PublishingEKAW - Linked Data Publishing
EKAW - Linked Data Publishing
Ruben Taelman
 
Be a Zen monk, the Python way
Be a Zen monk, the Python wayBe a Zen monk, the Python way
Be a Zen monk, the Python way
Sriram Murali
 
Apache Flink Training Workshop @ HadoopCon2016 - #2 DataSet API Hands-On
Apache Flink Training Workshop @ HadoopCon2016 - #2 DataSet API Hands-OnApache Flink Training Workshop @ HadoopCon2016 - #2 DataSet API Hands-On
Apache Flink Training Workshop @ HadoopCon2016 - #2 DataSet API Hands-On
Apache Flink Taiwan User Group
 
BWB Meetup: Storm - distributed realtime computation system
BWB Meetup: Storm - distributed realtime computation systemBWB Meetup: Storm - distributed realtime computation system
BWB Meetup: Storm - distributed realtime computation system
Andrii Gakhov
 
Clustering_Algorithm_DR
Clustering_Algorithm_DRClustering_Algorithm_DR
Clustering_Algorithm_DR
Nguyen Tran
 

What's hot (9)

Bsdtw17: george neville neil: realities of dtrace on free-bsd
Bsdtw17: george neville neil: realities of dtrace on free-bsdBsdtw17: george neville neil: realities of dtrace on free-bsd
Bsdtw17: george neville neil: realities of dtrace on free-bsd
 
Short introduction to Storm
Short introduction to StormShort introduction to Storm
Short introduction to Storm
 
Building Conclave: a decentralized, real-time collaborative text editor
Building Conclave: a decentralized, real-time collaborative text editorBuilding Conclave: a decentralized, real-time collaborative text editor
Building Conclave: a decentralized, real-time collaborative text editor
 
Data structure
Data structureData structure
Data structure
 
EKAW - Linked Data Publishing
EKAW - Linked Data PublishingEKAW - Linked Data Publishing
EKAW - Linked Data Publishing
 
Be a Zen monk, the Python way
Be a Zen monk, the Python wayBe a Zen monk, the Python way
Be a Zen monk, the Python way
 
Apache Flink Training Workshop @ HadoopCon2016 - #2 DataSet API Hands-On
Apache Flink Training Workshop @ HadoopCon2016 - #2 DataSet API Hands-OnApache Flink Training Workshop @ HadoopCon2016 - #2 DataSet API Hands-On
Apache Flink Training Workshop @ HadoopCon2016 - #2 DataSet API Hands-On
 
BWB Meetup: Storm - distributed realtime computation system
BWB Meetup: Storm - distributed realtime computation systemBWB Meetup: Storm - distributed realtime computation system
BWB Meetup: Storm - distributed realtime computation system
 
Clustering_Algorithm_DR
Clustering_Algorithm_DRClustering_Algorithm_DR
Clustering_Algorithm_DR
 

Similar to Tuga it 2017 - Event processing with Apache Storm

Distributed real time stream processing- why and how
Distributed real time stream processing- why and howDistributed real time stream processing- why and how
Distributed real time stream processing- why and how
Petr Zapletal
 
Data Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFixData Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFix
C4Media
 
FLiP Into Trino
FLiP Into TrinoFLiP Into Trino
FLiP Into Trino
Timothy Spann
 
Fletcher Framework for Programming FPGA
Fletcher Framework for Programming FPGAFletcher Framework for Programming FPGA
Fletcher Framework for Programming FPGA
Ganesan Narayanasamy
 
Interconnection Automation For All - Extended - MPS 2023
Interconnection Automation For All - Extended - MPS 2023Interconnection Automation For All - Extended - MPS 2023
Interconnection Automation For All - Extended - MPS 2023
Chris Grundemann
 
TAU on Power 9
TAU on Power 9TAU on Power 9
TAU on Power 9
Ganesan Narayanasamy
 
JConf.dev 2022 - Apache Pulsar Development 101 with Java
JConf.dev 2022 - Apache Pulsar Development 101 with JavaJConf.dev 2022 - Apache Pulsar Development 101 with Java
JConf.dev 2022 - Apache Pulsar Development 101 with Java
Timothy Spann
 
IoT Story: From Edge to HDP
IoT Story: From Edge to HDPIoT Story: From Edge to HDP
IoT Story: From Edge to HDP
DataWorks Summit
 
Considerations for Abstracting Complexities of a Real-Time ML Platform, Zhenz...
Considerations for Abstracting Complexities of a Real-Time ML Platform, Zhenz...Considerations for Abstracting Complexities of a Real-Time ML Platform, Zhenz...
Considerations for Abstracting Complexities of a Real-Time ML Platform, Zhenz...
HostedbyConfluent
 
Log Management Systems
Log Management SystemsLog Management Systems
Log Management Systems
Mehdi Hamidi
 
Next Gen Data Modeling in the Open Data Platform With Doron Porat and Liran Y...
Next Gen Data Modeling in the Open Data Platform With Doron Porat and Liran Y...Next Gen Data Modeling in the Open Data Platform With Doron Porat and Liran Y...
Next Gen Data Modeling in the Open Data Platform With Doron Porat and Liran Y...
HostedbyConfluent
 
Data engineering Stl Big Data IDEA user group
Data engineering   Stl Big Data IDEA user groupData engineering   Stl Big Data IDEA user group
Data engineering Stl Big Data IDEA user group
Adam Doyle
 
Being HAPI! Reverse Proxying on Purpose
Being HAPI! Reverse Proxying on PurposeBeing HAPI! Reverse Proxying on Purpose
Being HAPI! Reverse Proxying on Purpose
Aman Kohli
 
Season 7 Episode 1 - Tools for Data Scientists
Season 7 Episode 1 - Tools for Data ScientistsSeason 7 Episode 1 - Tools for Data Scientists
Season 7 Episode 1 - Tools for Data Scientists
aspyker
 
Red Hat Summit 2017 - LT107508 - Better Managing your Red Hat footprint with ...
Red Hat Summit 2017 - LT107508 - Better Managing your Red Hat footprint with ...Red Hat Summit 2017 - LT107508 - Better Managing your Red Hat footprint with ...
Red Hat Summit 2017 - LT107508 - Better Managing your Red Hat footprint with ...
Miguel Pérez Colino
 
Advanced Stream Processing with Flink and Pulsar - Pulsar Summit NA 2021 Keynote
Advanced Stream Processing with Flink and Pulsar - Pulsar Summit NA 2021 KeynoteAdvanced Stream Processing with Flink and Pulsar - Pulsar Summit NA 2021 Keynote
Advanced Stream Processing with Flink and Pulsar - Pulsar Summit NA 2021 Keynote
StreamNative
 
MACHBASE_NEO
MACHBASE_NEOMACHBASE_NEO
MACHBASE_NEO
MACHBASE
 
Building a Database for the End of the World
Building a Database for the End of the WorldBuilding a Database for the End of the World
Building a Database for the End of the World
jhugg
 
DBCC 2021 - FLiP Stack for Cloud Data Lakes
DBCC 2021 - FLiP Stack for Cloud Data LakesDBCC 2021 - FLiP Stack for Cloud Data Lakes
DBCC 2021 - FLiP Stack for Cloud Data Lakes
Timothy Spann
 
Apache Avro in LivePerson [Hebrew]
Apache Avro in LivePerson [Hebrew]Apache Avro in LivePerson [Hebrew]
Apache Avro in LivePerson [Hebrew]
LivePerson
 

Similar to Tuga it 2017 - Event processing with Apache Storm (20)

Distributed real time stream processing- why and how
Distributed real time stream processing- why and howDistributed real time stream processing- why and how
Distributed real time stream processing- why and how
 
Data Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFixData Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFix
 
FLiP Into Trino
FLiP Into TrinoFLiP Into Trino
FLiP Into Trino
 
Fletcher Framework for Programming FPGA
Fletcher Framework for Programming FPGAFletcher Framework for Programming FPGA
Fletcher Framework for Programming FPGA
 
Interconnection Automation For All - Extended - MPS 2023
Interconnection Automation For All - Extended - MPS 2023Interconnection Automation For All - Extended - MPS 2023
Interconnection Automation For All - Extended - MPS 2023
 
TAU on Power 9
TAU on Power 9TAU on Power 9
TAU on Power 9
 
JConf.dev 2022 - Apache Pulsar Development 101 with Java
JConf.dev 2022 - Apache Pulsar Development 101 with JavaJConf.dev 2022 - Apache Pulsar Development 101 with Java
JConf.dev 2022 - Apache Pulsar Development 101 with Java
 
IoT Story: From Edge to HDP
IoT Story: From Edge to HDPIoT Story: From Edge to HDP
IoT Story: From Edge to HDP
 
Considerations for Abstracting Complexities of a Real-Time ML Platform, Zhenz...
Considerations for Abstracting Complexities of a Real-Time ML Platform, Zhenz...Considerations for Abstracting Complexities of a Real-Time ML Platform, Zhenz...
Considerations for Abstracting Complexities of a Real-Time ML Platform, Zhenz...
 
Log Management Systems
Log Management SystemsLog Management Systems
Log Management Systems
 
Next Gen Data Modeling in the Open Data Platform With Doron Porat and Liran Y...
Next Gen Data Modeling in the Open Data Platform With Doron Porat and Liran Y...Next Gen Data Modeling in the Open Data Platform With Doron Porat and Liran Y...
Next Gen Data Modeling in the Open Data Platform With Doron Porat and Liran Y...
 
Data engineering Stl Big Data IDEA user group
Data engineering   Stl Big Data IDEA user groupData engineering   Stl Big Data IDEA user group
Data engineering Stl Big Data IDEA user group
 
Being HAPI! Reverse Proxying on Purpose
Being HAPI! Reverse Proxying on PurposeBeing HAPI! Reverse Proxying on Purpose
Being HAPI! Reverse Proxying on Purpose
 
Season 7 Episode 1 - Tools for Data Scientists
Season 7 Episode 1 - Tools for Data ScientistsSeason 7 Episode 1 - Tools for Data Scientists
Season 7 Episode 1 - Tools for Data Scientists
 
Red Hat Summit 2017 - LT107508 - Better Managing your Red Hat footprint with ...
Red Hat Summit 2017 - LT107508 - Better Managing your Red Hat footprint with ...Red Hat Summit 2017 - LT107508 - Better Managing your Red Hat footprint with ...
Red Hat Summit 2017 - LT107508 - Better Managing your Red Hat footprint with ...
 
Advanced Stream Processing with Flink and Pulsar - Pulsar Summit NA 2021 Keynote
Advanced Stream Processing with Flink and Pulsar - Pulsar Summit NA 2021 KeynoteAdvanced Stream Processing with Flink and Pulsar - Pulsar Summit NA 2021 Keynote
Advanced Stream Processing with Flink and Pulsar - Pulsar Summit NA 2021 Keynote
 
MACHBASE_NEO
MACHBASE_NEOMACHBASE_NEO
MACHBASE_NEO
 
Building a Database for the End of the World
Building a Database for the End of the WorldBuilding a Database for the End of the World
Building a Database for the End of the World
 
DBCC 2021 - FLiP Stack for Cloud Data Lakes
DBCC 2021 - FLiP Stack for Cloud Data LakesDBCC 2021 - FLiP Stack for Cloud Data Lakes
DBCC 2021 - FLiP Stack for Cloud Data Lakes
 
Apache Avro in LivePerson [Hebrew]
Apache Avro in LivePerson [Hebrew]Apache Avro in LivePerson [Hebrew]
Apache Avro in LivePerson [Hebrew]
 

More from Nuno Caneco

Building resilient applications
Building resilient applicationsBuilding resilient applications
Building resilient applications
Nuno Caneco
 
Stateful mock servers to the rescue on REST ecosystems
Stateful mock servers to the rescue on REST ecosystemsStateful mock servers to the rescue on REST ecosystems
Stateful mock servers to the rescue on REST ecosystems
Nuno Caneco
 
Git from the trenches
Git from the trenchesGit from the trenches
Git from the trenches
Nuno Caneco
 
Tuga IT 2017 - Redis
Tuga IT 2017 - RedisTuga IT 2017 - Redis
Tuga IT 2017 - Redis
Nuno Caneco
 
Fullstack LX - Improving your application performance
Fullstack LX - Improving your application performanceFullstack LX - Improving your application performance
Fullstack LX - Improving your application performance
Nuno Caneco
 
Running agile on a non-agile environment
Running agile on a non-agile environmentRunning agile on a non-agile environment
Running agile on a non-agile environment
Nuno Caneco
 
Introducing redis
Introducing redisIntroducing redis
Introducing redis
Nuno Caneco
 
Tuga it 2016 improving your application performance
Tuga it 2016   improving your application performanceTuga it 2016   improving your application performance
Tuga it 2016 improving your application performance
Nuno Caneco
 

More from Nuno Caneco (8)

Building resilient applications
Building resilient applicationsBuilding resilient applications
Building resilient applications
 
Stateful mock servers to the rescue on REST ecosystems
Stateful mock servers to the rescue on REST ecosystemsStateful mock servers to the rescue on REST ecosystems
Stateful mock servers to the rescue on REST ecosystems
 
Git from the trenches
Git from the trenchesGit from the trenches
Git from the trenches
 
Tuga IT 2017 - Redis
Tuga IT 2017 - RedisTuga IT 2017 - Redis
Tuga IT 2017 - Redis
 
Fullstack LX - Improving your application performance
Fullstack LX - Improving your application performanceFullstack LX - Improving your application performance
Fullstack LX - Improving your application performance
 
Running agile on a non-agile environment
Running agile on a non-agile environmentRunning agile on a non-agile environment
Running agile on a non-agile environment
 
Introducing redis
Introducing redisIntroducing redis
Introducing redis
 
Tuga it 2016 improving your application performance
Tuga it 2016   improving your application performanceTuga it 2016   improving your application performance
Tuga it 2016 improving your application performance
 

Recently uploaded

Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
SOFTTECHHUB
 
Serial Arm Control in Real Time Presentation
Serial Arm Control in Real Time PresentationSerial Arm Control in Real Time Presentation
Serial Arm Control in Real Time Presentation
tolgahangng
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Malak Abu Hammad
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
Octavian Nadolu
 
Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
Zilliz
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
shyamraj55
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
innovationoecd
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
Alpen-Adria-Universität
 
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Speck&Tech
 
GraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracyGraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracy
Tomaz Bratanic
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
DianaGray10
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
Neo4j
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
Matthew Sinclair
 
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
Zilliz
 
“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”
Claudio Di Ciccio
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
Neo4j
 
Full-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalizationFull-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalization
Zilliz
 
Mariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceXMariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceX
Mariano Tinti
 
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
Edge AI and Vision Alliance
 

Recently uploaded (20)

Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
 
Serial Arm Control in Real Time Presentation
Serial Arm Control in Real Time PresentationSerial Arm Control in Real Time Presentation
Serial Arm Control in Real Time Presentation
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
 
Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
 
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
 
GraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracyGraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracy
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
 
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
 
“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
 
Full-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalizationFull-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalization
 
Mariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceXMariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceX
 
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
 

Tuga it 2017 - Event processing with Apache Storm

  • 2. THANK YOU TO OUR SPONSORS PLATINUM GOLD SILVER
  • 4. Event processing with Apache Storm Nuno Caneco - Tuga IT - 20/May/2017
  • 5. Nuno Caneco Senior Software Engineer @ Talkdesk /nunocaneco nuno.caneco@gmail.com /@nuno.caneco Who am I
  • 6. Stream Processing - Why? WHY ● Data is crucial for business ● New data is always being generated ● Companies want to extract value from data in “real-time” USE CASES ● Fraud detection ● Sensor data aggregation ● Live monitoring
  • 7. What is Storm? Apache Storm is a free and open source distributed realtime computation system. Storm makes it easy to reliably process unbounded streams of data, doing for realtime processing what Hadoop did for batch processing. Storm is simple, can be used with any programming language, and is a lot of fun to use! Storm has many use cases: real time analytics, online machine learning, continuous computation, distributed RPC, ETL, and more. Storm is fast: a benchmark clocked it at over a million tuples processed per second per node. It is scalable, fault-tolerant, guarantees your data will be processed, and is easy to set up and operate. http://storm.apache.org/
  • 11. Topology Topologies combine individual work units to be applied to input data Spout Bolt A Bolt C Bolt B data [Tuple] [Tuple] [Tuple] [Tuple] [Data out] Bolt D [Tuple] Topology
  • 12. Spout ● First node of every topology ○ Collects data from the outside world ○ Injects the data on the topology in order to be processed ● Must implement ISpout interface ○ BaseRichSpout is a more convenient abstract class Spouts
  • 13. Bolt ● Middle or terminating nodes of a topology ● Implements a Unit of Data Processing ● Each Bolt has an output stream ● Can emit one or more Tuples to other Bolts subscribing the output stream ● Must implement IBolt ○ BaseRichBolt is a more convenient abstract class Bolt
  • 14. Tuple Hash-alike data structure containing the data that flows between Spouts and Bolts Data can be accessed by: ● Field index: [{0, “foo”}, {1, “bar”}] ● Key name: [{“foo_key”, “foo”}, {“bar_key”, “bar”}] Values can be: ● Java primitive types ● String ● byte[] ● Any Serializable object
  • 15. Example: Alert on monitored words Monitored Words Bolt Collector Spout {Message} Split Sentence Bolt {Word, MessageId} Message queue [Message] Notify User Bolt Store event on DB Bolt {MonitoredWord, MessageId} {MonitoredWord, MessageId}
  • 16. Demo
  • 18. Acknowledging Tuples ● ack(): Tuple was processed successfully ● fail(): Tuple failed to process Tuples with no ack() nor fail() are automatically replayed Tuples with fail() will fail up the dependency tree
  • 20. Ack: Anchoring public class SplitSentence extends BaseRichBolt { public void execute(Tuple tuple) { String sentence = tuple.getString(0); for(String word: sentence.split(" ")) { _collector.emit(tuple, new Values(word)); } _collector.ack(tuple); } } } Anchors the output tuple to the input tupleAcknowledges the input tuple
  • 21. Dealing with fail() Spout Bolt Bolt BoltBolt Bolt✅ ✅ ✅ ✅ ✅ ✅ Spout Bolt Bolt BoltBolt Bolt✅ ✅ ✅ ❌ ❌ ❌
  • 22. Beware! Storm is designed to scale to process millions of messages per second. It's design deliberately assumes that some Tuples might be lost. If your application needs Exactly Once semantics, you should consider using Trident (will talk about that in a while) Storm does not ensure exactly once processing
  • 23. Demo
  • 24. Parallelism Cluster node Worker Process Worker Process Cluster Node → 1+ JVM instances JVM Instance → 1+ Threads Thread → 1+ Task Each instance of a Bolt or Spout is a Task Thread Thread Thread Thread Task Task Task Task Task Task Task Task
  • 25. Parallelism Example Config conf = new Config(); conf.setNumWorkers(2); // use two worker processes topologyBuilder.setSpout("blue-spout", new BlueSpout(), 2); topologyBuilder.setBolt("green-bolt", new GreenBolt(), 2) .setNumTasks(4) .shuffleGrouping("blue-spout"); topologyBuilder.setBolt("yellow-bolt", new YellowBolt(), 6) .shuffleGrouping("green-bolt");
  • 26. Stream grouping ● Shuffle grouping: randomly distributed across all downstream Bolts ● Fields grouping: GROUP BY values - Same values of the grouped fields will be delivered to the same Bolt ● All grouping: The stream is replicated across all the bolt's tasks. Use this grouping with care ● Direct grouping: The producer of the Tuple must indicate which consumer will receive the Tuple ● Custom Grouping: When you go NIH
  • 27. Storm UI - Cluster
  • 28. Storm UI - Cluster
  • 29. Storm UI - Topology
  • 30. Storm UI - Topology
  • 31. Other features: Trident Trident is an abstraction layer to manage state across the topology The state can be kept: ● Internally in the topology - in memory or backed by HDFS ● Externally on a Database - such as Memcached or Cassandra
  • 32. Other features: Storm SQL The Storm SQL integration allows users to run SQL queries over streaming data in Storm. Cool feature, but still experimental
  • 35. PLEASE FILL IN EVALUATION FORMS FRIDAY, MAY 19th SATURDAY, MAY 20th https://survs.com/survey/cprwce7pi8 https://survs.com/survey/l9kksmlzd8 YOUR OPINION IS IMPORTANT!
  • 36. THANK YOU TO OUR SPONSORS PLATINUM GOLD SILVER
  • 37.
  • 38. Trident: How it works 1. Tuples are processed as small batches 2. Each batch of tuples is given a unique id called the "transaction id" (txid). a. If the batch is replayed, it is given the exact same txid. 3. State updates are ordered among batches. That is, the state updates for batch 3 won't be applied until the state updates for batch 2 have succeeded.
  • 39. Trident: Transactional Spout Trident Store java => [count=5, txid=1] kotlin => [count=8, txid=2] csharp => [count=10, txid=3] ["kotlin"] ["kotlin"] ["csharp"] Batch txid=3 Trident Store java => [count=5, txid=1] kotlin => [count=10, txid=3] csharp => [count=10, txid=3] "kotlin" += 2 "csharp" += 0