SlideShare a Scribd company logo
1 of 65
Storm is coming
real time stream processing
Grzegorz Kolpuc
@gkolpuc
https://pl.linkedin.com/pub/grzegorz-kolpuc/55/b7/700
grzegorzkolpuc@gmail.com
Event Platform
Storm
real time stream processing
What is stream processing?
“Streaming processing” is the ideal platform to process data
streams or sensor data (usually a high ratio of event
throughput versus numbers of queries), whereas “complex
event processing” (CEP) utilizes event-by-event processing
and aggregation
http://www.infoq.com/articles/stream-processing-hadoop
Why stream processing?
Latency (batch cannot provide real-time results)
Processing
Stream Processing System
Apache Storm vs. Spark Streaming – Two Stream Processing Platforms compared. Guido Schmutz.
Collecting queue
Processing
ProcessingCollecting
How to scale?
Apache Storm vs. Spark Streaming – Two Stream Processing Platforms compared. Guido Schmutz.
ProcessingCollecting
queue
Processing
Processing
Collecting
Collecting
Apache Storm vs. Spark Streaming – Two Stream Processing Platforms compared. Guido Schmutz.
How to scale?
q2
ProcessingB
ProcessingB
Collecting
Collecting
qn
q1 ProcessingA
ProcessingA
ProcessingA
q1
q2
Processing Models
Batch Processing
map-reduce
high latency
Stream Processing (ESP)
one at a time
sub-seconds latency
Micro-Batching
mix of stream and batch
processing small chunks of
data
seconds latency
Message Delivery Semantics
1. At most once: messages may be lost but never redelivered
2. At least once: messages will never be lost but may be
redelivered
3. Exactly once: messages are never lost and never redelivered
Complex solution needed?
1. scaling
2. message delivery
3. message grouping
4. message aggregation
5. cost of development and maintenance
➔ distributed real-time processing system
➔ scalable
➔ fault-tolerant
➔ simplify working with queues & workers
➔ Written in Clojure
Storm architecture
Nimbus
Zookeeper
Zookeeper
Zookeeper
Supervisor
Supervisor
Supervisor
Supervisor
Supervisor
Is nimbus a SPOF?
Nimbus and Supervisor daemons must be run under supervision
using a tool like daemontools or monit.
No worker processes are affected by the death of Nimbus or the
Supervisors.
So the answer is that Nimbus is "sort of" a SPOF
Core concepts
Tuple
core data unit (single queue message)
Stream
A stream is an unbounded sequence of tuples that is processed
and created in parallel in a distributed fashion
Spout
Spout is a source of streams in a topology
Spouts can either be reliable or unreliable
public interface ISpout extends Serializable {
void open(Map conf, TopologyContext context, SpoutOutputCollector
collector);
void close();
void activate();
void deactivate();
void nextTuple();
void ack(Object msgId);
void fail(Object msgId);
}
Bolt
Bolts can do anything from filtering, functions, aggregations,
joins, talking to databases, and more.
Bolts can emit more than one stream.
public interface IBolt extends Serializable {
void prepare(Map stormConf, TopologyContext context, OutputCollector collector);
void cleanup();
void execute(Tuple input);
}
Topology
analogous to a MapReduce job
Topology
kafka-spout
ftp-spout
processingA-bolt
merge-bolt
processingC-bolt
processingB-bolt
collecting-bolt
Apache Storm vs. Spark Streaming – Two Stream Processing Platforms compared. Guido Schmutz.
How to scale?
q2
ProcessingB
ProcessingB
Collecting
Collecting
qn
q1 ProcessingA
ProcessingA
ProcessingA
q1
q2
TopologyBuilder builder = new TopologyBuilder();
builder.setSpout("ftp-spout", new FTPSpout(config), 1);
builder.setSpout("kafka-spout", new KafkaSpout(config), 4);
builder.setBolt("processingA-bolt", new ProcessingABolt())
.shuffleGrouping("ftp-spout");
builder.setBolt("processingB-bolt", new ProcessingBBolt())
.shuffleGrouping("kafka-spout");
builder.setBolt("processingC-bolt", new ProcessingCBolt())
.shuffleGrouping("kafka-spout");
builder.setBolt("merge-bolt", new MergeBolt()).shuffleGrouping("processingA-bolt");
builder.setBolt("merge-bolt", new MergeBolt()).shuffleGrouping("processingB-bolt");
builder.setBolt("collecting-bolt", new CollectingBolt())
.shuffleGrouping("processingC-bolt");
Stream grouping
builder.setBolt("enrichment-bolt", new MnAEnrichmentBolt())
.shuffleGrouping("deals-kafka-spout");
builder.setBolt("enrichment-bolt", new MnAEnrichmentBolt())
.shuffleGrouping("deals-kafka-spout", "mergers-and-acquisions");
builder.setBolt("enrichment-bolt", new MnAEnrichmentBolt())
.fieldsGrouping("deals-kafka-spout", new Fields("EventId"));
builder.setBolt("enrichment-bolt", new MnAEnrichmentBolt())
.fieldsGrouping("deals-kafka-spout", "mergers-and-acquisions",
new Fields("EventId","EventType"));
builder.setBolt("enrichment-bolt", new MnAEnrichmentBolt())
.allGrouping("deals-kafka-spout");
builder.setBolt("enrichment-bolt", new MnAEnrichmentBolt())
.allGrouping("deals-kafka-spout", "mergers-and-acquisions");
builder.setBolt("enrichment-bolt", new MnAEnrichmentBolt())
.globalGrouping("deals-kafka-spout");
builder.setBolt("enrichment-bolt", new MnAEnrichmentBolt())
.globalGrouping("deals-kafka-spout", "mergers-and-acquisions");
builder.setBolt("enrichment-bolt", new MnAEnrichmentBolt())
.directGrouping("deals-kafka-spout");
builder.setBolt("enrichment-bolt", new MnAEnrichmentBolt())
.directGrouping("deals-kafka-spout", "mergers-and-acquisions");
Message delivery/reliability
Storm guarantees that every spout tuple will be fully processed
by the topology. It does this by tracking the tree of tuples
triggered by every spout tuple and determining when that tree
of tuples has been successfully completed. Every topology has
a "message timeout" associated with it. If Storm fails to detect
that a spout tuple has been completed within that timeout,
then it fails the tuple and replays it later.
public interface IOutputCollector extends IErrorReporter {
/**
* Returns the task ids that received the tuples.
*/
List<Integer> emit(String streamId, Collection<Tuple> anchors,
List<Object> tuple);
void emitDirect(int taskId, String streamId, Collection<Tuple>
anchors,
List<Object> tuple);
void ack(Tuple input);
void fail(Tuple input);
}
public class JUGBolt implements IRichBolt {
OutputCollector collector;
@Override
public void execute(Tuple input) {
try {
process(input);
collector.ack(input);
} catch (Exception e) {
collector.fail(input);
}
}
}
How to submit topology?
Config conf = new Config();
conf.setNumWorkers(20);
StormSubmitter.submitTopology("APR-30-JUGTopology", conf,
new JUGTopology().build());
storm jar path/to/jug-storm.jar org.jug.storm.JUGTopologyClusterSubmitter args...
How to test locally?
Config config = new Config();
LocalCluster cluster = new LocalCluster();
cluster.submitTopology("APR-30-JUGTopology", config,
new JUGTopology().build());
Scaling
Storm's usage of Zookeeper for cluster coordination
add machines
increase the parallelism
Parallelism
Worker processes
Executors (threads)
Tasks
Config conf = new Config();
conf.setNumWorkers(2); // use two worker processes
topologyBuilder.setSpout("blue-spout", new BlueSpout(), 2);
// set parallelism hint to 2
topologyBuilder.setBolt("green-bolt", new GreenBolt(), 2)
.setNumTasks(4)
.shuffleGrouping("blue-
spout");
topologyBuilder.setBolt("yellow-bolt", new YellowBolt(), 6)
.shuffleGrouping("green-
bolt");
StormSubmitter.submitTopology(
"mytopology",
conf,
topologyBuilder.createTopology()
);
http://storm.apache.org/documentation/Understanding-the-parallelism-of-a-Storm-topology.html
http://storm.apache.org/documentation/Understanding-the-parallelism-of-a-Storm-topology.html
storm rebalance mytopology -n 5 -e blue-spout=3 -e yellow-bolt=10
Trident
Trident is a high-level abstraction for doing realtime
computing on top of Storm. It allows you to seamlessly
intermix high throughput (millions of messages per second),
stateful stream processing with low latency distributed
querying
https://storm.apache.org/documentation/Trident-tutorial.html
topology.newStream("spout1", spout)
.each(new Fields("sentence"), new Split(), new Fields("word"))
.groupBy(new Fields("word"))
.persistentAggregate(new MemoryMapState.Factory(), new Count(), new Fields("count"))
.parallelismHint(6);
https://storm.apache.org/documentation/Trident-tutorial.html
DRCP
https://storm.apache.org/documentation/Distributed-RPC.html
Performance
why to not use storm?
no commercial support, but...
Who uses storm?
Other streaming frameworks
Apache Samza
Apache Flink
Spark Streaming
S4
IBM Streams
Latency
• Is performance of streaming application paramount
Development Cost
• Is it desired to have similar code bases for batch and stream processing =>
lambda architecture
Message Delivery Guarantees
• Is there high importance on processing every single record, or is some normal
amount of data loss acceptable
Process Fault Tolerance
• Is high-availability of primary concern
Choice?
Stream processing architectures
Lambda architecture
http://lambda-architecture.net/
Lambda architecture
http://pandawhale.com/post/51352/the-lambda-architecture-has-its-merits-but-alternatives-are-worth-exploring
Kappa Architecture (lightweight lambda)
http://www.kappa-architecture.com/
Q?
Storm is coming

More Related Content

What's hot

Accumulo Summit 2014: Accumulo backed Tinkerpop Implementation
Accumulo Summit 2014: Accumulo backed Tinkerpop ImplementationAccumulo Summit 2014: Accumulo backed Tinkerpop Implementation
Accumulo Summit 2014: Accumulo backed Tinkerpop Implementation
Accumulo Summit
 

What's hot (20)

Elasticsearch's aggregations &amp; esctl in action or how i built a cli tool...
Elasticsearch's aggregations &amp; esctl in action  or how i built a cli tool...Elasticsearch's aggregations &amp; esctl in action  or how i built a cli tool...
Elasticsearch's aggregations &amp; esctl in action or how i built a cli tool...
 
JS Fest 2019. Anjana Vakil. Serverless Bebop
JS Fest 2019. Anjana Vakil. Serverless BebopJS Fest 2019. Anjana Vakil. Serverless Bebop
JS Fest 2019. Anjana Vakil. Serverless Bebop
 
OmpSs – improving the scalability of OpenMP
OmpSs – improving the scalability of OpenMPOmpSs – improving the scalability of OpenMP
OmpSs – improving the scalability of OpenMP
 
Taming the Tiger: Tips and Tricks for Using Telegraf
Taming the Tiger: Tips and Tricks for Using TelegrafTaming the Tiger: Tips and Tricks for Using Telegraf
Taming the Tiger: Tips and Tricks for Using Telegraf
 
Story of static code analyzer development
Story of static code analyzer developmentStory of static code analyzer development
Story of static code analyzer development
 
Report for weather pi
Report for weather piReport for weather pi
Report for weather pi
 
Extending Flux to Support Other Databases and Data Stores | Adam Anthony | In...
Extending Flux to Support Other Databases and Data Stores | Adam Anthony | In...Extending Flux to Support Other Databases and Data Stores | Adam Anthony | In...
Extending Flux to Support Other Databases and Data Stores | Adam Anthony | In...
 
Go and Uber’s time series database m3
Go and Uber’s time series database m3Go and Uber’s time series database m3
Go and Uber’s time series database m3
 
Gor Nishanov, C++ Coroutines – a negative overhead abstraction
Gor Nishanov,  C++ Coroutines – a negative overhead abstractionGor Nishanov,  C++ Coroutines – a negative overhead abstraction
Gor Nishanov, C++ Coroutines – a negative overhead abstraction
 
C++ Generators and Property-based Testing
C++ Generators and Property-based TestingC++ Generators and Property-based Testing
C++ Generators and Property-based Testing
 
Garbage Collection
Garbage CollectionGarbage Collection
Garbage Collection
 
Rxjs ppt
Rxjs pptRxjs ppt
Rxjs ppt
 
Tech Talk #4 : RxJava and Using RxJava in MVP - Dương Văn Tới
Tech Talk #4 : RxJava and Using RxJava in MVP - Dương Văn TớiTech Talk #4 : RxJava and Using RxJava in MVP - Dương Văn Tới
Tech Talk #4 : RxJava and Using RxJava in MVP - Dương Văn Tới
 
R and C++
R and C++R and C++
R and C++
 
My Gentle Introduction to RxJS
My Gentle Introduction to RxJSMy Gentle Introduction to RxJS
My Gentle Introduction to RxJS
 
How to Build a Telegraf Plugin by Noah Crowley
How to Build a Telegraf Plugin by Noah CrowleyHow to Build a Telegraf Plugin by Noah Crowley
How to Build a Telegraf Plugin by Noah Crowley
 
Include
IncludeInclude
Include
 
Streaming Data Flow with Apache Flink @ Paris Flink Meetup 2015
Streaming Data Flow with Apache Flink @ Paris Flink Meetup 2015Streaming Data Flow with Apache Flink @ Paris Flink Meetup 2015
Streaming Data Flow with Apache Flink @ Paris Flink Meetup 2015
 
WattGo: Analyses temps-réél de series temporelles avec Spark et Solr (Français)
WattGo: Analyses temps-réél de series temporelles avec Spark et Solr (Français)WattGo: Analyses temps-réél de series temporelles avec Spark et Solr (Français)
WattGo: Analyses temps-réél de series temporelles avec Spark et Solr (Français)
 
Accumulo Summit 2014: Accumulo backed Tinkerpop Implementation
Accumulo Summit 2014: Accumulo backed Tinkerpop ImplementationAccumulo Summit 2014: Accumulo backed Tinkerpop Implementation
Accumulo Summit 2014: Accumulo backed Tinkerpop Implementation
 

Similar to Storm is coming

Streams processing with Storm
Streams processing with StormStreams processing with Storm
Streams processing with Storm
Mariusz Gil
 
Wprowadzenie do technologi Big Data i Apache Hadoop
Wprowadzenie do technologi Big Data i Apache HadoopWprowadzenie do technologi Big Data i Apache Hadoop
Wprowadzenie do technologi Big Data i Apache Hadoop
Sages
 
Our challenge for Bulkload reliability improvement
Our challenge for Bulkload reliability  improvementOur challenge for Bulkload reliability  improvement
Our challenge for Bulkload reliability improvement
Satoshi Akama
 

Similar to Storm is coming (20)

Streams processing with Storm
Streams processing with StormStreams processing with Storm
Streams processing with Storm
 
Presto anatomy
Presto anatomyPresto anatomy
Presto anatomy
 
Developing Java Streaming Applications with Apache Storm
Developing Java Streaming Applications with Apache StormDeveloping Java Streaming Applications with Apache Storm
Developing Java Streaming Applications with Apache Storm
 
A Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLA Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQL
 
Reactive programming on Android
Reactive programming on AndroidReactive programming on Android
Reactive programming on Android
 
[245] presto 내부구조 파헤치기
[245] presto 내부구조 파헤치기[245] presto 내부구조 파헤치기
[245] presto 내부구조 파헤치기
 
Storm
StormStorm
Storm
 
Real time and reliable processing with Apache Storm
Real time and reliable processing with Apache StormReal time and reliable processing with Apache Storm
Real time and reliable processing with Apache Storm
 
Wprowadzenie do technologi Big Data i Apache Hadoop
Wprowadzenie do technologi Big Data i Apache HadoopWprowadzenie do technologi Big Data i Apache Hadoop
Wprowadzenie do technologi Big Data i Apache Hadoop
 
Server side JavaScript: going all the way
Server side JavaScript: going all the wayServer side JavaScript: going all the way
Server side JavaScript: going all the way
 
Distributed Realtime Computation using Apache Storm
Distributed Realtime Computation using Apache StormDistributed Realtime Computation using Apache Storm
Distributed Realtime Computation using Apache Storm
 
GDG Jakarta Meetup - Streaming Analytics With Apache Beam
GDG Jakarta Meetup - Streaming Analytics With Apache BeamGDG Jakarta Meetup - Streaming Analytics With Apache Beam
GDG Jakarta Meetup - Streaming Analytics With Apache Beam
 
JS everywhere 2011
JS everywhere 2011JS everywhere 2011
JS everywhere 2011
 
The Future of Apache Storm
The Future of Apache StormThe Future of Apache Storm
The Future of Apache Storm
 
Wprowadzenie do technologii Big Data / Intro to Big Data Ecosystem
Wprowadzenie do technologii Big Data / Intro to Big Data EcosystemWprowadzenie do technologii Big Data / Intro to Big Data Ecosystem
Wprowadzenie do technologii Big Data / Intro to Big Data Ecosystem
 
Storm
StormStorm
Storm
 
nuclio Overview October 2017
nuclio Overview October 2017nuclio Overview October 2017
nuclio Overview October 2017
 
Our challenge for Bulkload reliability improvement
Our challenge for Bulkload reliability  improvementOur challenge for Bulkload reliability  improvement
Our challenge for Bulkload reliability improvement
 
Job Queue in Golang
Job Queue in GolangJob Queue in Golang
Job Queue in Golang
 
iguazio - nuclio overview to CNCF (Sep 25th 2017)
iguazio - nuclio overview to CNCF (Sep 25th 2017)iguazio - nuclio overview to CNCF (Sep 25th 2017)
iguazio - nuclio overview to CNCF (Sep 25th 2017)
 

Recently uploaded

VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
dharasingh5698
 
Top Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoor
Top Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoorTop Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoor
Top Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoor
dharasingh5698
 
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak HamilCara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Kandungan 087776558899
 
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
dollysharma2066
 

Recently uploaded (20)

Unleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leapUnleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leap
 
Bhosari ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready For ...
Bhosari ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready For ...Bhosari ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready For ...
Bhosari ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready For ...
 
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
 
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
 
Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024
 
chapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineeringchapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineering
 
Thermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.pptThermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.ppt
 
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
 
University management System project report..pdf
University management System project report..pdfUniversity management System project report..pdf
University management System project report..pdf
 
Block diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.pptBlock diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.ppt
 
KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghly
 
Thermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VThermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - V
 
Top Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoor
Top Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoorTop Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoor
Top Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoor
 
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak HamilCara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
 
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
 
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
 
Online banking management system project.pdf
Online banking management system project.pdfOnline banking management system project.pdf
Online banking management system project.pdf
 
Thermal Engineering Unit - I & II . ppt
Thermal Engineering  Unit - I & II . pptThermal Engineering  Unit - I & II . ppt
Thermal Engineering Unit - I & II . ppt
 
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
 
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
 

Storm is coming

Editor's Notes

  1. 1.Scala group presentation 2.Marrog presentation - zeromq/protocol buffeers 3.jacek laskowski 4. adam kawa - spotify 5.186k$ 6.storm is difficult Dlaczego chce powiedziec o stormie jezeli na reszcie tych rzeczy spedzam 90% czasu. Nikt nie zakwestionuje hadoopa, nawet goscuiu ktory nie wie dokladnie co to jest ale kiedys sciskal dlon ceo cloudery, jak uslyczy ze to sie ‘nie nadaje’, wybuchnie smiechem. jak ja slysze takie bzdury ze spark jest lepszy bo mozna go na yarn-ie uruchomic… albo dlatego ze mozna w scali pisac jak slysze ze development w stormi jest trudny, i moga to robic tylko wybrancy, bo 99% devow to glaby i nie dadza rade. Jest dokladnie odwrotnie i to zamierzam pokazac. slyszalem nawet ze developerzy znajacy storm-a zarabiaja srednio 186k $ rocznie, to ja powiedzialem mojemu managerowi “to juz wiesz dlaczego w tym stormie chce robic”
  2. bacouse of latency, batch cannot provide real-time results.
  3. source -> processors -> queues-> result ->>>
  4. more processors more queues hard to maintain: apachestormvsapachespark-v1-141203182123-conversion-gate02.pdf
  5. -fail-over -must recover fast -top-layer project - discussion on scaling on jboss, hazelcast or sth else -persistent queues - state to recover after fail
  6. Batch Processing • Familiar concept of processing data en masse • Generally incurs a high-latency (Event-) Stream Processing • A one-at-a-time processing model • A datum is processed as it arrives • Sub-second latency • Difficult to process state data efficiently Micro-Batching • A special case of batch processing with very small batch sizes (tiny) • A nice mix between batching and streaming • At cost of latency • Gives stateful computation, making windowing an easy task
  7. Batch Processing • Familiar concept of processing data en masse • Generally incurs a high-latency (Event-) Stream Processing • A one-at-a-time processing model • A datum is processed as it arrives • Sub-second latency • Difficult to process state data efficiently Micro-Batching • A special case of batch processing with very small batch sizes (tiny) • A nice mix between batching and streaming • At cost of latency • Gives stateful computation, making windowing an easy task
  8. Batch Processing • Familiar concept of processing data en masse • Generally incurs a high-latency (Event-) Stream Processing • A one-at-a-time processing model • A datum is processed as it arrives • Sub-second latency • Difficult to process state data efficiently Micro-Batching • A special case of batch processing with very small batch sizes (tiny) • A nice mix between batching and streaming • At cost of latency • Gives stateful computation, making windowing an easy task
  9. Batch Processing • Familiar concept of processing data en masse • Generally incurs a high-latency (Event-) Stream Processing • A one-at-a-time processing model • A datum is processed as it arrives • Sub-second latency • Difficult to process state data efficiently Micro-Batching • A special case of batch processing with very small batch sizes (tiny) • A nice mix between batching and streaming • At cost of latency • Gives stateful computation, making windowing an easy task
  10. dobra, wiemy co to jest ten stream processing, wiemy jak mozemy przetwarzac te dane (filozoficznie) wiemy ze wiadomosci moga ginac, i to moze byc dopuszczalne lub nie. czy potrzebne jest kompleksowe rozwiazanie? w aktualnym stanie rzeczy sa liczne problemy….
  11. BLOG Nathana - wpis o histori storma -baktype acquired by twitter -storm opensourced -nathan marz start-up -taylor goetz
  12. no tak ale to mialo byc foult tolerant, a jest jeden nimbus tylko…. ============= nimbus, zookeeper, supervisor, (worker, executors, tasks) Storm cluster: S1: hadoop job tracker structue - question, do you know what is this? S2: storm nimbus + supervisor - similar? what is wrong reqarding to fault tolerant? S3: is nimbus a SPOF? S3: for cluster... yes S4: for running topology... no S4:but... it fail fast S5: is supervisor a SPOF? S5: what happens when Nimbus or supervisor fails?
  13. What happens when Nimbus or Supervisor daemons die? The Nimbus and Supervisor daemons are designed to be fail-fast (process self-destructs whenever any unexpected situation is encountered) and stateless (all state is kept in Zookeeper or on disk). As described in Setting up a Storm cluster, the Nimbus and Supervisor daemons must be run under supervision using a tool like daemontools or monit. So if the Nimbus or Supervisor daemons die, they restart like nothing happened. Most notably, no worker processes are affected by the death of Nimbus or the Supervisors. This is in contrast to Hadoop, where if the JobTracker dies, all the running jobs are lost. If you lose the Nimbus node, the workers will still continue to function. Additionally, supervisors will continue to restart workers if they die. However, without Nimbus, workers won't be reassigned to other machines when necessary (like if you lose a worker machine). So the answer is that Nimbus is "sort of" a SPOF. In practice, it's not a big deal since nothing catastrophic happens when the Nimbus daemon dies. There are plans to make Nimbus highly available in the future.
  14. stormhadoopsummit2014-140407145559-phpapp01.pdf about 20 slides
  15. core data unit (single queue message)
  16. topology
  17. topology
  18. topology
  19. topology
  20. topology
  21. topology FTP Kafka percolate alertin
  22. -fail-over -must recover fast -top-layer project - discussion on scaling on jboss, hazelcast or sth else -persistent queues - state to recover after fail
  23. Stream groupings Part of defining a topology is specifying for each bolt which streams it should receive as input. A stream grouping defines how that stream should be partitioned among the bolt's tasks. There are seven built-in stream groupings in Storm, and you can implement a custom stream grouping by implementing the CustomStreamGrouping interface: Shuffle grouping: Tuples are randomly distributed across the bolt's tasks in a way such that each bolt is guaranteed to get an equal number of tuples. Fields grouping: The stream is partitioned by the fields specified in the grouping. For example, if the stream is grouped by the "user-id" field, tuples with the same "user-id" will always go to the same task, but tuples with different "user-id"'s may go to different tasks. Partial Key grouping: The stream is partitioned by the fields specified in the grouping, like the Fields grouping, but are load balanced between two downstream bolts, which provides better utilization of resources when the incoming data is skewed. This paper provides a good explanation of how it works and the advantages it provides. All grouping: The stream is replicated across all the bolt's tasks. Use this grouping with care. Global grouping: The entire stream goes to a single one of the bolt's tasks. Specifically, it goes to the task with the lowest id. None grouping: This grouping specifies that you don't care how the stream is grouped. Currently, none groupings are equivalent to shuffle groupings. Eventually though, Storm will push down bolts with none groupings to execute in the same thread as the bolt or spout they subscribe from (when possible). Direct grouping: This is a special kind of grouping. A stream grouped this way means that the producer of the tuple decides which task of the consumer will receive this tuple. Direct groupings can only be declared on streams that have been declared as direct streams. Tuples emitted to a direct stream must be emitted using one of the [emitDirect](/javadoc/apidocs/backtype/storm/task/OutputCollector.html#emitDirect(int, int, java.util.List) methods. A bolt can get the task ids of its consumers by either using the provided TopologyContext or by keeping track of the output of the emit method in OutputCollector (which returns the task ids that the tuple was sent to). Local or shuffle grouping: If the target bolt has one or more tasks in the same worker process, tuples will be shuffled to just those in-process tasks. Otherwise, this acts like a normal shuffle grouping. ================== Shuffle grouping is random grouping Fields grouping is grouped by value, such that equal value results in equal task All grouping replicates to all tasks Global grouping makes all tuples go to one task None grouping makes bolt run in the same thread as bolt/spout it subscribes to Direct grouping producer (task that emits) controls which consumer will receive Local or Shuffle grouping similar to the shuffle grouping but will shuffle tuples among bolt tasks running in the same worker process, if any. Falls back to shuffle grouping behavior.
  24. fail / ack timeout
  25. understanding of storm parallelism - more deeply rebalance
  26. obrazek - worker, executors, tasks
  27. request-reply w asseco , no tutaj duzo trzeba bylo duzo pic, zeby sie nie odwodnic.
  28. Latency • Is performance of streaming application paramount Development Cost • Is it desired to have similar code bases for batch and stream processing => lambda architecture Message Delivery Guarantees • Is there high importance on processing every single record, or is some normal amount of data loss acceptable Process Fault Tolerance • Is high-availability of primary concern