Your SlideShare is downloading. ×
0
Storm: Distributed and
fault-tolerant realtime
computation
Ferran Galí i Reniu
@ferrangali
19/06/2014
Ferran Galí i Reniu
● UPC - FIB
● Trovit
○ Hadoop
○ Lucene/Solr
○ Storm
Big Data
● Too much data
○ Store
○ Compute
○ Analyse
● Distributed systems
○ Provide horizontal scalability
● Hadoop
Distributed Systems
HDFS HDFS HDFS
File
● Hadoop
Distributed Systems
HDFS
MapReduce
HDFS
MapReduce
HDFS
MapReduce
File
Distributed Systems
● Hadoop
○ Huge files
○ Useful for batch
○ High latency
○ No real time
Storm
“Storm is a distributed realtime computation system.
Storm provides a set of general primitives for doing
realtime c...
Storm
● Who’s using it?
● Tuple
○ Ordered list of elements
○ Any type
Storm
String Integer
Serialized
Object
...
Storm
● Stream
○ Unbounded sequence of tuples
Tuple Tuple Tuple Tuple Tuple Tuple Tuple
Storm
● Spout
○ Source of streams
○ From data sources: Queues, API...
Tuple Tuple Tuple Tuple Tuple
Storm
● Bolt
○ Consumes streams
○ Does some processing (transform, join,...)
○ Emits streams
Tuple Tuple Tuple
Tuple
Tuple...
Storm
● Topology
○ Graph of spouts & bolts
○ Runs forever
Architecture
Nimbus
Zookeeper
Zookeeper
Zookeeper
Master
Worker
Worker
Coordinator
Supervisor
Slot
Slot
Slot
Slot
Supervis...
Architecture
Supervisor
Slot
Slot
Slot
Slot
Worker process
Single JVM
Tasks - Threads
parallelism hint = 4
parallelism hint = 1
parallelism hint = 2
parallelism hint = 2
parallelism hint = 3
parallelism hint ...
parallelism hint = 4
parallelism hint = 1
parallelism hint = 2
parallelism hint = 2
parallelism hint = 3
parallelism hint ...
Example: Word Count
line line line word word word
File
FileSpout SplitterBolt CounterBolt
parallelism hint = 2 parallelism...
SplitterBoltFileSpout
Example: Word Count
CounterBolt
Storm is a distributed
realtime computation
system. Storm provides a...
SplitterBoltFileSpout
Example: Word Count
CounterBolt
Storm is a distributed
Storm is a distributed
realtime computation
s...
SplitterBoltFileSpout
Example: Word Count
CounterBolt
Storm is a distributed
Storm is a distributed
realtime computation
s...
SplitterBoltFileSpout
Example: Word Count
CounterBolt
Storm is a distributed
Storm is a distributed
realtime computation
s...
SplitterBoltFileSpout
Example: Word Count
CounterBolt
Storm is a distributed
Storm is a distributed
realtime computation
s...
SplitterBoltFileSpout
Example: Word Count
CounterBolt
Storm is a distributed
Storm is a distributed
realtime computation
s...
SplitterBoltFileSpout
Example: Word Count
CounterBolt
Storm is a distributed
Storm is a distributed
realtime computation
s...
Groupings
● Shuffle grouping
● Fields grouping
● All grouping
● Global grouping
● Direct grouping
● Local or shuffle group...
Fault-tolerance
Nimbus
Zookeeper
Zookeeper
Zookeeper
Supervisor
Supervisor
● Worker dies
○ Supervisor will restart it
● Worker dies too many times
○ Nimbus will reassign it to another node
● Node d...
Guaranteeing message processing
● Through API
○ ack
○ fail
● Manual tuple replay
○ e.g: Spout emits again message with spe...
Guaranteeing message processing
● When is a message “fully processed”?
● Solutions
○ Transactional Topologies
○ Trident fr...
Yet another example
tweet tweet tweet
word
word
word
TwitterSpout SplitterBolt
CounterBolt
CommitBolt
signal
signal
signal...
Batch + Real time
● Lambda architecture
Serving
Batch layer
● High latency
● Reprocesses all data
New
data
Batch + Real time
● Lambda architecture
Speed layer
Serving
Batch layer
● Low latency
● Fast & incremental algorithms
● Ev...
Storm
● Who’s using it?
Trovit
● 40 countries
● 5 verticals
● Hundreds of millions of ads
Trovit
● Batch layer:
○ MapReduce pipeline over HDFS
HDFS
Filter Enrich Dedup Index
kafka
xml
Trovit
● Speed layer
○ Storm topology
ad
ad
ad
ad
ad
ad
rich ad rich ad rich ad
Feeds Spout
Kafka Spout
Processor Bolt Ind...
Trovit
HDFS
Filter Enrich Dedup Index
ad
ad
ad
ad
ad
ad
richad richad richad
HBaseZookeeper
kafka
xml
Questions?
Ferran Galí i Reniu
@ferrangali
19/06/2014
Storm: Distributed and fault tolerant realtime computation
Upcoming SlideShare
Loading in...5
×

Storm: Distributed and fault tolerant realtime computation

476

Published on

Published in: Engineering, Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
476
On Slideshare
0
From Embeds
0
Number of Embeds
5
Actions
Shares
0
Downloads
26
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Transcript of "Storm: Distributed and fault tolerant realtime computation"

  1. 1. Storm: Distributed and fault-tolerant realtime computation Ferran Galí i Reniu @ferrangali 19/06/2014
  2. 2. Ferran Galí i Reniu ● UPC - FIB ● Trovit ○ Hadoop ○ Lucene/Solr ○ Storm
  3. 3. Big Data ● Too much data ○ Store ○ Compute ○ Analyse ● Distributed systems ○ Provide horizontal scalability
  4. 4. ● Hadoop Distributed Systems HDFS HDFS HDFS File
  5. 5. ● Hadoop Distributed Systems HDFS MapReduce HDFS MapReduce HDFS MapReduce File
  6. 6. Distributed Systems ● Hadoop ○ Huge files ○ Useful for batch ○ High latency ○ No real time
  7. 7. Storm “Storm is a distributed realtime computation system. Storm provides a set of general primitives for doing realtime computation. Storm is simple, can be used with any programming language, is used by many companies, and is a lot of fun to use!” http://storm.incubator.apache.org/
  8. 8. Storm ● Who’s using it?
  9. 9. ● Tuple ○ Ordered list of elements ○ Any type Storm String Integer Serialized Object ...
  10. 10. Storm ● Stream ○ Unbounded sequence of tuples Tuple Tuple Tuple Tuple Tuple Tuple Tuple
  11. 11. Storm ● Spout ○ Source of streams ○ From data sources: Queues, API... Tuple Tuple Tuple Tuple Tuple
  12. 12. Storm ● Bolt ○ Consumes streams ○ Does some processing (transform, join,...) ○ Emits streams Tuple Tuple Tuple Tuple Tuple Tuple Tuple Tuple Tuple
  13. 13. Storm ● Topology ○ Graph of spouts & bolts ○ Runs forever
  14. 14. Architecture Nimbus Zookeeper Zookeeper Zookeeper Master Worker Worker Coordinator Supervisor Slot Slot Slot Slot Supervisor Slot Slot Slot Slot
  15. 15. Architecture Supervisor Slot Slot Slot Slot Worker process Single JVM Tasks - Threads
  16. 16. parallelism hint = 4 parallelism hint = 1 parallelism hint = 2 parallelism hint = 2 parallelism hint = 3 parallelism hint = 4 Supervisor Slot Slot Slot Slot Supervisor Slot Slot Slot Slot Worker processes = 8
  17. 17. parallelism hint = 4 parallelism hint = 1 parallelism hint = 2 parallelism hint = 2 parallelism hint = 3 parallelism hint = 4 Worker processes = 8 combined parallelism = 4 + 1 + 2 + 2 + 3 + 4 = 16 Tasks per worker = 16 / 8 = 2 Supervisor Supervisor
  18. 18. Example: Word Count line line line word word word File FileSpout SplitterBolt CounterBolt parallelism hint = 2 parallelism hint = 3 parallelism hint = 2
  19. 19. SplitterBoltFileSpout Example: Word Count CounterBolt Storm is a distributed realtime computation system. Storm provides a set of general primitives for doing realtime computation. Storm is simple, can be used with any programming language, is used by many companies, and is a lot of fun to use!
  20. 20. SplitterBoltFileSpout Example: Word Count CounterBolt Storm is a distributed Storm is a distributed realtime computation system. Storm provides a set of general primitives for doing realtime computation. Storm is simple, can be used with any programming language, is used by many companies, and is a lot of fun to use!
  21. 21. SplitterBoltFileSpout Example: Word Count CounterBolt Storm is a distributed Storm is a distributed realtime computation system. Storm provides a set of general primitives for doing realtime computation. Storm is simple, can be used with any programming language, is used by many companies, and is a lot of fun to use! realtime computation system. Storm provides a
  22. 22. SplitterBoltFileSpout Example: Word Count CounterBolt Storm is a distributed Storm is a distributed realtime computation system. Storm provides a set of general primitives for doing realtime computation. Storm is simple, can be used with any programming language, is used by many companies, and is a lot of fun to use! realtime computation system. Storm provides a shuffle grouping
  23. 23. SplitterBoltFileSpout Example: Word Count CounterBolt Storm is a distributed Storm is a distributed realtime computation system. Storm provides a set of general primitives for doing realtime computation. Storm is simple, can be used with any programming language, is used by many companies, and is a lot of fun to use! realtime computation system. Storm provides a Storm a is distributed realtime computation system provides Storm a shuffle grouping
  24. 24. SplitterBoltFileSpout Example: Word Count CounterBolt Storm is a distributed Storm is a distributed realtime computation system. Storm provides a set of general primitives for doing realtime computation. Storm is simple, can be used with any programming language, is used by many companies, and is a lot of fun to use! realtime computation system. Storm provides a Storm a is distributed realtime computation system provides Storm a Storm a is distributed realtime computation system provides Storm a x1 x1 x1 x1 x1 x1 x1 x1 x1 x1 shuffle grouping
  25. 25. SplitterBoltFileSpout Example: Word Count CounterBolt Storm is a distributed Storm is a distributed realtime computation system. Storm provides a set of general primitives for doing realtime computation. Storm is simple, can be used with any programming language, is used by many companies, and is a lot of fun to use! realtime computation system. Storm provides a shuffle grouping a is Storm distributed provides a Storm is distributed realtime computation system a x2 x1 x1 x1 x2 x1 x1 x1 realtime computation provides fields grouping system Storm
  26. 26. Groupings ● Shuffle grouping ● Fields grouping ● All grouping ● Global grouping ● Direct grouping ● Local or shuffle grouping
  27. 27. Fault-tolerance Nimbus Zookeeper Zookeeper Zookeeper Supervisor Supervisor
  28. 28. ● Worker dies ○ Supervisor will restart it ● Worker dies too many times ○ Nimbus will reassign it to another node ● Node dies ○ Nimbus will reassign task to another node ● Nimbus is not a SPOF ● Nimbus & Supervisors are fail-fast Fault-tolerance
  29. 29. Guaranteeing message processing ● Through API ○ ack ○ fail ● Manual tuple replay ○ e.g: Spout emits again message with specific id
  30. 30. Guaranteeing message processing ● When is a message “fully processed”? ● Solutions ○ Transactional Topologies ○ Trident framework Storm is a distributed Storm is distributed a Ok Fail Ok Ok
  31. 31. Yet another example tweet tweet tweet word word word TwitterSpout SplitterBolt CounterBolt CommitBolt signal signal signal DB shuffle grouping fields grouping all grouping https://github.com/ferrangali/betabeers-storm
  32. 32. Batch + Real time ● Lambda architecture Serving Batch layer ● High latency ● Reprocesses all data New data
  33. 33. Batch + Real time ● Lambda architecture Speed layer Serving Batch layer ● Low latency ● Fast & incremental algorithms ● Eventually overridden by batch layer ● High latency ● Reprocesses all data New data
  34. 34. Storm ● Who’s using it?
  35. 35. Trovit ● 40 countries ● 5 verticals ● Hundreds of millions of ads
  36. 36. Trovit ● Batch layer: ○ MapReduce pipeline over HDFS HDFS Filter Enrich Dedup Index kafka xml
  37. 37. Trovit ● Speed layer ○ Storm topology ad ad ad ad ad ad rich ad rich ad rich ad Feeds Spout Kafka Spout Processor Bolt Indexer Bolt Group by index Commit in batch every 5 minutes kafka xml
  38. 38. Trovit HDFS Filter Enrich Dedup Index ad ad ad ad ad ad richad richad richad HBaseZookeeper kafka xml
  39. 39. Questions? Ferran Galí i Reniu @ferrangali 19/06/2014
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×