Realtime Analytics with Storm and Hadoop

Storm + Hadoop

@nathanmarz 1

So many Big Data technologies...

2


Storm

2


Storm
Kafka
2

How to make these tools work
together?

3

Goals of data system
• Low latency reads
• Low latency writes
• Fault-tolerant
• Scalable

4

What is a data system?

Query = Function(All data)

5

Is there a general purpose way to
compute arbitrary functions in
realtime?

6

(What’s the title of this talk?)

7

Example query

Total number of pageviews to a
URL over a range of time

8

Example query

Implementation 9

Too slow: “all data” is petabyte-scale

10

Precomputation

All Query
data

11

Precomputation

All Precomputed
Query
data view

12

Example query
Pageview

Pageview

Pageview 2930
Query
Pageview

Pageview
All data
Precomputed view

13

Precomputation

All Precomputed
Query
data view

14

Precomputation

All Precomputed
Query
data Function
view
Function

15

Hadoop

Great at computing arbitrary
functions

16

Expressing those functions

Cascalog

Scalding
17

Hadoop precomputation
Batch view #1

e wo rkﬂow
MapR educ
All data

MapRed
uce work
ﬂ ow Batch view #2

18

Batch view database

Need a database that...
• Is batch-writable from Hadoop
• Has fast random reads

19

Batch view database

No random writes required!

20

Batch view database

Examples
• ElephantDB
• Voldemort
• Manhattan

21

Batch view database

• Extremely simple
• ElephantDB is only a few thousand lines of code

22

Hadoop precomputation

23

So we’re done, right?

24

Not quite...
• A batch workﬂow is too slow
• Views are out of date

Absorbed into batch views Not absorbed

Now

Time
25

Not quite...
Just a few hours
• A batch workﬂow is too slow of data!
• Views are out of date

Absorbed into batch views Not absorbed

Now

Time
25

Compensating for last few hours of
data
Realtime view #1

New data stream
Realtime view #2

26

Compensating for last few hours of
data
Realtime view #1

New data stream
Realtime view #2

Storm 26

Realtime views
Random read / random write databases
• Cassandra
• HBase
• Riak

27

Application queries

Batch view

Merge
Realtime view

28

Precomputation

All Precomputed
Query
data view

29

Precomputation

All Precomputed
batch view
data
Query
Precomputed
realtime view
New data stream

30

Precomputation

All Hadoop Precomputed
batch view
data
Query
Precomputed
realtime view
New data stream

30

Precomputation

All Hadoop Precomputed
batch view
data
Query
Precomputed
realtime view
New data stream Storm

30

Storm

Realtime view #1

New data stream
Realtime view #2

Storm 31

Storm
Realtime computation system
• Guarantees data will be processed
• Horizontally scalable
• Fault-tolerant
• Fast

32

Storm

Source stream

Source stream
Storm

33

Storm Cluster

Master node (similar to Hadoop JobTracker) 35

Storm Cluster

Used for cluster coordination 36

Storm Cluster

Run worker processes 37

Starting a topology

38

Killing a topology

39

Storm concepts
• Streams
• Spouts
• Bolts
• Topologies

40

Streams

Tuple Tuple Tuple Tuple Tuple Tuple Tuple

Unbounded sequence of tuples 41

Spouts

Source of streams 42

Spouts
• Read from Kestrel queue
• Read directly from Twitter streaming API

43

Bolts
• Functions
• Filters
• Joins
• Aggregations
• Talk to databases

45

Stream grouping

When a tuple is emitted, to which task does it go to? 48

Stream grouping

• Shufﬂe grouping: pick a random task
• Fields grouping: mod hashing on a subset of tuple ﬁelds
• All grouping: send to all tasks
• Global grouping: pick task with lowest id

49

Streaming word count

50


51


52


53


54


55

Precomputation

All Precomputed
Query
data Hadoop
views

+
Storm

56

Precomputation

All Precomputed
Query
data Hadoop
views
Storm
+
Storm

57

Distributed RPC

Sometimes there’s very little
you can precompute

58

Distributed RPC

And you still require a lot of
on-the-ﬂy computation

59

Example

Reach is the number of unique
people exposed to a URL on
Twitter
60

Reach
Follower
Distinct
Tweeter Follower follower

Follower
Distinct
URL Tweeter follower
Follower

Follower Distinct
Tweeter follower
Follower

61

Storm + HDFS

HDFS

New data Storm Distributed RPC

Use HBase-like strategy to reliably store state
within Storm bolts
64

Storm + HDFS

https://github.com/nathanmarz/storm-contrib/tree/master/storm-state

storm-state library 65

Missing pieces
• Getting data into Storm
• Getting data into Hadoop

66

Getting data into Storm
Queuing system
• Kestrel
• Kafka
• RabbitMQ

67

Getting data into Hadoop
• Scribe
• Flume
• Kafka

68

Learn more

http://manning.com/marz 69

Realtime Analytics with Storm and Hadoop

More Related Content

Viewers also liked

Similar to Realtime Analytics with Storm and Hadoop

More from DataWorks Summit

Recently uploaded

Realtime Analytics with Storm and Hadoop