Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Apache Samza*
Stream Processing at LinkedIn
Chris Riccomini
11/13/2013

* Incubating
Stream Processing?
0 ms

Response latency
0 ms

Response latency

Synchronous
0 ms

Response latency

Synchronous

Later. Possibly much later.
0 ms

Response latency
Milliseconds to minutes
Synchronous

Later. Possibly much later.
Newsfeed
News
Ad Relevance
Email
Search Indexing Pipeline
Metrics and Monitoring
Motivation
Real-time Feeds
•
•
•
•

User activity
Metrics
Monitoring
Database Changes
Real-time Feeds
• 10+ billion writes per day
• 172,000 messages per second (average)
• 55+ billion messages per day to rea...
Stream Processing is Hard
•
•
•
•
•
•

Partitioning
State
Re-processing
Failure semantics
Joins to services or database
No...
Samza Concepts
&
Architecture
Streams
Partition 0

Partition 1

Partition 2
Streams
Partition 0

1
2
3
4
5
6

Partition 1

1
2
3
4
5

Partition 2

1
2
3
4
5
6
7
Streams
Partition 0

1
2
3
4
5
6

Partition 1

1
2
3
4
5

Partition 2

1
2
3
4
5
6
7
Streams
Partition 0

1
2
3
4
5
6

Partition 1

1
2
3
4
5

Partition 2

1
2
3
4
5
6
7
Streams
Partition 0

1
2
3
4
5
6

Partition 1

1
2
3
4
5

Partition 2

1
2
3
4
5
6
7
Streams
Partition 0

1
2
3
4
5
6

Partition 1

1
2
3
4
5

Partition 2

1
2
3
4
5
6
7
Streams
Partition 0

1
2
3
4
5
6

Partition 1

1
2
3
4
5

Partition 2

1
2
3
4
5
6
7

next append
Tasks
Partition 0
Tasks
Partition 0

Task 1
Tasks
Partition 0

class PageKeyViewsCounterTask implements StreamTask {
public void process(IncomingMessageEnvelope envel...
Tasks
Partition 0

class PageKeyViewsCounterTask implements StreamTask {
public void process(IncomingMessageEnvelope envel...
Tasks
Partition 0

class PageKeyViewsCounterTask implements StreamTask {
public void process(IncomingMessageEnvelope envel...
Tasks
Partition 0

class PageKeyViewsCounterTask implements StreamTask {
public void process(IncomingMessageEnvelope envel...
Tasks
Partition 0

class PageKeyViewsCounterTask implements StreamTask {
public void process(IncomingMessageEnvelope envel...
Tasks
Partition 0

class PageKeyViewsCounterTask implements StreamTask {
public void process(IncomingMessageEnvelope envel...
Tasks
Partition 0

class PageKeyViewsCounterTask implements StreamTask {
public void process(IncomingMessageEnvelope envel...
Tasks
Partition 0

class PageKeyViewsCounterTask implements StreamTask {
public void process(IncomingMessageEnvelope envel...
Tasks
Partition 0

class PageKeyViewsCounterTask implements StreamTask {
public void process(IncomingMessageEnvelope envel...
Tasks
Partition 0

class PageKeyViewsCounterTask implements StreamTask {
public void process(IncomingMessageEnvelope envel...
Tasks
Partition 0

Task 1
Tasks
Page Views - Partition 0

1
2
3
4
PageKeyViews
CounterTask

Partition 0

Partition 1

Output Count Stream
Tasks
Page Views - Partition 0

1
2
3
4
PageKeyViews
CounterTask

Partition 0

Partition 1

Output Count Stream
Tasks
Page Views - Partition 0

1
2
3
4
PageKeyViews
CounterTask

Partition 0

Partition 1

Output Count Stream
Tasks
Page Views - Partition 0

1
2
3
4
PageKeyViews
CounterTask

Output Count Stream

Partition 0
Partition 1
Tasks
Page Views - Partition 0

1
2
3
4
PageKeyViews
CounterTask

Output Count Stream

Partition 0
Partition 1
Tasks
Page Views - Partition 0

1
2
3
4
PageKeyViews
CounterTask

Output Count Stream

Partition 0
Partition 1
Tasks
Page Views - Partition 0

1
2
3
4
PageKeyViews
CounterTask

Output Count Stream
Partition 0

Partition 1
Tasks
Page Views - Partition 0

1
2
3
4
PageKeyViews
CounterTask

Output Count Stream
Partition 0

Partition 1
Tasks
Page Views - Partition 0

1
2
3
4
PageKeyViews
CounterTask

Checkpoint
Stream

2
Output Count Stream

Partition 1
Pa...
Tasks
Page Views - Partition 0

1
2
3
4
PageKeyViews
CounterTask

Checkpoint
Stream

2
Output Count Stream

Partition 1
Pa...
Tasks
Page Views - Partition 0

1
2
3
4
PageKeyViews
CounterTask

Checkpoint
Stream

2
Output Count Stream

Partition 1
Pa...
Tasks
Page Views - Partition 0

1
2
3
4
PageKeyViews
CounterTask

Checkpoint
Stream

2
Output Count Stream

Partition 1
Pa...
Tasks
Page Views - Partition 0

1
2
3
4
PageKeyViews
CounterTask

Checkpoint
Stream

2
Output Count Stream

Partition 1
Pa...
Tasks
Page Views - Partition 0

1
2
3
4
PageKeyViews
CounterTask

Checkpoint
Stream

2
Output Count Stream

Partition 1
Pa...
Tasks
Page Views - Partition 0

1
2
3
4
PageKeyViews
CounterTask

Checkpoint
Stream

2
Output Count Stream

Partition 1
Pa...
Tasks
Page Views - Partition 0

1
2
3
4
PageKeyViews
CounterTask

Checkpoint
Stream

2
Output Count Stream

Partition 1
Pa...
Tasks
Page Views - Partition 0

1
2
3
4
PageKeyViews
CounterTask

Checkpoint
Stream

2
Output Count Stream

Partition 1
Pa...
Jobs
Stream A

Task 1

Task 2

Stream B

Task 3
Jobs
Stream A

Task 1

Stream B

Task 2

Stream C

Task 3
Jobs
AdViews

Task 1

AdClicks

Task 2

AdClickThroughRate

Task 3
Jobs
AdViews

Task 1

AdClicks

Task 2

AdClickThroughRate

Task 3
Jobs
Stream A

Task 1

Stream B

Task 2

Stream C

Task 3
Dataflow
Stream A

Stream B

Job 1

Stream D

Job 2

Stream E

Job 3

Stream B

Stream C
Dataflow
Stream A

Stream B

Job 1

Stream D

Job 2

Stream E

Job 3

Stream B

Stream C
YARN
YARN
You: I want to run command X on two machines with
512M of memory.
YARN
You: I want to run command X on two machines with
512M of memory.
YARN: Cool, where’s your code?
YARN
You: I want to run command X on two machines with
512M of memory.
YARN: Cool, where’s your code?
You: http://some-hos...
YARN
You: I want to run command X on two machines with
512M of memory.
YARN: Cool, where’s your code?
You: http://some-hos...
YARN

Host 1

Host 2

Host 3
YARN

Host 1

Host 2

Host 3

NM

NM

NM
YARN
Host 0
RM

Host 1

Host 2

Host 3

NM

NM

NM
YARN
Host 0
Client

RM

Host 1

Host 2

Host 3

NM

NM

NM
YARN
Host 0
Client

RM

Host 1

Host 2

Host 3

NM

NM

NM
YARN
Host 0
Client

RM

Host 1

Host 2

Host 3

NM

NM

NM
YARN
Host 0
Client

Host 1
NM

RM

Host 2
AM

Host 3

NM

NM
YARN
Host 0
Client

Host 1
NM

RM

Host 2
AM

Host 3

NM

NM
YARN
Host 0
Client

Host 1
NM

RM

Host 2
AM

Host 3

NM

NM
YARN
Host 0
Client

Host 1
NM

RM

Host 2
AM

Host 3

NM

NM
Container
YARN
Host 0
Client

Host 1
NM

RM

Host 2
AM

Host 3

NM

NM
Container
YARN
Host 0
Client

Host 1
NM

RM

Host 2
AM

Host 3

NM

NM
YARN
Host 0
Client

Host 1
NM

RM

Host 2
AM

Host 3

NM

NM
YARN
Host 0
Client

Host 1
NM

RM

Host 2
AM

Host 3

NM

NM
YARN
Host 0
Client

Host 1
NM

RM

Host 2
AM

Host 3

NM

NM
YARN
Host 0
Client

Host 1
NM

RM

Host 2
AM

Host 3

NM

NM

Container
Jobs
Stream A

Task 1

Task 2

Stream B

Task 3
Containers
Stream A

Task 1

Task 2

Stream B

Task 3
Containers
Stream A

Samza Container 1

Stream B

Samza Container 2
Containers

Samza Container 1

Samza Container 2
YARN
Host 1

Samza Container 1

Host 2

Samza Container 2
YARN
Host 1

Host 2

NodeManager

NodeManager

Samza Container 1

Samza Container 2
YARN
Host 1

Host 2

NodeManager

NodeManager

Samza Container 1

Samza Container 2

Samza YARN AM
YARN
Host 1

Host 2

NodeManager

NodeManager

Samza Container 1

Kafka Broker

Samza Container 2

Samza YARN AM

Kafka Br...
YARN
Host 1

Host 2

NodeManager

NodeManager

MapReduce
Container

HDFS

MapReduce
YARN AM

MapReduce
Container

HDFS
YARN
Host 1
Stream A

NodeManager

Samza Container 1
Samza Container 1

Kafka Broker
Stream C

Samza
Container 2
YARN
Host 1
Stream A

NodeManager

Samza Container 1
Samza Container 1

Kafka Broker
Stream C

Samza
Container 2
YARN
Host 1
Stream A

NodeManager

Samza Container 1
Samza Container 1

Kafka Broker
Stream C

Samza
Container 2
YARN
Host 1
Stream A

NodeManager

Samza Container 1
Samza Container 1

Kafka Broker
Stream C

Samza
Container 2
YARN
Host 1

Host 2

NodeManager

NodeManager

Samza Container 1

Kafka Broker

Samza Container 2

Samza YARN AM

Kafka Br...
CGroups
Host 1

Host 2

NodeManager

NodeManager

Samza Container 1

Kafka Broker

Samza Container 2

Samza YARN AM

Kafka...
(Not Running) Multi-Framework
Host 1

Host 2

NodeManager

NodeManager

Samza Container 1

Kafka

MapReduce
Container

Sam...
Stateful Processing
SELECT
col1,
count(*)
FROM
stream1
INNER JOIN
stream2
ON
stream1.col3 = stream2.col3
WHERE
col2 > 20
GROUP BY
col1
ORDER B...
SELECT
col1,
count(*)
FROM
stream1
INNER JOIN
stream2
ON
stream1.col3 = stream2.col3
WHERE
col2 > 20
GROUP BY
col1
ORDER B...
SELECT
col1,
count(*)
FROM
stream1
INNER JOIN
stream2
ON
stream1.col3 = stream2.col3
WHERE
col2 > 20
GROUP BY
col1
ORDER B...
SELECT
col1,
count(*)
FROM
stream1
INNER JOIN
stream2
ON
stream1.col3 = stream2.col3
WHERE
col2 > 20
GROUP BY
col1
ORDER B...
How do people do this?
Remote Stores
Stream A

Task 1

Task 2

Task 3

Key-Value
Store
Stream B
Remote RPC is slow
• Stream: ~500k records/sec/container
• DB: << less
Online vs. Async
No undo
• Database state is non-deterministic
• Can’t roll back mutations if task crashes
Tables & Streams
put(a, w)
put(b, x)
Database

put(a, y)

put(b, z)

Time
Stateful Tasks
Stream A

Task 1

Task 2

Stream B

Task 3
Stateful Tasks
Stream A

Task 1

Task 2

Stream B

Task 3
Stateful Tasks
Stream A

Task 1

Task 2

Stream B

Task 3

Changelog Stream
Stateful Tasks
Stream A

Task 1

Task 2

Stream B

Task 3

Changelog Stream
Stateful Tasks
Stream A

Task 1

Task 2

Stream B

Task 3

Changelog Stream
Stateful Tasks
Stream A

Task 1

Task 2

Stream B

Task 3

Changelog Stream
Stateful Tasks
Stream A

Task 1

Task 2

Stream B

Task 3

Changelog Stream
Stateful Tasks
Stream A

Task 1

Task 2

Stream B

Task 3

Changelog Stream
Stateful Tasks
Stream A

Task 1

Task 2

Stream B

Task 3

Changelog Stream
Stateful Tasks
Stream A

Task 1

Task 2

Stream B

Task 3

Changelog Stream
Stateful Tasks
Stream A

Task 1

Task 2

Stream B

Task 3

Changelog Stream
Stateful Tasks
Stream A

Task 1

Task 2

Stream B

Task 3

Changelog Stream
Stateful Tasks
Stream A

Task 1

Task 2

Stream B

Task 3

Changelog Stream
Stateful Tasks
Stream A

Task 1

Task 2

Stream B

Task 3

Changelog Stream
Key-Value Store
•
•
•
•

put(table_name, key, value)
get(table_name, key)
delete(table_name, key)
range(table_name, key1, ...
Stateful Stream Task
public class SimpleStatefulTask implements StreamTask, InitableTask {
private KeyValueStore<String, S...
Stateful Stream Task
public class SimpleStatefulTask implements StreamTask, InitableTask {
private KeyValueStore<String, S...
Stateful Stream Task
public class SimpleStatefulTask implements StreamTask, InitableTask {
private KeyValueStore<String, S...
Stateful Stream Task
public class SimpleStatefulTask implements StreamTask, InitableTask {
private KeyValueStore<String, S...
Whew!
Let’s be Friends!
• We are incubating, and you can help!
• Get up and running in 5 minutes
http://bit.ly/hello-samza
• Gra...
Apache Incubator Samza: Stream Processing at LinkedIn
Apache Incubator Samza: Stream Processing at LinkedIn
Upcoming SlideShare
Loading in …5
×

Apache Incubator Samza: Stream Processing at LinkedIn

1,286 views

Published on

This is the slide deck that was presented at QConf SF on November 13, 2013.

The presentation covers what Samza is, why we built it, and how it works.

Published in: Technology, Business
  • Be the first to comment

Apache Incubator Samza: Stream Processing at LinkedIn

  1. 1. Apache Samza* Stream Processing at LinkedIn Chris Riccomini 11/13/2013 * Incubating
  2. 2. Stream Processing?
  3. 3. 0 ms Response latency
  4. 4. 0 ms Response latency Synchronous
  5. 5. 0 ms Response latency Synchronous Later. Possibly much later.
  6. 6. 0 ms Response latency Milliseconds to minutes Synchronous Later. Possibly much later.
  7. 7. Newsfeed
  8. 8. News
  9. 9. Ad Relevance
  10. 10. Email
  11. 11. Search Indexing Pipeline
  12. 12. Metrics and Monitoring
  13. 13. Motivation
  14. 14. Real-time Feeds • • • • User activity Metrics Monitoring Database Changes
  15. 15. Real-time Feeds • 10+ billion writes per day • 172,000 messages per second (average) • 55+ billion messages per day to real-time consumers
  16. 16. Stream Processing is Hard • • • • • • Partitioning State Re-processing Failure semantics Joins to services or database Non-determinism
  17. 17. Samza Concepts & Architecture
  18. 18. Streams Partition 0 Partition 1 Partition 2
  19. 19. Streams Partition 0 1 2 3 4 5 6 Partition 1 1 2 3 4 5 Partition 2 1 2 3 4 5 6 7
  20. 20. Streams Partition 0 1 2 3 4 5 6 Partition 1 1 2 3 4 5 Partition 2 1 2 3 4 5 6 7
  21. 21. Streams Partition 0 1 2 3 4 5 6 Partition 1 1 2 3 4 5 Partition 2 1 2 3 4 5 6 7
  22. 22. Streams Partition 0 1 2 3 4 5 6 Partition 1 1 2 3 4 5 Partition 2 1 2 3 4 5 6 7
  23. 23. Streams Partition 0 1 2 3 4 5 6 Partition 1 1 2 3 4 5 Partition 2 1 2 3 4 5 6 7
  24. 24. Streams Partition 0 1 2 3 4 5 6 Partition 1 1 2 3 4 5 Partition 2 1 2 3 4 5 6 7 next append
  25. 25. Tasks Partition 0
  26. 26. Tasks Partition 0 Task 1
  27. 27. Tasks Partition 0 class PageKeyViewsCounterTask implements StreamTask { public void process(IncomingMessageEnvelope envelope, MessageCollector collector, TaskCoordinator coordinator) { GenericRecord record = ((GenericRecord) envelope.getMsg()); String pageKey = record.get("page-key").toString(); int newCount = pageKeyViews.get(pageKey).incrementAndGet(); collector.send(countStream, pageKey, newCount); } }
  28. 28. Tasks Partition 0 class PageKeyViewsCounterTask implements StreamTask { public void process(IncomingMessageEnvelope envelope, MessageCollector collector, TaskCoordinator coordinator) { GenericRecord record = ((GenericRecord) envelope.getMsg()); String pageKey = record.get("page-key").toString(); int newCount = pageKeyViews.get(pageKey).incrementAndGet(); collector.send(countStream, pageKey, newCount); } }
  29. 29. Tasks Partition 0 class PageKeyViewsCounterTask implements StreamTask { public void process(IncomingMessageEnvelope envelope, MessageCollector collector, TaskCoordinator coordinator) { GenericRecord record = ((GenericRecord) envelope.getMsg()); String pageKey = record.get("page-key").toString(); int newCount = pageKeyViews.get(pageKey).incrementAndGet(); collector.send(countStream, pageKey, newCount); } }
  30. 30. Tasks Partition 0 class PageKeyViewsCounterTask implements StreamTask { public void process(IncomingMessageEnvelope envelope, MessageCollector collector, TaskCoordinator coordinator) { GenericRecord record = ((GenericRecord) envelope.getMsg()); String pageKey = record.get("page-key").toString(); int newCount = pageKeyViews.get(pageKey).incrementAndGet(); collector.send(countStream, pageKey, newCount); } }
  31. 31. Tasks Partition 0 class PageKeyViewsCounterTask implements StreamTask { public void process(IncomingMessageEnvelope envelope, MessageCollector collector, TaskCoordinator coordinator) { GenericRecord record = ((GenericRecord) envelope.getMsg()); String pageKey = record.get("page-key").toString(); int newCount = pageKeyViews.get(pageKey).incrementAndGet(); collector.send(countStream, pageKey, newCount); } }
  32. 32. Tasks Partition 0 class PageKeyViewsCounterTask implements StreamTask { public void process(IncomingMessageEnvelope envelope, MessageCollector collector, TaskCoordinator coordinator) { GenericRecord record = ((GenericRecord) envelope.getMsg()); String pageKey = record.get("page-key").toString(); int newCount = pageKeyViews.get(pageKey).incrementAndGet(); collector.send(countStream, pageKey, newCount); } }
  33. 33. Tasks Partition 0 class PageKeyViewsCounterTask implements StreamTask { public void process(IncomingMessageEnvelope envelope, MessageCollector collector, TaskCoordinator coordinator) { GenericRecord record = ((GenericRecord) envelope.getMsg()); String pageKey = record.get("page-key").toString(); int newCount = pageKeyViews.get(pageKey).incrementAndGet(); collector.send(countStream, pageKey, newCount); } }
  34. 34. Tasks Partition 0 class PageKeyViewsCounterTask implements StreamTask { public void process(IncomingMessageEnvelope envelope, MessageCollector collector, TaskCoordinator coordinator) { GenericRecord record = ((GenericRecord) envelope.getMsg()); String pageKey = record.get("page-key").toString(); int newCount = pageKeyViews.get(pageKey).incrementAndGet(); collector.send(countStream, pageKey, newCount); } }
  35. 35. Tasks Partition 0 class PageKeyViewsCounterTask implements StreamTask { public void process(IncomingMessageEnvelope envelope, MessageCollector collector, TaskCoordinator coordinator) { GenericRecord record = ((GenericRecord) envelope.getMsg()); String pageKey = record.get("page-key").toString(); int newCount = pageKeyViews.get(pageKey).incrementAndGet(); collector.send(countStream, pageKey, newCount); } }
  36. 36. Tasks Partition 0 class PageKeyViewsCounterTask implements StreamTask { public void process(IncomingMessageEnvelope envelope, MessageCollector collector, TaskCoordinator coordinator) { GenericRecord record = ((GenericRecord) envelope.getMsg()); String pageKey = record.get("page-key").toString(); int newCount = pageKeyViews.get(pageKey).incrementAndGet(); collector.send(countStream, pageKey, newCount); } }
  37. 37. Tasks Partition 0 Task 1
  38. 38. Tasks Page Views - Partition 0 1 2 3 4 PageKeyViews CounterTask Partition 0 Partition 1 Output Count Stream
  39. 39. Tasks Page Views - Partition 0 1 2 3 4 PageKeyViews CounterTask Partition 0 Partition 1 Output Count Stream
  40. 40. Tasks Page Views - Partition 0 1 2 3 4 PageKeyViews CounterTask Partition 0 Partition 1 Output Count Stream
  41. 41. Tasks Page Views - Partition 0 1 2 3 4 PageKeyViews CounterTask Output Count Stream Partition 0 Partition 1
  42. 42. Tasks Page Views - Partition 0 1 2 3 4 PageKeyViews CounterTask Output Count Stream Partition 0 Partition 1
  43. 43. Tasks Page Views - Partition 0 1 2 3 4 PageKeyViews CounterTask Output Count Stream Partition 0 Partition 1
  44. 44. Tasks Page Views - Partition 0 1 2 3 4 PageKeyViews CounterTask Output Count Stream Partition 0 Partition 1
  45. 45. Tasks Page Views - Partition 0 1 2 3 4 PageKeyViews CounterTask Output Count Stream Partition 0 Partition 1
  46. 46. Tasks Page Views - Partition 0 1 2 3 4 PageKeyViews CounterTask Checkpoint Stream 2 Output Count Stream Partition 1 Partition 0 Partition 1
  47. 47. Tasks Page Views - Partition 0 1 2 3 4 PageKeyViews CounterTask Checkpoint Stream 2 Output Count Stream Partition 1 Partition 0 Partition 1
  48. 48. Tasks Page Views - Partition 0 1 2 3 4 PageKeyViews CounterTask Checkpoint Stream 2 Output Count Stream Partition 1 Partition 0 Partition 1
  49. 49. Tasks Page Views - Partition 0 1 2 3 4 PageKeyViews CounterTask Checkpoint Stream 2 Output Count Stream Partition 1 Partition 0 Partition 1
  50. 50. Tasks Page Views - Partition 0 1 2 3 4 PageKeyViews CounterTask Checkpoint Stream 2 Output Count Stream Partition 1 Partition 0 Partition 1
  51. 51. Tasks Page Views - Partition 0 1 2 3 4 PageKeyViews CounterTask Checkpoint Stream 2 Output Count Stream Partition 1 Partition 0 Partition 1
  52. 52. Tasks Page Views - Partition 0 1 2 3 4 PageKeyViews CounterTask Checkpoint Stream 2 Output Count Stream Partition 1 Partition 0 Partition 1
  53. 53. Tasks Page Views - Partition 0 1 2 3 4 PageKeyViews CounterTask Checkpoint Stream 2 Output Count Stream Partition 1 Partition 0 Partition 1
  54. 54. Tasks Page Views - Partition 0 1 2 3 4 PageKeyViews CounterTask Checkpoint Stream 2 Output Count Stream Partition 1 Partition 0 Partition 1
  55. 55. Jobs Stream A Task 1 Task 2 Stream B Task 3
  56. 56. Jobs Stream A Task 1 Stream B Task 2 Stream C Task 3
  57. 57. Jobs AdViews Task 1 AdClicks Task 2 AdClickThroughRate Task 3
  58. 58. Jobs AdViews Task 1 AdClicks Task 2 AdClickThroughRate Task 3
  59. 59. Jobs Stream A Task 1 Stream B Task 2 Stream C Task 3
  60. 60. Dataflow Stream A Stream B Job 1 Stream D Job 2 Stream E Job 3 Stream B Stream C
  61. 61. Dataflow Stream A Stream B Job 1 Stream D Job 2 Stream E Job 3 Stream B Stream C
  62. 62. YARN
  63. 63. YARN You: I want to run command X on two machines with 512M of memory.
  64. 64. YARN You: I want to run command X on two machines with 512M of memory. YARN: Cool, where’s your code?
  65. 65. YARN You: I want to run command X on two machines with 512M of memory. YARN: Cool, where’s your code? You: http://some-host/jobs/download/my.tgz
  66. 66. YARN You: I want to run command X on two machines with 512M of memory. YARN: Cool, where’s your code? You: http://some-host/jobs/download/my.tgz YARN: I’ve run your command on grid-node-2 and grid-node-7.
  67. 67. YARN Host 1 Host 2 Host 3
  68. 68. YARN Host 1 Host 2 Host 3 NM NM NM
  69. 69. YARN Host 0 RM Host 1 Host 2 Host 3 NM NM NM
  70. 70. YARN Host 0 Client RM Host 1 Host 2 Host 3 NM NM NM
  71. 71. YARN Host 0 Client RM Host 1 Host 2 Host 3 NM NM NM
  72. 72. YARN Host 0 Client RM Host 1 Host 2 Host 3 NM NM NM
  73. 73. YARN Host 0 Client Host 1 NM RM Host 2 AM Host 3 NM NM
  74. 74. YARN Host 0 Client Host 1 NM RM Host 2 AM Host 3 NM NM
  75. 75. YARN Host 0 Client Host 1 NM RM Host 2 AM Host 3 NM NM
  76. 76. YARN Host 0 Client Host 1 NM RM Host 2 AM Host 3 NM NM Container
  77. 77. YARN Host 0 Client Host 1 NM RM Host 2 AM Host 3 NM NM Container
  78. 78. YARN Host 0 Client Host 1 NM RM Host 2 AM Host 3 NM NM
  79. 79. YARN Host 0 Client Host 1 NM RM Host 2 AM Host 3 NM NM
  80. 80. YARN Host 0 Client Host 1 NM RM Host 2 AM Host 3 NM NM
  81. 81. YARN Host 0 Client Host 1 NM RM Host 2 AM Host 3 NM NM
  82. 82. YARN Host 0 Client Host 1 NM RM Host 2 AM Host 3 NM NM Container
  83. 83. Jobs Stream A Task 1 Task 2 Stream B Task 3
  84. 84. Containers Stream A Task 1 Task 2 Stream B Task 3
  85. 85. Containers Stream A Samza Container 1 Stream B Samza Container 2
  86. 86. Containers Samza Container 1 Samza Container 2
  87. 87. YARN Host 1 Samza Container 1 Host 2 Samza Container 2
  88. 88. YARN Host 1 Host 2 NodeManager NodeManager Samza Container 1 Samza Container 2
  89. 89. YARN Host 1 Host 2 NodeManager NodeManager Samza Container 1 Samza Container 2 Samza YARN AM
  90. 90. YARN Host 1 Host 2 NodeManager NodeManager Samza Container 1 Kafka Broker Samza Container 2 Samza YARN AM Kafka Broker
  91. 91. YARN Host 1 Host 2 NodeManager NodeManager MapReduce Container HDFS MapReduce YARN AM MapReduce Container HDFS
  92. 92. YARN Host 1 Stream A NodeManager Samza Container 1 Samza Container 1 Kafka Broker Stream C Samza Container 2
  93. 93. YARN Host 1 Stream A NodeManager Samza Container 1 Samza Container 1 Kafka Broker Stream C Samza Container 2
  94. 94. YARN Host 1 Stream A NodeManager Samza Container 1 Samza Container 1 Kafka Broker Stream C Samza Container 2
  95. 95. YARN Host 1 Stream A NodeManager Samza Container 1 Samza Container 1 Kafka Broker Stream C Samza Container 2
  96. 96. YARN Host 1 Host 2 NodeManager NodeManager Samza Container 1 Kafka Broker Samza Container 2 Samza YARN AM Kafka Broker
  97. 97. CGroups Host 1 Host 2 NodeManager NodeManager Samza Container 1 Kafka Broker Samza Container 2 Samza YARN AM Kafka Broker
  98. 98. (Not Running) Multi-Framework Host 1 Host 2 NodeManager NodeManager Samza Container 1 Kafka MapReduce Container Samza YARN AM HDFS
  99. 99. Stateful Processing
  100. 100. SELECT col1, count(*) FROM stream1 INNER JOIN stream2 ON stream1.col3 = stream2.col3 WHERE col2 > 20 GROUP BY col1 ORDER BY count(*) DESC LIMIT 50;
  101. 101. SELECT col1, count(*) FROM stream1 INNER JOIN stream2 ON stream1.col3 = stream2.col3 WHERE col2 > 20 GROUP BY col1 ORDER BY count(*) DESC LIMIT 50;
  102. 102. SELECT col1, count(*) FROM stream1 INNER JOIN stream2 ON stream1.col3 = stream2.col3 WHERE col2 > 20 GROUP BY col1 ORDER BY count(*) DESC LIMIT 50;
  103. 103. SELECT col1, count(*) FROM stream1 INNER JOIN stream2 ON stream1.col3 = stream2.col3 WHERE col2 > 20 GROUP BY col1 ORDER BY count(*) DESC LIMIT 10;
  104. 104. How do people do this?
  105. 105. Remote Stores Stream A Task 1 Task 2 Task 3 Key-Value Store Stream B
  106. 106. Remote RPC is slow • Stream: ~500k records/sec/container • DB: << less
  107. 107. Online vs. Async
  108. 108. No undo • Database state is non-deterministic • Can’t roll back mutations if task crashes
  109. 109. Tables & Streams put(a, w) put(b, x) Database put(a, y) put(b, z) Time
  110. 110. Stateful Tasks Stream A Task 1 Task 2 Stream B Task 3
  111. 111. Stateful Tasks Stream A Task 1 Task 2 Stream B Task 3
  112. 112. Stateful Tasks Stream A Task 1 Task 2 Stream B Task 3 Changelog Stream
  113. 113. Stateful Tasks Stream A Task 1 Task 2 Stream B Task 3 Changelog Stream
  114. 114. Stateful Tasks Stream A Task 1 Task 2 Stream B Task 3 Changelog Stream
  115. 115. Stateful Tasks Stream A Task 1 Task 2 Stream B Task 3 Changelog Stream
  116. 116. Stateful Tasks Stream A Task 1 Task 2 Stream B Task 3 Changelog Stream
  117. 117. Stateful Tasks Stream A Task 1 Task 2 Stream B Task 3 Changelog Stream
  118. 118. Stateful Tasks Stream A Task 1 Task 2 Stream B Task 3 Changelog Stream
  119. 119. Stateful Tasks Stream A Task 1 Task 2 Stream B Task 3 Changelog Stream
  120. 120. Stateful Tasks Stream A Task 1 Task 2 Stream B Task 3 Changelog Stream
  121. 121. Stateful Tasks Stream A Task 1 Task 2 Stream B Task 3 Changelog Stream
  122. 122. Stateful Tasks Stream A Task 1 Task 2 Stream B Task 3 Changelog Stream
  123. 123. Stateful Tasks Stream A Task 1 Task 2 Stream B Task 3 Changelog Stream
  124. 124. Key-Value Store • • • • put(table_name, key, value) get(table_name, key) delete(table_name, key) range(table_name, key1, key2)
  125. 125. Stateful Stream Task public class SimpleStatefulTask implements StreamTask, InitableTask { private KeyValueStore<String, String> store; public void init(Config config, TaskContext context) { this.store = context.getStore("mystore"); } public void process( IncomingMessageEnvelope envelope, MessageCollector collector, TaskCoordinator coordinator) { GenericRecord record = (GenericRecord) envelope.getMessage(); String memberId = record.get("member_id"); String name = record.get("name"); System.out.println("old name: " + store.get(memberId)); store.put(memberId, name); } }
  126. 126. Stateful Stream Task public class SimpleStatefulTask implements StreamTask, InitableTask { private KeyValueStore<String, String> store; public void init(Config config, TaskContext context) { this.store = context.getStore("mystore"); } public void process( IncomingMessageEnvelope envelope, MessageCollector collector, TaskCoordinator coordinator) { GenericRecord record = (GenericRecord) envelope.getMessage(); String memberId = record.get("member_id"); String name = record.get("name"); System.out.println("old name: " + store.get(memberId)); store.put(memberId, name); } }
  127. 127. Stateful Stream Task public class SimpleStatefulTask implements StreamTask, InitableTask { private KeyValueStore<String, String> store; public void init(Config config, TaskContext context) { this.store = context.getStore("mystore"); } public void process( IncomingMessageEnvelope envelope, MessageCollector collector, TaskCoordinator coordinator) { GenericRecord record = (GenericRecord) envelope.getMessage(); String memberId = record.get("member_id"); String name = record.get("name"); System.out.println("old name: " + store.get(memberId)); store.put(memberId, name); } }
  128. 128. Stateful Stream Task public class SimpleStatefulTask implements StreamTask, InitableTask { private KeyValueStore<String, String> store; public void init(Config config, TaskContext context) { this.store = context.getStore("mystore"); } public void process( IncomingMessageEnvelope envelope, MessageCollector collector, TaskCoordinator coordinator) { GenericRecord record = (GenericRecord) envelope.getMessage(); String memberId = record.get("member_id"); String name = record.get("name"); System.out.println("old name: " + store.get(memberId)); store.put(memberId, name); } }
  129. 129. Whew!
  130. 130. Let’s be Friends! • We are incubating, and you can help! • Get up and running in 5 minutes http://bit.ly/hello-samza • Grab some newbie JIRAs http://bit.ly/samza_newbie_issues

×