Apache Samza*
Stream Processing at LinkedIn
Chris Riccomini
11/13/2013

* Incubating
Watch the video with slide
synchronization on InfoQ.com!
http://www.infoq.com/presentations
/samza-linkedin

InfoQ.com: Ne...
Presented at QCon San Francisco
www.qconsf.com
Purpose of QCon
- to empower software development by facilitating the sprea...
Stream Processing?
0 ms

Response latency
0 ms

Response latency

Synchronous
0 ms

Response latency

Synchronous

Later. Possibly much later.
0 ms

Response latency
Milliseconds to minutes
Synchronous

Later. Possibly much later.
Newsfeed
News
Ad Relevance
Email
Search Indexing Pipeline
Metrics and Monitoring
Motivation
Real-time Feeds
•
•
•
•

User activity
Metrics
Monitoring
Database Changes
Real-time Feeds
• 10+ billion writes per day
• 172,000 messages per second (average)
• 55+ billion messages per day to rea...
Stream Processing is Hard
•
•
•
•
•
•

Partitioning
State
Re-processing
Failure semantics
Joins to services or database
No...
Samza Concepts
&
Architecture
Streams
Partition 0

Partition 1

Partition 2
Streams
Partition 0

1
2
3
4
5
6

Partition 1

1
2
3
4
5

Partition 2

1
2
3
4
5
6
7
Streams
Partition 0

1
2
3
4
5
6

Partition 1

1
2
3
4
5

Partition 2

1
2
3
4
5
6
7
Streams
Partition 0

1
2
3
4
5
6

Partition 1

1
2
3
4
5

Partition 2

1
2
3
4
5
6
7
Streams
Partition 0

1
2
3
4
5
6

Partition 1

1
2
3
4
5

Partition 2

1
2
3
4
5
6
7
Streams
Partition 0

1
2
3
4
5
6

Partition 1

1
2
3
4
5

Partition 2

1
2
3
4
5
6
7
Streams
Partition 0

1
2
3
4
5
6

Partition 1

1
2
3
4
5

Partition 2

1
2
3
4
5
6
7

next append
Tasks
Partition 0
Tasks
Partition 0

Task 1
Tasks
Partition 0

class PageKeyViewsCounterTask implements StreamTask {
public void process(IncomingMessageEnvelope envel...
Tasks
Partition 0

class PageKeyViewsCounterTask implements StreamTask {
public void process(IncomingMessageEnvelope envel...
Tasks
Partition 0

class PageKeyViewsCounterTask implements StreamTask {
public void process(IncomingMessageEnvelope envel...
Tasks
Partition 0

class PageKeyViewsCounterTask implements StreamTask {
public void process(IncomingMessageEnvelope envel...
Tasks
Partition 0

class PageKeyViewsCounterTask implements StreamTask {
public void process(IncomingMessageEnvelope envel...
Tasks
Partition 0

class PageKeyViewsCounterTask implements StreamTask {
public void process(IncomingMessageEnvelope envel...
Tasks
Partition 0

class PageKeyViewsCounterTask implements StreamTask {
public void process(IncomingMessageEnvelope envel...
Tasks
Partition 0

class PageKeyViewsCounterTask implements StreamTask {
public void process(IncomingMessageEnvelope envel...
Tasks
Partition 0

class PageKeyViewsCounterTask implements StreamTask {
public void process(IncomingMessageEnvelope envel...
Tasks
Partition 0

class PageKeyViewsCounterTask implements StreamTask {
public void process(IncomingMessageEnvelope envel...
Tasks
Partition 0

Task 1
Tasks
Page Views - Partition 0

1
2
3
4
PageKeyViews
CounterTask

Partition 0

Partition 1

Output Count Stream
Tasks
Page Views - Partition 0

1
2
3
4
PageKeyViews
CounterTask

Partition 0

Partition 1

Output Count Stream
Tasks
Page Views - Partition 0

1
2
3
4
PageKeyViews
CounterTask

Partition 0

Partition 1

Output Count Stream
Tasks
Page Views - Partition 0

1
2
3
4
PageKeyViews
CounterTask

Output Count Stream

Partition 0
Partition 1
Tasks
Page Views - Partition 0

1
2
3
4
PageKeyViews
CounterTask

Output Count Stream

Partition 0
Partition 1
Tasks
Page Views - Partition 0

1
2
3
4
PageKeyViews
CounterTask

Output Count Stream

Partition 0
Partition 1
Tasks
Page Views - Partition 0

1
2
3
4
PageKeyViews
CounterTask

Output Count Stream
Partition 0

Partition 1
Tasks
Page Views - Partition 0

1
2
3
4
PageKeyViews
CounterTask

Output Count Stream
Partition 0

Partition 1
Tasks
Page Views - Partition 0

1
2
3
4
PageKeyViews
CounterTask

Checkpoint
Stream

2
Output Count Stream

Partition 1
Pa...
Tasks
Page Views - Partition 0

1
2
3
4
PageKeyViews
CounterTask

Checkpoint
Stream

2
Output Count Stream

Partition 1
Pa...
Tasks
Page Views - Partition 0

1
2
3
4
PageKeyViews
CounterTask

Checkpoint
Stream

2
Output Count Stream

Partition 1
Pa...
Tasks
Page Views - Partition 0

1
2
3
4
PageKeyViews
CounterTask

Checkpoint
Stream

2
Output Count Stream

Partition 1
Pa...
Tasks
Page Views - Partition 0

1
2
3
4
PageKeyViews
CounterTask

Checkpoint
Stream

2
Output Count Stream

Partition 1
Pa...
Tasks
Page Views - Partition 0

1
2
3
4
PageKeyViews
CounterTask

Checkpoint
Stream

2
Output Count Stream

Partition 1
Pa...
Tasks
Page Views - Partition 0

1
2
3
4
PageKeyViews
CounterTask

Checkpoint
Stream

2
Output Count Stream

Partition 1
Pa...
Tasks
Page Views - Partition 0

1
2
3
4
PageKeyViews
CounterTask

Checkpoint
Stream

2
Output Count Stream

Partition 1
Pa...
Tasks
Page Views - Partition 0

1
2
3
4
PageKeyViews
CounterTask

Checkpoint
Stream

2
Output Count Stream

Partition 1
Pa...
Jobs
Stream A

Task 1

Task 2

Stream B

Task 3
Jobs
Stream A

Task 1

Stream B

Task 2

Stream C

Task 3
Jobs
AdViews

Task 1

AdClicks

Task 2

AdClickThroughRate

Task 3
Jobs
AdViews

Task 1

AdClicks

Task 2

AdClickThroughRate

Task 3
Jobs
Stream A

Task 1

Stream B

Task 2

Stream C

Task 3
Dataflow
Stream A

Stream B

Job 1

Stream D

Job 2

Stream E

Job 3

Stream B

Stream C
Dataflow
Stream A

Stream B

Job 1

Stream D

Job 2

Stream E

Job 3

Stream B

Stream C
YARN
YARN
You: I want to run command X on two machines with
512M of memory.
YARN
You: I want to run command X on two machines with
512M of memory.
YARN: Cool, where’s your code?
YARN
You: I want to run command X on two machines with
512M of memory.
YARN: Cool, where’s your code?
You: http://some-hos...
YARN
You: I want to run command X on two machines with
512M of memory.
YARN: Cool, where’s your code?
You: http://some-hos...
YARN

Host 1

Host 2

Host 3
YARN

Host 1

Host 2

Host 3

NM

NM

NM
YARN
Host 0
RM

Host 1

Host 2

Host 3

NM

NM

NM
YARN
Host 0
Client

RM

Host 1

Host 2

Host 3

NM

NM

NM
YARN
Host 0
Client

RM

Host 1

Host 2

Host 3

NM

NM

NM
YARN
Host 0
Client

RM

Host 1

Host 2

Host 3

NM

NM

NM
YARN
Host 0
Client

Host 1
NM

RM

Host 2
AM

Host 3

NM

NM
YARN
Host 0
Client

Host 1
NM

RM

Host 2
AM

Host 3

NM

NM
YARN
Host 0
Client

Host 1
NM

RM

Host 2
AM

Host 3

NM

NM
YARN
Host 0
Client

Host 1
NM

RM

Host 2
AM

Host 3

NM

NM
Container
YARN
Host 0
Client

Host 1
NM

RM

Host 2
AM

Host 3

NM

NM
Container
YARN
Host 0
Client

Host 1
NM

RM

Host 2
AM

Host 3

NM

NM
YARN
Host 0
Client

Host 1
NM

RM

Host 2
AM

Host 3

NM

NM
YARN
Host 0
Client

Host 1
NM

RM

Host 2
AM

Host 3

NM

NM
YARN
Host 0
Client

Host 1
NM

RM

Host 2
AM

Host 3

NM

NM
YARN
Host 0
Client

Host 1
NM

RM

Host 2
AM

Host 3

NM

NM

Container
Jobs
Stream A

Task 1

Task 2

Stream B

Task 3
Containers
Stream A

Task 1

Task 2

Stream B

Task 3
Containers
Stream A

Samza Container 1

Stream B

Samza Container 2
Containers

Samza Container 1

Samza Container 2
YARN
Host 1

Samza Container 1

Host 2

Samza Container 2
YARN
Host 1

Host 2

NodeManager

NodeManager

Samza Container 1

Samza Container 2
YARN
Host 1

Host 2

NodeManager

NodeManager

Samza Container 1

Samza Container 2

Samza YARN AM
YARN
Host 1

Host 2

NodeManager

NodeManager

Samza Container 1

Kafka Broker

Samza Container 2

Samza YARN AM

Kafka Br...
YARN
Host 1

Host 2

NodeManager

NodeManager

MapReduce
Container

HDFS

MapReduce
YARN AM

MapReduce
Container

HDFS
YARN
Host 1
Stream A

NodeManager

Samza Container 1
Samza Container 1

Kafka Broker
Stream C

Samza
Container 2
YARN
Host 1
Stream A

NodeManager

Samza Container 1
Samza Container 1

Kafka Broker
Stream C

Samza
Container 2
YARN
Host 1
Stream A

NodeManager

Samza Container 1
Samza Container 1

Kafka Broker
Stream C

Samza
Container 2
YARN
Host 1
Stream A

NodeManager

Samza Container 1
Samza Container 1

Kafka Broker
Stream C

Samza
Container 2
YARN
Host 1

Host 2

NodeManager

NodeManager

Samza Container 1

Kafka Broker

Samza Container 2

Samza YARN AM

Kafka Br...
CGroups
Host 1

Host 2

NodeManager

NodeManager

Samza Container 1

Kafka Broker

Samza Container 2

Samza YARN AM

Kafka...
(Not Running) Multi-Framework
Host 1

Host 2

NodeManager

NodeManager

Samza Container 1

Kafka

MapReduce
Container

Sam...
Stateful Processing
SELECT
col1,
count(*)
FROM
stream1
INNER JOIN
stream2
ON
stream1.col3 = stream2.col3
WHERE
col2 > 20
GROUP BY
col1
ORDER B...
SELECT
col1,
count(*)
FROM
stream1
INNER JOIN
stream2
ON
stream1.col3 = stream2.col3
WHERE
col2 > 20
GROUP BY
col1
ORDER B...
SELECT
col1,
count(*)
FROM
stream1
INNER JOIN
stream2
ON
stream1.col3 = stream2.col3
WHERE
col2 > 20
GROUP BY
col1
ORDER B...
SELECT
col1,
count(*)
FROM
stream1
INNER JOIN
stream2
ON
stream1.col3 = stream2.col3
WHERE
col2 > 20
GROUP BY
col1
ORDER B...
How do people do this?
Remote Stores
Stream A

Task 1

Task 2

Task 3

Key-Value
Store
Stream B
Remote RPC is slow
• Stream: ~500k records/sec/container
• DB: << less
Online vs. Async
No undo
• Database state is non-deterministic
• Can’t roll back mutations if task crashes
Tables & Streams
put(a, w)
put(b, x)
Database

put(a, y)
put(b, z)

Time
Stateful Tasks
Stream A

Task 1

Task 2

Stream B

Task 3
Stateful Tasks
Stream A

Task 1

Task 2

Stream B

Task 3
Stateful Tasks
Stream A

Task 1

Task 2

Stream B

Task 3

Changelog Stream
Stateful Tasks
Stream A

Task 1

Task 2

Stream B

Task 3

Changelog Stream
Stateful Tasks
Stream A

Task 1

Task 2

Stream B

Task 3

Changelog Stream
Stateful Tasks
Stream A

Task 1

Task 2

Stream B

Task 3

Changelog Stream
Stateful Tasks
Stream A

Task 1

Task 2

Stream B

Task 3

Changelog Stream
Stateful Tasks
Stream A

Task 1

Task 2

Stream B

Task 3

Changelog Stream
Stateful Tasks
Stream A

Task 1

Task 2

Stream B

Task 3

Changelog Stream
Stateful Tasks
Stream A

Task 1

Task 2

Stream B

Task 3

Changelog Stream
Stateful Tasks
Stream A

Task 1

Task 2

Stream B

Task 3

Changelog Stream
Stateful Tasks
Stream A

Task 1

Task 2

Stream B

Task 3

Changelog Stream
Stateful Tasks
Stream A

Task 1

Task 2

Stream B

Task 3

Changelog Stream
Stateful Tasks
Stream A

Task 1

Task 2

Stream B

Task 3

Changelog Stream
Key-Value Store
•
•
•
•

put(table_name, key, value)
get(table_name, key)
delete(table_name, key)
range(table_name, key1, ...
Stateful Stream Task
public class SimpleStatefulTask implements StreamTask, InitableTask {
private KeyValueStore<String, S...
Stateful Stream Task
public class SimpleStatefulTask implements StreamTask, InitableTask {
private KeyValueStore<String, S...
Stateful Stream Task
public class SimpleStatefulTask implements StreamTask, InitableTask {
private KeyValueStore<String, S...
Stateful Stream Task
public class SimpleStatefulTask implements StreamTask, InitableTask {
private KeyValueStore<String, S...
Whew!
Let’s be Friends!
• We are incubating, and you can help!
• Get up and running in 5 minutes
http://bit.ly/hello-samza
• Gra...
Watch the video with slide synchronization on
InfoQ.com!
http://www.infoq.com/presentations/samzalinkedin
Samza: Real-time Stream Processing at LinkedIn
Samza: Real-time Stream Processing at LinkedIn
Upcoming SlideShare
Loading in...5
×

Samza: Real-time Stream Processing at LinkedIn

1,250

Published on

Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/1eGbVJv.

Chris Riccomini discusses: Samza's feature set, how Samza integrates with YARN and Kafka, how it's used at LinkedIn, and what's next on the roadmap. Filmed at qconsf.com.

Chris Riccomini is a Staff Software Engineer at LinkedIn, where he's is currently working as a committer and PMC member for Apache Samza. He's been involved in a wide range of projects at LinkedIn, including, "People You May Know", REST.li, Hadoop, engineering tooling, and OLAP systems. Prior to LinkedIn, he worked on data visualization and fraud modeling at PayPal.

Published in: Technology, Business
0 Comments
18 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,250
On Slideshare
0
From Embeds
0
Number of Embeds
4
Actions
Shares
0
Downloads
0
Comments
0
Likes
18
Embeds 0
No embeds

No notes for slide

Samza: Real-time Stream Processing at LinkedIn

  1. 1. Apache Samza* Stream Processing at LinkedIn Chris Riccomini 11/13/2013 * Incubating
  2. 2. Watch the video with slide synchronization on InfoQ.com! http://www.infoq.com/presentations /samza-linkedin InfoQ.com: News & Community Site • 750,000 unique visitors/month • Published in 4 languages (English, Chinese, Japanese and Brazilian Portuguese) • Post content from our QCon conferences • News 15-20 / week • Articles 3-4 / week • Presentations (videos) 12-15 / week • Interviews 2-3 / week • Books 1 / month
  3. 3. Presented at QCon San Francisco www.qconsf.com Purpose of QCon - to empower software development by facilitating the spread of knowledge and innovation Strategy - practitioner-driven conference designed for YOU: influencers of change and innovation in your teams - speakers and topics driving the evolution and innovation - connecting and catalyzing the influencers and innovators Highlights - attended by more than 12,000 delegates since 2007 - held in 9 cities worldwide
  4. 4. Stream Processing?
  5. 5. 0 ms Response latency
  6. 6. 0 ms Response latency Synchronous
  7. 7. 0 ms Response latency Synchronous Later. Possibly much later.
  8. 8. 0 ms Response latency Milliseconds to minutes Synchronous Later. Possibly much later.
  9. 9. Newsfeed
  10. 10. News
  11. 11. Ad Relevance
  12. 12. Email
  13. 13. Search Indexing Pipeline
  14. 14. Metrics and Monitoring
  15. 15. Motivation
  16. 16. Real-time Feeds • • • • User activity Metrics Monitoring Database Changes
  17. 17. Real-time Feeds • 10+ billion writes per day • 172,000 messages per second (average) • 55+ billion messages per day to real-time consumers
  18. 18. Stream Processing is Hard • • • • • • Partitioning State Re-processing Failure semantics Joins to services or database Non-determinism
  19. 19. Samza Concepts & Architecture
  20. 20. Streams Partition 0 Partition 1 Partition 2
  21. 21. Streams Partition 0 1 2 3 4 5 6 Partition 1 1 2 3 4 5 Partition 2 1 2 3 4 5 6 7
  22. 22. Streams Partition 0 1 2 3 4 5 6 Partition 1 1 2 3 4 5 Partition 2 1 2 3 4 5 6 7
  23. 23. Streams Partition 0 1 2 3 4 5 6 Partition 1 1 2 3 4 5 Partition 2 1 2 3 4 5 6 7
  24. 24. Streams Partition 0 1 2 3 4 5 6 Partition 1 1 2 3 4 5 Partition 2 1 2 3 4 5 6 7
  25. 25. Streams Partition 0 1 2 3 4 5 6 Partition 1 1 2 3 4 5 Partition 2 1 2 3 4 5 6 7
  26. 26. Streams Partition 0 1 2 3 4 5 6 Partition 1 1 2 3 4 5 Partition 2 1 2 3 4 5 6 7 next append
  27. 27. Tasks Partition 0
  28. 28. Tasks Partition 0 Task 1
  29. 29. Tasks Partition 0 class PageKeyViewsCounterTask implements StreamTask { public void process(IncomingMessageEnvelope envelope, MessageCollector collector, TaskCoordinator coordinator) { GenericRecord record = ((GenericRecord) envelope.getMsg()); String pageKey = record.get("page-key").toString(); int newCount = pageKeyViews.get(pageKey).incrementAndGet(); collector.send(countStream, pageKey, newCount); } }
  30. 30. Tasks Partition 0 class PageKeyViewsCounterTask implements StreamTask { public void process(IncomingMessageEnvelope envelope, MessageCollector collector, TaskCoordinator coordinator) { GenericRecord record = ((GenericRecord) envelope.getMsg()); String pageKey = record.get("page-key").toString(); int newCount = pageKeyViews.get(pageKey).incrementAndGet(); collector.send(countStream, pageKey, newCount); } }
  31. 31. Tasks Partition 0 class PageKeyViewsCounterTask implements StreamTask { public void process(IncomingMessageEnvelope envelope, MessageCollector collector, TaskCoordinator coordinator) { GenericRecord record = ((GenericRecord) envelope.getMsg()); String pageKey = record.get("page-key").toString(); int newCount = pageKeyViews.get(pageKey).incrementAndGet(); collector.send(countStream, pageKey, newCount); } }
  32. 32. Tasks Partition 0 class PageKeyViewsCounterTask implements StreamTask { public void process(IncomingMessageEnvelope envelope, MessageCollector collector, TaskCoordinator coordinator) { GenericRecord record = ((GenericRecord) envelope.getMsg()); String pageKey = record.get("page-key").toString(); int newCount = pageKeyViews.get(pageKey).incrementAndGet(); collector.send(countStream, pageKey, newCount); } }
  33. 33. Tasks Partition 0 class PageKeyViewsCounterTask implements StreamTask { public void process(IncomingMessageEnvelope envelope, MessageCollector collector, TaskCoordinator coordinator) { GenericRecord record = ((GenericRecord) envelope.getMsg()); String pageKey = record.get("page-key").toString(); int newCount = pageKeyViews.get(pageKey).incrementAndGet(); collector.send(countStream, pageKey, newCount); } }
  34. 34. Tasks Partition 0 class PageKeyViewsCounterTask implements StreamTask { public void process(IncomingMessageEnvelope envelope, MessageCollector collector, TaskCoordinator coordinator) { GenericRecord record = ((GenericRecord) envelope.getMsg()); String pageKey = record.get("page-key").toString(); int newCount = pageKeyViews.get(pageKey).incrementAndGet(); collector.send(countStream, pageKey, newCount); } }
  35. 35. Tasks Partition 0 class PageKeyViewsCounterTask implements StreamTask { public void process(IncomingMessageEnvelope envelope, MessageCollector collector, TaskCoordinator coordinator) { GenericRecord record = ((GenericRecord) envelope.getMsg()); String pageKey = record.get("page-key").toString(); int newCount = pageKeyViews.get(pageKey).incrementAndGet(); collector.send(countStream, pageKey, newCount); } }
  36. 36. Tasks Partition 0 class PageKeyViewsCounterTask implements StreamTask { public void process(IncomingMessageEnvelope envelope, MessageCollector collector, TaskCoordinator coordinator) { GenericRecord record = ((GenericRecord) envelope.getMsg()); String pageKey = record.get("page-key").toString(); int newCount = pageKeyViews.get(pageKey).incrementAndGet(); collector.send(countStream, pageKey, newCount); } }
  37. 37. Tasks Partition 0 class PageKeyViewsCounterTask implements StreamTask { public void process(IncomingMessageEnvelope envelope, MessageCollector collector, TaskCoordinator coordinator) { GenericRecord record = ((GenericRecord) envelope.getMsg()); String pageKey = record.get("page-key").toString(); int newCount = pageKeyViews.get(pageKey).incrementAndGet(); collector.send(countStream, pageKey, newCount); } }
  38. 38. Tasks Partition 0 class PageKeyViewsCounterTask implements StreamTask { public void process(IncomingMessageEnvelope envelope, MessageCollector collector, TaskCoordinator coordinator) { GenericRecord record = ((GenericRecord) envelope.getMsg()); String pageKey = record.get("page-key").toString(); int newCount = pageKeyViews.get(pageKey).incrementAndGet(); collector.send(countStream, pageKey, newCount); } }
  39. 39. Tasks Partition 0 Task 1
  40. 40. Tasks Page Views - Partition 0 1 2 3 4 PageKeyViews CounterTask Partition 0 Partition 1 Output Count Stream
  41. 41. Tasks Page Views - Partition 0 1 2 3 4 PageKeyViews CounterTask Partition 0 Partition 1 Output Count Stream
  42. 42. Tasks Page Views - Partition 0 1 2 3 4 PageKeyViews CounterTask Partition 0 Partition 1 Output Count Stream
  43. 43. Tasks Page Views - Partition 0 1 2 3 4 PageKeyViews CounterTask Output Count Stream Partition 0 Partition 1
  44. 44. Tasks Page Views - Partition 0 1 2 3 4 PageKeyViews CounterTask Output Count Stream Partition 0 Partition 1
  45. 45. Tasks Page Views - Partition 0 1 2 3 4 PageKeyViews CounterTask Output Count Stream Partition 0 Partition 1
  46. 46. Tasks Page Views - Partition 0 1 2 3 4 PageKeyViews CounterTask Output Count Stream Partition 0 Partition 1
  47. 47. Tasks Page Views - Partition 0 1 2 3 4 PageKeyViews CounterTask Output Count Stream Partition 0 Partition 1
  48. 48. Tasks Page Views - Partition 0 1 2 3 4 PageKeyViews CounterTask Checkpoint Stream 2 Output Count Stream Partition 1 Partition 0 Partition 1
  49. 49. Tasks Page Views - Partition 0 1 2 3 4 PageKeyViews CounterTask Checkpoint Stream 2 Output Count Stream Partition 1 Partition 0 Partition 1
  50. 50. Tasks Page Views - Partition 0 1 2 3 4 PageKeyViews CounterTask Checkpoint Stream 2 Output Count Stream Partition 1 Partition 0 Partition 1
  51. 51. Tasks Page Views - Partition 0 1 2 3 4 PageKeyViews CounterTask Checkpoint Stream 2 Output Count Stream Partition 1 Partition 0 Partition 1
  52. 52. Tasks Page Views - Partition 0 1 2 3 4 PageKeyViews CounterTask Checkpoint Stream 2 Output Count Stream Partition 1 Partition 0 Partition 1
  53. 53. Tasks Page Views - Partition 0 1 2 3 4 PageKeyViews CounterTask Checkpoint Stream 2 Output Count Stream Partition 1 Partition 0 Partition 1
  54. 54. Tasks Page Views - Partition 0 1 2 3 4 PageKeyViews CounterTask Checkpoint Stream 2 Output Count Stream Partition 1 Partition 0 Partition 1
  55. 55. Tasks Page Views - Partition 0 1 2 3 4 PageKeyViews CounterTask Checkpoint Stream 2 Output Count Stream Partition 1 Partition 0 Partition 1
  56. 56. Tasks Page Views - Partition 0 1 2 3 4 PageKeyViews CounterTask Checkpoint Stream 2 Output Count Stream Partition 1 Partition 0 Partition 1
  57. 57. Jobs Stream A Task 1 Task 2 Stream B Task 3
  58. 58. Jobs Stream A Task 1 Stream B Task 2 Stream C Task 3
  59. 59. Jobs AdViews Task 1 AdClicks Task 2 AdClickThroughRate Task 3
  60. 60. Jobs AdViews Task 1 AdClicks Task 2 AdClickThroughRate Task 3
  61. 61. Jobs Stream A Task 1 Stream B Task 2 Stream C Task 3
  62. 62. Dataflow Stream A Stream B Job 1 Stream D Job 2 Stream E Job 3 Stream B Stream C
  63. 63. Dataflow Stream A Stream B Job 1 Stream D Job 2 Stream E Job 3 Stream B Stream C
  64. 64. YARN
  65. 65. YARN You: I want to run command X on two machines with 512M of memory.
  66. 66. YARN You: I want to run command X on two machines with 512M of memory. YARN: Cool, where’s your code?
  67. 67. YARN You: I want to run command X on two machines with 512M of memory. YARN: Cool, where’s your code? You: http://some-host/jobs/download/my.tgz
  68. 68. YARN You: I want to run command X on two machines with 512M of memory. YARN: Cool, where’s your code? You: http://some-host/jobs/download/my.tgz YARN: I’ve run your command on grid-node-2 and grid-node-7.
  69. 69. YARN Host 1 Host 2 Host 3
  70. 70. YARN Host 1 Host 2 Host 3 NM NM NM
  71. 71. YARN Host 0 RM Host 1 Host 2 Host 3 NM NM NM
  72. 72. YARN Host 0 Client RM Host 1 Host 2 Host 3 NM NM NM
  73. 73. YARN Host 0 Client RM Host 1 Host 2 Host 3 NM NM NM
  74. 74. YARN Host 0 Client RM Host 1 Host 2 Host 3 NM NM NM
  75. 75. YARN Host 0 Client Host 1 NM RM Host 2 AM Host 3 NM NM
  76. 76. YARN Host 0 Client Host 1 NM RM Host 2 AM Host 3 NM NM
  77. 77. YARN Host 0 Client Host 1 NM RM Host 2 AM Host 3 NM NM
  78. 78. YARN Host 0 Client Host 1 NM RM Host 2 AM Host 3 NM NM Container
  79. 79. YARN Host 0 Client Host 1 NM RM Host 2 AM Host 3 NM NM Container
  80. 80. YARN Host 0 Client Host 1 NM RM Host 2 AM Host 3 NM NM
  81. 81. YARN Host 0 Client Host 1 NM RM Host 2 AM Host 3 NM NM
  82. 82. YARN Host 0 Client Host 1 NM RM Host 2 AM Host 3 NM NM
  83. 83. YARN Host 0 Client Host 1 NM RM Host 2 AM Host 3 NM NM
  84. 84. YARN Host 0 Client Host 1 NM RM Host 2 AM Host 3 NM NM Container
  85. 85. Jobs Stream A Task 1 Task 2 Stream B Task 3
  86. 86. Containers Stream A Task 1 Task 2 Stream B Task 3
  87. 87. Containers Stream A Samza Container 1 Stream B Samza Container 2
  88. 88. Containers Samza Container 1 Samza Container 2
  89. 89. YARN Host 1 Samza Container 1 Host 2 Samza Container 2
  90. 90. YARN Host 1 Host 2 NodeManager NodeManager Samza Container 1 Samza Container 2
  91. 91. YARN Host 1 Host 2 NodeManager NodeManager Samza Container 1 Samza Container 2 Samza YARN AM
  92. 92. YARN Host 1 Host 2 NodeManager NodeManager Samza Container 1 Kafka Broker Samza Container 2 Samza YARN AM Kafka Broker
  93. 93. YARN Host 1 Host 2 NodeManager NodeManager MapReduce Container HDFS MapReduce YARN AM MapReduce Container HDFS
  94. 94. YARN Host 1 Stream A NodeManager Samza Container 1 Samza Container 1 Kafka Broker Stream C Samza Container 2
  95. 95. YARN Host 1 Stream A NodeManager Samza Container 1 Samza Container 1 Kafka Broker Stream C Samza Container 2
  96. 96. YARN Host 1 Stream A NodeManager Samza Container 1 Samza Container 1 Kafka Broker Stream C Samza Container 2
  97. 97. YARN Host 1 Stream A NodeManager Samza Container 1 Samza Container 1 Kafka Broker Stream C Samza Container 2
  98. 98. YARN Host 1 Host 2 NodeManager NodeManager Samza Container 1 Kafka Broker Samza Container 2 Samza YARN AM Kafka Broker
  99. 99. CGroups Host 1 Host 2 NodeManager NodeManager Samza Container 1 Kafka Broker Samza Container 2 Samza YARN AM Kafka Broker
  100. 100. (Not Running) Multi-Framework Host 1 Host 2 NodeManager NodeManager Samza Container 1 Kafka MapReduce Container Samza YARN AM HDFS
  101. 101. Stateful Processing
  102. 102. SELECT col1, count(*) FROM stream1 INNER JOIN stream2 ON stream1.col3 = stream2.col3 WHERE col2 > 20 GROUP BY col1 ORDER BY count(*) DESC LIMIT 50;
  103. 103. SELECT col1, count(*) FROM stream1 INNER JOIN stream2 ON stream1.col3 = stream2.col3 WHERE col2 > 20 GROUP BY col1 ORDER BY count(*) DESC LIMIT 50;
  104. 104. SELECT col1, count(*) FROM stream1 INNER JOIN stream2 ON stream1.col3 = stream2.col3 WHERE col2 > 20 GROUP BY col1 ORDER BY count(*) DESC LIMIT 50;
  105. 105. SELECT col1, count(*) FROM stream1 INNER JOIN stream2 ON stream1.col3 = stream2.col3 WHERE col2 > 20 GROUP BY col1 ORDER BY count(*) DESC LIMIT 10;
  106. 106. How do people do this?
  107. 107. Remote Stores Stream A Task 1 Task 2 Task 3 Key-Value Store Stream B
  108. 108. Remote RPC is slow • Stream: ~500k records/sec/container • DB: << less
  109. 109. Online vs. Async
  110. 110. No undo • Database state is non-deterministic • Can’t roll back mutations if task crashes
  111. 111. Tables & Streams put(a, w) put(b, x) Database put(a, y) put(b, z) Time
  112. 112. Stateful Tasks Stream A Task 1 Task 2 Stream B Task 3
  113. 113. Stateful Tasks Stream A Task 1 Task 2 Stream B Task 3
  114. 114. Stateful Tasks Stream A Task 1 Task 2 Stream B Task 3 Changelog Stream
  115. 115. Stateful Tasks Stream A Task 1 Task 2 Stream B Task 3 Changelog Stream
  116. 116. Stateful Tasks Stream A Task 1 Task 2 Stream B Task 3 Changelog Stream
  117. 117. Stateful Tasks Stream A Task 1 Task 2 Stream B Task 3 Changelog Stream
  118. 118. Stateful Tasks Stream A Task 1 Task 2 Stream B Task 3 Changelog Stream
  119. 119. Stateful Tasks Stream A Task 1 Task 2 Stream B Task 3 Changelog Stream
  120. 120. Stateful Tasks Stream A Task 1 Task 2 Stream B Task 3 Changelog Stream
  121. 121. Stateful Tasks Stream A Task 1 Task 2 Stream B Task 3 Changelog Stream
  122. 122. Stateful Tasks Stream A Task 1 Task 2 Stream B Task 3 Changelog Stream
  123. 123. Stateful Tasks Stream A Task 1 Task 2 Stream B Task 3 Changelog Stream
  124. 124. Stateful Tasks Stream A Task 1 Task 2 Stream B Task 3 Changelog Stream
  125. 125. Stateful Tasks Stream A Task 1 Task 2 Stream B Task 3 Changelog Stream
  126. 126. Key-Value Store • • • • put(table_name, key, value) get(table_name, key) delete(table_name, key) range(table_name, key1, key2)
  127. 127. Stateful Stream Task public class SimpleStatefulTask implements StreamTask, InitableTask { private KeyValueStore<String, String> store; public void init(Config config, TaskContext context) { this.store = context.getStore("mystore"); } public void process( IncomingMessageEnvelope envelope, MessageCollector collector, TaskCoordinator coordinator) { GenericRecord record = (GenericRecord) envelope.getMessage(); String memberId = record.get("member_id"); String name = record.get("name"); System.out.println("old name: " + store.get(memberId)); store.put(memberId, name); } }
  128. 128. Stateful Stream Task public class SimpleStatefulTask implements StreamTask, InitableTask { private KeyValueStore<String, String> store; public void init(Config config, TaskContext context) { this.store = context.getStore("mystore"); } public void process( IncomingMessageEnvelope envelope, MessageCollector collector, TaskCoordinator coordinator) { GenericRecord record = (GenericRecord) envelope.getMessage(); String memberId = record.get("member_id"); String name = record.get("name"); System.out.println("old name: " + store.get(memberId)); store.put(memberId, name); } }
  129. 129. Stateful Stream Task public class SimpleStatefulTask implements StreamTask, InitableTask { private KeyValueStore<String, String> store; public void init(Config config, TaskContext context) { this.store = context.getStore("mystore"); } public void process( IncomingMessageEnvelope envelope, MessageCollector collector, TaskCoordinator coordinator) { GenericRecord record = (GenericRecord) envelope.getMessage(); String memberId = record.get("member_id"); String name = record.get("name"); System.out.println("old name: " + store.get(memberId)); store.put(memberId, name); } }
  130. 130. Stateful Stream Task public class SimpleStatefulTask implements StreamTask, InitableTask { private KeyValueStore<String, String> store; public void init(Config config, TaskContext context) { this.store = context.getStore("mystore"); } public void process( IncomingMessageEnvelope envelope, MessageCollector collector, TaskCoordinator coordinator) { GenericRecord record = (GenericRecord) envelope.getMessage(); String memberId = record.get("member_id"); String name = record.get("name"); System.out.println("old name: " + store.get(memberId)); store.put(memberId, name); } }
  131. 131. Whew!
  132. 132. Let’s be Friends! • We are incubating, and you can help! • Get up and running in 5 minutes http://bit.ly/hello-samza • Grab some newbie JIRAs http://bit.ly/samza_newbie_issues
  133. 133. Watch the video with slide synchronization on InfoQ.com! http://www.infoq.com/presentations/samzalinkedin

×