Samza: Real-time Stream Processing at LinkedIn

1. Apache Samza* Stream Processing at LinkedIn Chris Riccomini 11/13/2013 * Incubating

2. Watch the video with slide synchronization on InfoQ.com! http://www.infoq.com/presentations /samza-linkedin InfoQ.com: News & Community Site • 750,000 unique visitors/month • Published in 4 languages (English, Chinese, Japanese and Brazilian Portuguese) • Post content from our QCon conferences • News 15-20 / week • Articles 3-4 / week • Presentations (videos) 12-15 / week • Interviews 2-3 / week • Books 1 / month

3. Presented at QCon San Francisco www.qconsf.com Purpose of QCon - to empower software development by facilitating the spread of knowledge and innovation Strategy - practitioner-driven conference designed for YOU: influencers of change and innovation in your teams - speakers and topics driving the evolution and innovation - connecting and catalyzing the influencers and innovators Highlights - attended by more than 12,000 delegates since 2007 - held in 9 cities worldwide

5. Stream Processing?

6. 0 ms Response latency

7. 0 ms Response latency Synchronous

8. 0 ms Response latency Synchronous Later. Possibly much later.

9. 0 ms Response latency Milliseconds to minutes Synchronous Later. Possibly much later.

10. Newsfeed

11. News

12. Ad Relevance

13. Email

14. Search Indexing Pipeline

15. Metrics and Monitoring

16. Motivation

17. Real-time Feeds • • • • User activity Metrics Monitoring Database Changes

18. Real-time Feeds • 10+ billion writes per day • 172,000 messages per second (average) • 55+ billion messages per day to real-time consumers

19. Stream Processing is Hard • • • • • • Partitioning State Re-processing Failure semantics Joins to services or database Non-determinism

20. Samza Concepts & Architecture

21. Streams Partition 0 Partition 1 Partition 2

22. Streams Partition 0 1 2 3 4 5 6 Partition 1 1 2 3 4 5 Partition 2 1 2 3 4 5 6 7

27. Streams Partition 0 1 2 3 4 5 6 Partition 1 1 2 3 4 5 Partition 2 1 2 3 4 5 6 7 next append

28. Tasks Partition 0

29. Tasks Partition 0 Task 1

30. Tasks Partition 0 class PageKeyViewsCounterTask implements StreamTask { public void process(IncomingMessageEnvelope envelope, MessageCollector collector, TaskCoordinator coordinator) { GenericRecord record = ((GenericRecord) envelope.getMsg()); String pageKey = record.get("page-key").toString(); int newCount = pageKeyViews.get(pageKey).incrementAndGet(); collector.send(countStream, pageKey, newCount); } }

40. Tasks Partition 0 Task 1

41. Tasks Page Views - Partition 0 1 2 3 4 PageKeyViews CounterTask Partition 0 Partition 1 Output Count Stream

44. Tasks Page Views - Partition 0 1 2 3 4 PageKeyViews CounterTask Output Count Stream Partition 0 Partition 1

49. Tasks Page Views - Partition 0 1 2 3 4 PageKeyViews CounterTask Checkpoint Stream 2 Output Count Stream Partition 1 Partition 0 Partition 1

58. Jobs Stream A Task 1 Task 2 Stream B Task 3

59. Jobs Stream A Task 1 Stream B Task 2 Stream C Task 3

60. Jobs AdViews Task 1 AdClicks Task 2 AdClickThroughRate Task 3

61. Jobs AdViews Task 1 AdClicks Task 2 AdClickThroughRate Task 3

62. Jobs Stream A Task 1 Stream B Task 2 Stream C Task 3

63. Dataflow Stream A Stream B Job 1 Stream D Job 2 Stream E Job 3 Stream B Stream C

64. Dataflow Stream A Stream B Job 1 Stream D Job 2 Stream E Job 3 Stream B Stream C

65. YARN

67. YARN You: I want to run command X on two machines with 512M of memory.

68. YARN You: I want to run command X on two machines with 512M of memory. YARN: Cool, where’s your code?

69. YARN You: I want to run command X on two machines with 512M of memory. YARN: Cool, where’s your code? You: http://some-host/jobs/download/my.tgz

70. YARN You: I want to run command X on two machines with 512M of memory. YARN: Cool, where’s your code? You: http://some-host/jobs/download/my.tgz YARN: I’ve run your command on grid-node-2 and grid-node-7.

71. YARN Host 1 Host 2 Host 3

72. YARN Host 1 Host 2 Host 3 NM NM NM

73. YARN Host 0 RM Host 1 Host 2 Host 3 NM NM NM

74. YARN Host 0 Client RM Host 1 Host 2 Host 3 NM NM NM

77. YARN Host 0 Client Host 1 NM RM Host 2 AM Host 3 NM NM

80. YARN Host 0 Client Host 1 NM RM Host 2 AM Host 3 NM NM Container

87. Jobs Stream A Task 1 Task 2 Stream B Task 3

88. Containers Stream A Task 1 Task 2 Stream B Task 3

89. Containers Stream A Samza Container 1 Stream B Samza Container 2

90. Containers Samza Container 1 Samza Container 2

91. YARN Host 1 Samza Container 1 Host 2 Samza Container 2

92. YARN Host 1 Host 2 NodeManager NodeManager Samza Container 1 Samza Container 2

93. YARN Host 1 Host 2 NodeManager NodeManager Samza Container 1 Samza Container 2 Samza YARN AM

94. YARN Host 1 Host 2 NodeManager NodeManager Samza Container 1 Kafka Broker Samza Container 2 Samza YARN AM Kafka Broker

95. YARN Host 1 Host 2 NodeManager NodeManager MapReduce Container HDFS MapReduce YARN AM MapReduce Container HDFS

96. YARN Host 1 Stream A NodeManager Samza Container 1 Samza Container 1 Kafka Broker Stream C Samza Container 2

100. YARN Host 1 Host 2 NodeManager NodeManager Samza Container 1 Kafka Broker Samza Container 2 Samza YARN AM Kafka Broker

101. CGroups Host 1 Host 2 NodeManager NodeManager Samza Container 1 Kafka Broker Samza Container 2 Samza YARN AM Kafka Broker

102. (Not Running) Multi-Framework Host 1 Host 2 NodeManager NodeManager Samza Container 1 Kafka MapReduce Container Samza YARN AM HDFS

103. Stateful Processing

104. SELECT col1, count(*) FROM stream1 INNER JOIN stream2 ON stream1.col3 = stream2.col3 WHERE col2 > 20 GROUP BY col1 ORDER BY count(*) DESC LIMIT 50;

108. How do people do this?

109. Remote Stores Stream A Task 1 Task 2 Task 3 Key-Value Store Stream B

110. Remote RPC is slow • Stream: ~500k records/sec/container • DB: << less

111. Online vs. Async

112. No undo • Database state is non-deterministic • Can’t roll back mutations if task crashes

113. Tables & Streams put(a, w) put(b, x) Database put(a, y) put(b, z) Time

114. Stateful Tasks Stream A Task 1 Task 2 Stream B Task 3

115. Stateful Tasks Stream A Task 1 Task 2 Stream B Task 3

116. Stateful Tasks Stream A Task 1 Task 2 Stream B Task 3 Changelog Stream

128. Key-Value Store • • • • put(table_name, key, value) get(table_name, key) delete(table_name, key) range(table_name, key1, key2)

129. Stateful Stream Task public class SimpleStatefulTask implements StreamTask, InitableTask { private KeyValueStore<String, String> store; public void init(Config config, TaskContext context) { this.store = context.getStore("mystore"); } public void process( IncomingMessageEnvelope envelope, MessageCollector collector, TaskCoordinator coordinator) { GenericRecord record = (GenericRecord) envelope.getMessage(); String memberId = record.get("member_id"); String name = record.get("name"); System.out.println("old name: " + store.get(memberId)); store.put(memberId, name); } }

133. Whew!

134. Let’s be Friends! • We are incubating, and you can help! • Get up and running in 5 minutes http://bit.ly/hello-samza • Grab some newbie JIRAs http://bit.ly/samza_newbie_issues

135. Watch the video with slide synchronization on InfoQ.com! http://www.infoq.com/presentations/samzalinkedin

Samza: Real-time Stream Processing at LinkedIn

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (18)

Similar to Samza: Real-time Stream Processing at LinkedIn

Similar to Samza: Real-time Stream Processing at LinkedIn (20)

More from C4Media

More from C4Media (20)

Recently uploaded

Recently uploaded (20)

Samza: Real-time Stream Processing at LinkedIn