-
1.
Apache Samza
Reliable Stream Processing Atop
Apache Kafka and Hadoop YARN
Jakob Homan
London HUG
-
2.
Who I am
• Samza for five months
• Before that Hadoop, Hive, Giraph
• Say hi: @blueboxtraveler
-
3.
Things we would like to do
(better)
-
4.
Provide timely, relevant updates to your newsfeed
-
5.
Update search results with new information as it appears
-
6.
Sculpt metrics and logs into useful shapes
-
7.
Tools?
RPC
Samza
Response latency
Milliseconds to minutes
Synchronous
Later. Possibly much later.
-
8.
Frame(work) of reference
Storage layer
Execution
engine
Classic
Hadoop
HDFS
Map-Reduce
Samza
Kafka
YARN
API
map(k, v) => (k,v)
reduce(k, list(v)) => (k,v)
process(msg(k,v)) => msg(k,v)
-
9.
Storage layer: Kafka
-
10.
Apache Kafka
• Persistent,
reliable,
distributed
message queue
Shiny new logo!
-
11.
At LinkedIn
10+ billion
writes per day
172k
messages per second
(average)
55+ billion
messages per day
to real-time consumers
-
12.
Quick aside…
Kafka: First among (pluggable) equals
LinkedIn: Espresso and Databus
Coming soon? HDFS, ActiveMQ, Amazon SQS
-
13.
Kafka in four bullet points
• Producers send messages to brokers
• Messages are key, value pairs
• Brokers store messages in topics for
consumers
• Consumers pull messages from brokers
-
14.
A Kafka Topic
“The ref’s blind!”
534
“Car nicked!”
234
“Very sleepy”
755
534
Topic: StatusUpdateEvent
“Nicked a car!”
Value: Timestamp, new status, geolocation, etc.
Key: User ID of user who updated the status
-
15.
For our purposes, hash partitioned on the key!
Key
Message
contents
Message
content
Key
Message
contents
Key
Message
contents
Message
contents
Key
Key
Message
contents
Key
Message
contents
Message
contents
Key
Partition 2
Key
Partition 1
Key
Partition 0
Message
contents
Key
Kafka topics are partitioned
Message
content
-
16.
A Samza job
• StatusUpdateEvent
• NewConnectionEvent
• LikeUpdateEvent
MyStreamTask
implements StreamTask
{ …………. }
Input topics
Some code
• NewsUpdatePost
• UpdatesPerHourMetric
Output topics
-
17.
Execution engine: YARN
-
18.
What we use YARN for
• Distributing our tasks across multiple
machines
• Letting us know when one has died
• Distributing a replacement
• Isolating our tasks from each other
-
19.
YARN: Execution and reliability
MyStreamTask:process()
Samza TaskRunner: Partition 0
Samza App Master
MyStreamTask:process()
Samza TaskRunner: Partition 1
Node Manager 1
Node Manager 2
Kafka Broker
Kafka Broker
Machine 1
Machine 1
-
20.
Co-partitioning of topics
MyStreamTask:process()
StatusUpdateEvent, Partition 0
Samza TaskRunner: Partition 0
NewsUpdatePost
NewConnectionEvent, Partition 0
An instance of StreamTask is responsible for a specific partition
-
21.
API: process()
-
22.
getKey(), getMsg()
public interface StreamTask {
void process(IncomingMessageEnvelope envelope,
MessageCollector collector,
TaskCoordinator coordinator
)
sendMsg(topic, key, value)
}
commit(), shutdown()
-
23.
Awesome feature: State
MyStreamTask:process()
Samza TaskRunner: Partition 0
Store state
• Generic data store interface
• Key-value out-of-box
– More soon? Bloom filter, lucene, etc.
• Restored by Samza upon task crash
-
24.
(Pseudo)code snippet: Newsfeed
• Consume StatusUpdateEvent
– Send those updates to all your conmections via
the NewsUpdatePost topic
• Consume NewConnectionEvent
– Maintain state of connections to know who to
send to
-
25.
public class NewsFeed implements StreamTask {
void process(envelope, collector, coordinator) {
msg = env.getMsg()
userId = msg.get(“userID”);
if(msg.get(“type”)==STATUS_UPDATE) {
foreach(conn: kvStore.get(userId) {
collector.send(“NewsUpdatePost”,
new Msg(conn, msg.get(“newStatus”))
}
} else {
newConn = msg.get(“newConnection”)
connections = kvStore.get(userId)
kvStore.put(userID, connections ++ newConn)
}
-
26.
Current status
-
27.
Hello, Samza!
Up and running in 3 minutes
Consume Wikipedia edits live
Generate stats on those edits
Cool, eh? bit.ly/hello-samza
-
28.
samza.incubator.apache.org
bit.ly/samza_newbie_issues
-
29.
Cheers!
•
•
•
•
•
Quick start: bit.ly/hello-samza
Project homepage: samza.incubator.apache.org
Newbie issues: bit.ly/samza_newbie_issues
Detailed Samza and YARN talk: bit.ly/samza_and_yarn
Twitter: @samzastream
RPC = lots of questions, but very quick and specificHadoop = fewer questions, but can take a long time to ponder them
ClassicHadoop because modern Hadoop also uses YARN and TezSamza leverages these existing technologies to build its own framework
Very much a production system, critical to LinkedIn
Log or topic, same termAt least once semanticsMessage kept around on order of days
Analagous to Map-ReduceInput directories =
Pretty standard use of YARN. Came along at exactly the right time for Samza. Nice not to have to have written something ouselves
Gives us distribution, task restart
Guarantee that messages that are partitioned on the same key will be handled by the same task.In the same way that MapReduce allows you to group on keys, copartitioning of the tasks on the keys, allows you to group on the message keysVery useful feature
Also provide interfaces for windowing tasks that are called specific amounts of time, number messagesAlso provide methods for initialization, configuration, etc.Checkpointing is handled behind the scenes
Neat feature that’s unique among current streaming
Note: Not how LinkedIn really does this!
One could imagine lots of Samza tasks consuming different events and publishing them to the NewsUpdatePostAnother task could then rank these and output them to a key value store so that the users see all the most relevant post
In production at LinkedInIncubatorLots of documentationLooking to build a new communityNewbie JIRAs