YARN
You: I want to run command X on two machines with
512M of memory.
YARN
You: I want to run command X on two machines with
512M of memory.
YARN: Cool, where’s your code?
YARN
You: I want to run command X on two machines with
512M of memory.
YARN: Cool, where’s your code?
You: http://some-host/jobs/download/my.tgz
YARN
You: I want to run command X on two machines with
512M of memory.
YARN: Cool, where’s your code?
You: http://some-host/jobs/download/my.tgz
YARN: I’ve run your command on grid-node-2 and
grid-node-7.
Let’s be Friends!
• We are incubating, and you can help!
• Get up and running in 5 minutes
http://bit.ly/hello-samza
• Grab some newbie JIRAs
http://bit.ly/samza_newbie_issues
Editor's Notes
- stream processing for us = anything asynchronous, but not batch computed.- 25% of code is async. 50% is rpc/online. 25% is batch.- stream processing is worst supported.
- stream processing for us = anything asynchronous, but not batch computed.- 25% of code is async. 50% is rpc/online. 25% is batch.- stream processing is worst supported.
- stream processing for us = anything asynchronous, but not batch computed.- 25% of code is async. 50% is rpc/online. 25% is batch.- stream processing is worst supported.
- stream processing for us = anything asynchronous, but not batch computed.- 25% of code is async. 50% is rpc/online. 25% is batch.- stream processing is worst supported.
- compute top shares, pull in, scrape, entity tag- language detection- send emails: friend was in the news- requirement: has to be fast, since news is trendy
- relevance pipeline
- we send relatively data rich emails- some emails are time sensitive (need to be sent soon)
- time sensitive- data ingestion pattern- other systems that follow this pattern: realtimeolap system, and social graph system
- ecosystem at LinkedIn (some unique traits)- hard unsolved problems in this space
- once we had all this data in kafka, we wanted to do stuff with it.- persistent,reliable,distributed,message queue- Kafka = first among equals, but stream systems are pluggable. Just like Hadoop with HDSF vs. S3.
- started with just simple web service that consumes and produces kafka messages.- realized that there are a lot of hard problems that needed to be solved.- reprocessing: what if my algorithm changes and I need to reprocess all events?- non-determinism: queries to external systems, time dependencies, ordering of messages.
- open area of research- been around for 20 years
partitioned
re-playable,ordered,fault tolerant,infinitevery heavyweight definition of a stream (vs. s4, storm, etc)
partition assignment happens on write
At least once messaging. Duplicates are possible.Future: exact semantics.Transparent to user. No ack’ing API.
connected by stream name onlyfully buffered
split job tracker upresource management, process isolation, fault tolerance, security
- group by, sum, count
- stream to stream, stream to table, table to table