This is the slide deck that was presented at the Hadoop Users Group at LinkedIn on November 5, 2013.
The presentation covers what Samza is, why we built it, and how it works.
107. Let’s be Friends!
• We are incubating, and you can help!
• Get up and running in 5 minutes
http://bit.ly/hello-samza
• Grab some newbie JIRAs
http://bit.ly/samza_newbie_issues
Editor's Notes
- stream processing for us = anything asynchronous, but not batch computed.- 25% of code is async. 50% is rpc/online. 25% is batch.- stream processing is worst supported.
- stream processing for us = anything asynchronous, but not batch computed.- 25% of code is async. 50% is rpc/online. 25% is batch.- stream processing is worst supported.
- stream processing for us = anything asynchronous, but not batch computed.- 25% of code is async. 50% is rpc/online. 25% is batch.- stream processing is worst supported.
- stream processing for us = anything asynchronous, but not batch computed.- 25% of code is async. 50% is rpc/online. 25% is batch.- stream processing is worst supported.
- stream processing for us = anything asynchronous, but not batch computed.- 25% of code is async. 50% is rpc/online. 25% is batch.- stream processing is worst supported.
- compute top shares, pull in, scrape, entity tag- language detection- send emails: friend was in the news- requirement: has to be fast, since news is trendy
- relevance pipeline
- we send relatively data rich emails- some emails are time sensitive (need to be sent soon)
- time sensitive- data ingestion pattern- other systems that follow this pattern: realtimeolap system, and social graph system
- ecosystem at LinkedIn (some unique traits)- hard unsolved problems in this space
- oncewe had all this data in kafka, we wanted to do stuff with it.- persistent,reliable,distributed,message queue- Kafka = first among equals, but stream systems are pluggable. Just like Hadoop with HDSF vs. S3.
- started with just simple web service that consumes and produces kafka messages.- realized that there are a lot of hard problems that needed to be solved.- reprocessing: what if my algorithm changes and I need to reprocess all events?- non-determinism: queries to external systems, time dependencies, ordering of messages.
- open area of research- been around for 20 years
partitioned
re-playableorderedfault tolerantinfinitevery heavyweight definition of a stream (vs. s4, storm, etc)
At least once messaging. Duplicates are possible.Future: exact semantics.Transparent to user. No ack’ing API.
connected by stream name onlyfully buffered
- group by, sum, count
- stream to stream, stream to table, table to table
- buffered sorting
UDP is an over-optimization, since most processors try to remote join, which is very slow.