2. Why?
Businesses generate and process events
Unified event log promotes data integration
Process event streams to take actions quickly
3. References
Kafka
Samza
Kafka Documentation
The Log: What every software engineer should know about real-
time data's unifying abstraction
Benchmarking Apache Kafka
Samza Documentation
Questioning the Lamba Architecture
Moving faster with data streams: The rise of Samza at LinkedIn
Why local state is a fundamental primitive in stream processing
Real time insights into LinkedIn's performance using Apache
Samza
4. Why?
Businesses generate and process events
Unified event log promotes data integration
Process event streams to take actions quickly
6. Event
Describes what happened
Who did it?
What did they do?
What was the result?
Provides context
When did it happen?
Where did it happen?
How did they do it?
Why did they do it?
7. Event Example: Pageview
User viewed web page
User
ID: a2be9031-9465-4ecb-9302-9b962fa854ac
IP: 65.121.142.238
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 1095)
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.101
Safari/537.36
Web Page
URL:
Context
Time: 2014-10-14T10:49:24.438-05:00
https://www.mycompany.com/page.html
8. Event Example: Clickthrough
User clicked link
User
ID: a2be9031-9465-4ecb-9302-9b962fa854ac
IP: 65.121.142.238
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 1095)
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.101
Safari/537.36
Link
URL:
Referer:
Context
Time: 2014-10-14T10:49:24.438-05:00
https://www.mycompany.com/product.html
https://www.othersite.com/foo.html
10. Event Example: User Update
User uploaded a new profile image
User
ID: 161fa4bf-6ae9-4f4e-b72e-01c40e7783e5
Profile Image
URL:
Context
Time: 2014-10-14T10:59:56.481-05:00
IP: 65.121.142.238
Using: webcam
http://profile-images.s3.amazonaws.com/katy-perry.jpg
11. Event Example: Tweet
User posted a tweet
User
ID:
Username: @zcox
Name: Zach Cox
Bio: Developer @BannoHQ | @iascala organizer | co-founded
@Pongr
Tweet
ID: 527152511568719872
URL: URL:
Text: Going to talk about processing event streams using
@apachekafka and @samzastream this Saturday @iowacodecamp
Mentions: @apachekafka, @samzastream, @iowacodecamp
URLs:
Context
Time: 2014-10-14T10:59:56.481-05:00
Using: Twitter for Android
Location: 41.7146365,-93.5914038
https://twitter.com/zcox/status/527152511568719872
http://iowacodecamp.com/session/list#66
http://iowacodecamp.com/session/list#66
12. Event Example: HTTP Request Latency
Some measured code took some time to execute
Code
production.my-app.some-server.http.get-user-profile
Time to execute
Min: 20 msec
Max: 950 msec
Average: 190 msec
Median: 110 msec
50%: 100 msec
75%: 120 msec
95%: 150 msec
99%: 500 msec
Context
Time: 2014-10-14T11:17:01.597-05:00
13. Event Example: Runtime Exception
Some code threw a runtime exception
Some code
Stack trace: [...]
Exception
Message: HBase read timed out
Context
Time: 2014-10-14T11:21:23.749-05:00
Application: my-app
Machine: some-server.my-company.com
14. Event Example: Application Logging
Some code logged some information
[INFO] [2014-10-14 11:25:44,750] [sentry-
akka.actor.default-dispatcher-2]
a.e.s.Slf4jEventHandler: Slf4jEventHandler started
Message: Slf4jEventHandler started
Level: INFO
Time: 2014-10-14 11:25:44,750
Thread: sentry-akka.actor.default-dispatcher-2
Logger: akka.event.slf4j.Slf4jEventHandler
15. Why?
Businesses generate and process events
Unified event log promotes data integration
Process event streams to take actions quickly
16. Unified Log
Events need to be sent somewhere
Events should be accessible to any program
Log provides a place for events to be sent and accessed
Kafka is a great log service
25. Log for Event Streams
Simple to send events to
Broadcasts events to all consumers
Buffers events on disk: producers and consumers decoupled
Consumers can start reading at any offset
26. Kafka
Apache OSS, mainly from LinkedIn
Handles all the logs/event streams
High-throughput: millions events/sec
High-volume: TBs - PBs of events
Low-latency: single-digit msec from producer to consumer
Scalable: topics are partitioned across cluster
Durable: topics are replicated across cluster
Available: auto failover
34. Samza
Event stream processing framework
Apache OSS, mainly from LinkedIn
Simple Java API
Scalable: runs jobs in parallel across cluster
Reliable: fault-tolerance and durability built-in
Tools for stateful stream processing
39. Aggregation
State = aggregated values (e.g. count)
Incorporate each new event into that aggregation
Output aggregated values as events to new stream
What happens if job stops?
Crash, deploy, ...
Can't lose state!
Samza handles this all for you
SELECTCOUNT(*)FROMstatuses;
41. Grouping
State = some data per group
Two Samza jobs:
Output statuses by user (map)
Count statuses per user (reduce)
Output: (user, count)
Could use as input to job that sorts by count (most active users)
SELECTuser_id,COUNT(user_id)FROMstatusesGROUPBYuser_id;
SELECTuser_id,COUNT(user_id)FROMstatusesGROUPBYuser_idORDERBYCOUNT(user_id)DESCLIMIT5;
42. Joins
Samza job has multiple input streams
Stream-Stream join: ad impressions + ad clicks
Stream-Table join: page views + user zip code
Table-Table join: user data + user settings
Joins involving tables need DB changelog
SELECTu.username,s.textFROMstatusessJOINusersuONu.id=s.user_id;
43. What else can we compute?
Tweets per sec/min/hour (recent, not for-all-time)
Enrich tweets with weather at current location
Most active users, locations, etc
Emojis: % of tweets that contain, top emojis
Hashtags: % of tweets that contain, top #hashtags
URLs: % of tweets that contain, top domains
Photo URLs: % of tweets that contain, top domains
Text analysis: sentiment, spam
46. Druid
Send it events
Druid reads from Kafka topic
That Kafka topic is a Samza output stream
Super fast time-series queries: aggregations, filters, top-n, etc
http://druid.io
47. Why?
Businesses generate and process events
Unified event log promotes data integration
Process event streams to take actions quickly
48. References
Kafka
Samza
Kafka Documentation
The Log: What every software engineer should know about real-
time data's unifying abstraction
Benchmarking Apache Kafka
Samza Documentation
Questioning the Lamba Architecture
Moving faster with data streams: The rise of Samza at LinkedIn
Why local state is a fundamental primitive in stream processing
Real time insights into LinkedIn's performance using Apache
Samza