An experiment in a distributed approach to processing the real-time data generated by a large scale social media campaign. Presented at Cambridge Geek Nights 13.
11. Valve
Subscribe to mailchimp exchange
morat.campaign.mailchimp.#
Translate to plain english for IRC
Inject into irc exchange with routing key morat.irc.
[channel]
12. mailchimp-irc-valve.rb
case record['type']
when 'subscribe'
output :irc, "'#{record['data']['merges']['FNAME']} #{record['data']
['merges']['LNAME']}' has joined the list"
when 'unsubscribe'
output :irc, "'#{record['data']['merges']['FNAME']} #{record['data']
['merges']['LNAME']}' has left the list"
...
15. Where have we got to?
Pump: Mailchimp webhook (HTTP POST) >
morat.[campaign].mailchimp.[type] (JSON)
Valve: morat.campaign.mailchimp.[type] (JSON) >
morat.irc.[campaign] (Text)
Sink: morat.irc.[campaign] (Text) > IRC server
16. That’s cool, but hey it would be great to see
#campaign tweets as well...
17. twitter-search-pump.rb
TweetStream::Client.new.track(keywords.split(',')) do |status|
keywords.split(',').each do |searchterm|
if status.text.match(searchterm)
searchterm.sub!(' ','')
searchterm.sub!('#','')
log.debug "Sending: #{status.user.screen_name} :: #{status.text} ::
morat.twitter.search.#{searchterm}"
broker.exchange.publish JSON.generate(status), :routing_key =>
"morat.twitter.search.#{searchterm}"
end
end
end
22. Valve
Subscribe to mailchimp exchange morat.
[campaign].mailchimp.#
Translate to Graphite format: [value] [timestamp]
Inject into graphite exchange with routing key
based on sample window: 10sec.
[campaign].mailchimp.[action].count
25. mailchimp-graphite-
valve.rb
%w{ subscribe unsubscribe campaign }.each do |action|
[ '10 sec', '1 min', '5 min', '15 min' ].each do |window|
valve.register "SELECT count(*) from
MailchimpEvent(type='#{action}').win:time_batch(#{window})", (
Listener.new(valve) do |agent, event|
valve.output :graphite, "#{event.get('count(*)')}", :routing_key =>
window.delete(' ') + ".morat.#{valve.application}.mailchimp.#{action}"
end
)
end
end
26. Why use CEP?
# find the sum of retweets of last 5 tweets which saw more than 10 retweets
SELECT sum(retweets) from TweetEvent(retweets >= 10).win:length(5)
# find max, min and average number of retweets for a sliding 60 second window of time
SELECT max(retweets), min(retweets), avg(retweets) FROM TweetEvent.win:time(60 sec)
# compute number of retweets for all tweets in 10 second batches
SELECT sum(retweets) from TweetEvent.win:time_batch(10 sec)
# number of retweets, grouped by timezone, buffered in 10 second increments
SELECT timezone, sum(retweets) from TweetEvent.win:time_batch(10 sec) group by timezone
# compute the sum of retweets in sliding 60 second window, and emit count every 30 events
SELECT sum(retweets) from TweetEvent.win:time(60 sec) output snapshot every 30 events
# every 10 seconds, report timezones which accumulated more than 10 retweets
SELECT timezone, sum(retweets) from TweetEvent.win:time_batch(10 sec) group by timezone having
sum(retweets) > 10
Courtesy @igrigorik http://www.igvita.com/2011/05/27/streamsql-event-processing-with-esper/
30. Valve
Grab raw data for window from graphite via REST
Create scatter graph using R and calculate
correlation
Inject correlation into graphite exchange
35. Valve
Subscribe to twitter exchange
morat.twitter.search.[keyword]
Extract adjectives using entagger
Inject adjectives into twitter exchange with routing
key morat.twitter.search.[keyword].adjectives as:
[adjective] [count]
36. twitter-sentiment-valve.rb
require 'engtagger'
...
log.debug "Received tweet from #{record['user']['screen_name']} on
#{routing_key}"
adjectives = @parser.add_tags(record['text']).scan(EngTagger::ADJ).map do |
n|
@parser.strip_tags(n)
end
ret = Hash.new(0)
adjectives.each do |n|
n = @parser.stem(n)
ret[n] += 1 unless n =~ /As*z/
end
37. Sink
Subscribe to twitter exchange
morat.twitter.search.[keyword].adjectives
Use node.js and Socket.IO to send data to web
client via Websockets
Visualise with processing.js in web browser
Take the adjectives and store a running total in Redis to create long timeline tag clouds\n Pull out @replies and RT’s and throw them into Neo4j - a graph database for post-competition analysis\n Hook an Arduino up to IRC to receive Mailchimp subscriptions and create a physical visualisation in the office (e.g. glow ball)\n