Drinking from the Firehose - Real-time Metrics

Drinking from the Firehose
Real-Time Metrics
Samantha Quiñones

@ieatkillerbees
http://samanthaquinones.com

“How would you let editors test how well
different headlines perform for the same
piece of content?”

Measuring User Behavior
• Application path
• Use patterns
• Mouse & attention tracking

Multivariate Testing
• Sort all users in to groups
• 1 control group receives unaltered content
• 1 or more groups receive altered content
• Measure behavioral statistics (CTR, abandon rate, time on page, scroll
depth) for each group

State Monitoring
• Debugging
• Load Monitoring

• Augmented intelligence for content creators
• Quality prediction

What if content could change
itself based on the weather?

1,300,000,000,000
events per
DAY

~600,000
datapoints
Containing

25 megabytes /
second
At a rate up to

CollectorCollectorCollectorCollector CollectorCollectorCollectorCollector CollectorCollectorCollectorCollector
Rabbit MQ Farm

Hadoop
• Framework for distributed storage and processing of data
• Designed to make managing very large datasets simple with…
• Well-documented, open-source, common libraries
• Optimizing for commodity hardware

Hadoop Distributed File System
• Modeled after Google File System
• Stores logical files across multiple systems
• Rack-aware
• No read-write concurrency

MapReduce
• Framework for massively parallel data processing tasks

Map
<?php
$document = "I'm a little teapot short and stout here is my handle here is my spout";
/**
* Outputs: [0,0,0,0,0,0,0,0,1,0,0,0,1,0,0]
*/
function map($target_word, $document) {
return array_map(
function ($word) use ($target_word) {
if ($word === $target_word) {
return 1;
}
return 0;
},
preg_split('/s+/', $document)
);
}
echo json_encode(map("is", $document)) . PHP_EOL;

Reduce
<?php
$data = [0,0,0,0,0,0,0,0,1,0,0,0,1,0,0];
/**
* Outputs: 2
*/
function reduce($data) {
return array_reduce(
$data,
function ($count, $value) {
return $count + $value;
}
);
}
echo reduce($data) . PHP_EOL;

Hadoop Limitations
• Hadoop jobs are batched and take significant time to run
• Data may not be available for 1+ hours after collection

Consider Shelf-life
• Most articles are relevant for < 24 hours
• Interest peaks < 3 hours

CollectorCollectorCollectorCollector
Rabbit
MQ
Farm
CollectorCollectorCollectorStreamer

Version 1 (PoC)
CollectorCollectorCollectorStreamer CollectorCollectorCollectorReceiver CollectorCollectorCollectorStatsD
Cluster
ElasticSearch

this.visit = function(record) {
if (record.userAgent) {
var parser = new UAParser();
parser.setUA(record.userAgent);
var user_agent = parser.getResult();
return { user_agent: user_agent }
}
return {};
};

Findings
• Max throughput per collector: 300 events/second
• ~70 receivers needed for prod
• StatsD key format creates data redundancy and reduced data richness

Transits & Terminals
• Transits - Short-term, in-memory, volatile storage for data with a life-span
up to a few seconds
• Terminals - Destinations for data that either store, abandon, or transmit

An efficient real-time data pathway consists
of a network of transits and terminals, where
no single node acts as both a transit and a
terminal at the same time.

StatsD
• Acts as a transit, taking data and passing it along…
• BUT
• Acts as a terminal, aggregating keys in memory and becoming a transit
after a time or buffer threshold.

Version 2
CollectorCollectorCollectorStreamer CollectorCollectorCollectorReceiver ElasticSearchRabbitMQ

RabbitMQ
• Lightweight message broker
• Allows complex message routing without application-level logic
• Can buffer 90-120 seconds of traffic

Version 2
• Eliminated eventing and improved performance
• Replaced StatsD with RabbitMQ
• Data records are kept together
• No longer works with Kibana (sadface)

while (buffer.length > 0) {
var char = buffer.shift();
if ('n' === char) {
queue.push(new Buffer(outbuf.join('')));
continue;
}
outbuf.push(char);
}
var i = 0;
var tBuf = buffer.slice();
while (i < buffer.length) {
var char = tBuf[i++];
if ('n' === char) {
queue.push(new Buffer(outbuf.join('')));
}
outbuf.push(char);
}

Findings
• Micro-optimized code became increasingly brittle and hard to maintain as
custom logic was needed for every edge case

Need to Get Serious
• Very high throughput
• Multi-threaded worker pool with large memory buffers
• Static & dynamic optimization
• Efficient memory management for extremely volatile in-memory data
• Eliminate any processing overhead. Receiver must be a Transit

And also…
• Not GoLang (because no one on the team is familiar with it)
• Not Rust (because no one on the team wants to be familiar with it)
• Not C (because C)

Why Java?
• Solid static & dynamic analysis and optimizations in the S2BC & JIT
compilers
• Clients for the stuff I needed to talk to
• Well-supported within AOL & within my team

Version 3
CollectorCollectorCollectorStreamer CollectorCollectorCollectorReceiver
ElasticSearch
RabbitMQ
CollectorCollectorCollectorProcessor/
Router

public class StreamReader {
private static final Logger logger = Logger.getLogger(StreamReader.class.getName());
private StreamerQueue queue = new StreamerQueue();
private StreamProcessor processor;
private List<StreamReader.BeaconWorkerThread> workerThreads = new ArrayList();
private RtStreamerClient client;
public StreamReader(String streamerURI, AmqpClient amqpClient, String appID, String tpcFltrs, String
rfFltrs, String bt) {
ArrayList queueList = new ArrayList();
this.processor = new StreamProcessor(amqpClient);
byte numThreads = 8;
for(int i = 0; i < numThreads; ++i) {
StreamReader.BeaconWorkerThread worker = new StreamReader.BeaconWorkerThread();
this.workerThreads.add(worker);
worker.start();
}
queueList.add(this.queue);
this.client = new RtStreamerClient(streamerURI, appID, tpcFltrs, rfFltrs, bt, queueList);
}
}

public class StreamProcessor {
private static final Logger logger =
Logger.getLogger(StreamProcessor.class.getName());
private AmqpClient amqpClient;
public StreamProcessor(AmqpClient amqpClient) {
this.amqpClient = amqpClient;
}
public void send(String data) throws Exception {
this.amqpClient.send(data.getBytes());
logger.debug("Sent event " + data + " to AMQP");
}
}

Queue
Queue
Queue
Queue
Queue
Queue
Queue
Queue
Queue
Queue
Queue
Network Input
Network Output
Linked List Queues

Findings

Why ElasticSearch
• Open-source Lucene search engine
• Highly-distributed storage engine
• Clusters nicely
• Built-in aggregations like whoa

Aggregations
• Geographic Boxing & Radius Grouping
• Time-Series
• Histograms
• Min/Max/Avg Statistical Evaluation
• MapReduce (coming soon!)

• How many users viewed my post on an android tablet in portrait mode
within 10 miles of Denton, TX?
• What is the average time from start of page-load to first click for readers
on linux desktops between 3am and 5am?
• Given two sets of link texts, which has the higher CTR for a randomized
sample of readers on tablet devices?

Browser to Browser in < 5
seconds

But wait…
Is that “real-time”?

Real-Time for Real
• Live analysis of data as it is collected
• Active visualization of very short-term trends in data

Potential Problems
• Small sample sizes for new datasets / small analysis windows
• Data volumes too high for end-user comprehension
• Data volumes too high for end-user hardware/network connections

Version 4
CollectorCollectorCollectorStreamer CollectorCollectorCollectorReceiver
ElasticSearch
RabbitMQ
CollectorCollectorCollectorProcessor/
Router
Websocket
Server

D3JS
• Open-source data visualization library written in JavaScript

By the way…
xn = x + (r * COS(2π * n / v))
yn = y + (r * COS(2π * n / v))
where n = ordinal of vertex and
where v = number of vertices and
x,y = center of the polygon

var views = 0;
var socket = io();
socket.on('pageview', function(point) {
views++;
});
function tick() {
data.push(views);
views = 0;
path
.attr("d", line)
.attr("transform", null)
.transition()
.duration(500)
.ease("linear")
.attr("transform", "translate(" + x(0) + ",0)")
.each("end", tick);
data.shift();
}
tick();

Receiver Layer
Receiver Buffer/Transit
Processing & Routing Layer
Processing & Routing Transit
Storage Engine End-User Consumable Queues
Layers are
• Geographically decoupled
• Capable of independent scaling
• Fully encapsulated with no cross-
layer dependencies

Interfaces
Input Stream (Java)
Routing (node.js)
Filtering (node.js)
Aggregation (PHP)
Visualization (D3JS)
MV Testing (PHP)

Languages & Tools
Rabbit
MQ
Hadoop
Elastic
Search
PHP
JS
(node)
JS (D3)
Java
MySQL

Where are we Now?
• It took 6 months to build a rock-solid data pipeline
• Entry points from:
• User data collectors
• Application code

• Live debugging & runtime profiling
• Embeddable visualizations
• On-demand stream filters
• Predictive performance analysis
• Real-time sentiment analysis

@ieatkillerbees
http://samanthaquinones.com
https://joind.in/13742

Drinking from the Firehose - Real-time Metrics

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Drinking from the Firehose - Real-time Metrics

Similar to Drinking from the Firehose - Real-time Metrics (20)

More from Samantha Quiñones

More from Samantha Quiñones (6)

Recently uploaded

Recently uploaded (20)

Drinking from the Firehose - Real-time Metrics