To understand an application’s performance, first you have to know what to measure. That’s the easy part. How do you take those measurements? Store them? Analyze them? Get them to the people who need them? Well, that’s where things get complicated, especially in the high-traffic distributed systems of the modern web! Like careful scientists, we must observe our subjects without altering them, and we must report our findings quickly so that we have the data necessary to make smart choices about the health and growth of the system.
Let’s explore the lessons learned by engineers at one of the world’s top web companies in their quest to find meaning at 5 MB/s. We’ll discuss the tools and techniques that enable the collection, indexing, and analysis of billions or more datapoints each hour, and learn how these same approaches can empower your applications and your business, no matter the scale.
7. Multivariate Testing
• Sort all users in to groups
• 1 control group receives unaltered content
• 1 or more groups receive altered content
• Measure behavioral statistics (CTR, abandon rate, time on page, scroll
depth) for each group
21. Hadoop
• Framework for distributed storage and processing of data
• Designed to make managing very large datasets simple with…
• Well-documented, open-source, common libraries
• Optimizing for commodity hardware
22. Hadoop Distributed File System
• Modeled after Google File System
• Stores logical files across multiple systems
• Rack-aware
• No read-write concurrency
24. Map
<?php
$document = "I'm a little teapot short and stout here is my handle here is my spout";
/**
* Outputs: [0,0,0,0,0,0,0,0,1,0,0,0,1,0,0]
*/
function map($target_word, $document) {
return array_map(
function ($word) use ($target_word) {
if ($word === $target_word) {
return 1;
}
return 0;
},
preg_split('/s+/', $document)
);
}
echo json_encode(map("is", $document)) . PHP_EOL;
35. this.visit = function(record) {
if (record.userAgent) {
var parser = new UAParser();
parser.setUA(record.userAgent);
var user_agent = parser.getResult();
return { user_agent: user_agent }
}
return {};
};
36.
37. Findings
• Max throughput per collector: 300 events/second
• ~70 receivers needed for prod
• StatsD key format creates data redundancy and reduced data richness
39. Transits & Terminals
• Transits - Short-term, in-memory, volatile storage for data with a life-span
up to a few seconds
• Terminals - Destinations for data that either store, abandon, or transmit
40. An efficient real-time data pathway consists
of a network of transits and terminals, where
no single node acts as both a transit and a
terminal at the same time.
41. StatsD
• Acts as a transit, taking data and passing it along…
• BUT
• Acts as a terminal, aggregating keys in memory and becoming a transit
after a time or buffer threshold.
44. RabbitMQ
• Lightweight message broker
• Allows complex message routing without application-level logic
• Can buffer 90-120 seconds of traffic
45. Version 2
• Eliminated eventing and improved performance
• Replaced StatsD with RabbitMQ
• Data records are kept together
• No longer works with Kibana (sadface)
46. while (buffer.length > 0) {
var char = buffer.shift();
if ('n' === char) {
queue.push(new Buffer(outbuf.join('')));
continue;
}
outbuf.push(char);
}
var i = 0;
var tBuf = buffer.slice();
while (i < buffer.length) {
var char = tBuf[i++];
if ('n' === char) {
queue.push(new Buffer(outbuf.join('')));
}
outbuf.push(char);
}
47. Findings
• Max throughput per collector: 600 events/second
• ~35 receivers needed for prod
• Micro-optimized code became increasingly brittle and hard to maintain as
custom logic was needed for every edge case
49. Need to Get Serious
• Very high throughput
• Multi-threaded worker pool with large memory buffers
• Static & dynamic optimization
• Efficient memory management for extremely volatile in-memory data
• Eliminate any processing overhead. Receiver must be a Transit
50. And also…
• Not GoLang (because no one on the team is familiar with it)
• Not Rust (because no one on the team wants to be familiar with it)
• Not C (because C)
53. Why Java?
• Solid static & dynamic analysis and optimizations in the S2BC & JIT
compilers
• Clients for the stuff I needed to talk to
• Well-supported within AOL & within my team
63. • How many users viewed my post on an android tablet in portrait mode
within 10 miles of Denton, TX?
• What is the average time from start of page-load to first click for readers
on linux desktops between 3am and 5am?
• Given two sets of link texts, which has the higher CTR for a randomized
sample of readers on tablet devices?
67. Real-Time for Real
• Live analysis of data as it is collected
• Active visualization of very short-term trends in data
68. Potential Problems
• Small sample sizes for new datasets / small analysis windows
• Data volumes too high for end-user comprehension
• Data volumes too high for end-user hardware/network connections
72. function plot(point) {
var points = svg.selectAll("circle")
.data([point], function(d) {
return d.id;
});
points.enter()
.append("circle")
.attr("cx", function (d) { return projection([parseInt(d.location.geopoint.lon), parseInt(d.location.geopoint.lat)])[0] })
.attr("cy", function (d) { return projection([parseInt(d.location.geopoint.lon), parseInt(d.location.geopoint.lat)])[1] })
.attr("r", function (d) { return 1; })
.style('fill', 'red')
.style('fill-opacity', 1)
.style('stroke', 'red')
.style('stroke-width', '0.5px')
.style('stroke-opacity', 1)
.transition()
.duration(10000)
.style('fill-opacity', 0)
.style('stroke-opacity', 0)
.attr('r', '32px').remove();
}
var buffer = [];
var socket = io();
socket.on('geopoint', function(point) {
if (point.location.geopoint) {
plot(point);
}
});
73.
74.
75. By the way…
xn = x + (r * COS(2π * n / v))
yn = y + (r * COS(2π * n / v))
where n = ordinal of vertex and
where v = number of vertices and
x,y = center of the polygon