10 Billion a Day, 100 Milliseconds Per: Monitoring Real-Time Bidding at AdRoll
Upcoming SlideShare
Loading in...5
×
 

10 Billion a Day, 100 Milliseconds Per: Monitoring Real-Time Bidding at AdRoll

on

  • 5,318 views

This is the talk I gave at Erlang Factory SF Bay Area 2014. In it I discussed the instrumentation by default approach taken in the AdRoll real-time bidding team, discuss the technical details of the ...

This is the talk I gave at Erlang Factory SF Bay Area 2014. In it I discussed the instrumentation by default approach taken in the AdRoll real-time bidding team, discuss the technical details of the libraries we use and lessons learned to adapt your organization to deal with the onslaught of data from instrumentation.

Statistics

Views

Total Views
5,318
Views on SlideShare
5,188
Embed Views
130

Actions

Likes
2
Downloads
18
Comments
0

5 Embeds 130

https://twitter.com 89
http://blog.troutwine.us 29
http://localhost 9
http://feedly.com 2
http://www.newsblur.com 1

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

10 Billion a Day, 100 Milliseconds Per: Monitoring Real-Time Bidding at AdRoll 10 Billion a Day, 100 Milliseconds Per: Monitoring Real-Time Bidding at AdRoll Presentation Transcript

  • T E N B I L L I O N A D AY, O N E - H U N D R E D M I L L I S E C O N D S P E R MONITORING REAL-TIME B I D D I N G AT A D R O L L
  • I DO THINGS WITH/TO COMPUTERS.
  • I CARE ABOUT RELIABLE, COMPLEX AND CRITICAL SYSTEMS.
  • ADROLL
  • LESS THIS
  • MORE THIS
  • WE’RE AN ADTECH C O M PA N Y .
  • R E TA R G E T I N G , IT’S A L L A B O U T D ATA .
  • • Our customers want to show ads to people who might care to see them. • We partner with “exchanges” to enter ad-slot auctions. • These auctions are executed and finalized while consumer’s webpages load.
  • REAL-TIME BIDDING
  • the nature of the problem domain: • Low latency ( < 100ms per transaction ) • Firm real-time system • Highly concurrent ( > 30 billion transactions per day ) • Global, 24/7 operation
  • “HUMANS ARE BAD AT PREDICTING THE PERFORMANCE OF COMPLEX SYSTEMS(…). OUR ABILITY TO CREATE LARGE AND COMPLEX SYSTEMS FOOLS US INTO BELIEVING THAT WE’RE ALSO ENTITLED TO UNDERSTAND THEM.” –C ARLOS BUENO “ M AT U R E O P T I M I Z AT I O N H A N D B O O K ”
  • AHEAD OF TIME V E R I F I C AT I O N I S N O T SUFFICIENT. ! (DON’T SCRIMP ON IT, THOUGH.)
  • IGNORANCE AND COMPLEX INTERACTIONS WITH EXTERNAL SYSTEMS ARE WHY WE CAN’T H AV E N I C E T H I N G S .
  • AT SCALE, EVEN RARE EVENTS HAPPEN FREQUENTLY.
  • AT SCALE, BAD THINGS HAPPEN FA S T E R T H A N HUMANS CAN RESPOND.
  • THE RESULTS ARE RARELY PRETTY.
  • THE CAUSES ARE RARELY QUICK TO DISCOVER.
  • W H AT CAN BE DONE?
  • WE H AV E T O O B S E R V E OUR SYSTEMS WHILE THEY RUN.
  • WE CAN BE SCIENTIFIC ABOUT THIS
  • NOT ALL SYSTEMS ARE MONITORABLE, JUST AS NOT A L L A R E T E S TA B L E . ! I T ’ S A M AT T E R O F D E S I G N A N D OF CULTURE.
  • A DESIGN T H AT I S M O N I T O R A B L E TA K E S S T O C K O F I T S I N T E R N A L TOLERANCES, EXTERNAL I N T E R FA C E S A N D E X P O S E S T H E S E FOR AN O P E R AT O R .
  • AN ENGINEER IS RESPONSIBLE FOR DESIGNING AND BUILDING SYSTEMS. ! A N O P E R AT O R IS RESPONSIBLE FOR RUNNING THEM.
  • A H E A D - O F - T I M E V E R I F I C AT I O N —TESTING, TYPE-CHECKING— GIVES THE ENGINEER INTUITION ABOUT THE SYSTEM.
  • IF YOU DON’T KNOW HOW THE SYSTEM S H O U L D B E H AV E Y O U C A N ’ T S AY H O W I T SHOULDN’T OR ISN’T.
  • MONITORING GIVES THE O P E R AT O R I N F O R M AT I O N A B O U T T H E B E H AV I O R O F THE RUNNING SYSTEM.
  • T H E O P E R AT O R S A N D E N G I N E E R S ARE OFTEN THE SAME PEOPLE.
  • THERE’S A POTENTIAL FOR A POSITIVE FEEDBACK LOOP OF QUALITY HERE.
  • “WHILE THE SKILL OF REENTRY WAS EASILY HANDLED BY AUTOMATED SYSTEMS, THE PILOT’S PRIMARY FUNCTION EVOLVED TO BE A REDUNDANT SYSTEM (…) COORDINATING A VARIETY OF CONTROLS AS MUCH AS DIRECTLY CONTROLLING THE VEHICLE.” – DAV I D A . M I N D E L L D I G I TA L A P O L LO : H U M A N A N D M AC H I N E I N S PAC E F L I G H T
  • exometer
  • exometer — github.com/Feuerlabs/exometer • Created by Feuerlabs. • Responsive upstream (Ulf Wiger never sleeps?) • Metric collection, aggregation and reporting decoupled. • Static and dynamic configuration. • Very low, predictable runtime overhead.
  • I M P O R TA N T T E R M S • METRIC: a measurement • ENTRY: a receiver and aggregator of metrics • REPORTER: an entity which samples entries on a regular interval and optionally ships these samples onto a third-system • SUBSCRIPTION: the definition of the regular interval on which reporters sample entries
  • Defining Entries {predefined, [ {[erlang,memory], {function, erlang, memory, ['$dp'], value,[ets,binary]}, [] }, {[erlang, gc], {function, erlang, statistics, [garbage_collection], match, {total_coll, rec_wrd, '_'}}, [] }, ! {[erlang, statistics], {function, erlang, statistics, ['$dp'], value, [run_queue]}, [] }, {[boodah, freq_cap, not_found], spiral}, {[boodah, freq_cap, ok], spiral}, {[boodah, freq_cap, timeout], spiral} ]},
  • Defining Entries {predefined, [ {[erlang,memory], {function, erlang, memory, ['$dp'], value,[ets,binary]}, [] }, {[erlang, gc], {function, erlang, statistics, [garbage_collection], match, {total_coll, rec_wrd, '_'}}, [] }, ! {[erlang, statistics], {function, erlang, statistics, ['$dp'], value, [run_queue]}, [] }, {[boodah, freq_cap, not_found], spiral}, {[boodah, freq_cap, ok], spiral}, {[boodah, freq_cap, timeout], spiral} ]},
  • Defining Entries {predefined, [ {[erlang,memory], {function, erlang, memory, ['$dp'], value,[ets,binary]}, [] }, {[erlang, gc], {function, erlang, statistics, [garbage_collection], match, {total_coll, rec_wrd, '_'}}, [] }, ! {[erlang, statistics], {function, erlang, statistics, ['$dp'], value, [run_queue]}, [] }, {[boodah, freq_cap, not_found], spiral}, {[boodah, freq_cap, ok], spiral}, {[boodah, freq_cap, timeout], spiral} ]},
  • Defining Entries {predefined, [ {[erlang,memory], {function, erlang, memory, ['$dp'], value,[ets,binary]}, [] }, {[erlang, gc], {function, erlang, statistics, [garbage_collection], match, {total_coll, rec_wrd, '_'}}, [] }, ! {[erlang, statistics], {function, erlang, statistics, ['$dp'], value, [run_queue]}, [] }, {[boodah, freq_cap, not_found], spiral}, {[boodah, freq_cap, ok], spiral}, {[boodah, freq_cap, timeout], spiral} ]},
  • Defining Entries {predefined, [ {[erlang,memory], {function, erlang, memory, ['$dp'], value,[ets,binary]}, [] }, {[erlang, gc], {function, erlang, statistics, [garbage_collection], match, {total_coll, rec_wrd, '_'}}, [] }, ! {[erlang, statistics], {function, erlang, statistics, ['$dp'], value, [run_queue]}, [] }, {[boodah, freq_cap, not_found], spiral}, {[boodah, freq_cap, ok], spiral}, {[boodah, freq_cap, timeout], spiral} ]},
  • Defining Reporters { reporters, [ { exometer_report_statsd, [ {hostname, "localhost"}, {port, 8125}, {type_map, [ {[erlang,statistics,run_queue], histogram}, {[erlang, gc, tot_coll], histogram}, {[erlang, gc, rec_wrd], histogram}, ! {[erlang,memory,ets], gauge}, {[erlang,memory,binary],gauge}, ! {[boodah,freq_cap,not_found],gauge}, {[boodah,freq_cap,ok],gauge}, {[boodah,freq_cap,timeout],gauge} ]}, !
  • Defining Reporters { reporters, [ { exometer_report_statsd, [ {hostname, "localhost"}, {port, 8125}, {type_map, [ {[erlang,statistics,run_queue], histogram}, {[erlang, gc, tot_coll], histogram}, {[erlang, gc, rec_wrd], histogram}, ! {[erlang,memory,ets], gauge}, {[erlang,memory,binary],gauge}, ! {[boodah,freq_cap,not_found],gauge}, {[boodah,freq_cap,ok],gauge}, {[boodah,freq_cap,timeout],gauge} ]}, !
  • Defining Reporters { reporters, [ { exometer_report_statsd, [ {hostname, "localhost"}, {port, 8125}, {type_map, [ {[erlang,statistics,run_queue], histogram}, {[erlang, gc, tot_coll], histogram}, {[erlang, gc, rec_wrd], histogram}, ! {[erlang,memory,ets], gauge}, {[erlang,memory,binary],gauge}, ! {[boodah,freq_cap,not_found],gauge}, {[boodah,freq_cap,ok],gauge}, {[boodah,freq_cap,timeout],gauge} ]}, !
  • Defining Subscriptions { report, [ { subscribers, [ {exometer_report_statsd, [erlang, statistics], run_queue, 1000}, ! ! ! {exometer_report_statsd, [boodah, freq_cap, not_found], one, 1000}, {exometer_report_statsd, [boodah, freq_cap, ok], one, 1000}, {exometer_report_statsd, [boodah, freq_cap, timeout], one, 1000} {exometer_report_statsd, [erlang, gc], tot_coll, 1000}, {exometer_report_statsd, [erlang, gc], rec_wrd, 1000}, ]} ! {exometer_report_statsd, [erlang, memory], ets, 10000}, {exometer_report_statsd, [erlang, memory], binary, 10000}, ! ! ]}
  • Defining Subscriptions { report, [ { subscribers, [ {exometer_report_statsd, [erlang, statistics], run_queue, 1000}, ! ! ! {exometer_report_statsd, [boodah, freq_cap, not_found], one, 1000}, {exometer_report_statsd, [boodah, freq_cap, ok], one, 1000}, {exometer_report_statsd, [boodah, freq_cap, timeout], one, 1000} {exometer_report_statsd, [erlang, gc], tot_coll, 1000}, {exometer_report_statsd, [erlang, gc], rec_wrd, 1000}, ]} ! {exometer_report_statsd, [erlang, memory], ets, 10000}, {exometer_report_statsd, [erlang, memory], binary, 10000}, ! ! ]}
  • Defining Subscriptions { report, [ { subscribers, [ {exometer_report_statsd, [erlang, statistics], run_queue, 1000}, ! ! ! {exometer_report_statsd, [boodah, freq_cap, not_found], one, 1000}, {exometer_report_statsd, [boodah, freq_cap, ok], one, 1000}, {exometer_report_statsd, [boodah, freq_cap, timeout], one, 1000} {exometer_report_statsd, [erlang, gc], tot_coll, 1000}, {exometer_report_statsd, [erlang, gc], rec_wrd, 1000}, ]} ! {exometer_report_statsd, [erlang, memory], ets, 10000}, {exometer_report_statsd, [erlang, memory], binary, 10000}, ! ! ]}
  • D o i n g i t d y n a m i c a l l y. 1> exometer:new([a, histogram], histogram). ok ! 2> exometer:get_value([a, histogram]). {ok,[{n,0}, {mean,0}, {min,0}, {max,0}, {median,0}, {50,0}, {75,0}, {90,0}, {95,0}, {99,0}, {999,0}]} ! 3> exometer_report:add_reporter( exometer_report_tty, []). ok ! 4> exometer_report:subscribe( exometer_report_tty, [a, histogram], mean, 1000, []). ok ! exometer_report_tty: a_histogram_mean 1393627070:0 exometer_report_tty: a_histogram_mean 1393627071:0 exometer_report_tty: a_histogram_mean 1393627072:0
  • W H AT A R E W E L O O K I N G F O R ?
  • • VM killers • System performance regressions • Abnormal system behavior • Surprises
  • “ A B N O R M A L S Y S T E M B E H AV I O R ” ?
  • CASE STUDIES
  • CASE STUDY: ADROLL RTB TIMEOUT SPIKES
  • Case Study: AdRoll RTB Timeout Spikes PERIODIC BID TIMEOUTS
  • Case Study: AdRoll RTB Timeout Spikes CONSISTENT SYSTEM LOAD
  • Case Study: AdRoll RTB Timeout Spikes C O R R E L AT E D N E T W O R K T R A F F I C S P I K E S
  • Case Study: AdRoll RTB Timeout Spikes C O R R E L AT E D R U N Q U E U E S P I K E S
  • Case Study: AdRoll RTB Timeout Spikes W H AT HAPPENED?
  • Case Study: AdRoll RTB Timeout Spikes • Scheduler threads were locked to CPUs • Background process comes on every 20 minutes, consumes a lot of CPU time • No cpu-shield was set up on our production systems • OS bumped a scheduler thread off its CPU, backing up its run-queue
  • CASE STUDY: EXCHANGE THROTTLING
  • Case Study: Exchange Throttling H E A L T H Y PAT T E R N O F B I D R E Q U E S T S
  • Case Study: Exchange Throttling THE TROUGH OF THROTTLING
  • Case Study: Exchange Throttling BAD GOOD
  • Case Study: Exchange Throttling PROBLEM CONFIRMED WITH EXCHANGE
  • Case Study: Exchange Throttling • All other metrics (run-queue, CPU, network IO) were fine. • Confirmed that no changes had been made to the running systems via deployment. • Amazon data showed no network issues to our machines.
  • Case Study: Exchange Throttling W H AT HAPPENED?
  • Case Study: Exchange Throttling WE HIT AN IMPLICIT EXCHANGE LIMIT. (ARGUABLY, A BUG.)
  • LESSONS LEARNED
  • IT I S P O S S I B L E T O H AV E T O O L I T T L E I N F O R M AT I O N .
  • “(THE FIREFIGHTERS) TRIED TO BEAT DOWN THE FLAMES (OF REACTOR 4). THEY CHERNOBYL KICKED AT THE BURNING GRAPHITE WITH THEIR FEET. … THE DOCTORS KEPT TELLING THEM THEY’D BEEN POISONED BY GAS.” - - SVETLANA ALEXIEVICH - VO I C E S F RO M C H E R N O BY L : T H E O R A L H I S TO RY O F A N U C L E A R D I SA S T E R
  • IT IS POSSIBLE TO COLLECT TOO M U C H I N F O R M AT I O N , O R P R E S E N T IT BADLY.
  • “SAFETY SYSTEMS, SUCH AS WARNING LIGHTS, ARE NECESSARY, BUT THEY HAVE THE POTENTIAL FOR DECEPTION. COMPLEX (…) ONE OF THE LESSONS OF SYSTEMS AND (THREE MILE ISLAND) IS THAT ANY PART OF THE SYSTEM MIGHT BE INTERACTING WITH OTHER PARTS IN UNANTICIPATED WAYS.” - C HARLES PERROW - N ORMAL ACCIDENTS: LIVIN G WITH HIGH-RISK TEC HN OLOGIES
  • I N D I R E C T K N O W L E D G E M AY N O T T E L L T H E W H O L E S T O R Y , O R M AY M A K E Y O U D O U B T W H AT ’ S P L A I N L Y B E F O R E Y O U R E Y E S .
  • “THE FIRST DISASTER IN SPACE HAD OCCURRED, AND NO ONE KNEW WHAT HAD HAPPENED. ON THE GROUND, THE FLIGHT CONTROLLERS WERE NOT EVEN SURE THAT ANYTHING HAD. ONE REASON FOR THEIR IGNORANCE WAS THE IMPERFECT NATURE OF TELEMETRY FROM THE SPACECRAFT WHICH COULD NOT TELL THEM DIRECTLY THAT AN OXYGEN TANK HAD BLOWN UP.” - - H E N R Y S . F. C O O P E R , J R . - X I I I : T H E A P O L LO F L I G H T T H AT FA I L E D
  • Things that don’t quite work like you’d hope.
  • Problems • Instrumenting code increases code size. • While very low, runtime impact is not zero. • Instrumentation of dependencies is up to the library author.
  • Solutions •Dtrace / systemtap hookups •Culture of instrumentation by default. •Tracing BIFs
  • W O M B AT ?
  • Instrumentation by default strategies • Application configuration callback module • Behaviors callbacks • Value-at-time functions
  • W H AT ABOUT CULTURE
  • LOOK FOR SOLUTIONS, NOT S C A P E G O AT S .
  • DON’T BE FOOLED BY OVERCONFIDENCE IN YOUR WORK.
  • DOCUMENT AND DISCUSS YOUR FA I L U R E S .
  • EVERYONE MUST WORK T O WA R D T H E SAME END.
  • KNOW SOME BASIC M AT H E M AT I C S .
  • PREFER LOOSE COUPLING WHEN POSSIBLE. ! BE EXPLICIT ABOUT TIGHT COUPLING.
  • MEASURE MUCH, P R E S E N T O N LY T H AT WHICH YOU’RE C E R TA I N T O N E E D .
  • MEASURE, DON’T GUESS.
  • BE A C C U R AT E .
  • QUESTIONS?
  • <3 @bltroutwine BRIAN@TROUTWINE.US
  • Bonus Bibliography! XIII: The Apollo Flight that Failed Henry S. F. Cooper, Jr. David A. Mindell Digital Apollo: Human and Machine in Spaceflight Charles Perrow Normal Accidents: Living with High-Risk Technologies Voices from Chernobyl: The Oral History of a Nuclear Disaster Svetlana Alexievich Real-Time Systems: Design Principles for Distributed Embedded Applications Hermann Kopetz Command and Control: Nuclear Weapons, the Damascus Accident and the Illusion of Safety Eric Schlosser