SlideShare a Scribd company logo
1 of 89
Download to read offline
T E N B I L L I O N A D AY, O N E - H U N D R E D M I L L I S E C O N D S P E R

MONITORING REAL-TIME
B I D D I N G AT A D R O L L
I DO THINGS WITH/TO COMPUTERS.
I

CARE ABOUT RELIABLE, COMPLEX

AND CRITICAL SYSTEMS.
ADROLL
LESS THIS
MORE THIS
WE’RE AN
ADTECH
C O M PA N Y .
R E TA R G E T I N G ,

IT’S

A L L A B O U T D ATA .
• Our customers want to show ads to people who might

care to see them.
• We partner with “exchanges” to enter ad-slot auctions.
• These auctions are executed and finalized while

consumer’s webpages load.
REAL-TIME
BIDDING
the nature of the problem domain:
• Low latency ( < 100ms per transaction )
• Firm real-time system
• Highly concurrent ( > 30 billion transactions per day )
• Global, 24/7 operation
“HUMANS

ARE BAD AT PREDICTING THE

PERFORMANCE OF COMPLEX SYSTEMS(…).

OUR

ABILITY TO CREATE LARGE AND COMPLEX

SYSTEMS FOOLS US INTO BELIEVING THAT
WE’RE ALSO ENTITLED TO UNDERSTAND
THEM.”
–C ARLOS BUENO
“ M AT U R E O P T I M I Z AT I O N H A N D B O O K ”
AHEAD

OF TIME

V E R I F I C AT I O N I S N O T
SUFFICIENT.
!

(DON’T SCRIMP ON IT, THOUGH.)
IGNORANCE AND
COMPLEX INTERACTIONS
WITH EXTERNAL SYSTEMS
ARE WHY WE CAN’T
H AV E N I C E T H I N G S .
AT

SCALE, EVEN RARE

EVENTS HAPPEN
FREQUENTLY.
AT

SCALE, BAD

THINGS HAPPEN
FA S T E R T H A N
HUMANS CAN
RESPOND.
THE RESULTS ARE
RARELY PRETTY.
THE CAUSES ARE RARELY
QUICK TO DISCOVER.
W H AT
CAN BE
DONE?
WE

H AV E T O O B S E R V E

OUR SYSTEMS WHILE
THEY RUN.
WE CAN BE
SCIENTIFIC
ABOUT THIS
NOT ALL SYSTEMS ARE
MONITORABLE, JUST AS NOT
A L L A R E T E S TA B L E .
!

I T ’ S A M AT T E R O F D E S I G N A N D
OF CULTURE.
A DESIGN

T H AT I S M O N I T O R A B L E

TA K E S S T O C K O F I T S I N T E R N A L
TOLERANCES, EXTERNAL
I N T E R FA C E S A N D E X P O S E S T H E S E
FOR AN

O P E R AT O R .
AN ENGINEER

IS RESPONSIBLE FOR

DESIGNING AND BUILDING
SYSTEMS.
!

A N O P E R AT O R

IS RESPONSIBLE FOR

RUNNING THEM.
A H E A D - O F - T I M E V E R I F I C AT I O N
—TESTING, TYPE-CHECKING—
GIVES THE ENGINEER
INTUITION ABOUT THE SYSTEM.
IF

YOU DON’T KNOW
HOW THE SYSTEM

S H O U L D B E H AV E Y O U
C A N ’ T S AY H O W I T
SHOULDN’T OR ISN’T.
MONITORING GIVES THE
O P E R AT O R I N F O R M AT I O N
A B O U T T H E B E H AV I O R O F
THE RUNNING SYSTEM.
T H E O P E R AT O R S A N D E N G I N E E R S
ARE OFTEN THE SAME PEOPLE.
THERE’S

A POTENTIAL FOR A POSITIVE

FEEDBACK LOOP OF QUALITY HERE.
“WHILE THE SKILL OF REENTRY WAS EASILY
HANDLED BY AUTOMATED SYSTEMS, THE
PILOT’S PRIMARY FUNCTION EVOLVED TO BE
A REDUNDANT SYSTEM (…) COORDINATING
A VARIETY OF CONTROLS AS MUCH AS
DIRECTLY CONTROLLING THE VEHICLE.”
– DAV I D A . M I N D E L L
D I G I TA L A P O L LO : H U M A N A N D M AC H I N E I N S PAC E F L I G H T
exometer
exometer — github.com/Feuerlabs/exometer
• Created by Feuerlabs.
• Responsive upstream (Ulf Wiger never sleeps?)
• Metric collection, aggregation and reporting decoupled.
• Static and dynamic configuration.
• Very low, predictable runtime overhead.
I M P O R TA N T T E R M S
• METRIC: a measurement
• ENTRY: a receiver and aggregator of metrics
• REPORTER: an entity which samples entries on a regular

interval and optionally ships these samples onto a third-system
• SUBSCRIPTION: the definition of the regular interval on which

reporters sample entries
Defining Entries
{predefined, [
{[erlang,memory],
{function, erlang, memory,
['$dp'], value,[ets,binary]},
[]
},

{[erlang, gc],
{function, erlang, statistics,
[garbage_collection], match,
{total_coll, rec_wrd, '_'}},
[]
},
!

{[erlang, statistics],
{function, erlang, statistics,
['$dp'], value, [run_queue]},
[]
},

{[boodah, freq_cap, not_found], spiral},
{[boodah, freq_cap, ok], spiral},
{[boodah, freq_cap, timeout], spiral}
]},
Defining Entries
{predefined, [
{[erlang,memory],
{function, erlang, memory,
['$dp'], value,[ets,binary]},
[]
},

{[erlang, gc],
{function, erlang, statistics,
[garbage_collection], match,
{total_coll, rec_wrd, '_'}},
[]
},
!

{[erlang, statistics],
{function, erlang, statistics,
['$dp'], value, [run_queue]},
[]
},

{[boodah, freq_cap, not_found], spiral},
{[boodah, freq_cap, ok], spiral},
{[boodah, freq_cap, timeout], spiral}
]},
Defining Entries
{predefined, [
{[erlang,memory],
{function, erlang, memory,
['$dp'], value,[ets,binary]},
[]
},

{[erlang, gc],
{function, erlang, statistics,
[garbage_collection], match,
{total_coll, rec_wrd, '_'}},
[]
},
!

{[erlang, statistics],
{function, erlang, statistics,
['$dp'], value, [run_queue]},
[]
},

{[boodah, freq_cap, not_found], spiral},
{[boodah, freq_cap, ok], spiral},
{[boodah, freq_cap, timeout], spiral}
]},
Defining Entries
{predefined, [
{[erlang,memory],
{function, erlang, memory,
['$dp'], value,[ets,binary]},
[]
},

{[erlang, gc],
{function, erlang, statistics,
[garbage_collection], match,
{total_coll, rec_wrd, '_'}},
[]
},
!

{[erlang, statistics],
{function, erlang, statistics,
['$dp'], value, [run_queue]},
[]
},

{[boodah, freq_cap, not_found], spiral},
{[boodah, freq_cap, ok], spiral},
{[boodah, freq_cap, timeout], spiral}
]},
Defining Entries
{predefined, [
{[erlang,memory],
{function, erlang, memory,
['$dp'], value,[ets,binary]},
[]
},

{[erlang, gc],
{function, erlang, statistics,
[garbage_collection], match,
{total_coll, rec_wrd, '_'}},
[]
},
!

{[erlang, statistics],
{function, erlang, statistics,
['$dp'], value, [run_queue]},
[]
},

{[boodah, freq_cap, not_found], spiral},
{[boodah, freq_cap, ok], spiral},
{[boodah, freq_cap, timeout], spiral}
]},
Defining Reporters
{ reporters,
[
{ exometer_report_statsd,
[
{hostname, "localhost"},
{port, 8125},
{type_map,
[
{[erlang,statistics,run_queue],
histogram},
{[erlang, gc, tot_coll], histogram},
{[erlang, gc, rec_wrd], histogram},

!

{[erlang,memory,ets], gauge},
{[erlang,memory,binary],gauge},
!

{[boodah,freq_cap,not_found],gauge},
{[boodah,freq_cap,ok],gauge},
{[boodah,freq_cap,timeout],gauge}
]},
!
Defining Reporters
{ reporters,
[
{ exometer_report_statsd,
[
{hostname, "localhost"},
{port, 8125},
{type_map,
[
{[erlang,statistics,run_queue],
histogram},
{[erlang, gc, tot_coll], histogram},
{[erlang, gc, rec_wrd], histogram},

!

{[erlang,memory,ets], gauge},
{[erlang,memory,binary],gauge},
!

{[boodah,freq_cap,not_found],gauge},
{[boodah,freq_cap,ok],gauge},
{[boodah,freq_cap,timeout],gauge}
]},
!
Defining Reporters
{ reporters,
[
{ exometer_report_statsd,
[
{hostname, "localhost"},
{port, 8125},
{type_map,
[
{[erlang,statistics,run_queue],
histogram},
{[erlang, gc, tot_coll], histogram},
{[erlang, gc, rec_wrd], histogram},

!

{[erlang,memory,ets], gauge},
{[erlang,memory,binary],gauge},
!

{[boodah,freq_cap,not_found],gauge},
{[boodah,freq_cap,ok],gauge},
{[boodah,freq_cap,timeout],gauge}
]},
!
Defining Subscriptions
{ report,
[
{ subscribers,
[
{exometer_report_statsd, [erlang, statistics],
run_queue, 1000},

!
!
!

{exometer_report_statsd,
[boodah, freq_cap, not_found], one, 1000},
{exometer_report_statsd,
[boodah, freq_cap, ok], one, 1000},
{exometer_report_statsd,
[boodah, freq_cap, timeout], one, 1000}

{exometer_report_statsd, [erlang, gc],
tot_coll, 1000},
{exometer_report_statsd, [erlang, gc],
rec_wrd, 1000},
]}

!

{exometer_report_statsd, [erlang, memory],
ets, 10000},
{exometer_report_statsd, [erlang, memory],
binary, 10000},
!
!

]}
Defining Subscriptions
{ report,
[
{ subscribers,
[
{exometer_report_statsd, [erlang, statistics],
run_queue, 1000},

!
!
!

{exometer_report_statsd,
[boodah, freq_cap, not_found], one, 1000},
{exometer_report_statsd,
[boodah, freq_cap, ok], one, 1000},
{exometer_report_statsd,
[boodah, freq_cap, timeout], one, 1000}

{exometer_report_statsd, [erlang, gc],
tot_coll, 1000},
{exometer_report_statsd, [erlang, gc],
rec_wrd, 1000},
]}

!

{exometer_report_statsd, [erlang, memory],
ets, 10000},
{exometer_report_statsd, [erlang, memory],
binary, 10000},
!
!

]}
Defining Subscriptions
{ report,
[
{ subscribers,
[
{exometer_report_statsd, [erlang, statistics],
run_queue, 1000},

!
!
!

{exometer_report_statsd,
[boodah, freq_cap, not_found], one, 1000},
{exometer_report_statsd,
[boodah, freq_cap, ok], one, 1000},
{exometer_report_statsd,
[boodah, freq_cap, timeout], one, 1000}

{exometer_report_statsd, [erlang, gc],
tot_coll, 1000},
{exometer_report_statsd, [erlang, gc],
rec_wrd, 1000},
]}

!

{exometer_report_statsd, [erlang, memory],
ets, 10000},
{exometer_report_statsd, [erlang, memory],
binary, 10000},
!
!

]}
D o i n g i t d y n a m i c a l l y.
1> exometer:new([a, histogram], histogram).
ok
!

2> exometer:get_value([a, histogram]).
{ok,[{n,0},
{mean,0},
{min,0},
{max,0},
{median,0},
{50,0},
{75,0},
{90,0},
{95,0},
{99,0},
{999,0}]}
!

3> exometer_report:add_reporter(
exometer_report_tty, []).
ok
!

4> exometer_report:subscribe(
exometer_report_tty,
[a, histogram], mean, 1000, []).
ok
!

exometer_report_tty: a_histogram_mean
1393627070:0
exometer_report_tty: a_histogram_mean
1393627071:0
exometer_report_tty: a_histogram_mean
1393627072:0
W H AT A R E W E L O O K I N G F O R ?
• VM killers
• System performance regressions
• Abnormal system behavior
• Surprises
“ A B N O R M A L S Y S T E M B E H AV I O R ” ?
CASE STUDIES
CASE STUDY:
ADROLL RTB TIMEOUT SPIKES
Case Study: AdRoll RTB Timeout Spikes

PERIODIC BID TIMEOUTS
Case Study: AdRoll RTB Timeout Spikes

CONSISTENT SYSTEM LOAD
Case Study: AdRoll RTB Timeout Spikes

C O R R E L AT E D N E T W O R K T R A F F I C S P I K E S
Case Study: AdRoll RTB Timeout Spikes

C O R R E L AT E D R U N Q U E U E S P I K E S
Case Study: AdRoll RTB Timeout Spikes

W H AT

HAPPENED?
Case Study: AdRoll RTB Timeout Spikes
• Scheduler threads were locked to CPUs
• Background process comes on every 20 minutes,

consumes a lot of CPU time
• No cpu-shield was set up on our production systems
• OS bumped a scheduler thread off its CPU, backing up its

run-queue
CASE STUDY:
EXCHANGE THROTTLING
Case Study: Exchange Throttling

H E A L T H Y PAT T E R N O F B I D R E Q U E S T S
Case Study: Exchange Throttling

THE TROUGH OF THROTTLING
Case Study: Exchange Throttling

BAD

GOOD
Case Study: Exchange Throttling

PROBLEM CONFIRMED WITH EXCHANGE
Case Study: Exchange Throttling

• All other metrics (run-queue, CPU, network IO) were fine.
• Confirmed that no changes had been made to the

running systems via deployment.
• Amazon data showed no network issues to our machines.
Case Study: Exchange Throttling

W H AT

HAPPENED?
Case Study: Exchange Throttling

WE

HIT AN IMPLICIT EXCHANGE LIMIT.
(ARGUABLY,

A BUG.)
LESSONS LEARNED
IT

I S P O S S I B L E T O H AV E T O O L I T T L E

I N F O R M AT I O N .
“(THE

FIREFIGHTERS) TRIED TO BEAT

DOWN THE FLAMES (OF

REACTOR 4). THEY

CHERNOBYL

KICKED AT THE

BURNING GRAPHITE WITH THEIR FEET.

… THE DOCTORS KEPT TELLING THEM
THEY’D BEEN POISONED BY GAS.”
-

- SVETLANA ALEXIEVICH

-

VO I C E S F RO M C H E R N O BY L : T H E O R A L H I S TO RY O F A N U C L E A R D I SA S T E R
IT

IS POSSIBLE TO COLLECT TOO

M U C H I N F O R M AT I O N , O R P R E S E N T
IT BADLY.
“SAFETY

SYSTEMS, SUCH AS WARNING LIGHTS,

ARE NECESSARY, BUT THEY HAVE THE POTENTIAL
FOR DECEPTION.
COMPLEX

(…) ONE OF THE LESSONS OF
SYSTEMS AND (THREE MILE ISLAND) IS

THAT ANY PART OF THE SYSTEM MIGHT BE
INTERACTING WITH OTHER PARTS IN
UNANTICIPATED WAYS.”
-

C HARLES PERROW

-

N ORMAL ACCIDENTS: LIVIN G WITH HIGH-RISK TEC HN OLOGIES
I N D I R E C T K N O W L E D G E M AY N O T T E L L T H E
W H O L E S T O R Y , O R M AY M A K E Y O U D O U B T
W H AT ’ S P L A I N L Y B E F O R E Y O U R E Y E S .
“THE

FIRST DISASTER IN SPACE HAD OCCURRED, AND

NO ONE KNEW WHAT HAD HAPPENED.

ON

THE

GROUND, THE FLIGHT CONTROLLERS WERE NOT EVEN
SURE THAT ANYTHING HAD.

ONE

REASON FOR THEIR

IGNORANCE WAS THE IMPERFECT NATURE OF
TELEMETRY FROM THE SPACECRAFT WHICH COULD NOT
TELL THEM DIRECTLY THAT AN OXYGEN TANK HAD
BLOWN UP.”
-

- H E N R Y S . F. C O O P E R , J R .

-

X I I I : T H E A P O L LO F L I G H T T H AT FA I L E D
Things that don’t quite
work like you’d hope.
Problems

• Instrumenting code increases code size.
• While very low, runtime impact is not zero.
• Instrumentation of dependencies is up to the

library author.
Solutions

•Dtrace / systemtap hookups
•Culture of instrumentation by default.
•Tracing BIFs
W O M B AT ?
Instrumentation by default strategies

• Application configuration callback module
• Behaviors callbacks
• Value-at-time functions
W H AT
ABOUT
CULTURE
LOOK FOR
SOLUTIONS,
NOT
S C A P E G O AT S .
DON’T

BE FOOLED BY

OVERCONFIDENCE IN YOUR
WORK.
DOCUMENT
AND DISCUSS
YOUR
FA I L U R E S .
EVERYONE
MUST WORK
T O WA R D T H E
SAME END.
KNOW SOME BASIC
M AT H E M AT I C S .
PREFER LOOSE
COUPLING WHEN
POSSIBLE.
!

BE EXPLICIT ABOUT
TIGHT COUPLING.
MEASURE

MUCH,

P R E S E N T O N LY T H AT
WHICH YOU’RE
C E R TA I N T O N E E D .
MEASURE,
DON’T GUESS.
BE
A C C U R AT E .
QUESTIONS?
<3
@bltroutwine
BRIAN@TROUTWINE.US
Bonus Bibliography!
XIII: The Apollo Flight that Failed

Henry S. F. Cooper, Jr.
David A. Mindell

Digital Apollo: Human and Machine in Spaceflight

Charles Perrow

Normal Accidents: Living with High-Risk Technologies
Voices from Chernobyl: The Oral History of a Nuclear Disaster

Svetlana Alexievich
Real-Time Systems: Design Principles for Distributed Embedded Applications
Hermann Kopetz
Command and Control: Nuclear Weapons, the Damascus Accident and the Illusion of Safety
Eric Schlosser

More Related Content

Similar to 10 Billion a Day, 100 Milliseconds Per: Monitoring Real-Time Bidding at AdRoll

Strangler Pattern in practice @PHPers Day 2019
Strangler Pattern in practice @PHPers Day 2019Strangler Pattern in practice @PHPers Day 2019
Strangler Pattern in practice @PHPers Day 2019Michał Kurzeja
 
Win-Win Search: Dual-Agent Stochastic Game in Session Search (SIGIR 2014)
Win-Win Search: Dual-Agent Stochastic Game in Session Search (SIGIR 2014)Win-Win Search: Dual-Agent Stochastic Game in Session Search (SIGIR 2014)
Win-Win Search: Dual-Agent Stochastic Game in Session Search (SIGIR 2014)Grace Yang
 
Four Architectural Patterns
Four Architectural Patterns Four Architectural Patterns
Four Architectural Patterns David Simons
 
Ember js meetup treviso liquid-fire
Ember js meetup treviso liquid-fireEmber js meetup treviso liquid-fire
Ember js meetup treviso liquid-fireWilliam Bergamo
 
4Developers 2015: Measure to fail - Tomasz Kowalczewski
4Developers 2015: Measure to fail - Tomasz Kowalczewski4Developers 2015: Measure to fail - Tomasz Kowalczewski
4Developers 2015: Measure to fail - Tomasz KowalczewskiPROIDEA
 
Architecting Android Apps: Marko Gargenta
Architecting Android Apps: Marko GargentaArchitecting Android Apps: Marko Gargenta
Architecting Android Apps: Marko Gargentajaxconf
 
Basics in algorithms and data structure
Basics in algorithms and data structure Basics in algorithms and data structure
Basics in algorithms and data structure Eman magdy
 
Beginner workshop to angularjs presentation at Google
Beginner workshop to angularjs presentation at GoogleBeginner workshop to angularjs presentation at Google
Beginner workshop to angularjs presentation at GoogleAri Lerner
 
Creating Modern Metadata Systems with New Relic, Dow Jones [FutureStack16]
Creating Modern Metadata Systems with New Relic, Dow Jones [FutureStack16]Creating Modern Metadata Systems with New Relic, Dow Jones [FutureStack16]
Creating Modern Metadata Systems with New Relic, Dow Jones [FutureStack16]New Relic
 
Faster! Faster! Accelerate your business with blazing prototypes
Faster! Faster! Accelerate your business with blazing prototypesFaster! Faster! Accelerate your business with blazing prototypes
Faster! Faster! Accelerate your business with blazing prototypesOSCON Byrum
 
PostgreSQL Day italy 2016 Unit Test
PostgreSQL Day italy 2016 Unit TestPostgreSQL Day italy 2016 Unit Test
PostgreSQL Day italy 2016 Unit TestAndrea Adami
 
Data Modelling at Scale
Data Modelling at ScaleData Modelling at Scale
Data Modelling at ScaleDavid Simons
 
Types and Immutability: why you should care
Types and Immutability: why you should careTypes and Immutability: why you should care
Types and Immutability: why you should careJean Carlo Emer
 
Spark streaming for the internet of flying things 20160510.pptx
Spark streaming for the internet of flying things 20160510.pptxSpark streaming for the internet of flying things 20160510.pptx
Spark streaming for the internet of flying things 20160510.pptxPablo Francisco Pérez Hidalgo
 
"PyTorch Deep Learning Framework: Status and Directions," a Presentation from...
"PyTorch Deep Learning Framework: Status and Directions," a Presentation from..."PyTorch Deep Learning Framework: Status and Directions," a Presentation from...
"PyTorch Deep Learning Framework: Status and Directions," a Presentation from...Edge AI and Vision Alliance
 
Visualizing Systems with Statemaps
Visualizing Systems with StatemapsVisualizing Systems with Statemaps
Visualizing Systems with Statemapsbcantrill
 
ELK Presentation Final V1
ELK Presentation Final V1ELK Presentation Final V1
ELK Presentation Final V1Jon Hammant
 

Similar to 10 Billion a Day, 100 Milliseconds Per: Monitoring Real-Time Bidding at AdRoll (20)

Strangler Pattern in practice @PHPers Day 2019
Strangler Pattern in practice @PHPers Day 2019Strangler Pattern in practice @PHPers Day 2019
Strangler Pattern in practice @PHPers Day 2019
 
Win-Win Search: Dual-Agent Stochastic Game in Session Search (SIGIR 2014)
Win-Win Search: Dual-Agent Stochastic Game in Session Search (SIGIR 2014)Win-Win Search: Dual-Agent Stochastic Game in Session Search (SIGIR 2014)
Win-Win Search: Dual-Agent Stochastic Game in Session Search (SIGIR 2014)
 
Four Architectural Patterns
Four Architectural Patterns Four Architectural Patterns
Four Architectural Patterns
 
Ember js meetup treviso liquid-fire
Ember js meetup treviso liquid-fireEmber js meetup treviso liquid-fire
Ember js meetup treviso liquid-fire
 
Everybody Lies
Everybody LiesEverybody Lies
Everybody Lies
 
4Developers 2015: Measure to fail - Tomasz Kowalczewski
4Developers 2015: Measure to fail - Tomasz Kowalczewski4Developers 2015: Measure to fail - Tomasz Kowalczewski
4Developers 2015: Measure to fail - Tomasz Kowalczewski
 
Measure to fail
Measure to failMeasure to fail
Measure to fail
 
BoSUSA18 | Bob Moesta| The 5 Skills Of An Innovator
BoSUSA18 | Bob Moesta| The 5 Skills Of An InnovatorBoSUSA18 | Bob Moesta| The 5 Skills Of An Innovator
BoSUSA18 | Bob Moesta| The 5 Skills Of An Innovator
 
Architecting Android Apps: Marko Gargenta
Architecting Android Apps: Marko GargentaArchitecting Android Apps: Marko Gargenta
Architecting Android Apps: Marko Gargenta
 
Basics in algorithms and data structure
Basics in algorithms and data structure Basics in algorithms and data structure
Basics in algorithms and data structure
 
Beginner workshop to angularjs presentation at Google
Beginner workshop to angularjs presentation at GoogleBeginner workshop to angularjs presentation at Google
Beginner workshop to angularjs presentation at Google
 
Creating Modern Metadata Systems with New Relic, Dow Jones [FutureStack16]
Creating Modern Metadata Systems with New Relic, Dow Jones [FutureStack16]Creating Modern Metadata Systems with New Relic, Dow Jones [FutureStack16]
Creating Modern Metadata Systems with New Relic, Dow Jones [FutureStack16]
 
Faster! Faster! Accelerate your business with blazing prototypes
Faster! Faster! Accelerate your business with blazing prototypesFaster! Faster! Accelerate your business with blazing prototypes
Faster! Faster! Accelerate your business with blazing prototypes
 
PostgreSQL Day italy 2016 Unit Test
PostgreSQL Day italy 2016 Unit TestPostgreSQL Day italy 2016 Unit Test
PostgreSQL Day italy 2016 Unit Test
 
Data Modelling at Scale
Data Modelling at ScaleData Modelling at Scale
Data Modelling at Scale
 
Types and Immutability: why you should care
Types and Immutability: why you should careTypes and Immutability: why you should care
Types and Immutability: why you should care
 
Spark streaming for the internet of flying things 20160510.pptx
Spark streaming for the internet of flying things 20160510.pptxSpark streaming for the internet of flying things 20160510.pptx
Spark streaming for the internet of flying things 20160510.pptx
 
"PyTorch Deep Learning Framework: Status and Directions," a Presentation from...
"PyTorch Deep Learning Framework: Status and Directions," a Presentation from..."PyTorch Deep Learning Framework: Status and Directions," a Presentation from...
"PyTorch Deep Learning Framework: Status and Directions," a Presentation from...
 
Visualizing Systems with Statemaps
Visualizing Systems with StatemapsVisualizing Systems with Statemaps
Visualizing Systems with Statemaps
 
ELK Presentation Final V1
ELK Presentation Final V1ELK Presentation Final V1
ELK Presentation Final V1
 

More from Brian Troutwine

(Moonconf 2016) Fetching Moths from the Works: Correctness Methods in Software
(Moonconf 2016) Fetching Moths from the Works: Correctness Methods in Software(Moonconf 2016) Fetching Moths from the Works: Correctness Methods in Software
(Moonconf 2016) Fetching Moths from the Works: Correctness Methods in SoftwareBrian Troutwine
 
Getting Uphill on a Candle: Crushed Spines, Detached Retinas and One Small Step
Getting Uphill on a Candle: Crushed Spines, Detached Retinas and One Small StepGetting Uphill on a Candle: Crushed Spines, Detached Retinas and One Small Step
Getting Uphill on a Candle: Crushed Spines, Detached Retinas and One Small StepBrian Troutwine
 
The Charming Genius of the Apollo Guidance Computer
The Charming Genius of the Apollo Guidance ComputerThe Charming Genius of the Apollo Guidance Computer
The Charming Genius of the Apollo Guidance ComputerBrian Troutwine
 
Fault-tolerance on the Cheap: Making Systems That (Probably) Won't Fall Over
Fault-tolerance on the Cheap: Making Systems That (Probably) Won't Fall Over Fault-tolerance on the Cheap: Making Systems That (Probably) Won't Fall Over
Fault-tolerance on the Cheap: Making Systems That (Probably) Won't Fall Over Brian Troutwine
 
Let it crash! The Erlang Approach to Building Reliable Services
Let it crash! The Erlang Approach to Building Reliable ServicesLet it crash! The Erlang Approach to Building Reliable Services
Let it crash! The Erlang Approach to Building Reliable ServicesBrian Troutwine
 
Automation With Humans in Mind: Making Complex Systems Predictable, Reliable ...
Automation With Humans in Mind: Making Complex Systems Predictable, Reliable ...Automation With Humans in Mind: Making Complex Systems Predictable, Reliable ...
Automation With Humans in Mind: Making Complex Systems Predictable, Reliable ...Brian Troutwine
 
Erlang, LFE, Joxa and Elixir: Established and Emerging Languages in the Erlan...
Erlang, LFE, Joxa and Elixir: Established and Emerging Languages in the Erlan...Erlang, LFE, Joxa and Elixir: Established and Emerging Languages in the Erlan...
Erlang, LFE, Joxa and Elixir: Established and Emerging Languages in the Erlan...Brian Troutwine
 
Monitoring with exometer at AdRoll
Monitoring with exometer at AdRollMonitoring with exometer at AdRoll
Monitoring with exometer at AdRollBrian Troutwine
 

More from Brian Troutwine (8)

(Moonconf 2016) Fetching Moths from the Works: Correctness Methods in Software
(Moonconf 2016) Fetching Moths from the Works: Correctness Methods in Software(Moonconf 2016) Fetching Moths from the Works: Correctness Methods in Software
(Moonconf 2016) Fetching Moths from the Works: Correctness Methods in Software
 
Getting Uphill on a Candle: Crushed Spines, Detached Retinas and One Small Step
Getting Uphill on a Candle: Crushed Spines, Detached Retinas and One Small StepGetting Uphill on a Candle: Crushed Spines, Detached Retinas and One Small Step
Getting Uphill on a Candle: Crushed Spines, Detached Retinas and One Small Step
 
The Charming Genius of the Apollo Guidance Computer
The Charming Genius of the Apollo Guidance ComputerThe Charming Genius of the Apollo Guidance Computer
The Charming Genius of the Apollo Guidance Computer
 
Fault-tolerance on the Cheap: Making Systems That (Probably) Won't Fall Over
Fault-tolerance on the Cheap: Making Systems That (Probably) Won't Fall Over Fault-tolerance on the Cheap: Making Systems That (Probably) Won't Fall Over
Fault-tolerance on the Cheap: Making Systems That (Probably) Won't Fall Over
 
Let it crash! The Erlang Approach to Building Reliable Services
Let it crash! The Erlang Approach to Building Reliable ServicesLet it crash! The Erlang Approach to Building Reliable Services
Let it crash! The Erlang Approach to Building Reliable Services
 
Automation With Humans in Mind: Making Complex Systems Predictable, Reliable ...
Automation With Humans in Mind: Making Complex Systems Predictable, Reliable ...Automation With Humans in Mind: Making Complex Systems Predictable, Reliable ...
Automation With Humans in Mind: Making Complex Systems Predictable, Reliable ...
 
Erlang, LFE, Joxa and Elixir: Established and Emerging Languages in the Erlan...
Erlang, LFE, Joxa and Elixir: Established and Emerging Languages in the Erlan...Erlang, LFE, Joxa and Elixir: Established and Emerging Languages in the Erlan...
Erlang, LFE, Joxa and Elixir: Established and Emerging Languages in the Erlan...
 
Monitoring with exometer at AdRoll
Monitoring with exometer at AdRollMonitoring with exometer at AdRoll
Monitoring with exometer at AdRoll
 

Recently uploaded

Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGSujit Pal
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 

Recently uploaded (20)

Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAG
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 

10 Billion a Day, 100 Milliseconds Per: Monitoring Real-Time Bidding at AdRoll

  • 1. T E N B I L L I O N A D AY, O N E - H U N D R E D M I L L I S E C O N D S P E R MONITORING REAL-TIME B I D D I N G AT A D R O L L
  • 2. I DO THINGS WITH/TO COMPUTERS.
  • 3. I CARE ABOUT RELIABLE, COMPLEX AND CRITICAL SYSTEMS.
  • 8. R E TA R G E T I N G , IT’S A L L A B O U T D ATA .
  • 9. • Our customers want to show ads to people who might care to see them. • We partner with “exchanges” to enter ad-slot auctions. • These auctions are executed and finalized while consumer’s webpages load.
  • 11. the nature of the problem domain: • Low latency ( < 100ms per transaction ) • Firm real-time system • Highly concurrent ( > 30 billion transactions per day ) • Global, 24/7 operation
  • 12. “HUMANS ARE BAD AT PREDICTING THE PERFORMANCE OF COMPLEX SYSTEMS(…). OUR ABILITY TO CREATE LARGE AND COMPLEX SYSTEMS FOOLS US INTO BELIEVING THAT WE’RE ALSO ENTITLED TO UNDERSTAND THEM.” –C ARLOS BUENO “ M AT U R E O P T I M I Z AT I O N H A N D B O O K ”
  • 13. AHEAD OF TIME V E R I F I C AT I O N I S N O T SUFFICIENT. ! (DON’T SCRIMP ON IT, THOUGH.)
  • 14. IGNORANCE AND COMPLEX INTERACTIONS WITH EXTERNAL SYSTEMS ARE WHY WE CAN’T H AV E N I C E T H I N G S .
  • 15. AT SCALE, EVEN RARE EVENTS HAPPEN FREQUENTLY.
  • 16. AT SCALE, BAD THINGS HAPPEN FA S T E R T H A N HUMANS CAN RESPOND.
  • 18. THE CAUSES ARE RARELY QUICK TO DISCOVER.
  • 19. W H AT CAN BE DONE?
  • 20. WE H AV E T O O B S E R V E OUR SYSTEMS WHILE THEY RUN.
  • 22. NOT ALL SYSTEMS ARE MONITORABLE, JUST AS NOT A L L A R E T E S TA B L E . ! I T ’ S A M AT T E R O F D E S I G N A N D OF CULTURE.
  • 23. A DESIGN T H AT I S M O N I T O R A B L E TA K E S S T O C K O F I T S I N T E R N A L TOLERANCES, EXTERNAL I N T E R FA C E S A N D E X P O S E S T H E S E FOR AN O P E R AT O R .
  • 24. AN ENGINEER IS RESPONSIBLE FOR DESIGNING AND BUILDING SYSTEMS. ! A N O P E R AT O R IS RESPONSIBLE FOR RUNNING THEM.
  • 25. A H E A D - O F - T I M E V E R I F I C AT I O N —TESTING, TYPE-CHECKING— GIVES THE ENGINEER INTUITION ABOUT THE SYSTEM.
  • 26. IF YOU DON’T KNOW HOW THE SYSTEM S H O U L D B E H AV E Y O U C A N ’ T S AY H O W I T SHOULDN’T OR ISN’T.
  • 27. MONITORING GIVES THE O P E R AT O R I N F O R M AT I O N A B O U T T H E B E H AV I O R O F THE RUNNING SYSTEM.
  • 28. T H E O P E R AT O R S A N D E N G I N E E R S ARE OFTEN THE SAME PEOPLE.
  • 29. THERE’S A POTENTIAL FOR A POSITIVE FEEDBACK LOOP OF QUALITY HERE.
  • 30. “WHILE THE SKILL OF REENTRY WAS EASILY HANDLED BY AUTOMATED SYSTEMS, THE PILOT’S PRIMARY FUNCTION EVOLVED TO BE A REDUNDANT SYSTEM (…) COORDINATING A VARIETY OF CONTROLS AS MUCH AS DIRECTLY CONTROLLING THE VEHICLE.” – DAV I D A . M I N D E L L D I G I TA L A P O L LO : H U M A N A N D M AC H I N E I N S PAC E F L I G H T
  • 32. exometer — github.com/Feuerlabs/exometer • Created by Feuerlabs. • Responsive upstream (Ulf Wiger never sleeps?) • Metric collection, aggregation and reporting decoupled. • Static and dynamic configuration. • Very low, predictable runtime overhead.
  • 33. I M P O R TA N T T E R M S • METRIC: a measurement • ENTRY: a receiver and aggregator of metrics • REPORTER: an entity which samples entries on a regular interval and optionally ships these samples onto a third-system • SUBSCRIPTION: the definition of the regular interval on which reporters sample entries
  • 34. Defining Entries {predefined, [ {[erlang,memory], {function, erlang, memory, ['$dp'], value,[ets,binary]}, [] }, {[erlang, gc], {function, erlang, statistics, [garbage_collection], match, {total_coll, rec_wrd, '_'}}, [] }, ! {[erlang, statistics], {function, erlang, statistics, ['$dp'], value, [run_queue]}, [] }, {[boodah, freq_cap, not_found], spiral}, {[boodah, freq_cap, ok], spiral}, {[boodah, freq_cap, timeout], spiral} ]},
  • 35. Defining Entries {predefined, [ {[erlang,memory], {function, erlang, memory, ['$dp'], value,[ets,binary]}, [] }, {[erlang, gc], {function, erlang, statistics, [garbage_collection], match, {total_coll, rec_wrd, '_'}}, [] }, ! {[erlang, statistics], {function, erlang, statistics, ['$dp'], value, [run_queue]}, [] }, {[boodah, freq_cap, not_found], spiral}, {[boodah, freq_cap, ok], spiral}, {[boodah, freq_cap, timeout], spiral} ]},
  • 36. Defining Entries {predefined, [ {[erlang,memory], {function, erlang, memory, ['$dp'], value,[ets,binary]}, [] }, {[erlang, gc], {function, erlang, statistics, [garbage_collection], match, {total_coll, rec_wrd, '_'}}, [] }, ! {[erlang, statistics], {function, erlang, statistics, ['$dp'], value, [run_queue]}, [] }, {[boodah, freq_cap, not_found], spiral}, {[boodah, freq_cap, ok], spiral}, {[boodah, freq_cap, timeout], spiral} ]},
  • 37. Defining Entries {predefined, [ {[erlang,memory], {function, erlang, memory, ['$dp'], value,[ets,binary]}, [] }, {[erlang, gc], {function, erlang, statistics, [garbage_collection], match, {total_coll, rec_wrd, '_'}}, [] }, ! {[erlang, statistics], {function, erlang, statistics, ['$dp'], value, [run_queue]}, [] }, {[boodah, freq_cap, not_found], spiral}, {[boodah, freq_cap, ok], spiral}, {[boodah, freq_cap, timeout], spiral} ]},
  • 38. Defining Entries {predefined, [ {[erlang,memory], {function, erlang, memory, ['$dp'], value,[ets,binary]}, [] }, {[erlang, gc], {function, erlang, statistics, [garbage_collection], match, {total_coll, rec_wrd, '_'}}, [] }, ! {[erlang, statistics], {function, erlang, statistics, ['$dp'], value, [run_queue]}, [] }, {[boodah, freq_cap, not_found], spiral}, {[boodah, freq_cap, ok], spiral}, {[boodah, freq_cap, timeout], spiral} ]},
  • 39. Defining Reporters { reporters, [ { exometer_report_statsd, [ {hostname, "localhost"}, {port, 8125}, {type_map, [ {[erlang,statistics,run_queue], histogram}, {[erlang, gc, tot_coll], histogram}, {[erlang, gc, rec_wrd], histogram}, ! {[erlang,memory,ets], gauge}, {[erlang,memory,binary],gauge}, ! {[boodah,freq_cap,not_found],gauge}, {[boodah,freq_cap,ok],gauge}, {[boodah,freq_cap,timeout],gauge} ]}, !
  • 40. Defining Reporters { reporters, [ { exometer_report_statsd, [ {hostname, "localhost"}, {port, 8125}, {type_map, [ {[erlang,statistics,run_queue], histogram}, {[erlang, gc, tot_coll], histogram}, {[erlang, gc, rec_wrd], histogram}, ! {[erlang,memory,ets], gauge}, {[erlang,memory,binary],gauge}, ! {[boodah,freq_cap,not_found],gauge}, {[boodah,freq_cap,ok],gauge}, {[boodah,freq_cap,timeout],gauge} ]}, !
  • 41. Defining Reporters { reporters, [ { exometer_report_statsd, [ {hostname, "localhost"}, {port, 8125}, {type_map, [ {[erlang,statistics,run_queue], histogram}, {[erlang, gc, tot_coll], histogram}, {[erlang, gc, rec_wrd], histogram}, ! {[erlang,memory,ets], gauge}, {[erlang,memory,binary],gauge}, ! {[boodah,freq_cap,not_found],gauge}, {[boodah,freq_cap,ok],gauge}, {[boodah,freq_cap,timeout],gauge} ]}, !
  • 42. Defining Subscriptions { report, [ { subscribers, [ {exometer_report_statsd, [erlang, statistics], run_queue, 1000}, ! ! ! {exometer_report_statsd, [boodah, freq_cap, not_found], one, 1000}, {exometer_report_statsd, [boodah, freq_cap, ok], one, 1000}, {exometer_report_statsd, [boodah, freq_cap, timeout], one, 1000} {exometer_report_statsd, [erlang, gc], tot_coll, 1000}, {exometer_report_statsd, [erlang, gc], rec_wrd, 1000}, ]} ! {exometer_report_statsd, [erlang, memory], ets, 10000}, {exometer_report_statsd, [erlang, memory], binary, 10000}, ! ! ]}
  • 43. Defining Subscriptions { report, [ { subscribers, [ {exometer_report_statsd, [erlang, statistics], run_queue, 1000}, ! ! ! {exometer_report_statsd, [boodah, freq_cap, not_found], one, 1000}, {exometer_report_statsd, [boodah, freq_cap, ok], one, 1000}, {exometer_report_statsd, [boodah, freq_cap, timeout], one, 1000} {exometer_report_statsd, [erlang, gc], tot_coll, 1000}, {exometer_report_statsd, [erlang, gc], rec_wrd, 1000}, ]} ! {exometer_report_statsd, [erlang, memory], ets, 10000}, {exometer_report_statsd, [erlang, memory], binary, 10000}, ! ! ]}
  • 44. Defining Subscriptions { report, [ { subscribers, [ {exometer_report_statsd, [erlang, statistics], run_queue, 1000}, ! ! ! {exometer_report_statsd, [boodah, freq_cap, not_found], one, 1000}, {exometer_report_statsd, [boodah, freq_cap, ok], one, 1000}, {exometer_report_statsd, [boodah, freq_cap, timeout], one, 1000} {exometer_report_statsd, [erlang, gc], tot_coll, 1000}, {exometer_report_statsd, [erlang, gc], rec_wrd, 1000}, ]} ! {exometer_report_statsd, [erlang, memory], ets, 10000}, {exometer_report_statsd, [erlang, memory], binary, 10000}, ! ! ]}
  • 45. D o i n g i t d y n a m i c a l l y. 1> exometer:new([a, histogram], histogram). ok ! 2> exometer:get_value([a, histogram]). {ok,[{n,0}, {mean,0}, {min,0}, {max,0}, {median,0}, {50,0}, {75,0}, {90,0}, {95,0}, {99,0}, {999,0}]} ! 3> exometer_report:add_reporter( exometer_report_tty, []). ok ! 4> exometer_report:subscribe( exometer_report_tty, [a, histogram], mean, 1000, []). ok ! exometer_report_tty: a_histogram_mean 1393627070:0 exometer_report_tty: a_histogram_mean 1393627071:0 exometer_report_tty: a_histogram_mean 1393627072:0
  • 46. W H AT A R E W E L O O K I N G F O R ?
  • 47. • VM killers • System performance regressions • Abnormal system behavior • Surprises
  • 48. “ A B N O R M A L S Y S T E M B E H AV I O R ” ?
  • 50. CASE STUDY: ADROLL RTB TIMEOUT SPIKES
  • 51. Case Study: AdRoll RTB Timeout Spikes PERIODIC BID TIMEOUTS
  • 52. Case Study: AdRoll RTB Timeout Spikes CONSISTENT SYSTEM LOAD
  • 53. Case Study: AdRoll RTB Timeout Spikes C O R R E L AT E D N E T W O R K T R A F F I C S P I K E S
  • 54. Case Study: AdRoll RTB Timeout Spikes C O R R E L AT E D R U N Q U E U E S P I K E S
  • 55. Case Study: AdRoll RTB Timeout Spikes W H AT HAPPENED?
  • 56. Case Study: AdRoll RTB Timeout Spikes • Scheduler threads were locked to CPUs • Background process comes on every 20 minutes, consumes a lot of CPU time • No cpu-shield was set up on our production systems • OS bumped a scheduler thread off its CPU, backing up its run-queue
  • 58. Case Study: Exchange Throttling H E A L T H Y PAT T E R N O F B I D R E Q U E S T S
  • 59. Case Study: Exchange Throttling THE TROUGH OF THROTTLING
  • 60. Case Study: Exchange Throttling BAD GOOD
  • 61. Case Study: Exchange Throttling PROBLEM CONFIRMED WITH EXCHANGE
  • 62. Case Study: Exchange Throttling • All other metrics (run-queue, CPU, network IO) were fine. • Confirmed that no changes had been made to the running systems via deployment. • Amazon data showed no network issues to our machines.
  • 63. Case Study: Exchange Throttling W H AT HAPPENED?
  • 64. Case Study: Exchange Throttling WE HIT AN IMPLICIT EXCHANGE LIMIT. (ARGUABLY, A BUG.)
  • 66. IT I S P O S S I B L E T O H AV E T O O L I T T L E I N F O R M AT I O N .
  • 67. “(THE FIREFIGHTERS) TRIED TO BEAT DOWN THE FLAMES (OF REACTOR 4). THEY CHERNOBYL KICKED AT THE BURNING GRAPHITE WITH THEIR FEET. … THE DOCTORS KEPT TELLING THEM THEY’D BEEN POISONED BY GAS.” - - SVETLANA ALEXIEVICH - VO I C E S F RO M C H E R N O BY L : T H E O R A L H I S TO RY O F A N U C L E A R D I SA S T E R
  • 68. IT IS POSSIBLE TO COLLECT TOO M U C H I N F O R M AT I O N , O R P R E S E N T IT BADLY.
  • 69. “SAFETY SYSTEMS, SUCH AS WARNING LIGHTS, ARE NECESSARY, BUT THEY HAVE THE POTENTIAL FOR DECEPTION. COMPLEX (…) ONE OF THE LESSONS OF SYSTEMS AND (THREE MILE ISLAND) IS THAT ANY PART OF THE SYSTEM MIGHT BE INTERACTING WITH OTHER PARTS IN UNANTICIPATED WAYS.” - C HARLES PERROW - N ORMAL ACCIDENTS: LIVIN G WITH HIGH-RISK TEC HN OLOGIES
  • 70. I N D I R E C T K N O W L E D G E M AY N O T T E L L T H E W H O L E S T O R Y , O R M AY M A K E Y O U D O U B T W H AT ’ S P L A I N L Y B E F O R E Y O U R E Y E S .
  • 71. “THE FIRST DISASTER IN SPACE HAD OCCURRED, AND NO ONE KNEW WHAT HAD HAPPENED. ON THE GROUND, THE FLIGHT CONTROLLERS WERE NOT EVEN SURE THAT ANYTHING HAD. ONE REASON FOR THEIR IGNORANCE WAS THE IMPERFECT NATURE OF TELEMETRY FROM THE SPACECRAFT WHICH COULD NOT TELL THEM DIRECTLY THAT AN OXYGEN TANK HAD BLOWN UP.” - - H E N R Y S . F. C O O P E R , J R . - X I I I : T H E A P O L LO F L I G H T T H AT FA I L E D
  • 72. Things that don’t quite work like you’d hope.
  • 73. Problems • Instrumenting code increases code size. • While very low, runtime impact is not zero. • Instrumentation of dependencies is up to the library author.
  • 74. Solutions •Dtrace / systemtap hookups •Culture of instrumentation by default. •Tracing BIFs
  • 75. W O M B AT ?
  • 76. Instrumentation by default strategies • Application configuration callback module • Behaviors callbacks • Value-at-time functions
  • 78. LOOK FOR SOLUTIONS, NOT S C A P E G O AT S .
  • 81. EVERYONE MUST WORK T O WA R D T H E SAME END.
  • 82. KNOW SOME BASIC M AT H E M AT I C S .
  • 83. PREFER LOOSE COUPLING WHEN POSSIBLE. ! BE EXPLICIT ABOUT TIGHT COUPLING.
  • 84. MEASURE MUCH, P R E S E N T O N LY T H AT WHICH YOU’RE C E R TA I N T O N E E D .
  • 86. BE A C C U R AT E .
  • 89. Bonus Bibliography! XIII: The Apollo Flight that Failed Henry S. F. Cooper, Jr. David A. Mindell Digital Apollo: Human and Machine in Spaceflight Charles Perrow Normal Accidents: Living with High-Risk Technologies Voices from Chernobyl: The Oral History of a Nuclear Disaster Svetlana Alexievich Real-Time Systems: Design Principles for Distributed Embedded Applications Hermann Kopetz Command and Control: Nuclear Weapons, the Damascus Accident and the Illusion of Safety Eric Schlosser