SlideShare a Scribd company logo
1 of 89
Download to read offline
T E N B I L L I O N A D AY, O N E - H U N D R E D M I L L I S E C O N D S P E R

MONITORING REAL-TIME
B I D D I N G AT A D R O L L
I DO THINGS WITH/TO COMPUTERS.
I

CARE ABOUT RELIABLE, COMPLEX

AND CRITICAL SYSTEMS.
ADROLL
LESS THIS
MORE THIS
WE’RE AN
ADTECH
C O M PA N Y .
R E TA R G E T I N G ,

IT’S

A L L A B O U T D ATA .
• Our customers want to show ads to people who might

care to see them.
• We partner with “exchanges” to enter ad-slot auctions.
• These auctions are executed and finalized while

consumer’s webpages load.
REAL-TIME
BIDDING
the nature of the problem domain:
• Low latency ( < 100ms per transaction )
• Firm real-time system
• Highly concurrent ( > 30 billion transactions per day )
• Global, 24/7 operation
“HUMANS

ARE BAD AT PREDICTING THE

PERFORMANCE OF COMPLEX SYSTEMS(…).

OUR

ABILITY TO CREATE LARGE AND COMPLEX

SYSTEMS FOOLS US INTO BELIEVING THAT
WE’RE ALSO ENTITLED TO UNDERSTAND
THEM.”
–C ARLOS BUENO
“ M AT U R E O P T I M I Z AT I O N H A N D B O O K ”
AHEAD

OF TIME

V E R I F I C AT I O N I S N O T
SUFFICIENT.
!

(DON’T SCRIMP ON IT, THOUGH.)
IGNORANCE AND
COMPLEX INTERACTIONS
WITH EXTERNAL SYSTEMS
ARE WHY WE CAN’T
H AV E N I C E T H I N G S .
AT

SCALE, EVEN RARE

EVENTS HAPPEN
FREQUENTLY.
AT

SCALE, BAD

THINGS HAPPEN
FA S T E R T H A N
HUMANS CAN
RESPOND.
THE RESULTS ARE
RARELY PRETTY.
THE CAUSES ARE RARELY
QUICK TO DISCOVER.
W H AT
CAN BE
DONE?
WE

H AV E T O O B S E R V E

OUR SYSTEMS WHILE
THEY RUN.
WE CAN BE
SCIENTIFIC
ABOUT THIS
NOT ALL SYSTEMS ARE
MONITORABLE, JUST AS NOT
A L L A R E T E S TA B L E .
!

I T ’ S A M AT T E R O F D E S I G N A N D
OF CULTURE.
A DESIGN

T H AT I S M O N I T O R A B L E

TA K E S S T O C K O F I T S I N T E R N A L
TOLERANCES, EXTERNAL
I N T E R FA C E S A N D E X P O S E S T H E S E
FOR AN

O P E R AT O R .
AN ENGINEER

IS RESPONSIBLE FOR

DESIGNING AND BUILDING
SYSTEMS.
!

A N O P E R AT O R

IS RESPONSIBLE FOR

RUNNING THEM.
A H E A D - O F - T I M E V E R I F I C AT I O N
—TESTING, TYPE-CHECKING—
GIVES THE ENGINEER
INTUITION ABOUT THE SYSTEM.
IF

YOU DON’T KNOW
HOW THE SYSTEM

S H O U L D B E H AV E Y O U
C A N ’ T S AY H O W I T
SHOULDN’T OR ISN’T.
MONITORING GIVES THE
O P E R AT O R I N F O R M AT I O N
A B O U T T H E B E H AV I O R O F
THE RUNNING SYSTEM.
T H E O P E R AT O R S A N D E N G I N E E R S
ARE OFTEN THE SAME PEOPLE.
THERE’S

A POTENTIAL FOR A POSITIVE

FEEDBACK LOOP OF QUALITY HERE.
“WHILE THE SKILL OF REENTRY WAS EASILY
HANDLED BY AUTOMATED SYSTEMS, THE
PILOT’S PRIMARY FUNCTION EVOLVED TO BE
A REDUNDANT SYSTEM (…) COORDINATING
A VARIETY OF CONTROLS AS MUCH AS
DIRECTLY CONTROLLING THE VEHICLE.”
– DAV I D A . M I N D E L L
D I G I TA L A P O L LO : H U M A N A N D M AC H I N E I N S PAC E F L I G H T
exometer
exometer — github.com/Feuerlabs/exometer
• Created by Feuerlabs.
• Responsive upstream (Ulf Wiger never sleeps?)
• Metric collection, aggregation and reporting decoupled.
• Static and dynamic configuration.
• Very low, predictable runtime overhead.
I M P O R TA N T T E R M S
• METRIC: a measurement
• ENTRY: a receiver and aggregator of metrics
• REPORTER: an entity which samples entries on a regular

interval and optionally ships these samples onto a third-system
• SUBSCRIPTION: the definition of the regular interval on which

reporters sample entries
Defining Entries
{predefined, [
{[erlang,memory],
{function, erlang, memory,
['$dp'], value,[ets,binary]},
[]
},

{[erlang, gc],
{function, erlang, statistics,
[garbage_collection], match,
{total_coll, rec_wrd, '_'}},
[]
},
!

{[erlang, statistics],
{function, erlang, statistics,
['$dp'], value, [run_queue]},
[]
},

{[boodah, freq_cap, not_found], spiral},
{[boodah, freq_cap, ok], spiral},
{[boodah, freq_cap, timeout], spiral}
]},
Defining Entries
{predefined, [
{[erlang,memory],
{function, erlang, memory,
['$dp'], value,[ets,binary]},
[]
},

{[erlang, gc],
{function, erlang, statistics,
[garbage_collection], match,
{total_coll, rec_wrd, '_'}},
[]
},
!

{[erlang, statistics],
{function, erlang, statistics,
['$dp'], value, [run_queue]},
[]
},

{[boodah, freq_cap, not_found], spiral},
{[boodah, freq_cap, ok], spiral},
{[boodah, freq_cap, timeout], spiral}
]},
Defining Entries
{predefined, [
{[erlang,memory],
{function, erlang, memory,
['$dp'], value,[ets,binary]},
[]
},

{[erlang, gc],
{function, erlang, statistics,
[garbage_collection], match,
{total_coll, rec_wrd, '_'}},
[]
},
!

{[erlang, statistics],
{function, erlang, statistics,
['$dp'], value, [run_queue]},
[]
},

{[boodah, freq_cap, not_found], spiral},
{[boodah, freq_cap, ok], spiral},
{[boodah, freq_cap, timeout], spiral}
]},
Defining Entries
{predefined, [
{[erlang,memory],
{function, erlang, memory,
['$dp'], value,[ets,binary]},
[]
},

{[erlang, gc],
{function, erlang, statistics,
[garbage_collection], match,
{total_coll, rec_wrd, '_'}},
[]
},
!

{[erlang, statistics],
{function, erlang, statistics,
['$dp'], value, [run_queue]},
[]
},

{[boodah, freq_cap, not_found], spiral},
{[boodah, freq_cap, ok], spiral},
{[boodah, freq_cap, timeout], spiral}
]},
Defining Entries
{predefined, [
{[erlang,memory],
{function, erlang, memory,
['$dp'], value,[ets,binary]},
[]
},

{[erlang, gc],
{function, erlang, statistics,
[garbage_collection], match,
{total_coll, rec_wrd, '_'}},
[]
},
!

{[erlang, statistics],
{function, erlang, statistics,
['$dp'], value, [run_queue]},
[]
},

{[boodah, freq_cap, not_found], spiral},
{[boodah, freq_cap, ok], spiral},
{[boodah, freq_cap, timeout], spiral}
]},
Defining Reporters
{ reporters,
[
{ exometer_report_statsd,
[
{hostname, "localhost"},
{port, 8125},
{type_map,
[
{[erlang,statistics,run_queue],
histogram},
{[erlang, gc, tot_coll], histogram},
{[erlang, gc, rec_wrd], histogram},

!

{[erlang,memory,ets], gauge},
{[erlang,memory,binary],gauge},
!

{[boodah,freq_cap,not_found],gauge},
{[boodah,freq_cap,ok],gauge},
{[boodah,freq_cap,timeout],gauge}
]},
!
Defining Reporters
{ reporters,
[
{ exometer_report_statsd,
[
{hostname, "localhost"},
{port, 8125},
{type_map,
[
{[erlang,statistics,run_queue],
histogram},
{[erlang, gc, tot_coll], histogram},
{[erlang, gc, rec_wrd], histogram},

!

{[erlang,memory,ets], gauge},
{[erlang,memory,binary],gauge},
!

{[boodah,freq_cap,not_found],gauge},
{[boodah,freq_cap,ok],gauge},
{[boodah,freq_cap,timeout],gauge}
]},
!
Defining Reporters
{ reporters,
[
{ exometer_report_statsd,
[
{hostname, "localhost"},
{port, 8125},
{type_map,
[
{[erlang,statistics,run_queue],
histogram},
{[erlang, gc, tot_coll], histogram},
{[erlang, gc, rec_wrd], histogram},

!

{[erlang,memory,ets], gauge},
{[erlang,memory,binary],gauge},
!

{[boodah,freq_cap,not_found],gauge},
{[boodah,freq_cap,ok],gauge},
{[boodah,freq_cap,timeout],gauge}
]},
!
Defining Subscriptions
{ report,
[
{ subscribers,
[
{exometer_report_statsd, [erlang, statistics],
run_queue, 1000},

!
!
!

{exometer_report_statsd,
[boodah, freq_cap, not_found], one, 1000},
{exometer_report_statsd,
[boodah, freq_cap, ok], one, 1000},
{exometer_report_statsd,
[boodah, freq_cap, timeout], one, 1000}

{exometer_report_statsd, [erlang, gc],
tot_coll, 1000},
{exometer_report_statsd, [erlang, gc],
rec_wrd, 1000},
]}

!

{exometer_report_statsd, [erlang, memory],
ets, 10000},
{exometer_report_statsd, [erlang, memory],
binary, 10000},
!
!

]}
Defining Subscriptions
{ report,
[
{ subscribers,
[
{exometer_report_statsd, [erlang, statistics],
run_queue, 1000},

!
!
!

{exometer_report_statsd,
[boodah, freq_cap, not_found], one, 1000},
{exometer_report_statsd,
[boodah, freq_cap, ok], one, 1000},
{exometer_report_statsd,
[boodah, freq_cap, timeout], one, 1000}

{exometer_report_statsd, [erlang, gc],
tot_coll, 1000},
{exometer_report_statsd, [erlang, gc],
rec_wrd, 1000},
]}

!

{exometer_report_statsd, [erlang, memory],
ets, 10000},
{exometer_report_statsd, [erlang, memory],
binary, 10000},
!
!

]}
Defining Subscriptions
{ report,
[
{ subscribers,
[
{exometer_report_statsd, [erlang, statistics],
run_queue, 1000},

!
!
!

{exometer_report_statsd,
[boodah, freq_cap, not_found], one, 1000},
{exometer_report_statsd,
[boodah, freq_cap, ok], one, 1000},
{exometer_report_statsd,
[boodah, freq_cap, timeout], one, 1000}

{exometer_report_statsd, [erlang, gc],
tot_coll, 1000},
{exometer_report_statsd, [erlang, gc],
rec_wrd, 1000},
]}

!

{exometer_report_statsd, [erlang, memory],
ets, 10000},
{exometer_report_statsd, [erlang, memory],
binary, 10000},
!
!

]}
D o i n g i t d y n a m i c a l l y.
1> exometer:new([a, histogram], histogram).
ok
!

2> exometer:get_value([a, histogram]).
{ok,[{n,0},
{mean,0},
{min,0},
{max,0},
{median,0},
{50,0},
{75,0},
{90,0},
{95,0},
{99,0},
{999,0}]}
!

3> exometer_report:add_reporter(
exometer_report_tty, []).
ok
!

4> exometer_report:subscribe(
exometer_report_tty,
[a, histogram], mean, 1000, []).
ok
!

exometer_report_tty: a_histogram_mean
1393627070:0
exometer_report_tty: a_histogram_mean
1393627071:0
exometer_report_tty: a_histogram_mean
1393627072:0
W H AT A R E W E L O O K I N G F O R ?
• VM killers
• System performance regressions
• Abnormal system behavior
• Surprises
“ A B N O R M A L S Y S T E M B E H AV I O R ” ?
CASE STUDIES
CASE STUDY:
ADROLL RTB TIMEOUT SPIKES
Case Study: AdRoll RTB Timeout Spikes

PERIODIC BID TIMEOUTS
Case Study: AdRoll RTB Timeout Spikes

CONSISTENT SYSTEM LOAD
Case Study: AdRoll RTB Timeout Spikes

C O R R E L AT E D N E T W O R K T R A F F I C S P I K E S
Case Study: AdRoll RTB Timeout Spikes

C O R R E L AT E D R U N Q U E U E S P I K E S
Case Study: AdRoll RTB Timeout Spikes

W H AT

HAPPENED?
Case Study: AdRoll RTB Timeout Spikes
• Scheduler threads were locked to CPUs
• Background process comes on every 20 minutes,

consumes a lot of CPU time
• No cpu-shield was set up on our production systems
• OS bumped a scheduler thread off its CPU, backing up its

run-queue
CASE STUDY:
EXCHANGE THROTTLING
Case Study: Exchange Throttling

H E A L T H Y PAT T E R N O F B I D R E Q U E S T S
Case Study: Exchange Throttling

THE TROUGH OF THROTTLING
Case Study: Exchange Throttling

BAD

GOOD
Case Study: Exchange Throttling

PROBLEM CONFIRMED WITH EXCHANGE
Case Study: Exchange Throttling

• All other metrics (run-queue, CPU, network IO) were fine.
• Confirmed that no changes had been made to the

running systems via deployment.
• Amazon data showed no network issues to our machines.
Case Study: Exchange Throttling

W H AT

HAPPENED?
Case Study: Exchange Throttling

WE

HIT AN IMPLICIT EXCHANGE LIMIT.
(ARGUABLY,

A BUG.)
LESSONS LEARNED
IT

I S P O S S I B L E T O H AV E T O O L I T T L E

I N F O R M AT I O N .
“(THE

FIREFIGHTERS) TRIED TO BEAT

DOWN THE FLAMES (OF

REACTOR 4). THEY

CHERNOBYL

KICKED AT THE

BURNING GRAPHITE WITH THEIR FEET.

… THE DOCTORS KEPT TELLING THEM
THEY’D BEEN POISONED BY GAS.”
-

- SVETLANA ALEXIEVICH

-

VO I C E S F RO M C H E R N O BY L : T H E O R A L H I S TO RY O F A N U C L E A R D I SA S T E R
IT

IS POSSIBLE TO COLLECT TOO

M U C H I N F O R M AT I O N , O R P R E S E N T
IT BADLY.
“SAFETY

SYSTEMS, SUCH AS WARNING LIGHTS,

ARE NECESSARY, BUT THEY HAVE THE POTENTIAL
FOR DECEPTION.
COMPLEX

(…) ONE OF THE LESSONS OF
SYSTEMS AND (THREE MILE ISLAND) IS

THAT ANY PART OF THE SYSTEM MIGHT BE
INTERACTING WITH OTHER PARTS IN
UNANTICIPATED WAYS.”
-

C HARLES PERROW

-

N ORMAL ACCIDENTS: LIVIN G WITH HIGH-RISK TEC HN OLOGIES
I N D I R E C T K N O W L E D G E M AY N O T T E L L T H E
W H O L E S T O R Y , O R M AY M A K E Y O U D O U B T
W H AT ’ S P L A I N L Y B E F O R E Y O U R E Y E S .
“THE

FIRST DISASTER IN SPACE HAD OCCURRED, AND

NO ONE KNEW WHAT HAD HAPPENED.

ON

THE

GROUND, THE FLIGHT CONTROLLERS WERE NOT EVEN
SURE THAT ANYTHING HAD.

ONE

REASON FOR THEIR

IGNORANCE WAS THE IMPERFECT NATURE OF
TELEMETRY FROM THE SPACECRAFT WHICH COULD NOT
TELL THEM DIRECTLY THAT AN OXYGEN TANK HAD
BLOWN UP.”
-

- H E N R Y S . F. C O O P E R , J R .

-

X I I I : T H E A P O L LO F L I G H T T H AT FA I L E D
Things that don’t quite
work like you’d hope.
Problems

• Instrumenting code increases code size.
• While very low, runtime impact is not zero.
• Instrumentation of dependencies is up to the

library author.
Solutions

•Dtrace / systemtap hookups
•Culture of instrumentation by default.
•Tracing BIFs
W O M B AT ?
Instrumentation by default strategies

• Application configuration callback module
• Behaviors callbacks
• Value-at-time functions
W H AT
ABOUT
CULTURE
LOOK FOR
SOLUTIONS,
NOT
S C A P E G O AT S .
DON’T

BE FOOLED BY

OVERCONFIDENCE IN YOUR
WORK.
DOCUMENT
AND DISCUSS
YOUR
FA I L U R E S .
EVERYONE
MUST WORK
T O WA R D T H E
SAME END.
KNOW SOME BASIC
M AT H E M AT I C S .
PREFER LOOSE
COUPLING WHEN
POSSIBLE.
!

BE EXPLICIT ABOUT
TIGHT COUPLING.
MEASURE

MUCH,

P R E S E N T O N LY T H AT
WHICH YOU’RE
C E R TA I N T O N E E D .
MEASURE,
DON’T GUESS.
BE
A C C U R AT E .
QUESTIONS?
<3
@bltroutwine
BRIAN@TROUTWINE.US
Bonus Bibliography!
XIII: The Apollo Flight that Failed

Henry S. F. Cooper, Jr.
David A. Mindell

Digital Apollo: Human and Machine in Spaceflight

Charles Perrow

Normal Accidents: Living with High-Risk Technologies
Voices from Chernobyl: The Oral History of a Nuclear Disaster

Svetlana Alexievich
Real-Time Systems: Design Principles for Distributed Embedded Applications
Hermann Kopetz
Command and Control: Nuclear Weapons, the Damascus Accident and the Illusion of Safety
Eric Schlosser

More Related Content

Similar to 10 Billion a Day, 100 Milliseconds Per: Monitoring Real-Time Bidding at AdRoll

Strangler Pattern in practice @PHPers Day 2019
Strangler Pattern in practice @PHPers Day 2019Strangler Pattern in practice @PHPers Day 2019
Strangler Pattern in practice @PHPers Day 2019Michał Kurzeja
 
Win-Win Search: Dual-Agent Stochastic Game in Session Search (SIGIR 2014)
Win-Win Search: Dual-Agent Stochastic Game in Session Search (SIGIR 2014)Win-Win Search: Dual-Agent Stochastic Game in Session Search (SIGIR 2014)
Win-Win Search: Dual-Agent Stochastic Game in Session Search (SIGIR 2014)Grace Yang
 
Four Architectural Patterns
Four Architectural Patterns Four Architectural Patterns
Four Architectural Patterns David Simons
 
Ember js meetup treviso liquid-fire
Ember js meetup treviso liquid-fireEmber js meetup treviso liquid-fire
Ember js meetup treviso liquid-fireWilliam Bergamo
 
4Developers 2015: Measure to fail - Tomasz Kowalczewski
4Developers 2015: Measure to fail - Tomasz Kowalczewski4Developers 2015: Measure to fail - Tomasz Kowalczewski
4Developers 2015: Measure to fail - Tomasz KowalczewskiPROIDEA
 
Architecting Android Apps: Marko Gargenta
Architecting Android Apps: Marko GargentaArchitecting Android Apps: Marko Gargenta
Architecting Android Apps: Marko Gargentajaxconf
 
Basics in algorithms and data structure
Basics in algorithms and data structure Basics in algorithms and data structure
Basics in algorithms and data structure Eman magdy
 
Beginner workshop to angularjs presentation at Google
Beginner workshop to angularjs presentation at GoogleBeginner workshop to angularjs presentation at Google
Beginner workshop to angularjs presentation at GoogleAri Lerner
 
Creating Modern Metadata Systems with New Relic, Dow Jones [FutureStack16]
Creating Modern Metadata Systems with New Relic, Dow Jones [FutureStack16]Creating Modern Metadata Systems with New Relic, Dow Jones [FutureStack16]
Creating Modern Metadata Systems with New Relic, Dow Jones [FutureStack16]New Relic
 
Faster! Faster! Accelerate your business with blazing prototypes
Faster! Faster! Accelerate your business with blazing prototypesFaster! Faster! Accelerate your business with blazing prototypes
Faster! Faster! Accelerate your business with blazing prototypesOSCON Byrum
 
PostgreSQL Day italy 2016 Unit Test
PostgreSQL Day italy 2016 Unit TestPostgreSQL Day italy 2016 Unit Test
PostgreSQL Day italy 2016 Unit TestAndrea Adami
 
Data Modelling at Scale
Data Modelling at ScaleData Modelling at Scale
Data Modelling at ScaleDavid Simons
 
Types and Immutability: why you should care
Types and Immutability: why you should careTypes and Immutability: why you should care
Types and Immutability: why you should careJean Carlo Emer
 
Spark streaming for the internet of flying things 20160510.pptx
Spark streaming for the internet of flying things 20160510.pptxSpark streaming for the internet of flying things 20160510.pptx
Spark streaming for the internet of flying things 20160510.pptxPablo Francisco Pérez Hidalgo
 
"PyTorch Deep Learning Framework: Status and Directions," a Presentation from...
"PyTorch Deep Learning Framework: Status and Directions," a Presentation from..."PyTorch Deep Learning Framework: Status and Directions," a Presentation from...
"PyTorch Deep Learning Framework: Status and Directions," a Presentation from...Edge AI and Vision Alliance
 
Visualizing Systems with Statemaps
Visualizing Systems with StatemapsVisualizing Systems with Statemaps
Visualizing Systems with Statemapsbcantrill
 
ELK Presentation Final V1
ELK Presentation Final V1ELK Presentation Final V1
ELK Presentation Final V1Jon Hammant
 
DOXLON November 2016 - ELK Stack and Beats
DOXLON November 2016 - ELK Stack and Beats DOXLON November 2016 - ELK Stack and Beats
DOXLON November 2016 - ELK Stack and Beats Outlyer
 

Similar to 10 Billion a Day, 100 Milliseconds Per: Monitoring Real-Time Bidding at AdRoll (20)

Strangler Pattern in practice @PHPers Day 2019
Strangler Pattern in practice @PHPers Day 2019Strangler Pattern in practice @PHPers Day 2019
Strangler Pattern in practice @PHPers Day 2019
 
Win-Win Search: Dual-Agent Stochastic Game in Session Search (SIGIR 2014)
Win-Win Search: Dual-Agent Stochastic Game in Session Search (SIGIR 2014)Win-Win Search: Dual-Agent Stochastic Game in Session Search (SIGIR 2014)
Win-Win Search: Dual-Agent Stochastic Game in Session Search (SIGIR 2014)
 
Four Architectural Patterns
Four Architectural Patterns Four Architectural Patterns
Four Architectural Patterns
 
Ember js meetup treviso liquid-fire
Ember js meetup treviso liquid-fireEmber js meetup treviso liquid-fire
Ember js meetup treviso liquid-fire
 
Everybody Lies
Everybody LiesEverybody Lies
Everybody Lies
 
4Developers 2015: Measure to fail - Tomasz Kowalczewski
4Developers 2015: Measure to fail - Tomasz Kowalczewski4Developers 2015: Measure to fail - Tomasz Kowalczewski
4Developers 2015: Measure to fail - Tomasz Kowalczewski
 
Measure to fail
Measure to failMeasure to fail
Measure to fail
 
Architecting Android Apps: Marko Gargenta
Architecting Android Apps: Marko GargentaArchitecting Android Apps: Marko Gargenta
Architecting Android Apps: Marko Gargenta
 
Basics in algorithms and data structure
Basics in algorithms and data structure Basics in algorithms and data structure
Basics in algorithms and data structure
 
Beginner workshop to angularjs presentation at Google
Beginner workshop to angularjs presentation at GoogleBeginner workshop to angularjs presentation at Google
Beginner workshop to angularjs presentation at Google
 
Creating Modern Metadata Systems with New Relic, Dow Jones [FutureStack16]
Creating Modern Metadata Systems with New Relic, Dow Jones [FutureStack16]Creating Modern Metadata Systems with New Relic, Dow Jones [FutureStack16]
Creating Modern Metadata Systems with New Relic, Dow Jones [FutureStack16]
 
Faster! Faster! Accelerate your business with blazing prototypes
Faster! Faster! Accelerate your business with blazing prototypesFaster! Faster! Accelerate your business with blazing prototypes
Faster! Faster! Accelerate your business with blazing prototypes
 
PostgreSQL Day italy 2016 Unit Test
PostgreSQL Day italy 2016 Unit TestPostgreSQL Day italy 2016 Unit Test
PostgreSQL Day italy 2016 Unit Test
 
Data Modelling at Scale
Data Modelling at ScaleData Modelling at Scale
Data Modelling at Scale
 
Types and Immutability: why you should care
Types and Immutability: why you should careTypes and Immutability: why you should care
Types and Immutability: why you should care
 
Spark streaming for the internet of flying things 20160510.pptx
Spark streaming for the internet of flying things 20160510.pptxSpark streaming for the internet of flying things 20160510.pptx
Spark streaming for the internet of flying things 20160510.pptx
 
"PyTorch Deep Learning Framework: Status and Directions," a Presentation from...
"PyTorch Deep Learning Framework: Status and Directions," a Presentation from..."PyTorch Deep Learning Framework: Status and Directions," a Presentation from...
"PyTorch Deep Learning Framework: Status and Directions," a Presentation from...
 
Visualizing Systems with Statemaps
Visualizing Systems with StatemapsVisualizing Systems with Statemaps
Visualizing Systems with Statemaps
 
ELK Presentation Final V1
ELK Presentation Final V1ELK Presentation Final V1
ELK Presentation Final V1
 
DOXLON November 2016 - ELK Stack and Beats
DOXLON November 2016 - ELK Stack and Beats DOXLON November 2016 - ELK Stack and Beats
DOXLON November 2016 - ELK Stack and Beats
 

More from Brian Troutwine

(Moonconf 2016) Fetching Moths from the Works: Correctness Methods in Software
(Moonconf 2016) Fetching Moths from the Works: Correctness Methods in Software(Moonconf 2016) Fetching Moths from the Works: Correctness Methods in Software
(Moonconf 2016) Fetching Moths from the Works: Correctness Methods in SoftwareBrian Troutwine
 
Getting Uphill on a Candle: Crushed Spines, Detached Retinas and One Small Step
Getting Uphill on a Candle: Crushed Spines, Detached Retinas and One Small StepGetting Uphill on a Candle: Crushed Spines, Detached Retinas and One Small Step
Getting Uphill on a Candle: Crushed Spines, Detached Retinas and One Small StepBrian Troutwine
 
The Charming Genius of the Apollo Guidance Computer
The Charming Genius of the Apollo Guidance ComputerThe Charming Genius of the Apollo Guidance Computer
The Charming Genius of the Apollo Guidance ComputerBrian Troutwine
 
Fault-tolerance on the Cheap: Making Systems That (Probably) Won't Fall Over
Fault-tolerance on the Cheap: Making Systems That (Probably) Won't Fall Over Fault-tolerance on the Cheap: Making Systems That (Probably) Won't Fall Over
Fault-tolerance on the Cheap: Making Systems That (Probably) Won't Fall Over Brian Troutwine
 
Let it crash! The Erlang Approach to Building Reliable Services
Let it crash! The Erlang Approach to Building Reliable ServicesLet it crash! The Erlang Approach to Building Reliable Services
Let it crash! The Erlang Approach to Building Reliable ServicesBrian Troutwine
 
Automation With Humans in Mind: Making Complex Systems Predictable, Reliable ...
Automation With Humans in Mind: Making Complex Systems Predictable, Reliable ...Automation With Humans in Mind: Making Complex Systems Predictable, Reliable ...
Automation With Humans in Mind: Making Complex Systems Predictable, Reliable ...Brian Troutwine
 
Erlang, LFE, Joxa and Elixir: Established and Emerging Languages in the Erlan...
Erlang, LFE, Joxa and Elixir: Established and Emerging Languages in the Erlan...Erlang, LFE, Joxa and Elixir: Established and Emerging Languages in the Erlan...
Erlang, LFE, Joxa and Elixir: Established and Emerging Languages in the Erlan...Brian Troutwine
 
Monitoring with exometer at AdRoll
Monitoring with exometer at AdRollMonitoring with exometer at AdRoll
Monitoring with exometer at AdRollBrian Troutwine
 

More from Brian Troutwine (8)

(Moonconf 2016) Fetching Moths from the Works: Correctness Methods in Software
(Moonconf 2016) Fetching Moths from the Works: Correctness Methods in Software(Moonconf 2016) Fetching Moths from the Works: Correctness Methods in Software
(Moonconf 2016) Fetching Moths from the Works: Correctness Methods in Software
 
Getting Uphill on a Candle: Crushed Spines, Detached Retinas and One Small Step
Getting Uphill on a Candle: Crushed Spines, Detached Retinas and One Small StepGetting Uphill on a Candle: Crushed Spines, Detached Retinas and One Small Step
Getting Uphill on a Candle: Crushed Spines, Detached Retinas and One Small Step
 
The Charming Genius of the Apollo Guidance Computer
The Charming Genius of the Apollo Guidance ComputerThe Charming Genius of the Apollo Guidance Computer
The Charming Genius of the Apollo Guidance Computer
 
Fault-tolerance on the Cheap: Making Systems That (Probably) Won't Fall Over
Fault-tolerance on the Cheap: Making Systems That (Probably) Won't Fall Over Fault-tolerance on the Cheap: Making Systems That (Probably) Won't Fall Over
Fault-tolerance on the Cheap: Making Systems That (Probably) Won't Fall Over
 
Let it crash! The Erlang Approach to Building Reliable Services
Let it crash! The Erlang Approach to Building Reliable ServicesLet it crash! The Erlang Approach to Building Reliable Services
Let it crash! The Erlang Approach to Building Reliable Services
 
Automation With Humans in Mind: Making Complex Systems Predictable, Reliable ...
Automation With Humans in Mind: Making Complex Systems Predictable, Reliable ...Automation With Humans in Mind: Making Complex Systems Predictable, Reliable ...
Automation With Humans in Mind: Making Complex Systems Predictable, Reliable ...
 
Erlang, LFE, Joxa and Elixir: Established and Emerging Languages in the Erlan...
Erlang, LFE, Joxa and Elixir: Established and Emerging Languages in the Erlan...Erlang, LFE, Joxa and Elixir: Established and Emerging Languages in the Erlan...
Erlang, LFE, Joxa and Elixir: Established and Emerging Languages in the Erlan...
 
Monitoring with exometer at AdRoll
Monitoring with exometer at AdRollMonitoring with exometer at AdRoll
Monitoring with exometer at AdRoll
 

Recently uploaded

Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDGMarianaLemus7
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraDeakin University
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 

Recently uploaded (20)

Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDG
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning era
 
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptxVulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 

10 Billion a Day, 100 Milliseconds Per: Monitoring Real-Time Bidding at AdRoll

  • 1. T E N B I L L I O N A D AY, O N E - H U N D R E D M I L L I S E C O N D S P E R MONITORING REAL-TIME B I D D I N G AT A D R O L L
  • 2. I DO THINGS WITH/TO COMPUTERS.
  • 3. I CARE ABOUT RELIABLE, COMPLEX AND CRITICAL SYSTEMS.
  • 8. R E TA R G E T I N G , IT’S A L L A B O U T D ATA .
  • 9. • Our customers want to show ads to people who might care to see them. • We partner with “exchanges” to enter ad-slot auctions. • These auctions are executed and finalized while consumer’s webpages load.
  • 11. the nature of the problem domain: • Low latency ( < 100ms per transaction ) • Firm real-time system • Highly concurrent ( > 30 billion transactions per day ) • Global, 24/7 operation
  • 12. “HUMANS ARE BAD AT PREDICTING THE PERFORMANCE OF COMPLEX SYSTEMS(…). OUR ABILITY TO CREATE LARGE AND COMPLEX SYSTEMS FOOLS US INTO BELIEVING THAT WE’RE ALSO ENTITLED TO UNDERSTAND THEM.” –C ARLOS BUENO “ M AT U R E O P T I M I Z AT I O N H A N D B O O K ”
  • 13. AHEAD OF TIME V E R I F I C AT I O N I S N O T SUFFICIENT. ! (DON’T SCRIMP ON IT, THOUGH.)
  • 14. IGNORANCE AND COMPLEX INTERACTIONS WITH EXTERNAL SYSTEMS ARE WHY WE CAN’T H AV E N I C E T H I N G S .
  • 15. AT SCALE, EVEN RARE EVENTS HAPPEN FREQUENTLY.
  • 16. AT SCALE, BAD THINGS HAPPEN FA S T E R T H A N HUMANS CAN RESPOND.
  • 18. THE CAUSES ARE RARELY QUICK TO DISCOVER.
  • 19. W H AT CAN BE DONE?
  • 20. WE H AV E T O O B S E R V E OUR SYSTEMS WHILE THEY RUN.
  • 22. NOT ALL SYSTEMS ARE MONITORABLE, JUST AS NOT A L L A R E T E S TA B L E . ! I T ’ S A M AT T E R O F D E S I G N A N D OF CULTURE.
  • 23. A DESIGN T H AT I S M O N I T O R A B L E TA K E S S T O C K O F I T S I N T E R N A L TOLERANCES, EXTERNAL I N T E R FA C E S A N D E X P O S E S T H E S E FOR AN O P E R AT O R .
  • 24. AN ENGINEER IS RESPONSIBLE FOR DESIGNING AND BUILDING SYSTEMS. ! A N O P E R AT O R IS RESPONSIBLE FOR RUNNING THEM.
  • 25. A H E A D - O F - T I M E V E R I F I C AT I O N —TESTING, TYPE-CHECKING— GIVES THE ENGINEER INTUITION ABOUT THE SYSTEM.
  • 26. IF YOU DON’T KNOW HOW THE SYSTEM S H O U L D B E H AV E Y O U C A N ’ T S AY H O W I T SHOULDN’T OR ISN’T.
  • 27. MONITORING GIVES THE O P E R AT O R I N F O R M AT I O N A B O U T T H E B E H AV I O R O F THE RUNNING SYSTEM.
  • 28. T H E O P E R AT O R S A N D E N G I N E E R S ARE OFTEN THE SAME PEOPLE.
  • 29. THERE’S A POTENTIAL FOR A POSITIVE FEEDBACK LOOP OF QUALITY HERE.
  • 30. “WHILE THE SKILL OF REENTRY WAS EASILY HANDLED BY AUTOMATED SYSTEMS, THE PILOT’S PRIMARY FUNCTION EVOLVED TO BE A REDUNDANT SYSTEM (…) COORDINATING A VARIETY OF CONTROLS AS MUCH AS DIRECTLY CONTROLLING THE VEHICLE.” – DAV I D A . M I N D E L L D I G I TA L A P O L LO : H U M A N A N D M AC H I N E I N S PAC E F L I G H T
  • 32. exometer — github.com/Feuerlabs/exometer • Created by Feuerlabs. • Responsive upstream (Ulf Wiger never sleeps?) • Metric collection, aggregation and reporting decoupled. • Static and dynamic configuration. • Very low, predictable runtime overhead.
  • 33. I M P O R TA N T T E R M S • METRIC: a measurement • ENTRY: a receiver and aggregator of metrics • REPORTER: an entity which samples entries on a regular interval and optionally ships these samples onto a third-system • SUBSCRIPTION: the definition of the regular interval on which reporters sample entries
  • 34. Defining Entries {predefined, [ {[erlang,memory], {function, erlang, memory, ['$dp'], value,[ets,binary]}, [] }, {[erlang, gc], {function, erlang, statistics, [garbage_collection], match, {total_coll, rec_wrd, '_'}}, [] }, ! {[erlang, statistics], {function, erlang, statistics, ['$dp'], value, [run_queue]}, [] }, {[boodah, freq_cap, not_found], spiral}, {[boodah, freq_cap, ok], spiral}, {[boodah, freq_cap, timeout], spiral} ]},
  • 35. Defining Entries {predefined, [ {[erlang,memory], {function, erlang, memory, ['$dp'], value,[ets,binary]}, [] }, {[erlang, gc], {function, erlang, statistics, [garbage_collection], match, {total_coll, rec_wrd, '_'}}, [] }, ! {[erlang, statistics], {function, erlang, statistics, ['$dp'], value, [run_queue]}, [] }, {[boodah, freq_cap, not_found], spiral}, {[boodah, freq_cap, ok], spiral}, {[boodah, freq_cap, timeout], spiral} ]},
  • 36. Defining Entries {predefined, [ {[erlang,memory], {function, erlang, memory, ['$dp'], value,[ets,binary]}, [] }, {[erlang, gc], {function, erlang, statistics, [garbage_collection], match, {total_coll, rec_wrd, '_'}}, [] }, ! {[erlang, statistics], {function, erlang, statistics, ['$dp'], value, [run_queue]}, [] }, {[boodah, freq_cap, not_found], spiral}, {[boodah, freq_cap, ok], spiral}, {[boodah, freq_cap, timeout], spiral} ]},
  • 37. Defining Entries {predefined, [ {[erlang,memory], {function, erlang, memory, ['$dp'], value,[ets,binary]}, [] }, {[erlang, gc], {function, erlang, statistics, [garbage_collection], match, {total_coll, rec_wrd, '_'}}, [] }, ! {[erlang, statistics], {function, erlang, statistics, ['$dp'], value, [run_queue]}, [] }, {[boodah, freq_cap, not_found], spiral}, {[boodah, freq_cap, ok], spiral}, {[boodah, freq_cap, timeout], spiral} ]},
  • 38. Defining Entries {predefined, [ {[erlang,memory], {function, erlang, memory, ['$dp'], value,[ets,binary]}, [] }, {[erlang, gc], {function, erlang, statistics, [garbage_collection], match, {total_coll, rec_wrd, '_'}}, [] }, ! {[erlang, statistics], {function, erlang, statistics, ['$dp'], value, [run_queue]}, [] }, {[boodah, freq_cap, not_found], spiral}, {[boodah, freq_cap, ok], spiral}, {[boodah, freq_cap, timeout], spiral} ]},
  • 39. Defining Reporters { reporters, [ { exometer_report_statsd, [ {hostname, "localhost"}, {port, 8125}, {type_map, [ {[erlang,statistics,run_queue], histogram}, {[erlang, gc, tot_coll], histogram}, {[erlang, gc, rec_wrd], histogram}, ! {[erlang,memory,ets], gauge}, {[erlang,memory,binary],gauge}, ! {[boodah,freq_cap,not_found],gauge}, {[boodah,freq_cap,ok],gauge}, {[boodah,freq_cap,timeout],gauge} ]}, !
  • 40. Defining Reporters { reporters, [ { exometer_report_statsd, [ {hostname, "localhost"}, {port, 8125}, {type_map, [ {[erlang,statistics,run_queue], histogram}, {[erlang, gc, tot_coll], histogram}, {[erlang, gc, rec_wrd], histogram}, ! {[erlang,memory,ets], gauge}, {[erlang,memory,binary],gauge}, ! {[boodah,freq_cap,not_found],gauge}, {[boodah,freq_cap,ok],gauge}, {[boodah,freq_cap,timeout],gauge} ]}, !
  • 41. Defining Reporters { reporters, [ { exometer_report_statsd, [ {hostname, "localhost"}, {port, 8125}, {type_map, [ {[erlang,statistics,run_queue], histogram}, {[erlang, gc, tot_coll], histogram}, {[erlang, gc, rec_wrd], histogram}, ! {[erlang,memory,ets], gauge}, {[erlang,memory,binary],gauge}, ! {[boodah,freq_cap,not_found],gauge}, {[boodah,freq_cap,ok],gauge}, {[boodah,freq_cap,timeout],gauge} ]}, !
  • 42. Defining Subscriptions { report, [ { subscribers, [ {exometer_report_statsd, [erlang, statistics], run_queue, 1000}, ! ! ! {exometer_report_statsd, [boodah, freq_cap, not_found], one, 1000}, {exometer_report_statsd, [boodah, freq_cap, ok], one, 1000}, {exometer_report_statsd, [boodah, freq_cap, timeout], one, 1000} {exometer_report_statsd, [erlang, gc], tot_coll, 1000}, {exometer_report_statsd, [erlang, gc], rec_wrd, 1000}, ]} ! {exometer_report_statsd, [erlang, memory], ets, 10000}, {exometer_report_statsd, [erlang, memory], binary, 10000}, ! ! ]}
  • 43. Defining Subscriptions { report, [ { subscribers, [ {exometer_report_statsd, [erlang, statistics], run_queue, 1000}, ! ! ! {exometer_report_statsd, [boodah, freq_cap, not_found], one, 1000}, {exometer_report_statsd, [boodah, freq_cap, ok], one, 1000}, {exometer_report_statsd, [boodah, freq_cap, timeout], one, 1000} {exometer_report_statsd, [erlang, gc], tot_coll, 1000}, {exometer_report_statsd, [erlang, gc], rec_wrd, 1000}, ]} ! {exometer_report_statsd, [erlang, memory], ets, 10000}, {exometer_report_statsd, [erlang, memory], binary, 10000}, ! ! ]}
  • 44. Defining Subscriptions { report, [ { subscribers, [ {exometer_report_statsd, [erlang, statistics], run_queue, 1000}, ! ! ! {exometer_report_statsd, [boodah, freq_cap, not_found], one, 1000}, {exometer_report_statsd, [boodah, freq_cap, ok], one, 1000}, {exometer_report_statsd, [boodah, freq_cap, timeout], one, 1000} {exometer_report_statsd, [erlang, gc], tot_coll, 1000}, {exometer_report_statsd, [erlang, gc], rec_wrd, 1000}, ]} ! {exometer_report_statsd, [erlang, memory], ets, 10000}, {exometer_report_statsd, [erlang, memory], binary, 10000}, ! ! ]}
  • 45. D o i n g i t d y n a m i c a l l y. 1> exometer:new([a, histogram], histogram). ok ! 2> exometer:get_value([a, histogram]). {ok,[{n,0}, {mean,0}, {min,0}, {max,0}, {median,0}, {50,0}, {75,0}, {90,0}, {95,0}, {99,0}, {999,0}]} ! 3> exometer_report:add_reporter( exometer_report_tty, []). ok ! 4> exometer_report:subscribe( exometer_report_tty, [a, histogram], mean, 1000, []). ok ! exometer_report_tty: a_histogram_mean 1393627070:0 exometer_report_tty: a_histogram_mean 1393627071:0 exometer_report_tty: a_histogram_mean 1393627072:0
  • 46. W H AT A R E W E L O O K I N G F O R ?
  • 47. • VM killers • System performance regressions • Abnormal system behavior • Surprises
  • 48. “ A B N O R M A L S Y S T E M B E H AV I O R ” ?
  • 50. CASE STUDY: ADROLL RTB TIMEOUT SPIKES
  • 51. Case Study: AdRoll RTB Timeout Spikes PERIODIC BID TIMEOUTS
  • 52. Case Study: AdRoll RTB Timeout Spikes CONSISTENT SYSTEM LOAD
  • 53. Case Study: AdRoll RTB Timeout Spikes C O R R E L AT E D N E T W O R K T R A F F I C S P I K E S
  • 54. Case Study: AdRoll RTB Timeout Spikes C O R R E L AT E D R U N Q U E U E S P I K E S
  • 55. Case Study: AdRoll RTB Timeout Spikes W H AT HAPPENED?
  • 56. Case Study: AdRoll RTB Timeout Spikes • Scheduler threads were locked to CPUs • Background process comes on every 20 minutes, consumes a lot of CPU time • No cpu-shield was set up on our production systems • OS bumped a scheduler thread off its CPU, backing up its run-queue
  • 58. Case Study: Exchange Throttling H E A L T H Y PAT T E R N O F B I D R E Q U E S T S
  • 59. Case Study: Exchange Throttling THE TROUGH OF THROTTLING
  • 60. Case Study: Exchange Throttling BAD GOOD
  • 61. Case Study: Exchange Throttling PROBLEM CONFIRMED WITH EXCHANGE
  • 62. Case Study: Exchange Throttling • All other metrics (run-queue, CPU, network IO) were fine. • Confirmed that no changes had been made to the running systems via deployment. • Amazon data showed no network issues to our machines.
  • 63. Case Study: Exchange Throttling W H AT HAPPENED?
  • 64. Case Study: Exchange Throttling WE HIT AN IMPLICIT EXCHANGE LIMIT. (ARGUABLY, A BUG.)
  • 66. IT I S P O S S I B L E T O H AV E T O O L I T T L E I N F O R M AT I O N .
  • 67. “(THE FIREFIGHTERS) TRIED TO BEAT DOWN THE FLAMES (OF REACTOR 4). THEY CHERNOBYL KICKED AT THE BURNING GRAPHITE WITH THEIR FEET. … THE DOCTORS KEPT TELLING THEM THEY’D BEEN POISONED BY GAS.” - - SVETLANA ALEXIEVICH - VO I C E S F RO M C H E R N O BY L : T H E O R A L H I S TO RY O F A N U C L E A R D I SA S T E R
  • 68. IT IS POSSIBLE TO COLLECT TOO M U C H I N F O R M AT I O N , O R P R E S E N T IT BADLY.
  • 69. “SAFETY SYSTEMS, SUCH AS WARNING LIGHTS, ARE NECESSARY, BUT THEY HAVE THE POTENTIAL FOR DECEPTION. COMPLEX (…) ONE OF THE LESSONS OF SYSTEMS AND (THREE MILE ISLAND) IS THAT ANY PART OF THE SYSTEM MIGHT BE INTERACTING WITH OTHER PARTS IN UNANTICIPATED WAYS.” - C HARLES PERROW - N ORMAL ACCIDENTS: LIVIN G WITH HIGH-RISK TEC HN OLOGIES
  • 70. I N D I R E C T K N O W L E D G E M AY N O T T E L L T H E W H O L E S T O R Y , O R M AY M A K E Y O U D O U B T W H AT ’ S P L A I N L Y B E F O R E Y O U R E Y E S .
  • 71. “THE FIRST DISASTER IN SPACE HAD OCCURRED, AND NO ONE KNEW WHAT HAD HAPPENED. ON THE GROUND, THE FLIGHT CONTROLLERS WERE NOT EVEN SURE THAT ANYTHING HAD. ONE REASON FOR THEIR IGNORANCE WAS THE IMPERFECT NATURE OF TELEMETRY FROM THE SPACECRAFT WHICH COULD NOT TELL THEM DIRECTLY THAT AN OXYGEN TANK HAD BLOWN UP.” - - H E N R Y S . F. C O O P E R , J R . - X I I I : T H E A P O L LO F L I G H T T H AT FA I L E D
  • 72. Things that don’t quite work like you’d hope.
  • 73. Problems • Instrumenting code increases code size. • While very low, runtime impact is not zero. • Instrumentation of dependencies is up to the library author.
  • 74. Solutions •Dtrace / systemtap hookups •Culture of instrumentation by default. •Tracing BIFs
  • 75. W O M B AT ?
  • 76. Instrumentation by default strategies • Application configuration callback module • Behaviors callbacks • Value-at-time functions
  • 78. LOOK FOR SOLUTIONS, NOT S C A P E G O AT S .
  • 81. EVERYONE MUST WORK T O WA R D T H E SAME END.
  • 82. KNOW SOME BASIC M AT H E M AT I C S .
  • 83. PREFER LOOSE COUPLING WHEN POSSIBLE. ! BE EXPLICIT ABOUT TIGHT COUPLING.
  • 84. MEASURE MUCH, P R E S E N T O N LY T H AT WHICH YOU’RE C E R TA I N T O N E E D .
  • 86. BE A C C U R AT E .
  • 89. Bonus Bibliography! XIII: The Apollo Flight that Failed Henry S. F. Cooper, Jr. David A. Mindell Digital Apollo: Human and Machine in Spaceflight Charles Perrow Normal Accidents: Living with High-Risk Technologies Voices from Chernobyl: The Oral History of a Nuclear Disaster Svetlana Alexievich Real-Time Systems: Design Principles for Distributed Embedded Applications Hermann Kopetz Command and Control: Nuclear Weapons, the Damascus Accident and the Illusion of Safety Eric Schlosser