Site Reliability in
the Serverless Age
Erik Peterson
CEO & Founder
CloudZero
erik@cloudzero.com | @silvexis
Serverless Boston Meetup | 9/18/2018
About Me
Erik Peterson – erik@cloudzero.com, @silvexis
• CEO and Founder of CloudZero
• I’m recovering from the application security industry,
now 100% focused on Cloud and Serverless
• Have been building systems on AWS since 2008
• Previously
• Veracode
• HP, SPI Dynamics, Sanctum
• United Nations IAEA, US Department of State,
SunTrust, Moody’s Investors
• Fun fact: I’ve lived in 6 US states and 3 countries
About CloudZero
Actionable Intelligence for Serverless Systems
• Dynamically map your environment, automatically discover
resources, relationships and data flows
• Easily identify errors and bottlenecks
• Track system costs and identify cost anomalies within
hours, not days or months
• Agentless deployment, requires only AWS data sources like
X-ray, CloudTrail and CloudWatch
What is Serverless?
What is Reliability?
How does Serverless affect Reliability?
How CloudZero Operates
The Future
WHAT IS
SERVERLESS?
• Event driven
• Invisible infrastructure
• Automatically scales with usage
• Fault tolerance and high availability built in
• Never pay for idle
Serverless is not
just Functions as
a Service
But FaaS is one of its most important building blocks
Serverless is a Spectrum (AWS edition)
0% 100%50%
More ServerlessLess Serverless
WernerVogels
CTOAmazonWeb Services
Sowhat does
reliability evenmean?
Reliability is the trustworthiness of a system’s
ability to delight the customer
Two forces
exist today
that drive
reliability
• DevOps (culture)
• Eliminate Dev and
Ops silos
• Accept failure as
normal
• Driven to achieve the
fastest feature velocity
• Measure everything
• Site Reliability
Engineering (practice)
• Automate everything
• Manage to Service
Level Objectives
• Monitor what matters
• Forecast demand and
manage capacity
• Use resources
efficiently
How does Serverless affect these forces?
Hint: Change is coming
Serverless
effect on
DevOps
• Eliminate Dev and Ops
silos
• Accept failure as normal
• Driven to achieve the
fastest feature velocity
• Measure everything
• Cost effective systems
are well built systems
REQUIRED
REQUIRED
CHANGE NOW HAPPENING
FASTER THAN YOUR CAN
KEEP UP WITH
BUILT IN, BUT WE ARE
DROWNING IN DATA
WAIT WHAT? WHAT ARE
COGS?
NEW
Serverless
effect on Site
Reliability
Engineering
• Automate everything
• Manage to Service
Level Objectives
• Monitor what matters
• Forecast demand and
manage capacity
• Use resources
efficiently
REQUIRED
I KNOW MY SLO, BUT WHAT
IS MY SLA?
MY DASHBOARDS CAN’T
KEEP UP
EVERYTHING SCALES, BUT
THERE ARE LIMITS
THE BILL IS WHAT?!
You can’t
“lift and shift” your way into
Serverless
Thisapplies to yourculture, technology and process
How CloudZero Does It
CloudZero’s
Culture and
Practice
• Culture
• Dev and Ops are one
• Failure is guaranteed
• Always 5 min from
production
• A well designed
system is a cost
effective one
• Practice
• Automate everything
• Ensure SLOs are
aligned with SLAs
• Dynamically monitor
(we eat our own
dogfood)
• Understand system
limits and plan
accordingly
• Track cost as a first
class operational
performance metric
Full disclosure some of these are a work in progress
Automate Everything
• We use Serverless Framework or AWS SAM for packaging and deployment
• Serverless used to be the only game in town
• SAM has made huge improvements
• Stackery.io is looking very interesting (and supports SAM)
• If you are starting from scratch today, start with SAM
• Semaphore for CI/CD
• Works so well we wrote a blog post on it
• https://www.cloudzero.com/blog/continuous-delivery-in-the-world-of-serverless
Dynamically
monitor
We are 100% eating our own
dogfood here
Understand Serverless Limits
• Scaling is built in but, Serverless systems have limits and constraints.
• You will hit them once you are in prod under heavy customer load
• It can be very very hard to figure out when the limits are being hit in a large
system with many moving parts. Here are just a few examples:
• Maximum number of concurrent
executions per AWS account
(1000, changeable)
• Immediate Concurrency Increase
(500 or more per min, depends on
region, fixed)
AWS Lambda API Gateway
• Integration timeout (29
sec max, fixed)
• Max Payload size (10mb,
fixed)
• S3 will asynchronously
call Lambda
• Lambda polls DynamoDB
Streams only once per
second, per shard
Invocation Limits
Monitor your completesystem cost
Don’t just watch your Lambda bill, it is just one part of
a Serverless system
CloudWatch Logs
$1.79$15
$0.89
$789!!!
$12
LambdaCost: $1.79
TotalSystemCost: $818.68
System Costs Per Day
Serverless is going to cause a new DevOps
Tribe to emerge
Source: https://twitter.com/swardley/status/1024107922203111424
Simon Wardley
“Donottellme theDevOpscommunityisn't
fragmentingintotwotribes...oh, yesitis... PnA vs
PnC...I'veaddeda thirdbaselinegroupPn B,who
aren'tDevOpsbutgivea usefulcontrolsample.”
Survey Size: 389
Existing DevOps “Tribe” New FinDevOps “Tribe”
Existing DevOps “Tribe”
Source: https://twitter.com/swardley/status/1024107922203111424
103 respondents
New FinDevOps “Tribe”
Source: https://twitter.com/swardley/status/1024107922203111424
207 respondents
ThankYou!
•Lets continue the conversation
•erik@cloudzero.com
•@silvexis

Site reliability in the serverless age - Serverless Boston Meetup

  • 1.
    Site Reliability in theServerless Age Erik Peterson CEO & Founder CloudZero erik@cloudzero.com | @silvexis Serverless Boston Meetup | 9/18/2018
  • 2.
    About Me Erik Peterson– erik@cloudzero.com, @silvexis • CEO and Founder of CloudZero • I’m recovering from the application security industry, now 100% focused on Cloud and Serverless • Have been building systems on AWS since 2008 • Previously • Veracode • HP, SPI Dynamics, Sanctum • United Nations IAEA, US Department of State, SunTrust, Moody’s Investors • Fun fact: I’ve lived in 6 US states and 3 countries
  • 3.
    About CloudZero Actionable Intelligencefor Serverless Systems • Dynamically map your environment, automatically discover resources, relationships and data flows • Easily identify errors and bottlenecks • Track system costs and identify cost anomalies within hours, not days or months • Agentless deployment, requires only AWS data sources like X-ray, CloudTrail and CloudWatch
  • 4.
    What is Serverless? Whatis Reliability? How does Serverless affect Reliability? How CloudZero Operates The Future
  • 5.
    WHAT IS SERVERLESS? • Eventdriven • Invisible infrastructure • Automatically scales with usage • Fault tolerance and high availability built in • Never pay for idle
  • 7.
    Serverless is not justFunctions as a Service But FaaS is one of its most important building blocks
  • 8.
    Serverless is aSpectrum (AWS edition) 0% 100%50% More ServerlessLess Serverless
  • 9.
  • 10.
  • 11.
    Reliability is thetrustworthiness of a system’s ability to delight the customer
  • 12.
    Two forces exist today thatdrive reliability • DevOps (culture) • Eliminate Dev and Ops silos • Accept failure as normal • Driven to achieve the fastest feature velocity • Measure everything • Site Reliability Engineering (practice) • Automate everything • Manage to Service Level Objectives • Monitor what matters • Forecast demand and manage capacity • Use resources efficiently
  • 13.
    How does Serverlessaffect these forces? Hint: Change is coming
  • 14.
    Serverless effect on DevOps • EliminateDev and Ops silos • Accept failure as normal • Driven to achieve the fastest feature velocity • Measure everything • Cost effective systems are well built systems REQUIRED REQUIRED CHANGE NOW HAPPENING FASTER THAN YOUR CAN KEEP UP WITH BUILT IN, BUT WE ARE DROWNING IN DATA WAIT WHAT? WHAT ARE COGS? NEW
  • 15.
    Serverless effect on Site Reliability Engineering •Automate everything • Manage to Service Level Objectives • Monitor what matters • Forecast demand and manage capacity • Use resources efficiently REQUIRED I KNOW MY SLO, BUT WHAT IS MY SLA? MY DASHBOARDS CAN’T KEEP UP EVERYTHING SCALES, BUT THERE ARE LIMITS THE BILL IS WHAT?!
  • 16.
    You can’t “lift andshift” your way into Serverless Thisapplies to yourculture, technology and process
  • 17.
  • 18.
    CloudZero’s Culture and Practice • Culture •Dev and Ops are one • Failure is guaranteed • Always 5 min from production • A well designed system is a cost effective one • Practice • Automate everything • Ensure SLOs are aligned with SLAs • Dynamically monitor (we eat our own dogfood) • Understand system limits and plan accordingly • Track cost as a first class operational performance metric Full disclosure some of these are a work in progress
  • 19.
    Automate Everything • Weuse Serverless Framework or AWS SAM for packaging and deployment • Serverless used to be the only game in town • SAM has made huge improvements • Stackery.io is looking very interesting (and supports SAM) • If you are starting from scratch today, start with SAM • Semaphore for CI/CD • Works so well we wrote a blog post on it • https://www.cloudzero.com/blog/continuous-delivery-in-the-world-of-serverless
  • 20.
    Dynamically monitor We are 100%eating our own dogfood here
  • 21.
    Understand Serverless Limits •Scaling is built in but, Serverless systems have limits and constraints. • You will hit them once you are in prod under heavy customer load • It can be very very hard to figure out when the limits are being hit in a large system with many moving parts. Here are just a few examples: • Maximum number of concurrent executions per AWS account (1000, changeable) • Immediate Concurrency Increase (500 or more per min, depends on region, fixed) AWS Lambda API Gateway • Integration timeout (29 sec max, fixed) • Max Payload size (10mb, fixed) • S3 will asynchronously call Lambda • Lambda polls DynamoDB Streams only once per second, per shard Invocation Limits
  • 22.
    Monitor your completesystemcost Don’t just watch your Lambda bill, it is just one part of a Serverless system CloudWatch Logs $1.79$15 $0.89 $789!!! $12 LambdaCost: $1.79 TotalSystemCost: $818.68 System Costs Per Day
  • 24.
    Serverless is goingto cause a new DevOps Tribe to emerge Source: https://twitter.com/swardley/status/1024107922203111424 Simon Wardley “Donottellme theDevOpscommunityisn't fragmentingintotwotribes...oh, yesitis... PnA vs PnC...I'veaddeda thirdbaselinegroupPn B,who aren'tDevOpsbutgivea usefulcontrolsample.” Survey Size: 389 Existing DevOps “Tribe” New FinDevOps “Tribe”
  • 25.
    Existing DevOps “Tribe” Source:https://twitter.com/swardley/status/1024107922203111424 103 respondents
  • 26.
    New FinDevOps “Tribe” Source:https://twitter.com/swardley/status/1024107922203111424 207 respondents
  • 27.
    ThankYou! •Lets continue theconversation •erik@cloudzero.com •@silvexis