Site reliability in the serverless age - Serverless Boston Meetup

Site Reliability in
the Serverless Age
Erik Peterson
CEO & Founder
CloudZero
erik@cloudzero.com | @silvexis
Serverless Boston Meetup | 9/18/2018

About Me
Erik Peterson – erik@cloudzero.com, @silvexis
• CEO and Founder of CloudZero
• I’m recovering from the application security industry,
now 100% focused on Cloud and Serverless
• Have been building systems on AWS since 2008
• Previously
• Veracode
• HP, SPI Dynamics, Sanctum
• United Nations IAEA, US Department of State,
SunTrust, Moody’s Investors
• Fun fact: I’ve lived in 6 US states and 3 countries

About CloudZero
Actionable Intelligence for Serverless Systems
• Dynamically map your environment, automatically discover
resources, relationships and data flows
• Easily identify errors and bottlenecks
• Track system costs and identify cost anomalies within
hours, not days or months
• Agentless deployment, requires only AWS data sources like
X-ray, CloudTrail and CloudWatch

What is Serverless?
What is Reliability?
How does Serverless affect Reliability?
How CloudZero Operates
The Future

WHAT IS
SERVERLESS?
• Event driven
• Invisible infrastructure
• Automatically scales with usage
• Fault tolerance and high availability built in
• Never pay for idle

Serverless is not
just Functions as
a Service
But FaaS is one of its most important building blocks

Serverless is a Spectrum (AWS edition)
0% 100%50%
More ServerlessLess Serverless

WernerVogels
CTOAmazonWeb Services

Sowhat does
reliability evenmean?

Reliability is the trustworthiness of a system’s
ability to delight the customer

Two forces
exist today
that drive
reliability
• DevOps (culture)
• Eliminate Dev and
Ops silos
• Accept failure as
normal
• Driven to achieve the
fastest feature velocity
• Measure everything
• Site Reliability
Engineering (practice)
• Automate everything
• Manage to Service
Level Objectives
• Monitor what matters
• Forecast demand and
manage capacity
• Use resources
efficiently

How does Serverless affect these forces?
Hint: Change is coming

Serverless
effect on
DevOps
• Eliminate Dev and Ops
silos
• Accept failure as normal
• Driven to achieve the
fastest feature velocity
• Measure everything
• Cost effective systems
are well built systems
REQUIRED
REQUIRED
CHANGE NOW HAPPENING
FASTER THAN YOUR CAN
KEEP UP WITH
BUILT IN, BUT WE ARE
DROWNING IN DATA
WAIT WHAT? WHAT ARE
COGS?
NEW

Serverless
effect on Site
Reliability
Engineering
• Manage to Service
Level Objectives
• Monitor what matters
• Forecast demand and
manage capacity
• Use resources
efficiently
REQUIRED
I KNOW MY SLO, BUT WHAT
IS MY SLA?
MY DASHBOARDS CAN’T
KEEP UP
EVERYTHING SCALES, BUT
THERE ARE LIMITS
THE BILL IS WHAT?!

You can’t
“lift and shift” your way into
Serverless
Thisapplies to yourculture, technology and process

CloudZero’s
Culture and
Practice
• Culture
• Dev and Ops are one
• Failure is guaranteed
• Always 5 min from
production
• A well designed
system is a cost
effective one
• Practice
• Ensure SLOs are
aligned with SLAs
• Dynamically monitor
(we eat our own
dogfood)
• Understand system
limits and plan
accordingly
• Track cost as a first
class operational
performance metric
Full disclosure some of these are a work in progress

Automate Everything
• We use Serverless Framework or AWS SAM for packaging and deployment
• Serverless used to be the only game in town
• SAM has made huge improvements
• Stackery.io is looking very interesting (and supports SAM)
• If you are starting from scratch today, start with SAM
• Semaphore for CI/CD
• Works so well we wrote a blog post on it
• https://www.cloudzero.com/blog/continuous-delivery-in-the-world-of-serverless

Dynamically
monitor
We are 100% eating our own
dogfood here

Understand Serverless Limits
• Scaling is built in but, Serverless systems have limits and constraints.
• You will hit them once you are in prod under heavy customer load
• It can be very very hard to figure out when the limits are being hit in a large
system with many moving parts. Here are just a few examples:
• Maximum number of concurrent
executions per AWS account
(1000, changeable)
• Immediate Concurrency Increase
(500 or more per min, depends on
region, fixed)
AWS Lambda API Gateway
• Integration timeout (29
sec max, fixed)
• Max Payload size (10mb,
fixed)
• S3 will asynchronously
call Lambda
• Lambda polls DynamoDB
Streams only once per
second, per shard
Invocation Limits

Monitor your completesystem cost
Don’t just watch your Lambda bill, it is just one part of
a Serverless system
CloudWatch Logs
$1.79$15
$0.89
$789!!!
$12
LambdaCost: $1.79
TotalSystemCost: $818.68
System Costs Per Day

Serverless is going to cause a new DevOps
Tribe to emerge
Source: https://twitter.com/swardley/status/1024107922203111424
Simon Wardley
“Donottellme theDevOpscommunityisn't
fragmentingintotwotribes...oh, yesitis... PnA vs
PnC...I'veaddeda thirdbaselinegroupPn B,who
aren'tDevOpsbutgivea usefulcontrolsample.”
Survey Size: 389
Existing DevOps “Tribe” New FinDevOps “Tribe”

Existing DevOps “Tribe”
103 respondents

New FinDevOps “Tribe”
207 respondents

ThankYou!
•Lets continue the conversation
•erik@cloudzero.com
•@silvexis

Site reliability in the serverless age - Serverless Boston Meetup

More Related Content

What's hot

Similar to Site reliability in the serverless age - Serverless Boston Meetup

Recently uploaded

Site reliability in the serverless age - Serverless Boston Meetup