© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Christian Beedgen
October 2015
5 Years of Building SaaS on AWS
A Story by Sumo Logic
$ whoami
Co-Founder & CTO, Sumo Logic
Cloud-based Machine Data Analytics Service
Applications, Operations, Security
Chief Architect, ArcSight
Major SIEM player in the enterprise space
Log Management for security and compliance
From Data to Decisions
DEVOPS
Streamline continuous
delivery
Monitor KPI’s and
Metrics
Accelerate
Troubleshooting
IT INFRASTRUCTURE
AND OPERATIONS
Monitor all workloads
Troubleshoot and
increase uptime
Simplify, Modernize,
and save costs
COMPLIANCE AND
SECURITY
COMPLIANCE AND
SECURITY
Automate and
demonstrate compliance
Audit all systems
Think beyond rules
Cloud Analytics Platform
DEVOPS
IT INFRASTRUCTURE
AND OPERATIONS
COMPLIANCE AND
SECURITY
Cloud Analytics Platform
From Data to Decisions
DEVOPS
IT INFRASTRUCTURE
AND OPERATIONS
COMPLIANCE AND
SECURITY
Customer A Cloud
COLLECTOR COLLECTOR
Customer A Data Center Customer B Data Center
COLLECTOR
Customer B Cloud
COLLECTOR
Why SaaS?
Because enterprise software sucks™
Why SaaS?
Because enterprise software sucks™
Too much pain for the customer
Time spent running the system is not spent using the system
Expensive when done adding hardware and people
Why SaaS?
Because enterprise software sucks™
Too much pain for the customer
Time spent running the system, not spent using the system
Expensive when done adding hardware and people
Disastrous for the vendor
No control over the runtime, hard to diagnose problems
Kills innovation because each release lives forever
Why AWS?
We are developers, not data center people
AWS has turned the data center into an API
As developers, we understand reuse (libraries, OSs, …)
Today’s systems require reuse on a higher level
Do you really want to care for 4,000 machines? HA? DR?
Anti-monolithic
In previous gigs, we dealt with monolithic systems
With Sumo, we knew what we needed to build, no MVP required
Get data into the system, index it, provide query function
So we had a logical breakdown immediately
And we knew it had to scale…
…not just to the biggest customer, but to all customers!
Ingestion Path
Receiver Bus Index
Raw
CQ
S3
Analytics Path
Query
Service
CQ
S3
Scale Today
50 TB of new incoming data per day
Double-digit PB of data under management
>2,000,000 queries/day
Thousands of instances in 4 regions globally
Divide & Conquer
Divide & Conquer
3 to 1000s of instances!
Divide & Conquer
Each box in the previous images
is an application
Divide & Conquer
Blast radius, bulk-heading,
concern separation
Divide & Conquer
Not everything will break all the
time – repair engines, not plane
Divide & Conquer
Not everybody will need to work
on everything all the time
What We Actually Did
Compose applications from layers of modules
Whole system is Scala on top of the JVM
One Maven POM per module, one main() per application
Initially one GitHub repository per module, today just one project
Right size AWS instance for each application cluster
Each application exposes a façade
Avro over HTTP, or Avro over HornetQ, or Avro over Kafka
How I Actually Visualize Microservices
Deployment wide services
Ingest
Search
Internal tools
receiver
hornetq-
forge
forge
cqsplitter
search
cloud
collector
service
api
con-
cierge
stream
katta
glass,
ganglia
bill
mix
meta
config
zoo-
keeper
appvault org
raw
hornetq-
inbound
cocoa
bloom
filter
analyticscsi
cqmerger
rework
view
autoview
depman
hornetq-
internal
hornetq-
metadata
nrt
2 to the power of 5 services
(“32”), 170+ modules
Don’t even ask about the #
of dependencies
At least 3 of each –
everything is a separately
scalable cluster
Service Discovery
Loose coupling in the large…
A deployment is made up of many things
Some of these things need to talk to each other
Some of these things come and go
Don’t pass in a huge list of static dependencies
Start each application with one parameter
$ bin/receiver prod.service-registry.sumologic.com
Anti-singletenant
Multi-dimensional scaling predicates multitenancy
This is a data processing platform – cost matters!
Autoscaling single tenants is too fine-grained for us
Also, efficiency… one code line “master” in deployment
Customers aren’t pets, they are cattle 
Anti-singletenant
Multi-dimensional scaling predicates multitenancy
This is a data processing platform – cost matters!
Autoscaling single tenants is too fine-grained for us
Also, efficiency… one code line “master” in deployment
Customers aren’t pets, they are cattle 
Anti-singletenant
Multi-dimensional scaling predicates multitenancy
This is a data processing platform – cost matters!
Autoscaling single tenants is too fine-grained for us
Also, efficiency… one code line “master” in deployment
Customers aren’t pets, they are cattle 
Yum yum yum…
FEATURE FLAGS!!!
Just one typical Sumo Logic customer - 8x Variance!
Just one typical Sumo Logic customer - 8x Variance!
Money flushed down the toilet
Just one typical Sumo Logic customer - 8x Variance!
Money flushed down the toilet
Load per tenant fluctuates wildly, but
aggregated system load just goes up slowly
Anti-manual
We use Jenkins, of course
We still build system versions as cross-cuts and QA them
We are busy moving toward true continuous delivery
Application Groups for things that evolve together…
…and that can be deployed together
ProdLongStagNite
dsh: Another AWS Deployment Tool
Model-driven, describe desired state, run to make it so
High performance due to parallelization
Covers all layers of the stack – AWS, OS, Sumo Logic
Easy to use and extend, scriptable CLI
Developer-friendly, Scala-based, high-level APIs
Data Access
Layer
Delivery
Authentication &
Authorization
MeteringMonitoring
Ordering
Provisioning
Billing
Analytics
Resource Management SaaS Application(s)
Business Services Core Platform Services
Interaction
Application
AdditionalApplications
Application Lifecycle
Management
EC2
EC2
Route53
S3 Glacier
CloudFront
DynamoDB RDSElastiCache
DynamoDB
DynamoDB RedShift
WorkSpaces
CloudWatch CloudTrail
IAM
CodeDeploy
BeanstalkCloudFormationOpsWorksSWF
SWF
EMR EMR Kinesis
SNS
Mobile
Analytics
Kinesis SNS
CognitoDirectory
Service
CloudSearch
AppStream
SES SQS
SWF XCode
Data
Pipeline
Data Access
Layer
Delivery
Authentication &
Authorization
MeteringMonitoring
Ordering
Provisioning
Billing
Analytics
Resource Management SaaS Application(s)
Business Services Core Platform Services
Interaction
Application
AdditionalApplications
Application Lifecycle
Management
3 ELBs (service, api, receiver)
Data Access
Layer
Delivery
Authentication &
Authorization
MeteringMonitoring
Ordering
Provisioning
Billing
Analytics
Resource Management SaaS Application(s)
Business Services Core Platform Services
Interaction
Application
AdditionalApplications
Application Lifecycle
Management
EC2, obviously
RIs, dabbling with Spot
Data Access
Layer
Delivery
Authentication &
Authorization
MeteringMonitoring
Ordering
Provisioning
Billing
Analytics
Resource Management SaaS Application(s)
Business Services Core Platform Services
Interaction
Application
AdditionalApplications
Application Lifecycle
Management
SES for alert emails to our
customers
Data Access
Layer
Delivery
Authentication &
Authorization
MeteringMonitoring
Ordering
Provisioning
Billing
Analytics
Resource Management SaaS Application(s)
Business Services Core Platform Services
Interaction
Application
AdditionalApplications
Application Lifecycle
Management
SQS for user registration from
corporate website
Data Access
Layer
Delivery
Authentication &
Authorization
MeteringMonitoring
Ordering
Provisioning
Billing
Analytics
Resource Management SaaS Application(s)
Business Services Core Platform Services
Interaction
Application
AdditionalApplications
Application Lifecycle
Management
Petabytes of S3
Data Access
Layer
Delivery
Authentication &
Authorization
MeteringMonitoring
Ordering
Provisioning
Billing
Analytics
Resource Management SaaS Application(s)
Business Services Core Platform Services
Interaction
Application
AdditionalApplications
Application Lifecycle
Management
ElastiCache Memcache for
client object caches
Data Access
Layer
Delivery
Authentication &
Authorization
MeteringMonitoring
Ordering
Provisioning
Billing
Analytics
Resource Management SaaS Application(s)
Business Services Core Platform Services
Interaction
Application
AdditionalApplications
Application Lifecycle
Management
DynamoDB for feature flags
and configuration
Data Access
Layer
Delivery
Authentication &
Authorization
MeteringMonitoring
Ordering
Provisioning
Billing
Analytics
Resource Management SaaS Application(s)
Business Services Core Platform Services
Interaction
Application
AdditionalApplications
Application Lifecycle
Management
RDS MySQL for configuration
and content objects
Data Access
Layer
Delivery
Authentication &
Authorization
MeteringMonitoring
Ordering
Provisioning
Billing
Analytics
Resource Management SaaS Application(s)
Business Services Core Platform Services
Interaction
Application
AdditionalApplications
Application Lifecycle
Management
SimpleDB for deployment
location
Data Access
Layer
Delivery
Authentication &
Authorization
MeteringMonitoring
Ordering
Provisioning
Billing
Analytics
Resource Management SaaS Application(s)
Business Services Core Platform Services
Interaction
Application
AdditionalApplications
Application Lifecycle
Management Sumo Logic 
Data Access
Layer
Delivery
Authentication &
Authorization
MeteringMonitoring
Ordering
Provisioning
Billing
Analytics
Resource Management SaaS Application(s)
Business Services Core Platform Services
Interaction
Application
AdditionalApplications
Application Lifecycle
Management Sumo Logic 
CloudWatch, CloudTrail
Data Access
Layer
Delivery
Authentication &
Authorization
MeteringMonitoring
Ordering
Provisioning
Billing
Analytics
Resource Management SaaS Application(s)
Business Services Core Platform Services
Interaction
Application
AdditionalApplications
Application Lifecycle
Management
Sumo Logic!
Data Access
Layer
Delivery
Authentication &
Authorization
MeteringMonitoring
Ordering
Provisioning
Billing
Analytics
Resource Management SaaS Application(s)
Business Services Core Platform Services
Interaction
Application
AdditionalApplications
Application Lifecycle
Management
Zuora for billing
Data Access
Layer
Delivery
Authentication &
Authorization
MeteringMonitoring
Ordering
Provisioning
Billing
Analytics
Resource Management SaaS Application(s)
Business Services Core Platform Services
Interaction
Application
AdditionalApplications
Application Lifecycle
Management Jenkins, GitHub
Data Access
Layer
Delivery
Authentication &
Authorization
MeteringMonitoring
Ordering
Provisioning
Billing
Analytics
Resource Management SaaS Application(s)
Business Services Core Platform Services
Interaction
Application
AdditionalApplications
Application Lifecycle
Management
Our own automation
framework – “dsh”
Data Access
Layer
Delivery
Authentication &
Authorization
MeteringMonitoring
Ordering
Provisioning
Billing
Analytics
Resource Management SaaS Application(s)
Business Services Core Platform Services
Interaction
Application
AdditionalApplications
Application Lifecycle
Management
CloudFormation for Mesos
cluster setup
Integrations
Generic S3 Collection
Amazon S3 Audit
Elastic Load Balancing
Amazon CloudFront
AWS CloudTrail
Amazon VPC Flow Logs
AWS Config
What Does the Future Hold?
Super happy to see Amazon EFS introduced
Borderline unnaturally excited about AWS KMS
Planning on using AWS Lambda as a “plugin system”
Implementing Mesos for new services
Very excited about Docker to enable better utilization
Thank You!
@raychaser

5 Years Of Building SaaS On AWS

Editor's Notes

  • #4 Our 3rd generation analytics platform helps customers gain instant insights into this growing pool of machine data within their complex environments. Proven machine learning analytics help customers gain deep visibility across DevOps, IT ops and compliance and security environments. For DevOps – our service empowers DevOps teams with a simple and scalable solution for monitoring KPI's and metrics across the entire stack to deliver quality software. With pattern recognition and transaction analytics teams spend less time troubleshooting and more time developing code and real-time dashboards allow them to quickly collaborate for root cause analysis of bugs and fix performance issues before they impact customers. For IT Ops – Sumo Logic helps transforms IT data into better customer experience and business decisions by extracting valuable information such as latencies, performance metrics, trends and critical events tied with core systems and services. IT can monitor complex workloads and migrations for errors, warnings, performance and availability across cloud and on-premises infrastructure stacks and modernize their management stack with a SaaS solution designed for elastic scale with lower TCO. Compliance and Security – Our service helps organizations simplify and automate compliance and security monitoring across their entire stack with predictive analytics, pre-built searches, real-time dashboards, and pre-defined reports.
  • #5 Our 3rd generation analytics platform helps customers gain instant insights into this growing pool of machine data within their complex environments. Proven machine learning analytics help customers gain deep visibility across DevOps, IT ops and compliance and security environments. For DevOps – our service empowers DevOps teams with a simple and scalable solution for monitoring KPI's and metrics across the entire stack to deliver quality software. With pattern recognition and transaction analytics teams spend less time troubleshooting and more time developing code and real-time dashboards allow them to quickly collaborate for root cause analysis of bugs and fix performance issues before they impact customers. For IT Ops – Sumo Logic helps transforms IT data into better customer experience and business decisions by extracting valuable information such as latencies, performance metrics, trends and critical events tied with core systems and services. IT can monitor complex workloads and migrations for errors, warnings, performance and availability across cloud and on-premises infrastructure stacks and modernize their management stack with a SaaS solution designed for elastic scale with lower TCO. Compliance and Security – Our service helps organizations simplify and automate compliance and security monitoring across their entire stack with predictive analytics, pre-built searches, real-time dashboards, and pre-defined reports.
  • #6 This is my personal experience from the last decade It sucks for the customer and it sucks for the vendor
  • #7 This is our experience from the ArcSight days The system can’t just run on some gameboy sitting in the corner There’s big servers required, and in our case even a big Oracle database We just gave the customer an “installation guide” and hoped for the best
  • #8 Not having control over the execution environment puts the developer into a severe disadvantage “Works on my machine” is the daily reality but how do you debug something you can’t touch? Too many degrees of freedom for the customer to make the wrong decision: OS choice vs available funds, storage setup and RAID levels, … Everything you do becomes instant legacy Every release you push to customers will slow down your future velocity You will spend all your time back porting fixes to old versions Because your big customers refuse to take the time, money and pain to upgrade The result is that you become a maintenance organizations Why would you do this voluntarily?
  • #14 http://microservices.io/articles/scalecube.html I wish we had actually read that back then in detail But our intuition got us pretty close
  • #15 http://microservices.io/articles/scalecube.html This actually worked out We called it an “Internal SOA” We forgot one thing tho, but more about that later
  • #16 http://microservices.io/articles/scalecube.html This was extremely hotly debated internally
  • #17 http://microservices.io/articles/scalecube.html If one thing fails, the rest might continue to function High cohesion, low coupling
  • #18 http://microservices.io/articles/scalecube.html Every order of magnitude of scale something will break You will not be able to predict what You need to be able to just fix that, in a running system
  • #19 http://microservices.io/articles/scalecube.html Every order of magnitude of scale something will break You will not be able to predict what You need to be able to just fix that, in a running system
  • #20 With about 200 modules, code review is really hard when not in a single repo We also use messaging heavily in the ingestion path
  • #21 http://www.slideshare.net/alvarosanchezmariscal/stateless-authentication-for-microservices
  • #22 http://static3.jadedpixel.com/s/files/1/0010/2052/files/Puppy_dogs.jpg
  • #23 How they actually look like So this is our version of the Microservices death star
  • #25 Because scaling is hard and has latency Why make it harder than it has to be? Have you ever implemented a closed loop controller? The system itself as a whole scales slowly Our customers behave in unforeseen ways But they never do so at the same time Customers are balanced within the system all the time In the majority of cases we don’t have to spike-scale Our system is not batch-based so latency really matters http://s133.photobucket.com/user/Lurkerlake/media/cow3.png.htmlcow3.png
  • #26 Because scaling is hard and has latency Why make it harder than it has to be? Have you ever implemented a closed loop controller? The system itself as a whole scales slowly Our customers behave in unforeseen ways But they never do so at the same time Customers are balanced within the system all the time In the majority of cases we don’t have to spike-scale Our system is not batch-based so latency really matters http://s133.photobucket.com/user/Lurkerlake/media/cow3.png.htmlcow3.png
  • #27 Because scaling is hard and has latency Why make it harder than it has to be? Have you ever implemented a closed loop controller? The system itself as a whole scales slowly Our customers behave in unforeseen ways But they never do so at the same time Customers are balanced within the system all the time In the majority of cases we don’t have to spike-scale Our system is not batch-based so latency really matters http://s133.photobucket.com/user/Lurkerlake/media/cow3.png.htmlcow3.png
  • #28 Assume for a second that we had to provision for this customer at the peak… Most of the time, there would be too many resources, driving up the cost of providing the service and hence the price And it wouldn’t even be able to absorb the spike
  • #29 Assume for a second that we had to provision for this customer at the peak… Most of the time, there would be too many resources, driving up the cost of providing the service and hence the price And it wouldn’t even be able to absorb the spike
  • #30 Assume for a second that we had to provision for this customer at the peak… Most of the time, there would be too many resources, driving up the cost of providing the service and hence the price And it wouldn’t even be able to absorb the spike
  • #33 First level job builds “system”, deploys to NITE which has the latest cross cut If that crosscut passes, push to STAG, where QA will tear into it Ultimately, push to LONG where the rest of the company gets to see the latest that survives If nothing bad happens in LONG, it goes to PROD, usually once per week
  • #34 *Need to make sure this is list is accurate*
  • #52 *Need to make sure this is list is accurate*
  • #53 EFS – we thought for a long time that it would help to further decouple data from processing but it looks too expensive right now ($0.30/GB/month) Being able to allow customers to manage the encryption keys is a big deal for us We managed to get PCI certified based on what we have built but in an ideal future we would have customers control over the keys There’s points in our product where we would like customers to add functionality. Charting, query operators – looking into Lambda to enable this safely With all the microservices being their own clusters we are wasting resources that we pay for