Scaling on AWS for the First 10 Million Users (ARC206) | AWS re:Invent 2013

ARC206 – Scaling on AWS for the First 10
Million Users
Simon Elisha, Principal Solutions Architect, Amazon Web Services
November 13, 2013

© 2013 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.

• ME: Simon Elisha – Principal Solutions Architect –
Amazon Web Services – @simon_elisha
• YOU: Here to learn more about scaling infrastructure on
AWS
• TODAY: About best practices and things to think about
when building for large scale

Hi, I have NO IDEA what I am doing!!

a lot of things to read

not where we want to start

Auto Scaling is a tool and a
destination. It’s not the single
thing that fixes everything.

Regions
US West (Oregon)

EU (Ireland)
AWS GovCloud (US)

Asia Pacific (Tokyo)

US East (Virginia)
Asia Pacific
(Sydney)
US West (N. California)

South America (Sao Paulo)
Asia Pacific
(Singapore)

Availability Zones
US West (Oregon)

EU (Ireland)
AWS GovCloud (US)

Asia Pacific (Tokyo)

US East (Virginia)
Asia Pacific
(Sydney)
US West (N. California)

South America (Sao Paulo)
Asia Pacific
(Singapore)

Service Reference Model
Deployment & Administration
App Services
Compute

Storage &
Content
Delivery

Database

Networking
AWS Global Infrastructure

AWS
OpsWorks

Amazon
SNS

Amazon
CloudSearch

Amazon
SES

Amazon SWF

Amazon
SQS

Amazon
CloudWatch

Amazon
EMR

Amazon
Route 53

AWS
Direct
Connect

AWS
CloudFormation

AWS
Data
Pipeline
Amazon
ElastiCache
Amazon
RDS

Amazon
DynamoDB
Amazon
RedShift

App Services

Amazon
Elastic
Transcoder

Amazon
VPC

AWS IAM

Deployment & Administration

Compute
Amazon
EC2

AWS Elastic
Beanstalk

Storage &
Content
Delivery

Database

Networking
AWS Global Infrastructure

Amazon S3
Amazon
CloudFront

AWS Storage
Gateway
Amazon
Glacier

So let’s start from day
one, user one ( you )

Day One, User One
• A single EC2 Instance

Amazon
Route 53

User

– With full stack on this host
•
•
•
•

Web app
Database
Management
Etc.

• A single Elastic IP
• Route53 for DNS

Elastic IP

EC2
Instance

“We’re gonna need a bigger box”
•
•
•
•
•
•
•
•

Simplest approach
Can now leverage PIOPs
High I/O instances
High memory instances
High CPU instances
High storage instances
Easy to change instance sizes
Will hit an endpoint eventually

hi1.4xlarge
m2.4xlarge
m1.small

Day One, User One
• We could potentially get
to a few hundred to a few
thousand depending on
application complexity
and traffic
• No failover
• No redundancy
• Too many eggs in one
basket

Amazon
Route 53

User

Elastic IP

EC2
Instance

Day One, User One:
• We could potentially get
to a few hundred to a few
thousand depending on
application complexity
and traffic
• No failover
• No redundancy
• Too many eggs in one
basket

Amazon
Route 53

User

Elastic IP

EC2
Instance

Day Two, User >1
First let’s separate out
our single host into
more than one.
• Web
• Database
– Make use of a database
service?

Amazon
Route 53

User

Elastic IP

Web
Instance

Database
Instance

Database Options
Self-managed

Database Server
on Amazon EC2
Your choice of
database running on
Amazon EC2
Bring Your Own
License (BYOL)

Fully Managed

Amazon
RDS

Amazon
DynamoDB

Amazon
Redshift

Microsoft SQL,
Oracle or MySQL as
a managed service

Managed NoSQL
database service
using SSD storage

Massively parallel,
petabyte-scale, data
warehouse service

Flexible licensing –
BYOL or license
included

Seamless scalability

Fast, powerful and
easy to scale

Zero administration

But how do I choose
what DB technology I
need? SQL? NoSQL?

Blended approach
can reduce technical
debt

Start with SQL
databases where it
makes sense

Why start with SQL?
• Established and well worn technology
• Lots of existing code, communities, books, background,
tools, etc
• You aren’t going to break SQL DBs in your first 10 million
users. But you might break parts of it (hence blended
approach)
• Clear patterns to scalability

If your usage is such that you will be
generating several TB ( >5 ) of data
in the first year OR have an
incredibly data intensive workload
you might need NoSQL

Why else might you need NoSQL?
•
•
•
•
•
•

Super low latency applications
Metadata driven datasets
Highly unrelational data
Need schema-less data constructs*
Massive amounts of data (again, in the TB range)
Rapid ingest of data (thousands of records/sec)

*Need != “its easier to do dev without schemas”

So decide wisely.
Look for the key
points of scale.

User >100
First let’s separate out
our single host into
more than one
• Web
• Database
– Use RDS to make your life
easier

Amazon
Route 53

User

Elastic IP

Web
Instance

RDS DB
Instance

User > 1000
User

Next let’s address our
lack of failover and
redundancy issues
• Elastic Load Balancing
• Another web instance

Amazon
Route 53

Elastic Load
Balancing

Web
Instance

Web
Instance

RDS DB Instance
Active (Multi-AZ)

RDS DB Instance
Standby (Multi-AZ)

Availability Zone

Availability Zone

– In another Availability Zone

• Enable Amazon RDS multi-AZ

Elastic Load Balancing
•

Create highly scalable applications

•

Distribute load across EC2 instances
in multiple Availability Zones

Feature
Available
Health checks

Session stickiness
Secure sockets layer
Monitoring

Elastic Load
Balancer

Details
Load balance across instances in multiple
Availability Zones
Automatically checks health of instances and
takes them in or out of service
Route requests to the same instance
Supports SSL offload from web and application
servers with flexible cipher support
Publishes metrics to CloudWatch

Scaling this
horizontally and
vertically will get us
pretty far
(10s-100s of thousands)

User >10 ks–100 ks
User

Amazon
Route 53

Elastic Load
Balancing

Web
Instance

Web
Instance

Web
Instance

RDS DB Instance RDS DB Instance
Read Replica
Read Replica
Availability Zone

Web
Instance

RDS DB Instance
Active (Multi-AZ)

Web
Instance

Web
Instance

RDS DB Instance
Standby (Multi-AZ)

Web
Instance

RDS DB Instance
Read Replica
Availability Zone

Web
Instance

RDS DB Instance
Read Replica

This will take us pretty far
honestly, but we care about
performance and
efficiency, so let’s clean
this up a bit

Shift Some Load Around
User

Let’s lighten the load on our
web and database instances:
•

•

•

Move static content from the
web Instance to Amazon S3
and CloudFront
Move dynamic content from the
Elastic Load Balancing to
CloudFront
Move session/state and DB
caching to ElastiCache or
Amazon DynamoDB

Amazon
Route 53

Amazon
CloudFront

Elastic Load
Balancer

Amazon S3

Web
Instance
ElastiCache

RDS DB Instance
Active (Multi-AZ)

Amazon
DynamoDB

Availability Zone

Check out Session: ARC309 – Dynamic Content
Acceleration: Lightning Fast Web Apps with
Amazon CloudFront and Amazon Route 53

Working with S3 – Amazon Simple Storage Service
•
•
•

•

Object-based storage for the web
11 9s of durability
Good for things like:
– Static assets ( css, js, images,
videos )
– Backups
– Logs
– Ingest of files for processing
“Infinitely scalable”

•
•
•
•
•
•
•

Supports fine grained permission control
Ties in well with CloudFront
Ties in with Amazon EMR
Acts as a logging endpoint for Amazon
S3, CloudFront, Billing
Supports encryption at transit and at rest
Reduced redundancy 1/3 cheaper
Amazon Glacier for super long term
storage

Amazon CloudFront

CDN for Static

CDN for Static &

Content

No CDN

Dynamic Content

•

80
70
60
50
40
30
20
10
0

8:00
AM

9:00
AM

10:00 11:00 12:00
AM
AM
PM

1:00
PM

2:00
PM

3:00
PM

4:00
PM

5:00
PM

6:00
PM

Server
Load

Response Time

Server
Load

Response Time

Server Load

Cache static content at the edge for faster delivery
Helps lower load on origin infrastructure
Dynamic and static content
Streaming video
Zone apex support
Custom SSL certificates
Low TTLs ( as short as 0 seconds )
Lower costs for origin fetches ( between Amazon
S3/EC2 and CloudFront )
Optimized to work with EC2, Amazon S3, Elastic
Load Balancing, and Route53

Volume of Data
Delivered (Gbps)

•
•
•
•
•
•
•
•

Response Time

Amazon CloudFront is a web service for
scalable content delivery.

7:00
PM

8:00
PM

9:00
PM

User

web and database instances
•

Move static content from the

web instance to Amazon S3
and CloudFront
• Move dynamic content from
the Elastic Load Balancing
to CloudFront
• Move session/state and DB
DynamoDB

Amazon
Route 53

Amazon
CloudFront

Elastic Load
Balancing

Amazon S3

Web
Instance
ElastiCache

RDS DB Instance
Active (Multi-AZ)
Availability Zone

Amazon
DynamoDB

User

web and database
instances
• Move static content from the
web instance to Amazon S3
and CloudFront
• Move dynamic content from
the Elastic Load Balancing to
CloudFront
• Move session/state and DB
Amazon DynamoDB

Amazon
Route 53

Amazon
Cloudfront

Elastic Load
Balancing

Amazon S3

Web
Instance
ElastiCache

RDS DB Instance
Active (Multi-AZ)
Availability Zone

Amazon
DynamoDB

Amazon DynamoDB
• Provisioned throughput NoSQL
database
• Fast, predictable performance
• Fully distributed, fault-tolerant

Feature Details
Provisioned
throughput
Predictable
performance
Strong
consistency
Fault tolerant

architecture
• Considerations for nonuniform
data

Monitoring

Dial up or down provisioned
read/write capacity
Average single-digit millisecond
latencies from SSD-backed
infrastructure
Be sure you are reading the
most up to date values
Data replicated across
Availability Zones
Integrated to CloudWatch

Secure

Integrates with AWS Identity
and Access Management (IAM)

Elastic
MapReduce

Integrates with Amazon Elastic
MapReduce for complex
analytics on large datasets

ElastiCache
•

•
•
•
•
•
•

Hosted Memcached & Redis
– Speaks same API as traditional open source
Memcached and Redis
Scale from one to many nodes
Self-healing ( replaces dead instance )
Very fast ( single digit ms speeds usually )
Local to a single AZ for Memcache, with no persistence or
replication
With Redis can put a replica in a different AZ with
persistence
Use AWS’s Auto Discovery client to simplify clusters
growing and shrinking without affecting your application

Now that our Web tier is
much more lightweight, we
can revisit the beginning of
our talk…

Auto Scaling
Trigger autoscaling policy
Amazon
CloudWatch

Automatic resizing of compute clusters
based on demand
Feature

Details

Control

Define minimum and maximum instance pool
sizes and when scaling and cool down occurs.

Integrated to Amazon
CloudWatch

Use metrics gathered by CloudWatch to drive
scaling.

Instance types

Run Auto Scaling for On-Demand and Spot
Instances. Compatible with VPC.

AWS autoscaling create-autoscaling-group
— Auto Scaling-group-name MyGroup
— Launch-configuration-name MyConfig
— Min size 4
— Max size 200
— Availability Zones us-west-2c

Typical Weekly Traffic to Amazon.com

Sunday

Monday

Tuesday

Wednesday

Thursday

Friday

Saturday

Typical Weekly Traffic to Amazon.com
Provisioned Capacity

Sunday

Monday

Tuesday

Wednesday

Thursday

Friday

Saturday

November Traffic to Amazon.com

November


November

76%


November

24%

Auto Scaling lets you do
this!

User >500k+

Amazon
Route 53

User

Amazon
Cloudfront

Elastic Load
Balancing

Web
Instance

Web
Instance

Web
Instance

Amazon S3

Web
Instance

Web
Instance

Web
Instance

DynamoDB
Active (Multi-AZ)
Read Replica
Availability Zone

ElastiCache

Standby (Multi-AZ) Read Replica
Availability Zone

ElastiCache

“Give me six hours to chop down a tree and I will spend
the first four sharpening the axe.” – Abraham Lincoln

“World of Hurt” If You Are Missing These
•
•
•
•

Metrics & alarming
Automated builds
Automated deployment
Centralized logging
Check out Session: ARC306
– Lumberjacking on AWS:
Cutting Through Logs to Find
What Matters

–Continuous Integration and
Deployment Best Practices on
AWS

Host-level
Metrics

Aggregatelevel
Metrics

Log
Analysis

External Site
Performance

Not having proper monitoring
or metrics is like flying a
plane with an eye mask on in
a thunderstorm.
Oh and your wing is on fire.

AWS Marketplace & Partners Can Help
• Customers can find, research,
buy software
• Simple pricing aligns with EC2
usage model
• Launch in minutes
• Marketplace billing integrated
into your AWS account
• 1,000+ products across 20+
categories
Learn more at: aws.amazon.com/marketplace

Spend Your Time Wisely
Managing your infrastructure will become an
increasingly important part of your time. Use tools to
automate repetitive tasks
• Tools to manage AWS resources
• Tools to manage software on and configuration of
your instances
• Automated data analysis of logs and user actions

AWS Application Management Solutions
Higher level services

Elastic Beanstalk
Convenience

AWS OpsWorks

Do it yourself

AWS CloudFormation

EC2
Control

Host-based Configuration Management
Two big players
– Opscode Chef
– PuppetLabs Puppet

•
•
•
•
•

Both do more or less the same thing
They have similar syntax
Works well with tools from the previous slide
Require some learning time
Can’t scale easily without this kind of capability

From 500K to 1 Million Users
•
•
•
•

Getting serious now
Significant user base
Plenty of attention if things go wrong
Interesting phase for startups with funding
rounds

Time to make some
radical improvements at
the web & app layers

SOA
=
Service-oriented Architecture

SOAing
Move services into their own tiers
or modules. Treat each of these
as 100% separate pieces of your
infrastructure and scale them
independently.
Amazon.com and AWS do this
extensively! It offers flexibility and
greater understanding of each
component.

Loose Coupling Sets You Free!
• The looser they're coupled, the bigger they scale
–
–
–
–

Use independent components
Design everything as a black box
Decouple interactions
Favor services with built in redundancy and scalability than
building your own

Use Amazon SQS as Buffers
Tight Coupling
Loose Coupling

Controller A

Q

Controller B

Q
Controller A

Controller B

– Controlling the Flood:
Massive Message Processing
with Amazon SQS & Amazon
DynamoDB

Loose Coupling + SOA = Winning
In the early days, if someone has a service for it already,
use that instead of building it yourself
Don’t reinvent the wheel
Examples:
• Email
• Queuing
• Transcoding
• Search

•
•
•
•

Databases
Monitoring
Metrics
Logging

Amazon SNS

Amazon SES

Amazon
CloudSearch

Amazon SQS

Amazon SWF

Amazon Elastic
Transcoder

On reinventing the wheel: If
you find yourself writing
your own queue, DNS
server, database, storage
system, monitoring tool …

Take a deep breath
and stop it. Now.

Imagine we let our users
upload photos

CloudFront
Download
Distribution

RRS
Amazon S3
Bucket to
Serve
Content to
CloudFront

Amazon S3
Bucket for
Ingest
Instances
User

SQS Queue
Size for Thumbnail

Autoscaling
Group

Instances
Amazon SNS Topic

Autoscaling
Group

SQS Queue
Size Image for
Mobile
Instances
SQS Queue
Size Image for Web

Autoscaling
Group

Amazon S3
Bucket for
Originals

Amazon Simple Workflow Service (SWF)
•
•
•
•
•
•
•

Provides an orchestration tool across your infrastructure
Can act as a middle layer to pass messages and setup tasks
Lets you break down individual tasks into different workers
Lets you define logic between workers
Lets you make a worker task from anything that can be scripted
Includes built-in retries, timeouts, logging
Features built-in reliability, scalability, and low cost

Your code =

&
Deciders

Workers

CloudFront
Download
Distribution

RRS
Amazon S3
Bucket to
Serve
Content to
CloudFront

Amazon S3
Bucket for
Ingest
Instances
User

Autoscaling
Group

Instances
Autoscaling
Group

SWF

Instances
Instance Running
Decider

Autoscaling
Group

Amazon S3
Bucket for
Originals

Users > 1 Million
Reaching a million and above is going to require some of
all the previous things:
• Multi-AZ
• Elastic Load Balancing between tiers
• Auto Scaling
• Service-oriented architecture
• Serving content smartly (S3/CloudFront)
• Caching off DB
• Moving state off tiers that autoscale

Users > 1 Million
User

Amazon
Route 53

Amazon
Cloudfront

Elastic Load
Balancer
Amazon SQS
Web
Instance

Web
Instance

Web
Instance

Web
Instance

Worker
Instance

Worker
Instance
Amazon
DynamoDB

ElastiCache

Read Replica
Read Replica
Availability Zone

RDS DB Instance
Active (Multi-AZ)

Amazon S3

Internal App
Instance

Internal App
Instance

Amazon
CloudWatch

Amazon SES

From 5 to 10 Million Users
You may start to run into issues with your database around
contention on the write master.
How can you solve it?
• Federation (splitting into multiple DBs based on function)
• Sharding (splitting one data set up across multiple hosts)
• Moving some functionality to other types of DBs (NoSQL)

Database Federation
• Split up databases by function
or purpose
• Harder to do cross-function
queries
• Essentially delays the need for
something like sharding or
NoSQL until much further down
the line
• Won’t help with single huge
functions or tables

ForumsDB

UsersDB

ProductsDB

Sharded Horizontal Scaling
• More complex at the
application layer
• ORM support can help
• No practical limit on
scalability
• Operational complexity
and sophistication
• Shard by function or key
space
• RDBMS or NoSQL

User

ShardID

002345

A

002346

B

002347

C

002348

B

002349

A

A

C

B

Shifting Functionality to NoSQL
• Similar in a sense to federation
• Again, think about the earlier points for when you
need NoSQL vs SQL
• Leverage hosted services like Amazon DynamoDB
• Consider these use cases:
–
–
–
–
–

Leaderboards and scoring
Rapid ingest of clickstream or log data
Temporary data needs (cart data)
“Hot” tables
Metadata or lookup tables

Amazon
DynamoDB

From 5 to 10 Million Users
You may start to run into issues with speed and performance
of your applications
• Make sure you have monitoring, metrics, & logging in place
– If you can’t build it internally, outsource it! (third-party SaaS)

• Pay attention to what customers are saying works well vs.
what doesn’t, and use this as direction
• Try to work on squeezing as much performance out of each
service or component

• Use Multi-AZ for your infrastructure
• Make use of self-scaling services (Elastic Load Balancing,
Amazon S3, Amazon SNS, SQS, Amazon SES, etc)
Build in redundancy at every level
• Blend SQL & NoSQL wisely
• Cache data both inside and outside your infrastructure
• Split tiers into individual services (SOA)
• Use autoscaling once you’re ready for it
• Use automation tools in your infrastructure
• Make sure you have good metrics, monitoring, and logging
tools in place
• Don’t reinvent the wheel

Putting all this together
means we should now
easily be able to handle
10+ million users!

Users > 10 Million

Iterating on top of the
patterns seen here will get
you up and over 100
million users.

Users > 10 Million
•
•
•
•
•

More fine tuning of your application
More SOA of features and functionality
Going from Multi-AZ to multi-region
Needing to start building custom solutions
Deep analysis of your whole stack
– How Netflix Leverages
Multiple Regions to Increase
Availability

One More Thing
• A fantastic amount of FINANCIAL ENGINEERING
to do as well
• Reserved Instances
• Spot Instances
• Correct use of storage
• Scaling driven by queues
• Correct instance sizes
• Etc…
– Running Lean and Mean:
Designing Cost-Efficient
Architectures on AWS

Next steps?
Read!
• aws.amazon.com/documentation
• aws.amazon.com/architecture
• aws.amazon.com/start-ups
Listen!
• aws.amazon.com/podcast

Next steps?
Ask for help!
• forums.aws.amazon.com
• aws.amazon.com/support
• Your local account manager & solution architect

AWS re:Invent Pub Crawl
Join the AWS Startup Team this evening at the AWS Pub Crawl
When: Wednesday November 13, 5:30pm - 7:30pm
Where: Canaletto at The Venetian, 2nd Floor
Who Will Be There: Startups, The AWS Startup Team,
Startup Launch Companies and
AWS re:Invent Hackathon winners

Startup Spotlight Sessions with Dr. Werner Vogels
Thurs. Nov 14, Marcello Room 4406

SPOT 203 - Fireside Chats – Startup Founders, 1:30-2:30pm
– Eliot Horowitz, CTO of MongoDB
– Jeff Lawson, CEO of Twilio
– Valentino Volonghi, Chief Architect of AdRoll

SPOT 204 - Fireside Chats – Startup Influencers, 3:00-4:00pm
– Albert Wegner, Managing Partner at Union Square Ventures
– David Cohen, Founder and CEO of TechStars

SPOT 101 - Startup Launches, 4:15-5:15pm
– 5 companies powered by AWS launching at AWS re:Invent 2013

Please give us your feedback on this
presentation

ARC206
As a thank you, we will select prize
winners daily for completed surveys!

Scaling on AWS for the First 10 Million Users (ARC206) | AWS re:Invent 2013

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Scaling on AWS for the First 10 Million Users (ARC206) | AWS re:Invent 2013

Similar to Scaling on AWS for the First 10 Million Users (ARC206) | AWS re:Invent 2013 (20)

More from Amazon Web Services

More from Amazon Web Services (20)

Recently uploaded

Recently uploaded (20)

Scaling on AWS for the First 10 Million Users (ARC206) | AWS re:Invent 2013