Scaling on AWS for the First 10 Million Users (ARC206) | AWS re:Invent 2013
Upcoming SlideShare
Loading in...5
×
 

Scaling on AWS for the First 10 Million Users (ARC206) | AWS re:Invent 2013

on

  • 2,939 views

Cloud computing gives you a number of advantages in being able to scale on demand, easily replace whole parts of your infrastructure, and much more. As a new business looking to use the cloud, you ...

Cloud computing gives you a number of advantages in being able to scale on demand, easily replace whole parts of your infrastructure, and much more. As a new business looking to use the cloud, you inevitably ask yourself, Where do I start? Join us at this session to understand some of the common patterns and recommended areas of focus you can expect to work through while scaling an infrastructure to handle going from zero to millions of users. From leveraging highly scalable AWS services to making smart decisions on building out your application, you'll learn a number of best practices for scaling your infrastructure in the cloud. The patterns and practices reviewed in this session will get you there.

Statistics

Views

Total Views
2,939
Views on SlideShare
2,911
Embed Views
28

Actions

Likes
14
Downloads
139
Comments
0

1 Embed 28

https://twitter.com 28

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Scaling on AWS for the First 10 Million Users (ARC206) | AWS re:Invent 2013 Scaling on AWS for the First 10 Million Users (ARC206) | AWS re:Invent 2013 Presentation Transcript

    • ARC206 – Scaling on AWS for the First 10 Million Users Simon Elisha, Principal Solutions Architect, Amazon Web Services November 13, 2013 © 2013 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.
    • • ME: Simon Elisha – Principal Solutions Architect – Amazon Web Services – @simon_elisha • YOU: Here to learn more about scaling infrastructure on AWS • TODAY: About best practices and things to think about when building for large scale
    • So how do we scale?
    • Hi, I have NO IDEA what I am doing!!
    • a lot of things to read
    • a lot of things to read not where we want to start
    • Auto Scaling is a tool and a destination. It’s not the single thing that fixes everything.
    • What do we need first?
    • Some basics
    • Regions US West (Oregon) EU (Ireland) AWS GovCloud (US) Asia Pacific (Tokyo) US East (Virginia) Asia Pacific (Sydney) US West (N. California) South America (Sao Paulo) Asia Pacific (Singapore)
    • Availability Zones US West (Oregon) EU (Ireland) AWS GovCloud (US) Asia Pacific (Tokyo) US East (Virginia) Asia Pacific (Sydney) US West (N. California) South America (Sao Paulo) Asia Pacific (Singapore)
    • Edge Locations
    • Service Reference Model Deployment & Administration App Services Compute Storage & Content Delivery Database Networking AWS Global Infrastructure
    • AWS OpsWorks Amazon SNS Amazon CloudSearch Amazon SES Amazon SWF Amazon SQS Amazon CloudWatch Amazon EMR Amazon Route 53 AWS Direct Connect AWS CloudFormation AWS Data Pipeline Amazon ElastiCache Amazon RDS Amazon DynamoDB Amazon RedShift App Services Amazon Elastic Transcoder Amazon VPC AWS IAM Deployment & Administration Compute Amazon EC2 AWS Elastic Beanstalk Storage & Content Delivery Database Networking AWS Global Infrastructure Amazon S3 Amazon CloudFront AWS Storage Gateway Amazon Glacier
    • So let’s start from day one, user one ( you )
    • Day One, User One • A single EC2 Instance Amazon Route 53 User – With full stack on this host • • • • Web app Database Management Etc. • A single Elastic IP • Route53 for DNS Elastic IP EC2 Instance
    • “We’re gonna need a bigger box” • • • • • • • • Simplest approach Can now leverage PIOPs High I/O instances High memory instances High CPU instances High storage instances Easy to change instance sizes Will hit an endpoint eventually hi1.4xlarge m2.4xlarge m1.small
    • “We’re gonna need a bigger box” • • • • • • • • Simplest approach Can now leverage PIOPs High I/O instances High memory instances High CPU instances High storage instances Easy to change instance sizes Will hit an endpoint eventually hi1.4xlarge m2.4xlarge m1.small
    • Day One, User One • We could potentially get to a few hundred to a few thousand depending on application complexity and traffic • No failover • No redundancy • Too many eggs in one basket Amazon Route 53 User Elastic IP EC2 Instance
    • Day One, User One: • We could potentially get to a few hundred to a few thousand depending on application complexity and traffic • No failover • No redundancy • Too many eggs in one basket Amazon Route 53 User Elastic IP EC2 Instance
    • Day Two, User >1 First let’s separate out our single host into more than one. • Web • Database – Make use of a database service? Amazon Route 53 User Elastic IP Web Instance Database Instance
    • Database Options Self-managed Database Server on Amazon EC2 Your choice of database running on Amazon EC2 Bring Your Own License (BYOL) Fully Managed Amazon RDS Amazon DynamoDB Amazon Redshift Microsoft SQL, Oracle or MySQL as a managed service Managed NoSQL database service using SSD storage Massively parallel, petabyte-scale, data warehouse service Flexible licensing – BYOL or license included Seamless scalability Fast, powerful and easy to scale Zero administration
    • But how do I choose what DB technology I need? SQL? NoSQL?
    • Not a binary decision!
    • Blended approach can reduce technical debt
    • Start with SQL databases where it makes sense
    • Why start with SQL? • Established and well worn technology • Lots of existing code, communities, books, background, tools, etc • You aren’t going to break SQL DBs in your first 10 million users. But you might break parts of it (hence blended approach) • Clear patterns to scalability
    • If your usage is such that you will be generating several TB ( >5 ) of data in the first year OR have an incredibly data intensive workload you might need NoSQL
    • Why else might you need NoSQL? • • • • • • Super low latency applications Metadata driven datasets Highly unrelational data Need schema-less data constructs* Massive amounts of data (again, in the TB range) Rapid ingest of data (thousands of records/sec) *Need != “its easier to do dev without schemas”
    • So decide wisely. Look for the key points of scale.
    • User >100 First let’s separate out our single host into more than one • Web • Database – Use RDS to make your life easier Amazon Route 53 User Elastic IP Web Instance RDS DB Instance
    • User > 1000 User Next let’s address our lack of failover and redundancy issues • Elastic Load Balancing • Another web instance Amazon Route 53 Elastic Load Balancing Web Instance Web Instance RDS DB Instance Active (Multi-AZ) RDS DB Instance Standby (Multi-AZ) Availability Zone Availability Zone – In another Availability Zone • Enable Amazon RDS multi-AZ
    • Elastic Load Balancing • Create highly scalable applications • Distribute load across EC2 instances in multiple Availability Zones Feature Available Health checks Session stickiness Secure sockets layer Monitoring Elastic Load Balancer Details Load balance across instances in multiple Availability Zones Automatically checks health of instances and takes them in or out of service Route requests to the same instance Supports SSL offload from web and application servers with flexible cipher support Publishes metrics to CloudWatch
    • Scaling this horizontally and vertically will get us pretty far (10s-100s of thousands)
    • User >10 ks–100 ks User Amazon Route 53 Elastic Load Balancing Web Instance Web Instance Web Instance RDS DB Instance RDS DB Instance Read Replica Read Replica Availability Zone Web Instance RDS DB Instance Active (Multi-AZ) Web Instance Web Instance RDS DB Instance Standby (Multi-AZ) Web Instance RDS DB Instance Read Replica Availability Zone Web Instance RDS DB Instance Read Replica
    • This will take us pretty far honestly, but we care about performance and efficiency, so let’s clean this up a bit
    • Shift Some Load Around User Let’s lighten the load on our web and database instances: • • • Move static content from the web Instance to Amazon S3 and CloudFront Move dynamic content from the Elastic Load Balancing to CloudFront Move session/state and DB caching to ElastiCache or Amazon DynamoDB Amazon Route 53 Amazon CloudFront Elastic Load Balancer Amazon S3 Web Instance ElastiCache RDS DB Instance Active (Multi-AZ) Amazon DynamoDB Availability Zone Check out Session: ARC309 – Dynamic Content Acceleration: Lightning Fast Web Apps with Amazon CloudFront and Amazon Route 53
    • Working with S3 – Amazon Simple Storage Service • • • • Object-based storage for the web 11 9s of durability Good for things like: – Static assets ( css, js, images, videos ) – Backups – Logs – Ingest of files for processing “Infinitely scalable” • • • • • • • Supports fine grained permission control Ties in well with CloudFront Ties in with Amazon EMR Acts as a logging endpoint for Amazon S3, CloudFront, Billing Supports encryption at transit and at rest Reduced redundancy 1/3 cheaper Amazon Glacier for super long term storage
    • Amazon CloudFront CDN for Static CDN for Static & Content No CDN Dynamic Content • 80 70 60 50 40 30 20 10 0 8:00 AM 9:00 AM 10:00 11:00 12:00 AM AM PM 1:00 PM 2:00 PM 3:00 PM 4:00 PM 5:00 PM 6:00 PM Server Load Response Time Server Load Response Time Server Load Cache static content at the edge for faster delivery Helps lower load on origin infrastructure Dynamic and static content Streaming video Zone apex support Custom SSL certificates Low TTLs ( as short as 0 seconds ) Lower costs for origin fetches ( between Amazon S3/EC2 and CloudFront ) Optimized to work with EC2, Amazon S3, Elastic Load Balancing, and Route53 Volume of Data Delivered (Gbps) • • • • • • • • Response Time Amazon CloudFront is a web service for scalable content delivery. 7:00 PM 8:00 PM 9:00 PM
    • Shift Some Load Around User Let’s lighten the load on our web and database instances • Move static content from the web instance to Amazon S3 and CloudFront • Move dynamic content from the Elastic Load Balancing to CloudFront • Move session/state and DB caching to ElastiCache or DynamoDB Amazon Route 53 Amazon CloudFront Elastic Load Balancing Amazon S3 Web Instance ElastiCache RDS DB Instance Active (Multi-AZ) Availability Zone Amazon DynamoDB
    • Shift Some Load Around User Let’s lighten the load on our web and database instances • Move static content from the web instance to Amazon S3 and CloudFront • Move dynamic content from the Elastic Load Balancing to CloudFront • Move session/state and DB caching to ElastiCache or Amazon DynamoDB Amazon Route 53 Amazon Cloudfront Elastic Load Balancing Amazon S3 Web Instance ElastiCache RDS DB Instance Active (Multi-AZ) Availability Zone Amazon DynamoDB
    • Amazon DynamoDB • Provisioned throughput NoSQL database • Fast, predictable performance • Fully distributed, fault-tolerant Feature Details Provisioned throughput Predictable performance Strong consistency Fault tolerant architecture • Considerations for nonuniform data Monitoring Dial up or down provisioned read/write capacity Average single-digit millisecond latencies from SSD-backed infrastructure Be sure you are reading the most up to date values Data replicated across Availability Zones Integrated to CloudWatch Secure Integrates with AWS Identity and Access Management (IAM) Elastic MapReduce Integrates with Amazon Elastic MapReduce for complex analytics on large datasets
    • ElastiCache • • • • • • • Hosted Memcached & Redis – Speaks same API as traditional open source Memcached and Redis Scale from one to many nodes Self-healing ( replaces dead instance ) Very fast ( single digit ms speeds usually ) Local to a single AZ for Memcache, with no persistence or replication With Redis can put a replica in a different AZ with persistence Use AWS’s Auto Discovery client to simplify clusters growing and shrinking without affecting your application
    • Now that our Web tier is much more lightweight, we can revisit the beginning of our talk…
    • Auto Scaling!
    • Auto Scaling Trigger autoscaling policy Amazon CloudWatch Automatic resizing of compute clusters based on demand Feature Details Control Define minimum and maximum instance pool sizes and when scaling and cool down occurs. Integrated to Amazon CloudWatch Use metrics gathered by CloudWatch to drive scaling. Instance types Run Auto Scaling for On-Demand and Spot Instances. Compatible with VPC. AWS autoscaling create-autoscaling-group — Auto Scaling-group-name MyGroup — Launch-configuration-name MyConfig — Min size 4 — Max size 200 — Availability Zones us-west-2c
    • Typical Weekly Traffic to Amazon.com Sunday Monday Tuesday Wednesday Thursday Friday Saturday
    • Typical Weekly Traffic to Amazon.com Provisioned Capacity Sunday Monday Tuesday Wednesday Thursday Friday Saturday
    • November Traffic to Amazon.com November
    • November Traffic to Amazon.com Provisioned Capacity November
    • November Traffic to Amazon.com 76% Provisioned Capacity November 24%
    • November Traffic to Amazon.com November
    • Auto Scaling lets you do this!
    • User >500k+ Amazon Route 53 User Amazon Cloudfront Elastic Load Balancing Web Instance Web Instance Web Instance Amazon S3 Web Instance Web Instance Web Instance DynamoDB RDS DB Instance RDS DB Instance Active (Multi-AZ) Read Replica Availability Zone ElastiCache RDS DB Instance RDS DB Instance Standby (Multi-AZ) Read Replica Availability Zone ElastiCache
    • A Pause to Think
    • “Give me six hours to chop down a tree and I will spend the first four sharpening the axe.” – Abraham Lincoln
    • “World of Hurt” If You Are Missing These • • • • Metrics & alarming Automated builds Automated deployment Centralized logging Check out Session: ARC306 – Lumberjacking on AWS: Cutting Through Logs to Find What Matters Check out Session: ARC307 –Continuous Integration and Deployment Best Practices on AWS
    • Host-level Metrics Aggregatelevel Metrics Log Analysis External Site Performance
    • Not having proper monitoring or metrics is like flying a plane with an eye mask on in a thunderstorm. Oh and your wing is on fire.
    • AWS Marketplace & Partners Can Help • Customers can find, research, buy software • Simple pricing aligns with EC2 usage model • Launch in minutes • Marketplace billing integrated into your AWS account • 1,000+ products across 20+ categories Learn more at: aws.amazon.com/marketplace
    • Spend Your Time Wisely Managing your infrastructure will become an increasingly important part of your time. Use tools to automate repetitive tasks • Tools to manage AWS resources • Tools to manage software on and configuration of your instances • Automated data analysis of logs and user actions
    • AWS Application Management Solutions Higher level services Elastic Beanstalk Convenience AWS OpsWorks Do it yourself AWS CloudFormation EC2 Control
    • Host-based Configuration Management Two big players – Opscode Chef – PuppetLabs Puppet • • • • • Both do more or less the same thing They have similar syntax Works well with tools from the previous slide Require some learning time Can’t scale easily without this kind of capability
    • From 500K to 1 Million Users • • • • Getting serious now Significant user base Plenty of attention if things go wrong Interesting phase for startups with funding rounds
    • Time to make some radical improvements at the web & app layers
    • SOA = Service-oriented Architecture
    • SOAing Move services into their own tiers or modules. Treat each of these as 100% separate pieces of your infrastructure and scale them independently. Amazon.com and AWS do this extensively! It offers flexibility and greater understanding of each component.
    • Loose Coupling Sets You Free! • The looser they're coupled, the bigger they scale – – – – Use independent components Design everything as a black box Decouple interactions Favor services with built in redundancy and scalability than building your own Use Amazon SQS as Buffers Tight Coupling Loose Coupling Controller A Q Controller B Q Controller A Controller B Check out Session: ARC301 – Controlling the Flood: Massive Message Processing with Amazon SQS & Amazon DynamoDB
    • Loose Coupling + SOA = Winning In the early days, if someone has a service for it already, use that instead of building it yourself Don’t reinvent the wheel Examples: • Email • Queuing • Transcoding • Search • • • • Databases Monitoring Metrics Logging Amazon SNS Amazon SES Amazon CloudSearch Amazon SQS Amazon SWF Amazon Elastic Transcoder
    • On reinventing the wheel: If you find yourself writing your own queue, DNS server, database, storage system, monitoring tool …
    • Take a deep breath and stop it. Now.
    • Back to SOA
    • Imagine we let our users upload photos
    • CloudFront Download Distribution RRS Amazon S3 Bucket to Serve Content to CloudFront Amazon S3 Bucket for Ingest Instances User SQS Queue Size for Thumbnail Autoscaling Group Instances Amazon SNS Topic Autoscaling Group SQS Queue Size Image for Mobile Instances SQS Queue Size Image for Web Autoscaling Group Amazon S3 Bucket for Originals
    • Amazon Simple Workflow Service (SWF) • • • • • • • Provides an orchestration tool across your infrastructure Can act as a middle layer to pass messages and setup tasks Lets you break down individual tasks into different workers Lets you define logic between workers Lets you make a worker task from anything that can be scripted Includes built-in retries, timeouts, logging Features built-in reliability, scalability, and low cost Your code = & Deciders Workers
    • CloudFront Download Distribution RRS Amazon S3 Bucket to Serve Content to CloudFront Amazon S3 Bucket for Ingest Instances User SQS Queue Size for Thumbnail Autoscaling Group Instances Amazon SNS Topic Autoscaling Group SQS Queue Size Image for Mobile Instances SQS Queue Size Image for Web Autoscaling Group Amazon S3 Bucket for Originals
    • CloudFront Download Distribution RRS Amazon S3 Bucket to Serve Content to CloudFront Amazon S3 Bucket for Ingest Instances User Autoscaling Group Instances Autoscaling Group SWF Instances Instance Running Decider Autoscaling Group Amazon S3 Bucket for Originals
    • Users > 1 Million Reaching a million and above is going to require some of all the previous things: • Multi-AZ • Elastic Load Balancing between tiers • Auto Scaling • Service-oriented architecture • Serving content smartly (S3/CloudFront) • Caching off DB • Moving state off tiers that autoscale
    • Users > 1 Million User Amazon Route 53 Amazon Cloudfront Elastic Load Balancer Amazon SQS Web Instance Web Instance Web Instance Web Instance Worker Instance Worker Instance Amazon DynamoDB ElastiCache RDS DB Instance RDS DB Instance Read Replica Read Replica Availability Zone RDS DB Instance Active (Multi-AZ) Amazon S3 Internal App Instance Internal App Instance Amazon CloudWatch Amazon SES
    • The next big steps
    • From 5 to 10 Million Users You may start to run into issues with your database around contention on the write master. How can you solve it? • Federation (splitting into multiple DBs based on function) • Sharding (splitting one data set up across multiple hosts) • Moving some functionality to other types of DBs (NoSQL)
    • Database Federation • Split up databases by function or purpose • Harder to do cross-function queries • Essentially delays the need for something like sharding or NoSQL until much further down the line • Won’t help with single huge functions or tables ForumsDB UsersDB ProductsDB
    • Sharded Horizontal Scaling • More complex at the application layer • ORM support can help • No practical limit on scalability • Operational complexity and sophistication • Shard by function or key space • RDBMS or NoSQL User ShardID 002345 A 002346 B 002347 C 002348 B 002349 A A C B
    • Shifting Functionality to NoSQL • Similar in a sense to federation • Again, think about the earlier points for when you need NoSQL vs SQL • Leverage hosted services like Amazon DynamoDB • Consider these use cases: – – – – – Leaderboards and scoring Rapid ingest of clickstream or log data Temporary data needs (cart data) “Hot” tables Metadata or lookup tables Amazon DynamoDB
    • From 5 to 10 Million Users You may start to run into issues with speed and performance of your applications • Make sure you have monitoring, metrics, & logging in place – If you can’t build it internally, outsource it! (third-party SaaS) • Pay attention to what customers are saying works well vs. what doesn’t, and use this as direction • Try to work on squeezing as much performance out of each service or component
    • A quick review
    • • Use Multi-AZ for your infrastructure • Make use of self-scaling services (Elastic Load Balancing, Amazon S3, Amazon SNS, SQS, Amazon SES, etc) Build in redundancy at every level • Blend SQL & NoSQL wisely • Cache data both inside and outside your infrastructure • Split tiers into individual services (SOA) • Use autoscaling once you’re ready for it • Use automation tools in your infrastructure • Make sure you have good metrics, monitoring, and logging tools in place • Don’t reinvent the wheel
    • Putting all this together means we should now easily be able to handle 10+ million users!
    • Users > 10 Million Iterating on top of the patterns seen here will get you up and over 100 million users.
    • Users > 10 Million • • • • • More fine tuning of your application More SOA of features and functionality Going from Multi-AZ to multi-region Needing to start building custom solutions Deep analysis of your whole stack Check out Session: ARC305 – How Netflix Leverages Multiple Regions to Increase Availability
    • One More Thing • A fantastic amount of FINANCIAL ENGINEERING to do as well • Reserved Instances • Spot Instances • Correct use of storage • Scaling driven by queues • Correct instance sizes • Etc… Check out Session: ARC313 – Running Lean and Mean: Designing Cost-Efficient Architectures on AWS
    • Next steps? Read! • aws.amazon.com/documentation • aws.amazon.com/architecture • aws.amazon.com/start-ups Listen! • aws.amazon.com/podcast
    • Next steps? Ask for help! • forums.aws.amazon.com • aws.amazon.com/support • Your local account manager & solution architect
    • AWS re:Invent Pub Crawl Join the AWS Startup Team this evening at the AWS Pub Crawl When: Wednesday November 13, 5:30pm - 7:30pm Where: Canaletto at The Venetian, 2nd Floor Who Will Be There: Startups, The AWS Startup Team, Startup Launch Companies and AWS re:Invent Hackathon winners
    • Startup Spotlight Sessions with Dr. Werner Vogels Thurs. Nov 14, Marcello Room 4406 SPOT 203 - Fireside Chats – Startup Founders, 1:30-2:30pm – Eliot Horowitz, CTO of MongoDB – Jeff Lawson, CEO of Twilio – Valentino Volonghi, Chief Architect of AdRoll SPOT 204 - Fireside Chats – Startup Influencers, 3:00-4:00pm – Albert Wegner, Managing Partner at Union Square Ventures – David Cohen, Founder and CEO of TechStars SPOT 101 - Startup Launches, 4:15-5:15pm – 5 companies powered by AWS launching at AWS re:Invent 2013
    • Please give us your feedback on this presentation ARC206 As a thank you, we will select prize winners daily for completed surveys!