• Save
2011 AWS Tour Australia, Closing Keynote: How Amazon.com migrated to AWS, by Jon Jenkins
Upcoming SlideShare
Loading in...5
×
 

2011 AWS Tour Australia, Closing Keynote: How Amazon.com migrated to AWS, by Jon Jenkins

on

  • 3,375 views

 

Statistics

Views

Total Views
3,375
Views on SlideShare
3,058
Embed Views
317

Actions

Likes
8
Downloads
0
Comments
0

3 Embeds 317

http://lanyrd.com 306
http://paper.li 9
http://a0.twimg.com 2

Accessibility

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • Hi my name is Jon Jenkins. I’ve been at Amazon for nearly 8 years. During almost all of that time I’ve worked for Amazon’s retail business. Basically when I say “retail business” it means everything that’s not AWS.
  • There’s a common misconception that the AWS services were created specifically for Amazon retail. However, nothing could be further from the truth. Amazon retail and AWS are really completely different businesses. We report into separate SVPs, sit in different buildings and operate independently. Like you, the Amazon retail business is just another customer of AWS.Today I’ve been invited here to provide you with a customer’s perspective on how Amazon’s retail web sites are using AWS to power our retail business.
  • The story I am going to tell you today begins in 1995 and ends in 2011. Over that period of time Amazon’s retail web sites have gone through dramatic changes in terms of technology and the architectures we use to build applications. At the bottom of every slide in this presentation you’ll see a timeline. As the story progresses you can follow along with the timeline to visualize what era of Amazon’s history I’ll be talking about.To understand our current approach toward our migration to the cloud it’s helpful to have a little historical context about Amazon retail’s technical history. So here we go with a whirlwind tour through Amazon’s early years.
  • Here we are in 1995 and this is the original amazon.com home page shortly after launch. Jeff B. founded Amazon in 1994 and the site launched to the public in 1995.Jeff’s basic concept was pretty simple. An internet bookstore could offer a much broader selection at a much lower price than any bricks and mortal bookseller.
  • Lets move forward to 1996. This is one of the earliest architectural diagrams of the Amazon retail business. Just a note, that box that says www.amazon.com is a single web server – it doesn’t represent a fleet or group of servers.Note that this is a logical diagram. Everything pictured here ran on a single DEC Alpha box that served the web site. For example, the Amazon catalog and search indexes were built into Berkley DBs that were pushed directly onto the web servers. The same host ran the ordering software and fulfillment systems.We had written our own customer written web application server called Obidos -- named after a town on the Amazon River. Humorously, the town of Obidos is located at the narrowest part of the Amazon and engineer’s liked to joke that just as Obidos is the bottleneck of the river it was also the chokepoint for software at on the retail web site.
  • In 1997 we added two more web servers to the fleet. We now had three Digital 4100 Unix servers. Screaming boxes for their day, they ran at 600 mhz and could hold up to four processors each.
  • In 1998 after the server room in our main office experienced a, how do you say, “water event”, and the floor partially collapsed we made a decision to move the web servers to a real data center in downtown Seattle.
  • By 1999 the original architecture was starting to show it’s cracks. This is an architectural diagram of a different sort drawn by one of our developers in the late 1990s. Obidos is represented by the South Park character Cartman. Like Cartman, Obidos had become bloated, ornery and difficult to deal with. More and more functionality had been piled into this core part of the platform and we were having trouble maintaining our pace of innovation.
  • By 2000 we had two distribution centers – one on the east coast and one on the west coast. It had become painfully obvious that it was a bad idea to have tight coupling between our distribution centers and our servers back in Seattle running the web site. Consequently we pushed through a project to decouple the systems powering the web site from fulfillment operations.
  • During 2001 we also migrated off the high-end 64-bit UNIX servers to more cost-conscious 32-bit x86-based Linux hosts. This marked the start of our move toward commodity servers and horizontal scalability.In 2001 we also took our first, fledgling steps toward a service oriented architecture. The first “service” at Amazon to be broken out from the main web server was our Customer Master Service that kept track of customer information. The service architecture was based on Tuxedo.
  • By 2005 Amazon had learned a lot about what a scalable web architecture should look like. This slide lists some of our takeaways were at that point.Because many of the engineers that are building the AWS utility services have spent time in the retail side of the business you will notice lots of these philosophies embodied in the various AWS services.
  • From here on I’ll slow down a little bit because we’ve reached the meat of what I’m going to talk about today.In March 2006, Simple Storage Service, the first AWS utility computing service launched in production. I assume most of you are familiar with S3 at this point so I won’t go into detail about what it is or how it works. However, it is worth mentioning that, contrary to popular myth, S3 was not built to satisfy Amazon retail’s internal use cases. It was designed to be a general purpose file store for the internet.Also later in the year in 2006 Elastic Compute Cloud launched into private beta.
  • Amazon has a strong culture of eating our own dog food so we wanted to figure out some way to start using S3 in a meaningful way as part of our retail business. We had lots of network attached storage devices and hundreds of NFS servers and since S3 is basically a file store these seemed like decent candidates to replace with S3.However, given that the amazon.com web site is the flagship of our retail business we really wanted to figure out a way to use S3 on the retail web site. But how? I mean what could we really do with just S3?
  • The answer is the widget pictured on this screen shot from the amazon.com web site.This is the IMDB Theatrical Release Information widget. In 2006 it appeared on almost every DVD and video detail page. The feature presents detailed information about the particular release of the movie that the user is purchasing.As many of you may be aware, IMDB is a wholly-owned subsidiary of Amazon. However, the businesses are run completely independently. We have different reporting structures, different technology platforms and sit in different buildings. Jeff B’s goal is to keep the businesses as independent as possible. He wants both IMDB and Amazon to innovate and operate without constraints imposed by the other. This structure posed some unique challenges that are specific to this particular widget.To better understand what I’m talking about lets look at the architecture of how this widget is rendered on the amazon.com web site.
  • This is a fairly common model of a service oriented architecture. You’ll see similar diagrams throughout the rest of this presentation. The basic goal of SOA is to provide reusable, scalable components as services that can be accessed by multiple consumers.So the way this feature worked is that the customer comes in from the left and hits the amazon.com web server residing in the Amazon retail data center. That web server issues a service call to the IMDB service to retrieve the theatrical release information. The IMDB service is really just a thin veneer over the IMDB database that stores this content. The service returns raw data to the amazon.com web server and then the web server transforms that data into HTML, and inserts it into the page that is returned to the customer.In general this is a pretty decent pattern for building web site features at Amazon. However, in this case it was problematic.
  • First, this architecture resulted in coupling between the Amazon and IMDB businesses. You see that actual code that transforms the raw service data into HTML lives on the amazon.com web servers. That means that if the IMDB team wants to change the look and feel of the widget or the data presented in the widget they have to adhere to the Amazon release schedule.Second, there are stringent runtime latency requirements for any content appearing on the Amazon web site. In this case, the IMDB team wasn’t able to consistently meet those latency requirements for this feature. Additionally, there has to be coordination when it comes to scaling too. As the Amazon retail business grows we would need to keep IMDB in the loop so they could scale their service appropriately. Even worse, let’s say Amazon is planning to have a sale on DVDs that will cause a big spike in traffic to these types of pages. We would have to make sure that IMDB was informed in advance so their service wouldn’t collapse under the load. Third, in 2006 the IMBD and Amazon teams used different service frameworks. That meant that it was difficult to integrate the two components. Furthermore, in this architecture when there is a change to the service interfaces the client needs to update it’s software to account for those changes. All of this caused big problems in terms of evolving the feature over time.The solution we came up with was to use S3 as a service. Today this might seem fairly obvious. After all, there’s lots of talk now-a-day about REST services and loose coupling. But back in 2006 lots of people were still building things in a RPC style.Anyway, what we chose to do was use S3 as a service. The IMDB team would insert raw HTML into the S3 bucket and at runtime the amazon web server would simply pull that HTML out of S3 and concatenate it into the web page.
  • Here’s a diagram comparing the new architecture to the old.At the bottom you can see that the customer traffic comes in from the left. It hits the amazon.com web server. But now we’ve built a generic S3 HTML puller component. That component basically maps a widget to an S3 bucket. The files in that bucket are named based on the ID of the product. So at run time the web server simply goes to the S3 bucket, pulls the file with the right name and concatenates it into the web page.I’ve purposely drawn the IMDB part as a black box because, frankly, it is a black box from the perspective of the Amazon web site. I have no idea how IMDB gets the content into that S3 bucket and I don’t really care. For all I know it’s a room full of monkeys manually typing in the content – it doesn’t matter.
  • So how did things work out? Here are the results from this change.First, we were able to serve pages to customers faster because S3 had a lower latency than the IMDB service.Second, IMDB doesn’t need to think about scaling at all. S3 is massively scalable and as the Amazon web site traffic picks up S3 bears the brunt of that load so we no longer need to coordinate traffic forecasts with IMDB.Third, the CPU utilization on the Amazon web servers was reduced. In the new model the web servers are simply concatenating pre-formed HTML into the page, not transforming raw service data into markup. This save a lot of CPU and means we can server more web pages per host.In the previous slide you’ll note that this model results in fewer runtime dependencies for the website. Specifically, where before we had both an IMDB service and database now we only have S3. Because we can use this same model to replace lots of other services with the S3 we can greatly reduce the number of dependencies which results in higher availability.The release model for the IMDB team is greatly simplified in this model. They can push new content to S3 whenever they want without any constraints imposed by the Amazon retail team. In fact, there’s a neat model for them to evolve their feature. They can simply put a new version of the feature into a new S3 bucket and we can flip which bucket the web servers are pulling the content from. If there is a problem with the new content we can instantly flip back to the old bucket.Finally, in 2006 the Amazon web site didn’t make a lot of use of AJAX. However, in retrospect this architecture set us up perfectly for AJAX features on the website. The browser can just as easily concatenate the HTML served by S3 into the web page as the web server can. This allows for a lot of flexibility in terms of how the web page is assembled without any underlying change in the storage.
  • That’s a pretty simple, albeit power way that we started using AWS services on the Amazon retail web site.But now lets jump forward to 2008 and look at something a little more complex.
  • In 2008 Amazon used several external monitoring services to measure the performance and reliability of our website so we could understand what our customer’s experience was like. There are lots of these services from different vendors but many of them turn out to be really expensive to use at scale. Additionally, since most of these services were black boxes we were never really sure what they were measuring.This is a screen shot of an internal application we built called “Client Experience Analytics”. The purpose of the application is to do external rendering of Amazon web pages in a real browser, save screenshots and metrics about the pages, and push the data into our metrics and alarming systems. Basically, it runs on an external network and provides s with a real perspective on what our customers experience when they use the amazon.com web site.
  • We knew we wanted to build an application like this, but there were several challenges.First, we knew the system would have a lot of moving parts. Rendering web pages from lots of remote sites and saving all the performance data is a complicated, workflow-based task. There are many components each of which have to be reliable and scalable.Second, the application had to do the actual page rendering in remote data centers. When I say remote data centers I mean data centers that are not on the Amazon retail network fabric. Also, the more geographical diversity we could get in terms of these rendering agents the better.We suspected that the application would be pretty popular after we launched it and we wanted to be able to scale it up quickly and easily.Finally, we were given a development team of only two people and just a few months to produce the initial version of the software. That meant we had to find pre-built or reusable components to meet our timeline.In principle the solution was pretty simple. We would try to use as many of the AWS services as possible to avoid writing functionality ourselves.
  • This is an architecture diagram of the Client Experience Analytics application. I’m not going to walk you through every component in the application, but I do want to highlight the places where we made use of AWS services.The horizontal box at the top-center is Simple Queue Service. We push all the pages that need to be rendered into SQS. Below that the three small boxes represent our fleet of EC2 hosts that pull work out of the queue and then render the pages in a real browser – IE, Firefox, etc. Since these EC2 hosts are running in the AWS network which is totally different than the Amazon retail network. That means we get real client side performance data from the EC2 hosts.The EC2 boxes record the data they collect into three separate repositories. Screen shots of each page a pushed into S3. This allows our internal users to see the page exactly as it was rendered by the browser. Metadata about the requests and performance data is written to RDS and SDB.At the top you can see that we also pump data into CloudWatch so that we can easily produce graphs for our users.On the far right you see an orange arrow. This is our notification system where alarms are propagated. At the time we built this application Simple Notification Service didn’t yet exist. We will replace our own custom notification system with SNS in the future.
  • So what were the results of this effort?Well, first we were able to deliver a complex application on a very short timeline with only a couple development resources. It would have been impossible to do this without the pre-built services that AWS offers.Normally an application like this would require the negotiation of several additional co-lo agreements. I don’t know about your businesses, but at Amazon that could take many month and would require coordination with finance, tax, infrastructure, security and other departments throughout the company. But because EC2 is present in several different geographies we were able to deploy a global application effortlessly.Because the EC2 hosts are on an external network we get accurate client-side performance statistics.With traditional external monitoring solutions you can’t
  • OK, but the thing everyone is always asking about is our main web server fleet for amazon.com. What are we doing to migrate it to the cloud?One of the main benefits that people often talk about with the cloud is the ability to dynamically scale capacity up or down based on demand. The idea is that when you don’t need all your capacity you can save money by releasing it back to the cloud. And, in theory, the web server fleet should be the poster child for this dynamic capacity story.Additionally, because we are an e-commerce site with lots of credit card interaction this part of our infrastructure has to be completely PCI compliant. There is simply no way we can risk losing this certification.So lets step forward one more year to 2009.
  • This is a typical weekly graph of traffic to the amazon.com web site. As you’d expect, there are peaks of usage during the day and troughs of usage at night. The variation from day to day is pretty consistent over the course of a week. If any of you run web sites you probably see traffic patterns very similar to this.Anyway, if the cloud can save you money by providing flexible capacity this would seem to be the ideal case for it.
  • Let me explain a little further.I spent several years of my life trying to figure out where to draw the red line on this graph. The line represents the expected maximum traffic plus a 15% buffer to account for any unexpected spikes. I ultimately got pretty good at predicting where we needed to draw the line and how much capacity amazon.com needed to purchase – at least assuming there were no unpredictable spikes in traffic due to product launches, unannounced sales or other external factors.The problem is that there’s a lot of area between that blue line and the red line. All of that area is web server capacity I’ve purchased but am not using.How much is going to waste?
  • In this slide the blue area of the graph is the percentage of the capacity we are actually using and the red area is the capacity we’ve purchased but is going to waste to the traffic cycle and the safety margin.You can see that during a typical week nearly 40% of the capacity we purchased was not being used. And, frankly, we did a lot better job of this than a lot of companies. It’s not uncommon for server fleets to be wasting more than 50% of their total capacity.
  • But really the problem is worse than this. This graph shows a typical traffic pattern for the month of November on the amazon.com web site. You see, we don’t just have a daily traffic cycle. We also have an annual traffic cycle that revolves around the retail calendar which peaks in the fourth quarter each year.As you can see in this graph amazon.com ramps way up over the course of November. Again, the red line represents the expected peak plus 15%.
  • When we calculate the area on this graph you can see that during November amazon.com was wasting about three-quarters of it’s available capacity. Obviously, wasting a lot of capacity is not a consistent with our goal of offering customers the lowest possible prices on the items we sell. So there’s a huge business opportunity here if we can figure out a way to move the web server fleet to the cloud and scale it dynamically.But the problem is really a lot worse than what I’m making it out to be here. Depending on how long it takes to procure and provision those servers I may have to order them months in advance of when I need them in November and I have to pay for them the moment they hit my data center even if they aren’t yet serving traffic. And, of course, those hosts are still going to be sitting around after the holiday season passes even though I don’t need them any more.
  • So the problem is pretty obvious in this case. We are wasting lots of money in underutilized capacity.Additionally, unexpected spikes in load are challenging to deal with. If we can get spare capacity in time we have to bring up our server software on it under duress which can lead to mistakes.Finally, scaling is often non-linear in this model. On amazon.com we tended to scale in units of racks not individual servers. This means that if I only needed a few additional servers I would tend to scale in groups of 40 or so just to keep things simple. Furthermore, at some point I’m going to fill up all the rack positions in my existing data center and now adding one more until of scalability is going to require me to build a brand new data center. That will cost millions of dollars and require serious lead time.[Say this next part kind of jokingly.] The solution is really simple. It even fits on a single line in a PowerPoint deck. We just need to migrate the entire web server fleet to AWS. Hmm, well, that’s easy. But we did come up with a plan for how to do it.
  • This slide is the architecture we came up with to transition the amazon.com web server fleet from what we call “classic” capacity to EC2.The customer traffic comes into the Amazon retail data center from the left and hits one of our existing production load balancers. What we did was to hook up our amazon.com data center to the AWS data center via the Virtual Private Cloud product. VPC makes AWS look just like your own data center from a networking standpoint. So the load balancer passes the request off to one of the web servers running in our EC2 clusters. You’ll note that we have web servers running in multiple availability zones – remember, you still have to architect for availability as you move to the cloud.The other nice thing about VPC is that those web servers can talk back across the VPC boundary to services and databases running in the Amazon retail data center to get any content that they need to compose the pages. Ultimately the page gets built on the web server and it is passed back across the VPC boundary to the Amazon retail data center and to the customer.I’m really proud about this next slide.
  • [If you deliver the next few lines correctly there will often be some applause from the audience in this section.]This date, November 10, 2010 is the day that we turned off the last physical web server for amazon.com in the Amazon retail data center. Since that date every single web page on the amazon.com web site has been served by our fleet of EC2 web servers. In my opinion this is a pretty remarkable accomplishment given that only a few years earlier we ha a tightly coupled, monolithic, C++, Cartman architecture.I’m pleased to say that amazon.com site availability in Q4 of 2010 was the best it had ever been and we were easily to handle several high profile product releases, big sales and a huge growth in the business over all.
  • So the results here are pretty obvious. We succeeded in moving our entire web server fleet for amazon.com – thousands of hosts – to the cloud.We are now in a position to dynamically scale our capacity up or down to meet customer demand. And we can scale up or down in units as small as a single host. I no longer worry about running out of space in a data center or having to build a new data center. I supposed someone over at AWS must worry about that sort of thing, but it’s not my problem.Finally, traffic spikes don’t cause nearly the problem that they used to. If we see an unexpected increase in load we simply provision more EC2 servers into the fleet. Of course, we can return them as soon as the traffic spike passes.
  • We’ve moved into 2011.There is probably no piece of our infrastructure that has proven to be more problematic over the years than databases. We’ve constantly struggled to get our relational data stores to scale at a pace that can keep up with the growth of the business. So I thought it might be interesting to take a look at a somewhat novel approach we’ve implemented using AWS to deal with a database scaling issue.
  • One of the promises that Amazon makes to its customers is that you will always have the ability to review your complete order history. This screen shot shows my order history review page. I’m a bit embarrassed that I’ve only been a customer of Amazon since 1999. The Amazon old-timers at Amazon like to point out that I was pretty late to the e-commerce game. On the right in the red circle you can see that I can select any year to view the orders that I’ve placed during that year.As you might imagine, over the course of the company’s history Amazon’s retail customers have placed billions of orders. A few years ago we made an interesting discovery. Most discussion around database scaling revolves around how many transactions per second your database must process. However, in the process of trying to understand our infrastructure spending we stumbled upon the fact that there was a factor even more important than TPS -- the cumulative amount of data stored by the database. If you think about it this makes sense. As you get more and more data into your database it puts increasing memory pressure on the hardware. By reducing the accumulated data that a database host has to store it can dramatically improve the ability of the data store to scale because more of the transactions can be served directly out of memory without hitting the disk.
  • This slide shows a high level view of the order retrieval service at Amazon. Obviously, it’s like pretty much every other service oriented pattern we’ve seen so far.
  • Here you can see the two most common ways that people approach scaling this type of architecture.In pattern 1 you simply buy bigger and bigger database boxes to handle the increased amount of data you need to store or transactions you need to process. In pattern 2 you shard your data across more instances to cope with the same factors.Pattern 1 gets expensive as you move into more and more exotic hardware platforms, and at some point you will hit a wall where there just isn’t a big enough server to handle the load. Pattern 2 adds complexity in terms of failure cases, replicaiton and handling inter-server communication.Ultimately neither one of these patterns makes us very happy.
  • So the problems are pretty straightforward. First, the cumulative data stored not just the transactions per second has a major impact on our ability to scale. It seems like we should be able to take advantage of the fact that lots of the older order data is infrequently accessed and that customers might be willing to wait a bit longer to get that data.Second, we don’t really like any of the conventional approaches toward dealing with the challenges of scaling databases. Each carries it’s own pitfalls.Third, the most expensive “classic” hardware in the Amazon retail server fleet are our database boxes. Our DBAs and DB engineers require us to use high-end SCSI drives, ECC memory and other expensive components or they won’t support our applications. To the degree that we can reduce the use of this type of hardware we can save lots of money.The solution we came up with is to create a tiered-storage solution using AWS. That solution takes advantage of the fact that there are really two types of data in our order database. First there is the highly dynamic, constantly changing influx of new orders that customers are constantly checking to view their delivery dates. Then there’s the set of older orders that are immutable – the items have been delivered and too much time has passed for the customer to return the item.
  • The architecture of the solution looks like this. We denormalize and move orders from our relational order database to an S3 bucket when those orders move into a “closed”, or immutable, state.
  • The results of this cloud implementation are pretty amazing. The team is taking a phased approach to migrating the cold order to S3.So far they’ve moved more than 670 million orders to their encrypted S3 repository. That’s more than 4 terabytes of data. I checked with the team a few weeks ago and they predict that within the next year or so they will have in excess of 50TB in their cold order store.By removing all of this cold data from Oracle it’s dramatically reduced the amount of money we need to spend on the ordering database instances. Although S3 is slower than pulling these orders from the database that performance delta is imperceptible to the customer.Finally, by reducing the footprint of these databases we can now start thinking about ways to move the remainder of the data into one of the AWS database solutions.
  • So here we sit in 2011. The applications I’ve described today are only a small fraction of the systems that the Amazon retail business has migrated to the AWS cloud.We now push all of our server logs to S3 for long term storage. We backup our data bases to S3. We store our source code in the cloud. Our build systems use EC2. And the list goes on and on. Throughout our process of migrating to AWS we’ve learned a lot of lessons about how to successfully move from what we call “classic” architectures to cloud architectures. So I’d like to take a moment to step back and reflect on some of these meta-lessons. The lessons come in two groups – business lessons and technical lessons.
  • The first set of takeaways from the last five years have to do with how we run the Amazon retail business in light of the cloud.First, I spend significantly less time worrying about capacity planning than I used to. Dynamic capacity in EC2 and the bottomless pit of storage that is S3 means that the consequences of inaccurately forecasting demand are low. This allows me to focus on features that my teams are building instead of running infrastructure.Second, I have far fewer conversations with finance. They don’t gave to deal with the big cap-ex requests that I used to submit and they understand the dynamic scaling model well enough to know that it allows us to run much leaner than we used to. Certainly they pay attention to the bill we get from AWS – yes, we get a bill just like you – but overall my conversations with finance are far less contentious.I get more innovation out of my organization now that we’ve started using the cloud. The Client Experience Analytics application is a good example. If I would have had to negotiate half-a-dozen co-lo deals to get that project off the ground I never would have let them do the project. Because I say “No” less often the developers are happier.One nice thing is that I get to take credit for the AWS price reductions. When the finance guy asks me why the AWS bill went down in a given month I simply make up a story about how we focused on efficiency.It is important to think about any regulations and compliance requirements that your application may have. For instance, we have to ensure that there is absolutely no chance that we will run afoul of the PCI compliance requirements because it would be devastating for the retail business if we lost that certification. The good news is that we’ve been able to build lots of compliant applications using AWS. Just be sure to work with your internal audit, legal and security teams to verify that the implementation is acceptable.Finally, a personal favorite for me is that I don’t have to worry about lease returns any more. Prior to moving to the cloud I used to have to deal with lease returns every single year. It would take a lot of time from my project managers and devs to deal with the swap out of the hardware going off lease for the new hardware that was coming in.
  • The second set of takeaways is more technical in nature.The first is that it’s a good idea to pick a couple simple applications to migrate so you can gain some initial experience with the cloud. We chose that IMDB feature on the detail page because it was a non-critical feature that only appeared on a subset of our web site. The approach to cloud-ifying it only involved one service and the architecture was very straightforward.Second, you don’t have to migrate a component in one-fell swoop. Figure out the end-state that you want to get to and then come up with an incremental plan that allows you to systematically get to that end state. A good technical program manager can be a big help in this regard.As you migrate your first few applications you’ll likely discover some reusable components that will be useful for migrating future applications. In Amazon retail an example was an encryption layer that sat on top of S3. This saved time because every developer didn’t have to reinvent the wheel each time. Be on the lookout for these types of generic components and support them across your organization.You are going to be charting some new ground in terms of security as you migrate to the cloud. My experience has been that you can either engage security as partners or you can treat them as your enemy. We engaged our security team very early and involved them in our design process. The result was that they felt invested in helping us figure out ways to accomplish our objectives and they played an important role in improving the final solutions we came up with.As you’ll recall from an earlier slide by 2005 we had come up with some basic engineering principles that we knew we wanted to follow going forward – decoupling, simplicity, service oriented architectures, etc. Look for opportunities to migrate to AWS in a way that furthers your overall architectural agenda. It’s pretty obvious that each of the examples I presented today aligned with that core engineering agenda.And finally, understand that the cloud is not going to make up for sloppy engineering. You still need to think about availability and performance. This means understanding the dependencies for your applications, building fault tolerant systems and learning about concepts like availability zones and redundancy models.
  • Hi my name is Jon Jenkins. I’ve been at Amazon for nearly 8 years. During almost all of that time I’ve worked for Amazon’s retail business. Basically when I say “retail business” it means everything that’s not AWS.

2011 AWS Tour Australia, Closing Keynote: How Amazon.com migrated to AWS, by Jon Jenkins 2011 AWS Tour Australia, Closing Keynote: How Amazon.com migrated to AWS, by Jon Jenkins Presentation Transcript

  • AWS Cloud Tour 2011, Australia
    Closing Keynote:
    How Amazon.com
    Migrated to AWS
    Jon Jenkins
    Director, Software Development
    Amazon.com
  • amazon.com’sJourney to the Cloud
    Jon Jenkins
    jjenkin@amazon.com
    Twitter - #awstour
    AWS Cloud Tour

  • 1995 - 2011
    1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011
  • 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011
  • 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011
  • +
    1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011
  • First real data center
    1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011
  • 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011
  • Distribution Center Isolation
    1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011
  • 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011
  • “We have 50 million lines of C++ code. No, it's more than that now. I don't know what it is anymore. It was 50 million last Christmas, nine months ago, and was expanding at 8 million lines a quarter. The expansion rate was increasing as well. Ouch.”
    Amazon SDE, internal blog post
    September 2004
    1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011
  • Decouple
    Service Oriented Architecture
    Scale Horizontally
    Increase Speed of Execution
    Develop Iteratively
    Seek Simplicity
    1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011
  • 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011
  • What could we do with just S3?
    1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011
  • 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011
  • 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011
    IMDB Widget Architecture
  • The Problem
    • Release process is coupled
    • Runtime latency & scale requirements
    • Service integration issues
    The Solution
    • Use S3 as a service
    • Store raw HTML for the feature in S3
    1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011
  • 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011
    Before
    After
  • Results
    • Reduced page latency
    • IMDB doesn’t worry about scaling
    • Reduced web server CPU utilization
    • Improved availability through reduced dependencies
    • Simplified release model
    • AJAX readiness
    1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011
  • What about a more complex case?
    1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011
  • 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011
  • The Problem
    • The system has lots of moving parts
    • It must run in an external data center
    • It must scale up quickly
    • Development team is two people
    The Solution
    • Use as many AWS services as possible
    1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011
  • 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011
  • Results
    • Very few dev resources required
    • Launched without having to negotiate any new datacenter co-lo presence
    • True external performance metrics
    • We can test site features in development that have not yet launched
    • The system scales horizontally to large amounts of traffic
    1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011
  • What about amazon.com web servers?
    1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011
  • 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011
  • 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011
  • 39%
    61%
    1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011
  • 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011
  • 76%
    24%
    1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011
  • The Problem
    • Retail web site hardware is underutilized
    • Traffic spikes require heroic effort
    • Scaling is non-linear
    The Solution
    • Migrate the entire www.amazon.com web server fleet to AWS
    1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011
  • 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011
  • November 10, 2010
    1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011
  • Results
    • All traffic for www.amazon.com is now served from AWS
    • We can dynamically scale the fleet in increments as small as a single host
    • Traffic spikes can be handled with ease
    1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011
  • What about a DB use case?
    1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011
  • 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011
  • Basic Order Storage Architecture
    1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011
  • Basic Order Storage Architecture
    Scaling Pattern 1
    Scaling Pattern 2
    1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011
  • The Problem
    • Cumulative data impacts scale
    • No database scaling pattern is ideal
    • Databases infrastructure is expensive
    The Solution
    • Create a tiered storage system with AWS
    1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011
  • 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 20102011
  • Results
    • 670 million (4TB) orders now stored in S3
    • We are spending way less on DB hosts
    • Sets us up for migration to RDS / SDB
    1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 20102011
  • Lessons learned
    1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011
  • Business Lessons
    • Less time spent on capacity planning
    • Fewer conversations with finance
    • More innovation
    • Happier developers
    • I get credit for AWS price reductions
    • Be sure to consider compliance issues
    • No more lease returns!
    1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011
  • Technical Lessons
    • Start with simple applications
    • Iterate toward your desired end-state
    • Identify reusable components
    • Engage security early and treat them as partners
    • Migrate to the cloud in concert with your other architectural objectives
    • The cloud can’t cover up sloppy engineering
    1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011
  • amazon.com’sJourney to the Cloud
    Jon Jenkins
    jjenkin@amazon.com
    Twitter - #awstour
    AWS Cloud Tour