Scalable talk notes


Published on

These are my notes from my talk "Building Scalable Websites with Perl"

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Scalable talk notes

  1. 1. 3/3/12 No Title Building Scalable Websites with Perl by Perrin Harkins Who is doing it? First, lets establish some credit with any doubters in the audience. I shouldnt have to tell you this, but Perl runs some of the largest websites in the world. Take a look at some of the better-known examples: uses Perl in nearly all of their properties, in particular the personalized My Yahoo service. On the whole, Yahoo serves three billion page views per day, and about 100 million unique users. Yahoo owns Overture, the largest sponsored search company. According to their posting on the Perl jobs list at, they handle "more than 10 billion transactions per month!", the company that pretty much defines e-commerce, uses Perl on their main site and partner sites. Amazon also operates the popular Internet Movie Database,, which is built in Perl., the largest on-line ticket retailer, is built almost entirely with Perl. So is its sister company,, which operates the most widely-used city guide sites in the US. Nielsen NetRatings says that Yahoo, Amazon, and InterActiveCorp, which owns Ticketmaster Online and CitySearch, are all in the top 10 in terms of overall web traffic. Were talking about phenomenal numbers of users and page views here. By comparison,, which people frequently point to as a high traffic site using Perl, is barely a drop in the bucket. How are they doing it? Okay, so your company probably doesnt get as much traffic as Yahoo. Still, you may be wondering, what is it that really large sites do that allows them to scale so big, and is it something you could apply to your own sites? Obviously, these are all very different applications. There is no single solution for scaling all of them. Even buying a lot of hardware isnt a magic bullet, since it just isnt feasible to buy enough computing power to prop up a slow application at these levels of traffic. However, what you discover when you talk to people who work at these sites, is that there are a few common techniques that tend to get used by almost everyone in one form or another. These are fundamental software techniques that have been around for ages, not some kind of newly invented Internet magic. Feel free to refer to them as design patterns if it will raise your salary. Today were going to talk about a couple of these and how they apply to web development problems. Things we wont be coveringfile:///Users/perrinharkins/Conferences/scalable_talk.html 1/7
  2. 2. 3/3/12 No Title I should also mention what were not going to talk about. Were not going to talk about mod_perl tuning: httpd.conf settings, reverse proxy configurations, increasing copy-on-write memory sharing, running the profiler... This stuff is very well-documented in the mod_perl books and the on-line documentation at If youre serious about building a scalable site and you havent read these resources yet, get on it! Were not going to talk about DBI tuning. Tim Bunce has detailed slides from his talks available on CPAN (, and there is more in the mod_perl documentation and books. Were not going to talk about hardware because, well, Im not very interested in hardware. Thats for cheaters. (However, Im willing to cut the sites I mentioned above a little slack on this...) Caching Caching helps performance by reducing the amount of work that needs to be done, and helps scalability by reducing the load on shared resources like databases. All of the sites I mentioned above cache like mad wherever they can. Page caching, object caching, de-normalized database tables - all of these are variations on a theme. Even if your data is so volatile that it changes every 30 seconds, if it only takes 1 second to generate it you will still get to serve it from cache for the other 29. Whole Pages If you can possibly get away with it, cache entire HTML pages and serve them as static files. This is simply unbeatable from a performance standpoint. Web servers and operating systems have been tuned to serve static files with incredible efficiency. When I worked at, we were caching all of the non- interactive pages (i.e. the ones that people just browsing the catalog would see) as static files, and serving those pages was about ten times as fast as generating the same page on the fly, even when all of the data needed to create the page was cached in our mod_perl servers. There are a few ways to make this happen. One of them is to simply write out all of the possible pages on your site on a regular basis. You can write a big batch job that generates all the files for your website, probably by reading a database and then pounding the data through templates. Sometimes people write elaborate versions of this, with dependency checking and make-like functionality. See the ttree program that comes with Template Toolkit for one take on it. However, you can also do this for a site that was not built to be pre-published. Many tools exist for spidering websites to local copies, so all you have to do is point one at your dynamic site and dump it out as static files. wget --mirror --convert-links --html-extension --reject gif,jpg,png --no-parent http://app-server/dynamic/pages/ In reality, most sites would end up needing something more customized than this, but a simple tool like this can give you something to do benchmarks on at least. This kind of approach is only feasible if your site is small enough to write out the whole thing on a regular basis. If you have a site which is a front-end to a large database of some kind, you might have potentially millions of different pages to publish. There might be a few that get the vast majority of the hits though, andfile:///Users/perrinharkins/Conferences/scalable_talk.html 2/7
  3. 3. 3/3/12 No Title are thus worth caching. Rather than try to figure out which ones to pre-publish, you can use a generate-on- demand approach. This is what most people think of when they hear talk about caching web pages. The simplest way to do that is with a caching proxy server. If youve read the mod_perl documentation you should be familiar with the idea of a reverse proxy, sometimes called an HTTP accelerator. Its an HTTP proxy that sits in front of your server, passing through requests for dynamic pages. You can configure it to cache the pages and then tell it how long to keep cache them by setting the Expires and Cache-Control headers during page generation. ProxyRequests Off ProxyPass /dynamic/stuff http://app-server/ ProxyPassReverse /dynamic/stuff http://app-server/ CacheRoot "/mnt/proxy-cache" CacheSize 500000 CacheGcInterval 12 CacheMaxExpire 36 CacheDefaultExpire 2 These pages are not quite as fast as regular static ones -- mod_proxy checks the headers at the top of the file to make sure it hasnt expired before serving it. However, they are much faster than dynamic generation. Note that this will only work for pages which you can generate on the fly in a reasonable amount of time. If you have a page that takes two minutes to generate, you need to generate it before users ask for it. Of course you can still use this approach, and seed it with some artificial requests beforehand, which will basically give you a mix between the generate-on-demand and pre-generation approaches. One final variation worth mentioning is intercepting the 404 error. It works like this: you set up your program as the handler for 404 "NOT FOUND" errors on the site. When a page is requested that is not found on the file system, that triggers a 404 and sends the request over to you. You then generate the requested page, and write it out to the file system so that it will be there the next time someone comes looking for it. This is the approach that Vignette StoryServer uses for caching, or at least it did, back in the early days when it was spun off from Its easy to configure an Apache server to do this: ErrorDocument 404 /page/generator This will make apache do an internal redirect to the program at /page/generator, passing information about the URL originally requested as environment variables. This program writes out the file, and then, if youre using mod_perl, you can just do an internal redirect to the newly generated page and let apache handle it like any other file. The upside is great performance, since the pages are served as normal static files. The downside of this is that you then have to manage expiring these pages yourself, probably by writing a cron job that will check for ones that are too old and delete them. You run the risk of serving a file a little after its expiration time if the cron doesnt do its job frequently enough. In general, I think the caching proxy approach is easier to manage, but if you are using something other than mod_perl -- like FastCGI, which already separates the Perl interpreters from the web server -- there is not as much incentive to run a proxy. Chunks of HTML or data Many of you were probably thinking during that last part "That sounds great, but my web designers insistedfile:///Users/perrinharkins/Conferences/scalable_talk.html 3/7
  4. 4. 3/3/12 No Title on putting the current users name on every page. I cant cache the whole thing." Obviously sites like Amazon or My Yahoo cant cache the whole page either. They can cache pieces of pages though, and reduce the page generation to little more than knitting the pieces together, like server-side includes. Yahoo uses this technique quite a bit, generating the pieces of content for the portal in advanace, and building a custom template for each user based on their preferences that includes the appropriate pieces at request-time. By the way, you may be aware that PHP is being used at Yahoo now and assumed that this meant it was replacing Perl. Thats not the case. PHP is mostly being used for this sort of include-template work, replacing some older in-house solutions that Yahoo used to use. The content generation that was done in Perl is still being done in Perl. The caching built into the Mason web development framework is a good example of caching pieces. It allows you to cache arbitrary content with a key and an expiration time and then retrieve it later. my $result = $m->cache->get($search_term); if (!defined($result)) { $result = run_search($search_term); $m->cache->set($search_term, $result, 30 min); } You can cache generated HTML, or you can cache data which youve fetched from a database or elsewhere. Caching the generated HTML gives better performance, because it allows you to skip more work when you get a cache hit (the HTML generation), but caching at the data level means you get to reuse the cached content if it shows up in multiple different layouts. That increases your chances of getting a cache hit., one of the top apartment listing services on the web, uses Masons cache to store results on a commonly used search page. Since there is a fair amount of repitition in these searches, they are able to serve 55% of the search hits from cache instead of going to the database. That also frees up database resources for other things. I created a simple plugin module for Template Toolkit that adds partial-page caching, which is available on CPAN as Template::Plugin::Cache. Its only really useful if you have templates that do a lot of work, fetching data and the like inside the template itself, which is generally not the best way to use Template Toolkit. When using a model-view-controller style of development, you will typically be caching data and doing it before you get to the templates. If you want to add caching to your application, there are several good options on CPAN. For a local cache on a single machine, I would recommend Rob Muellers Cache::FastMmap. BerkeleyDB is about the same speed if you use the OO interface and built-in locking, but youd have to build the cache expiration code yourself. Both of these are several times as fast as the popular Cache::FileCache module and hundreds of times faster than any of the modules built on top of IPC::ShareLite. our $Cache = Cache::FastMmap->new( cache_size => 500m, expire_time => 30m, ); $Cache->set($key, $value); my $value = $Cache->get($key); My only real complaint about Cache::FastMmap is that it doesnt provide a way to set different expiration times for individual items. You could add this yourself in a wrapper around Cache::FastMmap, but at that point it loses its main advantage over BerkeleyDB, which is the built-in expiration and purging functionality.file:///Users/perrinharkins/Conferences/scalable_talk.html 4/7
  5. 5. 3/3/12 No Title For a cache that needs to be shared across a whole cluster of machines, you need something different. Memcached ( is a cache server that you can access over the network. It keeps the cached items in RAM, but can be scaled for large amounts of data by running it on multiple servers. Requests are automatically hashed across the available servers, spreading the data set out across all of them. It uses some recent advances like the epoll system call in the Linux 2.6 kernel to offer impressive scalability. The website is currently using memcached. $memd = Cache::Memcached->new({ servers => [ "", "", "", [ "", 3 ] ], debug => 0, compress_threshold => 10_000, }; $memd->set($key, $value, 5*60 ); my $value = $memd->get($key); If that sounds like more than you want to deal with, you can make something simple with MySQL. Because MySQL has an option to use a lightweight non-transactional table type, it is a good choice for this kind of application. Just create a simple table with key, value, and expiration time columns and use it the way you would use a hash. If you follow DBI best practices, you can get performance that beats most of the cache modules on CPAN except the ones I mentioned here. Job Queuing I could go on for hours about caching, but there are other important things to cover. Lets say you run a website that sells concert tickets. That means that at a specific, publicly-announced time, Madonna tickets will go on sale. That, in turn, means that a staggering number of people will all be waiting at 11am on Sunday morning with their fingers poised above the mouse button ready to click "buy" until they get a ticket. But wait, it gets worse! In order to give people who are trying to buy tickets by phone or in person a fair shot at the action, you are only allowed to put holds on a certain number of tickets at a time, meaning that only that number of people can be in the process of actually buying a ticket at once. Does this sound like a good way to ruin your weekend? This is the sort of thing that the site has to deal with routinely. How do you handle excessive demand for a limited resource? The same way you do it in real life: you make people line up for it. Queues are a common approach for preventing overloading and making efficient use of resources. [ queue diagram ] So, what have we accomplished with our queue? First of all, we have control of how many processes are handling requests in parallel, so we wont overwhelm our backend systems. Second, since it hardly takes any time at all to queue a request or or check status, we are keeping our web server processes free to handle more users. The site will be responsive even when there are far more users on it sending in requests than we can actually handle at one time. Finally, we are providing frequently updated status information to users, so they wont leave or try to resubmit their requests. Queues are also useful when you have long-running jobs. For example, suppose youre building a site that compares prices on hotel rooms by making price quote requests to a bunch of remote servers and comparingfile:///Users/perrinharkins/Conferences/scalable_talk.html 5/7
  6. 6. 3/3/12 No Title them. That could take some time, even if you send the requests in parallel. You can keep the browser from timing out by using the standard forking technique, where you fork off a process to do the work and return an "in progress" page. When the forked process finishes handling the request, it writes the results to a shared data location, like a database or session file. Meanwhile, the page reloads, and until the results are available it justkeeps sending back the "in progress" page. Randal Schwartz has an article on-line that demonstrates this technique. Its located at However, this doesnt completely solve the problem. Say these jobs take 15 seconds to complete. What happens if 1000 people come in and submit jobs during 15 seconds? Youll have 1000 new processes forked! A queue approach avoids this, by just dropping the requests onto the queue and letting the already- running job processors handle them at a fixed rate. Modules to Use Now that you know what queues are good for, where do you get one? The Ticketmaster code is closely tied to their backend systems, so its not open source. There are some other options. One that you can grab from CPAN is Jason Mays Spread::Queue. This is built on top of the Spread toolkit ( for reliable multicast messaging. What Spread provides is a scalable way to send messages out across a cluster of machines and make sure they are received reliably and in order. It actually provides other things too, but this is the part that Spread::Queue is using. The system consists of three parts: a client library, a queue manager, and a worker library. The client library is called from your code when you want to add a request to the queue. That sends a request to the queue manager using Spread. You define your job processing code in a worker class. You can start as many worker processes as you like and they can be on any machine in the cluster. They will register themselves and begin accepting jobs. In the client process: use Spread::Queue::Sender; my $sender = Spread::Queue::Sender->new("myqueue"); $sender->submit("myfunc", { name => "value" }); my $response = $sender->receive; In the worker process: use Spread::Queue::Worker; my $worker = Spread::Queue::Worker->new("myqueue"); $worker->callbacks( myfunc => &myfunc, ); $SIG{INT} = &signal_handler; $worker->run; sub myfunc { my ($worker, $originator, $input) = @_; my $result = { response => "I heard you!",file:///Users/perrinharkins/Conferences/scalable_talk.html 6/7
  7. 7. 3/3/12 No Title }; $worker->respond($originator, $result); } The Spread::Queue system looks very attractive, but there are a few things it could use. There doesnt seem to be a way to check where a particular job is in the queue, or even to ask if that job is done yet or not without blocking until it is done. Also, the queue is not stored in a durable way: its just in the memory of the queue manager process, so if that process dies, the entire state of the queue is lost. Adding these features would make a good project for someone, and someone may be me if I need them before someone else does. Where to Learn More If some of these concepts are new to you, and you want to learn more about them, the good news is that there is lots of good technical writing on these subjects. The Perl Journal, including the "best of" collection that OReilly has been publishing, is a good resource, and so is the "Algorithms in Perl" book. The bad news is that some of the most interesting stuff is written for a Java audience. My advice is that if you want to learn how to do this scalable web development well, you cant be trapped in one community or one language -- you need to see what other people are doing. I like Martin Fowlers books, because he doesnt have an agenda to push and isnt trying to sell you on a particular tool or API. Similarly, the OReilly sites at, including, get some good stuff. The Java content is mostly open-source oriented so its much less fluffy than most Java sites. Acknowledgements Id like to thank Craig McLane and Adam Sussman of Ticketmaster, and Zack Steinkamp of Yahoo for being very generous with their time in answering my questions while I was working on this talk.file:///Users/perrinharkins/Conferences/scalable_talk.html 7/7