Scaling a php my sql web application


Published on

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Scaling a php my sql web application

  1. 1. Scaling a PHP MySQL Web Application, Part 1By Eli WhiteTips for scaling your PHP-MySQL Web app based on real-world experiences at Digg, TripAdvisor, and otherhigh-traffic sites.Published April 2011Creating a Web application - actually writing the core code - is often the initial focus of a project. It’s fun, it’s exciting,and it’s what you are driven to do at that time. Hopefully however, there always comes a time when your applicationstarts to take higher traffic than originally expected. This is the point where people often start thinking about how toscale their Website.Ideally, you should be thinking about how your application is going to scale from the moment you first write code.Now, that is not to say that you should spend a lot of effort in your early development process targeted at anunknown future. Who knows what will happen and if your application will ever hit the traffic levels that require ascalability effort? But hopefully the most important lesson you can learn here is to understand what you will need todo to scale in the future. By knowing this, you can do only what you need at each phase of your project without“coding yourself into a corner”, ending up in a situation where it’s hard to take the next scalability step.I’ve worked for a number of companies and projects that over time had to deal with massive levels of Web traffic.These include Digg, TripAdvisor, and the Hubble Space Telescope project. In this two-part article I will share some ofthe lessons learned, and take you step by step through a standard process of scaling your application.Performance vs. ScalabilityBefore we go much further, we should discuss the differences (and similarities) between performance and scalability.Performance, in the context of a Web application, is how fast you can serve data (pages) to the end user. Whenpeople talk about increasing the performance of their application, they are talking typically about making it take300ms instead of 500ms to generate the content.Scalability, in contrast, is the quality that enables your application to grow as your traffic grows. A scalable applicationis one that theoretically, no matter how much traffic is sent toward it, can have capacity added to handle that traffic.Scalability and performance obviously are interrelated . As you increase the performance of an application, it requiresfewer resources to scale it, making scaling easier. Similarly, you can’t really call an application scalable if it requiresone Web server per user for adequate performance, as it would be untenable for you to provide that.PHP TuningOne thing that you should look at first is any low-hanging fruit within your Web server and PHP setup. There aresome very easy things you can look into that can immediately increase performance and perhaps relieve the need toscale at this point in time completely, or at least make it easier.The first of these is installing an opcode cache. PHP is a scripting language, and therefore recompiles the code uponevery request. Installing an opcode cache into your Web server can circumvent this limitation. Think of an opcodecache as sitting between PHP and the server machine; after a PHP script is first compiled, the opcode cacheremembers the compiled version and future requests simply pull the already compiled version.There are numerous opcode caches available. Zend Server comes with one built-in and Microsoft provides one forWindows machines called ‘WinCache’. One of the most popular ones is an open source product called APC.Installing any of these products is very straightforward, and doing so will give you immediate and measurableperformance gains.The next step that you should evaluate is if you can remove the dynamic nature of any of your Web pages. OftenWeb applications will have pages that are being generated by PHP but that actually rarely change. Examples mightbe an FAQ page, or a press release. Caching the generation of these pages and serving the cached content so thatPHP doesn’t need to do any work can save many CPU cycles.There are multiple ways to approach this task. One of them is to actually pre-generate HTML pages from your PHPand let the HTML pages be directly served to the end users. This could be done as a nightly process perhaps, so thatany updates to the Web pages do in fact go live eventually, but on a delayed schedule. Implementing this can be as
  2. 2. simple as running your PHP scripts from the command line, piping their output to .html files, and then changing linksin your application.However there is an even easier approach to this that requires less effort: implementing an on-the-fly cache.Basically your entire script output is captured into a buffer saved to the filesystem, into memory/cache, or into thedatabase, etc. All future requests for that same script just read from the cached copy. Some templating systems(such as Smarty) automatically do this for you, and there are some nice drop-in packages that can handle this for youas well (such as jpcache).The simplest version of this is something fairly easy to write yourself: the following code injected at the top of yourtypical PHP page will handle this for you (obviously replace the timeout and cache directories to meet your needs).Encapsulate this into a single library function, and you could cache pages easily as needed. As this exists, it blindlycaches pages based only upon the URL and GET parameters. If the page changes based upon a session, POSTdata, or cookies, you’d have to add those into the uniqueness of the filename created.<?php$timeout = 3600; // One Hour$file = /tmp/cache/ . md5($_SERVER[REQUEST_URI]);if (file_exists($file) && (filemtime($file) + $timeout) > time()) { // Output the existing file to the user readfile($file); exit();} else { // Setup saving and let the page execute: ob_start(); register_shutdown_function(function () use ($file) { $content = ob_get_flush(); file_put_contents($file, $content); });}?>Load BalancingWhen creating Web applications, everyone starts with a single server that handles everything. That’s perfectly fine,and in fact how things should usually be done. One of the first steps you will need to take to begin scaling your Webapplication to handle more traffic is to have multiple Web servers handling the requests. PHP is especially well suitedto horizontal scaling in this manner by simply adding more Web servers as needed. This is handled via loadbalancing, which, fundamentally, is simply the concept of having a central point where all requests arrive and thenhanding the requests off to various Web servers.
  3. 3. Figure 1 Load balancingThere are numerous options for handling your load balancing operationally. For the most part they fall into threedifferent categories.The first are the software balancers. This is software that you can install on a standard machine (usually Linuxbased) and that will handle all the load balancing for you. Each software balancer also comes with its own extrafeatures such as built in page output caching, gzip compression, and more. Most of the popular options in thiscategory are actually multipurpose software / Web servers themselves that happen to have a reverse proxy modeyou can enable to do load balancing. These include Apache itself, Nginx, and Squid. There is also some smallerdedicated load balancing software such as Perlbal. You can even do a very simple version of this in your DNS serverjust by using DNS rotation so that every DNS request responds with a different IP address, although that is obviouslynot very flexible.The second category is the hardware balancers. These are physical machines that you can buy, designed to handlevery large traffic levels and with custom software built into them. Two of the better-known versions of this are theCitrix Netscaler and the F5 BigIP. The hardware solutions often provider many custom benefits, and act as firewallsand security barriers as well.The final classification is a new one referring specifically to cloud-based solutions. When hosting on cloud providersyou obviously don’t have the ability to install a hardware-based solution, restricting you to the software-only solutions.But most cloud providers also have their own built-in mechanisms for handling load balancing across their instances.In some cases these are less flexible, but are much easier to setup and maintain.In the end, all of the solutions, whether software or hardware, all typically offer many of the same features. The abilityto manipulate or cache the data as it comes across, the ability to load balance either by random selection or throughwatching health meters on the various sub machines, and much more. I would recommend that you explore whatoptions exist for your hosting environment and simply find which one serves your situation the best. There are verygood setup guides for any solution you would choose online that cover all the basics.Load Balancing PreparationAs mentioned earlier in this article, it’s important to only do what you need to but to be mentally ready to take the nextsteps. Rarely will you ever setup load balancing when you are first building your application but you should keep afew things in mind to make sure that you don’t preclude yourself from doing it later.First of all, if you are using any kind of local memory caching solution (such as the one provided by APC) you need towrite your code without assuming that you’ll only have a single cache. Once you have multiple servers, data youstore on one machine is only accessible from that machine.Similarly a common pitfall is assuming a single file system. PHP Sessions can’t be stored in the file system anymoreonce you are using multiple servers, and you need to find another solution for them (such as a database). I’ve also
  4. 4. seen code that stored uploaded files and expected to read them back later. You therefore need to assume that youshouldn’t store anything on the file system.Now with all that said, you can still use the techniques listed previously if you feel there is a strong need for the rapiddevelopment of your initial application. What’s important is that you encapsulate any ‘single machine’ dependantcode. Make sure that it’s in a single function, or at least within a class. Then, when the time comes to move to loadbalancing, there’s only one place you need to update to a new solution.MySQL Master-Slave ReplicationAt this point you’ve hopefully got your Web server setup to be scaling wonderfully. Now most Web applications findtheir next bottleneck: the database itself. MySQL is very powerful but eventually you will run into the same problemas with Web servers where one simply isn’t enough. MySQL comes with a built-in solution for this called master-slavereplication.In a master-slave setup, you will have one server that acts as the master, the true repository of the data. You thensetup another MySQL server, configuring it as the slave of the master. All actions that take place on the master arereplayed on the slave. Figure 2 Master-slave setupOnce configured in this manner, then you make sure that your code talks to the ‘proper’ database depending uponthe action you need to take. Any time you need to change data on the server (update, delete and insert commands),you connect to the Master. For read access, you connect to the slave instead. This could be as straight forward assomething like:<?php$master = new PDO(mysql:dbname=mydb;host=, $user, $pass);$master->exec(update users set posts += 1);$slave = new PDO(mysql:dbname=mydb;host=, $user, $pass);$results = $slave->query(select id from posts where user_id = 42);?>There are two benefits to taking this approach for scalability. First of all, you’ve managed to separate the load on thedatabase into two parts. So off the bat you’ve reduced the load on each server. Now that’s typically not going to bean even split; ost Web applications are very read heavy, and so the slave will be taking more of the load. But thatbrings us to the second point: isolation. At this point you’ve isolated two distinct types of load, the write and read. Thiswill allow you greater flexibility in taking the next scalability steps.Slave Lag
  5. 5. There is really only one main pitfall when doing this (other than ensuring you are always talking to the right MySQLdatabase), and that is slave lag. A database slave doesn’t instantly have the same data that the master has. Asmentioned, any commands that change the database are essentially re-played on the slaves. That means howeverthat there is a (hopefully very short) period of time after an update is issued on the master before it will be reflectedon the slave. Typically this is in the order of milliseconds but if your servers are overloaded, slaves can get muchfarther behind.The main change that this makes to your code is that you can’t write something to the master and then immediatelyattempt to read it back. This otherwise is somewhat common practice if you are using SQL math, have defaultvalues, or triggers in place. After issuing a write command your code may not know the value on the server andtherefore want to read it in to continue processing. A common example would be:<?php$master->exec(update users set posts += 1);$results = $slave->query(select posts from users);?>Even with the fastest setup possible, this will not give expected results. The update will not have been re-playedagainst the slave before the query is executed and you will get the wrong count of posts. In cases like this therefore,you need to find workarounds. In the absolute worst case you can query the master for this data, though that defeatsthe purpose of “isolation of concerns”. The better solution is to find a way to “approximate” the data. For example, inan example where you are showing a user adding a new post, just read in the number of posts first and manually addone to it before displaying the value back to the user. If it happens to be incorrect because multiple posts were beingadded at the same time, it’s only for that one Web page. The database is correct for other users reading the value.Multiple MySQL SlavesAt this point the next logical step is to expand your MySQL database capacity horizontally by adding in additionalslaves. One master database can feed any number of slave machines - within some extreme limits. While fewsystems will experience this you do have theoretical limits where the single master can’t keep up with the multitude ofslaves. Figure 3 Multiple MySQL slavesI explained previously that one of the benefits of using master-slave was the isolation of concerns. Combine this withthe fact that most Web applications are read-heavy because they add data rarely compared with the need to accessdata to generate every page. This means that usually your scalability concern is read access. That’s exactly whatadding more slaves gives you. Each additional slave you add increases your database capacity just like adding moreWeb servers does for PHP.Now this leaves you with a potential complication. How do you determine to which database to connect? When youhad just a single database that decision was easy. Even when you had a single slave it was easy. But now toproperly balance the database load you need to connect to a number of different servers.
  6. 6. There are two main solutions that people deploy in order to handle this.The first is to have a dedicated MySQLdatabase slave for each Web server that you have. This is one of the simplest solutions and commonly deployed forthat reason. You can envision this as looking something like this: Figure 4 Dedicated database slave for each Web serverIn this diagram you see a load balancer with three Web servers behind it. Each Web server is speaking to its owndatabase slave which in turn gets data from a single master. The decision for your Web servers as to which databaseto access is very simple as it’s no different than in a single-slave situation - you just need to configure each Webserver to connect to a different database as its slave.However in practice when this solution is deployed it’s often taken one step further: one single machine that acts asboth Web server and database slave. This simplifies matters greatly as connections to your slave are simply to‘localhost’ and balancing is therefore built in.There is a large drawback to taking this simple approach however, and it causes me to always shy away from it: Youend up tying the scalability of your Web servers to your MySQL database slaves. Again, isolation of concerns is a bigthing for scaling. Quite often the scaling needs of your Web servers and of your database slaves are mutuallyindependent. Your application might need the power of only three Web servers, but need 20 database slaves if it’svery database intensive. Or you might need three database slaves and 20 Web servers if it’s PHP- and logic-intensive. By tying the two together, you are forced to add both a new Web server and a new database slave at thesame time. This also means that you are adding to the maintenance overhead.Combining Web server and database slave onto the same machine also means you are over utilizing resources oneach machine. Plus you can’t acquire specific hardware that matches the needs of each service. Typically therequirements for database machines (large and fast I/O) aren’t the same as for your PHP slaves (fast CPU).The solution to this is to sever the relationship between Web servers and MySQL database slaves, and randomizeconnections between them. Literally in this case you would have each request that comes to a Web server randomly(or with an algorithm of your choice) select a different database slave to which to connect.
  7. 7. Figure 5 Randomizing the connections between Web servers and database slavesThis approach enables the ability to scale your Web servers and database slaves independently. It can even enablesmart algorithms that figure out which slave to which to connect, based upon whatever logic works best for yourapplication.There’s also another great benefit that we’ve not talked about yet: stability. Via randomized connections, it’s possiblefor a database slave to go down and for your application to not care, as it will just pick another slave to connect toinstead.Usually its your job to write code within your application that picks which database to which to connect. There aresome ways to avoid this task, such as putting all the slaves behind a load balancer and connecting to it. But havingthe logic in your PHP code gives you, as a programmer, the greatest flexibility in the future. Following is an exampleof some basic database selection code that accomplishes this task:<?phpclass DB { // Configuration information: private static $user = testUser; private static $pass = testPass; private static $config = array( write => array(mysql:dbname=MyDB;host=, read => array(mysql:dbname=MyDB;host=, mysql:dbname=MyDB;host=, mysql:dbname=MyDB;host= ); // Static method to return a database connection: public static function getConnection($server) { // First make a copy of the server array so we can modify it $servers = self::$config[$server];
  8. 8. $connection = false; // Keep trying to make a connection: while (!$connection && count($servers)) { $key = array_rand($servers); try { $connection = new PDO($servers[$key], self::$user, self::$pass); } catch (PDOException $e) {} if (!$connection) { // We couldnt connect. Remove this server: unset($servers[$key]); } } // If we never connected to any database, throw an exception: if (!$connection) { throw new Exception("Failed: {$server} database"); } return $connection; }}// Do some work$read = DB::getConnection(read);$write = DB::getConnection(write);. . .?>Of course there are numerous enhancements you may want (and should) make to the above code before it could beused in production. You probably want to log the individual database connection failures. You shouldn’t store yourconfiguration as static class variables because you can’t change your configuration without code changes. Plus, inthis setup, all servers are treated as equals. It can be worthwhile to add the idea of server ‘weighting’ so that someservers could be assigned less traffic than others. In the end you’ll most likely want to encapsulate this logic into agreater database abstraction class that gives you even more flexibility.ConclusionBy following the steps in this article you should be well on your way to a scalable architecture for your PHPapplication. In the end, there really is no single solution, no magic bullet. It’s the same reason that there isn’t just oneapplication solution or one framework. Every application will have different bottlenecks and different scaling problemsthat need solved.Part 2 of this article will discuss more advanced topics for scaling your MySQL database beyond the techniquesmentioned here.
  9. 9. Scaling a PHP MySQL Web Application, Part 2By Eli WhiteTips for scaling your PHP-MySQL Web app based on real-world experiences at Digg, TripAdvisor, and otherhigh-traffic sites. In this portion: pooling and sharding techniques.Published April 2011In Part 1 of this article, I explained the basic steps to move your application from a single server to a scalable solutionfor both PHP and MySQL. While most Web applications will never need to expand beyond the concepts presented inthat article, eventually you may find yourself in a situation where the basic steps aren’t enough.Here are some more advanced topics on how you might change your setup to support additional throughput, isolate“concerns” better, and take your scalability to the next level.MySQL Database PoolingDatabase pooling is one fairly simple technique that allows for greater “isolation of concerns.” . This concept involveshaving a group of database slaves virtually separated into multiple pools. Each pool has a specific classes of queriessent to it. Using a basic blog as an example case, we might end up with a pool distribution like this: Figure 1 Database poolingNow the obvious question at this point is: Why? Given that all the slaves in the above setup are identical, then whywould you consider doing this?The main reason is to isolate areas of high database load. Basically, you decide what queries are OK to performmore slowly. In this example you see that we’ve set aside just two slaves for batch processes. Typically if you haveany batch processes that need to run in the background you aren’t worried about how fast they run. Whether it takesfive seconds, or five minutes, doesn’t matter. So it’s OK for the batch processes to have fewer resources, therebyleaving more for the rest of the application. But perhaps more important, the kinds of queries you may typically do
  10. 10. when performing large batch processes (and the reason they are not done in real time in the first place) can be ratherlarge and resource intensive.By isolating these processes off to their own database slaves, your batch processes firing off won’t actually affect theapparent performance of your Web application. (It’s amazing how many Websites get slower just after midnight wheneveryone decides to run cron jobs to background-process data.) Isolate that pain and your Website performs better.You’ve now created selective scalability.The same concept can apply to other aspects of your application as well. If you think about a typical blog, thehomepage only shows a list of posts. Any comments to those posts are displayed only on a permalink page.Retrieving those comments for any given post can be a potentially painful process, given there can be any number ofthem. It’s further complicated if the site allows threaded comments because finding and assembling those threadscan be intense.Therein lays the benefit of database pooling. It’s usually more important to have your homepage load extremely fast.The permalink pages with comments can be slower. By the time someone is going to load the full post, they’vecommitted and are willing to wait a little longer. Therefore, if you isolate a set of slaves (four in the example above) tobe specific to queries about comments, you can leave the larger pool of ‘primary’ slaves to be used when generatinghomepages. Again you’ve isolated load and created selective scalability. Your comment and permalink pages mightbecome slower under heavy load, but homepage generation will always be fast.One way that you can apply scalability techniques to this pool model is to allow on the fly changes to your pooldistribution. If you have a particular permalink that is extremely popular for some reason, you could move slaves fromthe primary pool to the comments pool to help it out. By isolating load, you’ve managed to give yourself moreflexibility. You can add slaves to any pool, move them between pools, and in the end dial-in the performance that youneed at your current traffic level.There’s one additional benefit that you get from MySQL database pooling, which is a much higher hit rate on yourquery cache. MySQL (and most database systems) have a query cache built into them. This cache holds the resultsof recent queries. If the same query is re-executed, the cached results can be returned quickly.If you have 20 database slaves and execute the same query twice in a row, you only have a 1/20th chance of hittingthe same slave and getting a cached result. But by sending certain classes of queries to a smaller set of servers youcan drastically increase the chance of a cache hit and get greater performance.You will need to handle database pooling within your code - a natural extension of the basic load balancing code inPart 1. Let’s look at how we might extend that code to handle arbitrary database pools:<?phpclass DB { // Configuration information: private static $user = testUser; private static $pass = testPass; private static $config = array( write => array(mysql:dbname=MyDB;host=, primary => array(mysql:dbname=MyDB;host=, mysql:dbname=MyDB;host=, mysql:dbname=MyDB;host=, batch => array(mysql:dbname=MyDB;host=, comments => array(mysql:dbname=MyDB;host=, mysql:dbname=MyDB;host=, ); // Static method to return a database connection to a certainpool public static function getConnection($pool) { // Make a copy of the server array, to modify as we go: $servers = self::$config[$pool]; $connection = false;
  11. 11. // Keep trying to make a connection: while (!$connection && count($servers)) { $key = array_rand($servers); try { $connection = new PDO($servers[$key], self::$user, self::$pass); } catch (PDOException $e) {} if (!$connection) { // Couldn’t connect to this server, so remove it: unset($servers[$key]); } } // If we never connected to any database, throw an exception: if (!$connection) { throw new Exception("Failed Pool: {$pool}"); } return $connection; }}// Do something Comment related$comments = DB::getConnection(comments);. . .?>As you can see, very few changes were needed. This of course was purposeful. Knowing that pooling was desirable,the original code was formulated to be extensible. Any code designed to randomly select from a "pool" of read slavesin the first place should be able to be extended to simply understand that there are multiple pools to choose from.Of course, the comments in Part 1 apply to this code as well: You will probably want to encapsulate your logic withina greater database abstraction layer, add better error reporting, and perhaps extend the features offered.MySQL ShardingAt this point, we’ve covered all the truly easy steps towards scalability. Hopefully by this point, you’ve found solutionsthat work. Why? Because the next steps can be very painful from a coding perspective. I know this personally forthey are steps that we had to take when I worked at Digg.If you need to scale farther, it’s usually for one of a couple reasons. All of them reflect various pain points orbottlenecks in your code. It might be because your tables have gotten gigantic, with tens or hundreds of millions ofrows of data in them, and your queries are simply unable to complete quickly. It could be that your master databaseis overwhelmed with write traffic. Perhaps you have some other pain point which is similar to these situations.One possible solution is to shard your database. Sharding is the common-use term for a tactic practiced by many ofthe well known Web 2.0 sites, although many other people use the term partitioning. Partitioning is in fact a way tosplit individual tables into multiple parts within one machine. MySQL has support for some forms of this where eachpartition has the same table schema, which makes the data division transparent to PHP applications.So what is sharding? The simplest definition I can come up with is just this: “Breaking your tables or databases intopieces.”Really, that’s it. You are taking everything you have and breaking it into smaller shards of data. For example, youcould move last year’s infrequently accessed blog comments to a separate database. This allows you to scale moreeasily and to have faster response times since most queries look at less data. It can also help you scale write-throughput by breaking out into using multiple masters. This comes at a cost, though, on the programming side. Youneed to handle much more of your data logic within PHP, whether that is the need to access multiple tables to get thesame data, or to simulate joins that you can no longer do in SQL.Before we get into discussing that various forms of sharding and their benefits (plus drawbacks) there is something Ishould mention. This entire section addresses doing manual sharding and controlling it within your PHP code. Being
  12. 12. a programmer myself, I like having that control as it gives me the ultimate flexibility. Also in the past, it’s been thesolution that in the end won out. There are some forms of sharding that MySQL can do for you via features such asfederation and clustering. I do recommend that you explore those features to see if they might be able to help youwithout taking the extra PHP load. For now, let’s discuss the various forms of manual sharding.Vertical ShardingVertical sharding is the practice of moving various columns of a table into different tables. (Many people don’t eventhink of this tactic as sharding per se but simply as good-old database design via normalizing tables.) There arevarious methodologies that can help here, primarily based around moving rarely used or often-empty columns into anauxiliary table. This often keeps the important data that you reference in one smaller table. By having this smaller,narrower table, queries can execute faster and there is a greater chance that the entire table might fit into memory.A good rule of thumb often is to move any data that is never used in a WHERE clause into an auxiliary table. Theidea is that you want the primary table that you perform queries against to be as small and efficient as possible. Youcan later read the data from the auxiliary table after you know exactly which rows you need.An example might be a users table. You may have columns in that table that refer to items you want to keep butrarely use. Perhaps their full name or their address, which you keep for reference about the user but you neverdisplay on the Website nor search based upon it. Moving that data out of the primary table can therefore help.The original table might look like:A vertically sharded version of this would then look like:It should be noted that vertical sharding is usually performed on the same database server. You aren’t typicallymoving the auxiliary table to another server in an attempt to save space, but you are breaking a large table up andmaking the primary query table much more efficient. Once you’ve found the row(s) of information in Users that youare interested in, you can get the extra information - if needed - by highly efficient ID lookups on the UsersExtra.Manual Horizontal Sharding
  13. 13. Now if vertical sharding is breaking up the columns of a table into additional tables, horizontal sharding is breakingthe rows in a table into different tables. At this point, the sole goal behind doing this is to take a table that hasbecome so long to be unwieldy (millions upon millions of rows) and to break it into usable chunks.If you are doing this manually by spreading the shared tables across multiple databases, it is going to have an effecton your code base. Instead of being able to look in one location for certain data, you will need to look in multipletables. The trick with doing this in a way to have minimal impact on your code is to make sure that the methodologyused to break your table up matches how the data will be accessed in the code.There are numerous common ways to manually do horizontal sharding. One could be range-based, wherein the firstmillion row IDs are in the first table, the next million row IDs in table two, and so on. This can work fine as long as youalways know that you are going to know the ID of the row you will want, and therefore can have logic to pick the righttable. A drawback of this method is that you’ll also need to create tables as thresholds are crossed. It means eitherbuilding table creation code into your application, or always making sure that you have a "spare" table, andprocesses to alert you when it starts being used.Another common solution that obviates needing to make new tables on-the-fly is to have interlaced rows. This is theapproach, for example, of using the modulus function on your row numbers, and evenly dividing them out into a pre-set number of tables. So if you did this with three tables, the first row would be in table 1, the second row in table 2,the third in table 3, and then the fourth row would be in table 1.This methodology still requires you to know the ID in order to access the data, but now removes the concern of everneeding a new table. At the same time, it comes with a cost: If you choose to break into three tables, and eventuallythose tables get too large, you will come to a painful point of needing to redistribute your data yet again.Beyond the ID based sharding, some of the best sharding techniques cue off different data points that closely matchhow the data will be accessed. For example, if writing a typical blog or news application, you typically will alwaysknow the date of any post that you are trying to access. Most blog permalinks store the month and year within themsuch as: This means that any time that you are going to read in a post, youalready know the year and month . Therefore, you could partition the database based upon the date, making aseparate table for each year or perhaps a separate table for each year/month combination.This does mean that you need to understand the logic for other access needs as well. On the homepage of yourWebsite, you might want to list the last five posts. To do this now, you might need to access two tables (or more).You’d first attempt to read in five posts from the current year/month table, and if you didn’t find all five, start iteratingback in time, querying each table in turn until you’ve found all that you need.In this date-based setup, you do still need to make new tables on the fly, or have them pre-made and waiting. At leastin this situation, you know exactly when you will need new tables, and could even have them pre-made for the nextfew years, or a cronjob that makes them on a set schedule.Other data columns can be used for partitioning as well. Let’s explore another possibility using the data we used forhorizontal partitioning. In the case of this Users table, we might realize that in our application we will always know theusername when we go to access a specific user’s information, perhaps because they have to login first and provide itto us. Therefore, we could shard the table based upon the usernames, putting different sections of the alphabet intodifferent tables. A simple two-table shard could therefore look like:
  14. 14. This horizontal sharding can be used within the same database just to make tables more manageable. However mostoften this tactic begins to see use when you have so much data that you need to start breaking it out into multipledatabases. This could be either for storage concerns or throughput. In this case you can easily, once sharded, moveeach table onto their own master with their own slave pools, increasing not only your read throughput, but your writethroughput as well.Application-Level ShardingAt this point we’ve discussed the ideas of breaking your tables up both by rows and by columns. The next step, whichI will refer to as application-level sharding, is explicitly taking tables from your database and moving them ontodifferent servers (different master/slave setups).The sole reason for this is to expand your write throughput and to reduce overall load on the servers by isolatingqueries. This of course comes with - as all sharding does - the pain of rewriting your code to understand where thedata is stored, and how to access it now that you’ve broken it up.The interesting thing about application-level sharding is that if you’ve already implemented "pool" code (as shownpreviously), that exact same code can be used to access application shards as well. You just need to make aseparate pool for each master and make pools for each set of shard slaves. The code needs to ask for a connectionto the right pool for the shard that holds your data. The only real difference from before is that you have to be extracareful with your queries, since each pool isn’t a copy of the same data.One of the best tactics when looking to break your database up is to base it upon a natural separation of how yourapplication accesses data. Keep related tables together on the same server so that you don’t lose the ability toquickly access similar data through the same connection (and perhaps even keep the ability to do joins across thoserelated tables).For example, let’s look at this typical blog database structure and how we might shard it: Figure 3 Keeping related tables together
  15. 15. As you can see, we’ve broken this table into two application-based shards. One contains both the Users and Profilestables. The other contains the blog Entries and Comments tables. The related data is kept together. With onedatabase connection you can still read both core user information as well as their profiles. With another connectionyou can read in the blog entries and their associated comments and potentially even join between those tables ifneeded.ConclusionIn the end, all of these techniques - pooling and various forms of sharding - can be applied to your system to help itscale beyond its current limitations. They should all be used sparingly and only when needed.While setting your system up for MySQL database pooling is actually fairly easy and painless, sharding is a differentmatter, and in my opinion it should only be attempted once you have no other options available to you. Everyimplementation of sharding requires additional coding work. The logic used to conceptually break your database ortables up needs to be translated into code. You are left without the ability to do database joins or can do them only ina limited fashion. You can be left without the ability to do single queries that search through all of your data. Plus youhave to encapsulate all of your database access behind more abstraction layers that understand the logic of wherethe data actually exists now, so that you can change it in the future again if needed.For that reason, only do the parts that directly match your own application’s needs and infrastructure. It’s a step thata MySQL-powered Website may need to take when data or traffic levels rise beyond manageable sizes. But it is nota step to be taken lightly.