Top 10 Scalability Mistakes John Coggeshall
Welcome! Who am I: John Coggeshall Chief Technology Officer, Automotive Computer Services Author PHP 5 Unleashed Speaker on PHP-related topics worldwide Geek
What is Scalability? Define: Scalability The ability and flexibility of an application to meet growth requirements of an organization More then making a site go fast(er) Scalability in human resources, for example The “fastest” approach isn’t always the most scalable OO is slower, but more scalable from a code maintenance and reuse standpoint Failure to consider future needs during architectural stages leading to failure of the application’s API to scale
#  The secret to scalability is the ability to design, code, and maintain your applications using the same process again and again regardless of size
… .From Traffic To Infrastructure…
“ Scalability marginally impacts procedure, procedure grossly impacts scalability” - Theo Schlossnagle
You have to plan Performance and resource scalability requires forethought and process Version Control Performance Goals Metric measuring Development Mailing Lists API documentation Awareness  is key Think about these problems and how you will solve them as your project gets off the ground
Designing without Scalability If your application does not perform it will likely not succeed What does it mean to perform? 10 requests/sec? 100 requests/sec? 1000 requests/sec? If you don’t know what it will take to meet your performance requirements, you probably won’t meet them. At its worst, you'll be faced with a memorable and sometimes job-ending quote: 'This will never work. You're going to have to start all over.’
Performance Metrics Response Time How long does it take for the server to respond to the request? Resource usage CPU, memory, disk I/O, Network I/O Throughput Requests / second Probably the most useful number to keep track of
Proactive vs. Reactive Common Scenario: Reactive Write your app Deploy it Watch it blow up Try to fix it If you’re lucky, you  might  succeed “enough” If you’re unlucky….. Correct Approach: Proactive Know your performance goals up front and make sure your application is living up to them as part of the development process
Everyone has a role in Performance Architects: Balance performance against other application needs Interoperability Security Maintainability Developers: You need to know how to measure and how to optimize to meet the goals Web-stress tools, profilers, etc.
Designing with Scalability When designing your application, you should assume it needs to scale Quick and dirty prototypes often are exactly what gets to production It’s easy to make sure your applications have a decent chance of scaling MySQL: Design assuming someday you’ll need master/server replication, for example
Designing with Scalability Don’t write  an application you’ll need three years from now, write an application you need today Just think about what you might need in three years
Common Performance Blunders The names have been changed to protected the innocent, as well as my wallet.
Network file systems Problem: We have a server farm of 10 servers and we need to deploy our code base Very common problem Many people look to a technology like NFS At least 90% of the time, this is a bad idea NFS/GFS is really slow NFS/GFS has tons of locking issues
Network file systems So how do we deploy our code base? You should always deploy your code base locally on the machine serving it Rsync is your friend What about run-time updates? Accepting File uploads Need to be available to all servers simultaneously Solutions vary depending on needs NFS may be an option for this small portion of the site Database is also an option
I/O Buffers I/O Buffers are there for a reason, to make things faster Sending 4098 bytes of data to the user when your system write blocks are 4096 bytes is stupid In PHP you can solve this using output buffering At the system level you can also boost up your TCP buffer size Almost always a good idea, most distributions are very conservative here Just be mindful of the amount of RAM you actually have
Ram Disks Ram Disks are a very nice way to improve performance of an application, as long as you have a lot of memory laying around Use Ramdisks to store any sort of data you wouldn’t care if you lost when the 16 year old trips over the power cable A reasonable alternative to shared memory
Bandwidth Optimization You can optimize bandwidth in a few ways Compression mod_deflate Zlib.output_compression=1 (PHP) Content Reduction via Tidy: <?php $o = array(&quot;clean&quot; => true, &quot;drop-proprietary-attributes&quot; => true, &quot;drop-font-tags&quot; => true, &quot;drop-empty-paras&quot; => true, &quot;hide-comments&quot; => true, &quot;join-classes&quot; => true, &quot;join-styles&quot; => true); $tidy = tidy_parse_file(&quot;php.html&quot;, $o);  tidy_clean_repair($tidy); echo $tidy; ?>
Configuring PHP for Speed register_globals = off auto_globals_jit = on magic_quotes_gpc = off expose_php = off register_argc_argv = off always_populate_raw_post_data = off session.use_trans_sid = off session.auto_start = off session.gc_divisor = 10000 output_buffering = 4096
Blocking calls Blocking I/O can always be a problem in an application I.e. attempting to open a remote URL from within your PHP scripts If the resource is locked / slow / unavailable your script hangs while we wait for a timeout Might as well try to scale an application that has a sleep(30) in it Very bad
Blocking calls Solutions Don’t use blocking calls in your application Don’t use blocking calls in the heavy-load aspects of your application Have out-of-process scripts responsible for pulling down data
Failing to Cache Caching is one of the most important things you can do when writing a scalable application A lot of people don’t realize how much they can cache Rarely is a 5 second cache of any data going to affect user experience Yet it will have significant performance impact 1 page load / 2 queries per request 2 queries * 200 request / sec = 400 queries / second 400 queries * 5 seconds = 2000 queries you didn’t do
Failing to Cache Improving the speed of PHP can be done very easily using an op-code cache PHP 6 will have this ability built-in to the engine
Semi-Static Caching If you're web application has a lot of semi-static content Content that  could change so it has to be stored in the DB, but almost never does .. And you're running on Apache This Design Pattern is killer!
Semi-Static Caching Most people in PHP would implement a page like this: http://www.example.com/show_article.php?id=5 This would be responsible for generating the semi-static page HTML for the browser
Semi-Static Caching Instead of generating the HTML for the browser, make this script generate another PHP script that contains mostly static content Keep things like personalization code, but make the actual article itself static in the file Write the file to disk in a public folder under document root
Semi-Static Caching If you put them in this directory http://www.example.com/articles/5.php You can create a mod_rewrite rule such that http://www.example.com/articles/5.php  maps to http://www.example.com/show_article.php?id=5 Since show_article.php writes articles to files, once it's been generated no more DB reads!
Semi-Static Caching Simple and Elegant Solution Allows you to keep pages “personalized” Very easy to Maintain #
Poor database design  Database design is almost always the most important thing in your application PHP can be used completely properly, but if you mess up the database you’re hosed anyway Take the time to really think about your design Read books on designing relational databases Understand how Indexes work, and use them How Much Data?
Poor database design  For example.. Using MySQL MyISAM tables all the time Use InnoDB instead if you can Use MyISAM tables only if you plan on doing fulltext searching Even then, they shouldn’t be primary tables
Improperly dealing with database connections Improperly using persistent database connections Know your database, MySQL has a relatively light handshake process compared to Oracle Using PHP to deal with database fail over It’s not PHP’s Job, don’t do it.
Let me say that again.. I DO NOT CARE WHAT IT SAYS IN SOME BOOK, DO NOT USE PHP TO DETERMINE WHICH DATABASE TO CONNECT TO
Database connections Bad: Code to determine if it is the dev environment or not and a different database is selected in each case Suicidal: Code to determine if the primary master in a MySQL database is down, and instead attempt to seamlessly roll-over to a hot swap MySQL slave you bless as master GOOD: MySQL Proxy
Having your Cake and Eating it too For those of us using MySQL, here’s a great replication trick from our friends at Flickr InnoDB is under most circumstances considerably faster then MyISAM MyISAM is considerably better suited for full-text searches Trick: During a master/slave replication, the slave table type can change Set up a separate MyISAM fulltext search farm Connect to those servers when performing full-text searches
 
SQLite, Huh? SQLite is a great database package for PHP that can really speed certain things up Requires you understanding when and how to use it. SQLite is basically a flat-file embedded database Crazy-fast reads, horrible writes (full database locks) Answer: SQLite is a  great  lookup database
Keepalive Requests Keepalive sounds great on paper It can actually totally hose you if you aren’t careful Use Keepalive if: You use the same server for static/dynamic content You intelligently know how to set the timeout No Keepalive request should last more then 10 seconds If Apache is 100% Dynamic  Turn it off
Knowing where to  Not  optimize Sooner or later, you (likely) will worry about optimization Hopefully, you didn’t start after your application started blowing up (aka Twitter) When trying to make scalability decisions knowledge is the most important thing you can have
Knowing where to  Not  optimize PHP has both closed source and open source profilers which do an excellent job of identifying the bottlenecks in your application vmstat, iostat are your friends Optimize where it counts
Instrumentation of your applications is key to determining what matters most when optimizing If you’re not logging, you’re shooting in the dark White-box monitoring of your applications via tools like Zend Platform are enormously helpful in understanding what is going on You can’t make good process (or business) decisions unless you understand how your web site is being used and by whom . Knowing where to  Not  optimize
Amdahl’s Law: Improving code execution time by 50% when the code executes only 2% of the time will result in a 1% overall improvement Improving code execution time by 10% when the code executes 80% of the time will result in a 8% overall improvement Knowing where to  Not  optimize
Use Profilers Profilers are  easy  to use Profilers draw pretty pictures Profilers are good Use profilers
How a Profiler/Debugger works in PHP  Profiler / Debuggers in PHP work remotely against the web server
Tips on using a profiler When doing real performance analysis, here are a few tips to help you out: Copy the raw data (function execution times) into a spreadsheet and do analysis from there Most profilers provide at least two execution figures per function call The amount of time spent executing PHP code The amount of time PHP spent internally That means total = A + B  If you are spending a lot more time inside of PHP, you’ve got a blocking issue somewhere
Something More.. Do not mistake something more for something better Dev: “Hey, let’s build this great ORM that automatically generates it’s views like Ruby!” Manager: “Sounds great, go to it” <4 months pass> Dev: “Here’s my two weeks notice, I quit” Manager: “Okay John you write it” John: “Um, I have no idea what this guy did” <2 months pass to re-write the module in a way that we can maintain it>
Something More.. Don’t use a sledge hammer when a tack hammer will do Devs: Just because your boss doesn’t know the difference doesn’t make it a good idea It might seem like great job security to write code only you can maintain, but in reality all it will do is get you fired faster when they figure it out Managers: Know enough about the technologies to keep eager developers from leaving you holding the bag.
Final Thoughts #  Ultimately the secret of scalability is developing applications and procedures which scale both  UP   AND   DOWN You have to be able to afford to make the application to begin with You have to be able to afford to make the application ten times bigger then it is Without process, you will fail. REMEMBER:  In ANY application, there is only ever one bottleneck Questions?

Apache Con 2008 Top 10 Mistakes

  • 1.
    Top 10 ScalabilityMistakes John Coggeshall
  • 2.
    Welcome! Who amI: John Coggeshall Chief Technology Officer, Automotive Computer Services Author PHP 5 Unleashed Speaker on PHP-related topics worldwide Geek
  • 3.
    What is Scalability?Define: Scalability The ability and flexibility of an application to meet growth requirements of an organization More then making a site go fast(er) Scalability in human resources, for example The “fastest” approach isn’t always the most scalable OO is slower, but more scalable from a code maintenance and reuse standpoint Failure to consider future needs during architectural stages leading to failure of the application’s API to scale
  • 4.
    # Thesecret to scalability is the ability to design, code, and maintain your applications using the same process again and again regardless of size
  • 5.
    … .From TrafficTo Infrastructure…
  • 6.
    “ Scalability marginallyimpacts procedure, procedure grossly impacts scalability” - Theo Schlossnagle
  • 7.
    You have toplan Performance and resource scalability requires forethought and process Version Control Performance Goals Metric measuring Development Mailing Lists API documentation Awareness is key Think about these problems and how you will solve them as your project gets off the ground
  • 8.
    Designing without ScalabilityIf your application does not perform it will likely not succeed What does it mean to perform? 10 requests/sec? 100 requests/sec? 1000 requests/sec? If you don’t know what it will take to meet your performance requirements, you probably won’t meet them. At its worst, you'll be faced with a memorable and sometimes job-ending quote: 'This will never work. You're going to have to start all over.’
  • 9.
    Performance Metrics ResponseTime How long does it take for the server to respond to the request? Resource usage CPU, memory, disk I/O, Network I/O Throughput Requests / second Probably the most useful number to keep track of
  • 10.
    Proactive vs. ReactiveCommon Scenario: Reactive Write your app Deploy it Watch it blow up Try to fix it If you’re lucky, you might succeed “enough” If you’re unlucky….. Correct Approach: Proactive Know your performance goals up front and make sure your application is living up to them as part of the development process
  • 11.
    Everyone has arole in Performance Architects: Balance performance against other application needs Interoperability Security Maintainability Developers: You need to know how to measure and how to optimize to meet the goals Web-stress tools, profilers, etc.
  • 12.
    Designing with ScalabilityWhen designing your application, you should assume it needs to scale Quick and dirty prototypes often are exactly what gets to production It’s easy to make sure your applications have a decent chance of scaling MySQL: Design assuming someday you’ll need master/server replication, for example
  • 13.
    Designing with ScalabilityDon’t write an application you’ll need three years from now, write an application you need today Just think about what you might need in three years
  • 14.
    Common Performance BlundersThe names have been changed to protected the innocent, as well as my wallet.
  • 15.
    Network file systemsProblem: We have a server farm of 10 servers and we need to deploy our code base Very common problem Many people look to a technology like NFS At least 90% of the time, this is a bad idea NFS/GFS is really slow NFS/GFS has tons of locking issues
  • 16.
    Network file systemsSo how do we deploy our code base? You should always deploy your code base locally on the machine serving it Rsync is your friend What about run-time updates? Accepting File uploads Need to be available to all servers simultaneously Solutions vary depending on needs NFS may be an option for this small portion of the site Database is also an option
  • 17.
    I/O Buffers I/OBuffers are there for a reason, to make things faster Sending 4098 bytes of data to the user when your system write blocks are 4096 bytes is stupid In PHP you can solve this using output buffering At the system level you can also boost up your TCP buffer size Almost always a good idea, most distributions are very conservative here Just be mindful of the amount of RAM you actually have
  • 18.
    Ram Disks RamDisks are a very nice way to improve performance of an application, as long as you have a lot of memory laying around Use Ramdisks to store any sort of data you wouldn’t care if you lost when the 16 year old trips over the power cable A reasonable alternative to shared memory
  • 19.
    Bandwidth Optimization Youcan optimize bandwidth in a few ways Compression mod_deflate Zlib.output_compression=1 (PHP) Content Reduction via Tidy: <?php $o = array(&quot;clean&quot; => true, &quot;drop-proprietary-attributes&quot; => true, &quot;drop-font-tags&quot; => true, &quot;drop-empty-paras&quot; => true, &quot;hide-comments&quot; => true, &quot;join-classes&quot; => true, &quot;join-styles&quot; => true); $tidy = tidy_parse_file(&quot;php.html&quot;, $o); tidy_clean_repair($tidy); echo $tidy; ?>
  • 20.
    Configuring PHP forSpeed register_globals = off auto_globals_jit = on magic_quotes_gpc = off expose_php = off register_argc_argv = off always_populate_raw_post_data = off session.use_trans_sid = off session.auto_start = off session.gc_divisor = 10000 output_buffering = 4096
  • 21.
    Blocking calls BlockingI/O can always be a problem in an application I.e. attempting to open a remote URL from within your PHP scripts If the resource is locked / slow / unavailable your script hangs while we wait for a timeout Might as well try to scale an application that has a sleep(30) in it Very bad
  • 22.
    Blocking calls SolutionsDon’t use blocking calls in your application Don’t use blocking calls in the heavy-load aspects of your application Have out-of-process scripts responsible for pulling down data
  • 23.
    Failing to CacheCaching is one of the most important things you can do when writing a scalable application A lot of people don’t realize how much they can cache Rarely is a 5 second cache of any data going to affect user experience Yet it will have significant performance impact 1 page load / 2 queries per request 2 queries * 200 request / sec = 400 queries / second 400 queries * 5 seconds = 2000 queries you didn’t do
  • 24.
    Failing to CacheImproving the speed of PHP can be done very easily using an op-code cache PHP 6 will have this ability built-in to the engine
  • 25.
    Semi-Static Caching Ifyou're web application has a lot of semi-static content Content that could change so it has to be stored in the DB, but almost never does .. And you're running on Apache This Design Pattern is killer!
  • 26.
    Semi-Static Caching Mostpeople in PHP would implement a page like this: http://www.example.com/show_article.php?id=5 This would be responsible for generating the semi-static page HTML for the browser
  • 27.
    Semi-Static Caching Insteadof generating the HTML for the browser, make this script generate another PHP script that contains mostly static content Keep things like personalization code, but make the actual article itself static in the file Write the file to disk in a public folder under document root
  • 28.
    Semi-Static Caching Ifyou put them in this directory http://www.example.com/articles/5.php You can create a mod_rewrite rule such that http://www.example.com/articles/5.php maps to http://www.example.com/show_article.php?id=5 Since show_article.php writes articles to files, once it's been generated no more DB reads!
  • 29.
    Semi-Static Caching Simpleand Elegant Solution Allows you to keep pages “personalized” Very easy to Maintain #
  • 30.
    Poor database design Database design is almost always the most important thing in your application PHP can be used completely properly, but if you mess up the database you’re hosed anyway Take the time to really think about your design Read books on designing relational databases Understand how Indexes work, and use them How Much Data?
  • 31.
    Poor database design For example.. Using MySQL MyISAM tables all the time Use InnoDB instead if you can Use MyISAM tables only if you plan on doing fulltext searching Even then, they shouldn’t be primary tables
  • 32.
    Improperly dealing withdatabase connections Improperly using persistent database connections Know your database, MySQL has a relatively light handshake process compared to Oracle Using PHP to deal with database fail over It’s not PHP’s Job, don’t do it.
  • 33.
    Let me saythat again.. I DO NOT CARE WHAT IT SAYS IN SOME BOOK, DO NOT USE PHP TO DETERMINE WHICH DATABASE TO CONNECT TO
  • 34.
    Database connections Bad:Code to determine if it is the dev environment or not and a different database is selected in each case Suicidal: Code to determine if the primary master in a MySQL database is down, and instead attempt to seamlessly roll-over to a hot swap MySQL slave you bless as master GOOD: MySQL Proxy
  • 35.
    Having your Cakeand Eating it too For those of us using MySQL, here’s a great replication trick from our friends at Flickr InnoDB is under most circumstances considerably faster then MyISAM MyISAM is considerably better suited for full-text searches Trick: During a master/slave replication, the slave table type can change Set up a separate MyISAM fulltext search farm Connect to those servers when performing full-text searches
  • 36.
  • 37.
    SQLite, Huh? SQLiteis a great database package for PHP that can really speed certain things up Requires you understanding when and how to use it. SQLite is basically a flat-file embedded database Crazy-fast reads, horrible writes (full database locks) Answer: SQLite is a great lookup database
  • 38.
    Keepalive Requests Keepalivesounds great on paper It can actually totally hose you if you aren’t careful Use Keepalive if: You use the same server for static/dynamic content You intelligently know how to set the timeout No Keepalive request should last more then 10 seconds If Apache is 100% Dynamic Turn it off
  • 39.
    Knowing where to Not optimize Sooner or later, you (likely) will worry about optimization Hopefully, you didn’t start after your application started blowing up (aka Twitter) When trying to make scalability decisions knowledge is the most important thing you can have
  • 40.
    Knowing where to Not optimize PHP has both closed source and open source profilers which do an excellent job of identifying the bottlenecks in your application vmstat, iostat are your friends Optimize where it counts
  • 41.
    Instrumentation of yourapplications is key to determining what matters most when optimizing If you’re not logging, you’re shooting in the dark White-box monitoring of your applications via tools like Zend Platform are enormously helpful in understanding what is going on You can’t make good process (or business) decisions unless you understand how your web site is being used and by whom . Knowing where to Not optimize
  • 42.
    Amdahl’s Law: Improvingcode execution time by 50% when the code executes only 2% of the time will result in a 1% overall improvement Improving code execution time by 10% when the code executes 80% of the time will result in a 8% overall improvement Knowing where to Not optimize
  • 43.
    Use Profilers Profilersare easy to use Profilers draw pretty pictures Profilers are good Use profilers
  • 44.
    How a Profiler/Debuggerworks in PHP Profiler / Debuggers in PHP work remotely against the web server
  • 45.
    Tips on usinga profiler When doing real performance analysis, here are a few tips to help you out: Copy the raw data (function execution times) into a spreadsheet and do analysis from there Most profilers provide at least two execution figures per function call The amount of time spent executing PHP code The amount of time PHP spent internally That means total = A + B If you are spending a lot more time inside of PHP, you’ve got a blocking issue somewhere
  • 46.
    Something More.. Donot mistake something more for something better Dev: “Hey, let’s build this great ORM that automatically generates it’s views like Ruby!” Manager: “Sounds great, go to it” <4 months pass> Dev: “Here’s my two weeks notice, I quit” Manager: “Okay John you write it” John: “Um, I have no idea what this guy did” <2 months pass to re-write the module in a way that we can maintain it>
  • 47.
    Something More.. Don’tuse a sledge hammer when a tack hammer will do Devs: Just because your boss doesn’t know the difference doesn’t make it a good idea It might seem like great job security to write code only you can maintain, but in reality all it will do is get you fired faster when they figure it out Managers: Know enough about the technologies to keep eager developers from leaving you holding the bag.
  • 48.
    Final Thoughts # Ultimately the secret of scalability is developing applications and procedures which scale both UP AND DOWN You have to be able to afford to make the application to begin with You have to be able to afford to make the application ten times bigger then it is Without process, you will fail. REMEMBER: In ANY application, there is only ever one bottleneck Questions?