Queues, Pools, Caches


Published on

Transaction processing systems are generally considered easier to scale than data warehouses. Relational databases were designed for this type of workload, and there are no esoteric hardware requirements. Mostly, it is just matter of normalizing to the right degree and getting the indexes right. The major challenge in these systems is their extreme concurrency, which means that small temporary slowdowns can escalate to major issues very quickly.

In this presentation, Gwen Shapira will explain how application developers and DBAs can work together to built a scalable and stable OLTP system - using application queues, connection pools and strategic use of caches in different layers of the system.

Published in: Technology
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • We are a managed service AND a solution provider of elite database and System Administration skills in Oracle, MySQL and SQL Server
  • When we have hundreds of processes going as fast as possible, pile-ups will happen.
  • Key take away: A developer may or may not know anything about how the connection pool is configured. Prepare to dig.
  • Something causes response times to go up. Maybe we’ll figure out the cause, or maybe it was too short and fleeting to be recorded. The slowdown causes the application server to open more and more connections. Adding load to the shared pool. In addition, these connections may well all be running the same query, with the application configured to retry again and again until something works. This causes many latches, and depending on memory configuration, perhaps swapping too. This causes more load of the CPU, which causes response times to climb, which means the app will open more connections…
  • Keeping in mind that OLTP workloads are typically CPU-bound, the number of concurrent users the system can support is limited by the number of cores on the database server. A database with 12 cores can typically only run 12 concurrent CPU-bound sessions.
  • Most developers are well aware that if the connection pool is too small, the database will sit idle while users are either waiting for connections or are being turned away. Since the scalability limitation of small connection pools are known, developers tend to avoid them by creating large connection pools, and increasing their size at the first hint of performance problems.  However a too large connection pool is a much greater risk to the application scalability. Here is what the scalability of an OLTP system typically looks like. Amdahl’s law say that the scalability of the system is constrained by its serial component as the users are waiting for shared resources such as IO and CPU (This is the contention delay), but according to the Universal Scalability Law there is a second delay called “coherency delay” – which is the cost of maintaining data consistency in the system, this models waits on latches and mutexes. After a certain point, adding more users to the system will decrease throughput
  • Even when throughput doesn’t increase, at the point where throughput stops growing linearly, data starts to queue and response times suffer proportionally. If you check the wait events for a system that is past the point of saturation, you will see very high CPU utilization, high “log file sync” event as a result of the CPU contention and high waits for concurrency events such as “buffer busy waits” and “library cache latch”.
  • The oversized connection pools have to be re-established during failover events or database restarts. The larger the connection pool is, the longer the application will take to recover from failover event, as a result decreasing the availability of the application.
  • Load testing for the number of connections the DB can easily sustain is the key. When determining the maximum number of connections in the pool, you should run the test with a fixed number of users (Typically the highest you expect to support) and keep increasing the number of connections in the connection pool until response times become unacceptable, the wait events go from “CPU” to concurrency events and the database CPU utilization goes above 70%. Typically all three of these symptoms will start occurring at approximately the same time.
  • While we believe that most of the connections will be idle most of the time, we can’t be certain that this will be the case. When the database suffers a performance issue, the application or users are bound to retry running same slow queries over and over, eventually using the entire pool and overloading the database.If your connection pool can grow at these times, it means that it will open new connections, a resource intensive operation as we previously noted, to a database that is already abnormally busy.
  • When running a load test is impractical, you will need to estimate the number of connections based on available data. The factors to consider are:How many cores are available on the database server?How many concurrent users or threads does the application need to support?When an application thread takes a connection from the pool, how much of the time is spent holding the connection without actually running database queries? The more time the application spends “just holding” the connection, the larger the pool will need to be to support the application workload.How much of the database workload is IO-bound? You can check IOWAIT on the database server to determine this. The more IO-bound your workload is, the more concurrent users you can run without running into concurrency contention (You will see a lot of IO contention though). “Number of cores”x4 is a good connection pool starting point. Less if the connections are heavily utilized by the application and there is little IO activity and more if the opposite is true.
  • The remaining problem is what to do if the number of application servers is large and it is inefficient to divide the connection pool limit among the application servers. Well-architected systems usually have a separate data layer that can be deployed on separate set of servers. This data layer should be the only component of the application allowed to open connections to the databases. In this architecture, the connections are divided between small number of data-layer servers.This design has three great advantages: First, the data layer usually grows much slower than the application and rarely requires new servers to be added, which means that pools rarely require resizing. Second, application requests can be balanced between the data servers based on the remaining pool capacity and third, if there is a need to add application-side caching to the system (such as Memcached), only the data layer needs modification.
  • There are many options for handling excessing user requests. The only thing that is not an option – throw everything at the database and let the DB queue the excessive load
  • If service time suddenly increases, investigateArrival rate is monitored for capacity planningQueue size can be calculated. Increase in queue length will immediately impact SLA, and should be investigated and resolved. Note that bursts of longer queue size are expected, so investigate if queue size does not decrease within reasonable time.High utilization rates will impact queue size. Measuring utilization will assist in capacity planning and determining when more or less resources are needed.
  • Marc Fielding created a high-performance queue system with a database table and two jobs. Some prefer to implement their queues with a file, split and xargs.But if your requirements include multiple subscribers, which may or may not need to retrieve the same message multiple times, which may or may not want to ack messages, and that may or may not want to filter specific message types, then I recommend using an existing solution.
  • Memcached is probably the cheapest way to use more low-latency RAM in your system. It is simple and scales very well.
  • It looks like Memcached doesn’t scale well. But a closer look reveals that it is the client that did not scale – I could not drive more than 60K requests per second. Not bad for a small cache at 18 cents per hour.
  • There is no downside to Memcached that is too big.However, Memcached servers can and will fail, sending more traffic to the database. Make sure that Memcached nodes are sized so failure of a single node will not cripple the application.
  • You use SimCache to see that with cache size of 10G you will have hit ratio of 95% in Memcached. Memcached has latency of 1ms in your system. With 5% of the queries hitting the database, you expect the database CPU utilization to be around 20%, almost 100% of the DB Time on the CPU, and almost no wait time on the queue between the business and the data layers (you tested this separately when sizing your connection pool). In this case the database latency will be 5ms, so we expect the average latency for the data layer to be 0.95*1+0.05*5=1.2ms.
  • An increase in the number of cache misses will definitely mean that the database load is increasing at same time, and can indicate that more memory is necessary. Make sure that the number of gets is higher than the number of sets. If you are setting more than getting, the cache is a waste of space. If the number of gets per item is very low, the cache may be oversized. There is no downside to an oversized cache, but you may want to use the memory for another purpose.An increase in the number of evictions can also indicate that more memory is needed. Evicted time shows the time between the last get of the item to its eviction. If this period is short, this is a good indication that memory shortage makes the cache less effective
  • Queues, Pools, Caches

    1. 1. Queues, Pools,and CachesPresented by: Gwen shapira, Senior consultant
    2. 2. About Myself• 13 years with a pager• Oracle ACE• Oak table member• Senior consultant for Pythian• @gwenshap• http://www.pythian.com/news/author/shapira/• shapira@pythian.com 2 © 2009/2010 Pythian
    3. 3. Why PythianRecognized Leader:• Global industry-leader in remote database administration services and consulting for Oracle, Oracle Applications, MySQL and SQL Server• Work with over 150 multinational companies such as Forbes.com, Fox Sports, Nordion and Western Union to help manage their complex IT deploymentsExpertise:• One of the world’s largest concentrations of dedicated, full-time DBA expertise. Employ 7 Oracle ACEs/ACE Directors• Hold 7 Specializations under Oracle Platinum Partner program, including Oracle Exadata, Oracle GoldenGate & Oracle RACGlobal Reach & Scalability:• 24/7/365 global remote support for DBA and consulting, systems administration, special projects or emergency response 3 © 2009/2010 Pythian
    4. 4. © 2009/2010 Pythian
    5. 5. WHY?5 © 2009/2010 Pythian
    6. 6. OLTP: High Concurrency Low Latency Low Variance6 © 2009/2010 Pythian
    7. 7. 7 © 2009/2010 Pythian
    8. 8. Our mission: Use modern application design to control database concurrency to its maximum throughput, lower latency and make the system more predictable.8 © 2009/2010 Pythian
    9. 9. Nobody expects modern application design!9 © 2009/2010 Pythian
    10. 10. Our Chief weapons are: Connection Pools Queues Caches And fanatical monitoring And ruthless capacity planning And nice red uniforms!10 © 2009/2010 Pythian
    11. 11. Connection Pools1
    12. 12. The ProblemOpening a database connectionis high latency operation.OLTP systems cant afford this latencyfor every user request © 2009/2010 Pythian
    13. 13. The Solution © 2009/2010 Pythian
    14. 14. Application Business Layer Application Data Layer JNDI DataSource Interface DataSource Connection JDBC Driver Pool14 © 2009/2010 Pythian
    15. 15. New Problems CPU + Latch Slow Resp. contention Time Run out of Add More! connections15 © 2009/2010 Pythian
    16. 16. 5000 connectionsin pool… ?But only 12 cores?16 © 2009/2010 Pythian
    17. 17. 17 © 2009/2010 Pythian
    18. 18. 18 © 2009/2010 Pythian
    19. 19. And that’s not all… How long does it take to open 5000 connections to the database?19 © 2009/2010 Pythian
    20. 20. New Solutions 1.Load test 2.Limit connection pool 3.???? 4.Low latency!!!20 © 2009/2010 Pythian
    21. 21. Objection!1. But I don’t use all connections at once2. Our connection pool grows and shrinks automatically © 2009/2010 Pythian
    22. 22. "Dynamic connection pools are a loaded gun pointed at your system. Feeling lucky, punk?" Graham Wood, Oracle22 © 2009/2010 Pythian
    23. 23. Objection!Problem: We can’t load testSolution: Guesstimate1. How many cores are available?2. How much time is spent squatting a connection without running database queries?3. How much workload is IO-bound? © 2009/2010 Pythian
    24. 24. Objection!Problem: We have 5000 application serversSolution: Data Layer1. Separate servers running data layer2. Fewer servers3. Load balance based on capacity © 2009/2010 Pythian
    25. 25. Queues2
    26. 26. The ProblemWe have more user requests thandatabase connections © 2009/2010 Pythian
    27. 27. What do we do? 1. “Your call is important to us…” 2. Show them static content 3. Distract them with funny cat photos 4. Prioritize them 5. Acknowledge them and handle the request later 6. Jump them to the head of line 7. Tell them how long the wait is27 © 2009/2010 Pythian
    28. 28. Solution – Message Queue Msg 1 Msg 2 Msg 3 . . . Msg N28 © 2009/2010 Pythian
    29. 29. Application Business Layer Message Queue Application Data Layer DataSource Interface JNDI DataSource Connection JDBC Driver Pool29 © 2009/2010 Pythian
    30. 30. New Problems Stuff developers say about message queues: 1. It is impossible to reliably monitor queues 2. Queues are not necessary if you do proper capacity planning 3. Message queues are unnecessarily complicated.30 © 2009/2010 Pythian
    31. 31. 31 © 2009/2010 Pythian
    32. 32. Do Monitor: 1. Service time 2. Arrival rate 3. Queue size 4. Utilization32 © 2009/2010 Pythian
    33. 33. Capacity Planning Myth: With proper capacity planning, queues are not necessary Fact: Over-provisioning is not proper capacity planning Fact: Queue theory is capacity planning tool. Fact: Introduction of a few well defined and well understood queues will help capacity planning.33 © 2009/2010 Pythian
    34. 34. Complex Systems 1. Queues are simple 2. Enterprise message queues are complex 3. Match solution to problem requirements34 © 2009/2010 Pythian
    35. 35. Caches3
    36. 36. The ProblemOLTP scales best when working set iscached in RAMRDBMS have limitedmemory scalability © 2009/2010 Pythian
    37. 37. The Solution - Memcached App App App App Memcached Memcached Memcached Memcached App App37 © 2009/2010 Pythian
    38. 38. How is it awesome? 1. Less DB access 2. Less disk access 3. Distributed 4. Simple KV store 5. “Free” memory 6. Latency and availability resilience38 © 2009/2010 Pythian
    39. 39. Amazon ElastiCache Memcached cluster of any size in Amazon cloud Unfortunately only accessible from EC2 9 cents per node per hour!39 © 2009/2010 Pythian
    40. 40. Linear Scalability?40 © 2009/2010 Pythian
    41. 41. More Numbers 1. 0.007ms latency on my desktop 2. 2ms latency on cloud 3. 60K gets a second 4. All from the smallest possible servers at 38 cents per hour.41 © 2009/2010 Pythian
    42. 42. Application Business Message Layer Queue Memcached Application Data Layer DataSource Interface JNDI DataSource Connection JDBC Pool Driver42 © 2009/2010 Pythian
    43. 43. New Problems • Does not apply automatically • How to use it effectively? • How to monitor it? • How big?43 © 2009/2010 Pythian
    44. 44. Use Case - Select function get_username(int userid) { /* first try the cache */ name = memcached_fetch("username:" + userid); if (!name) { /* not found : request database */ name = db_select("SELECT username FROM users WHERE userid = ?", userid); /* then store in cache until next get */ memcached_add("username:" + userid, username); } return data; }44 © 2009/2010 Pythian
    45. 45. Use Case - Update function update_username(int userid, string username) { /* first update database */ result = db_execute("Update users set username=? WHERE userid=?", userid,username); if (result) { /* database update successful: update cache */ memcached_set("username:" + userid, username); }45 © 2009/2010 Pythian
    46. 46. Usage Advice 1. Use the ASH 2. More memory, fewer cores 3. DB is for durable writes 4. Warm-up the cache 5. Store nulls 6. Updates are tricky 7. Backward compatible schema46 © 2009/2010 Pythian
    47. 47. How Big? Cluster: As big as you can Node: Not too big to fail.47 © 2009/2010 Pythian
    48. 48. What will we gain by adding 1G cache? 1. You can’t calculate 2. Log all cache hits and misses, by key 3. Or sample 4. Run cache simulator 5. Predict avg. latency48 © 2009/2010 Pythian
    49. 49. 1ms * 0.95 + 5ms * 0.05 = 1.2ms49 © 2009/2010 Pythian
    50. 50. Monitor 1. Number of items, gets, sets and misses 2. Number of evictions and eviction time. 3. Low hit rate and high eviction rate? 4. Swapping 5. Average response time 6. Number of connections50 © 2009/2010 Pythian
    51. 51. Reminder: 1. Use Connection Pools 2. Limit the number of connections 3. Use queues to handle the excessive load 4. Use caches to make everything faster51 © 2009/2010 Pythian
    52. 52. Thank you and Q&A To contact us… sales@pythian.com 1-866-PYTHIAN To follow us… http://www.pythian.com/news/ http://www.facebook.com/pages/The-Pythian-Group/ http://twitter.com/pythian http://www.linkedin.com/company/pythian52 © 2009/2010 Pythian