• Like
[B5]memcached scalability-bag lru-deview-100
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

[B5]memcached scalability-bag lru-deview-100

  • 1,066 views
Published

 

Published in Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
1,066
On SlideShare
0
From Embeds
0
Number of Embeds
2

Actions

Shares
Downloads
24
Comments
0
Likes
7

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Enhancing the Scalability ofMemcahced Rajiv Kapoor, Intel Corporation Sep 19 2012
  • 2. Content •  What is Memcached •  Usage model •  Measuring performance •  Baseline performance & scalability •  Performance root cause •  Base transaction flow •  Optimization goals, design considerations •  Optimized transaction flow •  Optimization details •  Optimized version performance •  Summary2
  • 3. What is Memcached? •  Open Source distributed memory caching system − Typically serves as a cache for persistent databases − In-memory key-value store for quick data access − For a particular “key” a “value” is stored/deleted/retrieved etc. − Provides a networked data caching API simple to use and setup •  Used by many companies with web centric businesses •  Most common usage model - web data caching − Original data resides in persistent database − Database queries are expensive − Memcached caches the data to provide low latency access − Helps reduce the load on the database •  Computational cache •  Temporary object store3
  • 4. Web data caching usage model•  Memcached tier acts as a cache for the database tier −  Cache is spread over several memcached servers•  Client requests the “value” associated with a “key”•  A “GET” request for “key” sent to memcached•  If “key” found −  Memcached returns “value” for “key”•  If “key” not found −  Persistent database is queried for “key” −  “value” from database is returned to client −  “SET” request sent to MC with “key” & “value”•  Key-value pair stays in cache unless −  It is evicted because of cache LRU policies −  Explicitly removed by a “DELETE” request•  Typical operations −  GET, SET, DELETE, STATS, REPLACE, etc.•  Most frequent transaction is “GET” −  Impacts perf of most common use cases4
  • 5. Measuring performance •  Measure perf of most important transaction - “get” •  Best perf = max “get” Requests Per Sec (RPS) under SLA − SLA (Service Level Agreement) : Average “get” latency <= 1 ms •  Measurement configuration is “client-server” − Run memcached on one or more servers − Run load generator/s on “client/s” to send requests to MC servers − Load generator keeps track of transactions and reports results •  Process − Load gen sends “set” requests to prime cache with key-value pairs − For incremental RPS in a range, do following until avg latency > 1ms: − Send random key “gets” for 60 secs, calculate average latency •  S/W and H/W configuration − Open Source Memcached V 1.6 base and optimized − Open Source Mcblaster load generator − Intel® Xeon® E5-2660 2.2 GHz, 10GB NIC, 64 GB memory5
  • 6. Baseline performance & core scalability •  Intel® Xeon® E5-2660 2.2 GHz, 10GB NIC, 64 GB memory •  Intel® Turbo Boost Technology ON, Intel® Hyper-Threading Technology OFF No scalability beyond 3 cores, degrades beyond 4 Software and workloads used in performance tests may have been optimized for performance only on Intel® microprocessors. Performance tests, such as SYSmark* and MobileMark*, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance. Configuration: Intel® Xeon® E5-2660 2.2 GHz, 10GB NIC, 64 GB memory,6 Intel® Turbo Boost Technology ON, Intel® Hyper-Threading Technology OFF
  • 7. Performance root cause •  Profile during “gets” shows lots of time spent in locks •  Drill down into code shows coarse grained global cache locks −  Held for most of a thread’s execution time •  Removing the global locks & measuring “gets” showed substantial improvement −  Unsafe, done only as a proof of concept •  “Top” shows unbalanced CPU core utilization, possibilities are: −  Sub-optimal network packet handling and distribution −  Thread migration between cores7
  • 8. Transaction flow•  Incoming requests from clients•  Libevent distributes them to MC threads −  # of MC threads = # of cores −  No thread affinity•  Threads do key hashing in parallel•  Hash table processing to −  Find place for new item (key-value pair) −  Find location of existing item•  LRU processing to maintain cache policy −  Move item to front of list indicating most recently accessed•  A global cache lock around hash table and LRU processing −  Serializes all transactions on all threads −  This is the key bottleneck to scalability•  Final responses handled in parallel8
  • 9. The hash table •  Hash table is arranged as an array of buckets •  Each bucket has a singly linked list as a hash chain •  The hashed key is used to find the bucket it belongs in •  Item (key-value pair) is then inserted/retrieved from in the hash chain of that bucket9
  • 10. The LRU •  LRU - Least Recently Used cache management scheme − Cache is finite amount - evict old items to make room for new ones − LRU policy determines eviction order of cache items − Oldest active cache item is evicted first •  Uses a doubly linked list for quick manipulation − Head has most recently used item − GET for item removes it from current position & moved to head − On eviction the tail is checked for oldest item10
  • 11. Why the global lock •  Linked lists are used in both the hash table & the LRU •  Corruption can occur if the lock is removed − Example below of two close by items being removed − Higher chance of corruption in the LRU because of doubly linked list11
  • 12. Optimization goals, design considerations •  Goals − Must scale well with larger core counts − Hash distribution should have little effect on perf − Same performance accessing 1 unique key or 100k unique keys − Changes to LRU must maintain/increase hit rates − ~90% with test data set •  Implementation considerations − Any lock removal or reduction should be safe − No additional data should be used for cache items − Millions to billions of cache items in a fully populated instance − A single 64-bit field would reduce useable memory considerably leading to a reduced the hit rate − Focus on GETs for best performance − Most memcached instances are read dominated − New design should account for this and optimize for read traffic − Transaction ordering not guaranteed – just like the original implementation12
  • 13. Optimized transaction flow Original Optimized •  Global lock serializes Hash table and LRU •  Non-blocking gets using a “Bag” LRU scheme operations •  Better parallelization for set/delete with striped locks13
  • 14. SET/DEL optimization - parallel hash table •  Uses striped locks instead of a global lock − Fine grain collection of locks instead of a single global lock •  Makes use of a fixed-size, shared collection of locks for the entire hash table − Allows for a highly scalable hash table solution − Fixed-overhead •  Number of locks is a ^2 to determine lock quickly − Bitwise and the bucket with the number of locks to determine lock •  Not used for GETs14
  • 15. SET/DEL optimization - parallel hash table .. •  Each lock services Z number of buckets •  Number of locks, Z, based on balance between parallelism and lock maintenance overhead •  Multiple buckets can be manipulated in parallel15
  • 16. GET optimization – removing the global lock •  No global lock during hash table processing for GET •  With no global lock, two situations must be handled − Expansion of hash table during a GET − Hash table expands if there are a lot more items than buckets can handle − SET/DEL of an item during a GET •  Handling hash table expansion during GET − If expanding then wait for it to finish before looking up hash chain − If not expanding then find data in hash chain and return it •  Handling SET/DEL during a GET − If hash table expanding, wait to finish before modifying hash chain − Modify pointers in right order using atomic operations to ensure correct hash chain traversal for GETs •  A GET may still happen while the item is being modified (SET/DEL/REPLACE) − Is that a problem? − No, as long as traversal is correct, because operation order is not guaranteed anyways16
  • 17. GET optimization – Parallel Bag LRU •  Replaces the original doubly linked list LRU •  Basic concept is to group items with similar time stamps into “bags” − As before, no ordering is guaranteed •  Has all the functionality as the original LRU •  Re-uses original item data structure – no additions •  SET to a bag uses atomic Compare and Swap operation •  GET from a bag is lockless •  DEL requests do nothing to the Bag LRU •  LRU cleanup is delegated to a “cleaner thread” − Acts like “garbage collection/cleanup” − Evicts expired items quickly − Handles item cleanup from deletes − Reorders cache items based on update time − Adds additional Bags as needed17
  • 18. Parallel Bag LRU details – Bag Array Original LRU Bag LRU •  A list of bags in chronological order •  Bags have list of items •  Newest bag has recently allocated or accessed items •  Alternate bag used by cleaner thread to avoid lock contention on inserts to newest bag •  Bag head has pointers to oldest and newest bags for quick access18
  • 19. Parallel Bag LRU details – Bags •  Each bag has a singly linked list of cache items •  SET causes new item to be inserted into “newest bag” •  GET updates timestamp & pointer to point to “newest bag” •  Evictions handled by cleaner thread19
  • 20. Parallel Bag LRU – Cleaner Thread •  Periodically does house keeping on the Bag LRU − Currently every 5 secs •  Starts cleaning from the oldest bag’s oldest item20
  • 21. Optimizations - Misc •  Used thread affinity to bind 1 memcached thread per core •  Configured NIC driver to evenly distribute incoming packets over CPUs − 1 NIC queue per logical CPU, affinitized to a logical CPU •  Irqbalance, iptables services turned off21
  • 22. Optimized performance & core scaling •  Intel® Xeon® E5-2660 2.2 GHz, 10GB NIC, 64 GB memory •  Intel® Turbo Boost Technology ON, Intel® Hyper-Threading Technology OFF Linear scaling with optimizations Software and workloads used in performance tests may have been optimized for performance only on Intel® microprocessors. Performance tests, such as SYSmark* and MobileMark*, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance. Configuration: Intel® Xeon® E5-2660 2.2 GHz, 10GB NIC, 64 GB memory,22 Intel® Turbo Boost Technology ON, Intel® Hyper-Threading Technology OFF
  • 23. Server capacity •  Intel® Xeon® E5-2660 2.2 GHz, 10GB NIC, 64 GB memory Overall 900% gains vs. baseline Turbo and HT boost performance by 31% Software and workloads used in performance tests may have been optimized for performance only on Intel® microprocessors. Performance tests, such as SYSmark* and MobileMark*, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance. Configuration: Intel® Xeon® E5-2660 2.2 GHz, 10GB NIC, 64 GB memory,23 Intel® Turbo Boost Technology OFF/ON, Intel® Hyper-Threading Technology OFF/ON
  • 24. Efficiency and hit rate •  Hit rate measured with a synthetic benchmark increased slightly − At ~90% - similar to that of original version •  Efficiency (Transactions Per watt) increased by 3.4X − Mostly due to much higher RPS for little increase in power − Power draw would be less in a production environment •  Intel® Xeon® E5-2660 2.2 GHz, 10GB NIC, 64 GB memory Software and workloads used in performance tests may have been optimized for performance only on Intel® microprocessors. Performance tests, such as SYSmark* and MobileMark*, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance. Configuration: Intel® Xeon® E5-2660 2.2 GHz, 10GB NIC, 64 GB memory,24 Intel® Turbo Boost Technology ON, Intel® Hyper-Threading Technology ON
  • 25. Summary •  Base core/thread scalability hampered by locks − No throughput scaling beyond 3 cores, degradation beyond 4 •  Lockless “GETs” with Bag LRU improves scalability − Linear till the measured 16 cores − No increase in average latency − No loss in hit rate (~90%) − Same performance for random and hot/repeated keys •  Striped locks parallelize hash table access for SET/DEL •  Bag LRU source code available on GitHub − https://github.com/rajiv-kapoor/memcached/tree/bagLRU25
  • 26. Thank You
  • 27. Legal DisclaimerINFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED,BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT ASPROVIDED IN INTELS TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVERAND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDINGLIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANYPATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.•  A "Mission Critical Application" is any application in which failure of the Intel Product could result, directly or indirectly, in personal injury or death. SHOULD YOU PURCHASE OR USE INTELS PRODUCTS FOR ANY SUCH MISSION CRITICAL APPLICATION, YOU SHALL INDEMNIFY AND HOLD INTEL AND ITS SUBSIDIARIES, SUBCONTRACTORS AND AFFILIATES, AND THE DIRECTORS, OFFICERS, AND EMPLOYEES OF EACH, HARMLESS AGAINST ALL CLAIMS COSTS, DAMAGES, AND EXPENSES AND REASONABLE ATTORNEYS FEES ARISING OUT OF, DIRECTLY OR INDIRECTLY, ANY CLAIM OF PRODUCT LIABILITY, PERSONAL INJURY, OR DEATH ARISING IN ANY WAY OUT OF SUCH MISSION CRITICAL APPLICATION, WHETHER OR NOT INTEL OR ITS SUBCONTRACTOR WAS NEGLIGENT IN THE DESIGN, MANUFACTURE, OR WARNING OF THE INTEL PRODUCT OR ANY OF ITS PARTS.•  Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the absence or characteristics of any features or instructions marked "reserved" or "undefined". Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. The information here is subject to change without notice. Do not finalize a design with this information.•  The products described in this document may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request.•  Intel product plans in this presentation do not constitute Intel plan of record product roadmaps. Please contact your Intel representative to obtain Intels current plan of record product roadmaps.•  Intel processor numbers are not a measure of performance. Processor numbers differentiate features within each processor family, not across different processor families. Go to: http://www.intel.com/products/processor_number.•  Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order.•  Copies of documents which have an order number and are referenced in this document, or other Intel literature, may be obtained by calling 1-800-548-4725, or go to: http://www.intel.com/design/literature.htm•  [Add any code names from previous pages] and other code names featured are used internally within Intel to identify products that are in development and not yet publicly announced for release. Customers, licensees and other third parties are not authorized by Intel to use code names in advertising, promotion or marketing of any product or services and any such use of Intels internal code names is at the sole risk of the user•  Intel, [Add words with TM or R from previous pages..ie Xeon, Core, etc] and the Intel logo are trademarks of Intel Corporation in the United States and other countries.•  *Other names and brands may be claimed as the property of others.•  Copyright ©2012 Intel Corporation.
  • 28. Intels compilers may or may not optimize to the same degree for non-Intelmicroprocessors for optimizations that are not unique to Intel microprocessors.These optimizations include SSE2, SSE3, and SSE3 instruction sets and otheroptimizations. Intel does not guarantee the availability, functionality, oreffectiveness of any optimization on microprocessors not manufactured by Intel.Microprocessor-dependent optimizations in this product are intended for use withIntel microprocessors. Certain optimizations not specific to Intelmicroarchitecture are reserved for Intel microprocessors. Please refer to theapplicable product User and Reference Guides for more information regarding thespecific instruction sets covered by this notice.Notice revision #20110804
  • 29. Legal Disclaimer•  Built-In Security: No computer system can provide absolute security under all conditions. Built-in security features available on select Intel® Core™ processors may require additional software, hardware, services and/or an Internet connection. Results may vary depending upon configuration. Consult your PC manufacturer for more details.•  Enhanced Intel SpeedStep® Technology - See the Processor Spec Finder at http://ark.intel.com or contact your Intel representative for more information.•  Intel® Hyper-Threading Technology (Intel® HT Technology) is available on select Intel® Core™ processors. Requires an Intel® HT Technology-enabled system. Consult your PC manufacturer. Performance will vary depending on the specific hardware and software used. For more information including details on which processors support Intel HT Technology, visit http://www.intel.com/info/hyperthreading.•  Intel® 64 architecture requires a system with a 64-bit enabled processor, chipset, BIOS and software. Performance will vary depending on the specific hardware and software you use. Consult your PC manufacturer for more information. For more information, visit http://www.intel.com/info/em64t•  Intel® Turbo Boost Technology requires a system with Intel Turbo Boost Technology. Intel Turbo Boost Technology and Intel Turbo Boost Technology 2.0 are only available on select Intel® processors. Consult your PC manufacturer. Performance varies depending on hardware, software, and system configuration. For more information, visit http://www.intel.com/go/turbo•  Other Software Code Disclaimer Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice (including the next paragraph) shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
  • 30. Risk FactorsThe above statements and any others in this document that refer to plans and expectations for the second quarter, the year and thefuture are forward-looking statements that involve a number of risks and uncertainties. Words such as “anticipates,” “expects,”“intends,” “plans,” “believes,” “seeks,” “estimates,” “may,” “will,” “should” and their variations identify forward-looking statements.Statements that refer to or are based on projections, uncertain events or assumptions also identify forward-looking statements.Many factors could affect Intel’s actual results, and variances from Intel’s current expectations regarding such factors could causeactual results to differ materially from those expressed in these forward-looking statements. Intel presently considers the followingto be the important factors that could cause actual results to differ materially from the company’s expectations. Demand could bedifferent from Intels expectations due to factors including changes in business and economic conditions, including supply constraintsand other disruptions affecting customers; customer acceptance of Intel’s and competitors’ products; changes in customer orderpatterns including order cancellations; and changes in the level of inventory at customers. Uncertainty in global economic andfinancial conditions poses a risk that consumers and businesses may defer purchases in response to negative financial events, whichcould negatively affect product demand and other related matters. Intel operates in intensely competitive industries that arecharacterized by a high percentage of costs that are fixed or difficult to reduce in the short term and product demand that is highlyvariable and difficult to forecast. Revenue and the gross margin percentage are affected by the timing of Intel product introductionsand the demand for and market acceptance of Intels products; actions taken by Intels competitors, including product offerings andintroductions, marketing programs and pricing pressures and Intel’s response to such actions; and Intel’s ability to respond quicklyto technological developments and to incorporate new features into its products. Intel is in the process of transitioning to its nextgeneration of products on 22nm process technology, and there could be execution and timing issues associated with these changes,including products defects and errata and lower than anticipated manufacturing yields. The gross margin percentage could varysignificantly from expectations based on capacity utilization; variations in inventory valuation, including variations related to thetiming of qualifying products for sale; changes in revenue levels; segment product mix; the timing and execution of themanufacturing ramp and associated costs; start-up costs; excess or obsolete inventory; changes in unit costs; defects or disruptionsin the supply of materials or resources; product manufacturing quality/yields; and impairments of long-lived assets, includingmanufacturing, assembly/test and intangible assets. The majority of Intel’s non-marketable equity investment portfolio balance isconcentrated in companies in the flash memory market segment, and declines in this market segment or changes in management’splans with respect to Intel’s investments in this market segment could result in significant impairment charges, impactingrestructuring charges as well as gains/losses on equity investments and interest and other. Intels results could be affected byadverse economic, social, political and physical/infrastructure conditions in countries where Intel, its customers or its suppliersoperate, including military conflict and other security risks, natural disasters, infrastructure disruptions, health concerns andfluctuations in currency exchange rates. Expenses, particularly certain marketing and compensation expenses, as well asrestructuring and asset impairment charges, vary depending on the level of demand for Intels products and the level of revenue andprofits. Intel’s results could be affected by the timing of closing of acquisitions and divestitures. Intels results could be affected byadverse effects associated with product defects and errata (deviations from published specifications), and by litigation or regulatorymatters involving intellectual property, stockholder, consumer, antitrust, disclosure and other issues, such as the litigation andregulatory matters described in Intels SEC reports. An unfavorable ruling could include monetary damages or an injunctionprohibiting Intel from manufacturing or selling one or more products, precluding particular business practices, impacting Intel’sability to design its products, or requiring other remedies such as compulsory licensing of intellectual property. A detailed discussionof these and other factors that could affect Intel’s results is included in Intel’s SEC filings, including the company’s most recent Form10-Q, Form 10-K and earnings release. Rev. 5/4/12
  • 31. Summary Memcached is a popular key-value caching service used by web service delivery companies to reduce the latency of serving data to consumers and reduce load on back-end database servers. It has a scale out architecture that easily supports increasing throughput by simply adding more memcached servers, but at the individual server level scaling up to higher core counts is less rewarding. In this talk we introduce optimizations that break through such scalability barriers and allow all cores in a server to be used effectively. We explain new algorithms implemented to achieve an almost 6x increase in throughput while maintaining a 1ms average latency SLA by utilizing concurrent data structures, a new cache replacement policy and network optimizations.31
  • 32. Optimized transaction flow32