Answers to the Scaling Challenge: A Case Study with Answers.com on Scaling with Memcached and MySQLPresenters:	Dan Marriott, Director - Production Operations, Answers.com	Joaquín Ruiz, EVP Products, Gear6
Answers to the Scaling Challenge:A Case Study With Answers.com on Scaling With Memcached and MySQLApril 14, 2010Dan MarriottDirector - Production Operationsdanm@answers.com
OverviewThe world’s leading Q&A site3
OverviewAnswers.comCombines power of WikiAnswers® community-driven content with expert licensed content on ReferenceAnswers™
Users can ask anything and automatically receive the best available answer.
WikiAnswersCommunity generated social knowledge Q&A platform, leveraging wiki-based technologies - answers improve over time.ReferenceAnswers Content on millions of topics from over 250 licensed dictionaries and encyclopedias from leading publishers.4
Answers.com SitesRank in top Web properties#18in the U.S. (02/2010)(1)
#31worldwide (02/2010)(1)Unique monthly visitors50 million in the U.S. (02/2010)(1)
72 million worldwide (02/2010)(1)WikiAnswers domain fastest growing in U.S.:2007 unique visitors grew 573%(#1 of top 1,500 Web domains) (1)
2008 unique visitors grew 154%(#1 of top 200 Web domains) (1)
2009 unique visitors grew 74% (#2 of top 50 Web domains) (1)5(1) Source: comScore – Hybrid Measurement Methodology (U.S. only) beginning August 2009
Answers.com Sites650MM Unique Monthly U.S. Visitors = #18Feb 2010Source: comScore – Hybrid Measurement Methodology  for Answers sites beginning August 2009
ReferenceAnswers7
WikiAnswers: Q&A the Wiki Way8
Some key infrastructure elements2 Data Centers – Active/ActiveHP Blade Servers (BL460) and 2U Servers (DL380)Fusion-io SSDs in all large MySQL DB serversVMwareHP/LeftHand SANHardware Load-Balancers9
Software10Linux – mainly CentOS 5.x/64bitApache/JBOSSLucene / SolrMemcachedMySQL 5.0.xApache-TomcatMemcachedMySQL 5.0.x
Secret to Web 2.0Everything runs from memory in Web 2.0.Evan Weaver – Twitter11
Questions: How do I Scale????Four answers to the scaling challenge from Answers.com:Use Enterprise-grade PCI-SSDs in your MySQL serversEnhance Memcached tierControlling DB Slave server clustersHardware Load-Balancers12
Scaling Answer #1Use Enterprise-grade SSDs in your MySQL servers13
Using SSDs in your MySQL serversIf you have a large DB read clustersCan improve DB server queries/sec by 900%and thereby reduce # servers needed per clusterCan improve overall app speedUse PCI-based SSDs*Much* faster than SAS/SATA-based SSDsGive yourself at least 50% headroom (to handle uncompressing copies locally on SSD)14
Quick plug  :-)Panel:How Solid-state Technologies are Transforming MySQL Server Performance and the Datacenter ArchitecturesSumeet Bansal (Fusion-io)Ryan White (Cloudmark)Dan Marriott (Answers.com)Vadim Tkachenko (Percona Inc)Jeremy Zawodny (craigslist.org) Weds 5:15pm - Ballroom D15
Scaling Answer #2Enhance Memcached tier16
Memcached tier - before10 x 16/32GB RAM Memcached Servers per Data CenterDivided into several clustersStriped instances across each cluster10’s millions items/instance17
Memcached tier – ChallengesNo redundancy	Lose 25% of cache (or worse) on any server failureLoss of cache = poor performance/user experienceCostly (OpEx: Rackspace, power, cooling, maint, admin)18
Cache data critical to performance10’s of millions of pagesPages are dynamic – always publishingUnify data on same topic from different data sources onto one page19
Multiple entries/page[Illustrate mult. data entries/page]20
21
22
Cache data critical to performanceMultiple data sources (entries) per pageEach entry can have multiple links to other entriesCache metadata:List of entries for each page
Internal links for each entry23
Lose memcached node?Lose 25% cached data (or more)2-20 addit. MySQL queries to retrieve metadata and data needed to construct pageHeavy load on MySQL ServersLonger page load timesSite slows down24
Original plan to solve memcached weaknessEngineering team to develop memcached wrapperMirrored memcached server clusters (2x hardware)6 man-months to develop:Write to both clusters
Handle node failure
Handle cache data restore when node becomes available againAlso: Address memcached slab issues25
Alternative solution – Gear6Built-in redundancy function (pair of boxes)Mirroring (writes *everything* to both nodes)200GB Memcached space per box (RAM+SSD)Uses standard VIP mechanismGraceful failover (no impact)Automatic cache-sync on node recoveryAuto fail-back and rebalance VIPNo code changes necessary26
Gear6 GUI config example27
TestingSame blisteringly fast memcached performance use of slower SSD was undetectableTried to break ityank power, kill network ports, pull SSD driveTry software upgradeseamless failover/cache-sync on node-upDiscovered vastly improved slab managementplus ability to fine-tune instance memory usageScalablecan handle 60,000+ ops/sec28
Value-add with Gear6 memcachedImprove app reliabilityEnsure no SPOF at memcached layerScale up our database infrastructure safelySignificantly reduce TCO by decommissioning 20 serversSave 6 man-months of custom memcached wrapper development29
Gear6 Value-add                          …cont’dInsight into memcached performanceGUI helped us troubleshoot app issuesFirst class support (even at 4:00am!)New in v3.0: Ability to dynamically resize memcached instances with no cache loss!!30
Scaling Answer #3Controlling DB Slave server clusters31
Controlling DB Slave server clustersMaint. required sending command to Load Balancer to disable slave node in clusterOn DB node failure, lost ‘000s queries before monitoring noticed and issued LB command----------Solution: Install lighty (lighttpd) on every DB server32
Using Lighty to Control DB Slave clustersConfig lighty to make lightweight “healthcheck” call to local DB (SHOW STATUS LIKE 'Threads_running')Can set maint. flag on any DB serverHave LB monitor make http call to lighty (http://servername:3300/dbcheck?p=3309)Works for multiple MySQL instances on same box33
Lighty DB health-check34
Lighty script logicif maint_flag = TRUE		return “red”elseif dbcall not good		return “red”else		return “green”end35
Scaling Answer #4HW Load Balancers36
Hardware Load BalancersAvoid making firewall your bottleneckImmense flexibility [CLI, stats, ecv]Quick reconfigs, adds/changes/deletesHighly secureCaching, Compression, GSLB, throttle for safety37
Which Hardware Load Balancers?Citrix & F5Leaders of the pack (Garter Magic Quadrant)
Amazing performance
Supremely flexible
Advanced GSLB
ExpensiveA10 NetworksUp and coming
Great performance
Decent feature set
Cost-effective (certainly for internal load-balancing)38
39Thank you.Slides: http://tinyurl.com/mysqlconf2010-scaling-tips
Gear6: 	Perspectives on the scaling challenge

Case Study with Answers.com on Scaling with Memcached and MySQL

  • 1.
    Answers to theScaling Challenge: A Case Study with Answers.com on Scaling with Memcached and MySQLPresenters: Dan Marriott, Director - Production Operations, Answers.com Joaquín Ruiz, EVP Products, Gear6
  • 2.
    Answers to theScaling Challenge:A Case Study With Answers.com on Scaling With Memcached and MySQLApril 14, 2010Dan MarriottDirector - Production Operationsdanm@answers.com
  • 3.
  • 4.
    OverviewAnswers.comCombines power ofWikiAnswers® community-driven content with expert licensed content on ReferenceAnswers™
  • 5.
    Users can askanything and automatically receive the best available answer.
  • 6.
    WikiAnswersCommunity generated socialknowledge Q&A platform, leveraging wiki-based technologies - answers improve over time.ReferenceAnswers Content on millions of topics from over 250 licensed dictionaries and encyclopedias from leading publishers.4
  • 7.
    Answers.com SitesRank intop Web properties#18in the U.S. (02/2010)(1)
  • 8.
    #31worldwide (02/2010)(1)Unique monthlyvisitors50 million in the U.S. (02/2010)(1)
  • 9.
    72 million worldwide(02/2010)(1)WikiAnswers domain fastest growing in U.S.:2007 unique visitors grew 573%(#1 of top 1,500 Web domains) (1)
  • 10.
    2008 unique visitorsgrew 154%(#1 of top 200 Web domains) (1)
  • 11.
    2009 unique visitorsgrew 74% (#2 of top 50 Web domains) (1)5(1) Source: comScore – Hybrid Measurement Methodology (U.S. only) beginning August 2009
  • 12.
    Answers.com Sites650MM UniqueMonthly U.S. Visitors = #18Feb 2010Source: comScore – Hybrid Measurement Methodology for Answers sites beginning August 2009
  • 13.
  • 14.
  • 15.
    Some key infrastructureelements2 Data Centers – Active/ActiveHP Blade Servers (BL460) and 2U Servers (DL380)Fusion-io SSDs in all large MySQL DB serversVMwareHP/LeftHand SANHardware Load-Balancers9
  • 16.
    Software10Linux – mainlyCentOS 5.x/64bitApache/JBOSSLucene / SolrMemcachedMySQL 5.0.xApache-TomcatMemcachedMySQL 5.0.x
  • 17.
    Secret to Web2.0Everything runs from memory in Web 2.0.Evan Weaver – Twitter11
  • 18.
    Questions: How doI Scale????Four answers to the scaling challenge from Answers.com:Use Enterprise-grade PCI-SSDs in your MySQL serversEnhance Memcached tierControlling DB Slave server clustersHardware Load-Balancers12
  • 19.
    Scaling Answer #1UseEnterprise-grade SSDs in your MySQL servers13
  • 20.
    Using SSDs inyour MySQL serversIf you have a large DB read clustersCan improve DB server queries/sec by 900%and thereby reduce # servers needed per clusterCan improve overall app speedUse PCI-based SSDs*Much* faster than SAS/SATA-based SSDsGive yourself at least 50% headroom (to handle uncompressing copies locally on SSD)14
  • 21.
    Quick plug :-)Panel:How Solid-state Technologies are Transforming MySQL Server Performance and the Datacenter ArchitecturesSumeet Bansal (Fusion-io)Ryan White (Cloudmark)Dan Marriott (Answers.com)Vadim Tkachenko (Percona Inc)Jeremy Zawodny (craigslist.org) Weds 5:15pm - Ballroom D15
  • 22.
    Scaling Answer #2EnhanceMemcached tier16
  • 23.
    Memcached tier -before10 x 16/32GB RAM Memcached Servers per Data CenterDivided into several clustersStriped instances across each cluster10’s millions items/instance17
  • 24.
    Memcached tier –ChallengesNo redundancy Lose 25% of cache (or worse) on any server failureLoss of cache = poor performance/user experienceCostly (OpEx: Rackspace, power, cooling, maint, admin)18
  • 25.
    Cache data criticalto performance10’s of millions of pagesPages are dynamic – always publishingUnify data on same topic from different data sources onto one page19
  • 26.
  • 27.
  • 28.
  • 29.
    Cache data criticalto performanceMultiple data sources (entries) per pageEach entry can have multiple links to other entriesCache metadata:List of entries for each page
  • 30.
    Internal links foreach entry23
  • 31.
    Lose memcached node?Lose25% cached data (or more)2-20 addit. MySQL queries to retrieve metadata and data needed to construct pageHeavy load on MySQL ServersLonger page load timesSite slows down24
  • 32.
    Original plan tosolve memcached weaknessEngineering team to develop memcached wrapperMirrored memcached server clusters (2x hardware)6 man-months to develop:Write to both clusters
  • 33.
  • 34.
    Handle cache datarestore when node becomes available againAlso: Address memcached slab issues25
  • 35.
    Alternative solution –Gear6Built-in redundancy function (pair of boxes)Mirroring (writes *everything* to both nodes)200GB Memcached space per box (RAM+SSD)Uses standard VIP mechanismGraceful failover (no impact)Automatic cache-sync on node recoveryAuto fail-back and rebalance VIPNo code changes necessary26
  • 36.
  • 37.
    TestingSame blisteringly fastmemcached performance use of slower SSD was undetectableTried to break ityank power, kill network ports, pull SSD driveTry software upgradeseamless failover/cache-sync on node-upDiscovered vastly improved slab managementplus ability to fine-tune instance memory usageScalablecan handle 60,000+ ops/sec28
  • 38.
    Value-add with Gear6memcachedImprove app reliabilityEnsure no SPOF at memcached layerScale up our database infrastructure safelySignificantly reduce TCO by decommissioning 20 serversSave 6 man-months of custom memcached wrapper development29
  • 39.
    Gear6 Value-add …cont’dInsight into memcached performanceGUI helped us troubleshoot app issuesFirst class support (even at 4:00am!)New in v3.0: Ability to dynamically resize memcached instances with no cache loss!!30
  • 40.
    Scaling Answer #3ControllingDB Slave server clusters31
  • 41.
    Controlling DB Slaveserver clustersMaint. required sending command to Load Balancer to disable slave node in clusterOn DB node failure, lost ‘000s queries before monitoring noticed and issued LB command----------Solution: Install lighty (lighttpd) on every DB server32
  • 42.
    Using Lighty toControl DB Slave clustersConfig lighty to make lightweight “healthcheck” call to local DB (SHOW STATUS LIKE 'Threads_running')Can set maint. flag on any DB serverHave LB monitor make http call to lighty (http://servername:3300/dbcheck?p=3309)Works for multiple MySQL instances on same box33
  • 43.
  • 44.
    Lighty script logicifmaint_flag = TRUE return “red”elseif dbcall not good return “red”else return “green”end35
  • 45.
    Scaling Answer #4HWLoad Balancers36
  • 46.
    Hardware Load BalancersAvoidmaking firewall your bottleneckImmense flexibility [CLI, stats, ecv]Quick reconfigs, adds/changes/deletesHighly secureCaching, Compression, GSLB, throttle for safety37
  • 47.
    Which Hardware LoadBalancers?Citrix & F5Leaders of the pack (Garter Magic Quadrant)
  • 48.
  • 49.
  • 50.
  • 51.
  • 52.
  • 53.
  • 54.
    Cost-effective (certainly forinternal load-balancing)38
  • 55.
  • 56.
    Gear6: Perspectives onthe scaling challenge

Editor's Notes

  • #43 These three are interrelated Content is Dynamic and PersonalTake a step back…Many more people and not just in the USATraffic (cisco petabyte data): measuring in petabytes per monthApplications: Dynamic and personalized
  • #44 A lot of this population is coming into US-based social networking, gaming, media sites… distance means latency issue.Proliferation of broadband drove users to the net.USA #33 in broadband (>2Mbps); #16 in average speed. Miniwatts data shows the shift to Europe and Asia in particular.World internet usage is only NOW @ 1.6b now (24% according to 50x15 site)NOW MOBILE is adding rapidly to this population explosion.Akamai state of the internet report for q1 2009 gives very insightful information of current data trends.
  • #45 Consumer is driving the traffic growth (3x of business)>40% CAGR growth (4x in 4 years)
  • #47 How do sites make the same software stack go faster
  • #48 Why we are here again:Scalable stack framework for LAMP and Java had/have to emerge in order to accelerate stack and provide distribution services
  • #49 Why we are here again:Scalable stack framework for LAMP and Java had/have to emerge in order to accelerate stack and provide distribution services
  • #50 100ms latency introduced by traversing internet to origin site Typically +0.1 secs means 2-4% DROP in revenue
  • #51 100ms latency introduced by traversing internet to origin site Typically +0.1 secs means 2-4% DROP in revenue
  • #53 Maslow's Hierarchy Of Needs
  • #56 Increases revenue via being able to economically scale and support more members, services and transactionsScalable stack framework delivered as a network service addressable by all stack componentsOnramp information into Memory ASAPMemcached has emerged as de facto protocol to load information into memory BUT a lot more is neededMemcached ++ Focused on caching functions Advance network-based distributed services Non relational functions
  • #57 Web 2.0 driving user & traffic growth Key Apps: Entertainment, Communication, Social Networking Key Drivers: Mobile, Broadband, International, ConsumerMobile applications – 3G+ is 22% by 2010 (Mobile goes hi speed)Consumer Internet Traffic 2006–2012 This category encompasses any IP traffic that crosses the Internet and is not confined to a single service provider’s network. Peer-to-peer (P2P) traffic, still the largest share of Internet traffic today, will decrease as a percentage of overall Internet traffic. Internet video streaming and downloads are beginning to take a larger share of bandwidth, and will grow to nearly 50 percent of all consumer Internet traffic in 2012.
  • #60 We talked about “Universal Distro” and EC2 beforeGear6 Web Cache Software requirements:Certified on HP, Dell, Rackable…Requires dedicated serverRedundant GE ports or 10 GE ports64-bit (x86 64) server with 32GB or more of RAMIDE/SATA HD with 80GB+Option: 2.5” SATA drive bays for Flash SSDs - Intel X25-E or SamsungGear6 Web Cache delivers:High availability and consistent scalable high performance50k-200k Memcached ops per sec32-300+GB Memcached space per rack unitFault toleranceComplete cache control and visibility of resourcesCost effective mechanism of scaling database and application tiers
  • #61 We talked about “Universal Distro” and EC2 beforeGear6 Web Cache Software requirements:Certified on HP, Dell, Rackable…Requires dedicated serverRedundant GE ports or 10 GE ports64-bit (x86 64) server with 32GB or more of RAMIDE/SATA HD with 80GB+Option: 2.5” SATA drive bays for Flash SSDs - Intel X25-E or SamsungGear6 Web Cache delivers:High availability and consistent scalable high performance50k-200k Memcached ops per sec32-300+GB Memcached space per rack unitFault toleranceComplete cache control and visibility of resourcesCost effective mechanism of scaling database and application tiers