Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Large-scale projects development (scaling LAMP)

This 8-hours tutorial was given at various conferences including Percona conference (London), DevConf (Moscow), Highload++ (Moscow).


During this tutorial we will cover various topics related to high scalability for the LAMP stack. This workshop is divided into three sections.

The first section covers basic principles of shared nothing architectures and horizontal scaling for the app//cache/database tiers.

Section two of this tutorial is devoted to MySQL sharding techniques, queues and a few performance-related tips and tricks.

In section three we will cover the practical approach for measuring site performance and quality, porviding a "lean" support philosophy, connecting buesiness and technology metrics.

In addition we will cover a very useful Pinba real-time statistical server, it's features and various use cases. All of the sections will be based on real-world examples built in Badoo, one of the biggest dating sites on the Internet.

  • Login to see the comments

Large-scale projects development (scaling LAMP)

  1. 1. Large-scale projects development Alexey Rybak, Badoo Devconf, 10 june 2012
  2. 2. Who am I?• developer/manager/director roles in 2005 - … 2004 - 2005 and others, 1999 - 2004• this tutorial – hobby educationalproject since 2006
  3. 3. Rate yourself, please• Worked primarily on one-server or shared hosting systems, want to know basics of large-scale architectures and scaling techniques• Already have several servers in production, want to know how to grow on• Know all the things more or less, just want to systematize my knowledge and get answers to particular questions
  4. 4. Few more introductory words• Technology stack – LAMP• Most of problems have fundamental, stack-independent nature• Interrupt, ask questions• Is flipchart visible? We will have several flipchart sessions
  5. 5. Tutorial schedule• Introduction: values & principals• Web/applications and cache tiers• Databases, sharding• Queues• Lean production: measuring• Questions session (min. 1 hour)
  6. 6. 1. Introduction: values and principals
  7. 7. Why values?• next message is for developers• already worked in big projects? you know this• no? please, open your mind• something may sound wrong• while it’s sad but true
  8. 8. In large-scale projects• programming as writing code matters less• system design is the key• system design is not about • patterns • classes • modules • API … • not about any writing code practice or code design
  9. 9. System design• putting various components together• software and hardware• most of components are “ready”• know these components• more engineering• less traditional “programming”
  10. 10. System design• focused on business values• performance + cost of ownership• more clients (requests) with less moneyinvested• operations with less resources, minimumdowntime…• performance, high availability, reliability,recovery… many other buzz-words• can be painful for developers as it’s aboutmanaging unknowns
  11. 11. Scalability: an ability to grow Linear, good performance $$$ (income) Non-linear pe rfo rm Linear, but bad performance an ce $$$ (spending)• Scalability and performance determine your growth together• Scalability is the class of the function• Performance is function parameter (here: angle)• Will talk about both scalability and performance
  12. 12. Scaling• vertical: scale in (improving hardware)• horizontal: scale out (adding boxes)• components coupling matters• key to horizontal scaling is weak couplingbetween subsystems (share nothing =weak/loose coupling)
  13. 13. Queueing theory• Just to introduce basic models• Massive flow of random requests: • Telecommunications • call-centers • supermarkets • filling (gas) stations • airports • fast-food • Disneyland... • and internet projects• Started by A. K. Erlang, «The Theory ofProbabilities and Telephone conversations»,1909
  14. 14. Basic model: single-server queue queue server requests processed requests overflow: failureCharacteristics:• processed requests/sec (throughput)• total processing time (latency)• failures/sec (quality)• many othersImportant property: rapid non-linear performance degradation
  15. 15. Multiple-server queue servers queue requests processed• queue + N servers performs better than N (queue + server)• find these models in your project, they form your architecture basis
  16. 16. System design• Goal: components are coupled in the mosteffective way• Method: imagine it’s all queues and analyze dataprocessing flows• Components • High-level (software) • Low-level (hardware)
  17. 17. High-level components• Your software + ready building blocks• “Ready” software: • web servers • application servers (can be incorporated into web) • cache servers • database servers
  18. 18. Each based on• Hardware • CPU • memory • disk • network• OS • Linux/UNIX parallelism
  19. 19. Hardware: data flow limits CPU < 1E-9 s Memory #00 #01 1E-7 – 1E-6 s FS cache cache cache HDD > 1E-3 s Network• sequential: ~100MB/sec• random: ~200 req/sec ~1e-5 s• database IO isn’t sequential Random reads from memory via• SSD rocks in random IO network is faster than using a disk
  20. 20. Hardware: conclusions• reading from other box memory can besignificantly faster than reading from local disk• weakest link: random HDD IO (databases)• sequential bulk reads/writes are more effective• batch writes: accumulate data in memory andsync• databases use combination of thesetechniques• battery backed write cache• SSD: much faster random access
  21. 21. Components splitting Section#2: Applications Incoming HTTP-traffic Section#3: Data Front-end: connection handling Other applications clusters, involved into Back-end: application cluster request processing Cache: fast memory storage Queueing, jobs, analytical applications…Sharded databases: split disk writes In next sections we’ll discuss • why this splitting is effective • how to scale app/cache/db tiers horizontally
  22. 22. 2. Web/applications tier
  23. 23. Why frontend and backend? Incoming HTTP-traffic Front-end: connection handling Back-end: application cluster C10K problem – serving 10K connections Need to know • OS parallelism • server models
  24. 24. Linux: parallelism• processes• threads• multitasking, interrupts: context switch• the key property is how servershandle network connections
  25. 25. Servers models• Process per connection• Thread per connection• FSM (finite state machine)
  26. 26. Connection handling• process-per-connection (apache 1, 2 mpm_prefork)• slow clients = many processes• thread-per-connection (apache 2 mpm_worker)• slow clients = many threads• Keep-Alive – 90% clients• Overhead: context switches, RAM• “lightweight“: nginx (engine-x), lighttpd (lighty), …
  27. 27. Servers models• Process per connection • CGI: fork per connection • Pooling: Apache (v.1, mpm_prefork – min, max, spare), PostgreSQL+pgpool, PHP-FPM …• Thread per connection • Pooling: Apache (mpm_worker – min, max, spare), MySQL(thread_cache)• FSM (finite state machine) • “modern” kernel: kqueue, epoll • interface: libevent, libev • FSM + process pooling: nginx • FSM + thread pooling: memcached v>1.4
  28. 28. Nginx• 1 master + N workers (10**3 – 10**4 conn)• N ~ CPU cores * (blocking IO probability)• FSM• maniacal attention to speed and code quality• Keep-Alive: 100Kbytes active / 250 bytes inactive• logical, flexible, scalable configuration• with even embedded castrated perl•
  29. 29. [front/back]end• What does web-server do? • Executes script code • Serves client• Hey, does cook talk to restaurantcustomers?• These tasks are different, split tofrontend/backend• nginx + Apache with mod_php, mod_perl,mod_python• nginx + FCGI (for example, php-fpm)
  30. 30. [front/back]end Heavy-weight server (HWS) Light-weight server (LWS) Apache mod_php, nginx mod_perl, mod_python FastCGI«fast» and «slow» clients static content; can do simple scripting (SSI, perl) dynamic content
  31. 31. [front/back]end: scaling B • homogeneous tiers (maintenance) F • round-robin balancing B (weighted, WRRB) • WRRB means there’s noSLB F “state” B • key to simplest horizontal scaling: 6)don’t store any “state” on the box 7)weak coupling F B
  32. 32. Scaling linearIncome pe rfo rm an c e Spending
  33. 33. Scaling web tier• Many servers – put front- and back-ends into one box (much simpler maintenance)• Don’t store states on these boxes• Loose coupling• any shared resource make boxes “coupled”• share accurately• Common errors– common data via NFS (sessions, code) => local copies, sessions in memcached– heavy writes into shared db real-time => if possible, async messages– local cache => global cache
  34. 34. nginx: load balancingupstream backend { server weight=5; server; server unix:/tmp/backend3;}server { location / { proxy_pass http://backend; }}
  35. 35. nginx: fastcgiupstream backend { server www1.lan:8080 weight=2; server www2.lan:8080;}server { location / { fastcgi_index index.phtml; fastcgi_param [param] [value] ... fastcgi_pass backend; }}
  36. 36. Protected static files performance• static files with restricted access• you need some “logic” to check access rights• scripting is expensive: “heavy” process for eachclient• X-Accel-Redirect: “heavy” process checks rightsquickly and returns a special header with filename• URL-certificates: best practice, no scripting at all••
  37. 37. Caching• «memory»-10-9-10-6,«network»-10-4,«disk»- slower 10-3• 100% static (pages, images etc), HTML-blocks, «objects»• Complexity: – if-modified-since (no request) – proxy cache (cache data is stored on a web-server) – object(serialized) cache (cache storage is used)• Industry standard - memcached, also popular: Redis (more than cache) and others
  38. 38. Local vs. Global cache• memory utilization (very bad for huge clusters)• incoherence• intranet latency is small, use global in-memory cache LC backend + data frontend LC backend + data each backend talks to all global caches Global Cache Global Cache Global Cache Global Cache
  39. 39. Memcached• (LiveJournal -> Facebook)• shared cache-server• fsm (libevent)• memory slabs, items of 2N size• ideal for sessions, object cache• performance tips: • small objects, zip other (CPU? use thresholds) • multi-get • stats (get, set, hit, miss + slab info)
  40. 40. Scaling cache• global cache: how to map data to server?• server = crc32(key)%N and variations• problem adding new server: 100% miss (cold start)• solutions • 1. don’t use complex queries, flush caches periodically to check if your cold start is still quick (Badoo: cache cluster flush several times per year) • 2. distribution tricks like Ketama• years in production: old (slow) and new (fast) boxes • several daemons over one machine • virtual buckets
  41. 41. Advanced topic (PHP-only)• can skip• will be useful for PHP-developers only• covers PHP-FPM, initially developedin Badoo• 6 slides, cover or skip?
  42. 42. PHP• use acceleration: APC, xcache, ZPS,eAccelerator• PHP is quite hungry for memory & CPU • C: 1M • Perl: 10M • PHP: 20M• FCGI (fpm)
  43. 43. PHP-FPM• PHP-FPM: PHP FastCGI process manager• server architecture close to nginx (master + N workers)• happy production requirements: • non-stop live binary upgrades and configuration • see all errors • react on suspicious worker behavior (latency, mass death) • dynamic pools (mostly useful for shared hosting)
  44. 44. PHP-FPM: basic features• graceful reload: live binaries & conf updates• master process catches workers stderr – you’ll see everything in logs• slow workers auto-tracing & killing• emergency auto-reload when massive workers crash is detected
  45. 45. PHP-FPM: advanced features• fatal blank page: header will NOT be 200 on fatals• fastcgi_finish_request() – give output to client andcontinue (sessions, stats etc)• accelerated upload support (request_body_file - nginx0.5.9+)• groups: highload-php-(en|ru)
  46. 46. flipchart session• Questions?• Case#1: knowledge base (like wikipedia)• Case#2: media-storage (photo-video- hosting, file-sharing etc)
  47. 47. 3. Databases, sharding
  48. 48. Imagine you are… a database • and you’re doing SELECT• rough approximation• establish connection, allocate resources (speed,memory-per-connection on server side)• read the query• check query cache (if enabled, memory,invalidation)• cont. on the next slide …
  49. 49. SELECT (cont.)• parse query (CPU, bind vars, stored procs)• “get data” (index lookup, buffer cache, disk reads)• “sort data” (or just read sorted!)• in-memory, filesort, key buffer• output, clean up, close conn…
  50. 50. SELECT: resume• many steps and details• every step uses some “resource”• the principal feature of relational databases was that you just need to know SQL to talk to them• bad news: we have to know much more to tune databases
  51. 51. So, MySQL performance (1/3)• Many engines - MyISAM, InnoDB,Memory(Heap); Pluggable• Locking: MyISAM table-level, InnoDB row-level• «manual» locks: select get_lock, select forupdate• Indices: B-TREE, HASH (no BITMAP)• point->rangescan->fullscan;• fully matching prefix; innoDB PK: clustering,coverage(“using index”);• disk fragmentation
  52. 52. MySQL performance (2/3)• myisam key cache, innodb buffer pool• dirty buffers and transaction logs:innodb_flush_trx_log_at_commit• many indexes – heavy updates• sorting: in-memory (sort buffers), filesort
  53. 53. MySQL performance (3/3)• USE EXPLAIN• Extra: using temporary, using filesort• innodb_flush_method = O_DIRECT• alters can be heavy: use many small tables instead ofbig one• partitioning
  54. 54. MySQL common practices• applications: OLAP and OLTP• OLAP – MyISAM (Infobright and other column-based)• OLTP – InnoDB• imagine you are database• what operations will be executed?• need all of them?• replace heavy operations by others lighter• don’t be afraid of denormalization• think about scaling from the very beginning
  55. 55. Denormalization• remove extra join• remove sorting• remove grouping• remove filtering• make materialized views• very many other things …• Examples • Counters • Trees in databases: materialized path • Inverted search index
  56. 56. Other tips and tricks• multi-operations• On duplicate update• table switching (rename)• memory tables as a temporary storage• updated = updated
  57. 57. Scaling databases• we want • linear scalability • easy support• many people start with replication• replication is not bad, but it’s limited• “true” scale-out solution is only sharding
  58. 58. Scaling databases• vertical splitting: by tasks (tables)• put tables used together on another box• horizontal: by primary entities (users,documents)• split one table into many small and move themto other boxes
  59. 59. Replication basics• single server, writes/reads << 1• adding new one, more power to read• in the beginning ~100% growth (linear)• writes still go to the master, writes are not scaled• more servers – less efficiency• higher writes/reads factor – less efficiency• social networks, UGC – many writes
  60. 60. Replication problems• close to linear only in the very beginning• copies: ineffective disk and memory(buffer pool, fs cache) utilization• MySQL particularities: serving slaves,processed by one-thread etc.
  61. 61. G: 1) bigger for heavier writes 2) bigger for write-intensive applications
  62. 62. Scaling linearIncome pe rfo rm an c e Spending
  63. 63. Sharding• spread writes along all database nodes and achievetrue scale-out• what attribute to choose to shard by?• how to address data to the shard?• how to keep unique keys along the whole system?• how to query data from multiple nodes? how to runanalytical queries?• how to re-shard?• how to back-up?
  64. 64. Mapping data to shard• primary attribute: user_id, document_id …• unmanaged: id -> hash%N -> server• better: virtual buckets• id -> hash%N -> bucket -> [C] -> server• buckets: user -> bucket is determined by formula• best, “dynamical”: user -> bucket can be configurable• “dynamical”: id -> [C1] -> bucket -> [C2] -> server• configuration: C1 – “dynamical”, C2 – almost static
  65. 65. Sharding topology• Two main patterns: – proxy: hides sharding logic – coordinator: just tells exactly where to go• proxy • harder to build from scratch • easy to write apps• coordinator • easier to build from scratch • relatively harder to use • architecture doesn’t hide anything and provokes developers to learn internals
  66. 66. Dynamical mapping• ID -> {map 1} -> bucket -> {map 2} -> server• “coordinates” • datacenter • server • schema • table• mapping: • ID -> {bucket} • {bucket} = {server, schema, table} • 42 = {db15.dc3, Shard7, User33} • 42 = {30015, 7, 33} • almost “static” (changes rarely: re-sharding)
  67. 67. Dynamical mapping Where?WebApp Coordinator Node # 1234 data Storage nodes
  68. 68. Case#3: Sharding• flipchart!• most difficult part of tutorial• don’t hesitate to ask questions• additional questions to answer: • how to query data from multiple nodes? • how to run analytical queries? • how to re-shard? • how to back-up?
  69. 69. MySQL in Badoo (1/3)• minus in theory – plus in practice• they say MySQL is “stupid”• while this usually means that – MySQL doesn’t allow complex dependencies – so MySQL just doesn’t dictate ineffective architecture – no rocket science to build a system for millions users, thousands boxes, on commodity servers
  70. 70. MySQL in Badoo (2/3)• InnoDB• avoid complex queries• no FK, triggers or procedures• homemade sharding, replication, upgrade automation• virtual coordinate shard_id mapped to physical coordinates {serverX, dbY, tableZ}
  71. 71. MySQL in Badoo (3/3)• no “transparent” proxies that “hide” architecture• clients are routed dynamically• queues – MySQL (transaction-based events), also used Scribe, RabbitMQ• didn’t change architecture during 6 years from 0 to 130 M users
  72. 72. 4. Queues
  73. 73. Queues• If we can do something later – client shouldn’t wait• While sharding is “separation in space”, queueing is “separation in time”• Will cover basics and show how to build such a component
  74. 74. Distributed communications• RPC = Remote procedure calls• MQ = message queues• Synchronous: remote services• Asynchronous: queues• Bunch of ready standalone products• Generated-by-transactions queues• Standalone systems and transactional integrity problem
  75. 75. RPC/MQ: concept RPC: synchronous, “point-to-point” request “client” result “server” MQ: asynchronous, “publisher-subscriber” Message“client” “server” Queue Consumers message (jobs)
  76. 76. Database-driven MQ“publisher” “subscriber” database • transaction integrity • relatively slow • mostly used for transaction-based queues • hundreds event/sec on shard server is OK • subscribers: event dispatching
  77. 77. Case#4: MySQL-based queues • flipchart! • model, event processing, failover, scaling • decentralized queues
  78. 78. 5. Lean production: measuring
  79. 79. Development + support = 100% 100% • small projects • project just started Development (time) «dynamical» projects Tired projects Support (time) 100%
  80. 80. Monitoring• server monitoring is useless for strategic analysis• good monitoring • connects “business” and “technical” values • visualizes flows between sub-systems • helps to optimize flows • generally, helps to make right decisions• user -> (something complex) -> servers -> monitoring• in a big system you can’t “reconstruct” flows from servermonitoring
  81. 81. “Traditional” monitoring
  82. 82. Lean way• users make requests, that’s all• latency (how long request is processed on server) • for various apps (scripts) • statistics: not just average • internal “structure” of a request • what sub-systems are used to process the query • what is the impact of these sub-systems into the latentcy• requests per second • for various sub-systems
  83. 83. Maintenance• Latency/RPS by server (server group, datacenter …)• Real-time• CPU usage by apps (scripts)• What changes with new releases
  84. 84. PINBA• PHP extension handles “start” and “finish” for every request• Collects script_name, host, time, rusage …• Send UDP on request shutdown• From all your web-cluster• Listener/server thread in MySQL (v. 5.1.0+)• SQL-interface to all the data
  85. 85. PINBA: client data• request: script_name, host, domain, time, rusage, peak memory, output size, timers• timers: time + “key(tag) – value” pairs• example: – 0.001 sec – {group => “db::update”, server => “dbs42”}
  86. 86. PINBA: server data• SQL: “raw” data or reports• Reports – separate tables, updated real-time• Base reports (~10): general, by scripts, by host+script pairs…• Tag reports CREATE TABLE R … (ENGINE=PINBA COMMENT=report:foo,bar‘)• R: {script_name, foo_value, bar_value, count, time}• – many examples• 2012 – added nginx module for HTTP statuses
  87. 87. Pinba: real-time monitoring req/sec average time • Scripts • Virtual hosts • Physical servers
  88. 88. Request time (latency)
  89. 89. WTF?
  90. 90. No we know: scripts, times, periods – know where to dig
  91. 91. Year passes, code rottensThe law: usage grows until you start refactoring
  92. 92. Slowest requests
  93. 93. Memcached stats• Traditional stats – Req/sec – Hit/miss – Bytes read/written• Stats slabs• Stats items• Stats cachedump
  94. 94. Memcached: stats
  95. 95. Cachedump (1/4)17th slab = 128 Kstats cachedump 17ITEM uin_search_ZHJhZ29uXzIwMDM0QGhvdG1haWwuY29t [65470b; 1272983719 s]ITEM uin_search_YW5nZWw1dHJpYW5hZEBob3RtYWlsLmNvbQ==[65529 b; 1272974774 s]ITEM unreaded_contacts_count_55857620 [83253 b; 1272498369 s]ITEM antispam_gui_1676698422010-04-17 [83835 b; 1271677328 s]ITEM antispam_gui_1708317782010-04-15 [123400 b; 1271523593 s]ITEM psl_24139020 [65501 b; 1271335111 s]END
  96. 96. Cachedump (2/4)• Extract group name from cachedump• See size distributions, find anomalies• Or, just see some stupid errors• Or, make decisions – time to switch on compression – split objects into parts• Big object for memcached is evil
  97. 97. Cachedump (3/4)• Extract group name from cachedump• See access time distribution• You can play with lifetime• T lifetime >> T access time ? – Decrease lifetime for this group
  98. 98. Cachedump (4/4)• Can be very slow• Buggy (at least old versions)• Treat results as statistical samples• Or increase crazy static buffer in source codes
  99. 99. auto debug & profiling (1/1)• How to profile the code?• Callgrind & co – good, but too much data, 99.99% useless• Reduction of dimension: measure potentially slow parts only (IO: disk ops, remote queries – db, memcached,с/c++, …)• Timers in PINBA• Adding summary: average time, CPU, remote queries by group• Devel: always add this to the end of every page• Production: can be written to logs
  100. 100. Auto debug & profiling (2/2)• What happens between sub-systems• «cost» visualization• Easy to find non-trivial bugs: – No dbq->memq with refresh – Many gets instead of multi-get (or, many inserts instead or multi-insert et cetera) – complex inter-server transactions – Many connections to one and the same server (database, …) – cache-set when database is down or error occurred – reading from slave what was just written to the master – many more…
  101. 101. What’s missed• Component stats: MySQL, apache, nginx…• Server monitoring• Client side stats (DOM_READY, ON_LOAD) – very important• Errors
  102. 102. Spasibo!• 6. Questions session••• Please fill the feedback form: electronic ( or paper (available at my desk). Put your email and Ill send you this presentation.• Please give me your feedback, especially critical