Web scale MySQL at Facebook (Domas Mituzas)


Published on

Published in: Technology

Web scale MySQL at Facebook (Domas Mituzas)

  1. 1.
  2. 2. Web scale MySQL@ facebook<br />DomasMituzas<br />2011-10-03<br />
  3. 3. Agenda<br />
  4. 4. Facebook<br />800M active monthly users<br />500M active daily users<br />350M mobile users<br />7M apps and websites integrated via platform<br />
  5. 5. Current<br />
  6. 6. Setup<br />Software<br />MySQL 5.1<br />Custom facebook patch<br />Launchpad - mysqlatfacebook<br />Extra resiliency<br />Reduced operations effort<br />Hardware<br />Variety of generations<br />Many core<br />Local storage<br />Some flash storage<br />
  7. 7. UDB Performance numbers<br />(FromSep. 2011)<br />Query response time<br />4ms reads, 5ms writes<br />Network bytes sent per second<br />90GB peak<br />Queries per second<br />60M peak<br />Rows read per second<br />1450M peak<br />Rows changed per second<br />3.5M peak<br />InnoDB page IO per second<br />8.1M peak<br />
  8. 8. Performance focus<br />Focus on reliable throughput in production<br />Avoid performance stalls<br />Make sure hardware is used<br />99th percentile rather than average or median<br />Worst offender analysis – topN & histograms instead of tier averages<br />
  9. 9. Stalls<br />“Dogpiles”<br />Temporary slow down – even 0.1s is huge<br />
  10. 10. Stall tools<br />Dogpiled (in-house)<br />Snapshot aggregation of server state at distress <br />“time machine” view into logs before the event too<br />Aspersa (stalk, collect)<br />Poor man’s profiler (.org)<br />Later iterations – apmp, hpmp, tpmp<br />GDB<br />
  11. 11. Stalls found<br />Tables extending – global I/O mutex held <br />Drop table – both SQL layer and InnoDB global mutexes held<br />Purge contention – unnecessary dictionary lock held<br />Binlog reads – no commits can happen if old events read<br />Kernel mutex – O(N) and O(N^2) operations<br />Transaction creation<br />Lock creation/removal, deadlock detection<br />Background page flushing not really background<br />Many more<br />
  12. 12. Efficiency<br />Increasing utilization of hardware<br />Memory to Disk ratio<br />Finding bottlenecks<br />Disk bound normally<br />Sometimes network<br />Application or server software chokepoints<br />Rarely CPU/memory bandwidth<br />Application design<br />Biggest wins are in optimizing the workload<br />
  13. 13. Disk efficiency<br />Normally diskIOPS bound<br />Allowing higher queue lengths<br />Can operate at more than 8 pending operations per disk<br />InnoDB page size<br />Need adjustable per table or index for real gain<br />XFS/deadline<br />Parallelism at MySQL layer<br />>300 iops on 166 rps disks<br />
  14. 14. Memory efficiency<br />Compact records – Thrift compaction for objects, etc<br />Clustered and covering index planning<br />FORCE INDEX – avoid unnecessary I/O and cached pages<br />Historical data access costly<br />Full table scans<br />ETL-type queries, mysqldump, …<br />Tune midpoint insertion LRU for InnoDB<br />Incremental updating, incremental binary backups<br />O_DIRECT data and logs access<br />
  15. 15. Pure flash<br />(Cheating)<br />Data stored directly on flash<br />Limited data size<br />Not utilizing flash card fully<br />Still used in some cases<br />
  16. 16. Flashcache<br />Flash in front of disks<br />Can use slower disks<br />Write-back cache<br />Much more data storage<br />Able to utilize much more of flash card<br />Very long warmup time<br />Open source (github/facebook/flashcache)<br />
  17. 17. MySQL 2x<br />Flash allows for large loads<br />Large performance difference from pure disk servers <br />Many older servers still being used<br />Solution?<br />Run multiple MySQL instances per server<br />Use ports 3307, 3308, 3309, etc…<br />Replication prevents direct consolidation<br />Redo a lot of port assumptions in code<br />
  18. 18. Application caching<br />Old: memcached<br />Cache invalidation stampedes, refetching full dataset on refresh, many copies<br />New: write-through caching<br />Incremental cache updates<br />Cache hierarchies for datacenter local copies<br />Efficient operations for association set<br />Common API for all use cases<br />
  19. 19. Group commit<br />Some OLTP workloads too busy even for modern RAID cards<br />High I/O pressure increases response times<br />Durability compromises increase operational overhead<br />Dead batteries are extremely painful otherwise<br />Now in 5.1.52-fb<br />
  20. 20. Admission control<br />Server resources are limited<br />Per account thread concurrency<br />Reduces O(N^2) blowup chance<br />max_connections are no longer impacting server load<br />Per-application resource throttling<br />Now in 5.1.52-fb<br />
  21. 21. Online Schema Change<br />External PHP script, open source<br />Utilizes triggers for change tracking<br />Used on 100G+ sized tables<br />Dump/reload + fast index creation<br />Extendable class, may allow:<br />PK composition changes with conflict resolution<br />Indexing previously unindexed datasets<br />
  22. 22. Tools<br />Table and user statistics<br />Shadows<br />Slocket<br />pmysql<br />Replication sampling<br />Client log aggregation<br />Query comments<br />Indigo (Query monitor)<br />
  23. 23. Future<br />
  24. 24. Future<br />MySQL is never a solved problem<br />Always investigating better/new solutions<br />New hardware types<br />New datacenters and topologies<br />New use cases and clients<br />New neighbors to share data with<br />
  25. 25. Visibility<br />Never assume<br />Use metrics to measure<br />When metrics aren’t available, add them<br />Full stack<br />More InnoDB info<br />More application info<br />
  26. 26. Replication<br />Lag used to be a big problem, still is a bottleneck<br />Possible solutions:<br />“Better” slave prefetch<br />Maatkit version has problems<br />Our own version being used on some tiers successfully<br />May be possible with InnoDB cooperation<br />Continuent parallel slave<br />Oracle parallel slave in 5.6<br />
  27. 27. InnoDB Compression<br />Originally was planned during 5.1 upgrade<br />Problems<br />Replication stream cost<br />Increased log writes<br />Performance in some cases<br />Stability, monitoring, etc<br />