This document summarizes strategies for scaling a Magento installation. It discusses code management techniques like using good IDE tools and avoiding modifying core files. It outlines hardware profiles including networks, databases, caches and utility servers. It describes the team structure with 16 committers across 5 departments and 31 vendors. Effective communication practices and documentation are emphasized. Release processes, deployments, community participation and collaboration texts are also summarized.
4. Code Management
● Magento is big!
o Our project has over 820,000 lines of PHP
● Multi-lingual, multi-currency, multi-store
● Classes can have complex names
o *cough*
Enterprise_Reward_Block_Adminhtml_Customer_Edit_T
ab_Reward_History_Grid_Column_Renderer_Reason
*cough*
5. Code Management (cont.)
● Configuration is driven by XML
● The dreaded EAV
● Magento Indices
● Event-Observer
7. Code Management
● NEVER modify core files
o Magento’s forum never helped
● NEVER* add files to app/code/local/Mage
o Magento was built to be modular**
● Test your code with flat catalog enabled
and disabled
● Before overwriting classes, check for events
8. Code Optimization (Quick Wins)
Caching Magento Blocks
● DIY! Event to add cache data:
core_block_abstract_to_html_before
● OR use a module
https://github.com/aligent/CacheObserver
9. Code Optimization (Quick Wins)
Mage::getModel(‘catalog/product’)->load($_product-
>getId());
● This is bad in templates and when looping
over product collections
● Load with initial data select
o used_in_product_listing attribute option
10. Code Optimization
Make efficient use of Magento indices
● Example: Catalog URL Rewrites
o Includes all products by default (including products
marked as “Not Visible Individually”)
o Do you need SEO friendly URLs for products that
will never be seen???
o Reduce your index size by up to 95%
o Mage_Catalog_Model_Resource_Url::_getProducts
14. Hardware Profile (overview)
● 2 racks of hardware and dozens of servers
● Top quality of available (and compatible)
chipsets and memory
● Buffered DDR3; 1 channel per CPU
● 126 kW of stable, reliable, redundant, and
backed up power
● Minor kernel tweaks
15. Hardware Profile (network)
● NetScaler for load balancing
○ Vserver pools
○ Balances web, database, admin and endeca
○ Monitors will remove downed hosts
● Redundant Network Infrastructure
○ Backplane uses LACP (link aggregation) for
redundancy, load balancing and failover
○ HA pairing of configurations
16. Hardware Profile (network)
Dynamic port forwarding for browsing:
kyle@localhost $ ssh -L 2221:127.0.0.1:2221 whitelistedhost.example.com
kyle@whitelistedhost $ ssh -D 2221 cluster.example.com
Static port forwarding for Navicat SSH tunneling (tunneling through a tunnel):
kyle@localhost $ ssh -L 2222:127.0.0.1:2222 whitelistedhost.example.com
kyle@whitelistedhost $ ssh -L 2222:127.0.0.1:22 cluster.example.com
17. Hardware Profile (web)
● Dual Intel Xeon E3-1230 @ 3.30GHz
● 32 GB RAM
● Dozens of servers
● nginx and PHP5-FPM
● 6:1 ratio of PHP processes to CPU cores
18. Hardware Profile (database)
● Redundant database hosts
● MySQL 5.6 chosen for scaling capability
● tcmalloc further improves throughput
● Master/slave replication
● Standby hosts for warm failover
● Failure point: > 4,000 checkouts/hour
19. Hardware Profile (database)
● Quad Intel Xeon E7-2860
○ 10 cores + hyperthreading each totalling 80 threads
● 128 GBs of RAM
● RAID10 SSDs for data
○ writeback cache; noatime,noexec mount options
● RAID1 HDDs for OS
21. Hardware Profile (cache)
● Powering discrete instances of Redis
○ Sessions
○ Full page cache
○ Magento back end cache
○ Background processing queues
● Discrete instances are for threading, differing
memory limits, differing backup rules, and
multi-db deprecation
22. Hardware Profile (cache)
● Content is compressed with LZF
○ Compression and decompression with LZF is faster
than gzip so it’s an ideal solution
● Decreased utilization of network capacity
● Sentinel for failover (soon)
● RDB BGSAVE: prime number intervals
24. Hardware Profile (cache)
● Quad Intel Xeon E5-2620 @ 2.00GHz
● 128 GBs of RAM
● 4 bonded network interfaces
○ Prevents saturation of private network
○ 4 Gb/s
○ Bonding mode 5 (balance-tlb)
■ No special switch support
■ Nice when the colo manages the switch
25. Hardware Profile (utility)
● Cron and systems jobs
● Scripts
● Deploys
● Chef Server 10 for deploy and configuration
● Tests
○ Database test suite in Perl (Test::DatabaseRow)
● Backups (and copies)
26. Cluster Overview
● Production
○ Most hardware serves production
● Staging
○ Some data promoted to production nightly
● Preview{1..n}
○ Instances for testing and previewing new features,
bug fixes and design changes.
27. ● Aggregate hardware availability exceeds
six nines (99.9999%)
● Software availability is ~99.999%
● Software, including deployments: 99.98%
● Software, including maintenance: 99.9%
● Non-recoverable human errors: 98%
Production Uptime
29. Team Profile
● 16 committers; 8.25 FTE
● 4 Project Managers
● 5 departments
● 31 vendors
● 5 time zones
30. Team Values
● State your needs; respect others’
● Respect is given, then adjusted
● Process can always change and improve
● Work/life balance
● Mature and non-aggressive; mediate conflict
● Honesty and transparency
31. Team Mantras
● Trust (relevant) data; make things visible
● Measurable, repeatable, falsifiable
(scientific method)
● Redundancy reduces risks (if documented)
● Set expectations (timing, contents, formats)
and deliver on them
32. Team Mantras
● Automate what is repeated
● Use known patterns and
proven architectures
● Grow talent from within
● Compartmentalization of some data,
code, and knowledge
36. ● Group emails: avoid general questions,
assign actions to people, minimize
distribution lists
● Identify urgency of requests
● Use email filters
● Coach and mentor
Effective Communication
37. ● Daily phone calls: only while needed
● Set an agenda; keep to a schedule
● Encourage people to skip calls
or to leave early
● End the call when completed
Effective Communication
55. Deployments
● Monday through Thursday only!
● Communication: tickets, cross references,
pull requests, QA status, and releases
● Set expectations: timings for outages,
maintenance, and degraded functionality
● Are we done, yet?
● Explain outcomes and options
56. Community Participation
● Patches submitted
o Redis
o Cm_RedisSession
o Cm_Cache_Backend_Redis
o https://github.com/magento/magento2
● Modules improved
o CacheObserver
o VF_CustomMenu
58. ● Spence, Muneera U. Collaborative
Processes lecture. 13 Apr. 2006.
● Marks, Andrea. "The Role of Writing in a
Design Curriculum." AIGA: Design Education
(2004).
● Katzenbach, Jon R., and Douglas K. Smith.
The Wisdom of Teams. HarperCollins, 2003.
Collaboration Texts
59. ● Bennis, Warren, and Patricia W. Biederman.
Organizing Genius. Perseus, 1997.
● Marcum, James W. After the Information
Age. Peter Lang, 2006.
● https://en.wikipedia.org/wiki/Collaboration
(and collaborative method)
Collaboration Texts
How would you build the world’s largest, fastest, most complex Magento ecommerce store? Join three COPIOUS engineers as they share their approaches to this problem. This one-hour presentation will include the best practices, code samples, and system configurations necessary to scale Magento up to 100,000 daily orders with a catalog of 100,000 products.
Client is publicly traded, so we’re constrained by federal regulations on some details.
US retail sector; busiest periods, in order:
1. Cyber Monday
2. Pre-Christmas
3. Post-Christmas
4. Back to School
5. Spring Break
Site-wide average response time: 282 ms
Founded in early 2000s.
Native iOS and Android
Ecommerce clusters
Product configurations
Complex integrations
Marketing and content strategy
We are hiring
Business Development Director
Sr. Software Engineer / Engineering Manager
Studio Manager
DevOps Engineer
Sr. Ruby on Rails Developer
Sr. Strategist
Mobile Engineer
Keep things specific to Magento, not basic
These are more like ground rules
If you’re modifying core files, you’re doing it wrong! All too common to see Magento forum recommendations telling people to just modify app/code/core/…
Events: 406 events fired for homepage, 663 for category page, 1038 for PDP, 836 for cart
Blocks are where the rubber meets the road for Magento, the last piece in the chain of getting data to the end-user.
Many blocks are not cached (some rightly so for customer session)
For instance, Magento CMS blocks go through the rendering process for each page they are displayed on
Many modules available for this. Open source options available.
Production cache host has ~1 million keys for Back End cache.
commands per second: > 3,000
expirations per second: ~100
hit rate: ~85%
This is common to see this on product listing pages, cart page, checkout review
Magento has accounted for this!
Optimizing what is included in the indexes can be difficult but it can provide some big payoffs if you have a large catalog.
Rewrite Mage_Catalog_Model_Resource_Url::_getProducts
Current runtime for catalog_url index: ~30 minutes
What does this method return?
This method is called in product list blocks as well as PDPs and other small pages like the cart and each step of the checkout via collectTotals.
We don’t actually use 126 kW of power :)
Sandy Bridge: not the latest and greatest but still good
Kernel tweaks include: socket limits, shared memory limits, open file limits, larger queues for networking, and IPv4 stability/security/capacity
IPv6 ignored at nginx layer
MySQL and HTTP monitors will remove hosts from the pool that go down. Maximum period between failure and pool removal is 7 seconds.
Scripts try to recover downed instances by restarting services. Outcomes from outages are emailed to the group.
See ARP table corruption? Is it every 4 hours? Do you have Cisco switches? This is the ARP cache lifetime :)
Important NetScaler configs…
* Services: -cip ENABLED X-Forwarded-For -cltTimeout 30 -svrTimeout 120 -CKA YES
* Virtual server, port 80: -persistenceType NONE
* Virtual server, port 443: -persistenceType SSLSESSION
SKIP if time is an issue
A locked down network with no VPN means you need to get creative when working from home.
These CPUs are fast enough and a great value; not *extreme* power. $240 each
Average daily Load average of 0.7–1.2: sustained normal
Load average of 5: target maximum load (35% performance degradation)
Load average of 7+: “failure” load
NGINX and PHP5-FPM
We are targeting a comfortable performance level.
Ratio of PHP processes to CPU cores found through trial and error. This is the lowest process count we could deploy without socket resets under crush loads.
This quantity of PHP processes is possible with 32 GB of RAM in each web host.
Several boxes were shipped with extra/junk/mismatched RAM (1 GB sticks) and review was necessary
https://github.com/blog/1422-tcmalloc-and-mysql
MySQL 5.6 versus 5.5: https://dev.mysql.com/tech-resources/articles/mysql-5.6-rc.html
TODO What makes 5.6 scale better?
• Better linear performance and scale on systems supporting multi-processors and high CPU thread concurrency
• InnoDB has been re-factored to minimize legacy mutex contentions and bottlenecks
Better multi-processor support
We are interested in Percona and MariaDB but do not have operational capacity to use either. (discover, tune, configure, automate, etc)
Failure defined as connection timeouts and socket resets for ~3 percent of users.
It took us nearly a month to produce enough load to cause minor failures. The real/hard failure point is higher than this, but that’s the best we’ve been able to do! :P
Configs:
thread_cache_size = 512 (possibly too low!)
table_open_cache = 12288
tmp_table_size = 512M
query_cache_type = 1 (on)
query_cache_limit = 4M (supports SOAP and REST API integrations)
query_cache_size = 512M (larger than this is problematic; it’s typically ~60% full)
innodb_buffer_pool_size = 32G
innodb_log_buffer_size = 2G
innodb_log_file_size = 512M
innodb_file_per_table
Statistics:
42 TB of transmitted data
23 TB of innodb writes in 90 days
8.4 TB of innodb log churn
100% thread cache hit rate
99.996% table cache hit rate (large number of open tables possibly related to MySQL bugs #16244691 and #65384)
99.9999993% of table locks are immediate (most are nightly processes)
Innodb_buffer_pool_wait_free: 0
Innodb_log_waits: 0
85% query cache hit rate; this doesn’t really mean anything with such high churn rate of data
94.5% of temp tables are in memory (only nightly processes require disk tables)
99.97% of queries are faster than 200 ms; we’ve reached a plateau of optimization
The average row lock is a bit slow as a consequence of Magento indexing architecture and our background processing queues
Moderate rate of random and sequential reads (table scans!), but we can absorb that overhead with hardware and focus on improving PHP code
YES Lots of memory DB cache, For things like placing orders, SSDs provide fast write speeds.
MAYBE We have just over 10,000 write IOPS capacity (and sustain ~125 IOPS).
YES Remote management, configuration, and validation of hardware RAID can be difficult; push the colocation facility to assign knowledgeable technicians.
Partition mount configurations assume power will never be lost; optimal throughput and security. (RAID controllers do not have battery backup units installed)
These CPUs are about $2,000 each!
We now know what a load average of 400 looks like! o.O (Ugly SQL that wanted a temp table of 2.4 quadrillion rows)
endeca query was wanting to create a temp table
SKIP this if needed
YES 3,000 commands per second; 0.7 ms average response time
MAYBE All Redis instances persist to disk with RDB. Only sessions’ RDB files are backed up off-server (Sessions point to carts! We see a lot of anonymous users.)
YES Downside of multiple instances: quadrupling of file descriptors and socket connections required some kernel ulimit tweaks
NO PHP Redis libraries not quite mature enough to support persistent connections with PHP5-FPM: https://github.com/nicolasff/phpredis/issues/70
zend diables pconnect
Compressing cache contents also increases storage capacity! We can afford the increased CPU overhead to improve RAM and network capability.
Network throughput went from 500 Mbit/sec to 125 Mbit/sec
Sustained disk IO went from 80% to 12% utilization
Prime number intervals on the RDB BGSAVE reduce contention over disk IO, as write activity is less likely to overlap
Several boxes were shipped with extra/junk RAM (2 GB sticks) and review was necessary
Balances transmitted data by changing the mac address on the outgoing packages
No special switch support; 32 Gbps backplane :)
Nice when the colo runs the switch
“The outgoing traffic is distributed according to the current load (computed relative to the speed) on each slave.” - Linux Foundation
CPUs are $450 each.
SKIP this if needed
Daily offsite backups; verified functional :)
Bash, Ruby, Perl, Python, PHP
Chef *really* wants to run every 30 minutes; prevent that with the `--once` argument.
Test failures catch human errors; emails sent from failures are intentionally obnoxious
Ubuntu, nginx, php-fpm and MySQL are reliable, predictable, and scriptable.
Deployments take about 10 minutes because we’re cautious about database schema and large caches take time to clear.
Magento’s architecture for indexing in 1.12 greatly constrains our uptime.
We’ve written automation, adjusted sources of authority, and standardized communication workflows to prevent most human error.
Human errors will quickly drag uptime down to 94% if response time is slow and mitigation/recovery plans are not documented.
This is slightly above the size of most effective teams; managed through limiting scope of engagements for timing and components. We also cluster around sprints and product/feature development teams.
We hire smart people who work like craftspersons. They enjoy building things that delight other people. Expertise is not required, but the ability to learn is.
Client had some turnover; we have to be careful to not be perceived as a threat.
Regions: England (vendor), East Coast US, Midwest, West Coast US, East Coast Australia.
QUICK slide
This is intended for peer groups with moderate homogeneity and similar cultural backgrounds. (Not a license for monoculture, though.)
Give full respect up front—these are your peers. If person is disrespectful or burdensome, provide suggestions, and then gently reduce respect.
Some people worked *some* long weeks; not always. Many people even took vacations!
Admitting mistakes is better for all.
QUICK slide
Scientific method
Measure everything up front and develop your questions later (Borrowed from Big Data™, NoSQL, etc)
QUICK slide
Humans make errors; machines are made for repetition.
We do not test in production! :)
Standard POSIX process signaling and Ubuntu init scripts.
Generally:
Systems engineers need to know a moderate amount about a lot.
Software engineers need to know a lot about a little.
Project managers need to know a little about a lot.
Operations engineers need to know a little about a lot.
Easy mantra/value: nourish people with free beverages and comfortable, low-distraction environments.
Feed your team! We worked many lunch hours and a few late nights. Bosses bought food and accommodated dietary needs.
Introduction to vendors
Names, titles, emails, time zones (and business hours), escalation procedures
Optimal scenario: “we speak for [client] and [vendor] deals directly with us”
Approach with a gentle demeanor (not here to take over and rule how everything goes)
Small talk provides something to relate to; people seem quite affable toward the PNW
Share: Design goals, objectives, and values
Push the vendors to deliver; this should be done by company superiors.
SKIP if needed
Trust, and verify.
Some documentation was wrong or missing.
Urgencies have varying levels and definitions; find what works!
Skype and instant message are need-to-know basis
Cell phone contact should be rare and with explicit boundaries
SKIP if needed
Phone calls are hard!
Contact lists, issues triage, process documentation, collaborative editing, task delegation, history and context
BOUNDARIES: I check in with others when I see their timestamps are outside of business hours.
Documents with sensitive info are marked CONFIDENTIAL and shared with a minimal group. Some documents are internal only.
As proof of our decent work/life balance, we see most commits are business hours in local time. (Committers are in 2 time zones and have flexible schedules)
Some late nights and weekends were had, but typically for specific sprints, maintenance, deployments, and chores.
Cowboy coding in production: help or go away
Launch day! June 28/29
The state of the codebase was compatible with release because we were making limited and deliberate changes.
Release day. Features passing UAT are accepted before release.
Some releases have pretty complex preparations. A team familiar with Git is an effective team. (It shouldn’t get in the way!)
SKIP if needed.
The network graph can be quite elegant with such a large and effective team.
SKIP if needed.
Off hours maintenance is sometimes a frenzied, guess and check process.
We mitigate these risks with supporting people available and our commit/deploy rigor for safeguards.
Some of my flurries of commits are documentation updates. I’m a risk in that I understand nearly everything; others hold me accountable by requesting documentation.
The preview environment has a Chef HTML template that lists tickets, branches, URLs, known issues, and general notes.
I’ve provided links and listings of which files determine application states and integrations. (Single sources of authority preferred)
Standardize what gets included in a pull request. “What changed” (list) and “How to test” (steps, outcomes, caveats) are my favorites.
Opportunity to teach and learn; see how others do things, and provide DRYing or refactoring advice. It’s a vulnerable moment that deserves to be uplifting and positive.
Pull requests can provide advance notice to the group at large.
Friday and weekend deployments eat up budgets and human capital by requiring people be available.
Educate people and standardize language regarding ticketing systems and GitHub flow.
Quality assurance: always include steps to reproduce and expected outcome. Define what failure is; consider releasing incremental fixes.
Do you coordinate releases by dates? Version numbers? Names? Standardize and schedule! Build routines.
Explaining if issues have been solved or when they’re expected to be fixed.
Build and grow trust by defining risks, mitigation options, rollback criteria, and recovery steps and timings.
Reid’s 2007 undergraduate thesis was a survey of modern and contemporary literature with analysis for educational settings.
Expand upon the “dropping people from email CC if they’re difficult” advice, please?
Our most common habit is taking email threads internal to determine our preferred response. We explicitly state [internal thread] in the first line when this has been done.
Some people are very knowledgeable but tend to provide advice or input past their job titles. It’s nice and well intentioned, but slows things down. Our habit has been to only ask questions of those people for subjects specifically under their job titles.
What issues have you seen with splitting MySQL reads and writes?
Checkout theoretically can experience problems with replication delay, but we haven’t seen it happen.
Slaves will stop on any foreign key errors, and we’ve seen some. Features most frequently causing that problem have been reports. We’ve disabled reports because the client uses Analytics products for business intelligence.
Database unit tests and validation tests have helped us catch human errors that would cause slaves to stop
Why use real hardware?
“Walk before we can run.”
Previous Magento partners could only optimize the site to a point where it required 64 web servers and very high IOPS. That wasn’t going to be easy on the cloud.
The current application state would probably run pretty well on the cloud, but we’re risk averse and want full control.
We’re planning to eventually get to the cloud, which should occur when this hardware is out of date. That will be opportunity for full investigation of a custom ecommerce product (service oriented architecture; onmichannel integration).