CloudAustin Black Friday 2013


Published on

A 2014 CloudAustin presentation on how we prepared for and executed on our high traffic surge over Black Friday.

Published in: Technology
1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

CloudAustin Black Friday 2013

  1. 1. Black Friday 2013 Ernest Mueller, Bazaarvoice Engineering
  2. 2. What Is Black Friday? • The National Retail Federation writes: For some retailers, the holiday season [Nov-Dec] can represent as much as 20-40% of annual sales. • ShopperTrak says: National retail sales increased 2.7% and foot traffic decreased 14.6% when compared to the same two months last year (2012). • Black Friday (the Friday after Thanksgiving) and Cyber Monday (the Monday after that) have become big discounting and promotional events that retailers use to push holiday purchasing. • Summary: It‟s a big deal to many of our clients and is becoming more ecomm-driven every year
  3. 3. 2 Historically In 2011 we served 1.52 B And in 2012 we served 2.03 B. Roadmap Prediction Bazaarvoice expected review impressions on Black Friday & Cyber Monday 2013. That‟s a 30% YoY growth rate. Results Bazaarvoice served review impressions on Black Friday & Cyber Monday 2013. That‟s a 31.4% YoY growth rate. Black Friday/Cyber Monday 2013 @BV 2.67 B2.6 B
  4. 4. If you took all the reviews we served up to shoppers on Black Friday 2013 and printed them into paperback book form, it would take a bookshelf almost 11 miles long to hold them.
  5. 5. Scaling Isn’t Just For Black Friday • We continuously work to scale the product – our data size doubles year over year • Architectural changes to meet the demand are constant and ongoing – there is no “maintenance mode” at scale • Your base architecture needs to be scalable • Then you have to refactor again and again
  6. 6. 9
  7. 7. Dove’s Thoughts • Upping performance and running your system at 40% instead of 80% gave a lot of insight into our second order set of bottlenecks and performance characteristics • The choice of where to place/span ASGs and other Amazon bits was a major talking point among the Amigos, and ended up being located per AZ because of our DNS/HAProxy front end • The “diagonal scaling” challenge of instance size vs number of instances vs PIOPS speed is hard and you basically just have to run tests to dial in on the minima; this changes a lot over time • Remember, with the public cloud a lot of this is black box and while that removes a lot of work from you, it adds other work and requires certain best practices to make the most of your system
  8. 8. This Year • We started Black Friday specific work on August 12, 2013. • That‟s when client readiness surveys start coming in! • We‟ve done this previous years, but this year there was a big additional demand placed on the planning…
  9. 9. 14 The Old Meets The New
  10. 10. Communicate and Coordinate • The first step is always internal communication • We create an “Internal Preparedness Statement” to provide a concise, definitive statement for Engineering, Sales, Support, and Implementation • Regular weekly prep status meetings • From the August 12 “Planning is beginning” notification till the celebratory happy hour on Dec 16, I have 1,287 emails that mention “Black Friday.” • Due to the new distributed-team challenge, we needed a person responsible for coordinating our overall Black Friday response…
  11. 11. BV Holiday Freeze Statement Soft Freeze We observe a general change freeze period starting 1 November and ending 15 January. During this period, we do not introduce changes to Bazaarvoice products that are integrated with our clients' websites. We may introduce changes into back-end systems that do not impact the end-user site experience. Hard Freeze We only release infrastructure and configuration changes required to restore service to or prevent a service disruption to one or more of our customers. The Critical System Change periods are: • 5 days prior to and 5 days after Black Friday (24 November 2013 through 4 December 2013) • 4 days prior to and 7 days after Christmas (21 December 2013 through 1 January 2014)
  12. 12. What Does Freeze Mean To You?
  13. 13. Traffic Projections and Scaling Plan • Sadly, the answer isn‟t as simple as “Amazon, yay!” • Even they run out of resources over this period • We conduct detailed YOY traffic projections • We come up with a scaling plan to fit the projections • Leave headroom!
  14. 14. Traffic Projection Tips • Your system has various axes of scaling within it – trend and estimate them all • We estimate incoming and outgoing reviews per day, peak requests per second on display servers, and calculate per-server acceptable capacity at each level (tomcat, Solr, database) • Once you‟ve done it one year, it‟s easier because you can apply proportional lift to current traffic • Keep an ear to the ground for environmental changes! This year retailers decided to start earlier and spike a little less on BF, so scaling came earlier than last year – but we read the news so we were prepared
  15. 15. 0 200 400 600 800 1000 1200 1400 1600 Millions Pageviews UGC Impressions Unique Visitors 1.337 B 1.330 B
  16. 16. Situational Awareness • When the clock is running, you need your monitoring, alerting, response, etc. to be highly optimized for speed. • We use a variety of monitoring types – nagios, zabbix, datadog, Keynote, pingdom • And PagerDuty of course, aka “The One Ring” • We write out runbooks for common response tasks such that we can have level 1 support people do them – or at least so that we don‟t screw them up! • Custom tooling is a must.
  17. 17. 164k RPS 10 m2.xlarge 12 m2.xlarge 10 m2.xlarge 12k RPS 21k RPS 4330 ms 8210 ms AWS East AWS West 1023 ms c1 3.4k RPS 2340 ms System Stats Histogram 3.4k RPS 1240 ms c2
  18. 18. Demo! • formance
  19. 19. Escalated Response • We had 3x daily (9 AM, 2 PM, 9 PM) status calls for all teams to check in • We sent out overall status system performance to the entire company daily • Oncall shifts of 12 hours apiece – not fully online but not “waiting for pages” either, need to be eyeballing the system at regular intervals
  20. 20. Test Your Plan! • Test your scaling – Amazon limits are your enemy – there‟s a thousand of „em and many are hidden • Test your monitoring • Test your paging • Test your runbooks • We had two “game days” to scale up, apply load, provoke issues and execute on remediation
  21. 21. Step 6: Profit
  22. 22. How It Went Down • 23 teams across R&D and Support • 40 engineers participating as Black Friday representatives • 11 weeks of planning • 2 stress-testing "Game Days” • 26 round-the-clock status calls (8 “yellow” status, 18 “green”) • 35 issues examined during the period • $136,620.27 for the week in hosting costs • Zero downtime
  23. 23. November Performance (c3)
  24. 24. Questions?
  25. 25. Recruiting Moment - BV:IO 2014 • Bazaarvoice’s internal tech conference and hackathon! • Last year: Alamo Drafthouse, Adrian Cockroft (Netflix), Jason Baldridge (UT), Nick Bailey (Datastax), Peter Wang (Continuum Analytics) • This year: Norris Conference Center, Theo Schlossnagle (Circonus), Greg Brockman (Stripe CTF), Bob Metcalf (UT) • Late-nighter hackathon to develop sweet social commerce solutions • Plus – COD: Black Ops!
  26. 26. 42 Register: Team Signups On Hacker League Koderz Only