On June 27, 1923, at an altitude of about 500 feet above Rockwell Field on San Diego’s North Island, two U.S. Army Air Service airplanes became linked by a hose, and one airplane refueled the other. While only seventy-five gallons of gasoline were transferred, the event is memorable because it was a first. The procedure allows the aircraft to remain airborne longer, extending its range and therefore those of its weapons or its deployment radius; it can allow a take-off with a greater payload which could be weapons, cargo or personnel; or, a shorter take-off roll can be achieved because take-off can be at a lighter weight before refueling once airborne. In June 2007, Shopzilla decided to embark on a project to rebuild the core of our business – our consumer-facing web sites – mid-flight. We wanted to fundamentally change performance, maintainability and operations of our sites without impacting our on-going business. We needed to deliver it incrementally yet seamlessly integrate this new site architecture with our existing site. Today I’m going to talk about how we approached this, dive into some of the technical details of our solution and talk about what we achieved and where we’re going next.
Shopzilla is one of the largest and most comprehensive online shopping networks on the web through our leading comparison shopping sites Bizrate.com and Shopzilla.com and the Shopzilla Publisher Program *** We help shoppers find best value, for virtually anything from 1000’s retailers. Across our network we serve more than 100M impressions per day to anywhere from 20-30M unique visitors searching as many as 8000x/second for more than 105M products *** Its testament to a great engineering team that we were able to incrementally evolve our architecture to support our growth We grew our user base; 20% growth year-over-year We grew our product inventory; doubling or more every year We were always in a rush to market adding features, branding, content delivery systems always through addition, never remodeling
We had a two-tier architecture High memory utilization: A lot of reference data cached at startup resulting in processes with large memory footprints High Latency of a single request: complex web pages made many calls to database and metadata stores; the more data returned, the more calls mage. Long time to first byte: There was no progressive rendering of data; model data was fully assembled in memory before rendering Poor hardware utilization; memory footprint limited the number of processes and therefore requests that could be handled by a single instance Lack of instrumentation for understanding performance issues, request flow High risk development; served over a dozen distinct site experiences from a single codebase and single deployment; new features fraught with danger
Simplify the web application layer Decompose site into functionality separate, individually testable loosely coupled services Define performance SLAs Load test before every release; failure to meet an SLA is a defect Instrument and measure production code Cache where appropriate
In 2007 we decided to rebuild our site. We decided to start over. We have a fundamentally different business, we need fundamentally different software Our design principals were pretty basic: Simple is the new “clever”; performance and quality are design decisions; and you get what you measure We decided that we had to have continuous feedback from our users #1 it gave us a huge tool to manage risk. Since we decided to maintain the compatibility of the URL structure, we used a proxy by A10 networks to serve up our new site infrastructure, one page at a time! #2 it allowed us to keep up a constant drumbeat of progress for the company. Momentum was key for the company and actual, live, production launches were key for the team As a result we launched our first page for our first site on December of 07 Of course, this wasn’t just a page, it was the first version of the site framework as well Over the first 2 quarters of 2008, we gradually released more pages and increased the % of traffic we exposed to the new site until the full launch of Shopzilla on July 1 st . With the release of Shopzilla in July we started the development of Bizrate With the Bizrate release we had far fewer public releases We were confident in our site framework and our risk strategy shifted from proving the approach to getting Bizrate live by our holiday shopping peak Finally in mid-november, we shifted 100% of our US site traffic to our s2 platform We moved to our European properties and developed and deployed 7 brands on 2 core versions of the site.
Consider a typical page on our site; its packed with a lot of content, designed for our users but built for bots too We picked 1.5 seconds full page load as an aggressive number based on the size and weight of our pages *** With streaming HTTP responses we figured an approximate 650ms server side response time to still allow 1.5s full page load When we started we hadn’t considered defining a separate SLA for time-to-first-byte. But recently we went back and added this once we determined that it gave our site a feeling of loading quicker if we reduced this significantly.
We utilized the Java Concurrency API to implement an asynchronous, concurrent service invocation framework Independent services are invoked in parallel and return data destined for specific parts of the page Dependent service invocations may be chained Future results only used during rendering of the template. So while we must render content in a specific order, we incur no blocking until the results are actually required to be rendered.
Consider a portion of a typical page. I’ve highlighted in red those areas whose content is obtained via a service call. *** Some pages may request data from up to 30 sources By parallelizing the calls it helps reduce latency of a single request. Streamed HTTP responses ensures HTML is returned to clients as it becomes available *** We can visualize the server-side call stack Service calls with long dependencies proceed in parallel with other service calls. We can achieve a high level of concurrency since most threads are IO bound.
So we’ve built an architecture that we believe to be performant and scalable. How do you go about testing this? Here we’re looking at a typical graph of a performance run measuring response latency at the 95 th percentile and throughput achieved as we scale There are a lot of moving parts; highly concurrent requests, dozens of services, resource accesses Our strategy is to individually performance test each service to its SLA. Then the full stack is performance tested *** Of course, we continually monitor our systems in production utilizing JMX to emit all kinds of useful performance statistics. We also sample individual sessions and record the detailed performance log data used to re-create the server-side call graphs
We have some very large data sets that we need to cache. We’ve moved away from proprietary caching systems. We evaluated and subsequently implemented Oracle Coherence on a number of projects. In one example we were caching data used to route incoming paid traffic to the correct experience. Our whole ecosystem was slow and error prone and had direct financial impact *** Coherence allowed us to scale our data beyond a single physical server using a distributed cache. Automatically partitions our data We implemented read-through to transparently cache new data We configured the eviction policy to keep enough data to satisfy all the unique requests over a 90 day period No have no batch processes to ship gigabytes of data No delay in publishing new data Always on We have higher availability and faster publishing of new data
Did we make any money? Site conversion increased 7 – 12% Sessions increased; in the UK our marketing sessions grew by 120% as Google figured out our site was fast again and we were let out of the penalty box We required less infrastructure Our uptime improved to over 3-nines Our development velocity increased
As we developed new features we learned that performance can quickly go backwards *** We added external production monitoring; this is a snpashot of what we get. We continuously monitor the performance of key pages from a number of different geographic regions. We can observe a variety of metrics. We embarked on a performance refresh to improve our time to first byte and overall perceived page load experience by applying more best practices. We’ve continued to evolve our performance testing frameworks; adding the ability to automate full-page load regression tests on simulated real-world bandwidths.
Is performance worth the expense? Yes. Simplicity, quality, performance are design decisions You Get What You Measure You can’t take your eye off the ball
What’s next for Shopzilla Build new web sites for new markets Move into mobile Rebuild our merchant-facing web sites Re-architect our inventory systems Rebuild our search infrastructure
Shopzilla performance by design tssjs
Performance By Design A look at Shopzilla's path to high performance Tim Morrow, Senior Architect TSSJS 2010
Deployment Approach Shopzilla First Page Shopzilla Second Page Shopzilla 50% traffic Shopzilla 100% Bizrate Starts Bizrate 100% Bizrate 50% December 2007 March 2008 July 2008 October 2008 November 2008
Conclusion <ul><li>Is performance worth the expense? Yes. </li></ul><ul><li>Simplicity, quality, performance are design decisions </li></ul><ul><li>You get what you measure </li></ul><ul><li>Can’t take your eye off the ball </li></ul>