Hello! I'd like to talk to you about how Groupon rewrote it's web front-end using Node.js and the challenges we faced along the way. Our buisiness was changinging, the decision to rewrite, and then how it paid off
Before we talk about the transition, let me give you an idea of what Groupon's platform looks like and the problems that caused us to make this change.
Over the last 5 years Groupon has grown at an amazing rate. The business has changed a lot from the early days of the "Daily Deal" model. Our product offerings diversified from a mix of acquisitions and internal growth. Much of this growth came via new services that isolated business logic, but the presentation layer was all in our monolithic Ruby on Rails application.
A large portion of our business is overseas. We aquired the City Deals platform which has it's own infrastructure which means we had to maintain two separate infrastructures
When we first started our customers interacted with Groupon by opening up an email and visiting a web page on their desktop computers. Through the last 2 years our customers have become more engaged through our mobile application and as of Q3 2013 more than 50% of Groupon's transactions were made through our mobile application. 9 Million people downloaded the Groupon app in Q3 2013
Between our rapid growth, increasing importance of mobile, and the separate international platforms, our Product Engineering team wasn't able to work the way it needed. Our monolith prevented us from building features fast enough.Having multiple platforms impeded rolling out US features world wide.Since our web application was using a different data source than mobile, we ended up duplicating a lot of work across our front-end platforms and our API implementation.
We decided that it was time for some drastic measures. Groupon needed a new front-end platform that would allow us to fix the biggest impediments faced by our Product Development team.
Breaking our back-end into services grouped by product worked well for us in the past. Teams can own their individual service and then integrate with shared services. We wanted our front-end to work this way so teams could develop their products at their own pace with limited interference from other teams.
We had a list of requirements for the new platform, many of which were shortcomings of our old monolithic RoR application.
This architecture wasn't going to be easy to arrive at. We had a lot of questions about how we would implement some of the features of our monolithic site in a federated architecture.
We tried to redesign the entire site and revamp the front-end inside our old monolith and it failed. The monolith made it difficult to bring multiple features up to date at once.
Since this was going to be a largely green-field effort, we wanted to survey the landscape to see if Ruby on Rails was a good choice.Trying to compare platforms to each other was too artificial. Wanting to avoid "analysis paralysis" we decided to jump in and build an application with one of the platform choices to evaluate it.
In September of 2012, we started work on the Subscription Flow application.It was small application but it was substantial enough that we could further vet our platform decisions. Fortunately the site didn't share the look and feel of the rest of groupon.com, the API endpoints already existed, and we didn't need to deal with logged-in users, which allowed us to focus biting off a smaller piece of the problem. We also got some early signal on how well the new architecture would work out when we needed to overhaul the interface to implement a new design from the product team. We were able turn around the new design in about a week with one developer, something we couldn't easily accomplish in the old application.
In December of 2012, After 3 months of building we finally had an application worth shipping, but we needed to wire the application up into our infrastructure and test that it could handle all the production traffic properly.We ran into a few issues during the last mile testing.When we were trying to run a few extreme load tests, the application would hit a throughput ceiling. There was a bottleneck somewhere, but CPU and network utilization looked fine.It wasn't until I saw this tweet by @substack that I was able to figure out what was going on: We were limited on our outgoing service calls from the applications.
Then we ran a longevity test: send a consistent amount of traffic to the application over a weekend to see if any issues came up. The test was run against a testing infrastructure to isolate any potential production impact.
Within two hours we had a major site outage.
Turns out that our test infrastructure used the same hardware load balancer as our production infrastructure, and someone enabled SSL termination at the load balancer for the API endpoints. The internal traffic to the SSL endpoint caused the load balancer to chew through CPU while terminating SSL, eventually maxing out the CPU and taking the site down.
It also allowed us to iterate on the next set of problems: user authentication, new service calls, more complicated routing, and needed to share the look and feel of the rest of groupon.com
We had budgeted less than 3 months to "forklift" the site over to the new platform, but we missed that target by 3 months. A whole new set of problems popped up during the development and launch of this product, and and unlike the subscription app these problems were mostly cultural. We weren't expecting teams to have difficulty adjusting to maintaining, deploying, and supporting their own applications. There was also a breakdown in trust between the platform team for the rewrite and the team responsible for the Browse application. We were already beginning to focus on rolling out this platform company-wide and were not as responsive as we should have been to the problems the team was having.They felt responsible for maintaining too many parts of the problem and they ended up solving problems we had already solved due to poor communication.
Once we felt like the platform transition was going to work, we started investing in solving some of the bigger problems that we had intentionally skipped the in the first two applications.One of our biggest problems was trying to standardize the look and feel of the site (header, footer, styles) across many applications. We considered a few different approaches, and decided that the best option was to make the layout come from a service that could be controlled independently.
We built the layout service to use the Groupon Interface Guidelines. All assets, markup, and functional code are frozen with semantic versioning so we can develop new layouts without breaking old experiences.
It serves up a partially rendered mustache template along with the output of any service calls required to render the template.
The layout service can be used to coordinate site-wide experiments like the site redesign we launched in October.
We also built a routing layer into NGINX, which was already sitting in front of our legacy RoR application.From this layer we could bucket users into an experiment which would route them to the new node.js application. We could track how our users interacted with the new web site to make sure we didn't break the experience as we moved to the new platform.
By the middle of summer 2012 we decided that the new platform built on node.js was working well and that we should figure out how to move the entire site over.Rather than moving over page by page, we opted to invest in a major effort to rewrite the whole site at once, and launch it by September 1st.
This was a world wide effort. Product Engineering implemented a feature freeze across the entire company for the duration of the rewrite.The goal was to reimplement the same features on the new platform. We had 150 developers working in the US, Europe, Asia, and South America. Coordination was very difficult. We setup a support rotation and helped people through a chat room, email, visits to remote offices, and video chat.
Over the next 4 months teams built out their applications and launched them into production. We missed our goal of 9/1 by a month, but by the beginning of October, all of the planned applications were launched and serving 100% of traffic on the new Node.js platform.
We haven't moved all of the pages over quite yet, but as of now we're serving 95% of all of our web page traffic using node.
Our website used to have some major scalability issues during high-traffic days. We've been able to serve record traffic numbers during this period with no outages and without seeing elevated response times.- Mention we had a major promotion (don't name SBUX) that would have previously strained the old Ruby site but this time it didn't even break a sweat. Maybe put an arrow to that peak on the graph and an arrow to the (presumed) peak for today's cyber monday traffic.- Mention upcoming holiday season and expected traffic growth and that this is just the beginning of what we can handle on our platform.- Mention that we are now scaling it to 48 more countries (because that sounds really cool and impressive)
We have 29 I-Tier applications in production. All of our existing infrastructure has been ported over, and we have nearly full feature parity with our old monolith.Our gamble paid off. Since we've switched to the new platform, we've launched new products and rolled out a site redesign in record time.We've had two of the highest traffic days in the US ever without any outages or performance degradation. The site is faster, leaner, and our customers like using it more.
Its not a straight lineWe expected to hit the same problems we had before, but we hit a whole new set of problemsWe were solving for ruby on rails problems, but really we had new infrastructure problems
We’ll be open sourcing Testium first, a webdriver based integration testing framework
Riding The N Train: How we dismantled Groupon's Ruby on Rails Monolith
Riding the N(ode) Train: Dismantling the
Tuesday, December 3, 2013
Sean McCullough – Engineer at Groupon
A Changing Business
Leading the Mobile Commerce Revolution
Mobile Transaction Mix
Monthly, January 2011 to September 2013 (% of transactions)
Product Engineering was Stuck
We couldn‟t build features fast enough
We wanted to build features world-wide
Mobile and Web weren‟t at feature parity
• be built on APIs for consistent contract with mobile
• be easy to hire developers
• allow for teams to work at their own pace
• allow teams to deploy their own code
• allow for global design changes
• have out of the box I18n/L13n support
• be optimized for our read-heavy traffic pattern
• be small
How do we…?
• Authorize Users
• Share Sessions
• Route to different applications
• Manage distributed ops
• QA the whole site
We Tried This Before and Failed
• Rolled out a new site design in our
• Too many things changed all at once
• Hard to evaluate performance of each
The Second App
• User authentication
• More service calls
• Complicated routing
• More traffic
• Needed to share look and feel
The Second App
• Cultural problems
• Change of workflow
• Feedback loop fell apart
6 months to launch
Maintain consistent look and feel across
• Distribute layout as library
• Use ESIs for top/bottom of page
• Apps are called through a “chrome
• Fetch templates from service
The Big Push
… or There‟s No Going Back
• Decided to get the whole company
to move at once
• Supporting two platforms is hard –
Rip off the band aid!
• End of June 2012 - move to I-Tier by
The Big Push
… or There‟s No Going Back
• ~150 developers
• Global effort
• Feature freeze – A/B testing against
mostly the same features
• Moving to a new platform
is not a straight line
• Solving for old problems
• Solving for new problems
• Culture shift
• Streaming responses for better performance
• Better resiliency to outages… circuit breakers, brownouts
• Distributed Tracing
• Open Source
New I-Tier apps as we build new teams, products, ideas.
Latest technologies to help us drive our business.