Your SlideShare is downloading. ×
Riding The N Train: How we dismantled Groupon's Ruby on Rails Monolith
Upcoming SlideShare
Loading in...5

Thanks for flagging this SlideShare!

Oops! An error has occurred.


Introducing the official SlideShare app

Stunning, full-screen experience for iPhone and Android

Text the download link to your phone

Standard text messaging rates apply

Riding The N Train: How we dismantled Groupon's Ruby on Rails Monolith


Published on

This is a story about how Groupon's business was changing and our technology couldn't keep up. We rewrote the web site using node.js and changed the way our company and culture.

This is a story about how Groupon's business was changing and our technology couldn't keep up. We rewrote the web site using node.js and changed the way our company and culture.

Published in: Technology

  • Be the first to comment

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

No notes for slide
  • Hello! I'd like to talk to you about how Groupon rewrote it's web front-end using Node.js and the challenges we faced along the way. Our buisiness was changinging, the decision to rewrite, and then how it paid off
  • Before we talk about the transition, let me give you an idea of what Groupon's platform looks like and the problems that caused us to make this change.
  • Over the last 5 years Groupon has grown at an amazing rate. The business has changed a lot from the early days of the "Daily Deal" model. Our product offerings diversified from a mix of acquisitions and internal growth. Much of this growth came via new services that isolated business logic, but the presentation layer was all in our monolithic Ruby on Rails application.
  • A large portion of our business is overseas. We aquired the City Deals platform which has it's own infrastructure which means we had to maintain two separate infrastructures
  • When we first started our customers interacted with Groupon by opening up an email and visiting a web page on their desktop computers. Through the last 2 years our customers have become more engaged through our mobile application and as of Q3 2013 more than 50% of Groupon's transactions were made through our mobile application. 9 Million people downloaded the Groupon app in Q3 2013
  • Between our rapid growth, increasing importance of mobile, and the separate international platforms, our Product Engineering team wasn't able to work the way it needed. Our monolith prevented us from building features fast enough.Having multiple platforms impeded rolling out US features world wide.Since our web application was using a different data source than mobile, we ended up duplicating a lot of work across our front-end platforms and our API implementation.
  • We decided that it was time for some drastic measures. Groupon needed a new front-end platform that would allow us to fix the biggest impediments faced by our Product Development team.
  • Breaking our back-end into services grouped by product worked well for us in the past. Teams can own their individual service and then integrate with shared services. We wanted our front-end to work this way so teams could develop their products at their own pace with limited interference from other teams.
  • We had a list of requirements for the new platform, many of which were shortcomings of our old monolithic RoR application.
  • This architecture wasn't going to be easy to arrive at. We had a lot of questions about how we would implement some of the features of our monolithic site in a federated architecture.
  • We tried to redesign the entire site and revamp the front-end inside our old monolith and it failed. The monolith made it difficult to bring multiple features up to date at once.
  • Since this was going to be a largely green-field effort, we wanted to survey the landscape to see if Ruby on Rails was a good choice.Trying to compare platforms to each other was too artificial. Wanting to avoid "analysis paralysis" we decided to jump in and build an application with one of the platform choices to evaluate it.
  • I was a strong advocate for using Node.js to rebuild our front-end. The albeit artificial platform comparisons showed that Node.js could deliver the base line performance characteristics we needed.Writing Javascript on the server allowed us to utilize our in-house Javascript talent, and it would be relatively easy to hire developers with Javascript experience on front-end teams. The operational characteristics of Node were also beneficial.
  • In September of 2012, we started work on the Subscription Flow application.It was small application but it was substantial enough that we could further vet our platform decisions. Fortunately the site didn't share the look and feel of the rest of, the API endpoints already existed, and we didn't need to deal with logged-in users, which allowed us to focus biting off a smaller piece of the problem. We also got some early signal on how well the new architecture would work out when we needed to overhaul the interface to implement a new design from the product team. We were able turn around the new design in about a week with one developer, something we couldn't easily accomplish in the old application.
  • In December of 2012, After 3 months of building we finally had an application worth shipping, but we needed to wire the application up into our infrastructure and test that it could handle all the production traffic properly.We ran into a few issues during the last mile testing.When we were trying to run a few extreme load tests, the application would hit a throughput ceiling. There was a bottleneck somewhere, but CPU and network utilization looked fine.It wasn't until I saw this tweet by @substack that I was able to figure out what was going on: We were limited on our outgoing service calls from the applications.
  • Then we ran a longevity test: send a consistent amount of traffic to the application over a weekend to see if any issues came up. The test was run against a testing infrastructure to isolate any potential production impact.
  • Within two hours we had a major site outage.
  • Turns out that our test infrastructure used the same hardware load balancer as our production infrastructure, and someone enabled SSL termination at the load balancer for the API endpoints. The internal traffic to the SSL endpoint caused the load balancer to chew through CPU while terminating SSL, eventually maxing out the CPU and taking the site down.
  • By February 2013 we had launched the subscription application to a small percentage of production traffic and began the hand-off of this application to the product team that would own it long-term. It was time for us to start on a second application.We decided to rewrite the Browse page next because it was a new product, it was a client-side application using backbone.js, and the team had some experienced javascript developers.
  • It also allowed us to iterate on the next set of problems: user authentication, new service calls, more complicated routing, and needed to share the look and feel of the rest of
  • We had budgeted less than 3 months to "forklift" the site over to the new platform, but we missed that target by 3 months. A whole new set of problems popped up during the development and launch of this product, and and unlike the subscription app these problems were mostly cultural. We weren't expecting teams to have difficulty adjusting to maintaining, deploying, and supporting their own applications. There was also a breakdown in trust between the platform team for the rewrite and the team responsible for the Browse application. We were already beginning to focus on rolling out this platform company-wide and were not as responsive as we should have been to the problems the team was having.They felt responsible for maintaining too many parts of the problem and they ended up solving problems we had already solved due to poor communication.
  • Once we felt like the platform transition was going to work, we started investing in solving some of the bigger problems that we had intentionally skipped the in the first two applications.One of our biggest problems was trying to standardize the look and feel of the site (header, footer, styles) across many applications. We considered a few different approaches, and decided that the best option was to make the layout come from a service that could be controlled independently.
  • In order to make sure applications were being developed with consistent UI patterns, we built an internal style guide called the Groupon Interface Guidelines.It comes with CSS, HTML examples and some standard javascript libraries for common UI elements.
  • We built the layout service to use the Groupon Interface Guidelines. All assets, markup, and functional code are frozen with semantic versioning so we can develop new layouts without breaking old experiences.
  • It serves up a partially rendered mustache template along with the output of any service calls required to render the template.
  • The layout service can be used to coordinate site-wide experiments like the site redesign we launched in October.
  • We also built a routing layer into NGINX, which was already sitting in front of our legacy RoR application.From this layer we could bucket users into an experiment which would route them to the new node.js application. We could track how our users interacted with the new web site to make sure we didn't break the experience as we moved to the new platform.
  • By the middle of summer 2012 we decided that the new platform built on node.js was working well and that we should figure out how to move the entire site over.Rather than moving over page by page, we opted to invest in a major effort to rewrite the whole site at once, and launch it by September 1st.
  • This was a world wide effort. Product Engineering implemented a feature freeze across the entire company for the duration of the rewrite.The goal was to reimplement the same features on the new platform. We had 150 developers working in the US, Europe, Asia, and South America. Coordination was very difficult. We setup a support rotation and helped people through a chat room, email, visits to remote offices, and video chat.
  • Over the next 4 months teams built out their applications and launched them into production. We missed our goal of 9/1 by a month, but by the beginning of October, all of the planned applications were launched and serving 100% of traffic on the new Node.js platform.
  • We haven't moved all of the pages over quite yet, but as of now we're serving 95% of all of our web page traffic using node.
  • Our website used to have some major scalability issues during high-traffic days. We've been able to serve record traffic numbers during this period with no outages and without seeing elevated response times.- Mention we had a major promotion (don't name SBUX) that would have previously strained the old Ruby site but this time it didn't even break a sweat. Maybe put an arrow to that peak on the graph and an arrow to the (presumed) peak for today's cyber monday traffic.- Mention upcoming holiday season and expected traffic growth and that this is just the beginning of what we can handle on our platform.- Mention that we are now scaling it to 48 more countries (because that sounds really cool and impressive)
  • Browser measured page performance was 50% - 100% faster than the old site. The deal page used to be served from the CDN and personalized with javascript. With the transition to node.js we were able to move rendering back to the server and see a drastic reduction in page load times. We could deploy changes to sites in minutes instead of hours with our legacy monolith.
  • We have 29 I-Tier applications in production. All of our existing infrastructure has been ported over, and we have nearly full feature parity with our old monolith.Our gamble paid off. Since we've switched to the new platform, we've launched new products and rolled out a site redesign in record time.We've had two of the highest traffic days in the US ever without any outages or performance degradation. The site is faster, leaner, and our customers like using it more.
  • Its not a straight lineWe expected to hit the same problems we had before, but we hit a whole new set of problemsWe were solving for ruby on rails problems, but really we had new infrastructure problems
  • We’ll be open sourcing Testium first, a webdriver based integration testing framework
  • Transcript

    • 1. Riding the N(ode) Train: Dismantling the Monoliths Tuesday, December 3, 2013 Sean McCullough – Engineer at Groupon @mcculloughsean
    • 2. Part I Broken Architecture and A Changing Business
    • 3. Business in Early 2012 Page 3
    • 4. Architecture in 2012 Page 4
    • 5. Leading the Mobile Commerce Revolution Mobile Transaction Mix Monthly, January 2011 to September 2013 (% of transactions) 100% 80% September ‟13 60% 40% March „13 October ‟12 October „11 April ‟12 April ‟11 June „13 January „13 20% 0% July ‟12 January „11 July ‟11 January ‟12 North America Page 5
    • 6. Product Engineering was Stuck We couldn‟t build features fast enough We wanted to build features world-wide Mobile and Web weren‟t at feature parity Page 6
    • 7. Part II The Rewrite Page 7
    • 8. The Rewrite Page 8
    • 9. The Rewrite Should ... • be built on APIs for consistent contract with mobile • be easy to hire developers • allow for teams to work at their own pace • allow teams to deploy their own code • allow for global design changes • have out of the box I18n/L13n support • be optimized for our read-heavy traffic pattern • be small Page 9
    • 10. How do we…? • Deploy • Authorize Users • Share Sessions • Route to different applications • Manage distributed ops • QA the whole site Page 10
    • 11. We Tried This Before and Failed • Rolled out a new site design in our monolith • Too many things changed all at once • Hard to evaluate performance of each feature Page 11
    • 12. New Platform Evaluation We evaluated: • Node • MRI Ruby/Rails, MRI Ruby/Sinatra • JRuby/Rails, Sinatra • MRI Ruby + Sinatra+EM • Java/Play, Java/Vertx • Python+Twisted • PHP Page 12
    • 13. Why Node? • Vibrant community • NPM! • Easy to hire JavaScript developers • Had the minimum viable performance characteristic • Easy scaling (process model) Page 13
    • 14. The First App Page 14
    • 15. Growing Pains Page 15
    • 16. Poking Holes in our Infrastructure • Longevity Test over two days • Try to root out memory leaks • Talking only to non-production systems Page 16
    • 17. Poking Holes in our Infrastructure Within 2 hours we had a major site outage Page 17
    • 18. Poking Holes in our Infrastructure • SSL termination on our hardware load balancer caused CPU to max out at 100% • Production systems were using same LB as test and development systems Page 18
    • 19. Lessons Learned • You will run into problems with Node • You will find problems with your infrastructure • Don‟t panic! Page 19
    • 20. The Second App • Looking for the next page • Chose the “Browse” page • Recently Built • Built using mostly Backbone • Experienced team of JS developers Page 20
    • 21. The Second App Page 21
    • 22. The Second App New Problems: • User authentication • More service calls • Complicated routing • More traffic • Needed to share look and feel Page 22
    • 23. The Second App • Cultural problems • Change of workflow • Feedback loop fell apart 3 rewrites 6 months to launch Page 23
    • 24. Shared Layout Maintain consistent look and feel across site: • Distribute layout as library • Use ESIs for top/bottom of page • Apps are called through a “chrome service” • Fetch templates from service Page 24
    • 25. Groupon Interface Guidelines Page 25
    • 26. Layout Service • Uses semantic versioning • Roll forward with bug fixes • Stay locked on a specific version • Enable Site-Wide Experiments Page 26
    • 27. Layout Service Page 27
    • 28. Layout Service Page 28
    • 29. Routing Service Page 29
    • 30. The Big Push … or There‟s No Going Back • Decided to get the whole company to move at once • Supporting two platforms is hard – Rip off the band aid! • End of June 2012 - move to I-Tier by September 1st Page 30
    • 31. The Big Push … or There‟s No Going Back • ~150 developers • Global effort • Feature freeze – A/B testing against mostly the same features Page 31
    • 32. Part III It Worked! Page 32
    • 33. 95% Consumer Traffic On Node Page 33
    • 34. Sustained US Traffic Over 120k RPM Page 34
    • 35. Our Pages Got Faster Page 35
    • 36. It Worked! Page 36
    • 37. Success? • Moving to a new platform is not a straight line • Solving for old problems • Solving for new problems • Culture shift Page 37
    • 38. Next Steps • Streaming responses for better performance • Better resiliency to outages… circuit breakers, brownouts • Distributed Tracing • International • Open Source New I-Tier apps as we build new teams, products, ideas. Latest technologies to help us drive our business. 38
    • 39. Q&A