Scaling Up Lookout


Published on

Scaling Up Lookout was originally presented at Lookout's Scaling for Mobile event on July 25, 2013. R. Tyler Croy is a Senior Software Engineer at Lookout, Inc. Lookout has grown immensely in the last year. We've doubled the size of the company—added more than 80 engineers to the team, support 45+ million users, have over 1000 machines in production, see over 125,000 QPS and more than 2.6 billion requests/month. Our analysts use Hadoop, Hive, and MySQL to interactively manipulate multibillion row tables. With that, there are bound to be some growing pains and lessons learned.

Published in: Technology
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Scaling Up Lookout

  1. 1. Scaling Up Lookout R. Tyler Croy
  2. 2. Hello everybody, welcome to Lookout! I'm excited to be up here talking about one of my favorite subjects, scaling. Not just scaling in a technical sense, but scaling *everything*. Scaling people, scaling projects, scaling services, scaling hardware, everything needs to scale up as your company grows, and I'm going to talk about what we've been doing here. First, I should talk about ->
  3. 3. this guy
  4. 4. Who I am. - I've spoken a lot before about continuous deployment and automation, generally via Jenkins. As part of the Jenkins community, I help run the project infrastructure and pitch in as the marketing events coordinator, cheerleader, blogger, and anything else that Kohsuke (the founder) doesn't want to do. Prior to Lookout I've worked almost entirely on consumer web applications, not in a controllers and views sense, but rather building out backend services and APIs to help handle growth At Lookout, I've worked a lot on the Platform and Infrastructure team, before promoted, or demoted depending on how you look at it, to the Engineering Lead for ->
  5. 5. OMG Serious Business
  6. 6. The Lookout for Business team I could easily talk for over 30 minutes about some of the challenges building business products presents, but suffice it to say, it's chock full of tough problems to be solved. Not many companies grow to the point to where they're building out multiple product lines and revenue streams, but at Lookout we've now got Consumer, Data Platform and now Business projects underway. It's pretty exciting, but not what I want to talk about. Let's start by ->
  7. 7. Let's travel back in time
  8. 8. Talking about the past at Lookout. I've been here for a couple years now, so my timeline starts in ->
  9. 9. 2011
  10. 10. 2011 In the olden days, we did things pretty differently, in almost all aspects. I joined as the sixth member of the server engineering team, a group that now has 20-30 engineers today. -> Coming in with a background in continuous deployment, the first thing that caught my eye was
  11. 11. release process
  12. 12. Our release process was like running a gauntlet every couple weeks, and maybe we'd ship at the end of those two weeks, maybe not. It was terribly error-prone and really wasn't that great. James ran the numbers for me at one point, and during this time-period we were experiencing a "successful" deployment rate of ->
  13. 13. 36% of deployments failed
  14. 14. This means that 1/3 of the time when we would try to deploy code into production, something would go wrong and we would have to rollback the deploy and find out what went wrong. Unfortunately, since it took us two or more weeks to get the release out, we had on average ->
  15. 15. 68 commits per deployment
  16. 16. 68 commits per deployment, so one or more commits out of 68 could have caused the failure. After a rollback, we'd have to sift through all those commits and find the bug, fix it and then re-deploy. Because of this ->
  17. 17. 62% of deployments slipped
  18. 18. About 2/3rds of our deployments slipped their planned deployment dates. As an engineering organization, we couldn't really tell the product owner when changes were going to be live for customers with *any* confidence!
  19. 19. There were a myriad of reasons for these problems, including: - lack of test automation (tests existed, not reliably running, using Bitten with practically zero developer feedback) - painful deployment process To make things more difficult, all our back-end application code was in a ->
  20. 20. monorails
  21. 21. monolithic Rails application. While it served it's purpose as the company was bootstrapping itself, but was starting to show its age and prove challenging with more and more developers interacting with the repository.
  22. 22. The team was at an interesting junction during this time, problems were readily acknowledged with the way things were done, but finding the bandwidth and buy-in to fix them were difficult to come by. I think every startup that grows from 20 to 100 people goes through this phase when it is in denial of it's own growing pains. As more people joined the team, we pushed past the denial though and started working on ->
  23. 23. Scaling the Workflow
  24. 24. Scaling the workflow. Our two-ish week release cycle was first on the chopping block, we started with what became known as the ->
  25. 25. The Burgess Challenge
  26. 26. The Burgess Challenge. While having beers one night with James and the server team lead Dave, James asked if we could fix our release process and get us from two-ish week deployments to *daily* deployments, in ->
  27. 27. 60 days
  28. 28. 60 days. This was right at the end of the year, with Thanksgiving and Christmas breaks coming up, we had some slack in the product pipeline and decided to take the project on, and enter 2012 a different engineering org than we had left 2011. We started the process by bringing in some ->
  29. 29. New Tools
  30. 30. New tools, starting with ->
  31. 31. JIRA
  32. 32. JIRA. While I could rant on how much I hate JIRA, I think it's a better tool than Pivotal Tracker was for us. Pivotal Tracker worked well when the team and the backlog were much smaller, and less inter-dependent than they were in late 2011. Another tool we introduced was ->
  33. 33. Jenkins
  34. 34. Jenkins - Talk about the amount of work just to get tests passing *consistently* in Jenkins - Big change in developer feedback on test runs compared to previously. We also moved our code from Subversion into ->
  35. 35. Git + Gerrit
  36. 36. Git and Gerrit. Gerrit being a fantastic Git-based code-review tool. At the time the security team was already using GitHub:Firewall for their work. We discussed at great length whether the vanilla GitHub branch, pull request, merge process would be sufficient for our needs and whether or not a "second tool" like Gerrit would provide any value. I could, and have in the past, given entire presentations on the benefits of the Gerrit-based workflow, so I'll try to condense as much as possible into this slide of our new code workflow ->
  37. 37. describe the new workflow, comparing it to the previous SVN based one (giant commits, loose reviews, etc)
  38. 38. with Jenkins in the mix, our fancy Gerrit workflow had the added value of ensuring all our commits passed tests before even entering the main tree. We doing a much better job of consistently getting higher quality code into the repository, but we still couldn't get it to production easily Next on the fix-it-list was ->
  39. 39. The Release Process
  40. 40. The release process itself. At the time our release process was a mix of manual steps and capistrano tasks - Automation through Jenkins - Consistency with stages (no more update_faithful) We've managed to change entire engineering organization such that ->
  41. 41. 2% of deployments failed
  42. 42. 14 commits per deployment
  43. 43. 3% of deployments slipped
  44. 44. neat
  45. 45. Automating Internal Tooling
  46. 46. Introducing OpenStack to provide developer accessible internal VM management Managing of Jenkins build slaves via Puppet Introduction of MI Stages
  47. 47. OpenStack
  48. 48. If you're going to use a pre-tested commit workflow with an active engineering organization such as ours, make sure plan ahead and have plenty of hardware, or virtualized hardware for Jenkins We've started to invest in OpenStack infrastructure and the jclouds plugin for provisioning hosts to run all our jobs on. With over a 100 build slaves now, we had to also make sure we had ->
  49. 49. Automated Build Slaves
  50. 50. Automated the management of those build slaves, nobody has time to hand-craft hundreds of machines and ensure that they're consistent. Additionally, we didn't want to waste developer time playing the "it's probably the machine's fault" game every time a test failed.
  51. 51. Per-Developer Test Instances
  52. 52. Scaling the People
  53. 53. Not much to say here, every company is going to be different but you can't just ignore that there are social and cultural challenges in taking a small engineering team and growing to 100+ people.
  54. 54. - Transition from talking about the workflow to the tech stack
  55. 55. Scaling the Tech Stack
  56. 56. With regards to scaling the technical stack, I'm not going to spend too much time on this since the other people here tonight will speak to this in more detail than I probably should get into, but there are some major highlights from a server engineering standpoint Starting with the databases ->
  57. 57. Shard the Love
  58. 58. Global Derpbase woes Moving more and more data out of non-sharded tables Experimenting with various connection pooling mechanisms (worth mentioning?)
  59. 59. Undoing Rails Katamari
  60. 60. Diagnosing a big ball of mud Migrating code onto the first service (Pushcart) Slowly extracting more and more code from monorails, project which is ongoing
  61. 61. Modern JavaScript
  62. 62. I never thought this would have a big impact on scaling the technical stack, but modernizing our front-end applications has helped tremendously The JavaScript community has changed tremendously since the company was founded, the ecosystem is much more mature and the web in general has changed. By rebuilding front-end code as single-page JavaScript applications (read: Backbone, etc), we are able to reduce complexity tremendously on the backend by turning everything into more or less JSON API services
  63. 63. Infinity and Beyond
  64. 64. The future at Lookout is going to be very interesting, both technically and otherwise. On the technical side of the things we're seeing more of a ->
  65. 65. Diversifying the technical portfolio
  66. 66. Diversified technical portfolio. Before the year is out, we'll have services running running in Java, Ruby and even Node. TO support more varied services, we're getting much more friendly ->
  67. 67. Hello JVM
  68. 68. with the JVM, either via JRuby or other JVM-based languages. More things are being developed for and deployed on top of the JVM. Which offers some interesting some interesting opportunities to change our workflow further with things like: - Remote debugging - Live profiling - Better parallelism
  69. 69. With an increasingly diverse technical stack, and stratified services architecture, we're going to be faced with the technical and organization challenges of operating ->
  70. 70. 100 services
  71. 71. 100 services at once. When a team which owns a service is across the office, or across the country, how does that mean for clearly expressing service dependencies, contracts and interactions on an on-going basis? With all these services floating around, how do we maintain our ->
  72. 72. Institutional Knowledge
  73. 73. Institutional knowledge amongst the engineering team Growth means the size of our infrastructure exceeds the mental capacity of singular engineers to understand each component in detail.
  74. 74. We're not alone in this adventure, we have much to learn from companies like Amazon, or Netflix, who have traveled this path before. I wish I could say that the hard work is over, and that it's just smooth sailing and printing money from here on out, but that's not true. There's still a lot of hard work to be done, and difficult problems to talk about as we move into a much more service-oriented, and multi-product architecture. I'd like to ->
  75. 75. Thank you
  76. 76. Thank you for your time, if you have any questions for me, I'll be sticking around afterwards. Thank you