Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Migrating big data

675 views

Published on

Published in: Technology, News & Politics
  • Be the first to comment

  • Be the first to like this

Migrating big data

  1. 1. Moving Day:Migrating your Big Datafrom A to B Justin Dow jabba@mozilla.com Corey Shields cshields@mozilla.com Laura Thomson laura@mozilla.com
  2. 2. Overview• What is Socorro?• Rationale for migration• Build out• Automation• Smoke testing• Troubleshooting• Moving Day• Aftermath
  3. 3. What is Socorro?
  4. 4. “Socorro has a lot of moving parts”...“I prefer to think of them as dancing parts”
  5. 5. A different type of scaling:• Typical webapp: scale to millions of users without degradation of response time• Socorro: less than a hundred users, terabytes of data.
  6. 6. Basic law of scale still applies:The bigger you get, the more spectacularly you fail
  7. 7. Some numbers• At peak we receive 3000 crashes per minute• 3 million per day• Median crash size 150k• ~40TB stored in HBase and growing every day
  8. 8. What can we do?• Does betaN have more (null signature) crashes than other betas?• Analyze differences between Flash versions x and y crashes• Detect duplicate crashes• Detect explosive crashes• Find “frankeninstalls”• Email victims of a malware-related crash
  9. 9. Rationale and planning
  10. 10. Why?• Initial reason for move was stability: lots of downtime, partly due to approaching capacity limits• Needed more infrastructure, which meant new data center• Decided if we were going to do it over...we’d do it right
  11. 11. Fragility • Ongoing stability problems with HBase, and when it went down, everything went with it • Releases were nightmares, requiring manual upgrades of multiple boxes, editing of config files, and manual QA • Troubleshooting done via remote (awful)
  12. 12. Moving data• Setting up new architecture and new machines is one thing• Moving such a large amount of data...with no downtime...is harder• Downtime requirement turned out to be on collection, so we changed code to a pluggable storage architecture. Spool to disk when HBase unavailable• PostgreSQL easy to move data• HBase orginally intended to use distcp, but ran into trouble and ended up using a dirty copy tool authored in house
  13. 13. Planning tools • Bugzilla for tasks • Pre-flight checklist and in-flight checklist to track tasks • Read Atul Gawande’s The Checklist Manifesto • Rollback plan • Failure scenarios • Rehearsals
  14. 14. Build out
  15. 15. What was wrong?• legacy hardware• improperly managed code• each server was different• no configuration management• shared resources with other webapps• vital daemons were started with “nohup ./startDaemon &”• insufficient monitoring• one sysadmin. rest of team and developers had no insight into production• no automated testing
  16. 16. Automation
  17. 17. Configuration Management• new rule: if it wasn’t checked in and managed by puppet, it wasn’t going on the new servers• no local configuration/installation of anything• daemons got init scripts and proper nagios plugins• application configuration is done centrally in one place• staging application matches production
  18. 18. Packages for production• 3rd party libraries and packages are pulled in upstream• IT doesn’t need to know/care how a developer develops. What goes into production is a tested, polished package.• packages for production are built and tested by Jenkins the same way every time.• local patches aren’t allowed. A patch to production means a patch to the source upstream, a patch to stage and a proper rollout to production• Every package is fully tested in a staging environment
  19. 19. Smoke testing
  20. 20. Load Testing• Used a small portion (40 nodes) of a 512-node Seamicro cluster• Simulating real traffic by submitting crashes from a test farm• Testing the entire system as a whole, under real production load
  21. 21. Troubleshooting
  22. 22. The Bleeding Edge... bleeds! • Moving HBase data over, issues with distcp • New data center, new load balancers, new challenges • New hardware = new challenges with hbase • Discovering a misconfigured kickstart script, from which all servers were built. • Using puppet to correct the error • Testing various configurations
  23. 23. Moving Day
  24. 24. • Flew the team in• Migration day checklist: http://tinyurl.com/migrationday• Went remarkably smoothly due largely to good co-operation between teams
  25. 25. Aftermath
  26. 26. • Before moving day, every possible failure scenario was discussed and planned for• Most failure scenarios actually happened. The new cluster failed gracefully with 0 data loss.• New data that came in during the migration had to be dealt with.• Problems with new hbase cluster and various other tweaks.• Postmortem to learn what we did right and wrong
  27. 27. It’s all open Get the code (moving to github in 3-4 weeks): http://code.google.com/p/socorro Read/file/fix bugs: https://bugzilla.mozilla.org/ Call in for the weekly meetings: https://wiki.mozilla.org/Breakpad/Status_Meetings Join us in IRC: irc.mozilla.org #breakpad
  28. 28. Questions?
  29. 29. We’re Hiring!! • Help us work to improve and protect the open web • Offices worldwide, work from home possible for many jobs • careers.mozilla.com • System Administrators (Web Ops, VCS, LDAP, etc) • Site Reliability Engineers, QA Engineers • Developers, developers, developers, developers!!

×