Infrastructure Migration

605 views
493 views

Published on

Slideshow from a (nonrecorded) talk I gave at the Columbus, Ohio LOPSA chapter.

Published in: Technology, Business
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
605
On SlideShare
0
From Embeds
0
Number of Embeds
8
Actions
Shares
0
Downloads
16
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Infrastructure Migration

  1. 1. Infrastructure Migrations How many infrastructure migrations have I done? I’m not sure. I stopped counting around 5. One of the benefits of working for a small company that’s growing quickly is that you get to experience a lot of new things...and moving production and office environments is one of them. Thursday, August 2, 12
  2. 2. I am: Matt Simmons • 10+ year sysadmin • Small infrastructures • 6+ infrastructure migrations • http://www.standalone-sysadmin.com You probably know this... Thursday, August 2, 12
  3. 3. This is: Infrastructure Migrations Thursday, August 2, 12
  4. 4. 10,000ft view • Pre-Planning • Execution • Post-Mortem Thursday, August 2, 12 Like most things, 90% of the work is planning. The other 90% is lifting heavy things. There’s another 10-25% reserved for figuring out what went wrong, and determining how to make it not happen again.
  5. 5. Considerations: Types of Migrations • Build in parallel • Move Infrastructure • Hybrid You really, really want to build in parallel. Sure it’s expensive, but it means much, much shorter periods of downtime. Moving an infrastructure is hairraising, because there are only a few million things that can go wrong. Most people will probably end up doing hybrid migrations, where you build some of the new infrastructure, then migrate some from the existing setup. Watch out for things like IP addressing issues, and that you’ve made the correct assumptions about rack space and power requirements for the machines that are moving. Thursday, August 2, 12 And you don’t know scary until you’re driving a U-Haul full of ser vers across the Pennsylvania Turnpike in the middle of a rainstorm.
  6. 6. Considerations: • Downtime Limits • Uptime Requirements • Service Window Length Strangely enough, downtime limits and uptime requirements aren’t the same. Figure out what your uptime limits are according to your user base’s expectations, then figure out how much infrastructure needs to be running in order to accommodate that. Good luck. Thursday, August 2, 12 You might have a maintenance window, where downtime is planned and doesn’t count against your SLAs. If your migration can fit within this, awesome (hint: it can’t.) So you need to figure out what kind of downtime you can afford, and remember to schedule notices to your customers far enough in advance so that they aren’t taken by surprise.
  7. 7. Considerations: Upstream Network Changes I think I could do an entire presentation where I just list all of the problems that could happen when net work providers screw things up. Big ones to watch out for: Thursday, August 2, 12 1. Is the test and turn-up date early enough so that inevitable failures don’t impact the go-live date? 2. Is the circuit exactly what you ordered, and is what you ordered exactly what you need? 3. Are cross-connects in the datacenter ordered, and is the datacenter net working team working with the provider?
  8. 8. Considerations: (Wo)man Power You can’t lift all of the things you own. You need friends to come help you move, right? And you usually pay them beer and pizza for the effort. Moving infrastructures is kind of like that, except “money” typically substitutes for beer and pizza, and you want to find people who are reasonably smart, because you probably don’t own anything in your apartment that costs as much as a high performance RAID array. Thursday, August 2, 12 Figure out how many people you need, then add 20% to cover the stuff you didn’t think of. Have another 10% at home ready to come in if the need arises.
  9. 9. Considerations: How can we parallelize the work? If you have teams, having them all work independently but simultaneously is important, so try not to have one team waiting around on the result of another team. This is no different than removing bottlenecks from a computing infrastructure. Thursday, August 2, 12
  10. 10. Establishing a Plan Documentation shall set you free! Thursday, August 2, 12
  11. 11. Build a checklist Every good plan includes a checklist • What needs to be done • By whom? • Where? • In what order? Thursday, August 2, 12
  12. 12. Build a checklist Include all phases • • • • • • Thursday, August 2, 12 Off site prior On site prior On site during On site after Testing Signoff Off site things before moves are usually slow processes or long-term changes that rely on TTLs or human interaction outside of your organization.
  13. 13. Build a checklist Establish Dependencies If item 23 relies on item 24 being done, then it’s probably in the wrong place... Figuring out all of these dependencies is like untangling a knot. It’s slow, it’s difficult, and when you’re done, no one seems to be as appreciative of your hard work as you are. Thursday, August 2, 12
  14. 14. Build a checklist Build in checkpoints Checkpoints are a great place to stop all the teams at the same time and make sure that everyone’s on the same page. Thursday, August 2, 12
  15. 15. Build a checklist Include communication up-stream Overcommunicate. Keep your boss informed. Keep your stakeholders informed. If you have the kind of work environment where your users care, keep them informed. Thursday, August 2, 12
  16. 16. Build a checklist Multiple Checklists • Per team? • Per location? • Per person? Thursday, August 2, 12 If you’ve got multiple teams, you are likely to need multiple checklists. Ditto if your locations are farther apart. If each person’s tasks are complicated, give each person an individual checklist, too.
  17. 17. Build a checklist Schedule Breaks Breaks are SO important. You can’t work for 8 hours without stopping to rest, physically or mentally. Put these into the schedule. Thursday, August 2, 12
  18. 18. Change Management Techniques Establish tests for complicated steps (or groups) Would you build a new ser ver then put it into production without testing it? Of course not. Build tests to see if your work so far is correct. It can be as simple as “ this point, LED 7, 8, and 9 should be green, at and LED 10 should be amber”. Thursday, August 2, 12
  19. 19. Change Management Techniques Establish roll-back procedures Things happen. Stuff doesn’t always go right. Make sure your plan includes when to roll-back and what steps to take to do it. Thursday, August 2, 12
  20. 20. Change Management Techniques Establish failure guidelines Failures are inevitable. Unhandled failures are unnecessary though. Know how to tell if something has failed, and know what to do about it. Thursday, August 2, 12 (What happens if...) • ...a machine breaks? • ...a router doesn’t boot? • ...?
  21. 21. Identify Goods & Services to be Purchased These kinds of steps require a lot of planning, but more planning just makes the end result better. • Cables of specific lengths, connectors, label tape, velcro, rack shelves, etc • Servers, routers, firmwares, licenses, etc • Circuits, bandwidth, accounts, etc Thursday, August 2, 12
  22. 22. Maintain Communications • Cellphones • (at least one per team) • 2-way radios • (for lack of cellular service) • Probably not IP phones Cell reception in datacenters is spotty. Using handheld 2-way radios is much more reliable. Don’t rely on your IP phone infrastructure for critical communications during net work outages. Just don’t. Thursday, August 2, 12
  23. 23. Find Warm Bodies Figure out how many people you need. Add 20% for good measure Have 10% standing by Thursday, August 2, 12
  24. 24. Establish Roles Zone: “Your job is to stay at this rack, pulling things out in the order prescribed by the checklist, and to load them on the cart once removed” Man to Man: “Your job is to cart these servers to the truck, and once the number of servers in the truck matches the number prescribed by the checklist, to drive the truck to the new datacenter, and assist in loading the ser vers onto the cart for the next zone man” • Zone • Man to Man • Point Guard ...and so on, as required by your migration. Thursday, August 2, 12 Point Guard: “Your job is to act as the communications hub, the person to verify that check points happen on schedule, and that things are correct, as well as to finalize sign-off and handoff once we’re done”
  25. 25. Communicate the plan Default to being too communicative Have your point guard annoy people with the number of email updates. Thursday, August 2, 12
  26. 26. Communicate the plan Get clearance from the stake-holders Before ever starting work, make sure that everyone is on board with the migration plan, and that everyone has agreed and signed off. Thursday, August 2, 12
  27. 27. Communicate the plan Alert users multiple times • Well in advance (so long term projects aren’t scheduled) • A week before (so short-term pushes aren’t interrupted) • Immediately before (so last minute issues don’t compound) Thursday, August 2, 12
  28. 28. Communicate the plan Give everyone the information they need • Checklists • Plan document • Contact Information I actually got to the point where every person involved in the migration got a personalized envelope. The contents were the checklist relevant to their job, the diagrams of what the rack looked like before, what the new racks were supposed to look like, and the contact information for all of the other team members. ...and has signed off on it Thursday, August 2, 12
  29. 29. Executing the plan I love it when a plan comes together... Thursday, August 2, 12
  30. 30. Executing the plan Verify all goods were purchased Doing inventory sucks, but not having enough ethernet cables that reach to the switch sucks more... Thursday, August 2, 12
  31. 31. Executing the plan Clear personal schedules “oh, that was this weekend? Crap, man, I’m sorry. I have to go drink beer with my other friends and have a good weekend. Maybe next time, brah” Thursday, August 2, 12
  32. 32. Executing the plan Complete off-site checklist items Verify that everyone at both sites knows what’s happening, when, and is on board. Make sure the datacenter has people on hand to help who are capable of helping. Thursday, August 2, 12
  33. 33. Executing the plan Show up early ,,,because something won’t be right. Thursday, August 2, 12
  34. 34. Executing the plan Verify assigned roles Ask for questions ...and ask each person. Make sure that they know how to get ahold of you and the point guard. Thursday, August 2, 12
  35. 35. Executing the plan Step through the list Thursday, August 2, 12
  36. 36. Executing the plan Verify completeness with each team Thursday, August 2, 12
  37. 37. Executing the plan Perform on-site and off-site post-complete items Thursday, August 2, 12
  38. 38. Executing the plan Go have a beer. Seriously, celebrate completing the task with the team. I didn’t always get to do this, and I’m still sorry about it today. Thursday, August 2, 12
  39. 39. Executing the plan Complete post-mortem according to schedule During the next workweek, complete the postmortem and identify what went wrong as well as what went right. You can’t replicate success and eliminate failure unless you identify them. Thursday, August 2, 12
  40. 40. Dealing with problems Yes, you will have problems... Thursday, August 2, 12
  41. 41. Dealing with problems Two big take-aways: 1) Problems are inevitable because they are a condition of the infrastructure, and they arise from its inherent complexity. 2) It’s not possible to eliminate all failures, but it’s desirable to minimize them, and to try to eliminate repeating the same failure by improving the process and design. Thursday, August 2, 12 Problems are inevitable (It’s not “if”, it’s “when”) Read “The Field Guide to Understanding Human Error” by Sydney Dekker http:/ /amzn.to/QFpcqY During my talk, I gave far more discussion on this topic than I’m going to give here.
  42. 42. Dealing with problems • Identify & Acknowledge the problem • Don’t punish the reporter • Follow the failure guidelines • Roll-back if necessary & reschedule Thursday, August 2, 12
  43. 43. Post-mortem • What went wrong? • Why? • The ‘Five Whys’ • What went right? • What have we learned? Thursday, August 2, 12
  44. 44. Thanks for your time. I hope you were able to get something out of it. Infrastructure Migrations If you have questions, feel free to contact me @standaloneSA Thursday, August 2, 12 standalone.sysadmin@gmail.com

×