VMWare Forum Winnipeg - 2012

882
-1

Published on

Keynote speech and presentation by Anil Sedha alongwith the Postmedia team at the VMWare Forum in Winnipeg.

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
882
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
0
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • Recovery was difficult since disparate systems were in use and each one required their own recovery procedure
  • Recovery Timeline – Some critical applications like Canada.com have multiple consistency groups and tens of servers so it was difficult to let them stay down for a longer duration. This environment powers all of our major websites. Planned failover timeline – When we would perform failover the server recovery was essentially based on the ability of the SAN team as to how quickly they could failover the mirror volumes Multiple resources involved – Since only infrastructure was under our control the choreography of which servers come up first and then which interface servers had to be brought up next required a lot of intervention. Operational sequence – Even if servers came up in a specific sequence and we planned it there was always a chance that mistakes could happen.
  • We were able to showcase the value of using SRM to perform further virtualization since recovery was simplified.
  • We were able to showcase the value of using SRM to perform further virtualization since recovery was simplified.
  • We were able to showcase the value of using SRM to perform further virtualization since recovery was simplified.
  • We were able to showcase the value of using SRM to perform further virtualization since recovery was simplified.
  • VMWare Forum Winnipeg - 2012

    1. 1. Disaster RecoveryRuss Pedneault Anil C. Sedha Kevin Seniuk and Failover using Technology Services Manager Midrange Services Supervisor Senior Technical Specialist SRM VMWare Forum Winnipeg May 15, 2012
    2. 2. Company Overview Largest publisher by circulation of paid English-language daily newspapers in Canada, representing some of the country’s oldest and best known media brands. Reaching millions of Canadians every week Engage readers and offer advertisers and marketers integrated solutions to effectively reach target audiences through a variety of print, online, digital, and mobile platforms. Postmedia Network is a Mobile Web Leader – 120 Daily News media mobile sites, 80+ vertical mobile web sites, 1M monthly visitors, 9M monthly page views. 2
    3. 3. IT Overview Virtualization Platform: VMWare vSphere 4.1 and 5.0, SRM v4.1 500+ Virtual Servers, 250 Physical servers, 3 Virtual Center servers, 4000+ desktops, 3 datacenters and 13 smaller sites Server Hardware: HP, Cisco, SUN, and Apha/VMS servers EMC Clariion and VNX arrays, HP EVA arrays, Sun Storage, Data domain VTL Operating System: VMWare ESXi, Windows 2003/2008, HP-UX, VMS, Red Hat Enterprise Linux, Solaris, Suse Linux, Apple Messaging: Exchange 2007, MS Office Communicator, Cisco Unified Messaging Database: Oracle, MS SQL, Sybase, MySQL 3
    4. 4. Virtualization/SRM StoryBackgroundIT could not recover data quickly enough so Postmedia recovery plans were time consuming andinvolved special recovery procedures requiring expert knowledge.ChallengesIT environment was running mostly on old physical servers and had clustering/mirroring in placeThe Inevitable Happens- An entire datacenter goes down due to a power outage despite power protection.- After power was restored another outage had to be taken to perform repairs.- Enhanced recovery procedures were not in place at that timeResolution- Deploy virtualization first strategy- Implement SRM with existing Storage Replication Technology- Upgrade SRM to run with newer Storage Replication TechnologyTurnaroundSRM failover brings relief and a new self confidence in the organization that data can be recoveredin a very short duration with roll back capabilities. 4
    5. 5. BackgroundKey Issues –- Recovery timeline was unacceptable for some revenue generating applications- Multiple resources from Application and Infrastructure teams had to be involved- Operational sequence for recovery was manual so mistakes could easily happen- Changes in application environments meant keeping up with those changes manually- Managing failover/recovery of remote sites 5
    6. 6. Challenges Physical server infrastructure does not offer the flexibility for easy failover to secondary site. Reliance on aging hardware – unsure if server would come up after restart Many manual steps needed to make remote site operational Required specialists to bring up Storage environment at remote site before Server environment could be brought up. Clustered Environments presented additional challenges – Microsoft Cluster, HP- UX Cluster, Sun Cluster. Push back from Application teams – don’t touch the server running our applications 6
    7. 7. Challenges A large number of application servers were running on physicalhardware. A great deal of effort was needed by both Application andInfrastructure teams. Outages to critical applications for longer than expectedtimeframe would mean revenue loss. IT had never done a datacenter recovery or failover in the past. 7
    8. 8. Reality Bites (Power Outage) There was an unexpected Power Outage at one of our Datacenters and all servers went offline for approximately an hour. Server Recovery after power outage took further effort and quite a few hours. The initial event left Postmedia IT wondering what to do since a recovery would have taken many hours. Once power was restored, a planned failover was needed by Service Provider to perform power infrastructure repair for around 8 hours. Postmedia was given 5 days after negotiation (scheduled to next day earlier) to perform the planned failover before outage. 8
    9. 9. What SRM did for us Created a complete recovery process in a simple, centralized recovery plan,and automated recovery steps.SRM allowed failover of the Exchange 2007 environment in minutes. Other application servers failed over in minutes as well. Half of the datacenter move was accomplished quickly and within theexpected timeframe. The success of SRM and Virtualization gave the impetus to create further costsavings by virtualizing and retiring older servers. 9
    10. 10. What SRM did for us (Contd) Postmedia IT chose the approach of showcasing the benefits ofvirtualization instead of forcing virtualization on the business. Highlighted the capabilities of SRM failover of the Exchange 2007environment in minutes.Recovery is very simplified and even a non-IT individual within theorganization with the authorization and awareness of documentedlogin procedures can press the recovery button in case of a disaster. 10
    11. 11. Lessons LearnedSRM recovery plans should be created based on whichapplication consistency groups need to be failed over together. Review your common outage windows based on applicationsEnsure you have efficient storage replication mechanisms in placethat integrate with SRM.Verify your Recovery Plans in advance by running a test (this doesnot perform an actual failover) 11
    12. 12. Planned Failover - Now With newer replication mechanisms available in the industry it is moreeasier and quicker to perform failover using SRM. Postmedia moved away from traditional software based replication tohardware appliance based replication. We now have PVR like capabilities to rollback data to any point in time –right down to the seconds Our recent array upgrade required planned failovers and we were able tofailover Exchange and other critical applications in 7-13 minutes perrecovery group. Tested before we failed over to ensure success Ran 3 recovery plans simultaneously for faster failover 12
    13. 13. Where we are today450+ virtual servers, 50+ ESXi hostsSRM 4.1 fully implemented for all virtualized production serversReplication mechanism fully integrated and automated with SRM – wide variety ofstorage related replication productsRecovery of critical applications like Exchange, Citrix, CMS, takes 7-13 minutes tobring servers up at secondary siteSettled down on RecoverPoint appliances to perform Replication since it offers PVRlike data rollback capabilities.The organization has adopted a “Virtualize First” strategy.Significant ability to meet business timelines for application recovery.Can recover an entire datacenter quickly and successfully. 13
    14. 14. Thank You ! 14

    ×