View stunning SlideShares in full-screen with the new iOS app!Introducing SlideShare for AndroidExplore all your favorite topics in the SlideShare appGet the SlideShare app to Save for Later — even offline
View stunning SlideShares in full-screen with the new Android app!View stunning SlideShares in full-screen with the new iOS app!
Aspirus Enterprise Backup Assessment and Implementation of Avamar Written by: Thomas Whalen – Server and Storage Infrastructure Team Leader, Aspirus Information Technology Department Executive Summary Since the initial implementation of Epic within the Aspirus Health System, the ability to maintain a consistent backup process was a recurring challenge. The largest aspect of this challenge was finding a combination of backup technology and storage solutions to handle the continual growth of data as Aspirus continued to expand its Epic environment both in terms of clinical records and application modules. In late 2009, the Aspirus Information Technology department was able to participate in a proof of concept around EMC’s Avamar host-based de-duplication backup grid and Networker backup management software to see what the results of pushing the Epic production data to this backup architecture. In the past, we had leveraged a product from Exagrid to perform target-based de-duplication but found that the Exagrid didn’t yield the performance and de-duplication rates we considered acceptable as the environment continued to grow. In front of Exagrid we also had Symantec’s Netbackup backup management software that was proving to be inconsistent in performing routine backups and was plagued with various system issues forcing the IT staff to constantly focus a large degree of attention to it just to assure that routine backups could take place. Once we began the proof of concept with Avamar, we determined very quickly that the de-duplication rates observed were superior to Exagrid’s target-based de-duplication appliance. Also we felt that the scalability for the long-term needs of Aspirus’ ever-increasing data growth showed that EMC’s Avamar technology using its RAIN (Redundant Array of Independent Nodes) architecture would scale as Aspirus data rates grew. As important as all of this, the other aspect was that while implementing the system, we never observed any specific system issues with backups simply not working. At the end of the proof of concept and its eventual implementation, we now can realize an overall de-duplication rate of our Epic environment (based on routine nightly backups) of 110:1 storing an average of 900GB of total storage with an average nightly change rate of 1.5 – 2.5% or roughly 8G of daily changes. Because of the aggressive de-duplication capabilities, this equates to a significantly lower cost of ownership on securing the same amount of data typical written to tape or even another disk-based de-duplication system. It also allows us to free up staffing dedicated to hand-holding our previous backup system reallocating that time to be focused to more meaningful work in IT. Lastly, the days of random missed backups appear to be a thing of the past which assures us that our clinical and financial data will be consistently protected through its life-cycle. Epic Backup Architecture Aspirus uses a number of technologies to position its Epic clinical data for backup. In the beginning we would simply pull the backups from a snapshot mounted to the Epic shadow server and then spin that data off the magnetic tape. We found that this posed a number of specific problems both in Epic performance and also in the performance of writing the backup. In the area of Epic performance, using the SnapView tools from EMC on our Clariion SAN, we found that, based on the nature of how snapshots are designed to work, when the backup was initiated and the snapshot was mounted to the shadow server that this caused a residual effect in degraded performance of the production environment. As our datasets began to get larger and larger we found this performance issue becoming more visible to users and the mission of the IT technical group is to assure that we maintained the highest degree of performance 24x7. But as our environment started to grow, we also noticed that our backup window was getting longer and longer while we wrote 500-600G of data off to our DLT tape array. As time progressed and the data grew, we saw the writing on the wall that DLT was not going to be the long-term solution if we wanted to keep a daily backup process intact. At this point, we decided to use EMC SnapView clones to replicate the data from the production storage LUN’s to cloned LUN’s. While this is more expensive because of the duplicate storage requirements of the clone, mitigation of the performance issues we saw snapshot process was a good trade off in our opinion. Also the clone could be used for other purposes like environment refreshes. The initial clone was created using EMC 500G SATA drives which were slower in their overall speed but had more overall disk capacity. At this same time, we also moved away from using our DLT tape array to a target-based disk appliance from Exagrid. This transition was a good move as it brought about faster backups and restores but also introduced target-based de-duplication. As backups began to be written to Exagrid, we started show de-duplication rates around the 15:1. While transitioning from snapshots to clones, we also made the decision to move the backup processes off the production Shadow Server. The Shadow Server was pulling double-duty by not only providing the DR shadow as part of Epic’s overall best-practices, but we were also using that same Shadow Server to be the extracting Cache database for Epic Clarity reporting, a very intensive process. In order to reduce the Shadow Server workload, we decided to build a dedicated IBM AIX cloning server to present the Epic production clone to. This allowed us to make sure that no other Epic-specific processes or services were being impacted while we performed routine backups. The clone again would also allow us to use it for routine non-production environment refreshes for future builds, testing, validation, etc. Visual Representation of Previous Backup System In this design, we were getting acceptable backups but in using the SATA disks as well as the growth of the Epic production database, we started to see limits to the value of using the Exagrid storage system in speed and de-duplication as well as seeing more and more problems with managing the Epic backups through NetBackup. Avamar and Networker Assessment Contrasting Avamar vs. Exagrid The Avamar technology is comprised of a collection of servers or nodes or RAIN (Redundant Array of Independent Nodes) that comprises a “grid” of storage resources. The grid can grow as your storage needs grow and can natively support backups across the network along with the ability to manage NDMP backups for NAS-based storage solutions. Avamar also has the capabilities to replicate of backups across separate grids to provide DR for your critical data. While the Avamar and Exagrid storage architectures share similarities in function, the biggest difference is in their general method handling the backup data it’s self. Avamar is a host-based de-duplication system utilizing a client that sits on the server where the data required to backup resides to interrogate data that will be sent to the Avamar grid and then only sending the changed data down the wire. This results in less network traffic for your backup data. Avamar used a patented “Commonality Factor” process which learns patterns of data behavior and uses this to determine the degree of changed data from unchanged data and thus determines its de-duplication rates. Exagrid on the other hand is a target-based de-duplication system in which all data is sent to the grid into a high-speed disk repository. In this repository, Exagrid does a comparison of the data to create its de-duplication at the byte-level and moves the changed data to a lower-speed, higher capacity disk area for long-term retention and compression. This process takes place once all the data’s been passed to the Exagrid indicating to the software client that the backup is completed. One can argue the benefits of both types of technologies, and in fact, both are generally very good. But in considering Epic as our target application, we determined that sending close to 1TB down the wire nightly was a big part of our current backup pains. The Avamar system mitigates that again by using the host-based client to help determine what changes have been made and only sends the changed data down the wire resulting in Avamar just storing and managing the data that’s changed between backup cycles. But also in using the client to manage the changed data we uncovered an issue around our cloning process. Our initial testing using the SATA-based clone of Epic production showed an unacceptable degree of IOPS being pushed the host to interrogate data. Our first backups running with SATA ran in excess of 10 hours before it was completed. After investigating the host’s performance during the backup process, it was easy to see that the clone IOPS were slowing down the ability of the Avamar client to interrogate and move the changed data down the wire to the grid. Based on this, we created a new Fiber Channel-based clone running on 300G, 15K RPM drives. In this configuration, the impact was very positive. Our backup went from 10 hours to 6 hours on the first run faster than any backup we’ve ever cut since we went live in 2004. After a number of days of testing nightly backups, we began to see Avamar’s de-duplication process. Avamar Backup Performance Results The Avamar grid showed very good performance in accepting the data from the host even while using a single 1 gigabyte network connection. Over the course of 10 backup tests, backup timings were recorded along with the amount of changed data and then computed de-duplication ratios from the change rate. Figure 1 illustrates those results: Figure 1: De-Duplication Change Rate The figure 1 shows 11 backups that were run against Aspirus Epic production data using the Avamar de-duplication grid. What this chart shows is over the period of the backup cycles, the daily change rate decreases as the host-based client “learns” the pattern of changes day-to-day. This knowledge is used to then capture the differences only and send those to the grid. The chart’s left column is the percentage of de-duplication in percentage. What this shows is that as the daily backups were performed, each night the amount of data that the client determined was unchanged increased. The first backup showed zero data changes as it was the first backup performed and Avamar saw all data as new. Then backup 2 through 11 showed a steady increase in de-duplicated data. By backup 11, the rate of de-duplication was over 90%. Given a 900GB Epic database, this means that the backup consisted of roughly 7 to 10G in total changes sent to the grid. The benefits of this are a dramatic decrease is network traffic and over the continuum of backups along with a significant decrease in overall storage needs to keep a longer retention of Epic backups available. Based on the amount of total data over the amount of changed data, this shows a de-duplication rate of approximately 110:1. The value of this is measure in a number of ways. The largest consideration is in space required to store the same data to tape. Using DLT, even with compression you would need 2-3 tapes per night to keep that data safe. With Avamar, the amount of data required is 900G plus the daily changes. So for a week’s worth of backups, that equates to storage needs of about 950G versus approximately 21 tapes to keep about 4.9TB factoring in a moderate compression ratio on the DLT tape drive. Figure 2: Backup Time – Snapshot 1 Figure 2 shows the tracking of backup time in hours for the 11 backups we monitored. You’ll note that between backup 5 and 6 you’ll see a dramatic drop in time needed to perform the backup. This is the impact of using the Fiber Channel clone versus the SATA clone. This change reduced the backup time by 50%. What this chart does not show is the impact of Commonality Factor as the Avamar client learns the pattern of data change between backup cycles. As of the writing of this document, another capture of backup times show a much more interesting chart that illustrated the impact of Avamar’s Commonality Factor. Figure 3: Backup Time with Commonality Factor Figure 3 illustrated that over a longer period of time how Commonality Factor plays a role in the reduction to your backup window. As Commonality Factor learns the pattern of changed and unchanged data day-by-day, it uses algorithms to determine how to best scan the data on the host. When this efficiency occurs, this reduces the work needed by the client to review the data with the impact being an overall reduction of time in backup. You will see above that around backup 8 through 20 a slow decline in backup time as Commonality Factor plays are larger role in how much time the client needs to spend scanning the file systems. The fact that Aspirus can now backup their entire Epic production cache database instance in roughly 4 hours speaks volumes around the power on the Commonality Factor process versus other backup and deduplication technologies we’ve used in the past. This is simply the finest backup process we’ve encountered to date. Avamar Restore Performance Results Using Networker as the front-end to our Backup and Restore process posed a challenge in the area of Epic Production data restores. The reason is based around the aspect of Networker being designed for more a windows-oriented restore process. Said differently, unlikely backups where Networker will fire off a multi-threaded backup process (multiple avtar processes or Networker save processes), for restores it will only create a single restore for each of the file systems one at a time until all file systems are restored. Because of this characteristic, this poses challenges in the area of Epic restores. Because the traditional Epic cache database instance is comprised of multiple Epic production file systems, restoring those file systems one at a time would take a significant amount of time to complete even with the smallest of cache instances. In our testing, we found that by launching multiple restore processes against each file system allowed Networker to leverage the horsepower of the Avamar Grid and network infrastructure to pull back each Epic file system at the same time, thus simulating a multi-threaded restore. During the course of testing 4 restore points with the EMC Networker/Avamar technology, we recorded an aggregate restore time noted in Figure 3. Figure 4: Restores Single Instance vs. Multi-Instance In figure 4, we see that in the area of a single-instance restore, the restoration process takes significantly longer, upwards to days to finish. In a multi-instance restore, the ability to pull back your Epic production data is more palatable and results in a restore time that you can base an SLA around. Also if you look at the graph, you’ll see that multi-instance restore performs almost as well as the backup which is contrary to conventional 2:1 backup to restore baselines used in the IT industry today. But as we were learning about the recovery process, an optimization concern emerged that plays a significant role in the restoration process. Figure 5: Epic System File System Provisioning In figure 5, what we found as we were really dissected the restore process for our Epic production system was that one file system was significantly larger than any other file system in the production instance. Because of this file system, we noticed that individually all of our restores were resulting in about a 6 hour recovery time frame. All file systems but /epic/prd01. The /epic/prd01 area of Cache was individually taking ~12 hours to finish and thus pushed our recovery window to 12 hours in total. Considering the /epic/prd01 is 2 ½ times the size of any other /epic/prdxx file system, the restore time seemed to make sense albeit not optimal. To avoid this situation, when we learned is that we’ll need to do a better job of being more mindful of the balance of data between file systems and keep them all relative in size to assure that in a restore situation, we can maximize our time to recovery between all file systems. In this case, by balancing /epic/prd01 with all the remaining file system, even if they all grow an additional 10-20%, we should be able to reduce our recovery window from 12-13 hours to approximately 6-6 ½ hours given the restore timing we’ve already collected with the other /epic/prd02 – 08 file systems. Aspirus will actively engage Epic to better balance the /epic/prd01 file system with the other remaining Epic Cache file systems and then will revisit the recovery window again but we feel that our projections of recovery will be acceptable given the testing already completed. Also as stated early, every instance we restored, the file systems passed Epic integrity tests without issue. Analysis Summary Based on the finding we captured in both the backup and recovery processes of using the EMC Networker and Avamar Grid technology is that from an Epic perspective, offers a significant improvement in the overall management of backup data. From an SLA perspective, Aspirus was able to move their backup window for Epic from a 12-14 hour backup window to a 4–54 ½ hour backup window with recovery RTO of 7-8 hours down from 24-48 hours spinning back from tape. Care must be taken in assessing your Epic file systems to ensure they are balanced as well as positioning yourself with a host that the Epic data can be presented to. These steps are critical to the success of the implementation. From a Cost of Ownership perspective, we’ve observed in using aggressiveness of Avamar’s deduplication technology and its use of Commonality Factor, we’ve been able to reduce the long-term size of our backups for the retention windows we feel necessary by almost 90% over using the Exagrid. This equates to less money being spent to continually add capacity for all the other backups in the enterprise and extends our initial storage provisioning far longer than we originally anticipated. Also because of the host-based client, less data is traversing the network which helps to maintain overall network performance and make WAN-based Networker/Avamar back-ups a reality versus a wish-list item. The Avamar grid is sold based on your data deduplication needs not on the total amount of backup space like other backup storage technologies. Again because of the Commonality Factoring process, you’re initial determination of storage needs based on CF means that generally the Avamar RAIN grid will cost less per GB and require less total storage space due to higher degrees of deduplication achieved over other deduplication systems. Another major cost factor is client costs. With other backup technologies you must license the clients or hosts you wish the backup. With Avamar, the clients are free for a wide array of hosts (Windows, IBM AIX, HP-UX, Linux) but includes agents for Microsoft Exchange and Sharepoint, Oracle, DB2, and others which are usually high priced accessory licenses to the host license itself. In a lot of cases, it’s in this area that the costs of implementations of backup systems become very expensive quickly. In closing, Aspirus has spent a lot of time working with the EMC Avamar / Networker backup technology and feel it was absolutely the right move to make for all the points above but there’s one final point I have yet to cover. The best part outside of all the cool things technically with this backup environment is that we feel our backups are safe, recoverable, and we manage backups versus backups managing us. The time we can now devote to other work because we have a technically sound and functionally stable backup environment. The Aspirus Backup Architecture Today