1DisclaimerNÝHERJI ACCEPTS NO LIABILITY FOR THE CONTENTOF THIS PRESENTATION, OR THE CONSEQUENCES OFANY ACTIONS TAKEN ON THE BASIS OF THEINFORMATION PROVIDED, UNLESS THAT INFORMATIONIS SUBSEQUENTLY CONFIRMED IN WRITING.ANY VIEWS OR OPINIONS PRESENTED IN THIS SESSIONARE SOLELY THOSE OF THE AUTHOR AND DO NOTNECCESARILY REPRESENT THOSE OF IBM OR NÝHERJI.
2About Nyherji and PeterNyherji is one of Icelands leadingservice providers in the field ofinformation technology, offerscomplete solutions in the fields ofinformation technology, includingconsultancy, the provision ofhardware and software, officeequipment and technical service.Pétur Eyþórsson is a Lead designerof TSM and DR planninginfrastructure for all of Nyherji´sTSM Customers for the last 14years, and an amateur folk stylewrestler.
3Our EnvironmentNyherji manages roughly 50 TSM ServersTSM servers come in many shapes protecting 5 – 5,000 TB.Main OS Windows, AIX and Linux.TSM Server versions mostly 6.3Mostly midrange customers, that historically have used thetraditional disk-to-tape approach.No VTL´s, XIV or any other high end devices existWide distribution of Storwize V7000 and V3700
4Businesses are storing unnecessary dataBusinesses are spending 20% more than they need to onbacking up unnecessary data.– “The most common mistake businesses make is to fail toupdate their backup policies. It is not unusual for companies tobe using backup policies that are years or even decades old,which do not discriminate between business-critical filesand the personal music files of employees.” ~GartnerEspecially notorious problem in Icelands financial institutionsfor historical reasons.
5Our HistoryTSM has made some improvements that offer some newapproaches– TSM 6.1 (New TSM Database DB2, Introduced targetdeduplication)– TSM 6.2 (Introduced Source [client side] Deduplication)– TSM 6.3 (Introduced Node Replication, FCM 3.1, and TDP forVMware)– IBM acquired FastBack– TSM 6.4 (Enhancements to Node Replication, TSM Serverscalability, FCM 3.2 intorduced support for Netapp devices as wellas Metro Mirror/Global Mirror)Our past experience was solely based on conventional TSM disk-to-tape serversThese new technologies offered new options that show greatpotential.
6Our HistoryIn November 2010, we decided to move 2 of our TSM serversto a highly deduplicated environment.We had no prior experience with TSM deduplication and notmuch experience existed on the market that we could tap into.Since then we have moved 2 other big environments todeduplication and FCMAjustments have been made based on prior experience, aswe go along.Purpose of this presentation is to show you how we use ourTSM Environments
7Our EnviormentIBM Tivoli Storage Manager Suite for UnifiedRecovery license– Changes everything– No more PVU Counting– Incentive to push for technologies like FCM, deduplication andcompression.• New challenges6 High perfomance environments (3 sites) emerged– Use Deduplication where possible– Flash Copy Manager– TSM Node Replication– Block Level backup where possible– High utilization of Client Compression
8Design GoalsRTO Goal on Important Data less than 1 HourTSM Server RTO less than 6 HoursHas to be Cost effective!– NL-SATA for Storage Pool– TSM Deduplication
9Pre Dedup EnviormentClient DataDisk PoolTape Pool Copy PoolLargerfilesSmall Files
12FCM VMware backupDaily Inc, Weekly Full– 2 week cycle2 Device Classes (FCM)– STANDARD– INCREMENTALTDP for VMware used for weekly backup, 90 day retention– Daily on Linux FCM Management MachineBenefits of FCM for VMware– MUCH Faster Restore Speed (Data Stores)– No CBT issues– Cheap, License (Storwize)
13FCM Naming ConventionsOne FCM DeviceClassfor each backup type– Full, Incr, CopyTARGET_NAMINGmust specify a validtarget naming schema– Difficult/Impossibleto manage if notstructured properlySchedules backupwhole ESX Clusters
14Our TSM Dedup High Availability SolutionPrimary Site • Secondary SiteNode ReplicationNode ReplicationActive data (no Oracle)OraclePrimaryMetro Mirror
15Why FCM for OracleRestore/Backup time reduced down to minutes.Added workload of client deduplication & compression notaccpetable for the DBA´s on Production Machines.– Auxiliary/Proxy machine to backup to TSM from FCM copiesdoes the Deduplication & Compression and sends to TSMTDP for Oracle does not distinguish between Active/Inactivecopies.– DR Problem when using Active Data Node Replication• Solved with FCM MM Copy
16Our EnviormentTotal Storage of 95TB weekly changeof 25TB (before backup/dedup)Intel Based Servers– 120-140 GB Ram– TSM DB• 8 SSD DB– Raid-5– EasyTier 22 SAS– 1,7 TB SSD» Total availableSize 6TB• 48 SAS 15k rpm– Raid-10» Total available Size 6TB– 8-12 CoresV7000 Contoller– DS3700 & DS3200– 3TB NL-SAS drives– 170TB Usable Storage• RAID-6FCM for higher RTO needsNode Replication for high availabilityTape for– long term storage– Data that does not fit in dedupstorage
17TSM 24 hour work scheduleFCM Backups 18:00Main Client Backups 18:00-02:00Expire Inventory 02:00-03:30Identify Duplicates 03:30-04:15TSM DB backup 04:30-06:00TSM File Data Node Replication 06:00-10:00TSM Virtual Node replication 10:00 – 16:00TSM Database Node Replication 14:00-18:00TSM * Replication to capture missed and new dataTSM Space Reclamation 13:00 – 18:00 (Threshold 10)
19What we learnedAchieving increased perfomance– Solved with engineering parallelism• Solved differently between different applications
20What we learnedProtecting dedup storage pool to tape copypools provedproblematicPerfomance based on filesize* Fabricated dataNode Replication Solved this
21Why Node Replication as High Availability4 possible solutions– OS Cluster (Windows Cluster, AIX HACMP)• Pros– Robust– Automated Failover• Cons– Only OS fault torlerant– Traditional Server-to-Server Copy Pool Virtual Volumes• Pros– Robust– Volume failure recovery• Cons– Long RTO– Cumbersome and long recovery (especially Dedup TSM Servers)– TSM Node Replication• Pros– Relative simple failover– Warm standby server ready to go• Cons– Young technology– No easy way to recover from damaged volumes– TSM DB2 HADR• Pros– Easy Failover– Cold Standy berver ready to go– Can take over metadata only• Cons– No installation of the kind we proposed existed in production.
22Our Perfomance Design FormulaIF (X) AND (Y) = < 95%THEN (N) + 1X = CPU LoadY = Disk Response time (15 ms =95%)N = Number of parallel worker threads in TSM• Simplified formula to maximise workload in our TSM Servers– If idle TSM Resource is detected more threads are added.• CPU or Disk response time should always be the bottleneck
24Perfomance Data• TSM Server 1,1 TB Database, 80 TB Dedup Pool• 20 Thread deduplication space reclimation (threshold 10)– Sustained 5,000-9,000 IOPS• 17 TB of total data transfer– 7 Write– 10 Read• CPU Near Fully UtilizedTSM env1
25Perfomance numbers1,3TB TSM Database on Storwize SSD/SAS Easy Tier– Max 30,000 IOPS!– Space ReclaimMoves 4,3TB pr/H(read & write)Average DB IOPS8,000-12,000TSM env 2
26PerfomanceTSM Dedup enviorment total in 24 hours– Database writes ~x1 it´s size every day• 1TB Database writes 970GB– Database Reads ~x 1,5 it´s size every day• 1TB Database reads 1,5TB– SSD´s in Raid-5 becomes bottleneck during write intentisveoperations (Space Reclimation W/R 50/50)TSM env 2
27What has changedUse Deduplication where we can– Exclusion/Special treatment:• Very large single objects• Encrypted data• Large non repetetive dataWe can’t use storage pool hierarchy based on small file poolsanymore, Client Dedup restrictionsWe Assign a spesific DISK device class VMware ControlStorage Pool– Reduces the mount points requirement
29What we have learnd so farOur TSM Dedup servers can scale up 400TB of managed data (pre-dedup) this is based on DB sizeWe can´t be cheap when it comes to our TSM Servers Hardware– A lot of RAM 48Gig min– 12 cores (Intel)• Only put multiple TSM instances on AIX.– Use really fast disks for your database, it´s going to get hammered(5000+ IOPS). Preferably SSD or a lot of spindles– We use maximum active log size off the bat• We must be careful about our space reclaim workload, many threadscan eat up all the log´s really fast.• Gigantic single objects (1,0TB +) will pin the log, must be carefulabout workload during that object´s backup time.– Larger databases.• X2 if you dedup only B/A client data.• X3 - ∞ If you dedup Application data as well.– Depends a lot on how long you plan to keep your data in the pool– Copypool to tape from the dedup pool proved difficult to use , usednode replication instead.
30What we have learnd so farWe Use Client Deduplication if we have the backup window to doso.– It will cause performance degradation on your backups, applicationbackups are more affected. (assuming no transportationbottleneck)– Send all client data directly to the dedup storage pool– Saves a lot of work on your SATA drivesWe use VMware backup when possible, ALL VM´s must be on highenough HW level to support CBT.– If not Use FCM only on those machinesWe use Client Compression to add more space savings.– Must be careful about use of 3rd party compression, it may haveadverse affects on deduplication ratio.– We DON’T use Compression if we plan on doing server-sidededuplication, bad ratio
31What we have learnd so farUtilize client parallelism for greater speed and workload– B/A client - resource utilization– TDP SQL, - multiple database backup concurrently orstripes– VMware - use Vmmaxparallel, be carefull not to exceed8 on each host at the same time– Exchange - multiple data moversWe Keep our Deduplication volumes small 12-24GRun aggressive space reclamation– Aim for 10% Threshold (90% or less utilized)Keep large objects in a separate storage pool (active log size)Keep VMware CTL files in a separate (DISK Device class) storagepool.TSM Database backup as fast as possible– All log activity during backup window will be applied during the end– Need increased space in active log due to this.
32What we have learnd so farPlan for a small tape pool for data that does not suit well indedup storage pools– Application databases that change a lot (reorgs, index rebuilde.t.c)– Data that requires encryption.– Very large single objects 750G +For higher RTO systems we utilize FCM as a method toachive instant restore for newest data, alternatively we utilizeparallel threads to achieve your RTO goal, but there aredrawbacks.Reajust the Copy Rate for FCM, default setting does notalways applyIn rare cases heavy utilized applications can´t handle LUNquiescing– FCM or Blok level backups can´t be used.
34Chicago´s World Fair 1893• Moral of the story• Expect more technologicalinnovation than you canimagine in comming years• Don’t get your hopes up,IT´s innovation won’t solveall it´s problems.- With new and improvedtechnology new challengesemerge.