More Related Content
More from IBM India Smarter Computing
More from IBM India Smarter Computing (20)
Storage Strategies NOW: Real-time Compression: Achieving storage efficiency throughout the data lifecycle
- 1. Realtime Compression: Achieving storage efficiency throughout the data lifecycle
By Deni Connor, founding analyst
Patrick Corrigan, senior analyst
July 2011
F or many companies the growth in the volume of data is greater than their ability to effectively and
efficiently store it and manage it. Recent studies indicate that enterprise demand for primary data storage
capacity is growing at a rate of 35% to 65% annually.1 Much of that data, as much as 80% in some
organizations, is comprised of unstructured data ‐‐ files, spreadsheets and multiple data types (e.g. CAD,
engineering data, PDFs, etc.) ‐‐ that are traditionally stored on network attached storage (NAS) devices and file
servers. And that unstructured data is projected to grow at a rate of over 60% this year alone.
What processes other than the storing of unstructured data is fueling this unbounded storage growth? First,
the need to improve recovery time objectives (RTO) and recovery point objectives (RPO) is contributing
massively to storage growth ‐‐ the number of mirrors, snapshots, replicas, and clones for migration purposes –
all these processes greatly increase the amount of data that must be stored. Add to that the data that is being
replicated for disaster recovery purposes and the data that is being archived for regulatory and compliance
purposes. And, then also consider the amount of data that is copied to tape and shuttled offsite for long‐term
preservation.
The amount of data is cumulative and the copies of identical data being stored, while necessary, are creating a
storage burden. It affects not only expenditures for more storage, it also impacts storage management, LAN
and WAN bandwidth and performance, backup capacity and backup and recovery time. In the world of
increasingly narrow backup windows, with data doubling every 18 months, the ability to backup more data in
the same window of time is critical.
Further, while server virtualization has helped organizations control physical server sprawl, it has not
materially helped alleviate the storage capacity issue. In fact, the increased ease of virtual servers is
exacerbating the storage capacity problem, as new virtual servers, which require storage capacity, are being
deployed on a moment’s notice. According to some studies, the effect of virtualizing an environment causes a
4x growth in storage capacity. Virtualization has not only a significant impact on primary storage costs, it also
creates a major impact on backup and replication systems as users scramble to protect their data assets.
Reducing the amount of storage dedicated to virtual servers by 72% can result in a 3.5x decrease in the
recovery time objective (RTO).
Note: The information and recommendations made by Storage Strategies NOW are based upon public information and sources and may also include personal opinions both of Storage Strategies NOW and others,
all of which we believe to be accurate and reliable. As market conditions change however and not within our control, the information and recommendations are made without warranty of any kind. All product
names used and mentioned herein are the trademarks of their respective owners. Storage Strategies NOW, Inc. assumes no responsibility or liability for any damages whatsoever (including incidental,
consequential or otherwise), caused by your use of, or reliance upon, the information and recommendations presented herein, nor for any inadvertent errors which may appear in this document. This Storage
Strategies Now White Paper was commissioned by IBM and is distributed under license from Storage Strategies NOW.
1
Source: Wikibon
Copyright © 2011, Storage Strategies NOW, Inc. All Rights Reserved.
1
- 2.
Traditional compression
When deploying traditional compression for storage optimization users typically forget that it is done as a
post‐processing task—data is first written to disk and then compressed. Depending on the compression
software, this is either done manually (using tools such as WinZip, for example), done immediately after the
write (Windows NTFS volume compression) or when CPU cycles are available. Since CPU power is needed for
both compression and decompression, and disk space is required to accommodate files before compression
and after decompression, these techniques do not typically resolve the storage efficiency issue.
Deduplicating secondary data
Most current solutions for reducing storage capacity requirements focus on compressing and deduplicating
secondary backup data and static archives after it is stored on NAS devices. These approaches are fine as far as
they go, but they do not address the issue at the point of importance – that of decreasing the amount of
storage at the get‐go – the business of decreasing the amount of primary storage that at some time in its
lifecycle will be mirrored, replicated, cloned and backed up for data protection.
Users often look to deduplication of data using appliances, such as IBM ProtecTIER, to reduce their storage
capacity requirements. Current solutions that focus on secondary backup data only partially address the cost
of hardware acquisition, the power consumed by more storage devices and the floor space requirements of
increased storage capacity. While they reduce the requirements for power/cooling, staff resources and
licensing costs, they don’t fully remove them.
Data deduplication, depending on the method used, analyzes data and looks for files or blocks of data that are
the same. When two or more files or blocks match, the system sets a pointer to a single file or block, and does
not store multiple copies of that data. Deduplication provides the greatest benefit where there is a significant
amount of redundant data. User home directories, email systems that store copies of messages in each user’s
mail boxes or multiple virtual servers, for example, all of which typically contain multiple instances of duplicate
data, are prime candidates for deduplication. Deduplication generally provides less benefit with structured
data, such as SQL databases, which are typically normalized to contain minimum amounts of redundant data.
Deduplication can also have a negative impact on backup performance, since data must typically be
“rehydrated,” or un‐deduplicated, during recovery, which requires additional CPU time. Post‐processing
deduplication, which is done after data is backed up to disk, postpones processing until CPU cycles are
available, making the effect on performance less noticeable. Post‐processing deduplication, however, must
use some disk space to hold pending transactions, which again, does not help with optimizing storage
efficiency. Also, backup systems that employ deduplication are not very effective at deduplicating files that are
compressed using traditional methods, thus limiting the value of backup deduplication.
Both these approaches to deduplication overlook the simple answer.
Deduplicating only secondary backup data solves only a small part of the storage capacity issue. They ignore
the effect of data reduction on primary storage data before it is even backed up, archived and replicated –
where it will have the greatest effect on storage capacity.
Copyright © 2011, Storage Strategies NOW, Inc. All Rights Reserved.
2
- 3.
Final thoughts on data compression and deduplication
Data compression and deduplication, which have been very effective at saving on capacity requirements for
secondary backup data and for reducing hardware, cooling and floor space costs, have also been deployed for
optimizing primary storage, usually at the storage array itself. Using traditional compression and deduplication
techniques on primary data can be problematic due to the potential negative performance impact and,
especially in the case of deduplication, its effect on backup performance.
IBM Real‐Time Compression
Unlike traditional compression where data is written to disk and then compressed, IBM Real‐time
Compression compresses data in‐line, before it is written to disk. The IBM Real‐time Compression technology
is deployed on an STN6500 (for 1Gb networks) or an STN6800 (for 10Gb networks) appliance and sit between
a network switch and a NAS array to compress primary data. By compressing data before it arrives at the
array, an IBM Real‐time Compression Appliance can provide a primary storage reduction of up to 80%,
depending on the types of data being compressed without impacting performance or other operations. It
compresses the data, leaving the metadata (file permissions, Access Control Lists, ownership information, etc.)
intact when stored on the storage array. The storage array, and not the appliance, then returns the write
commit information to the application. No changes are required to servers, storage arrays, applications or
downstream processes such as backup, archiving, deduplication, snapshots or replication.
Integral to IBM's Real‐time Compression is the IBM Random Access
Compression Engine (RACE). RACE, which is based on 35 patents,
allows real‐time, random access compression without performance
degradation. IBM’s RACE uses standard LZ compression algorithms Polycom
and compression is performed using random access techniques. “We deal with the growth of data every
day. Polycom has a lot of products, and
Read and write operations only need to access the blocks of the all of them require multiple versions that
compressed file that must be read or written to, rather than we have to store and back up
decompressing and recompressing the entire file. This technique indefinitely, says Amit Bar On, IT
dramatically improves both read and write operations. In addition, manager for Polycom. “IBM’s Real‐time
since there is less data being written to the storage array, there is Compression Appliance helps us to
manage the data growth more
less I/O—and with less I/O also comes more CPU cycles to process efficiently. We are now less concerned
the given read and write requests. about storage capacity than we ever
were before, and at the same time
Further, and perhaps most importantly, by compressing data in saving on costs.”
front of the storage array, a net increase in effective cache size is
achieved. Whatever the compression ratio is for your data, this
compression ratio transcends to your storage cache. If your data is compressible by 3:1, IBM Real‐time
Compression provides the equivalent of increasing the size of your storage cache by three times. Since cache is
one of the most expensive components of a storage array, and since cache tends to have the biggest impact
on storage performance, the more you can increase cache, the better performance users and applications will
see.
Copyright © 2011, Storage Strategies NOW, Inc. All Rights Reserved.
3
- 4.
IBM Real‐time Compression also allows downstream operations, such as backup, deduplication and snapshots
to function optimally without the need to decompress the data prior to any processing by the downstream
operation. Because data can be effectively processed (backed up, deduplicated, mirrored, replicated, etc.) in
its compressed state, both processing time and storage requirements are significantly reduced. IBM Real‐time
Compression is designed to optimize both primary and active secondary storage.
The net of IBM Real‐time Compression reduces the data footprint
throughout the data lifecycle. Since data is compressed on primary storage
Ben‐Gurion University its benefits cascade forward, requiring fewer resources, including storage,
“In the past three years we
network bandwidth, power, cooling, floor space, staffing and backup
have continued to see an
exponential growth rate in resources.
data storage requirements.
We’ve been amazed by the Compression Accelerator
amount of compression that The IBM Real‐time Compression technology also includes a Web‐based
we can achieve by using the
utility called the Compression Accelerator, which non‐disruptively
IBM Real‐time Compression
Appliance.” compresses data already stored on disk as a background task. The
Compression Accelerator is a high‐performance and intelligent software
application running on the IBM Real‐time Compression Appliance that, by
policy, allows users to compress data that has already been saved to disk
while that data remains online and accessible by applications and end users. The policies allow users to
throttle how decompressed data gets compressed so as not to have an impact on existing storage
performance. To reduce possible impact to throughput, the Compression Accelerator’s policy‐based
management enables access to policies which allow granular control over background compression tasks. IBM
Real‐time Compression Appliance’s ability to transparently compress already stored data significantly
enhances and accelerates the benefit to end users and increases their ROI, by freeing up to 80% of the used
capacity for new workloads. With the Compression Accelerator running in background, users can reclaim an
average of 20TB of existing storage capacity every 24 hours.
How Real‐time Compression differs from traditional compression
With traditional compression, in order to modify a file, the file must be uncompressed, edited, then
recompressed into a new file. If data is inserted, all subsequent data blocks after the insertion are either
shifted or modified. (See Figure 1. Compression Techniques and File Modification). This creates a negative
impact on any downstream deduplication process. With IBM Real‐time Compression, an edit only affects the
block being edited. If a new data is inserted, IBM Real‐time Compression can add the new data and then use a
data map to locate that data without rewriting the entire file. This approach creates minimal impact on
downstream deduplication and similar processes.
Copyright © 2011, Storage Strategies NOW, Inc. All Rights Reserved.
4
- 5.
Figure 1. Compression Techniques and File Modification
IBM Real‐time Compression combined with deduplication
Studies have shown that the combination of IBM Real‐time Compression and downstream deduplication can
provide significant reductions in storage requirements beyond what each approach can achieve on its own.
Because the compression appliance is transparent to the network, servers, storage devices and applications,
the implementation is non‐intrusive and does not require system, application or process modifications. IBM
Real‐time Compression works transparently with and optimizes IBM ProtecTier, NetApp, EMC Data Domain,
Celerra and VNX and other storage and deduplication environments. With less primary data being written to
disk, there is less data to deduplicate.
When IBM Real‐time Compression was combined with IBM ProtecTIER, test results showed an 82% savings in
initial storage and a 96% overall data reduction. Backup time was reduced by 71% and a lower CPU utilization
and lower disk activity were seen on the ProtecTIER deduplication engine. When data was compressed with
IBM Real‐time Compression and then fed through an EMC Data Domain deduplication appliance, results
indicated a 40% improvement in capacity, a 72% reduction in backup time and significant reductions in CPU
cycles (72%), disk activity (67%) and network traffic (77%).
Copyright © 2011, Storage Strategies NOW, Inc. All Rights Reserved.
5
- 6.
Benefits
To recap, use of IBM’s Real‐time Compression can provide as much as
5x the storage efficiency savings and results in these benefits:
Reduced storage costs. With compression rates of up to 80%, Snowball VFX
“We simply couldn’t create or maintain
the costs for storing a given amount of raw data are the amount of data we need without the
substantially reduced. With an average compression rate of IBM Real‐time Compression solution,”
65%, 3TB of data can be stored on 1TB of disk. This reduction in said Yoni Cohen, Founder, Snowball VFX.
data stored applies not only to primary storage, but to backups “Without data compression we would
and archives as well. have needed twice the amount of disks
and twice the amount of storage
Reduced CAPEX/OPEX. Storage hardware requirements are systems. With IBM Real‐time
effectively reduced, as are costs for power, cooling, staffing and Compression, we can buy a smaller
floor space build‐out and leasing. storage system, but maintain the same
Transparently fits into your storage environment requiring no capacity and performance as a larger,
more expensive system. The IBM Real‐
changes to any of your existing processes
time Compression Appliance enables us
Reduced data size means less LAN and WAN traffic and faster to stay competitive and continue to
disk reads and writes, reducing data bottlenecks. deliver higher quality animation and
Meeting Recovery Point Objectives (RPOs) and Recovery Time effects to our customers at a unique
Objectives (RTOs). RPOs and RTOs can more easily be met, since price point in our industry.”
IBM Real‐time Compression reduces both the volume of data to
be restored, when compared to raw, uncompressed data, as
well as the time required to restore that data. Studies show a 3.5x decrease in RTOs.
Improved backup and restore performance, evidenced by 6.6x faster backups.
Lowered backup costs. Less data to back up can reduce the requirements for additional tape libraries,
backup software licenses (as much as 2x fewer licenses), staffing and backup media.
Faster replication – as much as and 3.3x faster.
Reduced data footprint throughout the data lifecycle. Since data is compressed on primary storage its
benefits cascade throughout the entire data lifecycle, requiring fewer resources, including storage and
network bandwidth, and associated management costs.
SSG‐NOW Assessment
The addition of primary data compression capabilities is an important step in an enterprise’s storage efficiency
strategy. IBM Real‐time Compression Appliances provide a significant reduction in primary data which effects
storage capacity requirements, downstream processes such as backup and recovery, and operating expenses.
By providing seamless and easy to deploy real‐time compression appliances, IBM has brought the advantages
of real‐time compression to a new level of convenience for a broad range of organizations. Transparent real‐
time compression, particularly if processes are not impacted by additional compute time, should be
considered by organizations of all sizes. To learn more about IBM Real‐time Compression Appliances, go to
www.ibm.com/storage/rtc.
TSL03060‐USEN‐00
Copyright © 2011, Storage Strategies NOW, Inc. All Rights Reserved.
6