In computer science and information theory, data compression, source coding, or bit-rate reduction involves encoding information using fewer bits than the original representation. Compression can be either lossy or lossless. Lossless compression reduces bits by identifying and eliminating statistical redundancy. No information is lost in lossless compression. Lossy compression reduces bits by identifying unnecessary information and removing it. The process of reducing the size of a data file is popularly referred to as data compression, although its formal name is source coding (coding done at the source of the data before it is stored or transmitted).Compression is useful because it helps reduce resources usage, such as data storage space or transmission capacity. Because compressed data must be decompressed to use, this extra processing imposes computational or other costs through decompression; this situation is far from being a free lunch. Data compression is subject to a space-time complexity trade-off. For instance, a compression scheme for video may require expensive hardware for the video to be decompressed fast enough to be viewed as it is being decompressed, and the option to decompress the video in full before watching it may be inconvenient or require additional storage. The design of data compression schemes involves trade-offs among various factors, including the degree of compression, the amount of distortion introduced (e.g., when using lossy data compression), and the computational resources required to compress and uncompress the data.
n computing, data deduplication is a specialized data compression technique for eliminating duplicate copies of repeating data. Related and somewhat synonymous terms are intelligent (data) compression and single-instance (data) storage. The technique is used to improve storage utilization and can also be applied to network data transfers to reduce the number of bytes that must be sent. In the deduplication process, unique chunks of data, or byte patterns, are identified and stored during a process of analysis. As the analysis continues, other chunks are compared to the stored copy and whenever a match occurs, the redundant chunk is replaced with a small reference that points to the stored chunk. Given that the same byte pattern may occur dozens, hundreds, or even thousands of times (the match frequency is dependent on the chunk size), the amount of data that must be stored or transferred can be greatly reduced.This type of deduplication is different from that performed by standard file-compression tools, such as LZ77 and LZ78. Whereas these tools identify short repeated substrings inside individual files, the intent of storage-based data deduplication is to inspect large volumes of data and identify large sections – such as entire files or large sections of files – that are identical, in order to store only one copy of it. This copy may be additionally compressed by single-file compression techniques. For example a typical email system might contain 100 instances of the same onemegabyte (MB) file attachment. Each time the email platform is backed up, all 100 instances of the attachment are saved, requiring 100 MB storage space. With data deduplication, only one instance of the attachment is actually stored; the subsequent instances are referenced back to the saved copy for deduplication ratio of roughly 100 to 1.
SQL Server 2008 provides two levels of data compression – row compression and page compression. Row compression helps store data more efficiently in a row by storing fixed-length data types in variable-length storage format. A compressed row uses 4 bits per compressed column to store the length of the data in the column. NULL and 0 values across all data types take no additional space other than these 4 bits.Page compression is a superset of row compression. In addition to storing data efficiently inside a row, page compression optimizes storage of multiple rows in a page, by minimizing the data redundancy. Page compression uses prefix compression and dictionary compression. Prefix compression looks for common patterns in the beginning of the column values on a given column across all rows on each page. Dictionary compression looks for exact value matches across all columns and rows on each page. Both dictionary and prefix are type-agnostic and see every column value as a bag of bytes.The data compression feature is available in the Enterprise and Developer editions of SQL Server 2008. Databases with compressed tables or indexes cannot be restored, attached, or in any way used on other editions. To determine whether a database is using compression, query the dynamic management view (DMV)sys.dm_db_persisted_sku_features. To determine what is compressed, and how (row or page), query the data_compression_desc column in the catalog view sys.partition
Enabling compression only changes the physical storage format of the data that is associated with a data type but not its syntax or semantics. Application changes are not required when one or more tables are enabled for compression. The new record storage format has the following main changes:It reduces the metadata overhead that is associated with the record. This metadata is information about columns, their lengths and offsets. In some cases, the metadata overhead might be larger than the old storage format.It uses variable-length storage format for numeric types (for example integer, decimal, and float) and the types that are based on numeric (for example datetime and money).It stores fixed character strings by using variable-length format by not storing the blank characters.
For each page that is being compressed, prefix compression uses the following steps:For each column, a value is identified that can be used to reduce the storage space for the values in each column.A row that represents the prefix values for each column is created and stored in the compression information (CI) structure that immediately follows the page header.The repeated prefix values in the column are replaced by a reference to the corresponding prefix. If the value in a row does not exactly match the selected prefix value, a partial match can still be indicated.After prefix compression has been completed, dictionary compression is applied. Dictionary compression searches for repeated values anywhere on the page, and stores them in the CI area. Unlike prefix compression, dictionary compression is not restricted to one column. Dictionary compression can replace repeated values that occur anywhere on a page. The following illustration shows the same page after dictionary compression.
For compressed backups, the size of the final backup file depends on how compressible the data is, and this is unknown before the backup operation finishes. Therefore, by default, when backing up a database using compression, the Database Engine uses a pre-allocation algorithm for the backup file. This algorithm pre-allocates a predefined percentage of the size of the database for the backup file. If more space is needed during the backup operation, the Database Engine grows the file. If the final size is less than the allocated space, at the end of the backup operation, the Database Engine shrinks the file to the actual final size of the backup.To allow the backup file to grow only as needed to reach its final size, use trace flag 3042. Trace flag 3042 causes the backup operation to bypass the default backup compression pre-allocation algorithm. This trace flag is useful if you need to save on space by allocating only the actual size required for the compressed backup. However, using this trace flag might cause a slight performance penalty (a possible increase in the duration of the backup operation).
A table or a partition can have three allocation units - IN_ROW_DATA, LOB_DATA, and ROW_OVERFLOW_DATA. The data stored in LOB_DATA and ROW_OVERFLOW_DATA allocation units is not compressed. Only the data that is stored in the IN_ROW_DATA allocation unit is compressed. Use Appendix B to understand how much data is stored in each of these three allocation units.FILESTREAM data is stored outside the database in a FILESTREAM data container on an NTFS volume. This data is not compressed.
Compressed pages are persisted as compressed on disk and stay compressed when read into memory. Data is decompressed (not the entire page, but only the data values of interest) when it meets one of the following conditions:It is read for filtering, sorting, joining, as part of a query response.It is updated by an application.There is no in-memory, decompressed copy of the compressed page. Decompressing data consumes CPU. However, because compressed data uses fewer data pages, it also saves:Physical I/O: Because physical I/O is expensive from a workload perspective, reduced physical I/O often results in a bigger saving than the additional CPU cost to compress and decompress the data. Note that physical I/O is saved both because a smaller volume of data is read from or written to disk, and because more data can remain cached in buffer pool memory.Logical I/O (if data is in memory): Because logical I/O consumes CPU, reduced logical I/O can sometimes compensate for the CPU cost to compress and decompress the data.
Here is an example how a customer used these measurements to decide which tables to page-compress. The customer had an OLTP database running on a server with average CPU utilization of approximately 20 percent. The large amount of available CPU, the significant amount of planned database growth, and the expense of storage provided motivation for data compression. The customer computed space savings, and the U and S measurements for the largest tables in the database, and targeted tables with S greater than 75 percent, U less than 20 percent for page compression. Table 1 shows the estimated row and page savings, the values of S and U, and the decision as to whether to row or page compress.Based on the metrics shown in Table 1, the customer decided to page-compress tables T2, T5, T6, T7, T8, and T10. All other tables in the database were row-compressed. Following this plan, the customer achieved 50 percent space savings, and approximately 10 percent increase in CPU utilization.
Workspace Required in the User DatabaseIn the user database, free workspace is required for the following:The compressed table or indexThe mapping index if you are compressing a heap or a clustered index with the ONLINE option set to ON and the SORT_IN_TEMPDB option set to OFF. (The recommendation is to set SORT_IN_TEMPDB to ON. The workspace requirement for the mapping index is discussed in the tempdb section.).While a table is being compressed, both the uncompressed table and the compressed table exist together until the compression is successful and committed. After the table or the index is compressed, the uncompressed table is dropped, and the space is released to the filegroup. To estimate the size of the compressed table, use the output of sp_estimate_data_compression_savings.Transaction Log SpaceThe amount of transaction log space needed depends on whether ONLINE is set to ON or OFF, and the recovery model used (full, bulk-logged, or simple).Workspace Required in tempdbIn the tempdb database, free workspace is required if ONLINE is set to ON:For the mapping index, an internal structure used to map old bookmarks to new bookmarks, enabling concurrent DML transactions. This is stored in tempdb if SORT_IN_TEMPDB is set to ON.For the version store. This is only used if there are concurrent DML operations. Size depends upon the volume of ongoing modifications and duration of long running DML transactions.
Side Effects of Compressing a Table or IndexWhen you compress a table or an index, you should be aware of two side effects:Compression includes a rebuild, thus removing fragmentation from the table or index.When a heap is compressed, if there are any nonclustered indexes on the heap, they are rebuilt as follows: o With ONLINE set to OFF, the nonclustered indexes are rebuilt one by one. o With ONLINE set to ON, all the nonclustered indexes are rebuilt simultaneously.You must account for the workspace required to rebuild the nonclustered indexes, because the space for the uncompressed heap is not released until the rebuild of the nonclustered indexes is complete.r
* The resulting row-compressed pages can be page-compressed by running a heap rebuild with page compression. ** With page compression, all the pages in the table may not actually be page-compressed. A page is page-compressed only if the space savings on that page exceeds an internally defined threshold.Updating or Deleting Compressed RowsAll updates to the rows in a row-compressed table or partition will maintain the rows in row-compressed format. Not every update to the rows in a page-compressed table or partition will cause the column prefix and page dictionary to be recomputed. When the number of changes on a given page-compressed page exceeds an internally defined threshold, the column prefix and page dictionary are recomputed.
When an application manipulates (INSERT, UPDATE, DELETE, CREATE/REBUILD INDEX, and so on) data in a table, some supporting data structures may be created by SQL Server, which may temporarily hold a subset of the data. Some of these data structures are:Transaction logMapping indexVersion storeSort pagesWhether or not these supporting structures are compressed if the data in a compressed table is manipulated depends upon the type of the data structure and the type of data compression used on the table. Table 4 summarizes the compression characteristics of the supporting data structures when a compressed table is manipulated.
In SQL Server, a columnstore index is data stored in column-wise fashion that can be used to answer a query just like data in any other type of index. A columnstore index appears as an index on a table when examining catalog views or the Object Explorer in Management Studio. The query optimizer considers the columnstore index as a data source for accessing data just like it considers other indexes when creating a query plan.This allows much greater compression. There isn’t a lot of open discussion on the compression algorithm being used, but it is much more instensive than page compression. In this case we are limited to non-updateable indexes right now. This is best used for warehouse type queries that return a lot of rows.
What data types cannot be used in a columnstore index?The following data types cannot be used in a columnstore index: decimal or numeric with precision > 18, datetimeoffset with precision > 2, binary, varbinary, image, text, ntext, varchar(max), nvarchar(max), cursor, hierarchyid, timestamp, uniqueidentifier, sqlvariant, xml.
Accelerating Database Performance Using Compression
Accelerating DatabasePerformanceUsingCompressionJoey D’AntoniPhiladelphia SQL Server Users Group08 May 2013
About Me Principal Architect SQL Server, Comcast Cable @jdanton –Twitter Joedantoni.wordpress.com firstname.lastname@example.org
OverviewCompression—What Does It ReallyMean?Deduplication—What isThat?What Should I Compress?Columnstore—How Is It Different?
Deduplication Specialized compression to eliminate duplicate copies ofrepeating data Real Example—InVMWare, you may have 10 copies ofWindowsrunning on same physical machine. Memory blocks (Common .dllsfor example) may be deduplicated. Backups
Benefits ofCompressionFaster performance on selectsLess I/O is required to return dataBetter Space Utilization on DiskMore Pages In Memory
Expenses ofCompressionAdded CPU CyclesSlower bulk inserts and updates*
BackupCompression In all editions of SQL Server, starting with 2008R2 Always use Backup Compression (even whenyour storage team says no) Space is by default pre-allocated for estimatedsize of uncompressed backup Trace Flag 3042
How DoesCompressionWork? Storage Engine compresses and decompressesdata No other parts of SQL Server need tounderstand compression Application code doesn’t need to change
SoWhatToCompressSQL Data Compression gives a great dealof flexibilityCan compress tables, indexes and/orpartitioningCan use different methods ofcompressionHow to Decide?
SpaceSavings sp_estimate_data_compression_savingsWhatWon’t Compress Well Columns with numeric or fixed-length character data typeswhere most values require all the bytes allocated for thespecific data type Not much repeating data Repeating data with non-repeating prefixes Data stored out of the row FILESTREAM data
ApplicationWorkloadsMicrosoft advises using for RowCompression for EVERYTHING if you canspare 10% CPU. Don’t do this!Page Compression has higher overheadBe careful where you use pagecompression
What HappenswithCompression Tables and Indexes are rebuilt using ALTERTABLE…REBUILD and ALTER INDEX..REBUILD Requires workspace, CPU and I/O Same mechanism as rebuilding an index Free workspace required in User Database Transaction Log Temp DB
How andWhenCompressData Online vs Offline Concurrent vs Serial Order of Compressing—start small and work up SORT_IN_TEMPDB
ManagingInserts andUpdatesTable organization Table compression settingROW PAGEHeap The newly inserted row isrow-compressed.The newly inserted row ispage-compressed:· if new row goes to anexisting page with pagecompression· if the new row is insertedthrough BULK INSERT withTABLOCK· if the new row is insertedthrough INSERT INTO ...(TABLOCK) SELECT ... FROMOtherwise, the row is row-compressed.*Clustered index The newly inserted row isrow-compressed.The newly inserted row ispage-compressed if new rowgoes to an existing page withpage compression Otherwise,it is row compressed until thepage fills up. Pagecompression is attemptedbefore a page split.**