Compression – Much more than saving space

I remember when SQL server 2008 was first released and I have examined some of its features, when I came across Compression, I was very impressed and thought “cool !” Actually i have found it much more than that: friendly to use , easy to maintain , saves storage space and here is the best part, can bring better performance (yes queries performance). But somehow I have neglected this feature, as lots of other DBA, with no particular reason. In this post I’m going to present Compression in SQL Server 2008 Enterprise (only in enterprise edition) and explain how it works, and show you a short demo.

So what does it mean compression?

Compression – the process of encoding information using fewer bits (or other information-bearing units) than an unencoded representation would use, through use of specific encoding schemes. Meaning, storing the same data with less space.

We all know compressions like ZIP files on any type of file or using JPEG instead of BMP or MP3 instead of WAV. While some of those types of compressions affect the quality of data being saved, other works on a static file that is being compressed one time and doesn’t change. Of course SQL needs a costume solution that can compress data while it is generated. SQL server offers Backup compression and data Compression. In this post I’ll focus on Data Compression.

There are 2 types of data compression SQL server offers: Row compression and page compression – will explain each one later on. So here are some questions regarding compression, you probably asked yourself:

On what kind of data, compression can be applied?

We can compress table, index or specific partition. Data compression can be applied only during creation or rebuild of a table, index or partition.

How does it work?

Row compression– Reduces metadata overhead associated with the row. It uses variable-length storage format for numeric type. NULL and 0 values across all data types are optimized and take no bytes

Page compression– Page compression consists of 3 operations:

Row Compression The same as described earlier
Prefix Compression For each column, a repeated prefix value that can reduce storage space is identified and stored in the CI (Compression Information) structure that follows the page header. For each value in the column an attempt is made to match partial or full prefix. The matched value is then replaced by a reference to the corresponding prefix Have a look at this example from BOL. This is how an ordinary page looks like:

Now we will see the same page after prefix compression:

The compression algorithm found the longest prefix that fits each column (in the page) so each column now refer to the prefix and how many chars it has in common with it. For example in column 2, the longest prefix is aaaacc. First values has the first 4 chars in common so the first 4 chars are replaced with the number 4 , the second value has no chars in common, so it is replaced with the number 0 and the last value is a perfect match, so no value there

Dictionary Compression Occurs after prefix compression. Repeated values are searched all over the page and replaced by references that are stored in the CI area of the page

In this stage repeated values are searched all over the page and if exact matches are found, then the matched value is written in the CI area of the page. Remember that it doesn’t mean that the values replaced were the same values before the compression, for instance the value 4b, appears in 2 columns but it’s not the same original value (since the prefix compression changed it).

Why using this type of method to compress data?

When fetching a compressed page we don’t want to wait for a long time in order to read its uncompressed data, because it will mean that the compression overhead is too much to bear. Microsoft needed to find a balance between good compression method and the time it takes to compress and uncompress data. Considering the kind of data that is stored in tables and indexes (usually same type of data which make sense (not gibberish) and sometimes it is sorted) so prefix compression as first step is a good idea.

How can compression bring better query performance?

It seems that when working with compression we have much more overhead, since we need to compress every page and when fetching it or looking for data inside the compressed page, we will need to uncompress it. So it seems that working with a compressed page takes longer time. But in fact it takes more CPU time, since the work of un/compression is made by the CPU. Compression saves us storage space, meaning the same data can be stored in less pages, so when fetching this data it will take less I/O and hence usually less time. Today most systems are I/O bound and have plenty of free CPU, so in those kinds of systems, using compression can bring better performance.

What about queries that return a small result set?

Even if a query returns a small result set, it still can process a lot of data. For example have a look at the attached script: Compression Demo I have created 2 tables with 2 million customers with randomly generated names based on the 20 most popular names in the US with some variations. One table is compressed (page compression) and the other is not. Both tables have indexes (one compressed and the other isn’t).

I have executed simple query that should return several hundred results (data is randomly generated).

When examining this example and the 2 types of tables and indexes (the compressed and uncompressed one) I can tell you that compression saved about 30%-40% from the size of the table and index when uncompressed.

I have also gathered this data from one of the executions:

CLEAN CACHE

Table ‘CustomersCompressed’.

Scan count 6, logical reads 11942, physical reads 17, read-ahead reads 5917, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Second execution

Table ‘CustomersCompressed’.

Scan count 6, logical reads 11942, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

CPU time = 9953 ms, elapsed time = 5844 ms

CLEAN CACHE

Table ‘Customers’.

Scan count 6, logical reads 15830, physical reads 26, read-ahead reads 7847, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Second execution

Table ‘Customers’.

Scan count 6, logical reads 15834, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

As you can see the compressed table and index saved about 30% of the logical reads!

Total execution time might be the same or slightly different, since it’s affected by your CPU and the load on your computer.

What about data already in memory?

Here is an interesting fact – when compressed page on disk is fetched to memory, it stays compressed in memory. Is it a good thing or bad? On some queries it might have a bad effect (queries that require lots of CPU). But remember that if pages are also compressed in memory so it means we can store more pages in memory. As always my advice to you is Test, Test and Test again, only then you will know if this feature can help you or hurt you.

When page compression occurs?

When new page is created in a compressed table or index it’s not automatically page compressed, but it is row compressed. When data is added to the page, all columns are evaluated for prefix and dictionary compression. Only if the space gained is significant, page compression will occur. After the page is compressed, all new rows will be compressed.

How much space can page compression save?

You probably guessed – It depends! It depends on type of data being compressed, for example varchar tends to be compressed better than other type of data. It depends on the values of data being compressed, for example, index with lots of repeated values will be compressed more efficiently. On the other hand UNIQUIDENTIFER will not be compressed so good and encrypted columns will not be compressed at all. Of course, also fragmentation and fill factor affects compression ratio. So it hard to give an exact number, compression can save 10% – 80% of the data size, but when used wisely on indexes and tables I would say that by my experience it saves roughly 30 % of the original size in most cases. To evaluate compression savings you can use the sp_estimate_data_compression_savings stored procedure.

And the big question – To compress or not to compress?

The world isn’t black and white and as always it depends… If your database suffers from I/O bottleneck and you have plenty of CPU than consider compressing your big tables and indexes. If your DB suffers from CPU bottleneck than using compression probably won’t be a good idea. If you have a big table with mainly insert operations, again using compression might not be the solution you seek. If you have plenty of memory and anyway most of your queries are being cached, so again using compression probably won’t help you to get better performance (although it still can help you save storage space) . But if you have big database in which queries are cached and dropped from cache all the time and a big table which is used widely by select statement and contains varchar, datetime or int columns – You should definitely consider compressing this table or its indexes.