# All About Compression, Part I

by Ga'ash Soffer on January 7, 1999 2:57 PM EST

Introduction

Lossless Compression Techniques

The first types of compression I am going to discuss are lossless compression techniques. These compression techniques allow the uncompressor to recreate the original data byte for byte. These types of compression are used predominately in file compression (PKZip, RAR, ARJ, LHA, etc.), though some of them are used in various forms of data compression because they are relatively easy to implement and very effective on certain data sets. Perhaps the most commonly used data compression algorithm is Huffman coding.

Huffman Coding

Huffman coding is based upon a simple philosophy: The most common data should be noted with the least number of bits, while the least common data should be noted with the most number of bits. After a statistical analysis of the file (or block) is done, the data is stored in a binary tree with the most common elements as close to the root as possible. Then, in order to find the coding sequence, if we traverse to the left of the tree in search for our character, we add a 0, if we traverse to the right of the tree, we add a 1. Below is an example of Huffman Compression at work.

String to Compress: "aabacbacdb"

Statistics:

a - 4
b - 3
c - 2
d - 1

Tree generation:

(1) start with 4 trees with 1 node in each, then merge the two smallest trees, and insert the result into the new list, repeat.

4 4 4 10
3 3 6
2 3
1

The resulting probability tree would look like the one below: (Note that this tree is always "full" (each node has 0 or 2 children))

10
/ \
6 4(a)
/ \
3 3(b)
/ \
2(c) 1(d)

Now, in order to come up with the notation for each character. Traverse the tree until you hit the character you are looking for. While traversing, every time you go to the left node, mark a '0', otherwise, mark a '1'.

Using this, finding the coding pattern is trivial:

a = 1
b = 01
c = 000
d = 001

Our string above, "aabacbacdb" would convert into '1101100001100000101'. Here we have 19bits, instead of 8*10 = 80bits in the original message. Of course we also have to store the tree for uncompression (or the code for each character) but as soon as the block compressed reaches a size of 32K or so, this extra space used is negligible.

Optimizing Huffman Compression

The main method used to optimize Huffman Compression is to combine it with LZ compression (which will be discussed next). There are; however, ways to optimize Huffman coding itself. The tree algorithm shown here (the one for generating the coding for each letter) is the most optimal; however, in order for Huffman coding to be most efficient, we must have large gaps between the amount of 1 letter as opposed to the other. The ideal statistical distribution for Huffman compression with a significant number of letters present is a division by powers of 2, i.e. 1/2 of 1 letter, 1/4 of another, etc..) Since we know what kind off statistics we want, we can choose blocks which will best reflect this statistical distribution. Let's take the data below.

aaaabbbbccccdddd

If we were to compress this with a block size of 16, we would not be able to compress the data that much, since all the data is distributed equally. If we decided to make our block size 4, assuming that storing the tree (or equivalent Huffman code for each character) is negligible (which it is in most real world cases) a block size of 4 would allow us to code each character in the above in 1 bit.

Now, what if a static block size won't work? Well, if we are doing Huffman compression alone, it is very feasible (albeit slow, or fast and difficult) to calculate what will be the most efficient block size dynamically. This means that every block will have a different size, the size which is most efficient. According to Tim Sweeney's, lead programmer of EPIC's kick a-- Unreal, response to an e-mail I fired off to him yesterday:

"A few simple attempts gained an additional 20% reduction in compressed bits,
but that was offset by having to store multiple Huffman tables -- so the net
result was about 0% reduction."

Tim Sweeney goes on to mention that:

"I think this could potentially be better
than Huffman alone, but I think the gain will be limited to maybe 20-30%.
Most kinds of data (such as text, graphics, program code) don't seem to have
much more local coherence than global coherence. LZH schemes will probably
beat this approach consistently."

This comes to the next point I will discuss: possible ways to work with dynamic block size when doing Huffman + LZ77 (LZH) compression, along with an explanation of what LZ(H) compression is.

LZ Compression and more...