If you have a scan of the 2/91 cover, please mail it to me! |
Arithmetic Coding + Statistical Modeling = Data CompressionPart 1 - Arithmetic Codingby Mark NelsonDr. Dobb's Journal February, 1991 |
This page contains my original text and figures for the article that appeared in the February, 1991 DDJ. The article was originally written to be a 2 part feature, but was cut down to 1 part. Links to Part 2 are here and at the end of the article.
Most of the data compression methods in common use today fall into one of two camps: dictionary based schemes and statistical methods. In the world of small systems, dictionary based data compression techniques seem to be more popular at this time. However, by combining arithmetic coding with powerful modeling techniques, statistical methods for data compression can actually achieve better performance. This two-part article discusses how to combine arithmetic coding with several different modeling methods to achieve some impressive compression ratios. The first part of the article details how arithmetic coding works. The second shows how to develop some effective models that can use an arithmetic coder to generate high performance compression programs.
Data compression operates in general by taking "symbols" from an input "text", processing them, and writing "codes" to a compressed file. For the purposes of this article, symbols are usually bytes, but they could just as easily be pixels, 80 bit floating point numbers, or EBCDIC characters. To be effective, a data compression scheme needs to be able to transform the compressed file back into an identical copy of the input text. Needless to say, it also helps if the compressed file is smaller than the input text!
Dictionary based compression systems operate by replacing groups of symbols in the input text with fixed length codes. A well-known example of a dictionary technique is LZW data compression. (See "LZW Data Compression" in the 10/89 issue of DDJ). LZW operates by replacing strings of essentially unlimited length with codes that usually range in size from 9 to 16 bits.
Statistical methods of data compression take a completely different approach. They operate by encoding symbols one at a time. The symbols are encoded into variable length output codes. The length of the output code varies based on the probability or frequency of the symbol. Low probability symbols are encoded using many bits, and high probability symbols are encoded using fewer bits.
In practice, the dividing line between statistical and dictionary
methods is not always so distinct. Some schemes can't be clearly
put in one camp or the other, and there are always hybrids which
use features from both techniques. However, the methods discussed
in this article use arithmetic coding to implement purely statistical
compression schemes.
It isn't enough to just be able to accurately calculate the probability of characters in a datastream, we also need a coding method that effectively takes advantage of that knowledge. Probably the best known coding method based on probability statistics is Huffman coding. D. A. Huffman published a paper in 1952 describing a method of creating a code table for a set of symbols given their probabilities. The Huffman code table was guaranteed to produce the lowest possible output bit count possible for the input stream of symbols, when using fixed length codes. Huffman called these "minimum redundancy codes", but the scheme is now universally called Huffman coding. Other fixed length coding systems, such as Shannon-Fano codes, were shown by Huffman to be non-optimal.
Huffman coding assigns an output code to each symbol, with the output codes being as short as 1 bit, or considerably longer than the input symbols, strictly depending on their probabilities. The optimal number of bits to be used for each symbol is the log base 2 of (1/p), where p is the probability of a given character. Thus, if the probability of a character is 1/256, such as would be found in a random byte stream, the optimal number of bits per character is log base 2 of 256, or 8. If the probability goes up to 1/2, the optimum number of bits needed to code the character would go down to 1.
The problem with this scheme lies in the fact that Huffman codes have to be an integral number of bits long. For example, if the probability of a character is 1/3, the optimum number of bits to code that character is around 1.6. The Huffman coding scheme has to assign either 1 or 2 bits to the code, and either choice leads to a longer compressed message than is theoretically possible.
This non-optimal coding becomes a noticeable problem when the probability of a character becomes very high. If a statistical method can be developed that can assign a 90% probability to a given character, the optimal code size would be 0.15 bits. The Huffman coding system would probably assign a 1 bit code to the symbol, which is 6 times longer than is necessary.
A second problem with Huffman coding comes about when trying to do adaptive data compression. When doing non-adaptive data compression, the compression program makes a single pass over the data to collect statistics. The data is then encoded using the statistics, which are unchanging throughout the encoding process.
In order for the decoder to decode the compressed data stream, it first needs a copy of the statistics. The encoder will typically prepend a statistics table to the compressed message, so the decoder can read it in before starting. This obviously adds a certain amount of excess baggage to the message.
When using very simple statistical models to compress the data, the encoding table tends to be rather small. For example, a simple frequency count of each character could be stored in as little as 256 bytes with fairly high accuracy. This wouldn't add significant length to any except the very smallest messages. However, in order to get better compression ratios, the statistical models by necessity have to grow in size. If the statistics for the message become too large, any improvement in compression ratios will be wiped out by added length needed to prepend them to the encoded message.
In order to bypass this problem, adaptive data compression is called for. In adaptive data compression, both the encoder and decoder start with their statistical model in the same state. Each of them process a single character at a time, and update their models after the character is read in. This is very similar to the way most dictionary based schemes such as LZW coding work. A slight amount of efficiency is lost by beginning the message with a non-optimal model, but it is usually more than made up for by not having to pass any statistics with the message.
The problem with combining adaptive modeling with Huffman coding is that rebuilding the Huffman tree is a very expensive process. For an adaptive scheme to be efficient, it is necessary to adjust the Huffman tree after every character. Algorithms to perform adaptive Huffman coding were not published until over 20 years after Huffman coding was first developed. Even now, the best adaptive Huffman coding algorithms are still relatively time and memory consuming.
It has only been in the last ten years that a respectable candidate to replace Huffman coding has been successfully demonstrated: Arithmetic coding. Arithmetic coding completely bypasses the idea of replacing an input symbol with a specific code. Instead, it takes a stream of input symbols and replaces it with a single floating point output number. The longer (and more complex) the message, the more bits are needed in the output number. It was not until recently that practical methods were found to implement this on computers with fixed sized registers.
The output from an arithmetic coding process is a single number less than 1 and greater than or equal to 0. This single number can be uniquely decoded to create the exact stream of symbols that went into its construction. In order to construct the output number, the symbols being encoded have to have a set probabilities assigned to them. For example, if I was going to encode the random message "BILL GATES", I would have a probability distribution that looks like this:
Character Probability --------- ----------- SPACE 1/10 A 1/10 B 1/10 E 1/10 G 1/10 I 1/10 L 2/10 S 1/10 T 1/10
Once the character probabilities are known, the individual symbols need to be assigned a range along a "probability line", which is nominally 0 to 1. It doesn't matter which characters are assigned which segment of the range, as long as it is done in the same manner by both the encoder and the decoder. The nine character symbol set use here would look like this:
Character Probability Range --------- ----------- ----------- SPACE 1/10 0.00 - 0.10 A 1/10 0.10 - 0.20 B 1/10 0.20 - 0.30 E 1/10 0.30 - 0.40 G 1/10 0.40 - 0.50 I 1/10 0.50 - 0.60 L 2/10 0.60 - 0.80 S 1/10 0.80 - 0.90 T 1/10 0.90 - 1.00
Each character is assigned the portion of the 0-1 range that corresponds to its probability of appearance. Note also that the character "owns" everything up to, but not including the higher number. So the letter 'T' in fact has the range 0.90 - 0.9999....
The most significant portion of an arithmetic coded message belongs to the first symbol to be encoded. When encoding the message "BILL GATES", the first symbol is "B". In order for the first character to be decoded properly, the final coded message has to be a number greater than or equal to 0.20 and less than 0.30. What we do to encode this number is keep track of the range that this number could fall in. So after the first character is encoded, the low end for this range is 0.20 and the high end of the range is 0.30.
After the first character is encoded, we know that our range for our output number is now bounded by the low number and the high number. What happens during the rest of the encoding process is that each new symbol to be encoded will further restrict the possible range of the output number. The next character to be encoded, 'I', owns the range 0.50 through 0.60. If it was the first number in our message, we would set our low and high range values directly to those values. But 'I' is the second character. So what we do instead is say that 'I' owns the range that corresponds to 0.50-0.60 in the new subrange of 0.2 - 0.3. This means that the new encoded number will have to fall somewhere in the 50th to 60th percentile of the currently established range. Applying this logic will further restrict our number to the range 0.25 to 0.26.
The algorithm to accomplish this for a message of any length is is shown below:
Set low to 0.0 Set high to 1.0 While there are still input symbols do get an input symbol code_range = high - low. high = low + range*high_range(symbol) low = low + range*low_range(symbol) End of While output low
Following this process through to its natural conclusion with our chosen message looks like this:
New Character Low value High Value ------------- --------- ---------- 0.0 1.0 B 0.2 0.3 I 0.25 0.26 L 0.256 0.258 L 0.2572 0.2576 SPACE 0.25720 0.25724 G 0.257216 0.257220 A 0.2572164 0.2572168 T 0.25721676 0.2572168 E 0.257216772 0.257216776 S 0.2572167752 0.2572167756
So the final low value, 0.2572167752 will uniquely encode the message "BILL GATES" using our present encoding scheme.
Given this encoding scheme, it is relatively easy to see how the decoding process will operate. We find the first symbol in the message by seeing which symbol owns the code space that our encoded message falls in. Since the number 0.2572167752 falls between 0.2 and 0.3, we know that the first character must be "B". We then need to remove the "B" from the encoded number. Since we know the low and high ranges of B, we can remove their effects by reversing the process that put them in. First, we subtract the low value of B from the number, giving 0.0572167752. Then we divide by the range of B, which is 0.1. This gives a value of 0.572167752. We can then calculate where that lands, which is in the range of the next letter, "I".
The algorithm for decoding the incoming number looks like this:
get encoded number Do find symbol whose range straddles the encoded number output the symbol range = symbol low value - symbol high value subtract symbol low value from encoded number divide encoded number by range until no more symbols
Note that I have conveniently ignored the problem of how to decide when there are no more symbols left to decode. This can be handled by either encoding a special EOF symbol, or carrying the stream length along with the encoded message.
The decoding algorithm for the "BILL GATES" message will proceed something like this:
Encoded Number Output Symbol Low High Range -------------- ------------- --- ---- ----- 0.2572167752 B 0.2 0.3 0.1 0.572167752 I 0.5 0.6 0.1 0.72167752 L 0.6 0.8 0.2 0.6083876 L 0.6 0.8 0.2 0.041938 SPACE 0.0 0.1 0.1 0.41938 G 0.4 0.5 0.1 0.1938 A 0.2 0.3 0.1 0.938 T 0.9 1.0 0.1 0.38 E 0.3 0.4 0.1 0.8 S 0.8 0.9 0.1 0.0
In summary, the encoding process is simply one of narrowing the range of possible numbers with every new symbol. The new range is proportional to the predefined probability attached to that symbol. Decoding is the inverse procedure, where the range is expanded in proportion to the probability of each symbol as it is extracted.
The process of encoding and decoding a stream of symbols using arithmetic coding is not too complicated. But at first glance, it seems completely impractical. Most computers support floating point numbers of up to 80 bits or so. Does this mean you have to start over every time you finish encoding 10 or 15 symbols? Do you need a floating point processor? Can machines with different floating point formats communicate using arithmetic coding?
As it turns out, arithmetic coding is best accomplished using standard 16 bit and 32 bit integer math. No floating point math is required, nor would it help to use it. What is used instead is a an incremental transmission scheme, where fixed size integer state variables receive new bits in at the low end and shift them out the high end, forming a single number that can be as many bits long as are available on the computer's storage medium.
In the previous section, I showed how the algorithm works by keeping track of a high and low number that bracket the range of the possible output number. When the algorithm first starts up, the low number is set to 0.0, and the high number is set to 1.0. The first simplification made to work with integer math is to change the 1.0 to 0.999...., or .111... in binary.
In order to store these numbers in integer registers, we first justify them so the implied decimal point is on the left hand side of the word. Then we load as much of the initial high and low values as will fit into the word size we are working with. My implementation uses 16 bit unsigned math, so the initial value of high is 0xFFFF, and low is 0. We know that the high value continues with FFs forever, and low continues with 0s forever, so we can shift those extra bits in with impunity when they are needed.
If you imagine our "BILL GATES" example in a 5 digit register, the decimal equivalent of our setup would look like this:
HIGH: 99999 LOW: 00000
In order to find our new range numbers, we need to apply the encoding algorithm from the previous section. We first had to calculate the range between the low value and the high value. The difference between the two registers will be 100000, not 99999. This is because we assume the high register has an infinite number of 9's added on to it, so we need to increment the calculated difference. We then compute the new high value using the formula from the previous section:
high = low + high_range(symbol)
In this case the high range was .30, which gives a new value for high of 30000. Before storing the new value of high, we need to decrement it, once again because of the implied digits appended to the integer value. So the new value of high is 29999.
The calculation of low follows the same path, with a resulting new value of 20000. So now high and low look like this:
HIGH: 29999 (999...) LOW: 20000 (000...)
At this point, the most significant digits of high and low match. Due to the nature of our algorithm, high and low can continue to grow closer to one another without quite ever matching. This means that once they match in the most significant digit, that digit will never change. So we can now output that digit as the first digit of our encoded number. This is done by shifting both high and low left by one digit, and shifting in a 9 in the least significant digit of high. The equivalent operations are performed in binary in the C implementation of this algorithm.
As this process continues, high and low are continually growing closer together, then shifting digits out into the coded word. The process for our "BILL GATES" message looks like this:
HIGH LOW RANGE CUMULATIVE OUTPUT Initial state 99999 00000 100000 Encode B (0.2-0.3) 29999 20000 Shift out 2 99999 00000 100000 .2 Encode I (0.5-0.6) 59999 50000 .2 Shift out 5 99999 00000 100000 .25 Encode L (0.6-0.8) 79999 60000 20000 .25 Encode L (0.6-0.8) 75999 72000 .25 Shift out 7 59999 20000 40000 .257 Encode SPACE (0.0-0.1) 23999 20000 .257 Shift out 2 39999 00000 40000 .2572 Encode G (0.4-0.5) 19999 16000 .2572 Shift out 1 99999 60000 40000 .25721 Encode A (0.1-0.2) 67999 64000 .25721 Shift out 6 79999 40000 40000 .257216 Encode T (0.9-1.0) 79999 76000 .257216 Shift out 7 99999 60000 40000 .2572167 Encode E (0.3-0.4) 75999 72000 .2572167 Shift out 7 59999 20000 40000 .25721677 Encode S (0.8-0.9) 55999 52000 .25721677 Shift out 5 59999 20000 .257216775 Shift out 2 .2572167752 Shift out 0 .25721677520
Note that after all the letters have been accounted for, two extra digits need to be shifted out of either the high or low value to finish up the output word.
This scheme works well for incrementally encoding a message. There is enough accuracy retained during the double precision integer calculations to ensure that the message is accurately encoded. However, there is potential for a loss of precision under certain circumstances.
In the event that the encoded word has a string of 0s or 9s in it, the high and low values will slowly converge on a value, but may not see their most significant digits match immediately. For example, high and low may look like this:
High: 700004 Low: 699995
At this point, the calculated range is going to be only a single digit long, which means the output word will not have enough precision to be accurately encoded. Even worse, after a few more iterations, high and low could look like this:
High: 70000 Low: 69999
At this point, the values are permanently stuck. The range between high and low has become so small that any calculation will always return the same values. But, since the most significant digits of both words are not equal, the algorithm can't output the digit and shift. It seems like an impasse.
The way to defeat this underflow problem is to prevent things from ever getting this bad. The original algorithm said something like "If the most significant digit of high and low match, shift it out". If the two digits don't match, but are now on adjacent numbers, a second test needs to be applied. If High and low are one apart, we then test to see if the 2nd most significant digit in high is a 0, and the 2nd digit in low is a 9. If so, it means we are on the road to underflow, and need to take action.
When underflow rears its ugly head, we head it off with a slightly different shift operation. Instead of shifting the most significant digit out of the word, we just delete the 2nd digits from high and low, and shift the rest of the digits left to fill up the space. The most significant digit stays in place. We then have to set an underflow counter to remember that we threw away a digit, and we aren't quite sure whether it was going to end up as a 0 or a 9. The operation looks like this:
Before After ------ ----- High: 40344 43449 Low: 39810 38100 Underflow: 0 1
After every recalculation operation, if the most significant digits don't match up, we can check for underflow digits again. If they are present, we shift them out and increment the counter.
When the most significant digits do finally converge to a single value, we first output that value. Then, we output all of the "underflow" digits that were previously discarded. The underflow digits will be all 9s or 0s, depending on whether High and Low converged to the higher or lower value. In the C implementation of this algorithm, the underflow counter keeps track of how many ones or zeros to put out.
In the "ideal" decoding process, we had the entire input number to work with. So the algorithm has us do things like "divide the encoded number by the symbol probability." In practice, we can't perform an operation like that on a number that could be billions of bytes long. Just like the encoding process, the decoder can operate using 16 and 32 bit integers for calculations.
Instead of maintaining two numbers, high and low, the decoder has to maintain three integers. The first two, high and low, correspond exactly to the high and low values maintained by the encoder. The third number, code, contains the current bits being read in from the input bits stream. The code value will always lie in between the high and low values. As they come closer and closer to it, new shift operations will take place, and high and low will move back away from code.
The high and low values in the decoder correspond exactly to the high and low that the encoder was using. They will be updated after every symbol just as they were in the encoder, and should have exactly the same values. By performing the same comparison test on the upper digit of high and low, the decoder knows when it is time to shift a new digit into the incoming code. The same underflow tests are performed as well, in lockstep with the encoder.
In the ideal algorithm, it was possible to determine what the current encoded symbols was just by finding the symbol whose probabilities enclosed the present value of the code. In the integer math algorithm, things are somewhat more complicated. In this case, the probability scale is determined by the difference between high and low. So instead of the range being between 0.0 and 1.0, the range will be between two positive 16 bit integer counts. The current probability is determined by where the present code value falls along that range. If you were to divide (value-low) by (high-low+1), you would get the actual probability for the present symbol.
So far, this encoding process may look interesting, but it is not immediately obvious why it is an improvement over Huffman coding. The answer becomes clear when we examine a case where the probabilities are a little different. Lets examine the case where we have to encode the stream "AAAAAAA", and the probability of "A" is known to be 0.90. This means there is a 90% chance that any incoming character will be the letter "A". We set up our probability table so that the letter "A" occupies the range 0.0 to 0.9, and the End of message symbol occupies the 0.9 to 1.0 range. The encoding process then looks like this:
New Character Low value High Value ------------- --------- ---------- 0.0 1.0 A 0.0 0.9 A 0.0 0.81 A 0.0 0.729 A 0.0 0.6561 A 0.0 0.59049 A 0.0 0.531441 A 0.0 0.4782969 END 0.43046721 0.4782969
Now that we know what the range of low and high values are, all that remains is to pick a number to encode this message. The number "0.45" will make this message uniquely decode to "AAAAAAA". Those two decimal digits take slightly less than 7 bits to specify, which means that we have encoded eight symbols in less than 8 bits! An optimal Huffman coded message would have taken a minimum of 9 bits using this scheme.
To take this point even further, I set up a test that had only two symbols defined. The byte value '0' had a 16382/16383 probability, and an EOF symbol had a 1/16383 probability of occurring. I then created a test file which was filled with 100,000 '0's. After compression using this model, the output file was only 3 bytes long! The minimum size using Huffman coding would have been 12,501 bytes. This is obviously a contrived example, but it does illustrate the point that arithmetic coding is able to compress data at rates much better than 1 bit per byte when the symbol probabilities are right.
An encoding procedure written in C is shown in listing 1 and 2. It contains two parts, an initialization procedure, and the encoder routine itself. The code for the encoder as well as the decoder were first published in an article entitled "Arithmetic Coding for Data Compression" in the February 1987 issue of "Communications of the ACM", by Ian H. Witten, Radford Neal, and John Cleary, and is being published here with the author's permission. I have modified the code slightly so as to further isolate the modeling and coding portions of the program.
There are two major differences between the algorithms shown earlier and the code in listing 1-2. The first difference is in the way probabilities are transmitted. In the algorithms shown above, the probabilities were kept as a pair of floating point numbers on the 0.0 to 1.0 range. Each symbol had its own section of that range. In the programs that are shown here, a symbol has a slightly different definition. Instead of two floating point numbers, the symbol's range is defined as two integers, which are counts along a scale. The scale is also included as part of the symbol definition. So, for the "BILL GATES" example, the letter L was defined previously as the high/low pair of 0.60 and 0.80. In the code being used here, "B" would be defined as the low and high counts of 6 and 8, with the symbol scale being 10. The reason it is done this way is because it fits in well with the way we keep statistics when modeling this type of data. The two methods are equivalent, it is just that keeping integer counts cuts down on the amount of work needing to be done. Note that the character actually owns the range up to but not including the high value.
The second difference in this algorithm is that all of the comparison and shifting operations are being done in base 2, rather than base 10. The illustrations given previously were done on base 10 numbers to make the algortihms a little more comprehensible. The algorithms work properly in base 10, but masking off digits and shifting in base 10 on most computers is difficult. So instead of comparing the top two digits, we now compare the top two bits.
There are two things missing that are needed in order to use the encoding and decoding algorithms. The first is a set of bit oriented input and output routines. These are shown in listing 3 and 4, and are presented here without comment. The modeling code is responsible for keeping track of the probabilities of each character, and performing two different transformations. During the encoding process, the modeling code has to take a character to be encoded and convert it to a probability range. The probability range is defined as a low count, a high count, and a total range. During the decoding process, the modeling code has to take a count derived from the input bit stream and convert it into a character for output.
A short test program is shown Listing 3. It implements a compression/expansion program that uses the "BILL GATES" model discussed previously. Adding a few judiciously placed print statements will let you track exactly what is happening in this program.
BILL.C has its own very simple modeling unit, which has a set of fixed probabilities for each of the possible letters in the message. I added a new character, the null string terminator, so as to know when to stop decoding the message.
BILL.C encodes an arbitrarily defined input string, and writes it out to a file called TEST.CMP. It then decodes the message and prints it to the screen for verification. There are a few functions in BILL.C that are modeling functions, which haven't been discussed yet. During the encoding process, a routine called convert_int_to_symbol is called. This routine gets a given input character, and converts it to a low count, high count, and scale using the current model. Since our model for BILL.C is a set of fixed probabilities, this just means looking up the probabilities in the table. Once those are defined, the encoder can be called.
During the decoding process, there are two functions associated with modeling. In order to determine what character is waiting to be decoded on the input stream, the model needs to be interrogated to determine what the present scale is. In our example, the scale (or range of counts) is fixed at 11, since there are always 11 counts in our model. This number is passed to the decoding routine, and it returns a count for the given symbol using that scale. Then, a modeling function called convert_symbol_to_int is called. It takes the given count, and determines what character matches the count. Finally, the decoder is called again to process that character out of the input stream.
Once you successfully understand how to encode/decode symbols using arithmetic coding, you can then begin trying to create models that will give superior compression performance. Next month's concluding article discusses some of the methods you can try for adaptive compression. A fairly simple statistical modeling program is able to provide compression superior to respected programs such as PKZIP or COMPRESS.
As I mentioned previously, the article in the June 1987 "Communications of the ACM" provides the definitive overview of arithmetic coding. Most of the article is reprinted in the book "Text Compression", by Timothy C. Bell, John G. Cleary, and Ian H. Witten. This book provides an excellent overview of both statistical and dictionary based compression techniques. Another good book is "Data Compression" by James Storer.
Bell, Timothy C., Cleary, John G., Witten, Ian H, (1990) "Text Compression", Prentice Hall, Englewood NJ
Nelson, Mark (1989) "LZW Data Compression", Doctor Dobb's Journal, October, pp 29-37
Storer, J.A., (1988) "Data Compression", Computer Science Press, Rockville, MD
Witten, Ian H., Neal, Radford M., and Cleary, John G. (1987) "Arithmetic
Coding for Data Compression", Communications of the ACM, June, pp 520-540.
Listing 1 | coder.h | Constants, declarations, and prototypes needed to use the arithmetic coding routines. These declarations are for routines that need to interface with the arithmetic coding stuff in coder.c |
Listing 2 | coder.c | This file contains the code needed to accomplish arithmetic coding of a symbol. All the routines in this module need to know in order to accomplish coding is what the probabilities and scales of the symbol counts are. This information is generally passed in a SYMBOL structure. |
Listing 3 | bitio.h | This header file contains the function prototypes needed to use the bitstream i/o routines. |
Listing 4 | bitio.c | This routine contains a set of bit oriented i/o routines used for arithmetic data compression. |
Listing 5 | bill.c | This short demonstration program will use arithmetic data compression to encode and then decode a string that only uses the letters out of the phrase "BILL GATES". It uses a fixed, hardcoded table of probabilities.
Continue with Part 2: Statistical Modeling |
Return to Mark Nelson's Home Page
Copyright (c) 1996, Mark Nelson. All Rights Reserved.