Cotty's recent project (no longer compression related)

Probability estimation for PPM

W. J. Teahan

Department of Computer Science, University of Waikato, Hamilton

ABSTRACT

The state of the art in lossless text compression is the PPM data compression scheme. Two ap-proaches

to the problem of selecting the context models used in the scheme are described. One uses

an *a priori *upper bound on the lengths of the contexts, while the other method is unbounded. Several

techniques that improve the probability estimation are described, including four new methods: partial

update exclusions for the unbounded approach, deterministic scaling, recency scaling and multiple

probability estimators. Each of these methods improves the performance for both the bounded and

unbounded approaches. In addition, further savings are possible by combining the two approaches.

1 Introduction

The state of the art in lossless text compression is the PPM data compression scheme

[1, 4]. PPM, or prediction by partial matching, is an adaptive statistical modeling

technique based on blending together different length context models to predict the

next character in the input sequence. The scheme achieves greater compression than

Ziv-Lempel (LZ) dictionary based methods which are more widely used because of

their simplicity and faster execution speeds.

PPM models are based upon context models of varying length of prior characters.

The models record the frequency of characters that have followed each of the

contexts. For example, a particular context may be the letters “thei”. All the letters

that have followed this context are counted. The next time the letters “thei” occur in

the text, the counts are used to estimate the probability for the upcoming character.

The PPM technique blends together the context models for the varying lengths (such

as “hei” and “ei”) to arrive at an overall probability distribution. *Arithmetic coding*

is then used to optimally encode the character which actually does occur with respect

to this distribution.

In the next section, some of the problems associated with estimating the probability

distribution are examined. The first problem is the problem of deciding how long

the contexts should be. In light of this problem, two different variants of the

algorithm are examined. One fixes an *a priori *upper bound on the length of the

context models; the other is unbounded. The second problem, called *local order*

*estimation*, concerns that of selecting a particular context model from all the current

context models. Following that, we describe different techniques at arriving at a

probability distribution given the chosen context model. Four new methods that

improve the probability estimation for both the bounded and unbounded approaches

are also described. In the final section we give results that apply these techniques

and compare their performance, leading to an 8% improvement in compression

for both approaches over the standard PPMC implementation [1]. A further 1%

improvement is possible by combining the two approaches.

2 Probability estimation for PPM

PPM splits compression into two steps. The first step accumulates a statistical model

of the characters seen so far in the input text. As each character is encoded this model is used to generate a probability distribution over those characters which can

occur next.

PPM’s statistical model is built up from contexts of prior characters of varying

lengths. A problem arises in deciding how long the contexts should be, whether

there should be an upper bound to the length of the contexts or whether they should

be unbounded. Another problem is deciding how much weight should be given to

each context. The approach taken by PPM is a “back off” method of selecting a

particular context and then backing off or *escaping *to a shorter context if the chosen

context does not predict the upcoming character.

The problems of local order estimation, and estimating the escape probability, are the

key problems associated with probability estimation for PPM. These two problems

are discussed below. Following that, other techniques effective at improving the

probability estimation are discussed, including four new methods: partial update

exclusions for unbounded length contexts, deterministic scaling, recency scaling

and multiple probability estimators.

**Bounded vs. unbounded length contexts. **There are two approaches to the

problem of deciding how long the contexts should be for PPM. The first approach,

adopted in the original implementation [4] and later on in [6], uses an upper bound

to the length of the contexts, with the context model of longest length chosen first

to estimate the probability. This strategy is surprisingly effective at modelling text,

a maximum order of 4 or 5 is well suited to a wide range of text files.

A problem with the PPM algorithm is the rapid growth in the size of the context

models as the maximum order increases. Until recently, it was thought that there was

little gain in extending the length of the contexts, as there is a drop off in compression

as the maximum order increases beyond length 5. However, a recent approach

called PPM* [3] uses a variant of the PPM algorithm that exploits unbounded length

contexts. It does this by storing the contexts in a data structure called a *context trie*.

Each context of varying length is stored in a single trie, along with input pointers

back into the input string whenever a context becomes unique. A separate *context*

*list *is maintained with pointers back into the trie for the currently active contexts.

Experiments (described in the next section) with adapting this data structure to the

bounded approach, with the trie being limited to a certain depth, have also shown

compression superior to previously used approaches. Substantial reductions in the

trie memory size for the unbounded approach is also possible by combining non-branching

nodes of the trie into a single node. The resulting compact data structure,

which is similar to a patricia trie [7], is linear with the size of the input string.

**Local order estimation. **Local order estimation using bounded length contexts

simply involves selecting the longest context model. However for unbounded

contexts, experiments have shown that such a strategy is poor at selecting the model

which predicts the upcoming character best. An entropy-based method of choosing

the context which has the best average compression to date is also surprisingly poor

at prediction. The most effective strategy found so far for unbounded contexts is as

follows. Acontext is defined to be “deterministic” when it gives only one prediction.

Experiments conducted by [2] have shown that for such contexts the observed

frequency of the deterministic character is much higher than expected based on

a uniform prior distribution. This can be exploited by using such contexts for

prediction. The deterministic strategy involves choosing the shortest deterministic

context. **Estimating the escape probability. **Asecond problem with probability estimation

is how to encode a *novel *character, which has not been seen before in the current

context. This problem is essentially the zero frequency problem which is described

in [8]. PPM uses an escape probability to “escape” to another context model,

usually of length one shorter than the current context. For novel characters which

have never been seen before in any length model, the algorithm escapes down to a

default “order -1” context model where all possible characters are equiprobable.

Various methods have been proposed for estimating the escape probability. Two

*ad hoc *methods, methods A and B, are described in [1, 8]. Methods P, X and XC

described in [8] are based upon a Poisson process model for the appearance of novel

characters. For practical reasons, the method most widely used is method C, used

by Moffat [6] in his implementation of the algorithm PPMC. Whenever a novel

character is encountered, an escape count is incremented, and the new character’s

count is set to one. The escape probability is computed as where is the

number of unique characters, and is the total number of characters seen so far.

Method C has been found to be superior to methods A and B in practice.

Howard [5] proposed a small modification to method C. Instead of adding 1 to both

the escape count and the new character’s count, each count is incremented by 1

2 ,

hence the total weight is incremented by 1 as for the other non-novel characters.

The escape probability for method D is computed as 2 .

Experiments reported in the next section have found that method C is better suited

to the unbounded approach, whereas method D is more effective with the bounded

approach. One explanation for this is that method D has the effect of increasing

the weight given to the character’s count for deterministic contexts. This is already

achieved by the local order estimation strategy used for the unbounded approach.

Scaling of deterministic contexts, described below, also has the same effect, although

not as marked when method D is applied.

**Scaling the counts. **The frequency counts associated with each context must be

scaled downward periodically to prevent overflow in the arithmetic coder. The usual

method is to halve the counts when a certain threshold has been exceeded. This

also has the side-effect of improving the compression because of its locally adaptive

effect. However, experimental results with context tries for both the bounded

and unbounded approaches have shown that count scaling is ineffective, with the

maximum scaling threshold dictated by the requirements of the arithmetic encoder

providing the best performance.

**Exclusions. **PPM makes use of *exclusions *by excluding predictions found at a

higher order when calculating the probabilities for lower-order contexts. Exclusions

have been found to be effective for both the bounded and unbounded approaches.

*Update exclusions *do not increment the count for a predicted character if it is already

predicted by a higher-order context. The effect of update exclusions is to use the

uniqueness of a character in a particular context for prediction instead of its fre-quency.

Update exclusions have been found to improve the performance of PPMC

and other variants that apply the bounded approach by 2%. However, full update

exclusions are less effective for the unbounded approach. A new technique, called

*partial update exclusions *does however provide an improvement in performance.

Partial update exclusions involve incrementing the counts for all deterministic con-texts,

otherwise incrementing the count only for the longest non-deterministic one.

**Deterministic scaling. **Another new technique that improves the probability esti-mation

is called *deterministic scaling*. The aforementioned experiments in [2] have

shown that for deterministic contexts the deterministic character is far more likely

to occur than predicted by a uniform prior distribution. The local order estimation

method used by the unbounded approach exploits this to some extent. However,

further improvement can be achieved for both bounded and unbounded approaches

by scaling the count for the deterministic character upwards by a constant factor.

A curious side-effect with deterministic scaling is that greater improvement in

prediction is achieved when the scaling is applied with partial update exclusions

than without them. The best scaling factor increases when partial update exclusions

are applied, which was found to be the case for all files in the corpus.

**Recency scaling. **Increasing the count for the most recent character in each context

by a scaling factor also improves the prediction. Runs of the same character often

occur in many of the contexts, and this can be exploited quite effectively by applying

a recency scaling factor.

**Multiple probability estimators. **In many cases, the first context PPM selects for

encoding is non-optimal, with another context model providing better prediction.

This is especially true when the input string is a random sequence of characters.

The ideal would be to predict an order -1 model immediately. PPM’s performance

on random strings is about 9 bpc, the extra 1 bpc incurred because of the need to

escape down to the lower orders.

One technique that overcomes this problem to some extent is to have multiple

probability estimation methods, one of them specifically designed to cope with

random strings. Each of the probability estimation methods can be made to compete

for the prediction given a particular context, with an entropy-based performance

measure (such as the average codelength required to encode the characters) being

used to choose between the methods.

An addition of two extra probability estimators, one random and the other semi-random,

along with the standard PPM estimator, has been found to be effective for

the bounded approach. For each context, the codelengths for encoding the characters

using each of the three methods are accumulated. The method with the minimum

average codelength to date is then chosen to encode the upcoming character. The

random estimator predicts using an order -1 context model. The semi-random

estimator predicts using the character counts for the current context, with counts for

all other characters set to 1. PPM’s performance on random strings improves from

9 bpc down to 8.4 bpc when these three estimators are used.

Adding further estimation methods may improve the compression even more. For

example, an estimation method that has higher or lower deterministic and recency

scaling factors may be appropriate for particular contexts. However, this multiple

probability estimation method has substantial overheads, both in terms of execution

speed required to calculate the codelengths, and in terms of storage, with a floating

point number needed to store the codelength for each probability estimation method

for all nodes in the context trie.

3 Experimental Results

Experimental results of running variants of the PPM algorithm on the Calgary corpus

are shown in Tables 1, 2 and 3. The compression ratios for each file shown in the file PPMC PPM1 PPM2 PPM3 PPM2

(bpc) (bpc) (bpc) (bpc) Order (bpc)

bib 2.11 1.90 **1.86 1.86 **5 1.86

book1 2.48 2.31 **2.30 2.30 **4 2.30

book2 2.26 1.99 1.97 **1.96 **5 1.97

geo 4.78 4.74 4.71 **4.63 **2 4.53

news 2.65 2.39 2.36 **2.35 **5 2.36

obj1 3.76 3.75 3.73 **3.71 **4 3.73

obj2 2.69 2.44 2.40 **2.38 **8 2.35

paper1 2.48 2.35 **2.33 2.33 **5 2.33

paper2 2.45 2.33 **2.32 2.32 **4 2.32

pic 1.09 0.81 **0.80 0.80 **5 0.80

progc 2.49 2.39 2.37 **2.36 **5 2.37

progl 1.90 1.71 **1.68 1.68 **8 1.64

progp 1.84 1.73 **1.70 1.70 **6 1.70

trans 1.77 1.52 **1.47 1.47 **7 1.43

average 2.48 2.31 2.29 **2.28 **2.26

PPMC Standard implementation using escape method C described in [1, 6]

PPM1 Maximum order = 5, update exclusions, escape method C

PPM2 Maximum order = 5, update exclusions, escape method D

deterministic scaling factor = 3.0, recency scaling factor = 1.1

PPM3 Three probability estimators: random, semi-random and variant PPM2

PPM2 Same as variant PPM2 except maximum order varies

Table 1: Compression ratios for the Calgary corpus using bounded length contexts

tables are given in bits per character (bpc). The best ratio for each file is printed in

boldface.

Results for both the bounded and unbounded approaches yield the same figure of

2.28 bpc for the corpus overall. This represents an improvement of 8% over the

standard PPMC implementation.

Table 1 lists results for PPM using bounded length contexts. The standard PPMC

implementation is shown in column 2. Implementations labelled PPM1 to PPM3

and PPM2 in the adjacent columns use context tries with no limit on memory size.

The main features of each of the variants are summarized below the table.

PPM1 uses update exclusions with count scaling using a maximum threshold com-bined

with a maximum fixed order of 5 and method C for escaping. PPM2 uses

method D instead and also applies deterministic scaling (with the best factor found

to be 3.0) and recency scaling (best factor 1.1). The main contributor to the improve-ment

over PPM1 was applying the escape method D. However, both deterministic

scaling and recency scaling were found to improve the compression slightly over

most of the files (although not as much as that found for unbounded contexts de-scribed

below).

PPM3 uses three probability estimators, with PPM2 being one of them, and a random

and semi-random estimator being the other two. The estimator with the minimum

average codelength is selected for each context. Although this leads to only a small

improvement over PPM2, the performance on file *geo *which contains random data

has improved from 4.71 bpc down to 4.63 bpc. Also the improvement is consistently

better for the other files. file PPM* PPM*1 PPM*2 PPM*3 PPM*4

(bpc) (bpc) (bpc) (bpc) (bpc)

bib 1.91 1.90 1.90 1.87 **1.86**

book1 **2.40 2.40 **2.41 2.41 2.41

book2 2.02 2.02 2.02 2.01 **2.00**

geo 4.83 4.82 4.77 **4.76 **4.78

news 2.42 2.41 2.40 2.38 **2.37**

obj1 4.00 4.00 3.84 **3.82 **3.83

obj2 2.43 2.42 2.39 2.35 **2.31**

paper1 2.37 2.36 2.36 **2.33 2.33**

paper2 2.36 2.36 2.36 2.35 **2.34**

pic 0.85 **0.83 **0.87 0.84 0.84

progc 2.40 2.39 2.38 2.35 **2.34**

progl 1.67 1.66 1.65 1.62 **1.61**

progp 1.62 1.60 1.61 1.57 **1.55**

trans 1.45 1.41 1.46 **1.39 1.39**

average 2.34 2.33 2.32 2.29 **2.28**

PPM* Unbounded order, exclusions, escape method C

PPM*1 Unbounded order, exclusions, escape method C

deterministic scaling factor = 2.0

PPM*2 Unbounded order, partial update exclusions, escape method C

PPM*3 Unbounded order, partial update exclusions, escape method C

deterministic scaling factor = 3.0

PPM*4 Same as variant PPM*3 plus recency scaling factor = 1.1

Table 2: Compression ratios for the Calgary corpus using unbounded length contexts

A maximum fixed order of 5 produces the best results averaged over the whole

corpus. However, this is not the case when each file is compressed separately. The

best order and compression ratio found using the PPM2 variant on each file is shown

in the final two columns (under the heading PPM2 ). For example, compressing

the file *geo *with order 5 requires 4.71 bpc, and this improves to 4.53 bpc if order

2 is chosen instead. These results averaged over the whole corpus yield about a

1% improvement compared when order 5 is chosen for all files (2.26 bpc compared

with 2.29 bpc). However, these improvements are only obtainable by a brute force

method of repeatedly compressing the same file to determine which order is best.

Although this could be done in parallel discarding all output files except the smallest

when completed, such a resource-intensive scheme is not warranted for such a small

improvement. Adaptive methods for determining which order is best are currently

being investigated but results so far are inferior to the brute force approach.

Table 2 lists results using unbounded length contexts. The standard PPM* algorithm

which uses escape method C and count scaling with a maximum threshold combined

with exclusions is shown in column 2. The implementations labelled PPM*1 to

PPM*4 in the next four columns use the standard PPM*approach with a deterministic

strategy for selecting the context order as their basis. The escape method D was

found to be inferior in all variants to the escape method C. Below the table is a

summary of the main features of each of the variants.

PPM*1 applies deterministic scaling using a factor of 2.0 which was found to be best

over the range of files in the Calgary corpus. PPM*2 uses partial update exclusions.

PPM*3 combines both partial update exclusions with deterministic scaling, with

the best factor increasing to 3.0. PPM*4 improves on PPM*3 by adding a recency file PPM3 PPM*4 Combined

(bpc) (bpc) (bpc)

bib **1.86 1.86 **1.86

book1 **2.30 **2.41 2.30

book2 **1.96 **2.00 1.96

geo **4.63 **4.78 4.63

news **2.35 **2.37 2.35

obj1 **3.71 **3.83 3.71

obj2 2.38 **2.31 **2.31

paper1 **2.33 2.33 **2.33

paper2 **2.32 **2.34 2.32

pic **0.80 **0.84 0.80

progc 2.36 **2.34 **2.34

progl 1.68 **1.61 **1.61

progp 1.70 **1.55 **1.55

trans 1.47 **1.39 **1.39

average **2.28 2.28 **2.25

Table 3: Combined results

scaling factor of 1.1.

Combined results for the best bounded and unbounded approaches (PPM3 and

PPM*4) are shown in Table 3. Results for PPM3 are taken from Table 1 and for

PPM*4 from Table 2. The results in the column labelled “Combined” uses whichever

of the two is smaller, and leads to a final compression ratio of 2.25 bpc for the whole

corpus, which is a further improvement of 1% over each of the two methods taken

separately. Interestingly, neither of the two approaches performs consistently better

than the other, instead outperforming the other on about half of the files (PPM*4

performing noticeably better on program source code files such as *progl *and *progp*

for example). Astraightforwardimplementation of the combined approach would be

to compress each file using both methods in parallel, and then discard the output file

for whichever method was inferior. An adaptive implementation of such a scheme,

which selects the better method for each context as the file is being compressed

exploiting the fact that the same context trie can be used to implement both methods

has yet to be investigated.

4 Conclusions

Four new methods that improve the probability estimation for the PPM data com-pression

scheme have been described. As well, two different approaches that use

bounded and unbounded length context models have been compared. Each of the

new methods is effective at improving the performance of the prediction for both

of these approaches, leading to an 8% improvement for both approaches over the

standard implementation PPMC. The two approaches combined lead to a further

1% improvement in performance.

While these gains are not large, we are in an area where diminishing returns are

to be expected. Indeed, if one is interested in *prediction *rather than compression,

such as for speech recognition and predictive typing, then even slightly improved

prediction may well be worthwhile.

A number of conclusions can be made from this research. Compression may be improved by carefully fine-tuning the techniques even further, but we are in danger

of over-fitting the algorithm to suit the test data. Important questions have also

been raised by this research that require further investigation. For example, why are

bounded context models just as effective as unbounded models? Why is an entropy-based

method effective for selecting between multiple probability estimators, but

less effective when applied to the problem of local order estimation? Are there better

methods of blending the context models rather than the PPM method of estimating

an escape probability? A greater understanding of these questions is necessary

before substantial improvement in probability estimation is possible.

Acknowledgement

I would like to thank my supervisor Dr. John Cleary for his guidance in preparing

this paper.

References

[1] T.C. Bell, J.G. Cleary, and I.H. Witten. *Text compression*. Prentice Hall, NJ, 1990.

[2] J.G. Cleary and W.J. Teahan. Some experiments on the zero frequency problem. In

J.A. Storer and M. Cohn, editors, *Proceedings DCC’95*. IEEE Computer Society Press,

1995.

[3] J.G. Cleary, W.J. Teahan, and I.H. Witten. Unbounded length contexts for PPM. In

J.A. Storer and M. Cohn, editors, *Proceedings DCC’95*. IEEE Computer Society Press,

1995.

[4] J.G. Cleary and I.H. Witten. Data compression using adaptive coding and partial string

matching. *IEEE Transactions on Communications*, 32(4):396–402, 1984.

[5] P.G. Howard. The design and analysis of efficient lossless data compression systems.

Technical Report CS–93–28, Brown University, Providence, Rhode Island, 1993.

[6] A. Moffat. Implementing the PPM data compression scheme. *IEEE Transactions on*

*Communications*, 38(11):1917–1921, 1990.

[7] R. Sedgewick. *Algorithms*. Addison-Wesley, Reading, Massachusets, 1988.

[8] I.H. Witten and T.C. Bell. The zero-frequency problem: estimating the probabilities of

novel events in adaptive text compression. *IEEE Transactions on Informaton Theory*,

37(4):1085–1094, 1991.