Cotty's recent project (no longer compression related)
Probability estimation for PPM
W. J. Teahan
Department of Computer Science, University of Waikato, Hamilton
ABSTRACT
The state of the art in lossless text compression is the PPM data compression scheme. Two ap-proaches
to the problem of selecting the context models used in the scheme are described. One uses
an
a priori upper bound on the lengths of the contexts, while the other method is unbounded. Severaltechniques that improve the probability estimation are described, including four new methods: partial
update exclusions for the unbounded approach, deterministic scaling, recency scaling and multiple
probability estimators. Each of these methods improves the performance for both the bounded and
unbounded approaches. In addition, further savings are possible by combining the two approaches.
1 Introduction
The state of the art in lossless text compression is the PPM data compression scheme
[1, 4]. PPM, or prediction by partial matching, is an adaptive statistical modeling
technique based on blending together different length context models to predict the
next character in the input sequence. The scheme achieves greater compression than
Ziv-Lempel (LZ) dictionary based methods which are more widely used because of
their simplicity and faster execution speeds.
PPM models are based upon context models of varying length of prior characters.
The models record the frequency of characters that have followed each of the
contexts. For example, a particular context may be the letters “thei”. All the letters
that have followed this context are counted. The next time the letters “thei” occur in
the text, the counts are used to estimate the probability for the upcoming character.
The PPM technique blends together the context models for the varying lengths (such
as “hei” and “ei”) to arrive at an overall probability distribution.
Arithmetic codingis then used to optimally encode the character which actually does occur with respect
to this distribution.
In the next section, some of the problems associated with estimating the probability
distribution are examined. The first problem is the problem of deciding how long
the contexts should be. In light of this problem, two different variants of the
algorithm are examined. One fixes an
a priori upper bound on the length of thecontext models; the other is unbounded. The second problem, called
local orderestimation
, concerns that of selecting a particular context model from all the currentcontext models. Following that, we describe different techniques at arriving at a
probability distribution given the chosen context model. Four new methods that
improve the probability estimation for both the bounded and unbounded approaches
are also described. In the final section we give results that apply these techniques
and compare their performance, leading to an 8% improvement in compression
for both approaches over the standard PPMC implementation [1]. A further 1%
improvement is possible by combining the two approaches.
2 Probability estimation for PPM
PPM splits compression into two steps. The first step accumulates a statistical model
of the characters seen so far in the input text. As each character is encoded this
model is used to generate a probability distribution over those characters which canoccur next.
PPM’s statistical model is built up from contexts of prior characters of varying
lengths. A problem arises in deciding how long the contexts should be, whether
there should be an upper bound to the length of the contexts or whether they should
be unbounded. Another problem is deciding how much weight should be given to
each context. The approach taken by PPM is a “back off” method of selecting a
particular context and then backing off or
escaping to a shorter context if the chosencontext does not predict the upcoming character.
The problems of local order estimation, and estimating the escape probability, are the
key problems associated with probability estimation for PPM. These two problems
are discussed below. Following that, other techniques effective at improving the
probability estimation are discussed, including four new methods: partial update
exclusions for unbounded length contexts, deterministic scaling, recency scaling
and multiple probability estimators.
Bounded vs. unbounded length contexts.
There are two approaches to theproblem of deciding how long the contexts should be for PPM. The first approach,
adopted in the original implementation [4] and later on in [6], uses an upper bound
to the length of the contexts, with the context model of longest length chosen first
to estimate the probability. This strategy is surprisingly effective at modelling text,
a maximum order of 4 or 5 is well suited to a wide range of text files.
A problem with the PPM algorithm is the rapid growth in the size of the context
models as the maximum order increases. Until recently, it was thought that there was
little gain in extending the length of the contexts, as there is a drop off in compression
as the maximum order increases beyond length 5. However, a recent approach
called PPM* [3] uses a variant of the PPM algorithm that exploits unbounded length
contexts. It does this by storing the contexts in a data structure called a
context trie.Each context of varying length is stored in a single trie, along with input pointers
back into the input string whenever a context becomes unique. A separate
contextlist
is maintained with pointers back into the trie for the currently active contexts.Experiments (described in the next section) with adapting this data structure to the
bounded approach, with the trie being limited to a certain depth, have also shown
compression superior to previously used approaches. Substantial reductions in the
trie memory size for the unbounded approach is also possible by combining non-branching
nodes of the trie into a single node. The resulting compact data structure,
which is similar to a patricia trie [7], is linear with the size of the input string.
Local order estimation.
Local order estimation using bounded length contextssimply involves selecting the longest context model. However for unbounded
contexts, experiments have shown that such a strategy is poor at selecting the model
which predicts the upcoming character best. An entropy-based method of choosing
the context which has the best average compression to date is also surprisingly poor
at prediction. The most effective strategy found so far for unbounded contexts is as
follows. Acontext is defined to be “deterministic” when it gives only one prediction.
Experiments conducted by [2] have shown that for such contexts the observed
frequency of the deterministic character is much higher than expected based on
a uniform prior distribution. This can be exploited by using such contexts for
prediction. The deterministic strategy involves choosing the shortest deterministic
context.
Estimating the escape probability. Asecond problem with probability estimationis how to encode a
novel character, which has not been seen before in the currentcontext. This problem is essentially the zero frequency problem which is described
in [8]. PPM uses an escape probability to “escape” to another context model,
usually of length one shorter than the current context. For novel characters which
have never been seen before in any length model, the algorithm escapes down to a
default “order -1” context model where all possible characters are equiprobable.
Various methods have been proposed for estimating the escape probability. Two
ad hoc
methods, methods A and B, are described in [1, 8]. Methods P, X and XCdescribed in [8] are based upon a Poisson process model for the appearance of novel
characters. For practical reasons, the method most widely used is method C, used
by Moffat [6] in his implementation of the algorithm PPMC. Whenever a novel
character is encountered, an escape count is incremented, and the new character’s
count is set to one. The escape probability is computed as where is the
number of unique characters, and is the total number of characters seen so far.
Method C has been found to be superior to methods A and B in practice.
Howard [5] proposed a small modification to method C. Instead of adding 1 to both
the escape count and the new character’s count, each count is incremented by
12
,hence the total weight is incremented by 1 as for the other non-novel characters.
The escape probability for method D is computed as
2 .Experiments reported in the next section have found that method C is better suited
to the unbounded approach, whereas method D is more effective with the bounded
approach. One explanation for this is that method D has the effect of increasing
the weight given to the character’s count for deterministic contexts. This is already
achieved by the local order estimation strategy used for the unbounded approach.
Scaling of deterministic contexts, described below, also has the same effect, although
not as marked when method D is applied.
Scaling the counts.
The frequency counts associated with each context must bescaled downward periodically to prevent overflow in the arithmetic coder. The usual
method is to halve the counts when a certain threshold has been exceeded. This
also has the side-effect of improving the compression because of its locally adaptive
effect. However, experimental results with context tries for both the bounded
and unbounded approaches have shown that count scaling is ineffective, with the
maximum scaling threshold dictated by the requirements of the arithmetic encoder
providing the best performance.
Exclusions.
PPM makes use of exclusions by excluding predictions found at ahigher order when calculating the probabilities for lower-order contexts. Exclusions
have been found to be effective for both the bounded and unbounded approaches.
Update exclusions
do not increment the count for a predicted character if it is alreadypredicted by a higher-order context. The effect of update exclusions is to use the
uniqueness of a character in a particular context for prediction instead of its fre-quency.
Update exclusions have been found to improve the performance of PPMC
and other variants that apply the bounded approach by 2%. However, full update
exclusions are less effective for the unbounded approach. A new technique, called
partial update exclusions
does however provide an improvement in performance.Partial update exclusions involve incrementing the counts for all deterministic con-texts,
otherwise incrementing the count only for the longest non-deterministic one.
Deterministic scaling.
Another new technique that improves the probability esti-mationis called
deterministic scaling. The aforementioned experiments in [2] haveshown that for deterministic contexts the deterministic character is far more likely
to occur than predicted by a uniform prior distribution. The local order estimation
method used by the unbounded approach exploits this to some extent. However,
further improvement can be achieved for both bounded and unbounded approaches
by scaling the count for the deterministic character upwards by a constant factor.
A curious side-effect with deterministic scaling is that greater improvement in
prediction is achieved when the scaling is applied with partial update exclusions
than without them. The best scaling factor increases when partial update exclusions
are applied, which was found to be the case for all files in the corpus.
Recency scaling.
Increasing the count for the most recent character in each contextby a scaling factor also improves the prediction. Runs of the same character often
occur in many of the contexts, and this can be exploited quite effectively by applying
a recency scaling factor.
Multiple probability estimators.
In many cases, the first context PPM selects forencoding is non-optimal, with another context model providing better prediction.
This is especially true when the input string is a random sequence of characters.
The ideal would be to predict an order -1 model immediately. PPM’s performance
on random strings is about 9 bpc, the extra 1 bpc incurred because of the need to
escape down to the lower orders.
One technique that overcomes this problem to some extent is to have multiple
probability estimation methods, one of them specifically designed to cope with
random strings. Each of the probability estimation methods can be made to compete
for the prediction given a particular context, with an entropy-based performance
measure (such as the average codelength required to encode the characters) being
used to choose between the methods.
An addition of two extra probability estimators, one random and the other semi-random,
along with the standard PPM estimator, has been found to be effective for
the bounded approach. For each context, the codelengths for encoding the characters
using each of the three methods are accumulated. The method with the minimum
average codelength to date is then chosen to encode the upcoming character. The
random estimator predicts using an order -1 context model. The semi-random
estimator predicts using the character counts for the current context, with counts for
all other characters set to 1. PPM’s performance on random strings improves from
9 bpc down to 8.4 bpc when these three estimators are used.
Adding further estimation methods may improve the compression even more. For
example, an estimation method that has higher or lower deterministic and recency
scaling factors may be appropriate for particular contexts. However, this multiple
probability estimation method has substantial overheads, both in terms of execution
speed required to calculate the codelengths, and in terms of storage, with a floating
point number needed to store the codelength for each probability estimation method
for all nodes in the context trie.
3 Experimental Results
Experimental results of running variants of the PPM algorithm on the Calgary corpus
are shown in Tables 1, 2 and 3. The compression ratios for each file shown in the
file PPMC PPM1 PPM2 PPM3 PPM2(bpc) (bpc) (bpc) (bpc)
Order (bpc)bib 2.11 1.90
1.86 1.86 5 1.86book1 2.48 2.31
2.30 2.30 4 2.30book2 2.26 1.99 1.97
1.96 5 1.97geo 4.78 4.74 4.71
4.63 2 4.53news 2.65 2.39 2.36
2.35 5 2.36obj1 3.76 3.75 3.73
3.71 4 3.73obj2 2.69 2.44 2.40
2.38 8 2.35paper1 2.48 2.35
2.33 2.33 5 2.33paper2 2.45 2.33
2.32 2.32 4 2.32pic 1.09 0.81
0.80 0.80 5 0.80progc 2.49 2.39 2.37
2.36 5 2.37progl 1.90 1.71
1.68 1.68 8 1.64progp 1.84 1.73
1.70 1.70 6 1.70trans 1.77 1.52
1.47 1.47 7 1.43average 2.48 2.31 2.29
2.28 2.26PPMC
Standard implementation using escape method C described in [1, 6]PPM
1 Maximum order = 5, update exclusions, escape method CPPM
2 Maximum order = 5, update exclusions, escape method Ddeterministic scaling factor = 3.0, recency scaling factor = 1.1
PPM
3 Three probability estimators: random, semi-random and variant PPM2PPM
2 Same as variant PPM2 except maximum order variesTable 1: Compression ratios for the Calgary corpus using bounded length contexts
tables are given in bits per character (bpc). The best ratio for each file is printed in
boldface.
Results for both the bounded and unbounded approaches yield the same figure of
2.28 bpc for the corpus overall. This represents an improvement of 8% over the
standard PPMC implementation.
Table 1 lists results for PPM using bounded length contexts. The standard PPMC
implementation is shown in column 2. Implementations labelled PPM
1 to PPM3and PPM
2 in the adjacent columns use context tries with no limit on memory size.The main features of each of the variants are summarized below the table.
PPM
1 uses update exclusions with count scaling using a maximum threshold com-binedwith a maximum fixed order of 5 and method C for escaping. PPM
2 usesmethod D instead and also applies deterministic scaling (with the best factor found
to be 3.0) and recency scaling (best factor 1.1). The main contributor to the improve-ment
over PPM
1 was applying the escape method D. However, both deterministicscaling and recency scaling were found to improve the compression slightly over
most of the files (although not as much as that found for unbounded contexts de-scribed
below).
PPM
3 uses three probability estimators, with PPM2 being one of them, and a randomand semi-random estimator being the other two. The estimator with the minimum
average codelength is selected for each context. Although this leads to only a small
improvement over PPM
2, the performance on file geo which contains random datahas improved from 4.71 bpc down to 4.63 bpc. Also the improvement is consistently
better for the other files.
file PPM* PPM*1 PPM*2 PPM*3 PPM*4(bpc) (bpc) (bpc) (bpc) (bpc)
bib 1.91 1.90 1.90 1.87
1.86book1
2.40 2.40 2.41 2.41 2.41book2 2.02 2.02 2.02 2.01
2.00geo 4.83 4.82 4.77
4.76 4.78news 2.42 2.41 2.40 2.38
2.37obj1 4.00 4.00 3.84
3.82 3.83obj2 2.43 2.42 2.39 2.35
2.31paper1 2.37 2.36 2.36
2.33 2.33paper2 2.36 2.36 2.36 2.35
2.34pic 0.85
0.83 0.87 0.84 0.84progc 2.40 2.39 2.38 2.35
2.34progl 1.67 1.66 1.65 1.62
1.61progp 1.62 1.60 1.61 1.57
1.55trans 1.45 1.41 1.46
1.39 1.39average 2.34 2.33 2.32 2.29
2.28PPM*
Unbounded order, exclusions, escape method CPPM*
1 Unbounded order, exclusions, escape method Cdeterministic scaling factor = 2.0
PPM*
2 Unbounded order, partial update exclusions, escape method CPPM*
3 Unbounded order, partial update exclusions, escape method Cdeterministic scaling factor = 3.0
PPM*
4 Same as variant PPM*3 plus recency scaling factor = 1.1Table 2: Compression ratios for the Calgary corpus using unbounded length contexts
A maximum fixed order of 5 produces the best results averaged over the whole
corpus. However, this is not the case when each file is compressed separately. The
best order and compression ratio found using the PPM
2 variant on each file is shownin the final two columns (under the heading PPM
2 ). For example, compressingthe file
geo with order 5 requires 4.71 bpc, and this improves to 4.53 bpc if order2 is chosen instead. These results averaged over the whole corpus yield about a
1% improvement compared when order 5 is chosen for all files (2.26 bpc compared
with 2.29 bpc). However, these improvements are only obtainable by a brute force
method of repeatedly compressing the same file to determine which order is best.
Although this could be done in parallel discarding all output files except the smallest
when completed, such a resource-intensive scheme is not warranted for such a small
improvement. Adaptive methods for determining which order is best are currently
being investigated but results so far are inferior to the brute force approach.
Table 2 lists results using unbounded length contexts. The standard PPM* algorithm
which uses escape method C and count scaling with a maximum threshold combined
with exclusions is shown in column 2. The implementations labelled PPM*
1 toPPM*
4 in the next four columns use the standard PPM*approach with a deterministicstrategy for selecting the context order as their basis. The escape method D was
found to be inferior in all variants to the escape method C. Below the table is a
summary of the main features of each of the variants.
PPM*
1 applies deterministic scaling using a factor of 2.0 which was found to be bestover the range of files in the Calgary corpus. PPM*
2 uses partial update exclusions.PPM*
3 combines both partial update exclusions with deterministic scaling, withthe best factor increasing to 3.0. PPM*
4 improves on PPM*3 by adding a recency file PPM3 PPM*4 Combined(bpc) (bpc) (bpc)
bib
1.86 1.86 1.86book1
2.30 2.41 2.30book2
1.96 2.00 1.96geo
4.63 4.78 4.63news
2.35 2.37 2.35obj1
3.71 3.83 3.71obj2 2.38
2.31 2.31paper1
2.33 2.33 2.33paper2
2.32 2.34 2.32pic
0.80 0.84 0.80progc 2.36
2.34 2.34progl 1.68
1.61 1.61progp 1.70
1.55 1.55trans 1.47
1.39 1.39average
2.28 2.28 2.25Table 3: Combined results
scaling factor of 1.1.
Combined results for the best bounded and unbounded approaches (PPM
3 andPPM*
4) are shown in Table 3. Results for PPM3 are taken from Table 1 and forPPM*
4 from Table 2. The results in the column labelled “Combined” uses whicheverof the two is smaller, and leads to a final compression ratio of 2.25 bpc for the whole
corpus, which is a further improvement of 1% over each of the two methods taken
separately. Interestingly, neither of the two approaches performs consistently better
than the other, instead outperforming the other on about half of the files (PPM*
4performing noticeably better on program source code files such as
progl and progpfor example). Astraightforwardimplementation of the combined approach would be
to compress each file using both methods in parallel, and then discard the output file
for whichever method was inferior. An adaptive implementation of such a scheme,
which selects the better method for each context as the file is being compressed
exploiting the fact that the same context trie can be used to implement both methods
has yet to be investigated.
4 Conclusions
Four new methods that improve the probability estimation for the PPM data com-pression
scheme have been described. As well, two different approaches that use
bounded and unbounded length context models have been compared. Each of the
new methods is effective at improving the performance of the prediction for both
of these approaches, leading to an 8% improvement for both approaches over the
standard implementation PPMC. The two approaches combined lead to a further
1% improvement in performance.
While these gains are not large, we are in an area where diminishing returns are
to be expected. Indeed, if one is interested in
prediction rather than compression,such as for speech recognition and predictive typing, then even slightly improved
prediction may well be worthwhile.
A number of conclusions can be made from this research. Compression may be
improved by carefully fine-tuning the techniques even further, but we are in dangerof over-fitting the algorithm to suit the test data. Important questions have also
been raised by this research that require further investigation. For example, why are
bounded context models just as effective as unbounded models? Why is an entropy-based
method effective for selecting between multiple probability estimators, but
less effective when applied to the problem of local order estimation? Are there better
methods of blending the context models rather than the PPM method of estimating
an escape probability? A greater understanding of these questions is necessary
before substantial improvement in probability estimation is possible.
Acknowledgement
I would like to thank my supervisor Dr. John Cleary for his guidance in preparing
this paper.
References
[1] T.C. Bell, J.G. Cleary, and I.H. Witten.
Text compression. Prentice Hall, NJ, 1990.[2] J.G. Cleary and W.J. Teahan. Some experiments on the zero frequency problem. In
J.A. Storer and M. Cohn, editors,
Proceedings DCC’95. IEEE Computer Society Press,1995.
[3] J.G. Cleary, W.J. Teahan, and I.H. Witten. Unbounded length contexts for PPM. In
J.A. Storer and M. Cohn, editors,
Proceedings DCC’95. IEEE Computer Society Press,1995.
[4] J.G. Cleary and I.H. Witten. Data compression using adaptive coding and partial string
matching.
IEEE Transactions on Communications, 32(4):396–402, 1984.[5] P.G. Howard. The design and analysis of efficient lossless data compression systems.
Technical Report CS–93–28, Brown University, Providence, Rhode Island, 1993.
[6] A. Moffat. Implementing the PPM data compression scheme.
IEEE Transactions onCommunications
, 38(11):1917–1921, 1990.[7] R. Sedgewick.
Algorithms. Addison-Wesley, Reading, Massachusets, 1988.[8] I.H. Witten and T.C. Bell. The zero-frequency problem: estimating the probabilities of
novel events in adaptive text compression.
IEEE Transactions on Informaton Theory,37(4):1085–1094, 1991.