In 1987, I was asked by a magazine editor to write an article about
data compression. I wrote a manuscript and an accompanying program,
sent them to the editor, and forgot about them. The next time I heard
from him I was told that the magazine was discontinued. So I uploaded
my program, a simple implementation of the LZSS compression algorithm
(see below) to *PC-VAN*, a big Japanese BBS run by NEC. That was
May 1, 1988.

Soon a number of hobby programmers gathered and began improving on
that program. The project culminated in Kazuhiko Miki's archiver
*LArc*, which was fairly widey used in Japan. (Dr. Miki was then
a medical specialist working at a governmental office. I heard he
left office and began work on freeware/shareware promotion.)

The LZSS algorithm is based on a very simple idea. Suppose I'm going to write "compression" here. But probably I've already used that word before in this file. If I used that word 57 characters before, I might as well write "go 57 characters backward, and read 11 characters," or <57,11> for short. In general, when I've already used the string of characters among the recent 4096 characters, say, I encode the string by a <position,length> pair.

In Storer's [8] terminology, this is a sliding dictionary algorithm, analyzed first by Ziv and Lempel [14] and then by Storer and Szymanski [9], among others.

Later versions of my LZSS implementations and Miki's *LArc*
used binary search trees to make string search faster; see Bell [1].

Incidentally, there are two distinct Ziv-Lempel (LZ) methods:
sliding dictionary [14] and dynamic dictionary [15] in Storer's [8] terminology. The
LZW algorithm [12] belongs to the latter. Most
pre-*LHarc* compression tools, such as 'compress', 'ARC', and 'PKARC',
used LZW.

During the summer of 1988, I wrote another compression program,
*LZARI*. This program is based on the following observation:
Each output of LZSS is either a single character or a
<position,length> pair. A single character can be coded as an
integer between 0 and 255. As for the <length> field, if the
range of <length> is 2 to 257, say, it can be coded as an
integer between 256 and 511. Thus, I can say that there are 512 kinds
of "characters," and the "characters" 256 through 511 are accompanied
by a <position> field. These 512 "characters" can be
Huffman-coded, or better still, algebraically coded. The
<position> field can be coded in the same manner. In
*LZARI* I used an adaptive algebraic compression [13], [2] to encode the "characters,"
and static algebraic compression to encode the <position> field.
(There were several versions of *LZARI*; some of them were
slightly different from the above description.) The compression of
*LZARI* was very tight, though rather slow.

Haruyasu Yoshizaki (Yoshi), a physician and guru hobby programmer,
worked very hard to make *LZARI* faster. Most importantly, he
replaced *LZARI*'s algebraic compression by dynamic Huffman
coding.

His program, *LZHUF*, was very successful. It was much faster
than my *LZARI*. As for compression ratio, Huffman cannot beat
algebraic compression, but the difference turned out to be very small.

Yoshi rewrote the compression engine of *LZHUF* in assembler,
and added a nifty user interface. His archiver, *LHarc*, soon
became the de facto standard among Japanese BBS users. After
Prof. Kenjirou Okubo, a mathematician, introduced *LHarc* to the
United States, it became world-famous. Other vendors began using
similar techniques: sliding dictionary plus statistical compressions
such as Huffman and Shannon-Fano. (I wondered why they used
Shannon-Fano rather than Huffman which is guaranteed to compress
tighter than Shannon-Fano. As it turned out, a then-popular book on
compression published in U.S. contained a wrong description and buggy
sample programs, such that Shannon-Fano outperformed (buggy) Huffman
on many files.)

Although *LHarc* was much faster than *LZARI*, we weren't quite
satisfied with its speed. Because *LHarc* was based on dynamic Huffman,
it had to update Huffman tree every time it received a character.
Yoshi and I tried other dynamic Huffman algorithms [5], [10], [11], but
improvements were not as great as we desired.

So I took a different step: replacing *LHarc*'s dynamic Huffman by a
static Huffman method.

Traditional static Huffman coding algorithm first scans the input
file to count character distribution, then builds Huffman tree and
encodes the file. In my approach, the input file is read only once.
It is first compressed by a sliding dictionary method like
*LZARI* and *LHarc*, and at the same time the distributions
of the "characters" (see above) and positions are counted. The output
of this process is stored in main memory. When the buffer in memory
is full (or the input is exhausted), the Huffman trees are
constructed, and the half-processed content of the buffer is actually
compressed and output.

In static Huffman, the Huffman tree must be stored in the compressed file. In the traditional approach this information consumes hundreds of bytes. My approach was to standardize Huffman trees so that (1) each left subtree is no deeper than its right counterpart, and (2) the leaves at the same level are sorted in ascending order. In this way the Huffman tree can be uniquely specified by the lengths of the codewords. Moreover, the resulting table is again compressed by the same Huffman algorithm.

To make the decoding program simpler, the Huffman tree is adjusted so that the codeword lengths do not exceed 16 bits. Since this adjusting is rarely needed, the algorithm is made very simple. It does not create optimal length-limited Huffman trees; see e.g. [6] for an optimal algorithm. Incidentally, my early program had a bug here, which was quickly pointed out and corrected by Yoshi.

The sliding dictionary algorithm is also improved by Yoshi using a "PATRICIA tree" data structure; see McCreight [7] and Fiala and Greene [4].

After completing my algorithm, I learned that Brent [3] also used a sliding dictionary plus Huffman coding.
His method, *SLH*, is simple and elegant, but since it doesn't
find the most recent longest match, the distribution of match position
becomes flat. This makes the second-stage Huffman compression less
efficient.

On the basis of these new algorithms, Yoshi began to rewrite his
*LHarc*, but it took him so long (remember he was a busy doctor!)
that I decided to write my own archiver. My archiver was quite
recklessly named 'ar'. (Actually I appended version numbers as in 'ar002'
for version 0.02.) I should have named it 'har' (after my name), say,
because 'ar' collides with the name of UNIX's archiver. I didn't want
my program to compete with *LHarc*, but I wanted many people to
try the algorithm, so I wrote it in pure ANSI C. This is the reason 'ar'
lacked many bells and whistles necessary for a real archiver.

Note:The version of 'ar002' most often found in the U.S. had a bug. Line 24 ofmaketbl.cshould read, of course,while (i <= 16) { weight[i] = 1U << (16 - i); i++; }Somehow the bug didn't show up when compiled by Turbo C.

Yoshi finally showed us his new archiver written in C. It was
tentatively named *LHx*. He then rewrote the main logic in
assembler. Yoshi and I wrote an article describing his new archiver,
which would be named *LH*, in the January, 1991, issue of "C
Magazine" (in Japanese). The suffix 'arc' of *LHarc* was
deliberately dropped because the people who sold *ARC* did not
want others to use the name.

Then we learned that for the new DOS 5.0, LH meaned LoadHigh, an
internal command. We decided to rename *LH* to *LHA*.

Also, I was told that the algorithm described in Fiala and Greene [4] got patented ("Textual Substitution Data Compression With Finite Length Search Windows," U.S. Patent 4,906,991, Mar. 6, 1990. Actually they got three patents! The other two were: "Start, Step, Stop Unary Encoding for Data Compression," Application Ser. No. 07/187,697, and "Search Tree Data Structure Encoding for Textual Substitution Data Compression Systems," Application Ser. No. 07/187,699.)

Furthermore, I learned that the original Ziv-Lempel compression method (Eastman et al., U.S. Patent 4,464,650, 8/1984) and the LZW method (Welch, 4,558,302, 12/1985) were patented. I also heard that Richard Stallman, of the Free Software Foundation, author of the EMACS editor and leader of the GNU project, ceased to use 'compress' program any more because its LZW algorithm got patented.

Are algorithms patentable? (See [16].) If these patents should turn out to be taken seriously, all compression programs now in use may infringe some of these patents. (Luckily, not all claims made by those algorithm patents seems to be valid.)

The foregoing is a slight modification of what I wrote in 1991. The year 1991 was a very busy year for me. In 1992, I joined the faculty of Matsusaka University. This opportunity should have given me more free time, but as it turned out I got ever busier. I stopped hacking on my compression algorithms; so did Yoshi.

Luckily, all good things in *LHA* were taken over, and all bad
things abandoned, by the new great archiver zip and the compression
tool gzip. I admire the efforts of
Jean-loup Gailly and others.

A brief historical comment on *PKZIP*: At one time a
programmer for PK and I were in close contact. We exchanged a lot of
ideas. No wonder *PKZIP* and *LHA* are so similar.

Another historical comment: *LHICE* and *ICE* are
definitely *not* written by Yoshi (or me or anyone I know). I
think they are faked versions of *LHarc*.

**REFERENCES**

- [1]
- Timothy C. Bell. Better OPM/L text compression. IEEE Transactions on Communications, COM-34(12):1176--1182, 1986.
- [2]
- Timothy C. Bell, John G. Cleary, and Ian H. Witten.
*Text Compression*. Prentice Hall, 1990. - [3]
- R. P. Brent. A linear algorithm for data compression. The Australian Computer Journal, 19(2):64--68, 1987.
- [4]
- Edward R. Fiala and Daniel H. Greene. Data compression with finite windows. Communications of the ACM, 32(4):490--505, 1989.
- [5]
- Donald E. Knuth. Dynamic Huffman coding. Journal of Algorithms, 6:163--180, 1985.
- [6]
- Lawrence L. Larmore and Daniel S. Hirschberg. A fast algorithm for optimal length-limited Huffman codes. Journal of the Association for Computing Machinery, 37(3):464--473, 1990.
- [7]
- Edward M. McCreight. A space-economical suffix tree construction algorithm. Journal of the Association for Computing Machinery, 23(2):262--272, 1976.
- [8]
- James A. Storer.
*Data Compression: Methods and Theory*. Computer Science Press, Rockville, MD., 1988. - [9]
- James A. Storer and Thomas G. Szymanski. Data compression via textual substitution. Journal of the Association for Computing Machinery, 29(4):928--951, 1982.
- [10]
- Jeffrey Scott Vitter. Design and analysis of dynamic Huffman codes. Journal of the Association for Computing Machinery, 34(4):825--845, 1987.
- [11]
- Jeffrey Scott Vitter. Algorithm 673: Dynamic Huffman coding. ACM Transactions on Mathematical Software, 15(2):158--167, 1989.
- [12]
- Terry A. Welch. A technique for high-performance data compression. IEEE Computer}, 17(6):8--19, 1984.
- [13]
- Ian H. Witten, Radford M. Neal, and John G. Cleary. Arithmetic coding for data compression. Communications of the ACM, 30(6):520--540, 1987.
- [14]
- Jacob Ziv and Abraham Lempel. A universal algorithm for sequential data compression. IEEE Transactions on Information Theory, IT-23(3):337--343, 1977.
- [15]
- Jacob Ziv and Abraham Lempel. Compression of individual sequences via variable-rate coding. IEEE Transactions on Information Theory, IT-24(5):530--536, 1978.
- [16]
- Edward N. Zalta. Are algorithms patentable? Notices of the American Mathematical Society, 35(6):796--799, 1988.

Haruhiko Okumura,

okumura@matsusaka-u.ac.jp

Last modified: Tue Mar 17 17:02:03 1998