History of Data Compression in Japan

In 1987, I was asked by a magazine editor to write an article about data compression. I wrote a manuscript and an accompanying program, sent them to the editor, and forgot about them. The next time I heard from him I was told that the magazine was discontinued. So I uploaded my program, a simple implementation of the LZSS compression algorithm (see below) to PC-VAN, a big Japanese BBS run by NEC. That was May 1, 1988.

Soon a number of hobby programmers gathered and began improving on that program. The project culminated in Kazuhiko Miki's archiver LArc, which was fairly widey used in Japan. (Dr. Miki was then a medical specialist working at a governmental office. I heard he left office and began work on freeware/shareware promotion.)

The LZSS algorithm is based on a very simple idea. Suppose I'm going to write "compression" here. But probably I've already used that word before in this file. If I used that word 57 characters before, I might as well write "go 57 characters backward, and read 11 characters," or <57,11> for short. In general, when I've already used the string of characters among the recent 4096 characters, say, I encode the string by a <position,length> pair.

In Storer's [8] terminology, this is a sliding dictionary algorithm, analyzed first by Ziv and Lempel [14] and then by Storer and Szymanski [9], among others.

Later versions of my LZSS implementations and Miki's LArc used binary search trees to make string search faster; see Bell [1].

Incidentally, there are two distinct Ziv-Lempel (LZ) methods: sliding dictionary [14] and dynamic dictionary [15] in Storer's [8] terminology. The LZW algorithm [12] belongs to the latter. Most pre-LHarc compression tools, such as 'compress', 'ARC', and 'PKARC', used LZW.

During the summer of 1988, I wrote another compression program, LZARI. This program is based on the following observation: Each output of LZSS is either a single character or a <position,length> pair. A single character can be coded as an integer between 0 and 255. As for the <length> field, if the range of <length> is 2 to 257, say, it can be coded as an integer between 256 and 511. Thus, I can say that there are 512 kinds of "characters," and the "characters" 256 through 511 are accompanied by a <position> field. These 512 "characters" can be Huffman-coded, or better still, algebraically coded. The <position> field can be coded in the same manner. In LZARI I used an adaptive algebraic compression [13], [2] to encode the "characters," and static algebraic compression to encode the <position> field. (There were several versions of LZARI; some of them were slightly different from the above description.) The compression of LZARI was very tight, though rather slow.

Haruyasu Yoshizaki (Yoshi), a physician and guru hobby programmer, worked very hard to make LZARI faster. Most importantly, he replaced LZARI's algebraic compression by dynamic Huffman coding.

His program, LZHUF, was very successful. It was much faster than my LZARI. As for compression ratio, Huffman cannot beat algebraic compression, but the difference turned out to be very small.

Yoshi rewrote the compression engine of LZHUF in assembler, and added a nifty user interface. His archiver, LHarc, soon became the de facto standard among Japanese BBS users. After Prof. Kenjirou Okubo, a mathematician, introduced LHarc to the United States, it became world-famous. Other vendors began using similar techniques: sliding dictionary plus statistical compressions such as Huffman and Shannon-Fano. (I wondered why they used Shannon-Fano rather than Huffman which is guaranteed to compress tighter than Shannon-Fano. As it turned out, a then-popular book on compression published in U.S. contained a wrong description and buggy sample programs, such that Shannon-Fano outperformed (buggy) Huffman on many files.)

Although LHarc was much faster than LZARI, we weren't quite satisfied with its speed. Because LHarc was based on dynamic Huffman, it had to update Huffman tree every time it received a character. Yoshi and I tried other dynamic Huffman algorithms [5], [10], [11], but improvements were not as great as we desired.

So I took a different step: replacing LHarc's dynamic Huffman by a static Huffman method.

Traditional static Huffman coding algorithm first scans the input file to count character distribution, then builds Huffman tree and encodes the file. In my approach, the input file is read only once. It is first compressed by a sliding dictionary method like LZARI and LHarc, and at the same time the distributions of the "characters" (see above) and positions are counted. The output of this process is stored in main memory. When the buffer in memory is full (or the input is exhausted), the Huffman trees are constructed, and the half-processed content of the buffer is actually compressed and output.

In static Huffman, the Huffman tree must be stored in the compressed file. In the traditional approach this information consumes hundreds of bytes. My approach was to standardize Huffman trees so that (1) each left subtree is no deeper than its right counterpart, and (2) the leaves at the same level are sorted in ascending order. In this way the Huffman tree can be uniquely specified by the lengths of the codewords. Moreover, the resulting table is again compressed by the same Huffman algorithm.

To make the decoding program simpler, the Huffman tree is adjusted so that the codeword lengths do not exceed 16 bits. Since this adjusting is rarely needed, the algorithm is made very simple. It does not create optimal length-limited Huffman trees; see e.g. [6] for an optimal algorithm. Incidentally, my early program had a bug here, which was quickly pointed out and corrected by Yoshi.

The sliding dictionary algorithm is also improved by Yoshi using a "PATRICIA tree" data structure; see McCreight [7] and Fiala and Greene [4].

After completing my algorithm, I learned that Brent [3] also used a sliding dictionary plus Huffman coding. His method, SLH, is simple and elegant, but since it doesn't find the most recent longest match, the distribution of match position becomes flat. This makes the second-stage Huffman compression less efficient.

On the basis of these new algorithms, Yoshi began to rewrite his LHarc, but it took him so long (remember he was a busy doctor!) that I decided to write my own archiver. My archiver was quite recklessly named 'ar'. (Actually I appended version numbers as in 'ar002' for version 0.02.) I should have named it 'har' (after my name), say, because 'ar' collides with the name of UNIX's archiver. I didn't want my program to compete with LHarc, but I wanted many people to try the algorithm, so I wrote it in pure ANSI C. This is the reason 'ar' lacked many bells and whistles necessary for a real archiver.

Note: The version of 'ar002' most often found in the U.S. had a bug. Line 24 of maketbl.c should read, of course,
    while (i <= 16) {
        weight[i] = 1U << (16 - i);  i++;
    }
Somehow the bug didn't show up when compiled by Turbo C.

Yoshi finally showed us his new archiver written in C. It was tentatively named LHx. He then rewrote the main logic in assembler. Yoshi and I wrote an article describing his new archiver, which would be named LH, in the January, 1991, issue of "C Magazine" (in Japanese). The suffix 'arc' of LHarc was deliberately dropped because the people who sold ARC did not want others to use the name.

Then we learned that for the new DOS 5.0, LH meaned LoadHigh, an internal command. We decided to rename LH to LHA.

Also, I was told that the algorithm described in Fiala and Greene [4] got patented ("Textual Substitution Data Compression With Finite Length Search Windows," U.S. Patent 4,906,991, Mar. 6, 1990. Actually they got three patents! The other two were: "Start, Step, Stop Unary Encoding for Data Compression," Application Ser. No. 07/187,697, and "Search Tree Data Structure Encoding for Textual Substitution Data Compression Systems," Application Ser. No. 07/187,699.)

Furthermore, I learned that the original Ziv-Lempel compression method (Eastman et al., U.S. Patent 4,464,650, 8/1984) and the LZW method (Welch, 4,558,302, 12/1985) were patented. I also heard that Richard Stallman, of the Free Software Foundation, author of the EMACS editor and leader of the GNU project, ceased to use 'compress' program any more because its LZW algorithm got patented.

Are algorithms patentable? (See [16].) If these patents should turn out to be taken seriously, all compression programs now in use may infringe some of these patents. (Luckily, not all claims made by those algorithm patents seems to be valid.)

The foregoing is a slight modification of what I wrote in 1991. The year 1991 was a very busy year for me. In 1992, I joined the faculty of Matsusaka University. This opportunity should have given me more free time, but as it turned out I got ever busier. I stopped hacking on my compression algorithms; so did Yoshi.

Luckily, all good things in LHA were taken over, and all bad things abandoned, by the new great archiver zip and the compression tool gzip. I admire the efforts of Jean-loup Gailly and others.

A brief historical comment on PKZIP: At one time a programmer for PK and I were in close contact. We exchanged a lot of ideas. No wonder PKZIP and LHA are so similar.

Another historical comment: LHICE and ICE are definitely not written by Yoshi (or me or anyone I know). I think they are faked versions of LHarc.

REFERENCES

[1]: Timothy C. Bell. Better OPM/L text compression. IEEE Transactions on Communications, COM-34(12):1176--1182, 1986.
[2]: Timothy C. Bell, John G. Cleary, and Ian H. Witten. Text Compression. Prentice Hall, 1990.
[3]: R. P. Brent. A linear algorithm for data compression. The Australian Computer Journal, 19(2):64--68, 1987.
[4]: Edward R. Fiala and Daniel H. Greene. Data compression with finite windows. Communications of the ACM, 32(4):490--505, 1989.
[5]: Donald E. Knuth. Dynamic Huffman coding. Journal of Algorithms, 6:163--180, 1985.
[6]: Lawrence L. Larmore and Daniel S. Hirschberg. A fast algorithm for optimal length-limited Huffman codes. Journal of the Association for Computing Machinery, 37(3):464--473, 1990.
[7]: Edward M. McCreight. A space-economical suffix tree construction algorithm. Journal of the Association for Computing Machinery, 23(2):262--272, 1976.
[8]: James A. Storer. Data Compression: Methods and Theory. Computer Science Press, Rockville, MD., 1988.
[9]: James A. Storer and Thomas G. Szymanski. Data compression via textual substitution. Journal of the Association for Computing Machinery, 29(4):928--951, 1982.
[10]: Jeffrey Scott Vitter. Design and analysis of dynamic Huffman codes. Journal of the Association for Computing Machinery, 34(4):825--845, 1987.
[11]: Jeffrey Scott Vitter. Algorithm 673: Dynamic Huffman coding. ACM Transactions on Mathematical Software, 15(2):158--167, 1989.
[12]: Terry A. Welch. A technique for high-performance data compression. IEEE Computer}, 17(6):8--19, 1984.
[13]: Ian H. Witten, Radford M. Neal, and John G. Cleary. Arithmetic coding for data compression. Communications of the ACM, 30(6):520--540, 1987.
[14]: Jacob Ziv and Abraham Lempel. A universal algorithm for sequential data compression. IEEE Transactions on Information Theory, IT-23(3):337--343, 1977.
[15]: Jacob Ziv and Abraham Lempel. Compression of individual sequences via variable-rate coding. IEEE Transactions on Information Theory, IT-24(5):530--536, 1978.
[16]: Edward N. Zalta. Are algorithms patentable? Notices of the American Mathematical Society, 35(6):796--799, 1988.

Haruhiko Okumura,
okumura@matsusaka-u.ac.jp

Last modified: Tue Mar 17 17:02:03 1998