NOTE: This text needs thorough rewriting.
Haruhiko Okumura
Matsusaka University, Matsusaka, Japan.
---------------------------------------------------------------
Data Compression Algorithms of LARC and LHarc
Haruhiko Okumura*
*The author is the sysop of the Science SIG of PV-VAN. His
address is: 12-2-404 Green Heights, 580 Nagasawa, Yokosuka 239
Japan
---------------------------------------------------------------
1. Introduction
In the spring of 1988, I wrote a very simple data compression
program named LZSS in C language, and uploaded it to the
Science SIG (forum) of PC-VAN, Japan's biggest personal
computer network.
That program was based on Storer and Szymanski's slightly
modified version of one of Lempel and Ziv's algorithms.
Despite its simplicity, for most files its compression
outperformed the archivers then widely used.
Kazuhiko Miki rewrote my LZSS in Turbo Pascal and assembly
language, and soon made it evolve into a complete archiver,
which he named LARC.
The first versions of LZSS and LARC were rather slow. So I
rewrote my LZSS using a binary tree, and so did Miki.
Although LARC's encoding was slower than the fastest archiver
available, its decoding was quite fast, and its algorithm was so
simple that even self-extracting files (compressed files
plus decoder) it created were usually smaller than
non-self-extracting files from other archivers.
Soon many hobby programmers joined the archiver project at the
forum. Very many suggestions were made, and LARC was revised
again and again. By the summer of 1988, LARC's speed and
compression have improved so much that LARC-compressed programs
were beginning to be uploaded in many forums of PC-VAN and
other networks.
In that summer I wrote another program, LZARI, which combined
the LZSS algorithm with adaptive arithmetic compression.
Although it was slower than LZSS, its compression performance
was amazing.
Miki, the author of LARC, uploaded LZARI to NIFTY-Serve,
another big information network in Japan. In NIFTY-Serve,
Haruyasu Yoshizaki replaced LZARI's adaptive arithmetic coding
with a version of adaptive Huffman coding to increase speed.
Based on this algorithm, which he called LZHUF, he developed
yet another archiver, LHarc.
In what follows, I will review several of these algorithms and
supply simplified codes in C language.
2. Simple coding methods
Replacing several (usually 8 or 4) "space" characters by one
"tab" character is a very primitive method for data compression.
Another simple method is run-length coding, which encodes the
message "AAABBBBAACCCC" into "3A4B2A4C", for example.
3. LZSS coding
This scheme is initiated by Ziv and Lempel [1]. A slightly
modified version is described by Storer and Szymanski [2]. An
implementation using a binary tree is proposed by Bell [3].
The algorithm is quite simple: Keep a ring buffer, which
initially contains "space" characters only. Read several
letters from the file to the buffer. Then search the buffer
for the longest string that matches the letters just read, and
send its length and position in the buffer.
If the buffer size is 4096 bytes, the position can be encoded
in 12 bits. If we represent the match length in four bits, the
pair is two bytes long. If the longest
match is no more than two characters, then we send just one
character without encoding, and restart the process with the
next letter. We must send one extra bit each time to tell the
decoder whether we are sending a pair or an
unencoded character.
The accompanying file LZSS.C is a version of this algorithm.
This implementation uses multiple binary trees to speed up the
search for the longest match. All the programs in this article
are written in draft-proposed ANSI C. I tested them with Turbo
C 2.0.
4. LZW coding
This scheme was devised by Ziv and Lempel [4], and modified by
Welch [5].
The LZW coding has been adopted by most of the existing
archivers, such as ARC and PKZIP. The algorithm can be made
relatively fast, and is suitable for hardware implementation as
well.
The algorithm can be outlined as follows: Prepare a table that
can contain several thousand items. Initially register in its
0th through 255th positions the usual 256 characters. Read
several letters from the file to be encoded, and search the
table for the longest match. Suppose the longest match is
given by the string "ABC". Send the position of "ABC" in the
table. Read the next character from the file. If it is "D",
then register a new string "ABCD" in the table, and restart the
process with the letter "D". If the table becomes full,
discard the oldest item or, preferably, the least used.
A Pascal program for this algorithm is given in Storer's book
[6].
5. Huffman coding
Classical Huffman coding is invented by Huffman [7]. A fairly
readable accound is given in Sedgewick [8].
Suppose the text to be encoded is "ABABACA", with four A's, two
B's, and a C. We represent this situation as follows:
4 2 1
| | |
A B C
Combine the least frequent two characters into one, resulting
in the new frequency 2 + 1 = 3:
4 3
| / \
A B C
Repeat the above step until the whole characters combine into a
tree:
7
/ \
/ 3
/ / \
A B C
Start at the top ("root") of this encoding tree, and travel to
the character you want to encode. If you go left, send a "0";
otherwise send a "1". Thus, "A" is encoded by "0", "B" by
"10", "C" by "11". Algotether, "ABABACA" will be encoded into
ten bits, "0100100110".
To decode this code, the decoder must know the encoding tree,
which must be sent separately.
A modification to this classical Huffman coding is the
adaptive, or dynamic, Huffman coding. See, e.g., Gallager
[9]. In this method, the encoder and the decoder processes the
first letter of the text as if the frequency of each character
in the file were one, say. After the first letter has been
processed, both parties increment the frequency of that
character by one. For example, if the first letter is 'C',
then freq['C'] becomes two, whereas every other frequencies are
still one. Then the both parties modify the encoding tree
accordingly. Then the second letter will be encoded and
decoded, and so on.
6. Arithmetic coding
The original concept of arithmetic coding is proposed by
P. Elias. An implementation in C language is described by
Witten and others [10].
Although the Huffman coding is optimal if each character must
be encoded into a fixed (integer) number of bits, arithmetic
coding wins if no such restriction is made.
As an example we shall encode "AABA" using arithmetic coding.
For simplicity suppose we know beforehand that the
probabilities for "A" and "B" to appear in the text are 3/4 and
1/4, respectively.
Initially, consider an interval:
0 <= x < 1.
Since the first character is "A" whose probability is 3/4, we
shrink the interval to the lower 3/4:
0 <= x < 3/4.
The next character is "A" again, so we take the lower 3/4:
0 <= x < 9/16.
Next comes "B" whose probability is 1/4, so we take the upper
1/4:
27/64 <= x < 9/16,
because "B" is the second element in our alphabet, {A, B}. The
last character is "A" and the interval is
27/64 <= x < 135/256,
which can be written in binary notation
0.011011 <= x < 0.10000111.
Choose from this interval any number that can be represented in
fewest bits, say 0.1, and send the bits to the right of "0.";
in this case we send only one bit, "1". Thus we have encoded
four letters into one bit! With the Huffman coding, four
letters could not be encoded into less than four bits.
To decode the code "1", we just reverse the process: First, we
supply the "0." to the right of the received code "1",
resulting in "0.1" in binary notation, or 1/2. Since this
number is in the first 3/4 of the initial interval 0 <= x < 1,
the first character must be "A". Shrink the interval into the
lower 3/4. In this new interval, the number 1/2 lies in the
lower 3/4 part, so the second character is again "A", and so on.
The number of letters in the original file must be sent
separately (or a special 'EOF' character must be appended at
the end of the file).
The algorithm described above requires that both the sender and
receiver know the probability distribution for the characters.
The adaptive version of the algorithm removes this restriction
by first supposing uniform or any agreed-upon distribution of
characters that approximates the true distribution, and then
updating the distribution after each character is sent and
received.
7. LZARI
In each step the LZSS algorithm sends either a character or a
pair. Among these, perhaps character "e"
appears more frequently than "x", and a pair
of length 3 might be commoner than one of length 18, say.
Thus, if we encode the more frequent in fewer bits and the less
frequent in more bits, the total length of the encoded text
will be diminished. This consideration suggests that we use
Huffman or arithmetic coding, preferably of adaptive kind,
along with LZSS.
This is easier said than done, because there are many possible
combinations. Adaptive compression must
keep running statistics of frequency distribution. Too many
items make statistics unreliable.
What follows is not even an approximate solution to the problem
posed above, but anyway this was what I did in the summer of
1988.
I extended the character set from 256 to three-hundred or so in
size, and let characters 0 through 255 be the usual 8-bit
characters, whereas characters 253 + n represent that what
follows is a position of string of length n, where n = 3, 4 ,
.... These extended set of characters will be encoded with
adaptive arithmetic compression.
I also observed that longest-match strings tend to be the ones
that were read relatively recently. Therefore, recent
positions should be encoded into fewer bits. Since 4096
positions are too many to encode adaptively, I fixed the
probability distribution of the positions "by hand." The
distribution function given in the accompanying LZARI.C is
rather tentative; it is not based on thorough experimentation.
In retrospect, I could encode adaptively the most significant 6
bits, say, or perhaps by some more ingenious method adapt the
parameters of the distribution function to the running
statistics.
At any rate, the present version of LZARI treats the positions
rather separately, so that the overall compression is by no
means optimal. Furthermore, the string length threshold above
which strings are coded into pairs is fixed,
but logically its value must change according to the length of
the pair we would get.
8. LZHUF
LZHUF, the algorithm of Haruyasu Yoshizaki's archiver LHarc,
replaces LZARI's adaptive arithmetic coding with adaptive
Huffman. LZHUF encodes the most significant 6 bits of the
position in its 4096-byte buffer by table lookup. More recent,
and hence more probable, positions are coded in less bits. On
the other hand, the remaining 6 bits are sent verbatim. Because
Huffman coding encodes each letter into a fixed number of bits,
table lookup can be easily implemented.
Though theoretically Huffman cannot exceed arithmetic
compression, the difference is very slight, and LZHUF is fairly
fast.
The accompanying file LZHUF.C was written by Yoshizaki. I
translated the comments into English and made a few trivial
changes to make it conform to the ANSI C standard.
References
[1] J. Ziv and A. Lempel, IEEE Trans. IT-23, 337-343 (1977).
[2] J. A. Storer and T. G. Szymanski, J. ACM, 29, 928-951
(1982).
[3] T. C. Bell, IEEE Trans. COM-34, 1176-1182 (1986).
[4] J. Ziv and A. Lempel, IEEE Trans. IT-24, 530-536 (1978).
[5] T. A. Welch, Computer, 17, No.6, 8-19 (1984).
[6] J. A. Storer, Data Compression: Methods and Theory
(Computer Science Press, 1988).
[7] D. A. Huffman, Proc IRE 40, 1098-1101 (1952).
[8] R. Sedgewick, Algorithms, 2nd ed. (Addison-Wesley, 1988).
[9] R. G. Gallager, IEEE Trans. IT-24, 668-674 (1978).
[10] I. E. Witten, R. M. Neal, and J. G. Cleary, Commun. ACM
30, 520-540 (1987).