Baby's first compression scheme

Run-Length Encoding is super simple to put in place and requires nothing in terms of memory and preparation. It comes in a multitude of colors and shapes but always focus on the same core idea: identify and shorten sequences of repeating symbols. It was often used as an early image compression technique (in ZSoft's PCX, a popular format nobody seems to remember) and was also common for shrinking tile-based 2D map data (it was used in games such as Rise Of The Triads, Wolfenstein 3D, Commander Keen, and Jazz Jackrabbit to name a few.) At one point, this compression scheme was even used for video files! The image quality was reduced in order to maximize gains from the RLE algorithm, and it looked absolutely horrendous. Good times.

RLE is not fit for text or random data and will often produce negative gains with that type of input (the compressed stream is longer than the initial data it was fed, which sort of quote-unquote "defeats" the purpose a little bit if I may say so myself;) For maximum gainz, we're going to focus our efforts on this tiny 8x8 image:

Assuming each pixel takes 1 byte of data, the image data would be 64 bytes long. Now, what if we wrote a very simple encoding scheme that would take 1 byte for the run length, and a second byte for the pixel color?

 6 bytes: 2x Blue,   5x Yellow, 1x Blue
12 bytes: 1x Blue,   2x Yellow, 1x Black,  1x Yellow, 1x Black, 2x Yellow
10 bytes: 3x Yellow, 1x Black,  1x Yellow, 1x Black,  2x Yellow
10 bytes: 1x Yellow, 1x Pink,   4x Yellow, 1x Pink,   1x Yellow
10 bytes: 3x Yellow, 1x Red,    1x Yellow, 1x Red,    2x Yellow
 6 bytes: 3x Yellow, 3x Red,    2x Yellow
 6 bytes: 4x Yellow, 1x Red,    3x Yellow
 6 bytes: 1x Blue,   6x Yellow, 1x Blue

We get a compressed stream of 66 bytes! That's 2 bytes... worse... than the original data... Okay, maybe it's not such a good idea. In fact, it's on par with Microsoft's horrendous .BMP RLE compression scheme (and I refuse to cover such nonsense here, let's be serious.) But what went wrong exactly? The main role of Run-Length Encoding is to identify and summarize runs of repeating symbols. That's where the money's at; if our algorithm is unable to differentiate runs from isolated symbols, it's doing it wrong. Furthermore, the way we encode/signify runs has to be as tiny as possible and only used when applicable.

Now, before you go: "Mike, you chucklefuck, why are you wasting my time with this thing? It sucks!" Let me clear things up: "yes but also no." RLE is designed to shrink repeating symbols; The longer the run, the higher the gain... but we're working with a simple 8x8 image which prevents large runs from happening, thus resulting in barely any gain. We're also not allowing runs to cover multiple scanlines, which would save extra bytes. Try compressing Duke Nukem 128x90 levels (using 16-bit integers rather than 8-bit values) or 320x200 images and you'll easily notice a difference.

How TrueVision .TGA did it

When working on the TGA format, TrueVision introduced a compression that used packets to describe pixel data. In a 256-color indexed image, each packet (descriptor included) was at most 129 bytes long and could either be compressed or uncompressed (raw,) depending on the packet descriptor's which contained two fields:

The seven least significant bits always provide the number of pixels to write to the target buffer minus 1 (it is not possible to write 0 pixel, but we can write up to 128 of them) and the most significant bit tells us whether the packet is RLE-compressed or raw. A compressed packet is only two bytes long: the packet's descriptor (one byte) followed by one value that we have to repeat multiple times. A raw packet contains the descriptor (yep, that's still one byte) followed by as many bytes of data as necessary to fill the target buffer.

Now let's think about that for a second. Since compressed packets are always 2 bytes long, a 2-pixel run will generate no gain... however, storing 2 pixels in a raw packet would take 3 bytes of data and produce a negative gain of 1 byte... for simplicity's sake, giving 2-pixel runs their own packet is a good idea, although some bytes can be saved here and there by allowing raw packets to absorb adjacent 2-pixel RLE packets (here's that smug asshole again for reference:)

 6 bytes: (RLEx2) Blue, (RLEx5) Yellow, (RAWx1) Blue
10 bytes: (RAWx1) Blue, (RLEx2) Yellow, (RAWx3) Black-Yellow-Black, (RLEx2) Yellow
 8 bytes: (RLEx3) Yellow, (RAWx3) Black-Yellow-Black, (RLEx2) Yellow
 8 bytes: (RAWx2) Yellow-Pink, (RLEx4) Yellow, (RLEx2) Pink-Yellow
 8 bytes: (RLEx3) Yellow, (RAWx3) Red-Yellow-Red, (RLEx2) Yellow
 6 bytes: (RLEx3) Yellow, (RLEx3) Red, (RLEx2) Yellow
 6 bytes: (RLEx4) Yellow, (RAWx1) Red, (RLEx3) Yellow
 6 bytes: (RAWx1) Blue, (RLEx6) Yellow, (RAWx1) Blue

If we automatically encode runs of 2 pixels into their own packet, we get a compressed stream of 58 bytes... but if we allow raw packets to absorb neighboring 2-pixel length runs, we may have a chance to merge into another raw packet; in this case, it spares one extra byte:

 6 bytes: (RLEx2) Blue, (RLEx5) Yellow, (RAWx1) Blue
 9 bytes: (RAWx8) Blue-Yellow-Yellow-Black-Yellow-Black-Yellow-Yellow
 8 bytes: (RLEx3) Yellow, (RAWx5) Black-Yellow-Black-Yellow-Yellow
 8 bytes: (RAWx2) Yellow-Pink, (RLEx4) Yellow, (RLEx2) Pink-Yellow
 8 bytes: (RLEx3) Yellow, (RAWx5) Red-Yellow-Red-Yellow-Yellow
 6 bytes: (RLEx3) Yellow, (RLEx3) Red, (RLEx2) Yellow
 6 bytes: (RLEx4) Yellow, (RAWx1) Red, (RLEx3) Yellow
 6 bytes: (RAWx1) Blue, (RLEx6) Yellow, (RAWx1) Blue

We won't write code for this technique for two reasons. First, processing raw packets and writing a fast algorithm to merge packets when possible requires more efforts than just counting runs; and second, the following RLE implementation offers a better compression rate and doesn't need to go through these hoops.

How ZSoft .PCX did it

The RLE implementation used in PCX files doesn't use packets like TGA did. Instead, the decompression algorithm assumes that bytes must be copied to the target buffer as they are, unless the two most significant bits are set. In other words, if BOTH bits 7 and 6 (adding 192 to the byte value) are set, then the byte is suddenly seen as an RLE descriptor rather than raw pixel data, and the remaining 6 bits (adding up to 63) provide the number of pixels to write to the target buffer:

Here's where it gets tricky: if we have a lonely pixel whose value is equal or above 192, isn't the decompression algorithm going to assume it's an RLE code? Well, yes, that's why the compressor always generates runs of 1 pixel or more with those 32 values (between 192 and 255 included.) It means that those colors will always produce two bytes of code... but it's only a negative gain when the pixel is isolated. Believe it or not, this compromise actually pays off. Back to our 8x8 image:

5 bytes: (RLEx2) Blue, (RLEx5) Yellow, Blue
8 bytes: Blue, (RLEx2) Yellow, Black, Yellow, Black, (RLEx2) Yellow
7 bytes: (RLEx3) Yellow, Black, Yellow, Black, (RLEx2) Yellow
6 bytes: Yellow, Pink, (RLEx4) Yellow, Pink, Yellow
7 bytes: (RLEx3) Yellow, Red, Yellow, Red, (RLEx2) Yellow
6 bytes: (RLEx3) Yellow, (RLEx3) Red, (RLEx2) Yellow
5 bytes: (RLEx4) Yellow, Red, (RLEx3) Yellow
4 bytes: Blue, (RLEx6) Yellow, Blue

Assuming that none of the color indices are equal or greater than 192, that's only 48 bytes of information! For the sake of argument, let's assume the Red color index was equal or greater than 192 (see the difference on lines 5 and 7, notice how it doesn't change anything for line 6:)

5 bytes: (RLEx2) Blue, (RLEx5) Yellow, Blue
8 bytes: Blue, (RLEx2) Yellow, Black, Yellow, Black, (RLEx2) Yellow
7 bytes: (RLEx3) Yellow, Black, Yellow, Black, (RLEx2) Yellow
6 bytes: Yellow, Pink, (RLEx4) Yellow, Pink, Yellow
9 bytes: (RLEx3) Yellow, (RLEx1) Red, Yellow, (RLEx1) Red, (RLEx2) Yellow
6 bytes: (RLEx3) Yellow, (RLEx3) Red, (RLEx2) Yellow
6 bytes: (RLEx4) Yellow, (RLEx1) Red, (RLEx3) Yellow
4 bytes: Blue, (RLEx6) Yellow, Blue

Finally, here's the QuickBASIC shrinking routine. Read one byte from the stream and use it as a reference, then look ahead until the reference byte no longer matches the last byte read; increase a counter as you go. If the count is greater than 2 or the reference value is greater than 191, write an RLE code. If the count is 1 and the reference value is below 192, write the reference byte as is. Keep going until the whole source buffer is depleted:

''
'' PCX-style RLE compression
''
FUNCTION rleShrink$ (inData AS STRING)
  DIM offset AS INTEGER, count AS INTEGER, streak AS INTEGER
  DIM ref AS INTEGER, outData AS STRING

  ' reading offset in the source buffer
  offset = 1
  DO
    ' read reference byte
    ref = ASC(MID$(inData, offset, 1))

    ' reset counter
    count = 0
    ' keep reading while the reference byte is showing up
    DO WHILE (ASC(MID$(inData, offset + count, 1)) = ref)
      ' increment counter
      count = count + 1
      IF ((offset + count) > LEN(inData)) THEN EXIT DO
    LOOP

    ' advance reading offset
    offset = offset + count

    ' count is at least 1 OR ref byte is >= 192
    IF ((ref > 191) OR (count > 1)) THEN
      ' write as many runs as necessary, at most 63 pixels at once
      streak = 63
      DO
        IF (streak > count) THEN streak = count
        outData = outData + CHR$(192 + streak) + CHR$(ref)
        count = count - streak
      LOOP WHILE (count)

    ' count is 1 AND ref byte is < 192
    ELSE
      ' write byte as it
      outData = outData + CHR$(ref)
    END IF
  LOOP UNTIL (offset > LEN(inData))

  ' return compressed data
  rleShrink$ = outData
END FUNCTION

The decompression routine is even simpler... read one byte, if its two most significant bits are set (value is above 191,) read another byte and write it as many times as necessary. If either (or both) most significant bits are clear, write the byte as-is to the target buffer. Keep going until the source input is completely processed:

''
'' PCX-style RLE decompression
''
FUNCTION rleExpand$ (inData AS STRING)
  DIM ref AS INTEGER, offset AS INTEGER, outData AS STRING

  DO
    ' read next byte in source buffer
    offset = offset + 1
    ref = ASC(MID$(inData, offset, 1))

    ' byte has two most significant bits set
    IF (ref > 191) THEN
      ' move to next byte in source buffer
      offset = offset + 1
      ' isolate count (ref AND 63)
      outData = outData + STRING$(ref AND 63, MID$(inData, offset, 1))

    ' write as is
    ELSE
      outData = outData + CHR$(ref)
    END IF
  LOOP UNTIL (offset = LEN(inData))

  rleExpand$ = outData
END FUNCTION

Those two small functions are all it takes to support RLE compression in your programs... if you think your graphics and levels could use a little diet, give it a go.

Making the most of little things

We know the core component of PCX's RLE is the location of colors in the palette: colors located in the first 192 entries can be encoded as either 1 or 2 bytes, while colors in the last 64 entries will always be encoded as 2 bytes. Therefore, if the order in which colors appear in the palette is not important to us, we should re-arrange attributes to maximize compression. How?

The problem is not coming from runs since they always generate 2 bytes of data regardless of the color index. However, isolated pixels produce one extra byte when the color index is equal or above 192. To minimize that, we have to go through the compression process to determine for each color how many times they appear on their own (no run.) Then, we sort the color list by decreasing "isolated count" (the color at index 0 would produce more extra bytes than the color located at index 255.) Now we can output our PCX file using the new index list.

This optimization can shave off between 5 and 15 percents of files generated by Photoshop. All that to say: compression is part alogrithm, part data.

Long runs for your consideration

One last thing before we move on... while PCX offers a better deal than TGA in terms of compression, its RLE scheme is often criticized for supporting runs that are half the size of TGA's... we can't rewrite history but we can write our own algorithm. Let's imagine a PCX-style RLE compression with longer streaks:

What if we assumed that a count of 63 (that is, all 6 bits are set) is actually a fragment of a longer streak unable to fit on 6 bits? We could read the following byte (all 8 bits) and use them as an extension of the count (just add the 6 bits from the RLE descriptor and the 8 bits from the extension...) and if that byte was also maxed out (equals 255,) then we'd read yet another byte and add it to the count, and so on and so forth until we stumble across a byte that isn't maxed out:

DIM count AS INTEGER, outBuffer AS STRING
DIM ascii AS INTEGER, offset AS INTEGER

count = 63 + 255 + 99

'' write extended PCX-style RLE count
IF (count) THEN
  IF (count < 63) THEN
    ' write one-byte RLE code
    outBuffer = outBuffer + CHR$(192 + count)
  ELSE
    ' write one-byte RLE code
    outBuffer = outBuffer + CHR$(192 + 63)
    count = count - 63
    ' write extension(s)
    outBuffer = outBuffer + STRING$(count \ 255, &HFF) + CHR$(count MOD 255)
  END IF
END IF

'' read extended PCX-style RLE count
offset = 1
DO
  ' read one byte
  ascii = ASC(MID$(outBuffer, offset, 1))
  offset = offset + 1

  ' two most significant bits set
  IF (ascii > 191) THEN
    ' isolate count
    count = (ascii AND 63)
    ' count is 63, read extension(s)
    IF (count = 63) THEN
      DO
        ' get one more byte
        ascii = ASC(MID$(outBuffer, offset, 1))
        ' increase count
        count = count + ascii
        ' advance to next byte
        offset = offset + 1
      ' keep going while the extension is maxed-out
      LOOP WHILE (ascii = 255)
    END IF

  ' only one (or neither) most significant bits are set
  ELSE
    count = 1
  END IF

  ' show repeat value
  PRINT "Repeats:"; count
LOOP UNTIL (offset > LEN(outBuffer))

This technique would save 1 byte every 63 pixels on very long runs... but if the count was indeed exactly 63, then an extra byte of value 0 would be needed to terminate the long streak...

Just an idea I'm throwing out there. Foot four thots.

Lempel-Ziv (LZ)

Run-Length and more!

Developped by Abraham Lempel and Jacob Ziv in the early 1970s, LZ substitutes entire sequences with reference codes that point to a previous instance of the same sequence. For that reason, the algorithm is very efficient at eliminating redundancies, patterns and repetitions (including runs of identical symbols, the bread and butter of RLE). This makes the compression algorithm suitable for text and dithered images. In terms of challenge, an LZ decoder is only marginally more complex than your run-of-the-mill RLE decoder. The encoder on the other hand requires a little more finesse, but is nonetheless still very manageable.

Let's take the phrase "A FRIEND IN NEED IS A FRIEND INDEED" for instance: it contains the sequence "A FRIEND IN" twice. But rather than write it twice, the compression would replace the second occurrence by a reference code that provides the offset to the beginning of the first time the sequence was spotted (relative to the current position), and its length in bytes. In this case, the offset would be 20 (because "A" appeared 20 bytes earlier) and the length 11. It may also replace the "D I" in "NEED IS" with a reference to the "D I" in "FRIEND IN", and replace the "EED" of "INDEED" by a reference to the "EED" found in "NEED". The end result would look something like: "A FRIEND IN NEE(8;3)S (20;11)D(19;3)".

Most differences between the various flavors of LZ come from the way those reference codes are structured. Some implementations (LZS) use a packet system similar to TrueVision's RLE compression scheme, some (LZW) use a dictionary to identify repeating sequences, some (DEFLATE) use variable-length codes, etc. For our implementation, we're going to use 2-byte reference codes: the first byte contains a 2-bit denote (like ZSoft's RLE) and the sequence length, the second byte contains the sequence offset.

The 2-bit denote works particularly well with UTF-8 text as most characters have either their most significant bit clear (English letters and control codes) or their 2nd most significant bit clear (UTF-8 extension codes,) which means only leading bytes require escaping (if you are a hundred percent sure to only use English letters, use a one-bit denote, and it'll be enough.) All that to say our implementation will be able to copy up to 63 bytes (6 bits), going as far as 255 bytes back (8 bits.) It's very simple.

The compression routine looks fairly similar to RLE: while RLE picks one reference code and counts the number of times it appears in-a-row, LZ takes two starting locations in the buffer and counts how many identical characters appear at both locations.

DECLARE FUNCTION shrinkLZ$ (msgIn AS STRING)

''
'' USE THE LEMPEL-ZIV COMPRESSION SCHEME TO SHRINK <msgIn>,
'' RETURN THE COMPRESSED STRING IN READABLE TEXT FORM.
''
FUNCTION shrinkLZ$ (msgIn AS STRING)
  DIM seqLen AS INTEGER     ' Length of the repeating sequence
  DIM seqBestOfs AS INTEGER ' Starting position of the longest repeating sequence
  DIM seqBestLen AS INTEGER ' Length of the longest repeating sequence
  DIM lenIn AS INTEGER      ' Length of the source buffer, in bytes
  DIM msgOut AS STRING      ' Compressed buffer
  DIM ofsRead AS INTEGER    ' Current reading offset
  DIM ofsBack AS INTEGER    ' Back reading offset

  ' Get source buffer length, in bytes.
  lenIn = LEN(msgIn)

  ' Initialize reading offset. A quick reminder for people coding in
  ' another language: our buffer is actually a STRING and QuickBASIC
  ' determines that the first character of a STRING is located at
  ' offset 1, not offset 0 (like C or JavaScript.) This also means
  ' that a STRING offset is valid if it is anywhere between 1 and the
  ' length in bytes of the STRING (included.)
  ofsRead = 1
  DO
    ' Return the back pointer to the beginning of the source buffer.
    ' Since the reference code only allocates 8 bits for the offset,
    ' the back offset may be located at most 255 bytes away from the
    ' reading offset.
    ofsBack = ofsRead - 255
    IF (ofsBack < 1) THEN
      ofsBack = 1
    END IF

    ' Reset the longest repeating sequence length. If this
    ' value is still zero by the end of the back buffer read,
    ' the encoder writes a literal.
    seqBestLen = 0

    ' The beginning of the back pointer must be
    ' located before the reading offset. Note that
    ' sequence itself may cross the reading offset.
    DO WHILE (ofsBack < ofsRead)

      ' Increase the sequence length as long as the bytes we
      ' wrote starting at ofsBack are identical to the upcoming
      ' bytes starting at ofsRead.
      seqLen = 0
      DO WHILE (MID$(msgIn, ofsBack + seqLen, 1) = MID$(msgIn, ofsRead + seqLen, 1))
        seqLen = seqLen + 1
        ' Do not read beyond the buffer length.
        IF (ofsRead + seqLen > lenIn) THEN
          EXIT DO
        END IF
      LOOP

      ' Sequences that are one or two bytes are not worth
      ' "shrinking" because reference codes are 2 bytes long.
      IF (seqLen > 2) THEN
        ' The sequence is large enough to be considered, but is
        ' it larger than the previous sequence we identified?
        IF (seqLen > seqBestLen) THEN
          ' The reference code only allocates 6 bits for the
          ' length; truncate the sequence if it is too long.
          IF (seqLen > 63) THEN seqLen = 63
          seqBestOfs = ofsBack
          seqBestLen = seqLen
        END IF
      END IF

      ' Move the back pointer forward:
      ' we may find a longer sequence.
      ofsBack = ofsBack + 1
    LOOP

    IF (seqBestLen) THEN
      ' We found a repeating sequence, write reference code.
      msgOut = msgOut + CHR$(&HC0 OR seqBestLen) + CHR$(ofsRead - seqBestOfs)
      ' Move reading offset ahead.
      ofsRead = ofsRead + seqBestLen
    ELSE
      ' We didn't find any repeating sequence, write literal.
      ' If the two most significant bits of the literal are set,
      ' we have to escape it!
      IF (ASC(MID$(msgIn, ofsRead, 1)) >= &hC0) THEN
        msgOut = msgOut + CHR$(&HC0) + MID$(msgIn, ofsRead, 1)
      ELSE
        msgOut = msgOut + MID$(msgIn, ofsRead, 1)
      END IF
      ' Move reading offset ahead.
      ofsRead = ofsRead + 1
    END IF

  ' Process the whole source buffer.
  LOOP WHILE (ofsRead <= lenIn)

  ' Return compressed string.
  shrinkLZ$ = msgOut
END FUNCTION

This techniques can easily reduce a stream by 40 to 50 percents. Of course, it depends entirely on the data structure itself: how often words or sentences are repeated, how far away they are from the reference code, etc.

As promised, the decompression routine is simpler. Essentially, we're either appending literals found in the compressed buffer (input,) or we're reading reference codes in the compressed buffer (input) before extracting whole sequences from the decompressed buffer (output.)

DECLARE FUNCTION expandLZ$ (msgIn AS STRING)

''
'' EXPAND THE LZ-COMPRESSED STREAM <msgIn>, RETURN THE
'' EXPANDED STRING.
''
FUNCTION expandLZ$ (msgIn AS STRING)
  DIM lenIn AS INTEGER, tmpInt AS INTEGER
  DIM backLen AS INTEGER, backOfs AS INTEGER
  DIM msgOut AS STRING, ofsRead AS INTEGER

  ' Get length of the compressed buffer.
  lenIn = LEN(msgIn)

  ' Initialize reading offset. A quick reminder for people coding in
  ' another language: our buffer is actually a STRING and QuickBASIC
  ' determines that the first character of a STRING is located at
  ' offset 1, not offset 0 (like C or JavaScript.) This also means
  ' that a STRING offset is valid if it is anywhere between 1 and the
  ' length in bytes of the STRING (included.)
  ofsRead = 1

  ' Read the whole compressed buffer.
  DO
    ' Read one byte.
    tmpInt = ASC(MID$(msgIn, ofsRead, 1))
    SELECT CASE (tmpInt)
    CASE &hC0
      ' Escaped literal. Read one byte from the
      ' input buffer, append to the output buffer.
      msgOut = msgOut + MID$(msgIn, ofsRead + 1, 1)
      ' Advance reading offset.
      ofsRead = ofsRead + 2

    CASE IS > &hC0
      ' Reference code. While literals are obtained from
      ' the input buffer, reference codes points to
      ' uncompressed data. This means we have to read the
      ' sequence from the output buffer and append it back
      ' to the output buffer.
      backLen = tmpInt - &hC0
      backOfs = ASC(MID$(msgIn, ofsRead + 1, 1))
      ' Does the sequence cross the reading offset? If we
      ' decide to copy sequences one byte at a time, this
      ' distinction is pointless.
      IF (backOfs >= backLen) THEN
        ' This sequence is entirely contained before
        ' the writing starting offset, just copy/paste the
        ' whole sequence at once.
        backOfs = 1 + LEN(msgOut) - backOfs
        msgOut = msgOut + MID$(msgOut, backOfs, backLen)
      ELSE
        ' The sequence crosses the starting writing offset,
        ' this means we have a cyclical sequence (this could
        ' be a dithering pattern, a run of the same symbol, or
        ' something similar).
        backOfs = 1 + LEN(msgOut) - backOfs
        FOR i% = 0 TO backLen - 1
          msgOut = msgOut + MID$(msgOut, backOfs + i%, 1)
        NEXT i%
      END IF
      ' Advance reading offset.
      ofsRead = ofsRead + 2

    CASE ELSE
      ' Literal. Read byte from the input buffer,
      ' append to the output buffer.
      msgOut = msgOut + MID$(msgIn, ofsRead, 1)
      ' Advance reading offset.
      ofsRead = ofsRead + 1
    END SELECT
  LOOP WHILE (ofsRead <= lenIn)

  ' Return expanded string.
  expandLZ$ = msgOut
END FUNCTION

The trickiest part of LZ is the look-behind process in the compression routine: the code we wrote earlier is sufficient and works flawlessly (at least, I hope it does,) it is painfuly slow unless we keep the back buffer window short (128 or 256 bytes at most.) But a shorter window also hinders the compression ratio. That's why the inclusion of a cyclical dictionnary or a look-up table for the latest occurences of each character could be useful.

As a side-note, Wolfenstein-3D uses a similar compression algorithm for its data, along with word-based Run-Length Encoding and Huffman Encoding. Level data was especially easy to compress with RLE, which reduced the stream size before it was processed by John Carmack's own rediscovery of LZ. After those two steps, Huffman Encoding was used to even further compress the data. The point is that algorithms that had little overhead were applied first, as to reduce the amount of data the more complex algorithms would have to process.

Huffman Encoding

What it do

Following the ASCII (American Standard Code for Information Interchange) table, characters are represented as 8-bit values, offering up to 256 unique combinations. It doesn't matter how often (like "e" or "i") or rarely (like "j" or "x") those characters are used, they all weight exactly 8 bits. Thus, the string "Don't you hate it when you have to come up with random strings for testing purpose and can't think of anything???" is stored on 113 bytes or 904 bits.

Huffman Encoding takes another approach, creating occurrence-dependent variable-length codes for each character. In other words: the more often a character is used, the shorter its length in bits will be. Inversely, characters that are less often used have longer codes. Of course, getting to those codes is a whole process of statistical analysis of the input, sorting, and assembling the collected data... it involves a whole data tree made of leaves and nodes and all that good stuff, then we follow a path in the tree from one point to another, and... enough talk about my last vacations in Cameroon, let's get to work.

Writing the leaf list

The word tree has been mentionned in the previous paragraph, so let's talk about that real quick. A tree is basically a bunch of branches and leaves. The spot where branches connect are "nodes" and their extremities are "leaves;" If you look at a real tree and start at the root (not the actual root, but where the trunk meets the soil) and follow the trunk up, you will eventually find a node. If you follow that node, you may find more nodes until you finally reach a leaf. In Huffman Encoding, leaves are symbols (the basic building blocks of our message.) Nodes are not generated from the root (like a real tree would,) but starting from the leaves. Eventually, all leaves and nodes are connected by the root.

In other words, to be as plain and obvious as one can be, we're basically going to grow a tree in the complete opposite way nature would. Just because it's easier. Somehow. It's probably a bit fuzzy right now, but it'll become clear as we go. First stop: let's setup a data type to store our leaves, nodes, connections and all that good stuff.

TYPE huffNodeType    ' OFS SZE DESCRIPTION
  weight AS LONG     ' 000 004 leaf/node: weight
  parent AS INTEGER  ' 004 002 leaf/node: parent node index
  symbol AS INTEGER  ' 006 002 leaf only: ASCII value
  isRight AS INTEGER ' 008 002 leaf only: is right (left if NULL)
  lfChild AS INTEGER ' 010 002 node only: left child index, right is lfChild+1
END TYPE             ' 12 BYTES, HUFFMAN NODE STRUCTURE

DIM node(511) AS huffNodeType, nodeCount AS INTEGER ' Huffman node array & count

The first step would be to count the frequency of each character (symbol) in the source string. It can be done easily by creating an array of 256 entries (from 0 to 255) and using the ASCII value to point to the proper element in the array; we previously defined the node() array and it (seemingly) contains far more entries than necessary, so we got that covered. Now let's take the old-timey classic "Helloworld" for reference (without space so every character is visible, but also because I forgot to include that space when I wrote the do-it-by-hand explanations and there's no way in hell I'm ever rewriting that part:)

In QuickBASIC, counting character occurrences would looks like this:

DECLARE SUB huffCharCount (node() AS ANY, feed AS STRING)

DIM source AS STRING

source = "Helloworld"
huffCharCount node(), source

''
'' count character occurrences
''
SUB huffCharCount (node() AS huffNodeType, feed AS STRING)
  DIM ascii AS INTEGER

  FOR i% = 1 TO LEN(feed)
    ascii = ASC(MID$(feed, i%, 1))
    node(ascii).weight = node(ascii).weight + 1
  NEXT i%
END SUB

When all the input data has been processed and we're ready to compress it, we sort the list by increasing frequency, and finally "stack" the entries back to index 0 of the array, consecutively.

Feel free to take a short break and marvel in awe at the amazing routine responsible for the masterpiece above:

DECLARE FUNCTION huffCharSort% (node() AS ANY)

nodeCount = huffCharSort%(node())

''
'' sort character list from least to most common
''
FUNCTION huffCharSort% (node() AS huffNodeType)
  DIM hiRef AS INTEGER, max AS INTEGER
  DIM count AS INTEGER, offset AS INTEGER

  ' so far, we assumed the index and the ASCII value were identical,
  ' after sorting the array it will likely not be the case anymore.
  FOR i% = 0 TO 255
    node(i%).symbol = i%
  NEXT i%

  ' sort array
  count = UBOUND(node) - LBOUND(node) + 1
  offset = count \ 2
  DO WHILE offset
    max = count - offset - 1
    DO
      hiRef = 0
      FOR i% = 0 TO max
        IF (node(i%).weight > node(i% + offset).weight) THEN
          SWAP node(i%), node(i% + offset)
          hiRef = i%
        END IF
      NEXT i%
      max = hiRef - offset
    LOOP WHILE hiRef
    offset = offset \ 2
  LOOP

  ' number of symbols used
  FOR i% = 0 TO UBOUND(node)
    IF (node(i%).weight) THEN
      count = i%
      EXIT FOR
    END IF
  NEXT i%

  ' clear unused slots
  FOR i% = 0 TO UBOUND(node) - count
    node(i%) = node(i% + count)
  NEXT i%
  count = (UBOUND(node) - LBOUND(node) + 1) - count
  FOR i% = count TO UBOUND(node)
    node(i%).weight = 0
    node(i%).symbol = 0
  NEXT i%

  ' number of unique symbols
  huffCharSort% = count
END FUNCTION

Writing the Huffman tree

Currently, the node() array only contains leaves (the ending points) of the Huffman tree. The next step is to link the entries of the list two by two via parent nodes until the whole tree is built. To rehiterate: currently only leaves exist in the list, but nodes will start to appear as we build them. First, we take entries 0 and 1, and create a parent node whose weight is the sum of its children's (here, we'll call it "n_dH" as it is the node connecting leaves "d" and "H".) Then, we move the newly created parent node down the list until it becomes the last entry of that weight. This way, we not only ensure that the list is still ordered by increasing frequency, but we also make sure the index of the parent node remains constant even as new nodes are inserted into the list.

Now we repeat the procedure with entries 2 and 3: create a parent node whose weight is the sum of those entries' weight, then move it down the list until it becomes the last entry of that weight... to keep the listing short and because we're only moving forward, entries we've already processed will no longer be shown in the following tables:

Halfway there... entries 4 and 5, parent node is moved to index 9:

Parent nodes are processed exactly like leaves, as shown by entries 6 and 7, new parent node moved to index 10:

Entries 8 and 9, parent node moved to index 11...

Last stretch with entries 10 and 11, and their parent at index 12!

The reading cursor now points to the last index of the list (12,) which means there are no more entries to read, and therefore the Huffman tree is complete! Note that unlike every other entry in the list (but like Batman,) entry 12 has no parent because it's the root of our tree (now that I think about it, I've never seen Batman, Oliver Twist and the root of an Huffman tree in the same place at the same time.) Since the increasing frequency is maintained through the whole process, we can also see that the more frequent a character is, the closer it is to the root.

DECLARE SUB huffMakeTree (node() AS ANY, nodeCount AS INTEGER)

huffMakeTree node(), nodeCount

''
'' build Huffman tree from ordered symbol list
''
SUB huffMakeTree (node() AS huffNodeType, nodeCount AS INTEGER)
  DIM loBound AS INTEGER, hiBound AS INTEGER
  DIM middle AS INTEGER, parentId AS INTEGER
  DIM readOfs AS INTEGER, nodeSum AS LONG

  DO
    ' weight of parent node
    nodeSum = node(readOfs).weight + node(readOfs + 1).weight

    ' find the best spot to insert parent node
    loBound = readOfs + 2
    hiBound = nodeCount - 1
    IF (loBound > hiBound) THEN
      parentId = nodeCount
    ELSE
      DO WHILE (loBound <= hiBound)
        middle = (loBound + hiBound) \ 2
        SELECT CASE node(middle).weight
        CASE IS < nodeSum
          loBound = middle + 1
          parentId = loBound
        CASE IS > nodeSum
          hiBound = middle - 1
          parentId = middle
        CASE ELSE
          parentId = middle
          EXIT DO
        END SELECT
      LOOP
    END IF

    ' last of this weight
    DO UNTIL ((parentId > nodeCount) OR (node(parentId).weight <> nodeSum))
      parentId = parentId + 1
    LOOP

    ' push nodes forward
    FOR i% = nodeCount TO parentId + 1 STEP -1
      node(i%) = node(i% - 1)
    NEXT i%

    ' write parent node
    node(parentId).weight = nodeSum
    node(parentId).symbol = -1
    node(parentId).parent = -1
    node(parentId).lfChild = readOfs

    ' link to parent
    node(readOfs).parent = parentId
    node(readOfs).isRight = 0
    node(readOfs + 1).parent = parentId
    node(readOfs + 1).isRight = -1

    ' move forward
    readOfs = readOfs + 2

    ' increase node count
    nodeCount = nodeCount + 1

    ' keep going until there are no more nodes to link
  LOOP UNTIL (readOfs = nodeCount - 1)
END SUB

Generating Huffman codes

The word "tree" has been used over ten times now, and we still haven't seen what it actually looks like. If you'd like, here's a routine to display it on screen. It's complex and sloppy, so I won't blame you if you don't get it right now; it'll become clear with the next steps, I swear. You can trust me. We'll make it. Together.

DECLARE SUB huffDrawTree (node() AS ANY, nodeCount AS INTEGER)

''
'' display Huffman tree in graphic mode 12
''
SUB huffDrawTree (node() AS huffNodeType, nodeCount AS INTEGER)
  DIM nodeVisited(nodeCount - 1) AS INTEGER
  DIM nodeX(nodeCount - 1) AS INTEGER
  DIM nodeDepth(nodeCount - 1) AS INTEGER
  DIM x AS INTEGER, nodeId AS INTEGER, deepest AS INTEGER
 
  ' enter graphic mode 12 (640x480, 16 colors)
  SCREEN 12

  ' read leaves from left to right
  nodeId = nodeCount - 1
  x = 2
  DO
    ' follow left path till leaf
    DO WHILE (node(nodeId).symbol = -1)
      nodeId = node(nodeId).lfChild
    LOOP
    ' register position
    nodeX(nodeId) = x
    x = x + 3
    ' go back until we reach a node that hasn't been fully checked
    DO
      nodeId = node(nodeId).parent
      IF (nodeId = -1) THEN EXIT DO
    LOOP WHILE nodeVisited(nodeId)
    IF (nodeId = -1) THEN EXIT DO
    ' go right once...
    nodeVisited(nodeId) = -1
    nodeId = node(nodeId).lfChild + 1
  LOOP

  ' get nodes depth
  FOR i% = 0 TO nodeCount - 1
    nodeId = i%
    DO
      nodeDepth(i%) = nodeDepth(i%) + 1
      nodeId = node(nodeId).parent
    LOOP UNTIL (nodeId = -1)
    IF (nodeDepth(i%) > deepest) THEN deepest = nodeDepth(i%)
  NEXT i%

  ' going from lower to higher, place non-leaves in between children
  FOR i% = deepest TO 0 STEP -1
    FOR j% = 0 TO nodeCount - 1
      IF (node(j%).symbol = -1) THEN
        IF (nodeDepth(j%) = i%) THEN
          nodeX(j%) = (nodeX(node(j%).lfChild) + nodeX(node(j%).lfChild + 1)) \ 2
        END IF
      END IF
    NEXT j%
  NEXT i%

  ' and here we go...
  FOR i% = 0 TO nodeCount - 1
    LOCATE nodeDepth(i%) * 3, nodeX(i%)
    IF (node(i%).symbol > -1) THEN
      IF (node(i%).symbol = 32) THEN
        LOCATE , nodeX(i%) - 1: PRINT CHR$(34); CHR$(node(i%).symbol); CHR$(34)
      ELSE
        PRINT CHR$(node(i%).symbol)
      END IF
    ELSE
      LOCATE nodeDepth(i%) * 3, nodeX(i%) - LEN(LTRIM$(STR$(i%))) \ 2: PRINT LTRIM$(STR$(i%))
    END IF
    IF (node(i%).parent > -1) THEN
      LINE ((nodeX(i%) - 1) * 8, nodeDepth(i%) * 16 * 3 - 16)_
      -((nodeX(node(i%).parent) - 1) * 8, (nodeDepth(i%) - 1) * 16 * 3), 15
    END IF
  NEXT i%
END SUB

And if you don't want to run the code above but still want to see what the tree looks like, I'll go ahead and assume you're a cat since cats are known to be both lazy and curious... oh yeah, and here's the image too, you furry sofa mangler:

You can see the leaves at the bottom ("d", "H", "w", "r", "l", "e", and "o" -- that's the data we started off with,) the root at index 12, and a bunch of nodes connecting the leaves to the root (6, 7, 9, 10, and 11.) Pay attention to the location of "l" compared to the other leaves: it is much closer to the root. That's because "l" appears three times in the source string, while most of those other characters appear only once or twice.

Now you may ask: "Mike, how come "o" is not closer to the root? It appears twice in the input string!" and that's an excellent question, person I just made up in my head ("person I just made up in my head" is an original character, do not steal.) One-on-one, "o" appears more often than "d"... BUT, collectively, letters "d", "H", "w", "r" and "e" are more likely to be encountered than "o". The algorithm generates balanced trees to prevent "pushing too far away" rarer characters. The construction takes into account the odds of specific characters showing up collectively by carrying their frequency to their parent node. To wrap it up in a short answer: because "balance." That being said, it should be noted that when many leaves are present, the length in bits of a character may very well exceed 8 bits.

Back to the topic. Each node offers two traveling directions: left or right, which is convenient since we can store that branching choice on a single bit! To attain "l" from root, we must go right (to node 11,) and then left... that's only two bits of information. To reach "w" from root, we go left (to node 10,) right (to node 7,) and finally left. That's three bits of information... the idea here is to replace the usual fixed-length 8-bit code of the ASCII table by those Huffman codes, thus reducing the overall size of the source input. To do so, we first have to generate the Huffman codes for every leaf.

Rather than start from the root and recursively travel to each leaf, we'll simply start from each leaf and move from parent to parent until we reach the root, and then reverse the sequence (since the starting point is meant to be the root, not the leaf.) It makes sense. After all, the tree itself was built bottom-up, so why not?

Here's how it goes: we parse all nodes until we find a leaf (.symbol is not -1,) we check if the node was access by a left turn (.isRight is FALSE) or a right turn (.isRight is TRUE,) and we write a clear bit if it's left or set the bit if it's right, then we flip the reading direction:

If there's even been a time to talk about least significant bit (LSB) and most significant bit (MSB,) that would be now. We have the habit of writing left to right, so when we write numbers, we tend to write the most significant digit first (on the left) and end by the least significant digit (on the right.) For example, three-hundred-and-twenty-five is written 325 with 3 being the most significant digit (going first.) If the same number was written with the least significant digit first, it would be 523. It sounds far-fetched until you realize that we pronounce 16 LSB ("sixteen",) but 61 MSB ("sixty one".) Not all articles are explicitely providing a reading sense, and it's all very confusing. The reason I'm pointing this out is: due to the way Huffman trees are designed, Huffman codes are prefix-free. Prefix-free means that shorter codes can never be spotted as a starting sequence in longer codes. For example, "10" is used by the "l" leaf, and those two bits never appear as the first two bits of any other value (makes sense: for that to happen, the leaf would have to be a parent of two elements.) It's important to know because this will allow us to write a decompression routine that uses those Huffman codes rather than the tree, if we had to.

We need to add two new things to the program: a new structure to store Huffman codes, and a lookup table filled with 31 powers of two to speed up the process of reading and writing bits. I would go for 32 powers of two, but you know how QuickBASIC is with its signed integers. To be entirely fair, the Huffman codes should be stored as strings of 8-bit values, and the lookup table would only need 2^0 thru 2^7; this would cover all the cases (including very long bit codes) but it would also increase the complexity of our program way beyond what we need right now.

TYPE huffCodeType    ' OFS SZE DESCRIPTION
  code AS LONG       ' 000 004 symbol code (root-to-leaf)
  symbol AS INTEGER  ' 004 002 ASCII value
  length AS INTEGER  ' 006 002 length of the code, in bits
END TYPE             ' 8 BYTES, HUFFMAN CODE

DIM code(255) AS huffCodeType, codeCount AS INTEGER ' Huffman code array & count

DIM SHARED lutSQR(30) AS LONG ' powers of two lookup table

FOR i% = LBOUND(lutSQR) TO UBOUND(lutSQR)
  lutSQR(i%) = 2 ^ i%
NEXT i%

The following routine will trace the path from leaf to root, starting from the least significant bit, and store the result in the .code field of the huffCodeType structure. Rather than starting at code(0) and move to the next index with each new leaf, the routine stores the Huffman code at the index matching the leaf's ASCII code (the Huffman code for the leaf representing "H" is stored in code(72):)

DECLARE SUB huffMakeCodes (node() AS ANY, nodeCount AS INTEGER, code() AS ANY)

huffMakeCodes node(), nodeCount, code()

''
'' use tree to generate Huffman codes
''
SUB huffMakeCodes (node() AS huffNodeType, nodeCount AS INTEGER, code() AS huffCodeType)
  DIM nodeId AS INTEGER, codeLen AS INTEGER
  DIM intPath AS LONG, tmpPath AS LONG

  ' search through each node
  FOR i% = 0 TO nodeCount - 1
    ' this is a leaf, rebuild code
    IF (node(i%).symbol > -1) THEN
      ' reset variables
      nodeId = i%: codeLen = 0
      intPath = 0: tmpPath = 0

      ' writes path from leaf to root
      DO
        ' write value
        IF (node(nodeId).isRight) THEN
          tmpPath = tmpPath + lutSqr(codeLen)
        END IF
        codeLen = codeLen + 1
        ' move to parent
        nodeId = node(nodeId).parent
      LOOP UNTIL (node(nodeId).parent = -1)

      ' reverse path (from root to leaf)
      FOR j% = 0 TO codeLen - 1
        IF (tmpPath AND lutSqr(j%)) THEN
          intPath = intPath OR lutSqr((codeLen - 1) - j%)
        END IF
      NEXT j%

      ' write code
      code(node(i%).symbol).symbol = node(i%).symbol
      code(node(i%).symbol).length = codeLen
      code(node(i%).symbol).code = intPath
    END IF
  NEXT i%
END SUB

Oh boy. That was a lot of code, but now we have all the pieces necessary to write our compression routine! Just one more routine solely for debug purpose: here's a tiny function to return numeric values (such as Huffman codes) into binary form, most significant bit first:

''
'' returns the binary form of an integer, most significant bit first
''
FUNCTION decBin$ (feed AS LONG, numBits AS INTEGER)
  DIM outStr AS STRING

  outStr = STRING$(numBits, "0")

  FOR i% = 0 TO numBits - 1
    IF (feed AND lutSQR(i%)) THEN
      MID$(outStr, numBits - i%, 1) = "1"
    END IF
  NEXT i%

  decBin$ = outStr
END FUNCTION

Experiment with that function by feeding it powers of 2, and then combinations of powers of 2... see what happens. It's fun and instructive, I swear.

Shrinking data

As said previously, compression is achieved by replacing each 8-bit symbol by their Huffman code counterpart. The compression function reads the string one byte at a time, then write each bit of the corresponding Huffman code to a temporary 8-bit buffer. When the temporary buffer is maxed out, the byte is appended to a secondary buffer, then the temporary buffer is cleared and more bits from further Huffman codes can be written.

The most confusing part of the routine is probably the multiple reading and writing variables being used: tgBit is the writing offset (in bits) to the temporary buffer (tgVal.) j% serves as the reading offset (in bits) of the Huffman code, and i% is the reading offset (in bytes) of the source input:

DECLARE FUNCTION huffShrink$ (feed AS STRING, code() AS ANY)

DIM target AS STRING

target = huffShrink$(source, code())

PRINT "Source size, in bytes:"; LEN(source)
PRINT "Target size, in bytes:"; LEN(target)

''
'' compress string using Huffman codes
''
FUNCTION huffShrink$ (feed AS STRING, code() AS huffCodeType)
  DIM tgStr AS STRING, tgBit AS INTEGER, tgVal AS LONG
  DIM ascii AS INTEGER, hcVal AS LONG

  ' for each character in the source string
  FOR i% = 1 TO LEN(feed)
    ' get the source ASCII value
    ascii = ASC(MID$(feed, i%, 1))
    ' get the matching Huffman code
    hcVal = code(ascii).code

    ' for each bit in the Huffman code
    FOR j% = 0 TO code(ascii).length - 1
      ' if that bit on the Huffman code is set,
      ' write to temporary target value
      IF (hcVal AND lutSqr(j%)) THEN tgVal = tgVal OR lutSqr(tgBit)
      ' if target bit is below 7, increment
      IF (tgBit < 7) THEN
        tgBit = tgBit + 1
      ' if target bit is 7, don't go to 8: put the temporary
      ' target value to temporary buffer, then reset target
      ' bit value to 0
      ELSE
        tgStr = tgStr + CHR$(tgVal)
        tgVal = 0
        tgBit = 0
      END IF
    NEXT j%
  NEXT i%

  ' the length of the compressed data is not multiple of 8, it means
  ' that the temporary target value hasn't been written yet... do it now.
  IF (tgBit) THEN tgStr = tgStr + CHR$(tgVal)

  huffShrink$ = tgStr
END FUNCTION

Before compression, "Helloworld" was 80 bits long, or 10 bytes (most significant bit first:)

Byte 0:  Byte 1:  Byte 2:  Byte 3:  Byte 4:  Byte 5:  Byte 6:  Byte 7:  Byte 8:  Byte 9:
H        e        l        l        o        w        o        r        l        d
01001000 01100101 01101100 01101100 01101111 01110111 01101111 01110010 01101100 01100100

But after compression, the string is only 27 bits (padded to 32 bits,) or 4 bytes. That is half the original length, give or take! The following values are displayed least significant bit first to match the table in "Generating Huffman codes." Notice the five padding bits on byte 3:

Byte 0:    Byte 1:    Byte 2:    Byte 3:
00111010   10111010   11101110   00000000

H   e   l  l  o   w   o   r   l  d
001 110 10 10 111 010 111 011 10 000

Expanding data

There are two ways the compressed stream can be expanded to its original form: walking down the compressed bit stream through the tree, or by using the Huffman codes. Either way, we also need to know when to stop reading the compressed stream as the last byte of data may (or not) contain junk bits; to solve this issue, either store the length (in bytes) of the decompressed stream, or the number of junk bits present on the last byte of the compressed stream (since it's never going to be more than 7, we can store that info on 3 bits.)

First, let's go for the easy solution of walking down the tree until enough bytes have been translated:

DECLARE FUNCTION huffExpand$ (feed AS STRING, length AS INTEGER, node() AS ANY, nodeCount AS INTEGER)

DIM target2 AS STRING

target2 = huffExpand$(target, LEN(source), node(), nodeCount)

PRINT target2

''
'' decompress feed into a buffer that is "length" bytes long, using the tree
''
FUNCTION huffExpand$ (feed AS STRING, length AS INTEGER, node() AS huffNodeType, nodeCount AS INTEGER)
  DIM buffer AS STRING, nodeId AS INTEGER
  DIM scBit AS INTEGER, scOfs AS INTEGER, scVal AS INTEGER

  ' reserve memory for target buffer
  buffer = SPACE$(length)

  ' for each character we need
  FOR i% = 1 TO length
    ' return to root
    nodeId = nodeCount - 1
    ' follow from root to leaf
    DO
      ' ran out of bits in source buffer, read 8 more bits
      IF (scBit = 0) THEN
        scOfs = scOfs + 1
        scVal = ASC(MID$(feed, scOfs, 1))
      END IF
      ' go right or left
      IF (scVal AND lutSQR(scBit)) THEN
        nodeId = node(nodeId).lfChild + 1
      ELSE
        nodeId = node(nodeId).lfChild
      END IF
      ' advance bit offset in source buffer
      scBit = (scBit + 1) AND 7
    ' keep going until we reach a leaf
    LOOP WHILE (node(nodeId).symbol = -1)
    ' got a leaf, write symbol to target buffer
    MID$(buffer, i%, 1) = CHR$(node(nodeId).symbol)
  NEXT i%

  huffExpand$ = buffer
END FUNCTION

Nice.

The second option works similarly, except it compares each bit from the input to the Huffman code list until one of them matches the sequence read (remember the prefix-free property of binary trees.) There are many ways this technique can be optimized, but we'll do the bare minimum for now by sorting the Huffman code list by increasing code length, then we only read codes whose length is identical to the number of bits we read from the input.

We need a routine to sort the code() array so we don't have to do it every time the huffExpand2$() function is called. If necessary, remember to clear the array and invoke huffMakeCodes() again before calling huffShrink$(). The routine is essentially identical to huffCharSort%(), going as far as returning the number of unique Huffman codes in the list:

DECLARE FUNCTION huffCodesBySize% (code() AS ANY)

codeCount = huffCodesBySize%(code())

''
'' sort Huffman code array by code length, returns number of unique codes
''
FUNCTION huffCodesBySize% (code() AS huffCodeType)
  DIM hiRef AS INTEGER, max AS INTEGER
  DIM count AS INTEGER, offset AS INTEGER

  ' sort array
  count = UBOUND(code) - LBOUND(code) + 1
  offset = count \ 2
  DO WHILE offset
    max = count - offset - 1
    DO
      hiRef = 0
      FOR i% = 0 TO max
        IF (code(i%).length > code(i% + offset).length) THEN
          SWAP code(i%), code(i% + offset)
          hiRef = i%
        END IF
      NEXT i%
      max = hiRef - offset
    LOOP WHILE hiRef
    offset = offset \ 2
  LOOP

  ' number of symbols used
  FOR i% = 0 TO UBOUND(code)
    IF (code(i%).length) THEN
      count = i%
      EXIT FOR
    END IF
  NEXT i%

  ' clear unused slots
  FOR i% = 0 TO UBOUND(code) - count
    code(i%) = code(i% + count)
  NEXT i%
  count = (UBOUND(code) - LBOUND(code) + 1) - count
  FOR i% = count TO UBOUND(code)
    code(i%).code = 0
    code(i%).symbol = 0
    code(i%).length = 0
  NEXT i%

  ' number of unique codes
  huffCodesBySize% = count
END FUNCTION

Here's the plan: we're going to read the code() array one entry at a time. In that loop, we'll read as many bits as necessary to have the same amount of bits as the currently selected entry in code(). If the temporary value we built so far matches the current code() entry, write the corresponding symbol to the output string, reset the temporary value and parse the whole list again. If the two values do not match, test the next entry, keep appending more bits if necessary, etc.

DECLARE FUNCTION huffExpand2$ (feed AS STRING, length AS INTEGER, code() AS ANY, codeCount AS INTEGER)

DIM target3 AS STRING

target3 = huffExpand2$(target, LEN(source), code(), codeCount)

PRINT target3

''
'' decompress feed into a buffer that is "length" bytes long, using Huffman codes
''
FUNCTION huffExpand2$ (feed AS STRING, length AS INTEGER, code() AS huffCodeType, codeCount AS INTEGER)
  DIM scOfs AS INTEGER, scBit AS INTEGER, scVal AS INTEGER
  DIM tgBit AS INTEGER, tgVal AS LONG
  DIM buffer AS STRING

  ' reserve memory for target buffer
  buffer = SPACE$(length)

  ' for each character we need
  FOR i% = 1 TO length
    ' reset temporary stream buffer
    tgBit = 0
    tgVal = 0
    ' find matching code
    FOR j% = 0 TO codeCount - 1
      ' load enough bits to temporary buffer so we can test this code
      WHILE (tgBit < code(j%).length)
        ' read 8 bits from source buffer
        IF (scBit = 0) THEN
          ' advance source buffer byte offset
          scOfs = scOfs + 1
          scVal = ASC(MID$(feed, scOfs, 1))
        END IF
        ' source bit is set, also set target bit
        IF (scVal AND lutSQR(scBit)) THEN tgVal = tgVal + lutSQR(tgBit)
        ' advance target buffer bit offset
        tgBit = tgBit + 1
        ' advance source buffer (8-bit) bit offset
        scBit = (scBit + 1) AND 7
      WEND
      ' exit as soon as we get a matching sequence
      IF (tgVal = code(j%).code) THEN EXIT FOR
    NEXT j%
    ' got the code, write symbol to target buffer
    MID$(buffer, i%, 1) = CHR$(code(j%).symbol)
  NEXT i%

  ' return decompressed string
  huffExpand2$ = buffer
END FUNCTION

So, what data do we need stored in our compressed file? Well, it depends on the algorithm used to decompress the stream: for huffExpand$() we need a copy of the tree somehow (either by saving it, or by saving the letter frequency and then rebuilding the tree from scratch.) The alternative, huffExpand2$(), requires Huffman codes sorted by code length, which may take less space... aside from that data, we need the compressed stream (of course) as well as the length in bytes of the uncompressed data or the number of junk bits in the last byte of compressed data...

Huffman Encoding can be used with more than just 8-bit values, like entire words for instance. Of course the tree will look much bigger and we'd have to tweak our program here and there, but it's possible. That said, Huffman Encoding is not specialized and other compressions may return a better output depending on the input data. However, it's still a nice way to crush a few more bytes in already compressed streams; from personal experience, the compressed data sits somewhere between 54% and 57% of the initial length.

Fixed-length Lempel-Ziv-Welch

Curb your enthousiasm

Before you get your hopes up: this is toe-tipping territory. We'll get a working routine, but it won't be enough for large data streams and it's painfully unoptimized. That said...

LZW is particularly efficient with text and works with other types of data as well. It all starts from a simple observation: the longer the message, the more likely it is for patterns to emerge. If we could save those potential patterns as they appear in a large enough dictionary, then we could substitute them with shorter codes when they re-appear later down the line. The process itself is... a lot trickier.

Before we start, let's write a dirty "dictionary check" routine to know whether or not a string is already present, and another to initialize the first 256 entries to ASCII values (this way, we can retrieve each individual character into the dictionary and return their index.)

DECLARE FUNCTION dicIndex% (array() AS STRING, length AS INTEGER, match AS STRING)
DECLARE SUB dicInit (buffer() AS STRING, length AS INTEGER)

''
'' return the index of the specified string, or -1 if it's not there
''
FUNCTION dicIndex% (array() AS STRING, length AS INTEGER, match AS STRING)
  FOR i% = 0 TO length - 1
    IF (array(i%) = match) THEN
      dicIndex% = i%
      EXIT FUNCTION
    END IF
  NEXT i%
  dicIndex% = -1
END FUNCTION

''
'' initialize dictionary
''
SUB dicInit (array() AS STRING, length AS INTEGER)
  FOR i% = 0 TO 255
    array(i%) = CHR$(i%)
  NEXT i%: length = 256

  FOR i% = length TO UBOUND(array)
    array(i%) = ""
  NEXT i%
END SUB

Compression

Here's how it goes: we read one byte at a time and set the temporary string (for prediction) to a combination of the active string (which is guaranteed to be in the dictionary) and the byte we just read. It's important to note that the active string is always available in the dictionary, while the temporary string may not; So, we search the dictionary to see if the temporary string exists. If it does: copy the temporary string to the active string; if it doesn't, write the index of the active string to the output buffer, insert the temporary string to the dictionary for later, and replace the active string by the byte we read. We're resetting the active string because we discovered a new pattern. As it is brand new, there's no chance we'll encounter a "more complete" version of it down the line, so we have to go back to a simpler string that we can grow until we (hopefully) get a match in the dictionary. If the active string is not empty after we're done parsing the input buffer, write its index to the output buffer.

DECLARE FUNCTION shrinkLZW$(feed AS STRING)

' source and target buffers
DIM source AS STRING, target AS STRING

' all purpose variable for display
DIM index AS INTEGER

source = "How much wood would a woodchuck chuck if a woodchuck could chuck wood?"
target = shrinkLZW$(source)

' display compressed buffer; every code is an index in the dictionary but
' we'll show indices 0 to 255 as their initial ASCII value.
FOR i% = 1 TO LEN(target) STEP 2
  index = CVI(MID$(target, i%, 2))
  SELECT CASE index
  CASE 0 TO 255
    PRINT CHR$(index);
  CASE ELSE
    PRINT "["; LTRIM$(STR$(index)); "]";
  END SELECT
NEXT i%

''
'' LZW compression
''
FUNCTION shrinkLZW$ (feed AS STRING)
  DIM dictionary(1023) AS STRING  ' dictionary
  DIM dicSize AS INTEGER          ' entries in the dictionary

  DIM tempStr AS STRING           ' temporary string (not yet registered)
  DIM activeStr AS STRING         ' active string (already registered)
  DIM inByte AS STRING * 1        ' byte in input stream
  DIM buffer AS STRING            ' target buffer

  ' initialize dictionary
  dicInit dictionary(), dicSize

  ' parse the whole input stream
  FOR i% = 1 TO LEN(feed)
    ' read byte, combine active string with byte into temporary string
    inByte = MID$(feed, i%, 1)
    tempStr = activeStr + inByte

    ' temporary string is in the dictionary
    IF (dicIndex%(dictionary(), dicSize, tempStr) <> -1) THEN
      ' copy temporary string to active string
      activeStr = tempStr

    ' temporary string is not in the dictionary
    ELSE
      ' append index of the active string to the output buffer
      buffer = buffer + MKI$(dicIndex%(dictionary(), dicSize, activeStr))
      ' write temporary string to dictionary
      dictionary(dicSize) = tempStr
      dicSize = dicSize + 1
      ' replace active string by currently read byte
      activeStr = inByte
    END IF
  NEXT i%

  ' we've read the whole input but didn't find the string yet, add now
  IF LEN(activeStr) THEN
    ' append index of the active string to the output buffer
    buffer = buffer + MKI$(dicIndex%(dictionary(), dicSize, activeStr))
  END IF

  ' return compressed buffer
  shrinkLZW$ = buffer
END FUNCTION

Decompression

And of course, we need some decompression routine too... the cool thing about LZW is that we only have to store the compressed data stream because the dictionary is rebuilt on-the-fly during decompression.

DECLARE FUNCTION expandLZW$ (feed AS STRING)

PRINT expandLZW$(target)

''
'' LZW decompression
''
FUNCTION expandLZW$ (feed AS STRING)
  DIM dictionary(1023) AS STRING  ' dictionary
  DIM dicSize AS INTEGER          ' entries in the dictionary

  DIM activeStr AS STRING         ' existing string
  DIM tempStr AS STRING           ' predicted string
  DIM inInteger AS INTEGER        ' integer in input buffer
  DIM buffer AS STRING             ' target buffer

  ' initialize dictionary
  dicInit dictionary(), dicSize

  ' get initial code
  activeStr = dictionary(CVI(MID$(feed, 1, 2)))
  buffer = activeStr

  ' parse the whole input stream
  FOR i% = 3 TO LEN(feed) STEP 2
    inInteger = CVI(MID$(feed, i%, 2))

    ' rebuild string now; it hasn't been written to the dictionary yet
    IF (inInteger = dicSize) THEN
      tempStr = activeStr + LEFT$(activeStr, 1)

    ' string already exists in the dictionary
    ELSEIF (LEN(dictionary(inInteger))) THEN
      tempStr = dictionary(inInteger)
    END IF

    ' append string to target buffer
    buffer = buffer + tempStr

    ' create new entry in dictionary
    dictionary(dicSize) = activeStr + LEFT$(tempStr, 1)
    dicSize = dicSize + 1

    ' set active string as the last string collected
    activeStr = tempStr
  NEXT i%

  ' return decompressed buffer
  expandLZW$ = buffer
END FUNCTION

And that's basically how it works. "But... the compressed stream is bigger than the source!" I hear you cry. Yes, because we've been using fixed-length 16-bit codes. If we compare the number of elements in the source string (70x 1 byte) and the compressed stream (45x 2 bytes,) it's obvious something positive happened, but to actually produce shorter streams, we have to use variable-length codes (like demonstrated by Huffman Encoding.)

It's part of the problem with LZW: having the theory down is not enough (like that time we designed a horrible RLE compression scheme.) For instance, when using longer input strings, the dictionary array will eventually run out of entries and will hog all the memory allocated to strings in no time. Since new strings are built off of existing entries, instead of writing full strings we could link entries together by index and only save the trailing byte, which would require other techniques to determine whether a string has been registered or not... but that still wouldn't be enough as the number of entries is limited, so we'd have to also create special codes to force-reset the dictionary when it's full. And then we have to come up with variable-length bit codes...

It is possible to write proper LZW compression and decompression routines in QuickBASIC (there's an amazing GIF viewer out there that's been around since the mid 90s that uses the same theory,) but for me? Right now? Meh.

Lossless Compression

Run-Length Encoding