Baby's first compression scheme
Run-Length Encoding is super simple to put in place and requires nothing in terms of memory and preparation. It comes in a multitude of colors and shapes but always focus on the same core idea: identify and shorten sequences of repeating symbols. It was often used as an early image compression technique (in ZSoft's PCX, a popular format nobody seems to remember) and was also common for shrinking tile-based 2D map data (it was used in games such as Rise Of The Triads, Wolfenstein 3D, Commander Keen, and Jazz Jackrabbit to name a few.) At one point, this compression scheme was even used for video files! The image quality was reduced in order to maximize gains from the RLE algorithm, and it looked absolutely horrendous. Good times.
RLE is not fit for text or random data and will often produce negative gains with that type of input (the compressed stream is longer than the initial data it was fed, which sort of quote-unquote "defeats" the purpose a little bit if I may say so myself;) For maximum gainz, we're going to focus our efforts on this tiny 8x8 image:
Assuming each pixel takes 1 byte of data, the image data would be 64 bytes long. Now, what if we wrote a very simple encoding scheme that would take 1 byte for the run length, and a second byte for the pixel color?
6 bytes: 2x Blue, 5x Yellow, 1x Blue
12 bytes: 1x Blue, 2x Yellow, 1x Black, 1x Yellow, 1x Black, 2x Yellow
10 bytes: 3x Yellow, 1x Black, 1x Yellow, 1x Black, 2x Yellow
10 bytes: 1x Yellow, 1x Pink, 4x Yellow, 1x Pink, 1x Yellow
10 bytes: 3x Yellow, 1x Red, 1x Yellow, 1x Red, 2x Yellow
6 bytes: 3x Yellow, 3x Red, 2x Yellow
6 bytes: 4x Yellow, 1x Red, 3x Yellow
6 bytes: 1x Blue, 6x Yellow, 1x Blue
We get a compressed stream of 66 bytes! That's 2 bytes... worse... than the original data... Okay, maybe it's not such a good idea. In fact, it's on par with Microsoft's horrendous .BMP RLE compression scheme (and I refuse to cover such nonsense here, let's be serious.) But what went wrong exactly? The main role of Run-Length Encoding is to identify and summarize runs of repeating symbols. That's where the money's at; if our algorithm is unable to differentiate runs from isolated symbols, it's doing it wrong. Furthermore, the way we encode/signify runs has to be as tiny as possible and only used when applicable.
Now, before you go: "Mike, you chucklefuck, why are you wasting my time with this thing? It sucks!" Let me clear things up: "yes but also no." RLE is designed to shrink repeating symbols; The longer the run, the higher the gain... but we're working with a simple 8x8 image which prevents large runs from happening, thus resulting in barely any gain. We're also not allowing runs to cover multiple scanlines, which would save extra bytes. Try compressing Duke Nukem 128x90 levels (using 16-bit integers rather than 8-bit values) or 320x200 images and you'll easily notice a difference.
How TrueVision .TGA did it
When working on the TGA format, TrueVision introduced a compression that used packets to describe pixel data. In a 256-color indexed image, each packet (descriptor included) was at most 129 bytes long and could either be compressed or uncompressed (raw,) depending on the packet descriptor's which contained two fields:
The seven least significant bits always provide the number of pixels to write to the target buffer minus 1 (it is not possible to write 0 pixel, but we can write up to 128 of them) and the most significant bit tells us whether the packet is RLE-compressed or raw. A compressed packet is only two bytes long: the packet's descriptor (one byte) followed by one value that we have to repeat multiple times. A raw packet contains the descriptor (yep, that's still one byte) followed by as many bytes of data as necessary to fill the target buffer.
Now let's think about that for a second. Since compressed packets are always 2 bytes long, a 2-pixel run will generate no gain... however, storing 2 pixels in a raw packet would take 3 bytes of data and produce a negative gain of 1 byte... for simplicity's sake, giving 2-pixel runs their own packet is a good idea, although some bytes can be saved here and there by allowing raw packets to absorb adjacent 2-pixel RLE packets (here's that smug asshole again for reference:)
6 bytes: (RLEx2) Blue, (RLEx5) Yellow, (RAWx1) Blue
10 bytes: (RAWx1) Blue, (RLEx2) Yellow, (RAWx3) Black-Yellow-Black, (RLEx2) Yellow
8 bytes: (RLEx3) Yellow, (RAWx3) Black-Yellow-Black, (RLEx2) Yellow
8 bytes: (RAWx2) Yellow-Pink, (RLEx4) Yellow, (RLEx2) Pink-Yellow
8 bytes: (RLEx3) Yellow, (RAWx3) Red-Yellow-Red, (RLEx2) Yellow
6 bytes: (RLEx3) Yellow, (RLEx3) Red, (RLEx2) Yellow
6 bytes: (RLEx4) Yellow, (RAWx1) Red, (RLEx3) Yellow
6 bytes: (RAWx1) Blue, (RLEx6) Yellow, (RAWx1) Blue
If we automatically encode runs of 2 pixels into their own packet, we get a compressed stream of 58 bytes... but if we allow raw packets to absorb neighboring 2-pixel length runs, we may have a chance to merge into another raw packet; in this case, it spares one extra byte:
6 bytes: (RLEx2) Blue, (RLEx5) Yellow, (RAWx1) Blue
9 bytes: (RAWx8) Blue-Yellow-Yellow-Black-Yellow-Black-Yellow-Yellow
8 bytes: (RLEx3) Yellow, (RAWx5) Black-Yellow-Black-Yellow-Yellow
8 bytes: (RAWx2) Yellow-Pink, (RLEx4) Yellow, (RLEx2) Pink-Yellow
8 bytes: (RLEx3) Yellow, (RAWx5) Red-Yellow-Red-Yellow-Yellow
6 bytes: (RLEx3) Yellow, (RLEx3) Red, (RLEx2) Yellow
6 bytes: (RLEx4) Yellow, (RAWx1) Red, (RLEx3) Yellow
6 bytes: (RAWx1) Blue, (RLEx6) Yellow, (RAWx1) Blue
We won't write code for this technique for two reasons. First, processing raw packets and writing a fast algorithm to merge packets when possible requires more efforts than just counting runs; and second, the following RLE implementation offers a better compression rate and doesn't need to go through these hoops.
How ZSoft .PCX did it
The RLE implementation used in PCX files doesn't use packets like TGA did. Instead, the decompression algorithm assumes that bytes must be copied to the target buffer as they are, unless the two most significant bits are set. In other words, if BOTH bits 7 and 6 (adding 192 to the byte value) are set, then the byte is suddenly seen as an RLE descriptor rather than raw pixel data, and the remaining 6 bits (adding up to 63) provide the number of pixels to write to the target buffer:
Here's where it gets tricky: if we have a lonely pixel whose value is equal or above 192, isn't the decompression algorithm going to assume it's an RLE code? Well, yes, that's why the compressor always generates runs of 1 pixel or more with those 32 values (between 192 and 255 included.) It means that those colors will always produce two bytes of code... but it's only a negative gain when the pixel is isolated. Believe it or not, this compromise actually pays off. Back to our 8x8 image:
5 bytes: (RLEx2) Blue, (RLEx5) Yellow, Blue
8 bytes: Blue, (RLEx2) Yellow, Black, Yellow, Black, (RLEx2) Yellow
7 bytes: (RLEx3) Yellow, Black, Yellow, Black, (RLEx2) Yellow
6 bytes: Yellow, Pink, (RLEx4) Yellow, Pink, Yellow
7 bytes: (RLEx3) Yellow, Red, Yellow, Red, (RLEx2) Yellow
6 bytes: (RLEx3) Yellow, (RLEx3) Red, (RLEx2) Yellow
5 bytes: (RLEx4) Yellow, Red, (RLEx3) Yellow
4 bytes: Blue, (RLEx6) Yellow, Blue
Assuming that none of the color indices are equal or greater than 192, that's only 48 bytes of information! For the sake of argument, let's assume the Red color index was equal or greater than 192 (see the difference on lines 5 and 7, notice how it doesn't change anything for line 6:)
5 bytes: (RLEx2) Blue, (RLEx5) Yellow, Blue
8 bytes: Blue, (RLEx2) Yellow, Black, Yellow, Black, (RLEx2) Yellow
7 bytes: (RLEx3) Yellow, Black, Yellow, Black, (RLEx2) Yellow
6 bytes: Yellow, Pink, (RLEx4) Yellow, Pink, Yellow
9 bytes: (RLEx3) Yellow, (RLEx1) Red, Yellow, (RLEx1) Red, (RLEx2) Yellow
6 bytes: (RLEx3) Yellow, (RLEx3) Red, (RLEx2) Yellow
6 bytes: (RLEx4) Yellow, (RLEx1) Red, (RLEx3) Yellow
4 bytes: Blue, (RLEx6) Yellow, Blue
Finally, here's the QuickBASIC shrinking routine. Read one byte from the stream and use it as a reference, then look ahead until the reference byte no longer matches the last byte read; increase a counter as you go. If the count is greater than 2 or the reference value is greater than 191, write an RLE code. If the count is 1 and the reference value is below 192, write the reference byte as is. Keep going until the whole source buffer is depleted:
''
'' PCX-style RLE compression
''
FUNCTION rleShrink$ (inData AS STRING)
DIM offset AS INTEGER, count AS INTEGER, streak AS INTEGER
DIM ref AS INTEGER, outData AS STRING
' reading offset in the source buffer
offset = 1
DO
' read reference byte
ref = ASC(MID$(inData, offset, 1))
' reset counter
count = 0
' keep reading while the reference byte is showing up
DO WHILE (ASC(MID$(inData, offset + count, 1)) = ref)
' increment counter
count = count + 1
IF ((offset + count) > LEN(inData)) THEN EXIT DO
LOOP
' advance reading offset
offset = offset + count
' count is at least 1 OR ref byte is >= 192
IF ((ref > 191) OR (count > 1)) THEN
' write as many runs as necessary, at most 63 pixels at once
streak = 63
DO
IF (streak > count) THEN streak = count
outData = outData + CHR$(192 + streak) + CHR$(ref)
count = count - streak
LOOP WHILE (count)
' count is 1 AND ref byte is < 192
ELSE
' write byte as it
outData = outData + CHR$(ref)
END IF
LOOP UNTIL (offset > LEN(inData))
' return compressed data
rleShrink$ = outData
END FUNCTION
The decompression routine is even simpler... read one byte, if its two most significant bits are set (value is above 191,) read another byte and write it as many times as necessary. If either (or both) most significant bits are clear, write the byte as-is to the target buffer. Keep going until the source input is completely processed:
''
'' PCX-style RLE decompression
''
FUNCTION rleExpand$ (inData AS STRING)
DIM ref AS INTEGER, offset AS INTEGER, outData AS STRING
DO
' read next byte in source buffer
offset = offset + 1
ref = ASC(MID$(inData, offset, 1))
' byte has two most significant bits set
IF (ref > 191) THEN
' move to next byte in source buffer
offset = offset + 1
' isolate count (ref AND 63)
outData = outData + STRING$(ref AND 63, MID$(inData, offset, 1))
' write as is
ELSE
outData = outData + CHR$(ref)
END IF
LOOP UNTIL (offset = LEN(inData))
rleExpand$ = outData
END FUNCTION
Those two small functions are all it takes to support RLE compression in your programs... if you think your graphics and levels could use a little diet, give it a go.
Making the most of little things
We know the core component of PCX's RLE is the location of colors in the palette: colors located in the first 192 entries can be encoded as either 1 or 2 bytes, while colors in the last 64 entries will always be encoded as 2 bytes. Therefore, if the order in which colors appear in the palette is not important to us, we should re-arrange attributes to maximize compression. How?
The problem is not coming from runs since they always generate 2 bytes of data regardless of the color index. However, isolated pixels produce one extra byte when the color index is equal or above 192. To minimize that, we have to go through the compression process to determine for each color how many times they appear on their own (no run.) Then, we sort the color list by decreasing "isolated count" (the color at index 0 would produce more extra bytes than the color located at index 255.) Now we can output our PCX file using the new index list.
This optimization can shave off between 5 and 15 percents of files generated by Photoshop. All that to say: compression is part alogrithm, part data.
Long runs for your consideration
One last thing before we move on... while PCX offers a better deal than TGA in terms of compression, its RLE scheme is often criticized for supporting runs that are half the size of TGA's... we can't rewrite history but we can write our own algorithm. Let's imagine a PCX-style RLE compression with longer streaks:
What if we assumed that a count of 63 (that is, all 6 bits are set) is actually a fragment of a longer streak unable to fit on 6 bits? We could read the following byte (all 8 bits) and use them as an extension of the count (just add the 6 bits from the RLE descriptor and the 8 bits from the extension...) and if that byte was also maxed out (equals 255,) then we'd read yet another byte and add it to the count, and so on and so forth until we stumble across a byte that isn't maxed out:
DIM count AS INTEGER, outBuffer AS STRING
DIM ascii AS INTEGER, offset AS INTEGER
count = 63 + 255 + 99
'' write extended PCX-style RLE count
IF (count) THEN
IF (count < 63) THEN
' write one-byte RLE code
outBuffer = outBuffer + CHR$(192 + count)
ELSE
' write one-byte RLE code
outBuffer = outBuffer + CHR$(192 + 63)
count = count - 63
' write extension(s)
outBuffer = outBuffer + STRING$(count \ 255, &HFF) + CHR$(count MOD 255)
END IF
END IF
'' read extended PCX-style RLE count
offset = 1
DO
' read one byte
ascii = ASC(MID$(outBuffer, offset, 1))
offset = offset + 1
' two most significant bits set
IF (ascii > 191) THEN
' isolate count
count = (ascii AND 63)
' count is 63, read extension(s)
IF (count = 63) THEN
DO
' get one more byte
ascii = ASC(MID$(outBuffer, offset, 1))
' increase count
count = count + ascii
' advance to next byte
offset = offset + 1
' keep going while the extension is maxed-out
LOOP WHILE (ascii = 255)
END IF
' only one (or neither) most significant bits are set
ELSE
count = 1
END IF
' show repeat value
PRINT "Repeats:"; count
LOOP UNTIL (offset > LEN(outBuffer))
This technique would save 1 byte every 63 pixels on very long runs... but if the count was indeed exactly 63, then an extra byte of value 0 would be needed to terminate the long streak...
Just an idea I'm throwing out there. Foot four thots.
What it do
Following the ASCII (American Standard Code for Information Interchange) table, characters are represented as 8-bit values, offering up to 256 unique combinations. It doesn't matter how often (like "e" or "i") or rarely (like "j" or "x") those characters are used, they all weight exactly 8 bits. Thus, the string "Don't you hate it when you have to come up with random strings for testing purpose and can't think of anything???" is stored on 113 bytes or 904 bits.
Huffman Encoding takes another approach, creating occurrence-dependent variable-length codes for each character. In other words: the more often a character is used, the shorter its length in bits will be. Inversely, characters that are less often used have longer codes. Of course, getting to those codes is a whole process of statistical analysis of the input, sorting, and assembling the collected data... it involves a whole data tree made of leaves and nodes and all that good stuff, then we follow a path in the tree from one point to another, and... enough talk about my last vacations in Cameroon, let's get to work.
Writing the leaf list
The word tree has been mentionned in the previous paragraph, so let's talk about that real quick. A tree is basically a bunch of branches and leaves. The spot where branches connect are "nodes" and their extremities are "leaves;" If you look at a real tree and start at the root (not the actual root, but where the trunk meets the soil) and follow the trunk up, you will eventually find a node. If you follow that node, you may find more nodes until you finally reach a leaf. In Huffman Encoding, leaves are symbols (the basic building blocks of our message.) Nodes are not generated from the root (like a real tree would,) but starting from the leaves. Eventually, all leaves and nodes are connected by the root.
In other words, to be as plain and obvious as one can be, we're basically going to grow a tree in the complete opposite way nature would. Just because it's easier. Somehow. It's probably a bit fuzzy right now, but it'll become clear as we go. First stop: let's setup a data type to store our leaves, nodes, connections and all that good stuff.
TYPE huffNodeType ' OFS SZE DESCRIPTION
weight AS LONG ' 000 004 leaf/node: weight
parent AS INTEGER ' 004 002 leaf/node: parent node index
symbol AS INTEGER ' 006 002 leaf only: ASCII value
isRight AS INTEGER ' 008 002 leaf only: is right (left if NULL)
lfChild AS INTEGER ' 010 002 node only: left child index, right is lfChild+1
END TYPE ' 12 BYTES, HUFFMAN NODE STRUCTURE
DIM node(511) AS huffNodeType, nodeCount AS INTEGER ' Huffman node array & count
The first step would be to count the frequency of each character (symbol) in the source string. It can be done easily by creating an array of 256 entries (from 0 to 255) and using the ASCII value to point to the proper element in the array; we previously defined the node() array and it (seemingly) contains far more entries than necessary, so we got that covered. Now let's take the old-timey classic "Helloworld" for reference (without space so every character is visible, but also because I forgot to include that space when I wrote the do-it-by-hand explanations and there's no way in hell I'm ever rewriting that part:)
In QuickBASIC, counting character occurrences would looks like this:
DECLARE SUB huffCharCount (node() AS ANY, feed AS STRING)
DIM source AS STRING
source = "Helloworld"
huffCharCount node(), source
''
'' count character occurrences
''
SUB huffCharCount (node() AS huffNodeType, feed AS STRING)
DIM ascii AS INTEGER
FOR i% = 1 TO LEN(feed)
ascii = ASC(MID$(feed, i%, 1))
node(ascii).weight = node(ascii).weight + 1
NEXT i%
END SUB
When all the input data has been processed and we're ready to compress it, we sort the list by increasing frequency, and finally "stack" the entries back to index 0 of the array, consecutively.
Feel free to take a short break and marvel in awe at the amazing routine responsible for the masterpiece above:
DECLARE FUNCTION huffCharSort% (node() AS ANY)
nodeCount = huffCharSort%(node())
''
'' sort character list from least to most common
''
FUNCTION huffCharSort% (node() AS huffNodeType)
DIM hiRef AS INTEGER, max AS INTEGER
DIM count AS INTEGER, offset AS INTEGER
' so far, we assumed the index and the ASCII value were identical,
' after sorting the array it will likely not be the case anymore.
FOR i% = 0 TO 255
node(i%).symbol = i%
NEXT i%
' sort array
count = UBOUND(node) - LBOUND(node) + 1
offset = count \ 2
DO WHILE offset
max = count - offset - 1
DO
hiRef = 0
FOR i% = 0 TO max
IF (node(i%).weight > node(i% + offset).weight) THEN
SWAP node(i%), node(i% + offset)
hiRef = i%
END IF
NEXT i%
max = hiRef - offset
LOOP WHILE hiRef
offset = offset \ 2
LOOP
' number of symbols used
FOR i% = 0 TO UBOUND(node)
IF (node(i%).weight) THEN
count = i%
EXIT FOR
END IF
NEXT i%
' clear unused slots
FOR i% = 0 TO UBOUND(node) - count
node(i%) = node(i% + count)
NEXT i%
count = (UBOUND(node) - LBOUND(node) + 1) - count
FOR i% = count TO UBOUND(node)
node(i%).weight = 0
node(i%).symbol = 0
NEXT i%
' number of unique symbols
huffCharSort% = count
END FUNCTION
Writing the Huffman tree
Currently, the node() array only contains leaves (the ending points) of the Huffman tree. The next step is to link the entries of the list two by two via parent nodes until the whole tree is built. To rehiterate: currently only leaves exist in the list, but nodes will start to appear as we build them. First, we take entries 0 and 1, and create a parent node whose weight is the sum of its children's (here, we'll call it "n_dH" as it is the node connecting leaves "d" and "H".) Then, we move the newly created parent node down the list until it becomes the last entry of that weight. This way, we not only ensure that the list is still ordered by increasing frequency, but we also make sure the index of the parent node remains constant even as new nodes are inserted into the list.
Now we repeat the procedure with entries 2 and 3: create a parent node whose weight is the sum of those entries' weight, then move it down the list until it becomes the last entry of that weight... to keep the listing short and because we're only moving forward, entries we've already processed will no longer be shown in the following tables:
Halfway there... entries 4 and 5, parent node is moved to index 9:
Parent nodes are processed exactly like leaves, as shown by entries 6 and 7, new parent node moved to index 10:
Entries 8 and 9, parent node moved to index 11...
Last stretch with entries 10 and 11, and their parent at index 12!
The reading cursor now points to the last index of the list (12,) which means there are no more entries to read, and therefore the Huffman tree is complete! Note that unlike every other entry in the list (but like Batman,) entry 12 has no parent because it's the root of our tree (now that I think about it, I've never seen Batman, Oliver Twist and the root of an Huffman tree in the same place at the same time.) Since the increasing frequency is maintained through the whole process, we can also see that the more frequent a character is, the closer it is to the root.
DECLARE SUB huffMakeTree (node() AS ANY, nodeCount AS INTEGER)
huffMakeTree node(), nodeCount
''
'' build Huffman tree from ordered symbol list
''
SUB huffMakeTree (node() AS huffNodeType, nodeCount AS INTEGER)
DIM loBound AS INTEGER, hiBound AS INTEGER
DIM middle AS INTEGER, parentId AS INTEGER
DIM readOfs AS INTEGER, nodeSum AS LONG
DO
' weight of parent node
nodeSum = node(readOfs).weight + node(readOfs + 1).weight
' find the best spot to insert parent node
loBound = readOfs + 2
hiBound = nodeCount - 1
IF (loBound > hiBound) THEN
parentId = nodeCount
ELSE
DO WHILE (loBound <= hiBound)
middle = (loBound + hiBound) \ 2
SELECT CASE node(middle).weight
CASE IS < nodeSum
loBound = middle + 1
parentId = loBound
CASE IS > nodeSum
hiBound = middle - 1
parentId = middle
CASE ELSE
parentId = middle
EXIT DO
END SELECT
LOOP
END IF
' last of this weight
DO UNTIL ((parentId > nodeCount) OR (node(parentId).weight <> nodeSum))
parentId = parentId + 1
LOOP
' push nodes forward
FOR i% = nodeCount TO parentId + 1 STEP -1
node(i%) = node(i% - 1)
NEXT i%
' write parent node
node(parentId).weight = nodeSum
node(parentId).symbol = -1
node(parentId).parent = -1
node(parentId).lfChild = readOfs
' link to parent
node(readOfs).parent = parentId
node(readOfs).isRight = 0
node(readOfs + 1).parent = parentId
node(readOfs + 1).isRight = -1
' move forward
readOfs = readOfs + 2
' increase node count
nodeCount = nodeCount + 1
' keep going until there are no more nodes to link
LOOP UNTIL (readOfs = nodeCount - 1)
END SUB
Generating Huffman codes
The word "tree" has been used over ten times now, and we still haven't seen what it actually looks like. If you'd like, here's a routine to display it on screen. It's complex and sloppy, so I won't blame you if you don't get it right now; it'll become clear with the next steps, I swear. You can trust me. We'll make it. Together.
DECLARE SUB huffDrawTree (node() AS ANY, nodeCount AS INTEGER)
''
'' display Huffman tree in graphic mode 12
''
SUB huffDrawTree (node() AS huffNodeType, nodeCount AS INTEGER)
DIM nodeVisited(nodeCount - 1) AS INTEGER
DIM nodeX(nodeCount - 1) AS INTEGER
DIM nodeDepth(nodeCount - 1) AS INTEGER
DIM x AS INTEGER, nodeId AS INTEGER, deepest AS INTEGER
' enter graphic mode 12 (640x480, 16 colors)
SCREEN 12
' read leaves from left to right
nodeId = nodeCount - 1
x = 2
DO
' follow left path till leaf
DO WHILE (node(nodeId).symbol = -1)
nodeId = node(nodeId).lfChild
LOOP
' register position
nodeX(nodeId) = x
x = x + 3
' go back until we reach a node that hasn't been fully checked
DO
nodeId = node(nodeId).parent
IF (nodeId = -1) THEN EXIT DO
LOOP WHILE nodeVisited(nodeId)
IF (nodeId = -1) THEN EXIT DO
' go right once...
nodeVisited(nodeId) = -1
nodeId = node(nodeId).lfChild + 1
LOOP
' get nodes depth
FOR i% = 0 TO nodeCount - 1
nodeId = i%
DO
nodeDepth(i%) = nodeDepth(i%) + 1
nodeId = node(nodeId).parent
LOOP UNTIL (nodeId = -1)
IF (nodeDepth(i%) > deepest) THEN deepest = nodeDepth(i%)
NEXT i%
' going from lower to higher, place non-leaves in between children
FOR i% = deepest TO 0 STEP -1
FOR j% = 0 TO nodeCount - 1
IF (node(j%).symbol = -1) THEN
IF (nodeDepth(j%) = i%) THEN
nodeX(j%) = (nodeX(node(j%).lfChild) + nodeX(node(j%).lfChild + 1)) \ 2
END IF
END IF
NEXT j%
NEXT i%
' and here we go...
FOR i% = 0 TO nodeCount - 1
LOCATE nodeDepth(i%) * 3, nodeX(i%)
IF (node(i%).symbol > -1) THEN
IF (node(i%).symbol = 32) THEN
LOCATE , nodeX(i%) - 1: PRINT CHR$(34); CHR$(node(i%).symbol); CHR$(34)
ELSE
PRINT CHR$(node(i%).symbol)
END IF
ELSE
LOCATE nodeDepth(i%) * 3, nodeX(i%) - LEN(LTRIM$(STR$(i%))) \ 2: PRINT LTRIM$(STR$(i%))
END IF
IF (node(i%).parent > -1) THEN
LINE ((nodeX(i%) - 1) * 8, nodeDepth(i%) * 16 * 3 - 16)_
-((nodeX(node(i%).parent) - 1) * 8, (nodeDepth(i%) - 1) * 16 * 3), 15
END IF
NEXT i%
END SUB
And if you don't want to run the code above but still want to see what the tree looks like, I'll go ahead and assume you're a cat since cats are known to be both lazy and curious... oh yeah, and here's the image too, you furry sofa mangler:
You can see the leaves at the bottom ("d", "H", "w", "r", "l", "e", and "o" -- that's the data we started off with,) the root at index 12, and a bunch of nodes connecting the leaves to the root (6, 7, 9, 10, and 11.) Pay attention to the location of "l" compared to the other leaves: it is much closer to the root. That's because "l" appears three times in the source string, while most of those other characters appear only once or twice.
Now you may ask: "Mike, how come "o" is not closer to the root? It appears twice in the input string!" and that's an excellent question, person I just made up in my head ("person I just made up in my head" is an original character, do not steal.) One-on-one, "o" appears more often than "d"... BUT, collectively, letters "d", "H", "w", "r" and "e" are more likely to be encountered than "o". The algorithm generates balanced trees to prevent "pushing too far away" rarer characters. The construction takes into account the odds of specific characters showing up collectively by carrying their frequency to their parent node. To wrap it up in a short answer: because "balance." That being said, it should be noted that when many leaves are present, the length in bits of a character may very well exceed 8 bits.
Back to the topic. Each node offers two traveling directions: left or right, which is convenient since we can store that branching choice on a single bit! To attain "l" from root, we must go right (to node 11,) and then left... that's only two bits of information. To reach "w" from root, we go left (to node 10,) right (to node 7,) and finally left. That's three bits of information... the idea here is to replace the usual fixed-length 8-bit code of the ASCII table by those Huffman codes, thus reducing the overall size of the source input. To do so, we first have to generate the Huffman codes for every leaf.
Rather than start from the root and recursively travel to each leaf, we'll simply start from each leaf and move from parent to parent until we reach the root, and then reverse the sequence (since the starting point is meant to be the root, not the leaf.) It makes sense. After all, the tree itself was built bottom-up, so why not?
Here's how it goes: we parse all nodes until we find a leaf (.symbol is not -1,) we check if the node was access by a left turn (.isRight is FALSE) or a right turn (.isRight is TRUE,) and we write a clear bit if it's left or set the bit if it's right, then we flip the reading direction:
If there's even been a time to talk about least significant bit (LSB) and most significant bit (MSB,) that would be now. We have the habit of writing left to right, so when we write numbers, we tend to write the most significant digit first (on the left) and end by the least significant digit (on the right.) For example, three-hundred-and-twenty-five is written 325 with 3 being the most significant digit (going first.) If the same number was written with the least significant digit first, it would be 523. It sounds far-fetched until you realize that we pronounce 16 LSB ("sixteen",) but 61 MSB ("sixty one".) Not all articles are explicitely providing a reading sense, and it's all very confusing. The reason I'm pointing this out is: due to the way Huffman trees are designed, Huffman codes are prefix-free. Prefix-free means that shorter codes can never be spotted as a starting sequence in longer codes. For example, "10" is used by the "l" leaf, and those two bits never appear as the first two bits of any other value (makes sense: for that to happen, the leaf would have to be a parent of two elements.) It's important to know because this will allow us to write a decompression routine that uses those Huffman codes rather than the tree, if we had to.
We need to add two new things to the program: a new structure to store Huffman codes, and a lookup table filled with 31 powers of two to speed up the process of reading and writing bits. I would go for 32 powers of two, but you know how QuickBASIC is with its signed integers. To be entirely fair, the Huffman codes should be stored as strings of 8-bit values, and the lookup table would only need 2^0 thru 2^7; this would cover all the cases (including very long bit codes) but it would also increase the complexity of our program way beyond what we need right now.
TYPE huffCodeType ' OFS SZE DESCRIPTION
code AS LONG ' 000 004 symbol code (root-to-leaf)
symbol AS INTEGER ' 004 002 ASCII value
length AS INTEGER ' 006 002 length of the code, in bits
END TYPE ' 8 BYTES, HUFFMAN CODE
DIM code(255) AS huffCodeType, codeCount AS INTEGER ' Huffman code array & count
DIM SHARED lutSQR(30) AS LONG ' powers of two lookup table
FOR i% = LBOUND(lutSQR) TO UBOUND(lutSQR)
lutSQR(i%) = 2 ^ i%
NEXT i%
The following routine will trace the path from leaf to root, starting from the least significant bit, and store the result in the .code field of the huffCodeType structure. Rather than starting at code(0) and move to the next index with each new leaf, the routine stores the Huffman code at the index matching the leaf's ASCII code (the Huffman code for the leaf representing "H" is stored in code(72):)
DECLARE SUB huffMakeCodes (node() AS ANY, nodeCount AS INTEGER, code() AS ANY)
huffMakeCodes node(), nodeCount, code()
''
'' use tree to generate Huffman codes
''
SUB huffMakeCodes (node() AS huffNodeType, nodeCount AS INTEGER, code() AS huffCodeType)
DIM nodeId AS INTEGER, codeLen AS INTEGER
DIM intPath AS LONG, tmpPath AS LONG
' search through each node
FOR i% = 0 TO nodeCount - 1
' this is a leaf, rebuild code
IF (node(i%).symbol > -1) THEN
' reset variables
nodeId = i%: codeLen = 0
intPath = 0: tmpPath = 0
' writes path from leaf to root
DO
' write value
IF (node(nodeId).isRight) THEN
tmpPath = tmpPath + lutSqr(codeLen)
END IF
codeLen = codeLen + 1
' move to parent
nodeId = node(nodeId).parent
LOOP UNTIL (node(nodeId).parent = -1)
' reverse path (from root to leaf)
FOR j% = 0 TO codeLen - 1
IF (tmpPath AND lutSqr(j%)) THEN
intPath = intPath OR lutSqr((codeLen - 1) - j%)
END IF
NEXT j%
' write code
code(node(i%).symbol).symbol = node(i%).symbol
code(node(i%).symbol).length = codeLen
code(node(i%).symbol).code = intPath
END IF
NEXT i%
END SUB
Oh boy. That was a lot of code, but now we have all the pieces necessary to write our compression routine! Just one more routine solely for debug purpose: here's a tiny function to return numeric values (such as Huffman codes) into binary form, most significant bit first:
''
'' returns the binary form of an integer, most significant bit first
''
FUNCTION decBin$ (feed AS LONG, numBits AS INTEGER)
DIM outStr AS STRING
outStr = STRING$(numBits, "0")
FOR i% = 0 TO numBits - 1
IF (feed AND lutSQR(i%)) THEN
MID$(outStr, numBits - i%, 1) = "1"
END IF
NEXT i%
decBin$ = outStr
END FUNCTION
Experiment with that function by feeding it powers of 2, and then combinations of powers of 2... see what happens. It's fun and instructive, I swear.
Shrinking data
As said previously, compression is achieved by replacing each 8-bit symbol by their Huffman code counterpart. The compression function reads the string one byte at a time, then write each bit of the corresponding Huffman code to a temporary 8-bit buffer. When the temporary buffer is maxed out, the byte is appended to a secondary buffer, then the temporary buffer is cleared and more bits from further Huffman codes can be written.
The most confusing part of the routine is probably the multiple reading and writing variables being used: tgBit is the writing offset (in bits) to the temporary buffer (tgVal.) j% serves as the reading offset (in bits) of the Huffman code, and i% is the reading offset (in bytes) of the source input:
DECLARE FUNCTION huffShrink$ (feed AS STRING, code() AS ANY)
DIM target AS STRING
target = huffShrink$(source, code())
PRINT "Source size, in bytes:"; LEN(source)
PRINT "Target size, in bytes:"; LEN(target)
''
'' compress string using Huffman codes
''
FUNCTION huffShrink$ (feed AS STRING, code() AS huffCodeType)
DIM tgStr AS STRING, tgBit AS INTEGER, tgVal AS LONG
DIM ascii AS INTEGER, hcVal AS LONG
' for each character in the source string
FOR i% = 1 TO LEN(feed)
' get the source ASCII value
ascii = ASC(MID$(feed, i%, 1))
' get the matching Huffman code
hcVal = code(ascii).code
' for each bit in the Huffman code
FOR j% = 0 TO code(ascii).length - 1
' if that bit on the Huffman code is set,
' write to temporary target value
IF (hcVal AND lutSqr(j%)) THEN tgVal = tgVal OR lutSqr(tgBit)
' if target bit is below 7, increment
IF (tgBit < 7) THEN
tgBit = tgBit + 1
' if target bit is 7, don't go to 8: put the temporary
' target value to temporary buffer, then reset target
' bit value to 0
ELSE
tgStr = tgStr + CHR$(tgVal)
tgVal = 0
tgBit = 0
END IF
NEXT j%
NEXT i%
' the length of the compressed data is not multiple of 8, it means
' that the temporary target value hasn't been written yet... do it now.
IF (tgBit) THEN tgStr = tgStr + CHR$(tgVal)
huffShrink$ = tgStr
END FUNCTION
Before compression, "Helloworld" was 80 bits long, or 10 bytes (most significant bit first:)
Byte 0: Byte 1: Byte 2: Byte 3: Byte 4: Byte 5: Byte 6: Byte 7: Byte 8: Byte 9:
H e l l o w o r l d
01001000 01100101 01101100 01101100 01101111 01110111 01101111 01110010 01101100 01100100
But after compression, the string is only 27 bits (padded to 32 bits,) or 4 bytes. That is half the original length, give or take! The following values are displayed least significant bit first to match the table in "Generating Huffman codes." Notice the five padding bits on byte 3:
Byte 0: Byte 1: Byte 2: Byte 3:
00111010 10111010 11101110 00000000
H e l l o w o r l d
001 110 10 10 111 010 111 011 10 000
Expanding data
There are two ways the compressed stream can be expanded to its original form: walking down the compressed bit stream through the tree, or by using the Huffman codes. Either way, we also need to know when to stop reading the compressed stream as the last byte of data may (or not) contain junk bits; to solve this issue, either store the length (in bytes) of the decompressed stream, or the number of junk bits present on the last byte of the compressed stream (since it's never going to be more than 7, we can store that info on 3 bits.)
First, let's go for the easy solution of walking down the tree until enough bytes have been translated:
DECLARE FUNCTION huffExpand$ (feed AS STRING, length AS INTEGER, node() AS ANY, nodeCount AS INTEGER)
DIM target2 AS STRING
target2 = huffExpand$(target, LEN(source), node(), nodeCount)
PRINT target2
''
'' decompress feed into a buffer that is "length" bytes long, using the tree
''
FUNCTION huffExpand$ (feed AS STRING, length AS INTEGER, node() AS huffNodeType, nodeCount AS INTEGER)
DIM buffer AS STRING, nodeId AS INTEGER
DIM scBit AS INTEGER, scOfs AS INTEGER, scVal AS INTEGER
' reserve memory for target buffer
buffer = SPACE$(length)
' for each character we need
FOR i% = 1 TO length
' return to root
nodeId = nodeCount - 1
' follow from root to leaf
DO
' ran out of bits in source buffer, read 8 more bits
IF (scBit = 0) THEN
scOfs = scOfs + 1
scVal = ASC(MID$(feed, scOfs, 1))
END IF
' go right or left
IF (scVal AND lutSQR(scBit)) THEN
nodeId = node(nodeId).lfChild + 1
ELSE
nodeId = node(nodeId).lfChild
END IF
' advance bit offset in source buffer
scBit = (scBit + 1) AND 7
' keep going until we reach a leaf
LOOP WHILE (node(nodeId).symbol = -1)
' got a leaf, write symbol to target buffer
MID$(buffer, i%, 1) = CHR$(node(nodeId).symbol)
NEXT i%
huffExpand$ = buffer
END FUNCTION
Nice.
The second option works similarly, except it compares each bit from the input to the Huffman code list until one of them matches the sequence read (remember the prefix-free property of binary trees.) There are many ways this technique can be optimized, but we'll do the bare minimum for now by sorting the Huffman code list by increasing code length, then we only read codes whose length is identical to the number of bits we read from the input.
We need a routine to sort the code() array so we don't have to do it every time the huffExpand2$() function is called. If necessary, remember to clear the array and invoke huffMakeCodes() again before calling huffShrink$(). The routine is essentially identical to huffCharSort%(), going as far as returning the number of unique Huffman codes in the list:
DECLARE FUNCTION huffCodesBySize% (code() AS ANY)
codeCount = huffCodesBySize%(code())
''
'' sort Huffman code array by code length, returns number of unique codes
''
FUNCTION huffCodesBySize% (code() AS huffCodeType)
DIM hiRef AS INTEGER, max AS INTEGER
DIM count AS INTEGER, offset AS INTEGER
' sort array
count = UBOUND(code) - LBOUND(code) + 1
offset = count \ 2
DO WHILE offset
max = count - offset - 1
DO
hiRef = 0
FOR i% = 0 TO max
IF (code(i%).length > code(i% + offset).length) THEN
SWAP code(i%), code(i% + offset)
hiRef = i%
END IF
NEXT i%
max = hiRef - offset
LOOP WHILE hiRef
offset = offset \ 2
LOOP
' number of symbols used
FOR i% = 0 TO UBOUND(code)
IF (code(i%).length) THEN
count = i%
EXIT FOR
END IF
NEXT i%
' clear unused slots
FOR i% = 0 TO UBOUND(code) - count
code(i%) = code(i% + count)
NEXT i%
count = (UBOUND(code) - LBOUND(code) + 1) - count
FOR i% = count TO UBOUND(code)
code(i%).code = 0
code(i%).symbol = 0
code(i%).length = 0
NEXT i%
' number of unique codes
huffCodesBySize% = count
END FUNCTION
Here's the plan: we're going to read the code() array one entry at a time. In that loop, we'll read as many bits as necessary to have the same amount of bits as the currently selected entry in code(). If the temporary value we built so far matches the current code() entry, write the corresponding symbol to the output string, reset the temporary value and parse the whole list again. If the two values do not match, test the next entry, keep appending more bits if necessary, etc.
DECLARE FUNCTION huffExpand2$ (feed AS STRING, length AS INTEGER, code() AS ANY, codeCount AS INTEGER)
DIM target3 AS STRING
target3 = huffExpand2$(target, LEN(source), code(), codeCount)
PRINT target3
''
'' decompress feed into a buffer that is "length" bytes long, using Huffman codes
''
FUNCTION huffExpand2$ (feed AS STRING, length AS INTEGER, code() AS huffCodeType, codeCount AS INTEGER)
DIM scOfs AS INTEGER, scBit AS INTEGER, scVal AS INTEGER
DIM tgBit AS INTEGER, tgVal AS LONG
DIM buffer AS STRING
' reserve memory for target buffer
buffer = SPACE$(length)
' for each character we need
FOR i% = 1 TO length
' reset temporary stream buffer
tgBit = 0
tgVal = 0
' find matching code
FOR j% = 0 TO codeCount - 1
' load enough bits to temporary buffer so we can test this code
WHILE (tgBit < code(j%).length)
' read 8 bits from source buffer
IF (scBit = 0) THEN
' advance source buffer byte offset
scOfs = scOfs + 1
scVal = ASC(MID$(feed, scOfs, 1))
END IF
' source bit is set, also set target bit
IF (scVal AND lutSQR(scBit)) THEN tgVal = tgVal + lutSQR(tgBit)
' advance target buffer bit offset
tgBit = tgBit + 1
' advance source buffer (8-bit) bit offset
scBit = (scBit + 1) AND 7
WEND
' exit as soon as we get a matching sequence
IF (tgVal = code(j%).code) THEN EXIT FOR
NEXT j%
' got the code, write symbol to target buffer
MID$(buffer, i%, 1) = CHR$(code(j%).symbol)
NEXT i%
' return decompressed string
huffExpand2$ = buffer
END FUNCTION
So, what data do we need stored in our compressed file? Well, it depends on the algorithm used to decompress the stream: for huffExpand$() we need a copy of the tree somehow (either by saving it, or by saving the letter frequency and then rebuilding the tree from scratch.) The alternative, huffExpand2$(), requires Huffman codes sorted by code length, which may take less space... aside from that data, we need the compressed stream (of course) as well as the length in bytes of the uncompressed data or the number of junk bits in the last byte of compressed data...
Huffman Encoding can be used with more than just 8-bit values, like entire words for instance. Of course the tree will look much bigger and we'd have to tweak our program here and there, but it's possible. That said, Huffman Encoding is not specialized and other compressions may return a better output depending on the input data. However, it's still a nice way to crush a few more bytes in already compressed streams; from personal experience, the compressed data sits somewhere between 54% and 57% of the initial length.
Curb your enthousiasm
Before you get your hopes up: this is toe-tipping territory. We'll get a working routine, but it won't be enough for large data streams and it's painfully unoptimized. That said...
LZW is particularly efficient with text and works with other types of data as well. It all starts from a simple observation: the longer the message, the more likely it is for patterns to emerge. If we could save those potential patterns as they appear in a large enough dictionary, then we could substitute them with shorter codes when they re-appear later down the line. The process itself is... a lot trickier.
Before we start, let's write a dirty "dictionary check" routine to know whether or not a string is already present, and another to initialize the first 256 entries to ASCII values (this way, we can retrieve each individual character into the dictionary and return their index.)
DECLARE FUNCTION dicIndex% (array() AS STRING, length AS INTEGER, match AS STRING)
DECLARE SUB dicInit (buffer() AS STRING, length AS INTEGER)
''
'' return the index of the specified string, or -1 if it's not there
''
FUNCTION dicIndex% (array() AS STRING, length AS INTEGER, match AS STRING)
FOR i% = 0 TO length - 1
IF (array(i%) = match) THEN
dicIndex% = i%
EXIT FUNCTION
END IF
NEXT i%
dicIndex% = -1
END FUNCTION
''
'' initialize dictionary
''
SUB dicInit (array() AS STRING, length AS INTEGER)
FOR i% = 0 TO 255
array(i%) = CHR$(i%)
NEXT i%: length = 256
FOR i% = length TO UBOUND(array)
array(i%) = ""
NEXT i%
END SUB
Compression
Here's how it goes: we read one byte at a time and set the temporary string (for prediction) to a combination of the active string (which is guaranteed to be in the dictionary) and the byte we just read. It's important to note that the active string is always available in the dictionary, while the temporary string may not; So, we search the dictionary to see if the temporary string exists. If it does: copy the temporary string to the active string; if it doesn't, write the index of the active string to the output buffer, insert the temporary string to the dictionary for later, and replace the active string by the byte we read. We're resetting the active string because we discovered a new pattern. As it is brand new, there's no chance we'll encounter a "more complete" version of it down the line, so we have to go back to a simpler string that we can grow until we (hopefully) get a match in the dictionary. If the active string is not empty after we're done parsing the input buffer, write its index to the output buffer.
DECLARE FUNCTION shrinkLZW$(feed AS STRING)
' source and target buffers
DIM source AS STRING, target AS STRING
' all purpose variable for display
DIM index AS INTEGER
source = "How much wood would a woodchuck chuck if a woodchuck could chuck wood?"
target = shrinkLZW$(source)
' display compressed buffer; every code is an index in the dictionary but
' we'll show indices 0 to 255 as their initial ASCII value.
FOR i% = 1 TO LEN(target) STEP 2
index = CVI(MID$(target, i%, 2))
SELECT CASE index
CASE 0 TO 255
PRINT CHR$(index);
CASE ELSE
PRINT "["; LTRIM$(STR$(index)); "]";
END SELECT
NEXT i%
''
'' LZW compression
''
FUNCTION shrinkLZW$ (feed AS STRING)
DIM dictionary(1023) AS STRING ' dictionary
DIM dicSize AS INTEGER ' entries in the dictionary
DIM tempStr AS STRING ' temporary string (not yet registered)
DIM activeStr AS STRING ' active string (already registered)
DIM inByte AS STRING * 1 ' byte in input stream
DIM buffer AS STRING ' target buffer
' initialize dictionary
dicInit dictionary(), dicSize
' parse the whole input stream
FOR i% = 1 TO LEN(feed)
' read byte, combine active string with byte into temporary string
inByte = MID$(feed, i%, 1)
tempStr = activeStr + inByte
' temporary string is in the dictionary
IF (dicIndex%(dictionary(), dicSize, tempStr) <> -1) THEN
' copy temporary string to active string
activeStr = tempStr
' temporary string is not in the dictionary
ELSE
' append index of the active string to the output buffer
buffer = buffer + MKI$(dicIndex%(dictionary(), dicSize, activeStr))
' write temporary string to dictionary
dictionary(dicSize) = tempStr
dicSize = dicSize + 1
' replace active string by currently read byte
activeStr = inByte
END IF
NEXT i%
' we've read the whole input but didn't find the string yet, add now
IF LEN(activeStr) THEN
' append index of the active string to the output buffer
buffer = buffer + MKI$(dicIndex%(dictionary(), dicSize, activeStr))
END IF
' return compressed buffer
shrinkLZW$ = buffer
END FUNCTION
Decompression
And of course, we need some decompression routine too... the cool thing about LZW is that we only have to store the compressed data stream because the dictionary is rebuilt on-the-fly during decompression.
DECLARE FUNCTION expandLZW$ (feed AS STRING)
PRINT expandLZW$(target)
''
'' LZW decompression
''
FUNCTION expandLZW$ (feed AS STRING)
DIM dictionary(1023) AS STRING ' dictionary
DIM dicSize AS INTEGER ' entries in the dictionary
DIM activeStr AS STRING ' existing string
DIM tempStr AS STRING ' predicted string
DIM inInteger AS INTEGER ' integer in input buffer
DIM buffer AS STRING ' target buffer
' initialize dictionary
dicInit dictionary(), dicSize
' get initial code
activeStr = dictionary(CVI(MID$(feed, 1, 2)))
buffer = activeStr
' parse the whole input stream
FOR i% = 3 TO LEN(feed) STEP 2
inInteger = CVI(MID$(feed, i%, 2))
' rebuild string now; it hasn't been written to the dictionary yet
IF (inInteger = dicSize) THEN
tempStr = activeStr + LEFT$(activeStr, 1)
' string already exists in the dictionary
ELSEIF (LEN(dictionary(inInteger))) THEN
tempStr = dictionary(inInteger)
END IF
' append string to target buffer
buffer = buffer + tempStr
' create new entry in dictionary
dictionary(dicSize) = activeStr + LEFT$(tempStr, 1)
dicSize = dicSize + 1
' set active string as the last string collected
activeStr = tempStr
NEXT i%
' return decompressed buffer
expandLZW$ = buffer
END FUNCTION
And that's basically how it works. "But... the compressed stream is bigger than the source!" I hear you cry. Yes, because we've been using fixed-length 16-bit codes. If we compare the number of elements in the source string (70x 1 byte) and the compressed stream (45x 2 bytes,) it's obvious something positive happened, but to actually produce shorter streams, we have to use variable-length codes (like demonstrated by Huffman Encoding.)
It's part of the problem with LZW: having the theory down is not enough (like that time we designed a horrible RLE compression scheme.) For instance, when using longer input strings, the dictionary array will eventually run out of entries and will hog all the memory allocated to strings in no time. Since new strings are built off of existing entries, instead of writing full strings we could link entries together by index and only save the trailing byte, which would require other techniques to determine whether a string has been registered or not... but that still wouldn't be enough as the number of entries is limited, so we'd have to also create special codes to force-reset the dictionary when it's full. And then we have to come up with variable-length bit codes...
It is possible to write proper LZW compression and decompression routines in QuickBASIC (there's an amazing GIF viewer out there that's been around since the mid 90s that uses the same theory,) but for me? Right now? Meh.