'=========================================================================== ' Subject: HUFFMAN COMPRESSION FOR PB/CC Date: 10-30-98 (08:11) ' Author: Don Schullian Code: PBCC ' Origin: d83@ath.forthnet.gr Packet: PBCC.ABC '=========================================================================== $if 0 ÚÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄ· PowerBASIC v3.20 ÚÄÄ´ DASoft ÇÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÂÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄ· ³ ÃÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄĶ Copyright 1997 ³ DATE: 1998-10-29 ÇÄ· ³ ³ FILE NAME hufmn-cc.bas º by ÔÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÄ º º ³ ³ LIBRARY º Don Schullian, Jr. º º ³ ÔÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍͼ º º ³ º º ³ The following code and notes are passed into PUBLIC DOMAIN. º º ³ º º ³ Honorable mention of my name is requested if any of this file is used º º ³ in another publication, recreated for mass consumption or teaching. º º ³ º º ÔÍÑÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍͼ º ÔÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍͼ After much searching and testing these HUFFMAN string (de)compression functions are about the smallest, fastest I can come up with. I've done away with the "hashing" functions the routines I've found used and just created the bit masks mathmatically. They work the same and look the same so "if it looks like a duck and quacks like a duck then....". Anyhow, what Huffman compression does is counts all the incidences of byte values used in a string (file), then stores the used vales in an array and sorts that array in decending order. (highest count first) Then each character is assigned a bit mask. The bit masks are then layed into a buffer and viola! compressed data. Maybe! There is an outside chance that the data won't comperss because there are too many varied and oft used odd-ball characters but... this would be rare for text in a single language. So lets take "PETER PIPER PICKED A PECK OF PICKLED PEPPERS." as a test string. After the count, sort and bit map create you get a set of arrays that look like this "P" CharVal(01) = 80 : Count%(01) = 9 : BitCode??(1) = "00" "E" CharVal(02) = 69 : Count%(02) = 8 : BitCode??(2) = "010" " " CharVal(02) = 32 : Count%(02) = 7 : BitCode??(2) = "011" "C" CharVal(02) = 67 : Count%(02) = 3 : BitCode??(2) = "1000" "R" CharVal(02) = 82 : Count%(02) = 3 : BitCode??(2) = "1001" "I" CharVal(02) = 73 : Count%(02) = 3 : BitCode??(2) = "1010" "K" CharVal(02) = 75 : Count%(02) = 3 : BitCode??(2) = "1011" "D" CharVal(02) = 68 : Count%(02) = 2 : BitCode??(2) = "11000" "F" CharVal(02) = 70 : Count%(02) = 1 : BitCode??(2) = "11001" "A" CharVal(02) = 65 : Count%(02) = 1 : BitCode??(2) = "11010" "." CharVal(02) = 46 : Count%(02) = 1 : BitCode??(2) = "11011" "O" CharVal(02) = 79 : Count%(02) = 1 : BitCode??(2) = "111000" "L" CharVal(02) = 76 : Count%(02) = 1 : BitCode??(2) = "111001" "S" CharVal(02) = 83 : Count%(02) = 1 : BitCode??(2) = "111010" "T" CharVal(02) = 84 : Count%(02) = 1 : BitCode??(2) = "111011" Packed up, "PETER" looks like this "000101110110101001" and takes up 18 bits or 2.25 bytes vs the original 5 bytes. Unpacking requires that you keep thinking in BITS left to right (that's backwards to the computer remember). When you read in the first bit and it is OFF then you're dealing with either the first, second or third character in the list ("P", "E" or " " ). If the next bit is, again OFF then the coded letter is "P" and so on for the next two bits if it is ON. If that frist bit is ON then you have to keep checking until you get an OFF bit and adding 4 for each ON bit you encounter. When you hit the OFF bit the next two bits add to your counter to determine which character in the list you're to use. ÚÄÄÄÄÄÄ + 4 ³ÚÄÄÄÄÄ + 4 ³³ÚÄÄÄÄ + 4 ³³³ÚÄÄÄÄÄÄÄÄÄ skip this bit ³³³³ÚÄÄ + 2 ³³³³³ÚÄ + 1 "T" = "111011" total = 15 so CharVal?(15) = 84 Here's the whole word "PETER" layed out with count and logic. ÚÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄ the first bit is OFF so ³ÚÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄ the next bit is OFF too so ============ "P" ³³ÚÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄ the first bits OFF so ³³³ÚÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄ the next bit is ON so at least 2 ³³³³ÚÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄ the next bit is OFF so it's a 2 ======= "E" ³³³³³ÚÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄ the first bit is ON so at least 4 ³³³³³³ÚÄÄÄÄÄÄÄÄÄÄÄÄÄÄ the next bit is ON so +4 for 8 ³³³³³³³ÚÄÄÄÄÄÄÄÄÄÄÄÄÄ the next bit is ON so +4 for 12 ³³³³³³³³ÚÄÄÄÄÄÄÄÄÄÄÄÄ this bit is OFF so end of 4's ³³³³³³³³³ÚÄÄÄÄÄÄÄÄÄÄÄ this bit is ON so +2 for 14 ³³³³³³³³³³ÚÄÄÄÄÄÄÄÄÄÄ this bit is ON so +1 for 15 ======= "T" ³³³³³³³³³³³ÚÄÄÄÄÄÄÄÄÄ the first bits OFF so ³³³³³³³³³³³³ÚÄÄÄÄÄÄÄÄ the next bit is ON so at least 2 ³³³³³³³³³³³³³ÚÄÄÄÄÄÄÄ the next bit is OFF so it's a 2 ======= "E" ³³³³³³³³³³³³³³ÚÄÄÄÄÄÄ this first bit is ON so at least 4 ³³³³³³³³³³³³³³³ÚÄÄÄÄÄ the next bit is OFF so end of 4's ³³³³³³³³³³³³³³³³ÚÄÄÄÄ this bit is OFF so add 0 ³³³³³³³³³³³³³³³³³ÚÄÄÄ this bit is ON so +1 for 5 ============ "R" 000101110110101001 Credit is due to that unknown individual who provided the HUFFMAN.TIP file that came with one of the earlier versions of PowerBASIC. From his or her notes I was able to decifer the coding system and create these functions. And so I now pass them onto you for your use. HAND, Don Schullian d83@DASoftVSS.com $endif DECLARE FUNCTION fHuffPack (O AS STRING) AS STRING DECLARE FUNCTION fHuffUnPack (P AS STRING) AS STRING FUNCTION PBmain () LOCAL Orig AS STRING LOCAL Packed AS STRING LOCAL UnPked AS STRING CLS Orig = "PETER PIPER PICKED A PECK OF PICKLED PEPPERS." Packed = fHuffPack( Orig ) PRINT FORMAT$( LEN(Orig ), "Len = ## : >>" ); Orig ; "<<" PRINT FORMAT$( LEN(Packed), "Len = ## : >>" ); Packed; "<<" PRINT STRING$(79,45) UnPked = fHuffUnPack( Packed ) PRINT "Orig: >>"; Orig ; "<<" PRINT "UnPk: >>"; UnPked; "<<" WaitKey$ END FUNCTION '========================================================================= '== fHuffPack( InString ) '== '== PARAMS: InString = the data to be packed '== RETURNS: Packed data string '== 2 bytes = original length '== 1 byte = number of unique characters '== x bytes = unique characters '== remainder = encoded data '== NULL if packed version of InString would exceed the '== string size limitations set with STRING '========================================================================= FUNCTION fHuffPack( InString AS STRING ) AS STRING DIM Count (256) AS LONG ' count of all chars DIM CharVal(256) AS LONG ' assigned values DIM BitCode(256) AS LONG ' bit values LOCAL Packed AS STRING ' working string buffer LOCAL InLen AS LONG ' length of incoming LOCAL CharCount AS LONG ' # of used chars LOCAL I_ptr AS BYTE PTR ' ptr to incoming LOCAL P_ptr AS BYTE PTR ' ptr to working buffer LOCAL BitBuf AS LONG ' working buffer of bits LOCAL BitCount AS LONG ' current bit position LOCAL X AS LONG ' loop counter LOCAL B AS LONG ' counter LOCAL F AS LONG ' counter LOCAL P AS LONG ' counter LOCAL S AS LONG ' counter ' I_ptr = STRPTR( InString ) ' set incoming pointer InLen = LEN( InString ) ' set incoming length ' FOR X = 1 TO InLen ' count each char in INCR Count(@I_ptr) ' the incoming string INCR I_ptr ' NEXT ' I_ptr = I_ptr - InLen ' reset pointer to start ' FOR X = 1 TO 256 ' assign and count the # IF Count(X) > 0 THEN ' of used characters INCR CharCount ' CharVal(X) = X ' END IF ' NEXT ' ' gets the char values in ARRAY SORT Count(), TAGARRAY CharVal(), DESCEND ' highest first order ARRAY INSERT CharVal(0) ' (1) is now highest ' BitCode(2) = 2 ' create a set of bit BitCode(3) = 3 ' values F = 8 ' 1 = "00" FOR X = 4 TO CharCount ' 2 = "010" BitCode(X) = F + B ' 3 = "011" IF B < 3 THEN ' 4 = "1000" INCR B ' 5 = "1001" ELSE ' 6 = "1010" B = F ' 7 = "1011" SHIFT LEFT F, 1 ' 8 = "11000" F = ( F OR B ) ' 12 = "111000" B = 0 ' END IF ' NEXT ' ' B = CharCount + 3 ' count number of bytes FOR X = 1 TO InLen ' required to hold the ARRAY SCAN CharVal(1), = @I_ptr, TO P ' 3 byte header + INCR I_ptr ' number of characters + SELECT CASE P ' the packed data CASE 1 : BitCount = BitCount + 2 ' (see list above for CASE 2, 3 : BitCount = BitCount + 3 ' bit counts) CASE ELSE : BitCount = BitCount + (P \ 4) + 3 ' END SELECT ' IF BitCount => 32000 THEN ' lay off 4000 bytes B = B + 4000 ' BitCount = BitCount - 32000 ' END IF ' NEXT ' IF BitCount > 0 THEN ' clean-up leftover BitCount = BitCount + 7 ' bits + whatever is B = B + ( BitCount \ 8 ) ' needed to get a full END IF ' bit I_ptr = I_ptr - InLen ' reset incoming pointer Packed = STRING$( B, 0 ) ' create working buffer MID$( Packed, 1, 2 ) = MKI$(InLen) ' put len of original P_ptr = STRPTR( Packed ) + 2 ' set the working pointer @P_ptr = CharCount ' byte 3 = # of chars FOR X = 1 TO CharCount ' bytes 4 -> are the INCR P_ptr ' char values that @P_ptr = CharVal(X) ' correspond to the bit NEXT ' codes ' BitCount = 0 ' start processing bytes FOR X = 1 TO InLen ' ARRAY SCAN CharVal(1), = @I_ptr, TO P ' find array pos of next INCR I_ptr ' char / bump pointer SELECT CASE P ' set # of bits required CASE 1 : S = 2 ' to hold the bit-code CASE 2, 3 : S = 3 ' CASE ELSE : S = (P \ 4) + 3 ' END SELECT ' SHIFT LEFT BitBuf, S ' shift buffer # new bits BitBuf = ( BitBuf OR BitCode(P) ) ' OR new bits into buffer BitCount = BitCount + S ' incr bit counter IF BitCount => 16 THEN GOSUB fHuffPack1 ' if the buffer is full NEXT ' IF BitCount > 0 THEN ' clean-up leftover bits IF BitCount > 8 THEN GOSUB fHuffPack1 ' more than 1 byte IF BitCount > 0 THEN ' still not done SHIFT LEFT BitBuf, 8 - BitCount ' move bits to left INCR P_ptr ' next/last pos in buf @P_ptr = ( BitBuf AND 255 ) ' put it there, dude! END IF ' END IF ' ' FUNCTION = Packed ' RETURN packed data EXIT FUNCTION ' '======================================================'======================== fHuffPack1: ' remove full bytes from DO ' the bit buffer and put BitCount = BitCount - 8 ' them in the work buf ROTATE RIGHT BitBuf, BitCount ' move the left most INCR P_ptr ' bit to the right @P_ptr = ( BitBuf AND 255 ) ' stuff it into the SHIFT RIGHT BitBuf, 8 ' strip right most bits ROTATE LEFT BitBuf, BitCount + 8 ' move bits back in order LOOP UNTIL BitCount < 8 ' RETURN ' ' END FUNCTION '========================================================================= '== fHuffUnPack( Packed ) '== '== PARAMS: Packed '== SEE: fHuffPack for header details '== RETURNS: Unpacked data '== NOTE: There is NO error checking etc. as it is assumed that what '== you send in Packed was created with fHuffPack '========================================================================= FUNCTION fHuffUnPack( Packed AS STRING ) AS STRING DIM CharVal(256) AS LONG ' char values from LOCAL UnPacked AS STRING ' working buffer LOCAL InLen AS LONG ' length of incoming LOCAL OutLen AS LONG ' final length of out LOCAL I_ptr AS BYTE PTR ' pointer to incoming LOCAL O_ptr AS BYTE PTR ' pointer to outgoing LOCAL BitBuf AS LONG ' buffer to work bits LOCAL FirstBit AS LONG ' current bit position LOCAL S AS LONG ' counter LOCAL P AS LONG ' counter ' OutLen = CVI( Packed ) ' final returned len UnPacked = STRING$( OutLen + 4,0 ) ' create working buf O_ptr = STRPTR( UnPacked ) ' set working ptr I_ptr = STRPTR( Packed ) + 2 ' set incoming ptr S = @I_ptr ' # of chars used InLen = LEN( Packed ) - ( S + 2 ) ' # of packed bytes FirstBit = -1 ' starts down here ' FOR P = 1 TO S ' load char values INCR I_ptr ' into array CharVal(P) = @I_ptr ' NEXT ' ' DO ' start processing WHILE ( FirstBit < 15 ) AND ( InLen > 0 ) ' if bit buffer empty DECR InLen ' knock off a byte INCR I_ptr ' next byte in incoming FirstBit = FirstBit + 8 ' loading 8 bits SHIFT LEFT BitBuf, 8 ' make room in bit buf BitBuf = ( BitBuf OR @I_ptr ) ' stuff new byte WEND ' IF BIT( BitBuf, FirstBit ) = 0 THEN ' if the bit is 0 DECR FirstBit ' next bit position IF BIT(BitBuf,FirstBit) = 0 THEN ' if still 0 then P = 1 ' this is code "00" ELSE ' the bit is ON DECR FirstBit ' next bit position P = 2 + BIT(BitBuf,FirstBit) ' "01" or "11" END IF ' ELSE ' else the bit is 1 P = 0 ' clear code value DO ' check next bits P = P + 4 ' each ON = 4 DECR FirstBit ' LOOP UNTIL BIT(BitBuf,FirstBit) = 0 ' until we hit an OFF FOR S = 2 TO 1 STEP -1 ' skip next bit then DECR FirstBit ' IF BIT(BitBuf,FirstBit) THEN P = P + S ' add 2 and/or 1 NEXT ' END IF ' DECR FirstBit ' next bit position @O_ptr = CharVal(P) ' set the decoded char INCR O_ptr ' move working buf ptr LOOP UNTIL ( InLen = 0 ) AND ( FirstBit < 0 ) ' got to get them all ' FUNCTION = LEFT$( UnPacked, OutLen ) ' RETURN truncated ' END FUNCTION