=========================================================================== CSC 310 Lecture 13 Summary Fall 2007 =========================================================================== Topics: - LZ77 - LZ78 - LZW - Dictionary based methods versus source modelling ---- LZ77 ---- - Dictionary based method - Use a buffer of size W, out of which S symbols have already been encoded (call this "past"), and W-S are still to be encoded (call this "present"). - Find the longest prefix of the "present" that occurs as a substring in the "past". - Send: - a pointer to beginning of the match - the length of the match - one new symbol immediately following the matched prefix - Each time a symbol not seen before occurs at the beginning of the present part, the pointer sent will be null, and the length need not be transmitted as it is 0. - The pointer is an index in the buffer, so it can take on 1+S possible values. We could use a fixed length code with ceil(log(1+S)) bits. - The length is a value between 1 and W-S. Note, the value 0 only occurs with the special null pointer, and we need not send it. Thus, we could use a fixed length code with ceil(log(W-S)) bits. - The new symbol can be any of the source alphabet, so we could send it using ceil(log(I)) bits. - We could also use Huffman codes or arithmetic coding for the quantities above. - Note, there is no problem if the substring that the present is matched against starts in the past, but extends into the present. The decoder can still reconstruct the string being encoded. - In practice, S is in the order of 32KB, and the length of matches is restricted to about 256 bytes. - Decoding is extremely fast, as it only involves copying strings, possibly containing many symbols. - Encoding is more costly, as substring matching must be performed. In practice, this is achieved by using hash tables with all substrings of length 3. ---- LZ78 ---- - In LZ77, the dictionary is, implicitly, the set of all substrings appearing in the sliding "past" window. If two strings occur more than W symbols afar, the LZ77 method cannot match them. - In LZ78, the encoder and decoder construct a dictionary that doesn't "slide". Initially this dictionary contains only one special entry with index 0 and an empty string. - To encode a string of symbols, the encoder finds the largest prefix s of this string that occurs in the dictionary. It sends: - the index of the matched string s in the dictionary - one symbol c following the match - Additionally, the encoder adds s+c to the dictionary. - Each time a new symbol occurs, LZ78 sends it as index 0 (empty match) followed by that symbol. --- LZW --- - In LZW, the need for sending a new symbol is removed as follows. - The dictionary is initialized to contain each source symbol by itself. - The encoder finds the longest prefix s of the stream to encode that appears in the dictionary. Let c be the next symbol following s. - The encoder sends the code for s, adds s+c to the dictionary, and continues to encode from c. - A subtle problem seems to arise in that, at the moment the encoder adds s+c to the dictionary, c is not transmitted, so it is unclear how the decoder could know about that particular dictionary item. However, this is not the problem as the decoder can figure out what c is by looking at the beginning of the following dictionary item. ------------------------------------------------ Dictionary based methods versus source modelling ------------------------------------------------ - As opposed to arithmetic coding, dictionary based techniques do not allow for "nice" source modelling. Even though they can be proved to compress down to the entropy under some assumptions about the source, their performance (in terms of compression achieved) does not match arithmetic coding. - In terms of speed, however, dictionary based methods beat arithmetic coding, especially at decoding. - Many known compression programs are based on LZ techniques, and their widespread use testifies to their suitability in practice.