===========================================================================
CSC 310                     Lecture 13 Summary                    Fall 2007
===========================================================================

Topics:

 -  LZ77

 -  LZ78

 -  LZW

 -  Dictionary based methods versus source modelling

----
LZ77
----

 -  Dictionary based method

 -  Use a buffer of size W, out of which S symbols have already been
    encoded (call this "past"), and W-S are still to be encoded (call this
    "present").

 -  Find the longest prefix of the "present" that occurs as a substring in
    the "past".

 -  Send:

     -  a pointer to beginning of the match

     -  the length of the match

     -  one new symbol immediately following the matched prefix

 -  Each time a symbol not seen before occurs at the beginning of the
    present part, the pointer sent will be null, and the length need not
    be transmitted as it is 0.

 -  The pointer is an index in the buffer, so it can take on 1+S possible
    values. We could use a fixed length code with ceil(log(1+S)) bits.

 -  The length is a value between 1 and W-S. Note, the value 0 only occurs
    with the special null pointer, and we need not send it. Thus, we could
    use a fixed length code with ceil(log(W-S)) bits.

 -  The new symbol can be any of the source alphabet, so we could send it
    using ceil(log(I)) bits.

 -  We could also use Huffman codes or arithmetic coding for the quantities
    above.

 -  Note, there is no problem if the substring that the present is matched
    against starts in the past, but extends into the present. The decoder
    can still reconstruct the string being encoded.

 -  In practice, S is in the order of 32KB, and the length of matches is
    restricted to about 256 bytes.

 -  Decoding is extremely fast, as it only involves copying strings,
    possibly containing many symbols.

 -  Encoding is more costly, as substring matching must be performed. In
    practice, this is achieved by using hash tables with all substrings of
    length 3.

----
LZ78
----

 -  In LZ77, the dictionary is, implicitly, the set of all substrings
    appearing in the sliding "past" window. If two strings occur more than
    W symbols afar, the LZ77 method cannot match them.

 -  In LZ78, the encoder and decoder construct a dictionary that doesn't
    "slide". Initially this dictionary contains only one special entry with
    index 0 and an empty string.

 -  To encode a string of symbols, the encoder finds the largest prefix s
    of this string that occurs in the dictionary. It sends:

     -  the index of the matched string s in the dictionary

     -  one symbol c following the match

 -  Additionally, the encoder adds s+c to the dictionary.

 -  Each time a new symbol occurs, LZ78 sends it as index 0 (empty match)
    followed by that symbol.

---
LZW
---

 -  In LZW, the need for sending a new symbol is removed as follows.

 -  The dictionary is initialized to contain each source symbol by itself.

 -  The encoder finds the longest prefix s of the stream to encode that
    appears in the dictionary. Let c be the next symbol following s.

 -  The encoder sends the code for s, adds s+c to the dictionary, and
    continues to encode from c.

 -  A subtle problem seems to arise in that, at the moment the encoder adds
    s+c to the dictionary, c is not transmitted, so it is unclear how the
    decoder could know about that particular dictionary item. However, this
    is not the problem as the decoder can figure out what c is by looking
    at the beginning of the following dictionary item.

------------------------------------------------
Dictionary based methods versus source modelling
------------------------------------------------

 -  As opposed to arithmetic coding, dictionary based techniques do not
    allow for "nice" source modelling. Even though they can be proved to
    compress down to the entropy under some assumptions about the source,
    their performance (in terms of compression achieved) does not match
    arithmetic coding.

 -  In terms of speed, however, dictionary based methods beat arithmetic
    coding, especially at decoding.

 -  Many known compression programs are based on LZ techniques, and their
    widespread use testifies to their suitability in practice.