=========================================================================== CSC 310H Lecture 10 Summary Fall 2007 =========================================================================== ------ Topics ------ - Doubling the interval around 1/2. - Arithmetic Coding, take 3. - How many final bits? - Buffer decoding. - Integer implementation. - References: section 6.2, lectures 7-8 from 2006 course by prof Roweis. -------------------------------- Doubling the interval around 1/2 -------------------------------- Idea: - We do not want v-u to become small, in order to prevent floating point arithmetic errors. - In our algorithm, [u,v) might become smaller and smaller, yet always including the midpoint 1/2. - The moment [u,v) is included in [1/4, 3/4) = [0.01, 0.11), we know that if we later discover that the next bit to transmit is going to be 1, after the expansion, [u,v) will be contained in [0.0, 0.1), so the next bit will be 0. Similarly, if the next bit turns out to be 0, we know that the one following that will be a 1. - There might be several of these expansions in a row. - We use this observation as follows. We keep a counter c of "opposite" bits that we have to output immediately after the next bit. ------------------------- Arithmetic coding, take 3 ------------------------- Algo: u <- 0 v <- 1 c <- 0 for every symbol a_{i_k} in turn r <- v-u u <- u + r*sum_{j=1}^{i_k-1}{p_j} v <- v + r*p_{i_k} while (u >= 1/2) or (v <= 1/2) or ((u >= 1/4) and (v <= 3/4)) if u >= 1/2 send 1 send c 0s c <- 0 u <- 2*(u-1/2) v <- 2*(v-1/2) else if v <= 1/2 send 0 send c 1s c <- 0 u <- 2*u v <- 2*v else if ((u >= 1/4) and (v <= 3/4)) c <- c+1 u <- 2*(u-1/4) v <- 2*(v-1/4) end if end while end for send enough final bits to specify a number inside [u,v) -------------------- How many final bits? -------------------- - With the current scheme, it will always be the case that either [0, 1/2) or [1/2, 3/4) is completely contained inside [u, v). Thus, at most 2 final bits are needed to specify a number inside [u, v): either 01 or 10. --------------- Buffer decoding --------------- Idea: - We again keep an interval of interest [u,v). - We perform decoding by loading a buffer b of size k with bits from the message we receive. - We identify the next letter to output by looking where inside [u,v) the 2^{-k} interval int(b) is located. - We expand the interval [u,v) as in the encoding part, and every time we perform an expansion, we discard one bit from the buffer and we read in another input bit. - NOTE: THE BIT WE DISCARD IS DIFFERENT DEPENDING ON THE EXPANSION. ----------------------------------- Arithmetic Coding, take 3, decoding ----------------------------------- Algo: [u,v) <- [0,1) read k bits into buffer b do divide [u,v) into subintervals find i so that int(b) is a subset of the subinterval of a_i output a_i [u,v) <- subinterval corresponding to a_i while (u >= 1/2) or (v <= 1/2) or ((u >= 1/4) and (v <= 3/4)) if u >= 1/2 [u,v) <- [2*(u-1/2),2*(v-1/2)) discard leftmost bit in b (it should be a 1 anyhow) shift contents of b to the left read in next input bit in the rightmost position of b else if v <= 1/2 [u,v) <- [2*u,2*v) discard leftmost bit in b (it should be a 0 anyhow) shift contents of b to the left read in next input bit in the rightmost position of b else if (u >= 1/4) and (v <= 3/4) [u,v) <- [2*(u-1/4),2*(v-1/4)) discard 2nd leftmost bit of b (should be opposite of leftmost) shift contents of b to the left but keeping leftmost bit read in next input bit in the rightmost position of b end if end while end do Note: - Whenever there are no bits left to read in the encoding, use 0s. - Must make sure it is possible to select next symbol a_i using current buffer b of size k. ---------------------- Integer implementation ---------------------- Ideas: - We want to avoid using floating point arithmetic. Instead, use m-bit integers everywhere. How big should m be? - We can estimate probabilities with integer counts. Suppose that {a_1, .., a_I} occur with counts {f_1, .., f_I}. For all i, define F_i = sum_{j=1}^i{f_j}. Then, we can use f_i/F_I instead of p_i. - Use integers u' and v' instead of u and v with the intended meaning that [u',v') represents the interval [u'/2^m, (v'+1)/2^m). - Adding the 1 to v' allows us to represent the endpoint v=1 using only m bits. Without the +1, we would have needed m+1 bits: 2^m itself is a 1 followed by m (not m-1) 0s. - The smallest the interval [u', v') is going to get is (2^m - 1)/4. We must make sure none of the symbols get an empty subinterval. The smallest probability can be 1/F_I (if there are symbols with a count of 0, bump them up to 1 to prevent problems). Thus, we want (2^m - 1)/4 * 1/F_I >= 1 m >= log(1 + 4*F_I) - Setting m = 2 + ceil(log(F_I + 1)) is enough. --------------------------- How to terminate the string --------------------------- - We need a way to be able to tell the decoder to stop. - Use a special symbol #, that stands for end-of-transmission. - Give it a constant probability, say p_# = .1, and split the remaining probability space (1-p_#) among the other symbols.