=========================================================================== CSC 373H Lecture Summary for Week 13 Winter 2006 =========================================================================== ------------------------ Approximation algorithms ------------------------ Weighted Set Cover: See section 11.3 in text. - ASCII notation: "\/" for set union, "/\" for set intersection - Input: U (universe of elements), subsets S_1,...,S_m subset of U with nonnegative integer weights w_1,w_2,...,w_m for each subset. Output: Cover C subset of {1,2,...,m} such that \/_{i in C} S_i = U and SUM_{i in C} w_i is minimum. - Example: U = {a,b,c,d,e}, S_1 = {a,b}, w_1 = 2, S_2 = {a,c,d}, w_2 = 5, S_3 = {b,e}, w_3 = 1, S_4 = {c,d}, w_4 = 2. C = {1,2} is NOT a cover because S_1 \/ S_2 != U. C = {2,3} is a cover of weight w_2 + w_3 = 5+1 = 6. C = {1,3,4} is a cover of weight 2+1+2 = 5, which is minimum. - Greedy algorithm: // Select sets one by one to try to minimize weight and // maximize number of new elements covered, at the same time. C := {} // cover R := U // remaining elements (i.e., not yet covered) while R != {}: pick i such that w_i / |S_i /\ R| is minimal // this minimizes "weight per new element covered" C := C \/ {i} R := R - S_i return C - Analysis: . After picking i in main loop, for each s in S_i /\ R, let c_s = w_i / |S_i /\ R| (c_s is "cost paid to cover s", used only in analysis). . By definition, each element covered during algorithm is accounted for by exactly one c_s so (11.9) SUM_{i in C} w_i = SUM_{s in U} c_s. . But greedy algorithm might "overpay" for some sets, i.e., SUM_{s in S_k} c_s > w_k; can we bound how much greater? . (11.10) For all S_k, SUM_{s in S_k} c_s <= H(|S_k|) w_k (where H(n) = 1 + 1/2 + ... + 1/n = Theta(log n)). Proof: Let d = |S_k| and S_k = {s_1, s_2, ..., s_d}, in order of coverage by the algorithm. * when s_1 is first covered, set used is at least as good as S_k so c_{s_1} <= w_k/d (cost per element for S_k) * when s_2 is first covered, set used is at least as good as S_k so c_{s_2} <= w_k/(d-1) (cost/elem for S_k-{s_1}) * ... * when s_j is first covered, set used is at least as good as S_k so c_{s_j} <= w_k/(d-j+1) (cost/elem for S_k-{s_1,...,s_{j-1}}) * ... Total: SUM_{s in S_k} c_s <= w_k/d + w_k/(d-1) + ... + w_k/1 = H(d) w_k. . Let d* = MAX_{i=1..m} |S_i| (max size of S_i's), C* = optimum set cover, and w* = SUM_{i in C*} w_i = optimum weight. . (11.11) SUM_{i in C} w_i <= H(d*) w* = Theta(log n) w*. Proof: * By (11.10), for each i in C*, w_i >= 1/H(|S_i|) * SUM_{s in S_i} c_s >= 1/H(d*) * SUM_{s in S_i} c_s. * C* is a cover so SUM_{i in C*} SUM_{s in S_i} c_s >= SUM_{s in U} c_s. * Hence, w* = SUM w_i i in C* 1 >= SUM ----- SUM c_s i in C* H(d*) s in S_i 1 1 >= ----- SUM c_s = ----- SUM w_i (by 11.9) H(d*) s in U H(d*) i in C --------------------- Randomized algorithms --------------------- Randomization and probabilistic analysis appear in two distinct ways: - random input: do "average-case" analysis to study behaviour on typical input - randomized algorithm: use randomization to make random decisions while processing input eg. randomized quicksort Some basic probability rules: Let A, B be events. P(not A) = 1-P(A). P(A or B) <= P(A) + P(B). If A and B are independent, then P(A and B) = P(A) P(B). If A implies B, then P(A) <= P(B). Contention Resolution (texbook section 13.1) - We have n processes, P_1, ..., P_n, trying to access one resource. (i.e., they can be trying to modify a common database) We divide time into discrete intervals (rounds). If two or more processes try to access simultaneously, all of them get locked out during that interval. - So trying to access as often as possible doesn't work. - If the processes have a way of communicating with each other, they can easily assure that everyone waits at most n time units before getting the access. - But what if they cannot communicate? The strategy: choose some probability p > 0. Each process attempts to access with probability p. - randomization breaks the symmetry in the problem - What value of p should we choose? What happens if p is too high (close to 1)? What happens if p is too low (close to 0)? How should p depend on n? - Success probability: probability that the first (or any other) process succeeds at round 1. - We need the first process to write, while all others are not writing. - P_success = p(1-p)^{n-1} - Using calculus, this probability is maximized when p=1/n. - Let p=1/n. P_success = p (1-p)^{n-1} = 1/n (1-1/n)^{n-1} This implies P_success approaches (1/n) (1/e), which is in Theta(1/n). Denote this probability by Ps(n). Moreover, Ps(n) > 1/(n e). Up to a constant, this is the best we can hope for. - Waiting for one process to finish: What is the probability that some process is not able to write after t rounds? The probability to fail in one round is 1-Ps(n) < 1 - 1/(n e) So the probability to fail in t rounds is (1-Ps(n))^t < (1 - 1/(n e))^t = ((1 - 1/(n e))^{n e} )^{t/ne} < (1/e)^{t/ne} If t in Theta(n), can bound the probability by a constant. If t in Theta(n log n), can bound the probability by Theta(1/n^c). - Waiting for all processes to finish: Use union bound (equation 13.2 in textbook). Get that Theta(n log n) rounds suffice with very high probability. Primality Testing - A number n is prime if its only divisors are 1 and n. Eg. 3, 7, 23 are prime, and 91=7x13 is not. - Large prime numbers are needed in the execution of many cryptographic protocols. Eg. the RSA protocol starts with two large primes p and q and relies on the hope that n=pq is hard to factor. - We want to produce large (100s of digits) prime numbers at random, fast. - Possible way: pick a big random number n, and test whether n is prime. By density of prime numbers theorems, will need O(log n) steps to find a prime. - We need a quick procedure to test whether n is a prime. - Algorithm 1: For all k between 2 and root(n), check whether k divides n. If YES, n is COMPOSITE. If NO, n is PRIME. - is this algorithm useful? - A property of prime numbers: (Fermat's Little Theorem) Let p be a prime, a be any number not divisible by p. Then a^{p-1} = 1 (mod p). Eg. 2^6 = 64 = 63+1 = 1 (mod 7). Not true for non-primes in general: 2^14 = 16384 = 16380+4 = 4 (mod 15). Sometimes true: (such numbers are called pseudoprimes) 14^14 = (-1)^14 = 1 (mod 15). Some composite numbers are pseudoprimes for all p (Carmichael numbers) 561 = 3*11*17, but 561^{p-1} = 1 (mod p) for all primes p - We can use the property to test if n is a prime: Choose a random a, 1 < a < n Compute b = a^{n-1} (mod n) If b = 1, output PRIME If not, output COMPOSITE - What is the time required to compute b? - The algorithm could be wrong! If n is prime, we always output PRIME. If n is composite, we might output either PRIME or COMPOSITE. - Rabin-Miller Strong Pseudoprime Test: uses a similar but stronger property Theorem: algorithm misclassifies composite numbers with probability <= 1/4. - Improving performance: what is the probability the answer is wrong after 2 rounds? Probability of failure is < (1/4)^2 = 1/16. - What is the probability of giving the wrong answer after k rounds? If the output in either round is COMPOSITE, output COMPOSITE. If all are PRIME, output PRIME. If n is prime, we always succeed. If not, the probability of failure < (1/4)^k. Take k = 0.5 log n rounds for a failure probability of 1/n. - Deterministic primality testing: Recently Agrawal et al. [2002] found a deterministic algorithm for primality testing running in time polynomial in log n. - More complicated than the simple test seen above. - Thus PRIMES is in P.