===========================================================================
CSC 363H                Lecture Summary for Week 10             Summer 2007
===========================================================================

-------------------------
Hardness and Completeness
-------------------------

Suppose we have two classes of languages, C and D,
and we know that C is a subset of D. Furthermore, suppose
that we do not know whether or not C = D.

Idea: we want to identify the "hardest" problems in class D.

We say that a language L is D-hard if it has the follwing property:
    IF L is in C THEN C = D (i.e. D subset of C).

We say that a language L is D-complete if
    L is D-hard, and
    L is in D.

Note: it is not clear that any language should have this property.
However, if we manage to show that a language L is in fact D-complete,
we obtain the following:
    L is in C IFF C = D
Thus, in a sense, L is the "hardest" problem in D.
This has both theoretical and practical implications.

theoretical:
    In order to show that C =/= D, i.e. that C is a proper subset of D,
    it is enough to concentrate the efforts in showing that
    L does not belong to C.

practical:
    If very few people believe that C = D, showing that L is D-complete
    can be taken as very strong evidence that L is probably not in C.

-------------------------
Decidable vs Recognizable
-------------------------

Consider C = decidable languages, D = recognizable languages.

Define L to be Recognizable-hard if for all recognizable L', L' <=m L.

This definition agrees with the abstract definition given earlier
because, if L is decidable and L' <=m L, then L' is decidable.
Since this holds for every recognizable L', we obtain:
    If L is decidable, then Recognizable = Decidable.

Theorem: A_TM is Recognizable-complete.
Proof: easy exercise

Thus, according to the property we talked about earlier, we get
    A_TM is decidable IFF decidable = recognizable

What might seem strange at first sight is that in the computability part
of the course we managed to show that both sides of this IFF statement
are actually false. Things will not be as nice in the complexity part.

-------
P vs NP
-------

Now consider C = P and D = NP.

Define L to be NP-hard if for all L' in NP, L' <=p L.

Again, one can check that with this definition we get:
    If L is NP-hard and L is in P, then P = NP.

Define L to be NP-complete if
    L in NP, and
    L is NP-hard.

Theorem:  If A is NPc, then A in P iff P = NP.
Proof:
 -  If P = NP, then A NPc -> A in NP -> A in P.
 -  If A in P, then A NPc -> for all B in NP, B <=p A -> B in P (because
    A in P), i.e., NP subset of P so P = NP.

Corollary:  If P != NP and A is NPc, then A not in P.
So proving NP-completeness "as good as" proving not in P.

---------------
NP-completeness
---------------

 -  SAT = { F : F is a propositional formula that is satisfiable, i.e.,
                there is some assignment of truth-values to the variables
                of F that makes F true }

 -  Cook-Levin Theorem:  SAT is NPc
    SAT in NP: Given <F,c>, check that c encodes an assignment of truth
        values to the variables of F that makes F evaluate to TRUE.
        This can be done in polytime, and F is satisfiable iff
        there is some value of c that makes the verifier accept.
    SAT is NP-hard: (high-level idea only)
        For an arbitrary language A in NP, by definition, there is some NTM
        M_A that decides A in time <= C n^k for some constants C, k.
        Given an input x, we can construct a formula F_x that describes
        possible computation paths (of length at most C |x|^k) of M_A on x,
        such that F_x is satisfiable iff there is some computation path of
        M_A that accepts x.  Intuitively, we are simulating the TM model of
        computation using propositional formulas, which are similar to
        digital circuits.  Details are in the textbook and are needed to
        ensure that this can be done in polytime.

In general, to prove A is NP-hard, it's sufficient to show B <=p A for some
B known to be NP-hard: if B <=p A then for all L in NP, L <=p B (by
definition of NP-hardness for B) so L <=p A (since <=p is transitive,
something you can prove as an easy exercise).

Template for proofs of NP-completeness:  To show A is NPc, prove that
    A in NP: Describe a polytime verifier for A.
        "Given <x,c>, check c has right format and properties..."
        Argue that verifier runs in polytime and that
        x in A iff verifier accepts <x,c> for some c.
    A is NP-hard: Show B <=p A for some NP-hard B.
        "Given y, construct x as follows: ..."
        Argue that construction can be carried out in polytime
        and that y in B iff x in A
        (often by showing y in B -> x in A and x in A -> y in B).

Examples:

 -  3SAT is NPc:
    3SAT in NP because it's a special case of SAT.
    In proof of Cook-Levin, possible to construct formula F_x in CNF, so
    CNF-SAT is NP-complete; CNF-SAT <=p 3SAT is proven separately.

    Note:  Careful with directions!  Trivially, 3SAT <=p CNF-SAT (3SAT is
    special case of CNF-SAT).  But we need other direction, transforming
    instances of general problem into instances of restricted problem.

 -  VERTEX-COVER is NPc:  (see textbook)
    VERTEX-COVER (VC for short) = { <G,k> : G is a graph that contains a
    vertex cover of size k, i.e., a set C of k vertices such that each edge
    of G has at least one endpoint in C }
     .  VC in NP:  certificate = vertex cover of size k.
     .  VC is NP-hard:  3SAT <=p VC.
        Given F = (a1 \/ b1 \/ c1) /\ ... /\ (ar \/ br \/ cr), where
        ai,bi,ci in {x1,~x1,x2,~x2,...,xs,~xs}, construct G=(V,E) and k
        such that F satisfiable iff G contains vertex cover of size k, as
        follows:
            k = s + 2r
            V = { a1,b1,c1, ..., ar,br,cr, x1,~x1, ..., xs,~xs }
            E = { (xi,~xi) : 1 <= i <= s } U
                { (ai,bi),(bi,ci),(ci,ai) : 1 <= i <= r } U
                { (l,x) : l = ai or bi or ci, and x = xj or ~xj
                          corresponding to l }
        For example, if
        F = (x1 \/ ~x2 \/ ~x4) /\ (x2 \/ ~x3 \/ x1) /\ (~x3 \/ x4 \/ ~x2),
        then a1=x1, b1=~x2, c1=~x4, a2=x2, b2=~x3, c2=x1, a3=~x3, b3=x4,
        c3=~x2 so
        k = 4 + 2*3 = 10
        V = {a1,b1,c1, a2,b2,c2, a3,b3,c3, x1,~x1, x2,~x2, x3,~x3, x4,~x4}
        E = { (x1,~x1), (x2,~x2), (x3,~x3), (x4,~x4),
              (a1,b1), (b1,c1), (c1,a1), (a1,x1), (b1,~x2), (c1,~x4),
              (a2,b2), (b2,c2), (c2,a2), (a2,x2), (b2,~x3), (c2,x1),
              (a3,b3), (b3,c3), (c3,a3), (a3,~x3), (b3,x4), (c3,~x2) }

        Clearly, construction can be done in polytime (with one scan of F).

        Also, if F is satisfiable, then there is an assignment of truth
        values that make at least one literal in each clause true.  Pick a
        cover C as follows: for each variable, C contains xi or ~xi,
        whichever is true under the truth assignment; for each clause, C
        contains every literal except one that's true (pick arbitrarily if
        more than one true literal).  C contains exactly s+2r vertices and
        is a cover: all edges (xi,~xi) are covered; all edges in clause
        triangles are covered (because we picked two vertices from each
        triangle); all edges between "clauses" and "variables" are covered
        (two from inside triangle, one from true literal for that clause).

        Finally if G contains a cover C of size k=s+2r, C must contain at
        least one of xi or ~xi for each i (because of edges (xi,~xi)) and
        at least two of ai,bi,ci for each i (because of triangle), so only
        way for C to have size s+2r is to contain exactly one of xi or ~xi
        and exactly two of ai,bi,ci, for each i.  Since C covers all edges
        with only two vertices per triangle, the third vertex in each
        triangle must have its "outside" edge covered because of xi or ~xi.
        If we set literals according to choices of xi or ~xi in C, this
        will make formula F true: at least one literal will be true in each
        clause (because at least one edge from "variables" to "clauses" is
        covered by the variable in C).

 -  SUBSET-SUM is NPc:  Already known in NP.  For NP-hardness, show
    3SAT <=p SS:
    Given formula F = (a1 \/ b1 \/ c1) /\ ... /\ (ar \/ br \/ cr) where
    ai,bi,ci in {x1,~x1,...,xs,~xs}, construct numbers as follows:
     .  For j = 1,...,s,
        number xj = 1 followed by s-j 0s followed by r digits where
        k-th next digit equals 1 if xj appears in clause C_k, 0 otherwise;
        number ~xj = 1 followed by s-j 0s followed by r digits where
        k-th next digit equals 1 if ~xj appears in clause C_k, 0 otherwise.
     .  For j = 1,...,r,
        number Cj = 1 followed by r-j 0s and
        number C'j = 2 followed by r-j 0s.
     .  Target t = s 1s followed by r 4s.
    Clearly, this can be constructed in polytime.

    Example of reduction for
    F = (x1 \/ ~x2 \/ ~x4) /\ (x2 \/ ~x3 \/ x1) /\ (~x3 \/ x4 \/ ~x2):
    S = {  x1 = 1000110,
          ~x1 = 1000000,
           x2 = 0100010,
          ~x2 = 0100101,
           x3 = 0010000,
          ~x3 = 0010011,
           x4 = 0001001,
          ~x4 = 0001100,
           C1 = 0000100,
          C'1 = 0000200,
           C2 = 0000010,
          C'2 = 0000020,
           C3 = 0000001,
          C'3 = 0000002 }
            t = 1111444
    Note: slightly different from book to ensure S contains all distinct
    numbers (book's reduction constructs S with repeated numbers).

    If F is satisfiable, then there is a setting of variables such that
    each clause of F contains at least one true literal.  Consider the
    subset S' = {numbers that correspond to true literals}.  By
    construction, SUM_{x in S'} x = s 1s followed by r digits, each one of
    which is either 1, 2, or 3 (because each clause contains at least one
    true literal).  This means it is possible to add suitable numbers from
    {C1,C'1,...,Cr,C'r} so that the last r digits of the sum are equal to
    4, i.e., there is a subset S' such that SUM_{x in S'} x = t.

    If there is a subset S' of S such that SUM_{x in S'} x = t, then S'
    must contain exactly one of {xj,~xj} for j = 1,...,n, because that is
    the only way for the numbers in S' to add to the target (with a 1 in
    the first s digits).  Then, F is satisfied by setting each variable
    according to the numbers in S': for each clause j, the corresponding
    digit in the target is equal to 4 but the numbers Cj and C'j together
    only add up to 3 in that digit; this means that the selection of
    numbers in S' must include some literal with a 1 in that digit, i.e.,
    clause j contains at least one true literal.