TODAY: Approximation Algorithms for NP-Hard Problems Given a computational problem the steps we undertook in this class are 1. design an algorithm 2. analyze the algorithm 2.1 Bound the running time 2.2 Prove correctness Our goal, broadly stated, is to design an algorithm which -runs in polynomial time and -is correct on all inputs. Last week when we introduced NP-completeness we unfortunately saw that the polynomial time requirment cannot always be satisfied, as the best algorithms known for NP-hard problems run in exponential time. Today, we will insist on satisfying the polynomial-time requirement (wish) for NP-hard problems but will relax the ``correctness'' requirement so as to be able deal with NP-hard problems in `real life'. Speaking of `real life', although last week we restricted our discussions to decision problems as the theory of NP, P, and NP-completeness is stated in terms of decision problems, in actuality many of the decision problems we talked about show up as `optimization problems'. As an example, consider the Traveling-Salesman-Problem (TSP). The decision problem version is. TSP_dec: Given a graph G=(V,E) with costs on the edges and a bound B, does there exist a cyclic tour of the vertices ( which starts and ends at the same vertex and visits all vertices exactly once) of total cost <= B. The optimization version of TSP is. TSP_opt: Given a graph G=(V,E) with costs on the edges, find a cyclic tour of the vertices ( which starts and ends at the same vertex and visits all vertices exactly once) of minimum cost. This is a `minimization' problem, as we are looking to minimize the cost of the tour. Another example, is the CLIQUE problem. The decision problem version is. CLIQUE_dec: Given a graph G=(V,E) and a bound B, does there exist a subset of the vertices S which forms a clique and |S| >= B. The optimization version of CLIQUE is. CLIQUE_opt: Given a undurected graph G=(V,E), find a subset of the vertices S of maximum size such that S forms a clique. This is a `maximization' problem, as we are looking to maximize the size of the clique. For the rest of today's lecture we will study what can be done for optimization versions of NP-hard problems. Indeed, can anything be done? What are possible approaches? 1. Run an exponential time algorithm which always gives the correct solution. This will work for small input sizes only. 2.Run an algorithm which produces potentially incorrect solutions for some (or all) inputs, in return for polynomial time. This is what I call a `heuristic': a strategy for producing solutions which gives no guarantee as to their correctness (or quality if we are solving an optimization problem). 3. Run an `approximation algorithm' which -always runs in polynomiatl time -produces a solution which is PROBVABLY within a guaranteed factor from the optimal solution. This is the approach we shall pursue today. E.g We will design an approximation algorithm for the TSP_opt problem when the input graph and costs satisfy the triangle inequality, such that the tour produced by our algorithm will be provably within a factor 2 of the cost of the optimal least costly tour. How do we measure how good is an approximation algorithm A for some optimization problem? We will use the ratio-bound measure defined as follows. Given an optimization problem on input I of size n, we are interested in two quantities. C_A(I) = cost of solution produced by the approximation algorithm on input I (e.g. for TSP thats the cost of the tour produced for weighted graph G whereas for CLIQUE it is the size of the clique produced for graph G) C*(I) = cost of optimal solution for input I (e.g. for TSP thats the cost of the minimum cost tour in weighted graph G, whereas for CLIQUE it is the size of the largest clique in graph G) The RATIO_BOUND of algorithm A is r(n) if MAX_I of size n ( C_A(I)/C*(I) , C*(I)/C(I)) <= r(n) Interpetation of this measure is: For a maximization problem, we know that C*(I)/C(I) >= 1 for all I (as the optimal is the largest); a ratio-bound r(n) means that still the approximate solution C(I) is larger than (or equal) to 1/r(n) of the cost of the optimal solution C*(I). For a minimization problem, we know that C(I)/C*(I) >= 1 for all I (as the optimal is the smallest); a ratio-bound r(n) means that for the worst input I of size n still the approximate solution C(I) is less than (or equal) to r(n) times the cost of the optimal solution C*(I). Today, we will first show approximation algorithm for TSP_opt which achieves ratio-bound of 2. Our algorithm takes as input a graph G=(V,E) which is complete, and a cost function C:E->R such that for all u,v,w C(u,v) =< C(u,w) + c(w,v) (the Triangle inequality). We remark, that there exists a 1.5 ratio-bound algorithms for such graphs in the litreature (too complicated for class). Interestingly, for graphs where the cost function corresponds to ordinary geomertric distance in the plane, there exists even better approximation algorithm: for every epsilon 0< e <1, the algorithms run in time polynomial in n and 1/epsilon, and achieve ratio-bound (1+e). In contrast, for general graphs and cost functions, no approximation algorithm exists unless P=NP. APPROX-TSP ALGORITHM (G,c) -------------------------- 1. Build MST T for G. 2. Pick an arbitrary vertex and call it r (for root). Do a preorder walk of the tree T starting at r. Call L the list of vertices in preorder visited in the walk. 3. Output a tour H that visits all vertices of G starting and returning to r in the order prescribed by L. THEOREM: APPROX-TSP runs in polynomial time and achieves a ratio-bound of 2 on complete input graphs which obey the triangle-inequality. PROOF: The running time for MST and preorder walk are polynomial in E and in V. Thus, it remains to show that the ratio-bound is 2. For any subgraph L, let c(L) denote the value obtained by summing all the costs of the edges of L. For input graph G, let H* denote the optimal tour of G and T the MST of G. Observe that H* with one edge removed is also a spanning tree of the graph, whose cost no larger of the minimum spanning tree T. Namely, c(T) <= c(H*) Consider a "twice around" tour of T called W, where starting with r, it traverses each edge of the tree twice once in every direction of the preorder walk . Clearly, c(w) = 2c(T). Note that this W visits all vertices, starting and ending in r. Unfortunately , each vertex is encountered multiple times. We can change W into real traveling salesman tour as follows. Let H the tour obtained from W by short-cutting W as follows. Repeatedly, for any u,v,w in W such that u->v->w in W, and vertex v has already been visited, replace it by going directly from u->w (short-cutting v). The cost of H can only go down since by the triangle inequality c(u,v)+c(v,w)>=c(u,w). Moreover, the resulting H is a traveling salesman tour of the graph where each vertex is visited in the same order as a preorder walk of the tree. Putting this all together, c(H) <=c(W) = 2c(T) <= 2c(H*), and as a result we establish that ratio-bound = approx/optimal = c(H)/c(H*) <= 2. QED Next, let us show an approximation algorithm for the set cover problem. The set cover problem is given as input a universe of elements U={1,...,m} and sets S1...Sn such that union of the Si's = U, find the smallest collection of sets I such that the union of S_i in I = U. It has many applications. For example, U may be a set of jobs to be done. And S_i corresponds to jobs that machine i can accomplish. You would like to buy the smallest number of machines such that all jobs can be done. This problem is NP-complete. Can show an easy reduction from vertext cover where the universe is the set of edges and the sets are pairs of vertices between which there are edges. We show a greedy approximation algorithm which achives ratio bound of O(ln m). APPROX_SET_COVER ALGORITHM (U,S1,...,Sn) Repeat untill all elments are covered -choose new set S_i containing max uncovered elements -add i to I -mark all elements from Si as covered Output I THEOREM: APPROX_SET_COVER is a polynomial time algoritm which achieves ratio bound O(ln m) for the set cover problem where m= |U|. Proof: On an input U, S1,..,Sn. Let k denote the size of the minimum cover. Let u_i = number of uncovered elements after i-th iteration. Initially u_0 =m. We know that there is a cover of size k, so must exist at least some set S_i that covers more than (or equal to ) 1/k of uncovered elmements (otherwise by Pigeon hole principle k sets would not be enough). The greedy algorithm will chose the largest set, so definitely after the first choice of a set, the number of remaining uncovered elements goes down as follows u_1 <= u_0-u_0/k = (1-1/k) = m (1-1/k). Moreover, we know that Fact: At any point in the algorithm there exists alway a new set S_i that covers at least 1/k of the remaining elements. Thus, we can appy the above argument repeatedly, u_2 <=u_1 (1-1/k) <= u_0 (1-1/k)^2 = m(1-1/k)^2 u_3 <= u_2 (1-1/k) <=...<=m(1-1/k)^3 ... u_i <= m(1-1.k)^i ... How long can this go on? Consider the largest value of g, such that after inserting all but the last set of the greedy cover. Namely, there is at least 1 more element left to cover. That is, g such that 1<= u_{g-1} <= m(1-1/k)^g. Rewrting, 1<= m(1-1/k)^{g/k}k <= m(1/e)^{g/k} and thus m>= e^{g/k} and g/k <= log m. The number of sets chosen by the greedy algorithm is g+1, thus the ratio-bound = (g+1)/k <= log m + 1. QED So far, we saw two approximation algorithms for two different minimization problems. The approximation for TSP had ratio-bound 2 whereas the apprxomation for SET-COVER had a O(ln m) ratio bound. Indeed, optimization versions of NP-hard problems have widely different behavior with respect to approximation even though the decision versions all seem to be "as hard" as each other. the last example will be for a the CLIQUE maximization problem. We will show how to achieve a ratio bound of O(n/log n). This seems quite bad, but it can be proved that no better than n^b can be done for any 0