TODAY: Approximation Algorithms for NP-Hard Problems
Given a computational problem the steps we undertook in this class are
1. design an algorithm
2. analyze the algorithm
2.1 Bound the running time
2.2 Prove correctness
Our goal, broadly stated,
is to design an algorithm which
-runs in polynomial time and
-is correct on all inputs.
Last week when we introduced NP-completeness we unfortunately saw that
the polynomial time requirment cannot always be satisfied, as the best
algorithms known for NP-hard problems run in exponential time.
Today, we will insist on satisfying the
polynomial-time requirement (wish) for NP-hard problems
but will relax the ``correctness'' requirement so as to be able
deal with NP-hard problems in `real life'.
Speaking of `real life', although last week we restricted our
discussions to decision problems as the theory of NP, P, and
NP-completeness is stated in terms of decision problems, in actuality
many of the decision problems we talked about show up as `optimization
problems'.
As an example, consider the Traveling-Salesman-Problem (TSP).
The decision problem version is.
TSP_dec: Given a graph G=(V,E) with costs on the edges and a bound B,
does there exist a cyclic tour of the vertices ( which starts and ends
at the same vertex and visits all vertices exactly once) of total cost
<= B.
The optimization version of TSP is.
TSP_opt: Given a graph G=(V,E) with costs on the edges, find a cyclic
tour of the vertices ( which starts and ends at the same vertex and
visits all vertices exactly once) of minimum cost.
This is a `minimization' problem, as we are looking to minimize the
cost of the tour.
Another example, is the CLIQUE problem.
The decision problem version is.
CLIQUE_dec: Given a graph G=(V,E) and a bound B, does there exist a
subset of the vertices S which forms a clique and |S| >= B.
The optimization version of CLIQUE is.
CLIQUE_opt: Given a undurected graph G=(V,E), find a subset of the
vertices S of maximum size such that S forms a clique.
This is a `maximization' problem, as we are looking to maximize the
size of the clique.
For the rest of today's lecture we will study what can be done for
optimization versions of NP-hard problems.
Indeed, can anything be done? What are possible approaches?
1. Run an exponential time algorithm which always gives the
correct solution. This will work for small input sizes only.
2.Run an algorithm which produces potentially incorrect solutions for
some (or all) inputs, in return for polynomial time. This is what I
call a `heuristic': a strategy for producing solutions which gives no
guarantee as to their correctness (or quality if we are solving an
optimization problem).
3. Run an `approximation algorithm' which -always runs in polynomiatl
time -produces a solution which is PROBVABLY within a guaranteed
factor from the optimal solution. This is the approach we shall
pursue today. E.g We will design an approximation algorithm for the
TSP_opt problem when the input graph and costs satisfy the triangle
inequality, such that the tour produced by our algorithm will be
provably within a factor 2 of the cost of the optimal least costly
tour.
How do we measure how good is an approximation algorithm A for some
optimization problem?
We will use the ratio-bound measure defined as follows.
Given an optimization problem on input I of size n, we are interested
in two quantities.
C_A(I) = cost of solution produced by the approximation algorithm on
input I (e.g. for TSP thats the cost of the tour produced for weighted
graph G whereas for CLIQUE it is the size of the clique produced for
graph G)
C*(I) = cost of optimal solution for input I (e.g. for TSP thats the
cost of the minimum cost tour in weighted graph G, whereas for CLIQUE
it is the size of the largest clique in graph G)
The RATIO_BOUND of algorithm A is r(n) if
MAX_I of size n ( C_A(I)/C*(I) , C*(I)/C(I)) <= r(n)
Interpetation of this measure is:
For a maximization problem, we know that C*(I)/C(I) >= 1 for all I (as
the optimal is the largest); a ratio-bound r(n) means that still the
approximate solution C(I) is larger than (or equal) to 1/r(n) of the
cost of the optimal solution C*(I).
For a minimization problem, we know that C(I)/C*(I) >= 1 for all I (as
the optimal is the smallest); a ratio-bound r(n) means that for the
worst input I of size n still the approximate solution C(I) is less
than (or equal) to r(n) times the cost of the optimal solution C*(I).
Today, we will first show approximation algorithm for TSP_opt which
achieves ratio-bound of 2. Our algorithm takes as input a graph
G=(V,E) which is complete, and a cost function C:E->R such that for
all u,v,w C(u,v) =< C(u,w) + c(w,v) (the Triangle inequality).
We remark, that there exists a 1.5 ratio-bound algorithms for such
graphs in the litreature (too complicated for class). Interestingly,
for graphs where the cost function corresponds to ordinary geomertric
distance in the plane, there exists even better approximation
algorithm: for every epsilon 0< e <1, the algorithms run in time
polynomial in n and 1/epsilon, and achieve ratio-bound (1+e). In
contrast, for general graphs and cost functions, no approximation
algorithm exists unless P=NP.
APPROX-TSP ALGORITHM (G,c)
--------------------------
1. Build MST T for G.
2. Pick an arbitrary vertex and call it r (for root).
Do a preorder walk of the tree T starting at r.
Call L the list of vertices in preorder visited in the walk.
3. Output a tour H that visits all vertices of G starting and returning
to r in the order prescribed by L.
THEOREM: APPROX-TSP runs in polynomial time and achieves a ratio-bound
of 2 on complete input graphs which obey the triangle-inequality.
PROOF: The running time for MST and preorder walk are polynomial in E
and in V.
Thus, it remains to show that the ratio-bound is 2. For any subgraph
L, let c(L) denote the value obtained by summing all the costs of the
edges of L.
For input graph G, let H* denote the optimal tour of G and T the MST
of G. Observe that H* with one edge removed is also a spanning tree
of the graph, whose cost no larger of the minimum spanning tree T.
Namely, c(T) <= c(H*)
Consider a "twice around" tour of T called W, where starting with r,
it traverses each edge of the tree twice once in every direction of
the preorder walk . Clearly, c(w) = 2c(T).
Note that this W visits all vertices, starting and ending in r.
Unfortunately , each vertex is encountered multiple times.
We can change W into real traveling salesman tour as follows. Let H
the tour obtained from W by short-cutting W as follows. Repeatedly,
for any u,v,w in W such that u->v->w in W, and vertex v has already
been visited, replace it by going directly from u->w (short-cutting
v). The cost of H can only go down since by the triangle inequality
c(u,v)+c(v,w)>=c(u,w). Moreover, the resulting H is a traveling
salesman tour of the graph where each vertex is visited in the same
order as a preorder walk of the tree.
Putting this all together, c(H) <=c(W) = 2c(T) <= 2c(H*), and as a
result we establish that ratio-bound = approx/optimal = c(H)/c(H*) <=
2.
QED
Next, let us show an approximation algorithm for the set cover
problem.
The set cover problem is given as input a universe of elements
U={1,...,m} and sets S1...Sn such that union of the Si's = U, find the
smallest collection of sets I such that the union of S_i in I = U.
It has many applications. For example, U may be a set of jobs to be
done. And S_i corresponds to jobs that machine i can accomplish. You
would like to buy the smallest number of machines such that all jobs
can be done.
This problem is NP-complete. Can show an easy reduction from vertext
cover where the universe is the set of edges and the sets are pairs of
vertices between which there are edges.
We show a greedy approximation algorithm which achives ratio bound of
O(ln m).
APPROX_SET_COVER ALGORITHM (U,S1,...,Sn)
Repeat untill all elments are covered
-choose new set S_i containing max uncovered elements
-add i to I
-mark all elements from Si as covered
Output I
THEOREM: APPROX_SET_COVER is a polynomial time algoritm which achieves
ratio bound O(ln m) for the set cover problem where m= |U|.
Proof: On an input U, S1,..,Sn. Let k denote the size of the minimum
cover.
Let u_i = number of uncovered elements after i-th iteration.
Initially u_0 =m. We know that there is a cover of size k, so must
exist at least some set S_i that covers more than (or equal to ) 1/k
of uncovered elmements (otherwise by Pigeon hole principle k sets
would not be enough). The greedy algorithm will chose the largest set,
so definitely after the first choice of a set, the number of remaining
uncovered elements goes down as follows u_1 <= u_0-u_0/k = (1-1/k) = m
(1-1/k).
Moreover, we know that Fact: At any point in the algorithm there
exists alway a new set S_i that covers at least 1/k of the remaining
elements.
Thus, we can appy the above argument repeatedly,
u_2 <=u_1 (1-1/k) <= u_0 (1-1/k)^2 = m(1-1/k)^2
u_3 <= u_2 (1-1/k) <=...<=m(1-1/k)^3
...
u_i <= m(1-1.k)^i
...
How long can this go on?
Consider the largest value of g, such that after inserting all but the last set of
the greedy cover. Namely, there is at least 1 more element left to cover.
That is, g such that 1<= u_{g-1} <= m(1-1/k)^g.
Rewrting, 1<= m(1-1/k)^{g/k}k <= m(1/e)^{g/k} and thus m>= e^{g/k} and g/k <= log m.
The number of sets chosen by the greedy algorithm is g+1, thus
the ratio-bound = (g+1)/k <= log m + 1. QED
So far, we saw two approximation algorithms for two different
minimization problems. The approximation for TSP had ratio-bound 2
whereas the apprxomation for SET-COVER had a O(ln m) ratio
bound. Indeed, optimization versions of NP-hard problems have widely
different behavior with respect to approximation even though the
decision versions all seem to be "as hard" as each other.
the last example will be for a the CLIQUE maximization problem. We
will show how to achieve a ratio bound of O(n/log n). This seems
quite bad, but it can be proved that no better than n^b can be done
for any 0