Beware. This page was up to date as of March 1995. Now I'm more interested in approximation algorithms for the TSP. Until I write up my thoughts on that field, I'll keep this page around. (I'm getting a degree doing research, not updating web pages.) Until then, visit my current research interests page.

# Research interests

I'm looking into innovative ways of parallelizing branch and bound codes for the traveling salesman problem (TSP).

Here is the outline for the parallelization research.

## Introduction

Conventional wisdom holds that parallelism should be exploited at the highest level, resulting in the coarsest grain computation and lowest synchronization costs. In the branch and bound framework, this means using one processor per subproblem node, but processing many nodes at once. Tarek Abdelrahman calls this algorithm-level parallelism. However, he has shown that naive implementation of this strategy on shared memory machines can lead to excessive contention and synchronization costs when accessing the central pool of nodes waiting to be processed.

I'm particularly interested in finding out the best way to parallelize branch and bound codes that use Held and Karp's 1-tree relaxation. These are of interest because this relaxation is the basis for the most efficient sequential algorithms known for solving symmetric TSPs. This is according to Balas and Toth's survey in The Traveling Salesman Problem, edited by Lawler et al., 1985. It is always a good idea to start one's work with the best algorithm known so far.

Branch and bound algorithms based on 1-trees are also notable because the search trees they generate are very small. For instance, Smith and Thompson report that on problems of size 42, 48, 57, and 100 their algorithm generates 7, 7, 34, and 95 branch and bound search nodes, respectively. (Most of the time is spent computing many minimum 1-trees per branch and bound node.) Compare this with the thousands of nodes generated on a problem of size 48 when using the LMSK algorithm reported by Abdelrahman in 1994.

It therefore seems worthwhile to study the profitability of using many processors per branch and bound node. Abdelrahman calls this subproblem-level parallelism. I want to gain experience with exploiting parallelism at this level, especially in using the 1-tree framework where it appears to be most promising.

## Held and Karp's 1-tree relaxation

So what is a 1-tree? Suppose we are given a graph G=(V,E) with n nodes, i.e.,|V|=n. Then a 1-tree for G is a spanning tree for (V\{1},E) together with two edges in G incident upon vertex 1.

Notice that any tour of G qualifies as a 1-tree for G. It follows then that the length of a shortest 1-tree of G is a lower bound on the length of a shortest tour of G. A minimum cost 1-tree for G can be found by computing a minimum spanning tree for (V\{1},E) and by scanning for the two shortest edges incident upon vertex 1. There are efficient algorithms for these two steps. This is the basis for the use of the 1-tree relaxation in branch and bound codes for the TSP.

## Lagrangean cost function

When used as in a branch and bound framework, it pays to have as tight a lower bounding function as possible. Pearl has shown that a linear increase in the error of the lower bounding function leads to an exponential increase in the number of subproblems generated.

Unfortunately, the length of the minimum 1-tree of a graph is on average only about 60% of the length of the shortest tour of that graph. How can we improve on this? The trick is to use Lagrange multipliers to incorporate some degree constraints. That is, we can add a cost to vertices that have either too many or too few edges.

Instead of computing the minimum 1-tree using the plain cost function c(i,j), use c(i,j)+l(i)+l(j) where l is some real-valued function on the vertices. With some rearrangement and noting that every vertex in a tour is of degree 2, we can show that for any choice of l, L(l) = min 1-tree cost (using c(i,j)+l(i)+l(j))- 2 sum_{i \in V} l(i) <= min tour cost. (Sorry, but HTML isn't TeX.)

Given the right choice of l, this Lagrangean lower bound consistently achieves 99.5% of the minimum tour cost on symmetric instances, and is the basis for the most successful algorithms for the symmetric TSP (up to 1985, according to Balas and Toth).

We use what is called a subgradient optimization method to compute the n-vector l. This is basically a numerical hill-climbing procedure in n dimensions. Some theory has been developed to aid in the choice of step sizes, though there are still some parameters that must be chosen; this is still a black art.

## Why use Prim's algorithm?

The graphs that we compute MSTs for are complete or very nearly so. In such cases, Prim's algorithm is asymptotically optimal because it scans each edge exactly once and its storage requirements can be made linear in the number of vertices. See the MILES_SPAN program in Knuth's excellent The Stanford GraphBase, especially sections 69 and 70.

As suggested by Quinn's parallel programming text, I also implemented Sollin's algorithm. It suffers from two drawbacks. First, it is quite complex; the parallel version would require much fine-grained locking and synchronization. Second, it turns out to be much slower than Prim's algorithm on the (nearly) complete graphs that we encounter. (See my research log book for timing comparisons.)

[Perhaps insert descriptions of Prim's and Sollin's algorithm.]

## Computational experience

### Minimum spanning trees

I have tested code for minimum spanning trees using Prim's algorithm targetted for KSR shared memory parallel machines. I have modified the algorithm to allow the specification that certain edges must appear in the 1-tree and that others must not. These are called fixed in and fixed out edges, respectively.

I have run experiments to determine the speedup for various problem sizes with various numbers of processors. For small problems (e.g. 51 cities), there is no speedup over a the sequential algorithm. For larger problems, we get some speedup with the best being 60% of optimal speedup for a 3038 city problem on up to 10 processors, taking 8.6 seconds.

### Branch and bound code for the TSP itself

I have completed a sequential version of the entire branch and bound algorithm with the following features:
• It uses a best-first problem selection strategy. (Thanks to Wayne Hayes for the priority queue routines.)
• It uses geometrically decreasing alpha-step parameter. The exact scheme is original to me, although Volgenant and Jonker mention such a scheme in passing.
• It uses an original set of stopping rules based on the convergence properties of the ascent. I found those of Volgenant and Jonker to be too basic (a fixed number of steps) and those of Smith and Thompson too opaque.
• It uses the branching scheme of Volgenant and Jonker.
• It uses their rule that if two edges are fixed into a vertex, then no other edges should be allowed to be incident upon that vertex in the 1-tree computed.
• It uses a fictitious upper bound: After the initial ascent computes a lower bound for the general graph, an upper bound of 1.015 times that lower bound as an upper bound for the rest of the computation, until an actual tour is found that is less than that. If no such tour is found, then the computation is restarted after the initial ascent using 1.015 times the previously tried upper bound.
It solves instances to optimality in the following times:
            Number  Ascents    Task List Len   1-Trees
Instance    Cities  Performed  Avg   Std Dev  Computed  Time (in seconds)
----------  ------  ---------  ----  -------  --------  ----
grid25         25         65    16.4    8.6     675      3.7
dantzig42      42        376    67.9   42.5    3635     31.0
hk48           48         13     8.1    3.6     242      2.8
eil51          51        126    92.1   43.6    1706     29.5
st70           70       1854  1294.4  722.8   30412    982.5
eil76          76        373   328.2  189.9    6998    260.5
kroA100       100       9473  2836.1 2004.1  161288  10078.7
eil101        101       1009   291.3   94.7   10028    713.6
lin105        105         26    21.7   11.1     722     54.7
randA50        50         29     3.5    2.6     240      3.0
randB50        50         22     4.3    2.9     257      3.0
randC50        50         23    12.7    7.8     448      5.0
randD50        50         32     6.4    6.0     518      5.9
randA100      100         80    58.7   37.0    2164     89.8
randB100      100        225   172.5   94.6    4987    194.4
randC100      100         36    29.8   17.1     877     33.7
randD100      100         19    16.3    8.6     573     23.9
randA150      150         42    37.1   20.5    1269    134.8
randB150      150         96    76.1   49.0    2872    273.1
randC150      150         46    43.1   24.7    1690    158.1
randB150      150         21    18.7   10.9     726     70.0
randA200      200         26    25.5   14.5     930    159.7
randB200      200        200   184.3  106.7    9157   1540.9
randC200      200         44    36.4   23.2    1962    332.3
randD200      200         30    29.6   16.8    1396    237.3
randA300      300         57    54.3   30.7    2677   1040.9
randB300      300         86    82.5   47.3    6333   2435.4
randC300      300         40    39.2   22.8    1760    699.4
randD300      300        113   107.7   61.1    5611   2162.2
randA1000    1000         90    89.1   51.8    7204  32579.5


Instances beginning with rand have edge lengths which were drawn from a uniform random distribution over [1,1000].

A posting to comp.theory by and email from Jay Herder (jherder at yahoo dot com) prompted an investigation into grid graphs. He was interested in approximate solutions on graphs where with embedded grids as part of a wiring problem. Instance grid25 is a 5 by 5 grid of points spaced 10 units apart. Arbitrary insertion found an optimal tour in this case, suggesting that it is a good heuristic for these kinds of graphs.

All others were taken from the TSPLIB library of instances.

The compiler was GCC 2.5.8 with no optimization under Linux 1.0.8. The machine is an Intel 486 DX2/66 with 256K 20ns cache and 16MB 70ns main memory. For comparison, this machine is about as fast as a single processor on a KSR-1, and about 3 times faster (for this task) than a SPARC IPC.

The program requires little memory: the program and data on instance eil51.tsp take up about 130 kilobytes on the 486, as reported by top(1) under Linux. Note that this fits entirely within the 256K cache, even though I exclusively use the double type for storage and manipulation of the Lagrangean multipliers and edge lengths.

Note, however, that on Euclidean instances, I do not store the entire cost matrix but instead recompute the cost function on the fly from the stored coordinates. This is done to reduce the working set size, especially for larger instances. I experimented with storing the edge costs but found that this slowed down the program by about 5% even on the relatively small 51 city problem. That is, table lookup was slower than recomputation.

## Things to do

• Design and test a parallelization of the BB algorithm.