Here is the outline for the parallelization research.

- Introduction
- Held and Karp's 1-tree relaxation
- Lagrangean cost function
- Why use Prim's algorithm?
- Computational experience
- Things to do

I'm particularly interested in finding out the best way to parallelize
branch and bound codes that use
Held and Karp's 1-tree relaxation.
These are of interest because this relaxation is the basis for the
most efficient sequential algorithms known for solving *symmetric* TSPs.
This is according to Balas and Toth's survey in The Traveling
Salesman Problem, edited by Lawler

Branch and bound algorithms based on 1-trees are also notable because the search trees they generate are very small. For instance, Smith and Thompson report that on problems of size 42, 48, 57, and 100 their algorithm generates 7, 7, 34, and 95 branch and bound search nodes, respectively. (Most of the time is spent computing many minimum 1-trees per branch and bound node.) Compare this with the thousands of nodes generated on a problem of size 48 when using the LMSK algorithm reported by Abdelrahman in 1994.

It therefore seems worthwhile to study the profitability of using many
processors per branch and bound node.
Abdelrahman calls this *subproblem-level*
parallelism.
I want to gain experience with exploiting parallelism at
this level, especially in using the
1-tree framework where it appears to be most promising.

Notice that any tour of G qualifies as a 1-tree for G.
It follows then that *the length of a shortest 1-tree of G is
a lower bound on the length of a shortest tour of G*.
A minimum cost 1-tree for G can be found by computing a minimum spanning
tree for (V\{1},E) and by scanning for the two shortest edges incident
upon vertex 1.
There are efficient algorithms for these two steps.
This is the basis for the use of the 1-tree relaxation in branch and
bound codes for the TSP.

Unfortunately, the length of the minimum 1-tree of a graph is on average only about 60% of the length of the shortest tour of that graph. How can we improve on this? The trick is to use Lagrange multipliers to incorporate some degree constraints. That is, we can add a cost to vertices that have either too many or too few edges.

Instead of computing the minimum 1-tree using the plain cost function
*c(i,j)*, use *c(i,j)+l(i)+l(j)* where *l* is some
real-valued function on the vertices. With some rearrangement and
noting that every vertex in a tour is of degree 2, we can show that
for any choice of *l*,
*L(l) = *min 1-tree cost (using *c(i,j)+l(i)+l(j)*)-
*2 sum_{i \in V} l(i) <= *min tour cost. (Sorry, but HTML isn't TeX.)

Given the right choice of *l*,
this Lagrangean lower bound consistently achieves 99.5% of the minimum
tour cost on symmetric instances, and is the basis for the most successful
algorithms for the symmetric TSP (up to 1985, according to Balas and Toth).

We use what is called a subgradient optimization method to compute the
*n*-vector *l*. This is basically a numerical hill-climbing
procedure in *n* dimensions. Some theory has been developed to
aid in the choice of step sizes, though there are still some parameters
that must be chosen; this is still a black art.

As suggested by Quinn's parallel programming text, I also implemented Sollin's algorithm. It suffers from two drawbacks. First, it is quite complex; the parallel version would require much fine-grained locking and synchronization. Second, it turns out to be much slower than Prim's algorithm on the (nearly) complete graphs that we encounter. (See my research log book for timing comparisons.)

[Perhaps insert descriptions of Prim's and Sollin's algorithm.]

I have run experiments to determine the speedup for various problem sizes with various numbers of processors. For small problems (e.g. 51 cities), there is no speedup over a the sequential algorithm. For larger problems, we get some speedup with the best being 60% of optimal speedup for a 3038 city problem on up to 10 processors, taking 8.6 seconds.

- It uses a best-first problem selection strategy. (Thanks to Wayne Hayes for the priority queue routines.)
- It uses geometrically decreasing alpha-step parameter. The exact scheme is original to me, although Volgenant and Jonker mention such a scheme in passing.
- It uses an original set of stopping rules based on the convergence properties of the ascent. I found those of Volgenant and Jonker to be too basic (a fixed number of steps) and those of Smith and Thompson too opaque.
- It uses the branching scheme of Volgenant and Jonker.
- It uses their rule that if two edges are fixed into a vertex, then no other edges should be allowed to be incident upon that vertex in the 1-tree computed.
- It uses a fictitious upper bound: After the initial ascent computes a lower bound for the general graph, an upper bound of 1.015 times that lower bound as an upper bound for the rest of the computation, until an actual tour is found that is less than that. If no such tour is found, then the computation is restarted after the initial ascent using 1.015 times the previously tried upper bound.

Number Ascents Task List Len 1-Trees Instance Cities Performed Avg Std Dev Computed Time (in seconds) ---------- ------ --------- ---- ------- -------- ---- grid25 25 65 16.4 8.6 675 3.7 dantzig42 42 376 67.9 42.5 3635 31.0 hk48 48 13 8.1 3.6 242 2.8 eil51 51 126 92.1 43.6 1706 29.5 st70 70 1854 1294.4 722.8 30412 982.5 eil76 76 373 328.2 189.9 6998 260.5 kroA100 100 9473 2836.1 2004.1 161288 10078.7 eil101 101 1009 291.3 94.7 10028 713.6 lin105 105 26 21.7 11.1 722 54.7 randA50 50 29 3.5 2.6 240 3.0 randB50 50 22 4.3 2.9 257 3.0 randC50 50 23 12.7 7.8 448 5.0 randD50 50 32 6.4 6.0 518 5.9 randA100 100 80 58.7 37.0 2164 89.8 randB100 100 225 172.5 94.6 4987 194.4 randC100 100 36 29.8 17.1 877 33.7 randD100 100 19 16.3 8.6 573 23.9 randA150 150 42 37.1 20.5 1269 134.8 randB150 150 96 76.1 49.0 2872 273.1 randC150 150 46 43.1 24.7 1690 158.1 randB150 150 21 18.7 10.9 726 70.0 randA200 200 26 25.5 14.5 930 159.7 randB200 200 200 184.3 106.7 9157 1540.9 randC200 200 44 36.4 23.2 1962 332.3 randD200 200 30 29.6 16.8 1396 237.3 randA300 300 57 54.3 30.7 2677 1040.9 randB300 300 86 82.5 47.3 6333 2435.4 randC300 300 40 39.2 22.8 1760 699.4 randD300 300 113 107.7 61.1 5611 2162.2 randA1000 1000 90 89.1 51.8 7204 32579.5

Instances beginning with `rand` have edge lengths which
were drawn from a uniform random distribution over [1,1000].

A posting to comp.theory by and email from Jay Herder (jherder at yahoo dot com)
prompted an investigation into grid graphs. He was interested in approximate
solutions on graphs where with embedded grids as part of a wiring problem.
Instance `grid25` is a 5 by 5 grid of points spaced 10 units apart.
Arbitrary
insertion found an optimal tour in this case, suggesting that
it is a good heuristic for these kinds of graphs.

All others were taken from the TSPLIB library of instances.

The compiler was GCC 2.5.8 with no optimization under Linux 1.0.8. The machine is an Intel 486 DX2/66 with 256K 20ns cache and 16MB 70ns main memory. For comparison, this machine is about as fast as a single processor on a KSR-1, and about 3 times faster (for this task) than a SPARC IPC.

The program requires little memory: the program and data on instance eil51.tsp
take up about 130 kilobytes on the 486, as reported by `top(1)`
under Linux. Note that this fits entirely within the 256K cache, even
though I exclusively use the `double` type for storage and manipulation
of the Lagrangean multipliers and edge lengths.

Note, however, that on Euclidean instances, I do *not* store
the entire cost matrix but instead recompute the cost function on the fly
from the stored coordinates. This is done to reduce the working set size,
especially for larger instances. I experimented with storing the edge
costs but found that this *slowed down*
the program by about 5% even on the
relatively small 51 city problem. That is, table lookup was slower than
recomputation.

- Design and test a parallelization of the BB algorithm.