Here is the outline for the parallelization research.
I'm particularly interested in finding out the best way to parallelize
branch and bound codes that use
Held and Karp's 1-tree relaxation.
These are of interest because this relaxation is the basis for the
most efficient sequential algorithms known for solving symmetric TSPs.
This is according to Balas and Toth's survey in The Traveling
Salesman Problem, edited by Lawler
Branch and bound algorithms based on 1-trees are also notable because the search trees they generate are very small. For instance, Smith and Thompson report that on problems of size 42, 48, 57, and 100 their algorithm generates 7, 7, 34, and 95 branch and bound search nodes, respectively. (Most of the time is spent computing many minimum 1-trees per branch and bound node.) Compare this with the thousands of nodes generated on a problem of size 48 when using the LMSK algorithm reported by Abdelrahman in 1994.
It therefore seems worthwhile to study the profitability of using many processors per branch and bound node. Abdelrahman calls this subproblem-level parallelism. I want to gain experience with exploiting parallelism at this level, especially in using the 1-tree framework where it appears to be most promising.
Notice that any tour of G qualifies as a 1-tree for G. It follows then that the length of a shortest 1-tree of G is a lower bound on the length of a shortest tour of G. A minimum cost 1-tree for G can be found by computing a minimum spanning tree for (V\{1},E) and by scanning for the two shortest edges incident upon vertex 1. There are efficient algorithms for these two steps. This is the basis for the use of the 1-tree relaxation in branch and bound codes for the TSP.
Unfortunately, the length of the minimum 1-tree of a graph is on average only about 60% of the length of the shortest tour of that graph. How can we improve on this? The trick is to use Lagrange multipliers to incorporate some degree constraints. That is, we can add a cost to vertices that have either too many or too few edges.
Instead of computing the minimum 1-tree using the plain cost function c(i,j), use c(i,j)+l(i)+l(j) where l is some real-valued function on the vertices. With some rearrangement and noting that every vertex in a tour is of degree 2, we can show that for any choice of l, L(l) = min 1-tree cost (using c(i,j)+l(i)+l(j))- 2 sum_{i \in V} l(i) <= min tour cost. (Sorry, but HTML isn't TeX.)
Given the right choice of l, this Lagrangean lower bound consistently achieves 99.5% of the minimum tour cost on symmetric instances, and is the basis for the most successful algorithms for the symmetric TSP (up to 1985, according to Balas and Toth).
We use what is called a subgradient optimization method to compute the n-vector l. This is basically a numerical hill-climbing procedure in n dimensions. Some theory has been developed to aid in the choice of step sizes, though there are still some parameters that must be chosen; this is still a black art.
As suggested by Quinn's parallel programming text, I also implemented Sollin's algorithm. It suffers from two drawbacks. First, it is quite complex; the parallel version would require much fine-grained locking and synchronization. Second, it turns out to be much slower than Prim's algorithm on the (nearly) complete graphs that we encounter. (See my research log book for timing comparisons.)
[Perhaps insert descriptions of Prim's and Sollin's algorithm.]
I have run experiments to determine the speedup for various problem sizes with various numbers of processors. For small problems (e.g. 51 cities), there is no speedup over a the sequential algorithm. For larger problems, we get some speedup with the best being 60% of optimal speedup for a 3038 city problem on up to 10 processors, taking 8.6 seconds.
Number Ascents Task List Len 1-Trees Instance Cities Performed Avg Std Dev Computed Time (in seconds) ---------- ------ --------- ---- ------- -------- ---- grid25 25 65 16.4 8.6 675 3.7 dantzig42 42 376 67.9 42.5 3635 31.0 hk48 48 13 8.1 3.6 242 2.8 eil51 51 126 92.1 43.6 1706 29.5 st70 70 1854 1294.4 722.8 30412 982.5 eil76 76 373 328.2 189.9 6998 260.5 kroA100 100 9473 2836.1 2004.1 161288 10078.7 eil101 101 1009 291.3 94.7 10028 713.6 lin105 105 26 21.7 11.1 722 54.7 randA50 50 29 3.5 2.6 240 3.0 randB50 50 22 4.3 2.9 257 3.0 randC50 50 23 12.7 7.8 448 5.0 randD50 50 32 6.4 6.0 518 5.9 randA100 100 80 58.7 37.0 2164 89.8 randB100 100 225 172.5 94.6 4987 194.4 randC100 100 36 29.8 17.1 877 33.7 randD100 100 19 16.3 8.6 573 23.9 randA150 150 42 37.1 20.5 1269 134.8 randB150 150 96 76.1 49.0 2872 273.1 randC150 150 46 43.1 24.7 1690 158.1 randB150 150 21 18.7 10.9 726 70.0 randA200 200 26 25.5 14.5 930 159.7 randB200 200 200 184.3 106.7 9157 1540.9 randC200 200 44 36.4 23.2 1962 332.3 randD200 200 30 29.6 16.8 1396 237.3 randA300 300 57 54.3 30.7 2677 1040.9 randB300 300 86 82.5 47.3 6333 2435.4 randC300 300 40 39.2 22.8 1760 699.4 randD300 300 113 107.7 61.1 5611 2162.2 randA1000 1000 90 89.1 51.8 7204 32579.5
Instances beginning with rand have edge lengths which were drawn from a uniform random distribution over [1,1000].
A posting to comp.theory by and email from Jay Herder (jherder at yahoo dot com) prompted an investigation into grid graphs. He was interested in approximate solutions on graphs where with embedded grids as part of a wiring problem. Instance grid25 is a 5 by 5 grid of points spaced 10 units apart. Arbitrary insertion found an optimal tour in this case, suggesting that it is a good heuristic for these kinds of graphs.
All others were taken from the TSPLIB library of instances.
The compiler was GCC 2.5.8 with no optimization under Linux 1.0.8. The machine is an Intel 486 DX2/66 with 256K 20ns cache and 16MB 70ns main memory. For comparison, this machine is about as fast as a single processor on a KSR-1, and about 3 times faster (for this task) than a SPARC IPC.
The program requires little memory: the program and data on instance eil51.tsp take up about 130 kilobytes on the 486, as reported by top(1) under Linux. Note that this fits entirely within the 256K cache, even though I exclusively use the double type for storage and manipulation of the Lagrangean multipliers and edge lengths.
Note, however, that on Euclidean instances, I do not store the entire cost matrix but instead recompute the cost function on the fly from the stored coordinates. This is done to reduce the working set size, especially for larger instances. I experimented with storing the edge costs but found that this slowed down the program by about 5% even on the relatively small 51 city problem. That is, table lookup was slower than recomputation.