Transit Note #2

Using the Network for Load Balancing on Parallel Computers

Andre DeHon

Original Issue: April 1990

Last Updated: Wed Nov 10 22:28:18 EST 1993

Idea

For parallel computer systems we must be able to effectively assign computation tasks to the various processors. This problem remains a very open area of research. If computational tasks are poorly partitioned/scheduled across processors a number of problems can arise that will undermine the benefits of parallel computation. Bottlenecks on resources can effectively serialize computation. Hot spots in the routing network can also limit the amount of parallelism that can be utilized.

Perhaps we can do dynamic load balancing of parallel computations by taking advantage of network loading. That is, if we use the loading inherent in the network to direct computation to the portions of the parallel computer with least congestion, we should be able to get a good distribution of computation and avoid canonical hot spots.

Scheme

Its easiest to think about this in a model where computations are spawned from processors in the machine. Each task is sent to some other processor to be performed.

To get good distribution of tasks among the processors in the computer, the spawned computation takes the path of least resistance through the network. That is, sends the computation to whichever processor it can most easily reach considering the current network loading/congestion. This should distribute computations fairly evenly and work to minimize the congestion inherent in the computations.

Implementation

This can be implemented moderately easily on a future version of RN1 (or just about any routing component). All that is needed is a DON'T CARE routing specification. When the router gets this routing specification, it routes it in any direction it can. Ideally it will route it in the least congested direction, if it knows anything about the congestion further in the network. The router should then chose randomly among equally likely direction of routing in much the same way that it selects among equivalent output ports.

The status information returned by the router should then indicate which direction (if any) the datum was actually routed in addition to the information it currently returns. Providing this information will allow the spawning processor to know where the computation has been sent.

Fault-Tolerance

This should work well from fault-tolerance perspective. The network simply won't route to any processor it cannot reach. As such, nodes which are faulty or unreachable due to network faults, won't be assigned computations.

Dealing with computations that were assigned to a processor which has become faulty is a harder issue. If each processor spawning off tasks keeps some information on who was assigned which task, it might be possible to use that information to reschedule computations which were previously assigned to a processor which has become faulty.

Locality

This kind of distribution should also guarantee to give us a distribution of computation which matches the locality inherent in our network topology. That is, when there is no locality in the network ( e.g. bidelta networks), we should get a random distribution of computations among processors which is the most desirable situation in such a case. When there is locality in the network ( e.g. tree structured networks such as [DeHon 90]), the distribution should be skewed in the same way as the network's architectural locality. This will make it more likely that computations will be spawned off close to the spawning processor. The likelihood of the computation being spawned off at some distance will decrease as the distance increases. This should give a close to optimal distribution of computation among processors given the inherent network locality. It will keep the communication moderately matched with the network structure as long as computations are most likely to communicate with the processor that spawned them or other computations spawned by the same parent.

Distribution Based on Network Loading

The distribution of processes among processors which arises will be entirely based on the network communication patterns in the programs. That is to say, that a processor which is bogged down in computation, but doing little or not communication over the network will be quite likely to receive new computational tasks. This occurs because only the communication resource is used in determining optimal scheduling in this scheme. This means that this scheme is most suitable when it's either the case that the network is the limiting resource in the computation or the network traffic to and from each node is a good approximation to the loading at each node.

It would be possible to get some information back to the network about how busy a processor was using the busy-out line described in (tn1). The busy processing node would effectively be telling the network that it could deal with less network traffic because it was very busy computationally. The network would then treat that processor as more congested and be more likely to avoid it.

The busy-out mechanism is a scheme which allows only one bit of congestion information to be returned by each processor. Certainly, if the network could deal with more bits of information about loading on a node, this would work even better. For instance, if the network used the analog routing techniques of [Cho90], the node could produce an analog voltage proportional to its loading to feed into the network.

Other Uses of Random Routing

A number of theory algorithms prove that there are no worst-case problems in routing networks by indirecting through a random intermediate node. That is, in order to route from node A to node B, the node picks a random node, C, and routes a message to it. C then routes then interprets the message as something it should forward and routes the data to B. Providing this ability to route to a random network destination allows us to do this directly, and guarantees to pick a C which is easiest to get to from the given source.

See Also...

References

Cho90
Frederic T. Chong. Analog Techniques for Adaptive Routing on Interconnection Networks. Transit Note 14, MIT Artificial Intelligence Laboratory, April 1990. [tn14 HTML link] [tn14 FTP link].

DeH90a
Andre DeHon. Fat-Tree Routing For Transit. AI Technical Report 1224, MIT Artificial Intelligence Laboratory, April 1990. [AITR-1224 FTP link].

DeH90b
Andre DeHon. RN1: Things to Improve Upon. Transit Note 1, MIT Artificial Intelligence Laboratory, April 1990. [tn1 HTML link] [tn1 FTP link].

MIT Transit Project