Design and Performance of Multipath MIN Architectures
(Extended Abstract)
Frederic T. Chong and Thomas F. Knight, Jr.
Original Issue: February 1992
Last Updated: Sat Nov 6 14:45:30 EST 1993

In this paper, we discuss the use of multipath multistage interconnection networks (MINs) in the design of a fault-tolerant parallel computer. Multipath networks have multiple paths between any input and any output. In particular, we examine networks with properties of expansion and maximal-fanout . Unlike most previous work, we examine systems which can tolerate node failure and isolation. We describe mechanisms for fault identification and system reconfiguration. The intuitive approach to reconfiguring a faulty system is to maximize the number of processing nodes left in operation. However, our results show that synchronization requirements make it critical to eliminate nodes with poor network connections. We find that a conservative fault-propagation algorithm for reconfiguration, adapted from a work by Leighton and Maggs [LM89], performs well for all of our multipath networks. We also address some practical issues of network construction and present performance simulations based upon the MIT Transit architecture. Overall, simulation results for 1024 node systems indicate that multipath networks perform well not only in theory, but also in practice.
Large scale parallel multiprocessors appear to be the only technique
for constructing teraflop to petaflop performance machines. These
machines will require to
high performance processors to
solve difficult computational problems such as tertiary protein
conformation, weather prediction, and common sense understanding. Two
related critical issues in the design of such machines,
interconnection topology and reliability are addressed in
this paper. In thousand to million processor architectures, we can
assume that at any given time some significant fraction of the
processors, interconnection components, wiring, and power will be
faulty. The achievable performance of such systems is limited by the
latency of the communication, the process synchronization
requirements, and the performance of the processor array and
interconnect under faulty conditions.
In particular, we describe the simulated performance and design principles for the wiring of low latency multistage interconnection networks (MINs) which exhibit a high degree of fault tolerance. Such multistage networks can provide nearly non-blocking interconnect with latency limited primarily by wire propagation delay. Very large networks, constructed in the fat tree topology, use these MINs as basic building blocks [Lei85] [DeH91c], achieving interconnection bandwidth limited by physical considerations of three-dimensional wiring densities. Such networks form the wiring topology of the Thinking Machines CM-5 processor array.
The key factor in providing fault tolerance in MINs is the addition of multiple independent routing paths within the network. With independent routes, failures of components, wiring, or power in one path may leave a second path undamaged. There are many choices in the construction of such multipath networks, differing in detailed wiring topology and in the strategies used in recovery from switch or wiring failures. The comparative evaluation of these wiring techniques and recovery strategies is the subject of this paper.
In this section, we describe three networks: the randomly-wired
multibutterfly, the fully-deterministic maximal-fanout network, and
the randomized maximal-fanout network. We present an
lower time bound for a worst-case
permutation on fully-deterministic networks and show how a randomized
approach to maximal-fanout avoids this bound.
To achieve greater fault tolerance to component failures, we interwire our networks to provide many multiple paths between each pair of endpoints. To understand the concept of interwiring, it is helpful to view routing through a MIN as a sorting function through equivalence classes of fewer and fewer switches. Take, for example, a radix-2 network, where radix refers to the number of logical directions switched by each router. The switches in the first stage of the network belong to a single equivalence class. A message starts in this class and is routed to one of two classes in the second stage. In each subsequent stage, each message is routed to one of two classes, each containing half the number of switches as the classes in the current stage. Consequently, a message proceeds through the network, halving the number of possible destinations with each stage, until it has reached a class of size one at the end of the network. This last class is the message's destination.
The wiring scheme of a network must provide each switch in an
equivalence class with connections to the appropriate equivalence
classes in every logical direction. However, our multipath networks
are constructed from dilated routers. A dilated router has
more than one output in each logical direction. The examples given in
this section use a dilation-2 router, one which has two outputs in
each logical direction. Interwiring refers to the practice of
connecting these redundant outputs to different switches within the
appropriate class. Bassalygo and Pinsker [BP74] used such
interwiring to construct the first non-blocking networks of size and depth
. On-line algorithms for non-blocking
routing are described in [ALM90]. Other work on interwiring
multipath networks can be found
in [CS82] [RK84].
The first network we examine is the randomly-wired multibutterfly,
which randomly interwires within equivalence classes. Upfal first
introduced the term multibutterfly in [Upf89]. Random
interwiring gives networks the property of expansion. Formally,
an expansion property guarantees that any subset of
components from one M-input stage must connect to
at least
components in the next stage, where
. The property of expansion in multistage networks has been
shown, in theory, to possess substantial fault tolerance and
performance [Upf89] [LM89]. Our random
interwirings, shown in Figure
, are based upon the wiring
presented by Leighton and Maggs [LM89].
The other two networks we shall examine have the property of
maximal-fanout rather than that of expansion. Maximal-fanout
guarantees that the paths between any two endpoints will use as many
distinct routers as possible. The fully-deterministic network,
presented earlier in [CED92] and shown in
Figure , uses a regular butterfly topology to achieve
maximal-fanout. Instead of using every output as a separate logical
direction, the dilation of 2 in our network allows us to use pairs of
outputs for each logical direction. We use these output pairs to
create an exponential fanout-tree of paths between each pair
of endpoints. Note that the multiple paths, shown in the
maximal-fanout network of Figure
, reach 4 routers in the
second stage as compared to the 3 routers reached by paths in the
multibutterfly of Figure
.
Maximal-fanout increases fault tolerance and performance between any pair of single endpoints. However, it sacrifices expansion, which has been proven to have good fault tolerance and performance between groups of endpoints. We will compare these properties empirically in subsequent sections.
Unfortunately, the regularity of the fully-deterministic network is vulnerable to worst-case permutations. For simplicity, we will first examine a radix-2, dilation-2 network. This network uses a regular wiring which is topologically equivalent to an 4-ary butterfly. The difference is that the 4 physical outputs of each dilated router are grouped into 2 pairs of logical outputs. The multiple paths resulting from this dilation avoid the single points of congestion a regular butterfly suffers on permutations such as bit-reversal. However, we can still construct a worst-case permutation for this multipath network which results in congestion in groups of routers.
Since the network has maximal-fanout, we know that node reaches
all of these
chips in the middle stage through a
binary tree. Let us look at the
-th stage of the
network. There are
chips,
, of the fanout-tree
from
to
in this stage. Each one of these chips in
reaches
nodes on the input
side. However, the topology of the 4-ary butterfly is such that this
set of the
nodes,
, is the same for each of the
chips in
.
Consequently, if each of the nodes in
is trying
to reach a unique member of the
nodes in
, all
messages must go through the
chips
in
. Therefore, such a permutation would take at least
cycles to route.
Since, we are building radix-4, dilation-2 networks with our RN1 part,
the lower bound on our networks is somewhat larger. Since radix is
greater than dilation, maximal-fanout extends past the middle of the
network. For radix-4, dilation-2 networks of sufficient size, maximal
fanout extends roughly two-thirds into the network. This results in a
lower bound closer to . Note that all of the
preceding asymptotic analysis, while interesting for future
scalability, assumes networks larger than those we shall be examining
in detail. In fact, simulations of 1024-node, fully-deterministic
networks do not show significant worst-case behavior.
Fortunately, even the asymptotic lower bounds for the
fully-deterministic wiring can be avoided by a randomized fanout
scheme. Figure illustrates how randomly chosen
connections between the appropriate fanout classes produce
maximal-fanout for each pair of nodes. Each stage of the network is
divided into routing equivalence classes. In turn, each routing
equivalence class in stage
is divided into
fanout
classes. To wire a pair of dilated outputs of a router, select the
appropriate routing equivalence class for the pair's logical
direction. Then select fanout class numbers
and
within
that routing class, where
is the fanout class containing the
output pair. Randomly select a free input port in each of the 2
fanout classes and wire 1 port of the output pair to each. Since the
children of each port in each fanout-tree are randomly selected from
disjoint fanout classes, this process ensures that fanout-trees will
have physically distinct components. Note that this process continues
only until the last stage of maximal-fanout - where fanout classes are
of size 1.
We now describe the system architecture modeled in the simulations presented in this paper. The system is based upon an architecture under construction by the MIT Transit project.
The network simulated uses a circuit-switched routing component based upon the RN1 [MDK91], a completed custom VLSI chip, and the RN2 (tn44), a chip under design. Each chip can act either as a single 8-input, radix-4, dilation-2 router, or as two independent 4-input, radix-4, dilation-1 routers.
To aid the routing of messages, each component will have a pin dedicated to calculating flow control information according to the following blocking criterion taken from Leighton and Maggs in [LM89]. A router is blocked if it does not have at least one unused, operational output port in each logical direction which leads to a router which is not blocked. To route a message, a router attempts to choose a single output port by first looking at unused ports, and second eliminating any ports which are blocked. If no unique choice arises - all ports unused, but all unblocked or all blocked - then the router randomly decides between ports.
Each chip also incorporates a serial Test Access Port (TAP) which accesses IEEE boundary scan facilities [Com90]. These ports, in turn, are connected together in a diagnostic network (tn60) controlled by a Boundary Scan Controller (BSC) which can provide in-operation diagnostics and chip reconfiguration.
Each attempt to send a message through the network automatically receives status information from each router involved. If routing errors occur, an accumulation of this status information can be used to identify suspect chips. To test a chip for faults, the BSC first isolates the chip from the network by disabling all ports on other chips which lead to the suspect chip. The BSC can then run a test pattern through the chip and identify any failures. It then disables any of the ports which can no longer function. Finally, it re-enables any ports from other chips which lead to still-functioning ports on the suspect chip. Similarly, interchip wiring may be tested for continuity, isolation, and electrical performance.
Having described a network architecture, we turn to network loading. We shall construct a specific task, representative of shared-memory applications, by examining message length, message frequency, and application grain size.
Message traffic from shared-memory applications, taken from [CFKA90], was studied in [CED92]. It was found that traffic consisting uniformly of 24-byte messages was representative of effects observed from several barrier-synchronized applications which manipulate 16-byte cache lines and maintain cache coherency.
To derive the frequency of message loading, several ``reasonable'' parameter values were chosen. Note that our results are not overly sensitive to loading, so rough figures are adequate. The program code for each processor is assumed to be resident in local memory. Consequently, only data references will result in non-local memory references. We assumed a data cache miss rate of 15 percent. With 1 data read/write, we get 0.15 misses per processor cycle. We assumed that 50 percent of the references are to local memory, which gives us 0.075 references to non-local memory per processor cycle. With a 50 MHz processor and the RN1 part running at better than 100MHz, we have 2 router cycles per processor cycle. This gives us 0.0375 non-local memory references per router cycle. Adding an additional 10 percent to account for cache coherency messages, we end up with approximately 0.04 messages per router cycle.
Finally, we examine application grain size, or time between
synchronizations. A grain size representative of applications studied
is 10,000 cycles, or messages. Putting
it all together, our performance metric shall be the time to route the
following task: all processors in the system must each send 400
24-byte messages at a rate of 0.08 messages per active processor
cycle. Each processor can have up to 4 threads with outstanding
messages before stalling.
Note that our task explicitly models barrier synchronization. Other
styles of synchronization may involve smaller processor groups, but
may also tend to synchronize these groups more often. In any case, it
is important to model the synchronization requirements of an
application. For our purposes, barrier synchronization incorporates
an appropriate component of these requirements into our performance
metric. We shall see in Section that synchronization
plays a major role in performance degradation. This occurs when
network failure results in a small number of nodes with particularly
poor communication bandwidth.
Let us take one more look at our numbers. Since messages are 24 bytes long, we are basically running our network at 1 byte per router cycle, or 100 percent. If the message rate were any higher, the processors would just be stalled more often, and the network loading would not really change. Note that our analysis assumes low latency message handling, a concept demonstrated in the J-Machine [D +92]. If, as with many commercial and research machines, there exists a high latency for message handling, the latency induces a feedback effect which prevents full utilization of the network [Joh92]. Although cache miss and message locality numbers are open to debate, we observe that the technological trends of multiple-issue processors and wider cache lines will only increase demands on the network. However, performance results presented in this paper were also verified to be qualitatively unchanged under network loading half of that used here.
In this section, we describe two methods of system partitioning and compare their performance. Previous work [LLM90] [CED92] has concentrated upon the fault performance of multipath networks in which every endpoint could communicate with every other endpoint. In perhaps a more realistic view, we now examine the performance of systems which can tolerate network failures which isolate endpoints. Once such failures occur, a system must have some partitioning algorithm for determining live nodes with which to continue operation.
As shall be discussed in Appendix , the dominant effect of
network failures is the isolation of either node outputs into the
first stage of the network, or node inputs from the last stage. Any
partitioning algorithm must recognize such I/O-isolation of
nodes and remove them from the set of live nodes.
Before we begin, a few notes about the data presented in
figures and
: fault tolerance
and performance results presented were obtained through Monte Carlo
simulation of uniformly distributed router failures. Performance
results reflect the time to route a task, as specified in
Section
, which simulates a barrier synchronized
shared memory application. Note that we are interested in
steady-state performance - the time to route the task after fault
reconfiguration has already occurred. Lastly, the granularity of our
task is assumed to allow it to be evenly redistributed to the live
nodes.
Initially, we focused on retaining the largest number of processors
possible by only excluding I/O-isolated nodes from the set of live
nodes. However, using only the I/O-isolation criterion, many live
nodes may not be able to directly communicate with each other. To
overcome this problem, we utilized a multi-hop protocol, which
allows a message to visit intermediate processing nodes before
reaching its final destination. Such a scheme allows all nodes to
communicate with each other as long as the transitive closure of node
connections is fully connected. Figure summarizes
system tolerance and node loss for 1024-node (5-stage) systems based
on randomly-wired networks. Figure
(A) shows the
probability of network usability given no reconfiguration,
I/O-isolation, and I/O-isolation combined with the multi-hop protocol.
Figure
(B) shows the average percentage of node
loss, under the multi-hop protocol, for only the fraction of
networks which were usable in Figure
(A). In
other words, trials in which the network lost all processing nodes are
not counted. Results for maximum fanout networks were similar to
those of the multibutterfly.

Unfortunately, a small number of ``weak'' nodes, those with very poor connections to some other nodes, are still live. As results will show, synchronization requirements cause these weak nodes to significantly degrade performance.
Since weak nodes degrade performance, a more conservative criterion for choosing live nodes should result in better performance. Although fewer nodes are available to perform the task, we will no longer be limited by the speed of some of the weakest excluded nodes.
Leighton and Maggs proved in [LM89] the following theorem:
It turns out that we can use the proof of this theorem to construct an
algorithm to select live nodes. The proof uses a stricter definition
of I/O-isolation which includes nodes with inputs or outputs isolated
by switches from the entire network, not just the first or last
stages. However, due to the dominant effects of the first and last
stage in the size of networks we examined, we find it both sufficient
and practical to only look at those stages for I/O-isolation. The
resulting fault-propagation algorithm is shown in
Figure .
Figure (A) compares, for a 1024-node
randomly-wired multibutterfly system, the average node loss of the
fault-propagation algorithm to that of the multi-hop protocol.
Although fault-propagation was designed for networks with expansion,
it appears to also work well for other multipath MINs such as the
maximal-fanout networks. Average node loss for maximal-fanout
networks is basically indistinguishable from that of the
multibutterfly. Although the difference between average node loss
under the multi-hop protocol and the fault-propagation algorithm is
imperceptible for network failures of less than 15 percent, the
handful of weak nodes eliminated, coupled with the overhead of the
multi-hop protocol, make the fault propagated systems perform
substantially better than the multi-hop systems. In fact,
Figure
(B) shows that the performance curves of
the fault-propagated systems are nearly flat. For such a small
network size, this is a somewhat surprising validation of asymptotic
predictions implied by Theorem 2. Also of interest in
Figure
(B), the randomized, maximal-fanout network
performed just as well as the multibutterfly.

Several issues have arisen in our attempt to bring multipath MIN architectures closer to reality. Unlike most previous work, we have examined systems which can tolerate node failure and isolation. We have demonstrated that a simplified version of an algorithm from Leighton and Maggs [LM89] effectively determines which nodes to use in the presence of network failure. Not only is this fault-propagation algorithm effective on multibutterflies, it is also effective on other multipath networks such as maximal-fanout networks.
Of particular importance is the quality of network connections available to each processing node. In reconfiguring a faulty system, an intuitive approach is to maximize the number of live nodes. However, our performance results show that the synchronization requirements of applications make it critical to eliminate poorly connected nodes from system operation. The blocked router criterion, used in the fault-propagation algorithm, eliminates such nodes and ensures high quality connections between every pair of live nodes.
Overall, we have empirically demonstrated the high fault tolerance and performance of multibutterflies for practical systems. Although lacking the theoretical basis of multibutterflies, maximal-fanout networks appear to be equally as useful.
We would like to thank Tom Leighton for his suggestions on the analysis and randomization of maximal-fanout networks. Many of the architectural facets of Transit presented here are the work of Andre DeHon. Eran Egozy wrote the original code to generate network wirings. We would also like to thank Andre DeHon, Tom Leighton, and Ellen Spertus for their helpful comments on drafts of this paper.
Appendices
Although we are primarily concerned with the steady-state performance of the system after reconfiguration, it is important to discuss the feasibility of tolerating node failure or isolation. Several approaches are available: checkpointing, redundant processors, and evacuating state.
Checkpointing computation is a well-known method of tolerating processor loss. Secondary storage devices would be attached to some of the networks endpoints. There are various types of checkpointing involving various levels of optimism [SY85] [SW89]. However, the time overhead of checkpointing in a parallel system can be quite high.
On the other hand, we can trade some time overhead for hardware overhead by using redundant processors. We can use redundant processing nodes to shadow the computation of the processors doing the actual work [Len88] [Web90]. Actually, we need not have entirely redundant processors, we can just have redundant processes spread across all the nodes. Each node can have some ``real'' work and some redundant work, as long as all of the redundant processes are for work being done on other nodes. If a node is lost, all of its processes can be recovered from redundant processes. It is important to note, however, that there could be significant overhead in the messages necessary to support process shadowing. For both checkpointing and redundant processors, it is vital to place backup elements at network endpoints which are unlikely to all fail at once. For instance, backups should be placed in different clusters of endpoints in the fully-deterministic network.
Since the rate of node loss may be quite low, it may be more appropriate to run with as little overhead as possible, and then evacuate a node's state when absolutely necessary. Each node has two outputs and two inputs connected to the network. However, recall that the RN1 routing component is circuit switched and bidirectional. Consequently, we may regard each node as having four bidirectional channels through which we can evacuate state. Depending upon the amount of state that must be evacuated, a node may choose to begin evacuation when 2 or 3 of these bidirectional channels have failed. In truly desperate circumstances, a node could even resort to the very low bandwidth of the bit-serial diagnostic network to evacuate some state.
As discussed in Section , boundary scan capabilities
allow us to dynamically diagnose and reconfigure for both component
and wire failures. However, how can we dynamically implement the
partitioning algorithms described in Section
?
The multi-hop protocol is fairly straightforward to implement. The Boundary Scan Controller (BSC) can look at the first and last stages of the network and calculate I/O-isolation. Multiple hops can be supported by means of a forwarding packet, which is sent to an intermediate node and contains both the original message and its final destination. Each node can decide when to send a forwarding packet in one of two ways. First, a node may decide to send to an intermediate destination after a constant number of retries to the true destination. Second, to avoid some of the expense of multiple retries, each node may keep a cache of difficult nodes to reach, sending forwarding packets when a message's ultimate destination matches an address in the cache. Cache entries should be replace or flushed periodically because some nodes may be difficult to reach due to temporary congestion, not I/O-isolation. Also at issue is how to choose an intermediate node. This can be done either randomly, or with care to avoid nodes difficult to reach.
However, performance results emphasize the importance of dynamically implementing the fault-propagation algorithm. The algorithm requires the calculation of router blocking information. Fortunately, the Transit architecture has hardware support for calculating this information for purposes of flow control. However, this flow control is in use during routing while the system is in operation. We suggest three alternative methods of calculating blocking information for fault propagation. First, we can simply stop the machine and use the flow control hardware. Second, we can temporarily stop using flow control for routing, since the router will achieve reasonable performance by making random, uninformed decisions for a short time. We can use the hardware for fault propagation and then reactivate flow control. Third, if speed is not an issue, we can have the BSC gather the needed fault information and calculate fault-propagation sequentially.
Much of the fault tolerance and routing behavior of our networks is dominated by the first and last stages. Although multipath networks provide multiple paths between any two nodes, these paths can only use a large number of physically distinct routers towards the middle of the network. Near the nodes, these paths must concentrate towards their specific sources and destinations. This concentration is most severe in the first and last stages, where each node only has a small number of connections to the network. We now describe methods of coping with this limited network I/O. Many of these ideas were initially explored in [DKM91].
The wiring of the nodes to the inputs of the first-stage routers
involves a tradeoff between mean time between machine reconfiguration
and smoothness of degradation. Each node has 2 inputs into distinct
chips of the first stage of the network. This means that each
first-stage router has 8 nodes connected to its inputs. For nodes
randomly wired to the first stage, let be the mean time
between isolation of any node's inputs. Now consider a wiring in
which sets of 8 nodes are wired to 2 routers, each node connected to
both routers. The time between any node loss is decreased, but if we
lose nodes, we lose 8 at a time. To a first approximation, the mean
time to loss of a single node is also
, the mean time to loss
of each cluster of 8 is
. In general, we can cluster
processors to
routers in a regular butterfly pattern such that we
lose
processors at a time, for
,
radix, and
dilation. This results in a mean time to cluster
loss of approximately
.
The desirability of clustering depends on several factors. If the system can not tolerate any node isolation, maximal clustering will give the greatest expected time to system failure. If the system can reconfigure itself to accommodate node isolation, then reconfiguration scheme becomes an issue. If reconfiguration is dynamic, then the smoother node loss may be easier to handle. If it is static, then it may be desirable to run for as long as possible, then stop the machine and reconfigure for a large number of isolated nodes.
The amount of clustering will also affect each cluster's bandwidth into the network. However, in simulations of worst-case and uniform message traffic, the bandwidth through the rest of the network does not significantly exceed what the first stage routers could provide to each cluster.
The size of each routing equivalence classes in the last stage of the
network is one router. Unfortunately, if each
router were assigned both network connections to each node, the
network would have a single point of failure. To avoid this
difficulty, we split each
router into two
logical routers. Now each routing equivalence class has two
logical routers, which we assume are assigned to physically
distinct chips. Consequently, it takes at least two chip failures in
the last stage to isolate any node. Figures
and
illustrate this last-stage wiring.
In order to decrease wire-lengths and machine size, the Transit
architecture shall use a three-dimensional stack packaging
scheme (tn38) [DeH91b] shown in Figure .
Vertical connections are accomplished via button board connectors.
Note that routing components are arranged in columns directly above
one another. This suggests an economical method of wiring the
diagnostic network. Columns of chips can have their boundary scan
pins vertically connected. These columns can then be connected to
some number of Boundary Scan Controllers.