Transit Note #64

Design and Performance of Multipath MIN Architectures

(Extended Abstract)

Frederic T. Chong and Thomas F. Knight, Jr.

Original Issue: February 1992

Last Updated: Sat Nov 6 14:45:30 EST 1993

Abstract:

In this paper, we discuss the use of multipath multistage interconnection networks (MINs) in the design of a fault-tolerant parallel computer. Multipath networks have multiple paths between any input and any output. In particular, we examine networks with properties of expansion and maximal-fanout . Unlike most previous work, we examine systems which can tolerate node failure and isolation. We describe mechanisms for fault identification and system reconfiguration. The intuitive approach to reconfiguring a faulty system is to maximize the number of processing nodes left in operation. However, our results show that synchronization requirements make it critical to eliminate nodes with poor network connections. We find that a conservative fault-propagation algorithm for reconfiguration, adapted from a work by Leighton and Maggs [LM89], performs well for all of our multipath networks. We also address some practical issues of network construction and present performance simulations based upon the MIT Transit architecture. Overall, simulation results for 1024 node systems indicate that multipath networks perform well not only in theory, but also in practice.

Introduction

Large scale parallel multiprocessors appear to be the only technique for constructing teraflop to petaflop performance machines. These machines will require to high performance processors to solve difficult computational problems such as tertiary protein conformation, weather prediction, and common sense understanding. Two related critical issues in the design of such machines, interconnection topology and reliability are addressed in this paper. In thousand to million processor architectures, we can assume that at any given time some significant fraction of the processors, interconnection components, wiring, and power will be faulty. The achievable performance of such systems is limited by the latency of the communication, the process synchronization requirements, and the performance of the processor array and interconnect under faulty conditions.

In particular, we describe the simulated performance and design principles for the wiring of low latency multistage interconnection networks (MINs) which exhibit a high degree of fault tolerance. Such multistage networks can provide nearly non-blocking interconnect with latency limited primarily by wire propagation delay. Very large networks, constructed in the fat tree topology, use these MINs as basic building blocks [Lei85] [DeH91c], achieving interconnection bandwidth limited by physical considerations of three-dimensional wiring densities. Such networks form the wiring topology of the Thinking Machines CM-5 processor array.

The key factor in providing fault tolerance in MINs is the addition of multiple independent routing paths within the network. With independent routes, failures of components, wiring, or power in one path may leave a second path undamaged. There are many choices in the construction of such multipath networks, differing in detailed wiring topology and in the strategies used in recovery from switch or wiring failures. The comparative evaluation of these wiring techniques and recovery strategies is the subject of this paper.

Multipath Networks

In this section, we describe three networks: the randomly-wired multibutterfly, the fully-deterministic maximal-fanout network, and the randomized maximal-fanout network. We present an lower time bound for a worst-case permutation on fully-deterministic networks and show how a randomized approach to maximal-fanout avoids this bound.

To achieve greater fault tolerance to component failures, we interwire our networks to provide many multiple paths between each pair of endpoints. To understand the concept of interwiring, it is helpful to view routing through a MIN as a sorting function through equivalence classes of fewer and fewer switches. Take, for example, a radix-2 network, where radix refers to the number of logical directions switched by each router. The switches in the first stage of the network belong to a single equivalence class. A message starts in this class and is routed to one of two classes in the second stage. In each subsequent stage, each message is routed to one of two classes, each containing half the number of switches as the classes in the current stage. Consequently, a message proceeds through the network, halving the number of possible destinations with each stage, until it has reached a class of size one at the end of the network. This last class is the message's destination.

The wiring scheme of a network must provide each switch in an equivalence class with connections to the appropriate equivalence classes in every logical direction. However, our multipath networks are constructed from dilated routers. A dilated router has more than one output in each logical direction. The examples given in this section use a dilation-2 router, one which has two outputs in each logical direction. Interwiring refers to the practice of connecting these redundant outputs to different switches within the appropriate class. Bassalygo and Pinsker [BP74] used such interwiring to construct the first non-blocking networks of size and depth . On-line algorithms for non-blocking routing are described in [ALM90]. Other work on interwiring multipath networks can be found in [CS82] [RK84].

Randomly-Wired Multibutterfly

The first network we examine is the randomly-wired multibutterfly, which randomly interwires within equivalence classes. Upfal first introduced the term multibutterfly in [Upf89]. Random interwiring gives networks the property of expansion. Formally, an expansion property guarantees that any subset of components from one M-input stage must connect to at least components in the next stage, where . The property of expansion in multistage networks has been shown, in theory, to possess substantial fault tolerance and performance [Upf89] [LM89]. Our random interwirings, shown in Figure , are based upon the wiring presented by Leighton and Maggs [LM89].

Fully-Deterministic Maximal-Fanout

The other two networks we shall examine have the property of maximal-fanout rather than that of expansion. Maximal-fanout guarantees that the paths between any two endpoints will use as many distinct routers as possible. The fully-deterministic network, presented earlier in [CED92] and shown in Figure , uses a regular butterfly topology to achieve maximal-fanout. Instead of using every output as a separate logical direction, the dilation of 2 in our network allows us to use pairs of outputs for each logical direction. We use these output pairs to create an exponential fanout-tree of paths between each pair of endpoints. Note that the multiple paths, shown in the maximal-fanout network of Figure , reach 4 routers in the second stage as compared to the 3 routers reached by paths in the multibutterfly of Figure .

Maximal-fanout increases fault tolerance and performance between any pair of single endpoints. However, it sacrifices expansion, which has been proven to have good fault tolerance and performance between groups of endpoints. We will compare these properties empirically in subsequent sections.

Unfortunately, the regularity of the fully-deterministic network is vulnerable to worst-case permutations. For simplicity, we will first examine a radix-2, dilation-2 network. This network uses a regular wiring which is topologically equivalent to an 4-ary butterfly. The difference is that the 4 physical outputs of each dilated router are grouped into 2 pairs of logical outputs. The multiple paths resulting from this dilation avoid the single points of congestion a regular butterfly suffers on permutations such as bit-reversal. However, we can still construct a worst-case permutation for this multipath network which results in congestion in groups of routers.

Theorem 1:

There exists a permutation which takes cycles to route on a radix-2, dilation-2 fully-deterministic maximal fanout network.

Proof:

For a network where radix equals dilation, maximal-fanout extends to the middle network, where chips in the fanout-tree equal the width of the network. Let us examine the paths which from a particular node to a set of output nodes, , whose addresses begin with the bits . There are chips in the middle stage which can reach nodes in .

Since the network has maximal-fanout, we know that node reaches all of these chips in the middle stage through a binary tree. Let us look at the -th stage of the network. There are chips, , of the fanout-tree from to in this stage. Each one of these chips in reaches nodes on the input side. However, the topology of the 4-ary butterfly is such that this set of the nodes, , is the same for each of the chips in .

Consequently, if each of the nodes in is trying to reach a unique member of the nodes in , all messages must go through the chips in . Therefore, such a permutation would take at least cycles to route.

Since, we are building radix-4, dilation-2 networks with our RN1 part, the lower bound on our networks is somewhat larger. Since radix is greater than dilation, maximal-fanout extends past the middle of the network. For radix-4, dilation-2 networks of sufficient size, maximal fanout extends roughly two-thirds into the network. This results in a lower bound closer to . Note that all of the preceding asymptotic analysis, while interesting for future scalability, assumes networks larger than those we shall be examining in detail. In fact, simulations of 1024-node, fully-deterministic networks do not show significant worst-case behavior.

Randomized Maximal-Fanout

Fortunately, even the asymptotic lower bounds for the fully-deterministic wiring can be avoided by a randomized fanout scheme. Figure illustrates how randomly chosen connections between the appropriate fanout classes produce maximal-fanout for each pair of nodes. Each stage of the network is divided into routing equivalence classes. In turn, each routing equivalence class in stage is divided into fanout classes. To wire a pair of dilated outputs of a router, select the appropriate routing equivalence class for the pair's logical direction. Then select fanout class numbers and within that routing class, where is the fanout class containing the output pair. Randomly select a free input port in each of the 2 fanout classes and wire 1 port of the output pair to each. Since the children of each port in each fanout-tree are randomly selected from disjoint fanout classes, this process ensures that fanout-trees will have physically distinct components. Note that this process continues only until the last stage of maximal-fanout - where fanout classes are of size 1.

The Simulated Architecture

We now describe the system architecture modeled in the simulations presented in this paper. The system is based upon an architecture under construction by the MIT Transit project.

The network simulated uses a circuit-switched routing component based upon the RN1 [MDK91], a completed custom VLSI chip, and the RN2 (tn44), a chip under design. Each chip can act either as a single 8-input, radix-4, dilation-2 router, or as two independent 4-input, radix-4, dilation-1 routers.

To aid the routing of messages, each component will have a pin dedicated to calculating flow control information according to the following blocking criterion taken from Leighton and Maggs in [LM89]. A router is blocked if it does not have at least one unused, operational output port in each logical direction which leads to a router which is not blocked. To route a message, a router attempts to choose a single output port by first looking at unused ports, and second eliminating any ports which are blocked. If no unique choice arises - all ports unused, but all unblocked or all blocked - then the router randomly decides between ports.

Each chip also incorporates a serial Test Access Port (TAP) which accesses IEEE boundary scan facilities [Com90]. These ports, in turn, are connected together in a diagnostic network (tn60) controlled by a Boundary Scan Controller (BSC) which can provide in-operation diagnostics and chip reconfiguration.

Each attempt to send a message through the network automatically receives status information from each router involved. If routing errors occur, an accumulation of this status information can be used to identify suspect chips. To test a chip for faults, the BSC first isolates the chip from the network by disabling all ports on other chips which lead to the suspect chip. The BSC can then run a test pattern through the chip and identify any failures. It then disables any of the ports which can no longer function. Finally, it re-enables any ports from other chips which lead to still-functioning ports on the suspect chip. Similarly, interchip wiring may be tested for continuity, isolation, and electrical performance.

Network Loading

Having described a network architecture, we turn to network loading. We shall construct a specific task, representative of shared-memory applications, by examining message length, message frequency, and application grain size.

Message traffic from shared-memory applications, taken from [CFKA90], was studied in [CED92]. It was found that traffic consisting uniformly of 24-byte messages was representative of effects observed from several barrier-synchronized applications which manipulate 16-byte cache lines and maintain cache coherency.

To derive the frequency of message loading, several ``reasonable'' parameter values were chosen. Note that our results are not overly sensitive to loading, so rough figures are adequate. The program code for each processor is assumed to be resident in local memory. Consequently, only data references will result in non-local memory references. We assumed a data cache miss rate of 15 percent. With 1 data read/write, we get 0.15 misses per processor cycle. We assumed that 50 percent of the references are to local memory, which gives us 0.075 references to non-local memory per processor cycle. With a 50 MHz processor and the RN1 part running at better than 100MHz, we have 2 router cycles per processor cycle. This gives us 0.0375 non-local memory references per router cycle. Adding an additional 10 percent to account for cache coherency messages, we end up with approximately 0.04 messages per router cycle.

Finally, we examine application grain size, or time between synchronizations. A grain size representative of applications studied is 10,000 cycles, or messages. Putting it all together, our performance metric shall be the time to route the following task: all processors in the system must each send 400 24-byte messages at a rate of 0.08 messages per active processor cycle. Each processor can have up to 4 threads with outstanding messages before stalling.

Note that our task explicitly models barrier synchronization. Other styles of synchronization may involve smaller processor groups, but may also tend to synchronize these groups more often. In any case, it is important to model the synchronization requirements of an application. For our purposes, barrier synchronization incorporates an appropriate component of these requirements into our performance metric. We shall see in Section that synchronization plays a major role in performance degradation. This occurs when network failure results in a small number of nodes with particularly poor communication bandwidth.

Let us take one more look at our numbers. Since messages are 24 bytes long, we are basically running our network at 1 byte per router cycle, or 100 percent. If the message rate were any higher, the processors would just be stalled more often, and the network loading would not really change. Note that our analysis assumes low latency message handling, a concept demonstrated in the J-Machine [D +92]. If, as with many commercial and research machines, there exists a high latency for message handling, the latency induces a feedback effect which prevents full utilization of the network [Joh92]. Although cache miss and message locality numbers are open to debate, we observe that the technological trends of multiple-issue processors and wider cache lines will only increase demands on the network. However, performance results presented in this paper were also verified to be qualitatively unchanged under network loading half of that used here.

System Partitioning

In this section, we describe two methods of system partitioning and compare their performance. Previous work [LLM90] [CED92] has concentrated upon the fault performance of multipath networks in which every endpoint could communicate with every other endpoint. In perhaps a more realistic view, we now examine the performance of systems which can tolerate network failures which isolate endpoints. Once such failures occur, a system must have some partitioning algorithm for determining live nodes with which to continue operation.

As shall be discussed in Appendix , the dominant effect of network failures is the isolation of either node outputs into the first stage of the network, or node inputs from the last stage. Any partitioning algorithm must recognize such I/O-isolation of nodes and remove them from the set of live nodes.

Before we begin, a few notes about the data presented in figures and : fault tolerance and performance results presented were obtained through Monte Carlo simulation of uniformly distributed router failures. Performance results reflect the time to route a task, as specified in Section , which simulates a barrier synchronized shared memory application. Note that we are interested in steady-state performance - the time to route the task after fault reconfiguration has already occurred. Lastly, the granularity of our task is assumed to allow it to be evenly redistributed to the live nodes.

Multi-Hop Protocol

Initially, we focused on retaining the largest number of processors possible by only excluding I/O-isolated nodes from the set of live nodes. However, using only the I/O-isolation criterion, many live nodes may not be able to directly communicate with each other. To overcome this problem, we utilized a multi-hop protocol, which allows a message to visit intermediate processing nodes before reaching its final destination. Such a scheme allows all nodes to communicate with each other as long as the transitive closure of node connections is fully connected. Figure summarizes system tolerance and node loss for 1024-node (5-stage) systems based on randomly-wired networks. Figure (A) shows the probability of network usability given no reconfiguration, I/O-isolation, and I/O-isolation combined with the multi-hop protocol. Figure (B) shows the average percentage of node loss, under the multi-hop protocol, for only the fraction of networks which were usable in Figure (A). In other words, trials in which the network lost all processing nodes are not counted. Results for maximum fanout networks were similar to those of the multibutterfly.

Unfortunately, a small number of ``weak'' nodes, those with very poor connections to some other nodes, are still live. As results will show, synchronization requirements cause these weak nodes to significantly degrade performance.

The Fault-Propagation Algorithm

Since weak nodes degrade performance, a more conservative criterion for choosing live nodes should result in better performance. Although fewer nodes are available to perform the task, we will no longer be limited by the speed of some of the weakest excluded nodes.

Leighton and Maggs proved in [LM89] the following theorem:

Theorem 2:

No matter how an adversary chooses faults in an -input, -output network with expansion, there exist at least inputs and outputs between which any permutation can be routed in steps.

It turns out that we can use the proof of this theorem to construct an algorithm to select live nodes. The proof uses a stricter definition of I/O-isolation which includes nodes with inputs or outputs isolated by switches from the entire network, not just the first or last stages. However, due to the dominant effects of the first and last stage in the size of networks we examined, we find it both sufficient and practical to only look at those stages for I/O-isolation. The resulting fault-propagation algorithm is shown in Figure .

Figure (A) compares, for a 1024-node randomly-wired multibutterfly system, the average node loss of the fault-propagation algorithm to that of the multi-hop protocol. Although fault-propagation was designed for networks with expansion, it appears to also work well for other multipath MINs such as the maximal-fanout networks. Average node loss for maximal-fanout networks is basically indistinguishable from that of the multibutterfly. Although the difference between average node loss under the multi-hop protocol and the fault-propagation algorithm is imperceptible for network failures of less than 15 percent, the handful of weak nodes eliminated, coupled with the overhead of the multi-hop protocol, make the fault propagated systems perform substantially better than the multi-hop systems. In fact, Figure (B) shows that the performance curves of the fault-propagated systems are nearly flat. For such a small network size, this is a somewhat surprising validation of asymptotic predictions implied by Theorem 2. Also of interest in Figure (B), the randomized, maximal-fanout network performed just as well as the multibutterfly.

Conclusion

Several issues have arisen in our attempt to bring multipath MIN architectures closer to reality. Unlike most previous work, we have examined systems which can tolerate node failure and isolation. We have demonstrated that a simplified version of an algorithm from Leighton and Maggs [LM89] effectively determines which nodes to use in the presence of network failure. Not only is this fault-propagation algorithm effective on multibutterflies, it is also effective on other multipath networks such as maximal-fanout networks.

Of particular importance is the quality of network connections available to each processing node. In reconfiguring a faulty system, an intuitive approach is to maximize the number of live nodes. However, our performance results show that the synchronization requirements of applications make it critical to eliminate poorly connected nodes from system operation. The blocked router criterion, used in the fault-propagation algorithm, eliminates such nodes and ensures high quality connections between every pair of live nodes.

Overall, we have empirically demonstrated the high fault tolerance and performance of multibutterflies for practical systems. Although lacking the theoretical basis of multibutterflies, maximal-fanout networks appear to be equally as useful.

Acknowledgments

We would like to thank Tom Leighton for his suggestions on the analysis and randomization of maximal-fanout networks. Many of the architectural facets of Transit presented here are the work of Andre DeHon. Eran Egozy wrote the original code to generate network wirings. We would also like to thank Andre DeHon, Tom Leighton, and Ellen Spertus for their helpful comments on drafts of this paper.

Appendices

Tolerating Node Loss

Although we are primarily concerned with the steady-state performance of the system after reconfiguration, it is important to discuss the feasibility of tolerating node failure or isolation. Several approaches are available: checkpointing, redundant processors, and evacuating state.

Checkpointing computation is a well-known method of tolerating processor loss. Secondary storage devices would be attached to some of the networks endpoints. There are various types of checkpointing involving various levels of optimism [SY85] [SW89]. However, the time overhead of checkpointing in a parallel system can be quite high.

On the other hand, we can trade some time overhead for hardware overhead by using redundant processors. We can use redundant processing nodes to shadow the computation of the processors doing the actual work [Len88] [Web90]. Actually, we need not have entirely redundant processors, we can just have redundant processes spread across all the nodes. Each node can have some ``real'' work and some redundant work, as long as all of the redundant processes are for work being done on other nodes. If a node is lost, all of its processes can be recovered from redundant processes. It is important to note, however, that there could be significant overhead in the messages necessary to support process shadowing. For both checkpointing and redundant processors, it is vital to place backup elements at network endpoints which are unlikely to all fail at once. For instance, backups should be placed in different clusters of endpoints in the fully-deterministic network.

Since the rate of node loss may be quite low, it may be more appropriate to run with as little overhead as possible, and then evacuate a node's state when absolutely necessary. Each node has two outputs and two inputs connected to the network. However, recall that the RN1 routing component is circuit switched and bidirectional. Consequently, we may regard each node as having four bidirectional channels through which we can evacuate state. Depending upon the amount of state that must be evacuated, a node may choose to begin evacuation when 2 or 3 of these bidirectional channels have failed. In truly desperate circumstances, a node could even resort to the very low bandwidth of the bit-serial diagnostic network to evacuate some state.

Dynamic System Partitioning

As discussed in Section , boundary scan capabilities allow us to dynamically diagnose and reconfigure for both component and wire failures. However, how can we dynamically implement the partitioning algorithms described in Section ?

The multi-hop protocol is fairly straightforward to implement. The Boundary Scan Controller (BSC) can look at the first and last stages of the network and calculate I/O-isolation. Multiple hops can be supported by means of a forwarding packet, which is sent to an intermediate node and contains both the original message and its final destination. Each node can decide when to send a forwarding packet in one of two ways. First, a node may decide to send to an intermediate destination after a constant number of retries to the true destination. Second, to avoid some of the expense of multiple retries, each node may keep a cache of difficult nodes to reach, sending forwarding packets when a message's ultimate destination matches an address in the cache. Cache entries should be replace or flushed periodically because some nodes may be difficult to reach due to temporary congestion, not I/O-isolation. Also at issue is how to choose an intermediate node. This can be done either randomly, or with care to avoid nodes difficult to reach.

However, performance results emphasize the importance of dynamically implementing the fault-propagation algorithm. The algorithm requires the calculation of router blocking information. Fortunately, the Transit architecture has hardware support for calculating this information for purposes of flow control. However, this flow control is in use during routing while the system is in operation. We suggest three alternative methods of calculating blocking information for fault propagation. First, we can simply stop the machine and use the flow control hardware. Second, we can temporarily stop using flow control for routing, since the router will achieve reasonable performance by making random, uninformed decisions for a short time. We can use the hardware for fault propagation and then reactivate flow control. Third, if speed is not an issue, we can have the BSC gather the needed fault information and calculate fault-propagation sequentially.

Coping with Network I/O

Much of the fault tolerance and routing behavior of our networks is dominated by the first and last stages. Although multipath networks provide multiple paths between any two nodes, these paths can only use a large number of physically distinct routers towards the middle of the network. Near the nodes, these paths must concentrate towards their specific sources and destinations. This concentration is most severe in the first and last stages, where each node only has a small number of connections to the network. We now describe methods of coping with this limited network I/O. Many of these ideas were initially explored in [DKM91].

The First Stage

The wiring of the nodes to the inputs of the first-stage routers involves a tradeoff between mean time between machine reconfiguration and smoothness of degradation. Each node has 2 inputs into distinct chips of the first stage of the network. This means that each first-stage router has 8 nodes connected to its inputs. For nodes randomly wired to the first stage, let be the mean time between isolation of any node's inputs. Now consider a wiring in which sets of 8 nodes are wired to 2 routers, each node connected to both routers. The time between any node loss is decreased, but if we lose nodes, we lose 8 at a time. To a first approximation, the mean time to loss of a single node is also , the mean time to loss of each cluster of 8 is . In general, we can cluster processors to routers in a regular butterfly pattern such that we lose processors at a time, for , radix, and dilation. This results in a mean time to cluster loss of approximately .

The desirability of clustering depends on several factors. If the system can not tolerate any node isolation, maximal clustering will give the greatest expected time to system failure. If the system can reconfigure itself to accommodate node isolation, then reconfiguration scheme becomes an issue. If reconfiguration is dynamic, then the smoother node loss may be easier to handle. If it is static, then it may be desirable to run for as long as possible, then stop the machine and reconfigure for a large number of isolated nodes.

The amount of clustering will also affect each cluster's bandwidth into the network. However, in simulations of worst-case and uniform message traffic, the bandwidth through the rest of the network does not significantly exceed what the first stage routers could provide to each cluster.

The Last Stage

The size of each routing equivalence classes in the last stage of the network is one router. Unfortunately, if each router were assigned both network connections to each node, the network would have a single point of failure. To avoid this difficulty, we split each router into two logical routers. Now each routing equivalence class has two logical routers, which we assume are assigned to physically distinct chips. Consequently, it takes at least two chip failures in the last stage to isolate any node. Figures and illustrate this last-stage wiring.

Transit Packaging

In order to decrease wire-lengths and machine size, the Transit architecture shall use a three-dimensional stack packaging scheme (tn38) [DeH91b] shown in Figure . Vertical connections are accomplished via button board connectors. Note that routing components are arranged in columns directly above one another. This suggests an economical method of wiring the diagnostic network. Columns of chips can have their boundary scan pins vertically connected. These columns can then be connected to some number of Boundary Scan Controllers.

References

ALM90
S. Arora, T. Leighton, and B. Maggs. On-line algorithms for path selection in a non-blocking network. In Proceedings of the 22nd Annual ACM Symposium on Theory of Computing, pages 149-158, May 1990.

BP74
L. A. Bassalygo and M. S. Pinsker. Complexity of optimum nonblocking switching networks without reconnections. Problems of Information Transmission, 9:64-66, 1974.

CED92
Frederic Chong, Eran Egozy, and Andre DeHon. Fault Tolerance and Performance of Multipath Multistage Interconnection Networks. In Thomas F. Knight Jr. and John Savage, editors, Advanced Research in VLSI and Parallel Systems 1992, pages 227-242. MIT Press, March 1992. [FTP link].

CFKA90
David Chaiken, Craig Fields, Kiyoshi Kurihara, and Anant Agarwal. Directory-Based Cache-Coherence in Large-Scale Multiprocessors. IEEE Computer, 23(6):41-58, June 1990.

Com90
IEEE Standards Committee. IEEE Standard Test Access Port and Boundary-Scan Architecture. IEEE, 345 East 47th Street, New York, NY 10017-2394, July 1990. IEEE Std 1149.1-1990.

CS82
L. Ciminiera and A. Serra. A fault-tolerant connecting network for multiprocessor systems. In Proceedings of the 1982 International Conference on Parallel Processing, pages 113-122. IEEE, August 1982.

D +92
William J. Dally et al. The Message-Driven Processor: A Multicomputer Processing Node with Efficient Mechanisms. IEEE Micro, pages 23-39, April 1992.

DeH91a
Andre DeHon. MBTA: Quick Overview. Transit Note 38, MIT Artificial Intelligence Laboratory, January 1991. [tn38 HTML link] [tn38 FTP link].

DeH91b
Andre DeHon. MBTA: Wonderland Packaging. Transit Note 39, MIT Artificial Intelligence Laboratory, February 1991. [tn39 HTML link] [tn39 FTP link].

DeH91c
Andre DeHon. Practical Schemes for Fat-Tree Network Construction. In Carlo H. Sequin, editor, Advanced Research in VLSI: International Conference 1991, pages 307-322. MIT Press, March 1991. [FTP link].

DeH91d
Andre DeHon. RN2 Proposal. Transit Note 44, MIT Artificial Intelligence Laboratory, April 1991. [tn44 HTML link] [tn44 FTP link].

DeH92
Andre DeHon. Scan-Based Testability for Fault-Tolerant Architectures. Transit Note 60, MIT Artificial Intelligence Laboratory, January 1992. [tn60 HTML link] [tn60 FTP link].

DKM91
Andre DeHon, Thomas F. Knight Jr., and Henry Minsky. Fault-Tolerant Design for Multistage Routing Networks. In International Symposium on Shared Memory Multiprocessing, pages 60-71. Information Processing Society of Japan, April 1991. [FTP link].

Joh92
Kirk L. Johnson. The Impact of Communication Locality on Large-Scale Multiprocessor Performance. In Proceedings of the 19th Annual International Symposium on Computer Architecture, Queensland, Australia, May 1992.

Lei85
Charles E. Leiserson. Fat-Trees: Universal Networks for Hardware Efficient Supercomputing. IEEE Transactions on Computers, C-34(10):892-901, October 1985.

Len88
Daniel E. Lenoski. A Highly Integrated, Fault-Tolerant Minicomputer: The Nonstop CLX. In Digest of Papers--Compcon Spring 88: Intellectual Leverage, pages 515-519, San Francisco, California, February 1988. IEEE.

LLM90
Tom Leighton, Derek Lisinski, and Bruce Maggs. Empirical Evaluation of Randomly-Wired Multistage Networks. In International Conference on Computer Design: VLSI in Computers and Processors. IEEE, 1990.

LM89
Tom Leighton and Bruce Maggs. Expanders Might Be Practical: Fast Algorithms for Routing Around Faults on Multibutterflies. In IEEE 30th Annual Symposium on Foundations of Computer Science, 1989.

MDK91
Henry Minsky, Andre DeHon, and Thomas F. Knight Jr. RN1: Low-Latency, Dilated, Crossbar Router. In Hot Chips Symposium III, 1991.

RK84
S.M. Reddy and V.P. Kumar. On fault-tolerant multistage interconnection networks. In Proceedings of the 1984 International Conference on Parallel Processing, pages 155-165. IEEE, August 1984.

SW89
A. Prasad Sistla and Jennifer L. Welch. Efficient Distributed Recovery Using Message Logging. In Proceedings of the Eighth Annual Symposium on Principles of Distributed Computing, pages 223-238, Edmonton, Alberta, Canada, August 1989.

SY85
Robert E. Strom and Shaula Yemini. Optimistic Recovery in Distributed Systems. ACM Transactions on Computer Systems, 3(3):204-226, August 1985.

Upf89
E. Upfal. An deterministic packet routing scheme. In 21st Annual ACM Symposium on Theory of Computing, pages 241-250. ACM, May 1989.

Web90
Steve Webber. The Stratus Architecture. Stratus Technical Report TR-1, Status Computer, Inc., 55 Fairbanks Blvd., Marlboro, Massachusetts 01752, 1990.

MIT Transit Project