Transit Note #41

RNP: Fault Tolerant Routing Protocol

Andre DeHon

Tom Knight

Henry Minsky

Original Issue: February 1991

Last Updated: Mon Nov 8 13:17:47 EST 1993

Abstract:

In the course of developing reliable, high-performance routing networks, we have devised a simple routing protocol for high-speed, fault-tolerant, circuit-switched routing. The simplicity of the protocol allows the hardware implementation to be fast and efficient. The routing node protocol (RNP) uses a source-responsible, random routing scheme to achieve dynamic fault-tolerance without requiring global information about the state of the network. RNP can be easily adapted to almost any data word size and switch radix. RNP allows arbitrary network size and a large degree of freedom in network topology. The protocol achieves these features while remaining simple and easily constructible. We have implemented a VLSI routing component which embodies RNP.

Introduction

RNP is a synchronous protocol for high-speed, pipelined routing of word-wide data through crossbar switching elements. Any word width, dilation, radix, and network size can be accommodated by RNP. The protocol uses a single control bit in addition to the data word to allow out of band signalling. The crossbar switching element is assumed to have multiple outputs in each logical switching direction to obtain fault-tolerance (see [DKM91]).

Protocol Overview

A data stream of an arbitrary number of words is sent into the network one word per clock cycle. The first data word is treated as a routing specification and is used for path selection. Subsequent words are pipelined through the connection, if any, opened by the leading word. When the data stream ends, the sender may signal a request for the open connection to be reversed or dropped. When each router receives a reversal request from the sender, the router returns status and checksum information concerning the open connection to the source node. Once all routers in the network are reversed, data may flow back from the destination to the source. The connection may be reversed as many times as the source and destination desire before being closed.

Network Organization

The network structure assumed in this work is indirect, circuit-switched, and pipelined. The network endpoints may queue messages coming and going through the network, but the routers in the network perform no queuing. Messages blocked in the network are discarded. This class of network encompasses both bidelta style multistage interconnection networks ( e.g. Figure ) and fat-tree style networks. [DKM91], [CED92], and [DeH91] describe the class of networks we assume in some detail.

Other Fault-Tolerant Networks

One approach predominantly used to achieve fault-tolerance in multistage routing networks is to construct a network with more switching stages than are actually required to uniquely specify the destination ([LP83], [CYH84], [Sie85], [BBN87] et. al.). The set of destination specifications that reach the same physical destination defines a class of equivalent paths. Since any of several paths can reach the destination, it is possible to choose a path which avoids any fault in the network. Most of these schemes require the processor to choose its own path through the network. These schemes almost exclusively assume each endpoint has a single input and a single output connection to the network. BBN's large Butterfly Plus computers actually implement this extra stage approach to fault-tolerance.

An alternative approach for fault-tolerance is to simply provide multiple redundant networks ([FG88], [KS83]). The endpoint chooses a network over which to make the connection. The networks route the connections independently and reconverge at the destination endpoint. This approach does provide multiple input and output connections. Again, the endpoints are responsible for choosing fault-free paths.

Kruskal and Snir also propose a network with redundant paths using switching elements similar to ours in [KS83]. They suggest the redundant outputs from a routing element all go to the same physical component. This allows them to tolerate wire failures but does not provide any tolerance for component failures.

Grondalski implemented a network with redundant connections [Gro87]. Like Kruskal and Snir's network the redundant outputs from each routing component are all connected to the same physical component. As such the network can only tolerate wire faults between components. These faults are identified only during explicit test times when test patterns are communicated between adjacent routers.

In [GBG89], Ghafoor, Bashkow, and Ghafoor describe a hypercube based fault-tolerant network structure. Their routing protocol provides simple self-routing only in the absence of faults. Faults must be detected via explicit tests before they can be avoided. When faults are present in the network, each node must know the presence of every fault and explicitly route to avoid the faults. Thus, the self-routing property is lost in the presence of faults.

In all of these networks, fault-tolerance is static. That is, faults must be explicitly identified before they can be avoided. The identification usually occurs through some test operation which is separate from the normal operation of the network. In most cases, every source must know the location of every fault in order to achieve fault-avoidance.

In contrast, RNP provides dynamic fault-avoidance. During normal network operation, the source always knows when a message fails and, to some extent, how it failed. Fault-tolerance in RNP is achieved while allowing the source nodes to remain oblivious to the functional status of particular routing components in the network.

Related Fault-Tolerant Networks

The Kappa network [KPR88] achieves fault-tolerance by utilizing dilated crossbar routers wired in a deterministic pattern. The Kappa network is effectively a particular network wiring in the class of networks [DKM91] for which RNP was developed.

Our network structure grew out of the multibutterfly work due to Upfal [Upf89] and Leighton and Maggs [LM89]. We have reduced some of their more theoretical thoughts on routing schemes to a form easily implemented in practice.

Terminology

A routing component effects routing by switching data between the wires connected to the component. A port is the portion of the routing component that connects to the wires that make up a single word-wide datapath. A forward port is a port which normally receives data from a router in a previous stage and feeds the data into the router's crossbar. A backward port is a port which normally transmits data from a router's crossbar to the forward port of a router in the subsequent stage of the network. The direction of a network connection can be reversed. When this happens, the backward port receives data sent by a forward port. An input port is a port receiving data; and output port is a port sending data. In their normal state, the forward port is an input port and the backward port is an output port. However, when a connection is reversed, the roles of the two ports are also reversed. Each router is characterized by its radix and dilation. The radix of a router defines the number of distinct directions distinguished by the routing component. Dilation refers to the number of logically equivalent backward ports in each distinct direction.

Organization

Section puts RNP into perspective by describing how it would fit into a multi-layered fault-tolerance protocol. Section describes the behavior of the protocol as seen by a single pair of routing components. Section describes how routing proceeds across a network of router and Section describes the impact of this protocol on a network of routers. Section briefly describes, RN1, a routing chip which uses RNP. Section discusses the limitations of RNP. Section goes on to describe extensions to this protocol to overcome most of the limitations described in Section .

RNP Protocol Layer

RNP fits into a layered protocol scheme, such as the ISO OSI Reference Model [DZ83] at the data-link layer. That is, RNP itself is independent of the underlying physical layer which actually takes care of raw bit transmissions. RNP is independent of the electrical and mechanical aspects of the interconnection as well as the physical medium. The protocol is optimized for use in situations where the delay between adjacent routing components is small -- i.e. where this delay is on the order of the router's clock period. RNP provides mechanisms for controlling the transmission of data packets and the direction of transmission over interconnection lines. It also provides sufficient information back to the source to allow the source to detect errors in transmissions and retry connections as appropriate. By leaving the retransmission of corrupted packets to the source, RNP allows the source to dictate the retransmission policy. Technically, this means RNP alone does not completely fulfill the role of the data-link layer as defined by the ISO OSI Reference Model. RNP must be coupled with the appropriate source to fully define the data-link layer. Since RNP provides dynamic self-routing, the protocol layer identified as the network-layer by the ISO OSI model is trivial or non-existent. All of the higher level layers in the ISO OSI network model are unaffected by RNP.

The source can simply prepend a destination address onto the sequence of words it wishes to transmit and feed them into the network followed by a word indicating whether or not it wishes an acknowledgment and reply. When the source requests a reply, the routers provide checksum and routing disposition back to the source as described in detail below. When the router replies indicate that the message was not successfully transmitted, the source can retry the connection with the policy of its choosing. With the information provided back to the source about connection failures, the source should be able to tell, over time, when there is high likelihood that a link or component in the network has failed. The source can then propagate this information back to monitoring subsystems. Such subsystems can at least note the possible failure and may be able to further diagnose the failure or even mask or repair the fault.

RNP itself is connection oriented, though there is no need for higher-level protocols to be connection oriented. Along with the appropriate support at the endpoint, RNP provides a reliable byte stream connection from end to end through the routing network.

Description

The behavior of RNP is based on the dialog protocol between each backward port of each router and its companion forward port in the following stage of routers in the network. Here we describe this protocol from the point of view of a single pair of routing components.

Signalling

Routing control signalling is performed over data transmission channels. Using simple state machines and one control bit, this signalling can occur out of band from the data allowing arbitrary data to be passed using the protocol. Table shows the encoding of various signals. The control field is the high bits of the data word. The remainder of this section explains how these control signals are used to effect routing control.

Connection States

The states of a forward-backward port pair can be described by a simple nine-state finite state machine. Figure shows this state machine. Each transition is labeled as: < event>/< result>< dir>. Where < event> is a logical expression, usually including the reception of a particular kind of control word, < result> is an output resulting from the reception, and < dir> is an arrow indicating the direction which the < result> is sent. For instance, the arrow from swallow to forward means that when a DATA word is received and the resulting output direction specified is not blocked, forward the DATA out the allocated output port in the forward direction and change to the forward state.

Router Behavior

Idle port

When a connection between routers is not in use, the connection is in an idle state. While in the idle state, the backward port on a router transmits the IDLE word to its corresponding forward port in the next stage of the network. A forward port takes the reception of an idle word to mean that it should remain in an idle state.

Route

To route a connection through a router, a ROUTE word is fed into a router forward port. The forward port recognizes the transition of the control bit from the zero, which it was receiving when the connection was idle, to a one. The router then uses the control field to determine the desired direction to route the word. The router forwards the ROUTE word through a backward port in the desired direction, if one is available, and locks down the connection so subsequent DATA words can be passed in the same direction. If no output is available, the connection is blocked.

Data

All words with their control bit set high, which are received following a ROUTE word are forwarded through the allocated backward port of the routing component to the forward port of the next router in the network.

Data Completion

When all of the data words in a message have been passed into a router, the sender has two options. The sender can either drop the connection without getting any further response, or turn the direction of the connection around for a reply.

To drop the connection, the input is given a DROP word; the reception of a DROP word causes the router to close down an open connection and free up the backward port for reuse. When the output port is freed via a DROP word, it returns to the idle state sending an IDLE or DROP word to the next router.

To turn a connection around, the input port is given a TURN word; when the router receives a TURN word, it forwards the turn out the allocated output port, if there is one, and returns two words. These two words fill the pipeline delay associated with informing the the subsequent router to turn the connection around and getting backward data from the subsequent router. The filler words differ based on whether the connection is sending data in the forward or reverse direction when the TURN is received.

In the forward direction, the router returns first a STATUS word and then a CHECKSUM word. If the connection was not blocked during the routing cycle, after sending these two words, the router will be receiving data in the reverse direction and will forward this data through. If the connection was blocked, the router forwards a DROP word following the CHECKSUM and returns to the idle state.

In the backward direction, the router simply sends two HOLD words before sending the reverse data. In order for a port to be in the backward direction, there must be a connection through the router so there will always be data to propagate following a TURN.

Checksum and Status Information

The STATUS and CHECKSUM words are a two word wide value which serves to inform the source node of the integrity of the connection made through each router in the path obtained through the network. The control field in the STATUS word informs the source about which of the logically equivalent backward ports, if any, through which the connection is routed. This information allows the source to identify the exact routers in the network through which the connection was actually routed. When a connection is blocked at some router, this serves to inform the source about where the connection became blocked. The remaining bits of the STATUS word, together with the bits in the CHECKSUM word, can be used to transmit a longitudinal checksum back to the source. This checksum is generated on all the data sent through the connection since the last ROUTE, including the ROUTE word, itself. The checksum allows the source to identify if data is corrupted by the network, as well as exactly where in the network the data became corrupted.

Making Connections

When a connection is opened through a router, there may or may not be outputs available in the desired logical output direction. If there are no available outputs, the router simply discards the remaining bits associated with the data stream. When the connection is later turned around, the router STATUS word returned by the routing node informs the source that the message was blocked at the router. When exactly one output in the desired direction is available, the router simple switches the connection through that output. When multiple paths are available, the router switches the data to a logically appropriate backward port randomly selected from those available.

This random path selection is the key to making the protocol fault-tolerant while avoiding the need for centralized information about the state of the network and keeping the routing protocol simple. If faults develop in the network, the source will detect the occurrence of a failed or damaged connection by the returned status and checksum information. The source then knows to resend the data. Since the routing node selects randomly among equivalent outputs at each stage, it is highly likely that the retry connection will take an alternate path through the network, thus avoiding the faulty component. In this manner, the source can eventually find a fault-free path through the network, provided one exists. The random selection also frees the source from knowing the actual number of redundant paths provided by the network. Random selection among equivalent available outputs is an extremely simple selection criterion to implement in silicon and can thus be implemented with little area and considerable speed. Random selection also requires no state information not already contained on the individual routing component.

Connection Grammar

Figure shows a grammar summarizing all possible dialogs over the connection between the backward port of one router and the forward port of a subsequent router. The unshaded control/data words indicate data sent forward across the connection whereas the shaded words indicate data sent across the connection in the reverse direction.

Routing Through Networks

Between routing stages, the bits of the datapath can be reordered so that routers in different stages see distinct control fields. This reordering of bits allows routing words to specify complete paths through the network.

RNP allows switches to be configured to ignore the first data word in an incoming message by setting a SWALLOW configuration bit. This option allows networks to be arbitrarily large. Every time all of the routing bits in the ROUTE word are exhausted, the ROUTE word can be discarded allowing routing to continue with the fresh routing bits in the subsequent word. In this manner, the first DATA word in the message following the original ROUTE word is promoted to be the ROUTE word after the original is exhausted.

Cascading Routers

A network is built by cascading routers which use the RNP protocol. This section describes what RNP looks like from a network endpoint and shows sample network connections.

Network Behavior

The source feeds messages into a router in the first stage of the network. The leading word is treated as the ROUTE word. At each stage, the ROUTE is either pipelined through the network or blocked by existing connections. Between router stages, the bits are rotated to present fresh routing bits to each router. When the bits are exhausted the subsequent router will be configured to swallow the ROUTE word and promote the first DATA word to be the new ROUTE word. When the entire message is fed into the network, the source will generally send a TURN word to reverse the connection. The first router in the network returns STATUS and CHECKSUM words and then forwards the data received from the second router in the network. The first reverse data from the second router in the network will be the second router's STATUS and CHECKSUM words. The source will, thus, successively receive STATUS- CHECKSUM word pairs from each router in the connection through the network. If the connection is blocked at some point, the blocked router will send a DROP following its STATUS- CHECKSUM pair; this DROP will serve to close down the connection as it propagates back to the source. If the connection was not blocked, the source will receive data from the destination following the final STATUS- CHECKSUM pair. The connection may be reversed or closed by the destination once it has completed its reply.

Examples

Figure shows a small network with a radix of two and a dilation of two which might be constructed with routers communicating via RNP (see [DKM91] for more details on the network wiring). In Figure , the possible paths from input 6 to output 16 are highlighted. For the sake of simplicity in examples, let us consider the routers involved in one of the paths between input 6 and output 16. The same protocol is obeyed between all routers so the examples easily generalize to the complete network.

Each of the following examples (Figures through ), shows several cycles of communications over one of the paths indicated in Figure . Each connection between routers is labeled with the control/data word transmitted during the cycle and annotated with the direction of data flow. Often words of the same type are subscripted so the progression of individual words can be tracked from cycle to cycle.

Opening a Connection

Figures and show how a route is opened through the network. The connection in Figure succeeds in opening a connection through the network, whereas the one in Figure is blocked at the router in the third stage.

Dropping a Connection

Figure shows an open connection being dropped in the forward direction. Dropping a connection from the reverse direction proceeds identically with the roles of the source and destination reversed. If the connection is blocked, the DROP is propagated up to the router at which the connection is blocked.

Turning a Connection (Forward)

Figure shows a successful connection being turned and backward data propagating through the network. Figure shows how a blocked connection is collapsed when reversed.

Turning a Connection (Reverse)

Turns from the reverse direction proceed basically as forward turns. Instead of sending STATUS and CHECKSUM words, routers send HOLD words when turned from the reverse direction. Figure shows a turn from the reverse direction.

RN1

RN1 is a custom CMOS routing component which utilizes RNP to provide simple high-speed switching for fault-tolerant networks. RN1 has eight forward ports and eight backward ports and uses byte-wide words. It can be configured in either of two ways, as shown in Figure . The primary configuration is a crossbar router with a dilation of two. In this configuration, all 8 input channels are logically equivalent. Alternately, the component can be configured as a pair of crossbars, each with 4 logically equivalent inputs and a dilation of one.

The initial version of RN1 can support switching rates over 50MHz. The limits to RN1's speed are not associated with RNP, and we expect future revisions to run at 100MHz to 200MHz. The RN1 routing component is described further in [Kni89], [Min91a], and [MDK91].

Limitations

RNP, as described so far, does have some limitations in terms of fault-tolerance and performance.

With checksum and status information available to inform the source about the progress and integrity of a connection, most bit errors impede performance, but are not fatal. As long as the inputs to the routing component float low when not driven, the component will almost always settle into the correct idle configuration in the case where a companion component becomes faulty in a non-graceful manner. If a backward or forward port control bit gets stuck high, a path from that router through to some network endpoint may be allocated and held indefinitely. This ``permanently allocated'' path hinders performance by tying up routing resources.

As described, RNP has no way of knowing if any of its logically equivalent outputs in a given direction are more likely to cause a blocked connection than the others. This may lead to some sub-optimal choices of routing paths. Also, a source must always transmit a message and turn the network connection around in order to discover if a connection is blocked. Consequently, routing resources can be held for the duration of a message by connections which are blocked later in the network. For long messages or large networks this blocking may degrade performance appreciably.

Backward Channel

By adding a single bit-wide channel to each connection between routers, information can be propagated in the reverse direction of the connection to allow timely information about the state of the downstream connection. When a connection is idle, this backward channel can be used to propagate Leighton-Maggs style flow control [LLM90] to inform the backward port about the likelihood that any connection can be routed through the router associated with the forward port. When choosing among logically equivalent output ports, this backward flow control information can be used to bias the selection of a backward port away from busy routers. When the router is in some non-idle state, this backward channel can be used to drop a connection from the backward direction.

This additional channel improves the fault-tolerance and performance of RNP in several ways. First, it decreases the probability that a connection is blocked while opening a connection. Additionally, if a connection is blocked, the backward channel can be used to force the connection to be dropped by propagating a drop request from the blocked router back to the source. In this manner, the connection is collapsed from the head, rather than from the tail, in a pipelined manner, quickly freeing routers for reuse by other connections through the network. If the destination endpoint recognizes that it is receiving corrupted data, either through a violation of higher level protocols or because of the invalidity of some forward checksum, it can cause the connection to be collapsed; this, too, frees the resources held by the connection for reuse. Finally, the ability to drop a connection from the receiving endpoint allows a stuck control bit fault to be handled in a more graceful manner. Propagating a backward drop will cause all routers allocated by the faulty component to relinquish the connection; this will effectively isolate the fault to the single connection which has the stuck control bit.

Summary

We have described a general, dynamic fault-tolerant routing protocol which is easily adaptable to pipelined routing components of arbitrary radix, dilation, and word-size. The protocol places no limitations on the size or organization of network built from routing components using RNP. We have specifically considered using RNP with both multibutterfly style networks [DKM91] and fat-tree networks [DeH91]. The protocol is simple allowing efficient high-speed implementation in VLSI. The RN1 routing component which we have developed demonstrates the effectiveness of the protocol.

See Also...

References

BBN87
BBN Advanced Computers Inc., Cambridge, MA. Inside the Butterfly Plus, October 1987.

CED92
Frederic Chong, Eran Egozy, and Andre DeHon. Fault Tolerance and Performance of Multipath Multistage Interconnection Networks. In Thomas F. Knight Jr. and John Savage, editors, Advanced Research in VLSI and Parallel Systems 1992, pages 227-242. MIT Press, March 1992. [FTP link].

CYH84
Chin Chi-Yuan and Kai Hwang. Connection Principles for Multipath Packet Switching Networks. In 10th Annual Symposium on Computer Architecture, pages 99-108, 1984.

DeH91
Andre DeHon. Practical Schemes for Fat-Tree Network Construction. In Carlo H. Sequin, editor, Advanced Research in VLSI: International Conference 1991, pages 307-322. MIT Press, March 1991. [FTP link].

DKM91
Andre DeHon, Thomas F. Knight Jr., and Henry Minsky. Fault-Tolerant Design for Multistage Routing Networks. In International Symposium on Shared Memory Multiprocessing, pages 60-71. Information Processing Society of Japan, April 1991. [FTP link].

DZ83
J. D. Day and J. Zimmermann. The OSI Reference Model. IEEE Transactions on Communications, 71:1334-1340, December 1983.

Ego90
Eran Egozy. Symbolic Wiring Routing Package. Transit Note 29, MIT Artificial Intelligence Laboratory, August 1990. [tn29 HTML link] [tn29 FTP link].

FG88
Peter Franaszek and Christos Georgiou. Multipath Hierarchies in Interconnection Networks. Lecture Notes in Computer Science, 297:112-123, December 1988.

GBG89
Arif Ghafoor, Theodore R. Bashkow, and Imran Ghafoor. Bisectional Fault-Tolerant Communication Architecture for Supercomputer Systems. IEEE Transactions on Computers, 38(10):1425-1446, October 1989.

Gro87
Robert Grondalski. A VLSI Chip Set for Massively Parallel Architecture. In IEEE International Solid-State Circuits Conference, pages 198-199, 1987.

Kni89
Thomas F. Knight Jr. Technologies for Low-Latency Interconnection Switches. In Symposium on parallel architecures and algorithms. ACM, June 1989.

KPR88
S.C. Kothari, G.M. Prabhu, and R. Roberts. The Kappa Network with Fault-Tolerant Destination Tag Algorithm. IEEE Transactions on Computers, 37(5):612-617, May 1988.

KS83
Clyde P. Kruskal and Marc Snir. The Performance of Multistage Interconnection Networks for Multiprocessors. IEEE Transactions on Computers, C-32(12):1091-1098, December 1983.

LLM90
Tom Leighton, Derek Lisinski, and Bruce Maggs. Empirical Evaluation of Randomly-Wired Multistage Networks. In International Conference on Computer Design: VLSI in Computers and Processors. IEEE, 1990.

LM89
Tom Leighton and Bruce Maggs. Expanders Might Be Practical: Fast Algorithms for Routing Around Faults on Multibutterflies. In IEEE 30th Annual Symposium on Foundations of Computer Science, 1989.

LP83
Duncan Lawrie and Krishnan Padmanabhan. A Class of Redundant Path Multistage Interconnection Networks. IEEE Tr. on Computers, C-32(12):1099-1108, December 1983.

MDK91
Henry Minsky, Andre DeHon, and Thomas F. Knight Jr. RN1: Low-Latency, Dilated, Crossbar Router. In Hot Chips Symposium III, 1991.

Min91a
Henry Q. Minsky. A Parallel Crossbar Routing Chip for a Shared Memory Multiprocessor. Master's thesis, MIT, 545 Technology Sq., Cambridge, MA 02139, January 1991. [FTP link].

Min91b
Henry Q. Minsky. RN1 Data Router -- B revision. Transit Note 45, MIT Artificial Intelligence Laboratory, May 1991. [tn45 HTML link] [tn45 FTP link].

Sie85
Howard Siegel. Interconnection Networks for Large-Scale Parallel Processing. Lexington Books, Lexington, MA, 1985.

Upf89
E. Upfal. An deterministic packet routing scheme. In 21st Annual ACM Symposium on Theory of Computing, pages 241-250. ACM, May 1989.

MIT Transit Project