Transit Note #73

Multipath Enhanced Transit ROuter Architecture

( METRO)

Eran Egozy

Andre DeHon Thomas Knight, Jr.

Henry Minsky Samuel Peretz

Original Issue: July, 1992

Last Updated: Fri Nov 5 13:08:41 EST 1993

This document defines the Multipath Enhanced Transit Router ( METRO) architecture, an architecture for a family high-performance network routing components for indirect, multistage routing networks.

Features

Applications

METRO is primarily designed to serve as the basic building block for reliable, high-performance routing networks in high-speed multiprocessors.

Basic Architecture

A METRO router is a dilated crossbar routing component which supports half-duplex bidirectional, pipelined, circuit-switched connections. Each METRO router is self-routing and supports dynamic message traffic. It works in conjunction with a source-responsible network interface to achieve reliable end-to-end data transmission in the presence of heavy network congestion and dynamic faults. Fault-localization and reconfiguration capabilities provided through the scan interface allow the router to perform efficiently in the presence of static faults.

A crossbar has a set of inputs and a set of outputs and can connect any of the inputs to any of the outputs with the restriction that only one input can be connected to each output at the same time. A dilated crossbar is similar to a standard crossbar with the exception that multiple outputs are considered to be equivalent; that is, a collection of outputs is presumed to eventually lead to the same destination. The number of outputs which are equivalent in a particular logical direction is referred to as the crossbar's dilation. The number of logically distinct outputs which the crossbar can switch among is referred to as the crossbar's radix.

A circuit-switched routing component establishes a connection between an input and output port and forwards the data between the output and the input in a deterministic amount of time. Notably, there is no storage of the transmitted data inside the routing component. In a network of circuit-switched routing components, a path from the source end of the connection to the destination is locked-down during the connection; the resources along the established path are not available for other connections during the time the connection is established. In a pipelined circuit-switched routing component, all the routing components in a network run synchronously from a central clock and data takes a deterministic number of clock cycles to pass through each routing component.

A crossbar is said to be self-routing if it can establish connections through itself based on signalling on its input channels. That is, rather than having some external entity set the configuration of connections in the crosspoint, the router configures itself in response to requests which arrive via the input channels. A router is said to handle dynamic message trafic when it can open and close connections as messages arrive independantly from one another at the input ports.

When connections are requested through a router, there is no guarantee that the connections can be made. As long as the dilation of the router is smaller than the number of input channels into a router, it is always possible that more connections will want to connect in a given logical direction than there are logically equivalent outputs. When this happens, some of the connections must be denied. When a connection request is rejected for this reason, it is said to be blocked. The data from a blocked connection is discarded and the source is informed that the connection was not established.

Once a connection is established through the crossbar, it can be turned. That is, the direction of data transmission can be reversed so that data flows from the original destination to the original source. This capability is useful in providing rapid replies between two nodes and is important to effecting reliable communications. METRO provides half-duplex bidirectional data transmission since it can send data in both directions, but only in one direction at a time.

Since connections can be turned around and data may flow in either direction through the crossbar, it is confusing to talk about input and output ports since any port can be either an input or an output. Instead, we will consider a set of forward ports and a set of backward ports. A forward port initiates a route and is initially an input port while a backward port is initially an output port.

Figure shows the basic topology for a METRO crossbar router. The arrows indicate the initial direction of data flow.

Fault-Tolerance Support

Faults in a network of routing components can be either static or dynamic and may be transient faults or permanent faults. While a permanent fault occurs and remains a fault, a transient fault may presist for a short period of time and then go away. Transient faults which recur with notable frequency are termed intermittent. For the purposes of this presentation, static faults are permanent or intermittent faults which have occurred at some point in the past and are known to the system as a whole. Dynamic faults are transient faults or any faults which the system has not yet detected.

With dilated routing components, static faults are easily handled by disabling any backward port connected to a fault. In this manner, the router always chooses to avoid the fault at the expense of increasing network congestion.

Since dynamic faults may occur at any time, it is necessary to detect when a connection has been corrupted by a dynamic fault. To this end, the METRO router is designed to work with a source-responsible network interface. The source opening connections through the network is assigned the responsibility for assuring that data reaches the destination successfully. Each METRO router makes this possible by providing checksum and status information back to the source. With this information, the source can determine when data is corrupted by dynamic faults. Only when the source receives valid checksums from all routers and a positive acknowledgment from the destination does it consider the connection to have completed successfully. The source will generally re-attempt any failed connections. Protocols which require this positive acknowledgment from the destination back to the source are termed end-to-end protocols and are necessary to handle faults which may appear at any point during a data transmission.

The checksum and status information returned by the routers is also useful in identifying where a fault may have occurred. The METRO architecture provides scan based access for fault-localization. This allows they system to pin-point the nature and location of the fault. It also allows transient faults to be distinguished from permanent or intermittent faults.

Since the fault-localization and reconfiguration operations depend on the functionality of the scan-path, it is critical that components be accessible via scan control. For this reason, METRO supports multiple scan paths. If one path fails, another scan path can be used to communicate with the component.

Stochastic Path Selection

When more than one backward port is available for any connection request, the choice of which backward port to use is made based on random input bits. This random path selection is the key to making the METRO router robust to dynamic faults while keeping implementations simple. When faults develop in the network, the source will detect the occurrence of a failed or damaged connection by the returned status and checksum information. The source then knows to resend the data. Since the routing compnents select randomly among equivalent outputs at each routing stage in the network, it is highly likely that the retry connection will take an alternate path through the network, thus avoiding the newly exposed fault. Retry coupled with randomization in path selection guarantees that any existing non-faulty path can be found without global information about faults in the network.

Configuring Wide Routers From Narrow

To allow wide routers to be built from routing components with narrow datapaths, METRO provides features to facilitate cascading routers. A router with an effectively wider data path can be achieved by using multiple METRO routers in parallel. METRO provides sufficient external state to allow this width-cascading.

Typical Application

In normal use, many METRO routers would be used to construct a multipath multistage network. Multiple stages allow the routers to route data between a source and a destination through a logarithmic number of routing components. In a multibutterfly style network, each stage in the routing network subdivides the set of possible destinations into a number of distinct classes determined by the radix of the routing components. Successive stages recursively subdivide the destination class until each output node from the network is uniquely identified. The dilated components give rise to multiple independent paths through the network. Figure shows a small network in this genera.

[LM89] and [Upf89] present some of the basic theory on multibutterfly networks. Network construction, utilization, and performance issues are further explored in [DKM91] [LLM90] [CED92] [CK92]. Fat-Tree networks [Lei85] [GL85] are another class of multistage, multipath networks which can be built using METRO routing components. [DeH91] and [DeH90] detail issues involved in constructing such networks.

Architecture Variables

There are several choices one can make in building a specific implementation of the METRO router which are independent of the basic architecture. Since the choice of an actual number is generally irrelevant to the subsequent discussion, variables to denote each of these choices are introduced here. Any specific implementation should, of course, make suitable choices for these values and the choices should be well documented along with the implementation. Table lists the variables. The remainder of this section explains each in slightly more detail.

Number of Scan Paths

This is the number of separate scan paths which are supported by the routing component. A minimum of one is necessary for functionallity, but we highly reccommend a minimum of two scan paths to handle any faults which may occur in the target system's scan paths. Additional scan paths provide higher bandwidth access to scan registers and a higher level of fault tolerance.

Data Channel Width

The data channel can have arbitrary width as long as it is wide enough to support the general routing header.

Maximum Dilation

flexible routing allows one to make a static trade-off between the dilation () and radix () of a component. Due to implementation limitations, there may be a maximum dilation (less than the number of outputs on the part) which a given implementation can support. The dilation is always restricted to a power of two.

Number of Forward and Backward Ports

It is generally advisable to keep the number of forward and backward ports the same ( i.e. ), but there is certainly no architectural requirement to do so. Notice that with a configurable dilation the effective radix () depends on the number of backward ports and the dilation in use; i.e. .

Random Input Bits

Architecturally, as long as the bit streams are random the number is irrelevant. To get good pseudo-randomness, it often makes sense to have several inputs.

Header Words Consumed

Since connection establishment generally takes more time through a router than data transmission once the connection is established, the architecture allows each router to consume some number of words at the head of each connection request while it is being set up. The number of words consumed should be deterministic and constant for a particular implementation.

Data Pipestages Through Router

To achieve a higher bandwidth, the number of data pipestages in the router may be increased. However, there are at least two registers in the data path for a METRO one at the forward port, and one at the backport.

Variable Turn Delay Maximum

The variable turn delay feature allows routers to cope with long wires and other situations where the round trip time between the backward port of a router and the corresponding forward port of the router in the subsequent stage (or an endpoint in the network) varies from port to port and router to router. A particular implementation will have to set some limit on the range of this variable.

Component I/O Pins

Table summarizes the i/o pins on a METRO routing component. The total number of pins required depends on the specific values chosen for various architectural variables. For example, an RN2 routing component is an part with byte-wide channels. It has four random input bits and two boundary scan paths. The total pin count comes out to 183 pins.

Feature Descriptions

Configurable Dilation

A particular METRO routing component may implement one of two dilation configuration options:

  1. static dilation configuration
  2. general, dynamic dilation configuration

Static

When a router is configured during setup, the dilation may be set to any power of two up to the implementation limit . This allows a single router design to serve in a variety of configurations.

General Dynamic

Included in each routing request is a specification of both the direction of travel and the effective component dilation. In effect, the route header specifies, for each connection request, the set of output ports which are acceptable for use in routing the particular connection. Each connection may request a dilation which is any power of two up to the implementation limit . This option allows the system to moderate path flexibility at run-time.

Data Idle

There are several cases in system use or in basic router behavior where it is necessary to keep a connection open while no data is available for transmission. DATA IDLE is a word passed through the router just like data but which is outside of the normal band of data word encodings. There are two major uses of this designated token:

  1. By the protocols being used on top of a network of METRO routing components when it is not possible to deterministicly specify when data will be available to send with a message.
  2. By the routing components themselves in cases where the number of cycles between events can vary within a given implementation.

Connection Reversal

After opening a connection through a series of routers, the source will generally want a reply back from the destination to verify that the message was received. Often the source will also want to get a response to the connection request. By sending a designated control word, TURN, the established path is reversed in a pipelined manner so that data can return from the original destination back to the original source. During the pipeline delays associated with reversing the connection, each routing component has the opportunity to inject information into the return data stream about the status of the connection and the integrity of the data transmitted; this information is useful to the source in identifying and localizing errors.

To allow arbitrary protocols to be layered on top of the basic METRO router protocol, connections may be turned back to the forward direction. Any number of data transmission reversals may occur during a single connection. It is always the prerogative of the transmitting end of the connection to signal a connection reversal.

Variable Turn Delay

The number of cycles of delay between the point where a router forwards the TURN word on to the attached router and the point where the first word of reverse data is received from the attached router will depend on the length of wire connecting the two components. That is, assuming the wire between the two components can be modeled as a number of pipeline registers, the more pipeline register associated with the wire, the more clock cycles it will require before return data arrives from the router at the other end of the wire.

In some network systems, it is generally not possible or desirable to make all the connections between routers equally long. In particular, closer routers should be able to take advantage of the fact that interchip signalling may occur faster. Since each port on a single routers connections to a different router it may be the case that the length of the wires connected to each port differ. For this reason, METRO provides the ability to define the number of pipeline stages associated with each port connection separately. During the time the router is waiting for a response from the turn by the attached routing compnent, the waiting router will send DATA IDLE words to hold the connection open until it has reverse data to forward along.

One important assumption made above is that we can model the wire between two components as a number of pipeline registers. How one assures that this assumption is satisfied is, of course, an implementation detail. With a properly series terminated point-to-point connection between routers, we do not have to worry about reflections and settling time on the wire. The wire will look, for the most part, like a time-delay. The necessary trick is then to make the time-delay approximate an integral number of clock cycles so that it does look like a number of pipeline registers. A brute force way to achieve this is to carefully control the length and electrical characteristics of the wires between components. Another method would be to use adjustable delay in the pad drivers themselves to adjust the chip-to-chip delay sufficiently to meet the assumption. See Chapter 6 of [DeH93b] for further discussion on delay adjustment technologies.

Path Reclamation -- Fast and Detailed

When a router is unable to route a connection due to the lack of availability of an appropriate backward port, the connection is blocked. There are two ways which a METRO router can handle this situation. The router could wait for a path reversal request and shut the connection down after returning information to the source about the blocked state of the connection. Alternately, the router could immediately begin propagation of a backward drop request back to the source using the backward control bit ( BCB). The earlier behavior provides the source with sufficient information to determine at exactly which router the blocking occurred and whether or not any errors occurred prior to that point. The latter behavior releases the resources held by the blocked connection rapidly while only informing the source of the routing stage in which the blocking occurred.

The METRO router allows each forward port to independently select between these two modes. The tradeoff between fast path reclamation and detailed information gathering can be handled dynamically while the router is in use. Note that the mode of path reclamation is solely determined by the configuration of the forward port on the router at which the blocking occurred. This allows the system to select portions of the network ( e.g. a particular stage in the routing network or a particular routing component) for gathering detailed information while blocking in the remainder of the network is signalled via fast reclamation for efficiency.

Scan Support

METRO integrates extensive scan support using a mostly IEEE 1149-1.1990 [Com90] compliant Test Access Port (TAP). The current IEEE 1149 TAP specification does not provide any provisions for components with multiple TAPs, so multiTAP features are non-standard and may make the scan architecture on METRO non-compliant, depending on how one interprets the standard.

MultiTAP

MultiTAP provides a mechanism for more than one TAP to coexist on a single component. Each TAP has its own IEEE 1149 standard TAP-controller, instruction register, bypass register, and i/o pins. The TAPs share all other internal registers. Conflict resolution hardware on chip guarantees that at most one TAP is accessing an internal register at any given point in time. When conflicts between TAPs do occur, the TAP most recently requesting the register is given access, and the other TAP is reset to the bypass instruction. This behavior guarantees that a good scan path can always wrest a resource from a faulty scan path. To further decrease the likelihood that a faulty scan path can interfere with a non-faulty scan path's operation, an error-detecting, sparse instruction encoding is used on the scan instruction. The multiTAP architecture is developed in greater detail in (tn60).

Configuration

The TAPs provide a convenient mechanism for setting mostly static configuration options. METRO relies heavily on the TAPs for this purpose. Section summarizes the configuration options available to an implementation of a METRO routing component.

Fault Localization and Masking

In addition to the normal boundary-scan facilities associated with an IEEE 1149 TAP, METRO provides fine-grain facilities for on-line fault-diagnosis. In particular, many of the system in which a METRO router might be used are large and it would be inconvenient or impractical to bring the entire machine down to run traditional boundary-scan style diagnostics. For this reason, METRO routers allow each port to be treated separately for purposes of testing and reconfiguration. Each port can be disabled, removing it from the set of resources in use by the system. The topology and redundancy provided by typical METRO networks allow the network to continue functioning with some resources disabled. Once a port is disabled, boundary and internal scan tests can be applied exclusively to the disabled port or ports while the rest of the router functions normally. This allows a forward-back port pair, a routing component, or some region of a network, to be isolated from normal operation for testing. Once the fault region is identified, the faulty-region can be left disabled and the rest of the system returned to service. Disabled faults are, masked, in that their faulty behavior can no longer corrupt message traffic.

This kind of reconfiguration will, of course, require a high-level coordination based-on an understanding of the basic topology of the network so that regions can be isolated in a consistent manner. Further development of these capabilities is provided in (tn60)

Router Width Cascading

In width cascading, two or more routers need to behave identically in terms of how they handle connections. METRO provides two hooks in order to allow this to occur. First, the routers receive their random bits from outside the part. These bits are used along with the connection requests to decide the assignment of connection requests to backward ports. As long as the connection requests and random bits are the identical for the set of cascaded routers, in the non-faulty case the cascaded routers will allocate identically. Only in faulty cases should the connection requests, and hence allocations, ever differ. Note that it is important that allocation of connections to backward ports be a deterministic function of connection requests and random input bits for this to work properly.

Each backward port has a pull-down signal () that indicates when the port is not in use. By connecting this signal across the cascaded routers, the backward ports can detect when they disagree about an allocation. Since disagreement can only occur as a result of some fault, as soon as the disagreement is noticed, the connection is shut down so that the fault is contained.

Note that there is still a need for end-to-end checking to assure that messages arrive in tact. Though highly improbable, there still exist some cases in which a faulty allocation may go unnoticed by the shared pull-down. The shared pull-down facility significantly limits the negative effects that can be caused by the faults which are most likely to occur.

To avoid the need for additional resources to generate the random bits, the METRO architecture requires each component to generate one random bit stream.

Configuration Options

A METRO router has many options which can be selected each time the component is used, as well as, while the component is in use. All of these options are configurable under scan control from a TAP. Table summarizes the configurable options available for any METRO router.

A METRO with does not consume any header words at all, so an additional configuration scan register ( swallow) will be needed to optionally force a consumpation of the header byte.

For a METRO component with static dilation configuration, dilation specifies the effective dilation for the component. A METRO which implements general, dynamic dilation, instead, will not have such a configuration register. Instead the dilation information must be encoded along with each route specification.

N.B. At present a single ``fast reclamation'' bit for each port serves to decide both:

One might want finer granularity, controlling acknowledgement and generation separately. At present, we expect that fine of a grained control is past the point of diminishing returns.

A particular implementation of the METRO router may have implementation dependent configuration options, as well.

Routing Protocol and Behavioral Details

METRO routers communicate using the router portion of the METRO Routing Protocol ( MRP-ROUTER). The most definitive description of MRP-ROUTER can be found in Chapter 4 of [DeH93b]. [DeH93b] also contains several examples of protocol behavior.

Instrumentation/Statistics Gathering

Instrumentation and statistic gathering support is strongly reccommended for any implementation. At present, these are all implementation specific. More will be said elsewhere about potentially interesting statistics to gather.

Scan Architecture

The details of the scan architecture are implementation specific. At a minimum, the scan architecture must provide facilities to load the configuration data described in Section . For full IEEE-1149.1 compatibility, the component should also support boundary scan operations. Each router port should be separately scannable in order to fully realize the fault localization properties introduced in Section . For testing and diagnostics, we recommend that all internal state ( e.g. finite-state machine bits, crosspoint state, and internal pipeline registers) be accessible via the scan path. Additionally, counters and event monitors used for statistics gathering should be controllable and accessible via the scan paths.

See Also...

References

CED92
Frederic Chong, Eran Egozy, and Andre DeHon. Fault Tolerance and Performance of Multipath Multistage Interconnection Networks. In Thomas F. Knight Jr. and John Savage, editors, Advanced Research in VLSI and Parallel Systems 1992, pages 227-242. MIT Press, March 1992. [FTP link].

CK92
Frederic T. Chong and Thomas F. Knight, Jr. Design and Performance of Multipath MIN Architectures. In Symposium on Parallel Architectures and Algorithms, pages 286-295, San Diego, California, June 1992. ACM. [FTP link].

Com90
IEEE Standards Committee. IEEE Standard Test Access Port and Boundary-Scan Architecture. IEEE, 345 East 47th Street, New York, NY 10017-2394, July 1990. IEEE Std 1149.1-1990.

DeH90
Andre DeHon. Fat-Tree Routing For Transit. AI Technical Report 1224, MIT Artificial Intelligence Laboratory, April 1990. [AITR-1224 FTP link].

DeH91
Andre DeHon. Practical Schemes for Fat-Tree Network Construction. In Carlo H. Sequin, editor, Advanced Research in VLSI: International Conference 1991, pages 307-322. MIT Press, March 1991. [FTP link].

DeH92a
Andre DeHon. Scan-Based Testability for Fault-Tolerant Architectures. Transit Note 60, MIT Artificial Intelligence Laboratory, January 1992. [tn60 HTML link] [tn60 FTP link].

DeH92b
Andre DeHon. Scan-Based Testability for Fault-tolerant Architectures. In Duncan M. Walker and Fabrizio Lombardi, editors, Proceedings of the IEEE International Workshop on Defect and Fault Tolerance in VLSI Systems, pages 90-99. IEEE, IEEE Computer Society Press, 1992. [FTP link].

DeH93a
Andre DeHon. METROJR-ORBIT Datasheet. Transit Note 90, MIT Artificial Intelligence Laboratory, August 1993. [tn90 HTML link] [tn90 FTP link].

DeH93b
Andre DeHon. Robust, High-Speed Network Design for Large-Scale Multiprocessing. Master's thesis, MIT, 545 Technology Sq., Cambridge, MA 02139, February 1993. [FTP link].

DKM91
Andre DeHon, Thomas F. Knight Jr., and Henry Minsky. Fault-Tolerant Design for Multistage Routing Networks. In International Symposium on Shared Memory Multiprocessing, pages 60-71. Information Processing Society of Japan, April 1991. [FTP link].

GL85
Ronald I. Greenberg and Charles E. Leiserson. Randomized Routing on Fat-Trees. In IEEE 26th Annual Symposium on the Foundations of Computer Science. IEEE, November 1985.

Lei85
Charles E. Leiserson. Fat-Trees: Universal Networks for Hardware Efficient Supercomputing. IEEE Transactions on Computers, C-34(10):892-901, October 1985.

LLM90
Tom Leighton, Derek Lisinski, and Bruce Maggs. Empirical Evaluation of Randomly-Wired Multistage Networks. In International Conference on Computer Design: VLSI in Computers and Processors. IEEE, 1990.

LM89
Tom Leighton and Bruce Maggs. Expanders Might Be Practical: Fast Algorithms for Routing Around Faults on Multibutterflies. In IEEE 30th Annual Symposium on Foundations of Computer Science, 1989.

Min91
Henry Q. Minsky. A Parallel Crossbar Routing Chip for a Shared Memory Multiprocessor. AI Technical Report 1284, MIT Artificial Intelligence Laboratory, 545 Technology Sq., Cambridge, MA 02139, 1991. [FTP link].

Upf89
E. Upfal. An deterministic packet routing scheme. In 21st Annual ACM Symposium on Theory of Computing, pages 241-250. ACM, May 1989.

MIT Transit Project