Transit Note #44
RN2 Proposal

Andre DeHon

Original Issue: April 1991

Last Updated: Thu Nov 11 15:35:28 EST 1993

Abstract:

After gaining considerable experience from RN1, we are now ready to begin work on the next generation routing component, RN2. RN2 is largely an incremental refinement of the basic architecture embodied in RN1. RN2 includes extensive support for boundary scan manipulation and testing. RN2 incorporates the low-voltage swing drivers originally intended for RN1, a back propagating control signal for fast path collapsing and routing flow control, and additional pipelining for higher bandwidth performance.

Purpose of This Document

This document identifies the basic architecture of RN2. The description assumes that the reader is familiar with RN1 (tn26) [Min91]. In general, we focus on the difference between RN1 and RN2 rather than giving a complete description of the architecture.

RN2 Characteristics

Just as RN1, RN2 is basically an 8 input crossbar routing component with byte-wide data channels. The component can be organized as a either a single radix 4, dilation 2 router or as a pair of independent radix 4, dilation 1 routers. RN2 uses the RNP (tn41) fault tolerant routing protocol for communication over its data ports. Unlike RN1, RN2 does incorporate the backward flow control channel also described in (tn41).

RN2 differs from RN1 by the following characteristics:

Boundary scan architecture compliant with IEEE1149.1-1990 [Com90] with multi-TAP extensions (tn60)
Backward flow control allowing fast blocked-path collapsing and less oblivious routing
Dynamically adjustable, controlled impedance, low-voltage swing drivers -- controllable via the boundary scan test access port
Ability to ``turn-off'' any port
200MHz operation (targeted)
chip to chip latency of 20 ns (maybe 15 ns)

Technology

RN2 is targeted at Hewlett-Packard's 1 micron CMOS process, HP26 [HP90]. We have chosen this process because it is the highest performance CMOS technology available through MOSIS today.

Performance Goals

To achieve the desired performance goals, RN2's internal architecture will differ slightly from RN1. Exactly which internal architecture is used will be determined once we get into the detailed architecture and see how things decompose.

In RN1A, the time from the rising of the active edge of the clock to the arrival of data at the output pad for an allocate cycle is 13-14ns. RN1 was implemented in a 1.2 micron process (HP34). With the technology change and some cleanup of the critical path, we hope to trim this path to under 10ns.

We hope to be able move data from one chip to another across one to two feet of wire in less than 5ns using the low-voltage I/O pads of Knight, Simon, and DeHon. This chip-to-chip time (including I/O pad delays) of 5ns is the basis for the 200MHz operational goal. The aim is to pipeline the internal architecture such that we can place a new byte on each output port every 5ns.

Two Cycle Allocate

One strategy for achieving this pipelining is to find a place to break the allocate cycle such that the 10ns allocate cycle through the crosspoint array occurs in two separate clock cycles. Here, a datum is clocked into an input register on RN2. From there it is clocked into an intermediate register. Finally, it is clocked into an output register to drive the next routing component. Thus, there is a three clock cycles (15ns) latency from chip to chip.

This is the preferred strategy for pipelining data through RN2. However, this requires that we find an appropriate place to break the cycle, which may not be a trivial matter.

Wide Crosspoint

The alternative strategy is to place a pair of registers on the input and output pads. Byte-wide data is clocked into the chip in two (5ns) clock cycles. The 16-bit-wide data is then clocked through the crosspoint in one two clock cycle period (10ns) and register at the output register. The output register clocks the data out in two 5ns clock cycles over the byte-wide channel. This strategy takes advantage of the fact that we can transmit data chip to chip twice as fast as we can perform an allocate through the crosspoint. The strategy matches the bandwidth by doubling the size of the datapath through the crosspoint. The chip to chip latency in this case is four 5ns clock cycles (20ns).

Due to the larger latency, this is the fallback strategy. We feel confident this is achievable in the case that we cannot find a good place to divide the clock cycle. For simplicity, it may be necessary to restrict signalling events to occur on ``even'' clock cycles ( i.e. whenever the signalling event ends up in the high-byte to be transmitted through the crosspoint).

Boundary Scan

RN2 will rely heavily on boundary scan logic to load mostly-static configuration bits into the part as well as for testing. Obviously, the boundary scan path allows connectivity testing of the network PCBs. It will also allow static testing of the routing component. For fault tolerance in the scan TAP itself, RN2 will have two (perhaps 3?) TAP as described in (tn60).

Each I/O pad will have registers loadable through the boundary scan mechanism to set the output drive resistance. Additionally each I/O pad will have a boundary scan register for sampling the output waveform seen by the part.

The clock has a register that controls the non-overlap between the internally generated clock phases.

Each port has the following control bits, loadable through boundary scan logic:

swallow -- just like RN1's swallow pin. When this bit is set, the first byte of an allocate request is stripped off and the following byte is used to select an output port. input ports only
disable-port -- if this bit is set on an output port, output port is deterministically not used. This bit is figured into the calculation of propagation flow control. An input port with this bit set will never attempt to allocate a connection.
disable-back-control -- if this bit is set, the backward control bit on this port is disabled. This allows the routers on the ends of the network to interface to components which do not support the backward control bit.

The chip has a control bit to select between dual dilation 1 mode and dilation 2 mode. If the wide-crosspoint strategy described above is used, we may want to slow-clock control bit. When this is set, data is sent one byte at a time through the crosspoint. This allows us to run the component at 100MHz without incurring, unnecessary, additional latency.

I/O Pads

As described above, the component uses the Knight, Simon, DeHon low-voltage swing I/O pads. These pads have dynamically adjustable impedance control registers to provide properly matched series terminated drive on long wires. The pads switch from ground to one-volt, thus being compatible with skewed ECL voltage levels.

Backward Control Bit

Flow Control

The backward control bit performs two functions. When a port is idle, the backward control bit serves as a binary indication of the likelihood of routing a connection through the specified output port. i.e. if the bit is driven in the busy direction, it is not guaranteed that the connection can be successfully routed through the given output port to the desired destination; if this bit is driven in the not-busy direction, it is highly-probably the connection can be successfully routed through the port.

This flow control function is always an estimate of the state of the network. Since we are not willing to slow the network down to allow the flow control to propagate all the way through the network, it is updated in a pipelined manner on a cycle by cycle basis. Thus the information provided through this means to a router may be stale. In general, the information is pessimistic, in that it may indicate a path is unlikely through a given direction when one actually exists. As a result, each router uses this approximation to the network state in an advisory fashion. Only in the case that one output port in a given direction is busy and the other is not-busy is it actually used; in this case, the router deterministically routes the connection out the not-busy port. As in the case in which both are non-busy, if both are busy, the router chooses randomly between the two available ports. The impact of getting blocked at subsequent stages is lessened by the other use of the backward control bit.

Path Collapsing

When the port is in use, the backward control bit is used to propagate a connection drop backward through the network. When a router cannot route a connection, it drives the backward control bit into the back-drop state. This backdrop is pipelined back through the connection reclaiming the resources held by the connection from the point of blocking back to the source. This allows connections to be broken down without waiting for the tail of a message, thus reclaiming precious routing resources for use by subsequent connection attempts through the network. If the endpoints also implement the backward control bit signalling, this additionally allows the receiving endpoint to break down a connection when it desires. Thus when an endpoint receives a bogus or corrupted message, it can reclaim its input port at its discretion rather than simply at the discretion of the originator. This also provides tolerance against stuck-at faults in the network.

Issues

One issue to be decided is the level-encoding of the busy and back-drop signals. It would be nice to encode them such that disconnected backward control bits always floated to back-drop and busy. This way, if the backward control bit is floating, the port is used as little as possible. Additionally, it would be nice for back-drop and busy to be of different encodings so that neither a stuck-at-high fault or a stuck-at-low fault will request/hold resources along a path for very long. Obviously these two conditions cannot both be met, so we will have to decide which is more significant.

Another important issue to be worked out is precisely how to deal with the backward control bit while ports are being turned around or allocated ( i.e. any time the router is going through transient states). Care should probably be taken to account for the pipeline delay for the downstream router to change states.

Port Deselection

When the disable-port bit is unset, the port behaves normally. When the bit is set, the port is disabled. As an input port, it will not allocate connections. As an output port, the component is deterministically avoided. A port being disabled is a stronger anti-bias than a port's backward control bit being set to busy since a connection is never allocated through a disabled port.

The ability to deselect a port is important to masking identified faults or allowing in-operation repairs. In the case where a fault has occurred and the status information collected by the endpoints indicates it is likely that a fault has occurred, the possibly faulty component can be isolated from use by disabling all the ports communicating with the identified component. This makes it such that the component is deterministically avoided. The component and its interconnect can then be tested via the boundary scan mechanism. After testing, the offending fault, if any, is left masked. If only wires on a few ports are faulty, the unaffected ports can be re-enabled.

Masking a fault in this manner should improve the faulty-performance of the network over the case where the fault has occurred but is not masked. The magnitude of this effect is still a subject for simulation.

In larger systems, it may be desirable to remove a modular portion of the network ( e.g. a unit tree stack in a fat-tree network [DeH90]) while the rest of the network continues to operate. By disabling all the ports into the module, we can similarly prevent routers from attempting to make connections through the removed module.

Pin Requirements

The signal pin requirements for RN2 are summarized below. Additionally, a separate 5V or 3.3V dirty power supply will be needed to support the boundary scan interface (assuming we are going to run the boundary scan interface at TTL levels to be compatible with other boundary scan components).

Construction Approach

We will be experimenting with some new tools for synthesis and simulation during the design of RN2. Following is our current strategy for bringing an RN2 design to realization.

Move the RN1B design to a hardware design language ( i.e. TCF) -- verify functionallity with existing RN1B test suite
Synthesize to Actel FPGAs to generate RN1B-Actel
Add boundary scan support and other design changes that leave the basic RN1B functionallity and verify design
Revise to RN2 functionallity and update test vectors accordingly
Synthesize to Actel FPGAs to generate RN2A (Actel)
Retarget to HP26 standard cell CMOS -- initially try naive mapping, tweak as necessary to achieve desired performance
Clean up optimized design adding 1V pads to generate RN2C (CMOS)

Transit Note #44
RN2 Proposal

Abstract:

Purpose of This Document

RN2 Characteristics

Technology

Performance Goals

Two Cycle Allocate

Wide Crosspoint

Boundary Scan

I/O Pads

Backward Control Bit

Flow Control

Path Collapsing

Issues

Port Deselection

Pin Requirements

Construction Approach

See Also...

References

Transit Note #44 RN2 Proposal

Abstract:

Purpose of This Document

RN2 Characteristics

Technology

Performance Goals

Two Cycle Allocate

Wide Crosspoint

Boundary Scan

I/O Pads

Backward Control Bit

Flow Control

Path Collapsing

Issues

Port Deselection

Pin Requirements

Construction Approach

See Also...

References

Transit Note #44
RN2 Proposal