RN2 Proposal
Andre DeHon
Original Issue: April 1991
Last Updated: Thu Nov 11 15:35:28 EST 1993
After gaining considerable experience from RN1, we are now ready to begin work on the next generation routing component, RN2. RN2 is largely an incremental refinement of the basic architecture embodied in RN1. RN2 includes extensive support for boundary scan manipulation and testing. RN2 incorporates the low-voltage swing drivers originally intended for RN1, a back propagating control signal for fast path collapsing and routing flow control, and additional pipelining for higher bandwidth performance.
This document identifies the basic architecture of RN2. The description assumes that the reader is familiar with RN1 (tn26) [Min91]. In general, we focus on the difference between RN1 and RN2 rather than giving a complete description of the architecture.
Just as RN1, RN2 is basically an 8 input crossbar routing component with byte-wide data channels. The component can be organized as a either a single radix 4, dilation 2 router or as a pair of independent radix 4, dilation 1 routers. RN2 uses the RNP (tn41) fault tolerant routing protocol for communication over its data ports. Unlike RN1, RN2 does incorporate the backward flow control channel also described in (tn41).
RN2 differs from RN1 by the following characteristics:
RN2 is targeted at Hewlett-Packard's 1 micron CMOS process, HP26 [HP90]. We have chosen this process because it is the highest performance CMOS technology available through MOSIS today.
To achieve the desired performance goals, RN2's internal architecture will differ slightly from RN1. Exactly which internal architecture is used will be determined once we get into the detailed architecture and see how things decompose.
In RN1A, the time from the rising of the active edge of the clock to the arrival of data at the output pad for an allocate cycle is 13-14ns. RN1 was implemented in a 1.2 micron process (HP34). With the technology change and some cleanup of the critical path, we hope to trim this path to under 10ns.
We hope to be able move data from one chip to another across one to two feet of wire in less than 5ns using the low-voltage I/O pads of Knight, Simon, and DeHon. This chip-to-chip time (including I/O pad delays) of 5ns is the basis for the 200MHz operational goal. The aim is to pipeline the internal architecture such that we can place a new byte on each output port every 5ns.
One strategy for achieving this pipelining is to find a place to break the allocate cycle such that the 10ns allocate cycle through the crosspoint array occurs in two separate clock cycles. Here, a datum is clocked into an input register on RN2. From there it is clocked into an intermediate register. Finally, it is clocked into an output register to drive the next routing component. Thus, there is a three clock cycles (15ns) latency from chip to chip.
This is the preferred strategy for pipelining data through RN2. However, this requires that we find an appropriate place to break the cycle, which may not be a trivial matter.
The alternative strategy is to place a pair of registers on the input and output pads. Byte-wide data is clocked into the chip in two (5ns) clock cycles. The 16-bit-wide data is then clocked through the crosspoint in one two clock cycle period (10ns) and register at the output register. The output register clocks the data out in two 5ns clock cycles over the byte-wide channel. This strategy takes advantage of the fact that we can transmit data chip to chip twice as fast as we can perform an allocate through the crosspoint. The strategy matches the bandwidth by doubling the size of the datapath through the crosspoint. The chip to chip latency in this case is four 5ns clock cycles (20ns).
Due to the larger latency, this is the fallback strategy. We feel confident this is achievable in the case that we cannot find a good place to divide the clock cycle. For simplicity, it may be necessary to restrict signalling events to occur on ``even'' clock cycles ( i.e. whenever the signalling event ends up in the high-byte to be transmitted through the crosspoint).
RN2 will rely heavily on boundary scan logic to load mostly-static configuration bits into the part as well as for testing. Obviously, the boundary scan path allows connectivity testing of the network PCBs. It will also allow static testing of the routing component. For fault tolerance in the scan TAP itself, RN2 will have two (perhaps 3?) TAP as described in (tn60).
Each I/O pad will have registers loadable through the boundary scan mechanism to set the output drive resistance. Additionally each I/O pad will have a boundary scan register for sampling the output waveform seen by the part.
The clock has a register that controls the non-overlap between the internally generated clock phases.
Each port has the following control bits, loadable through boundary scan logic:
The chip has a control bit to select between dual dilation 1 mode and dilation 2 mode. If the wide-crosspoint strategy described above is used, we may want to slow-clock control bit. When this is set, data is sent one byte at a time through the crosspoint. This allows us to run the component at 100MHz without incurring, unnecessary, additional latency.
As described above, the component uses the Knight, Simon, DeHon low-voltage swing I/O pads. These pads have dynamically adjustable impedance control registers to provide properly matched series terminated drive on long wires. The pads switch from ground to one-volt, thus being compatible with skewed ECL voltage levels.
The backward control bit performs two functions. When a port is idle, the backward control bit serves as a binary indication of the likelihood of routing a connection through the specified output port. i.e. if the bit is driven in the busy direction, it is not guaranteed that the connection can be successfully routed through the given output port to the desired destination; if this bit is driven in the not-busy direction, it is highly-probably the connection can be successfully routed through the port.
This flow control function is always an estimate of the state of the network. Since we are not willing to slow the network down to allow the flow control to propagate all the way through the network, it is updated in a pipelined manner on a cycle by cycle basis. Thus the information provided through this means to a router may be stale. In general, the information is pessimistic, in that it may indicate a path is unlikely through a given direction when one actually exists. As a result, each router uses this approximation to the network state in an advisory fashion. Only in the case that one output port in a given direction is busy and the other is not-busy is it actually used; in this case, the router deterministically routes the connection out the not-busy port. As in the case in which both are non-busy, if both are busy, the router chooses randomly between the two available ports. The impact of getting blocked at subsequent stages is lessened by the other use of the backward control bit.
One issue to be decided is the level-encoding of the busy and back-drop signals. It would be nice to encode them such that disconnected backward control bits always floated to back-drop and busy. This way, if the backward control bit is floating, the port is used as little as possible. Additionally, it would be nice for back-drop and busy to be of different encodings so that neither a stuck-at-high fault or a stuck-at-low fault will request/hold resources along a path for very long. Obviously these two conditions cannot both be met, so we will have to decide which is more significant.
Another important issue to be worked out is precisely how to deal with the backward control bit while ports are being turned around or allocated ( i.e. any time the router is going through transient states). Care should probably be taken to account for the pipeline delay for the downstream router to change states.
When the disable-port bit is unset, the port behaves normally. When the bit is set, the port is disabled. As an input port, it will not allocate connections. As an output port, the component is deterministically avoided. A port being disabled is a stronger anti-bias than a port's backward control bit being set to busy since a connection is never allocated through a disabled port.
The ability to deselect a port is important to masking identified faults or allowing in-operation repairs. In the case where a fault has occurred and the status information collected by the endpoints indicates it is likely that a fault has occurred, the possibly faulty component can be isolated from use by disabling all the ports communicating with the identified component. This makes it such that the component is deterministically avoided. The component and its interconnect can then be tested via the boundary scan mechanism. After testing, the offending fault, if any, is left masked. If only wires on a few ports are faulty, the unaffected ports can be re-enabled.
Masking a fault in this manner should improve the faulty-performance of the network over the case where the fault has occurred but is not masked. The magnitude of this effect is still a subject for simulation.
In larger systems, it may be desirable to remove a modular portion of the network ( e.g. a unit tree stack in a fat-tree network [DeH90]) while the rest of the network continues to operate. By disabling all the ports into the module, we can similarly prevent routers from attempting to make connections through the removed module.
The signal pin requirements for RN2 are summarized below. Additionally, a separate 5V or 3.3V dirty power supply will be needed to support the boundary scan interface (assuming we are going to run the boundary scan interface at TTL levels to be compatible with other boundary scan components).
We will be experimenting with some new tools for synthesis and simulation during the design of RN2. Following is our current strategy for bringing an RN2 design to realization.