Transit Note #1

RN1: Things to Improve Upon

Andre DeHon

Original Issue: April 1990

Last Update: Wed Nov 10 22:21:04 EST 1993

Introduction

There are a number of things we might want to do differently or improve upon for future versions of a routing chip. In this document, I touch upon a few such changes that we've considered while working on RN1. These suggestions range from features which are peculiar to our implementation of RN1 to more general behavioral features one might want from a routing component.

RN1 Quirks

These are quirks in RN1 that are simply artifacts of the design processes. Hopefully, they will be cleaned up in future versions.

Separate Busses

The Finite State Machines (FSM's) that control RN1's behavior were originally designed to interface with a single data bus for data traveling both forward and backward through the chip. The states in the FSM were carefully crafted/optimized to allow correct behavior when the single data bus carried data in both directions through the chip. At a much later stage in the development of RN1 the internal data busses were separated in order to lower capacitive loading and increase switching speed. The FSM's were modified slightly to behave correctly with the separate busses.

Since the busses are separate both going to and from the i/o pads and in the crosspoint array, the data busses should be fully separated throughout the chip. The state machines stand to be cleaned up considerably. The FSM's should be redesigned and optimized to deal with the separate data busses.

Random Routing Computation

Currently, the calculation of the pseudo-random bits which are used to select among equivalent output ports is based on every data bit that goes through the routing component. Using all the data should make the preferred output selection quite random, but it prevents us from using multiple RN1 components (or stacks) in parallel to increase bandwidth.

Clever modification of the pseudo-random bit calculations should allow us to cascade RN1 routers in parallel. The obvious way to achieve this is to make the pseudo-random bits functions of only the routing bytes. The routing bytes will have to be the same between parallel routing components so the random bits will end up being the same. Each routing chip in a parallel cascade will receive the same routing sequences and hence chose to always route in the same direction as the routing components it is in parallel with.

The cascade of parallel components, however, may not be very fault tolerant. Particularly, if a wire (or component) is faulty, some components in a parallel cascade will receive the data while others do not. These will thus see different routing data than their associated routing component(s). The routing components will then be out of synchronization and not route consistently with one another.

Additionally, handling this case on the i/o controller end could be quite difficult. It would not be entirely possible to simply treat the data being sent as a single wide-word since some byte of the word could be blocked due to faulty components independently of the others.

Further thought has led to other schemes for cascading routing components. See (tn4).

Turn

Turn is currently being recognized only as 0b011111111. It would make turn detection slightly faster and save a small amount of silicon to simply recognize 0b11xxxxxx as the turn byte. Since each router sees the original turn byte rotated by two bits, this does not really infringe on the useful space available for encoding routing control.

Checksum Clocking

The current checksum unit does a couple of things poorly. It's clock is out of phase with the system clock; this results in a delayed checksum result and the inclussion of the turn byte in the checksum computation. Secondly, the checksum unit is clocked, changing the value of the checksum between the status byte and the checksum byte. It would probably be more effecient and easier to deal with if we make the checksum in phase with the rest of the system, not include the turn byte, and not change between status and checksum. While we are at it, we might want to change the checksum to a CCITT CRC checksum.

Collapsing Paths

A single extra control line associated with each input/output port of RN1 could be quite useful. This could be used to collapse blocked paths through the network from the point where the block occurred back to the source. This would allow the intermediate routing components to become available for use more quickly and alleviate false blocking in the network ( i.e. Blocking that results when the tail of a data stream is still holding network resources after the head has already been blocked in the network).

BBN's Butterfly Plus uses this kind of a scheme for recovering resources held by blocked data streams [BBN87].

For small networks and short data streams, there wouldn't be a significant need for this kind of mechanism, but for large networks or data streams, it could offer considerably benefit.

One issue that will need to be considered in using such a scheme, is that of getting the routing status information back to the source. If the path is being collapsed early, it is not possible to return status information exactly the way we are currently. The fact that the path is being collapsed is enough for the i/o controller to properly implement the source responsible protocol. However, in order to keep information on how blockage is occurring ( e.g. to identify faults or network contention patterns), the information on where data was routed and exactly what blockages occurred is needed. Perhaps, this same line could be used to serially transmit the essential information back to the source.

One Scheme

One possibility would be to have a synchronous collapse bit travelling back down the path setup by the connection. When a connection is blocked, this line is asserted and propogates back down the path visiting a new router each clock cycle. As this collapse signal passes a router, it causes the router to drop its connection through the blocked path. The arrival timing of this drop signal at the source of the network will be sufficient to inform the source of where in the network the problem occured; this will also signal why the network connection failed, since this collapse path only indicates blocking; checksum information will be reported as before. The information we lose with this scheme is exactly what part of the network is blocked/faulty ( i.e. exactly which chip the connection was routed through when it was blocked). However, if we add adaptive fault localization and routing inside the network, it may not be as important to get this information back to the source.

Busy Out

A busy-out control line associated with each input/output port of RN1 might also prove useful. This could be used as a mechanism for propagating status information about network loading back through the network.

This would be an input line to each back port. When active, the back port would simply treat that output like it was in use and not direct any output to it (this would be a fairly trivial modification given the way output port allocation is implemented in RN1).

This would allow concentration to less outputs than those of actual routing component as necessary.

It would allow a busy processor to "serialize" its stream of connections from the network. (Granted, this is of questionable benefit. However, when interacting with the use of the network to do dynamic load balancing ((tn2)), it could be quite useful).

A busy-out line could be used to avoid faulty directions of travel provided a mechanism existed to identify such faults.

Adding a signal like this to the back port would be sufficient to implement Leighton's scheme of guaranteeing unnecessary blocking in network [Lei90]. In Leighton's scheme an output is blocked whenever some direction that it could route to would result in blocking. This scheme prevents all blocking from occurring within the network. To use the busy-out control line in this manner, the routing component the back port was connected to would have to compute when to busy out the back port. For simple algorithms such as Leighton's this logic would also be a trivial addition.

Speeding Up the Dynamic Logic in the XPOINT

[HQM]

The precharge busses in the P1, P0, S1, S0 lines are being charged all the way up to VDD. If we use an N pullup instead of a P, we could charge up to one or more threshold drops below VDD. This would speed up the allocate ripple, at the expense of some noise margin.

A Better Use for Checksum Bits?

[HQM]

Instead of a 'backward' checksum when we do a turn, have each chip send out the status of all of the ports in the STATUS and CHECKSUM bytes. This allows you to get a better sampling of network activity, and maybe a processor could use this for dynamic traffic routing decisions...

Idle Cycles

[AMD]

In the current scheme, there is no way to idle during a transmission. That is, there is no way to say, ``Hey, I won't have the next byte in this message during this clock cycle, so send this byte through but don't treat it as part of the data.'' This would be especially useful in tolerating access latency at the remote end of a connection without complicating the next higher network protocol level. The current lack of such a mechanism is proving to be a bit of an inconvenience for MBTA. It is necessary to explicitly have ``in-band'' signals to indicate that the response data is going to be sent and forces all the data for the transaction to be ready and guaranteed to be available before the transaction can be initiated.

This DATA-IDLE could be incorporated into the current setup by encoding it as a signalling word. That is, assign it an encoding with the control bit low and some particular data field like 0xAA.

When RN1 receives this DATA-IDLE word, it passes it through just like data and remains in the same routing state. The controller on the receiving end is responsible for treating this byte as if it did not occur in the data stream. That is, the controller simply acts like it took longer to receive the next byte from the network. The controller on the sending end of a network connection is then free to insert DATA-IDLE words when it cannot feed valid data into the network on a given clock cycle.

Fault Identification

[TK (text AMD)]

One issue RN1 does not directly address is that of fault identification. It uses random routing to allow it to avoid faults despite the fact that it doesn't know where and when they have occurred. It provides status information so software has a chance of identifying errors as they occur. In any case, however, this knowledge cannot be used to purposefully avoid faults. In more advanced versions of the RN router, we will probably want to add explicit fault detection and avoidance. With the ability to detect when faults have occurred, we can use simple models like those proposed by Tom Leighton [Lei90] to always route around faulty components.

At the cost of twice the hardware, one way of detecting faults is to have a pair of routers ``vote'' on what should be happening and declare themselves faulty if they every disagree. To achieve this effect, we add a MONITOR configuration pin, a FAIL output pin, and eight FAIL-IN pins. The master router has MONITOR deasserted and the monitoring router has MONITOR ASSERTED. The master behaves in a basically standard fashion. The slave treats all of its ``outputs'' as inputs and listens to the outputs provided by the master. If the slave ever disagrees with the outputs provided by the master, it asserts FAIL. The FAIL-IN signals are used to inform a router about the state of routers to which it is connected.

This scheme works particularly well with out pad grid array three-dimensional stack packaging scheme. The stack can be composed of alternating master and monitor planes. With the monitor right above the master routing component, their pins are properly aligned for this style of monitoring.

Long Wires

Building large networks using either a bidelta configuration or a fat-tree configuration will necessitate long wires [DeH90a]. The current RN1 design can only cope with longer wires by increasing the clock cycle to allow signals to propagate across the long wires. However, we can do better than this uniform degradation in speed by pipelining bits on the longer wires. To achieve this pipelining, some changes will need to be made to RN1. The following suggestion as a means of handling this pipelining is from [DeH90a]

This pipelining scheme requires different behavior from RN1 when a connection is turned around. Currently, RN1 expects to get valid data in the new direction of flow two clock cycles following the reception of a byte indicating it should turn the connection around. The addition of a single clock cycle's worth of wire delay causes this turn around time to be increased by two clock cycles; that is one additional clock cycle is required for the turn byte to propagate across the interconnect to the next routing component, and one additional clock cycle is required for the return data to propagate back across the connection. With no extra clock cycles of wire delay between RN1 routing components, this turn will occur in two clock cycles. With clock cycles of wire delay between a pair of successive routing stages, the turn will occur in clock cycles. In order for the turn sequence to function properly, the routing component at each end of such a long wire must be capable of dealing with the extra cycles. RN1 will need to know the number of delay cycles to expect and be able to deal with them accordingly.

The delay size will need to be configurable for ever input and output port on the routing component. With 8 input ports and 8 output ports, this makes for a total of 16 ports that need to be configured on a single RN1 component. It is necessary to be able to configure each input and output port separately to allow input and output ports from the same component to connect to interconnections of differing delay lengths. Additionally, to allow for the large range of delay values that must be accommodated for reasonably large networks, the delay length configuration will require multiple bits to specify the delay of a single input or output port. While this configuration information could be provided to the chip by configuration pins, it should be obvious that this would require the addition of quite a large number of signal pins. RN1 already has tight constraints on the number of pins that its package will support. Thus, an alternate means must be used to configure an RN1 routing component with this information.

One such alternative is to use UV programmable cells in the routing chip to store this configuration data. This would require the addition of only a few signal pins to facilitate the programming of the UV configuration cells. Cells would be programmed by initiating a program sequence and shifting the configuration data into the UV cells while the component is exposed to UV light. [Gla85] describes a technique for constructing UV programmable cells of this nature which is applicable to the one micron CMOS process in which RN1 is fabricated. This scheme has the drawback that a component cannot simply be replaced with another one off the shelf. The replacement component will first need to be configured before it can serve as a replacement.

A simple programming board can easily be built to program the configuration into a component as necessary.

With the additional configuration information provided to RN1, it must also be updated to deal appropriately with the configured delays of various lengths. This will required updating the finite state machine logic in the routing component to deal with varying delay lengths. This will be a minor change, but will necessitate adding additional state information to the FSM's. During the additional delay cycles, some reasonable data or pattern should be sent to the component not directly connected to the long wire so that the system is guaranteed to be well behaved. After the status and checksum bytes are sent to this component, the component on what was originally the sending side of the long wires, will need to send this additional data. This additional data could simply be a repetition of the status and checksum bytes.

Forced Routing

It might be interesting for diagnosing faults in the network to be able to specify an exact path through the network. This will necessitate the ability to force RN1 to route a message through a particular one of its eight output ports thus bypassing the random path selection. Since RN1 has 8 physical output ports, this forced routing specification will require 3 bits to specify the output port at each routing component. Also, it is necessary to distinguish this forced routing specification from the regular routing specification.

A forced routing specification could look like: 0b001010rpp. That is, the control bit is low, the high five bits of the data are 01010, and the low three bits specify the desired output port. The r specifies which of the two ports in a given direction to utilize. The pp specifies the general routing direction. Note, that the routing specification will have to look like this to the router being forced to route through a specific port. This means the bits will need to be pre-rotated outside of the network to offset the effects of the bit rotations in the network wiring.

This forced routing specification should be given to a router when it is normally expecting a routing byte ( i.e. in the idle state). The router recognizes the pattern as a forced routing specification by the high 5 bits of the data and the low control bit. The router then sets up a connection to the specified output port, if the port is available. The router, however, does not pass its forced routing byte through to the next routing component. Instead, it implicitly swallows this forced routing byte and forwards the next byte in the sequence to the next router. Once a connection is established, subsequent non-drop bytes with the control bit low are also passed on to the next router. In this manner, a path can be completely specified through the network using one force routing byte for each stage in the network. The router deals with blocked connections in the same way it would normally.

Other Things to Consider

There are certainly a number of larger issues that should be considered for future version of the routing component. Certainly, more sophisticated adaptive routing could be done than suggested in Section . The analog adaptive routing techniques of [Cho90] might be worth adapting to this network. When we go to building very large networks, it may be beneficial to be able to implement a packet switched routing scheme. We currently don't know how to retain the more important properties of RN1 in such a scheme; it would be desirable to be able to extend RN1's beneficial properties to a packet switching model.

See Also...

References

BBN87
BBN Advanced Computers Inc., Mass. Inside the Butterfly Plus, October 1987.

Cho90
Frederic T. Chong. Analog Techniques for Adaptive Routing on Interconnection Networks. Transit Note 14, MIT Artificial Intelligence Laboratory, April 1990. [tn14 HTML link] [tn14 FTP link].

DeH90a
Andre DeHon. Fat-Tree Routing For Transit. AI Technical Report 1224, MIT Artificial Intelligence Laboratory, April 1990. [AITR-1224 FTP link].

DeH90b
Andre DeHon. Shared Randomness. Transit Note 4, MIT Artificial Intelligence Laboratory, May 1990. [tn4 HTML link] [tn4 FTP link].

DeH90c
Andre DeHon. Using the Network for Load Balancing on Parallel Computers. Transit Note 2, MIT Artificial Intelligence Laboratory, April 1990. [tn2 HTML link] [tn2 FTP link].

Gla85
L. A. Glasser. A UV Write-Enabled PROM. In Proceedings, 1985 Chapel Hill Conference on VLSI, May 1985.

Lei90
F. T. Leighton. The Role of Randomness in the Design of Parallel Architectures. In William J. Dally, editor, Advanced Research in VLSI. MIT Press, 1990. Proceedings of the Sixth MIT Conference.

MIT Transit Project