Transit Note #22

MBTA: Node Architecture Selection

Andre DeHon

Tom Simon

Original Issue: June 1990

Last Updated: Wed Nov 10 22:51:21 EST 1993

Purpose

This note summarizes several node architectures which were originally under consideration for MBTA. This may be of interest to anyone wondering why we settled on the node architecture we did. Also, it may be useful to see how the architecture might have differed if our goals were weighted differently.

On one hand, we want to have a generic node architecture capable of allowing fair emulation of many possible node architectures. On the other, we want to be able to test the RN1 based network at full speed to test performance, packaging, cooling, and reliability. Along with these goals, we would like to have a very general architecture which both is simple and small. See (tn17) and [DeH90e] for a more detailed explanation of the goals and purposes of MBTA.

Various Nodes

Long Run

As we attempt to settle on a node configuration for MBTA, it is worthwhile to consider what the node architecture for a real machine might look like. Figure shows one possibility for the eventual architecture of a node in a multiprocessor computer system. Main memory may or may not support the independent read/write ports shown. The way in which it supports these ports ( e.g. banked memory, time-sliced access, multi-ported RAM) is abstracted away for the time being. In this configuration, most of the time it should be possible to keep the processor running, make network requests through net-out, and service two incoming requests through net-in simultaneously.

Minimum Hardware

A node with a single bus and processor (Figure ) is the simplest MBTA node configuration possible. In this case, the single processor simulates the function of the entire node. No parallel operations occur on the node. In a single node simulation cycle, the processor runs a time slice of all hardware that might be on node

Features

Bugs

Dual Read Ports

In order to stand a chance of using and servicing the bandwidth of the network, the node must be able to support three concurrent operations between the node and network. Figure shows one configuration that attempts to meet this goal. The processor simulates the compute processor and whatever other hardware might be on the node ( e.g. memory processor). The network processor simply processes network requests; in particular, it handles the rop operations described in (tn19) and [DeH90b]. The network processor runs in a tight loop checking for rop operations from the network and dispatches to service them accordingly. The memory is duplicated. This effectively provides the node with two read ports into memory. Writes, however, require access to both memory busses in order to guarantee that the memories remain consistent.

In this scheme, it is possible to take advantage of the three simultaneous network ports in a couple of ways as shown below:

As noted above, no other memory operations are possible during a write. Since the two processors have read ports into different memory banks, their instruction fetches do not interfere with each other. It should be possible for one of the processor to reserve all of the node's memory resources to perform a sequence of atomic memory operations.

Features

Bugs

Multiple Bank Memory

It is possible to get some of the benefits of the previous scheme with two independent banks of memory. As long as the processors store their instructions in opposite memory banks, instruction fetches will not interfere with each other. In this case write operations do not require all of the node's memory resources. Similarly, atomic operations completely within a single memory bank only require exclusive access to half of the memory system. If the data is not well distributed between the two memories, access to the over-utilized memory bank can bottleneck node operation.

Features

Bugs

FIFO Intensive

We can expand upon the dual bank configuration using FIFOs to buffer data coming and going from the node (Figure ). This will allow us to easily guarantee that transmission and reception can occur at full network speeds. The FIFOs also gives us buffers to avoid blockage due to critical resources. Additionally, they make it possible to allow remote operations to be performed without the full attention of the processor. This should simplify and the code required to deal with network operations. The FIFOs will be costly in terms of board area and component expenses. The data and address bussing, multiplexing, and timing will probably be quite hairy in this scheme.

At the cost of additional bus and multiplexing hair, it would probably be possible to use a single FIFO for each pair of FIFO's shown in Figure . Generally, only a FIFO in a single direction will be used at once. A single FIFO would be sufficient if the direction of the FIFO can effectively be reconfigured between operations.

Since Figure may be a bit misleading, a more detailed version of this option is shown in Figure . The bus gates shown deal with arbitration and multiplexing for access to a client bus/device.

Features

Bugs

Fast Network, Slow Node

Figure shows a compromise that, hopefully, satisfies many of the primary goals of MBTA. This is a simple node configuration with a single processor and memory. The fast memory is intended to cycle at twice the rate data arrives from the network. The node is capable of utilizing full network bandwidth by sending data out through net-out under processor control while simultaneously receiving data to memory through both net-in ports. The FIFO associated with the output path through net-out allows net-out to assume full responsibility for retrying network operations. The processor only fills the FIFO initially. Once the network operation is initiated, net-out has all the data available to retry the operation until it succeeds.

Fair emulation is possible since there is a single stream of execution. The processor simply executes a time slice of each emulated node device during each emulation cycle.

In fast network test mode, the processor feeds write operation over the network and records appropriate statistical information. The write operations are handled directly by the net-in units, leaving the destination node's processor free to be feeding data into the network, also. Periodically, the processor can check the validity of the data written into memory.

When emulating nodes, it would be unfair to run the network at full speed. The network should only be running at some small fraction of its real speed to be fairly matched to the simulation. However, it is important in testing reliability and cooling to be able to run the network at full speed for long periods of time. To allow these test to coincide with emulation experiments, net-out and net-in can both send ``dummy'' data to consume the unused network bandwidth. Rather than running the network slow, net-out and net-in are designed to service the network at full speed. However, they do not send simulated data at full speed. They will send the node generated network data only intermittently. Between these times, they will send filler data to each other. Both endpoints agree to only treat every -th byte as network data (where is a parameter set at boot time). Between the valid data, the sending node sends some sequence of random data which the receiving node can anticipate and verify ( e.g. the sending node could simply send the contents of a byte wide counter running at network speed). This allows the nodes to see network bandwidth which is properly matched to the emulated node speed. At the same time, the network can be run continuously at full speed with semi-interesting, changing information running through it. All of the interesting packaging, cooling, and reliability issues can thus be tested simultaneous with simulations.

Features

Bugs

High Bandwidth Node Bus

We can gain all of the advantages of the previous (Section ) architecture and achieve greater generallity by doubling the node's bus bandwidth. Here, we double the bandwidth of the bus by making the memory twice as wide (Figure ). With memory which can exchange a 64-bit datum every two network cycles, the node bus has the bandwidth to handle four devices independently accessing memory at the rate data is being transfered over the network. This allows the processor, two interfaces from the network, and one interface to the network to be serviced at full network bandwidth. As long as each of these components can access memory at the full network rate, there is no need for separable busses as needed in the FIFO intensive design (Section ). Also, FIFOs like those shown in Sections and can be implemented in the flat node memory.

With a single processor, fair emulation is still achievable as described in the previous section. While performing fair emulation, dummy cycles can be used to allow the network to run at full speed as described before. Here, there is no need for the processor to move outgoing data in many cases. It can simply tell the network output interface where in memory to find outgoing data.

Full speed network testing is easier and can be more interesting in this case. Since each agent on the node bus can access memory at full speed, it is easier to keep all network ports busy simultaneously with interesting data while running at the full network rate.

This node architecture will also have reasonably high performance if it is used directly to execute parallel programs. That is, programs can be compiled to directly use the hardware provided by this node rather than emulate the operation of some other hardware. When this is done, this node architecture is capable of using the network at full speed and probably achieving a respectable level of performance.

Features

Bugs

Acknowledgments

Most of the node considerations are the result of numerous discussions amongst ourselves, Tom Knight, Henry Minsky, and Andy Berlin.

See Also...

References

DeH90a
Andre DeHon. Global Perspective. Transit Note 5, MIT Artificial Intelligence Laboratory, May 1990. [tn5 HTML link] [tn5 FTP link].

DeH90b
Andre DeHon. MBTA: Message Formats. Transit Note 21, MIT Artificial Intelligence Laboratory, June 1990. [tn21 HTML link] [tn21 FTP link].

DeH90c
Andre DeHon. MBTA: Modular Bootstrapping Transit Architecture. Transit Note 17, MIT Artificial Intelligence Laboratory, April 1990. [tn17 HTML link] [tn17 FTP link].

DeH90d
Andre DeHon. MBTA: Network Level Transactions. Transit Note 19, MIT Artificial Intelligence Laboratory, June 1990. [tn19 HTML link] [tn19 FTP link].

DeH90e
Andre DeHon. MBTA: Thoughts on Construction. Transit Note 18, MIT Artificial Intelligence Laboratory, June 1990. [tn18 HTML link] [tn18 FTP link].

DS90
Andre DeHon and Thomas Simon. MBTA: Node Architecture. Transit Note 25, MIT Artificial Intelligence Laboratory, July 1990. [tn25 HTML link] [tn25 FTP link].

MIT Transit Project