Transit Note #22
MBTA: Node Architecture Selection

Andre DeHon

Tom Simon

Original Issue: June 1990

Last Updated: Wed Nov 10 22:51:21 EST 1993

Purpose

This note summarizes several node architectures which were originally under consideration for MBTA. This may be of interest to anyone wondering why we settled on the node architecture we did. Also, it may be useful to see how the architecture might have differed if our goals were weighted differently.

On one hand, we want to have a generic node architecture capable of allowing fair emulation of many possible node architectures. On the other, we want to be able to test the RN1 based network at full speed to test performance, packaging, cooling, and reliability. Along with these goals, we would like to have a very general architecture which both is simple and small. See (tn17) and [DeH90e] for a more detailed explanation of the goals and purposes of MBTA.

Various Nodes

Long Run

As we attempt to settle on a node configuration for MBTA, it is worthwhile to consider what the node architecture for a real machine might look like. Figure shows one possibility for the eventual architecture of a node in a multiprocessor computer system. Main memory may or may not support the independent read/write ports shown. The way in which it supports these ports ( e.g. banked memory, time-sliced access, multi-ported RAM) is abstracted away for the time being. In this configuration, most of the time it should be possible to keep the processor running, make network requests through net-out, and service two incoming requests through net-in simultaneously.

Minimum Hardware

A node with a single bus and processor (Figure ) is the simplest MBTA node configuration possible. In this case, the single processor simulates the function of the entire node. No parallel operations occur on the node. In a single node simulation cycle, the processor runs a time slice of all hardware that might be on node

Features

minimal hardware complexity
minimal hardware cost
minimal node surface area
possible to simulate arbitrary node hardware
easy to separate hardware functions and synchronize relative progression speed

Bugs

slow -- cannot approach utilizing network (must run network very slow to match simulated node speed)
minimal benefit over pure software simulation
pay overhead (time) for interleaved simulation of various components

Dual Read Ports

In order to stand a chance of using and servicing the bandwidth of the network, the node must be able to support three concurrent operations between the node and network. Figure shows one configuration that attempts to meet this goal. The processor simulates the compute processor and whatever other hardware might be on the node ( e.g. memory processor). The network processor simply processes network requests; in particular, it handles the rop operations described in (tn19) and [DeH90b]. The network processor runs in a tight loop checking for rop operations from the network and dispatches to service them accordingly. The memory is duplicated. This effectively provides the node with two read ports into memory. Writes, however, require access to both memory busses in order to guarantee that the memories remain consistent.

In this scheme, it is possible to take advantage of the three simultaneous network ports in a couple of ways as shown below:

1. Processor net-out
2. net-in read/write memory
3. net-in Network Processor
1. memory feeds net-out (read)
2. net-in read memory
3. net-in Network Processor

As noted above, no other memory operations are possible during a write. Since the two processors have read ports into different memory banks, their instruction fetches do not interfere with each other. It should be possible for one of the processor to reserve all of the node's memory resources to perform a sequence of atomic memory operations.

Features

concurrent operations on each node
some cases allow full utilization of network bandwidth; it should be possible to fully match network operational speed (at least in contrived examples).
simple simulation/programming model

Bugs

hardware is somewhat complicated
node is about twice as expensive as the simple case
node is about twice as large as the simple case
write operations tie up most of the on-node memory bandwidth
net-int and net-out will have to be implemented as VLSI components
it may be more difficult to metric the relative progression of the various components being simulated on a node than in the single processor case

Multiple Bank Memory

It is possible to get some of the benefits of the previous scheme with two independent banks of memory. As long as the processors store their instructions in opposite memory banks, instruction fetches will not interfere with each other. In this case write operations do not require all of the node's memory resources. Similarly, atomic operations completely within a single memory bank only require exclusive access to half of the memory system. If the data is not well distributed between the two memories, access to the over-utilized memory bank can bottleneck node operation.

Features

concurrent operations on each node
some cases allow full utilization of network bandwidth; it should be possible to fully match network operational speed (at least in contrived examples).
simple simulation/programming model
simultaneous write operations and atomic sequences possible in certain situations

Bugs

hardware is somewhat complicated
node is about twice as expensive as the simple case
node is about twice as large as the simple case
it may be more difficult to metric the relative progression of the various components being simulated on a node than in the single processor case
net-int and net-out will have to be implemented as VLSI components
memory bandwidth is highly dependent on the division of data among memories

FIFO Intensive

We can expand upon the dual bank configuration using FIFOs to buffer data coming and going from the node (Figure ). This will allow us to easily guarantee that transmission and reception can occur at full network speeds. The FIFOs also gives us buffers to avoid blockage due to critical resources. Additionally, they make it possible to allow remote operations to be performed without the full attention of the processor. This should simplify and the code required to deal with network operations. The FIFOs will be costly in terms of board area and component expenses. The data and address bussing, multiplexing, and timing will probably be quite hairy in this scheme.

At the cost of additional bus and multiplexing hair, it would probably be possible to use a single FIFO for each pair of FIFO's shown in Figure . Generally, only a FIFO in a single direction will be used at once. A single FIFO would be sufficient if the direction of the FIFO can effectively be reconfigured between operations.

Since Figure may be a bit misleading, a more detailed version of this option is shown in Figure . The bus gates shown deal with arbitration and multiplexing for access to a client bus/device.

Features

concurrent operations on each node
full utilization of network bandwidth is possible in bursts; it should be possible to fully match network operational speed.
simple simulation/programming model
some split transactions are possible
simultaneous write operations and atomic sequences possible in certain situations
can easily guarantee to send or receive data streams from network without interruption
net-int and net-out will have to be implemented as VLSI components
operation retries can be invisible to the processor

Bugs

hardware is complicated (busses and muxing will be very hairy)
node may be somewhat expensive (2.5 to 3 times simplest case)
node is about two to three times as large as the simple case (without doing any custom VLSI)
it may be more difficult to metric the relative progression of the various components being simulated on a node than in the single processor case
memory bandwidth is somewhat dependent on the division of data among memories (but not as much as in the previous case).

Fast Network, Slow Node

Figure shows a compromise that, hopefully, satisfies many of the primary goals of MBTA. This is a simple node configuration with a single processor and memory. The fast memory is intended to cycle at twice the rate data arrives from the network. The node is capable of utilizing full network bandwidth by sending data out through net-out under processor control while simultaneously receiving data to memory through both net-in ports. The FIFO associated with the output path through net-out allows net-out to assume full responsibility for retrying network operations. The processor only fills the FIFO initially. Once the network operation is initiated, net-out has all the data available to retry the operation until it succeeds.

Fair emulation is possible since there is a single stream of execution. The processor simply executes a time slice of each emulated node device during each emulation cycle.

In fast network test mode, the processor feeds write operation over the network and records appropriate statistical information. The write operations are handled directly by the net-in units, leaving the destination node's processor free to be feeding data into the network, also. Periodically, the processor can check the validity of the data written into memory.

When emulating nodes, it would be unfair to run the network at full speed. The network should only be running at some small fraction of its real speed to be fairly matched to the simulation. However, it is important in testing reliability and cooling to be able to run the network at full speed for long periods of time. To allow these test to coincide with emulation experiments, net-out and net-in can both send ``dummy'' data to consume the unused network bandwidth. Rather than running the network slow, net-out and net-in are designed to service the network at full speed. However, they do not send simulated data at full speed. They will send the node generated network data only intermittently. Between these times, they will send filler data to each other. Both endpoints agree to only treat every -th byte as network data (where is a parameter set at boot time). Between the valid data, the sending node sends some sequence of random data which the receiving node can anticipate and verify ( e.g. the sending node could simply send the contents of a byte wide counter running at network speed). This allows the nodes to see network bandwidth which is properly matched to the emulated node speed. At the same time, the network can be run continuously at full speed with semi-interesting, changing information running through it. All of the interesting packaging, cooling, and reliability issues can thus be tested simultaneous with simulations.

Features

hardware is small and simple
should be possible to schedule hardware simulation fairly; That is, the relative progress of each emulated hardware unit on each mode should proceed at the same rate.
network operation retries can be invisible to the processor
full network bandwidth can be exercised with sufficiently fast memory and contrived tests
in normal simulation operation network can be run at full speed sending interesting data despite the fact that the simulation is slow

Bugs

memory sufficiently fast for full speed network tests will be expensive
net-int and net-out will have to be implemented as VLSI components
simulation is somewhat slow

High Bandwidth Node Bus

We can gain all of the advantages of the previous (Section ) architecture and achieve greater generallity by doubling the node's bus bandwidth. Here, we double the bandwidth of the bus by making the memory twice as wide (Figure ). With memory which can exchange a 64-bit datum every two network cycles, the node bus has the bandwidth to handle four devices independently accessing memory at the rate data is being transfered over the network. This allows the processor, two interfaces from the network, and one interface to the network to be serviced at full network bandwidth. As long as each of these components can access memory at the full network rate, there is no need for separable busses as needed in the FIFO intensive design (Section ). Also, FIFOs like those shown in Sections and can be implemented in the flat node memory.

With a single processor, fair emulation is still achievable as described in the previous section. While performing fair emulation, dummy cycles can be used to allow the network to run at full speed as described before. Here, there is no need for the processor to move outgoing data in many cases. It can simply tell the network output interface where in memory to find outgoing data.

Full speed network testing is easier and can be more interesting in this case. Since each agent on the node bus can access memory at full speed, it is easier to keep all network ports busy simultaneously with interesting data while running at the full network rate.

This node architecture will also have reasonably high performance if it is used directly to execute parallel programs. That is, programs can be compiled to directly use the hardware provided by this node rather than emulate the operation of some other hardware. When this is done, this node architecture is capable of using the network at full speed and probably achieving a respectable level of performance.

Features

should be possible to schedule hardware simulation fairly; That is, the relative progress of each emulated hardware unit on each mode should proceed at the same rate.
network operation retries can be invisible to the processor
full network bandwidth can be exercised with sufficiently fast memory; full bandwidth can be exploited both by carefully arranged tests and by parallel execution which is not concerned with fair emulation
in normal (fair) simulation operation network can be run at full speed sending interesting data despite the fact that the simulation is slow

Bugs

memory sufficiently fast to provide the full bus bandwidth implied by this architecture will be expensive
net-in and net-out will have to be implemented as VLSI components
simulation may be slower than the cases in which multiple processors are used to distributed the emulation load

Acknowledgments

Most of the node considerations are the result of numerous discussions amongst ourselves, Tom Knight, Henry Minsky, and Andy Berlin.

Transit Note #22
MBTA: Node Architecture Selection

Purpose

Various Nodes

Long Run

Minimum Hardware

Dual Read Ports

Multiple Bank Memory

FIFO Intensive

Fast Network, Slow Node

High Bandwidth Node Bus

Acknowledgments

See Also...

References

Transit Note #22 MBTA: Node Architecture Selection

Purpose

Various Nodes

Long Run

Minimum Hardware

Dual Read Ports

Multiple Bank Memory

FIFO Intensive

Fast Network, Slow Node

High Bandwidth Node Bus

Acknowledgments

See Also...

References

Transit Note #22
MBTA: Node Architecture Selection