Transit Note #18
MBTA: Thoughts on Construction

Andre DeHon

Original Issue: June 1990

Last Updated: Tue Nov 9 12:52:23 EST 1993

Introduction

This document should detail the construction of MBTA. It is intended to be an evolutionary document. Feel free to add ideas/concerns as they arise.

Goals

See [DeH90a] for a discussion of what MBTA is intended to be.

In designing MBTA, we should be keeping a number of goals in mind.

Generality -- MBTA is intended to be a vehicle to study parallel architectures. To achieve this goal, we would like to have as much freedom as possible to emulate various architectures. Thus, we aim a providing hardware which is general enough to effectively emulate a wide range of architectures.
Simplicity -- The design and construction should be sufficiently simple that we can succeed in constructing and debugging MBTA machines in the near future. MBTA is not an end in itself but a means to an end; we shouldn't expend too much energy building tools such that we lose sight of the purpose.
Minimum Component Count -- Minimizing components goes along with simplicity. We want to construct each node from as little hardware as is necessary. This is important for keeping the size down, minimizing, complexity, and keeping costs sufficiently low that we can afford to build several Park machines and at least one Wonderland.

Processor

One of the biggest decisions we must make is that of which stock processor to use for the compute and i/o processors. The settled node architecture only has a single node processor [DS90a] so this decision is narrowed to a single processor.

Processor

This is a short list of things we were looking for in a processor:

fast, simple instructions
integrated Floating Point as part of CPU
minimal required hardware support
freedom from enforced MMU
some development tools available

Survey

Following is a survey of the various processor available and their various bugs and features.

Intel 860
- 64-bit processor
- FP unit integrated with processor
- pipelined FP mode is gross
Intel 960 family (GCC available for '960)
- Intel 960CA
  - 32-bit processor
  - super-scalar (integer unit,branch unit,load-store unit)
  - demultiplexed address/data bus
  - 1.5K SRAM on chip
  - 25+ processor cycle (s at 25MHz) interrupt latency
  - CB will be available in November (11/90?) -- CB will be pin compatible with the CA and include an on chip Floating Point Unit
  - 33MHz and 40MHz versions available
- Intel 960KB
  - 32-bit processor
  - super-scalar(?) (one integer unit; one FP)
  - multiplexed address/data bus
  - s (at 25MHz) interrupt latency
- Intel 960KA (same as KB without the FP unit)
  - 32-bit processor
  - scalar
  - multiplexed address/data bus
  - s (at 25MHz) interrupt latency
MIPS R3000
- 32-bit processor
- Pacemips R3400 has FP unit and R3000 on single chip
- looks like MMU is integrated with processor
- R3100 heavily integrated support for R3000/3400 -- however, may not allow flexibility necessary
- IDT, LSI
SPARC
- BIT -- fast ECL version -- separate floating point (very fast) -- ECL would integrate well with network (but probably little else)
- LSI, Cypress (CMOS)
- register windows
- Alternate Space Indicators
- Might be able to make easier use of Alewife's software system (could even migrate to SPARCLE)
Motorola 88100 (looks like 2 88200's are required to use)
AMD 29000 (aimed at being a high end fancy sequencer with many support chips -- suboptimal due to requisite chip count (also circa 1985 technology))
TI LISP chip -- Sounds like this would require considerable support components in order to get running. The uCode would have to be completely written from scratch. Availability is uncertain. Recommendation from Pat Bosshart at TI is to use a RISC chip instead.

Decision

After much deliberation, we have decided to go with the C series 80960 components. We can start with the 80960CA since it is available now and replace that with the 80960CB when it becomes available. While floating point is essential for the final machine, it is not necessary for early development.

Pricing and Availability

At the moment only 16MHz, 25MHz, 33MHz, and 40MHz versions of the 80960CA are available.

Memory

Organization

After much deliberation on width, organization, structure, and the like, we decided that the simplest thing to do would be to provide flat, fast, static RAM 64-bits wide. We can always deal with emulated any other memory configurations (including hierarchical, segmented, or tagged memory) on top of this configuration.

Size

The current belief is that we need to support at least 256K words per node (1Megabyte/node). We will provide a 4M word (16M byte) address space on each node so we can expand the memory size if necessary.

Speed

The node architecture requires that we cycle the memory twice every four network cycles in full-speed testing mode [DS90a]. With the network running at 100MHz, this requires 20ns memory references. This will translate into needing memory with access time on the order of 15ns.

Components

We want to use the most monolithic parts we can find available. This will probably be the most economical option on a per bit basis. It will also save us board space. Right now, it looks like a fast 128k8 SRAM is the optimal memory component for speed and density.

Note: eventually add information on parts available survey (and pricing) -- this isn't critical, just yet.

Prototyping and Bootstrapping

Strategy

For prototyping and building initial MBTA machines it seems optimal to start with some form of Field Programmable Gate Arrays (FPGA's). We should be able to start with these programmable gate arrays and debug logic and functionality. They will certainly not allow us the speed we need to match up to the networks targeted operational speed. With the logic and functionality well understood and debugged, we can then fabricate the logic in gate-array or standard-cell form.

Ideally, we'd like to describe the logic in one CAD tool. From this description, we generate the FPGA's. Once the programmable devices are debugged, we then change to the target technology and ``recompile'' the logic into standard-cell format.

Division

The node architecture is cleanly divided into a few basic units as shown in [DS90a]. The logic units which must be generated are the network interface [DeH90c] and the external bus interface logic [DS90b]. These can each be implemented as its own component. Optimally, we would like to find programmable arrays which allow each of these components to fit in a single programmable package. However, it does not look like any of the current FPGA's have sufficient density. Hopefully, we can find an easy way to divide the logic in each of these components among several FPGAs.

Integration

Once the logic is debugged, we can move the designs to standard cell silicon. Ideally, we would like to integrate to as large components as possible. We will probably do the standard cell in the HP26 (1.0 micron process) [HP90]. We can integrate 1V pads into the network interface component(s) allowing the integrated network interface's to connect directly to the final version of RN1/RN2 in the network.

Network Interface Integration

We can probably win (at least with respect to area versus pin requirements) by integrating the four network interfaces into a single component. See [DeH90c] for up to date details.

FPGA Survey

Altera (MAX)

This looks basically like several PALs (or a large PAL). The architecture is very much structured like PALs. It does what it can do fast, but doesn't have the sort of flexibility one is aiming for in semi-custom silicon.

Xilinx

The Xilinx LCA

is composed of an array of configurable logic blocks with intervening routing blocks. They are configurable via SRAM cells so can be reprogrammed ad-infinitum on the fly. They load themselves from EPROMs or serial EPROMs so no special hardware is needed to program them. I've gotten many bad reports about how it is almost impossible to get reasonable performance out of them and how they are hard to work with. The auto-router is apparently poor (about a factor of 3 to 5 below what one can do with hand routing). The routing is expensive (in terms of time). It costs nanoseconds for every block of routing distances. It sounds like average routing delays are on the order of 60ns. Usable density may only be a moderate fraction of the claimed gates. One source claims they're wonderful for doing things below 1MHz but not for anything much faster. Xilinx currently offers FPGAs with densities from 2000 (64 logic blocks) to 9000 (320 logic blocks) gates. The largest package allows up to 144 useable I/O pins.

The 4000 series parts is supposed to be more routeable, faster, denser, etc. It also has some level of JTAG boundary scan support [Com90].

Actel

The Actel component looks much more like a gate-array. The cells are all simple gates which larger functions can be built from. They've allocated a larger amount of area to routing than the Xilinx. Consequently, it sounds like they take a much smaller hit in performance due to interconnect and come closer to achieving usage of all the gates. The fuse structure is faster than Xilinx, but the components are only one-time-programmable. It looks like one may be able to get a reasonable level of system performance out of the Actel components. The Actel parts looks like they require a specific programmer to be programmed; the data i/o unisite does not support these parts. Current densities are 2000 gates (546 ``logic modules'' -- where a D flip-flop requires 2 ACT 1 logic modules). The current offering allows a maximum of 69 useable I/O pins.

8000 gate parts are planned for their second generation components (1232 ``logic modules'' -- where the ACT 2 logic modules may include both logic and a flip-flop in a single module). These parts will support up to 140 useable I/O pins.

Plessey (ERA)

Plessey offers what they call Electrically Reconfigurable Arrays (ERA). These are SRAM configurable like the Xilinx. They are composed of simpler gates (more primitive than Actel's -- more like a conventional gate array). I'm currently very unclear on how their interconnect works. I'm also very unclear on the kind of system performance one could expect out of these parts. One comment seems to imply that one can expect to get gate utilizations as low as 40%. Their current offering has 10,000 gates with about 80 useable I/O pins.

Plessey hopes to expand this to 20,000, 40,000, and later 100,000 gates. Plessey has a simple path for moving from these ERA's to Plessey's gate-arrays.

TK's bits on Plessey: The basic cell is either a latch or a 2-input nand gate. Nothing else. There are 2500 of these cells in the part. They estimate 30% utilization after routing for the parts. So the ``10000'' gate part might be more like 800 gates if there are no latches. The wiring is 3 level metal, with some global bus, local bus structure. The local bus hits ten cells. Delay across the chip on the global bus is 5ns. There is also a peripheral bus running around the pads (10 signals) intended for clocks I think. There is a development system on the PC, using Viewlogic capture and simulation, with a programmer. The programmer downloads a two chip module which has the array and a ram to hold configuration info. The module has a large capacitor to hold config information for 24 hours unpowered. It has the identical pinout to the real part, and can be plugged in in place of the part in a system environment. They seemed receptive to the idea of letting us get in and muck with the intermediate formats and the detailed routing. The development system cost $10K and comes with 2 modules. The modules cost $500. The parts are $200 in 1-25, and $70 in 250-500.

We settled on using Actel FPGA's and are now set to program the ACT1 series components. In theory they are in the processor shipping us a programming head to allow programming to the ACT-1280, the large ACT2 component.

Host Interface

The T-Station host interface is described in [DeH90d]. We have settled on using GPIB [Com78] as the initial transport layer for implementing the T-Station interface. This decision was made as the result of a number of goals:

moderately high bandwidth
available on many existing machines without additional hardware
sufficiently universal to be easily available on a wide variety of machines

It may be desirable to eventually provide other T-Station implementations. As mentioned in [DeH90b] an ethernet T-Station might be interesting and would free T-Station from needing a dedicated host machine.

Conventional Packaging

How do we integrate conventional packaging into our stacks?

Tony Salas is looking at this problem in his 3-D DRAM project. Hopefully, the insight and experience gained in packaging conventional DRAM parts in a three-dimensional structure will be applicable to our more general problems.

Options for consideration include:

AT&T uniaxial conductor (especially with J-lead packages (leadless packages))
Elastomeric Interconnect
ES-Kit - cinch style connectors and pc-boards.

See [DeH91].

Node Packaging

This is largely outdated -- see [DeH91].

Current projections suggest that the 64 processor ( routing components) routing boards will be about square. At the ends of the network, this leaves us with about a square area centered around each routing component where four nodes need to connect. If we place half the nodes above the network and half below it, then two nodes need to connect to this square area on each side of the network. For this arrangement one way to package the nodes would be in small vertical stacks which are roughly square and 3 or 4 layers tall. However, a rough cut at the design for a multiple layer square node such as this exposed a few problems with such a strategy. The amount of vertical interconnect required for a node spread over multiple layers is quite large. Also, breaking up the space into these small amounts severely limits our ability to effectively use the available space.

Alternately, we are current considering square nodes. Assuming we can build a node this small, the node spans 4 () routers. This way 16 node boards are needed for each cluster of 4 endpoint routers. If we split the nodes on either side of the network, this means 8 nodes reside on the each side of the network. This strategy allows us to exploit planar interconnect within a node where things are heavily interconnected. It requires only minimal vertical interconnect since each node only requires the 40 signal pins necessary to be connected to the network.

The exact details of what we will need to package into each node will depend on the level of integration we can get away with. Figure shows a possible single layer node. Here, we use the through vias provided by the Transit-DSPGA372 package to effect vertical interconnect. Together, four such packages provide just enough through bandwidth to satisfy the requirements of eight nodes. This allows us to use basically the same packaging scheme for the nodes as we do for the network. To effectively do this, we do have to get all of our random logic packaged in thin (less than 110 mil thick) packages. A quick check looks like this is doable using leadless or gull wing surface mount components. This configuration is tight and may be tough to route.

If we can package the 80960 in a Transit-DSPGA372 package and integrate four network interfaces into a single VLSI component which is also packaged in a Transit-DSPGA372 package, we can further cut down the component area requirement and make a single layer square node more feasible.

In either case, we will need two different flavors of node boards. Since the board is square, we can use the four possible rotations of the board to tap off four different sets of network ports for four nodes. However, we need to tap off eight different sets for each side of the stack. We can make two slightly different boards which tap off of the node at different places on each rotation.

T-Station

T-Station will need an appendage that interconnects directly to this node stack structure. This appendage can simply be a layer the same shape as the other boards in the node stack. The T-Station components which must physically be close to the node can be housed in this appendage. We might even be able to house an entire T-Station implementation in a single layer of this stack structure with appropriate integration. In either case, connectors will come out of the side of the stack to connect to the rest of T-Station, the host computer, or the network, as appropriate.

References

Com78: IEEE Standards Committee. IEEE Standard Digital Interface for Programmable Instrumentation. IEEE, 345 East 47th Street, New York, NY 10017, November 1978. ANSI/IEEE Std 488-1978.
Com90: IEEE Standards Committee. IEEE Standard Test Access Port and Boundary-Scan Architecture. IEEE, 345 East 47th Street, New York, NY 10017-2394, July 1990. IEEE Std 1149.1-1990.
DeH90a: Andre DeHon. MBTA (Modular Bootstrapping Transit Architecture). Transit Note 17, MIT Artificial Intelligence Laboratory, April 1990. [tn17 HTML link] [tn17 FTP link].
DeH90b: Andre DeHon. MBTA: Modular Bootstrapping Transit Architecture. Transit Note 17, MIT Artificial Intelligence Laboratory, April 1990. [tn17 HTML link] [tn17 FTP link].
DeH90c: Andre DeHon. MBTA: Network Interface. Transit Note 31, MIT Artificial Intelligence Laboratory, August 1990. [tn31 HTML link] [tn31 FTP link].
DeH90d: Andre DeHon. T-Station: The MBTA Host Interface. Transit Note 20, MIT Artificial Intelligence Laboratory, June 1990. [tn20 HTML link] [tn20 FTP link].
DeH91: Andre DeHon. MBTA: Wonderland Packaging. Transit Note 39, MIT Artificial Intelligence Laboratory, February 1991. [tn39 HTML link] [tn39 FTP link].
DS90a: Andre DeHon and Thomas Simon. MBTA: Node Architecture. Transit Note 25, MIT Artificial Intelligence Laboratory, July 1990. [tn25 HTML link] [tn25 FTP link].
DS90b: Andre DeHon and Thomas Simon. MBTA: Node Bus Controller. Transit Note 30, MIT Artificial Intelligence Laboratory, August 1990. [tn30 HTML link] [tn30 FTP link].
HP90: Colorado Integrated Circuit Division Hewlett-Packard. CMOS 26 Design Rules. Hewlett Packard, September 1990. Rev. B.

MIT Transit Project

Transit Note #18 MBTA: Thoughts on Construction

Introduction

Goals

Processor

Processor

Survey

Decision

Pricing and Availability

Memory

Organization

Size

Speed

Components

Prototyping and Bootstrapping

Strategy

Division

Integration

Network Interface Integration

FPGA Survey

Altera (MAX)

Xilinx

Actel

Plessey (ERA)

Host Interface

Conventional Packaging

Node Packaging

T-Station

References

Transit Note #18
MBTA: Thoughts on Construction