Transit Note #114

A First Generation DPGA Implementation

Edward Tau, Derrick Chen, Ian Eslick, Jeremy Brown, and Andre DeHon

Original Issue: January, 1995

Last Updated: Mon Dec 11 14:16:51 EST 1995

Abstract:

Dynamically Programmable Gate Arrays (DPGAs) represent a hybrid architecture lying between traditional FPGAs and SIMD arrays. Notably, these arrays can efficiently support computations where the function of the array elements needs to vary both among array elements during any single cycle and within any single array element over time. We describe our minimal, first generation DPGA. This DPGA uses traditional 4-LUTs for the basic array element, but backs LUT and interconnect programming cells with a 4-context memory implemented using dynamic RAM. Additionally, this DPGA supports non-intrusive background loads of non-active contexts and automatic refresh for the dynamic memory cells. We draw several lessons from this design experience which may be relevant to future DPGA and FPGA designs.

Introduction

Traditional Field-Programmable Gate Arrays (FPGAs) have a set of programmable elements which can be configured to personalize the FPGA to implement a user-defined function. Reprogrammable FPGAs implement these programmable elements as memory cells, commonly static RAMs. These cells are configured during a slow programming phase which typically takes 10's of milliseconds due to limited bandwidth to off-chip memories [Xil93] [Atm94]. Once configured, the array personality is static until reloaded. Each array element, consequently, performs the same function throughout the epoch of operation for the device. Devices such as the AT6000 from Atmel can support partial reloads during operation, but the reload still takes a millisecond or more due to bandwidth limitations, and cells cannot be used during the reload operation.

Dynamically Programmable Gate Arrays (DPGAs) [BDK93] [DeH94] differ from traditional FPGAs by providing on-chip memory for multiple array personalities. The configuration memory resources are replicated to contain several configurations for the fixed computing and interconnect resources. In effect, the DPGA contains an on-chip cache of array configurations and exploits high, local on-chip bandwidth to allow reconfiguration to occur rapidly, on the order of nanoseconds instead of milliseconds. Loading a new configuration from off-chip is still limited by low off-chip bandwidth. However, the multiple contexts on the DPGA allow the array to operate on one context while other contexts are being reloaded.

In this paper, we describe our minimal DPGA implementation. This design represents our first generation effort and contains considerable room for optimization. Nonetheless, the design demonstrates the viability of DPGAs, underscores the costs and benefits of DPGAs as compared to traditional FPGAs, and highlights many of the important issues in the design of programmable arrays.

Our DPGA Features:

We begin by detailing our basic DPGA architecture in Section . Section provides highlights from our implementation including key details on our prototype DPGA IC. Section summarizes the major lessons from this effort.

Architecture

Figure depicts the basic architecture for this DPGA. Each array element is a conventional 4-input lookup table (4-LUT). Small collections of array elements, in this case 44 arrays, are grouped together into subarrays. These subarrays are then tiled to compose the entire array. Crossbars between subarrays serve to route inter-subarray connections. A single, 2-bit, global context identifier is distributed throughout the array to select the configuration for use. Additionally, programming lines are distributed to read and write configuration memories.

DRAM Memory

The basic memory primitive is a 432 bit DRAM array which provides four context configurations for both the LUT and interconnection network (See Figure ). The memory cell is a standard three transistor DRAM cell. Notably, the context memory cells are built entirely out of N-well devices, allowing the memory array to be packed densely, avoiding the large cost for N-well to P-well separation. The active context data is read onto a row of standard, complementary CMOS inverters which drive LUT programming and selection logic.

Array Element

The array element is a 4-LUT which includes an optional flip-flop on its output. Each array element contains a context memory array. For our prototype, this is the 432 bit memory described above. 16 bits provide the LUT programming, 12 configure the four 8-input multiplexors which select each input to the 4-LUT, and one selects the optional flip-flop. The remaining three memory bits are presently unused.

Subarrays

The subarray organizes the lowest level of the interconnect hierarchy. Each array element output is run vertically and horizontally across the entire span of the subarray. Each array element can, in turn, select as an input the output of any array element in its subarray which shares the same row or column. This topology allows a reasonably high degree of local connectivity.

This leaf topology is limited to moderately small subarrays since it ultimately does not scale. The row and column widths remains fixed regardless of array size so the horizontal and vertical interconnect would eventually saturate the row and column channel capacity if the topology were scaled up. Additionally, the the delay on the local interconnect increases with each additional element in a row or column. For small subarrays, there is adequate channel capacity to route all outputs across a row and column without increasing array element size, so the topology is feasible and desirable. Further, the additional delay for the few elements in the row or column of a small subarray is moderately small compared to the fixed delays in the array element and routing network. In general, the subarray size should be carefully chosen with these properties in mind.

Local Interconnect

In addition to the local outputs which run across each row and column, a number of non-local lines are also allocated to each row and column. The non-local lines are driven by the global interconnect. Each LUT can then pick inputs from among the lines which cross its array element. In the prototype, each row and column supports four non-local lines. Each array element could thus pick its inputs from eight global lines, six row and column neighbor outputs, and its own output. Each input is configured with an 8:1 selector as noted above.

Local Decode

Row select lines for the context memories are decoded and buffered locally from the 2-bit context identifier. A single decoder services each row of array elements in a subarray. One decoder also services the crossbar memories for four of the adjacent crossbars. In our prototype, this placed five decoders in each subarray, each servicing four array element or crossbar memory blocks for a total of 128 memory columns. Each local decoder also contains circuitry to refresh the DRAM memory on contexts which are not being actively read or written.

Global Interconnect

Between each subarray a pair of crossbars route the subarray outputs from one subarray into the non-local inputs of the adjacent subarray. Note that all array element outputs are available on all four sides of the subarray. In our prototype, this means that each crossbar is a 168 crossbar which routes 8 of the 16 outputs to the neighboring subarray's 8 inputs on that side. Each 168 crossbar is backed by a 432 DRAM array to provide the 4 context configurations. Each crossbar output is configured by decoding 4 configuration bits to select among the 16 crossbar input signals.

While the nearest neighbor interconnect is sufficient for the 33 array in the prototype, a larger array should include a richer interconnection scheme among subarrays. At present, we anticipate that a mesh with bypass structure with hierarchically distributed interconnect lines will be appropriate for larger arrays.

Programming

The programming port makes the entire array look like one large, 32-bit wide, synchronous memory. The programming interface was designed to support high-bandwidth data transfer from an attached processor and is suitable for applications where the array is integrated on the processor die. Any non-active context may be written during operation. Read back is provided in the prototype primarily for verification.

Implementation

The DPGA prototype is targeted for a 1 drawn 0.85 effective CMOS process with 3 metal layers and silicided polysilicon and diffusion. The basic composition and area for the prototype is summarized in Table . From Table , we see that 40% of the area used on the chip goes into supporting the programmable configuration memories. Table also shows that 80% of the area used supports the configurable network.

Layout Inefficiencies

The prototype could be packed more tightly since it has large blank areas and large areas dedicated to wire routing. A more careful co-design of the interconnect and subarray resources would eliminate much or all of the unused space between functional elements. Most of the dedicated wiring channels are associated with the local interconnect within a subarray. With careful planning, it should be possible to route all of these wires over the subarray cells in metal 2 and 3. As a result, a careful design might be 40-50% smaller than our first generation prototype.

Memory Area

From the start, we suspected that memory density would be a large determinant of array size. Table demonstrates this to be true. In order to reduce the size of the memory, we employed a 3 transistor DRAM cell design. To keep the aspect ratio on the 432 memory small, we targeted a very narrow DRAM column. Unfortunately, this emphasis on aspect ratio did not allow us to realize the most area efficient DRAM implementation (See Table ).

One key reason for targeting a low aspect ratio was to balance the number of interconnect channels available in each array element row and column. However, with 8 interconnect signals currently crossing each side of the array element, we are far from being limited by saturated interconnect area. Instead, array element cell size is largely limited by memory area. Further, we route programming lines vertically into each array element memory. This creates an asymmetric need for interconnect channel capacity since the vertical dimension needs to support 32 signals while the horizontal dimension need only support a dozen memory select and control lines.

For future array elements we should optimize memory cell area with less concern about aspect ratio. In fact, the array element can easily be split in half with 16 bits above the fixed logic in the array element and 16 below. This rearrangement will also allow us to distribute only 16 programming lines to each array element if we load the top and bottom 16 bits separately. This revision does not sacrifice total programming bandwidth if we load the top or bottom half of a pair of adjacent array elements simultaneously.

Timing

Table summarizes the key timing estimates for the DPGA prototype at the slow-speed and nominal process points. As shown, context switches can occur on a cycle by cycle basis and contribute only a few nanoseconds to the operational cycle time. Equation relates minimum achievable cycle time to the number of LUT delays, , and crossbar crossings, in the critical path of a design.

These estimates suggest a heavily pipelined design which placed only one level of lookup table logic () and one crossbar traversal () in each pipeline stage could achieve 60-100MHz operation allowing for a context switch on every cycle. Our prototype, however, does not have a suitably aggressive clocking, packaging, or i/o design to actually sustain such a high clock rate. DRAM refresh requirements force a minimum operating frequency of 5MHz.

Conclusions

We have presented the full design for a first-generation DPGA prototype. The prototype demonstrates that efficient, dynamically programmable gate arrays can be implemented which support a single cycle, array-wide context switch.

Multiple context programmable gate arrays make the most sense when moderately small amounts of additional memory area can increase the amount of logic available for use. While the configurable memory is clearly a major element of programmable array size in our 4-context prototype, the memory component does not dominate fixed logic area. With the current memory cell design, the tradeoff is worth considering, but not necessarily compelling. If the basic memory cell were much larger relative to the fixed logic, the memory area would dominate and the multiple context support would make little sense. Consequently, SRAM-based DPGAs may not be desirable in the architecture described here.

If the basic memory cell were much smaller relative to the fixed logic, the logic area would dominate making the cost of multiple contexts marginal and making it sensible to include even more on-chip contexts. DPGAs with a more aggressive dynamic memory cell look appealing. Further, DPGAs based on native DRAM cells or flash memories look very attractive in this architecture.

See Also...

References

Atm94
Atmel Corporation, 2125 O'Nel Drive, San Jose, CA 95131. Configurable Logic Design and Application Book, 1994.

BCE +94
Jeremy Brown, Derrick Chen, Ian Eslick, Edward Tau, and Andre DeHon. A 1 CMOS Dynamically Programmable Gate Array. Transit Note 112, MIT Artificial Intelligence Laboratory, November 1994. [tn112 HTML link] [tn112 PS link].

BDK93
Michael Bolotski, Andre DeHon, and Thomas F. Knight Jr. Unifying FPGAs and SIMD Arrays. Transit Note 95, MIT Artificial Intelligence Laboratory, September 1993. [tn95 HTML link] [tn95 PS link].

DeH94
Andre DeHon. DPGA-Coupled Microprocessors: Commodity ICs for the Early 21st Century. In Proceedings of the IEEE Workshop on FPGAs for Custom Computing Machines, 1994. [FTP link].

DeH95
Andre DeHon. DPGA Utilization and Application. Transit Note 129, MIT Artificial Intelligence Laboratory, October 1995. Abridged version presented at FPGA '96. [tn129 HTML link] [tn129 PS link].

Xil93
Xilinx, Inc., 2100 Logic Drive, San Jose, CA 95124. The Programmable Logic Data Book, 1993.

Revisiting Memory Cell Area

From Table , 33% of the 82% non-blank area on the prototype was supporting context memory. Table decomposes this % of active die area.

As noted before, the prototype memory cells were larger than necessary due to their aspect ratio. Prototype DRAM cells occupied . In contrast, we expect a tight DRAM cell layout to occupy in this technology and a tight SRAM cell layout to occupy .

Assuming commensurate savings in proportion to memory cell area, the tight DRAM implementation would save about 7% over the current design. An SRAM implementation would be, at most, 15% larger. In practice, the SRAM implementation would probably be only 5-10% larger for a 4-context design since the refresh control circuitry would no longer be needed. Of course, as one goes to greater numbers of contexts, the relative area differences for the memory cells will provide a larger contribution to overall die size.

MIT Transit Project