Transit Note #120

Notes on Integrating Reconfigurable Logic with DRAM Arrays

Andre DeHon

Original Issue: March, 1995

Last Updated: Sat Apr 8 20:57:50 EDT 1995

This is a quick note describing the opportunities which come from integrating small amounts of reconfigurable logic into a conventional DRAM, as well as a rough architecture for such integration.

Internal DRAM Bandwidth

Conventional DRAM arrays are built out of numerous banks of DRAM (See Figure ). The memory addresses are, consequently, broken into two parts: (1) a row address which selects the address to be read within each bank and (2) a column address which selects the bank whose data will actually be read. Internally, once the row address has been read, a large number of data bits are available. For a bank memory, with -bit wide banks, bits are actually available at the inputs to the final selection process. In the most conventional architectures, all but bits are then thrown away during the final column selection process. The result is a meager i/o bandwidth to and from the memory, despite a huge on-chip bandwidth.

A 4 megabit DRAM, old by today's standards, is likely to have 512 or 1024 banks ( e.g. ), with a 4- or 8-bit data width ( e.g. ). The DRAM throws away almost three orders of magnitude in bandwidth during the final multiplexing stage. This gap only gets larger as we move to larger capacity DRAM arrays. The DRAM cell area shrinks, allowing greater density. However, the i/o's supported on die's perimeter does not grow commensurately with the internal memory capacity. In fact, the i/o capacity is hardly growing at all.

A number of modern twists on the original DRAM architecture have tried to improve the situation, generally by exploiting locality of memory references and increasing the i/o toggle rate.

  1. Static Column and Fast Page Mode DRAM -- These DRAMs allow the column address to be changed without repeating the row address read. This provides slightly faster access to any of the words which were read during the row address phase.
  2. SDRAM -- Synchronous DRAMs with burst modes, allow a sequences of contiguous words to be rapidly pipelined out of the DRAM IC. For contiguous block accesses, this arrangement better utilizes the data read from each bank by allowing a sequence of such words to be read at once. ( e.g. [Mic93])
  3. EDRAM/CDRAM -- These DRAMs place a static RAM cache in place of the final mux selection. The cache is filled rapidly, taking advantage of the entire -bit wide datapath from the memory banks. Subsequent access within the page or memory cache are serviced with SRAM-like timings. ( e.g. Ramtron EDRAM [Cor93], overview articles [Bur92] [Bur93])
  4. RAMBUS DRAM -- RAMBUS DRAMs have a very high speed i/o system and operate by sending packets of data in response to each read request [Ram93]. Like SDRAMs they provide higher utilization of the on-chip bandwidth when the consumer needs contiguous blocks of memory.
  5. VRAM -- Video DRAMs integrate a synchronous i/o port, in addition to a traditional DRAM i/o port, to support applications, such as video display, where the entire memory is accessed sequentially. Like SDRAMs and RAMBUS DRAMs, the large amount of data read on bank access can be rapidly pipelined off of the chip.

All of these DRAM variants take advantage of locality and exploit the fact that the external i/o can often be cycled faster than the basic memory read. When reading large, contiguous blocks, or data closely packed in memory, each internal bank read can yield many data words of interest to the application. Access bandwidth in these cases is limited entirely by the external i/o bandwidth rather than memory cycle time. This external bandwidth is still an order of magnitude or two less than the on-chip bandwidth.

Move Logic to Bandwidth

Since we cannot move the high internal bandwidth across the bandwidth limited chip i/o to service external logic, we can consider moving the required logic to the bandwidth -- by putting the logic on the die with the DRAM. This avoids the chip crossing problem and allows us to more fully exploit the internal bandwidth which our conventional DRAM architectures already contain.

The question then is ``What logic do we put inside?'' If we knew exactly what kind of logic we wanted, we could embed the two on the same die. Remember, however, that DRAMs are highly commodity items. Each design variant must have very widespread appeal to merit the design and production costs associated with modifying the design to contain a given piece of logic. A few applications, may be able to justify their own fixed logic integration into DRAMs, but most cannot.

Alternately, we can consider placing more flexible logic on the DRAM which can be adapted to a wide variety of tasks. Here, as with the DPGA-coupled processors (tn100), we can exploit a wider range of application to gain commodity appeal.

People have, in fact, already consider placing SIMD logic at the output of the bank memory within a DRAM array to exploit the high on-chip bandwidth while retaining some flexibility. [ESS92] introduces one such device. Cray Computer, Inc., working with the Institute for Defense Analysis, has integrated such a SIMD/memory hybrid into their Cray 3 supercomputer.

In general, we can consider using any programmable, computational array structure as flexible logic which can operate on the bits which are read, and may be written, in parallel for the memory banks. (tn95) introduced a unified computational array model which included both FPGAs and SIMD arrays, as well as DPGAs. Recalling this framework, we note that the SIMD structure allows us to perform the same operation on each of the embedded computational array elements operating on the bank memory outputs. At the same time, the SIMD array allows us to perform a different operation to the memory on each array cycle -- which may or may not be coupled to the memory cycle time. Integrated FPGA logic, on the other hand, would allow us to wire up array elements so that they performed different functions, however, those functions could not change rapidly in time. Using the DPGA hybrid would allow the integrated array to vary in function both from cycle to cycle and from cell to cell within the array.

In the following section, we expand on the integration of FPGA or DPGA logic into a DRAM memory array.

Reconfigurable Logic on DRAM

In general, we can add a small block of reconfigurable logic local to each DRAM memory bank. This reconfigurable block directly consumes, and may produce, the bits which are read out of each DRAM bank. The block has associated programmable interconnect which allow it to take inputs from and feed outputs to other reconfigurable logic blocks. Since the banks are physically arrayed on the DRAM array, the reconfigurable logic blocks are also gridded.

Figure shows a DRAM bank with a reconfigurable logic block. Data from the memory bank is a local input to the reconfigurable logic. The output from the bank to the final memory selection ( e.g. the final mux in Figure ) may come directly from the bank output or from the programmable logic. Data written back into the memory may come from the programmable logic block, from new input data, or from the bank output. The programmable block is connected to other programmable logic blocks via programmable interconnect. This allows outputs from the logical elements in each programmable logic block to feed into other programmable blocks. This interconnect also allows DRAM bank output to connect to programmable logic which is not local to the DRAM bank.

There is a large amount of flexibility in the implementation of the programmable logic block. The number of array elements/block, the array element design, the amount of interconnect flexibility, and their topology can vary considerably. Figure shows a general formulation of such a logic block with array elements, each with input bits, a -bit datapath to the DRAM bank, and and lines connecting adjacent programmable blocks in the horizontal and vertical directions. If we want to be able to perform local computation and writeback from the programmable logic, we need at least as many programmable array elements as data bits ( i.e. ). Each array element can be implemented in any of a number of ways ( e.g. a -LUT ( input lookup table) or piece of hardwired logic). Generally, the array elements will have optional or mandatory flip-flops latching the output values computed, especially when the logic performed on the data is pipelined.

The interconnect shown in Figure can be very general ( e.g. a full crossbar in the worst case). In practice it is likely to be much more limited. First, a full-crossbar provides more flexibility than is useful if the array elements are actually -LUTs (see the (tn121)). Further restriction is likely to be necessary to keep interconnect size and delay down to an acceptable level. Some connections may be omitted simply because they are unnecessary, for example, Data Word Out and Data Word Writeback may be the same port since it is not likely they will be used simultaneously. Also, there is no need to interconnect the Bank Word to the Data Word Out or Data Word Writeback in the topology shown in Figure since there are direct connections which provide that effect outside of the programmable logic.

In general, the topology for interconnecting programmable logic blocks carries the same issues as interconnecting sub-arrays in a DPGA. See (tn121) for a more detailed discussion. The formulation above is a bit simplistic in that it shows a single bundle of lines in each compass direction. As detailed in (tn121), the interconnect may be hierarchical and these lines may be broken down into lines which span varying distances in the array. In these cases, while or lines may be consumed at each programmable logic block, the the full or lines will not be generated at each block -- or more properly, some of the lines will likely connect straight across each programmable logic block without being switched within the block.

The mux between the programmable logic Data Word Out and the Bank Word is most likely controlled by read logic, retaining the ability to read and write straight from memory as a conventional DRAM without using the programmable logic. A control signal into the array of memory banks can control the desired behavior. The writeback data is selected in a similar manner.

The array elements and programmable logic may have multiple context configurations ( i.e. DPGA (tn95)). In these cases, the control lines into the array will select not only whether or not to use the programmable logic, as described above, but also which context should act at on a given cycle.

One promising use of DPGA logic will be to perform multiple micro-cycle operations on each array memory operation. Conventional DRAMs have 50-60 ns cycle times, roughly half row read and half in column select and readout. Well designed DPGA logic, in today's technology, with one or two LUT and local interconnect delays can operate with a 5 ns cycle time (see, for example, (tn114)). Consequently, one can reasonably expect to cycle through several array operation for each memory operation.

The programmable interconnect will also need to configured. It will generally be most convenient and economical to integrate programmable logic programming with the memory array datapath. In the minimal case, additional address lines specify when writes are destined to programmable configuration, and configuration data is loaded from the Data Word In associated with each bank. A more useful configuration will be to connect the configuration input to the Bank Word lines, allowing the array elements to exploit the full on-chip bandwidth of the DRAM array for configuration reloading. Of course, we still have the off-chip bottleneck to load the configurations into the DRAM to begin with, but once the data is loaded on the DRAM, parallel reloads allow the logic programming to change very rapidly.

As described so far, we see the coupled DRAM array as supporting a few additional address and control lines to activate and control the on-chip logic. Figure shows a potential configuration. In addition to the traditional DRAM i/os for address, data, row and column select, and read/write select, the reconfigurable logic coupled DRAM includes a line to control when the logic is used, several bits to select the active context, and a line to control the reloading of configuration data. A common clock may be used for the programmable logic microcycle and memory operation. As noted above, the memory cycle is likely to be multiple microcycle clocks.

Alternately, one could implement a small, on-chip controller to orchestrate the programmable logic. (tn122) and (tn118) (``Orchestrated DPGA/SIMD Logic'' logic in Section 3.1) touch briefly upon such on-chip context controller.

Of course, one can combine the integrated logic with the more novel DRAM i/o interfaces described in Section to maximize the off-chip bandwidth simultaneous with taking advantage of the on-chip bandwidth.

See Also...

References

BDK93
Michael Bolotski, Andre DeHon, and Thomas F. Knight Jr. Unifying FPGAs and SIMD Arrays. Transit Note 95, MIT Artificial Intelligence Laboratory, September 1993. [tn95 HTML link] [tn95 PS link].

Bur92
Dave Bursky. Integrated Cached DRAM Lets Data Flow at 100 MHz. Electronic Design, February 1992.

Bur93
Dave Bursky. Fast DRAMs can be Swapped for SRAM Caches. Electronic Design, pages 55-67, July 1993.

Cor93
Ramtron International Corporation. 15ns Enhanced Dynamic RAM Family Product Summary. Product Literature, 1993. Ramtron International Corporatoin, 1850 Ramtron Drive, Colorado Springs, CO 80921.

DeH94
Andre DeHon. DPGA-Coupled Microprocessors: Commodity ICs for the Early 21st Century. Transit Note 100, MIT Artificial Intelligence Laboratory, January 1994. [tn100 HTML link] [tn100 PS link].

DeH95a
Andre DeHon. Notes on Context Distribution. Transit Note 122, MIT Artificial Intelligence Laboratory, February 1995. [tn122 HTML link] [tn122 PS link].

DeH95b
Andre DeHon. Notes on Coupling Processors with Reconfigurable Logic. Transit Note 118, MIT Artificial Intelligence Laboratory, March 1995. [tn118 HTML link] [tn118 PS link].

DeH95c
Andre DeHon. Notes on Programmable Interconnect. Transit Note 121, MIT Artificial Intelligence Laboratory, February 1995. [tn121 HTML link] [tn121 PS link].

ESS92
Duncan Elliot, Martin Snelgrove, and Michael Stumm. Computational Ram: A Memory-SIMD Hybrid and its Application to DSP. In Proceedings of the Custom Integrated Circuits Conference, pages 30.6.1-4. IEEE, IEEE, May 1992.

Mic93
Micron Semiconductor, Inc. Synchronous DRAMs. Design Line, 2(2):1-5, 1993. Micron Semiconductor, Inc., 2805 East Columbia Road, Boise, Idaho 83706-9698.

Ram93
Rambus Inc. Architectural Overview. Produce Literature, 1993. Rambus Inc., 2465 Latham Steet, Mountain View, CA 94040.

TEC +95
Edward Tau, Ian Eslick, Derrick Chen, Jeremy Brown, and Andre DeHon. A First Generation DPGA Implementation. Transit Note 114, MIT Artificial Intelligence Laboratory, January 1995. [tn114 HTML link] [tn114 PS link].

MIT Transit Project