Notes on Integrating Reconfigurable Logic with DRAM Arrays
Andre DeHon
Original Issue: March, 1995
Last Updated: Sat Apr 8 20:57:50 EDT 1995

This is a quick note describing the opportunities which come from integrating small amounts of reconfigurable logic into a conventional DRAM, as well as a rough architecture for such integration.
Conventional DRAM arrays are built out of numerous banks of DRAM (See
Figure ). The memory addresses are, consequently,
broken into two parts: (1) a row address which selects the address to be
read within each bank and (2) a column address which selects the bank whose
data will actually be read. Internally, once the row address has been
read, a large number of data bits are available. For a
bank memory,
with
-bit wide banks,
bits are actually available at the
inputs to the final selection process. In the most conventional
architectures, all but
bits are then thrown away during the final
column selection process. The result is a meager i/o bandwidth to and from
the memory, despite a huge on-chip bandwidth.
A 4 megabit DRAM, old by today's standards, is likely to have 512 or 1024
banks ( e.g. ), with a 4- or 8-bit data width ( e.g.
). The DRAM throws away almost three orders of magnitude in bandwidth
during the final multiplexing stage. This gap only gets larger as we move
to larger capacity DRAM arrays. The DRAM cell area shrinks, allowing
greater density. However, the i/o's supported on die's perimeter does not
grow commensurately with the internal memory capacity. In fact, the i/o
capacity is hardly growing at all.
A number of modern twists on the original DRAM architecture have tried to improve the situation, generally by exploiting locality of memory references and increasing the i/o toggle rate.
All of these DRAM variants take advantage of locality and exploit the fact that the external i/o can often be cycled faster than the basic memory read. When reading large, contiguous blocks, or data closely packed in memory, each internal bank read can yield many data words of interest to the application. Access bandwidth in these cases is limited entirely by the external i/o bandwidth rather than memory cycle time. This external bandwidth is still an order of magnitude or two less than the on-chip bandwidth.
Since we cannot move the high internal bandwidth across the bandwidth limited chip i/o to service external logic, we can consider moving the required logic to the bandwidth -- by putting the logic on the die with the DRAM. This avoids the chip crossing problem and allows us to more fully exploit the internal bandwidth which our conventional DRAM architectures already contain.
The question then is ``What logic do we put inside?'' If we knew exactly what kind of logic we wanted, we could embed the two on the same die. Remember, however, that DRAMs are highly commodity items. Each design variant must have very widespread appeal to merit the design and production costs associated with modifying the design to contain a given piece of logic. A few applications, may be able to justify their own fixed logic integration into DRAMs, but most cannot.
Alternately, we can consider placing more flexible logic on the DRAM which can be adapted to a wide variety of tasks. Here, as with the DPGA-coupled processors (tn100), we can exploit a wider range of application to gain commodity appeal.
People have, in fact, already consider placing SIMD logic at the output of the bank memory within a DRAM array to exploit the high on-chip bandwidth while retaining some flexibility. [ESS92] introduces one such device. Cray Computer, Inc., working with the Institute for Defense Analysis, has integrated such a SIMD/memory hybrid into their Cray 3 supercomputer.
In general, we can consider using any programmable, computational array
structure as flexible logic which can operate on the bits
which are read, and may be written, in parallel for the
memory banks.
(tn95) introduced a unified computational array model which included
both FPGAs and SIMD arrays, as well as DPGAs. Recalling this framework, we
note that the SIMD structure allows us to perform the same operation on each
of the embedded computational array elements operating on the bank memory
outputs. At the same time, the SIMD array allows us to perform a different
operation to the memory on each array cycle -- which may or may not be
coupled to the memory cycle time. Integrated FPGA logic, on the other
hand, would allow us to wire up array elements so that they performed
different functions, however, those functions could not change rapidly in
time. Using the DPGA hybrid would allow the integrated array to vary in
function both from cycle to cycle and from cell to cell within the array.
In the following section, we expand on the integration of FPGA or DPGA logic into a DRAM memory array.
In general, we can add a small block of reconfigurable logic local to each
DRAM memory bank. This reconfigurable block directly consumes, and may
produce, the bits which are read out of each DRAM bank. The block has
associated programmable interconnect which allow it to take inputs from and
feed outputs to other reconfigurable logic blocks. Since the banks are
physically arrayed on the DRAM array, the reconfigurable logic blocks are
also gridded.
Figure shows a DRAM bank with a
reconfigurable logic block. Data from the memory bank is a local
input to the reconfigurable logic. The output from the bank to the final
memory selection ( e.g. the final mux in Figure
)
may come directly from the bank output or from the programmable logic.
Data written back into the memory may come from the programmable logic
block, from new input data, or from the bank output. The programmable
block is connected to other programmable logic blocks via programmable
interconnect. This allows outputs from the logical elements in each
programmable logic block to feed into other programmable blocks. This
interconnect also allows DRAM bank output to connect to programmable logic
which is not local to the DRAM bank.
There is a large amount of flexibility in the implementation of the
programmable logic block. The number of array elements/block, the array
element design, the amount of interconnect flexibility, and their topology
can vary considerably. Figure shows a general
formulation of such a logic block with
array elements, each with
input bits, a
-bit datapath to the DRAM bank, and
and
lines
connecting adjacent programmable blocks in the horizontal and vertical
directions. If we want to be able to perform local computation and
writeback from the programmable logic, we need at least as many
programmable array elements as data bits ( i.e.
). Each
array element can be implemented in any of a number of ways ( e.g. a
-LUT (
input lookup table) or piece of hardwired logic). Generally,
the array elements will have optional or mandatory flip-flops latching the
output values computed, especially when the logic performed on the data is
pipelined.
The interconnect shown in Figure can be very
general ( e.g. a full crossbar in the worst case). In practice it is
likely to be much more limited. First, a full-crossbar provides more
flexibility than is useful if the array elements are actually
-LUTs (see
the (tn121)). Further restriction is
likely to be necessary to keep interconnect size and delay down to an
acceptable level. Some connections may be omitted simply because they are
unnecessary, for example, Data Word Out and Data Word Writeback
may be the same port since it is not likely they will be used
simultaneously. Also, there is no need to interconnect the Bank Word
to the Data Word Out or Data Word Writeback in the topology
shown in Figure
since there are direct
connections which provide that effect outside of the programmable logic.
In general, the topology for interconnecting programmable logic blocks
carries the same issues as interconnecting sub-arrays in a DPGA. See
(tn121) for a more detailed discussion.
The formulation above is a bit simplistic in that it shows a single bundle
of lines in each compass direction. As detailed in (tn121), the interconnect may be hierarchical and these lines
may be broken down into lines which span varying distances in the array.
In these cases, while or
lines may be consumed at each
programmable logic block, the the full
or
lines will not be
generated at each block -- or more properly, some of the lines will likely
connect straight across each programmable logic block without being
switched within the block.
The mux between the programmable logic Data Word Out and the Bank Word is most likely controlled by read logic, retaining the ability to read and write straight from memory as a conventional DRAM without using the programmable logic. A control signal into the array of memory banks can control the desired behavior. The writeback data is selected in a similar manner.
The array elements and programmable logic may have multiple context configurations ( i.e. DPGA (tn95)). In these cases, the control lines into the array will select not only whether or not to use the programmable logic, as described above, but also which context should act at on a given cycle.
One promising use of DPGA logic will be to perform multiple micro-cycle operations on each array memory operation. Conventional DRAMs have 50-60 ns cycle times, roughly half row read and half in column select and readout. Well designed DPGA logic, in today's technology, with one or two LUT and local interconnect delays can operate with a 5 ns cycle time (see, for example, (tn114)). Consequently, one can reasonably expect to cycle through several array operation for each memory operation.
The programmable interconnect will also need to configured. It will generally be most convenient and economical to integrate programmable logic programming with the memory array datapath. In the minimal case, additional address lines specify when writes are destined to programmable configuration, and configuration data is loaded from the Data Word In associated with each bank. A more useful configuration will be to connect the configuration input to the Bank Word lines, allowing the array elements to exploit the full on-chip bandwidth of the DRAM array for configuration reloading. Of course, we still have the off-chip bottleneck to load the configurations into the DRAM to begin with, but once the data is loaded on the DRAM, parallel reloads allow the logic programming to change very rapidly.
As described so far, we see the coupled DRAM array as supporting a few
additional address and control lines to activate and control the
on-chip logic. Figure shows a potential
configuration. In addition to the traditional DRAM i/os for address, data,
row and column select, and read/write select, the reconfigurable logic
coupled DRAM includes a line to control when the logic is used, several
bits to select the active context, and a line to control the reloading of
configuration data. A common clock may be used for the programmable logic
microcycle and memory operation. As noted above, the memory cycle is
likely to be multiple microcycle clocks.
Alternately, one could implement a small, on-chip controller to orchestrate the programmable logic. (tn122) and (tn118) (``Orchestrated DPGA/SIMD Logic'' logic in Section 3.1) touch briefly upon such on-chip context controller.
Of course, one can combine the integrated logic with the more novel DRAM
i/o interfaces described in Section to maximize the
off-chip bandwidth simultaneous with taking advantage of the on-chip
bandwidth.