Dynamically Programmable Gate Arrays:
A Step Toward Increased Computational Density
Original Issue: April, 1996
Last Updated: Mon Jul 8 11:39:35 EDT 1996
``How big is a computation?''
The design goal for ``general-purpose'' computing devices is to develop a device which can:
We know, for example, that a four input Lookup Table (4-LUT) occupies roughly 640K ( e.g. 0.16mm in a 1 CMOS processor ()) [CSA +91] [TEC +95]. Thus, we get a 4-LUT density of 1.6 4-LUTs per one million of area.
At the same time, we notice that the descriptive density of 4-LUT designs can be much greater than the 4-LUT density just observed. That is, the LUT configuration is small compared to the network area so that an idle LUT can occupy much less space than an active one.
For illustrative purposes, let us assume that it takes 200 bits to describe the configuration for one 4-LUT, which is typical of commercial FPGA devices. A 64Mb DRAM would hold 335K such configurations. Since a typical 64Mb DRAM is 6G, we can pack 56 4-LUT descriptions per one million of area -- or about 35 the density which we can pack 4-LUTs. In fact, there is good reason to believe that we can use much less than 200 bits to describe each 4-LUT computation [DeH96b], making even greater densities achievable in practice.
Returning to our original question, we see that there are two components which combine to define the requisite area for our general-purpose device:
In an ideal packing, a computation requiring active compute elements and total 4-LUTs, can be implemented in area:
In practice, a perfect packing is difficult to achieve due to connectivity and dependency requirements such that configuration memories are required.
When , it is advantageous to associate multiple, ideally , LUT descriptions with each active LUT. Dynamically Programmable Gate Arrays (DPGAs), do just that, packing several configuration memories for each active LUT and switching element (Figure ).
From our own experience with the DPGA [TEC +95]: The base LUT area is generally consistent with other FPGA implementations. Our is based on a DRAM cell design which is 600 per memory bit, making the memory cells 2 smaller than an SRAM implementation. A static memory would be closer approximated as: An aggressive DRAM cell should realize a 300 memory cell, making:
In all these cases the configuration size is at least an order of magnitude smaller than the active LUT:
Single context devices running designs where the descriptive complexity, , is large compared to the number of requisite parallel operations, , are, consequently, much larger than they need to be since they require more active LUTs than are necessary to support the task at the desired computational speed.
Of course, there are many issues and details associated with multicontext computing devices. In this section, we will walk through a specific design example in order to make the previous discussion more concrete and in order to motivate additional issues. We will take an ASCII Hexbinary converter as our example.
Figure describes the basic conversion task. Assuming we care about the latency of this operation, a mapping which minimizes the critical path length using SIS [SSL +92] and Chortle [Fra92] has a path length of 3 and requires 21 4-LUTs. Figure shows the LUT mapping both in equations and circuit topology.
The multicontext implementation requires: In contrast, a non-pipelined, single-context implementation would require LUTs, for an area of:
If we assume that we can pipeline the configuration read, the multicontext device can achieve comparable delay per LUT evaluation to the single context device. The total latency then is 21 ns, as before. The throughput at the 7 ns clock rate is 48 MHz. If we do not pipeline the configuration read, as was the case for the DPGA prototype [TEC +95], the configuration read adds another 2.5 ns to the LUT delay, making for a total latency of 28.5 ns and a throughput of 35 MHz.
If the 35-48 MHz implementations are still faster than required by the application, we can reduce even further. Ideally, if we did not have to worry about retiming issues: That is, if we can afford LUT delays, and we can evaluate LUTs per cycle, we only need enough active LUTs such that . E.g. if we required less than 4 MHz operation, we should need only one active LUT.
In practice, as long as we have to retime intermediates through LUTs as noted above, there will be a limit to our ability to reduce the maximum stage size by adding temporal pipeline stages. Even if we only had to arrange for the four LUT inputs needed for a single LUT evaluation in stage , we would have to deploy at least 4 LUTs in stage to carry the signals needed by that single LUT in stage . If the only place where intermediate signals can exist is on LUT outputs, as assumed so far, we eventually reach a lower limit on our ability to serialize active resource usage -- decreasing active resource requirements by reusing them to evaluate different LUTs in sequence. The limit arises from signal connectivity requirements. For the ASCII Hexbinary mapping considered above, the 3 cycle, case is the effective limit.
To achieve further reductions it is necessary to relax the restriction that all inputs to an evaluation stage be delivered from active LUT outputs from the immediately prior evaluation stage. One way to achieve this is to associate a pool of memory with each active LUT. This pool serves the same role as a register file in a traditional processor. VEGA [JL95] uses this approach achieving: Where: when:
A similar alternative is to move the flip-flop which normally lives at the output of a LUT to its inputs and to replicate this for each described LUT rather than only associating such flip-flops with active LUTs. Figure contrasts this scheme with the scheme used in the original DPGA prototype and with traditional LUT organizations. The basic idea is to latch each LUT input on the cycle which it is produced even when the LUT will not be evaluated on immediately subsequent cycle. Figure elaborates upon this configuration. Since each virtual LUT in a group shares each input line, it is not possible to load two different signals into the same input location of different LUTs on the same cycle. Also, since all the inputs to a virtual LUT are latched, the LUT output can be produced on multiple cycles if routing restrictions such as these require it to be generated multiple times.
Adding these matching units and input flip-flops will, of course, make the described contexts larger. From our own experience, we estimate: Together, this gives us:
In the most extreme serial case, , . This implementation requires area: Running at a 9.5 ns LUT cycle, this implementation has a 200 ns latency achieving 5 MHz operation.
Excessive serialization is expensive since we need to add resources to hold intermediate values. Another way to share resources while meeting a lower computational rate is to overlay multiple, independent computations into the same space. This allows each computation to have wide stages as in Section , while still taking advantage of low throughput requirements.
In our example above, we might run 12 LUTs on the same 21 cycle round to achieve 5 MHz throughput for the ASCII conversion. The conversion itself only occupies these 28 LUTs for 3 cycles, leaving 18 cycles during which the array can be used for other tasks. Array costs are now in line with Equation . The amortized cost for this implementation is then:
Table summarizes the area for various implementation points described in this section. Also included is the effective area required for two different processor implementations of this task. The MIPS-X [HHC +87] assembly code is shown in Figure ; for the sake of comparison, we assume the 4 cycle path which occurs when converting an ASCII decimal digit. The DEC Alpha [GBB +96] implementation assumes that the conversion is performed by table lookup and the table is already loaded into the on-chip data cache; the conversion takes one add and one lookup, each of which take up half a cycle since the Alpha can issue two instruction per cycle. The processor area comes from scaling the actual area to meet the cycle time -- this assumes that we either deploy multiple processors to achieve a higher rate of execution than the base processor cycle time or that the task shares the processor with other tasks and area is amortized accordingly.
In this section, we have examined one task in depth. See [DeH96a] for a broader look at multicontext applications.
One thing which is clear from Table is that there is an optimal ratio between and which varies with the desired throughput requirements. Unfortunately, we cannot expect to have a distinct DPGA design for every possible number of contexts.
We can define a throughput ratio as the ratio of the desired task throughput to the raw LUT throughput:
Here, we have been assuming MHz for conventional devices. When , single contexts are very inefficient since far more active LUTs are provided than necessary. When , where is the number of contexts supported, area is unnecessarily consumed by context memory. Figure plots versus showing the relative efficiency of running a task with throughput ratio on a device with contexts for both the DPGA context sizes and DPGA sizes.
One interesting design point to notice on both graphs is that point where the area dedicated to the fixed LUT is equal to the area of the contexts. This occurs at for the DPGA and for DPGAs with input latches. At this point, the efficiency never drops below 50% for any values of . This is interesting since the cross efficiencies at the extremes in the graph shown drop down to 13%.
In a quick review of the FPD'95 proceedings, we see FPGA task implementations running at frequencies between 6MHz and 80MHz, with most running between 10MHz and 20MHz. Thus, we see a decent number of FPGA tasks implemented with . We might ask why these applications have such large throughput ratios, . There are two main causes:
Multicontext FPGAs have the potential to pack typical computational tasks into less area, delivering more computation per unit area -- i.e. increased computational density. The key properties they exploit are:
Task dataflow dependencies require that a number of signals be made available to each stage of computation. The retiming requirements associated with satisfying these dependencies can pose a limit to the effective serializability of the task. In some cases we can avoid these limitations by overlaying loosely or unrelated tasks on the same resources. Alternately, we can architect intermediate data storage to allow serialization, at the cost of larger context resources. Architectural design of these multicontext structures and synthesis for them is a moderately young area of research. Interesting research possibilities exist in the areas of:
Jeremy Brown, Derrick Chen, Ian Eslick, Ethan Mirsky, and Edward Tau made the DPGA prototype possible. The input latch ideas come from the TSFPGA architecture co-developed with Derrick Chen. Many of the quantitative details about the DPGA are the results of the collected efforts of this team.