Transit Note #136
Dynamically Programmable Gate Arrays:
A Step Toward Increased Computational Density

Andre DeHon

Original Issue: April, 1996

Last Updated: Mon Jul 8 11:39:35 EDT 1996

Abstract:

Field-Programmable Gate Arrays are interesting, general-purpose computational devices because (1) they have high computational density and (2) they have fine-grained control of their computational resources since each gate is independently controlled. The earlier provides them with a potential 10 advantage in raw peak performance density versus modern microprocessors. The later can afford a 32 advantage on random bit-level computations. Nonetheless, typical FPGA usage seldom extracts this full density advantage. DPGAs are less computationally dense than FPGAs, but allow most applications to achieve greater, yielded computational density. The key to unraveling this potential paradox lies in distinguishing instruction density from active computing density. Since the storage space for a single instruction is inherently smaller than the computational element it controls, packing several instructions per computational unit increases the aggregate instruction capacity of the device without a significant reduction in computational density. The number of different instructions executed per computational task often limits the effective computational density. As a result, DPGAs can meet the throughput requirements of many computing tasks with 3-4 less area than conventional FPGAs.

Computational Area

``How big is a computation?''

The design goal for ``general-purpose'' computing devices is to develop a device which can:

implement desired computational tasks
perform the computation at the desired latency or throughput
realize the implementation at minimal cost -- usually silicon area

As device designers we are concerned with the area which a computational element occupies and its latency or throughput.

We know, for example, that a four input Lookup Table (4-LUT) occupies roughly 640K ( e.g. 0.16mm in a 1 CMOS processor ()) [CSA +91] [TEC +95]. Thus, we get a 4-LUT density of 1.6 4-LUTs per one million of area.

At the same time, we notice that the descriptive density of 4-LUT designs can be much greater than the 4-LUT density just observed. That is, the LUT configuration is small compared to the network area so that an idle LUT can occupy much less space than an active one.

For illustrative purposes, let us assume that it takes 200 bits to describe the configuration for one 4-LUT, which is typical of commercial FPGA devices. A 64Mb DRAM would hold 335K such configurations. Since a typical 64Mb DRAM is 6G, we can pack 56 4-LUT descriptions per one million of area -- or about 35 the density which we can pack 4-LUTs. In fact, there is good reason to believe that we can use much less than 200 bits to describe each 4-LUT computation [DeH96b], making even greater densities achievable in practice.

Returning to our original question, we see that there are two components which combine to define the requisite area for our general-purpose device:

-- the total number of 4-LUTs in the design -- the descriptive complexity
-- the total number of 4-LUTs which must be evaluated simultaneously in order to achieve the desired task time or computational throughput -- the parallelism required to achieve the temporal requirements

In an ideal packing, a computation requiring active compute elements and total 4-LUTs, can be implemented in area:

In practice, a perfect packing is difficult to achieve due to connectivity and dependency requirements such that configuration memories are required.

DPGAs

When , it is advantageous to associate multiple, ideally , LUT descriptions with each active LUT. Dynamically Programmable Gate Arrays (DPGAs), do just that, packing several configuration memories for each active LUT and switching element (Figure ).

From our own experience with the DPGA [TEC +95]: The base LUT area is generally consistent with other FPGA implementations. Our is based on a DRAM cell design which is 600 per memory bit, making the memory cells 2 smaller than an SRAM implementation. A static memory would be closer approximated as: An aggressive DRAM cell should realize a 300 memory cell, making:

In all these cases the configuration size is at least an order of magnitude smaller than the active LUT:

Single context devices running designs where the descriptive complexity, , is large compared to the number of requisite parallel operations, , are, consequently, much larger than they need to be since they require more active LUTs than are necessary to support the task at the desired computational speed.

ASCII HexBinary: Extended Example

Of course, there are many issues and details associated with multicontext computing devices. In this section, we will walk through a specific design example in order to make the previous discussion more concrete and in order to motivate additional issues. We will take an ASCII Hexbinary converter as our example.

Figure describes the basic conversion task. Assuming we care about the latency of this operation, a mapping which minimizes the critical path length using SIS [SSL +92] and Chortle [Fra92] has a path length of 3 and requires 21 4-LUTs. Figure shows the LUT mapping both in equations and circuit topology.

Traditional Pipelining for Throughput

If we cared only about achieving the highest throughput, we would fully pipeline this implementation such that it took in a new character on each cycle and output its encoding three cycles later. This pipelining would require an additional 7 LUTs to pipeline data which is needed more than one pipeline stage after being generated ( i.e. 4 to retime c<3:0> for presentation to the second stage and 3 to retime c<3>, c<1> and i1 for presentation to the final stage). Consequently, we effectively evaluate a design with

4-LUTs with

physical 4-LUTs. Typical LUT delay, including a moderate amount of local interconnect traversal, is 7 ns [TEC +95] [CSA +91]. Assuming this is the only limit to cycle time, the implementation could achieve 140 MHz operation. Notice that the only reason we had to have any more LUTs or LUT descriptions than strictly required by the task description was in order to perform signal retiming based on the dependency structure of the computation. Spatially, this implementation requires:

Multicontext Implementation -- Temporal Pipelining

If, instead, we cared about the latency, but did not need 140 MHz operation, we could use a multicontext device with 3 LUT descriptions per active element. To achieve the target latency of 3 LUT delays, we need to have enough active LUTs to implement the largest stage -- the middle one. If the inputs are arriving from some other circuit which is also operating in multicontext fashion, we must retime them as before. Consequently, we require 3 extra LUTs in the largest stage, making for a total

. Note that the 4 retiming LUTs added to stage 1 also bring its total LUT usage up to 12 LUTs. We end up implementing

, with

and

. If c<7:0> were inputs which did not change during these three cycles, we would only need one extra retiming LUT in stage 2 for i1, allowing us to use

The multicontext implementation requires: In contrast, a non-pipelined, single-context implementation would require LUTs, for an area of:

If we assume that we can pipeline the configuration read, the multicontext device can achieve comparable delay per LUT evaluation to the single context device. The total latency then is 21 ns, as before. The throughput at the 7 ns clock rate is 48 MHz. If we do not pipeline the configuration read, as was the case for the DPGA prototype [TEC +95], the configuration read adds another 2.5 ns to the LUT delay, making for a total latency of 28.5 ns and a throughput of 35 MHz.

Further Temporal pipelining

If the 35-48 MHz implementations are still faster than required by the application, we can reduce even further. Ideally, if we did not have to worry about retiming issues: That is, if we can afford LUT delays, and we can evaluate LUTs per cycle, we only need enough active LUTs such that . E.g. if we required less than 4 MHz operation, we should need only one active LUT.

In practice, as long as we have to retime intermediates through LUTs as noted above, there will be a limit to our ability to reduce the maximum stage size by adding temporal pipeline stages. Even if we only had to arrange for the four LUT inputs needed for a single LUT evaluation in stage , we would have to deploy at least 4 LUTs in stage to carry the signals needed by that single LUT in stage . If the only place where intermediate signals can exist is on LUT outputs, as assumed so far, we eventually reach a lower limit on our ability to serialize active resource usage -- decreasing active resource requirements by reusing them to evaluate different LUTs in sequence. The limit arises from signal connectivity requirements. For the ASCII Hexbinary mapping considered above, the 3 cycle, case is the effective limit.

To achieve further reductions it is necessary to relax the restriction that all inputs to an evaluation stage be delivered from active LUT outputs from the immediately prior evaluation stage. One way to achieve this is to associate a pool of memory with each active LUT. This pool serves the same role as a register file in a traditional processor. VEGA [JL95] uses this approach achieving: Where: when:

A similar alternative is to move the flip-flop which normally lives at the output of a LUT to its inputs and to replicate this for each described LUT rather than only associating such flip-flops with active LUTs. Figure contrasts this scheme with the scheme used in the original DPGA prototype and with traditional LUT organizations. The basic idea is to latch each LUT input on the cycle which it is produced even when the LUT will not be evaluated on immediately subsequent cycle. Figure elaborates upon this configuration. Since each virtual LUT in a group shares each input line, it is not possible to load two different signals into the same input location of different LUTs on the same cycle. Also, since all the inputs to a virtual LUT are latched, the LUT output can be produced on multiple cycles if routing restrictions such as these require it to be generated multiple times.

Adding these matching units and input flip-flops will, of course, make the described contexts larger. From our own experience, we estimate: Together, this gives us:

In the most extreme serial case, , . This implementation requires area: Running at a 9.5 ns LUT cycle, this implementation has a 200 ns latency achieving 5 MHz operation.

Interleaved Computation

Excessive serialization is expensive since we need to add resources to hold intermediate values. Another way to share resources while meeting a lower computational rate is to overlay multiple, independent computations into the same space. This allows each computation to have wide stages as in Section , while still taking advantage of low throughput requirements.

In our example above, we might run 12 LUTs on the same 21 cycle round to achieve 5 MHz throughput for the ASCII conversion. The conversion itself only occupies these 28 LUTs for 3 cycles, leaving 18 cycles during which the array can be used for other tasks. Array costs are now in line with Equation . The amortized cost for this implementation is then:

Area Comparison Summary

Table summarizes the area for various implementation points described in this section. Also included is the effective area required for two different processor implementations of this task. The MIPS-X [HHC +87] assembly code is shown in Figure ; for the sake of comparison, we assume the 4 cycle path which occurs when converting an ASCII decimal digit. The DEC Alpha [GBB +96] implementation assumes that the conversion is performed by table lookup and the table is already loaded into the on-chip data cache; the conversion takes one add and one lookup, each of which take up half a cycle since the Alpha can issue two instruction per cycle. The processor area comes from scaling the actual area to meet the cycle time -- this assumes that we either deploy multiple processors to achieve a higher rate of execution than the base processor cycle time or that the task shares the processor with other tasks and area is amortized accordingly.

In this section, we have examined one task in depth. See [DeH96a] for a broader look at multicontext applications.

Device Sizing

One thing which is clear from Table is that there is an optimal ratio between and which varies with the desired throughput requirements. Unfortunately, we cannot expect to have a distinct DPGA design for every possible number of contexts.

We can define a throughput ratio as the ratio of the desired task throughput to the raw LUT throughput:

Here, we have been assuming MHz for conventional devices. When , single contexts are very inefficient since far more active LUTs are provided than necessary. When , where is the number of contexts supported, area is unnecessarily consumed by context memory. Figure plots versus showing the relative efficiency of running a task with throughput ratio on a device with contexts for both the DPGA context sizes and DPGA sizes.

One interesting design point to notice on both graphs is that point where the area dedicated to the fixed LUT is equal to the area of the contexts. This occurs at for the DPGA and for DPGAs with input latches. At this point, the efficiency never drops below 50% for any values of . This is interesting since the cross efficiencies at the extremes in the graph shown drop down to 13%.

Typical Application

In a quick review of the FPD'95 proceedings, we see FPGA task implementations running at frequencies between 6MHz and 80MHz, with most running between 10MHz and 20MHz. Thus, we see a decent number of FPGA tasks implemented with . We might ask why these applications have such large throughput ratios, . There are two main causes:

Task throughput requirements are low -- an external timing requirement dictates that there is no point for the circuit to provide any higher throughput.
Critical path with cyclic constraints -- throughput is often limited by the critical path, especially when the result from one round of computation are required at the start of the next round of computation.

Regardless of the detailed reasons, in this range of operation, single context FPGAs are paying a raw overhead factor of 6-10

for DPGAs or 3

for DPGAs with input latches. Even if the packing efficiency on the standard DPGAs is only 50%, this still amounts to a 3-5

area overhead.

Conclusions

Multicontext FPGAs have the potential to pack typical computational tasks into less area, delivering more computation per unit area -- i.e. increased computational density. The key properties they exploit are:

LUT throughput is often much greater than task throughput.
A LUT instruction is smaller than the active LUT.

Sharing the active portion of the LUT among multiple instructions yields increased device usage efficiency when the throughput ratio is large (

Task dataflow dependencies require that a number of signals be made available to each stage of computation. The retiming requirements associated with satisfying these dependencies can pose a limit to the effective serializability of the task. In some cases we can avoid these limitations by overlaying loosely or unrelated tasks on the same resources. Alternately, we can architect intermediate data storage to allow serialization, at the cost of larger context resources. Architectural design of these multicontext structures and synthesis for them is a moderately young area of research. Interesting research possibilities exist in the areas of:

Temporally oriented synthesis to minimize retiming requirements
Tighter structures for data retiming
Hybrid structures with an intermediate amount of retiming resources

Acknowledgements

Jeremy Brown, Derrick Chen, Ian Eslick, Ethan Mirsky, and Edward Tau made the DPGA prototype possible. The input latch ideas come from the TSFPGA architecture co-developed with Derrick Chen. Many of the quantitative details about the DPGA are the results of the collected efforts of this team.

Transit Note #136
Dynamically Programmable Gate Arrays:
A Step Toward Increased Computational Density

Abstract:

Computational Area

DPGAs

ASCII HexBinary: Extended Example

Traditional Pipelining for Throughput

Multicontext Implementation -- Temporal Pipelining

Further Temporal pipelining

Interleaved Computation

Area Comparison Summary

Device Sizing

Typical Application

Conclusions

Acknowledgements

See Also...

References

Transit Note #136 Dynamically Programmable Gate Arrays: A Step Toward Increased Computational Density

Abstract:

Computational Area

DPGAs

ASCII HexBinary: Extended Example

Traditional Pipelining for Throughput

Multicontext Implementation -- Temporal Pipelining

Further Temporal pipelining

Interleaved Computation

Area Comparison Summary

Device Sizing

Typical Application

Conclusions

Acknowledgements

See Also...

References

Transit Note #136
Dynamically Programmable Gate Arrays:
A Step Toward Increased Computational Density