Previous: Dynamically Programmable Gate Arrays Up: New Architectures Next: Time-Switched Field Programmable Gate Arrays

Dynamically Programmable Gate Arrays with Input Registers

In Chapter we noticed that retiming requirements often prevented us from realizing as significant a reduction in active LUTs as should be possible. As a result of retiming, we often had to dedicate active LUTs simply to pass data through intermediate contexts. Retiming requirments also created a saturation level below which no further reduction in active LUTs was possible even if we were willing to take more time or add more context memories.

In this chapter we introduce input registers to the simple DPGA model used in the previous chapter. These input registers allow us to store values which need to traverse LUT evaluation levels in memories rather than having them consume active resources during the period of time which they are being retimed. This addition reduces the retiming limit we encountered in the previous chapter.

We introduce input registers to the base DPGA architecture (Section ) and expand our computing device model accordingly (Section ). Section provides a basic example of the benefits of adding input registers. We expand our experimental, multicontext mapping software from the previous chapter to handle input registers (Section ) and examine the aggregate results of mapping circuit benchmarks to these devices. In Section , we briefly relate the input register model used in this chapter to potential alternatives. At the end of this chapter (Section ) we review the key points about multicontext devices as developed over the last several chapters.

Input Registers

We established in Chapter that most of the active area in conventional FPGAs goes into interconnect. When a signal must cross multiple succeeding contexts between the producer and the final consumer, in the existing model, we must dedicate precious, active routing resources to the signal for all intervening contexts. Note that this property is essentially true of single context FPGAs, as well. If a value is produced early in some critical path, but not consumed until several LUT delays later, the wires and switches between the producer and consumer are tied up holding the value for the entire time. Tying up switches and wires to transport a value in time is a poor use of a scarce resource.

The conventional model results from storing values in registers on the output of each computational element (See Figures and ). With this arrangement, we must hold the value on the output and tie up switches and wires between the producer and the consumer until such time as the final consumer has used the value. Since values are produced at different times, and several values from different sources must converge at a consuming LUT in order for it to produce its output value, this gives rise to the situation where switches and wires are forced to sit idle holding values for much longer than the time it takes for them to transport the values from their sources to their destinations.

The alternative is to move the value registers to the inputs of the computational elements (See Figure ). In the simplest case, this means having four flip-flops on the input of each 4-LUT rather than one flip-flop on the output. This modification allows us to move the data from the producer to consumer in the minimum transit time -- a time independent of when the consumer will actually use the data. We now tie up space in a register to perform the retiming function rather than tying up all the wires and interconnect required to route the value from producers to consumers. Since the register can be much smaller than the intervening interconnect, this results in a tighter implementation.

Conceptually, the key idea here is that signal transport and retiming are two different functions:

  1. Spatial Transport -- moves data in space -- route data from source to destination
  2. Temporal Transport (Retiming) -- moves data in time -- make data available at some later time when it is actually required
By segregating the mechanisms we use for these two functions, we can optimize them independently and achieve a tighter implementation.

We can view this multicontext progression as successively relaxing the strict interconnect requirements for this class of devices:

iDPGA Model

A DPGA with input registers (iDPGA) associates an -bit long shift register with each LUT input in addition to the instructions per active LUT. The LUT instruction tells the LUT which of the values on the shift register to actually select on each cycle. Each LUT input can thus retime a value by up to cycles. That is, values may arrive at the destination LUT up to clock cycles before they are consumed. Figure shows a possible iDPGA array element with 4 contexts and an input register with depth 3.

The input registers do place a restriction on the grouping of logical LUTs into physical LUTs which was not present in the original DPGA. Multiple LUTs cannot have inputs arriving at the same input position on the same cycle. Fortunately, LUT input permutability often allows us to rearrange the inputs to avoid such potential conflicts. Nonetheless, the restriction does complicate LUT placement.

The additional resources required for this model are -additional register cells for each input and one multiplexor for each input. For a -LUT, the area then is:

Composing areas for a 4-LUT, we have:

Note here that we assume the total number of context description bits does not change. Rather, the bits that indicate which of the inputs to select are bits which have been shuffled from spatial routing to temporal routing. That is, this scheme reduces the spatial interconnect requirements by performing temporal retiming in these registers. We are assuming that the bits are shuffled from one task to another without any significant change in the overall number of bits required.

Example

Recall from Section , that our ASCIIhex binary circuit could be mapped to three contexts, but could not, viably, be mapped to fewer contexts. By adding the -input register as suggested above, the active LUT requirements continue to decline with throughput reductions. Figure shows this same circuit mapped with varying input register depth. As the number of input registers increases from 1 to 4, the saturation point reduces from 7 active LUTs to 4. Using our area model from the previous section, the , iDPGA is 5.5M, or over 3 smaller than the single context FPGA implementation at 18.4M and over 2 smaller than the smallest DPGA without input registers at 12.5M.

Circuit Benchmarks: Input Depth

To examine the merits of input registers, we return to our throughput optimized circuit benchmarks as we originally visited in Section for DPGAs. We use the same MCNC circuit benchmark set and the same input netlists as synthesized and mapped by sis and Chortle. Again, since we are assuming here that the target criteria is throughput, both sis and Chortle netlists were synthesized in area mode. As before, no modifications to the mapping and netlist generation are made.

Mapping

As before, we divide the multi-context case into separate spatial pipeline stages such that the path length between pipeline registers is equal to the acceptable period between results. The LUTs within a phase are then evaluated in multicontext fashion using the available contexts. The main difference from Section is the cost metric for retiming. Since each LUT can retime up to cycles, we only charge for retiming registers every temporal stages between the original source and the final destination.

When we do need to place retiming registers, they are placed in a stylized fashion. Starting from the final consumer, we walk back through the circuit toward the primary inputs, placing a retiming repeater LUT every th stage. In practice, we often have much more freedom in the placement of retiming registers, but this freedom was not exploited in our experimental mapping tools. During the annealing step, whenever the final consumer for a particular value is moved, the retiming chain is stripped out and replaced based on the consumer's new location.

After all levelization has been done, a grouping pass is performed. The grouping pass attempts to group together logical LUTs within a spatial partition to reside on one physical LUT. For a group of LUTs to be compatible, it must be possible to permute the LUTs' inputs such that no two LUTs require a different value to arrive on the same input on the same clock cycle. Rather than trying all permutations, we use a randomized, greedy placement scheme. We randomly pick which input in a LUT to place first, then greedily place it in a non-conflicting location. Other inputs within a LUT are placed sequentially after the initial random selection. The compatibility routine will make several attempts to find a satisfying assignment before declaring the grouping incompatible.

Grouping is performed independently on each spatial partition. The grouping routine starts by packing all the logical LUTs in a spatial partition into the minimum number of physical LUTs -- i.e. the number of physical LUTs required to implement the largest temporal stage. The attempt is made by first randomly assigning logical LUTs to physical LUTs, then randomly selecting logical LUTs to swap in order to reduce incompatibility conflicts. Swaps which do not increase the incompatibilities in the grouping are greedily accepted. Swapping continues until a satisfying set of groupings is found or the swapping runs longer than a predetermined time limit which is proportional to the number of logical LUTs which can be described in the spatial partition. When packing fails, we increment the number of target physical LUTs and retry packing.

In review, circuit mapping proceeds through the following steps:

  1. Technology Independent Optimization ( sis)
  2. LUT Mapping ( Chortle)
  3. Spatial and Temporal Levelization (simulated annealing)
  4. Physical LUT Grouping (greedy swapping with heuristic compatibility verification)

Detailed Example: alu2

Table shows the total LUTs required after retiming and packing for the alu2 benchmark mapped to provide a throughput of one result every four LUT delays. The table shows mappings for various values of and . We constrain in the current mapping software, so there are no configurations with . Up to , we see that each additional input register allows us to further reduce the total number of physical LUTs required in the implementation. Table uses the area model from Section to translate the LUT counts into areas, and Table shows the area savings versus a traditional FPGA implementation (). The , , iDPGA implementation is smallest at 43% of the area of the FPGA implementation.

Figure shows the area of the family of alu2 implementations as a function of context () and input () depth. Figure plots the areas as ratios versus the FPGA implementation. The first couple of input registers ( goes from and ) show significant gains for this benchmark. Gains diminish for greater input register depth. The best implementations are one-third the size of the FPGA implementation.

Average Characteristics

Figure shows the average area ratios across the entire benchmark set (See Table ) analogously to Figure . We see here that an input register depth of four provides almost all of the benefits of input registers, with most of the benefit realized by a depth of three, as we saw with the alu2 case in the previous section.

Figure plots area versus throughput for various context depths (), at a single values for input depth (). Here, was chosen to give the best results for low throughputs. For lower throughput values, the 5-8 context cases differ by only 10%. At the extreme of 20 clocks per result, the , case is 33.7% the size of the single context case, versus the , case which is 37.6%.

Tables through record implementation area ratio for all values of and . Each table reports implemenation areas for a different fixed throughput target in analog with Table . For the maximum throughput of one result per LUT delay, the traditional, single-context FPGA provides the best implementation. For all other cases, the multicontext implementations are always smaller than the single-context implementation. With a LUT-cycle delay in the 7-9.5 ns range, even today's ``high'' throughput implementations in the 30-50 MHz range are producing new results only once ever 3-5 LUT delays. At these speeds 3-4 context devices are 40-50% smaller than the single context implementation. At lower throughputs, the multiple context implementations are almost one-third the size of the single-context implementation on average.

Area for Improvement

As noted previously (Sections and ), netlist mapping is oblivious of the final temporal implementations. The allocation of temporal and spatial pipeline stages is more rigid than strictly necessary. As we noted above (Section ), retiming LUTs are inserted in a stylized fashion which is not likely to be optimal. Compatibility testing is stochastic and may declare many compatible LUT groups incompatible. Consequently, tighter packing of LUTs is likely with more sophisticated mapping tools.

Other Input Retiming Models

Register File

LSM, YSE, and VEGA all use a register file to hold data values between production and consumption like most processors. These machines are also targeted at significantly more sequentialization (1K8K contexts). Consequently, they manage to use only a single port into the register file. The register file organization has a more general access pattern since any value can be written to any memory location and read from any location to any output. The generality avoids packing compatibility restrictions, allowing data to be packed more tightly into memories. However, the more general access is also significantly more expense to support; LSM and YSE replicate the entire memory bank, storing four copies of every data value, in order to achieve four read ports. The restriction to a single write port is a simplification which these machines use in order to make the register file implementation viable.

Time Matching

Instead of shifting data through a continually advancing shift register, we can make each of the input registers take its value from the input line and load it at a specified time. In this scheme, the input registers hold a finite number of values (), but are not be limited to only the last values. Such a scheme would require a unit to match input times, making each input larger than the iDPGA, but the increased range and packing density relaxes timing constraints on data arrival which are useful for simplifying the task of physical mapping. This is the scheme used by TSFPGA and it will be explored more fully in the following chapter.

Summary

Typical tasks require two, different kinds of data transport -- spatial transport to move data from the processing element that generated it to the ones which will consume it and temporal transport to take data from the time when it is generated to the times when it is consumed. It is inefficient to tie up expensive, spatial transport resources such as wires and switches, to perform a temporal transport task. Tasks such as circuit evaluation have sufficient requirements for temporal transport that input retiming registers are clearly a worthwhile architectural feature to include in a multicontext device. Implementations with multiple retiming registers are more compact than implementations with no additional retiming resources.

As with multiple contexts, the extent to which we can save area with deep input registers depends on the area ratio between the active interconnect and the retiming registers. Here, we assumed the ratio between active area and instruction area was 10:1 (800K:78K), as in the previous chapter. We assumed, the ratio between the active area and context area including both instruction and retiming was roughly 8:1 (800K:104K). At these ratios, 4-5 context iDPGA implementations were, on average, half to one-third the size of the single context alternative.

The best implementation varies with target throughput. At these size ratios, the , case is moderately good across throughput ranges. It is only worse than the single context implementation at the highest throughput, and is within 20% of the best implementation at the lowest throughput measured here.

Review

In the development since Chapter , we have seen that the area required to implement a general-purpose computational task is composed of four parts:

  1. Active interconnect area
  2. Active computational processing element area
  3. Task description (instruction storage) area
  4. Intermediate value storage for temporal retiming
While traditional FPGA architectures have a one-to-one mapping between these components, this resource ratio is neither necessary or efficient. We further saw that active interconnect area is, by far, the largest single component of this area, while task description and value storage areas are small in comparison.

For a given computational task, we saw that the requirements for each of these four parts arise from different sources. The number of instructions required to describe the task and number of intermediates held during computation arise from the basic computational task, itself. The size of the active interconnect and processing are dictated by the task's target throughput. For the highest possible throughput, the conventional FPGA strategy of allocating a single instruction to each piece of active interconnect and processing is an efficient allocation of resources. However, as throughput requirements drop below this extreme, multicontext implementations compress the implementation into less space by sharing and reusing a smaller number of active resources. This sharing increases the ratio of instructions and intermediates to active resources. DPGAs are the practical implementation of such a sharing scheme, assigning multiple instructions and multiple intermediate values to each active resource.

Note that the amount of compressibility we achieve with DPGAs is critically dependent upon how small we can make the non-active residue. That is, when we remove active interconnect and processing elements, we are left with the instruction and the intermediate values. The amount of area savings we can realize depends on how much smaller the space to hold instructions and intermediates is than the space for the active area necessary to actually process the instruction and its data. It is this effect which motivates our interests in reducing the number of bits used to describe each instruction (Section ) and in reducing the area required to store those bits ( e.g. DRAM context implementations in the DPGA prototype -- Section ).

It is also worthwhile to note that the style of compression used in the last two chapters (Chapters and ), makes instructions and data readily accessible and is largely independent of task structure. While densely encoded instructions need some decoding, each instruction is encoded separately so that it can be stored locally and used immediately upon being read. If we are willing to pay additional access latency and work with variable size encodings, block and structure-based encoding schemes can be used, making it is possible to compress the instruction requirements further. Ultimately, the minimum task description area will depend on the descriptive complexity of the task (See Section ). Exploiting structure, such as, data widths, operation commonality, and task recurrence requires more general instruction distribution datapaths and more sequential decoding of task instructions. Nonetheless, variants on these techniques may be valuable in further compressing instruction and data residues and hence reducing task implementation size.


André DeHon <andre@mit.edu> Reinventing Computing MIT AI Lab