Previous: New Architectures Up: New Architectures Next: Dynamically Programmable Gate Arrays with Input Registers

Dynamically Programmable Gate Arrays

In Chapter we demonstrated that if we settle on a single word width, , we can select a robust context depth, , such that the area required to implement any task on the architecture with fixed is at most 2 the area of using an architecture with optimal . Further, for single bit granularities, , the model predicted a robust context depth . In contrast, the primary, conventional, general-purpose devices with independent, bit-level control over each bit-processing unit are Field-Programmable Gate Arrays (FPGAs), which have . Our analysis from Chapter suggests that we can often realize more compact designs with multicontext devices. Figure shows the yielded efficiency of a 16-context, single-bit granularity device for comparison with Figure , emphasizing the broader range of efficiency for these multicontext devices.

In this chapter, we introduce Dynamically Programmable Gate Arrays (DPGAs), fine-grained, multicontext devices which are often more area efficient than FPGAs. The chapter features:

DPGA Introduction

The DPGA is a multicontext (), fine-grained (), computing device. Initially, we assume a single control stream (). Each compute and interconnect resource has its own, small, memory for describing its behavior (See Figure ). These instruction memories are read in parallel whenever a context (instruction) switch is indicated.

The DPGA exploits two facts:

  1. The description of an operation is much smaller than the active area necessary to perform the operation.
  2. It is seldom necessary to evaluate every gate or bit computation in a design simultaneously in order to achieve the desired task latency or throughput.

To illustrate the issue, consider the task of converting an ASCII Hex digit into binary. Figure describes the basic computation required. Assuming we care about the latency of this operation, a mapping which minimizes the critical path length using SIS [SSL +92] and Chortle [Fra92] has a path length of 3 and requires 21 4-LUTs. Figure shows the LUT mapping both in equations and circuit topology.

Traditional Pipelining for Throughput

If we cared only about achieving the highest throughput, we would fully pipeline this implementation such that it took in a new character on each cycle and output its encoding three cycles later. This pipelining would require an additional 7 LUTs to pipeline data which is needed more than one pipeline stage after being generated ( i.e. 4 to retime c<3:0> for presentation to the second stage and 3 to retime c<3>, c<1> and i1 for presentation to the final stage -- See Figure ). Consequently, we effectively evaluate a design with 4-LUTs with physical 4-LUTs. Typical LUT delay, including a moderate amount of local interconnect traversal, is 7 ns (See Table ). Assuming this is the only limit to cycle time, the implementation could achieve 140 MHz operation. Notice that the only reason we had to have any more LUTs or LUT descriptions than strictly required by the task description was in order to perform signal retiming based on the dependency structure of the computation. Using our FPGA area based on the model in the previous chapter, an FPGA LUT in a large array occupies . Consequently, this implementation requires:

Multicontext Implementation -- Temporal Pipelining

If, instead, we cared about the latency, but did not need 140 MHz operation, we could use a multicontext device with 3 LUT descriptions per active element (). To achieve the target latency of 3 LUT delays, we need to have enough active LUTs to implement the largest stage -- the middle one. If the inputs are arriving from some other circuit which is also operating in multicontext fashion, we must retime them as before (Figure ). Consequently, we require 3 extra LUTs in the largest stage, making for a total . Note that the 4 retiming LUTs added to stage 1 also bring its total LUT usage up to 12 LUTs. We end up implementing , with and . If c<7:0> were inputs which did not change during these three cycles, we would only need one extra retiming LUT in stage 2 for i1, allowing us to use .

The multicontext LUT is slightly larger due to the extract contexts. Two additional contexts add 160K to the LUT area, making for . The multicontext implementation requires: In contrast, a non-pipelined, single-context implementation would require LUTs, for an area of:

If we assume that we can pipeline the configuration read, the multicontext device can achieve comparable delay per LUT evaluation to the single context device. The total latency then is 21 ns, as before. The throughput at the 7 ns clock rate is 48 MHz. If we do not pipeline the configuration read, as was the case for the DPGA prototype (Section ), the configuration read adds another 2.5 ns to the LUT delay, making for a total latency of 28.5 ns and a throughput of 35 MHz.

General Observations

We were able to realize this area savings because the single context device had to deploy active compute and interconnect area for each portion of the task even though the task only required a smaller number of active elements at any point in time. In general, we have two components which combine to define the requisite area for a computational device:

In an ideal packing, a computation requiring active compute elements and total 4-LUTs, can be implemented in area:

Equation is a simplification of our area model (Equation ). Using the typical values suggested in the previous chapter:

In practice, a perfect packing is difficult to achieve due to connectivity and dependency requirements such that configuration memories are required. In the previous example, we saw for due to retiming and packing constraints. In fact, with the model described so far, retiming requirements prevent us from implementing this task on any fewer than 12 active LUTs. Retiming requirements are one of the main obstacles to realizing the full, ideal benefits. We will see retiming effects more clearly when we look at circuit benchmarks in Section .

Related Architectures

Several hardware logic simulator have been built which share a similar execution model to the DPGA. These designs were generally motivated to reduce the area required to emulate complex designs and, consequently, took advantage of the fact that task descriptions are small compared to to their physical realizations in order to increase logic density.

The Logic Simulation Machine [BLMR83], and later, the Yorktown Simulation Engine (YSE) [Den82] were the earliest such hardware emulators. The YSE was built out of discrete TTL and MOS memories, requiring hundreds of components for each logic processor. Processors had an 8K deep instruction memory (), 128 bit instructions (, once processor-to-processor interconnect is included) and produced two results per cycle (). The YSE design supported arrays of up to 256 processors (), with a single controller () running the logic processors in lock step, and a full 256256, 2-bit wide crossbar ().

The Hydra processor which Arkos Design's developed for their Pegasus hardware emulator is a closer cousin to the DPGA [Mal94]. They integrate 32, 16-context, bit processors on each Hydra chip (, , ). The logic function is an 8-input NAND with programmable input inversions.

VEGA uses 1K-2K context memories to achieved a 7 logic description density improvement over single context FPGAs. At , VEGA is optimized to be efficient for very large ratios, , and can be quite inefficient for regular, high-throughput tasks. With , and a separate controller per processor (), Equation predicts a VEGA processing element will have , which is about 8.5 smaller than the 2048 single context processing elements which it emulates -- so the area savings realized by VEGA is quite consistent with our area model developed in Chapter .

Hydra and VEGA were developed independently and concurrently to the DPGA, which was first described in [BDK94].

Dharma [Bha93][BCK93] was designed to solve the FPGA routing problem. Logic is evaluated in strict levels similar to the scheme used for circuit evaluation in Section with one gate-delay evaluation per cycle. Dharma is based on a few, monolithic crossbars () which are reused at each level of logic. Once gates have been assigned to evaluation levels, the full crossbar makes placement and routing trivial. While this arrangement is quite beneficial for small arrays, the scaling rate of the full crossbar makes this scheme less attractive for large arrays, , as we saw in Section .

Realm of Application

DPGAs, as with any general-purpose computing device supporting the rapid selection among instructions, are beneficial in cases where only a limited amount of functionality is needed at any point in time, and where it is necessary to rapidly switch among the possible functions needed. That is, if we need all the throughput we can possibly get out of a single function, as in the fully-pipelined ASCII HexBinary converter in Section , then an FPGA, or other purely spatial reconfigurable architecture will handle the task efficiently. However, when the throughput requirements from a function are limited or the function is needed only intermittently, a multicontext device can provide a more efficient implementation. In this section, we look at several, commonly arising situations where multicontext devices are preferable to single-context devices, including:

We also briefly revisit instruction bandwidth to see why partial reconfiguration, alone, is not an adequate substitute for many of these tasks.

Limited Throughput Requirements

Often the system environment places limits on the useful throughput for a subtask. As we saw in the introduction to this chapter, when the raw device supports a higher throughput than that required from the task, we can share the active resources in time among tasks or among different portions of the same task.

Relative Processing Speeds

Most designs are composed of several sub-components or sub-tasks, each performing a task necessary to complete the entire application (See Figure ). The overall performance of the design is limited by the processing throughput of the slowest device. If the performance of the slowest device is fixed, there is no need for the other devices in the system to process at substantially higher throughputs.

In these situations, reuse of the active silicon area on the non-bottleneck components can improve performance or lower costs. If we are getting sufficient performance out of the bottleneck resource, then we may be able to reduce cost by sharing the gates on the non-bottleneck resources between multiple ``components'' of the original design (See Figure ). If we are not getting sufficient performance on the bottleneck resource and its task is parallelizable, we may be able to employ underused resources on the non-bottleneck components to improve system performance without increasing system cost (See Figure ).

Fixed Functional Requirements

Many applications have fixed functional requirements. Input processing on sensor data, display processing, or video processing all have task defined processing rates which are fixed. In many applications, processing faster than the sample or display rate is not necessary or useful. Once we achieve the desired rate, the rest of the ``capacity'' of the device is not required for the function. With reuse of active silicon, the residual processing capacity can be employed on other computations.

I/O Latency and Bandwidth

Device I/O bandwidth often acts as a system bottleneck, limiting the rate at which data can be delivered to a part. This, in turn, limits the useful throughput we can extract from the internal logic. Even when the I/O pins are heavily reused ( e.g. [BTA93]), components often have less I/O throughput than they have computational throughput. Reviewing technology costs, we expect this bottleneck to only get worse over time.

When data throughput is limited by I/O bandwidth, we can reuse the internal resources to provide a larger, effective, internal gate capacity. This reuse decrease the total number of devices required in the system. It may also help lower the I/O bandwidth requirements by grouping larger sets of interacting functions on each IC.

Latency Limited Designs

Some designs are limited by latency not throughput. Here, high throughput may be unimportant. Often it is irrelevant how quickly we can begin processing the next datum if that time is shorter than the latency through the design. This is particularly true of applications which must be serialized for correctness ( e.g. atomic actions, database updates, resource allocation/deallocation, adaptive feedback control).

By reusing gates and wires, we can use device capacity to implement these latency limited operations with less resources than would be required without reuse. This will allow us to use smaller devices to implement a function or to place more functionality onto each device.

Cyclic dependencies

Some computations have cyclic dependencies such that they cannot continue until the result of the previous computation is known. For example, we cannot reuse a multiplier when performing exponentiation until the previous multiply result is known. Finite state machines (FSMs) also have the requirement that they cannot begin to calculate their behavior in the next state, until that state is known. In a purely spatial implementation, each gate or wire performs its function during one gate delay time and sits idle the rest of the cycle. Active resource reuse is the most beneficial way to increase utilization in cases such as these.

Temporally Varying or Data Dependent Functional Requirements

Another characteristic of finite state machines is that the computational task varies over time and as a function of the input data. At any single point in time, only a small subset of the total computational graph is needed. In a spatial implementation, all of the functionality must be implemented simultaneously, even though only small subsets are ever used at once. This is a general property held by many computational tasks.

Many tasks may perform quite different computations based on the kind of data they receive. A network interface may handle packets differently based on packet type. A computational function may handle new data differently based on its range. Data objects of different types may require widely different handling. Rather than providing separate, active resources for each of these mutually exclusive cases, a multicontext device can use a minimum amount of active resources, selecting the proper operational behavior as needed.

Multicontext versus Monolithic and Partial Reconfiguration

Multicontext devices are specifically tailored to the cases where we need a limited amount of active functionality at any point in time, but we need to be able to select or change that functionality rapidly. This rapid switching is necessary to obtain reasonable performance for the kinds of applications described in this section. This requirement makes reconfigurations from a central memory pool, on or off chip, inadequate.

In this section, we draw out this point, reviewing the application domains identified in the previous section. We also look at cases where one can get away without multicontext devices. At the end of this section, we articulate a reconfiguration rate taxonomy which allows us to categorize both device architectures and applications.

Tasks with limited throughput requirements

As we discussed in Section , tasks with limited throughput requirements can be implemented in less area using multicontext devices. If, we placed the configuration contexts off-chip, the context-switch rate would be paced by the limited bandwidth into configuration memory. Returning to our ASCII HexBinary converter, in the three context case, we would have to reload 12 LUT instructions between contexts 1 and 2, 4 between contexts 2 and 3, and 12 between contexts 3 and 1. If we assume a 500MB/s RAMBUS I/O port [Ram93] operating at peak burst performance, we can load one byte/2 ns. The evaluation time would be: Assuming , as in Section and Chapter , and ns: Such a solution is simultaneously: (1) over an order of magnitude slower than the multicontext implementation, which operated at 21-28 ns, and (2) over two order of magnitude larger when you consider the 500-700M occupied by a 4Mb RAMBUS DRAM. Arguably, the DRAM could be smaller than 4Mb, but it is not economical to build, package, and sell such small memories. Further, notice that this is a tiny subtask with 12 active LUTs, while reasonably sized FPGAs contain hundreds to thousands of LUTs, making the reconfiguration time orders of magnitude slower. As noted in Chapter , reconfiguration bandwidth limitations will dictate the rate of operation rather than the circuit path length.

Latency limited tasks

The same effect described above occurs in latency limited designs. If we want to save real-estate by reusing active area, the time to load in the next instruction may pace operation. Off-chip memory, or an on-chip central memory pool, will suffer from the memory bandwidth bottleneck just noted.

Data varying logical functions

In finite-state machines, or other tasks which may change the function they perform at each point in time based on the data arriving, this reconfiguration latency also determines cycle time. Many tasks will exhibit the characteristics identified here -- in response to a new data item, hundreds of LUT instructions must be loaded before the actual task, which may take only a few LUT delays to evaluate, can be performed.

Infrequent temporal change

Of course, if the distinct pieces of functionality required change only infrequently, and can operationally tolerate long reload latencies, then off-chip reconfigurations may be acceptable and efficient. For example, the UCLA configurable computing system for automatic target recognition [VSCZ96] takes advantage of the fact that a loaded correlation configuration can be used against an entire image segment before a new correlation is required. With 128128 pixel images, a complete filter match of a 1616 correlation template across the full image requires roughly correlations amounting to 16K clock cycles on the correlator. Operating at a 60 ns clock rate, this full correlation takes roughly 1 ms. The conventional FPGA actually used for the UCLA implementation, a Xilinx XC4010, takes 10 ms to reload its configuration [Xil94b]. However, as we noted in Section , the sparse encoding used by conventional devices makes them excessively slow at reconfiguration. Assuming a RAMBUS reconfiguration port and 64-bits/4-LUT, the 1600 4-LUTs on the XC4010 can be reloaded in roughly: Here, the reload time is small compared to the loaded context operating time (), such that reload has a small effect on the rate of operation. In fact, as the UCLA paper notes, when the next context is predictable in advance and , a two context FPGA would be able to completely overlap the loading of the next instruction with the operation in the current configuration.

Large-grain, data-dependent blocks

Similarly, when performing data dependent computations and the type of data changes slowly compared to the processing rate, long reconfiguration times might be acceptable. For example, a video display which can handle different video data formats ( e.g. PAL, NTSC, MPEG-1, MPEG-2, HDTV), will only have to process and display one kind of video stream at a time. For human consumption, it will typically display the same data stream for a long time and the 10's of milliseconds of latency it may take to load the configuration with the appropriate display engine would not be noticeable to the human observer.

Minor configuration edits

Sometimes configurations need only minor edits in order to evolve over time or be properly configured for different data types. For example, an -character text matching filter may only require the configuration of a /4 4-LUTs to change to handle a different -character search target. If these only represent a small portion of the entire configuration, the reconfiguration can be described as an edit on the existing configuration with less bandwidth than a full context reload. In cases like this where the edits are small, partial reconfiguration -- the ability to efficiently change small portions of the configuration while leaving the rest in place -- may be adequate to reduce context switch bandwidths sufficiently to keep reload latency low. We see partial reconfiguration support in modern devices from Plessey [Ple90], Atmel [Atm94], and Xilinx [Xil96] to support configuration edits such as this.

Reconfiguration Rate Taxonomy

From the above, we see three cases for configuration management based on the rate at which the task requires distinct pieces of functionality and the rate at which it is efficient to change the configuration applied to the active processing elements:

  1. Static -- the configuration does not change within an operational epoch
  2. Quasistatic -- the configuration changes slowly compared to the rate of operation upon data
  3. Dynamic -- configuration changes at the same rate as data, potentially on a cycle-by-cycle basis

We can further subdivide configuration management capabilities of architecture and application requirements based on whether they can take advantage of limited bandwidth configuration edits:

  1. Atomic -- the vector of instructions across the array must change all at once
  2. Non-atomic -- small subsets of the array instructions can be changed independently
Strictly speaking, the atomicity of configuration changes is orthogonal to the rate of reconfiguration. For statically configured applications, the atomicity of reload is irrelevant since the context does not change. The atomicity is most relevant for quasistatic configuration changes since those are the cases which benefit from reduced bandwidth requirements. Dynamic architectures can change their active instruction on a cycle-by-cycle basis so non-atomic changes do not allow an array-wide context switch to occur any faster. However, edits to the non-active contexts on dynamic architectures may still benefit from the bandwidth reduction enabled by non-atomic updates.

A Prototype DPGA

Jeremy Brown, Derrick Chen, Ian Eslick, and Edward Tau started a first-generation prototype DPGA prototype while they were taking MIT's introductory VLSI course (6.371) during the Fall of 1994. The chip was completed during the Spring of 1995 with additional help from Ethan Mirsky. Andre DeHon helped the group hash out the microarchitecture and oversaw the project. The prototype was first presented publicly in [TEC +95]. A project report containing lower level details is available as [BCE +94].

In this section, we describe this prototype DPGA implementation. The design represents a first generation effort and contains considerable room for optimization. Nonetheless, the design demonstrates the viability of DPGAs, underscores the costs and benefits of DPGAs as compared to traditional FPGAs, and highlights many of the important issues in the design of programmable arrays. The fabricated prototype did have one timing problem which prevented it from functioning fully, but our post mortem analysis suggests that the problem is easily avoidable.

Our DPGA prototype features:

We begin by detailing our basic DPGA architecture in Section . Section provides highlights from our implementation including key details on our prototype DPGA IC. In Section , we describe several aspects of the prototype's operation. Section extracts a DPGA area model based on the prototype implementation. Section closes out this section on the DPGA prototype by summarizing the major lessons from the effort.

Architecture

Figure depicts the basic architecture for this DPGA. Each array element is a conventional 4-input lookup table (4-LUT). Small collections of array elements, in this case 44 arrays, are grouped together into subarrays. These subarrays are then tiled to compose the entire array. Crossbars between subarrays serve to route inter-subarray connections. A single, 2-bit, global context identifier is distributed throughout the array to select the configuration for use. Additionally, programming lines are distributed to read and write configuration memories.

DRAM Memory

The basic memory primitive is a 432 bit DRAM array which provides four context configurations for both the LUT and interconnection network (See Figure ). The memory cell is a standard three transistor DRAM cell. Notably, the context memory cells are built entirely out of N-well devices, allowing the memory array to be packed densely, avoiding the large cost for N-well to P-well separation. The active context data is read onto a row of standard, complementary CMOS inverters which drive LUT programming and selection logic.

Array Element

The array element is a 4-LUT which includes an optional flip-flop on its output (Figure ). Each array element contains a context memory array. For our prototype, this is the 432 bit memory described above. 16 bits provide the LUT programming, 12 configure the four 8-input multiplexors which select each input to the 4-LUT, and one selects the optional flip-flop. The remaining three memory bits are presently unused.

Subarrays

The subarray organizes the lowest level of the interconnect hierarchy. Each array element output is run vertically and horizontally across the entire span of the subarray (Figure ). Each array element can, in turn, select as an input the output of any array element in its subarray which shares the same row or column. This topology allows a reasonably high degree of local connectivity.

This leaf topology is limited to moderately small subarrays since it ultimately does not scale. The row and column widths remains fixed regardless of array size so the horizontal and vertical interconnect would eventually saturate the row and column channel capacity if the topology were scaled up. Additionally, the delay on the local interconnect increases with each additional element in a row or column. For small subarrays, there is adequate channel capacity to route all outputs across a row and column without increasing array element size, so the topology is feasible and desirable. Further, the additional delay for the few elements in the row or column of a small subarray is moderately small compared to the fixed delays in the array element and routing network. In general, the subarray size should be carefully chosen with these properties in mind.

Non-Local Interconnect

In addition to the local outputs which run across each row and column, a number of non-local lines are also allocated to each row and column. The non-local lines are driven by the global interconnect (Figure ). Each LUT can then pick inputs from among the lines which cross its array element. In the prototype, each row and column supports four non-local lines. Each array element could thus pick its inputs from eight global lines, six row and column neighbor outputs, and its own output. Each input is configured with an 8:1 selector as noted above (Figure ). Of course, not all combinations of 15 inputs taken 4 at a time are available with this scheme. The inputs are arranged so any combination of local signals can be selected along with many subsets of global signals. Freedom available at the crossbar in assigning global lines to tracks reduces the impact of this restriction, but complicates placement.

Local Decode

Row select lines for the context memories are decoded and buffered locally from the 2-bit context identifier. A single decoder services each row of array elements in a subarray. One decoder also services the crossbar memories for four of the adjacent crossbars. In our prototype, this placed five decoders in each subarray, each servicing four array element or crossbar memory blocks for a total of 128 memory columns. Each local decoder also contains circuitry to refresh the DRAM memory on contexts which are not being actively read or written.

Global Interconnect

Between each subarray a pair of crossbars route the subarray outputs from one subarray into the non-local inputs of the adjacent subarray. Note that all array element outputs are available on all four sides of the subarray. In our prototype, this means that each crossbar is a 168 crossbar which routes 8 of the 16 outputs to the neighboring subarray's 8 inputs on that side (Figure ). Each 168 crossbar is backed by a 432 DRAM array to provide the 4 context configurations. Each crossbar output is configured by decoding 4 configuration bits to select among the 16 crossbar input signals.

While the nearest neighbor interconnect is sufficient for the 33 array in the prototype, a larger array should include a richer interconnection scheme among subarrays. At present, we anticipate that a mesh with bypass structure with hierarchically distributed interconnect lines will be appropriate for larger arrays.

Programming

The programming port makes the entire array look like one large, 32-bit wide, synchronous memory. The programming interface was designed to support high-bandwidth data transfer from an attached processor and is suitable for applications where the array is integrated on the processor die. Any non-active context may be written during operation. Read back is provided in the prototype primarily for verification.

Implementation

The DPGA prototype is targeted for a 1 drawn, 0.85 effective gate length CMOS process with 3 metal layers and silicided polysilicon and diffusion. Table highlights the prototype's major characteristics. Figure shows the fabricated die, and Figure shows a closeup of the basic subarray tile containing a 44 array of LUTs and four inter-subarray crossbars. Table summarizes the areas for the constituent parts.

Table breaks down the chip area by consumers. In Table , configuration memory is divided between those supporting the LUT programming and that supporting interconnect. All together, the configuration memory accounts for 33% of the total die area or 40% of the area used on the die. The network area, including local interconnect, wiring, switching, and network configuration area accounts for 66% of the die area or 80% of the area actually used on the die. Leaving out the configuration memory, the fixed portion of the interconnect area is 45% of the total area or over half of the active die area.

Layout Inefficiencies

The prototype could be packed more tightly since it has large blank areas and large areas dedicated to wire routing. A more careful co-design of the interconnect and subarray resources would eliminate much or all of the unused space between functional elements. Most of the dedicated wiring channels are associated with the local interconnect within a subarray. With careful planning, it should be possible to route all of these wires over the subarray cells in metal 2 and 3. As a result, a careful design might be 40-50% smaller than our first generation prototype.

Memory

Area

From the start, we suspected that memory density would be a large determinant of array size. Table demonstrates this to be true. In order to reduce the size of the memory, we employed a 3 transistor DRAM cell design as shown in Figure . To keep the aspect ratio on the 432 memory small, we targeted a very narrow DRAM column (See Figure ).

Unfortunately, this emphasis on aspect ratio did not allow us to realize the most area efficient DRAM implementation (See Table ). In particular, our DRAM cell was 7.619.2, or almost 600. A tight DRAM should have been 75-80, or about 300. Our tall and thin DRAM was via and wire limited and hence could not be packed as area efficiently as a more square DRAM cell.

One key reason for targeting a low aspect ratio was to balance the number of interconnect channels available in each array element row and column. However, with 8 interconnect signals currently crossing each side of the array element, we are far from being limited by saturated interconnect area. Instead, array element cell size is largely limited by memory area. Further, we route programming lines vertically into each array element memory. This creates an asymmetric need for interconnect channel capacity since the vertical dimension needs to support 32 signals while the horizontal dimension need only support a dozen memory select and control lines.

For future array elements we should optimize memory cell area with less concern about aspect ratio. In fact, the array element memory can easily be split in half with 16 bits above the fixed logic in the array element and 16 below. This rearrangement will also allow us to distribute only 16 programming lines to each array element if we load the top and bottom 16 bits separately. This revision does not sacrifice total programming bandwidth if we load the top or bottom half of a pair of adjacent array elements simultaneously.

Table further decomposes memory area percentages by function. We have already noted that a tight DRAM cell would be half the area of the prototype DRAM cell and an SRAM cell would be twice as large. Using these breakdowns and assuming commensurate savings in proportion to memory cell area, the tight DRAM implementation would save about 7% total area over the current design. An SRAM implementation would be, at most, 15% larger. In practice, the SRAM implementation would probably be only 5-10% larger for a 4-context design since the refresh control circuitry would no longer be needed. Of course, as one goes to greater numbers of contexts, the relative area differences for the memory cells will provide a larger contribution to overall die size.

Memory Timing

The memory in the fabricated prototype suffered from a timing problem due to the skew between the read precharge enable and the internal write enable. As shown in Figure , the read bus is precharged directly on the high edge of the clock signal clk. The internal write enable, iwe, controls write-back during refresh. iwe and the write enable signals, we<4:0>, are generated by the local decoder and driven across an entire row of four array elements in a subarray, which makes for a 128-bit wide memory. Both iwe and we<4:0> are pipelined signals which transition on the rising edge of clk. On the rising edge of clk, we have a race between the turn on of the precharge transistor and the turn off of iwe and we<4:0>. Since clk directly controls the precharge transistor, precharge begins immediately. However, since iwe and we<4:0> are registered, it takes a clock-to-q delay before they can begin to change. Further, since there are 128 consumers spread across 1100, the signal propagation time across the subarray is non-trivial. Consequently, it is possible for the precharge to race through write enables left on at the end of a previous cycle and overwrite memory. This problem is most acute for the memories which are farthest from the local decoders.

Empirically, we noticed that the memories farthest from the local decoder lost their values after short time periods. In the extreme cases of the input and output pads, which were often very far from their configuration memories, the programmed values were overwritten almost immediately. The memories closer to the local decoder were more stable. The array elements adjacent to the decoders were generally quite reliable. After identifying this potential failure mode, we simulated explicit skew between clk and the write enables in SPICE. In simulation, the circuit could tolerate about 1.5 ns of skew between clk and the write enables before the memory values began to degrade.

We were able to verify that refresh was basically operational. By continually writing to single context, we can starve other contexts from ever refreshing. When we forced the chip into this mode, data disappeared from the non-refreshed memories very quickly. The time constant on this decay was significantly different from the time constants observed due to the timing decay giving us confidence that the basic refresh scheme worked aside from the timing skew problem.

Obviously, the circuit should have been designed to tolerate this kind of skew. A simple and robust solution would have been to disable the refresh inverter or the writeback path directly on to avoid simultaneously enabling both the precharge and writeback transistors. Alternately, the precharge could have been gated and distributed consistently with the write enables.

Crossbar Implementation

To keep configuration memory small, the crossbar enables were stored encoded in configuration memory then decoded for crossbar control. The same 432 memory used for the array element was used to control each 168 crossbar. Note that the entire memory is 128K. The crossbar itself is 535K, making the pair 660K. Had we not encoded the crossbar controls, the crossbar memory alone would have occupied 512K before we consider the crossbar itself. This suggests that the encoding was marginally beneficial for our four context case, and would be of even greater benefit for greater numbers of contexts. For fewer contexts, the encoding would not necessarily be beneficial.

Timing

Table summarizes the key timing estimates for the DPGA prototype at the slow-speed and nominal process points. As shown, context switches can occur on a cycle-by-cycle basis and contribute only a few nanoseconds to the operational cycle time. Equation relates minimum achievable cycle time to the number of LUT delays, , and crossbar crossings, in the critical path of a design.

These estimates suggest a heavily pipelined design which placed only one level of lookup table logic () and one crossbar traversal () in each pipeline stage could achieve 60-100MHz operation allowing for a context switch on every cycle. Our prototype, however, does not have a suitably aggressive clocking, packaging, or i/o design to actually sustain such a high clock rate. DRAM refresh requirements force a minimum operating frequency of 5MHz.

Pipelining

Two areas for pipelining are worth considering. Currently, the context memory read time happens at the beginning of each cycle. In many applications, the next context is predictable and the next context read can be pipelined in parallel with operation in the current context. This pipelining can hide the additional latency, . Also, notice that the inter-subarray crossbar delay is comparable to the LUT plus local interconnect delay. For aggressive implementations, allowing the non-local interconnect to be pipelined will facilitate small microcycles and very high throughput operation. Pipelining both the crossbar routing and the context reads could potentially allow a 3-4 ns operational cycle.

Component Operation

Inter-context Communication

The only method of inter-context communication for the prototype is through the array element output register. That is, when a succeeding context wishes to use a value produced by the immediately preceding context, we enable the register output on the associated array element in the succeeding context (See Figure ). When the clock edge occurs signaling the end of the preceding cycle, the signal value is latched into the output register and the new context programming is read. In the new context, the designated array element output now provides the value stored from the previous context rather than the value produced combinationally by the associated LUT. This, of course, makes the associated LUT a logical choice to use to produce values for the new context's succeeding context since it cannot be used combinationally in the new context, itself.

In the prototype, the array element output register is also the only means of state storage. Consequently, it is not possible to perform orthogonal operations in each context and preserve context-dependent state.

Note that a single context which acts as a shift register can be used to snapshot, offload, and reload the entire state of a context. In an input/output minimal case, all the array elements in the array can be linked into a single shift register. Changing to the shift register context will allow the shift register to read all the values produced by the preceding context. Clocking the device in this context will shift data to the output pin and shift data in from the input pin. Changing from this context to an operating context which registers the needed inputs will insert loaded values for operation. Such a scheme may be useful to take snapshots during debugging or to support context switches where it is important to save state. If only a subset of the array elements in the array produce meaningful state values, the shift register can be built out of only those elements. If more input/output signals can be assigned to data onload and offload, a parallel shift register can be built, limiting the depth and hence onload/offload time.

Context Switching

Context switches are signaled by a context strobe. If context strobe is asserted at a clock edge, a context read occurs. If context strobe is not asserted, the component remains in the same context.

DRAM Refresh

DRAM memory is refreshed under one of two conditions:
  1. Context Read -- Whenever a context is read, that context will be refreshed.
  2. Clocked in Same Context -- Whenever a clock cycle occurs but the context strobe is not asserted and there is no read or write to any of the memories serviced by a particular local decoder, the ``next'' context is refreshed. Each local decoder maintains a modulo four counter which it increments each time it is able to perform a context refresh in this manner. If the array stays in the same context for more that four cycles, every fourth cycle, the active context value is refreshed through we<4> (See Figure ).
This refresh scheme does place some restrictions on the context sequencing, but it allows most common patterns. In particular, proper refresh occurs if we: If one continually changes contexts on every clock cycle and only walks through a small subset of the entire set of contexts, the non-visited contexts will be starved from refresh. For example, switching continually between context zero and context one would prevent contexts two and three from ever getting a refresh.

The context memory typically gets very stylized usage. For any single memory, writes are infrequent. Common usage patterns are to read through all the contexts or to remain in one context for a number of cycles. As such the usage pattern is complementary to DRAM refresh requirements.

Background Load

Notice from Figure that the write path is completely separate from the read path. This allows background writes to occur orthogonally to normal operation. Data can be read through the refresh inverter and we<4> with iwe disabled to prevent refresh or writeback. At the same time, new data arriving on ewv may be loaded through ewe and written into memory using we<3:0>.

Prototype Context Area Model

Using the prototype areas, we can formulate a simplified model for the area of an -context DPGA array element. From the prototype: Based on this area model, our robust context point, , is 45, 23, and 11, respectively for each of the various memory implementations.

Prototype Conclusions

The prototype demonstrates that efficient, dynamically programmable gate arrays can be implemented which support a single cycle, array-wide context switch. As noted in Chapter and the introduction to this chapter, when the instruction description area is small compared to the active compute and network area, multiple context implementations are more efficient than single context implementations for a large range of application characteristics. The prototype bears out this relationship with the context memory for each array element occupying at least an order of magnitude less area than the fixed logic and interconnect area. The prototype further shows that the context memory read overhead can be small, only a couple of nanoseconds.

The prototype has room for improvement in many areas:

Additional, architectural, areas for improvement over the prototype are identified in the following sections and in the next chapter.

Circuit Evaluation

One large class of workloads for traditional FPGAs is conventional circuit evaluation. In this section, we look at circuit levelization where traditional circuits are automatically mapped into multicontext implementations. In latency limited designs (Section ), the DPGA can achieve comparable delays in less area. In applications requiring limited task throughput (Section ), DPGAs can often achieve the required throughput in less area.

Levelization

Levelized logic is a CAD technique for automatic temporal pipelining of existing circuit netlists. Bhat refers to this technique as temporal partitioning in the context of the Dharma architecture [Bha93]. The basic idea is to assign an evaluation context to each gate so that:

With latency constraints, we may further require that the levelized network not take any more steps than necessary. With this assignment, the series of contexts , , ..., evaluates the logic netlist in sequence over microcycles. With a full levelization scheme, the number of contexts used to evaluate a netlist is equal to the critical path in the netlist.

For sake of illustration, Figure shows a fraction of the ASCIIHex binary circuit extracted from Figure . The critical paths ( e.g. c<1i8i15o<1>) is three elements long. Spatially implemented, this netlist evaluates a 6 gate function in 3 cycles using 6 physical gates. In three cycles, these three gates could have provided gate evaluations, so we underutilize all the gates in the circuit. The circuit can be fully levelized as shown in Figure ( = {i1,i4,i7,i8}, = {i15}, = {o<1>}) or partially levelized combining the final two stages ( = {i1,i4,i7,i8}, = {i15,o<1>}). If the inputs are held constant during evaluation, we need only four LUTs to evaluate either case. In this case, if the inputs are not held constant, we will need two additional LUTs in for retiming so there is no benefit to multicontext evaluation. However, as we saw in Section , for the whole circuit, even with retiming, the total number of LUTs needed for levelized evaluation is smaller than the number needed in a fully spatial implementation.

Recall also from Section , that grouping is one of the limits to even levelization. When there is slack in the circuit network, the slack gives us some freedom in the context placement for components outside of the critical path. In general, this slack should be used to equalize context size, minimizing the number of active LUTs required to achieve the desired task latency or throughput. As we see in both this subcircuit and the full circuit, signal retiming requirements also serve to increase the number of active elements we need in each evaluations level.

Latency Limited Designs

As noted in Section , many tasks are latency limited. This could be due to data dependencies, such that the output from the previous evaluation must be available before subsequent evaluation may begin. Alternately, this subtask may be the latency limiting portion of some larger computational task. Further, the task may be one where the repetition rate is not hight, but the response time is critical. In these cases, multicontext evaluation will allow implementations with fewer active LUTs and, consequently, less implementation area.

We use the MCNC circuit benchmark suite to characterize the benefits of multicontext evaluation. Each benchmark circuit is mapped to a netlist of 4-LUTs using sis [SSL +92] for technology independent optimization and Chortle [Fra92] [BFRV92] for LUT mapping. Since we are assuming latency is critical in this case, both sis and Chortle were run in delay mode. No modifications to the mapping and netlist generation were made for levelized computation.

LUTs are initially assigned to evaluation contexts randomly without violating the circuit dataflow requirements. A simple, annealing-based swapping schedule is then used to minimize total evaluation costs. Evaluation cost is taken as the number of 4-LUTs in the final mapping including the LUTs added to perform retiming. Table shows the circuits mapped to a two-context DPGA, and Table , a four-context DPGA. Table shows the full levelization case -- that is circuits are mapped to an -context DPGA, where is equal to the number of LUT delays in the circuit's critical path. The tables break out the number of active LUTs in the multicontext implementations to show the effects of signal retiming requirements.

From the mapped results, we see a 30-40% overall area reduction using multicontext FPGAs, with some designs achieving almost 50%. For this collection of benchmark circuits, which has an average critical path length of 5-6 4-LUT delays, a four context DPGA gives the best, overall, results. Note that retiming requirements for these circuits dictates that each context contain, on average, 50% of the LUTs in the original design. Without the retiming requirements, 60-70% area savings look possible without increasing evaluation path length.

In addition to task delay requirements, three effects are working together here to limit the number of contexts which are actually beneficial for for these circuits:

  1. packing limitations
  2. retiming requirements
  3. non-trivial, finite instruction area
The annealing step explicitly minimized total LUT throughput including retiming. Nonetheless, looking at the total number of LUTs actually used, we see the number of active LUTs actually used for computation continues to decrease as the number of contexts increase, while the total number of LUTs tends to level out due to retiming saturation. Figure shows the area breakdown of these effects in terms of the number of LUTs and total area required as a function of the number of contexts used for the des benchmark. Figures and show similar data for C880 and alu2.

Timing

There are two potential sources of additional latency for the multicontext cases versus the single context cases.
  1. stage balancing time
  2. context-switch time

When the number of LUT delays in the path is not an even multiple of the number of contexts, it is not possible to allocate an even number of LUT delays to each context. For example, since the des circuit takes eight LUT delays to evaluate, a three context implementation will place three LUT delays in two of the three contexts and two LUT delays in the third. In a simple clocking scheme, each context would get the same amount of time. In the des case, that would be three LUT delays, making the total evaluation time nine LUT delays.

Changing contexts will add some latency overhead at least for registering values during context switches. In the circuit evaluation case, the next context is always deterministic and could effectively be pipelined in parallel with evaluation of the previous context. From the DPGA prototype, we saw that LUT-to-LUT delay was roughly 6 ns and the context read was 2.5 ns. Register clocking overhead is likely to be on the order of 1 ns. This gives: In the pipelined read case, we add 1 ns per context switch or at most % delay to the critical path. In the non-pipelined read case, we add 2.5 ns per context switch, or at most % delay to the critical path.

The area for the multicontext implementation is smaller and the number of LUTs involved is smaller. As a result, the interconnect traversed in each context may be more physically and logically local, thus contributing less to the LUT-to-LUT delay.

Area

In the model used, we assume that the basic interconnect area per LUT is the same in the single and multiple context case. Since the total number of LUTs needed for the multicontext implementation is smaller, the multicontext implementation can use an array with fewer LUTs than the single context implementation. We saw in Section that interconnect area grows with array size, so the area going into interconnect will be less for the multicontext array assuming the Rent parameter remains the same.

Area for Improvement

The results presented in this section are based on:

  1. LUT area model numbers
  2. DPGA architecture resembling the DPGA prototype
  3. conventional circuit netlist mapping
It may be possible to achieve better results by improving each of these areas.
  1. component area -- The model assumes an instruction storage cost based on 64 instruction bits/LUT and conventional SRAM memory cells. Smaller context area can be achieved by tighter instruction encoding ( e.g. Section ) or smaller memory cells ( e.g. DRAM used in the prototype DPGA described in Section ).
  2. architecture -- The largest gap between the ideal case and practice is in retiming. The architecture can be modified to better handle retiming (See Chapter ).
  3. mapping -- LUT mapping which is sensitive to the retiming costs may be capable of generating netlists with lower retiming requirements.

Limited Task Throughput

In Section , we saw that system and application requirements often limit the throughput required out of each individual subtask or circuit. When throughput requirements are limited, we can often meet the throughput requirement with fewer active LUTs than design LUTs, realizing a smaller and more economical implementation.

To characterize this opportunity we again use the MCNC circuit benchmarks. sis and Chortle are used for mapping, as before. Since we are assuming here that the target criteria is throughput, both sis and Chortle are run in delay mode. As before, no modifications to the mapping and netlist generation are made.

For baseline comparison in the single-context FPGA case, we insert retiming registers in the mapped design to achieve the required throughput. That is, if we wish to produce a new result every LUT delays, we add pipelining registers every LUTs in the critical path. For example, if the critical path on a circuit is 8 LUT delays long and the desired throughput is one result ever 2 LUT delays, we break the circuit into four pipeline stages, adding registers every 2 LUT delays in the original circuit. We use a simple annealing algorithm to assign non-critical path LUTs in order to minimize the number of retiming registers which must be added to the design.

Similarly, we divide the multicontext case into separate spatial pipeline stages such that the path length between pipeline registers is equal to the acceptable period between results. The LUTs within a phase are then evaluated in multicontext fashion using the available contexts. Again, if the critical path on a circuit is 8 LUT delays long and the desired throughput is one result every 2 LUT delays, we break the circuit into four spatial pipeline stages, adding registers ever 2 LUT delays in the original circuit. The spatial pipeline stage is further subdivided into two temporal pipeline stages which are evaluated using two contexts. This multicontext implementation switches contexts on a one LUT delay period. Similarly, if the desired throughput was only one result every 4 LUT delays, the design would be divided into 2 spatial pipeline stages and up to 4 temporal pipeline stages, depending on the number of contexts available on the target device. The same annealing algorithm is used to assign spatial and temporal pipeline stages to non-critical path LUTs in a manner which minimizes the number of total design and retiming LUTs required in the levelized circuit.

As the throughput requirements diminish, we can generally achieve smaller implementations. Unfortunately, as noted in the previous section retiming requirements prevent us from effectively using a large number of contexts to decrease implementation area. For the alu2 benchmark, Table shows how LUT requirements vary with throughput and Table translates the LUT requirements into areas based on the model parameters used in the previous section. Figure plots the areas from Table . Table recasts the areas from Table as ratios to the the best implementation area at a given throughput. For this circuit, the four or five context implementation is 45% smaller than the single context implementation for low throughput requirements.

Tables through highlight area ratios at three throughput points for the entire benchmark set. For reference, Table summarizes the number of mapped design LUTs and path lengths for the netlists used for these experiments. We see that the 2-4 context implementations are 20-30% smaller than the single context implementations for low throughput requirements.

Area for Improvement

The results shown here are moderately disappointing. Retiming requirements prevent us from collapsing the number of active LUTs substantially as we go to deeper multicontext implementations. As with the previous section, the results presented in this section are based on our area model, the prototype DPGA architecture, and conventional circuit netlist mapping. More than in the previous section, the results here also depend upon the experimental temporal partitioning CAD software. Groupings into temporal and spatial pipelining stages are more rigid than necessary, so better packing may be possible with more flexible stage assignment. The retiming limitations identified here also motivate architectural modifications which we will see in the next chapter.

Time-Sliced Interleaving

The retiming limitation we are encountering here arises largely from packing the circuit into a limited number of LUTs and serializing the communication of intermediate results. An alternate strategy would be to share a larger group of LUTs more coarsely between multiple subcircuits in a time-sliced fashion. That is, rather than trying to sequentialize the evaluation, we retain the full circuit, or a partially sequentialized version, and only invoke it periodically.

Considering again our alu2 example, for moderately low throughput tasks, one context may hold the 169 mapped design LUTs, while other contexts hold other, independent, tasks. A two context DPGA could alternate switch between evaluating the alu2 example and some other circuit or task. In this two context case, the amortized area would be: Note that 81M is smaller than the 91M area which the two context, non-interleaved implementation achieved and smaller than the 84-85M for the four and five context implementation (Table ). Further interleaving can yield even lower amortized costs. e.g.

This coarse-grain interleaving achieves a more ideal reduction in area:

Figure plots the area ratio versus . Note that the ratio is ultimately bounded by the ratio, which is roughly 10% for the model parameters assumed throughout this section. On the negative side,

Temporally Varying Logic -- Finite State Machines

As we noted in Sections and , the performance of a finite state machine is dictated by its latency rather than its throughput. Since the next state calculation must complete and be fed back to the input of the FSM before the next state behavior can begin, there is no benefit to be gained from spatial pipelining within the FSM logic. Temporal pipelining can be used within the FSM logic to increase gate and wire utilization as seen in Section . Finite state machines, however, happen to have additional structure over random logic which can be exploited. In particular, one never needs the full FSM logic at any point in time. During any cycle, the logic from only one state is active. In a traditional FPGA, we have to implement all of this logic at once in order to get the full FSM functionality. With multiple contexts, each context need only contain a portion of the state graph. When the machine transitions to a state whose logic resides in another context, we can switch contexts making a different portion of the FSM active. National Semiconductor, for example, exploits this feature in their multicontext programmable logic array (PLA), MAPL [Haw91].

Example

Figure shows a simple, four-state FSM for illustrative purposes. The conventional, single-context implementation requires four 4-LUTs, one to implement each of Dout and NS1 and two to calculate NS0. Figure shows a two-context DPGA implementation of this same FSM. The design is partitioned into two separate circuits based on the original state variable S1. The two circuits are placed in separate contexts and NS1 is used to select the circuit to execute as appropriate. Each circuit only requires three 4-LUTs, making the overall design smaller than the flat, single context implementation.

Full Temporal Partitioning

In the most extreme case, each FSM state is assigned its own context. The next state computation simply selects the appropriate next context in which to operate. Tables and show the reduction in area and path delay which results from state-per-context multiple context implementation of the MCNC FSM benchmarks. FSMs were mapped using mustang [DMNSV88]. Logic minimization and LUT mapping were done with espresso, sis, and Chortle. For single context FSM implementations, both one-hot and dense encodings were synthesized and the best mapping was selected. The multicontext FSM implementations use dense encodings so the state specification can directly serve as the context select. For multicontext implementations, delay and capacity are dictated by the logic required for the largest and slowest state. On average, the fully partitioned, multicontext implementation is 35-45% smaller that the single context implementation. Many FSMs are 3-5 smaller.

Timing

From Tables and , the multicontext FSM implementations generally have one or two fewer logic levels in their critical path than the single context implementations when mapped for minimum latency. The multicontext implementations have an even greater reduction in path length when mapped for minimum area. The multicontext FSMs, however, require additional time to distribute the context select and perform the multi-context read. i.e.

Recall from the prototype and with a typical amount of switching. Properly engineered, context distribution should take a few nanoseconds, which means the multicontext and single-context implementations run at comparable speeds when the multicontext implementation has one fewer LUT delays in its critical path than the single-context implementation.

Partial Temporal Partitioning

The capacity utilization and delay are often dictated by a few of the more complex states. It is often possible to reduce the number of contexts required without increasing the capacity required or increasing the delay. Tables and show the cse benchmark partitioned into various numbers of contexts and optimized for area or path delay, respectively. These partitions were obtained by partitioning along mustang assigned state bits starting with a four bit state encoding. Figures and plot the LUT count, area, and delay data from the tables versus the number of contexts employed.

One thing we note from both the introductory example (Figures and ) and the cse example is that the full state-per-context case is not always the most area efficient mapping. In the introductory example, once we had partitioned to two contexts, no further LUT reduction could be realized by going to four contexts. Consequently, the four context implementation would be larger than the two context implementation owing to the deeper context memories. In the cse example, the reduction in LUTs associated with going to going from 8 to 11 or 11 to 16 contexts saved less area than the cost of the additional context memories.

Tables through show the benchmark set mapped to various multicontext implementations for minimum area. All partitioning is performed along mustang state bits. For these results, we examined all possible state bits along which to split and chose the best set. On average across the benchmark set, the 8-context mapping saves over 40% in area versus the best single-context case. The best multicontext mapping is often 3-5 smaller than the best single context mapping.

Tables through show the benchmark set mapped to various numbers of context implementations with delay minimization as the target. All partitioning is performed along mustang state bits. For these results, we examined all possible state bits along which to split and chose the best set. On average across the benchmark set, the 8 context mapping are half the area of the best single-context case while achieving comparable delay.

Comparison with Memory-based FSM Implementations

A memory and a state register can also be used to implement finite-state machines. The data inputs and current state are packed together and used as addresses into the memory, and the memory outputs serve as machine outputs and next state outputs. Figure shows a memory implementation of our simple FSM example from Figures and .

Used for finite-state machines, the DPGA is a hybrid between a purely gate (FPGA) implementation and a purely memory implementation. The DPGA takes advantage of the memory to realize smaller, state-specific logic than an FPGA which must implement all logic simultaneously. The DPGA uses the restructurable interconnect in the array to implement next-state and output computations out of gates. As we noticed in Section , the gate implementation allow the DPGA to exploit regularities in the computational task. In this case, we avoid the necessarily exponential area increase associated with additional inputs in a memory implementation, the linear-log increase associated with additional states, and the linear increase associated with additional outputs.

Assuming we could build just the right sized memory for a given FSM, the area would be: Table summarizes the areas of the best memory-based FSM implementations along with the areas for FPGA and 8-context DPGA implementations. The ``Min area'' column indicates the area assuming a memory of exactly the right size is used, while the ``Memory Area'' field shows the area for the smallest memory with an integral number of address bits as shown in the ``organization'' column. When the total number of state bits and input bits is less than 11, the optimal memory implementations can be much smaller than the FPGA or DPGA implementation. Above 11 input and state bits, the DPGA implementation is smaller. Since the DPGA implementation size increases with task complexity rather than number of inputs, while the memory implementation is exponential in the number of inputs and state bits, the disparity grows substantially as the number of inputs and state bits increase.

Areas for Improvement

Timing

In this section, we assumed that the context read occurred in series with execution within the target context and state. It is possible to overlap context reads with execution by using a more sophisticated FSM mapping model. On state transition, instead of reading a context with the target state logic, we read a context with all the logic for any state which may follow the target state. This can be viewed as speculatively fetching just the set of logic which may be needed by the time it has been read a cycle later. Using this scheme, we can reduce the time to the actual delay in the context rather than the context delay plus the read time. For heavily branching FSMs, the target logic will often have to include more state logic per context than with this style of mapping than it was with the simple division described here. As we see here, including more state logic increases the delay so it is not immediately obvious which case will generally have superior performance.

Partitioning

For the partial temporal partitioning above, we partitioned strictly along mustang state bits. This is likely to give less than optimal partitions since mustang's cost model is aimed at multi-level logic implementations. It assumes all logic must be available at once and is not trying to maximize the independence among state groups. A more sophisticated mapping would go back to the original state-transition graph and partition states explicitly to minimize the logic required in each partition. Informally, the goal would be to group states with similar logic together and separate states performing disparate logical functions.

General Technique

While demonstrated in the contexts of FSMs, the basic technique used here is fairly general. When we can predict which portions of a netlist, circuit, or computational task are needed at a given point in time, we can generate a more specialized design which only includes the required logic. The specialized design is often smaller and faster than the fully general design. With a multicontext component, we can use the contexts to hold many specialized variants of a design, selecting them as needed.

In the synthesis of computations, it is common to synthesize a controller along with datapaths or computational elements. The computations required are generally different from state to state. Traditional, spatial implementations must have hardware to handle all the computations or generalize a common datapath to the point where it can be used for all the computations. Multicontext devices can segregate different computational elements into different contexts and select them only as needed.

For example, in both dbC [GM93] and PRISM-II [AWG94] a general datapath capable of handling all of the computational subtasks in computation is synthesized alongside the controller. At any point in time, however, only a small portion of the functionality contained in the datapath is actually needed and enabled by the controller. The multicontext implementation would be smaller by folding the disparate logic into memory and reusing the same active logic and interconnect to implement them as needed.

Additional Application Styles

Multifunction Components

With multiple, on-chip contexts, a device may be loaded with several different functions, any of which is immediately accessible with minimal overhead. A DPGA can thus act as a ``multifunction peripheral,'' performing distinct tasks without idling for long reconfiguration intervals. In a system such as the one shown in Figure , a single device may perform several tasks. When used as a reconfigurable accelerator for a processor ( e.g. [AS93] [DeH94] [RS94]) or to implement a dynamic processor ( e.g. [WH95]), the DPGA can support multiple loaded acceleration functions simultaneously. The DPGA is more efficient in these allocations than single-context FPGAs because it allows rapid reuse of resources without paying the large idle-time overheads associated with reconfiguration from off-chip memory.

In a data storage or transmission application, for instance, one may be limited by the network or disk bandwidth. A single device may be loaded with functions to perform:

The device would be then called upon to perform the required tasks as needed.

Within a CAD application, such as espresso [RSV87], one needs to perform several distinct operations at different times, each of which could be accelerated with reconfigurable logic. We could load the DPGA with assist functions, such as:

Since these tasks are needed at distinct times, they can easily be stacked in separate contexts. Contexts are selected as the program needs these functions. To the extent that function usage is interleaved, the on-chip context configurations reduce the reload idle time which would be required to share a conventional device among as diverse a set of functions.

Utility Functions

Some classes of functionality are needed, occasionally but not continuously. In conventional systems, to get the functionality at all, we have to dedicate wire or gate capacity to such functions, even though they may be used very infrequently. A variety of data loading and unloading tasks fit into this ``infrequent use'' category, including:

In a multicontext DPGA, the resources to handle these infrequent cases can be relegated to a separate context, or contexts, from the ``normal'' case code. The wires and control required to shift in (out) data and load it are allocated for use only when the respective utility context is selected. The operative circuitry then, does not contend with the utility circuitry for wiring channels or switches, and the utility functions do not complicate the operative logic. In this manner, the utility functions can exist without increasing critical path delay during operation.

A relaxation algorithm, for instance, might operate as follows:

Each of these operations may be separate contexts. The relaxation computation may even be spread over several contexts. This general operation style, where inputs and outputs are distinct and infrequent phases of operation, is common for many kinds of operations ( e.g. multi-round encryption, hashing, searching, and many optimization problems).

Temporally Systolic Computations

Figure shows a typical video coding pipeline ( e.g. [JOSV95]). In a conventional FPGA implementation, we would lay this pipeline out spatially, streaming data through the pipeline. If we needed the throughput capacity offered by the most heavily pipelined spatial implementation, that would be the design of choice. However, if we needed less throughput, the spatially pipelined version would require the same space while underutilizing the silicon. In this case, a DPGA implementation could stack the pipeline stages in time. The DPGA can execute a number of cycles on one pipeline function then switch to another context and execute a few cycles on the next pipeline function (See Figure ). In this manner, the lower throughput requirement could be translated directly into lower device requirements.

This is the same basic organizational scheme used for levelized logic evaluation (Section ). The primary difference being that evaluation levels are divided according to application subtasks.

This is a general schema with broad application. The pipeline design style is quite familiar and can be readily adapted for multicontext implementation. The amount of temporal pipelining can be varied as throughput requirements change or technology advances. As silicon feature sizes shrink, primitive device bandwidth increases. Operations with fixed bandwidth requirements can increasingly be compressed into more temporal and less spatial evaluation.

When temporally pipelining functions, the data flowing between functional blocks can be transmitted in time rather than space. This saves routing resources by bringing the functionality to the data rather than routing the data to the required functional block. By packing more functions onto a chip, temporal pipelining can also help avoid component I/O bandwidth bottlenecks between function blocks.

Control

In the prototype DPGA (Section ), we had a single, array-wide control thread, the context select line, which was driven from off-chip. In general, as we noted in Chapter , the array may be segregated into regions controlled by a number of distinct control threads. Further, in many applications it will be beneficial to control execution on-chip -- perhaps even from portions of the array, itself.

Segregation

In the prototype subarrays were used to organize local interconnect. The subarray level of array decomposition can also be used to segregate independent control domains. As shown in Figure , the context select lines were simply buffered and routed to each subarray. The actual decoding to control memories in the prototype occured in the local decode block. We can control the subarrays independently by providing distinct sets of control lines for each subarrays, or groups of subarrays.

Distribution

Hardwired Control

In the simplest case, separate control streams can be physically assigned to each subarray or subarray group. For example, Figure shows a 33 subarray design with a separate control stream for each column.

Configurable Control

Alternately, multiple control streams can be physically routed to each subarray with local configuration used to select the appropriate one for use. For example, Figure shows a 33 subarray design with three controllers where each subarray can be configured to select any of the three control streams. The configurable control may even be integrated with the configurable interconnection network.

Metaconfiguration

In scenarios where array control can be configured it will often be necessary to have a separate level of configuration from the array itself. This meta-level configuration is used to define the sources for control data and perhaps control distribution paths. It does not change from cycle-to-cycle as does regular array configuration data. The MATRIX design described in Chapter deals explicitly with this kind of a multi-level configuration scheme.

Source

Off-Chip

The control stream can be sourced from off-chip, as in the prototype, providing considerable flexibility to the application or system. Off-chip control, however, implies additional latency in the control path and the additional cost of a separate controller component. It also requires precious i/o pins be dedicated to control rather than data. Tasks which benefit from rapid feedback between data in the computation stream and the control stream are hindered by the datacontrol path which most cross first off-chip then back on-chip.

Local Dedicated Controller

A dedicated, programmable controller can be integrated on-chip to manage the control stream. The controller could come in the form of a simple counter, a programmable PLA, a basic microcontroller, or a core microprocessor. Integrated on-chip, it has low latency and high bandwidth to the array and avoids consuming i/o pins. In order to integrate such controllers on chip, we must decide how much space to dedicate to them, how many separate controllers to provide, and what form the controller will take. Recall from Section , that we would like to match the number of control streams with the needs of the application, but we cannot do that if the controllers must be allocated prior to fabrication.

Feedback Self Control

For mapped FSMs (Section ), we saw that it was beneficial to route some of the design outputs back into the control port ( e.g. Figure ). As noted above, this entails some integration of the reconfigurable network and the control distribution path.

Self Control

We can also build the controller out of FPGA/DPGA logic. The controller generally implements an FSM. It would be plausible, then to allocate one or more subarrays to build a controller which is, in turn, used to control the other subarrays on the component. With this scheme, we can partition the array and build just as many controllers as are required for the task at hand. Figure shows a case where two subarrays are used to build the controller which is then responsible for controlling the rest of the array.

Conclusions

Conventional FPGAs underutilize silicon in most situations. When throughput requirements are limited, task latency is important, or when the computation required varies with time or the data being processed, conventional designs leave much of the active LUTs and interconnect idle for most of the time. Since the area to hold a configuration is small compared to the active LUT and interconnect area, we can generally meet task throughput or latency requirements with less implementation area using a multicontext component.

In this section we introduced the DPGA, a bit-level computational array which stores several instructions or configurations along with each active computing element. We described a complete prototype of the architecture and some elementary design automation schemes for mapping traditional circuits and FSMs onto these multicontext architectures.

We showed how to automatically map conventional tasks onto these multicontext devices. For latency limited and low throughput tasks, a 4-context DPGA implementation is, on average, 20-40% smaller than an FPGA implementation. For FSMs, the 4-context DPGA implementation is generally 30-40% smaller than the FPGA implementation, while the 8-context DPGA implementation is 40-50% smaller. Signal retiming requirements are the primary limitation which prevents the DPGA from realizing greater savings on circuit benchmarks, so it is worthwhile to consider architectural changes to support retiming in a less expensive manner. We will look at one such modification in the next chapter. All of these results are based on a context-memory area to active compute and interconnect area ratio of 1:10. The smaller the context memories can be implemented relative to the fixed logic, the greater the reduction in implementation area we can achieve with multicontext devices.

For hand-mapped designs or coarse-grain interleaving, the potential area savings is much greater. With the 1:10 context memory to active ratio, the 4-context DPGA can be one-third the size of an FPGA supporting the same number of LUTs. An 8-context DPGA can be 20% of the size of the FPGA. Several of the automatically mapped FSM examples come close to achieving these area reductions.


André DeHon <andre@mit.edu> Reinventing Computing MIT AI Lab