Transit Note #118
Notes on Coupling Processors with Reconfigurable Logic

Andre DeHon

Original Issue: March, 1995

Last Updated: Sat Apr 8 20:35:54 EDT 1995

Introduction

This is an informal note which discusses several options for incorporating reconfigurable logic into a microprocessor design. The goal of this note is to catalog and discuss the options. See (tn100) for a more motivational introduction to coupling reconfigurable arrays with microprocessors.

The note start by looking at several general classes of reconfiguration which might be worthwhile to support. Section examines reconfigurable i/o, Section looks at reconfigurable or programmable functional units, Section described reconfigurable control logic, Section explores reconfigurable instruction decoding, and Section looks at scenarios where the processor's basic behavior is reconfigurable. Section touches on reconfigurable logic technologies. Section looks at interface issues associated with reconfiguring the logic.

Flexible I/O

A large class of interesting applications arise if we insert the flexible logic into the processor's on/off chip datapath. In the extreme, the flexible logic could completely replace the off-chip i/o circuitry. Figure shows the basic organization of a vanilla microprocessor. The variants described in this section provide various flexible logic alternatives for the external I/O interface.

Architectural Options

Figure shows a scenario where the flexible logic can interpose itself in the i/o operation. Arranged appropriately, the latency impact on i/o operations which do not make use of the reconfigurable logic can be minimal -- just an extra multiplexor delay in each path. When the reconfigurable array processes data on its way on or off chip, the reconfigurable processing can be pipelined with processor and i/o operations. The reconfigurable operations will increase the off chip latency, but not diminish bandwidth. Of course, in the configurations of interest the additional latency in input or output processing will be small compared to the latency which would be incurred if the processing had to be done in software using the fixed portion of the processor, itself.

Figure shows a scenario where the off-chip i/o is completely subsumed by reconfigurable logic. Note that the relatively low-bandwidth associated with off-chip communications, compared to on-chip communication, can partially compensate for the slower native performance of reconfigurable logic. The datapath between the fixed processing core and the reconfigurable logic can be large, allowing the reconfigurable logic to use parallelism to achieve reasonable off-chip i/o bandwidth.

Further, the performance hit due to reconfigurable logic may often be lower than the performance hit taken when external logic components must be inserted into the datapath to adapt the processor's i/o to a particular system.

Similarly, one might worry that the reconfigurable structure will take more die area than the non-reconfigurable i/o. While the reconfigurable i/o may be larger, the increase in system cost which comes from having a larger die may well be less than the increase in system cost which comes from adding an external IC to adapt the conventional processor to fit into a particular system.

Of course, if one has a favorite bus to support, one could combine the previous two configurations (See Figure ). Placing multiplexors both on the i/o pins themselves and the internal datapath allows the prefered bus to suffer very minimal speed degradation while allowing full reconfigurability of the i/o interface. This might be interesting, for example, in building a single IC to span a large range of systems. The fixed bus structure might be tuned to the highest end product line. The lower end models could employ the reconfigurable logic to adapt the core to their systems. This configuration would be particularly ideal if the lower end systems were cheaper particularly because they ran the external busses at a lower speed than the high end models.

Application

The variants which allow control over the external interface can be employed to:

Adapt to the trendy standards -- Standard bus lifetimes are shortening and market windows are tight. Currently, if you are going to integrate native support for some standard bus on your processor ( e.g. PCI, VESA-VL, MBUS), you have to predict the market during CPU design (12-18 months before product launch). If you guess wrong, you run the risk of losing out on important design-ins. Additionally, your processor's lifetime may end up being truncated by the next trendy standard. Reconfigurable i/o allows you to adapt to popular standards. It might also allow a system designer the ability to build one card or mother board which is soft configured for the bus of choice.
Adapt to non-standard system architectures -- In the present setting, one cannot get native bus support for busses which lack widespread industry popularity. With reconfigurable i/o, the processor can be adapted for direct connection to any bus system. This may be particularly useful in mixed processor environments and in legacy system and application environments.
Adapt to particular memory and i/o configurations -- A reconfigurable i/o interface can be tuned to the characteristics of a particular system configuration. Block and burst transfer size, block alignment, timing, and translation can be adapted to the system configuration rather than being generic. This may allow the system to naturally accommodate burst transfer modes available from memories or peripherals rather than being limited by the processor's preconceived and hardwired block transfer modes.
Handle low-level protocol processing -- In addition to bus protocols, the i/o system can be adapted to handle some of the low-level details of i/o stream processing, particularly those which are best done with efficient support for bit operations. Stripping off protocol headers, extracting fields, and responding rapidly to various signaling conditions are all tasks which can be handled well with flexible hardware.
Buffer input/output data -- Reconfigurable i/o logic can be arranged to strategicly buffer data, perhaps according to application or peripheral requirements, coming and going from the processor. The logic can provide FIFO buffering as may be useful when embedding a compute processor in a system. The buffering can be system specific, tuned to the devices with which the processor is communicating. It can also be application specific, providing buffering which is appropriate only for certain kinds of data.
For instance, the processor could pump out program locations for profiling operations. The reconfigurable logic could take the location data, timestamp it, then buffer it for a free bus cycle on which to write the data into memory, and supply the appropriate addresses to store the profiling data in FIFO style into an allocated region of main memory. Note that this allows the system to take a time-stamped program trace while only requiring the processor to execute a store instruction. Since the reconfigurable logic is doing buffering and translation on-chip, the store need take no more processor cycles than a store to the on-chip cache.
Provide direct network interfacing -- The reconfigurable logic can be used to provide direct interfacing to an attached long- or short-haul network. This would be particularly useful to adapt commodity processors for use in large-scale, multiprocessor computing systems. In the Cray T3D, for example, a separate ASIC is used to couple an Alpha microprocessor into the high-speed mesh network [Cra95]. With reconfigurable i/o, the network interface, or a good portion of it, can be built on chip. This close coupling can provide much higher network i/o performance, avoiding the overhead of going out to a remote device over a standard bus in order to communicate with the network. A network interface is, of course, one place where i/o data buffering may be necessary.
Support a variety of peripherals directly -- The reconfigurable i/o can be used to directly interface peripherals to the processor, without requiring off-chip glue logic. Chip selects and peripheral specific timing and control signals can be generated directly by the reconfigurable i/o. Eliminating glue logic will reduce part count and system cost.
Integrate special i/o signaling -- With reconfigurable i/o, the processor can be configured to handle external events smoothly and rapidly and provide special output signals. For example, in a polled device situation, busy or ready signals can be run straight into the processor's reconfigurable logic. The impact of polling on processor cycle time need be no more than an on-chip cache access - less, if the signal can be computed into the processor's branch and conditioning logic. Similarly, reconfigurable logic can provide special output signals, such as the context ID in the April/Sparcle microprocessor [ALKK90], the current priority level, or the process or thread ID of the current running process.
ECC or Parity Checking -- The reconfigurable i/o can be configured to perform ECC or parity computations on data coming and going from memory in systems where error checking and correction is warranted.
Reliable Systems Interfacing -- The reconfigurable i/o can be used to do combining, checking, and voting when building highly reliable systems. In this setting, the standard microprocessor with reconfigurable i/o can be adapted to work in such tandem configurations. For example, in a dual checking system, one can be configured to provide outputs, while the other can be configured to listen and compare its internal results to the results of the master. If they ever differ, the backup processor can signal the main to stop. Again, this reduces system cost by obviating the need for separate logic components to do the combining and checking. Additionally, it makes the single microprocessor design with reconfigurable i/o attractive for use when building reliable systems.
Synchronization Management -- System specific synchronization can be handled by the reconfigurable i/o logic. With hooks out to the actual i/o pins, the processor can generate and receive synchronization signals directly. The reconfigurable logic inside the processor can process the signaling accordingly. This could, for instance, be used to implement barrier synchronization in a multiprocessor. More interestingly, perhaps, it would allow one to implement application specific synchronization schemes. In some cases local synchronization between a few processors may be all that is required for correctness, and the reconfigurable processors can be adapted accordingly.

With the reconfigurable logic optionally in the i/o datapath, the flexible logic can be used for:

Byte swapping -- When the native byte order for the processor differs from that of the data being handled, the reconfigurable logic can provide the byte reordering as needed. Examples of cases where selective byte swapping may be necessary include attached peripherals or coprocessors with different byte orders, system or network software which expects a particular byte order, or file formats which require a specific byte order.
Field extraction and insertion -- When formatting or processing messages, protocols, or packed data, the processor may need to selectively extract and rearrange fields for processing or storage.
Encoding and Decoding -- Often a data stream must be decoded before processing or encoded before exiting the processor. The reconfigurable logic can serve to translate data appropriately for internal use. Common examples include de/encryption of data for secure storage and transmission and (de)compression to minimize storage space or transmission time.
Scatter/gather addressing -- Often the processor's i/o system needs to remap addresses in some systematic way to address a data structure efficiently. For example, in the Cray T3D, an external address unit remaps addresses to block and distribute array elements [Cra95]. The reconfigurable i/o can be programmed to remap the addresses in an appropriate, data structure and system specific manner.

Advantage Summary

Generally, we can summarize a few common advantages for a reconfigurable i/o interface:

Performance -- Direct, on-chip interfacing saves the requisite chip-crossing latency associated with interfacing the processor to a peripheral or system using a separate ASIC.
Costs -- System costs are reduced by reducing the number and kind of components required to interface the processor to its peripherals and system environment.
Flexibility -- The reconfigurable i/o processor can adapt rapidly to changing and varied system architectures and standards and can serve to differentiate products.

Attached Logic or Function Unit

Another important application for processor-coupled reconfigurable logic is to serve as an application specific accelerator. Here, we use the reconfigurable logic to build logical functions and operations which are used heavily by a particular application. To achieve low-latency and high-bandwidth between the processor and the reconfigurable logic, we attach the reconfigurable logic directly to the processor's register file along with the fixed functional units ( e.g. ALU, IU, FPU, LD/ST, MDU).

Architectural Options

Recall our basic microprocessor organization from Figure . When we focus in on the interface between the register-file and ALU, the typical organization looks like Figure . Here, a two read, one write port register file is coupled to a single ALU. Register-file addresses are generally derived from the decoded instruction stream and are not shown in Figure .

Figures and show two simple options for the addition of a single programmable function unit (PFU) to the traditional RF/ALU organization shown in Figure . In Figure , the RF ports are shared between the ALU and PFU allowing the processor to retire at most one result from each functional unit on each cycle and allowing at most two operands to be sent to the ALU/PFU combination each cycle. Figure has independent read and write ports allowing both to operate independently and fully in parallel. Of course, hybrids between these two extremes are also possible ( e.g. Figure , which shares one of three read ports between the ALU and PFU). Reducing the number of read/write ports into the register file, allows the register file implementation to be simpler and faster, while increasing the number of ports allows a larger range of operations to occur in parallel.

Today's high-end microprocessors, generally have multiple, fixed functional units, exploiting parallelism to increase throughput. In these superscalar and VLIW configurations, the programmable function unit (PFU) would take its place alongside the fixed function units. Figure shows the general organization of the processing core of such a superscalar or VLIW processor. The expander and concentrator blocks abstract away the large range of datapath sharing which could go into an implementation. As with the simpler examples above (Figures , , and ), the number of register file ports can be conserved by sharing them among functional units at the expense of restricting the options for parallel operation.

Table summarizes the parameters included in the register file and fixed unit datapath shown in Figure . This assumes a single load/store unit taking in a single address-data pair and generating a single data result. Of course, multiple load store units with varying input/output interfaces are also possible. Note, as long as , read port sharing will be necessary in the expander. Similarly, as long as , write port sharing will be necessary in the concentrator.

It is also worth noting that it is generally better to share the logic among PFUs. Consequently, rather than designing the processor with independent PFUs, one would design on large PFU, perhaps times as large as a typical single PFU, and provide it with and inputs and outputs. This also give the PFU set additional flexibility in utilizing its RF read and write bandwidth. Figure shows this configuration.

Similarly, in designs where flexible i/o is also desirable, as described in Section , it may be beneficial to merge the PFU reconfigurable logic with the input/output reconfigurable logic. Figure shows a case where the load/store function is subsumed by reconfigurable logic (compare Figure ). Figure shows the analog to Figure , where the fixed load/store and programmable logic exist in parallel.

Figures through show specific, small examples with a single ALU, a single PFU unit which can serve as reconfigurable i/o, and a single hardwired load/store unit. The primary difference among these examples is the number of RF read/write ports and hence the function of the expander and concentrators.

Timing Control

Assuming the processor runs at some fixed rate independent of the function implemented in the PFU, the logic coupling may have to deal with various timings which are possible in the PFU.

Single cycle latency, Single cycle throughput -- In the simplest case the PFU function may operate within the standard pipeline clock period of the processor.
Multiple cycle latency, No new Operations while Operation in Progress -- In some cases the latency of the PFU operation may be multiple clock cycles. In the simplest multiple cycle case, the processor will not be able to issue a new operation until the previous operation completes.
Multiple cycle latency, Single cycle throughput -- If the programmed function is pipelined, the processor may still be able to issue a new operation every cycle.
-cycle Latency, Launch every -cycles -- In the general case, the processor may be able to launch a new operation every cycles, while the operation completes cycles after launched.
Multiple Latency Function -- Sometimes a PFU may implement several functions with differing, but predictable latencies.
Variable Latency -- Some operations may have data dependent latencies.

We can handle most of these cases in the same way analogous cases are already handled in processors. The main difference being that fixed functional units fall into one of the categories which is known at design time, whereas the category here will depend on the function being implemented and hence will not be known until the function is configured.

Predictable delay constraints can be scheduled in software. That is, the compiler can guarantee to only emit code which will launch a new operation every cycles and expects the result of an operation to only be available after cycles. The compiler will know the PFU function when generating the code to accompany it, so can arrange code appropriately to handle the specifics of a particular PFU.

To support variable times, the control logic can accommodate ready and busy signals from the programmable logic. The PFU can, for instance, have a pair of extra signals, one to indicate when the result is done and one to indicate when the PFU is ready for the next operation. These control signals would be generated from the programmable logic and be customized to each PFU configuration. The controller can then stall the pipeline when the PFU is not ready for input. Similarly, it can use the result completion signal to determine when to writeback the result and when to stall dependent computation. The processor could, for example, use a standard register score-boarding strategy to force the processor to stall only when the instruction stream attempts to access the PFU result before it is generated.

Figure shows such an arrangement. ready_input is asserted whenever the PFU is ready to receive a new input. retire_result is asserted when the PFU completes each operation. The processor control will stall the pipeline if ready_input is not asserted when the next operation in the pipeline requires the PFU. The processor control uses retire_result to recognize when a result is done and make sure writeback occurs to the appropriate target register at that time. When the PFU instruction is issued, the result register is marked unavailable. If the processor control encounters any reads to unavailable registers, it can stall the dependent instruction awaiting the writeback which makes the value available.

Of course, a particular processor could choose to restrict the kinds of variability allowed to simplify control. Implementations could restrict themselves to no variability or to variability only in launch rate or completion latency.

Diverting Control Flow

Other hooks into the processor's control flow may be merited. In particular, there are a number of applications where it would be beneficial to give the logic an option to divert program flow rather than simply stall it. Two general classes:

Exception/assumption detection -- The processor code could be compiled assuming certain cases do not occur. The PFU could then be programmed to watch values and detect when these exceptional cases occur, diverting the processor's control to handle the exceptional case accordingly. For example, compiled code could be written assume a certain address rage is never written, allowing the values to be cached in registers or even compiled into the code. The PFU then performs parallel checking to see that this assumption is met throughout the region of code. In a similar manner, the PFU might be programmed to watch for specific address writes to facilitate debugging.
PFU limitations -- Similarly, the PFU may implement a restricted version of some function -- perhaps one that only works for certain values. When unusual values, those not handled by the PFU, are detected the PFU could divert control flow to software which implements the function.

As described, this could simply be a line which signaled a synchronous exception vectored into a handler setup to handle the specified exceptional event. Alternately, the line could set dirty bits in the processor state, thread state, result registers, or the like to indicate that the computed value was incorrect and should be ignored. Such a line might also inhibit memory writes or other side-effecting operations which might write incorrect results based on the violated assumptions.

Control Registers

In some cases it may be useful to place specialized control registers inside the PFU. For example, for a DPGA PFU it might be beneficial to have a dedicated context state register for the array inside the PFU. This would be particularly advantageous if the PFU performed multiple cycle functions in the same PFU, but the processor did not want to allocate register file or instruction bandwidth to feed the context identification into the PFU on every cycle. Some internal registers may be beneficial anytime when the PFU operates on logical input data larger than its register file datapath. Internal registers can always be built out of the programmable logic, but where we can anticipate their common need, it is cheaper to go ahead and include fixed registers.

Control Inputs

So far, we have described scenarios where the PFU simply takes data from the register file datapath. We may want a control signal into the PFU indicating when new data is valid and destined for the PFU. Of course, if the PFU can always operate on the data its provided and the processor only takes results from the PFU when it expects the PFU to generate results, such control is not strictly required. However, if the PFU is tied up for multiple cycles with each operation, as suggested in some usage scenarios above, the PFU needs to be told when it actually needs to start operating on data. Additional control signals might tell the PFU more about what to do. For example, if a PFU is setup to perform more than one operation, the PFU might get some instruction bits to specify the current operation. Similarly, control bits might inform the PFU about the kind of data it is now receiving via its register file inputs. This information would be particularly valuable if the PFU operated on more data than it got over the register file datapath in a single cycle and did not always get all of the data it operates on reloaded in a deterministic sequence.

Orchestrated DPGA/SIMD Logic

We can also view the processor sequencing and control as an orchestrator, coordinating DPGA or SIMD logical operations occuring within the PFU. This view is entirely consistent with the general scheme presented here. A processor designed specificly with this in mind is likely to include more PFU logic and less fixed ALUs. In fact, the fixed ALUs might exist primarily for addressing, control branching, exception handling, and configuration loading.

Application

Special Purpose Functional Units -- The primary application for the programmable functional unit model is, of course, as special purpose functional units adapted to a particular application. Operations which are executed frequently in the application, but poorly supported in the fixed processing units, can be implemented in the PFU to accelerate the application.
Matching, Searching, and Filter -- One class of operations for the specialized functional unit is to support efficient data filtering. When the processor needs to process large amounts of data looking for certain characteristics, the PFU can be programmed to identify the desired characteristics allowing the processor to rapidly walk over large datasets. This kind of support is likely to be valuable in database and transaction processing applications.
Exception/assumption checking -- As noted above, the PFU can also be used for assumption checking and error detection in support of speculative and common-case execution on the fixed or programmable functional units.
Fine-grained Parallelism -- PFUs implemented with fine-grained logic ( e.g. FPGAs or DPGAs) can very efficiently take advantage of fine-grained, application-specific parallelism. This kind of parallelism is handled particular poorly with traditional, fixed, wide ALUs.
Special Purpose State Collection and Computation -- State can also be built up inside the PFU. The PFU can thus be used to maintain specialized state adapted to a particular application. Further, the PFU can implement logic to efficiently update that state as new data is collected or events occur. For example, the PFU could implement a pseudo-random number generator, maintaining the pattern state internally as well as computing functional updates. A wide range of statistics collection could be implemented this way. The processor could fire data values at the PFU, and the PFU would use the new data to update its statistical residues to be retrieved by the processor once the monitoring period ends.

Advantage Summary

Performance -- The programmable functional unit arrangement is primarily aimed at increasing the performance of the processor by allowing the inclusion of application specific acceleration logic.
Functional Extension -- This coupling makes some operations feasible which are conventionally infeasible. Operation and value checking can support debugging. Lightweight data collection and maintenance can facilitate minimally intrusive profiling and monitoring.

Control Logic

An interesting class of reconfiguration becomes available when reconfigurable logic is interfaced with the basic control circuitry for the processor. In the previous section we began to introduce some special cases where allowing the reconfigurable logic direct access to consume and generate control signals will expand the range of adaptation possible. In this section, we focus more specificly on this class of reconfiguration which is useful apart from its coupling to PFU logic.

Architecture

Every traditional microprocessor has logic which controls the flow of instructions and data. This logic usually accounts for a very small portion of the silicon area on the processor die, but plays a large role in establishing how the processor behaves and what it does efficiently. Direct hooks into this logic allow us to reconfigure the basic processor behavior. The hooks could range from allowing reconfigurable logic to drive into selective fixed-logic signals, as suggested for the stall in the previous section, to replacing the fixed control logic with a reprogrammable substrate. The latter offers more flexibility while the former allows faster and smaller implementation of standard control structures. Just like the flexible input logic, default, hardwired control logic can be wired in parallel with reconfigurable logic to give some elements of both schemes.

In general, reconfigurable logic might monitor:

various data and address lines on the chip -- The logic may use these inputs to detect unusual events or assumption violations.
cache miss, conflict, TLB miss lines, read/write control lines, processor stall, pipeline flush, branching, branch prediction, mispredicted branch -- Access to these signals will be particularly useful when the reconfigurable logic controls behavior in exceptional cases like, cache or TLB misses.
i/o's -- As noted in Section , direct access to the i/o's is beneficial in adapting the processor to a system and the devices it is operating with. This kind of access on some i/o signals is useful for adapting control and signaling even if the primary i/o busses are not reconfigurable.

All of these lines can be monitor for profiling, debugging, and statistical purposes.

The reconfigurable logic might control:

processor stall, pipeline flush
next instruction
read/write control (internal writeback and external)
outputs

This kind of control was introduced in Sections

and

, and is also useful independent of a programmable functional unit or reconfigurable i/o.

When reconfigurable control logic is arranged in this manner, the processor's behavioral patterns can be revised. In some cases, this may allow the reconfigurable logic to control what happens on exceptional events like traps, cache misses, TLB misses, or context switches. It may also allow the instruction stream to make more or less stringent assumptions about the data and provide a means of checking these assumptions and handling the data accordingly.

Among other things, this may allow the processor to be adapted to match the semantics desired by a particular operating system or operating environment. In many modern systems, the OS ends up executing many instructions to switch contexts, take traps, or save/restore registers because the processor does not quite provide the right hooks to match the OS semantics ( e.g. [ALBL91]). Reconfigurable control can provide an opportunity to make up for semantic gap at the processor level, rather than incurring large software overheads to emulate the desired semantics.

In general, the control logic on the processor is the hardest part to get correct. The various exceptional and hazard cases, and their interactions, are difficult to handle well and correctly. Sometimes it is difficult to decide what the correct behavior should be. With highly reconfigurable control logic, we defer the binding time, allowing the logic to be fixed after the processor is fabricated and allowing the behavior to be revised without spinning a new design.

If one does combine reconfigurable control with a programmable functional unit (Section ) or reconfigurable i/o (Section ), it may make sense to combine the reconfigurable logic into one large block. This allows sharing and averaging. When the control is simpler, more space is available for i/o or programmable functions. Similarly, when the control is large, it can borrow space from the other reconfigurable units. Figure shows our basic processor datapath with a reconfigurable logic block serving as a PFU, reconfigurable i/o, and which includes hooks into the processor's control logic.

Application and Advantage Summary

Behavior Modification -- The processor's behavior can be modified, allowing it to be tuned for the characteristics of a particular application or adapted to provide particular semantics.
Profiling and Monitoring -- Direct hooks allowing reconfigurable hardware to monitor on-chip signals supports a wide variety of low-overhead profiling which is currently impossible. Key, on-chip, lines and datapaths can be monitored with event computation and handling processed directly in reconfigurable logic. Simple statistical gathering can update state in the reconfigurable logic without perturbing processor operation. Coupled with sufficient hooks into the i/o, the logic may even be able to store events off to memory without interrupting the processor's execution. More elaborate schemes can use the reconfigurable logic to detect events then divert program flow to allow the processor to run code to further handle the event.
Debugging -- Combining control of processor flow with event monitoring, the reconfigurable logic can be used to provide rich debugging support. Breakpoints can be set based on internal processor state. Customized action can even be defined to snapshot and preserve precious state when targeted events occur.
Exploit on-chip bandwidth -- On chip access to internal signals and is moderately inexpensive. However, limitations in off chip bandwidth make it impractical to route internal signals off chip for observation or control. On-chip reconfigurable logic makes it possible to access these signals economically.

Instruction Interpretation

The processor reads the instruction from the cache and controls execution accordingly. In modern, RISC microprocessors, the instruction is decoded by hardwired control logic to manipulate each stage of the processor's pipeline. Reconfigurable logic can be integrated into the processor's instruction stream decoding in a number of ways:

Straight -- the reconfigurable logic interprets the instruction stream, controlling the code, effectively taking the place of the fixed decoder and control (See Figure ).
Microcode-like -- the reconfigurable logic can turn the instruction into a sequence of primitive operations and control their flow through the processor. This is much like traditional microcode, except that the control is implemented in the reconfigurable logic rather than a microcode PROM.
Fixed/Flexible Hybrid -- A hybrid flexible/hardwired scheme, might provide direct paths for common operands such as register file addresses, while the reconfigurable logic has complete control over operation interpretation and control signals (See Figure ).
Code Expansion -- the code can also be expanded from a denser, memory efficient form, into a larger, more workable form when stored in the I-cache ( e.g. CRISP [DMB87]). The instructions returned from the I-cache are thus expanded instructions which directly control the on chip resources. In some ways this is reminiscent of the decoding which can be applied to data on the processor's input datapath when using reconfigurable i/o (Section ). Of course, this configuration has a straight and hybrid scheme, as well. Figure , shows the straight version.

Application and Advantage Summary

Decode and emulate instructions -- Flexible instruction interpretation allows the processor to be adapted to efficiently decode and run instructions for some fixed processor. The flexible logic decodes the provided instruction stream and reformulates them into the control structures on this processor. In general, this may require expanding some instructions into multiple operations on the core processor.
Customize instruction stream to application -- The instructions can be customized to the application. This can be used to compress executables by adapting the instruction encoding to the application instruction usage. As a simple example, we could derive the optimal Huffman source coding for a binary, then revise the instruction encoding accordingly. Of course, more sophisticated recodings will provide greater benefits.
Ease incorporation of configurable structures -- Customized instruction decoding is complimentary with the other reconfigurable structures introduced here, allowing us to adapt the instruction stream to directly accommodate new instructions and behaviors made possible by the reconfigurable logic.

Basic processor behavior

Architecture

A classic question in processor architecture is: ``where should resource be deployed.'' Should the cache be larger/smaller relative to the TLB? Do we allocate space to prefetch or writeback buffers? How much memory should go into the data cache, instruction cache, scratchpad memory? Do we include a branch target buffer or victim cache?

A second question which comes along with this one is: ``How do we manage the limited resources?'' What's the prefetch/reload/eviction policy?

The traditional solution to both of these questions is to make a static decision at design time which does, in some sense, reasonable well across the benchmarks the designers consider representative. This inevitably leads to compromise for everything, and for many applications the magnitude of the compromise can be quite large.

With a reconfigurable processor we can, instead, leave some flexibility in the architecture so the machine can be configured to deploy the resources most effectively for the given application. The idea is to go ahead and build specialized pieces of hardwired logic with common utility ( e.g. memories, ALUs, FPUs), but rather than completely hardwiring their control and datapaths, leaving flexibility to reorganize their interconnect and hence use.

Figure , for example, shows a revision of our generic VLIW processor architecture where blocks of configurable memory have been added to the collection of processing resources. Here, some outputs from the ALU/PFU/FPU/memory bank are routed back to the expander to allow cascaded operations. For example, a virtual memory address coming out of a register may be translated through a memory before being feed to the i/o. Similarly a base address from one register may be added to an index from another before the address is fed in as an address to the the cache. To facilitate this, we conceptually add additional outputs from the concentrator and inputs to the expander, in addition to the additional concentrator inputs and expander outputs entailed by the additional memory units they support.

Additionally, the memories can be arrange in standard sized chunks which can be composed, allowing the memory resources to be shuffled at a moderate granularity. For example, each basic memory could be a 2Kx8 memory chunk. 4 or 8 of these can be grouped together to build a 32- or 64-bit wide memory. Additionally, they can be cascaded to build deeper memories. So, could be cascaded to build an 8Kx32 memory.

With a little bit of additional control logic, these memories can be used as caches, cache-tags, TLBs, explicit scratchpad memories, FIFO buffers, or the like. These memories can completely subsume the separate data cache shown in our original processor model (Figure ). The additional control logic is likely to be supported largely in reconfigurable logic as suggested in Section .

Systems that do not use a TLB can reallocate memory blocks to the cache. Applications with more or less virtual memory locality can adjust the size of the TLB accordingly. Applications with known data access, can use explicit scratchpad memory, reallocating the memory blocks holding cache tags to data storage. Applications with specific spatial locality in data or instructions, can build logic to prefetch data according to the application need.

Figure expands the datapath to show the processor control. In particular, this organization makes it clear that the instruction cache can be implemented out of the memory units, as well. Each application can now trade-off memory between the i-cache and d-cache based on the needs of the application.

We can also view the register file as another memory which can also be built out of the deployable memory units. Figure shows a configuration where there is no a priori designated register file. Rather the register file is built out of the memories. This may allow, for example, the reconfiguration of the register file width, depth, and number of simultaneous read ports. Further the register file can be broken into a series of smaller register files where appropriate. Here, the expander/concentrator is collapsed into a single reconfigurable interconnect.

Alternately, the register file may want to be a slightly specialized memory unit, but still be deployable for the reasons articulated here. As noted above, width, depth, and read cascading are moderately easily constructed by paralleling memory blocks just as in building basic memory structures. What is harder to build by composition is multiple write ports, and register files often depend heavily on a number of simultaneous write ports. For this reason, it may make sense to also include a different kind of memory block with multiple write ports to allow efficient construction of register files, as well as other structures requiring simultaneous write support.

The configuration shown in Figure shows hardwired processor control and a completely reconfigurable i/o unit. Of course, variations could implement all or much of the control in reconfigurable logic and/or include hardwired load/store units.

This finally leads to a revised model for a computing device in which basic, specialized functional units ( e.g. memories, ALUs, FPUs, MDUs, LD/ST units, DMA logic, hardwired control units) are embedded in a reconfigurable interconnection scheme along with regions of reconfigurable logic which can be used for monitoring, control, i/o, decoding, and as PFUs. This device gains the performance and space advantages of hardwired logic units for commonly used operations. At the same time, it gains performance advantage over a purely fixed microprocessor by adapting the processor organization much more tightly to the application.

Note that the reconfigurable interconnect used to interconnect functional units differs both from the fine-grained reconfigurable interconnect typically employed in FPGAs and the expander/concentrator interconnect used in a pure VLIW. Rather, it is a hybrid of the two. Unlike traditional FPGA interconnect, most of the operations with the interconnect are bus oriented. Therefore, busses are switched in groups. Most busses may be some nominal processor bus width ( e.g. 16, 32, 64). Some will be able to compose or decompose these busses to other interesting sizes ( e.g. 1-bit entities for fine-grained logic, 8-bit entities for reconfigurable memories). In a traditional VLIW, the decoding of the instruction specifies the configuration of busses. With this kind of a setup, the instruction bandwidth would be excessive if fully configured from the instruction stream. Similarly, the interconnect pattern would be too rigid if fully configured via FPGA style programming. Here, the configuration of busses will depend partially on the instruction executed and partially on the way the system is currently configured. In most scenarios the decoding between the instruction stream specification of interconnect and the full interconnect specification would be done in the reconfigurable logic. For efficiency, the reconfigurable logic serving this purpose might be tailored somewhat to this application.

Application and Advantage Summary

Deploy resources where needed -- As noted, limited resources can be deployed where they most benefit the application or system using the processor rather than being statically deployed according to aggregate statistics across all applications and systems.
Arrange datapaths as needed -- Datapaths can be organized to match the application requirements. For example, ALUs and register ports can be cascaded to operate on wider data efficiently. Also, data can flow directly from one functional unit to another without an intervening store into and read from the register file.
Systolic operations -- Perhaps as a special, well understood, case of application-specific datapath and resource arrangement, systolic arrays or pipelines of functional units can be arranged allowing data to flow through standard sequences of operations and adapting the processor to exploit task-specific parallelism.
Structure parallelism to the problem -- In general, deployable functional resources allow the processor to structure the parallel operations in accordance with the application. Applications with high, static parallelism can schedule resources staticly for parallel operation. Applications suitable to pipelined parallelism can be organized in that manner. Applications with heavy dynamic dependencies can allocate part of the resources to detecting conditions under which serialization is necessary.
Adapt management strategies to application characteristics -- With the addition of reconfigurable control, resources management characteristics can be tuned to the application.

Reconfigurable Logic

The reconfigurable logic can be realized as one of many different structures.

FPGA -- Traditional, fine-grained logic modules with programmable interconnect ( e.g. Xilinx LCA [Xil93]) is one likely candidate. The fine-grained structure is particularly good for random logic and bit manipulation.
DPGA -- Multiple context programmable logic ( e.g. DPGAs (tn95)) can provide density advantages over more traditional FPGAs, especially in situations where well defined subsets of the logic operate at different times. DPGAs also allow non-destructive background loading which can help reduce effective configuration overhead.
PAL -- PAL-like array structures are good for smaller blocks of logic or for fast decoding.
Memory -- In some setting a block of memory can serve as an efficient, programmable translation resource. Most programmable technologies are implemented out of smaller memories with programmable interconnect.
MATRIX or other reconfigurable datapath array architecture -- MATRIX is a more coarse-grained, datapath oriented architecture which has more specialized memory and ALU primitives and operates on multi-bit bussed data. Similarly, Wang and Gulak's reconfigurable datapath array architecture [WG94] provides more coarse-grained, datapath oriented, reconfigurable blocks suitable for building a microprocessor datapath. These architectures represents a configurable design point somewhere between a fixed processor and a bit-configurable FPGA. In many ways a MATRIX array is an extreme of the deployable fixed-unit organization described in Section . As such, MATRIX may be well suited for certain classes of PFU implementations.
MATRIX/DPGA Hybrid -- In practice, a mix of fine-grained control and more specialized functional units embedded in a common reconfigurable mesh may be a very attractive choice for the the reconfigurable logic in PFU and i/o logic.

Much of the logic used in the i/o path and, to some extent, in the PFUs, is likely to be datapath oriented. Consequently, it will probably make sense to specialize a good portion of the array logic to datapath usage. This datapath specialization may include routing busses, slaving multiple programmable cells off of a single configuration ( e.g. [CL94]), and including bussed register banks. The benefit of datapath orientation is greater density and performance on datapath applications than regular FPGA/DPGA structures. Fine-grained logic will still be desirable for control operations and bit-wise manipulations.

Configuration Reloading

Figure shows the generic, logical view for a reconfigurable array. The programmable i/o's are shown separate from the configuration facilities.

The configuration i/o's control the loading of the array's configuration or context. The configuration port can look very much like a memory port. Depending on the design requirements, the port can be anything from a 1-bit serial data port with no address control to a 64-bit wide (or larger) data port with full, random access address control to the internal configuration memories. Wider datapaths support more rapid context loading. Random access to the configured logic allows rapid, incremental changes in the array personality.

The programmable i/o's are inputs to the logic implemented in the reconfigurable array and outputs generated by the array. There need be little direct correspondence between the number of i/o's and the size of the array. In some situations, it will be beneficial for all i/o's to be bidirectional i/o's -- e.g. if the array is being coupled to a common bus on the processor. More likely, in processor-coupled applications, it will be beneficial for all the i/o's to be dedicated inputs and outputs.

For multicontext ( e.g. DPGA) designs, a context select will specify the active context. This may come from a special purpose register driving the context select, from hardwired logic, from decoded CPU signals, from a hardwired sequencer, or even from a programmable output from this or another reconfigurable logic array.

In a processor-coupled scenario, we could place the reconfiguration loading data and address path in any of several places:

Register File Port -- We could couple the reconfiguration data and address path into the register file datapath, perhaps even sharing the ports into the reconfigurable unit used as a programmable PFU or reconfigurable i/o. This allows high reload bandwidth, but also requires a processor cycle for each configuration word transfered into the reconfigurable control unit.
Shared I/O Path -- Alternately, we could wire the configuration port into the i/o path. This would allow configurations to be loaded directly from memory without being first loaded into the register file. This could still require processor direction to control the loading of data. If a separate DMA unit is available on the processor, the processor could start a DMA transfer to load a new configuration from memory, then go about computing in parallel with the configuration load.
Separate I/O Path -- A separate i/o path could be provided for reconfiguration. This path might go to dedicated configuration memory. In this case, DMA i/o would be most sensible, since the separate path allows configuration to take place without interrupting processor load/store/fetch operations. This options is, of course, more costly and less flexible than those which share the processor's i/o path and main memory for configuration loading and storage.

Bandwidth Issues for Coupled Reconfigurable Logic

Recent papers begin to show more explicit evidence that the bandwidth between the conventional processor (and memory) limits the performance improvement attainable with the reconfigurable compute engine, typically by an order of magnitude.

For a Sobel edge detector, Luk notes, that the hardware-assisted version is, in practice, only 39% faster than the software only version. He then notes that the communication overhead accounts for 88% of the time taken. ``If this overhead is not included, the hardware-assisted design is approximately 13 times faster than the software version. Furthermore, if the input-output bottleneck can be eliminated so that the only speed limitation is the critical path delay, we estimate that a speedup of about 300 times can be achieved.'' [LWP94]
[GSH94] also presented evidence that performance is directly limited by bandwidth between the control processor and the reconfigurable system. In their talk, they showed that the reconfigurable system gave roughly a 10x speedup, but was limited by the low bandwidth interconnect. They suggested that another factor of ten in performance acceleration could be realized if the bus bandwidth were increased. The (preliminary?) paper alludes to the issue, but does not spell out the result as clearly as the talk.
For Electronic Design Automation (EDA) tasks, [HTA94] finds only marginal benefits (speedup factors between 1 and 8) for off-chip, FPGA co-processors. They find that the bus bandwidth limitation is largely responsible for this bound.

MIT Transit Project

Transit Note #118
Notes on Coupling Processors with Reconfigurable Logic

Introduction

Flexible I/O

Architectural Options

Application

Advantage Summary

Attached Logic or Function Unit