Transit Note #118

Notes on Coupling Processors with Reconfigurable Logic

Andre DeHon

Original Issue: March, 1995

Last Updated: Sat Apr 8 20:35:54 EDT 1995

Introduction

This is an informal note which discusses several options for incorporating reconfigurable logic into a microprocessor design. The goal of this note is to catalog and discuss the options. See (tn100) for a more motivational introduction to coupling reconfigurable arrays with microprocessors.

The note start by looking at several general classes of reconfiguration which might be worthwhile to support. Section examines reconfigurable i/o, Section looks at reconfigurable or programmable functional units, Section described reconfigurable control logic, Section explores reconfigurable instruction decoding, and Section looks at scenarios where the processor's basic behavior is reconfigurable. Section touches on reconfigurable logic technologies. Section looks at interface issues associated with reconfiguring the logic.

Flexible I/O

A large class of interesting applications arise if we insert the flexible logic into the processor's on/off chip datapath. In the extreme, the flexible logic could completely replace the off-chip i/o circuitry. Figure shows the basic organization of a vanilla microprocessor. The variants described in this section provide various flexible logic alternatives for the external I/O interface.

Architectural Options

Figure shows a scenario where the flexible logic can interpose itself in the i/o operation. Arranged appropriately, the latency impact on i/o operations which do not make use of the reconfigurable logic can be minimal -- just an extra multiplexor delay in each path. When the reconfigurable array processes data on its way on or off chip, the reconfigurable processing can be pipelined with processor and i/o operations. The reconfigurable operations will increase the off chip latency, but not diminish bandwidth. Of course, in the configurations of interest the additional latency in input or output processing will be small compared to the latency which would be incurred if the processing had to be done in software using the fixed portion of the processor, itself.

Figure shows a scenario where the off-chip i/o is completely subsumed by reconfigurable logic. Note that the relatively low-bandwidth associated with off-chip communications, compared to on-chip communication, can partially compensate for the slower native performance of reconfigurable logic. The datapath between the fixed processing core and the reconfigurable logic can be large, allowing the reconfigurable logic to use parallelism to achieve reasonable off-chip i/o bandwidth.

Further, the performance hit due to reconfigurable logic may often be lower than the performance hit taken when external logic components must be inserted into the datapath to adapt the processor's i/o to a particular system.

Similarly, one might worry that the reconfigurable structure will take more die area than the non-reconfigurable i/o. While the reconfigurable i/o may be larger, the increase in system cost which comes from having a larger die may well be less than the increase in system cost which comes from adding an external IC to adapt the conventional processor to fit into a particular system.

Of course, if one has a favorite bus to support, one could combine the previous two configurations (See Figure ). Placing multiplexors both on the i/o pins themselves and the internal datapath allows the prefered bus to suffer very minimal speed degradation while allowing full reconfigurability of the i/o interface. This might be interesting, for example, in building a single IC to span a large range of systems. The fixed bus structure might be tuned to the highest end product line. The lower end models could employ the reconfigurable logic to adapt the core to their systems. This configuration would be particularly ideal if the lower end systems were cheaper particularly because they ran the external busses at a lower speed than the high end models.

Application

The variants which allow control over the external interface can be employed to:

With the reconfigurable logic optionally in the i/o datapath, the flexible logic can be used for:

Advantage Summary

Generally, we can summarize a few common advantages for a reconfigurable i/o interface:

Attached Logic or Function Unit

Another important application for processor-coupled reconfigurable logic is to serve as an application specific accelerator. Here, we use the reconfigurable logic to build logical functions and operations which are used heavily by a particular application. To achieve low-latency and high-bandwidth between the processor and the reconfigurable logic, we attach the reconfigurable logic directly to the processor's register file along with the fixed functional units ( e.g. ALU, IU, FPU, LD/ST, MDU).

Architectural Options

Recall our basic microprocessor organization from Figure . When we focus in on the interface between the register-file and ALU, the typical organization looks like Figure . Here, a two read, one write port register file is coupled to a single ALU. Register-file addresses are generally derived from the decoded instruction stream and are not shown in Figure .

Figures and show two simple options for the addition of a single programmable function unit (PFU) to the traditional RF/ALU organization shown in Figure . In Figure , the RF ports are shared between the ALU and PFU allowing the processor to retire at most one result from each functional unit on each cycle and allowing at most two operands to be sent to the ALU/PFU combination each cycle. Figure has independent read and write ports allowing both to operate independently and fully in parallel. Of course, hybrids between these two extremes are also possible ( e.g. Figure , which shares one of three read ports between the ALU and PFU). Reducing the number of read/write ports into the register file, allows the register file implementation to be simpler and faster, while increasing the number of ports allows a larger range of operations to occur in parallel.

Today's high-end microprocessors, generally have multiple, fixed functional units, exploiting parallelism to increase throughput. In these superscalar and VLIW configurations, the programmable function unit (PFU) would take its place alongside the fixed function units. Figure shows the general organization of the processing core of such a superscalar or VLIW processor. The expander and concentrator blocks abstract away the large range of datapath sharing which could go into an implementation. As with the simpler examples above (Figures , , and ), the number of register file ports can be conserved by sharing them among functional units at the expense of restricting the options for parallel operation.

Table summarizes the parameters included in the register file and fixed unit datapath shown in Figure . This assumes a single load/store unit taking in a single address-data pair and generating a single data result. Of course, multiple load store units with varying input/output interfaces are also possible. Note, as long as , read port sharing will be necessary in the expander. Similarly, as long as , write port sharing will be necessary in the concentrator.

It is also worth noting that it is generally better to share the logic among PFUs. Consequently, rather than designing the processor with independent PFUs, one would design on large PFU, perhaps times as large as a typical single PFU, and provide it with and inputs and outputs. This also give the PFU set additional flexibility in utilizing its RF read and write bandwidth. Figure shows this configuration.

Similarly, in designs where flexible i/o is also desirable, as described in Section , it may be beneficial to merge the PFU reconfigurable logic with the input/output reconfigurable logic. Figure shows a case where the load/store function is subsumed by reconfigurable logic (compare Figure ). Figure shows the analog to Figure , where the fixed load/store and programmable logic exist in parallel.

Figures through show specific, small examples with a single ALU, a single PFU unit which can serve as reconfigurable i/o, and a single hardwired load/store unit. The primary difference among these examples is the number of RF read/write ports and hence the function of the expander and concentrators.

Timing Control

Assuming the processor runs at some fixed rate independent of the function implemented in the PFU, the logic coupling may have to deal with various timings which are possible in the PFU.

We can handle most of these cases in the same way analogous cases are already handled in processors. The main difference being that fixed functional units fall into one of the categories which is known at design time, whereas the category here will depend on the function being implemented and hence will not be known until the function is configured.

Predictable delay constraints can be scheduled in software. That is, the compiler can guarantee to only emit code which will launch a new operation every cycles and expects the result of an operation to only be available after cycles. The compiler will know the PFU function when generating the code to accompany it, so can arrange code appropriately to handle the specifics of a particular PFU.

To support variable times, the control logic can accommodate ready and busy signals from the programmable logic. The PFU can, for instance, have a pair of extra signals, one to indicate when the result is done and one to indicate when the PFU is ready for the next operation. These control signals would be generated from the programmable logic and be customized to each PFU configuration. The controller can then stall the pipeline when the PFU is not ready for input. Similarly, it can use the result completion signal to determine when to writeback the result and when to stall dependent computation. The processor could, for example, use a standard register score-boarding strategy to force the processor to stall only when the instruction stream attempts to access the PFU result before it is generated.

Figure shows such an arrangement. ready_input is asserted whenever the PFU is ready to receive a new input. retire_result is asserted when the PFU completes each operation. The processor control will stall the pipeline if ready_input is not asserted when the next operation in the pipeline requires the PFU. The processor control uses retire_result to recognize when a result is done and make sure writeback occurs to the appropriate target register at that time. When the PFU instruction is issued, the result register is marked unavailable. If the processor control encounters any reads to unavailable registers, it can stall the dependent instruction awaiting the writeback which makes the value available.

Of course, a particular processor could choose to restrict the kinds of variability allowed to simplify control. Implementations could restrict themselves to no variability or to variability only in launch rate or completion latency.

Diverting Control Flow

Other hooks into the processor's control flow may be merited. In particular, there are a number of applications where it would be beneficial to give the logic an option to divert program flow rather than simply stall it. Two general classes:

  1. Exception/assumption detection -- The processor code could be compiled assuming certain cases do not occur. The PFU could then be programmed to watch values and detect when these exceptional cases occur, diverting the processor's control to handle the exceptional case accordingly. For example, compiled code could be written assume a certain address rage is never written, allowing the values to be cached in registers or even compiled into the code. The PFU then performs parallel checking to see that this assumption is met throughout the region of code. In a similar manner, the PFU might be programmed to watch for specific address writes to facilitate debugging.
  2. PFU limitations -- Similarly, the PFU may implement a restricted version of some function -- perhaps one that only works for certain values. When unusual values, those not handled by the PFU, are detected the PFU could divert control flow to software which implements the function.
As described, this could simply be a line which signaled a synchronous exception vectored into a handler setup to handle the specified exceptional event. Alternately, the line could set dirty bits in the processor state, thread state, result registers, or the like to indicate that the computed value was incorrect and should be ignored. Such a line might also inhibit memory writes or other side-effecting operations which might write incorrect results based on the violated assumptions.

Control Registers

In some cases it may be useful to place specialized control registers inside the PFU. For example, for a DPGA PFU it might be beneficial to have a dedicated context state register for the array inside the PFU. This would be particularly advantageous if the PFU performed multiple cycle functions in the same PFU, but the processor did not want to allocate register file or instruction bandwidth to feed the context identification into the PFU on every cycle. Some internal registers may be beneficial anytime when the PFU operates on logical input data larger than its register file datapath. Internal registers can always be built out of the programmable logic, but where we can anticipate their common need, it is cheaper to go ahead and include fixed registers.

Control Inputs

So far, we have described scenarios where the PFU simply takes data from the register file datapath. We may want a control signal into the PFU indicating when new data is valid and destined for the PFU. Of course, if the PFU can always operate on the data its provided and the processor only takes results from the PFU when it expects the PFU to generate results, such control is not strictly required. However, if the PFU is tied up for multiple cycles with each operation, as suggested in some usage scenarios above, the PFU needs to be told when it actually needs to start operating on data. Additional control signals might tell the PFU more about what to do. For example, if a PFU is setup to perform more than one operation, the PFU might get some instruction bits to specify the current operation. Similarly, control bits might inform the PFU about the kind of data it is now receiving via its register file inputs. This information would be particularly valuable if the PFU operated on more data than it got over the register file datapath in a single cycle and did not always get all of the data it operates on reloaded in a deterministic sequence.

Orchestrated DPGA/SIMD Logic

We can also view the processor sequencing and control as an orchestrator, coordinating DPGA or SIMD logical operations occuring within the PFU. This view is entirely consistent with the general scheme presented here. A processor designed specificly with this in mind is likely to include more PFU logic and less fixed ALUs. In fact, the fixed ALUs might exist primarily for addressing, control branching, exception handling, and configuration loading.

Application

Advantage Summary

Control Logic

An interesting class of reconfiguration becomes available when reconfigurable logic is interfaced with the basic control circuitry for the processor. In the previous section we began to introduce some special cases where allowing the reconfigurable logic direct access to consume and generate control signals will expand the range of adaptation possible. In this section, we focus more specificly on this class of reconfiguration which is useful apart from its coupling to PFU logic.

Architecture

Every traditional microprocessor has logic which controls the flow of instructions and data. This logic usually accounts for a very small portion of the silicon area on the processor die, but plays a large role in establishing how the processor behaves and what it does efficiently. Direct hooks into this logic allow us to reconfigure the basic processor behavior. The hooks could range from allowing reconfigurable logic to drive into selective fixed-logic signals, as suggested for the stall in the previous section, to replacing the fixed control logic with a reprogrammable substrate. The latter offers more flexibility while the former allows faster and smaller implementation of standard control structures. Just like the flexible input logic, default, hardwired control logic can be wired in parallel with reconfigurable logic to give some elements of both schemes.

In general, reconfigurable logic might monitor:

All of these lines can be monitor for profiling, debugging, and statistical purposes.

The reconfigurable logic might control:

This kind of control was introduced in Sections and , and is also useful independent of a programmable functional unit or reconfigurable i/o.

When reconfigurable control logic is arranged in this manner, the processor's behavioral patterns can be revised. In some cases, this may allow the reconfigurable logic to control what happens on exceptional events like traps, cache misses, TLB misses, or context switches. It may also allow the instruction stream to make more or less stringent assumptions about the data and provide a means of checking these assumptions and handling the data accordingly.

Among other things, this may allow the processor to be adapted to match the semantics desired by a particular operating system or operating environment. In many modern systems, the OS ends up executing many instructions to switch contexts, take traps, or save/restore registers because the processor does not quite provide the right hooks to match the OS semantics ( e.g. [ALBL91]). Reconfigurable control can provide an opportunity to make up for semantic gap at the processor level, rather than incurring large software overheads to emulate the desired semantics.

In general, the control logic on the processor is the hardest part to get correct. The various exceptional and hazard cases, and their interactions, are difficult to handle well and correctly. Sometimes it is difficult to decide what the correct behavior should be. With highly reconfigurable control logic, we defer the binding time, allowing the logic to be fixed after the processor is fabricated and allowing the behavior to be revised without spinning a new design.

If one does combine reconfigurable control with a programmable functional unit (Section ) or reconfigurable i/o (Section ), it may make sense to combine the reconfigurable logic into one large block. This allows sharing and averaging. When the control is simpler, more space is available for i/o or programmable functions. Similarly, when the control is large, it can borrow space from the other reconfigurable units. Figure shows our basic processor datapath with a reconfigurable logic block serving as a PFU, reconfigurable i/o, and which includes hooks into the processor's control logic.

Application and Advantage Summary

Instruction Interpretation

The processor reads the instruction from the cache and controls execution accordingly. In modern, RISC microprocessors, the instruction is decoded by hardwired control logic to manipulate each stage of the processor's pipeline. Reconfigurable logic can be integrated into the processor's instruction stream decoding in a number of ways:

Application and Advantage Summary

Basic processor behavior

Architecture

A classic question in processor architecture is: ``where should resource be deployed.'' Should the cache be larger/smaller relative to the TLB? Do we allocate space to prefetch or writeback buffers? How much memory should go into the data cache, instruction cache, scratchpad memory? Do we include a branch target buffer or victim cache?

A second question which comes along with this one is: ``How do we manage the limited resources?'' What's the prefetch/reload/eviction policy?

The traditional solution to both of these questions is to make a static decision at design time which does, in some sense, reasonable well across the benchmarks the designers consider representative. This inevitably leads to compromise for everything, and for many applications the magnitude of the compromise can be quite large.

With a reconfigurable processor we can, instead, leave some flexibility in the architecture so the machine can be configured to deploy the resources most effectively for the given application. The idea is to go ahead and build specialized pieces of hardwired logic with common utility ( e.g. memories, ALUs, FPUs), but rather than completely hardwiring their control and datapaths, leaving flexibility to reorganize their interconnect and hence use.

Figure , for example, shows a revision of our generic VLIW processor architecture where blocks of configurable memory have been added to the collection of processing resources. Here, some outputs from the ALU/PFU/FPU/memory bank are routed back to the expander to allow cascaded operations. For example, a virtual memory address coming out of a register may be translated through a memory before being feed to the i/o. Similarly a base address from one register may be added to an index from another before the address is fed in as an address to the the cache. To facilitate this, we conceptually add additional outputs from the concentrator and inputs to the expander, in addition to the additional concentrator inputs and expander outputs entailed by the additional memory units they support.

Additionally, the memories can be arrange in standard sized chunks which can be composed, allowing the memory resources to be shuffled at a moderate granularity. For example, each basic memory could be a 2Kx8 memory chunk. 4 or 8 of these can be grouped together to build a 32- or 64-bit wide memory. Additionally, they can be cascaded to build deeper memories. So, could be cascaded to build an 8Kx32 memory.

With a little bit of additional control logic, these memories can be used as caches, cache-tags, TLBs, explicit scratchpad memories, FIFO buffers, or the like. These memories can completely subsume the separate data cache shown in our original processor model (Figure ). The additional control logic is likely to be supported largely in reconfigurable logic as suggested in Section .

Systems that do not use a TLB can reallocate memory blocks to the cache. Applications with more or less virtual memory locality can adjust the size of the TLB accordingly. Applications with known data access, can use explicit scratchpad memory, reallocating the memory blocks holding cache tags to data storage. Applications with specific spatial locality in data or instructions, can build logic to prefetch data according to the application need.

Figure expands the datapath to show the processor control. In particular, this organization makes it clear that the instruction cache can be implemented out of the memory units, as well. Each application can now trade-off memory between the i-cache and d-cache based on the needs of the application.

We can also view the register file as another memory which can also be built out of the deployable memory units. Figure shows a configuration where there is no a priori designated register file. Rather the register file is built out of the memories. This may allow, for example, the reconfiguration of the register file width, depth, and number of simultaneous read ports. Further the register file can be broken into a series of smaller register files where appropriate. Here, the expander/concentrator is collapsed into a single reconfigurable interconnect.

Alternately, the register file may want to be a slightly specialized memory unit, but still be deployable for the reasons articulated here. As noted above, width, depth, and read cascading are moderately easily constructed by paralleling memory blocks just as in building basic memory structures. What is harder to build by composition is multiple write ports, and register files often depend heavily on a number of simultaneous write ports. For this reason, it may make sense to also include a different kind of memory block with multiple write ports to allow efficient construction of register files, as well as other structures requiring simultaneous write support.

The configuration shown in Figure shows hardwired processor control and a completely reconfigurable i/o unit. Of course, variations could implement all or much of the control in reconfigurable logic and/or include hardwired load/store units.

This finally leads to a revised model for a computing device in which basic, specialized functional units ( e.g. memories, ALUs, FPUs, MDUs, LD/ST units, DMA logic, hardwired control units) are embedded in a reconfigurable interconnection scheme along with regions of reconfigurable logic which can be used for monitoring, control, i/o, decoding, and as PFUs. This device gains the performance and space advantages of hardwired logic units for commonly used operations. At the same time, it gains performance advantage over a purely fixed microprocessor by adapting the processor organization much more tightly to the application.

Note that the reconfigurable interconnect used to interconnect functional units differs both from the fine-grained reconfigurable interconnect typically employed in FPGAs and the expander/concentrator interconnect used in a pure VLIW. Rather, it is a hybrid of the two. Unlike traditional FPGA interconnect, most of the operations with the interconnect are bus oriented. Therefore, busses are switched in groups. Most busses may be some nominal processor bus width ( e.g. 16, 32, 64). Some will be able to compose or decompose these busses to other interesting sizes ( e.g. 1-bit entities for fine-grained logic, 8-bit entities for reconfigurable memories). In a traditional VLIW, the decoding of the instruction specifies the configuration of busses. With this kind of a setup, the instruction bandwidth would be excessive if fully configured from the instruction stream. Similarly, the interconnect pattern would be too rigid if fully configured via FPGA style programming. Here, the configuration of busses will depend partially on the instruction executed and partially on the way the system is currently configured. In most scenarios the decoding between the instruction stream specification of interconnect and the full interconnect specification would be done in the reconfigurable logic. For efficiency, the reconfigurable logic serving this purpose might be tailored somewhat to this application.

Application and Advantage Summary

Reconfigurable Logic

The reconfigurable logic can be realized as one of many different structures.

Much of the logic used in the i/o path and, to some extent, in the PFUs, is likely to be datapath oriented. Consequently, it will probably make sense to specialize a good portion of the array logic to datapath usage. This datapath specialization may include routing busses, slaving multiple programmable cells off of a single configuration ( e.g. [CL94]), and including bussed register banks. The benefit of datapath orientation is greater density and performance on datapath applications than regular FPGA/DPGA structures. Fine-grained logic will still be desirable for control operations and bit-wise manipulations.

Configuration Reloading

Figure shows the generic, logical view for a reconfigurable array. The programmable i/o's are shown separate from the configuration facilities.

The configuration i/o's control the loading of the array's configuration or context. The configuration port can look very much like a memory port. Depending on the design requirements, the port can be anything from a 1-bit serial data port with no address control to a 64-bit wide (or larger) data port with full, random access address control to the internal configuration memories. Wider datapaths support more rapid context loading. Random access to the configured logic allows rapid, incremental changes in the array personality.

The programmable i/o's are inputs to the logic implemented in the reconfigurable array and outputs generated by the array. There need be little direct correspondence between the number of i/o's and the size of the array. In some situations, it will be beneficial for all i/o's to be bidirectional i/o's -- e.g. if the array is being coupled to a common bus on the processor. More likely, in processor-coupled applications, it will be beneficial for all the i/o's to be dedicated inputs and outputs.

For multicontext ( e.g. DPGA) designs, a context select will specify the active context. This may come from a special purpose register driving the context select, from hardwired logic, from decoded CPU signals, from a hardwired sequencer, or even from a programmable output from this or another reconfigurable logic array.

In a processor-coupled scenario, we could place the reconfiguration loading data and address path in any of several places:

See Also...

References

ALBL91
Thomas Anderson, Henry Levy, Brian Bershad, and Edward Lazowska. The Interaction of Architectures and Operating System Design. In Fourth International Conference on Architectural Support for Programming Languages, pages 108-120. ACM, April 1991.

ALKK90
Anant Agarwal, Beng-Hong Lim, David Kranz, and John Kubiatowicz. APRIL: A Processor Architecture for Multiprocessing. In Proceedings of the 17th International Symposium on Computer Architecture, pages 104-114. IEEE, May 1990.

BDK93
Michael Bolotski, Andre DeHon, and Thomas F. Knight Jr. Unifying FPGAs and SIMD Arrays. Transit Note 95, MIT Artificial Intelligence Laboratory, September 1993. [tn95 HTML link] [tn95 PS link].

CL94
Don Cherepacha and David Lewis. A Datapath Oriented Architecture for FPGAs. In Second International ACM/SIGDA Workshop on Field-Programmable Gate Arrays. ACM, February 1994. proceedings not available outside of the workshop, contact author lewis@eecg.toronto.edu.

Cra95
Cray Research, Inc. CRAY T3D System Architecture Overview Manual, 1995. URL http://www.cray.com/PUBLIC/product-info/mpp/T3D_Architecture_over/T3D.overview.html.

DeH94
Andre DeHon. DPGA-Coupled Microprocessors: Commodity ICs for the Early 21st Century. Transit Note 100, MIT Artificial Intelligence Laboratory, January 1994. [tn100 HTML link] [tn100 PS link].

DMB87
Dvid Ditzel, Hubert McLellan, and Alan Bernbaum. The Hardware Architecture of the CRISP Microprocessor. In 14th International Symposium on Computer Architecture, pages 309-319. ACM/IEEE, IEEE Computer Society Press, June 1987.

GSH94
Greg J. Gent, Scott R. Smith, and Regina L. Haviland. An FPGA-based Custom Coprocessor for Automatic Image Segmentation Applications. In Duncan A. Buell and Kenneth L. Pocek, editors, Proceedings of the IEEE Workshop on FPGAs for Custom Computing Machines, Los Alamitos, California, April 1994. IEEE Computer Society, IEEE Computer Society Press.

HTA94
Neil Howard, Andrew Tyrrell, and Nigel Allinson. FPGA Acceleration of Electronic Design Automation Tasks. In Will Moore and Wayne Luk, editors, More FPGAs, pages 337-344. Abingdon EE&CS Books, 49 Five Mile Drive, Oxford OX2 8HR, UK, 1994.

LWP94
Wayne Luk, Teddy Wu, and Ian Page. Hardware-Software Codesign of Multidimensional Programs. In Duncan A. Buell and Kenneth L. Pocek, editors, Proceedings of the IEEE Workshop on FPGAs for Custom Computing Machines, Los Alamitos, California, April 1994. IEEE Computer Society, IEEE Computer Society Press.

WG94
Qiang Wang and P. Glenn Gulak. An Array Architecture for Reconfigurable Datapaths. In Will Moore and Wayne Luk, editors, More FPGAs, pages 35-46. Abingdon EE&CS Books, 49 Five Mile Drive, Oxford OX2 8HR, UK, 1994.

Xil93
Xilinx, Inc., 2100 Logic Drive, San Jose, CA 95124. The Programmable Logic Data Book, 1993.

Bandwidth Issues for Coupled Reconfigurable Logic

Recent papers begin to show more explicit evidence that the bandwidth between the conventional processor (and memory) limits the performance improvement attainable with the reconfigurable compute engine, typically by an order of magnitude.

  1. For a Sobel edge detector, Luk notes, that the hardware-assisted version is, in practice, only 39% faster than the software only version. He then notes that the communication overhead accounts for 88% of the time taken. ``If this overhead is not included, the hardware-assisted design is approximately 13 times faster than the software version. Furthermore, if the input-output bottleneck can be eliminated so that the only speed limitation is the critical path delay, we estimate that a speedup of about 300 times can be achieved.'' [LWP94]

  2. [GSH94] also presented evidence that performance is directly limited by bandwidth between the control processor and the reconfigurable system. In their talk, they showed that the reconfigurable system gave roughly a 10x speedup, but was limited by the low bandwidth interconnect. They suggested that another factor of ten in performance acceleration could be realized if the bus bandwidth were increased. The (preliminary?) paper alludes to the issue, but does not spell out the result as clearly as the talk.

  3. For Electronic Design Automation (EDA) tasks, [HTA94] finds only marginal benefits (speedup factors between 1 and 8) for off-chip, FPGA co-processors. They find that the bus bandwidth limitation is largely responsible for this bound.

MIT Transit Project