Previous: Introduction and Background Up: Introduction and Background Next: Basics and Terminology
General-purpose computers have served us well over the past couple of decades. Broad applicability has led to wide spread use and volume commoditization. Flexibility allows a single machine to perform a multitude of functions and be deployed into applications unconceived at the time the device was designed or manufactured. The flexibility inherent in general-purpose machines was a key component of the computer revolution.
To date, processors have been the driving engine behind general-purpose computing. Originally dictated by the premium for active real estate, processors focus on the heavy reuse of a single or small number of functional units. With Very Large Scale Integration (VLSI), we can now integrate complete and powerful processors onto a single integrated circuit, and the technology continues to provide a growing amount of real estate.
As enabling as processors have been, our appetite and need for computing power has grown faster. Despite the fact that processor performance steadily increases, we often find it necessary to prop up these general-purpose devices with specialized processing assists, generally in the form of specialized co-processors or ASICs. Consequently, today's computers exhibit an increasing disparity between the general-purpose core and its specialized assistants. High performance systems are built from a plethora of specialized ASICs. Even today's high-end workstations dedicate more active silicon to specialized processing than to general-purpose compute. The general-purpose processor will be only a small part of tomorrow's multi-media PC. As this trend continues, the term ``general-purpose computer'' will become a misnomer for modern computer systems. Relatively little of the computing power in tomorrow's computers can be efficiently deployed to solve any problem.
The problem is not with the notion of general-purpose computing, but with the implementation technique. For the past several years, industry and academia have focussed largely on the task of building the highest performance processor, instead of trying to build the highest performance general-purpose computing engine. When active area was extremely limited, this was a very sensible approach. However, as silicon real estate continues to increase far beyond the space required to implement a competent processor, it is time to re-evaluate general-purpose architectures in light of shifting resource availability and cost.
In particular, an interesting space has opened between the extremes of general-purpose processors and specialized ASICs. That space is the domain of reconfigurable computing and offers all the benefits of general-purpose computing with greater performance density than traditional processors. This space is most easily seen by looking at the binding time for device function. ASICs bind function to active silicon at fabrication time making the silicon useful only for the designated function. Processors bind functions to active silicon only for the duration of a single cycle, a restrictive model which limits the amount the processor can accomplish in a single cycle while requiring considerable on-chip resources to hold and distribute instructions. Reconfigurable devices allow functions to be bound at a range of intervals within the final system depending on the needs of the application. This flexibility in binding time allows reconfigurable devices to make better use of the limited device resources including instruction distribution.
Consequently, reconfigurable computing architectures offer:
The thesis provides:
The major contributions of this thesis include:
The remainder of this chapter provides a synopsis of the key results and relationships developed in the thesis. This introductory part of the thesis continues with Chapter which defines the terminology and metrics used throughout the thesis. Chapter reviews and highlights the existing evidence for the high performance potential of reconfigurable computing architectures.
Part sets the stage by examining the computational capabilities of existing general-purpose computing devices. This starts with a broad, empirical, review of general-purpose architectures in Chapter . In Chapter , we compare hardwired and general-purpose multiplier implementations as a case study bridging general-purpose and application-dedicated architectures. In Chapter , we review processor architectures implemented on top of reconfigurable architectures to broaden the picture and to see one way in which conventional reconfigurable architectures deal with high operational diversity.
Part takes a more compositional view of reconfigurable computing architectures. Chapter looks at building blocks, sizes, and requirements for interconnect. Chapter looks at resource requirements for instruction distribution. Finally in Chapter , we bring the empirical data, interconnect, and instruction characteristics together, providing a first order model of RP-space, our high-level model for reconfigurable processing architectures.
Part includes three new architectures: DPGA (Chapters and ), TSFPGA (Chapter ), and MATRIX (Chapter ), which are highlighted below in Section . The final chapters in Part , review the results and identify promising directions for the future.
Broadly considered, reconfigurable devices fill their silicon area with a large number of computing primitives, interconnected via a configurable network. The operation of each primitive can be programmed as well as the interconnect pattern. Computational tasks can be implemented spatially on the device with intermediates flowing directly from the producing function to the receiving function. Since we can put thousands of reconfigurable units on a single die, significant data flow may occur without crossing chip boundaries. To first order, one can think about turning an entire task into hardware dataflow and mapping it on the reconfigurable substrate. Reconfigurable computing generally provides spatially-oriented processing rather than the temporally-oriented processing typical of programmable architectures such as microprocessors.
The key differences between reconfigurable machines and conventional processors are:
To establish an intuitive feel for the design point and role of configurable devices, we can take a high-level look at conventional devices. Ignoring, for the moment multiplies, floating-point operations, and table lookup computations, the modern processor has a peak performance on the order of 256, 3-LUT gate-evaluations per clock cycle ( e.g. two 64-bit ALUs). A modern FPGA has a peak performance on the order of 2,048, 4-LUT gate-evaluations per clock cycle. The basic clock cycle time is comparable giving the FPGA at least an order of magnitude larger raw capacity.
Note that both the processor ALUs and FPGA blocks are typically built with additional gates which serve to lower the latency of word operations without increasing the raw throughput ( e.g. fast carry chains which allow a full 64-bit wide add to complete within one cycle time). This latency reduction may be important to reducing the serial path length in tasks with limited parallelism, but is not reflected in this raw capacity comparison.
The FPGA can sustain its peak performance level as long as the same 2K gate-evaluation functionality is desired from cycle to cycle. Wiring and pipelining limitations are the primary reason the FPGA would achieve lower than peak performance, and this is likely to account for, at most, a 20-50% reduction from peak performance. If more diverse functionality is desired from a single FPGA than the 1-2K gate-evaluations provided by the FPGA, performance drops considerably due to function reload time.
The processor is likely to provide a much lower peak performance and the effect is much more application specific. Due to the bitwise-SIMD nature of traditional ALUs, work per cycle can be as low as a couple of gate-evaluations on compute operations. Since processors perform all ``interconnect'' using shifts, moves, loads, and stores, many cycles yield no gate-evaluations, only movement of data. The lower peak performance of processors comes from the fact that the processor ALU occupies only a small fraction of the die, with substantial area going to instruction flow control and on-chip memory to support large sequences of diverse operations without requiring off-chip instruction or data access.
A comparably sized, dedicated piece of hardwired functionality, with no memory could provide a capacity of 200,000-300,000 4-LUT gate-evaluations per clock cycle, at potentially higher clock rates. While the raw gate delay on the hardwired logic can be 10 smaller than on the FPGA, reasonable cycle times in equivalent logic processes are closer to 2 since it makes sense to pipelined the FPGA design at a more shallow logic depth than the custom logic. Returning to the multiplier, for example, such a chip might provide 64K multiply bit operations per cycle ( e.g. a 256256 multiply pipelined at the byte level). The dedicated hardware provides 100-300 times the capacity of the FPGA on the one task it was designed to solve. To first order, the dedicated hardware can deliver very little capacity to significantly different applications. It is also worthwhile to note that the fixed granularity of hardwired devices often causes them to sacrifice much of their capacity advantage when used on small data items. For instance, performing an 88 multiply on a 6464 hardwired multiplier makes use of only 'th of the multiplier's capacity, removing much of its 300 capacity advantage.
Combining these observations, we can categorize the circumstances under which the various structures are prefered.
In Chapter we define our evaluation and comparison metrics more carefully. Chapters and provide an empirical review of conventional general-purpose and specialized architectures, focusing on their performance density.
While reconfigurable devices have, potentially, 100 less performance per unit area than hardwired circuitry, they provide 10-100 the performance density of processors. As noted above, FPGAs offer a potential 10 advantage in raw, peak, general-purpose functional density over processors. This density advantage comes largely from dedicating significantly less instruction memory and distribution resources per active computing element. At the same time this lower memory ratio allows reconfigurable devices to deploy active capacity at a finer grained level, allowing them to realize a higher yield of their raw capacity, sometimes as much as 10, than conventional processors. It is these two effects taken together which give reconfigurable architectures their 10-100 performance density advantage over conventional processor architectures in many situations.
From an empirical review of conventional, reconfigurable devices, we see that 80-90% of the area is dedicated to the switches and wires making up the reconfigurable interconnect. Most of the remaining area goes into configuration memory for the network. The actually logic function only accounts for a few percent of the area in a reconfigurable device. This interconnect and configuration overhead is responsible for the 100 density disadvantage which reconfigurable devices suffer relative to hardwired logic.
To a first order approximation, this gives us:
It is this basic relationship (Shown diagrammatically in Figure ) which characterizes the RP design space.
After reviewing implementations in Chapter , Chapters and examine interconnect and instruction delivery issues in depth. Chapter brings these together, yielding a slightly more sophisticated model than the one above to explain the primary tradeoffs in the design of reconfigurable computing architectures.
From the general relationships above, we see that conventional conventional Field Programmable Gate Arrays (FPGAs) represent one extreme in our RP-space. The space is large, leaving considerable space for interesting architectures in middle. Exploiting the relative area properties identified above and common device usage scenarios, we have developed three new general-purpose computing architectures. By judicious allocation of device resources, these architectures offer higher yielded capacity over a wide range of applications.
Ultimately, reconfiguration is a technique for compressing the resources dedicated to instruction stream distribution while maintaining a general-purpose architecture. As such, it is an important architectural tool for extracting the highest performance from our silicon real estate. Characteristics of an application which change slowly or do not change can be configured rather than broadcast. The savings in instruction control resources result in higher logic capacity per unit area.
With CMOS VLSI we have reached to the point where we are no longer so limited by the aggregate capacity of a single IC die that the device must be optimized exclusively to maximize the number of distinct instructions resident on a chip. Beyond this point spatial implementation of all or portions of general-purpose computations is both feasible and beneficial. From this point on we will see:
The thesis closes in Part by reviewing the key lessons from reconfigurable designs and their implications for future general-purpose architectures.