Previous: Introduction and Background Up: Introduction and Background Next: Basics and Terminology

Overview and Synopsis

Evolution of General-Purpose Computing with VLSI Technology

General-purpose computers have served us well over the past couple of decades. Broad applicability has led to wide spread use and volume commoditization. Flexibility allows a single machine to perform a multitude of functions and be deployed into applications unconceived at the time the device was designed or manufactured. The flexibility inherent in general-purpose machines was a key component of the computer revolution.

To date, processors have been the driving engine behind general-purpose computing. Originally dictated by the premium for active real estate, processors focus on the heavy reuse of a single or small number of functional units. With Very Large Scale Integration (VLSI), we can now integrate complete and powerful processors onto a single integrated circuit, and the technology continues to provide a growing amount of real estate.

As enabling as processors have been, our appetite and need for computing power has grown faster. Despite the fact that processor performance steadily increases, we often find it necessary to prop up these general-purpose devices with specialized processing assists, generally in the form of specialized co-processors or ASICs. Consequently, today's computers exhibit an increasing disparity between the general-purpose core and its specialized assistants. High performance systems are built from a plethora of specialized ASICs. Even today's high-end workstations dedicate more active silicon to specialized processing than to general-purpose compute. The general-purpose processor will be only a small part of tomorrow's multi-media PC. As this trend continues, the term ``general-purpose computer'' will become a misnomer for modern computer systems. Relatively little of the computing power in tomorrow's computers can be efficiently deployed to solve any problem.

The problem is not with the notion of general-purpose computing, but with the implementation technique. For the past several years, industry and academia have focussed largely on the task of building the highest performance processor, instead of trying to build the highest performance general-purpose computing engine. When active area was extremely limited, this was a very sensible approach. However, as silicon real estate continues to increase far beyond the space required to implement a competent processor, it is time to re-evaluate general-purpose architectures in light of shifting resource availability and cost.

In particular, an interesting space has opened between the extremes of general-purpose processors and specialized ASICs. That space is the domain of reconfigurable computing and offers all the benefits of general-purpose computing with greater performance density than traditional processors. This space is most easily seen by looking at the binding time for device function. ASICs bind function to active silicon at fabrication time making the silicon useful only for the designated function. Processors bind functions to active silicon only for the duration of a single cycle, a restrictive model which limits the amount the processor can accomplish in a single cycle while requiring considerable on-chip resources to hold and distribute instructions. Reconfigurable devices allow functions to be bound at a range of intervals within the final system depending on the needs of the application. This flexibility in binding time allows reconfigurable devices to make better use of the limited device resources including instruction distribution.

Consequently, reconfigurable computing architectures offer:

More application-specific adaptation than conventional processors
Greater computational density than conventional processors
More and broader reuse of silicon than ASICs
Better opportunities to ride hardware and algorithmic technology curves than ASICs
Better match to current technology costs than ASICs or processors

This Thesis

This thesis characterizes a class of reconfigurable computing architectures and relates them broadly to the more well understood conventional alternatives. Since technology costs dictate the architectural tradeoffs involved, this characterization is performed in the context of MOS VLSI implementations. The convergence of process technologies along with the large amount of silicon real-estate available on a single die these days allows us to perform broad comparisons based primarily on silicon area.

The thesis provides:

A high level characterization of a reconfigurable processing space which includes reconfigurable architectures such as FPGAs. This characterization helps us understand the key characteristics of reconfigurable devices, including when and what level of performance we can extract from various architectural points.
Empirical relations on the key building blocks in CMOS VLSI taken from existing designs in the literature and our own experimental designs, include:
- sizes ( e.g. How big is a 4-LUT?)
- performance density (operations per unit space-time)
- relative feature sizes ( e.g. interconnect versus configuration memory versus active computing)
- first order modeling of key area factors
Architecture designs and implementations which explore new points in the identified design space based on the empirical characterization.
- architectures which exploit the identified cost structure to provide greater functional density for reconfigurable devices
- architectures which allow diversity/density tradeoffs based on application characteristics
Lessons and observations for future device architects and systems designers

The major contributions of this thesis include:

RP-space model for reconfigurable processing architectures -- While many loose taxonomies exist for general-purpose computing, none are as systematic as the one presented here. By focusing on the RP-space domain, this model provides size estimates and facilitates pedagogical efficiency comparisons of architectures within the space.
DPGA -- A novel bit-level architecture with multiple, on-chip instructions per compute element, including the theory and concepts behind the architecture, an implementation, experimental CAD to support it, and validation of efficiency using standard circuit benchmarks. For typical logic circuits and finite-state machines, the DPGA implementation is one-third the size of the FPGA implementation.
TSFPGA -- A model, possible implementation, and CAD for fine-grained, time-switched interconnect with demonstrated fast physical mapping capabilities. TSFPGA exploits the observations that most of the area benefits in DPGAs come from sharing the interconnect and that most of the difficulty in mapping to traditional FPGAs is their limited interconnect. By sharing interconnect resources in time, TSFPGA extracts more interconnect functionality from less active switching resources. For typical applications, quick mapping can be done in seconds. The mapped design area is smaller than comparable FPGAs and slightly larger than comparable DPGAs.
MATRIX -- The first architecture to allow run-time binding of instruction resources. Focusing on a design point using an array of 8-bit ALU and register-file building blocks interconnected via a byte-wide network, MATRIX yields 10-20 the computational density of conventional processors on sample image processing tasks. With today's silicon, we can place hundreds of these 8-bit functional units on a large die operating at 100MHz, making it possible to deliver over 10 Gop/s (8-bit ops) per component.

The remainder of this chapter provides a synopsis of the key results and relationships developed in the thesis. This introductory part of the thesis continues with Chapter which defines the terminology and metrics used throughout the thesis. Chapter reviews and highlights the existing evidence for the high performance potential of reconfigurable computing architectures.

Part sets the stage by examining the computational capabilities of existing general-purpose computing devices. This starts with a broad, empirical, review of general-purpose architectures in Chapter . In Chapter , we compare hardwired and general-purpose multiplier implementations as a case study bridging general-purpose and application-dedicated architectures. In Chapter , we review processor architectures implemented on top of reconfigurable architectures to broaden the picture and to see one way in which conventional reconfigurable architectures deal with high operational diversity.

Part takes a more compositional view of reconfigurable computing architectures. Chapter looks at building blocks, sizes, and requirements for interconnect. Chapter looks at resource requirements for instruction distribution. Finally in Chapter , we bring the empirical data, interconnect, and instruction characteristics together, providing a first order model of RP-space, our high-level model for reconfigurable processing architectures.

Part includes three new architectures: DPGA (Chapters and ), TSFPGA (Chapter ), and MATRIX (Chapter ), which are highlighted below in Section . The final chapters in Part , review the results and identify promising directions for the future.

Reconfigurable Device Characteristics

Broadly considered, reconfigurable devices fill their silicon area with a large number of computing primitives, interconnected via a configurable network. The operation of each primitive can be programmed as well as the interconnect pattern. Computational tasks can be implemented spatially on the device with intermediates flowing directly from the producing function to the receiving function. Since we can put thousands of reconfigurable units on a single die, significant data flow may occur without crossing chip boundaries. To first order, one can think about turning an entire task into hardware dataflow and mapping it on the reconfigurable substrate. Reconfigurable computing generally provides spatially-oriented processing rather than the temporally-oriented processing typical of programmable architectures such as microprocessors.

The key differences between reconfigurable machines and conventional processors are:

Instruction Distribution -- Rather than broadcasting a new instruction to the functional units on every cycle, instructions are locally configured, allowing the reconfigurable device to compress instruction stream distribution and effectively deliver more instructions into active silicon on each cycle.
Spatial routing of intermediates -- As space permits, intermediate values are routed in parallel from producing function to consuming function rather than forcing all communication to take place in time through a central resource bottleneck.
More, often finer-grained, separately programmable building blocks -- Reconfigurable devices provide a large number of separately programmable building blocks allowing a greater range of computations to occur per time step. This effect is largely enabled by the compressed instruction distribution.
Distributed deployable resources, eliminating bottlenecks -- Resources such as memory, interconnect, and functional units are distributed and deployable based on need rather than being centralized in large pools. Independent, local access allows reconfigurable designs to take advantage of high, local, parallel on-chip bandwidth, rather than creating a central resource bottleneck.

Configurable, Programmable, and Fixed-Function Devices

To establish an intuitive feel for the design point and role of configurable devices, we can take a high-level look at conventional devices. Ignoring, for the moment multiplies, floating-point operations, and table lookup computations, the modern processor has a peak performance on the order of 256, 3-LUT gate-evaluations per clock cycle ( e.g. two 64-bit ALUs). A modern FPGA has a peak performance on the order of 2,048, 4-LUT gate-evaluations per clock cycle. The basic clock cycle time is comparable giving the FPGA at least an order of magnitude larger raw capacity.

Note that both the processor ALUs and FPGA blocks are typically built with additional gates which serve to lower the latency of word operations without increasing the raw throughput ( e.g. fast carry chains which allow a full 64-bit wide add to complete within one cycle time). This latency reduction may be important to reducing the serial path length in tasks with limited parallelism, but is not reflected in this raw capacity comparison.

The FPGA can sustain its peak performance level as long as the same 2K gate-evaluation functionality is desired from cycle to cycle. Wiring and pipelining limitations are the primary reason the FPGA would achieve lower than peak performance, and this is likely to account for, at most, a 20-50% reduction from peak performance. If more diverse functionality is desired from a single FPGA than the 1-2K gate-evaluations provided by the FPGA, performance drops considerably due to function reload time.

The processor is likely to provide a much lower peak performance and the effect is much more application specific. Due to the bitwise-SIMD nature of traditional ALUs, work per cycle can be as low as a couple of gate-evaluations on compute operations. Since processors perform all ``interconnect'' using shifts, moves, loads, and stores, many cycles yield no gate-evaluations, only movement of data. The lower peak performance of processors comes from the fact that the processor ALU occupies only a small fraction of the die, with substantial area going to instruction flow control and on-chip memory to support large sequences of diverse operations without requiring off-chip instruction or data access.

A comparably sized, dedicated piece of hardwired functionality, with no memory could provide a capacity of 200,000-300,000 4-LUT gate-evaluations per clock cycle, at potentially higher clock rates. While the raw gate delay on the hardwired logic can be 10 smaller than on the FPGA, reasonable cycle times in equivalent logic processes are closer to 2 since it makes sense to pipelined the FPGA design at a more shallow logic depth than the custom logic. Returning to the multiplier, for example, such a chip might provide 64K multiply bit operations per cycle ( e.g. a 256256 multiply pipelined at the byte level). The dedicated hardware provides 100-300 times the capacity of the FPGA on the one task it was designed to solve. To first order, the dedicated hardware can deliver very little capacity to significantly different applications. It is also worthwhile to note that the fixed granularity of hardwired devices often causes them to sacrifice much of their capacity advantage when used on small data items. For instance, performing an 88 multiply on a 6464 hardwired multiplier makes use of only 'th of the multiplier's capacity, removing much of its 300 capacity advantage.

Combining these observations, we can categorize the circumstances under which the various structures are prefered.

Fixed Function, Limited Operation Diversity, High Throughput -- When the function and data granularity to be computed are well-understood and fixed, and when the function can be economically implemented in space, dedicated hardware provides the most computational capacity per unit area to the application.
Variable Function, Low Diversity -- If the function required is unknown or varying, but the instruction or data diversity is low, the task can be mapped directly to a reconfigurable computing device and efficiently extract high computational density.
Space Limited, High Entropy -- If we are limited spatially and the function to be computed has a high operation and data diversity, we are forced to reuse limited active space heavily and accept limited instruction and data bandwidth. In this regime, conventional processor organization are most effective since they dedicate considerable space to on-chip instruction storage in order to minimize off-chip instruction traffic while executing descriptively complex tasks.

Reconfigurable devices have become increasingly interesting as aggregate IC capacity has grown large enough to adequately hold the computational diversity of many computing tasks or, at least, the key kernels of these tasks. As the area available for general-purpose computing devices increases, more tasks will fit conveniently on reconfigurable devices, increasing the range of applications where the reconfigurable solution yields higher performance per unit area.

In Chapter we define our evaluation and comparison metrics more carefully. Chapters and provide an empirical review of conventional general-purpose and specialized architectures, focusing on their performance density.

Key Relations

While reconfigurable devices have, potentially, 100 less performance per unit area than hardwired circuitry, they provide 10-100 the performance density of processors. As noted above, FPGAs offer a potential 10 advantage in raw, peak, general-purpose functional density over processors. This density advantage comes largely from dedicating significantly less instruction memory and distribution resources per active computing element. At the same time this lower memory ratio allows reconfigurable devices to deploy active capacity at a finer grained level, allowing them to realize a higher yield of their raw capacity, sometimes as much as 10, than conventional processors. It is these two effects taken together which give reconfigurable architectures their 10-100 performance density advantage over conventional processor architectures in many situations.

From an empirical review of conventional, reconfigurable devices, we see that 80-90% of the area is dedicated to the switches and wires making up the reconfigurable interconnect. Most of the remaining area goes into configuration memory for the network. The actually logic function only accounts for a few percent of the area in a reconfigurable device. This interconnect and configuration overhead is responsible for the 100 density disadvantage which reconfigurable devices suffer relative to hardwired logic.

To a first order approximation, this gives us:

It is this basic relationship (Shown diagrammatically in Figure ) which characterizes the RP design space.

Since , devices with a single on-chip configuration, such as most reconfigurable devices, can afford to exert fine-grained control over their operations -- any savings associated with sharing configuration bits would be small compared to the network area.
Since , to pack the most functional diversity into a part, one can allocate multiple configurations on chip. With the order-of-magnitude relative sizes given in Relation , up to a 10 increase in the functional diversity per unit area is attainable in this manner.
However, since is only , if the number of configurations is large, say 100 or more, the configuration memory area will become the dominant size factor. Processors are essentially optimized into this regime and that partially accounts for their 10 lower raw performance density compared to reconfigurable devices.
Once we go to a large number of contexts, such that the total configuration memory space begins to dominate interconnect area, fine-granularity becomes costly. In this regime wide-word operations allow us to amortize instruction area across multiple bit processing elements. This simultaneously allows machines with wide ( e.g. 32 bit) datapaths to hold 1000's of configurations on chip while making them only 10 less computationally dense than fine-grained, single context devices.

After reviewing implementations in Chapter , Chapters and examine interconnect and instruction delivery issues in depth. Chapter brings these together, yielding a slightly more sophisticated model than the one above to explain the primary tradeoffs in the design of reconfigurable computing architectures.

New General-Purpose Architectures

From the general relationships above, we see that conventional conventional Field Programmable Gate Arrays (FPGAs) represent one extreme in our RP-space. The space is large, leaving considerable space for interesting architectures in middle. Exploiting the relative area properties identified above and common device usage scenarios, we have developed three new general-purpose computing architectures. By judicious allocation of device resources, these architectures offer higher yielded capacity over a wide range of applications.

DPGA

The Dynamically Programmable Gate Array (DPGA) is a multicontext FPGA, formed by associating memory for several configurations with each active LUT and interconnect switch (See Figure

). From Relation

, we see that the area associated with context memory is small and can be replicated several times without substantially impacting active device capacity. The multicontext design allows the device to reuse its active capacity to provide additional functionality rather than additional throughput. For the operations required by an application which are not the throughput bottleneck, the multicontext device yields higher device capacity than conventional FPGA architectures. Chapter

describes our 4-context DPGA design and implementation, identifies several common usage scenarios, and details experimental mapping techniques for circuits and finite-state machines. Chapter

extends the basic DPGA model and circuit mapping tools to include input retiming registers. The resulting architecture achieves 3

the density of conventional FPGAs without sacrificing performance on typical applications.

TSFPGA

A careful review of the DPGA implementation and Relation

, reminds us that the active logic portion of a reconfigurable design comprises only a small fraction of the space while the programmable network is the key area consumer. The Time-Switched FPGA (TSFPGA) focuses on reuse of the critical switch and wire resources (See Figure

). By pipelining the switching operations TSFPGA allows us to extract higher capacity from the available switches and wires. At the same time, the switched interconnect allows each individual switching element to play a number of different roles. Consequently, TSFPGA compresses switching requirements, providing more effective switching capacity with less physical interconnect. The greater yielded switching capacity allows physical design mapping to occur rapidly -- in seconds rather than the hours typical of conventional FPGA architectures. Chapter

details the TSFPGA design, implementation, and experimental mapping software.

MATRIX

All prior general-purpose computing architectures, including processors, FPGAs, and the two previous architectures make a rigid distinction between instruction and control resources which manage computation and the computing resources which perform computations for an application. Consequently, one must make a fabrication time decision about the device's control structure and the deployment of resources for control. We see in Chapters

and

that this decision has a large impact on the distribution of dedicated instruction resources in the design and the range of applications where the device is efficiently employed. MATRIX is a novel, coarse-grained, computing architecture which uses a multilevel configuration scheme to defer this binding to the application (See Figure

). Our focus implementation uses an 8-bit primitive datapath element for the basic functional unit. Rather than separate the resources for instruction storage and distribution from the resources for data storage and computation and fix them at fabrication time, the MATRIX architecture unifies these resources. Once unified, traditional instruction and control resources are decomposed along with computing resources and can be deployed in an application-specific manner. Chip capacity can be deployed to support active computation or to control reuse of computational resources depending on the needs of the application and the available hardware resources. As a result, MATRIX can be efficiently employed across a broader range of computational characteristics than conventional architectures. Chapter

introduces the MATRIX architecture and shows how it obtains these unique characteristics.

Prognosis for the Future

Ultimately, reconfiguration is a technique for compressing the resources dedicated to instruction stream distribution while maintaining a general-purpose architecture. As such, it is an important architectural tool for extracting the highest performance from our silicon real estate. Characteristics of an application which change slowly or do not change can be configured rather than broadcast. The savings in instruction control resources result in higher logic capacity per unit area.

With CMOS VLSI we have reached to the point where we are no longer so limited by the aggregate capacity of a single IC die that the device must be optimized exclusively to maximize the number of distinct instructions resident on a chip. Beyond this point spatial implementation of all or portions of general-purpose computations is both feasible and beneficial. From this point on we will see:

More applications and kernels fit for spatial implementations on reconfigurable substrates
Reconfigurable techniques find their way into general-purpose and flexible computing devices, changing the way we design even ``nominally'' conventional architectures

Reconfigurable architectures and techniques should be added to the modern computer architect's repertoire of design techniques, alongside more venerable ones such as microprogramming, translation, and caching.

The thesis closes in Part by reviewing the key lessons from reconfigurable designs and their implications for future general-purpose architectures.

André DeHon <andre@mit.edu> Reinventing Computing MIT AI Lab