Previous: Basics and Terminology Up: Introduction and Background

Reconfigurable Computing Background

This chapter briefly reviews reconfigurable computing including:

Modern successes
Intellectual lineage
Technology trends which determine the circumstances when reconfigurable architectures are viable and advantageous

Successes of Reconfigurable Computing

FPGAs first became available in the middle of the 1980's ( e.g. [CDF +86]). In the late 80's and early 90's we began to see reconfigurable computing engines enabled by these new devices. In this section we highlight the early reconfigurable computing ``successes.''

Programmable Active Memories

DEC PRL's Programmable Active Memory (PAM) was one of the earliest platforms for reconfigurable computing. PAM is an array of Xilinx 3K components connected to a host workstation [BRV89]. The Perle-1 board contained 23 XC3090's -- roughly 15,000 4-LUTs. Using this component as an accelerator, DEC PRL was able to speedup many application by an order of magnitude and, in some cases, provide performance in excess of conventional supercomputers or custom VLSI implementations. Highlights from [BRV92]:

Large number multiply 16 faster than Cray-II
600kbit/s, 512-bit RSA decoding -- fastest implementation in existence at time of development -- 10 best software implementation on DEC Alpha
String matching within a factor of two of custom implementation requiring 28 VLSI ICs
Convolution and 3-D geometry at 200-300 MIPs
Laplace equation at 25 GIPs
DCT at 15 GIPs

The total silicon in the Perle-1 board was comparable to the total silicon in the host workstation -- but the combination ran these applications and others 10

faster than the workstation alone. The difference being that almost all of the silicon on the Perle-1 board was general-purpose and capable of being deployed to the problem at hand.

Splash

SRC's Splash is a systolic array composed of 32 Xilinx XC3090's, 20K 4-LUTs. On DNA sequence matching Splash achieved over 300 the performance of a Cray-II or over 200 the performance of a 16K-processor CM-2 [GHK +91].

PRISM

Brown's PRISM architecture coupled a single Xilinx XC3090, 640 4-LUTs, with a Motorola 68010 node processor. The coupled FPGA could compute fine-grained, bitwise functions ( e.g. Hamming distance, bit reversal, ECC, logic evaluations, find first one), 20 faster than the 68010 host microprocessor [AS93].

Logic Emulation

Perhaps the most commercially significant application of ``reconfigurable logic'' to date has been in the business of logic emulation. One of the earliest FPGA-based logic emulators was the Realizer [VBB93] which was a precursor to Quickturn System's Enterprise Emulation System. The Realizer, with 42 XC3090's (27K 4-LUTs) and 160 XC2018's serving exclusively for interconnect, was able to emulate 10K gate designs at a rate of several million clock cycles per second.

Lineage

While reconfigurable architectures have only recently begun to show significant application viability, the basic ideas have been around almost as long as the idea of programmable general-purpose computing.

John von Neumann, who is generally credited with developing our conventional model for serial, programmable computing, also envisioned spatial computing automata -- a grid of simple, cellular, building blocks which could be configured to perform computational tasks [vN66].

As computing implementation technology improved from vacuum tubes to diodes and transistors to integrated circuits, research continued into cellular computation. In [Min67] Minnick reviewed the state of the art in microcellular computational arrays, suggesting a role for ``programmable arrays.'' Minnick's own cutpoint cellular array in 1964 housed 48 cells less powerful than a 2-LUT in a 68 cellular array with only right and down nearest neighbor connections in the space of a suitcase. In 1971, Minnick reported a programmable cellular array which used flip-flops to hold the configuration context which customized the array [Min71].

Jump and Fitsche detail the workings of a programmable cellular array [JF72] without describing a possible technology realization.

Schaffner developed one of the earliest ``general-purpose,'' ``programmable hardware'' machines in 1969 [Sch78][Sch71]. Shaffner's machine used ALU's with reconfigurable interconnect for his reconfigurable building blocks, including the facilities to swap in ``hardware'' pages. The machine was employed primarily for real-time signal processing for radar and weather.

The early eighties saw considerable interest in systolic computing architectures [Kun82]. While much of the research was concerned with deriving hardwired, application-specific arrays, this research also spawned the development of programmable systolic components ( e.g. [FKM83] [HS84]). These components were some of the first ``reconfigurable computing'' devices built in VLSI. Owing to the application focus and the silicon real estate available at the time, the programmable systolic building blocks were more coarse-grained than the cellular arrays or FPGAS, placing a single 8-bit ALU per chip and relying predominantly on large, multichip or wafer-scale arrays to build up significant spatial computations.

The most direct descendent of the programmable cellular array research is the Configurable Array Logic (CAL) IC from Tom Kean and Algotronix [Alg90][GK89][Kea89]. CAL used a minimal 2-LUT for the basic cellular element and mostly nearest-neighbor connections for interconnect. This gives it a much finer grain than the contemporary FPGAs from Xilinx which use 4-LUTs and richer interconnect.

Technological Enablers

The basic idea of configurable array computation has been around as long as the ideas for central processor, stored program execution. So, why have programmable processors become the mainstream of general-purpose processing while ``reconfigurable computing'' is only now emerging as a competitive, general-purpose computing technology?

The answer lies with technology costs and application requirements. Active computing resources have been a premium since the days of the vacuum tube. To realize general-purpose computers, it took thousands of tubes to build a general-purpose computer -- making it infeasible to implement large, spatial computations. With the advent of core-memory, memory became moderately dense compared to computing elements. To implement large, complex, computational tasks, it was more efficient to store large programs densely in memory and reuse a small amount of fixed logic.

The beginning of the MOS VLSI era reinforced these costs. Dense memories could be implemented on silicon ICs. Because of high off-chip i/o costs, the critical unit became the amount of logic or computation which could be placed on a single IC. The driving force has been to localize computation to one or a small number of ICs to reduce costs and interchip communications. The microprocessor was made successful by minimizing the amount of compute logic to the point where it would fit onto a single IC. The critical turning point in processor development was when it became possible put a competent processor on a single IC. The RISC structure became so successful because it enabled early integration of such capable processors. Once single-chip processors became possible, they rapidly rose to dominate multichip implementations. While silicon area was a premium, exploiting the higher density of memories to store programs and reuse the limited space on the processor die was necessary. Today, we still see some premium to fitting the kernel task descriptions and their data into the limited memory available on the processor die.

The turning point for configurable hardware came when it was possible to place hundreds of programmable elements on a single IC. At that point it became possible to realize regular computations in space, dedicating each active computing element to a single task. Reconfigurable computing began to take off as we could put 500-1,000 such programmable elements on a single IC. Today we look at thousands of such elements per IC and that number continues to increase with the silicon capacity. At thousands to tens of thousands of programmable elements, tight application kernels can be spatially configured on one or a few configurable ICs without the need to share active resources. This, in effect, caches the kernel not just in on-chip memory for use by a limited amount of active processing elements, but right with the active processing elements such that a large number may operate simultaneously.

There will always be some premium for dense task representation to handle the most complicated tasks. However, as the silicon real-estate becomes larger, the premium for dense task packing subsides making it more and more beneficial to increase the on-chip silicon available for active processing and remove the on-chip bottleneck between memory and processing elements. This transition moves us to reconfigurable architectures.

André DeHon <andre@mit.edu> Reinventing Computing MIT AI Lab