Previous: Empirical Review Up: Empirical Review Next: Case Study: Multiply

Empirical Review of General Purpose Computing Architectures in the Age of MOS VLSI

Here we review various general-purpose computing architectures by taking an empirical look at their implementations during the past decade. In this section we draw from the whole realm of general-purpose architectures -- not just those which fit directly into our RP-space. This makes a larger set of design points avilable for review, but also introduces considerably more variation in architectures than we will focus on in later chapters. We look primarily at general-purpose capacity in this section, generally ignoring the effects of specialized functional units. The following chapter will look at the effects of custom multipiers, the most common specialized functional unit added to nominally general-purpose computing devices. The focus here is on integrated, single IC, computational building blocks to keep the comparison as consistent as possible across such a wide variety of architectures. Additionally, we focus entirely on MOS VLSI implementations since most of these architecture have had multiple MOS VLSI implementations and the effects of MOS feature size device scaling are moderately well understood.

Processors

We start by looking at a simple RISC-style processor.

Model

The pedagogical processor model (Figure ) is composed of:

-bit ALU (two read, one write port from register file) [for the sake of comparison, we do include multiple ALU processors in the empirical review]
memory (Register File, Instruction/Data Cache)
control (sequencer, cache control, load/store unit, etc.)

The ALU is the sole source of general-purpose capacity. Table

shows the traditional ALU operations provided by the ALU along with the computational capacity provided by each operation. An

-bit ALU provides

ALU bit operations. For this simple processor, no multiplier or specialized coprocessor is included. We will look at hardwired multiply implementations separately in Chapter

as an example of such specialized coprocessors. Each ALU operation operates in one processor clock cycle.

multicols

Capacity Provided

We extract a maximum of gate evaluations ( ALU bit operations) per cycle. Modern processors are achieving cycle times as low as 2-5ns with . The fastest, single-ALU processors today thus offer a peak capacity around 84 gate-evaluations/ns. Table compares several processor implementations over the past decade. Results are summarized there in terms of ALU bit ops since that is the native, and hence most accurate, unit for processors. From Table , we see that conventional processors have provided a peak functional density of 3-9 ALU bit operations/ over the past decade. We see from Table and some simple weightings below that an ALU bit op is somewhere between one half and two 3-LUT gate evaluations.

It is interesting, and perhaps a bit unexpected, to note how consistent this capacity density has been over time. We might have expected:

delays to improve with process such that would be a better measure of process-normalized capacity than [ is the delay parameter for a process which nominally amounts to the intrinsic delay for gates. One is the delay required for one inverter to drive a single inverter of equal size.]
architectural or circuit design improvements to have increased functional density over time

Given the displayed consistency, we may be seeing compensating effects from:

velocity saturation in the CMOS devices, especially submicron CMOS prevents the expected scaling
decreasing relative performance of on-chip interconnect with scaling
increasing chip size implies increasing wire runs -- the area occupied by long interconnect wires is not scaling with
increasing use of CAD -- designers are giving up some density in order to manage the complexity associated with the larger and larger designs
increasing gap between on-chip and off-chip performance necessitates dedicating more on-chip area to non-active resources, particularly memory, to compensate for the i/o bottleneck
increasing area being dedicated to control and state management to prevent control and data dependent stalls in the instruction stream

This peak computational density assumes that every operation on each -bit ALU performs an -bit compute operation and the processor completes one instructions per ALU on every cycle. In practice, a significant number of processor cycles are not spent executing compute operations.

Limited instruction and data bandwidth -- coupled with long latencies to off-chip resources, limited bandwidth can cause the processor to stall waiting on the information which it needs to determine the course of the computation or the data it needs to operate upon. As a result, the number of instructions completed per cycle is always less than the number of ALUs available.
Abstraction overhead and data movement consume capacity -- procedure calls, data marshaling, traps, and data conversion consume capacity without, directly, providing capacity to the problem. Operations which simply move data around do not provide capacity to the problem either. Since we have a mix of the instructions shown in Table , we will never achieve the peak condition where every operation provides gate-evaluations to the computational task.

Quantitative studies tell us, for given systems or application sets, average values for the number of gate evaluations per instruction and the number of instructions executed per clock cycle. This gives us an expected computational capacity:

For example, while HaL's SPARC64 should be able to issue 4 instructions per cycle, in practice it only issues 1.2 instructions per cycle on common workloads [EG95]. Thus %, resulting in a 70% reduction from expected peak capacity.

Assuming the integer DLX instructions mixes given in Appendix C of [HP90] are typical, we can calculate by weighting the instructions by their provided capacities from Table . In Table we see that one ALU bit op in these applications is roughly 0.6 gate-evaluations.

If this effect is jointly typical with the instructions per issue slot number above, then we would yield at most of the theoretical, functional density. For the HaL case, this reduces to .

There are, of course, several additional aspects which prevent most any application from achieving even this expected peak capacity and which cause many applications to not even come close to it.

Coarse-grained datapath/network -- the word-oriented datapaths in conventional processors prevents efficient interaction of arbitrary bits -- e.g. A simple operation like XOR-ing together two fields in a data word requires a mask, shift, and XOR operation even though the whole operation may effect only a few gate evaluations. Returning to the HaL example above, we note that the processor has a 64-bit datapath. When running code developed for 32-bit SPARCs, half of the datapath is idle, further reducing the yielded capacity by 50% to .
Limited control of ALU capacity -- the -bit ALU mostly functions as a collection of -bit processors controlled in SIMD fashion. Full capacity is only extracted when all bits arranged in a dataword need the same operation. Data with smaller ranges (less than ) do not make full use of the capacity. Inhomogeneous operations ( e.g. ADD the low 8-bits, XOR bits 10-8, use this constant for bits 15-11, AND in bits 19-16, and OR together the remaining bits) must be decomposed into sets of homogeneous operations.

Example: Average Calculation

Consider a windowed average calculation performed on a processor:

Figure

shows a possible inner loop of the windowed average calculation on a standard RISC processor. Assuming one instruction per instruction slot, this sequence takes 9 instruction slots to perform two potentially 32 bit adds -- for a total of 128 gate evaluations. The loads, stores, and shifts are all data movement operations and do not contribute to the actual computational task at hand. The branch and increments are control overhead. Assuming this operation is performend on a MIPS-X processor at 1 CPI, we yield a functional density:

Example: Parity Calculation

Consider calculating the parity of a 32-bit word.

In 10 operations (See Figure ), the processor can perform the 32b XOR required for the parity calculation -- 32 2-input gate evaluations or 11 4-input gate evaluations. Again, assuming a MIPS-X like processor and 1 CPI, we yield:

VLIW Processors

Very Long Instruction Word (VLIW) machines are processors with multiple, parallel functional units which are exposed at the architectural level. A single, wide, instruction word controls the function of each functional unit on a cycle-by-cycle basis. Pedagogically, a VLIW processor looks like a processor with multiple, independent functional units. At this level, the VLIW processor does not look characteristically different from the modern superscalar processors included at the end of the processor table.

Table summarizes the characteristics of two VLIW processors. With only two datapoints it is not possible to assess general trends. These examples seem to have about 2 the peak capacity of processors. To the extent this may be characteristic of VLIW designs, it may arise from the fact that the separate functional units share instruction control and management circuitry more than in superscalar processors.

VLIW processors may fail to achieve their peak for the same reasons as processors. In addition, they may suffer from:

Scheduling granularity -- instructions must be statically scheduled in blocks at the VLIW width. When this packing does not match the needs of the application, functional units may go idle.
Mismatch in Functional Unit Mix -- some VLIWs have a mix of functional units which perform different operations. If the mix of operations in the application does not match the functional unit mix at a fine-grained level, functional units may go idle, lowering yielded capacity.
Data Transfer -- one way VLIWs may achieve a higher functional density than superscalar processors is to segregate the register files, reducing the interconnect required to deliver data from the register file to the functional units and back. Some cycles will be required to transfer data between register files as necessary to allow the various functional units to cooperate on a task.

Digital Signal Processors (DSPs)

Digital Signal Processors (DSPs) are essentially specialized microprocessors which:

Integrate a hardwired Multiplier Accumulator (MAC) unit
Includes specialized datapaths allowing
- zero overhead loops
- parallel increment of counters and pointers

The net effect of these additions is generally to increase the percentage of yielded capacity and particularly to increase the yielded capacity on tight loop multiply operations.

Table reviews several DSP implementations. On non-multiply operations, the peak performance is generally lower than the processors. For the kinds of operations typical of DSPs, these processors will generally yield much closer to their peak capacity than processors.

Memories

Most general-purpose devices use memories to store instructions and data. A memory can also be used directly to implement complex computational functions. For complicated functions, a memory lookup can often provide high, programmable computational capacity. For just a few examples see [Lev77][RS92][HT95][Has87].

Model

We characterize a memory by:

depth -- either in terms of address bits () or total number of memory words ()
width -- , the number of bits read out for each bits of address put into the memory
-- the minimum time required between successive read operations

Capacity Provided

The capacity provided by a memory is highly dependent on the inherent complexity of the logic function being computed. The lower bound is trivially zero since we could program the identify function into a memory.

We can use a counting argument to determine how complicated the functions can get. We start by observing that an -input, one-output lookup table can implement different functions. We then consider how many gates it requires to implement any of these functions. Each gate can be any function of four inputs, so each gate can implement functions. A collection of gates can thus implement at most functions (less due to overcounting). In order to implement any of the functions provided by the table, we need at least:

Conversely, by construction, we can show that any function computed by the input lookup table can be computed with gate evaluations. As suggested in Figure , we can use gates to select the correct functional value based on the low four bits of the address. We then build a binary mux reduction tree to select the final output based on the remaining address bits. This tree requires muxes. Together, the gates compute any function computable by the -input lookup table.

An input by one output table lookup can thus provide between and gate evaluations per cycle for the most complicated input functions. Since the bounds are essentially a factor of two apart, we can approximate the peak as gate evaluations per cycle. If the table is bits wide, the table provides at most times as many gate evaluations. Putting all this together, we get:

Tables , and reviews memory implementations, showing the peak functional density for each memory array. For the most complex functions, memories provide the highest capacity of any general-purpose architecture. For less complex operations, however, memories are inefficient, yielding very little of their potential capacity.

For example, an 8-bit add operation with carry output requires 16 gate evaluations. Performed in a memory, such as a 9-bit version of the 64K18 memory from [SMK +94], this provides only . The inefficiency of the memory-based adder increases with operand size since the number of gate evaluations in an -bit add increases as whereas the memory area increases as .

For all the memories listed, the capacity is based on continuous cycles of random access. In particular, nibble, fast page, or synchronous access in DRAMs is not exploited. For example, [TNK +94] achieves 13,500 on random access. In sequential access mode, the part can output 18 bits every 8ns. For large sequential access, this means an effective cycle time of 8ns instead of the 48ns quoted -- a factor of six improvement in cycle time and capacity. Used in this mode, the peak performance is 81,000 .

It is also worth noting that, unlike processors, the capacity of memories has increased over time. This is likely due to:

increased specialization of the fabrication processes to memories -- especially the introduction of three-dimensional structures in DRAMs, local interconnect, and high poly film resistors for SRAMs
increased pipelining of memory access

Modern processors actually dedicate a significant portion of their area to memory. Table summarizes the peak capacity the processor can extract by using table lookups in its D-cache. The area used in calculating this capacity is the entire processor for the processors listed in Table . This peak capacity can be thought of as the peak capacity one could extract from each load operation when using the on-chip D-cache for table lookup operations.

Field-Programmable Gate Arrays (FPGAs)

Field-Programmable Gate Arrays (FPGAs) are composed of a collection of programmable gates embedded in a programmable interconnect. Programmable gates are often implemented using small lookup tables. The small lookup tables with programmable interconnect allow one to take advantage of the structure inherent in many computations to reduce the amount of memory and space required to implement a function versus the full memory arrays of the previous section. Ultimately, this allows FPGA space required for an application to scale with the complexity of the application rather than scaling exponentially in the manner of pure memories.

Model

For pedagogical purposes, we consider an FPGA composed of:

, four-input lookup tables (4-LUTs) for gates with an optional flip-flop on the output of each LUT which can be used for pipelining or data storage
``adequate'' programmable interconnect to wire up functions using the 4-LUTs
a minimum operating cycle time, , which accounts for the time to travel through one LUT and its associated interconnect.

Capacity Provided

Running at full capacity and minimum operating cycle, the FPGA provides gate evaluations per cycle. Modern FPGAs can hold on the order of 2000 4-LUTs and run at cycle times on the order of 5-10ns. Table computes the normalized capacity provided by a few representative FPGAs. From these numbers we see that an FPGAs provide a peak capacity on the order of 200-300 .

FPGA capacity has not change dramatically over time, but the sample size is small. There is a slight upward trend which is probably representative of the relative youth of the architecture.

This peak, too, is not achievable for every application. Some effects which may prevent an application from achieving this peak include:

Limited interconnect -- When the network connectivity is inadequate, all of a device's capacity cannot be used. This may require either that cells buffer and transmit data or simply that cells go unused in the area. Conventional FPGA interconnect routes most applications with over 80% utilization.
Pipeline efficiency limits -- In heavily pipelined systems, capacity can be required to pass data across pipeline stages to all of its points of consumption. This transit capacity consumes device capacity without contributing to the evaluation capacity required for the application.
Limited ability to pipeline operations -- Some tasks have cyclic dependencies where a result is required before the next round of computation can begin. Unless several, orthogonal tasks are interleaved on the FPGA, the cyclic path length limits the rate at which resource can be reused and, in turn, prevents the application from fully utilizing the FPGAs capacity.
Limited I/O Bandwidth -- The I/O cycle time on most FPGAs is higher than the logic cycle time shown in Table . Data transfer to and from the FPGA may limit the capacity which can actually be applied to a problem.
Limited need for this functionality -- If a piece of functionality implemented in an FPGA is not required at the rate and frequency achievable on the FPGA, the FPGA can be employed far below its available capacity.
Need for additional functionality -- When the FPGA cannot hold the required functional diversity for a task and must be reprogrammed in order to complete the task, the device goes partially or entirely unused during the reprogramming cycle.

As one example of pipelining, i/o, and functionality limitations, DEC's Programmable Active Memories ran from 15-33MHz for several application [BRV92]. At these rates, the peak functional density extracted from the XC3090's employed was 13-26 , only about 10-20% of the potential functional density.

Example: Average Calculation

Consider, again, our windowed average calculation:

Figure

shows a pipelined datapath to compute this windowed average. A 16-bit add on an XC4000 part is operates in 21ns. Thus a cycle time of 42ns should be achievable if the

's are 28 bits each -- if they are 12-bit entities the 21ns cycle would be feasible. With the 32-bit datapath, the impementation requires:

8 pipeline registers of 16 CLBs each (128 CLBs)
1 adder of 17 CLBs
1 subtractor of 17 CLBs

The cycle time can easily be cut in half by pipelining the two halves of each 32b operation, effectively doubling yielded functional density.

Example: Parity Calculation

Consider also the FPGA implementation of the 32-bit parity calculation.

The FPGA can build the 11 gate parity reduction (See Figure

). The path is three gates long. At

7ns/gate, the unpipelined version operates in roughly 21ns.

Pipelining the XOR-reduction, we can reduce the cycle time and increase yield. If we pipeline at the gate level and assume that we can only cycle the part at a 10ns cycle due to clocking limitations, we yield:

Vector and SIMD Processors

Single-Instruction, Multiple-Data (SIMD) machines are composed of a number of processing elements performing identical operations on different data items. Vector processors perform identical operations on a linear ensemble of data items. At a pedagogical level vector processors are essentially SIMD processors, though in practice the two architectures have traditionally been optimized for different usage scenarios.

Model

For pedagogical purposes, we consider a SIMD/Vector array composed of:

processing elements (or vector units) all of which perform the same operation on each cycle
-bits wide processing element
instruction control and distribution logic
a minimum operating cycle time, -- the rate at which new instructions can be initiated in the array

Capacity Provided

The SIMD/Vector array provides provides a peak of ALU bit operations per cycle or gate-evaluations per cycle. Abacus, a modern, fine-grained SIMD array, supports 1000 1-bit PEs and can operate at 125MHz. Abacus thus provides 660 . Table computes the normalized capacity provided by several SIMD arrays of varying granularity, and Table shows the composition of a modern vector microprocessor.

SIMD/Vector arrays only achieve their peak capacity when every PE/VU is computing a useful logic operation on every cycle. Limitations to achieving this peak include:

Limited, local interconnect -- PEs are typically connected only to a few neighbors. Every communication operation occupies a PE without providing any gate-evaluation capacity. On SIMD arrays, when data is moved into the array, around in the array, or out of the array, PEs can be occupied for several cycles without performing any logical operations.
Inhomogeneous operation -- All PEs are only usefully employed when the same operation is required on every data bit. When this is not the case, many PEs sit idle or perform no useful work. PEs are often masked out of operation in order to perform computations selectively on data bits.

Flynn [Fly72] summarizes some of the limitations associated with SIMD processing.

Example: Average Calculation

Returning to our windowed average calculation:

Here, we assume the data is resident in PE memory on the array. It could have been loaded via a background load operation during a previous operation if it started off chip. Groups of 32 PEs are assigned to each word. To perform the average we shift the target data across the array the 8 times and accumulate at each group of 32 PEs as shown in Figure

. The average takes a total of 30 cycles on 32 PEs to perform what we determined earlier to be 128 gate evaluations:

Example: Parity Calculation

Consider also the Abacus implementation of the 32-bit parity calculation.

The SIMD array can perform a series 31 bitwise shift and xor operations to effect an XOR-scan. The xor can be folded into the shift such that scan operation only takes 31 cycles for the shift- XOR plus one to configure the operation. At the end of the scan operation, the partiy result is in the high (or low) processor of each 32-bit word.

The data memory can be preloaded with a sequence of bypass operations to allow faster accumulation. The scan can then be performed in

operations, where each operation is 3 cycles long.

Multimedia Processors

Multimedia processors are a recent hybrid of microprocessors, DSP, and Vector/SIMD processors. Aimed at processing video, graphics, and sound, these processors support efficient operation on data of various grain sizes by segmenting up their wide-word ALUs to provide SIMD parallel operation on the bytes within the word. This segmentation combats the increasing inefficiency associated with processing small data values on wide-word processors.

From Table , we see the CMOS multimedia processor have the same peak functional density as processors. The major difference is that the segmentation allows these processor to operate on 16-bit and byte-wide data without discarding a factor of 4-8 in performance. Of course, this is true only as long as these finer-grained operations can be performed efficiently in a SIMD manner.

The BiCMOS multimedia processor promised by MicroUnity would have a significantly higher performance density by exploiting a novel process. The comparison between their architecture in CMOS and BiCMOS makes it clear that this functional density advantage comes primarily from the process and not from the architecture.

Multiple Context FPGAs

Like FPGAs, multicontext FPGAs are composed of a collection of programmable gates embedded in a programmable interconnect. Unlike FPGAs, multicontext devices store several configurations for the logic and the interconnect on the chip. The additional area for the extra contexts decreases functional density, but it increases functional diversity by allowing each LUT element to perform several different functions.

Table summarizes the capacities of some experimental, multiple-context FPGAs. Like FPGAs, these devices may suffer from limited interconnect or application pipelining limits. The additional context memory makes them less susceptible to functionality limits than traditional components. Chapter details the usage of multicontext devices including their relative capacity yield compared to single context devices.

MIMD Processors

Contemporary MIMD processors have largely been built from collections of microprocessors. As such, the functional density of these multiprocessors is certainly no larger than that of the microprocessors employed for the compute nodes. Since these machines typically require additional components for routing between processor and to connect processors into the routing network, the average functional density is actually much lower.

Table samples a few processors which were designed explicitly for multiprocessor implementation. These processor integrate the basic network interface and, in some cases, a portion of the routing network, onto the device. While the sample size is too small to draw any strong conclusions, the highest capacity implementations show only about half the functional density of the microprocessors we reviewed in Section .

Reconfigurable ALUs

Reconfigurable ALUs are composed of a collection of coarse-grain ALUs embedded in a programmable interconnect. Their word orientation and limitation to ALU operations distinguishes them from FPGAs.

Model

For pedagogical purposes, a reconfigurable ALU contains:

, -bit ALUs
``adequate'' programmable interconnect to wire up functions of the ALUs
a minimum operating cycle time, , which accounts for the time to operate in one ALU or traverse the interconnect between ALUs.
optionally, a small instruction store associated with each ALU

Capacity Provided

Running at full capacity and minimum operating cycle, the reconfigurable ALU provides ALU bit operations per cycle. Experimental reconfigurable ALUs achieve roughly 50 ALU bit operations/.

Like a processor D-cache, the memory on MATRIX can be used as a large lookup table. Using the MATRIX 2568 memory for function lookup, MATRIX can achieve up to 440 4-LUT gate-evaluations/.

Like processor, reconfigurable ALUs may suffer lower yield due to:

Mismatched grain-size and limited ALU control -- When fine-grain operations are required, the word-wide interconnect limits which bits may interact with each other. ALU operations are word-wide SIMD making sub-word operations awkward and inefficient.

Unlike processors, the reconfigurable interconnect allows these architectures to avoid much of the data movement overhead necessary on processors. Like FPGAs, pipelining, interconnect, and functionality limits may prevent full utilization.

Example: Average Calculation

Returning to our windowed average calculation:

Figure

shows a pipelined datapath to compute this windowed average on MATRIX. In this scheme, 4 BFUs (See Figure

and Chapter

) are used to serve as an 8 value delay register, and 4 are used to perform the addition and subtration. Two cycles are required for each result so that a single datapath can be used for the add and subtract and so that the single memory can provide one read and one write cycle. The implementation yields:

Example: Parity Calculation

Consider also the MATRIX implementation of the 32-bit parity calculation.

The most straightforward implemenation, uses the memory as an 8-LUT to calculate the parity of 8 bit data chunks. A total of 5 such chunks will perform the entire calculation (See Figure

). Assuming pipelined operation of the first four and final reductions:

Summary

Table summarizes the observed computational densities for the general-purpose architecture classes reviewed in this section.

Memories provide the highest programmable capacity of any of the devices reviewed. However, they only yield this capacity on the most complex functions -- those whose complexity is, in fact, exponential in the number of input bits. The capacity they provide is not robust in the face of less complex tasks.

Reconfigurable devices provide the highest general-purpose capacity which can be deployed to application needs. Unlike memories capacity consumption scales along with problem complexity. Their peak performance is 10 all non-reconfigurable architectures, with the exception of large, well engineered SIMD arrays. Fine-grained devices, such as FPGAs, are robust to grain-size variation, as well. Reconfigurable architectures are not, however, robust to tasks with functional diversity larger than the aggregate device capacity. Multicontext devices, such as the DPGA, sacrifice a portion of the peak FPGA capacity density to partially mitigate this problem -- providing support for much higher on chip functional diversity.

Large SIMD or vector arrays have high peak performance because they ammortize a single stream of instruction control, bandwidth, and memory among a large number of active computing elements. They handle high diversity with the ability to issue a new instruction on each cycle. However, they require very large granularity operations in order to efficiently use the computational resources in the array.

Processors are robust to high functional diversity, but achieve this robustness at a large cost in available capacity -- 10 below reconfigurable devices. They also give up fine-grain control of operations, creating a potential for another 10 loss in performance when irregular, fine-grained operations are required. Vector and VLIW structures provide slightly higher capacity density for very stylized usage patterns, but are less robust to tasks which deviate from their stylized control paradigm.

Here we see distinctions in granularity, operation diversity, and yieldable capacity. The key issues we used to classify architectures was the way the devices store and distribute instructions to processing elements. Characterizing instructions and interconnect issues with a focus on RP-space is the goal of Part .

André DeHon <andre@mit.edu> Reinventing Computing MIT AI Lab